LLM Evaluation Checklist

Metrics, test sets, and guardrails for shipping reliable LLM features.

Use your browser’s print dialog and choose “Save as PDF” to download.

Scope & planning

Decide what to evaluate before choosing metrics.

Identify whether the use case is single-turn or multi-turn (e.g. RAG chatbot, conversational agent).
Map system architecture: RAG, agentic, foundational model, or hybrid.
Keep total metrics to ≤ 5: 1–2 custom (use-case) + 2–3 generic (system).
Define a minimum passing threshold for each metric so you can gate releases.

Metric quality

Good metrics are quantitative, reliable, and accurate.

Ensure every metric produces a numeric score (not just pass/fail).
Prefer LLM-as-a-judge scorers over BLEU/ROUGE for semantic nuance.
Validate that scores align with human judgment (e.g. correlation studies or spot checks).
Avoid over-evaluating: when you evaluate everything, you evaluate nothing.

Choosing scorers

Match the scorer to the type of criteria.

Use G-Eval for subjective criteria (coherence, helpfulness, tone, brand voice).
Use DAG (decision tree + LLM judge) when success criteria are clear and discrete.
Use QAG (claim extraction + closed questions) for faithfulness and factuality.
Use exact-match or logic only where appropriate (e.g. tool correctness).
Skip purely statistical scorers (BLEU, ROUGE, METEOR) for semantic evaluation.

Core metrics to consider

Essential dimensions before adding task-specific metrics.

Answer relevancy: output addresses the input in an informative, concise way.
Correctness: output is factually correct against ground truth where available.
Hallucination / faithfulness: no fake or unsupported information.
Contextual relevancy (RAG): retriever returns context relevant to the query.
Responsible metrics: bias and toxicity where the application touches users.

RAG pipelines

Retriever + generator; evaluate both.

Faithfulness: generator output aligns with retrieval context.
Answer relevancy: generator output is concise and on-topic.
Contextual relevancy: proportion of retrieved sentences relevant to the input.
Contextual precision & recall (if you have expected output): ranking and coverage of needed info.
For multi-turn RAG: use turn-level variants (turn faithfulness, turn relevancy, etc.).

AI agents

Task completion, tools, and plans.

Task completion: agent accomplishes the given task (single-turn or multi-turn).
Tool correctness: agent calls the right tools for the task.
Argument correctness: tool arguments make sense for the input.
Plan quality: plans are complete, logical, and efficient.
Plan adherence: agent follows the plan it created.
Step efficiency: no unnecessary steps in the execution trace.

Foundational models & fine-tuning

Evaluating the LLM itself, not the full system.

Hallucination: use SelfCheckGPT-style or NLI with context where ground truth is unclear.
Toxicity: off-the-shelf detectors or G-Eval with a clear rubric.
Bias: G-Eval with explicit criteria; consider region/culture and clear rubrics.
Prompt alignment: instructions in the prompt template are followed in the output.

Task-specific & custom metrics

At least one metric tailored to your use case.

Summarization: factual alignment with source + inclusion of important information (e.g. QAG).
Helpfulness, brand voice, or format: use G-Eval with clear criteria.
Structured or formatted output: use DAG for deterministic checks (e.g. headings, order).
Document custom criteria and rubrics so others can reproduce scores.

Implementation & ops

Make evaluation part of the pipeline.

Create or curate a test set that reflects real inputs and edge cases.
Establish a baseline and track metric scores over time.
Integrate evals into CI where possible (e.g. regression on key metrics).
Add production monitoring for the same dimensions you eval offline.
Provide eval and guardrail documentation to stakeholders.

← All downloads