LLM Evaluation Checklist
Metrics, test sets, and guardrails for shipping reliable LLM features.
Use your browser’s print dialog and choose “Save as PDF” to download.
Scope & planning
Decide what to evaluate before choosing metrics.
- Identify whether the use case is single-turn or multi-turn (e.g. RAG chatbot, conversational agent).
- Map system architecture: RAG, agentic, foundational model, or hybrid.
- Keep total metrics to ≤ 5: 1–2 custom (use-case) + 2–3 generic (system).
- Define a minimum passing threshold for each metric so you can gate releases.
Metric quality
Good metrics are quantitative, reliable, and accurate.
- Ensure every metric produces a numeric score (not just pass/fail).
- Prefer LLM-as-a-judge scorers over BLEU/ROUGE for semantic nuance.
- Validate that scores align with human judgment (e.g. correlation studies or spot checks).
- Avoid over-evaluating: when you evaluate everything, you evaluate nothing.
Choosing scorers
Match the scorer to the type of criteria.
- Use G-Eval for subjective criteria (coherence, helpfulness, tone, brand voice).
- Use DAG (decision tree + LLM judge) when success criteria are clear and discrete.
- Use QAG (claim extraction + closed questions) for faithfulness and factuality.
- Use exact-match or logic only where appropriate (e.g. tool correctness).
- Skip purely statistical scorers (BLEU, ROUGE, METEOR) for semantic evaluation.
Core metrics to consider
Essential dimensions before adding task-specific metrics.
- Answer relevancy: output addresses the input in an informative, concise way.
- Correctness: output is factually correct against ground truth where available.
- Hallucination / faithfulness: no fake or unsupported information.
- Contextual relevancy (RAG): retriever returns context relevant to the query.
- Responsible metrics: bias and toxicity where the application touches users.
RAG pipelines
Retriever + generator; evaluate both.
- Faithfulness: generator output aligns with retrieval context.
- Answer relevancy: generator output is concise and on-topic.
- Contextual relevancy: proportion of retrieved sentences relevant to the input.
- Contextual precision & recall (if you have expected output): ranking and coverage of needed info.
- For multi-turn RAG: use turn-level variants (turn faithfulness, turn relevancy, etc.).
AI agents
Task completion, tools, and plans.
- Task completion: agent accomplishes the given task (single-turn or multi-turn).
- Tool correctness: agent calls the right tools for the task.
- Argument correctness: tool arguments make sense for the input.
- Plan quality: plans are complete, logical, and efficient.
- Plan adherence: agent follows the plan it created.
- Step efficiency: no unnecessary steps in the execution trace.
Foundational models & fine-tuning
Evaluating the LLM itself, not the full system.
- Hallucination: use SelfCheckGPT-style or NLI with context where ground truth is unclear.
- Toxicity: off-the-shelf detectors or G-Eval with a clear rubric.
- Bias: G-Eval with explicit criteria; consider region/culture and clear rubrics.
- Prompt alignment: instructions in the prompt template are followed in the output.
Task-specific & custom metrics
At least one metric tailored to your use case.
- Summarization: factual alignment with source + inclusion of important information (e.g. QAG).
- Helpfulness, brand voice, or format: use G-Eval with clear criteria.
- Structured or formatted output: use DAG for deterministic checks (e.g. headings, order).
- Document custom criteria and rubrics so others can reproduce scores.
Implementation & ops
Make evaluation part of the pipeline.
- Create or curate a test set that reflects real inputs and edge cases.
- Establish a baseline and track metric scores over time.
- Integrate evals into CI where possible (e.g. regression on key metrics).
- Add production monitoring for the same dimensions you eval offline.
- Provide eval and guardrail documentation to stakeholders.