LLM Evaluation Checklist

Metrics, test sets, and guardrails for shipping reliable LLM features.

Use your browser’s print dialog and choose “Save as PDF” to download.

Scope & planning

Decide what to evaluate before choosing metrics.

  • Identify whether the use case is single-turn or multi-turn (e.g. RAG chatbot, conversational agent).
  • Map system architecture: RAG, agentic, foundational model, or hybrid.
  • Keep total metrics to ≤ 5: 1–2 custom (use-case) + 2–3 generic (system).
  • Define a minimum passing threshold for each metric so you can gate releases.

Metric quality

Good metrics are quantitative, reliable, and accurate.

  • Ensure every metric produces a numeric score (not just pass/fail).
  • Prefer LLM-as-a-judge scorers over BLEU/ROUGE for semantic nuance.
  • Validate that scores align with human judgment (e.g. correlation studies or spot checks).
  • Avoid over-evaluating: when you evaluate everything, you evaluate nothing.

Choosing scorers

Match the scorer to the type of criteria.

  • Use G-Eval for subjective criteria (coherence, helpfulness, tone, brand voice).
  • Use DAG (decision tree + LLM judge) when success criteria are clear and discrete.
  • Use QAG (claim extraction + closed questions) for faithfulness and factuality.
  • Use exact-match or logic only where appropriate (e.g. tool correctness).
  • Skip purely statistical scorers (BLEU, ROUGE, METEOR) for semantic evaluation.

Core metrics to consider

Essential dimensions before adding task-specific metrics.

  • Answer relevancy: output addresses the input in an informative, concise way.
  • Correctness: output is factually correct against ground truth where available.
  • Hallucination / faithfulness: no fake or unsupported information.
  • Contextual relevancy (RAG): retriever returns context relevant to the query.
  • Responsible metrics: bias and toxicity where the application touches users.

RAG pipelines

Retriever + generator; evaluate both.

  • Faithfulness: generator output aligns with retrieval context.
  • Answer relevancy: generator output is concise and on-topic.
  • Contextual relevancy: proportion of retrieved sentences relevant to the input.
  • Contextual precision & recall (if you have expected output): ranking and coverage of needed info.
  • For multi-turn RAG: use turn-level variants (turn faithfulness, turn relevancy, etc.).

AI agents

Task completion, tools, and plans.

  • Task completion: agent accomplishes the given task (single-turn or multi-turn).
  • Tool correctness: agent calls the right tools for the task.
  • Argument correctness: tool arguments make sense for the input.
  • Plan quality: plans are complete, logical, and efficient.
  • Plan adherence: agent follows the plan it created.
  • Step efficiency: no unnecessary steps in the execution trace.

Foundational models & fine-tuning

Evaluating the LLM itself, not the full system.

  • Hallucination: use SelfCheckGPT-style or NLI with context where ground truth is unclear.
  • Toxicity: off-the-shelf detectors or G-Eval with a clear rubric.
  • Bias: G-Eval with explicit criteria; consider region/culture and clear rubrics.
  • Prompt alignment: instructions in the prompt template are followed in the output.

Task-specific & custom metrics

At least one metric tailored to your use case.

  • Summarization: factual alignment with source + inclusion of important information (e.g. QAG).
  • Helpfulness, brand voice, or format: use G-Eval with clear criteria.
  • Structured or formatted output: use DAG for deterministic checks (e.g. headings, order).
  • Document custom criteria and rubrics so others can reproduce scores.

Implementation & ops

Make evaluation part of the pipeline.

  • Create or curate a test set that reflects real inputs and edge cases.
  • Establish a baseline and track metric scores over time.
  • Integrate evals into CI where possible (e.g. regression on key metrics).
  • Add production monitoring for the same dimensions you eval offline.
  • Provide eval and guardrail documentation to stakeholders.

© 2026 Wilkins Labs. All rights reserved.