LLM outputs are unpredictable. You ship a feature, tweak a prompt, and suddenly your chatbot starts making up policy details or completely ignoring the context you retrieved. Without automated evaluation, you only find out about these failures when users complain, not before you merge.

This guide walks you through building a reliable LLM evaluation pipeline with DeepEval. You'll install DeepEval, write your first test case with all the required fields, pick the right metrics and thresholds for your use case, and run dataset evaluations in pytest and CI to catch problems before they hit production.

Uploaded image

What Is DeepEval and Why Use It for LLM Testing

DeepEval is an open-source Python framework that automates LLM output evaluation using metrics like answer relevancy, faithfulness, and contextual relevancy. Instead of manually checking outputs, you define test cases with inputs, expected behavior, and context, then assert that outputs meet your quality thresholds. DeepEval runs these assertions in your test suite, CI pipeline, and nightly regression jobs.

Here's the thing. Traditional unit tests check deterministic logic. LLM outputs are probabilistic and context-dependent. DeepEval bridges this gap by scoring outputs against semantic criteria. Does the answer actually address the question? Does it stay grounded in the documents you retrieved? When scores drop below your threshold, tests fail. This lets you treat LLM quality like any other software requirement: measurable, repeatable, and enforceable in CI.

DeepEval fits into your development lifecycle at three key points. Locally, you run single test cases while tweaking prompts or retrieval logic. In pull request CI, you run a fast subset of critical cases to block regressions. In nightly or pre-release jobs, you run the full dataset to catch edge cases and drift across model versions or data updates.

The real value here is speed and confidence. You catch prompt regressions in seconds, not after user reports start rolling in. You compare model versions with actual numbers, not gut feelings. You scale evaluation from 10 test cases to 1,000 without any manual review.

Install DeepEval and Run Your First Test Case

DeepEval requires Python 3.9 or later. You'll also need API keys for the judge model provider. OpenAI, Azure OpenAI, or another supported backend. Set your API key as an environment variable before running tests. Something like export OPENAI_API_KEY=your_key. DeepEval uses these judge models to score outputs against metrics.

Install DeepEval from PyPI in a clean virtual environment:

# Create and activate a new virtual environment, then install DeepEval
python -m venv .venv
source .venv/bin/activate
pip install -U deepeval

To install the latest development version directly from GitHub:

pip install git+https://github.com/confident-ai/deepeval.git

Verify the installation by importing the package:

# Verify that DeepEval is installed and can be imported
python -c "import deepeval; print('deepeval ok')"

Now let's create your first test case. A test case needs four fields: input (the prompt or question), actual_output (what your LLM returned), expected_output (optional reference answer or format constraint), and context (optional retrieved documents for RAG evaluation). Here's a minimal example that evaluates answer relevancy:

# Minimal DeepEval test: Evaluate a single LLM output for answer relevancy

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import assert_test

# Example prompt and model output
prompt = "What are two reasons to write unit tests?"
actual_output = (
    "Unit tests help catch regressions early and make refactoring safer "
    "because they verify behavior automatically."
)

# Create a test case for the LLM output
test_case = LLMTestCase(
    input=prompt,
    actual_output=actual_output,
)

# Configure the answer relevancy metric with a threshold
metric = AnswerRelevancyMetric(
    threshold=0.7  # Set threshold based on baseline performance
)

# Assert that the output meets the relevancy threshold
assert_test(test_case, [metric])

When you run this code, DeepEval sends the input and output to a judge model, which returns a relevancy score between 0 and 1. If the score meets or exceeds the threshold (0.7), the test passes. If it falls below, assert_test raises an AssertionError with the score and failure reason. You'll see output like score=0.85, threshold=0.7, status=passed in your terminal. This pass/fail signal is what you'll use in CI to block bad changes.

The expected_output field is optional. Use it when you need exact format constraints. JSON schema, required entities, policy-mandated phrases, that sort of thing. Or when you want to compare semantic similarity to a reference answer. For most cases though, metrics alone work fine. Include context when evaluating RAG pipelines, which I'll show you in the next section.

Choose the Right Metrics for Your Use Case

DeepEval provides metrics for different LLM patterns. Answer relevancy measures whether the output addresses the input question. Faithfulness checks if the output is grounded in provided context, which is critical for RAG. Contextual relevancy evaluates whether retrieved context chunks are actually useful for answering the question. Other metrics cover toxicity, bias, hallucination, and tool use.

If your app is a Q&A assistant, relevance and correctness matter most. If it's RAG based, groundedness and context alignment are key. If it's a workflow agent, tool use and instruction adherence become important. Use the smallest set of metrics that catches real failures. For a deeper dive into building reliable RAG pipelines and implementing vector store retrieval, check out our Ultimate Guide to Vector Store Retrieval for RAG Systems.

Start with one or two metrics per intent type. For a RAG assistant, combine answer relevancy and faithfulness:

# Example: Combine answer relevancy and faithfulness metrics in a single test

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval import assert_test

# Define the question, context, and model output
question = "What is the retention period for invoices?"
context = (
    "Finance policy. Invoices must be retained for 7 years. "
    "Receipts must be retained for 3 years."
)
actual_output = "Invoices must be retained for 7 years."

# Create a test case with context for RAG evaluation
test_case = LLMTestCase(
    input=question,
    actual_output=actual_output,
    context=[context],
)

# Attach both answer relevancy and faithfulness metrics
metrics = [
    AnswerRelevancyMetric(threshold=0.8),
    FaithfulnessMetric(threshold=0.8),
]

# Assert that both metrics pass
assert_test(test_case, metrics)

Thresholds control when tests fail. Set them based on baseline performance, not arbitrary targets. Run your pipeline on a representative sample, maybe 20 to 50 cases. Collect scores and set thresholds at the 20th percentile for each metric. This way you catch real degradation without failing on normal variance. For high-stakes outputs like policy answers or financial data, tighten thresholds to the 50th percentile or higher.

Metrics that use judge models, like faithfulness and relevancy, aren't perfectly deterministic. Control variance by setting temperature to 0 in your LLM pipeline where possible. Use fixed random seeds. Pin the judge model version in your DeepEval configuration. Log the judge model version and evaluator prompt with each test run so you can trace changes in scores back to evaluation drift, not just model drift.

For RAG pipelines, make sure the context field in your test case matches what your model actually saw during generation. If your retrieval step returns three chunks, pass those exact chunks to LLMTestCase. Don't pass different or additional context only to the evaluator. You'll get false passes when the model hallucinates but the evaluator sees grounding docs. Or false failures when the model uses correct context but the evaluator sees incomplete context. Capture retrieved chunks during your pipeline run and attach them to the test case before evaluation.

Here's a practical RAG example that evaluates both faithfulness and contextual relevancy:

# Practical RAG example: Evaluate both faithfulness and contextual relevancy

from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, ContextualRelevancyMetric
from deepeval import assert_test

# Define the question and retrieved context chunks
question = "Does the API support idempotency keys for POST requests?"

retrieved_chunks = [
    "API Guide. All POST endpoints accept the Idempotency-Key header. "
    "Keys are valid for 24 hours and prevent duplicate charges.",
    "Authentication. Use OAuth2 bearer tokens for all requests."
]

actual_output = (
    "Yes. POST endpoints accept an Idempotency-Key header. "
    "The key is valid for 24 hours to prevent duplicate charges."
)

# Create a test case with multiple context chunks
test_case = LLMTestCase(
    input=question,
    actual_output=actual_output,
    context=retrieved_chunks,
)

# Attach contextual relevancy and faithfulness metrics
metrics = [
    ContextualRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
]

# Assert that both metrics pass for the test case
assert_test(test_case, metrics)

When a test fails, DeepEval returns the score, threshold, and a reason string explaining why the output didn't meet the criteria. Check this output in your terminal or CI logs to iterate quickly. A faithfulness failure might show "claim not supported by context" with the specific unsupported sentence highlighted.

Actually, many real applications require structured output. JSON schemas, tool call arguments, specific templates. DeepEval's semantic metrics don't validate format. Add complementary checks. Parse JSON, validate against your schema, assert required fields before running semantic metrics. This catches format breaks immediately and keeps semantic evaluation focused on content quality.

Build a Dataset-Driven Evaluation Workflow

Single tests help you get started, but they won't protect you from regressions. You need a dataset that represents your product. Include happy paths, edge cases, policy and safety cases, and known failure modes. Then run the full set on every meaningful change. Prompt edits, model version bumps, retrieval changes, tool schema changes. If you're planning to fine-tune your own models, our step-by-step guide to fine-tuning large language models provides practical advice on dataset creation and evaluation.

Store test cases in JSON with fields for input, context, and any metadata. Intent type, expected entities, difficulty level. Here's an example dataset file:

[
  {
    "id": "billing_retention_01",
    "input": "What is the retention period for invoices?",
    "context": [
      "Finance policy. Invoices must be retained for 7 years. Receipts must be retained for 3 years."
    ]
  },
  {
    "id": "api_idempotency_01",
    "input": "Does the API support idempotency keys for POST requests?",
    "context": [
      "API Guide. All POST endpoints accept the Idempotency-Key header. Keys are valid for 24 hours."
    ]
  }
]

Load these cases into LLMTestCase objects. The loader sets actual_output to an empty string initially. You'll fill it by running your pipeline for each input:

# Loader: Convert JSON dataset rows into DeepEval LLMTestCase objects

import json
from deepeval.test_case import LLMTestCase

def load_cases(path: str):
    """
    Load test cases from a JSON file and convert them to LLMTestCase objects.

    Args:
        path (str): Path to the JSON dataset file.

    Returns:
        list: List of LLMTestCase objects with input and context fields populated.
              The actual_output field should be filled after running your pipeline.
    """
    with open(path, "r") as f:
        data = json.load(f)
    cases = []
    for row in data:
        # actual_output should be set after running your LLM pipeline
        cases.append(
            LLMTestCase(
                input=row["input"],
                actual_output="",  # To be filled after pipeline execution
                context=row.get("context", None)
            )
        )
    return cases

Run your pipeline for each case, attach the output, and evaluate with your chosen metrics. This function runs batch evaluation and aggregates pass/fail results:

# Batch evaluation: Run metrics on a dataset and aggregate results

from deepeval.metrics import FaithfulnessMetric, ContextualRelevancyMetric
from deepeval import assert_test

def evaluate_dataset(cases, run_pipeline):
    """
    Evaluate a list of LLMTestCase objects using specified metrics.

    Args:
        cases (list): List of LLMTestCase objects.
        run_pipeline (callable): Function that takes input text and returns model output.

    Returns:
        list: List of tuples (input, passed) indicating pass/fail for each test case.
    """
    # Define metrics to use for all cases
    metrics = [
        ContextualRelevancyMetric(threshold=0.7),
        FaithfulnessMetric(threshold=0.8),
    ]

    results = []
    for case in cases:
        # Run your LLM pipeline to get the actual output for each input
        case.actual_output = run_pipeline(case.input)
        try:
            # Assert that the case passes all metrics
            assert_test(case, metrics)
            results.append((case.input, True))
        except AssertionError:
            # If any metric fails, record as failed
            results.append((case.input, False))

    passed = sum(1 for _, ok in results if ok)
    print(f"Passed {passed}/{len(results)}")  # Lightweight logging of results
    return results

For RAG pipelines, make sure run_pipeline returns both the generated output and the retrieved context chunks. Attach the retrieved chunks to case.context before calling assert_test. This guarantees that faithfulness and contextual relevancy metrics evaluate against the same context the model used, avoiding false results.

LLM evaluation should run like any other test suite. The simplest path is writing pytest tests that call DeepEval assertions. This makes it easy to run pytest locally and in CI. Keep tests deterministic by controlling temperature and setting seeds where possible. Use stable retrieval snapshots for the CI run. For a step-by-step walkthrough on building robust LLM workflows with Python, see our guide on How to Build Reliable LangChain LLM Workflows in 15 Minutes Flat.

Here's an example pytest integration:

# Example pytest integration: Run DeepEval assertions as part of CI

import os
import pytest
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import assert_test

def run_app(prompt: str) -> str:
    """
    Simulate running your LLM pipeline for a given prompt.

    Args:
        prompt (str): The input prompt/question.

    Returns:
        str: The model's output (replace with real pipeline call in production).
    """
    # For demonstration, return a canned response.
    # In production, call your actual LLM pipeline here.
    return "Unit tests catch regressions and make refactoring safer."

def test_unit_testing_answer_relevancy():
    """
    Test that the LLM output is relevant to the unit testing question.
    """
    case = LLMTestCase(
        input="What are two reasons to write unit tests?",
        actual_output=run_app("What are two reasons to write unit tests?")
    )
    metric = AnswerRelevancyMetric(threshold=0.7)
    # Assert that the output meets the relevancy threshold
    assert_test(case, [metric])

Run pytest in your CI pipeline on every pull request. Start with a small, fast subset of critical cases. Maybe 10 to 20 cases to keep CI under 2 minutes. Run the full dataset nightly or before releases to catch edge cases and model drift. Separate "fast unit eval" from "full regression eval" using pytest markers or separate test files.

Control costs and latency by sampling intelligently. For large datasets, run all cases nightly but only high-priority cases in PR CI. Mock tool calls and external API responses in CI to avoid flakiness and cost. Use fixed retrieval snapshots, pre-retrieved context stored in your test dataset, instead of live retrieval in CI to ensure repeatability. Set timeouts and retries for judge model calls to handle transient API failures.

Pin your judge model version and evaluator prompts in your DeepEval configuration. Log these versions with each test run so you can trace score changes back to evaluation drift, not just model or prompt changes. Store evaluation artifacts. Scores, failure reasons, test case IDs. Put them in CI logs or a dedicated results database for historical comparison and debugging.

When you change prompts, model versions, or retrieval logic, run your full evaluation dataset before merging. Compare aggregate scores. Mean, median, pass rate per metric. Compare them to your baseline. If scores drop significantly, inspect failing cases to figure out whether the change introduced a real regression or whether thresholds need adjustment. Use this feedback loop to iterate quickly and ship confidently.

Conclusion

Building reliable LLM applications requires more than just prompt engineering and model selection. You need systematic evaluation that catches regressions before they reach production. DeepEval gives you the framework to make this happen.

We've covered the essential pieces: installing DeepEval and writing your first test case, selecting metrics that match your use case, building dataset-driven workflows that scale, and integrating everything into your CI pipeline. The key is starting simple. Pick one or two metrics, write ten test cases, and run them locally. Once that's working, expand to more cases and tighter integration.

The real payoff comes when you can ship prompt changes with confidence, knowing your test suite will catch any degradation. When you can compare model versions with actual numbers instead of vibes. When you can scale from prototype to production without losing sleep over quality.

As I've come to learn through building these systems, LLM evaluation isn't about perfection. It's about catching the failures that matter and shipping faster with confidence. Start with the basics we've covered here, then expand based on what breaks in your specific application. Your users, and your on-call rotation, will thank you.