Back to all posts

Treating LLM Evals Like Tests: RAGAS and Langfuse as a CI Gate

Savinay Kumar

The scariest thing about shipping LLM features is that they fail silently. Change a prompt, swap a model version, adjust a retrieval parameter, and nothing throws. The pipeline is green, the deploy succeeds, and answer quality quietly drops ten percent. No stack trace, no failing test, just a slow erosion you notice weeks later in user complaints.

Traditional tests do not catch this because the output is not deterministic. So we treat evaluation the way we treat tests: a golden dataset, automated scoring, and a threshold that blocks a merge.

What we measure

For a RAG system, "is the answer good" decomposes into a few measurable properties, and RAGAS scores most of them out of the box:

Splitting the score this way is the point. A single "quality" number tells you something regressed. These tell you whether to go fix the retriever or the prompt.

The golden dataset is the asset

Evals are only as good as the dataset behind them. Ours started small, a few dozen hand-written question and answer pairs, and grows from real traffic: when we find a query the system handled badly, we fix it and add it to the set so it can never silently regress again. Production failures become permanent regression tests.

We keep these datasets and every eval run in Langfuse, which also traces live requests. The same scores we compute in CI we sample on production traffic, so offline and online quality stay comparable instead of drifting apart.

Wiring it into CI

The mechanics are ordinary CI. On any pull request that touches a prompt, a model version, or retrieval, the eval suite runs against the golden dataset and fails the build if a metric drops below its floor:

def test_rag_quality():
    results = evaluate(
        dataset=load_golden_set(),
        metrics=[faithfulness, answer_relevancy, context_recall],
    )
    assert results["faithfulness"] >= 0.90
    assert results["answer_relevancy"] >= 0.85
    assert results["context_recall"] >= 0.80

A regression now shows up as a red check on the PR, next to the unit tests, before it reaches a user.

The honest caveats

Eval is not free and not perfect:

None of that is a reason to skip it. An imperfect, automated quality gate beats finding out from users. Once evals run in CI, changing a prompt stops being an act of faith and becomes a normal, reviewable engineering change.