Engineering · 12 min read

How we build eval suites that catch drift before customers do

A walkthrough of the three-layer eval harness we ship with every agent — unit tests for prompts, property tests for outputs, and a drift detector that runs nightly against production traffic.

Maria Chen · Staff Eng April 8, 2026

When a customer reports that their agent "feels worse today," it's usually too late. The drift happened last week, compounded over the weekend, and the first person to notice was someone with a complaint, not an alert.

Every agent we ship goes to production with three layers of automated tests. Each one catches a different failure mode. Together, they turn "feels worse" into a numeric regression on a dashboard, visible hours before any user sees it.

Layer 1 — Prompt unit tests

The shallowest layer. For every prompt in the system, we freeze a small set of input/output pairs that are non-negotiable: the refund-intent classifier must return refund on "I want my money back," the extractor must pull total: 42.00 from the canonical invoice fixture, and so on.

These run on every pull request. A change to the prompt that breaks any of them fails CI. They're cheap — a few dozen fixtures, one model call each, under a dollar per run.

Layer 2 — Property tests on outputs

Prompt unit tests catch known regressions. Property tests catch unknown ones.

For every agent output, we assert structural properties that must always hold, independent of input:

We run these on randomly sampled production traffic — 50 samples per agent per hour, logged to a Postgres table. A property violation is a page. In two years of running this, we've caught silent schema drift from a model version upgrade three separate times, all within the first hour.

Layer 3 — Golden trace drift detection

The deepest layer. Once a week, every agent re-runs against a curated set of "golden traces" — real production inputs with outputs we've manually reviewed and locked.

The trick is what we compare. String match doesn't work — LLMs paraphrase. So we use a secondary model (usually a smaller, cheaper one) to judge semantic equivalence against the locked output, and we track the equivalence rate over time.

Drift shows up as the rate ticking down. 94% → 91% → 87% over three weeks is the pattern that used to generate a customer complaint in week four. Now it generates a Slack alert in week one.

What we learned running this for two years

Most drift is not model drift. It's data drift — the inputs shifted, and the agent's implicit assumptions broke. The eval harness surfaces this faster than any model-versioning strategy.

The cheapest layer catches the most bugs. Unit tests are 80% of the value. Don't skip them chasing fancy eval frameworks.

Golden traces decay. The "truth" labels we locked a year ago are sometimes wrong now, because the business changed. We budget an engineer-day per quarter to re-curate them.

If you want us to bootstrap this harness on your own agents, fork the playbook from our engagement shape and wire in your fixtures — or get in touch.