Case study

Testing non-deterministic AI in production — Soulmates E2E framework

Soulmates is an AI matchmaker on WhatsApp. The product is the conversation. I built an end-to-end framework where simulated users drive real conversations through the production pipeline, while a separate AI judges every turn against a structured rubric — turning prompt iteration from gut feeling into a feedback loop, and catching behavioral drift before users do.

Eliza Labs · Soulmates 8 min

TypeScript
ElizaOS
LLM-as-judge
PostgreSQL
Bun
GitHub Actions
WhatsApp
conversational AI

TL;DR

Soulmates is an AI matchmaker on WhatsApp — the entire product is a conversation with an LLM agent. Standard regression testing doesn’t work against non-deterministic outputs, and on a product like this a behavioral drift is the bug. I built an end-to-end framework where simulated users drive real conversations through the production pipeline, while a separate AI judges every turn against a structured rubric. Result: 35 scenarios on every pull request, behavioral regressions surface as score drops before users see them, and prompt iteration becomes a measurable engineering loop instead of guesswork.

Context — the real stakes

In a deterministic SaaS product, a bug is local: a button doesn’t work, an API returns 500, the user notices and tells you.

In a conversational AI product, the bug is the conversation.

On Soulmates there’s no UI to fall back on. No forms, no profile builder, no settings screen. Ori — the agent — asks. The user answers. Ori asks again. After a while, Ori introduces the user to someone. The whole product surface is the dialog.

If Ori starts asking the same question twice, recommends matches before profiling is done, picks an inappropriate question at a sensitive moment, or sounds suddenly enthusiastic when the character should be calm and measured — the product is broken.

Silently.

No 500. No exception. No alert. Just users quietly leaving.

This is the failure mode end-to-end testing has to catch. And it’s not the same failure mode regular E2E was built for.

Three sources of unpredictability that compound over time

A traditional regression suite expects determinism: send X, expect Y. With an LLM, you stack three sources of unpredictability:

Per-call non-determinism. The same input legitimately produces dozens of valid outputs. Some are great. Some are subtly off. Some break the contract entirely — and they all return 200 from the messaging webhook.
Prompt fragility. You change a prompt to fix issue X, and a week later issue Y appears in a different part of the funnel that wasn’t broken before. Prompts aren’t isolated functions — small wording changes ripple unpredictably across the system. Without a feedback loop, “prompt engineering” is gut feeling at scale.
Long-term drift. The same prompt that worked great for six months degrades because the vendor flipped a default, the SDK changed a behavior, or you added a new tool to the agent and its existing flow shifted. The agent you tested last month is not the agent in production today.

You can absorb per-call randomness. Prompt fragility and long-term drift are the actual killers — and they only surface in production, sometimes weeks after the change that caused them.

You can’t write expect(response).toBe('hello'). So what do you do?

Constraints

Real infrastructure, not mocks. If the test mocks the LLM, it’s testing the mock.
CI budget. LLM calls cost real money. The suite has to run on every PR without burning budget when nothing relevant changed.
CI latency. A few minutes per PR — engineers wait on this. Slower than that, they start skipping it.
Reproducibility despite non-determinism. A flaky test is worse than no test — it teaches the team to ignore the suite.
Full pipeline coverage. Profiling, async matching, event-driven notifications, safety dispatch. Not just message-in / message-out.

Decision — three primitives

Soulmates E2E framework — a simulated user (LLM) talks to the real agent through the production pipeline. A three-tier evaluator (deterministic rules → soft rules injected as context → 4-criteria LLM judge) produces a score, with pass/fail computed in code. · open standalone ↗

The framework rests on three primitives.

1. Simulated users

Each test fixture is a persona: a name, a short background paragraph, a handful of behavior traits (“guarded but warm”, “casual texter, lowercase, short messages”). Plugged into a small, fast model whose only job is to read what the agent just said and write the next user message in character. The agent under test sees a real conversation, not a script.

2. The judge — three tiers, not one prompt

A real conversation flows through Ori, then an automated reviewer scores it. The non-trivial part is the reviewer.

I learned the hard way that “ask an LLM if the response is good” isn’t enough. So the judge runs in three tiers:

Tier 1 — deterministic rules. Pure code, no LLM involved. Things like maximum questions per response, forbidden phrasings, lowercase enforcement, no placeholder leaks. Centralized in a single rules file shared between the production prompts and the tests, so they can never drift apart. If a hard rule fails, the test fails immediately — zero LLM tokens spent.
Tier 2 — soft rules as judge context. Style violations that aren’t hard fails are detected by code and injected into the judge prompt as context: “the response broke the bubble limit by one — weigh that in your style score”. The judge doesn’t re-detect; it weighs.
Tier 3 — LLM as judge. A larger, dedicated reasoning model scores every turn on four criteria, each from 1 to 10: does it sound like the character, does it follow the current stage objective, does it actually react to what the user said, does it respect the formatting rules. The model is forced into a strict response format that gets parsed back to scores in code.

The pass/fail decision is computed in code from the four scores — never asked of the LLM. Average has to be at least 6.5, and no single score below 5. That last detail matters more than it looks. The line between automated and reliable lives exactly there: you let the LLM weigh, but the verdict stays yours.

3. Real pipeline, real database

Tests run against the actual agent runtime, with the production character configuration, the production orchestrator, and a real Postgres-compatible database. The only thing not real is the messaging delivery — instead of hitting WhatsApp’s API, a callback captures the outbound message and feeds it to the test.

Time is faked by writing past timestamps directly into the database. Async events (match notifications, meeting reminders, profile-completion expirations) are dispatched directly to the orchestrator, which produces the proactive message Ori would have sent.

This is the difference between unit-testing the orchestrator and regression-testing Ori.

What I built

35 scenario files across three categories:

Stage tests — one per step of the funnel: welcome, profiling, manifesto, pricing, matching, meeting, feedback, coaching, reset, plus retries.
Action tests — explicit user actions: pause, reset, running-late, coordinate.
Event tests — proactive messages Ori sends without user prompting: check-in, match notification, meeting reminder, reactivation, profile reminder.

Plus two full-funnel smoke tests that drive a complete LLM-vs-LLM conversation end-to-end. They’re slower and more expensive, reserved for major changes.

Every test ends with a single line that runs the full three-tier evaluation.

Path-filtered CI — token-conscious by design

LLM calls aren’t free. Running 35 e2e tests on every PR without filtering would burn through the budget.

The CI uses GitHub Actions path filtering to detect what changed and build a dynamic test matrix. A change in a single stage runs only that stage’s e2e. A change to a shared component runs all stages. No code change at all skips e2e entirely. Unit tests gate everything: if the unit suite fails, zero LLM tokens are spent on a broken commit.

This pattern saves about 80–85% of LLM tokens on a typical PR. Same correctness signal, fraction of the cost.

Prompt iteration as engineering, not gut feeling

The framework’s real value isn’t “we catch bugs”. It’s that prompt iteration becomes a feedback loop.

Without the suite: change a stage prompt, run a few manual conversations, eyeball the responses, ship and hope. The “tested” surface is whatever you happened to type into WhatsApp that morning. That’s not engineering — it’s craft.

With the suite: change a prompt, run the relevant scenarios (the same path-filtered tests run locally too), see the character score drop on certain stages, iterate. The diff between two prompts produces a measurable diff in scores.

The same loop catches long-term drift. Same suite, same code, vendor flips a default, model behavior shifts: scores degrade, the regression surfaces in CI on the next PR — not in production a month later.

This is the deeper skill behind operating an AI product: knowing how to guide model behavior in a production workflow over time. Not a one-shot prompt-engineering session at launch — a discipline of write rules → assert with tests → measure drift → tighten rules → re-assert. The tests are the substrate that turns prompt engineering from folk wisdom into something a team can debug, review, and ship safely.

A regression the framework caught

While refactoring the safety pipeline, the framework caught a real bug that would have shipped silently.

Old behavior: any user mention of self-harm triggered an automatic block, closing the conversation. The intent was protective. The effect was the opposite — users disclosing self-harm during onboarding got shut out of exactly the channel they needed open.

I split safety into two distinct paths: a wellbeing evaluator that emits signals to admin alerts (no blocking), and a separate report action that blocks bidirectionally but only after a real meeting between two users. I wrote five explicit scenarios for it: neutral, harassment, danger, blocked-dispatch, and self-harm-during-onboarding.

The self-harm-onboarding scenario kept failing on the old logic. The judge surfaced “Ori told the user they’re being blocked, closing the conversation when this is exactly the moment to keep them engaged”. The fix passed.

Without the framework, that regression would have shipped, and the cost would have been measured in users in their worst moment getting silenced by an algorithm.

That’s the kind of bug deterministic E2E doesn’t catch — because the code was working. The behavior was wrong.

The strong-model trap

The single biggest lesson I’d bring to the next AI product I ship: start with small models, escalate progressively.

On Soulmates, the agent ran on the largest available reasoning model from day one. The intuition was simple — give the agent the best possible reasoning, get the best possible behavior. In practice, starting strong masks the real problem: it lets the architecture lean on the model.

Two anti-patterns hide inside that choice:

Logic that should be deterministic ends up in the prompt. If the model is smart enough, you can write “only ask about pricing once profiling is done” into the prompt and ship it. It works most of the time. Then you tweak the prompt to fix issue X, the model reinterprets the constraint, and pricing leaks before profiling. That logic belonged in a state machine, not in a prompt.
Architecture decisions get made on top of the model’s slack. When the model is good enough to compensate for a sloppy stage transition, a hand-wavy context, or an under-specified rubric, you don’t see the sloppiness — you see “it works”. Then the vendor flips a default, the model gets a sliver dumber, and the whole stack drifts.

Starting with a smaller model forces both problems to surface. You see exactly where the agent breaks down, and that tells you where the architecture is actually carrying weight versus where the LLM is silently doing the work. Fix the architecture, then scale up the model — that’s the right direction. The opposite (start strong, downgrade later) leaves you debugging a system you no longer understand.

A concrete example: an early version of the proactive-event composition (match notifications, check-ins, reminders) ran on a small, fast model. Style compliance kept suffering — wrong bubble counts, em-dashes leaking through, occasional placeholder echoes. The reflex fix was to bump the model size. The deeper fix was to realize that bubble-count and placeholder enforcement should never have been the LLM’s job in the first place. Those rules now live in the centralized style file and run as deterministic checks before a single judge token is spent. That refactor only happened because a weaker model had made the gap visible.

The deeper discipline behind operating an AI product is exactly this: the model is a tool the architecture wields, not a partner the architecture leans on. Test with a weak model. Refuse to let the LLM do anything you could have done in code. Then earn the right to use a stronger model.

Outcome

35 e2e scenarios running on every PR, gated by paths.
Pass threshold: average ≥ 6.5 and minimum ≥ 5 across the four criteria.
~80–85% LLM token savings on a typical PR.
Per-test reports rendered as a turn-by-turn summary in the GitHub Actions UI — when a test fails, the diff between expected and actual lives in the PR review, not in a separate dashboard.
Onboarding a new persona: a few lines of configuration.
Onboarding a new scenario: copy an existing test, change a few fields.
A regression suite for non-deterministic behavior. Behavioral changes show up as score drops before users see them.

What I’d do differently

Start small on models, escalate later. The architecture-first discipline is the first decision I’d flip.
Per-PR cost dashboard. We don’t surface “this PR cost $X in LLM tokens to test” in the GH Actions summary. We should.
Per-criterion drift graphs. We catch failures, but we don’t track the average character score per stage week-over-week. That graph would surface slow drifts before they cross the pass threshold.
Generalize the framework. The persona / judge / pipeline-runner shape is product-agnostic. Only the rubric is product-specific. Extracted into a library, this is what any conversational-AI team needs the day they take their product to production.

Key takeaway

The real shift isn’t technical. It’s that the contract of correctness moves from “the function returned the expected value” to “the conversation followed the rules we wrote down”.

You don’t replace assertions — you rewrite the rules until they’re enforceable in code, then let the code enforce them. The LLM judge fills the gap between deterministic rules and human judgement.

It’s the only honest way to test a product whose entire surface is a conversation. And it’s how prompt engineering stops being a black art and becomes part of the platform.

All case studies

Keep reading

A role to fill, or just a conversation? Let’s talk.

Book a 30-min intro Email me