← back
Soulmates conversational AI E2E framework LLM-simulated user drives real conversations through the production pipeline; a three-tier judge scores every turn. Pass/fail computed in code.
// 01 · SIMULATED USER (LLM PERSONA · SMALL FAST MODEL) UserPersona name · background · traits reads agent reply, writes next user turn in character, autonomously // 02 · PRODUCTION PIPELINE (REAL INFRA, ONLY DELIVERY MOCKED) Orchestrator stage machine · evaluators Postgres PGlite + pgvector · real schema Agent (Ori) large reasoning model Outbound delivery — captured by callback (only mock) no WhatsApp call, just intercepted text async events all production code paths exercised — only outbound message stubbed user msg agent reply // 03 · THREE-TIER EVALUATOR — RUNS ON EACH AGENT TURN Tier 1 — Deterministic rules pure code · no LLM · max questions per turn · max bubbles · forbidden phrasings (allow-list) · lowercase · no em-dashes · no placeholder leaks hard fail = test fails immediately zero LLM tokens spent Tier 2 — Soft rules detected by code · injected as judge context "the response broke the bubble limit by 1 — weigh that in your style score" code detects · the judge weighs Tier 3 — LLM judge large model · separate call scores 1–10 on 4 axes: · CHARACTER — sounds like Ori? · GUIDANCE — follows the stage? · CONVERSATION — reacts to user? · STYLE — formatting compliance? strict XML output · parsed in code turn // 04 · VERDICT — COMPUTED IN CODE, NEVER ASKED OF THE LLM pass = average ≥ 6.5 && min ≥ 5 the LLM weighs · the verdict stays ours Soulmates E2E framework — a simulated user (LLM) talks to the real agent through the production pipeline. A three-tier evaluator (deterministic rules → soft rules injected as context → 4-criteria LLM judge) produces a score, with pass/fail computed in code.