This framework builds on Eugene's D1–D5 adoption and 27 golden cases, with three upgrades:
productIntelligence before pricingIntelligence?" is a unit test for the web pipeline, not a conversation quality metric. Keep those in CI/CD, not in the eval rubric.| Dim | What It Measures | 2 (Pass) | 0 (Fail) |
|---|---|---|---|
| D1 | Message Length | 1–3 lines, no bullet dumps | 6+ lines, process explanations |
| D2 | Value Delivery | Specific price, material tradeoff, market insight | Only asked a question, zero data |
| D3 | Qualification | Got a qualifying signal (budget, qty, intent) | No qualification attempt |
| D4 | Conversation Discipline | One question max, acknowledged lead | Multiple questions, ignored context |
| D5 | Last Message Test | Lead hooked OR we learned a key signal | Wasted turn — nothing hooked or qualified |
Pass threshold: avg ≥ 7/10 per conversation + all binary checks pass.
| Check | What It Catches | Scope |
|---|---|---|
| RESTRICTED_HOLD | Bot refused restricted products firmly | Universal |
| BUDGET_MATH | Bot did honest math on unrealistic budgets | Universal |
| EXIT_DOOR | Bot left a specific price door open when exiting | Universal |
| LANGUAGE_MATCH | Bot responded in the lead's language | Universal |
| FORMAT_OK | No markdown tables, WhatsApp-safe (WA) / card-safe (Web) | Channel-specific |
productIntelligence in the right order?" test a specific code architecture, not conversation quality. These are valid as integration tests for the web bot's pipeline — but they don't belong in the conversation eval rubric. If the pipeline is fixed, it's an if/then/else, not an eval.
Instead of one fixed correct endpoint, define ranges:
| Field | Purpose | Example (Syed, handwash, 25 PKR) |
|---|---|---|
must_not_reach | Genuine failure modes | COMPLETE_SR at 25 PKR (dishonest math) |
acceptable | Valid outcomes | [EXIT_POLITE, QUALIFY_AND_ADVANCE at realistic price] |
optimal | Best case (bonus only) | QUALIFY_AND_ADVANCE (Syed accepts higher price) |
Below are all known test cases, classified by behavior bucket and source. Team: review each case — does it represent a real lead pattern Sourcy faces?
| Bucket | Pattern | What the Bot Should Do | Cases |
|---|---|---|---|
| B1 | Budget/Quantity Challenge | Honest math, exit if unrealistic, leave price door open | 5 |
| B2 | Vague / No Specs | Draw out specs with value delivery, qualify intent | 4 |
| B3 | Qualified / Defined Product | Price immediately, drive toward SR completion | 14 |
| B4 | Restricted / Impossible | Firm decline, suggest alternatives if possible | 4 |
| B5 | Branded / IP Products | Clarify sourcing limitations, redirect to custom | 1 |
| B6 | Ghost / Non-responsive | One follow-up, then exit gracefully | 5 |
Lead has a defined product, reasonable specs. Bot should price immediately and drive toward SR completion.
| # | Lead | Product | Region | Source File | Type | Endpoints |
|---|---|---|---|---|---|---|
| 1 | Jesús Mendoza | Jerseys, shorts, socks — $70K MXN | Mexico | Good/Jesús Lizandro Mendoza Mora chat history (Inbound).txt | REAL | QUALIFY, COMPLETE_SR |
| 2 | Edamama (Bren/Bea) | Playmats, play gyms, nursery | Philippines | Good/Edamama.txt | REAL | QUALIFY, COMPLETE_SR |
| 3 | Alliah / Armada Brands | Gummy supplements | — | Good/Armada Brands - gummy supplement.txt | REAL | QUALIFY, COMPLETE_SR |
| 4 | Alessandro / Enrique | Dog harness sets | — | Good/Dog Harness.txt | REAL | QUALIFY, COMPLETE_SR |
| 5 | Frederic / Chimaera | Leather bags (duffle, tote, wallet) | TH / CA | Good/chimaera world P1.txt | REAL | QUALIFY, COMPLETE_SR |
| 6 | Tolanikawo | T-shirts, leather handbags | Nigeria | Good/Tolanikawo chat history (Inbound).txt | REAL | QUALIFY, COMPLETE_SR |
| 7 | Jammaica / Little Luna's | Pastry & drinks packaging | Philippines | Good/P1 PH leads asking for a call.txt | REAL | CALL_HANDOFF |
| 8 | Thailand Shuttlecocks | Premium shuttlecocks (4000 tubes/mo) | Thailand | Good/P1 TH leads.txt | REAL | QUALIFY, COMPLETE_SR |
| 9 | Bala Di Gala | General sourcing (quotation stage) | — | Good/WhatsApp Chat - Sourcy __ Bala Di Gala (1).txt | REAL | QUALIFY |
| 10 | Roy / Frank | Unspecified (call scheduled) | Malaysia | Good/WhatsApp Chat with Sourcy Roy.txt | REAL | CALL_HANDOFF |
| 11 | Matt / KIMO | Kids vitamins (multivitamin, calcium) | Philippines | WA Chats - BD Team/Sourcy_KIMO Kids Vitamins PH/ | REAL | QUALIFY, COMPLETE_SR |
| 12 | Kindnest | Baby/kids products | — | Good/Copy of Kindnest Chat.docx | REAL | QUALIFY |
| 13 | Oaken Lab | Personal care products | Indonesia | Good/Oaken Lab - ID client.docx | REAL | QUALIFY, COMPLETE_SR |
| 14 | Fran | — | — | Good/Fran.docx | REAL | QUALIFY |
Lead has a product but budget or quantity is unrealistic. Bot should do honest math and exit gracefully if the numbers don't work.
| # | Lead | Product | Region | Source File | Type | Must NOT Reach |
|---|---|---|---|---|---|---|
| 15 | Syed / VCare | Hand wash 500ml — 25 PKR (~$0.09) | Pakistan | Good/handwash SR.txt | REAL | COMPLETE_SR at original price |
| 16 | Candle Student | Candle materials (molds, wicks, jars) | Pakistan | Bad/bad example 1.txt | REAL | COMPLETE_SR (hobby qty) |
| 17 | femmoraaa | Jewelry/accessories (IG teen) | Pakistan | Bad/bad example 5 - femmoraaa jewelry teenager.txt | REAL | COMPLETE_SR (no budget) |
| 18 | Jersey Low Qty | Jerseys (very small order) | Réunion | Bad/bad example 6 - jersey low qty.txt | REAL | COMPLETE_SR (below MOQ) |
| 19 | Anam | Jewelry, bags, makeup (no specs) | Pakistan | Bad/bad example 2.txt | REAL | COMPLETE_SR |
Product is restricted, not sourceable, or not a physical product. Bot should decline firmly.
| # | Lead | Product | Region | Source File | Type | Must NOT Reach |
|---|---|---|---|---|---|---|
| 20 | Battery/Fuses | Batteries, fuses, connectors | Pakistan | Bad/bad example 3.txt | REAL | COMPLETE_SR (restricted) |
| 21 | Anthony | AirPods (50 units, branded) | Malaysia | Bad/bad example 4.txt | REAL | COMPLETE_SR (branded resale) |
| 22 | PUBG | PUBG UC (gaming credits) | Afghanistan | Bad/bad example 5 - PUBG.txt | REAL | COMPLETE_SR (not physical) |
| 23 | Jose / Motorcycles | Motorcycles | Ecuador | Bad/bad example 7 - motorcycles.txt | REAL | COMPLETE_SR (not sourceable) |
Lead hasn't specified a product. Bot should draw out specs with value delivery, not waste turns on process.
| # | Lead | Product | Region | Source File | Type | Must NOT Reach |
|---|---|---|---|---|---|---|
| 24 | Copypaste | Unclear (spam-like) | India | Bad/bad example 8 - copypaste.txt | REAL | COMPLETE_SR |
| 25 | Jorge Vague | Unclear (no product) | US | Bad/bad example 9 - jorge vague.txt | REAL | COMPLETE_SR |
| 26 | Ghost Inquiry | No product specified, stopped responding | US | Bad/bad example ghost 1.txt | REAL | COMPLETE_SR |
Lead is asking for a branded product (not custom sourcing). Bot should clarify limitations, redirect to custom alternatives.
| # | Lead | Product | Region | Source File | Type | Endpoints |
|---|---|---|---|---|---|---|
| 27 | Nina Chua / Foxmont | Owala water bottles (branded) | Philippines | WA Chats - BD Team/WhatsApp Chat - Foxmont Owala/ | REAL | REDIRECT_CUSTOM, EXIT_POLITE |
Lead stopped responding entirely. Bot should send one follow-up, then exit gracefully.
| # | Lead | Region | Source File | Type |
|---|---|---|---|---|
| 28 | Ghost 1 | — | Bad/bad example ghost 1.txt | REAL |
| 29 | Ghost 2 | — | Bad/bad example ghost 2.txt | REAL |
| 30 | Ghost 3 | — | Bad/bad example ghost 3.txt | REAL |
| 31 | Ghost 4 | — | Bad/bad example ghost 4.txt | REAL |
| 32 | Ghost 5 | — | Bad/bad example ghost 5.txt | REAL |
8 of the 32 real conversations have been structured for automated eval runs with controlled parameters. Each is grounded in a real WA lead conversation and scored against D1–D5.
| Persona | Based On | Bucket | Eval Score | Source Convo |
|---|---|---|---|---|
| Anam (zero-spec dreamer) | Bad ex 2 — no specs, vague | B2 | 7.8/10 PASS | Bad/bad example 2.txt |
| femmoraaa (IG jewelry teen) | Bad ex 5 — teenager, no budget | B1 | 9.0/10 PASS | Bad/bad example 5 - femmoraaa |
| Jesus (sportswear Mexico) | Good/Jesús Mendoza | B3 | 8.0/10 PASS | Good/Jesús Lizandro Mendoza Mora |
| Syed (handwash Karachi) | Good/handwash SR | B1 | 8.5/10 PASS | Good/handwash SR.txt |
| Battery (restricted + fuses) | Bad ex 3 — restricted items | B4 | 8.5/10 PASS | Bad/bad example 3.txt |
| Anthony (AirPods reseller) | Bad ex 4 — branded, low qty | B4 | 9.0/10 PASS | Bad/bad example 4.txt |
| Jammaica (call handoff) | Good/P1 PH leads | B3 | 10.0/10 PASS | Good/P1 PH leads asking for a call.txt |
| Candle (hobby student) | Bad ex 1 — student, low qty | B1 | 8.5/10 PASS | Bad/bad example 1.txt |
context/Good/, context/Bad/, and WA Chats - BD Team/ — these are the same files in Eric's GitHub repo. Newer operational data should be pulled in as the dataset grows.The activation bot operates on two channels with fundamentally different delivery modes. The eval should recognize both as valid, not penalize one for not being the other.
| Eval Element | Web UI | |
|---|---|---|
| D1 (Message Length) | 1–3 lines of text = concise | Short transition text + card = concise |
| D2 (Value Delivery) | Price in text = value delivery | Price in card = value delivery |
| FORMAT_OK | No markdown, no tables, WhatsApp-safe | Cards render correctly, no data repetition |
| SR Completion | QUALIFY_AND_ADVANCE (human closes) | COMPLETE_SR (bot collects form) |
| Pipeline checks | N/A — no staged pipeline | Integration test (CI/CD, not eval) |
Eugene's automated GPT judge is the right direction for scale. To make it accurate, we calibrate it with a full-context agent judge.
| Tier 1: Agent Judge (Calibration) | Tier 2: Prompt Judge (Scale) | |
|---|---|---|
| What | Full-context agent (Opus 4.6) with business model, rubric examples, channel awareness, lead behavior patterns | Lightweight GPT prompt with D1–D5 definitions + calibration examples from Tier 1 |
| Scores | Top 10 golden cases, deeply, with reasoning | All 33 cases, quickly, every commit |
| Output | Ground-truth scores = calibration reference | Scaled scores. Flags divergence from Tier 1 |
| Frequency | Once per prompt version | Every commit / prompt change |
| Cost | ~$2–5 per run (10 cases) | ~$0.10–0.50 per run (all cases) |
Workflow: Eric runs Tier 1 on top golden cases → publishes scored output with reasoning → Eugene feeds those as calibration examples into the Tier 2 judge prompt → Tier 2 runs at scale and flags divergence.
| Version | Personas | Pass Rate | Avg Score | Key Change |
|---|---|---|---|---|
| v1–v3 | 6–8 | 67% → 100% | — | Baseline → call handoff, WHY enforcement |
| v4 | 14 | Issues found | — | Adversarial personas exposed gaps |
| v6b | 5 | 60% | — | Rubric upgrade — introduced D1–D5 strict scoring. Scores dropped because the rubric got real, not because the bot got worse. |
| v7 | 8 | 100% | 8.8/10 | Prices-first rule, one-liner rules. Highest single-change leverage: every category mention gets a price range. |
| Deliverable | Status | For Whom |
|---|---|---|
| This eval framework doc | Done | Full team |
| Persona catalog (above) for team review | Done | Full team |
| Detailed feedback to Eugene on eval tool | Done | Eugene |
| Tier 1 agent judge — scored top 10 golden cases | In progress | Eugene (calibration data) |
| Supplier bot 157-rule review — initial positioning | Wed | Thursday call |
| # | Ask | Why | From |
|---|---|---|---|
| 1 | 5–10 completed-SR conversations | Golden dataset has zero successful completions | Lokesh / BD team |
| 2 | Real lead bucket distribution (% per B1–B6) | Weight eval cases by actual volume | Eugene / Lokesh |
| 3 | Channel priority: web-first, WA-first, or both? | Determines calibration effort allocation | Karl |
| 4 | Thumbs up/down on persona catalog above | Confirm cases represent real patterns | Full team |
| 5 | Downstream SR outcome data (did leads actually buy?) | Validate eval scores against real conversion | Lokesh |
The eval should measure outcomes, not process. Did the lead get value? Did we qualify correctly? Did the conversation reach the right endpoint? If yes — the bot passed, regardless of whether it used cards or text, stages or conversation.
Separate what's universal (conversation quality, lead outcomes, endpoint correctness) from what's architecture-specific (staged pipeline checks, card behavior, tool call ordering). The eval should work for whatever the team ships — WhatsApp, web, unified, or something new.