Activation Bot Eval Framework v2

Outcome-based scoring, channel-aware calibration, persona catalog for team review
24 February 2026 · Sourcy Internal
Cases
32
real conversations, 8 tested
Behavior Buckets
6
B1–B6 coverage
Scoring Dims
D1–D5
per-turn quality
Channels
2
WhatsApp + Web UI

I. The Core Principle

Outcome-based eval, not process-based. The eval should check: Did the lead get value? Did we qualify correctly? Did the conversation reach the right endpoint? — NOT: Did the bot follow Stage 1 → 2 → 3 → 4?

This framework builds on Eugene's D1–D5 adoption and 27 golden cases, with three upgrades:

  1. Separate conversation eval from integration tests. "Did the bot call productIntelligence before pricingIntelligence?" is a unit test for the web pipeline, not a conversation quality metric. Keep those in CI/CD, not in the eval rubric.
  2. Channel-aware calibration. WhatsApp has no cards — the text IS the delivery. Web UIs show rich cards where text is a short transition. Same quality standards, different calibration per channel.
  3. Endpoint ranges, not fixed points. A better bot might convert someone we expected to exit. Define failure modes to avoid, not exact correct outcomes.

II. Scoring Framework

Per-Turn Quality: D1–D5 (unchanged)

DimWhat It Measures2 (Pass)0 (Fail)
D1Message Length1–3 lines, no bullet dumps6+ lines, process explanations
D2Value DeliverySpecific price, material tradeoff, market insightOnly asked a question, zero data
D3QualificationGot a qualifying signal (budget, qty, intent)No qualification attempt
D4Conversation DisciplineOne question max, acknowledged leadMultiple questions, ignored context
D5Last Message TestLead hooked OR we learned a key signalWasted turn — nothing hooked or qualified

Pass threshold: avg ≥ 7/10 per conversation + all binary checks pass.

Binary Checks (per conversation)

CheckWhat It CatchesScope
RESTRICTED_HOLDBot refused restricted products firmlyUniversal
BUDGET_MATHBot did honest math on unrealistic budgetsUniversal
EXIT_DOORBot left a specific price door open when exitingUniversal
LANGUAGE_MATCHBot responded in the lead's languageUniversal
FORMAT_OKNo markdown tables, WhatsApp-safe (WA) / card-safe (Web)Channel-specific
What about staged pipeline checks (CK-001 through CK-004)? Checks like "did the bot call productIntelligence in the right order?" test a specific code architecture, not conversation quality. These are valid as integration tests for the web bot's pipeline — but they don't belong in the conversation eval rubric. If the pipeline is fixed, it's an if/then/else, not an eval.

Endpoint Scoring (per conversation)

Instead of one fixed correct endpoint, define ranges:

FieldPurposeExample (Syed, handwash, 25 PKR)
must_not_reachGenuine failure modesCOMPLETE_SR at 25 PKR (dishonest math)
acceptableValid outcomes[EXIT_POLITE, QUALIFY_AND_ADVANCE at realistic price]
optimalBest case (bonus only)QUALIFY_AND_ADVANCE (Syed accepts higher price)
The Jesus subtlety. v7-Jesus qualified the lead in 3 turns: budget math done, all 3 categories priced, logo question asked. But didn't collect email. In WhatsApp, that's a warm handoff to the growth team. In the web UI, the bot drives through a contact form. Both are correct — the eval should measure "did the lead advance and get value?" not "did the bot collect email?"

III. Persona Catalog — Team Review

Below are all known test cases, classified by behavior bucket and source. Team: review each case — does it represent a real lead pattern Sourcy faces?

Behavior Buckets

BucketPatternWhat the Bot Should DoCases
B1Budget/Quantity ChallengeHonest math, exit if unrealistic, leave price door open5
B2Vague / No SpecsDraw out specs with value delivery, qualify intent4
B3Qualified / Defined ProductPrice immediately, drive toward SR completion14
B4Restricted / ImpossibleFirm decline, suggest alternatives if possible4
B5Branded / IP ProductsClarify sourcing limitations, redirect to custom1
B6Ghost / Non-responsiveOne follow-up, then exit gracefully5

Coverage Distribution

B3 Qualified
14 cases
B1 Budget
5 cases
B6 Ghost
5 cases
B4 Restricted
4 cases
B2 Vague
3 cases
B5 Branded
1 case
Question for the team: What % of Sourcy's real inbound leads fall into each bucket? If B3 is 60% of real volume but only 48% of test cases, we're under-testing the most important path. We need the real distribution to weight correctly.

B3 Qualified / Defined Product — 14 cases

Lead has a defined product, reasonable specs. Bot should price immediately and drive toward SR completion.

#LeadProductRegionSource FileTypeEndpoints
1Jesús MendozaJerseys, shorts, socks — $70K MXNMexicoGood/Jesús Lizandro Mendoza Mora chat history (Inbound).txtREALQUALIFY, COMPLETE_SR
2Edamama (Bren/Bea)Playmats, play gyms, nurseryPhilippinesGood/Edamama.txtREALQUALIFY, COMPLETE_SR
3Alliah / Armada BrandsGummy supplementsGood/Armada Brands - gummy supplement.txtREALQUALIFY, COMPLETE_SR
4Alessandro / EnriqueDog harness setsGood/Dog Harness.txtREALQUALIFY, COMPLETE_SR
5Frederic / ChimaeraLeather bags (duffle, tote, wallet)TH / CAGood/chimaera world P1.txtREALQUALIFY, COMPLETE_SR
6TolanikawoT-shirts, leather handbagsNigeriaGood/Tolanikawo chat history (Inbound).txtREALQUALIFY, COMPLETE_SR
7Jammaica / Little Luna'sPastry & drinks packagingPhilippinesGood/P1 PH leads asking for a call.txtREALCALL_HANDOFF
8Thailand ShuttlecocksPremium shuttlecocks (4000 tubes/mo)ThailandGood/P1 TH leads.txtREALQUALIFY, COMPLETE_SR
9Bala Di GalaGeneral sourcing (quotation stage)Good/WhatsApp Chat - Sourcy __ Bala Di Gala (1).txtREALQUALIFY
10Roy / FrankUnspecified (call scheduled)MalaysiaGood/WhatsApp Chat with Sourcy Roy.txtREALCALL_HANDOFF
11Matt / KIMOKids vitamins (multivitamin, calcium)PhilippinesWA Chats - BD Team/Sourcy_KIMO Kids Vitamins PH/REALQUALIFY, COMPLETE_SR
12KindnestBaby/kids productsGood/Copy of Kindnest Chat.docxREALQUALIFY
13Oaken LabPersonal care productsIndonesiaGood/Oaken Lab - ID client.docxREALQUALIFY, COMPLETE_SR
14FranGood/Fran.docxREALQUALIFY

B1 Budget / Quantity Challenge — 5 cases

Lead has a product but budget or quantity is unrealistic. Bot should do honest math and exit gracefully if the numbers don't work.

#LeadProductRegionSource FileTypeMust NOT Reach
15Syed / VCareHand wash 500ml — 25 PKR (~$0.09)PakistanGood/handwash SR.txtREALCOMPLETE_SR at original price
16Candle StudentCandle materials (molds, wicks, jars)PakistanBad/bad example 1.txtREALCOMPLETE_SR (hobby qty)
17femmoraaaJewelry/accessories (IG teen)PakistanBad/bad example 5 - femmoraaa jewelry teenager.txtREALCOMPLETE_SR (no budget)
18Jersey Low QtyJerseys (very small order)RéunionBad/bad example 6 - jersey low qty.txtREALCOMPLETE_SR (below MOQ)
19AnamJewelry, bags, makeup (no specs)PakistanBad/bad example 2.txtREALCOMPLETE_SR

B4 Restricted / Impossible Product — 4 cases

Product is restricted, not sourceable, or not a physical product. Bot should decline firmly.

#LeadProductRegionSource FileTypeMust NOT Reach
20Battery/FusesBatteries, fuses, connectorsPakistanBad/bad example 3.txtREALCOMPLETE_SR (restricted)
21AnthonyAirPods (50 units, branded)MalaysiaBad/bad example 4.txtREALCOMPLETE_SR (branded resale)
22PUBGPUBG UC (gaming credits)AfghanistanBad/bad example 5 - PUBG.txtREALCOMPLETE_SR (not physical)
23Jose / MotorcyclesMotorcyclesEcuadorBad/bad example 7 - motorcycles.txtREALCOMPLETE_SR (not sourceable)

B2 Vague / No Specs — 3 cases

Lead hasn't specified a product. Bot should draw out specs with value delivery, not waste turns on process.

#LeadProductRegionSource FileTypeMust NOT Reach
24CopypasteUnclear (spam-like)IndiaBad/bad example 8 - copypaste.txtREALCOMPLETE_SR
25Jorge VagueUnclear (no product)USBad/bad example 9 - jorge vague.txtREALCOMPLETE_SR
26Ghost InquiryNo product specified, stopped respondingUSBad/bad example ghost 1.txtREALCOMPLETE_SR

B5 Branded / IP Products — 1 case

Lead is asking for a branded product (not custom sourcing). Bot should clarify limitations, redirect to custom alternatives.

#LeadProductRegionSource FileTypeEndpoints
27Nina Chua / FoxmontOwala water bottles (branded)PhilippinesWA Chats - BD Team/WhatsApp Chat - Foxmont Owala/REALREDIRECT_CUSTOM, EXIT_POLITE

B6 Ghost / Non-responsive — 5 cases

Lead stopped responding entirely. Bot should send one follow-up, then exit gracefully.

#LeadRegionSource FileType
28Ghost 1Bad/bad example ghost 1.txtREAL
29Ghost 2Bad/bad example ghost 2.txtREAL
30Ghost 3Bad/bad example ghost 3.txtREAL
31Ghost 4Bad/bad example ghost 4.txtREAL
32Ghost 5Bad/bad example ghost 5.txtREAL

Structured Eval Personas — 8 cases

8 of the 32 real conversations have been structured for automated eval runs with controlled parameters. Each is grounded in a real WA lead conversation and scored against D1–D5.

PersonaBased OnBucketEval ScoreSource Convo
Anam (zero-spec dreamer)Bad ex 2 — no specs, vagueB27.8/10 PASSBad/bad example 2.txt
femmoraaa (IG jewelry teen)Bad ex 5 — teenager, no budgetB19.0/10 PASSBad/bad example 5 - femmoraaa
Jesus (sportswear Mexico)Good/Jesús MendozaB38.0/10 PASSGood/Jesús Lizandro Mendoza Mora
Syed (handwash Karachi)Good/handwash SRB18.5/10 PASSGood/handwash SR.txt
Battery (restricted + fuses)Bad ex 3 — restricted itemsB48.5/10 PASSBad/bad example 3.txt
Anthony (AirPods reseller)Bad ex 4 — branded, low qtyB49.0/10 PASSBad/bad example 4.txt
Jammaica (call handoff)Good/P1 PH leadsB310.0/10 PASSGood/P1 PH leads asking for a call.txt
Candle (hobby student)Bad ex 1 — student, low qtyB18.5/10 PASSBad/bad example 1.txt

Data Gaps
  • No completed-SR conversations. All current cases are qualifying/advancing or failed. We need 5–10 conversations where the lead actually completed an SR — Sourcy is collecting this data daily.
  • B5 (Branded) is under-represented — only 1 case. Branded product requests may be a significant % of inbound.
  • All source files come from context/Good/, context/Bad/, and WA Chats - BD Team/ — these are the same files in Eric's GitHub repo. Newer operational data should be pulled in as the dataset grows.
Ask #1: Can we get 5–10 completed-SR conversations from recent operations? Most valuable additions to the dataset.

Ask #2: For each case above — does it represent a real lead pattern Sourcy faces? Any missing patterns?

Ask #3: Real inbound lead distribution across B1–B6? Helps us weight test cases correctly.

IV. Channel Trade-offs — WhatsApp vs Web UI

The activation bot operates on two channels with fundamentally different delivery modes. The eval should recognize both as valid, not penalize one for not being the other.

WhatsApp (Conversational)

  • Text IS the delivery — no cards, no forms
  • Prices-first in natural language
  • Warm handoff to growth team for SR completion
  • Lower friction → more leads stay engaged
  • Proven: 8.8/10 avg across 8 structured eval personas

Web UI (Staged Pipeline)

  • Rich cards carry data — text is a short transition
  • Structured forms collect SR fields
  • Bot drives through to SR completion in-conversation
  • Less human follow-up needed
  • Better for leads who want visual/structured experience
These are trade-offs, not one being better. If human ops cost is high → staged web UI reduces follow-up work. If conversion rate matters → conversational WA may keep more leads engaged. If lead quality matters → both can work, different trade-offs. The eval framework should measure outcomes for BOTH, calibrated per channel.

Channel-Specific Calibration

Eval ElementWhatsAppWeb UI
D1 (Message Length)1–3 lines of text = conciseShort transition text + card = concise
D2 (Value Delivery)Price in text = value deliveryPrice in card = value delivery
FORMAT_OKNo markdown, no tables, WhatsApp-safeCards render correctly, no data repetition
SR CompletionQUALIFY_AND_ADVANCE (human closes)COMPLETE_SR (bot collects form)
Pipeline checksN/A — no staged pipelineIntegration test (CI/CD, not eval)
Ask #3: Channel priority — is the team going web-first, WA-first, or both simultaneously? This determines which channel gets more golden test cases and calibration effort.

V. Two-Tier Judge Architecture

Eugene's automated GPT judge is the right direction for scale. To make it accurate, we calibrate it with a full-context agent judge.

Tier 1: Agent Judge (Calibration)Tier 2: Prompt Judge (Scale)
WhatFull-context agent (Opus 4.6) with business model, rubric examples, channel awareness, lead behavior patternsLightweight GPT prompt with D1–D5 definitions + calibration examples from Tier 1
ScoresTop 10 golden cases, deeply, with reasoningAll 33 cases, quickly, every commit
OutputGround-truth scores = calibration referenceScaled scores. Flags divergence from Tier 1
FrequencyOnce per prompt versionEvery commit / prompt change
Cost~$2–5 per run (10 cases)~$0.10–0.50 per run (all cases)

Workflow: Eric runs Tier 1 on top golden cases → publishes scored output with reasoning → Eugene feeds those as calibration examples into the Tier 2 judge prompt → Tier 2 runs at scale and flags divergence.


VI. Version Progression

VersionPersonasPass RateAvg ScoreKey Change
v1–v36–867% → 100%Baseline → call handoff, WHY enforcement
v414Issues foundAdversarial personas exposed gaps
v6b560%Rubric upgrade — introduced D1–D5 strict scoring. Scores dropped because the rubric got real, not because the bot got worse.
v78100%8.8/10Prices-first rule, one-liner rules. Highest single-change leverage: every category mention gets a price range.
v6b was a rubric upgrade, not a regression. The v1–v5 rubric was lenient. v6b introduced per-turn D1–D5 scoring with hard message length caps. Pass rate dropped from 100% to 60% — because the eval was catching real gaps (process dumps, missed value delivery) that the old rubric couldn't detect. v7 fixed those gaps. This is the methodology working as designed.

VII. This Week — Deliverables & Asks

What Eric Delivers (by Thursday)

DeliverableStatusFor Whom
This eval framework docDoneFull team
Persona catalog (above) for team reviewDoneFull team
Detailed feedback to Eugene on eval toolDoneEugene
Tier 1 agent judge — scored top 10 golden casesIn progressEugene (calibration data)
Supplier bot 157-rule review — initial positioningWedThursday call

What Eric Needs From the Team

#AskWhyFrom
15–10 completed-SR conversationsGolden dataset has zero successful completionsLokesh / BD team
2Real lead bucket distribution (% per B1–B6)Weight eval cases by actual volumeEugene / Lokesh
3Channel priority: web-first, WA-first, or both?Determines calibration effort allocationKarl
4Thumbs up/down on persona catalog aboveConfirm cases represent real patternsFull team
5Downstream SR outcome data (did leads actually buy?)Validate eval scores against real conversionLokesh

Summary

The eval should measure outcomes, not process. Did the lead get value? Did we qualify correctly? Did the conversation reach the right endpoint? If yes — the bot passed, regardless of whether it used cards or text, stages or conversation.

Separate what's universal (conversation quality, lead outcomes, endpoint correctness) from what's architecture-specific (staged pipeline checks, card behavior, tool call ordering). The eval should work for whatever the team ships — WhatsApp, web, unified, or something new.