Activation Bot Eval Framework v2

I. The Core Principle

Outcome-based eval, not process-based. The eval should check: Did the lead get value? Did we qualify correctly? Did the conversation reach the right endpoint? — NOT: Did the bot follow Stage 1 → 2 → 3 → 4?

This framework builds on Eugene's D1–D5 adoption and 27 golden cases, with three upgrades:

Separate conversation eval from integration tests. "Did the bot call productIntelligence before pricingIntelligence?" is a unit test for the web pipeline, not a conversation quality metric. Keep those in CI/CD, not in the eval rubric.
Channel-aware calibration. WhatsApp has no cards — the text IS the delivery. Web UIs show rich cards where text is a short transition. Same quality standards, different calibration per channel.
Endpoint ranges, not fixed points. A better bot might convert someone we expected to exit. Define failure modes to avoid, not exact correct outcomes.

II. Scoring Framework

Per-Turn Quality: D1–D5 (unchanged)

Dim	What It Measures	2 (Pass)	0 (Fail)
D1	Message Length	1–3 lines, no bullet dumps	6+ lines, process explanations
D2	Value Delivery	Specific price, material tradeoff, market insight	Only asked a question, zero data
D3	Qualification	Got a qualifying signal (budget, qty, intent)	No qualification attempt
D4	Conversation Discipline	One question max, acknowledged lead	Multiple questions, ignored context
D5	Last Message Test	Lead hooked OR we learned a key signal	Wasted turn — nothing hooked or qualified

Pass threshold: avg ≥ 7/10 per conversation + all binary checks pass.

Binary Checks (per conversation)

Check	What It Catches	Scope
RESTRICTED_HOLD	Bot refused restricted products firmly	Universal
BUDGET_MATH	Bot did honest math on unrealistic budgets	Universal
EXIT_DOOR	Bot left a specific price door open when exiting	Universal
LANGUAGE_MATCH	Bot responded in the lead's language	Universal
FORMAT_OK	No markdown tables, WhatsApp-safe (WA) / card-safe (Web)	Channel-specific

What about staged pipeline checks (CK-001 through CK-004)? Checks like "did the bot call productIntelligence in the right order?" test a specific code architecture, not conversation quality. These are valid as integration tests for the web bot's pipeline — but they don't belong in the conversation eval rubric. If the pipeline is fixed, it's an if/then/else, not an eval.

Endpoint Scoring (per conversation)

Instead of one fixed correct endpoint, define ranges:

Field	Purpose	Example (Syed, handwash, 25 PKR)
`must_not_reach`	Genuine failure modes	COMPLETE_SR at 25 PKR (dishonest math)
`acceptable`	Valid outcomes	[EXIT_POLITE, QUALIFY_AND_ADVANCE at realistic price]
`optimal`	Best case (bonus only)	QUALIFY_AND_ADVANCE (Syed accepts higher price)

The Jesus subtlety. v7-Jesus qualified the lead in 3 turns: budget math done, all 3 categories priced, logo question asked. But didn't collect email. In WhatsApp, that's a warm handoff to the growth team. In the web UI, the bot drives through a contact form. Both are correct — the eval should measure "did the lead advance and get value?" not "did the bot collect email?"

III. Persona Catalog — Team Review

Below are all known test cases, classified by behavior bucket and source. Team: review each case — does it represent a real lead pattern Sourcy faces?

Behavior Buckets

Bucket	Pattern	What the Bot Should Do	Cases
B1	Budget/Quantity Challenge	Honest math, exit if unrealistic, leave price door open	5
B2	Vague / No Specs	Draw out specs with value delivery, qualify intent	4
B3	Qualified / Defined Product	Price immediately, drive toward SR completion	14
B4	Restricted / Impossible	Firm decline, suggest alternatives if possible	4
B5	Branded / IP Products	Clarify sourcing limitations, redirect to custom	1
B6	Ghost / Non-responsive	One follow-up, then exit gracefully	5

Coverage Distribution

B3 Qualified

14 cases

B1 Budget

5 cases

B6 Ghost

5 cases

B4 Restricted

4 cases

B2 Vague

3 cases

B5 Branded

1 case

Question for the team: What % of Sourcy's real inbound leads fall into each bucket? If B3 is 60% of real volume but only 48% of test cases, we're under-testing the most important path. We need the real distribution to weight correctly.

B3 Qualified / Defined Product — 14 cases

Lead has a defined product, reasonable specs. Bot should price immediately and drive toward SR completion.

#	Lead	Product	Region	Source File	Type	Endpoints
1	Jesús Mendoza	Jerseys, shorts, socks — $70K MXN	Mexico	`Good/Jesús Lizandro Mendoza Mora chat history (Inbound).txt`	REAL	QUALIFY, COMPLETE_SR
2	Edamama (Bren/Bea)	Playmats, play gyms, nursery	Philippines	`Good/Edamama.txt`	REAL	QUALIFY, COMPLETE_SR
3	Alliah / Armada Brands	Gummy supplements	—	`Good/Armada Brands - gummy supplement.txt`	REAL	QUALIFY, COMPLETE_SR
4	Alessandro / Enrique	Dog harness sets	—	`Good/Dog Harness.txt`	REAL	QUALIFY, COMPLETE_SR
5	Frederic / Chimaera	Leather bags (duffle, tote, wallet)	TH / CA	`Good/chimaera world P1.txt`	REAL	QUALIFY, COMPLETE_SR
6	Tolanikawo	T-shirts, leather handbags	Nigeria	`Good/Tolanikawo chat history (Inbound).txt`	REAL	QUALIFY, COMPLETE_SR
7	Jammaica / Little Luna's	Pastry & drinks packaging	Philippines	`Good/P1 PH leads asking for a call.txt`	REAL	CALL_HANDOFF
8	Thailand Shuttlecocks	Premium shuttlecocks (4000 tubes/mo)	Thailand	`Good/P1 TH leads.txt`	REAL	QUALIFY, COMPLETE_SR
9	Bala Di Gala	General sourcing (quotation stage)	—	`Good/WhatsApp Chat - Sourcy __ Bala Di Gala (1).txt`	REAL	QUALIFY
10	Roy / Frank	Unspecified (call scheduled)	Malaysia	`Good/WhatsApp Chat with Sourcy Roy.txt`	REAL	CALL_HANDOFF
11	Matt / KIMO	Kids vitamins (multivitamin, calcium)	Philippines	`WA Chats - BD Team/Sourcy_KIMO Kids Vitamins PH/`	REAL	QUALIFY, COMPLETE_SR
12	Kindnest	Baby/kids products	—	`Good/Copy of Kindnest Chat.docx`	REAL	QUALIFY
13	Oaken Lab	Personal care products	Indonesia	`Good/Oaken Lab - ID client.docx`	REAL	QUALIFY, COMPLETE_SR
14	Fran	—	—	`Good/Fran.docx`	REAL	QUALIFY

B1 Budget / Quantity Challenge — 5 cases

Lead has a product but budget or quantity is unrealistic. Bot should do honest math and exit gracefully if the numbers don't work.

#	Lead	Product	Region	Source File	Type	Must NOT Reach
15	Syed / VCare	Hand wash 500ml — 25 PKR (~$0.09)	Pakistan	`Good/handwash SR.txt`	REAL	COMPLETE_SR at original price
16	Candle Student	Candle materials (molds, wicks, jars)	Pakistan	`Bad/bad example 1.txt`	REAL	COMPLETE_SR (hobby qty)
17	femmoraaa	Jewelry/accessories (IG teen)	Pakistan	`Bad/bad example 5 - femmoraaa jewelry teenager.txt`	REAL	COMPLETE_SR (no budget)
18	Jersey Low Qty	Jerseys (very small order)	Réunion	`Bad/bad example 6 - jersey low qty.txt`	REAL	COMPLETE_SR (below MOQ)
19	Anam	Jewelry, bags, makeup (no specs)	Pakistan	`Bad/bad example 2.txt`	REAL	COMPLETE_SR

B4 Restricted / Impossible Product — 4 cases

Product is restricted, not sourceable, or not a physical product. Bot should decline firmly.

#	Lead	Product	Region	Source File	Type	Must NOT Reach
20	Battery/Fuses	Batteries, fuses, connectors	Pakistan	`Bad/bad example 3.txt`	REAL	COMPLETE_SR (restricted)
21	Anthony	AirPods (50 units, branded)	Malaysia	`Bad/bad example 4.txt`	REAL	COMPLETE_SR (branded resale)
22	PUBG	PUBG UC (gaming credits)	Afghanistan	`Bad/bad example 5 - PUBG.txt`	REAL	COMPLETE_SR (not physical)
23	Jose / Motorcycles	Motorcycles	Ecuador	`Bad/bad example 7 - motorcycles.txt`	REAL	COMPLETE_SR (not sourceable)

B2 Vague / No Specs — 3 cases

Lead hasn't specified a product. Bot should draw out specs with value delivery, not waste turns on process.

#	Lead	Product	Region	Source File	Type	Must NOT Reach
24	Copypaste	Unclear (spam-like)	India	`Bad/bad example 8 - copypaste.txt`	REAL	COMPLETE_SR
25	Jorge Vague	Unclear (no product)	US	`Bad/bad example 9 - jorge vague.txt`	REAL	COMPLETE_SR
26	Ghost Inquiry	No product specified, stopped responding	US	`Bad/bad example ghost 1.txt`	REAL	COMPLETE_SR

B5 Branded / IP Products — 1 case

Lead is asking for a branded product (not custom sourcing). Bot should clarify limitations, redirect to custom alternatives.

#	Lead	Product	Region	Source File	Type	Endpoints
27	Nina Chua / Foxmont	Owala water bottles (branded)	Philippines	`WA Chats - BD Team/WhatsApp Chat - Foxmont Owala/`	REAL	REDIRECT_CUSTOM, EXIT_POLITE

B6 Ghost / Non-responsive — 5 cases

Lead stopped responding entirely. Bot should send one follow-up, then exit gracefully.

#	Lead	Region	Source File	Type
28	Ghost 1	—	`Bad/bad example ghost 1.txt`	REAL
29	Ghost 2	—	`Bad/bad example ghost 2.txt`	REAL
30	Ghost 3	—	`Bad/bad example ghost 3.txt`	REAL
31	Ghost 4	—	`Bad/bad example ghost 4.txt`	REAL
32	Ghost 5	—	`Bad/bad example ghost 5.txt`	REAL

Structured Eval Personas — 8 cases

8 of the 32 real conversations have been structured for automated eval runs with controlled parameters. Each is grounded in a real WA lead conversation and scored against D1–D5.

Persona	Based On	Bucket	Eval Score	Source Convo
Anam (zero-spec dreamer)	Bad ex 2 — no specs, vague	B2	7.8/10 PASS	`Bad/bad example 2.txt`
femmoraaa (IG jewelry teen)	Bad ex 5 — teenager, no budget	B1	9.0/10 PASS	`Bad/bad example 5 - femmoraaa`
Jesus (sportswear Mexico)	Good/Jesús Mendoza	B3	8.0/10 PASS	`Good/Jesús Lizandro Mendoza Mora`
Syed (handwash Karachi)	Good/handwash SR	B1	8.5/10 PASS	`Good/handwash SR.txt`
Battery (restricted + fuses)	Bad ex 3 — restricted items	B4	8.5/10 PASS	`Bad/bad example 3.txt`
Anthony (AirPods reseller)	Bad ex 4 — branded, low qty	B4	9.0/10 PASS	`Bad/bad example 4.txt`
Jammaica (call handoff)	Good/P1 PH leads	B3	10.0/10 PASS	`Good/P1 PH leads asking for a call.txt`
Candle (hobby student)	Bad ex 1 — student, low qty	B1	8.5/10 PASS	`Bad/bad example 1.txt`

Data Gaps

No completed-SR conversations. All current cases are qualifying/advancing or failed. We need 5–10 conversations where the lead actually completed an SR — Sourcy is collecting this data daily.
B5 (Branded) is under-represented — only 1 case. Branded product requests may be a significant % of inbound.
All source files come from context/Good/, context/Bad/, and WA Chats - BD Team/ — these are the same files in Eric's GitHub repo. Newer operational data should be pulled in as the dataset grows.

Ask #1: Can we get 5–10 completed-SR conversations from recent operations? Most valuable additions to the dataset.

Ask #2: For each case above — does it represent a real lead pattern Sourcy faces? Any missing patterns?

Ask #3: Real inbound lead distribution across B1–B6? Helps us weight test cases correctly.

IV. Channel Trade-offs — WhatsApp vs Web UI

The activation bot operates on two channels with fundamentally different delivery modes. The eval should recognize both as valid, not penalize one for not being the other.

WhatsApp (Conversational)

Text IS the delivery — no cards, no forms
Prices-first in natural language
Warm handoff to growth team for SR completion
Lower friction → more leads stay engaged
Proven: 8.8/10 avg across 8 structured eval personas

Web UI (Staged Pipeline)

Rich cards carry data — text is a short transition
Structured forms collect SR fields
Bot drives through to SR completion in-conversation
Less human follow-up needed
Better for leads who want visual/structured experience

These are trade-offs, not one being better. If human ops cost is high → staged web UI reduces follow-up work. If conversion rate matters → conversational WA may keep more leads engaged. If lead quality matters → both can work, different trade-offs. The eval framework should measure outcomes for BOTH, calibrated per channel.

Channel-Specific Calibration

Eval Element	WhatsApp	Web UI
D1 (Message Length)	1–3 lines of text = concise	Short transition text + card = concise
D2 (Value Delivery)	Price in text = value delivery	Price in card = value delivery
FORMAT_OK	No markdown, no tables, WhatsApp-safe	Cards render correctly, no data repetition
SR Completion	QUALIFY_AND_ADVANCE (human closes)	COMPLETE_SR (bot collects form)
Pipeline checks	N/A — no staged pipeline	Integration test (CI/CD, not eval)

Ask #3: Channel priority — is the team going web-first, WA-first, or both simultaneously? This determines which channel gets more golden test cases and calibration effort.

V. Two-Tier Judge Architecture

Eugene's automated GPT judge is the right direction for scale. To make it accurate, we calibrate it with a full-context agent judge.

	Tier 1: Agent Judge (Calibration)	Tier 2: Prompt Judge (Scale)
What	Full-context agent (Opus 4.6) with business model, rubric examples, channel awareness, lead behavior patterns	Lightweight GPT prompt with D1–D5 definitions + calibration examples from Tier 1
Scores	Top 10 golden cases, deeply, with reasoning	All 33 cases, quickly, every commit
Output	Ground-truth scores = calibration reference	Scaled scores. Flags divergence from Tier 1
Frequency	Once per prompt version	Every commit / prompt change
Cost	~$2–5 per run (10 cases)	~$0.10–0.50 per run (all cases)

Workflow: Eric runs Tier 1 on top golden cases → publishes scored output with reasoning → Eugene feeds those as calibration examples into the Tier 2 judge prompt → Tier 2 runs at scale and flags divergence.

VI. Version Progression

Version	Personas	Pass Rate	Avg Score	Key Change
v1–v3	6–8	67% → 100%	—	Baseline → call handoff, WHY enforcement
v4	14	Issues found	—	Adversarial personas exposed gaps
v6b	5	60%	—	Rubric upgrade — introduced D1–D5 strict scoring. Scores dropped because the rubric got real, not because the bot got worse.
v7	8	100%	8.8/10	Prices-first rule, one-liner rules. Highest single-change leverage: every category mention gets a price range.

v6b was a rubric upgrade, not a regression. The v1–v5 rubric was lenient. v6b introduced per-turn D1–D5 scoring with hard message length caps. Pass rate dropped from 100% to 60% — because the eval was catching real gaps (process dumps, missed value delivery) that the old rubric couldn't detect. v7 fixed those gaps. This is the methodology working as designed.

VII. This Week — Deliverables & Asks

What Eric Delivers (by Thursday)

Deliverable	Status	For Whom
This eval framework doc	Done	Full team
Persona catalog (above) for team review	Done	Full team
Detailed feedback to Eugene on eval tool	Done	Eugene
Tier 1 agent judge — scored top 10 golden cases	In progress	Eugene (calibration data)
Supplier bot 157-rule review — initial positioning	Wed	Thursday call

What Eric Needs From the Team

#	Ask	Why	From
1	5–10 completed-SR conversations	Golden dataset has zero successful completions	Lokesh / BD team
2	Real lead bucket distribution (% per B1–B6)	Weight eval cases by actual volume	Eugene / Lokesh
3	Channel priority: web-first, WA-first, or both?	Determines calibration effort allocation	Karl
4	Thumbs up/down on persona catalog above	Confirm cases represent real patterns	Full team
5	Downstream SR outcome data (did leads actually buy?)	Validate eval scores against real conversion	Lokesh

Activation Bot Eval Framework v2

I. The Core Principle

II. Scoring Framework

Per-Turn Quality: D1–D5 (unchanged)

Binary Checks (per conversation)

Endpoint Scoring (per conversation)

III. Persona Catalog — Team Review

Behavior Buckets

Coverage Distribution

B3 Qualified / Defined Product — 14 cases

B1 Budget / Quantity Challenge — 5 cases

B4 Restricted / Impossible Product — 4 cases

B2 Vague / No Specs — 3 cases

B5 Branded / IP Products — 1 case

B6 Ghost / Non-responsive — 5 cases

Structured Eval Personas — 8 cases

IV. Channel Trade-offs — WhatsApp vs Web UI

WhatsApp (Conversational)

Web UI (Staged Pipeline)

Channel-Specific Calibration

V. Two-Tier Judge Architecture

VI. Version Progression

VII. This Week — Deliverables & Asks

What Eric Delivers (by Thursday)

What Eric Needs From the Team

Summary