Sourcy Activation Bot — Superbowl Evaluation

v7 (Eric/Donna)

8.7

Rubric A avg /10

8/8 PASS

Ivan (Gemini)

3.5

Rubric A avg /10

0/8 PASS

Eugene (Next.js)

4.7

Rubric A avg /10

0/8 PASS

What We Tested

8 real lead personas from actual Sourcy WhatsApp conversations — ranging from zero-spec dreamers to B2B professionals with exact price targets. Each persona was run through all 3 bots via their live APIs.

Rubric A (ours): 5 dimensions scored 0-2 per turn. Message Length, Value Delivery, Qualification, Conversation Discipline, Last Message Test. Pass threshold: 7.0/10 avg.

Rubric B (Eugene's 5 Success Conditions): Immediate Capability Proof, Visible Intelligence, Low Cognitive Load, Control/Safety, Clear Momentum. Scored 0-2 per conversation, max 10.

The Three-Line Summary

v7 = Sourcing expert + conversation designer. Prices first, 1-3 lines, one question. Passes both rubrics.

Ivan = Deep sourcing knowledge. Knows chipsets, IFRA standards, landed costs — needs tighter message formatting for WhatsApp (currently 6-12 lines per response).

Eugene = Strong conversational UX, but hasn't yet injected domain knowledge. Perfect message discipline — needs prices and category expertise to match.

Overall Scores

Metric	v7	Ivan	Eugene
Rubric A (D1–D5) avg	8.7	3.5	4.7
Rubric A pass rate	8/8	0/8	0/8
Rubric B (5 Conditions) avg	9.8	4.1	3.3
Binary checks	All pass	RESTRICTED, FORMAT fail	BUDGET, EXIT fail

Per-Persona Heatmap

Persona	Type	v7	Ivan	Eugene
anam	Zero-spec dreamer	7.8	2.2	5.2
femmoraaa	IG jewelry teen	9.0	4.0	5.3
battery	Restricted + Pakistan	8.5	4.5	6.5
jesus	Spanish, 70K MXN	8.0	3.0	5.3
anthony	AirPods reseller	9.0	4.5	3.5
candle	Student, tiny qty	9.0	4.0	4.5
jammaica	Zoom handoff	10.0	3.0	6.0
syed	B2B handwash	8.5	4.0	2.5
Overall		8.7	3.5	4.7

Rubric A: 5-Dimension Scoring (D1–D5)

Each bot turn scored 0-2 on 5 dimensions. Max 10/turn. Conversation pass = avg ≥ 7.0 across all turns + binary checks pass.

Dimension Definitions

Dim	Name	2 (Pass)	0 (Fail)
D1	Message Length	1-3 lines. No bullets, no lists.	6+ lines, bullet dumps, walls.
D2	Value Delivery	Specific price, material tradeoff, or market insight.	Only asked a question. No value.
D3	Qualification	Got a qualifying signal (budget, qty, intent).	No qualification attempt.
D4	Conversation Discipline	One question max. Acknowledged lead.	Multiple questions, didn't acknowledge lead input.
D5	Last Message Test	Lead got hooked or we learned a key signal.	If lead left now, they didn't get enough to stay.

D1: Message Length — What Good and Bad Looks Like

This is the WhatsApp constraint. Users on mobile scroll past anything longer than 3 lines. This dimension separates bots built for chat from bots built for email.

v7 — avg 1.9/2.0

Consistently 1-3 lines. Example (candle persona):

"Candle supplies from China run really well! Silicone molds $0.50-3/pc, wicks $0.01-0.05/pc, dye chips $0.02-0.10/pc, fragrance oils $3-8/kg, soy/paraffin wax $1-3/kg, glass jars $0.30-2/pc. Are you starting a candle brand or supplying a craft shop?"

Every product priced. One question. Three lines.

Ivan — avg 0.1/2.0

Consistently 6-12 lines with markdown formatting. Example (femmoraaa persona):

"The process typically follows these steps:
1. Selection: Choosing styles from a supplier's catalog...
2. Sampling: Always order 3-5 samples first...
3. Production/Bulk Order: Paying a 30% deposit...
4. QC & Shipping: Checking the goods for scratches..."

Valuable content, but no WhatsApp user reads numbered step-by-step lists mid-chat.

Eugene — avg 1.9/2.0

Consistently short and clean. Example (battery persona):

"I can't source batteries, explosives, or weapons, as those fall under restricted import categories."

Eugene matches v7 on length. One line, clear, direct. This is a genuine strength.

D2: Value Delivery — The Core Differentiator

Does the bot give the lead something they didn't have before? A price, a material tradeoff, a market insight. This is where most of the score gap lives.

v7 — avg 1.9/2.0

Prices on virtually every turn. Example (syed persona):

"500ml custom handwash from factories runs $0.30-0.80/unit (~80-220 PKR). Your target of 25-30 PKR (~$0.09) is below raw material cost — bottle + pump alone is $0.10-0.15."

Immediate price range + honest math showing the target is below cost. The lead now knows something they didn't before.

Ivan — avg 1.4/2.0

Ivan actually has strong knowledge — it just takes 3-4 turns to surface. Example (jesus persona, T3):

"$70,000 MXN (~$4,100 USD)... camisetas de fútbol de alta calidad $5-$12 USD, tenis $15-$35 USD... aranceles elevados (hasta el 25-30% + IVA)."

Real prices and customs data. The value is present — it's buried in a 12-line response, which tanks D1 and D4.

Eugene — avg 0.2/2.0

No prices in any conversation. Example (femmoraaa, after she asked about investment cost):

"Are you interested in necklaces, bracelets, earrings, rings, or something else? Also, what materials are you thinking of — stainless steel, alloy, beads?"

She specifically asked "how much investment" — the response asked two more narrowing questions instead. This is the highest-leverage improvement area: inject price ranges when a category is mentioned.

D3: Qualification — Learning About the Lead

Did the bot extract a useful signal (budget, quantity, timeline, business type) from the exchange? All three bots attempt this, but via different strategies.

v7 — avg 1.3/2.0

Qualification embedded in value delivery — one question per turn. Example (anthony):

"We can't source branded AirPods (Apple IP), but we can find similar TWS earbuds from factories at $2-8/unit. What's your target quantity?"

Value first (price + IP flag), then one qualifying question. The lead doesn't feel interrogated.

Ivan — avg 0.7/2.0

Asks 2 questions per turn — more info extracted per turn, but the lead feels surveyed. Example (battery):

"1. Could you provide the exact amperage for the 8 fuses...?
2. What is your destination market (country) and your target budget?"

Both are good questions. But numbered Q&A format reads like a form, not a conversation.

Eugene — avg 0.7/2.0

Clean one-question approach, but the questions are often spec-narrowing rather than business-qualifying. Example (candle):

"What kind of wax are you thinking of — soy, beeswax, paraffin, or something else?"

Good product question. But it doesn't reveal budget, quantity, or whether the lead is a business or hobbyist — which determines if factory sourcing even makes sense.

D4: Conversation Discipline — Respecting What the Lead Said

Does the bot acknowledge what the lead just told them before asking the next question? Does it stay at one question per turn? This is the "feels human" dimension.

v7 — avg 2.0/2.0

Always acknowledges, never multi-questions. Example (femmoraaa T2):

"PVD gold stainless steel jewelry, tarnish-free — great pick. Runs $1-4/piece..."

First 5 words mirror her exact ask. Then value. Then one question. She feels heard.

Ivan — avg 0.2/2.0

Frequently 2 questions + doesn't always acknowledge lead input. Example (jammaica — asked for Zoom):

"While I operate as an AI assistant to streamline your specifications first... are you looking for custom-branded designs or standard plain packaging, and what are your estimated monthly volumes?"

Lead asked for a Zoom call. Bot deflected to spec collection. Two questions at the end. The lead's request wasn't addressed.

Eugene — avg 1.5/2.0

Strong discipline — one question per turn, short format. Example (jesus T2):

"Got it! So you're interested in athletic shoes, sportswear for men and women, and soccer jerseys. ⚽ Is that right?"

Acknowledges with a summary, one confirmation question. Clean. Loses points when it repeats questions the lead already answered (syed: "Is this standard or custom?" asked twice).

D5: Last Message Test — Would You Reply to This?

If the lead closed WhatsApp right now, did this exchange give them a reason to come back? Did they learn something, or just get asked another question?

v7 — avg 1.7/2.0

Lead walks away knowing prices. Example (jesus T2):

"Tenis deportivos con tu logo van de $8-15/par, ropa deportiva $4-12/pieza, y camisas de futbol $5-10/unidad."

Even if he never replies, he now has price ranges for all 3 categories. That alone might bring him back.

Ivan — avg 0.9/2.0

The value is in the response, but it takes effort to extract. Example (anthony T2 ending):

"...which specific model are you targeting (e.g., the Pro style with silicone tips or the standard Gen 2/3 style) and what is your total budget for these 50 units?"

The last thing the lead reads is two questions — not a reason to stay. The useful price data ($2.50-4.50 landed) was earlier in the same message but may not register.

Eugene — avg 0.3/2.0

Most turns end with a narrowing question and nothing else. Example (anam T5 — lead said "no idea" twice already):

"No worries! Let's try this: are you thinking more of face makeup, eye makeup, or lip products?"

Friendly tone, but the lead still has no information to anchor to. If they close the chat, they haven't learned anything new about sourcing.

Dimension Averages (All 8 Personas, All Turns)

Ivan

Eugene

D1 Length

1.9

0.1

1.9

D2 Value

1.9

1.4

0.2

D3 Qualification

1.3

0.7

D4 Discipline

2.0

0.2

1.5

D5 Last Msg

1.7

0.9

0.3

Binary Checks

Check	v7	Ivan	Eugene
RESTRICTED_HOLD	PASS	FAIL — offered DG handling for batteries	PASS
BUDGET_MATH	PASS	PASS (buried in walls)	FAIL — didn't engage with 70K MXN or 25-30 PKR
EXIT_DOOR	PASS	Partial	FAIL — small qty not flagged
LANGUAGE_MATCH	PASS	PASS	PASS
FORMAT_OK	PASS	FAIL — markdown everywhere	PASS

Key Insight

Ivan and Eugene are complementary. Ivan scores highest on D2 (Value) but lowest on D1+D4 (Length+Discipline). Eugene scores highest on D1+D4 but lowest on D2. v7 combines the strengths of both — which took 7 iterations to achieve. The path forward is merging these strengths.

Rubric B: Eugene's 5 Success Conditions

Each condition scored 0-2 per full conversation (not per turn). This is the framework Eugene presented from the internal Superbowl analysis — so we're evaluating his own bot on his own rubric.

Condition Definitions

#	Condition	What it means
1	Immediate Capability Proof	Does the bot prove in the FIRST reply that it can actually help? (price, insight, category expertise — not just "what product?")
2	Visible Intelligence	Does the bot demonstrate it knows something the lead doesn't? (materials, MOQs, market dynamics, cost breakdowns)
3	Low Cognitive Load	Is the conversation effortless for the lead? (short messages, no jargon lectures)
4	Control/Safety	Does the lead feel the bot won't waste their time or money? (honest pricing, restricted holds, exit when unviable)
5	Clear Momentum	After each exchange, is it obvious what happens next? (not stuck in loops, progressing toward decision)

Condition 1: Immediate Capability Proof

The first message sets the frame. Does the lead think "this bot can actually help me" or "this is another generic chatbot"?

v7 — avg 1.9/2.0

Proves capability on the first product mention. Example (candle — lead lists 6 product types):

"Silicone molds $0.50-3/pc, wicks $0.01-0.05/pc, dye chips $0.02-0.10/pc, fragrance oils $3-8/kg, soy/paraffin wax $1-3/kg, glass jars $0.30-2/pc."

Every item the lead mentioned gets a price instantly. The lead's mental model shifts from "chatbot" to "this person knows my industry."

Ivan — avg 1.0/2.0

Shows expertise but defers prices. Example (anam — lead says "Hi, I want to source an item"):

"In global sourcing, the product category dictates everything: consumer electronics often require strict certifications (CE or FCC)... custom apparel focuses more on fabric weight (GSM)..."

This reads like a sourcing textbook — the lead knows the bot is knowledgeable, but doesn't yet know it can help them.

Eugene — avg 0.3/2.0

Friendly opener, but no proof of domain knowledge. Example (femmoraaa — lead asks "how much investment?"):

"Sourcing jewelry and accessories from China can be a great way to start a business. To give you the best advice, let's narrow down what you're looking for. 💍"

Warm and concise, but the lead still doesn't know if the bot can actually source jewelry or just ask questions about it. Adding one price range here changes the entire perception.

Condition 2: Visible Intelligence

Does the bot know something the lead doesn't? Materials, factory dynamics, landed cost math, regulatory traps — things that make a lead think "I need this bot."

v7 — avg 2.0/2.0

Shows intelligence through pricing and trade-offs, not lectures. Example (syed — target price of 25-30 PKR):

"Your target of 25-30 PKR (~$0.09) is below raw material cost — bottle + pump alone is $0.10-0.15."

The lead now knows their price target is impossible. That's visible intelligence — one sentence, changes the deal.

Ivan — avg 1.6/2.0

Strong domain knowledge — often the deepest of all three bots. Example (syed T2):

"Your target of 25-30 PKR translates to ~$0.09-$0.11 USD. A 500ml bottle with lotion pump typically costs $0.12-$0.20 before adding liquid, labeling, and freight. Consider importing only components and filling locally."

Ivan actually proposes a creative solution (component import). This is genuinely smart. The issue is it comes mid-paragraph in a long response — easy to miss.

Eugene — avg 0.0/2.0

No domain knowledge surfaced in any conversation. Example (syed — provided company, city, exact price target):

"To get started, is this a standard liquid handwash formula or something custom?"

The lead gave price, company, city, and product spec. The response doesn't engage with any of it. This is the single biggest scoring gap — and the most fixable. Adding a price/knowledge layer to the prompt would likely lift this from 0 to 1.5+.

Condition 3: Low Cognitive Load

How much effort does the lead have to put in to understand the bot's message? Short messages, simple language, no jargon walls.

v7 — avg 2.0/2.0

Short and scannable. Example (anthony):

"We can't source branded AirPods (Apple IP), but we can find similar TWS earbuds from factories at $2-8/unit. What's your target quantity?"

One constraint, one alternative with price, one question. Takes 3 seconds to read.

Ivan — avg 0.0/2.0

High cognitive load on every turn. Example (candle T1 — a beginner asks about supplies):

"Sourcing candle-making supplies is a multi-faceted process... Soy wax remains the industry favorite for its 'clean burn' marketing, though paraffin is more cost-effective for scent throw. Jars usually see MOQs around 1,000–3,000 units for custom shapes... ensuring they meet IFRA standards is crucial for B2B compliance..."

A student who just wants to try candle-making gets IFRA compliance info and custom MOQs. The content is accurate — it's just not calibrated to the audience.

Eugene — avg 2.0/2.0

Effortless to read. Example (jesus T2):

"Got it! So you're interested in athletic shoes, sportswear for men and women, and soccer jerseys. ⚽ Is that right?"

Eugene ties with v7 here. Summaries + confirmations + emojis = zero friction. This is a genuine strength and the foundation to build on.

Condition 4: Control / Safety

Does the lead feel the bot won't waste their time or money? This includes: honest pricing, restricted-item handling, flagging unviable orders, and graceful exits.

v7 — avg 1.9/2.0

Honest even when it means losing the deal. Example (candle — student with 25 wicks):

"Those quantities are really small for factory sourcing — at 25-35 wicks, shipping alone would cost more than the products. You'd get better prices from a local craft supplier or Amazon/Shopee. When you're ready to scale to 500+ wicks, factory pricing drops 40-60%."

Told the lead "don't buy from us yet" — and gave them a concrete reentry threshold. Builds enormous trust.

Ivan — avg 0.6/2.0

Engaged with battery sourcing (restricted item) instead of redirecting. Example (battery T2):

"Shipping to Karachi (SAPT/KICT) via sea is the most cost-effective route for batteries... looking at 'Forward Booking' from the factory or a Tier-1 distributor..."

Technically accurate, but it gave the lead hope of a battery order that probably shouldn't happen. The lead doesn't feel "safe" — they feel led on.

Eugene — avg 0.8/2.0

Good on restricted items, but didn't flag unviable small orders or unrealistic prices. Example (battery T1):

"I can't source batteries, explosives, or weapons, as those fall under restricted import categories."

Clean restriction — correct behavior. But for candle (25 wicks) and syed (25-30 PKR), the bot didn't flag that the order was unviable. Adding economic viability checks would lift this score.

Condition 5: Clear Momentum

After each exchange, is the conversation moving forward? Or is it stuck asking the same type of question?

v7 — avg 2.0/2.0

Each turn progresses: product → price → quantity → next step. Example (jammaica — lead asks for Zoom):

"Absolutely, we do Zoom calls! Let me connect you with our sourcing team. Quick — food packaging runs $0.15-1.50/unit. What's your email and preferred time?"

Zoom confirmed + price anchor + two data points requested = three things accomplished in one turn.

Ivan — avg 0.6/2.0

Conversations progress slowly because each response covers too much ground. Example (jammaica — asked for Zoom):

"While I operate as an AI assistant to streamline your specifications first... For F&B packaging, common materials include food-grade Kraft paper... PET or biodegradable PLA..."

Lead wanted a Zoom call. Bot pivoted to material specs. The conversation went sideways instead of forward.

Eugene — avg 0.4/2.0

Narrowing loops can stall momentum. Example (syed — gave full spec + price target, then asked "Can you source it?"):

"Yes, we can definitely source liquid handwash for you! To confirm, is this a standard liquid handwash formula or something custom?"

Same question as T1, repeated. The lead gave a direct "yes or no" question and got a loop. Adding a progress tracker or branching logic ("if lead gave spec, skip narrowing") would fix this.

Condition Averages (All 8 Personas)

Ivan

Eugene

1. Capability Proof

1.9

1.0

0.3

2. Intelligence

2.0

1.6

0.0

3. Cognitive Load

2.0

0.0

2.0

4. Control/Safety

1.9

0.6

0.8

5. Momentum

2.0

0.6

0.4

Per-Persona Rubric B Scores

Persona	v7	Ivan	Eugene
anam	9	1	3
femmoraaa	10	5	4
battery	10	4	6
jesus	9	3	2
anthony	10	6	3
candle	10	5	2
jammaica	10	3	4
syed	10	6	2
Average	9.8	4.1	3.3

Opportunity: Eugene's bot scores strongest on Low Cognitive Load (2.0) but hasn't yet been tuned for Visible Intelligence (0.0) or Capability Proof (0.3). The discipline foundation is solid — adding domain knowledge (prices, materials, category expertise) would significantly lift these scores. The same pattern that took v7 from v1 (67% pass) to v7 (100% pass) would apply here.

Persona Deep-Dives

How each bot handled the 8 test scenarios. Click any persona for the annotated breakdown.

anam

Zero-spec dreamer

v7: 7.8I: 2.2E: 5.2

femmoraaa

IG jewelry teen

v7: 9.0I: 4.0E: 5.3

battery

Restricted + Pakistan

v7: 8.5I: 4.5E: 6.5

jesus

Spanish, 70K MXN

v7: 8.0I: 3.0E: 5.3

anthony

AirPods reseller

v7: 9.0I: 4.5E: 3.5

candle

Student beginner

v7: 9.0I: 4.0E: 4.5

jammaica

Zoom handoff

v7: 10I: 3.0E: 6.0

syed

B2B handwash

v7: 8.5I: 4.0E: 2.5

anam — Zero-spec dreamer (5 turns)

Lead has no product, no reference, no budget. Says "I want to source an item" then "jewelry and bags" then "makeup" then "no idea." The hardest activation scenario — how do you sell to someone who doesn't know what they want?

v7: 7.8 PASS

Prices on every turn. When she said "jewelry and bags" → "$0.50-3 fashion, $3-15 fine." When she said "makeup" → "$0.50-4/unit." When she said "no idea" → listed top 3 products with prices. Finally asked budget when nothing else worked. Correct escalation.

Ivan: 2.2 FAIL

Manufacturing hub lectures. "ODM" jargon. No prices for 4 turns. 6-10 lines per message. Multiple questions per turn. Finally asked budget on T5 — too late. The knowledge exists but is buried under academic delivery.

Eugene: 5.2 FAIL

Perfect message length and discipline. The narrowing approach ("what product?", "which one?") works for leads who know what they want, but this persona had no reference. Adding prices or popular product suggestions earlier would break the narrowing loop.

syed — VCare Karachi, 25-30 PKR handwash (2 turns)

The most demanding lead: exact product spec, company name, city, target price — all in one message. Tests whether the bot can do honest math and deliver a real answer.

v7: 8.5 PASS

T1: "$0.30-0.80/unit (80-220 PKR). Your target of 25-30 PKR is below raw material cost — bottle + pump alone is $0.10-0.15." Honest math, immediate. T2: Confirmed can source, restated realistic price at 5,000+ units.

Ivan: 4.0 FAIL

Extraordinary detail — PET/HDPE, lotion pump cost breakdown, SLES/Glycerin formulation, pH/viscosity QA. Did the math ($0.09-0.11 gap). Offered component import strategy. But 7-8 lines per turn, two questions each.

Eugene: 2.5 FAIL

T1: "Is this standard or custom?" — didn't engage with the specific details Syed provided (price target, company name, city). T2 repeated the same question. The bot would benefit from acknowledging lead-provided specs before asking follow-ups.

battery — Restricted product + Pakistan pushback (2 turns)

Asks for lead-acid batteries (IATA dangerous goods). When rejected, pushes back: "Importing in Pakistan no problem." Tests whether the bot holds firm on restricted items.

v7: 8.5 PASS

Clean reject with empathy. "I hear you — Pakistan may allow, but carriers seize." Held firm both turns. Pivoted to fuses/connectors each time. Left the door open on sourceable products.

Ivan: 4.5 FAIL

Knew Kung Long brand + IATA rules (impressive knowledge). BUT didn't firmly restrict — offered to source batteries with DG handling. Failed the fundamental safety test.

Eugene: 6.5 FAIL

Clean reject + fuse redirect (T1 = 8, strong). T2 held firm but could've added more value by re-engaging on fuses/connectors. Borderline pass — the discipline is correct, just needs a warmer redirect.

Remaining Personas — Summary

Persona	v7 Highlight	Ivan Highlight	Eugene Highlight
femmoraaa	$0.50-3 → PVD $1-4 → starter kit $500-2K. Perfect escalation.	Knows Yiwu vs Guangzhou, 316L. But 10 lines per turn.	She asked "how much investment" — bot focused on narrowing specs instead.
jesus	70K MXN = $4K → 150 tenis + 200 ropa + 150 camisas. Budget math in Spanish.	Did budget math (T3) but in 9-line lecture format.	"Felicitaciones!" — didn't engage with the 70K MXN budget.
anthony	IP reject → TWS $2-8 → 50 units = $150-250 → ready-stock only.	JL/Airoha/Qualcomm chipset knowledge. 10 lines.	IP reject then "phone cases, chargers?" No prices.
candle	Full price list → graceful exit → "come back at 500+ wicks"	IFRA standards, soy vs paraffin. 8 lines + numbered Qs.	Didn't flag small qty. "Assorted or custom?"
jammaica	"Yes, we do Zoom!" + $0.15-1.50/unit + email ask. Perfect.	"I operate as AI to streamline specs first." Deflected.	"I don't do Zoom directly." No prices.

Raw Transcripts

Click any conversation to expand the full transcript with scoring annotations.

Verdict

v7 leads on both rubrics — it's the only bot that passes on either scoring framework. The difference is 5+ points on a 10-point scale.

Both superbowl demos have clear strengths (Ivan's knowledge depth, Eugene's conversational discipline) that haven't yet been through an iterative eval cycle. Applying the same rubric-based iteration process would accelerate both significantly.

What Each Bot Teaches Us

v7 — The Baseline

Strengths: Prices-first, 1-3 line cap, one question per turn, honest math, restricted holds, graceful exits, multi-language.

Weaknesses: D3 Qualification avg 1.3 (could push harder). Jesus T1 catalog response = no prices. Double-questions on Syed.

Why it works: 7 prompt iterations, 8-persona test suite, scored rubric. The methodology, not any single trick.

Ivan — Steal the Knowledge

Strength: Deepest domain knowledge of the three. Chipsets, IFRA, SLES/glycerin, Kung Long, PVD/316L, landed cost calculations.

Fix: Hard message cap (3 lines max, enforced with examples). One question rule. Strip markdown. Client-side Gemini → server-side for prompt security.

Steal: Observability panel (health meter + component status) for internal monitoring dashboard.

Eugene — Steal the UX

Strength: Message discipline, chip buttons, concept cards, structured state machine, clean format.

Opportunity: Add prices-first rule, category expertise, and budget math. The discipline foundation is already strong — adding domain knowledge would be the highest-leverage improvement.

Steal: Chip buttons + concept cards for website channel. State progression for funnel tracking.

What v7 Still Needs (v8 Roadmap)

#	Gap	Source	Fix
1	Real Alibaba prices + images	Eugene's API access	Swap LLM price guesses for live API data
2	Visual proof (product images)	Ivan's image generation	Alibaba image API integration
3	Observability dashboard	Ivan's health meter	Internal-only monitoring panel
4	Chip buttons on web	Eugene's UX	Website channel only (not WhatsApp)
5	D3 Qualification depth	Own rubric	Push harder on budget/qty in early turns

The Meta-Point

The eval methodology is the deliverable, not the prompt. v7 exists because of 7 iterations scored against 8 personas on a 5-dimension rubric. The superbowl demos haven't been through an eval cycle yet. One round of this eval process applied to either demo would likely improve their scores by 2-3 points. The methodology itself is the accelerant.

Sourcy Activation Bot Eval

What We Tested

The Three-Line Summary

Overall Scores

Per-Persona Heatmap

Rubric A: 5-Dimension Scoring (D1–D5)

Dimension Definitions

D1: Message Length — What Good and Bad Looks Like

v7 — avg 1.9/2.0

Ivan — avg 0.1/2.0

Eugene — avg 1.9/2.0

D2: Value Delivery — The Core Differentiator

v7 — avg 1.9/2.0

Ivan — avg 1.4/2.0

Eugene — avg 0.2/2.0

D3: Qualification — Learning About the Lead

v7 — avg 1.3/2.0

Ivan — avg 0.7/2.0

Eugene — avg 0.7/2.0

D4: Conversation Discipline — Respecting What the Lead Said

v7 — avg 2.0/2.0

Ivan — avg 0.2/2.0

Eugene — avg 1.5/2.0

D5: Last Message Test — Would You Reply to This?

v7 — avg 1.7/2.0

Ivan — avg 0.9/2.0

Eugene — avg 0.3/2.0

Dimension Averages (All 8 Personas, All Turns)

Binary Checks

Key Insight

Rubric B: Eugene's 5 Success Conditions

Condition Definitions

Condition 1: Immediate Capability Proof

v7 — avg 1.9/2.0

Ivan — avg 1.0/2.0

Eugene — avg 0.3/2.0

Condition 2: Visible Intelligence

v7 — avg 2.0/2.0

Ivan — avg 1.6/2.0

Eugene — avg 0.0/2.0

Condition 3: Low Cognitive Load

v7 — avg 2.0/2.0

Ivan — avg 0.0/2.0

Eugene — avg 2.0/2.0

Condition 4: Control / Safety

v7 — avg 1.9/2.0

Ivan — avg 0.6/2.0

Eugene — avg 0.8/2.0

Condition 5: Clear Momentum

v7 — avg 2.0/2.0

Ivan — avg 0.6/2.0

Eugene — avg 0.4/2.0

Condition Averages (All 8 Personas)

Per-Persona Rubric B Scores

Persona Deep-Dives

anam — Zero-spec dreamer (5 turns)

v7: 7.8 PASS

Ivan: 2.2 FAIL

Eugene: 5.2 FAIL

syed — VCare Karachi, 25-30 PKR handwash (2 turns)

v7: 8.5 PASS

Ivan: 4.0 FAIL

Eugene: 2.5 FAIL

battery — Restricted product + Pakistan pushback (2 turns)

v7: 8.5 PASS

Ivan: 4.5 FAIL

Eugene: 6.5 FAIL

Remaining Personas — Summary

Raw Transcripts

Verdict

What Each Bot Teaches Us

v7 — The Baseline

Ivan — Steal the Knowledge

Eugene — Steal the UX

What v7 Still Needs (v8 Roadmap)

The Meta-Point