8 real lead personas from actual Sourcy WhatsApp conversations — ranging from zero-spec dreamers to B2B professionals with exact price targets. Each persona was run through all 3 bots via their live APIs.
Rubric A (ours): 5 dimensions scored 0-2 per turn. Message Length, Value Delivery, Qualification, Conversation Discipline, Last Message Test. Pass threshold: 7.0/10 avg.
Rubric B (Eugene's 5 Success Conditions): Immediate Capability Proof, Visible Intelligence, Low Cognitive Load, Control/Safety, Clear Momentum. Scored 0-2 per conversation, max 10.
| Metric | v7 | Ivan | Eugene |
|---|---|---|---|
| Rubric A (D1–D5) avg | 8.7 | 3.5 | 4.7 |
| Rubric A pass rate | 8/8 | 0/8 | 0/8 |
| Rubric B (5 Conditions) avg | 9.8 | 4.1 | 3.3 |
| Binary checks | All pass | RESTRICTED, FORMAT fail | BUDGET, EXIT fail |
| Persona | Type | v7 | Ivan | Eugene |
|---|---|---|---|---|
| anam | Zero-spec dreamer | 7.8 | 2.2 | 5.2 |
| femmoraaa | IG jewelry teen | 9.0 | 4.0 | 5.3 |
| battery | Restricted + Pakistan | 8.5 | 4.5 | 6.5 |
| jesus | Spanish, 70K MXN | 8.0 | 3.0 | 5.3 |
| anthony | AirPods reseller | 9.0 | 4.5 | 3.5 |
| candle | Student, tiny qty | 9.0 | 4.0 | 4.5 |
| jammaica | Zoom handoff | 10.0 | 3.0 | 6.0 |
| syed | B2B handwash | 8.5 | 4.0 | 2.5 |
| Overall | 8.7 | 3.5 | 4.7 |
Each bot turn scored 0-2 on 5 dimensions. Max 10/turn. Conversation pass = avg ≥ 7.0 across all turns + binary checks pass.
| Dim | Name | 2 (Pass) | 0 (Fail) |
|---|---|---|---|
| D1 | Message Length | 1-3 lines. No bullets, no lists. | 6+ lines, bullet dumps, walls. |
| D2 | Value Delivery | Specific price, material tradeoff, or market insight. | Only asked a question. No value. |
| D3 | Qualification | Got a qualifying signal (budget, qty, intent). | No qualification attempt. |
| D4 | Conversation Discipline | One question max. Acknowledged lead. | Multiple questions, didn't acknowledge lead input. |
| D5 | Last Message Test | Lead got hooked or we learned a key signal. | If lead left now, they didn't get enough to stay. |
This is the WhatsApp constraint. Users on mobile scroll past anything longer than 3 lines. This dimension separates bots built for chat from bots built for email.
Consistently 1-3 lines. Example (candle persona):
"Candle supplies from China run really well! Silicone molds $0.50-3/pc, wicks $0.01-0.05/pc, dye chips $0.02-0.10/pc, fragrance oils $3-8/kg, soy/paraffin wax $1-3/kg, glass jars $0.30-2/pc. Are you starting a candle brand or supplying a craft shop?"
Every product priced. One question. Three lines.
Consistently 6-12 lines with markdown formatting. Example (femmoraaa persona):
"The process typically follows these steps:
1. Selection: Choosing styles from a supplier's catalog...
2. Sampling: Always order 3-5 samples first...
3. Production/Bulk Order: Paying a 30% deposit...
4. QC & Shipping: Checking the goods for scratches..."
Valuable content, but no WhatsApp user reads numbered step-by-step lists mid-chat.
Consistently short and clean. Example (battery persona):
"I can't source batteries, explosives, or weapons, as those fall under restricted import categories."
Eugene matches v7 on length. One line, clear, direct. This is a genuine strength.
Does the bot give the lead something they didn't have before? A price, a material tradeoff, a market insight. This is where most of the score gap lives.
Prices on virtually every turn. Example (syed persona):
"500ml custom handwash from factories runs $0.30-0.80/unit (~80-220 PKR). Your target of 25-30 PKR (~$0.09) is below raw material cost — bottle + pump alone is $0.10-0.15."
Immediate price range + honest math showing the target is below cost. The lead now knows something they didn't before.
Ivan actually has strong knowledge — it just takes 3-4 turns to surface. Example (jesus persona, T3):
"$70,000 MXN (~$4,100 USD)... camisetas de fútbol de alta calidad $5-$12 USD, tenis $15-$35 USD... aranceles elevados (hasta el 25-30% + IVA)."
Real prices and customs data. The value is present — it's buried in a 12-line response, which tanks D1 and D4.
No prices in any conversation. Example (femmoraaa, after she asked about investment cost):
"Are you interested in necklaces, bracelets, earrings, rings, or something else? Also, what materials are you thinking of — stainless steel, alloy, beads?"
She specifically asked "how much investment" — the response asked two more narrowing questions instead. This is the highest-leverage improvement area: inject price ranges when a category is mentioned.
Did the bot extract a useful signal (budget, quantity, timeline, business type) from the exchange? All three bots attempt this, but via different strategies.
Qualification embedded in value delivery — one question per turn. Example (anthony):
"We can't source branded AirPods (Apple IP), but we can find similar TWS earbuds from factories at $2-8/unit. What's your target quantity?"
Value first (price + IP flag), then one qualifying question. The lead doesn't feel interrogated.
Asks 2 questions per turn — more info extracted per turn, but the lead feels surveyed. Example (battery):
"1. Could you provide the exact amperage for the 8 fuses...?
2. What is your destination market (country) and your target budget?"
Both are good questions. But numbered Q&A format reads like a form, not a conversation.
Clean one-question approach, but the questions are often spec-narrowing rather than business-qualifying. Example (candle):
"What kind of wax are you thinking of — soy, beeswax, paraffin, or something else?"
Good product question. But it doesn't reveal budget, quantity, or whether the lead is a business or hobbyist — which determines if factory sourcing even makes sense.
Does the bot acknowledge what the lead just told them before asking the next question? Does it stay at one question per turn? This is the "feels human" dimension.
Always acknowledges, never multi-questions. Example (femmoraaa T2):
"PVD gold stainless steel jewelry, tarnish-free — great pick. Runs $1-4/piece..."
First 5 words mirror her exact ask. Then value. Then one question. She feels heard.
Frequently 2 questions + doesn't always acknowledge lead input. Example (jammaica — asked for Zoom):
"While I operate as an AI assistant to streamline your specifications first... are you looking for custom-branded designs or standard plain packaging, and what are your estimated monthly volumes?"
Lead asked for a Zoom call. Bot deflected to spec collection. Two questions at the end. The lead's request wasn't addressed.
Strong discipline — one question per turn, short format. Example (jesus T2):
"Got it! So you're interested in athletic shoes, sportswear for men and women, and soccer jerseys. ⚽ Is that right?"
Acknowledges with a summary, one confirmation question. Clean. Loses points when it repeats questions the lead already answered (syed: "Is this standard or custom?" asked twice).
If the lead closed WhatsApp right now, did this exchange give them a reason to come back? Did they learn something, or just get asked another question?
Lead walks away knowing prices. Example (jesus T2):
"Tenis deportivos con tu logo van de $8-15/par, ropa deportiva $4-12/pieza, y camisas de futbol $5-10/unidad."
Even if he never replies, he now has price ranges for all 3 categories. That alone might bring him back.
The value is in the response, but it takes effort to extract. Example (anthony T2 ending):
"...which specific model are you targeting (e.g., the Pro style with silicone tips or the standard Gen 2/3 style) and what is your total budget for these 50 units?"
The last thing the lead reads is two questions — not a reason to stay. The useful price data ($2.50-4.50 landed) was earlier in the same message but may not register.
Most turns end with a narrowing question and nothing else. Example (anam T5 — lead said "no idea" twice already):
"No worries! Let's try this: are you thinking more of face makeup, eye makeup, or lip products?"
Friendly tone, but the lead still has no information to anchor to. If they close the chat, they haven't learned anything new about sourcing.
| Check | v7 | Ivan | Eugene |
|---|---|---|---|
| RESTRICTED_HOLD | PASS | FAIL — offered DG handling for batteries | PASS |
| BUDGET_MATH | PASS | PASS (buried in walls) | FAIL — didn't engage with 70K MXN or 25-30 PKR |
| EXIT_DOOR | PASS | Partial | FAIL — small qty not flagged |
| LANGUAGE_MATCH | PASS | PASS | PASS |
| FORMAT_OK | PASS | FAIL — markdown everywhere | PASS |
Each condition scored 0-2 per full conversation (not per turn). This is the framework Eugene presented from the internal Superbowl analysis — so we're evaluating his own bot on his own rubric.
| # | Condition | What it means |
|---|---|---|
| 1 | Immediate Capability Proof | Does the bot prove in the FIRST reply that it can actually help? (price, insight, category expertise — not just "what product?") |
| 2 | Visible Intelligence | Does the bot demonstrate it knows something the lead doesn't? (materials, MOQs, market dynamics, cost breakdowns) |
| 3 | Low Cognitive Load | Is the conversation effortless for the lead? (short messages, no jargon lectures) |
| 4 | Control/Safety | Does the lead feel the bot won't waste their time or money? (honest pricing, restricted holds, exit when unviable) |
| 5 | Clear Momentum | After each exchange, is it obvious what happens next? (not stuck in loops, progressing toward decision) |
The first message sets the frame. Does the lead think "this bot can actually help me" or "this is another generic chatbot"?
Proves capability on the first product mention. Example (candle — lead lists 6 product types):
"Silicone molds $0.50-3/pc, wicks $0.01-0.05/pc, dye chips $0.02-0.10/pc, fragrance oils $3-8/kg, soy/paraffin wax $1-3/kg, glass jars $0.30-2/pc."
Every item the lead mentioned gets a price instantly. The lead's mental model shifts from "chatbot" to "this person knows my industry."
Shows expertise but defers prices. Example (anam — lead says "Hi, I want to source an item"):
"In global sourcing, the product category dictates everything: consumer electronics often require strict certifications (CE or FCC)... custom apparel focuses more on fabric weight (GSM)..."
This reads like a sourcing textbook — the lead knows the bot is knowledgeable, but doesn't yet know it can help them.
Friendly opener, but no proof of domain knowledge. Example (femmoraaa — lead asks "how much investment?"):
"Sourcing jewelry and accessories from China can be a great way to start a business. To give you the best advice, let's narrow down what you're looking for. 💍"
Warm and concise, but the lead still doesn't know if the bot can actually source jewelry or just ask questions about it. Adding one price range here changes the entire perception.
Does the bot know something the lead doesn't? Materials, factory dynamics, landed cost math, regulatory traps — things that make a lead think "I need this bot."
Shows intelligence through pricing and trade-offs, not lectures. Example (syed — target price of 25-30 PKR):
"Your target of 25-30 PKR (~$0.09) is below raw material cost — bottle + pump alone is $0.10-0.15."
The lead now knows their price target is impossible. That's visible intelligence — one sentence, changes the deal.
Strong domain knowledge — often the deepest of all three bots. Example (syed T2):
"Your target of 25-30 PKR translates to ~$0.09-$0.11 USD. A 500ml bottle with lotion pump typically costs $0.12-$0.20 before adding liquid, labeling, and freight. Consider importing only components and filling locally."
Ivan actually proposes a creative solution (component import). This is genuinely smart. The issue is it comes mid-paragraph in a long response — easy to miss.
No domain knowledge surfaced in any conversation. Example (syed — provided company, city, exact price target):
"To get started, is this a standard liquid handwash formula or something custom?"
The lead gave price, company, city, and product spec. The response doesn't engage with any of it. This is the single biggest scoring gap — and the most fixable. Adding a price/knowledge layer to the prompt would likely lift this from 0 to 1.5+.
How much effort does the lead have to put in to understand the bot's message? Short messages, simple language, no jargon walls.
Short and scannable. Example (anthony):
"We can't source branded AirPods (Apple IP), but we can find similar TWS earbuds from factories at $2-8/unit. What's your target quantity?"
One constraint, one alternative with price, one question. Takes 3 seconds to read.
High cognitive load on every turn. Example (candle T1 — a beginner asks about supplies):
"Sourcing candle-making supplies is a multi-faceted process... Soy wax remains the industry favorite for its 'clean burn' marketing, though paraffin is more cost-effective for scent throw. Jars usually see MOQs around 1,000–3,000 units for custom shapes... ensuring they meet IFRA standards is crucial for B2B compliance..."
A student who just wants to try candle-making gets IFRA compliance info and custom MOQs. The content is accurate — it's just not calibrated to the audience.
Effortless to read. Example (jesus T2):
"Got it! So you're interested in athletic shoes, sportswear for men and women, and soccer jerseys. ⚽ Is that right?"
Eugene ties with v7 here. Summaries + confirmations + emojis = zero friction. This is a genuine strength and the foundation to build on.
Does the lead feel the bot won't waste their time or money? This includes: honest pricing, restricted-item handling, flagging unviable orders, and graceful exits.
Honest even when it means losing the deal. Example (candle — student with 25 wicks):
"Those quantities are really small for factory sourcing — at 25-35 wicks, shipping alone would cost more than the products. You'd get better prices from a local craft supplier or Amazon/Shopee. When you're ready to scale to 500+ wicks, factory pricing drops 40-60%."
Told the lead "don't buy from us yet" — and gave them a concrete reentry threshold. Builds enormous trust.
Engaged with battery sourcing (restricted item) instead of redirecting. Example (battery T2):
"Shipping to Karachi (SAPT/KICT) via sea is the most cost-effective route for batteries... looking at 'Forward Booking' from the factory or a Tier-1 distributor..."
Technically accurate, but it gave the lead hope of a battery order that probably shouldn't happen. The lead doesn't feel "safe" — they feel led on.
Good on restricted items, but didn't flag unviable small orders or unrealistic prices. Example (battery T1):
"I can't source batteries, explosives, or weapons, as those fall under restricted import categories."
Clean restriction — correct behavior. But for candle (25 wicks) and syed (25-30 PKR), the bot didn't flag that the order was unviable. Adding economic viability checks would lift this score.
After each exchange, is the conversation moving forward? Or is it stuck asking the same type of question?
Each turn progresses: product → price → quantity → next step. Example (jammaica — lead asks for Zoom):
"Absolutely, we do Zoom calls! Let me connect you with our sourcing team. Quick — food packaging runs $0.15-1.50/unit. What's your email and preferred time?"
Zoom confirmed + price anchor + two data points requested = three things accomplished in one turn.
Conversations progress slowly because each response covers too much ground. Example (jammaica — asked for Zoom):
"While I operate as an AI assistant to streamline your specifications first... For F&B packaging, common materials include food-grade Kraft paper... PET or biodegradable PLA..."
Lead wanted a Zoom call. Bot pivoted to material specs. The conversation went sideways instead of forward.
Narrowing loops can stall momentum. Example (syed — gave full spec + price target, then asked "Can you source it?"):
"Yes, we can definitely source liquid handwash for you! To confirm, is this a standard liquid handwash formula or something custom?"
Same question as T1, repeated. The lead gave a direct "yes or no" question and got a loop. Adding a progress tracker or branching logic ("if lead gave spec, skip narrowing") would fix this.
| Persona | v7 | Ivan | Eugene |
|---|---|---|---|
| anam | 9 | 1 | 3 |
| femmoraaa | 10 | 5 | 4 |
| battery | 10 | 4 | 6 |
| jesus | 9 | 3 | 2 |
| anthony | 10 | 6 | 3 |
| candle | 10 | 5 | 2 |
| jammaica | 10 | 3 | 4 |
| syed | 10 | 6 | 2 |
| Average | 9.8 | 4.1 | 3.3 |
How each bot handled the 8 test scenarios. Click any persona for the annotated breakdown.
Lead has no product, no reference, no budget. Says "I want to source an item" then "jewelry and bags" then "makeup" then "no idea." The hardest activation scenario — how do you sell to someone who doesn't know what they want?
Prices on every turn. When she said "jewelry and bags" → "$0.50-3 fashion, $3-15 fine." When she said "makeup" → "$0.50-4/unit." When she said "no idea" → listed top 3 products with prices. Finally asked budget when nothing else worked. Correct escalation.
Manufacturing hub lectures. "ODM" jargon. No prices for 4 turns. 6-10 lines per message. Multiple questions per turn. Finally asked budget on T5 — too late. The knowledge exists but is buried under academic delivery.
Perfect message length and discipline. The narrowing approach ("what product?", "which one?") works for leads who know what they want, but this persona had no reference. Adding prices or popular product suggestions earlier would break the narrowing loop.
The most demanding lead: exact product spec, company name, city, target price — all in one message. Tests whether the bot can do honest math and deliver a real answer.
T1: "$0.30-0.80/unit (80-220 PKR). Your target of 25-30 PKR is below raw material cost — bottle + pump alone is $0.10-0.15." Honest math, immediate. T2: Confirmed can source, restated realistic price at 5,000+ units.
Extraordinary detail — PET/HDPE, lotion pump cost breakdown, SLES/Glycerin formulation, pH/viscosity QA. Did the math ($0.09-0.11 gap). Offered component import strategy. But 7-8 lines per turn, two questions each.
T1: "Is this standard or custom?" — didn't engage with the specific details Syed provided (price target, company name, city). T2 repeated the same question. The bot would benefit from acknowledging lead-provided specs before asking follow-ups.
Asks for lead-acid batteries (IATA dangerous goods). When rejected, pushes back: "Importing in Pakistan no problem." Tests whether the bot holds firm on restricted items.
Clean reject with empathy. "I hear you — Pakistan may allow, but carriers seize." Held firm both turns. Pivoted to fuses/connectors each time. Left the door open on sourceable products.
Knew Kung Long brand + IATA rules (impressive knowledge). BUT didn't firmly restrict — offered to source batteries with DG handling. Failed the fundamental safety test.
Clean reject + fuse redirect (T1 = 8, strong). T2 held firm but could've added more value by re-engaging on fuses/connectors. Borderline pass — the discipline is correct, just needs a warmer redirect.
| Persona | v7 Highlight | Ivan Highlight | Eugene Highlight |
|---|---|---|---|
| femmoraaa | $0.50-3 → PVD $1-4 → starter kit $500-2K. Perfect escalation. | Knows Yiwu vs Guangzhou, 316L. But 10 lines per turn. | She asked "how much investment" — bot focused on narrowing specs instead. |
| jesus | 70K MXN = $4K → 150 tenis + 200 ropa + 150 camisas. Budget math in Spanish. | Did budget math (T3) but in 9-line lecture format. | "Felicitaciones!" — didn't engage with the 70K MXN budget. |
| anthony | IP reject → TWS $2-8 → 50 units = $150-250 → ready-stock only. | JL/Airoha/Qualcomm chipset knowledge. 10 lines. | IP reject then "phone cases, chargers?" No prices. |
| candle | Full price list → graceful exit → "come back at 500+ wicks" | IFRA standards, soy vs paraffin. 8 lines + numbered Qs. | Didn't flag small qty. "Assorted or custom?" |
| jammaica | "Yes, we do Zoom!" + $0.15-1.50/unit + email ask. Perfect. | "I operate as AI to streamline specs first." Deflected. | "I don't do Zoom directly." No prices. |
Click any conversation to expand the full transcript with scoring annotations.
v7 leads on both rubrics — it's the only bot that passes on either scoring framework. The difference is 5+ points on a 10-point scale.
Both superbowl demos have clear strengths (Ivan's knowledge depth, Eugene's conversational discipline) that haven't yet been through an iterative eval cycle. Applying the same rubric-based iteration process would accelerate both significantly.
Strengths: Prices-first, 1-3 line cap, one question per turn, honest math, restricted holds, graceful exits, multi-language.
Weaknesses: D3 Qualification avg 1.3 (could push harder). Jesus T1 catalog response = no prices. Double-questions on Syed.
Why it works: 7 prompt iterations, 8-persona test suite, scored rubric. The methodology, not any single trick.
Strength: Deepest domain knowledge of the three. Chipsets, IFRA, SLES/glycerin, Kung Long, PVD/316L, landed cost calculations.
Fix: Hard message cap (3 lines max, enforced with examples). One question rule. Strip markdown. Client-side Gemini → server-side for prompt security.
Steal: Observability panel (health meter + component status) for internal monitoring dashboard.
Strength: Message discipline, chip buttons, concept cards, structured state machine, clean format.
Opportunity: Add prices-first rule, category expertise, and budget math. The discipline foundation is already strong — adding domain knowledge would be the highest-leverage improvement.
Steal: Chip buttons + concept cards for website channel. State progression for funnel tracking.
| # | Gap | Source | Fix |
|---|---|---|---|
| 1 | Real Alibaba prices + images | Eugene's API access | Swap LLM price guesses for live API data |
| 2 | Visual proof (product images) | Ivan's image generation | Alibaba image API integration |
| 3 | Observability dashboard | Ivan's health meter | Internal-only monitoring panel |
| 4 | Chip buttons on web | Eugene's UX | Website channel only (not WhatsApp) |
| 5 | D3 Qualification depth | Own rubric | Push harder on budget/qty in early turns |