Latest Bot Benchmark

Eval V1 — 6 Conversations (Mar 10–11 2026) vs Historical Baseline
11 March 2026
Conversations
6
real 1688 suppliers
Average Score
6.5/9
72%
Best
8.5/9
Conv 2053
vs Historical
+3pp
69% → 72%

Bot version: Mastra + Gemini 3.0 Flash. Data source: Nelson’s 6-conversation archive (Google Drive, shared Mar 11). This bot has a richer goal structure than the historical version — 5 goals per conversation including a “supplier capabilities assessment” goal.


Summary

ConvProductMsgs E1E2E3 E4E5E6 E7E8E9 CoreS1
1971Futsal shoes (logo)38 PaFF PPaP PaPaPa 5.0No
2046Blackout curtains (logo)50 PaPaF PPaP PaPPa 6.0No
2051Skateboard sneakers (logo)38 PaPaF PFP PaPaPa 5.0No
2053Car phone holder (logo)19 PPaP PPP PPPa 8.5No
2054Car phone holder K11 (logo)21 PPP PPP PaPaPa 8.0No
2058Storage box (logo)37 PPaF PPaP PaPPa 6.5No

P = Pass (1), Pa = Partial (0.5), F = Fail (0). E6/E8 scored as Pass where N/A (no auto-responses / no rejections). S1 = stretch metric (negotiation attempt).


Detailed Scoring

Conv 1971 — Futsal Shoes with Custom Logo 38 msgs · 5.0/9

Bot contacts a shoe supplier about S-8002 black futsal shoes. SR requires logo customization (tongue + heel).

DimScoreNotes
E1 Goal CompletionPartial6/8 goals achieved. Price ¥89/pair, MOQ 50–1000, lead time 15–20 days, dimensions & weight collected. ❌ Capabilities assessment PENDING. ❌ Custom logo pricing unconfirmed (supplier kept asking for the file first).
E2 One-QuestionFail3+ violations. Bundled file-sharing with price questions, asked 2 questions after statements.
E3 Turn EfficiencyFail~18 bot messages. Repeatedly asked for logo pricing when supplier said “send me the file first” 3 times. Major loop.
E4 No HallucinationPassNo fabricated data about pricing or specs.
E5 ExtractabilityPartialGood base data (price, MOQ, lead time, dimensions, weight, carton). Customization pricing unresolved.
E6 Auto-ResponsePassN/A — no auto-responses in this conversation.
E7 NaturalnessPartialPolite, natural Chinese. But repeated failure to understand “send me the file first” is unnatural.
E8 Rejection RecoveryPartialHandled heel-logo pushback (“鞋跟那个位置不太好弄”) by offering to send file. But the repeated price loop is a failure to read the room.
E9 CustomizationPartialAsked about placement, format, pricing. Good start. Couldn’t complete — didn’t send logo file despite saying it would.

S1 Negotiation: No attempt.

Conv 2046 — Blackout Curtains with Logo 50 msgs · 6.0/9

Bot contacts a curtain factory. SR requires logo customization. Conversation spans two days.

DimScoreNotes
E1 Goal CompletionPartial6/7 goals. Price $6/piece (200 pcs), MOQ 200, lead time 10–12 days, weight ~1kg, packing specs, logo confirmed 10×3cm. ❌ Capabilities PENDING (supplier frustrated: “你到底要什么”).
E2 One-QuestionPartial2 violations: long statement + tangent question, and multi-part capability question.
E3 Turn EfficiencyFail~20 bot messages. Stalled overnight for logo file. Capability tangent frustrated supplier.
E4 No HallucinationPassNo fabricated data.
E5 ExtractabilityPartialCore data good. Logo pricing not separately broken out — is $6 with or without logo? Unclear.
E6 Auto-ResponsePassSupplier sent auto-greeting + FAQ card. Bot ignored and waited for real response. Correct.
E7 NaturalnessPartialHandled “are you a bot?” well. But supplier got annoyed at capability questions (“你到底要什么”).
E8 Rejection RecoveryPassSupplier pushed to switch to voice call twice. Bot firmly but politely declined, citing platform policy. Good recovery.
E9 CustomizationPartialLogo confirmed, size adjusted based on supplier feedback (10×10 → 10×3). Good adaptation. Logo pricing not isolated.

S1 Negotiation: No attempt.

Conv 2051 — Skateboard Sneakers with Logo 38 msgs · 5.0/9

Bot contacts a shoe supplier about skateboard sneakers. SR requires logo customization.

DimScoreNotes
E1 Goal CompletionPartial4/7 goals. Price ¥75 (300), ¥67 (500), ¥62 (1000). MOQ 100–500. Lead time 20–45 days. ❌ Packing specs PENDING. ❌ Capabilities PENDING.
E2 One-QuestionPartial2 borderline violations: rapid-fire messages and answer + question combos.
E3 Turn EfficiencyFail~16 bot messages. Pursued carton specs when supplier said “I can’t answer without the order.” Confusing mid-conversation product link switch.
E4 No HallucinationPassNo fabricated data.
E5 ExtractabilityFailPrice tiers and lead time good. No exact weight, no carton specs. Conversation scattered across topics.
E6 Auto-ResponsePassN/A — no auto-responses.
E7 NaturalnessPartialAdapted to supplier switching to English. But second product link mid-conversation was confusing. Robotic packing-specs pursuit.
E8 Rejection RecoveryPartialSupplier said can’t provide carton specs without order. Bot kept asking variations instead of moving on.
E9 CustomizationPartialGot logo pricing tiers and lead time. Didn’t complete full branch (no file format, no placement confirmation).

S1 Negotiation: No attempt.

Conv 2053 — Car Phone Holder, No Custom 19 msgs · 8.5/9

Bot contacts a phone mount supplier. Quick, efficient conversation with a very responsive (possibly AI-assisted) supplier.

DimScoreNotes
E1 Goal CompletionPassAll goals achieved. Price $6.30/piece (200 pcs, with or without logo). MOQ 200. Lead time 7–10 days. Dimensions, weight, packing, logo pricing — all confirmed.
E2 One-QuestionPartial1 violation: asked about two lead-time variants in one question.
E3 Turn EfficiencyPass9 bot messages. Clean, efficient. All data in 9 turns with no waste.
E4 No HallucinationPassFinal summary accurately recaps all data.
E5 ExtractabilityPassSelf-summary: “200个手机支架单价6.30美元,常规版交期7天,带Logo版10天,外箱尺寸60×36×45cm,重22.6kg.” Perfect.
E6 Auto-ResponsePassSupplier responses looked AI-assisted. Bot treated them as real answers — correct, since data was accurate.
E7 NaturalnessPassClean, professional, efficient. Good closing summary.
E8 Rejection RecoveryPassN/A — no rejections.
E9 CustomizationPartialGot logo pricing and lead time. Didn’t follow up on supplier’s mention of “定制包装MOQ500件”.

S1 Negotiation: No attempt.

Conv 2054 — Car Phone Holder K11, Capabilities Assessment 21 msgs · 8.0/9

Bot contacts another phone mount supplier. Same product category as 2053. Conversation includes a full capabilities assessment.

DimScoreNotes
E1 Goal CompletionPassAll goals achieved. Price ¥7.5/piece (200 pcs). MOQ 1–1000. Lead time 1 week. Dimensions, weight, packing, capabilities — all confirmed.
E2 One-QuestionPassEvery bot message asks exactly one question. Excellent discipline.
E3 Turn EfficiencyPass10 bot messages. Efficient, especially given the capabilities assessment adds extra questions.
E4 No HallucinationPassNo fabrication.
E5 ExtractabilityPassSupplier sent complete spec sheet. Bot acknowledged and moved to capabilities. Clean data.
E6 Auto-ResponsePassSupplier sent auto-greeting with FAQ buttons. Bot ignored buttons and waited for real price. Correct.
E7 NaturalnessPartialEfficient but capabilities questions feel formulaic in sequence: factory products → OEM/ODM → certifications → general MOQ.
E8 Rejection RecoveryPartialAccepted “不” (no OEM/ODM) and “没” (no certs) correctly. But a human would probe: “if not full ODM, can you at least change colors?”
E9 CustomizationPartialGot logo MOQ (1000) and lead time. Didn’t probe logo pricing separately — is ¥7.5 with or without logo?

S1 Negotiation: No attempt.

Conv 2058 — Storage Box with Logo 37 msgs · 6.5/9

Bot contacts a storage box supplier. Conversation involves logo customization and a capabilities assessment.

DimScoreNotes
E1 Goal CompletionPassAll goals achieved. Price ¥5.2 (1000 pcs w/ logo), ¥9.5 (200 pcs). MOQ 200–1000. Lead time ~1 month (custom). Dimensions, weight, packing, capabilities confirmed.
E2 One-QuestionPartial2 violations: multi-part capability question and multi-part packing question.
E3 Turn EfficiencyFail~16 bot messages. Slow supplier needed follow-up. Bot asked about weight when supplier said “it’s on the listing page.”
E4 No HallucinationPassNo fabrication.
E5 ExtractabilityPartialTwo price tiers collected. Carton specs received as image (harder to extract). Weight incomplete.
E6 Auto-ResponsePassSupplier sent auto-greeting (“客服离线中”). Bot waited for real response. Correct.
E7 NaturalnessPartialAsked about weight when supplier said it’s on the listing. Asked “35×20×15 is in cm?” — slightly naive.
E8 Rejection RecoveryPassN/A — no rejections.
E9 CustomizationPartialGot logo pricing at scale. Discovered material customization possible. Didn’t pursue logo pricing at 200-piece level.

S1 Negotiation: No attempt.


Aggregate Analysis

Score Distribution

DimensionPassPartialFailAvg Score
E1: Goal Completion3300.75
E2: One-Question Discipline1410.50
E3: Turn Efficiency2040.33
E4: No Hallucination6001.00
E5: Structured Extractability2310.58
E6: Auto-Response Handling6001.00
E7: Conversational Naturalness1500.58
E8: Rejection Recovery3300.75
E9: Customization Handling0600.50
S1: Negotiation0/6 attempted

Average Score by Dimension

E4 No Hallucination
1.00
E6 Auto-Response
1.00
E1 Goal Completion
0.75
E8 Rejection Recovery
0.75
E5 Extractability
0.58
E7 Naturalness
0.58
E2 One-Question
0.50
E9 Customization
0.50
E3 Turn Efficiency
0.33

Comparison: Historical Bot vs Latest Bot

DimensionHistorical (Dec–Feb)Latest (Mar 10–11)Delta
E1: Goal Completion0.500.75+0.25
E2: One-Question0.700.50−0.20
E3: Turn Efficiency0.300.33+0.03
E4: No Hallucination0.901.00+0.10
E5: Extractability0.400.58+0.18
E6: Auto-Response1.001.00=
E7: Naturalness0.600.58−0.02
E8: Rejection Recovery0.700.75+0.05
E9: Customization0.900.50−0.40
Overall 6.2/9 (69%) 6.5/9 (72%) +0.3
S1: Negotiation0/50/6=
Customization handling regressed −0.40 The latest bot has logo customization in every SR but struggles to complete the branch — usually because it can’t/doesn’t send the actual logo file. Every conversation scored Partial. This is the biggest single regression.
Hallucination is now perfect: 1.00 across all 6 conversations Zero fabricated data. Strong improvement from 0.90. The bot only states facts the supplier actually provided.

Key Findings

  1. Goal completion improved significantly (+0.25). The latest bot has a richer goal structure (5 goals per conversation) and achieves more of them. Three conversations got full marks.
  2. Turn efficiency remains the weakest dimension (0.33). Four of six conversations exceed 12 bot messages. The bot still doesn’t know when to stop or when to accept “I can’t answer that now.”
  3. Customization handling dropped (−0.40). Paradoxically, the latest bot has customization in every SR (logo printing) but struggles to complete the branch — usually because it can’t/doesn’t send the actual logo file. Every conversation scored Partial.
  4. One-question discipline regressed (−0.20). The latest bot bundles questions more often, especially when asking about capabilities or packing specs.
  5. Zero hallucination across all 6 conversations. Strong improvement — no fabricated data at all.
  6. New feature: capabilities assessment. The latest bot asks about factory certifications, OEM/ODM, main product categories. Useful but currently formulaic and sometimes irritates suppliers.
  7. Price negotiation still completely absent. Neither bot version attempts to negotiate.
  8. Best conversation (2053 at 8.5/9) shows the bot excels with responsive suppliers who give structured data. Worst cases (1971, 2051 at 5.0/9) involve suppliers who need the file first or can’t answer without order details.

Overall Assessment

The latest bot is a modest improvement over the historical version (+3pp, 69% → 72%). The gains are in goal completion and hallucination avoidance. The persistent weaknesses are turn efficiency, one-question discipline, and customization handling — the same areas where our bot (Eric’s version) can differentiate.

The 72% baseline is what we need to beat on Friday.

Sources