Latest Bot Benchmark — Eval V1 — by Eric San

Conversations

6

real 1688 suppliers

Average Score

6.5/9

72%

Best

8.5/9

Conv 2053

vs Historical

+3pp

69% → 72%

Bot version: Mastra + Gemini 3.0 Flash. Data source: Nelson’s 6-conversation archive (Google Drive, shared Mar 11). This bot has a richer goal structure than the historical version — 5 goals per conversation including a “supplier capabilities assessment” goal.

Summary

Conv	Product	Msgs	E1	E2	E3	E4	E5	E6	E7	E8	E9	Core	S1
1971	Futsal shoes (logo)	38	Pa	F	F	P	Pa	P	Pa	Pa	Pa	5.0	No
2046	Blackout curtains (logo)	50	Pa	Pa	F	P	Pa	P	Pa	P	Pa	6.0	No
2051	Skateboard sneakers (logo)	38	Pa	Pa	F	P	F	P	Pa	Pa	Pa	5.0	No
2053	Car phone holder (logo)	19	P	Pa	P	P	P	P	P	P	Pa	8.5	No
2054	Car phone holder K11 (logo)	21	P	P	P	P	P	P	Pa	Pa	Pa	8.0	No
2058	Storage box (logo)	37	P	Pa	F	P	Pa	P	Pa	P	Pa	6.5	No

P = Pass (1), Pa = Partial (0.5), F = Fail (0). E6/E8 scored as Pass where N/A (no auto-responses / no rejections). S1 = stretch metric (negotiation attempt).

Detailed Scoring

Conv 1971 — Futsal Shoes with Custom Logo 38 msgs · 5.0/9

Bot contacts a shoe supplier about S-8002 black futsal shoes. SR requires logo customization (tongue + heel).

Dim	Score	Notes
E1 Goal Completion	Partial	6/8 goals achieved. Price ¥89/pair, MOQ 50–1000, lead time 15–20 days, dimensions & weight collected. ❌ Capabilities assessment PENDING. ❌ Custom logo pricing unconfirmed (supplier kept asking for the file first).
E2 One-Question	Fail	3+ violations. Bundled file-sharing with price questions, asked 2 questions after statements.
E3 Turn Efficiency	Fail	~18 bot messages. Repeatedly asked for logo pricing when supplier said “send me the file first” 3 times. Major loop.
E4 No Hallucination	Pass	No fabricated data about pricing or specs.
E5 Extractability	Partial	Good base data (price, MOQ, lead time, dimensions, weight, carton). Customization pricing unresolved.
E6 Auto-Response	Pass	N/A — no auto-responses in this conversation.
E7 Naturalness	Partial	Polite, natural Chinese. But repeated failure to understand “send me the file first” is unnatural.
E8 Rejection Recovery	Partial	Handled heel-logo pushback (“鞋跟那个位置不太好弄”) by offering to send file. But the repeated price loop is a failure to read the room.
E9 Customization	Partial	Asked about placement, format, pricing. Good start. Couldn’t complete — didn’t send logo file despite saying it would.

S1 Negotiation: No attempt.

Conv 2046 — Blackout Curtains with Logo 50 msgs · 6.0/9

Bot contacts a curtain factory. SR requires logo customization. Conversation spans two days.

Dim	Score	Notes
E1 Goal Completion	Partial	6/7 goals. Price $6/piece (200 pcs), MOQ 200, lead time 10–12 days, weight ~1kg, packing specs, logo confirmed 10×3cm. ❌ Capabilities PENDING (supplier frustrated: “你到底要什么”).
E2 One-Question	Partial	2 violations: long statement + tangent question, and multi-part capability question.
E3 Turn Efficiency	Fail	~20 bot messages. Stalled overnight for logo file. Capability tangent frustrated supplier.
E4 No Hallucination	Pass	No fabricated data.
E5 Extractability	Partial	Core data good. Logo pricing not separately broken out — is $6 with or without logo? Unclear.
E6 Auto-Response	Pass	Supplier sent auto-greeting + FAQ card. Bot ignored and waited for real response. Correct.
E7 Naturalness	Partial	Handled “are you a bot?” well. But supplier got annoyed at capability questions (“你到底要什么”).
E8 Rejection Recovery	Pass	Supplier pushed to switch to voice call twice. Bot firmly but politely declined, citing platform policy. Good recovery.
E9 Customization	Partial	Logo confirmed, size adjusted based on supplier feedback (10×10 → 10×3). Good adaptation. Logo pricing not isolated.

S1 Negotiation: No attempt.

Conv 2051 — Skateboard Sneakers with Logo 38 msgs · 5.0/9

Bot contacts a shoe supplier about skateboard sneakers. SR requires logo customization.

Dim	Score	Notes
E1 Goal Completion	Partial	4/7 goals. Price ¥75 (300), ¥67 (500), ¥62 (1000). MOQ 100–500. Lead time 20–45 days. ❌ Packing specs PENDING. ❌ Capabilities PENDING.
E2 One-Question	Partial	2 borderline violations: rapid-fire messages and answer + question combos.
E3 Turn Efficiency	Fail	~16 bot messages. Pursued carton specs when supplier said “I can’t answer without the order.” Confusing mid-conversation product link switch.
E4 No Hallucination	Pass	No fabricated data.
E5 Extractability	Fail	Price tiers and lead time good. No exact weight, no carton specs. Conversation scattered across topics.
E6 Auto-Response	Pass	N/A — no auto-responses.
E7 Naturalness	Partial	Adapted to supplier switching to English. But second product link mid-conversation was confusing. Robotic packing-specs pursuit.
E8 Rejection Recovery	Partial	Supplier said can’t provide carton specs without order. Bot kept asking variations instead of moving on.
E9 Customization	Partial	Got logo pricing tiers and lead time. Didn’t complete full branch (no file format, no placement confirmation).

S1 Negotiation: No attempt.

Conv 2053 — Car Phone Holder, No Custom 19 msgs · 8.5/9

Bot contacts a phone mount supplier. Quick, efficient conversation with a very responsive (possibly AI-assisted) supplier.

Dim	Score	Notes
E1 Goal Completion	Pass	All goals achieved. Price $6.30/piece (200 pcs, with or without logo). MOQ 200. Lead time 7–10 days. Dimensions, weight, packing, logo pricing — all confirmed.
E2 One-Question	Partial	1 violation: asked about two lead-time variants in one question.
E3 Turn Efficiency	Pass	9 bot messages. Clean, efficient. All data in 9 turns with no waste.
E4 No Hallucination	Pass	Final summary accurately recaps all data.
E5 Extractability	Pass	Self-summary: “200个手机支架单价6.30美元，常规版交期7天，带Logo版10天，外箱尺寸60×36×45cm，重22.6kg.” Perfect.
E6 Auto-Response	Pass	Supplier responses looked AI-assisted. Bot treated them as real answers — correct, since data was accurate.
E7 Naturalness	Pass	Clean, professional, efficient. Good closing summary.
E8 Rejection Recovery	Pass	N/A — no rejections.
E9 Customization	Partial	Got logo pricing and lead time. Didn’t follow up on supplier’s mention of “定制包装MOQ500件”.

S1 Negotiation: No attempt.

Conv 2054 — Car Phone Holder K11, Capabilities Assessment 21 msgs · 8.0/9

Bot contacts another phone mount supplier. Same product category as 2053. Conversation includes a full capabilities assessment.

Dim	Score	Notes
E1 Goal Completion	Pass	All goals achieved. Price ¥7.5/piece (200 pcs). MOQ 1–1000. Lead time 1 week. Dimensions, weight, packing, capabilities — all confirmed.
E2 One-Question	Pass	Every bot message asks exactly one question. Excellent discipline.
E3 Turn Efficiency	Pass	10 bot messages. Efficient, especially given the capabilities assessment adds extra questions.
E4 No Hallucination	Pass	No fabrication.
E5 Extractability	Pass	Supplier sent complete spec sheet. Bot acknowledged and moved to capabilities. Clean data.
E6 Auto-Response	Pass	Supplier sent auto-greeting with FAQ buttons. Bot ignored buttons and waited for real price. Correct.
E7 Naturalness	Partial	Efficient but capabilities questions feel formulaic in sequence: factory products → OEM/ODM → certifications → general MOQ.
E8 Rejection Recovery	Partial	Accepted “不” (no OEM/ODM) and “没” (no certs) correctly. But a human would probe: “if not full ODM, can you at least change colors?”
E9 Customization	Partial	Got logo MOQ (1000) and lead time. Didn’t probe logo pricing separately — is ¥7.5 with or without logo?

S1 Negotiation: No attempt.

Conv 2058 — Storage Box with Logo 37 msgs · 6.5/9

Bot contacts a storage box supplier. Conversation involves logo customization and a capabilities assessment.

Dim	Score	Notes
E1 Goal Completion	Pass	All goals achieved. Price ¥5.2 (1000 pcs w/ logo), ¥9.5 (200 pcs). MOQ 200–1000. Lead time ~1 month (custom). Dimensions, weight, packing, capabilities confirmed.
E2 One-Question	Partial	2 violations: multi-part capability question and multi-part packing question.
E3 Turn Efficiency	Fail	~16 bot messages. Slow supplier needed follow-up. Bot asked about weight when supplier said “it’s on the listing page.”
E4 No Hallucination	Pass	No fabrication.
E5 Extractability	Partial	Two price tiers collected. Carton specs received as image (harder to extract). Weight incomplete.
E6 Auto-Response	Pass	Supplier sent auto-greeting (“客服离线中”). Bot waited for real response. Correct.
E7 Naturalness	Partial	Asked about weight when supplier said it’s on the listing. Asked “35×20×15 is in cm?” — slightly naive.
E8 Rejection Recovery	Pass	N/A — no rejections.
E9 Customization	Partial	Got logo pricing at scale. Discovered material customization possible. Didn’t pursue logo pricing at 200-piece level.

S1 Negotiation: No attempt.

Aggregate Analysis

Score Distribution

Dimension	Pass	Partial	Fail	Avg Score
E1: Goal Completion	3	3	0	0.75
E2: One-Question Discipline	1	4	1	0.50
E3: Turn Efficiency	2	0	4	0.33
E4: No Hallucination	6	0	0	1.00
E5: Structured Extractability	2	3	1	0.58
E6: Auto-Response Handling	6	0	0	1.00
E7: Conversational Naturalness	1	5	0	0.58
E8: Rejection Recovery	3	3	0	0.75
E9: Customization Handling	0	6	0	0.50
S1: Negotiation	0/6 attempted

Average Score by Dimension

E4 No Hallucination

1.00

E6 Auto-Response

1.00

E1 Goal Completion

0.75

E8 Rejection Recovery

0.75

E5 Extractability

0.58

E7 Naturalness

0.58

E2 One-Question

0.50

E9 Customization

0.50

E3 Turn Efficiency

0.33

Comparison: Historical Bot vs Latest Bot

Dimension	Historical (Dec–Feb)	Latest (Mar 10–11)	Delta
E1: Goal Completion	0.50	0.75	+0.25
E2: One-Question	0.70	0.50	−0.20
E3: Turn Efficiency	0.30	0.33	+0.03
E4: No Hallucination	0.90	1.00	+0.10
E5: Extractability	0.40	0.58	+0.18
E6: Auto-Response	1.00	1.00	=
E7: Naturalness	0.60	0.58	−0.02
E8: Rejection Recovery	0.70	0.75	+0.05
E9: Customization	0.90	0.50	−0.40
Overall	6.2/9 (69%)	6.5/9 (72%)	+0.3
S1: Negotiation	0/5	0/6	=

Customization handling regressed −0.40 The latest bot has logo customization in every SR but struggles to complete the branch — usually because it can’t/doesn’t send the actual logo file. Every conversation scored Partial. This is the biggest single regression.

Hallucination is now perfect: 1.00 across all 6 conversations Zero fabricated data. Strong improvement from 0.90. The bot only states facts the supplier actually provided.

Key Findings

Goal completion improved significantly (+0.25). The latest bot has a richer goal structure (5 goals per conversation) and achieves more of them. Three conversations got full marks.
Turn efficiency remains the weakest dimension (0.33). Four of six conversations exceed 12 bot messages. The bot still doesn’t know when to stop or when to accept “I can’t answer that now.”
Customization handling dropped (−0.40). Paradoxically, the latest bot has customization in every SR (logo printing) but struggles to complete the branch — usually because it can’t/doesn’t send the actual logo file. Every conversation scored Partial.
One-question discipline regressed (−0.20). The latest bot bundles questions more often, especially when asking about capabilities or packing specs.
Zero hallucination across all 6 conversations. Strong improvement — no fabricated data at all.
New feature: capabilities assessment. The latest bot asks about factory certifications, OEM/ODM, main product categories. Useful but currently formulaic and sometimes irritates suppliers.
Price negotiation still completely absent. Neither bot version attempts to negotiate.
Best conversation (2053 at 8.5/9) shows the bot excels with responsive suppliers who give structured data. Worst cases (1971, 2051 at 5.0/9) involve suppliers who need the file first or can’t answer without order details.

Overall Assessment

The latest bot is a modest improvement over the historical version (+3pp, 69% → 72%). The gains are in goal completion and hallucination avoidance. The persistent weaknesses are turn efficiency, one-question discipline, and customization handling — the same areas where our bot (Eric’s version) can differentiate.

The 72% baseline is what we need to beat on Friday.

Sources

Data source: Nelson’s 6-conversation archive (Google Drive, shared Mar 11)
Bot version: Current production supplier bot (Mastra + Gemini 3.0 Flash)
Scoring rubric: eval-rubric-v1.md
Scorer: LLM evaluation (Claude) against eval rubric. For official baseline, recommend Shen blind-score these same conversations.