Historical Bot Benchmark

Eval V1 — 5 Real Production Conversations Scored on 9 Dimensions
11 March 2026
Conversations
5
real production data
Average Score
6.2/9
69%
Best
8.5/9
conv 331172
Negotiation
0/5
no attempts

Data source: bot-messages-prod-2025-12-2026-02.csv — real production bot conversations from Dec 2025–Feb 2026.


Summary

Conv ID Product Msgs E1 E2 E3 E4 E5 E6 E7 E8 E9 Core S1
319698 40oz insulated cup (sample) 28 Pa Pa F P Pa P Pa F N/A 5.5 No
308471 Lip gloss #09 (1000 pcs) 28 Pa P F Pa Pa P Pa Pa N/A 6.0 No
266217 Leather bag, brown (100 pcs) 29 Pa Pa F P F P Pa N/A N/A 5.5 No
331172 Double-wall paper cup, logo (1000 pcs) 30 P Pa Pa P P P P N/A P 8.5 No
272292 Heart-shaped mold (inquire only) 3 F P P P F N/A Pa N/A N/A 5.5 No

P = Pass (1), Pa = Partial (0.5), F = Fail (0), N/A = not applicable (scored as Pass).

Average core score: 6.2/9 (69%)


Detailed Scoring

Conversation 319698 — 40oz Insulated Cup (28 msgs)

Bot inquires about a 40oz insulated cup. Starts with a product link, then asks about MOQ, sample price, customization, bulk price, packing specs.

E1: Goal Completion → Partial (0.5)
  • ✓ MOQ — implied no MOQ (supplier said “直接下单就可以”)
  • ✓ Price — got sample price (25 RMB incl. shipping) and bulk price (16.5/pc for 500)
  • ✗ Lead time — never explicitly asked or obtained
  • ✓ Packing specs — got single cup weight (700g), asked about bulk carton specs but supplier was vague
  • ✗ Sample terms — treated quantity-1 as sample correctly, got price, but didn’t ask about sample lead time
  • ✓ Customization — confirmed customization is possible (“可以定制”)

Achieved: 4/6 → Partial

E2: One-Question Discipline → Partial (0.5)

Bot asks mostly one question per message, but message 14 packs an example into the question (“请问这款产品支持定制吗(比如加印Logo)?”). 1 borderline violation → Partial

E3: Turn Efficiency → Fail (0)

28 messages total, 14 bot messages. Conversation continued long after price was confirmed — bot kept asking incremental questions when these could have been more efficiently sequenced. Also continued after the handshake emoji. 14 bot messages + wasted turns → Fail

E4: No Hallucination → Pass (1)

Bot only stated information the supplier provided. No fabrication detected.

E5: Structured Extractability → Partial (0.5)

Lead time missing and packing specs are vague → Partial

E6: Auto-Response Handling → Pass (1)

Supplier sent “商家长时间没有回复” platform notification. Bot waited then followed up cleanly.

E7: Conversational Naturalness → Partial (0.5)

Polite and Chinese, appropriate tone. However: repetitive “好的” openers, formulaic feel. Reads like a competent but obvious bot.

E8: Rejection Recovery → Fail (0)

Supplier sent packing info as an image (not text). Bot said “好的,收到” but didn’t extract or follow up. When supplier was vague about carton specs, bot didn’t push — just moved on.

E9: Customization Handling → N/A (scored as Pass)

SR was a sample inquiry, no customization branch needed.

S1: Price Negotiation → No

Bot accepted ¥16.5 for 500 pcs without any negotiation attempt.


Conversation 308471 — Lip Gloss #09 (28 msgs)

Bot inquires about lip gloss, discovers #09 is discontinued, adapts to alternatives.

E1: Goal Completion → Partial (0.5)
  • ✓ MOQ — implied 1000 pcs (supplier can order from brand)
  • ✓ Price — page price with 9–9.5折 discount
  • ✓ Lead time — “正常周期也是一周”
  • ✗ Packing specs — never asked
  • ✗ Sample terms — never asked

Achieved: 3/5 → Partial

E2: One-Question Discipline → Pass (1)

Bot consistently asks one question per message throughout. Good discipline.

E3: Turn Efficiency → Fail (0)

28 messages, ~12 bot messages. Bot sent a redundant message asking for recommendations AFTER the supplier already recommended #08 and #10. Also kept asking about price in different ways. 12 bot messages with 2+ wasted turns → Fail

E4: No Hallucination → Partial (0.5)

Bot asked “请问最终的含税出厂单价是多少?” — mentioning tax when the supplier never brought it up. Contradicts rule-no-tax-mention. Minor but noted.

E5: Structured Extractability → Partial (0.5)

Price is not a concrete number (page price × 0.9 or 0.95), packing missing → Partial

E6: Auto-Response Handling → Pass (1)

Supplier sent “1688官方客服” satisfaction survey. Bot ignored it and waited for real response. Correct.

E7: Conversational Naturalness → Partial (0.5)

Handled the discontinuation well — adapted to alternatives. But asked redundant questions about price format and “含税出厂单价” is slightly robotic.

E8: Rejection Recovery → Partial (0.5)

Supplier rejected #09 (discontinued). Bot recovered by asking for alternatives — good. But when told #08 and #10 were ALSO discontinued with low stock, bot didn’t probe further (e.g., “what colors DO you have in stock?”). Accepted and moved on.

E9: Customization Handling → N/A (scored as Pass)

No customization in this SR.

S1: Price Negotiation → No

Bot asked about discount tiers but didn’t actively negotiate. Accepted the 9–9.5折 without countering.


Conversation 266217 — Leather Bag, Brown (29 msgs)

Bot inquires about 100 leather bags. Long conversation about packing specs with supplier who doesn’t have standard carton sizes.

E1: Goal Completion → Partial (0.5)
  • ✗ MOQ — never explicitly asked (just said “100个”)
  • ✓ Price — ¥86/pc for 100
  • ✓ Lead time — 15 days after order confirmation
  • ✗ Packing specs — got single bag dimensions (32×12×15cm, 560g) but never got carton specs
  • ✗ Sample terms — never asked

Achieved: 2/5 → Partial (borderline Fail, but price and lead time are the two most important)

E2: One-Question Discipline → Partial (0.5)

2 violations: one message answered two supplier questions + asked one (loaded), another asked a similar question twice. 2 violations → Partial

E3: Turn Efficiency → Fail (0)

29 messages, ~14 bot messages. Bot spent 8+ messages trying to get carton packing specs from a supplier who clearly didn’t have standard specs. Should have accepted the single-unit dimensions and moved on. Major turn waste.

E4: No Hallucination → Pass (1)

No fabricated information.

E5: Structured Extractability → Fail (0)

Only 2 clean data points (price, lead time). No carton specs despite 8 messages trying. Cannot construct a usable supplier card → Fail

E6: Auto-Response Handling → Pass (1)

No auto-responses in this conversation.

E7: Conversational Naturalness → Partial (0.5)

Polite and adapted to responses. However, repeated packing-spec questions read as persistent and inflexible — a human would have pivoted after 2 attempts.

E8: Rejection Recovery → N/A (scored as Pass)

Supplier didn’t reject — they were vague about packing. The bot’s inability to pivot is more of an E3/E7 issue than E8.

E9: Customization Handling → N/A (scored as Pass)

No customization in this SR.

S1: Price Negotiation → No

Accepted ¥86 without any negotiation.


Conversation 331172 — Double-Wall Paper Cup with Logo (30 msgs)

Bot inquires about 12oz double-wall paper cups with custom logo printing. This is a customization SR.

Best conversation: 8.5/9 This conversation demonstrates the bot at its best — cooperative supplier, clear product, and competent customization handling. All goals achieved with specific, numeric data.
E1: Goal Completion → Pass (1)
  • ✓ MOQ — 1000 pcs for single-side coated with logo
  • ✓ Price — ¥0.42/pc for 1000, standard four-color printing
  • ✓ Lead time — 7–15 days (printing factory resumes after Lantern Festival, estimated mid-March)
  • ✓ Packing specs — 1000 pcs/carton, 62×50×48.5cm, 18kg
  • ✓ Sample terms — not explicitly asked but 1000 MOQ is the entry
  • ✓ Customization — confirmed logo printing feasible, got spec and pricing

Achieved: 6/6 → Pass

E2: One-Question Discipline → Partial (0.5)

1–2 borderline violations: asked for cup height/diameter after goals complete, re-asked about 12oz price after supplier sent price table image.

E3: Turn Efficiency → Partial (0.5)

30 messages, ~12 bot messages. Mostly efficient but asked for cup height and bottom diameter AFTER all goals were achieved — unnecessary final turn. 12 bot messages with 1–2 wasted turns → Partial

E4: No Hallucination → Pass (1)

All stated facts trace to supplier messages.

E5: Structured Extractability → Pass (1)

All data is numeric, specific, and directly usable. MOQ: 1000 pcs. Price: ¥0.42/pc. Lead time: 7–15 days. Packing: 1000 pcs/carton, 62×50×48.5cm, 18kg. Customization: logo confirmed, four-color.

E6: Auto-Response Handling → Pass (1)

Supplier sent “1688官方客服” satisfaction survey mid-conversation. Bot ignored it and continued. Correct.

E7: Conversational Naturalness → Pass (1)

Bot sounds natural — good Chinese, appropriate tone, smooth transitions. “都清楚了。非常感谢您的耐心解答!” is a natural close. No robotic markers.

E8: Rejection Recovery → N/A (scored as Pass)

No rejections in this conversation.

E9: Customization Handling → Pass (1)

Bot asked about logo printing from the opening message. Got MOQ for custom (1000), price for custom (¥0.42), confirmed four-color printing, confirmed spec. Navigated customization branch competently.

S1: Price Negotiation → No

Accepted ¥0.42 without negotiation.


Conversation 272292 — Heart-Shaped Mold (3 msgs)

Ultra-short conversation — bot asks about MOQ, supplier says “no MOQ, in stock”, conversation ends.

E1: Goal Completion → Fail (0)
  • ✓ MOQ — no MOQ, in stock
  • ✗ Price — not asked
  • ✗ Lead time — not asked
  • ✗ Packing — not asked
  • ✗ Sample — not asked

Achieved: 1/5 → Fail

E2: One-Question Discipline → Pass (1)

Only one bot message with one question.

E3: Turn Efficiency → Pass (1)

2 bot messages. But efficiency is meaningless when nothing was collected.

E4: No Hallucination → Pass (1)

Nothing to hallucinate.

E5: Structured Extractability → Fail (0)

Only MOQ collected. No usable supplier card.

E6: Auto-Response Handling → N/A (scored as Pass)

No auto-responses.

E7: Conversational Naturalness → Partial (0.5)

Opening was fine but conversation just stopped. A competent buyer would follow up.

E8: Rejection Recovery → N/A (scored as Pass)

No rejection.

E9: Customization Handling → N/A (scored as Pass)

No customization needed.

S1: Price Negotiation → No

Aggregate Analysis

Score Distribution

DimensionPassPartialFailAvg Score
E1: Goal Completion 1 3 1 0.50
E2: One-Question Discipline 2 3 0 0.70
E3: Turn Efficiency 1 1 3 0.30
E4: No Hallucination 4 1 0 0.90
E5: Structured Extractability 1 2 2 0.40
E6: Auto-Response Handling 5 0 0 1.00
E7: Conversational Naturalness 1 4 0 0.60
E8: Rejection Recovery 3 1 1 0.70
E9: Customization Handling 4 0 1 0.90
S1: Negotiation 0/5 attempted

Average Score per Dimension

E6 Auto-Response
1.00
E4 No Hallucination
0.90
E9 Customization
0.90
E2 One-Question
0.70
E8 Rejection Recovery
0.70
E7 Naturalness
0.60
E1 Goal Completion
0.50
E5 Extractability
0.40
E3 Turn Efficiency
0.30
S1 Negotiation
0/5

Key Findings

1. Turn Efficiency is the weakest dimension (0.30 avg) The bot consistently uses too many messages. Three of five conversations exceed 12 bot messages. The bot doesn’t know when to stop or how to batch related questions.
  1. Goal Completion is inconsistent (0.50 avg). One conversation achieved all goals perfectly (331172), but others miss 2–3 goals. The bot often skips sample terms and struggles when packing specs aren’t standard.
  2. No Hallucination is strong (0.90 avg). The bot rarely fabricates data. One minor violation (mentioning tax unprompted).
  3. Auto-Response Handling is excellent (1.00 avg). The bot handles 数字客服回复 well — ignores or follows up appropriately.
  4. Conversational Naturalness needs work (0.60 avg). Mostly passable but reads formulaic. Repetitive openers (“好的”), inflexible when supplier doesn’t give expected data format.
  5. Price Negotiation is completely absent (0/5). Not a single attempt across all conversations. This is a capability gap, not a quality issue — the bot was never designed to negotiate.
7. One strong conversation (331172 at 8.5/9) shows the bot CAN perform well When the supplier is cooperative, the product is clear, and customization is straightforward, the bot delivers. The challenge is consistency across varied supplier behaviors.

Comparison: Previous Rubric vs Eval V1

Our previous benchmark (Mar 4, S1–S5 rubric, simulated conversations) showed:

Those scores were inflated
  1. Simulated (cooperative) suppliers, not real ones
  2. Only 5 dimensions, missing E6–E9
  3. Weighted scoring that favored goal completion

Eval V1 on real production data: 69% average. This is the honest baseline.

Sources

For official baseline, recommend Shen or another human judge blind-score 5–10 conversations.