This document explains how we built the evaluation framework for supplier bot conversations. It answers three questions:
We worked from three independent evidence sources:
Historical bot-supplier conversations across two datasets:
What the logs told us:
We fully analyzed the rulebook — all 157 rules across 21 files.
| Category | Rules | Token Share |
|---|---|---|
| Goal management (core + ordering) | 10 | Core logic |
| Customization (color, logo, material, size, packaging, photo, MOQ) | 70 | 46% of goal-bound rules |
| Non-custom flows (lead time, price, stock, sample, packing, spec) | 42 | Standard info collection |
| Conversation management | 13 | Flow control |
| Message structure | 4 | Formatting |
| Stock handling | 8 | Availability checks |
| Misc (negotiation, customization general, sample) | 5 | Edge cases |
| Other (food grade, info extraction, media) | 5 | Narrow scope |
We analyzed ~17 conversations where Sourcy’s human growth team (东印度采购) manually sourced from 1688 suppliers. These revealed what a competent human buyer does that bots currently don’t:
Each dimension is scored Pass / Partial / Fail (1 / 0.5 / 0). All dimensions carry equal weight. Maximum core score: 9/9.
| # | Dimension | What It Checks | Primary Source |
|---|---|---|---|
| E1 | Goal Completion | Did the bot collect the data points it was supposed to? | Rules, Bot logs |
| E2 | One-Question Discipline | Does each bot message ask at most one question? | Rules, Bot logs |
| E3 | Turn Efficiency | Did the bot get there without wasting turns? | Bot logs, Rules |
| E4 | No Hallucination | Did the bot only state facts the supplier actually provided? | Rules, Bot logs |
| E5 | Structured Extractability | Could you pull a clean supplier quote from this conversation? | Rules, Human logs |
| E6 | Auto-Response Handling | When the supplier sent 数字客服回复, did the bot handle it? | Rules, Bot logs |
| E7 | Conversational Naturalness | Does this read like a competent buyer, not a form bot? | Human logs, Shen |
| E8 | Rejection Recovery | When the supplier pushed back, did the bot explore alternatives? | Rules, Human logs |
| E9 | Customization Handling | When the SR involves customization, did the bot navigate it competently? | Rules, Human logs, Karl |
| S1 | Price Negotiation Attempt | Did the bot make any attempt to negotiate price? | Human logs, Rules |
S1 is reported separately — it’s a capability the current bot doesn’t have, so it’s a stretch target, not a core pass/fail.
The critical question: “If we evaluate on 9 dimensions instead of 157 rules, do we lose anything?” No. Here’s the full mapping:
Note: Some rules are shared across dimensions. E5 (Structured Extractability) evaluates data quality across all goal target fields rather than mapping to specific rules.
All of goal_core.md (10 rules), plus every non-custom flow rule that ends in a goal status update:
NC_LEAD_GROUP.md (6 rules) → lead time goalNC_PRICE_GROUP.md (8 rules) → price goalNC_PACK_GROUP.md (6 rules) → packaging goalNC_SAMPLE_GROUP.md (10 rules) → sample goalNC_STOCK_GROUP.md (7 rules) → MOQ / stock goalNC_SPEC_GROUP.md (5 rules) → specifications goalThese rules define how to reach each goal. E1 evaluates the outcome: did we get there?
rule-one-question-per-message (priority 8). One rule, but the single most frequent quality failure in the bot logs. It gets its own dimension because it’s both easily measurable and high-impact.
rule-conclude-when-complete — don’t keep going after goals are donerule-limit-confirmations — don’t re-confirm what’s already confirmedrule-no-repeat-questions — don’t ask what’s already answeredThree rules, one outcome: don’t waste turns.
rule-provide-only-known-information (priority 10)rule-no-tax-mentionThe bot should only state facts from context or supplier responses. Never fabricate.
Evaluates the quality of collected data — are values specific, numeric, and unambiguous? Maps to the goal target_fields schema across all 157 rules.
rule-auto-response — detect messageType-based auto-responsesrule-non-meaningful-reply-reminder — handle replies that don’t answer the questionrule-unresponsive — handle suppliers who go silentThese 3 rules address ~31% of real conversations where suppliers don’t meaningfully respond.
rule-first-product-link — open with the product link naturallyrule-one-question-per-message — natural pacing (shared with E2)rule-export-market-inquiry — answer supplier’s questions about our marketrule-image-request — use media when appropriatePlus the human-conversation standard: polite, direct, language-flexible, not robotic. This is where Shen’s quality standard lives.
rule-goal-status-failed — when a goal can’t be achieved, handle it gracefullyrule-contact-refuse-escalate — escalate when supplier refusesrule-file-upload-not-received — handle file delivery failuresrule-stock-ask-alternatives — find alternatives when out of stockrule-custom-moq-too-high-negotiate — negotiate when MOQ exceeds needHuman conversations show this clearly: “做不了” is not the end — humans explore alternatives.
The largest dimension. 70 customization rules span 7 sub-groups:
| Sub-group | Rules | Covers |
|---|---|---|
| Color | 8 | Pantone codes, MOQ for custom colors, feasibility |
| Logo | 8 | Vector files, printing methods, placement, proofing |
| Material | 10 | Material options, thickness, price impact |
| Size | 12 | Custom dimensions, mold fees, drawings |
| Packaging | 9 | Custom box, labels, artwork requirements |
| Photo | 13 | Custom product from photo, design files, mold fees |
| Custom MOQ | 10 | Specs-first quoting, tiered pricing, sample availability |
rule-price-negotiation — the only negotiation rule in the system (thin). Human conversations show rich negotiation: “1元拿得到吗?”, “can you help us on this first order”, “这已经是最低价了” → “如果数量多呢?”. The current bot doesn’t negotiate. S1 measures whether we start to.
| Rule File | Rules | Covered By |
|---|---|---|
food_grade.md | 3 | E1 (certification is a goal) |
stock_handling.md | 8 | E1 + E6 (goal completion + response handling) |
information_extraction.md | 1 | E7 (answering supplier’s questions naturally) |
media_handling.md | 1 | E7 (using media appropriately) |
misc_rules.md | 5 | E1, E9, S1 |
Core score: E1 through E9 → max 9/9
Stretch metric: S1 reported separately (Yes/No + quality note)
Conversation ID: XXX Core: 7.5/9 (E1:P E2:P E3:P E4:P E5:Pa E6:P E7:Pa E8:F E9:N/A) Stretch: S1 — No attempt
P = Pass, Pa = Partial, F = Fail, N/A = not applicable (no customization in SR).
eval-rubric-v1.md