Supplier Bot Eval Methodology V1

Purpose

This document explains how we built the evaluation framework for supplier bot conversations. It answers three questions:

What are we measuring? — 9 dimensions + 1 stretch metric, each Pass / Partial / Fail
How did we decide what matters? — Three input layers: bot logs, 157 rules, human conversations
What about the 157 rules? — Every rule is covered. Here’s how.

How We Derived the Eval Dimensions

We worked from three independent evidence sources:

Source 1: Bot Conversation Logs (5,300+ messages)

Historical bot-supplier conversations across two datasets:

Aug–Nov 2024: 399 structured conversations (5,799 turns) from Tek’s extraction
Dec 2025–Feb 2026: 5,322 messages across dev and prod environments

What the logs told us:

Goal completion varies — some conversations collect all 6 data points, many don’t
One-question-per-message is the #1 quality failure: bots routinely ask 2–3 questions per turn
Turn counts range from 4 to 80+; efficient conversations are under 10 bot messages
~31% of conversations hit auto-response loops (数字客服回复) without recovery
Hallucination is rare but when it occurs (fabricated prices, assumed specs), it’s high-stakes

Source 2: The 157 Rules (144K tokens)

We fully analyzed the rulebook — all 157 rules across 21 files.

Category	Rules	Token Share
Goal management (core + ordering)	10	Core logic
Customization (color, logo, material, size, packaging, photo, MOQ)	70	46% of goal-bound rules
Non-custom flows (lead time, price, stock, sample, packing, spec)	42	Standard info collection
Conversation management	13	Flow control
Message structure	4	Formatting
Stock handling	8	Availability checks
Misc (negotiation, customization general, sample)	5	Edge cases
Other (food grade, info extraction, media)	5	Narrow scope

Key insight 82% of rules (128/157) relate to goal management — either the core goal state machine or specific customization/non-custom flows. The rules are not 157 independent things to check. They are variations on a single theme: “When the supplier says X about goal Y, do Z.”

Source 3: Human Supplier Conversations (17 growth-team chats)

We analyzed ~17 conversations where Sourcy’s human growth team (东印度采购) manually sourced from 1688 suppliers. These revealed what a competent human buyer does that bots currently don’t:

Polite but direct — “好的，谢谢”, then straight to the next question. No robotic phrasing.
Language flexibility — Switch between Chinese and English naturally, match supplier’s language
Negotiate price — “1元拿得到吗？”, “can you help us on this first order”, “can you make it 1 CNY?”
Recover from rejection — When told “做不了” or MOQ too high, humans explore alternatives
Navigate customization depth — Multi-day conversations with logo files, Pantone codes, material samples
Handle digital customer service (数字客服回复) — Recognize it immediately and wait for the human
Manage multi-turn relationships — Follow up days later on order progress, QC issues, shipping

The 9 Eval Dimensions + 1 Stretch Metric

Each dimension is scored Pass / Partial / Fail (1 / 0.5 / 0). All dimensions carry equal weight. Maximum core score: 9/9.

#	Dimension	What It Checks	Primary Source
E1	Goal Completion	Did the bot collect the data points it was supposed to?	Rules, Bot logs
E2	One-Question Discipline	Does each bot message ask at most one question?	Rules, Bot logs
E3	Turn Efficiency	Did the bot get there without wasting turns?	Bot logs, Rules
E4	No Hallucination	Did the bot only state facts the supplier actually provided?	Rules, Bot logs
E5	Structured Extractability	Could you pull a clean supplier quote from this conversation?	Rules, Human logs
E6	Auto-Response Handling	When the supplier sent 数字客服回复, did the bot handle it?	Rules, Bot logs
E7	Conversational Naturalness	Does this read like a competent buyer, not a form bot?	Human logs, Shen
E8	Rejection Recovery	When the supplier pushed back, did the bot explore alternatives?	Rules, Human logs
E9	Customization Handling	When the SR involves customization, did the bot navigate it competently?	Rules, Human logs, Karl
S1	Price Negotiation Attempt	Did the bot make any attempt to negotiate price?	Human logs, Rules

S1 is reported separately — it’s a capability the current bot doesn’t have, so it’s a stretch target, not a core pass/fail.

How the 157 Rules Map to 9 Dimensions

The critical question: “If we evaluate on 9 dimensions instead of 157 rules, do we lose anything?” No. Here’s the full mapping:

Rules per Dimension

E9 Customization

70 rules

E1 Goal Completion

52 rules

E8 Rejection Recovery

5 rules

E7 Naturalness

4 rules

E3 Turn Efficiency

3 rules

E6 Auto-Response

3 rules

E4 No Hallucination

2 rules

E2 One-Question

1 rule

S1 Price Negotiation

1 rule

Note: Some rules are shared across dimensions. E5 (Structured Extractability) evaluates data quality across all goal target fields rather than mapping to specific rules.

E1: Goal Completion

52 rules

All of goal_core.md (10 rules), plus every non-custom flow rule that ends in a goal status update:

NC_LEAD_GROUP.md (6 rules) → lead time goal
NC_PRICE_GROUP.md (8 rules) → price goal
NC_PACK_GROUP.md (6 rules) → packaging goal
NC_SAMPLE_GROUP.md (10 rules) → sample goal
NC_STOCK_GROUP.md (7 rules) → MOQ / stock goal
NC_SPEC_GROUP.md (5 rules) → specifications goal

These rules define how to reach each goal. E1 evaluates the outcome: did we get there?

E2: One-Question Discipline

1 rule

rule-one-question-per-message (priority 8). One rule, but the single most frequent quality failure in the bot logs. It gets its own dimension because it’s both easily measurable and high-impact.

E3: Turn Efficiency

3 rules

rule-conclude-when-complete — don’t keep going after goals are done
rule-limit-confirmations — don’t re-confirm what’s already confirmed
rule-no-repeat-questions — don’t ask what’s already answered

Three rules, one outcome: don’t waste turns.

E4: No Hallucination

2 rules

rule-provide-only-known-information (priority 10)
rule-no-tax-mention

The bot should only state facts from context or supplier responses. Never fabricate.

E5: Structured Extractability

all goal target fields

Evaluates the quality of collected data — are values specific, numeric, and unambiguous? Maps to the goal target_fields schema across all 157 rules.

E6: Auto-Response Handling

3 rules

rule-auto-response — detect messageType-based auto-responses
rule-non-meaningful-reply-reminder — handle replies that don’t answer the question
rule-unresponsive — handle suppliers who go silent

These 3 rules address ~31% of real conversations where suppliers don’t meaningfully respond.

E7: Conversational Naturalness

4 rules + human standard

rule-first-product-link — open with the product link naturally
rule-one-question-per-message — natural pacing (shared with E2)
rule-export-market-inquiry — answer supplier’s questions about our market
rule-image-request — use media when appropriate

Plus the human-conversation standard: polite, direct, language-flexible, not robotic. This is where Shen’s quality standard lives.

E8: Rejection Recovery

5 rules

rule-goal-status-failed — when a goal can’t be achieved, handle it gracefully
rule-contact-refuse-escalate — escalate when supplier refuses
rule-file-upload-not-received — handle file delivery failures
rule-stock-ask-alternatives — find alternatives when out of stock
rule-custom-moq-too-high-negotiate — negotiate when MOQ exceeds need

Human conversations show this clearly: “做不了” is not the end — humans explore alternatives.

E9: Customization Handling

70 rules

The largest dimension. 70 customization rules span 7 sub-groups:

Sub-group	Rules	Covers
Color	8	Pantone codes, MOQ for custom colors, feasibility
Logo	8	Vector files, printing methods, placement, proofing
Material	10	Material options, thickness, price impact
Size	12	Custom dimensions, mold fees, drawings
Packaging	9	Custom box, labels, artwork requirements
Photo	13	Custom product from photo, design files, mold fees
Custom MOQ	10	Specs-first quoting, tiered pricing, sample availability

Strategic importance (Karl’s direction) Karl has directed the company toward customization as the strategic focus — it’s both the most top-of-funnel and most converting use case. E9 cannot be deprioritized. Our eval treats it as a core dimension, not a conditional one.

S1: Price Negotiation stretch

1 rule + human standard

rule-price-negotiation — the only negotiation rule in the system (thin). Human conversations show rich negotiation: “1元拿得到吗？”, “can you help us on this first order”, “这已经是最低价了” → “如果数量多呢？”. The current bot doesn’t negotiate. S1 measures whether we start to.

Rules Not Explicitly in a Single Dimension

Rule File	Rules	Covered By
`food_grade.md`	3	E1 (certification is a goal)
`stock_handling.md`	8	E1 + E6 (goal completion + response handling)
`information_extraction.md`	1	E7 (answering supplier’s questions naturally)
`media_handling.md`	1	E7 (using media appropriately)
`misc_rules.md`	5	E1, E9, S1

Total coverage: 157/157 rules are represented in the eval framework. Every rule maps to at least one dimension. No rules are lost. The eval is 9 things to check, not 157.

Scoring System

Per dimension: Pass (1) / Partial (0.5) / Fail (0) All dimensions equally weighted. We are targeting full marks across all dimensions. Implicit weighting is premature — the data will tell us which dimensions have natural variance.

Core score: E1 through E9 → max 9/9

Stretch metric: S1 reported separately (Yes/No + quality note)

Example Output

Conversation ID: XXX
Core: 7.5/9 (E1:P E2:P E3:P E4:P E5:Pa E6:P E7:Pa E8:F E9:N/A)
Stretch: S1 — No attempt

P = Pass, Pa = Partial, F = Fail, N/A = not applicable (no customization in SR).

What This Framework Enables

Apples-to-apples comparison — Score the existing bot, Eric’s bot, and eventually human conversations on the same 9 dimensions
Rule coverage without rule complexity — Every rule is represented; none are lost. But the eval is 9 things to check, not 157.
Strategic alignment — Customization handling (E9) is a core dimension, reflecting Karl’s direction
Extensibility — S1 (negotiation) shows how new capabilities get added: start as a stretch metric, promote to core when the bot can do it
Shen’s blind test ready — Each dimension has clear criteria. A human judge can score without knowing which bot generated the conversation.

Next Steps

~~Confirm eval dimensions~~ — done (this document)
Benchmark the historical bot — Score existing bot conversations (Aug–Nov 2024 + Dec 2025–Feb 2026) on these 9 dimensions
Benchmark the latest bot — Score Nelson’s most recent transcript (confirmed by Awsaf as the latest baseline)
Run Eric’s bot — Benchmark against the same SRs and supplier archetypes
Shen’s blind test — A/B comparison with human judge on conversational quality

Supplier Bot Eval Methodology

Purpose

How We Derived the Eval Dimensions

Source 1: Bot Conversation Logs (5,300+ messages)

Source 2: The 157 Rules (144K tokens)

Source 3: Human Supplier Conversations (17 growth-team chats)

The 9 Eval Dimensions + 1 Stretch Metric

How the 157 Rules Map to 9 Dimensions

Rules per Dimension

E1: Goal Completion

E2: One-Question Discipline

E3: Turn Efficiency

E4: No Hallucination

E5: Structured Extractability

E6: Auto-Response Handling

E7: Conversational Naturalness

E8: Rejection Recovery

E9: Customization Handling

S1: Price Negotiation stretch

Rules Not Explicitly in a Single Dimension

Scoring System

Example Output

What This Framework Enables

Next Steps

Sources