Supplier Bot Eval Methodology

9 Dimensions + 1 Stretch Metric — Derived from Bot Logs, 157 Rules, and Human Conversations
11 March 2026
Dimensions
9+1
core + stretch
Rules Covered
157/157
full coverage
Evidence Sources
3
logs, rules, human
Scoring
P/Pa/F
pass / partial / fail

Purpose

This document explains how we built the evaluation framework for supplier bot conversations. It answers three questions:

  1. What are we measuring? — 9 dimensions + 1 stretch metric, each Pass / Partial / Fail
  2. How did we decide what matters? — Three input layers: bot logs, 157 rules, human conversations
  3. What about the 157 rules? — Every rule is covered. Here’s how.

How We Derived the Eval Dimensions

We worked from three independent evidence sources:

Source 1: Bot Conversation Logs (5,300+ messages)

Historical bot-supplier conversations across two datasets:

What the logs told us:

Source 2: The 157 Rules (144K tokens)

We fully analyzed the rulebook — all 157 rules across 21 files.

CategoryRulesToken Share
Goal management (core + ordering)10Core logic
Customization (color, logo, material, size, packaging, photo, MOQ)7046% of goal-bound rules
Non-custom flows (lead time, price, stock, sample, packing, spec)42Standard info collection
Conversation management13Flow control
Message structure4Formatting
Stock handling8Availability checks
Misc (negotiation, customization general, sample)5Edge cases
Other (food grade, info extraction, media)5Narrow scope
Key insight 82% of rules (128/157) relate to goal management — either the core goal state machine or specific customization/non-custom flows. The rules are not 157 independent things to check. They are variations on a single theme: “When the supplier says X about goal Y, do Z.”

Source 3: Human Supplier Conversations (17 growth-team chats)

We analyzed ~17 conversations where Sourcy’s human growth team (东印度采购) manually sourced from 1688 suppliers. These revealed what a competent human buyer does that bots currently don’t:


The 9 Eval Dimensions + 1 Stretch Metric

Each dimension is scored Pass / Partial / Fail (1 / 0.5 / 0). All dimensions carry equal weight. Maximum core score: 9/9.

#DimensionWhat It ChecksPrimary Source
E1Goal CompletionDid the bot collect the data points it was supposed to?Rules, Bot logs
E2One-Question DisciplineDoes each bot message ask at most one question?Rules, Bot logs
E3Turn EfficiencyDid the bot get there without wasting turns?Bot logs, Rules
E4No HallucinationDid the bot only state facts the supplier actually provided?Rules, Bot logs
E5Structured ExtractabilityCould you pull a clean supplier quote from this conversation?Rules, Human logs
E6Auto-Response HandlingWhen the supplier sent 数字客服回复, did the bot handle it?Rules, Bot logs
E7Conversational NaturalnessDoes this read like a competent buyer, not a form bot?Human logs, Shen
E8Rejection RecoveryWhen the supplier pushed back, did the bot explore alternatives?Rules, Human logs
E9Customization HandlingWhen the SR involves customization, did the bot navigate it competently?Rules, Human logs, Karl
S1Price Negotiation AttemptDid the bot make any attempt to negotiate price?Human logs, Rules

S1 is reported separately — it’s a capability the current bot doesn’t have, so it’s a stretch target, not a core pass/fail.


How the 157 Rules Map to 9 Dimensions

The critical question: “If we evaluate on 9 dimensions instead of 157 rules, do we lose anything?” No. Here’s the full mapping:

Rules per Dimension

E9 Customization
70 rules
E1 Goal Completion
52 rules
E8 Rejection Recovery
5 rules
E7 Naturalness
4 rules
E3 Turn Efficiency
3 rules
E6 Auto-Response
3 rules
E4 No Hallucination
2 rules
E2 One-Question
1 rule
S1 Price Negotiation
1 rule

Note: Some rules are shared across dimensions. E5 (Structured Extractability) evaluates data quality across all goal target fields rather than mapping to specific rules.


E1: Goal Completion

52 rules

All of goal_core.md (10 rules), plus every non-custom flow rule that ends in a goal status update:

These rules define how to reach each goal. E1 evaluates the outcome: did we get there?


E2: One-Question Discipline

1 rule

rule-one-question-per-message (priority 8). One rule, but the single most frequent quality failure in the bot logs. It gets its own dimension because it’s both easily measurable and high-impact.


E3: Turn Efficiency

3 rules

Three rules, one outcome: don’t waste turns.


E4: No Hallucination

2 rules

The bot should only state facts from context or supplier responses. Never fabricate.


E5: Structured Extractability

all goal target fields

Evaluates the quality of collected data — are values specific, numeric, and unambiguous? Maps to the goal target_fields schema across all 157 rules.


E6: Auto-Response Handling

3 rules

These 3 rules address ~31% of real conversations where suppliers don’t meaningfully respond.


E7: Conversational Naturalness

4 rules + human standard

Plus the human-conversation standard: polite, direct, language-flexible, not robotic. This is where Shen’s quality standard lives.


E8: Rejection Recovery

5 rules

Human conversations show this clearly: “做不了” is not the end — humans explore alternatives.


E9: Customization Handling

70 rules

The largest dimension. 70 customization rules span 7 sub-groups:

Sub-groupRulesCovers
Color8Pantone codes, MOQ for custom colors, feasibility
Logo8Vector files, printing methods, placement, proofing
Material10Material options, thickness, price impact
Size12Custom dimensions, mold fees, drawings
Packaging9Custom box, labels, artwork requirements
Photo13Custom product from photo, design files, mold fees
Custom MOQ10Specs-first quoting, tiered pricing, sample availability
Strategic importance (Karl’s direction) Karl has directed the company toward customization as the strategic focus — it’s both the most top-of-funnel and most converting use case. E9 cannot be deprioritized. Our eval treats it as a core dimension, not a conditional one.

S1: Price Negotiation stretch

1 rule + human standard

rule-price-negotiation — the only negotiation rule in the system (thin). Human conversations show rich negotiation: “1元拿得到吗?”, “can you help us on this first order”, “这已经是最低价了” → “如果数量多呢?”. The current bot doesn’t negotiate. S1 measures whether we start to.


Rules Not Explicitly in a Single Dimension

Rule FileRulesCovered By
food_grade.md3E1 (certification is a goal)
stock_handling.md8E1 + E6 (goal completion + response handling)
information_extraction.md1E7 (answering supplier’s questions naturally)
media_handling.md1E7 (using media appropriately)
misc_rules.md5E1, E9, S1
Total coverage: 157/157 rules are represented in the eval framework. Every rule maps to at least one dimension. No rules are lost. The eval is 9 things to check, not 157.

Scoring System

Per dimension: Pass (1) / Partial (0.5) / Fail (0) All dimensions equally weighted. We are targeting full marks across all dimensions. Implicit weighting is premature — the data will tell us which dimensions have natural variance.

Core score: E1 through E9 → max 9/9

Stretch metric: S1 reported separately (Yes/No + quality note)

Example Output

Conversation ID: XXX
Core: 7.5/9 (E1:P E2:P E3:P E4:P E5:Pa E6:P E7:Pa E8:F E9:N/A)
Stretch: S1 — No attempt

P = Pass, Pa = Partial, F = Fail, N/A = not applicable (no customization in SR).


What This Framework Enables

  1. Apples-to-apples comparison — Score the existing bot, Eric’s bot, and eventually human conversations on the same 9 dimensions
  2. Rule coverage without rule complexity — Every rule is represented; none are lost. But the eval is 9 things to check, not 157.
  3. Strategic alignment — Customization handling (E9) is a core dimension, reflecting Karl’s direction
  4. Extensibility — S1 (negotiation) shows how new capabilities get added: start as a stretch metric, promote to core when the bot can do it
  5. Shen’s blind test ready — Each dimension has clear criteria. A human judge can score without knowing which bot generated the conversation.

Next Steps

  1. Confirm eval dimensionsdone (this document)
  2. Benchmark the historical bot — Score existing bot conversations (Aug–Nov 2024 + Dec 2025–Feb 2026) on these 9 dimensions
  3. Benchmark the latest bot — Score Nelson’s most recent transcript (confirmed by Awsaf as the latest baseline)
  4. Run Eric’s bot — Benchmark against the same SRs and supplier archetypes
  5. Shen’s blind test — A/B comparison with human judge on conversational quality

Sources