Sourcy Supplier Bot — Feasibility & Architecture Report

Data Profiled

15,399

messages parsed

Rules Analyzed

157

144K tokens total

Cost Reduction

7×

proposed vs current

Benchmark Winner

zero-shot > 157 rules

I. Data Inventory

Complete ingestion and profiling of all supplier bot data produced to date.

Dataset	Volume	Period
Bot conversations (structured JSON)	399 convos, 5,799 turns	Aug–Nov 2024
Bot messages (CSV, dev + prod)	5,321 messages, 253 convos	Dec 2025–Feb 2026
Human supplier chats	4,279 messages, ~75 suppliers	Apr–Aug 2025
Rules (guidance JSON)	157 rules, 144K tokens	Current
Sample SRs	5 SRs with supplier lists	Feb 2026
Rules-agent memory files	63 test run snapshots	Mar 2026

Total: 11,120 bot messages + 4,279 human messages + 157 rules profiled.

II. Token Economics of the Rulebook

The 157 rules total 144K tokens in their JSON form. Breakdown:

Rules (core logic)

36,815 (26%)

Examples

51,168 (35%)

Metadata + structure

56,288 (39%)

74% of the rulebook is not rules. Strip examples and metadata → 37K tokens. This fits comfortably in a single LLM call at $0.006/turn on Gemini Flash or Kimi K2.5.

III. Rule Distribution & Issues

Distribution by Category

goal_management

128 (82%)

conversation_flow

16 (10%)

general_behavior

13 (8%)

Within goal-bound rules, customization drives 46% (53 of 116). The system is functionally a customization negotiation engine with basic trade-off collection attached.

Critical Issues Found

Rule Contradiction rule-one-question-per-message (priority 8) contains an example asking price before MOQ — directly contradicting rule-ask-moq-before-price (priority 9) and rule-goal-ordering-no-customization (priority 3).

Issue	Detail
Contradicting examples	Price-before-MOQ in one-question rule vs MOQ-before-price in ordering rule
Priority collision	50% of rules share identical priority (8), making conflict resolution unpredictable
Inconsistent negotiation counts	Three rules give different "max attempts" for the same scenario
Override gaps	Only 2 override relationships defined across 157 rules

IV. Automated Benchmark Results

We built an automated benchmarking harness that simulates 3 supplier archetypes × 3 prompt levels, with LLM-as-judge scoring across 5 dimensions.

Test Setup

Component	Details
Prompt levels	L0 zero-shot (~200 tokens) / L1 minimal guidance (~2K tokens) / L2 full stripped rulebook (~16K tokens)
Supplier archetypes	A: responsive factory / B: evasive & redirect-to-email / C: WeChat redirect
Engines tested	Kimi K2.5 (Claude drop-in) + Cursor Agent (Sonnet 3.5)
Judge dimensions	S1 Goal Completion / S2 One Question per Message / S3 Turn Efficiency / S4 No Hallucination / S5 Structured Extractability
Total conversations	18 (9 per engine, 3 suppliers × 3 levels)

Results: Kimi K2.5 Engine

L0 Zero-shot

93% (4.67/5.00)

L1 Minimal

91% (4.56/5.00)

L2 Full Rules

85% (4.24/5.00)

Results: Cursor Agent (Sonnet) Engine

L0 Zero-shot

90% (4.49/5.00)

L1 Minimal

89% (4.47/5.00)

L2 Full Rules

84% (4.20/5.00)

Key Finding: Zero-shot beats the full rulebook on both engines. Adding 157 rules to the context degrades performance. The LLM already knows how to conduct a B2B sourcing conversation in Chinese. The rulebook introduces constraint conflicts and cognitive load that reduce output quality.

Dimension Breakdown (Averaged Across Engines)

Dimension	L0	L1	L2
S1 Goal Completion	4.5	4.5	4.0
S2 One Question / Msg	4.8	4.7	4.3
S3 Turn Efficiency	4.3	4.2	3.8
S4 No Hallucination	5.0	5.0	4.8
S5 Extractability	4.3	4.1	4.2

L2's weakest point is turn efficiency — the bot gets bogged down trying to satisfy conflicting rules and takes more turns to achieve the same goals.

V. Architecture Comparison

Current: Two-Phase Retrieval

Phase 1 (Extraction Agent) → Code Bridge → Phase 2 (Goal Selector)
Double latency per turn
Extraction errors in Phase 1 corrupt Phase 2
No cross-rule conflict resolution
~$0.040 per turn
~$8.00 per SR (20 suppliers × 10 turns)

Proposed: Single-Pass + SR-Aware Goals

One LLM call per turn
Half the latency
LLM handles conflict resolution natively
37K token context fits easily
~$0.006 per turn
~$1.10 per SR (20 suppliers × 10 turns)

7× cost reduction while eliminating extraction errors and halving latency.

VI. Trade-Off Matrix — What the Bot Should Produce

Extracted manually from real human-supplier conversations (东印度采购, Jul 2025). Product: white kraft paper bag, 21×14×27cm, custom color logo.

Data Point	何继跃88 (Yiwu)	皓茁环保科技 (Zhejiang)	pengqizheng2016
MOQ	200 pcs	500 pcs	Declined
Unit Price (200)	¥6.50	N/A	—
Unit Price (500)	¥3.00	¥1.12	—
Unit Price (1000)	¥2.00	¥0.83	—
Lead Time	12 days	12–15 days	—
Customization	Full color, white kraft	Full color, 130g white kraft + rope	—
Packing	30×45×50cm, ~15kg	46×40×56cm, ~21kg	—
Shipping (SZ)	¥50 (200pc) / ¥120 (500pc)	¥20–28	—

This table is the business deliverable. A human sourcing agent took ~2 hours of chatting to produce this for 3 suppliers. With 20 suppliers across 5 SRs, that's 200+ hours of work. The bot should produce this table for each SR automatically in under 24 hours.

VII. Proposed Architecture

Three components that don't exist yet — and one that should be deprecated.

1. Goal Generator

Takes an SR (with maturity level and customization requirements) and produces structured goals. Replaces 128 goal_management rules with one LLM call per SR.

SR Maturity	Goals Generated	Example
High, no customization	4 goals	Price, MOQ, lead time, packing
Mid, logo only	6 goals	+ Logo feasibility, logo MOQ
Mid, L1–L2 custom	7–9 goals	+ Size/color/material, custom pricing
Low maturity	3 goals	Feasibility, rough price, MOQ range

2. Conversation Engine (Single-Pass)

One LLM call per turn. System prompt: Level 1 guidance (~2K tokens) + SR-specific goals. Based on benchmark results, L1 is the optimal production baseline — captures 1688 etiquette and goal ordering without the performance drag of the full rulebook.

3. Trade-Off Aggregator

Collects structured outputs from all parallel conversations and builds the comparison matrix shown above. This is the actual business deliverable nobody has built yet.

What about the 157 rules? They become training data and quality benchmarks, not runtime logic. Use them to write the system prompt, build eval cases, and document conversation flows. Don't load them into the LLM at runtime — the benchmark proves it hurts.

VIII. Cost Model

Approach	Per Turn	Per SR (20 suppliers × 10 turns)	Per Month (50 SRs)
Two-phase (current)	~$0.040	~$8.00	~$400
Single-pass, full rulebook (L2)	$0.022	$4.33	~$217
Single-pass, stripped (L1, proposed)	$0.006	$1.10	~$55
Single-pass, zero-shot (L0)	$0.002	$0.40	~$20

At 50 SRs/month, the proposed architecture saves $345/month vs the current approach. At scale (500 SRs/month), that's $3,450/month saved.

IX. Proposed Next Steps

#	Task	Timeline	Owner
1	Align on architecture direction	This sync	All
2	Deprecate rule system as runtime logic; retain as eval corpus	Immediate	Tek/Eric
3	Secure chatServer polling API docs from Awsaf	This week	Awsaf
4	Build goal generator + test against 5 sample SRs	3 days	Eric
5	Build single-pass conversation engine (L1 baseline)	5 days	Eric
6	Integrate with chatServer API (message send/receive)	Parallel	Eric + Awsaf
7	End-to-end test: 1 real SR, 5 live suppliers	End of next week	Eric

What I Need From the Team

Awsaf: chatServer API documentation or access to test endpoint
Lokesh: Confirmation on the 5 sample SRs + any additional eval cases
All: Alignment that L1 prompt is the production baseline (not the full rulebook)

Verdict

The conversation engine is solved. Zero-shot LLMs already know how to talk to Chinese suppliers.

The 157-rule system encodes valuable institutional knowledge, but loading it into the LLM at runtime hurts performance. The benchmark proves this across two independent engines.

The hard problems — SR-aware goal generation, parallel orchestration, and trade-off aggregation — have not been built yet. That's where the engineering effort should go.

Proposed baseline: Level 1 prompt (~2K tokens) + dynamic goals per SR. 7× cheaper, 2× faster, empirically better.

References

[1] Sourcy bot conversation data — 399 structured conversations (Aug–Nov 2024) + 5,321 CSV messages (Dec 2025–Feb 2026). Primary dataset for bot behavior analysis

[2] Human supplier conversations — 4,279 messages from 东印度采购 across ~75 suppliers (Apr–Aug 2025). Ground truth for supplier behavior patterns and trade-off extraction

[3] Rules guidance JSON — 157 rules, rules_guidance.json from Tek's rules-agent codebase. Full rule system analyzed for token counts, categories, contradictions

[4] Automated benchmark — 18 simulated conversations (3 supplier archetypes × 3 prompt levels × 2 engines), LLM-as-judge scoring, Mar 4, 2026. Empirical comparison of prompt strategies

[5] Daniel — Supplier Bot Onboarding, system design document for two-phase agentic retrieval architecture. Original architecture rationale

[6] Lokesh — Problem Statement, parallel outreach specification and SR samples. Business requirements and evaluation criteria