Sourcy Supplier Bot

Feasibility Report & Architecture Proposal
4 March 2026
Data Profiled
15,399
messages parsed
Rules Analyzed
157
144K tokens total
Cost Reduction
proposed vs current
Benchmark Winner
L0
zero-shot > 157 rules

I. Data Inventory

Complete ingestion and profiling of all supplier bot data produced to date.

DatasetVolumePeriod
Bot conversations (structured JSON)399 convos, 5,799 turnsAug–Nov 2024
Bot messages (CSV, dev + prod)5,321 messages, 253 convosDec 2025–Feb 2026
Human supplier chats4,279 messages, ~75 suppliersApr–Aug 2025
Rules (guidance JSON)157 rules, 144K tokensCurrent
Sample SRs5 SRs with supplier listsFeb 2026
Rules-agent memory files63 test run snapshotsMar 2026

Total: 11,120 bot messages + 4,279 human messages + 157 rules profiled.


II. Token Economics of the Rulebook

The 157 rules total 144K tokens in their JSON form. Breakdown:

Rules (core logic)
36,815 (26%)
Examples
51,168 (35%)
Metadata + structure
56,288 (39%)
74% of the rulebook is not rules. Strip examples and metadata → 37K tokens. This fits comfortably in a single LLM call at $0.006/turn on Gemini Flash or Kimi K2.5.

III. Rule Distribution & Issues

Distribution by Category

goal_management
128 (82%)
conversation_flow
16 (10%)
general_behavior
13 (8%)

Within goal-bound rules, customization drives 46% (53 of 116). The system is functionally a customization negotiation engine with basic trade-off collection attached.

Critical Issues Found

Rule Contradiction rule-one-question-per-message (priority 8) contains an example asking price before MOQ — directly contradicting rule-ask-moq-before-price (priority 9) and rule-goal-ordering-no-customization (priority 3).
IssueDetail
Contradicting examplesPrice-before-MOQ in one-question rule vs MOQ-before-price in ordering rule
Priority collision50% of rules share identical priority (8), making conflict resolution unpredictable
Inconsistent negotiation countsThree rules give different "max attempts" for the same scenario
Override gapsOnly 2 override relationships defined across 157 rules

IV. Automated Benchmark Results

We built an automated benchmarking harness that simulates 3 supplier archetypes × 3 prompt levels, with LLM-as-judge scoring across 5 dimensions.

Test Setup

ComponentDetails
Prompt levelsL0 zero-shot (~200 tokens) / L1 minimal guidance (~2K tokens) / L2 full stripped rulebook (~16K tokens)
Supplier archetypesA: responsive factory / B: evasive & redirect-to-email / C: WeChat redirect
Engines testedKimi K2.5 (Claude drop-in) + Cursor Agent (Sonnet 3.5)
Judge dimensionsS1 Goal Completion / S2 One Question per Message / S3 Turn Efficiency / S4 No Hallucination / S5 Structured Extractability
Total conversations18 (9 per engine, 3 suppliers × 3 levels)

Results: Kimi K2.5 Engine

L0 Zero-shot
93% (4.67/5.00)
L1 Minimal
91% (4.56/5.00)
L2 Full Rules
85% (4.24/5.00)

Results: Cursor Agent (Sonnet) Engine

L0 Zero-shot
90% (4.49/5.00)
L1 Minimal
89% (4.47/5.00)
L2 Full Rules
84% (4.20/5.00)
Key Finding: Zero-shot beats the full rulebook on both engines. Adding 157 rules to the context degrades performance. The LLM already knows how to conduct a B2B sourcing conversation in Chinese. The rulebook introduces constraint conflicts and cognitive load that reduce output quality.

Dimension Breakdown (Averaged Across Engines)

DimensionL0L1L2
S1 Goal Completion4.54.54.0
S2 One Question / Msg4.84.74.3
S3 Turn Efficiency4.34.23.8
S4 No Hallucination5.05.04.8
S5 Extractability4.34.14.2

L2's weakest point is turn efficiency — the bot gets bogged down trying to satisfy conflicting rules and takes more turns to achieve the same goals.


V. Architecture Comparison

Current: Two-Phase Retrieval

  • Phase 1 (Extraction Agent) → Code Bridge → Phase 2 (Goal Selector)
  • Double latency per turn
  • Extraction errors in Phase 1 corrupt Phase 2
  • No cross-rule conflict resolution
  • ~$0.040 per turn
  • ~$8.00 per SR (20 suppliers × 10 turns)

Proposed: Single-Pass + SR-Aware Goals

  • One LLM call per turn
  • Half the latency
  • LLM handles conflict resolution natively
  • 37K token context fits easily
  • ~$0.006 per turn
  • ~$1.10 per SR (20 suppliers × 10 turns)

7× cost reduction while eliminating extraction errors and halving latency.


VI. Trade-Off Matrix — What the Bot Should Produce

Extracted manually from real human-supplier conversations (东印度采购, Jul 2025). Product: white kraft paper bag, 21×14×27cm, custom color logo.

Data Point 何继跃88 (Yiwu) 皓茁环保科技 (Zhejiang) pengqizheng2016
MOQ200 pcs500 pcsDeclined
Unit Price (200)¥6.50N/A
Unit Price (500)¥3.00¥1.12
Unit Price (1000)¥2.00¥0.83
Lead Time12 days12–15 days
CustomizationFull color, white kraftFull color, 130g white kraft + rope
Packing30×45×50cm, ~15kg46×40×56cm, ~21kg
Shipping (SZ)¥50 (200pc) / ¥120 (500pc)¥20–28
This table is the business deliverable. A human sourcing agent took ~2 hours of chatting to produce this for 3 suppliers. With 20 suppliers across 5 SRs, that's 200+ hours of work. The bot should produce this table for each SR automatically in under 24 hours.

VII. Proposed Architecture

Three components that don't exist yet — and one that should be deprecated.

1. Goal Generator

Takes an SR (with maturity level and customization requirements) and produces structured goals. Replaces 128 goal_management rules with one LLM call per SR.

SR MaturityGoals GeneratedExample
High, no customization4 goalsPrice, MOQ, lead time, packing
Mid, logo only6 goals+ Logo feasibility, logo MOQ
Mid, L1–L2 custom7–9 goals+ Size/color/material, custom pricing
Low maturity3 goalsFeasibility, rough price, MOQ range

2. Conversation Engine (Single-Pass)

One LLM call per turn. System prompt: Level 1 guidance (~2K tokens) + SR-specific goals. Based on benchmark results, L1 is the optimal production baseline — captures 1688 etiquette and goal ordering without the performance drag of the full rulebook.

3. Trade-Off Aggregator

Collects structured outputs from all parallel conversations and builds the comparison matrix shown above. This is the actual business deliverable nobody has built yet.

What about the 157 rules? They become training data and quality benchmarks, not runtime logic. Use them to write the system prompt, build eval cases, and document conversation flows. Don't load them into the LLM at runtime — the benchmark proves it hurts.

VIII. Cost Model

ApproachPer TurnPer SR (20 suppliers × 10 turns)Per Month (50 SRs)
Two-phase (current)~$0.040~$8.00~$400
Single-pass, full rulebook (L2)$0.022$4.33~$217
Single-pass, stripped (L1, proposed)$0.006$1.10~$55
Single-pass, zero-shot (L0)$0.002$0.40~$20

At 50 SRs/month, the proposed architecture saves $345/month vs the current approach. At scale (500 SRs/month), that's $3,450/month saved.


IX. Proposed Next Steps

#TaskTimelineOwner
1Align on architecture directionThis syncAll
2Deprecate rule system as runtime logic; retain as eval corpusImmediateTek/Eric
3Secure chatServer polling API docs from AwsafThis weekAwsaf
4Build goal generator + test against 5 sample SRs3 daysEric
5Build single-pass conversation engine (L1 baseline)5 daysEric
6Integrate with chatServer API (message send/receive)ParallelEric + Awsaf
7End-to-end test: 1 real SR, 5 live suppliersEnd of next weekEric
What I Need From the Team
  • Awsaf: chatServer API documentation or access to test endpoint
  • Lokesh: Confirmation on the 5 sample SRs + any additional eval cases
  • All: Alignment that L1 prompt is the production baseline (not the full rulebook)

Verdict

The conversation engine is solved. Zero-shot LLMs already know how to talk to Chinese suppliers.

The 157-rule system encodes valuable institutional knowledge, but loading it into the LLM at runtime hurts performance. The benchmark proves this across two independent engines.

The hard problems — SR-aware goal generation, parallel orchestration, and trade-off aggregation — have not been built yet. That's where the engineering effort should go.

Proposed baseline: Level 1 prompt (~2K tokens) + dynamic goals per SR. 7× cheaper, 2× faster, empirically better.


References

[1] Sourcy bot conversation data — 399 structured conversations (Aug–Nov 2024) + 5,321 CSV messages (Dec 2025–Feb 2026). Primary dataset for bot behavior analysis
[2] Human supplier conversations — 4,279 messages from 东印度采购 across ~75 suppliers (Apr–Aug 2025). Ground truth for supplier behavior patterns and trade-off extraction
[3] Rules guidance JSON — 157 rules, rules_guidance.json from Tek's rules-agent codebase. Full rule system analyzed for token counts, categories, contradictions
[4] Automated benchmark — 18 simulated conversations (3 supplier archetypes × 3 prompt levels × 2 engines), LLM-as-judge scoring, Mar 4, 2026. Empirical comparison of prompt strategies
[5] Daniel — Supplier Bot Onboarding, system design document for two-phase agentic retrieval architecture. Original architecture rationale
[6] Lokesh — Problem Statement, parallel outreach specification and SR samples. Business requirements and evaluation criteria