Complete ingestion and profiling of all supplier bot data produced to date.
| Dataset | Volume | Period |
|---|---|---|
| Bot conversations (structured JSON) | 399 convos, 5,799 turns | Aug–Nov 2024 |
| Bot messages (CSV, dev + prod) | 5,321 messages, 253 convos | Dec 2025–Feb 2026 |
| Human supplier chats | 4,279 messages, ~75 suppliers | Apr–Aug 2025 |
| Rules (guidance JSON) | 157 rules, 144K tokens | Current |
| Sample SRs | 5 SRs with supplier lists | Feb 2026 |
| Rules-agent memory files | 63 test run snapshots | Mar 2026 |
Total: 11,120 bot messages + 4,279 human messages + 157 rules profiled.
The 157 rules total 144K tokens in their JSON form. Breakdown:
Within goal-bound rules, customization drives 46% (53 of 116). The system is functionally a customization negotiation engine with basic trade-off collection attached.
rule-one-question-per-message (priority 8) contains an example asking price before MOQ — directly contradicting rule-ask-moq-before-price (priority 9) and rule-goal-ordering-no-customization (priority 3).
| Issue | Detail |
|---|---|
| Contradicting examples | Price-before-MOQ in one-question rule vs MOQ-before-price in ordering rule |
| Priority collision | 50% of rules share identical priority (8), making conflict resolution unpredictable |
| Inconsistent negotiation counts | Three rules give different "max attempts" for the same scenario |
| Override gaps | Only 2 override relationships defined across 157 rules |
We built an automated benchmarking harness that simulates 3 supplier archetypes × 3 prompt levels, with LLM-as-judge scoring across 5 dimensions.
| Component | Details |
|---|---|
| Prompt levels | L0 zero-shot (~200 tokens) / L1 minimal guidance (~2K tokens) / L2 full stripped rulebook (~16K tokens) |
| Supplier archetypes | A: responsive factory / B: evasive & redirect-to-email / C: WeChat redirect |
| Engines tested | Kimi K2.5 (Claude drop-in) + Cursor Agent (Sonnet 3.5) |
| Judge dimensions | S1 Goal Completion / S2 One Question per Message / S3 Turn Efficiency / S4 No Hallucination / S5 Structured Extractability |
| Total conversations | 18 (9 per engine, 3 suppliers × 3 levels) |
| Dimension | L0 | L1 | L2 |
|---|---|---|---|
| S1 Goal Completion | 4.5 | 4.5 | 4.0 |
| S2 One Question / Msg | 4.8 | 4.7 | 4.3 |
| S3 Turn Efficiency | 4.3 | 4.2 | 3.8 |
| S4 No Hallucination | 5.0 | 5.0 | 4.8 |
| S5 Extractability | 4.3 | 4.1 | 4.2 |
L2's weakest point is turn efficiency — the bot gets bogged down trying to satisfy conflicting rules and takes more turns to achieve the same goals.
7× cost reduction while eliminating extraction errors and halving latency.
Extracted manually from real human-supplier conversations (东印度采购, Jul 2025). Product: white kraft paper bag, 21×14×27cm, custom color logo.
| Data Point | 何继跃88 (Yiwu) | 皓茁环保科技 (Zhejiang) | pengqizheng2016 |
|---|---|---|---|
| MOQ | 200 pcs | 500 pcs | Declined |
| Unit Price (200) | ¥6.50 | N/A | — |
| Unit Price (500) | ¥3.00 | ¥1.12 | — |
| Unit Price (1000) | ¥2.00 | ¥0.83 | — |
| Lead Time | 12 days | 12–15 days | — |
| Customization | Full color, white kraft | Full color, 130g white kraft + rope | — |
| Packing | 30×45×50cm, ~15kg | 46×40×56cm, ~21kg | — |
| Shipping (SZ) | ¥50 (200pc) / ¥120 (500pc) | ¥20–28 | — |
Three components that don't exist yet — and one that should be deprecated.
Takes an SR (with maturity level and customization requirements) and produces structured goals. Replaces 128 goal_management rules with one LLM call per SR.
| SR Maturity | Goals Generated | Example |
|---|---|---|
| High, no customization | 4 goals | Price, MOQ, lead time, packing |
| Mid, logo only | 6 goals | + Logo feasibility, logo MOQ |
| Mid, L1–L2 custom | 7–9 goals | + Size/color/material, custom pricing |
| Low maturity | 3 goals | Feasibility, rough price, MOQ range |
One LLM call per turn. System prompt: Level 1 guidance (~2K tokens) + SR-specific goals. Based on benchmark results, L1 is the optimal production baseline — captures 1688 etiquette and goal ordering without the performance drag of the full rulebook.
Collects structured outputs from all parallel conversations and builds the comparison matrix shown above. This is the actual business deliverable nobody has built yet.
| Approach | Per Turn | Per SR (20 suppliers × 10 turns) | Per Month (50 SRs) |
|---|---|---|---|
| Two-phase (current) | ~$0.040 | ~$8.00 | ~$400 |
| Single-pass, full rulebook (L2) | $0.022 | $4.33 | ~$217 |
| Single-pass, stripped (L1, proposed) | $0.006 | $1.10 | ~$55 |
| Single-pass, zero-shot (L0) | $0.002 | $0.40 | ~$20 |
At 50 SRs/month, the proposed architecture saves $345/month vs the current approach. At scale (500 SRs/month), that's $3,450/month saved.
| # | Task | Timeline | Owner |
|---|---|---|---|
| 1 | Align on architecture direction | This sync | All |
| 2 | Deprecate rule system as runtime logic; retain as eval corpus | Immediate | Tek/Eric |
| 3 | Secure chatServer polling API docs from Awsaf | This week | Awsaf |
| 4 | Build goal generator + test against 5 sample SRs | 3 days | Eric |
| 5 | Build single-pass conversation engine (L1 baseline) | 5 days | Eric |
| 6 | Integrate with chatServer API (message send/receive) | Parallel | Eric + Awsaf |
| 7 | End-to-end test: 1 real SR, 5 live suppliers | End of next week | Eric |
The conversation engine is solved. Zero-shot LLMs already know how to talk to Chinese suppliers.
The 157-rule system encodes valuable institutional knowledge, but loading it into the LLM at runtime hurts performance. The benchmark proves this across two independent engines.
The hard problems — SR-aware goal generation, parallel orchestration, and trade-off aggregation — have not been built yet. That's where the engineering effort should go.
Proposed baseline: Level 1 prompt (~2K tokens) + dynamic goals per SR. 7× cheaper, 2× faster, empirically better.
rules_guidance.json from Tek's rules-agent codebase.
Full rule system analyzed for token counts, categories, contradictions