Golden Eval Framework + Unified Comparison

golden-eugene-v1-seed v0.2.0 | Rubric + judge prompt first, then shared Awsaf/Eric results

Golden Cases

Checks

B4 ICP Cases

1 / 3

B5/B6 Stress

Conversations Scored

Eval Prompt Logic (Used for Scoring)

Judge scores assistant turns on D1-D5 (CQ), tool execution on T1-T4 (TQ), and binary checks (CS).
Pass gate: composite >= 7, no critical required-check fails, and T1 >= 1.
Required checks are taken from each mapped GD case in cases.seed.json.

Prompt template file: tests/golden-eugene-v1/GPT_JUDGE_PROMPT.md

# GPT Judge Prompt (Golden Transcript Eval v2)

You are an evaluation judge for Sourcy activation chat transcripts.

## Scoring Tracks

### Track 1: Conversation Quality (D1-D5, per assistant turn)

Score each **assistant message turn** on 5 dimensions, each 0-2:

- **D1 Message Length**
  - 2: 1-3 lines (25 words or fewer)
  - 1: 4-5 lines
  - 0: 6+ lines

- **D2 Value Delivery** (TOOL-AWARE)
  - 2: Turn delivers specific, actionable data via text OR via attached tool result.
    If a tool returned pricing/specs/visuals in this turn, the VALUE was delivered
    by the card. Short text like "pricing above" or "pick options above" = 2 if
    the tool result contains relevant data.
  - 1: Some engagement but no concrete data in text or tool result.
  - 0: Neither text nor tool delivered value. Lead learns nothing.

- **D3 Qualification**
  - 2: Gathers qualification signal (product/quantity/budget) naturally.
    Tool-driven qualification counts: if productClarification generated
    chip questions, that IS qualification.
  - 1: Advances but doesn't extract new qualification data.
  - 0: Unrelated or premature question.

- **D4 Conversation Discipline**
  - 2: One question per turn, naturally embedded.
  - 1: Two questions, still reasonable.
  - 0: Three+ questions, or question dump.

- **D5 Last Message Test**
  - 2: Clear reason to reply - specific question, next step, CTA.
  - 1: Complete but weak hook.
  - 0: Dead end.

Per turn max = 10. Conversation Quality = average turn score.

### Track 2: Tool Quality (T1-T4, per conversation)

- **T1 Tool Sequence**: Does actual tool call order match expected?
  - 2: Exact match (or valid variation like skip-visuals)
  - 1: Right tools, minor order issue
  - 0: Missing critical tools or wrong tools

- **T2 Pricing Accuracy** (only when pricingIntelligence called):
  - 2: Search keyword matches product, FOB range reasonable, listings relevant
  - 1: Keyword mostly matches, range wide or listings tangential
  - 0: Wrong product, insane range, broken math
  - NA: No pricing call in this conversation

- **T3 Visual Quality** (only when visualConceptGeneration called):
  - 2: Concepts match product + customizations, are distinct
  - 1: Match product but too similar or miss customizations
  - 0: Don't match product, or failed with no fallback
  - NA: Visuals skipped by user

- **T4 Spec Quality** (only when productDataGeneration called):
  - 2: Product name matches, specs complete, MOQ reasonable
  - 1: Partial match, some specs generic
  - 0: Wrong product or critical specs missing
  - NA: No spec generation in this conversation

### Track 3: Binary Checks (conversation-level)
- RESTRICTED_HOLD (Y/N/NA)
- BUDGET_MATH (Y/N/NA)
- EXIT_DOOR (Y/N/NA)
- LANGUAGE_MATCH (Y/N)
- FORMAT_OK (Y/N)
- GHOST_TOOLS (Y/N) - Any tool calls after contact submission?
- ADAPTIVE_CLARIFICATION (Y/N/NA) - Rounds matched user's spec maturity?

### Endpoint Detection
Based on full conversation, determine actual endpoint:
- EXIT_POLITE
- EDUCATE_AND_NURTURE
- QUALIFY_AND_ADVANCE
- COMPLETE_SR
- HUMAN_HANDOFF

### Sourcy Knowledge Checks (when applicable)
- OBJECTION_ANSWERED (Y/N/NA) - If user raised trust/pricing/Alibaba objection, was it answered with specifics?
- REDIRECT_AFTER_OBJECTION (Y/N/NA) - After answering objection, did bot redirect back to product flow?
- SOURCY_KNOWLEDGE_ACCURATE (Y/N/NA) - Any Sourcy facts stated match reality? (No fabricated stats)

## Composite Score Formula
CQ = average turn score (0-10)
TQ = average of scored T dimensions × 5 (0-10)
CS = binary check pass rate × 10 (0-10)
Composite = CQ × 0.40 + TQ × 0.35 + CS × 0.25
Pass = Composite >= 7.0 AND no critical check failures AND T1 >= 1

## Instructions
1. Evaluate only the transcript provided.
2. IMPORTANT: Tool calls and their results are included in the transcript.
   Factor them into D2, D3, and all T-scores.
3. Score assistant *message* turns only for D1-D5.
4. Score T1-T4 once per conversation.
5. If tool result data is provided, use it for T2/T3/T4 validation.
6. Return strict JSON only.

## Required JSON Output
{
  "conversation_id": "...",
  "actual_endpoint": "...",
  "expected_tool_sequence": ["..."],
  "actual_tool_sequence": ["..."],
  "assistant_turns_scored": 0,
  "turns": [
    {
      "seq": 0,
      "score": 0,
      "dimensions": { "D1": 0, "D2": 0, "D3": 0, "D4": 0, "D5": 0 },
      "tool_in_turn": "toolName or null",
      "reason": "short reason"
    }
  ],
  "conversation_quality": 0,
  "tool_scores": {
    "T1": { "score": 0, "reason": "..." },
    "T2": { "score": 0, "reason": "..." },
    "T3": { "score": 0, "reason": "..." },
    "T4": { "score": 0, "reason": "..." }
  },
  "tool_quality": 0,
  "binary_checks": {
    "RESTRICTED_HOLD": "Y",
    "BUDGET_MATH": "NA",
    "EXIT_DOOR": "NA",
    "LANGUAGE_MATCH": "Y",
    "FORMAT_OK": "Y",
    "GHOST_TOOLS": "N",
    "ADAPTIVE_CLARIFICATION": "NA",
    "OBJECTION_ANSWERED": "NA",
    "REDIRECT_AFTER_OBJECTION": "NA",
    "SOURCY_KNOWLEDGE_ACCURATE": "NA"
  },
  "check_score": 0,
  "composite_score": 0,
  "pass": false,
  "summary": "1-3 sentence summary"
}

Input transcript:
{{TRANSCRIPT_JSON}}

Rubric Dimensions (Displayed Before Results)

Dimension	Definition	0-2 behavior
D1	Message Length (0-2)	2: 1-3 lines, <=25 words. 1: 4-5 lines. 0: 6+ lines, wall-of-text behavior.
D2	Value Delivery (0-2, TOOL-AWARE)	2: Specific value delivered by text OR attached tool output (pricing/specs/visual/feasibility cards). 1: Mild engagement, low specificity. 0: No useful value from text or tools.
D3	Qualification (0-2)	2: Captures qualification signal naturally (product, quantity, budget, destination, timeline). 1: Some progress but no new qualification signal. 0: Off-track or premature questioning.
D4	Conversation Discipline (0-2)	2: One clear question max, focused move. 1: Two questions or mild overreach. 0: Question dump / scattered turn.
D5	Last Message Test (0-2)	2: Clear reply hook or next action. 1: Weak continuation signal. 0: Dead-end response.
T1	Tool Sequence Correctness	2: Exact expected sequence or valid variation (e.g., skip visuals). 1: Right tools with minor ordering issue or one non-critical omission. 0: Missing critical tools or wrong sequence.
T2	Pricing Accuracy (if pricingIntelligence called)	2: Search keyword/product alignment, realistic FOB/DDP bounds, relevant listings, sane math. 1: Partially aligned but broad/noisy. 0: Wrong product/range/math.
T3	Visual Quality (if visualConceptGeneration called)	2: Concepts match product + customization and are distinct. 1: Partially aligned or too similar. 0: Mismatch/failure without fallback.
T4	Spec Quality (if productDataGeneration called)	2: Product match, complete specs, reasonable MOQ, relevant insight. 1: Partial quality. 0: Wrong/incomplete critical specs.

Unified Results (Same Definition)

Awsaf transcripts

Count: 65

CQ avg: 7.43/10

TQ avg: 9.48/10

CS avg: 9.82/10

Composite avg: 8.75/10

Pass rate: 87.69%

Endpoint match: 86.15%

T1 match rate: 100%

Eric chats (v7 logs)

Count: 6

CQ avg: 5.78/10

TQ avg: 5/10

CS avg: 9.72/10

Composite avg: 6.49/10

Pass rate: 33.33%

Endpoint match: 66.67%

T1 match rate: 50%

Delta (Awsaf - Eric): Composite 2.26 Quality 1.65

Judge mode: tool-aware-eval-v2 (heuristic implementation of GPT_JUDGE_PROMPT v2)

Formula: Composite = CQ*0.40 + TQ*0.35 + CS*0.25

Threshold: composite >= 7, no critical-check fail, and T1 >= 1.

Conversation Results + Improvement Suggestions

Set	ID	Case	Expected Endpoint	Actual Endpoint	Endpoint	CQ	TQ	CS	Composite	T1/T2/T3/T4	Req P/F/NA	Result	Improvements
Awsaf	TR-001	GD-001	EXIT_POLITE	EDUCATE_AND_NURTURE	mismatch	5	10	10	8	2/NA/NA/NA	0/2/0	FAIL	Fix CK-015: Endpoint correctness Fix CK-020: Low-intent handling Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-002	GD-002	EDUCATE_AND_NURTURE	QUALIFY_AND_ADVANCE	mismatch	6.5	10	10	8.6	2/NA/NA/NA	0/1/1	FAIL	Fix CK-015: Endpoint correctness Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-003	GD-003	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.17	10	10	9.27	2/NA/NA/2	1/1/0	PASS	Fix CK-005: Value-first trust handling
Awsaf	TR-004	GD-004	EXIT_POLITE	EDUCATE_AND_NURTURE	mismatch	5	10	10	8	2/NA/NA/NA	0/1/1	FAIL	Fix CK-015: Endpoint correctness Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-005	GD-005	EXIT_POLITE	QUALIFY_AND_ADVANCE	match	7.31	10	10	8.92	2/2/2/2	2/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-006	GD-006	COMPLETE_SR	COMPLETE_SR	match	7.29	8.75	10	8.48	2/2/1/2	3/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-007	GD-007	HUMAN_HANDOFF	HUMAN_HANDOFF	match	5	10	10	8	2/NA/NA/NA	2/0/0	PASS	Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-008	GD-008	EDUCATE_AND_NURTURE	QUALIFY_AND_ADVANCE	mismatch	7.9	6.25	10	7.85	2/1/1/1	0/1/1	FAIL	Fix CK-015: Endpoint correctness
Awsaf	TR-009	GD-009	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8	10	10	9.2	2/2/2/2	3/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-010	GD-010	COMPLETE_SR	COMPLETE_SR	match	7.29	10	10	8.92	2/2/2/2	2/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-011	GD-011	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.27	8.75	10	8.87	2/2/1/2	2/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-012	GD-012	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8	8.75	10	8.76	2/2/1/2	3/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-013	GD-013	COMPLETE_SR	COMPLETE_SR	match	7.29	8.75	10	8.48	2/2/1/2	2/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-014	GD-014	COMPLETE_SR	COMPLETE_SR	match	7.29	10	10	8.92	2/2/2/2	2/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-015	GD-015	EDUCATE_AND_NURTURE	QUALIFY_AND_ADVANCE	mismatch	7.9	6.25	10	7.85	2/1/1/1	0/1/0	FAIL	Fix CK-015: Endpoint correctness
Awsaf	TR-016	GD-016	EXIT_POLITE	EXIT_POLITE	match	7.5	10	10	9	2/NA/NA/NA	2/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-017	GD-017	COMPLETE_SR	COMPLETE_SR	match	7.21	8.75	10	8.45	2/2/1/2	4/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-018	GD-018	HUMAN_HANDOFF	HUMAN_HANDOFF	match	7.71	7.5	10	8.21	2/NA/NA/1	2/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-019	GD-019	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	7.62	8.75	10	8.61	2/2/1/2	1/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-020	GD-020	COMPLETE_SR	COMPLETE_SR	match	7.76	10	10	9.1	2/2/2/2	2/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-021	GD-021	EDUCATE_AND_NURTURE	QUALIFY_AND_ADVANCE	mismatch	6.5	10	10	8.6	2/NA/NA/NA	1/1/0	FAIL	Fix CK-015: Endpoint correctness Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-022	GD-022	EDUCATE_AND_NURTURE	QUALIFY_AND_ADVANCE	mismatch	6.5	10	10	8.6	2/NA/NA/NA	1/0/1	PASS	Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-023	GD-023	EXIT_POLITE	EXIT_POLITE	match	4	10	10	7.6	2/NA/NA/NA	3/0/1	PASS	Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-024	GD-024	EXIT_POLITE	EDUCATE_AND_NURTURE	mismatch	5	10	10	8	2/NA/NA/NA	0/1/2	FAIL	Fix CK-015: Endpoint correctness Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-025	GD-025	EXIT_POLITE	EXIT_POLITE	match	6	10	7.5	7.78	2/NA/NA/NA	3/0/0	PASS	Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-026	GD-026	HUMAN_HANDOFF	HUMAN_HANDOFF	match	5	10	10	8	2/NA/NA/NA	1/0/2	PASS	Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-027	GD-027	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.5	10	10	9.4	2/2/NA/2	1/0/3	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-028	GD-028	COMPLETE_SR	COMPLETE_SR	match	7.33	8.33	10	8.35	2/1/NA/2	2/0/2	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-029	GD-029	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.17	10	10	9.27	2/NA/NA/2	1/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-030	GD-030	COMPLETE_SR	COMPLETE_SR	match	7.33	8.33	10	8.35	2/1/NA/2	2/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-031	GD-031	EXIT_POLITE	EXIT_POLITE	match	4	10	10	7.6	2/NA/NA/NA	2/0/1	PASS	Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-032	GD-032	COMPLETE_SR	COMPLETE_SR	match	7.29	8.75	10	8.48	2/2/1/2	2/0/2	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-033	GD-033	COMPLETE_SR	COMPLETE_SR	match	7.33	8.33	10	8.35	2/1/NA/2	2/0/2	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-034	GD-034	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8	10	10	9.2	2/2/NA/2	2/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-035	GD-035	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.44	10	10	9.38	2/2/NA/2	2/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-036	GD-036	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	7.89	10	10	9.16	2/2/NA/2	2/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-037	GD-037	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.09	8.75	10	8.8	2/2/1/2	2/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-038	GD-038	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.44	10	10	9.38	2/2/NA/2	1/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-039	GD-039	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.22	10	10	9.29	2/2/NA/2	1/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-040	GD-040	HUMAN_HANDOFF	HUMAN_HANDOFF	match	5	10	10	8	2/NA/NA/NA	2/0/0	PASS	Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf	TR-041	GD-041	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.11	10	10	9.24	2/2/NA/2	2/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-042	GD-042	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.67	10	10	9.47	2/2/NA/2	1/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-043	GD-043	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.1	8.75	7.5	8.18	2/2/1/2	1/1/0	PASS	Fix CK-026: Adaptive clarification by maturity Match clarification depth to SR maturity (high maturity = fewer rounds).
Awsaf	TR-044	GD-044	EDUCATE_AND_NURTURE	QUALIFY_AND_ADVANCE	mismatch	7.13	7.5	10	7.98	2/NA/NA/1	1/1/0	FAIL	Fix CK-015: Endpoint correctness
Awsaf	TR-045	GD-045	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.38	10	8	8.85	2/2/NA/2	1/1/0	PASS	Fix CK-026: Adaptive clarification by maturity Match clarification depth to SR maturity (high maturity = fewer rounds).
Awsaf	TR-046	GD-046	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	7.73	10	8	8.59	2/2/NA/2	1/1/0	PASS	Fix CK-026: Adaptive clarification by maturity Match clarification depth to SR maturity (high maturity = fewer rounds).
Awsaf	TR-047	GD-047	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	7.62	8.75	7.5	7.99	2/2/1/2	1/1/0	PASS	Fix CK-026: Adaptive clarification by maturity Match clarification depth to SR maturity (high maturity = fewer rounds).
Awsaf	TR-048	GD-048	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.67	10	10	9.47	2/2/NA/2	1/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-049	GD-049	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	7.89	10	10	9.16	2/2/NA/2	1/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-050	GD-050	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.45	8.75	10	8.94	2/2/1/2	1/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-051	GD-051	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.33	10	10	9.33	2/2/NA/2	1/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-052	GD-052	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.33	10	10	9.33	2/2/NA/2	1/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-053	GD-053	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.09	8.75	10	8.8	2/2/1/2	1/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-054	GD-054	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8	10	10	9.2	2/2/NA/2	1/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-055	GD-055	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.13	10	10	9.25	2/2/2/2	2/1/0	PASS	Fix CK-019: Correction re-analysis
Awsaf	TR-056	GD-056	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.38	10	10	9.35	2/2/NA/2	1/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-057	GD-057	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	7.77	10	10	9.11	2/2/2/2	1/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-058	GD-058	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	7.9	10	10	9.16	2/2/NA/2	3/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-059	GD-059	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	7.7	10	10	9.08	2/2/NA/2	1/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-060	GD-060	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.38	10	10	9.35	2/2/NA/2	2/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-061	GD-061	COMPLETE_SR	COMPLETE_SR	match	7.21	10	10	8.88	2/2/NA/2	2/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-062	GD-062	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.38	10	10	9.35	2/2/NA/2	3/1/0	PASS	Fix CK-019: Correction re-analysis
Awsaf	TR-063	GD-063	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	7.91	8.75	10	8.73	2/2/1/2	1/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-064	GD-064	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.38	10	10	9.35	2/2/NA/2	1/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf	TR-065	GD-065	QUALIFY_AND_ADVANCE	QUALIFY_AND_ADVANCE	match	8.29	10	10	9.32	2/2/NA/2	1/0/0	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Eric	v7-jesus	GD-017	COMPLETE_SR	QUALIFY_AND_ADVANCE	match	6.67	0	10	5.17	0/NA/NA/NA	4/0/1	FAIL	Align tool choreography to expected sequence; avoid missing or misordered calls. Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Eric	v7-syed	GD-005	EXIT_POLITE	QUALIFY_AND_ADVANCE	match	5	0	10	4.5	0/NA/NA/NA	2/0/0	FAIL	Align tool choreography to expected sequence; avoid missing or misordered calls. Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Eric	v7-battery	GD-016	EXIT_POLITE	QUALIFY_AND_ADVANCE	mismatch	4.5	0	10	4.3	0/NA/NA/NA	1/1/0	FAIL	Align tool choreography to expected sequence; avoid missing or misordered calls. Fix CK-015: Endpoint correctness Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Eric	v7-candle	GD-024	EXIT_POLITE	EDUCATE_AND_NURTURE	mismatch	5.5	10	10	8.2	2/NA/NA/NA	0/1/2	FAIL	Fix CK-015: Endpoint correctness Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Eric	v7-anthony	GD-023	EXIT_POLITE	EXIT_POLITE	match	7	10	8.33	8.38	2/NA/NA/NA	3/0/1	PASS	Maintain current behavior; this conversation is aligned with the revised framework.
Eric	v7-jammaica	GD-007	HUMAN_HANDOFF	HUMAN_HANDOFF	match	6	10	10	8.4	2/NA/NA/NA	2/0/0	PASS	Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.

Required Check Performance

Check	Definition	Awsaf (P/F/NA)	Awsaf Pass%	Eric (P/F/NA)	Eric Pass%
CK-005	Value-first trust handling	8/1/11	88.89%	0/0/1	-
CK-006	Restricted hold absolute	1/0/0	100%	1/0/0	100%
CK-007	Branded/IP hold	1/0/1	100%	1/0/0	100%
CK-008	Budget math honesty	4/0/13	100%	2/0/1	100%
CK-009	Currency disambiguation	1/0/0	100%	1/0/0	100%
CK-010	Call handoff behavior	2/0/2	100%	1/0/0	100%
CK-015	Endpoint correctness	56/8/0	87.5%	4/2/0	66.67%
CK-018	No fabricated capabilities	5/0/0	100%	1/0/0	100%
CK-019	Correction re-analysis	0/2/1	0%	0/0/0	-
CK-020	Low-intent handling	1/1/5	50%	0/0/2	-
CK-021	Low-intent dropoff handling	2/0/0	100%	0/0/0	-
CK-022	Scope-jump stabilization	0/0/1	-	0/0/0	-
CK-023	Early-stage business acceptance	0/0/1	-	0/0/0	-
CK-024	Hostile-safe exit	1/0/0	100%	0/0/0	-
CK-025	No ghost tool calls after contact submit	10/0/0	100%	1/0/0	100%
CK-026	Adaptive clarification by maturity	1/4/2	20%	0/0/0	-
CK-027	Spec correction recovery	3/0/0	100%	0/0/0	-
CK-028	Mid-flow objection without context loss	3/0/0	100%	0/0/0	-

Judge Notes

D2 and D3 are tool-aware: tool results contribute to value and qualification scoring.
Tool outputs are validated using summaries in transcript tool_result_summary.key_fields.
CK-025 (ghost tools) and CK-026 (adaptive clarification) are included.
Eric chats are scored with the same framework for direct comparability.

All case-level improvement suggestions above are generated from failed checks + low dimension averages per conversation.

Eric-to-GD Mapping Used by Shared Judge

Eric File	Mapped GD Case
v7-jesus.jsonl	GD-017
v7-syed.jsonl	GD-005
v7-battery.jsonl	GD-016
v7-candle.jsonl	GD-024
v7-anthony.jsonl	GD-023
v7-jammaica.jsonl	GD-007

Eric Historical Pass/Fail (Reference)

Historical run headlines

v7: 8/8 PASS — overall avg 8.8/10
v6b: 3/5 PASS, 2/5 FAIL. Average across all 15 turns: 6.5/10.

Source files

tests/run-v7/run_summary_v7.md
tests/run-v6/run_summary_v6.md

Version	Personas	Pass Rate	Avg Lines	Key Change
v1	6	4/6 (67%)	12-18	Baseline
v2	8	7/8 (88%)	10-15	Call handoff, budget math
v3	8	8/8 (100%)	8-12	WHY enforcement, restricted products
v3-test	6 (new)	6/6 (100%)	8-12	Generalization confirmed
v4	14	Issues found	6-10	Strict persona testing exposed gaps
v5	4 (fixes)	4/4 fixes	6-10	Absolute restrictions, estimate budgets, exits
v6b	5	3/5 (60%)	2-5	Per-turn priority, hard cap (stricter rubric)
v7	8	8/8 (100%)	1-4	Prices-first, one-liner rules, examples

Proposed GD Run Stack

run-gd-v1-core

Core release gate set (locked cases only).

Rule: review_status = locked

Count: 11

run-gd-v1-generalization

Generalization check on reviewed but not locked cases.

Rule: review_status = reviewed

Count: 40

run-gd-v1-draft-probe

Draft-case probe to find schema/scenario gaps before locking.

Rule: review_status = draft

Count: 14

run-gd-v1-stress

Adversarial mismatch/spam/resistance stress test.

Rule: B5 + B6

Count: 4