Golden Eval Framework + Unified Comparison

golden-eugene-v1-seed v0.2.0 | Rubric + judge prompt first, then shared Awsaf/Eric results

65
Golden Cases
28
Checks
3
B4 ICP Cases
1 / 3
B5/B6 Stress
71
Conversations Scored

Eval Prompt Logic (Used for Scoring)

Prompt template file: tests/golden-eugene-v1/GPT_JUDGE_PROMPT.md

# GPT Judge Prompt (Golden Transcript Eval v2)

You are an evaluation judge for Sourcy activation chat transcripts.

## Scoring Tracks

### Track 1: Conversation Quality (D1-D5, per assistant turn)

Score each **assistant message turn** on 5 dimensions, each 0-2:

- **D1 Message Length**
  - 2: 1-3 lines (25 words or fewer)
  - 1: 4-5 lines
  - 0: 6+ lines

- **D2 Value Delivery** (TOOL-AWARE)
  - 2: Turn delivers specific, actionable data via text OR via attached tool result.
    If a tool returned pricing/specs/visuals in this turn, the VALUE was delivered
    by the card. Short text like "pricing above" or "pick options above" = 2 if
    the tool result contains relevant data.
  - 1: Some engagement but no concrete data in text or tool result.
  - 0: Neither text nor tool delivered value. Lead learns nothing.

- **D3 Qualification**
  - 2: Gathers qualification signal (product/quantity/budget) naturally.
    Tool-driven qualification counts: if productClarification generated
    chip questions, that IS qualification.
  - 1: Advances but doesn't extract new qualification data.
  - 0: Unrelated or premature question.

- **D4 Conversation Discipline**
  - 2: One question per turn, naturally embedded.
  - 1: Two questions, still reasonable.
  - 0: Three+ questions, or question dump.

- **D5 Last Message Test**
  - 2: Clear reason to reply - specific question, next step, CTA.
  - 1: Complete but weak hook.
  - 0: Dead end.

Per turn max = 10. Conversation Quality = average turn score.

### Track 2: Tool Quality (T1-T4, per conversation)

- **T1 Tool Sequence**: Does actual tool call order match expected?
  - 2: Exact match (or valid variation like skip-visuals)
  - 1: Right tools, minor order issue
  - 0: Missing critical tools or wrong tools

- **T2 Pricing Accuracy** (only when pricingIntelligence called):
  - 2: Search keyword matches product, FOB range reasonable, listings relevant
  - 1: Keyword mostly matches, range wide or listings tangential
  - 0: Wrong product, insane range, broken math
  - NA: No pricing call in this conversation

- **T3 Visual Quality** (only when visualConceptGeneration called):
  - 2: Concepts match product + customizations, are distinct
  - 1: Match product but too similar or miss customizations
  - 0: Don't match product, or failed with no fallback
  - NA: Visuals skipped by user

- **T4 Spec Quality** (only when productDataGeneration called):
  - 2: Product name matches, specs complete, MOQ reasonable
  - 1: Partial match, some specs generic
  - 0: Wrong product or critical specs missing
  - NA: No spec generation in this conversation

### Track 3: Binary Checks (conversation-level)
- RESTRICTED_HOLD (Y/N/NA)
- BUDGET_MATH (Y/N/NA)
- EXIT_DOOR (Y/N/NA)
- LANGUAGE_MATCH (Y/N)
- FORMAT_OK (Y/N)
- GHOST_TOOLS (Y/N) - Any tool calls after contact submission?
- ADAPTIVE_CLARIFICATION (Y/N/NA) - Rounds matched user's spec maturity?

### Endpoint Detection
Based on full conversation, determine actual endpoint:
- EXIT_POLITE
- EDUCATE_AND_NURTURE
- QUALIFY_AND_ADVANCE
- COMPLETE_SR
- HUMAN_HANDOFF

### Sourcy Knowledge Checks (when applicable)
- OBJECTION_ANSWERED (Y/N/NA) - If user raised trust/pricing/Alibaba objection, was it answered with specifics?
- REDIRECT_AFTER_OBJECTION (Y/N/NA) - After answering objection, did bot redirect back to product flow?
- SOURCY_KNOWLEDGE_ACCURATE (Y/N/NA) - Any Sourcy facts stated match reality? (No fabricated stats)

## Composite Score Formula
CQ = average turn score (0-10)
TQ = average of scored T dimensions × 5 (0-10)
CS = binary check pass rate × 10 (0-10)
Composite = CQ × 0.40 + TQ × 0.35 + CS × 0.25
Pass = Composite >= 7.0 AND no critical check failures AND T1 >= 1

## Instructions
1. Evaluate only the transcript provided.
2. IMPORTANT: Tool calls and their results are included in the transcript.
   Factor them into D2, D3, and all T-scores.
3. Score assistant *message* turns only for D1-D5.
4. Score T1-T4 once per conversation.
5. If tool result data is provided, use it for T2/T3/T4 validation.
6. Return strict JSON only.

## Required JSON Output
{
  "conversation_id": "...",
  "actual_endpoint": "...",
  "expected_tool_sequence": ["..."],
  "actual_tool_sequence": ["..."],
  "assistant_turns_scored": 0,
  "turns": [
    {
      "seq": 0,
      "score": 0,
      "dimensions": { "D1": 0, "D2": 0, "D3": 0, "D4": 0, "D5": 0 },
      "tool_in_turn": "toolName or null",
      "reason": "short reason"
    }
  ],
  "conversation_quality": 0,
  "tool_scores": {
    "T1": { "score": 0, "reason": "..." },
    "T2": { "score": 0, "reason": "..." },
    "T3": { "score": 0, "reason": "..." },
    "T4": { "score": 0, "reason": "..." }
  },
  "tool_quality": 0,
  "binary_checks": {
    "RESTRICTED_HOLD": "Y",
    "BUDGET_MATH": "NA",
    "EXIT_DOOR": "NA",
    "LANGUAGE_MATCH": "Y",
    "FORMAT_OK": "Y",
    "GHOST_TOOLS": "N",
    "ADAPTIVE_CLARIFICATION": "NA",
    "OBJECTION_ANSWERED": "NA",
    "REDIRECT_AFTER_OBJECTION": "NA",
    "SOURCY_KNOWLEDGE_ACCURATE": "NA"
  },
  "check_score": 0,
  "composite_score": 0,
  "pass": false,
  "summary": "1-3 sentence summary"
}

Input transcript:
{{TRANSCRIPT_JSON}}

Rubric Dimensions (Displayed Before Results)

DimensionDefinition0-2 behavior
D1Message Length (0-2)
  • 2: 1-3 lines, <=25 words.
  • 1: 4-5 lines.
  • 0: 6+ lines, wall-of-text behavior.
D2Value Delivery (0-2, TOOL-AWARE)
  • 2: Specific value delivered by text OR attached tool output (pricing/specs/visual/feasibility cards).
  • 1: Mild engagement, low specificity.
  • 0: No useful value from text or tools.
D3Qualification (0-2)
  • 2: Captures qualification signal naturally (product, quantity, budget, destination, timeline).
  • 1: Some progress but no new qualification signal.
  • 0: Off-track or premature questioning.
D4Conversation Discipline (0-2)
  • 2: One clear question max, focused move.
  • 1: Two questions or mild overreach.
  • 0: Question dump / scattered turn.
D5Last Message Test (0-2)
  • 2: Clear reply hook or next action.
  • 1: Weak continuation signal.
  • 0: Dead-end response.
T1Tool Sequence Correctness
  • 2: Exact expected sequence or valid variation (e.g., skip visuals).
  • 1: Right tools with minor ordering issue or one non-critical omission.
  • 0: Missing critical tools or wrong sequence.
T2Pricing Accuracy (if pricingIntelligence called)
  • 2: Search keyword/product alignment, realistic FOB/DDP bounds, relevant listings, sane math.
  • 1: Partially aligned but broad/noisy.
  • 0: Wrong product/range/math.
T3Visual Quality (if visualConceptGeneration called)
  • 2: Concepts match product + customization and are distinct.
  • 1: Partially aligned or too similar.
  • 0: Mismatch/failure without fallback.
T4Spec Quality (if productDataGeneration called)
  • 2: Product match, complete specs, reasonable MOQ, relevant insight.
  • 1: Partial quality.
  • 0: Wrong/incomplete critical specs.

Unified Results (Same Definition)

Awsaf transcripts

Count: 65

CQ avg: 7.43/10

TQ avg: 9.48/10

CS avg: 9.82/10

Composite avg: 8.75/10

Pass rate: 87.69%

Endpoint match: 86.15%

T1 match rate: 100%

Eric chats (v7 logs)

Count: 6

CQ avg: 5.78/10

TQ avg: 5/10

CS avg: 9.72/10

Composite avg: 6.49/10

Pass rate: 33.33%

Endpoint match: 66.67%

T1 match rate: 50%

Delta (Awsaf - Eric): Composite 2.26 Quality 1.65
Judge mode: tool-aware-eval-v2 (heuristic implementation of GPT_JUDGE_PROMPT v2)
Formula: Composite = CQ*0.40 + TQ*0.35 + CS*0.25
Threshold: composite >= 7, no critical-check fail, and T1 >= 1.

Conversation Results + Improvement Suggestions

SetIDCaseExpected EndpointActual EndpointEndpoint CQTQCSCompositeT1/T2/T3/T4Req P/F/NAResultImprovements
Awsaf TR-001 GD-001 EXIT_POLITE EDUCATE_AND_NURTURE mismatch 5 10 10 8 2/NA/NA/NA 0/2/0 FAIL
  • Fix CK-015: Endpoint correctness
  • Fix CK-020: Low-intent handling
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-002 GD-002 EDUCATE_AND_NURTURE QUALIFY_AND_ADVANCE mismatch 6.5 10 10 8.6 2/NA/NA/NA 0/1/1 FAIL
  • Fix CK-015: Endpoint correctness
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-003 GD-003 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.17 10 10 9.27 2/NA/NA/2 1/1/0 PASS
  • Fix CK-005: Value-first trust handling
Awsaf TR-004 GD-004 EXIT_POLITE EDUCATE_AND_NURTURE mismatch 5 10 10 8 2/NA/NA/NA 0/1/1 FAIL
  • Fix CK-015: Endpoint correctness
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-005 GD-005 EXIT_POLITE QUALIFY_AND_ADVANCE match 7.31 10 10 8.92 2/2/2/2 2/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-006 GD-006 COMPLETE_SR COMPLETE_SR match 7.29 8.75 10 8.48 2/2/1/2 3/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-007 GD-007 HUMAN_HANDOFF HUMAN_HANDOFF match 5 10 10 8 2/NA/NA/NA 2/0/0 PASS
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-008 GD-008 EDUCATE_AND_NURTURE QUALIFY_AND_ADVANCE mismatch 7.9 6.25 10 7.85 2/1/1/1 0/1/1 FAIL
  • Fix CK-015: Endpoint correctness
Awsaf TR-009 GD-009 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8 10 10 9.2 2/2/2/2 3/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-010 GD-010 COMPLETE_SR COMPLETE_SR match 7.29 10 10 8.92 2/2/2/2 2/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-011 GD-011 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.27 8.75 10 8.87 2/2/1/2 2/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-012 GD-012 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8 8.75 10 8.76 2/2/1/2 3/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-013 GD-013 COMPLETE_SR COMPLETE_SR match 7.29 8.75 10 8.48 2/2/1/2 2/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-014 GD-014 COMPLETE_SR COMPLETE_SR match 7.29 10 10 8.92 2/2/2/2 2/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-015 GD-015 EDUCATE_AND_NURTURE QUALIFY_AND_ADVANCE mismatch 7.9 6.25 10 7.85 2/1/1/1 0/1/0 FAIL
  • Fix CK-015: Endpoint correctness
Awsaf TR-016 GD-016 EXIT_POLITE EXIT_POLITE match 7.5 10 10 9 2/NA/NA/NA 2/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-017 GD-017 COMPLETE_SR COMPLETE_SR match 7.21 8.75 10 8.45 2/2/1/2 4/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-018 GD-018 HUMAN_HANDOFF HUMAN_HANDOFF match 7.71 7.5 10 8.21 2/NA/NA/1 2/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-019 GD-019 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 7.62 8.75 10 8.61 2/2/1/2 1/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-020 GD-020 COMPLETE_SR COMPLETE_SR match 7.76 10 10 9.1 2/2/2/2 2/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-021 GD-021 EDUCATE_AND_NURTURE QUALIFY_AND_ADVANCE mismatch 6.5 10 10 8.6 2/NA/NA/NA 1/1/0 FAIL
  • Fix CK-015: Endpoint correctness
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-022 GD-022 EDUCATE_AND_NURTURE QUALIFY_AND_ADVANCE mismatch 6.5 10 10 8.6 2/NA/NA/NA 1/0/1 PASS
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-023 GD-023 EXIT_POLITE EXIT_POLITE match 4 10 10 7.6 2/NA/NA/NA 3/0/1 PASS
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-024 GD-024 EXIT_POLITE EDUCATE_AND_NURTURE mismatch 5 10 10 8 2/NA/NA/NA 0/1/2 FAIL
  • Fix CK-015: Endpoint correctness
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-025 GD-025 EXIT_POLITE EXIT_POLITE match 6 10 7.5 7.78 2/NA/NA/NA 3/0/0 PASS
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-026 GD-026 HUMAN_HANDOFF HUMAN_HANDOFF match 5 10 10 8 2/NA/NA/NA 1/0/2 PASS
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-027 GD-027 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.5 10 10 9.4 2/2/NA/2 1/0/3 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-028 GD-028 COMPLETE_SR COMPLETE_SR match 7.33 8.33 10 8.35 2/1/NA/2 2/0/2 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-029 GD-029 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.17 10 10 9.27 2/NA/NA/2 1/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-030 GD-030 COMPLETE_SR COMPLETE_SR match 7.33 8.33 10 8.35 2/1/NA/2 2/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-031 GD-031 EXIT_POLITE EXIT_POLITE match 4 10 10 7.6 2/NA/NA/NA 2/0/1 PASS
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-032 GD-032 COMPLETE_SR COMPLETE_SR match 7.29 8.75 10 8.48 2/2/1/2 2/0/2 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-033 GD-033 COMPLETE_SR COMPLETE_SR match 7.33 8.33 10 8.35 2/1/NA/2 2/0/2 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-034 GD-034 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8 10 10 9.2 2/2/NA/2 2/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-035 GD-035 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.44 10 10 9.38 2/2/NA/2 2/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-036 GD-036 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 7.89 10 10 9.16 2/2/NA/2 2/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-037 GD-037 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.09 8.75 10 8.8 2/2/1/2 2/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-038 GD-038 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.44 10 10 9.38 2/2/NA/2 1/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-039 GD-039 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.22 10 10 9.29 2/2/NA/2 1/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-040 GD-040 HUMAN_HANDOFF HUMAN_HANDOFF match 5 10 10 8 2/NA/NA/NA 2/0/0 PASS
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Awsaf TR-041 GD-041 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.11 10 10 9.24 2/2/NA/2 2/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-042 GD-042 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.67 10 10 9.47 2/2/NA/2 1/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-043 GD-043 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.1 8.75 7.5 8.18 2/2/1/2 1/1/0 PASS
  • Fix CK-026: Adaptive clarification by maturity
  • Match clarification depth to SR maturity (high maturity = fewer rounds).
Awsaf TR-044 GD-044 EDUCATE_AND_NURTURE QUALIFY_AND_ADVANCE mismatch 7.13 7.5 10 7.98 2/NA/NA/1 1/1/0 FAIL
  • Fix CK-015: Endpoint correctness
Awsaf TR-045 GD-045 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.38 10 8 8.85 2/2/NA/2 1/1/0 PASS
  • Fix CK-026: Adaptive clarification by maturity
  • Match clarification depth to SR maturity (high maturity = fewer rounds).
Awsaf TR-046 GD-046 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 7.73 10 8 8.59 2/2/NA/2 1/1/0 PASS
  • Fix CK-026: Adaptive clarification by maturity
  • Match clarification depth to SR maturity (high maturity = fewer rounds).
Awsaf TR-047 GD-047 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 7.62 8.75 7.5 7.99 2/2/1/2 1/1/0 PASS
  • Fix CK-026: Adaptive clarification by maturity
  • Match clarification depth to SR maturity (high maturity = fewer rounds).
Awsaf TR-048 GD-048 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.67 10 10 9.47 2/2/NA/2 1/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-049 GD-049 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 7.89 10 10 9.16 2/2/NA/2 1/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-050 GD-050 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.45 8.75 10 8.94 2/2/1/2 1/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-051 GD-051 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.33 10 10 9.33 2/2/NA/2 1/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-052 GD-052 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.33 10 10 9.33 2/2/NA/2 1/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-053 GD-053 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.09 8.75 10 8.8 2/2/1/2 1/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-054 GD-054 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8 10 10 9.2 2/2/NA/2 1/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-055 GD-055 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.13 10 10 9.25 2/2/2/2 2/1/0 PASS
  • Fix CK-019: Correction re-analysis
Awsaf TR-056 GD-056 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.38 10 10 9.35 2/2/NA/2 1/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-057 GD-057 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 7.77 10 10 9.11 2/2/2/2 1/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-058 GD-058 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 7.9 10 10 9.16 2/2/NA/2 3/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-059 GD-059 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 7.7 10 10 9.08 2/2/NA/2 1/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-060 GD-060 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.38 10 10 9.35 2/2/NA/2 2/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-061 GD-061 COMPLETE_SR COMPLETE_SR match 7.21 10 10 8.88 2/2/NA/2 2/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-062 GD-062 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.38 10 10 9.35 2/2/NA/2 3/1/0 PASS
  • Fix CK-019: Correction re-analysis
Awsaf TR-063 GD-063 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 7.91 8.75 10 8.73 2/2/1/2 1/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-064 GD-064 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.38 10 10 9.35 2/2/NA/2 1/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Awsaf TR-065 GD-065 QUALIFY_AND_ADVANCE QUALIFY_AND_ADVANCE match 8.29 10 10 9.32 2/2/NA/2 1/0/0 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Eric v7-jesus GD-017 COMPLETE_SR QUALIFY_AND_ADVANCE match 6.67 0 10 5.17 0/NA/NA/NA 4/0/1 FAIL
  • Align tool choreography to expected sequence; avoid missing or misordered calls.
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Eric v7-syed GD-005 EXIT_POLITE QUALIFY_AND_ADVANCE match 5 0 10 4.5 0/NA/NA/NA 2/0/0 FAIL
  • Align tool choreography to expected sequence; avoid missing or misordered calls.
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Eric v7-battery GD-016 EXIT_POLITE QUALIFY_AND_ADVANCE mismatch 4.5 0 10 4.3 0/NA/NA/NA 1/1/0 FAIL
  • Align tool choreography to expected sequence; avoid missing or misordered calls.
  • Fix CK-015: Endpoint correctness
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Eric v7-candle GD-024 EXIT_POLITE EDUCATE_AND_NURTURE mismatch 5.5 10 10 8.2 2/NA/NA/NA 0/1/2 FAIL
  • Fix CK-015: Endpoint correctness
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.
Eric v7-anthony GD-023 EXIT_POLITE EXIT_POLITE match 7 10 8.33 8.38 2/NA/NA/NA 3/0/1 PASS
  • Maintain current behavior; this conversation is aligned with the revised framework.
Eric v7-jammaica GD-007 HUMAN_HANDOFF HUMAN_HANDOFF match 6 10 10 8.4 2/NA/NA/NA 2/0/0 PASS
  • Tighten turn quality: keep messages concise, one question max, and include clear next-step CTA.

Required Check Performance

CheckDefinitionAwsaf (P/F/NA)Awsaf Pass%Eric (P/F/NA)Eric Pass%
CK-005Value-first trust handling8/1/1188.89%0/0/1-
CK-006Restricted hold absolute1/0/0100%1/0/0100%
CK-007Branded/IP hold1/0/1100%1/0/0100%
CK-008Budget math honesty4/0/13100%2/0/1100%
CK-009Currency disambiguation1/0/0100%1/0/0100%
CK-010Call handoff behavior2/0/2100%1/0/0100%
CK-015Endpoint correctness56/8/087.5%4/2/066.67%
CK-018No fabricated capabilities5/0/0100%1/0/0100%
CK-019Correction re-analysis0/2/10%0/0/0-
CK-020Low-intent handling1/1/550%0/0/2-
CK-021Low-intent dropoff handling2/0/0100%0/0/0-
CK-022Scope-jump stabilization0/0/1-0/0/0-
CK-023Early-stage business acceptance0/0/1-0/0/0-
CK-024Hostile-safe exit1/0/0100%0/0/0-
CK-025No ghost tool calls after contact submit10/0/0100%1/0/0100%
CK-026Adaptive clarification by maturity1/4/220%0/0/0-
CK-027Spec correction recovery3/0/0100%0/0/0-
CK-028Mid-flow objection without context loss3/0/0100%0/0/0-

Judge Notes

All case-level improvement suggestions above are generated from failed checks + low dimension averages per conversation.

Eric-to-GD Mapping Used by Shared Judge

Eric FileMapped GD Case
v7-jesus.jsonlGD-017
v7-syed.jsonlGD-005
v7-battery.jsonlGD-016
v7-candle.jsonlGD-024
v7-anthony.jsonlGD-023
v7-jammaica.jsonlGD-007

Eric Historical Pass/Fail (Reference)

Historical run headlines
  • v7: 8/8 PASS — overall avg 8.8/10
  • v6b: 3/5 PASS, 2/5 FAIL. Average across all 15 turns: 6.5/10.
Source files
  • tests/run-v7/run_summary_v7.md
  • tests/run-v6/run_summary_v6.md
VersionPersonasPass RateAvg LinesKey Change
v164/6 (67%)12-18Baseline
v287/8 (88%)10-15Call handoff, budget math
v388/8 (100%)8-12WHY enforcement, restricted products
v3-test6 (new)6/6 (100%)8-12Generalization confirmed
v414Issues found6-10Strict persona testing exposed gaps
v54 (fixes)4/4 fixes6-10Absolute restrictions, estimate budgets, exits
v6b53/5 (60%)2-5Per-turn priority, hard cap (stricter rubric)
v788/8 (100%)1-4Prices-first, one-liner rules, examples

Proposed GD Run Stack

run-gd-v1-core

Core release gate set (locked cases only).

Rule: review_status = locked

Count: 11

run-gd-v1-generalization

Generalization check on reviewed but not locked cases.

Rule: review_status = reviewed

Count: 40

run-gd-v1-draft-probe

Draft-case probe to find schema/scenario gaps before locking.

Rule: review_status = draft

Count: 14

run-gd-v1-stress

Adversarial mismatch/spam/resistance stress test.

Rule: B5 + B6

Count: 4