Hopeman — Cantonese ASR & transcription landscape

1

Production pipeline

Default path when FIREWORKS_API_KEY is set — `tools/transcribe-segmented.js`

Input

MP3/M4A (5 min segments)

→

ASR

Fireworks whisper-v3, language=yue

→

QC

Repeated n-gram hallucination detect → retry w/ alt preprocessing

→

Output

Stitched [MM:SS] markdown

Fallback path

If no Fireworks key: AssemblyAI first pass → flagged chunks retried with local Whisper (compat). Env chain: workspace .env or mufu root.

Meeting workflow

Transcript → manual meeting note under context/meetings/. Command reference: .cursor/commands/transcribe.md. WhatsApp voice files may be mislabeled .jpeg — rename to real media type first.

2

Proxy bake-off

`workdir/asr-bakeoff/` — real Hopeman audio, not true CER

Proxy Segmentation stability, silence-trim sensitivity, manual plausibility read.

1Whisper large-v3 + silence trim — best segmentation stability; use as draft, not final truth.

2Whisper large-v2 (zh) on dense speech

3AssemblyAI raw full-clip baseline

4AssemblyAI on silence-heavy clips (trimmed)

Dense full vs split

3.8%

v3

Dense full vs split

12.4%

v2

Dense full vs split

24.0%

AssemblyAI

Silence raw vs trim

4.4% / 18.5%

v3 / AAI

Audio fixture

Source: Amap meeting MP3 (Mar 2026). Clips: 60s dense speech, 30s halves, 60s silence-heavy + trimmed variant. See reports/proxy-eval.md.

3

Gaps & next steps

Staged pipeline eval — MEMORY.md

Not proven yet

No human-verified ground truth → no true CER on Hopeman recordings.
Ranking may not hold across more than one full recording.
Diarization not scored; post-ASR refinement not benchmarked as a first-class stage.

Hardening moves

Run same pass on Citibank meeting MP3 (Mar 30).

Verify one 2–3 min clip manually.

Score true CER against that verified clip.

Evaluate layers separately: preprocess → chunking → ASR → post cleanup.

Related: workdir/cantonese-asr-study.md, workdir/asr-validation/, output/asr-validation-review.html