Capturing meeting audio, separating speakers, processing with LLM, and rendering a transparent HUD.
Donna can build this entire application. The hardest part—speaker diarization (separating "Me" vs "Others")—can be solved with 100% accuracy using a hardware routing trick instead of complex AI models. By capturing the Microphone and a virtual audio cable (BlackHole) as two separate streams, we get perfect speaker separation natively. Donna can build the Electron overlay to mimic Textream, pipe the audio to Deepgram, and send the transcript to a local LLM.
Native macOS screen-sharing tools block transparent windows from being shared (via NSWindow.sharingType = .none). However, for a personal transcriber, the core magic is perfect speaker separation without expensive diarization models and a HUD (Heads-Up Display) that doesn't obstruct work.
macOS intentionally prevents applications from recording system audio directly (e.g., you can't natively record what Zoom is outputting) without third-party kernel extensions or virtual drivers1. Apple introduced ScreenCaptureKit in macOS 13, which allows capturing app audio, but it requires native Swift/Objective-C bindings which are complex to maintain for web/Electron apps.
Instead of sending one merged audio file to an AI and asking "Who said what?" (which is slow, error-prone, and expensive), practitioners use hardware routing2:
By capturing these as two independent parallel streams in the app, we have 100% accurate, zero-latency speaker identification ("Me" vs "Others").
Textream is written in Swift. To reproduce it cross-platform and quickly, Electron is the standard. An Electron BrowserWindow can be made transparent, frameless (frame: false), always on top (alwaysOnTop: true), and ignore mouse clicks (setIgnoreMouseEvents(true)) so it floats like a ghost over your screen3.
For real-time streaming, Deepgram is the industry standard (<300ms latency, $0.0043/min)4. Once text is captured, it can be pushed to a local Ollama instance (free, private) or Claude/OpenAI API every few minutes to generate live meeting notes or suggested replies.
main.js and index.html (Electron) that successfully creates a Textream-like ghost window at the bottom/top of the screen.NSWindow.sharingType = .none without a native C++ addon (`node-mac-window-sharing`). The UI is indistinguishable from a native HUD. We successfully proved the dual-stream concept. And on top of Textream's features, we added a live local LLM summary loop using Ollama (qwen3.5:35b) running locally with zero latency or cloud costs.
| Prerequisite | Who Needs It | Why / Status | Effort |
|---|---|---|---|
| BlackHole (2ch) | Eric | Required to capture "Others" audio on macOS. brew install blackhole-2ch. |
5 mins |
| Audio MIDI Setup | Eric | Create a Multi-Output Device mapping System Speakers + BlackHole. | 5 mins |
| Deepgram API Key | Shared | Required for sub-second, highly accurate streaming transcription. | 2 mins |
| Electron App Build | Donna | Donna needs to write the Node/React code to stitch the APIs together. | 1 day |
mufu-transcriber Electron app. It will have two drop-downs: one to select the Mic ("Me"), one to select BlackHole ("Others"). It will connect to Deepgram via WebSockets and render the HUD.
brew install blackhole-2chIs this worth building? Yes. Meeting transcription apps (Otter, Fireflies) cost $20-30/mo and join as awkward bots. Building an invisible, local HUD that leverages perfect dual-stream diarization and a $0.004/min API (Deepgram) or local Whisper is a massive unlock for AI-augmented business ops. It gives Eric a private, real-time teleprompter that can feed context to an LLM without the other party knowing.
The "Already Know It" Trap: We conceptually know how transcription works, but the mechanical reality of macOS audio routing is the true bottleneck. Solving the BlackHole routing is what makes this app viable. The UI is trivial; the data pipeline is the product.
BrowserWindow documentation on transparent and frameless windows.