← Back
Feature Complete

Consensus

A privacy-first macOS transcription app that runs two engines on the same audio and flags exactly the spots where they disagree.

The Use Case

Litigators record things constantly: depositions in unofficial copies, witness interviews, client calls, internal strategy sessions, motion-hearing audio. Most of those recordings are confidential — sometimes privileged. Uploading them to a cloud transcription service to get a rough transcript before the certified version arrives is a privacy posture that does not survive contact with a thoughtful client.

So the audio either gets typed up by a paralegal or it doesn't get transcribed at all. Both are bad answers. Consensus is the third one — a fully local Mac app that runs everything on your own machine and exports a transcript in court-reporter PDF format on the way out.

The Problem With Single-Pass Transcription

Every consumer transcription app runs one engine, one pass, and gives you one result. If it misheard something, you have no way of knowing unless you listen to the entire recording yourself. That defeats the purpose.

Professional court reporting solves this by having two independent transcribers work the same recording and a third person reconcile the differences. The disagreements are exactly the spots where the audio is ambiguous — the places a human should actually listen to. No consumer software does this.

Consensus runs two different speech-to-text engines on the same audio and compares the results. Where they agree, you can trust the transcript. Where they disagree, the app flags those spans and lets you resolve them inline — select the better version, accept, move on. You end up with a verified transcript built from multiple independent readings of the same audio. Otter, Rev, Descript, Trint, MacWhisper, Sonix — all single-engine, single-pass.

What's Inside

Fully Local

Nothing leaves your machine. No cloud services, no API keys, no accounts. Audio stays on your Mac and never touches a server.

Speaker Identification

Automatic speaker diarization clusters voices so you know who said what. Rename speakers after transcription to match real names.

Quality Metrics

Word confidence, segment confidence, diarization quality, compression ratio. Automatic risk flagging highlights the spots most likely to contain errors.

7 Export Formats

Plain text, Markdown, JSON, SRT subtitles, RTF, Word, and a legal PDF in court reporter format: 25 lines per page, Courier 12pt, line numbers.

The Process

Consensus started as a Python prototype with a Gradio web UI. It worked, but it had all the friction of a Python ML project: heavyweight dependencies, environment management, and a web interface that felt wrong on a Mac. The native Swift rebuild was about making it something I'd actually use day to day.

This project has been more challenging than I expected. Single-pass transcription works well, but the multi-engine pipeline -- figuring out the right workflow, the comparison algorithm, the user interface for resolving disagreements -- has required significant iteration. Speaker diarization has been the toughest piece: getting consistent, accurate speaker labels from local models is an active area of research across the field, and my experience reflects that.

The project has also involved more research than anything else I've built. Evaluating transcription engines, studying confidence-weighted word alignment (ROVER), testing multiple diarization approaches at different thresholds, reading papers on error rates, tracking new model releases. The local ML landscape is moving fast, and keeping up with it has been part of the work.

It's improving with each iteration. The pipeline has been overhauled twice, the reconciliation interface redesigned from a 300-row grid into an inline flag system, and the diarization post-processing now runs three correction passes. Getting it right is taking time, but the bones are solid and the single-pass mode already does what I originally needed.

Built With

Swift 6 with strict concurrency. SwiftUI on macOS 15+. WhisperKit for primary transcription (5 model sizes from 75MB to 3GB). FluidAudio's Parakeet for the second engine. SpeakerKit and FluidAudio for diarization. ZIPFoundation for Word document generation. All dependencies via Swift Package Manager. Roughly 7,400 lines across 39 source files.