← Back
In Development

Consensus

A transcription app that checks its own work.

The Problem

This started with a real need. I had a meeting I'd recorded and wanted to transcribe. Two problems: I didn't want to upload the audio to a cloud service for data security reasons, and I didn't want to deal with creating accounts or paying for something I'd use occasionally. So I decided to build my own -- and use it as a chance to learn about running local machine learning models.

Once I started, I realized the deeper problem: every transcription app runs one engine, one pass, and gives you one result. If it misheard something, you have no way of knowing unless you listen to the entire recording yourself. That defeats the purpose.

Professional transcription services solve this by having two independent transcribers work the same recording, then a third person reconciles the differences. The disagreements are exactly the spots where the audio is ambiguous -- the places a human should actually listen to. No consumer software does this.

Deep Review

Consensus runs two different speech-to-text engines on the same audio and compares the results. Where they agree, you can trust the transcript. Where they disagree, the app flags those spans and lets you resolve them inline -- select the better version, accept, and move on.

You end up with a verified transcript built from multiple independent readings of the same audio. No other consumer transcription app does this. Otter, Rev, Descript, Trint, MacWhisper, Sonix -- all single-engine, single-pass.

What's Inside

Fully Local

Nothing leaves your machine. No cloud services, no API keys, no accounts. Audio stays on your Mac and never touches a server.

Speaker Identification

Automatic speaker diarization clusters voices so you know who said what. Rename speakers after transcription to match real names.

Quality Metrics

Word confidence, segment confidence, diarization quality, compression ratio. Automatic risk flagging highlights the spots most likely to contain errors.

7 Export Formats

Plain text, Markdown, JSON, SRT subtitles, RTF, Word, and a legal PDF in court reporter format: 25 lines per page, Courier 12pt, line numbers.

The Process

Consensus started as a Python prototype with a Gradio web UI. It worked, but it had all the friction of a Python ML project: heavyweight dependencies, environment management, and a web interface that felt wrong on a Mac. The native Swift rebuild was about making it something I'd actually use day to day.

This project has been more challenging than I expected. Single-pass transcription works well, but the multi-engine pipeline -- figuring out the right workflow, the comparison algorithm, the user interface for resolving disagreements -- has required significant iteration. Speaker diarization has been the toughest piece: getting consistent, accurate speaker labels from local models is an active area of research across the field, and my experience reflects that.

The project has also involved more research than anything else I've built. Evaluating transcription engines, studying confidence-weighted word alignment (ROVER), testing multiple diarization approaches at different thresholds, reading papers on error rates, tracking new model releases. The local ML landscape is moving fast, and keeping up with it has been part of the work.

It's improving with each iteration. The pipeline has been overhauled twice, the reconciliation interface redesigned from a 300-row grid into an inline flag system, and the diarization post-processing now runs three correction passes. Getting it right is taking time, but the bones are solid and the single-pass mode already does what I originally needed.

Built With

Swift 6 with strict concurrency. SwiftUI on macOS 15+. WhisperKit for primary transcription (5 model sizes from 75MB to 3GB). FluidAudio's Parakeet for the second engine. SpeakerKit and FluidAudio for diarization. ZIPFoundation for Word document generation. All dependencies via Swift Package Manager. Roughly 7,400 lines across 39 source files.