Guide · 14 min read · Updated May 2026

Local Whisper on Mac — the privacy-first dictation guide

Audio is the most privacy-sensitive content surface most of us produce. If you're going to dictate at volume, you probably want the model running on your machine, not on someone else's. Here's the honest map of running Whisper locally on Apple Silicon in 2026 — what works, what it costs in latency, and which tool fits which workflow.

Why local Whisper actually matters in 2026

The privacy argument for local transcription is the obvious one, but it's worth being precise about what's at stake. There are three real reasons people end up here, and only one of them is what most marketing pages talk about.

1. Audio is the most privacy-sensitive content surface you produce. A typed message contains the words you decided to send. A voice recording contains your voice — biometric data, with all the implications of that — plus background audio you didn't intend to capture (a colleague's call across the room, your kid in the next room, a TV in the background). It also captures the half-formed monologue you talk yourself through before deciding what to actually write. Sending all of that to a cloud transcription provider is a meaningfully bigger privacy footprint than sending text. Even with good vendor policies, the surface area is larger and the data is more revealing.

2. Zero ongoing cost. Cloud Whisper APIs charge per minute. For light use (a few minutes a day), that's small. For heavy daily dictation — call it 30 minutes a day across emails, notes, Slack, drafts — it adds up to real money over a year, and the meter never stops. Local Whisper has a fixed cost: the time to download the model. After that, marginal transcription is free forever, regardless of how much you talk.

3. Latency control — but not in the way you'd think. Local has no network round-trip, which sounds like it should be faster. It usually isn't. Whisper is an encoder-decoder transformer that produces output in a single batch pass after the recording stops; on consumer hardware it can't compete with a cloud streaming endpoint that emits tokens during the recording itself. What local does give you is predictable latency — no spinning beachball when your wifi flakes, no rate-limit surprises, and no degradation when the provider's infra has a bad day. The floor is also the ceiling, which matters more than people realize.

So local is the right call when (a) the content is sensitive, (b) the volume is high enough that per-call cost matters, or (c) reliability under flaky connectivity matters more than peak speed. We'll come back to the "when isn't it right" question in a section below.

How Whisper works (just enough to make sense of the options)

Whisper is an encoder-decoder transformer trained by OpenAI on roughly 680,000 hours of multilingual audio. The encoder turns 30-second audio chunks into a representation; the decoder generates text tokens conditioned on that representation. Critically, the decoder runs after the audio is captured, not during it. This is why Whisper is fundamentally a batch model — it doesn't emit text as you talk, it emits text after you stop.

"Local" on Mac means the model weights run on your Apple Silicon GPU using Metal acceleration. Two stacks dominate: Whisper.cpp (Georgi Gerganov's C/C++ port using ggml, the same quantization library that powers llama.cpp) and Apple's Core ML path (used by some apps, sometimes alongside NeMo for Parakeet). Whisper.cpp is what most of the apps in this guide are built on, including ThoughtMic.

The model size tiers, in order of accuracy and resource cost: Tiny (~75 MB) → Base (~150 MB) → Small (~500 MB) → Medium (~1.5 GB) → Large V3 (~3 GB) → Large V3 Turbo (~1.6 GB, distilled from Large V3 for ~2x speed at near-identical accuracy). Quantization further shrinks these — Q5 cuts roughly half the memory at minimal accuracy cost; Q4 cuts more but starts producing visible degradation, especially on proper nouns.

The reason all of this works on a MacBook is Apple Silicon's unified memory architecture. On a desktop PC, the GPU has its own VRAM and you're constantly shuffling data across the PCIe bus; on M-series Macs, the CPU and GPU share the same memory pool, so loading a 1.6 GB model into "GPU memory" is just allocating it once. That's why a base-tier 8 GB M1 can run Large V3 Turbo at usable speed, and why this comparison page only exists for Mac.

Practical recommendation for most people: Large V3 Turbo at Q5. Best accuracy-to-speed tradeoff currently available, fits comfortably in memory on every M-series Mac, handles technical vocabulary and proper nouns well. This is what ThoughtMic ships with by default; MacWhisper and VoiceInk both let you pick.

The four serious options

Whisper.cpp directly

Free · open source · CLI

The engine. Free, fast, MIT-licensed, no GUI. For people building their own tooling.

Model
Any GGUF Whisper model — Tiny through Large V3 Turbo, any quantization tier
Output
Text or JSON to stdout; file out with -otxt, -osrt, -ovtt
Privacy
100% local. No telemetry, no network calls.
Speed
Fastest of all the options — no GUI overhead, direct Metal acceleration
Structure
None. It's a transcription engine; what you do with the output is your problem.
Best for
Engineers, automation pipelines, building your own tools, batch-processing existing audio archives. Also the foundation almost every other tool here is built on.

MacWhisper

$5–$20 one-time · local

The most polished GUI wrapper. Built for transcribing audio files (podcasts, meetings, interviews), not for live capture.

Model
Whisper Tiny → Large V3, multiple quantization options
Output
Transcript displayed in-app; export to TXT, SRT, VTT, DOCX, PDF
Privacy
100% local Whisper
Speed
Batch — feed it a file, get a transcript. Real-time recording also supported but the workflow is file-first.
Structure
None at the vault level. You move the transcript wherever it needs to go.
Best for
People who have audio files (call recordings, podcast episodes, lecture captures) they want clean transcripts of. Polished UX, sensible defaults, frequent updates.

VoiceInk

Free (build) · ~$25–$49 paid · open source

System-wide dictation hotkey with an open-source backbone. Strong fit for people who want to inspect or modify the code.

Model
Whisper variants; some users add Parakeet V3 alongside
Output
Text at the cursor in any active app
Privacy
Local by default; some optional cloud features in paid tiers — check the current docs
Speed
Local batch, comparable to other Whisper.cpp wrappers
Structure
None. Plain text into whatever has focus.
Best for
Open-source users who want a dictation app they can audit or modify. Build-from-source path is real and supported. Paid tier funds development.

ThoughtMic

Free up to 2k words/wk · $8/mo · $99 lifetime

Voice-to-vault pipeline. Local Whisper Large V3 Turbo plus auto-titles, auto-tags, auto-backlinks, and a Friday-review surface so captures don't pile up.

Model
Whisper Large V3 Turbo at Q5, Metal-accelerated via Whisper.cpp
Output
Text at cursor in any app, AND a titled/tagged/linked Markdown note in your vault folder
Privacy
Audio and vault stay local. Optional cloud rephrase is text-only via Zero Data Retention — never audio, never vault contents.
Speed
~1.5s overhead + ~80ms per second of audio on M1/M2/M3
Structure
Auto-generated title, tags suggested from your existing vault taxonomy, backlinks resolved against existing notes, queryable from Claude Desktop / Cursor via local MCP server
Best for
Obsidian / Logseq / Foam / VS Code / plain-Markdown users who want voice capture that integrates with their note system, not just transcribes around it. See thoughtmic.com.

Honest performance expectations on Apple Silicon

The single most common mistake people make about local Whisper is expecting it to feel like cloud streaming dictation. It doesn't, and it can't, given the architecture. Here's what to actually expect:

Whisper Large V3 Turbo, Q5, on M1/M2/M3: roughly 1.5 seconds of fixed overhead (model warm-up, audio normalization, encoder pass) plus about 80 milliseconds per second of audio. So a 6-second utterance lands in roughly 2 seconds total after you stop talking. A 30-second utterance: about 4 seconds. A 5-minute meeting recording: about 25 seconds. M3 Pro and M3 Max are noticeably faster on long content; the fixed overhead is similar, but the per-second cost drops.

Smaller models are faster, with caveats. Whisper Small is roughly 3x faster than Large V3 Turbo, but its accuracy on proper nouns, brand names, technical vocabulary, and acronyms degrades visibly. For a stream-of-consciousness journal entry, Small is fine. For meeting notes where you'll later search for "Anthropic" or "Postgres" or your colleague's name, Large V3 Turbo earns its keep.

Neither tier is the streaming experience cloud apps deliver. Apps like Wispr Flow that produce text appearing as you speak are doing one of two things: running a streaming-capable architecture (Conformer, RNN-T, or Whisper modified for rolling chunks), or running Whisper on cloud GPUs that are an order of magnitude faster than even an M3 Max. Neither path is available locally on Apple Silicon today at the same fluency. If sub-300ms perceived latency is non-negotiable for your workflow, local Whisper isn't the answer; cloud streaming is.

Energy cost is real but manageable. A short transcription pass barely registers. Sustained heavy use (transcribing an hour-long meeting on Large V3) is comparable to a video call in battery hit. The Apple Silicon GPU is dramatically more efficient than a discrete card for this workload, but it's still real work.

When local is the right call (and when it isn't)

Local is the right call when:

  • The content is sensitive. Personal journaling, therapy notes, work that touches client/patient/customer data, anything covered by HIPAA / GDPR / SOC 2 contractual obligations. The simpler and safer story is "the audio never left this Mac."
  • You're in a regulated industry. Legal, healthcare, finance, government, defense — even if a cloud vendor would technically be compliant, local removes a category of compliance work entirely.
  • You dictate a lot. Per-minute API pricing is unobjectionable at low volume and adds up fast at high volume. Local has no marginal cost.
  • You don't want vendor lock-in or surprise pricing changes. The model on your disk doesn't get a pricing-page update.
  • You want it to work offline. Plane mode, train tunnel, spotty coffee shop wifi, the airline's "we have wifi" theater that returns 0.3 Mbps — local doesn't care.

Cloud streaming is the right call when:

  • You absolutely need sub-300ms perceived latency. If you write thousands of words a day by voice into messaging apps and feel each second of post-recording delay, the streaming experience is meaningfully better. Local Whisper isn't going to compete on this axis any time soon.
  • The content isn't sensitive and the per-call cost is small. Casual dictation into general-purpose apps, with a vendor whose privacy story you're comfortable with.
  • You're on a Mac with limited memory pressure budget. Local Whisper is memory-efficient but not free; if you're already maxing 8 GB unified memory on Chrome and Slack and Xcode, the offload helps.

Both are fine when:

  • Most casual PKM use sits here. The 1–2 second latency penalty for local matters less than the privacy and cost advantages over a year of use. Pick on the privacy/cost axis, not the latency one.

Setup notes for each option

Whisper.cpp directly

  1. Install via Homebrew: brew install whisper-cpp — or clone and build from source if you want the latest: git clone https://github.com/ggerganov/whisper.cpp && cd whisper.cpp && make
  2. Download a model. For Large V3 Turbo Q5: bash ./models/download-ggml-model.sh large-v3-turbo-q5_0 (or grab any quantization from huggingface.co/ggerganov/whisper.cpp)
  3. Run: ./main -m models/ggml-large-v3-turbo-q5_0.bin -f your-audio.wav
  4. For real-time mic input, use the ./stream binary built alongside main. Note this is genuinely DIY tooling — wire it into a hotkey, output target, and post-processing yourself.

MacWhisper

  1. Buy from goodsnooze.gumroad.com/l/macwhisper (one-time price; tiers unlock larger models and pro features)
  2. Open the app, drag in an audio file (or hit record)
  3. Pick a Whisper model size — start with Small for speed-testing the workflow, switch to Large V3 once you trust it
  4. Wait for transcription, then export TXT / SRT / DOCX or copy/paste into wherever it needs to go

VoiceInk

  1. Visit the project on GitHub: github.com/Beingpax/VoiceInk
  2. Either build from source (instructions in the repo) or grab a paid build from the project's website
  3. On first launch, grant microphone + accessibility permissions (the latter is what lets it inject text at the cursor)
  4. Set a hotkey, pick a model, and you're dictating system-wide

ThoughtMic

  1. Join the waitlist (launches summer 2026)
  2. At launch: download, point it at your vault folder, set the hotkey
  3. Press ⌥ Space anywhere on your Mac, talk — text appears at your cursor and a structured Markdown note lands in your vault with a title, tags, and backlinks
  4. Hit ⌥⇧ R on Fridays to walk through your #inbox notes with one-keystroke decisions (Discard / Keep / Promote / Archive)
  5. Optional: enable the local MCP server so Claude Desktop or Cursor can query your vault directly

One thing worth knowing regardless of which tool you pick

The first time you run Whisper Large V3 Turbo locally, the encoder pass on the first transcription is slower than subsequent runs — the model needs to warm into the GPU caches. Don't benchmark your first transcription; the second and third are the real number.

Frequently asked

How fast is local Whisper on Apple Silicon, really?

On an M1/M2/M3, Whisper Large V3 Turbo at Q5 measures roughly 1.5 seconds of fixed overhead plus about 80ms per second of audio. So a 6-second utterance takes around 2 seconds total to transcribe after you stop talking. The Small model is roughly 3x faster but loses accuracy on proper nouns, brand names, and technical vocabulary. Neither is the streaming-fast experience you get from cloud apps like Wispr Flow — that requires architectural differences (rolling-chunk Whisper, or a non-Whisper architecture like Parakeet RNN-T).

Will running Whisper locally drain my battery?

For typical PKM-style use (a handful of short utterances per hour), no — the model only runs during a transcription pass, which is a few hundred milliseconds to a few seconds at a time. For sustained heavy use (an hour-long meeting transcription on Large), expect noticeable battery hit, similar to running a video call. The Apple Silicon GPU is far more efficient than a discrete card for this workload, but it's still real work.

Can I run Whisper Large on an M1 Mac?

Yes. Whisper Large V3 Turbo at Q5 needs roughly 1.5–2 GB of memory and runs comfortably on every M-series Mac including the base 8 GB M1. Unified memory is what makes this practical — there's no separate VRAM bottleneck. The 8 GB M1 will feel some pressure if you're running Whisper, Chrome with 30 tabs, and Xcode at once, but the model itself fits.

Is there a Linux version of any of these tools?

Whisper.cpp builds and runs on Linux out of the box and is the foundation most local Whisper tools sit on. MacWhisper, VoiceInk, and ThoughtMic are all Mac-only. For Linux specifically, you're looking at Whisper.cpp directly, plus open-source frontends like nerd-dictation or whisper-writer if you want hotkey-driven dictation. The vault-native tooling (auto-tagging, auto-backlinks, Friday review) doesn't have a comparable Linux equivalent today.

Parakeet vs Whisper — which is better for local dictation?

They optimize for different things. Whisper is an encoder-decoder transformer built for batch transcription — high accuracy, runs after you stop talking. Parakeet (NVIDIA's RNN-T model) is built for streaming — emits tokens during speech, lower latency, but historically worse on long-form audio and proper nouns. For PKM-style capture where you record 10–60 seconds and want clean output, Whisper Large V3 Turbo is the better tradeoff today. For real-time messaging-style dictation where you want words appearing as you talk, Parakeet V3 is competitive and getting closer every release.

Why is local Whisper sometimes slower than cloud transcription?

Two reasons. First, Whisper is architecturally a batch model — it can't emit text until you stop talking, regardless of where it runs. Cloud apps that feel instant (like Wispr Flow) typically use streaming architectures like Conformer or RNN-T, which emit tokens during the recording. Second, big cloud providers run inference on H100/H200 GPUs that are dramatically faster than even an M3 Max for this workload — Groq's Whisper Large V3 endpoint can transcribe 10 seconds of audio in under 200ms. Local trades that raw throughput for zero data egress, zero per-call cost, and full offline capability.

Which model size should I actually pick?

For most people, Large V3 Turbo at Q5. Best accuracy-to-speed tradeoff currently shipping, fits in memory on every M-series Mac, handles technical vocabulary well. Drop to Small only if you're on a very memory-constrained machine, content is purely conversational with no proper nouns or jargon, or you're transcribing extremely long files where the speed delta matters. Skip Tiny and Base unless you're on an Intel Mac (in which case you have bigger problems than model choice).

Want local Whisper that lands as structured notes?

ThoughtMic ships summer 2026. Whisper Large V3 Turbo on your Mac, vault-native output, Friday-review built in. Join the waitlist for a launch-day download link and the Founder's Deal ($49 lifetime, first 50 only).

We'll email you at launch. No newsletter. Unsubscribe any time.

Works with Obsidian · Logseq · Foam · VS Code · any .md folder