Self-hosted Whisper turns speech-to-text from a metered cloud expense into a fixed, one-time hardware cost — and in 2026 the open ecosystem around it is good enough that most content and marketing teams never need a transcription API again. OpenAI’s Whisper, plus community runtimes like whisper.cpp, faster-whisper and WhisperX, transcribes podcasts, webinars, video and meetings locally for roughly zero dollars per minute, at unlimited volume, with every recording staying on your own machine.

Cloud transcription looks cheap at a few tenths of a cent per minute. It stops looking cheap the moment you point it at a back-catalog: a single 200-episode podcast archive, a year of webinars, or a sales team’s call recordings runs into thousands of audio-hours, and every one of those hours is a line item that recurs every time you re-process. Local transcription flips the model — you pay once for hardware you already half-own, then transcribe forever. The catch is knowing where that crossover actually sits, which is the math most “Whisper is free” posts skip.

This guide is the practical version for teams that ship content, not just developers. We cover what Whisper is and the model sizes that matter, how to pick a runtime by your hardware, the real-world speed you can expect on a Mac versus an NVIDIA card, how Whisper’s accuracy now compares to NVIDIA’s Parakeet, and the local-versus-cloud cost crossover in audio-hours. Every model size, error rate and price below is traceable to a primary source and dated to June 29, 2026.

Key takeaways

01
Whisper is genuinely open — MIT licence, code and weights.OpenAI released Whisper as an automatic speech recognition system trained on 680,000 hours of multilingual, multitask audio, under the MIT licence for both code and weights. You can download it, run it offline, and ship it in a product with no per-minute fee.
02
Pick the runtime by your hardware, not the hype.whisper.cpp and MLX for Apple Silicon; faster-whisper (CTranslate2) for NVIDIA; WhisperX when you need word-level timestamps plus speaker labels; distil-whisper when you want maximum speed and can spend about 1% more word error rate. Same weights, four jobs.
03
faster-whisper is up to 4× faster at the same accuracy.Its README reports up to 4× the speed of openai/whisper at the same accuracy while using less memory, and distil-whisper is roughly 6× faster and about 50% smaller within 1% WER — so a used 8GB GPU comfortably runs the top open model.
04
Whisper is no longer the only open game.NVIDIA Parakeet-TDT-0.6B-v3 (600M params) posts a lower average WER than Whisper large-v3 (6.34% vs 6.43%) at roughly 49× the throughput — but it is English and European-language only and CC-BY-4.0. Honest answer: it depends on your languages.
05
Local wins past a payback line measured in audio-hours.Cloud batch APIs run $0.0025 to $0.006 per minute. An approximate $600 Mac mini or a $700–900 used RTX 3090 pays for itself versus OpenAI’s $0.36/hr after roughly 1,670–2,500 hours of audio — and after that, transcription is effectively free.

01 — Why Local NowThe case for on-device transcription.

For a content or marketing team, transcription is not an edge case — it is infrastructure. Subtitles for video, searchable text for a podcast back-catalog, repurposing a webinar into a blog post and ten social clips, meeting notes, accessibility compliance: all of it starts with turning speech into accurate text. The question is whether you rent that capability by the minute or own it outright. Running it locally is the same decision as generating images locally with FLUX and ComfyUI — you trade a recurring API bill for fixed hardware and full control of your data.

Two forces make 2026 the year this tips for non-developers. First, the runtimes matured: you no longer need to be an ML engineer to run Whisper — a drag-and-drop Mac app or a one-line install on a gaming GPU gets you production-grade transcripts. Second, the privacy calculus hardened: customer calls, internal strategy sessions and unreleased product footage are exactly the recordings you do not want leaving your network. Local transcription keeps every byte on-device, which is the same logic behind running local LLMs with Ollama, LM Studio or vLLM.

Subtitling

Video and podcast back-catalog

burn-in captions · SRT/VTT export

Caption a year of YouTube uploads or a 200-episode podcast in one batch. At a few tenths of a cent per minute, a back-catalog is exactly where cloud fees compound — and where local pays back fastest.

Highest-volume use case

Repurposing

Webinars into content

transcript → blog, threads, clips

A 60-minute webinar becomes a transcript, then a blog post, a quote-card series and ten short clips. Accurate text is the raw material every downstream repurposing workflow depends on.

Content engine input

Meeting notes

Calls and stand-ups

diarized transcript + summary

Speaker-labelled transcripts of sales calls and internal meetings feed straight into summaries and CRM notes — without sending a single recording to a third-party vendor.

Privacy-sensitive

Accessibility

Captions and compliance

WCAG captions · searchable archives

Accurate captions are an accessibility requirement, not a nice-to-have. Owning the pipeline means you can re-caption an entire archive whenever standards or branding change, at no marginal cost.

Always-on requirement

02 — Meet WhisperWhat Whisper actually is.

Whisper is OpenAI’s open automatic speech recognition system, released in September 2022 under the MIT licence for both the code and the model weights. Architecturally it is a straightforward encoder-decoder Transformer: audio is split into 30-second chunks, converted to a log-Mel spectrogram and fed to the encoder; the decoder then predicts the text along with special tokens for language identification, timestamps, transcription, and translation to English. That single-model design is why one checkpoint can transcribe, timestamp and translate without extra plumbing.

The robustness is the part that made Whisper a default. Because it was trained on a very large, messy, real-world corpus rather than a clean benchmark set, it generalises well to accents, background noise and domain jargon out of the box. OpenAI reports that across diverse datasets it makes about 50% fewer errors, zero-shot, than models specialised on a single benchmark — and roughly a third of its training audio is non-English, which is why it handles around 99 languages (a widely-cited figure from the model card, with quality varying widely by language).

The training-data headline

OpenAI describes Whisper as “an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web.” That scale — and the permissive MIT licence on the weights — is the entire reason a free, open model competes with paid cloud APIs on accuracy. Source: OpenAI, “Introducing Whisper” (2022-09-21).

03 — Model SizesSix sizes, one trade-off.

Whisper ships in a family of sizes that trade accuracy for speed and memory. Bigger models are more accurate and need more VRAM; smaller ones are faster and run on almost anything. The figures below are OpenAI’s own approximate numbers from the official README — treat them as a starting point, because real VRAM depends heavily on the runtime (int8 versus fp16), batch size and beam width.

Whisper model family — parameter count, approximate VRAM, relative speed and English-only variant
Model	Parameters	VRAM (approx)	Relative speed	English-only
tiny	39M	~1GB	~10×	tiny.en
base	74M	~1GB	~7×	base.en
small	244M	~2GB	~4×	small.en
medium	769M	~5GB	~2×	medium.en
turbo (large-v3-turbo)	809M	~6GB	~8×	— (multilingual only)
large (-v3)	1,550M	~10GB	1× (baseline)	— (multilingual only)

Source: openai/whisper README (retrieved 2026-06-29). VRAM and relative-speed figures are OpenAI’s own approximations; “relative speed” is measured against the large baseline.

English-only .en variants exist for tiny, base, small and medium and perform better on English audio (notably tiny.en and base.en); large and turbo are multilingual-only.

The model worth knowing in 2026 is large-v3-turbo. Released in October 2024, it is a distillation of large-v3 in which OpenAI pruned the decoder from 32 layers to 4, cutting parameters from 1,550M to 809M — community and vendor notes describe it as “as good as large-v2 but roughly 6× faster,” at about 6GB of VRAM versus 10GB. For most teams, turbo is the sweet spot: near-flagship accuracy at a fraction of the compute. And there is no successor to chase — OpenAI had not announced a Whisper v4 as of June 29, 2026, so large-v3 and turbo remain the production-safe open checkpoints.

04 — RuntimesThe same weights, four runtimes.

The Whisper weights are one thing; the program that runs them is another, and the runtime you choose matters more than the model size for most teams. The original openai/whisper is the reference implementation. Around it, the community built faster, more specialised engines: a C/C++ port for Apple Silicon and embedded devices, a CTranslate2 build for NVIDIA, an alignment wrapper for word-level timestamps and diarization, and a distilled model for raw speed. Pick by hardware and by what your output needs.

Whisper runtimes and the leading non-Whisper alternative, by best hardware, timestamp granularity, diarization, language coverage and recommended use
Runtime	Best hardware	Word timestamps	Diarization	Languages	Use when
Whisper-based runtimes
openai/whisper	NVIDIA GPU or CPU	Segment-level	No	Multilingual (~99)	Reference setup, simplest path
whisper.cpp	Apple Silicon, CPU, embedded	Segment-level	No	Multilingual (~99)	Macs, real-time, on-device
faster-whisper	NVIDIA GPU	Segment-level	No	Multilingual (~99)	Fastest Whisper on NVIDIA
WhisperX	NVIDIA GPU	Word-level (wav2vec2)	Yes (pyannote)	Multilingual	Per-word timestamps + speaker labels
distil-whisper	NVIDIA GPU or CPU	Segment-level	No	English (distil-large-v3)	Maximum speed, ~1% WER trade
Non-Whisper alternative
NVIDIA Parakeet-TDT-0.6B-v3	NVIDIA GPU	Yes (TDT)	No (separate step)	25 European languages	Throughput-bound English/EU, lowest WER

Sources: project READMEs and model cards for openai/whisper, whisper.cpp, faster-whisper, m-bain/whisperX, distil-whisper and nvidia/parakeet-tdt-0.6b-v3 (retrieved 2026-06-29). Whisper, faster-whisper and whisper.cpp are MIT-licensed; Parakeet is CC-BY-4.0.

whisper.cpp is a dependency-free C/C++ port that uses GGML weights and runs on Apple Metal, CUDA, Vulkan and plain CPU — the strongest fit for a Mac, an embedded device, or anything real-time. faster-whisper is a CTranslate2 reimplementation that is the fastest path on NVIDIA hardware. WhisperX wraps Whisper with wav2vec2 phoneme alignment for per-word timestamps and pyannote-audio for speaker diarization — the runtime to reach for when you need to know who said what, and when. And distil-whisper is a distilled model that drops in to all three: roughly 6× faster and about 50% smaller, within 1% WER on long-form decoding.

“fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.”— WhisperX project README

Read the 70× claim carefully

WhisperX’s headline “70× realtime” figure is measured with large-v2 and batched inference on a capable NVIDIA GPU — it does not generalise to CPU, to a Mac, or to large-v3. It is also memory-light: WhisperX runs large-v2 in under 8GB of GPU memory at beam_size=5, which means the top open model fits comfortably on a used RTX 3060- or 3070-class card.

05 — Real-World SpeedHow fast it really runs.

Speed is usually quoted as a multiple of realtime — “2× realtime” means one minute of audio transcribes in 30 seconds — or as RTFx, the throughput factor. The numbers below come from the faster-whisper README’s own benchmark: the same audio, transcribed with Whisper large-v2 on a single NVIDIA RTX 3070 Ti (8GB), across four runtimes. Speeds are hardware-dependent — read these as the shape of the gap between runtimes on one mid-range card, not as a promise for yours.

Transcribing the same audio · Whisper large-v2 on one 8GB NVIDIA card

Source: SYSTRAN faster-whisper README · Whisper large-v2 · RTX 3070 Ti 8GB (lower time is better)

openai/whisperfp16 · 4,708 MB peak VRAM

2m23s

whisper.cpp (Flash Attn)fp16 · 4,127 MB

1m05s

faster-whisperfp16 · 4,525 MB

1m03s

faster-whisper (batch=8)fp16 · 6,090 MB

17s

faster-whisper (int8)int8 · 2,926 MB — fits 4GB cards

59s

Two things jump out. First, batching collapses wall-clock time: faster-whisper at batch=8 turns a 2m23s job into 17 seconds on the same card. Second, int8 quantization runs large-v2 in under 3GB of VRAM — the reason a humble 8GB GPU, or even a 4GB one, is enough for the flagship open model. If you are sizing a machine for this, our guide to the best hardware to run local AI by price bracket maps cards and Macs to budgets.

On Apple Silicon the story is different but still strong. With Metal acceleration, community benchmarks suggest Whisper large-v3 runs at roughly 2–3× realtime on an M3 or M4 (an RTF of about 0.33–0.50, or 20–30 seconds per minute of audio), with Metal giving a 30–60% speedup over CPU-only that widens with model size. These are secondary benchmark figures, not a vendor spec — results vary by chip tier (base, Pro, Max, Ultra), by audio, and by thermals — so treat them as directional rather than guaranteed.

faster-whisper

vs openai/whisper

4×

Up to 4× faster at the same accuracy while using less memory, via the CTranslate2 backend. The default NVIDIA runtime for batch transcription.

MIT licensed

distil-whisper

faster, ~50% smaller

6×

distil-large-v3 is about 6.3× faster than large-v3 and within 1% WER on long-form decoding. Drops in to whisper.cpp, faster-whisper and openai/whisper.

~1% WER cost

Apple Silicon

realtime on M3 / M4

2–3×

Community benchmarks suggest ~2–3× realtime with Metal acceleration (RTF ≈ 0.33–0.50). Varies by chip tier and thermals — directional, not a vendor figure.

Community estimate

06 — Accuracy & WERWhisper versus Parakeet.

Accuracy is measured as word error rate — WER — where lower is better. The neutral reference is the Open ASR Leaderboard, which runs the same eight datasets across every model. The surprise of 2026 is that Whisper no longer holds the open accuracy crown: NVIDIA’s Parakeet-TDT-0.6B-v3, at a quarter of Whisper large-v3’s size, posts a lower average WER and does it at dramatically higher throughput.

Average word error rate and throughput (RTFx) for Parakeet and Whisper on the Open ASR Leaderboard
Model	Params	Avg WER	RTFx (throughput)	Notes
Parakeet-TDT-0.6B-v3	600M	6.34% ¹	3,332.74	25 EU languages · CC-BY-4.0
Parakeet CTC 1.1B	1.1B	6.68%	2,793.75	TDT/CTC = very high throughput
Whisper large-v3	1,550M	6.43%	68.56	Multilingual (~99) · MIT

Source: Open ASR Leaderboard, data dated 2025-11-21 (retrieved 2026-06-29). Lower WER is more accurate; higher RTFx is faster.

¹ Parakeet’s model card lists 6.34% average WER; the leaderboard shows 6.32% — a minor discrepancy. We cite the model-card figure as primary.

The throughput gap is the real story. Parakeet posts an RTFx of more than 3,300 against Whisper large-v3’s 68.56 — roughly 49× the throughput — because its TDT decoder is architecturally built for speed rather than the autoregressive decoding Whisper uses. That is the difference between transcribing a podcast archive over a weekend and doing it over a coffee break. The catch is coverage: Parakeet-v3 is English and 25 European languages only, under a CC-BY-4.0 licence that requires attribution. For a Spanish-and-English marketing team it is a gift; for anyone touching Asian or African languages, Whisper stays the answer.

Why the architectures diverge

The Open ASR Leaderboard team’s own framing: “Models combining Conformer encoders with large language model (LLM) decoders currently lead in English transcription accuracy … CTC and TDT decoders deliver 10–100× faster throughput.” In plain terms — the most accurate models are slow, the fastest models trade a sliver of accuracy for an order of magnitude more speed, and both now beat the paid cloud baseline on accuracy. Source: Open ASR Leaderboard, Hugging Face blog (2025-11-21).

07 — Local vs Cloud CostThe crossover nobody publishes.

“Whisper is free” is true at the margin and misleading in the aggregate. Local transcription costs roughly zero dollars per minute once you own the hardware — but the hardware, power and a little ops time are real. The honest comparison is not free-versus-paid; it is a fixed cost amortised against a recurring one. So start from the cloud rates, then find the audio-hour count where the fixed cost wins. First, the going batch prices as of June 29, 2026.

Cloud speech-to-text API pricing as of June 2026, by provider and model, per minute and per hour
Provider / model	Per minute	Per hour	Mode
OpenAI gpt-4o-transcribe	$0.006	$0.36	Batch + streaming
OpenAI gpt-4o-mini-transcribe	$0.003	$0.18	Batch + streaming
Deepgram Nova-3 (pre-recorded)	$0.0043	$0.26	Batch
Deepgram Nova-3 (streaming)	$0.0077	$0.46	Streaming
AssemblyAI Universal	$0.0025	$0.15	Batch
AssemblyAI Slam-1	~$0.0045–0.0062	~$0.27–0.37 ¹	Batch (prompt-based)

Sources: OpenAI, Deepgram and AssemblyAI pricing pages (retrieved 2026-06-29). OpenAI’s legacy whisper-1 API is also around $0.006 per minute. Advanced features (diarization, sentiment, PII redaction) bill extra.

¹ Slam-1’s rate is unsettled across 2026 sources ($0.27–$0.37 per hour); confirm the live pricing page before budgeting.

Now the crossover. The table below holds the cloud rate constant and scales audio volume, so you can read your own monthly hours straight off it. Each cell is simply audio-hours × the per-hour rate (gpt-4o-transcribe at $0.36/hr, Deepgram Nova-3 batch at $0.258/hr, AssemblyAI Universal at $0.15/hr). The self-hosted column is the point: it does not move with volume.

Monthly transcription cost by audio volume — cloud APIs versus self-hosted Whisper
Monthly audio	gpt-4o-transcribe	Deepgram Nova-3	AssemblyAI Universal	Self-hosted Whisper
10 hrs	$3.60	$2.58	$1.50	≈ $0 (power)
100 hrs	$36.00	$25.80	$15.00	≈ $0 (power)
500 hrs	$180.00	$129.00	$75.00	≈ $0 (power)
1,000 hrs	$360.00	$258.00	$150.00	≈ $0 (power)
5,000 hrs	$1,800.00	$1,290.00	$750.00	≈ $0 (power)

Digital Applied calculation: cost = audio-hours × per-hour rate, at the June 2026 batch prices above. Self-hosted is the marginal cost after hardware — electricity only.

The hardware payback line

An approximate $600 Mac mini or a $700–900 used RTX 3090 pays for itself versus gpt-4o-transcribe ($0.36/hr) after roughly 1,670–2,500 hours of audio ($600 ÷ $0.36 ≈ 1,667 hrs; $900 ÷ $0.36 ≈ 2,500 hrs), and versus the cheaper AssemblyAI Universal ($0.15/hr) after roughly 4,000–6,000 hours. In monthly terms: at 100 audio-hours a month a $600 machine clears its cost against gpt-4o-transcribe in under 17 months; at 1,000 hours a month, in under two. Hardware street prices are illustrative — verify current pricing before you buy. The full buy-versus-rent math is in our local AI workstation economics breakdown.

08 — Your StackWhich setup for your team.

There is no single best answer — there is the best answer for your hardware, your languages and your output. Match your situation to one of the four paths below, then benchmark on your own audio before you commit a pipeline to it.

On a Mac

whisper.cpp or MLX

Apple Silicon with Metal handles large-v3 at roughly 2–3× realtime. whisper.cpp is the dependency-free workhorse; MLX builds squeeze more out of the GPU. No NVIDIA card required — your laptop is the rig.

Pick whisper.cpp / MLX

On an NVIDIA GPU

faster-whisper for throughput

A used 8GB card runs large-v2 in int8 under 3GB and, batched, transcribes a 2m23s job in 17 seconds. This is the default for processing a back-catalog at speed.

Pick faster-whisper

Need who-said-what

WhisperX

When the deliverable is a diarized, word-level transcript — interview, panel, sales call — WhisperX adds wav2vec2 alignment and pyannote speaker labels on top of Whisper, in under 8GB.

Pick WhisperX

English/EU at scale

distil-whisper or Parakeet

distil-whisper buys ~6× speed for ~1% WER as a drop-in. For English and 25 EU languages, Parakeet-TDT is more accurate than Whisper large-v3 at ~49× the throughput — mind the CC-BY attribution.

Pick distil-whisper / Parakeet

Whichever path you take, transcription is rarely the end of the workflow — it is the front door to repurposing, summarisation and search. The transcript feeds a content engine; the audio you keep local can also be paired with the inverse problem, text-to-speech, to close the loop on audio content. If you want a production pipeline rather than a weekend experiment, our content engine service builds transcription, repurposing and publishing into one system you own end to end.

09 — ConclusionThe only question left is your payback line.

The state of open speech-to-text, June 2026

Open speech-to-text is a solved problem — what's left is the hardware math.

Whisper made accurate, multilingual transcription free and open; the community made it fast and specialised; and Parakeet proved an open model can now beat the paid cloud baseline on accuracy. For a content or marketing team, the practical upshot is that transcription is no longer a metered service you rent — it is a capability you can own, on a machine you may already have.

The decision is no longer technical, it is financial. If your monthly audio is small, the cloud APIs are genuinely cheap and not worth replacing. If you are sitting on a back-catalog, re-processing an archive, or holding recordings you cannot send off-network, the crossover arrives fast — and past it, every hour of audio you transcribe is effectively free. Run your own volume against the crossover table, pick the runtime that matches your hardware, and benchmark on a real sample before you commit.

Looking forward, the trend only sharpens. Throughput-first architectures like Parakeet’s TDT decoder are pushing open accuracy past the cloud while running orders of magnitude faster, and distil models keep shrinking the hardware floor. The teams that win the next two years of content operations will be the ones that quietly moved their highest-volume, most privacy-sensitive workloads on-device — and stopped paying per minute for a problem the open ecosystem already solved.

Self-Hosted Whisper: local transcription for $0 per minute

01 — Why Local NowThe case for on-device transcription.

Video and podcast back-catalog

Webinars into content

Calls and stand-ups

Captions and compliance

02 — Meet WhisperWhat Whisper actually is.

03 — Model SizesSix sizes, one trade-off.

04 — RuntimesThe same weights, four runtimes.

05 — Real-World SpeedHow fast it really runs.

Transcribing the same audio · Whisper large-v2 on one 8GB NVIDIA card

vs openai/whisper

faster, ~50% smaller

realtime on M3 / M4

06 — Accuracy & WERWhisper versus Parakeet.

07 — Local vs Cloud CostThe crossover nobody publishes.

08 — Your StackWhich setup for your team.

whisper.cpp or MLX

faster-whisper for throughput

WhisperX

distil-whisper or Parakeet

09 — ConclusionThe only question left is your payback line.

Open speech-to-text is a solved problem — what's left is the hardware math.

Stop renting transcription by the minute — own it end to end.

Local speech-to-text engagements

The questions teams ask before going local.

Continue exploring local AI.

Local AI Image Generation in 2026: Flux, SD & ComfyUI

Best Open-Weight Coding Models to Self-Host in 2026

Local AI Workstation Economics: Costs vs Cloud in 2026

Meta & TikTok Conversions API: Server-Side Tracking 2026

AI Usage Statistics 2026: Who Uses AI and How Much

AI Customer Support 2026: 50+ Adoption + ROI Data Points