On-Device ASR Comparison — VoxRT vs Picovoice Cheetah, Whisper.cpp & Vosk

Why VoxRT

What you get with VoxRT ASR

Streaming partials

Partial transcripts every 80 ms — no batch step required.

CTC and RNN-T

Both decoders exposed for your latency / accuracy trade-off.

32M FastConformer

Research-grade accuracy running entirely on-device.

Lowest published WER

3.27% on LibriSpeech test-clean — within float-noise of the upstream NeMo reference.

Fully on-device

No cloud, no audio leaving the device — and no per-use fees.

Side by side

The comparison table

Comparison of on-device streaming speech-to-text engines — VoxRT, Picovoice Cheetah, Whisper.cpp, Vosk, Sherpa-onnx and Moonshine — by word-error rate on LibriSpeech test-clean, model size, real-time factor on a Snapdragon 662 and an iPhone A15, and license.
Engine	WER — LS test-clean	Model size	RTF — Snapdragon 662	RTF — iPhone A15	License
VoxRTstreaming-medium-pc (32M)	3.27% RNN-T ^a 4.90% CTC	~61 MB fp16 on disk	0.30 measured	0.08–0.10 measured	MIT + CC-BY-4.0 runtime + weights
Picovoice Cheetahcommercial · streaming	5.4% ^b	34 MB	~0.47 est. ^c	~0.077 est. ^c	Commercial free plan limited
Whisper.cpp base.enOpenAI · OSS runtime	~4.4% ^d tiny.en ~5.6%	142 MiB	~1.82 est. ^c slower than real-time	~0.29 est. ^c	MIT + separate weights ^e
Vosk Smallen-us-0.15 · Kaldi	9.85% ^f	40 MB	~0.68 est. ^c	~0.11 est. ^c	Apache-2.0
Sherpa-onnxstreaming Zipformer 20M	3.88% ^g HF card only	~80 MB int8	Not published	Not published	Apache-2.0
Moonshine MediumUseful Sensors · seq2seq	5.9% ^h	Not published	~19 est. ^c not real-time	~3.1 est. ^c not real-time	MIT runtime + weights

WER = word-error rate on LibriSpeech test-clean; lower is better. RTF = real-time factor — fraction of one CPU core to keep up with audio in real time; below 1.0 is real-time. Every competitor's mobile RTF here is estimated (only VoxRT's is measured), and WER figures are vendor-self-reported on different setups, so they are directionally indicative, not directly comparable (see the methodology note below). Sources: ^a Internal VoxRT measurement on LibriSpeech test-clean — RNN-T head 3.27%, CTC head 4.90%; RTF measured single-thread on a Snapdragon 662 (A73) and an iPhone 13 Pro Max (A15). ^b Picovoice speech-to-text-benchmark, LibriSpeech test-clean, measured on an AMD Ryzen 9 5900X desktop. ^c Estimated: these vendors publish desktop (or no) numbers only; mobile RTF is scaled from Picovoice's published desktop Core-Hour data by Geekbench 6 single-core ratios (see methodology). It is a lower bound on their true mobile cost — we have not measured them on a Snapdragon 662 ourselves. ^d Upstream OpenAI Whisper paper WER for the .en variants; whisper.cpp publishes no first-party WER table. ^e whisper.cpp's runtime is MIT, but OpenAI's Whisper weights ship under a separate license — confirm before commercial mobile redistribution. ^f Vosk model registry (alphacephei.com/vosk/models). Vosk is a Kaldi HMM/DNN + language model, not end-to-end CTC/RNN-T, so its WER isn't a like-for-like architecture comparison. ^g Sherpa-onnx's 20M WER appears only on the model's Hugging Face card; the central docs list no WER and no RTF for the streaming Zipformer models. ^h Picovoice STT benchmark, Streaming-Engines WER table; corroborated by the Moonshine paper. Moonshine is seq2seq, so its "streaming" label is loose and its compute cost (~40× Cheetah) rules it out for mobile streaming.

For technical evaluators

The technical details

VoxRT's ASR is NVIDIA NeMo's 32M-parameter FastConformer hybrid, ported tensor-by-tensor onto a mobile-first Rust runtime — a stateless C ABI, shipped today as native Android (JitPack) and iOS (Swift Package) modules, with both CTC and RNN-T decoders and measured mobile real-time factor. Below are the numbers and the footprint that lands in your app.

3.27%

WER (RNN-T) on LibriSpeech test-clean — ~40% fewer word errors than Cheetah's 5.4%

0.30

streaming real-time factor on a Snapdragon 662 (measured)

0.08–0.10

streaming RTF on iPhone 13 Pro Max (measured)

80 ms

cache-aware streaming chunk cadence — partials emitted as you speak

What it costs in your app

Swift wrapper source~20 KB
Native xcframework, compressed (device slice)~5 MB
Streaming model on disk (fp16)~61 MB
Native heap at runtime (steady-state)~150 MB

Engine by engine

How each one really compares

VoxRT

MIT runtime · CC-BY-4.0 weights

3.27% WER on the RNN-T head — the lowest published in this on-device field — at a measured 0.30 RTF on a Snapdragon 662 and 0.08–0.10 on an A15, with 80 ms streaming chunks. It's the only engine here publishing a measured cheap-Android RTF. Honest gaps: English-only at v1, ~150 MB resident memory, and Cheetah's model file is smaller.

Picovoice Cheetah

Commercial · free plan limited

Picovoice Cheetah is a closed commercial streaming ASR SDK with published desktop benchmarks, a smaller model file (34 MB), and broader platform and language coverage. VoxRT is more accurate on the published LibriSpeech test-clean numbers — 3.27% vs 5.4% WER — and publishes measured mobile RTF where Cheetah publishes desktop only (its desktop figure scales to ~0.47 on a Snapdragon 662). Cheetah's published word-emission latency is 590 ms; VoxRT streams 80 ms cache-aware chunks. Fully commercial, with an opaque paid tier.

Whisper.cpp

Runtime MIT · weights separate

The OSS reference everyone cites — but it publishes no mobile WER or RTF at all. The realistic mobile variants are tiny.en (~5.6%) and base.en (~4.4%); base.en is ~1.82 estimated RTF on a Snapdragon 662, slower than real-time for streaming on cheap Android, and larger variants blow past mobile RAM budgets. The runtime is MIT, but OpenAI's model weights carry a separate license worth checking before you ship.

Vosk

Apache-2.0

Permissively licensed, with official mobile bindings and unusually transparent published WER — but the numbers are dated. Vosk Small is 9.85% WER, roughly 3× ours; the accurate Large model is 1.8 GB and about 1.94 estimated RTF on a Snapdragon 662, not real-time. It's a Kaldi HMM/DNN + language-model design rather than end-to-end.

Sherpa-onnx

Apache-2.0

Our closest architectural sibling — NeMo-derived streaming Zipformer transducers, mobile-targeted, Apache-2.0 — and the 20M variant's 3.88% WER is genuinely competitive. The catch is disclosure: its central docs publish no WER and no RTF, so you can't evaluate it without running the benchmark yourself. We publish what Sherpa-onnx makes you measure.

Moonshine

MIT · runtime + weights

The freshest entrant (late 2024) and the cleanest license story — MIT on both runtime and weights. But Medium's 5.9% WER comes at roughly 40× Cheetah's compute: our estimate is ~19 RTF on a Snapdragon 662 and ~3.1 on an A15 — unusable for mobile streaming. It's seq2seq, so the "streaming" label is loose.

Read the numbers carefully

Why the figures aren't apples-to-apples

We'd rather hand you the caveats than a tidy leaderboard. Every figure above is vendor-self-reported, and there's no independent academic benchmark covering all of these engines on a common test set — the Whisper numbers are even from the upstream OpenAI paper, since whisper.cpp publishes none of its own.

Word-error rate isn't perfectly comparable across rows. Vosk is a Kaldi HMM/DNN plus language model rather than an end-to-end network; Sherpa-onnx's 3.88% lives only on a per-model Hugging Face card, not its central docs; ours is on LibriSpeech test-clean with the RNN-T head. Read the ordering as directional.

Mobile real-time factor is the axis nobody but us actually measures on cheap-tier Android. Every competitor's RTF here is estimated — scaled from Picovoice's published desktop Core-Hour data by Geekbench single-core ratios — and it's a lower bound: a vendor's NEON mobile path is usually less optimized than its desktop AVX2 path, so real mobile numbers tend to be worse, not better. We have not run Cheetah on a Snapdragon 662 ourselves; our own 0.30 (SD662) and 0.08–0.10 (A15) are measured.

What's solid: our accuracy lead on the published numbers, our 80 ms cache-aware streaming cadence, and the fact that we're the only engine here publishing a measured cheap-Android RTF at all.

Explore

Related primitives

FAQ

On-device speech-to-text, answered

What is the most accurate on-device speech-to-text?

On published LibriSpeech test-clean word-error rate, VoxRT's 3.27% (RNN-T head) leads the on-device English field we surveyed — versus Picovoice Cheetah 5.4%, Vosk Small 9.85%, Whisper base.en ~4.4% and Moonshine Medium 5.9%. The only sub-4% competitor is Sherpa-onnx's 20M streaming Zipformer at 3.88%, a number published only on a Hugging Face model card. There is no independent third-party benchmark covering every engine on a common test set, so these are vendor-self-reported figures.

How does VoxRT compare to Picovoice Cheetah?

Picovoice Cheetah is a closed commercial streaming ASR SDK that publishes desktop benchmarks. VoxRT is more accurate on the published LibriSpeech test-clean numbers — 3.27% versus 5.4% word-error rate — publishes measured mobile real-time factor rather than desktop-only, and emits partials every 80 ms.

Can I run Whisper on a phone in real time?

The realistic mobile Whisper variants are tiny.en (~5.6% WER) and base.en (~4.4%). base.en is roughly 1.82× real-time on a Snapdragon 662 by our estimate — slower than real-time, so not viable for streaming on cheap Android — and the larger variants exceed typical mobile RAM budgets. Note also that whisper.cpp's runtime is MIT but OpenAI's Whisper weights ship under a separate license to confirm before commercial mobile redistribution.

Is Vosk accurate enough for production?

Vosk is Apache-2.0 with official mobile bindings and unusually transparent published WER, but the numbers are dated: Vosk Small is 9.85% WER, roughly 3× VoxRT's, and the accurate Large model is 1.8 GB and not real-time on cheap Android. Vosk is a Kaldi HMM/DNN plus language-model design rather than an end-to-end CTC/RNN-T model, so its WER is not a like-for-like architecture comparison.

What is a usable real-time factor for streaming ASR on mobile?

Real-time factor is the fraction of one CPU core needed to keep up with audio; below 1.0 is real-time, and lower leaves headroom for the rest of your app. VoxRT measures 0.30 on a Snapdragon 662 and 0.08 to 0.10 on an iPhone 13 Pro Max. Most competitors publish only desktop numbers; scaled to a Snapdragon 662, several — Whisper base, Vosk Large, Moonshine — exceed 1.0 and are not real-time on cheap Android.

Which on-device speech-to-text should you ship?