Which on-device speech-to-text should you ship?
On-device ASR turns speech into text right on the phone — no cloud round-trip, no streamed audio. This page compares the on-device options — Picovoice Cheetah, Whisper.cpp, Vosk, Sherpa-onnx and Moonshine — in plain terms: how accurate they are, how fast they run, and how they're licensed. VoxRT's edge is concrete: the lowest published word-error rate in the on-device field, a measured cheap-Android real-time factor, and 80 ms streaming chunks.
What you get with VoxRT ASR
Streaming partials
Partial transcripts every 80 ms — no batch step required.
CTC and RNN-T
Both decoders exposed for your latency / accuracy trade-off.
32M FastConformer
Research-grade accuracy running entirely on-device.
Lowest published WER
3.27% on LibriSpeech test-clean — within float-noise of the upstream NeMo reference.
Fully on-device
No cloud, no audio leaving the device — and no per-use fees.
The comparison table
| Engine | WER — LS test-clean | Model size | RTF — Snapdragon 662 | RTF — iPhone A15 | License |
|---|---|---|---|---|---|
| VoxRTstreaming-medium-pc (32M) | 3.27% RNN-T a 4.90% CTC |
~61 MB fp16 on disk |
0.30 measured |
0.08–0.10 measured |
MIT + CC-BY-4.0 runtime + weights |
| Picovoice Cheetahcommercial · streaming | 5.4% b | 34 MB | ~0.47 est. c | ~0.077 est. c | Commercial free plan limited |
| Whisper.cpp base.enOpenAI · OSS runtime | ~4.4% d tiny.en ~5.6% |
142 MiB | ~1.82 est. c slower than real-time |
~0.29 est. c | MIT + separate weights e |
| Vosk Smallen-us-0.15 · Kaldi | 9.85% f | 40 MB | ~0.68 est. c | ~0.11 est. c | Apache-2.0 |
| Sherpa-onnxstreaming Zipformer 20M | 3.88% g HF card only |
~80 MB int8 | Not published | Not published | Apache-2.0 |
| Moonshine MediumUseful Sensors · seq2seq | 5.9% h | Not published | ~19 est. c not real-time |
~3.1 est. c not real-time |
MIT runtime + weights |
WER = word-error rate on LibriSpeech test-clean; lower is better. RTF = real-time factor — fraction of one CPU core to keep up with audio in real time; below 1.0 is real-time. Every competitor's mobile RTF here is estimated (only VoxRT's is measured), and WER figures are vendor-self-reported on different setups, so they are directionally indicative, not directly comparable (see the methodology note below). Sources: a Internal VoxRT measurement on LibriSpeech test-clean — RNN-T head 3.27%, CTC head 4.90%; RTF measured single-thread on a Snapdragon 662 (A73) and an iPhone 13 Pro Max (A15). b Picovoice speech-to-text-benchmark, LibriSpeech test-clean, measured on an AMD Ryzen 9 5900X desktop. c Estimated: these vendors publish desktop (or no) numbers only; mobile RTF is scaled from Picovoice's published desktop Core-Hour data by Geekbench 6 single-core ratios (see methodology). It is a lower bound on their true mobile cost — we have not measured them on a Snapdragon 662 ourselves. d Upstream OpenAI Whisper paper WER for the .en variants; whisper.cpp publishes no first-party WER table. e whisper.cpp's runtime is MIT, but OpenAI's Whisper weights ship under a separate license — confirm before commercial mobile redistribution. f Vosk model registry (alphacephei.com/vosk/models). Vosk is a Kaldi HMM/DNN + language model, not end-to-end CTC/RNN-T, so its WER isn't a like-for-like architecture comparison. g Sherpa-onnx's 20M WER appears only on the model's Hugging Face card; the central docs list no WER and no RTF for the streaming Zipformer models. h Picovoice STT benchmark, Streaming-Engines WER table; corroborated by the Moonshine paper. Moonshine is seq2seq, so its "streaming" label is loose and its compute cost (~40× Cheetah) rules it out for mobile streaming.
The technical details
VoxRT's ASR is NVIDIA NeMo's 32M-parameter FastConformer hybrid, ported tensor-by-tensor onto a mobile-first Rust runtime — a stateless C ABI, shipped today as native Android (JitPack) and iOS (Swift Package) modules, with both CTC and RNN-T decoders and measured mobile real-time factor. Below are the numbers and the footprint that lands in your app.
What it costs in your app
- Swift wrapper source~20 KB
- Native xcframework, compressed (device slice)~5 MB
- Streaming model on disk (fp16)~61 MB
- Native heap at runtime (steady-state)~150 MB
How each one really compares
VoxRT
MIT runtime · CC-BY-4.0 weights
3.27% WER on the RNN-T head — the lowest published in this on-device field — at a measured 0.30 RTF on a Snapdragon 662 and 0.08–0.10 on an A15, with 80 ms streaming chunks. It's the only engine here publishing a measured cheap-Android RTF. Honest gaps: English-only at v1, ~150 MB resident memory, and Cheetah's model file is smaller.
Picovoice Cheetah
Commercial · free plan limited
Picovoice Cheetah is a closed commercial streaming ASR SDK with published desktop benchmarks, a smaller model file (34 MB), and broader platform and language coverage. VoxRT is more accurate on the published LibriSpeech test-clean numbers — 3.27% vs 5.4% WER — and publishes measured mobile RTF where Cheetah publishes desktop only (its desktop figure scales to ~0.47 on a Snapdragon 662). Cheetah's published word-emission latency is 590 ms; VoxRT streams 80 ms cache-aware chunks. Fully commercial, with an opaque paid tier.
Whisper.cpp
Runtime MIT · weights separate
The OSS reference everyone cites — but it publishes no mobile WER or RTF at all. The realistic mobile variants are tiny.en (~5.6%) and base.en (~4.4%); base.en is ~1.82 estimated RTF on a Snapdragon 662, slower than real-time for streaming on cheap Android, and larger variants blow past mobile RAM budgets. The runtime is MIT, but OpenAI's model weights carry a separate license worth checking before you ship.
Vosk
Apache-2.0
Permissively licensed, with official mobile bindings and unusually transparent published WER — but the numbers are dated. Vosk Small is 9.85% WER, roughly 3× ours; the accurate Large model is 1.8 GB and about 1.94 estimated RTF on a Snapdragon 662, not real-time. It's a Kaldi HMM/DNN + language-model design rather than end-to-end.
Sherpa-onnx
Apache-2.0
Our closest architectural sibling — NeMo-derived streaming Zipformer transducers, mobile-targeted, Apache-2.0 — and the 20M variant's 3.88% WER is genuinely competitive. The catch is disclosure: its central docs publish no WER and no RTF, so you can't evaluate it without running the benchmark yourself. We publish what Sherpa-onnx makes you measure.
Moonshine
MIT · runtime + weights
The freshest entrant (late 2024) and the cleanest license story — MIT on both runtime and weights. But Medium's 5.9% WER comes at roughly 40× Cheetah's compute: our estimate is ~19 RTF on a Snapdragon 662 and ~3.1 on an A15 — unusable for mobile streaming. It's seq2seq, so the "streaming" label is loose.
Why the figures aren't apples-to-apples
We'd rather hand you the caveats than a tidy leaderboard. Every figure above is vendor-self-reported, and there's no independent academic benchmark covering all of these engines on a common test set — the Whisper numbers are even from the upstream OpenAI paper, since whisper.cpp publishes none of its own.
Word-error rate isn't perfectly comparable across rows. Vosk is a Kaldi HMM/DNN plus language model rather than an end-to-end network; Sherpa-onnx's 3.88% lives only on a per-model Hugging Face card, not its central docs; ours is on LibriSpeech test-clean with the RNN-T head. Read the ordering as directional.
Mobile real-time factor is the axis nobody but us actually measures on cheap-tier Android. Every competitor's RTF here is estimated — scaled from Picovoice's published desktop Core-Hour data by Geekbench single-core ratios — and it's a lower bound: a vendor's NEON mobile path is usually less optimized than its desktop AVX2 path, so real mobile numbers tend to be worse, not better. We have not run Cheetah on a Snapdragon 662 ourselves; our own 0.30 (SD662) and 0.08–0.10 (A15) are measured.
What's solid: our accuracy lead on the published numbers, our 80 ms cache-aware streaming cadence, and the fact that we're the only engine here publishing a measured cheap-Android RTF at all.
Related primitives
On-device speech-to-text, answered
What is the most accurate on-device speech-to-text?
On published LibriSpeech test-clean word-error rate, VoxRT's 3.27% (RNN-T head) leads the on-device English field we surveyed — versus Picovoice Cheetah 5.4%, Vosk Small 9.85%, Whisper base.en ~4.4% and Moonshine Medium 5.9%. The only sub-4% competitor is Sherpa-onnx's 20M streaming Zipformer at 3.88%, a number published only on a Hugging Face model card. There is no independent third-party benchmark covering every engine on a common test set, so these are vendor-self-reported figures.
How does VoxRT compare to Picovoice Cheetah?
Picovoice Cheetah is a closed commercial streaming ASR SDK that publishes desktop benchmarks. VoxRT is more accurate on the published LibriSpeech test-clean numbers — 3.27% versus 5.4% word-error rate — publishes measured mobile real-time factor rather than desktop-only, and emits partials every 80 ms.
Can I run Whisper on a phone in real time?
The realistic mobile Whisper variants are tiny.en (~5.6% WER) and base.en (~4.4%). base.en is roughly 1.82× real-time on a Snapdragon 662 by our estimate — slower than real-time, so not viable for streaming on cheap Android — and the larger variants exceed typical mobile RAM budgets. Note also that whisper.cpp's runtime is MIT but OpenAI's Whisper weights ship under a separate license to confirm before commercial mobile redistribution.
Is Vosk accurate enough for production?
Vosk is Apache-2.0 with official mobile bindings and unusually transparent published WER, but the numbers are dated: Vosk Small is 9.85% WER, roughly 3× VoxRT's, and the accurate Large model is 1.8 GB and not real-time on cheap Android. Vosk is a Kaldi HMM/DNN plus language-model design rather than an end-to-end CTC/RNN-T model, so its WER is not a like-for-like architecture comparison.
What is a usable real-time factor for streaming ASR on mobile?
Real-time factor is the fraction of one CPU core needed to keep up with audio; below 1.0 is real-time, and lower leaves headroom for the rest of your app. VoxRT measures 0.30 on a Snapdragon 662 and 0.08 to 0.10 on an iPhone 13 Pro Max. Most competitors publish only desktop numbers; scaled to a Snapdragon 662, several — Whisper base, Vosk Large, Moonshine — exceed 1.0 and are not real-time on cheap Android.