How we measure — and why vendor numbers aren't comparable
Every performance figure VoxRT publishes follows the rules on this page: measured numbers are measured on real, named hardware; estimated numbers are labeled as estimates; and cross-vendor comparisons come with the caveats the field usually leaves out. This is the methodology behind our wake-word, VAD, and ASR comparison pages.
The three numbers that matter
Real-time factor (RTF)
RTF = processing time ÷ audio durationAn RTF of 0.01 means one second of audio takes 10 ms to process — keeping up with live speech uses ~1% of one CPU core. Below 1.0 is real-time; lower leaves more headroom for the rest of your app. We report RTF as a fraction (0.021) or the equivalent core percentage (2.1%).
Word error rate (WER)
WER = (subs + inserts + deletes) ÷ wordsThe standard speech-to-text accuracy metric; lower is better. We report WER on LibriSpeech test-clean, the most widely published English test set, and name the decoder (RNN-T 3.267%, CTC 4.895%) — decoders on the same model differ.
Wake-word accuracy
precision / recall at a threshold + ROC AUCA single "accuracy" number hides the trade-off, so we publish the full threshold sweep: at the 0.90 default, precision 0.993 / recall 0.982; ROC AUC 0.9966 across all thresholds. The test set matters as much as the metric — see below.
Measured on the devices users actually carry
Most on-device voice vendors publish desktop or Raspberry Pi numbers. We measure on phones — including a deliberately cheap one — because a voice feature that only performs on flagships fails in the field.
| Device | Silicon | Why it's in the set | Used for |
|---|---|---|---|
| Xiaomi Redmi 9C (2020) | Snapdragon 662, Cortex-A73 | The cheap-tier Android reference — the floor most real user bases include | Wake word, VAD, ASR (primary Android numbers) |
| Samsung Galaxy S9+ (2018) | Snapdragon 845 | An older flagship — different core generation than the Redmi | ASR secondary Android reference |
| iPhone 13 Pro Max (2021) | Apple A15 Bionic | The iOS reference point | Wake word, VAD, ASR (primary iOS numbers) |
| Raspberry Pi Zero 2 W | Cortex-A53 @ 1.0 GHz | The embedded floor — if it runs here, it runs on SBCs | Wake word (Linux SDK, measured sustained ≥60 s) |
| MacBook Pro (M4) | Apple M4 | Modern desktop reference for the WebAssembly build | Browser wake word |
How the measured numbers are produced
Single-thread, release builds, post-warmup. RTF is measured on one thread in release builds after warmup, on the device's big cores. For the wake word on the Snapdragon 662, the 0.021 figure holds in both scheduler-default and pinned high-performance modes; the efficiency-core low-power mode runs ~0.071 — we publish both rather than only the flattering one.
File replay and live microphone are reported separately. Live-mic capture adds real overhead. Our ASR streaming RTF on the Snapdragon 662 is 0.302 on file replay and 0.353 on a live microphone; comparison tables quote the file-replay figure and footnote the live-mic one.
Accuracy is measured on a deliberately hard, held-out test set. The wake-word split is 5,240 positive utterances plus 6,416 hard negatives — isolated "Hey", isolated "Assistant", competitor wake words like "Hey Siri", phonetic neighbours, arbitrary speech, and non-speech audio — with all speakers disjoint from training and validation. Easy negatives inflate accuracy; phonetic neighbours are what break wake words in production.
Ported models are verified against their upstream reference. Our ASR is NVIDIA NeMo's FastConformer rebuilt tensor-by-tensor on the VoxRT runtime; we verify WER stays within floating-point noise of the upstream Python reference (3.267% RNN-T / 4.895% CTC on LibriSpeech test-clean) so the port itself never silently costs accuracy.
Anyone can re-measure. The SDKs and models are public on GitHub, and the browser demo includes a benchmark that generates 60 seconds of audio in your browser and measures RTF on your own device.
How competitor numbers are handled
Most vendors publish no mobile numbers at all. When our comparison tables need one, we scale the vendor's own published desktop figure by Geekbench 6 single-core ratios between the published device and the target phone — roughly 3× between a desktop Ryzen and a Snapdragon 662. Every scaled figure is labeled estimated, never presented as a measurement.
These estimates are lower bounds on the true mobile cost: a vendor's mobile NEON path is usually less optimized than its desktop AVX2 path, so real mobile numbers tend to be worse, not better. When an estimate lands close to our measured figure — as with Picovoice Porcupine's scaled ~1.8% against our measured 2.1% on the Snapdragon 662 — we call it a tie, not an advantage.
Accuracy claims from competitors are quoted from their own published benchmarks, with the source named — we do not re-measure competitor accuracy ourselves.
Why cross-vendor numbers aren't apples-to-apples
There is no independent third-party benchmark covering on-device voice engines on a common test set — every number in this field, including ours, is vendor-self-reported. Four things break comparability:
Different data. VAD accuracy is reported on different datasets at different frame granularities. Wake-word accuracy depends on the specific phrase tested — a custom phrase is consistently harder than a vendor's tuned built-in keywords, so never read a custom number against a built-in one.
Different architectures. A Kaldi HMM/DNN with a language model and an end-to-end RNN-T produce WER figures that aren't like-for-like even on the same test set.
Different hardware and modes. Desktop Threadripper, Raspberry Pi Zero, an A8 iPhone, and a Snapdragon 662 are not interchangeable reference points, and default-versus-pinned scheduling changes results on big.LITTLE chips.
Different disclosure. Some vendors publish full tables; some publish a chart image; some publish a claim with no number. A missing number is information too.
What is solid and comparable across vendors: license terms, footprint, whether there's a first-party native mobile SDK, and whether mobile numbers are published at all. Our comparison tables lean on those axes and treat every performance ordering as directional.
Where the numbers live
VoxRT's measured figures are published in the SDK repos — wake word, VAD, ASR — and mirrored in our docs. Competitor figures are cited to their sources on each comparison page: Picovoice's wake-word and speech-to-text benchmarks, the Silero VAD wiki, the TEN VAD README, the OpenAI Whisper paper, the Vosk model registry, and NVIDIA NeMo model cards.
Methodology, answered
What is real-time factor (RTF)?
Processing time divided by audio duration. An RTF of 0.01 means one second of audio takes 10 ms to process — keeping up with live speech uses about 1% of one CPU core. Below 1.0 is real-time; lower leaves more headroom for the rest of the app.
Why benchmark on a Snapdragon 662?
The Snapdragon 662 (Xiaomi Redmi 9C, 2020) represents the cheap-tier Android hardware real user bases actually carry. A voice feature that only works on flagships fails in the field — and since most vendors publish only desktop or Raspberry Pi numbers, cheap-Android performance is the axis buyers can't otherwise evaluate.
Are VoxRT's numbers independently verified?
No — like every vendor in this field, ours are self-reported, and no independent third-party benchmark covers these engines on a common test set. We mitigate that by publishing the setup, labeling every estimate as an estimate, shipping the SDKs publicly so anyone can re-measure, and offering an in-browser benchmark that measures RTF on your own device.
How are competitors' mobile numbers estimated?
By scaling their own published desktop figures by Geekbench 6 single-core ratios (~3× between a desktop Ryzen and a Snapdragon 662). Estimates are lower bounds — mobile NEON paths are usually less optimized than desktop AVX2 — and are always labeled as estimates.
Why aren't cross-vendor numbers directly comparable?
Different data, different architectures, different hardware, and different disclosure. The ordering such numbers suggest is directional, not commensurable. What is comparable: license terms, footprint, native SDK availability, and whether mobile numbers are published at all.