Benchmark methodology

How we measure — and why vendor numbers aren't comparable

Every performance figure VoxRT publishes follows the rules on this page: measured numbers are measured on real, named hardware; estimated numbers are labeled as estimates; and cross-vendor comparisons come with the caveats the field usually leaves out. This is the methodology behind our wake-word, VAD, and ASR comparison pages.

Definitions

The three numbers that matter

Real-time factor (RTF)

RTF = processing time ÷ audio duration

An RTF of 0.01 means one second of audio takes 10 ms to process — keeping up with live speech uses ~1% of one CPU core. Below 1.0 is real-time; lower leaves more headroom for the rest of your app. We report RTF as a fraction (0.021) or the equivalent core percentage (2.1%).

Word error rate (WER)

WER = (subs + inserts + deletes) ÷ words

The standard speech-to-text accuracy metric; lower is better. We report WER on LibriSpeech test-clean, the most widely published English test set, and name the decoder (RNN-T 3.267%, CTC 4.895%) — decoders on the same model differ.

Wake-word accuracy

precision / recall at a threshold + ROC AUC

A single "accuracy" number hides the trade-off, so we publish the full threshold sweep: at the 0.90 default, precision 0.993 / recall 0.982; ROC AUC 0.9966 across all thresholds. The test set matters as much as the metric — see below.

Hardware

Measured on the devices users actually carry

Most on-device voice vendors publish desktop or Raspberry Pi numbers. We measure on phones — including a deliberately cheap one — because a voice feature that only performs on flagships fails in the field.

DeviceSiliconWhy it's in the setUsed for
Xiaomi Redmi 9C (2020)Snapdragon 662, Cortex-A73The cheap-tier Android reference — the floor most real user bases includeWake word, VAD, ASR (primary Android numbers)
Samsung Galaxy S9+ (2018)Snapdragon 845An older flagship — different core generation than the RedmiASR secondary Android reference
iPhone 13 Pro Max (2021)Apple A15 BionicThe iOS reference pointWake word, VAD, ASR (primary iOS numbers)
Raspberry Pi Zero 2 WCortex-A53 @ 1.0 GHzThe embedded floor — if it runs here, it runs on SBCsWake word (Linux SDK, measured sustained ≥60 s)
MacBook Pro (M4)Apple M4Modern desktop reference for the WebAssembly buildBrowser wake word
Measurement rules

How the measured numbers are produced

Single-thread, release builds, post-warmup. RTF is measured on one thread in release builds after warmup, on the device's big cores. For the wake word on the Snapdragon 662, the 0.021 figure holds in both scheduler-default and pinned high-performance modes; the efficiency-core low-power mode runs ~0.071 — we publish both rather than only the flattering one.

File replay and live microphone are reported separately. Live-mic capture adds real overhead. Our ASR streaming RTF on the Snapdragon 662 is 0.302 on file replay and 0.353 on a live microphone; comparison tables quote the file-replay figure and footnote the live-mic one.

Accuracy is measured on a deliberately hard, held-out test set. The wake-word split is 5,240 positive utterances plus 6,416 hard negatives — isolated "Hey", isolated "Assistant", competitor wake words like "Hey Siri", phonetic neighbours, arbitrary speech, and non-speech audio — with all speakers disjoint from training and validation. Easy negatives inflate accuracy; phonetic neighbours are what break wake words in production.

Ported models are verified against their upstream reference. Our ASR is NVIDIA NeMo's FastConformer rebuilt tensor-by-tensor on the VoxRT runtime; we verify WER stays within floating-point noise of the upstream Python reference (3.267% RNN-T / 4.895% CTC on LibriSpeech test-clean) so the port itself never silently costs accuracy.

Anyone can re-measure. The SDKs and models are public on GitHub, and the browser demo includes a benchmark that generates 60 seconds of audio in your browser and measures RTF on your own device.

Estimates

How competitor numbers are handled

Most vendors publish no mobile numbers at all. When our comparison tables need one, we scale the vendor's own published desktop figure by Geekbench 6 single-core ratios between the published device and the target phone — roughly 3× between a desktop Ryzen and a Snapdragon 662. Every scaled figure is labeled estimated, never presented as a measurement.

These estimates are lower bounds on the true mobile cost: a vendor's mobile NEON path is usually less optimized than its desktop AVX2 path, so real mobile numbers tend to be worse, not better. When an estimate lands close to our measured figure — as with Picovoice Porcupine's scaled ~1.8% against our measured 2.1% on the Snapdragon 662 — we call it a tie, not an advantage.

Accuracy claims from competitors are quoted from their own published benchmarks, with the source named — we do not re-measure competitor accuracy ourselves.

The honest caveat

Why cross-vendor numbers aren't apples-to-apples

There is no independent third-party benchmark covering on-device voice engines on a common test set — every number in this field, including ours, is vendor-self-reported. Four things break comparability:

Different data. VAD accuracy is reported on different datasets at different frame granularities. Wake-word accuracy depends on the specific phrase tested — a custom phrase is consistently harder than a vendor's tuned built-in keywords, so never read a custom number against a built-in one.

Different architectures. A Kaldi HMM/DNN with a language model and an end-to-end RNN-T produce WER figures that aren't like-for-like even on the same test set.

Different hardware and modes. Desktop Threadripper, Raspberry Pi Zero, an A8 iPhone, and a Snapdragon 662 are not interchangeable reference points, and default-versus-pinned scheduling changes results on big.LITTLE chips.

Different disclosure. Some vendors publish full tables; some publish a chart image; some publish a claim with no number. A missing number is information too.

What is solid and comparable across vendors: license terms, footprint, whether there's a first-party native mobile SDK, and whether mobile numbers are published at all. Our comparison tables lean on those axes and treat every performance ordering as directional.

Sources

Where the numbers live

VoxRT's measured figures are published in the SDK repos — wake word, VAD, ASR — and mirrored in our docs. Competitor figures are cited to their sources on each comparison page: Picovoice's wake-word and speech-to-text benchmarks, the Silero VAD wiki, the TEN VAD README, the OpenAI Whisper paper, the Vosk model registry, and NVIDIA NeMo model cards.

FAQ

Methodology, answered

What is real-time factor (RTF)?

Processing time divided by audio duration. An RTF of 0.01 means one second of audio takes 10 ms to process — keeping up with live speech uses about 1% of one CPU core. Below 1.0 is real-time; lower leaves more headroom for the rest of the app.

Why benchmark on a Snapdragon 662?

The Snapdragon 662 (Xiaomi Redmi 9C, 2020) represents the cheap-tier Android hardware real user bases actually carry. A voice feature that only works on flagships fails in the field — and since most vendors publish only desktop or Raspberry Pi numbers, cheap-Android performance is the axis buyers can't otherwise evaluate.

Are VoxRT's numbers independently verified?

No — like every vendor in this field, ours are self-reported, and no independent third-party benchmark covers these engines on a common test set. We mitigate that by publishing the setup, labeling every estimate as an estimate, shipping the SDKs publicly so anyone can re-measure, and offering an in-browser benchmark that measures RTF on your own device.

How are competitors' mobile numbers estimated?

By scaling their own published desktop figures by Geekbench 6 single-core ratios (~3× between a desktop Ryzen and a Snapdragon 662). Estimates are lower bounds — mobile NEON paths are usually less optimized than desktop AVX2 — and are always labeled as estimates.

Why aren't cross-vendor numbers directly comparable?

Different data, different architectures, different hardware, and different disclosure. The ordering such numbers suggest is directional, not commensurable. What is comparable: license terms, footprint, native SDK availability, and whether mobile numbers are published at all.

Judge it on your own device

The browser demo includes a benchmark that measures real-time factor on your hardware.

Run the browser benchmark View on GitHub