On-device VAD · compared

On-device VAD, honestly compared

How VoxRT stacks up against Silero VAD, Picovoice Cobra, WebRTC VAD, TEN VAD and MarbleNet — on accuracy, measured mobile speed, footprint and license. Every number below is the vendor's own published figure, with the caveats spelled out.

The short version

The model is shared. The runtime is the product.

VoxRT ships the proven, MIT-licensed Silero v5 weights — the same network many of these engines benchmark against — on its own from-scratch Rust inference runtime. So the differentiator isn't a magic accuracy number; it's published, measured mobile performance, a redistributable license with no per-user check, and a stateless C ABI packaged as real Android and iOS modules.

Side by side

The comparison table

Comparison of on-device Voice Activity Detection engines — VoxRT, Silero VAD, Picovoice Cobra, WebRTC VAD, TEN VAD and MarbleNet — by vendor-reported accuracy, measured mobile real-time factor on Android and iPhone, footprint and license.
Engine Accuracy (vendor-reported) Mobile RTF — Android Mobile RTF — iPhone Footprint License
VoxRTSilero v5 on the VoxRT runtime ROC-AUC 0.94–0.96 a
(inherits Silero v5)
3.05%
Snapdragon 662 (A73)
1.85%
iPhone 13 Pro Max (A15)
~1.7 MB
net app-size impact
MIT
runtime + weights
Silero VAD v5snakers4 · upstream ROC-AUC 0.94–0.96 a Not published b Not published b ~2 MB
model file (runtime separate)
MIT
code + weights
Picovoice Cobracommercial "Largest AUC" claim c
no numeric F1/AUC published
Not published d Not published d Not published Commercial
free plan + paid tier
WebRTC VADGoogle · legacy 2010-era GMM baseline e Not published Not published <100 KB BSD-3
TEN VADAgora PR-curve claims f
no numeric F1/AUC published
4.9–5.7%
Snapdragon 425 / 450 g
0.5–2.1%
iPhone 6 / 8 (A8 / A11) g
320–532 KB
full library (runtime + model)
Apache-2.0 + non-compete h
MarbleNetNVIDIA NeMo Per-checkpoint i
no single headline metric
Not published Not published Varies by checkpoint CC-BY-4.0

RTF = real-time factor — fraction of one CPU core needed to keep up with audio in real time; lower is better. Accuracy and RTF figures are each vendor's own published numbers on different datasets and hardware, so they are directionally indicative, not directly comparable (see the methodology note below). Sources: a Silero VAD wiki, Quality-Metrics — ROC-AUC 0.96 Multi-Domain / 0.96 AliMeeting / 0.94 VoxConverse / 0.79 MSDWild, on 31.25 ms segments; VoxRT ships these same v5 weights. b Silero publishes desktop CPU figures only (≈189 µs/chunk, Ryzen Threadripper 3960X), no mobile RTF. c Picovoice Cobra blog. d Picovoice publishes desktop (Ryzen 9 5900X) and Raspberry Pi Zero RTF, not Android/iOS. e WebRTC VAD has no first-party modern benchmark. f TEN VAD README — PR-curve plots only. g TEN VAD README, vendor-measured RTF table (iPhone numbers are on older A8/A11 silicon than VoxRT's A15 reference, so not a like-for-like). h TEN VAD LICENSE adds a condition barring deployment "in a way that competes with Agora's offerings" or that enables third parties to do so. i NeMo model card reports metrics per checkpoint.

Why VoxRT

What VoxRT brings to VAD

Measured mobile speed

1.85% RTF on iPhone 13 Pro Max, 3.05% on a Snapdragon 662 — published phone-class numbers, not desktop extrapolations.

Redistributable license

MIT runtime and MIT weights — no per-user license check, no free-plan caps, no non-compete clause to read around.

Stateless C ABI

A clean stateless C interface, 32 ms chunks, shipped as native Android (JitPack) and iOS (Swift Package) modules.

Tiny footprint

~1.7 MB net app-size impact, no ONNX Runtime, PyTorch Mobile or LiteRT dependency dragged in behind it.

Fully on-device

No network round-trips, no microphone audio leaving the device — the same runtime behind VoxRT wake word and ASR.

~54 streams per core

Cost per stream is low enough to run dozens of concurrent VAD streams on a single core.

Engine by engine

How each one really compares

Silero VAD

MIT · the upstream model

The proven open model — and the one VoxRT ships. Upstream gives you an ONNX file and a Python wrapper; you build the mobile integration yourself, and there's no published Android/iOS RTF. VoxRT packages those same weights with measured mobile numbers and ready-made modules.

Picovoice Cobra

Commercial · free plan + paid tier

An established commercial engine that claims the largest AUC of its three-way benchmark, though no concrete F1/AUC value is published and its only RTF figures are desktop and Raspberry Pi Zero. It's a paid, license-gated path versus VoxRT's permissive MIT.

WebRTC VAD

BSD-3 · the free floor

The ubiquitous 2010-era GMM detector: smallest binary in the field and zero licensing friction, but a generation behind modern neural VAD on noisy speech. Useful as a baseline, rarely as a product choice.

TEN VAD

Apache-2.0 with added conditions

Technically strong — a small full-library binary and impressive vendor-measured mobile RTF. The catch is the license: an added clause bars deploying it in ways that compete with, or enable others to compete with, Agora — which is a real constraint for a redistributable SDK.

MarbleNet

CC-BY-4.0 · NVIDIA NeMo

A capable NeMo checkpoint family we evaluated and set aside in favor of Silero. There's no single headline metric to anchor a row on, and it isn't packaged as a mobile SDK surface.

VoxRT

MIT · runtime + weights

Proven Silero accuracy, published mobile RTF, a tiny footprint and a redistributable license — packaged as drop-in Android and iOS modules on a runtime tuned for phone-class hardware.

Read the numbers carefully

Why the figures aren't apples-to-apples

We'd rather hand you the caveats than a tidy leaderboard. Every accuracy and speed figure above is vendor-self-reported, and the field has no independent third-party VAD benchmark at the scale ASR enjoys.

Accuracy is measured on different datasets at different granularities — Silero on its own validation sets at 31.25 ms segments, Picovoice on LibriSpeech mixed with DEMAND noise at 0 dB SNR, TEN VAD on precision-recall curves with no published numbers, WebRTC with no first-party benchmark at all. The "who beats whom" ordering each vendor reports is directional, not commensurable.

Real-time factor is even less comparable: Silero publishes desktop Threadripper figures, Picovoice desktop Ryzen plus Raspberry Pi Zero, TEN VAD mobile on older A8/A11 iPhones, and VoxRT on a Snapdragon 662 and an A15 iPhone. The only near-overlap — TEN VAD's iPhone 8 against VoxRT's iPhone 13 Pro Max — runs on different silicon, so treat it as a lower bound, not a head-to-head.

What's solid and comparable is the rest of the table: license terms, footprint, and whether a vendor publishes mobile numbers at all.

FAQ

On-device VAD, answered

What is the best on-device VAD engine?

There is no independent third-party benchmark for on-device VAD, so any single "best" claim is vendor-self-reported. The practical choice comes down to license, published mobile performance and footprint. VoxRT ships the proven Silero v5 weights (MIT) on its own runtime, with measured real-time factors of 1.85% on an iPhone 13 Pro Max and 3.05% on a Snapdragon 662, and a redistributable license with no per-user check.

Is Silero VAD free for commercial use?

Yes. Silero VAD is MIT-licensed for both its code and its weights, so it can be used and redistributed commercially. VoxRT ships these same Silero v5 weights on its own runtime.

Does TEN VAD's license allow building a redistributable SDK?

TEN VAD is licensed under Apache-2.0 with an added condition that bars deploying it in a way that competes with — or enables others to compete with — Agora's offerings. For a redistributable VAD SDK that is a meaningful constraint, so check the license against your use case. By contrast, VoxRT's runtime and weights are MIT.

How does VoxRT compare to Picovoice Cobra?

Picovoice Cobra is a commercial engine with a free plan and a paid tier. It claims the largest AUC in its own three-way benchmark but publishes no concrete F1 or AUC number, and its only real-time-factor figures are on desktop and a Raspberry Pi Zero — not Android or iPhone. VoxRT is MIT-licensed and publishes measured mobile real-time factors.

What is a good real-time factor (RTF) for on-device VAD?

Real-time factor is the fraction of one CPU core needed to keep up with audio in real time, so lower is better. VoxRT measures 1.85% (RTF 0.0185) on an iPhone 13 Pro Max and 3.05% on a Snapdragon 662 — low enough to run dozens of concurrent VAD streams on a single core.

Is WebRTC VAD good enough?

WebRTC VAD is the lightweight, sub-100 KB, BSD-3-licensed baseline from 2010. It is fast and ubiquitous and fine as a free floor, but it is a generation behind modern neural VAD at separating speech from noise, which is why newer engines benchmark against it.

Put VoxRT VAD on your device

Tell us what you're shipping and which devices it has to run on.

Try now