On-device voice activity detection
Streaming, on-device detection of when a person starts and stops speaking. Silero v5 weights on the VoxRT runtime — about 1.7 MB of app-size impact, and no audio ever leaves the device.
The foundational voice primitive
Voice activity detection tells you, frame by frame, whether someone is speaking. It's the building block the rest of a voice pipeline sits on: it gates wake-word and keyword-spotting models so they only run when there's speech, drives barge-in and interruption logic, and stands alone for record-trimming and turn-taking.
VoxRT ships the well-known Silero v5 weights (MIT-licensed) on its own from-scratch inference runtime — the accuracy of a proven model with a tiny binary footprint and no heavyweight ML framework dependency.
The runtime is the product
VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on phone-class hardware.
VAD is the free, open-source showcase of that runtime — running Silero v5 with state-of-the-art per-frame latency, alongside wake word and streaming ASR. All three share the same Rust runtime crate and NEON kernel set.
What you get
Per-frame decisions
Streaming, low-latency speech / no-speech on every audio frame.
Proven model
Silero v5 weights, MIT-licensed — no ONNX Runtime or PyTorch Mobile.
Tiny footprint
~1.7 MB net app-size impact; runs comfortably on budget Android.
Power gate
Gates wake-word and keyword-spotting models so they only run on speech.
Fully on-device
No network, no microphone audio leaving the device.
Negligible cost per stream
About 1.7 MB in your app
- Swift wrapper source~17 KB
- Native xcframework, compressed (device slice)~500 KB
- Silero VAD weights (fp16)1.2 MB
- Net app-size impact~1.7 MB