Silero v5 voice-activity detection, running on the VoxRT custom on-device inference runtime.
- Current version:
v0.1.3 - Minimum iOS: 16.0
- Architectures shipped:
arm64(iPhone / iPad, NEON-accelerated) - License: Apache-2.0 (Swift wrapper) · proprietary (compiled runtime, redistribution allowed via this Swift Package)
What is VoxRT?
VoxRT is a from-scratch inference runtime for on-device speech models. No ONNX Runtime, no PyTorch Mobile, no LiteRT — a custom Rust core sized and tuned for streaming voice workloads on phone-class hardware.
VoxrtSilero is the free, open-source showcase of that runtime: a Swift Package that runs the Silero v5 VAD with state-of-the-art per-frame latency. The runtime is the product; Silero is the demo subject.
Siblings on the same runtime:
VoxrtAsr— streaming speech recognition (FastConformer 32M)VoxrtWakeWord— always-on wake-phrase detection (~48 K params)
Commercial custom-phrase wake-word / keyword-spotting / domain-specific ASR models built on the same runtime live at voxrt.com.
Performance
Measured at ship time, ARM64 release builds, post-warmup, RTF = wall-time-per-frame ÷ frame-duration (lower is better):
| Device | RTF | per-frame latency |
|---|---|---|
| iPhone 13 Pro Max | 1.85% | ~0.6 ms / 32 ms frame |
What this means: at 1.85% RTF you can run ~54 parallel VAD streams on a single core before saturating it, leaving the device idle to handle the rest of the audio pipeline (ASR, TTS, UI).
How it compares
VAD became a commodity by 2026 — the question is who you can actually ship in a paid mobile app with measured numbers and a clean license:
| VoxrtSilero | Picovoice Cobra | WebRTC VAD | TEN VAD | |
|---|---|---|---|---|
| Underlying model | Silero v5 (MIT upstream) | proprietary | GMM (2010) | proprietary |
| Model / binary footprint | 1.2 MB model (.vxrt) | not published | < 100 KB | 320–532 KB shared lib (runtime + model bundled) |
| Mobile RTF disclosed | ✅ measured on cheap Android + iPhone | ❌ desktop Ryzen + Raspberry Pi Zero only | ❌ | ✅ vendor-measured on Android + iPhone |
| License | Apache-2.0 wrapper + proprietary runtime + MIT weights (Silero Team) | Commercial freemium (paid tier opaque) | BSD-3 | Apache-2.0 with non-compete clause: redistribution blocked if it could enable Agora competitors |
| Ship in a paid app | ✅ no per-deployment terms | ⚠️ requires paid commercial tier | ✅ accuracy is the 2010 floor | ❌ license clause 1 forbids it |
We don't innovate on the VAD model — Silero v5 is the upstream MIT architecture you'd already pick. What we add is a NEON-optimized Rust runtime, a stateless C ABI suitable for SDK packaging, and per-device measured RTF that other vendors don't publish.
Full sourced analysis: voxrt.com.
Binary footprint
- Swift wrapper source: ~17 KB total (
.swiftfiles included in your app's compile pass) VoxrtSileroNative.xcframework(compressed): ~500 KB device slice- Silero VAD weights
silero_vad.vxrt: 1.2 MB (downloaded separately, see below)
Net app-size impact: ~1.7 MB.
Install
In Xcode: File → Add Package Dependencies → paste:
https://github.com/VoxRT/voxrt-silero-ios
…and pin to v0.1.3.
Or in Package.swift:
dependencies: [
.package(url: "https://github.com/VoxRT/voxrt-silero-ios.git", from: "0.1.3"),
],
Get the VAD model
The model weights are NOT bundled — you fetch them once from voxrt-silero-models:
https://github.com/VoxRT/voxrt-silero-models/releases/download/v0.1.3/silero_vad.vxrt
SHA-256: 0fe8498c9bd1ae119bcb0c75c8481b3a8b8be0f95c14f334d469851c19054156
You decide where it lives. Three common patterns:
- Bundle in app resources — drag
silero_vad.vxrtinto your Xcode project. Works offline from first launch. - Download on first run —
URLSessionfetch intoFileManager.default.urls(for: .applicationSupportDirectory, ...). Smaller App Store binary; needs network at first launch. - Download on demand — Apple's On-Demand Resources or Background Asset Downloader if you want App Store to host the file.
Quick start
import VoxrtSilero
// 1. Resolve the bundled model URL.
guard let modelURL = Bundle.main.url(forResource: "silero_vad", withExtension: "vxrt") else {
fatalError("silero_vad.vxrt not found in bundle")
}
// 2. Build the engine. `init(modelURL:)` memory-maps the file via
// `Data(contentsOf:options: .mappedIfSafe)` under the hood — no
// eager copy into RAM. One instance per audio stream.
let vad = try VoxrtSileroVadEngine(modelURL: modelURL)
// (Convenience: same as above for the default bundle + name)
// let vad = try VoxrtSileroVadEngine.fromBundleResource()
// 3. Feed PCM (Int16, 16 kHz, mono).
let events = try vad.processPcm(samples)
for event in events {
switch event {
case .speechOnset(let timeMs):
print("speech started at \(timeMs) ms")
case .speechOffset(let timeMs):
print("speech ended at \(timeMs) ms")
}
}
The engine owns the LSTM state internally. Call vad.reset() between streams (e.g. when re-arming the mic). State snapshotting for replay / fork is also supported — see snapshotLstmState().
Live microphone example
The engine is synchronous and stateful — no internal queue, no delegate callbacks. You drive it from your AVAudioEngine tap callback and get events back as the return value of processPcm.
import AVFAudio
import VoxrtSilero
// NOTE: tap callbacks fire on a real-time audio thread. Don't
// allocate heavy buffers per callback in production — pre-size +
// reuse like the snippet below.
let session = AVAudioSession.sharedInstance()
try session.setCategory(.playAndRecord, mode: .measurement, options: [])
try session.setActive(true)
let audioEngine = AVAudioEngine()
let input = audioEngine.inputNode
let hwFormat = input.outputFormat(forBus: 0) // 44.1 / 48 kHz
let voxrtFormat = AVAudioFormat( // engine target
commonFormat: .pcmFormatInt16,
sampleRate: 16_000,
channels: 1,
interleaved: true,
)!
let converter = AVAudioConverter(from: hwFormat, to: voxrtFormat)!
guard let modelURL = Bundle.main.url(forResource: "silero_vad", withExtension: "vxrt") else {
fatalError("silero_vad.vxrt missing from bundle")
}
let vad = try VoxrtSileroVadEngine(modelURL: modelURL)
// 512 samples @ 16 kHz = 32 ms — the engine's internal frame size.
// We resample mic chunks into this buffer.
let scratchCapacity: AVAudioFrameCount = 512
let voxrtBuf = AVAudioPCMBuffer(pcmFormat: voxrtFormat,
frameCapacity: scratchCapacity)!
input.installTap(
onBus: 0,
bufferSize: 4_096, // hw frames per cb
format: hwFormat
) { hwBuf, _ in
voxrtBuf.frameLength = 0
var error: NSError?
converter.convert(to: voxrtBuf, error: &error) { _, status in
status.pointee = .haveData
return hwBuf
}
if error != nil { return }
guard let int16 = voxrtBuf.int16ChannelData?[0] else { return }
let n = Int(voxrtBuf.frameLength)
let samples = Array(UnsafeBufferPointer(start: int16, count: n))
let events = try? vad.processPcm(samples)
for event in events ?? [] {
// Tap callbacks run off the main thread — marshal UI.
switch event {
case .speechOnset(let timeMs):
DispatchQueue.main.async { print("speech started @ \(timeMs) ms") }
case .speechOffset(let timeMs):
DispatchQueue.main.async { print("speech ended @ \(timeMs) ms") }
}
}
}
try audioEngine.start()
// ... later, on stop:
audioEngine.stop()
input.removeTap(onBus: 0)
vad.close()
vad.processPcm returns immediately with whatever VAD events crossed the hysteresis thresholds during this buffer — often an empty list while inside a speech segment, an onset/offset event when the state machine transitions. UI marshalling is the caller's job; the engine takes no opinion about your concurrency model.
Info.plist must declare the microphone privacy reason:
<key>NSMicrophoneUsageDescription</key>
<string>Used for on-device voice activity detection.</string>
Audio contract
- Sample rate: 16 000 Hz. No automatic resampling. Phone mic hardware delivers 44.1 / 48 kHz to
AVAudioEngine; convert viaAVAudioConverterto 16 kHz Int16 mono before feedingprocessPcm. Feeding the wrong rate is the #1 source of "VAD never fires" bugs. - Sample format:
Int16PCM, mono, native endian. - Buffer size: any. The engine internally segments into 32 ms frames (512 samples) with a 4 ms (64-sample) rolling context.
- Latency: one frame (32 ms) of inherent buffering. End-of-speech is reported with the configurable
minSilenceMs(default 250 ms) hysteresis.
Threading
- The engine is a synchronous, stateful function. It does NOT own a queue. Each
processPcmcall blocks on the calling thread for the duration of inference — typically theAVAudioEnginetap thread for live mic. Marshal events back to UI viaDispatchQueue.main.async(or your concurrency framework of choice). - One instance is single-thread-at-a-time. Serialise
processPcm/reset/closeagainst each other on a given instance. - Between unrelated streams (e.g. re-arming the mic for a new session), call
vad.reset()to zero the LSTM state without paying weight-load cost again. Callvad.close()when done.
Permissions
iOS requires a usage-description string for microphone access. Add to your app's Info.plist:
<key>NSMicrophoneUsageDescription</key>
<string>Used for on-device voice activity detection.</string>
AVAudioSession.requestRecordPermission(...) triggers the user prompt the first time mic capture is initiated. Without the Info.plist key the app crashes with a privacy-violation exception on first request.
Architectures roadmap
v0.1.3 ships only arm64 for physical devices, NEON-optimized. Simulator slices (arm64-sim + x86_64) are included for build convenience but are not part of the supported production target list.
| Target | Status |
|---|---|
| iOS arm64 (device) | ✅ Shipped |
| iOS arm64 simulator | ✅ Shipped (build-time only) |
| iOS x86_64 simulator | ✅ Shipped (build-time only) |
| macOS arm64 | 🟡 Coming soon |
| macOS x86_64 (AVX) | 🟡 Coming soon |
| visionOS / tvOS / watchOS | ☁️ On request |
Project layout
voxrt-silero-ios/
├── Package.swift # SPM manifest (binaryTarget URL + checksum)
├── Sources/VoxrtSilero/ # Idiomatic Swift wrapper (open, Apache-2.0)
│ └── VoxrtSilero.swift
└── README.md # this file
The compiled VoxrtSileroNative.xcframework is downloaded automatically by SPM from this version's GitHub Release — it is not in the repo.
License
- The Swift wrapper (
Sources/VoxrtSilero/) is licensed under Apache-2.0. SeeLICENSE. - The compiled
VoxrtSileroNative.xcframework(fetched by SPM from the matching GitHub Release) is proprietary VoxRT runtime code owned by Elephant Enterprises LLC, redistributable as part of this unmodified Swift Package. SeeLICENSE-BINARYfor the full terms. - Silero VAD model weights are © Silero Team, originally MIT-licensed; the
.vxrtencoded form retains the same license. See the models repository. - Commercial integration / custom-model packaging questions: [email protected].
Links
- VoxRT runtime + commercial models: voxrt.com
- Android counterpart: voxrt-silero-android
- VAD model weights & versions: voxrt-silero-models
- Streaming ASR (iOS): voxrt-asr-ios · models
- Wake-word (iOS): voxrt-wake-word-ios · models
- Bugs / questions: open an issue on this repo