Always-on wake-phrase detection on the VoxRT custom on-device inference runtime. ~48K-parameter depthwise-separable convnet, 16 kHz mono in, sigmoid-score out + threshold-crossing events. Detects the phrase "Hey Assistant".
- Current version:
v0.1.0 - Minimum iOS: 16.0
- Architectures shipped:
arm64(iPhone / iPad, NEON-accelerated) + simulator slices (arm64+x86_64) - License: Apache-2.0 (Swift wrapper) · proprietary (compiled runtime, redistribution allowed via this Swift Package)
- Wake-phrase weights: proprietary in-house (synthetic training data; no upstream license obligations)
What is VoxRT?
VoxRT is a from-scratch inference runtime for on-device speech models. No ONNX Runtime, no PyTorch Mobile, no LiteRT — a custom Rust core sized and tuned for streaming voice workloads on phone-class hardware.
VoxrtWakeWord is the wake-word product on that runtime, alongside VoxrtSilero (VAD) and VoxrtAsr (streaming ASR). All three share the same Rust runtime crate and the same NEON kernel set. The runtime is the product; the models are what it runs.
Custom-phrase wake-word models (your own brand name, multi-phrase detection, language extension) are part of the commercial VoxRT SDK tier. Contact [email protected].
Model quality
Test split: 5,240 positive utterances + 6,416 hard-negative utterances (isolated "Hey", isolated "Assistant", competitor wake-words like "Hey Siri", phonetic neighbours, arbitrary speech, non-speech audio). All speakers disjoint from train + val.
- ROC AUC: 0.9966
- Average precision (PR AUC): 0.9899
| Threshold | Precision | Recall | F1 | FPR | False positives on test |
|---|---|---|---|---|---|
| 0.5 | 0.864 | 0.995 | 0.925 | 12.8 % | 822 / 6,416 |
| 0.85 | 0.957 | 0.987 | 0.972 | 3.7 % | 234 / 6,416 |
| 0.9 (default) | 0.993 | 0.982 | 0.987 | 0.5 % | 34 / 6,416 |
| 0.95 | 0.997 | 0.769 | 0.868 | 0.2 % | 12 / 6,416 |
The package ships with threshold = 0.9 as the default operating point. Lower it via setThreshold if your application can tolerate more false positives in exchange for higher recall.
Performance
arm64 device build, post-warmup, RTF = wall-time-per-frame ÷ frame audio duration (lower is better):
| Device | RTF | per-frame latency |
|---|---|---|
| iPhone 13 Pro Max (A15 Bionic) | 0.015 | ~150 µs / 10 ms frame |
At RTF ≈ 0.015 the wake-word burns ~1.5 % of one core during continuous listening — well within an always-on power budget.
How it compares
The on-device wake-word category is dominated by Picovoice Porcupine on the paid side and openWakeWord on the OSS side:
| VoxrtWakeWord | Picovoice Porcupine | openWakeWord | |
|---|---|---|---|
| Model file | ~100 KB (.vxrt) | not published | not published |
| Mobile RTF disclosed | ✅ measured on Snapdragon 662 + iPhone | ❌ Raspberry Pi 5 only (0.6 % CPU; ~1.8 % scaled to SD662) | ❌ Raspberry Pi 3 only |
| Accuracy headline | ROC AUC 0.9966 on "Hey Assistant"; precision 0.993 / recall 0.982 @ default threshold | 2.7 % miss rate averaged across 6 built-in keywords (alexa, computer, jarvis, smart mirror, snowboy, view glass) | varies per pretrained model |
| Native mobile SDK | ✅ Android JitPack + iOS SPM | ✅ Android + iOS + RN + Flutter | ❌ Python-only; community C++ port |
| License | Apache-2.0 wrapper + proprietary runtime + proprietary weights (redistribution allowed as an unmodified part of this SDK, no per-seat fees) | Commercial (Free Plan evaluation-only; production tier opaque, sales-gated) | Apache-2.0 code, CC-BY-NC-SA on pretrained weights (non-commercial) |
| Custom phrase / language | Tuned per customer on request (paid engagement) | Via Picovoice Console — paid tier required for commercial deployment | Self-train via Colab + TTS (~1 hour) |
On raw speed and accuracy we're near-tie with Porcupine (their 2.7 % miss rate is a real benchmark; our ~100 KB model is genuinely tiny). The clear differentiators are license clarity (no per-seat fees, commercial redistribution allowed as part of this SDK vs Picovoice opaque pricing vs openWakeWord NC-blocked weights), measured mobile RTF (no other vendor publishes one for cheap Android), and a ~100 KB model file.
Full sourced analysis: voxrt.com.
Binary footprint
- Swift wrapper source: ~7 KB total (one file)
VoxrtWakeWordNative.xcframework.zip(downloaded by SPM): ~19 MB compressed (device + simulator slices)- After SPM extraction + linker dead-code elimination on the device-only path: ~2–3 MB delta in your app binary
- Wake-phrase model
voxrt_wake_word.vxrt: ~100 KB fp16 (downloaded separately)
Net effect on a consuming iOS app's IPA: roughly 2–3 MB once xcframework device slice + .vxrt + Swift wrapper are linked and bundled.
Install
In Xcode: File → Add Package Dependencies → paste:
https://github.com/VoxRT/voxrt-wake-word-ios
…and pin to v0.1.0.
Or in Package.swift:
dependencies: [
.package(url: "https://github.com/VoxRT/voxrt-wake-word-ios.git", from: "0.1.0"),
],
Get the wake-phrase model
The model weights are NOT bundled with the package — fetch them once from voxrt-wake-word-models:
https://github.com/VoxRT/voxrt-wake-word-models/releases/download/v0.1.0/voxrt_wake_word.vxrt
SHA-256: 9d40bdc132a2ad8e85bd8a28bb49b77c51a7c62f60567222a037e44418510e8f
Three common bundling patterns for an ~100 KB asset:
- Bundle in app resources — drag
voxrt_wake_word.vxrtinto your Xcode target and load withVoxrtWakeWordEngine(bundleResource: "voxrt_wake_word"). Works offline from first launch. - Download on first run —
URLSessionfetch intoFileManager.default.urls(for: .applicationSupportDirectory, ...), verify the SHA-256, then load withVoxrtWakeWordEngine(modelURL: cachedFile). Lets you swap models without an app update. - App Thinning / On-Demand Resources — Apple's per-asset delivery if you want the App Store to host the file.
Download-on-first-run snippet
import CryptoKit
private let kModelURL = URL(string:
"https://github.com/VoxRT/voxrt-wake-word-models/releases/download/v0.1.0/voxrt_wake_word.vxrt"
)!
private let kModelSHA256 = "9d40bdc132a2ad8e85bd8a28bb49b77c51a7c62f60567222a037e44418510e8f"
func ensureModel() async throws -> URL {
let fm = FileManager.default
let dir = try fm.url(
for: .applicationSupportDirectory, in: .userDomainMask,
appropriateFor: nil, create: true
)
let dest = dir.appendingPathComponent("voxrt_wake_word.vxrt")
if fm.fileExists(atPath: dest.path),
sha256Hex(try Data(contentsOf: dest)) == kModelSHA256 {
return dest
}
let (tmpURL, _) = try await URLSession.shared.download(from: kModelURL)
let bytes = try Data(contentsOf: tmpURL)
guard sha256Hex(bytes) == kModelSHA256 else {
throw NSError(domain: "voxrt", code: 1,
userInfo: [NSLocalizedDescriptionKey: "model SHA-256 mismatch"])
}
if fm.fileExists(atPath: dest.path) { try fm.removeItem(at: dest) }
try fm.moveItem(at: tmpURL, to: dest)
return dest
}
private func sha256Hex(_ d: Data) -> String {
SHA256.hash(data: d).map { String(format: "%02x", $0) }.joined()
}
// Then, in your app:
let modelURL = try await ensureModel()
let engine = try VoxrtWakeWordEngine(modelURL: modelURL)
Quick start
import VoxrtWakeWord
// 1. Resolve the bundled model URL.
guard let modelURL = Bundle.main.url(forResource: "voxrt_wake_word",
withExtension: "vxrt") else {
fatalError("voxrt_wake_word.vxrt not found in bundle")
}
// 2. Build the engine. `init(modelURL:)` reads the .vxrt bytes
// into the runtime — ~100 KB total, no streaming I/O required.
let engine = try VoxrtWakeWordEngine(modelURL: modelURL)
// 3. Feed Int16 PCM (mono, 16 kHz) blocks of any size — 100 ms
// blocks are the recommended pace for AVAudioEngine taps.
// processPcm returns any threshold-crossings emitted during
// this push; usually empty.
let detections = try engine.processPcm(pcmInt16Array)
for d in detections {
print("frame=\(d.frameIndex) t=\(d.timestampSec) score=\(d.score)")
}
processPcm / reset / close are synchronous and stateful. The engine does NOT own a worker thread. Drive it from your own capture thread.
Live microphone example
The canonical streaming pattern — capture-thread owns the AVAudioEngine tap, engine is a stateful function.
import AVFoundation
import VoxrtWakeWord
let session = AVAudioSession.sharedInstance()
try session.setCategory(.record, mode: .measurement)
try session.setPreferredSampleRate(16_000)
try session.setActive(true)
let audioEngine = AVAudioEngine()
let input = audioEngine.inputNode
let hwFormat = input.outputFormat(forBus: 0)
let voxrtFormat = AVAudioFormat(
commonFormat: .pcmFormatInt16,
sampleRate: 16_000,
channels: 1,
interleaved: true
)!
let converter = AVAudioConverter(from: hwFormat, to: voxrtFormat)!
guard let modelURL = Bundle.main.url(forResource: "voxrt_wake_word",
withExtension: "vxrt") else { fatalError() }
let wakeWord = try VoxrtWakeWordEngine(modelURL: modelURL)
input.installTap(onBus: 0, bufferSize: 4_096, format: hwFormat) { hwBuf, _ in
let outCap = AVAudioFrameCount(
Double(hwBuf.frameLength) * 16_000 / hwBuf.format.sampleRate + 256
)
guard let outBuf = AVAudioPCMBuffer(pcmFormat: voxrtFormat, frameCapacity: outCap) else {
return
}
var err: NSError?
converter.convert(to: outBuf, error: &err) { _, status in
status.pointee = .haveData
return hwBuf
}
if err != nil { return }
guard let i16Ptr = outBuf.int16ChannelData?[0] else { return }
let samples = Array(UnsafeBufferPointer(start: i16Ptr, count: Int(outBuf.frameLength)))
do {
for d in try wakeWord.processPcm(samples) {
DispatchQueue.main.async {
// update UI on wake detection
print("wake! score=\(d.score)")
}
}
} catch { /* surface to UI */ }
}
try audioEngine.start()
To stop cleanly (button tap, scene background, navigation away):
audioEngine.stop()
audioEngine.inputNode.removeTap(onBus: 0)
try? AVAudioSession.sharedInstance().setActive(
false, options: [.notifyOthersOnDeactivation]
)
wakeWord.close()
Permission: add
NSMicrophoneUsageDescriptionto yourInfo.plist(orINFOPLIST_KEY_NSMicrophoneUsageDescriptionin a generated-plist project) before requestingAVAudioSession.setActive.
Tuning
Threshold
Default is 0.9 (the chosen operating point on test). Lower for higher recall, raise for stricter precision:
try engine.setThreshold(0.85) // a bit more recall, ~5 % false-positive rate
try engine.setThreshold(0.95) // a bit stricter, but loses ~20 % recall
Cooldown
After a detection, the engine suppresses further events for cooldownFrames × 10 ms. Default is 100 frames = 1 second.
try engine.setCooldownFrames(200) // 2 seconds
API
public final class VoxrtWakeWordEngine {
public init(modelURL: URL) throws
public init(bytes: Data) throws
public convenience init(
bundleResource name: String = "voxrt_wake_word",
ext: String = "vxrt",
bundle: Bundle = .main
) throws
public func processPcm(_ pcm: [Int16]) throws -> [WakeWordDetection]
public func processPcm(_ pcm: [Float]) throws -> [WakeWordDetection]
public func currentScore() throws -> Float
public func reset() throws
public func setThreshold(_ threshold: Float) throws
public func setCooldownFrames(_ frames: Int) throws
public func close()
}
public struct WakeWordDetection: Equatable {
public let frameIndex: UInt64 // 0-based frame index (1 frame = 10 ms)
public let timestampSec: Float // seconds since engine start (or last reset)
public let score: Float // sigmoid score in [0, 1]
}
public enum VoxrtWakeWord {
public static var nativeVersion: String
public static var abiVersion: (major: UInt16, minor: UInt16)
}
License
- Swift wrapper source (this Swift Package): Apache-2.0. See
LICENSE. - Compiled runtime (
VoxrtWakeWordNative.xcframework): proprietary, redistributable under the terms inLICENSE-BINARY. - Wake-phrase model (
voxrt_wake_word.vxrt): proprietary, distributed separately under thevoxrt-wake-word-modelslicense terms.
For commercial integration, custom phrase models, or licensing terms beyond redistribution of the unmodified package, contact [email protected].