Always-on wake-phrase detection on the VoxRT custom on-device inference runtime. ~48K-parameter depthwise-separable convnet, 16 kHz mono in, sigmoid-score out + threshold-crossing events. Detects the phrase "Hey Assistant".
- Status: production-ready for aarch64 Linux. All five language channels verified end-to-end on a live microphone using the official registries (PyPI / npm /
go get/ GitHub Release tarball) — see the Examples section for the tested quickstarts. - Current version:
v0.1.5 - Target platform:
aarch64Linux (Raspberry Pi 3 / 4 / 5 / Zero 2, NVIDIA Jetson, AWS Graviton, Rock Pi / Orange Pi / other aarch64 SBCs) - glibc baseline: 2.17 (manylinux2014_aarch64 — Pi OS Bullseye/Bookworm, Jetson L4T 32.x/35.x/36.x, Ubuntu 18.04+, Debian 10+, RHEL 7+, AWS Graviton)
- Languages shipped: Python (PyPI wheel), Node.js (npm), Go (
go get), C / C++ (tarball + CMake + pkg-config), Rust (git dependency) - License: Apache-2.0 (wrapper sources) · proprietary (compiled runtime, redistribution allowed via this SDK)
- Wake-phrase weights: proprietary in-house (synthetic training data; no upstream license obligations)
What is VoxRT?
VoxRT is a from-scratch inference runtime for on-device speech models. No ONNX Runtime, no PyTorch Mobile, no LiteRT — a custom Rust core sized and tuned for streaming voice workloads on phone-class and edge-Linux hardware.
voxrt-wake-word is the wake-word product on that runtime, alongside voxrt-wake-word-android (JitPack) and voxrt-wake-word-ios (Swift Package Manager). Same runtime crate, same NEON kernel set, same .vxrt model file — pick the SDK that matches your target. Sister products voxrt-silero (VAD) and voxrt-asr-android (streaming ASR) share the same runtime.
Custom-phrase wake-word models (your own brand name, multi-phrase detection, language extension) are part of the commercial VoxRT SDK tier. Contact [email protected].
Model quality
Test split: 5,240 positive utterances + 6,416 hard-negative utterances (isolated "Hey", isolated "Assistant", competitor wake-words like "Hey Siri", phonetic neighbours, arbitrary speech, non-speech audio). All speakers disjoint from train + val.
- ROC AUC: 0.9966
- Average precision (PR AUC): 0.9899
| Threshold | Precision | Recall | F1 | FPR | False positives on test |
|---|---|---|---|---|---|
| 0.5 | 0.864 | 0.995 | 0.925 | 12.8 % | 822 / 6,416 |
| 0.85 | 0.957 | 0.987 | 0.972 | 3.7 % | 234 / 6,416 |
| 0.9 (default) | 0.993 | 0.982 | 0.987 | 0.5 % | 34 / 6,416 |
| 0.95 | 0.997 | 0.769 | 0.868 | 0.2 % | 12 / 6,416 |
The SDK ships with threshold = 0.9 as the default operating point. Lower it at runtime if your application can tolerate more false positives in exchange for higher recall.
Performance
aarch64 release builds, post-warmup, live microphone at 16 kHz mono. RTF = wall-time-per-audio-second (lower is better):
| Device | CPU | RTF | CPU budget |
|---|---|---|---|
| Raspberry Pi Zero 2 W | Cortex-A53 ×4 @ 1.0 GHz | 0.053 (measured, sustained ≥ 60 s) | 5.3 % of one core |
| Raspberry Pi 3 A+ / B / B+ | Cortex-A53 ×4 @ 1.2–1.4 GHz | ~0.038–0.044 (scaled) | ~4 % |
| Raspberry Pi 4 B / 400 | Cortex-A72 ×4 @ 1.5–1.8 GHz | ~0.018–0.024 (scaled) | ~2 % |
| Raspberry Pi 5 | Cortex-A76 ×4 @ 2.4 GHz | ~0.008–0.012 (scaled) | ~1 % |
| NVIDIA Jetson Nano / Orin Nano | A57 / A78AE | ≲ Pi 4–5 | ≲ 2 % |
| AWS Graviton / Ampere Altra | Neoverse N1 / V1 | ≪ 0.005 | server-class |
| Rock Pi / Orange Pi / Khadas (A53/A55) | various | ≲ Pi 3–4 | ≲ 4 % |
The Pi Zero 2 W row is a direct measurement (RTF 0.053 sustained across a 60-second continuous live-mic session, peak-chunk RTF 0.074). Other rows scale by clock + µarch and are conservative — same binary, no per-device tuning.
At RTF ≈ 0.05 on the smallest Pi, the wake-word is ~19× faster than realtime with 95 % of one core free for the rest of your app. Even on the LITTLE-cluster range, the ~5 % CPU budget survives thermal throttling gracefully.
NEON vs scalar
The runtime ships one binary with both scalar and NEON code paths for every hot op — a runtime feature-detect selects NEON on aarch64 and falls back to a bit-exact scalar path (kernels match each other within ULP budget as a CI gate; no "looks about right"). On the same reference device, with all thread affinity + build flags held constant, forcing the scalar path vs the default NEON path:
| Device | Scalar RTF | NEON RTF | Speedup |
|---|---|---|---|
| Snapdragon 662, Cortex-A73 pinned (HIGH_PERF affinity) | 0.182 | 0.021 | 8.7× |
Continuous 60-second live-mic push, 512-sample frames at 16 kHz mono. HIGH_PERF affinity keeps the perf-cluster core clock steady across the two runs so the ratio is the NEON win alone, not scheduler noise. Numbers are the reference — cheap-tier A53 gets a similar multiplier at proportionally slower absolute RTF.
How it compares
The on-device wake-word category is dominated by Picovoice Porcupine on the paid side and openWakeWord on the OSS side:
| voxrt-wake-word | Picovoice Porcupine | openWakeWord | |
|---|---|---|---|
| Model file | ~100 KB (.vxrt) | not published | not published |
| Edge-Linux RTF disclosed | ✅ 0.053 measured on Pi Zero 2 W (5.3 % CPU); scaled figures for Pi 3/4/5 + Jetson + Graviton | ✅ 0.6 % CPU on Pi 5 32-bit (published) | ❌ Pi 3 only, "15–20 models real-time" (no per-model RTF) |
| Accuracy headline | ROC AUC 0.9966 on "Hey Assistant"; precision 0.993 / recall 0.982 @ default threshold | 2.7 % miss rate averaged across 6 built-in keywords | varies per pretrained model |
| Native mobile SDK | ✅ Android JitPack + iOS SPM | ✅ Android + iOS + RN + Flutter | ❌ Python-only; community C++ port |
| License | Apache-2.0 wrapper + proprietary runtime + proprietary weights (redistribution allowed via this SDK) | Commercial (Free Plan evaluation-only; production tier opaque, sales-gated) | Apache-2.0 code, CC-BY-NC-SA on pretrained weights (non-commercial) |
| Custom phrase / language | Tuned per customer on request (paid engagement) | Via Picovoice Console — paid tier required for commercial deployment | Self-train via Colab + TTS (~1 hour) |
Differentiators: published edge-Linux RTF on cheap hardware (Porcupine only publishes RPi 5; we publish Pi Zero 2 W), a ~100 KB model file, and license clarity (openWakeWord's NC weights block commercial use; Picovoice's paid tier is sales-gated).
Full sourced analysis at voxrt.com.
Binary footprint
- C tarball (
voxrt-wake-word-0.1.5-aarch64-linux-gnu.tar.gz): ~264 KB compressed. Containslibvoxrt_wake_word.so(~480 KB stripped),voxrt_wake_word.h, pkg-config + CMake configs, examples/, LICENSE × 2, README. - PyPI wheel (
voxrt_wake_word-...-cp39-abi3-manylinux_2_17_aarch64.whl): ~280 KB compressed. One wheel covers Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 via the pyo3 stable-ABI bridge. - npm package (
voxrt-wake-word-0.1.5.tgz, imports as@voxrt/wake-word): ~264 KB compressed. Containsvoxrt-wake-word.linux-arm64-gnu.node(~480 KB) +index.js+index.d.ts. - Go module (in-repo at
go/): ~480 KB bundled.so+ ~5 KB Go source. Fetched bygo get github.com/VoxRT/voxrt-wake-word-linux/go. - Wake-phrase model
voxrt_wake_word.vxrt(downloaded separately from voxrt-wake-word-models): ~100 KB fp16.
Net effect on a shipped app: roughly 600 KB once the runtime .so + .vxrt + language wrapper are bundled.
Install
Python
pip install voxrt-wake-word
One abi3 wheel covers Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13. Or pin a specific version's wheel from a GitHub Release if your deploy pipeline prefers offline installs.
Node.js
npm install @voxrt/wake-word
TypeScript definitions ship with the package. Or pin a specific version's tarball from a GitHub Release — same .tgz, offline-installable via npm install ./….
Go
go get github.com/VoxRT/voxrt-wake-word-linux/[email protected]
The go/ subdirectory of this repository IS the Go module — no separate publish step. cgo picks up the bundled .so via ${SRCDIR}-relative rpath at build time; no LD_LIBRARY_PATH or system install required.
C / C++
curl -L https://github.com/VoxRT/voxrt-wake-word-linux/releases/download/v0.1.5/voxrt-wake-word-0.1.5-aarch64-linux-gnu.tar.gz \
| sudo tar -xz -C /usr/local --strip-components=1
sudo ldconfig
After extract:
gcc main.c $(pkg-config --cflags --libs voxrt-wake-word) -o my_app
Or in CMake:
find_package(VoxRTWakeWord 0.1.5 REQUIRED)
target_link_libraries(my_app PRIVATE VoxRT::WakeWord)
Rust
Until the crates.io decision is finalised (see docs/wake-word/12-linux-sdk-packaging.md in the source monorepo — private), pin the crate via git:
[dependencies]
voxrt-wake-word = { git = "https://github.com/VoxRT/voxrt-wake-word-linux", tag = "v0.1.5" }
Get the wake-phrase model
The voxrt_wake_word.vxrt model file (AES-256-GCM encrypted, ~100 KB) is distributed separately from voxrt-wake-word-models so the SDK ships without weights and the two version independently. All v0.1.x SDK releases use the same model v0.1.0.
- Download once at deploy time:
``sh curl -LO https://github.com/VoxRT/voxrt-wake-word-models/releases/download/v0.1.0/voxrt_wake_word.vxrt `` Bundle alongside your application binary.
- Download on first run + SHA-256 verify — recommended if you want to swap models without a redeploy. Compare against
9d40bdc132a2ad8e85bd8a28bb49b77c51a7c62f60567222a037e44418510e8f.
- Embed as a build-time resource — for Python:
importlib.resources. For Node.js:require.resolve+fs.readFileSync. For Go://go:embed. For C: any resource-embedder. Feed the bytes tofrom_bytes(...)on your language'sWakeWordEngineconstructor.
Quick start
Python — the smallest possible end-to-end pipeline:
from voxrt_wake_word import WakeWordEngine
engine = WakeWordEngine.from_path("voxrt_wake_word.vxrt")
engine.threshold = 0.9
engine.cooldown_frames = 100
for chunk in mic_iter(): # int16 mono @ 16 kHz, any chunk size
for d in engine.push_pcm_i16(chunk):
print(f"wake! t={d.timestamp_sec:.3f}s score={d.score:.4f}")
Every language ships an equivalent end-to-end example under examples/. All quickstarts use only stdlib WAV parsing — no numpy / sounddevice / go-audio dependency.
Examples
| Language | Example | What it does |
|---|---|---|
| C | examples/c/alsa-mic-quickstart | Live ALSA microphone → wake-word events. Build with make, run against plughw:0. |
| C++ | examples/c/alsa-mic-quickstart-cpp | Same live-mic demo in idiomatic C++17: RAII wrappers over voxrt_wake_word_t* + snd_pcm_t*, std::atomic Ctrl-C, exception-based errors. Same C ABI under the hood. |
| C++ / CMake | examples/c/cmake-consumer | Skeleton CMake project consuming the SDK via find_package(VoxRTWakeWord ...). Copy-paste into a real project. |
| Python | examples/python/quickstart | Walk a 16 kHz WAV file in 32 ms chunks, print detections + RTF. Pure stdlib (wave + struct). |
| Node.js | examples/nodejs/quickstart | Same shape, JavaScript. Manual WAV header parse via fs + Buffer, no npm deps. |
| Go | examples/go/quickstart | Same shape, Go. encoding/binary for the WAV header. cgo picks up the bundled .so via ${SRCDIR} rpath. |
The C examples additionally ship with a live-microphone version; language quickstarts document the microphone-streaming pattern in each README (via sounddevice for Python, naudiodon / portaudio for Node, portaudio / goalsa for Go).
Tuning
Threshold. Sigmoid-space [0, 1]. Default 0.9. Lower increases recall (more detections, more false positives); raise increases precision (fewer detections, fewer false positives). See the Model quality table above for concrete precision/recall/FPR trade-offs at 0.5, 0.85, 0.9, 0.95.
Cooldown. Frames of suppression after a trigger. Default 100 (= 1.0 s at 10 ms hop). Prevents stuttering on a long wake phrase that crosses threshold across many consecutive frames. Lower for very tight utterance-to-utterance testing; leave at 100 for production.
CPU affinity. Not exposed as a public knob — the SDK does not pin threads on Linux. If you're on a big.LITTLE Pi (Pi 4/5) and want the wake-word thread to stick to the perf cluster, wrap push_pcm_i16 in a thread you affine yourself via sched_setaffinity (POSIX C), os.sched_setaffinity (Python), or runtime.LockOSThread + unix.SchedSetaffinity (Go).
Both threshold and cooldown_frames are mutable at any time — change them between chunks safely.
License
- Wrapper sources (Python / Node / Go / C headers / examples / docs): Apache-2.0.
- Compiled binaries (
.so,.node,.whl, prebuilt tarball artefact): proprietary, see LICENSE-BINARY. - Wake-phrase model weights (
voxrt_wake_word.vxrt): proprietary in-house (synthetic training data), same terms as LICENSE-BINARY.
Commercial licensing for custom phrases, additional languages, or bulk-device deployments: [email protected].
Links
- Brand + docs: <https://voxrt.com>
- Android (JitPack): <https://github.com/VoxRT/voxrt-wake-word-android>
- iOS (Swift Package Manager): <https://github.com/VoxRT/voxrt-wake-word-ios>
- Wake-word model releases: <https://github.com/VoxRT/voxrt-wake-word-models>
- Issues: <https://github.com/VoxRT/voxrt-wake-word-linux/issues>