On-device streaming speech recognition
Real-time speech-to-text that runs on the device — partial transcripts every 80 ms, no offline batch step. A 32M-parameter FastConformer ported tensor-by-tensor onto the VoxRT runtime, with both CTC and RNN-T decoders exposed.
True streaming, fully on-device
Most "on-device" speech recognition is really batch transcription with a delay. VoxRT streams: it emits partial transcripts as the user speaks, so you can drive live captions, voice input, and barge-in without waiting for an utterance to finish — and without sending audio to a server.
The model is NVIDIA NeMo's stt_en_fastconformer_hybrid_medium_streaming_80ms (CC-BY-4.0), ported tensor-by-tensor onto the VoxRT runtime. Accuracy stays within floating-point noise of the upstream Python NeMo reference — research-grade quality with a tiny on-device footprint.
The runtime is the product
VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on phone-class hardware.
Streaming ASR is one product on that runtime, alongside VAD and wake word. All three share the same Rust runtime crate and NEON kernel set — the runtime is the product; the models are what it runs.
What you get
Streaming partials
Partial transcripts every 80 ms — no batch step required.
CTC and RNN-T
Both decoders exposed for your latency/accuracy trade-off.
32M FastConformer
Research-grade accuracy running on-device.
WER parity
Word error rate within float-noise of the upstream NeMo reference.
Fully on-device
No cloud, no audio leaving the device.
Real-time with headroom to spare
Pick your accuracy/latency trade-off
Word error rate on LibriSpeech-500, within floating-point noise of the upstream Python NeMo reference. Two decoders are exposed.
| Decoder | LibriSpeech-500 WER | Per-chunk cost | Notes |
|---|---|---|---|
| RNN-T (recommended) | 3.267% | ~50 ms | Higher accuracy; LSTM state survives chunk boundaries |
| CTC | 4.895% | ~5 ms | ~15% cheaper per chunk; marginally lower accuracy |
What it costs on disk and in memory
- Swift wrapper source~20 KB
- Native xcframework, compressed (device slice)~5 MB
- Streaming model on disk (fp16)~61 MB
- Native heap at runtime (steady-state)~150 MB