Streaming & batch ASR

On-device streaming speech recognition

Real-time speech-to-text that runs on the device — partial transcripts every 80 ms, no offline batch step. A 32M-parameter FastConformer ported tensor-by-tensor onto the VoxRT runtime, with both CTC and RNN-T decoders exposed.

Overview

True streaming, fully on-device

Most "on-device" speech recognition is really batch transcription with a delay. VoxRT streams: it emits partial transcripts as the user speaks, so you can drive live captions, voice input, and barge-in without waiting for an utterance to finish — and without sending audio to a server.

The model is NVIDIA NeMo's stt_en_fastconformer_hybrid_medium_streaming_80ms (CC-BY-4.0), ported tensor-by-tensor onto the VoxRT runtime. Accuracy stays within floating-point noise of the upstream Python NeMo reference — research-grade quality with a tiny on-device footprint.

The runtime

The runtime is the product

VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on phone-class hardware.

Streaming ASR is one product on that runtime, alongside VAD and wake word. All three share the same Rust runtime crate and NEON kernel set — the runtime is the product; the models are what it runs.

Capabilities

What you get

Streaming partials

Partial transcripts every 80 ms — no batch step required.

CTC and RNN-T

Both decoders exposed for your latency/accuracy trade-off.

32M FastConformer

Research-grade accuracy running on-device.

WER parity

Word error rate within float-noise of the upstream NeMo reference.

Fully on-device

No cloud, no audio leaving the device.

Performance

Real-time with headroom to spare

0.08–0.10
streaming RTF on iPhone 13 Pro Max — ~90 ms / 1.12 s chunk
0.30
streaming real-time factor on a Snapdragon 662
32M
parameters (FastConformer hybrid)
~90%
of one core free during live transcription (at RTF ≈ 0.10)
Model quality

Pick your accuracy/latency trade-off

Word error rate on LibriSpeech-500, within floating-point noise of the upstream Python NeMo reference. Two decoders are exposed.

DecoderLibriSpeech-500 WERPer-chunk costNotes
RNN-T (recommended)3.267%~50 msHigher accuracy; LSTM state survives chunk boundaries
CTC4.895%~5 ms~15% cheaper per chunk; marginally lower accuracy
Footprint

What it costs on disk and in memory

  • Swift wrapper source~20 KB
  • Native xcframework, compressed (device slice)~5 MB
  • Streaming model on disk (fp16)~61 MB
  • Native heap at runtime (steady-state)~150 MB

Build it on-device with VoxRT

Tell us what you're transcribing and which devices it has to run on.

Try now