Streaming & batch ASR

On-device streaming speech recognition

Real-time speech-to-text that runs on the device — partial transcripts every 80 ms, no offline batch step. A 32M-parameter FastConformer ported tensor-by-tensor onto the VoxRT runtime, with both CTC and RNN-T decoders exposed.

Get the ASR SDK →

Overview

True streaming, fully on-device

Most "on-device" speech recognition is really batch transcription with a delay. VoxRT streams: it emits partial transcripts as the user speaks, so you can drive live captions, voice input, and barge-in without waiting for an utterance to finish — and without sending audio to a server.

The model is NVIDIA NeMo's stt_en_fastconformer_hybrid_medium_streaming_80ms (CC-BY-4.0), ported tensor-by-tensor onto the VoxRT runtime. Accuracy stays within floating-point noise of the upstream Python NeMo reference — research-grade quality with a tiny on-device footprint.

The runtime

The runtime is the product

VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on phone-class hardware.

Streaming ASR is one product on that runtime, alongside VAD and wake word. All three share the same Rust runtime crate and NEON kernel set — the runtime is the product; the models are what it runs.

Capabilities

What you get

Streaming partials

Partial transcripts every 80 ms — no batch step required.

CTC and RNN-T

Both decoders exposed for your latency/accuracy trade-off.

32M FastConformer

Research-grade accuracy running on-device.

WER parity

Word error rate within float-noise of the upstream NeMo reference.

Fully on-device

No cloud, no audio leaving the device.

Performance

Real-time with headroom to spare

0.08–0.10

streaming RTF on iPhone 13 Pro Max — ~90 ms / 1.12 s chunk

0.30

streaming real-time factor on a Snapdragon 662

32M

parameters (FastConformer hybrid)

~90%

of one core free during live transcription (at RTF ≈ 0.10)

Model quality

Pick your accuracy/latency trade-off

Word error rate on LibriSpeech-500, within floating-point noise of the upstream Python NeMo reference. Two decoders are exposed.

Decoder	LibriSpeech-500 WER	Per-chunk cost	Notes
RNN-T (recommended)	3.267%	~50 ms	Higher accuracy; LSTM state survives chunk boundaries
CTC	4.895%	~5 ms	~15% cheaper per chunk; marginally lower accuracy

Footprint

What it costs on disk and in memory

Swift wrapper source~20 KB
Native xcframework, compressed (device slice)~5 MB
Streaming model on disk (fp16)~61 MB
Native heap at runtime (steady-state)~150 MB

Explore