On-device speech-to-intent
Skip the transcript. Define intents and slots in YAML, and we train a model that maps speech directly to structured intents in a single on-device inference — lower latency, lower memory, better domain accuracy than ASR plus a separate NLU step.
Audio to intent in one inference
The usual pipeline runs speech-to-text, then feeds the transcript into a natural-language-understanding model to extract structured meaning. That's two models, two sources of latency, and two places to lose accuracy. Speech-to-intent collapses both into a single model that reads audio and emits structured intents directly.
Because the model is trained on your own intents and slots, it's both faster and more accurate on your domain — and it runs entirely on-device, with no audio or transcript leaving the user's hardware.
The runtime is the product
VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on phone-class hardware.
Speech-to-intent runs on that same runtime as wake word, VAD, and streaming ASR — one Rust runtime crate and one NEON kernel set across the whole stack.
Define a context spec, get structured output
# your context intents: set_temperature: slots: [value, unit] # at runtime "Set the temperature to seventy-two degrees" → { intent: "set_temperature", slots: { value: 72, unit: "degrees" } }
What you get
One inference
Audio mapped straight to structured intents and slots — no transcript stage.
YAML context spec
Declare your intents and slots in a few lines — we train the model.
Lower latency & memory
One model instead of ASR plus a separate NLU pipeline.
Domain accuracy
Higher accuracy because the model is trained on your intents.
Fully on-device
No audio or transcript leaves the device.