Speech-to-Intent · end-to-end

On-device speech-to-intent

Skip the transcript. Define intents and slots in YAML, and we train a model that maps speech directly to structured intents in a single on-device inference — lower latency, lower memory, better domain accuracy than ASR plus a separate NLU step.

Get started → Need full transcripts?

Overview

Audio to intent in one inference

The usual pipeline runs speech-to-text, then feeds the transcript into a natural-language-understanding model to extract structured meaning. That's two models, two sources of latency, and two places to lose accuracy. Speech-to-intent collapses both into a single model that reads audio and emits structured intents directly.

Because the model is trained on your own intents and slots, it's both faster and more accurate on your domain — and it runs entirely on-device, with no audio or transcript leaving the user's hardware.

The runtime

The runtime is the product

VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on phone-class hardware.

Speech-to-intent runs on that same runtime as wake word, VAD, and streaming ASR — one Rust runtime crate and one NEON kernel set across the whole stack.

How it works

Define a context spec, get structured output

# your context
intents:
  set_temperature:
    slots: [value, unit]

# at runtime
"Set the temperature to seventy-two degrees" → {
  intent: "set_temperature",
  slots: { value: 72, unit: "degrees" }
}

Capabilities

What you get

One inference

Audio mapped straight to structured intents and slots — no transcript stage.

YAML context spec

Declare your intents and slots in a few lines — we train the model.

Lower latency & memory

One model instead of ASR plus a separate NLU pipeline.

Domain accuracy

Higher accuracy because the model is trained on your intents.

Fully on-device

No audio or transcript leaves the device.

Get started →

Explore