Glossary

Voice AI, in plain English

The terms you'll meet when evaluating on-device voice technology — what they mean, why they matter, and where to dig deeper. Each definition stands on its own; quote freely.

Voice features

The building blocks

Wake word · also: hotword, trigger word, wake phrase

A wake word is a specific word or phrase — like "Hey Assistant" — that an always-listening detector recognizes to bring an app or device to attention. The detector runs continuously on a tiny model so the full voice pipeline only activates after the phrase is heard.

Because it listens all the time, a wake-word detector is judged on three things at once: how rarely it fires by mistake, how rarely it misses the phrase, and how little power it draws while idle.

See: Custom Wake Word · try it in your browser

Keyword spotting · KWS

Keyword spotting recognizes a fixed vocabulary of spoken commands — play, pause, next — using a compact classifier instead of full speech recognition. It responds faster, uses less battery and memory, and is more accurate on its command set than a general transcriber.

The trade-off is flexibility: keyword spotting only knows its vocabulary. If users need to say arbitrary things, you want speech recognition; if commands carry parameters ("set it to 72"), you want speech-to-intent.

See: Keyword Spotting

Voice activity detection · VAD

Voice activity detection determines, frame by frame, whether audio contains human speech. It's the building block of a voice pipeline: it gates wake-word and speech-recognition models so they only run when someone is speaking, and drives turn-taking, barge-in, and recording trimming.

A good VAD is judged on how well it separates speech from noise and on how little compute it uses — it's the one model that genuinely never stops running.

See: Voice Activity Detection · VAD engine comparison

Speech-to-intent · end-to-end SLU

Speech-to-intent maps audio directly to a structured intent with slots — one model, one inference — instead of transcribing speech to text and running a separate natural-language-understanding (NLU) model over the transcript. Collapsing the pipeline lowers latency and memory and removes a place to lose accuracy.

It fits products with a known command domain ("set the temperature to 72") where you never actually need the words — just the meaning.

See: Speech-to-Intent

Automatic speech recognition · ASR, speech-to-text, STT

Automatic speech recognition converts spoken audio into text. On-device ASR runs the model on the user's hardware, so no audio is uploaded and transcription works offline; cloud ASR streams the audio to a server and returns text.

See: On-Device Streaming ASR · ASR engine comparison

Streaming ASR · vs. batch transcription

Streaming ASR emits partial transcripts while the user is still speaking, refining them as more audio arrives, rather than waiting for the utterance to finish (batch transcription). Streaming is what makes live captions, dictation, and barge-in feel instant.

Some engines marketed as "real-time" are batch models run in a loop — the distinction to check is whether partials refine as you speak.

See: On-Device Streaming ASR

Architecture

Where the model runs

On-device AI · edge AI

On-device (edge) AI runs the model on the user's own hardware — phone, laptop, or embedded board — instead of sending data to a cloud server. For voice, that means microphone audio never leaves the device: no network dependency, no per-request cloud cost, and a stronger privacy posture.

The engineering challenge is fitting useful accuracy into the compute, memory, and battery budget of cheap hardware — which is why on-device engines are judged on real-time factor and footprint, not just accuracy.

See: why VoxRT is on-device-first

Inference runtime

An inference runtime is the software engine that executes a trained model's math on the target hardware. General-purpose runtimes (ONNX Runtime, PyTorch Mobile, LiteRT) support many model types at the cost of size; a specialized runtime built for one workload — streaming voice, in VoxRT's case — keeps the binary in the hundreds of kilobytes.

See: the VoxRT runtime

Barge-in

Barge-in is letting the user interrupt a voice assistant while it's speaking or processing. It requires always-on voice activity detection and low-latency recognition, since the system must hear and react to the interruption in real time.

See: Voice Activity Detection

Endpointing

Endpointing is deciding when the user has finished speaking so the system can finalize the transcript or act on the command. It's usually driven by voice activity detection watching for trailing silence — end the turn too early and you cut users off; too late and the product feels sluggish.

See: Voice Activity Detection

Metrics

How voice engines are measured

Real-time factor · RTF

Real-time factor is processing time divided by audio duration. An RTF of 0.01 means one second of audio takes 10 milliseconds to process — about 1% of one CPU core to keep up with live speech. Below 1.0 is real-time; lower leaves more headroom for the rest of the app.

RTF is only meaningful with the hardware named: the same model can be 0.08 on a modern iPhone and 0.30 on a 2020 budget Android.

See: benchmark methodology · measure RTF on your own device

Word error rate · WER

Word error rate is the standard accuracy metric for speech recognition: substitutions plus insertions plus deletions, divided by the number of words in the reference transcript. Lower is better — 5% WER means roughly one error every twenty words.

WER figures are only comparable on the same test set (LibriSpeech test-clean is the most widely published for English), and architectures differ enough that even same-set numbers deserve care.

See: benchmark methodology · ASR comparison

False accept / false reject · FA / FR, precision / recall

A false accept is a wake-word detector triggering on something that wasn't the phrase; a false reject is missing the phrase when it was said. The detection threshold trades one against the other — which is why a single "accuracy" number for a wake word is incomplete, and why VoxRT publishes the full threshold sweep.

See: Custom Wake Word · benchmark methodology

Detection threshold

The confidence score above which a detector fires. Raising the threshold makes detection stricter — fewer false accepts, more false rejects; lowering it does the opposite. Production systems tune the threshold to the product's tolerance for each error type.

See: try adjusting a live threshold

See the terms in action

The browser demo shows a wake word, a live threshold, and an RTF benchmark — no install.

Open the live demo Read the SDK docs