10 May 2026 · Infinity Nexatech

Why voice AI lives or dies at 700ms

A practical guide to building voice agents that feel human — and the latency budget that gets you there.

#ai#voice#engineering

The 700ms rule

Humans expect a reply within roughly 700 milliseconds. Cross that line and the conversation feels wrong — your prospect senses they’re talking to a machine even if they can’t articulate why.

The latency budget

Here’s the budget we actually plan against on Indian telephony:

Telephony round-trip: 60–120ms
Speech-to-text streaming first token: 80–150ms
LLM first token (with caching): 180–280ms
Tool calls (CRM lookups, etc.): 0–250ms (concurrent)
Text-to-speech first chunk: 80–180ms

If you don’t ruthlessly compress each stage, you’ll land at 1.5–2 seconds, and your CSAT will reflect it.

How we hit it

Streaming everything. Speculative tool use. Prompt caching at every layer. A regional LLM endpoint closer to the speech stack than to the cloud cost-optimiser.

The cheapest voice agent in the world is the one nobody hangs up on.