· Infinity Nexatech
Why voice AI lives or dies at 700ms
A practical guide to building voice agents that feel human — and the latency budget that gets you there.
#ai#voice#engineering
The 700ms rule
Humans expect a reply within roughly 700 milliseconds. Cross that line and the conversation feels wrong — your prospect senses they’re talking to a machine even if they can’t articulate why.
The latency budget
Here’s the budget we actually plan against on Indian telephony:
- Telephony round-trip: 60–120ms
- Speech-to-text streaming first token: 80–150ms
- LLM first token (with caching): 180–280ms
- Tool calls (CRM lookups, etc.): 0–250ms (concurrent)
- Text-to-speech first chunk: 80–180ms
If you don’t ruthlessly compress each stage, you’ll land at 1.5–2 seconds, and your CSAT will reflect it.
How we hit it
Streaming everything. Speculative tool use. Prompt caching at every layer. A regional LLM endpoint closer to the speech stack than to the cloud cost-optimiser.
The cheapest voice agent in the world is the one nobody hangs up on.