Skip to content
Infinity Nexatech
← Back to field reports
· Infinity Nexatech

Why voice AI lives or dies at 700ms

A practical guide to building voice agents that feel human — and the latency budget that gets you there.

#ai#voice#engineering

The 700ms rule

Humans expect a reply within roughly 700 milliseconds. Cross that line and the conversation feels wrong — your prospect senses they’re talking to a machine even if they can’t articulate why.

The latency budget

Here’s the budget we actually plan against on Indian telephony:

  • Telephony round-trip: 60–120ms
  • Speech-to-text streaming first token: 80–150ms
  • LLM first token (with caching): 180–280ms
  • Tool calls (CRM lookups, etc.): 0–250ms (concurrent)
  • Text-to-speech first chunk: 80–180ms

If you don’t ruthlessly compress each stage, you’ll land at 1.5–2 seconds, and your CSAT will reflect it.

How we hit it

Streaming everything. Speculative tool use. Prompt caching at every layer. A regional LLM endpoint closer to the speech stack than to the cloud cost-optimiser.

The cheapest voice agent in the world is the one nobody hangs up on.