How We Keep API Overhead Under 100ms - TokenFast Blog

The #1 concern we hear from developers evaluating an API gateway is latency. Fair enough — if you're adding a proxy layer between your app and the model provider, you'd expect some overhead. Here's how we keep that overhead under 100ms at p99.

The Architecture

TokenFast sits between your application and the upstream model provider (Anthropic, OpenAI, Google, etc.). Every request flows through us, which means we need to be fast or get out of the way.

Our stack:

LiteLLM as the proxy core — handles model routing, format translation, and provider-specific quirks
Persistent connection pools to every upstream provider — no TLS handshake per request
Regional deployment — your request hits the nearest edge node before being routed to the provider

Where the Milliseconds Go

A typical request through TokenFast:

Your app → TokenFast edge (5-15ms)
  → Auth + rate limit check (1-3ms)
  → Route to upstream provider (2-5ms)
  → Provider processing (500-30,000ms)
  → Stream back through TokenFast (<1ms per chunk)
Total overhead: 8-23ms typical, <100ms at p99

The provider's own processing time (the "thinking" part) dwarfs our overhead by 10-100x. A Claude Opus response that takes 8 seconds to generate adds ~15ms of proxy overhead. That's 0.19%.

Connection Pooling

The single biggest optimization. Without pooling, every request would need a fresh TLS 1.3 handshake to the upstream provider — that's 50-150ms right there. We maintain warm connection pools to every provider endpoint, so your requests reuse established connections.

What About Streaming?

For streaming responses (which is most production usage), the overhead is even less noticeable. We forward each Server-Sent Event chunk as it arrives with sub-millisecond relay latency. The streaming experience is indistinguishable from calling the provider directly.

Measuring It Yourself

Every response from TokenFast includes timing headers:

x-tokenfast-overhead-ms: 12
x-tokenfast-upstream-ms: 4523
x-tokenfast-region: us-east-1

You can verify our overhead claims on every single request. We believe in transparency — if we're ever slow, you'll see it in the numbers.

When We're NOT the Right Choice

If you're building a latency-critical application where every millisecond matters (real-time voice, high-frequency trading decisions), and you only use one model from one provider, calling them directly will save you 10-20ms. For everyone else, the operational simplicity and cost savings more than justify the negligible overhead.