Devoured - May 01, 2026
KV Cache Locality: The Hidden Variable in Your LLM Serving Cost (11 minute read)

KV Cache Locality: The Hidden Variable in Your LLM Serving Cost (11 minute read)

AI Read original

Your LLM load balancer is probably wasting 20-40% of GPU compute recomputing prefills that already exist in cache on a different GPU in your cluster.

What: KV cache locality refers to the fact that transformer key-value caches are stored per-GPU, so routing identical requests to different GPUs forces redundant prefill computation even though the cached work exists on another card in the cluster.
Why it matters: As context windows grow to 16K+ tokens and RAG applications share thousands of tokens across requests, the cost difference between cache hits (18ms TTFT) and misses (500ms+) becomes a major performance and cost multiplier that standard load balancers ignore.
Takeaway: Check your vLLM deployment's gpu_prefix_cache_hit_rate metric - if it's below 30% and you serve 13B+ models with shared prefixes across multiple GPUs, you're likely wasting significant compute.
Deep dive
  • Round-robin and least-connections load balancing waste GPU compute by routing requests to GPUs without cached KV pairs, forcing redundant prefill computation
  • Benchmarks on 8x A100s with CodeLlama 13B show prefix-aware routing improves cache hits from 12.5% to 97.5%, reduces P99 TTFT from 6.8s to 1.0s, and increases throughput 22%
  • Cache miss penalty on CodeLlama 13B is 500ms vs 18ms for cache hit, a 28x difference in time-to-first-token
  • Wasted prefill costs approximately $1,200-$1,800 monthly per 8-GPU node, or 22% of total GPU spend
  • Performance gains scale with model size (13B-70B sweet spot), prefix length (16K tokens show 43.6% improvement vs 29.7% at 8K), and sharing ratio
  • Even 50% prefix sharing achieves 91% cache hit rate with prefix-aware routing vs ~11% with round-robin
  • Tail latency improvements are dramatic because cache misses under load create queueing delays that compound across requests
  • Prefix-aware routing doesn't help models ≤8B (routing overhead ~10ms exceeds savings), short prefixes (<500 tokens), or unique conversations
  • Load imbalance is a risk when traffic concentrates on specific prefixes, requiring load-aware fallbacks to prevent GPU hot spots
  • Article introduces Ranvier, a prefix-aware load balancer using adaptive radix trees to route based on token locality
Decoder
  • KV cache: The key-value pairs computed during prefill that transformers cache in GPU memory to avoid recomputing when generating output tokens
  • Prefill: The initial phase where the model processes all input tokens (system prompt, context, history) and computes their key-value pairs; compute-intensive and scales with token count
  • Decode: The generation phase where the model produces output tokens one at a time, reusing the cached key-value pairs from prefill; much faster than prefill
  • TTFT (Time to First Token): The latency between receiving a request and returning the first generated token, heavily influenced by prefill time and cache hits
  • vLLM: A popular open-source LLM serving engine that implements KV caching and other optimizations for transformer inference
  • RAG (Retrieval-Augmented Generation): A pattern where LLMs are given retrieved context documents as part of the prompt, often resulting in long shared system prompts across requests
Original article

Every time your load balancer sends a request to the wrong GPU, that GPU recomputes a prefill it already computed somewhere else. The KV cache for that 4,000-token system prompt exists. It's just sitting on a different card. Your load balancer doesn't know. It can't know. It's counting connections, not tokens.

That recomputation takes real time and real money. On a Llama 3.1 70B at half precision, a 4,000-token prefill takes over a second. If eight GPUs each recompute the same system prompt independently because round-robin sent one request to each, you just paid for the same work eight times. Multiply by every request, every hour, every day.

This post is about the cost of that mistake, how to measure it, and what changes when your load balancer understands token locality.

What the KV Cache Actually Saves You

A transformer processes input tokens in two phases. Prefill computes the key-value pairs for every input token: the system prompt, the conversation history, the RAG context. This is the expensive part. It scales with token count and model size, and it's compute-bound on the GPU. Decode generates output tokens one at a time, each one reusing the key-value pairs from prefill. This is the cheap part.

vLLM and other serving engines cache the key-value pairs from prefill in GPU memory. When a new request arrives with the same token prefix, the engine skips prefill entirely and jumps straight to decode. This is the KV cache hit.

On our benchmarks, a cache hit on CodeLlama 13B returns in 18ms at P50. A cache miss takes around 500ms. That's a 28x gap in time-to-first-token, decided entirely by whether the tokens were already on that GPU.

But here's the thing: the KV cache is per-GPU. GPU 0's cache doesn't help GPU 3. If your load balancer sends Request A to GPU 0 and the identical Request B to GPU 3, Request B pays full prefill cost even though the work was already done. The cache exists. It's just on the wrong card.

The Math on Wasted Prefill

Let's make this concrete. You're running a RAG application with a 4,000-token system prompt. You have 8 GPUs serving CodeLlama 13B. You're handling 30 concurrent users with a stress workload (heavy on large and extra-large prefixes). Here's what we measured on 8x A100s:

Round-robin routing:

  • Cache hit rate: 12.5%
  • P99 TTFT: 6,800ms
  • Throughput: 36.3 req/s

With 8 backends and random routing, you'd expect ~12.5% cache hits by chance. One in eight requests happens to land on the GPU that already has its prefix cached. The other 87.5% recompute from scratch.

Prefix-aware routing:

  • Cache hit rate: 97.5%
  • P99 TTFT: 1,000ms
  • Throughput: 44.4 req/s

Same GPUs. Same model. Same workload. The only change is which GPU receives which request.

That throughput difference, 36.3 vs 44.4 requests per second, is a 22.3% improvement. On hardware costing ~$10/hour, that's either 22% more throughput for free or the same throughput on fewer GPUs. Over a month of continuous operation, on a single 8-GPU node, the wasted prefill in round-robin comes to roughly $1,200–$1,800 in GPU-hours (22% of ~$7,300/month at $10/hr) that produce no useful work. Multiply by the number of nodes in your cluster.

Where the Savings Compound

The benefit scales with three variables: model size, prefix length, and prefix sharing ratio.

Model size

Larger models have more expensive prefill, so cache misses cost more.

Model XLarge Cache Hit Improvement Aggregate Throughput Gain
Llama 3.1 8B 31.6% ~0% (inference too fast)
CodeLlama 13B 35.9% +13.7% to +22.3%
Llama 3.1 70B 43.8% ~0% (compute-bound)

The 8B numbers are the warning case. When prefill is already fast (~420ms total inference), the 7-10ms routing overhead eats into the savings. If prefill isn't your bottleneck, prefix-aware routing doesn't help.

The 70B numbers tell a different story. Aggregate throughput doesn't change because the GPUs are already compute-saturated. But individual requests are 44% faster on cache hit (P50: 1,498ms hit vs 2,665ms miss). Your users feel the difference even if your throughput dashboard doesn't.

The sweet spot is 13B-70B models where prefill is expensive enough to matter but the GPUs aren't so saturated that they can't benefit from skipping it.

Prefix length

Longer shared prefixes mean more wasted compute per cache miss.

Max Prefix Tokens Cache Miss P50 Cache Hit P50 Improvement
8,192 (default) 638ms 448ms 29.7%
16,384 817ms 461ms 43.6%

At 16K tokens, a cache miss wastes nearly 400ms of GPU compute that a hit avoids entirely. As context windows keep growing, this gap widens.

Prefix sharing ratio

This is the percentage of tokens shared across requests. A RAG application where every request includes the same 4,000-token knowledge base has a high sharing ratio. A chat application where every conversation is unique has a low one.

Sharing Ratio Round-Robin Hits Prefix-Aware Hits Improvement
50% ~11% 91% +80pp
70% ~13% 90% +77pp
90% ~12% 97-98% +85pp

Even at 50% sharing, where half the tokens are unique, prefix-aware routing still achieves 91% cache hits. A consistent hash fallback (deterministic routing based on prefix when no learned route exists yet) ensures that requests with the same prefix land on the same GPU even before the system has observed them.

The P99 Story

Cost isn't just GPU-hours. It's also the cost of slow responses.

At 30 concurrent users on CodeLlama 13B over 30 minutes of sustained load, round-robin routing produced a P99 TTFT of 6,800ms. That's 6.8 seconds before the first token appears. For an interactive application like code completion or chat, that's a broken experience. Users don't wait 6.8 seconds.

Prefix-aware routing brought that same P99 down to 1,000ms. Same hardware, same model, same concurrency. An 85.3% improvement on tail latency.

Why does the tail improve so much? Because tail latency in LLM serving is driven by cache misses under load. When the GPU is busy generating tokens for other requests, a new request that requires full prefill gets queued behind them. With round-robin, 87.5% of requests need full prefill, so the queue is always full of expensive work. With prefix-aware routing, 97.5% of requests skip prefill entirely, so the queue drains faster and the few remaining misses get processed sooner.

This is the strongest argument for KV cache locality. Throughput improvements look good on a dashboard. Tail latency is what users actually experience.

What Doesn't Work

Prefix-aware routing isn't free, and it doesn't help everywhere.

Small models (≤8B): Inference is already fast enough that the routing overhead (~10ms for tokenization + tree lookup) approaches the prefill savings. The net effect is roughly zero.

Short prefixes (<500 tokens): The prefill cost for short sequences is small enough that cache misses don't meaningfully hurt. The routing overhead (~3ms minimum) can exceed the savings.

Unique conversations: If every request has a completely different prefix (no shared system prompt, no shared context), there's nothing to cache. The routing tree learns routes that are never reused.

Load imbalance: Strict prefix affinity can create hot spots. If 80% of your traffic shares the same system prompt, prefix-aware routing sends 80% of traffic to one GPU. We handle this with a load-aware fallback that diverts requests when a backend's in-flight count exceeds twice the median. This trades a cache miss for a balanced GPU, reducing P95 by 36% and P99 by 45% compared to strict affinity. The cache hit rate drops about 5 points, which is the right trade.

Measuring Your Own Cache Locality

Before you change anything, measure your current cache hit rate. Most vLLM deployments expose this via Prometheus:

  • vllm:gpu_prefix_cache_hit_rate (or vllm:gpu_prefix_cache_queries_total and _hits_total on older versions; check your /metrics endpoint)
  • Compare TTFT distributions between requests with shared vs unique prefixes
  • Look at your P99/P50 ratio. A ratio above 5x suggests cache thrashing

If your cache hit rate is already above 80%, you're either lucky or your traffic naturally clusters. If it's below 30%, you're leaving performance on the table.

The variables that matter most:

  1. How many GPUs are you routing across? More GPUs = lower chance of random cache hits. With 8 GPUs, random routing gives ~12.5% hits.
  2. How long are your shared prefixes? Longer = more wasted compute per miss.
  3. What's your prefix sharing ratio? Higher = more opportunity for reuse.
  4. What model size are you serving? Larger = more expensive prefill per miss.

If you have many GPUs, long shared prefixes, high sharing ratios, and large models, you're likely wasting 20-40% of your GPU compute on redundant prefill.

The Takeaway

KV cache locality is not a tuning knob. It's a multiplier on your existing hardware. The same GPUs, serving the same model, handling the same traffic, produce measurably different throughput and latency depending on one decision: which GPU gets which request.

Round-robin doesn't make that decision. Least-connections doesn't make that decision. They balance load without understanding what the load is. When every request carries thousands of tokens that might already be cached somewhere in your cluster, "balanced" and "efficient" are not the same thing.

We built Ranvier to make that decision. It routes requests to the GPU that already has their token prefix cached, using an adaptive radix tree that learns routes in real time. The first post in this series covered why your load balancer is wasting your GPUs. This post covered what that waste costs. The next one will cover how we tokenize 50,000 requests per second without blocking the event loop.