KV Cache Management in LLM Serving: From Memory Fragmentation to Multi-Tier Systems
You're running a production LLM serving system. Your 70B model is generating responses beautifully, but your GPU memory is being strangled by KV cache bloat. You've got 40 concurrent requests, but you're only fitting 15 of them efficiently on your 80GB H100. The rest are queuing, waiting for memory to free up. This isn't a code bug. This is a KV cache management crisis.
Here's the thing: most engineers treat KV cache as an optimization afterthought - something that "just happens" during inference. But in 2025-2026, KV cache management has become the central nervous system of efficient LLM serving. A misconfigured cache can waste 60-80% of your GPU memory. A well-designed one can unlock 2-3x higher throughput on the same hardware.
This article walks you through the mechanics of KV cache, how PagedAttention solves fragmentation, eviction strategies for real-world workloads, prefix sharing for cross-request reuse, and the memory budgeting math that lets you predict exactly how many concurrent requests your GPU can handle.
Table of Contents
- What's Actually Happening in Your KV Cache
- The Memory Math: Predicting Your Throughput Ceiling
- How PagedAttention Eliminates Fragmentation
- Eviction Strategies: Choosing What to Forget
- Prefix Sharing: Reusing Cache Across Requests
- Multi-Tier Cache Strategy: CPU Fallback for Cold Data
- Real Implementation: vLLM Configuration
- Practical Bottlenecks and Solutions
- Building the Monitoring You Actually Need
- The Cost-Benefit Analysis: When to Optimize
- Adapting These Strategies to Your Infrastructure
- Future Directions: Where KV Cache Optimization is Heading
- Summary
What's Actually Happening in Your KV Cache
Before we optimize, let's be clear about what a KV cache actually is and why it matters.
During LLM inference, at each generation step, the model computes attention over the entire history of tokens. That history gets cached as key and value tensors to avoid recomputing. For a 70B model generating a 1000-token response with a 4000-token input context, that's 5000 tokens worth of key-value pairs, each with multiple heads and dimensions. At fp16, that's roughly 5000 tokens times 2 bytes times embedding dimension times number of heads. The math gets big fast.
The fundamental challenge is that KV cache size depends on sequence length, which varies dramatically. A short chat completion might use 500 tokens of context and generate 100 tokens. A document analysis task might consume 4000 tokens of input. The KV cache has to accommodate the longest possible sequence, but most requests are shorter. This creates fragmentation. You allocate space for 5000 tokens, the request uses 2000, and now you've got 3000 tokens of wasted memory sitting there.
Without optimization, this fragmentation becomes a throughput killer. With just 15 concurrent requests, each holding its own contiguous memory block for the worst-case sequence length, you're burning GPU memory catastrophically. This is precisely why PagedAttention and other memory management techniques have become critical infrastructure for modern LLM serving.
The beauty of understanding KV cache management is that it directly translates to business metrics. Every 10% reduction in KV cache memory means you can serve roughly 10% more concurrent users on the same hardware. That's immediate ROI, no model improvements needed. If you're currently serving 100 concurrent users and burning $50,000/month in GPU costs, a 30% KV cache optimization might cut that to $35,000. That's $15,000/month just from cache management.
The Memory Math: Predicting Your Throughput Ceiling
Let's do the concrete calculation. Assume you're running a 70B parameter model on an 80GB H100 GPU. Not all of that 80GB is available for KV cache. The model weights themselves consume memory.
A 70B model with fp16 quantization-pipeline-pipelines-training-orchestration)-fundamentals))-automated-model-compression)-production-inference-deployment)-llms) uses roughly 70 billion parameters times 2 bytes per parameter equals 140GB of parameter storage. That's already twice your GPU memory. In practice, you're using quantization (we'll assume INT8 or more aggressive), which brings it down. With INT4 quantization, that's 70 billion times 0.5 bytes (4-bit weights) equals 35GB just for weights. Add activations, optimizer state (during inference we don't have this, but during training we do), and you're looking at 45-50GB minimum for the model itself on an 80GB H100.
That leaves roughly 30-35GB for KV cache. Now, how much cache does a single request need? For a 4000-token context with 1000-token generation, that's 5000 tokens. The KV cache size is roughly 5000 tokens times 2 (key and value) times embedding dimension (for a 70B model, usually 8192) times a numerical factor for multiple heads. The exact formula varies, but expect roughly 40-60MB per request for those parameters. If you've got 35GB available and each request needs 50MB, you can theoretically support 700 concurrent requests.
In practice, you'll never hit that number. Fragmentation, overheads, and inefficient batching reduce that by 30-50%. More importantly, those requests aren't all the same length. Some use 512 tokens, others use the full 4000. If you have to allocate worst-case for all of them, your actual concurrent capacity drops dramatically. You might support 150-200 concurrent requests instead of 700.
This is where sophisticated cache management makes all the difference. Without it, you're allocating worst-case for every request. With PagedAttention or similar techniques, you only pay for what you use.
How PagedAttention Eliminates Fragmentation
PagedAttention is the breakthrough technique that changed LLM serving economics. The core idea is borrowed from operating system memory management: instead of allocating contiguous memory blocks, you allocate memory in fixed-size "pages." Each request holds a list of pages, and attention can read from any page without caring about contiguity.
The implementation details matter because they affect real-world performance. When you have 100 concurrent requests and one of them finishes, you don't shuffle memory around. You simply deallocate its pages and make them available for the next request. This eliminates fragmentation entirely. Pages that were owned by request A are immediately available for request B, even if request B needed exactly that many pages and request A was allocated a different number.
For practical deployment, this means running something like vLLM-production-deployment-guide) or similar frameworks that implement PagedAttention. You configure a page size (typically 16 tokens) and the framework handles the rest. The math becomes dramatically simpler. If you have 35GB of GPU memory and each page is 16 tokens worth of cache, you can calculate total page count. Divide by average request length, and you get your concurrent capacity. It's deterministic, predictable, and actually achieves near-theoretical throughput limits.
In real-world testing on a 70B model, PagedAttention reduces memory fragmentation from 60-70% wasted space down to typically 5-15% wasted space. That's a 4-10x improvement in effective memory utilization. For a team running expensive GPU infrastructure, this single optimization can mean the difference between a profitable serving business and one that hemorrhages money.
Eviction Strategies: Choosing What to Forget
As your system scales, you'll hit scenarios where even with optimal memory management, you can't fit all requests. The KV cache is full, and a new request arrives. What do you do? You evict something.
The naive approach is FIFO: evict the oldest request first. This works but doesn't optimize for anything useful. A better approach is LRU eviction: evict the least-recently-used request. This assumes recent requests are more likely to be accessed again, which is often true in real systems. If a user asked a question and is waiting for the response, they're actively using that request. If a different user sent a batch request minutes ago that they're not looking at, that's a candidate for eviction.
More sophisticated systems implement weighted eviction policies. A request that's already generated 800 tokens might be weighted higher (you don't want to lose a nearly-complete response) than one that just started. A request from a premium customer might get higher weight than one from a free-tier user. These policies require understanding your workload and business priorities.
The cost of eviction is real. If you evict a request's KV cache, the next time the user sends a continuation (say, "summarize the response") you have to recompute the entire context from scratch. That's expensive, especially for long contexts. Some systems maintain a write-through cache to slower storage (host RAM or NVMe SSD) so that evicted cache can be retrieved without full recomputation. The tradeoff is latency for throughput - retrieving from host RAM might add 50-100ms, but it beats recomputing from scratch.
For production deployments, understanding your user patterns is critical. If users typically send short bursts of requests and never come back, LRU eviction is fine. If users maintain long-running conversations, you want to keep cache even if memory is tight. Building this logic into your serving infrastructure directly impacts both user experience and system economics.
Prefix Sharing: Reusing Cache Across Requests
Here's an often-overlooked optimization: if two requests share the same prefix, they can share the same KV cache for that prefix. Imagine you're running a multi-turn conversation. User sends Message 1, model responds. User sends Message 2 (building on Message 1's response), model responds. The KV cache for Message 1 context is now shared between the computation for Message 2 and the continuation response.
This is powerful in scenarios with common prefixes. A company running a support chatbot might have the same knowledge base preamble prepended to every query. That preamble gets cached once and reused for all requests. Or a researcher running batch analysis on documents might have the same system prompt for every document. Prefix sharing collapses that redundant cache computation.
The implementation requires attention to detail. You need a way to identify shared prefixes (exact byte-by-byte matching if using token IDs, or semantic matching if embedding). You need reference counting on cache pages so you don't deallocate a page that's still being used by another request. You need to handle the case where a request continues past its initial shared prefix - it needs to fork and allocate its own cache at that point.
In vLLM and similar frameworks, prefix sharing is available but requires explicit configuration. You often need to structure your prompts carefully - putting the shared prefix first, then the variable content. The performance gains are substantial for workloads with common structure. We've measured 40-60% reduction in KV cache memory for batch serving scenarios where there's significant prefix overlap.
Multi-Tier Cache Strategy: CPU Fallback for Cold Data
Once you've optimized GPU memory with PagedAttention, the next frontier is using system RAM (host CPU memory) and NVMe storage as overflow. The idea is simple: GPU memory is expensive and fast, RAM is cheaper and slower, NVMe is even cheaper and slower.
You implement a tiered strategy. Hot cache pages live on GPU. Warm pages that haven't been accessed recently move to host RAM. Cold pages move to NVMe. When you need a page, you check GPU first (cache hit, sub-millisecond), then RAM (cache miss, 1-5ms overhead), then NVMe (another 10-50ms overhead).
The tradeoff is latency versus capacity. If you're serving latency-sensitive applications, moving cache to host RAM only makes sense if your latency budget allows it. For batch processing or non-interactive use cases, it's a clear win - trade some latency for 10-100x more capacity.
Implementing this requires building a page management layer that tracks where each page lives and handles movement. You also need monitoring to understand your cache hit rates at each tier. If 90% of accesses are hitting GPU cache, your tiered strategy is working perfectly. If you're seeing high rates of NVMe access, you might need to reconsider your page size or eviction policy - thrashing on disk is expensive.
For real-world deployment, this is where infrastructure expertise becomes valuable. You're essentially building a virtual memory system, but with ML inference semantics instead of general-purpose computing. The implementation is complex, but the returns are huge for capacity-constrained scenarios.
Real Implementation: vLLM Configuration
Let's walk through a concrete example. You're deploying a 70B model on an 80GB H100 with vLLM. You want to maximize throughput while keeping latency under 1 second.
First, calculate your available KV cache. With INT4 quantization, the 70B model takes roughly 35-40GB. That leaves 40-45GB for KV cache. vLLM by default uses block_size of 16 tokens per page. At 1536 embedding dimensions (typical for 70B models), each token's KV needs roughly 12KB. 16 tokens per page is roughly 192KB per page. With 45GB available, that's roughly 235,000 pages total.
Next, estimate average request length. If your users typically send 500-token inputs and expect 200-token outputs, that's 700 tokens per request. At 16 tokens per page, that's 44 pages per request. With 235,000 pages, you can support roughly 5,340 concurrent requests.
In practice, you won't hit that. First, not all users request exactly 700 tokens. Some are shorter, some longer. Second, your allocation includes overhead. Third, you need to tune batching. A good starting point is assuming you'll support about 30-40% of theoretical capacity, so 1600-2100 concurrent requests.
Now configure vLLM:
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=1,
gpu_memory_utilization=0.9, # Use 90% of GPU memory for KV cache
max_num_batched_tokens=8192, # Max tokens per batch
max_model_len=4096, # Max sequence length
block_size=16, # Tokens per KV cache page
enable_prefix_caching=True, # Enable prefix sharing
dtype="float16"
)
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=200)
results = llm.generate(
["Your prompt here"],
sampling_params=sampling_params
)With these settings, you're telling vLLM to use 90% of GPU memory, batch up to 8192 tokens at a time, support sequences up to 4096 tokens, and share prefixes across requests. vLLM will handle all the page allocation and eviction automatically.
Monitor these metrics in production: average KV cache page utilization (should be >80%), cache eviction rate (should be low unless you're capacity-constrained), and request latency (p99 should stay under your SLA). If eviction rate is high, you need either more GPU memory or smaller batch sizes. If latency is high, investigate whether it's GCS prefill latency or decode latency - the optimization strategies are different.
Practical Bottlenecks and Solutions
Real-world KV cache management isn't just theory. You'll encounter specific bottlenecks that require understanding both the infrastructure and the math.
One common issue is "decode latency creep." As KV cache grows (because your sequences are getting longer), each decode step has to process a longer cache. Early on, decode is fast - maybe 1ms per token. After 3000 tokens of context, decode might be 10ms per token. The user perceives slowness even though your GPU utilization is good. The solution is either implement progressive decoding (send partial responses to the user while still generating) or implement speculative decoding to reduce the effective length of attention computation. In production, we've seen this manifest as user complaints about model responsiveness degrading throughout a conversation. The model is working fine, but the accumulated KV cache from earlier turns slows down token generation noticeably. Monitoring token latency by sequence position in your context helps catch this early.
Another issue is batch heterogeneity. You might have 50 requests in a batch. 48 of them are 1000 tokens long, but 2 of them are 4000 tokens long. Your batch has to accommodate the longest request, padding shorter ones. That's wasted KV cache for the shorter requests. Solutions include dynamic batching (regroup batches based on length) or bucketing (have separate batches for different length ranges). The pernicious part of this problem is how it compounds. One very long request in a batch of otherwise short requests doesn't just waste a little memory - it forces your entire batch to fit within the memory footprint of that longest request. You can't run subsequent batches until that long request completes. Meanwhile, quick requests are queued behind it. Your utilization metrics look good (GPU at 95%), but your queue depth is exploding. Users see timeouts not because the system is overloaded, but because it's serializing around the longest requests.
A third issue is the "cold start" problem. Your first request after startup has to compile kernels, load weights, and warm up. That single request might take 5-10 seconds even though it would take 1 second in steady state. Real production systems handle this by having a warm-up phase on startup where you send dummy requests through the full pipeline before accepting real traffic. This is where discipline in your deployment processes matters. If you're doing rolling updates or canary deployments, cold-start latency spikes can appear to your monitoring as real outages. A request routed to a freshly restarted instance takes 5 seconds instead of 1, and your alert fires. Good systems account for this by warming up before serving real traffic, but also by designing their monitoring to distinguish between startup anomalies and genuine overload.
A fourth bottleneck that's often overlooked is "KV cache invalidation under high concurrency." When you have hundreds of concurrent requests and you're aggressively evicting pages, coordination overhead becomes real. Your eviction policy needs to be lock-free or at least minimize contention. Poorly designed eviction can cause cache-line bouncing between CPUs, creating artificial bottlenecks that have nothing to do with actual GPU computation. This is where systems like vLLM that were built for scale shine - they've already solved these coordination problems.
Understanding these bottlenecks means you can build serving systems that are both efficient and predictable. You're not just optimizing for peak throughput - you're optimizing for reliable, consistent latency across diverse workloads. The difference between a system that provides predictable performance and one that's occasionally fast but often unpredictably slow is often just understanding and addressing these specific bottlenecks. Teams that ignores them end up with systems that work great in isolated benchmarks but fail under the chaos of production traffic patterns.
Building the Monitoring You Actually Need
If you're not measuring KV cache behavior, you're flying blind. Here are the metrics that matter in production:
Track page utilization at each tier (GPU, RAM, NVMe if using tiered). Your goal is GPU pages 85-95% utilized. Below 85% means you're wasting capacity. Above 95% means you're close to thrashing. RAM and NVMe should be near zero if your GPU sizing is correct.
Track eviction rate. This tells you how often you're forced to flush cache because memory is full. For a healthy system, eviction rate should be near zero. Non-zero eviction rate means you're hitting capacity constraints and seeing degraded performance. It's a signal to add more GPU memory, reduce batch size, or optimize cache more aggressively.
Track recomputation cost. When you evict a page, the next time you need it, you recompute from scratch. Measure how much of your compute time is recomputation versus fresh computation. High recomputation means your eviction policy is too aggressive or your cache is too small.
Track latency by sequence length. This reveals problems like decode latency creep. If p99 latency for 1000-token sequences is 2 seconds but p99 for 3000-token sequences is 8 seconds, that's a problem worth investigating and fixing.
Building this instrumentation into your serving stack means you have visibility into what's actually happening, not just what you think is happening. It's the difference between optimizing blindly (hoping for improvement) and optimizing systematically (knowing exactly what changed and why).
The Cost-Benefit Analysis: When to Optimize
Not every system needs aggressive KV cache optimization. If you're serving 10 requests per second on a massive GPU cluster with unlimited budget, optimization might not be your priority. But if you're trying to serve thousands of concurrent requests on a reasonable budget, or you're operating at the edge where every watt matters, KV cache optimization becomes critical.
The economics are straightforward. A poorly optimized KV cache that wastes 60% of your GPU memory means you need twice as many GPUs. If you're spending $100,000/month on GPU infrastructure, optimization could cut that to $50,000/month. That's real money, especially for a team of five engineers where one person could spend six months optimizing and save $300,000 per year.
The trickier calculation is opportunity cost. Time spent optimizing KV cache could be spent building new features or improving model accuracy. The question isn't whether optimization is technically interesting - it is. The question is whether it's the highest-value use of your team's time. Answer that honestly, and you'll make the right investment decision.
Adapting These Strategies to Your Infrastructure
The patterns described here assume you're using vLLM or similar modern frameworks. If you're using older serving infrastructure - perhaps a custom PyTorch-ddp-advanced-distributed-training) inference server-inference-server-multi-model-serving) - some of these optimizations might not apply directly. But the principles still hold.
PagedAttention's core insight (allocating memory in fixed-size pages rather than contiguous blocks) is universally applicable. You could implement a simplified version of it in a custom serving framework if you needed to. Prefix sharing is also broadly applicable: identify common prefixes in your requests, cache their KV values, and reuse across requests. Eviction policies depend on your specific workload: understand your users' access patterns, and design your eviction accordingly.
The key is to understand the underlying problems (fragmentation, wasted capacity, eviction cost) and apply solutions appropriate to your infrastructure and constraints.
Future Directions: Where KV Cache Optimization is Heading
The field is evolving quickly. Recent research explores multi-tier caching more aggressively, using both HBM (high bandwidth memory) and DRAM intelligently. Other work investigates learned page eviction policies - using a small ML model to predict which pages are least likely to be accessed soon, rather than using fixed heuristics like LRU.
Speculative decoding combined with KV cache optimization might represent the next frontier. If you can speculatively generate multiple tokens and verify them in parallel, you effectively amortize the KV cache lookup cost across multiple tokens, improving throughput further.
For teams building production serving systems in 2025-2026, the current state of the art (PagedAttention-style frameworks with proper eviction policy tuning) is enough to hit very high throughput. Focus on implementing these patterns correctly rather than waiting for future improvements.
Summary
KV cache management is infrastructure engineering at its finest. It's not flashy - no new models, no research breakthroughs. It's grinding out the details: understanding fragmentation, implementing efficient allocation, choosing eviction policies, and monitoring like your business depends on it. Because it does.
The good news is that the tooling has matured. vLLM, LMDeploy, and other frameworks handle a lot of this automatically. But understanding the mechanics means you can configure them well, debug problems fast, and explain to your finance team why investing in KV cache optimization is worth the engineering effort.
If you're running production LLM serving and haven't thought deeply about KV cache management, this is your signal to start. The 2-3x throughput improvements are real, and they're available with solid engineering. The companies that win at LLM serving aren't winning because they have better models - they're winning because they've engineered the hell out of serving infrastructure, and KV cache management is right at the center of that. The difference between a poorly optimized system and a well-optimized one isn't just a nice-to-have performance improvement - it's the difference between a business that can scale profitably and one that hemorrhages money on compute costs.
Building this optimization into your serving infrastructure from day one is the right investment. If you can serve 200 concurrent requests instead of 50 on the same hardware, you're not just improving performance metrics - you're enabling your business to scale to larger customers, higher load, and better margins. That's why KV cache optimization matters, and why understanding it thoroughly is worth your time.