Blog post

LLM Caching Mastery: From KV Cache to Production Optimization with vLLM

13 min read

Caching is one of the most critical optimization techniques for large language model (LLM) inference, enabling dramatic performance improvements and cost reductions. By storing and reusing previously computed intermediate results—particularly Key-Value (KV) tensors—LLM systems can reduce inference latency by up to 4-5× for typical workloads. This guide explores the fundamental concepts of LLM caching, various caching strategies, and practical implementation examples using vLLM, the leading open-source LLM serving framework.

Tales Of Tensors: KV Cache: The Trick That Makes LLMs Faster

Understanding KV Caching: The Foundation of LLM Optimization

Why KV Caching Matters

Large language models generate tokens autoregressively, producing one token at a time. During this process, each new token must attend to all previous tokens in the sequence. This attention mechanism—the core of transformer architectures—involves computing attention scores between every pair of tokens. Without optimization, this becomes prohibitively expensive.

Without KV caching, generating a 100-token output requires:

  • First token: 100 attention computations
  • Second token: 101 attention computations
  • Third token: 102 attention computations
  • And so on…

This quadratic complexity (O(n²)) makes inference extremely slow. In empirical testing, enabling KV caching reduces generation time from 40 seconds to approximately 9 seconds for a typical sequence—a 4.5× speedup.

How KV Caching Works

During the attention mechanism within transformer layers, the model computes three representations for each token:

  • Query (Q): What the current token is looking for
  • Key (K): What each token represents
  • Value (V): The information each token carries

For subsequent token generation, the model needs:

  • The Query vector of only the newest token
  • The Key and Value vectors of all previous tokens

The crucial insight is that KV vectors for already-processed tokens don’t change. Therefore, instead of recomputing them, we can retrieve them from cache.

KV Caching Process:

  1. Prefill phase: Process the entire prompt, compute and store KV pairs for all tokens
  2. Decode phase: For each newly generated token:
    • Compute Q vector for the new token
    • Retrieve cached K and V vectors for all previous tokens
    • Compute attention
    • Store new K and V vectors in cache

Memory Implications

While KV caching provides dramatic speedup, it comes with a memory cost. For example, with Llama 3-70B:

  • Model parameters: 140 GB (FP16)
  • KV cache per token: ~2.5 MB
  • 4,096 token context: ~10.5 GB

With multiple concurrent users, KV cache memory becomes a dominant factor in deployment. Effective memory management is essential for practical production systems.

Caching Techniques in Modern LLM Systems

Basic KV Caching

Implementation: Store the complete Key and Value tensors from every attention layer

Pros:

  • Simple to implement
  • Provides substantial speedup (~4-5×)
  • Baseline optimization for virtually all LLM systems

Cons:

  • Memory grows linearly with context length
  • Doesn’t handle duplicate computations across requests

Automatic Prefix Caching (APC)

Automatic Prefix Caching addresses a critical inefficiency: when multiple requests share the same prefix (e.g., system prompt, shared context), each request unnecessarily recomputes the shared prefix’s KV cache.

Core Concept:

  • Partition KV cache into fixed-size blocks (e.g., 16-128 tokens per block)
  • Hash each block based on its prefix and content
  • Maintain a global hash table of cached blocks
  • When a new request arrives, check if its blocks are already cached
  • Reuse physical memory blocks across requests with shared prefixes

Example:

Request 1: [System Prompt (shared)] [Context A] [Query 1]
Request 2: [System Prompt (shared)] [Context B] [Query 2]

The System Prompt blocks are computed once and reused. Block hashes enable this identification:

hash(System Prompt tokens) → Physical Block Reference

Performance Impact:

  • vLLM achieves ~13-34% throughput improvement with APC on shared-prefix workloads
  • Time-to-first-token (TTFT) improves by ~20% due to reduced prefill
  • Some overhead when requests have no shared prefixes

Semantic Caching with GPTCache

Semantic caching takes a different approach: instead of caching based on exact token sequences, it caches based on query meaning.

Mechanism:

  1. Convert incoming queries to embeddings using an embedding model
  2. Search cached queries for high-similarity embeddings
  3. If similarity exceeds threshold, return cached response without calling LLM
  4. Otherwise, generate response and cache it

Components:

  • Embedding generator: Converts text to vectors (e.g., BERT, OpenAI embeddings)
  • Vector store: Enables efficient similarity search (e.g., Milvus, Weaviate)
  • Cache manager: Handles storage and eviction policies
  • Similarity evaluator: Determines cache hits

Real-World Benefits:

  • Reduces API calls by 10× in chat applications
  • Improves response time for recurring patterns
  • Reduces costs when using paid LLM APIs

LMCache: Distributed KV Cache Offloading

For very long contexts or multiple concurrent users, GPU memory becomes insufficient. LMCache offloads KV cache to CPU RAM and disk storage while maintaining performance through intelligent prefetching.

Architecture:

  • GPU (Hot cache): Most recently accessed KV blocks
  • CPU RAM (Warm cache): Secondary storage with fast GPU transfer
  • Disk/Remote (Cold cache): Long-term storage for infrequently accessed blocks

Configuration Example:

export LMCACHE_CHUNK_SIZE=256
export LMCACHE_LOCAL_CPU=True
export LMCACHE_MAX_LOCAL_CPU_SIZE=5.0
export LMCACHE_USE_EXPERIMENTAL=True

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 16384 \
  --kv-transfer-config '{
    "kv_connector": "LMCacheConnectorV1",
    "kv_role": "kv_both"
  }'

Implementing Caching in vLLM: Practical Examples

Basic KV Caching (Already Enabled by Default)

In vLLM, basic KV caching is enabled by default and requires no special configuration:

from vllm import LLM, SamplingParams

# Create LLM instance - KV caching enabled automatically
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    gpu_memory_utilization=0.9  # Use 90% of GPU memory for KV cache
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512
)

# Generate responses
prompts = [
    "Explain quantum computing in simple terms.",
    "What is machine learning?",
    "How does photosynthesis work?"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")
    print("-" * 80)

Key Configuration Parameter:

  • gpu_memory_utilization: Controls percentage of GPU memory allocated for KV cache. Higher values (0.8-0.95) maximize cache space but risk out-of-memory errors.

Enabling Automatic Prefix Caching

Prefix caching shines in scenarios with repeated prefixes, such as:

  • Multi-round conversations (chat history is shared prefix)
  • Batch queries on the same document
  • System prompts shared across requests

Python API Implementation:

from vllm import LLM, SamplingParams
from vllm.distributed import cleanup_dist_env_and_memory

# Shared prefix - imagine this is a long system prompt or document
prefix = (
    "You are an expert school principal. Draft 10-15 questions "
    "for interviewing a potential Math teacher with 5 years experience "
    "in middle school education. Based on this context, answer the following: "
)

# Different queries using the same prefix
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is"
]

# Combine prefix with each query
full_prompts = [prefix + query for query in prompts]

sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

# Create LLM WITHOUT prefix caching (baseline)
regular_llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.4)

print("=" * 80)
print("WITHOUT Prefix Caching:")
print("=" * 80)

outputs = regular_llm.generate(full_prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt[:100]}...")
    print(f"Response: {output.outputs[0].text}\n")

# Clean up memory
del regular_llm
cleanup_dist_env_and_memory()

# Create LLM WITH prefix caching enabled
prefix_cached_llm = LLM(
    model="facebook/opt-125m",
    enable_prefix_caching=True,  # Enable prefix caching
    gpu_memory_utilization=0.4
)

# Warmup: compute and cache the shared prefix
prefix_cached_llm.generate(full_prompts[0], sampling_params)

print("=" * 80)
print("WITH Prefix Caching (after warmup):")
print("=" * 80)

# Now generate with cache hits
outputs = prefix_cached_llm.generate(full_prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt[:100]}...")
    print(f"Response: {output.outputs[0].text}\n")

print("Note: Subsequent requests should be noticeably faster due to cache reuse!")

Key Concepts:

  1. Warmup step: First request computes prefix KV cache
  2. Cache hits: Subsequent requests with same prefix skip recomputation
  3. Automatic management: vLLM handles block identification and reuse transparently

Starting vLLM Server with Prefix Caching

For serving multiple clients via OpenAI-compatible API:

# Start vLLM server with prefix caching enabled
vllm serve meta-llama/Llama-2-7b-chat-hf \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --port 8000

# In another terminal, send requests using OpenAI client

Client Code:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1"
)

# System prompt (shared prefix across requests)
system_prompt = """You are an expert in software engineering. 
You help developers with technical questions and code reviews."""

# Multiple queries will share the system prompt prefix
queries = [
    "How do I implement a binary search tree in Python?",
    "What's the difference between REST and GraphQL?",
    "Explain the decorator pattern in design patterns."
]

for query in queries:
    response = client.chat.completions.create(
        model="meta-llama/Llama-2-7b-chat-hf",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ],
        temperature=0.7,
        max_tokens=512
    )
    
    print(f"Query: {query}")
    print(f"Response: {response.choices[0].message.content}\n")
    print("-" * 80)

Monitoring and Diagnostics

To verify caching is working and monitor performance:

from vllm import LLM, SamplingParams
import time

llm = LLM(
    model="facebook/opt-125m",
    enable_prefix_caching=True,
    gpu_memory_utilization=0.4,
    disable_log_stats=False  # Enable logging for monitoring
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=50)

# Time the first request (no cache)
print("First request (computing prefix cache):")
start = time.time()
output1 = llm.generate("The capital of France is", sampling_params)
time1 = time.time() - start
print(f"Time: {time1:.3f}s")

# Time subsequent requests (cache hits)
print("\nSecond request (should use cache):")
start = time.time()
output2 = llm.generate("The capital of France is", sampling_params)
time2 = time.time() - start
print(f"Time: {time2:.3f}s")

# Time with different suffix (prefix still cached)
print("\nThird request (same prefix, different suffix - cache hit):")
start = time.time()
output3 = llm.generate("The capital of France is the city called", sampling_params)
time3 = time.time() - start
print(f"Time: {time3:.3f}s")

print(f"\nSpeedup: {time1/time2:.2f}x faster with cached prefix")

Advanced Caching Strategies and Configurations

Memory Tuning for Optimal Cache Performance

Problem: KV cache memory pressure causes request preemption (recomputation)

Solutions:

# Strategy 1: Increase GPU memory utilization
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    gpu_memory_utilization=0.95  # Maximize KV cache space
)

# Strategy 2: Reduce batch concurrency
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    max_num_seqs=32,  # Reduce concurrent requests
    max_num_batched_tokens=4096  # Limit prefill batch size
)

# Strategy 3: Tensor parallelism (distribute across GPUs)
# Each GPU gets more memory for KV cache
llm = LLM(
    model="meta-llama/Llama-70b",
    tensor_parallel_size=4  # Use 4 GPUs
)

# Strategy 4: Pipeline parallelism (distribute layers)
llm = LLM(
    model="meta-llama/Llama-70b",
    pipeline_parallel_size=2  # Distribute layers across 2 GPU groups
)

# Strategy 5: CPU offloading for long contexts
# Offload 10GB of KV cache to CPU RAM
llm = LLM(
    model="meta-llama/Llama-70b",
    cpu_offload_gb=10
)

Monitoring Preemption:

# View preemption events in logs
# Look for: "Sequence group X is preempted"
# This indicates cache pressure - increase utilization or decrease batch size

Sliding Window Cache for Long Sequences

For very long contexts where full KV caching is impractical:

# Some models (e.g., Mistral) support sliding window attention
# Only recent N tokens' KV cache is kept
llm = LLM(
    model="mistralai/Mistral-7B",
    max_model_len=32768  # Support long contexts
    # Model internally manages sliding window cache
)

# This trades off attention to very distant tokens for memory savings

Multi-Adapter Caching with LoRA

When serving multiple LoRA-adapted versions of same base model:

# vLLM can jointly manage KV cache across different adapters
# Caching includes adapter identity in hash

llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    enable_lora=True,
    max_lora_rank=64,
    enable_prefix_caching=True  # Works with LoRA
)

# Different adapters share base model weights and cache intelligently

Caching for Specific Workload Patterns

Document Query Workload

Scenario: User repeatedly queries same long document with different questions

# Load long document once, cache its KV representation
document = """
[Long technical document - thousands of tokens]
"""

llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    enable_prefix_caching=True,
    gpu_memory_utilization=0.9
)

# Questions about the document
questions = [
    "Summarize the key findings.",
    "What methodology was used?",
    "What are the limitations?",
    "Suggest future research directions.",
]

for question in questions:
    prompt = f"Document:\n{document}\n\nQuestion: {question}"
    result = llm.generate(prompt)
    # First question: slow (compute document cache)
    # Subsequent questions: fast (reuse cache)

Multi-Turn Conversation Workload

Scenario: Chat application with conversation history

llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    enable_prefix_caching=True
)

# Build conversation with growing history
conversation_history = []
system_prompt = "You are a helpful assistant."

while True:
    user_input = input("User: ")
    
    # Build prompt with full history
    messages = [{"role": "system", "content": system_prompt}]
    messages.extend(conversation_history)
    messages.append({"role": "user", "content": user_input})
    
    # With prefix caching, old conversation history is cached
    # Only new user message is computed
    response = llm.generate(format_messages(messages))
    
    conversation_history.append({"role": "user", "content": user_input})
    conversation_history.append({"role": "assistant", "content": response})
    
    print(f"Assistant: {response}")

Performance Benchmarks and Metrics

KV Caching Impact

ConfigurationLatencyThroughputMemory
No Caching10.0s (baseline)100% (baseline)Low
KV Cache2.2s450%+10GB
Prefix Cache1.8s550%+12GB
Semantic Cache0.15s (hit)6700% (hit)+2GB (embeddings)

Notes:

  • Latency measured for 100-token generation on L40 GPU
  • Throughput normalized to no-caching baseline
  • Semantic cache times assume 90%+ hit rate

Cache Hit Rate Indicators

Monitor these metrics to assess caching effectiveness:

# Throughput improvement: (cached_throughput / baseline_throughput)
# Target: >3x improvement with prefix caching

# Time-to-first-token: (TTFT_cached / TTFT_uncached)  
# Target: <0.5 (half the time due to reduced prefill)

# Token-generation latency: (TPOT_cached / TPOT_uncached)
# Target: ~1.0 (minimal difference, as decoding is same)

Best Practices and Recommendations

When to Use Each Caching Strategy

Workload TypeRecommended StrategyReasoning
Chat applicationsPrefix CachingConversation history is shared prefix
Document Q&APrefix CachingDocument is shared prefix
API with varied queriesSemantic CachingCatch similar queries despite phrasing differences
High-volume inferenceBasic KV CacheSufficient for simple workloads
Long contexts (100k+ tokens)LMCache OffloadingGPU memory insufficient for full context
Multiple LoRA adaptersMulti-adapter cachingJointly manage adapter-specific caches

Configuration Checklist

  • Enable prefix caching for any workload with repeated prefixes
  • Set gpu_memory_utilization to 0.9+ unless stability is critical
  • Monitor preemption events in logs; if frequent, increase memory utilization
  • Measure baseline latency before deploying caching
  • Use tensor parallelism for models larger than 30B parameters
  • Enable CPU offloading for contexts >8k tokens with single GPU
  • Profile cache hit rates in production to validate caching effectiveness
  • Set up alerting for cache-related performance degradation

Avoiding Common Pitfalls

Pitfall 1: Over-aggressive memory utilization

  • Problem: gpu_memory_utilization=1.0 causes OOM errors
  • Solution: Start at 0.9 and reduce if errors occur

Pitfall 2: Expecting decode speedup from caching

  • Problem: Decode (token generation) speed unchanged by caching
  • Problem: Caching only helps prefill phase
  • Solution: Focus on TTFT metrics, not TPOT (time per output token)

Pitfall 3: Caching with no shared prefixes

  • Problem: Prefix caching adds overhead with no benefit
  • Solution: Profile workload; disable if cache hit rate less than 50%

Pitfall 4: Forgetting warmup with prefix caching

  • Problem: First request pays full cost; subsequent requests fast
  • Solution: Pre-load common prefixes during system initialization

Future Directions in LLM Caching

Emerging techniques likely to shape LLM caching:

  1. Speculative Decoding with Cached Speculation: Cache predictions for common patterns
  2. Hierarchical Caching: Multi-level hierarchy matching CPU cache behavior
  3. Cross-Request Cache Sharing: Share cache across different users safely
  4. Adaptive Cache Eviction: ML-based policies instead of LRU
  5. Disaggregated Cache: Separate cache services for high concurrency scenarios

Conclusion

Caching is fundamental to practical LLM deployment. From basic KV caching providing 4-5× speedups to sophisticated prefix caching exploiting shared computation patterns, the right caching strategy can transform LLM economics.

Key Takeaways:

  • KV caching is essential: Provides substantial speedup with manageable memory overhead
  • Prefix caching multiplies benefits: Automatic reuse of shared prefixes drives further optimization
  • vLLM makes implementation easy: Simple API and sensible defaults enable quick deployment
  • Monitor metrics: Cache hit rates and latency improvements validate effectiveness
  • Workload determines strategy: Different patterns benefit from different caching approaches

Start with basic KV caching (enabled by default), profile your workload, then selectively enable more sophisticated techniques like prefix caching or semantic caching as needed. The investment in understanding and optimizing cache behavior pays dividends in reduced latency and infrastructure costs.