On This Pageexpand_more
LLM Inference Optimization: The Engineering Behind Fast, Cheap AI
Master LLM inference optimization: speculative decoding, KV-cache compression, quantization, FlashAttention, and serving frameworks compared for fast, cost-effective AI.

Training a frontier model costs millions. Serving it costs more.
GPT-4's training run was estimated at around $100 million. But OpenAI serves billions of requests per day, and every millisecond of latency, every megabyte of memory, every watt of power compounds into a bill that dwarfs the original training investment. Major tech companies routinely spend more on inference infrastructure in a single quarter than the entire cost of training their largest models.
This is the uncomfortable economics of large language models: training is a one-time fixed cost; inference is a variable cost that scales with every user, every query, every token. For most practitioners and organizations deploying LLMs, inference optimization is not a nice-to-have - it is the difference between a viable product and a financial sinkhole.
This article is a deep technical tour of the techniques that make LLM inference fast and affordable. We will cover the full stack: from algorithmic innovations like speculative decoding and KV-cache compression, to systems-level engineering like continuous batching and paged attention, to the serving frameworks that package it all together. If you build, deploy, or operate LLM-based systems, this is the engineering you need to understand.
Why Inference Optimization Matters More Than Training
The ML community has historically obsessed over training efficiency: mixed precision training, data parallelism, ZeRO optimization. These matter, but they affect a small number of teams running a small number of jobs. Inference optimization affects everyone.
Consider the math. A 70B parameter model in FP16 requires roughly 140 GB of GPU memory just for the weights. At FP16, a single A100 (80 GB) cannot even hold the model. You need at least two GPUs for tensor parallelism, which means inter-GPU communication overhead on every forward pass. Now multiply that by thousands of concurrent users, each with a different context length, each expecting sub-second time-to-first-token (TTFT) and smooth streaming at 30+ tokens per second.
The gap between "this model works in a notebook" and "this model serves 10,000 concurrent users at acceptable latency and cost" is enormous. Inference optimization bridges that gap.
The Inference Pipeline: Prefill vs. Decode
Before diving into specific techniques, you need to understand the two-phase structure of autoregressive LLM inference.
Prefill phase (prompt processing). The model processes the entire input prompt in parallel. This is a compute-bound operation: essentially a single forward pass over the full sequence. The KV (key-value) pairs for each attention layer are computed and cached. TTFT is primarily determined by prefill speed.
Decode phase (token generation). The model generates tokens one at a time, autoregressively. Each new token requires a forward pass, but only the new token's query attends to the cached KV pairs from all previous tokens. This is a memory-bandwidth-bound operation. The bottleneck is reading the model weights and KV cache from GPU memory, not the arithmetic.
This distinction matters because different optimizations target different phases. FlashAttention and tensor parallelism accelerate prefill. KV-cache compression and speculative decoding accelerate decode. Continuous batching optimizes how multiple requests share the GPU across both phases.
KV-Cache: The Memory Bottleneck
What the KV-Cache Is
In a transformer's attention mechanism, each layer computes key (K) and value (V) projections for every token in the sequence. During autoregressive generation, recomputing these projections for all previous tokens at every step would be prohibitively expensive. Instead, we cache the K and V tensors. This is the KV-cache.
For a model with layers, attention heads, head dimension , and sequence length , the KV-cache size per request is:
To illustrate the scale of this problem, consider a hypothetical 70B dense model with standard MHA (80 layers, 64 attention heads, head dim 128) at 8K context in FP16. We use a dense model here for simplicity, though most frontier models at these scales now use MoE architectures with different memory profiles:
At 128K context, that becomes ~320 GB - more than the model weights themselves. The KV-cache is often the binding constraint on batch size and thus throughput. Every optimization dollar spent reducing KV-cache memory directly increases how many concurrent requests a GPU can serve.
Grouped-Query Attention (GQA)
Standard multi-head attention (MHA) uses separate K and V projections for each attention head. If a model has 64 heads, it needs 64 sets of K and V caches. Grouped-Query Attention, introduced by Ainslie et al. (2023) and adopted in most production-scale models from 2023 onward, groups multiple query heads to share a single set of K and V heads.
A typical 70B GQA model uses 64 query heads but only 8 KV heads, an 8:1 ratio. This reduces the KV-cache by 8x compared to MHA with negligible quality loss. GQA is now the default for nearly all production-scale models.
Multi-head Latent Attention (MLA)
DeepSeek's MLA is one of the most impactful KV-cache innovations (see Inside DeepSeek). Instead of caching separate K and V matrices, MLA compresses the KV representation into a low-rank latent space. DeepSeek-V2 and V3 use MLA to achieve KV-cache compression ratios far beyond what GQA provides, roughly 93% reduction compared to standard MHA, while maintaining or improving model quality.
MLA works by jointly compressing the KV pairs through a down-projection into a compact latent vector, then reconstructing K and V on the fly during attention computation via learned up-projections. The compute overhead of the reconstruction is small relative to the memory bandwidth savings, making it a net win for the decode phase.
Paged Attention
Traditional KV-cache implementations pre-allocate contiguous memory blocks for each request based on the maximum possible sequence length. This leads to severe memory fragmentation and waste: a request that ultimately uses 500 tokens still reserves memory for 8,192 tokens.
Paged Attention, introduced by the vLLM project (Kwon et al., 2023), borrows the concept of virtual memory paging from operating systems. The KV-cache is divided into fixed-size "pages" (blocks), and pages are allocated on demand as the sequence grows. A block table maps logical token positions to physical memory locations.
The results are dramatic: Paged Attention achieves near-zero memory waste, enables much larger batch sizes, and supports advanced features like copy-on-write for parallel sampling (beam search, best-of-N) where multiple sequences can share KV-cache pages for their common prefix.
KV-Cache Quantization
The KV-cache is typically stored in FP16 or BF16 (2 bytes per element). Quantizing it to INT8 (1 byte) or even INT4 (0.5 bytes) can halve or quarter the memory footprint. Research has shown that KV-cache values are more robust to quantization than model weights, because the attention mechanism's softmax normalization provides some error resilience.
Frameworks like vLLM support FP8 KV-cache quantization out of the box. More aggressive approaches like KIVI (2024) apply per-channel quantization to keys (which have large outlier channels) and per-token quantization to values, achieving 2-bit KV-cache with minimal quality degradation.
TurboQuant (Zandieh & Mirrokni, ICLR 2026) takes a fundamentally different approach. Rather than relying on per-channel or per-token scaling factors that add their own memory overhead, TurboQuant uses a two-stage process. First, PolarQuant randomly rotates the data vectors and converts them to polar coordinates, where the angle distributions are naturally concentrated. This eliminates the need for expensive normalization entirely. Second, a QJL error correction stage compresses residual quantization errors into a single sign bit using the Johnson-Lindenstrauss transform. The result is 3-bit KV-cache quantization with zero memory overhead from quantization metadata, no calibration data required, and no measurable accuracy loss across benchmarks including LongBench, Needle In A Haystack, and RULER. On H100 GPUs, 4-bit TurboQuant achieves up to 8x throughput over unquantized 32-bit keys, with at least 6x reduction in KV-cache memory.
Speculative Decoding: Guessing Ahead
The Core Idea
Autoregressive decoding is inherently serial: you cannot generate token N+1 until you have token N. But what if you could guess multiple future tokens cheaply, then verify them in parallel with the full model?
This is speculative decoding. A small, fast "draft model" generates a sequence of K candidate tokens. The large "target model" then processes all K candidates in a single forward pass (which is essentially as fast as processing one token, since the forward pass is memory-bandwidth-bound). Any prefix of correct predictions is accepted; the first wrong token is resampled from the target model's distribution.
The key insight is that the verification step is nearly free - the target model was going to read its weights from memory regardless, and processing K tokens in parallel during the decode phase barely increases compute. If the draft model's acceptance rate is high (say 70-80% per token), you get a 2-3x speedup in tokens per second with zero quality loss. The output distribution is mathematically identical to the target model's distribution.
Draft Model Selection
The draft model must be much smaller and faster than the target model, but similar enough to have a high acceptance rate. Common choices:
- Smaller model from the same family: Qwen3.5-35B-A3B as draft for Qwen3.5-397B-A17B
- Quantized version of the target: A 4-bit quantized version of the model drafting for its FP16 self
- Fine-tuned small model: A small model specifically trained to mimic the target model's distribution
The acceptance rate depends heavily on the task and prompt. Factual recall and code completion tend to have high acceptance rates (the next tokens are relatively predictable); creative writing and reasoning have lower rates.
Medusa and Eagle: Self-Drafting
Instead of using a separate draft model, Medusa (Cai et al., 2024) adds multiple lightweight "heads" to the target model itself. Each head predicts a different future token position (head 1 predicts the next token, head 2 predicts two tokens ahead, etc.). These heads are small MLPs trained while the base model is frozen.
Medusa constructs a tree of candidate continuations and verifies them in a single forward pass using tree attention. This eliminates the need for a separate draft model and the associated memory overhead.
Eagle (Li et al., 2024) takes a different approach: it trains an autoregressive draft head that takes the target model's hidden states as context. Eagle-2 further improves on this with context-aware dynamic draft tree construction, achieving acceptance rates significantly higher than vanilla speculative decoding. Eagle-2 has demonstrated 3-4x speedup on comparable dense 70B models with no quality loss.
Quantization for Inference
Quantization reduces the numerical precision of model weights (and optionally activations) to decrease memory usage and increase throughput. The key insight is that LLM weights are over-parameterized. They contain significant redundancy that lower precision can approximate.
The Quantization Landscape
GPTQ (Frantar et al., 2022): A post-training quantization method based on approximate second-order information. GPTQ quantizes weights to INT4 or INT3 using a layer-wise approach with Hessian-based error compensation. It produces models that run on GPU with frameworks like AutoGPTQ and ExLlamaV2. GPTQ models are widely available on HuggingFace and remain popular for GPU inference.
AWQ (Lin et al., 2024): Activation-Aware Weight Quantization recognizes that a small fraction of weight channels (those corresponding to large activation magnitudes) disproportionately affect model quality. AWQ applies per-channel scaling to protect these salient channels before quantization. AWQ typically achieves better quality than GPTQ at the same bit width, especially at INT4. It has become the default quantization method for many production deployments.
GGUF (formerly GGML): The quantization format used by llama.cpp and its ecosystem. GGUF supports a wide range of quantization levels (Q2_K through Q8_0) with different strategies per layer. The key differentiator is CPU support: GGUF models can run on CPU and Apple Silicon using Metal, making them the go-to for local/edge deployment. The quality-to-compression tradeoff is well-optimized: Q4_K_M is often considered the sweet spot for quality vs. size.
FP8 (E4M3 / E5M2): Half the memory of FP16 with much simpler quantization (often just casting, with optional per-tensor scaling). FP8 is natively supported on NVIDIA Hopper (H100) and Ada Lovelace GPUs. Because it is a floating-point format, it avoids many of the pathologies of integer quantization (outlier channels, zero-point calibration). FP8 is becoming the default for production GPU serving. Both vLLM and TensorRT-LLM support FP8 weights and FP8 KV-cache.
When to Use What
| Scenario | Recommended Quantization | Why |
|---|---|---|
| Production GPU serving (H100/A100) | FP8 or AWQ INT4 | Best quality/throughput tradeoff on modern GPUs |
| GPU serving, maximum throughput | GPTQ/AWQ INT4 with ExLlamaV2 | 4-bit enables larger batches, ExLlamaV2 kernels are fast |
| Local deployment (Mac/CPU) | GGUF Q4_K_M or Q5_K_M | llama.cpp's Metal backend is excellent on Apple Silicon |
| Edge/mobile deployment | GGUF Q3_K_M or Q2_K | Aggressive quantization for memory-constrained devices |
| Quality-sensitive applications | FP8 or AWQ INT4 (group size 128) | Minimal quality loss from FP16 baseline |
| Latency-critical, single-user | GPTQ/AWQ INT4 | Lower memory = faster weight loading in decode phase |
A common rule of thumb: every 1-bit reduction in weight precision roughly halves the memory requirement and proportionally increases the maximum batch size (and thus throughput) for memory-bound decode.
FlashAttention: Taming the Attention Bottleneck
The Memory Hierarchy Problem
Standard attention computes the full S x S attention matrix, materializes it in GPU HBM (High Bandwidth Memory), applies softmax, then multiplies by V. For long sequences, this attention matrix is enormous and the repeated HBM reads/writes dominate runtime.
FlashAttention 1 (Dao et al., 2022)
FlashAttention rewrites the attention computation to be IO-aware. Instead of materializing the full attention matrix, it tiles the computation into blocks that fit in GPU SRAM (on-chip memory, ~20 MB on A100 vs. 80 GB HBM). By computing attention in tiles and using an online softmax algorithm, FlashAttention avoids writing the full attention matrix to HBM entirely.
The result: 2-4x wall-clock speedup and significant memory savings (attention memory goes from O(S^2) to O(S)). FlashAttention made long-context training and inference practical.
FlashAttention 2 (Dao, 2023)
FlashAttention 2 improved on the original by better parallelizing across the sequence length dimension (parallelizing over the Q dimension rather than KV), reducing non-matmul FLOPs, and improving warp-level partitioning. These changes yielded an additional 2x speedup over FlashAttention 1, reaching 50-73% of theoretical peak FLOPS on A100.
FlashAttention 3 (Dao et al., 2024)
Designed specifically for NVIDIA Hopper GPUs (H100/H200), FlashAttention 3 exploits hardware features unique to the Hopper architecture: asynchronous execution of WGMMA (Warp Group Matrix Multiply-Accumulate) tensor core operations overlapped with softmax computation in CUDA cores, and hardware-accelerated FP8 support using the TMA (Tensor Memory Accelerator).
FlashAttention 3 achieves up to 740 TFLOPS on H100 (FP16), approaching 75% of the theoretical maximum. With FP8, it reaches over 1.2 PFLOPS. This is a landmark: attention is no longer the dominant bottleneck for many workloads on Hopper GPUs.
Continuous Batching and Dynamic Scheduling
The Problem with Static Batching
Naive batching groups N requests and processes them together. But LLM requests have wildly different lengths: one user sends a 50-token prompt, another sends 4,000 tokens. With static batching, all requests in a batch must wait for the longest one to finish before new requests can be scheduled. GPU utilization plummets.
Continuous (In-Flight) Batching
Continuous batching, pioneered by Orca (Yu et al., 2022) and implemented in virtually every modern serving framework, schedules at the iteration level rather than the request level. When a request finishes generating (hits an EOS token or max length), a new request is immediately inserted into the batch without waiting for other requests to complete.
This dramatically improves GPU utilization and throughput. In practice, continuous batching can increase throughput by 10-20x compared to static batching for workloads with variable output lengths.
Chunked Prefill
A further refinement: instead of processing an entire long prompt in one prefill step (which blocks decode steps for other requests), chunked prefill breaks the prompt into chunks and interleaves prefill chunks with decode steps. This keeps TTFT low for new requests while maintaining throughput for ongoing generations. vLLM and SGLang both implement chunked prefill.
Prefix Caching and RadixAttention
Many LLM applications share common prefixes: system prompts, few-shot examples, RAG context. Recomputing the KV-cache for these shared prefixes on every request is wasteful.
Prefix caching stores KV-cache entries for common prefixes and reuses them across requests. SGLang's RadixAttention implements this elegantly using a radix tree (a compressed trie) to store and look up cached KV segments. When a new request arrives, the framework traverses the radix tree to find the longest matching prefix, reuses its cached KV entries, and only computes the KV-cache for the new suffix.
This is particularly impactful for:
- Chatbots with system prompts (every request shares the same system prompt KV-cache)
- Multi-turn conversations (each turn reuses the KV-cache from prior turns)
- Agentic workflows where the same tool descriptions and instructions are prepended to every call
- Batch processing with shared few-shot examples
In production workloads with high prefix sharing, RadixAttention can reduce prefill latency by 5-10x.
Serving Frameworks Compared
The techniques above are impressive in isolation, but the real magic is in the serving frameworks that integrate them into production-ready systems. Open-source serving stacks serve the models from The Open-Source LLM Power Shift. Here is how the major frameworks compare:
| Feature | vLLM | SGLang | TensorRT-LLM | llama.cpp |
|---|---|---|---|---|
| Primary use case | General-purpose GPU serving | High-throughput GPU serving, structured generation | Maximum performance GPU serving | Local/edge, CPU + GPU |
| Paged Attention | Yes (invented it) | Yes | Yes | No (uses contiguous cache) |
| Continuous Batching | Yes | Yes | Yes | Limited (llama-server) |
| Speculative Decoding | Yes | Yes | Yes | Yes (basic) |
| KV-Cache Quantization | FP8, INT8 | FP8, INT8 | FP8, INT4 | Q4, Q8 (implicit) |
| Prefix Caching | Yes (automatic) | Yes (RadixAttention) | Yes | Prompt caching |
| Structured Output | Outlines integration | Native (fast constrained decoding) | Limited | Grammar-based (GBNF) |
| FlashAttention | FlashAttention 2/3, FlashInfer | FlashInfer | Custom fused kernels | Custom CUDA/Metal kernels |
| Quantization Support | GPTQ, AWQ, FP8, BitsAndBytes | GPTQ, AWQ, FP8 | FP8, INT8, INT4 (own format) | GGUF (Q2-Q8) |
| Multi-GPU | Tensor + Pipeline parallelism | Tensor + Data parallelism | Tensor + Pipeline + Expert parallelism | Limited tensor parallelism |
| Ease of Setup | pip install vllm | pip install sglang | NVIDIA container, build from source | brew install llama.cpp |
| API Compatibility | OpenAI-compatible | OpenAI-compatible | NVIDIA Triton / OpenAI-compatible | OpenAI-compatible |
| Best For | Most production GPU deployments | Agentic/structured workloads, prefix-heavy | Maximum throughput, NVIDIA-only environments | Local dev, Mac/CPU, privacy-first |
vLLM
vLLM remains the most widely adopted open-source serving framework. Its PagedAttention implementation set the standard, and it has steadily added features: chunked prefill, automatic prefix caching, FP8 quantization, multi-LoRA serving, and speculative decoding. The project's strength is its breadth of model support (virtually every HuggingFace model works) and production stability. If you are starting a new LLM serving project and want a safe default, vLLM is it.
SGLang
SGLang has emerged as the performance leader in many benchmarks, particularly for workloads involving structured generation (JSON, function calling) and high prefix sharing. Its RadixAttention implementation for prefix caching is the most sophisticated available, and its constrained decoding engine is significantly faster than alternatives. SGLang also pioneered disaggregated prefill and decode, allowing different GPU pools to handle each phase.
For agentic applications where models are called repeatedly with similar prompts and must produce structured output, SGLang often delivers 2-3x higher throughput than vLLM.
TensorRT-LLM
NVIDIA's own serving framework squeezes maximum performance from NVIDIA GPUs through aggressive kernel fusion, custom CUDA kernels, and tight integration with the NVIDIA software stack. TensorRT-LLM often wins absolute throughput benchmarks but at the cost of flexibility: model support is narrower, setup is more complex, and you are locked into NVIDIA hardware. It is the right choice for large-scale NVIDIA-only deployments where engineering resources are available to manage the complexity.
MoE models have unique serving challenges (see Mixture of Experts Demystified). TensorRT-LLM has some of the best MoE support, with expert parallelism across GPUs.
llama.cpp
llama.cpp is in a category of its own: a C/C++ inference engine designed for efficiency on consumer hardware. It runs on CPU, Apple Silicon (Metal), NVIDIA GPUs (CUDA), AMD GPUs (ROCm, Vulkan), and even phones. Its GGUF quantization format enables running 70B models on a MacBook Pro with 64 GB of unified memory.
llama.cpp is not designed for high-concurrency serving (though llama-server has improved), but for local development, prototyping, privacy-sensitive deployments, and edge inference, nothing else comes close.
Practical Decision Framework: Which Optimization for Which Bottleneck
The single most important question in inference optimization is: what is your bottleneck?
If your bottleneck is TTFT (time to first token):
Your prefill is too slow. Solutions:
- Use FlashAttention 2/3
- Enable tensor parallelism to split the prefill across GPUs
- Enable prefix caching for repeated prompts
- Use chunked prefill if long prefills are blocking decode for other requests
- Consider FP8 quantization to speed up prefill compute
If your bottleneck is decode throughput (tokens/second):
You are memory-bandwidth bound. Solutions:
- Quantize weights (AWQ INT4, FP8) to reduce memory reads per token
- Enable GQA/MLA models to shrink the KV-cache
- Use KV-cache quantization (FP8 or INT8)
- Enable speculative decoding to generate multiple tokens per forward pass
- Increase batch size (if memory allows) to amortize weight reads
If your bottleneck is memory (cannot fit the model or enough concurrent requests):
Solutions:
- Quantize: INT4 cuts memory 4x from FP16
- Use Paged Attention to eliminate KV-cache fragmentation
- Use KV-cache quantization
- Choose a GQA/MLA model architecture
- Use pipeline parallelism to split the model across GPUs
If your bottleneck is cost:
Solutions:
- Quantize aggressively (INT4 or even INT3) to use fewer/smaller GPUs
- Maximize batch size through memory optimizations (all of the above)
- Use speculative decoding to increase effective throughput without adding GPUs
- Use prefix caching to avoid redundant computation
- Consider MoE models which activate fewer parameters per token
- These optimizations are what make reasoning models practical at scale
Putting It All Together: A Real-World Example
Consider serving a dense 70B model for a chatbot application with a system prompt of 2,000 tokens, average user messages of 200 tokens, and average responses of 500 tokens. You want to serve 100 concurrent users on a minimal GPU setup.
Without optimization: FP16 weights (140 GB) require 2x H100 GPUs. KV-cache for 100 users at 2,700 tokens each: ~6.75 GB per user = 675 GB total. This does not fit in any reasonable setup.
With optimization stack:
- AWQ INT4 quantization: Weights shrink to ~35 GB, fitting on a single H100
- GQA (built-in): KV-cache is already 8x smaller than MHA = ~0.84 GB per user
- FP8 KV-cache quantization: Further 2x reduction = ~0.42 GB per user
- Paged Attention: Near-zero waste, allocate only what is needed = ~84 GB total for 100 users at ~50% average utilization
- Prefix caching: The 2,000-token system prompt KV-cache is computed once and shared = saves ~40 GB of KV-cache
- Continuous batching: Requests are served as they complete, maintaining high GPU utilization
- Speculative decoding with a smaller draft model from the same family: ~2.5x decode speedup
Total memory: ~35 GB weights + ~44 GB active KV-cache + ~5 GB draft model + overhead ≈ 90 GB. This fits on 2x H100 with room to spare, and can likely serve 100 concurrent users with acceptable latency. Without these optimizations, you would need 10x the GPU count.
What is Next
1-Bit and Sub-1-Bit Models
BitNet (Ma et al., 2024) demonstrated that transformer models trained from scratch with ternary weights ({-1, 0, 1}) can match full-precision models at scale, while requiring only addition operations (no multiplication) for matrix operations. This is not post-training quantization - it is a fundamentally different training paradigm.
The implications are staggering: models that require no floating-point multiply hardware, that could run on custom ASICs orders of magnitude more efficient than GPUs. Microsoft's BitNet b1.58 showed competitive performance with full-precision models at similar scale. While still early, 1-bit models may represent the long-term future of efficient inference.
Hardware-Software Co-Design
The next generation of inference optimization is happening at the hardware level. NVIDIA's Blackwell architecture introduces a dedicated "transformer engine" with native support for FP4 computation. AMD and Intel are competing with their own inference-optimized hardware. Custom silicon from Groq (deterministic LPU), Cerebras (wafer-scale), and SambaNova (dataflow architecture) promises order-of-magnitude improvements for specific inference patterns.
The trend is clear: general-purpose GPUs are increasingly inefficient for the very specific computation patterns of LLM inference (mostly large matrix-vector multiplications during decode). Purpose-built hardware will gradually take over high-volume inference workloads.
Disaggregated Inference
The prefill and decode phases have fundamentally different hardware requirements: prefill is compute-bound and benefits from high FLOPS; decode is memory-bandwidth-bound and benefits from high bandwidth-to-compute ratio. Disaggregated inference runs these phases on different hardware pools optimized for each.
SGLang and Mooncake (the serving system behind Moonshot AI's Kimi) have pioneered this approach. Expect disaggregated inference to become the standard architecture for large-scale LLM serving.
Compiler-Driven Optimization
Frameworks like TorchInductor, Triton (the compiler, not NVIDIA Triton), and MLIR-based compilers are making it possible to automatically generate fused, optimized kernels for specific model architectures and hardware. This reduces the need for hand-written CUDA kernels and makes optimizations more portable across hardware.
Key Takeaways
- Inference cost dominates total LLM cost. For any model serving real users, inference optimization has a higher ROI than training optimization.
- Know your bottleneck. Prefill is compute-bound; decode is memory-bandwidth-bound. Different optimizations target different phases. Profile before optimizing.
- The KV-cache is the central challenge. GQA, MLA, Paged Attention, and KV-cache quantization are not optional. They are table stakes for serving at scale.
- Quantization is (nearly) free performance. FP8 quantization on Hopper GPUs and AWQ INT4 on older GPUs provide substantial memory and throughput improvements with minimal quality loss. For most applications, FP16 inference is wasteful.
- Speculative decoding is underused. It provides 2-3x decode speedup with mathematically zero quality loss. If you have a good draft model and latency matters, use it.
- FlashAttention is non-negotiable. There is no reason not to use FlashAttention (or FlashInfer) in 2026. It is free performance.
- Framework choice matters. vLLM for general-purpose serving, SGLang for structured/agentic workloads, TensorRT-LLM for maximum NVIDIA throughput, llama.cpp for local/edge. Do not fight the framework. Choose the one that matches your use case.
- Combine techniques multiplicatively. Quantization + KV-cache compression + speculative decoding + continuous batching + prefix caching can reduce serving costs by 10-20x compared to naive FP16 serving. These optimizations compose.
- The field is moving fast. 1-bit models, disaggregated inference, and purpose-built hardware are poised to deliver another order-of-magnitude improvement. Stay current.
The engineering of LLM inference optimization is where the abstract capability of large models meets the concrete constraints of latency, memory, cost, and power. Mastering these techniques is not optional for anyone serious about deploying LLMs in production. The models will keep getting bigger; the job of the inference engineer is to make that growth invisible to the end user.