On This Page
AI Engineering

Quantization Deep Dive: FP8 Training, FP4, and the Outlier Problem

A technical guide to LLM quantization: FP8 training, NVFP4 and MXFP4, W4A4 inference, the outlier problem, and where low-bit precision quietly breaks accuracy.

RayZ
Diagram of a weight precision ladder from 16 down to 1.58 bits and an activation distribution with one outlier spike, feeding a low-bit model that fits on one GPU

When OpenAI released gpt-oss in August 2025, the 120-billion-parameter model fit on a single 80GB GPU. Not squeezed there after the fact by the community with a lossy conversion, but shipped that way: the mixture-of-experts weights, which are more than 90% of the parameter count, were released natively in MXFP4, a 4-bit floating-point format, at roughly 4.25 bits per parameter. The model never existed in higher precision for those weights. Around the same time, DeepSeek-V3 had already shown the other half of the story, training a frontier-scale model with most of its compute running in 8-bit floating point rather than the BF16 that every large model used a year earlier.

Those two facts mark a shift that most "quantization" discussion misses. Quantization stopped being a post-hoc compression trick you apply to a finished model and became a property of how models are trained and shipped. Getting it right, or recognizing when someone got it wrong, requires separating questions that usually get mashed together: which tensors you lower the precision of, when in the lifecycle you do it, and which specific failure mode you are about to hit. This piece is the mechanical version of that story, including the part the benchmarks hide, which is where low-bit precision quietly breaks. It pairs with the LLM inference optimization overview, which is the hub for the latency and throughput side of the same problem.

Two axes everyone conflates

"We quantized the model to 4-bit" is an underspecified sentence. It collapses two independent axes, and almost every confused argument about quantization comes from mixing them.

The first axis is what you quantize. A transformer has several distinct tensor populations, and they behave nothing alike under low precision:

  • Weights. Static, known ahead of time, and the easiest thing to quantize. Weight-only 4-bit is close to free for many models.
  • Activations. Dynamic, input-dependent, and the hard case, because they contain outliers (more on this below). This is the wall that "weights and activations in 4-bit," written W4A4, runs into.
  • KV cache. The activations you store for every past token. Quantizing it is what makes long context affordable, and it has its own error-accumulation behavior.
  • Gradients and optimizer states. Only relevant during training. Their dynamic range is wide and their sensitivity is high, which is why training in low precision is a different and harder problem than serving in it.

The second axis is when. Post-training quantization (PTQ) takes a finished BF16 model and lowers precision with a small calibration set, no gradient updates. Quantization-aware training (QAT) simulates the rounding during training so the weights adapt to it. And native low-precision training keeps the forward and backward passes themselves in a low-precision format from the start, which is what DeepSeek-V3 did in FP8 and what the frontier is now pushing to FP4.

Hold these two axes in mind and the landscape stops being a single dial labeled "bits" and becomes a grid. Weight-only PTQ at 4 bits is a solved, low-risk operation. W4A4 PTQ is a research-grade problem that needs rotations to work. FP8 training is a production reality. FP4 training is the current frontier. They share the word "4-bit" and almost nothing else.

The memory math, and why anyone tolerates the difficulty

Before the mechanisms, the motivation, because it is just arithmetic and it explains every decision that follows. The weight memory of a model is parameters times bits-per-parameter divided by eight. That single formula decides what fits on a GPU.

Weight memory=parameters×bits per parameter8\text{Weight memory} = \frac{\text{parameters} \times \text{bits per parameter}}{8}

Run that for gpt-oss-120b (120B parameters) and the precision ladder falls out directly:

  • BF16 (16-bit): 223.5 GiB, three 80GB GPUs just for weights.
  • FP8 (8-bit): 111.8 GiB, two 80GB GPUs.
  • MXFP4 (4.25-bit): 59.4 GiB, fits on one 80GB GPU, which is exactly why gpt-oss ships in MXFP4.
  • Ternary (b1.58, ~1.58-bit): 22.1 GiB, fits on a 24GB consumer card.

The jump from BF16 to MXFP4 is the difference between a three-GPU deployment and a single-GPU one, which is the difference between a model most people cannot run and one they can. That is the entire economic argument for quantization, and it is why the technique is load-bearing for the open-weights ecosystem: low-bit precision is what lets an open model actually run on hardware a practitioner owns. The ecosystem treats it as first-class, not an afterthought: Qwen ships official AWQ checkpoints alongside the full-precision Qwen2.5 weights on release day, and gpt-oss shipped in MXFP4 with no higher-precision version of those expert weights at all.

Weights are only half the footprint. The KV cache, the per-token state attention keeps for context, grows linearly with sequence length and can dwarf the weights at long context. Its size is roughly two (keys and values) times layers times KV heads times head dimension times sequence length times bytes-per-element. Quantizing it to 4-bit cuts that by 4x, which is why KV-cache quantization methods like KVQuant and KIVI exist as a separate discipline from weight quantization. The newest entries are data-free and rotation-based: Google's TurboQuant randomly rotates each vector so its coordinates concentrate into a predictable distribution, then applies an optimal per-coordinate quantizer, staying quality-neutral around 3.5 bits per channel with no calibration data and a provable near-optimality guarantee. Either way, long-context serving lives or dies on getting the cache precision right.

The outlier problem, which is the whole problem

If you understand one mechanism in quantization, make it this one. Low-bit quantization fails on activations because of outlier features, and nearly every serious technique is a way of coping with them.

Quantizing a tensor to N bits means mapping its values onto a small grid of levels. The grid spans the tensor's dynamic range, so a single value far from the rest stretches the range and forces every other value into a handful of levels near zero. The information in the bulk of the distribution gets crushed. In transformers this is not hypothetical. Dettmers et al. (2022) documented emergent outlier features: starting around the 6.7B-parameter scale, a small number of feature dimensions develop activation magnitudes far larger than the rest, they are systematic rather than random, and zeroing them collapses model quality. Those outliers are exactly what makes naive activation quantization destructive.

Weights mostly do not have this problem, which is why weight-only quantization is easy and weight-plus-activation quantization is hard. The progression of techniques is a progression of outlier defenses:

  • GPTQ quantizes weights one at a time using second-order (Hessian) information to compensate for the error introduced so far. Weight-only, near-lossless at 4-bit for many models.
  • AWQ (activation-aware weight quantization) notices that a small fraction of weight channels matter most, identified by activation statistics, and scales them to protect them. Still weight-only, but it uses activation information to decide what to protect.
  • SmoothQuant migrates the difficulty: it mathematically shifts the outlier scale from activations into weights, where it is easier to handle, by a per-channel rescaling that cancels out across the matmul. This is what first made 8-bit activations practical.
  • Rotation methods are the current answer for 4-bit activations. The insight is that an orthogonal rotation of the hidden state leaves the model's output unchanged but spreads any single outlier across all dimensions, flattening the distribution so it quantizes cleanly. QuaRot uses fixed randomized Hadamard matrices to do this and quantizes weights, activations, and KV cache all to 4-bit. SpinQuant replaces the fixed rotation with learned rotation matrices optimized for the specific model. The same rotation insight now has a data-free, provably near-optimal form in Google's TurboQuant (ICLR 2026), though it targets KV-cache and vector quantization rather than full W4A4 inference. When these methods were introduced on Llama-2 7B, the standard open baseline at the time, SpinQuant narrowed the W4A4KV4 gap to full precision to 2.9 points of average accuracy, beating SmoothQuant by 25 points and quantization-aware LLM-QAT by 19. The pattern holds on current models: a 2026 empirical study of Qwen3 quantization found rotation methods of the SpinQuant family to be the best W4A4 approach, while also confirming the catch, that even the best of them degrade sharply in the most aggressive regimes.

That is the honest headline for 4-bit-everything inference today: a few points off full precision, achievable, not free, and only with a rotation step that a naive "cast to int4" pipeline skips. Anyone reporting lossless W4A4 without a rotation or its equivalent is reporting perplexity on a calibration set, not behavior on the tail.

FP8 training: the art is what you keep in high precision

Training in low precision is harder than serving in it, because gradients and optimizer states have wide dynamic range and the errors compound over hundreds of thousands of steps. DeepSeek-V3 is the reference example of doing it at scale, and the lesson from its recipe is the opposite of "quantize everything."

The framework is mixed precision with fine-grained scaling. Most compute-dense operations, the large matmuls in the feed-forward and attention projections, run in FP8 (the E4M3 variant, four exponent bits and three mantissa bits). But the precision is not applied bluntly:

  • Block-wise and tile-wise scaling. Weights are quantized in 128×128 blocks and activations in finer tiles, each with its own scaling factor, so a local outlier only stretches the range of its own block rather than the whole tensor. This is the SmoothQuant insight applied at training time and at fine granularity.
  • High-precision accumulation. FP8 matmuls accumulate into a higher-precision register rather than summing in FP8, because the accumulation is where 8-bit rounding error would otherwise pile up.
  • Selective high precision. The numerically sensitive modules stay in higher precision: the embedding, the output head, the MoE gating that decides expert routing, the normalization layers, and the attention operators. These are small in compute but large in their effect on stability, so quantizing them buys little and risks divergence.
  • Optimizer states in BF16. The Adam moment estimates that drive every weight update are kept in BF16, not FP8. The weights, activations, and gradients are the FP8 tensors; the optimizer's memory of the trajectory is not.

The takeaway generalizes past DeepSeek. Low-precision training is not a single switch, it is a careful partition of the model into "robust enough for FP8" and "leave it alone," and getting that partition wrong is how training runs diverge at step 200,000 with no warning. The payoff is real: FP8 roughly halves the memory and doubles the matmul throughput versus BF16, which is a large fraction of why frontier-scale training got cheaper in 2025.

FP4 training: NVFP4, MXFP4, and the format that wins

The frontier has already moved past FP8. NVIDIA's Blackwell generation (the GB200 and GB300 systems) added native hardware support for 4-bit floating point, and two formats are competing to own training: MXFP4 and NVFP4. The difference between them is a clean illustration of why the details matter.

Both are microscaling formats: instead of one scale per tensor, they attach a scale to small blocks of values, which is the fine-grained-scaling idea from the FP8 recipe pushed further. The formats differ in block size and scale type. MXFP4, the Open Compute Project standard that gpt-oss ships in, uses 32-element blocks with a power-of-two (E8M0) scale. NVFP4 uses a smaller 16-element block with a higher-resolution FP8 (E4M3) per-block scale, plus a per-tensor FP32 scale on top. The smaller block and richer scale let NVFP4 track local dynamic range more faithfully, which matters enormously at 4 bits where there is almost no mantissa to spare.

The consequence shows up directly in pretraining loss. In NVIDIA's NVFP4 pretraining study (2025), NVFP4 reached a final loss about 1.5% above the BF16 reference, while MXFP4 landed around 2.5% above. That gap is not cosmetic: to match NVFP4's final loss, the MXFP4 run needed 36% more training tokens. At frontier scale, 36% more tokens is tens of millions of dollars, so the format choice is a budget decision, not a numerical footnote. On the hardware side NVFP4 matmuls run roughly 4x faster than BF16 on GB200 and up to 6x on GB300, at about half the memory of FP8. The direction is set: FP4 training is where FP8 was two years ago, moving from research result to production recipe, with the format war decided largely by who captures local dynamic range with the least overhead.

Below four bits: the ternary edge

If 4-bit is the production frontier, the research edge is below it, and the most interesting point is ternary. Microsoft's BitNet b1.58 trains weights restricted to three values, -1, 0, and +1, which is about 1.58 bits each (log2 of 3). The radical consequence is not just the memory, it is the arithmetic: a weight that can only be -1, 0, or +1 turns the core matmul from multiply-accumulate into add-subtract-skip, removing the multiplications that dominate inference energy. The BitNet b1.58 2B4T release, a 2-billion-parameter ternary model trained on 4 trillion tokens, showed that a from-scratch ternary model can land competitively with full-precision 2B baselines on standard benchmarks, which a year earlier would have been assumed impossible.

The catch, and the reason ternary is an edge rather than a default, is that it only works trained from scratch. You cannot take a finished BF16 model and round it to ternary without destroying it, because the model was never trained to tolerate that quantization. This is the deeper pattern across the whole low-bit story: precision is cheapest to remove when the model is built expecting its absence, and most expensive to remove after the fact. Post-training quantization is bounded by what the original training left robust; native low-precision training sets that robustness as a design parameter. The frontier is moving toward the second because it has a higher ceiling, and ternary is the clearest demonstration of how high that ceiling might go.

Where it breaks: the part perplexity hides

Now the reality gap, because everything above is the optimistic framing and the optimistic framing is measured on the wrong metric. Quantization papers lead with perplexity and average benchmark deltas, and those averages systematically understate the damage.

The tail degrades faster than the average. Quantization error is roughly uniform noise added to the model's computation, and uniform noise hurts the hardest examples most. A model that loses half a point of average accuracy can lose several points on the specific slice that mattered: the long-tail facts, the rare languages, the multi-step reasoning chains where a small per-step error compounds. Perplexity, averaged over a calibration corpus, is nearly blind to this. The lesson from the evaluation-crisis piece applies directly here: a single aggregate number is not a result, and "lossless quantization" claimed on perplexity is exactly the kind of conflated metric that hides which part of the distribution broke.

Reasoning and long context are the fragile cases. Two regimes stress quantization hardest. Long chains of reasoning accumulate per-token error, so a quantized model that matches the original on short answers can drift on a 50-step derivation. And long context leans on the KV cache, which is the tensor you most want to quantize for memory reasons and the one where error accumulates across thousands of cached tokens. If your workload is short factual queries, 4-bit is far safer than if it is agentic multi-step reasoning over a 200K-token context, and a benchmark dominated by the former will tell you nothing about the latter.

Calibration sets overfit. PTQ methods choose their quantization parameters using a small calibration set. Choose it from a narrow distribution and the quantized model is tuned to that distribution, looking great on in-distribution evals and degrading on everything else. The calibration set is a tiny training set with all the contamination and representativeness problems that implies. Even its size is a knob you can get wrong: the Qwen3 quantization study found GPTQ on Qwen3-8B stayed stable with 128 or more calibration samples but degraded noticeably below 64.

The honest baseline is iso-memory, not iso-parameter. The comparison that actually informs a decision is not "the model at FP16 versus the same model at 4-bit." It is "given this memory budget, a large model quantized or a smaller model at full precision." The evidence that a large model holds up under 4-bit is strong: a January 2026 benchmark of weight-only 4-bit on Qwen2.5-32B-Instruct kept every common format within about 6% of full-precision perplexity, and AWQ lost only around four points on HumanEval. The arithmetic does the rest. A 4-bit Qwen2.5-32B occupies roughly the memory of an FP16 model a quarter its size, near an 8B, while retaining far more capability than an 8B has to begin with. The right question is never "does quantization hurt," it always hurts a little, it is "does a bigger quantized model beat a smaller precise one for the same memory," and the answer is usually yes, which is the entire reason the technique matters.

How to decide

The grid from the start of this piece is also the decision procedure.

For inference on a model you did not train, start with weight-only 4-bit (GPTQ or AWQ, or a GGUF Q4_K_M build for llama.cpp). It is the near-free win and covers most local-deployment cases. Reach for W4A4 only when activation memory or compute is the actual bottleneck, and when you do, use a rotation-based method rather than naive casting, and validate on your real task distribution, not perplexity. Quantize the KV cache when context length is the constraint, and test it specifically on your longest contexts where its error accumulates.

For training, FP8 with fine-grained scaling and selective high-precision modules is now a reasonable default at scale, and FP4 with NVFP4 is the emerging frontier if you have Blackwell-class hardware and can absorb the integration cost. In both cases the discipline is the same: partition the model into low-precision and protected components deliberately, keep the optimizer state higher than the weights, and accumulate in high precision.

For everyone, measure the tail and the task, not the average and the perplexity. Quantization is the clearest case in the whole inference stack where the demo number and the production number diverge, and the gap lives entirely in the part of the distribution that aggregate metrics throw away. This is the same architectural pressure that pushes designs like DeepSeek's compressed attention toward heavy KV-cache compression: the memory wins are enormous and the cost is paid in a tail you have to measure deliberately or you will not see it at all.

Key Takeaways

  1. "4-bit" is two questions, not one. Separate what you quantize (weights, activations, KV cache, gradients, optimizer state) from when (post-training, quantization-aware, or native low-precision training). Weight-only 4-bit PTQ is near-free; W4A4 PTQ is a research-grade problem; FP8 training is production; FP4 training is the frontier.
  2. The outlier problem is the whole problem. Emergent outlier features (Dettmers et al., from ~6.7B scale) stretch the dynamic range and crush activation quantization. Every serious technique, from SmoothQuant to QuaRot and SpinQuant, is an outlier defense.
  3. Rotations are how 4-bit activations work today. An orthogonal rotation leaves the output unchanged but spreads outliers across dimensions. Rotation methods of the SpinQuant family are the best W4A4 approach on current models like Qwen3, bringing the gap to full precision to a few points; without a rotation step, naive int4 activations fall apart.
  4. FP8 training is selective, not uniform. DeepSeek-V3 runs the big matmuls in FP8 with 128×128 block-wise scaling and high-precision accumulation, but keeps embeddings, the output head, MoE gating, norms, attention, and the optimizer state in higher precision. The skill is the partition.
  5. FP4 training has a format winner so far. NVFP4's smaller 16-element blocks and FP8 per-block scale track local range better than MXFP4's 32-element power-of-two blocks: ~1.5% versus ~2.5% loss above BF16, and MXFP4 needed 36% more tokens to match. At frontier scale that gap is a budget line.
  6. Models now ship natively low-bit. gpt-oss-120b released directly in MXFP4 (4.25 bits per MoE parameter, ~90% of weights), fitting on a single 80GB GPU. Quantization is no longer only a post-hoc step.
  7. Perplexity hides the damage. Quantization noise hurts the tail, long reasoning chains, and long-context KV cache far more than the average. Validate on the real task distribution and the hardest slices, not a calibration-set perplexity.
  8. Compare iso-memory, not iso-parameter. The decision is "bigger quantized model versus smaller precise one for the same memory." Weight-only 4-bit on Qwen2.5-32B stays within ~6% of FP16 perplexity, so a 4-bit 32B in the memory of an FP16 8B is the better buy. That is the reason quantization is worth the trouble.

The Acing AI newsletter goes deep on the gap between what AI research reports and what production systems actually do. If low-bit precision is on your roadmap, this is the kind of mechanism we take apart.