On This Pageexpand_more

Model Architectures

DeepSeek V4 and the Hybrid Attention Bet

Inside DeepSeek V4: hybrid attention (CSA + HCA), 1.6T MoE, 1M context, and the lineage from MLA to NSA to DSA that made it possible.

RayZPublished Apr 27, 2026

On April 24, 2026, DeepSeek released V4-Pro and V4-Flash under MIT, and the headline number is not the parameter count. It is the cost. At a 1M-token context, V4-Pro runs single-token inference at 27% of the FLOPs of V3.2, and holds 10% of the KV cache. The attention budget that long-context inference has been bleeding for two years just got cut by an order of magnitude, and the recipe is open.

The recipe is what this post is about. DeepSeek V4 hybrid attention is not a single trick. It is two attention variants, Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), interleaved through a 1.6T-parameter Mixture-of-Experts backbone, sitting on a residual stack reinforced by manifold-constrained hyper-connections, trained with the Muon optimizer. Each piece individually is a measured step. Together, and only together, they reset the price of long context.

Why V4 matters more than the benchmark sheet

The benchmark line for V4-Pro is roughly what you would expect: 1.6T total parameters, 49B active per token, 1M context window, three reasoning modes (Non-think, Think High, Think Max), competitive on the usual reasoning and code suites. Strong, but not the story. The story is that the cost curve for long-context attention bent.

For two years, the long-context conversation has been dominated by tradeoffs. Dense attention scales quadratically with sequence length and explodes in KV memory. Sliding-window attention cheats on the long range. Pure sparse attention loses recall on the tail. The architectural research that mattered (MLA, NSA, DSA, latent attention variants from labs across the field) has been a steady walk down the cost curve, not a leap.

V4 is a leap, and it is a leap because DeepSeek stopped picking one strategy. CSA gives you precise sparse retrieval over a moderately compressed cache. HCA gives you a cheap, dense, global view of an aggressively compressed cache. The model gets both at every depth and learns to route between them. That is the bet, and the numbers say it paid out.

If you have not read DeepSeek's earlier architectural lineage, the V3-era piece is the right primer for the MLA and MoE foundations V4 inherits.

The hybrid attention design

What CSA actually does

Compressed Sparse Attention is the precise side of the hybrid. It runs in two phases.

Phase 1: compress along the sequence dimension. V4 applies a compression rate of 4. Every four token positions in the KV cache are consolidated into a single compressed entry. This is not a sliding-window summary or a projection. It is a learned compression that preserves the geometry needed for a content-based lookup.

Phase 2: sparsify with a lightning indexer. Over the compressed cache, V4 applies the DeepSeek Sparse Attention pattern: a small, fast indexer scores every compressed entry against the current query, and only the top-k entries are passed into the full attention computation. V4-Pro takes the top 1,024. V4-Flash takes the top 512. Plus a fixed sliding window of 128 tokens for unconditional local recall, so nothing nearby ever falls off the edge.

Two compounding wins, one principle. Compression cuts the cache that you must score against. Sparsity cuts the entries you actually attend to. The result is a layer that can do precise content-based retrieval over a million tokens of context with a working set in the low thousands.

A simplified pseudocode sketch:

python

def csa_layer(query, kv_cache):
    # Phase 1: compress KV along the sequence dim (rate 4)
    compressed_kv = sequence_compress(kv_cache, rate=4)

    # Phase 2: lightning indexer scores all compressed entries
    scores = lightning_indexer(query, compressed_kv.keys)
    top_k_idx = topk(scores, k=1024)  # 512 for V4-Flash
    selected = compressed_kv[top_k_idx]

    # Local sliding window for guaranteed recent recall
    local_window = kv_cache[-128:]

    # Attention runs over selected sparse entries + local window
    return attention(query, concat(selected, local_window))

This is descriptive, not the released kernel. The released kernels exist, and DeepSeek has shipped them. V3.2's DSA kernels delivered up to 640 TFLOPS during prefill and 410 during decode on the same family of patterns.

What HCA actually does

Heavily Compressed Attention is the broad side. Where CSA is precise, HCA is global.

HCA applies a compression rate of 128. Sets of 128 tokens collapse into a single compressed KV entry. That is aggressive: most of the per-token detail is gone. But what remains is small enough that HCA can afford to run dense attention over it. Every query attends to every compressed entry. No top-k, no indexer, no sparsity gate.

The mental model: CSA is search over the long tail with high precision and low recall. HCA is a compressed map of the entire context with low precision and high recall. CSA can find the specific clause in the contract. HCA can sense that the contract has shifted topic three pages ago.

Neither alone is enough. CSA without HCA loses the global signal that distant tokens carry collectively. HCA without CSA loses the ability to retrieve a specific detail at distance. The architecture interleaves them through the depth of the network, so every layer has access to both views.

CSA vs HCA: precise sparse retrieval over a 4x-compressed cache (top-1024 plus a 128-token local window) versus dense attention over a 128x-compressed cache.

Why interleaving matters

It would have been easier to pick one. The reason V4 ships both, alternating through the stack, comes down to what attention layers actually do as you go deeper.

Early layers benefit from broad signal; they are still establishing what the input is about. Late layers benefit from precise lookup; they are committing to specific predictions. But this gradient is not clean enough to assign HCA to early layers and CSA to late ones. Real workloads mix retrieval and abstraction throughout. By interleaving, V4 lets the model route information through whichever attention variant fits the layer's job at that depth, learned end-to-end.

This is the bet that distinguishes V4 from the dozen prior papers that proposed one or the other. It is also a bet that only pays off if the optimizer can actually learn the routing. Which is where the rest of the architecture comes in.

Supporting architecture

Manifold-Constrained Hyper-Connections

DeepSeek V4 introduces Manifold-Constrained Hyper-Connections (mHC), a residual-connection variant designed to preserve signal fidelity across a deeper, hybrid stack. Conventional residual streams are simple addition: each block adds its delta to the running representation. With two attention variants competing for that stream, plus MoE blocks downstream, signals can drift or collapse in ways that hurt training stability.

mHC strengthens the residual path with constrained projections, preserving the geometric properties of the running representation while still letting blocks contribute meaningful updates. The published material is light on the exact constraint, but the empirical claim is consistent: better signal propagation, better training stability, no loss of expressivity.

This is the kind of innovation that does not show up in benchmark numbers but quietly enables the rest of the architecture to work. Hybrid attention plus MoE plus 1M context is a lot of competing pressure on a residual stream. mHC is the piece that keeps it coherent.

The Muon optimizer

V4 is trained with Muon, not AdamW. Muon's core operation is to apply Newton-Schulz iterations to approximately orthogonalize the gradient update matrix before applying it as a weight update. The intuition: gradient updates that are closer to orthogonal are better-conditioned, and the optimizer spends less of its budget on updates that effectively cancel themselves out.

Reported benefits in V4's setting are faster convergence and greater training stability versus AdamW at the same scale. Muon has been gathering momentum (no pun) across labs working at the trillion-parameter scale, where AdamW's per-parameter state becomes a meaningful infrastructure cost and its conditioning behavior on enormous matrices becomes less ideal.

This matters for the attention story because Muon's stability properties are part of what makes hybrid attention learnable end-to-end. A less stable optimizer pushes you toward simpler architectures.

The KV cache numbers

The headline efficiency claim is worth grounding in a concrete picture. At a 1M-token context:

V3.2 baseline: full MLA-compressed KV cache. Already a sharp reduction over standard MHA, but linear in sequence length.
V4-Pro: 10% of V3.2's KV cache. Single-token inference at 27% of V3.2's FLOPs.

The 10% number comes primarily from HCA's 128x compression on the layers where it runs, combined with CSA's working-set bound on the sparse layers. The 27% FLOPs figure reflects that even with two attention variants, the compute is dominated by what you actually attend to, and CSA's top-k sparsity caps that at a few thousand entries no matter how long the context grows.

For deployment math, this is the difference between long context being a premium feature and long context being the default. A serving pool that could hold N concurrent 200K-token sessions on V3.2 hardware can hold roughly 10N at 1M tokens on V4-Pro, modulo MoE expert sharding and other factors. The relevant primer for the surrounding inference stack is LLM inference optimization.

DeepSeek's pattern of attention contributions

It is worth pausing to name what DeepSeek has been doing in this area, because the V4 launch is not an isolated event. It is the latest checkpoint in a multi-year run of attention-mechanism research from one lab, shipped open, with kernels.

The trail goes back to Multi-Head Latent Attention (MLA) in DeepSeek V2, a low-rank compression of keys and values that pairs cleanly with KV caching and became the foundation that V3, V3.1, and R1 inherited. Then Native Sparse Attention (NSA), published in early 2025, won a Best Paper award at ACL 2025 for its hardware-aligned, natively trainable sparse attention design. Then DeepSeek Sparse Attention (DSA) in V3.2, which instantiated the NSA ideas on top of MLA and shipped the production kernels (up to 640 TFLOPS prefill, 410 TFLOPS decode). And now V4's hybrid CSA + HCA, which combines the compression line with the sparsity line in a single architecture.

What is uncommon here is not just the cadence. It is that every step has been published with weights, and most have shipped with the inference kernels. FlashMLA went up on GitHub. The DSA kernels shipped with V3.2. The V4 weights are MIT. This is the rare case of a frontier-tier lab whose release pattern actually lets the rest of the field build on the work, not just read about it. The attention literature in 2026 looks meaningfully different because of it, and a lot of the open-weights ecosystem inherits these designs directly. (See the open-source LLM power shift for the broader market context.)

If your mental model of "open contributions" is fine-tuning checkpoints and inference engines, DeepSeek's cadence is a reminder that the most consequential open contributions can be primary architectural research with the kernels attached.

How V4 fits the MoE story

The MoE backbone (1.6T total, 49B active per token) is by now standard DeepSeek shape, refined from V3's 671B/37B configuration. The post-training pipeline is two-stage: independent domain-expert cultivation via SFT plus GRPO on each domain, followed by unified consolidation through on-policy distillation back into a single model.

This is worth flagging because it is a different shape than the typical "SFT then RLHF on a unified model" pipeline. Domain-expert cultivation lets each domain develop independently with its own reward signal, then distillation consolidates the capabilities without the interference that joint training imposes. The reasoning modes (Non-think, Think High, Think Max) are surfaced through the same model: a single set of weights, three inference-time policies for how much reasoning compute to spend.

If you want the pure-MoE foundations, Mixture of Experts demystified is the right starting point. The relevant point for V4 is that MoE and hybrid attention compose well: sparse experts handle the FFN compute, sparse attention handles the attention compute, and the only dense components left are the residual stream and the routers themselves.

Where the bet could break

Production reality: a few honest concerns to weigh against the published numbers.

Routing failure modes under hybrid attention. Two attention variants per layer means more learnable structure, which means more failure surface. If CSA's lightning indexer mis-scores during a domain shift, the layer's effective recall collapses. If HCA's compressed entries lose discriminative power on a long, repetitive context, the global view becomes noise. The published evals don't expose these failure modes, and they may take time to surface in production traffic.

Kernel availability outside the DeepSeek stack. V3.2's DSA kernels are public. V4's CSA + HCA kernels will follow the same pattern based on DeepSeek's prior cadence, but at the moment of writing not every inference stack has parity. Expect a window where running V4 outside DeepSeek's reference path leaves performance on the table.

Long-context evals are immature. A 1M-token context is easy to claim and hard to evaluate. Needle-in-a-haystack tests are saturated. Real long-context workloads (codebases, contracts, multi-document research) vary enormously, and per the broader argument in RAG 2026, conflated evaluation is how regressions hide. The honest read on V4's long-context quality will take a few months and a few independent benchmarks.

None of these defeat the architecture. They are the gap between "the paper number works" and "the deployment number works," which is the gap that matters for any model.

Key Takeaways

V4-Pro cuts long-context inference cost by an order of magnitude at the 1M-token point: 10% of V3.2's KV cache, 27% of its single-token FLOPs.
The hybrid attention design uses two variants per layer. CSA does precise sparse retrieval over a 4x-compressed cache (top-1024 for Pro, top-512 for Flash, plus 128-token local window). HCA does dense attention over a 128x-compressed cache for global signal.
Interleaving CSA and HCA through depth is the central architectural bet. It lets the model route information through whichever attention variant fits each layer's job, learned end-to-end.
Manifold-Constrained Hyper-Connections (mHC) are the residual-path variant that holds the hybrid stack together, preserving signal geometry across competing attention variants and MoE blocks.
V4 is trained with the Muon optimizer, not AdamW. Newton-Schulz orthogonalization of gradient updates gives faster convergence and better stability at trillion-parameter scale.
The MoE shape is 1.6T total / 49B active, with a two-stage post-training pipeline of domain-expert cultivation followed by on-policy distillation into a single model with three inference-time reasoning modes.
DeepSeek's attention lineage (MLA → NSA → DSA → CSA + HCA) is one of the most consistent open contribution streams in frontier model research, with weights and kernels included, not just papers.
The honest unknowns are routing failure modes under domain shift, kernel parity outside DeepSeek's stack, and the immaturity of long-context evals. None defeat the architecture, but they are the gap that production will close, not the paper.