
DeepSeek V4 and the Hybrid Attention Bet
Inside DeepSeek V4: hybrid attention (CSA + HCA), 1.6T MoE, 1M context, and the lineage from MLA to NSA to DSA that made it possible.
How transformers decide what to focus on. From the original scaled dot-product attention to multi-head attention, grouped query attention, and multi-head latent attention. The mechanism at the heart of every modern AI system.

Inside DeepSeek V4: hybrid attention (CSA + HCA), 1.6T MoE, 1M context, and the lineage from MLA to NSA to DSA that made it possible.

Master the transformer architecture from first principles: self-attention, multi-head attention, positional encodings, encoder-decoder design, and modern innovations like RoPE, GQA, and SwiGLU, with code.

Learn how Mixture of Experts (MoE) powers frontier AI models like DeepSeek-V3 and Mixtral: sparse routing, load balancing, and why MoE beat dense scaling.

Explore DeepSeek's architecture breakthroughs: Multi-Head Latent Attention, auxiliary-loss-free MoE, FP8 training, and GRPO: frontier AI for $5.5M.