On This Pageexpand_more
AI Architecture

Mixture of Experts Demystified: Why Every Frontier Model Uses MoE Now

Learn how Mixture of Experts (MoE) powers frontier AI models like DeepSeek-V3 and Mixtral: sparse routing, load balancing, and why MoE beat dense scaling.

RayZ
Mixture of Experts Demystified: Why Every Frontier Model Uses MoE Now

The best AI models are mostly turned off

Here is a fact that surprises most people encountering modern AI architectures for the first time: when a frontier language model processes your prompt, roughly 90% of its parameters do nothing. They sit idle. Dormant. Switched off. And this is not a bug — it is the single most important architectural insight driving the current generation of large language models.

The technique is called Mixture of Experts (MoE), and it has become the dominant architecture behind virtually every frontier model. DeepSeek-V3, Mistral Large 3, Qwen3.5, Gemma 4, the list is now the industry default rather than the exception. Each of these models contains far more total parameters than any single forward pass will ever touch, because they route each token through only a small subset of specialized "expert" sub-networks.

The result is a model that has the knowledge capacity of a trillion-parameter behemoth but the inference cost of something much smaller. This is how open-source labs compete with billion-dollar compute budgets. This is how inference stays affordable as capabilities scale. And understanding MoE is now table stakes for anyone working seriously with large language models.

Let us break it all down.


What Is Mixture of Experts?

The core idea: conditional computation

In a standard dense transformer, every token passes through every parameter in every layer. A 70-billion-parameter dense model activates all 70 billion parameters for every single token it processes. This is straightforward but wasteful, since not every parameter is relevant to every input.

MoE introduces conditional computation: the idea that different inputs should activate different parts of the network. Instead of one monolithic feed-forward network (FFN) in each transformer layer, an MoE layer contains multiple parallel FFN sub-networks called experts, plus a small router network (also called a gating network) that decides which experts to activate for each token.

Here is the key distinction:

  • Dense model: Token enters a transformer layer and passes through one large FFN. Every parameter is used.
  • MoE model: Token enters a transformer layer and hits a router. The router scores all available experts, selects the top-k (typically 1 or 2), and sends the token only through those selected experts. The outputs are combined via weighted sum. The other experts are never computed.

Conceptual diagram: dense vs. MoE forward pass

Imagine a single transformer layer processing one token:

Dense vs. MoE forward pass — in a dense model all FFN parameters are activated, while MoE routes each token through only selected experts

The self-attention layers remain shared; MoE replaces only the FFN layers. Since FFN layers account for roughly two-thirds of parameters in a standard transformer, this is where the leverage comes from.

Key terminology

  • Total parameters: The full count of all parameters across all experts. This determines the model's knowledge capacity.
  • Active parameters: The parameters actually computed for a given token. This determines the inference cost.
  • Expert: A standard FFN sub-network (typically two linear layers with an activation function between them, identical in architecture to a dense FFN but smaller).
  • Router / Gating network: A learned linear layer that takes the token's hidden state and produces a probability distribution over experts.
  • Top-k routing: Selecting the k highest-scoring experts for each token (k is usually 1 or 2).
  • Expert capacity: The maximum number of tokens an expert can process in a given batch, used to enforce load balancing.

A Brief History of MoE

The Mixture of Experts concept is far older than transformers. The original idea dates back to Jacobs et al. (1991), who proposed a system of specialized networks with a learned gating mechanism. But for decades, MoE remained a niche technique: difficult to train, hard to stabilize, and poorly suited to the hardware available.

The modern era of MoE in deep learning began in earnest around 2017:

Shazeer et al. (2017): Outrageously Large Neural Networks

Noam Shazeer and collaborators at Google published what is arguably the foundational paper for modern MoE in deep learning. They demonstrated an MoE layer with up to 131,072 experts applied to LSTM-based language models and machine translation. The paper introduced critical concepts like the noisy top-k gating mechanism and the importance loss to encourage balanced expert utilization. The key finding: MoE could achieve better results with lower computational cost than dense models of equivalent quality.

GShard (2020) and Switch Transformer (2021)

Google's GShard scaled MoE transformers to 600 billion parameters for machine translation. Then came Switch Transformer (Fedus, Zoph, and Shazeer, 2021), which simplified the routing to top-1 (each token goes to exactly one expert) and demonstrated that MoE could scale transformers to over a trillion parameters. Switch Transformer showed that top-1 routing, despite being simpler than top-2, worked remarkably well when combined with proper load balancing.

ST-MoE (2022)

Google's Stable and Transferable Mixture of Experts (ST-MoE) addressed training instability issues that had plagued earlier MoE models. It introduced router z-loss, a regularization term that prevents router logits from growing too large, which was a major source of training instability. This made MoE training significantly more reliable.

Mixtral 8x7B (December 2023): MoE goes mainstream

Mistral AI's Mixtral 8x7B was the inflection point. It was the first widely available, high-quality open-weight MoE model that practitioners could actually run. With 46.7 billion total parameters but only 12.9 billion active per token (using top-2 routing over 8 experts), Mixtral matched or exceeded the quality of dense 70B models while being dramatically cheaper to run. Suddenly MoE was not just a research curiosity; it was a practical advantage.

The 2024 MoE explosion

After Mixtral proved the concept commercially, MoE became the default architecture for frontier models:

  • DeepSeek-V2 (May 2024): Introduced fine-grained experts and DeepSeekMoE architecture
  • DeepSeek-V3 (December 2024): 671B total parameters, 37B active, 256 fine-grained experts with top-8 routing
  • Grok-1 (March 2024): xAI's 314B parameter MoE with 8 experts, top-2 routing
  • DBRX (March 2024): Databricks' 132B parameter MoE with 16 fine-grained experts, top-4 routing
  • Qwen1.5-MoE-A2.7B (2024): Alibaba's efficient MoE with only 2.7B active parameters from 14.3B total
  • Mixtral 8x22B (April 2024): Mistral's larger MoE model, 176B total with 39B active
  • Arctic (April 2024): Snowflake's 480B parameter MoE designed for enterprise workloads

2025 and beyond: MoE becomes universal

By 2025, MoE was no longer a differentiator; it was table stakes. Every major lab adopted it:

  • DeepSeek-R1 (January 2025): Built on the V3 MoE architecture (671B/37B active) but trained primarily through reinforcement learning for step-by-step reasoning
  • Mistral Large 3 (December 2025): 675B total parameters, ~41B active, fine-grained MoE with a 256K context window
  • Qwen3 (2025): Made MoE a first-class strategy across the Qwen family, including compact MoE variants like Qwen3-30B-A3B with only 3B active parameters
  • DeepSeek-V3.2 (January 2026): 685B total parameters, 37B active, introduced DeepSeek Sparse Attention and unified chat with reasoning in a single model
  • Qwen3.5-397B-A17B (February 2026): Alibaba's flagship MoE with 397B total parameters and 17B active, pushing multimodal reasoning and ultra-long context
  • Gemma 4 27B-A4B (2026): Google's open-weight MoE with 25.2B total parameters but only 3.8B active, delivering near-4B serving cost with significantly better quality

The convergence is now total. Dense architectures remain practical at small scale, but for frontier-class models the question is no longer whether to use MoE but how to configure the expert topology.

DeepSeek pushed MoE further than anyone (see Inside DeepSeek)


How Routing Works: The Heart of MoE

The router is the most critical component of an MoE model. It determines which experts process which tokens, and getting this right is the difference between a well-functioning MoE model and one that collapses into using only a handful of experts.

Basic routing mechanism

The router is typically a simple linear layer. Given a token's hidden representation h (the output of the attention layer), the router computes:

router_logits=Wrouterh(shape: nexperts)\text{router\_logits} = W_{\text{router}} \cdot h \quad \text{(shape: } n_{\text{experts}} \text{)}
router_probs=softmax(router_logits)\text{router\_probs} = \text{softmax}(\text{router\_logits})
top_k_experts=argmaxk(router_probs)\text{top\_k\_experts} = \arg\max_k(\text{router\_probs})

The token is then sent to the selected top-k experts. Each expert produces its own output, and the final output is a weighted combination:

output=itop_krouter_probs[i]experti(h)\text{output} = \sum_{i \in \text{top\_k}} \text{router\_probs}[i] \cdot \text{expert}_i(h)

The weights are the softmax probabilities of the selected experts, renormalized to sum to 1.

Top-1 vs. top-2 vs. top-k routing

  • Top-1 routing (Switch Transformer): Each token goes to exactly one expert. Maximally efficient but can be less stable and provides less expressivity.
  • Top-2 routing (Mixtral, Grok-1): Each token goes to two experts. Provides a smoother gradient signal for router training and allows experts to combine knowledge. This is the most common choice.
  • Top-k routing with k > 2 (DBRX uses top-4, DeepSeek-V3 uses top-8): Activates more experts per token, trading some efficiency for quality. This makes more sense when you have many small (fine-grained) experts.

Expert choice routing

An alternative paradigm flips the direction: instead of each token choosing its experts, each expert chooses its tokens. Proposed in the "Expert Choice" paper (Zhou et al., 2022), this approach has each expert select the top-k tokens it has the highest affinity for. This naturally balances load (each expert processes exactly the same number of tokens) but introduces challenges with ensuring every token gets processed by at least one expert.


The Load Balancing Problem

Left to its own devices, an MoE router will often collapse: it will learn to send most tokens to a small number of "popular" experts while the rest sit idle. This expert collapse or rich-get-richer problem is the central training challenge of MoE models.

Why does this happen? Early in training, some experts randomly perform slightly better than others. The router learns to favor them, so they receive more gradient updates, making them even better. Meanwhile, underutilized experts stagnate. Without intervention, you end up with an expensive model that functionally uses only 2-3 of its experts.

Traditional approach: auxiliary loss functions

The standard solution is an auxiliary load-balancing loss added to the training objective. This loss penalizes uneven expert utilization:

Lbalance=αNi=1NfiPi\mathcal{L}_{\text{balance}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i

Where f_i is the fraction of tokens routed to expert i, P_i is the average router probability for expert i, N is the number of experts, and alpha is a weighting coefficient (typically 0.01 to 0.1).

This loss encourages the router to distribute tokens more evenly. When expert i receives too many tokens (high f_i) and also has high router probability (high P_i), the product f_i * P_i is large and the loss pushes back.

Additional regularization techniques include:

  • Router z-loss (ST-MoE): Penalizes large router logits to prevent numerical instability. Defined as Lz=1B(logexi)2\mathcal{L}_z = \frac{1}{B} \sum \left( \log \sum e^{x_i} \right)^2.
  • Expert capacity factors: Hard caps on how many tokens each expert can receive. Overflow tokens are either dropped (reducing quality) or sent to a shared "default" expert.
  • Random routing: Adding noise to router decisions during training to encourage exploration.

The auxiliary loss dilemma

Here is the problem: the auxiliary loss fights against the main language modeling objective. Setting alpha too high forces perfectly even load distribution but hurts model quality, because the router cannot learn genuine specialization. Setting it too low lets expert collapse happen. Finding the right alpha is notoriously finicky and can vary across training stages.

DeepSeek's loss-free load balancing

DeepSeek-V3 introduced an elegant alternative that avoids auxiliary losses entirely. Instead of adding a loss term, they use a bias-based approach:

Each expert maintains a bias term that is added to its router logits. These biases are not learned through gradient descent. Instead, they are adjusted through a simple heuristic: if an expert is receiving too many tokens, its bias is decreased slightly; if it is receiving too few, its bias is increased. This nudges the router toward balanced utilization without conflicting with the language modeling gradient.

The result is that DeepSeek-V3 achieves excellent load balance without any auxiliary loss penalty, meaning the router is free to optimize purely for language modeling quality. This was a meaningful contribution. The auxiliary loss has been the source of much training instability and quality degradation in prior MoE work.


Fine-Grained vs. Coarse-Grained Experts

One of the most important architectural decisions in MoE design is expert granularity: how many experts to use and how large each one should be.

Coarse-grained experts (Mixtral, Grok-1)

The original approach uses a relatively small number of large experts. Mixtral 8x7B has 8 experts, each essentially a full 7B-class FFN. Grok-1 similarly uses 8 experts. With top-2 routing, each token activates 2 out of 8 experts, a 4x reduction in compute vs. activating all of them.

Advantages:

  • Simpler routing decisions (choosing among 8 options is easier than choosing among 256)
  • Each expert has large capacity and can specialize in broad domains
  • Straightforward parallelism across devices

Disadvantages:

  • Coarser routing granularity: the router must make a bigger commitment with each choice
  • Load balancing is harder because each expert represents a larger fraction of total compute
  • Less flexible combinatorial expressivity (8-choose-2 = 28 possible expert combinations)

Fine-grained experts (DeepSeek-V2/V3, DBRX)

The fine-grained approach uses many smaller experts with higher top-k. DeepSeek-V3 uses 256 experts with top-8 routing, where each expert is 1/32 the size of what a single coarse expert would be. DBRX uses 16 experts with top-4 routing.

The key insight is that fine-grained experts provide exponentially more combinatorial flexibility. With 256-choose-8, there are approximately 4.4 trillion possible expert combinations per token. This means the model can learn incredibly nuanced routing patterns, where different tokens can activate vastly different subsets of the network.

DeepSeek-V2 formalized this with the DeepSeekMoE architecture, which also introduced shared experts, a small number of experts that are always activated for every token, regardless of routing. These shared experts capture common knowledge that should apply universally, while the routed experts handle specialized knowledge.

The DeepSeek-V3 architecture uses:

  • 1 shared expert (always active)
  • 256 routed experts (8 selected per token)
  • Total: 671B parameters, ~37B active per forward pass

This is a far more efficient use of parameters than the coarse-grained approach and has become the leading paradigm for new MoE models.


Why MoE Won: The Cost-Performance Tradeoff

The fundamental argument for MoE comes down to economics. Let us make it concrete:

Training cost

Training a dense model from scratch requires computing the forward and backward pass through all parameters for every token. A 70B dense model requires roughly 70B multiply-accumulate operations per token per layer.

An MoE model with 70B total parameters but only 14B active per token requires roughly 14B operations per token per layer, a 5x reduction in per-token training compute. You get the knowledge capacity of a 70B model at the training cost of something closer to a 14B model.

The savings are not exactly this clean (router computation, communication overhead, and memory for all expert parameters add cost), but the directional advantage is enormous. DeepSeek-V3, with 671B total parameters, reportedly trained for approximately $5.6 million in compute cost, a fraction of what a 671B dense model would cost.

Inference cost

At inference time, you only load and compute the active experts per token. This means:

  • Latency: Comparable to a dense model the size of your active parameters, not your total parameters.
  • Throughput: Higher than the equivalent-quality dense model because you are doing less compute per token.
  • Memory: This is the catch. You still need to store all parameters in memory (or have an efficient loading strategy). A 671B MoE model needs 671B parameters worth of memory even though each forward pass only touches 37B of them.

The quality argument

MoE models consistently demonstrate that more total parameters improve quality even when active parameters remain fixed. Mixtral 8x7B (46.7B total, 12.9B active) outperforms comparable dense 70B models (70B total, 70B active) on most benchmarks. The model has access to more learned knowledge even though it does less computation per token.

This challenges the old scaling-law intuition that "bigger model = more compute per token." MoE shows that you can scale knowledge capacity independently of per-token compute, and both dimensions matter for quality.

MoE is what makes open-source models competitive (see The Open-Source LLM Power Shift)


Key MoE Models: A Closer Look

DeepSeek-V3

The model that proved MoE could compete at the frontier on a fraction of the budget (later succeeded by V3.1 and V3.2). Key specifications:

  • 671B total parameters, ~37B active
  • 256 routed experts + 1 shared expert per MoE layer
  • Top-8 routing with fine-grained experts
  • Multi-head latent attention (MLA): Compresses KV cache by projecting keys and values into a low-rank latent space
  • Loss-free load balancing: Bias-based approach replacing auxiliary losses
  • Multi-token prediction training objective
  • Trained on 14.8 trillion tokens
  • Competitive with GPT-4 and Claude 3.5 Sonnet on many benchmarks at a fraction of the training cost. Its successor, DeepSeek-V3.2 (685B total, January 2026), added DeepSeek Sparse Attention and unified reasoning into a single model.

Mixtral 8x7B and 8x22B

Mistral AI's models that brought MoE to the mainstream:

  • 8x7B: 46.7B total, 12.9B active, 8 experts, top-2 routing. Uses a sliding window attention mechanism. Released December 2023.
  • 8x22B: 176B total, 39B active, 8 experts, top-2 routing. Released April 2024. Strong multilingual and coding performance.

Grok-1

xAI's first public model:

  • 314B total parameters, 8 experts, top-2 routing
  • Open-weighted (released March 2024)
  • One of the largest MoE models with relatively coarse-grained experts

DBRX

Databricks' enterprise-focused model:

  • 132B total, 36B active
  • 16 fine-grained experts, top-4 routing
  • Demonstrated that fine-grained experts work well at moderate scale
  • Strong performance on coding and enterprise tasks

Qwen-MoE

Alibaba's efficiency-focused MoE:

  • Qwen1.5-MoE-A2.7B: 14.3B total, 2.7B active
  • Demonstrated that MoE can be effective even at small scale
  • Uses 60 routed experts with top-4 routing, plus 4 shared experts (64 total)
  • Competitive with dense 7B models at a fraction of the active compute

Expert Parallelism and Serving Challenges

MoE models introduce unique systems challenges that do not exist in dense models. Understanding these is essential for anyone deploying MoE in production.

The all-to-all communication problem

In a data-parallel or tensor-parallel setup for a dense model, each device processes its shard of the computation and communicates results via standard collective operations (all-reduce). MoE adds a complication: because different tokens route to different experts, and experts may live on different devices, you need all-to-all communication to dispatch tokens to the right expert and collect results.

The pattern looks like this:

  1. All devices compute the router for their local tokens
  2. Tokens are shuffled across devices via all-to-all so each device receives the tokens assigned to its local experts (this is called expert parallelism or EP)
  3. Each device computes its local experts on the received tokens
  4. Results are shuffled back via another all-to-all

This all-to-all communication can be the bottleneck, especially on hardware with limited inter-node bandwidth. It is fundamentally different from the all-reduce patterns that dense model parallelism uses, and optimizing it requires careful co-design of the routing algorithm and the communication topology.

Memory requirements

An MoE model with 671B parameters needs memory to store all 671B parameters, even though only 37B are active per token. At FP16, that is roughly 1.3 TB of memory just for weights. This means:

  • You need many GPUs just for weight storage, even if the active compute fits on fewer
  • KV cache memory requirements are determined by the active parameters (shared attention layers), not total parameters
  • Quantization is especially valuable for MoE models because it reduces the memory footprint of inactive experts

Expert offloading

One technique for running large MoE models on limited hardware is expert offloading: keeping only the currently active experts in GPU memory and swapping others in from CPU memory or even disk as needed. This works because only a subset of experts are needed at any given time.

The challenge is latency: PCIe bandwidth between CPU and GPU is much lower than GPU memory bandwidth. Predictive loading (using the router's decisions to prefetch experts) can help, and several inference frameworks have implemented this. But for latency-sensitive applications, having all experts resident in GPU memory remains preferred.

This area has seen rapid progress. KTransformers (Chen et al., SOSP 2025) introduced AMX-optimized CPU kernels and an expert deferral technique that processes a subset of experts immediately on the GPU while scheduling the rest concurrently on the CPU, achieving 2-4x speedups over prior offloading approaches. SGLang now integrates KTransformers as a backend, combining GPU tensor parallelism with CPU/GPU hybrid expert parallelism. These advances are making it increasingly practical to run 600B+ MoE models on consumer-grade hardware.

Inference frameworks

Efficiently serving MoE models requires specialized kernel implementations. Standard dense-model inference frameworks needed significant extensions:

  • vLLM: MoE-specific kernels with expert parallelism (EP), tensor parallelism (TP), and data parallelism (DP) strategies that can be combined for different hardware topologies
  • TensorRT-LLM: NVIDIA's framework with optimized MoE layers
  • SGLang: Routing-aware batching with KTransformers integration for CPU/GPU hybrid serving
  • llama.cpp: CPU/GPU hybrid inference with expert offloading for MoE

The key optimization is expert batching: grouping tokens destined for the same expert into batches so the expert FFN can process them efficiently as a matrix multiplication rather than a series of individual vector operations. DeepSeek's open-sourced EPLB (Expert-Parallel Load Balancer) and DualPipe (bidirectional pipeline parallelism for computation-communication overlap) have also become reference implementations for large-scale MoE deployment.

Serving MoE models efficiently requires specialized techniques (see LLM Inference Optimization)


Sparse Upcycling: Dense to MoE Conversion

Training an MoE model from scratch is not the only path. Sparse upcycling is the technique of converting a pre-trained dense model into an MoE model, then continuing training.

How it works

  1. Start with a pre-trained dense transformer
  2. For each FFN layer that will become an MoE layer, duplicate the FFN to create N identical copies. These become the initial experts
  3. Initialize the router randomly (or with a simple heuristic)
  4. Continue training on additional data

Because every expert starts as an identical copy of the original FFN, the model initially behaves exactly like the dense original (the router's choice does not matter when all experts are identical). As training progresses, experts diverge and specialize.

Benefits and trade-offs

Benefits:

  • Leverages existing pre-training investment; you do not need to train from scratch
  • Can significantly improve model quality with modest additional training compute
  • Provides a smooth transition from dense to sparse architectures
  • Useful for organizations that have already trained strong dense models

Trade-offs:

  • Experts share the same initialization, which can make differentiation slower than training from scratch
  • The final MoE model may not be as well-optimized as one designed as MoE from the start
  • Still requires significant continued training to realize the benefits

Google's research has shown that sparse upcycling can recover most of the quality gains of training an MoE from scratch, at a fraction of the compute cost. This makes it an attractive option for teams with existing dense model investments.


When MoE Is Overkill

MoE is not always the right choice. Understanding when to use it, and when not to, is as important as understanding how it works.

Small-scale use cases

If your deployment target is a single consumer GPU (8-24 GB VRAM), MoE may hurt more than it helps. A 14B-total MoE model with 2.7B active parameters needs memory for all 14B parameters but only computes 2.7B per token. A well-trained dense 3B model would use similar compute per token while requiring much less memory. At small scale, the memory overhead of inactive experts often outweighs the quality benefits.

Low-batch inference

MoE's efficiency advantage is most pronounced at high batch sizes, where the overhead of routing and communication is amortized across many tokens. For single-user, low-batch inference scenarios (like running a local chatbot), the routing overhead is proportionally larger and the gains are smaller.

Latency-critical applications

The all-to-all communication in expert parallelism adds latency that does not exist in dense model serving. For applications where milliseconds matter (real-time voice, high-frequency decision-making), a dense model may be preferred because its communication patterns are simpler and more predictable.

When you need maximal performance per parameter

Interestingly, dense models can be more efficient than MoE when your constraint is total parameter count rather than compute. A dense 7B model will generally outperform an MoE model with 7B total parameters, because the MoE model dedicates some of those parameters to inactive experts. MoE wins when you can afford the memory for more total parameters than you could afford to compute densely.

The rule of thumb

MoE makes sense when:

  • You need quality beyond what your compute budget can achieve with dense models
  • You have enough memory to store the full model
  • Your serving infrastructure can handle the routing and communication patterns
  • You are operating at sufficient scale to justify the systems complexity

The Future of MoE

Several trends are shaping where MoE goes next.

Even finer granularity

The trend from 8 coarse experts to 256 fine-grained experts has already materialized in production models. Gemma 4 and Qwen3.5 both use fine-grained expert topologies, and research continues to push toward even more experts with higher top-k to unlock better routing specialization.

MoE beyond FFN layers

Most MoE architectures only apply conditional computation to FFN layers. Attention layers remain dense and shared. There is active research on applying expert routing to attention heads themselves, creating Mixture of Attention architectures. This could unlock even greater efficiency gains, since attention is the other major compute consumer in transformers. A related approach is Native Sparse Attention (NSA) (DeepSeek, ACL 2025), which uses a dynamic hierarchical sparse strategy combining coarse-grained token compression with fine-grained token selection. NSA achieves up to 9x forward speedups at 64K context length while maintaining full-attention quality, and is already deployed in DeepSeek-V3.2 as DeepSeek Sparse Attention (DSA).

Hardware co-design

The all-to-all communication pattern of MoE does not map well to current GPU cluster topologies, which are optimized for all-reduce. Future hardware and networking designs may be tailored for MoE's communication patterns, with higher-bandwidth mesh interconnects that make expert parallelism more efficient.

Learned routing curricula

Rather than using a fixed top-k throughout training, future models may learn to adjust their routing dynamically, using more experts for harder tokens and fewer for easy ones. This idea connects to the broader concept of adaptive computation, where the model allocates compute based on input difficulty.

MoE + other efficiency techniques

MoE composes well with other efficiency methods: quantization (reduce memory of inactive experts), speculative decoding (use a smaller dense model as the drafter), distillation (compress an MoE model into a smaller model for edge deployment), and sparse attention (DeepSeek-V3.2's DeepSeek Sparse Attention reduces memory and accelerates long-context reasoning). The combination of MoE with multi-head latent attention, as in DeepSeek-V3, shows how multiple efficiency techniques can stack, and the trend is accelerating.


Key Takeaways

  1. MoE activates only a fraction of parameters per token, giving models the knowledge of a huge network at the compute cost of a small one. This is the core innovation that makes modern frontier models economically viable.
  2. Routing is the critical design decision. The choice of top-k, the number of experts, fine-grained vs. coarse-grained, and the load balancing strategy all have major impacts on model quality and training stability.
  3. Fine-grained experts with higher top-k have emerged as the dominant paradigm, pioneered by DeepSeek and adopted by others. More small experts provide exponentially more combinatorial routing flexibility than fewer large experts.
  4. Load balancing remains the hardest training challenge. Traditional auxiliary losses conflict with the modeling objective. DeepSeek-V3's loss-free bias-based approach represents a significant step forward.
  5. MoE models are harder to serve than dense models due to memory requirements (all parameters must be stored) and communication patterns (all-to-all for expert parallelism). Efficient serving requires specialized infrastructure and kernels.
  6. Sparse upcycling makes MoE accessible to teams with existing dense model investments. You do not need to train from scratch to benefit from MoE.
  7. MoE is not always the answer. For small-scale, memory-constrained, or latency-critical deployments, dense models may still be the better choice. MoE shines when you have the memory budget and need quality beyond what your compute budget can achieve with dense architectures.
  8. Every major frontier model has converged on MoE, including DeepSeek-V3/R1, Mistral Large 3, Qwen3.5, and Gemma 4. This is not a trend; it is the established default architecture for large-scale language models. Understanding MoE is no longer optional for practitioners in this space.