On This Pageexpand_more
Architecture

Understanding Transformer Architectures from Scratch

Master the transformer architecture from first principles: self-attention, multi-head attention, positional encodings, encoder-decoder design, and modern innovations like RoPE, GQA, and SwiGLU, with code.

RayZ
Understanding Transformer Architectures from Scratch

The architecture that changed everything

In June 2017, a team of eight researchers at Google published a paper with the unassuming title "Attention Is All You Need." The paper introduced the Transformer -- a sequence-to-sequence architecture that replaced the recurrent computations underpinning every state-of-the-art NLP model with a single, elegant mechanism: attention.

At the time, the dominant architectures were Recurrent Neural Networks (RNNs) and their gated variants, Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs). These models processed sequences one token at a time, maintaining a hidden state that was updated step by step. This sequential nature created two fundamental problems:

  1. Parallelization was impossible. Processing token t required the output from token t-1. Training could not be distributed across the sequence dimension, which meant that longer sequences took proportionally longer to train.
  2. Long-range dependencies decayed. Despite gating mechanisms designed to preserve information, the hidden state acted as a bottleneck. Information from early tokens was progressively diluted as the sequence grew, making it difficult for the model to learn relationships spanning hundreds or thousands of tokens.

The Transformer solved both problems simultaneously. By computing attention over all positions in parallel, it eliminated the sequential bottleneck entirely. And by allowing every token to attend directly to every other token, regardless of distance, it made long-range dependencies as easy to learn as short-range ones.

The results were immediate and decisive. The original Transformer achieved state-of-the-art results on machine translation benchmarks while training in a fraction of the time required by recurrent models. Within two years, Transformer-based models (BERT, GPT-2, T5) had swept virtually every NLP benchmark. Within five years, the architecture had expanded beyond text into vision (ViT), audio (Whisper), protein structure (AlphaFold 2), and multimodal reasoning (GPT-4, Claude, Gemini).

Today, every frontier AI system is built on the Transformer or a direct descendant of it. Understanding this architecture is not optional for anyone working in AI; it is foundational. This article builds that understanding from first principles, covering every component of the original design and the modern innovations that have reshaped it.


The Original Encoder-Decoder Architecture

The 2017 Transformer was designed for sequence-to-sequence tasks, primarily machine translation. It follows an encoder-decoder structure:

  • The encoder reads the input sequence (e.g., a sentence in English) and produces a set of continuous representations.
  • The decoder takes those representations and generates the output sequence (e.g., the translation in French) one token at a time.

Both the encoder and decoder are stacks of identical layers. The original paper used 6 layers in each stack. Each encoder layer contains two sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network. Each decoder layer contains three sub-layers: masked multi-head self-attention, multi-head cross-attention over the encoder output, and a position-wise feed-forward network.

Here is the high-level structure:

Input SequenceEmbedding + Positional EncodingENCODER ×NMulti-Head Self-Attention+ Add & LayerNormFeed-Forward Network+ Add & LayerNormEncoder Output (context)DECODER ×NMasked Self-Attention+ Add & LayerNormCross-Attentionattends to encoder + Add & LayerNormK, VFeed-Forward Network+ Add & LayerNormLinear + SoftmaxOutput Tokens
High-level Transformer architecture (Vaswani et al., 2017)

Every sub-layer in both the encoder and decoder uses a residual connection followed by layer normalization. These stability mechanisms are critical for training deep networks and are discussed in detail later.

Let us now examine each component, starting with the mechanism at the heart of everything.


Self-Attention: The Core Mechanism

Intuition

Self-attention answers a deceptively simple question: for each token in a sequence, how relevant is every other token?

Consider the sentence: "The animal didn't cross the street because it was too tired." When processing the word "it", the model needs to understand that "it" refers to "the animal", not "the street." Self-attention gives the model a direct mechanism to compute this: each token produces a query ("what am I looking for?"), every token produces a key ("what do I contain?"), and the dot product between queries and keys determines the relevance scores. High-scoring tokens then contribute their values to the output representation.

The Math: Scaled Dot-Product Attention

Given an input sequence of n tokens, each represented as a d-dimensional vector, self-attention computes three matrices by projecting the input through learned weight matrices:

  • Queries: Q=XWQQ = XW_Q, where WQW_Q is a (dmodel×dkd_{model} \times d_k) weight matrix
  • Keys: K=XWKK = XW_K, where WKW_K is a (dmodel×dkd_{model} \times d_k) weight matrix
  • Values: V=XWVV = XW_V, where WVW_V is a (dmodel×dvd_{model} \times d_v) weight matrix

The attention output is then:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Let us break this down step by step:

  1. $QK^T$ computes the dot product between every query and every key, producing an (n×nn \times n) matrix of raw relevance scores. Each entry (i,ji, j) represents how much token ii should attend to token jj.
  2. Division by $\sqrt{d_k}$ is the scaling factor. Without it, when dkd_k is large, the dot products grow large in magnitude, pushing the softmax into regions with extremely small gradients. Scaling by the square root of the key dimension keeps the values in a range where softmax gradients remain healthy. This is why it is called scaled dot-product attention.
  3. Softmax converts the raw scores into a probability distribution across the sequence for each query position. Token ii's attention weights over all positions sum to 1.
  4. Multiplication by $V$ produces the output: a weighted sum of value vectors, where the weights are the attention probabilities.

Implementation in PyTorch

python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def scaled_dot_product_attention(
    query: torch.Tensor,   # (batch, seq_len, d_k)
    key: torch.Tensor,     # (batch, seq_len, d_k)
    value: torch.Tensor,   # (batch, seq_len, d_v)
    mask: torch.Tensor = None
) -> torch.Tensor:
    """
    Compute scaled dot-product attention.

    Args:
        query: Query tensor of shape (batch, seq_len, d_k)
        key: Key tensor of shape (batch, seq_len, d_k)
        value: Value tensor of shape (batch, seq_len, d_v)
        mask: Optional mask tensor (e.g., causal mask for decoders)

    Returns:
        Attention output of shape (batch, seq_len, d_v)
    """
    d_k = query.size(-1)

    # Step 1: Compute raw attention scores
    # (batch, seq_len, d_k) @ (batch, d_k, seq_len) -> (batch, seq_len, seq_len)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

    # Step 2: Apply mask (if provided) -- set masked positions to -inf
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Step 3: Softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)

    # Step 4: Weighted sum of values
    output = torch.matmul(attention_weights, value)

    return output

The mask parameter is crucial for decoder self-attention, where the model must not attend to future tokens during training. We will return to this when discussing the decoder.


Multi-Head Attention: Parallel Attention Subspaces

A single attention function captures one type of relationship between tokens. But language encodes many simultaneous relationships (syntactic structure, semantic similarity, coreference, positional proximity) and a single set of Q, K, V projections cannot capture them all effectively.

Multi-head attention addresses this by running multiple attention operations in parallel, each with its own learned projections, and concatenating their outputs:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W_O

where each head is:

headi=Attention(XWQi,XWKi,XWVi)\text{head}_i = \text{Attention}(X W_Q^i, X W_K^i, X W_V^i)

The original Transformer uses h=8h = 8 heads with dk=dv=dmodel/h=64d_k = d_v = d_{model} / h = 64. This means each head operates on a 6464-dimensional subspace of the 512512-dimensional model, and the total computation is roughly equivalent to a single head with full dimensionality.

What different heads learn

Research analyzing trained Transformers has revealed that different attention heads specialize in different linguistic phenomena:

  • Some heads learn positional patterns, attending to the previous token or the next token.
  • Some heads learn syntactic relationships: subject-verb agreement, dependency parsing structures.
  • Some heads learn semantic similarity, where tokens with related meanings attend to each other.
  • Some heads learn coreference, connecting pronouns to their antecedents.

This specialization is not hand-designed. It emerges naturally from training, because the independent projections give each head the freedom to learn a different attention pattern.

Implementation

python
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Learned projection matrices for Q, K, V, and output
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        mask: torch.Tensor = None
    ) -> torch.Tensor:
        batch_size = query.size(0)

        # Project and reshape: (batch, seq_len, d_model) -> (batch, num_heads, seq_len, d_k)
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Compute attention for all heads in parallel
        attn_output = scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads: (batch, num_heads, seq_len, d_k) -> (batch, seq_len, d_model)
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

        # Final linear projection
        return self.W_o(attn_output)

Notice how the view and transpose operations split d_model into num_heads separate d_k-dimensional subspaces. The scaled_dot_product_attention function operates identically on this 4D tensor: each head computes attention independently and in parallel on the GPU.


Positional Encodings: Injecting Sequence Order

Self-attention is permutation-equivariant: if you shuffle the input tokens, the attention computation treats them exactly the same way (the output tokens get shuffled accordingly, but no information about position is used). This is a problem. The sentence "the dog bit the man" means something very different from "the man bit the dog," and the model needs to know token positions.

Positional encodings inject positional information into the token representations. They are added to the input embeddings before the first encoder/decoder layer.

Sinusoidal Positional Encodings (Original Transformer)

The original paper uses fixed sinusoidal functions of different frequencies:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

where pos is the position in the sequence and i is the dimension index.

Why sinusoids? The authors hypothesized that sinusoidal encodings would allow the model to learn to attend by relative positions, because for any fixed offset k, PE(pos + k) can be expressed as a linear function of PE(pos). In practice, this means the model can learn "three tokens ahead" as a general pattern rather than memorizing specific absolute position pairs.

python
class SinusoidalPositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_len: int = 5000):
        super().__init__()

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # (1, max_len, d_model)

        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch, seq_len, d_model)
        return x + self.pe[:, :x.size(1), :]

Learned Positional Encodings

An alternative, used in BERT and early GPT models, is to simply learn a separate embedding vector for each position:

python
class LearnedPositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_len: int = 512):
        super().__init__()
        self.position_embeddings = nn.Embedding(max_len, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        positions = torch.arange(x.size(1), device=x.device).unsqueeze(0)
        return x + self.position_embeddings(positions)

Learned encodings are more flexible, but they impose a hard maximum sequence length: the model cannot generalize to positions it has not seen during training. Sinusoidal encodings, being mathematical functions, can extrapolate to longer sequences, though in practice this extrapolation is imperfect.

Modern architectures have moved beyond both of these approaches. We discuss Rotary Position Embeddings (RoPE) and other innovations in the final section.


Feed-Forward Networks: The Thinking Layers

After the attention sub-layer, each encoder and decoder layer contains a position-wise feed-forward network (FFN). This is a simple two-layer MLP applied independently and identically to each token position:

FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2

The inner dimension (dffd_{ff}) is typically 4 times the model dimension. In the original Transformer with dmodel=512d_{model} = 512, this means dff=2048d_{ff} = 2048. The activation function is ReLU.

Why does this matter? The attention mechanism handles token-to-token interactions, deciding which information to gather. The feed-forward network processes what to do with it. Research has shown that FFN layers act as key-value memories that store factual knowledge. When a token representation enters the FFN, the first linear layer matches it against learned patterns (keys), and the second linear layer retrieves the associated information (values).

This is why the FFN layers contain the majority of a Transformer's parameters. In a standard design, the attention layers hold roughly one-third of the parameters, and the FFN layers hold the remaining two-thirds. When people talk about "model parameters" or "model size," they are primarily talking about FFN weights.

Modern transformers use Mixture of Experts layers to scale these FFN blocks efficiently (see MoE Demystified for a deep dive into how sparse expert routing works).


Layer Normalization and Residual Connections

Residual Connections

Every sub-layer in the Transformer is wrapped in a residual connection:

output=SubLayer(x)+x\text{output} = \text{SubLayer}(x) + x

This technique, borrowed from ResNet, addresses the vanishing gradient problem in deep networks. By adding the input directly to the output, gradients flow through the addition operation during backpropagation, maintaining a strong gradient signal even through many layers. Without residual connections, Transformers deeper than a few layers become essentially untrainable.

Layer Normalization

Layer normalization normalizes the activations across the feature dimension for each token independently:

LayerNorm(x)=γxμσ2+ϵ+β\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

where mu and sigma are the mean and variance computed across the features of a single token, and gamma and beta are learned scale and shift parameters.

Unlike batch normalization (which normalizes across the batch dimension), layer normalization is independent of batch size and works identically during training and inference. This makes it well-suited for variable-length sequences and autoregressive generation.

Post-Norm vs Pre-Norm

The original Transformer applies layer normalization after the residual addition (post-norm):

output = LayerNorm(x + SubLayer(x))

Most modern Transformers use pre-norm, applying normalization before the sub-layer:

output = x + SubLayer(LayerNorm(x))

Pre-norm is easier to train because the residual path remains completely unmodified, so gradients flow through a clean identity path. Post-norm can produce slightly better final quality but is harder to optimize and more sensitive to learning rate. The shift to pre-norm was one of the first practical changes adopted after the original paper, and it remains the dominant choice in large-scale models.


The Decoder: Masked Attention, Cross-Attention, and Autoregressive Generation

The decoder is where output tokens are generated. It introduces two mechanisms not present in the encoder.

Masked Self-Attention

During training, the decoder processes the entire target sequence at once (for efficiency), but it must maintain the constraint that predicting token t can only use information from tokens 1 through t-1. This is the autoregressive property: each token is generated conditioned only on previous tokens.

Masked self-attention enforces this by applying a causal mask, an upper-triangular matrix of negative infinities, to the attention scores before the softmax. This ensures that attention weights for future positions are zero:

python
def create_causal_mask(seq_len: int) -> torch.Tensor:
    """
    Create a causal (look-ahead) mask for decoder self-attention.
    Returns a lower-triangular matrix of ones.
    """
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask  # 1 = attend, 0 = mask out

When applied in the attention computation, positions where the mask is 0 get their scores set to negative infinity, so after softmax, those positions receive zero weight.

Cross-Attention

The second attention sub-layer in each decoder layer is cross-attention (also called encoder-decoder attention). Here, the queries come from the decoder, but the keys and values come from the encoder output:

  • Q = decoder hidden states projected through WQW_Q
  • K = encoder output projected through WKW_K
  • V = encoder output projected through WVW_V

This is the bridge between the encoder and decoder. It allows each decoder position to attend to all positions in the input sequence, enabling the decoder to "look at" the source when generating each output token. In machine translation, this is how the decoder consults the original sentence while producing the translation word by word.

Autoregressive Generation at Inference Time

During inference, the decoder generates one token at a time:

  1. Feed the start-of-sequence token into the decoder.
  2. The decoder attends to all encoder outputs (via cross-attention) and to all previously generated tokens (via masked self-attention).
  3. The output hidden state passes through a linear layer and softmax to produce a probability distribution over the vocabulary.
  4. Sample or select the highest-probability token.
  5. Append this token to the decoder input and repeat from step 2.

This sequential generation is why autoregressive models can be slow at inference time, since each new token requires a full forward pass through the decoder. For how these models are served efficiently, see LLM Inference Optimization.


The Three Lineages: Encoder-Only, Decoder-Only, and Encoder-Decoder

The original Transformer's encoder-decoder design quickly branched into three distinct families, each dominating different categories of tasks.

Encoder-Only: BERT and its Descendants

Key models: BERT, RoBERTa, DeBERTa, ALBERT, ELECTRA

Encoder-only models use just the encoder stack. They process the full input sequence with bidirectional attention, where every token can attend to every other token, including those that come after it. This makes them powerful for understanding tasks where the full context is available.

To train without a decoder, BERT introduced masked language modeling (MLM): randomly mask 15% of input tokens and train the model to predict them. This creates a rich bidirectional representation.

Encoder-only models excel at:

  • Text classification and sentiment analysis
  • Named entity recognition
  • Extractive question answering
  • Semantic similarity and retrieval embeddings

They are not naturally suited for text generation, because they lack the autoregressive mechanism that produces tokens one at a time.

Decoder-Only: GPT and the Foundation of Modern LLMs

Key models: GPT-1/2/3/4/5, Claude, Gemini, Mistral, Qwen, DeepSeek

Decoder-only models use just the decoder stack, with one critical simplification: there is no encoder, so there is no cross-attention. Each decoder layer contains only masked self-attention and a feed-forward network. The input and output share the same sequence space: the model reads a prompt and continues generating from where the prompt ends.

Training uses causal language modeling (CLM): given a sequence of tokens, predict the next token at every position. The causal mask ensures that the prediction at position t only uses tokens 1 through t.

This architecture turned out to be remarkably versatile. With sufficient scale and data, decoder-only models learn to perform virtually any task when given appropriate prompting:

  • Translation: "Translate to French: The cat sat on the mat."
  • Summarization: "Summarize the following article: ..."
  • Code generation: "Write a Python function that ..."
  • Reasoning: "Let's think step by step about ..."

The simplicity of the decoder-only design, a single uniform stack with a single training objective, also makes it straightforward to scale. There is no encoder-decoder alignment to manage, no cross-attention to route. Just predict the next token, and scale up.

This is why virtually every frontier language model today -- GPT-5, Claude, Gemini, Mistral, Qwen, DeepSeek -- uses a decoder-only architecture.

Encoder-Decoder: T5, BART, and Sequence-to-Sequence Revival

Key models: T5, BART, mBART, FLAN-T5, UL2

Encoder-decoder models use the full original architecture. They excel at tasks with a clear input-to-output structure: translation, summarization, question answering where the answer is generated rather than extracted.

T5 (Text-to-Text Transfer Transformer) unified all NLP tasks into a text-to-text format: every task receives a text input and produces a text output, with a task-specific prefix (e.g., "translate English to German:", "summarize:"). This was a powerful demonstration that a single architecture and training format could handle diverse tasks.

Despite their versatility, encoder-decoder models have been largely overtaken by decoder-only models at the frontier. The simplicity of decoder-only scaling, combined with in-context learning capabilities that emerge at sufficient size, has made the decoder-only approach dominant.


Modern Innovations: How Today's Transformers Differ from 2017

The core ideas of the Transformer remain, but nearly every component has been refined. Here are the most important modern innovations.

Rotary Position Embeddings (RoPE)

RoPE, introduced by Su et al. in 2021 and now the standard positional encoding for virtually all production language models (including Qwen, Mistral, DeepSeek, and Gemma), encodes position information by rotating the query and key vectors in the embedding space. Rather than adding positional information to the token embeddings, RoPE applies a position-dependent rotation matrix to Q and K before the dot product.

The key insight is that after rotation, the dot product between a query at position m and a key at position n depends only on the relative distance (m - n), not on the absolute positions. This gives RoPE an inherent relative position encoding without any additional parameters.

RoPE also enables better length extrapolation than fixed sinusoidal or learned encodings. Techniques like YaRN (Yet another RoPE eNtension) and NTK-aware interpolation modify the RoPE frequencies to allow models trained on shorter contexts to generalize to much longer sequences.

Grouped Query Attention (GQA) and Multi-Query Attention (MQA)

Standard multi-head attention stores separate K and V projections for each head, which creates a large KV-cache during autoregressive inference. For a model with 32 heads and 128K context length, this cache can consume tens of gigabytes of GPU memory per request.

Multi-Query Attention (MQA), proposed by Shazeer (2019), shares a single set of K and V projections across all query heads. This reduces the KV-cache by a factor equal to the number of heads (e.g., 32x) with minimal quality loss. MQA was an important stepping stone but has been largely superseded in practice.

Grouped Query Attention (GQA) is the middle ground that won out: instead of one KV head for all queries, it uses a small number of KV groups (e.g., 8 KV heads shared among 32 query heads). This balances memory savings with representational capacity. GQA is now the default attention configuration for virtually all production-scale models, including Qwen, Mistral, Gemma, and DeepSeek.

DeepSeek innovated on attention with Multi-Head Latent Attention, a fundamentally different approach (see DeepSeek Architecture Innovations for a detailed analysis).

Differential Attention

Standard softmax attention distributes weight across all tokens, including irrelevant ones. This "attention noise" dilutes the signal from the tokens that actually matter. Differential Attention (Ye et al., ICLR 2025 Oral) addresses this by computing attention as the difference between two separate softmax attention maps:

DiffAttn(X)=(softmax(Q1K1Td)λsoftmax(Q2K2Td))V\text{DiffAttn}(X) = \left(\text{softmax}\left(\frac{Q_1 K_1^T}{\sqrt{d}}\right) - \lambda \cdot \text{softmax}\left(\frac{Q_2 K_2^T}{\sqrt{d}}\right)\right) V

The subtraction cancels out the noise that both maps share, producing sparse, focused attention patterns. The learnable scalar lambda controls the balance. The result is significant: differential attention models match the quality of standard transformers at roughly 65% of the model size or training data, while also reducing hallucinations and improving robustness in long-context and in-context learning settings.

SwiGLU Activation

The original Transformer used ReLU in its feed-forward layers. Modern models have largely replaced this with SwiGLU, a gated activation function introduced by Shazeer (2020):

SwiGLU(x)=Swish(xW1)(xW2)\text{SwiGLU}(x) = \text{Swish}(x W_1) \odot (x W_2)

where Swish(x) = x sigmoid(beta x) and the circle-dot symbol denotes element-wise multiplication. The key difference from ReLU is the gating mechanism: one linear projection computes the values, and another linear projection computes the gates that control which values pass through.

SwiGLU consistently outperforms ReLU and GELU across model sizes. The tradeoff is that SwiGLU requires three weight matrices instead of two (the gate matrix is additional), so the inner dimension d_ff is typically adjusted to 8/3 d_model (rather than 4 d_model for ReLU) to keep the parameter count comparable.

RMSNorm (Root Mean Square Layer Normalization)

Standard layer normalization computes both the mean and variance of activations, then re-centers and re-scales. RMSNorm simplifies this by only computing the root mean square, omitting the mean subtraction:

RMSNorm(x)=x1di=1dxi2+ϵγ\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} \cdot \gamma

RMSNorm is computationally cheaper than LayerNorm (one less reduction operation) and performs comparably in practice. It has become the universal default for modern language models, replacing LayerNorm almost entirely at scale. A complementary technique, QK-Norm (normalizing query and key vectors before the attention dot product), is increasingly adopted alongside RMSNorm to prevent attention logit explosion in very deep models. QK-Norm is used in Gemma 2/3, OLMo 2, and Qwen 3, and is becoming a standard part of the production recipe.

Pre-Norm with Post-Norm Quality

As discussed earlier, pre-norm is easier to train but post-norm can yield slightly better results. Some recent architectures explore hybrid approaches. DeepSeek-V3 and others have experimented with DeepNorm and other normalization strategies that combine the training stability of pre-norm with the quality benefits of post-norm.

Putting It All Together: A Modern Decoder-Only Layer

A single layer in a modern decoder-only Transformer (circa 2025-2026) typically looks like this:

Input
  |
  v
RMSNorm
  |
  v
Grouped Query Attention (with RoPE applied to Q and K)
  |
  +--- Residual connection
  |
  v
RMSNorm
  |
  v
SwiGLU Feed-Forward Network
  |
  +--- Residual connection
  |
  v
Output

Compare this with the original 2017 layer:

Input
  |
  v
Multi-Head Attention (with sinusoidal positional encoding added at input)
  |
  +--- Residual connection
  |
  v
LayerNorm (post-norm)
  |
  v
ReLU Feed-Forward Network
  |
  +--- Residual connection
  |
  v
LayerNorm (post-norm)
  |
  v
Output

The bones are the same. The refinements are everywhere.


How This Connects to Frontier Models

Every frontier model in production today is a Transformer descendant. Here is how the architecture maps onto the landscape:

Model FamilyArchitectureKey Innovations
GPT-5.xDecoder-only (rumored MoE)Proprietary; likely GQA, advanced training
Claude 4.5 / 4.6Decoder-onlyProprietary architecture details
Gemini 3.xDecoder-only MoELong context, multimodal
Nemotron 3Hybrid Mamba-Transformer MoEUp to 1M context, multimodal
Mistral Large 3 / Small 4Decoder-only MoESliding window attention, GQA/MQA
DeepSeek-V3Decoder-only MoEMLA, auxiliary-loss-free routing, FP8 training
Qwen 3.xDecoder-only (dense/MoE)GQA, SwiGLU, YaRN for long context

The striking pattern: all decoder-only (or close to it). The encoder-decoder architecture that started it all has been almost entirely supplanted for generative tasks. Encoder-only models (BERT and descendants) remain important for embedding and classification tasks, but the frontier is decoder-only. Some models like Nemotron 3 incorporate hybrid Mamba-Transformer designs, blending state-space models with attention, but still follow the decoder-only generation paradigm.

The active areas of architectural research include:

  • Longer context windows through improved position encodings and efficient attention variants.
  • Sparse computation through Mixture of Experts and conditional routing.
  • Inference efficiency through quantization, speculative decoding, and KV-cache compression.
  • Hybrid and alternative architectures. State-space models like Mamba have moved from research into production through hybrid designs that interleave Mamba layers with transformer attention layers. NVIDIA's Nemotron 3 is a prominent example, combining Mamba-2 layers with standard attention to match pure-transformer quality while achieving significantly faster inference, particularly on long sequences where Mamba's linear-time complexity provides an advantage over quadratic attention.

Fine-tuning these architectures is covered in our LoRA Tutorial, which walks through parameter-efficient adaptation without modifying the full weight set.


Key Takeaways

  1. The Transformer replaced sequential computation with parallel attention, solving both the training speed and long-range dependency problems of RNNs and LSTMs.
  2. Self-attention computes relevance between all token pairs using queries, keys, and values. Scaling by sqrt(d_k) prevents gradient saturation in softmax.
  3. Multi-head attention runs multiple attention operations in parallel, allowing the model to capture different types of relationships simultaneously.
  4. Positional encodings inject sequence order information because attention is inherently permutation-equivariant. Modern models use RoPE, which encodes relative position through vector rotation.
  5. Feed-forward networks act as knowledge stores, processing the information gathered by attention. They contain the majority of model parameters.
  6. Residual connections and layer normalization enable training of deep networks by maintaining gradient flow and stabilizing activations.
  7. The decoder uses causal masking for autoregressive generation and cross-attention (in encoder-decoder models) to condition on input representations.
  8. Three architectural lineages emerged -- encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) -- with decoder-only becoming dominant for generative AI.
  9. Modern refinements include RoPE, GQA/MQA, SwiGLU, RMSNorm, and pre-norm, each improving efficiency, quality, or training stability over the original design.
  10. Every frontier model is a Transformer descendant. Understanding this architecture is not just useful -- it is the prerequisite for understanding everything else in modern AI.

The Transformer is only the beginning of the stack. From here, explore how these architectures are scaled with [Mixture of Experts](/articles/mixture-of-experts-demystified), how they are made efficient for deployment with [inference optimization](/articles/llm-inference-optimization), and how they are adapted to specific tasks with [parameter-efficient fine-tuning](/tutorials/fine-tuning-transformers-lora).