Reference

AI & ML Glossary.

Definitions, related concepts, and links to deeper reading for the terms that matter in artificial intelligence and machine learning.

A

AWQ (Activation-Aware Weight Quantization)

A quantization method that identifies and preserves the most important weights (those corresponding to large activations) at higher precision while aggressively quantizing the rest. AWQ achieves better quality than naive INT4 quantization.

Activation Outliers

A small number of activation values whose magnitudes are far larger than the rest, concentrated in particular channels. They are the main obstacle to low-bit activation quantization, because a numeric range wide enough to represent them wastes precision on everything else — which is why quantizing weights is far easier than quantizing activations.

Activation Steering

Modifying a model's behavior at inference time by adding a direction vector to its internal activations, rather than retraining. The steering vector is typically derived from interpretability tools such as sparse-autoencoder features, and can amplify or suppress a specific concept in the output.

Agentic AI

AI systems that autonomously plan, use tools, and take multi-step actions to accomplish goals, as opposed to single-turn question-answering. Agentic architectures combine LLMs with tool use, memory, and planning loops.

Alignment

The process of steering a model's behavior to be helpful, harmless, and honest — matching human values and intentions. Alignment techniques include RLHF, DPO, GRPO, and Constitutional AI, applied after pre-training and SFT.

B

Benchmark

A standardized evaluation dataset or task used to measure and compare model capabilities. Common LLM benchmarks include MMLU (knowledge), HumanEval (coding), GSM8K (math), and AIME (competition math).

Benchmark Contamination

When the data in an evaluation benchmark has leaked into a model's training set, inflating its score without reflecting real capability. Because benchmarks live on the public web, contamination is hard to rule out, and detecting it usually requires building a fresh held-out test in the same style.

Best-of-N Sampling

A test-time strategy that samples N independent candidate answers and selects one, either by majority vote or by a verifier or reward-model score. Gains are steep at first and saturate quickly (usually by 16 to 32 samples), because voting can only surface answers already in the model's high-probability modes.

C

Calibration

Passing a small, representative sample of data through a model to measure its activation ranges so post-training quantization can pick precision parameters that minimize error. How representative the calibration set is directly affects the accuracy of the quantized model.

Causal Language Modeling

The training objective used by decoder-only models: predict the next token given all preceding tokens. A causal mask prevents each position from attending to future tokens, enforcing the left-to-right generation order.

Chain-of-Thought (CoT)

A prompting technique (and training objective) where the model generates intermediate reasoning steps before the final answer. CoT dramatically improves performance on multi-step tasks and is the foundation of reasoning model behavior.

Chunking

The step in a retrieval pipeline that splits source documents into smaller passages for embedding and retrieval. The strategy used (fixed-size, recursive, or semantic boundary detection) largely determines retrieval quality, since a poorly cut chunk separates a claim from the evidence that answers a query.

Constitutional AI

An alignment approach (introduced by Anthropic) where the model critiques and revises its own outputs according to a set of written principles ("constitution"), reducing reliance on human feedback for identifying harmful outputs.

Context Engineering

The practice of deliberately assembling what goes into a model's context window — retrieved passages, tool outputs, instructions, and history — to maximize the signal the model can actually use. It treats the prompt as a managed budget rather than a dumping ground, the discipline that replaced naive 'stuff everything in' prompting as windows grew.

Context Window

The maximum number of tokens a model can process in a single forward pass. Modern frontier models support 128K-1M+ tokens. Longer context windows enable processing entire documents but increase memory requirements quadratically (or linearly with optimized attention).

Continuous Batching

A serving optimization that dynamically adds and removes requests from a running batch as they arrive and complete, rather than waiting for an entire batch to finish. Continuous batching dramatically improves GPU utilization and throughput.

D

DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization)

A set of refinements to GRPO that counters entropy collapse with decoupled (higher) clipping bounds and skips zero-gradient prompt groups via dynamic sampling, while dropping the KL penalty for reasoning tasks. DAPO targets training stability and wasted compute, not added capability.

DPO (Direct Preference Optimization)

An alignment method that skips the reward model entirely by directly optimizing the language model on preference pairs. DPO is simpler and more stable than RLHF while achieving comparable results.

LDPO(πθ;πref)=E(x,yw,yl)[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]

Where:

  • π_θ is the policy (model) being trained

  • π_ref is the frozen reference model (typically the SFT checkpoint)

  • x is the input prompt

  • y_w is the preferred (winning) response

  • y_l is the dispreferred (losing) response

  • β (beta) controls how much the model is penalized for deviating from the reference policy

  • σ is the sigmoid function

Decoder

A transformer stack that generates tokens autoregressively using causal (masked) attention — each token can only attend to previous tokens. Virtually all modern generative LLMs (GPT, Claude, Gemini, Qwen, DeepSeek) use decoder-only architectures.

Distillation

Transferring knowledge from a larger "teacher" model to a smaller "student" model by training the student to replicate the teacher's output distribution. Distillation is the primary method for creating smaller models that retain reasoning capabilities.

DoRA (Weight-Decomposed Low-Rank Adaptation)

An evolution of LoRA that decomposes pre-trained weights into magnitude and direction components, then applies low-rank adaptation only to the direction. DoRA closes more of the gap between LoRA and full fine-tuning.

E

Effective Context Length

The longest input over which a model can actually find and combine the relevant tokens, as opposed to the advertised context window it will merely accept without erroring. Benchmarks like RULER and NoLiMa routinely measure it at a half or a quarter of the advertised size, with degradation starting well before the limit.

Embeddings

Dense vector representations of text (words, sentences, or documents) in a continuous vector space, where semantic similarity corresponds to vector proximity. Embeddings are the foundation of retrieval systems, search, and clustering.

Encoder

A transformer stack that processes the full input sequence bidirectionally (each token attends to all others). Encoders produce rich contextual representations and are used in models like BERT for classification and embedding tasks.

Expert Routing

The mechanism in MoE models that decides which experts process each token. Routing strategies include top-k selection, auxiliary load-balancing losses, and DeepSeek's auxiliary-loss-free approach. Router quality directly impacts model performance and training stability.

F

FP8 (8-bit Floating Point)

An 8-bit floating-point format used for both training and inference. DeepSeek-V3 pioneered FP8 training at scale, demonstrating that 8-bit precision is sufficient for pre-training frontier models, roughly doubling throughput versus FP16.

Fine-tuning

Adapting a pre-trained model to a specific task or domain by continuing training on a smaller, targeted dataset. Fine-tuning can be full-parameter or parameter-efficient (LoRA, QLoRA).

FlashAttention

An IO-aware exact attention algorithm that restructures the attention computation to minimize GPU memory reads/writes (HBM access). FlashAttention achieves 2-4x speedups over standard attention without any approximation.

G

GGUF

The quantization format used by the llama.cpp ecosystem, supporting a range of precision levels (Q2 through Q8) with per-layer strategies. GGUF models run on CPU and Apple Silicon, making them the standard for local and edge deployment.

GPTQ

A post-training weight quantization method that uses approximate second-order (Hessian) information to minimize quantization error layer by layer. GPTQ produces INT4/INT3 models optimized for GPU inference.

GRPO (Group Relative Policy Optimization)

A reinforcement learning alignment technique introduced by DeepSeek that evaluates groups of responses relative to each other rather than using a separate reward model. GRPO was key to training DeepSeek-R1's reasoning capabilities.

LGRPO(θ)=ExD[1Gi=1G1oit=1oi(min(πθ(oi,tx,oi,<t)πref(oi,tx,oi,<t)A^i,  clip()A^i)βDKL(πθπref))]\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_{x \sim D} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left( \min\left( \frac{\pi_\theta(o_{i,t} | x, o_{i,<t})}{\pi_{\text{ref}}(o_{i,t} | x, o_{i,<t})} \hat{A}_i, \; \text{clip}(\cdot) \hat{A}_i \right) - \beta \, D_{KL}(\pi_\theta \| \pi_{\text{ref}}) \right) \right]

Where:

  • G is the number of sampled responses per prompt (the group size)

  • o_i is the i-th sampled response, |o_i| is its length in tokens

  • π_θ is the policy being trained; π_ref is the reference policy

  • Â_i is the advantage for response i, computed relative to the group: Â_i = (r_i − mean(r)) / std(r), where r_i is the reward for response i

  • β is the KL divergence penalty weight

  • D_KL is the KL divergence regularizer preventing the policy from drifting too far from the reference

  • The key innovation: advantages are computed from the group’s own rewards, eliminating the need for a separate critic/reward model

GSPO (Group Sequence Policy Optimization)

A refinement of GRPO from the Qwen team that defines the importance-sampling ratio and clipping at the sequence level rather than per token, reducing the variance that accumulates over long generations. GSPO stabilizes reinforcement learning for long reasoning chains and mixture-of-experts models.

Gradient Checkpointing

A memory optimization technique that trades compute for memory during training by recomputing intermediate activations during the backward pass instead of storing them. Essential for training large models on limited GPU memory.

GraphRAG

An evolution of RAG that structures retrieved knowledge as a graph (entities and relationships) rather than flat document chunks. GraphRAG improves answers to multi-hop questions that require synthesizing information across multiple sources.

Read more:RAG in 2026

Grouped Query Attention (GQA)

A memory-efficient variant of multi-head attention that shares key and value projections across groups of query heads. GQA reduces KV-cache size by the grouping factor (e.g., 8x) with minimal quality loss, and is used in most modern open models.

H

Hallucination

When a language model generates text that is fluent and confident but factually incorrect or fabricated. Hallucination is a fundamental challenge in LLMs arising from the model's tendency to produce plausible-sounding text regardless of factual grounding.

I

IPO (Identity Preference Optimization)

A DPO variant that replaces the log-sigmoid loss with a squared term so the reference-model regularization keeps binding even when preferences are near-deterministic, preventing the overfitting where the policy collapses onto the chosen response. The conservative choice for clean, strong preference data.

Inference-Time Compute

The paradigm of improving model performance by spending more computation during inference (generating longer reasoning chains, exploring multiple solution paths) rather than during training. This is the core insight behind reasoning models.

K

KTO (Kahneman-Tversky Optimization)

A preference optimization method that learns from unpaired binary labels (a single good-or-bad signal per example) instead of chosen/rejected pairs, using a loss derived from prospect theory. KTO fits the common case where thumbs-up/down feedback is easier to collect than curated pairs.

KV-Cache

During autoregressive generation, the cached key and value tensors from previous tokens so they don't need to be recomputed at each step. The KV-cache grows linearly with sequence length and is often the primary memory bottleneck during inference.

L

LLM-as-a-Judge

Using a language model to score or compare other models' outputs in place of human raters or a fixed answer key. It scales evaluation to open-ended tasks but carries measurable biases (toward longer answers, its own outputs, and answer position) that can quietly reshape rankings.

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method that freezes the original model weights and injects small trainable low-rank matrices into each layer. LoRA typically trains ~1% of total parameters while achieving results comparable to full fine-tuning.

Loss Function

A mathematical function that measures the difference between a model's predictions and the target values. For language models, the standard pre-training loss is cross-entropy over next-token predictions.

LCE=t=1TlogPθ(xtx<t)\mathcal{L}_{\text{CE}} = -\sum_{t=1}^{T} \log P_\theta(x_t | x_{<t})

Where:

  • T is the sequence length (number of tokens)

  • x_t is the target token at position t

  • x_{<t} is all tokens before position t (the context)

  • P_θ(x_t | x_{<t}) is the model’s predicted probability for the correct token

  • The negative log means: higher probability for the correct token → lower loss

Lost-in-the-Middle

The tendency of long-context models to use information placed at the very start or end of the input far more reliably than information buried in the middle. It means where a fact sits in the prompt affects whether the model uses it, independent of the size of the context window.

M

MCP (Model Context Protocol)

An open protocol (originated at Anthropic, now Linux Foundation) that standardizes how AI models connect to external tools, data sources, and APIs. MCP provides a universal interface — analogous to USB-C for AI — replacing ad-hoc tool integrations.

MMLU (Massive Multitask Language Understanding)

A benchmark of ~15,000 multiple-choice questions across 57 academic subjects, widely used as a general knowledge metric for LLMs. Frontier models now score 90%+, leading to the creation of harder variants (MMLU-Pro).

MMMU (Massive Multi-discipline Multimodal Understanding)

A multimodal benchmark of college-level questions that pair text with images (diagrams, charts, tables) across many disciplines, the vision-language analog of MMLU. Frontier models now cluster in its harder MMMU-Pro variant within a few points, so the aggregate score no longer discriminates between them.

MTP (Multi-Token Prediction)

A training and inference scheme where a model predicts more than one future token per position using extra prediction heads, rather than a single next token. At inference the extra predictions seed a draft that the main model verifies, making MTP a built-in speculative decoder. DeepSeek-V3 and V4 ship MTP, and its depth-one setting (MTP-1) is the baseline DSpark is measured against.

MXFP4 (Microscaling 4-bit Float)

A 4-bit floating-point format that shares a microscaling exponent across small blocks of values, costing roughly 4.25 bits per parameter. OpenAI's gpt-oss shipped its mixture-of-experts weights natively in MXFP4, letting a 120B-parameter model fit on a single 80GB GPU.

Mechanistic Interpretability

A research discipline that reverse-engineers the internal algorithms a neural network has learned, down to individual features and circuits, rather than offering post-hoc explanations of its outputs. Its aim is to understand why a model produces a given output well enough to predict and intervene on its behavior.

Memory-Bandwidth Bound

The condition during autoregressive decoding where generation speed is limited by how fast weights can be read from memory, not by arithmetic throughput. Because producing one token streams the whole model, tokens-per-second tracks memory bandwidth — the insight behind speculative decoding and the platform gap in local inference.

Mixed Precision Training

Training with a combination of floating-point precisions (e.g., FP16 or BF16 for forward/backward passes, FP32 for weight updates) to reduce memory usage and increase throughput without significant quality loss.

Mixture of Experts (MoE)

An architecture where each transformer layer contains multiple parallel feed-forward networks ("experts"), and a router selects a subset (typically 2-8) for each token. MoE scales total model knowledge without proportionally increasing per-token compute cost.

Model Collapse

The degradation that occurs when a generative model is trained recursively on its own (or another model's) synthetic outputs: errors compound and the tails of the distribution thin and then vanish, converging toward a low-variance, generic output. It is largely prevented by accumulating synthetic data on top of real data rather than replacing it.

Multi-Head Attention (MHA)

Running multiple self-attention operations in parallel, each with different learned projections (heads), then concatenating the results. MHA allows the model to attend to information from different representation subspaces simultaneously.

Multi-Head Latent Attention (MLA)

An attention variant introduced by DeepSeek that compresses key-value representations into a low-dimensional latent space rather than reducing the number of heads. MLA achieves greater KV-cache compression than GQA while maintaining full representational capacity.

Multi-Query Attention (MQA)

An extreme variant of GQA where all query heads share a single set of key and value projections. MQA offers maximum KV-cache reduction but may sacrifice some representational capacity compared to GQA.

Muon

A neural-network optimizer that updates weight matrices using an orthogonalized version of the gradient (via a Newton-Schulz iteration) rather than the per-coordinate scaling of Adam. It has gained traction for large-model pre-training — DeepSeek-V4 trained with it — by improving efficiency on matrix-shaped parameters.

N

Needle-in-a-Haystack

A long-context evaluation that hides a specific fact inside a large document and asks the model to retrieve it. Models pass it easily because the query shares vocabulary with the target, which is why it overstates real long-context ability; harder variants remove the lexical overlap and scores drop sharply.

O

Open Weights

Models whose trained parameters are publicly released for download and self-hosting, as distinct from fully open-source (which would also include training data and code) and from closed, API-only models. Open weights let practitioners fine-tune, quantize, and run a model on their own hardware.

P

PRM (Process Reward Model)

A reward model that scores the individual intermediate steps of a reasoning trace rather than only the final answer, used to guide search or rerank candidates during test-time scaling. Its value is capped by its own accuracy: an imperfect PRM causes search to optimize toward its mistakes.

PTQ (Post-Training Quantization)

Quantizing a finished, full-precision model to lower precision using a small calibration set and no gradient updates. Weight-only PTQ at 4-bit is close to lossless for many models; pushing activations to the same precision (W4A4) is far harder because of activation outliers.

Paged Attention

A memory management technique (introduced by vLLM) that stores KV-cache in non-contiguous pages, similar to virtual memory in operating systems. Paged attention eliminates memory fragmentation and enables efficient dynamic batching.

Pre-training

The initial phase of training a language model on a large corpus of text using self-supervised objectives (typically next-token prediction). Pre-training produces a base model with broad language understanding but no task-specific behavior.

Prompt Caching

An inference optimization that stores the computed KV-cache for a prompt prefix so later requests sharing that prefix skip re-processing it. It sharply cuts latency and cost for workloads with stable system prompts or long shared context, such as agent loops and multi-turn chat.

Prompt Engineering

The practice of designing and refining input prompts to elicit desired behavior from language models. Techniques include few-shot examples, system prompts, chain-of-thought instructions, and structured output formatting.

Q

QAT (Quantization-Aware Training)

A quantization approach that simulates low-precision rounding during training so the model's weights adapt to it, instead of lowering precision on a finished model after the fact. QAT recovers more accuracy than post-training methods at very low bit-widths, at the cost of a full training run.

QLoRA (Quantized LoRA)

Combines 4-bit quantization of the base model with LoRA adapters, enabling fine-tuning of large models on consumer hardware. The base model is loaded in NF4 precision while LoRA adapters train in higher precision.

Quantization

Reducing the numerical precision of model weights (and optionally activations) from higher-precision formats (FP32, FP16) to lower-precision formats (INT8, INT4, FP8). Quantization reduces memory usage and increases inference speed with a small quality tradeoff.

R

RAG (Retrieval-Augmented Generation)

A pattern that grounds LLM responses in external knowledge by retrieving relevant documents and including them in the prompt context. RAG reduces hallucination, enables up-to-date knowledge, and is the dominant approach for enterprise LLM applications.

Read more:RAG in 2026

RLHF (Reinforcement Learning from Human Feedback)

An alignment technique where a reward model trained on human preference data guides the language model via reinforcement learning (typically PPO). RLHF steers model behavior toward helpful, harmless, and honest outputs.

LPPO(θ)=E^t[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{PPO}}(\theta) = \hat{\mathbb{E}}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

Where:

  • r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t) is the probability ratio between the new and old policy

  • Â_t is the estimated advantage (how much better the action was than expected)

  • ε (epsilon) is the clipping range (typically 0.1–0.2) that prevents large policy updates

  • π_θ is the policy (model) being trained

  • The min and clip together ensure the policy doesn’t change too drastically in a single update

RLVR (Reinforcement Learning with Verifiable Rewards)

Reinforcement learning post-training where the reward is a deterministic, programmatic check (a math-answer comparison, unit tests, a parser) rather than a learned reward model, so there is nothing to reward-hack. RLVR is the dominant paradigm for training reasoning models on math and code, where verification is cheap and exact.

RMSNorm

A simplified layer normalization that normalizes by root mean square only, skipping the mean-centering step of standard LayerNorm. RMSNorm is computationally cheaper and is used in most modern open-source LLMs.

ReAct (Reasoning and Acting)

An agent pattern that interleaves chain-of-thought reasoning with tool actions in a loop: the model reasons, takes an action, observes the result, and revises its plan. Introduced by Yao et al. (2022), it underpins most production agent architectures.

Reasoning Models

Language models trained to perform explicit step-by-step reasoning before producing a final answer, typically using inference-time compute scaling. Reasoning models (o1/o3, DeepSeek-R1, QwQ) excel at math, coding, and science tasks where deliberation improves accuracy.

Red-Teaming

Systematically probing an AI model to discover failure modes, safety vulnerabilities, and harmful outputs. Red-teaming involves crafting adversarial inputs designed to bypass safety measures and is a critical part of responsible AI deployment.

Reranking

A second retrieval stage that re-scores an initial set of candidate passages with a more expensive, higher-precision model before they reach the prompt. Rerankers (typically cross-encoders) trade latency for relevance, correcting cases where first-pass vector similarity surfaces the wrong passages.

Reward Hacking

When a model optimized against a reward maximizes that signal in ways that diverge from the intended goal, exploiting flaws in a learned reward model or an imperfect verifier instead of solving the task. It is the core reason RLHF needs a KL penalty and the central motivation for verifiable rewards.

RoPE (Rotary Position Embeddings)

A position encoding method that applies rotation matrices to query and key vectors, making the attention dot product naturally depend on relative token distance. RoPE enables better length generalization and is the standard position encoding in modern open models.

RoPE(xm,m)=(x1x2x3x4)(cosmθ1cosmθ1cosmθ2cosmθ2)+(x2x1x4x3)(sinmθ1sinmθ1sinmθ2sinmθ2)\text{RoPE}(x_m, m) = \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ \vdots \end{pmatrix} \odot \begin{pmatrix} \cos m\theta_1 \\ \cos m\theta_1 \\ \cos m\theta_2 \\ \cos m\theta_2 \\ \vdots \end{pmatrix} + \begin{pmatrix} -x_2 \\ x_1 \\ -x_4 \\ x_3 \\ \vdots \end{pmatrix} \odot \begin{pmatrix} \sin m\theta_1 \\ \sin m\theta_1 \\ \sin m\theta_2 \\ \sin m\theta_2 \\ \vdots \end{pmatrix}

Where:

  • x_m is the embedding vector at position m

  • m is the absolute token position

  • θ_i = 10000^{−2i/d} are the rotation frequencies for each dimension pair

  • ⊙ denotes element-wise multiplication

  • After rotation, the dot product q_m · k_n depends only on the relative distance (m − n)

  • This gives relative position encoding without any additional learnable parameters

S

SFT (Supervised Fine-Tuning)

A fine-tuning stage where a base model is trained on curated instruction-response pairs to follow instructions. SFT typically follows pre-training and precedes alignment (RLHF/DPO).

Scaling Laws

Empirical relationships (notably Chinchilla scaling laws) that predict model performance as a function of parameter count, dataset size, and compute budget. Scaling laws guide decisions about how to allocate training resources.

L(N,D)=ANα+BDβ+LL(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + L_\infty

Where:

  • L is the test loss (lower is better)

  • N is the number of model parameters

  • D is the number of training tokens

  • α and β are empirically fitted exponents (typically ~0.34 and ~0.28)

  • A and B are fitted constants

  • L_∞ is the irreducible loss (entropy of natural language)

  • The Chinchilla finding: optimal training balances N and D such that tokens ≈ 20× parameters

Self-Attention

The core mechanism in transformers where each token computes attention weights over all other tokens in the sequence, producing context-aware representations. Self-attention enables the model to capture long-range dependencies regardless of distance in the sequence.

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Where:

  • Q (queries), K (keys), V (values) are linear projections of the input embeddings

  • d_k is the dimension of the key vectors (used for scaling to prevent large dot products)

  • QKᵀ computes the similarity between every pair of tokens

  • softmax normalizes the scores into attention weights that sum to 1

  • The result is a weighted sum of value vectors, where weights reflect token relevance

Self-Consistency

A test-time method that samples multiple chain-of-thought reasoning paths for the same question and returns the majority-vote answer. It improves accuracy on reasoning tasks over a single greedy decode but is subject to the same diminishing returns as other parallel sampling.

Semi-Autoregressive Decoding

A drafting strategy for speculative decoding that sits between fully sequential and fully parallel generation: a parallel backbone proposes all draft positions at once, then a lightweight head conditioned only on the immediately preceding token corrects each one. It recovers most of the acceptance length of sequential drafting (EAGLE-3 style) without paying the full per-token serial cost. DeepSeek's DSpark drafter is the 2026 example.

SimPO (Simple Preference Optimization)

A reference-free preference optimization method that replaces DPO's log-ratio against a frozen reference with the length-normalized average log-probability of the sequence plus a target reward margin. Removing the reference model halves memory and compute but also removes the anchor that limits drift.

Sliding Window Attention

An attention variant where each token attends only to a fixed-size window of recent tokens rather than the whole sequence. It bounds per-token cost and KV-cache growth at the expense of direct long-range recall, so it is often interleaved with full or sparse attention layers that restore it.

Softmax

A function that converts a vector of raw scores (logits) into a probability distribution, where each value is in (0,1) and all values sum to 1. Softmax is used in the attention mechanism to compute attention weights and in the output layer to produce token probabilities.

softmax(zi)=ezij=1Kezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Where:

  • z_i is the raw score (logit) for class/token i

  • K is the total number of classes/tokens

  • e^{z_i} exponentiates each score, making them positive

  • The denominator sums all exponentiated scores, ensuring the output is a valid probability distribution

  • Higher logits get exponentially more probability mass, making softmax a "soft" version of argmax

Sparse Attention

A family of attention mechanisms that restrict each query to a learned or fixed subset of key positions instead of attending densely over all tokens. Sparse attention (e.g., DeepSeek's DSA/NSA) cuts the cost of long-context inference by scoring far fewer entries per token while preserving most of the recall of full attention.

Sparse Autoencoder

A neural network trained to decompose a model's internal activations into a sparse set of interpretable features. Sparse autoencoders are the primary tool in mechanistic interpretability for understanding what individual neurons and circuits represent.

Speculative Decoding

An inference acceleration technique where a smaller "draft" model generates candidate token sequences that the larger "target" model verifies in parallel. Accepted tokens skip expensive sequential generation, yielding 2-3x speedups with no quality loss.

Superposition

The phenomenon where a model represents more distinct features than it has neurons by encoding them as overlapping directions in activation space. Superposition is why individual neurons are polysemantic and why sparse autoencoders are needed to recover interpretable features.

SwiGLU

An activation function combining the Swish activation with a Gated Linear Unit. SwiGLU has become the standard FFN activation in modern LLMs, replacing ReLU and GELU, offering improved training dynamics at a modest parameter increase.

Synthetic Data

Training data generated by a model rather than collected from humans. It adds signal when it flows from a stronger teacher (distillation) or is filtered by an external verifier, and degrades the model when it merely recycles the model's own distribution, which causes model collapse.

T

TPS (Tokens Per Second)

The rate at which a model generates output tokens during the decode phase. TPS measures generation throughput and is the primary metric for streaming response quality.

TTFT (Time to First Token)

The latency from when a request is sent to when the first output token is generated. TTFT is dominated by the prefill phase (processing the input prompt) and is a critical metric for interactive applications.

Tensor Parallelism

Distributing a single model layer across multiple GPUs by splitting weight matrices along specific dimensions. Tensor parallelism enables serving models too large for a single GPU's memory, at the cost of inter-GPU communication overhead.

Tokenization

The process of converting raw text into a sequence of integer token IDs that a model can process. Modern LLMs use subword tokenizers (BPE, SentencePiece) that balance vocabulary size with the ability to represent any text.

Tool Use (Function Calling)

The capability that lets a model invoke external functions, APIs, or services by emitting a structured call that an application executes and feeds back into the context. It is the foundation of agentic systems and is increasingly standardized through interfaces like MCP.

Transformer

The dominant neural network architecture for language models, introduced in "Attention Is All You Need" (2017). Transformers process sequences in parallel using self-attention and feed-forward layers, replacing the sequential processing of RNNs.

V

VLM (Vision-Language Model)

A model that accepts both images and text by encoding image patches into tokens that share the language model's sequence, letting it reason over visual and textual content together. A high-resolution image can cost as many tokens as several pages of text, which makes image resolution a primary cost and latency lever.

Vector Database

A database optimized for storing and querying high-dimensional embedding vectors using approximate nearest neighbor (ANN) search. Vector databases (Pinecone, Weaviate, Qdrant, pgvector) are the retrieval backbone of RAG systems.