AI & ML Glossary.
Definitions, related concepts, and links to deeper reading for the terms that matter in artificial intelligence and machine learning.
A
AWQ (Activation-Aware Weight Quantization)
A quantization method that identifies and preserves the most important weights (those corresponding to large activations) at higher precision while aggressively quantizing the rest. AWQ achieves better quality than naive INT4 quantization.
Agentic AI
AI systems that autonomously plan, use tools, and take multi-step actions to accomplish goals, as opposed to single-turn question-answering. Agentic architectures combine LLMs with tool use, memory, and planning loops.
Alignment
The process of steering a model's behavior to be helpful, harmless, and honest — matching human values and intentions. Alignment techniques include RLHF, DPO, GRPO, and Constitutional AI, applied after pre-training and SFT.
B
Benchmark
A standardized evaluation dataset or task used to measure and compare model capabilities. Common LLM benchmarks include MMLU (knowledge), HumanEval (coding), GSM8K (math), and AIME (competition math).
C
Causal Language Modeling
The training objective used by decoder-only models: predict the next token given all preceding tokens. A causal mask prevents each position from attending to future tokens, enforcing the left-to-right generation order.
Chain-of-Thought (CoT)
A prompting technique (and training objective) where the model generates intermediate reasoning steps before the final answer. CoT dramatically improves performance on multi-step tasks and is the foundation of reasoning model behavior.
Constitutional AI
An alignment approach (introduced by Anthropic) where the model critiques and revises its own outputs according to a set of written principles ("constitution"), reducing reliance on human feedback for identifying harmful outputs.
Context Window
The maximum number of tokens a model can process in a single forward pass. Modern frontier models support 128K-1M+ tokens. Longer context windows enable processing entire documents but increase memory requirements quadratically (or linearly with optimized attention).
Continuous Batching
A serving optimization that dynamically adds and removes requests from a running batch as they arrive and complete, rather than waiting for an entire batch to finish. Continuous batching dramatically improves GPU utilization and throughput.
D
DPO (Direct Preference Optimization)
An alignment method that skips the reward model entirely by directly optimizing the language model on preference pairs. DPO is simpler and more stable than RLHF while achieving comparable results.
Where:
π_θ is the policy (model) being trained
π_ref is the frozen reference model (typically the SFT checkpoint)
x is the input prompt
y_w is the preferred (winning) response
y_l is the dispreferred (losing) response
β (beta) controls how much the model is penalized for deviating from the reference policy
σ is the sigmoid function
Decoder
A transformer stack that generates tokens autoregressively using causal (masked) attention — each token can only attend to previous tokens. Virtually all modern generative LLMs (GPT, Claude, Gemini, Qwen, DeepSeek) use decoder-only architectures.
Distillation
Transferring knowledge from a larger "teacher" model to a smaller "student" model by training the student to replicate the teacher's output distribution. Distillation is the primary method for creating smaller models that retain reasoning capabilities.
DoRA (Weight-Decomposed Low-Rank Adaptation)
An evolution of LoRA that decomposes pre-trained weights into magnitude and direction components, then applies low-rank adaptation only to the direction. DoRA closes more of the gap between LoRA and full fine-tuning.
E
Embeddings
Dense vector representations of text (words, sentences, or documents) in a continuous vector space, where semantic similarity corresponds to vector proximity. Embeddings are the foundation of retrieval systems, search, and clustering.
Encoder
A transformer stack that processes the full input sequence bidirectionally (each token attends to all others). Encoders produce rich contextual representations and are used in models like BERT for classification and embedding tasks.
Expert Routing
The mechanism in MoE models that decides which experts process each token. Routing strategies include top-k selection, auxiliary load-balancing losses, and DeepSeek's auxiliary-loss-free approach. Router quality directly impacts model performance and training stability.
F
FP8 (8-bit Floating Point)
An 8-bit floating-point format used for both training and inference. DeepSeek-V3 pioneered FP8 training at scale, demonstrating that 8-bit precision is sufficient for pre-training frontier models, roughly doubling throughput versus FP16.
Fine-tuning
Adapting a pre-trained model to a specific task or domain by continuing training on a smaller, targeted dataset. Fine-tuning can be full-parameter or parameter-efficient (LoRA, QLoRA).
FlashAttention
An IO-aware exact attention algorithm that restructures the attention computation to minimize GPU memory reads/writes (HBM access). FlashAttention achieves 2-4x speedups over standard attention without any approximation.
G
GGUF
The quantization format used by the llama.cpp ecosystem, supporting a range of precision levels (Q2 through Q8) with per-layer strategies. GGUF models run on CPU and Apple Silicon, making them the standard for local and edge deployment.
GPTQ
A post-training weight quantization method that uses approximate second-order (Hessian) information to minimize quantization error layer by layer. GPTQ produces INT4/INT3 models optimized for GPU inference.
GRPO (Group Relative Policy Optimization)
A reinforcement learning alignment technique introduced by DeepSeek that evaluates groups of responses relative to each other rather than using a separate reward model. GRPO was key to training DeepSeek-R1's reasoning capabilities.
Where:
G is the number of sampled responses per prompt (the group size)
o_i is the i-th sampled response, |o_i| is its length in tokens
π_θ is the policy being trained; π_ref is the reference policy
Â_i is the advantage for response i, computed relative to the group: Â_i = (r_i − mean(r)) / std(r), where r_i is the reward for response i
β is the KL divergence penalty weight
D_KL is the KL divergence regularizer preventing the policy from drifting too far from the reference
The key innovation: advantages are computed from the group’s own rewards, eliminating the need for a separate critic/reward model
Gradient Checkpointing
A memory optimization technique that trades compute for memory during training by recomputing intermediate activations during the backward pass instead of storing them. Essential for training large models on limited GPU memory.
GraphRAG
An evolution of RAG that structures retrieved knowledge as a graph (entities and relationships) rather than flat document chunks. GraphRAG improves answers to multi-hop questions that require synthesizing information across multiple sources.
Grouped Query Attention (GQA)
A memory-efficient variant of multi-head attention that shares key and value projections across groups of query heads. GQA reduces KV-cache size by the grouping factor (e.g., 8x) with minimal quality loss, and is used in most modern open models.
H
Hallucination
When a language model generates text that is fluent and confident but factually incorrect or fabricated. Hallucination is a fundamental challenge in LLMs arising from the model's tendency to produce plausible-sounding text regardless of factual grounding.
I
Inference-Time Compute
The paradigm of improving model performance by spending more computation during inference (generating longer reasoning chains, exploring multiple solution paths) rather than during training. This is the core insight behind reasoning models.
K
KV-Cache
During autoregressive generation, the cached key and value tensors from previous tokens so they don't need to be recomputed at each step. The KV-cache grows linearly with sequence length and is often the primary memory bottleneck during inference.
L
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that freezes the original model weights and injects small trainable low-rank matrices into each layer. LoRA typically trains ~1% of total parameters while achieving results comparable to full fine-tuning.
Loss Function
A mathematical function that measures the difference between a model's predictions and the target values. For language models, the standard pre-training loss is cross-entropy over next-token predictions.
Where:
T is the sequence length (number of tokens)
x_t is the target token at position t
x_{<t} is all tokens before position t (the context)
P_θ(x_t | x_{<t}) is the model’s predicted probability for the correct token
The negative log means: higher probability for the correct token → lower loss
M
MCP (Model Context Protocol)
An open protocol (originated at Anthropic, now Linux Foundation) that standardizes how AI models connect to external tools, data sources, and APIs. MCP provides a universal interface — analogous to USB-C for AI — replacing ad-hoc tool integrations.
MMLU (Massive Multitask Language Understanding)
A benchmark of ~15,000 multiple-choice questions across 57 academic subjects, widely used as a general knowledge metric for LLMs. Frontier models now score 90%+, leading to the creation of harder variants (MMLU-Pro).
Mixed Precision Training
Training with a combination of floating-point precisions (e.g., FP16 or BF16 for forward/backward passes, FP32 for weight updates) to reduce memory usage and increase throughput without significant quality loss.
Mixture of Experts (MoE)
An architecture where each transformer layer contains multiple parallel feed-forward networks ("experts"), and a router selects a subset (typically 2-8) for each token. MoE scales total model knowledge without proportionally increasing per-token compute cost.
Multi-Head Attention (MHA)
Running multiple self-attention operations in parallel, each with different learned projections (heads), then concatenating the results. MHA allows the model to attend to information from different representation subspaces simultaneously.
Multi-Head Latent Attention (MLA)
An attention variant introduced by DeepSeek that compresses key-value representations into a low-dimensional latent space rather than reducing the number of heads. MLA achieves greater KV-cache compression than GQA while maintaining full representational capacity.
Multi-Query Attention (MQA)
An extreme variant of GQA where all query heads share a single set of key and value projections. MQA offers maximum KV-cache reduction but may sacrifice some representational capacity compared to GQA.
P
Paged Attention
A memory management technique (introduced by vLLM) that stores KV-cache in non-contiguous pages, similar to virtual memory in operating systems. Paged attention eliminates memory fragmentation and enables efficient dynamic batching.
Pre-training
The initial phase of training a language model on a large corpus of text using self-supervised objectives (typically next-token prediction). Pre-training produces a base model with broad language understanding but no task-specific behavior.
Prompt Engineering
The practice of designing and refining input prompts to elicit desired behavior from language models. Techniques include few-shot examples, system prompts, chain-of-thought instructions, and structured output formatting.
Q
QLoRA (Quantized LoRA)
Combines 4-bit quantization of the base model with LoRA adapters, enabling fine-tuning of large models on consumer hardware. The base model is loaded in NF4 precision while LoRA adapters train in higher precision.
Quantization
Reducing the numerical precision of model weights (and optionally activations) from higher-precision formats (FP32, FP16) to lower-precision formats (INT8, INT4, FP8). Quantization reduces memory usage and increases inference speed with a small quality tradeoff.
R
RAG (Retrieval-Augmented Generation)
A pattern that grounds LLM responses in external knowledge by retrieving relevant documents and including them in the prompt context. RAG reduces hallucination, enables up-to-date knowledge, and is the dominant approach for enterprise LLM applications.
RLHF (Reinforcement Learning from Human Feedback)
An alignment technique where a reward model trained on human preference data guides the language model via reinforcement learning (typically PPO). RLHF steers model behavior toward helpful, harmless, and honest outputs.
Where:
r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t) is the probability ratio between the new and old policy
Â_t is the estimated advantage (how much better the action was than expected)
ε (epsilon) is the clipping range (typically 0.1–0.2) that prevents large policy updates
π_θ is the policy (model) being trained
The min and clip together ensure the policy doesn’t change too drastically in a single update
RMSNorm
A simplified layer normalization that normalizes by root mean square only, skipping the mean-centering step of standard LayerNorm. RMSNorm is computationally cheaper and is used in most modern open-source LLMs.
Reasoning Models
Language models trained to perform explicit step-by-step reasoning before producing a final answer, typically using inference-time compute scaling. Reasoning models (o1/o3, DeepSeek-R1, QwQ) excel at math, coding, and science tasks where deliberation improves accuracy.
Red-Teaming
Systematically probing an AI model to discover failure modes, safety vulnerabilities, and harmful outputs. Red-teaming involves crafting adversarial inputs designed to bypass safety measures and is a critical part of responsible AI deployment.
RoPE (Rotary Position Embeddings)
A position encoding method that applies rotation matrices to query and key vectors, making the attention dot product naturally depend on relative token distance. RoPE enables better length generalization and is the standard position encoding in modern open models.
Where:
x_m is the embedding vector at position m
m is the absolute token position
θ_i = 10000^{−2i/d} are the rotation frequencies for each dimension pair
⊙ denotes element-wise multiplication
After rotation, the dot product q_m · k_n depends only on the relative distance (m − n)
This gives relative position encoding without any additional learnable parameters
S
SFT (Supervised Fine-Tuning)
A fine-tuning stage where a base model is trained on curated instruction-response pairs to follow instructions. SFT typically follows pre-training and precedes alignment (RLHF/DPO).
Scaling Laws
Empirical relationships (notably Chinchilla scaling laws) that predict model performance as a function of parameter count, dataset size, and compute budget. Scaling laws guide decisions about how to allocate training resources.
Where:
L is the test loss (lower is better)
N is the number of model parameters
D is the number of training tokens
α and β are empirically fitted exponents (typically ~0.34 and ~0.28)
A and B are fitted constants
L_∞ is the irreducible loss (entropy of natural language)
The Chinchilla finding: optimal training balances N and D such that tokens ≈ 20× parameters
Self-Attention
The core mechanism in transformers where each token computes attention weights over all other tokens in the sequence, producing context-aware representations. Self-attention enables the model to capture long-range dependencies regardless of distance in the sequence.
Where:
Q (queries), K (keys), V (values) are linear projections of the input embeddings
d_k is the dimension of the key vectors (used for scaling to prevent large dot products)
QKᵀ computes the similarity between every pair of tokens
softmax normalizes the scores into attention weights that sum to 1
The result is a weighted sum of value vectors, where weights reflect token relevance
Softmax
A function that converts a vector of raw scores (logits) into a probability distribution, where each value is in (0,1) and all values sum to 1. Softmax is used in the attention mechanism to compute attention weights and in the output layer to produce token probabilities.
Where:
z_i is the raw score (logit) for class/token i
K is the total number of classes/tokens
e^{z_i} exponentiates each score, making them positive
The denominator sums all exponentiated scores, ensuring the output is a valid probability distribution
Higher logits get exponentially more probability mass, making softmax a "soft" version of argmax
Sparse Autoencoder
A neural network trained to decompose a model's internal activations into a sparse set of interpretable features. Sparse autoencoders are the primary tool in mechanistic interpretability for understanding what individual neurons and circuits represent.
Speculative Decoding
An inference acceleration technique where a smaller "draft" model generates candidate token sequences that the larger "target" model verifies in parallel. Accepted tokens skip expensive sequential generation, yielding 2-3x speedups with no quality loss.
SwiGLU
An activation function combining the Swish activation with a Gated Linear Unit. SwiGLU has become the standard FFN activation in modern LLMs, replacing ReLU and GELU, offering improved training dynamics at a modest parameter increase.
T
TPS (Tokens Per Second)
The rate at which a model generates output tokens during the decode phase. TPS measures generation throughput and is the primary metric for streaming response quality.
TTFT (Time to First Token)
The latency from when a request is sent to when the first output token is generated. TTFT is dominated by the prefill phase (processing the input prompt) and is a critical metric for interactive applications.
Tensor Parallelism
Distributing a single model layer across multiple GPUs by splitting weight matrices along specific dimensions. Tensor parallelism enables serving models too large for a single GPU's memory, at the cost of inter-GPU communication overhead.
Tokenization
The process of converting raw text into a sequence of integer token IDs that a model can process. Modern LLMs use subword tokenizers (BPE, SentencePiece) that balance vocabulary size with the ability to represent any text.
Transformer
The dominant neural network architecture for language models, introduced in "Attention Is All You Need" (2017). Transformers process sequences in parallel using self-attention and feed-forward layers, replacing the sequential processing of RNNs.
V
Vector Database
A database optimized for storing and querying high-dimensional embedding vectors using approximate nearest neighbor (ANN) search. Vector databases (Pinecone, Weaviate, Qdrant, pgvector) are the retrieval backbone of RAG systems.