Reference

AI & ML Glossary.

Definitions, related concepts, and links to deeper reading for the terms that matter in artificial intelligence and machine learning.

A

AWQ (Activation-Aware Weight Quantization)

A quantization method that identifies and preserves the most important weights (those corresponding to large activations) at higher precision while aggressively quantizing the rest. AWQ achieves better quality than naive INT4 quantization.

Related:Quantization GPTQ

Agentic AI

AI systems that autonomously plan, use tools, and take multi-step actions to accomplish goals, as opposed to single-turn question-answering. Agentic architectures combine LLMs with tool use, memory, and planning loops.

Alignment

The process of steering a model's behavior to be helpful, harmless, and honest — matching human values and intentions. Alignment techniques include RLHF, DPO, GRPO, and Constitutional AI, applied after pre-training and SFT.

B

Benchmark

A standardized evaluation dataset or task used to measure and compare model capabilities. Common LLM benchmarks include MMLU (knowledge), HumanEval (coding), GSM8K (math), and AIME (competition math).

C

Causal Language Modeling

The training objective used by decoder-only models: predict the next token given all preceding tokens. A causal mask prevents each position from attending to future tokens, enforcing the left-to-right generation order.

Related:Decoder Pre-training

Chain-of-Thought (CoT)

A prompting technique (and training objective) where the model generates intermediate reasoning steps before the final answer. CoT dramatically improves performance on multi-step tasks and is the foundation of reasoning model behavior.

Constitutional AI

An alignment approach (introduced by Anthropic) where the model critiques and revises its own outputs according to a set of written principles ("constitution"), reducing reliance on human feedback for identifying harmful outputs.

Context Window

The maximum number of tokens a model can process in a single forward pass. Modern frontier models support 128K-1M+ tokens. Longer context windows enable processing entire documents but increase memory requirements quadratically (or linearly with optimized attention).

Continuous Batching

A serving optimization that dynamically adds and removes requests from a running batch as they arrive and complete, rather than waiting for an entire batch to finish. Continuous batching dramatically improves GPU utilization and throughput.

D

DPO (Direct Preference Optimization)

An alignment method that skips the reward model entirely by directly optimizing the language model on preference pairs. DPO is simpler and more stable than RLHF while achieving comparable results.

\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]

Decoder

A transformer stack that generates tokens autoregressively using causal (masked) attention — each token can only attend to previous tokens. Virtually all modern generative LLMs (GPT, Claude, Gemini, Qwen, DeepSeek) use decoder-only architectures.

Distillation

Transferring knowledge from a larger "teacher" model to a smaller "student" model by training the student to replicate the teacher's output distribution. Distillation is the primary method for creating smaller models that retain reasoning capabilities.

DoRA (Weight-Decomposed Low-Rank Adaptation)

An evolution of LoRA that decomposes pre-trained weights into magnitude and direction components, then applies low-rank adaptation only to the direction. DoRA closes more of the gap between LoRA and full fine-tuning.

E

Embeddings

Dense vector representations of text (words, sentences, or documents) in a continuous vector space, where semantic similarity corresponds to vector proximity. Embeddings are the foundation of retrieval systems, search, and clustering.

Encoder

A transformer stack that processes the full input sequence bidirectionally (each token attends to all others). Encoders produce rich contextual representations and are used in models like BERT for classification and embedding tasks.

Related:Decoder Transformer

Expert Routing

The mechanism in MoE models that decides which experts process each token. Routing strategies include top-k selection, auxiliary load-balancing losses, and DeepSeek's auxiliary-loss-free approach. Router quality directly impacts model performance and training stability.

Related:Mixture of Experts (MoE)

F

FP8 (8-bit Floating Point)

An 8-bit floating-point format used for both training and inference. DeepSeek-V3 pioneered FP8 training at scale, demonstrating that 8-bit precision is sufficient for pre-training frontier models, roughly doubling throughput versus FP16.

Fine-tuning

Adapting a pre-trained model to a specific task or domain by continuing training on a smaller, targeted dataset. Fine-tuning can be full-parameter or parameter-efficient (LoRA, QLoRA).

FlashAttention

An IO-aware exact attention algorithm that restructures the attention computation to minimize GPU memory reads/writes (HBM access). FlashAttention achieves 2-4x speedups over standard attention without any approximation.

Related:Self-Attention KV-Cache

G

GGUF

The quantization format used by the llama.cpp ecosystem, supporting a range of precision levels (Q2 through Q8) with per-layer strategies. GGUF models run on CPU and Apple Silicon, making them the standard for local and edge deployment.

Related:Quantization GPTQ

GPTQ

A post-training weight quantization method that uses approximate second-order (Hessian) information to minimize quantization error layer by layer. GPTQ produces INT4/INT3 models optimized for GPU inference.

GRPO (Group Relative Policy Optimization)

A reinforcement learning alignment technique introduced by DeepSeek that evaluates groups of responses relative to each other rather than using a separate reward model. GRPO was key to training DeepSeek-R1's reasoning capabilities.

\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_{x \sim D} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left( \min\left( \frac{\pi_\theta(o_{i,t} | x, o_{i,<t})}{\pi_{\text{ref}}(o_{i,t} | x, o_{i,<t})} \hat{A}_i, \; \text{clip}(\cdot) \hat{A}_i \right) - \beta \, D_{KL}(\pi_\theta \| \pi_{\text{ref}}) \right]Where:G is the number of sampled responses per prompt (the group size)o_i is the i-th sampled response, |o_i| is its length in tokensπ_θ is the policy being trained; π_ref is the reference policyÂ_i is the advantage for response i, computed relative to the group: Â_i = (r_i − mean(r)) / std(r), where r_i is the reward for response iβ is the KL divergence penalty weightD_KL is the KL divergence regularizer preventing the policy from drifting too far from the referenceThe key innovation: advantages are computed from the group’s own rewards, eliminating the need for a separate critic/reward model

Gradient Checkpointing

A memory optimization technique that trades compute for memory during training by recomputing intermediate activations during the backward pass instead of storing them. Essential for training large models on limited GPU memory.

GraphRAG

An evolution of RAG that structures retrieved knowledge as a graph (entities and relationships) rather than flat document chunks. GraphRAG improves answers to multi-hop questions that require synthesizing information across multiple sources.

Grouped Query Attention (GQA)

A memory-efficient variant of multi-head attention that shares key and value projections across groups of query heads. GQA reduces KV-cache size by the grouping factor (e.g., 8x) with minimal quality loss, and is used in most modern open models.

H

Hallucination

When a language model generates text that is fluent and confident but factually incorrect or fabricated. Hallucination is a fundamental challenge in LLMs arising from the model's tendency to produce plausible-sounding text regardless of factual grounding.

I

Inference-Time Compute

The paradigm of improving model performance by spending more computation during inference (generating longer reasoning chains, exploring multiple solution paths) rather than during training. This is the core insight behind reasoning models.

K

KV-Cache

During autoregressive generation, the cached key and value tensors from previous tokens so they don't need to be recomputed at each step. The KV-cache grows linearly with sequence length and is often the primary memory bottleneck during inference.

L

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method that freezes the original model weights and injects small trainable low-rank matrices into each layer. LoRA typically trains ~1% of total parameters while achieving results comparable to full fine-tuning.

Loss Function

A mathematical function that measures the difference between a model's predictions and the target values. For language models, the standard pre-training loss is cross-entropy over next-token predictions.

\mathcal{L}_{\text{CE}} = -\sum_{t=1}^{T} \log P_\theta(x_t | x_{<t})

M

MCP (Model Context Protocol)

An open protocol (originated at Anthropic, now Linux Foundation) that standardizes how AI models connect to external tools, data sources, and APIs. MCP provides a universal interface — analogous to USB-C for AI — replacing ad-hoc tool integrations.

Related:Agentic AI

MMLU (Massive Multitask Language Understanding)

A benchmark of ~15,000 multiple-choice questions across 57 academic subjects, widely used as a general knowledge metric for LLMs. Frontier models now score 90%+, leading to the creation of harder variants (MMLU-Pro).

Related:Benchmark

Mixed Precision Training

Training with a combination of floating-point precisions (e.g., FP16 or BF16 for forward/backward passes, FP32 for weight updates) to reduce memory usage and increase throughput without significant quality loss.

Mixture of Experts (MoE)

An architecture where each transformer layer contains multiple parallel feed-forward networks ("experts"), and a router selects a subset (typically 2-8) for each token. MoE scales total model knowledge without proportionally increasing per-token compute cost.

Related:Expert Routing Transformer

Multi-Head Attention (MHA)

Running multiple self-attention operations in parallel, each with different learned projections (heads), then concatenating the results. MHA allows the model to attend to information from different representation subspaces simultaneously.

Multi-Head Latent Attention (MLA)

An attention variant introduced by DeepSeek that compresses key-value representations into a low-dimensional latent space rather than reducing the number of heads. MLA achieves greater KV-cache compression than GQA while maintaining full representational capacity.

Multi-Query Attention (MQA)

An extreme variant of GQA where all query heads share a single set of key and value projections. MQA offers maximum KV-cache reduction but may sacrifice some representational capacity compared to GQA.

P

Paged Attention

A memory management technique (introduced by vLLM) that stores KV-cache in non-contiguous pages, similar to virtual memory in operating systems. Paged attention eliminates memory fragmentation and enables efficient dynamic batching.

Related:KV-Cache Continuous Batching

Pre-training

The initial phase of training a language model on a large corpus of text using self-supervised objectives (typically next-token prediction). Pre-training produces a base model with broad language understanding but no task-specific behavior.

Prompt Engineering

The practice of designing and refining input prompts to elicit desired behavior from language models. Techniques include few-shot examples, system prompts, chain-of-thought instructions, and structured output formatting.

Related:Chain-of-Thought (CoT)

Q

QLoRA (Quantized LoRA)

Combines 4-bit quantization of the base model with LoRA adapters, enabling fine-tuning of large models on consumer hardware. The base model is loaded in NF4 precision while LoRA adapters train in higher precision.

Quantization

Reducing the numerical precision of model weights (and optionally activations) from higher-precision formats (FP32, FP16) to lower-precision formats (INT8, INT4, FP8). Quantization reduces memory usage and increases inference speed with a small quality tradeoff.

R

RAG (Retrieval-Augmented Generation)

A pattern that grounds LLM responses in external knowledge by retrieving relevant documents and including them in the prompt context. RAG reduces hallucination, enables up-to-date knowledge, and is the dominant approach for enterprise LLM applications.

RLHF (Reinforcement Learning from Human Feedback)

An alignment technique where a reward model trained on human preference data guides the language model via reinforcement learning (typically PPO). RLHF steers model behavior toward helpful, harmless, and honest outputs.

L^{\text{PPO}}(\theta) = \hat{\mathbb{E}}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

RMSNorm

A simplified layer normalization that normalizes by root mean square only, skipping the mean-centering step of standard LayerNorm. RMSNorm is computationally cheaper and is used in most modern open-source LLMs.

Related:SwiGLU Transformer

Reasoning Models

Language models trained to perform explicit step-by-step reasoning before producing a final answer, typically using inference-time compute scaling. Reasoning models (o1/o3, DeepSeek-R1, QwQ) excel at math, coding, and science tasks where deliberation improves accuracy.

Red-Teaming

Systematically probing an AI model to discover failure modes, safety vulnerabilities, and harmful outputs. Red-teaming involves crafting adversarial inputs designed to bypass safety measures and is a critical part of responsible AI deployment.

RoPE (Rotary Position Embeddings)

A position encoding method that applies rotation matrices to query and key vectors, making the attention dot product naturally depend on relative token distance. RoPE enables better length generalization and is the standard position encoding in modern open models.

\text{RoPE}(x_m, m) = \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ \vdots \end{pmatrix} \odot \begin{pmatrix} \cos m\theta_1 \\ \cos m\theta_1 \\ \cos m\theta_2 \\ \cos m\theta_2 \\ \vdots \end{pmatrix} + \begin{pmatrix} -x_2 \\ x_1 \\ -x_4 \\ x_3 \\ \vdots \end{pmatrix} \odot \begin{pmatrix} \sin m\theta_1 \\ \sin m\theta_1 \\ \sin m\theta_2 \\ \sin m\theta_2 \\ \vdots \end{pmatrix}

S

SFT (Supervised Fine-Tuning)

A fine-tuning stage where a base model is trained on curated instruction-response pairs to follow instructions. SFT typically follows pre-training and precedes alignment (RLHF/DPO).

Scaling Laws

Empirical relationships (notably Chinchilla scaling laws) that predict model performance as a function of parameter count, dataset size, and compute budget. Scaling laws guide decisions about how to allocate training resources.

L(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + L_\infty

Related:Pre-training

Self-Attention

The core mechanism in transformers where each token computes attention weights over all other tokens in the sequence, producing context-aware representations. Self-attention enables the model to capture long-range dependencies regardless of distance in the sequence.

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Softmax

A function that converts a vector of raw scores (logits) into a probability distribution, where each value is in (0,1) and all values sum to 1. Softmax is used in the attention mechanism to compute attention weights and in the output layer to produce token probabilities.

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Related:Self-Attention Loss Function

Sparse Autoencoder

A neural network trained to decompose a model's internal activations into a sparse set of interpretable features. Sparse autoencoders are the primary tool in mechanistic interpretability for understanding what individual neurons and circuits represent.

Speculative Decoding

An inference acceleration technique where a smaller "draft" model generates candidate token sequences that the larger "target" model verifies in parallel. Accepted tokens skip expensive sequential generation, yielding 2-3x speedups with no quality loss.

SwiGLU

An activation function combining the Swish activation with a Gated Linear Unit. SwiGLU has become the standard FFN activation in modern LLMs, replacing ReLU and GELU, offering improved training dynamics at a modest parameter increase.

Related:Transformer RMSNorm

T

TPS (Tokens Per Second)

The rate at which a model generates output tokens during the decode phase. TPS measures generation throughput and is the primary metric for streaming response quality.

Related:TTFT (Time to First Token)

TTFT (Time to First Token)

The latency from when a request is sent to when the first output token is generated. TTFT is dominated by the prefill phase (processing the input prompt) and is a critical metric for interactive applications.

Tensor Parallelism

Distributing a single model layer across multiple GPUs by splitting weight matrices along specific dimensions. Tensor parallelism enables serving models too large for a single GPU's memory, at the cost of inter-GPU communication overhead.

Related:Continuous Batching

Tokenization

The process of converting raw text into a sequence of integer token IDs that a model can process. Modern LLMs use subword tokenizers (BPE, SentencePiece) that balance vocabulary size with the ability to represent any text.

Related:Transformer Context Window

Transformer

The dominant neural network architecture for language models, introduced in "Attention Is All You Need" (2017). Transformers process sequences in parallel using self-attention and feed-forward layers, replacing the sequential processing of RNNs.

V

Vector Database

A database optimized for storing and querying high-dimensional embedding vectors using approximate nearest neighbor (ANN) search. Vector databases (Pinecone, Weaviate, Qdrant, pgvector) are the retrieval backbone of RAG systems.