On This Pageexpand_more
Reasoning Models: How LLMs Learned to Think Before They Speak
Explore how reasoning models like o1, o3, and DeepSeek-R1 use inference-time compute scaling and chain-of-thought to solve problems standard LLMs cannot.

Introduction: The Problem That Scaling Alone Could Not Solve
In September 2024, OpenAI quietly released a model called o1. It did not have more parameters than GPT-4. It was not trained on a larger dataset. Yet on the American Invitational Mathematics Examination (AIME), it scored 83.3% (using consensus across 64 samples), up from GPT-4o's roughly 12%. On competitive programming problems from Codeforces, it reached the 89th percentile among human competitors. The difference was not in what the model knew, but in how it used what it knew.
This was the moment reasoning models went from a research curiosity to a paradigm shift. For years, the scaling playbook for large language models was straightforward: more parameters, more data, more pre-training compute. That formula produced GPT-3, GPT-4, Claude 2, and Gemini. But it had begun to hit diminishing returns. The next frontier was not about making models bigger; it was about making them think longer.
By early 2026, the speed of progress proved even faster than the optimists expected. AIME 2025 was solved outright (GPT-5.2 scored 100%), GPQA Diamond reached 94.3% (Gemini 3.1 Pro), and reasoning capabilities were no longer confined to specialized model variants. The frontier had shifted from "can models reason?" to "how cheaply and efficiently can they reason?"
Reasoning models represent a fundamental change in how we extract intelligence from neural networks. Instead of generating an answer in a single forward pass, they allocate additional compute at inference time, producing extended internal chains of thought before arriving at a final response. The result is a class of models that can solve problems previously considered out of reach for AI: formal mathematical proofs, multi-step scientific reasoning, complex code generation, and strategic planning.
This article traces the arc from early chain-of-thought prompting to the modern reasoning model paradigm, examines the key architectures and training methods that make it work, and explores what it means for practitioners deciding when and how to deploy these systems.
From Prompting Trick to Training Paradigm
Chain-of-Thought Prompting: The Spark
The story begins in January 2022, when Jason Wei and colleagues at Google Brain published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." The core insight was deceptively simple: if you show a language model examples of step-by-step reasoning in the prompt, it will produce step-by-step reasoning in its outputs, and arrive at dramatically better answers.
On the GSM8K benchmark of grade-school math word problems, PaLM 540B went from roughly 18% accuracy with standard prompting to 56.9% with chain-of-thought (CoT) prompting, more than tripling performance. The model had the latent capability all along; it simply needed the right elicitation.
This raised a profound question: if a model can reason better when prompted to show its work, what happens if you train it to always reason before answering?
The Verification Gap
Standard LLMs have a fundamental architectural limitation. A transformer generates each token by attending to all previous tokens and performing a fixed number of operations per layer, then emitting the next token. For a question like "What is 37 times 43?", the model gets roughly the same amount of computation whether the answer is trivially obvious or requires careful multi-step work.
Humans do not operate this way. We allocate more mental effort to harder problems. We check our work. We backtrack when we realize a line of reasoning is flawed. The key insight behind reasoning models is that language models can be trained to do the same, by generating intermediate "thinking" tokens that serve as a scratchpad for extended computation before producing a final answer.
This is what researchers call inference-time compute scaling (also called test-time compute scaling): rather than only investing compute during training, you invest additional compute during inference, letting the model "think" for longer on harder problems.
How Reasoning Models Work
Thinking Tokens and Extended Chain-of-Thought
At a mechanical level, a reasoning model works by generating a potentially long sequence of internal reasoning tokens before producing the user-visible response. These thinking tokens are where the actual problem-solving happens. The model might:
- Decompose the problem into sub-problems
- Explore multiple solution paths
- Verify intermediate results
- Backtrack when it detects errors in its reasoning
- Synthesize findings into a final answer
Consider a concrete example. Given the prompt "Prove that there are infinitely many prime numbers," a standard LLM might produce a passable summary of Euclid's proof in a single pass. A reasoning model, by contrast, might internally generate hundreds or thousands of thinking tokens, exploring whether to use Euclid's proof, Euler's proof via the zeta function, or Furstenberg's topological proof, evaluating each approach for rigor, checking logical steps, and only then presenting a clean, verified proof.
The critical difference is that the model can allocate compute proportional to problem difficulty. A simple factual question might generate a handful of thinking tokens. A hard competition math problem might generate tens of thousands.
The Scaling Curve: More Thinking, Better Answers
One of the most striking findings from OpenAI's o1 technical report was the relationship between thinking time and accuracy. On the AIME benchmark, performance improved log-linearly with the amount of inference-time compute. Doubling the number of thinking tokens consistently yielded measurable gains, up to a point.
This creates a new scaling axis. Traditional scaling laws (Chinchilla, Kaplan et al.) describe how performance improves with more pre-training compute. Inference-time scaling describes how performance improves with more test-time compute, and the two are complementary. A model can be scaled along either axis, or both.
The practical implication is significant: instead of training an even larger model (which costs hundreds of millions of dollars), you can achieve better performance on hard problems by letting a smaller model think longer (which costs a fraction per query, though more than a standard response).
The Key Models
OpenAI o1 (September 2024)
OpenAI's o1 was the first commercially available reasoning model at scale. Released in September 2024, it introduced the concept of "thinking tokens" to a broad audience. The model was trained using large-scale reinforcement learning to produce extended chain-of-thought reasoning before answering.
Key results:
- AIME 2024: 83.3% (vs. GPT-4o at 13.4%)
- GPQA Diamond (PhD-level science): 78.0% (vs. GPT-4o at 53.6%)
- Codeforces: 89th percentile
- PhD-level Physics/Chemistry/Biology: Surpassed human PhD-level performance in several sub-domains
OpenAI deliberately chose not to show the full thinking trace to users, instead providing a summarized version. This was partly for safety reasons (to prevent users from training models to replicate the reasoning process) and partly because the raw thinking traces can be messy and non-linear: they contain backtracking, dead ends, and self-corrections that would confuse most users.
OpenAI o3, o4-mini, and GPT-5 (2024 - 2025)
The successor models pushed the paradigm further. o3 demonstrated performance on the ARC-AGI benchmark, a test of novel abstract reasoning, that many researchers had considered beyond the reach of current architectures. On ARC-AGI's public evaluation set, o3 scored 87.5% at high compute, compared to GPT-4o's roughly 5%. o3-mini and later o4-mini offered compute-efficient variants with configurable "reasoning effort" levels (low, medium, high), allowing users to tune the trade-off between quality and cost per query. o4-mini became the strongest model on AIME 2025, scoring 92.7% without tools and 99.5% with a Python interpreter.
The most consequential release, however, was GPT-5 in August 2025. Rather than maintaining separate model lines for reasoning and general use, GPT-5 introduced a unified architecture with an intelligent router that dynamically decides whether to respond quickly or engage deep reasoning based on query complexity. This effectively retired the separate o-series/GPT-series distinction. GPT-5 scored 94.6% on AIME 2025, 88.4% on GPQA Diamond, and 74.9% on SWE-bench Verified. By December 2025, GPT-5.2 Thinking hit 100% on AIME 2025 without tools, marking the effective saturation of a benchmark that had seemed impossibly hard just 14 months earlier.
DeepSeek-R1 (January 2025)
DeepSeek-R1, released in January 2025 by the Chinese AI lab DeepSeek, was arguably the most significant development in the reasoning model space, not because it outperformed o1 on all benchmarks, but because of how it was built.
DeepSeek published a detailed technical report revealing their training methodology. The results were competitive with o1 across math, coding, and science benchmarks, but the training process was radically different. Two findings shook the research community:
1. Reasoning Emerged from Pure Reinforcement Learning
In a preliminary experiment called DeepSeek-R1-Zero, the team applied reinforcement learning directly to their base model (DeepSeek-V3) with no supervised fine-tuning on reasoning traces whatsoever. The reward signal was simple: did the model get the right answer?
Remarkably, the model spontaneously developed chain-of-thought reasoning, self-verification, and even "aha moment" behaviors where it would reconsider its approach mid-stream. The researchers observed the model's reasoning traces growing longer and more sophisticated over the course of RL training, without any human-written examples of how to reason.
This was a profound finding. It suggested that extended reasoning is not something that needs to be taught by imitation; it is a naturally emergent behavior when a language model is optimized to actually solve hard problems.
2. The Entire Pipeline Cost a Fraction of OpenAI's Approach
DeepSeek trained its models at a reported cost roughly an order of magnitude lower than comparable Western efforts. For how DeepSeek achieved this at 1/10th the cost, see our deep dive on DeepSeek's Architecture Innovations.
Claude with Extended Thinking (Anthropic, 2025)
Anthropic introduced extended thinking capabilities for Claude 3.7 Sonnet in early 2025, allowing the model to engage in longer internal reasoning before responding. Claude's implementation makes the thinking process partially visible to users (in a summarized form), enabling them to understand and verify the model's reasoning steps.
Claude's extended thinking is notable for its integration with the model's existing strengths in instruction following, safety, and nuanced communication. Rather than being a separate "reasoning mode," it functions as an adjustable budget that users can set based on task complexity, allowing the same model to handle both quick responses and deep analytical work.
With the Claude 4 family (Opus 4 and Sonnet 4, May 2025), extended thinking became a core capability across the product line. Both models offered hybrid instant-response and deep-thinking modes. Claude Sonnet 4 reached 72.7% on SWE-bench, establishing Anthropic as a leader in agentic coding tasks. Subsequent releases through Claude 4.5 and 4.6 continued to increase maximum thinking budgets, with Opus 4.5 designed specifically for sustained scientific research and high-stakes analysis requiring extended autonomous work.
Google Gemini: Thinking Models and Deep Think
Google's reasoning model trajectory began with Gemini 2.0 Flash Thinking in late 2024, emphasizing efficiency. The major leap came with Gemini 2.5 Pro (March 2025), a "thinking model" that debuted at the top of LMArena's leaderboard with a 1M token context window. At Google I/O 2025, Google introduced Gemini 2.5 Pro Deep Think, an experimental mode that uses parallel reasoning, considering multiple hypotheses simultaneously rather than sequentially. Deep Think achieved a 49.4% accuracy increase on USAMO 2025 compared to standard mode. By early 2026, Gemini 3.1 Pro reached 94.3% on GPQA Diamond, the highest score on that benchmark at the time.
The Broader Landscape (2025 - 2026)
The reasoning model space expanded from a handful of entrants to a universal capability across the industry. Every major lab now ships reasoning capabilities:
- Qwen3 (April 2025) introduced unified thinking/non-thinking modes in a single model, eliminating the need for separate chat and reasoning variants. Available in dense and MoE architectures from 0.6B to 235B parameters under Apache 2.0.
- Grok 3 (February 2025) was xAI's first dedicated reasoning model, trained on the Colossus supercluster. Its "Reasoning" and "mini Reasoning" variants with chain-of-thought outperformed GPT-4o on AIME and GPQA. Rapid iteration through Grok 4 (July 2025) and beyond followed.
- Mistral Magistral (June 2025) brought reasoning to Mistral's lineup, with Magistral Medium scoring 73.6% on AIME 2024 (90% with majority voting). Mistral Small 4 (March 2026) consolidated reasoning, vision, and coding into a single 119B-parameter MoE model with just 6B active parameters per token.
- Meta Llama 4 (April 2025) adopted MoE architecture with reasoning capabilities. Llama 5, expected in mid-2026, is designed with "System 2 thinking" for deliberate, multi-step reasoning.
- Open-source efforts like Open-R1 continued to replicate and extend the reasoning model recipe using publicly available components.
The field moved from a single lab's innovation to a universal industry standard in roughly 18 months.
GRPO: How DeepSeek Trained Reasoning Without Human Labels
One of the most influential technical contributions from DeepSeek-R1 was Group Relative Policy Optimization (GRPO), the reinforcement learning algorithm at the heart of their training process. Understanding GRPO is essential for anyone who wants to grasp how reasoning capabilities are trained into modern models.
The Problem with Standard RLHF for Reasoning
Traditional RLHF (Reinforcement Learning from Human Feedback) uses a learned reward model to score the quality of model outputs, then optimizes the policy (the language model) to produce higher-scoring outputs. This works well for subjective qualities like helpfulness or tone, but it has a critical limitation for reasoning: you need human annotators to label reasoning traces, and creating high-quality reasoning demonstrations is extremely expensive and slow.
Moreover, standard policy optimization algorithms like PPO (Proximal Policy Optimization) require a separate value model (critic) that estimates the expected reward from any given state. For language models, this critic needs to be roughly the same size as the policy model itself, effectively doubling memory requirements during training.
How GRPO Works
GRPO eliminates the need for a separate critic network entirely. Here is the core idea:
- Sample a group: For each problem in the training batch, generate multiple candidate solutions (say, 8 to 64 completions) from the current policy.
- Score the group: Evaluate each solution using a verifiable reward signal. For math, this is simply: did the model arrive at the correct answer? For code, it is: did the code pass the test cases?
- Compute relative advantage: Instead of using a learned value function, GRPO computes the advantage of each solution relative to the other solutions in the same group. Solutions that score above the group mean receive positive advantage; those below receive negative advantage. Formally, the advantage is normalized by the group's standard deviation.
- Update the policy: Use these relative advantages to update the model weights, encouraging it to produce more solutions like the high-scoring ones and fewer like the low-scoring ones. A KL divergence penalty prevents the policy from drifting too far from the reference model.
The mathematical formulation of GRPO's objective is:
Where is the probability ratio between the new and old policy, is the group-normalized advantage, and the clipping and KL terms provide stability.
Why GRPO Matters
GRPO's significance extends beyond a single training run:
- No human reasoning labels needed: The model learns to reason purely from outcome-based rewards. You need problems with verifiable answers, not demonstrations of how to solve them.
- Dramatically lower memory cost: Without a separate critic network, training requires roughly half the GPU memory of PPO-based approaches.
- Naturally suited to reasoning: Because the reward is binary (correct/incorrect), the relative advantage computation creates a strong learning signal: solutions that happen to include good reasoning steps are more likely to arrive at correct answers, so reasoning behavior is reinforced indirectly.
- Scales with problem difficulty: The group-based sampling naturally allocates more "learning signal" to problems where some but not all attempts succeed, exactly the frontier where the model has the most to learn.
GRPO has since been adopted and adapted by numerous research groups working on reasoning models, and it has become a standard tool in the post-training optimization toolkit. Several refinements have emerged: Dr. GRPO removes response length normalization to reduce unnecessarily verbose reasoning, while DAPO (Dynamic Sampling Policy Optimization) adds techniques like dynamic sampling and token-level policy gradient loss for better long chain-of-thought performance.
An important caveat has also emerged from subsequent research. Studies on reinforcement learning from verifiable rewards (RLVR), the class of methods that includes GRPO, have shown that these techniques primarily amplify reasoning capabilities already present in the base model rather than teaching fundamentally new reasoning strategies. The gains come from improved sampling efficiency: the model learns to more reliably surface its best reasoning paths, not to reason in ways it could not before. This finding reinforces the complementary nature of pre-training and post-training: a strong base model remains essential, and RL-based reasoning training is most effective when there is latent capability to unlock.
The Two Scaling Axes: Pre-Training vs. Inference-Time
Understanding reasoning models requires understanding a fundamental shift in how the field thinks about scaling. For years, the dominant framework was pre-training compute scaling: performance is a predictable function of the compute (and data) used during training. The Chinchilla scaling laws (Hoffmann et al., 2022) formalized this, showing that for a fixed compute budget, there is an optimal balance between model size and data quantity.
Reasoning models introduce a second, complementary axis: inference-time compute scaling. Here, performance is a function of compute spent per query at inference time.
The Key Insight: Complementary, Not Competitive
These two axes are not substitutes; they are complementary. A well-trained base model provides the foundation of knowledge and capability. Inference-time compute allows that knowledge to be applied with greater depth and rigor to individual problems.
Think of it this way: pre-training scaling determines the ceiling of what a model can potentially do. Inference-time scaling determines how close the model gets to that ceiling on any given problem. A small model with extended thinking will still be limited by its knowledge base. A large model without extended thinking will leave performance on the table for hard problems.
The most capable reasoning models combine both: strong base models (trained with hundreds of billions of tokens) augmented with inference-time reasoning (spending thousands of tokens thinking before answering).
Empirical Evidence
The research from OpenAI, DeepSeek, and others converges on several empirical findings:
| Scaling Axis | Cost Structure | Best For |
|---|---|---|
| Pre-training compute | High fixed cost, low marginal cost per query | Broad knowledge, language fluency, common tasks |
| Inference-time compute | Low fixed cost, high marginal cost per query | Hard reasoning, math, code, planning, novel problems |
A key result from DeepSeek's work was that a 7B-parameter model with distilled reasoning capabilities could match or exceed a 70B standard model on reasoning-heavy benchmarks. This suggests that inference-time compute can partially substitute for pre-training scale on specific task types, offering a more cost-efficient path for reasoning-focused applications.
Practical Implications: When to Use Reasoning Models
When reasoning models first appeared, they were separate, specialized systems. By mid-2025, the distinction began to dissolve: GPT-5's unified architecture, Qwen3's merged thinking/non-thinking modes, and Claude 4's hybrid approach all point toward reasoning as an adjustable capability within a single model rather than a separate model class. Still, understanding when deeper reasoning adds value remains critical for practitioners, whether they are choosing between models or configuring thinking budgets within one.
When Reasoning Models Excel
- Mathematical problem solving: Competition math, formal proofs, quantitative analysis where logical rigor matters
- Complex code generation: Multi-file implementations, algorithmic challenges, debugging that requires tracing execution paths
- Scientific reasoning: Multi-step derivations, experimental design, analyzing conflicting evidence
- Strategic planning: Tasks requiring consideration of multiple future states, trade-offs, and contingencies
- Agentic workflows: Multi-step tasks where the model must plan, execute, observe, and adapt. Reasoning models power the next generation of AI Agents (see AI Agents in Production).
When Standard Models Are Preferable
- Simple factual queries: "What is the capital of France?" does not benefit from extended reasoning
- Creative writing: Poetry, fiction, and marketing copy are rarely improved by deliberative reasoning
- Conversational interaction: Chatbots and customer support, where latency matters and problems are typically straightforward
- High-throughput, low-latency scenarios: When you need to process thousands of requests per second at minimal cost
- Summarization and extraction: Tasks where the answer is present in the input and needs to be identified, not derived
The Latency and Cost Trade-Off
Reasoning models are inherently slower and more expensive per query than standard models. A standard GPT-4o response might generate 200-500 tokens in 2-3 seconds. An o1 response to a hard math problem might generate 10,000+ thinking tokens (hidden) plus 500 visible tokens, taking 30-60 seconds and costing 5-20x more.
This creates a genuine engineering trade-off:
| Cost per query | Latency | Accuracy on hard tasks | |
|---|---|---|---|
| Standard model | ~0.01 USD | ~2 seconds | ~70% |
| Reasoning model | ~0.10 USD | ~30 seconds | ~92% |
For many applications, the 70% accuracy is sufficient. For others (medical diagnosis, financial analysis, safety-critical code), the 92% justifies the cost and latency. The right choice depends on the use case.
To run reasoning models efficiently in production, cost management becomes critical. For techniques on minimizing inference costs while preserving quality, see LLM Inference Optimization.
Configurable Reasoning Effort
Configurable thinking budgets have become a near-universal feature across providers. OpenAI's o3-mini and o4-mini, Claude's extended thinking, Gemini 2.5 models, and Qwen3 all allow users (or the model itself) to adjust how much thinking happens before answering. This enables a tiered approach:
- Route simple queries to standard models or low-effort reasoning
- Route medium-complexity queries to moderate reasoning effort
- Route the hardest problems to maximum reasoning effort
This routing can be automated with a lightweight classifier or even a smaller LLM that estimates query difficulty, creating a cost-efficient system that provides deep reasoning only when it adds value.
The Distillation Question: Can You Get Reasoning for Free?
One of the most active research areas is reasoning distillation, training smaller, faster models to replicate the outputs of larger reasoning models. DeepSeek demonstrated this convincingly by distilling their R1 model's reasoning capabilities into models ranging from 1.5B to 70B parameters.
The distilled models showed remarkable capability:
- DeepSeek-R1-Distill-Qwen-32B outperformed OpenAI's o1-mini on several benchmarks
- DeepSeek-R1-Distill-Qwen-7B achieved reasoning performance competitive with much larger standard models
However, distillation has limits. Distilled models are learning to imitate reasoning patterns rather than developing them from first principles via RL. They can reproduce common reasoning strategies well but may struggle on truly novel problems that fall outside the distribution of the teacher model's training.
The practical takeaway: distilled reasoning models are excellent for deployment scenarios where you need reasoning capability at low cost and latency, but for frontier performance on the hardest problems, full reasoning models trained with RL remain superior.
What This Means for the Future
The Convergence of Reasoning and Agency
Reasoning models are not just better at answering questions; they are better at planning. This makes them natural candidates for agentic systems where a model must decompose a complex task, execute steps, handle errors, and adapt. The combination of strong reasoning with tool use, code execution, and environment interaction is already producing AI agents that handle tasks previously requiring substantial human oversight. The share of organizations with deployed AI agents nearly doubled from 7.2% to 13.2% between August and December 2025, and enterprise reasoning token consumption at OpenAI grew roughly 320x over the same period, signaling a shift from experimentation to production use.
Reasoning as a Universal Layer
The prediction that reasoning would become a tunable parameter rather than a separate model has largely come true. GPT-5's intelligent router, Qwen3's unified thinking modes, and Claude 4's hybrid architecture all allow a single model to fluidly adjust reasoning depth based on task requirements, thinking deeply about a hard theorem and quickly about a simple question within the same conversation. The remaining frontier is making this adaptation fully automatic and cost-efficient, so that users never need to manually configure thinking budgets.
The Efficiency Race
The DeepSeek-R1 paper demonstrated that reasoning capabilities can be trained at dramatically lower cost than previously assumed. This sparked an efficiency race that has only intensified: Mistral Small 4 packs reasoning into a 6B-active-parameter MoE model, Qwen3 offers reasoning variants down to 0.6B parameters, and distilled reasoning models routinely run on consumer hardware. The question is no longer whether small models can reason, but how close they can get to frontier performance.
Benchmark Saturation and What Comes Next
A striking development of 2025-2026 is the saturation of benchmarks that once seemed impossibly hard. AIME 2025 has been solved (100%, GPT-5.2). GPQA Diamond is above 94%. Codeforces ratings continue to climb. The community is now grappling with a measurement problem: the benchmarks that defined the reasoning model era are running out of headroom. Newer, harder evaluations like ARC-AGI-2, SWE-bench Pro, and USAMO are taking their place, but the pattern suggests that benchmark-driven progress may need to be supplemented with more open-ended evaluations of reasoning quality.
Open Questions
Several important questions remain unresolved:
- Faithfulness: Are the thinking traces actually reflective of the model's internal computation, or are they a learned post-hoc narrative? Research suggests the truth is somewhere in between, and the finding that RLVR amplifies existing capabilities rather than creating new ones adds nuance to this question.
- Ceiling: The saturation of existing benchmarks suggests that current reasoning methods may be approaching diminishing returns on well-defined tasks. Whether inference-time scaling continues to yield gains on more open-ended, real-world problems is an active area of investigation.
- Safety: Extended reasoning makes models more capable, but also potentially more capable at harmful tasks. How do we ensure reasoning models think responsibly as well as deeply?
- Verification: For domains without clear right/wrong answers (policy analysis, ethical reasoning, creative problem-solving), how do we train and evaluate reasoning quality?
- Parallel reasoning: Google's Deep Think experiments with simultaneous hypothesis exploration suggest that sequential chain-of-thought may not be the only viable architecture for inference-time reasoning. Whether parallel approaches yield fundamentally different capabilities remains to be seen.
Key Takeaways
- Reasoning models represent a new scaling paradigm. Instead of only scaling pre-training compute, we can now scale inference-time compute to extract dramatically better performance on hard problems from existing model architectures.
- The mechanism is extended chain-of-thought. Reasoning models generate potentially thousands of internal "thinking tokens" before producing a visible response, enabling decomposition, verification, backtracking, and synthesis.
- GRPO changed the training economics. DeepSeek's Group Relative Policy Optimization showed that reasoning can emerge from pure reinforcement learning with outcome-based rewards, eliminating the need for expensive human-labeled reasoning traces. Subsequent work (Dr. GRPO, DAPO) has refined the approach, though research also shows these methods amplify existing base model capabilities rather than creating fundamentally new reasoning skills.
- Reasoning capability can be distilled. Smaller models trained to imitate reasoning models retain much of the capability at a fraction of the cost, making reasoning accessible across the model size spectrum, from 0.6B-parameter open-source models to frontier systems.
- The right configuration depends on the task. Reasoning excels at math, code, science, and planning but is overkill for simple queries, creative writing, and high-throughput applications. With unified models now handling both modes, the question has shifted from "which model?" to "how much thinking budget?"
- Pre-training and inference-time scaling are complementary. The strongest systems combine a well-trained base model with inference-time reasoning. Neither axis alone is sufficient for frontier performance, and RL-based post-training works best when there is latent capability in the base model to unlock.
- Reasoning is converging into the base model. The 2024-era pattern of separate reasoning models (o1, R1) is giving way to unified architectures (GPT-5, Qwen3, Claude 4) that dynamically adjust reasoning depth. This simplifies deployment but raises new questions about cost control and routing.
- Benchmarks are saturating faster than expected. AIME, GPQA, and other defining benchmarks of the reasoning era are approaching or at their ceilings. The field needs harder evaluations to continue measuring progress meaningfully.
The shift from "make the model bigger" to "let the model think longer" is one of the most consequential developments in AI since the transformer architecture itself. What began as a single lab's experiment in September 2024 became a universal industry capability by mid-2025. For practitioners, the immediate action is clear: understand when reasoning adds value for your use case, configure thinking budgets accordingly, and recognize that the models themselves are increasingly making these routing decisions autonomously. The models that merely generate have given way to models that reason, and that transition happened faster than almost anyone predicted.