When AI Takes Time to Think: Implications of Test-Time Compute

Commentary

Mar 26, 2025

High-tech data center with server racks, photo by Nikada/Getty Images

Photo by Nikada/Getty Images

By Lennart Heim and Ashley Lin

The rise of reasoning models like OpenAI's o1, o3, and DeepSeek's R1 adds another tool to the AI development toolkit. These models utilize “test-time compute” to improve performance. Rather than generating answers immediately, these models engage in explicit step-by-step reasoning processes, essentially “thinking out loud” by generating intermediate calculations, exploring multiple approaches, and evaluating potential solutions before arriving at a final answer. At the extreme, we might allow a model to think for minutes—generating dozens of pages of text—before it condenses this extensive reasoning into a concise final output for the user.

This development represents an evolution rather than a revolution in AI, but it has implications for policy, competition, and security. As with most technological advances, nothing has suddenly changed overnight, but the cumulative impact of this shift could be substantial over time.

How Compute Impacts the AI Lifecycle

Before exploring the implications, we should understand compute's role in AI development and deployment and how this equation is now updated. Historically, increased training compute has been the primary driver of AI progress, enabling the training of increasingly capable models—with more parameters and more data. Now, test-time compute represents an additional variable that has entered this equation. What seemed like a straightforward path to AI progress—just scale up pre-training compute—was always more nuanced than discussions suggested. Now, with inference-time reasoning, the equation becomes even more complex, with multiple variables to optimize rather than just one dial to turn.

Figure 1: How Compute Drives AI Capabilities Across Different AI Lifecycle Phases

A graphic representing the process from development to deployment with example figures for each step in the process.

Simplified AI Lifecycle

  1. Development
    1. Experimentation: Model architecture and design
    2. Pre-training: Base AI model
    3. Post-training: Enhanced AI model vis RLHF constitutional AI, instructions, tuning, etc.
  2. Deployment

Right side shows plots illustrating the lifecycle stages.

Top to bottom:

Chinchilla scaling laws showing optimal model-data balance for experimentation.

Compute is a tracker of AI capabilities.

DeepSeek-R1 performance improvements through reinforcement learning iterations.

OpenAI o1's accuracy scaling with test-time compute reasoning.

User growth curves representing deployment scale impact.

Left: Simplified AI lifecycle. Right: from top to bottom: Chinchilla scaling laws showing optimal model-data balance for experimentation; compute is a tracker of AI capabilities; DeepSeek-R1 performance improvements through reinforcement learning iterations; OpenAI o1's accuracy scaling with test-time compute reasoning; and user growth curves representing deployment scale impact.

Simplified, we can see that compute drives capabilities through four key phases:

  1. Experimentation: Testing architectures (such as hyperparameters, learning rates, and other variables) to find the optimal design that is later scaled
  2. Pre-training: Building foundational capabilities through massive training runs on internet-scale data sets
  3. Post-training: Using reinforcement learning from human feedback, constitutional AI, and instruction tuning to make the model act more like a chatbot or excel at specific tasks; more recently, the application of reinforcement learning to teach step-by-step reasoning skills
    1. Deployment Capabilities: How long the model thinks in response to queries—the test-time compute discussed here
    2. Deployment Scale: The number of users using the system or the number of AI agents being deployed; a critical factor when millions of users are making queries

What's new is not that compute matters—it's where and how it matters. It would be a mistake to conclude compute matters less now, just as it was when people downplayed its role after DeepSeek's efficiency claims. These advances still build on pre-trained foundation models that required thousands of chips and millions of dollars to develop.

What's new is not that compute matters—it's where and how it matters.

Once a model goes through the first broad training, we discover two new scaling levers: first, how much reinforcement learning we apply to teach reasoning skills during post-training, and second, how much “thinking time” we allow during inference. Unlike previous advances primarily driven by training data volume and model size, reasoning models can become more capable simply by allowing them more computation time to solve problems—though they still fundamentally depend on the capabilities established during pre-training and the reasoning skills established through post-training reinforcement learning.

Consequently, this represents another step in computing's long history of efficiency and capability gains. Just as Moore's Law progressed through different substrates and form factors while maintaining the underlying principle of increasing transistor density, test-time compute introduces a complementary approach to drive AI capabilities forward. While this might appear incremental today, it could have profound implications for the AI ecosystem and policy in the long run.

Six Implications of Test-Time Compute

1. The Innovation Cycle Is Accelerating

Expect rapid performance gains from teams already working on reasoning (like OpenAI, DeepSeek, and others) and new ones, as iteration cycles are faster in a new research field. There are more low-hanging fruits as it's cheaper to iterate—using only reinforcement learning enhancements and increased test-time compute without needing to pre-train a new model, which requires thousands of chips and millions of dollars.

Furthermore, these accessible improvements draw more developers into AI advancement, including academics who are more likely to share insights and accelerate collective progress. This increases capability diffusion—gaps between frontier and follower models will likely narrow more quickly than in the pre-training-dominant era. However, pre-training advances continue in parallel, potentially creating new capability gaps with each generational leap. For example, xAI's Grok3 represents the largest publicly known trained model, reportedly using more than 1026 FLOP on a cluster with 100,000 cutting-edge chips—resources beyond most companies' reach.

The fastest improvements will likely come in domains with clear feedback signals from easily verifiable results, particularly mathematics and software engineering. This is significant since many model developers are software engineers, creating a potential feedback loop: engineers use these models for their work, which drives more usage, new developments, and potentially better future models. But will this reasoning ability transfer effectively to other domains? That remains to be seen.

2. Accelerated Diffusion, Continued Advantage

Test-time compute serves both (PDF) industry leaders and smaller players: advanced models gain enhanced reasoning capabilities, while more modest systems achieve greater capabilities that once required extensive pre-training. Rather than eliminating compute barriers, this represents another step in AI's long history of algorithmic improvements—capabilities become cheaper at any given performance level, yet advancing the frontier still requires increasingly substantial resources. Leading companies maintain their edge by applying reasoning techniques to their newest, largest models. Meanwhile, followers can achieve yesterday's frontier performance with more modest resources, narrowing but not closing the capability gap.

3. Tiered Access to Reasoning Capabilities

Test-time compute creates flexibility in AI capabilities. The same model can perform at different “intelligence levels” depending on the compute resources allocated to each query. We already face this dilemma today: do you need the pro version to access cutting-edge capabilities? For simple queries, perhaps not, but many users purchase premium subscriptions to access leading capabilities because they improve performance on certain tasks. Now you could decide between which model to use and how long it should reason—or companies might just make this decision for you, or the model itself might decide.

Which tasks truly require such extensive reasoning will determine where capabilities remain limited to the richest actors. To appreciate the cost of advanced capabilities: achieving OpenAI's o3's highest performance on the ARC-AGI benchmark (which often involves simply filling in a few pixels) required approximately 10,000 Nvidia H100s for a 10-minute response time on a single task. The model generates millions of tokens—equivalent to many books worth of text—not as one coherent reasoning chain but as multiple parallel explorations of potential solutions. This extreme resource demand explains why Sam Altman recently noted that ChatGPT Pro operates at a loss—the background compute costs for advanced reasoning at scale are substantial.

4. Deployment Capabilities: From “How Many” to “How Many and How Smart”

Deployment compute has always been a factor in AI's impact, determining how many users can be served and, consequently, the breadth and depth of AI's influence across sectors. With test-time compute, the relationship between compute and capability intensifies—the same model can deliver different levels of intelligence depending on allocated thinking time. This makes deployment compute's importance even stronger: it now determines both how many AI systems can operate simultaneously and how intelligent each instance can be.

The relationship between compute and deployment capabilities has important geopolitical and economic implications. Geopolitically, countries with substantial compute resources can project influence by heavily subsidizing AI services internationally, similar to past technology-driven soft power strategies. Economically, access to inference compute determines which companies can profitably deploy advanced AI at scale, potentially creating feedback loops where deployment success funds further advancements. The result is that while small-scale AI capabilities may diffuse more widely, the largest-scale applications—allocating significant thinking time to many queries—will initially concentrate among actors with the greatest compute access.

While small-scale AI capabilities may diffuse more widely, the largest-scale applications—allocating significant thinking time to many queries—will initially concentrate among actors with the greatest compute access.

5. Synthetic Data Drives Capability Flywheels

Advanced reasoning might depend on synthetic—AI-produced—reasoning data generation, making this data an increasingly valuable strategic resource. A capability flywheel might emerge where each model generation relies on outputs from the previous generation: today's models generate reasoning patterns that train tomorrow's models to reason even better, creating an accelerating cycle of capability enhancement.

While these capability flywheels incentivize data theft, securing synthetic reasoning data long-term appears challenging. Access will likely proliferate—for example, DeepSeek R1's shared chain-of-thought reasoning has already benefited many others. This data proliferation could further narrow gaps between frontier and follower models.

6. Policy Faces Information Asymmetry Challenges

Lastly, and importantly, we face a challenge while writing this piece: information asymmetries. Our remarks could be far more precise if we worked at a leading AI company. Industry insiders might even be amused by the speculations; yet here we are lone AI policy researchers in DC trying to make decisions about AI but learning about technical innovations through reverse engineering and the goodwill of AI companies.

As AI capabilities advance, making informed policy decisions becomes increasingly challenging from external perspectives. The technical details of reasoning processes and their development often remain proprietary, creating significant information asymmetries between developers and policymakers. Without addressing these asymmetries, policymakers risk falling further behind in understanding the rapidly changing AI landscape.

Conclusion

Test-time compute doesn't change everything—but it introduces important new dynamics that policymakers must consider. This isn't a paradigm shift that renders previous approaches obsolete but an evolution that adds new variables to the AI development equation and impacts AI policy. It turns out AI development isn't just one line going up straight (pre-training); it might now be multiple lines (pre-training, reinforcement learning, and test-time compute)—but we (and the DC policy apparatus) always only learn later what these lines are and how steep they might be.