Adjust Your Timelines. O3 Changes Everything

Dec 25, 2024

I was in the middle of writing about two entirely different topics when Day 12 of “Shipmas” arrived, and OpenAI dropped its latest model, O3. The internet — and the AI world — lit up in a frenzy, and what better than a meme to summarize the reaction.

O3 isn’t just another AI release; it’s proof of continuing a paradigm shift centered on one key innovation: reasoning or what we fancily call test-time-compute. Since ChatGPT two years ago, AI has excelled at recognizing patterns and generating outputs, but actual reasoning — the ability to think step-by-step and verify results — has always been the elusive prize. O3 changes that, marking a monumental leap in how models approach complex, novel problems, two months after O1 — the younger brother — was announced.

In the next 12 months, expect things to get even wilder. Two parallel scaling laws — pre-training advancements and test-time compute — are accelerating like never before. As base models improve, reasoning will help to generate better synthetic data (new science discovery, simulation data, correction of errors…) creating a flywheel effect that powers exponential progress. If this pace holds, O4 and O5 could drop next year, each more transformative than the last.

So, brace yourself. The timeline for what’s possible in AI just shifted dramatically, and if O3 is any indication, 2025 might redefine everything we thought we knew about this technology. When these models' societal gains come to fruition, the world will go wild.

What is Reasoning?

Reasoning in AI is about much more than arriving at the correct answer — it’s about the steps and logic that lead there. With O3, this process has been supercharged. Here’s how it works:

You start with the base model (GPT-4o likely), and generate thousands of candidate solutions using a “chain of thought” approach. Then, a verification model — likely the same base model — evaluates these answers, checking for logical consistency, accuracy, and calculations. This verifier is fine-tuned on an enormous dataset of corrections, enabling it to detect and refine mistakes. When the reasoning is correct, those steps fine-tune the base model, creating a feedback loop that generates even better results. Think of it as synthetic data on steroids.

In more technical terms, OpenAI has trained models to explore/search the space of potential solutions (think about the classical tree search algorithm) and use automatic evaluations through LLMs to identify the best possible path. This technique breaks the implicit limitations on reasoning due to the nature of the data used for training. If you think about it, not much of the published data available for training includes reasoning, only results. With this approach, OpenAI (and others exploring this space) have turned abstract computation into actionable logic that can adapt to complex, novel problems.

The Two Scaling Curves

Reasoning is a new tool at our disposal to continue pushing frontier models' limits, effectively enabling a second scaling law. In this new reality, pre-training advancements (more data, more compute, better results) play along test-time compute (more time to return a correct response). It’s a self-reinforcing loop: reasoning produces better (synthetic) data — which was a limiting factor in pre-training gains- which builds stronger base models, which in turn fuels even more refined reasoning and so on. It’s a flywheel effect — one scaling law feeds the other, creating exponential progress.

Achievements That Change the Game

Let’s talk numbers. What O3 has achieved across multiple benchmarks (a very common way to test the models) isn’t just incremental progress; it’s a fundamental leap that forces us to rethink what AI can do. While OpenAI has scheduled the full release for early 2025 (pending safety evaluations), the preliminary results are staggering. To put these achievements in perspective, imagine a rookie athlete simultaneously breaking world records in swimming, marathon running, weightlifting, and chess. That’s essentially what O3 has done in the AI world. Here’s what the data shows:

Math Mastery: Scored over 25% on FrontierMath — the hardest math problem set in the world- a leap from the previous best of under 2%. 25% seems like not much, but there are no more than a handful of humans in the world who can get close to that. Don’t believe me? Read the sample question below.

A sample question of the FrontierMath benchmark

Science Supremacy: Outperformed humans with 87.7% accuracy on GPQA, demonstrating a remarkable ability to handle graduate-level science questions. These aren’t easy, either.

In a parallel universe where a magnet can have an isolated North or South pole, Maxwell’s equations look different. But, specifically, which of those equations are different? — Yup, this is a real one

Software Engineering Excellence: Dominated SWE-Bench with 71.7% and scored 99.95% in competitive coding. These are real-world coding capabilities, including debugging and verification WAY beyond the average SWE. This is bananas.

General Adaptation: And here is the best one. Achieved 88% on ARC-AGI, breaking barriers in general reasoning tasks previously considered untouchable. This benchmark was designed to be an “AGI-resistant” test, requiring adaptive reasoning across novel tasks. The model’s success here is groundbreaking but came at a steep price tag. The ARC-AGI inference alone reportedly cost $350,000. This cost speaks about the potential of scaling computing. Give machines more time ($$) to think, and good responses will come.

But Why Does This Matter to All of Us?

“I think that most people are underestimating just how radical the upside of AI could be.” — Dario Amodei (CEO, Anthropic)

The leap from benchmarks to real-world impact is happening faster than most people realize. We’re not just talking about AI scoring well on tests — we’re seeing machines rapidly closing the gap on human capabilities and in some cases, surpassing them.

This chart likely needs an update. Math, General knowledge, and Code have (likely) fallen. Source: Our World in Data

In just a few years, we’ve witnessed AI surpass human-level performance in areas once thought to be exclusively human domains: general knowledge, mathematical problem-solving, and code generation. Even more striking, we’re seeing breakthroughs in complex reasoning — long considered the final frontier of human cognitive superiority.

Let me give you a concrete example that shows why this matters. Recently, social media erupted with panic about black plastic utensils potentially containing a toxic compound called BDE-209. The story seemed credible until a researcher decided to fact-check the original paper. Here’s where it gets interesting: Ethan Mollick ran an experiment by feeding the paper to O1 (not even the latest version) through ChatGPT. Within seconds, the AI spotted a multiplication error on page 7 that had sparked the entire controversy. Think about that — instant validation of scientific research that escaped the peer review process.

This ability to quickly analyze complex documents and spot errors is just one facet of AI’s growing capabilities.

Let’s zoom in on another field that’s being revolutionized: software development. If O3 consistently outperforms the average software engineer and is deployed at scale in platforms like Devin or IDEs like Cursor, the very nature of software development will transform. English could soon become the dominant “coding language,” allowing people without traditional technical expertise to describe what they need and watch as AI builds, debugs, and optimizes complex systems in real-time.

But it’s not just about efficiency or new tools. As Dario Amodei discusses in “Machines of Loving Grace,” we are on the brink of unlocking solutions to problems that have plagued humanity for decades. Think about the possibilities: curing diseases with precision medicine, modeling climate interventions with unprecedented accuracy, or even reshaping how we generate and distribute energy globally.

Conclusion

O3 isn’t just another milestone in AI’s journey — it’s the moment that forces us to tear up our roadmaps and redraw our horizons. Two months after O1, OpenAI hasn’t simply moved the goalposts; they’ve changed the game entirely. When machines can outperform humans on graduate-level physics and solve mathematical problems that only a handful of people worldwide can tackle, we’re not just crossing thresholds — we’re shattering them.

Think about it: if this is O3, what territories will O4 and O5 explore? 2025 isn’t just going to be wild — it’s likely to be the year that rewrites the rules of what technology can achieve. The implications ripple far beyond benchmark scores and technical achievements. We’re watching the birth of systems that don’t just process information but genuinely reason through problems like a brilliant collaborator.

Here’s the kicker: even if we hit a plateau tomorrow, the capabilities we’ve unlocked with O3 will fuel a decade of innovation. But if this pace continues? We’re not just approaching a new chapter in technology — we’re opening an entirely new book. The following 12 months won’t just redefine what’s possible; they’ll redefine what we dare to imagine.

It’s here, and it’s thinking.

Alvaro Higes

Discussion about this post

Ready for more?