Why behind AI: Code Red 5.2
OpenAI's leadership seems determined to not let any other frontier lab be able to claim the top spot in the world of SOTA models
The last two weeks on the frontier of AI have been hectic, to say the least. After the release of Gemini 3.0 Pro, we saw a flurry of new models launching, each trying to outdo the rest.
On December 2nd, we had a story leak from The Information regarding a memo by Sam Altman, claiming that he has internally declared a code red:
OpenAI CEO Sam Altman on Monday told employees he was declaring a “code red” to marshal more resources to improve ChatGPT as threats rise from Google and other artificial intelligence competitors, according to an internal memo.
As a result, OpenAI plans to delay other initiatives, such as advertising, Altman said.
OpenAI CEO Sam Altman declares ‘code red’ to improve ChatGPT.
The company is preparing to release a new reasoning model that scores well against Google’s Gemini 3
The code red also involves making improvements to OpenAI’s image-generating AI
“We are at a critical time for ChatGPT,” he said.
OpenAI hasn’t publicly acknowledged it is working on selling ads, but it is testing different types of ads, including those related to online shopping, according to a person with knowledge of its plans. Millions of people already use ChatGPT to search for products to buy.
Altman said the code red “surge” to improve ChatGPT meant OpenAI would also delay progress with other products such as AI agents, which aim to automate tasks related to shopping and health, and Pulse, which generates personalized reports for ChatGPT users to read each morning.
I’ve previously covered why Google remains the biggest competitor to OpenAI (the research lab was literally funded by Musk as an early move to prevent Google’s potential dominance in AI):
It’s clear that GPT-5 is an attempt to put efficiency and outcome-based pricing at the forefront of the product. This is needed both because it improves their hand in case things get more contentious with Microsoft, but also because it should give them a little more breathing room when 20% of that revenue is going to Microsoft, essentially killing any chance of a positive margin.
On the product side they need to achieve 3 outcomes:
Towards (pro)consumers: Improve average revenue per customer and reduce the cost of subsidizing free usage. The best way to achieve this is to minimize unproductive usage (essentially the companion aspect of 4o usage), try and push as many queries as possible to lower compute configurations without significant penalties on retention, and push paying users into the Pro tier. GPT-5 is clearly aimed in this direction, particularly due to significantly reducing the Plus subscription benefits. Both the amount of messages and context window on Plus have been reduced to the point where it’s no longer useful as a primary subscription for heavy users. When you also account that the best performance is with GPT-5 Pro, the need for the highest subscription is obvious.
Towards developers: Developers predominantly need to use the API and will do so through application layers with quality of life improvements. There is a reason why the Cursor team was positioned quite heavily in the presentation. The problem here is that Claude Code appears to be strongly preferred as a primary tool by developers for agentic workflows (the most token-consuming ones). Hence what looks like a joint play with Cursor to offer GPT-5 as the best model for the newly launched Cursor CLI agent (free usage the first days of the launch). Whether this will be a successful strategy is yet to be seen, but Anthropic is on pace to reach 40% of OpenAI’s revenue this year thanks to dominating with developers and this use case is too important to play catch-up.
Towards businesses/Enterprise: This is a highly awkward product currently. Revenue in the range of $1B is negligible at their size, and Microsoft offers essentially an equivalent product in terms of average outcomes through Copilot for Business. The most confusing part is that Enterprise usage doesn’t include the Pro model and the increased limits on Deep Research, both of which are the essential killer apps for pro users. Google pulled a similar confusing feat by launching an Ultra plan that’s not available for businesses and includes their Deep Thinking mode, which is highly competitive. If GPT-5 is meant to improve outcomes in this direction, it’s difficult to see how.
I think that GPT-5 has proven to be a success in terms of efficiency, while the o1 coding model has been able to capture attention (and inference) on the coding side.
Still, both Google and Anthropic have shown significant growth both in terms of ARR and mind share with the heavy early adopters that can no longer be ignored.
To put it simply, Sam has overcommitted and underdelivered relative to public perception. For OpenAI to continue scaling, they will likely need to offer the undisputed best model in the market at the same or lower cost envelope as their competition. They need to do this at a time when the quality and determination of their competition is at its highest.
The alternative is to win the consumer business in such a way that second and third place barely get a mention. The most likely strategy for this would be to offer the consumer a consumer super assistant:
He often communicates his vision to staffers in writing, including weekly “top of mind” Slack posts and strategy memos. In one such memo, presented as part of the Google antitrust trial, he shared a vision for a future version of ChatGPT that would act like a “super assistant,” with advanced search capabilities very different from those of today’s conventional search engines. Consumers could tell such an assistant they wanted to buy a particular pair of shoes, and the technology could find them online, propose other shoes and even complete a transaction, Turley said during the trial.
Turley has said this view of ChatGPT would make it more akin to an operating system. Users of this future product could access AI through a variety of apps and agents to book travel and order food deliveries, as well as carry out work-related actions like writing code.
We are seeing the foundation of this product today, with Shopping Assistant being introduced recently, alongside the curated social product Pulse. OpenAI has also had a usable voice mode for more than a year, which enables true multimodality.
Still, conversion rates relative to the user base have remained low. The current active subscriber growth across consumer and business (which historically accounts for 75% of ARR) is around the 40M mark. While a 5% conversion rate is nothing to sniff at, it’s a low figure for what’s supposed to be the most useful application in the world.
OpenAI’s efforts to deliver additional applications outside of ChatGPT have also not progressed significantly, while Google has continued to expand Gemini integrations across its surfaces and appears to be entering the Apple consumer play as well. The benefit of Google’s monopolistic scale is that they can afford not only to keep making their product better, but also to deploy it in ways that continue to erode OpenAI’s conversion rates.
So if we come back to first principles, the only way to fight “good enough” is with “significantly better.”
Enter 5.2.
For coding purposes, 5.2 is a marked improvement over even the specialized Codex variation, while delivering outcomes with lower output tokens for most tasks.
The coding benchmarks are less interesting, however, compared to knowledge work tasks. 5.2 performs at expert level and often offers “the best” opinion in more than half of the cases. It’s useful to read further on the GDPval since it’s driven by OpenAI themselves.
GDPval is an early step. While it covers 44 occupations and hundreds of tasks, we are continuing to refine our approach to expand the scope of our testing and make the results more meaningful. The current version of the evaluation is also one-shot, so it doesn’t capture cases where a model would need to build context or improve through multiple drafts—for example, revising a legal brief after client feedback or iterating on a data analysis after spotting an anomaly. Additionally, in the real world, tasks aren’t always clearly defined with a prompt and reference files; for example, a lawyer might have to navigate ambiguity and talk to their client before deciding that creating a legal brief is the right approach to help them. We plan to expand GDPval to include more occupations, industries, and task types, with increased interactivity, and more tasks involving navigating ambiguity, with the long-term goal of better measuring progress on diverse knowledge work.
GPT 5.2 Pro specifically remains the best model in the world, even if it’s hidden behind the $200 subscription (although it will now finally appear in the API). Let’s take a look at the feedback from a power user:
A Concrete Example
I’m working on a new app (more on this soon, maybe?) that requires balancing a ton of different constraints... engineering time, maturity of AI tech available today, very strict user experience considerations, cost, and more. Getting all of these right simultaneously is extremely difficult.
Most models fell flat on their face when I described what I was trying to build and asked for ideas that fit those constraints. They’d give up, usually by optimizing for one constraint while ignoring others, or suggesting solutions that weren’t actually feasible.
I gave the problem to Pro. It thought for almost an hour. When it finished, it had come up with a fantastic idea that I’m actually using. The solution accounted for constraints I hadn’t even explicitly mentioned... it understood the shape of the problem well enough to fill in gaps I’d left.
There’s just nothing like GPT-5.2 Pro.
How Pro Thinks Differently (Well... Maybe)
One thing I’ve noticed from watching Pro’s reasoning summaries is that it uses code a lot more than I expected. Not just for coding tasks. For everything.
When I asked it to write a book, it used code to keep track of chapter names, chapter lengths, and the overall outline. It planned the entire structure programmatically before writing, then used code to build the final PDF.
For idea generation tasks, when it’s juggling a bunch of possibilities, it’ll put them into lists and data structures. It’s using code to organize its working memory... keeping track of what it’s considering, what constraints each option satisfies, what tradeoffs exist.
I don’t know if previous models were doing this internally and we just couldn’t see it, but the reasoning summaries definitely show much more code than I’ve seen before. Maybe OpenAI is just increasing transparency a bit. But it’s definitely a noticeable difference, at least for me.
When Pro Fails
Pro isn’t perfect. When it fails after thinking for a long time, it’s usually because it made a wrong assumption somewhere or misunderstood part of the problem. The output looks reasonable but doesn’t actually solve what you asked for, or solves a slightly different problem than you intended.
This is annoying specifically because of the time investment. Every so often, Pro will think for 45 minutes and then fail, and it wastes a ton of time. But it fails less often than previous models, and when you’re working on hard problems, some failure rate is unavoidable. Even people make wrong assumptions sometimes.
Overall, Pro gets it right more often than not, and more often than anything else I’ve used.
…
Pro vs. Everything Else
I haven’t found a single task where standard GPT-5.2 Thinking beats Pro. That doesn’t mean Thinking is bad. It’s a good model, but if you have Pro access and time isn’t a constraint, Pro is just better.
Claude Opus 4.5 sometimes beats Pro, but it’s a matter of different strengths rather than one being universally better. I think Opus handles some creative writing tasks better. There’s a stylistic quality to its prose that I prefer. For quick, well-defined code changes where I know exactly what I want, I slightly prefer the code Opus 4.5 writes. It’s a small stylistic thing.
For quick research, obviously I’m not going to Pro, as I don’t want to wait 20 minutes for something I could get in 20 seconds. But for extensive research, where I need something researched and thought through deeply and carefully, Pro is where I go.
Pro is also definitely a better writer than GPT-5.2 Thinking. The thoughtfulness that goes into Pro’s reasoning translates into more nuanced, better-structured and more info-dense writing.
Pure prose quality still trails Claude Opus 4.5, but I often choose 5.2 Pro for writing anyway because it reasons more carefully; even if the wording is a touch less polished, the arguments are clearer and better supported.
Improvements Over 5.1 Pro
It’s not like GPT-5.2 Pro is a different breed of model from 5.1 Pro. I can’t point to any one thing that’s dramatically better. It just is overall somewhat better, and a bit more reliable across the board.
Part of this is that we’re getting to the edge of what we as humans can often evaluate outside of our own domains. In coding, I can see that it’s better. But if I’m asking a medical question, I’m not qualified to judge if 5.2 Pro is better than 5.1 Pro... they’re both so much smarter than me in that domain.
I’d say it’s probably 15% better across the board (which, for less than a month of progress, is pretty amazing). And it’s willing to think longer when it needs to, which is a huge win. It would be annoying if it thought longer on things where it didn’t need to, but I often find it’s around the same speed as 5.1 Pro on most tasks... it’s just on things that are extremely hard that it’s willing to go longer.
Is It Worth $200/Month?
The ChatGPT Pro plan costs $200/month and gives you essentially unlimited Pro queries. Whether that’s worth it depends entirely on how you work.
For me, it’s not even a question. I can’t live without GPT-5.2 Pro. I pay $200/month without thinking about it. I rely on this for my daily work in ways that would be hard to replicate with other tools.
But I’m not the average user. I’ve been using these models intensively for a long time. I know how to prompt them well. I’ve integrated AI into my workflows deeply enough that I constantly see opportunities to use it. I have friends who are dealing with something in their life that AI can help with, and it hasn’t even occurred to them to use it. For them, Pro probably isn’t worth $200/month. They wouldn’t get enough value out of it.
If you’re someone who uses AI seriously, who works on hard problems, who has learned to prompt effectively, and who would benefit from having access to the most capable reasoning available, Pro is worth it. If you’re still figuring out how to integrate AI into your work, you might want to get more comfortable with the standard (and much, much cheaper) tiers first.
The folks at Every are more mixed, particularly on the 5.2 regular and Thinking versions:
Bottom line: If you’re a ChatGPT Pro subscriber, GPT-5.2 is worth exploring for longer-running analytical tasks. For a leap in everyday chat, temper expectations—the real gains will likely come when this model powers agentic tools like Codex.
Overall, we’ll happily use this model for day-to-day ChatGPT use, but Opus 4.5 is still our workhorse for tasks that require the most creativity, intelligence, and autonomy.
Coming back to the benchmarks, the model turned in a strong showing on ARC-AGI:
Almost one year ago, an unreleased version of o3 Pro benchmarked at 88%, while costing $4.5K per task.
Today, o1 Pro with extended thinking that’s actually accessible via API was able to reach 90.5% for $11.64 per task, or 390 times the efficiency improvement.
More interestingly, they also now have the best model on the second version of the benchmark, which was designed as “hopefully unbeatable” for a few years.
On paper, they are now dominating Gemini 3.0 Pro and Opus 4.5 in the most novel of challenges:
So if intelligence has improved, how are we doing on efficiency?
As the classical meme states, THE PRICE OF THE BRICK GOES UP.
While in some scenarios the model will achieve the task at lower token costs, practically speaking there is a significant bump in pricing, probably following Google’s choice to increase Gemini’s rate card. Anthropic has always been much more expensive, so their recent cost cutting didn’t move the needle significantly at $5 for input/$25 for output.
So, does this update move the needle? I’m a heavy user of ChatGPT and pay for the $200 subscription. I trust the Codex models most for my Infra Play Database. I’ve leveraged Deep Research with the Pro models on an almost daily basis for account research. Updates like these are welcome, because they squeeze performance higher, particularly when it comes to the model figuring things out over a longer period of time.
The challenge for OpenAI is that users like myself will not limit themselves to a single subscription and will experiment with other models repeatedly. The combined value of Gemini, Anthropic, Grok, and Cursor subscriptions, together with ChatGPT Plus, is still slightly more than half the cost of Pro. The only real reason for me to pay the $200 monthly cost is Deep Research with 5.2 Pro for knowledge work.
Several hours before the official launch of 5.2, Google announced a significant update to their Research Agent, claiming a big performance bump. While this is still to be tested (it’s not fully rolled out to regular applications), it should be obvious how this puts my potential renewal for the highest tier at risk.
It’s a war of attrition out there and code red remains very much in place.











Really sharp take on the attrition dynamics between frontier labs. The comparison of 5.2 Pro's $200 price point against the combined Gemini/Anthropic/Cursor/Plus bundle is something I've been weighing too. I think the real competetive moat OpenAI has left is inertia—if Deep Research stays meaningfully ahead, they can probly keep retention, but Google's updates to their Research Agent timing is strategic pressure. The efficiency gains on ARC-AGI (390x improvement year-over-year) are wild but they don't matter if conversion rates stay flat at 5%.