·10 min read

Should You Switch to Reasoning Models Like o3 for Critical Thinking with AI?

By Aleksei Zulin

You've heard the hype. You've probably tried o1 or o3 once or twice and thought - wait, is this thing actually thinking? And now you're wondering whether you should restructure how you use AI for anything that actually matters intellectually.

Here's the direct answer: reasoning models like o3 are genuinely better at structured logical problems, multi-step deduction, and tasks where the path to the answer requires explicit intermediate steps. For most everyday critical thinking tasks - analyzing an argument, weighing a decision, stress-testing a plan - the gap between reasoning models and frontier standard models is smaller than the marketing suggests, and switching wholesale may cost you more in latency and money than it returns in insight quality. The smarter move is task-specific selection, not a blanket switch.

That said, if you're doing graduate-level technical analysis, formal reasoning chains, or anything that lives in the territory between "hard logic" and "soft judgment," o3 genuinely changes what's possible. Knowing when you're in that territory is the actual skill.


What "Reasoning" Actually Means in This Context

The term "reasoning model" gets used loosely, so let's be precise about what's different architecturally.

Standard frontier models like GPT-4o or Claude 3.7 Sonnet generate tokens by predicting what comes next given training. They're extraordinarily capable, but they compress the answer and the working-out into a single forward pass. Reasoning models - OpenAI's o-series in particular - generate internal chain-of-thought tokens before producing the final answer. The model essentially argues with itself in a scratchpad before committing to a response.

François Chollet, the AI researcher at Google DeepMind who created the ARC-AGI benchmark (designed specifically to test flexible reasoning rather than memorization), documented in December 2024 that o3 achieved 88% on the ARC-AGI semi-private evaluation set. The same benchmark had defeated every prior model convincingly. Chollet's benchmark was explicitly constructed to resist pattern-matching from training data - you need to generalize rules from a small number of examples. The fact that o3 cracked it suggests something structurally different is happening, not just better memorization at scale.

That benchmark matters for your decision because ARC-AGI tasks resemble a category of critical thinking that standard models genuinely struggle with: reasoning from sparse, unfamiliar premises to novel conclusions. If your work involves that kind of reasoning - legal analysis from novel case combinations, engineering failure modes in new system configurations, policy impact modeling - the benchmark gap is meaningful to you.

If your work involves synthesizing familiar information, generating arguments, evaluating rhetoric, or making strategic recommendations in well-mapped domains, the gap is much smaller.

A 2024 analysis published by researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) examining reasoning model performance across domains found that extended chain-of-thought processing produced the largest accuracy gains on problems requiring more than four sequential logical steps - and diminishing returns on problems that could be solved in two or fewer steps. This maps cleanly to a practical heuristic: count the steps your problem actually requires before choosing your model.


Where Standard Models Still Hold Their Own

Professor Ethan Mollick at the Wharton School of Business has run extensive practical experiments using AI across graduate business education since 2023, and his published observations (particularly in Co-Intelligence, 2024) show something counterintuitive: for judgment-heavy tasks, the quality difference between model classes often collapses when you give standard models enough context and iterative prompting space. The reasoning advantage of o-series models appears most strongly in zero-shot or low-prompt conditions - when the human hasn't done the scaffolding work.

Put differently: a skilled AI user with a well-structured prompt using Claude 3.7 Sonnet may outperform a novice user with o3 on the same analytical task. The model's internal reasoning compensates for some of the scaffolding the human didn't provide. But for someone who already knows how to construct layered prompts and iterate toward precision, the marginal gain from the model's extended thinking is smaller.

This matters for the switching decision. If you're still developing your prompting craft, reasoning models offer real insurance - the model catches reasoning failures you didn't know to look for. If you've built systematic habits around how you use AI for thinking, you're already capturing a chunk of that benefit manually.

There's another wrinkle. Standard models are often better at the social dimensions of critical thinking - detecting emotional subtext, identifying rhetorical moves, understanding what an argument is really doing in context versus what it claims to be doing. Reasoning models optimize for logical structure. Human critical thinking, most of the time, needs both.


The Cost You're Not Accounting For

Latency is a real cognitive variable, and I don't think people talk about this enough.

When you use a reasoning model on a complex task, you might wait 30 to 90 seconds for a response instead of 5. In a deep analytical session - the kind where you're genuinely using AI as a thinking partner - that wait interrupts the cognitive flow state that makes the session valuable in the first place. You lose the conversational momentum.

Subbarao Kambhampati, a professor at Arizona State University's School of Computing and Augmented Intelligence who has published extensively on LLM reasoning limitations, has made the point that the real bottleneck in human-AI reasoning collaboration is rarely the model's output quality - it's the quality of the human's ability to evaluate, redirect, and build on that output in real time. A slower model that returns a better first answer might actually produce worse collaborative outcomes than a faster model you can iterate with more rapidly.

The math here isn't obvious. Sometimes the right answer is: use the slower model once, go away and think, come back. Sometimes the right answer is: fire fast responses back and forth until you've triangulated something. Reasoning models fit the first pattern. Standard models fit the second. Neither pattern dominates critical thinking work.

Research from Stanford's Human-Centered AI Institute (HAI) on human-AI collaboration patterns has similarly noted that response latency above approximately 20 seconds causes measurable disruption to analytical conversation flow - a finding that directly implicates reasoning model trade-offs in real working conditions.


When Reasoning Models Are Worth the Switch

Two edge cases where the calculus shifts decisively.

Formal logic and verification. If you're using AI to verify an argument's validity - not its plausibility, its actual logical structure - reasoning models are significantly more reliable. OpenAI's o1 technical report (September 2024) showed that o1 substantially outperformed GPT-4o on GPQA Diamond, a benchmark of PhD-level science questions that require multi-step reasoning chains, scoring 56.1% versus GPT-4o's 37.3%. For verification tasks, the model's internal debate surface matters: it's more likely to catch its own contradictions before surfacing an answer.

One-shot, high-stakes analysis. When you can't iterate - a single document needs to be analyzed correctly, or a decision memo needs to get the reasoning right on the first pass - reasoning models offer more robustness against the shallow-sounding-but-wrong output that plagues faster models under pressure. If you have time to iterate, you can compensate with prompting. If you don't, spend the latency.

What doesn't work: using reasoning models as a substitute for domain expertise you lack. A reasoning model will produce a structurally valid argument in a domain where neither you nor the model has genuine grounding. The argument will sound thorough. The premises may still be wrong. Reasoning improves inference given premises; it doesn't validate the premises themselves. People forget this constantly. (I've caught myself forgetting it too, which is - worth admitting.)


The Wrong Frame: Model Loyalty

Most people asking "should I switch to o3?" are asking the wrong question.

The mental model of "my AI" - a single model I use for everything - makes sense when models are undifferentiated. They're not anymore. The intelligent practice now looks more like a tasting menu of cognitive tools: standard models for ideation, fast iteration, and social/rhetorical analysis; reasoning models for formal verification, novel logical territory, and high-stakes single-pass analysis; specialized models for domain-specific depth.

This is genuinely new cognitive infrastructure. We don't have great intuitions for it yet because we've never had it before. The people who are developing sharp intuitions are the ones treating model selection as a metacognitive decision - not a preference, but a deliberate tool choice made before the task begins.

The switch isn't binary. The question isn't whether o3 is better than Claude or GPT-4o. The question is: for this specific task, in this specific context, what kind of model architecture does this problem need? That's a different question, and it requires you to get clearer about what you actually need from AI on any given task - which turns out to be a valuable practice regardless of what you conclude about the models.


Limitations

The evidence base here has real limits that deserve acknowledgment.

Most model comparison studies, including the benchmark evaluations cited above, measure performance under controlled conditions that don't replicate real working contexts. They test models on problems with known answers. Critical thinking in professional or intellectual life mostly operates on problems without clean ground truth - decisions, interpretations, strategic assessments. No benchmark currently measures how well a model helps you think through a genuinely ambiguous problem where even experts disagree.

The ARC-AGI result is genuinely impressive, but Chollet himself has been careful to note that it doesn't prove general reasoning capability - it proves a specific kind of pattern generalization. That distinction matters.

There's also the problem of model versions changing rapidly. Something observable about o3 today may not describe o3 in eight months. The gap between reasoning and standard models is narrowing as standard models incorporate more inference-time compute. More research is needed on whether extended thinking genuinely improves outcomes in judgment-heavy domains or only in logic-heavy ones - and on the specific cognitive costs of latency in collaborative human-AI reasoning sessions.


FAQ

If I'm not a technical user, should I bother with o3 at all?

For most everyday analytical work - writing, research synthesis, argument evaluation - a well-prompted standard model will serve you better because you can iterate faster and the latency won't interrupt your thinking. Start with o3 only for formal logic, verification tasks, or problems where you genuinely need multi-step deduction and can afford to wait.

Is extended thinking in Claude the same as what o3 does?

Functionally similar in that both expose internal chain-of-thought processing before the final output. The implementations differ architecturally. Both are worth testing on your specific problem type - some users find one more reliable for certain domains than the other, and the honest answer is that this varies enough that experimentation beats theory.

Does using a reasoning model reduce the need for good prompting?

Partially, and only in specific conditions. Reasoning models compensate most when you're working zero-shot or with minimal context - the internal chain-of-thought catches errors you didn't prompt against. But for complex, judgment-heavy tasks, prompting quality still determines whether you've framed the right question. A reasoning model will reason more carefully about a poorly framed question; it won't tell you the question was wrong.


The model selection question connects directly to a larger one: what does it mean to use AI as a genuine thinking partner rather than a sophisticated search tool? That question sits at the center of everything I write about in The Last Skill - how to preserve and sharpen human judgment in an environment where AI handles an increasing share of cognitive load. The decision about which model to use is, at bottom, a decision about what kind of thinking you want to do yourself and what you want to delegate. Knowing that distinction is the last skill that matters.


json

json

Related Articles

About the Author

Aleksei Zulin is the author of The Last Skill, a book on how to think with AI as a cognitive partner rather than use it as a tool. Systems engineer turned writer exploring the frontier of human-AI collaboration.

The Last Skill is a book about thinking with AI as a cognitive partner.

Get The Book - $29