What Is Chain-of-Thought Prompting and How Does It Work?

The model gets the wrong answer. You try again with the same question. Wrong again. Then you add six words - "let's think through this step by step" - and suddenly it's correct. Not kind of correct. Precisely, verifiably correct. Same model. Same question. Different prompt.

That's chain-of-thought prompting, and it's one of the most counterintuitive discoveries in AI research of the last decade.

Chain-of-thought (CoT) prompting is a technique where you instruct a language model to articulate its intermediate reasoning steps before producing a final answer. Instead of asking a model to jump directly to a conclusion, you encourage it to "think aloud" - generating a sequence of logical steps that trace the path from problem to solution. The result is that models solve problems they'd otherwise fail at, not because they've been retrained, but because the reasoning process itself is made visible and explicit inside the output.

It works because large language models are, at their core, next-token predictors. When forced to generate reasoning tokens before an answer token, the model conditions its answer on better intermediate context. The chain of thought becomes part of the input, not just the output.

Where Chain-of-Thought Came From

The landmark paper is "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" by Jason Wei, Xuezhi Wang, and colleagues at Google Brain, published in 2022. Their finding was stark: on a benchmark called GSM8K - a set of grade-school math word problems - a 540-billion parameter model (PaLM) jumped from roughly 17% accuracy to 58% accuracy when given chain-of-thought examples in the prompt. No fine-tuning. No new training data. Just different prompting.

Wei's team also found something that surprised many in the field: chain-of-thought only emerged reliably in models above a certain scale threshold - approximately 100 billion parameters. Below that, encouraging step-by-step reasoning either had no effect or sometimes made performance worse. This "emergent ability" framing matters because it means CoT prompting isn't universally applicable. It's scale-dependent.

The idea didn't arrive fully formed. Earlier work on "scratchpads" by Nye et al. (2021) showed that allowing models to produce intermediate computation tokens could improve multi-step arithmetic. Wei's contribution was demonstrating that this extended to natural language reasoning tasks - commonsense, symbolic, and mathematical - and that it could be elicited through prompting alone, without modifying the model.

How the Mechanism Actually Works

Here's where it gets strange if you think too hard about it.

A language model doesn't "think" in the way humans do. There's no internal deliberation happening before the first token appears. The model generates token by token, each one conditioned on everything that came before it. So when you prompt a model to reason step by step, what you're really doing is forcing the model to produce a longer, more structured context window that subsequent tokens get to attend to.

The chain of thought isn't a window into pre-existing reasoning. It's reasoning being constructed in real time, token by token, where each generated step constrains and informs what comes next.

A 2023 study from MIT's BCS and CSAIL - specifically work by researchers including Jacob Andreas - explored the degree to which intermediate tokens in CoT outputs actually causally influence final answers versus being post-hoc narratives. The evidence suggested that for factual and logical tasks, the intermediate steps do carry causal weight. Remove them mid-generation, and answer accuracy drops. They're doing real work.

Two main CoT variants have emerged since the original paper. Few-shot CoT means providing the model with examples of solved problems that include worked-out reasoning traces before asking your actual question. Zero-shot CoT - popularized by Kojima et al. (2022) at the University of Tokyo - showed that simply appending "Let's think step by step" to a prompt, with no examples at all, produced significant accuracy gains. Cheaper to write. Almost as effective for many tasks.

When It Fails (And Who It Fails For)

Chain-of-thought prompting can introduce a specific failure mode worth naming: confident wrong reasoning. The model produces a fluent, plausible-sounding chain of steps that leads to an incorrect answer with full apparent confidence. The chain of thought reads well. The answer is wrong.

This is arguably more dangerous than a blunt wrong answer, because the coherence of the reasoning chain can make the error harder to catch. Wang et al.'s 2022 paper on "Self-Consistency" addressed this partially - by sampling multiple reasoning chains and taking a majority vote on the final answer, you can reduce the impact of any single faulty chain. But it doesn't eliminate the problem.

There's also a class of tasks where CoT actively hurts performance. Research from Yao et al. (2023) and related work on "Tree of Thoughts" noted that tasks requiring creative divergence rather than convergent logical steps don't map well onto linear reasoning chains. If you're asking a model to generate marketing copy, brainstorm unusual product names, or produce poetry - forcing step-by-step analytical reasoning can flatten the output. The model becomes more systematic and less generative.

Small models are the other exception. (I keep coming back to this because I think practitioners underestimate it.) If you're working with a fine-tuned 7B or 13B parameter model, adding "let's think step by step" to your prompt may accomplish nothing. Or it might produce a chain of plausible-sounding reasoning that ends in an answer even further from correct. The mechanism requires sufficient model capacity to function as intended.

Connecting CoT to Broader Prompting Strategies

Chain-of-thought doesn't live in isolation. It's one node in a growing ecosystem of reasoning-augmentation techniques.

ReAct (Reasoning + Acting), introduced by Yao et al. at Princeton and Google in 2022, interleaves chain-of-thought reasoning with external tool calls - allowing models to retrieve real information, execute code, or query APIs between reasoning steps. The chain becomes a scaffold for action, not just reflection.

Tree of Thoughts (Yao et al., 2023) generalizes the linear chain into a branching structure, where multiple reasoning paths are explored and evaluated simultaneously - closer to how humans actually solve hard problems when they're being systematic about it. Linear CoT is a special case where only one branch is ever considered.

The connection to self-prompting and meta-cognition is also worth noting. When a model is asked to critique its own chain of thought before finalizing an answer - sometimes called "reflection" - accuracy on complex tasks can improve further. This is the direction recent work on "o1"-style models has taken: training models to produce extended internal reasoning traces before outputting visible responses.

What I find interesting, and haven't fully resolved in my own thinking, is whether these extended reasoning traces constitute something meaningfully different from pattern matching at scale, or whether that distinction even matters for practical use.

Honest Constraints

Chain-of-thought prompting is well-supported for specific task types: multi-step arithmetic, symbolic reasoning, commonsense inference, and structured problem-solving. The evidence is weaker for open-ended generation, aesthetic judgment, and highly domain-specific reasoning that wasn't well-represented in training data.

The scale dependency is a real constraint that much popular writing ignores. Most practitioners are working with models smaller than the threshold where CoT reliably emerges, yet they apply these techniques expecting the same gains shown in frontier model research. The results will disappoint.

CoT also doesn't address hallucination in factual recall. If the model doesn't know something, producing a confident-sounding chain of reasoning about it doesn't help - it may make things worse. The technique improves reasoning over known information. It cannot synthesize correct information from nothing.

Finally, almost all CoT research is conducted in English. Cross-lingual transfer of these gains is inconsistent and underexplored.

FAQ

Does chain-of-thought prompting work with all language models?

No. The effect is reliably demonstrated in models above approximately 100 billion parameters. Smaller models often show no improvement or degraded performance. This scale dependency is one of the most important and underreported constraints in how CoT is discussed outside research contexts.

Should I always use "let's think step by step" in my prompts?

For analytical tasks - math, logic, structured problem-solving - yes, it tends to help. For creative or divergent tasks, it can constrain output quality. Match the prompting strategy to the task type, and test empirically rather than assuming the gain.

What's the difference between chain-of-thought and regular prompting?

Standard prompting asks the model to produce an answer. Chain-of-thought prompting asks the model to produce reasoning that leads to an answer. The intermediate steps condition the final answer token, giving the model more structured context to work from.

Can chain-of-thought prompting be used in automated pipelines?

Yes, and it often should be. In agentic systems, CoT traces also provide interpretability benefits - you can inspect the reasoning to diagnose failures. The cost is increased token usage and therefore latency and expense, which matters at scale.

From chain-of-thought, the natural next topics are prompt engineering as a discipline (how CoT fits within broader prompt design strategies), agentic AI architectures (where CoT becomes a component of multi-step automated reasoning), and AI interpretability research (which asks whether generated reasoning traces actually reflect internal model computation or are better understood as plausible post-hoc narratives).

Where Chain-of-Thought Came From

How the Mechanism Actually Works

When It Fails (And Who It Fails For)

Connecting CoT to Broader Prompting Strategies

Honest Constraints

FAQ

About the Author