--- title: Chain-of-Thought Prompting: Give the AI Scratch Paper description: Adding 'let's think step by step' to your prompts can dramatically improve reasoning tasks. Here's what the research shows, when it works, and when it doesn't. date: February 5, 2026 author: Robert Soares category: prompt-engineering --- In January 2022, researchers at Google published a paper that changed how people talk to AI. They discovered that adding a simple phrase before asking language models to solve problems could boost accuracy from 17.7% to 78.7% on math word problems. The phrase? "Let's think step by step." This wasn't magic. It was [chain-of-thought prompting](https://arxiv.org/abs/2201.11903), a technique that gives AI something like scratch paper for working through problems. ## The Research Behind It Jason Wei and colleagues at Google Brain ran experiments across three large language models. They tested arithmetic reasoning, commonsense questions, and symbolic manipulation. The pattern held across all categories: when models showed their work, they got more answers right. On the [GSM8K math benchmark](https://research.google/blog/language-models-perform-reasoning-via-chain-of-thought/), their 540-billion parameter model hit 58% accuracy with chain-of-thought prompting. Standard prompting? Nowhere close. A follow-up using self-consistency pushed this to 74%. The most striking result came from Sports Understanding. PaLM 540B reached 95% accuracy, beating unaided human experts who scored 84%. A few months later, [researchers from the University of Tokyo and Google](https://arxiv.org/abs/2205.11916) published "Large Language Models are Zero-Shot Reasoners." They found you don't even need examples. Just append "Let's think step by step" and accuracy on MultiArith jumped from 17.7% to 78.7%. GSM8K went from 10.4% to 40.7%. That's a 61 percentage point improvement from one sentence. ## Why Does This Work? A Hacker News user named leobg [explained the mechanics well](https://news.ycombinator.com/item?id=35503044): > "I think the idea is that the LLM cannot think internally. It's output _is_ its thinking process. Especially with an auto regressive architecture like GPT, where each output token becomes part of the input. I imagine it like handing the LLM a piece of scratch paper." This captures something important. Language models generate one token at a time. Each token becomes context for the next. When you ask for immediate answers, the model has to compress all reasoning into picking the right first word. But when you ask for steps, each intermediate conclusion becomes part of the input for what follows. Consider this math problem: "Roger has 5 tennis balls. He buys 2 more cans of 3. How many tennis balls does he have now?" Solving it requires understanding the word problem, identifying the operations, and calculating correctly. Asking for the answer directly forces the model to do all of this in the jump from question to number. Asking for steps lets it establish each piece. Roger starts with 5. He buys 2 cans. Each can has 3 balls. So he buys 6 balls. 5 plus 6 is 11. Each sentence constrains what comes next. The model builds toward the answer instead of guessing it. ## The Catch Nobody Mentions First Here's what the hype articles skip: chain-of-thought prompting only works with big models. The original research found this is an "emergent property of model scale." Below roughly 100 billion parameters, asking for step-by-step reasoning actually hurt performance. Smaller models produced what looked like reasoning chains but contained logical errors. The confident-sounding steps led to wrong answers more often than just asking directly. If you're using a smaller local model, this technique might backfire. Test it. Compare results with and without the step-by-step instruction. Don't assume the research applies to your specific setup. ## Two Ways to Do It **Zero-shot approach**: Just add the phrase. No examples needed. > "A bat and ball cost $1.10 total. The bat costs $1 more than the ball. How much does the ball cost? Let's think step by step." This works surprisingly well. It costs nothing extra in prompt length. **Few-shot approach**: Show the model what good reasoning looks like first. > Here's a math problem and how to solve it step by step: > > Question: There are 15 trees in the grove. Grove workers will plant trees today. After they are done, there will be 21 trees. How many trees did the grove workers plant today? > > Reasoning: We start with 15 trees. We end with 21 trees. The difference is what was planted. 21 minus 15 equals 6. > > Answer: 6 > > Now solve this one the same way: > [your actual question] Few-shot takes more tokens but often produces better results on complex tasks. The examples teach format and depth, not just the general idea of showing work. ## Tasks That Benefit Chain-of-thought prompting shines for problems with multiple steps where mistakes compound. Math word problems. Logic puzzles. Multi-step planning. Anything where you'd use scratch paper yourself. [IBM's analysis](https://www.ibm.com/think/topics/chain-of-thoughts) highlights practical applications: customer service bots breaking down problems, research tasks requiring hypothesis building, educational explanations in math and science. The technique works best when the task genuinely has intermediate steps that inform the final answer. Another Hacker News commenter, travisjungroth, [made an observation that stuck with me](https://news.ycombinator.com/item?id=35503044): > "Most writing about anything difficult is product, not process. Articles get drafts before being published. People think about answers before writing them down. How to Solve It does a great job explaining this about math problems. The steps to the proof are not the steps to creating the proof. So when you go to solve a problem by mimicking the solutions to problems, something is missing." This matters. The published solution to a math problem looks nothing like the actual process of figuring it out. Language models trained on cleaned-up final answers never saw the messy working that led there. Prompting for steps recreates something that was absent from training. ## Tasks That Don't Benefit Simple lookups don't gain anything. Asking "What's the capital of France?" with step-by-step instructions just produces longer output with no accuracy improvement. The model already has this answer readily available. Tasks requiring creativity rather than reasoning see less improvement. Writing poetry, generating marketing copy, brainstorming names. These don't have logical steps to show. Forcing them into a reasoning framework feels awkward and may constrain the output unnecessarily. [Recent research](https://bdtechtalks.com/2024/05/13/chain-of-thought-planning/) found the benefits don't generalize as broadly as early hype suggested. CoT prompts improve models on specific planning tasks but don't transfer well across domains. The improvements are real but narrower than sometimes claimed. There's also no guarantee the reasoning is faithful. The model might produce plausible-sounding steps that don't actually reflect how it reached the answer. This creates a risk of false confidence. You see a logical chain and assume correctness, but the steps might be post-hoc rationalization rather than genuine reasoning. ## Computational Cost Matters More tokens means more time and money. If you're hitting an API, chain-of-thought responses cost more than direct answers. For high-volume applications, this adds up fast. Consider whether the accuracy improvement justifies the cost. A customer service bot answering simple FAQs probably doesn't need reasoning chains. A system doing financial calculations probably does. ## Putting It Into Practice For math and logic problems, start with zero-shot. Add "Let's think step by step" or "Work through this step by step" to your prompt. Compare results against your baseline. For complex multi-step tasks, consider few-shot examples showing the reasoning depth you want. One or two good examples often outperform pure zero-shot on harder problems. For anything involving planning, try breaking the task into explicit phases. Ask for the plan first, then execution. This is chain-of-thought applied at the task level rather than sentence level. If accuracy matters more than speed, add a verification step. After the model reaches an answer, ask it to check the work. This catches some errors that slip through initial reasoning. Here's an example that shows the difference in practice: **Without chain-of-thought:** > "A company's revenue grew 20% in Q1 and then dropped 15% in Q2. What was the net change?" > > Model: "5% growth" That answer is wrong. The model multiplied instead of calculating compound percentages. **With chain-of-thought:** > "A company's revenue grew 20% in Q1 and then dropped 15% in Q2. What was the net change? Let's work through this step by step." > > Model: "Starting with 100 as a baseline. After 20% growth: 100 x 1.20 = 120. After 15% drop from that new amount: 120 x 0.85 = 102. Net change: 102 - 100 = 2, which is 2% growth." The forced reasoning prevents the shortcut error. The model can't skip to "20 minus 15" because it has to work through the actual calculation. ## Variations Worth Knowing The basic technique has spawned several extensions. **Self-consistency** generates multiple reasoning paths and takes the majority answer. If you ask the model to solve a problem five times with chain-of-thought, and it gets the same answer four times, that answer is probably right. This approach pushed GSM8K accuracy from 58% to 74% in Google's follow-up research. **Tree of Thoughts** explores multiple reasoning branches simultaneously rather than committing to a single path. Useful when there are genuinely different approaches to a problem and you want to explore several before picking. **Least-to-Most prompting** breaks complex problems into subproblems, solves the simpler ones first, and uses those solutions to tackle harder pieces. Good for problems with natural hierarchies or dependencies. These variations add complexity. Master the basic version first. Most people get significant value from just adding "let's think step by step" and never need the more elaborate approaches. ## The Bigger Picture Chain-of-thought prompting works because it exploits how these models actually function. They're next-token predictors. Each word constrains the probability of what follows. Asking for reasoning creates helpful constraints that accumulate toward correct answers. This might become obsolete. Models trained specifically for reasoning, like those with built-in "thinking" modes, may internalize these patterns. The explicit prompt might become unnecessary as the behavior gets baked into model weights. But for now, with current models, the technique remains valuable. It costs one sentence and can multiply accuracy on the right tasks. The key is knowing which tasks those are. How would you know if the reasoning a model shows you is the reasoning it actually used?