Self-Consistency
Definition
Self-consistency, introduced by Wang et al. (2022), improves upon chain-of-thought prompting by exploiting the observation that while any single reasoning chain may be flawed, the correct answer tends to appear more frequently across multiple independent samples. The technique generates k diverse reasoning chains (typically 10-40) by sampling with temperature > 0, then marginalizes over all reasoning paths by taking the most common final answer. Self-consistency provides particularly large gains on arithmetic, multi-step reasoning, and commonsense inference tasks where chain-of-thought alone still makes frequent errors.
Why It Matters
Self-consistency addresses a fundamental limitation of single-sample prompting: language models are stochastic, and a single response reflects one path through a complex probability space. For high-stakes reasoning tasks—financial calculations, medical triage logic, code correctness analysis—sampling once and trusting the result is insufficient. Self-consistency trades inference cost (k samples instead of 1) for substantial accuracy gains. It's also a practical tool for calibrating confidence: if 9 out of 10 samples agree, the answer is likely correct; if samples are split 5-5, the question is genuinely ambiguous and may need human review.
How It Works
Implementation: (1) formulate a chain-of-thought prompt for the task; (2) sample k completions with temperature 0.5-0.8 (enough diversity for independent chains); (3) extract the final answer from each completion; (4) count answer frequencies and return the majority. The sampling temperature controls the diversity of reasoning paths—too low produces near-identical chains; too high produces incoherent reasoning. Aggregation can use simple majority vote or weighted voting where answer candidates are scored by their reasoning chain quality. At k=40, self-consistency improves CoT accuracy by 10-20% on standard reasoning benchmarks.
Self-Consistency — 5 Reasoning Paths → Majority Vote
Majority vote result
3/5 paths agree on $1.5M → final answer selected
Confidence signal: 3/5 agreement = moderate confidence. 9/10 agreement = high confidence. 5/10 split = flag for human review.
Real-World Example
A legal AI tool evaluates contract risk by analyzing 15 clauses per contract. Using single-sample CoT, risk classifications had an 18% error rate on ambiguous clauses. Implementing self-consistency with k=10 samples per clause reduced the error rate to 7%—below the 10% threshold required for the tool to be used without mandatory human review. The accuracy gain justified a 10x inference cost increase because the alternative (human review of every contract) cost 50x more than the AI tool's inference budget.
Common Mistakes
- ✕Using self-consistency for simple tasks—the accuracy gains don't justify the 10-40x cost increase for tasks that are already reliably answered
- ✕Setting temperature too low—if all samples produce the same reasoning path, majority voting provides no benefit over single-sample
- ✕Not logging disagreements—high disagreement across samples is a signal of genuine ambiguity that should trigger human review or clarification
Related Terms
Chain-of-Thought Prompting
Chain-of-thought prompting instructs an LLM to show its reasoning step by step before giving a final answer, significantly improving accuracy on complex reasoning, math, and multi-step problems.
Prompt Engineering
Prompt engineering is the practice of designing and refining the text inputs given to AI language models to reliably produce accurate, useful, and well-formatted outputs for specific tasks.
Tree-of-Thought Prompting
Tree-of-thought prompting extends chain-of-thought by having the model explore multiple reasoning branches in parallel, evaluate each branch's promise, and backtrack from dead ends—enabling systematic problem-solving for complex tasks.
Few-Shot Prompting
Few-shot prompting provides an LLM with a small number of input-output examples within the prompt itself, demonstrating the desired task format and behavior so the model can generalize to new inputs without any fine-tuning.
Reasoning Model
A reasoning model is an LLM that explicitly 'thinks' through problems in an extended internal reasoning process before producing a final answer, trading inference speed for dramatically improved accuracy on complex tasks.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →