Prompt Engineering

Self-Consistency

Definition

Self-consistency, introduced by Wang et al. (2022), improves upon chain-of-thought prompting by exploiting the observation that while any single reasoning chain may be flawed, the correct answer tends to appear more frequently across multiple independent samples. The technique generates k diverse reasoning chains (typically 10-40) by sampling with temperature > 0, then marginalizes over all reasoning paths by taking the most common final answer. Self-consistency provides particularly large gains on arithmetic, multi-step reasoning, and commonsense inference tasks where chain-of-thought alone still makes frequent errors.

Why It Matters

Self-consistency addresses a fundamental limitation of single-sample prompting: language models are stochastic, and a single response reflects one path through a complex probability space. For high-stakes reasoning tasks—financial calculations, medical triage logic, code correctness analysis—sampling once and trusting the result is insufficient. Self-consistency trades inference cost (k samples instead of 1) for substantial accuracy gains. It's also a practical tool for calibrating confidence: if 9 out of 10 samples agree, the answer is likely correct; if samples are split 5-5, the question is genuinely ambiguous and may need human review.

How It Works

Implementation: (1) formulate a chain-of-thought prompt for the task; (2) sample k completions with temperature 0.5-0.8 (enough diversity for independent chains); (3) extract the final answer from each completion; (4) count answer frequencies and return the majority. The sampling temperature controls the diversity of reasoning paths—too low produces near-identical chains; too high produces incoherent reasoning. Aggregation can use simple majority vote or weighted voting where answer candidates are scored by their reasoning chain quality. At k=40, self-consistency improves CoT accuracy by 10-20% on standard reasoning benchmarks.

Self-Consistency — 5 Reasoning Paths → Majority Vote

Prompt (×5, temp 0.7):"Revenue is $4.2M. Fixed costs $1.8M, variable costs $0.9M. What is the profit? Think step by step."
Path A
Start with revenue: $4.2M. Subtract fixed costs $1.8M and variable costs $0.9M = $1.5M profit.
$1.5M
Path B
Revenue $4.2M. Fixed $1.8M. Variable $0.9M. Net = $4.2 - $2.7 = $1.5M.
$1.5M
Path C
Add costs: 1.8 + 0.9 = 2.7. Revenue minus costs: 4.2 - 2.7 = 1.5.
$1.5M
Path D
Revenue $4.2M. Subtract $1.8M = $2.4M. Forgot variable costs — $2.4M profit.
$2.4M
Path E
Total costs 1.8 + 0.9 = 2.8 (arithmetic error). Profit = 4.2 - 2.8 = $1.4M.
$1.4M

Majority vote result

3/5 paths agree on $1.5M → final answer selected

$1.5M
3 votes (majority)

Confidence signal: 3/5 agreement = moderate confidence. 9/10 agreement = high confidence. 5/10 split = flag for human review.

Real-World Example

A legal AI tool evaluates contract risk by analyzing 15 clauses per contract. Using single-sample CoT, risk classifications had an 18% error rate on ambiguous clauses. Implementing self-consistency with k=10 samples per clause reduced the error rate to 7%—below the 10% threshold required for the tool to be used without mandatory human review. The accuracy gain justified a 10x inference cost increase because the alternative (human review of every contract) cost 50x more than the AI tool's inference budget.

Common Mistakes

  • Using self-consistency for simple tasks—the accuracy gains don't justify the 10-40x cost increase for tasks that are already reliably answered
  • Setting temperature too low—if all samples produce the same reasoning path, majority voting provides no benefit over single-sample
  • Not logging disagreements—high disagreement across samples is a signal of genuine ambiguity that should trigger human review or clarification

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Self-Consistency? Self-Consistency Definition & Guide | 99helpers | 99helpers.com