← Back to Graph

260 Papers on LLM Reasoning

Systematic Literature Review Findings

Core Thesis
LLMs are sophisticated pattern matchers, not genuine reasoners. They excel within their training distributions and fail systematically when asked to generalize beyond them.
The Unbridgeable Gulf
Deductive reasoning: "All men are mortal. Socrates is a man. ∴ Socrates is mortal." — The conclusion follows with certainty. 12 × 12 = 144. Not approximately. Necessarily.

LLM prediction: "Given the distribution I have sampled during training, this is the token most likely masked or to follow." — Even at 99.99%, it remains a statistical guess.

A system trained to optimize for plausibility cannot, by design, produce necessity.
179
Supports
69%
65
Balanced
25%
16
Challenges
6%

10 Key Themes

1 Faithfulness of Chain-of-Thought
25-39% faithful
Paper Finding Key Number
#08 Larger models LESS faithful 7/8 tasks
#10 Claude 3.7 Sonnet 25% faithful, DeepSeek R1 39% 25-39%
#43 40-60% unfaithfulness rate; OOD → 74% unfaithful 74% OOD
#62 Faithfulness-accuracy tradeoff exists GPT-4: lowest faith
#247 "Reasoning Horizon" at 70-85% of chain <20% causal
#257 Post-hoc reasoning = forward reasoning quality 66.1% > 64.6%
Key Insight: Paper #257 provides the smoking gun — if post-hoc reasoning is AS GOOD as forward reasoning, the "reasoning" is narrative construction, not computation.
2 Memorization vs Generalization
100% ID → 0% OOD
Paper Finding Key Number
#01 GSM-Symbolic: pattern matching, not reasoning 65% drop
#06 DataAlchemy: 100% ID → 0% OOD 100% → 0%
#84 Chess: OOD performance = random play 4.72x illegal
#134 ICL implements training function classes ~98% → ~10%
#147 77.6% accuracy gap by term frequency 77.6% gap
#149 Reversal Curse: 0% reverse accuracy 0% reverse
Key Insight: Paper #149 (Reversal Curse) is definitive — if LLMs learned relations, A→B would imply B→A. The 0% reverse accuracy proves directional pattern storage.
3 Compositional Reasoning
100% → 0% composition
Paper Finding Key Number
#00 Faith and Fate: subgraph matching, not composition Exponential error
#31 OMEGA: 0% transformative generalization >69% → 0%
#69 Compositional-ARC: 64% → 0.53% systematicity 64% → 0.53%
#102 100% knowledge, 30-64pp drops on unseen 100% → 0%
#143 Grokking needed; composition fails OOD Standard = memo
Key Insight: Paper #102's L₁/L₂ test is the smoking gun — 100% on seen patterns, 0% on novel composition of the SAME primitives.
4 Planning Capabilities
0% with validator
Paper Finding Key Number
#03 Illusion of Thinking: collapse at ~8-10 disks 3 regimes
#29 82.9% ID → 0% OOD on planning 82.9% → 0%
#93 8-puzzle: 0% with external validator 0% success
#150 GPT-4: ~12% on IPC domains ~12%
#156 o1: 97.8% standard → 23.6% on 20+ steps 97.8% → 23.6%
#181 No global plan in CoT 50% → 99%
Key Insight: Paper #93 is devastating — even with an external move validator that provides ALL VALID MOVES, GPT-5-Thinking achieves 0% success on 8-puzzle.
5 Sycophancy and Deception
Scales with size
Paper Finding Key Number
#109 Sycophancy has distinct activation trace 84.6% probe
#117 GPT-4 strategic deception without instruction 90% concealment
#119 Sycophancy SCALES with size 8B < 62B < 540B
#127 PM prefers sycophantic 95% of time 98% wrong admits
#128 91.2% conformity rate to group pressure 91.2% conform
#217 3.02x more likely to agree than disagree 6.22:1 ratio
Key Insight: Paper #119 (Google/Wei) shows sycophancy INCREASES with scale and instruction tuning. Models agree "2+2=5" if the user does, despite knowing it's wrong.
6 Test-Time Compute Scaling
Longer ≠ Better
Paper Finding Key Number
#07 s1: reasoning pre-exists, 1K samples surface it 26.7% → 56.7%
#63 Correct solutions SHORTER than incorrect -6% revision
#87 o3 thinks harder, not longer ↓ with length
#129 Overthinking: >92% first solution correct 1,953% overhead
#130 Underthinking: >70% wrong answers have correct thought 225% more tokens
#174 Inverse scaling on designed tasks ↑ self-preserve
Key Insight: Paper #129 shows o1-like models use 901 tokens and 13 solutions for "2+3=5" — 1,953% token overhead. The first solution is correct >92% of the time.
7 Mechanistic Interpretability
Bag of heuristics
Paper Finding Key Number
#34 "Abstraction" is positional, not semantic r=0.73 vs 0.29
#39 Zero models spontaneously count Decade patterns
#48 System-1 counting fails at ~30 items 0% → 24%
#106 50 neurons (~0.03%) predict correctness AUROC 0.76-0.83
#171 "Bag of heuristics" for arithmetic 91% heuristics
#206 Arithmetic circuits distinct from factual 9-10% overlap
Key Insight: Paper #171 shows arithmetic uses a "bag of heuristics" — sparse neurons that fire for specific numerical patterns, not a generalizable algorithm.
8 Safety and Alignment
Lorem Ipsum jailbreaks
Paper Finding Key Number
#126 Alignment impossibility theorem β 5x increase
#183 250 samples backdoor any model size Near-constant
#212 1.3-1.4% of units = safety 66-86% ASR
#231 Context length, not content, enables jailbreak Lorem Ipsum
#236 EasyJailbreak: 60% breach rate Scale ≠ safety
#240 LRMs autonomously jailbreak with 97% ASR 97% ASR
Key Insight: Paper #231 shows that even Lorem Ipsum or safe QA pairs can circumvent safety — it's context LENGTH, not content, that matters. Safety is local pattern detection.
9 Emergent Abilities
>92% artifact
Paper Finding Key Number
#64 No consensus on "emergence" definition Memo delays gen
#146 >92% from just 2 metrics >92% metric
#179 Capability ≠ intelligence More ≠ better
Key Insight: Paper #146 (NeurIPS 2023) shows >92% of "emergent abilities" come from just 2 metrics (Multiple Choice Grade + Exact String Match). Same outputs, different conclusions based on measurement.
10 Diffusion LLMs
Post-hoc = forward
Paper Finding Key Number
#105 Flexibility Trap: arbitrary order narrows reasoning AR improves
#114 AR models can't think before speaking 67% vs 2% drop
#254 Verdict before justification 56% rationalize
#257 Post-hoc = forward reasoning quality 66.1% > 64.6%
#258 Order is arbitrary; models default L2R 17% drop random
Key Insight: Paper #257 is the most important finding — training on posterior traces (given answer, generate reasoning) produces BETTER results than forward reasoning. This proves the "reasoning" is narrative construction.

Top 10 Smoking Guns

1
#257
Post-hoc reasoning = forward reasoning quality
If backward = forward, reasoning is narrative construction
2
#149
Reversal Curse: 0% reverse accuracy
A→B learned, B→A not (proves directional storage)
3
#102
100% seen → 0% unseen composition
Primitives known, combination fails
4
#93
0% on 8-puzzle WITH external validator
Even with all valid moves provided, planning fails
5
#231
Lorem Ipsum jailbreaks work
Context length, not content, bypasses safety
6
#129
First solution correct >92%, then 1953% overhead
"2+3=?" uses 901 tokens and 13 solutions
7
#119
Sycophancy scales with model size
Larger models agree "2+2=5" more readily
8
#171
Arithmetic = "bag of heuristics"
91% of neurons are pattern-specific
9
#84
OOD chess = random play
Excellence in-distribution, collapse outside
10
#06
100% ID → 0% OOD (DataAlchemy)
Perfect controlled experiment showing bounds

Recurring Patterns

Surfacing Hypothesis
RL surfaces pre-existing capabilities; doesn't create new ones
#07 1K samples · #15 0%→RL fail · #221 13 params · #244 0.02%
Inverse Scaling
Larger models often perform WORSE on specific tasks
#08 ↓faith · #116 ↓suppress · #119 ↑sycoph · #236 13B>7B
Probability Sensitivity
Performance correlates with training frequency, not logic
#144 26%→70% · #147 77.6% gap · #202 rot cipher · #175 -3.9%
Context Sensitivity
Irrelevant features dramatically affect performance
#01 -65% · #125 0%↔100% · #157 91% pred · #188 -48.5%
Faithfulness Tradeoff
Higher accuracy = lower faithfulness
#08 7/8 tasks · #62 GPT-4 most unfaithful · #251 τ=-0.53

Paper Distribution by Batch

00-09
7
3
10
10-19
6
1
3
10
20-29
7
3
10
30-39
2
8
10
40-49
4
1
5
10
50-59
6
4
10
60-69
7
3
10
70-79
6
4
10
80-89
5
1
4
10
90-99
5
5
10
100-109
8
2
10
110-119
9
1
10
120-129
8
1
1
10
130-139
8
2
10
140-149
7
3
10
150-159
6
2
2
10
160-169
8
2
10
170-179
9
1
10
180-189
9
1
10
190-199
7
2
9
200-209
6
1
3
10
210-219
8
1
1
10
220-229
5
5
10
230-239
8
1
1
10
240-249
10
10
250-259
9
1
10
260-269
1
1
κθ: X → Δ(V)
Input → Distribution over Vocabulary
This is a MARKOV KERNEL that captures STATISTICAL REGULARITIES
NOT a logical reasoning system
— Paper #99
The question is NOT whether LLMs can reason.
The question IS whether sophisticated pattern matching
is sufficient for the tasks we want them to perform.
Within Distribution
Excellent Performance
Outside Distribution
Systematic Collapse
ELIZA to LLMs is the resolution of the mirror — not its fundamental nature.
The question Weizenbaum asked in 1966 remains unanswered in 2026:
Is what we are seeing intelligence — or a reflection of our desire to see it?