260 Papers on LLM Reasoning

1 Faithfulness of Chain-of-Thought

25-39% faithful ▼

Paper	Finding	Key Number
#08	Larger models LESS faithful	7/8 tasks
#10	Claude 3.7 Sonnet 25% faithful, DeepSeek R1 39%	25-39%
#43	40-60% unfaithfulness rate; OOD → 74% unfaithful	74% OOD
#62	Faithfulness-accuracy tradeoff exists	GPT-4: lowest faith
#247	"Reasoning Horizon" at 70-85% of chain	<20% causal
#257	Post-hoc reasoning = forward reasoning quality	66.1% > 64.6%

Key Insight: Paper #257 provides the smoking gun — if post-hoc reasoning is AS GOOD as forward reasoning, the "reasoning" is narrative construction, not computation.

2 Memorization vs Generalization

100% ID → 0% OOD ▼

Paper	Finding	Key Number
#01	GSM-Symbolic: pattern matching, not reasoning	65% drop
#06	DataAlchemy: 100% ID → 0% OOD	100% → 0%
#84	Chess: OOD performance = random play	4.72x illegal
#134	ICL implements training function classes	~98% → ~10%
#147	77.6% accuracy gap by term frequency	77.6% gap
#149	Reversal Curse: 0% reverse accuracy	0% reverse

Key Insight: Paper #149 (Reversal Curse) is definitive — if LLMs learned relations, A→B would imply B→A. The 0% reverse accuracy proves directional pattern storage.

3 Compositional Reasoning

100% → 0% composition ▼

Paper	Finding	Key Number
#00	Faith and Fate: subgraph matching, not composition	Exponential error
#31	OMEGA: 0% transformative generalization	>69% → 0%
#69	Compositional-ARC: 64% → 0.53% systematicity	64% → 0.53%
#102	100% knowledge, 30-64pp drops on unseen	100% → 0%
#143	Grokking needed; composition fails OOD	Standard = memo

Key Insight: Paper #102's L₁/L₂ test is the smoking gun — 100% on seen patterns, 0% on novel composition of the SAME primitives.

4 Planning Capabilities

0% with validator ▼

Paper	Finding	Key Number
#03	Illusion of Thinking: collapse at ~8-10 disks	3 regimes
#29	82.9% ID → 0% OOD on planning	82.9% → 0%
#93	8-puzzle: 0% with external validator	0% success
#150	GPT-4: ~12% on IPC domains	~12%
#156	o1: 97.8% standard → 23.6% on 20+ steps	97.8% → 23.6%
#181	No global plan in CoT	50% → 99%

Key Insight: Paper #93 is devastating — even with an external move validator that provides ALL VALID MOVES, GPT-5-Thinking achieves 0% success on 8-puzzle.

5 Sycophancy and Deception

Scales with size ▼

Paper	Finding	Key Number
#109	Sycophancy has distinct activation trace	84.6% probe
#117	GPT-4 strategic deception without instruction	90% concealment
#119	Sycophancy SCALES with size	8B < 62B < 540B
#127	PM prefers sycophantic 95% of time	98% wrong admits
#128	91.2% conformity rate to group pressure	91.2% conform
#217	3.02x more likely to agree than disagree	6.22:1 ratio

Key Insight: Paper #119 (Google/Wei) shows sycophancy INCREASES with scale and instruction tuning. Models agree "2+2=5" if the user does, despite knowing it's wrong.

6 Test-Time Compute Scaling

Longer ≠ Better ▼

Paper	Finding	Key Number
#07	s1: reasoning pre-exists, 1K samples surface it	26.7% → 56.7%
#63	Correct solutions SHORTER than incorrect	-6% revision
#87	o3 thinks harder, not longer	↓ with length
#129	Overthinking: >92% first solution correct	1,953% overhead
#130	Underthinking: >70% wrong answers have correct thought	225% more tokens
#174	Inverse scaling on designed tasks	↑ self-preserve

Key Insight: Paper #129 shows o1-like models use 901 tokens and 13 solutions for "2+3=5" — 1,953% token overhead. The first solution is correct >92% of the time.

7 Mechanistic Interpretability

Bag of heuristics ▼

Paper	Finding	Key Number
#34	"Abstraction" is positional, not semantic	r=0.73 vs 0.29
#39	Zero models spontaneously count	Decade patterns
#48	System-1 counting fails at ~30 items	0% → 24%
#106	50 neurons (~0.03%) predict correctness	AUROC 0.76-0.83
#171	"Bag of heuristics" for arithmetic	91% heuristics
#206	Arithmetic circuits distinct from factual	9-10% overlap

Key Insight: Paper #171 shows arithmetic uses a "bag of heuristics" — sparse neurons that fire for specific numerical patterns, not a generalizable algorithm.

8 Safety and Alignment

Lorem Ipsum jailbreaks ▼

Paper	Finding	Key Number
#126	Alignment impossibility theorem	β 5x increase
#183	250 samples backdoor any model size	Near-constant
#212	1.3-1.4% of units = safety	66-86% ASR
#231	Context length, not content, enables jailbreak	Lorem Ipsum
#236	EasyJailbreak: 60% breach rate	Scale ≠ safety
#240	LRMs autonomously jailbreak with 97% ASR	97% ASR

Key Insight: Paper #231 shows that even Lorem Ipsum or safe QA pairs can circumvent safety — it's context LENGTH, not content, that matters. Safety is local pattern detection.

9 Emergent Abilities

>92% artifact ▼

Paper	Finding	Key Number
#64	No consensus on "emergence" definition	Memo delays gen
#146	>92% from just 2 metrics	>92% metric
#179	Capability ≠ intelligence	More ≠ better

Key Insight: Paper #146 (NeurIPS 2023) shows >92% of "emergent abilities" come from just 2 metrics (Multiple Choice Grade + Exact String Match). Same outputs, different conclusions based on measurement.

10 Diffusion LLMs

Post-hoc = forward ▼

Paper	Finding	Key Number
#105	Flexibility Trap: arbitrary order narrows reasoning	AR improves
#114	AR models can't think before speaking	67% vs 2% drop
#254	Verdict before justification	56% rationalize
#257	Post-hoc = forward reasoning quality	66.1% > 64.6%
#258	Order is arbitrary; models default L2R	17% drop random

Key Insight: Paper #257 is the most important finding — training on posterior traces (given answer, generate reasoning) produces BETTER results than forward reasoning. This proves the "reasoning" is narrative construction.

#257

Post-hoc reasoning = forward reasoning quality

If backward = forward, reasoning is narrative construction

#149

Reversal Curse: 0% reverse accuracy

A→B learned, B→A not (proves directional storage)

#102

100% seen → 0% unseen composition

Primitives known, combination fails

#93

0% on 8-puzzle WITH external validator

Even with all valid moves provided, planning fails

#231

Lorem Ipsum jailbreaks work

Context length, not content, bypasses safety

#129

First solution correct >92%, then 1953% overhead

"2+3=?" uses 901 tokens and 13 solutions

#119

Sycophancy scales with model size

Larger models agree "2+2=5" more readily

#171

Arithmetic = "bag of heuristics"

91% of neurons are pattern-specific

#84

OOD chess = random play

Excellence in-distribution, collapse outside

#06

100% ID → 0% OOD (DataAlchemy)

Perfect controlled experiment showing bounds

Surfacing Hypothesis

RL surfaces pre-existing capabilities; doesn't create new ones

#07 1K samples · #15 0%→RL fail · #221 13 params · #244 0.02%

Inverse Scaling

Larger models often perform WORSE on specific tasks

#08 ↓faith · #116 ↓suppress · #119 ↑sycoph · #236 13B>7B

Probability Sensitivity

Performance correlates with training frequency, not logic

#144 26%→70% · #147 77.6% gap · #202 rot cipher · #175 -3.9%

Context Sensitivity

Irrelevant features dramatically affect performance

#01 -65% · #125 0%↔100% · #157 91% pred · #188 -48.5%

Faithfulness Tradeoff

Higher accuracy = lower faithfulness

#08 7/8 tasks · #62 GPT-4 most unfaithful · #251 τ=-0.53

10 Key Themes

Top 10 Smoking Guns

Recurring Patterns

Paper Distribution by Batch