Back to Paper Network

LLM Pre-training Data

Common Crawl (2007–) Non-profit web archive. ~100 crawls since 2008. Apache Nutch crawler on ~100 machines over 10–12 days per crawl. Hundreds of millions of seed URLs. Nearly every pre-training dataset builds on Common Crawl.
"Data does not fall from the sky" Companies openly publish architecture details but keep data pipelines secret. Data curation is the key differentiator—and the most labor-intensive, least scalable part of building frontier LLMs. Much of the pipeline is heuristic.
Source
Clean
Refine
Train
Stage 1
Web Crawling
Common Crawl harvests the open web. Hundreds of millions of seed URLs, politeness policies, robots.txt.
Two output formats:
WARC — raw HTML as captured by the crawler. Preferred: you control text extraction.
WET — pre-extracted plain text. Lossy conversion, lower downstream quality.

DCLM demonstrated that choosing WARC over WET and converting HTML yourself materially improves downstream model accuracy.
~100 crawls archived since 2008
10–12 days per crawl on ~100 machines
Hundreds of millions of seed URLs
Policies: selection, politeness (robots.txt), re-visit frequency
⚠ Challenge
Dynamic URLs, duplicate content, malicious injection. Wikipedia dump timing can be exploited for data poisoning.

Stage 2
Text Extraction (HTML → Text)
Convert raw HTML into clean text. Tool choice matters: trafilatura, jusText, resiliparse.
trafilatura — widely used (FineWeb, RefinedWeb). Good precision, some recall loss.
jusText — higher token yield. Chosen by Nemotron-CC for this reason.
resiliparse — fast C-based parser.
markdownify — preserves document structure as Markdown.

Key finding (DCLM, Pile-CC): starting from WARC and extracting text yourself consistently outperforms using pre-converted WET files.
⚠ Critical
This seemingly mundane step has outsized impact. The choice between trafilatura and jusText alone changes token counts by >20% and affects downstream benchmark scores.

Stage 3
Filtering
Remove low-quality, toxic, non-target-language content. Two schools: rule-based vs. model-based.
3a. Language Identification
fastText classifier (176 languages, trained on Wikipedia + Tatoeba). Thresholds vary widely: C4 uses p(en)≥0.99, FineWeb uses p(en)>0.65, Dolma uses p(en)≥0.5.

3b. Quality Filtering — Rule-based
Explicit heuristics (C4, Gopher, RefinedWeb, FineWeb, Dolma): lines must end in punctuation, ≥5 words, ≥3 sentences, no "bad words", no `{`, no "lorem ipsum", ≥80% words with alphabetic chars.

3c. Quality Filtering — Model-based (becoming the norm)
GPT-3: linear classifier, positives = {WebText, Wikipedia, Books}
LLaMA: positives = Wikipedia-referenced pages
DCLM: fastText, positives = {OpenHermes-2.5 (GPT-4 generated), ELI5 subreddit}
phi-1: GPT-4 labels 100K subset for "educational value" → 17.68% HumanEval vs 12.19% unfiltered

3d. Toxicity Filtering
Dolma: Jigsaw Toxic Comments dataset, fastText classifiers for hate + NSFW
Gopher: Google SafeSearch API
C4: removed pages with any word from LDNOOBW list (blunt)
⚠ Tension
Rule-based avoids ML bias but model-based produces better downstream performance. FineWebEdu and DCLM remove ~90% of data. Nemotron-CC addresses this with ensemble classifiers + synthetic rephrasing of filtered content.

Stage 4
Deduplication
Remove exact and near-duplicate content. Reduces memorization, improves training efficiency.
Why deduplicate? C4 contained one product description repeated 61,036 times. Duplicates waste compute and increase memorization (copyright/privacy risk).

4a. Exact Deduplication
Hash each item (MurmurHash). Group by hash. Keep one per group. Simple, high precision, parallelizable (MapReduce). Misses near-duplicates.

4b. Bloom Filters
Memory-efficient probabilistic set membership. No false negatives; tunable false positive rate. Dolma: Bloom filter dedup on paragraphs (FP rate 1e-15).

4c. Fuzzy Dedup — MinHash + LSH
Jaccard similarity: J(A,B) = |A∩B| / |A∪B|. MinHash gives Pr[h(A)=h(B)] = J(A,B). Locality-Sensitive Hashing (LSH) sharpens the threshold via banding: n hashes split into b bands of r rows.
Threshold ≈ (1/b)^(1/r)  —  AND within bands, OR across bands → S-curve around threshold
RefinedWeb, FineWeb, SlimPajama all use MinHash 5-gram dedup with LSH.

Stage 5
Data Mixing & Staging
Weight different sources. Upsample high-quality domains. Stage from pre-training to mid-training to post-training.
Pre-training — large amounts of lower-quality, high-diversity web data. Token budgets: Llama 3 trained on 15T tokens, Qwen3 on 36T tokens.

Mid-training — high-quality subset + long-context data. Continued pre-training on curated mix. Shifts distribution toward quality.

Post-training — instruction data, chat, RLHF. Tiny volume (<1M examples), high curation effort.

Domain weighting: GPT-3 upsampled Wikipedia and Books 2–3× vs. raw proportion. Most labs keep exact mixing ratios secret.
⚠ Critical
The mixing recipe is arguably the most guarded secret in LLM development. Different mixes produce dramatically different capabilities.

Output
Training-Ready Token Stream
Tokenized, shuffled, packed into sequences. Fed to the model as next-token prediction targets.
Landmark Datasets
C4 (2019)
156B tokens from 1 CC snapshot. Rule-based filtering only. The first large-scale cleaned CC dataset. Used by T5.
The Pile (2021)
275B tokens from 22 curated sources. Grassroots open-source effort. Included arXiv, PubMed, GitHub, Books3.
RefinedWeb (2023)
5T tokens (600B released). "Web data is all you need." WARC + trafilatura + Gopher rules + MinHash.
FineWeb (2024)
15T tokens from 95 CC dumps. HuggingFace. Rules + MinHash + PII anonymization. Fully open.
DCLM (2024)
240T raw → 3.8T filtered. DataComp-LM. fastText quality classifier. Open benchmark for data curation.
Nemotron-CC (2024)
6.3T tokens (1.1T HQ subset). NVIDIA. Ensemble classifier + synthetic rephrasing of low-quality data.
The Data Scaling Arc
2019: C4 = 156B tokens from 1 snapshot → 2020: GPT-3 = 400B tokens → 2023: RefinedWeb = 5T tokens → 2024: Llama 3 trains on 15T tokens, Qwen3 on 36T tokens, DCLM-pool = 240T raw tokens.
The trend: more data, better filtering, higher quality—but the pipeline remains fundamentally heuristic with "many opportunities to improve."
§1Quality Filtering Algorithms

General framework

Given target data T (small, high quality) and raw data R (large, noisy), find subset T' of R similar to T. Must generalize from T and run extremely fast on huge R.

1a. KenLM (n-gram language model)

CCNet (Wenzek et al., 2019)
Kneser-Ney smoothed n-gram model. Extremely simple and fast. Generative approach: score(x) = p_T(x). Sort by perplexity, keep top fraction.
score(x) = perplexity_T(x) = exp(−1/n ⋅ ∑ log p(w_i | w_{i-k}...w_{i-1}))
CCNet: KenLM trained on Wikipedia, keep top 1/3 lowest perplexity paragraphs.
OpenMathText: KenLM on ProofPile, threshold <15,000 perplexity → 14.7B math tokens. A 1.4B model trained on this beat models with 20× more data.

1b. fastText classifier

Joulin et al. (2016) — Bag of Tricks for Efficient Text Classification
Bag of n-gram embeddings + linear head. Hashing trick (10M bins) for unbounded vocab. Asynchronous SGD.
score(x) = p(T | x)  —  discriminative approach, keep if score ≥ threshold
Orders of magnitude faster than BERT/LLM classifiers. DCLM showed fastText quality classifier outperforms all rule-based methods.

1c. DSIR (importance resampling)

Xie et al. (2023) — Data Selection for Language Models via Importance Resampling
Fit bag-of-hashed-ngram distributions to both target and raw data. Resample proportionally to importance weights.
score(x) = p_T(x) / p_R(x)  —  importance weight, resample proportionally
More principled than heuristic classification (captures diversity). Slightly better than fastText on GLUE, similar compute.

Comparison

Method Approach Speed Quality signal Used by
KenLM Generative: p(x) Very fast Proximity to reference CCNet, OpenMathText
fastText Discriminative: p(T|x) Very fast Binary quality label DCLM, GPT-3, LLaMA
DSIR Importance: p_T/p_R Fast Distribution matching Research datasets
LLM judge Prompted scoring Slow Rich semantic phi-1 (GPT-4 labels)
§2Language Identification
fastText language identification — 176 languages, trained on Wikipedia + Tatoeba + SETimes

Threshold sensitivity

Dataset Threshold Effect
C4 p(en) ≥ 0.99 Very aggressive — removes multilingual content, code, LaTeX
FineWeb p(en) > 0.65 Moderate — retains code-heavy and mixed-language content
Dolma p(en) ≥ 0.5 Permissive — keeps dialect, code-switching

Known failure modes

• Short text (fewer features to classify)
• Low-resource languages (poor training data)
• Dialects and code-switching (mixed languages in one document)
• LaTeX and source code (not natural language)
§3Toxicity Filtering
Dataset Approach Training data Tradeoff
C4 Word blocklist (LDNOOBW) N/A Blunt — removes medical/sexual health content
Dolma fastText classifiers (2 models) Jigsaw Toxic Comments (Wikipedia talk pages) Separates hate from NSFW; more nuanced
Gopher Google SafeSearch API Google's proprietary data Production-grade but non-reproducible

Dolma's two-classifier design

Hate classifier: positive = unlabeled + obscene (Jigsaw), negative = clean
NSFW classifier: positive = obscene subset, negative = rest
Separating hate from NSFW allows independent thresholds. Medical/health content is less likely to be caught.
§4Exact Deduplication & Bloom Filters

Exact dedup

Hash each item (document, paragraph, or n-gram span) with a fast non-cryptographic hash (MurmurHash, CityHash). Group by hash. Keep one per group.

Design choices: What is an "item"? Document-level (coarse), paragraph-level (Dolma), 3-sentence spans (C4), n-gram spans. Finer granularity catches more but risks creating incoherent documents.

Hash functions

Type Examples Speed Use case
Cryptographic SHA-256, MD5 Slow Collision-resistant verification
Non-cryptographic MurmurHash, CityHash, DJB2 Fast Dedup, Bloom filters, MinHash

Bloom filters

Probabilistic set membership. No false negatives; tunable false positive rate. Uses k hash functions mapping to an m-bit array.
Optimal k = ln(2) ⋅ m/n  —  False positive rate = 0.5^k
Dolma: Bloom filter dedup on paragraphs with false positive rate set to 10-15. Memory-efficient: a set of billions of items can be represented in a few GB.
§5Fuzzy Deduplication (Jaccard, MinHash, LSH)

Jaccard similarity

J(A, B) = |A ∩ B| / |A ∪ B|  —  near-duplicates defined as J ≥ threshold (e.g. 0.8)
Documents represented as sets of character n-grams (typically 5-grams). Computing exact Jaccard for all pairs is O(n²) — intractable at web scale.

MinHash

Broder (1997) — On the resemblance and containment of documents
Hash function where Pr[h(A) = h(B)] = J(A, B). Uses the minimum hash value over the set. With k independent hash functions, estimate J by fraction of matching min-hashes. Empirically verified: 100 hash functions closely approximates true Jaccard.

Locality-Sensitive Hashing (LSH)

Problem: MinHash alone is too stochastic for clean thresholding. Solution: n hash functions split into b bands of r rows.
Collision iff ∃ a band where all r hashes match
AND within bands (raises threshold) ⋅ OR across bands (catches matches)
Creates S-curve around threshold ≈ (1/b)^(1/r)
Tuning: increasing r → sharper threshold, shifted right (harder to match). Increasing b → shifted left (easier to match).

Example: n=9000 hashes, b=20 bands, r=450 rows → threshold ≈ 0.998.

Production usage

Dataset Method Granularity
RefinedWeb MinHash 5-gram + LSH Document-level
FineWeb MinHash + LSH Document-level
SlimPajama MinHashLSH Document-level (627B token subset)
GPT-3 Fuzzy dedup Document-level + benchmark contamination check
§6HTML→Text Conversion Wars
Tool Approach Used by Tradeoff
WET files Pre-extracted by Common Crawl C4, early CC datasets Convenient but lossy
trafilatura Python, content extraction heuristics RefinedWeb, FineWeb Good precision, some recall loss
jusText Block-level classification Nemotron-CC Higher token yield than trafilatura
resiliparse Fast C-based HTML parser DCLM Speed-optimized

Key finding

DCLM and Pile-CC independently demonstrated that WARC → custom text extraction consistently outperforms pre-converted WET files on downstream benchmarks. This "mundane" step has outsized impact on final model quality.

Nemotron-CC chose jusText over trafilatura specifically because it yields more tokens per document. Since Nemotron-CC's thesis is that over-filtering loses valuable data, maximizing raw token extraction is structurally aligned with their approach.
§7Copyright, Fair Use & Data Secrecy

The copyright landscape

Most internet content is copyrighted. Fair use for ML training is legally unsettled. Major lawsuits ongoing (NYT v. OpenAI, Getty v. Stability AI, etc.).

Shadow libraries in the training mix:
LibGen — ~4M books. Meta confirmed training LLaMA on LibGen.
Sci-Hub — ~88M academic papers.
Books3 (The Pile) — 196K books from Bibliotik shadow library. Taken down.
BooksCorpus — 7K self-published books from Smashwords. Taken down for TOS violation.

Data secrecy dynamics

Competitive advantage + copyright liability = frontier labs disclose almost nothing about training data. Architecture and training algorithms are published; data pipelines are not.

Exceptions: Dolma, FineWeb, DCLM, Nemotron-CC are fully open with documented pipelines. The Pile pioneered grassroots open data curation.

Benchmark contamination

GPT-3 ran fuzzy dedup against benchmarks to check for leakage. Not universally practiced. A growing concern as datasets scale to hundreds of trillions of tokens.
§8Open Questions & Tensions
Rule-based vs. model-based filtering
Rules avoid bias but classifiers produce better downstream performance. No consensus.
What is "quality"?
Wikipedia-like? Educational? Instruction-like? Each choice biases the model differently.
Over-filtering
FineWebEdu/DCLM remove ~90% of data. Nemotron-CC's synthetic rephrasing is one response.
Synthetic data in the pipeline
Nemotron-CC uses LM rephrasing. How far can synthetic augmentation of pre-training data go?
Data poisoning
Wikipedia dump timing exploitable. CC seed URL injection. No robust defenses at scale.
Copyright liability
Fair use for ML training legally unsettled. Major lawsuits in progress.
Benchmark contamination
Fuzzy dedup against benchmarks needed but not universal. Gets harder as dataset scale grows.
Pipeline is fundamentally heuristic
"Many opportunities to improve." Data curation scales with human effort, not compute.