LLM Pre-training Data Pipeline | The Thinking Machine That Doesn't Think

Common Crawl (2007–) Non-profit web archive. ~100 crawls since 2008. Apache Nutch crawler on ~100 machines over 10–12 days per crawl. Hundreds of millions of seed URLs. Nearly every pre-training dataset builds on Common Crawl.

"Data does not fall from the sky" Companies openly publish architecture details but keep data pipelines secret. Data curation is the key differentiator—and the most labor-intensive, least scalable part of building frontier LLMs. Much of the pipeline is heuristic.

Source

Clean

Refine

Train

Stage 1

Web Crawling

▼

Common Crawl harvests the open web. Hundreds of millions of seed URLs, politeness policies, robots.txt.

Two output formats:
• WARC — raw HTML as captured by the crawler. Preferred: you control text extraction.
• WET — pre-extracted plain text. Lossy conversion, lower downstream quality.

DCLM demonstrated that choosing WARC over WET and converting HTML yourself materially improves downstream model accuracy.

~100 crawls archived since 2008

10–12 days per crawl on ~100 machines

Hundreds of millions of seed URLs

Policies: selection, politeness (robots.txt), re-visit frequency

⚠ Challenge

Dynamic URLs, duplicate content, malicious injection. Wikipedia dump timing can be exploited for data poisoning.

│
▼

Stage 2

Text Extraction (HTML → Text)

▼

Convert raw HTML into clean text. Tool choice matters: trafilatura, jusText, resiliparse.

• trafilatura — widely used (FineWeb, RefinedWeb). Good precision, some recall loss.
• jusText — higher token yield. Chosen by Nemotron-CC for this reason.
• resiliparse — fast C-based parser.
• markdownify — preserves document structure as Markdown.

Key finding (DCLM, Pile-CC): starting from WARC and extracting text yourself consistently outperforms using pre-converted WET files.

⚠ Critical

This seemingly mundane step has outsized impact. The choice between trafilatura and jusText alone changes token counts by >20% and affects downstream benchmark scores.

│
▼

Stage 3

Filtering

▼

Remove low-quality, toxic, non-target-language content. Two schools: rule-based vs. model-based.

3a. Language Identification
fastText classifier (176 languages, trained on Wikipedia + Tatoeba). Thresholds vary widely: C4 uses p(en)≥0.99, FineWeb uses p(en)>0.65, Dolma uses p(en)≥0.5.

3b. Quality Filtering — Rule-based
Explicit heuristics (C4, Gopher, RefinedWeb, FineWeb, Dolma): lines must end in punctuation, ≥5 words, ≥3 sentences, no "bad words", no `{`, no "lorem ipsum", ≥80% words with alphabetic chars.

3c. Quality Filtering — Model-based (becoming the norm)
• GPT-3: linear classifier, positives = {WebText, Wikipedia, Books}
• LLaMA: positives = Wikipedia-referenced pages
• DCLM: fastText, positives = {OpenHermes-2.5 (GPT-4 generated), ELI5 subreddit}
• phi-1: GPT-4 labels 100K subset for "educational value" → 17.68% HumanEval vs 12.19% unfiltered

3d. Toxicity Filtering
• Dolma: Jigsaw Toxic Comments dataset, fastText classifiers for hate + NSFW
• Gopher: Google SafeSearch API
• C4: removed pages with any word from LDNOOBW list (blunt)

⚠ Tension

Rule-based avoids ML bias but model-based produces better downstream performance. FineWebEdu and DCLM remove ~90% of data. Nemotron-CC addresses this with ensemble classifiers + synthetic rephrasing of filtered content.

│
▼

Stage 4

Deduplication

▼

Remove exact and near-duplicate content. Reduces memorization, improves training efficiency.

Why deduplicate? C4 contained one product description repeated 61,036 times. Duplicates waste compute and increase memorization (copyright/privacy risk).

4a. Exact Deduplication
Hash each item (MurmurHash). Group by hash. Keep one per group. Simple, high precision, parallelizable (MapReduce). Misses near-duplicates.

4b. Bloom Filters
Memory-efficient probabilistic set membership. No false negatives; tunable false positive rate. Dolma: Bloom filter dedup on paragraphs (FP rate 1e-15).

4c. Fuzzy Dedup — MinHash + LSH
Jaccard similarity: J(A,B) = |A∩B| / |A∪B|. MinHash gives Pr[h(A)=h(B)] = J(A,B). Locality-Sensitive Hashing (LSH) sharpens the threshold via banding: n hashes split into b bands of r rows.

Threshold \approx (1/b)^(1/r) — AND within bands, OR across bands \to S-curve around threshold

RefinedWeb, FineWeb, SlimPajama all use MinHash 5-gram dedup with LSH.

│
▼

Stage 5

Data Mixing & Staging

▼

Weight different sources. Upsample high-quality domains. Stage from pre-training to mid-training to post-training.

Pre-training — large amounts of lower-quality, high-diversity web data. Token budgets: Llama 3 trained on 15T tokens, Qwen3 on 36T tokens.

Mid-training — high-quality subset + long-context data. Continued pre-training on curated mix. Shifts distribution toward quality.

Post-training — instruction data, chat, RLHF. Tiny volume (<1M examples), high curation effort.

Domain weighting: GPT-3 upsampled Wikipedia and Books 2–3× vs. raw proportion. Most labs keep exact mixing ratios secret.

⚠ Critical

The mixing recipe is arguably the most guarded secret in LLM development. Different mixes produce dramatically different capabilities.

│
▼

Output

Training-Ready Token Stream

Tokenized, shuffled, packed into sequences. Fed to the model as next-token prediction targets.

Landmark Datasets

C4 (2019)

156B tokens from 1 CC snapshot. Rule-based filtering only. The first large-scale cleaned CC dataset. Used by T5.

The Pile (2021)

275B tokens from 22 curated sources. Grassroots open-source effort. Included arXiv, PubMed, GitHub, Books3.

RefinedWeb (2023)

5T tokens (600B released). "Web data is all you need." WARC + trafilatura + Gopher rules + MinHash.

FineWeb (2024)

15T tokens from 95 CC dumps. HuggingFace. Rules + MinHash + PII anonymization. Fully open.

DCLM (2024)

240T raw → 3.8T filtered. DataComp-LM. fastText quality classifier. Open benchmark for data curation.

Nemotron-CC (2024)

6.3T tokens (1.1T HQ subset). NVIDIA. Ensemble classifier + synthetic rephrasing of low-quality data.

The Data Scaling Arc

2019: C4 = 156B tokens from 1 snapshot → 2020: GPT-3 = 400B tokens → 2023: RefinedWeb = 5T tokens → 2024: Llama 3 trains on 15T tokens, Qwen3 on 36T tokens, DCLM-pool = 240T raw tokens.
The trend: more data, better filtering, higher quality—but the pipeline remains fundamentally heuristic with "many opportunities to improve."

§1Quality Filtering Algorithms▼

General framework

Given target data T (small, high quality) and raw data R (large, noisy), find subset T' of R similar to T. Must generalize from T and run extremely fast on huge R.

1a. KenLM (n-gram language model)

CCNet (Wenzek et al., 2019)

Kneser-Ney smoothed n-gram model. Extremely simple and fast. Generative approach: score(x) = p_T(x). Sort by perplexity, keep top fraction.

score(x) = perplexity_T(x) = exp(−1/n ⋅ ∑ log p(w_i | w_{i-k}...w_{i-1}))

CCNet: KenLM trained on Wikipedia, keep top 1/3 lowest perplexity paragraphs.
OpenMathText: KenLM on ProofPile, threshold <15,000 perplexity → 14.7B math tokens. A 1.4B model trained on this beat models with 20× more data.

1b. fastText classifier

Joulin et al. (2016) — Bag of Tricks for Efficient Text Classification

Bag of n-gram embeddings + linear head. Hashing trick (10M bins) for unbounded vocab. Asynchronous SGD.

score(x) = p(T | x) — discriminative approach, keep if score \geq threshold

Orders of magnitude faster than BERT/LLM classifiers. DCLM showed fastText quality classifier outperforms all rule-based methods.

1c. DSIR (importance resampling)

Xie et al. (2023) — Data Selection for Language Models via Importance Resampling

Fit bag-of-hashed-ngram distributions to both target and raw data. Resample proportionally to importance weights.

score(x) = p_T(x) / p_R(x) — importance weight, resample proportionally

More principled than heuristic classification (captures diversity). Slightly better than fastText on GLUE, similar compute.

Comparison

Method	Approach	Speed	Quality signal	Used by
KenLM	Generative: p(x)	Very fast	Proximity to reference	CCNet, OpenMathText
fastText	Discriminative: p(T\|x)	Very fast	Binary quality label	DCLM, GPT-3, LLaMA
DSIR	Importance: p_T/p_R	Fast	Distribution matching	Research datasets
LLM judge	Prompted scoring	Slow	Rich semantic	phi-1 (GPT-4 labels)

§2Language Identification▼

fastText language identification — 176 languages, trained on Wikipedia + Tatoeba + SETimes

Threshold sensitivity

Dataset	Threshold	Effect
C4	p(en) ≥ 0.99	Very aggressive — removes multilingual content, code, LaTeX
FineWeb	p(en) > 0.65	Moderate — retains code-heavy and mixed-language content
Dolma	p(en) ≥ 0.5	Permissive — keeps dialect, code-switching

Known failure modes

• Short text (fewer features to classify)
• Low-resource languages (poor training data)
• Dialects and code-switching (mixed languages in one document)
• LaTeX and source code (not natural language)

§3Toxicity Filtering▼

Dataset	Approach	Training data	Tradeoff
C4	Word blocklist (LDNOOBW)	N/A	Blunt — removes medical/sexual health content
Dolma	fastText classifiers (2 models)	Jigsaw Toxic Comments (Wikipedia talk pages)	Separates hate from NSFW; more nuanced
Gopher	Google SafeSearch API	Google's proprietary data	Production-grade but non-reproducible

Dolma's two-classifier design

• Hate classifier: positive = unlabeled + obscene (Jigsaw), negative = clean
• NSFW classifier: positive = obscene subset, negative = rest
Separating hate from NSFW allows independent thresholds. Medical/health content is less likely to be caught.

§4Exact Deduplication & Bloom Filters▼

Exact dedup

Hash each item (document, paragraph, or n-gram span) with a fast non-cryptographic hash (MurmurHash, CityHash). Group by hash. Keep one per group.

Design choices: What is an "item"? Document-level (coarse), paragraph-level (Dolma), 3-sentence spans (C4), n-gram spans. Finer granularity catches more but risks creating incoherent documents.

Hash functions

Type	Examples	Speed	Use case
Cryptographic	SHA-256, MD5	Slow	Collision-resistant verification
Non-cryptographic	MurmurHash, CityHash, DJB2	Fast	Dedup, Bloom filters, MinHash

Bloom filters

Probabilistic set membership. No false negatives; tunable false positive rate. Uses k hash functions mapping to an m-bit array.

Optimal k = ln(2) \cdot m/n — False positive rate = 0.5^k

Dolma: Bloom filter dedup on paragraphs with false positive rate set to 10^-15. Memory-efficient: a set of billions of items can be represented in a few GB.

§5Fuzzy Deduplication (Jaccard, MinHash, LSH)▼

Jaccard similarity

J(A, B) = |A \cap B| / |A \cup B| — near-duplicates defined as J \geq threshold (e.g. 0.8)

Documents represented as sets of character n-grams (typically 5-grams). Computing exact Jaccard for all pairs is O(n²) — intractable at web scale.

MinHash

Broder (1997) — On the resemblance and containment of documents

Hash function where Pr[h(A) = h(B)] = J(A, B). Uses the minimum hash value over the set. With k independent hash functions, estimate J by fraction of matching min-hashes. Empirically verified: 100 hash functions closely approximates true Jaccard.

Locality-Sensitive Hashing (LSH)

Problem: MinHash alone is too stochastic for clean thresholding. Solution: n hash functions split into b bands of r rows.

Collision iff \exists a band where all r hashes match AND within bands (raises threshold) \cdot OR across bands (catches matches) Creates S-curve around threshold \approx (1/b)^(1/r)

Tuning: increasing r → sharper threshold, shifted right (harder to match). Increasing b → shifted left (easier to match).

Example: n=9000 hashes, b=20 bands, r=450 rows → threshold ≈ 0.998.

Production usage

Dataset	Method	Granularity
RefinedWeb	MinHash 5-gram + LSH	Document-level
FineWeb	MinHash + LSH	Document-level
SlimPajama	MinHashLSH	Document-level (627B token subset)
GPT-3	Fuzzy dedup	Document-level + benchmark contamination check

§6HTML→Text Conversion Wars▼

Tool	Approach	Used by	Tradeoff
WET files	Pre-extracted by Common Crawl	C4, early CC datasets	Convenient but lossy
trafilatura	Python, content extraction heuristics	RefinedWeb, FineWeb	Good precision, some recall loss
jusText	Block-level classification	Nemotron-CC	Higher token yield than trafilatura
resiliparse	Fast C-based HTML parser	DCLM	Speed-optimized

Key finding

DCLM and Pile-CC independently demonstrated that WARC → custom text extraction consistently outperforms pre-converted WET files on downstream benchmarks. This "mundane" step has outsized impact on final model quality.

Nemotron-CC chose jusText over trafilatura specifically because it yields more tokens per document. Since Nemotron-CC's thesis is that over-filtering loses valuable data, maximizing raw token extraction is structurally aligned with their approach.

§7Copyright, Fair Use & Data Secrecy▼

The copyright landscape

Most internet content is copyrighted. Fair use for ML training is legally unsettled. Major lawsuits ongoing (NYT v. OpenAI, Getty v. Stability AI, etc.).

Shadow libraries in the training mix:
• LibGen — ~4M books. Meta confirmed training LLaMA on LibGen.
• Sci-Hub — ~88M academic papers.
• Books3 (The Pile) — 196K books from Bibliotik shadow library. Taken down.
• BooksCorpus — 7K self-published books from Smashwords. Taken down for TOS violation.

Data secrecy dynamics

Competitive advantage + copyright liability = frontier labs disclose almost nothing about training data. Architecture and training algorithms are published; data pipelines are not.

Exceptions: Dolma, FineWeb, DCLM, Nemotron-CC are fully open with documented pipelines. The Pile pioneered grassroots open data curation.

Benchmark contamination

GPT-3 ran fuzzy dedup against benchmarks to check for leakage. Not universally practiced. A growing concern as datasets scale to hundreds of trillions of tokens.

§8Open Questions & Tensions▼

Rule-based vs. model-based filtering
Rules avoid bias but classifiers produce better downstream performance. No consensus.

What is "quality"?
Wikipedia-like? Educational? Instruction-like? Each choice biases the model differently.

Over-filtering
FineWebEdu/DCLM remove ~90% of data. Nemotron-CC's synthetic rephrasing is one response.

Synthetic data in the pipeline
Nemotron-CC uses LM rephrasing. How far can synthetic augmentation of pre-training data go?

Data poisoning
Wikipedia dump timing exploitable. CC seed URL injection. No robust defenses at scale.

Copyright liability
Fair use for ML training legally unsettled. Major lawsuits in progress.

Benchmark contamination
Fuzzy dedup against benchmarks needed but not universal. Gets harder as dataset scale grows.

Pipeline is fundamentally heuristic
"Many opportunities to improve." Data curation scales with human effort, not compute.