General framework
Given
target data T (small, high quality) and
raw data R (large, noisy),
find subset T' of R similar to T. Must generalize from T and run extremely fast on huge
R.
1a. KenLM (n-gram language model)
CCNet (Wenzek et al., 2019)
Kneser-Ney smoothed n-gram model. Extremely simple and fast.
Generative approach: score(x) = p_T(x). Sort by perplexity, keep top fraction.
score(x) = perplexity_T(x) = exp(−1/n ⋅ ∑ log p(w_i | w_{i-k}...w_{i-1}))
CCNet: KenLM trained on Wikipedia, keep top 1/3 lowest perplexity paragraphs.
OpenMathText: KenLM on ProofPile, threshold <15,000 perplexity → 14.7B
math tokens. A 1.4B model trained on this beat models with 20× more data.
1b. fastText classifier
Joulin et al. (2016) — Bag of Tricks for Efficient Text Classification
Bag of n-gram embeddings + linear head. Hashing trick (10M bins) for unbounded vocab.
Asynchronous SGD.
score(x) = p(T | x) — discriminative approach, keep if score ≥
threshold
Orders of magnitude faster than BERT/LLM classifiers.
DCLM showed fastText
quality classifier outperforms all rule-based methods.
1c. DSIR (importance resampling)
Xie et al. (2023) — Data Selection for Language Models via Importance Resampling
Fit bag-of-hashed-ngram distributions to both target and raw data. Resample
proportionally to importance weights.
score(x) = p_T(x) / p_R(x) — importance weight, resample
proportionally
More principled than heuristic classification (captures diversity). Slightly better than
fastText on GLUE, similar compute.
Comparison
| Method |
Approach |
Speed |
Quality signal |
Used by |
| KenLM |
Generative: p(x) |
Very fast |
Proximity to reference |
CCNet, OpenMathText |
| fastText |
Discriminative: p(T|x) |
Very fast |
Binary quality label |
DCLM, GPT-3, LLaMA |
| DSIR |
Importance: p_T/p_R |
Fast |
Distribution matching |
Research datasets |
| LLM judge |
Prompted scoring |
Slow |
Rich semantic |
phi-1 (GPT-4 labels) |