Data Selection for LLM Fine-Tuning LIMA · LESS
Two-paper synthesis · The hypothesis & the algorithm

Less, but
which less?

LIMA argued that a thousand carefully chosen examples can rival a million sloppy ones. LESS gave us the math to find those examples without curating by hand. Taken together, they give us a complete recipe for data-efficient fine-tuning.

The Superficial Alignment Hypothesis

The starting point of LIMA is a precise empirical claim, which Zhou et al. call the Superficial Alignment Hypothesis:

A model's knowledge and capabilities are learned almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.

If this is right, then instruction tuning is not imparting knowledge — it is selecting a style over knowledge the base model already has. Style is a much lower-dimensional object than knowledge. We should therefore need orders of magnitude less data to learn it, provided the data we use is consistent and diverse in the right ways.

LIMA tests the hypothesis with the strongest possible counterfactual: fine-tune a 65B LLaMa on 1,000 examples, no RLHF, no preference modeling, pure supervised loss. The result on a controlled human study:

1,000
Total training examples (≈750k tokens)
43%
Of LIMA responses match or beat GPT-4
58%
Match or beat Bard
65%
Match or beat DaVinci-003 (RLHF-trained)

The model has no exposure to reinforcement learning, preference modeling, or any feedback loop. The only signal it receives is what 1,000 well-chosen demonstrations look like. The question this raises — and the one we are really interested in — is: what makes those 1,000 examples good?

LIMA's curation, decoded

LIMA's actual selection procedure is more disciplined than "carefully curated" suggests. Reading the paper closely, the criteria fall into three buckets — and these buckets recur in every subsequent paper on data selection.

The three pillars of LIMA's curation QUALITY Stack Exchange: score ≥ 10 top-voted answer only wikiHow: curated articles consistent style Reddit: manually filtered (top-upvoted from r/AskReddit + r/WP) ≈ proxy: high engagement DIVERSITY Stack Exchange: 75 STEM + 99 other exchanges → 200 each wikiHow: 19 categories stratified sample Manual (250 ex): authors deliberately cover task types not in the forum data ≈ stratified sampling STYLE UNIFORMITY Filter responses: remove HTML, code errors, links, images Length window: 1200–4096 chars (typical AI-assistant response length) Tone: first-person, helpful, no "I don't know" ≈ enforce target distribution
Figure 1 LIMA's three implicit criteria. Each one approximates a measurable property: quality ≈ a community-upvote proxy, diversity ≈ stratified sampling across topic categories, style uniformity ≈ filtering to a target response distribution.

The diversity ablation is the punchline

The paper's most actionable result is a small table comparing fine-tunes on different mixes. The version trained on Stack Exchange with category-stratified sampling beats the version trained on the same number of unstratified Stack Exchange examples by a wide margin — even though both come from the same source. Conversely, doubling the data within a single source gives almost no gain.

The two LIMA findings that matter for our problem

(1) Diversity dominates raw count. Going from 2,000 stratified to 4,000 stratified Stack Exchange examples does not meaningfully improve LIMA. Adding 250 manually-written examples from different task types does.

(2) Style consistency is non-negotiable. Mixing high-quality but stylistically heterogeneous sources (e.g. raw Reddit) hurts. The model learns to imitate whatever target it sees most coherently, so the target must be coherent.

This is the empirical foundation. But LIMA does not give us a statistic we can compute on an arbitrary dataset — the curation was substantially manual. The work that followed asks: can we replace each of these three buckets with a number?

What does "good" mean, quantitatively?

Post-LIMA, a small zoo of statistics emerged for scoring instruction examples. Each one tries to capture one of LIMA's three pillars with something we can compute in a forward pass. Before getting to LESS — the most principled of these — it is worth laying out the landscape, because the right metric depends on the goal.

StatisticWhat it measuresCostCaptures
Perplexity (PPL)−log pθ(y | x)1 forward passFluency / in-distribution-ness
IFD (Cherry)Conditional vs. unconditional ratio2 forward passesHow much the instruction helps predict the answer
SuperFilteringIFD scored by a small proxy (GPT-2)2 forward passes, small modelSame as IFD, ~10× cheaper
AlpagasusGPT-4-as-judge rating 1–51 API call per ex.Quality (human-aligned)
DEITAcomplexity × quality, then diversity filter2 model calls + embeddingsAll three LIMA pillars
LESSGradient cosine similarity to a target taskLoRA warmup + datastoreInfluence on a specific downstream task

Instruction-Following Difficulty (IFD)

IFD is the cleanest and cheapest statistic with a coherent theory. For an instruction–response pair (x, y), we define two perplexities of the response — one when the model sees the instruction, one when it does not.

PPL(y | x) = exp(1|y| t=1|y| log pθ(yt | y<t, x) )
PPL(y)  = exp(1|y| t=1|y| log pθ(yt | y<t) )

The first quantity, PPL(y | x), is how surprised the model is by the response when conditioned on the instruction. The second, PPL(y), is how surprised it is when the response is dropped in cold. The IFD score is the ratio:

IFD
IFD(x, y)  =  PPL(y | x)PPL(y)

The ratio compresses three different failure modes into one number. The three regimes deserve to be unpacked separately, because the selection rule depends on understanding why each tail is bad.

Regime 1 — IFD ≈ 1: the instruction adds no information

When the ratio approaches 1, the conditional perplexity equals the unconditional perplexity: knowing x does not help the model predict y. There are two ways this happens in practice. Either the response is generic — "yes", "thanks for asking", "I'll get back to you", boilerplate that any model can produce regardless of context. Or the instruction is malformed, off-topic, or noisy — the response is unrelated to the prompt, so conditioning on the prompt is irrelevant. In both cases the pair teaches the model nothing useful: there is no instruction → response mapping to learn, because the mapping isn't a function of the instruction at all. These examples should be dropped.

Regime 2 — IFD ≪ 1: the instruction is informative but the pair is already easy

A very low ratio means PPL(y | x) is much smaller than PPL(y) — the instruction massively reduces the model's uncertainty about the response. That sounds desirable, and in a vacuum it is. The problem is that the model is already assigning high probability to the right answer. It does not need this example to learn the mapping; it has already learned it during pretraining. Training on these examples burns gradient steps reinforcing a behavior the model already exhibits. They are not harmful, but they are wasted slots in the training budget.

Regime 3 — IFD in the middle: the instruction helps but the answer is non-trivial

This is where learning happens. The unconditional perplexity is high (the response would be surprising in isolation), but conditioning on the instruction makes it noticeably more predictable. The ratio is below 1 — enough that the instruction is doing real work — but not so far below that the model already nails it. These are the examples where a gradient step actually shifts the model's behavior in a useful direction. The selection rule is therefore not "highest IFD" or "lowest IFD" but "highest IFD among examples below a threshold" — discard the top tail (IFD ≈ 1) entirely, then take top-k of the rest.

The remarkable empirical finding from SuperFiltering: IFD rankings are consistent across model scales. The Spearman rank correlation between GPT-2's IFD scores and LLaMA-2-7B's IFD scores on the Alpaca dataset is 0.846. This means we can use a tiny model — even GPT-2 — to score a million examples in an hour, and the ranking will hold up for selecting data to train a 70B model. The difficulty signal lives in the data, not in the scorer.

IFD distribution: which examples to keep DISCARD IFD ≪ 1 "too easy" model already solves these KEEP — TOP-k by IFD informative instructions, non-trivial mappings these are what the model actually learns from DISCARD IFD ≈ 1 or noise "unlearnable" instruction provides no useful signal IFD score (low → high) count of examples 0.3 0.7 1.0+
Figure 2 A typical IFD histogram on an instruction corpus, mapped to the three regimes above. The left tail (Regime 2, IFD ≪ 1) wastes training budget on examples the model already solves. The right tail (Regime 1, IFD ≈ 1) is dominated by generic or noisy pairs the model cannot learn from. The middle (Regime 3) is where useful learning lives.

DEITA: combining all three pillars

DEITA (Liu et al., ICLR 2024) was the first method to operationalize LIMA's three pillars as separable scores we can compute on every example:

s(x, y)  =  c(x, y) complexity  ×  q(x, y) quality ,   then apply REPR FILTER for diversity

Where c and q come from small LLaMA-based scorer models trained on GPT-4 ratings (open-sourced). The REPR FILTER is iterative: sort the pool by s descending; greedily add the next example only if its embedding's cosine distance to the nearest already-selected example exceeds a threshold τ. With τ = 0.9 DEITA matches Vicuna-style baselines on MT-bench using 6K examples from a pool of 300K.

The three scores map exactly onto LIMA's three pillars: q ≈ quality, c ≈ style/format complexity, REPR FILTER ≈ diversity. The contribution is that these are now numbers, not editorial judgments.

LESS: data selection as influence estimation

IFD and DEITA are generic quality signals — they tell us which examples are good in the abstract. But the question we usually face is sharper: we want to improve our model on task X; which subset of our data should we train on? That is the question LESS (Xia et al., ICML 2024) solves directly.

The core idea is borrowed from classical influence functions (Koh & Liang 2017), adapted for LLM fine-tuning. For a training example z = (x, y) and a validation example z* = (x*, y*), the influence of training on z on the loss at z* is, to first order:

Influence (SGD)
InfSGD(z, z*)  ≈  −η  ·  ∇ℓ(z*; θ) ∇ℓ(z; θ)

In words: if the gradient of the loss on z points in the same direction as the gradient of the loss on z*, training on z will reduce z*'s loss. The sign and magnitude are given by the inner product of gradients at the current parameters θ.

This formula is theoretically clean but practically broken for modern LLM fine-tuning. LESS identifies three specific problems and fixes each one.

Problem 1: Adam, not SGD

Fine-tuning uses Adam, which adaptively rescales each coordinate by an estimate of its squared gradient. The effective update direction is not ∇ℓ but Γ(·; θ) := m / (√v + ε), the Adam update. LESS substitutes this directly:

Influence (Adam)
InfAdam(z, z*)  ≈  i=1N  η̄i  ·  cos( ∇ℓ(z*; θi),  Γ(z; θi) )

Three modifications relative to the SGD formula, each motivated by a specific failure mode:

Problem 2: gradient vectors are billion-dimensional

For a 7B model, ∇ℓ has 7B entries. Storing one per example is impossible. LESS does two things:

(a) LoRA the model. Train with rank-r LoRA adapters, so the trainable parameter count drops from ~7B to ~100M. All gradients are computed in this smaller space.

(b) Random projection. Apply the Johnson–Lindenstrauss lemma: project the 100M-dim gradient down to a d-dim feature (d = 8192 in the paper) via a fixed random Gaussian matrix P ∈ ℝd × 108. JL guarantees that for any two vectors u, v:

[ |Pu, Pv⟩ − ⟨u, v| > ε ‖u‖ ‖v]  <  2 e−c d ε2

So inner products — exactly what the influence formula needs — are preserved up to small relative error. Now every training example has a manageable 8192-dim "gradient fingerprint" we can store and search.

Problem 3: we don't have θi from the full training run

The influence formula requires gradients at the actual training checkpoints, but those are what we are trying to avoid computing. LESS approximates this with a warmup: train on a small random subset (a few % of data) for N epochs with LoRA, checkpoint after each epoch, and compute gradients at those checkpoints. Empirically this captures most of the optimization trajectory's direction without needing the full run.

The four stages of LESS 1 · WARMUP Train LoRA on random 5% of pool D for N epochs Save checkpoints θ₁, θ₂, …, θ_N Save optimizer states Γ at each θᵢ (offline, one-time) 2 · DATASTORE For every z ∈ D: Compute Γ(z; θᵢ) at each checkpoint Project to ℝ^8192 via JL random matrix store: |D| × 8192 floats 3 · SCORE Few-shot examples D_val of target task Compute g* = mean ∇ℓ(z*; θᵢ) over D_val Score every z by cos( g*, Γ(z) ) 4 · TRAIN Pick top-5% by score Train M_T on subset done Steps 1–2 are offline and amortize across many target tasks. Step 3 takes seconds for a new target task; the datastore is the durable asset. Result: training on 5% of data ≥ training on 100%, on the targeted skill.
Figure 3 LESS pipeline. The expensive parts (warmup + datastore construction) happen once. For any new downstream task, we provide a handful of validation examples, score the datastore by cosine similarity to their mean gradient, and take the top-5%.

The transferability result

The paper's most useful empirical finding: gradient features built with a small model transfer to selecting data for a large model. Specifically, a gradient datastore built with LLaMA-2-7B selects data that, when used to fine-tune LLaMA-2-13B or Mistral-7B, gives the same downstream performance as a datastore built natively at the larger scale. This means the per-example gradient cost only ever has to be paid once at small scale.

On MMLU, TyDiQA, and BBH, training on the LESS-selected 5% beats training on the full dataset. Random 5% subsets are substantially worse. So the selection signal is real, not just a regularization effect from having less data.

Choosing between methods: a decision tree

The methods above are not interchangeable. Each one optimizes for a different setting. Here is how we pick.

Which selection method should we use? Do we have a specific downstream target task? NO — general chatbot YES — specific skill Do we have compute for a scorer model + embeddings? No Yes SuperFiltering GPT-2 IFD scoring Keep top-k ~1 hour, ~CPU DEITA c × q + REPR filter All three pillars SOTA for general Do we have a few labeled examples of the target task? No Yes DEITA general quality filter, with topic tag oversampling LESS gradient-similarity to validation set 5% beats 100% The hybrid recipe most teams actually use 1. SuperFiltering (cheap) to drop bottom-30% noisy data 2. DEITA quality × complexity scores to rank the rest 3. Diversity filter (embedding cosine threshold) on top-ranked → Final set: typically 5–10K examples from millions
Figure 4 A decision tree for picking a method. The split that matters most is whether we have a target task in mind (use LESS) or are training a general assistant (use SuperFiltering or DEITA). The hybrid pipeline at the bottom is what most production teams converge on.

A concrete playbook for our dataset

Suppose we have a pool of, say, 500K instruction–response pairs and want to fine-tune a 7B–13B model. Here is the end-to-end procedure synthesized from the three papers above.

Algorithm · End-to-end data selection
1Coarse filter (cheap, mandatory). Drop examples where: response length < 50 tokens or > 4096; response is duplicated (exact or near-dup via MinHash); response contains code-tag mismatches, HTML errors, or non-target languages. This usually removes 20–40% of a public dataset.
2Compute IFD with a small model. We use GPT-2-XL or LLaMA-3.2-1B. For each (x, y) we compute IFD(x, y) = PPL(y | x) / PPL(y). ~1 hour on a single GPU for 500K examples.
3Keep the informative band. Discard the bottom decile (IFD too low → trivial) and the top decile (IFD ≈ 1 → uninformative or noisy). Keep the middle 80%.
4Score quality with DEITA's open scorers. Run `hkust-nlp/deita-quality-scorer-llama` to get q ∈ [1, 6] per example. Drop the bottom-25%.
5Tag for topics. Cluster sentence embeddings (e.g. `all-MiniLM-L6-v2` + k-means with k=50–200) or use InsTag. This gives us the stratification axis from LIMA.
6If we have a target task → run LESS. Warmup-train LoRA on 5% random, build the gradient datastore, score by cosine similarity to a set of 10–50 few-shot validation examples. Take the top-5%.
7If general assistant → stratified sampling. Within each topic cluster, sort by IFD × q and take the top-N per cluster. This recovers LIMA's per-category cap.
8Diversity sweep. Iterate over the sorted candidate list; only add an example if its embedding's cosine distance to the nearest already-selected example exceeds τ = 0.9 (DEITA's REPR FILTER).
9Style audit (LIMA's lesson). Spot-check 100 examples by hand for response-style consistency. If sources differ in "voice" (formal vs. casual), the model will learn the mode it sees most. Standardize via a rewriter LLM or filter for one voice.
10Train and ablate. Compare against random subsets of the same size. If the selected set does not beat random by ≥2 points on the eval, something in the pipeline is broken.

Sanity checks worth running

The random-subset baseline. We always train on a random subset of the same size and report both numbers. A selection method that does not beat random is not selecting — it is just shrinking the dataset.

The diversity check. After selection, we plot the topic-cluster distribution of the selected set. If 80% of examples come from 5 clusters, the model will overfit to those topics. Re-balance.

The length check. Zhao et al. (ICML 2024) found that "select the longest response" is a surprisingly strong baseline — sometimes within 1–2 points of more sophisticated methods. If a selection method does not strongly correlate with this trivial baseline, that is interesting; if it does, the sophistication may not be earning its keep.

The deepest finding in this line of work is that the marginal example contributes far less than the average example. Selection is not about finding gold; it is about discarding the dross that drowns it.

What none of these methods do yet

A few honest limitations to set expectations:

The arc from LIMA to LESS is, in retrospect, a story about replacing taste with statistics. LIMA proved that taste suffices; LESS proved that with the right inner product, taste is computable.

LESS in plain language

Pulling the previous sections together into a clean, narrative summary — useful when presenting to an audience that hasn't read the paper.

The LESS algorithm — how it works in practice

Calculating and comparing gradients for billions of parameters across millions of documents is incredibly computationally heavy. Here is how the LESS algorithm makes it efficient.

The four stages of LESS — start to finish STAGE 1 · WARMUP Train briefly on a tiny random slice — gives us checkpoints Pool D ~200K examples take 5% random LoRA fine-tune ~100M trainable params N epochs, save each checkpoint θ₁ checkpoint θ₂ checkpoint θ₃ checkpoint θ₄ These θᵢ approximate the optimization trajectory → gradients live somewhere along this path STAGE 2 · GRADIENT DATASTORE Compute the "direction" of every training example — compress to fit on disk All of D z₁, z₂, … z₂₀₀ₖ for each z: at each θᵢ Compute Γ(z; θᵢ) Adam update direction ∈ ℝ^(100M) huge — can't store Random projection P P · Γ(z) → 8192-dim Johnson–Lindenstrauss: dot products preserved GRADIENT DATASTORE 200,000 fingerprints 8192 floats each → ~6 GB on disk, amortized across all tasks STAGE 3 · SIMILARITY SEARCH Encode the target task the same way — then ask: which training examples point the same direction? Target task ~50 examples z* (e.g. math QA) your validation set Encode target P · ∇ℓ(z*; θᵢ) average over D_val → one 8192-dim vector ĝ* COMPARE — cosine similarity score(z) = Σᵢ cos( ĝ*ᵢ , Γ(z; θᵢ) ) high score → training on z reduces loss on the target (first-order Taylor: Δℓ(z*) ≈ −η · ⟨∇ℓ(z*), ∇ℓ(z)⟩) from datastore (stage 2) ↓ rank all 200,000 examples ↓ STAGE 4 · TARGETED TRAINING Keep the top-ranked 5% — fine-tune the real model from scratch on just those TOP 5% ~10,000 examples highest cosine score Fine-tune target model e.g. 7B / 13B / 70B on top-5% subset only Result 5% beats 100% on target eval + datastore reusable for next task
Figure 5 End-to-end LESS pipeline. Stage 1 produces checkpoints we can take gradients at. Stage 2 turns every training example into a small "fingerprint" of how it would update the model — compressed via JL projection so the entire datastore fits comfortably on disk. Stage 3 encodes the target task the same way and ranks training examples by cosine similarity; the first-order Taylor expansion guarantees that high cosine = loss-reducing. Stage 4 fine-tunes on just the top-5%. The datastore is the durable asset: build it once with a 7B model, reuse it for many target tasks and many downstream model sizes.
The four practical stages
1Warmup training. The method starts by briefly training the model on a small, random subset of data using LoRA (Low-Rank Adaptation). This gives the model a basic understanding of how to format answers and follow instructions, and produces a handful of checkpoints θ1, …, θN that approximate what the optimization trajectory looks like.
Why this is needed: the influence formula requires gradients at training checkpoints — without warmup we have none.
2Build a gradient datastore. The researchers pass all the available training data through the model and calculate their gradients. To save computer memory, they use a random projection technique to compress these massive gradient vectors into much smaller, low-dimensional representations.
The math: Johnson–Lindenstrauss says that for a random Gaussian matrix P ∈ ℝ8192 × 108, dot products between vectors u, v are preserved up to small error: ⟨Pu, Pv⟩ ≈ ⟨u, v⟩ with high probability. Storage drops by ~12000× and all the comparisons we need still work.
3Similarity search. The researchers take a few examples of the target task (e.g., a few math problems) and calculate their gradients. They then search the compressed datastore for training examples that have the highest dot product (cosine similarity) with the target task.
The math: for training example z and target validation example z*, a one-step Taylor expansion gives the change in loss on z* from training on z as approximately −η · ∇ℓ(z*) ∇ℓ(z). Aligned gradients → loss on target drops → example is helpful. The score becomes Σi cos(∇ℓ(z*; θi), Γ(z; θi)) summed across checkpoints, with Γ being the Adam-adjusted update direction rather than the raw gradient.
4Targeted training. Finally, they train the model entirely from scratch using only the top-ranked data points (typically the top 5%).
No math here — just a standard fine-tuning run, but on the filtered subset rather than the full pool.

Why LESS is a breakthrough

Two findings worth highlighting

Less is more. The paper shows that training a model on just the top 5% of data selected by LESS often results in a smarter model than if we had trained it on 100% of the data. Trimming the fat improves focus — the discarded 95% wasn't neutral, it was actively pulling the model in directions irrelevant to the target task.

Transferability. A supercomputer is not required to run LESS. We can use a smaller, cheaper model (like a 7-billion parameter model) to calculate the gradients and select the top 5% of data. We can then use that carefully curated 5% to train a massive, state-of-the-art model (like a 70-billion parameter model) from a completely different family, and the selected data remains highly effective. The selection signal lives in the data, not in the scorer model.

LESS picks training examples whose gradient points in the same direction as the gradient of the target task — because that is mathematically the same as picking examples whose training step will reduce loss on the target.