LIMA argued that a thousand carefully chosen examples can rival a million sloppy ones. LESS gave us the math to find those examples without curating by hand. Taken together, they give us a complete recipe for data-efficient fine-tuning.
The starting point of LIMA is a precise empirical claim, which Zhou et al. call the Superficial Alignment Hypothesis:
If this is right, then instruction tuning is not imparting knowledge — it is selecting a style over knowledge the base model already has. Style is a much lower-dimensional object than knowledge. We should therefore need orders of magnitude less data to learn it, provided the data we use is consistent and diverse in the right ways.
LIMA tests the hypothesis with the strongest possible counterfactual: fine-tune a 65B LLaMa on 1,000 examples, no RLHF, no preference modeling, pure supervised loss. The result on a controlled human study:
The model has no exposure to reinforcement learning, preference modeling, or any feedback loop. The only signal it receives is what 1,000 well-chosen demonstrations look like. The question this raises — and the one we are really interested in — is: what makes those 1,000 examples good?
LIMA's actual selection procedure is more disciplined than "carefully curated" suggests. Reading the paper closely, the criteria fall into three buckets — and these buckets recur in every subsequent paper on data selection.
The paper's most actionable result is a small table comparing fine-tunes on different mixes. The version trained on Stack Exchange with category-stratified sampling beats the version trained on the same number of unstratified Stack Exchange examples by a wide margin — even though both come from the same source. Conversely, doubling the data within a single source gives almost no gain.
(1) Diversity dominates raw count. Going from 2,000 stratified to 4,000 stratified Stack Exchange examples does not meaningfully improve LIMA. Adding 250 manually-written examples from different task types does.
(2) Style consistency is non-negotiable. Mixing high-quality but stylistically heterogeneous sources (e.g. raw Reddit) hurts. The model learns to imitate whatever target it sees most coherently, so the target must be coherent.
This is the empirical foundation. But LIMA does not give us a statistic we can compute on an arbitrary dataset — the curation was substantially manual. The work that followed asks: can we replace each of these three buckets with a number?
Post-LIMA, a small zoo of statistics emerged for scoring instruction examples. Each one tries to capture one of LIMA's three pillars with something we can compute in a forward pass. Before getting to LESS — the most principled of these — it is worth laying out the landscape, because the right metric depends on the goal.
| Statistic | What it measures | Cost | Captures |
|---|---|---|---|
| Perplexity (PPL) | −log pθ(y | x) | 1 forward pass | Fluency / in-distribution-ness |
| IFD (Cherry) | Conditional vs. unconditional ratio | 2 forward passes | How much the instruction helps predict the answer |
| SuperFiltering | IFD scored by a small proxy (GPT-2) | 2 forward passes, small model | Same as IFD, ~10× cheaper |
| Alpagasus | GPT-4-as-judge rating 1–5 | 1 API call per ex. | Quality (human-aligned) |
| DEITA | complexity × quality, then diversity filter | 2 model calls + embeddings | All three LIMA pillars |
| LESS | Gradient cosine similarity to a target task | LoRA warmup + datastore | Influence on a specific downstream task |
IFD is the cleanest and cheapest statistic with a coherent theory. For an instruction–response pair (x, y), we define two perplexities of the response — one when the model sees the instruction, one when it does not.
The first quantity, PPL(y | x), is how surprised the model is by the response when conditioned on the instruction. The second, PPL(y), is how surprised it is when the response is dropped in cold. The IFD score is the ratio:
The ratio compresses three different failure modes into one number. The three regimes deserve to be unpacked separately, because the selection rule depends on understanding why each tail is bad.
When the ratio approaches 1, the conditional perplexity equals the unconditional perplexity: knowing x does not help the model predict y. There are two ways this happens in practice. Either the response is generic — "yes", "thanks for asking", "I'll get back to you", boilerplate that any model can produce regardless of context. Or the instruction is malformed, off-topic, or noisy — the response is unrelated to the prompt, so conditioning on the prompt is irrelevant. In both cases the pair teaches the model nothing useful: there is no instruction → response mapping to learn, because the mapping isn't a function of the instruction at all. These examples should be dropped.
A very low ratio means PPL(y | x) is much smaller than PPL(y) — the instruction massively reduces the model's uncertainty about the response. That sounds desirable, and in a vacuum it is. The problem is that the model is already assigning high probability to the right answer. It does not need this example to learn the mapping; it has already learned it during pretraining. Training on these examples burns gradient steps reinforcing a behavior the model already exhibits. They are not harmful, but they are wasted slots in the training budget.
This is where learning happens. The unconditional perplexity is high (the response would be surprising in isolation), but conditioning on the instruction makes it noticeably more predictable. The ratio is below 1 — enough that the instruction is doing real work — but not so far below that the model already nails it. These are the examples where a gradient step actually shifts the model's behavior in a useful direction. The selection rule is therefore not "highest IFD" or "lowest IFD" but "highest IFD among examples below a threshold" — discard the top tail (IFD ≈ 1) entirely, then take top-k of the rest.
The remarkable empirical finding from SuperFiltering: IFD rankings are consistent across model scales. The Spearman rank correlation between GPT-2's IFD scores and LLaMA-2-7B's IFD scores on the Alpaca dataset is 0.846. This means we can use a tiny model — even GPT-2 — to score a million examples in an hour, and the ranking will hold up for selecting data to train a 70B model. The difficulty signal lives in the data, not in the scorer.
DEITA (Liu et al., ICLR 2024) was the first method to operationalize LIMA's three pillars as separable scores we can compute on every example:
Where c and q come from small LLaMA-based scorer models trained on GPT-4 ratings (open-sourced). The REPR FILTER is iterative: sort the pool by s descending; greedily add the next example only if its embedding's cosine distance to the nearest already-selected example exceeds a threshold τ. With τ = 0.9 DEITA matches Vicuna-style baselines on MT-bench using 6K examples from a pool of 300K.
The three scores map exactly onto LIMA's three pillars: q ≈ quality, c ≈ style/format complexity, REPR FILTER ≈ diversity. The contribution is that these are now numbers, not editorial judgments.
IFD and DEITA are generic quality signals — they tell us which examples are good in the abstract. But the question we usually face is sharper: we want to improve our model on task X; which subset of our data should we train on? That is the question LESS (Xia et al., ICML 2024) solves directly.
The core idea is borrowed from classical influence functions (Koh & Liang 2017), adapted for LLM fine-tuning. For a training example z = (x, y) and a validation example z* = (x*, y*), the influence of training on z on the loss at z* is, to first order:
In words: if the gradient of the loss on z points in the same direction as the gradient of the loss on z*, training on z will reduce z*'s loss. The sign and magnitude are given by the inner product of gradients at the current parameters θ.
This formula is theoretically clean but practically broken for modern LLM fine-tuning. LESS identifies three specific problems and fixes each one.
Fine-tuning uses Adam, which adaptively rescales each coordinate by an estimate of its squared gradient. The effective update direction is not ∇ℓ but Γ(·; θ) := m / (√v + ε), the Adam update. LESS substitutes this directly:
Three modifications relative to the SGD formula, each motivated by a specific failure mode:
For a 7B model, ∇ℓ has 7B entries. Storing one per example is impossible. LESS does two things:
(a) LoRA the model. Train with rank-r LoRA adapters, so the trainable parameter count drops from ~7B to ~100M. All gradients are computed in this smaller space.
(b) Random projection. Apply the Johnson–Lindenstrauss lemma: project the 100M-dim gradient down to a d-dim feature (d = 8192 in the paper) via a fixed random Gaussian matrix P ∈ ℝd × 108. JL guarantees that for any two vectors u, v:
So inner products — exactly what the influence formula needs — are preserved up to small relative error. Now every training example has a manageable 8192-dim "gradient fingerprint" we can store and search.
The influence formula requires gradients at the actual training checkpoints, but those are what we are trying to avoid computing. LESS approximates this with a warmup: train on a small random subset (a few % of data) for N epochs with LoRA, checkpoint after each epoch, and compute gradients at those checkpoints. Empirically this captures most of the optimization trajectory's direction without needing the full run.
The paper's most useful empirical finding: gradient features built with a small model transfer to selecting data for a large model. Specifically, a gradient datastore built with LLaMA-2-7B selects data that, when used to fine-tune LLaMA-2-13B or Mistral-7B, gives the same downstream performance as a datastore built natively at the larger scale. This means the per-example gradient cost only ever has to be paid once at small scale.
On MMLU, TyDiQA, and BBH, training on the LESS-selected 5% beats training on the full dataset. Random 5% subsets are substantially worse. So the selection signal is real, not just a regularization effect from having less data.
The methods above are not interchangeable. Each one optimizes for a different setting. Here is how we pick.
Suppose we have a pool of, say, 500K instruction–response pairs and want to fine-tune a 7B–13B model. Here is the end-to-end procedure synthesized from the three papers above.
The random-subset baseline. We always train on a random subset of the same size and report both numbers. A selection method that does not beat random is not selecting — it is just shrinking the dataset.
The diversity check. After selection, we plot the topic-cluster distribution of the selected set. If 80% of examples come from 5 clusters, the model will overfit to those topics. Re-balance.
The length check. Zhao et al. (ICML 2024) found that "select the longest response" is a surprisingly strong baseline — sometimes within 1–2 points of more sophisticated methods. If a selection method does not strongly correlate with this trivial baseline, that is interesting; if it does, the sophistication may not be earning its keep.
A few honest limitations to set expectations:
The arc from LIMA to LESS is, in retrospect, a story about replacing taste with statistics. LIMA proved that taste suffices; LESS proved that with the right inner product, taste is computable.
Pulling the previous sections together into a clean, narrative summary — useful when presenting to an audience that hasn't read the paper.
Calculating and comparing gradients for billions of parameters across millions of documents is incredibly computationally heavy. Here is how the LESS algorithm makes it efficient.
Less is more. The paper shows that training a model on just the top 5% of data selected by LESS often results in a smarter model than if we had trained it on 100% of the data. Trimming the fat improves focus — the discarded 95% wasn't neutral, it was actively pulling the model in directions irrelevant to the target task.
Transferability. A supercomputer is not required to run LESS. We can use a smaller, cheaper model (like a 7-billion parameter model) to calculate the gradients and select the top 5% of data. We can then use that carefully curated 5% to train a massive, state-of-the-art model (like a 70-billion parameter model) from a completely different family, and the selected data remains highly effective. The selection signal lives in the data, not in the scorer model.