Let's state the popular claim plainly, because it is a genuinely seductive one. Dataset distillation — DD, in the literature — promises to take an enormous training set and compress its information into a tiny set of synthetic samples, so that a model trained on a few dozen engineered images learns nearly as much as one trained on a million real ones. The implicit boast is that these crafted samples are better than reality: that a synthetic image, tuned by an optimization procedure, can pack in more useful signal than any real photograph could. It is the kind of idea that gets a lot of papers written. A study posted to arXiv on June 16, 2026, by Trisha Mittal, Akshay Mehra, and Joshua Kimball decided to check it against the simplest possible rival.

That rival is coreset selection — CS — which does not synthesize anything. It just picks a subset of the real data. The conventional wisdom the field has been telling itself is that this approach is inherently handicapped, and the authors name the assumption directly: that DD outperforms CS "based on the assumption that restricting condensed datasets to subsets of real samples fundamentally limits their expressiveness." In other words, the story goes, you can only do so well with real examples; to do better you must invent samples. That is the narrative the paper sets out to test.

"Our results show that while some DD methods fail to outperform even simple random subsets, the SOTA DD approaches are comparable to or worse than coresets on large-scale datasets and incur a substantially higher cost for construction."— arXiv:2606.18209, source

Read that quote slowly, because there are two separate punctures in it. The first is that some distillation methods cannot beat a random subset — not a clever coreset, just randomly grabbing real samples. The second, more damning for the field's premise, is that even the state-of-the-art methods are "comparable to or worse than" coresets on large-scale data, and they cost substantially more to produce. The expensive, sophisticated approach does not reliably beat the cheap, dumb one. I'd love to believe the synthetic-samples story, but the benchmark complicates it.

Why the methodology is the real news

The reason this paper lands harder than a typical "we got better numbers" result is that its contribution is mostly about how DD has been measured. The authors point out that distillation methods "are often evaluated under inconsistent evaluation protocols, ranging from standard ERM to single/multi-teacher supervision, making it difficult to isolate the effectiveness of distilled data from evaluation." That is a careful way of saying the field's wins may be partly artifacts of generous test setups — that some of the apparent magic was in the grading, not the data.

So they standardize. They benchmark seven SOTA distillation methods on ImageNet-1K, ImageNet100, and ImageNette, using three widely adopted training protocols, against three coreset strategies. The point of holding the protocol fixed is to ask what the distilled data is actually worth once you stop letting each method bring its own favorable evaluation. This is the steelman-then-test move: give DD its best methods and real benchmarks, control the part of the experiment that flatters it, and see what survives.

Beyond accuracy, the part the headline numbers miss

Accuracy is not the only axis, and the paper's secondary analysis is arguably the more interesting half. The authors evaluate "the representativeness, diversity, and quality of condensed sets" and find that coresets "consistently achieve better coverage of the original data distribution." That is a quieter but more structural finding. A distilled set might hit a respectable accuracy number while quietly failing to represent the full spread of the data — a brittleness that an accuracy figure alone would hide. Real samples, by construction, are drawn from the real distribution; synthetic ones have to earn that coverage, and the study says they often don't.

There is also a cost argument that should not get lost in the accuracy debate. Constructing a distilled set is an optimization procedure that can be expensive, sometimes rivaling the cost of just training on the data you were trying to avoid. Coreset selection is comparatively trivial — you score and pick. When the cheap method matches or beats the expensive one on accuracy and beats it on distribution coverage, the burden of proof shifts. The authors put it bluntly, concluding that the findings "highlight the limited practical advantages of current DD methods."

What this doesn't say

Fairness requires drawing the boundaries of the claim. This is a benchmark of current methods on image classification at ImageNet scale; it is not a proof that synthesizing training data can never help, and the authors are careful to scope it to "current DD methods." There may be regimes — smaller datasets, different modalities, privacy-driven settings where you specifically cannot retain real samples — where distillation's value proposition is different. The result is a check on a specific, oversold comparison, not a verdict on an entire research direction.

But within those boundaries, the deflation is healthy. The field had been repeating a comforting premise — that you can engineer samples more informative than reality, and that subsets of real data are fundamentally limited — and treating it as settled. A standardized benchmark that gives the favored approach its best methods and still finds the humble alternative "competitive and often a more computationally efficient alternative" is exactly the kind of receipt worth keeping. The lesson is not that distillation is worthless. It is that "better than just picking good real examples" is a claim that has to be earned under fixed rules, and on these datasets, under these rules, it wasn't.