Learnability-Guided Diffusion for Dataset Distillation

CVPR 2026

Center for Research in Computer Vision, University of Central Florida

Abstract

Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled datasets, but existing approaches produce redundant training signals—disjoint subsets capture 70–80% overlapping information.

We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small distilled dataset, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce learnability-guided diffusion that balances current-model informativeness with reference-model validity, automatically generating curriculum-aligned samples.

Our approach reduces redundancy by 39.1%, enables specialization across training phases, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%).



Motivation

Learnability-Guided Dataset Distillation teaser figure

Standard distillation generates redundant samples. We partition the distilled dataset into data increments (e.g 10 images per class). With DiT, a model trained on one increment already achieves 98.0% accuracy on another—meaning each increment provides minimal new information.

Our approach generates complementary samples. By conditioning synthesis on the current model knowledge, we guide generation toward samples that complement existing data. The resulting increments achieve only 17.0% cross-increment accuracy, confirming they introduce substantial new learning signal.

Quantifying Redundancy

Cross-validation heatmaps showing redundancy reduction
Cross-validation across distilled data increments on ImageNette. Each heatmap shows accuracy when training on one increment and evaluating on another. DiT exhibits severe redundancy (avg. 94.7% off-diagonal accuracy), while our method yields only 57.65% on average—a 39.1% reduction in redundancy, confirming that our increments contain complementary information.


Our Method

Overview of learnability-guided iterative generation framework

Overview of our learnability-guided iterative generation framework.

  • Incremental Distillation Loop (top): We iteratively train a model \(\theta_t\) on the cumulative dataset, generate samples via learnability-guided diffusion, select the most learnable samples, and augment the dataset for the next stage.
  • Effect on Sample Space (bottom): The current model \(\theta_t\) (green) expands over iterations while the fixed reference model \(\theta^*\) (purple) defines the overall learnable region. New samples target the learnable gap between the two.

Problem Formulation

We partition the distilled dataset into \(K\) disjoint increments. At each stage, the model trained so far guides synthesis of the next increment by maximizing:

\[\mathcal{I}_i^* = \arg\max_{\mathcal{I}}\;\big[\mathcal{L}(\theta_{i-1}, \mathcal{I}) - \mathcal{L}(\theta^*, \mathcal{I})\big]\]

The first term targets samples hard for the current model; the second penalizes samples also hard for the reference—filtering out degenerate examples and focusing each increment on meaningful knowledge gaps.

Learnability Guidance

We integrate a per-sample learnability score directly into diffusion sampling:

\[\mathcal{S}(x, y) = \mathcal{L}(\theta_{i-1}, x, y) - \omega \cdot \mathcal{L}(\theta^*, x, y)\]

This score steers the denoising trajectory via gradient-based guidance:

\[\tilde{\epsilon}_\phi(x_t, t, y) = \epsilon_\phi(x_t, t, y) + \lambda \cdot \rho_t \cdot \nabla_{x_t} \mathcal{S}(x_t, y)\]

where \(\lambda\) controls guidance strength and \(\rho_t\) normalizes across timesteps. The result is an active learning mechanism that forms a curriculum of increasing difficulty.

Learnability Rank Selection

Guided sampling alone can still yield weakly learnable examples. We over-generate by a factor \(\kappa\), score each candidate with \(\mathcal{S}\), and retain the top-\(N_i\) per class:

\[\mathcal{I}_i = \operatorname{Top}_{N_i}\!\Big(\big\{(x_j^c,\; y_j^c,\; \mathcal{S}(x_j^c, y_j^c))\big\}\Big)\]

This two-stage approach—guidance during generation plus rank-based selection—yields high-learnability increments that complement rather than replicate existing data.



Results

ImageNette & ImageWoof

Dataset Model IPC DiT Minimax DiT-IGD DiT+MGD3 DiT+Ours Full
Nette ConvNet-6 50 74.176.980.980.982.6 94.3
100 78.281.184.586.587.2
ResNet-18 50 75.278.181.081.585.0 95.3
100 77.881.384.485.686.9
Woof ConvNet-6 50 48.550.754.253.453.9 85.9
100 54.257.161.159.261.9
ResNet-18 50 57.460.562.063.965.1 89.0
100 62.367.470.671.372.9
Our method achieves consistent improvements over prior state-of-the-art methods across both datasets and all architectures. On ImageNette at 100 IPC, we reach 87.2%, surpassing DiT+IGD by 2.7%. On ImageWoof at 100 IPC, we achieve 72.9% with ResNet-18.

ImageNet-1K

SRe2L G-VBSM RDED DiT Minimax DiT-IGD MGD3 Ours
46.851.856.552.958.659.860.260.1
ImageNet-1K at 50 IPC with ResNet-18. Our method achieves 60.1%, matching state-of-the-art performance while substantially outperforming the base diffusion approach (DiT: 52.9%, a 7.2% improvement).

Incremental Training Dynamics

DiT incremental training loss

(a) DiT Incremental Training

Ours incremental training loss

(b) Ours Incremental Training

Validation accuracy progression

(c) Validation Accuracy Progression

Incremental training dynamics. (a–b) Our method produces stronger loss spikes after each new increment (avg. Δ=0.20 vs. 0.06 for DiT), confirming that added data is harder and complementary. (c) Our method achieves 85.2% vs. 79.9% (IGD) and 75.6% (DiT) at 50 IPC, with sustained accuracy gains across all stages.

Sample Difficulty Analysis

Learning dynamics visualization
Learning-dynamics visualization. DiT concentrates in easy regions (>80% with μ > 0.8). Our method distributes broadly across easy, informative, and hard regions. Jensen–Shannon divergence confirms alignment: our method achieves JS = 0.41 vs. DiT (2.04) and IGD (0.92)—5× and 2.2× improvements.

In-Distribution & Semantic Consistency

In-distribution analysis scatter plots
In-distribution and learning-dynamics analysis. Each point shows a sample's reference-model confidence (x-axis) and mean GT probability during training (y-axis). DiT clusters on easy, high-confidence samples. DiT+Loss (without our regularization) covers a broader range but introduces out-of-distribution samples. DiT+Ours achieves a balanced spread—capturing informative mid and hard examples while staying closely aligned with the in-distribution region.



Qualitative Results

Visual diversity comparison across methods on Church class

Visual diversity in incrementally distilled datasets (Church class, 50 IPC total across 5 increments). DiT generates repetitive samples with similar architectural styles and lighting. IGD improves slightly but clusters around Gothic exteriors. Ours produces diverse samples spanning multiple architectural styles (traditional, modern, ornate), perspectives (exterior and interior views), and lighting conditions (day, night, golden hour)—reflecting our curriculum-based synthesis where early increments capture simpler structures and later increments progressively introduce complexity.


BibTeX

@inproceedings{chansantiago2026learnability,
  title     = {Learnability-Guided Diffusion for Dataset Distillation},
  author    = {Chan-Santiago, Jeffrey A. and Shah, Mubarak},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
               and Pattern Recognition (CVPR)},
  year      = {2026},
}