MGD3: Mode-Guided Dataset Distillation using Diffusion Models

ICML 2025 Spotlight Top 2.6%

1Center for Research in Computer Vision, University of Central Florida
2Mehta Family School of DS & AI, Indian Institute of Technology Roorkee, India
3Cisco Research

Abstract

Dataset distillation has emerged as an effective strategy, significantly reducing training costs and facilitating more efficient model deployment. Recent advances have leveraged generative models to distill datasets by capturing the underlying data distribution.

Unfortunately, existing methods require model fine-tuning with distillation losses to encourage diversity and representativeness. However, these methods do not guarantee sample diversity, limiting their performance.

We propose a mode-guided diffusion model that leverages a pre-trained diffusion model without the need for fine-tuning using distillation losses. Our approach addresses dataset diversity in three stages: Mode Discovery to identify distinct data modes, Mode Guidance to enhance intra-class diversity, and Stop Guidance to mitigate artifacts in synthetic samples that affect performance.

We evaluate our approach on ImageNette, ImageIDC, ImageNet-100, and ImageNet-1K, achieving accuracy improvements of 4.4%, 2.9%, 1.6%, and 1.6%, respectively, over state-of-the-art methods. Our method eliminates the need for fine-tuning diffusion models with distillation losses, significantly reducing computational costs.



The Task

Overview of gradient field in diffusion

Dataset distillation aims to compress the knowledge of a large training dataset into a significantly smaller set of synthetic samples, such that models trained on this distilled dataset can achieve performance comparable to those trained on the full dataset.

Optimization-based Distillation: Learns a synthetic dataset by directly optimizing it to match the gradient or feature statistics of the original dataset.
Generation-based Distillation: First models the distribution of the original dataset, then generates samples that approximate this learned distribution.



Motivation

Overview of gradient field in diffusion

Overview of the gradient field (score function) during the denoising process in latent diffusion for a specific class \( c \).
The original data distribution (blue dots) highlights denser regions via an orange gradient field. To generate a sample \( \hat{X}_i \), noise \( x_T^i \sim \mathcal{N}(0, \mathbf{I}) \) is initially sampled.

  • (a) DiT: A pre-trained diffusion model without fine-tuning leads to imbalanced mode likelihoods, resulting in limited sample diversity and frequent repetition of modes.
  • (b) MinMax Diffusion: Fine-tunes the model to balance mode likelihoods and improve diversity. However, it still suffers from sample redundancies tied to initial noise conditions.
  • (c) MGD3 (Ours): Introduces mode-guided denoising (colored traces), explicitly steering samples toward distinct modes (stars). After \( k \) guided steps, it transitions to unguided denoising (black trace), achieving both high diversity and consistency— without requiring any fine-tuning.


Our Method

Overview of the proposed method for distilled dataset synthesis

Overview of the proposed method for distilled dataset synthesis using a diffusion model.
The approach consists of three stages: Mode Discovery, Mode Guidance, and Stop Guidance.

  • Mode Discovery: Estimates the \( N \) modes of the original dataset in the latent diffusion model’s generative space.
  • Mode Guidance: Given a mode \( m_k \) and class \( c \), the generation process is steered toward the mode \( m_k \) for \( t_{\text{stop}} \) denoising steps using the pre-trained model.
  • Stop Guidance: After \( t_{\text{stop}} \) steps, the model transitions to standard unguided denoising. Without guidance, generations may follow the unguided (grey) path, resulting in redundant or overlapping samples.


Results


ImageNet subsets

Table 1
Comparison of performance between pre-trained diffusion models and state-of-the-art methods on ImageNet subsets. Evaluated using the hard-label protocol with ResNet-10 and average pooling. Accuracy is used as the evaluation metric. The best-performing results are highlighted in bold.
Table 2
Comparison with generative prior methods. Evaluation across architectures (AlexNet, VGG11, ResNet18, ViT) and ImageNet subsets (A–E) using the hard-label protocol. Our method outperforms GLaD, H-GLaD, and LM3D in the cross-architecture setup, while offering better scalability for large datasets.
ImageNet1k
Comparison with state-of-the-art methods on ImageNet-1K using the soft-label protocol. Our method achieves state-of-the-art performance, outperforming prior approaches by 1.3% and 1.6% on IPC 10 and IPC 50, respectively.

Text-to-Image Diffusions

Text-to-Image
Performance of the Text-to-Image model across multiple datasets using the soft-label protocol. We evaluate our method using a general-purpose text-to-image diffusion model, with class names as text prompts. Mode guidance significantly improves performance over Stable Diffusion across all datasets, including gains of 3.4% and 2.3% on ImageNet-1K at IPC 10 and IPC 50, respectively.

Ablations

Ablations

Ablation study on the components of the proposed method. Evaluated on the ImageNette dataset with IPC 10 using ConvNet-6, ResNet10, and ResNet18, the study incrementally adds Mode Discovery, Mode Guidance, and Stop Guidance. Results show that each component contributes to performance gains, with Stop Guidance playing a key role in enhancing final accuracy.

BibTeX


@inproceedings{chan2025mgd3,
  title     = {{MGD}$^3$: Mode-Guided Dataset Distillation using Diffusion Models},
  author    = {Chan Santiago, Jeffrey A. and Tirupattur, Praveen and Nayak, Gaurav Kumar and Liu, Gaowen and Shah, Mubarak},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
  year      = {2025},
}