GPD

Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

¹Max Planck Institute for Informatics, ²South China University of Technology,
³Monash University, ⁴Shanghai Jiao Tong University
^*Indicates Equal Contribution
^†Indicates Corresponding Author

Abstract

Knowledge distillation aims to transfer knowledge from a large teacher model to a compact student counterpart, often coming with a significant performance gap between them. Interestingly, we find that a too-large performance gap can hamper the training process. To alleviate this, we propose a Gap Preserving Distillation (GPD) method that trains an additional dynamic teacher model from scratch along with the student to maintain a reasonable performance gap. To further strengthen distillation, we develop a hard strategy by enforcing both models to share parameters. Besides, we also build the soft bidirectional mappings between them through Inverse Reparameterization (IR) and Channel-Branch Reparameterization (CBR). IR initializes a larger dynamic teacher with approximately the same accuracy as the student to avoid a too large gap in early stage of training. CBR enables direct extraction of an effective student model from the dynamic teacher without post-training. In experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers, achieving up to 1.58% accuracy improvement. Interestingly, GPD also generalizes well to the scenarios without a pre-trained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80% and 0.89% on ResNet18, respectively.

Method

Gap Preserving Distillation (GPD) introduces a novel approach to knowledge distillation by training a dynamic teacher alongside the student model. The method employs Inverse Reparameterization (IR) to expand the student model along both channel and branch dimensions, creating a larger dynamic teacher while initially maintaining the same accuracy. This prevents too large a performance gap in early training stages. To strengthen knowledge transfer, GPD implements parameter sharing between the student and dynamic teacher through Channel-Branch Reparameterization (CBR). The CBR technique enables direct extraction of an effective student model from the dynamic teacher without post-training.

Illustration of Gap Preserving Distillation (GPD) Method.

Results

1. Distillation with a static teacher

Comparison with existing distillation methods across different architectures:

2. Training from Scratch

Performance when training models from scratch without a static teacher:

3. Fine-tuning Scenario

Results when fine-tuning pre-trained models:

BibTeX

@inproceedings{guo2025gap, title = {Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher}, author = {Yong Guo and Shulian Zhang and Haolin Pan and Jing Liu and Yulun Zhang and Jian Chen}, booktitle = {The Thirteenth International Conference on Learning Representations}, year = {2025}, url = {https://openreview.net/forum?id=PnfghHD4Pi} }