Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

1Max Planck Institute for Informatics, 2South China University of Technology,
3Monash University, 4Shanghai Jiao Tong University

*Indicates Equal Contribution

Indicates Corresponding Author

Abstract

Knowledge distillation aims to transfer knowledge from a large teacher model to a compact student counterpart, often coming with a significant performance gap between them. Interestingly, we find that a too-large performance gap can hamper the training process. To alleviate this, we propose a Gap Preserving Distillation (GPD) method that trains an additional dynamic teacher model from scratch along with the student to maintain a reasonable performance gap. To further strengthen distillation, we develop a hard strategy by enforcing both models to share parameters. Besides, we also build the soft bidirectional mappings between them through Inverse Reparameterization (IR) and Channel-Branch Reparameterization (CBR). IR initializes a larger dynamic teacher with approximately the same accuracy as the student to avoid a too large gap in early stage of training. CBR enables direct extraction of an effective student model from the dynamic teacher without post-training. In experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers, achieving up to 1.58% accuracy improvement. Interestingly, GPD also generalizes well to the scenarios without a pre-trained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80% and 0.89% on ResNet18, respectively.

Method

Gap Preserving Distillation (GPD) introduces a novel approach to knowledge distillation by training a dynamic teacher alongside the student model. The method employs Inverse Reparameterization (IR) to expand the student model along both channel and branch dimensions, creating a larger dynamic teacher while initially maintaining the same accuracy. This prevents too large a performance gap in early training stages. To strengthen knowledge transfer, GPD implements parameter sharing between the student and dynamic teacher through Channel-Branch Reparameterization (CBR). The CBR technique enables direct extraction of an effective student model from the dynamic teacher without post-training.

Illustration of Gap Preserving Distillation
Illustration of Gap Preserving Distillation (GPD) Method.

Results

1. Distillation with a static teacher

Comparison with existing distillation methods across different architectures:

Results1.png

2. Training from Scratch

Performance when training models from scratch without a static teacher:

Results2.png

3. Fine-tuning Scenario

Results when fine-tuning pre-trained models:

Results3.png

BibTeX


        @inproceedings{guo2025gap,
          title     = {Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher},
          author    = {Yong Guo and Shulian Zhang and Haolin Pan and Jing Liu and Yulun Zhang and Jian Chen},
          booktitle = {The Thirteenth International Conference on Learning Representations},
          year      = {2025},
          url       = {https://openreview.net/forum?id=PnfghHD4Pi}
        }