Variational Bayesian Group-Level Sparsification for Knowledge Distillation

Yue Ming, Hao Fu, Yibo Jiang, Hui Yu

Research output: Contribution to journalArticlepeer-review

82 Downloads (Pure)


Deep neural networks are capable of learning powerful representation, but often limited by heavy network architectures and high computational cost. Knowledge distillation (KD) is one of the effective ways to perform model compression and inference acceleration. But the final student models remain parameter redundancy. To tackle these issues, we propose a novel approach, called Variational Bayesian Group-level Sparsification for Knowledge Distillation (VBGS-KD), to distill a large teacher network into a small and sparse student network while preserving accuracy. We impose the sparsity-inducing prior on the groups of parameters in the student model, and introduce the variational Bayesian approximation to learn structural sparseness, which can effectively prune most part of weights. The prune threshold is learned during training without extra fine-tuning. The proposed method can learn the robust student networks that
have achieved satisifying accuracy and compact sizes compared with the state-of-the-arts methods. We have validated our method on the MNIST and CIFAR-10 datasets, observing 90.3% sparsity with 0.19% accuracy boosting in MNIST. Extensive experiments on the CIFAR-10 dataset demonstrate the efficiency of the proposed approach.
Original languageEnglish
Pages (from-to)126628-126636
JournalIEEE Access
Publication statusPublished - 13 Jul 2020


Dive into the research topics of 'Variational Bayesian Group-Level Sparsification for Knowledge Distillation'. Together they form a unique fingerprint.

Cite this