Professional Documents
Culture Documents
teacher 1
teacher 2
logits
teacher n
emsemble logits
student
logits
Figure 1: The framework of our proposed collaborative multi-teacher knowledge distillation for low bit-width DNNs. The
collaborative learning among a set of quantized teachers forms useful shared knowledge at certain layers through importance
parameters (π). LTKL and LSKL are for mutual learning between teachers and student via ensemble logits z, where z is calcu-
lated from teachers logits zT and student logits zS . CE means Cross Entropy, and KL means Kullback-Leibler divergence.
D(.) denotes the loss for intermediate features-based distillation that could be attention loss or hint loss (FitNet).
The attention loss [28] is defined as follows This learning mechanism encourages both teachers and stu-
dent to simultaneously adjust their parameters to achieve
an overall learning objective, e.g., minimizing a total loss
\calD _{A T}=\sum _{k \in \calI }\left \|\frac {Q_{k}^{T}}{\left \|Q_{k}^{T}\right \|_{2}}-\frac {Q_{k}^{S}}{\left \|Q_{k}^{S}\right \|_{2}}\right \|_{p}, (5)
function.
The mutual learning [35] applies KL losses on the soft-
where QSk and QSk are the attention maps of features FkS and max outputs of networks. However, due to the diversity of
FkS , respectively. ∥.∥p is the lp norm function that could the output logits from different networks, this method may
be a l1 or l2 normalization. The attention map for a fea- hurt the performance of models. To overcome this issue, we
c×h×w adopt KDCL-MinLogit [8], which is a simple and effective
Pc F ∈ R
ture map that has c channels is defined as
Q = j=1 |Aj | , where Aj = F (j, :, :) is the j th channel
p method to ensemble logits from teachers and student. In
of F ; |.| is the element-wise absolute function. In our im- particular, this method selects the minimum logit value of
plementation we use p = 2. The FitNet loss [26] is defined each category.
as follows Let zT and zS be the logit outputs of the combined
teacher model T and a student model S, z T,c and z S,c
\calD _{HT}=\sum _{k \in \calI }\left \|F_{k}^{T} - r(F_{k}^{S})\right \|_{p}, (6) be the elements of zT and zS corresponding to the target
class c, 1 be the vector with all 1s elements, we denote
zT,c = zT − z T,c 1 and zS,c = zS − z S,c 1. The element zi
where r(.) is a convolutional layer that adapts the student
of the ensemble logits z is computed as follows
feature map FkS before comparing it to the shared knowl-
edge teacher feature map FkT . In our method, we follow \overline {z_i} = min\{z_i^{T,c}, z_i^{S,c}\}, i=1, 2, ..., m (7)
existing works [10, 26] in which r(.) a convolutional layer
with a kernel size of 1 × 1. where ziT,c , ziS,c are the ith elements of zT,c and zS,c , and
m is the number of classes.
3.3.3 Mutual learning between teachers and student The mutual learning is defined as follows
In addition to the intermediate features-based distillation,
\calL _{KL}^{S}= \calT ^{2} \times KL(\overline {\bp }|| \bp ^{S}), (8)
we also leverage the mutual learning [35] defined on the
logits from networks for learning. The mutual learning al-
lows student to give feedback about its learning to teachers. \calL _{KL}^{T}= \calT ^{2} \times KL(\overline {\bp }|| \bp ^{T}), (9)
where KL means the Kullback-Leibler divergence and T is Models Bit-width Top 1 Top 5
the temperature parameter. p, pS , and pT are the soft logits FP 72.4 91.3
of z, zS , and zT , respectively. The soft logit p is defined as 8 bits 70.9 90.9
p = sof tmax( Tz ). Single model 6 bits 70.8 90.8
Finally, the overall loss of our proposed collaborative 4 bits 70.7 90.8
and mutual learning in classification task is defined as 2 bits 69.4 90.5
KD (from FP teacher) 71.3 91.6
Average teacher 71.0 91.6
\small \label {eq:final_loss} \begin {aligned} \calL = \alpha \times (\calL _{CE}^{S}+ \calL _{CE}^{T}) + \beta \times (\calL _{KL}^S + \calL _{KL}^T) + \gamma \times \calL _{feat}, \end {aligned} CMT-KD (w/o Att) 2 bits 71.8 91.8
(10) CMT-KD (w/o ML) 70.9 91.3
where α, β, and γ are the hyper-parameters of total loss CMT-KD 72.1 91.9
for optimization and LCE is the standard cross-entropy Combined teacher 4, 6, 8 bits 72.3 91.7
loss calculated using the corresponding soft logits and the
ground-truth labels. During the training process, both the Table 1: Ablation studies on the CIFAR-100 dataset with
model weights of teachers, student and the importance fac- AlexNet. The descriptions of settings are presented in Sec-
tors of teachers, i.e., πik ∀i, k, will be updated by minimiz- tion 4.2.
ing L, using gradient descent.
4. Experiments We set the weight decay to 25e-6 for the 1 or 2-bit pre-
cision and set it to 1e-4 for higher precisions. Regarding
4.1. Experimental setup hyper-parameters of the total loss (10), we empirically set
Datasets. We conduct experiments on CIFAR-100 [16] α = 1, β = 0.5. In our experiments, the shared knowl-
and ImageNet (ILSVRC-2012) [27] datasets. CIFAR-100 edge is formed at certain layers of teachers and is distilled
dataset consists of 100 classes with the total of 60, 000 im- to the correspondence layers of student. Specifically, the
ages, where 50, 000 and 10, 000 images are used for train- shared knowledge is formed at the last convolutional layers
ing and testing set, respectively. ImageNet is a large scale of each convolution block, i.e., layers 2, 5, and 7 of AlexNet
dataset with 1, 000 classes in total. This dataset contains 1.2 teachers and from layers 5, 9, 13, and 17 of ResNet18 teach-
million images for training and 50, 000 images for valida- ers. Meanwhile, we set γ to 100 or 1 when attention loss or
tion, which is used as test set in our experiments. FitNet loss is used in Lf eat . We do not quantize the first
and last layers.
0.25 0.2
0.20 0.0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Epoch Epoch
(c) Layer 13 (d) Layer 17
Figure 2: The importance factors of the three ResNet18 teachers (4-bit, 6-bit, 8-bit) on the CIFAR-100 dataset during the
training process.
With Kw = 2 and Ka = 2, the CMT-KD (HWGQ) outper- layers in which the shared knowledge is formed. Figure 2b
forms the HWGQ [4] method by 0.6%, and 0.5% on top-1 shows that the highest precision teacher (8-bit) does not al-
accuracy for AlexNet and ResNet18, respectively. ways give the highest contributions to the shared knowledge
As the proposed framework is flexible to quantizers, we at all layers. For example, at layer 9, the importance factor
also report in Table 5 the results when LSQ [7] quantizer is of the 6-bit teacher is higher than the importance factor of
used to quantize teacher and student networks in our frame- the 8-bit teacher. At the last convolutional layer (i.e., layer
work, i.e., CMT-KD (LSQ). LSQ is a quantization method 17), the 8-bit teacher dominates other teachers and gives
in which the step size is learned during training. When most of the contributions to the shared knowledge. In addi-
ResNet18 is used, given the same LSQ quantizer implemen- tion, the importance factors are mainly updated at the early
tation1 , our method CMT-KD (LSQ) can boost the top-1 ac- training stage of the framework.
curacy of LSQ* by 1.1%. However, we note that the results
reported in LSQ [7] slightly outperforms CMT-KD (LSQ).
It is worth noting that in order to achieve the reported results
(67.9% top-1 and 88.1% top-5), LSQ [7] also uses knowl- 5. Conclusion
edge distillation to distill knowledge from the full precision
model to the quantized model2 . Our best method CMT-KD
In this paper, we propose a novel approach for learning
(LSQ) compares favorably to the recent method SPEQ [2]
low bit-width DNNs models by distilling knowledge from
for both AlexNet and ResNet18 models.
multiple quantized teachers. We introduce the idea of col-
laborative learning that allows teachers to form importance-
Visualization of the importance factors (π) of teachers. aware shared knowledge, which will be used to guide the
Figure 2 visualizes the importance factors of three teachers student. The proposed framework also leverages the idea
(4-bit, 6-bit, 8-bit) when using ResNet18 architecture for of mutual learning that allows both teachers and student to
the teachers and student. The experiment is conducted on adjust their parameters to achieve an overall object func-
CIFAR-100 dataset. The visualized importance factors are tion. The proposed framework allows end-to-end training
at the last convolutional layers in each block of ResNet18 in which not only network parameters but also the impor-
during training, i.e., layers 5, 9, 13, 17. They are also the tance factors indicating the contributions of teachers to the
1 The official source code of LSQ is not available. We adopt the LSQ shared knowledge are updated simultaneously. The exper-
quantizer from an un-official implementation from https://github. imental results on CIFAR-100 and ImageNet datasets with
com/hustzxd/LSQuantization for our experiments. AlexNet and ResNet18 architectures demonstrate that the
2 We are unsuccessful in reproducing results reported in LSQ [7]. For