Professional Documents
Culture Documents
Wu Et Al. - 2022 - Identity-Sensitive Knowledge Propagation For Cloth
Wu Et Al. - 2022 - Identity-Sensitive Knowledge Propagation For Cloth
RE-IDENTIFICATION
Jianbing Wu† Hong Liu†∗ Wei Shi† Hao Tang‡ Jingwen Guo†
†
Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, China
‡
Computer Vision Lab, ETH Zurich, Switzerland
{kimbing.ng,jingwenguo}@stu.pku.edu.cn, {hongliu,pkusw}@pku.edu.cn, hao.tang@vision.ee.ethz.ch
arXiv:2208.12023v1 [cs.CV] 25 Aug 2022
Fig. 2. The architecture of the global stream and the face stream. The global stream aims to learn cloth-irrelevant knowledge
under the guidance of the human parsing module, and the face stream is designed to extract representative facial features from
low-resolution images by acquiring knowledge from the teacher network. The details of notations can be referred to in section 2.
training a teacher network, which can restore missing fa- Spatial Attention (CSA) module to learn effective attention
cial details from degraded images and learn representative maps under the guidance of the human parsing network.
features. After that, a facial knowledge propagation loss is Cloth-irrelevant Spatial Attention. As illustrated in the
introduced to transfer the facial knowledge to a simpler stu- global stream of Fig. 2, given a person image X g , we first
dent network. Note that with our well-designed knowledge pass it through the truncated OSNet [18] backbone up until
propagation strategies, the cumbersome human parsing and the first layer of its third convolutional block to derive the
the face restoration module can be removed during inference middle feature maps F with size hf × wf × cf . Taking F as
to reduce redundant computations. the input, our CSA module then produces a cloth-irrelevant
Our contributions are summarized as follows: (1) We in- attention map, which can be formulated as:
troduce a cloth-irrelevant spatial attention module for the CC-
Ψb =σ W ∗F +b , (1)
ReID task to effectively discourage the misleading features of pw
clothes regions while preserving the identity-sensitive ones. where ∗ denotes convolution operation, W pw denotes the
(2) To mitigate the resolution degradation issue of surveil- point-wise convolution filters, and σ(x) = 1/(1 + exp(−x))
lance images, we further propose the idea of restoring the is the sigmoid function. The attention map Ψb is then applied
missing facial details and propagating the facial knowledge to the feature F using the following formula:
to a smaller network. (3) Extensive experiments show that the
F att = F ⊗ Ψ,
b (2)
proposed framework outperforms existing methods by a large
margin without relying on the human parsing or face restora- where ⊗ denotes the Hadamard matrix product, and F att is
tion module during inference. the refined feature maps. We then feed F att into the Multi-
Branch Network (MBN) introduced in [4], which combines
global, part-based, and channel features in a multi-branch ar-
2. METHODOLOGY
chitecture. The resulting embeddings are grouped into two
The key to addressing the CC-ReID problem is to reduce the sets following [4], denoted by I g and Rg , respectively.
attention to the clothes regions and learn identity-sensitive Mask Guided Attention Loss. The attention map Ψ b is ex-
features from biometric traits, such as human faces. We thus pected to have lower scores in the clothes regions. However,
design a two-stream architecture, which consists of a global without extra auxiliary supervision, it is hard to guarantee that
stream and a face stream, to learn both global cloth-irrelevant the CSA module produces cloth-irrelevant attention scores.
features and enhanced facial features. Fig. 2 illustrates the ar- To this end, we introduce a mask-guided attention loss to en-
chitecture of the two streams, and more details are discussed able the CSA module to learn effective attention maps. As
in the following subsections. depicted in Fig. 2, we first utilize the pre-trained human pars-
ing module [19] to estimate a fine-grained mask Ψp of human
2.1. Mask-Guided Cloth-irrelevant Feature Stream parts taking the image X g as the input. Each grid in the mask
Ψp represents the category of the corresponding pixel in X g
The clothing appearance is not reliable in cloth-changing set- (e.g., arm, leg, dress, skirt). After that, a cloth-irrelevant mask
tings. A naive solution is to utilize spatial attention mecha- can be obtained using the following formula:
nisms to explicitly discourage clothing features. However, it (
is difficult to learn effective attention weights without aux- ε, if Ψp(i,j) ∈ C,
Ψ(i,j) =
e (3)
iliary supervision. Instead, we design a new Cloth-irrelevant 1, otherwise.
Here, C denotes the set of cloth-related categories, and ε where CE and T rip denote the cross-entropy loss and the
is a hyper-parameter to avoid producing near-zero attention batch hard triplet loss, respectively. The term y denotes the
scores. The mask guided attention loss is defined as: identity labels of person images. The overall framework is
hf wf 2 trained by optimizing the total loss L, which is defined as:
1 X X
e0
Latt = Ψb (i,j) − Ψ
(i,j) , (4)
hf · wf i=1 j=1 L = λLatt + Ltrip + (αLf kp + (1 − α)Lsce ) + Lgce , (7)
0
where Ψ
e is obtained by resizing Ψ
e to the same size as Ψ.
b where the λ and α are adopted to balance different losses.
2.2. Facial Feature Enhancement Stream Inference. The inference stage, during which the human
parsing module in the global stream and the teacher net-
The neural systems of human beings can perfectly iden- work φt in the face stream are removed, can be seen as a
tify persons by exploiting facial cues, even with resolution- cosine-similarity-based retrieval process. Embeddings ob-
degraded images. This is thanks to their capability of recov- tained before the last fully connected layer in the MBNs
ering missing facial details. Inspired by this, we design a are concatenated for similarity computation as in [4]. As for
teacher-student network in the face stream to propagate fa- those query images with no face detected, we only enable the
cial knowledge from the knowledgeable teacher network φt , global stream to compute feature vectors.
which is trained beforehand, to the separate student network
φs . Both the teacher and the student networks take the face 3. EXPERIMENTS AND DISCUSSIONS
image X f as input, which is detected from the image X g by
the face detector [20]. As illustrated in Fig. 2, the teacher first 3.1. Datasets and Experimental Setups
restores the facial details using the pre-trained face restora- Our experiments are conducted on three widely used CC-
tion module GPEN [21]. The enhanced face is then fed to the ReID datasets, including Celeb-reID [7], Celeb-reID-light
truncated OSNet and the MBN to derive the feature vector [7], and PRCC [8]. The Celeb-reID dataset is made up of
set Rt and the logit output vector set I t . By optimizing the person images where over 70% of samples show different
cross-entropy loss using I t and the triplet loss using Rt , a clothes. The Celeb-reID-light dataset is the light but chal-
knowledgeable teacher model can be obtained. lenging version of Celeb-reID since all images of each person
After training the teacher network, we fix its parameters are in different clothes. The PRCC dataset provides both
and propagate the informative facial knowledge to the simpler cross-clothes and same-clothes settings to support in-depth
student network φs , where the cumbersome face restoration evaluations. The values of ε, λ, τ , and α are determined by
module is removed. Similar to the teacher φt , the outputs of cross validation. Specifically, for Celeb-reID and Celeb-reID-
the student φs are also grouped into two sets, denoted by I s light, we set ε = 0.1, λ = 7, τ = 5, and α = 0.7. For PRCC
and Rs , respectively. Inspired by the idea of [22], we design dataset, we set ε = 0.1, λ = 7, τ = 1, and α = 0.8. Other
a facial knowledge propagation loss to transfer the knowledge experimental settings follows our previous work [13]. The cu-
to the student, which is defined as: mulative matching characteristics (CMC) and mean Average
Lf kp = τ 2
X
KL S(I(i) t
s
, τ )
S(I(i) , τ) , (5) Precision (mAP) are reported in the following subsections.
i
where KL denotes Kullback–Leibler divergence and τ is the 3.2. Comparison with State-of-the-Art Methods
temperature parameter of the softmax function S. PHere, the We compare our method with several common Re-ID models
softmax function S, defined as S(z, τ )i = e(zi /τ )/ j e(zj /τ ) , [1–4] and state-of-the-art CC-ReID approaches [7–14]. In the
is designed to produce knowledgeable soft labels for the stu- tables hereafter, “R-k” denotes rank-k accuracy, “-” denotes
dent φt . The term τ 2 in Formula (5) is adopted to guarantee not reported, and the best and second-best results are in bold
that the magnitude of the computed loss remains unchanged and underline styles, respectively.
for different values of τ . By optimizing Lf kp , the student net- Table 1 shows the experimental results on Celeb-reID-
work φs is endowed with the capability of extracting discrim- light and Celeb-reID datasets. Compared with the state-of-
inative facial features from low-resolution images without re- the-art method IRANet [13], which relies on the off-the-shelf
lying on the face restoration module. pose estimator [15], our method achieves higher performance
2.3. Training and Inference in both datasets. For PRCC dataset, we conduct experiments
Training. Apart from the aforementioned losses, we also op- in both cross-clothes and same-clothes settings. As shown in
timize the batch hard triplet loss [17] using all features in Table 2, the proposed DeSKPro outperforms the best exist-
Rg ∪ Rs , and the cross-entropy loss using all logit outputs ing method by a large margin in the cross-clothes setting. No-
in I g ∪ I s for basic discrimination learning, formulated as: tice that our method surpasses FSAM [12] as well, which also
X X attempts to transfer knowledge from pre-trained networks to
Lgce = CE(ŷ, y), Ls
ce = CE(ŷ, y), save extra computations. This is because they only comple-
ŷ∈I g ŷ∈I s
X X (6) ment shape knowledge in the appearance stream while omit-
Ltrip = g
T rip(f , y) + s
T rip(f , y), ting the identity-sensitive cues from human faces. Moreover,
f ∈R f ∈R
Table 1. Comparison on Celeb-reID-light and Celeb-reID Table 3. Ablation study on Celeb-reID-light dataset. “G” de-
datasets (%). notes the global stream without the CSA module, and φ+ /φ−
Celeb-reID-light Celeb-reID denotes the student network trained with/without Lf kp .
Method
R-1 R-5 mAP R-1 R-5 mAP Body Stream Face Stream Celeb-reID-light
Model
HACNN [1] 16.2 - 11.5 47.6 - 9.5 G CSA Latt φ−
s φ+
s φt R-1 mAP
MGN [3] 21.5 - 13.9 49.0 - 10.8 1 (baseline) X 32.9 18.8
AFD-Net [10] 22.2 51.0 11.3 52.1 66.1 10.6 2 X X 33.5 20.1
LightMBN [4] 32.9 67.0 18.8 57.3 71.6 14.3 3 X X X 37.2 20.3
ReIDCaps+ [7] 33.5 63.3 19.0 63.0 76.3 15.8 4 X 41.6 20.7
RCSANet [11] 46.6 - 24.4 65.3 - 17.5 5 X 47.0 25.6
6 X 47.4 25.8
CASE-Net [9] 35.1 66.7 20.4 66.4 78.1 18.2
7 X X X X 50.1 27.4
IRANet [13] 46.2 72.7 25.4 64.1 78.7 19.0
DeSKPro X X X X 52.0 29.8
DeSKPro (Ours) 52.0 81.6 29.8 68.6 82.3 22.7 DeSKPro∗ X X X X 53.7 30.9
[4] Fabian Herzog, Xunbo Ji, Torben Teepe, et al., [15] Riza Alp Guler, Natalia Neverova, and Iasonas Kokki-
“Lightweight Multi-Branch Network For Person Re- nos, “DensePose: Dense Human Pose Estimation in the
Identification,” in IEEE International Conference on Wild,” in IEEE/CVF Conference on Computer Vision
Image Processing, 2021, pp. 1129–1133. and Pattern Recognition, 2018, pp. 7297–7306.
[5] Shijie Yu, Shihua Li, Dapeng Chen, et al., “COCAS: A [16] Xuelin Qian, Wenxuan Wang, Li Zhang, et al., “Long-
Large-Scale Clothes Changing Person Dataset for Re- Term Cloth-Changing Person Re-identification,” in
Identification,” in IEEE/CVF Conference on Computer Asian Conference on Computer Vision, 2021, pp. 71–88.
Vision and Pattern Recognition, 2020, pp. 3397–3406.
[17] Alexander Hermans, Lucas Beyer, and Bastian Leibe,
[6] Fangbin Wan, Yang Wu, Xuelin Qian, et al., “When “In Defense of the Triplet Loss for Person Re-
Person Re-identification Meets Changing Clothes,” in Identification,” arXiv preprint arXiv:1703.07737, 2017.
IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops, 2020, pp. 3620–3628. [18] Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro,
et al., “Omni-Scale Feature Learning for Person Re-
[7] Yan Huang, Jingsong Xu, Qiang Wu, et al., “Beyond Identification,” in IEEE/CVF International Conference
Scalar Neuron: Adopting Vector-Neuron Capsules for on Computer Vision, 2019, pp. 3701–3711.
Long-Term Person Re-Identification,” IEEE Transac-
tions on Circuits and Systems for Video Technology, vol. [19] Peike Li, Yunqiu Xu, Yunchao Wei, et al., “Self-
30, no. 10, pp. 3459–3471, 2020. Correction for Human Parsing,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2021.
[8] Qize Yang, Ancong Wu, and Wei-Shi Zheng, “Person
Re-Identification by Contour Sketch Under Moderate [20] Jiankang Deng, Jia Guo, Evangelos Ververas, et al.,
Clothing Change,” IEEE Transactions on Pattern Anal- “RetinaFace: Single-Shot Multi-Level Face Localisa-
ysis and Machine Intelligence, vol. 43, no. 6, pp. 2029– tion in the Wild,” in IEEE/CVF Conference on Com-
2046, June 2021. puter Vision and Pattern Recognition, 2020, pp. 5202–
[9] Yu-Jhe Li, Xinshuo Weng, and Kris M. Kitani, “Learn- 5211.
ing Shape Representations for Person Re-Identification [21] Tao Yang, Peiran Ren, Xuansong Xie, et al., “GAN
under Clothing Change,” in IEEE/CVF Winter Con- Prior Embedded Network for Blind Face Restoration in
ference on Applications of Computer Vision, 2021, pp. the Wild,” in IEEE/CVF Conference on Computer Vi-
2432–2441. sion and Pattern Recognition, 2021, pp. 672–681.
[10] Wanlu Xu, Hong Liu, Wei Shi, et al., “Adversar- [22] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill-
ial Feature Disentanglement for Long-Term Person Re- ing the Knowledge in a Neural Network,” arXiv preprint
identification,” in International Joint Conference on Ar- arXiv:1503.02531, 2015.
tificial Intelligence, 2021, pp. 1201–1207.
[11] Yan Huang, Qiang Wu, JingSong Xu, et al., “Cloth-
ing Status Awareness for Long-Term Person Re-
Identification,” in IEEE/CVF International Conference
on Computer Vision, 2021, pp. 11895–11904.