You are on page 1of 5

IDENTITY-SENSITIVE KNOWLEDGE PROPAGATION FOR CLOTH-CHANGING PERSON

RE-IDENTIFICATION

Jianbing Wu† Hong Liu†∗ Wei Shi† Hao Tang‡ Jingwen Guo†

Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, China

Computer Vision Lab, ETH Zurich, Switzerland
{kimbing.ng,jingwenguo}@stu.pku.edu.cn, {hongliu,pkusw}@pku.edu.cn, hao.tang@vision.ee.ethz.ch
arXiv:2208.12023v1 [cs.CV] 25 Aug 2022

ABSTRACT Global Stream


Knowledgeable
Teacher Network
Knowledge ... ...
Cloth-changing person re-identification (CC-ReID), which Propagation
Student Network
aims to match person identities under clothing changes, is a
new rising research topic in recent years. However, typical face Triplet CE
detection Face Stream
biometrics-based CC-ReID methods often require cumber- Knowledgeable
Teacher Network
some pose or body part estimators to learn cloth-irrelevant Knowledge
Propagation ... ...
features from human biometric traits, which comes with high Student Network

computational costs. Besides, the performance is significantly


limited due to the resolution degradation of surveillance im- Fig. 1. Overall architecture of the proposed framework. “CE”
ages. To address the above limitations, we propose an ef- and “Triplet” denote two widely used Re-ID losses, namely
fective Identity-Sensitive Knowledge Propagation framework the cross-entropy loss and the triplet loss [17].
(DeSKPro) for CC-ReID. Specifically, a Cloth-irrelevant
Spatial Attention module is introduced to eliminate the dis- To address the above practical problem, Cloth-Changing
traction of clothing appearance by acquiring knowledge from Person Re-identification (CC-ReID) has drawn increasing at-
the human parsing module. To mitigate the resolution degra- tention in recent years [5–14]. Existing CC-ReID methods
dation issue and mine identity-sensitive cues from human often focus on mining identity-relevant cues from pre-defined
faces, we propose to restore the missing facial details using biometric traits. For instance, Yang et al. [8] introduced a
prior facial knowledge, which is then propagated to a smaller spatial polar transformation on contour sketch to learn shape
network. After training, the extra computations for human representations. Shi et al. [13] proposed to learn head-guided
parsing or face restoration are no longer required. Extensive features with the help of the pose estimator [15]. Qian et
experiments show that our framework outperforms state-of- al. [16] introduced a shape embedding module to extract bi-
the-art methods by a large margin. Our code is available at ological features from body keypoints. A common issue be-
https://github.com/KimbingNg/DeskPro. hind these methods is that extra pose or contour estimation
Index Terms— Cloth-Changing Person Re-Identification, is required during inference, leading to higher computational
Identity-Sensitive Features, Knowledge Propagation costs. Some other researchers are dedicated to mining the
identity-sensitive features from human faces [5, 6] since fa-
1. INTRODUCTION cial cues are more robust under clothing changes. However,
due to the low resolution of surveillance images, these meth-
Person re-identification (Re-ID), which aims to match pedes- ods also fail to extract representative facial features.
trians across non-overlapping cameras, is a challenging task To address these challenges, we propose an Identity-
with significant research impact. Recent works [1–4] have Sensitive Knowledge Propagation framework, termed DeSK-
achieved remarkable progress by learning powerful appear- Pro. As illustrated in Fig. 1, the proposed framework consists
ance representations based on the assumption that the images of a global cloth-irrelevant feature stream (global stream)
of the same identity in both the query set and the gallery set and a facial feature enhancement stream (face stream). Both
have the same clothing. However, the clothing appearance streams can acquire informative knowledge from their respec-
tends to be unreliable in realistic scenes since people may tive teacher networks during training. In the global stream,
wear different clothes at different times. a Cloth-irrelevant Spatial Attention (CSA) module, which
∗ Corresponding author. This work is supported by National Natural Sci- is trained under the guidance of the knowledgeable human
ence Foundation of China (No.62073004), and Shenzhen Fundamental Re- parsing network, is designed to facilitate the learning of
search Program (No.JCYJ20200109140410340). identity-sensitive features. In the face stream, we begin by
Training Inference Global Stream Face Stream Training Inference

Student (CSA) Teacher Teacher


clothes human ...
Face
1x1 region parsing Restoration MBN
Conv
sigmoid resize ...
truncated
(restored face)
OSNet

384x48x16 48x16 48x16 384x128


truncated Student
384x128 OSNet ...
...
⊗ MBN MBN
... ...
spatial truncated
attention (low resolution face) OSNet

Fig. 2. The architecture of the global stream and the face stream. The global stream aims to learn cloth-irrelevant knowledge
under the guidance of the human parsing module, and the face stream is designed to extract representative facial features from
low-resolution images by acquiring knowledge from the teacher network. The details of notations can be referred to in section 2.

training a teacher network, which can restore missing fa- Spatial Attention (CSA) module to learn effective attention
cial details from degraded images and learn representative maps under the guidance of the human parsing network.
features. After that, a facial knowledge propagation loss is Cloth-irrelevant Spatial Attention. As illustrated in the
introduced to transfer the facial knowledge to a simpler stu- global stream of Fig. 2, given a person image X g , we first
dent network. Note that with our well-designed knowledge pass it through the truncated OSNet [18] backbone up until
propagation strategies, the cumbersome human parsing and the first layer of its third convolutional block to derive the
the face restoration module can be removed during inference middle feature maps F with size hf × wf × cf . Taking F as
to reduce redundant computations. the input, our CSA module then produces a cloth-irrelevant
Our contributions are summarized as follows: (1) We in- attention map, which can be formulated as:
troduce a cloth-irrelevant spatial attention module for the CC- 
Ψb =σ W ∗F +b , (1)
ReID task to effectively discourage the misleading features of pw
clothes regions while preserving the identity-sensitive ones. where ∗ denotes convolution operation, W pw denotes the
(2) To mitigate the resolution degradation issue of surveil- point-wise convolution filters, and σ(x) = 1/(1 + exp(−x))
lance images, we further propose the idea of restoring the is the sigmoid function. The attention map Ψb is then applied
missing facial details and propagating the facial knowledge to the feature F using the following formula:
to a smaller network. (3) Extensive experiments show that the
F att = F ⊗ Ψ,
b (2)
proposed framework outperforms existing methods by a large
margin without relying on the human parsing or face restora- where ⊗ denotes the Hadamard matrix product, and F att is
tion module during inference. the refined feature maps. We then feed F att into the Multi-
Branch Network (MBN) introduced in [4], which combines
global, part-based, and channel features in a multi-branch ar-
2. METHODOLOGY
chitecture. The resulting embeddings are grouped into two
The key to addressing the CC-ReID problem is to reduce the sets following [4], denoted by I g and Rg , respectively.
attention to the clothes regions and learn identity-sensitive Mask Guided Attention Loss. The attention map Ψ b is ex-
features from biometric traits, such as human faces. We thus pected to have lower scores in the clothes regions. However,
design a two-stream architecture, which consists of a global without extra auxiliary supervision, it is hard to guarantee that
stream and a face stream, to learn both global cloth-irrelevant the CSA module produces cloth-irrelevant attention scores.
features and enhanced facial features. Fig. 2 illustrates the ar- To this end, we introduce a mask-guided attention loss to en-
chitecture of the two streams, and more details are discussed able the CSA module to learn effective attention maps. As
in the following subsections. depicted in Fig. 2, we first utilize the pre-trained human pars-
ing module [19] to estimate a fine-grained mask Ψp of human
2.1. Mask-Guided Cloth-irrelevant Feature Stream parts taking the image X g as the input. Each grid in the mask
Ψp represents the category of the corresponding pixel in X g
The clothing appearance is not reliable in cloth-changing set- (e.g., arm, leg, dress, skirt). After that, a cloth-irrelevant mask
tings. A naive solution is to utilize spatial attention mecha- can be obtained using the following formula:
nisms to explicitly discourage clothing features. However, it (
is difficult to learn effective attention weights without aux- ε, if Ψp(i,j) ∈ C,
Ψ(i,j) =
e (3)
iliary supervision. Instead, we design a new Cloth-irrelevant 1, otherwise.
Here, C denotes the set of cloth-related categories, and ε where CE and T rip denote the cross-entropy loss and the
is a hyper-parameter to avoid producing near-zero attention batch hard triplet loss, respectively. The term y denotes the
scores. The mask guided attention loss is defined as: identity labels of person images. The overall framework is
hf wf  2 trained by optimizing the total loss L, which is defined as:
1 X X
e0
Latt = Ψb (i,j) − Ψ
(i,j) , (4)
hf · wf i=1 j=1 L = λLatt + Ltrip + (αLf kp + (1 − α)Lsce ) + Lgce , (7)
0
where Ψ
e is obtained by resizing Ψ
e to the same size as Ψ.
b where the λ and α are adopted to balance different losses.
2.2. Facial Feature Enhancement Stream Inference. The inference stage, during which the human
parsing module in the global stream and the teacher net-
The neural systems of human beings can perfectly iden- work φt in the face stream are removed, can be seen as a
tify persons by exploiting facial cues, even with resolution- cosine-similarity-based retrieval process. Embeddings ob-
degraded images. This is thanks to their capability of recov- tained before the last fully connected layer in the MBNs
ering missing facial details. Inspired by this, we design a are concatenated for similarity computation as in [4]. As for
teacher-student network in the face stream to propagate fa- those query images with no face detected, we only enable the
cial knowledge from the knowledgeable teacher network φt , global stream to compute feature vectors.
which is trained beforehand, to the separate student network
φs . Both the teacher and the student networks take the face 3. EXPERIMENTS AND DISCUSSIONS
image X f as input, which is detected from the image X g by
the face detector [20]. As illustrated in Fig. 2, the teacher first 3.1. Datasets and Experimental Setups
restores the facial details using the pre-trained face restora- Our experiments are conducted on three widely used CC-
tion module GPEN [21]. The enhanced face is then fed to the ReID datasets, including Celeb-reID [7], Celeb-reID-light
truncated OSNet and the MBN to derive the feature vector [7], and PRCC [8]. The Celeb-reID dataset is made up of
set Rt and the logit output vector set I t . By optimizing the person images where over 70% of samples show different
cross-entropy loss using I t and the triplet loss using Rt , a clothes. The Celeb-reID-light dataset is the light but chal-
knowledgeable teacher model can be obtained. lenging version of Celeb-reID since all images of each person
After training the teacher network, we fix its parameters are in different clothes. The PRCC dataset provides both
and propagate the informative facial knowledge to the simpler cross-clothes and same-clothes settings to support in-depth
student network φs , where the cumbersome face restoration evaluations. The values of ε, λ, τ , and α are determined by
module is removed. Similar to the teacher φt , the outputs of cross validation. Specifically, for Celeb-reID and Celeb-reID-
the student φs are also grouped into two sets, denoted by I s light, we set ε = 0.1, λ = 7, τ = 5, and α = 0.7. For PRCC
and Rs , respectively. Inspired by the idea of [22], we design dataset, we set ε = 0.1, λ = 7, τ = 1, and α = 0.8. Other
a facial knowledge propagation loss to transfer the knowledge experimental settings follows our previous work [13]. The cu-
to the student, which is defined as: mulative matching characteristics (CMC) and mean Average
 
Lf kp = τ 2
X
KL S(I(i) t
s
, τ ) S(I(i) , τ) , (5) Precision (mAP) are reported in the following subsections.
i
where KL denotes Kullback–Leibler divergence and τ is the 3.2. Comparison with State-of-the-Art Methods
temperature parameter of the softmax function S. PHere, the We compare our method with several common Re-ID models
softmax function S, defined as S(z, τ )i = e(zi /τ )/ j e(zj /τ ) , [1–4] and state-of-the-art CC-ReID approaches [7–14]. In the
is designed to produce knowledgeable soft labels for the stu- tables hereafter, “R-k” denotes rank-k accuracy, “-” denotes
dent φt . The term τ 2 in Formula (5) is adopted to guarantee not reported, and the best and second-best results are in bold
that the magnitude of the computed loss remains unchanged and underline styles, respectively.
for different values of τ . By optimizing Lf kp , the student net- Table 1 shows the experimental results on Celeb-reID-
work φs is endowed with the capability of extracting discrim- light and Celeb-reID datasets. Compared with the state-of-
inative facial features from low-resolution images without re- the-art method IRANet [13], which relies on the off-the-shelf
lying on the face restoration module. pose estimator [15], our method achieves higher performance
2.3. Training and Inference in both datasets. For PRCC dataset, we conduct experiments
Training. Apart from the aforementioned losses, we also op- in both cross-clothes and same-clothes settings. As shown in
timize the batch hard triplet loss [17] using all features in Table 2, the proposed DeSKPro outperforms the best exist-
Rg ∪ Rs , and the cross-entropy loss using all logit outputs ing method by a large margin in the cross-clothes setting. No-
in I g ∪ I s for basic discrimination learning, formulated as: tice that our method surpasses FSAM [12] as well, which also
X X attempts to transfer knowledge from pre-trained networks to
Lgce = CE(ŷ, y), Ls
ce = CE(ŷ, y), save extra computations. This is because they only comple-
ŷ∈I g ŷ∈I s
X X (6) ment shape knowledge in the appearance stream while omit-
Ltrip = g
T rip(f , y) + s
T rip(f , y), ting the identity-sensitive cues from human faces. Moreover,
f ∈R f ∈R
Table 1. Comparison on Celeb-reID-light and Celeb-reID Table 3. Ablation study on Celeb-reID-light dataset. “G” de-
datasets (%). notes the global stream without the CSA module, and φ+ /φ−
Celeb-reID-light Celeb-reID denotes the student network trained with/without Lf kp .
Method
R-1 R-5 mAP R-1 R-5 mAP Body Stream Face Stream Celeb-reID-light
Model
HACNN [1] 16.2 - 11.5 47.6 - 9.5 G CSA Latt φ−
s φ+
s φt R-1 mAP
MGN [3] 21.5 - 13.9 49.0 - 10.8 1 (baseline) X 32.9 18.8
AFD-Net [10] 22.2 51.0 11.3 52.1 66.1 10.6 2 X X 33.5 20.1
LightMBN [4] 32.9 67.0 18.8 57.3 71.6 14.3 3 X X X 37.2 20.3
ReIDCaps+ [7] 33.5 63.3 19.0 63.0 76.3 15.8 4 X 41.6 20.7
RCSANet [11] 46.6 - 24.4 65.3 - 17.5 5 X 47.0 25.6
6 X 47.4 25.8
CASE-Net [9] 35.1 66.7 20.4 66.4 78.1 18.2
7 X X X X 50.1 27.4
IRANet [13] 46.2 72.7 25.4 64.1 78.7 19.0
DeSKPro X X X X 52.0 29.8
DeSKPro (Ours) 52.0 81.6 29.8 68.6 82.3 22.7 DeSKPro∗ X X X X 53.7 30.9

Table 2. Comparison on PRCC dataset (%).


PRCC
Model 2
Method Cross clothes Same clothes
R-1 mAP R-1 mAP
HACNN [1] 21.8 - 82.5 -
PCB [2] 22.9 - 86.9 -
Yang et al. [8] 34.4 - 64.2 - Model 3
CASE-Net [9] 39.5 - 71.2 -
AFD-Net [10] 42.8 - 95.7 -
RCSANet [11] 50.2 48.6 100.0 97.2
LightMBN [4] 50.5 51.2 100.0 98.9
FSAM [12] 54.5 - 98.8 -
Fig. 3. Visualization of the attention maps and the refined fea-
IRANet [13] 54.9 53.0 99.7 97.8 ture maps for model 2 and model 3.
Shu et al. [14] 65.8 61.2 99.5 96.7
DeSKPro (Ours) 74.0 66.3 99.6 96.6
to model 6, it has lower computational costs since the face
restoration module is removed. This also indicates that the
the clothing appearance is not explicitly suppressed in their knowledge propagation strategy in the face stream is the
method, which damages the robustness of features. Although trade-off between performance and computational efficiency.
DeSKPro does not outperform all existing methods in the Two Stream Architecture of DeSKPro. As shown in Table
same-clothes setting, it still attains competitive results. 3, the performance can be further improved by assembling
both global and face streams. This indicates that the global
3.3. Effectiveness of Proposed Components and the facial features complement each other. We also pro-
vide a variant of DeSKPro, which is termed DeSKPro∗ in Ta-
To verify the effectiveness of each component in DeSKPro, ble 3. It is obtained by replacing the student network φs with
we conduct ablation experiments with different component the more complex teacher network φt in the face stream. Even
settings on the more challenging Celeb-reID-light dataset. though it achieves the best performance in our experiments, it
Cloth-irrelevant Spatial Attention. We first evaluate the brings extra computational costs for face restoration.
importance of the attention module CSA and the mask-guided
attention loss Latt . As shown in Table 3, the rank-1 accu- 4. CONCLUSION
racy and mAP can be improved with the CSA module. After
adding Latt to the total loss function, the performance can be In this paper, we present an Identity-Sensitive Knowledge
further boosted. The feature visualization is presented in Fig. Propagation framework (DeSKPro), which achieves state-
3, which also shows that the CSA model can effectively attend of-the-art results on three challenging CC-ReID datasets, in-
to the cloth-irrelevant regions after training with Latt . cluding Celeb-reID-light, Celeb-reID, and PRCC. We argue
Face Enhancement and Knowledge Propagation. As that suppressing the clothes features explicitly and recovering
shown in Table 3, the teacher network (model 6) signifi- facial details from resolution-degraded images can help boost
cantly outperforms the student network (model 4), which the performance of the CC-ReID task. The experiments have
verifies that restoring face details can help improve perfor- also shown that propagating body and facial knowledge can
mance. Comparing model 5 with model 4, it can be seen that help avoid the extra computation costs for mask estimation or
the performance is boosted significantly after the knowledge face restoration while preserving performance. We hope this
propagation. Even though model 5 is still slightly inferior work can spark further research on the CC-ReID problem.
5. REFERENCES [12] Peixian Hong, Tao Wu, Ancong Wu, et al., “Fine-
Grained Shape-Appearance Mutual Learning for Cloth-
[1] Wei Li, Xiatian Zhu, and Shaogang Gong, “Harmo- Changing Person Re-Identification,” in IEEE/CVF Con-
nious Attention Network for Person Re-identification,” ference on Computer Vision and Pattern Recognition,
in IEEE/CVF Conference on Computer Vision and Pat- 2021, pp. 10513–10522.
tern Recognition, 2018, pp. 2285–2294.
[13] Wei Shi, Hong Liu, and Mengyuan Liu, “IRANet:
[2] Yifan Sun, Liang Zheng, Yi Yang, et al., “Beyond
Identity-relevance Aware Representation for Cloth-
Part Models: Person Retrieval with Refined Part Pooling
Changing Person Re-Identification,” Image and Vision
(and A Strong Convolutional Baseline),” in European
Computing, vol. 117, pp. 104335, 2022.
Conference on Computer Vision, 2018, pp. 501–518.
[3] Guanshuo Wang, Yufeng Yuan, Xiong Chen, et al., [14] Xiujun Shu, Ge Li, Xiao Wang, et al., “Semantic-
“Learning Discriminative Features with Multiple Gran- Guided Pixel Sampling for Cloth-Changing Person Re-
ularities for Person Re-Identification,” in ACM Interna- Identification,” IEEE Signal Processing Letters, vol. 28,
tional Conference on Multimedia, 2018, pp. 274–282. pp. 1365–1369, 2021.

[4] Fabian Herzog, Xunbo Ji, Torben Teepe, et al., [15] Riza Alp Guler, Natalia Neverova, and Iasonas Kokki-
“Lightweight Multi-Branch Network For Person Re- nos, “DensePose: Dense Human Pose Estimation in the
Identification,” in IEEE International Conference on Wild,” in IEEE/CVF Conference on Computer Vision
Image Processing, 2021, pp. 1129–1133. and Pattern Recognition, 2018, pp. 7297–7306.

[5] Shijie Yu, Shihua Li, Dapeng Chen, et al., “COCAS: A [16] Xuelin Qian, Wenxuan Wang, Li Zhang, et al., “Long-
Large-Scale Clothes Changing Person Dataset for Re- Term Cloth-Changing Person Re-identification,” in
Identification,” in IEEE/CVF Conference on Computer Asian Conference on Computer Vision, 2021, pp. 71–88.
Vision and Pattern Recognition, 2020, pp. 3397–3406.
[17] Alexander Hermans, Lucas Beyer, and Bastian Leibe,
[6] Fangbin Wan, Yang Wu, Xuelin Qian, et al., “When “In Defense of the Triplet Loss for Person Re-
Person Re-identification Meets Changing Clothes,” in Identification,” arXiv preprint arXiv:1703.07737, 2017.
IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops, 2020, pp. 3620–3628. [18] Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro,
et al., “Omni-Scale Feature Learning for Person Re-
[7] Yan Huang, Jingsong Xu, Qiang Wu, et al., “Beyond Identification,” in IEEE/CVF International Conference
Scalar Neuron: Adopting Vector-Neuron Capsules for on Computer Vision, 2019, pp. 3701–3711.
Long-Term Person Re-Identification,” IEEE Transac-
tions on Circuits and Systems for Video Technology, vol. [19] Peike Li, Yunqiu Xu, Yunchao Wei, et al., “Self-
30, no. 10, pp. 3459–3471, 2020. Correction for Human Parsing,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2021.
[8] Qize Yang, Ancong Wu, and Wei-Shi Zheng, “Person
Re-Identification by Contour Sketch Under Moderate [20] Jiankang Deng, Jia Guo, Evangelos Ververas, et al.,
Clothing Change,” IEEE Transactions on Pattern Anal- “RetinaFace: Single-Shot Multi-Level Face Localisa-
ysis and Machine Intelligence, vol. 43, no. 6, pp. 2029– tion in the Wild,” in IEEE/CVF Conference on Com-
2046, June 2021. puter Vision and Pattern Recognition, 2020, pp. 5202–
[9] Yu-Jhe Li, Xinshuo Weng, and Kris M. Kitani, “Learn- 5211.
ing Shape Representations for Person Re-Identification [21] Tao Yang, Peiran Ren, Xuansong Xie, et al., “GAN
under Clothing Change,” in IEEE/CVF Winter Con- Prior Embedded Network for Blind Face Restoration in
ference on Applications of Computer Vision, 2021, pp. the Wild,” in IEEE/CVF Conference on Computer Vi-
2432–2441. sion and Pattern Recognition, 2021, pp. 672–681.
[10] Wanlu Xu, Hong Liu, Wei Shi, et al., “Adversar- [22] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill-
ial Feature Disentanglement for Long-Term Person Re- ing the Knowledge in a Neural Network,” arXiv preprint
identification,” in International Joint Conference on Ar- arXiv:1503.02531, 2015.
tificial Intelligence, 2021, pp. 1201–1207.
[11] Yan Huang, Qiang Wu, JingSong Xu, et al., “Cloth-
ing Status Awareness for Long-Term Person Re-
Identification,” in IEEE/CVF International Conference
on Computer Vision, 2021, pp. 11895–11904.

You might also like