17.2022-IET Cyber-Syst and Robotics - 2022 - Zheng - Multi Branch Angle Aware Spatial Temporal Graph Convolutional Neural Network

- - -
26316315, 2022, 2, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/csy2.12052 by TESTREGPROV3, Wiley Online Library on [06/11/2022]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Received: 28 November 2021 Revised: 14 May 2022 Accepted: 14 May 2022 IET Cyber‐Systems and Robotics
DOI: 10.1049/csy2.12052
ORIGINAL RESEARCH
Multi‐branch angle aware spatial temporal graph convolutional

neural network for model‐based gait recognition
Liyang Zheng | Yuheng Zha | Da Kong | Hanqing Yang | Yu Zhang
College of Control Science and Engineering, Abstract

Zhejiang University, Hangzhou, China
Model‐based gait recognition with skeleton data input has attracted more attention in
recent years. The model‐based gait recognition methods take skeletons constructed by
Correspondence
Yu Zhang, College of Control Science and
body joints as input, which are invariant to changing carrying and clothing conditions.
Engineering, Zhejiang University, Hangzhou, However, previous methods limitedly model the skeleton information in either spatial or
310027, China. temporal domains and ignore the pose variety under different view angles, which results
Email: zhangyu80@zju.edu.cn
in poor performance for gait recognition. To solve the above problems, we propose the
Multi‐Branch Angle Aware Spatial Temporal Graph Convolutional Neural Network to
Funding information
National Natural Science Foundation of China,
better depict the spatial‐temporal relationship while minimising the interference from the
Grant/Award Number: 62088101; National Key view angles. The model adopts the legacy Spatial Temporal Graph Neural Network (ST‐
Research and Development Program of China, GCN) as its backbone and relocates it to create independent ST‐GCN branches. The
Grant/Award Numbers: 2021ZD0201400,
2022ZD0208800; Open Research Project of the State
novel Angle Estimator module is designed to predict the skeletons' view angles, which
Key Laboratory of Industrial Control Technology, enables the network robust to the changing views. To balance the weights of different
Zhejiang University, China, Grant/Award Number: body parts and sequence frames, we build a Part‐Frame‐Importance module to redis-
ICT2022B04
tribute them. Our experiments on the challenging CASIA‐B dataset have proved the
efficacy of the proposed method, which achieves state‐of‐the‐art performance under
different carrying and clothing conditions.
KEYWORDS
deep learning, machine intelligence, vision
1 | INTRODUCTION recognition methods. Appearance‐based models are suscepti-

ble to the carrying and wearing conditions of the participant.
Gait is a biometric characteristic that depicts the walking style Model‐based methods need to estimate human skeleton pa-
of a person. Compared with other biometric features, such as rameters such as the position of legs and arms, which are
the face, iris, fingerprint, and palm print, gait is proper for invariant to view angles and scales. However, it relies on the
human identification at a distance without active cooperation quality of the extracted skeleton data. Due to the rapid
from the subject. Hence, gait recognition methods have development of deep learning in the computer vision area,
become increasingly attractive in recent years, and their ap- both approaches are trending and gaining significant progress
plications include video surveillance, identification, crowd in accuracy. Our proposed Multi‐Branch Angle Aware Spatial
density estimation, etc. Temporal Graph Convolutional Neural Network (MAST‐
Gait recognition mainly contains two categories (Figure 1): GCN) is a model‐based gait recognition network that takes
model‐free gait recognition methods and model‐based gait skeleton keypoints as input and outputs a unique gait feature
recognition methods [1]. The model‐free methods use sil- for each person.
houettes [2–5] or derivative Gait Energy Image [6, 7] as input, The widely used gait datasets generally provide RGB
which can also be categorised into appearance‐based gait sequences or silhouettes. For example, CASIA‐B [8]
Liyang Zheng and Yuheng Zha contributed equally to this study.
-
This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the
original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.
© 2022 The Authors. IET Cyber‐Systems and Robotics published by John Wiley & Sons Ltd on behalf of Zhejiang University Press.
IET Cyber‐Syst. Robot. 2022;4:97–106. wileyonlinelibrary.com/journal/csy2 97

98
- ZHENG ET AL.
provides both raw video sequences and silhouettes, while the position of the joints, thus benefiting the subsequent gait
the OU‐ISIR Gait Database, Multi‐View Large Population recognition method.
Dataset (OU‐MVLP) [9] only provides silhouettes. The skeleton Current appearance‐based gait recognition approaches
sequences in this work are extracted from the RGB video se- [2–5] have similarities with person re‐identification methods
quences by human pose estimation networks. We did not adopt [16, 17]. Both tasks use the human body's appearance. The
the datasets without the original video. The fancy networks for model‐based approach is able to directly process the data flow
human pose estimation can accurately obtain the attributes of from human pose estimation methods that take the RGB video
each joint (e.g., joint types and positions). OpenPose [10], sequence (e.g. videos from the webcam) as input. Compared
AlphaPose [11–13], HRNet [14] and Pifpaf [15] are state‐of‐the‐ with the traditional gait recognition approach [18, 19], model‐
art methods for extracting human keypoints that transform a based gait recognition [20–23] with a deep neural network
picture with persons into skeleton expressions. Because of the extracts more accurate gait features to alleviate the influence of
low resolution (320 � 240) and vague videos in the CASIA‐B illumination changes and occlusion in the wild. Model‐based
dataset, the HRNet is more appropriate to accurately calculate methods focus on the body's movement patterns as well as
the static skeleton structure. Thus, it is robust to variations
such as changing clothes and appendages. The novel methods
of pose estimation and large pose datasets contribute to more
precise human keypoint prediction. However, it is still chal-
lenging to build an accurate human identity recognition model
due to limited information from body joints.
Inspired by the previous work of the spatial‐temporal graph
convolution network for skeleton‐based person re‐
identification [17] and action recognition tasks [24], we pro-
pose a novel MAST‐GCN (Figure 2). The MAST‐GCN con-
tains a multi‐branch structure to extract multi‐faceted features,
including static and dynamic information from joint and bone
nodes. The skeleton structure is naturally suitable for the graph
expression [25], thus we use the spatial graph convolutional
network (SGCN) [26] to extract latent information between
body parts. Note that the gait contains not only spatial data but
also temporal information, and the Temporal Convolutional
Network (TCN) is applied to extract features in the time
domain. The SGCN and TCN (Figure 4) form a spatial‐
F I G U R E 1 Two types of gait recognition methods: model‐free temporal graph convolution layer (Figure 5). By stacking the
(appearance‐based) method (left) and model‐based method (right) spatial‐temporal layers, we construct a Spatial Temporal Graph
F I G U R E 2 The overview of multi‐branch angle aware spatial temporal graph convolutional neural network structure. Skeleton representation obtained
from raw RGB images is first fed into the multi‐branch structure together with four ST‐GCN blocks to obtain static and dynamic information of joints and
bones respectively. The skeleton frames are first divided into four subgraphs and then processed by joint and bone branches. Then the joint and bone branches
are concatenated (denoted by ⓒ). The output feature is then fed into the PFI module branch and the Angle Estimator branch, thus getting the Angle Vector and
the Sequence Vector. At last, we apply an FC layer to the feature to obtain the final Gait Vector. FC, fully connected; PFI, part‐frame‐importance; ST‐GCN,
spatial temporal graph neural network
ZHENG ET AL.
- 99
Neural Network (ST‐GCN) block. View angle change is a factor as a set, is a milestone for appearance‐based gait recognition
that reduces model performance because it changes the shape of for its immunisation to the order of the sequence. GaitPart [3]
the skeletons. To the best of our knowledge, no former further proposes a partial representation of a silhouette to
approach directly considered this element to mitigate its inter- finely excavate details. The current state‐of‐the‐art model on
ference. The Angle Estimator we propose is capable of calcu- CASIA‐B [8] and OU‐MVLP [9] is MT3D [4], which applies
lating the angle of view while detaching the view factor from the 3D CNNs to extract the spatial‐temporal feature and uses a
final gait feature. Moreover, we balance the weights of body multiple‐temporal‐scale framework to integrate both the small
parts and the temporal frames by the proposed Part‐Frame‐ and large scale temporal information. However, current
Importance (PFI). The experiments carried out on the chal- appearance‐based methods are not practical since the dataset
lenging dataset CASIA‐B showed that our MAST‐GCN is they trained on was extracted by background subtraction
effective for the model‐based gait recognition task. methods, which is difficult to utilise in an end‐to‐end manner
In summary, our contributions are as follows: (i.e. directly process an image from a camera).
The model‐based gait recognition takes the skeleton se-
� We propose the multi‐branch structure to represent static quences extracted by human pose estimators as the data
and dynamic information of joints and bones respectively. source. To make full use of the sequential skeleton data, some
The raw keypoints obtained with the human pose estimation researchers combined the convolution neural network and
method are imported directly to the static joint branch, from recurrent neural network to learn both spatial and temporal
which we further compute the dynamic joint branch and body features [20, 22]. Although these works could extract the
static bone branch. Similarly, the dynamic bone branch is body motion as well as the appearance information, the long‐
generated from the static bone branch. The four‐branch term spatial‐temporal relation between the frames is difficult to
architecture is able to gain multi‐scale information of the be constructed in these kinds of methods. Pose‐Based Tem-
skeleton, thus improving the model performance. poral‐Spatial Network with 3D Pose Estimation (PTSN‐3D)
� We propose a novel Angle Estimator to capture the view [21] and PoseGait [22] are pioneer works that take the 3D pose
of a gait sequence that enables the network to detach the to learn the gait feature, but they need additional calculations
angle information from the gait feature. With this detach- to estimate 3D joints' position from the 2D skeleton data.
ment, our model is able to focus on the sole analysis of gait. Skeleton data are proven to be well modelled by graph struc-
Therefore, the model is good at recognising subjects from tures, the recent work JointsGait [23] adapts the classical
various perspectives. ST‐GCN [24] network and a Joints Relationship Pyramid
� We propose a spatial‐temporal multi‐branch frame work for Pooling module to extract the gait feature. Our network is a
gait recognition named MAST‐GCN. We conduct experi- model‐based approach to depict the interaction between
ments on the widely used CASIA‐B [8] dataset to show that different body parts and frames.
our model achieves state‐of‐the‐art performance. Further
ablation studies prove that each part is essential for the
functionality of the model. 2.3 | Graph convolutional networks
Many real‐life data are constructed in the form of graphs,

2 | RELATED WORK such as knowledge graphs [32], social networks [33], traffics
[34], etc. However, how to build neural networks to work
2.1 | Human pose estimation well on these structured graphs is a challenging problem. The
graph typically has node features and the connection of these
The pose estimation is a prerequisite of the model‐based gait nodes. Graph Convolutional Network (GCN) is a type of
recognition. The pose estimator can compute the position of convolutional neural network that can work directly on
human joints from an image and output the type of each joint. graphs and takes advantage of their structural information.
This field can be categorised into 2‐dimensional [14, 27, 28] There are generally two ways to construct GCNs on the
and 3‐dimensional [29–31] pose estimation approaches. We graph according to their principle: (1) Spectral‐based ap-
adopt the 2D pose estimation solution, and considering that proaches [26, 35, 36], where the locality of the graph
the HRNet [14] uses a parallel connection to maintain high convolution is considered in the form of spectral analysis;
resolution expression, which is crucial to extract detailed in- and (2) Spatial‐based approaches [37–39], which directly
formation from the video sequence in the dataset, we choose perform convolution in the graph domain. Our work takes
this network as our skeleton estimator. the advantage of the latter by aggregating the neighbour
nodes with fewer calculations.
GCN has already thrived in the area of skeleton‐based
2.2 | Gait recognition action recognition [24, 40, 41]. These approaches construct
graphs on skeletons by setting the keypoints as nodes. The
The gait recognition has two branches: the appearance‐based connections between the nodes are based on their physical
method and model‐based method [1]. GaitSet [2], an connection, that is, the existing bone connection. Since a great
appearance‐based method that takes the temporal silhouettes similarity between action recognition and gait recognition is
100
- ZHENG ET AL.
observed, it is natural to adopt GCN to handle gait recognition 3.1.2 | Joint branch and bone branch
problems. To our knowledge, the JointsGait [23] and Gait-
Graph [42] are currently state‐of‐the‐art methods. However, As shown in Figure 3, we initially extract 17 joint nodes from
they ignore the interference from the view angles and lack the an RGB image and compute 14 bone nodes from the previous
multi‐scale information from the bones and dynamics to joint nodes. Each node in the Static‐Joint‐Branch (S‐Joint‐
thoroughly construct gait features. Our proposed MAST‐GCN Branch) contains three elements: [x, y, prob], where x and y
attempts to solve the above issue. denote the coordinate of each joint and prob denotes the
probability of its type. The Dynamic‐Joint‐Branch (D‐Joint‐
Branch) is obtained from a differential method expressed in
3 | METHOD Equation (3). Note that the velocity branches have one less
frame than the static branches, so we duplicate the first frame
The method we propose is briefly called MAST‐GCN. The
whole pipeline of the method is shown in Figure 2. First, a set
of skeleton data from a walking sequence is split into four
branches: Static‐Joint‐Branch, Static‐Bone Branch, Dynamic‐
Joint‐Branch, and Dynamic‐Bone‐Branch. Then, they are
processed by the multi‐branch ST‐GCN blocks. The outputs
of the above blocks are first grouped into joint and bone
branches and then concatenated together to feed into the PFI
module and Angle Estimator. After PFI module, the output
feature is processed by ST‐GCN block E. Then, an Average
Pooling (AvgPool) layer and a Fully Connected (FC) layer are
applied to the feature to obtain the final gait vector. In addition
to the above data stream, we splice the angle vector calculated
by the Angle Estimator with the intermediate gait feature
F I G U R E 3 The multi‐feature of skeletons. (a) Pose estimation results
before the FC. The Mean Square Error (MSE) and Supervised from HRNet, including 17 joints and their related connection. (b) Joints (17
Contrastive (SupCon) functions are fused together to construct blue points) and their movement between two neighbour frames (dashed
the loss function. green lines). (c) Bones (14 lines with highlight red edges) and their
movement between two neighbour frames (dashed dark green lines)
3.1 | Multi‐branch skeleton graph
3.1.1 | Preliminaries
A body pose graph could be represented as G ¼ ðV ; E Þ,

where V ¼ fv1 ; …; vN g denotes N nodes in the graph, E is
the graph edges leading to an adjacency matrix A ∈ RN�N ,
A(i,j) = 1 if vi and vj are physically connected or are the same
F I G U R E 4 The structure of spatial graph convolutional network
node, otherwise, A(i,j) = 0. A set of skeleton frames could be (SGCN) and Temporal Convolutional Network (TCN). The colourful dots
denoted as X ∈ RC�T �N , where Xt,n is the C dimensional represent graph nodes in Section 3.1, and the colourful lines between the
feature vector of the node vn at the tth frame in a total of dots indicate the relationship between the body joints. ST‐GCN, spatial
T frames. Let D(i,i) = ∑jA(i,j) be the degree matrix of the ad- temporal graph neural network
jacency matrix A. The normalised adjacency matrix A ~ is an

approximation of the graph‐Laplacian:
~ ¼ D−12 AD−12
A ð1Þ
Following [43], we multiply a learnable mask matrix

M ∈ RN�N with A ~ to balance the weight between body parts.
A tanh activation function is applied on matrix M:
b ¼A
A ~ ⊙ ðtanhðMÞ þ 1Þ; ð2Þ
F I G U R E 5 Two types of residual connection in an ST‐GCN layer. The
⊕ denotes the adding operation. (a) Direct residual connection, if
where ⊙ represents the Hadamard product, M is initially a zero dimðxÞ ¼ dimðF ðxÞÞ, (b) a residual connection with a linear layer, if
b is the final adjacency matrix used in the graph
matrix, and A dimðxÞ ≠ dimðF ðxÞÞ (see channel dimensions in Table 1). ST‐GCN, spatial
computation. temporal graph neural network
ZHENG ET AL.
- 101
so that the velocity branches have the same number of frames 3.2.3 | Spatial temporal graph neural network
as in the static branches.
The SGCN and TCN modules (Figure 4) constitute an ST‐
� � GCN layer (Figure 5), which processes the spatial and temporal
Dtjoint;i ¼ X tjoint;i − X t−1
joint;i ; t ∈ ½1; T �; i ∈ 1; Nj ; ð3Þ information of a skeleton sequence. In practice, we add re-
sidual connections in the ST‐GCN block, which retains gait
details. The final output of the ST‐GCN block is X lt , which can
where X tjoint;i denotes the ith node's elements of the S‐Joint‐ be calculated by:
Branch in the tth frame excluding the probability of the joint
type, T denotes the total amount of the frames in a sequence, � � � �
and Nj denotes the number of joint nodes, which in our X lt ¼ σ TCN X ls þ X l−1
t ; l ∈ ½1; L�; ð6Þ
method is Nj = 17.
Each node in the Static‐Bone‐Branch (S‐Bone‐Branch) where X lt is the lth layer of the temporal feature, X ls is the lth
contains six elements: [xs, ys, xe, ye, l, θ ], where xs, ys, xe, ye layer of the spatial feature, σ denotes the ReLU [44] activation
denote the start and the end coordinates of the bones, l function, and L denotes the total amount of ST‐GCN layers.
and θ denote the length of each bone and their angle In the ST‐GCN Block A‐D, we stack five layers of the ST‐
relative to the ground, respectively. Similarly, we compute GCN, which all have the same structure except the input
the Dynamic‐Bone‐Branch (D‐Bone‐Branch) by the differ- feature size. The output of ST‐GCN Block A and B is
ential method: concatenated with the output of ST‐GCN C and D so that the
Joint‐Branch and Bone‐Branch both have a 256‐dimensional
feature to depict the skeleton information. In ST‐GCN
Dtbone;i ¼ X tbone;i − X t−1
bone;i ; t ∈ ½1; T �; i ∈ ½1; Nb �; ð4Þ Block E, we stack three layers of ST‐GCN with a 256‐
dimensional output feature. The parameters are shown in
Table 1.
where X tbone;i denotes the ith node's elements of the S‐Bone‐
Branch in the tth frame, T denotes the total number of
frames in a sequence, and Nb denotes the number of bone
3.2.4 | Part‐frame importance module
nodes, which in our method is Nb = 14.
Prior to ST‐GCN Block E, we add a layer to adjust the weight for
each body part and frame. The NJB � NT size learnable matrix P
3.2 | The ST‐GCN layers included in the PFI module is able to improve the model per-
formance while not increasing the computational complexity
3.2.1 | Spatial graph convolution significantly. ⊙ represents element‐wise multiplication.
To model the spatial relationship between nodes, we apply PFIðxÞ ¼ P ⊙ x; ð7Þ
several layers of the Spatial Graph Convolutional Network
(SGCN). Each layer of SGCN is defined as: ��
X ¼ PFI Cat XJ ; XB ; ð8Þ
� �
X ls ¼ σ AX
b l−1 l
t W ; l ∈ ½1; L�; ð5Þ where XJ and XB are the features of Joint‐Branch and Bone‐
Branch respectively. Cat is a concatenation of joint nodes
where X ls and X lt denote the l th layer of spatial features and
temporal features respectively, W l ∈ Rfin �fout is a fin � fout TABLE 1 Details of the spatial temporal graph neural network blocks
learnable matrix of the lth layer, X 0t is the input feature, L is the
Type Layer ID Channel in Channel out Residual Stride
number of GCN layers, and σ is a ReLU [44] activation
function. Each SGCN layer is followed by a TCN layer. Block A–D 1 fin 64 1
2 64 64 ✓ 1
3 64 128 ✓ 2
3.2.2 | Temporal convolution network
4 128 256 ✓ 2
Following the SGCN, TCN layers aim to capture the motion 5 256 128 ✓ 1
of skeletons. Each node in a single frame is able to fuse the
Block E 1 256 256 1
information from the neighbouring frames. The temporal
module consists of a batch normalisation layer, a 2D con- 2 256 256 ✓ 1
volutional layer and a ReLU activation layer. Each joint or 3 256 256 ✓ 2
bone in a certain frame is already aligned, so the basic
Note: For joint branches, the input channel size fin is 3, while for bone branches, the
convolution module has a kernel size of (3,1) to move along input channel size fin is 6. “Residual” indicates whether the residual connection is added
the frame axis. in the specific layer.
102
- ZHENG ET AL.
and bone nodes. NJB denotes the number of joint and bone numerical instability while maintaining the accuracy of our
nodes, and NT denotes the number of frames. In our model, model.
NJB = 31 and NT = 15. MSE Loss: MSE is suitable to constrain the output of our
Angle Estimator in Section 3.3 to be consistent with the real
view condition. The loss function is defined as follows:
3.2.5 | Fully connected layer
N � �2
1 X bi ;
The feature of ST‐GCN is first concatenated with the angle LM ¼ Vi − V ð11Þ
N i¼1
information and then fed into the Fully Connected Layer (FC),
which has a 128‐dimensional feature output. b i is the prediction of
where N is the number of samples, V
angle vector, and Vi is the true angle vector's value of sample i.
3.3 | Angle estimators Fusion of Loss Functions: The supervised contrastive
loss and the MSE loss are fused as follows:
We design the Angle Estimator to predict the view angle of the
L ¼ LS þ γLM ; ð12Þ
walking condition because the changing views are influential to
the learning of gait features. Because of the confidence of the
pose estimation when the wear condition is changed, this where γ is to balance the weight of two loss functions, and in
module could make the model aware of the change in the view our experiment the γ is set to 1.5.
of the gait sequence, thus alleviating the influence of the bad
pose information. Generally, the angle vector V ∈ R2 of a
walking sequence could be expressed as: 4 | EXPERIMENTS AND ANALYSIS
V ¼ ½sin θ; cos θ�T ; ð9Þ 4.1 | Dataset
where θ is the view angle in radians. In order to map pose To evaluate the effectiveness of our proposed method, several
features to angle vectors, we designed a multi‐layer structure, experiments have been conducted on the CASIA‐B [8] gait
which is summarised as the Angle Estimator. dataset. CASIA‐B dataset is a widely applied gait dataset, and it
Angle Estimator has a plain structure that is composed is also one of the largest public gait databases, which was
of three layers of 1D convolution (Conv), two layers of the created in January 2005. 124 subjects are involved (93 males
activation function (ReLU [44]), and one layer of 1D and 31 females) in the dataset, and each of them has 11 views
adaptive average pooling (AvgPool). Note that (0°, 18°, ⋯, 180). There are three walking conditions in the
sin2θ + cos2θ = 1, we apply a normalisation layer on the sequences, which are 6 normal walking (NM) sequences, 2 bag
final 2D output feature. These layers are simple yet effective, walking (BG) sequences and 2 coat walking (CL) sequences.
and significantly reduce the pose representation to the low The total number of video sequences in the CASIA‐B dataset
dimension angle vector. is 13,640, with about several gait cycles in each sequence. In the
test period, the first 4 normal walking sequences (NM 1–4) are
used for the gallery set and the last two normal sequences (NM
3.4 | Loss function No. 5–6) as well as all the bag (BG 1–2) and coating sequences
(CL 1–2) are put into the probe set. As the sequential and
We adopt the multi‐loss strategy to optimise our MAST‐GCN flipped order frames are processed by the network at the same
network. The supervised contrastive loss constrains the rela- time, the average of the two feature vectors is taken as the final
tionship between the final gait features, and the MSE loss is gait result. The Euclidean distance of the corresponding
used after the Angle‐Estimator module to constrain the pre- feature vectors between the gallery and probe is used to rank
dicted angle of each gait sequence. the similarity. For fair evaluation, the first 74 subjects are used
Supervised Contrastive Loss: The SupCon Loss [45] for training, and the remaining 50 subjects are put into the test
pulls samples belonging to the same class closer in the set as the experimental settings of PTSN [20], JointsGait [23]
embedding space, and pushes apart clusters of samples from and GaitGraph [42].
different classes.
� 4.2 | Data pre‐processing
X −1 X exp z i ⋅ z p =τ
LS ¼ log P ð10Þ
i∈I
jPðiÞj p∈PðiÞ a∈AðiÞ expðz i ⋅ z a =τÞ
Based on the original video data of the CASIA‐B dataset, the
n o publicly available body keypoint extraction method HRNet
Here, PðiÞ ≡ p ∈ AðiÞ : y~ p ¼ y~ i is the set of indices of [14] is applied to obtain 17 human joints in the form of the
all positives in the multiviewed batch distinct from i, and |P(i)| Common Objects in Context (COCO) Keypoints dataset [46]
is its cardinality. We set the temperature τ = 0.005 to avoid on every frame. Except for connections of the facial keypoints,
ZHENG ET AL.
- 103
we obtained 14 bones physically or abstractly from those joints conditions, as 88.2% in NM, 75.2% in BG, 73.0% in CL,
(shown in Figure 3). To enrich the sparse skeleton information, respectively, compared with the other state‐of‐the‐art model‐
we adapt some data augmentation methods. First, Gaussian based methods.
noise is added to the locations of the joints. Second, we reverse Based on the averaged Top‐1 accuracies in Table 2, we
the order of the original sequence to obtain the gait in an draw three line graphs for the NM, BG and CL conditions, as
opposite direction. shown in Figures 6–8. It can be seen from the figures that the
We trained our model on the CASIA‐B dataset with an Top‐1 accuracies of MAST‐GCN are absolutely higher than
NVIDIA 3060Ti GPU. The skeleton sequence was randomly the other method's results under the CL condition. Besides, the
sampled at a fixed size of 60 frames, and the sequences less polylines of MAST‐GCN are more stable than those of other
than 60 frames were discarded. We used the Adam [47] opti- methods under NM and BG conditions. We conclude that
miser with a 1‐cycle learning rate [48] and set the learning rate
at a maximum of 0.01 for the first 300 epochs. Then, we set the
maximum learning rate at 1e‐4 for the second phase of training
for 100 epochs. The batch size was set to 32.
4.3 | Comparison with other methods on the

CASIA‐B dataset
In Table 2, our proposed method is compared with some other

state‐of‐the‐art works. The average recognition accuracy of all
11 angles are calculated. The higher value of the average sug-
gests that the network is better at matching these gait features
with the ground truth. The lower the value of the standard
deviation is, the more robust the model is to the changing view
angles. As presented, our method (MAST‐GCN) achieves the F I G U R E 6 Averaged Top‐1 accuracies for the probe data being NM.
best performance on average accuracy in all three walking NM, normal walking sequences
TABLE 2 Averaged Top‐1 accuracies in percent compared with other model‐based methods on CASIA‐B dataset, excluding identical‐view cases
Gallery (NM No.1–4) 0°–180°

Probe 0° 18° 36° 54° 72° 90° 108° 126° 144° 162° 180° Mean
NM No.5–6 PTSN [20] 34.5 45.6 49.6 51.3 52.7 52.3 53 50.8 52.2 48.3 31.4 47.4
PTSN‐3D [21] 38.7 50.2 55.9 56.0 56.7 54.6 54.8 56.0 54.1 52.4 40.2 51.9
PoseGait [22] 48.5 62.7 66.6 66.2 61.9 59.8 63.6 65.7 66.0 58.0 46.5 60.5
JointsGait [23] 68.1 73.6 77.9 76.4 77.5 79.1 78.4 76.0 69.5 71.9 70.1 74.4
GaitGraph [42] 85.3 88.5 91.0 92.5 87.2 86.5 88.4 89.2 87.9 85.9 81.9 87.7
Ours 87.7 89.2 89.6 89.9 90.3 90.0 88.5 88.1 88.5 86.1 81.8 88.2
BG No.1–2 PTSN [20] 22.4 29.8 29.6 29.2 32.5 31.5 32.1 31.0 27.3 28.1 18.2 28.3
PTSN‐3D [21] 27.7 32.7 37.4 35.0 37.1 37.5 37.7 36.9 33.8 31.8 27.0 34.1
PoseGait [22] 29.1 39.8 46.5 46.8 42.7 42.2 42.7 42.2 42.3 35.2 26.7 39.6
JointsGait [23] 54.3 59.1 60.6 59.7 63.0 65.7 62.4 59.0 58.1 58.6 50.1 59.1
GaitGraph [42] 75.8 76.7 75.9 76.1 71.4 73.9 78.0 74.7 75.4 75.4 69.2 74.8
Ours 74.8 76.6 78.0 78.2 75.2 73.1 72.0 74.3 77.4 75.8 72.0 75.2
CL No.1–2 PTSN [20] 14.2 17.1 17.6 19.3 19.5 20.0 20.1 17.3 16.5 18.1 14.0 17.6
PTSN‐3D [21] 15.8 17.2 19.9 20.0 22.3 24.3 28.1 23.8 20.9 23.0 17.0 21.1
PoseGait [22] 21.3 28.2 34.7 33.8 33.8 34.9 31.0 31.0 32.7 26.3 19.7 29.8
JointsGait [23] 48.1 46.9 49.6 50.5 51.0 52.3 49.0 46.0 48.7 53.6 52.0 49.8
GaitGraph [42] 69.6 66.1 68.8 67.2 64.5 62.0 69.5 65.6 65.7 66.1 64.3 66.3
Ours 68.8 71.6 76.8 75.4 76.6 74.8 71.5 66.7 74.4 75.0 71.0 73.0
Note: Bold values denote the highest in each test condition.

Abbreviations: BG, bag walking sequences; CL, coat walking sequences; NM, normal walking sequences.
104
- ZHENG ET AL.
TABLE 3 Effect of branches, angle‐estimator, and part‐frame‐

importance
Probe Module Mean

NM No.5–6 S‐joint‐branch 57.1
+S‐bone‐branch 81.0
+D‐joint‐branch 85.1
+D‐bone‐branch 84.9
+Angle‐estimator 85.9
+PFI 88.2
BG No.1–2 S‐joint‐branch 51.3

F I G U R E 7 Averaged Top‐1 accuracies for the probe data being BG.
BG, bag walking sequences +D‐joint‐branch 68.9
+PFI 75.2
CL No.1–2 S‐joint‐branch 31.8

+D‐joint‐branch 63.8
+PFI 73.0
Note: The “+” denotes which module is added compared with the previous line, bold
values are the highest in each test condition.
Abbreviations: BG, bag walking sequence; CL, coat walking sequence; NM, normal
F I G U R E 8 Averaged Top‐1 accuracies for the probe data being CL. walking sequence.
CL, coat walking sequences
model are effective because their contributions vary. It is

skeleton features are less af by the changing views and clothing obvious that only an S‐Joint‐Branch cannot extract thorough
or carrying conditions while silhouettes have more detailed gait features. As more branches are added, the network is able
information for gait recognition. to learn multi‐scale features from the joints and bones. The
Angle Estimator contributes the most in the CL condition with
a 7.6% margin over the four branches’ configuration. The PFI
4.4 | Discussion module further improves the model performance to a state‐of‐
the‐art level, which proves its effectiveness in balancing the
4.4.1 | Ablation study weight between the body parts and frames.
To assess the effectiveness of the branches and the other

modules, we compared model performance under different 4.4.2 | Visualised distributions
circumstances on the CASIA‐B dataset. The structure of the
models is the same, except for the number of branches and To illustrate the impact of the proposed methods on alleviating
whether the Angle Estimator or the PFI module was per- the discrepancy of different walking conditions and on
formed. As shown in Table 3, the initial status of the model increasing discriminability, we randomly select 15 identities
contains a Joint‐Only branch. Here, we compared models with from the testing set to visualise the distributions of learned
only 1 branch (S‐Joint‐Branch), 2 branches (2 static branches), features by t‐SNE in Figure 9.
3 branches (2 static branches + D‐Joint‐Branch), and 4
branches (2 static branches + 2 dynamic branches). Addi-
tionally, we evaluate the performance of the Angle‐Estimator 4.4.3 | Ability of the angle estimator
module and the PFI module. The evaluation metrics are
similar to those in Table 2 since we only compare the mean Following the majority of the previous work, three evaluation
value of the Top‐1 accuracies under the 11 probe angles. The metrics, namely Pearson Correlation coefficient (PC) [49],
results in Table 3 indicate that all the elements we add to this mean absolute error (MAE), and Root Mean Squared Error,
ZHENG ET AL.
- 105
A C K N OWL E D G E M E N TS
This work was supported by NSFC 62088101 Autonomous
Intelligent Unmanned Systems, the National Key Research and
Development Program of China under Grant 2021ZD0201400
and Grant 2022ZD0208800, and the Open Research Project of
the State Key Laboratory of Industrial Control Technology,
Zhejiang University, China (No. ICT2022B04).
C O N FL I C T O F I N TE R E S T
The authors declare that they have no conflict of interest.
DA TA AVA I L A B I L I T Y S TA TE M E N T
The data that support the findings of this study are openly
available in CASIA Gait Database at http://www.cbsr.ia.ac.cn/
english/Gait%20Databases.asp, reference number [8].
F I G U R E 9 The t‐SNE visualisation results of our proposed model
with specific structures. It can be seen that the multi‐branch structure, angle
O R CI D
estimator, and PFI module all contribute to the final performance.
(a) Features extracted by the S‐Joint‐Branch structure; (b) Features Liyang Zheng https://orcid.org/0000-0002-4947-5977
extracted by our Multi‐Branch structure, not including Angle‐Estimator Da Kong https://orcid.org/0000-0002-8788-1976
and PFI module. (c) Features extracted by our MAST‐GCN without PFI Yu Zhang https://orcid.org/0000-0002-0043-4904
module. (d) Features extracted by our MAST‐GCN. MAST‐GCN, multi‐
branch angle aware spatial temporal graph convolutional neural network;
PFI, part‐frame‐importance R E FE R E N C E S
1. Isaac, E.R.H.P., et al.: Trait of gait: a survey on gait biometrics. arXiv
preprint arXiv:1903.10744 (2019)
2. Chao, H., et al.: GaitSet: regarding gait as a set for cross‐view gait
T A B L E 4 Performance of the angle estimator with label regression in recognition. Proc. AAAI Conf. Artif. Intell. 33, 8126–8133 (2019).
terms of PC, MAE, and RMSE https://doi.org/10.1609/aaai.v33i01.33018126
3. Fan, C., et al.: GaitPart: temporal part‐based model for gait recognition.
Metrics PC MAE RMSE In: Proceedings of the IEEE/CVF Conference on Computer Vision and
−1
Pattern Recognition, pp. 14225–14233 (2020)
Results on the test set 9.9999 � 10 0.0040 0.0204 4. Lin, B., Zhang, S., Bao, F.: Gait recognition with multiple‐temporal‐scale
Abbreviations: MAE, mean absolute error; PC, Pearson correlation; RMSE, root mean 3D convolutional neural network. In: Proceedings of the 28th ACM
squared error. International Conference on Multimedia, vol. 1, pp. 3054–3062. ACM,
New York (2020)
5. Zhang, Z., et al.: Gait recognition via disentangled representation
learning. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 4710–4719 (2019)
are utilised to evaluate the performance of Angle Estimator on 6. Shiraga, K., et al.: GEINet: view‐invariant gait recognition using a con-
the test sets of the CASIA‐B datasets. The results are listed in volutional neural network. In: 2016 International Conference on Bio-
Table 4, which can be regarded as a reflection of the ability of metrics (ICB), 13–16 June 2016, Halmstad, Sweden, pp. 1–8. IEEE,
Piscataway (2016)
the accurate prediction of the view angle of the walking
7. Yu, S., et al.: GaitGAN: invariant gait feature extraction using generative
condition. adversarial networks. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops, 21–
26 July 2017, Honolulu, HI, USA, pp. 30–37 (2017)
5 | CONCLUSIONS 8. Yu, S., Tan, D., Tan, T.: A framework for evaluating the effect of view
angle, clothing and carrying condition on gait recognition. In: 18th In-
ternational Conference on Pattern Recognition (ICPR'06), vol. 4, pp.
In this paper, we propose a multi‐branch spatial‐temporal 441–444. IEEE, Piscataway (2006)
graph convolution network to solve the model‐based gait 9. Takemura, N., et al.: Multi‐view large population gait dataset and its
recognition problem. The gait frames are processed as skeleton performance evaluation for cross‐view gait recognition. IPSJ Trans.
sequences through a pose estimation network first, thus, only Comput. Vis. Appl. 10(4), 1–14 (2018). https://doi.org/10.1186/s41074‐
the 2D skeleton data are used in the model. To obtain the 018‐0039‐6
10. Cao, Z., et al.: OpenPose: realtime multi‐person 2D pose estimation
robust motion information, the Angle‐Estimator module and using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1),
the multi‐branch pose information aggregation structure is 172–186 (2019). https://doi.org/10.1109/tpami.2019.2929257
fused to get the final gait vector. The ST‐GCN networks are 11. Fang, H.‐S., et al.: RMPE: regional multi‐person pose estimation. In:
able to capture subtle connections between body parts and 2017 IEEE International Conference on Computer Vision (ICCV), 22–
29 October 2017, Venice, Italy. IEEE, Piscataway (2017)
skeleton frames with the PFI module. Carefully designed ex-
12. Li, J., et al.: CrowdPose: efficient crowded scenes pose estimation and a
periments on the CASIA‐B dataset show that our MAST‐GCN new benchmark. In: 2019 IEEE/CVF Conference on Computer Vision
reaches state‐of‐the‐art results compared with recent well‐ and Pattern Recognition (CVPR), pp. 10855–10864 (2019)
known related networks. 13. Xiu, Y., et al.: Pose flow: efficient online pose tracking. In: BMVC (2018)
106
- ZHENG ET AL.
14. Sun, K., et al.: Deep high‐resolution representation learning for 33. Wang, Z., et al.: Linkage based face clustering via graph convolution
human pose estimation. In: Proceedings of the IEEE/CVF Con- network. In: Proceedings of the IEEE/CVF Conference on Computer
ference on Computer Vision and Pattern Recognition, pp. 5693–5703 Vision and Pattern Recognition, pp. 1117–1125 (2019)
(2019) 34. Gao, J., et al.: VectorNet: encoding HD maps and agent dynamics from
15. Kreiss, S., Bertoni, L., Alexandre, A.: PifPaf: composite fields for vectorized representation. In: Proceedings of the IEEE/CVF Conference
human pose estimation. In: Proceedings of the IEEE/CVF Confer- on Computer Vision and Pattern Recognition, pp. 11525–11533 (2020)
ence on Computer Vision and Pattern Recognition, pp. 11977–11986 35. Bruna, J., et al.: Spectral networks and locally connected networks on
(2019) graphs. In: International Conference on Learning Representations
16. Wu, Y., et al.: Adaptive graph representation learning for video person re‐ (ICLR2014), CBLS, April 2014 (2014)
identification. IEEE Trans. Image Process. 29, 8821–8830 (2020). 36. Henaff, M., Bruna, J., LeCun, Y.: Deep convolutional networks on graph‐
https://doi.org/10.1109/tip.2020.3001693 structured data. CoRR, abs/1506.05163 (2015)
17. Yang, J., et al.: Spatial‐temporal graph convolutional network for video‐ 37. Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural
based person re‐identification. In: Proceedings of the IEEE/CVF Con- networks for graphs. In: International Conference on Machine Learning,
ference on Computer Vision and Pattern Recognition, pp. 3289–3299 pp. 2014–2023. PMLR (2016)
(2020) 38. Veličković, P., et al.: Graph attention networks. In: International Con-
18. Chai, Y., et al.: Automatic gait recognition using dynamic variance fea- ference on Learning Representations (2018)
tures. In: 7th International Conference on Automatic Face and Gesture 39. Chen, J., Ma, T., Xiao, C.: FastGCN: fast learning with graph convolu-
Recognition (FGR06), pp. 475–480 (2006) tional networks via importance sampling. In: International Conference
19. Gafurov, D., Snekkenes, E.: Towards understanding the uniqueness of on Learning Representations (2018)
gait biometric. In: 2008 8th IEEE International Conference on Auto- 40. Li, M., et al.: Actional‐structural graph convolutional networks for
matic Face Gesture Recognition, pp. 1–8 (2008) skeleton‐based action recognition. In: Proceedings of the IEEE/CVF
20. Liao, R., et al.: Pose‐based temporal‐spatial network (PTSN) for gait Conference on Computer Vision and Pattern Recognition, pp. 3595–3603
recognition with carrying and clothing variations. In: Chinese Conference (2019)
on Biometric Recognition, pp. 474–483. Springer (2017) 41. Zhang, X., Xu, C., Tao, D.: Context aware graph convolution for
21. An, W., et al.: Improving gait recognition with 3D pose estimation. In: skeleton‐based action recognition. In: Proceedings of the IEEE/CVF
Chinese Conference on Biometric Recognition, pp. 137–147. Springer, Conference on Computer Vision and Pattern Recognition, pp. 14333–-
Cham (2018) 14342 (2020)
22. Liao, R., et al.: A model‐based gait recognition method with body pose 42. Teepe, T., et al.: GaitGraph: graph convolutional network for skeleton‐
and human prior knowledge. Pattern Recogn. 98, 107069 (2020). https:// based gait recognition. arXiv preprint arXiv:2101.11228 (2021)
doi.org/10.1016/j.patcog.2019.107069 43. Cheng, K., et al.: Skeleton‐based action recognition with shift graph
23. Li, N., Zhao, X., Ma, C.: A model‐based gait recognition method based convolutional network. In: Proceedings of the IEEE/CVF Conference
on gait graph convolutional networks and joints relationship pyramid on Computer Vision and Pattern Recognition, pp. 183–192 (2020)
mapping. arXiv preprint arXiv:2005.08625 (2020) 44. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with
24. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25,
for skeleton‐based action recognition. Proc. AAAI Conf. Artif. Intell. 32 1097–1105 (2012)
(2018) 45. Khosla, P., et al.: Supervised contrastive learning. In: Larochelle, H., et al.
25. Ren, B., et al.: A survey on 3D skeleton‐based action recognition using (eds.) Advances in Neural Information Processing Systems, vol. 33,
learning method. arXiv preprint arXiv:2002.05907 (2020) pp. 18661–18673. Curran Associates, Inc. (2020)
26. Kipf, T.N., Welling, M.: Semi‐supervised classification with graph con- 46. Lin, T.‐Y., et al.: Microsoft COCO: common objects in context. In: Eu-
volutional networks. In: 5th International Conference on Learning ropean Conference on Computer Vision, pp. 740–755. Springer (2014)
Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Con- 47. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In:
ference Track Proceedings. OpenReview.net (2017) Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning
27. Wei, S.‐E., et al.: Convolutional pose machines. In: Proceedings of the Representations, ICLR 2015, 7–9 May 2015, San Diego, CA, USA,
IEEE Conference on Computer Vision and Pattern Recognition, Conference Track Proceedings (2015)
pp. 4724–4732 (2016) 48. Smith, L.N., Topin, N.: Super‐convergence: very fast training of neural
28. Newell, A., Yang, K., Jia, D.: Stacked hourglass networks for human pose networks using large learning rates. In: Artificial Intelligence and Machine
estimation. In: European Conference on Computer Vision, pp. 483–499. Learning for Multi‐Domain Operations Applications, vol. 11006,
Springer (2016) pp. 1100612. International Society for Optics and Photonics (2019)
29. Pavlakos, G., et al.: Coarse‐to‐fine volumetric prediction for single‐image 49. Eisenthal, Y., Dror, G., Ruppin, E.: Facial attractiveness: beauty and the
3D human pose. In: Proceedings of the IEEE Conference on Computer machine. Neural Comput. 18(1), 119–142 (2006). https://doi.org/10.
Vision and Pattern Recognition, pp. 7025–7034 (2017) 1162/089976606774841602
30. Chen, C.‐H., Ramanan, D.: 3D human pose estimation = 2D pose
estimation + matching. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 7035–7043 (2017)
31. Choi, H., Moon, G., Lee, K.M.: Pose2Mesh: graph convolutional How to cite this article: Zheng, L., et al.: Multi‐branch
network for 3D human pose and mesh recovery from a 2D human pose. angle aware spatial temporal graph convolutional neural
In: European Conference on Computer Vision, pp. 769–787. Springer, network for model‐based gait recognition. IET Cyber‐
Cham (2020)
Syst. Robot. 4(2), 97–106 (2022). https://doi.org/10.
32. Wang, H., et al.: Knowledge graph convolutional networks for recom-
mender systems. In: The World Wide Web Conference, pp. 3307–3313 1049/csy2.12052
(2019)

17.2022-IET Cyber-Syst and Robotics - 2022 - Zheng - Multi Branch Angle Aware Spatial Temporal Graph Convolutional Neural Network

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

17.2022-IET Cyber-Syst and Robotics - 2022 - Zheng - Multi Branch Angle Aware Spatial Temporal Graph Convolutional Neural Network

Uploaded by

Copyright:

Available Formats

- - -

Multi‐branch angle aware spatial temporal graph convolutional

Liyang Zheng | Yuheng Zha | Da Kong | Hanqing Yang | Yu Zhang

College of Control Science and Engineering, Abstract

1 | INTRODUCTION recognition methods. Appearance‐based models are suscepti-

Liyang Zheng and Yuheng Zha contributed equally to this study.

IET Cyber‐Syst. Robot. 2022;4:97–106. wileyonlinelibrary.com/journal/csy2 97

Many real‐life data are constructed in the form of graphs,

3.1 | Multi‐branch skeleton graph

A body pose graph could be represented as G ¼ ðV ; E Þ,

jacency matrix A. The normalised adjacency matrix A ~ is an

Following [43], we multiply a learnable mask matrix

V ¼ ½sin θ; cos θ�T ; ð9Þ 4.1 | Dataset

4.3 | Comparison with other methods on the

In Table 2, our proposed method is compared with some other

Gallery (NM No.1–4) 0°–180°

Note: Bold values denote the highest in each test condition.

TABLE 3 Effect of branches, angle‐estimator, and part‐frame‐

Probe Module Mean

BG No.1–2 S‐joint‐branch 51.3

CL No.1–2 S‐joint‐branch 31.8

model are effective because their contributions vary. It is

To assess the effectiveness of the branches and the other

You might also like