Professional Documents
Culture Documents
9, SEPTEMBER 2023
Abstract—Vision Transformer (ViT) has shown great potential computer vision tasks. Since AlexNet [1] was proposed for
for various visual tasks due to its ability to model long-range the ImageNet image classification challenge [2], CNN archi-
dependency. However, ViT requires a large amount of computing tectures have become increasingly powerful, with more careful
resource to compute the global self-attention. In this work, we
propose a ladder self-attention block with multiple branches and a designs [3], [4], [5], deeper connections [6], [7] and wider di-
progressive shift mechanism to develop a light-weight transformer mensions [8]. Thus, CNNs appear to be indispensable for various
backbone that requires less computing resources (e.g., a relatively computer vision tasks. CNNs are successful due to the inductive
small number of parameters and FLOPs), termed Progressive Shift bias implied in convolutional computations, and this inductive
Ladder Transformer (PSLT). First, the ladder self-attention block bias ensures that CNNs are generalizable as investigated in [9],
reduces the computational cost by modelling local self-attention
in each branch. In the meanwhile, the progressive shift mech- [10].
anism is proposed to enlarge the receptive field in the ladder Another prevalent architecture that has shown extraordinary
self-attention block by modelling diverse local self-attention for achievements in computer vision tasks is the vision transformer.
each branch and interacting among these branches. Second, the In the transformer, which was first proposed for sequence mod-
input feature of the ladder self-attention block is split equally elling and transduction tasks in natural language processing [11],
along the channel dimension for each branch, which considerably
reduces the computational cost in the ladder self-attention block [12], features can interact globally by modelling attention
(with nearly 13 the amount of parameters and FLOPs), and the of long-range dependency. Vision Transformer (ViT) models
outputs of these branches are then collaborated by a pixel-adaptive global interaction by computing attention among the feature map
fusion. Therefore, the ladder self-attention block with a relatively according to the projected query, key and value of each pixel. ViT
small number of parameters and FLOPs is capable of modelling
has shown these achievements due to the considerable amount
long-range interactions. Based on the ladder self-attention block,
PSLT performs well on several vision tasks, including image clas- of computing resource in the model. However, models with large
sification, objection detection and person re-identification. On the number of parameters are difficult to deploy in edge-computing
ImageNet-1 k dataset, PSLT achieves a top-1 accuracy of 79.9% devices with limited memory storage and computing resources,
with 9.2 M parameters and 1.9 G FLOPs, which is comparable to such as FPGAs. And models with large number of FLOPs always
several existing models with more than 20 M parameters and 4 G
require much time for inference, because FLOPs measures the
FLOPs. Code is available at https://isee-ai.cn/wugaojie/PSLT.html.
number of float-point operations for inference.
Index Terms—Ladder self-Attention, light-weight vision In this work, we aim to develop a light-weight vision trans-
transformer, multimedia information retrieval. former backbone with a relatively small number of parame-
ters and FLOPs. Existing methods [13], [14], [15], [16] have
I. INTRODUCTION focused on reducing the number of float-point operations by
evolving the form of computing self-attention; however, the
ONVOLUTIONAL neural networks (CNNs) have shown
C considerable promise as general-purpose backbones for
receptive field in the window-based self-attention block is re-
stricted in most of these methods [15], [16], and only pixels
divided in the same windows can interact with each other in the
Manuscript received 8 July 2022; revised 16 February 2023; accepted 26
March 2023. Date of publication 7 April 2023; date of current version 4 August window-based self-attention block. Therefore, the interactions
2023. This work was supported partially by the NSFC under Grants U21A20471, between pixels in different windows cannot be modelled in one
U1911401, and U1811461, and in part by the Guangdong NSF Project under block.
Grants 2023B1515040025 and 2020B1515120085. Recommended for accep-
tance by V. Koltun. (Corresponding author: Wei-Shi Zheng.) Thus, we propose a light-weight ladder self-attention block
Gaojie Wu and Yutong Lu are with the School of Computer Science and with multiple branches and a progressive shift mechanism is
Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510275, China introduced to enlarge the receptive field explicitly. The expan-
(e-mail: wugj7@mail2.sysu.edu.cn; yutong.lu@nscc-gz.cn).
Wei-Shi Zheng is with the School of Computer Science and Engineering, sion of the receptive field for the ladder self-attention block is
Sun Yat-sen University, Guangzhou, Guangdong 510275, China, also with the accomplished according to the following strategy. First, diverse
Peng Cheng Laboratory, Shenzhen, Guangdong 518005, China, also with the local self-attentions are modelled by steering pixels at the same
Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-
sen University, Ministry of Education, Guangzhou, Guangdong 510275, China, spatial position to model interactions with pixels in diverse
and also with Guangdong Province Key Laboratory of Information Security windows for different branches. Second, the progressive shift
Technology, Guangdong 510006, China (e-mail: wszheng@ieee.org). mechanism transmits the output features of the current branch
Qi Tian is with Cloud and AI BU, Huawei, Shenzhen, Guangdong 518129,
China (e-mail: tian.qi1@huawei.com). to the subsequent branch. The self-attention in the subsequent
Digital Object Identifier 10.1109/TPAMI.2023.3265499 branch is computed with the participation of the output features
0162-8828 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PSLT: A LIGHT-WEIGHT VISION TRANSFORMER WITH LADDER SELF-ATTENTION AND PROGRESSIVE SHIFT 11121
Fig. 1. Illustration of prevalent transformer architectures for image classification. (a) ViT [31] with only one stage. (b) Swin Transformer [15], which uses
window-based multi-head self-attention and multi-stage blocks. (c) Our proposed PSLT adopts light-weight ladder self-attention (SA) blocks in the final two stages
to model long-range dependency of pixels divided in different windows in the same block, and the light-weight convolutional blocks (MobileNetV2 [21] block with
a squeeze-and-excitation (SE) block [50]) are adopted in the first two stages. The details of PSLT are described in Section III, and the structures of the PSW-MHSA
and pixel-adaptive fusion module are shown in Fig. 3. “Shift 1” and “Shift 2” denotes shift operations with different directions. “d1 ” and “d2 ” denote the number
of channels, and d1 = 3d2 .
in the current branch, allowing interactions among features in the channel axis. Then, each part of the feature map is
different windows in the two branches. As a result, the ladder sent to an individual branch to compute the self-attention
self-attention block with the progressive shift mechanism can similarity.
model long-range interactions among pixels divided in different 2) The ladder-self attention block is designed to acquire
windows. In addition, in the ladder self-attention block, each large receptive field, requiring relatively small number of
branch only takes an equal proportion of input channels of the computing resources. PSLT models local attention on each
block, which considerably reduces the number of parameters and branch for computing efficiency; and more importantly,
FLOPs. And all the channels in these branches are aggregated in without introducing extra parameters, PSLT adopts the
the proposed pixel-adaptive fusion module to generate the output progressive shift mechanism to model interactions among
of the ladder self-attention block. Based on the above designs, pixels in different windows to enlarge the receptive field
we developed a light-weight general-purpose backbone with a of each ladder self-attention block. The progressive shift
relatively small number of parameters. mechanism forms a block shaped like a ladder.
The overall framework of our model is shown in Fig. 1. Our proposed PSLT achieves excellent performance on visual
According to the conclusion of [9], [10] that the convolution tasks such as image classification, object detection and person
in the early stages helps the model learn better and faster. PSLT re-identification with a relatively small number of trainable pa-
adopts light-weight convolutional blocks in the early stages and rameters. With less than 10 million parameters and 2 G FLOPs,
the ladder self-attention blocks in the latter stages. Our PSLT PSLT achieves a top-1 accuracy of 79.9% on ImageNet-1 k
has several important characteristics: image classification at input resolution 224 × 224 without pre-
1) PSLT uses light-weight ladder self-attention blocks, which training on extra large dataset.
greatly reduce the number of trainable parameters and We notice that recently there exist light-weight transform-
FLOPs. The ladder self-attention block first divides the ers [17], [18], [19] that combine local interaction (convolution)
input feature map into several equal proportions along and global self-attention in one block. In this work, we provide
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
11122 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 9, SEPTEMBER 2023
TABLE I parameters, and PSLT does not utilize a search phase before
COMPARISON OF VARIOUS GENERAL-PURPOSE BACKBONES. “T” INDICATES
THAT THE MODEL ADOPTS THE TRANSFORMER ARCHITECTURE. “ADAPTIVE”
training. In the following section, we detail some related works.
INDICATES THAT THE MODEL CAN ADAPT TO THE INPUT IMAGE. “LIGHT”
DENOTES THAT THE MODEL HAS A RELATIVELY SMALL NUMBER OF A. Convolutional Neural Networks
PARAMETERS AND FLOPS. “NAS” DENOTES THAT THE MODEL NEEDS
ANOTHER SEARCH PROCESS BEFORE TRAINING FROM SCRATCH Since the development of AlexNet [1], CNNs have become the
most popular general-purpose backbones for various computer
vision tasks. More effective and deeper convolutional neural
networks have been proposed to further improve the model
capacity for feature representation, such as VGG [6], ResNet [3],
DenseNet [7] and ResNext [5]. However, neural networks with
large scales are difficult to deploy in edge devices, which have
limited memory resources. Thus, various works have focused
on designing light-weight neural networks with less trainable
parameters.
- Light-Weight CNN. To decrease the number of trainable
parameters, MobileNets [20], [21], [22] substitute the stan-
dard convolution operation with a more efficient combination
of depthwise and pointwise convolution. ShuffleNet [23] uses
group convolution and channel shuffle to further simplify the
model. The manual design of neural networks is time consuming,
and automatically designing convolutional neural architectures
has shown great potential for developing high-performance neu-
ral networks [24], [25], [26]. EfficientNet [27] investigates the
model width, depth and resolution with a neural architecture
search [24], [25], [28].
Although CNNs can effectively model local interactions,
adaptation to input data is absent in standard convolution.
Usually, the adaptive methods are capable of improving model
capacity by the dynamic weights or receptive fields with respec-
tive to different input images. RedNet [29] produces dynamic
another perspective, and our PSLT is capable of modelling weights for distinct input data. Deformable convolution [30]
long-range interaction by effectively incorporating diverse lo- produces an adaptive location for the convolution kernels. Our
cal self-attentions and evolving the self-attention of the latter proposed backbone (PSLT) inherits the virtue of convolution that
branches in the ladder self-attention block with a relatively small implies inductive bias when modelling local interactions, and
number of parameters and FLOPs. PSLT takes advantage of self-attention [11] to model long-range
In summary, our contributions are as follows: dependency and adapt to input data with dynamic weights.
r We propose a light-weight ladder self-attention block with
multiple branches and considerably less parameters. A B. Vision Transformer
pixel-adaptive fusion module is developed to aggregate
features from multiple branches with adaptive weights Transformers [11], [12] are widely used in natural language
along both the spatial and channel dimensions. processing, and the transformer architecture for computer vision
r We propose a progressive shift mechanism in the ladder tasks has evolved since ViT [31] was proposed for image classi-
self-attention block to enlarge the receptive field. The fication. ViT has one-stage structure that produces features with
progressive mechanism not only allows modelling inter- only one scale. PiT [39] adopts depthwise convolution [20] to de-
action among pixels in different windows for long-range crease the spatial dimension. CrossViT [40] models multi-scale
dependencies, but also reduces the number of parameters features via small and large patch embedding sizes. Recent trans-
necessary to obtain the value for computing self-attention former architectures [13], [14], [15], [37], [41] have adopted
in the following branch. hierarchical structures like popular CNNs to yield multi-scale
r We develop a general-purpose backbone (PSLT) with a features through patch merging to progressively reduce the
relatively small number of parameters. PSLT is constructed number of patches. To mitigate the issue that the tokenization in
with light-weight convolutional blocks and the ladder self- ViT [42] may destruct the object structure, PS-ViT [32] locates
attention blocks, improving its generalization ability. discriminative regions by iteratively sampling tokens.
Since self-attention requires a large amount of computing
resources to model long-range interactions, recent works [15],
II. RELATED WORK
[16], [43] have approximated global interaction by modelling
A comparison of various models is shown in Table I. Our PSLT local and long-range dependency. SwinTransformer [15] divides
was manually developed and has a relatively small number of the feature map into multiple windows, self-attention is only
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PSLT: A LIGHT-WEIGHT VISION TRANSFORMER WITH LADDER SELF-ATTENTION AND PROGRESSIVE SHIFT 11123
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
11124 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 9, SEPTEMBER 2023
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PSLT: A LIGHT-WEIGHT VISION TRANSFORMER WITH LADDER SELF-ATTENTION AND PROGRESSIVE SHIFT 11125
Fig. 3. Example of detailed blocks in PSLT. (a) Window-based multi-head self-attention in the Swin Transformer [15]. (b) The ladder self-attention block with
two branches for example. The progressive shift mechanism transmits output feature of the previous branch to the subsequent branch, which further enlarges the
receptive field. Only the PSW-MHSA is illustrated in detail; LayerNrom and Light FFN are shown in Fig. 1 and omitted here for simplicity. “WP” indicates the
window partition. (c) The pixel-adaptive fusion module in the ladder self-attention block. “Weights” denotes the adaptive weights along the channel and spatial
dimensions, which is the output of the second fully connected layer. The details of the PSW-MHSA are described in Section III-B1.
multiple branches, and the training cost (number of trainable W-MHSA and PSW-MHSA. Taking the input of h × w × d1
parameters) is decreased. as an example, each window has M × M non-overlapping
Formally, the ladder self-attention blocks are computed as patches, and B branches are contained in the PSW-MHSA.
Then, the number of trainable parameters in the W-MHSA and
Ôt = PSW-MHSA(It , Ot−1 ) PSW-MHSA are:
QIt KIt φ(W-MHSA) = 4d21
PSW-MHSA(It , Ot−1 ) = softmax( √ )Ot−1 + It
d
3d21 d2
Ot = LFFN(LN(Ôt )) φ(PSW-MHSA) = + 12 . (2)
B B
Ō = PAFM(Ot ), t = 0, 1, 2, . . .. (1) The computational complexity of the W-MHSA and PSW-
MHSA are:
The output (Ôt ) of the PSW-MHSA in the t-th branch is com-
puted according to the input feature of the t-th branch (It ) and Ω(W-MHSA) = 4hwd21 + 2M 2 hwd1
the output feature of the (t − 1)-th branch (Ot−1 ). In the PSW-
3hwd21 hwd21
MHSA, the query (QIt ) and the key (KIt ) are projected from Ω(PSW-MHSA) = + + 2M 2 hwd1 . (3)
It , and Ot−1 is directly applied as the value. The first branch is B B2
computed as O0 = MHSA(I0 ), as shown in Fig. 3(a). Then, the As can be seen, when B = 1, the PSW-MHSA has the same
Light FFN (LFFN) and layer norm (LN) are applied to produce number of trainable parameters and complexity as the W-
the output of the t-th branch (Ot ). Finally, a pixel-adaptive fusion MHSA. When B > 1, the number of trainable parameters and
module (PAFM) is developed to generate the output of the ladder the computation complexity are both considerably reduced (B
self-attention block (Ō) according to the outputs of all branches. is set to 3 by default).
- Complexity Analysis. Our intention is to build a light-weight 2) Light FFN: As presented in Fig. 1(b), the output of the
backbone, and the PSW-MHSA is proposed for modelling MHSA is processed by the FFN. To further decrease the number
long-range dependency. Here, we roughly compare the num- of trainable parameters and float-point operations in PSLT, a
ber of trainable parameters and computation complexity of the Light FFN is proposed to replace the original FFN, as shown
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
11126 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 9, SEPTEMBER 2023
in Fig. 1(c). In contrast to the original FFN, which uses a TABLE III
SPECIFICATION FOR PSLT. “CONV2D” AND “MBV2BLOCK” INDICATE THE
fully connected layer with d channels for both the input and STANDARD CONVOLUTIONAL OPERATION AND THE LIGHT-WEIGHT BLOCK IN
output features, the Light FNN first projects the input with THE MOBILENETV2 [21] WITH THE SQUEEZE-AND-EXCITATION LAYER [50].
d2 channels to a narrower feature with d42 channels. Then, a “CONV2D↓” AND “MBV2BLOCK+SE↓” DENOTE THE OPERATION OR BLOCK
FOR DOWNSAMPLING WITH STRIDE 2. “FC” DENOTES THE FULLY CONNECTED
depthwise convolution is applied to model local interactions, LAYER. PSLT HAS A TOTAL OF 9.223M PARAMETERS ACCORDING TO THE
and a pointwise convolution is adopted to restore the channels. TABLE
- Complexity Analysis. Taking the input of an FFN layer of size
h × w × d2 as an example, the number of trainable parameters
of the FFN and LFFN are:
φ(FFN) = d2
d22 9d2
φ(LFFN) = + . (4)
2 4
The computational complexity of the FFN and LFFN are:
Ω(FFN) = hwd22
hwd22 9hwd2
Ω(LFFN) = + . (5)
2 16
In general, the number of input feature channels d2 is substan-
tially larger than 92 ; thus, the LFFN is more computationally
economical than the FFN, as it has less trainable parameters and C. Network Specification
float-point operations.
3) Pixel-Adaptive Fusion Module: PSLT divides the input Table III shows the architecture of PSLT, which contains 9.2 M
feature map equally along the channel dimension for the multiple trainable parameters for image classification on the ImageNet
branches to decrease the number of parameters and computation dataset [2]. PSLT stacks the light-weight convolutional blocks
complexity. To model long-range dependency, the progressive (MBV2Block+SE) in the first two stages and the proposed ladder
shift mechanism in the ladder self-attention block divides the self-attention blocks in the final two stages. PSLT starts with
same pixel into different windows in each branch. Because the three 3 × 3 convolution as stem to produce the input feature of
output features of the branches are produced by integrating fea- the first stage. In both stage 1 and stage 2, the first convolutional
ture information in different windows, PSLT proposes a pixel- block has a stride of 2 to downsample the input feature map. The
adaptive fusion module to efficiently aggregate multi-branch final block in stage 2 is utilized to downsample the input feature
features with adaptive weights along the spatial and channel map of stage 3. A 2 × 2 convolution is applied to downsample
dimensions. the feature map of stage 4. Each time the size of the feature map
More specifically, the output features of all the branches are is halved, the number of feature channels doubles.
concatenated and sent to two fully connected layers to yield the Finally, the classifier head uses global average pooling on
weights for each pixel. The weights indicate the importance of the output of the last block. The globally pooled feature passes
the features from all the branches for each pixel in the feature through a fully connected layer to yield an image classification
map. To ensure that the number of trainable parameters is not score. Without the classifier head, PSLT with multi-scale fea-
increased, the first fully connected layer halves the number of tures can easily be applied as a backbone in existing methods
channels. The features are then multiplied by the weights and for various visual tasks.
then sent to a fully connected layer for feature fusion along the - Detailed Configuration. Following SwinTransformer [15],
channel dimension. the window size is set to 7 × 7, and the window partition and
Different from the conventional squeeze-and-excitation position embedding is produced with the mechanism similar
block [50] measuring only the importance of the channels, we in- to [15]. The number of heads in PSW-MHSA is set to 4 and
troduce the pixel-adaptive fusion module to integrate the output 8 in the last two stages respectively. The stride in the shift
features of all branches with adaptive weights in both the spatial operation (shown in Fig. 1(c)) in PSW-MHSA is set to 3. The
and channel dimensions for two reasons. First, because the pixels shift operation is similar to that in [15], but the shift direction is
in each branch are divided into different windows, the pixels at different from each other in the last two branches.
various spatial locations contain distinct information. Second,
along the channel dimension, the input feature map is split
IV. EXPERIMENT
equally for each branch, so the pixel information at the same spa-
tial location is different among the branches. The experimental In this section, we evaluate the effectiveness of the proposed
results show that our proposed pixel-adaptive fusion module is PSLT by conducting experiments on several tasks, including
more suitable for feature fusion than the squeeze-and-excitation ImageNet-1 k image classification [2], COCO objection detec-
block [50]. tion [51] and Market-1501 person re-identification [52].
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PSLT: A LIGHT-WEIGHT VISION TRANSFORMER WITH LADDER SELF-ATTENTION AND PROGRESSIVE SHIFT 11127
TABLE IV
IMAGE CLASSIFICATION PERFORMANCE ON THE IMAGENET WITHOUT PRETRAINING. MODELS WITH SIMILAR NUMBER OF PARAMETERS ARE PRESENTED FOR
COMPARISON. “INPUT” INDICATES THE SCALE OF THE INPUT IMAGES. “TOP-1” (“V2 TOP-1”) DENOTES THE TOP-1 ACCURACY ON IMAGENET (IMAGENET-V2).
“#PARAMS” REFERS TO THE NUMBER OF TRAINABLE PARAMETERS. “#FLOPS” IS CALCULATED ACCORDING TO THE CORRESPONDING INPUT SIZE. “+” INDICATES
THAT THE PERFORMANCE IS CITED FROM MOBILE-FORMER [38]. “*” INDICATES THAT THE MODEL WAS TRAINED WITH DISTILLING FROM AN EXTERNAL
TEACHER. “‡” DENOTES EXTRA TECHNIQUES ARE APPLIED, SUCH AS MULTI-SCALE SAMPLER [19] AND EMA [2]. “PACC” (“FACC”) IS THE RATIO OF TOP-1
ACCURACY TO THE NUMBER OF PARAMETERS (FLOPS)
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
11128 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 9, SEPTEMBER 2023
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PSLT: A LIGHT-WEIGHT VISION TRANSFORMER WITH LADDER SELF-ATTENTION AND PROGRESSIVE SHIFT 11129
TABLE VI
OBJECT DETECTION PERFORMANCE ON COCO. THE AP VALUE ON THE VALIDATION SET IS REPORTED. “MS” DENOTES THAT MULTI-SCALE TRAINING [14] WAS
USED. “+” INDICATES THAT THE RESULT IS TAKEN FROM [14]. FLOPS IS CALCULATED AT INPUT RESOLUTION 1280 × 800. “*” DENOTES THAT THE MODEL IS
TRAINED BASED ON SPARSE R-CNN [72] FOLLOWING [15]
TABLE VII
OBJECT DETECTION AND INSTANCE SEGMENTATION PERFORMANCE ON COCO. AP b AND AP m DENOTE THE BOUNDING BOX AP AND MASK AP ON THE
VALIDATION SET, RESPECTIVELY. “MS” DENOTES THAT MULTI-SCALE TRAINING [14] WAS USED. “+” INDICATES THAT THE RESULT WAS TAKEN FROM [14].
FLOPS IS CALCULATED AT INPUT RESOLUTION 1280 × 800
commonly-used convolutional neural networks with a small 8 GeForce 1080Ti GPUs. We adopt the AdamW [59] optimizer
amount of parameters. Compared with WRN [8], PSLT achieves with an initial learning rate of 1 × 10−4 . Following the common
comparable performance on the CIFAR10/100 with similar PVT settings, we adopted a 1× or 3× training schedule (12 or
amount of parameters but higher top-1 accuracy on ImageNet 36 epochs) to train the detection models. The shorter side of the
with much less parameters in Table IV. Compared to the vision training images was resized to 800 pixels, while the longer side
transformers, our PSLT shows higher model generalization abil- of the images is set to a maximum of 1,333 pixels. The shorter
ity with fewer training images, where PSLT achieves relatively side of the validation images was fixed to 800 pixels. With the
higher top-1 accuracy on both CIFAR-10 (95.43% versus 95.0%) 3× training schedule, the shorter side of the input image was
and CIFAR-100 (79.1% versus 78.63%) than MOA-T with less randomly resized in the range of [640, 800].
trainable parameters (8.7 M versus 30 M). - Experimental Results. Table VI shows the performance of the
backbones on COCO val2017; the backbones were initialized
with pretrained weights on ImageNet using RetinaNet for object
B. Objection Detection detection. As shown in the table, with a similar number of train-
- Experimental Setting. We conduct objection detection ex- able parameters, PSLT achieves comparable performance. With
periments on COCO 2017 [51], which contains 118 k images in the 1× training scheduler, PSLT outperforms Mobile-Former,
the training set and 5 k images in the validation set of 80 classes. where PSLT achieving a higher AP (41.2 versus 38.0) with
Following PVT [14], which uses standard settings to generate slightly more parameters (19.1 M versus 17.9 M) and FLOPs
multi-scale feature maps, we evaluate our proposed PSLT as (192 G versus 181 G). The last block of Table VI presents models
a backbone with two standard detectors: RetinaNet [73] (one- with more than 30 M parameters; PSLT achieves higher AP
stage) and Mask R-CNN [74] (two-stage). During training, our (41.2 versus 40.4) than PVT-S with considerably fewer param-
backbone (PSLT) was first initialized with the pretrained weights eters (19.1 M versus 34.2 M) and FLOPs (192 G versus 226 G).
on ImageNet [2], and the newly added layers were initialized Compared with the commonly used backbone ResNet-50, PSLT
with Xavier [75]. Our PSLT was trained with a batch size of 16 on achieves a considerably higher AP (41.2 versus 35.3) with nearly
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
11130 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 9, SEPTEMBER 2023
50% less trainable parameters (19.1 M versus 37.7 M) and fewer TABLE VIII
SEMANTIC SEGMENTATION PERFORMANCE OF DIFFERENT BACKBONES ON THE
FLOPs (192 G versus 239 G). With the 3× training scheduler, ADE20 K. THE MEAN INTERSECTION OVER UNION (MIOU) IS REPORTED. “+”
PSLT shows higher performance with a similar number of INDICATES THAT THE RESULT IS TAKEN FROM [14]. “*” DENOTES THAT THE
parameters, where PSLT achieving a 1.7-point higher AP than MODEL IS TRAINED BASED ON UPERNET [79] FOLLOWING [15]
PVT-S (43.9 versus 42.2) with considerably fewer parameters.
Similar results are shown in Table VII, including segmentation
experiments on Mask R-CNN. With the 1× training scheduler,
PSLT outperforms ResNet-50, where PSLT achieving 2.8 points
higher box AP (40.8 versus 38.0) and 2.9 points higher mask AP
(37.3 versus 34.4) with fewer parameters (28.9 M versus 44.2 M)
and FLOPs (211 G versus 253 G). PSLT shows comparable
performance to PVT-S (40.8 versus 40.4 box AP and 37.3 versus
37.8 mask AP) with considerably fewer parameters (28.9 M
versus 44.1 M) and FLOPs (211 G versus 305 G). Similar results
are observed with the 3× training scheduler; PSLT outperforms
PVT-T, where PSLT achieving 2.9 points higher AP b and 1.5
points higher AP m with a smaller number of parameters (28.9 M For a fair comparison with Swin [15], we implement our
versus 32.9 M) and FLOPs (211 G versus 240 G). Moreover, PSLT-Large with the same experimental setting as Swin [15]
PSLT achieves comparable performance to ResNet-50, which with UperNet [79] framework. Our PSLT-Large achieves com-
has more than 40 M parameters. parable performance with less parameters and FLOPs.
For a fair comparison with Swin [15], we implement our
PSLT-Large with the same experimental setting as Swin [15]
with Sparse R-CNN [72] framework (in Table VI) and with D. Person Re-Identification
Mask R-CNN [74] framework (in Table VII). Our PSLT-Large - Experimental Setting. Person re-identification (ReID) ex-
achieves comparable performance with less parameters and periments are conducted with Market-1501 [52], a dataset that
FLOPs. contains 12,936 images of 751 people in the training set and
19,732 images of 750 different people in the validation set. All
the images in the Market-1501 dataset were captured with six
C. Semantic Segmentation cameras. We trained our PSLT on the training set to classify
- Experimental Setting. Semantic segmentation experiments each person, similar to an image classifier with 751 training
are conducted with a challenging scene parsing dataset, classes. The output feature of the final stage in PSLT was utilized
ADE20K [76]. ADE20 K contains 150 fine-grained semantic to compute the similarity between the query image and the
categories with 20210, 2000 and 3352 images in the training, gallery images for validation, and the rank-1 accuracy and mean
validation and test sets, respectively. Following PVT [14], we average precision (mAP) are reported. For a fair comparison,
evaluate our proposed PSLT as a backbone on the basis of we follow Bag of Tricks [80] to perform the ReID experiments.
the Semantic FPN [77], a simple segmentation method with- The data augmentation techniques that we apply to the training
out dilated convolution [78]. In the training phase, the back- images include horizontal flipping, central padding, randomly
bone was first initialized with the pretrained weights on Im- cropping to 256 × 128, normalizing by subtracting the channel
ageNet [2], and the newly added layers were initialized with mean and dividing by the channel standard deviation, and ran-
Xavier [75]. Our PSLT was trained for 80 k iterations with a dom erasing [57]. The data augmentation techniques applied to
batch size of 16 on 4 GeForce 1080Ti GPUs. We adopted the the test images include resizing to 256 × 128 and normalizing
AdamW [59] optimizer with an initial learning rate of 2 × 10−4 . by subtracting the channel mean and dividing by the channel
The learning rate follows the polynomial decay rate with a standard deviation.
power of 0.9. The training images were randomly resized and The proposed PSLT was trained for 360 epochs under the
cropped to 512 × 512, and the images in the validation set guidance of the cross-entropy loss and triplet loss [87] using
were rescaled, with the shorter side set to 512 pixels during the Adam optimizer. Similarly, PSLT adopts the BNNeck [80]
testing. structure to compute the triplet loss. The output feature after the
- Experimental Results. Table VIII shows the semantic seg- global average pooling layer first passed through a BatchNorm
mentation performance on the ADE20 K validation set for layer before being sent to the classifier. The feature before the
backbones initialized with pretrained weights on ImageNet with BatchNorm layer was used to compute the triplet loss, and the
Sematic FPN [77]. As shown in the table, PSLT outperforms the output score of the classifier was used to compute the cross-
ResNet-based models [3], where PSLT achieving a 2.3% points entropy loss for training. For validation, the feature after the
higher than ResNet-50 (39.0% versus 36.7%) with nearly half BatchNorm layer was adopted to compute the similarity. A batch
the number of parameters (13.0 M versus 28.5 M) and 20% less size of 64 was applied during training. The initial learning rate is
FLOPs than ResNet-50 (32.0 G versus 45.6 G). With almost set to 1.5 × 10−4 , and the learning rate was decayed by a factor
the same number parameters and FLOPs, PSLT achieves a 2.3% of 10 at epochs 150, 225 and 300. The weight decay was set to
points higher than PVT-T (39.0% vs 36.7%). 5 × 10−4 .
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PSLT: A LIGHT-WEIGHT VISION TRANSFORMER WITH LADDER SELF-ATTENTION AND PROGRESSIVE SHIFT 11131
TABLE IX TABLE X
PERFORMANCE OF THE MODELS ON THE MARKET-1501 WHEN THE MODELS COMPARISON OF MODEL INFERENCE. “MEM” DENOTES THE PEAK MEMORY
ARE FIRST PRETRAINED ON IMAGENET. “RANK-1” INDICATES THE RANK-1 FOR EVALUATION. “FPS” IS THE NUMBER OF IMAGES PROCESSED FOR ONE
ACCURACY RATE. “MAP” DENOTES THE MEAN AVERAGE PRECISION. SECOND
“FLOPS” IS NOT REPORTED FOR THE COMPARED METHODS IN LITERATURES
TABLE XI
ABLATION STUDY ON THE IMAGENET WITHOUT PRETRAINING. “#BRANCHES”
INDICATES THE NUMBER OF BRANCHES IN THE LADDER SELF-ATTENTION
BLOCK
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
11132 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 9, SEPTEMBER 2023
TABLE XIII
ABLATION STUDY OF FFN LAYER ON IMAGENET WITHOUT PRE-TRAINING.
“LFFN” DENOTES OUR LIGHT FNN LAYER
TABLE XV
ABLATION STUDY OF THE PROPOSED MODULES FOR PSLT ON THE IMAGENET
WITHOUT PRETRAINING. “V1” AND “V2” DENOTE THE TOP-1 ACCURACY ON
IMAGENET AND IMAGENET-V2 VALIDATION SETS RESPECTIVELY
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PSLT: A LIGHT-WEIGHT VISION TRANSFORMER WITH LADDER SELF-ATTENTION AND PROGRESSIVE SHIFT 11133
The pixel-adaptive fusion module (PAFM) improves the model [2] O. Russakovsky et al., “ImageNet large scale visual recognition challenge,”
performance by integrating information of all branches in the Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
block. The top-1 accuracy of our PSLT and DeiT-S on both recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
the ImageNet and ImageNet-V2 validation set conforms to the pp. 770–778.
linear fit in [53]. Our PSLT (including DeiT-S) achieves smaller [4] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr,
“Res2Net: A new multi-scale backbone architecture,” IEEE Trans. Pattern
improvement over ViT-B on ImageNet-V2 validation (2% and Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662, Feb. 2021.
1% top-1 accuracy improvement on ImageNet and ImageNet-V2 [5] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
validation set, respectively). Such a phenomenon has been simi- transformations for deep neural networks,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2017, pp. 1492–1500.
larly observed in [37], [69]. As described in [88], this shows the [6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
model generalization ability is challenged on different validation large-scale image recognition,” 2014, arXiv:1409.1556.
set. [7] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2017, pp. 4700–4708.
VI. CONCLUSION [8] S. Zagoruyko and N. Komodakis, “Wide residual networks,”
2016, arXiv:1605.07146.
In this work, we propose a ladder self-attention block with [9] Z. Dai, H. Liu, Q. Le, and M. Tan, “CoAtNet: Marrying convolution and
multiple branches requiring a relatively small number of pa- attention for all data sizes,” in Proc. Adv. Neural Inf. Process. Syst., 2021,
pp. 3965–3977.
rameters and FLOPs. To improve the computation efficiency [10] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Girshick,
and enlarge the receptive field of the ladder self-attention block, “Early convolutions help transformers see better,” in Proc. Adv. Neural
the input feature map is split into several parts along the chan- Inf. Process. Syst., 2021, pp. 30 392–30 400.
[11] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
nel dimension, and a progressive shift mechanism is proposed Process. Syst., 2017, pp. 5998–6008.
for long-range dependency by modelling diverse local self- [12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
attention on each branch and interacting among these branches. training of deep bidirectional transformers for language understanding,”
2018, arXiv:1810.04805.
A pixel-adaptive fusion module is finally designed to integrate [13] J. Guo et al., “CMT: Convolutional neural networks meet vision trans-
features from multiple branches with adaptive weights along formers,” 2021, arXiv:2107.06263.
both the spatial and channel dimensions. Furthermore, the lad- [14] W. Wang et al., “Pyramid vision transformer: A versatile backbone for
dense prediction without convolutions,” in Proc. IEEE/CVF Int. Conf.
der self-attention block adopts a Light FFN layer to decrease Comput. Vis., 2021, pp. 568–578.
the number of trainable parameters and float-point operations. [15] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
Based on the above designs, we develop a general-purpose shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 10
012–10 022.
backbone (PSLT) with a relatively small number of parameters [16] W. Wang, L. Yao, L. Chen, D. Cai, X. He, and W. Liu, “Cross-
and FLOPs. Overall, PSLT applies convolutional blocks in the Former: A versatile vision transformer based on cross-scale attention,”
early stages and ladder self-attention blocks in the latter stages. 2021, arXiv:2108.00154.
[17] J. Pan et al., “EdgeViTs: Competing light-weight CNNs on mobile devices
PSLT achieves comparable performance with existing methods with vision transformers,” 2022, arXiv:2205.03436.
on several computer vision tasks, including image classification, [18] M. Maaz et al., “EdgeNeXt: Efficiently amalgamated CNN-transformer
object detection and person re-identification. architecture for mobile vision applications,” 2022, arXiv:2206.10589.
[19] S. Mehta and M. Rastegari, “MobileViT: Light-weight, general-purpose,
Our work explores a new perspective to model long-range and mobile-friendly vision transformer,” 2021, arXiv:2110.02178.
interaction by effectively utilizing diverse local self-attentions, [20] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks
while there are recent works [17], [18], [38], [48] implement for mobile vision applications,” 2017, arXiv:1704.04861.
[21] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo-
effective methods to combine the convolution and global self- bileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF
attention. We find that modelling local self-attention is more Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520.
environmental-friendly than modelling global self-attention. [22] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE Int. Conf.
Comput. Vis., 2019, pp. 1314–1324.
For future work, we will investigate the direct computing [23] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient
optimization of our PSLT for fast inference. And we will in- convolutional neural network for mobile devices,” in Proc. IEEE/CVF
vestigate a new computation manner for self-attention because Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6848–6856.
[24] B. Zoph and Q. Le, “Neural architecture search with reinforcement learn-
the similarity computation is time consuming. For example, we ing,” in Proc. Int. Conf. Learn. Representations, 2017, pp. 1–46.
will investigate whether simple addition is powerful enough [25] Z. Guo et al., “Single path one-shot neural architecture search with uniform
to model interactions in the self-attention computation. PSLT sampling,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 544–560.
[26] C. Li et al., “BossNAS: Exploring hybrid CNN-transformers with block-
shows comparable performance without elaborately selecting wisely self-supervised neural architecture search,” in Proc. IEEE/CVF Int.
the hyperparameters, such as the number of blocks, the channel Conf. Comput. Vis., 2021, pp. 12 281–12 291.
dimension or the number of heads in the ladder self-attention [27] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convo-
lutional neural networks,” Int. Conf. Mach. Learn., 2019, pp. 6105–6114.
blocks. Furthermore, we can automatically select parameters [28] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architec-
that significantly improve model performance with neural ar- ture search,” in Proc. 7th Int. Conf. Learn. Representations, 2019,
chitecture search techniques. pp. 1–13.
[29] D. Li et al., “Involution: Inverting the inherence of convolution for visual
recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
REFERENCES 2021, pp. 12 321–12 330.
[30] J. Dai et al., “Deformable convolutional networks,” in Proc. IEEE Int.
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification Conf. Comput. Vis., 2017, pp. 764–773.
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro- [31] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for
cess. Syst., 2012, pp. 1097–1105. image recognition at scale,” 2020, arXiv:2010.11929.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
11134 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 9, SEPTEMBER 2023
[32] X. Yue et al., “Vision transformer with progressive sampling,” in Proc. [61] N. Ma, X. Zhang, J. Huang, and J. Sun, “Weightnet: Revisiting the
IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 387–396. design space of weight networks,” in Proc. Eur. Conf. Comput. Vis., 2020,
[33] L. Yuan, Q. Hou, Z. Jiang, J. Feng, and S. Yan, “Volo: Vision outlooker pp. 776–792.
for visual recognition,” 2021, arXiv:2106.13112. [62] H. Yan, Z. Li, W. Li, C. Wang, M. Wu, and C. Zhang, “ConT-
[34] Y. Li et al., “EfficientFormer: Vision transformers at mobilenet speed,” Net: Why not use convolution and transformer at the same time?,”
2022, arXiv:2206.01191. 2021, arXiv:2104.13497.
[35] H. Wu et al., “CvT: Introducing convolutions to vision transformers,” in [63] W. Xu, Y. Xu, T. Chang, and Z. Tu, “Co-scale conv-attentional im-
Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 22–31. age transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021,
[36] Z. Peng et al., “Conformer: Local features coupling global representations pp. 9981–9990.
for visual recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, [64] S. Ren, D. Zhou, S. He, J. Feng, and X. Wang, “Shunted self-attention via
pp. 367–376. multi-scale token aggregation,” 2021, arXiv:2111.15193.
[37] B. Graham et al., “LeViT: A vision transformer in convnet’s clothing for [65] A. Hatamizadeh, H. Yin, J. Kautz, and P. Molchanov, “Global context
faster inference,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 12 vision transformers,” 2022, arXiv:2206.09959.
259–12 269. [66] W. Yu et al., “MetaFormer is actually what you need for vision,”
[38] Y. Chen et al., “Mobile-former: Bridging mobilenet and transformer,” 2021, arXiv:2111.11418.
2021, arXiv:2108.05895. [67] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation
[39] B. Heo, S. Yun, D. Han, S. Chun, J. Choe, and S. J. Oh, “Rethinking learning for human pose estimation,” in Proc. IEEE/CVF Conf. Comput.
spatial dimensions of vision transformers,” in Proc. IEEE/CVF Int. Conf. Vis. Pattern Recognit., 2019, pp. 5693–5703.
Comput. Vis., 2021, pp. 11 936–11 945. [68] H. Yin, A. Vahdat, J. M. Alvarez, A. Mallya, J. Kautz, and P.
[40] C.-F. R. Chen, Q. Fan, and R. Panda, “CrossViT: Cross-attention multi- Molchanov, “A-ViT: Adaptive tokens for efficient vision transformer,”
scale vision transformer for image classification,” in Proc. IEEE/CVF Int. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
Conf. Comput. Vis., 2021, pp. 357–366. pp. 10 809–10 818.
[41] H. Fan et al., “Multiscale vision transformers,” in Proc. IEEE/CVF Int. [69] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
Conf. Comput. Vis., 2021, pp. 6824–6835. H. Jégou, “Training data-efficient image transformers & distilla-
[42] L. Yuan et al., “Tokens-to-Token ViT: Training vision transformers from tion through attention,” in Proc. Int. Conf. Mach. Learn., 2021,
scratch on ImageNet,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 10 347–10 357.
pp. 558–567. [70] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
[43] Q. Yu, Y. Xia, Y. Bai, Y. Lu, A. L. Yuille, and W. Shen, “Glance- Univ. Toronto, pp. 1–60, May 2012.
and-gaze vision transformer,” in Proc. Adv. Neural Inf. Process. Syst., [71] K. Patel, A. M. Bur, F. Li, and G. Wang, “Aggregating global features into
2021, pp. 12992–13003. local vision transformer,” 2022, arXiv:2201.12903.
[44] Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang, “Vision transformer [72] P. Sun et al., “Sparse R-CNN: End-to-end object detection with learnable
with deformable attention,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern proposals,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Recognit., 2022, pp. 4794–4803. 2021, pp. 14 454–14 463.
[45] X. Chu et al., “Twins: Revisiting the design of spatial attention in vision [73] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
transformers,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 9355– dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
9366. pp. 2980–2988.
[46] Q. Zhang and Y.-B. Yang, “ResT: An efficient transformer for vi- [74] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
sual recognition,” in Proc. Adv. Neural Inf. Process. Syst., 2021, IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969.
pp. 15 475–15 485. [75] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
[47] X. Dong et al., “CSWin transformer: A general vision transformer back- feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell. Statist.,
bone with cross-shaped windows,” in Proc. IEEE/CVF Conf. Comput. Vis. 2010, pp. 249–256.
Pattern Recognit., 2022, pp. 12 124–12 134. [76] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene
[48] Y. Lee, J. Kim, J. Willette, and S. J. Hwang, “MPViT: Multi-path vision parsing through ADE20K dataset,” in Proc. IEEE Conf. Comput. Vis.
transformer for dense prediction,” 2021, arXiv:2112.11010. Pattern Recognit., 2017, pp. 633–641.
[49] C. Yang et al., “Lite vision transformer with enhanced self-attention,” [77] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11 networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019,
998–12 008. pp. 6399–6408.
[50] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation [78] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolu-
networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, tions,” 2015, arXiv:1511.07122.
pp. 2011–2023. [79] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual
[51] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. parsing for scene understanding,” in Proc. Eur. Conf. Comput. Vis., 2018,
Eur. Conf. Comput. Vis, 2014, pp. 740–755. pp. 418–434.
[52] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable [80] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang, “Bag of tricks and a
person re-identification: A benchmark,” in Proc. IEEE Int. Conf. Comput. strong baseline for deep person re-identification,” in Proc. IEEE/CVF
Vis., 2015, pp. 1116–1124. Conf. Comput. Vis. Pattern Recognit. Workshops, 2019.
[53] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet clas- [81] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature learning
sifiers generalize to imagenet?,” in Proc. Int. Conf. Mach. Learn., 2019, for person re-identification,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
pp. 5389–5400. 2019, pp. 3702–3712.
[54] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. [82] H. Li, G. Wu, and W.-S. Zheng, “Combined depth space based architecture
Comput. Vis. Pattern Recognit., 2015, pp. 1–9. search for person re-identification,” in Proc. IEEE/CVF Conf. Comput. Vis.
[55] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Pattern Recognit., 2021, pp. 6729–6738.
empirical risk minimization,” 2017, arXiv:1710.09412. [83] R. Quan, X. Dong, Y. Wu, L. Zhu, and Y. Yang, “Auto-ReID: Searching
[56] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “AutoAug- for a part-aware convNet for person re-identification,” in Proc. IEEE/CVF
ment: Learning augmentation strategies from data,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 3749–3758.
Conf. Comput. Vis. Pattern Recognit., 2019, pp. 113–123. [84] S. He et al., “TransReID: Transformer-based object re-identification,” in
[57] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random eras- Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 15 013–15 022.
ing data augmentation,” in Proc. AAAI Conf. Artif. Intell., 2020, [85] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part
pp. 13 001–13 008. models: Person retrieval with refined part pooling (and a strong
[58] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking convolutional baseline),” in Proc. Eur. Conf. Comput. Vis., 2018,
the inception architecture for computer vision,” in Proc. IEEE Conf. pp. 480–496.
Comput. Vis. pattern Recognit., 2016, pp. 2818–2826. [86] Y. Sun et al., “Perceive where to focus: Learning visibility-aware part-level
[59] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” features for partial person re-identification,” in Proc. IEEE/CVF Conf.
2017, arXiv:1711.05101. Comput. Vis. Pattern Recognit., 2019, pp. 393–402.
[60] Y. Yuan et al., “Hrformer: High-resolution transformer for dense predic- [87] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for
tion,” 2021, arXiv:2110.09408. person re-identification,” 2017, arXiv:1703.07737.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.
WU et al.: PSLT: A LIGHT-WEIGHT VISION TRANSFORMER WITH LADDER SELF-ATTENTION AND PROGRESSIVE SHIFT 11135
[88] V. Shankar, R. Roelofs, H. Mania, A. Fang, B. Recht, and L. Schmidt, Yutong Lu is professor with Sun Yat-sen University
“Evaluating machine accuracy on imageNet,” in Proc. Int. Conf. Mach. (SYSU), director with National supercomputing cen-
Learn., 2020, pp. 8634–8644. ter in Guangzhou. Her extensive research and devel-
opment experience has spanned several generations
of domestic supercomputers in China. Her continuing
Gaojie Wu is currently working toward the PhD research interests include HPC, Cloud computing,
degree from the School of computer science and storage and file system, and advanced programming
engineering, Sun Yat-sen University. His research environment. At present, she is devoted to the research
interests include mainly focus on neural architecture and implementation of system and application for the
search and computer vision. convergence of HPC, Big Data, and AI on supercom-
puter.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:48:11 UTC from IEEE Xplore. Restrictions apply.