You are on page 1of 2

COMP3340 Project Proposal of Group 17

FU Yunxiang LAW Man Ting LIU Yunhao LYU Zhiheng SU Meiwen


BASc(Applied AI) BASc(Applied AI) BEng(CS) BEng(CS) BASc(Applied AI)
3035719026 3035694862 3035448079 3035772432 3035709277

1. Background and slightly translating images, whereas “strong” refers to


reinforcement-learning-based auto augmentation [3].
Image classification aims at identifying which category
an image represents. It is one of the upstream tasks from 2.2. Models and Experiments
other computer vision (CV) problem. Traditionally, state-
The group will implement the MLP-Mixer and the
of-the-art (SOTA) performances are achieved via convolu-
ResMLP on the MMClassfication [2] where the ViT model
tional neural network (CNN) models [6, 10]. Recent works
has already been included. Their performances will be com-
explore alternatives to the CNN, among which the Trans-
pared against the CNN-based SOTA ResNet-56 [6]. If time
former achieves SOTA results [4, 8, 14]. Derived from the
permits, the group plans to train all the four models us-
Transformer, MLP backbones further attains competitive
ing “weak augmentation”, “strong augmentation” and pre-
accuracy with simplified architectures [11, 12].
training on ImageNet, respectively. The published results of
1.1. Vision Transformers the MLP-Mixer is achieved via pre-training while those of
the ResMLP using data augmentation [11, 12]. Hence, the
Transformers [13] are widely used in natural language group valuates it worthwhile to compare the strategies in a
processing due to its efficiency and scalability to train mod- parallel control-variates manner.
els with enormous sizes, owning to its multi-head selfat-
tention (MHS) mechanism [1, 7]. The Vision Transform- 2.3. Evaluation
ers(ViT) model extends the methodology to image classifi-
Metrics demonstrating classification results, such as the
cation [4]. Since Transformers are designed for sequential
accuracy or the F1 score, may not be illustrative for such a
data, images are divided into patches, mapped to a fixed-
small dataset. Hence, those reflecting the internal inductive
size vector, and concatenated with a position embedding in
mechanism can be inspiring. The group considers apply
the ViT model. Experiments shows ViT inherits the superi-
the Class Activation Map [5] to visualize critical regions
ority regarding efficiency and the scalability [13].
for classification, which can disclose what models learn to
1.2. MLP-based image classification capture. A better model is expected to concentrate on sim-
ilar features to what taxonomists may heed, such as corolla
Despite the success of the ViT, the complexity of the at- or stamen shapes. Correlation among output dimensions of
tention mechanism arouses controversy regarding its neces- Softmax can be wielded as another evaluation tool, offering
sities. MLP-Mixer [11] offers a feature extraction backbone clues of categories one model may regard indistinguishable.
by using stacks of token-mixing and channel-mixing MLP The group may explore the models deeper by combining
block. ResMLP [12] shares the mechanism with the MLP- discovers from both tools jointly, such as try to answer why
Mixer, with slightly different implementation details. The certain models will treat two certain categories alike while
acceptable performances of both models offers appealing the others can separate them properly.
accuracy-computation trade-offs [11, 12].
3. Project Group
2. Proposed Methods and Plans
Mr. Lyu obtains practical machine learning knowledge
The group aims to figure out how architecture distinc- from his contribution to the CV models of HKU RoboMas-
tions among the aforementioned non-CNN models impact ter and his internship in Megvii. Hence, his experience is
their outcomes given various training strategies. believed helpful in the experiment and evaluation phases.
Mr. Fu, an Applied AI sophomore, will cooperated with
2.1. Dataset
Mr. Lyu regarding these two tasks. AI-related experiences
Training focuses on the tiny Category-Flower dataset of Mr. Liu, a final year CS student, focuses on traditional
with 80 images in each of the 17 classes [9]. It will CV algorithms and the reinforcement learning. Miss Law
be processed in two ways: “weak” and “strong” augmen- and Miss Su, both majoring in Applied AI, will collaborate
tations. Specifically, “weak” means horizontally flipping with Mr. Liu to implement the MLP-Mixer and ResMLP

1
References Resmlp: Feedforward networks for image classification with
data-efficient training. CoRR, abs/2105.03404, 2021. 1
[1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
Polosukhin. Attention is all you need. pages 5998–6008,
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
2017. 1
Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
[14] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack
Pyramid vision transformer: A versatile backbone for dense
Clark, Christopher Berner, Sam McCandlish, Alec Radford,
prediction without convolutions. In Proceedings of the
Ilya Sutskever, and Dario Amodei. Language models are
IEEE/CVF International Conference on Computer Vision,
few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell,
pages 568–578, 2021. 1
M. F. Balcan, and H. Lin, editors, Advances in Neural Infor-
mation Processing Systems, volume 33, pages 1877–1901.
Curran Associates, Inc., 2020. 1
[2] MMClassification Contributors. Openmmlab’s image clas-
sification toolbox and benchmark. https://github.
com/open-mmlab/mmclassification, 2020. 1
[3] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude-
van, and Quoc V Le. Autoaugment: Learning augmentation
strategies from data. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
113–123, 2019. 1
[4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
worth 16x16 words: Transformers for image recognition at
scale. volume abs/2010.11929, 2020. 1
[5] Jacob Gildenblat and contributors. Pytorch library for
cam methods. https://github.com/jacobgil/
pytorch-grad-cam, 2021. 1
[6] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. 2016 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
pages 770–778, 2016. 1
[7] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao
Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam
Shazeer, and Zhifeng Chen. Gshard: Scaling giant models
with conditional computation and automatic sharding, 2020.
1
[8] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei,
Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans-
former: Hierarchical vision transformer using shifted win-
dows. ArXiv, abs/2103.14030, 2021. 1
[9] Maria-Elena Nilsback and Andrew Zisserman. 17 category
flower dataset, 2006. 1
[10] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2015. 1
[11] Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lu-
cas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung,
Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario
Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp ar-
chitecture for vision. CoRR, abs/2105.01601, 2021. 1
[12] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu
Cord, Alaaeldin El-Nouby, Edouard Grave, Armand Joulin,
Gabriel Synnaeve, Jakob Verbeek, and Hervé Jégou.

You might also like