Professional Documents
Culture Documents
With all the insane hype around GPT3, DALLE, PaLM, and many more, now is the
perfect time to cover this paper.
Go through the Machine Learning news these days, and you will see Transformers
everywhere (watch this video IBM Technology for a quick overview to the idea). And
for good reason. Since their introduction, Transformers have taken the world of
Deep Learning by storm. While they were traditionally associated with Natural
Language Processing, Transformers are now being used in Computer Vision Pipelines
too. Just in the last few weeks, we have seen the use of Transformers in some
insane applications in Computer Vision. Thus, it seemed like Transformers would
replace Convolutional Neural Networks (CNNs) for generic Computer Vision tasks.
In this work, we reexamine the design spaces and test the limits of what a pure
ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design
of a vision Transformer, and discover several key components that contribute to the
performance difference along the way.
The results are quite interesting, and they show that CNNs can even outperform
Transformers in certain tasks. This is more proof that your Deep Learning Pipelines
can be improved with better training, rather than simply going for bigger models.
In this article, I will cover some interesting findings from their paper. But first
some context into Transformers and CNNs and the advantage of each kind of
architecture in Computer Vision tasks.
CNNs: The OG Computer Vision Networks
Convolutional Neural Networks have been the OG Computer Vision Architecture since
their inception. In fact, the foundations of CNNs are older than I am. CNNs were
literally built for vision.