NLP Model Compression Techniques for Efficient Deployment

(https://nasscom.in/ai-gamechangers/?
utm_source=AnalyticsIndiaMag&utm_medium=AI%20Banner&utm_campaign=Website%20Tra c)
(https://analyticsindiamag.com/)
(https://praxis.ac.in/data-science-program/?utm_source=AIM&utm_medium=banner&utm_campaign=DS-July2021)
(https://leapsprograms.analyttica.com/?utm_source=aim&utm_medium=partner_website&utm_campaign=aim_leapsprogramsbanner_partnership)
(https://skillup.analyticsindiasummit.com/)
DEVELOPERS CORNER (HTTPS://ANALYTICSINDIAMAG.COM/CATEGORY/DEVELOPERS_CORNER/)
Model Compression Is The Big ML Flavour

Of 2021
BY SHRADDHA GOLED (HTTPS://ANALYTICSINDIAMAG.COM/AUTHOR/SHRADDHA-GOLEDANALYTICSINDIAMAG-COM/)
14/04/2021
→ Model compression is a technique of deploying state-of -the-art deep networks in devices with
low power and resources, without compromising much on the accuracy of the model.
I
t’s a common myth universally acknowledged that a large, complex machine model must be
better. However, the complexity and size of the model may not necessarily translate to good
performance. Moreover, such models pose challenges such as di culty in training and
environmental costs.
Interestingly, famous ImageNet models such as AlexNet and VGG-16 have been compressed to up to
50 times their size without losing accuracy. The compression has increased their inference speed and
ease of adaptation across several devices.
What Is Model Compression?

Model compression is the technique of deploying state-of-the-art deep networks in devices with low
power and resources without compromising on the model’s accuracy. Compressing or reducing in
size and/or latency means the model has fewer and smaller parameters and requires lesser RAM.
Since the late 1980s, researchers have been developing model compression techniques. Some of the
important papers from that time include — Pruning vs clipping in neural networks (1989
(https://journals.aps.org/pra/abstract/10.1103/PhysRevA.39.6600)), A technique for trimming the
fat from a network via relevance assessment (1989 (https://papers.nips.cc/paper/119-skeletonization-
a-technique-for-trimming-the-fat-from-a-network-via-relevance-assessment.pdf)), and A simple
procedure for pruning backpropagation trained neural networks (1990
(https://ieeexplore.ieee.org/document/80236)).
Of late, model compression has been drawing interest from the research community, especially after
the 2012 ImageNet competition.
“This Imagenet 2012 event was de nitely what triggered the big explosion of AI today. There were
de nitely some very promising results in speech recognition shortly before this (again many of them
sparked by Toronto), but they didn’t take o publicly as much as that ImageNet win did in 2012 and
the following years,” said Matthew Zeiler, an NYU Ph.D, winner of ImageNet competition in 2014
(https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-
world/).
Popular Model Compression Techniques

Pruning: This technique entails removing connections between the neurons, sometimes the whole
neuron, channel or lter from a trained network. Pruning is done because networks tend to be over
parameterised; multiple features convey almost the same information and are inconsequential in the
large scheme of things.
Depending on the type of network component being removed, pruning can be classi ed into
unstructured and structured pruning. In unstructured pruning, individual weights or neurons are
removed, and in structured pruning, entire channels or lters are taken out.
Quantization: Unlike pruning, where the number of weights is reduced,
Download our Mobile App
(https://play.google.com/store/apps/details?id=com.analyticsindiamag)
(https://apps.apple.com/us/app/id1502685162)
quantization involves decreasing the weights’ size. It is a process of mapping values from a large set to
values in a smaller set. Meaning, the output contains a smaller range of values compared to the input
without losing much information in the process.
Selective attention: Only the objects or elements of interest are
focused while the background and other elements are discarded. This technique requires the
addition of a selective attention network upstream of the existing AI system.
Low-rank factorisation: This process uses matrix or tensor decomposition to estimate useful
parameters. A weight matrix with greater dimension and rank can be replaced with smaller dimension
matrices through factorisation.
Knowledge distillation: It is an indirect way of compressing a model

SEE ALSO
(https://an
alyticsindia DEVELOPERS CORNER (HTTPS://ANALYTICSINDIAMAG.COM/CATEGORY/DEVELOPERS_CORNER/)
mag.com/g Guide To TensorLy: A Python Library For Tensor Learning

(https://analyticsindiamag.com/guide-to-tensorly-a-python-library-for-tensor-
uide-to-
learning/)
tensorly-a-
python-
library-for-
tensor-
learning/)
where an existing larger model, called teacher, trains smaller models called students. The goal is to
have the same distribution in the student model as available in the teacher model. Here, the loss
function is minimised during the transfer of knowledge from teacher to student.
Major Breakthroughs
Model compression continues to gather momentum. In 2019, MIT
researchers introduced the Lottery Ticket Hypothesis (https://analyticsindiamag.com/the-lottery-

ticket-hypothesis-that-shocked-the-world/) by improving on the traditional pruning technique. It
refers to “a randomly-initialised, dense neural network contains a subnetwork that is initialised such
that—when trained in isolation—it can match the test accuracy of the original network after training
for at most the same number of iterations”. Facebook AI found that the technique could be
extended to reinforcement learning and natural language processing.
MIT assistant professor Song Han introduced AutoML for Model Compression (AMC
(https://arxiv.org/pdf/1802.03494.pdf)). It leverages reinforcement learning to o er a model
compression policy at a higher compression ratio, accuracy, and lower human e ort. AutoML has
now become an industry standard.
The availability of smaller BERT (https://analyticsindiamag.com/top-ten-bert-alternatives-for-nlu-

projects/)based models like ALBERT (https://analyticsindiamag.com/complete-guide-to-albert-a-
lite-bertwith-python-code/) (Google and Toyota), TinyBERT (Huawei), and DistilBERT
(https://analyticsindiamag.com/python-guide-to-huggingface-distilbert-smaller-faster-cheaper-
distilled-bert/) (HuggingFace) is a testament to model compression’s growing popularity.
Further, companies such as Arm have taken a shine to TinyML

(https://analyticsindiamag.com/tinyml-and-its-great-application-in-iot-technology/), an embedded
software technology used to build low power consuming devices to run ML models. As per global
tech market advisory rm ABI Research, about 230 billion devices will be shipped with TinyML
chipset by 2030. Model compression lies at the heart of TinyMLs.
Some of the major breakthroughs (https://analyticsindiamag.com/8-neural-network-compression-

techniques-for-machine-learning-developers/) in recent years in model compression include:

NLP Model Compression Techniques for Efficient Deployment

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Model Compression Techniques for Efficient Deployment

Uploaded by

Copyright:

Available Formats

(https://nasscom.in/ai-gamechangers/?

DEVELOPERS CORNER (HTTPS://ANALYTICSINDIAMAG.COM/CATEGORY/DEVELOPERS_CORNER/)

Model Compression Is The Big ML Flavour

What Is Model Compression?

Popular Model Compression Techniques

Quantization: Unlike pruning, where the number of weights is reduced,

Download our Mobile App

Selective attention: Only the objects or elements of interest are

Knowledge distillation: It is an indirect way of compressing a model

mag.com/g Guide To TensorLy: A Python Library For Tensor Learning

researchers introduced the Lottery Ticket Hypothesis (https://analyticsindiamag.com/the-lottery-

The availability of smaller BERT (https://analyticsindiamag.com/top-ten-bert-alternatives-for-nlu-

Further, companies such as Arm have taken a shine to TinyML

Some of the major breakthroughs (https://analyticsindiamag.com/8-neural-network-compression-

You might also like