You are on page 1of 27

Artificial Intelligence (CS303)

Lab Courses
Lab 6: OLMP
OLMP
OLMP

LMP

Steps to DNN compression


Understand OLMP
DNN

ANN
ANN
• The study of artificial neural networks was inspired by attempts to simulate biological neural
systems (e.g. human brain).
• Basic structural and functional unit: nerve cells called neurons
• Work Mechanism
• Different neurons are linked together via axons (轴突) and dendrite (树突)
• When one neuron is excited after stimulation, it sends chemicals to the connected neurons,
thereby changing the potential (电位) within these neurons.
• If the potential (电位) of a neuron exceeds a “threshold”, then it is activated and send
chemicals to other neurons.
ANN
• A neuron is connected to the axons (轴突) of other neurons via dendrites (树突), which are
extensions from the cell body of the neuron.
• The contact point between a dendrite (树突) and an axon (轴突) is called a synapse (突触).
• The human brain learns by changing the strength of the synaptic connection between
neurons upon repeated stimulation by the same impulse.
Artificial Neuron Mathematical Model

• Input: 𝑥" from the i-th neuron

• Weights: connection weights (synapse)

• Output: 𝑜$ = 𝜑(∑+")* 𝑤"$ 𝑥" − 𝜃$ )

• One neuron can be considered as logistic regression


Artificial Neuron Model

• Output: 𝑜$ = 𝜑(∑+")* 𝑤"$ 𝑥" − 𝜃$ )

• Ideal activation function: step function but inapplicable


• Common activation function:sigmoid,tanh,ReLU
Artificial Neural Networks

• Consist of multiple artificial neurons

• Usually have the structure of an input layer,


multiple hidden layer, an output layer

• The design of an NN or AutoML aims to


design appropriate hidden layers and
connection weights.

3-layer Feedforward neural networks


• Other NNs:RBF Networks,CNN,RNN
etc.
One Inference Process

(*) * (3) 3 4
𝑧7 = 𝑓(a7 ) 𝑧7 = 𝑓(a7 ) 𝑦= = 𝑓 a=

𝑥 → 𝑎(*) → 𝑧 (*) → 𝑎(3) → 𝑧 (3) → 𝑎(4) → 𝑦

9
(*) (*)
a7 = 8 𝑤$" 𝑥" − 𝜃$
")*

<
(3) (3)
a; = 8 𝑤;$ 𝑧$ − 𝜃;
$)*

>
Superscript of w:layer index (4) (4)
a= = 8 𝑤=; 𝑧; − 𝜃=
;)*
Training of NN

• W and Threshold values decide the output of NN


• The training is to find appropriate values for W and Threshold
• The learning process is to tune weight matrix

x1 x2 x3 l1 l2
1.0 0.1 0.3 1 0 𝐸𝑟𝑟𝑜𝑟 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑊, 𝜃
1
0.1 1.5 1.2 1 0 = [ 𝑜1 𝑊, 𝜃 − 𝑦1 3
2
1.1 1.1 2.0 0 1 + 𝑜2 𝑊, 𝜃 − 𝑦2 3 ]
0.2 0.2 0.3 0 1
Calculate the gradient for
𝑊, 𝜃 , then tune them
Deep Neural Networks

• Shallow NN vs. Deep NNs


• No clear definition
• Deep NNs usually:
• Thousands of neurons in one layer
• Layer number >=8
• New activation function and training
methods
DNN Application:Handwritten character recognition
Performance of DNN

• Anti-interference ability, such as different sizes, digital distortion


Requirement
• A lot of successful applications, e.g. face recognition, NLP
• However, DNNs are not easy to run on low-end hardware
• One obstacle is their enormous sizes, Memory size ß à Number of W
LSTMP RNN [Sak et al., 2014]
>300MB

AlexNet [Krizhevsky et al., 2012] Transformer [Vaswani et al., 2017]


>200MB >1.2GB

ZIP

IPhone 8
=289MB 2GB RAM =104MB

=333MB =125MB
DNN Compression

• Pruning (e.g. LMP and OLMP)


• Quantization
• Other Compression Methods
Notations
Suppose a neuron network with 𝐿 + 1 layers is represented as the
set of its connections:
= =
𝑊 = 𝑊",$ 𝑊",$ ≠ 0,1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑖 ≤ 𝑛= , 1 ≤ 𝑗 ≤ 𝑛=U* . (1)

Where:
• 𝑛= is the number of neurons in layer 𝑙, 𝑛=U* is similar.
=
• 𝑊",$ denotes the connection weight between the 𝑖 WX neuron in layer 𝑙 and the 𝑗WX
neuron in layer 𝑙 + 1.
=
• 𝑊",$ = 0 indicates the corresponding connection does not exist.

Layer 𝑙 + 1 𝑗
𝑛=U* neurons
=
𝑊",$
Layer 𝑙 𝑖
𝑛= neurons
MP and OLMP

Magnitude-based pruning (MP) [LeCun et al., 1989] : Given a network 𝑊 and a threshold 𝜀,
Magnitude-based pruning indicates:
𝑀𝑃 𝑊, 𝜀 = 𝑤 𝑤 ≥ 𝜀, 𝑤 ∈ 𝑊 .
This method prunes the connections whose absolute connection weights are lower than 𝜀.

Layer-wise magnitude-based pruning (LMP) [Guo et al., 2016; Han et al., 2015] : Instead of apply MP on
the whole network, LMP applies one each layer separately:
𝐿𝑀𝑃 𝑊, {𝜀* , 𝜀3 , … , 𝜀` } = ⋃`=)* 𝑀𝑃 𝑊, 𝑙, 𝜀= , (2)
= =
where 𝑀𝑃 𝑊, 𝑙, 𝜀 = 𝑊",$ 𝑊",$ ≥ 𝜀, 1 ≤ 𝑖 ≤ 𝑛= , 1 ≤ 𝑗 ≤ 𝑛=U* .

𝜀= is the threshold for layer 𝐿


Threshold Tuning

• The solution space for 𝜀* , 𝜀3 , … , 𝜀` can be very large


Suppose a DNN has 𝐿 layers and each layer contains 𝑁 connections, then the possible
combinations of 𝜀* , 𝜀3 , … , 𝜀` will be of size:
(𝑁 + 1)`
Which is very large even for a DNN with modest size

• The evaluation of candidate thresholds is time consuming


Need to evaluate the pruned model on the training set for performance loss.
Optimization-based OLMP
Optimization based LMP:
𝜺∗ = argmin 𝑊 p 𝑠. 𝑡. 𝑓 𝑊 − 𝑓(𝑊 p ) ≤ 𝛿 , (3)
𝜺∈ℝl ,mn )`<o(m,𝜺)
where 𝜺 = {𝜀*, 𝜀3, … , 𝜀` }, 𝑊 p is the model pruned by applying LMP with 𝜺 on
𝑊.
Derivative-free optimization methods

Derivative-free optimization methods [Goldberg, 1989; Brochu et al., 2010; Qian et al., 2015; Yu et al., 2016] : do not require
the problem to be either continuous or differentiable. In our paper, we use negatively
correlated search (NCS) [Tang et al., 2016] to solve Eq. (3).

Negatively Correlated Search [Tang et al., 2016] :It uses negative correlations to increase diversities
among solutions and to encourage them to search different areas of the solution space.

*NCS can be substituted by


any other suitable
optimization methods!
How to apply NCS

• Fitness function definition


Overall Pipeline for DNN Compression

= =
Eq. (1): 𝑊 = 𝑊",$ 𝑊",$ ≠ 0,1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑖 ≤ 𝑛= , 1 ≤ 𝑗 ≤ 𝑛=U* .

Eq. (2): 𝐿𝑀𝑃 𝑊, {𝜀* , 𝜀3 , … , 𝜀` } = ⋃`=)* 𝑀𝑃 𝑊, 𝑙, 𝜀= ,


𝑀𝑃 𝑊, 𝑙, 𝜀 = 𝑊",$ = =
𝑊",$ ≥ 𝜀, 1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑖 ≤ 𝑛= , 1 ≤ 𝑗 ≤ 𝑛=U* .
⑤ Be retrained on the
whole training set until
Eq. (3): 𝜺∗ = argmin 𝑊 p 𝑠. 𝑡. 𝑓 𝑊 − 𝑓(𝑊 p ) ≤ 𝛿 , 𝜺 = {𝜀* , 𝜀3 , … , 𝜀` }. converging
𝜺∈ℝl ,mn )`<o(m,𝜺)

Original model Final model


fulfilled
Stop criterion

① The 𝑊 w.r.t† Eq. (1) Model to be pruned

② Solve Eq. (3) for 𝜺∗ OLMP

④ Use Dynamic surgery


[Guo et al., 2016] to recover
③ Prune 𝑊 with 𝜺∗ retrain + connection recover
the incorrect pruned
Pruned model
w.r.t Eq. (2) connections
Iterative pruning and adjusting
Experimental Settings
Experiment-Application to LeNet

OLMP achieves the best


compressional result with
no accuracy loss on test set

Iterative pruning and retraining (ITR) [Han et al., 2015]


Dynamic surgery (DS) [Guo et al., 2016]
Soft-weight sharing (SWS) [Ullrich et al., 2017]
Sparse VD [Molchanov et al., 2017]
Experiment-Application to AlexNet-Caltech

OLMP can effectively compress


conventional DNNs
Experiment – OLMP without iterative pruning
Conclusion
• Conventional Layer-wise magnitude-based pruning needs to tune the layer-specific
thresholds manually

Hard for end-users with limited expertise

• OLMP tune the thresholds automatically


• Formulate as an optimization problem
• Use derivate-free optimization algorithm

• New compressional pipeline


• Iterative OLMP and adjusting
• Adjusting contains incorrect repairing

• Empirical results show the effectiveness

You might also like