OLMP Lab6

Artificial Intelligence (CS303)
Lab Courses
Lab 6: OLMP
OLMP
OLMP
LMP
Steps to DNN compression

Understand OLMP
DNN
ANN
ANN
• The study of artificial neural networks was inspired by attempts to simulate biological neural
systems (e.g. human brain).
• Basic structural and functional unit: nerve cells called neurons
• Work Mechanism
• Different neurons are linked together via axons (轴突) and dendrite (树突)
• When one neuron is excited after stimulation, it sends chemicals to the connected neurons,
thereby changing the potential (电位) within these neurons.
• If the potential (电位) of a neuron exceeds a “threshold”, then it is activated and send
chemicals to other neurons.
ANN
• A neuron is connected to the axons (轴突) of other neurons via dendrites (树突), which are
extensions from the cell body of the neuron.
• The contact point between a dendrite (树突) and an axon (轴突) is called a synapse (突触).
• The human brain learns by changing the strength of the synaptic connection between
neurons upon repeated stimulation by the same impulse.
Artificial Neuron Mathematical Model
• Input: 𝑥" from the i-th neuron
• Weights: connection weights (synapse)
• Output: 𝑜$ = 𝜑(∑+")* 𝑤"$ 𝑥" − 𝜃$ )
• One neuron can be considered as logistic regression

Artificial Neuron Model
• Output: 𝑜$ = 𝜑(∑+")* 𝑤"$ 𝑥" − 𝜃$ )
• Ideal activation function: step function but inapplicable

• Common activation function：sigmoid，tanh，ReLU
Artificial Neural Networks
• Consist of multiple artificial neurons
• Usually have the structure of an input layer,

multiple hidden layer, an output layer
• The design of an NN or AutoML aims to

design appropriate hidden layers and
connection weights.
3-layer Feedforward neural networks

• Other NNs：RBF Networks，CNN，RNN
etc.
One Inference Process
(*) * (3) 3 4
𝑧7 = 𝑓(a7 ) 𝑧7 = 𝑓(a7 ) 𝑦= = 𝑓 a=
𝑥 → 𝑎(*) → 𝑧 (*) → 𝑎(3) → 𝑧 (3) → 𝑎(4) → 𝑦
9
(*) (*)
a7 = 8 𝑤$" 𝑥" − 𝜃$
")*
<
(3) (3)
a; = 8 𝑤;$ 𝑧$ − 𝜃;
$)*
>
Superscript of w：layer index (4) (4)
a= = 8 𝑤=; 𝑧; − 𝜃=
;)*
Training of NN
• W and Threshold values decide the output of NN

• The training is to find appropriate values for W and Threshold
• The learning process is to tune weight matrix
x1 x2 x3 l1 l2
1.0 0.1 0.3 1 0 𝐸𝑟𝑟𝑜𝑟 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑊, 𝜃
1
0.1 1.5 1.2 1 0 = [ 𝑜1 𝑊, 𝜃 − 𝑦1 3
2
1.1 1.1 2.0 0 1 + 𝑜2 𝑊, 𝜃 − 𝑦2 3 ]
0.2 0.2 0.3 0 1
Calculate the gradient for
𝑊, 𝜃 , then tune them
Deep Neural Networks
• Shallow NN vs. Deep NNs

• No clear definition
• Deep NNs usually:
• Thousands of neurons in one layer
• Layer number >=8
• New activation function and training
methods
DNN Application：Handwritten character recognition
Performance of DNN
• Anti-interference ability, such as different sizes, digital distortion

Requirement
• A lot of successful applications, e.g. face recognition, NLP
• However, DNNs are not easy to run on low-end hardware
• One obstacle is their enormous sizes, Memory size ß à Number of W
LSTMP RNN [Sak et al., 2014]
>300MB
AlexNet [Krizhevsky et al., 2012] Transformer [Vaswani et al., 2017]

>200MB >1.2GB
ZIP
IPhone 8
=289MB 2GB RAM =104MB
=333MB =125MB
DNN Compression
• Pruning (e.g. LMP and OLMP)

• Quantization
• Other Compression Methods
Notations
Suppose a neuron network with 𝐿 + 1 layers is represented as the
set of its connections:
= =
𝑊 = 𝑊",$ 𝑊",$ ≠ 0,1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑖 ≤ 𝑛= , 1 ≤ 𝑗 ≤ 𝑛=U* . (1)
Where:
• 𝑛= is the number of neurons in layer 𝑙, 𝑛=U* is similar.
=
• 𝑊",$ denotes the connection weight between the 𝑖 WX neuron in layer 𝑙 and the 𝑗WX
neuron in layer 𝑙 + 1.
=
• 𝑊",$ = 0 indicates the corresponding connection does not exist.
Layer 𝑙 + 1 𝑗
𝑛=U* neurons
=
𝑊",$
Layer 𝑙 𝑖
𝑛= neurons
MP and OLMP
Magnitude-based pruning (MP) [LeCun et al., 1989] : Given a network 𝑊 and a threshold 𝜀,
Magnitude-based pruning indicates:
𝑀𝑃 𝑊, 𝜀 = 𝑤 𝑤 ≥ 𝜀, 𝑤 ∈ 𝑊 .
This method prunes the connections whose absolute connection weights are lower than 𝜀.
Layer-wise magnitude-based pruning (LMP) [Guo et al., 2016; Han et al., 2015] : Instead of apply MP on
the whole network, LMP applies one each layer separately:
𝐿𝑀𝑃 𝑊, {𝜀* , 𝜀3 , … , 𝜀` } = ⋃`=)* 𝑀𝑃 𝑊, 𝑙, 𝜀= , (2)
= =
where 𝑀𝑃 𝑊, 𝑙, 𝜀 = 𝑊",$ 𝑊",$ ≥ 𝜀, 1 ≤ 𝑖 ≤ 𝑛= , 1 ≤ 𝑗 ≤ 𝑛=U* .
𝜀= is the threshold for layer 𝐿

Threshold Tuning
• The solution space for 𝜀* , 𝜀3 , … , 𝜀` can be very large

Suppose a DNN has 𝐿 layers and each layer contains 𝑁 connections, then the possible
combinations of 𝜀* , 𝜀3 , … , 𝜀` will be of size:
(𝑁 + 1)`
Which is very large even for a DNN with modest size
• The evaluation of candidate thresholds is time consuming

Need to evaluate the pruned model on the training set for performance loss.
Optimization-based OLMP
Optimization based LMP:
𝜺∗ = argmin 𝑊 p 𝑠. 𝑡. 𝑓 𝑊 − 𝑓(𝑊 p ) ≤ 𝛿 , （3）
𝜺∈ℝl ,mn )`<o(m,𝜺)
where 𝜺 = {𝜀*, 𝜀3, … , 𝜀` }, 𝑊 p is the model pruned by applying LMP with 𝜺 on
𝑊.
Derivative-free optimization methods
Derivative-free optimization methods [Goldberg, 1989; Brochu et al., 2010; Qian et al., 2015; Yu et al., 2016] : do not require
the problem to be either continuous or differentiable. In our paper, we use negatively
correlated search (NCS) [Tang et al., 2016] to solve Eq. (3).
Negatively Correlated Search [Tang et al., 2016] ：It uses negative correlations to increase diversities
among solutions and to encourage them to search different areas of the solution space.
*NCS can be substituted by

any other suitable
optimization methods!
How to apply NCS
• Fitness function definition

Overall Pipeline for DNN Compression
= =
Eq. (1): 𝑊 = 𝑊",$ 𝑊",$ ≠ 0,1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑖 ≤ 𝑛= , 1 ≤ 𝑗 ≤ 𝑛=U* .
Eq. (2): 𝐿𝑀𝑃 𝑊, {𝜀* , 𝜀3 , … , 𝜀` } = ⋃`=)* 𝑀𝑃 𝑊, 𝑙, 𝜀= ,

𝑀𝑃 𝑊, 𝑙, 𝜀 = 𝑊",$ = =
𝑊",$ ≥ 𝜀, 1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑖 ≤ 𝑛= , 1 ≤ 𝑗 ≤ 𝑛=U* .
⑤ Be retrained on the
whole training set until
Eq. (3): 𝜺∗ = argmin 𝑊 p 𝑠. 𝑡. 𝑓 𝑊 − 𝑓(𝑊 p ) ≤ 𝛿 , 𝜺 = {𝜀* , 𝜀3 , … , 𝜀` }. converging
𝜺∈ℝl ,mn )`<o(m,𝜺)
Original model Final model

fulfilled
Stop criterion
① The 𝑊 w.r.t† Eq. (1) Model to be pruned
② Solve Eq. (3) for 𝜺∗ OLMP
④ Use Dynamic surgery

[Guo et al., 2016] to recover
③ Prune 𝑊 with 𝜺∗ retrain + connection recover
the incorrect pruned
Pruned model
w.r.t Eq. (2) connections
Iterative pruning and adjusting
Experimental Settings
Experiment-Application to LeNet
OLMP achieves the best

compressional result with
no accuracy loss on test set
Iterative pruning and retraining (ITR) [Han et al., 2015]

Dynamic surgery (DS) [Guo et al., 2016]
Soft-weight sharing (SWS) [Ullrich et al., 2017]
Sparse VD [Molchanov et al., 2017]
Experiment-Application to AlexNet-Caltech
OLMP can effectively compress

conventional DNNs
Experiment – OLMP without iterative pruning
Conclusion
• Conventional Layer-wise magnitude-based pruning needs to tune the layer-specific
thresholds manually
Hard for end-users with limited expertise
• OLMP tune the thresholds automatically

• Formulate as an optimization problem
• Use derivate-free optimization algorithm
• New compressional pipeline

• Iterative OLMP and adjusting
• Adjusting contains incorrect repairing
• Empirical results show the effectiveness

OLMP Lab6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

OLMP Lab6

Uploaded by

Copyright:

Available Formats

Artificial Intelligence (CS303)

Steps to DNN compression

• Input: 𝑥" from the i-th neuron

• Weights: connection weights (synapse)

• Output: 𝑜$ = 𝜑(∑+")* 𝑤"$ 𝑥" − 𝜃$ )

• One neuron can be considered as logistic regression

• Output: 𝑜$ = 𝜑(∑+")* 𝑤"$ 𝑥" − 𝜃$ )

• Ideal activation function: step function but inapplicable

• Consist of multiple artificial neurons

• Usually have the structure of an input layer,

• The design of an NN or AutoML aims to

3-layer Feedforward neural networks

𝑥 → 𝑎(*) → 𝑧 (*) → 𝑎(3) → 𝑧 (3) → 𝑎(4) → 𝑦

• W and Threshold values decide the output of NN

• Shallow NN vs. Deep NNs

• Anti-interference ability, such as different sizes, digital distortion

AlexNet [Krizhevsky et al., 2012] Transformer [Vaswani et al., 2017]

• Pruning (e.g. LMP and OLMP)

𝜀= is the threshold for layer 𝐿

• The solution space for 𝜀* , 𝜀3 , … , 𝜀` can be very large

• The evaluation of candidate thresholds is time consuming

*NCS can be substituted by

• Fitness function definition

Eq. (2): 𝐿𝑀𝑃 𝑊, {𝜀* , 𝜀3 , … , 𝜀` } = ⋃`=)* 𝑀𝑃 𝑊, 𝑙, 𝜀= ,

Original model Final model

① The 𝑊 w.r.t† Eq. (1) Model to be pruned

② Solve Eq. (3) for 𝜺∗ OLMP

④ Use Dynamic surgery

OLMP achieves the best

Iterative pruning and retraining (ITR) [Han et al., 2015]

OLMP can effectively compress

Hard for end-users with limited expertise

• OLMP tune the thresholds automatically

• New compressional pipeline

• Empirical results show the effectiveness

You might also like

𝑥 → 𝑎() → 𝑧 () → 𝑎(3) → 𝑧 (3) → 𝑎(4) → 𝑦