Professional Documents
Culture Documents
2. SeparableConv2D layers
– Not only separable convolutions reduce the number of
trainable parameters compared with Conv2D, but they also
reduce the number of operations by splitting the convolution
in the sequence of depthwise and pointwise
– Depthwise: for each of the Ci x Ho x Wo elements of the
output tensor:
» MULs: K x K, ADDs: K x K-1 (no bias addition)
» Hence ~2 x Ci x Ho x Wo x K x K
– Pointwise: it’s like a Conv2D with 1x1 filters
» 2 x Co x Ho x Wo x 1 x 1 x Ci operations
– In total 2 x Ho x Wo x (K x K + Co) x Ci operations
» Ratio #OPs Conv2D/SeparableConv2D:
Co
»
1+ Co⁄K∙K
24/02/22 - 11 EESAM - © 2020 MC- LL
Number of operations (#OPs): Formulas
3. Dense
– Dense (Fully Connected) layers are multiplications between a
weight matrix and a flattened tensor (i.e., a vector), plus the
addition of the bias vector
– Assume flattened input and output tensors with Xi and Yo
elements, respectively. For each y in {Yo}
» MULS: Xi
» ADDs: Xi
– In total 2 x Yo x Xi #OPs
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
activation (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 32768) 0
_________________________________________________________________
dense (Dense) (None, 10) 327690
_________________________________________________________________
activation_1 (Activation) (None, 10) 0
=================================================================
Total params: 328,586
Trainable params: 328,586
Non-trainable params: 0
24/02/22 - 15 EESAM - © 2020 MC- LL
Sanity Check
model.add(Conv2D(32,(3,3),padding="same", input_shape=i_s))
…
model.add(Dense(classes))
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
activation (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 32768) 0
_________________________________________________________________
dense (Dense) (None, 10) 327690
_________________________________________________________________
activation_1 (Activation) (None, 10) 0
=================================================================
Total params: 328,586
Trainable params: 328,586
Non-trainable params: 0
24/02/22 - 16 EESAM - © 2020 MC- LL
Sanity Check
model.add(Conv2D(32,(3,3),padding="same", input_shape=i_s))
…
model.add(Dense(classes))
w = Ci x K x K x Co = 3 x 3 x 3 x 32 = 864
Model: "sequential" b = Co = 32
_________________________________________________________________
Layer (type) #params
Output= 864
Shape+ 32 = 896 Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
activation (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 32768) 0
_________________________________________________________________
dense (Dense) (None, 10) 327690
_________________________________________________________________
activation_1 (Activation) (None, 10) 0
=================================================================
Total params: 328,586
Trainable params: 328,586
Non-trainable params: 0
24/02/22 - 17 EESAM - © 2020 MC- LL
Sanity Check
model.add(Conv2D(32,(3,3),padding="same", input_shape=i_s))
…
model.add(Dense(classes))
w = Ci x K x K x Co = 3 x 3 x 3 x 32 = 864
Model: "sequential" b = Co = 32
_________________________________________________________________
Layer (type) #params
Output= 864
Shape+ 32 = 896 Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
activation (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 32768) 0
_________________________________________________________________
dense (Dense) (None, 10) 327690
_________________________________________________________________
activation_1 (Activation) (None, 10) 0
=================================================================
Total params: 328,586 w = Xi x Yo = 32768 x 10 = 327680
Trainable params: 328,586 b = Yo = 10
Non-trainable params: 0 #params = 327690
24/02/22 - 18 EESAM - © 2020 MC- LL
How much memory is it needed?
45 nm technology
24/02/22 - 20 EESAM - © 2020 MC- LL
What about the memory used
by the activations?
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
activation (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 32768) 0
_________________________________________________________________
dense (Dense) (None, 10) 327690
_________________________________________________________________
activation_1 (Activation) (None, 10) 0
=================================================================
Th (GOP/s)
unattainable computing throughput
0.4
max attainable computing throughput
0.368
/s
GB
attainable computing throughput
2
0.
0.2
e
op
sl
OPs:DRAM byte
1.0 1.84 2.0
Th (GOP/s)
unattainable computing throughput
0.4
/s
GB
unattainable computing throughput
2
0.
0.2
e
0.14
sl
OPs:DRAM byte
0.7 1.0 2.0
Ho = (Hi + 2P - K) / S + 1 Ho = (Hi + 2P - K) / S + 1
= (32 + 2 * 0 - 3) / 2 + 1 = 14 + 1 = 15 = (32 + 2 * 0 - 3) / 2 + 1 = 14 + 1 = 15
24/02/22 - 32 EESAM - © 2020 MC- LL
Model Summary (3/6)
24/02/22 - 36 aren’t trainable! Only gamma and beta are EESAM - © 2020 MC- LL
MiniGoogleNet: #OPs
CONV K Ci Co Ho Wo #OPs
conv2d_1 3 3 96 32 32 5,308,416
conv2d_2 1 96 32 32 32 6,291,456
conv2d_3 3 96 32 32 32 56,623,104
conv2d_4 1 64 32 32 32 4,194,304
conv2d_5 3 64 48 32 32 56,623,104
conv2d_6 3 80 80 15 15 25,920,000
conv2d_7 1 160 112 15 15 8,064,000
conv2d_8 3 160 48 15 15 31,104,000
conv2d_9 1 160 96 15 15 6,912,000
conv2d_10 3 160 64 15 15 41,472,000
conv2d_11 1 160 80 15 15 5,760,000
conv2d_12 3 160 80 15 15 51,840,000
conv2d_13 1 160 48 15 15 3,456,000
conv2d_14 3 160 96 15 15 62,208,000
conv2d_15 3 144 96 7 7 12,192,768
conv2d_16 1 240 176 7 7 4,139,520
conv2d_17 3 240 160 7 7 33,868,800
conv2d_18 1 336 176 7 7 5,795,328
conv2d_19 3 336 160 7 7 47,416,320
DENSE Xi Yo #OPs
dense_1 336 10 6,720
TOTAL 469,195,840
24/02/22 - 40 4.8x lower OPs energy, 4.8x lower latency if computing-bound EESAM - © 2020 MC- LL
Exercise 1
• Each vector unit can run 128/32 = 4 OP / cycle (FP32 floating point)
– Peak throughput: 16 x 4 OP/cycle x 0.7 GHz = 44.8 GOP/s
• On-chip memory big enough to store activations and fast enough to
sustain the peak throughput, but params need to be accessed from
DDR
– DDR Bandwidth: 2 x 1.6 GHz x 4B = 12.8 GB/s
• OP:DRAM Byte for MiniGoogLeNet: 469M/6.3M = 74.4 OP/B
• OP:DRAM Byte for ShallowNet: 2.3M/1.25M = 1.84 OP/B
Th (GOP/s)
MiniGoogLeNet
44.8
ShallowNet
/s
GB
23.6
.8
12
pe
o
sl
OP:DRAM byte
24/02/22 - 42 1.84 3.5 74.4 EESAM - © 2020 MC- LL
Solution (2/2)
(Hi,Wi,Ci) = (Ho,Wo,Co)
(27,27,96) (27,27,256) (13,13,256)
(227,227,3) = (55,55,96)
1 2 3
4 5
(Ho,Wo,Co)
(27,7,96) (27,27,256) (13,13,256)
= (55,55,96)
1 2
(13,13,256) (6,6,256)
• Conv2D: 2 x Co x Ho x Wo x K x K x Ci operations
– Conv1: 2 x 96 x 55 x 55 x 11 x 11 x 3 = 210,830,400
– Conv2: 2 x 256 x 27 x 27 x 5 x 5 x 96 = 895,795,200
– Conv3: 2 x 384 x 13 x 13 x 3 x 3 x 256 = 299,040,768
– Conv4: 2 x 384 x 13 x 13 x 3 x 3 x 384 = 448,561,152
– Conv5: 2 x 384 x 13 x 13 x 3 x 3 x 256 = 299,040,768
– Total Conv operations: 2,153,268,288
• FC: 2 x Y x X operations
– FC1: 2 x 4096 x 9216 = 75,497,472
– FC2: 2 x 4096 x 4096 = 335,54,432
– FC3: 2 x 10 x 4096 = 81920
– Total FC operations = 109,133,824
• Total Conv2D + FC operations: 2,262,402,112
24/02/22 - 49 EESAM - © 2020 MC- LL
Solution
AlexNet: Performance
Th (GOP/s)
89.6
/s
GB
.8
12
pe
o
sl
OP:DRAM byte
24/02/22 - 50 7 20.4 EESAM - © 2020 MC- LL
Solution
AlexNet: Performance
OP:DRAM byte
24/02/22 - 51 7 20.4 EESAM - © 2020 MC- LL
Homework 2: AlexNet in Keras
• Answer 1:
– MAC ops. = B x (MULs + ADDs) / 2 = 4 x 2,262,402,112 / 2 =
4,524,804,224
– Time for a MAC operation: T = N / Fck = 0.5 ns
» Total time = MAC ops. / M x T = 0.28 s
» Total energy = MAC ops. x E = 18 mJ
• Answer 2:
– Counting the MAC as 2 operations, the peak throughput is
2 OP/MAC x 8 MAC x 2 GHz = 32 GOP/s
– The AlexNet operation intensity is B x 20.4 = 81.6 OP/Byte
» Corner of roofline model should be x ≤ 81.6 OP/B: in the worst case, the
memory bandwidth Bw should be such that Bw (GB/s) x 81.6 OP/B =
32 GOP/s, t.i. Bw = 32/81.6 = 0.4 GB/s
» Ex: 1 LPDDR4 chip with one-byte width running at 200 MHz is OK
Note: Energy largely underestimated (memory access cannot be ignored)
24/02/22 - 54 EESAM - © 2020 MC- LL
Homework 4: Exercise
• Answer 1:
– MAC ops. = B x (MULs + ADDs) / 2 = 4 x 469M / 2 = 938M MACs
– Time for a MAC operation: T = N / Fck = 0.5 ns
» Total time = MAC ops. / M x T = 938M / 8 x 0.5n = 58.6 ms
» Total energy = MAC ops. x E = 938M x 4p = 3.8 mJ
– Counting the MAC as 2 operations. the peak throughput is
2 OP/MAC x 8 MAC x 2 GHz = 32 GOP/s
– The MiniGoogLeNet operation intensity is B x 74.4 = 298 OP/Byte
» Corner of roofline model should be x ≤ 298 OP/B: in the worst case, the
memory bandwidth Bw should be such that Bw (GB/s) x 298 OP/B =
32 GOP/s, t.i. Bw = 32/298 = 0.11 GB/s
» Ex: 1 LPDDR4 chip with one-byte width running at 55 MHz is OK
• Answer 2:
– If 8 MAC units do the job in 58.6 ms, to do it in 40 ms (assuming
perfect scaling of performance) we need ceil(8 x 58.6 / 40) 12 MAC
units
– The computing energy does not change (total number of operations
is the same): MAC ops. x E = 938M x 4p = 3.8 mJ
– The size of the parameters in DRAM is 6.3 MB, therefore the energy
is E = 640 p x (6.3M / 4) = 1 mJ