Eesam Compmemreq v2

Embedded Electronic Systems
for Artificial intelligence and

Machine Learning
Computing and Memory
Requirements
Mario R. Casu, Luciano Lavagno
Politecnico di Torino
Outline
• Computing the number of operations and the number

of operands (parameters and activations) to estimate
performance and energy
• Examples:
– ShallowNet
– MiniGoogLeNet
– AlexNet
• Homeworks
24/02/22 - 2 EESAM - © 2020 MC- LL

Number of parameters (#params): Examples
• We need to count all trainable parameters

1. Conv2D example: 128 filters with (3,3) kernel
• Number of weights: For each (input, output) channel

pair (3x128 in the example), we have one 3x3 kernel
– 3 x 3 x 3 x 128 = 3456 weights
• Bias: for each output channel, we have one bias
– 128 biases
24/02/22 - 3 EESAM - © 2020 MC- LL

Number of parameters (#params): Examples
2. SeparableConv2D example: same (input,output)

channel pair (3x128) of previous Conv2D example
• Number of weights and biases for depthwise

– 3 x 3 x 3 = 27 weights; 0 (bias not used)
• Number of weights and biases for pointwise
– 3 x 1 x 1 x 128 = 384; 128 biases
• In total 411/128 weights/biases (3456/128 in Conv2D)
– (3456+128)/(411+128) = 6.6x fewer #params
24/02/22 - 4 EESAM - © 2020 MC- LL
Number of parameters (#params): Formulas
1. Conv2D: (K,K) kernel, F filters, S stride, P padding

– Input shape (Bi,Hi,Wi,Ci), i.e. batch, height, width, channel
– Output shape (Bo,Ho,Wo,Co)
» Bo = Bi
» Co = F
» Ho = (Hi + 2 x P - K) / S + 1
» Wo = (Wi + 2 x P - K) / S + 1
– Number of Weights (w) and Biases (b)
» w = Ci x K x K x Co, b = Co
24/02/22 - 5 EESAM - © 2020 MC- LL

1. Conv2D: (K,K) kernel, F filters, S stride, P padding

– Output shape (Bo,Ho,Wo,Co)
» Bo = Bi
» Co = F
» Ho = (Hi + 2 x P - K) / S + 1
» Wo = (Wi + 2 x P - K) / S + 1
» w = Ci x K x K x Co, b = Co
2. SeparableConv2D with same input/output shapes
» w = Ci x ( K x K + Co ), b = Co
Co
» Weight ratio Conv2D/SeparableConv2D:
1+ Co⁄K∙K
24/02/22 - 6 EESAM - © 2020 MC- LL
3. Dense (used after Flatten to create a linear array):

– Input shape (Bi,Xi), i.e. batch, input array size
– Output shape (Bo,Yo), i.e. batch, output array size
» Bo = Bi
» w = Xi x Yo, b = Yo
24/02/22 - 7 EESAM - © 2020 MC- LL

3. For MaxPooling2D and AveragePooling2D use the

same formulas for the size of the output tensor as
Conv2D and SeparableConv2D
– but no trainable params: no w, no b
4. Batch Normalization
– Output shape (Bo,Ho,Wo,Co) = (Bi,Hi,Wi,Ci)
– Parameters (p): mean, variance, gamma, beta for each Ci
» p = 4 x Ci = 4 x Co
– But only gamma and beta parameters are trainable (pt):
» pt = 2 x Ci = 2 x Co
24/02/22 - 8 EESAM - © 2020 MC- LL

Number of activations (#activs)
• Simply the product of the sizes along the N

dimensions of a tensor
– Ex: (B,H,W,C) = (1,32,32,64) => 1 x 32 x 32 x 64 = 65,536
• Normally #params ≫ #activs
– Params are stored in external DRAM while activations stay in
on-chip buffers (produced and immediately consumed)
– But we need to make room on-chip for parameters as well, as
we need to prefetch them from DRAM when needed
• Both for parameters and activations, to compute the
actual memory storage requirement we need to know
the actual datatype. Examples:
– FP32: 65,536 x 4B = 262144 B = 256 kB
– FP16 (BF16, INT16): 65,536 x 2B = 128 kB
– INT8: 65,536 x 1B = 64 kB
24/02/22 - 9 EESAM - © 2020 MC- LL
Number of operations (#OPs): Formulas
• Convolutional and Fully connected layers dominate

the total number of operations and so the computing
requirements
• Multiplications (MULs) and Additions (ADDs) dictate
computing requirements (ignore the other operations)
1. Conv2D layers
– For each of the Co x Ho x Wo elements of the output tensor:
» MULs: K x K x Ci
» ADDs: K x K x Ci (including the bias addition)
– In total 2 x Co x Ho x Wo x K x K x Ci #OPs
24/02/22 - 10 EESAM - © 2020 MC- LL

2. SeparableConv2D layers
– Not only separable convolutions reduce the number of
trainable parameters compared with Conv2D, but they also
reduce the number of operations by splitting the convolution
in the sequence of depthwise and pointwise
– Depthwise: for each of the Ci x Ho x Wo elements of the
output tensor:
» MULs: K x K, ADDs: K x K-1 (no bias addition)
» Hence ~2 x Ci x Ho x Wo x K x K
– Pointwise: it’s like a Conv2D with 1x1 filters
» 2 x Co x Ho x Wo x 1 x 1 x Ci operations
– In total 2 x Ho x Wo x (K x K + Co) x Ci operations
» Ratio #OPs Conv2D/SeparableConv2D:
Co
»
1+ Co⁄K∙K
24/02/22 - 11 EESAM - © 2020 MC- LL
3. Dense
– Dense (Fully Connected) layers are multiplications between a
weight matrix and a flattened tensor (i.e., a vector), plus the
addition of the bias vector
– Assume flattened input and output tensors with Xi and Yo
elements, respectively. For each y in {Yo}
» MULS: Xi
» ADDs: Xi
– In total 2 x Yo x Xi #OPs
24/02/22 - 12 EESAM - © 2020 MC- LL

Example: Shallownet Sequential
• To estimate computing requirements, we consider only MULs and

ADDs (FLOPs) in Conv2D and Dense
• For memory requirements we need #params and #activations
def shallownet_sequential(width, height, depth, classes):

# initialize the model along with the input shape to be
# "channels last" ordering
model = Sequential()
i_s = (height, width, depth)
# define the first (and only) CONV => RELU layer
model.add(Conv2D(32, (3, 3), padding="same", input_shape=i_s))
model.add(Activation("relu"))
# softmax classifier
model.add(Flatten())
model.add(Dense(classes))
model.add(Activation("softmax"))
# return the constructed network architecture
return model
24/02/22 - 13 EESAM - © 2020 MC- LL
Shallownet Sequential: computing
and memory requirements
• To estimate computing requirements, we consider only MULs and

ADDs (FLOPs) in Conv2D and Dense
• For memory requirements we need #params and #activations
def shallownet_sequential(width, height, depth, classes):

# initialize the model along with the input shape to be
# "channels last" ordering #FLOPs = 2 x Co x Ho x Wo x K x K x Ci
model = Sequential() = 2 x 32 x 32 x 32 x 3 x 3 x 3 = 1,769,472
i_s = (height, width, depth)
# define the first (and only) CONV => RELU layer
model.add(Conv2D(32, (3, 3), padding="same", input_shape=i_s))
model.add(Activation("relu"))
# softmax classifier #FLOPs = 2 x Yo x Xi
model.add(Flatten()) = 2 x 10 x 32768 = 655,360
model.add(Activation("softmax"))
# return the constructed network architecture
return model Total #FLOPs = 2,424,832 ~ 2.4M
24/02/22 - 14 EESAM - © 2020 MC- LL
Model Summary
model = shallownet_sequential(width=32, height=32,

depth=3, classes=10)
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
activation (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 32768) 0
_________________________________________________________________
dense (Dense) (None, 10) 327690
_________________________________________________________________
activation_1 (Activation) (None, 10) 0
=================================================================
Total params: 328,586
Trainable params: 328,586
Non-trainable params: 0
24/02/22 - 15 EESAM - © 2020 MC- LL
Sanity Check
model.add(Conv2D(32,(3,3),padding="same", input_shape=i_s))
…
Model: "sequential"
_________________________________________________________________
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
=================================================================
24/02/22 - 16 EESAM - © 2020 MC- LL
Sanity Check
…
w = Ci x K x K x Co = 3 x 3 x 3 x 32 = 864
Model: "sequential" b = Co = 32
_________________________________________________________________
Layer (type) #params
Output= 864
Shape+ 32 = 896 Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
=================================================================
24/02/22 - 17 EESAM - © 2020 MC- LL
Sanity Check
…
w = Ci x K x K x Co = 3 x 3 x 3 x 32 = 864
Model: "sequential" b = Co = 32
_________________________________________________________________
Layer (type) #params
Output= 864
Shape+ 32 = 896 Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
=================================================================
Total params: 328,586 w = Xi x Yo = 32768 x 10 = 327680
Trainable params: 328,586 b = Yo = 10
Non-trainable params: 0 #params = 327690
24/02/22 - 18 EESAM - © 2020 MC- LL
How much memory is it needed?
• Fast access to memory very often means better

inference throughput
• However, semiconductor memories are either fast
(and close to the processors) or dense, but not both
• Need to reduce the size of the parameters, possibly
using small data types
• Previous example: 328,586 parameters
– FP32 (4 Bytes): (328586 x4)/2^20 = 1.25 MB
» Can’t fit in the cache of a small processor for edge applications
– INT8 (1 Byte): (328586 x1)/2^10 = 320 kB
» Can fit in the cache of a small processor for edge applications
24/02/22 - 19 EESAM - © 2020 MC- LL

How much energy is it needed?
• The less we access the DRAM the better
• Best-case scenario: we read the

params once (enough on-chip
buffering for temporary storage)
– Edram(FP32) = 1.25MB / 4B x 640pJ
= 0.2 mJ
– Edram(INT8) = 320 kB / 4B x 640pJ =
0.05 mJ (4x less)
• OPs: 50% MULs 50% ADDs
– Eops(FP32) = 2.4 M x 0.5 x (3.7 +
0.9) pJ = 0.0055 mJ
– Eops(INT8) = 2.4 M x 0.5 x (0.2 +
0.03) pJ = 0.00028 mJ (20x less)
45 nm technology
24/02/22 - 20 EESAM - © 2020 MC- LL
What about the memory used
by the activations?
• The size of the activation tensors can have an impact

on performance
• Differently from parameters, activations are produced
and consumed almost immediately
• Ideally, activations should be stored in a local (i.e., on
chip) memory to reduce data access time and energy
– Not always possible because they can easily exceed the on-
chip memory capabilities
• To evaluate memory requirements, we need to
evaluate the maximum among the sizes of the
intermediate tensors (i.e., the activations)
– Unless we have a pipeline: all layers executed concurrently
24/02/22 - 21 EESAM - © 2020 MC- LL

Tensor size: assume batch = 1
_________________________________________________________________
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
=================================================================
Input tensor: 3 x 32 x 32 => 3072

To be converted into bytes,
Conv2D: 32 x 32 x 32 => 32768
depending on the datatype
Activation: 32 x 32 x 32 => 32768
Flatten: 32768 (just memory shape change of arrangement)
Dense: 10
Activation_1: 10
24/02/22 - 22 EESAM - © 2020 MC- LL
Shallownet Sequential: Recap
• #FLOPs: 2,424,832 = 2.3 MOP

• #PARAMS: 328,586 = 1.25 MB, if FP32
• MAX #ACTIVATIONS: 32,768 = 128 kB, if FP32
• Assume a hypothetical system with sufficient on-chip
memory to store the maximum number of activations
– DRAM accesses for #PARAMS
– FLOPS : DRAM bytes = 2.3MOP/1.25MB = 1.84 OP/B
• Ex. DDR bus @100MHz, 8 bits
– Peak bandwidth: 2x100M = 0.2 GB/s
» Computing throughput matches peak DDR bandwidth at
0.2 GB/s x 1.84 OP/B = 0.368 GOP/s
» More computing power would be underutilized (roofline model)
24/02/22 - 23 EESAM - © 2020 MC- LL

Roofline model
Th (GOP/s)
unattainable computing throughput
0.4
max attainable computing throughput
0.368
/s
GB
attainable computing throughput
2
0.
0.2
e
op
sl
OPs:DRAM byte
1.0 1.84 2.0
• Example unattainable throughput

– DDR bus @100MHz, 8 bits, Fck=100 MHz, 2 MAC
» Peak bandwidth: 2x100M = 0.2 GB/s
» Computing throughput: 0.4 GOP/s (2 MACs = 4 OPs)
24/02/22 - 24 EESAM - © 2020 MC- LL

Shallownet Sequential: example
performance estimate
• Actual throughput: TH = 0.368 GOP/s

• Latency: #FLOPs/TH = 1000 x 2.3M / 368M = 6.25 ms
• Frame rate (fps) = 1000 / 6.25 = 160 fps
• What if we replace Conv2D with SeparableConv2D?
=====================================================================
separable_conv2d (Separable Conv2D) (None, 32, 32, 32) 155
activation_2 (Activation) (None, 32, 32, 32) 0
flatten_1 (Flatten) (None, 32768) 0
dense_1 (Dense) (None, 10) 327690
=====================================================================
Trainable params: 327,845 #PARAMS: 327845 = 1.25 MB
Non-trainable params: 0 #FLOPs: 907264 =0.865 MOP
=> 0.7 OP/BYTE
24/02/22 - 25 EESAM - © 2020 MC- LL
Roofline model
Th (GOP/s)
0.4
/s
GB
2
0.
0.2
e
max attainable computing throughput

op
0.14
sl
OPs:DRAM byte
0.7 1.0 2.0
• Actual throughput: TH = 0.14 GOP/s

• Latency: #OPs/TH = 1000 x 0.865M / 140M = 6.18 ms
• Frame rate (fps) = 1000 / 6.18 = 162 fps
– No significant performance improvement with separable
convolution compared to standard Conv2D: why?
24/02/22 - 26 EESAM - © 2020 MC- LL
MiniGoogLeNet
Conv Module Inception Module Downsample Module
24/02/22 - 27 Ref: https://arxiv.org/abs/1611.03530 EESAM - © 2020 MC- LL

Conv and Inception Modules
def minigooglenet_functional(width, height, depth, classes):

def conv_module(x, K, kX, kY, stride, chanDim, padding="same"):
# define a CONV => BN => RELU pattern
x = Conv2D(K, (kX, kY), strides=stride, padding=padding)(x)
x = BatchNormalization(axis=chanDim)(x)
x = Activation("relu")(x)
# return the block
return x
def inception_module(x, numK1x1, numK3x3, chanDim):

# define two CONV modules, then concatenate across the
# channel dimension
conv_1x1 = conv_module(x, numK1x1, 1, 1, (1, 1), chanDim)
conv_3x3 = conv_module(x, numK3x3, 3, 3, (1, 1), chanDim)
x = concatenate([conv_1x1, conv_3x3], axis=chanDim)
# return the block
return x
24/02/22 - 28 EESAM - © 2020 MC- LL

Downsample Module
def downsample_module(x, K, chanDim):

# define the CONV module and POOL, then concatenate
# across the channel dimensions
conv_3x3=conv_module(x,K,3,3,(2,2),chanDim,padding="valid")
pool=MaxPooling2D((3, 3), strides=(2, 2))(x)
x=concatenate([conv_3x3, pool], axis=chanDim)
# return the block
return x
24/02/22 - 29 EESAM - © 2020 MC- LL

def minigooglenet_functional(width, height, depth, classes):
. . . #previously defined functions
# initialize the input shape to be "channels last"
inputShape = (height, width, depth)
MiniGoogLeNet
chanDim = -1 as a Keras
# define the model input and first CONV module
inputs = Input(shape=inputShape)
functional API
x = conv_module(inputs, 96, 3, 3, (1, 1), chanDim) model
# two Inception modules followed by a downsample module
x = inception_module(x, 32, 32, chanDim)
x = downsample_module(x, 80, chanDim)
# four Inception modules followed by a downsample module
x = downsample_module(x, 96, chanDim)
# two Inception modules followed by global POOL and dropout
x = AveragePooling2D((7, 7))(x)
x = Dropout(0.5)(x)
# softmax classifier
x = Flatten()(x)
x = Dense(classes)(x)
x = Activation("softmax")(x)
# create the model
model = Model(inputs, x, name="minigooglenet")
24/02/22 - 30 return model EESAM - © 2020 MC- LL
MiniGoogleNet: Model Summary (1/6)
w + b = (3 x 3 x 3 x 96) + 96 Mean, variance, gamma, beta:
Model: "minigooglenet" 2592 + 96 = 2688 4 x 96 = 384
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 32, 32, 3)] 0
__________________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 32, 32, 96) 2688 input_1[0][0] 32x32x96=98304
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 32, 32, 96) 384 conv2d_1[0][0]
__________________________________________________________________________________________________
activation_2 (Activation) (None, 32, 32, 96) 0 batch_normalization[0][0]
__________________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 32, 32, 32) 3104 activation_2[0][0]
__________________________________________________________________________________________________
conv2d_3 (Conv2D) (None, 32, 32, 32) 27680 activation_2[0][0]
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 32, 32, 32) 128 conv2d_2[0][0]
__________________________________________________________________________________________________
__________________________________________________________________________________________________
activation_3 (Activation) (None, 32, 32, 32) 0 batch_normalization_1[0][0]
__________________________________________________________________________________________________
__________________________________________________________________________________________________
concatenate_2 (Concatenate) (None, 32, 32, 64) 0 activation_3[0][0]
activation_4[0][0]
__________________________________________________________________________________________________
w + b = (96 x 1 x 1 x 32) + 32 w + b = (96 x 3 x 3 x 32) + 32

24/02/22 - 31
3072 + 32 = 3104 27648 + 32 = 27680 EESAM - © 2020 MC- LL
Model Summary (2/6)
conv2d_4 (Conv2D) (None, 32, 32, 32) 2080 concatenate_2[0][0]

__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
activation_6[0][0]
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 15, 15, 80) 0 concatenate_3[0][0]
__________________________________________________________________________________________________
max_pooling2d_3[0][0]
__________________________________________________________________________________________________
Ho = (Hi + 2P - K) / S + 1 Ho = (Hi + 2P - K) / S + 1
= (32 + 2 * 0 - 3) / 2 + 1 = 14 + 1 = 15 = (32 + 2 * 0 - 3) / 2 + 1 = 14 + 1 = 15
24/02/22 - 32 EESAM - © 2020 MC- LL
Model Summary (3/6)

__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
activation_9[0][0]
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
activation_11[0][0]
__________________________________________________________________________________________________
24/02/22 - 33 “concatenate” adds the size along the output channel dimension
EESAM - © 2020 MC- LL
Model Summary (4/6)

__________________________________________________________________________________________________
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 15, 15, 80) 320 conv2d_11[0][0]
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
activation_13[0][0]
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
activation_15[0][0]
__________________________________________________________________________________________________
24/02/22 - 34 EESAM - © 2020 MC- LL
Model Summary (5/6)

__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 7, 7, 144) 0 concatenate_8[0][0]
__________________________________________________________________________________________________
max_pooling2d_4[0][0]
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
activation_18[0][0]
__________________________________________________________________________________________________
24/02/22 - 35 EESAM - © 2020 MC- LL

w + b = Xi x Yo + Yo = Model Summary (6/6)
336 x 10 + 10 = 3370

__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
__________________________________________________________________________________________________
activation_20[0][0]
__________________________________________________________________________________________________
average_pooling2d (AveragePooli (None, 1, 1, 336) 0 concatenate_11[0][0]
__________________________________________________________________________________________________
dropout (Dropout) (None, 1, 1, 336) 0 average_pooling2d[0][0]
__________________________________________________________________________________________________
flatten_1 (Flatten) (None, 336) 0 dropout[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 10) 3370 flatten_1[0][0]
__________________________________________________________________________________________________
activation_21 (Activation) (None, 10) 0 dense_1[0][0]
==================================================================================================
Total params: 1,656,250
Trainable params: 1,652,826 FP32 (4 Bytes): (1,652,826 x4)/2^20 = 6.3 MB
Non-trainable params: 3,424
Mean and variance in batch normalization
__________________________________________________________________________________________________
24/02/22 - 36 aren’t trainable! Only gamma and beta are EESAM - © 2020 MC- LL
MiniGoogleNet: #OPs
CONV K Ci Co Ho Wo #OPs
conv2d_1 3 3 96 32 32 5,308,416
conv2d_2 1 96 32 32 32 6,291,456
conv2d_3 3 96 32 32 32 56,623,104
conv2d_4 1 64 32 32 32 4,194,304
conv2d_5 3 64 48 32 32 56,623,104
conv2d_6 3 80 80 15 15 25,920,000
conv2d_7 1 160 112 15 15 8,064,000
conv2d_8 3 160 48 15 15 31,104,000
conv2d_9 1 160 96 15 15 6,912,000
conv2d_10 3 160 64 15 15 41,472,000
conv2d_11 1 160 80 15 15 5,760,000
conv2d_12 3 160 80 15 15 51,840,000
conv2d_13 1 160 48 15 15 3,456,000
conv2d_14 3 160 96 15 15 62,208,000
conv2d_15 3 144 96 7 7 12,192,768
conv2d_16 1 240 176 7 7 4,139,520
conv2d_17 3 240 160 7 7 33,868,800
conv2d_18 1 336 176 7 7 5,795,328
conv2d_19 3 336 160 7 7 47,416,320
DENSE Xi Yo #OPs
dense_1 336 10 6,720
TOTAL 469,195,840
24/02/22 - 37 EESAM - © 2020 MC- LL

Comments on MiniGoogLeNet
• Activation tensors can be stored in an on-chip memory of an

edge device with sufficient memory
– 98,304 x 4B = 393,216 B = 384 kB
• Params (weights and biases) can’t be stored
– 6.3 MB exceeds typical on-chip memory size
– Need to fetch params from the external DRAM at each batch
» DRAM energy: 6.3 MB / 4B x 640 pJ = 1 mJ
• Number of operations: 469 MOPS (FP32)
– One CPU with one single FPU running at 1 GHz would take ~0.5 s
to run inference (2 fps)
– For 60 fps we would need 30 FPUs working in parallel…
» …as long as we manage to perfectly parallelize the execution and avoid
the curse of the memory wall…
» OPS Energy: 469 M x 0.5 x (3.1+0.9) pJ = 0.9 mJ
» Differently from ShallowNet, OPS energy comparable to DRAM energy
24/02/22 - 38 EESAM - © 2020 MC- LL

• What if we replace all 3x3 Conv2D with SeparableConv2D?

• Before:
=======================================================
Total params: 1,656,250
Trainable params: 1,652,826
_______________________________________________________
• After:
=======================================================
_______________________________________________________
• 4.7x reduction; if using FP32, from 6.3 MB to 1.3 MB
• Possible on-chip storage, but even with off-chip it can be very
effective for performance and energy improvement
– 4.7x lower DRAM energy
– 4.7x lower latency if memory-bound
24/02/22 - 39 EESAM - © 2020 MC- LL
CONV K Ci Co Ho Wo #OPs Conv2D #OPs Separable ratio

conv2d_1 3 3 96 32 32 5,308,416 645,120 8.2
conv2d_2 1 96 32 32 32 6,291,456 6,291,456 1.0
conv2d_3 3 96 32 32 32 56,623,104 8,060,928 7.0
conv2d_4 1 64 32 32 32 4,194,304 4,194,304 1.0
conv2d_5 3 64 48 32 32 56,623,104 7,471,104 7.6
conv2d_6 3 80 80 15 15 25,920,000 3,204,000 8.1
conv2d_7 1 160 112 15 15 8,064,000 8,064,000 1.0
conv2d_8 3 160 48 15 15 31,104,000 4,104,000 7.6
conv2d_9 1 160 96 15 15 6,912,000 6,912,000 1.0
conv2d_10 3 160 64 15 15 41,472,000 5,256,000 7.9
conv2d_11 1 160 80 15 15 5,760,000 5,760,000 1.0
conv2d_12 3 160 80 15 15 51,840,000 6,408,000 8.1
conv2d_13 1 160 48 15 15 3,456,000 3,456,000 1.0
conv2d_14 3 160 96 15 15 62,208,000 7,560,000 8.2
conv2d_15 3 144 96 7 7 12,192,768 1,481,760 8.2
conv2d_16 1 240 176 7 7 4,139,520 4,139,520 1.0
conv2d_17 3 240 160 7 7 33,868,800 3,974,880 8.5
conv2d_18 1 336 176 7 7 5,795,328 5,795,328 1.0
conv2d_19 3 336 160 7 7 47,416,320 5,564,832 8.5
DENSE Xi Yo
dense_1 336 10 6,720 6,720 1.0
TOTAL 469,195,840 98,349,952 4.8
24/02/22 - 40 4.8x lower OPs energy, 4.8x lower latency if computing-bound EESAM - © 2020 MC- LL
Exercise 1
• Evaluate the performance of MiniGoogleNet and

Shallownet (latency, fps) on an Intel Movidius
“Myriad X” System-on-Chip (SoC) with the following
characteristics:
– 16 SHAVE processors, each with one 128-bit vector unit
– Clock frequency: 700 MHz
– On-die memory 2.5 MB, 450 GB/s access bandwidth
– DDR4 DRAM 4 Gbit @1600 MHz, 32 bit
24/02/22 - 41 EESAM - © 2020 MC- LL

Solution (1/2)
• Each vector unit can run 128/32 = 4 OP / cycle (FP32 floating point)
– Peak throughput: 16 x 4 OP/cycle x 0.7 GHz = 44.8 GOP/s
• On-chip memory big enough to store activations and fast enough to
sustain the peak throughput, but params need to be accessed from
DDR
– DDR Bandwidth: 2 x 1.6 GHz x 4B = 12.8 GB/s
• OP:DRAM Byte for MiniGoogLeNet: 469M/6.3M = 74.4 OP/B
• OP:DRAM Byte for ShallowNet: 2.3M/1.25M = 1.84 OP/B
Th (GOP/s)
MiniGoogLeNet
44.8
ShallowNet
/s
GB
23.6
.8
12
pe
o
sl
OP:DRAM byte
24/02/22 - 42 1.84 3.5 74.4 EESAM - © 2020 MC- LL
Solution (2/2)
• The Movidius SoC can run MiniGoogleNet at full

speed (if the code is perfectly parallelizable)
– Latency: #OPs / Th = 469 MOP / 44.8 GOP/s = 10.5 ms
– Frames per second: 1 / 0.0105 = 95 fps
• The Movidius SoC cannot run ShallowNet at full
speed due to the memory wall, yet the speed is very
high:
– Latency: #OPs / Th = 2.3 MOP / 23.6 GOP/s = 0.1 ms
– Frames per second: 1 / 0.1ms = 10 kfps
24/02/22 - 43 EESAM - © 2020 MC- LL

Homework 1: Separable Conv2D
• Re-evaluate the performance of MiniGoogleNet and

Shallownet (latency, fps) on the same device of
Exercise 1 assuming that 2D standard convolutions
with K>1 are replaced with separable convolutions
• Evaluate the energy cost before and after the
replacement
24/02/22 - 44 EESAM - © 2020 MC- LL

Exercise 2
• Compute the number of weights and activations of the

famous AlexNet DNN
• Evaluate performance and energy on the Myriad X;
activations and parameters use the BF16 datatype
– For energy assume the same energy figures used before
(real energy details unknown for the Myriad X SoC)
24/02/22 - 45 EESAM - © 2020 MC- LL

Solution
AlexNet: 5 Conv2D layers
(Hi,Wi,Ci) = (Ho,Wo,Co)
(27,27,96) (27,27,256) (13,13,256)
(227,227,3) = (55,55,96)
1 2 3
4 5
(13,13,384) (13,13,384) (13,13,256)
24/02/22 - 46 EESAM - © 2020 MC- LL

Solution
AlexNet: MaxPooling2D layers
(Ho,Wo,Co)
(27,7,96) (27,27,256) (13,13,256)
= (55,55,96)
1 2
(13,13,256) (6,6,256)
24/02/22 - 47 EESAM - © 2020 MC- LL

Solution
AlexNet: Activations and Parameters
Activations Weights and Biases
• Input: 227x227x3 = 154,587 • Conv1: 3x11x11x96+96 = 34,944
• Conv1: 55x55x96 = 290,400 • Conv2: 96x5x5x256+256 = 614,656
• MaxPool1: 27x27x96 = 69,984 • Conv3: 256x3x3x384+384 = 885,120
• Conv2: 27x27x256 = 186,624 • Conv4: 384x3x3x384+384 =
• MaxPool2: 13x13x256 = 43,264 1,327,488
• Conv3: 13x13x384 = 64,896 • Conv5: 384x3x3x256+256 = 884,992
• Conv4: 13x13x384 = 64,896 • FC1: 9216x4096+4096 = 37,752,832
• Conv5: 13x13x256 = 43,264 • FC2: 4096x4096+4096 = 16,781,312
• MaxPool3: 6x6x256 = 9,216 • FC3: 4096x10+10 = 40,970
• FC1: 4,096 • TOTAL = 58,322,314;
in MB (BF16): (58,299,082 x 2) / 2^20
• FC2: 4,096 = 111 MB
• FC3: 10
• TOTAL = 935,333; in MB (BF16): (935,333x2)/2^20 = 1.78 MB
• MAX
24/02/22 - 48 = 290,400; in MB (BF16): (290,400x2)/2^20 = 0.55 MB EESAM - © 2020 MC- LL
Solution
AlexNet: Operations
• Conv2D: 2 x Co x Ho x Wo x K x K x Ci operations
– Conv1: 2 x 96 x 55 x 55 x 11 x 11 x 3 = 210,830,400
– Conv2: 2 x 256 x 27 x 27 x 5 x 5 x 96 = 895,795,200
– Conv3: 2 x 384 x 13 x 13 x 3 x 3 x 256 = 299,040,768
– Conv4: 2 x 384 x 13 x 13 x 3 x 3 x 384 = 448,561,152
– Conv5: 2 x 384 x 13 x 13 x 3 x 3 x 256 = 299,040,768
– Total Conv operations: 2,153,268,288
• FC: 2 x Y x X operations
– FC1: 2 x 4096 x 9216 = 75,497,472
– FC2: 2 x 4096 x 4096 = 335,54,432
– FC3: 2 x 10 x 4096 = 81920
– Total FC operations = 109,133,824
• Total Conv2D + FC operations: 2,262,402,112
24/02/22 - 49 EESAM - © 2020 MC- LL
Solution
AlexNet: Performance
• Each vector unit can run 128/16 = 8 BF16 OP / cycle

• On-chip memory big enough to store activations (2.5 MB) and
fast enough to sustain the peak throughput (450 GB/s), but
params need to be accessed from DDR
• OP/ DRAM Byte for AlexNet: 2262M/111M = 20.4 OP/B
Th (GOP/s)
89.6
/s
GB
.8
12
pe
o
sl
OP:DRAM byte
24/02/22 - 50 7 20.4 EESAM - © 2020 MC- LL
Solution
AlexNet: Performance
• Each vector unit can run 128/16 = 8 BF16 OP / cycle

• On-chip memory big enough to store activations (2.5 MB) and
fast enough to sustain the peak throughput (450 GB/s), but
params need to be accessed from DDR
• OP/ DRAM Byte for AlexNet: 2262M/111M = 20.4 OP/B
• Latency: 2262 MOP / 89.6 GOP/s = 25 ms (4 fps)
Th (GOP/s) • DRAM energy: 111 MB / 4B x 640 pJ = 18 mJ
• OPs energy: 2262M x 0.5 x 4 pJ = 4.5 mJ
89.6
/s
GB
.8
12
pe
o
sl
OP:DRAM byte
24/02/22 - 51 7 20.4 EESAM - © 2020 MC- LL
Homework 2: AlexNet in Keras
• Use Google Colab and Keras to build AlexNet using

the sequential API method
• Evaluate the number of activations and parameters
using model.summary() and check against paper and
pencil calculations
• Check if there can be any advantage in terms of
performance if standard convolutions were replaced
with separable convolutions (assume to use the same
Myriad X device used for Exercise 2)
24/02/22 - 52 EESAM - © 2020 MC- LL

Homework 3: Exercise
• Assume a processor with M = 8 MAC units working at

Fck = 2 GHz clock frequency
• Assume each MAC takes N = 1 clock cycle and E = 4 pJ
• Assume that the memory is not a bottleneck
1. Compute the latency and the energy (only MAC ops) to
process a mini batch of size B = 4 images using AlexNet
2. Determine the memory bandwidth needed to support the
maximum throughput
24/02/22 - 53 EESAM - © 2020 MC- LL

Solution
• Answer 1:
– MAC ops. = B x (MULs + ADDs) / 2 = 4 x 2,262,402,112 / 2 =
4,524,804,224
– Time for a MAC operation: T = N / Fck = 0.5 ns
» Total time = MAC ops. / M x T = 0.28 s
» Total energy = MAC ops. x E = 18 mJ
• Answer 2:
– Counting the MAC as 2 operations, the peak throughput is
2 OP/MAC x 8 MAC x 2 GHz = 32 GOP/s
– The AlexNet operation intensity is B x 20.4 = 81.6 OP/Byte
» Corner of roofline model should be x ≤ 81.6 OP/B: in the worst case, the
memory bandwidth Bw should be such that Bw (GB/s) x 81.6 OP/B =
32 GOP/s, t.i. Bw = 32/81.6 = 0.4 GB/s
» Ex: 1 LPDDR4 chip with one-byte width running at 200 MHz is OK
Note: Energy largely underestimated (memory access cannot be ignored)
24/02/22 - 54 EESAM - © 2020 MC- LL
Homework 4: Exercise
1. Repeat the previous exercise for the case of

MiniGoogleNet
2. Evaluate the parallelism (M) needed to obtain a
latency of 40 ms and the consequent energy for both
MAC operations and DRAM access (640 pJ for each
32-bit DRAM access)
24/02/22 - 55 EESAM - © 2020 MC- LL

Solution (1/2)
• Answer 1:
– MAC ops. = B x (MULs + ADDs) / 2 = 4 x 469M / 2 = 938M MACs
– Time for a MAC operation: T = N / Fck = 0.5 ns
» Total time = MAC ops. / M x T = 938M / 8 x 0.5n = 58.6 ms
» Total energy = MAC ops. x E = 938M x 4p = 3.8 mJ
– Counting the MAC as 2 operations. the peak throughput is
2 OP/MAC x 8 MAC x 2 GHz = 32 GOP/s
– The MiniGoogLeNet operation intensity is B x 74.4 = 298 OP/Byte
» Corner of roofline model should be x ≤ 298 OP/B: in the worst case, the
memory bandwidth Bw should be such that Bw (GB/s) x 298 OP/B =
32 GOP/s, t.i. Bw = 32/298 = 0.11 GB/s
» Ex: 1 LPDDR4 chip with one-byte width running at 55 MHz is OK
Note: Energy largely underestimated (memory access cannot be ignored)

24/02/22 - 56 EESAM - © 2020 MC- LL
Solution (2/2)
• Answer 2:
– If 8 MAC units do the job in 58.6 ms, to do it in 40 ms (assuming
perfect scaling of performance) we need ceil(8 x 58.6 / 40) 12 MAC
units
– The computing energy does not change (total number of operations
is the same): MAC ops. x E = 938M x 4p = 3.8 mJ
– The size of the parameters in DRAM is 6.3 MB, therefore the energy
is E = 640 p x (6.3M / 4) = 1 mJ
24/02/22 - 57 EESAM - © 2020 MC- LL

Eesam Compmemreq v2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Eesam Compmemreq v2

Uploaded by

Copyright:

Available Formats

Embedded Electronic Systems

for Artificial intelligence and

• Computing the number of operations and the number

24/02/22 - 2 EESAM - © 2020 MC- LL

• We need to count all trainable parameters

• Number of weights: For each (input, output) channel

24/02/22 - 3 EESAM - © 2020 MC- LL

2. SeparableConv2D example: same (input,output)

• Number of weights and biases for depthwise

1. Conv2D: (K,K) kernel, F filters, S stride, P padding

24/02/22 - 5 EESAM - © 2020 MC- LL

1. Conv2D: (K,K) kernel, F filters, S stride, P padding

3. Dense (used after Flatten to create a linear array):

24/02/22 - 7 EESAM - © 2020 MC- LL

3. For MaxPooling2D and AveragePooling2D use the

24/02/22 - 8 EESAM - © 2020 MC- LL

• Simply the product of the sizes along the N

• Convolutional and Fully connected layers dominate

24/02/22 - 10 EESAM - © 2020 MC- LL

24/02/22 - 12 EESAM - © 2020 MC- LL

• To estimate computing requirements, we consider only MULs and

def shallownet_sequential(width, height, depth, classes):

• To estimate computing requirements, we consider only MULs and

def shallownet_sequential(width, height, depth, classes):

model = shallownet_sequential(width=32, height=32,

• Fast access to memory very often means better

24/02/22 - 19 EESAM - © 2020 MC- LL

• The less we access the DRAM the better

• Best-case scenario: we read the

• The size of the activation tensors can have an impact

24/02/22 - 21 EESAM - © 2020 MC- LL

Input tensor: 3 x 32 x 32 => 3072

• #FLOPs: 2,424,832 = 2.3 MOP

24/02/22 - 23 EESAM - © 2020 MC- LL

• Example unattainable throughput

24/02/22 - 24 EESAM - © 2020 MC- LL

• Actual throughput: TH = 0.368 GOP/s

max attainable computing throughput

• Actual throughput: TH = 0.14 GOP/s

24/02/22 - 27 Ref: https://arxiv.org/abs/1611.03530 EESAM - © 2020 MC- LL

def minigooglenet_functional(width, height, depth, classes):

def inception_module(x, numK1x1, numK3x3, chanDim):

24/02/22 - 28 EESAM - © 2020 MC- LL

def downsample_module(x, K, chanDim):

24/02/22 - 29 EESAM - © 2020 MC- LL

w + b = (96 x 1 x 1 x 32) + 32 w + b = (96 x 3 x 3 x 32) + 32

conv2d_4 (Conv2D) (None, 32, 32, 32) 2080 concatenate_2[0][0]

conv2d_7 (Conv2D) (None, 15, 15, 112) 18032 concatenate_4[0][0]

conv2d_11 (Conv2D) (None, 15, 15, 80) 12880 concatenate_6[0][0]

conv2d_15 (Conv2D) (None, 7, 7, 96) 124512 concatenate_8[0][0]

24/02/22 - 35 EESAM - © 2020 MC- LL

conv2d_18 (Conv2D) (None, 7, 7, 176) 59312 concatenate_10[0][0]

24/02/22 - 37 EESAM - © 2020 MC- LL

• Activation tensors can be stored in an on-chip memory of an

24/02/22 - 38 EESAM - © 2020 MC- LL

• What if we replace all 3x3 Conv2D with SeparableConv2D?

CONV K Ci Co Ho Wo #OPs Conv2D #OPs Separable ratio

• Evaluate the performance of MiniGoogleNet and

24/02/22 - 41 EESAM - © 2020 MC- LL

• The Movidius SoC can run MiniGoogleNet at full

24/02/22 - 43 EESAM - © 2020 MC- LL

• Re-evaluate the performance of MiniGoogleNet and

24/02/22 - 44 EESAM - © 2020 MC- LL