You are on page 1of 57

Embedded Electronic Systems

for Artificial intelligence and


Machine Learning
Computing and Memory
Requirements
Mario R. Casu, Luciano Lavagno
Politecnico di Torino
Outline

• Computing the number of operations and the number


of operands (parameters and activations) to estimate
performance and energy
• Examples:
– ShallowNet
– MiniGoogLeNet
– AlexNet
• Homeworks

24/02/22 - 2 EESAM - © 2020 MC- LL


Number of parameters (#params): Examples

• We need to count all trainable parameters


1. Conv2D example: 128 filters with (3,3) kernel

• Number of weights: For each (input, output) channel


pair (3x128 in the example), we have one 3x3 kernel
– 3 x 3 x 3 x 128 = 3456 weights
• Bias: for each output channel, we have one bias
– 128 biases

24/02/22 - 3 EESAM - © 2020 MC- LL


Number of parameters (#params): Examples

2. SeparableConv2D example: same (input,output)


channel pair (3x128) of previous Conv2D example

• Number of weights and biases for depthwise


– 3 x 3 x 3 = 27 weights; 0 (bias not used)
• Number of weights and biases for pointwise
– 3 x 1 x 1 x 128 = 384; 128 biases
• In total 411/128 weights/biases (3456/128 in Conv2D)
– (3456+128)/(411+128) = 6.6x fewer #params
24/02/22 - 4 EESAM - © 2020 MC- LL
Number of parameters (#params): Formulas

1. Conv2D: (K,K) kernel, F filters, S stride, P padding


– Input shape (Bi,Hi,Wi,Ci), i.e. batch, height, width, channel
– Output shape (Bo,Ho,Wo,Co)
» Bo = Bi
» Co = F
» Ho = (Hi + 2 x P - K) / S + 1
» Wo = (Wi + 2 x P - K) / S + 1
– Number of Weights (w) and Biases (b)
» w = Ci x K x K x Co, b = Co

24/02/22 - 5 EESAM - © 2020 MC- LL


Number of parameters (#params): Formulas

1. Conv2D: (K,K) kernel, F filters, S stride, P padding


– Input shape (Bi,Hi,Wi,Ci), i.e. batch, height, width, channel
– Output shape (Bo,Ho,Wo,Co)
» Bo = Bi
» Co = F
» Ho = (Hi + 2 x P - K) / S + 1
» Wo = (Wi + 2 x P - K) / S + 1
– Number of Weights (w) and Biases (b)
» w = Ci x K x K x Co, b = Co
2. SeparableConv2D with same input/output shapes
– Number of Weights (w) and Biases (b)
» w = Ci x ( K x K + Co ), b = Co
Co
» Weight ratio Conv2D/SeparableConv2D:
1+ Co⁄K∙K
24/02/22 - 6 EESAM - © 2020 MC- LL
Number of parameters (#params): Formulas

3. Dense (used after Flatten to create a linear array):


– Input shape (Bi,Xi), i.e. batch, input array size
– Output shape (Bo,Yo), i.e. batch, output array size
» Bo = Bi
– Number of Weights (w) and Biases (b)
» w = Xi x Yo, b = Yo

24/02/22 - 7 EESAM - © 2020 MC- LL


Number of parameters (#params): Formulas

3. For MaxPooling2D and AveragePooling2D use the


same formulas for the size of the output tensor as
Conv2D and SeparableConv2D
– but no trainable params: no w, no b
4. Batch Normalization
– Input shape (Bi,Hi,Wi,Ci), i.e. batch, height, width, channel
– Output shape (Bo,Ho,Wo,Co) = (Bi,Hi,Wi,Ci)
– Parameters (p): mean, variance, gamma, beta for each Ci
» p = 4 x Ci = 4 x Co
– But only gamma and beta parameters are trainable (pt):
» pt = 2 x Ci = 2 x Co

24/02/22 - 8 EESAM - © 2020 MC- LL


Number of activations (#activs)

• Simply the product of the sizes along the N


dimensions of a tensor
– Ex: (B,H,W,C) = (1,32,32,64) => 1 x 32 x 32 x 64 = 65,536
• Normally #params ≫ #activs
– Params are stored in external DRAM while activations stay in
on-chip buffers (produced and immediately consumed)
– But we need to make room on-chip for parameters as well, as
we need to prefetch them from DRAM when needed
• Both for parameters and activations, to compute the
actual memory storage requirement we need to know
the actual datatype. Examples:
– FP32: 65,536 x 4B = 262144 B = 256 kB
– FP16 (BF16, INT16): 65,536 x 2B = 128 kB
– INT8: 65,536 x 1B = 64 kB
24/02/22 - 9 EESAM - © 2020 MC- LL
Number of operations (#OPs): Formulas

• Convolutional and Fully connected layers dominate


the total number of operations and so the computing
requirements
• Multiplications (MULs) and Additions (ADDs) dictate
computing requirements (ignore the other operations)
1. Conv2D layers
– For each of the Co x Ho x Wo elements of the output tensor:
» MULs: K x K x Ci
» ADDs: K x K x Ci (including the bias addition)
– In total 2 x Co x Ho x Wo x K x K x Ci #OPs

24/02/22 - 10 EESAM - © 2020 MC- LL


Number of operations (#OPs): Formulas

2. SeparableConv2D layers
– Not only separable convolutions reduce the number of
trainable parameters compared with Conv2D, but they also
reduce the number of operations by splitting the convolution
in the sequence of depthwise and pointwise
– Depthwise: for each of the Ci x Ho x Wo elements of the
output tensor:
» MULs: K x K, ADDs: K x K-1 (no bias addition)
» Hence ~2 x Ci x Ho x Wo x K x K
– Pointwise: it’s like a Conv2D with 1x1 filters
» 2 x Co x Ho x Wo x 1 x 1 x Ci operations
– In total 2 x Ho x Wo x (K x K + Co) x Ci operations
» Ratio #OPs Conv2D/SeparableConv2D:
Co
»
1+ Co⁄K∙K
24/02/22 - 11 EESAM - © 2020 MC- LL
Number of operations (#OPs): Formulas

3. Dense
– Dense (Fully Connected) layers are multiplications between a
weight matrix and a flattened tensor (i.e., a vector), plus the
addition of the bias vector
– Assume flattened input and output tensors with Xi and Yo
elements, respectively. For each y in {Yo}
» MULS: Xi
» ADDs: Xi
– In total 2 x Yo x Xi #OPs

24/02/22 - 12 EESAM - © 2020 MC- LL


Example: Shallownet Sequential

• To estimate computing requirements, we consider only MULs and


ADDs (FLOPs) in Conv2D and Dense
• For memory requirements we need #params and #activations

def shallownet_sequential(width, height, depth, classes):


# initialize the model along with the input shape to be
# "channels last" ordering
model = Sequential()
i_s = (height, width, depth)
# define the first (and only) CONV => RELU layer
model.add(Conv2D(32, (3, 3), padding="same", input_shape=i_s))
model.add(Activation("relu"))
# softmax classifier
model.add(Flatten())
model.add(Dense(classes))
model.add(Activation("softmax"))
# return the constructed network architecture
return model
24/02/22 - 13 EESAM - © 2020 MC- LL
Shallownet Sequential: computing
and memory requirements

• To estimate computing requirements, we consider only MULs and


ADDs (FLOPs) in Conv2D and Dense
• For memory requirements we need #params and #activations

def shallownet_sequential(width, height, depth, classes):


# initialize the model along with the input shape to be
# "channels last" ordering #FLOPs = 2 x Co x Ho x Wo x K x K x Ci
model = Sequential() = 2 x 32 x 32 x 32 x 3 x 3 x 3 = 1,769,472
i_s = (height, width, depth)
# define the first (and only) CONV => RELU layer
model.add(Conv2D(32, (3, 3), padding="same", input_shape=i_s))
model.add(Activation("relu"))
# softmax classifier #FLOPs = 2 x Yo x Xi
model.add(Flatten()) = 2 x 10 x 32768 = 655,360
model.add(Dense(classes))
model.add(Activation("softmax"))
# return the constructed network architecture
return model Total #FLOPs = 2,424,832 ~ 2.4M
24/02/22 - 14 EESAM - © 2020 MC- LL
Model Summary

model = shallownet_sequential(width=32, height=32,


depth=3, classes=10)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
activation (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 32768) 0
_________________________________________________________________
dense (Dense) (None, 10) 327690
_________________________________________________________________
activation_1 (Activation) (None, 10) 0
=================================================================
Total params: 328,586
Trainable params: 328,586
Non-trainable params: 0
24/02/22 - 15 EESAM - © 2020 MC- LL
Sanity Check

model.add(Conv2D(32,(3,3),padding="same", input_shape=i_s))

model.add(Dense(classes))

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
activation (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 32768) 0
_________________________________________________________________
dense (Dense) (None, 10) 327690
_________________________________________________________________
activation_1 (Activation) (None, 10) 0
=================================================================
Total params: 328,586
Trainable params: 328,586
Non-trainable params: 0
24/02/22 - 16 EESAM - © 2020 MC- LL
Sanity Check

model.add(Conv2D(32,(3,3),padding="same", input_shape=i_s))

model.add(Dense(classes))
w = Ci x K x K x Co = 3 x 3 x 3 x 32 = 864
Model: "sequential" b = Co = 32
_________________________________________________________________
Layer (type) #params
Output= 864
Shape+ 32 = 896 Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
activation (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 32768) 0
_________________________________________________________________
dense (Dense) (None, 10) 327690
_________________________________________________________________
activation_1 (Activation) (None, 10) 0
=================================================================
Total params: 328,586
Trainable params: 328,586
Non-trainable params: 0
24/02/22 - 17 EESAM - © 2020 MC- LL
Sanity Check

model.add(Conv2D(32,(3,3),padding="same", input_shape=i_s))

model.add(Dense(classes))
w = Ci x K x K x Co = 3 x 3 x 3 x 32 = 864
Model: "sequential" b = Co = 32
_________________________________________________________________
Layer (type) #params
Output= 864
Shape+ 32 = 896 Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
activation (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 32768) 0
_________________________________________________________________
dense (Dense) (None, 10) 327690
_________________________________________________________________
activation_1 (Activation) (None, 10) 0
=================================================================
Total params: 328,586 w = Xi x Yo = 32768 x 10 = 327680
Trainable params: 328,586 b = Yo = 10
Non-trainable params: 0 #params = 327690
24/02/22 - 18 EESAM - © 2020 MC- LL
How much memory is it needed?

• Fast access to memory very often means better


inference throughput
• However, semiconductor memories are either fast
(and close to the processors) or dense, but not both
• Need to reduce the size of the parameters, possibly
using small data types
• Previous example: 328,586 parameters
– FP32 (4 Bytes): (328586 x4)/2^20 = 1.25 MB
» Can’t fit in the cache of a small processor for edge applications
– INT8 (1 Byte): (328586 x1)/2^10 = 320 kB
» Can fit in the cache of a small processor for edge applications

24/02/22 - 19 EESAM - © 2020 MC- LL


How much energy is it needed?

• The less we access the DRAM the better

• Best-case scenario: we read the


params once (enough on-chip
buffering for temporary storage)
– Edram(FP32) = 1.25MB / 4B x 640pJ
= 0.2 mJ
– Edram(INT8) = 320 kB / 4B x 640pJ =
0.05 mJ (4x less)
• OPs: 50% MULs 50% ADDs
– Eops(FP32) = 2.4 M x 0.5 x (3.7 +
0.9) pJ = 0.0055 mJ
– Eops(INT8) = 2.4 M x 0.5 x (0.2 +
0.03) pJ = 0.00028 mJ (20x less)

45 nm technology
24/02/22 - 20 EESAM - © 2020 MC- LL
What about the memory used
by the activations?

• The size of the activation tensors can have an impact


on performance
• Differently from parameters, activations are produced
and consumed almost immediately
• Ideally, activations should be stored in a local (i.e., on
chip) memory to reduce data access time and energy
– Not always possible because they can easily exceed the on-
chip memory capabilities
• To evaluate memory requirements, we need to
evaluate the maximum among the sizes of the
intermediate tensors (i.e., the activations)
– Unless we have a pipeline: all layers executed concurrently

24/02/22 - 21 EESAM - © 2020 MC- LL


Tensor size: assume batch = 1

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
activation (Activation) (None, 32, 32, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 32768) 0
_________________________________________________________________
dense (Dense) (None, 10) 327690
_________________________________________________________________
activation_1 (Activation) (None, 10) 0
=================================================================

Input tensor: 3 x 32 x 32 => 3072


To be converted into bytes,
Conv2D: 32 x 32 x 32 => 32768
depending on the datatype
Activation: 32 x 32 x 32 => 32768
Flatten: 32768 (just memory shape change of arrangement)
Dense: 10
Activation_1: 10
24/02/22 - 22 EESAM - © 2020 MC- LL
Shallownet Sequential: Recap

• #FLOPs: 2,424,832 = 2.3 MOP


• #PARAMS: 328,586 = 1.25 MB, if FP32
• MAX #ACTIVATIONS: 32,768 = 128 kB, if FP32
• Assume a hypothetical system with sufficient on-chip
memory to store the maximum number of activations
– DRAM accesses for #PARAMS
– FLOPS : DRAM bytes = 2.3MOP/1.25MB = 1.84 OP/B
• Ex. DDR bus @100MHz, 8 bits
– Peak bandwidth: 2x100M = 0.2 GB/s
» Computing throughput matches peak DDR bandwidth at
0.2 GB/s x 1.84 OP/B = 0.368 GOP/s
» More computing power would be underutilized (roofline model)

24/02/22 - 23 EESAM - © 2020 MC- LL


Roofline model

Th (GOP/s)
unattainable computing throughput
0.4
max attainable computing throughput
0.368
/s
GB
attainable computing throughput
2
0.

0.2
e
op
sl

OPs:DRAM byte
1.0 1.84 2.0

• Example unattainable throughput


– DDR bus @100MHz, 8 bits, Fck=100 MHz, 2 MAC
» Peak bandwidth: 2x100M = 0.2 GB/s
» Computing throughput: 0.4 GOP/s (2 MACs = 4 OPs)

24/02/22 - 24 EESAM - © 2020 MC- LL


Shallownet Sequential: example
performance estimate

• Actual throughput: TH = 0.368 GOP/s


• Latency: #FLOPs/TH = 1000 x 2.3M / 368M = 6.25 ms
• Frame rate (fps) = 1000 / 6.25 = 160 fps
• What if we replace Conv2D with SeparableConv2D?
Layer (type) Output Shape Param #
=====================================================================
separable_conv2d (Separable Conv2D) (None, 32, 32, 32) 155
activation_2 (Activation) (None, 32, 32, 32) 0
flatten_1 (Flatten) (None, 32768) 0
dense_1 (Dense) (None, 10) 327690
activation_3 (Activation) (None, 10) 0
=====================================================================
Total params: 327,845
Trainable params: 327,845 #PARAMS: 327845 = 1.25 MB
Non-trainable params: 0 #FLOPs: 907264 =0.865 MOP
=> 0.7 OP/BYTE
24/02/22 - 25 EESAM - © 2020 MC- LL
Roofline model

Th (GOP/s)
unattainable computing throughput
0.4

/s
GB
unattainable computing throughput
2
0.

0.2
e

max attainable computing throughput


op

0.14
sl

OPs:DRAM byte
0.7 1.0 2.0

• Actual throughput: TH = 0.14 GOP/s


• Latency: #OPs/TH = 1000 x 0.865M / 140M = 6.18 ms
• Frame rate (fps) = 1000 / 6.18 = 162 fps
– No significant performance improvement with separable
convolution compared to standard Conv2D: why?
24/02/22 - 26 EESAM - © 2020 MC- LL
MiniGoogLeNet
Conv Module Inception Module Downsample Module

24/02/22 - 27 Ref: https://arxiv.org/abs/1611.03530 EESAM - © 2020 MC- LL


Conv and Inception Modules

def minigooglenet_functional(width, height, depth, classes):


def conv_module(x, K, kX, kY, stride, chanDim, padding="same"):
# define a CONV => BN => RELU pattern
x = Conv2D(K, (kX, kY), strides=stride, padding=padding)(x)
x = BatchNormalization(axis=chanDim)(x)
x = Activation("relu")(x)
# return the block
return x

def inception_module(x, numK1x1, numK3x3, chanDim):


# define two CONV modules, then concatenate across the
# channel dimension
conv_1x1 = conv_module(x, numK1x1, 1, 1, (1, 1), chanDim)
conv_3x3 = conv_module(x, numK3x3, 3, 3, (1, 1), chanDim)
x = concatenate([conv_1x1, conv_3x3], axis=chanDim)
# return the block
return x

24/02/22 - 28 EESAM - © 2020 MC- LL


Downsample Module

def downsample_module(x, K, chanDim):


# define the CONV module and POOL, then concatenate
# across the channel dimensions
conv_3x3=conv_module(x,K,3,3,(2,2),chanDim,padding="valid")
pool=MaxPooling2D((3, 3), strides=(2, 2))(x)
x=concatenate([conv_3x3, pool], axis=chanDim)
# return the block
return x

24/02/22 - 29 EESAM - © 2020 MC- LL


def minigooglenet_functional(width, height, depth, classes):
. . . #previously defined functions
# initialize the input shape to be "channels last"
inputShape = (height, width, depth)
MiniGoogLeNet
chanDim = -1 as a Keras
# define the model input and first CONV module
inputs = Input(shape=inputShape)
functional API
x = conv_module(inputs, 96, 3, 3, (1, 1), chanDim) model
# two Inception modules followed by a downsample module
x = inception_module(x, 32, 32, chanDim)
x = inception_module(x, 32, 48, chanDim)
x = downsample_module(x, 80, chanDim)
# four Inception modules followed by a downsample module
x = inception_module(x, 112, 48, chanDim)
x = inception_module(x, 96, 64, chanDim)
x = inception_module(x, 80, 80, chanDim)
x = inception_module(x, 48, 96, chanDim)
x = downsample_module(x, 96, chanDim)
# two Inception modules followed by global POOL and dropout
x = inception_module(x, 176, 160, chanDim)
x = inception_module(x, 176, 160, chanDim)
x = AveragePooling2D((7, 7))(x)
x = Dropout(0.5)(x)
# softmax classifier
x = Flatten()(x)
x = Dense(classes)(x)
x = Activation("softmax")(x)
# create the model
model = Model(inputs, x, name="minigooglenet")
24/02/22 - 30 return model EESAM - © 2020 MC- LL
MiniGoogleNet: Model Summary (1/6)
w + b = (3 x 3 x 3 x 96) + 96 Mean, variance, gamma, beta:
Model: "minigooglenet" 2592 + 96 = 2688 4 x 96 = 384
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 32, 32, 3)] 0
__________________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 32, 32, 96) 2688 input_1[0][0] 32x32x96=98304
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 32, 32, 96) 384 conv2d_1[0][0]
__________________________________________________________________________________________________
activation_2 (Activation) (None, 32, 32, 96) 0 batch_normalization[0][0]
__________________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 32, 32, 32) 3104 activation_2[0][0]
__________________________________________________________________________________________________
conv2d_3 (Conv2D) (None, 32, 32, 32) 27680 activation_2[0][0]
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 32, 32, 32) 128 conv2d_2[0][0]
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 32, 32, 32) 128 conv2d_3[0][0]
__________________________________________________________________________________________________
activation_3 (Activation) (None, 32, 32, 32) 0 batch_normalization_1[0][0]
__________________________________________________________________________________________________
activation_4 (Activation) (None, 32, 32, 32) 0 batch_normalization_2[0][0]
__________________________________________________________________________________________________
concatenate_2 (Concatenate) (None, 32, 32, 64) 0 activation_3[0][0]
activation_4[0][0]
__________________________________________________________________________________________________

w + b = (96 x 1 x 1 x 32) + 32 w + b = (96 x 3 x 3 x 32) + 32


24/02/22 - 31
3072 + 32 = 3104 27648 + 32 = 27680 EESAM - © 2020 MC- LL
Model Summary (2/6)

conv2d_4 (Conv2D) (None, 32, 32, 32) 2080 concatenate_2[0][0]


__________________________________________________________________________________________________
conv2d_5 (Conv2D) (None, 32, 32, 48) 27696 concatenate_2[0][0]
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 32, 32, 32) 128 conv2d_4[0][0]
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 32, 32, 48) 192 conv2d_5[0][0]
__________________________________________________________________________________________________
activation_5 (Activation) (None, 32, 32, 32) 0 batch_normalization_3[0][0]
__________________________________________________________________________________________________
activation_6 (Activation) (None, 32, 32, 48) 0 batch_normalization_4[0][0]
__________________________________________________________________________________________________
concatenate_3 (Concatenate) (None, 32, 32, 80) 0 activation_5[0][0]
activation_6[0][0]
__________________________________________________________________________________________________
conv2d_6 (Conv2D) (None, 15, 15, 80) 57680 concatenate_3[0][0]
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 15, 15, 80) 320 conv2d_6[0][0]
__________________________________________________________________________________________________
activation_7 (Activation) (None, 15, 15, 80) 0 batch_normalization_5[0][0]
__________________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 15, 15, 80) 0 concatenate_3[0][0]
__________________________________________________________________________________________________
concatenate_4 (Concatenate) (None, 15, 15, 160) 0 activation_7[0][0]
max_pooling2d_3[0][0]
__________________________________________________________________________________________________

Ho = (Hi + 2P - K) / S + 1 Ho = (Hi + 2P - K) / S + 1
= (32 + 2 * 0 - 3) / 2 + 1 = 14 + 1 = 15 = (32 + 2 * 0 - 3) / 2 + 1 = 14 + 1 = 15
24/02/22 - 32 EESAM - © 2020 MC- LL
Model Summary (3/6)

conv2d_7 (Conv2D) (None, 15, 15, 112) 18032 concatenate_4[0][0]


__________________________________________________________________________________________________
conv2d_8 (Conv2D) (None, 15, 15, 48) 69168 concatenate_4[0][0]
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 15, 15, 112) 448 conv2d_7[0][0]
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 15, 15, 48) 192 conv2d_8[0][0]
__________________________________________________________________________________________________
activation_8 (Activation) (None, 15, 15, 112) 0 batch_normalization_6[0][0]
__________________________________________________________________________________________________
activation_9 (Activation) (None, 15, 15, 48) 0 batch_normalization_7[0][0]
__________________________________________________________________________________________________
concatenate_5 (Concatenate) (None, 15, 15, 160) 0 activation_8[0][0]
activation_9[0][0]
__________________________________________________________________________________________________
conv2d_9 (Conv2D) (None, 15, 15, 96) 15456 concatenate_5[0][0]
__________________________________________________________________________________________________
conv2d_10 (Conv2D) (None, 15, 15, 64) 92224 concatenate_5[0][0]
__________________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, 15, 15, 96) 384 conv2d_9[0][0]
__________________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, 15, 15, 64) 256 conv2d_10[0][0]
__________________________________________________________________________________________________
activation_10 (Activation) (None, 15, 15, 96) 0 batch_normalization_8[0][0]
__________________________________________________________________________________________________
activation_11 (Activation) (None, 15, 15, 64) 0 batch_normalization_9[0][0]
__________________________________________________________________________________________________
concatenate_6 (Concatenate) (None, 15, 15, 160) 0 activation_10[0][0]
activation_11[0][0]
__________________________________________________________________________________________________
24/02/22 - 33 “concatenate” adds the size along the output channel dimension
EESAM - © 2020 MC- LL
Model Summary (4/6)

conv2d_11 (Conv2D) (None, 15, 15, 80) 12880 concatenate_6[0][0]


__________________________________________________________________________________________________
conv2d_12 (Conv2D) (None, 15, 15, 80) 115280 concatenate_6[0][0]
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 15, 15, 80) 320 conv2d_11[0][0]
__________________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 15, 15, 80) 320 conv2d_12[0][0]
__________________________________________________________________________________________________
activation_12 (Activation) (None, 15, 15, 80) 0 batch_normalization_10[0][0]
__________________________________________________________________________________________________
activation_13 (Activation) (None, 15, 15, 80) 0 batch_normalization_11[0][0]
__________________________________________________________________________________________________
concatenate_7 (Concatenate) (None, 15, 15, 160) 0 activation_12[0][0]
activation_13[0][0]
__________________________________________________________________________________________________
conv2d_13 (Conv2D) (None, 15, 15, 48) 7728 concatenate_7[0][0]
__________________________________________________________________________________________________
conv2d_14 (Conv2D) (None, 15, 15, 96) 138336 concatenate_7[0][0]
__________________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 15, 15, 48) 192 conv2d_13[0][0]
__________________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 15, 15, 96) 384 conv2d_14[0][0]
__________________________________________________________________________________________________
activation_14 (Activation) (None, 15, 15, 48) 0 batch_normalization_12[0][0]
__________________________________________________________________________________________________
activation_15 (Activation) (None, 15, 15, 96) 0 batch_normalization_13[0][0]
__________________________________________________________________________________________________
concatenate_8 (Concatenate) (None, 15, 15, 144) 0 activation_14[0][0]
activation_15[0][0]
__________________________________________________________________________________________________
24/02/22 - 34 EESAM - © 2020 MC- LL
Model Summary (5/6)

conv2d_15 (Conv2D) (None, 7, 7, 96) 124512 concatenate_8[0][0]


__________________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 7, 7, 96) 384 conv2d_15[0][0]
__________________________________________________________________________________________________
activation_16 (Activation) (None, 7, 7, 96) 0 batch_normalization_14[0][0]
__________________________________________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 7, 7, 144) 0 concatenate_8[0][0]
__________________________________________________________________________________________________
concatenate_9 (Concatenate) (None, 7, 7, 240) 0 activation_16[0][0]
max_pooling2d_4[0][0]
__________________________________________________________________________________________________
conv2d_16 (Conv2D) (None, 7, 7, 176) 42416 concatenate_9[0][0]
__________________________________________________________________________________________________
conv2d_17 (Conv2D) (None, 7, 7, 160) 345760 concatenate_9[0][0]
__________________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 7, 7, 176) 704 conv2d_16[0][0]
__________________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 7, 7, 160) 640 conv2d_17[0][0]
__________________________________________________________________________________________________
activation_17 (Activation) (None, 7, 7, 176) 0 batch_normalization_15[0][0]
__________________________________________________________________________________________________
activation_18 (Activation) (None, 7, 7, 160) 0 batch_normalization_16[0][0]
__________________________________________________________________________________________________
concatenate_10 (Concatenate) (None, 7, 7, 336) 0 activation_17[0][0]
activation_18[0][0]
__________________________________________________________________________________________________

24/02/22 - 35 EESAM - © 2020 MC- LL


w + b = Xi x Yo + Yo = Model Summary (6/6)
336 x 10 + 10 = 3370

conv2d_18 (Conv2D) (None, 7, 7, 176) 59312 concatenate_10[0][0]


__________________________________________________________________________________________________
conv2d_19 (Conv2D) (None, 7, 7, 160) 484000 concatenate_10[0][0]
__________________________________________________________________________________________________
batch_normalization_17 (BatchNo (None, 7, 7, 176) 704 conv2d_18[0][0]
__________________________________________________________________________________________________
batch_normalization_18 (BatchNo (None, 7, 7, 160) 640 conv2d_19[0][0]
__________________________________________________________________________________________________
activation_19 (Activation) (None, 7, 7, 176) 0 batch_normalization_17[0][0]
__________________________________________________________________________________________________
activation_20 (Activation) (None, 7, 7, 160) 0 batch_normalization_18[0][0]
__________________________________________________________________________________________________
concatenate_11 (Concatenate) (None, 7, 7, 336) 0 activation_19[0][0]
activation_20[0][0]
__________________________________________________________________________________________________
average_pooling2d (AveragePooli (None, 1, 1, 336) 0 concatenate_11[0][0]
__________________________________________________________________________________________________
dropout (Dropout) (None, 1, 1, 336) 0 average_pooling2d[0][0]
__________________________________________________________________________________________________
flatten_1 (Flatten) (None, 336) 0 dropout[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 10) 3370 flatten_1[0][0]
__________________________________________________________________________________________________
activation_21 (Activation) (None, 10) 0 dense_1[0][0]
==================================================================================================
Total params: 1,656,250
Trainable params: 1,652,826 FP32 (4 Bytes): (1,652,826 x4)/2^20 = 6.3 MB
Non-trainable params: 3,424
Mean and variance in batch normalization
__________________________________________________________________________________________________

24/02/22 - 36 aren’t trainable! Only gamma and beta are EESAM - © 2020 MC- LL
MiniGoogleNet: #OPs
CONV K Ci Co Ho Wo #OPs
conv2d_1 3 3 96 32 32 5,308,416
conv2d_2 1 96 32 32 32 6,291,456
conv2d_3 3 96 32 32 32 56,623,104
conv2d_4 1 64 32 32 32 4,194,304
conv2d_5 3 64 48 32 32 56,623,104
conv2d_6 3 80 80 15 15 25,920,000
conv2d_7 1 160 112 15 15 8,064,000
conv2d_8 3 160 48 15 15 31,104,000
conv2d_9 1 160 96 15 15 6,912,000
conv2d_10 3 160 64 15 15 41,472,000
conv2d_11 1 160 80 15 15 5,760,000
conv2d_12 3 160 80 15 15 51,840,000
conv2d_13 1 160 48 15 15 3,456,000
conv2d_14 3 160 96 15 15 62,208,000
conv2d_15 3 144 96 7 7 12,192,768
conv2d_16 1 240 176 7 7 4,139,520
conv2d_17 3 240 160 7 7 33,868,800
conv2d_18 1 336 176 7 7 5,795,328
conv2d_19 3 336 160 7 7 47,416,320
DENSE Xi Yo #OPs
dense_1 336 10 6,720
TOTAL 469,195,840

24/02/22 - 37 EESAM - © 2020 MC- LL


Comments on MiniGoogLeNet

• Activation tensors can be stored in an on-chip memory of an


edge device with sufficient memory
– 98,304 x 4B = 393,216 B = 384 kB
• Params (weights and biases) can’t be stored
– 6.3 MB exceeds typical on-chip memory size
– Need to fetch params from the external DRAM at each batch
» DRAM energy: 6.3 MB / 4B x 640 pJ = 1 mJ
• Number of operations: 469 MOPS (FP32)
– One CPU with one single FPU running at 1 GHz would take ~0.5 s
to run inference (2 fps)
– For 60 fps we would need 30 FPUs working in parallel…
» …as long as we manage to perfectly parallelize the execution and avoid
the curse of the memory wall…
» OPS Energy: 469 M x 0.5 x (3.1+0.9) pJ = 0.9 mJ
» Differently from ShallowNet, OPS energy comparable to DRAM energy

24/02/22 - 38 EESAM - © 2020 MC- LL


Comments on MiniGoogLeNet

• What if we replace all 3x3 Conv2D with SeparableConv2D?


• Before:
=======================================================
Total params: 1,656,250
Trainable params: 1,652,826
Non-trainable params: 3,424
_______________________________________________________
• After:
=======================================================
Total params: 353,786
Trainable params: 350,362
Non-trainable params: 3,424
_______________________________________________________
• 4.7x reduction; if using FP32, from 6.3 MB to 1.3 MB
• Possible on-chip storage, but even with off-chip it can be very
effective for performance and energy improvement
– 4.7x lower DRAM energy
– 4.7x lower latency if memory-bound
24/02/22 - 39 EESAM - © 2020 MC- LL
Comments on MiniGoogLeNet

CONV K Ci Co Ho Wo #OPs Conv2D #OPs Separable ratio


conv2d_1 3 3 96 32 32 5,308,416 645,120 8.2
conv2d_2 1 96 32 32 32 6,291,456 6,291,456 1.0
conv2d_3 3 96 32 32 32 56,623,104 8,060,928 7.0
conv2d_4 1 64 32 32 32 4,194,304 4,194,304 1.0
conv2d_5 3 64 48 32 32 56,623,104 7,471,104 7.6
conv2d_6 3 80 80 15 15 25,920,000 3,204,000 8.1
conv2d_7 1 160 112 15 15 8,064,000 8,064,000 1.0
conv2d_8 3 160 48 15 15 31,104,000 4,104,000 7.6
conv2d_9 1 160 96 15 15 6,912,000 6,912,000 1.0
conv2d_10 3 160 64 15 15 41,472,000 5,256,000 7.9
conv2d_11 1 160 80 15 15 5,760,000 5,760,000 1.0
conv2d_12 3 160 80 15 15 51,840,000 6,408,000 8.1
conv2d_13 1 160 48 15 15 3,456,000 3,456,000 1.0
conv2d_14 3 160 96 15 15 62,208,000 7,560,000 8.2
conv2d_15 3 144 96 7 7 12,192,768 1,481,760 8.2
conv2d_16 1 240 176 7 7 4,139,520 4,139,520 1.0
conv2d_17 3 240 160 7 7 33,868,800 3,974,880 8.5
conv2d_18 1 336 176 7 7 5,795,328 5,795,328 1.0
conv2d_19 3 336 160 7 7 47,416,320 5,564,832 8.5
DENSE Xi Yo
dense_1 336 10 6,720 6,720 1.0
TOTAL 469,195,840 98,349,952 4.8

24/02/22 - 40 4.8x lower OPs energy, 4.8x lower latency if computing-bound EESAM - © 2020 MC- LL
Exercise 1

• Evaluate the performance of MiniGoogleNet and


Shallownet (latency, fps) on an Intel Movidius
“Myriad X” System-on-Chip (SoC) with the following
characteristics:
– 16 SHAVE processors, each with one 128-bit vector unit
– Clock frequency: 700 MHz
– On-die memory 2.5 MB, 450 GB/s access bandwidth
– DDR4 DRAM 4 Gbit @1600 MHz, 32 bit

24/02/22 - 41 EESAM - © 2020 MC- LL


Solution (1/2)

• Each vector unit can run 128/32 = 4 OP / cycle (FP32 floating point)
– Peak throughput: 16 x 4 OP/cycle x 0.7 GHz = 44.8 GOP/s
• On-chip memory big enough to store activations and fast enough to
sustain the peak throughput, but params need to be accessed from
DDR
– DDR Bandwidth: 2 x 1.6 GHz x 4B = 12.8 GB/s
• OP:DRAM Byte for MiniGoogLeNet: 469M/6.3M = 74.4 OP/B
• OP:DRAM Byte for ShallowNet: 2.3M/1.25M = 1.84 OP/B

Th (GOP/s)
MiniGoogLeNet
44.8
ShallowNet
/s
GB

23.6
.8
12
pe
o
sl

OP:DRAM byte
24/02/22 - 42 1.84 3.5 74.4 EESAM - © 2020 MC- LL
Solution (2/2)

• The Movidius SoC can run MiniGoogleNet at full


speed (if the code is perfectly parallelizable)
– Latency: #OPs / Th = 469 MOP / 44.8 GOP/s = 10.5 ms
– Frames per second: 1 / 0.0105 = 95 fps
• The Movidius SoC cannot run ShallowNet at full
speed due to the memory wall, yet the speed is very
high:
– Latency: #OPs / Th = 2.3 MOP / 23.6 GOP/s = 0.1 ms
– Frames per second: 1 / 0.1ms = 10 kfps

24/02/22 - 43 EESAM - © 2020 MC- LL


Homework 1: Separable Conv2D

• Re-evaluate the performance of MiniGoogleNet and


Shallownet (latency, fps) on the same device of
Exercise 1 assuming that 2D standard convolutions
with K>1 are replaced with separable convolutions
• Evaluate the energy cost before and after the
replacement

24/02/22 - 44 EESAM - © 2020 MC- LL


Exercise 2

• Compute the number of weights and activations of the


famous AlexNet DNN
• Evaluate performance and energy on the Myriad X;
activations and parameters use the BF16 datatype
– For energy assume the same energy figures used before
(real energy details unknown for the Myriad X SoC)

24/02/22 - 45 EESAM - © 2020 MC- LL


Solution
AlexNet: 5 Conv2D layers

(Hi,Wi,Ci) = (Ho,Wo,Co)
(27,27,96) (27,27,256) (13,13,256)
(227,227,3) = (55,55,96)
1 2 3

4 5

(13,13,384) (13,13,384) (13,13,256)

24/02/22 - 46 EESAM - © 2020 MC- LL


Solution
AlexNet: MaxPooling2D layers

(Ho,Wo,Co)
(27,7,96) (27,27,256) (13,13,256)
= (55,55,96)
1 2

(13,13,256) (6,6,256)

24/02/22 - 47 EESAM - © 2020 MC- LL


Solution
AlexNet: Activations and Parameters
Activations Weights and Biases
• Input: 227x227x3 = 154,587 • Conv1: 3x11x11x96+96 = 34,944
• Conv1: 55x55x96 = 290,400 • Conv2: 96x5x5x256+256 = 614,656
• MaxPool1: 27x27x96 = 69,984 • Conv3: 256x3x3x384+384 = 885,120
• Conv2: 27x27x256 = 186,624 • Conv4: 384x3x3x384+384 =
• MaxPool2: 13x13x256 = 43,264 1,327,488
• Conv3: 13x13x384 = 64,896 • Conv5: 384x3x3x256+256 = 884,992
• Conv4: 13x13x384 = 64,896 • FC1: 9216x4096+4096 = 37,752,832
• Conv5: 13x13x256 = 43,264 • FC2: 4096x4096+4096 = 16,781,312
• MaxPool3: 6x6x256 = 9,216 • FC3: 4096x10+10 = 40,970
• FC1: 4,096 • TOTAL = 58,322,314;
in MB (BF16): (58,299,082 x 2) / 2^20
• FC2: 4,096 = 111 MB
• FC3: 10
• TOTAL = 935,333; in MB (BF16): (935,333x2)/2^20 = 1.78 MB
• MAX
24/02/22 - 48 = 290,400; in MB (BF16): (290,400x2)/2^20 = 0.55 MB EESAM - © 2020 MC- LL
Solution
AlexNet: Operations

• Conv2D: 2 x Co x Ho x Wo x K x K x Ci operations
– Conv1: 2 x 96 x 55 x 55 x 11 x 11 x 3 = 210,830,400
– Conv2: 2 x 256 x 27 x 27 x 5 x 5 x 96 = 895,795,200
– Conv3: 2 x 384 x 13 x 13 x 3 x 3 x 256 = 299,040,768
– Conv4: 2 x 384 x 13 x 13 x 3 x 3 x 384 = 448,561,152
– Conv5: 2 x 384 x 13 x 13 x 3 x 3 x 256 = 299,040,768
– Total Conv operations: 2,153,268,288
• FC: 2 x Y x X operations
– FC1: 2 x 4096 x 9216 = 75,497,472
– FC2: 2 x 4096 x 4096 = 335,54,432
– FC3: 2 x 10 x 4096 = 81920
– Total FC operations = 109,133,824
• Total Conv2D + FC operations: 2,262,402,112
24/02/22 - 49 EESAM - © 2020 MC- LL
Solution
AlexNet: Performance

• Each vector unit can run 128/16 = 8 BF16 OP / cycle


– Peak throughput: 16 x 8 OP/cycle x 0.7 GHz = 89.6 GOP/s
• On-chip memory big enough to store activations (2.5 MB) and
fast enough to sustain the peak throughput (450 GB/s), but
params need to be accessed from DDR
– DDR Bandwidth: 2 x 1.6 GHz x 4B = 12.8 GB/s
• OP/ DRAM Byte for AlexNet: 2262M/111M = 20.4 OP/B

Th (GOP/s)

89.6
/s
GB
.8
12
pe
o
sl

OP:DRAM byte
24/02/22 - 50 7 20.4 EESAM - © 2020 MC- LL
Solution
AlexNet: Performance

• Each vector unit can run 128/16 = 8 BF16 OP / cycle


– Peak throughput: 16 x 8 OP/cycle x 0.7 GHz = 89.6 GOP/s
• On-chip memory big enough to store activations (2.5 MB) and
fast enough to sustain the peak throughput (450 GB/s), but
params need to be accessed from DDR
– DDR Bandwidth: 2 x 1.6 GHz x 4B = 12.8 GB/s
• OP/ DRAM Byte for AlexNet: 2262M/111M = 20.4 OP/B
• Latency: 2262 MOP / 89.6 GOP/s = 25 ms (4 fps)
Th (GOP/s) • DRAM energy: 111 MB / 4B x 640 pJ = 18 mJ
• OPs energy: 2262M x 0.5 x 4 pJ = 4.5 mJ
89.6
/s
GB
.8
12
pe
o
sl

OP:DRAM byte
24/02/22 - 51 7 20.4 EESAM - © 2020 MC- LL
Homework 2: AlexNet in Keras

• Use Google Colab and Keras to build AlexNet using


the sequential API method
• Evaluate the number of activations and parameters
using model.summary() and check against paper and
pencil calculations
• Check if there can be any advantage in terms of
performance if standard convolutions were replaced
with separable convolutions (assume to use the same
Myriad X device used for Exercise 2)

24/02/22 - 52 EESAM - © 2020 MC- LL


Homework 3: Exercise

• Assume a processor with M = 8 MAC units working at


Fck = 2 GHz clock frequency
• Assume each MAC takes N = 1 clock cycle and E = 4 pJ
• Assume that the memory is not a bottleneck
1. Compute the latency and the energy (only MAC ops) to
process a mini batch of size B = 4 images using AlexNet
2. Determine the memory bandwidth needed to support the
maximum throughput

24/02/22 - 53 EESAM - © 2020 MC- LL


Solution

• Answer 1:
– MAC ops. = B x (MULs + ADDs) / 2 = 4 x 2,262,402,112 / 2 =
4,524,804,224
– Time for a MAC operation: T = N / Fck = 0.5 ns
» Total time = MAC ops. / M x T = 0.28 s
» Total energy = MAC ops. x E = 18 mJ
• Answer 2:
– Counting the MAC as 2 operations, the peak throughput is
2 OP/MAC x 8 MAC x 2 GHz = 32 GOP/s
– The AlexNet operation intensity is B x 20.4 = 81.6 OP/Byte
» Corner of roofline model should be x ≤ 81.6 OP/B: in the worst case, the
memory bandwidth Bw should be such that Bw (GB/s) x 81.6 OP/B =
32 GOP/s, t.i. Bw = 32/81.6 = 0.4 GB/s
» Ex: 1 LPDDR4 chip with one-byte width running at 200 MHz is OK
Note: Energy largely underestimated (memory access cannot be ignored)
24/02/22 - 54 EESAM - © 2020 MC- LL
Homework 4: Exercise

1. Repeat the previous exercise for the case of


MiniGoogleNet
2. Evaluate the parallelism (M) needed to obtain a
latency of 40 ms and the consequent energy for both
MAC operations and DRAM access (640 pJ for each
32-bit DRAM access)

24/02/22 - 55 EESAM - © 2020 MC- LL


Solution (1/2)

• Answer 1:
– MAC ops. = B x (MULs + ADDs) / 2 = 4 x 469M / 2 = 938M MACs
– Time for a MAC operation: T = N / Fck = 0.5 ns
» Total time = MAC ops. / M x T = 938M / 8 x 0.5n = 58.6 ms
» Total energy = MAC ops. x E = 938M x 4p = 3.8 mJ
– Counting the MAC as 2 operations. the peak throughput is
2 OP/MAC x 8 MAC x 2 GHz = 32 GOP/s
– The MiniGoogLeNet operation intensity is B x 74.4 = 298 OP/Byte
» Corner of roofline model should be x ≤ 298 OP/B: in the worst case, the
memory bandwidth Bw should be such that Bw (GB/s) x 298 OP/B =
32 GOP/s, t.i. Bw = 32/298 = 0.11 GB/s
» Ex: 1 LPDDR4 chip with one-byte width running at 55 MHz is OK

Note: Energy largely underestimated (memory access cannot be ignored)


24/02/22 - 56 EESAM - © 2020 MC- LL
Solution (2/2)

• Answer 2:
– If 8 MAC units do the job in 58.6 ms, to do it in 40 ms (assuming
perfect scaling of performance) we need ceil(8 x 58.6 / 40) 12 MAC
units
– The computing energy does not change (total number of operations
is the same): MAC ops. x E = 938M x 4p = 3.8 mJ
– The size of the parameters in DRAM is 6.3 MB, therefore the energy
is E = 640 p x (6.3M / 4) = 1 mJ

24/02/22 - 57 EESAM - © 2020 MC- LL

You might also like