You are on page 1of 7

Summary of concepts in Deep Learning

Deep Learning is an emerging eld which needs no introduction. The aim of this article is to collaboratively learn various concepts in Deep Learning in a concise
manner. If you feel something can be added or updated please add a comment. I will keep adding new material to this article as well.

How to use this article?


Deep Learning has many parameters, hyperparameters and concepts. This article aims to give a quick refresher on some core topics, especially to those who are
new to this eld. Some use cases I can think of,
Whenever you encounter a term in Deep Learning online when sur ng the web and cannot remember what it means, come here and do a quick nd.
If you are preparing for an interview for a ML role and want to quickly revise, this is a good place as being concise and complete was my primary motto

Credits
A big thank you to DeepLearning.ai team and their Deep Learning specialization on Coursera. All the material here including notations, concepts, some diagrams
are a heavily shortened form of their excellent 5 course series.

Notation used throughout


Refer this section for any variables used in the article

notation description
m number of training examples
nx number of features per training example
X input matrix where each column is a training example
Y output matrix where each column is the corresponding label of the training example in X , i.e. Y [0] is the label for X [0], the 1st training example

Y
̂  predicted labels for new test inputs
Z linear transformation of X
A non-linear transformation of Z , the result of an activation function
W weights matrix for each feature in X
x features of one training example
y output label of one training example
y ̂  predicted output label of one training example
z linear transformation of x
a non-linear transformation of z , the result of an activation function
w weights matrix for x
b bias matrix
σ sigmoid function, σ (z) =
1

(1+e−z )
and the output lies in between (0, 1) for any value of z

value of the feature j in ith training example


(i)
x
j

value of the weight for jth hidden unit in kth layer of ith training example
(i)[k]
w
j

L total number of layers in a deep neural net (excluding input layer)


n
[l]
number of units in hidden layer l
X
{t}
, Y {t} X and Y values for tth mini-batch in mini-batch gradient descend
J cost function of the model considering all training examples
C number of classes in a multi-class classi er

Logistic Regression, the building block of Deep Neural Nets


It is a linear model for classi cation. The goal of the model is to predict probabilities of output labels for a given input.
T
z = w x + b

a = σ (z)

y ̂  = a

Cross entropy loss for nding out how good the predictions are for a single training example,
L(y ,̂  y) = −(ylog(y )
̂  + (1 − y)log(y ))
̂ 

Cost function for all examples,


m
1 (i) (i)
J = − L(y ̂  , y )
m ∑
i=1

This J is used by an optimization algorithm (like gradient descend) to nd optimal values for w and b .
Shallow Neural Nets
In logistic regression z and a are computed to obtain prediction for each training example. In a shallow neural net, this process is repeated twice before predicting
the output label. In logistic regression,

whereas in a shallow net,

[1] and [2] are layers in the network. Layer [1] is a hidden layer as it is neither the input nor output. Layer [1] has three (hidden) units / neurons and layer [2] has one
unit. The prediction for a training example x , is as follows in a shallow neural net,
[1] [1] [1]
z = w x + b
[1] [1]
a = σ (z )
[2] [2] [1] [2]
z = w a + b
[2] [2]
y ̂  = a = σ (z )

This process is extended to all training examples to obtain Z [1] , Z [2] , A[1] , A[2] , Y ̂ . If this process is extended to more than 2 hidden layers it is called a deep
neural net!

Activation functions
sigmoid, σ (z) =
1

1+e−z

σ(z) lies in between (0, 1)


generally used for binary classi cation tasks in the last layer
z −z
e −e
tanh(z) =
ez +e−z

tanh(z) lies in between (-1, 1)


the graph is centered at 0, unlike sigmoid
ReLU (z) = max(0, z)

both sigmoid and tanh slow down learning when z is too small or high
neural net learns much faster when compared to sigmoid or tanh
generally used in the hidden layers
Leaky ReLU (z) = max(0.01z, z)

Deep Neural Nets


Simply put, it is a neural network with multiple hidden layers. The number of layers L and number of units in each layer are hyperparameters decided before
training.

Figure 1: A 4 layer, fully connected deep neural network


̂ 
The above network has L = 4 n , [1]
,
= 3 n
[2]
,
= 4 n
[3]
= 3 and n [4] = 1 Y = A
L
. is the result for all training examples. X [0]
= A is computed as,
[1] [1] [0] [1]
Z = W A + b
[1] [1] [1]
A = g (Z )

Similarly, the process is repeated for layers [2], [3] and [4]
̂  [L=4] [4] [4]
Y = A = g (Z )

Here g [l] is the activation function used in layer l. When implemented with numpy vectors, all computations are parallelized across training examples and is called
a vectorized implementation. Without vectorization, the neural net has to loop over training examples one by one to complete one epoch of training which slows
down learning.
(i)
Each training example x (i) , is passed through the net to obtain the prediction y ̂  from the last layer. This step is called forward propagation in the entire process.
(i)
y ̂  is compared with y(i) using J to obtain the error in prediction. This error is passed back from layer [L] to [L − 1] to [L − 2] and so on to [1] to adjust W [l] ,
b
[l]
at each layer so that the next prediction causes smaller error. This step of passing back the error is called back propagation in the entire process. Every time
error is passed back, the amount of change the system makes to the parameters W [l] , b [l] is governed by a hyperparameter called learning rate, α.

Dimensionality checks
These formulae can help debug dimensions of various matrices during implementing deep neural nets
[l] [l] [l−1]
w . shape = (n ,n )
[l] [l]
b . shape = (n , 1)
[l] [l] [l]
A . shape = Z . shape = (n , m)

Hyperparameters to choose
W
[l]
, b [l] are parameters of the neural net and are learned during the training phase. Hyperparameters are manually set by the developer before training.
learning rate alpha, α - the rate at which parameters are updated to bring the predictions close to actual values
number of epochs – After training with the entire training data once, one epoch is completed. This parameter controls how many times this should be
repeated.
hidden layers, L – how many hidden layers in the Deep Neural Net (DNN)
hidden units per layer – values for n [1] , n [2] , n [3] ,…, n [L]
activation functions – activation function to use in each layer, g [1] , g [2] , g [3] ,…, g [L]

Optimizing Deep Neural Networks

Data splitting – All data from same distribution


All the available labelled data is split into,

Train data – Majority of the data is used for training


Dev data – Also called validation set / data. Used for validating the model and hyperparameter tuning
Test data – Used for validating the nal chosen model

Error Types
As shown in Figure 2, a DNN has a train and dev error besides the test error
Avoidable bias – difference between human error (the benchmark many a times) and training error. Possible solutions to reduce this are:
Train on a bigger network (increase L or n [l] )
Increase number of epochs
Change network architecture
Variance – difference between training error and dev error. This happens due to over tting to training data. Possible solutions to reduce this are:
Train on more data
Regularization
Change network architecture
Figure 2: Range of each error

Data Splitting – Data from different distributions


Ideally train, dev and test sets should be from the same data distribution for best results. But sometimes big enough data might not be available for performing a
deep learning experiment. For example, for creating a DNN to classify 100 pictures of your 2 cats, training on cat pictures from internet and testing on your 100 cat
pictures may not yield good results as data distributions are different. In such situations,

split the available 100 cat pictures 50-50. Mix the Train (50) pictures with internet pictures like so

As the train and dev data are different distributions, comparing the training and dev errors does not clarify if it is due to high variance or due to data mismatch.
Hence, the train data is split into train and training-dev after mixing your 50 cat pictures.

Now as train and training-dev sets are from same distribution, it can be understood the root cause of the problem as either bias or variance or data mismatch.
Figure 3: Range of errors when not all data is from same distribution
As shown in Figure 3, as training-dev set and dev-set are from different data distributions, the difference between their errors is due to data mismatch.

Regularization
When the neural net over ts (high variance) the model to training data, predictions on unseen dev set can be poor. Regularization reduces the impact of (various)
neurons in the model so that it can generalize better to unseen inputs. lambda λ, is the hyperparameter which controls the amount of regularization used in L1 and
L2 algorithms. Here are some algorithms / ideas for regularization,
L1 – Uses L1-norm to penalize W ’s
L2 – Uses L2-norm to penalize W ’s
Dropout – Randomly zeros (drops) some neurons from the network thus making it simpler and generalize better. keep_prob is the hyperparamater which is
the probability of retaining a neuron. Different layers can have different values of keep_prob based on density of connections
Data augmentation – transform, randomly crop and translate input training images
Early stopping – after every epoch compute dev error and once it starts increasing, stop the training though training error continues to decrease (sign for
over tting)

Normalization
Normalize input features with varying ranges to learn faster. Normalizing, sets μ = 0 and σ 2 = 1 for all training examples.
Batch normalization – the idea of normalizing inputs is extended to all layers. z[l] is normalized before applying the activation function. The ow of
parameters would then be,
[1] [1] [1] [1] [2] [2]
W ,b β ,γ W ,b
[1] [1]
[1] ̃  [1] ̃  [2]
X −−−−−→ Z −−−−→ Z → a = g(Z ) −−−−−→ Z …

[1]
̃ 
Z is the normalized Z [1] computed using parameters β [1] , γ [1] . Just like W [l] and b [l] are parameters that are learned during training, β [l] and γ [l]
are too.
In case of mini-batch gradient descend, exponential weighted averages of μ and σ 2 across batches are saved during training. These are used to
[l]{t}
compute Z ̃  ) during inference time.

Train faster and better


Mini-batch gradient descend – if the training set size if huge, models learns better, but each epoch takes longer. In mini-batch gradient descend, the inputs
are sliced into batches and a step is taken by gradient descend after training on a mini-batch. Mini-batch size is generally chosen in between 1 and m to take
advantage of both vectorization and quicker steps. Typical batch sizes are 64, 128, 256 or 512 training examples, such that each mini-batch ts in memory
of CPU / GPU
Gradient descend with Momentum – mini-batch gradient descend introduces oscillations which may slow down reaching the optimum. Momentum solves
this problem by adding a moving average like affect and dampening the oscillations to reach the optimum faster. Momentum β, controls the size of the
sliding window ≈ 1

1−β
RMS Prop – Guides the gradient descend algorithm towards the minimum by taking longer steps in the dimensions farther away from minimum and smaller
steps in the dimensions closer to minimum. β2 and ϵ are hyperparameters for this optimization. ϵ is not so important and is added only to avoid division by
zero error and is generally set to 10−8
Adam – combines ideas from gradient descend with momentum and RMS prop and uses β, β2 and ϵ as hyperparameters
Learning rate decay – mini-batch gradient descend adds oscillations around the minimum. Adding a decay to learning rate converges better. So α is no
longer a constant and becomes
1
α =
(1 + decay_rate × epoch_number) × α0

Hyperparameter Tuning
As there are many hyperparameters to set before training, it is important to realize that not all of them are equally important. For example, α is more important λ,
so ne tuning α rst is better. Some approaches for tuning a hyperparameter are,
Grid based search – create a table of combinations of hyperparameter 1 and 2 values. For each combination evaluate on dev set to nd the best
combination
Random based search – randomly select combinations of values for hyperparameters 1 and 2. For each combination evaluate on dev set to nd the best
combination. After performing a random search in a broad domain of values, a more ne-grained search in the area(s) of interest using the results from
coarse random search can be performed. It is important to scale the hyperparameters before selecting values uniformly at random
Panda VS Caviar approach – If the model is complex that multiple combinations cannot be tested, it is a better idea to baby sit watching how J varies with
time and change hyperparameter values at runtime.

Multiclass classi cation


Softmax layer is used as the nal layer to classify into C classes. The activations from nal layer L are computed as,
[L]
ti
a =
i C
∑ ti
j=1

where ti = (e
zi
)
[L]

Transfer Learning
Use the learned parameters from one model to another. It is done by replacing the last few layers in the original trained network. The new layers can then be
trained using the new dataset of interest. This is generally applicable when features identi ed by initial layers of an existing model can be re-used for a another
task.

Convolutional Neural Nets


A class of deep neural nets for computer vision tasks. It is expected of a DNN to identify the features from X without the need for hand tuning them. Therefore, in
computer vision tasks images, videos are generally used as is as X . Without feature engineering if the image is passed as is to the network the number of
parameters to learn can be quite high based on the image’s resolution. For example, if the input image is (width, height, RGB channels) = (1000, 1000, 3)
dimensional, fully connecting it (as shown in Figure 1) to a layer with n [1] = 1000, would imply W . shape = (1000, 3 × 106 ), i.e. 3 billion parameters. Training
so many parameters demands lot of training data and hence existing ideas from DNN are not used for computer vision applications. Therefore, a new class called
Convolutional Neural Nets (CNN) is studied.
It is known that earlier layers in a DNN identify simple features like edges and the later ones detect more complex shapes in a given image. The operator,
convolution ∗, in Mathematics solves both the above problems – identify edges in earlier layers and shapes in the later, requires fewer parameters than a fully
connected DNN.

Working of a convolution operation

Figure 4: Convolution operator in action


Source: Coding exercise “Convolution model - Step by Step - v2” in the course https://www.coursera.org/learn/convolutional-neural-networks/

The number of channels (the 3rd dimension) in the input layer should match the number of dimension in convolution lter

Padding
Due to the way ∗ works, cells on the edges contribute lesser compared to inner cells in the output layer. Strip(s) of zeros are added to input layer before ∗
operation which is called padding, p to solve this problem. There are two types of padding,
Same ⟹ p = 0

f −1
Valid p =
2
where f is dimension of the convolution lter. More on f below.

Dimensionality involving a convolution operation


n + 2p − f n + 2p − f
(n, n, #channels) ∗ (f , f , #channels) → (⌊ ⌋, ⌊ ⌋, #f ilters)
s + 1 s + 1

where n = dimension of input layer / image

f = dimension of convolution lter


p = amount of padding to input layer
s = stride length of convolution lter on input layer
#f ilters = number of convolution lters used on input layer

For Figure 4: n = 5, #channels = 1, f = 3, p = 0, s = 1, #f ilters = 1

Pooling
Another type of operator like ∗, which is mainly used to shrink the height and width of the input. Just like ∗, pooling layers also are lters which run across the
input. However, they do not have any parameters to learn.
Max Pooling – pick the max value at every position of lter on the input
Average Pooling – pick the average value at every position of lter on the input

Deep Learning Hyperparameters


Hyperparameter Symbol Common Values
regularization λ also called, “weight decay”
learning rate α 0.01
keep_prob 0.7 from Dropout regularization
momentum β 0.9 also used in Adam
mini-batch size t 64, 128, 256, 512
RMS Prop β2 0.999 also used in Adam
learning rate decay also called decay_rate
lter size f
[l]
In CNN, size of a lter in layer l
stride s
[l]
In CNN, stride length in layer l
padding p
[l]
In CNN, padding in layer l
# lters nc
[l]
In CNN, number of lters used in layer l

You might also like