Professional Documents
Culture Documents
Deep Learning is an emerging eld which needs no introduction. The aim of this article is to collaboratively learn various concepts in Deep Learning in a concise
manner. If you feel something can be added or updated please add a comment. I will keep adding new material to this article as well.
Credits
A big thank you to DeepLearning.ai team and their Deep Learning specialization on Coursera. All the material here including notations, concepts, some diagrams
are a heavily shortened form of their excellent 5 course series.
notation description
m number of training examples
nx number of features per training example
X input matrix where each column is a training example
Y output matrix where each column is the corresponding label of the training example in X , i.e. Y [0] is the label for X [0], the 1st training example
Y
̂ predicted labels for new test inputs
Z linear transformation of X
A non-linear transformation of Z , the result of an activation function
W weights matrix for each feature in X
x features of one training example
y output label of one training example
y ̂ predicted output label of one training example
z linear transformation of x
a non-linear transformation of z , the result of an activation function
w weights matrix for x
b bias matrix
σ sigmoid function, σ (z) =
1
(1+e−z )
and the output lies in between (0, 1) for any value of z
value of the weight for jth hidden unit in kth layer of ith training example
(i)[k]
w
j
a = σ (z)
y ̂ = a
Cross entropy loss for nding out how good the predictions are for a single training example,
L(y ,̂ y) = −(ylog(y )
̂ + (1 − y)log(y ))
̂
This J is used by an optimization algorithm (like gradient descend) to nd optimal values for w and b .
Shallow Neural Nets
In logistic regression z and a are computed to obtain prediction for each training example. In a shallow neural net, this process is repeated twice before predicting
the output label. In logistic regression,
[1] and [2] are layers in the network. Layer [1] is a hidden layer as it is neither the input nor output. Layer [1] has three (hidden) units / neurons and layer [2] has one
unit. The prediction for a training example x , is as follows in a shallow neural net,
[1] [1] [1]
z = w x + b
[1] [1]
a = σ (z )
[2] [2] [1] [2]
z = w a + b
[2] [2]
y ̂ = a = σ (z )
This process is extended to all training examples to obtain Z [1] , Z [2] , A[1] , A[2] , Y ̂ . If this process is extended to more than 2 hidden layers it is called a deep
neural net!
Activation functions
sigmoid, σ (z) =
1
1+e−z
both sigmoid and tanh slow down learning when z is too small or high
neural net learns much faster when compared to sigmoid or tanh
generally used in the hidden layers
Leaky ReLU (z) = max(0.01z, z)
Similarly, the process is repeated for layers [2], [3] and [4]
̂ [L=4] [4] [4]
Y = A = g (Z )
Here g [l] is the activation function used in layer l. When implemented with numpy vectors, all computations are parallelized across training examples and is called
a vectorized implementation. Without vectorization, the neural net has to loop over training examples one by one to complete one epoch of training which slows
down learning.
(i)
Each training example x (i) , is passed through the net to obtain the prediction y ̂ from the last layer. This step is called forward propagation in the entire process.
(i)
y ̂ is compared with y(i) using J to obtain the error in prediction. This error is passed back from layer [L] to [L − 1] to [L − 2] and so on to [1] to adjust W [l] ,
b
[l]
at each layer so that the next prediction causes smaller error. This step of passing back the error is called back propagation in the entire process. Every time
error is passed back, the amount of change the system makes to the parameters W [l] , b [l] is governed by a hyperparameter called learning rate, α.
Dimensionality checks
These formulae can help debug dimensions of various matrices during implementing deep neural nets
[l] [l] [l−1]
w . shape = (n ,n )
[l] [l]
b . shape = (n , 1)
[l] [l] [l]
A . shape = Z . shape = (n , m)
Hyperparameters to choose
W
[l]
, b [l] are parameters of the neural net and are learned during the training phase. Hyperparameters are manually set by the developer before training.
learning rate alpha, α - the rate at which parameters are updated to bring the predictions close to actual values
number of epochs – After training with the entire training data once, one epoch is completed. This parameter controls how many times this should be
repeated.
hidden layers, L – how many hidden layers in the Deep Neural Net (DNN)
hidden units per layer – values for n [1] , n [2] , n [3] ,…, n [L]
activation functions – activation function to use in each layer, g [1] , g [2] , g [3] ,…, g [L]
Error Types
As shown in Figure 2, a DNN has a train and dev error besides the test error
Avoidable bias – difference between human error (the benchmark many a times) and training error. Possible solutions to reduce this are:
Train on a bigger network (increase L or n [l] )
Increase number of epochs
Change network architecture
Variance – difference between training error and dev error. This happens due to over tting to training data. Possible solutions to reduce this are:
Train on more data
Regularization
Change network architecture
Figure 2: Range of each error
split the available 100 cat pictures 50-50. Mix the Train (50) pictures with internet pictures like so
As the train and dev data are different distributions, comparing the training and dev errors does not clarify if it is due to high variance or due to data mismatch.
Hence, the train data is split into train and training-dev after mixing your 50 cat pictures.
Now as train and training-dev sets are from same distribution, it can be understood the root cause of the problem as either bias or variance or data mismatch.
Figure 3: Range of errors when not all data is from same distribution
As shown in Figure 3, as training-dev set and dev-set are from different data distributions, the difference between their errors is due to data mismatch.
Regularization
When the neural net over ts (high variance) the model to training data, predictions on unseen dev set can be poor. Regularization reduces the impact of (various)
neurons in the model so that it can generalize better to unseen inputs. lambda λ, is the hyperparameter which controls the amount of regularization used in L1 and
L2 algorithms. Here are some algorithms / ideas for regularization,
L1 – Uses L1-norm to penalize W ’s
L2 – Uses L2-norm to penalize W ’s
Dropout – Randomly zeros (drops) some neurons from the network thus making it simpler and generalize better. keep_prob is the hyperparamater which is
the probability of retaining a neuron. Different layers can have different values of keep_prob based on density of connections
Data augmentation – transform, randomly crop and translate input training images
Early stopping – after every epoch compute dev error and once it starts increasing, stop the training though training error continues to decrease (sign for
over tting)
Normalization
Normalize input features with varying ranges to learn faster. Normalizing, sets μ = 0 and σ 2 = 1 for all training examples.
Batch normalization – the idea of normalizing inputs is extended to all layers. z[l] is normalized before applying the activation function. The ow of
parameters would then be,
[1] [1] [1] [1] [2] [2]
W ,b β ,γ W ,b
[1] [1]
[1] ̃ [1] ̃ [2]
X −−−−−→ Z −−−−→ Z → a = g(Z ) −−−−−→ Z …
[1]
̃
Z is the normalized Z [1] computed using parameters β [1] , γ [1] . Just like W [l] and b [l] are parameters that are learned during training, β [l] and γ [l]
are too.
In case of mini-batch gradient descend, exponential weighted averages of μ and σ 2 across batches are saved during training. These are used to
[l]{t}
compute Z ̃ ) during inference time.
1−β
RMS Prop – Guides the gradient descend algorithm towards the minimum by taking longer steps in the dimensions farther away from minimum and smaller
steps in the dimensions closer to minimum. β2 and ϵ are hyperparameters for this optimization. ϵ is not so important and is added only to avoid division by
zero error and is generally set to 10−8
Adam – combines ideas from gradient descend with momentum and RMS prop and uses β, β2 and ϵ as hyperparameters
Learning rate decay – mini-batch gradient descend adds oscillations around the minimum. Adding a decay to learning rate converges better. So α is no
longer a constant and becomes
1
α =
(1 + decay_rate × epoch_number) × α0
Hyperparameter Tuning
As there are many hyperparameters to set before training, it is important to realize that not all of them are equally important. For example, α is more important λ,
so ne tuning α rst is better. Some approaches for tuning a hyperparameter are,
Grid based search – create a table of combinations of hyperparameter 1 and 2 values. For each combination evaluate on dev set to nd the best
combination
Random based search – randomly select combinations of values for hyperparameters 1 and 2. For each combination evaluate on dev set to nd the best
combination. After performing a random search in a broad domain of values, a more ne-grained search in the area(s) of interest using the results from
coarse random search can be performed. It is important to scale the hyperparameters before selecting values uniformly at random
Panda VS Caviar approach – If the model is complex that multiple combinations cannot be tested, it is a better idea to baby sit watching how J varies with
time and change hyperparameter values at runtime.
where ti = (e
zi
)
[L]
Transfer Learning
Use the learned parameters from one model to another. It is done by replacing the last few layers in the original trained network. The new layers can then be
trained using the new dataset of interest. This is generally applicable when features identi ed by initial layers of an existing model can be re-used for a another
task.
The number of channels (the 3rd dimension) in the input layer should match the number of dimension in convolution lter
Padding
Due to the way ∗ works, cells on the edges contribute lesser compared to inner cells in the output layer. Strip(s) of zeros are added to input layer before ∗
operation which is called padding, p to solve this problem. There are two types of padding,
Same ⟹ p = 0
f −1
Valid p =
2
where f is dimension of the convolution lter. More on f below.
Pooling
Another type of operator like ∗, which is mainly used to shrink the height and width of the input. Just like ∗, pooling layers also are lters which run across the
input. However, they do not have any parameters to learn.
Max Pooling – pick the max value at every position of lter on the input
Average Pooling – pick the average value at every position of lter on the input