You are on page 1of 14

The deep neural networks have different architectures, sometimes shallow,

sometimes very deep trying to generalize on the given dataset. But, in this pursuit of
trying too hard to learn different features from the dataset, they sometimes learn
the statistical noise in the dataset. This definitely improves the model performance
on the training dataset but fails massively on new data points (test dataset). This is
the problem of overfitting. To tackle this problem we have various regularization
techniques that penalize the weights of the network but this wasn’t enough.

The best way to reduce overfitting or the best way to regularize a fixed-size model
is to get the average predictions from all possible settings of the parameters and
aggregate the final output. But, this becomes too computationally expensive and
isn’t feasible for a real-time inference/prediction.

The other way is inspired by the ensemble techniques (such as AdaBoost,


XGBoost, and Random Forest) where we use multiple neural networks of different
architectures. But this requires multiple models to be trained and stored, which over
time becomes a huge challenge as the networks grow deeper.

So, we have a great solution known as Dropout Layers.


Figure 1: Dropout applied to a Standard Neural Network

What is a Dropout?

The term “dropout” refers to dropping out the nodes (input and hidden layer) in a
neural network (as seen in Figure 1). All the forward and backwards connections
with a dropped node are temporarily removed, thus creating a new network
architecture out of the parent network. The nodes are dropped by a dropout
probability of p.

Let’s try to understand with a given input x: {0, 1, 2, 3, 4, 5} to the fully connected
layer. We have a dropout layer with probability p = 0.2 (or keep probability = 0.8).
During the forward propagation (training) from the input x, 20% of the nodes would
be dropped, i.e. the x could become {1, 0, 3, 4, 5} or {1, 2, 0, 4, 5} and so on.
Similarly, it applied to the hidden layers.
For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is
applied with drop probability = 0.5, then 500 neurons would be randomly dropped
in every iteration (batch).

Generally, for the input layers, the keep probability, i.e. 1- drop probability, is
closer to 1, 0.8 being the best as suggested by the authors. For the hidden layers, the
greater the drop probability more sparse the model, where 0.5 is the most optimized
keep probability, that states dropping 50% of the nodes.

How does it solve the Overfitting problem?

In the overfitting problem, the model learns the statistical noise. To be precise, the
main motive of training is to decrease the loss function, given all the units
(neurons). So in overfitting, a unit may change in a way that fixes up the mistakes
of the other units. This leads to complex co-adaptations, which in turn leads to the
overfitting problem because this complex co-adaptation fails to generalize on the
unseen dataset.

Now, if we use dropout, it prevents these units to fix up the mistake of other units,
thus preventing co-adaptation, as in every iteration the presence of a unit is highly
unreliable. So by randomly dropping a few units (nodes), it forces the layers to take
more or less responsibility for the input by taking a probabilistic approach.

This ensures that the model is getting generalized and hence reducing the
overfitting problem.
Figure 2: (a) Hidden layer features without dropout; (b) Hidden layer features with
dropout

From figure 2, we can easily make out that the hidden layer with dropout is learning
more of the generalized features than the co-adaptations in the layer without
dropout. It is quite apparent, that dropout breaks such inter-unit relations and
focuses more on generalization.

Dropout Implementation

Figure 3: (a) A unit (neuron) during training is present with a probability p and is
connected to the next layer with weights ‘w’ ; (b) A unit during
inference/prediction is always present and is connected to the next layer with
weights, ‘pw’

In the original implementation of the dropout layer, during training, a unit


(node/neuron) in a layer is selected with a keep probability (1-drop probability).
This creates a thinner architecture in the given training batch, and every time this
architecture is different.

In the standard neural network, during the forward propagation we have the
following equations:

Figure 4: Forward propagation of a standard neural network

where:
z: denote the vector of output from layer (l + 1) before activation
y: denote the vector of outputs from layer l
w: weight of the layer l
b: bias of the layer l

Further, with the activation function, z is transformed into the output for layer (l+1).

Now, if we have a dropout, the forward propagation equations change in the


following way:
Figure 5: Forward propagation of a layer with dropout

So before we calculate z, the input to the layer is sampled and multiplied element-


wise with the independent Bernoulli variables. r denotes the Bernoulli random
variables each of which has a probability p of being 1. Basically, r acts as a mask to
the input variable, which ensures only a few units are kept according to the keep
probability of a dropout. This ensures that we have thinned outputs “y(bar)”, which
is given as an input to the layer during feed-forward propagation.

Figure 6: Comparison of the dropout network with the standard network for a given
layer during forward propagation

Dropout during Inference


Now, we know the dropout works mathematically but what happens during
the inference/prediction? Do we use the network with dropout or do we
remove the dropout during inference?

According to the original implementation (Figure 3b)


during the inference, we do not use a dropout layer.
This means that all the units are considered during the
prediction step. But, because of taking all the
units/neurons from a layer, the final weights will be
larger than expected and to deal with this problem,
weights are first scaled by the chosen dropout rate. With
this, the network would be able to make accurate
predictions.

To be more precise, if a unit is retained with probability p during training, the


outgoing weights of that unit are multiplied by p during the prediction stage.

How Dropout was conceived


According to Geoffrey Hinton, one of the authors of “Dropout: A Simple Way
to Prevent Neural Networks from Overfitting” there were a set of events that
inspired the fundamental dropout.

1. The analogy with Google Brain is that it should be big because it


learns a large ensemble of models. In neural networks, it is not a
very efficient use of hardware since the same features would need to
be invented separately by different models. This is when the idea of
using the same subset of neurons was discovered.

2. Bank Teller: In those days, the tellers keep changing regularly and it
must be because it would require cooperation between the
employees to successfully defraud the bank. This implanted the idea
of randomly selecting different neurons such that with every
iteration there is a different set of neurons used. This would ensure
that neurons are unable to learn the co-adaptations and prevent
overfitting, similar to preventing the conspiracies in the bank.

3. Sexual Reproduction: It involves talking half of the genes of one


parent and half of the other, adding a very small amount of random
mutation, to produce an offspring. This creates a mixed ability of the
genes and makes them more robust. This can be linked to a dropout
which is used to break co-adaptations (adds randomness just like a
gene mutation).
Introduction
When you go out to buy a shirt for yourself, you will not buy something which
is very fit for your body because then if you eat pizza or biryani and if you
become fat it will not be convenient you will not buy something that is very
loose because then it looks like a cloth hanging on a skeleton, you will try to
buy a right fit for your body the problem of overfitting and underfitting
happening in the machine learning project as well, and there are techniques to
tackle this overfitting and under fitting issue and these techniques are
called regularization techniques.

In the below image, we are applying a dropout on the second hidden layer of a
neuron network.

So
urce: mohcinemadkour.github.io
What’s Dropout?
In machine learning, “dropout” refers to the practice of disregarding certain
nodes in a layer at random during training. A dropout is a regularization
approach that prevents overfitting by ensuring that no units are codependent
with one another.

Dropout Regularization
When you have training data, if you try to train your model too much, it might
overfit, and when you get the actual test data for making predictions, it will not
probably perform well. Dropout regularization is one technique used to tackle
overfitting problems in deep learning.

Training with Drop-Out Layers


Dropout is a regularization method approximating concurrent training of many
neural networks with various designs. During training, some layer outputs are
ignored or dropped at random. This makes the layer appear and is regarded as
having a different number of nodes and connectedness to the preceding layer.
In practice, each layer update during training is carried out with a different
perspective of the specified layer. Dropout makes the training process noisy,
requiring nodes within a layer to take on more or less responsible for the
inputs on a probabilistic basis.

According to this conception, dropout may break apart circumstances in which


network tiers co-adapt to fix mistakes committed by prior layers, making the
model more robust. Dropout is implemented per layer in a neural network. It
works with the vast majority of layers, including dense, fully connected,
convolutional, and recurrent layers such as the long short-term memory
network layer. Dropout can occur on any or all of the network’s hidden layers
as well as the visible or input layer. It is not used on the output layer.

Dropout Implementation
Using the torch. nn, you can easily add a dropout to your PyTorch models. The
dropout class accepts the dropout rate (the likelihood of a neuron being
deactivated) as a parameter.
self.dropout = nn.Dropout(0.25)

Dropout can be used after any non-output layer.

To investigate the impact of dropout, train an image classification model. I’ll


start with an unregularized network and then use Dropout to train a
regularised network. The Cifar-10 dataset is used to train the models over 15
epochs.

A complete example of introducing dropout to a PyTorch model is provided.

class Net(nn.Module):
def __init__(self, input_shape=(3,32,32)):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3)
self.conv2 = nn.Conv2d(32, 64, 3)
self.conv3 = nn.Conv2d(64, 128, 3)
self.pool = nn.MaxPool2d(2,2)
n_size = self._get_conv_output(input_shape)
self.fc1 = nn.Linear(n_size, 512)
self.fc2 = nn.Linear(512, 10)
self.dropout = nn.Dropout(0.25)
def forward(self, x):
x = self._forward_features(x)
x = x.view(x.size(0), -1)
x = self.dropout(x)
x = F.relu(self.fc1(x))
# Apply dropout
x = self.dropout(x)
x = self.fc2(x)
return x

An unregularized network overfits instantly on the training dataset. Take note


of how the validation loss for the no-dropout run diverges dramatically after
only a few epochs. This explains why the generalization error has grown.
Overfitting is avoided by training with two dropout layers and a dropout
probability of 25%. However, this affects training accuracy, necessitating the
training of a regularised network over a longer period.
Leaving improves model generalisation. Although the training accuracy is lower
than that of the unregularized network, the total validation accuracy has
improved. This explains why the generalization error has decreased.

Why will dropout help with overfitting?

 It can’t rely on one input as it might be randomly dropped out.


 Neurons will not learn redundant details of inputs
Other Popular Regularization Techniques
When combating overfitting, dropping out is far from the only choice.
Regularization techniques commonly used include:

Early stopping: automatically terminates training when a performance measure


(e.g., validation loss, accuracy) ceases to improve.

Weight decay: add a penalty to the loss function to motivate the network to
utilize lesser weights.

Noise: Allow some random variations in the data through augmentation to


create noise (which makes the network robust to a larger distribution of inputs
and hence improves generalization).

Model Combination: the outputs of separately trained neural networks are


averaged (which requires a lot of computational power, data, and time).

Dropout Regularization Hyperparameters


A big decaying learning rate and a high momentum are two hyperparameter
values that have been discovered to function well with dropout regularisation.
Limiting our weight vectors using dropout allows us to employ a high learning
rate without fear of the weights blowing up. Dropout noise, along with our big
decaying learning rate, allows us to explore alternative areas of our loss
function and, hopefully, reach a better minimum.

The Drawbacks of Dropout


Although dropout is a potent tool, it has certain downsides. A dropout network
may take 2-3 times longer to train than a normal network. Finding a regularizer
virtually comparable to a dropout layer is one method to reap the benefits of
dropout without slowing down training. This regularizer is a modified variant of
L2 regularisation for linear regression. An analogous regularizer for more
complex models has yet to be discovered until that time when doubt drops
out.

You might also like