You are on page 1of 23

Artificial Intelligence

The term Artificial Intelligence (AI) refers to the technique that enables the computer to emulate
human intelligence. A set of AI algorithms devised to make the computer learn from the past data
to act upon future prediction(s) is called Machine Learning (ML). Artificial Neural Network
(ANN) is one such algorithm mimicking the structure and behavior of human brain/central nervous
system. The ANN is built upon several artificial neurons, arranged in layered manner. A simple
ANN is composed of Input, hidden and output layers to accept, process and predict the data
respectively. With increased number of hidden layers, the ANN can be made deeper to call it a
deep neural network.

1.1 From Biological to Artificial Neurons


Biological Neuron Human nervous system consists of numerous cells called biological neurons
arranged in a layered manner. Each cell/nerve receives the outside signal through dendrite and
passes it to soma which decides whether the cell has to respond/fire for it or not. This decision is
then transmitted through the Axons, to reach the synapse which forwards it to the neighboring cells.
By the end of this transition, the signal would make a chain of fired neurons (usually of different
layers) that determine the brain’s response for the input signal such as moving, talking etc. The
structure of a biological neuron is shown in Fig1.1 and that of the central nervous system in Fig1.2.

Fig1.1 Biological Neuron

Fig1.2 Layered Structure of Central Nervous System


Inspired by the structure and functionality of a biological neuron, the notion of Artificial Neural
Networks (ANN) was conceived way back in the 1940’s. The former ANNs simulated biological
neurons using artificial neurons. Lateral deeper inventions of biological nervous system made these
earlier ANNs rudimental. Nevertheless, their impressive results encouraged the evolution of more
powerful ANNs.
Artificial Neuron
McCulloch-Pitts Neuron The first ever artificial neuron (AN) was developed by two
scientists named McCulloch and Pitts in 1943. They published a computational model of the human
nervous system using an artificial neuron as the fundamental unit. As shown in Fig 1.3. The
scientists called the mathematical formulation of a biological neuron as artificial neuron (AN). The
AN’s Input vector is analogous to the excitation received by a biological neuron and its
corresponding weight vector represents the synaptic force. An input is said to be inhibitory if its
value can solely influence the activation of the neuron irrespective of other excitatory inputs. A
positive weight represents an excitatory effect and a negative weight an inhibitory effect. The
nucleus of an AN aggregate the weighted inputs and gives it to an activation function which decides
whether the neuron should fire/respond to the given input or not.

Fig 1.3. McCulloch Pitt’s Neuron; Mankind’s first Artificial Neuron.


Perceptron
This fundamental model was straightforward and works well for Boolean inputs. in 1957 Frank
Rosenblatt proposed a slightly modified version of the MCP neuron which works for real valued
inputs. This goal is accomplished by associating each input feature with weight. These weighted
inputs concept enables us to set priorities for the features to be considered for decision making of
the neuron. Rosenblatt’s neuron is called a Linear Threshold Unit (LTU) or a Perceptron, whose
functionality is depicted in Fig 1.4. The processing unit of a perceptron is built upon two functions
called preactivation and activation. A pre-activation function receives all the weighted inputs and
aggregates them. This aggregated result then goes to the activation function which decides the
response (Fire or Not) of the perceptron.
Fig 1.4. Perceptron: A Neuron with weighted inputs
Realizing Boolean functions using Perceptron
Logical functions are a great starting point since they will bring us to a natural
development of the theory behind the perceptron and, therefore, neural networks.
Realization of the following Boolean functions is explained below.

NOT logical function:


NOT(x) is a 1-variable function, that means that we will have one input at a time: N=1. Also, it is
a logical function, and so both the input and the output have only two possible states: 0 and 1 (i.e.,
False and True): the Heaviside step function seems to fit our case since it produces a binary output.

Given two parameters, w and b, it will perform the following computation:


ŷ = ϴ(wx + b).: if we pick w = -1 and b = 0.5. The result would be
NOT(0) = 1
NOT(1) = 0

AND logical function

The AND logical function is a 2-variables function, AND (x1, x2), with binary inputs and output.

This graph is associated with the following computation:

ŷ = ϴ(w1*x1 + w2*x2 + b)

This time, we have three parameters: w1, w2, and b. For the values w1 = 1, w2 = 1, b = -1.5 we

get the answer as

AND(1, 1) = 1

AND(1, 0) = 0

AND(0, 1) = 0

AND(0, 0) = 0

OR logical function
OR (x1, x2) is a 2-variables function too, and its output is 1-dimensional (i.e., one number) and
has two possible states (0 or 1). Therefore, we will use a perceptron with the same architecture as
the one before.
For these values of w1 = 1, w2 = 1, b = -0.5, the result would be

OR(1, 1) = 1

OR(1, 0) = 1

OR(0, 1) = 1

OR(0, 0) = 0

Perceptron Learning Algorithm

Manuel estimation or computation of parameters is both time consuming and prone to

human error. Therefore, an algorithm which automates this process has been developed. The

algorithm is called Perceptron learning algorithm, whose pseudocode is given below


Proof of Convergence

XOR logical function


We conclude that a single perceptron with a Heaviside activation function can implement each
one of the fundamental logical functions: NOT, AND, and OR. They are called fundamental
because any logical function, no matter how complex, can be obtained by a combination of those
three.

The XOR function whose Truth Table is given as above, can also be represented using the same
fundamental functions as shown below.
But there exists no such combination of W1, W2 and b that realizes the XOR function. So Minsky
and Papert have shown in their famous book on perceptron that the single perceptron cannot
represent a simple non-linear function such as XOR where Multi-Layer Perceptron (MLP) prevails.
Although this allegation prevented people from applying AI to real world applications, the
introduction of Back-Propagation Algorithm (BPA) in 1970 has again quickened its growth.

1.2 Implementing MLP


Multi-Layer Perceptron
An MLP is a collection of artificial perceptron connected across several layers working towards
achieving a desired functionality. An MLP has been proven to represent many linearly inseparable
functions with the appropriate set of weights assigned to each of the neurons in the MLP. The
XOR function realization using MLP is shown below.
Even more clear explanation can be found below

This optimal set of weights is obtained with the help of a back-propagation algorithm. MLP training
is comprised of two functions called Feedforward and Back-Propagation. The First function feeds
the weighted features to the MLP neurons in forward direction. The prediction obtained by these
input features is then compared with the actual expected output called the ground truth value using
the loss function. The error obtained at this stage is then propagated back to the input layer using
the Back-Propagation algorithm. While propagating back, BPA updates the weights of those
neurons that are responsible for this error. The Feedforward process is again carried out with these
updated weights. This cycle is repeated until the training process converges by arriving at an
acceptable minimum error value.
Gradient Descent Algorithm
A Back propagation algorithm is implemented with gradient descent algorithm. The algorithm
reduces each parameter value of a node by the gradient of its activation function.
The Gradient Descent Algorithm consists of three moules in it
➢ Farward Propogation
➢ Backward Propogation
➢ Parameter Updation
The gradient ∇𝑊𝑘 will be very large at steep slopes and very small near gentle slopes. This
causes the gentle slopes to last for long time/region. Leading to slower convergence.

Thus, to improve this several other methods have evolved.

1.3 Fine-Tuning Neural Network Hyperparameters


In any MLP its performance depends on the number of layers, the number of neurons per layer,
the type of activation function to use in each layer, the weight initialization logic, the number of
input samples we consider etc.
Number of Hidden Layers
For many problems, you can just begin with a single hidden layer and you will get
reasonable results. It has actually been shown that an MLP with just one hidden layer can model
even the most complex functions provided it has enough neurons. For a long time, these facts
convinced researchers that there was no need to investigate any deeper neural networks. But they
overlooked the fact that deep networks have a much higher parameter efficiency than shallow ones:
they can model complex functions using exponentially fewer neurons than shallow nets, allowing
them to reach much better performance with the same amount of training data.
Real-world data is often structured in such a hierarchical way and Deep Neural Networks
automatically take advantage of this fact: lower hidden layers model low-level structures (e.g., line
segments of various shapes and orientations), intermediate hidden layers combine these low-level
structures to model intermediate-level structures (e.g., squares, circles), and the highest hidden
layers and the output layer combine these intermediate structures to model high-level structures
(e.g., faces). Not only does this hierarchical architecture help DNNs converge faster to a good
solution, it also improves their ability to generalize to new datasets.
Using transfer learning, the network will not have to learn from scratch all the low-level structures
instead, it can adopt the weights from the lower layers of an already trained model which
accomplishes almost the similar task; The network will only have to learn the higher-level
structures. For example, if you have already trained a model to recognize faces in pictures, and you
now want to train a new neural network to recognize hairstyles, then you can kickstart training by
reusing the lower layers of the first network. Instead of randomly initializing the weights and biases
of the first few layers of the new neural network, you can initialize them to the value of the weights
and biases of the lower layers of the first network.
Number of Neurons per Hidden Layer
The number of neurons in the input and output layers is determined by the type of input and output
your task requires. As for the hidden layers, it used to be a common practice to size them to form
a pyramid, with fewer and fewer neurons at each layer—the rationale being that many low-level
features can coalesce into far fewer high-level features. However, it seems that simply using the
same number of neurons in all hidden layers performs just as well in most cases, or even better,
and there is just one hyperparameter to tune instead of one per layer. However, depending on the
dataset, it can sometimes help to make the first hidden layer bigger than the others.

Learning Rate
The learning rate is arguably the most important hyperparameter. In general, the optimal learning
rate is about half of the maximum learning rate (i.e., the learning rate above which the training
algorithm diverges)
Optimizers
Choosing a better optimizer than plain old Mini-batch Gradient Descent (and tuning its
hyperparameters) is also quite important.
Batch size
The batch size can also have a significant impact on your model’s performance and the training
time. In general, the optimal batch size will be lower than 32. A small batch size ensures that each
training iteration is very fast, and although a large batch size will give a more precise estimate of
the gradients, in practice this does not matter much since the optimization landscape is quite
complex and the direction of the true gradients do not point precisely in the direction of the
optimum.
Few Python libraries to be used hyperparameters optimizers:
• Hyperopt: a popular Python library for optimizing over all sorts of complex search spaces
(including real values such as the learning rate, or discrete values such as the number of layers).
• Hyperas, kopt or Talos: optimizing hyperparameters for Keras model (the first two are based
on Hyperopt).
• Scikit-Optimize (skopt): a general-purpose optimization library. The Bayes SearchCV class
performs Bayesian optimization using an interface similar to Grid SearchCV.
• Spearmint: a Bayesian optimization library.
• Sklearn-Deap: a hyperparameter optimization library based on evolutionary algorithms, also
with a GridSearchCV-like interface.

1.4 Training Deep Neural Networks


Training Deep Neural Networks may come across several issues such as the following
➢ Vanishing and Exploding Gradient Problems
➢ Available dataset may not be sufficient to train large(deeper) networks
➢ Extremely slow training set
➢ Overfitting

1.5. Vanishing and Exploding Gradient Problems


The backpropagation algorithm works by going from the output layer to the input layer,
propagating the error gradient on the way. These gradients often get smaller and smaller as the
algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the
lower layer connection weights virtually unchanged, and training never converges to a good
solution. This is called the vanishing gradients problem.
In some cases, the opposite can happen: the gradients can grow bigger and bigger, so many layers
get insanely large weight updates, and the algorithm diverges. This is the exploding gradients
problem, which is mostly encountered in recurrent neural networks.
More generally, deep neural networks suffer from unstable gradients; different layers may learn at
widely different speeds.
Reasons for Vanishing/Exploding Gradients problem
Xavier Glorot and Yoshua Bengio suspected (found) in their paper titled “Understanding the
Difficulty of Training Deep Feedforward”
➢ Logistic Sigmoid Activation
➢ Random Weight Initialization
Weight Initialization
The hidden layer weights are generally initialized with random weights using uniform distribution
with zero mean and unit standard deviation. But the variance keeps increasing after each layer.
This situation becomes even worse with the usage of logistic sigmoid activation function as it has
a mean of 0.5(but not 1).

Logistic Sigmoid activation


In the logistic activation function,we can see that when inputs become large (negative or positive),
the function saturates at 0 or 1, with a derivative extremely close to 0. Thus when backpropagation
starts, it has virtually no gradient to propagate back through the network, and what little gradient
exists keeps getting diluted as backpropagation progresses down through the top layers, so there is
really nothing left for the lower layers.

Solutions:
The following techniques are used in practice to avoid the vanishing and exploding gradient
problems.

• Glorot and He Initialization


• Non saturating Activation functions
• Batch Normalization
• Gradient Clipping

Glorot and He Initialization


The two scientists He and Glorot stated in their paper that the signal should flow properly in both
the forward and backward directions so that the signal neither dies (during backpropagation) nor
explore and saturate (during forward propagation). This accomplishment requires
➢ The variance of inputs and outputs of a layer to be the same.
➢ The variance of derivative before and after flowing through a layer.
The scientists proposed a strategy for weight initialization satisfying the above requirements.
Xavier Initialization/ Glorot Initialization

Normal distribution with mean 0 and variance

Or a uniform distribution between −r and + r, with

Where

Initialization Parameters for different Activation Functions


By default, Keras uses Glorot initialization with a uniform distribution. You can change this to
He initialization by setting kernel_initializer="he_uniform" or kernel_initializer="he_normal"
when creating a layer, like this:

Nonsaturating Activation Functions


According to Glorot and He the second significant problem for vanishing and exploding gradients
is improper activation function.
It turns out that activation functions other than sigmoid behave much better in deep neural
networks, in particular the ReLU activation function, mostly because it does not saturate for
positive values. Unfortunately, the ReLU activation function is not perfect. It suffers from a
problem known as the dying ReLUs: during training, some neurons effectively die, meaning they
stop outputting anything other than 0. In some cases, you may find that half of your network’s
neurons are dead, especially if you used a large learning rate. To solve this problem, you may want
to use a variant of the ReLU function, such as the leaky ReLU. This function is defined as
LeakyReLU α(z) = max(αz, z)
The hyperparameter α defines how much the function “leaks”: it is theslope of the function for z
< 0 and is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go
into a long coma, but they have a chance to eventually wake up.
randomized leaky ReLU (RReLU), where α is picked randomly in a given range during training,
and it is fixed to an average value during testing. It also performed fairly well and seemed to act as
a regularizer (reducing the risk of overfitting the training set).
Finally, the parametric leaky ReLU (PReLU), where α is authorized to be learned during training
(instead of being a hyperparameter, it becomes a parameter that can be modified by
backpropagation like any other parameter). This was reported to strongly outperform ReLU on
large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

Djork-Arné Clevert et al.6 proposed a new activation function called the exponential linear unit
(ELU) that outperformed all the ReLU variants in their experiments: training time was reduced,
and the neural network performed better on the test set.

The main drawback of the ELU activation function is that it is slower to compute than the ReLU
and its variants (due to the use of the exponential function), but during training this is compensated
by the faster convergence rate. However, at test time an ELU network will be slower than a ReLU
network.
Batch Normalization
Although using He initialization along with ELU (or any variant of ReLU) can signifi-
cantly reduce the vanishing/exploding gradients problems at the beginning of training, it doesn’t
guarantee that they won’t come back during training.
In 2015, Sergey Ioffe and Christian Szegedy proposed a technique called Batch
Normalization (BN) to address the vanishing/exploding gradients problems. The technique
consists of adding an operation in the model just before or after the activation function of each
hidden layer, simply zero-centering and normalizing each input, then scaling and shifting the result
using two new parameter vectors per layer: one for scaling, the other for shifting. This operation
lets the model learn the optimal scale and mean of each of the layer’s inputs. It does so by
evaluating the mean and standard deviation of each input over the current mini batch.
Batch Normalization algorithm
• μB is the vector of input means, evaluated over the whole mini-batch B (it con‐
tains one mean per input).
• σB is the vector of input standard deviations, also evaluated over the whole mini-
batch (it contains one standard deviation per input).
• mB is the number of instances in the mini-batch.
• x(i) is the vector of zero-centered and normalized inputs for instance i.
• γ is the output scale parameter vector for the layer (it contains one scale parame‐
ter per input).

• ⊗ represents element-wise multiplication (each input is multiplied by its corre‐


sponding output scale parameter).
• β is the output shift (offset) parameter vector for the layer (it contains one offset
parameter per input). Each input is offset by its corresponding shift parameter.
• ε is a tiny number to avoid division by zero (typically 10–5). This is called a
smoothing term.
• z (i) is the output of the BN operation: it is a rescaled and shifted version of the
Inputs.
Batch Normalization does, however, add some complexity to the model. More‐over, there
is a runtime penalty: the neural network makes slower predictions due to the extra computations
required at each layer. So if you need predictions to be lightning-fast, you may want to check
how well plain ELU + He initialization perform before playing with Batch Normalization.

Implementing Batch Normalization with Keras


The authors of the BN paper argued in favor of adding the BN layers before the acti‐ vation
functions, rather than after (as we just did). There is some debate about this, as it seems to depend
on the task. So that’s one more thing you can experiment with to see which option works best on
your dataset. To add the BN layers before the activa‐ tion functions, we must remove the activation
function from the hidden layers and add them as separate layers after the BN layers.

Gradient Clipping
Another popular technique to lessen the exploding gradients problem is to simply clip the gradients
during backpropagation so that they never exceed some threshold. This is called Gradient Clipping.
In Keras, implementing Gradient Clipping is just a matter of setting the clipvalue or clipnorm
argument when creating an optimizer. For example:

This will clip every component of the gradient vector to a value between –1.0 and 1.0. Note that it
may change the orientation of the gradient vec‐ tor: for example, if the original gradient vector is
[0.9, 100.0], it points mostly in the direction of the second axis, but once you clip it by value, you
get [0.9, 1.0], which points roughly in the diagonal between the two axes. In practice however, this
approach works well. If you want to ensure that Gradient Clipping does not change the direction
of the gradient vector, you should clip by norm by setting clipnorm instead of clipvalue.

1.6 Reusing Pretrained Layers


It is generally not a good idea to train a very large DNN from scratch: instead, you should always
try to find an existing neural network that accomplishes a similar task to the one you are trying to
tackle, then just reuse the lower layers of this network: this is called transfer learning. It will not
only speed up training considerably but will also require much less training data.
For example, suppose that you have access to a DNN that was trained to classify pic‐ tures into
100 different categories, including animals, plants, vehicles, and everyday objects. You now want
to train a DNN to classify specific types of vehicles. These tasks are very similar, even partly
overlapping, so you should try to reuse parts of the first network, The output layer of the original
model should usually be replaced since it is most likely not useful at all for the new task, and it
may not even have the right number of outputs for the new task.
In some cases, freezing the first layers may not be required, in some other cases some of the middle
layers would not be required, in such cases we should learn which layers to freeze through trial-
and-error method.

Transfer Learning with Keras


First, you need to load model A, and create a new model based on model A’s layers.

Note that model_A and model_B_on_A now share some layers. When you train model_B_on_A,
it will also affect model_A. If you want to avoid that, you need to clone model_A before you reuse
its layers. To do this, you must clone model A’s architecture, then copy its weights as follows

Now you can train the model_B_on_A to to perform the task B, but since the new output layer
was initialized randomly, it will make large errors, at least during the first few epochs, so there will
be large error gradients that may wreck the reused weights.
To avoid this, one approach is to freeze the reused layers during the first few epochs, giving the
new layer some time to learn reasonable weights. To do this, simply set every layer’s train able
attribute to False and compile the model:
Next, we can train the model for a few epochs, then unfreeze the reused layers (which requires
compiling the model again) and continue training to fine-tune the reused layers for task B.

After unfreezing the reused layers, it is usually a good idea to reduce the learning rate, once again
to avoid damaging the reused weights.

Unsupervised Pretraining
If you must handle a task for which not much labeled data is available, and also there won’t be a
model trained for a similar task available. You can go for Unsupervised learning. If you can gather
plenty of unlabeled training data, you can try to train the layers one by one, starting with the lowest
layer and then going up, using an unsupervised feature detector algorithm such as Restricted
Boltzmann Machines (RBMs) or autoencoders.
In Unsupervised learning, each layer is trained on the output of the previously trained layers (all
layers except the one being trained are frozen). Once all layers have been trained this way, you can
add the output layer for your task, and fine-tune the final network using supervised learning (i.e.,
with the labeled training examples). At this point, you can unfreeze all the pretrained layers, or just
some of the upper ones.

Unsupervised pretraining

Pretraining on an Auxiliary Task


If you do not have much labeled training data, one last option is to train a first neural network on
an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the
lower layers of that network for your actual task. The first neural network’s lower layers will
learn feature detectors that will likely be reusa‐ ble by the second neural network.
For example, if you want to build a system to recognize faces, you may only have a few pictures
of each individual—clearly not enough to train a good classifier. Gather‐ ing hundreds of pictures
of each person would not be practical. However, you could gather a lot of pictures of random
people on the web and train a first neural network to detect whether or not two different pictures
feature the same person. Such a net‐ work would learn good feature detectors for faces, so
reusing its lower layers would allow you to train a good face classifier using little training data.

1.7. Faster Optimizers


One huge speed boost of a Deep Neural Network comes from using a faster optimizer than the
regular Gradient Descent optimizer.
(Note: Refer PPT for the content use the below mentioned formulae and diagrams from Textbook
for exam purpose)
Momentum Optimization

gradient of the cost function J(θ) with regards to the weights


m - momentum vector
η learning rate
β momentum / decay factor (of momentum)

Nesterov Accelerated Gradient

. Regular versus Nesterov Momentum optimization


AdaGrad

⊗ symbol represents the element-wise multiplication

The ⊘ symbol represents the element-wise division, and ϵ is a smoothing term to avoid division
by zero, typically set to 10−10

AdaGrad versus Gradient Descent

RMSProp

Adam and Nadam Optimization


Algorithm
Nadam optimization18 is more important: it is simply Adam optimization plus the Nesterov trick,
so it will often converge slightly faster than Adam.

1.8. Avoiding Overfitting Through Regularization


With so many parameters, the network has an incredible amount of freedom and can fit a huge
variety of complex datasets. But this great flexibility also means that it is prone to overfit‐ ting the
training set. Regularization techniques can be used during the model training to avoid overfitting.
They are

• Early Stopping
• Batch Normalization
• ℓ1 and ℓ2 Regularization
• Dropout
• Max-norm regularization.

Early Stopping
A very different way to regularize iterative learning algorithms such as Gradient Descent
is to stop training as soon as the validation error reaches a minimum. This is called early
stopping

ℓ1 and ℓ2 Regularization

➢ L1 Regularization, also called a lasso regression, adds the “absolute value of magnitude”
of the coefficient as a penalty term to the loss function. Essentially, when we use L1
regularization, we are penalizing the absolute value of the weights.

➢ L2 Regularization, also called a ridge regression, adds the “squared magnitude” of the
coefficient as the penalty term to the loss function. L2 regularization returns a non-sparse
solution since the weights will be non-zero (although some may be close to 0). A major
snag to consider when using L2 regularization is that it’s not robust to outliers. The squared
terms will blow up the differences in the error of the outliers. The regularization would
then attempt to fix this by penalizing the weights.
Here is how to apply ℓ2 regularization to a Keras layer’s connection weights, using a regularization
factor of 0.01:

use keras.regu larizers. l1_l2() (specifying both regularization factors).

Dropout
Dropout is one of the most popular regularization techniques for deep neural net‐ works. It is a
fairly simple algorithm: at every training step, every neuron (including the input neurons, but
always excluding the output neurons) has a probability p of being temporarily “dropped out,”
meaning it will be entirely ignored during this training step, but it may be active during the next
step (see Figure 11-9). The hyperparameter p is called the dropout rate, and it is typically set to
50%. After training, neurons don’t get dropped anymore.

To implement dropout using Keras, you can use the keras.layers.Dropout layer. During training, it
randomly drops some inputs (setting them to 0) and divides the remaining inputs by the keep
probability. After training, it does nothing at all, it just passes the inputs to the next layer. For
example, the following code applies dropout regularization before every Dense layer, using a
dropout rate of 0.2:

Monte-Carlo (MC) Dropout


In this technique dropout is applied on the training data blockwise, averaging over multiple
predictions with dropout on gives us a Monte Carlo estimate that is generally more reliable than
the result of a single prediction with dropout of.
Keras Implementation

Max-Norm Regularization
Another regularization technique that is quite popular for neural networks is called max-norm
regularization: for each neuron, it constrains the weights w of the incom‐ ing connections such that
∥ *w* ∥2 ≤ _r_, where r is the max-norm hyperparameter and ∥ · ∥2 is the ℓ2 norm. Max-norm
regularization does not add a regularization loss term to the overall loss function. Instead, it is
typically implemented by computing ∥w∥2 after each training step and clipping w if needed

. Reducing r increases the amount of regularization and helps reduce overfitting.


Maxnorm regularization can also help alleviate the vanishing/exploding gradients prob‐ lems (if
you are not using Batch Normalization). To implement max-norm regularization in Keras, just set
every hidden layer’s ker nel_constraint argument to a max_norm() constraint, with the appropriate
max value, for example:

You might also like