Professional Documents
Culture Documents
Theory:
1 1.5 40
Test-1
Continuous
Theory:
2 Internal 1.5 40
Test-2 100 0.3 30
Evaluation
(CIE)
Alternate
3 Assessment - 20
*
3
Total Marks 100
COURSE SYLLABUS
1. Artificial Intelligence: Introduction to AI, Applications of AI, History of
AI, Types of AI, Intelligent Systems, and Intelligent Agents: Agents and
Environments, rationality, structure of agents, Problem Solving,
Knowledge representation.
PART-A
❑ Series of hidden
layers extract
increasingly abstract
features
17
❑ MLPs are mostly used to build image and speech recognition systems or
some other types of the translation software.
❑ The working of MLPs starts by feeding the data in the input layer. The
neurons present in the layer form a graph to establish a connection that
passes in one direction.
❑ The weight of this input data is found to exist between the hidden layer
and the input layer. MLPs use activation functions to determine which
nodes are ready to fire.
❑ MLPs are mainly used to train the models to understand what kind of co-
relation the layers are serving to achieve the desired output from the
given data set.
BACKPROPAGATION
❑ Backpropagation is the learning mechanism that allows the Multilayer
Perceptron to iteratively adjust the weights in the network, with the goal
of minimizing the cost function.
19
Fig. Backpropagation
❑ In each iteration, after the weighted sums are forwarded through all
layers, the gradient of the Mean Squared Error is computed across all
input and output pairs.
❑ Then, to propagate it back, the weights of the first hidden layer are
updated with the value of the gradient. That’s how the weights are
propagated back to the starting point of the neural network!.
❑ This process keeps going until gradient for each input-output pair has
converged, meaning the newly computed gradient hasn’t changed more
than a specified convergence threshold, compared to the previous
iteration.
❖ IMPORTANCE OF BACKPROPAGATION
22
❑ Imagine you have a machine learning problem and want to train
your algorithm with gradient descent to minimize your cost-
function J(w, b) and reach its local minimum by tweaking its
parameters (w and b).
❑ The image above shows the horizontal axes represent the parameters
(w and b), while the cost function J(w, b) is represented on the
vertical axes. Gradient descent is a convex function.
❑ Gradient descent then starts at that point (somewhere around the top
of our illustration), and it takes one step after another in the steepest
downside direction (i.e., from the top to the bottom of the
illustration) until it reaches the point where the cost function is as
small as possible.
HIDDEN UNITS
❑ Hidden unit takes in a vector/tensor, compute an affine
transformation z and then applies an element-wise non-linear
function g(z). Where z:
❑ The way hidden units are differentiated from each other is based
on their activation function, g(z):
1
S ( x) = −x
1+ e
❖Softmax Function:
❑ The softmax formula is as follows:
where all the z i values are the elements of the input vector and can
take any real value. The term on the bottom of the formula is the
normalization term which ensures that all the output values of the
function will sum to 1, thus constituting a valid probability
distribution.
❖Maxout Function:
❑ A Maxout Activation Function is a neuron activation function
that is based on the mathematical function:
28
ARCHITECTURE DESIGN
❖Basic design of a neural network: Neural network for
supervised learning (with regularization)
yi N
f (xi,W) 1
xi i=1,..N L=
N ∑ L ( f (x ,W ),y )
i i i
i= 1
i=1,..N
Loss +
Li
L+R(W)
R(W)
Regularizer
W={W(1), W(2),..} (Norm Penalty)
❖DESIGN WITH TWO LAYERS
❑ Most networks are organized into groups of
units are called layers
– Layers are arranged in a chain structure
❑ Each layer is a function of layer that preceded it
– First layer is given by h(1)=g(1)(W(1)T x + b(1))
– Second layer is h(2)=g(2)(W(2)T x + b(2)), etc.
• Example x=[x1,x2,x3]T
T T T
W 1(1) = ⎡⎢WW W ⎤ ,W (1) = ⎡W W W ⎤ ,W (1) = ⎡W W W ⎤
⎣ 11 12 13 ⎥⎦ 2 ⎣⎢ 21 22 23 ⎥⎦ 3 ⎣⎢ 31 32 33 ⎥⎦
PART-B
36
WHAT IS REGULARIZATION?
❑ Central problem of ML is to design algorithms that
will perform well not just on training data but on new
inputs as well
❑ Regularization is:
– “any modification made to a learning algorithm to
reduce generalization error but not training error”
▪ Reduce test error even at expense of higher
training error
3
7
PHILOSOPHICAL VIEW
❑ Regularization is a recurrent issue in ML
4
4
LIMITING MODEL CAPACITY
❑ Regularization has been used for decades prior to advent
of deep learning
Two-variables
x and y
Three variables x,y and z
WEIGHT DECAY AND INVARIANCE
❑ Suppose we train two 2-layer networks
– First network: trained using original data: x={xi}, y={yk}
– Second network: input and/or target variables are
transformed by one of the linear transformations
x i → x!i = ax i + b y k → y!k = cy k + d
wji →
1
w wk j → cwkj
a ji
and/or
SIMPLE WEIGHT DECAY FAILS INVARIANCE
• Simple weight decay E!(w) = E (w) + α 2w w T
J ( w; X ,y )
J!(w;X ,y) = wT w + J (w;X ,y)
2
J ( w; X ,y )
wTw
wTw
Along w1, eigen value of Hessian of J is small. Along w2, J is very sensitive to
J does not increase much when moving movements away from w*.
horizontally away from w*. Because J does not The corresponding eigenvalue is
have a strong preference along this direction, large, indicating high curvature.
the regularizer has a strong effect on this axis. As a result, weight decay affects the
The regularizer pulls w1 close to 0. position of w2 relatively little
L1 REGULARIZATION
• L2 weight decay is common weight decay
• Other ways to penalize model parameter size
• L1 regularization is defined as
( ) = w = w i
1
i 1
Contours of
Unregularized
Error function
Constraint
region
REGULARIZATION WITH LINEAR MODELS
Norm Regularization:
REGULARIZATION AND UNDER-CONSTRAINED PROBLEMS
Under-constrained closed-form linear regression
∇En = − ∑ {t n
− wT φ(x n ) } φ(x )
n
T
without regularization
n=1
N with regularization
En = − tn − wT φ(x n ) φ(x ) n
T
+ w
n=1
+ = ( T ) −1 T
SEMI-SUPERVISED LEARNING
TASK OF SEMI-SUPERVISED LEARNING
❑ Both unlabeled examples from P(x) and labeled
examples from P(x,y) are used to estimate P(y|x)
or predict y from x.
❑ In the context of deep learning it refers to learning
a representation h=f(x).
❑ The goal is to learn a representation so that
examples from the same class have similar
representations.
HOW SEMI-SUPERVISED SUCCEEDS
❑ p(x): a mixture over three components, y ε {1,2,3}
– If components well-separated:
• modeling p(x) reveals where each component is
– A single labeled example per class enough to learn p(x|y)
– Which we can use to predict p(y|x)
styles
– Matrix W describes mapping from x to h Hidden
Layer
Weight matrix
W is sparse
h vector is
sparse
W is not sparse
REPRESENTATIONAL REGULARIZATION
❑ Accomplished using same sort of mechanisms
used in parameter regularization
❑ Norm penalty regularization of representation
– Performed by adding to the loss function J, a norm
penalty on the representation.
• The regularized loss function is
where α ε [0,∞)
• An L1 penalty term induces sparsity:
PLACING CONSTRAINT ON ACTIVATION VALUES
• Another approach to representational sparsity:
– place a hard constraint on activation values
• Called Orthogonal matching pursuit (OMP)
– Encode x with h that solves constrained
optimization:
• where ||h||0 is the number of zero entries of h
• Problem is solved efficiently when W is orthogonal
– Often called OMP-k, where k is no. of zero entries
• OMP-1 is very effective for deep architectures
• Essentially, any model with hidden units can be
made sparse:
– sparsity regularization is used in many contexts