You are on page 1of 79

ARTIFICIAL INTELLIGENCE WITH DEEP LEARNING

(Course Code: A4460)


(Professional Elective-III)

Dr. Hathiram Nenavath


ME (IISc), PhD (NITW)
Associate Professor,
Dept. of Electronics and Communication Engg.,
Vardhaman College of Engineering (Autonomous),
Email: hathiram.iisc@gmail.com
Mobile: +91-9000747143
COURSE OUTCOMES (COS)
Upon successful completion of this course, student will be able to:
❑ A4460.1: Understand the basic concepts of Artificial
Intelligence and Machine Learning.

❑ A4460.2: Identify an efficient algorithm for Deep Models.

❑ A4460.3: Apply optimization strategies for large scale


applications.

❑ A4460.4: Develop the relations among Artificial Intelligence,


Machine Learning and Deep Learning.

❑ A4460.5: Analyze state of the art Deep Convolutional Neural


2
Networks Structures.
COURSE ASSESSMENT
Component
Duration Total
S No Component Wise Weightage Marks
in Hours Marks
Marks

Theory:
1 1.5 40
Test-1
Continuous
Theory:
2 Internal 1.5 40
Test-2 100 0.3 30
Evaluation
(CIE)
Alternate
3 Assessment - 20
*

Semester End Exam


5 3 100 100 0.7 70
(SEE)

3
Total Marks 100
COURSE SYLLABUS
1. Artificial Intelligence: Introduction to AI, Applications of AI, History of
AI, Types of AI, Intelligent Systems, and Intelligent Agents: Agents and
Environments, rationality, structure of agents, Problem Solving,
Knowledge representation.

2. Machine Learning Principles: Components of ML, Loss Function,


Learning Algorithms, Supervised Learning Algorithms, Unsupervised
Learning Algorithms, Stochastic Gradient Descent, Building a Machine
Learning Algorithm, Challenges Motivating Deep Learning. Feature
Learning: Dimensionality Reduction, PCA, LDA.
3. Deep Learning Review: Review of Deep Learning, Multi-layer
Perceptron, Back propagation, Deep Feed forward Networks: Gradient-
Based Learning, Hidden Units, Architecture Design, Back-Propagation
and Other Differentiation Algorithms.
Regularization for Deep Learning: Parameter Norm Penalties,
4
Regularization and Under-Constrained Problems, Semi-Supervised
Learning, Multi-Task Learning, Sparse Representations.
4. Optimization for Training Deep Models: Challenges in
Neural Network Optimization, Basic Algorithms, Parameter
Initialization Strategies, Algorithms with Adaptive Learning
Rates, Optimization Strategies and Meta-Algorithms.
Convolutional Networks: The Convolution Operation, Variants
of the Basic Convolution Function, Efficient Convolution
Algorithms, Unsupervised Features, Convolutional Networks
and the History of Deep Learning.
5. Recurrent and Recursive Nets: Recurrent Neural Networks,
Bidirectional RNNs, Encoder-Decoder Sequence-to-Sequence
Architectures, Deep Recurrent Networks, Recursive Neural
Networks.
Deep Convolutional Neural Networks Structures: AlexNet,
5
VGGnet, GoogleLeNet, ResNet, DenseNet.
TEXT BOOK
❑ Stuart J. Russell, Peter Norvig, Artificial Intelligence: A Modern
Approach, 3rd Edition, 2010, Pearson Education.

❑ Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, the


MIT Press, London, 2016.
REFERENCE BOOKS
❑ Artificial Intelligence, Shivani Goel, Pearson Education.

❑ Artificial Intelligence and Expert systems – Patterson, Pearson


Education.

❑ Ethern Alpaydin, Introduction to Machine Learning, Eastern Economy


Edition, Prentice Hall of India, 2005.
6
❑ Simon Haykin, Neural Networks and Learning Machines, 3rd Edition,
Pearson Prentice Hall.
7
UNIT-III

PART-A

Deep Learning Review


REVIEW OF DEEP LEARNING
❖ Definition of Deep Learning:
❑ Deep learning is inspired by neural networks of
the brain to build learning machines which
discover rich and useful internal representations,
computed as a composition of learned features
and functions.

❑ This definition is a goal and does not say much


about HOW we achieve that
– E.g., by adding priors to learn better high-
level representations.
– The term deep learning is indeed aspirational,
like 'AI' or 'machine learning’.
❖ CHARACTERISTICS OF DEEP LEARNING
1. A type of Machine Learning that improves with
experience and data

2. Only viable approach to building AI systems in


real-world environments

3. Obtains its power by a nested hierarchy of


concepts
– each concept defined by relationship to simpler
concepts
▪ More abstract representations computed in terms of less
abstract ones
IMAGE EXAMPLE
❑ Input is an array of pixel values
– First Layer: presence or absence of edges at
particular locations and orientations of image
– Second layer: detect motifs by spotting arrangements
of edges, regardless of small variations in edge
positions
– Third layer: assemble motifs into larger combinations
that corresponds to parts of familiar objects
– Subsequent layers: detect objects as combinations of
these parts
❑ Key aspect of deep learning:
– Layers not designed by human engineers
▪ Learned from data using a general purpose
learning procedure
EXAMPLE OF IMAGE DEEP LEARNING
❑ Function to map
pixels to object identity
is complicated

❑ Series of hidden
layers extract
increasingly abstract
features

❑ Final decision made


by a simple
classifier
❖ UNSUPERVISED REPRESENTATION LEARNING
Deep Style
❖ Autoencoder:
– Quintessential example of
representation learning
– Encoder:
• Converts input into a New designs from representation
representation with nice
properties
– Decoder:
• Converts the
representation back to
input
❖ NATURAL LANGUAGE PROCESSING
❑ Training Data
❑ Word-to-vec
– One-hot vector mapped
to vector of 300
❑ Word embedding
– Similar words are close
together
❖ DEEP LEARNING ROAD MAP:
❖ STUDY OF DEEP LEARNING
MULTILAYER PERCEPTRONS (MLPS)
❑ MLPs are the base of deep learning technology. It is also called Deep
Feed forward Networks or feed-forward neural networks. It belongs to a
class of feed-forward neural networks having various layers of
perceptrons. These perceptrons have various activation functions in them.

17

Fig. Multilayer Perceptron


❑ MLPs also have connected input and output layers and their number is
the same. Also, there's a layer that remains hidden amidst these two
layers.

❑ MLPs are mostly used to build image and speech recognition systems or
some other types of the translation software.

❑ The working of MLPs starts by feeding the data in the input layer. The
neurons present in the layer form a graph to establish a connection that
passes in one direction.

❑ The weight of this input data is found to exist between the hidden layer
and the input layer. MLPs use activation functions to determine which
nodes are ready to fire.

❑ These activation functions include tanh function, sigmoid and ReLUs.

❑ MLPs are mainly used to train the models to understand what kind of co-
relation the layers are serving to achieve the desired output from the
given data set.
BACKPROPAGATION
❑ Backpropagation is the learning mechanism that allows the Multilayer
Perceptron to iteratively adjust the weights in the network, with the goal
of minimizing the cost function.

19

Fig. Backpropagation
❑ In each iteration, after the weighted sums are forwarded through all
layers, the gradient of the Mean Squared Error is computed across all
input and output pairs.

❑ Then, to propagate it back, the weights of the first hidden layer are
updated with the value of the gradient. That’s how the weights are
propagated back to the starting point of the neural network!.

❑ This process keeps going until gradient for each input-output pair has
converged, meaning the newly computed gradient hasn’t changed more
than a specified convergence threshold, compared to the previous
iteration.
❖ IMPORTANCE OF BACKPROPAGATION

❑ Backpropagation is a technique for computing


derivatives quickly:
– It is the key algorithm that makes training deep
models computationally tractable
– For modern neural networks it can make training
gradient descent 10 million times faster relative to
naiive implementation
▪ It is the difference between a model that takes a
week to train instead of 200,000 years
DEEP FEED FORWARD NETWORKS: GRADIENT-BASED LEARNING
❑ Gradient: A gradient measures how much the output of a function
changes if you change the inputs a little bit.

❑ Gradient Descent: Gradient Descent is an optimization algorithm for


finding a local minimum of a differentiable function. Gradient descent
is simply used in machine learning to find the values of a function's
parameters (coefficients) that minimize a cost function as far as
possible.

❖ How Gradient Descent works:

22
❑ Imagine you have a machine learning problem and want to train
your algorithm with gradient descent to minimize your cost-
function J(w, b) and reach its local minimum by tweaking its
parameters (w and b).

❑ The image above shows the horizontal axes represent the parameters
(w and b), while the cost function J(w, b) is represented on the
vertical axes. Gradient descent is a convex function.

❑ We know we want to find the values of w and b that correspond to


the minimum of the cost function (marked with the red arrow). To
start finding the right values we initialize w and b with some random
numbers.

❑ Gradient descent then starts at that point (somewhere around the top
of our illustration), and it takes one step after another in the steepest
downside direction (i.e., from the top to the bottom of the
illustration) until it reaches the point where the cost function is as
small as possible.
HIDDEN UNITS
❑ Hidden unit takes in a vector/tensor, compute an affine
transformation z and then applies an element-wise non-linear
function g(z). Where z:

❑ The way hidden units are differentiated from each other is based
on their activation function, g(z):

❖ReLU (Rectified Linear Unit):


❑ The ReLU is a piecewise linear function that will output the input
directly if it is positive, otherwise, it will output zero.

❑ It has become the default activation function for many types of


neural networks because a model that uses it is easier to train and
often achieves better performance. 24
❖Sigmoid Function:
❑ The sigmoid function is a special form of the logistic function and
is usually denoted by S(x) or sig(x). It is given by:

1
S ( x) = −x
1+ e
❖Softmax Function:
❑ The softmax formula is as follows:

where all the z i values are the elements of the input vector and can
take any real value. The term on the bottom of the formula is the
normalization term which ensures that all the output values of the
function will sum to 1, thus constituting a valid probability
distribution.
❖Maxout Function:
❑ A Maxout Activation Function is a neuron activation function
that is based on the mathematical function:

❖Hyperbolic Tangent Function: A hyperbolic tangent


function is another type of activation function used in Deep
Learning, which is a smoother, zero-centred function.

x: an input data point

❑ In terms of sigmoid function:

28
ARCHITECTURE DESIGN
❖Basic design of a neural network: Neural network for
supervised learning (with regularization)

Data Set:{xi,yi}, i=1,..N

yi N
f (xi,W) 1
xi i=1,..N L=
N ∑ L ( f (x ,W ),y )
i i i
i= 1
i=1,..N
Loss +
Li
L+R(W)

R(W)
Regularizer
W={W(1), W(2),..} (Norm Penalty)
❖DESIGN WITH TWO LAYERS
❑ Most networks are organized into groups of
units are called layers
– Layers are arranged in a chain structure
❑ Each layer is a function of layer that preceded it
– First layer is given by h(1)=g(1)(W(1)T x + b(1))
– Second layer is h(2)=g(2)(W(2)T x + b(2)), etc.
• Example x=[x1,x2,x3]T
T T T
W 1(1) = ⎡⎢WW W ⎤ ,W (1) = ⎡W W W ⎤ ,W (1) = ⎡W W W ⎤
⎣ 11 12 13 ⎥⎦ 2 ⎣⎢ 21 22 23 ⎥⎦ 3 ⎣⎢ 31 32 33 ⎥⎦

First Network layer Network layer output In matrix multiplication notation


❖ ARCHITECTURE TERMINOLOGY

❑ The word architecture refers to the overall


structure of the network:
– How many units should it have?
– How the units should be connected to each other?

❑ Most neural networks are organized into groups of


units called layers
– Most neural network architectures arrange these
layers in a chain structure
– With each layer being a function of the layer that
preceded it
❖ ADVANTAGE OF DEEPER NETWORKS

❑ Deeper networks have


– Far fewer units in each layer
– Far fewer parameters
– Often generalize well to the test set
– But are often more difficult to optimize

❑ Ideal network architecture must be found via


experimentation guided by validation set error
❖ MAIN ARCHITECTURAL CONSIDERATIONS
1. Choice of depth of network
2. Choice of width of each layer
❖ GENERIC NEURAL ARCHITECTURES
UNIT-III

PART-B

Regularization for Deep Learning

36
WHAT IS REGULARIZATION?
❑ Central problem of ML is to design algorithms that
will perform well not just on training data but on new
inputs as well

❑ Regularization is:
– “any modification made to a learning algorithm to
reduce generalization error but not training error”
▪ Reduce test error even at expense of higher
training error

3
7
PHILOSOPHICAL VIEW
❑ Regularization is a recurrent issue in ML

❑ Hinton borrowed the concept in his neural networked


view:
– used a shocking term "unlearning" to refer to it.

❑ To achieve a greater effectiveness, one must not learn


the idiosyncrasies of the data

❑ One must remain a little ignorant in order to discover


the true behavior of the data
SOME GOALS OF REGULARIZATION
❑ Many forms of regularization available
➢ Major efforts are to develop better regularization
❑ Put extra constraints on objective function
➢ They are equivalent to a soft constraint on
parameter values
▪ Result in improved performance on test set
❑ Some goals of regularization
1. Encode prior knowledge
2. Express preference for simpler model
3. Needed to make underdetermined problem
determined
REGULARIZATION USING A SIMPLER MODEL
❖ REGULARIZING ESTIMATORS:
❑ In Deep Learning, regularization means regularizing estimators

❑ Involves increased bias for reduced variance


– Good regularizes reduces variance significantly while not overly
increasing bias

❖ Model Types and Regularization:


❑ Three types of model families
1. Excludes the true data generating process
• Implies underfitting and inducing high bias
2. Matches the true data generating process
3. Overfits
• Includes true data generating process but also many other
processes
❑ Goal of regularization is to take model from third regime to second
❖ IMPORTANCE OF REGULARIZATION:
❑ Overly complex family does not necessarily include target function,
true data generating process, or even an approximation
❑ Most deep learning applications are where true data generating process is
outside family
– Complex domains of images, audio sequences and text true generation
process may involve entire universe
• Fitting square hole (data generating process) to round hole
(model family)
❖ WHAT IS THE BEST MODEL?
❑ Best fitting model obtained not by finding the right number of
parameters
❑ Instead, best fitting model is a large model that has been regularized
appropriately
❑ We review several strategies for how to create such a large, deep
regularized model
❖ REGULARIZATION STRATEGIES
1. Parameter Norm Penalties
– (L2- and L1- regularization)
2. Norm Penalties as Constrained Optimization
3. Regularization and Under-constrained Problems
4. Data Set Augmentation
5. Noise Robustness
6. Semi-supervised learning
7. Multi-task learning
8. Early Stopping
9. Parameter tying and parameter sharing
10. Sparse representations
11. Bagging and other ensemble methods
12. Dropout
13. Adversarial training
14. Tangent methods
PARAMETER NORM PENALTIES
❖ TOPICS IN PARAMETER NORM PENALTIES
1. Overview (limiting model capacity)
2. L2 parameter regularization
3. L1 regularization

4
4
LIMITING MODEL CAPACITY
❑ Regularization has been used for decades prior to advent
of deep learning

❑ Linear- and logistic-regression allow simple,


straightforward and effective regularization
strategies
– Adding a parameter norm penalty Ω(θ) to the objective
function J :
J!( ;X ,y) = J ( ;X ,y) + ( )

• where αε[0,θ) is a hyperparameter that weight the relative contribution


of the norm penalty term Ω
– Setting α to 0 results in no regularization. Larger values correspond to
more regularization
NORM PENALTY
❑ When our training algorithm minimizes the
regularized objective function
J!( ;X ,y) = J ( ;X ,y) + ( )

– it will decrease both the original objective J on the


training data and some measure of the size of the
parameters θ

❑ Different choices of the parameter norm Ω can


result in different solutions preferred
– We discuss effects of various norms
Learning
NO PENALTY FOR BIASES
❑ Norm penalty Ω penalizes only weights at each layer
S

and leaves biases unregularized


– Biases require less data to fit than weights
– Each weight specifies how variables interact
• Fitting weights requires observing both variables in a variety of
conditions
❑ Each bias controls only a single variable
– We do not induce too much variance by leaving biases
unregularized
❑ w indicates all weights affected by norm penalty
❑ θ denotes both w and biases
DIFFERENT OR SAME ΑS FOR LAYERS?
❑ Sometimes it is desirable to use a separate penalty
with a different α for each layer
J!( ;X ,y) = J ( ;X ,y) + ( )

❑ Invariance under linear transformation T is one


case
– i.e., we want neural net to perform the same when the
inputs are transformed
• But this creates too many hyperparameters
• Search space reduced by using same hyperparameters
LINEAR TRANSFORMATION T
❑ Consider a simple linear transformation of the
input

Two-variables
x and y
Three variables x,y and z
WEIGHT DECAY AND INVARIANCE
❑ Suppose we train two 2-layer networks
– First network: trained using original data: x={xi}, y={yk}
– Second network: input and/or target variables are
transformed by one of the linear transformations
x i → x!i = ax i + b y k → y!k = cy k + d

❑ Consistency requires that we should obtain


equivalent networks that differ only by linear
transformation of the weights
For first layer: And/or or second layer:

wji →
1
w wk j → cwkj
a ji
and/or
SIMPLE WEIGHT DECAY FAILS INVARIANCE
• Simple weight decay E!(w) = E (w) + α 2w w T

• Treats all weights and biases on equal footing


• While resulting wji and wkj should be treated differently
• Consequently networks will have different weights and
violate invariance
• We therefore look for a regularizer invariant
under the linear transformations
α1 α
– Such a regularizer is ∑ w2 + 2 ∑ w2
2 w∈W 1 2 w∈W 2
• where W are weights of first layer and
1

• W are the set of weights in the second layer


2

– This regularizer remains unchanged under the weight


transformations provided the parameters are rescaled using
λ → a1/2λ and λ → c −1/2λ
1 1 2 2
WEIGHT DECAY USED IN PRACTICE
❑ Because it can be expensive to search for the correct value of
multiple hyperparameters, it is still reasonable to use same weight
decay at all layers to reduce search space
L2 PARAMETER REGULARIZATION
• Simplest and most common kind
• Called Weight decay
• Drives weights closer to the origin
– by adding a regularization term ( ) =
1
||w ||22
to the objectve function 2

• In other communities also known as ridge


regression or Tikhonov regularization
GRADIENT OF REGULARIZED OBJECTIVE
• Objective function (with no bias parameter)

J!(w;X ,y) = wT w + J (w;X ,y)
2
• Corresponding parameter gradient
 J!(w;X ,y) = w +  J (w;X ,y)
w w

• To perform single gradient step, perform update:


( (
w  w -  w + wJ w; X , y ))
• Written another way, the update is
(
w  (1 −  )w - wJ w; X , y )
– We have modified learning rule to shrink w by
constant factor 1-εα at each step
TO STUDY EFFECT ON ENTIRE TRAINING
❑ Make quadratic approximation to the objective
function in the neighborhood of minimal
unregularized cost w*=arg minw J(w)

❑ The approximation is given by


J(w*)+½(w-w*)TH(w-w*)
Where H is the Hessian matrix of J wrt w evaluated at
w*
EFFECT OF L2 REGULARIZATION ON OPTIMAL W
Deep Learning Srihari

 J ( w; X ,y )
J!(w;X ,y) = wT w + J (w;X ,y)
2

J ( w; X ,y )

wTw
wTw

Along w1, eigen value of Hessian of J is small. Along w2, J is very sensitive to
J does not increase much when moving movements away from w*.
horizontally away from w*. Because J does not The corresponding eigenvalue is
have a strong preference along this direction, large, indicating high curvature.
the regularizer has a strong effect on this axis. As a result, weight decay affects the
The regularizer pulls w1 close to 0. position of w2 relatively little
L1 REGULARIZATION
• L2 weight decay is common weight decay
• Other ways to penalize model parameter size
• L1 regularization is defined as
( ) = w =  w i
1
i 1

– which sums the absolute values of parameters

Image Source: https://zhuanlan.zhihu.com/p/28023308


SPARSITY AND FEATURE SELECTION
❑ The sparsity property induced by L1 regularization has
been used extensively as a feature selection mechanism
– Feature selection simplifies an ML problem by choosing
subset of available features

❑ LASSO (Least Absolute Shrinkage and Selection Operator)


integrates an L1 penalty with a linear model and least squares
cost function
– The L1 penalty causes a subset of the weights to become zero,
suggesting that those features can be discarded
SPARSITY WITH LASSO CONSTRAINT
• With q=1 and λ is sufficiently large, some of the
coefficients wj are driven to zero
• Leads to a sparse model
– where corresponding basis functions play no role
• Origin of sparsity is illustrated here:
Quadratic solution where Minimization with Lasso Regularizer
w1* and w0* are nonzero A sparse solution with w1*=0

Contours of
Unregularized
Error function

Constraint
region
REGULARIZATION WITH LINEAR MODELS

Norm Regularization:
REGULARIZATION AND UNDER-CONSTRAINED PROBLEMS
Under-constrained closed-form linear regression

M −1   0(x 1 ) 1(x 1 ) ...  M −1(x1 ) 


y(x,w) =  wj j (x) = w T (x) 

 =  0 2
(x )


Samples X Dimensions
j =0 
 
w ML =  + t  + = ( T ) −1  T   0(x N )  M −1(x N ) 
UNDERCONSTRAINED LOGISTIC REGRESSION

• Logistic regression with linearly separable classes


– Task is under-determined
w τ +1 = w τ − η∇E n E n = (y n − tn )n where yn= σ(wTϕn)

• If w separates perfectly, so will 2w and with


higher likelihood
– SGD will continually increase w and will never halt
• Solution left to the overflow handler
SOLUTION FOR UNDERCONTRAINED ITERATIVE
❑ Logistic regression with linearly separable classes
is under-determined
w τ +1 = w τ − η∇E n E n = (y n − tn )n where yn= σ(wTϕn)

❑ Solution: Most forms of regularization guarantee


convergence of iterative methods
– e.g., weight decay
N

∇En = − ∑ {t n
− wT φ(x n ) } φ(x )
n
T
without regularization
n=1

 N  with regularization

En =  −  tn − wT φ(x n ) φ(x ) n
T
 + w
 n=1 

– Stops when slope of likelihood equals weight decay


coefficient
REGULARIZATION IN LINEAR ALGEBRA PROBLEMS

 + = ( T ) −1  T
SEMI-SUPERVISED LEARNING
TASK OF SEMI-SUPERVISED LEARNING
❑ Both unlabeled examples from P(x) and labeled
examples from P(x,y) are used to estimate P(y|x)
or predict y from x.
❑ In the context of deep learning it refers to learning
a representation h=f(x).
❑ The goal is to learn a representation so that
examples from the same class have similar
representations.
HOW SEMI-SUPERVISED SUCCEEDS
❑ p(x): a mixture over three components, y ε {1,2,3}
– If components well-separated:
• modeling p(x) reveals where each component is
– A single labeled example per class enough to learn p(x|y)
– Which we can use to predict p(y|x)

capital letters, small letters, digits x = no. of black pixels

p(x) has three modes


p(x|y) is a univariate Gaussian for y=1,2,3
HOW UNSUPERVISED LEARNING HELPS
❑ Unsupervised learning can provide useful clues for how
to group examples in representational space

❑ Examples that cluster tightly in the input space should


be mapped to similar representations

❑ A linear classifier in the new space may


achieve better generalization

❑ A variant is the application of PCA as a preprocessing


step before applying a classifier to the projected data.
SHARING PARAMETERS
❑ Instead of separate unsupervised and supervised
components in the model, construct models in
which generative models of either P(x) or P(x,y)
shares parameters with a discriminative model of
P(y|x).

❑ One can then trade-off the supervised criterion


–log P(y|x) with the unsupervised or generative one
(such as –log P(x) or –log P(x,y))
– The generative criterion then expresses a prior belief
about the solution to the supervised problem
• viz., structure of P(x) is connected to structure of P(y|x) in a
way that is captured by shared parameterization
MULTI-TASK LEARNING
SHARING PARAMETERS OVER TASKS
❑ Multi-task learning is a way to improve
generalization by pooling the examplesout of
several tasks
– Examples can be seen as providing soft constraints on
the parameters

❑ In the same way that additional training examples


put more pressure on the parameters of the model
towards values that generalize well
COMMON FORM OF MULTITASK LEARNING
❑ Different supervised tasks, predicting y(i) given x
❑ Share the same input x, as well as some
intermediate representation h(shared) capturing a
common pool of factors
EX: AUTONOMOUS NAVIGATION
x

y(1) y(2) y(3)


COMMON MULTI-TASK SITUATION
❑ Common input but different
target random variables
• Lower layers (whether feedforward
or includes a generative
component with downward
arrows)
can be shared across such tasks.
• Task-specific parameters h(1),
h(2)
can be learned on top of those
yielding a shared
representation h(shared)
Common pool of factors explain
variations of Input x while each
task is associated with a
Subset of these factors
MULTI-TASK IN UNSUPERVISED LEARNING
❑ In the unsupervised learning context:
• some of the top level factors are associated with none of the output
tasks h(3)
• These are factors that explain some of the input variations but not
relevant for predicting h(1), h(2)
MODEL CAN BE DIVIDED INTO TWO PARTS
1. Task specific parameters
– Which only benefit from the examples of their task to
achieve good generalization
• These are the upper layers of the neural network
2. Generic parameters
– Shared across all tasks
• Which benefit from the pooled data of all tasks
• These are the lower levels of the neural network
BENEFITS OF MULTI-TASKING
❑ Improved generalization and generalization
error bounds
– achieved due to shared parameters
• For which statistical strength can be greatly improved
– In proportion to the increased no. of examples for the shared
parameters compared to the scenario of single-task models

❑ From the point of view of deep learning, the


underlying prior belief is the following:
– Among the factors that explain the variations observed in
the data associated with different tasks, some are shared
across two or more tasks
SPARSE REPRESENTATIONS
DIRECT AND INDIRECT PENALTIES
❑ Direct Penalty
– Weight decay penalizes parameters directly
– L1 penalization induces sparse parameterization
❑ Indirect Penalty

– Another strategy is to place penalty on the


activations of the units in the neural network
➢ Encouraging their activations to be sparse
➢ It imposes a complicated penalty on model parameters
– Representational sparsity describes a representation
where many of the elements of the representation are
close to zero
DEFINITION NEEDS MATRIX NOTATION
• Given network drawn in two different Output
Layer

styles
– Matrix W describes mapping from x to h Hidden
Layer

– Vector w describes mapping from h to y


– Intercept parameters b are omitted
• Layer 1 (hidden layer): h computed by function f (1)(x; W,c)=h=g(WTx+c)
– c are bias variables
• Layer 2 (output layer) computes f (2)(h;w,b) =h T w +b
– w are linear regression weights
– Output is linear regression applied to h rather than to x
• Complete model is
f (x; W,c,w,b)=f (2)(f (1)(x))
DIRECT VERSUS REPRESENTATIONAL SPARSITY
– Parameter regularization, with W=A

Weight matrix
W is sparse

– Representational regularization, with W=B

h vector is
sparse
W is not sparse
REPRESENTATIONAL REGULARIZATION
❑ Accomplished using same sort of mechanisms
used in parameter regularization
❑ Norm penalty regularization of representation
– Performed by adding to the loss function J, a norm
penalty on the representation.
• The regularized loss function is

where α ε [0,∞)
• An L1 penalty term induces sparsity:
PLACING CONSTRAINT ON ACTIVATION VALUES
• Another approach to representational sparsity:
– place a hard constraint on activation values
• Called Orthogonal matching pursuit (OMP)
– Encode x with h that solves constrained
optimization:
• where ||h||0 is the number of zero entries of h
• Problem is solved efficiently when W is orthogonal
– Often called OMP-k, where k is no. of zero entries
• OMP-1 is very effective for deep architectures
• Essentially, any model with hidden units can be
made sparse:
– sparsity regularization is used in many contexts

You might also like