Professional Documents
Culture Documents
▪ A basic unit of the form 𝑦 = 𝑓(𝒘⊤ 𝒙) is known as the “Perceptron” (not to be confused
with the Perceptron “algorithm”, which learns a linear classification model)
Although can kernelize to
make them nonlinear
▪ This can’t however learn nonlinear functions or nonlinear decision boundaries
CS771: Intro to ML
3
Neural Networks: Multi-layer Perceptron (MLP)
▪ An MLP is a network containing several Perceptron units across many layers
▪ An MLP consists of an input layer, an output layer, and one or more hidden layers
Output Layer 𝑦ො𝑛
(with a scalar-valued output)
Hidden layer units/nodes
(a.k.a. “neurons”) act as Learnable
new features weights
One hidden Layer
(with K=2 hidden units)
▪ Note: Hidden layer pre-act 𝑧𝑛𝑘 and post-act ℎ𝑛𝑘 will be shown together for brevity
▪ Denoting 𝑾 = [𝒘1 , 𝒘2 , … , 𝒘𝐾 ], 𝒘𝑘 ∈ ℝ𝐷 , 𝒉𝑛 = 𝑔 𝑾⊤ 𝒙𝑛 ∈ ℝ𝐾 (𝐾 = 2, 𝐷 = 3
above). Note: 𝑔 applied elementwise on pre-activation vector 𝒛𝑛 = 𝑾⊤ 𝒙𝑛 CS771: Intro to ML
6
Activation Functions: Some Common Choices
sigmoid tanh
Preferred more than
For sigmoid as well as tanh, sigmoid. Helps keep the
gradients saturate (become mean of the next layer’s
h close to zero as the function inputs close to zero (with
tends to its extreme values)
h
sigmoid, it is close to 0.5)
a a
CS771: Intro to ML
7
MLP Can Learn Any Nonlinear Function
▪ An MLP can be seen as a composition of multiple linear models combined nonlinearly
High-score in the middle and low-
Obtained by composing the two one-sided score score on either of the two sides
increasing score functions (using learnable of it. Exactly what we want for
Score monotonically increases. the given classification problem
One-sided increase (not ideal for weights 𝑣1 and 𝑣2 to “add“ them). This
learning nonlinear decision can now learn the nonlinear decision
boundaries). Just a linear model boundary (high score in the middle) 𝑦ො𝑛
score score
score
A nonlinear
classification problem
CS771: Intro to ML
9
CS771: Intro to ML
10
Single Hidden Layer and Single Output
▪ One hidden layer with 𝐾 nodes and a single output (e.g., scalar-valued regression or
binary classification) ⊤
𝑜 is a suitable function (e.g., identity,
𝑦ො𝑛 = 𝑜(𝒗 𝒉𝑛 )
sign, or sigmoid, etc) to convert the
score into the actual response
𝑔 is a nonlinear
activation function
CS771: Intro to ML
11
Single Hidden Layer and Multiple Outputs
▪ One hidden layer with 𝐾 nodes and a vector of 𝐶 outputs (e.g., vector-valued
regression or multi-class classification or multi-label classification) 𝑜 is a suitable function (e.g.,
identity, sign, or softmax, etc)
to convert the score into the
𝑦ො𝑛1 𝑦ො𝑛2 𝑦ො𝑛𝐶 ෝ𝑛 =
𝒚 𝑜(𝑽⊤ 𝒉𝑛 ) actual response vector for 𝑥𝑛
𝑔 is a nonlinear
activation function
CS771: Intro to ML
12
Multiple Hidden Layers (One/Multiple Outputs)
▪ Most general case: Multiple hidden layers with (with same or different number of
hidden nodes in each) and a scalar or vector-valued output
Note: Nonlinearity 𝑔
is applied element-
wise on its inputs
(ℓ)
so ℎ𝑛 has the
same size as vector
(ℓ−1)
𝑊 (ℓ) ℎ𝑛
CS771: Intro to ML
13
The Bias Term
▪ Each layer’s pre-activations 𝒛𝑛(ℓ) have a an add bias term
(ℓ) (ℓ)
𝒃 (has the same size as 𝒛𝑛 and 𝒉𝑛 ) as well
(ℓ) 𝑦ො𝑛
(ℓ) ℓ ⊤ (ℓ−1)
𝒛𝑛 =𝑾 𝒉𝑛 + 𝒃(ℓ) 𝒗
(2)
𝒉𝑛
(ℓ) ℓ
𝒉𝑛 = 𝑔(𝒛𝑛 ) 𝑾(2)
(1)
𝒉𝑛
▪ Bias term increases the expressiveness of the network
and ensures that we have nonzero activations/pre- 𝑾(1)
activations even if this layer’s input is a vector of all zeros 𝒙𝑛
▪ Note that the bias term is the same for all inputs (does
not depend on 𝑛)
CS771: Intro to ML
Also note that neural nets are faster
15
Kernel Methods vs Neural Nets than kernel methods at test time
since kernel methods need to store
the training examples at test time
whereas neural nets do not
▪ Recall the prediction rule for a kernel method (e.g., kernel SVM)
𝑁
𝑦𝑛 = 𝒘⊤ 𝜙(𝒙𝑛 ) OR 𝑦𝑛 = 𝛼𝑛 𝑘(𝒙𝑖 , 𝒙𝑛 )
𝑖=1
▪ It’s like a one hidden layer NN with
▪ Pre-defined 𝑁 features 𝑘 𝒙𝑖 , 𝒙𝑛 𝑁 𝑖=1 acting as feature vector 𝒉𝑛
▪ 𝛼𝑖 𝑁𝑛=1 are learnable output layer weights
▪ Both kernel methods and deep neural networks extract new features from the inputs
▪ For kernel methods, the features are pre-defined via the kernel function
▪ For deep NN, the features are learned by the network
▪ Note: Kernels can also be learned from data (“kernel learning”) in which case kernel
methods and deep neural nets become even more similar in spirit ☺ CS771: Intro to ML
𝑾(ℓ) = [𝒘1 , 𝒘2 , … , 𝒘𝐾16
ℓ ℓ ℓ
] is
Features Learned by a Neural Network ℓ
𝐾ℓ−1 × 𝐾ℓ matrix of weights
between layer ℓ − 1 and ℓ
(ℓ)
▪ 𝒘𝑘 ∈ ℝ𝐾ℓ−1 denotes “feature detector” for feature 𝑘 of layer ℓ For
𝑾(ℓ) collectively represents the
𝐾ℓ feature detectors for layer ℓ
(ℓ) ℓ ⊤ (ℓ−1)
▪ For input 𝒙𝑛 , ℎ𝑛𝑘 = 𝑔(𝒘𝑘 𝒉𝑛 ) is the value of feature 𝑘 in layer ℓ
𝐾0 = 𝐷
CS771: Intro to ML
17
Why Neural Networks Work Better: Another View
▪ Linear models tend to only learn the “average” pattern
▪ E.g., Weight vector of a linear classification model represent average pattern of a class
▪ Deep models can learn multiple patterns (each hidden node can learn one pattern)
▪ Thus deep models can learn to capture more subtle variations that a simpler linear model
CS771: Intro to ML
18
Neural Nets: Some Aspects
▪ Much of the magic lies in the hidden layers
CS771: Intro to ML
19
Representational Power of Neural Nets
▪ Consider a single hidden layer neural net with 𝐾 hidden nodes
K=3 K=6 K = 20
▪ Recall that each hidden unit “adds” a simple function to the overall function
▪ Increasing 𝐾 (number of hidden units) will result in a more complex function
▪ Very large 𝐾 seems to overfit (see above fig). Should we instead prefer small 𝐾?
▪ No! It is better to use large 𝐾 and regularize well. Reason/justification:
▪ Simple NN with small 𝐾 will have a few local optima, some of which may be bad
▪ Complex NN with large 𝐾 will have many local optimal, all equally good (theoretical results on this)
▪ We can also use multiple hidden layers (each sufficiently large) and regularize well
CS771: Intro to ML
20
Wide or Deep?
▪ While very wide single hidden layer can approx. any function, often we prefer many,
less wide, hidden layers
▪ Higher layers help learn more directly useful/interpretable features (also useful for
compressing data using a small number of features)
CS771: Intro to ML
21
CS771: Intro to ML
22
Background: Gradient and Jacobian
▪ Let 𝒚 = 𝑓(𝒙), where 𝑓: ℝ𝑃 → ℝ𝑄 , 𝒙 ∈ ℝ𝑃 , 𝒚 ∈ ℝ𝑄 . Denote 𝒚 = [𝑓1 𝒙 , … , 𝑓𝑄 𝒙 ]
▪ The gradient of each component 𝑦𝑖 = 𝑓𝑖 𝒙 ∈ ℝ (𝑖 = 1,2, … , 𝑄) w.r.t. 𝒙 ∈ ℝ𝑃 is
Note: Gradient expressed here as a
𝜕𝑦𝑖 𝜕𝑦𝑖 𝜕𝑦𝑖 row vector (has the same length as 𝒙
∇𝑓𝑖 𝒙 = = … ∈ ℝ1×𝑃 which is a column vector) for
𝜕𝒙 𝜕𝑥1 𝜕𝑥𝑃 notational convenience later
CS771: Intro to ML
23
Background: Multivariate Chain Rule of Calculus
▪ Let 𝒙 ∈ ℝ𝑃 , 𝒚 = 𝑔(𝒙) ∈ ℝ𝑄 , 𝑧 = 𝑓(𝒚) ∈ ℝ, where 𝑔: ℝ𝑃 → ℝ𝑄 , 𝑓: ℝ𝑄 → ℝ
Scalar derivative 1 × 𝑃 vector derivative
1 × 𝑃 vector
derivative 𝜕𝑧 𝑄 𝜕𝑧 𝜕𝑦 Used chain rule of total derivatives
g f 𝑖
= Sum is needed since 𝑧 depends
𝒙 𝒚
𝑧
𝜕𝒙 𝑖=1 𝜕𝑦𝑖 𝜕𝒙 on 𝒚, and 𝒚 is a vector
▪ The above can be written as a product of a vector and a matrix
1×𝑃 Turns out to be a product of 1 × 𝑄 gradient (same as
𝑄 × 𝑃 Jacobian of 𝑔
𝜕𝑦1
Jacobian of 𝑓 and 𝑔 in that order ☺ Jacobian since 𝑓 is scalar)
𝜕𝑧 𝜕𝑧 𝜕𝒙 ∇𝑔1 (𝒙)
𝜕𝑧
= … × ⋮ = ∇𝑓(𝒚) × ⋮ = ∇𝑓(𝒚) × 𝐽 𝑔
𝜕𝒙 𝜕𝑦1 𝜕𝑦𝑄 𝜕𝑦𝑄 ∇𝑔𝑃 (𝒙)
𝜕𝒙
▪ More generally, let 𝒘 ∈ ℝ𝑃 , 𝒙 = ℎ(𝒘) ∈ ℝ𝑄 , 𝒚 = 𝑔(𝒙) ∈ ℝ𝑅 , 𝒛 = 𝑓(𝒚) ∈ ℝ𝑆
Note that chain rule for scalar
Product of the 3 Jacobians 𝜕𝒛
in that order (simple! ☺) = 𝐽 𝑓 × 𝐽 𝑔 × 𝐽ℎ ∈ ℝ𝑆×𝑃 variables 𝑤, 𝑥, 𝑦, 𝑧 is defined in a
similar way as
𝜕𝒛
= 𝑓 ′ (𝑦)𝑔′ (𝑥)ℎ′(𝑤)
𝜕𝒘 𝜕𝒘 CS771: Intro to ML
24
Backpropagation (Backprop)
𝑦ො𝑛
▪ Backprop is gradient descent with multivariate chain rule for derivatives
▪ Consider a two hidden layer neural network 𝒗
(2) (2)
ℒ 𝑾(1) , 𝑾(2) , 𝒗 = σ𝑁 ො𝑛 ) = σ𝑁
𝑛=1 ℓ(𝑦𝑛 , 𝑦 𝑛=1 ℓ𝑛 𝒉𝑛 = 𝑔(𝒛𝑛 )
𝑾(2)
▪ We wish to minimize the loss
(1) (1)
𝒉𝑛 = 𝑔(𝒛𝑛 )
▪ The gradient based updates will be
𝜕ℒ 𝜕ℒ 𝑾(1)
𝒗=𝒗−𝜂 𝑾(𝑖) = 𝑾(𝑖) − 𝜂 (𝑖 = 1,2)
𝜕𝒗 𝜕𝑾(𝑖) 𝒙𝑛
𝜕ℓ𝑛 𝜕ℓ𝑛
▪ Since ℒ = σ𝑁
𝑛=1 ℓ𝑛 , we need to compute and (𝑖 = 1,2)
𝜕𝒗 𝜕𝑾(𝑖)
(2)
▪ Assume output activation 𝑜 as identity (𝑦ො𝑛 = 𝒗⊤ 𝒉𝑛 )
Derivative of ℓ𝑛 w.r.t. 𝑦ො𝑛
𝜕ℓ𝑛 𝜕ℓ𝑛 𝜕𝑦ො𝑛 (2)
= = ℓ′ (𝑦𝑛 , 𝑦ො𝑛 )𝒉𝑛
𝜕𝒗 𝜕𝑦ො𝑛 𝜕𝒗 CS771: Intro to ML
25
Backpropagation in detail
𝜕ℓ𝑛 (2) 𝑦ො𝑛
▪ Let’s now look at where ℓ𝑛 = ℓ(𝑦𝑛 , 𝑦ො𝑛 ) and 𝑦ො𝑛 = 𝒗⊤ 𝒉𝑛
𝜕𝑾(2)
𝒗
𝜕ℓ𝑛 𝜕ℓ𝑛 𝜕 𝑦ො𝑛 ′
𝜕𝑦ො𝑛
= = ℓ (𝑦𝑛 , 𝑦ො𝑛 ) (2) (2)
𝜕𝑾(2) 𝜕𝑦ො𝑛 𝜕𝑾(2) 𝜕𝑾(2) 𝒉𝑛 = 𝑔(𝒛𝑛 )
(2)
𝜕𝑦ො𝑛 𝜕𝑦ො𝑛 𝜕𝒉𝑛 𝜕𝑦ො𝑛 𝜕𝒗 𝑾(2)
= +
𝜕𝑾(2) 𝜕𝒉(2) 𝜕𝑾(2) 𝜕𝒗 𝜕𝑾(2) (1)
𝒉𝑛 = 𝑔(𝒛𝑛 )
(1)
𝑛
𝜕𝒗 Using transpose since
▪ Since 𝑣 doesn’t depend on 𝑾(2) , =0 we assume gradient to
𝑾(1)
𝜕𝑾(2) be a row vector
Jacobian of size (2) (2)
1 × 𝐾2 × 𝐾1 𝜕ℓ𝑛 ′
𝜕𝑦ො𝑛 𝜕𝒉𝑛 𝜕𝒉 𝑛 𝒙𝑛
= ℓ (𝑦𝑛 , 𝑦ො𝑛 ) (2) (2) = ℓ′ 𝑦𝑛 , 𝑦ො𝑛 𝒗⊤
𝜕𝑾(2) 𝜕𝒉 𝜕𝑾 𝜕𝑾(2)
(2) 𝑛
▪ We now need
𝜕𝒉𝑛
. Using
(2) 2
= 𝑔(𝒛𝑛 ) where 𝒛𝑛 =
𝒉𝑛
2
𝑾 2 ⊤ 𝒉(1) and 𝑔 is elementwise
𝜕𝑾(2) 𝑛
2
applied nonlinearity on the vector 𝒛𝑛
(2) (2) (2) (2) This Jacobian is a tensor
Jacobian of size 𝜕𝒉𝑛 𝜕𝒉𝑛 𝜕𝒛𝑛 𝜕𝒛𝑛 of size 𝐾2 × 𝐾2 × 𝐾1
′ 𝑧 2 , … , 𝑔′ 𝑧 2
𝐾2 × 𝐾2 × 𝐾1 = = diag 𝑔 𝑛1 𝑛𝐾2
𝜕𝑾(2) 𝜕𝒛 2 𝜕𝑾(2) 𝜕𝑾(2)
𝑛
Diagonal matrix of size 𝐾2 × 𝐾2 with Jacobian
(gradient vector) of g along the diagonals CS771: Intro to ML
26
Backpropagation: Computation Reuse
𝑦ො𝑛
▪ Summarizing, the required gradients/Jacobians for this network are
𝒗
𝜕ℓ𝑛 ′ (2)
= ℓ 𝑦𝑛 , 𝑦ො𝑛 𝒉𝑛 (2)
𝒉𝑛 = 𝑔(𝒛𝑛 )
(2)
𝜕𝒗
(2) (2)
𝜕ℓ𝑛 ′
𝜕 𝑦
ො𝑛 𝜕𝒉𝑛 𝜕𝒛𝑛 𝑾(2)
= ℓ (𝑦 ,
𝑛 𝑛𝑦
ො )
𝜕𝑾(2) (2)
𝜕𝒉 𝜕𝒛 𝜕𝑾
2 (2) (1)
𝒉𝑛 = 𝑔(𝒛𝑛 )
(1)
𝑛 𝑛
(2) (2) (1) (1)
𝜕ℓ𝑛 𝜕 ො
𝑦 𝜕𝒉 𝜕𝒛 𝜕𝒉 𝜕𝒛
= ℓ′ (𝑦𝑛 , 𝑦ො𝑛 ) (2)𝑛 𝑛
2
𝑛
(1)
𝑛
1
𝑛 𝑾(1)
𝜕𝑾(1) 𝜕𝒉𝑛 𝜕𝒛𝑛 𝜕𝒉𝑛 𝜕𝒛𝑛 𝜕𝑾(1) 𝒙𝑛
Backward Pass
Forward Pass
▪ Software frameworks such as Tensorflow and PyTorch support this already so you
don’t need to implement it by hand (so no worries of computing derivatives etc)
CS771: Intro to ML
28
Backpropagation through an example
CS771: Intro to ML