Professional Documents
Culture Documents
Anush Sankaran
Senior Research Scientist
Microsoft
WHAT IS ML? TOY EXAMPLE
2
WHAT IS ML? TOY EXAMPLE
3
WHAT IS ML? TOY EXAMPLE
4
WHAT IS ML? TOY EXAMPLE
5
WHAT IS ML? TOY EXAMPLE
6
WHAT IS ML? TOY EXAMPLE
7
WHAT IS ML? TOY EXAMPLE
8
WHAT IS ML? TOY EXAMPLE
x1 x2 y
● Rule-based systems
What is f(.)
● Decision Trees
● Random Forests
● Logistic Regression
● Neural Networks (Multi layer Perceptron)
● Support Vector Machines (SVM)
● Gradient Boosted Trees
● ….
10
WHAT IS ML? TOY EXAMPLE
x1 x2 y
Forward Pass
Neuron pre- - Parameters: w1 , w2, … , wn
activation h
w1 w2 wn
……
Neuron activation f = g(h(x)) = g(Σi wixi)
- g is a non-linear activation function
- Examples: sigmoid, tanh, RBF, Relu etc.
x1 x2 xn
……
12
12
Generalized Form: Loss Function
Compute Loss (y,
y’) Neuron activation y’ = f(x) = g(Σi wixi)
Neuron g
activation Loss function or Error function
Forward Pass
Neuron pre- L(x, y) = e(y, y’)
h
activation = e(y, g(Σi wixi))
w1 w2 wn
……
- If loss or error is small, then the prediction is
good enough
- If loss or error is large, then the prediction is not
x1 x2 xn good enough
……
13
Generalized Form: Optimization Function
Compute Loss (y,
y’) Neuron activation y’ = f(x) = g(Σi wixi)
Neuron g
- Update `w` such that the loss/error reduces
activation
wi = wi ± (value)
Forward Pass
Neuron pre-
Backward Pass
h
activation
Optimization function:
w1 w2 wn - How to find the `value` for given set of `w`
……
and L
- Finds the direction (±) and also the
magnitude
x1 x2 - Example: Stochastic Gradient Descent
……
xn
14
Generalized Form: Objective Function
Compute Loss (y,
y’)
Objective Function
Neuron g
activation
Forward Pass
Neuron pre-
Backward Pass
h
activation
w1 w2 wn
……
x1 x2 xn
……
15
This classifier is called Perceptron or Logistic
Regression
Generalized Form: Summary
Compute Loss (y,
y’) 1. To predict output values:
Neuron g Activation function y’ = g(Σi wixi)
activation
Forward Pass
Neuron pre- 2. To find out how/good or bad the prediction
Backward Pass
h
activation is:
w1 w2 wn Loss function L(x, y) = e(y, g(Σi wixi))
……
3. To update parameters such that the
prediction improves:
● No squashing of output
17
Activation Function: Sigmoid
18
Activation Function: TanH
19
Activation Function: ReLU
● Rectified Linear activation (ReLU):
g(a) = max(0, a)
20
This classifier is called Perceptron or Logistic
Regression
Generalized Form: Summary
Compute Loss (y,
y’) 1. To predict output values:
Neuron g Activation function y’ = g(Σi wixi)
activation
Forward Pass
Neuron pre- 2. To find out how/good or bad the prediction
Backward Pass
h
activation is:
w1 w2 wn Loss function L(x, y) = e(y, g(Σi wixi))
……
3. To update parameters such that the
prediction improves:
22
To be continued …
● Optimization Functions
○ Stochastic Gradient Descent
○ Challenges with Regression/ Perceptron
● Multi Layer Perceptron
○ Backpropagation
● Challenges with Neural Networks
● Use Cases of Neural Network
23
This classifier is called Perceptron or Logistic
Regression
Generalized Form: Summary
Compute Loss (y,
y’) 1. To predict output values:
Neuron g Activation function y’ = g(Σi wixi)
activation
Forward Pass
Neuron pre- 2. To find out how/good or bad the prediction
Backward Pass
h
activation is:
w1 w2 wn Loss function L(x, y) = e(y, g(Σi wixi))
……
3. To update parameters such that the
prediction improves:
25
Optimization Function: How to Update Parameters
26
The Loss Curve
L
Start with w1= 0.2, w2 = 0.8 L = 31547.4
L = 31547.4
Update to w1= 0.5, w2 = 0.5 L = 4600.5
The loss curve
𝜽1
0. 0. 0.
2 4 5
27
Differential Calculus to the Help
Question:
Fermat's Theorem
28
Differential Calculus to the Help
Question:
29
Gradient Descent Algorithm
Compute Loss (y,
y’)
Neuron g
activation
Forward Pass
Neuron pre-
Backward Pass
h
activation
w1 w2 wn
……
x1 x2 xn
……
30
Gradient Descent in Real World
31
Gradient Descent in Real World
●
32
This classifier is called Perceptron or Logistic
Regression
Generalized Form: Summary
Compute Loss (y, 1. To predict output values:
y’) Activation function y’ = sigmoid(Σi wixi)
Neuron g
activation 2. To find out how/good or bad the prediction
is:
Forward Pass
Neuron pre-
Backward Pass
x1 x2 xn
……
33
Problem #1: Always go through Origin
Compute Loss (y,
y’)
Neuron g
activation
Forward Pass
Neuron pre-
Backward Pass
h
activation
w1 w2 wn
……
x1 x2 xn
f = g(h(x)) = g(Σi wixi )
……
34
Problem #1: Always go through Origin
Compute Loss (y,
y’)
Neuron g
activation
Forward Pass
Neuron pre-
Backward Pass
h
activation
w1 w2 wn b
……
Forward Pass
Neuron pre-
Backward Pass
h
activation
w1 w2 wn
……
x1 x2 xn
……
36
Problem #2: Getting Stuck with Local Minima
Adding momentum:
37
Problem #3: Strictly Linear Classification
38
More than One Neuron/ Perceptron?
y1 ’ y1 ’ y2’
w1 w2 … wn
x1 x2 … xn
x1 x2 … xn
39
MultiLayer Perceptron (MLP) / Neural Networks
40
Back Propagation: How to learn the weights?
41
How to choose #Layers and #Neurons ?
42
Problem #1: Possible Overfitting?
43
Problem #1: Bias-Variance Trade Off
44
Solution #1: Regularization
Objective function:
Objective function
with L1 regularization:
Objective function
with L2 regularization:
45
Problem #2: How to Chose Learning Rate?
46
Problem #3: How to Initialize W & B?
● For biases: Initialize all to 0
● For weights:
○ Cannot initialize weights to 0 if tanh activation function is used
■ All gradients will end up to be zero
○ Cannot initialize all the weight values to be same
■ Highly restricts the overall learning process
○ Random initialization works in most of the cases!
○ Glorot et al. showed the following initialization is closed to ideal:
47
Problem #4: MLP is a Universal Approximator
48
Problem #5: Going Wide vs. Going Deep?
Going Wide Going Deep
49
Problem #6: Vanishing Gradient Problem
0≤ λ≤1
50
Summary: All the Hyper Parameters!
● Architecture of the neural network model
○ Number of layers
○ Number of nodes in each layer Weights and Biases are
● Which activation function to use the only learnable
parameters
○ Sigmoid, tanh, relu, softmax etc.
● Which loss function to use
○ MSE, L1, CrossEntropy, Hinge etc.
● Which optimization function to use
○ SGD, mini-batch Gradient Descent, Adam, RMSProp
● Which initialization method: for weights and biases
● Which learning rate to use: lr = [0.1, 0.01, 0.001, 0.0001]. Dynamic learning rates are possible!
● Which regularization to use? And what is the regularization factor?
51
No Free Lunch Theorem!
52
How To Learn Features?
53
Unsupervised Feature Learning
Neural Networks AutoEncoder
Features
Input Data
Input Data
. .
. .
. . . . .
. . .
. . .
54
Going Deep: Deep Learning
Stacked AutoEncoder = Deep Learning Hidden#3
Hidden#2
Hidden#1
Hidden#2
Hidden#1
Input Data
.
.
. .
. .
. .
. .
.
Encoding Layers .
56
Convolutional Neural Networks: n-dimensional data
57
Recurrent Neural Networks: Sequential Data
58
Everything about NNs
Questions?
59