You are on page 1of 59

NEURAL NETWORKS

Anush Sankaran
Senior Research Scientist
Microsoft
WHAT IS ML? TOY EXAMPLE

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

2
WHAT IS ML? TOY EXAMPLE

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

3
WHAT IS ML? TOY EXAMPLE

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

4
WHAT IS ML? TOY EXAMPLE

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

5
WHAT IS ML? TOY EXAMPLE

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

6
WHAT IS ML? TOY EXAMPLE

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

7
WHAT IS ML? TOY EXAMPLE

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ??? Mal


e
How did you
say?

8
WHAT IS ML? TOY EXAMPLE

Height (cm) Weight (kg) Gender?

Person #1 178 81 Male

Person #2 163 59 Female

Person #3 181 78 Male

Person #4 166 62 Female

Person #x 177 79 ???

x1 x2 y

Mathematical form: f(x1, x2) = y What is f(.)


9
Basic ML Formulation

Various possible solutions:


Mathematical form: f(x1, x2) = y

● Rule-based systems
What is f(.)
● Decision Trees
● Random Forests
● Logistic Regression
● Neural Networks (Multi layer Perceptron)
● Support Vector Machines (SVM)
● Gradient Boosted Trees
● ….

10
WHAT IS ML? TOY EXAMPLE

Height (cm) Weight (kg) Gender? Mathematical form: f(x1, x2) = y

Person #1 178 81 (100.4) Male


f(w1 * x1 + w2 * x2)
Person #2 163 59 (79.8) Female
- f is a non-linear function
Person #3 181 78 (98.6) Male
thresh(0.2 * x1 + 0.8 * x2)
Person #4 166 62 (82.8) Female
- (0.2*x1 + 0.8*x2) > 90 = Male
Person #x 177 79 (98.6) ??? - (0.2*x1 + 0.8*x2) < 90 = Female

x1 x2 y

Mathematical form: f(x1, x2) = y What is f(.)


11
Generalized Form: A Single Neuron

Neuron pre-activation h(x) = (Σi wixi)


Neuron g
activation
- Input features: x1 , x2, … , xn

Forward Pass
Neuron pre- - Parameters: w1 , w2, … , wn
activation h

w1 w2 wn
……
Neuron activation f = g(h(x)) = g(Σi wixi)
- g is a non-linear activation function
- Examples: sigmoid, tanh, RBF, Relu etc.
x1 x2 xn
……
12

12
Generalized Form: Loss Function
Compute Loss (y,
y’) Neuron activation y’ = f(x) = g(Σi wixi)
Neuron g
activation Loss function or Error function

Forward Pass
Neuron pre- L(x, y) = e(y, y’)
h
activation = e(y, g(Σi wixi))
w1 w2 wn
……
- If loss or error is small, then the prediction is
good enough
- If loss or error is large, then the prediction is not
x1 x2 xn good enough
……
13
Generalized Form: Optimization Function
Compute Loss (y,
y’) Neuron activation y’ = f(x) = g(Σi wixi)

Neuron g
- Update `w` such that the loss/error reduces
activation
wi = wi ± (value)

Forward Pass
Neuron pre-
Backward Pass

h
activation
Optimization function:
w1 w2 wn - How to find the `value` for given set of `w`
……
and L
- Finds the direction (±) and also the
magnitude
x1 x2 - Example: Stochastic Gradient Descent
……
xn
14
Generalized Form: Objective Function
Compute Loss (y,
y’)
Objective Function
Neuron g
activation  

Forward Pass
Neuron pre-
Backward Pass

h
activation
w1 w2 wn
……

x1 x2 xn
……
15
This classifier is called Perceptron or Logistic
Regression
Generalized Form: Summary
Compute Loss (y,
y’) 1. To predict output values:
Neuron g Activation function y’ = g(Σi wixi)
activation

Forward Pass
Neuron pre- 2. To find out how/good or bad the prediction
Backward Pass

h
activation is:
w1 w2 wn Loss function L(x, y) = e(y, g(Σi wixi))
……
3. To update parameters such that the
prediction improves:

x1 x2 xn Optimization function wi = wi ± (value)


……
16
Activation Function: Linear

● Linear activation function: g(a) = a

● No squashing of output

● Not interesting and does NOT help


learning

17
Activation Function: Sigmoid

18
Activation Function: TanH

19
Activation Function: ReLU
● Rectified Linear activation (ReLU):
g(a) = max(0, a)

● Squash the output in the range of (0, inf)

● There is no upper bound

● Strictly positive output

● Strictly increasing function

20
This classifier is called Perceptron or Logistic
Regression
Generalized Form: Summary
Compute Loss (y,
y’) 1. To predict output values:
Neuron g Activation function y’ = g(Σi wixi)
activation

Forward Pass
Neuron pre- 2. To find out how/good or bad the prediction
Backward Pass

h
activation is:
w1 w2 wn Loss function L(x, y) = e(y, g(Σi wixi))
……
3. To update parameters such that the
prediction improves:

x1 x2 xn Optimization function wi = wi ± (value)


……
21
Loss Functions

22
To be continued …

● Optimization Functions
○ Stochastic Gradient Descent
○ Challenges with Regression/ Perceptron
● Multi Layer Perceptron
○ Backpropagation
● Challenges with Neural Networks
● Use Cases of Neural Network

23
This classifier is called Perceptron or Logistic
Regression
Generalized Form: Summary
Compute Loss (y,
y’) 1. To predict output values:
Neuron g Activation function y’ = g(Σi wixi)
activation

Forward Pass
Neuron pre- 2. To find out how/good or bad the prediction
Backward Pass

h
activation is:
w1 w2 wn Loss function L(x, y) = e(y, g(Σi wixi))
……
3. To update parameters such that the
prediction improves:

x1 x2 xn Optimization function wi = wi ± (value)


……
24
Optimization Function: Gradient Descent

Mathematical form: f(x1, x2) = y


Height Weight Beauty Predicted
(cm) (kg) Score Score
f(.) = (w1* x1 + w2 * x2)
Person #1 178 81 98 100.4
Start with w1= 0.2, w2 = 0.8
Person #2 163 59 97 79.8
f(.) = (0.2 * x1 + 0.8 * x2)
Person #3 181 78 82 98.6
How good (bad) is my prediction?
Person #4 166 62 80 82.8 Loss function:

Person #x 177 79 75 98.6 MSE = ∑( y – y”)2


x1 x2 y y’ =
31547.4

25
Optimization Function: How to Update Parameters

Mathematical form: f(x1, x2) = y f(.) = (w1* x1 + w2 * x2)

Start with w1= 0.2, w2 = 0.8 f(.) = (0.2 * x1 + 0.8 * x2)

Loss function: MSE = ∑( y – y”)2 =


31547.4 This is very bad loss. Means our prediction/model is bad.

Update to w1= 0.5, w2 = 0.5 f(.) = (0.5 * x1 + 0.5 * x2)

Loss function: MSE = ∑( y – y”)2 =


4600.5 This is better than previous.

Update to w1= 0.4, w2 = 0.6 f(.) = (0.4 * x1 + 0.6 * x2)

Loss function: MSE = ∑( y – y”)2 =


2429.0 This is even better than previous.

26
The Loss Curve

L
Start with w1= 0.2, w2 = 0.8 L = 31547.4
L = 31547.4
Update to w1= 0.5, w2 = 0.5 L = 4600.5
The loss curve

Update to w1= 0.4, w2 = 0.6 L = 2429.0


L = 2429.0 L = 4600.5

Update the parameters using differential calculus

𝜽1
0. 0. 0.
2 4 5

27
Differential Calculus to the Help

Question:

For which value of w1 and w2 , is the loss function


“minimum” ?

Fermat's Theorem

If f(x) has a local extremum at x=a and f is differentiable at a,


then f’(a) = 0

28
Differential Calculus to the Help

Question:

How to update the values of w1 and w2

Move along with the slope, to update w1 and w2

29
Gradient Descent Algorithm
Compute Loss (y,
y’)
Neuron g
activation

Forward Pass
Neuron pre-
Backward Pass

h
activation
w1 w2 wn
……

x1 x2 xn
……
30
Gradient Descent in Real World

31
Gradient Descent in Real World

●  

32
This classifier is called Perceptron or Logistic
Regression
Generalized Form: Summary
Compute Loss (y, 1. To predict output values:
y’) Activation function y’ = sigmoid(Σi wixi)
Neuron g
activation 2. To find out how/good or bad the prediction
is:

Forward Pass
Neuron pre-
Backward Pass

h Loss function L(x, y) = (y – y’)2


activation
w1 w2 wn
…… 3. To update parameters such that the
prediction improves:
Optimization
  function

x1 x2 xn
……
33
Problem #1: Always go through Origin
Compute Loss (y,
y’)
Neuron g
activation

Forward Pass
Neuron pre-
Backward Pass

h
activation
w1 w2 wn
……

x1 x2 xn
f = g(h(x)) = g(Σi wixi )
……
34
Problem #1: Always go through Origin
Compute Loss (y,
y’)
Neuron g
activation

Forward Pass
Neuron pre-
Backward Pass

h
activation
w1 w2 wn b
……

f = g(h(x)) = g(Σi wixi + b)


x1 x2 xn
……
35
Problem #2: Getting Stuck with Local Minima
Compute Loss (y,
y’)
Neuron g
activation

Forward Pass
Neuron pre-
Backward Pass

h
activation
w1 w2 wn
……

x1 x2 xn
……
36
Problem #2: Getting Stuck with Local Minima

Without momentum With momentum

 
Adding momentum:

37
Problem #3: Strictly Linear Classification

How do we solve this problem?

38
More than One Neuron/ Perceptron?

y1 ’ y1 ’ y2’

w1 w2 … wn

w11 w12 w21 w22 wn1 wn2

x1 x2 … xn
x1 x2 … xn
39
MultiLayer Perceptron (MLP) / Neural Networks

40
Back Propagation: How to learn the weights?

41
How to choose #Layers and #Neurons ?

42
Problem #1: Possible Overfitting?

43
Problem #1: Bias-Variance Trade Off

44
Solution #1: Regularization

 
Objective function:

 
Objective function
with L1 regularization:

 
Objective function
with L2 regularization:

45
Problem #2: How to Chose Learning Rate?

46
Problem #3: How to Initialize W & B?
● For biases: Initialize all to 0
● For weights:
○ Cannot initialize weights to 0 if tanh activation function is used
■ All gradients will end up to be zero
○ Cannot initialize all the weight values to be same
■ Highly restricts the overall learning process
○ Random initialization works in most of the cases!
○ Glorot et al. showed the following initialization is closed to ideal:

47
Problem #4: MLP is a Universal Approximator

• A single hidden layer neural


network is an universal
approximator

• It can model any mathematical


function

• Not easy to train!

48
Problem #5: Going Wide vs. Going Deep?
Going Wide Going Deep

49
Problem #6: Vanishing Gradient Problem
 
0≤ λ≤1

50
Summary: All the Hyper Parameters!
● Architecture of the neural network model
○ Number of layers
○ Number of nodes in each layer Weights and Biases are
● Which activation function to use the only learnable
parameters
○ Sigmoid, tanh, relu, softmax etc.
● Which loss function to use
○ MSE, L1, CrossEntropy, Hinge etc.
● Which optimization function to use
○ SGD, mini-batch Gradient Descent, Adam, RMSProp
● Which initialization method: for weights and biases
● Which learning rate to use: lr = [0.1, 0.01, 0.001, 0.0001]. Dynamic learning rates are possible!
● Which regularization to use? And what is the regularization factor?
51
No Free Lunch Theorem!

● There are no universal solutions to ML


problems.

● All ML approaches are equally good if we


do not place strong assumptions on the
input data.

● For every ML algorithm, there exists a


sample or sample class where it
outperforms some other method.

52
How To Learn Features?

How to learn these features?

53
Unsupervised Feature Learning
Neural Networks AutoEncoder

Input Input Output


layer Hidden layer Hidden layer
layer layer
Input Data

Features

Input Data
Input Data
. .
. .
. . . . .
. . .
. . .

54
Going Deep: Deep Learning
Stacked AutoEncoder = Deep Learning Hidden#3

Hidden#2

Hidden#1

Input Data Input Data


. .
. .
. . .
. .
. . . .
. . .
. . .
.
Encoding Layers .
Decoding Layers 55
Going Deep: Deep Learning Classifier
Deep Learning Classifier Hidden#3

Hidden#2

Hidden#1

Input Data
.
.
. .
. .
. .
. .
.
Encoding Layers .
56
Convolutional Neural Networks: n-dimensional data

57
Recurrent Neural Networks: Sequential Data

58
Everything about NNs

Questions?

59

You might also like