You are on page 1of 286

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Advanced Learning Algorithms

Welcome!
Advanced learning algorithms
Neural Networks
inference (prediction)
training
Practical advice for building machine learning systems
Decision Trees

Andrew Ng
Neural Networks Intuition

Neurons and the brain


Neural networks
Origins: Algorithms that try to mimic the brain.

Used in the 1980’s and early 1990’s.


Fell out of favor in the late 1990’s.

Resurgence from around 2005.

speech images text (NLP)

Andrew Ng
Neurons in the brain

[Credit: US National Institutes of Health, National Institute on Aging]


Biological neuron Simplified mathematical
model of a neuron
inputs outputs inputs outputs

image source: https://biologydictionary.net/sensory-neuron/

Andrew Ng
Why Now?
large neural network

medium neural network


performance

small neural network

traditional AI

amount of data

Andrew Ng
Neural Network Intuition

Demand Prediction
Demand Prediction
𝑥 = 𝑝𝑟𝑖𝑐𝑒 input
1 output
𝑎=𝑓 𝑥 =
top seller? 1 + 𝑒 − 𝑤𝑥+𝑏
yes/no

𝑥 𝑎
price

Andrew Ng
Demand Prediction
layer

price
layer
shipping cost
probability of
marketing being a top seller
material

Andrew Ng
Demand Prediction
layer

price
layer
shipping cost a a
probability of
marketing being a top seller
material

Andrew Ng
Demand Prediction
layer

price
layer
shipping cost a a
probability of
marketing being a top seller
material

Andrew Ng
Multiple hidden layers

a a
x
x
input
output
layer

hidden hidden
layer layer hidden hidden
hidden
layer layer layer

neural network architecture


Andrew Ng
Neural Networks Intuition

Example:
Recognizing Images
Face recognition

x=

Andrew Ng
Face recognition

output layer
⋮ probability of being
⋮ ⋮
person ‘XYZ’

x
input

source: Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
by Honglak Lee, Roger Grosse, Ranganath Andrew Y. Ng

Andrew Ng
Car classification

output layer
⋮ probability of
⋮ ⋮
car detected

source: Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
by Honglak Lee, Roger Grosse, Ranganath Andrew Y. Ng

Andrew Ng
Neural network model

Neural network layer


Neural network layer

Andrew Ng
Neural network layer 1
𝑔(𝑧) =
1 + 𝑒− 𝑧

a[1]
x [1] [1] [1]
𝑤1 , 𝑏1 𝑎1 = 𝑔(w1 ∙ x + 𝑏1 )
vector of
layer 0 layer 2 activation
[1] values from
w2 , 𝑏2 𝑎2 = 𝑔(w2[1] ∙ x + 𝑏2[1] ) [1] layer 1
x a

layer 1

w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ x + 𝑏3 )

notation for layer


numbering

Andrew Ng
Neural network layer 1
𝑔(𝑧) =
1 + 𝑒− 𝑧

a[1] 𝑎 [2]
x

layer 2 𝑤1 , 𝑏1 [1] [1] [1]


𝑎1 = 𝑔(w1 ∙ a[1] + 𝑏1 ) a scalar
a[1] a value

layer 1

Andrew Ng
Neural network layer
predict category 1 or 0 (yes/no)
a[1] 𝑎 [2]
x
2
𝑖𝑠 𝑎 ≥ 0.5?
layer 2

layer 1 𝑦ො = 1 𝑦ො = 0

Andrew Ng
Neural Network Model

More complex neural networks


More complex neural network

a[1] a[2] a[3] a[4]


x
input

layer 0

layer 1 layer 2 layer 3 layer 4

hidden layers output


layer

Andrew Ng
More complex neural network
[3] [3] [3]
[3]
𝑤1 , 𝑏1
[3] 𝑎1 = 𝑔(w1 ∙ a[2] + 𝑏1 )
[3] [3] [3]
a[1] a[2] a[3] a[4]
[3]
a[2] w2 , 𝑏2
[3]
𝑎2 = 𝑔(w2 ∙ a[2] + 𝑏2 ) a[3]
x
input [3] [3] [3] [3] [3]
w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ a[2] + 𝑏3 )

layer 1 layer 2 layer 3 layer 4

Andrew Ng
Notation
[3] [3] [3]
[3]
𝑤1 , 𝑏1
[3] 𝑎1 = 𝑔(w1 ∙ a[2] + 𝑏1 )
[3] [3] [3]
x a[1] a[2] a[3] a[4]
[3]
a[2] w2 , 𝑏2
[3]
𝑎2 = 𝑔(w2 ∙ a[2] + 𝑏2 ) a[3]
input
[3] [3] [3] [3] [3]
w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ a[2] + 𝑏3 )

layer 1 layer 2 layer 3 layer 4


Question:
Can you fill in the superscripts and
subscripts for the second neuron?

Andrew Ng
Notation
[3] [3] [3]
[3]
𝑤1 , 𝑏1
[3] 𝑎1 = 𝑔(w1 ∙ a[2] + 𝑏1 )
x
[3] [3] [3]
a[1] a[2] a[3] a[4]
[3]
a[2] w2 , 𝑏2
[3]
𝑎2 = 𝑔(w2 ∙ a[2] + 𝑏2 ) a[3]
x
input
[3] [3] [3] [3] [3]
w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ a[2] + 𝑏3 )

layer 3 layer 4
layer 1 layer 2
[3]
𝑎2 = 𝑔(w2[3] ∙ a[2] + 𝑏2[3] )
output of layer 𝑙 − 1
Activation value of (previous layer)
layer 𝑙, unit(neuron) 𝑗 [𝑙]
𝑎𝑗 = 𝑔(w𝑗[𝑙] ∙ a[𝑙−1] + 𝑏𝑗[𝑙] )
sigmoid
Parameters w & b of layer 𝑙, unit 𝑗

Andrew Ng
Neural Network Model

Inference: making predictions


(forward propagation)
Handwritten digit recognition
Digit images

label 0 1

⋮ a[1] a[2] a[3] probability


x ⋮ of being a
handwritten ‘1’
output
layer

25 units 15 units 1 unit


layer 1 layer 2 layer 3 a[1]

Andrew Ng
Handwritten digit recognition
Digit images

label 0 1

⋮ a[1] a[2] a[3] probability


x ⋮ of being a
handwritten ‘1’
output
layer

25 units 15 units 1 unit a[2]


layer 1 layer 2 layer 3

Andrew Ng
Handwritten digit recognition
forward propagation

x a[1] a[2] a[3] = 𝑓(𝑥)


⋮ ⋮ probability a[3] = 𝑔 w13 ∙ a 2 + 𝑏13
of being a
handwritten ‘1’ 3
𝑖𝑠 𝑎1 ≥ 0.5?
output
layer

25 units 15 units 1 unit


layer 1 layer 2 layer 3 𝑦ො = 1 𝑦ො = 0
image is digit 1 image isn’t digit 1

Andrew Ng
TensorFlow implementation

Inference in Code
Coffee roasting
x a[1] a[2]

Duration 2
(minutes) 𝑖𝑠 𝑎1 ≥ 0.5?

𝑦ො = 1 𝑦ො = 0

Temperature
(Celsius)

Andrew Ng
x a[1] a[2]

x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation=‘sigmoid’)
a1 = layer_1(x)

Andrew Ng
Build the model using TensorFlow
x a[1] a[2]

x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation=‘sigmoid’)
a1 = layer_1(x)

layer_2 = Dense(units=1, activation=‘sigmoid’)


a2 = layer_2(a1)

Andrew Ng
Build the model using TensorFlow
x a[1] a[2]
2
𝑖𝑠 𝑎1 ≥ 0.5?

𝑦ො = 1 𝑦ො = 0
x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation=‘sigmoid’)
a1 = layer_1(x) if a2 >= 0.5:
yhat = 1
else:
yhat = 0
layer_2 = Dense(units=1, activation=‘sigmoid’)
a2 = layer_2(a1)

Andrew Ng
Model for digit classification
x = np.array([[0.0,...245,...240...0]])
layer_1 = Dense(units=25, activation=‘sigmoid’)
a1 = layer_1(x)
x ⋮ a[1] a[2] a[3]

1 unit

15 units
25 units

Andrew Ng
Model for digit classification
x = np.array([[0.0,...245,...240...0]])
layer_1 = Dense(units=25, activation=‘sigmoid’)
a1 = layer_1(x)
x ⋮ a[1] a[2] a[3] layer_2 = Dense(units=15, activation=‘sigmoid’)
⋮ a2 = layer_2(a1)

1 unit

15 units
25 units

Andrew Ng
Model for digit classification
x = np.array([[0.0,...245,...240...0]])
layer_1 = Dense(units=25, activation=‘sigmoid’)
a1 = layer_1(x)
x ⋮ a[1] a[2] a[3] layer_2 = Dense(units=15, activation=‘sigmoid’)
⋮ a2 = layer_2(a1)

1 unit layer_3 = Dense(units=1, activation=‘sigmoid’)


a3 = layer_3(a2)
15 units
25 units

Andrew Ng
Model for digit classification
x = np.array([[0.0,...245,...240...0]])
layer_1 = Dense(units=25, activation=‘sigmoid’)
a1 = layer_1(x)
x ⋮ a[1] a[2] a[3] layer_2 = Dense(units=15, activation=‘sigmoid’)
⋮ a2 = layer_2(a1)

1 unit layer_3 = Dense(units=1, activation=‘sigmoid’)


a3 = layer_3(a2)
15 units
25 units 3 if a3 >= 0.5:
𝑖𝑠 𝑎1 ≥ 0.5? yhat = 1
else:
𝑦ො = 1 𝑦ො = 0 yhat = 0

Andrew Ng
TensorFlow implementation

Data in TensorFlow
Feature vectors
temperature duration Good coffee? x = np.array([[200.0, 17.0]])
(Celsius) (minutes) (1/0)
[[200.0, 17.0]]
200.0 17.0 1
425.0 18.5 0
… … …

Andrew Ng
Note about numpy arrays
x = np.array([[1, 2, 3],
1 2 3 [4, 5, 6]])
4 5 6
[[1, 2, 3],
[4, 5, 6]]
2x3
x = np.array([[0.1, 0.2],
[-3.0, -4.0,], 4x2
[-0.5, -0.6,],
[7.0, 8.0,]]) 1x2
[[0.1, 0.2], 2x1
[-3.0, -4.0,],
[-0.5, -0.6,],
[7.0, 8.0,]]

Andrew Ng
Note about numpy arrays
x = np.array([[200, 17]]) 200 17 1x2

x = np.array([[200], 200 2x1


[17]]) 17

x = np.array([200,17])

Andrew Ng
Feature vectors
temperature duration Good coffee? x = np.array([[200.0, 17.0]])
(Celsius) (minutes) (1/0)
[[200.0, 17.0]]
200.0 17.0 1
425.0 18.5 0 1x2
… … …
200.0 17.0

Andrew Ng
Activation vector
x a[1] a[2]

x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation=‘sigmoid’)
a1 = layer_1(x)
1 x 3 matrix
tf.Tensor([[0.2 0.7 0.3]], shape=(1, 3), dtype=float32)

a1.numpy()
array([[1.4661001, 1.125196 , 3.2159438]], dtype=float32)

Andrew Ng
Activation vector
x a[1] a[2]

layer_2 = Dense(units=1, activation=‘sigmoid’)


a2 = layer_2(a1)
1x1
tf.Tensor([[0.8]], shape=(1, 1), dtype=float32)

a2.numpy()
array([[0.8]], dtype=float32)

Andrew Ng
TensorFlow implementation

Building a neural network


What you saw earlier
x = np.array([[200.0, 17.0]])
x a[1] a[2] layer_1 = Dense(units=3, activation="sigmoid")
a1 = layer_1(x)

layer_2 = Dense(units=1, activation="sigmoid")


a2 = layer_2(a1)

Andrew Ng
Building a neural network architecture
x a[1] a[2] layer_1 = Dense(units=3, activation="sigmoid")
layer_2 = Dense(units=1, activation="sigmoid")
model = Sequential([layer_1, layer_2])

x = np.array([[200.0, 17.0],
y [120.0, 5.0],
4x2
[425.0, 20.0],
200 17 1 [212.0, 18.0])
y = np.array([1,0,0,1])
120 5 0
model.compile(...)
425 20 0 model.fit(x,y)
212 18 1 model.predict(x_new)

Andrew Ng
Digit classification model
layer_1 = Dense(units=25, activation="sigmoid")
layer_2 = Dense(units=15, activation="sigmoid")
layer_3 = Dense(units=1, activation="sigmoid")
x [1] [2] [3]
⋮ a a a
⋮ model = Sequential([layer_1, layer_2, layer_3]
model.compile(...)
1 unit
x = np.array([[0..., 245, ..., 17],
15 units [0..., 200, ..., 184])
25 units y = np.array([1,0])

model.fit(x,y)

model.predict(x_new)

Andrew Ng
Digit classification model
model = Sequential([
Dense(units=25, activation="sigmoid"),
x ⋮ a[1] a[2] a[3]
Dense(units=15, activation="sigmoid"),
Dense(units=1, activation="sigmoid")])

model.compile(...)
1 unit
x = np.array([[0..., 245, ..., 17],
15 units [0..., 200, ..., 184])
25 units y = np.array([1,0])

model.fit(x,y)

model.predict(x_new)

Andrew Ng
Neural network implementation
in Python

Forward prop in a single layer


forward prop (coffee roasting model)
[2] [2] [2]
[1]
𝑤1 , 𝑏1
[1] 𝑎1 = 𝑔(w1 ∙ a[1] + 𝑏1 )
x [1] [1] a[1] [2] [2]
𝑤1 , 𝑏1
a[2] w2_1 = np.array([-7, 8])
w2 , 𝑏2 b2_1 = np.array([3])
[1] [1] z2_1 = np.dot(w2_1,a1)+b2_1
w3 , 𝑏3 w2_1
a2_1 = sigmoid(z1_1)
x = np.array([200, 17])
[1] [1] [1] [1] [1] [1] [1] [1] [1]
𝑎1 = 𝑔(w1 ∙ x + 𝑏1 ) 𝑎2 = 𝑔(w2 ∙ x + 𝑏2 ) 𝑎3 = 𝑔(w3 ∙ x + 𝑏3 )

w1_1 = np.array([1, 2]) w1_2 = np.array([-3, 4]) w1_3 = np.array([5, -6])


b1_1 = np.array([-1]) b1_2 = np.array([1]) b1_3 = np.array([2])
z1_1 = np.dot(w1_1,x)+ b z1_2 = np.dot(w1_2,x)+b z1_3 = np.dot(w1_3,x)+b
a1_1 = sigmoid(z1_1) a1_2 = sigmoid(z1_2) a1_3 = sigmoid(z1_3)

a1 = np.array([a1_1, a1_2, a1_3])

Andrew Ng
Neural network implemenation
in Python

General implementation of
forward propagation
[1]
𝑤1 , 𝑏1
[1]
Forward prop in NumPy
x [1] [1] a [𝑙]
w2 , 𝑏2
def dense(a_in,W,b, g): def sequential(x):
[1] [1]
w3 , 𝑏3 units = W.shape[1] a1 = dense(x,W1,b1,g)
a_out = np.zeros(units) a2 = dense(a1,W2,b2,g)
[1] 1 −3 [1][1] 5 for j in range(units): a3 = dense(a2,W3,b3,g)
𝑤1 = w2 = w3 =
2 4 −6 w = W[:,j] a4 = dense(a3,W4,b4,g)
W = np.array([ z = np.dot(w,a_in) + b[j] f_x = a4
[1, -3, 5] a_out[j] = g(z) return f_x
[2, 4, -6]]) return a_out
[𝑙] [𝑙] [𝑙]
𝑏1 = −1 𝑏2 = 1 𝑏3 = 2
b = np.array([-1, 1, 2])

a [0] = x
a_in = np.array([-2, 4])

Andrew Ng
Speculations on artificial general
intelligence (AGI)

Is there a path to AGI?


AI

ANI AGI
(artificial narrow intelligence) (artificial general intelligence)
E.g., smart speaker, Do anything a human can do
self-driving car, web search,
AI in farming and factories

Andrew Ng
Biological neuron Simplified mathematical
model of a neuron
inputs outputs inputs outputs

image source: https://biologydictionary.net/sensory-neuron/

Andrew Ng
Neural network and the brain
Can we mimic the human brain?

vs x 𝑓(x)

We have (almost) no idea how the brain works

Andrew Ng
The ”one learning algorithm” hypothesis

Auditory cortex

Auditory cortex learns to see


[Roe et al., 1992]

Andrew Ng
The ”one learning algorithm” hypothesis

Somatosensory cortex

Somatosensory cortex learns to see


[Metin & Frost, 1989]

Andrew Ng
Sensor representations in the brain

Seeing with your tongue Human echolocation (sonar)

Haptic belt: Direction sense Implanting a 3rd eye


[BrainPort; Welsh & Blasch, 1997; Nagel et al., 2005; Constantine-Paton & Law, 2009]

Andrew Ng
Vectorization
(optional)

How neural networks are


implemented efficiently
For loops vs. vectorization
x = np.array([200, 17]) X = np.array([[200, 17]])
W = np.array([[1, -3, 5], W = np.array([[1, -3, 5],
[-2, 4, -6]]) [-2, 4, -6])
b = np.array([-1, 1, 2]) B = np.array([[-1, 1, 2]])

def dense(a_in,W,b): def dense(A_in,W,B):


a_out = np.zeros(units) Z = np.matmul(A_in,W) + B
for j in range(units): A_out = g(Z)
w = W[:,j] return A_out
z = np.dot(w,x) + b[j]
a[j] = g(z) [[1,0,1]]
return a

[1,0,1]

Andrew Ng
Vectorization
(optional)

Matrix multiplication
Dot products

example in general transpose vector vector multiplication


1 3 ↑ ↑ 1 ← a𝑻 → ↑
∙ a=
2 4 a ∙ w 2 w
↓ ↓ ↓
a𝑻 = 1 2
𝑧 = a∙w 𝑧 = a𝑻 w

Andrew Ng
Vector matrix multiplication
[2]
→ 1
a =

[4 6]
3 5
→ 𝐙= →
a 𝐖 [← →]
𝑻 𝑻
a = [1 2] 𝐖= →
a
𝑻


w1 →
w2

𝐙 = [→
a →
𝑻
w 2]
w 1 𝑎𝑻 →
(1 ∗ 3) + (2 ∗ 4) (1 ∗ 5) + (2 ∗ 6)

𝐙 = [11 17]

Andrew Ng
matrix matrix multiplication
[2 − 2]
1 −1
𝐀=

[4 6]
→ 𝑇

[− 1 − 2]
1 2 𝐖 = 3 5 𝐙 = 𝐀𝑻 𝐖 = ← a1 →

𝐀𝑻 =
→ 𝑇 w1 →
w2
← a 2 →


a 1→
𝑇
w1 →
a 1→
𝑇
w2
=

a 2→
𝑇
w1 →
a 2→
𝑇
w2

[− 11 − 17]
11 17
=

Andrew Ng
Vectorization
(optional)

Matrix multiplication rules


Matrix multiplication rules
1 −1 0.1 1 2
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝒁 = 𝐀𝑻 𝐖 =
0.1 0.2 4 6 8 0

Andrew Ng
Matrix multiplication rules
1 −1 0.1 1 2
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝒁 = 𝐀𝑻 𝐖 =
0.1 0.2 4 6 8 0

Andrew Ng
Matrix multiplication rules
1 −1 0.1 1 2 11 17 23 9
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝑻
𝒁 = 𝐀 𝐖 = −11 −17 −23 −9
0.1 0.2 4 6 8 0
1.1 1.7 2.3 0.9

Andrew Ng
Matrix multiplication rules
1 −1 0.1 1 2 11 17 23 9
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝑻
𝒁 = 𝐀 𝐖 = −11 −17 −23 −9
0.1 0.2 4 6 8 0
1.1 1.7 2.3 0.9

Andrew Ng
Vectorization
(optional)

Matrix multiplication code


Matrix multiplication in NumPy
1 −1 0.1 1 2 11 17 23 9
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝑻
𝒁 = 𝐀 𝐖 = −11 −17 −23 −9
0.1 0.2 4 6 8 0
1.1 1.7 2.3 0.9
A=np.array([1,-1,0.1], W=np.array([3,5,7,9],
Z = np.matmul(AT,W)
[2,-2,0.2]]) [4,6,8,0])
Z = AT @ W
AT=np.array([1,2],
[-1,-2], [[11,17,23,9],
[0.1,0.2]) [-11,-17,-23,-9],
[1.1,1.7,2.3,0.9]
AT=A.T ]

Andrew Ng
Dense layer vectorized
[1] [1]
𝑤1 , 𝑏1 AT = np.array([[200, 17]])
𝐀𝑻
[1] [1]
[1]
𝐀 𝐀[2] W = np.array([[1, -3, 5],
w2 , 𝑏2
[-2, 4, -6])
[1] [1]
w3 , 𝑏3 b = np.array([[-1, 1, 2]])

def dense(AT,W,b,g):
z = np.matmul(AT,W) + b
𝐀𝑻 = 200 17 𝐙 = 𝐀𝑻 𝐖 + 𝐁
a_out = g(z)
165 −531 900 return a_out
1 −3 5
𝐖= [1] [1] [1]
−2 4 −6 𝑧1 𝑧2 𝑧3 [[1,0,1]]
Ԧ𝐛 = −1 1 2
𝐀 = g(𝐙)
1 0 1

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Neural Network
Training

TensorFlow
implementation
Train a Neural Network in TensorFlow
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([

⋮ a[1] a[2] a[3] Dense(units=25, activation='sigmoid’)


x ⋮ Dense(units=15, activation='sigmoid’)
Dense(units=1, activation='sigmoid’)
1 unit )]
from tensorflow.keras.losses import
15 units BinaryCrossentropy
25 units model.compile(loss=BinaryCrossentropy())

Given set of (x,y) examples model.fit(X,Y,epochs=100)


How to build and train this in code?

Andrew Ng
Neural Network
Training

Training Details
Model Training Steps
specify how to logistic regression neural network
compute output
given input x and z = np.dot(w,x)+ b model = Sequential([
parameters w,b Dense(...)
Dense(...)
(define model) Dense(...)
f_x = 1/(1+np.exp(-z)) ])
𝑓w,𝑏 x =?

specify loss and cost logistic loss binary cross entropy


𝐿 𝑓w,𝑏 x , 𝑦 1 example loss = -y * np.log(f_x)
-(1-y) * np.log(1-f_x) model.compile(
𝑚 loss=BinaryCrossentropy())
1
𝐽 w, 𝑏 = ෍ 𝐿 𝑓w,𝑏 x (𝑖) , 𝑦 (𝑖)
𝑚
𝑖=1
Train on data to w = w - alpha * dj_dw model.fit(X,y,epochs=100)
minimize 𝐽 w, 𝑏 b = b - alpha * dj_db

Andrew Ng
1. Create the model
define the model import tensorflow as tf
𝑓 x =? from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
1
𝐖 ,𝒃 1 𝐖 2 ,𝒃 2 𝐖 3,𝒃 3
model = Sequential([
[1] [1] Dense(units=25, activation='sigmoid’)
w1 , 𝑏1
Dense(units=15, activation='sigmoid’)
[2] [2] Dense(units=1, activation='sigmoid’)
[1] w1 , 𝑏1 [2]
⋮ a a a[3] ])
x ⋮ [3] [3]
w1 , 𝑏1
[2] [2]
w15 , 𝑏15 1 unit
[1] [1]
w25 , 𝑏25
15 units
25 units

Andrew Ng
2. Loss and cost functions
𝑚
Mnist digit binary classification 1 𝑖 𝑖
𝐽 𝐖, 𝐁 = ෍ 𝐿 𝑓 x ,𝑦
classification problem 𝑚
𝑖=1
𝐿 𝑓 x , 𝑦 = −𝑦log 𝑓 x − 1 − 𝑦 log 1 − 𝑓 x 1 2 3
𝐖 ,𝐖 ,𝐖 𝒃 1 ,𝒃 2 ,𝒃 3

from tensorflow.keras.losses import


model.compile(loss= BinaryCrossentropy())
BinaryCrossentropy
regression
(predicting numbers
and not categories) from tensorflow.keras.losses import
MeanSquaredError
model.compile(loss= MeanSquaredError())

Andrew Ng
3. Gradient descent
repeat {
[𝑙] [𝑙] 𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏
𝐽(𝑤) minimum 𝜕𝑤𝑗
[𝑙] [𝑙] 𝜕
𝑏𝑗 = 𝑏𝑗 − 𝛼 𝐽 w, 𝑏
𝜕𝑏
}
𝑤

model.fit(X,y,epochs=100)

Andrew Ng
Neural network libraries
Use code libraries instead of coding ”from scratch”

Good to understand the implementation


(for tuning and debugging).

Andrew Ng
Activation Functions

Alternatives to the
sigmoid activation
Demand Prediction Example
[1] 1 1
price affordability 𝑎2 = 𝑔(w2 ∙ x + 𝑏2 )
Sigmoid ReLU
shipping cost awareness
a
1 1
marketing 1
𝑔 𝑧 = 𝑔 𝑧 =
1+𝑒−𝑧
material perceived quality

0 z z
0
0<𝑔 𝑧 <1

Andrew Ng
Examples of Activation Functions
[1] 1 1
𝑎2 = 𝑔(w2 ∙ x + 𝑏2 )
Linear activation function Sigmoid ReLU

1
𝑔 𝑧 =
1 1
1
𝑔(𝑧) = 𝑧 𝑔 𝑧 =
1+𝑒−𝑧

0 z 0 z 0 z

0<𝑔 𝑧 <1

Andrew Ng
Activation Functions

Choosing activation
functions
Output Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) Choosing 𝑔 𝑧 for output layer?

Binary classification Regression Regression


Sigmoid Linear activation function ReLU
y=0/1 y = +/- Y = 0 or +

1 1
1
𝑔 𝑧
𝑔 𝑧 𝑔 𝑧

0 z 0 z 0 z

Andrew Ng
Hidden Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) Choosing 𝑔 𝑧 for hidden layer

Sigmoid 𝐽(𝐖, 𝐁) 𝜕 ReLU


𝐽 𝐖,𝐁
𝜕𝑤
1 1
1
𝑔 𝑧 =
1+𝑒−𝑧
𝑔 𝑧 = max(0, 𝑧)

0 z 𝑤 0 z

Andrew Ng
Choosing Activation Summary
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) activation='sigmoid'

activation='linear'

ReLU activation='relu'

from tf.keras.layers import Dense


model = Sequential([
Dense(units=25, activation='relu'),
Dense(units=15, activation='relu'),
Dense(units=1, activation='sigmoid')
])

Andrew Ng
Activation Functions

Why do we need
activation functions?
Why do we need activation functions?
price price
affordability
shipping cost
marketing top seller?
awareness shipping cost
material 𝑓 x = w ∙x +𝑏
𝑎Ԧ [1] 𝑎Ԧ [2] x w,b 𝑏
x marketing

material
perceived quality 1
𝑔 𝑧

if all 𝑔 𝑧 are linear... ...no different than linear regression

0 z

Andrew Ng
Linear Example

𝑎 [1] [1]
= 𝑤1 𝑥 + 𝑏1[1]
[2] [2]
𝑥 𝑎 [2] = 𝑤1 𝑎 [1] + 𝑏1
𝑤1 , 𝑏1
𝑎 [1]
[1] [1] [2] [2]
𝑤1 , 𝑏1 𝑎 [2]

[2] [1]
𝑎Ԧ [2] =( 𝑤1 𝑤1 ) 𝑥 + 𝑤1[2] 𝑏1[1] + 𝑏1[2]
𝑔(𝑧) = 𝑧

𝑎Ԧ [2] = 𝑤 𝑥 + 𝑏

Andrew Ng
Example
x 𝑎Ԧ [1] 𝑎Ԧ [2] 𝑎Ԧ [3] 𝑎Ԧ [4]

𝑔(𝑧) = 𝑧
1
[4] [4] 𝑎Ԧ [4] =
𝑎Ԧ [4] = w1 ∙ 𝑎Ԧ [3] + 𝑏1 [4] [3] [4]
1+𝑒 −(w1 ∙𝑎 +𝑏1 )

Don’t use linear activations in hidden layers

Andrew Ng
Multiclass
Classification

Multiclass
MNIST example

𝑦=0 1 2 3 4 5 6 7 8 9

𝑦=7

Andrew Ng
Multiclass classification example
𝑃 𝑦=2x

𝑃 𝑦=1x

𝑥2 𝑥2 𝑃 𝑦=3x

𝑃 𝑦=4x
𝑃 𝑦=1x

𝑥1 𝑥1

Andrew Ng
Multiclass
Classification

Softmax
Logistic regression Softmax regression (4 possible outputs)
(2 possible output values)
𝑧 = w∙x+𝑏 𝑧1 = w1 ∙ x + 𝑏1

1
𝑎 = 𝑔 𝑧 = 1+𝑒 −𝑧 = 𝑃 𝑦 = 1 x =𝑃 𝑦=1 x
𝑒 𝑧2
𝑧2 = w2 ∙ x + 𝑏2 𝑎2 = 𝑧
𝑒 1 + 𝑒 𝑧 2 + 𝑒 𝑧 3 + 𝑒 𝑧4
Softmax regression =𝑃 𝑦=2 x
(N possible outputs) 𝑒 𝑧3
𝑧3 = w3 ∙ x + 𝑏3 𝑎3 = 𝑧
𝑒 1 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑧𝑗 = w𝑗 ∙ x + 𝑏𝑗 j = 1, … , N
=𝑃 𝑦=3x

𝑒 𝑧𝑗 𝑒 𝑧4
𝑎𝑗 = = P(y = 𝑗|x) 𝑧4 = w4 ∙ x + 𝑏4 𝑎4 = 𝑧
σ𝑁
𝑘=1 𝑒
𝑧𝑘 𝑒 1 + 𝑒 𝑧2 + 𝑒 𝑧 3 + 𝑒 𝑧4
=𝑃 𝑦=4x

Andrew Ng
Cost
Logistic regression Softmax regression
𝑒 𝑧1
𝑎1 = 𝑧 =𝑃 𝑦=1 x
𝑧 = w∙x+𝑏 𝑒 1 + 𝑒 𝑧2 + ⋯ + 𝑒 𝑧𝑁
⋮ 𝑒 𝑧𝑁
1 𝑎𝑁 = 𝑧 =𝑃 𝑦=𝑁x
𝑎1 = 𝑔 𝑧 = =𝑃 𝑦=1 x 𝑒 1 + 𝑒 𝑧 2 + ⋯ + 𝑒 𝑧𝑁
1 + 𝑒−𝑧
Crossentropy loss
𝑎2 = 1 − 𝑎1 =𝑃 𝑦=0 x
𝑙𝑜𝑠𝑠 𝑎1 , … , 𝑎𝑁 , 𝑦 =
𝐿
l𝑜𝑠𝑠 = −𝑦 log 𝑎1 − 1 − 𝑦 log 1 − 𝑎1

𝐽 w, 𝑏 = average loss
0 0.5 1 𝑎𝑗

Andrew Ng
Multiclass
Classification

Neural Network with


Softmax output
Neural Network with Softmax
0
output
3
0 𝑒 𝑧1
3 3 3 3
𝑧1 = w1 ∙ a 2 + 𝑏1 𝑎10 = 3 3
0 𝑒 𝑧1 + ⋯ + 𝑒 𝑧10
x a[1] a[2] a[3] = P y = 1 𝑥Ԧ
⋮ ⋮ ⋮ ⋮
3
3 3 3 3 𝑒 𝑧10
𝑧10 = 𝑤10 ∙ 𝑎Ԧ 2 + 𝑏10 𝑎10 = 3 3
𝑒 𝑧1 + ⋯ + 𝑒 𝑧10
= P y = 10 𝑥Ԧ

logistic regression
25 units 15 units 10 units
1 unit 3
𝑎1 = 𝑔 𝑧1
3 3
𝑎2 = 𝑔 𝑧2
3

softmax
a[3]=
3 3 3 3
𝑎1 , … 𝑎10 = 𝑔 𝑧1 , … , 𝑧10

Andrew Ng
specify the model
MNIST with softmax
import tensorflow as tf
𝑓w,𝑏 x =? from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=10, activation='softmax')
specify loss and cost )]
𝐿 𝑓w,𝑏 x , 𝑦 from tensorflow.keras.losses import
SparseCategoricalCrossentropy
model.compile(loss= SparseCategoricalCrossentropy() )
Train on data to model.fit(X,Y,epochs=100)
minimize 𝐽 w, 𝑏 Note: better (recommended) version later.

Andrew Ng
Multiclass
Classification

Improved implementation
of softmax
Numerical Roundoff Errors
option 1
2
𝑥=
10,000
option 2

Andrew Ng
Numerical Roundoff Errors
More numerically accurate implementation of logistic loss:

Logistic regression: model = Sequential([


1 Dense(units=25, activation='relu')
𝑎=𝑔 𝑧 = Dense(units=15, activation='relu')
1 + 𝑒 −𝑧
Dense(units=10, activation='sigmoid')
Original loss
model.compile(loss=BinaryCrossEntropy() )
𝑙𝑜𝑠𝑠 = − 𝑦 log 𝑎 − 1 − 𝑦 log(1 − 𝑎)
model.compile(loss=BinaryCrossEntropy(from_logits=True) )

More accurate loss (in code)


1 1
𝑙𝑜𝑠𝑠 = − 𝑦 log − 1 − 𝑦 log(1 − )
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧

Andrew Ng
More numerically accurate implementation of softmax
Softmax regression model = Sequential([
Dense(units=25, activation='relu')
(𝑎1, … , 𝑎10 ) = 𝑔(𝑧1 , … , 𝑧10 )
Dense(units=15, activation='relu')
− log 𝑎1 if 𝑦 = 1 Dense(units=10, activation='softmax')
Loss = 𝐿(𝑎,
Ԧ 𝑦) = ቐ ⋮
− log 𝑎10 if 𝑦 = 10
model.compile(loss=SparseCategoricalCrossEntropy() )
More Accurate

model.compile(loss=SparseCrossEntropy(from_logits=True) )

Andrew Ng
MNIST (more numerically accurate)
model import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=10, activation='linear’) )]
loss from tensorflow.keras.losses import
SparseCategoricalCrossentropy
model.compile(...,loss=SparseCategoricalCrossentropy(from_logits=True) )
fit model.fit(X,Y,epochs=100)
predict logits = model(X)
f_x = tf.nn.softmax(logits)

Andrew Ng
logistic regression
(more numerically accurate)
model model = Sequential([
Dense(units=25, activation='sigmoid')
Dense(units=15, activation='sigmoid')
Dense(units=1, activation='linear')
from tensorflow.keras.losses import
BinaryCrossentropy

loss model.compile(..., BinaryCrossentropy(from_logits=True)) )

model.fit(X,Y,epochs=100)

fit logit = model(X)

predict f_x = tf.nn.sigmoid(logit)

Andrew Ng
Multi-label
Classification

Classification with
multiple outputs
(Optional)
Multi-label Classification

Is there a car?
Is there a bus?
Is there a pedestrian

Andrew Ng
Multiple classes
car bus pedestrian

Alternatively, train one neural network with three outputs

x 𝑎Ԧ [1] 𝑎Ԧ [2] 𝑎Ԧ [3]

Andrew Ng
Additional Neural Network
Concepts

Advanced Optimization
Gradient Descent
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼𝜕𝑤 𝐽 w, 𝑏
𝑗

𝑤2 𝐽 w, 𝑏 𝑤2 𝐽 w, 𝑏

𝑤1 𝑤1
Go faster – increase Go slower – decrease

Andrew Ng
Adam Algorithm Intuition
Adam: Adaptive Moment estimation
𝜕
𝑤1 = 𝑤1 − 𝛼1 𝜕𝑤 𝐽 w, 𝑏
1

𝑤10 = 𝑤10 − 𝛼10 𝜕𝑤𝜕 𝐽 w, 𝑏
10
𝜕
𝑏 = 𝑏 − 𝛼11 𝜕𝑏 𝐽 w, 𝑏

Andrew Ng
Adam Algorithm Intuition

𝑤2 𝐽 w, 𝑏 𝑤2 𝐽 w, 𝑏

𝑤1 𝑤1
If 𝑤𝑗 (or 𝑏) keeps moving If 𝑤𝑗 (or 𝑏) keeps oscillating,
in same direction, reduce 𝛼𝑗 .
increase 𝛼𝑗 .

Andrew Ng
model
MNIST Adam
model = Sequential([
tf.keras.layers.Dense(units=25, activation='sigmoid')
tf.keras.layers.Dense(units=15, activation='sigmoid')
tf.keras.layers.Dense(units=10, activation='linear')
])

compile
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

fit
model.fit(X,Y,epochs=100)

Andrew Ng
Additional Neural Network
Concepts

Additional Layer Types


Dense Layer

x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3]

Each neuron output is a function of


all the activation outputs of the previous layer.

2 2 2
𝑎Ԧ1 = 𝑔 𝑤1 ∙ 𝑎Ԧ 1 + 𝑏1

Andrew Ng
Convolutional Layer

Each Neuron only looks at


part of the previous layer’s inputs.

Why?
• Faster computation
• Need less training data
(less prone to
overfitting)

Andrew Ng
Convolutional Neural Network
EKG

𝑥1 𝑥2 𝑥3 𝑥100
𝑥1
𝑥2 𝑥1 − 𝑥20
𝑥3 𝑥11 − 𝑥30 1 1
𝑎1 − 𝑎5
𝑥21 − 𝑥40 x a[1] a[2] a[3]
⋮ 1 1
⋮ ⋮ 𝑎3 − 𝑎7
𝑥81 − 𝑥100 1 1
𝑥100 𝑎5 − 𝑎9
3 units
9 units

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Advice for applying
machine learning

Deciding what to try next


Debugging a learning algorithm
You’ve implemented regularized linear regression on housing prices
𝑚 𝑛
1 2 𝜆
𝐽 w, 𝑏 = ෍ 𝑓w,𝑏 x 𝑖 −𝑦 𝑖 + ෍ 𝑤𝑗2
2𝑚 2𝑚
𝑖=1 𝑗=1
But it makes unacceptably large errors in predictions. What do you
try next?
Get more training examples
Try smaller sets of features
Try getting additional features
Try adding polynomial features (𝑥12, 𝑥22, 𝑥1𝑥2, 𝑒𝑡𝑐)
Try decreasing 𝜆
Try increasing 𝜆

Andrew Ng
Machine learning diagnostic
Diagnostic: A test that you run to gain insight into what
is/isn’t working with a learning algorithm, to gain guidance
into improving its performance.

Diagnostics can take time to implement but doing so can be


a very good use of your time.

Andrew Ng
Evaluating and
choosing models

Evaluating a model
Evaluating your model
Model fits the training data well
but will fail to generalize to new
price

examples not in the training set.

size x 𝑥1 = size in feet2


Wnx4 𝑥2= no. of bedrooms
𝑥 = 𝑥1
𝑥3= no. of floors
𝑓w,𝑏 x = 𝑤1𝑥 + 𝑤2𝑥 2 + ⋯ + 𝑤𝑛 𝑥 𝑛 + 𝑏 𝑥4= age of home in years
f(𝑥)
Ԧ
just x
Andrew Ng
Evaluating your model
Dataset:
size price
𝑥 1
,𝑦 1
2104 400 𝑚𝑡𝑟𝑎𝑖𝑛 =
𝑥 2 ,𝑦 2
1600 330 no. training
70% 2400 369 training set ⋮ examples
1416 232 𝑥 𝑚𝑡𝑟𝑎𝑖𝑛 , 𝑦 𝑚𝑡𝑟𝑎𝑖𝑛 =7
3000 540
1985 300 1 1 𝑚𝑡𝑒𝑠𝑡 =
1534 𝑥𝑡𝑒𝑠𝑡 , 𝑦𝑡𝑒𝑠𝑡
315 no. test
1427 199 ⋮ examples
30% 1380 212 test set 𝑚 𝑚
𝑥𝑡𝑒𝑠𝑡𝑡𝑒𝑠𝑡 , 𝑦𝑡𝑒𝑠𝑡𝑡𝑒𝑠𝑡 =3
1494 243

Andrew Ng
Train/test procedure for linear regression (with squared error
cost)

Fit parameters by minimizing cost function !𝐽(→


w , 𝑏)

w , 𝑏 2𝑚𝑡𝑟𝑎𝑖𝑛 ∑ ( )
𝑚 2 𝑛
𝑚𝑖𝑛 1 𝜆
𝐽( w , 𝑏) = → ( x ) −𝑦(𝑖) +
𝑡𝑟𝑎𝑖𝑛
→ → (𝑖)
𝑤𝑗2
2𝑚𝑡𝑟𝑎𝑖𝑛 ∑
𝑓→
w ,𝑏
𝑖= 1 𝑗= 1

Compute test error:


( w ,𝑏( 𝑡𝑒𝑠𝑡)
𝑚 2

2𝑚𝑡𝑒𝑠𝑡 [ ∑ ]
1
𝐽𝑡𝑒𝑠𝑡( w , 𝑏) = )
𝑡𝑒𝑠𝑡
→ 𝑓→ →
x
(𝑖)
−𝑦 (𝑖)
𝑡𝑒𝑠𝑡
𝑖= 1

Compute training error:


( )]
𝑚 2

2𝑚𝑡𝑟𝑎𝑖𝑛 [ ∑
1
𝐽𝑡𝑟𝑎𝑖𝑛( w , 𝑏) = ( 𝑡𝑟𝑎𝑖𝑛) −𝑦𝑡𝑟𝑎𝑖𝑛
𝑡𝑟𝑎𝑖𝑛
→ 𝑓→ →
x
(𝑖) (𝑖)
w ,𝑏
𝑖= 1

Andrew Ng
Train/test procedure for linear
regression (with squared error cost)
Fit parameters by minimizing cost function 𝐽 w, 𝑏

price
Compute test error: 𝑚𝑡𝑒𝑠𝑡 train
1 2
𝐽𝑡𝑒𝑠𝑡 w, 𝑏 = ෍ 𝑓w,𝑏 x test
𝑖
𝑡𝑒𝑠𝑡 −𝑦 𝑖
𝑡𝑒𝑠𝑡
2𝑚𝑡𝑒𝑠𝑡
𝑖=1
size
Compute training error: 𝑚𝑡𝑟𝑎𝑖𝑛
𝐽 w,1 𝑏 will be low 𝑖 𝑖
2
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 =
𝑡𝑟𝑎𝑖𝑛 ෍ 𝑓w,𝑏 x𝑡𝑟𝑎𝑖𝑛 − 𝑦𝑡𝑟𝑎𝑖𝑛
𝐽𝑡𝑒𝑠𝑡 2𝑚 𝑏 will be high
w, 𝑡𝑟𝑎𝑖𝑛
𝑖=1

Andrew Ng
Train/test procedure for
classification problem

Fit parameters by minimizing 𝐽 ! (→


w , 𝑏) to find →
!w, 𝑏

( ) ( )]
1 𝑚 𝜆 𝑛 2
[
E.g.,
𝐽 ( w , 𝑏) = −
→ 𝑦 log 𝑓 w ,𝑏( x ) + (1 −𝑦 )log 1 −𝑓→

w ,𝑏( x )

(𝑖) (𝑖) (𝑖) (𝑖)
∑ 2𝑚 ∑
→ + 𝑤𝑗
𝑚 𝑖= 1 𝑗= 1

Compute test error:


( w ,𝑏( )) ( w ,𝑏( 𝑡𝑒𝑠𝑡))
𝑚

[ ]
1 𝑡𝑒𝑠𝑡 (𝑖)
𝐽𝑡𝑒𝑠𝑡( w , 𝑏) = −
→ → →
(𝑖) (𝑖)
( )
(𝑖)
𝑚𝑡𝑒𝑠𝑡 ∑
𝑦𝑡𝑒𝑠𝑡log 𝑓→ x 𝑡𝑒𝑠𝑡 + 1 −𝑦𝑡𝑒𝑠𝑡 log 1 −𝑓 → x
𝑖= 1

Compute train error:


( w ,𝑏( )) ( ( w ,𝑏( 𝑡𝑟𝑎𝑖𝑛))
𝑚

[ ]
1
𝐽𝑡𝑟𝑎𝑖𝑛( w , 𝑏) = − 𝑡𝑟𝑎𝑖𝑛)
𝑡𝑟𝑎𝑖𝑛
→ (𝑖) → (𝑖) (𝑖) → (𝑖)

𝑚𝑡𝑟𝑎𝑖𝑛 ∑
𝑦𝑡𝑟𝑎𝑖𝑛 log 𝑓→ x 𝑡𝑟𝑎𝑖𝑛 + 1 −𝑦 log 1 −𝑓 → x
𝑖= 1

Andrew Ng
Train/test procedure for
classification problem
Fit parameters by minimizing 𝐽 w, 𝑏 to find w, 𝑏
E.g., 𝑚 𝑛
1 𝑖 test set and the𝑖 fraction 𝜆of the 2
𝐽 w, 𝑏 = − ෍ 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖 fraction
+ 1− of 𝑦the log 1 − 𝑓w,𝑏 x + ෍ 𝑤𝑗
𝑚 train set that the algorithm has misclassified.
2𝑚
𝑖=1 𝑗=1
Compute test error: 1 if 𝑓w,𝑏 x 𝑖
≥ 0.5
𝑚𝑡𝑒𝑠𝑡 𝑦ො = ቐ
1 𝑖 𝑖 0 if 𝑓𝑖 x 𝑖 < 0.5 𝑖
𝐽𝑡𝑒𝑠𝑡 w, 𝑏 = − count
෍ 𝑦𝑡𝑒𝑠𝑡 log 𝑓w,𝑏𝑦ො ≠x𝑡𝑒𝑠𝑡
𝑦 + 1 − 𝑦𝑡𝑒𝑠𝑡w,𝑏 log 1 − 𝑓w,𝑏 x 𝑡𝑒𝑠𝑡
𝑚𝑡𝑒𝑠𝑡
𝑖=1
𝐽𝑡𝑒𝑠𝑡 w, 𝑏 is the fraction of the test set that has
Compute train error:
1
𝑚𝑡𝑟𝑎𝑖𝑛
been misclassified.
𝑖 𝑖 𝑖 𝑖
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 = −
𝑚𝑡𝑟𝑎𝑖𝑛
෍ 𝐽𝑡𝑟𝑎𝑖𝑛logw, 𝑓𝑏w,𝑏is xthe
𝑦𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛fraction
+ 1 −of𝑦𝑡𝑟𝑎𝑖𝑛
the train
log 1set
− 𝑓that
w,𝑏 x𝑡𝑟𝑎𝑖𝑛
𝑖=1 has been misclassified.

Andrew Ng
Evaluating and
choosing models

Model selection and


training/cross validation/test
sets
Model selection (choosing a model)
Once parameters w, 𝑏 are fit to the
training set, the training error 𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
is likely lower than the actual
price

generalization error.

size 𝐽𝑡𝑒𝑠𝑡 w, 𝑏 is better estimate of how


well the model will generalize to new
𝑓w,𝑏 x = 𝑤1𝑥1 + 𝑤2𝑥 2
data than 𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 .
+𝑤3𝑥 3 + 𝑤4𝑥 4 + 𝑏

Andrew Ng
Model selection (choosing a model)
<1> <1> <1> <1>
d=1 1. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑏 w,b Jtest(w, b )
<2> <2>
d=2 2. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑏 w,b Jtest(w<2>,b<2>)
d=3 3. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏 w<3>,b<3> Jtest(w<3>,b<3>)

d=10 10. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + ⋯ + 𝑤10 𝑥10 + 𝑏 Jtest(w<10>,b<10>)
Jtest(w<5>,b<5>)
Choose 𝑤1 𝑥1 + ⋯ + 𝑤5𝑥5 + 𝑏 d=5
How well does the model perform? Report test set error 𝐽𝑡𝑒𝑠𝑡 𝑤 <5> , 𝑏 <5> ?
The problem is 𝐽𝑡𝑒𝑠𝑡 𝑤 <5>, 𝑏 <5> is likely to be an optimistic estimate of
generalization error. Ie: An extra parameter d (degree of polynomial) was
chosen using the test set. w,b

Andrew Ng
Training/cross validation/test set
validation set
size price development set
dev set
2104 400 𝑥 1
,𝑦 1

1600 330 ⋮ 𝑚𝑡𝑟𝑎𝑖𝑛 = 6


2400 369 training set 𝑥 𝑚𝑡𝑟𝑎𝑖𝑛
,𝑦 𝑚𝑡𝑟𝑎𝑖𝑛

1416 232 60% 1


𝑥𝑐𝑣 , 𝑦𝑐𝑣
1
3000 540
⋮ 𝑚𝑐𝑣 = 2
1985 300 𝑚 𝑚
1534 315 cross validation 𝑥𝑐𝑣 𝑐𝑣 , 𝑦𝑐𝑣 𝑐𝑣
1427 199 20% 1
𝑥𝑡𝑒𝑠𝑡 , 𝑦𝑡𝑒𝑠𝑡
1

1380 212 test set ⋮ 𝑚𝑡𝑒𝑠𝑡 = 2


1494 243 20% 𝑚 𝑚
𝑥𝑡𝑒𝑠𝑡𝑡𝑒𝑠𝑡 , 𝑦𝑡𝑒𝑠𝑡𝑡𝑒𝑠𝑡

Andrew Ng
Training/cross validation/test set
𝑚𝑡𝑟𝑎𝑖𝑛
1 2
Training error: 𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 = ෍ 𝑓w,𝑏 x 𝑖
−𝑦 𝑖
2𝑚𝑡𝑟𝑎𝑖𝑛
𝑖=1

𝑚𝑐𝑣
1 2
Cross validation 𝐽𝑐𝑣 w, 𝑏 =
𝑖 𝑖
෍ 𝑓w,𝑏 x𝑐𝑣 − 𝑦𝑐𝑣 (validation error,
2𝑚𝑐𝑣 dev error)
error: 𝑖=1

𝑚𝑡𝑒𝑠𝑡
1 𝑖 𝑖
2
Test error: 𝐽𝑡𝑒𝑠𝑡 w, 𝑏 = ෍ 𝑓w,𝑏 x𝑡𝑒𝑠𝑡 − 𝑦𝑡𝑒𝑠𝑡
2𝑚𝑡𝑒𝑠𝑡
𝑖=1

Andrew Ng
Model selection
d=1 1. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑏 w<1>,b<1> Jcv(w<1>,b<1>)
d=2 2. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑏 Jcv(w<2>,b<2>)
d=3 3. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏

⋮ ⋮
d=10 10. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + ⋯ + 𝑤10 𝑥10 + 𝑏 Jcv(w<10>,b<10>)

Pick 𝑤1 𝑥1 + ⋯ + 𝑤4 𝑥 4 + 𝑏 ( 𝐽𝑐𝑣 𝑤 <4>, 𝑏 <4> )


Estimate generalization error using test the set: 𝐽𝑡𝑒𝑠𝑡 𝑤 <4>, 𝑏 <4>

Andrew Ng
Model selection – choosing a neural network architecture

1. w(1),b(1) 1 1
𝐽𝑐𝑣 𝐖 ,𝐁

2. w(2),b(2) 𝐽𝑐𝑣 𝐖 2 ,𝐁 2

3. w(3),b(3) 𝐽𝑐𝑣 𝐖 3 ,𝐁 3

Train, CV
Pick 𝐖 2 ,𝐁 2

Estimate generalization error using the test set: 𝐽𝑡𝑒𝑠𝑡 𝐖 2 ,𝐁 2

Andrew Ng
Bias and variance

Diagnosing bias and variance


Bias/variance
price

price

price
size size size
2
𝑓 w,𝑏 𝑥 = 𝑤1 𝑥 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 2
+𝑏 +𝑏 +𝑤3𝑥 3 + 𝑤4 𝑥 4 + 𝑏
High bias “Just right“ High variance
(underfit) (overfit)
𝐽𝑡𝑟𝑎𝑖𝑛 is high 𝐽𝑡𝑟𝑎𝑖𝑛 is low 𝐽𝑡𝑟𝑎𝑖𝑛 is low
𝑑=1 𝑑=2 𝐽 𝑑=4
𝐽𝑐𝑣 is high 𝑐𝑣 is low 𝐽𝑐𝑣 is high

Andrew Ng
Understanding bias and variance

𝐽𝑐𝑣 w, 𝑏

𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
degree of polynomial
𝑑

Andrew Ng
Diagnosing bias and variance
How do you tell if your algorithm has a bias or variance problem?

High bias (underfit)


𝐽𝑡𝑟𝑎𝑖𝑛 will be high
( 𝐽𝑡𝑟𝑎𝑖𝑛 ≈ 𝐽𝑐𝑣 )
𝐽𝑐𝑣 w, 𝑏
overfit underfit High variance (overfit)
high low 𝐽𝑐𝑣 ≫ 𝐽𝑡𝑟𝑎𝑖𝑛
variance variance
( 𝐽𝑡𝑟𝑎𝑖𝑛 may be be low)

High bias and high variance


𝐽𝑡𝑟𝑎𝑖𝑛 will be high
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
and 𝐽𝑐𝑣 ≫ 𝐽𝑡𝑟𝑎𝑖𝑛
degree of polynomial

Andrew Ng
Bias and variance

Regularization and
bias/variance
Linear regression with regularization
Model: 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4 + 𝑏
𝑚 𝑛
1 2 𝜆
𝐽 w, 𝑏 = ෍ 𝑓w,𝑏 x 𝑖 −𝑦 𝑖 + ෍ 𝑤𝑗2
2𝑚 2𝑚
𝑖=1 𝑗=1
price

price

price
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 is small 𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 is small
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 is large 𝐽𝑐𝑣 w, 𝑏 is small 𝐽𝑐𝑣 w, 𝑏 is large

size size size


Large 𝜆 Intermediate 𝜆 Small 𝜆
High bias (underfit) High variance (overfit)
𝜆 = 10,000 𝑤1 ≈ 0, 𝑤2 ≈ 0 𝜆 𝜆=0
𝑓w,𝑏 x ≈ 𝑏
Andrew Ng
Choosing the regularization parameter 𝜆
Model: 𝑓w,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4 + 𝑏
𝑚𝑖𝑛
1.Try 𝜆=0 w, 𝑏
𝐽 w, 𝑏 𝑤 <1> , 𝑏 <1> 𝐽𝑐𝑣 𝑤 <1> , 𝑏 <1>
2. Try 𝜆 = 0.01 𝑤 <2> , 𝑏 <2> 𝐽𝑐𝑣 𝑤 <2> , 𝑏 <2>
3. Try 𝜆 = 0.02 𝐽𝑐𝑣 𝑤 <3> , 𝑏 <3>
4. Try 𝜆 = 0.04
5. Try 𝜆 = 0.08 𝐽𝑐𝑣 𝑤 <5> , 𝑏 <5>

12. Try 𝜆 ≈ 10 𝑤 <12> , 𝑏 <12> 𝐽𝑐𝑣 𝑤 <12> , 𝑏 <12>

Pick 𝑤 <5> , 𝑏 <5>


Report test error: 𝐽𝑡𝑒𝑠𝑡 𝑤 <5>, 𝑏 <5>

Andrew Ng
Bias and variance as a function of regularization parameter 𝜆
𝑚 𝑛
1 2 𝜆
𝐽 w, 𝑏 = ෍ 𝑓w,𝑏 x 𝑖 −𝑦 𝑖 + ෍ 𝑤𝑗2 bias variance
2𝑚 2𝑚 𝐽𝑐𝑣 w, 𝑏
𝑖=1 𝑗=1
bias 𝐽𝑐𝑣 w, 𝑏
𝐽𝑐𝑣 variance
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏

degree of polynomial
𝐽𝑡𝑟𝑎𝑖𝑛
small 𝜆 large 𝜆

Andrew Ng
Bias and variance

Establishing a baseline level of


performance
Speech recognition example

Human level performance : 10.6% 0.2%


Training error 𝐽𝑡𝑟𝑎𝑖𝑛 : 10.8%
4.0%
Cross validation error 𝐽𝑐𝑣 : 14.8%

Andrew Ng
Establishing a baseline level of performance

What is the level of error you can reasonably


hope to get to?

• Human level performance

• Competing algorithms performance

• Guess based on experience

Andrew Ng
Bias/variance examples
Baseline performance : 10.6% 0.2% 10.6% 4.4% 10.6% 4.4%
Training error ( 𝐽𝑡𝑟𝑎𝑖𝑛 ) : 10.8% 15.0% 15.0%
4.0% 0.5% 4.7%
Cross validation error ( 𝐽𝑐𝑣 ) : 14.8% 15.5% 19.7%
high high high bias
variance bias high variance

Andrew Ng
Bias and variance

Learning curves
Learning curves 𝑓w,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑏

𝐽𝑡𝑟𝑎𝑖𝑛= training error


𝐽𝑐𝑣 = cross validation error

𝐽𝑐𝑣 w, 𝑏
error

𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏

𝑚𝑡𝑟𝑎𝑖𝑛 (training set size)

Andrew Ng
High bias
𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑏

𝐽𝑐𝑣 w, 𝑏
error

𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
human level
performance

𝑚 (training set size)

if a learning algorithm suffers


from high bias, getting more
training data will not (by
itself) help much.

Andrew Ng
High variance 𝑓w,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑤3𝑥 3
+ 𝑤4 𝑥 4 + 𝑏
(with small 𝜆)
𝐽𝑐𝑣 w, 𝑏
human level
performance
error

𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏

𝑚 (training set size)

if a learning algorithm suffers


from high variance, getting
more training data is likely to
help.

Andrew Ng
Bias and variance

Deciding what to try next


revisited
Debugging a learning algorithm
You’ve implemented regularized linear regression on housing prices
𝑚 𝑛
1 2 𝜆
𝐽 w, 𝑏 = ෍ 𝑓w,𝑏 x 𝑖 −𝑦 𝑖 + ෍ 𝑤𝑗2
2𝑚 2𝑚
𝑖=1 𝑗=1
But it makes unacceptably large errors in predictions. What do you
try next?
Get more training examples fixes high variance
Try smaller sets of features x, x2, x3, x4, x5… fixes high variance
Try getting additional features fixes high bias
Try adding polynomial features (𝑥12, 𝑥22, 𝑥1𝑥2, 𝑒𝑡𝑐) fixes high bias
Try decreasing 𝜆 fixes high bias
Try increasing 𝜆 fixes high variance

Andrew Ng
Bias and variance

Bias/variance and
neural networks
The bias variance tradeoff
𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑏 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 2 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 2 + 𝑤3𝑥 3
+𝑏 +𝑤4𝑥 4 + 𝑏

Simple model Complex model


High bias High variance
tradeoff
𝐽𝑐𝑣 w, 𝑏

𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏

degree of polynomial

Andrew Ng
Neural networks and bias variance
Large neural networks are low bias machines

Does it do well Yes


Yes Does it do well Yes
Yes
on the training on the cross Done !
Done!
set? validation set?
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 𝐽𝑐𝑣 w, 𝑏

No
No No
No

Bigger network
Bigger network More data
GPU

Andrew Ng
Neural networks and regularization

3 layers 4 layers

A large neural network will usually do as well or better than a


smaller one so long as regularization is chosen appropriately.

Andrew Ng
Neural network regularization
𝑚
1 𝜆
b
𝑖 𝑖
𝐽 𝐖, 𝐁 = ෍ 𝐿 𝑓 x ,𝑦 + ෍ 𝑤𝟐
𝑚 2𝑚
𝑖=1 𝑎𝑙𝑙 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝐖
Unregularized MNIST model
layer_1 = Dense(units=25, activation=”relu")
layer_2 = Dense(units=15, activation=”relu")
layer_3 = Dense(units=1, activation="sigmoid")
model = Sequential([layer_1, layer_2, layer_3])
𝜆
Regularized MNIST model
layer_1 = Dense(units=25, activation=”relu”, kernel_regularizer=L2(0.01))
layer_2 = Dense(units=15, activation=”relu”, kernel_regularizer=L2(0.01))
layer_3 = Dense(units=1, activation="sigmoid”, kernel_regularizer=L2(0.01))
model = Sequential([layer_1, layer_2, layer_3])

Andrew Ng
Machine learning
development process

Iterative loop of
ML development
Iterative loop of ML development
Choose architecture
(model, data, etc.)

Diagnostics
Train
(bias, variance and
model
error analysis)

Andrew Ng
Spam classification example

From: cheapsales@buystufffromme.com From: Alfred Ng


To: Andrew Ng To: Andrew Ng
Subject: Buy now! Subject: Christmas dates?

Deal of the week! Buy now! Hey Andrew,


Rolex w4tchs - $100 Was talking to Mom about plans
Med1cine (any kind) - £50 for Xmas. When do you get off
Also low cost M0rgages work. Meet Dec 22?
available. Alf

Andrew Ng
Building a spam classifier
Supervised learning: x = features of email
y = spam (1) or not spam (0)
Features: list the top 10,000 words to compute 𝑥1, 𝑥2, ⋯ , 𝑥10,000

From: cheapsales@buystufffromme.com
0 a To: Andrew Ng
1 andrew Subject: Buy now!

2 11 buy Deal of the week! Buy now!


1 deal Rolex w4tchs - $100
0 discount
Med1cine (any kind) - £50
Also low cost M0rgages
⋮ ⋮ available.

Andrew Ng
Building a spam classifier
How to try to reduce your spam classifier’s error?

• Collect more data. E.g., “Honeypot” project.


• Develop sophisticated features based on email
routing (from email header).
• Define sophisticated features from email body.
E.g., should “discounting” and ”discount” be
treated as the same word.
• Design algorithms to detect misspellings.
E.g., w4tches, med1cine, m0rtgage.

Andrew Ng
Iterative loop of ML development
Choose architecture
(model, data, etc.)

Diagnostics
Train
(bias, variance and
model
error analysis)

Andrew Ng
Machine learning
development process

Error analysis
Error analysis
𝑚𝑐𝑣= 500 examples in cross validation set.
5000
Algorithm misclassifies 100 of them.
1000
Manually examine 100 examples and categorize them
based on common traits. more data
Pharma: 21 features
Deliberate misspellings (w4tches, med1cine): 3
Unusual email routing: 7 more data
Steal passwords (phishing): 18 features
Spam message in embedded image: 5

Andrew Ng
Building a spam classifier
How to try to reduce your spam classifier’s error?

• Collect more data. E.g., “Honeypot” project.


• Develop sophisticated features based on email
routing (from email header).
• Define sophisticated features from email body.
E.g., should “discounting” and ”discount” be
treated as the same word.
• Design algorithms to detect misspellings.
E.g., w4tches, med1cine, m0rtgage.

Andrew Ng
Iterative loop of ML development
Choose architecture
(model, data, etc.)

Diagnostics
Train
(bias, variance and
model
error analysis)

Andrew Ng
Machine learning
development process

Adding data
Adding data
Add more data of everything. E.g., “Honeypot” project.

Add more data of the types where error analysis has


indicated it might help.

Pharma spam

E.g., Go to unlabeled data and find more examples of


Pharma related spam.
( X,Y )

Andrew Ng
Data augmentation
Augmentation: modifying an existing training example to
create a new training example.

( X,Y )

Andrew Ng
Data augmentation by introducing distortions

Andrew Ng
Data augmentation for speech
Speech recognition example

Original audio (voice search: ”What is today’s weather?”)

+ Noisy background: Crowd

+ Noisy background: Car

+ Audio on bad cellphone connection

Andrew Ng
Data augmentation by introducing distortions
Distortion introduced should be representation of the type of
noise/distortions in the test set.
Audio:
Background noise,
bad cellphone connection

Usually does not help to add purely random/meaningless


noise to your data.

𝑥𝑖 =intensity (brightness) of pixel 𝑖


𝑥𝑖 ← 𝑥𝑖 + random noise
[Adam Coates and Tao Wang]

Andrew Ng
Data synthesis
Synthesis: using artificial data inputs to create a new
training example.

Andrew Ng
Artificial data synthesis for photo OCR

[http://www.publicdomainpictures.net/view-image.php?image=5745&picture=times-square]

Andrew Ng
Artificial data synthesis for photo OCR

Abcdefg
Abcdefg
Abcdefg
Abcdefg
Abcdefg
Real data
[Adam Coates and Tao Wang]

Andrew Ng
Artificial data synthesis for photo OCR

Real data Synthetic data


[Adam Coates and Tao Wang]

Andrew Ng
Engineering the data used by your system

Conventional
model-centric AI= Code + Data
approach: (algorithm/model)

Work on this

Data-centric
approach: AI= Code + Data
(algorithm/model)

Work on this

Andrew Ng
Machine learning
development process

Transfer learning: using data


from a different task
Transfer learning
Cats 0, 1, 2, … ,9
Dogs 1,000
x Cars classes
1 million images
People
1 ⋮
𝐖 ,𝒃 1 𝐖 3 ,𝒃 3 4 𝐖 5
,𝒃5
𝐖 2 ,𝒃 2 𝐖 ,𝒃 4
1,000 output units
Supervised pretraining
0

x
1
2
Fine tuning

5 5
𝐖 ,𝒃 9
1 10 output units
𝐖 ,𝒃 1 𝐖 3 ,𝒃 3 4 4
𝐖 2 ,𝒃 2 𝐖 ,𝒃
Option 1: only train output layers parameters.
Option 2: train all parameters.

Andrew Ng
Why does transfer learning work?

x use the same input type

detects detects detects


edges corners curves/basic shapes

Edges Corners Curves / basic shapes

Andrew Ng
Transfer learning summary
1. Download neural network parameters pretrained on a large
dataset with same input type (e.g., images, audio, text) as
your application (or train your own). 1M
2. Further train (fine tune) the network
on your own data.
1000
50

Andrew Ng
Machine learning
development process

Full cycle of a
machine learning project
Full cycle of a machine learning project

Scope Collect Train Deploy in


project data model production

Define project Define and Training, error Deploy,


collect data analysis & monitor and
iterative maintain
improvement system

Andrew Ng
Deployment
API call (x)
Inference server Mobile app

(audio clip)
ML model

Inference 𝑦ො

(text transcript)
Software engineering may be needed for:
Ensure reliable and efficient predictions
MLOps
Scaling
Logging machine learning
System monitoring operations
Model updates
Andrew Ng
Machine learning
development process

Fairness, bias, and ethics


Bias
Hiring tool that discriminates against women.

Facial recognition system matching dark skinned


individuals to criminal mugshots.

Biased bank loan approvals.

Toxic effect of reinforcing negative stereotypes.

Andrew Ng
Adverse use cases
Deepfakes

Spreading toxic/incendiary speech through optimizing for


engagement.
Generating fake content for commercial or political purposes.
Using ML to build harmful products, commit fraud etc.
Spam vs anti-spam : fraud vs anti-fraud.

Andrew Ng
Guidelines
Get a diverse team to brainstorm things that might go
wrong, with emphasis on possible harm to vulnerable groups.
Carry out literature search on standards/guidelines for your
industry.
Audit systems against possible harm prior to deployment.
Audit

Develop mitigation plan (if applicable), and after deployment,


monitor for possible harm.

Andrew Ng
Skewed datasets
(optional)

Error metrics for


skewed datasets
Rare disease classification example
Train classifier 𝑓w,𝑏 x (𝑦 = 1 if disease present,
𝑦 = 0 otherwise)
Find that you’ve got 1% error on test set
(99% correct diagnoses)

Only 0.5% of patients have the disease

print(“y=0”) 99.5% accuracy, 0.5% error


1%
1.2%

Andrew Ng
Precision/recall
𝑦 = 1 in presence of rare class we want to detect.
Precision:
Actual Class (of all patients where we predicted 𝑦 = 1, what
1 0 fraction actually have the rare disease?)
True positives True positives 15
True False = = = 0.75
1 positive positive #predicted positive True pos + False pos 15 + 5
Predict 15 5
-ed
False True Recall:
Class
0 negative negative (of all patients that actually have the rare disease,
10 70 what fraction did we correctly detect as having it?)
True positives True positives 15
= = = 0.6
25 75 #actual positive True pos + False neg 15 +10

print(“y=0”)

Andrew Ng
Skewed datasets
(optional)

Trading off precision


and recall
Trading off precision and recall
true positives
precision =
Logistic regression: 0 < 𝑓w,𝑏 x < 1 total predicted positive
Predict 1 if 𝑓w,𝑏 x ≥ 0.5 0.7 0.9 0.3 true positives
recall =
Predict 0 if 𝑓w,𝑏 x < 0.5 0.7 0.9 0.3 total actual positive

Suppose we want to predict 𝑦 = 1 (rare disease) threshold = 0.99


only if very confident.
higher precision, lower recall. 1

Suppose we want to avoid missing too many threshold

Precision
cases of rare disease (when in doubt = 0.01
predict 𝑦 = 1)
lower precision, higher recall.
More generally predict 1 if: 𝑓w,𝑏 x ≥ threshold. 0 Recall 1

Andrew Ng
F1 score
How to compare precision/recall numbers?

Precision (P) Recall (R) Average F1 score

Algorithm 1 0.5 0.4 0.45 0.444


Algorithm 2 0.7 0.1 0.4 0.175
Algorithm 3 0.02 1.0 0.501 0.0392

print(“y=1”) 1
1 1 1 PR
Average = P+R F1 score = + = 2
2 2 P R P+R

Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Decision Trees

Decision Tree Model


Cat classification example
Ear shape (x 1 ) Face shape (x 2 ) Whiskers (x 3 ) Cat
Pointy Round Present 1
Floppy Not round Present 1
Floppy Round Absent 0
Pointy Not round Present 0
Pointy Round Present 1
Pointy Round Absent 1
Floppy Not round Absent 0
Pointy Round Absent 1
Floppy Round Absent 0
Floppy Round Absent 0

Categorical (discrete values)

Andrew Ng
Decision Tree New test example

root node
Ear shape

Pointy Floppy
decision nodes

Ear shape: Pointy


Face Face shape: Round
Whiskers
shape Whiskers: Present

Round Not round Present Absent

Cat Not cat Cat Not cat

leaf nodes

Andrew Ng
Decision Tree
Fac e Fac e
E ar E ar s hape s hape
s hape s hape

Round N ot round Round N ot round


P ointy Floppy P ointy Floppy

C at N ot c at Ear
Not Cat
shape
Whiskers N ot c at Not cat Fac e
s hape

P ointy Floppy
P res ent A bs ent N ot round
Round

C at N ot c at Whis kers Not Cat


Cat Not Cat

P res ent A bs ent

Cat Not Cat

Andrew Ng
Decision Trees

Learning Process
Decision Tree Learning
?

Andrew Ng
Decision Tree Learning
Ear shape

Pointy Floppy

Andrew Ng
Decision Tree Learning
Ear shape

Pointy Floppy

Andrew Ng
Decision Tree Learning
Ear shape

Pointy Floppy

Andrew Ng
Decision Tree Learning
Ear shape

Pointy Floppy

Face
shape

Round Not round

4/4 cats

Andrew Ng
Decision Tree Learning
Ear shape

Pointy Floppy

Face
shape

Round Not round

Cat

0/1 cats

Andrew Ng
Decision Tree Learning
Ear shape

Pointy Floppy

Face
shape

Round Not round

Cat Not cat

Andrew Ng
Decision Tree Learning
Ear shape

Pointy Floppy

Face
?
shape

Round Not round

Cat Not cat

Andrew Ng
Decision Tree Learning
Ear shape

Pointy Floppy

Face
Whiskers
shape

Round Not round

Cat Not cat

Andrew Ng
Decision Tree Learning
Ear shape

Pointy Floppy

Face
Whiskers
shape

Round Not round Present Absent

Cat Not cat

1/1 cats 0/4 cats

Andrew Ng
Decision Tree Learning
Ear shape

Pointy Floppy

Face
Whiskers
shape

Round Not round Present Absent

Cat Not cat Cat Not Cat

Andrew Ng
Decision Tree Learning
Decision 1: How to choose what feature to split on at each
node?
Maximize purity (or minimize impurity)

Andrew Ng
Decision Tree Learning
C at

Decision 1: How to choose what feature to split on at each DNA

node? Y es No

Maximize purity (or minimize impurity)

E ar Fac e Whis kers


s hape Shape

Round N ot round P res ent 5 /5 c ats 0 /5 c ats


P ointy Floppy A bs ent

4 /5 c ats 1 /5 c ats 4 /7 c ats 1 /3 c ats 3 /4 c ats 2 /6 c ats

Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting? Depth 0

• When a node is 100% one class


• When splitting a node will result in the tree exceeding
a maximum depth Depth 1

Depth 2

Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting? Depth 0

• When a node is 100% one class


• When splitting a node will result in the tree exceeding
a maximum depth Depth 1

Depth 2

Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting?

• When a node is 100% one class


• When splitting a node will result in the tree exceeding
a maximum depth Fac e
Shape
• When improvements in purity score are below a
threshold Round N ot round

• When number of examples in a node is below a


threshold

4 /7 c ats 1 /3 c ats

Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting?

• When a node is 100% one class


• When splitting a node will result in the tree exceeding
a maximum depth Fac e
Shape
• When improvements in purity score are below a
threshold Round N ot round

• When number of examples in a node is below a


Not cat
threshold

Andrew Ng
Decision Tree Learning

Measuring purity
Entropy as a measure of impurity
p1 = fraction of examples that
are cats p1 = 0 H(p1) = 0
D og D og D og D og D og D og

H(p1) p1 = 2/6 H(p1) = 0.92


C at C at D og D og D og D og

p1 = 3/6 H(p1) = 1
C at C at C at D og D og D og

p1 = 5/6 H(p1) = 0.65


C at C at C at C at C at D og

p1
p1 = 6/6 H(p1) = 0
C at C at C at C at C at C at

Andrew Ng
Entropy as a measure of impurity
p1 = fraction of examples that
are cats
𝑝0 = 1 − 𝑝1
H(p1)
𝐻(𝑝1 ) = −𝑝1 𝑙𝑜𝑔2(𝑝1 ) − 𝑝0 𝑙𝑜𝑔2(𝑝0)
= −𝑝1 𝑙𝑜𝑔2(𝑝1) − (1 − 𝑝1 )𝑙𝑜𝑔2(1 − 𝑝1 )

Note: “0 log(0)” = 0
p1

Andrew Ng
Decision Tree Learning

Choosing a split: Information


Gain
Choosing a split
𝑝 1 = 5ൗ10 = 0.5 𝐻 0.5 = 1 𝐻 0.5 = 1
𝐻 0.5 = 1 E ar Fac e Whis kers
s hape Shape

P ointy Floppy Round N ot round


P res ent A bs ent

𝑝1 = 4ൗ5 = 0.8 𝑝1 = 1ൗ5 = 0.2 𝑝1 = 4ൗ7 = 0.57 𝑝1 = 1ൗ3 = 0.33 𝑝1 = 3ൗ4 = 0.75 𝑝1 = 2ൗ6 = 0.33

𝐻 0.8 = 0.72 𝐻 0.2 = 0.72 𝐻 0.57 = 0.99 𝐻 0.33 = 0.92 𝐻 0.75 = 0.81 𝐻 0.33 = 0.92
5 5 7 3 4 6
𝐻 0.5 − 𝐻 0.8 + 𝐻 0.2 𝐻 0.5 − 𝐻 0.57 + 𝐻 0.33 𝐻 0.5 − 𝐻 0.75 + 𝐻 0.33
10 10 10 10 10 10

= 0.28 = 0.03 = 0.12

Andrew Ng
Information Gain
𝑝1root = 5ൗ10 = 0.5
E ar
s hape

Information gain
P ointy Floppy

right
= H(p1root ) − 𝑤 left 𝐻 𝑝1left + 𝑤 right 𝐻 𝑝1

right 1
𝑝1left = 4ൗ5 𝑝1 = ൗ5

𝑤 left = 5ൗ10 5
𝑤 right = ൗ10

Andrew Ng
Decision Tree Learning

Putting it together
Decision Tree Learning
• Start with all examples at the root node
• Calculate information gain for all possible features, and pick the one with
the highest information gain
• Split dataset according to selected feature, and create left and right
branches of the tree
• Keep repeating splitting process until stopping criteria is met:
• When a node is 100% one class
• When splitting a node will result in the tree exceeding a maximum
depth
• Information gain from additional splits is less than threshold
• When number of examples in a node is below a threshold

Andrew Ng
Recursive splitting
?

Andrew Ng
Recursive splitting
E ar
s hape

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e
s hape

Round N ot Round

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e
s hape

Round N ot Round

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e
s hape

Round N ot Round

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e
s hape

Round N ot Round

Cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e
s hape

Round N ot Round

Cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e
s hape

Round N ot Round

Cat Not cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e
s hape

Round N ot Round

Cat Not cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e
s hape

Round N ot Round

Cat Not cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e ?
s hape

Round N ot Round

Cat Not cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e Whis kers


s hape

A bs ent
Round N ot Round P res ent

Cat Not cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e Whis kers


s hape

A bs ent
Round N ot Round P res ent

Cat Not cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e Whis kers


s hape

A bs ent
Round N ot Round P res ent

Cat Not cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e Whis kers


s hape

A bs ent
Round N ot Round P res ent

Cat Not cat Cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e Whis kers


s hape

A bs ent
Round N ot Round P res ent

Cat Not cat Cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e Whis kers


s hape

A bs ent
Round N ot Round P res ent

Cat Not cat Cat Not cat

Andrew Ng
Recursive splitting
E ar
s hape

P ointy Floppy

Fac e
s hape
Whis kers
Recursive algorithm

A bs ent
Round N ot Round P res ent

Cat Not cat Cat Not cat

Andrew Ng
Decision Tree Learning

Using one-hot encoding of


categorical features
Features with three possible values
Ear shape (𝑥1) Face shape (𝑥 2 ) Whiskers (𝑥3 ) Cat (𝑦)

Pointy Round Present 1

Oval Not round Present 1


Oval Round Absent 0
Ear
Pointy Not round Present 0 shape
Oval Round Present 1 Pointy

Pointy Round Absent 1 Floppy Oval

Floppy Not round Absent 0


Oval Round Absent 1

Floppy Round Absent 0


Floppy Round Absent 0

3 possible values

Andrew Ng
One hot encoding
Ear shape Pointy ears Floppy ears Oval ears Face shape Whiskers Cat

Pointy Round Present 1

Oval Not round Present 1


Oval 0 0 1 Round Absent 0

Pointy 1 0 0 Not round Present 0


Oval 0 0 1 Round Present 1
Pointy 1 0 0 Round Absent 1
Floppy 0 1 0 Not round Absent 0
Oval 0 0 1 Round Absent 1

Floppy 0 1 0 Round Absent 0


Floppy 0 1 0 Round Absent 0

Andrew Ng
One hot encoding

If a categorical feature can take on 𝑘 values,


create 𝑘 binary features (0 or 1 valued).

Andrew Ng
One hot encoding
Ear shape Pointy ears Floppy ears Oval ears Face shape Whiskers Cat

Pointy 1 0 0 Round Present 1

Oval 0 0 1 Not round Present 1


Oval 0 0 1 Round Absent 0

Pointy 1 0 0 Not round Present 0


Oval 0 0 1 Round Present 1
Pointy 1 0 0 Round Absent 1
Floppy 0 1 0 Not round Absent 0
Oval 0 0 1 Round Absent 1

Floppy 0 1 0 Round Absent 0


Floppy 0 1 0 Round Absent 0

Andrew Ng
One hot encoding and neural networks
Pointy ears Floppy ears Round ears Face shape Whiskers Cat

1 0 0 Round Present 1

0 0 1 Not round Present 1


0 0 1 Round Absent 0

1 0 0 Not round Present 1 0


0 0 1 Round 1 Present 1 1
1 0 0 Round 1 Absent 0 1
0 1 0 Not round 0 Absent 0 1
0 0 1 Round 1 Absent 0 1

0 1 0 Round 1 Absent 0 1
0 1 0 Round 1 Absent 0 1

Andrew Ng
Decision Tree Learning

Continuous valued features


Continuous features
Ear shape Face shape Whiskers Weight (lbs.) Cat
Pointy Round Present 7.2 1
Floppy Not round Present 8.8 1
Floppy Round Absent 15 0

Pointy Not round Present 9.2 0

Pointy Round Present 8.4 1


Pointy Round Absent 7.6 1
Floppy Not round Absent 11 0
Pointy Round Absent 10.2 1
Floppy Round Absent 18 0
Floppy Round Absent 20 0

Andrew Ng
Splitting on a continuous variable
Cat
Weight ≤ 8 lbs .
Weight ≤ 9 lbs .

No
Y es

Not Cat Weight (lbs.)

2 2 8 3
H(0.5) − 𝐻 + 𝐻 = 0.24
10 2 10 8
4 4 6 1
H(0.5) − 𝐻 + 𝐻 = 0.61
10 4 10 6
7 5 3 0
H(0.5) − 𝐻 + 𝐻 = 0.40
10 7 10 3

Andrew Ng
Decision Tree Learning

Regression Trees (optional)


Regression with Decision Trees: Predicting a number
Ear shape Face shape Whiskers Weight (lbs.)
Pointy Round Present 7.2

Floppy Not round Present 8.8


Floppy Round Absent 15

Pointy Not round Present 9.2


Pointy Round Present 8.4
Pointy Round Absent 7.6
Floppy Not round Absent 11
Pointy Round Absent 10.2

Floppy Round Absent 18


Floppy Round Absent 20

Andrew Ng
Regression with Decision Trees

E ar
s hape

P ointy
Floppy

Fac e Fac e
s hape Shape

Round N ot round
Round N ot Round

Weights(lbs.): Weights(lbs.): Weights (lbs.): Weights(lbs.):


7.2, 8.4, 7.6, 10.2 9.2 15, 18, 20 8.8,11

Andrew Ng
Choosing a split
Variance at root Variance at root Variance at root
node: 20.51 node: 20.51 node: 20.51
E ar Fac e Whis kers
s hape Shape

Round N ot round
P ointy Floppy P res ent A bs ent

Weights: 7.2, Weights: 8.8, 15, Weights: 7.2, 15, Weights: 8.8,9.2,11 Weights: 7.2, 8.8, Weights: 15, 7.6,
9.2, 8.4,7.6, 10.2 11, 18, 20 8.4, 7.6,10.2, 18, 20 9.2, 8.4 11, 10.2, 18, 20

Variance: 1.47 Variance: 21.87 Variance: 27.80 Variance: 1.37 Variance: 0.75 Variance: 23.32

𝑤 left = 5ൗ10 𝑤 right = 5ൗ10 𝑤 left = 7ൗ10 𝑤 right = 3ൗ10 𝑤 left = 4ൗ10 𝑤 right = 6ൗ10

7 3 20.51 − 4 6
5 5 20.51 − ∗ 0.75 + ∗ 23.32
20.51 − ∗ 1.47 + ∗ 21.87 10
∗ 27.80 +
10
∗ 1.37 10 10
10 10
= 8.84 = 0.64 = 6.22

Andrew Ng
Tree ensembles

Using multiple decision trees


Trees are highly sensitive to small
changes of the data

E ar
s hape Whis kers

P ointy Floppy P res ent A bs ent

Andrew Ng
Tree ensemble New test example

Whis kers E ar Fac e


s hape s hape
Ear shape: Pointy
P res ent A bs ent Floppy Face shape: Not Round
Round N ot Round
P ointy
Whiskers: Present

Ear
shape N ot c at Cat
Face Whis kers
shape Whiskers

P ointy Floppy
A bs ent P res ent A bs ent
Round N ot round P res ent

C at N ot c at
Not cat Not Cat Not Cat Cat Not Cat
Cat

Prediction: Cat Prediction: Not cat Prediction: Cat

Final prediction: Cat

Andrew Ng
Tree ensembles

Sampling with replacement


Sampling with replacement
Tokens

Sampling with replacement:

Andrew Ng
Sampling with replacement
Ear shape Face shape Whiskers Cat
Pointy Round Present 1

Floppy Not round Absent 0


Pointy Round Absent 1

Pointy Not round Present 0


Floppy Not round Absent 0
Pointy Round Absent 1
Pointy Round Present 1
Floppy Not round Present 1

Floppy Round Absent 0


Pointy Round Absent 1

Andrew Ng
Tree ensembles

Random forest algorithm


Generating a tree sample
Given training set of size 𝑚

For 𝑏 = 1 to 𝐵:
Use sampling with replacement to create a new training set of size 𝑚
Train a decision tree on the new dataset

Whiskers Ear Face Ear shape


Ear Face shape shape Whiskers Cat
shape shape Whiskers Cat Absent Floppy
Present Pointy
Pointy Round Present Yes
Pointy Round Present Yes Pointy Round Absent Yes
Floppy
Floppy
Round
Round
Absent
Absent
No
No Ear shape Not cat
Floppy
Floppy
Not Round
Not Round
Absent
Absent
No
No
Face
shape Not cat

Pointy Round Present Yes Pointy Round Absent Yes
Pointy Not Round Present Yes Floppy Round Absent No
Floppy Round Absent No Round Not round Not round
Floppy Round Absent No Round
Floppy Round Present Yes Floppy Round Absent No
Pointy Not Round Absent No Pointy Not Round Absent No
Cat Not cat Cat Not cat
Pointy Not Round Absent No Pointy Round Present Yes
Pointy Not Round Present Yes

Bagged decision tree

Andrew Ng
Randomizing the feature choice

At each node, when choosing a feature to use to split,


if 𝑛 features are available, pick a random subset of
𝑘 < 𝑛 features and allow the algorithm to only choose
from that subset of features.

Random forest algorithm

Andrew Ng
Tree ensembles

XGBoost
Boosted trees intuition
Given training set of size 𝑚

For 𝑏 = 1 to 𝐵:
Use sampling with replacement to create a new training set of size 𝑚
But instead of picking from all examples with equal (1/m) probability, make it
more likely to pick examples that the previously trained trees misclassify
Train a decision tree on the new dataset
Whiskers Ear Face
Ear Face shape shape Whiskers Prediction
shape shape Whiskers Cat Absent
Present
Pointy Round Present Cat ✅
Pointy Round Present Yes
Floppy Not Round Present Not cat ❌
Floppy Round Absent No
Ear shape Floppy Round Absent Not cat ✅
Floppy Round Absent No Not cat
Pointy Not Round Present Not cat ✅
Pointy Round Present Yes
Pointy Round Present Cat ✅
Pointy Not Round Present Yes
Not round Pointy Round Absent Not cat
Floppy Round Absent No Round ❌
Floppy Not Round Absent Not cat
Floppy Round Present Yes ✅
Pointy Round Absent Not cat ❌
Pointy Not Round Absent No
Cat Not cat Floppy Round Absent Not cat ✅
Pointy Not Round Absent No
Floppy Round Absent Not cat ✅
Pointy Not Round Present Yes

Andrew Ng
XGBoost (eXtreme Gradient Boosting)
• Open source implementation of boosted trees
• Fast efficient implementation
• Good choice of default splitting criteria and criteria for when
to stop splitting
• Built in regularization to prevent overfitting
• Highly competitive algorithm for machine learning
competitions (eg: Kaggle competitions)

Andrew Ng
Using XGBoost
Classification Regression

from xgboost import XGBClassifier from xgboost import XGBRegressor

model = XGBClassifier() model = XGBRegressor()

model.fit(X_train, y_train) model.fit(X_train, y_train)


y_pred = model.predict(X_test) y_pred = model.predict(X_test)

Andrew Ng
Conclusion

When to use decision trees


Decision Trees vs Neural Networks
Decision Trees and Tree ensembles
• Works well on tabular (structured) data
• Not recommended for unstructured data (images, audio, text)
• Fast
Choose architecture
(model, data, etc.)

Diagnostics
(bias, variance and error
analysis)
Train
model

Andrew Ng
Decision Trees vs Neural Networks
Decision Trees and Tree ensembles
• Works well on tabular (structured) data
• Not recommended for unstructured data (images, audio, text)
• Fast
• Small decision trees may be human interpretable
Neural Networks
• Works well on all types of data, including tabular (structured) and
unstructured data
• May be slower than a decision tree
• Works with transfer learning
• When building a system of multiple models working together, it
might be easier to string together multiple neural networks

Andrew Ng

You might also like