C2 W1 Merged

Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Advanced Learning Algorithms
Welcome!
Advanced learning algorithms
Neural Networks
inference (prediction)
training
Practical advice for building machine learning systems
Decision Trees
Andrew Ng
Neural Networks Intuition
Neurons and the brain

Neural networks
Origins: Algorithms that try to mimic the brain.
Used in the 1980’s and early 1990’s.

Fell out of favor in the late 1990’s.
Resurgence from around 2005.
speech images text (NLP)
Andrew Ng
Neurons in the brain
[Credit: US National Institutes of Health, National Institute on Aging]

Biological neuron Simplified mathematical
model of a neuron
inputs outputs inputs outputs
image source: https://biologydictionary.net/sensory-neuron/
Andrew Ng
Why Now?
large neural network
medium neural network

performance
small neural network
traditional AI
amount of data
Andrew Ng
Neural Network Intuition
Demand Prediction
Demand Prediction
𝑥 = 𝑝𝑟𝑖𝑐𝑒 input
1 output
𝑎=𝑓 𝑥 =
top seller? 1 + 𝑒 − 𝑤𝑥+𝑏
yes/no
𝑥 𝑎
price
Andrew Ng
Demand Prediction
layer
price
layer
shipping cost
probability of
marketing being a top seller
material
Andrew Ng
Demand Prediction
layer
price
layer
shipping cost a a
probability of
material
Andrew Ng
Demand Prediction
layer
price
layer
shipping cost a a
probability of
material
Andrew Ng
Multiple hidden layers
a a
x
x
input
output
layer
hidden hidden
layer layer hidden hidden
hidden
layer layer layer
neural network architecture

Andrew Ng
Neural Networks Intuition
Example:
Recognizing Images
Face recognition
x=
Andrew Ng
Face recognition
output layer
⋮ probability of being
⋮ ⋮
person ‘XYZ’
x
input
source: Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
by Honglak Lee, Roger Grosse, Ranganath Andrew Y. Ng
Andrew Ng
Car classification
output layer
⋮ probability of
⋮ ⋮
car detected
source: Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
by Honglak Lee, Roger Grosse, Ranganath Andrew Y. Ng
Andrew Ng
Neural network model
Neural network layer

Andrew Ng
Neural network layer 1
𝑔(𝑧) =
1 + 𝑒− 𝑧
a[1]
x [1] [1] [1]
𝑤1 , 𝑏1 𝑎1 = 𝑔(w1 ∙ x + 𝑏1 )
vector of
layer 0 layer 2 activation
[1] values from
w2 , 𝑏2 𝑎2 = 𝑔(w2[1] ∙ x + 𝑏2[1] ) [1] layer 1
x a
layer 1
w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ x + 𝑏3 )
notation for layer

numbering
Andrew Ng
Neural network layer 1
𝑔(𝑧) =
1 + 𝑒− 𝑧
a[1] 𝑎 [2]
x
layer 2 𝑤1 , 𝑏1 [1] [1] [1]

𝑎1 = 𝑔(w1 ∙ a[1] + 𝑏1 ) a scalar
a[1] a value
layer 1
Andrew Ng
predict category 1 or 0 (yes/no)
a[1] 𝑎 [2]
x
2
𝑖𝑠 𝑎 ≥ 0.5?
layer 2
layer 1 𝑦ො = 1 𝑦ො = 0
Andrew Ng
Neural Network Model
More complex neural networks

More complex neural network
a[1] a[2] a[3] a[4]

x
input
layer 0
layer 1 layer 2 layer 3 layer 4
hidden layers output

layer
Andrew Ng
More complex neural network
[3] [3] [3]
[3]
𝑤1 , 𝑏1
[3] 𝑎1 = 𝑔(w1 ∙ a[2] + 𝑏1 )
[3] [3] [3]
a[1] a[2] a[3] a[4]
[3]
a[2] w2 , 𝑏2
[3]
𝑎2 = 𝑔(w2 ∙ a[2] + 𝑏2 ) a[3]
x
input [3] [3] [3] [3] [3]
w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ a[2] + 𝑏3 )
Andrew Ng
Notation
[3] [3] [3]
[3]
𝑤1 , 𝑏1
[3] 𝑎1 = 𝑔(w1 ∙ a[2] + 𝑏1 )
[3] [3] [3]
x a[1] a[2] a[3] a[4]
[3]
a[2] w2 , 𝑏2
[3]
𝑎2 = 𝑔(w2 ∙ a[2] + 𝑏2 ) a[3]
input
[3] [3] [3] [3] [3]
w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ a[2] + 𝑏3 )

Question:
Can you fill in the superscripts and
subscripts for the second neuron?
Andrew Ng
Notation
[3] [3] [3]
[3]
𝑤1 , 𝑏1
[3] 𝑎1 = 𝑔(w1 ∙ a[2] + 𝑏1 )
x
[3] [3] [3]
a[1] a[2] a[3] a[4]
[3]
a[2] w2 , 𝑏2
[3]
𝑎2 = 𝑔(w2 ∙ a[2] + 𝑏2 ) a[3]
x
input
[3] [3] [3] [3] [3]
w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ a[2] + 𝑏3 )
layer 3 layer 4
layer 1 layer 2
[3]
𝑎2 = 𝑔(w2[3] ∙ a[2] + 𝑏2[3] )
output of layer 𝑙 − 1
Activation value of (previous layer)
layer 𝑙, unit(neuron) 𝑗 [𝑙]
𝑎𝑗 = 𝑔(w𝑗[𝑙] ∙ a[𝑙−1] + 𝑏𝑗[𝑙] )
sigmoid
Parameters w & b of layer 𝑙, unit 𝑗
Andrew Ng
Neural Network Model
Inference: making predictions

(forward propagation)
Handwritten digit recognition
Digit images
label 0 1
⋮ a[1] a[2] a[3] probability

x ⋮ of being a
handwritten ‘1’
output
layer
25 units 15 units 1 unit

layer 1 layer 2 layer 3 a[1]
Andrew Ng
Digit images
label 0 1
⋮ a[1] a[2] a[3] probability

x ⋮ of being a
handwritten ‘1’
output
layer
25 units 15 units 1 unit a[2]

layer 1 layer 2 layer 3
Andrew Ng
forward propagation
x a[1] a[2] a[3] = 𝑓(𝑥)

⋮ ⋮ probability a[3] = 𝑔 w13 ∙ a 2 + 𝑏13
of being a
handwritten ‘1’ 3
𝑖𝑠 𝑎1 ≥ 0.5?
output
layer
25 units 15 units 1 unit

layer 1 layer 2 layer 3 𝑦ො = 1 𝑦ො = 0
image is digit 1 image isn’t digit 1
Andrew Ng
TensorFlow implementation
Inference in Code
Coffee roasting
x a[1] a[2]
Duration 2
(minutes) 𝑖𝑠 𝑎1 ≥ 0.5?
𝑦ො = 1 𝑦ො = 0
Temperature
(Celsius)
Andrew Ng
x a[1] a[2]
x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation=‘sigmoid’)
a1 = layer_1(x)
Andrew Ng
Build the model using TensorFlow
x a[1] a[2]
x = np.array([[200.0, 17.0]])
a1 = layer_1(x)

a2 = layer_2(a1)
Andrew Ng
Build the model using TensorFlow
x a[1] a[2]
2
𝑖𝑠 𝑎1 ≥ 0.5?
𝑦ො = 1 𝑦ො = 0
x = np.array([[200.0, 17.0]])
a1 = layer_1(x) if a2 >= 0.5:
yhat = 1
else:
yhat = 0
a2 = layer_2(a1)
Andrew Ng
Model for digit classification
x = np.array([[0.0,...245,...240...0]])
a1 = layer_1(x)
x ⋮ a[1] a[2] a[3]
⋮
1 unit
15 units
25 units
Andrew Ng
x = np.array([[0.0,...245,...240...0]])
a1 = layer_1(x)
x ⋮ a[1] a[2] a[3] layer_2 = Dense(units=15, activation=‘sigmoid’)
⋮ a2 = layer_2(a1)
1 unit
15 units
25 units
Andrew Ng
x = np.array([[0.0,...245,...240...0]])
a1 = layer_1(x)
⋮ a2 = layer_2(a1)
1 unit layer_3 = Dense(units=1, activation=‘sigmoid’)

a3 = layer_3(a2)
15 units
25 units
Andrew Ng
x = np.array([[0.0,...245,...240...0]])
a1 = layer_1(x)
⋮ a2 = layer_2(a1)
1 unit layer_3 = Dense(units=1, activation=‘sigmoid’)

a3 = layer_3(a2)
15 units
25 units 3 if a3 >= 0.5:
𝑖𝑠 𝑎1 ≥ 0.5? yhat = 1
else:
𝑦ො = 1 𝑦ො = 0 yhat = 0
Andrew Ng
Data in TensorFlow
Feature vectors
temperature duration Good coffee? x = np.array([[200.0, 17.0]])
(Celsius) (minutes) (1/0)
[[200.0, 17.0]]
200.0 17.0 1
425.0 18.5 0
… … …
Andrew Ng
Note about numpy arrays
x = np.array([[1, 2, 3],
1 2 3 [4, 5, 6]])
4 5 6
[[1, 2, 3],
[4, 5, 6]]
2x3
x = np.array([[0.1, 0.2],
[-3.0, -4.0,], 4x2
[-0.5, -0.6,],
[7.0, 8.0,]]) 1x2
[[0.1, 0.2], 2x1
[-3.0, -4.0,],
[-0.5, -0.6,],
[7.0, 8.0,]]
Andrew Ng
Note about numpy arrays
x = np.array([[200, 17]]) 200 17 1x2
x = np.array([[200], 200 2x1

[17]]) 17
x = np.array([200,17])
Andrew Ng
Feature vectors
temperature duration Good coffee? x = np.array([[200.0, 17.0]])
(Celsius) (minutes) (1/0)
[[200.0, 17.0]]
200.0 17.0 1
425.0 18.5 0 1x2
… … …
200.0 17.0
Andrew Ng
Activation vector
x a[1] a[2]
x = np.array([[200.0, 17.0]])
a1 = layer_1(x)
1 x 3 matrix
tf.Tensor([[0.2 0.7 0.3]], shape=(1, 3), dtype=float32)
a1.numpy()
array([[1.4661001, 1.125196 , 3.2159438]], dtype=float32)
Andrew Ng
Activation vector
x a[1] a[2]

a2 = layer_2(a1)
1x1
tf.Tensor([[0.8]], shape=(1, 1), dtype=float32)
a2.numpy()
array([[0.8]], dtype=float32)
Andrew Ng
Building a neural network

What you saw earlier
x = np.array([[200.0, 17.0]])
x a[1] a[2] layer_1 = Dense(units=3, activation="sigmoid")
a1 = layer_1(x)
layer_2 = Dense(units=1, activation="sigmoid")

a2 = layer_2(a1)
Andrew Ng
Building a neural network architecture
x a[1] a[2] layer_1 = Dense(units=3, activation="sigmoid")
model = Sequential([layer_1, layer_2])
x = np.array([[200.0, 17.0],
y [120.0, 5.0],
4x2
[425.0, 20.0],
200 17 1 [212.0, 18.0])
y = np.array([1,0,0,1])
120 5 0
model.compile(...)
425 20 0 model.fit(x,y)
212 18 1 model.predict(x_new)
Andrew Ng
Digit classification model
x [1] [2] [3]
⋮ a a a
⋮ model = Sequential([layer_1, layer_2, layer_3]
model.compile(...)
1 unit
x = np.array([[0..., 245, ..., 17],
15 units [0..., 200, ..., 184])
25 units y = np.array([1,0])
model.fit(x,y)
model.predict(x_new)
Andrew Ng
Digit classification model
model = Sequential([
Dense(units=25, activation="sigmoid"),
x ⋮ a[1] a[2] a[3]
Dense(units=15, activation="sigmoid"),
Dense(units=1, activation="sigmoid")])
⋮
model.compile(...)
1 unit
x = np.array([[0..., 245, ..., 17],
15 units [0..., 200, ..., 184])
25 units y = np.array([1,0])
model.fit(x,y)
model.predict(x_new)
Andrew Ng
Neural network implementation
in Python
Forward prop in a single layer

forward prop (coffee roasting model)
[2] [2] [2]
[1]
𝑤1 , 𝑏1
[1] 𝑎1 = 𝑔(w1 ∙ a[1] + 𝑏1 )
x [1] [1] a[1] [2] [2]
𝑤1 , 𝑏1
a[2] w2_1 = np.array([-7, 8])
w2 , 𝑏2 b2_1 = np.array([3])
[1] [1] z2_1 = np.dot(w2_1,a1)+b2_1
w3 , 𝑏3 w2_1
a2_1 = sigmoid(z1_1)
x = np.array([200, 17])
[1] [1] [1] [1] [1] [1] [1] [1] [1]
𝑎1 = 𝑔(w1 ∙ x + 𝑏1 ) 𝑎2 = 𝑔(w2 ∙ x + 𝑏2 ) 𝑎3 = 𝑔(w3 ∙ x + 𝑏3 )
w1_1 = np.array([1, 2]) w1_2 = np.array([-3, 4]) w1_3 = np.array([5, -6])

b1_1 = np.array([-1]) b1_2 = np.array([1]) b1_3 = np.array([2])
z1_1 = np.dot(w1_1,x)+ b z1_2 = np.dot(w1_2,x)+b z1_3 = np.dot(w1_3,x)+b
a1_1 = sigmoid(z1_1) a1_2 = sigmoid(z1_2) a1_3 = sigmoid(z1_3)
a1 = np.array([a1_1, a1_2, a1_3])
Andrew Ng
Neural network implemenation
in Python
General implementation of
forward propagation
[1]
𝑤1 , 𝑏1
[1]
Forward prop in NumPy
x [1] [1] a [𝑙]
w2 , 𝑏2
def dense(a_in,W,b, g): def sequential(x):
[1] [1]
w3 , 𝑏3 units = W.shape[1] a1 = dense(x,W1,b1,g)
a_out = np.zeros(units) a2 = dense(a1,W2,b2,g)
[1] 1 −3 [1][1] 5 for j in range(units): a3 = dense(a2,W3,b3,g)
𝑤1 = w2 = w3 =
2 4 −6 w = W[:,j] a4 = dense(a3,W4,b4,g)
W = np.array([ z = np.dot(w,a_in) + b[j] f_x = a4
[1, -3, 5] a_out[j] = g(z) return f_x
[2, 4, -6]]) return a_out
[𝑙] [𝑙] [𝑙]
𝑏1 = −1 𝑏2 = 1 𝑏3 = 2
b = np.array([-1, 1, 2])
a [0] = x
a_in = np.array([-2, 4])
Andrew Ng
Speculations on artificial general
intelligence (AGI)
Is there a path to AGI?

AI
ANI AGI
(artificial narrow intelligence) (artificial general intelligence)
E.g., smart speaker, Do anything a human can do
self-driving car, web search,
AI in farming and factories
Andrew Ng
Biological neuron Simplified mathematical
model of a neuron
inputs outputs inputs outputs
image source: https://biologydictionary.net/sensory-neuron/
Andrew Ng
Neural network and the brain
Can we mimic the human brain?
vs x 𝑓(x)
We have (almost) no idea how the brain works
Andrew Ng
The ”one learning algorithm” hypothesis
Auditory cortex
Auditory cortex learns to see

[Roe et al., 1992]
Andrew Ng
The ”one learning algorithm” hypothesis
Somatosensory cortex
Somatosensory cortex learns to see

[Metin & Frost, 1989]
Andrew Ng
Sensor representations in the brain
Seeing with your tongue Human echolocation (sonar)
Haptic belt: Direction sense Implanting a 3rd eye

[BrainPort; Welsh & Blasch, 1997; Nagel et al., 2005; Constantine-Paton & Law, 2009]
Andrew Ng
Vectorization
(optional)
How neural networks are

implemented efficiently
For loops vs. vectorization
x = np.array([200, 17]) X = np.array([[200, 17]])
W = np.array([[1, -3, 5], W = np.array([[1, -3, 5],
[-2, 4, -6]]) [-2, 4, -6])
b = np.array([-1, 1, 2]) B = np.array([[-1, 1, 2]])
def dense(a_in,W,b): def dense(A_in,W,B):

a_out = np.zeros(units) Z = np.matmul(A_in,W) + B
for j in range(units): A_out = g(Z)
w = W[:,j] return A_out
z = np.dot(w,x) + b[j]
a[j] = g(z) [[1,0,1]]
return a
[1,0,1]
Andrew Ng
Vectorization
(optional)
Matrix multiplication
Dot products
example in general transpose vector vector multiplication

1 3 ↑ ↑ 1 ← a𝑻 → ↑
∙ a=
2 4 a ∙ w 2 w
↓ ↓ ↓
a𝑻 = 1 2
𝑧 = a∙w 𝑧 = a𝑻 w
Andrew Ng
Vector matrix multiplication
[2]
→ 1
a =
[4 6]
3 5
→ 𝐙= →
a 𝐖 [← →]
𝑻 𝑻
a = [1 2] 𝐖= →
a
𝑻
→
w1 →
w2
𝐙 = [→
a →
𝑻
w 2]
w 1 𝑎𝑻 →
(1 ∗ 3) + (2 ∗ 4) (1 ∗ 5) + (2 ∗ 6)
𝐙 = [11 17]
Andrew Ng
matrix matrix multiplication
[2 − 2]
1 −1
𝐀=
[4 6]
→ 𝑇
[− 1 − 2]
1 2 𝐖 = 3 5 𝐙 = 𝐀𝑻 𝐖 = ← a1 →
→
𝐀𝑻 =
→ 𝑇 w1 →
w2
← a 2 →
→
a 1→
𝑇
w1 →
a 1→
𝑇
w2
=
→
a 2→
𝑇
w1 →
a 2→
𝑇
w2
[− 11 − 17]
11 17
=
Andrew Ng
Vectorization
(optional)
Matrix multiplication rules

1 −1 0.1 1 2
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝒁 = 𝐀𝑻 𝐖 =
0.1 0.2 4 6 8 0
Andrew Ng
1 −1 0.1 1 2
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝒁 = 𝐀𝑻 𝐖 =
0.1 0.2 4 6 8 0
Andrew Ng
1 −1 0.1 1 2 11 17 23 9
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝑻
𝒁 = 𝐀 𝐖 = −11 −17 −23 −9
0.1 0.2 4 6 8 0
1.1 1.7 2.3 0.9
Andrew Ng
1 −1 0.1 1 2 11 17 23 9
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝑻
𝒁 = 𝐀 𝐖 = −11 −17 −23 −9
0.1 0.2 4 6 8 0
1.1 1.7 2.3 0.9
Andrew Ng
Vectorization
(optional)
Matrix multiplication code

Matrix multiplication in NumPy
1 −1 0.1 1 2 11 17 23 9
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝑻
𝒁 = 𝐀 𝐖 = −11 −17 −23 −9
0.1 0.2 4 6 8 0
1.1 1.7 2.3 0.9
A=np.array([1,-1,0.1], W=np.array([3,5,7,9],
Z = np.matmul(AT,W)
[2,-2,0.2]]) [4,6,8,0])
Z = AT @ W
AT=np.array([1,2],
[-1,-2], [[11,17,23,9],
[0.1,0.2]) [-11,-17,-23,-9],
[1.1,1.7,2.3,0.9]
AT=A.T ]
Andrew Ng
Dense layer vectorized
[1] [1]
𝑤1 , 𝑏1 AT = np.array([[200, 17]])
𝐀𝑻
[1] [1]
[1]
𝐀 𝐀[2] W = np.array([[1, -3, 5],
w2 , 𝑏2
[-2, 4, -6])
[1] [1]
w3 , 𝑏3 b = np.array([[-1, 1, 2]])
def dense(AT,W,b,g):
z = np.matmul(AT,W) + b
𝐀𝑻 = 200 17 𝐙 = 𝐀𝑻 𝐖 + 𝐁
a_out = g(z)
165 −531 900 return a_out
1 −3 5
𝐖= [1] [1] [1]
−2 4 −6 𝑧1 𝑧2 𝑧3 [[1,0,1]]
Ԧ𝐛 = −1 1 2
𝐀 = g(𝐙)
1 0 1
Andrew Ng
Copyright Notice

Neural Network
Training
TensorFlow
implementation
Train a Neural Network in TensorFlow
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
⋮ a[1] a[2] a[3] Dense(units=25, activation='sigmoid’)

x ⋮ Dense(units=15, activation='sigmoid’)
Dense(units=1, activation='sigmoid’)
1 unit )]
from tensorflow.keras.losses import
15 units BinaryCrossentropy
25 units model.compile(loss=BinaryCrossentropy())
Given set of (x,y) examples model.fit(X,Y,epochs=100)

How to build and train this in code?
Andrew Ng
Neural Network
Training
Training Details
Model Training Steps
specify how to logistic regression neural network
compute output
given input x and z = np.dot(w,x)+ b model = Sequential([
parameters w,b Dense(...)
Dense(...)
(define model) Dense(...)
f_x = 1/(1+np.exp(-z)) ])
𝑓w,𝑏 x =?
specify loss and cost logistic loss binary cross entropy

𝐿 𝑓w,𝑏 x , 𝑦 1 example loss = -y * np.log(f_x)
-(1-y) * np.log(1-f_x) model.compile(
𝑚 loss=BinaryCrossentropy())
1
𝐽 w, 𝑏 = ෍ 𝐿 𝑓w,𝑏 x (𝑖) , 𝑦 (𝑖)
𝑚
𝑖=1
Train on data to w = w - alpha * dj_dw model.fit(X,y,epochs=100)
minimize 𝐽 w, 𝑏 b = b - alpha * dj_db
Andrew Ng
1. Create the model
define the model import tensorflow as tf
𝑓 x =? from tensorflow.keras import Sequential
1
𝐖 ,𝒃 1 𝐖 2 ,𝒃 2 𝐖 3,𝒃 3
[1] [1] Dense(units=25, activation='sigmoid’)
w1 , 𝑏1
Dense(units=15, activation='sigmoid’)
[2] [2] Dense(units=1, activation='sigmoid’)
[1] w1 , 𝑏1 [2]
⋮ a a a[3] ])
x ⋮ [3] [3]
w1 , 𝑏1
[2] [2]
w15 , 𝑏15 1 unit
[1] [1]
w25 , 𝑏25
15 units
25 units
Andrew Ng
2. Loss and cost functions
𝑚
Mnist digit binary classification 1 𝑖 𝑖
𝐽 𝐖, 𝐁 = ෍ 𝐿 𝑓 x ,𝑦
classification problem 𝑚
𝑖=1
𝐿 𝑓 x , 𝑦 = −𝑦log 𝑓 x − 1 − 𝑦 log 1 − 𝑓 x 1 2 3
𝐖 ,𝐖 ,𝐖 𝒃 1 ,𝒃 2 ,𝒃 3

model.compile(loss= BinaryCrossentropy())
BinaryCrossentropy
regression
(predicting numbers
and not categories) from tensorflow.keras.losses import
MeanSquaredError
model.compile(loss= MeanSquaredError())
Andrew Ng
3. Gradient descent
repeat {
[𝑙] [𝑙] 𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏
𝐽(𝑤) minimum 𝜕𝑤𝑗
[𝑙] [𝑙] 𝜕
𝑏𝑗 = 𝑏𝑗 − 𝛼 𝐽 w, 𝑏
𝜕𝑏
}
𝑤
model.fit(X,y,epochs=100)
Andrew Ng
Neural network libraries
Use code libraries instead of coding ”from scratch”
Good to understand the implementation

(for tuning and debugging).
Andrew Ng
Activation Functions
Alternatives to the
sigmoid activation
Demand Prediction Example
[1] 1 1
price affordability 𝑎2 = 𝑔(w2 ∙ x + 𝑏2 )
Sigmoid ReLU
shipping cost awareness
a
1 1
marketing 1
𝑔 𝑧 = 𝑔 𝑧 =
1+𝑒−𝑧
material perceived quality
0 z z
0
0<𝑔 𝑧 <1
Andrew Ng
Examples of Activation Functions
[1] 1 1
𝑎2 = 𝑔(w2 ∙ x + 𝑏2 )
Linear activation function Sigmoid ReLU
1
𝑔 𝑧 =
1 1
1
𝑔(𝑧) = 𝑧 𝑔 𝑧 =
1+𝑒−𝑧
0 z 0 z 0 z
0<𝑔 𝑧 <1
Andrew Ng
Choosing activation
functions
Output Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) Choosing 𝑔 𝑧 for output layer?
Binary classification Regression Regression

Sigmoid Linear activation function ReLU
y=0/1 y = +/- Y = 0 or +
1 1
1
𝑔 𝑧
𝑔 𝑧 𝑔 𝑧
0 z 0 z 0 z
Andrew Ng
Hidden Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) Choosing 𝑔 𝑧 for hidden layer
Sigmoid 𝐽(𝐖, 𝐁) 𝜕 ReLU

𝐽 𝐖,𝐁
𝜕𝑤
1 1
1
𝑔 𝑧 =
1+𝑒−𝑧
𝑔 𝑧 = max(0, 𝑧)
0 z 𝑤 0 z
Andrew Ng
Choosing Activation Summary
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) activation='sigmoid'
activation='linear'
ReLU activation='relu'
from tf.keras.layers import Dense

Dense(units=25, activation='relu'),
Dense(units=15, activation='relu'),
Dense(units=1, activation='sigmoid')
])
Andrew Ng
Why do we need
activation functions?
Why do we need activation functions?
price price
affordability
shipping cost
marketing top seller?
awareness shipping cost
material 𝑓 x = w ∙x +𝑏
𝑎Ԧ [1] 𝑎Ԧ [2] x w,b 𝑏
x marketing
material
perceived quality 1
𝑔 𝑧
if all 𝑔 𝑧 are linear... ...no different than linear regression
0 z
Andrew Ng
Linear Example
𝑎 [1] [1]
= 𝑤1 𝑥 + 𝑏1[1]
[2] [2]
𝑥 𝑎 [2] = 𝑤1 𝑎 [1] + 𝑏1
𝑤1 , 𝑏1
𝑎 [1]
[1] [1] [2] [2]
𝑤1 , 𝑏1 𝑎 [2]
[2] [1]
𝑎Ԧ [2] =( 𝑤1 𝑤1 ) 𝑥 + 𝑤1[2] 𝑏1[1] + 𝑏1[2]
𝑔(𝑧) = 𝑧
𝑎Ԧ [2] = 𝑤 𝑥 + 𝑏
Andrew Ng
Example
x 𝑎Ԧ [1] 𝑎Ԧ [2] 𝑎Ԧ [3] 𝑎Ԧ [4]
𝑔(𝑧) = 𝑧
1
[4] [4] 𝑎Ԧ [4] =
𝑎Ԧ [4] = w1 ∙ 𝑎Ԧ [3] + 𝑏1 [4] [3] [4]
1+𝑒 −(w1 ∙𝑎 +𝑏1 )
Don’t use linear activations in hidden layers
Andrew Ng
Multiclass
Classification
Multiclass
MNIST example
𝑦=0 1 2 3 4 5 6 7 8 9
𝑦=7
Andrew Ng
Multiclass classification example
𝑃 𝑦=2x
𝑃 𝑦=1x
𝑥2 𝑥2 𝑃 𝑦=3x
𝑃 𝑦=4x
𝑃 𝑦=1x
𝑥1 𝑥1
Andrew Ng
Multiclass
Classification
Softmax
Logistic regression Softmax regression (4 possible outputs)
(2 possible output values)
𝑧 = w∙x+𝑏 𝑧1 = w1 ∙ x + 𝑏1
1
𝑎 = 𝑔 𝑧 = 1+𝑒 −𝑧 = 𝑃 𝑦 = 1 x =𝑃 𝑦=1 x
𝑒 𝑧2
𝑧2 = w2 ∙ x + 𝑏2 𝑎2 = 𝑧
𝑒 1 + 𝑒 𝑧 2 + 𝑒 𝑧 3 + 𝑒 𝑧4
Softmax regression =𝑃 𝑦=2 x
(N possible outputs) 𝑒 𝑧3
𝑧3 = w3 ∙ x + 𝑏3 𝑎3 = 𝑧
𝑒 1 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑧𝑗 = w𝑗 ∙ x + 𝑏𝑗 j = 1, … , N
=𝑃 𝑦=3x
𝑒 𝑧𝑗 𝑒 𝑧4
𝑎𝑗 = = P(y = 𝑗|x) 𝑧4 = w4 ∙ x + 𝑏4 𝑎4 = 𝑧
σ𝑁
𝑘=1 𝑒
𝑧𝑘 𝑒 1 + 𝑒 𝑧2 + 𝑒 𝑧 3 + 𝑒 𝑧4
=𝑃 𝑦=4x
Andrew Ng
Cost
Logistic regression Softmax regression
𝑒 𝑧1
𝑎1 = 𝑧 =𝑃 𝑦=1 x
𝑧 = w∙x+𝑏 𝑒 1 + 𝑒 𝑧2 + ⋯ + 𝑒 𝑧𝑁
⋮ 𝑒 𝑧𝑁
1 𝑎𝑁 = 𝑧 =𝑃 𝑦=𝑁x
𝑎1 = 𝑔 𝑧 = =𝑃 𝑦=1 x 𝑒 1 + 𝑒 𝑧 2 + ⋯ + 𝑒 𝑧𝑁
1 + 𝑒−𝑧
Crossentropy loss
𝑎2 = 1 − 𝑎1 =𝑃 𝑦=0 x
𝑙𝑜𝑠𝑠 𝑎1 , … , 𝑎𝑁 , 𝑦 =
𝐿
l𝑜𝑠𝑠 = −𝑦 log 𝑎1 − 1 − 𝑦 log 1 − 𝑎1
𝐽 w, 𝑏 = average loss
0 0.5 1 𝑎𝑗
Andrew Ng
Multiclass
Classification
Neural Network with

Softmax output
Neural Network with Softmax
0
output
3
0 𝑒 𝑧1
3 3 3 3
𝑧1 = w1 ∙ a 2 + 𝑏1 𝑎10 = 3 3
0 𝑒 𝑧1 + ⋯ + 𝑒 𝑧10
x a[1] a[2] a[3] = P y = 1 𝑥Ԧ
⋮ ⋮ ⋮ ⋮
3
3 3 3 3 𝑒 𝑧10
𝑧10 = 𝑤10 ∙ 𝑎Ԧ 2 + 𝑏10 𝑎10 = 3 3
𝑒 𝑧1 + ⋯ + 𝑒 𝑧10
= P y = 10 𝑥Ԧ
logistic regression
25 units 15 units 10 units
1 unit 3
𝑎1 = 𝑔 𝑧1
3 3
𝑎2 = 𝑔 𝑧2
3
softmax
a[3]=
3 3 3 3
𝑎1 , … 𝑎10 = 𝑔 𝑧1 , … , 𝑧10
Andrew Ng
specify the model
MNIST with softmax
import tensorflow as tf
𝑓w,𝑏 x =? from tensorflow.keras import Sequential
Dense(units=25, activation='relu')
Dense(units=10, activation='softmax')
specify loss and cost )]
𝐿 𝑓w,𝑏 x , 𝑦 from tensorflow.keras.losses import
SparseCategoricalCrossentropy
model.compile(loss= SparseCategoricalCrossentropy() )
Train on data to model.fit(X,Y,epochs=100)
minimize 𝐽 w, 𝑏 Note: better (recommended) version later.
Andrew Ng
Multiclass
Classification
Improved implementation
of softmax
Numerical Roundoff Errors
option 1
2
𝑥=
10,000
option 2
Andrew Ng
Numerical Roundoff Errors
More numerically accurate implementation of logistic loss:
Logistic regression: model = Sequential([

1 Dense(units=25, activation='relu')
𝑎=𝑔 𝑧 = Dense(units=15, activation='relu')
1 + 𝑒 −𝑧
Original loss
model.compile(loss=BinaryCrossEntropy() )
𝑙𝑜𝑠𝑠 = − 𝑦 log 𝑎 − 1 − 𝑦 log(1 − 𝑎)
model.compile(loss=BinaryCrossEntropy(from_logits=True) )
More accurate loss (in code)

1 1
𝑙𝑜𝑠𝑠 = − 𝑦 log − 1 − 𝑦 log(1 − )
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
Andrew Ng
More numerically accurate implementation of softmax
Softmax regression model = Sequential([
(𝑎1, … , 𝑎10 ) = 𝑔(𝑧1 , … , 𝑧10 )
− log 𝑎1 if 𝑦 = 1 Dense(units=10, activation='softmax')
Loss = 𝐿(𝑎,
Ԧ 𝑦) = ቐ ⋮
− log 𝑎10 if 𝑦 = 10
model.compile(loss=SparseCategoricalCrossEntropy() )
More Accurate
model.compile(loss=SparseCrossEntropy(from_logits=True) )
Andrew Ng
MNIST (more numerically accurate)
model import tensorflow as tf
from tensorflow.keras import Sequential
Dense(units=10, activation='linear’) )]
loss from tensorflow.keras.losses import
SparseCategoricalCrossentropy
model.compile(...,loss=SparseCategoricalCrossentropy(from_logits=True) )
fit model.fit(X,Y,epochs=100)
predict logits = model(X)
f_x = tf.nn.softmax(logits)
Andrew Ng
logistic regression
(more numerically accurate)
model model = Sequential([
Dense(units=1, activation='linear')
BinaryCrossentropy
loss model.compile(..., BinaryCrossentropy(from_logits=True)) )
model.fit(X,Y,epochs=100)
fit logit = model(X)
predict f_x = tf.nn.sigmoid(logit)
Andrew Ng
Multi-label
Classification
Classification with
multiple outputs
(Optional)
Multi-label Classification
Is there a car?
Is there a bus?
Is there a pedestrian
Andrew Ng
Multiple classes
car bus pedestrian
Alternatively, train one neural network with three outputs
x 𝑎Ԧ [1] 𝑎Ԧ [2] 𝑎Ԧ [3]
Andrew Ng
Additional Neural Network
Concepts
Advanced Optimization
Gradient Descent
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼𝜕𝑤 𝐽 w, 𝑏
𝑗
𝑤2 𝐽 w, 𝑏 𝑤2 𝐽 w, 𝑏
𝑤1 𝑤1
Go faster – increase Go slower – decrease
Andrew Ng
Adam Algorithm Intuition
Adam: Adaptive Moment estimation
𝜕
𝑤1 = 𝑤1 − 𝛼1 𝜕𝑤 𝐽 w, 𝑏
1
⋮
𝑤10 = 𝑤10 − 𝛼10 𝜕𝑤𝜕 𝐽 w, 𝑏
10
𝜕
𝑏 = 𝑏 − 𝛼11 𝜕𝑏 𝐽 w, 𝑏
Andrew Ng
Adam Algorithm Intuition
𝑤2 𝐽 w, 𝑏 𝑤2 𝐽 w, 𝑏
𝑤1 𝑤1
If 𝑤𝑗 (or 𝑏) keeps moving If 𝑤𝑗 (or 𝑏) keeps oscillating,
in same direction, reduce 𝛼𝑗 .
increase 𝛼𝑗 .
Andrew Ng
model
MNIST Adam
tf.keras.layers.Dense(units=25, activation='sigmoid')
tf.keras.layers.Dense(units=15, activation='sigmoid')
tf.keras.layers.Dense(units=10, activation='linear')
])
compile
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))
fit
model.fit(X,Y,epochs=100)
Andrew Ng
Additional Neural Network
Concepts
Additional Layer Types

Dense Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3]
Each neuron output is a function of

all the activation outputs of the previous layer.
2 2 2
𝑎Ԧ1 = 𝑔 𝑤1 ∙ 𝑎Ԧ 1 + 𝑏1
Andrew Ng
Convolutional Layer
Each Neuron only looks at

part of the previous layer’s inputs.
Why?
• Faster computation
• Need less training data
(less prone to
overfitting)
Andrew Ng
Convolutional Neural Network
EKG
𝑥1 𝑥2 𝑥3 𝑥100
𝑥1
𝑥2 𝑥1 − 𝑥20
𝑥3 𝑥11 − 𝑥30 1 1
𝑎1 − 𝑎5
𝑥21 − 𝑥40 x a[1] a[2] a[3]
⋮ 1 1
⋮ ⋮ 𝑎3 − 𝑎7
𝑥81 − 𝑥100 1 1
𝑥100 𝑎5 − 𝑎9
3 units
9 units
Andrew Ng
Copyright Notice

Advice for applying
machine learning
Deciding what to try next

Debugging a learning algorithm
You’ve implemented regularized linear regression on housing prices
𝑚 𝑛
1 2 𝜆
𝐽 w, 𝑏 = ෍ 𝑓w,𝑏 x 𝑖 −𝑦 𝑖 + ෍ 𝑤𝑗2
2𝑚 2𝑚
𝑖=1 𝑗=1
But it makes unacceptably large errors in predictions. What do you
try next?
Get more training examples
Try smaller sets of features
Try getting additional features
Try adding polynomial features (𝑥12, 𝑥22, 𝑥1𝑥2, 𝑒𝑡𝑐)
Try decreasing 𝜆
Try increasing 𝜆
Andrew Ng
Machine learning diagnostic
Diagnostic: A test that you run to gain insight into what
is/isn’t working with a learning algorithm, to gain guidance
into improving its performance.
Diagnostics can take time to implement but doing so can be

a very good use of your time.
Andrew Ng
Evaluating and
choosing models
Evaluating a model
Evaluating your model
Model fits the training data well
but will fail to generalize to new
price
examples not in the training set.
size x 𝑥1 = size in feet2

Wnx4 𝑥2= no. of bedrooms
𝑥 = 𝑥1
𝑥3= no. of floors
𝑓w,𝑏 x = 𝑤1𝑥 + 𝑤2𝑥 2 + ⋯ + 𝑤𝑛 𝑥 𝑛 + 𝑏 𝑥4= age of home in years
f(𝑥)
Ԧ
just x
Andrew Ng
Evaluating your model
Dataset:
size price
𝑥 1
,𝑦 1
2104 400 𝑚𝑡𝑟𝑎𝑖𝑛 =
𝑥 2 ,𝑦 2
1600 330 no. training
70% 2400 369 training set ⋮ examples
1416 232 𝑥 𝑚𝑡𝑟𝑎𝑖𝑛 , 𝑦 𝑚𝑡𝑟𝑎𝑖𝑛 =7
3000 540
1985 300 1 1 𝑚𝑡𝑒𝑠𝑡 =
1534 𝑥𝑡𝑒𝑠𝑡 , 𝑦𝑡𝑒𝑠𝑡
315 no. test
1427 199 ⋮ examples
30% 1380 212 test set 𝑚 𝑚
𝑥𝑡𝑒𝑠𝑡𝑡𝑒𝑠𝑡 , 𝑦𝑡𝑒𝑠𝑡𝑡𝑒𝑠𝑡 =3
1494 243
Andrew Ng
Train/test procedure for linear regression (with squared error
cost)
Fit parameters by minimizing cost function !𝐽(→

w , 𝑏)
w , 𝑏 2𝑚𝑡𝑟𝑎𝑖𝑛 ∑ ( )
𝑚 2 𝑛
𝑚𝑖𝑛 1 𝜆
𝐽( w , 𝑏) = → ( x ) −𝑦(𝑖) +
𝑡𝑟𝑎𝑖𝑛
→ → (𝑖)
𝑤𝑗2
2𝑚𝑡𝑟𝑎𝑖𝑛 ∑
𝑓→
w ,𝑏
𝑖= 1 𝑗= 1
Compute test error:

( w ,𝑏( 𝑡𝑒𝑠𝑡)
𝑚 2
2𝑚𝑡𝑒𝑠𝑡 [ ∑ ]
1
𝐽𝑡𝑒𝑠𝑡( w , 𝑏) = )
𝑡𝑒𝑠𝑡
→ 𝑓→ →
x
(𝑖)
−𝑦 (𝑖)
𝑡𝑒𝑠𝑡
𝑖= 1
Compute training error:

( )]
𝑚 2
2𝑚𝑡𝑟𝑎𝑖𝑛 [ ∑
1
𝐽𝑡𝑟𝑎𝑖𝑛( w , 𝑏) = ( 𝑡𝑟𝑎𝑖𝑛) −𝑦𝑡𝑟𝑎𝑖𝑛
→ 𝑓→ →
x
(𝑖) (𝑖)
w ,𝑏
𝑖= 1
Andrew Ng
Train/test procedure for linear
regression (with squared error cost)
Fit parameters by minimizing cost function 𝐽 w, 𝑏
price
Compute test error: 𝑚𝑡𝑒𝑠𝑡 train
1 2
𝐽𝑡𝑒𝑠𝑡 w, 𝑏 = ෍ 𝑓w,𝑏 x test
𝑖
𝑡𝑒𝑠𝑡 −𝑦 𝑖
𝑡𝑒𝑠𝑡
2𝑚𝑡𝑒𝑠𝑡
𝑖=1
size
Compute training error: 𝑚𝑡𝑟𝑎𝑖𝑛
𝐽 w,1 𝑏 will be low 𝑖 𝑖
2
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 =
𝑡𝑟𝑎𝑖𝑛 ෍ 𝑓w,𝑏 x𝑡𝑟𝑎𝑖𝑛 − 𝑦𝑡𝑟𝑎𝑖𝑛
𝐽𝑡𝑒𝑠𝑡 2𝑚 𝑏 will be high
w, 𝑡𝑟𝑎𝑖𝑛
𝑖=1
Andrew Ng
Train/test procedure for
classification problem
Fit parameters by minimizing 𝐽 ! (→

w , 𝑏) to find →
!w, 𝑏
( ) ( )]
1 𝑚 𝜆 𝑛 2
[
E.g.,
𝐽 ( w , 𝑏) = −
→ 𝑦 log 𝑓 w ,𝑏( x ) + (1 −𝑦 )log 1 −𝑓→
→
w ,𝑏( x )
→
(𝑖) (𝑖) (𝑖) (𝑖)
∑ 2𝑚 ∑
→ + 𝑤𝑗
𝑚 𝑖= 1 𝑗= 1
Compute test error:

( w ,𝑏( )) ( w ,𝑏( 𝑡𝑒𝑠𝑡))
𝑚
[ ]
1 𝑡𝑒𝑠𝑡 (𝑖)
𝐽𝑡𝑒𝑠𝑡( w , 𝑏) = −
→ → →
(𝑖) (𝑖)
( )
(𝑖)
𝑚𝑡𝑒𝑠𝑡 ∑
𝑦𝑡𝑒𝑠𝑡log 𝑓→ x 𝑡𝑒𝑠𝑡 + 1 −𝑦𝑡𝑒𝑠𝑡 log 1 −𝑓 → x
𝑖= 1
Compute train error:

( w ,𝑏( )) ( ( w ,𝑏( 𝑡𝑟𝑎𝑖𝑛))
𝑚
[ ]
1
𝐽𝑡𝑟𝑎𝑖𝑛( w , 𝑏) = − 𝑡𝑟𝑎𝑖𝑛)
→ (𝑖) → (𝑖) (𝑖) → (𝑖)
𝑚𝑡𝑟𝑎𝑖𝑛 ∑
𝑦𝑡𝑟𝑎𝑖𝑛 log 𝑓→ x 𝑡𝑟𝑎𝑖𝑛 + 1 −𝑦 log 1 −𝑓 → x
𝑖= 1
Andrew Ng
Train/test procedure for
classification problem
Fit parameters by minimizing 𝐽 w, 𝑏 to find w, 𝑏
E.g., 𝑚 𝑛
1 𝑖 test set and the𝑖 fraction 𝜆of the 2
𝐽 w, 𝑏 = − ෍ 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖 fraction
+ 1− of 𝑦the log 1 − 𝑓w,𝑏 x + ෍ 𝑤𝑗
𝑚 train set that the algorithm has misclassified.
2𝑚
𝑖=1 𝑗=1
Compute test error: 1 if 𝑓w,𝑏 x 𝑖
≥ 0.5
𝑚𝑡𝑒𝑠𝑡 𝑦ො = ቐ
1 𝑖 𝑖 0 if 𝑓𝑖 x 𝑖 < 0.5 𝑖
𝐽𝑡𝑒𝑠𝑡 w, 𝑏 = − count
෍ 𝑦𝑡𝑒𝑠𝑡 log 𝑓w,𝑏𝑦ො ≠x𝑡𝑒𝑠𝑡
𝑦 + 1 − 𝑦𝑡𝑒𝑠𝑡w,𝑏 log 1 − 𝑓w,𝑏 x 𝑡𝑒𝑠𝑡
𝑚𝑡𝑒𝑠𝑡
𝑖=1
𝐽𝑡𝑒𝑠𝑡 w, 𝑏 is the fraction of the test set that has
Compute train error:
1
𝑚𝑡𝑟𝑎𝑖𝑛
been misclassified.
𝑖 𝑖 𝑖 𝑖
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 = −
෍ 𝐽𝑡𝑟𝑎𝑖𝑛logw, 𝑓𝑏w,𝑏is xthe
𝑦𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛fraction
+ 1 −of𝑦𝑡𝑟𝑎𝑖𝑛
the train
log 1set
− 𝑓that
w,𝑏 x𝑡𝑟𝑎𝑖𝑛
𝑖=1 has been misclassified.
Andrew Ng
Evaluating and
choosing models
Model selection and

training/cross validation/test
sets
Model selection (choosing a model)
Once parameters w, 𝑏 are fit to the
training set, the training error 𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
is likely lower than the actual
price
generalization error.
size 𝐽𝑡𝑒𝑠𝑡 w, 𝑏 is better estimate of how

well the model will generalize to new
𝑓w,𝑏 x = 𝑤1𝑥1 + 𝑤2𝑥 2
data than 𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 .
+𝑤3𝑥 3 + 𝑤4𝑥 4 + 𝑏
Andrew Ng
Model selection (choosing a model)
<1> <1> <1> <1>
d=1 1. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑏 w,b Jtest(w, b )
<2> <2>
d=2 2. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑏 w,b Jtest(w<2>,b<2>)
d=3 3. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏 w<3>,b<3> Jtest(w<3>,b<3>)
⋮
d=10 10. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + ⋯ + 𝑤10 𝑥10 + 𝑏 Jtest(w<10>,b<10>)
Jtest(w<5>,b<5>)
Choose 𝑤1 𝑥1 + ⋯ + 𝑤5𝑥5 + 𝑏 d=5
How well does the model perform? Report test set error 𝐽𝑡𝑒𝑠𝑡 𝑤 <5> , 𝑏 <5> ?
The problem is 𝐽𝑡𝑒𝑠𝑡 𝑤 <5>, 𝑏 <5> is likely to be an optimistic estimate of
generalization error. Ie: An extra parameter d (degree of polynomial) was
chosen using the test set. w,b
Andrew Ng
Training/cross validation/test set
validation set
size price development set
dev set
2104 400 𝑥 1
,𝑦 1
1600 330 ⋮ 𝑚𝑡𝑟𝑎𝑖𝑛 = 6

2400 369 training set 𝑥 𝑚𝑡𝑟𝑎𝑖𝑛
,𝑦 𝑚𝑡𝑟𝑎𝑖𝑛
1416 232 60% 1

𝑥𝑐𝑣 , 𝑦𝑐𝑣
1
3000 540
⋮ 𝑚𝑐𝑣 = 2
1985 300 𝑚 𝑚
1534 315 cross validation 𝑥𝑐𝑣 𝑐𝑣 , 𝑦𝑐𝑣 𝑐𝑣
1427 199 20% 1
𝑥𝑡𝑒𝑠𝑡 , 𝑦𝑡𝑒𝑠𝑡
1
1380 212 test set ⋮ 𝑚𝑡𝑒𝑠𝑡 = 2

1494 243 20% 𝑚 𝑚
𝑥𝑡𝑒𝑠𝑡𝑡𝑒𝑠𝑡 , 𝑦𝑡𝑒𝑠𝑡𝑡𝑒𝑠𝑡
Andrew Ng
Training/cross validation/test set
1 2
Training error: 𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 = ෍ 𝑓w,𝑏 x 𝑖
−𝑦 𝑖
2𝑚𝑡𝑟𝑎𝑖𝑛
𝑖=1
𝑚𝑐𝑣
1 2
Cross validation 𝐽𝑐𝑣 w, 𝑏 =
𝑖 𝑖
෍ 𝑓w,𝑏 x𝑐𝑣 − 𝑦𝑐𝑣 (validation error,
2𝑚𝑐𝑣 dev error)
error: 𝑖=1
𝑚𝑡𝑒𝑠𝑡
1 𝑖 𝑖
2
Test error: 𝐽𝑡𝑒𝑠𝑡 w, 𝑏 = ෍ 𝑓w,𝑏 x𝑡𝑒𝑠𝑡 − 𝑦𝑡𝑒𝑠𝑡
2𝑚𝑡𝑒𝑠𝑡
𝑖=1
Andrew Ng
Model selection
d=1 1. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑏 w<1>,b<1> Jcv(w<1>,b<1>)
d=2 2. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑏 Jcv(w<2>,b<2>)
d=3 3. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏
⋮
⋮ ⋮
d=10 10. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + ⋯ + 𝑤10 𝑥10 + 𝑏 Jcv(w<10>,b<10>)
Pick 𝑤1 𝑥1 + ⋯ + 𝑤4 𝑥 4 + 𝑏 ( 𝐽𝑐𝑣 𝑤 <4>, 𝑏 <4> )

Estimate generalization error using test the set: 𝐽𝑡𝑒𝑠𝑡 𝑤 <4>, 𝑏 <4>
Andrew Ng
Model selection – choosing a neural network architecture
1. w(1),b(1) 1 1
𝐽𝑐𝑣 𝐖 ,𝐁
2. w(2),b(2) 𝐽𝑐𝑣 𝐖 2 ,𝐁 2
3. w(3),b(3) 𝐽𝑐𝑣 𝐖 3 ,𝐁 3
Train, CV
Pick 𝐖 2 ,𝐁 2
Estimate generalization error using the test set: 𝐽𝑡𝑒𝑠𝑡 𝐖 2 ,𝐁 2
Andrew Ng
Bias and variance
Diagnosing bias and variance

Bias/variance
price
price
price
size size size
2
𝑓 w,𝑏 𝑥 = 𝑤1 𝑥 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 2
+𝑏 +𝑏 +𝑤3𝑥 3 + 𝑤4 𝑥 4 + 𝑏
High bias “Just right“ High variance
(underfit) (overfit)
𝐽𝑡𝑟𝑎𝑖𝑛 is high 𝐽𝑡𝑟𝑎𝑖𝑛 is low 𝐽𝑡𝑟𝑎𝑖𝑛 is low
𝑑=1 𝑑=2 𝐽 𝑑=4
𝐽𝑐𝑣 is high 𝑐𝑣 is low 𝐽𝑐𝑣 is high
Andrew Ng
Understanding bias and variance
𝐽𝑐𝑣 w, 𝑏
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
degree of polynomial
𝑑
Andrew Ng
Diagnosing bias and variance
How do you tell if your algorithm has a bias or variance problem?
High bias (underfit)

𝐽𝑡𝑟𝑎𝑖𝑛 will be high
( 𝐽𝑡𝑟𝑎𝑖𝑛 ≈ 𝐽𝑐𝑣 )
overfit underfit High variance (overfit)
high low 𝐽𝑐𝑣 ≫ 𝐽𝑡𝑟𝑎𝑖𝑛
variance variance
( 𝐽𝑡𝑟𝑎𝑖𝑛 may be be low)
High bias and high variance

𝐽𝑡𝑟𝑎𝑖𝑛 will be high
and 𝐽𝑐𝑣 ≫ 𝐽𝑡𝑟𝑎𝑖𝑛
Andrew Ng
Bias and variance
Regularization and
bias/variance
Linear regression with regularization
Model: 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4 + 𝑏
𝑚 𝑛
1 2 𝜆
𝐽 w, 𝑏 = ෍ 𝑓w,𝑏 x 𝑖 −𝑦 𝑖 + ෍ 𝑤𝑗2
2𝑚 2𝑚
𝑖=1 𝑗=1
price
price
price
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 is small 𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 is small
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 is large 𝐽𝑐𝑣 w, 𝑏 is small 𝐽𝑐𝑣 w, 𝑏 is large
size size size

Large 𝜆 Intermediate 𝜆 Small 𝜆
High bias (underfit) High variance (overfit)
𝜆 = 10,000 𝑤1 ≈ 0, 𝑤2 ≈ 0 𝜆 𝜆=0
𝑓w,𝑏 x ≈ 𝑏
Andrew Ng
Choosing the regularization parameter 𝜆
Model: 𝑓w,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4 + 𝑏
𝑚𝑖𝑛
1.Try 𝜆=0 w, 𝑏
𝐽 w, 𝑏 𝑤 <1> , 𝑏 <1> 𝐽𝑐𝑣 𝑤 <1> , 𝑏 <1>
2. Try 𝜆 = 0.01 𝑤 <2> , 𝑏 <2> 𝐽𝑐𝑣 𝑤 <2> , 𝑏 <2>
3. Try 𝜆 = 0.02 𝐽𝑐𝑣 𝑤 <3> , 𝑏 <3>
4. Try 𝜆 = 0.04
5. Try 𝜆 = 0.08 𝐽𝑐𝑣 𝑤 <5> , 𝑏 <5>
⋮
12. Try 𝜆 ≈ 10 𝑤 <12> , 𝑏 <12> 𝐽𝑐𝑣 𝑤 <12> , 𝑏 <12>
Pick 𝑤 <5> , 𝑏 <5>

Report test error: 𝐽𝑡𝑒𝑠𝑡 𝑤 <5>, 𝑏 <5>
Andrew Ng
Bias and variance as a function of regularization parameter 𝜆
𝑚 𝑛
1 2 𝜆
𝐽 w, 𝑏 = ෍ 𝑓w,𝑏 x 𝑖 −𝑦 𝑖 + ෍ 𝑤𝑗2 bias variance
2𝑚 2𝑚 𝐽𝑐𝑣 w, 𝑏
𝑖=1 𝑗=1
bias 𝐽𝑐𝑣 w, 𝑏
𝐽𝑐𝑣 variance
𝐽𝑡𝑟𝑎𝑖𝑛
small 𝜆 large 𝜆
Andrew Ng
Bias and variance
Establishing a baseline level of

performance
Speech recognition example
Human level performance : 10.6% 0.2%

Training error 𝐽𝑡𝑟𝑎𝑖𝑛 : 10.8%
4.0%
Cross validation error 𝐽𝑐𝑣 : 14.8%
Andrew Ng
Establishing a baseline level of performance
What is the level of error you can reasonably

hope to get to?
• Human level performance
• Competing algorithms performance
• Guess based on experience
Andrew Ng
Bias/variance examples
Baseline performance : 10.6% 0.2% 10.6% 4.4% 10.6% 4.4%
Training error ( 𝐽𝑡𝑟𝑎𝑖𝑛 ) : 10.8% 15.0% 15.0%
4.0% 0.5% 4.7%
Cross validation error ( 𝐽𝑐𝑣 ) : 14.8% 15.5% 19.7%
high high high bias
variance bias high variance
Andrew Ng
Bias and variance
Learning curves
Learning curves 𝑓w,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑏
𝐽𝑡𝑟𝑎𝑖𝑛= training error

𝐽𝑐𝑣 = cross validation error
error
𝑚𝑡𝑟𝑎𝑖𝑛 (training set size)
Andrew Ng
High bias
𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑏
error
human level
performance
𝑚 (training set size)
if a learning algorithm suffers

from high bias, getting more
training data will not (by
itself) help much.
Andrew Ng
High variance 𝑓w,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑤3𝑥 3
+ 𝑤4 𝑥 4 + 𝑏
(with small 𝜆)
human level
performance
error
𝑚 (training set size)
if a learning algorithm suffers

from high variance, getting
more training data is likely to
help.
Andrew Ng
Bias and variance
Deciding what to try next

revisited
Debugging a learning algorithm
You’ve implemented regularized linear regression on housing prices
𝑚 𝑛
1 2 𝜆
𝐽 w, 𝑏 = ෍ 𝑓w,𝑏 x 𝑖 −𝑦 𝑖 + ෍ 𝑤𝑗2
2𝑚 2𝑚
𝑖=1 𝑗=1
But it makes unacceptably large errors in predictions. What do you
try next?
Get more training examples fixes high variance
Try smaller sets of features x, x2, x3, x4, x5… fixes high variance
Try getting additional features fixes high bias
Try adding polynomial features (𝑥12, 𝑥22, 𝑥1𝑥2, 𝑒𝑡𝑐) fixes high bias
Try decreasing 𝜆 fixes high bias
Try increasing 𝜆 fixes high variance
Andrew Ng
Bias and variance
Bias/variance and
neural networks
The bias variance tradeoff
𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑏 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 2 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 2 + 𝑤3𝑥 3
+𝑏 +𝑤4𝑥 4 + 𝑏
Simple model Complex model

High bias High variance
tradeoff
Andrew Ng
Neural networks and bias variance
Large neural networks are low bias machines
Does it do well Yes

Yes Does it do well Yes
Yes
on the training on the cross Done !
Done!
set? validation set?
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 𝐽𝑐𝑣 w, 𝑏
No
No No
No
Bigger network
Bigger network More data
GPU
Andrew Ng
Neural networks and regularization
3 layers 4 layers
A large neural network will usually do as well or better than a

smaller one so long as regularization is chosen appropriately.
Andrew Ng
Neural network regularization
𝑚
1 𝜆
b
𝑖 𝑖
𝐽 𝐖, 𝐁 = ෍ 𝐿 𝑓 x ,𝑦 + ෍ 𝑤𝟐
𝑚 2𝑚
𝑖=1 𝑎𝑙𝑙 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝐖
Unregularized MNIST model
layer_1 = Dense(units=25, activation=”relu")
layer_2 = Dense(units=15, activation=”relu")
model = Sequential([layer_1, layer_2, layer_3])
𝜆
Regularized MNIST model
layer_1 = Dense(units=25, activation=”relu”, kernel_regularizer=L2(0.01))
layer_2 = Dense(units=15, activation=”relu”, kernel_regularizer=L2(0.01))
layer_3 = Dense(units=1, activation="sigmoid”, kernel_regularizer=L2(0.01))
model = Sequential([layer_1, layer_2, layer_3])
Andrew Ng
Machine learning
development process
Iterative loop of
ML development
Iterative loop of ML development
Choose architecture
(model, data, etc.)
Diagnostics
Train
(bias, variance and
model
error analysis)
Andrew Ng
Spam classification example
From: cheapsales@buystufffromme.com From: Alfred Ng

To: Andrew Ng To: Andrew Ng
Subject: Buy now! Subject: Christmas dates?
Deal of the week! Buy now! Hey Andrew,

Rolex w4tchs - $100 Was talking to Mom about plans
Med1cine (any kind) - £50 for Xmas. When do you get off
Also low cost M0rgages work. Meet Dec 22?
available. Alf
Andrew Ng
Building a spam classifier
Supervised learning: x = features of email
y = spam (1) or not spam (0)
Features: list the top 10,000 words to compute 𝑥1, 𝑥2, ⋯ , 𝑥10,000
From: cheapsales@buystufffromme.com
0 a To: Andrew Ng
1 andrew Subject: Buy now!
2 11 buy Deal of the week! Buy now!

1 deal Rolex w4tchs - $100
0 discount
Med1cine (any kind) - £50
Also low cost M0rgages
⋮ ⋮ available.
Andrew Ng
How to try to reduce your spam classifier’s error?
• Collect more data. E.g., “Honeypot” project.

• Develop sophisticated features based on email
routing (from email header).
• Define sophisticated features from email body.
E.g., should “discounting” and ”discount” be
treated as the same word.
• Design algorithms to detect misspellings.
E.g., w4tches, med1cine, m0rtgage.
Andrew Ng
Choose architecture
(model, data, etc.)
Diagnostics
Train
(bias, variance and
model
error analysis)
Andrew Ng
Machine learning
development process
Error analysis
Error analysis
𝑚𝑐𝑣= 500 examples in cross validation set.
5000
Algorithm misclassifies 100 of them.
1000
Manually examine 100 examples and categorize them
based on common traits. more data
Pharma: 21 features
Deliberate misspellings (w4tches, med1cine): 3
Unusual email routing: 7 more data
Steal passwords (phishing): 18 features
Spam message in embedded image: 5
Andrew Ng
How to try to reduce your spam classifier’s error?
• Collect more data. E.g., “Honeypot” project.

• Develop sophisticated features based on email
routing (from email header).
• Define sophisticated features from email body.
E.g., should “discounting” and ”discount” be
treated as the same word.
• Design algorithms to detect misspellings.
E.g., w4tches, med1cine, m0rtgage.
Andrew Ng
Choose architecture
(model, data, etc.)
Diagnostics
Train
(bias, variance and
model
error analysis)
Andrew Ng
Machine learning
development process
Adding data
Adding data
Add more data of everything. E.g., “Honeypot” project.
Add more data of the types where error analysis has

indicated it might help.
Pharma spam
E.g., Go to unlabeled data and find more examples of

Pharma related spam.
( X,Y )
Andrew Ng
Data augmentation
Augmentation: modifying an existing training example to
create a new training example.
( X,Y )
Andrew Ng
Data augmentation by introducing distortions
Andrew Ng
Data augmentation for speech
Speech recognition example
Original audio (voice search: ”What is today’s weather?”)
+ Noisy background: Crowd
+ Noisy background: Car
+ Audio on bad cellphone connection
Andrew Ng
Data augmentation by introducing distortions
Distortion introduced should be representation of the type of
noise/distortions in the test set.
Audio:
Background noise,
bad cellphone connection
Usually does not help to add purely random/meaningless

noise to your data.
𝑥𝑖 =intensity (brightness) of pixel 𝑖

𝑥𝑖 ← 𝑥𝑖 + random noise
[Adam Coates and Tao Wang]
Andrew Ng
Data synthesis
Synthesis: using artificial data inputs to create a new
training example.
Andrew Ng
Artificial data synthesis for photo OCR
[http://www.publicdomainpictures.net/view-image.php?image=5745&picture=times-square]
Andrew Ng
Abcdefg
Abcdefg
Abcdefg
Abcdefg
Abcdefg
Real data
Andrew Ng
Real data Synthetic data

Andrew Ng
Engineering the data used by your system
Conventional
model-centric AI= Code + Data
approach: (algorithm/model)
Work on this
Data-centric
approach: AI= Code + Data
(algorithm/model)
Work on this
Andrew Ng
Machine learning
development process
Transfer learning: using data

from a different task
Transfer learning
Cats 0, 1, 2, … ,9
Dogs 1,000
x Cars classes
1 million images
People
1 ⋮
𝐖 ,𝒃 1 𝐖 3 ,𝒃 3 4 𝐖 5
,𝒃5
𝐖 2 ,𝒃 2 𝐖 ,𝒃 4
1,000 output units
Supervised pretraining
0
x
1
2
Fine tuning
⋮
5 5
𝐖 ,𝒃 9
1 10 output units
𝐖 ,𝒃 1 𝐖 3 ,𝒃 3 4 4
𝐖 2 ,𝒃 2 𝐖 ,𝒃
Option 1: only train output layers parameters.
Option 2: train all parameters.
Andrew Ng
Why does transfer learning work?
x use the same input type
detects detects detects

edges corners curves/basic shapes
Edges Corners Curves / basic shapes
Andrew Ng
Transfer learning summary
1. Download neural network parameters pretrained on a large
dataset with same input type (e.g., images, audio, text) as
your application (or train your own). 1M
2. Further train (fine tune) the network
on your own data.
1000
50
Andrew Ng
Machine learning
development process
Full cycle of a
machine learning project
Full cycle of a machine learning project
Scope Collect Train Deploy in

project data model production
Define project Define and Training, error Deploy,

collect data analysis & monitor and
iterative maintain
improvement system
Andrew Ng
Deployment
API call (x)
Inference server Mobile app
(audio clip)
ML model
Inference 𝑦ො
(text transcript)
Software engineering may be needed for:
Ensure reliable and efficient predictions
MLOps
Scaling
Logging machine learning
System monitoring operations
Model updates
Andrew Ng
Machine learning
development process
Fairness, bias, and ethics

Bias
Hiring tool that discriminates against women.
Facial recognition system matching dark skinned

individuals to criminal mugshots.
Biased bank loan approvals.
Toxic effect of reinforcing negative stereotypes.
Andrew Ng
Adverse use cases
Deepfakes
Spreading toxic/incendiary speech through optimizing for

engagement.
Generating fake content for commercial or political purposes.
Using ML to build harmful products, commit fraud etc.
Spam vs anti-spam : fraud vs anti-fraud.
Andrew Ng
Guidelines
Get a diverse team to brainstorm things that might go
wrong, with emphasis on possible harm to vulnerable groups.
Carry out literature search on standards/guidelines for your
industry.
Audit systems against possible harm prior to deployment.
Audit
Develop mitigation plan (if applicable), and after deployment,

monitor for possible harm.
Andrew Ng
Skewed datasets
(optional)
Error metrics for

skewed datasets
Rare disease classification example
Train classifier 𝑓w,𝑏 x (𝑦 = 1 if disease present,
𝑦 = 0 otherwise)
Find that you’ve got 1% error on test set
(99% correct diagnoses)
Only 0.5% of patients have the disease
print(“y=0”) 99.5% accuracy, 0.5% error

1%
1.2%
Andrew Ng
Precision/recall
𝑦 = 1 in presence of rare class we want to detect.
Precision:
Actual Class (of all patients where we predicted 𝑦 = 1, what
1 0 fraction actually have the rare disease?)
True positives True positives 15
True False = = = 0.75
1 positive positive #predicted positive True pos + False pos 15 + 5
Predict 15 5
-ed
False True Recall:
Class
0 negative negative (of all patients that actually have the rare disease,
10 70 what fraction did we correctly detect as having it?)
True positives True positives 15
= = = 0.6
25 75 #actual positive True pos + False neg 15 +10
print(“y=0”)
Andrew Ng
Skewed datasets
(optional)
Trading off precision

and recall
Trading off precision and recall
true positives
precision =
Logistic regression: 0 < 𝑓w,𝑏 x < 1 total predicted positive
Predict 1 if 𝑓w,𝑏 x ≥ 0.5 0.7 0.9 0.3 true positives
recall =
Predict 0 if 𝑓w,𝑏 x < 0.5 0.7 0.9 0.3 total actual positive
Suppose we want to predict 𝑦 = 1 (rare disease) threshold = 0.99

only if very confident.
higher precision, lower recall. 1
Suppose we want to avoid missing too many threshold
Precision
cases of rare disease (when in doubt = 0.01
predict 𝑦 = 1)
lower precision, higher recall.
More generally predict 1 if: 𝑓w,𝑏 x ≥ threshold. 0 Recall 1
Andrew Ng
F1 score
How to compare precision/recall numbers?
Precision (P) Recall (R) Average F1 score
Algorithm 1 0.5 0.4 0.45 0.444

Algorithm 2 0.7 0.1 0.4 0.175
Algorithm 3 0.02 1.0 0.501 0.0392
print(“y=1”) 1
1 1 1 PR
Average = P+R F1 score = + = 2
2 2 P R P+R
Andrew Ng
Copyright Notice

Decision Trees
Decision Tree Model

Cat classification example
Ear shape (x 1 ) Face shape (x 2 ) Whiskers (x 3 ) Cat
Pointy Round Present 1
Floppy Not round Present 1
Floppy Round Absent 0
Pointy Not round Present 0
Pointy Round Absent 1
Floppy Not round Absent 0
Categorical (discrete values)
Andrew Ng
Decision Tree New test example
root node
Ear shape
Pointy Floppy
decision nodes
Ear shape: Pointy

Face Face shape: Round
Whiskers
shape Whiskers: Present
Round Not round Present Absent
Cat Not cat Cat Not cat
leaf nodes
Andrew Ng
Decision Tree
Fac e Fac e
E ar E ar s hape s hape
s hape s hape
Round N ot round Round N ot round

P ointy Floppy P ointy Floppy
C at N ot c at Ear
Not Cat
shape
Whiskers N ot c at Not cat Fac e
s hape
P ointy Floppy
P res ent A bs ent N ot round
Round
C at N ot c at Whis kers Not Cat

Cat Not Cat
P res ent A bs ent
Cat Not Cat
Andrew Ng
Decision Trees
Learning Process
Decision Tree Learning
?
Andrew Ng
Ear shape
Pointy Floppy
Andrew Ng
Ear shape
Pointy Floppy
Andrew Ng
Ear shape
Pointy Floppy
Andrew Ng
Ear shape
Pointy Floppy
Face
shape
Round Not round
4/4 cats
Andrew Ng
Ear shape
Pointy Floppy
Face
shape
Round Not round
Cat
0/1 cats
Andrew Ng
Ear shape
Pointy Floppy
Face
shape
Round Not round
Cat Not cat
Andrew Ng
Ear shape
Pointy Floppy
Face
?
shape
Round Not round
Cat Not cat
Andrew Ng
Ear shape
Pointy Floppy
Face
Whiskers
shape
Round Not round
Cat Not cat
Andrew Ng
Ear shape
Pointy Floppy
Face
Whiskers
shape
Cat Not cat
1/1 cats 0/4 cats
Andrew Ng
Ear shape
Pointy Floppy
Face
Whiskers
shape
Cat Not cat Cat Not Cat
Andrew Ng
Decision 1: How to choose what feature to split on at each
node?
Maximize purity (or minimize impurity)
Andrew Ng
C at
Decision 1: How to choose what feature to split on at each DNA
node? Y es No
Maximize purity (or minimize impurity)
E ar Fac e Whis kers

s hape Shape
Round N ot round P res ent 5 /5 c ats 0 /5 c ats

P ointy Floppy A bs ent
4 /5 c ats 1 /5 c ats 4 /7 c ats 1 /3 c ats 3 /4 c ats 2 /6 c ats
Andrew Ng
Decision 2: When do you stop splitting? Depth 0
• When a node is 100% one class

• When splitting a node will result in the tree exceeding
a maximum depth Depth 1
Depth 2
Andrew Ng
Decision 2: When do you stop splitting? Depth 0

a maximum depth Depth 1
Depth 2
Andrew Ng
Decision 2: When do you stop splitting?

a maximum depth Fac e
Shape
• When improvements in purity score are below a
threshold Round N ot round
• When number of examples in a node is below a

threshold
4 /7 c ats 1 /3 c ats
Andrew Ng
Decision 2: When do you stop splitting?

a maximum depth Fac e
Shape
• When improvements in purity score are below a
threshold Round N ot round
• When number of examples in a node is below a

Not cat
threshold
Andrew Ng
Measuring purity
Entropy as a measure of impurity
p1 = fraction of examples that
are cats p1 = 0 H(p1) = 0
D og D og D og D og D og D og
H(p1) p1 = 2/6 H(p1) = 0.92

C at C at D og D og D og D og
p1 = 3/6 H(p1) = 1
C at C at C at D og D og D og
p1 = 5/6 H(p1) = 0.65

C at C at C at C at C at D og
p1
p1 = 6/6 H(p1) = 0
C at C at C at C at C at C at
Andrew Ng
Entropy as a measure of impurity
p1 = fraction of examples that
are cats
𝑝0 = 1 − 𝑝1
H(p1)
𝐻(𝑝1 ) = −𝑝1 𝑙𝑜𝑔2(𝑝1 ) − 𝑝0 𝑙𝑜𝑔2(𝑝0)
= −𝑝1 𝑙𝑜𝑔2(𝑝1) − (1 − 𝑝1 )𝑙𝑜𝑔2(1 − 𝑝1 )
Note: “0 log(0)” = 0
p1
Andrew Ng
Choosing a split: Information

Gain
Choosing a split
𝑝 1 = 5ൗ10 = 0.5 𝐻 0.5 = 1 𝐻 0.5 = 1
𝐻 0.5 = 1 E ar Fac e Whis kers
s hape Shape
P ointy Floppy Round N ot round

P res ent A bs ent
𝑝1 = 4ൗ5 = 0.8 𝑝1 = 1ൗ5 = 0.2 𝑝1 = 4ൗ7 = 0.57 𝑝1 = 1ൗ3 = 0.33 𝑝1 = 3ൗ4 = 0.75 𝑝1 = 2ൗ6 = 0.33
𝐻 0.8 = 0.72 𝐻 0.2 = 0.72 𝐻 0.57 = 0.99 𝐻 0.33 = 0.92 𝐻 0.75 = 0.81 𝐻 0.33 = 0.92
5 5 7 3 4 6
𝐻 0.5 − 𝐻 0.8 + 𝐻 0.2 𝐻 0.5 − 𝐻 0.57 + 𝐻 0.33 𝐻 0.5 − 𝐻 0.75 + 𝐻 0.33
10 10 10 10 10 10
= 0.28 = 0.03 = 0.12
Andrew Ng
Information Gain
𝑝1root = 5ൗ10 = 0.5
E ar
s hape
Information gain
P ointy Floppy
right
= H(p1root ) − 𝑤 left 𝐻 𝑝1left + 𝑤 right 𝐻 𝑝1
right 1
𝑝1left = 4ൗ5 𝑝1 = ൗ5
𝑤 left = 5ൗ10 5
𝑤 right = ൗ10
Andrew Ng
Putting it together
• Start with all examples at the root node
• Calculate information gain for all possible features, and pick the one with
the highest information gain
• Split dataset according to selected feature, and create left and right
branches of the tree
• Keep repeating splitting process until stopping criteria is met:
• When splitting a node will result in the tree exceeding a maximum
depth
• Information gain from additional splits is less than threshold
• When number of examples in a node is below a threshold
Andrew Ng
Recursive splitting
?
Andrew Ng
Recursive splitting
E ar
s hape
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e ?
s hape
Round N ot Round
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Round N ot Round P res ent
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Cat Not cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Cat Not cat Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Cat Not cat Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e Whis kers

s hape
A bs ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Whis kers
Recursive algorithm
A bs ent
Andrew Ng
Using one-hot encoding of

categorical features
Features with three possible values
Ear shape (𝑥1) Face shape (𝑥 2 ) Whiskers (𝑥3 ) Cat (𝑦)
Oval Not round Present 1

Oval Round Absent 0
Ear
Pointy Not round Present 0 shape
Oval Round Present 1 Pointy
Pointy Round Absent 1 Floppy Oval

Oval Round Absent 1

3 possible values
Andrew Ng
One hot encoding
Ear shape Pointy ears Floppy ears Oval ears Face shape Whiskers Cat
Oval Not round Present 1

Oval 0 0 1 Round Absent 0
Pointy 1 0 0 Not round Present 0

Oval 0 0 1 Round Present 1
Pointy 1 0 0 Round Absent 1
Floppy 0 1 0 Not round Absent 0
Floppy 0 1 0 Round Absent 0

Andrew Ng
One hot encoding
If a categorical feature can take on 𝑘 values,

create 𝑘 binary features (0 or 1 valued).
Andrew Ng
One hot encoding
Ear shape Pointy ears Floppy ears Oval ears Face shape Whiskers Cat
Pointy 1 0 0 Round Present 1
Oval 0 0 1 Not round Present 1

Pointy 1 0 0 Not round Present 0

Oval 0 0 1 Round Present 1
Pointy 1 0 0 Round Absent 1
Floppy 0 1 0 Not round Absent 0

Andrew Ng
One hot encoding and neural networks
Pointy ears Floppy ears Round ears Face shape Whiskers Cat
1 0 0 Round Present 1
0 0 1 Not round Present 1

0 0 1 Round Absent 0
1 0 0 Not round Present 1 0

0 0 1 Round 1 Present 1 1
1 0 0 Round 1 Absent 0 1
0 1 0 Not round 0 Absent 0 1
Andrew Ng
Continuous valued features

Continuous features
Ear shape Face shape Whiskers Weight (lbs.) Cat
Pointy Round Present 7.2 1
Floppy Not round Present 8.8 1
Floppy Round Absent 15 0
Pointy Not round Present 9.2 0
Pointy Round Present 8.4 1

Pointy Round Absent 7.6 1
Floppy Not round Absent 11 0
Pointy Round Absent 10.2 1
Andrew Ng
Splitting on a continuous variable
Cat
Weight ≤ 8 lbs .
Weight ≤ 9 lbs .
No
Y es
Not Cat Weight (lbs.)
2 2 8 3
H(0.5) − 𝐻 + 𝐻 = 0.24
10 2 10 8
4 4 6 1
H(0.5) − 𝐻 + 𝐻 = 0.61
10 4 10 6
7 5 3 0
H(0.5) − 𝐻 + 𝐻 = 0.40
10 7 10 3
Andrew Ng
Regression Trees (optional)

Regression with Decision Trees: Predicting a number
Ear shape Face shape Whiskers Weight (lbs.)
Pointy Round Present 7.2
Floppy Not round Present 8.8

Pointy Not round Present 9.2

Pointy Round Present 8.4
Pointy Round Absent 7.6
Pointy Round Absent 10.2

Andrew Ng
Regression with Decision Trees
E ar
s hape
P ointy
Floppy
Fac e Fac e
s hape Shape
Round N ot round
Round N ot Round
Weights(lbs.): Weights(lbs.): Weights (lbs.): Weights(lbs.):

7.2, 8.4, 7.6, 10.2 9.2 15, 18, 20 8.8,11
Andrew Ng
Choosing a split
Variance at root Variance at root Variance at root
node: 20.51 node: 20.51 node: 20.51
E ar Fac e Whis kers
s hape Shape
Round N ot round
P ointy Floppy P res ent A bs ent
Weights: 7.2, Weights: 8.8, 15, Weights: 7.2, 15, Weights: 8.8,9.2,11 Weights: 7.2, 8.8, Weights: 15, 7.6,
9.2, 8.4,7.6, 10.2 11, 18, 20 8.4, 7.6,10.2, 18, 20 9.2, 8.4 11, 10.2, 18, 20
Variance: 1.47 Variance: 21.87 Variance: 27.80 Variance: 1.37 Variance: 0.75 Variance: 23.32
𝑤 left = 5ൗ10 𝑤 right = 5ൗ10 𝑤 left = 7ൗ10 𝑤 right = 3ൗ10 𝑤 left = 4ൗ10 𝑤 right = 6ൗ10
7 3 20.51 − 4 6
5 5 20.51 − ∗ 0.75 + ∗ 23.32
20.51 − ∗ 1.47 + ∗ 21.87 10
∗ 27.80 +
10
∗ 1.37 10 10
10 10
= 8.84 = 0.64 = 6.22
Andrew Ng
Tree ensembles
Using multiple decision trees

Trees are highly sensitive to small
changes of the data
E ar
s hape Whis kers
P ointy Floppy P res ent A bs ent
Andrew Ng
Tree ensemble New test example
Whis kers E ar Fac e

s hape s hape
Ear shape: Pointy
P res ent A bs ent Floppy Face shape: Not Round
Round N ot Round
P ointy
Whiskers: Present
Ear
shape N ot c at Cat
Face Whis kers
shape Whiskers
P ointy Floppy
A bs ent P res ent A bs ent
Round N ot round P res ent
C at N ot c at
Not cat Not Cat Not Cat Cat Not Cat
Cat
Prediction: Cat Prediction: Not cat Prediction: Cat
Final prediction: Cat
Andrew Ng
Tree ensembles
Sampling with replacement

Tokens
Sampling with replacement:
Andrew Ng
Ear shape Face shape Whiskers Cat

Pointy Not round Present 0

Floppy Not round Present 1

Andrew Ng
Tree ensembles
Random forest algorithm

Generating a tree sample
Given training set of size 𝑚
For 𝑏 = 1 to 𝐵:
Use sampling with replacement to create a new training set of size 𝑚
Train a decision tree on the new dataset
Whiskers Ear Face Ear shape

Ear Face shape shape Whiskers Cat
shape shape Whiskers Cat Absent Floppy
Present Pointy
Pointy Round Present Yes
Pointy Round Present Yes Pointy Round Absent Yes
Floppy
Floppy
Round
Round
Absent
Absent
No
No Ear shape Not cat
Floppy
Floppy
Not Round
Not Round
Absent
Absent
No
No
Face
shape Not cat
…
Pointy Round Present Yes Pointy Round Absent Yes
Pointy Not Round Present Yes Floppy Round Absent No
Floppy Round Absent No Round Not round Not round
Floppy Round Absent No Round
Floppy Round Present Yes Floppy Round Absent No
Pointy Not Round Absent No Pointy Not Round Absent No
Pointy Not Round Absent No Pointy Round Present Yes
Pointy Not Round Present Yes
Bagged decision tree
Andrew Ng
Randomizing the feature choice
At each node, when choosing a feature to use to split,

if 𝑛 features are available, pick a random subset of
𝑘 < 𝑛 features and allow the algorithm to only choose
from that subset of features.
Random forest algorithm
Andrew Ng
Tree ensembles
XGBoost
Boosted trees intuition
Given training set of size 𝑚
For 𝑏 = 1 to 𝐵:
Use sampling with replacement to create a new training set of size 𝑚
But instead of picking from all examples with equal (1/m) probability, make it
more likely to pick examples that the previously trained trees misclassify
Train a decision tree on the new dataset
Whiskers Ear Face
Ear Face shape shape Whiskers Prediction
shape shape Whiskers Cat Absent
Present
Pointy Round Present Cat ✅
Floppy Not Round Present Not cat ❌
Floppy Round Absent No
Ear shape Floppy Round Absent Not cat ✅
Floppy Round Absent No Not cat
Pointy Not Round Present Not cat ✅
Pointy Round Present Cat ✅
Not round Pointy Round Absent Not cat
Floppy Round Absent No Round ❌
Floppy Not Round Absent Not cat
Floppy Round Present Yes ✅
Pointy Round Absent Not cat ❌
Pointy Not Round Absent No
Cat Not cat Floppy Round Absent Not cat ✅
Pointy Not Round Absent No
Floppy Round Absent Not cat ✅
Andrew Ng
XGBoost (eXtreme Gradient Boosting)
• Open source implementation of boosted trees
• Fast efficient implementation
• Good choice of default splitting criteria and criteria for when
to stop splitting
• Built in regularization to prevent overfitting
• Highly competitive algorithm for machine learning
competitions (eg: Kaggle competitions)
Andrew Ng
Using XGBoost
Classification Regression
from xgboost import XGBClassifier from xgboost import XGBRegressor
model = XGBClassifier() model = XGBRegressor()
model.fit(X_train, y_train) model.fit(X_train, y_train)

y_pred = model.predict(X_test) y_pred = model.predict(X_test)
Andrew Ng
Conclusion
When to use decision trees

Decision Trees vs Neural Networks
Decision Trees and Tree ensembles
• Works well on tabular (structured) data
• Not recommended for unstructured data (images, audio, text)
• Fast
Choose architecture
(model, data, etc.)
Diagnostics
(bias, variance and error
analysis)
Train
model
Andrew Ng
Decision Trees vs Neural Networks
Decision Trees and Tree ensembles
• Works well on tabular (structured) data
• Not recommended for unstructured data (images, audio, text)
• Fast
• Small decision trees may be human interpretable
Neural Networks
• Works well on all types of data, including tabular (structured) and
unstructured data
• May be slower than a decision tree
• Works with transfer learning
• When building a system of multiple models working together, it
might be easier to string together multiple neural networks
Andrew Ng

C2 W1 Merged

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

C2 W1 Merged

Uploaded by

Copyright:

Available Formats

Copyright Notice

These slides are distributed under the Creative Commons License.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Neurons and the brain

Used in the 1980’s and early 1990’s.

Resurgence from around 2005.

speech images text (NLP)

[Credit: US National Institutes of Health, National Institute on Aging]

image source: https://biologydictionary.net/sensory-neuron/

medium neural network

small neural network

neural network architecture

Neural network layer

notation for layer

layer 2 𝑤1 , 𝑏1 [1] [1] [1]

More complex neural networks

a[1] a[2] a[3] a[4]

layer 1 layer 2 layer 3 layer 4

hidden layers output

layer 1 layer 2 layer 3 layer 4

layer 1 layer 2 layer 3 layer 4

Inference: making predictions

⋮ a[1] a[2] a[3] probability

25 units 15 units 1 unit

⋮ a[1] a[2] a[3] probability

25 units 15 units 1 unit a[2]

x a[1] a[2] a[3] = 𝑓(𝑥)

25 units 15 units 1 unit

layer_2 = Dense(units=1, activation=‘sigmoid’)

1 unit layer_3 = Dense(units=1, activation=‘sigmoid’)

1 unit layer_3 = Dense(units=1, activation=‘sigmoid’)

x = np.array([[200], 200 2x1

layer_2 = Dense(units=1, activation=‘sigmoid’)

Building a neural network

layer_2 = Dense(units=1, activation="sigmoid")

Forward prop in a single layer

w1_1 = np.array([1, 2]) w1_2 = np.array([-3, 4]) w1_3 = np.array([5, -6])

a1 = np.array([a1_1, a1_2, a1_3])

Is there a path to AGI?

image source: https://biologydictionary.net/sensory-neuron/

We have (almost) no idea how the brain works

Auditory cortex learns to see

Somatosensory cortex learns to see

Seeing with your tongue Human echolocation (sonar)

Haptic belt: Direction sense Implanting a 3rd eye

How neural networks are

def dense(a_in,W,b): def dense(A_in,W,B):

example in general transpose vector vector multiplication

Matrix multiplication rules

Matrix multiplication code

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

⋮ a[1] a[2] a[3] Dense(units=25, activation='sigmoid’)

Given set of (x,y) examples model.fit(X,Y,epochs=100)

specify loss and cost logistic loss binary cross entropy

from tensorflow.keras.losses import

Good to understand the implementation

Binary classification Regression Regression

Sigmoid 𝐽(𝐖, 𝐁) 𝜕 ReLU

from tf.keras.layers import Dense

if all 𝑔 𝑧 are linear... ...no different than linear regression

Don’t use linear activations in hidden layers

Neural Network with

Logistic regression: model = Sequential([