Professional Documents
Culture Documents
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Welcome!
Advanced learning algorithms
Neural Networks
inference (prediction)
training
Practical advice for building machine learning systems
Decision Trees
Andrew Ng
Neural Networks Intuition
Andrew Ng
Neurons in the brain
Andrew Ng
Why Now?
large neural network
traditional AI
amount of data
Andrew Ng
Neural Network Intuition
Demand Prediction
Demand Prediction
𝑥 = 𝑝𝑟𝑖𝑐𝑒 input
1 output
𝑎=𝑓 𝑥 =
top seller? 1 + 𝑒 − 𝑤𝑥+𝑏
yes/no
𝑥 𝑎
price
Andrew Ng
Demand Prediction
layer
price
layer
shipping cost
probability of
marketing being a top seller
material
Andrew Ng
Demand Prediction
layer
price
layer
shipping cost a a
probability of
marketing being a top seller
material
Andrew Ng
Demand Prediction
layer
price
layer
shipping cost a a
probability of
marketing being a top seller
material
Andrew Ng
Multiple hidden layers
a a
x
x
input
output
layer
hidden hidden
layer layer hidden hidden
hidden
layer layer layer
Example:
Recognizing Images
Face recognition
x=
Andrew Ng
Face recognition
output layer
⋮ probability of being
⋮ ⋮
person ‘XYZ’
x
input
source: Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
by Honglak Lee, Roger Grosse, Ranganath Andrew Y. Ng
Andrew Ng
Car classification
output layer
⋮ probability of
⋮ ⋮
car detected
source: Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
by Honglak Lee, Roger Grosse, Ranganath Andrew Y. Ng
Andrew Ng
Neural network model
Andrew Ng
Neural network layer 1
𝑔(𝑧) =
1 + 𝑒− 𝑧
a[1]
x [1] [1] [1]
𝑤1 , 𝑏1 𝑎1 = 𝑔(w1 ∙ x + 𝑏1 )
vector of
layer 0 layer 2 activation
[1] values from
w2 , 𝑏2 𝑎2 = 𝑔(w2[1] ∙ x + 𝑏2[1] ) [1] layer 1
x a
layer 1
w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ x + 𝑏3 )
Andrew Ng
Neural network layer 1
𝑔(𝑧) =
1 + 𝑒− 𝑧
a[1] 𝑎 [2]
x
layer 1
Andrew Ng
Neural network layer
predict category 1 or 0 (yes/no)
a[1] 𝑎 [2]
x
2
𝑖𝑠 𝑎 ≥ 0.5?
layer 2
layer 1 𝑦ො = 1 𝑦ො = 0
Andrew Ng
Neural Network Model
layer 0
Andrew Ng
More complex neural network
[3] [3] [3]
[3]
𝑤1 , 𝑏1
[3] 𝑎1 = 𝑔(w1 ∙ a[2] + 𝑏1 )
[3] [3] [3]
a[1] a[2] a[3] a[4]
[3]
a[2] w2 , 𝑏2
[3]
𝑎2 = 𝑔(w2 ∙ a[2] + 𝑏2 ) a[3]
x
input [3] [3] [3] [3] [3]
w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ a[2] + 𝑏3 )
Andrew Ng
Notation
[3] [3] [3]
[3]
𝑤1 , 𝑏1
[3] 𝑎1 = 𝑔(w1 ∙ a[2] + 𝑏1 )
[3] [3] [3]
x a[1] a[2] a[3] a[4]
[3]
a[2] w2 , 𝑏2
[3]
𝑎2 = 𝑔(w2 ∙ a[2] + 𝑏2 ) a[3]
input
[3] [3] [3] [3] [3]
w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ a[2] + 𝑏3 )
Andrew Ng
Notation
[3] [3] [3]
[3]
𝑤1 , 𝑏1
[3] 𝑎1 = 𝑔(w1 ∙ a[2] + 𝑏1 )
x
[3] [3] [3]
a[1] a[2] a[3] a[4]
[3]
a[2] w2 , 𝑏2
[3]
𝑎2 = 𝑔(w2 ∙ a[2] + 𝑏2 ) a[3]
x
input
[3] [3] [3] [3] [3]
w3 , 𝑏3 𝑎3 = 𝑔(w3 ∙ a[2] + 𝑏3 )
layer 3 layer 4
layer 1 layer 2
[3]
𝑎2 = 𝑔(w2[3] ∙ a[2] + 𝑏2[3] )
output of layer 𝑙 − 1
Activation value of (previous layer)
layer 𝑙, unit(neuron) 𝑗 [𝑙]
𝑎𝑗 = 𝑔(w𝑗[𝑙] ∙ a[𝑙−1] + 𝑏𝑗[𝑙] )
sigmoid
Parameters w & b of layer 𝑙, unit 𝑗
Andrew Ng
Neural Network Model
label 0 1
Andrew Ng
Handwritten digit recognition
Digit images
label 0 1
Andrew Ng
Handwritten digit recognition
forward propagation
Andrew Ng
TensorFlow implementation
Inference in Code
Coffee roasting
x a[1] a[2]
Duration 2
(minutes) 𝑖𝑠 𝑎1 ≥ 0.5?
𝑦ො = 1 𝑦ො = 0
Temperature
(Celsius)
Andrew Ng
x a[1] a[2]
x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation=‘sigmoid’)
a1 = layer_1(x)
Andrew Ng
Build the model using TensorFlow
x a[1] a[2]
x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation=‘sigmoid’)
a1 = layer_1(x)
Andrew Ng
Build the model using TensorFlow
x a[1] a[2]
2
𝑖𝑠 𝑎1 ≥ 0.5?
𝑦ො = 1 𝑦ො = 0
x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation=‘sigmoid’)
a1 = layer_1(x) if a2 >= 0.5:
yhat = 1
else:
yhat = 0
layer_2 = Dense(units=1, activation=‘sigmoid’)
a2 = layer_2(a1)
Andrew Ng
Model for digit classification
x = np.array([[0.0,...245,...240...0]])
layer_1 = Dense(units=25, activation=‘sigmoid’)
a1 = layer_1(x)
x ⋮ a[1] a[2] a[3]
⋮
1 unit
15 units
25 units
Andrew Ng
Model for digit classification
x = np.array([[0.0,...245,...240...0]])
layer_1 = Dense(units=25, activation=‘sigmoid’)
a1 = layer_1(x)
x ⋮ a[1] a[2] a[3] layer_2 = Dense(units=15, activation=‘sigmoid’)
⋮ a2 = layer_2(a1)
1 unit
15 units
25 units
Andrew Ng
Model for digit classification
x = np.array([[0.0,...245,...240...0]])
layer_1 = Dense(units=25, activation=‘sigmoid’)
a1 = layer_1(x)
x ⋮ a[1] a[2] a[3] layer_2 = Dense(units=15, activation=‘sigmoid’)
⋮ a2 = layer_2(a1)
Andrew Ng
Model for digit classification
x = np.array([[0.0,...245,...240...0]])
layer_1 = Dense(units=25, activation=‘sigmoid’)
a1 = layer_1(x)
x ⋮ a[1] a[2] a[3] layer_2 = Dense(units=15, activation=‘sigmoid’)
⋮ a2 = layer_2(a1)
Andrew Ng
TensorFlow implementation
Data in TensorFlow
Feature vectors
temperature duration Good coffee? x = np.array([[200.0, 17.0]])
(Celsius) (minutes) (1/0)
[[200.0, 17.0]]
200.0 17.0 1
425.0 18.5 0
… … …
Andrew Ng
Note about numpy arrays
x = np.array([[1, 2, 3],
1 2 3 [4, 5, 6]])
4 5 6
[[1, 2, 3],
[4, 5, 6]]
2x3
x = np.array([[0.1, 0.2],
[-3.0, -4.0,], 4x2
[-0.5, -0.6,],
[7.0, 8.0,]]) 1x2
[[0.1, 0.2], 2x1
[-3.0, -4.0,],
[-0.5, -0.6,],
[7.0, 8.0,]]
Andrew Ng
Note about numpy arrays
x = np.array([[200, 17]]) 200 17 1x2
x = np.array([200,17])
Andrew Ng
Feature vectors
temperature duration Good coffee? x = np.array([[200.0, 17.0]])
(Celsius) (minutes) (1/0)
[[200.0, 17.0]]
200.0 17.0 1
425.0 18.5 0 1x2
… … …
200.0 17.0
Andrew Ng
Activation vector
x a[1] a[2]
x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation=‘sigmoid’)
a1 = layer_1(x)
1 x 3 matrix
tf.Tensor([[0.2 0.7 0.3]], shape=(1, 3), dtype=float32)
a1.numpy()
array([[1.4661001, 1.125196 , 3.2159438]], dtype=float32)
Andrew Ng
Activation vector
x a[1] a[2]
a2.numpy()
array([[0.8]], dtype=float32)
Andrew Ng
TensorFlow implementation
Andrew Ng
Building a neural network architecture
x a[1] a[2] layer_1 = Dense(units=3, activation="sigmoid")
layer_2 = Dense(units=1, activation="sigmoid")
model = Sequential([layer_1, layer_2])
x = np.array([[200.0, 17.0],
y [120.0, 5.0],
4x2
[425.0, 20.0],
200 17 1 [212.0, 18.0])
y = np.array([1,0,0,1])
120 5 0
model.compile(...)
425 20 0 model.fit(x,y)
212 18 1 model.predict(x_new)
Andrew Ng
Digit classification model
layer_1 = Dense(units=25, activation="sigmoid")
layer_2 = Dense(units=15, activation="sigmoid")
layer_3 = Dense(units=1, activation="sigmoid")
x [1] [2] [3]
⋮ a a a
⋮ model = Sequential([layer_1, layer_2, layer_3]
model.compile(...)
1 unit
x = np.array([[0..., 245, ..., 17],
15 units [0..., 200, ..., 184])
25 units y = np.array([1,0])
model.fit(x,y)
model.predict(x_new)
Andrew Ng
Digit classification model
model = Sequential([
Dense(units=25, activation="sigmoid"),
x ⋮ a[1] a[2] a[3]
Dense(units=15, activation="sigmoid"),
Dense(units=1, activation="sigmoid")])
⋮
model.compile(...)
1 unit
x = np.array([[0..., 245, ..., 17],
15 units [0..., 200, ..., 184])
25 units y = np.array([1,0])
model.fit(x,y)
model.predict(x_new)
Andrew Ng
Neural network implementation
in Python
Andrew Ng
Neural network implemenation
in Python
General implementation of
forward propagation
[1]
𝑤1 , 𝑏1
[1]
Forward prop in NumPy
x [1] [1] a [𝑙]
w2 , 𝑏2
def dense(a_in,W,b, g): def sequential(x):
[1] [1]
w3 , 𝑏3 units = W.shape[1] a1 = dense(x,W1,b1,g)
a_out = np.zeros(units) a2 = dense(a1,W2,b2,g)
[1] 1 −3 [1][1] 5 for j in range(units): a3 = dense(a2,W3,b3,g)
𝑤1 = w2 = w3 =
2 4 −6 w = W[:,j] a4 = dense(a3,W4,b4,g)
W = np.array([ z = np.dot(w,a_in) + b[j] f_x = a4
[1, -3, 5] a_out[j] = g(z) return f_x
[2, 4, -6]]) return a_out
[𝑙] [𝑙] [𝑙]
𝑏1 = −1 𝑏2 = 1 𝑏3 = 2
b = np.array([-1, 1, 2])
a [0] = x
a_in = np.array([-2, 4])
Andrew Ng
Speculations on artificial general
intelligence (AGI)
ANI AGI
(artificial narrow intelligence) (artificial general intelligence)
E.g., smart speaker, Do anything a human can do
self-driving car, web search,
AI in farming and factories
Andrew Ng
Biological neuron Simplified mathematical
model of a neuron
inputs outputs inputs outputs
Andrew Ng
Neural network and the brain
Can we mimic the human brain?
vs x 𝑓(x)
Andrew Ng
The ”one learning algorithm” hypothesis
Auditory cortex
Andrew Ng
The ”one learning algorithm” hypothesis
Somatosensory cortex
Andrew Ng
Sensor representations in the brain
Andrew Ng
Vectorization
(optional)
[1,0,1]
Andrew Ng
Vectorization
(optional)
Matrix multiplication
Dot products
Andrew Ng
Vector matrix multiplication
[2]
→ 1
a =
[4 6]
3 5
→ 𝐙= →
a 𝐖 [← →]
𝑻 𝑻
a = [1 2] 𝐖= →
a
𝑻
→
w1 →
w2
𝐙 = [→
a →
𝑻
w 2]
w 1 𝑎𝑻 →
(1 ∗ 3) + (2 ∗ 4) (1 ∗ 5) + (2 ∗ 6)
𝐙 = [11 17]
Andrew Ng
matrix matrix multiplication
[2 − 2]
1 −1
𝐀=
[4 6]
→ 𝑇
[− 1 − 2]
1 2 𝐖 = 3 5 𝐙 = 𝐀𝑻 𝐖 = ← a1 →
→
𝐀𝑻 =
→ 𝑇 w1 →
w2
← a 2 →
→
a 1→
𝑇
w1 →
a 1→
𝑇
w2
=
→
a 2→
𝑇
w1 →
a 2→
𝑇
w2
[− 11 − 17]
11 17
=
Andrew Ng
Vectorization
(optional)
Andrew Ng
Matrix multiplication rules
1 −1 0.1 1 2
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝒁 = 𝐀𝑻 𝐖 =
0.1 0.2 4 6 8 0
Andrew Ng
Matrix multiplication rules
1 −1 0.1 1 2 11 17 23 9
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝑻
𝒁 = 𝐀 𝐖 = −11 −17 −23 −9
0.1 0.2 4 6 8 0
1.1 1.7 2.3 0.9
Andrew Ng
Matrix multiplication rules
1 −1 0.1 1 2 11 17 23 9
𝐀= 𝐀𝑻 = −1 −2 3 5 7 9
2 −2 0.2 𝐖= 𝑻
𝒁 = 𝐀 𝐖 = −11 −17 −23 −9
0.1 0.2 4 6 8 0
1.1 1.7 2.3 0.9
Andrew Ng
Vectorization
(optional)
Andrew Ng
Dense layer vectorized
[1] [1]
𝑤1 , 𝑏1 AT = np.array([[200, 17]])
𝐀𝑻
[1] [1]
[1]
𝐀 𝐀[2] W = np.array([[1, -3, 5],
w2 , 𝑏2
[-2, 4, -6])
[1] [1]
w3 , 𝑏3 b = np.array([[-1, 1, 2]])
def dense(AT,W,b,g):
z = np.matmul(AT,W) + b
𝐀𝑻 = 200 17 𝐙 = 𝐀𝑻 𝐖 + 𝐁
a_out = g(z)
165 −531 900 return a_out
1 −3 5
𝐖= [1] [1] [1]
−2 4 −6 𝑧1 𝑧2 𝑧3 [[1,0,1]]
Ԧ𝐛 = −1 1 2
𝐀 = g(𝐙)
1 0 1
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
TensorFlow
implementation
Train a Neural Network in TensorFlow
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Andrew Ng
Neural Network
Training
Training Details
Model Training Steps
specify how to logistic regression neural network
compute output
given input x and z = np.dot(w,x)+ b model = Sequential([
parameters w,b Dense(...)
Dense(...)
(define model) Dense(...)
f_x = 1/(1+np.exp(-z)) ])
𝑓w,𝑏 x =?
Andrew Ng
1. Create the model
define the model import tensorflow as tf
𝑓 x =? from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
1
𝐖 ,𝒃 1 𝐖 2 ,𝒃 2 𝐖 3,𝒃 3
model = Sequential([
[1] [1] Dense(units=25, activation='sigmoid’)
w1 , 𝑏1
Dense(units=15, activation='sigmoid’)
[2] [2] Dense(units=1, activation='sigmoid’)
[1] w1 , 𝑏1 [2]
⋮ a a a[3] ])
x ⋮ [3] [3]
w1 , 𝑏1
[2] [2]
w15 , 𝑏15 1 unit
[1] [1]
w25 , 𝑏25
15 units
25 units
Andrew Ng
2. Loss and cost functions
𝑚
Mnist digit binary classification 1 𝑖 𝑖
𝐽 𝐖, 𝐁 = 𝐿 𝑓 x ,𝑦
classification problem 𝑚
𝑖=1
𝐿 𝑓 x , 𝑦 = −𝑦log 𝑓 x − 1 − 𝑦 log 1 − 𝑓 x 1 2 3
𝐖 ,𝐖 ,𝐖 𝒃 1 ,𝒃 2 ,𝒃 3
Andrew Ng
3. Gradient descent
repeat {
[𝑙] [𝑙] 𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝐽 w, 𝑏
𝐽(𝑤) minimum 𝜕𝑤𝑗
[𝑙] [𝑙] 𝜕
𝑏𝑗 = 𝑏𝑗 − 𝛼 𝐽 w, 𝑏
𝜕𝑏
}
𝑤
model.fit(X,y,epochs=100)
Andrew Ng
Neural network libraries
Use code libraries instead of coding ”from scratch”
Andrew Ng
Activation Functions
Alternatives to the
sigmoid activation
Demand Prediction Example
[1] 1 1
price affordability 𝑎2 = 𝑔(w2 ∙ x + 𝑏2 )
Sigmoid ReLU
shipping cost awareness
a
1 1
marketing 1
𝑔 𝑧 = 𝑔 𝑧 =
1+𝑒−𝑧
material perceived quality
0 z z
0
0<𝑔 𝑧 <1
Andrew Ng
Examples of Activation Functions
[1] 1 1
𝑎2 = 𝑔(w2 ∙ x + 𝑏2 )
Linear activation function Sigmoid ReLU
1
𝑔 𝑧 =
1 1
1
𝑔(𝑧) = 𝑧 𝑔 𝑧 =
1+𝑒−𝑧
0 z 0 z 0 z
0<𝑔 𝑧 <1
Andrew Ng
Activation Functions
Choosing activation
functions
Output Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) Choosing 𝑔 𝑧 for output layer?
1 1
1
𝑔 𝑧
𝑔 𝑧 𝑔 𝑧
0 z 0 z 0 z
Andrew Ng
Hidden Layer
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) Choosing 𝑔 𝑧 for hidden layer
0 z 𝑤 0 z
Andrew Ng
Choosing Activation Summary
x 𝑎Ԧ [1] 𝑎Ԧ [2] a[3] = 𝑓(x) activation='sigmoid'
activation='linear'
ReLU activation='relu'
Andrew Ng
Activation Functions
Why do we need
activation functions?
Why do we need activation functions?
price price
affordability
shipping cost
marketing top seller?
awareness shipping cost
material 𝑓 x = w ∙x +𝑏
𝑎Ԧ [1] 𝑎Ԧ [2] x w,b 𝑏
x marketing
material
perceived quality 1
𝑔 𝑧
0 z
Andrew Ng
Linear Example
𝑎 [1] [1]
= 𝑤1 𝑥 + 𝑏1[1]
[2] [2]
𝑥 𝑎 [2] = 𝑤1 𝑎 [1] + 𝑏1
𝑤1 , 𝑏1
𝑎 [1]
[1] [1] [2] [2]
𝑤1 , 𝑏1 𝑎 [2]
[2] [1]
𝑎Ԧ [2] =( 𝑤1 𝑤1 ) 𝑥 + 𝑤1[2] 𝑏1[1] + 𝑏1[2]
𝑔(𝑧) = 𝑧
𝑎Ԧ [2] = 𝑤 𝑥 + 𝑏
Andrew Ng
Example
x 𝑎Ԧ [1] 𝑎Ԧ [2] 𝑎Ԧ [3] 𝑎Ԧ [4]
𝑔(𝑧) = 𝑧
1
[4] [4] 𝑎Ԧ [4] =
𝑎Ԧ [4] = w1 ∙ 𝑎Ԧ [3] + 𝑏1 [4] [3] [4]
1+𝑒 −(w1 ∙𝑎 +𝑏1 )
Andrew Ng
Multiclass
Classification
Multiclass
MNIST example
𝑦=0 1 2 3 4 5 6 7 8 9
𝑦=7
Andrew Ng
Multiclass classification example
𝑃 𝑦=2x
𝑃 𝑦=1x
𝑥2 𝑥2 𝑃 𝑦=3x
𝑃 𝑦=4x
𝑃 𝑦=1x
𝑥1 𝑥1
Andrew Ng
Multiclass
Classification
Softmax
Logistic regression Softmax regression (4 possible outputs)
(2 possible output values)
𝑧 = w∙x+𝑏 𝑧1 = w1 ∙ x + 𝑏1
1
𝑎 = 𝑔 𝑧 = 1+𝑒 −𝑧 = 𝑃 𝑦 = 1 x =𝑃 𝑦=1 x
𝑒 𝑧2
𝑧2 = w2 ∙ x + 𝑏2 𝑎2 = 𝑧
𝑒 1 + 𝑒 𝑧 2 + 𝑒 𝑧 3 + 𝑒 𝑧4
Softmax regression =𝑃 𝑦=2 x
(N possible outputs) 𝑒 𝑧3
𝑧3 = w3 ∙ x + 𝑏3 𝑎3 = 𝑧
𝑒 1 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑧𝑗 = w𝑗 ∙ x + 𝑏𝑗 j = 1, … , N
=𝑃 𝑦=3x
𝑒 𝑧𝑗 𝑒 𝑧4
𝑎𝑗 = = P(y = 𝑗|x) 𝑧4 = w4 ∙ x + 𝑏4 𝑎4 = 𝑧
σ𝑁
𝑘=1 𝑒
𝑧𝑘 𝑒 1 + 𝑒 𝑧2 + 𝑒 𝑧 3 + 𝑒 𝑧4
=𝑃 𝑦=4x
Andrew Ng
Cost
Logistic regression Softmax regression
𝑒 𝑧1
𝑎1 = 𝑧 =𝑃 𝑦=1 x
𝑧 = w∙x+𝑏 𝑒 1 + 𝑒 𝑧2 + ⋯ + 𝑒 𝑧𝑁
⋮ 𝑒 𝑧𝑁
1 𝑎𝑁 = 𝑧 =𝑃 𝑦=𝑁x
𝑎1 = 𝑔 𝑧 = =𝑃 𝑦=1 x 𝑒 1 + 𝑒 𝑧 2 + ⋯ + 𝑒 𝑧𝑁
1 + 𝑒−𝑧
Crossentropy loss
𝑎2 = 1 − 𝑎1 =𝑃 𝑦=0 x
𝑙𝑜𝑠𝑠 𝑎1 , … , 𝑎𝑁 , 𝑦 =
𝐿
l𝑜𝑠𝑠 = −𝑦 log 𝑎1 − 1 − 𝑦 log 1 − 𝑎1
𝐽 w, 𝑏 = average loss
0 0.5 1 𝑎𝑗
Andrew Ng
Multiclass
Classification
logistic regression
25 units 15 units 10 units
1 unit 3
𝑎1 = 𝑔 𝑧1
3 3
𝑎2 = 𝑔 𝑧2
3
softmax
a[3]=
3 3 3 3
𝑎1 , … 𝑎10 = 𝑔 𝑧1 , … , 𝑧10
Andrew Ng
specify the model
MNIST with softmax
import tensorflow as tf
𝑓w,𝑏 x =? from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=10, activation='softmax')
specify loss and cost )]
𝐿 𝑓w,𝑏 x , 𝑦 from tensorflow.keras.losses import
SparseCategoricalCrossentropy
model.compile(loss= SparseCategoricalCrossentropy() )
Train on data to model.fit(X,Y,epochs=100)
minimize 𝐽 w, 𝑏 Note: better (recommended) version later.
Andrew Ng
Multiclass
Classification
Improved implementation
of softmax
Numerical Roundoff Errors
option 1
2
𝑥=
10,000
option 2
Andrew Ng
Numerical Roundoff Errors
More numerically accurate implementation of logistic loss:
Andrew Ng
More numerically accurate implementation of softmax
Softmax regression model = Sequential([
Dense(units=25, activation='relu')
(𝑎1, … , 𝑎10 ) = 𝑔(𝑧1 , … , 𝑧10 )
Dense(units=15, activation='relu')
− log 𝑎1 if 𝑦 = 1 Dense(units=10, activation='softmax')
Loss = 𝐿(𝑎,
Ԧ 𝑦) = ቐ ⋮
− log 𝑎10 if 𝑦 = 10
model.compile(loss=SparseCategoricalCrossEntropy() )
More Accurate
model.compile(loss=SparseCrossEntropy(from_logits=True) )
Andrew Ng
MNIST (more numerically accurate)
model import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=10, activation='linear’) )]
loss from tensorflow.keras.losses import
SparseCategoricalCrossentropy
model.compile(...,loss=SparseCategoricalCrossentropy(from_logits=True) )
fit model.fit(X,Y,epochs=100)
predict logits = model(X)
f_x = tf.nn.softmax(logits)
Andrew Ng
logistic regression
(more numerically accurate)
model model = Sequential([
Dense(units=25, activation='sigmoid')
Dense(units=15, activation='sigmoid')
Dense(units=1, activation='linear')
from tensorflow.keras.losses import
BinaryCrossentropy
model.fit(X,Y,epochs=100)
Andrew Ng
Multi-label
Classification
Classification with
multiple outputs
(Optional)
Multi-label Classification
Is there a car?
Is there a bus?
Is there a pedestrian
Andrew Ng
Multiple classes
car bus pedestrian
Andrew Ng
Additional Neural Network
Concepts
Advanced Optimization
Gradient Descent
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼𝜕𝑤 𝐽 w, 𝑏
𝑗
𝑤2 𝐽 w, 𝑏 𝑤2 𝐽 w, 𝑏
𝑤1 𝑤1
Go faster – increase Go slower – decrease
Andrew Ng
Adam Algorithm Intuition
Adam: Adaptive Moment estimation
𝜕
𝑤1 = 𝑤1 − 𝛼1 𝜕𝑤 𝐽 w, 𝑏
1
⋮
𝑤10 = 𝑤10 − 𝛼10 𝜕𝑤𝜕 𝐽 w, 𝑏
10
𝜕
𝑏 = 𝑏 − 𝛼11 𝜕𝑏 𝐽 w, 𝑏
Andrew Ng
Adam Algorithm Intuition
𝑤2 𝐽 w, 𝑏 𝑤2 𝐽 w, 𝑏
𝑤1 𝑤1
If 𝑤𝑗 (or 𝑏) keeps moving If 𝑤𝑗 (or 𝑏) keeps oscillating,
in same direction, reduce 𝛼𝑗 .
increase 𝛼𝑗 .
Andrew Ng
model
MNIST Adam
model = Sequential([
tf.keras.layers.Dense(units=25, activation='sigmoid')
tf.keras.layers.Dense(units=15, activation='sigmoid')
tf.keras.layers.Dense(units=10, activation='linear')
])
compile
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))
fit
model.fit(X,Y,epochs=100)
Andrew Ng
Additional Neural Network
Concepts
2 2 2
𝑎Ԧ1 = 𝑔 𝑤1 ∙ 𝑎Ԧ 1 + 𝑏1
Andrew Ng
Convolutional Layer
Why?
• Faster computation
• Need less training data
(less prone to
overfitting)
Andrew Ng
Convolutional Neural Network
EKG
𝑥1 𝑥2 𝑥3 𝑥100
𝑥1
𝑥2 𝑥1 − 𝑥20
𝑥3 𝑥11 − 𝑥30 1 1
𝑎1 − 𝑎5
𝑥21 − 𝑥40 x a[1] a[2] a[3]
⋮ 1 1
⋮ ⋮ 𝑎3 − 𝑎7
𝑥81 − 𝑥100 1 1
𝑥100 𝑎5 − 𝑎9
3 units
9 units
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Andrew Ng
Machine learning diagnostic
Diagnostic: A test that you run to gain insight into what
is/isn’t working with a learning algorithm, to gain guidance
into improving its performance.
Andrew Ng
Evaluating and
choosing models
Evaluating a model
Evaluating your model
Model fits the training data well
but will fail to generalize to new
price
Andrew Ng
Train/test procedure for linear regression (with squared error
cost)
w , 𝑏 2𝑚𝑡𝑟𝑎𝑖𝑛 ∑ ( )
𝑚 2 𝑛
𝑚𝑖𝑛 1 𝜆
𝐽( w , 𝑏) = → ( x ) −𝑦(𝑖) +
𝑡𝑟𝑎𝑖𝑛
→ → (𝑖)
𝑤𝑗2
2𝑚𝑡𝑟𝑎𝑖𝑛 ∑
𝑓→
w ,𝑏
𝑖= 1 𝑗= 1
2𝑚𝑡𝑒𝑠𝑡 [ ∑ ]
1
𝐽𝑡𝑒𝑠𝑡( w , 𝑏) = )
𝑡𝑒𝑠𝑡
→ 𝑓→ →
x
(𝑖)
−𝑦 (𝑖)
𝑡𝑒𝑠𝑡
𝑖= 1
2𝑚𝑡𝑟𝑎𝑖𝑛 [ ∑
1
𝐽𝑡𝑟𝑎𝑖𝑛( w , 𝑏) = ( 𝑡𝑟𝑎𝑖𝑛) −𝑦𝑡𝑟𝑎𝑖𝑛
𝑡𝑟𝑎𝑖𝑛
→ 𝑓→ →
x
(𝑖) (𝑖)
w ,𝑏
𝑖= 1
Andrew Ng
Train/test procedure for linear
regression (with squared error cost)
Fit parameters by minimizing cost function 𝐽 w, 𝑏
price
Compute test error: 𝑚𝑡𝑒𝑠𝑡 train
1 2
𝐽𝑡𝑒𝑠𝑡 w, 𝑏 = 𝑓w,𝑏 x test
𝑖
𝑡𝑒𝑠𝑡 −𝑦 𝑖
𝑡𝑒𝑠𝑡
2𝑚𝑡𝑒𝑠𝑡
𝑖=1
size
Compute training error: 𝑚𝑡𝑟𝑎𝑖𝑛
𝐽 w,1 𝑏 will be low 𝑖 𝑖
2
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 =
𝑡𝑟𝑎𝑖𝑛 𝑓w,𝑏 x𝑡𝑟𝑎𝑖𝑛 − 𝑦𝑡𝑟𝑎𝑖𝑛
𝐽𝑡𝑒𝑠𝑡 2𝑚 𝑏 will be high
w, 𝑡𝑟𝑎𝑖𝑛
𝑖=1
Andrew Ng
Train/test procedure for
classification problem
( ) ( )]
1 𝑚 𝜆 𝑛 2
[
E.g.,
𝐽 ( w , 𝑏) = −
→ 𝑦 log 𝑓 w ,𝑏( x ) + (1 −𝑦 )log 1 −𝑓→
→
w ,𝑏( x )
→
(𝑖) (𝑖) (𝑖) (𝑖)
∑ 2𝑚 ∑
→ + 𝑤𝑗
𝑚 𝑖= 1 𝑗= 1
[ ]
1 𝑡𝑒𝑠𝑡 (𝑖)
𝐽𝑡𝑒𝑠𝑡( w , 𝑏) = −
→ → →
(𝑖) (𝑖)
( )
(𝑖)
𝑚𝑡𝑒𝑠𝑡 ∑
𝑦𝑡𝑒𝑠𝑡log 𝑓→ x 𝑡𝑒𝑠𝑡 + 1 −𝑦𝑡𝑒𝑠𝑡 log 1 −𝑓 → x
𝑖= 1
[ ]
1
𝐽𝑡𝑟𝑎𝑖𝑛( w , 𝑏) = − 𝑡𝑟𝑎𝑖𝑛)
𝑡𝑟𝑎𝑖𝑛
→ (𝑖) → (𝑖) (𝑖) → (𝑖)
𝑚𝑡𝑟𝑎𝑖𝑛 ∑
𝑦𝑡𝑟𝑎𝑖𝑛 log 𝑓→ x 𝑡𝑟𝑎𝑖𝑛 + 1 −𝑦 log 1 −𝑓 → x
𝑖= 1
Andrew Ng
Train/test procedure for
classification problem
Fit parameters by minimizing 𝐽 w, 𝑏 to find w, 𝑏
E.g., 𝑚 𝑛
1 𝑖 test set and the𝑖 fraction 𝜆of the 2
𝐽 w, 𝑏 = − 𝑦 𝑖 log 𝑓w,𝑏 x 𝑖 fraction
+ 1− of 𝑦the log 1 − 𝑓w,𝑏 x + 𝑤𝑗
𝑚 train set that the algorithm has misclassified.
2𝑚
𝑖=1 𝑗=1
Compute test error: 1 if 𝑓w,𝑏 x 𝑖
≥ 0.5
𝑚𝑡𝑒𝑠𝑡 𝑦ො = ቐ
1 𝑖 𝑖 0 if 𝑓𝑖 x 𝑖 < 0.5 𝑖
𝐽𝑡𝑒𝑠𝑡 w, 𝑏 = − count
𝑦𝑡𝑒𝑠𝑡 log 𝑓w,𝑏𝑦ො ≠x𝑡𝑒𝑠𝑡
𝑦 + 1 − 𝑦𝑡𝑒𝑠𝑡w,𝑏 log 1 − 𝑓w,𝑏 x 𝑡𝑒𝑠𝑡
𝑚𝑡𝑒𝑠𝑡
𝑖=1
𝐽𝑡𝑒𝑠𝑡 w, 𝑏 is the fraction of the test set that has
Compute train error:
1
𝑚𝑡𝑟𝑎𝑖𝑛
been misclassified.
𝑖 𝑖 𝑖 𝑖
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 = −
𝑚𝑡𝑟𝑎𝑖𝑛
𝐽𝑡𝑟𝑎𝑖𝑛logw, 𝑓𝑏w,𝑏is xthe
𝑦𝑡𝑟𝑎𝑖𝑛 𝑡𝑟𝑎𝑖𝑛fraction
+ 1 −of𝑦𝑡𝑟𝑎𝑖𝑛
the train
log 1set
− 𝑓that
w,𝑏 x𝑡𝑟𝑎𝑖𝑛
𝑖=1 has been misclassified.
Andrew Ng
Evaluating and
choosing models
generalization error.
Andrew Ng
Model selection (choosing a model)
<1> <1> <1> <1>
d=1 1. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑏 w,b Jtest(w, b )
<2> <2>
d=2 2. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑏 w,b Jtest(w<2>,b<2>)
d=3 3. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏 w<3>,b<3> Jtest(w<3>,b<3>)
⋮
d=10 10. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + ⋯ + 𝑤10 𝑥10 + 𝑏 Jtest(w<10>,b<10>)
Jtest(w<5>,b<5>)
Choose 𝑤1 𝑥1 + ⋯ + 𝑤5𝑥5 + 𝑏 d=5
How well does the model perform? Report test set error 𝐽𝑡𝑒𝑠𝑡 𝑤 <5> , 𝑏 <5> ?
The problem is 𝐽𝑡𝑒𝑠𝑡 𝑤 <5>, 𝑏 <5> is likely to be an optimistic estimate of
generalization error. Ie: An extra parameter d (degree of polynomial) was
chosen using the test set. w,b
Andrew Ng
Training/cross validation/test set
validation set
size price development set
dev set
2104 400 𝑥 1
,𝑦 1
Andrew Ng
Training/cross validation/test set
𝑚𝑡𝑟𝑎𝑖𝑛
1 2
Training error: 𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 = 𝑓w,𝑏 x 𝑖
−𝑦 𝑖
2𝑚𝑡𝑟𝑎𝑖𝑛
𝑖=1
𝑚𝑐𝑣
1 2
Cross validation 𝐽𝑐𝑣 w, 𝑏 =
𝑖 𝑖
𝑓w,𝑏 x𝑐𝑣 − 𝑦𝑐𝑣 (validation error,
2𝑚𝑐𝑣 dev error)
error: 𝑖=1
𝑚𝑡𝑒𝑠𝑡
1 𝑖 𝑖
2
Test error: 𝐽𝑡𝑒𝑠𝑡 w, 𝑏 = 𝑓w,𝑏 x𝑡𝑒𝑠𝑡 − 𝑦𝑡𝑒𝑠𝑡
2𝑚𝑡𝑒𝑠𝑡
𝑖=1
Andrew Ng
Model selection
d=1 1. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑏 w<1>,b<1> Jcv(w<1>,b<1>)
d=2 2. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑏 Jcv(w<2>,b<2>)
d=3 3. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏
⋮
⋮ ⋮
d=10 10. 𝑓w,𝑏 x = 𝑤1 𝑥1 + 𝑤2 𝑥 2 + ⋯ + 𝑤10 𝑥10 + 𝑏 Jcv(w<10>,b<10>)
Andrew Ng
Model selection – choosing a neural network architecture
1. w(1),b(1) 1 1
𝐽𝑐𝑣 𝐖 ,𝐁
2. w(2),b(2) 𝐽𝑐𝑣 𝐖 2 ,𝐁 2
3. w(3),b(3) 𝐽𝑐𝑣 𝐖 3 ,𝐁 3
Train, CV
Pick 𝐖 2 ,𝐁 2
Andrew Ng
Bias and variance
price
price
size size size
2
𝑓 w,𝑏 𝑥 = 𝑤1 𝑥 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 2
+𝑏 +𝑏 +𝑤3𝑥 3 + 𝑤4 𝑥 4 + 𝑏
High bias “Just right“ High variance
(underfit) (overfit)
𝐽𝑡𝑟𝑎𝑖𝑛 is high 𝐽𝑡𝑟𝑎𝑖𝑛 is low 𝐽𝑡𝑟𝑎𝑖𝑛 is low
𝑑=1 𝑑=2 𝐽 𝑑=4
𝐽𝑐𝑣 is high 𝑐𝑣 is low 𝐽𝑐𝑣 is high
Andrew Ng
Understanding bias and variance
𝐽𝑐𝑣 w, 𝑏
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
degree of polynomial
𝑑
Andrew Ng
Diagnosing bias and variance
How do you tell if your algorithm has a bias or variance problem?
Andrew Ng
Bias and variance
Regularization and
bias/variance
Linear regression with regularization
Model: 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4 + 𝑏
𝑚 𝑛
1 2 𝜆
𝐽 w, 𝑏 = 𝑓w,𝑏 x 𝑖 −𝑦 𝑖 + 𝑤𝑗2
2𝑚 2𝑚
𝑖=1 𝑗=1
price
price
price
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 is small 𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 is small
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏 is large 𝐽𝑐𝑣 w, 𝑏 is small 𝐽𝑐𝑣 w, 𝑏 is large
Andrew Ng
Bias and variance as a function of regularization parameter 𝜆
𝑚 𝑛
1 2 𝜆
𝐽 w, 𝑏 = 𝑓w,𝑏 x 𝑖 −𝑦 𝑖 + 𝑤𝑗2 bias variance
2𝑚 2𝑚 𝐽𝑐𝑣 w, 𝑏
𝑖=1 𝑗=1
bias 𝐽𝑐𝑣 w, 𝑏
𝐽𝑐𝑣 variance
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
degree of polynomial
𝐽𝑡𝑟𝑎𝑖𝑛
small 𝜆 large 𝜆
Andrew Ng
Bias and variance
Andrew Ng
Establishing a baseline level of performance
Andrew Ng
Bias/variance examples
Baseline performance : 10.6% 0.2% 10.6% 4.4% 10.6% 4.4%
Training error ( 𝐽𝑡𝑟𝑎𝑖𝑛 ) : 10.8% 15.0% 15.0%
4.0% 0.5% 4.7%
Cross validation error ( 𝐽𝑐𝑣 ) : 14.8% 15.5% 19.7%
high high high bias
variance bias high variance
Andrew Ng
Bias and variance
Learning curves
Learning curves 𝑓w,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑏
𝐽𝑐𝑣 w, 𝑏
error
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
Andrew Ng
High bias
𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑏
𝐽𝑐𝑣 w, 𝑏
error
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
human level
performance
Andrew Ng
High variance 𝑓w,𝑏 𝑥 = 𝑤1 𝑥 + 𝑤2𝑥 2 + 𝑤3𝑥 3
+ 𝑤4 𝑥 4 + 𝑏
(with small 𝜆)
𝐽𝑐𝑣 w, 𝑏
human level
performance
error
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
Andrew Ng
Bias and variance
Andrew Ng
Bias and variance
Bias/variance and
neural networks
The bias variance tradeoff
𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑏 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 2 𝑓w,𝑏 𝑥 = 𝑤1𝑥 + 𝑤2𝑥 2 + 𝑤3𝑥 3
+𝑏 +𝑤4𝑥 4 + 𝑏
𝐽𝑡𝑟𝑎𝑖𝑛 w, 𝑏
degree of polynomial
Andrew Ng
Neural networks and bias variance
Large neural networks are low bias machines
No
No No
No
Bigger network
Bigger network More data
GPU
Andrew Ng
Neural networks and regularization
3 layers 4 layers
Andrew Ng
Neural network regularization
𝑚
1 𝜆
b
𝑖 𝑖
𝐽 𝐖, 𝐁 = 𝐿 𝑓 x ,𝑦 + 𝑤𝟐
𝑚 2𝑚
𝑖=1 𝑎𝑙𝑙 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝐖
Unregularized MNIST model
layer_1 = Dense(units=25, activation=”relu")
layer_2 = Dense(units=15, activation=”relu")
layer_3 = Dense(units=1, activation="sigmoid")
model = Sequential([layer_1, layer_2, layer_3])
𝜆
Regularized MNIST model
layer_1 = Dense(units=25, activation=”relu”, kernel_regularizer=L2(0.01))
layer_2 = Dense(units=15, activation=”relu”, kernel_regularizer=L2(0.01))
layer_3 = Dense(units=1, activation="sigmoid”, kernel_regularizer=L2(0.01))
model = Sequential([layer_1, layer_2, layer_3])
Andrew Ng
Machine learning
development process
Iterative loop of
ML development
Iterative loop of ML development
Choose architecture
(model, data, etc.)
Diagnostics
Train
(bias, variance and
model
error analysis)
Andrew Ng
Spam classification example
Andrew Ng
Building a spam classifier
Supervised learning: x = features of email
y = spam (1) or not spam (0)
Features: list the top 10,000 words to compute 𝑥1, 𝑥2, ⋯ , 𝑥10,000
From: cheapsales@buystufffromme.com
0 a To: Andrew Ng
1 andrew Subject: Buy now!
Andrew Ng
Building a spam classifier
How to try to reduce your spam classifier’s error?
Andrew Ng
Iterative loop of ML development
Choose architecture
(model, data, etc.)
Diagnostics
Train
(bias, variance and
model
error analysis)
Andrew Ng
Machine learning
development process
Error analysis
Error analysis
𝑚𝑐𝑣= 500 examples in cross validation set.
5000
Algorithm misclassifies 100 of them.
1000
Manually examine 100 examples and categorize them
based on common traits. more data
Pharma: 21 features
Deliberate misspellings (w4tches, med1cine): 3
Unusual email routing: 7 more data
Steal passwords (phishing): 18 features
Spam message in embedded image: 5
Andrew Ng
Building a spam classifier
How to try to reduce your spam classifier’s error?
Andrew Ng
Iterative loop of ML development
Choose architecture
(model, data, etc.)
Diagnostics
Train
(bias, variance and
model
error analysis)
Andrew Ng
Machine learning
development process
Adding data
Adding data
Add more data of everything. E.g., “Honeypot” project.
Pharma spam
Andrew Ng
Data augmentation
Augmentation: modifying an existing training example to
create a new training example.
( X,Y )
Andrew Ng
Data augmentation by introducing distortions
Andrew Ng
Data augmentation for speech
Speech recognition example
Andrew Ng
Data augmentation by introducing distortions
Distortion introduced should be representation of the type of
noise/distortions in the test set.
Audio:
Background noise,
bad cellphone connection
Andrew Ng
Data synthesis
Synthesis: using artificial data inputs to create a new
training example.
Andrew Ng
Artificial data synthesis for photo OCR
[http://www.publicdomainpictures.net/view-image.php?image=5745&picture=times-square]
Andrew Ng
Artificial data synthesis for photo OCR
Abcdefg
Abcdefg
Abcdefg
Abcdefg
Abcdefg
Real data
[Adam Coates and Tao Wang]
Andrew Ng
Artificial data synthesis for photo OCR
Andrew Ng
Engineering the data used by your system
Conventional
model-centric AI= Code + Data
approach: (algorithm/model)
Work on this
Data-centric
approach: AI= Code + Data
(algorithm/model)
Work on this
Andrew Ng
Machine learning
development process
x
1
2
Fine tuning
⋮
5 5
𝐖 ,𝒃 9
1 10 output units
𝐖 ,𝒃 1 𝐖 3 ,𝒃 3 4 4
𝐖 2 ,𝒃 2 𝐖 ,𝒃
Option 1: only train output layers parameters.
Option 2: train all parameters.
Andrew Ng
Why does transfer learning work?
Andrew Ng
Transfer learning summary
1. Download neural network parameters pretrained on a large
dataset with same input type (e.g., images, audio, text) as
your application (or train your own). 1M
2. Further train (fine tune) the network
on your own data.
1000
50
Andrew Ng
Machine learning
development process
Full cycle of a
machine learning project
Full cycle of a machine learning project
Andrew Ng
Deployment
API call (x)
Inference server Mobile app
(audio clip)
ML model
Inference 𝑦ො
(text transcript)
Software engineering may be needed for:
Ensure reliable and efficient predictions
MLOps
Scaling
Logging machine learning
System monitoring operations
Model updates
Andrew Ng
Machine learning
development process
Andrew Ng
Adverse use cases
Deepfakes
Andrew Ng
Guidelines
Get a diverse team to brainstorm things that might go
wrong, with emphasis on possible harm to vulnerable groups.
Carry out literature search on standards/guidelines for your
industry.
Audit systems against possible harm prior to deployment.
Audit
Andrew Ng
Skewed datasets
(optional)
Andrew Ng
Precision/recall
𝑦 = 1 in presence of rare class we want to detect.
Precision:
Actual Class (of all patients where we predicted 𝑦 = 1, what
1 0 fraction actually have the rare disease?)
True positives True positives 15
True False = = = 0.75
1 positive positive #predicted positive True pos + False pos 15 + 5
Predict 15 5
-ed
False True Recall:
Class
0 negative negative (of all patients that actually have the rare disease,
10 70 what fraction did we correctly detect as having it?)
True positives True positives 15
= = = 0.6
25 75 #actual positive True pos + False neg 15 +10
print(“y=0”)
Andrew Ng
Skewed datasets
(optional)
Precision
cases of rare disease (when in doubt = 0.01
predict 𝑦 = 1)
lower precision, higher recall.
More generally predict 1 if: 𝑓w,𝑏 x ≥ threshold. 0 Recall 1
Andrew Ng
F1 score
How to compare precision/recall numbers?
print(“y=1”) 1
1 1 1 PR
Average = P+R F1 score = + = 2
2 2 P R P+R
Andrew Ng
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Andrew Ng
Decision Tree New test example
root node
Ear shape
Pointy Floppy
decision nodes
leaf nodes
Andrew Ng
Decision Tree
Fac e Fac e
E ar E ar s hape s hape
s hape s hape
C at N ot c at Ear
Not Cat
shape
Whiskers N ot c at Not cat Fac e
s hape
P ointy Floppy
P res ent A bs ent N ot round
Round
Andrew Ng
Decision Trees
Learning Process
Decision Tree Learning
?
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
shape
4/4 cats
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
shape
Cat
0/1 cats
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
shape
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
?
shape
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
Whiskers
shape
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
Whiskers
shape
Andrew Ng
Decision Tree Learning
Ear shape
Pointy Floppy
Face
Whiskers
shape
Andrew Ng
Decision Tree Learning
Decision 1: How to choose what feature to split on at each
node?
Maximize purity (or minimize impurity)
Andrew Ng
Decision Tree Learning
C at
node? Y es No
Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting? Depth 0
Depth 2
Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting? Depth 0
Depth 2
Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting?
4 /7 c ats 1 /3 c ats
Andrew Ng
Decision Tree Learning
Decision 2: When do you stop splitting?
Andrew Ng
Decision Tree Learning
Measuring purity
Entropy as a measure of impurity
p1 = fraction of examples that
are cats p1 = 0 H(p1) = 0
D og D og D og D og D og D og
p1 = 3/6 H(p1) = 1
C at C at C at D og D og D og
p1
p1 = 6/6 H(p1) = 0
C at C at C at C at C at C at
Andrew Ng
Entropy as a measure of impurity
p1 = fraction of examples that
are cats
𝑝0 = 1 − 𝑝1
H(p1)
𝐻(𝑝1 ) = −𝑝1 𝑙𝑜𝑔2(𝑝1 ) − 𝑝0 𝑙𝑜𝑔2(𝑝0)
= −𝑝1 𝑙𝑜𝑔2(𝑝1) − (1 − 𝑝1 )𝑙𝑜𝑔2(1 − 𝑝1 )
Note: “0 log(0)” = 0
p1
Andrew Ng
Decision Tree Learning
𝑝1 = 4ൗ5 = 0.8 𝑝1 = 1ൗ5 = 0.2 𝑝1 = 4ൗ7 = 0.57 𝑝1 = 1ൗ3 = 0.33 𝑝1 = 3ൗ4 = 0.75 𝑝1 = 2ൗ6 = 0.33
𝐻 0.8 = 0.72 𝐻 0.2 = 0.72 𝐻 0.57 = 0.99 𝐻 0.33 = 0.92 𝐻 0.75 = 0.81 𝐻 0.33 = 0.92
5 5 7 3 4 6
𝐻 0.5 − 𝐻 0.8 + 𝐻 0.2 𝐻 0.5 − 𝐻 0.57 + 𝐻 0.33 𝐻 0.5 − 𝐻 0.75 + 𝐻 0.33
10 10 10 10 10 10
Andrew Ng
Information Gain
𝑝1root = 5ൗ10 = 0.5
E ar
s hape
Information gain
P ointy Floppy
right
= H(p1root ) − 𝑤 left 𝐻 𝑝1left + 𝑤 right 𝐻 𝑝1
right 1
𝑝1left = 4ൗ5 𝑝1 = ൗ5
𝑤 left = 5ൗ10 5
𝑤 right = ൗ10
Andrew Ng
Decision Tree Learning
Putting it together
Decision Tree Learning
• Start with all examples at the root node
• Calculate information gain for all possible features, and pick the one with
the highest information gain
• Split dataset according to selected feature, and create left and right
branches of the tree
• Keep repeating splitting process until stopping criteria is met:
• When a node is 100% one class
• When splitting a node will result in the tree exceeding a maximum
depth
• Information gain from additional splits is less than threshold
• When number of examples in a node is below a threshold
Andrew Ng
Recursive splitting
?
Andrew Ng
Recursive splitting
E ar
s hape
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Cat
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e ?
s hape
Round N ot Round
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
A bs ent
Round N ot Round P res ent
Andrew Ng
Recursive splitting
E ar
s hape
P ointy Floppy
Fac e
s hape
Whis kers
Recursive algorithm
A bs ent
Round N ot Round P res ent
Andrew Ng
Decision Tree Learning
3 possible values
Andrew Ng
One hot encoding
Ear shape Pointy ears Floppy ears Oval ears Face shape Whiskers Cat
Andrew Ng
One hot encoding
Andrew Ng
One hot encoding
Ear shape Pointy ears Floppy ears Oval ears Face shape Whiskers Cat
Andrew Ng
One hot encoding and neural networks
Pointy ears Floppy ears Round ears Face shape Whiskers Cat
1 0 0 Round Present 1
0 1 0 Round 1 Absent 0 1
0 1 0 Round 1 Absent 0 1
Andrew Ng
Decision Tree Learning
Andrew Ng
Splitting on a continuous variable
Cat
Weight ≤ 8 lbs .
Weight ≤ 9 lbs .
No
Y es
2 2 8 3
H(0.5) − 𝐻 + 𝐻 = 0.24
10 2 10 8
4 4 6 1
H(0.5) − 𝐻 + 𝐻 = 0.61
10 4 10 6
7 5 3 0
H(0.5) − 𝐻 + 𝐻 = 0.40
10 7 10 3
Andrew Ng
Decision Tree Learning
Andrew Ng
Regression with Decision Trees
E ar
s hape
P ointy
Floppy
Fac e Fac e
s hape Shape
Round N ot round
Round N ot Round
Andrew Ng
Choosing a split
Variance at root Variance at root Variance at root
node: 20.51 node: 20.51 node: 20.51
E ar Fac e Whis kers
s hape Shape
Round N ot round
P ointy Floppy P res ent A bs ent
Weights: 7.2, Weights: 8.8, 15, Weights: 7.2, 15, Weights: 8.8,9.2,11 Weights: 7.2, 8.8, Weights: 15, 7.6,
9.2, 8.4,7.6, 10.2 11, 18, 20 8.4, 7.6,10.2, 18, 20 9.2, 8.4 11, 10.2, 18, 20
Variance: 1.47 Variance: 21.87 Variance: 27.80 Variance: 1.37 Variance: 0.75 Variance: 23.32
𝑤 left = 5ൗ10 𝑤 right = 5ൗ10 𝑤 left = 7ൗ10 𝑤 right = 3ൗ10 𝑤 left = 4ൗ10 𝑤 right = 6ൗ10
7 3 20.51 − 4 6
5 5 20.51 − ∗ 0.75 + ∗ 23.32
20.51 − ∗ 1.47 + ∗ 21.87 10
∗ 27.80 +
10
∗ 1.37 10 10
10 10
= 8.84 = 0.64 = 6.22
Andrew Ng
Tree ensembles
E ar
s hape Whis kers
Andrew Ng
Tree ensemble New test example
Ear
shape N ot c at Cat
Face Whis kers
shape Whiskers
P ointy Floppy
A bs ent P res ent A bs ent
Round N ot round P res ent
C at N ot c at
Not cat Not Cat Not Cat Cat Not Cat
Cat
Andrew Ng
Tree ensembles
Andrew Ng
Sampling with replacement
Ear shape Face shape Whiskers Cat
Pointy Round Present 1
Andrew Ng
Tree ensembles
For 𝑏 = 1 to 𝐵:
Use sampling with replacement to create a new training set of size 𝑚
Train a decision tree on the new dataset
Andrew Ng
Randomizing the feature choice
Andrew Ng
Tree ensembles
XGBoost
Boosted trees intuition
Given training set of size 𝑚
For 𝑏 = 1 to 𝐵:
Use sampling with replacement to create a new training set of size 𝑚
But instead of picking from all examples with equal (1/m) probability, make it
more likely to pick examples that the previously trained trees misclassify
Train a decision tree on the new dataset
Whiskers Ear Face
Ear Face shape shape Whiskers Prediction
shape shape Whiskers Cat Absent
Present
Pointy Round Present Cat ✅
Pointy Round Present Yes
Floppy Not Round Present Not cat ❌
Floppy Round Absent No
Ear shape Floppy Round Absent Not cat ✅
Floppy Round Absent No Not cat
Pointy Not Round Present Not cat ✅
Pointy Round Present Yes
Pointy Round Present Cat ✅
Pointy Not Round Present Yes
Not round Pointy Round Absent Not cat
Floppy Round Absent No Round ❌
Floppy Not Round Absent Not cat
Floppy Round Present Yes ✅
Pointy Round Absent Not cat ❌
Pointy Not Round Absent No
Cat Not cat Floppy Round Absent Not cat ✅
Pointy Not Round Absent No
Floppy Round Absent Not cat ✅
Pointy Not Round Present Yes
Andrew Ng
XGBoost (eXtreme Gradient Boosting)
• Open source implementation of boosted trees
• Fast efficient implementation
• Good choice of default splitting criteria and criteria for when
to stop splitting
• Built in regularization to prevent overfitting
• Highly competitive algorithm for machine learning
competitions (eg: Kaggle competitions)
Andrew Ng
Using XGBoost
Classification Regression
Andrew Ng
Conclusion
Diagnostics
(bias, variance and error
analysis)
Train
model
Andrew Ng
Decision Trees vs Neural Networks
Decision Trees and Tree ensembles
• Works well on tabular (structured) data
• Not recommended for unstructured data (images, audio, text)
• Fast
• Small decision trees may be human interpretable
Neural Networks
• Works well on all types of data, including tabular (structured) and
unstructured data
• May be slower than a decision tree
• Works with transfer learning
• When building a system of multiple models working together, it
might be easier to string together multiple neural networks
Andrew Ng