You are on page 1of 176

Nhân bản – Phụng sự – Khai phóng

Machine Learning
Artificial Intelligence
CONTENTs

• Introduction to ML
• Supervised learning
• Regression: Linear Regression
• Classification: Naïve Bayes, Decision trees
• Unsupervised learning
• Clustering: K-Means
• Model Evaluation
• Neural Network

2
CONTENTs

• Introduction to ML
• Supervised learning
• Regression: Linear Regression
• Classification: Naïve Bayes, Decision trees
• Unsupervised learning
• Clustering: K-Means
• Model Evaluation
• Neural Network

3
…Introduction to ML

• Is this a benign or malignant cell?

4
Introduction to ML

• Use ML to make prediction

5
…Introduction to ML

• Use ML to make prediction

6
…Introduction to ML

• Use ML to make prediction… such as predicting the right drug

7
…Introduction to ML

• Use ML to make decisions

8
…Introduction to ML

• Use ML for customer segmentation

9
…Introduction to ML

• Use ML for recommendation systems

10
…Introduction to ML

• Build ML models

11
…Introduction to ML

• Build ML models

12
…Introduction to ML

• Build ML models

13
…Introduction to ML

• Build ML models

14
…Introduction to ML

• Major ML techniques (1)

15
…Introduction to ML

• Major ML techniques (2)

16
…Introduction to ML
• What is ML?

17
…Introduction to ML
• How does ML relate to AI?

Three components of ML

ML vs AI
https://vas3k.com/blog/machine_learning/
18
…ML

• Overview

The map of
ML world

19
…Introduction to ML

• Overview

(Học Máy Tích hợp)

https://vas3k.com/blog/machine_learning/
20
…Introduction to ML

• Overview – (1).Classical ML

21
…Introduction to ML

• Overview – (2).Reinforcement Learning


"Throw a robot into a maze and let it find an exit"

Nowadays is used for:


• Self-driving cars
• Robot vacuums
• Games
• Automating trading
• Enterprise resource management

Popular algorithms:
Q-Learning, SARSA, DQN, A3C,
Genetic algorithm

22
…Introduction to ML

• Overview – (3). Ensemble Methods


(Học Máy Tích hợp)
"Bunch of stupid trees learning to correct errors of each other"

Nowadays is used for:


• Everything that fits classical
algorithm approaches (but works better)
• Search systems
• Computer vision
• Object detection

Popular algorithms:
• Random Forest
• Gradient Boosting
(phương pháp Học Máy Tích hợp)

23
…Introduction to ML

• Overview – (4). Neural Networks and Deep Leaning


"We have a thousand-layer network, dozens of video cards,
but still no idea where to use it. Let's generate cat pics!"

Used today for:


• Replacement of all algorithms above
• Object identification on photos and videos
• Speech recognition and synthesis
• Image processing, style transfer
• Machine translation

Popular architectures:
• Perceptron
• Convolutional Network (CNN)
• Recurrent Networks (RNN)
• Autoencoders
24
…Introduction to ML

• Supervised vs Unsupervised
(classical ML)

• Supervised learning:
• Learning a model from labeled data.
• Unsupervised learning:
• Learning a model from unlabeled data
25
…Introduction to ML

• Teaching the model with labeled data

26
…Introduction to ML

• Supervised Learning (classical ML)

There are two types


• Regression
• prediction of a specific point on
a numeric axis.

• Classification
• an object's category prediction

27
…Introduction to ML

• Type of Supervised learning

28
…Introduction to ML

• Supervised Learning - Regression (prediction)

29
…Introduction to ML

• Supervised learning
• Regression is the process of predicting continuous values

30
…Introduction to ML

• Supervised learning
• Classification is the process of predicting discrete class labels/categories

31
…Introduction to ML

• Supervised Learning - Classification


“The machine has a "supervisor/teacher" who gives the machine all the answers, like whether it's a cat or a dog.
The teacher has divided (labeled) the data into cats and dogs, and the machine is using these examples to learn.”

32
…Introduction to ML

• Supervised Learning - Classification

33
…Introduction to ML

• Unsupervised learning

34
…Introduction to ML

• Unsupervised learning
• Clustering is grouping of data points or objects

35
…Introduction to ML

• Supervised vs Unsupervised

36
CONTENTs

• Introduction to ML
• Supervised learning
• Regression: Linear Regression
• Classification: Naïve Bayes, Decision trees
• Unsupervised learning
• Clustering: K-Means
• Model Evaluation
• Neural Network

37
Supervised learning

• Given a training set of example input–output pairs:


𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅𝑑

38
Supervised learning
• Given a training set of example input–output pairs:
𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅𝑑
• Regression:
oy is a real value, 𝑦 𝜖 𝑅
o𝑓: 𝑅𝑑 → 𝑅
f is called a regressor.
• Classification:
oy is discrete. To simplify, 𝑦 𝜖 {−1, +1}
o𝑓: 𝑅𝑑 → −1, +1
f is called a binary classifier.

39
Supervised learning

• Regression:

• Example: Income in function of age, weight of the fruit in function of its


length.

40
Supervised learning

• Regression:

• Example: Income in function of age, weight of the fruit in function of its


length.

41
Supervised learning

• Regression:

• Example: Income in function of age, weight of the fruit in function of its


length.

42
Supervised learning

• Classification:

43
CONTENTs

• Introduction to ML
• Supervised learning
• Regression: Linear Regression
• Classification: Naïve Bayes, Decision trees
• Unsupervised learning
• Clustering: K-Means
• Model Evaluation
• Neural Network
• Recommender Systems

44
Linear Regression: History

• A very popular technique.


• Rooted in Statistics.
• Method of Least Squares used as early as
• 1795 by Gauss.
• Re-invented in 1805 by Legendre.
• Frequently applied in astronomy to study
the large scale of the universe. Carl Friedrich Gauss

• Still a very useful tool today.

45
Linear Regression

• Given a training set of example input–output pairs:


𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅𝑑 𝑎𝑛𝑑 𝑦𝑖 𝜖 𝑅

• Task: Learn a regression function:


oℎ: 𝑅𝑑 → 𝑅
oℎ 𝑥 = 𝑦
• A regression model is called to be linear if it is represented by a linear
function.
46
Linear Regression

• Univariate linear regression model:

• Multivariable linear regression model:

o𝜔 are weights
o𝑤 is the vector 𝜔0 , 𝜔1
oLearning the linear model → learning the 𝝎
47
Linear Regression

line in 𝑅2 hyperplane in 𝑅3

48
Introduction to Regression

49
…Introduction to Regression

• Regression model

50
…Introduction to Regression

• Applications of regression

• Regression algorithms

51
…Introduction to Regression

• Type of regression models

52
Simple Linear Regression

• Linear regression topology

53
Simple Linear Regression

• Using linear regression to predic continuous values

54
Simple Linear Regression

• Using linear regression to predic continuous values

55
Simple Linear Regression

• Linear regression model representation

56
Simple Linear Regression

• How to find the best fit?

57
Simple Linear Regression

• How to find the best fit?

58
Simple Linear Regression

• Estimating the parameters

59
Simple Linear Regression

• Estimating the parameters

60
Simple Linear Regression

• Estimating the parameters

61
Simple Linear Regression

• Predictions with linear regression

62
Simple Linear Regression

• Pros of linear regression

63
Multiple Linear Regression

• Types of regression models

64
Multiple Linear Regression

• Predicting continuous values with Multiple Linear Regression

65
Multiple Linear Regression

• Predicting continuous values with Multiple Linear Regression

66
Multiple Linear Regression

• Predicting continuous values with Multiple Linear Regression

67
Multiple Linear Regression

• Estimating multiple linear regression parameters

68
Multiple Linear Regression

• Making predictions with multiple linear regression

69
CONTENTs

• Introduction to ML
• Supervised learning
• Regression: Linear Regression
• Classification: Naïve Bayes, Decision trees
• Unsupervised learning
• Clustering: K-Means
• Model Evaluation
• Neural Network
• Recommender Systems

70
Bayes Rule

• Writing 𝑝(𝐴 ∧ 𝐵) in two different ways:

• 𝑝 𝐴 𝐵 is called posterior
(posterior distribution on A given B).
• 𝑝 𝐴 is called prior.
• 𝑝(𝐵) is called evidence.
• 𝑝(𝐵|𝐴) is called likelihood.
71
Bayes Rule

• This table divides the sample space into 4 mutually exclusive events.
• The probability in the margins are called marginals and are
calculated by summing across the rows and the columns.
• Another form:

72
Example of Using Bayes Rule

• A: patient has cancer.


• B: patient has a positive lab test.

73
Why probabilities?

• Why are we bringing here a Bayesian framework?

• Recall Classification framework:

oGiven: Training data: 𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅𝑑 𝑎𝑛𝑑 𝑦𝑖 𝜖 𝑌


Task: Learn a classification function: 𝑓: 𝑅𝑑 → 𝑌

• Learn a mapping from 𝑥 to 𝑦.

• We want to find this mapping 𝑓(𝑥) = 𝑦 through 𝑝(𝑦|𝑥)!

74
Discriminative Algorithms

• Idea: model 𝑝 𝑦|𝑥 , conditional distribution of 𝑦 given 𝑥.


• In Discriminative Algorithms: find a decision boundary that
separates positive from negative example.
• To predict a new example, check on which side of the decision
boundary it falls.
• Model 𝑝 𝑦|𝑥 directly.

75
Generative Algorithms

• Idea: Build a model for what positive examples look like. Build a
different model for what negative example look like.
• To predict a new example, match it with each of the models and
see which match is best.
• Model 𝑝 𝑥|𝑦 and 𝑝 𝑦 !
𝑝 𝑥|𝑦 𝑝 𝑦
• Use Bayes rule to obtain 𝑝 𝑦|𝑥 =
𝑝 𝑥
• To make a prediction:

76
Naive Bayes Classifier

• Probabilistic model.
• Highly practical method.
• Application domains to natural language text documents.
• Naive because of the strong independence assumption it makes (not
realistic).
• Simple model.
• Strong method can be comparable to decision trees and neural
networks in some cases.

77
Setting

• A training data 𝑥𝑖 , 𝑦𝑖 is a feature vector and 𝑦𝑖 is a discrete label.


• 𝑑 features, and 𝑛 examples.
• E.g. consider document classification, each example is a documents,
each feature represents the presence or absence of a particular word
in the document.
• We have a training set.
• A new example with feature values 𝑥𝑛𝑒𝑤 = 𝑎1 , 𝑎2 , … , 𝑎𝑑 .
• We want to predict the label 𝑦𝑛𝑒𝑤 of the new example.

78
Setting

• Use Bayes rule to obtain:

• Can we estimate these two terms from the training data?


1. 𝑝 𝑦 can be easy to estimate: count the frequency with which each
label 𝑦.
2. 𝑝 𝑎1 , 𝑎2 , … , 𝑎𝑑 |𝑦 is not easy to estimate unless we have a very large
sample. (We need to see every example many times to get reliable
estimates)
79
Naive Bayes Classifier

• Makes a simplifying assumption that the feature values are conditionally


independent given the label.
• Given the label of the example, the probability of observing the
conjunction 𝑎1 , 𝑎2 , … , 𝑎𝑑 is the product of the probabilities for the
individual features:

• Naive Bayes Classifier:

• Can we estimate these two terms from the training data?


Yes!

80
Algorithm

• Learning: Based on the frequency counts in the dataset:


1. Estimate all 𝑝 𝑦 , ∀𝑦 𝜖 𝑌.
2. Estimate all 𝑝 𝑎𝑗 |𝑦 ∀𝑦 𝜖 𝑌, ∀𝑎𝑗 .
• Classification: For a new example, use:

• Note: No model per se or hyperplane, just count the frequencies of


various data combinations within the training examples.

81
Example

• Can we predict the class of the new example?

82
Solution

• Conditional probabilities:

𝑦𝑛𝑒𝑤 = 𝑦𝑒𝑠
83
CONTENTs

• Introduction to ML
• Supervised learning
• Regression: Linear Regression
• Classification: Naïve Bayes, Decision trees
• Unsupervised learning
• Clustering: K-Means
• Model Evaluation
• Neural Network

84
Tree Classifiers

• Popular classification methods.


• Easy to understand, simple algorithmic approach.
• No assumption about linearity.
• History:
o CART (Classification And Regression Trees): Friedman 1977.
oID3 and C4.5 family: Quilan 1979-1983.
o Refinements in mid 1990’s (e.g., pruning, numerical features etc.).
• Applications:
oBotany (e.g., New Flora of the British Isles Stace 1991).
o Medical research (e.g., Pima Indian diabetes diagnosis, early diagnosis
of acute myocardial infarction).
oComputational biology (e.g., interaction between genes)
85
Tree Classifiers

• The terminology Tree is graphic.


• However, a decision tree is grown from the root downward.
• The idea is to send the examples down the tree, using the concept
of information entropy.

• General Steps to build a tree:


1. Start with the root node that has all the examples.
2. Greedy selection of the next best feature to build the branches.
The splitting criteria is node purity.
3. Class majority will be assigned to the leaves.
86
Classification

• Given a training set of example input–output pairs:


𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛
where 𝑥𝑖 𝜖 𝑅𝑑 𝑎𝑛𝑑 𝑦𝑖 𝑖𝑠 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 (𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙/𝑞𝑢𝑎𝑙𝑖𝑡𝑎𝑡𝑖𝑣𝑒), 𝑦𝑖 𝜖𝑌.
E.g. 𝑌 = −1, +1 , 𝑌 = 0 + 1
• Task: Learn a classification function:
𝑓: 𝑅𝑑 → 𝑌

• In the case of Tree Classifiers:


1. No need for 𝑥𝑖 𝜖 𝑅𝑑 , so no need to turn categorical features into
numerical features.
2. The model is a tree.

87
Example

88
Example

89
Splitting criteria

1. The central choice is selecting the next attribute to split on.


2. We want some criteria that measures the homogeneity or
impurity of examples in the nodes:
a. Quantify the mix of classes at each node.
b. Maximum if equal number of examples from each class.
c. Minimum if the node is pure.
3. A perfect measure commonly used in Information Theory:

90
Splitting criteria

• In general, for c classes:

91
Splitting criteria

• Now each node has some entropy that measures the


homogeneity in the node.
• How to decide which attribute is best to split on based on
entropy?
• We use Information Gain that measures the expected reduction in
entropy caused by partitioning the examples according
to the attributes:

92
93
94
95
96
Decision trees

• At the first split starting from the root, we choose the attribute that
has the max gain.
• Then, we re-start the same process at each of the children nodes (if
node not pure).

97
Decision trees

Pros & Cons


+ Intuitive, interpretable.
+ Can be turned into rules.
+ Well-suited for categorical data.
+ Simple to build.
BUT …
- Unstable (change in an example may lead to a different tree).
- Univariate (split one attribute at a time, does not combine features).
- A choice at some node depends on the previous choices.
- Need to balance the data.

98
CONTENTs

• Introduction to ML
• Supervised learning
• Regression: Linear Regression
• Classification: Naïve Bayes, Decision trees
• Unsupervised learning
• Clustering: K-Means
• Model Evaluation
• Neural Network

99
…Clustering

• Introduction to Clustering

100
…Clustering

• Introduction to Clustering

101
…Clustering

• Introduction to Clustering

102
…Clustering

• Introduction to Clustering

103
…Clustering

• Clustering applicatinons

104
…Clustering

• Clustering applicatinons

105
k-Means clustering

• Introduction to Clustering

106
…k-Means clustering

• k-Means algorithms

107
…k-Means clustering

• Determine the similarity/distance

108
…k-Means clustering

• 1-dimensional similarity/distance

109
…k-Means clustering

• Multi-dimensional similarity/distance

110
…k-Means clustering

• How does k-Means clustering work?

111
…k-Means clustering

• k-Means clustering – initialize k

112
…k-Means clustering

• k-Means clustering – distance calculation

113
…k-Means clustering

• k-Means clustering – assign to centroid

114
…k-Means clustering

• k-Means clustering – assign to centroid

115
…k-Means clustering

• k-Means clustering – compute new centroids

116
…k-Means clustering

• k-Means clustering – repeat

117
…k-Means clustering

• k-Means clustering – repeat

118
…k-Means clustering

• k-Means clustering – repeat

119
…k-Means clustering

• k-Means clustering – repeat

120
…k-Means clustering

• k-Means clustering – repeat

121
…k-Means clustering

• k-Means clustering – repeat

122
…k-Means clustering

• k-Means clustering – repeat

123
1. Select initial 2. Assign each object to the
centroids at random cluster with the nearest
centroid.
K-means Clustering

3. Compute each centroid as the 2. Assign each object to the


mean of the objects assigned to it cluster with the nearest
(go to 2) centroid.
12
Repeat previous 2 steps until no ch
4
Distance metrics

Euclidean:

Cosine:

12
5
…k-Means clustering

• k-Means clustering – algorithm

126
…k-Means clustering

• k-Means clustering – choosing k

127
CONTENTs

• Introduction to ML
• Supervised learning
• Regression: Linear Regression
• Classification: Naïve Bayes, Decision trees
• Unsupervised learning
• Clustering: K-Means

• Model Evaluation
• Neural Network

128
Training and Testing

• Question: How can we be confident about f?

129
Training and Testing

• We calculate 𝐸 𝑡𝑟𝑎𝑖𝑛 the in-sample error (training error or empirical


error/risk).

• Examples of loss functions:


• Classification error:

• The squared-error:

130
Training and Testing

• We calculate 𝐸 𝑡𝑟𝑎𝑖𝑛 the in-sample error (training error or empirical


error/risk).

• We aim to have 𝐸 𝑡𝑟𝑎𝑖𝑛 𝑓 small, i.e., minimize 𝐸 𝑡𝑟𝑎𝑖𝑛 𝑓


• We hope that 𝐸 𝑡𝑒𝑠𝑡 𝑓 , the out-sample error (test/true error),
will be small too.

131
Model Evaluation in Regression Models

• Model evaluation approaches

132
Model Evaluation in Regression Models

• Best approach for most accurate result?

133
Model Evaluation in Regression Models

• Calculating the accurate of a model?

134
Model Evaluation in Regression Models

• Calculating the accurate of a model?

135
Model Evaluation in Regression Models

• Training and test on the same dataset

136
Model Evaluation in Regression Models

• Training and test on the same dataset

137
Model Evaluation in Regression Models

• Train/Test split evaluation approach

138
Model Evaluation in Regression Models

• Train/Test split evaluation approach

139
Evaluation Metrics in Regression Models

• Regression accuracy

140
Evaluation Metrics in Regression Models

• Regression accuracy

141
Evaluation Metrics in Regression Models

• Regression accuracy

142
Evaluation Metrics in Classification

• Classification accuracy

143
Evaluation Metrics in Classification

• Jaccard index

144
Confusion matrix

145
Evaluation Metrics in Classification

• F1-score

146
Evaluation Metrics in Classification

• Log loss

147
Evaluation Metrics in Classification

• Log loss

148
CONTENTs

• Introduction to ML
• Supervised learning
• Regression: Linear Regression
• Classification: Naïve Bayes, Decision trees
• Unsupervised learning
• Clustering: K-Means
• Model Evaluation
• Neural Network

149
CONTENTS

• Neural Networks
• Perceptron
• From perceptron to MLP
• Backpropagation algorithm
• Multiclass case
• MNIST database

150
AI
Neural Networks

• Algorithms that try to mimic how the brain functions.


• Worked extremely well to recognize:
1. handwritten characters (LeCun et a. 1989),
2. spoken words (Lang et al. 1990),
3. faces (Cottrel 1990)
• Extensively studied in the 1990’s with a moderate success.
• Now back with lots of success with deep learning thanks to the
algorithmic and computational progress.
• The first algorithm used was the Perceptron (Resenblatt 1959).

151
AI
Perceptron

• Given 𝑛 examples and 𝑑 features


𝑑

𝑓 𝑥𝑖 = 𝑠𝑖𝑔𝑛 ෍ 𝜔𝑗 𝑥𝑖𝑗
𝑗=0

152
AI
Perceptron expressiveness

• Consider the perceptron with the step function.


• Idea: Iterative method that starts with a random hyperplane and
adjust it using your training data.
• It can represent Boolean functions such as AND, OR, NOT but not
the XOR function.
• It produces a linear separator in the input space.

AI153
From perceptron to MLP

• The perceptron works perfectly if data is linearly separable. If not, it


will not converge.
• Neural networks use the ability of the perceptrons to represent
elementary functions and combine them in a network of layers of
elementary questions.
• However, a cascade of linear functions is still linear, and we want
networks that represent highly non-linear functions.

154
AI
From perceptron to MLP

• Also, perceptron used a threshold function, which is


undifferentiable and not suitable for gradient descent in case data
is not linearly separable.
• We want a function whose output is a linear function of the inputs.
• One possibility is to use the sigmoid function:
𝑒𝑧 1
𝑔 𝑧 = 𝑧
=
1+ 𝑒 1 + 𝑒 −𝑧

𝑔 𝑧 → 1 𝑤ℎ𝑒𝑛 𝑧 → +∞ 𝑔 𝑧 → 0 𝑤ℎ𝑒𝑛 𝑧 → −∞

155
AI
Perceptron with Sigmoid

• Given 𝑛 examples and 𝑑 features.


• For an example 𝑥𝑖 (the 𝑖 𝑡ℎ line in the matrix of examples)
1
𝑓 𝑥𝑖 =
− σ𝑑
1 + 𝑒 𝑗=0 𝜔𝑗 𝑥𝑖𝑗

156
AI
The XOR example

• Let’s try to create a NN for the XOR function using elementary


perceptrons.
• First observe:

• Let’s consider that: For 𝑧 ≥ 10, 𝑔 𝑧 → 1. 𝐹𝑜𝑟 𝑧 ≤ −10, 𝑔 𝑧 → 0.

157
AI
The OR example
• First what is the perceptron of the OR?

158
AI
The AND and NAND example
• Similarly, we obtain the perceptrons for the AND and NAND:

• Note: how the weights in the NAND are the inverse weights of the
AND.

159
AI
The XOR example
• Let’s try to create a NN for the XOR function using elementary
perceptrons.

160
AI
The XOR example
• Let’s put them together...

• XOR as a combination of 3 basic perceptrons.

161
AI
Backpropagation algorithm

• Note: Feedforward NN (as opposed to recurrent networks) have no


connections that loop.
• Learn the weights for a multilayer network.
• Backpropagation stands for “backward propagation of errors”.
• Given a network with a fixed architecture (neurons and
interconnections).
• Use Gradient descent to minimize the squared error between the
network output value 𝑜 and the ground truth 𝑦.
• We suppose multiple output 𝑘.
• Challenge: Search in all possible weight values for all neurons in the
network.

162
AI
Feedforward-Backpropagation
• 𝑒 = 𝑥, 𝑦
• e: example
• x: feature vector 
• y: label i
𝜔𝑖𝑗

j
𝑓(𝑥)

 𝑜𝑒

AI
163
Backpropagation rules

• We consider 𝑘 outputs
• For an example e defined by (𝑥, 𝑦), the error on training example 𝑒,
summed over all output neurons in the network is:
1
𝐸𝑒 𝜔 = ෍ 𝑦𝑘 − 𝑜𝑘 2
2
𝑘
• Remember, gradient descent iterates through all the training
examples one at a time, descending the gradient of the error w.r.t.
this example.
𝜕𝐸𝑒 𝜔
∆𝜔𝑖𝑗 = − 𝛼
𝜕𝜔𝑖𝑗

164
AI
Backpropagation rules

Notations:
• 𝑥𝑖𝑗 : the 𝑖 𝑡ℎ input to neuron j.
• 𝜔𝑖𝑗 : the weight associated with the 𝑖 𝑡ℎ input to neuron j.
• 𝑧𝑗 = σ 𝜔𝑖𝑗 𝑥𝑖𝑗 , weighted sum of inputs for neuron j.
• 𝑜𝑗 : output computed by neuron j.
• 𝑔 is the sigmoid function.
• outputs: the set of neurons in the output layer.
• 𝑆𝑢𝑐𝑐 𝑗 : the set of neurons whose immediate inputs includethe
output of neuron j.

165
AI
Backpropagation notations

i
𝜔𝑖𝑗 𝑧𝑗 = ෍ 𝜔𝑖𝑗 𝑥𝑖𝑗

i j 𝑔 𝑧𝑗  𝑜𝑗

j  𝑔 𝑧𝑗  𝑜𝑗

AI
Backpropagation rules

𝜕𝐸𝑒 𝜕𝐸𝑒 𝜕𝑧𝑗 𝜕𝑬𝒆


= = 𝑥𝑖𝑗
𝜕𝜔𝑖𝑗 𝜕𝑧𝑗 𝜕𝜔𝑖𝑗 𝜕𝒛𝒋
𝜕𝑬𝒆
Δ𝜔𝑖𝑗 = − 𝛼 𝑥𝑖𝑗
𝜕𝒛𝒋
𝜕𝐸𝑒
• We consider two cases in calculating (let’s abandon the index e):
𝜕𝑧𝑗

oCase 1: Neuron 𝒋 is an output neuron


oCase 2: Neuron 𝒋 is a hidden neuron

167
AI
Backpropagation rules

• Case 1: Neuron 𝒋 is an output neuron

168
AI
Backpropagation rules
𝜕𝐸
= − 𝑦𝑗 − 𝑜𝑗 𝑜𝑗 (1 − 𝑜𝑗 )
𝜕𝑧𝑗
∆𝜔𝑖𝑗 = 𝛼 𝑦𝑗 − 𝑜𝑗 𝑜𝑗 1 − 𝑜𝑗 𝑥𝑖𝑗
• We will note
𝜕𝐸
𝛿𝑗 = −
𝜕𝑧𝑗
∆𝜔𝑖𝑗 = 𝛼 𝛿𝑗 𝑥𝑖𝑗

169
AI
Backpropagation rules
• Case 2: Neuron 𝒋 is a hidden neuron

170
AI
Backpropagation algorithm
• Input: training examples (𝑥, 𝑦), learning rate 𝛼 (e.g., 𝛼 = 0.1), 𝑛𝑖 , 𝑛ℎ and 𝑛𝑜 .
• Output: a neural network with one input layer, one hidden layer and one output layer
with 𝑛𝑖 , 𝑛ℎ and 𝑛𝑜 number of neurons respectively and all its weights.
1. Create feedforward network 𝑛𝑖 , 𝑛ℎ , 𝑛𝑜
2. Initialize all weights to a small random number (e.g., in [-0.2, 0.2])
3. Repeat until convergence
(a) For each training example (𝑥, 𝑦)
i. Feed forward: Propagate example 𝑥 through the network and compute the
output 𝑜𝑗 from every neuron.
ii. Propagate backward: Propagate the errors backward.
Case 1 For each output neuron 𝑘, calculate its error
𝛿𝑘 = 𝑜𝑘 1 − 𝑜𝑘 (𝑦𝑘 − 𝑜𝑘 )
Case 2 For each hidden neuron ℎ, calculate its error
𝛿ℎ = 𝑜ℎ 1 − 𝑜ℎ ෍ 𝜔ℎ𝑘 𝛿𝑘
𝑘𝜖𝑆𝑢𝑐𝑐 ℎ
iii. Update each weight 𝜔𝑖𝑗 ← 𝜔𝑖𝑗 + 𝛼 𝛿𝑗 𝑥𝑖𝑗
171
AI
Observations

• Convergence: small changes in the weights.


• There are other activation functions. Hyperbolic tangent function, is practically better
for NN as its outputs range from -1 to 1.

𝑒 𝑘𝑥
𝑔 𝑥 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑥 = 𝑘𝑥
𝑓𝑜𝑟 𝑘 = 1, 𝑘 = 2, 𝑒𝑡𝑐.
1+𝑒
𝑥
𝑒 −𝑒 −𝑥
𝑔 𝑥 = 𝑡𝑎𝑛ℎ 𝑥 = 𝑥 −𝑥
(𝐼𝑡 𝑖𝑠 𝑎 𝑟𝑒𝑠𝑐𝑎𝑙𝑖𝑛𝑔 𝑜𝑓 𝑡ℎ𝑒 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛!).
𝑒 +𝑒
AI
Multiclass case

• Nowadays, networks with more than two layers, a.k.a. deep


networks, have proven to be very effective in many domains.
• Examples of deep networks: convolutional NN, recurrent NN, etc.

173
AI
MNIST database
Experiments with the MNIST database
• The MNIST database of handwritten digits
• Training set of 60,000 examples, test set of 10,000 examples
• Vectors in 𝑅784 (28x28 images)
• Labels are the digits they represent
• Various methods have been tested with this training set and test set

• Linear models: 7% - 12% error


• KNN: 0.5%- 5% error
• Neural networks: 0.35% - 4.7% error
• Convolutional NN: 0.23% - 1.7% error
174
AI
SUMMARY

• Introduction to ML
• Supervised learning
• Regression: Linear Regression
• Classification: Naïve Bayes, Decision trees
• Unsupervised learning
• Clustering: K-Means
• Model Evaluation
• Neural Network

175
Nhân bản – Phụng sự – Khai phóng

Enjoy the Course…!

176

You might also like