AI.5 Machine Learning (21 26)

Nhân bản – Phụng sự – Khai phóng
Machine Learning
Artificial Intelligence
CONTENTs
• Introduction to ML
• Supervised learning
• Regression: Linear Regression
• Classification: Naïve Bayes, Decision trees
• Unsupervised learning
• Clustering: K-Means
• Model Evaluation
• Neural Network
2
CONTENTs
• Neural Network
3
…Introduction to ML
• Is this a benign or malignant cell?
4
Introduction to ML
• Use ML to make prediction
5
• Use ML to make prediction
6
• Use ML to make prediction… such as predicting the right drug
7
• Use ML to make decisions
8
• Use ML for customer segmentation
9
• Use ML for recommendation systems
10
• Build ML models
11
• Build ML models
12
• Build ML models
13
• Build ML models
14
• Major ML techniques (1)
15
• Major ML techniques (2)
16
• What is ML?
17
• How does ML relate to AI?
Three components of ML
ML vs AI
https://vas3k.com/blog/machine_learning/
18
…ML
• Overview
The map of
ML world
19
• Overview
(Học Máy Tích hợp)
https://vas3k.com/blog/machine_learning/
20
• Overview – (1).Classical ML
21
• Overview – (2).Reinforcement Learning

"Throw a robot into a maze and let it find an exit"
Nowadays is used for:

• Self-driving cars
• Robot vacuums
• Games
• Automating trading
• Enterprise resource management
Popular algorithms:
Q-Learning, SARSA, DQN, A3C,
Genetic algorithm
22
• Overview – (3). Ensemble Methods

(Học Máy Tích hợp)
"Bunch of stupid trees learning to correct errors of each other"
Nowadays is used for:

• Everything that fits classical
algorithm approaches (but works better)
• Search systems
• Computer vision
• Object detection
Popular algorithms:
• Random Forest
• Gradient Boosting
(phương pháp Học Máy Tích hợp)
23
• Overview – (4). Neural Networks and Deep Leaning

"We have a thousand-layer network, dozens of video cards,
but still no idea where to use it. Let's generate cat pics!"
Used today for:

• Replacement of all algorithms above
• Object identification on photos and videos
• Speech recognition and synthesis
• Image processing, style transfer
• Machine translation
Popular architectures:
• Perceptron
• Convolutional Network (CNN)
• Recurrent Networks (RNN)
• Autoencoders
24
• Supervised vs Unsupervised
(classical ML)
• Supervised learning:
• Learning a model from labeled data.
• Unsupervised learning:
• Learning a model from unlabeled data
25
• Teaching the model with labeled data
26
• Supervised Learning (classical ML)
There are two types

• Regression
• prediction of a specific point on
a numeric axis.
• Classification
• an object's category prediction
27
• Type of Supervised learning
28
• Supervised Learning - Regression (prediction)
29
• Regression is the process of predicting continuous values
30
• Classification is the process of predicting discrete class labels/categories
31
• Supervised Learning - Classification

“The machine has a "supervisor/teacher" who gives the machine all the answers, like whether it's a cat or a dog.
The teacher has divided (labeled) the data into cats and dogs, and the machine is using these examples to learn.”
32
• Supervised Learning - Classification
33
34
• Clustering is grouping of data points or objects
35
• Supervised vs Unsupervised
36
CONTENTs
• Neural Network
37
Supervised learning
• Given a training set of example input–output pairs:

𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅𝑑
38
Supervised learning
𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅𝑑
• Regression:
oy is a real value, 𝑦 𝜖 𝑅
o𝑓: 𝑅𝑑 → 𝑅
f is called a regressor.
• Classification:
oy is discrete. To simplify, 𝑦 𝜖 {−1, +1}
o𝑓: 𝑅𝑑 → −1, +1
f is called a binary classifier.
39
Supervised learning
• Regression:
• Example: Income in function of age, weight of the fruit in function of its

length.
40
Supervised learning
• Regression:

length.
41
Supervised learning
• Regression:

length.
42
Supervised learning
• Classification:
43
CONTENTs
• Neural Network
• Recommender Systems
44
Linear Regression: History
• A very popular technique.

• Rooted in Statistics.
• Method of Least Squares used as early as
• 1795 by Gauss.
• Re-invented in 1805 by Legendre.
• Frequently applied in astronomy to study
the large scale of the universe. Carl Friedrich Gauss
• Still a very useful tool today.
45
Linear Regression

𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅𝑑 𝑎𝑛𝑑 𝑦𝑖 𝜖 𝑅
• Task: Learn a regression function:

oℎ: 𝑅𝑑 → 𝑅
oℎ 𝑥 = 𝑦
• A regression model is called to be linear if it is represented by a linear
function.
46
Linear Regression
• Univariate linear regression model:
• Multivariable linear regression model:
o𝜔 are weights
o𝑤 is the vector 𝜔0 , 𝜔1
oLearning the linear model → learning the 𝝎
47
Linear Regression
line in 𝑅2 hyperplane in 𝑅3
48
Introduction to Regression
49
…Introduction to Regression
• Regression model
50
• Applications of regression
• Regression algorithms
51
• Type of regression models
52
Simple Linear Regression
• Linear regression topology
53
• Using linear regression to predic continuous values
54
• Using linear regression to predic continuous values
55
• Linear regression model representation
56
• How to find the best fit?
57
• How to find the best fit?
58
• Estimating the parameters
59
60
61
• Predictions with linear regression
62
• Pros of linear regression
63
Multiple Linear Regression
• Types of regression models
64
• Predicting continuous values with Multiple Linear Regression
65
66
67
• Estimating multiple linear regression parameters
68
• Making predictions with multiple linear regression
69
CONTENTs
• Neural Network
• Recommender Systems
70
Bayes Rule
• Writing 𝑝(𝐴 ∧ 𝐵) in two different ways:
• 𝑝 𝐴 𝐵 is called posterior
(posterior distribution on A given B).
• 𝑝 𝐴 is called prior.
• 𝑝(𝐵) is called evidence.
• 𝑝(𝐵|𝐴) is called likelihood.
71
Bayes Rule
• This table divides the sample space into 4 mutually exclusive events.
• The probability in the margins are called marginals and are
calculated by summing across the rows and the columns.
• Another form:
72
Example of Using Bayes Rule
• A: patient has cancer.

• B: patient has a positive lab test.
73
Why probabilities?
• Why are we bringing here a Bayesian framework?
• Recall Classification framework:
oGiven: Training data: 𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅𝑑 𝑎𝑛𝑑 𝑦𝑖 𝜖 𝑌

Task: Learn a classification function: 𝑓: 𝑅𝑑 → 𝑌
• Learn a mapping from 𝑥 to 𝑦.
• We want to find this mapping 𝑓(𝑥) = 𝑦 through 𝑝(𝑦|𝑥)!
74
Discriminative Algorithms
• Idea: model 𝑝 𝑦|𝑥 , conditional distribution of 𝑦 given 𝑥.

• In Discriminative Algorithms: find a decision boundary that
separates positive from negative example.
• To predict a new example, check on which side of the decision
boundary it falls.
• Model 𝑝 𝑦|𝑥 directly.
75
Generative Algorithms
• Idea: Build a model for what positive examples look like. Build a
different model for what negative example look like.
• To predict a new example, match it with each of the models and
see which match is best.
• Model 𝑝 𝑥|𝑦 and 𝑝 𝑦 !
𝑝 𝑥|𝑦 𝑝 𝑦
• Use Bayes rule to obtain 𝑝 𝑦|𝑥 =
𝑝 𝑥
• To make a prediction:
76
Naive Bayes Classifier
• Probabilistic model.
• Highly practical method.
• Application domains to natural language text documents.
• Naive because of the strong independence assumption it makes (not
realistic).
• Simple model.
• Strong method can be comparable to decision trees and neural
networks in some cases.
77
Setting
• A training data 𝑥𝑖 , 𝑦𝑖 is a feature vector and 𝑦𝑖 is a discrete label.

• 𝑑 features, and 𝑛 examples.
• E.g. consider document classification, each example is a documents,
each feature represents the presence or absence of a particular word
in the document.
• We have a training set.
• A new example with feature values 𝑥𝑛𝑒𝑤 = 𝑎1 , 𝑎2 , … , 𝑎𝑑 .
• We want to predict the label 𝑦𝑛𝑒𝑤 of the new example.
78
Setting
• Use Bayes rule to obtain:
• Can we estimate these two terms from the training data?

1. 𝑝 𝑦 can be easy to estimate: count the frequency with which each
label 𝑦.
2. 𝑝 𝑎1 , 𝑎2 , … , 𝑎𝑑 |𝑦 is not easy to estimate unless we have a very large
sample. (We need to see every example many times to get reliable
estimates)
79
Naive Bayes Classifier
• Makes a simplifying assumption that the feature values are conditionally

independent given the label.
• Given the label of the example, the probability of observing the
conjunction 𝑎1 , 𝑎2 , … , 𝑎𝑑 is the product of the probabilities for the
individual features:
• Naive Bayes Classifier:
• Can we estimate these two terms from the training data?

Yes!
80
Algorithm
• Learning: Based on the frequency counts in the dataset:

1. Estimate all 𝑝 𝑦 , ∀𝑦 𝜖 𝑌.
2. Estimate all 𝑝 𝑎𝑗 |𝑦 ∀𝑦 𝜖 𝑌, ∀𝑎𝑗 .
• Classification: For a new example, use:
• Note: No model per se or hyperplane, just count the frequencies of

various data combinations within the training examples.
81
Example
• Can we predict the class of the new example?
82
Solution
• Conditional probabilities:
𝑦𝑛𝑒𝑤 = 𝑦𝑒𝑠
83
CONTENTs
• Neural Network
84
Tree Classifiers
• Popular classification methods.

• Easy to understand, simple algorithmic approach.
• No assumption about linearity.
• History:
o CART (Classification And Regression Trees): Friedman 1977.
oID3 and C4.5 family: Quilan 1979-1983.
o Refinements in mid 1990’s (e.g., pruning, numerical features etc.).
• Applications:
oBotany (e.g., New Flora of the British Isles Stace 1991).
o Medical research (e.g., Pima Indian diabetes diagnosis, early diagnosis
of acute myocardial infarction).
oComputational biology (e.g., interaction between genes)
85
Tree Classifiers
• The terminology Tree is graphic.

• However, a decision tree is grown from the root downward.
• The idea is to send the examples down the tree, using the concept
of information entropy.
• General Steps to build a tree:

1. Start with the root node that has all the examples.
2. Greedy selection of the next best feature to build the branches.
The splitting criteria is node purity.
3. Class majority will be assigned to the leaves.
86
Classification

𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛
where 𝑥𝑖 𝜖 𝑅𝑑 𝑎𝑛𝑑 𝑦𝑖 𝑖𝑠 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 (𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙/𝑞𝑢𝑎𝑙𝑖𝑡𝑎𝑡𝑖𝑣𝑒), 𝑦𝑖 𝜖𝑌.
E.g. 𝑌 = −1, +1 , 𝑌 = 0 + 1
• Task: Learn a classification function:
𝑓: 𝑅𝑑 → 𝑌
• In the case of Tree Classifiers:

1. No need for 𝑥𝑖 𝜖 𝑅𝑑 , so no need to turn categorical features into
numerical features.
2. The model is a tree.
87
Example
88
Example
89
Splitting criteria
1. The central choice is selecting the next attribute to split on.

2. We want some criteria that measures the homogeneity or
impurity of examples in the nodes:
a. Quantify the mix of classes at each node.
b. Maximum if equal number of examples from each class.
c. Minimum if the node is pure.
3. A perfect measure commonly used in Information Theory:
90
Splitting criteria
• In general, for c classes:
91
Splitting criteria
• Now each node has some entropy that measures the

homogeneity in the node.
• How to decide which attribute is best to split on based on
entropy?
• We use Information Gain that measures the expected reduction in
entropy caused by partitioning the examples according
to the attributes:
92
93
94
95
96
Decision trees
• At the first split starting from the root, we choose the attribute that
has the max gain.
• Then, we re-start the same process at each of the children nodes (if
node not pure).
97
Decision trees
Pros & Cons

+ Intuitive, interpretable.
+ Can be turned into rules.
+ Well-suited for categorical data.
+ Simple to build.
BUT …
- Unstable (change in an example may lead to a different tree).
- Univariate (split one attribute at a time, does not combine features).
- A choice at some node depends on the previous choices.
- Need to balance the data.
98
CONTENTs
• Neural Network
99
…Clustering
• Introduction to Clustering
100
…Clustering
101
…Clustering
102
…Clustering
103
…Clustering
• Clustering applicatinons
104
…Clustering
• Clustering applicatinons
105
k-Means clustering
106
…k-Means clustering
• k-Means algorithms
107
• Determine the similarity/distance
108
• 1-dimensional similarity/distance
109
• Multi-dimensional similarity/distance
110
• How does k-Means clustering work?
111
• k-Means clustering – initialize k
112
• k-Means clustering – distance calculation
113
• k-Means clustering – assign to centroid
114
• k-Means clustering – assign to centroid
115
• k-Means clustering – compute new centroids
116
• k-Means clustering – repeat
117
118
119
120
121
122
123
1. Select initial 2. Assign each object to the
centroids at random cluster with the nearest
centroid.
K-means Clustering
3. Compute each centroid as the 2. Assign each object to the

mean of the objects assigned to it cluster with the nearest
(go to 2) centroid.
12
Repeat previous 2 steps until no ch
4
Distance metrics
Euclidean:
Cosine:
12
5
• k-Means clustering – algorithm
126
• k-Means clustering – choosing k
127
CONTENTs
• Neural Network
128
Training and Testing
• Question: How can we be confident about f?
129
• We calculate 𝐸 𝑡𝑟𝑎𝑖𝑛 the in-sample error (training error or empirical

error/risk).
• Examples of loss functions:

• Classification error:
• The squared-error:
130
• We calculate 𝐸 𝑡𝑟𝑎𝑖𝑛 the in-sample error (training error or empirical

error/risk).
• We aim to have 𝐸 𝑡𝑟𝑎𝑖𝑛 𝑓 small, i.e., minimize 𝐸 𝑡𝑟𝑎𝑖𝑛 𝑓

• We hope that 𝐸 𝑡𝑒𝑠𝑡 𝑓 , the out-sample error (test/true error),
will be small too.
131
Model Evaluation in Regression Models
• Model evaluation approaches
132
• Best approach for most accurate result?
133
• Calculating the accurate of a model?
134
• Calculating the accurate of a model?
135
• Training and test on the same dataset
136
• Training and test on the same dataset
137
• Train/Test split evaluation approach
138
• Train/Test split evaluation approach
139
Evaluation Metrics in Regression Models
• Regression accuracy
140
141
142
Evaluation Metrics in Classification
• Classification accuracy
143
• Jaccard index
144
Confusion matrix
145
• F1-score
146
• Log loss
147
• Log loss
148
CONTENTs
• Neural Network
149
CONTENTS
• Neural Networks
• Perceptron
• From perceptron to MLP
• Backpropagation algorithm
• Multiclass case
• MNIST database
150
AI
Neural Networks
• Algorithms that try to mimic how the brain functions.

• Worked extremely well to recognize:
1. handwritten characters (LeCun et a. 1989),
2. spoken words (Lang et al. 1990),
3. faces (Cottrel 1990)
• Extensively studied in the 1990’s with a moderate success.
• Now back with lots of success with deep learning thanks to the
algorithmic and computational progress.
• The first algorithm used was the Perceptron (Resenblatt 1959).
151
AI
Perceptron
• Given 𝑛 examples and 𝑑 features

𝑑
𝑓 𝑥𝑖 = 𝑠𝑖𝑔𝑛 ෍ 𝜔𝑗 𝑥𝑖𝑗
𝑗=0
152
AI
Perceptron expressiveness
• Consider the perceptron with the step function.

• Idea: Iterative method that starts with a random hyperplane and
adjust it using your training data.
• It can represent Boolean functions such as AND, OR, NOT but not
the XOR function.
• It produces a linear separator in the input space.
AI153
From perceptron to MLP
• The perceptron works perfectly if data is linearly separable. If not, it

will not converge.
• Neural networks use the ability of the perceptrons to represent
elementary functions and combine them in a network of layers of
elementary questions.
• However, a cascade of linear functions is still linear, and we want
networks that represent highly non-linear functions.
154
AI
From perceptron to MLP
• Also, perceptron used a threshold function, which is

undifferentiable and not suitable for gradient descent in case data
is not linearly separable.
• We want a function whose output is a linear function of the inputs.
• One possibility is to use the sigmoid function:
𝑒𝑧 1
𝑔 𝑧 = 𝑧
=
1+ 𝑒 1 + 𝑒 −𝑧
𝑔 𝑧 → 1 𝑤ℎ𝑒𝑛 𝑧 → +∞ 𝑔 𝑧 → 0 𝑤ℎ𝑒𝑛 𝑧 → −∞
155
AI
Perceptron with Sigmoid
• Given 𝑛 examples and 𝑑 features.

• For an example 𝑥𝑖 (the 𝑖 𝑡ℎ line in the matrix of examples)
1
𝑓 𝑥𝑖 =
− σ𝑑
1 + 𝑒 𝑗=0 𝜔𝑗 𝑥𝑖𝑗
156
AI
The XOR example
• Let’s try to create a NN for the XOR function using elementary

perceptrons.
• First observe:
• Let’s consider that: For 𝑧 ≥ 10, 𝑔 𝑧 → 1. 𝐹𝑜𝑟 𝑧 ≤ −10, 𝑔 𝑧 → 0.
157
AI
The OR example
• First what is the perceptron of the OR?
158
AI
The AND and NAND example
• Similarly, we obtain the perceptrons for the AND and NAND:
• Note: how the weights in the NAND are the inverse weights of the
AND.
159
AI
The XOR example
• Let’s try to create a NN for the XOR function using elementary
perceptrons.
160
AI
The XOR example
• Let’s put them together...
• XOR as a combination of 3 basic perceptrons.
161
AI
Backpropagation algorithm
• Note: Feedforward NN (as opposed to recurrent networks) have no

connections that loop.
• Learn the weights for a multilayer network.
• Backpropagation stands for “backward propagation of errors”.
• Given a network with a fixed architecture (neurons and
interconnections).
• Use Gradient descent to minimize the squared error between the
network output value 𝑜 and the ground truth 𝑦.
• We suppose multiple output 𝑘.
• Challenge: Search in all possible weight values for all neurons in the
network.
162
AI
Feedforward-Backpropagation
• 𝑒 = 𝑥, 𝑦
• e: example
• x: feature vector 
• y: label i
𝜔𝑖𝑗
j
𝑓(𝑥)
 𝑜𝑒
AI
163
Backpropagation rules
• We consider 𝑘 outputs
• For an example e defined by (𝑥, 𝑦), the error on training example 𝑒,
summed over all output neurons in the network is:
1
𝐸𝑒 𝜔 = ෍ 𝑦𝑘 − 𝑜𝑘 2
2
𝑘
• Remember, gradient descent iterates through all the training
examples one at a time, descending the gradient of the error w.r.t.
this example.
𝜕𝐸𝑒 𝜔
∆𝜔𝑖𝑗 = − 𝛼
𝜕𝜔𝑖𝑗
164
AI
Notations:
• 𝑥𝑖𝑗 : the 𝑖 𝑡ℎ input to neuron j.
• 𝜔𝑖𝑗 : the weight associated with the 𝑖 𝑡ℎ input to neuron j.
• 𝑧𝑗 = σ 𝜔𝑖𝑗 𝑥𝑖𝑗 , weighted sum of inputs for neuron j.
• 𝑜𝑗 : output computed by neuron j.
• 𝑔 is the sigmoid function.
• outputs: the set of neurons in the output layer.
• 𝑆𝑢𝑐𝑐 𝑗 : the set of neurons whose immediate inputs includethe
output of neuron j.
165
AI
Backpropagation notations
i
𝜔𝑖𝑗 𝑧𝑗 = ෍ 𝜔𝑖𝑗 𝑥𝑖𝑗
i j 𝑔 𝑧𝑗  𝑜𝑗
j  𝑔 𝑧𝑗  𝑜𝑗
AI
𝜕𝐸𝑒 𝜕𝐸𝑒 𝜕𝑧𝑗 𝜕𝑬𝒆

= = 𝑥𝑖𝑗
𝜕𝜔𝑖𝑗 𝜕𝑧𝑗 𝜕𝜔𝑖𝑗 𝜕𝒛𝒋
𝜕𝑬𝒆
Δ𝜔𝑖𝑗 = − 𝛼 𝑥𝑖𝑗
𝜕𝒛𝒋
𝜕𝐸𝑒
• We consider two cases in calculating (let’s abandon the index e):
𝜕𝑧𝑗
oCase 1: Neuron 𝒋 is an output neuron

oCase 2: Neuron 𝒋 is a hidden neuron
167
AI
• Case 1: Neuron 𝒋 is an output neuron
168
AI
𝜕𝐸
= − 𝑦𝑗 − 𝑜𝑗 𝑜𝑗 (1 − 𝑜𝑗 )
𝜕𝑧𝑗
∆𝜔𝑖𝑗 = 𝛼 𝑦𝑗 − 𝑜𝑗 𝑜𝑗 1 − 𝑜𝑗 𝑥𝑖𝑗
• We will note
𝜕𝐸
𝛿𝑗 = −
𝜕𝑧𝑗
∆𝜔𝑖𝑗 = 𝛼 𝛿𝑗 𝑥𝑖𝑗
169
AI
• Case 2: Neuron 𝒋 is a hidden neuron
170
AI
Backpropagation algorithm
• Input: training examples (𝑥, 𝑦), learning rate 𝛼 (e.g., 𝛼 = 0.1), 𝑛𝑖 , 𝑛ℎ and 𝑛𝑜 .
• Output: a neural network with one input layer, one hidden layer and one output layer
with 𝑛𝑖 , 𝑛ℎ and 𝑛𝑜 number of neurons respectively and all its weights.
1. Create feedforward network 𝑛𝑖 , 𝑛ℎ , 𝑛𝑜
2. Initialize all weights to a small random number (e.g., in [-0.2, 0.2])
3. Repeat until convergence
(a) For each training example (𝑥, 𝑦)
i. Feed forward: Propagate example 𝑥 through the network and compute the
output 𝑜𝑗 from every neuron.
ii. Propagate backward: Propagate the errors backward.
Case 1 For each output neuron 𝑘, calculate its error
𝛿𝑘 = 𝑜𝑘 1 − 𝑜𝑘 (𝑦𝑘 − 𝑜𝑘 )
Case 2 For each hidden neuron ℎ, calculate its error
𝛿ℎ = 𝑜ℎ 1 − 𝑜ℎ ෍ 𝜔ℎ𝑘 𝛿𝑘
𝑘𝜖𝑆𝑢𝑐𝑐 ℎ
iii. Update each weight 𝜔𝑖𝑗 ← 𝜔𝑖𝑗 + 𝛼 𝛿𝑗 𝑥𝑖𝑗
171
AI
Observations
• Convergence: small changes in the weights.

• There are other activation functions. Hyperbolic tangent function, is practically better
for NN as its outputs range from -1 to 1.
𝑒 𝑘𝑥
𝑔 𝑥 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑥 = 𝑘𝑥
𝑓𝑜𝑟 𝑘 = 1, 𝑘 = 2, 𝑒𝑡𝑐.
1+𝑒
𝑥
𝑒 −𝑒 −𝑥
𝑔 𝑥 = 𝑡𝑎𝑛ℎ 𝑥 = 𝑥 −𝑥
(𝐼𝑡 𝑖𝑠 𝑎 𝑟𝑒𝑠𝑐𝑎𝑙𝑖𝑛𝑔 𝑜𝑓 𝑡ℎ𝑒 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛!).
𝑒 +𝑒
AI
Multiclass case
• Nowadays, networks with more than two layers, a.k.a. deep

networks, have proven to be very effective in many domains.
• Examples of deep networks: convolutional NN, recurrent NN, etc.
173
AI
MNIST database
Experiments with the MNIST database
• The MNIST database of handwritten digits
• Training set of 60,000 examples, test set of 10,000 examples
• Vectors in 𝑅784 (28x28 images)
• Labels are the digits they represent
• Various methods have been tested with this training set and test set
• Linear models: 7% - 12% error

• KNN: 0.5%- 5% error
• Neural networks: 0.35% - 4.7% error
• Convolutional NN: 0.23% - 1.7% error
174
AI
SUMMARY
• Neural Network
175
Nhân bản – Phụng sự – Khai phóng
Enjoy the Course…!
176

AI.5 Machine Learning (21 26)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI.5 Machine Learning (21 26)

Uploaded by

Copyright:

Available Formats

Nhân bản – Phụng sự – Khai phóng

• Is this a benign or malignant cell?

• Use ML to make prediction

• Use ML to make prediction

• Use ML to make prediction… such as predicting the right drug

• Use ML to make decisions

• Use ML for customer segmentation

• Use ML for recommendation systems

• Major ML techniques (1)

• Major ML techniques (2)

(Học Máy Tích hợp)

• Overview – (2).Reinforcement Learning

Nowadays is used for:

• Overview – (3). Ensemble Methods

Nowadays is used for:

• Overview – (4). Neural Networks and Deep Leaning

Used today for:

• Teaching the model with labeled data

• Supervised Learning (classical ML)

There are two types

• Type of Supervised learning

• Supervised Learning - Regression (prediction)

• Supervised Learning - Classification

• Supervised Learning - Classification

• Given a training set of example input–output pairs:

• Example: Income in function of age, weight of the fruit in function of its

• Example: Income in function of age, weight of the fruit in function of its

• Example: Income in function of age, weight of the fruit in function of its

• A very popular technique.

• Still a very useful tool today.

• Given a training set of example input–output pairs:

• Task: Learn a regression function:

• Univariate linear regression model:

• Multivariable linear regression model:

• Type of regression models

• Linear regression topology

• Using linear regression to predic continuous values

• Using linear regression to predic continuous values

• Linear regression model representation

• How to find the best fit?

• How to find the best fit?

• Estimating the parameters

• Estimating the parameters

• Estimating the parameters

• Predictions with linear regression

• Pros of linear regression

• Types of regression models

• Predicting continuous values with Multiple Linear Regression

• Predicting continuous values with Multiple Linear Regression

• Predicting continuous values with Multiple Linear Regression

• Estimating multiple linear regression parameters

• Making predictions with multiple linear regression

• Writing 𝑝(𝐴 ∧ 𝐵) in two different ways:

• A: patient has cancer.

• Why are we bringing here a Bayesian framework?

• Recall Classification framework:

oGiven: Training data: 𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅𝑑 𝑎𝑛𝑑 𝑦𝑖 𝜖 𝑌

• Learn a mapping from 𝑥 to 𝑦.

• We want to find this mapping 𝑓(𝑥) = 𝑦 through 𝑝(𝑦|𝑥)!

• Idea: model 𝑝 𝑦|𝑥 , conditional distribution of 𝑦 given 𝑥.

• A training data 𝑥𝑖 , 𝑦𝑖 is a feature vector and 𝑦𝑖 is a discrete label.

• Use Bayes rule to obtain: