Professional Documents
Culture Documents
September 8, 2022
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 1 / 38
What is machine learning?
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 2 / 38
What is machine learning?
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 3 / 38
Supervised Learning - Classification
''cat''
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 4 / 38
Supervised Learning - Regression
Learning a mapping fθ : RN → R.
$50,312
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 5 / 38
Supervised Learning - Structured Prediction
"CS115
dễ như
ăn kẹo."
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 6 / 38
Supervised learning
Figure: Classifying Iris flowers into 3 subspecies: setosa, versicolor and virginica.
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 8 / 38
Classification
In image classification,
The X is the set of images, which is very high-dimensional.
For a color image with C = 3 channels (RGB) and D1 × D2 pixels,
we have X = RD , where D = C × D1 × D2 .
Learning a mapping f : X → Y from images to labels is challenging.
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 9 / 38
Classification
1
https://blog.ml.cmu.edu/2019/03/29/building-machine-learning-models-via-
comparisons/
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 10 / 38
Classification
The Iris dataset is a collection of 150 labeled examples of Iris flowers, 50
of each type, described by 4 features
sepal length, sepal width, petal length, petal width.
The input space is much lower-dimensional X = R4 .
These numeric features are highly informative.
Figure: A decision tree on Iris dataset, using just petal length and width features.
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 12 / 38
Learning a classifier
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 14 / 38
Empirical risk minimization
The empirical risk is the average loss of the predictor on the training set:
N
1 X
L(θ) = `(yn , f (xn ; θ))
N
n=1
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 17 / 38
Uncertainty
We can’t perfectly predict the exact output given the input, due to
lack of knowledge of the input-output mapping (model uncertainty)
and/or intrinsic (irreducible) stochasticity in the mapping (data
uncertainty).
We capture our uncertainty using conditional probability distribution:
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 18 / 38
Uncertainty
Since fc (x; θ) returns the probability of class label c, we require
C
X
0 ≤ fc ≤ 1 for each c, and fc = 1
c=1
ea1 eaC
S(a) = PC , . . . , PC
ac ac
c=1 e c=1 e
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 19 / 38
Uncertainty
The overall model is defined as
f (x; θ) = b + wT x = b + w1 x1 + w2 x2 + . . . + wD xD
where θ = (b, w) are the parameters of the model. w are called the
weights and b is called the bias. This model is called logistic regression.
We can define w̃ = [b, w1 , . . . , wD ] and x̃ = [1, x1 , . . . , xD ] so that
w̃T x̃ = b + wT x
This converts the affine function into a linear function. For simplicity, we
can just write the prediction function as
f (x; θ) = wT x
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 20 / 38
Maximum likelihood estimation
When fitting probabilistic models, we can use the negative log
probability as our loss function:
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 21 / 38
Regression
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 22 / 38
Regression
Regression is very similar to classification, but since the output
ŷ = f (x; θ) is real-valued, we need to use a different loss function.
Why don’t we use the misclassification rate with the zero-one loss?
N N
1 X 1 X
L(θ) = `01 (y, ŷ) = I(yn 6= f (xn ; θ))
N N
n=1 n=1
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 23 / 38
Linear Regression
We can fit the 1d data above using a simple linear regression model:
f (x; θ) = b + wx
where w is the slope, and b is the offset. θ = (w, b) are the parameters of
the model. By adjusting θ, we can minimize the sum of squared errors
until we find the least squares solution:
θ̂ = argmin MSE(θ)
θ
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 24 / 38
Linear Regression
If we have multiple input features, we can write
f (x; θ) = b + w1 x1 . . . + wD xD = b + wT x
where θ = (w, b). This is called multiple linear regression.
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 25 / 38
Polynomial regression
f (x; w) = wT φ(x)
φ(x) = [1, x, x2 , . . . , xD ]
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 27 / 38
Polynomial regression
f (x; w, V ) = wT φ(x; V )
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 30 / 38
Deep neural networks
f (x; w, V ) = wT φ(x; V )
Typically, the final layer L is linear and has the form
fL (x) = wT f1:L−1 (x)
where f1:L−1 (x) is the learned feature extractor φ(x; V ).
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 32 / 38
Deep neural networks
Common DNN variants are:
Convolutional neural networks (CNNs) for images.
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 33 / 38
Overfitting and Generalization
We can rewrite the empirical risk equation as
N
1 X
L(θ) = `(yn , f (xn ; θ))
N
n=1
1 X
L(θ; Dtrain ) = `(y, f (x; θ))
|Dtrain |
(x,y)∈Dtrain
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 35 / 38
Overfitting and Generalization
In practice, we partition the data into 3 sets: a training set, a test set,
and a validation set.
The training set is used for model fitting.
The validation set is used for model selection.
The test set is used to estimate future performance (i.e., the
population risk).
The test set is NOT used for model fitting or model selection.
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 37 / 38
No free lunch theorem
Ngoc Hoang Luong (UIT) CS115 - Math for CompSci September 8, 2022 38 / 38