2223hk1 Slide01 ML2022-2

INT3405 Machine Learning
Lecture 1 - Introduction
Ta Viet Cuong, Le Duc Trong, Tran Quoc Long
VNU-UET
2022
1 / 23
Table of content
What is Machine Learning
Probability & Random variable
Probability Distributions & Maximum Likelihood Estimation
2 / 23
Machine learning
Machine learning is the study of computer algorithms that allow

computer programs to automatically improve through experience
(Tom Mitchell)
▶ T : a task with clearly defined input and output
▶ P: a performance measure assessing how good an algorithm is
on the task
▶ E : a set of experience (i.e. data) provided to the algorithm
3 / 23
Example
(Single) Face detection
▶ T : input = 224x224 RGB image, output = (x1,y1,x2,y2)
top-left & bottom-right corner of the face in the input
▶ P: IoU
▶ E : a set of (million) (image, (x1, y1, x2, y2)) pairs
Exercises: Specify T , P, E for
▶ Predicting tomorrow’s weather given geographic information,
satellite images, and a trailing window of past weather.
▶ Answering question, expressed in free-form text.
▶ Identifying all of people depicted in an image and draws
outlines around each.
▶ Recommending users with products that they are likely to
enjoy while browsing.
4 / 23
Types of Machine Learning
▶ Supervised learning: learn input-output relationship

(E = {(xi , yi )} where xi ’s are inputs and yi ’s are desired
targets)
▶ Unsupervised / Self-supervised learning: learn data features,
clusters or distribution (E = {xi }, inputs only, no targets)
▶ Reinforcement learning: learn good action policy for an agent
in an environment (E = {(s, a) → (s ′ , r )} where s, s ′ are
states, a is action, r is reward)
5 / 23
Key phases in Machine Learning
Phase Programming aspect

Data preparation storing, retrieving, transforming data
Data modelling model libraries, machine learning algorithms
Training model optimization, fine tuning, validation
Inference deploying, logging, testing, mobile, web, api
6 / 23
Prerequisite for Machine learning
Math Programming
▶ Linear Algebra ▶ Data structure and
▶ Calculus algorithms
▶ Probability and Statistics ▶ Python/C++
▶ Optimization ▶ Libaries: numpy, pandas,
scikit-learn, pytorch
▶ Framework: jupyter, django,
fastapi, Android, IOS
7 / 23
Probability
Definitions:
▶ Sample space: Ω is the set of all possible outcomes or results
(of a random experiment).
▶ Event space: The set F ⊂ 2Ω is a σ-algebra of the sets of Ω.
Each element in F is an event (subset of Ω).
▶ A σ-algebra must satisfy: (i) F ̸= ∅, (ii) A ∈ F ⇒ Ω \ A ∈ F,
S∞
(iii) Ai ∈ F, ∀i ⇒ i=1 Ai ∈ F
▶ Probability measure: a function P : F → R+ satisfies the
following properties:
▶ P(Ω) = 1, P(∅) = 0
▶ Ai ∈ F, Ai ∩ Aj = ∅, ∀i ̸= j ⇒ P( ∞
S P∞
i=1 Ai ) = i=1 P(Ai )
As a result, the probability of a random event is specified by a
probability triple (Ω, F, P).
8 / 23
Probability
Example
Consider a random experiment: A closed box contains 100
marbles, of which 40 are red and 60 are blue. Take out one marble
randomly.
▶ Sample space: Ω is the set of 100 marbles in the box.
▶ Event space: F = {∅, Ω, red marble, blue marble}, i.e F
includes 4 sets of Ω. Notice that F is a σ-algebra of Ω.
▶ Probability measure: If the chances of taking every marble are
all equal, then
▶ P(∅) = 0, (Ω) = 1, P(red) = 0.4, P(blue) = 0.6
▶ Event ∅: no marble are taken (happen with probability 0).
▶ Event Ω: A red or blue marble is taken (happen with
probability 1).
▶ Event red marble: the marble taken is red (probability 0.4).
▶ Event blue marble: the marble taken is blue (probability 0.6).
9 / 23
Probability
Bayes’ theorem
Consider two events A, B, with P(A) ̸= 0, then
P(A ∩ B) P(A|B)P(B)
P(B|A) = =
P(A) P(A)
where
▶ P(B|A): the probability of event B occurring given that A is
true (a-posterior).
▶ P(A|B): the likelihood of A given a fixed B
▶ P(B): marginal or prior probability.
Independence
Two events A and B are independent iff P(A ∩ B) = P(A)P(B)
10 / 23
Probability
Example: COVID-19
▶ Test results are accurate on a sicked person with 90% (True
positive rate).
▶ Test results are accurate on a healthy person with 99% (True
negative rate).
▶ 3% of the population have COVID-19.
Question: what is the probability that a random person who tests
positive is really a sicked person?
▶ Event A: positive test result.
▶ Event B: has disease.
P(A|B) × P(B) = 0.9 × 0.03 = 0.027
P(A) = P(A|B) × P(B) + P(A| − B) × P(−B)

= 0.9 × 0.03 + 0.01 × 0.97 = 0.0367
⇒ P(B|A) = 73.569%
11 / 23
Random variable
A random variable X is a measurable function X on the sample

space
X :Ω→R
Example:
▶ Randomly take 10 marbles (with replacement). The number
of blue marble in the 10 taken marbles is a random variable.
▶ Pick randomly 1 person in 100 people, the height of that
person is a random variable.
12 / 23
Types of random variables
▶ Discrete
X ∈ {1, 2, . . . C }
with parameters: θc = P(X = c), c = 1, 2, . . . C
▶ Continuous
X ∈R
▶ Cumulative density function (CDF): F (x) = P(X ≤ x)

▶ Probability density function (PDF): p(x) = F ′ (x)
▶ Bayes’s formula for PDF:
p(x, y ) = p(y |x)p(x) = p(x|y )p(y )
13 / 23
Properties of random distribution
▶ Expectation
X Z
E[X ] = cP(X = c) = xp(x)dx
c R
Z
E[f (X )] = f (x)p(x)dx
R
▶ Variance
V[X ] = E[(X − E[X ])2 ]
14 / 23
Properties of expectation
E[aX + bY + c] = aE[X ] + bE[Y ] + c

V[aX ] = a2 V[X ]
V[X ] = E[X 2 ] − (E[X ])2
V[X ] = V[E[X |Y ]] + E[V[X |Y ]]
If X , Y are independent
E[X · Y ] = E[X ] · E[Y ]
V[X + Y ] = V[X ] + V[Y ]
15 / 23
Properties of expectation
Z Z Z
EY EX [X |Y ] = xp(x|y )dx p(y )dy = xp(x)dx = E[X ]
R R R
Z Z Z Z
xp(x|y )dx p(y )dy = xp(x|y )p(y ) dxdy
R R R R
Z Z
= xp(x, y ) dxdy
ZR ZR
= xp(y |x)p(x) dxdy
ZR RZ
= p(y |x) dy xp(x) dx
R R
| {z }
1
16 / 23
Bernoulli distribution
X ∈ {0, 1} with probability P(X = 1) = θ, written as X ∼ Ber (θ).

We also have P(X = 0) = 1 − θ.
▶ A biased coin: θ = probability of head
▶ Binary classification: P(y = 1|x) = Ber (θ(x))
→ probability of class 1 is a function of input
17 / 23
Parameter estimation
Toss a coin (sampling) N times, the number of times heads come

up (number 1) is s times, what is the parameter θ of the coin
(Bernoulli distribution)?
An intuitive guess: θ = Ns , why does this number make sense?
Let xi ∈ {0, 1} is the values from the i th toss.
The probability of the data D = {x1 , x2 , . . . , xN } under the model
X ∼ Ber (θ) is
N
Y
L(θ) = P(D) = P(x1 , x2 , . . . , xN ) = P(xi )
i=1
YN
= θxi (1 − θ)1−xi
i=1
18 / 23
Maximum likelihood Estimation - MLE
L(θ) is the likelihood of θ with respect to the dataset D
MLE: Find θ for which L(θ) is maximized.
n
X
ℓ(θ) = log L(θ) = xi log θ + (1 − xi ) log(1 − θ)
i=1
n
X
ℓ′ (θ) = xi /θ − (1 − xi )/(1 − θ) = 0
i=1
n n
1X 1 X
xi = (1 − xi )
θ 1−θ
i=1 i=1
| {z } | {z }
s N−s
s(1 − θ) = (N − s)θ
s
θMLE =
N
19 / 23
How good is the MLE?
▶ Unbiased: E[θMLE ] = θ
▶ Variance goes to 0: V[θMLE ] = θ(1 − θ)/N
▶ Consistent: P{|θMLE − θ| ≥ ϵ} n→∞
−→ ∞
√ d
▶ Normality: N(θMLE − θ) −→ N (0, 1)
20 / 23
Binomial distribution
The probability of getting exactly s heads in N independent

Bernoulli trials of tossing a coin is a Binomial distribution.
X ∼ Bin(s|N, θ) ⇒ P(X = s) = CNs θs (1 − θ)N−s

If, after taking the experiments n times, we get the data
D = {s1 , s2 , . . . , sn }, then what is the sensible value of θ? (Hint:
using MLE)
n
Y
L(θ) = P(D) = P(s1 , x2 , . . . , sn ) = P(si )
i=1
Yn
= CNsi θsi (1 − θ)N−si
i=1
21 / 23
Binomial distribution (cont)
n
X
ℓ(θ) = log L(θ) = const + si log θ + (N − si ) log(1 − θ)
i=1
n
X
ℓ′ (θ) = si /θ − (N − si )/(1 − θ) = 0
i=1
n n
1X 1 X
si = (N − si )
θ 1−θ
i=1 i=1
N
1 X si
θ=
n N
i=1
22 / 23
Gaussian distribution
The distribution X ∼ N (x|µ, σ 2 ) is a Gaussian distribution with

the density function
1 (x−µ)2
p(X = x) = √ e− 2σ 2
2πσ 2
▶ Regression: p(y |x) = N (y |µ(x), σ 2 ) or y = µ(x) + ϵ with

ϵ ∼ N (ϵ|0, σ 2 )
Exercise: Given the data D = {x1 , x2 , . . . xn }, what are reasonable
values of the parameters µ, σ 2 ?
23 / 23

2223hk1 Slide01 ML2022-2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2223hk1 Slide01 ML2022-2

Uploaded by

Copyright:

Available Formats

INT3405 Machine Learning

Ta Viet Cuong, Le Duc Trong, Tran Quoc Long

What is Machine Learning

Probability & Random variable

Probability Distributions & Maximum Likelihood Estimation

Machine learning is the study of computer algorithms that allow

▶ Supervised learning: learn input-output relationship

Phase Programming aspect

P(A) = P(A|B) × P(B) + P(A| − B) × P(−B)

A random variable X is a measurable function X on the sample

▶ Cumulative density function (CDF): F (x) = P(X ≤ x)

p(x, y ) = p(y |x)p(x) = p(x|y )p(y )

E[aX + bY + c] = aE[X ] + bE[Y ] + c

E[X · Y ] = E[X ] · E[Y ]

V[X + Y ] = V[X ] + V[Y ]

X ∈ {0, 1} with probability P(X = 1) = θ, written as X ∼ Ber (θ).

Toss a coin (sampling) N times, the number of times heads come

The probability of getting exactly s heads in N independent

X ∼ Bin(s|N, θ) ⇒ P(X = s) = CNs θs (1 − θ)N−s

The distribution X ∼ N (x|µ, σ 2 ) is a Gaussian distribution with

▶ Regression: p(y |x) = N (y |µ(x), σ 2 ) or y = µ(x) + ϵ with

You might also like