You are on page 1of 23

INT3405 Machine Learning

Lecture 1 - Introduction

Ta Viet Cuong, Le Duc Trong, Tran Quoc Long

VNU-UET

2022

1 / 23
Table of content

What is Machine Learning

Probability & Random variable

Probability Distributions & Maximum Likelihood Estimation

2 / 23
Machine learning

Machine learning is the study of computer algorithms that allow


computer programs to automatically improve through experience
(Tom Mitchell)
▶ T : a task with clearly defined input and output
▶ P: a performance measure assessing how good an algorithm is
on the task
▶ E : a set of experience (i.e. data) provided to the algorithm

3 / 23
Example
(Single) Face detection
▶ T : input = 224x224 RGB image, output = (x1,y1,x2,y2)
top-left & bottom-right corner of the face in the input
▶ P: IoU
▶ E : a set of (million) (image, (x1, y1, x2, y2)) pairs
Exercises: Specify T , P, E for
▶ Predicting tomorrow’s weather given geographic information,
satellite images, and a trailing window of past weather.
▶ Answering question, expressed in free-form text.
▶ Identifying all of people depicted in an image and draws
outlines around each.
▶ Recommending users with products that they are likely to
enjoy while browsing.

4 / 23
Types of Machine Learning

▶ Supervised learning: learn input-output relationship


(E = {(xi , yi )} where xi ’s are inputs and yi ’s are desired
targets)
▶ Unsupervised / Self-supervised learning: learn data features,
clusters or distribution (E = {xi }, inputs only, no targets)
▶ Reinforcement learning: learn good action policy for an agent
in an environment (E = {(s, a) → (s ′ , r )} where s, s ′ are
states, a is action, r is reward)

5 / 23
Key phases in Machine Learning

Phase Programming aspect


Data preparation storing, retrieving, transforming data
Data modelling model libraries, machine learning algorithms
Training model optimization, fine tuning, validation
Inference deploying, logging, testing, mobile, web, api

6 / 23
Prerequisite for Machine learning

Math Programming
▶ Linear Algebra ▶ Data structure and
▶ Calculus algorithms
▶ Probability and Statistics ▶ Python/C++
▶ Optimization ▶ Libaries: numpy, pandas,
scikit-learn, pytorch
▶ Framework: jupyter, django,
fastapi, Android, IOS

7 / 23
Probability

Definitions:
▶ Sample space: Ω is the set of all possible outcomes or results
(of a random experiment).
▶ Event space: The set F ⊂ 2Ω is a σ-algebra of the sets of Ω.
Each element in F is an event (subset of Ω).
▶ A σ-algebra must satisfy: (i) F ̸= ∅, (ii) A ∈ F ⇒ Ω \ A ∈ F,
S∞
(iii) Ai ∈ F, ∀i ⇒ i=1 Ai ∈ F
▶ Probability measure: a function P : F → R+ satisfies the
following properties:
▶ P(Ω) = 1, P(∅) = 0
▶ Ai ∈ F, Ai ∩ Aj = ∅, ∀i ̸= j ⇒ P( ∞
S P∞
i=1 Ai ) = i=1 P(Ai )
As a result, the probability of a random event is specified by a
probability triple (Ω, F, P).

8 / 23
Probability
Example
Consider a random experiment: A closed box contains 100
marbles, of which 40 are red and 60 are blue. Take out one marble
randomly.
▶ Sample space: Ω is the set of 100 marbles in the box.
▶ Event space: F = {∅, Ω, red marble, blue marble}, i.e F
includes 4 sets of Ω. Notice that F is a σ-algebra of Ω.
▶ Probability measure: If the chances of taking every marble are
all equal, then
▶ P(∅) = 0, (Ω) = 1, P(red) = 0.4, P(blue) = 0.6
▶ Event ∅: no marble are taken (happen with probability 0).
▶ Event Ω: A red or blue marble is taken (happen with
probability 1).
▶ Event red marble: the marble taken is red (probability 0.4).
▶ Event blue marble: the marble taken is blue (probability 0.6).

9 / 23
Probability

Bayes’ theorem
Consider two events A, B, with P(A) ̸= 0, then

P(A ∩ B) P(A|B)P(B)
P(B|A) = =
P(A) P(A)

where
▶ P(B|A): the probability of event B occurring given that A is
true (a-posterior).
▶ P(A|B): the likelihood of A given a fixed B
▶ P(B): marginal or prior probability.
Independence
Two events A and B are independent iff P(A ∩ B) = P(A)P(B)

10 / 23
Probability
Example: COVID-19
▶ Test results are accurate on a sicked person with 90% (True
positive rate).
▶ Test results are accurate on a healthy person with 99% (True
negative rate).
▶ 3% of the population have COVID-19.
Question: what is the probability that a random person who tests
positive is really a sicked person?
▶ Event A: positive test result.
▶ Event B: has disease.
P(A|B) × P(B) = 0.9 × 0.03 = 0.027

P(A) = P(A|B) × P(B) + P(A| − B) × P(−B)


= 0.9 × 0.03 + 0.01 × 0.97 = 0.0367

⇒ P(B|A) = 73.569%
11 / 23
Random variable

A random variable X is a measurable function X on the sample


space
X :Ω→R
Example:
▶ Randomly take 10 marbles (with replacement). The number
of blue marble in the 10 taken marbles is a random variable.
▶ Pick randomly 1 person in 100 people, the height of that
person is a random variable.

12 / 23
Types of random variables

▶ Discrete
X ∈ {1, 2, . . . C }
with parameters: θc = P(X = c), c = 1, 2, . . . C
▶ Continuous
X ∈R

▶ Cumulative density function (CDF): F (x) = P(X ≤ x)


▶ Probability density function (PDF): p(x) = F ′ (x)
▶ Bayes’s formula for PDF:

p(x, y ) = p(y |x)p(x) = p(x|y )p(y )

13 / 23
Properties of random distribution

▶ Expectation
X Z
E[X ] = cP(X = c) = xp(x)dx
c R
Z
E[f (X )] = f (x)p(x)dx
R
▶ Variance
V[X ] = E[(X − E[X ])2 ]

14 / 23
Properties of expectation

E[aX + bY + c] = aE[X ] + bE[Y ] + c


V[aX ] = a2 V[X ]
V[X ] = E[X 2 ] − (E[X ])2
V[X ] = V[E[X |Y ]] + E[V[X |Y ]]
If X , Y are independent

E[X · Y ] = E[X ] · E[Y ]

V[X + Y ] = V[X ] + V[Y ]

15 / 23
Properties of expectation

Z Z  Z
EY EX [X |Y ] = xp(x|y )dx p(y )dy = xp(x)dx = E[X ]
R R R

Z Z  Z Z
xp(x|y )dx p(y )dy = xp(x|y )p(y ) dxdy
R R R R
Z Z
= xp(x, y ) dxdy
ZR ZR
= xp(y |x)p(x) dxdy
ZR RZ 
= p(y |x) dy xp(x) dx
R R
| {z }
1

16 / 23
Bernoulli distribution

X ∈ {0, 1} with probability P(X = 1) = θ, written as X ∼ Ber (θ).


We also have P(X = 0) = 1 − θ.
▶ A biased coin: θ = probability of head
▶ Binary classification: P(y = 1|x) = Ber (θ(x))
→ probability of class 1 is a function of input

17 / 23
Parameter estimation

Toss a coin (sampling) N times, the number of times heads come


up (number 1) is s times, what is the parameter θ of the coin
(Bernoulli distribution)?
An intuitive guess: θ = Ns , why does this number make sense?
Let xi ∈ {0, 1} is the values from the i th toss.
The probability of the data D = {x1 , x2 , . . . , xN } under the model
X ∼ Ber (θ) is
N
Y
L(θ) = P(D) = P(x1 , x2 , . . . , xN ) = P(xi )
i=1
YN
= θxi (1 − θ)1−xi
i=1

18 / 23
Maximum likelihood Estimation - MLE
L(θ) is the likelihood of θ with respect to the dataset D
MLE: Find θ for which L(θ) is maximized.
n
X
ℓ(θ) = log L(θ) = xi log θ + (1 − xi ) log(1 − θ)
i=1
n
X
ℓ′ (θ) = xi /θ − (1 − xi )/(1 − θ) = 0
i=1
n n
1X 1 X
xi = (1 − xi )
θ 1−θ
i=1 i=1
| {z } | {z }
s N−s
s(1 − θ) = (N − s)θ
s
θMLE =
N

19 / 23
How good is the MLE?

▶ Unbiased: E[θMLE ] = θ
▶ Variance goes to 0: V[θMLE ] = θ(1 − θ)/N
▶ Consistent: P{|θMLE − θ| ≥ ϵ} n→∞
−→ ∞
√ d
▶ Normality: N(θMLE − θ) −→ N (0, 1)

20 / 23
Binomial distribution

The probability of getting exactly s heads in N independent


Bernoulli trials of tossing a coin is a Binomial distribution.

X ∼ Bin(s|N, θ) ⇒ P(X = s) = CNs θs (1 − θ)N−s


If, after taking the experiments n times, we get the data
D = {s1 , s2 , . . . , sn }, then what is the sensible value of θ? (Hint:
using MLE)
n
Y
L(θ) = P(D) = P(s1 , x2 , . . . , sn ) = P(si )
i=1
Yn
= CNsi θsi (1 − θ)N−si
i=1

21 / 23
Binomial distribution (cont)

n
X
ℓ(θ) = log L(θ) = const + si log θ + (N − si ) log(1 − θ)
i=1
n
X
ℓ′ (θ) = si /θ − (N − si )/(1 − θ) = 0
i=1
n n
1X 1 X
si = (N − si )
θ 1−θ
i=1 i=1
N
1 X si
θ=
n N
i=1

22 / 23
Gaussian distribution

The distribution X ∼ N (x|µ, σ 2 ) is a Gaussian distribution with


the density function

1 (x−µ)2
p(X = x) = √ e− 2σ 2
2πσ 2

▶ Regression: p(y |x) = N (y |µ(x), σ 2 ) or y = µ(x) + ϵ with


ϵ ∼ N (ϵ|0, σ 2 )
Exercise: Given the data D = {x1 , x2 , . . . xn }, what are reasonable
values of the parameters µ, σ 2 ?

23 / 23

You might also like