You are on page 1of 65

Lecture 2: Basics in Machine Learning

IBA6102 Machine Learning for Business

Qiyuan DENG

The Chinese University of Hong Kong, Shenzhen

September 12, 2023


Review: Introduction to Machine Learning

What is Machine Learning?


Why do we study Machine Learning?
Machine Learning Applications
Machine Learning Overview
History of Machine Learning

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51


Review: Introduction to Machine Learning

What is Machine Learning?


Why do we study Machine Learning?
Machine Learning Applications
Machine Learning Overview
History of Machine Learning

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51


Review: Introduction to Machine Learning

What is Machine Learning?


Why do we study Machine Learning?
Machine Learning Applications
Machine Learning Overview
History of Machine Learning

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51


Review: Introduction to Machine Learning

What is Machine Learning?


Why do we study Machine Learning?
Machine Learning Applications
Machine Learning Overview
History of Machine Learning

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51


Review: Introduction to Machine Learning

What is Machine Learning?


Why do we study Machine Learning?
Machine Learning Applications
Machine Learning Overview
History of Machine Learning

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51


Review: Introduction to Machine Learning

What is Machine Learning?


Why do we study Machine Learning?
Machine Learning Applications
Machine Learning Overview
History of Machine Learning

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51


Test Your Understanding
Join from your laptop: Visit bodoudou.com and enter the Game Pin
Join from your phone: Scan the QR Code
Enter your nickname: Name (+ the last 2 digit of your student ID)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 3 / 51


Test Your Understanding

Read the question


Choose your color!
Your score depends on whether you select the right answer and your
reaction time!

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 4 / 51


Maths for Machine Learning

Statistics
Probability
Estimation

Linear Algebra
Useful for compact representation of data
Dimension reduction techniques

Optimization theory

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 5 / 51


Agenda

Probabilities
Dependence, Independence, and Conditional Independence

Parameter Estimation
Maximum Likelihood Estimation (MLE)
Maximum A Posterior Estimation (MAP)

Python Review
Hands-on

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 6 / 51


Agenda

Probabilities
Dependence, Independence, and Conditional Independence

Parameter Estimation
Maximum Likelihood Estimation (MLE)
Maximum A Posterior Estimation (MAP)

Python Review
Hands-on

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 6 / 51


Notations

a∈A
|B|
||v||
P
R

x, y, z
A, B
y = f (x)
y = f (x)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 7 / 51


Probability Spaces

A random process or experiment with three components:


Ω, the set of possible outcomes O
|Ω|
discrete or continuous
F , the set of possible events E
|F |
P , the probability distribution

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 8 / 51


Discrete Probability Space

Three consecutive flips of a coin

8 possible outcomes:
O = HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T

Possible events:
Two flips are heads up: E = (O ∈ {HHT, HT H, T HH})
First and third flips are tails up: E = (O ∈ {T HT, T T T })

If the coin is fair


P (HHH) = P (HHT ) = P (HT H) = P (HT T ) = P (T HH) =
P (T HT ) = P (T T H) = P (T T T ) = 1/8
Probability of event E, P (exactly two heads up) =?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 9 / 51


Continuous Probability Space

Height of a randomly chosen person

Infinite number of possible outcomes/events

Probability of outcomes are not equal, and are described by a


continuous function

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 10 / 51


Probability Distributions

Discrete: probability mass function (pmf)

Continuous: probability density function (pdf)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 11 / 51


Axioms of Probability

Non negative
∀E ∈ F, P (E) ≥ 0

All possible outcomes


P (Ω) = 1

Addition of disjoint events


∀E, E ′ ∈ F , where E ∩ E ′ = ∅, P (E ∪ E ′ ) = P (E) + P (E ′ )

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 12 / 51


Random Variables

A random variable X is a function that associates a number x


X=x

A way to redefine a probability space to a new


Axioms of probability
Discrete or continuous

Discrete example: X = number of heads in three flips of a coin


X = 0, 1, 2, 3
P (X = 0) = P (X = 3) = 1/8, P (X = 1) = P (X = 2) = 3/8

Continuous example: X = average height of five randomly chosen


people

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 13 / 51


Multivariate Probability Distribution

Scenario
Several random processes occur
Probabilities for each possible combination

Joint probability of several random variables


P (X = x, Y = y)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 14 / 51


Multivariate Probability Distribution

Marginal probability
Probability distribution
P of a single variable in a joint distribution
P (X = x) = b=all values of Y P (X = x, Y = b)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 15 / 51


Multivariate Probability Distribution

Conditional probability
Probability distribution of one variable given that another variable
takes a certain value
P (Y = y|X = x) = P (X=x,Y
P (X=x)
=y)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 16 / 51


Complement Rule

Given event A, P (not A) = 1 − P (A)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 17 / 51


Product Rule

Given events A and B, P (A, B) = P (A|B) · P (B)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 18 / 51


Total Probability Rule

Given events A and B,


P (A) = P (A, B)+P (A, not B) = P (A|B)·P (B)+P (A|not B)·P (not B)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 19 / 51


Independence

Given events A and B, P (A|B) = P (A) or P (A, B) = P (A) · P (B)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 20 / 51


Independence

Independent random variables


P (X, Y ) = P (X)P (Y )
P (X|Y ) = P (X)

Y and X don’t contain information about each other.


Observing Y doesn’t help predicting X
Observing X doesn’t help predicting Y

Examples:
Independent: Wining on roulette this week and next week.
Russian roulette

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 21 / 51


Dependent vs. Independent

Independent X, Y Dependent X, Y

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 22 / 51


Conditionally Independent
Conditionally independent variables
P (X, Y |Z) = P (X|Z)P (Y |Z)

Knowing Z makes X and Y independent

Storks deliver babies? Highly statistically significant correlation


exists between stork populations and human birth rates across Europe.

Common Environment Factors

Examples
Dependent: shoe size and reading skills
Conditionally independent: shoe size and reading skills given age
Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 23 / 51
Conditionally Independent

London taxi drivers


A survey has pointed out a positive and significant correlation
between the number of accidents and wearing coats. They concluded
that coats could hinder movements of drivers and be the cause of
accidents. A new law was prepared to prohibit drivers from wearing
coats when driving.

Finally another study pointed out that people wear coats when it
rains...

P (Accidents, Coat|Rain) = P (Accidents|Rain) · P (Coat|Rain)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 24 / 51


Correlation ̸= Causation

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 25 / 51


Conditionally Independence

X is conditionally independent of Y given Z


P (X, Y |Z) = P (X|Z)P (Y |Z)
Equivalent to: ∀x, y, z, P (X = x|Y = y, Z = z) = P (X = x|Z = z)

P (Accidents, Coat|Rain) = P (Accidents|Rain) · P (Coat|Rain)

P (T hunder|Rain, Lightning) = P (T hunder|Lightning)


It does NOT mean T hunder is independent of Rain
But given Lightning, knowing Rain doesn’t give more information
about T hunder.

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 26 / 51


Agenda

Fundamentals of Machine Learning


Probabilities
Dependence, Independence, and Conditional Independence

Parameter Estimation
Maximum Likelihood Estimation (MLE)
Maximum A Posterior Estimation (MAP)

Python Review
Hands-on

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 27 / 51


Parameter Estimation

Suppose we have random variables X1 , . . . , Xn and corresponding


observations x1 , . . . , xn .

How to fit a parametric model to the data?

How to choose the values of parameters?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 28 / 51


Parameter Estimation

Our first ML problem: Estimating the probability of flipping a coin

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 29 / 51


Parameter Estimation

Our first ML problem: Estimating the probability of flipping a coin

What’s the probability that the coin will fall with head up?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 29 / 51


Parameter Estimation

Our first ML problem: Estimating the probability of flipping a coin

What’s the probability that the coin will fall with head up?
Let’s flip it a few times to estimate the probability:

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 29 / 51


Parameter Estimation

Our first ML problem: Estimating the probability of flipping a coin

What’s the probability that the coin will fall with head up?
Let’s flip it a few times to estimate the probability:

3
The estimated probability is: 5 “Frequency of heads”

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 29 / 51


Parameter Estimation

Our first ML problem: Estimating the probability of flipping a coin

What’s the probability that the coin will fall with head up?
Let’s flip it a few times to estimate the probability:

3
The estimated probability is: 5 “Frequency of heads”
Why frequency of heads?
How good is this estimation?
Why is this a machine learning problem?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 29 / 51


Why Frequency of Heads?

Frequency of heads is exactly the Maximum Likelihood Estimator


(MLE) for this problem.

Maximize the probability of the data we have seen:


θ̂M LE = arg maxL(θ)
θ
L(·) is the likelihood function: L(θ) = P (x1 , . . . , xn ; θ)
n
Q
Assume X1 , . . . , Xn are i.i.d: L(θ) = P (xi ; θ)
i=1
Qn
(Optional) log-likelihood: log L(θ) = log P (xi ; θ)
i=1

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 30 / 51


MLE for Bernoulli Distribution

Data:

D = {Xi }ni=1 , Xi = xi ∈ {H, T }


P (H) = θ, P (T ) = 1 − θ

Flips are i.i.d


Independent events
Identically distributed according to Bernoulli distribution

MLE: Choose θ that maximizes the probability of observed data

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 31 / 51


MLE for Bernoulli Distribution

MLE: Choose θ that maximizes the probability of observed data

θ̂M LE = arg max L(θ)


θ
= arg max P (x1 , x2 , . . . , xn ; θ)
θ
Qn
= arg max P (xi ; θ) ⇐ Independent draws
θ i=1
Q Q
= arg max θ (1 − θ) ⇐ Identically distributed
θ i:xi =H i:xi =T

= arg max θαH (1 − θ)αT


θ | {z }
J(θ)

αH : Number of Heads
αT : Number of Tails

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 32 / 51


MLE for Bernoulli Distribution

MLE: Choose θ that maximizes the probability of observed data

θ̂M LE = arg max L(θ)


θ
= arg max θαH (1 − θ)αT
θ | {z }
J(θ)

Solve for θ̂M LE by FOC:

∂J(θ)
∂θ = αH θαH −1 (1 − θ)αT − αT θαH (1 − θ)αT −1 |θ=θ̂M LE = 0
⇒ αH (1 − θ) − αT θ|θ=θ̂M LE = 0
αH
⇒ θ̂M LE = αH +αT

This is exactly the “Frequency of heads”!

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 33 / 51


How Good is MLE?
αH
θ̂M LE = αH +αT
Number of flips: n = αH + αT
Let θ∗ be the true parameter, how many flips do I need to get a good
estimation?
I want to know θ̂, within ϵ = 0.1 error, with probability at least
1 − δ = 0.95
2
∀ϵ > 0, P (|θ̂ − θ∗ | ≥ ϵ) ≤ 2e−2nϵ (Hoeffding’s inequality)
2
⇒ P (|θ̂ − θ∗ | ≥ ϵ) ≤ 2e−2nϵ ≤ δ
ln 2/δ
⇒n≥ 2ϵ2

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 34 / 51


How Good is MLE?
αH
θ̂M LE = αH +αT
Number of flips: n = αH + αT
Let θ∗ be the true parameter, how many flips do I need to get a good
estimation?
I want to know θ̂, within ϵ = 0.1 error, with probability at least
1 − δ = 0.95
2
∀ϵ > 0, P (|θ̂ − θ∗ | ≥ ϵ) ≤ 2e−2nϵ (Hoeffding’s inequality)
2
⇒ P (|θ̂ − θ∗ | ≥ ϵ) ≤ 2e−2nϵ ≤ δ
ln 2/δ
⇒n≥ 2ϵ2
≈ 185!

Great for large samples, but can be heavily biased for small samples
Interpretation
Computational efficiency

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 34 / 51


Why is This a ML Problem?

“A computer program is said to learn from experience


E with respect to some class of tasks T and
performance measure P, if its performance at tasks in
T, as measured by P, improves with experience E.”
– Tom M. Mitchell

Improve their performance (accuracy of the predicted probability)


at some task (predicting the probability of heads)
with experience (the more coins we flip the better we are)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 35 / 51


MLE for Continuous Features
 2

Gaussian distribution: f (x; µ, σ) = √ 1 exp − (x−µ)
2σ 2
2πσ 2

MLE: Choose θ = (µ, σ 2 ) that maximizes the probability of observed


data
θ̂M LE = arg max L(θ)
θ
Qn
= arg max f (xi ; θ) ⇐ Independent draws
θ i=1
n
1 −(xi −µ)2 /2σ 2
Q
= arg max 2σ 2 e ⇐ Identically distributed
θi=1
1 − Pni=1 (xi −µ)2 /2σ2
= arg max2 e
θ=(µ,σ ) 2σ 2
| {z }
J(θ)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 36 / 51


MLE for Gaussian Mean and Variance

n
1P
µ̂M LE = n xi
i=1
n
2 1P
σ̂M LE = n (xi − µ̂)2
i=1
2 2 2
σ̂M LE is biased: E[σ̂M LE ] ̸= σ
n
2 1
(xi − µ̂)2
P
Unbiased estimator: σ̂unbiased = n−1
i=1

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 37 / 51


Agenda

Fundamentals of Machine Learning


Probabilities
Dependence, Independence, and Conditional Independence

Parameter Estimation
Maximum Likelihood Estimation (MLE)
Maximum A Posterior Estimation (MAP)

Python Review
Hands-on

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 38 / 51


Prior Knowledge

If you have some ideas about your parameters


E.g., We know the coin is “close” to 50 − 50. What can we do?

The Bayesian way


Rather than estimating a single θ, we obtain a distribution over
possible values of θ

max P (θ|x1 , . . . , xn ) ⇐ posterior


θ

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 39 / 51


Bayesian Learning

Bayes Theorem:

P (D|θ)P (θ)
P (θ|D) =
P (D)

Equivalently:

P (θ|D) ∝ P (D|θ) P (θ)


| {z } | {z } | {z }
posterior likelihood prior
Thomas Bayes (1702–1761)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 40 / 51


Example: Cancer Diagnosis
A patient takes a lab test and the result comes back positive. The
test returns a correct positive result is only 98% of the cases in which
the disease is actually present, and a correct negative result is only
97% of the cases in which the disease is not present. Furthermore,
0.8% of the entire population have this cancer.
Would the patient likely have Cancer or not?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 41 / 51


Example: Cancer Diagnosis
A patient takes a lab test and the result comes back positive. The
test returns a correct positive result is only 98% of the cases in which
the disease is actually present, and a correct negative result is only
97% of the cases in which the disease is not present. Furthermore,
0.8% of the entire population have this cancer.
Would the patient likely have Cancer or not?

P (Cancer) = 0.008, P (¬Cancer) = 0.992


P (+|Cancer) = 0.98, P (−|Cancer) = 0.02
P (+|¬Cancer) = 0.03, P (−|¬Cancer) = 0.97

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 41 / 51


Example: Cancer Diagnosis
A patient takes a lab test and the result comes back positive. The
test returns a correct positive result is only 98% of the cases in which
the disease is actually present, and a correct negative result is only
97% of the cases in which the disease is not present. Furthermore,
0.8% of the entire population have this cancer.
Would the patient likely have Cancer or not?

P (Cancer) = 0.008, P (¬Cancer) = 0.992


P (+|Cancer) = 0.98, P (−|Cancer) = 0.02
P (+|¬Cancer) = 0.03, P (−|¬Cancer) = 0.97

P (+) = P (+|Cancer)P (Cancer) + P (+|¬Cancer)P (¬Cancer)
P (+|Cancer)P (Cancer)
P (Cancer|+) = P (+)
P (+|¬Cancer)P (¬Cancer)
P (¬Cancer|+) = P (+)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 41 / 51


Maximum A Posterior Learning (MAP)

Choose the most probable hypothesis given the observed data and
prior belief by maximizing the posterior probability

θ̂M AP = arg max P (θ|x1 , . . . , xn )


θ
= arg max P (x1 , . . . , xn |θ)P (θ)
θ

P (θ): Prior encodes our knowledge/preference

Computational intensive
Choosing P (θ) reflects our prior knowledge about the learning task

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 42 / 51


MAP vs. MLE
MLE: Choose θ that maximizes the likelihood of the observed data

θ̂M LE = arg max P (x1 , . . . , xn |θ)


θ

MAP: Choose θ that maximizes the posterior probability given the


observed data

θ̂M AP = arg max P (θ|x1 , . . . , xn )


θ
= arg max P (x1 , . . . , xn |θ)P (θ)
θ

When is MAP same as MLE?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 43 / 51


MAP vs. MLE
MLE: Choose θ that maximizes the likelihood of the observed data

θ̂M LE = arg max P (x1 , . . . , xn |θ)


θ

MAP: Choose θ that maximizes the posterior probability given the


observed data

θ̂M AP = arg max P (θ|x1 , . . . , xn )


θ
= arg max P (x1 , . . . , xn |θ)P (θ)
θ

When is MAP same as MLE?


Uninformative priors, e.g., uniform distribution: ∀θi , θj , P (θi ) = P (θj )

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 43 / 51


MAP for Binomial Distribution

Coin flip problem


Likelihood is Binomial:
 
n
P (x1 , . . . , xn |θ) =   θαH (1 − θ)αT
αH

If the prior is Beta distribution:

θβH −1 (1 − θ)βT −1
P (θ) = ∼ Beta(βH , βT )
B(βH , βT )
R1
Beta function: B(x, y) = 0
tx−1 (1 − t)y−1 dt
Posterior is also Beta distribution:

P (θ|x1 , . . . , xn ) ∼ Beta(βH + αH , βT + αT )

P (θ) and P (θ|x1 , . . . , xn ) have the same form! ⇐ Conjugate prior

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 44 / 51


MAP for Binomial Distribution
Conjugate priors:
Closed-form representation of posterior
θ and P (θ|x1 , . . . , xn ) have the same form
P (θ) ∼ Beta(βH , βT ), P (θ|x1 , . . . , xn ) ∼ Beta(βH + αH , βT + αT )
θ̂M AP = arg max P (θ|x1 , . . . , xn )
θ
= arg max P (x1 , . . . , xn |θ)P (θ)
θ
αH +βH −1
= αH +βH +αT +βT −2

As we get more samples (n = αH + αT ), effect of the prior is


“washed out”
Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 45 / 51
Frequentists vs. Bayesians

Not good when sample is small Different answer for different priors

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 46 / 51


Agenda

Fundamentals of Machine Learning


Probabilities
Dependence, Independence, and Conditional Independence

Parameter Estimation
Maximum Likelihood Estimation (MLE)
Maximum A Posterior Estimation (MAP)

Python Review
Hands-on

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 47 / 51


Python

Free & Open-source


Simple and powerful
Easy to learn
Object oriented
Libraries & Packages
Portable

Good for machine learning

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 48 / 51


Python

Installation
Download: https://www.python.org/downloads/

IDE
Jupyter notebook
Pycharm

Conda

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 49 / 51


Agenda

Fundamentals of Machine Learning


Probabilities
Dependence, Independence, and Conditional Independence

Parameter Estimation
Maximum Likelihood Estimation (MLE)
Maximum A Posterior Estimation (MAP)

Python Review
Hands-on

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 50 / 51


Assignment 1

Maths
Python

Will be posted tonight


Due: October 10, at 9am (before Lecture 5)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 51 / 51

You might also like