2023 Fall IBA6102 Lecture 2

Lecture 2: Basics in Machine Learning
IBA6102 Machine Learning for Business
Qiyuan DENG
The Chinese University of Hong Kong, Shenzhen
September 12, 2023

Review: Introduction to Machine Learning
What is Machine Learning?

Why do we study Machine Learning?
Machine Learning Applications
Machine Learning Overview
History of Machine Learning
Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51











Test Your Understanding
Join from your laptop: Visit bodoudou.com and enter the Game Pin
Join from your phone: Scan the QR Code
Enter your nickname: Name (+ the last 2 digit of your student ID)

Test Your Understanding
Read the question

Choose your color!
Your score depends on whether you select the right answer and your
reaction time!

Maths for Machine Learning
Statistics
Probability
Estimation
Linear Algebra
Useful for compact representation of data
Dimension reduction techniques
Optimization theory

Agenda
Probabilities
Dependence, Independence, and Conditional Independence
Parameter Estimation
Maximum Likelihood Estimation (MLE)
Maximum A Posterior Estimation (MAP)
Python Review
Hands-on

Agenda
Probabilities
Python Review
Hands-on

Notations
a∈A
|B|
||v||
P
R
x, y, z
A, B
y = f (x)
y = f (x)

Probability Spaces
A random process or experiment with three components:

Ω, the set of possible outcomes O
|Ω|
discrete or continuous
F , the set of possible events E
|F |
P , the probability distribution

Discrete Probability Space
Three consecutive flips of a coin
8 possible outcomes:
O = HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T
Possible events:
Two flips are heads up: E = (O ∈ {HHT, HT H, T HH})
First and third flips are tails up: E = (O ∈ {T HT, T T T })
If the coin is fair

P (HHH) = P (HHT ) = P (HT H) = P (HT T ) = P (T HH) =
P (T HT ) = P (T T H) = P (T T T ) = 1/8
Probability of event E, P (exactly two heads up) =?

Continuous Probability Space
Height of a randomly chosen person
Infinite number of possible outcomes/events
Probability of outcomes are not equal, and are described by a

continuous function

Probability Distributions
Discrete: probability mass function (pmf)
Continuous: probability density function (pdf)

Axioms of Probability
Non negative
∀E ∈ F, P (E) ≥ 0
All possible outcomes

P (Ω) = 1
Addition of disjoint events

∀E, E ′ ∈ F , where E ∩ E ′ = ∅, P (E ∪ E ′ ) = P (E) + P (E ′ )

Random Variables
A random variable X is a function that associates a number x

X=x
A way to redefine a probability space to a new

Axioms of probability
Discrete or continuous
Discrete example: X = number of heads in three flips of a coin

X = 0, 1, 2, 3
P (X = 0) = P (X = 3) = 1/8, P (X = 1) = P (X = 2) = 3/8
Continuous example: X = average height of five randomly chosen

people

Multivariate Probability Distribution
Scenario
Several random processes occur
Probabilities for each possible combination
Joint probability of several random variables

P (X = x, Y = y)

Marginal probability
Probability distribution
P of a single variable in a joint distribution
P (X = x) = b=all values of Y P (X = x, Y = b)

Conditional probability
Probability distribution of one variable given that another variable
takes a certain value
P (Y = y|X = x) = P (X=x,Y
P (X=x)
=y)

Complement Rule
Given event A, P (not A) = 1 − P (A)

Product Rule
Given events A and B, P (A, B) = P (A|B) · P (B)

Total Probability Rule
Given events A and B,

P (A) = P (A, B)+P (A, not B) = P (A|B)·P (B)+P (A|not B)·P (not B)

Independence
Given events A and B, P (A|B) = P (A) or P (A, B) = P (A) · P (B)

Independence
Independent random variables

P (X, Y ) = P (X)P (Y )
P (X|Y ) = P (X)
Y and X don’t contain information about each other.

Observing Y doesn’t help predicting X
Observing X doesn’t help predicting Y
Examples:
Independent: Wining on roulette this week and next week.
Russian roulette

Dependent vs. Independent
Independent X, Y Dependent X, Y

Conditionally Independent
Conditionally independent variables
P (X, Y |Z) = P (X|Z)P (Y |Z)
Knowing Z makes X and Y independent
Storks deliver babies? Highly statistically significant correlation

exists between stork populations and human birth rates across Europe.
Common Environment Factors
Examples
Dependent: shoe size and reading skills
Conditionally independent: shoe size and reading skills given age
Conditionally Independent
London taxi drivers

A survey has pointed out a positive and significant correlation
between the number of accidents and wearing coats. They concluded
that coats could hinder movements of drivers and be the cause of
accidents. A new law was prepared to prohibit drivers from wearing
coats when driving.
Finally another study pointed out that people wear coats when it
rains...
P (Accidents, Coat|Rain) = P (Accidents|Rain) · P (Coat|Rain)

Correlation ̸= Causation

Conditionally Independence
X is conditionally independent of Y given Z

P (X, Y |Z) = P (X|Z)P (Y |Z)
Equivalent to: ∀x, y, z, P (X = x|Y = y, Z = z) = P (X = x|Z = z)
P (Accidents, Coat|Rain) = P (Accidents|Rain) · P (Coat|Rain)
P (T hunder|Rain, Lightning) = P (T hunder|Lightning)

It does NOT mean T hunder is independent of Rain
But given Lightning, knowing Rain doesn’t give more information
about T hunder.

Agenda
Fundamentals of Machine Learning

Probabilities
Python Review
Hands-on

Suppose we have random variables X1 , . . . , Xn and corresponding

observations x1 , . . . , xn .
How to fit a parametric model to the data?
How to choose the values of parameters?

Our first ML problem: Estimating the probability of flipping a coin

What’s the probability that the coin will fall with head up?

Let’s flip it a few times to estimate the probability:

3
The estimated probability is: 5 “Frequency of heads”

3
The estimated probability is: 5 “Frequency of heads”
Why frequency of heads?
How good is this estimation?
Why is this a machine learning problem?

Why Frequency of Heads?
Frequency of heads is exactly the Maximum Likelihood Estimator

(MLE) for this problem.
Maximize the probability of the data we have seen:

θ̂M LE = arg maxL(θ)
θ
L(·) is the likelihood function: L(θ) = P (x1 , . . . , xn ; θ)
n
Q
Assume X1 , . . . , Xn are i.i.d: L(θ) = P (xi ; θ)
i=1
Qn
(Optional) log-likelihood: log L(θ) = log P (xi ; θ)
i=1

MLE for Bernoulli Distribution
Data:
D = {Xi }ni=1 , Xi = xi ∈ {H, T }

P (H) = θ, P (T ) = 1 − θ
Flips are i.i.d

Independent events
Identically distributed according to Bernoulli distribution
MLE: Choose θ that maximizes the probability of observed data

θ̂M LE = arg max L(θ)

θ
= arg max P (x1 , x2 , . . . , xn ; θ)
θ
Qn
= arg max P (xi ; θ) ⇐ Independent draws
θ i=1
Q Q
= arg max θ (1 − θ) ⇐ Identically distributed
θ i:xi =H i:xi =T
= arg max θαH (1 − θ)αT

θ | {z }
J(θ)
αH : Number of Heads
αT : Number of Tails


θ
= arg max θαH (1 − θ)αT
θ | {z }
J(θ)
Solve for θ̂M LE by FOC:
∂J(θ)
∂θ = αH θαH −1 (1 − θ)αT − αT θαH (1 − θ)αT −1 |θ=θ̂M LE = 0
⇒ αH (1 − θ) − αT θ|θ=θ̂M LE = 0
αH
⇒ θ̂M LE = αH +αT
This is exactly the “Frequency of heads”!

How Good is MLE?
αH
θ̂M LE = αH +αT
Number of flips: n = αH + αT
Let θ∗ be the true parameter, how many flips do I need to get a good
estimation?
I want to know θ̂, within ϵ = 0.1 error, with probability at least
1 − δ = 0.95
2
∀ϵ > 0, P (|θ̂ − θ∗ | ≥ ϵ) ≤ 2e−2nϵ (Hoeffding’s inequality)
2
⇒ P (|θ̂ − θ∗ | ≥ ϵ) ≤ 2e−2nϵ ≤ δ
ln 2/δ
⇒n≥ 2ϵ2

How Good is MLE?
αH
θ̂M LE = αH +αT
Number of flips: n = αH + αT
Let θ∗ be the true parameter, how many flips do I need to get a good
estimation?
I want to know θ̂, within ϵ = 0.1 error, with probability at least
1 − δ = 0.95
2
∀ϵ > 0, P (|θ̂ − θ∗ | ≥ ϵ) ≤ 2e−2nϵ (Hoeffding’s inequality)
2
⇒ P (|θ̂ − θ∗ | ≥ ϵ) ≤ 2e−2nϵ ≤ δ
ln 2/δ
⇒n≥ 2ϵ2
≈ 185!
Great for large samples, but can be heavily biased for small samples
Interpretation
Computational efficiency

Why is This a ML Problem?
“A computer program is said to learn from experience

E with respect to some class of tasks T and
performance measure P, if its performance at tasks in
T, as measured by P, improves with experience E.”
– Tom M. Mitchell
Improve their performance (accuracy of the predicted probability)

at some task (predicting the probability of heads)
with experience (the more coins we flip the better we are)

MLE for Continuous Features
2

Gaussian distribution: f (x; µ, σ) = √ 1 exp − (x−µ)
2σ 2
2πσ 2
MLE: Choose θ = (µ, σ 2 ) that maximizes the probability of observed

data
θ
Qn
= arg max f (xi ; θ) ⇐ Independent draws
θ i=1
n
1 −(xi −µ)2 /2σ 2
Q
= arg max 2σ 2 e ⇐ Identically distributed
θi=1
1 − Pni=1 (xi −µ)2 /2σ2
= arg max2 e
θ=(µ,σ ) 2σ 2
| {z }
J(θ)

MLE for Gaussian Mean and Variance
n
1P
µ̂M LE = n xi
i=1
n
2 1P
σ̂M LE = n (xi − µ̂)2
i=1
2 2 2
σ̂M LE is biased: E[σ̂M LE ] ̸= σ
n
2 1
(xi − µ̂)2
P
Unbiased estimator: σ̂unbiased = n−1
i=1

Agenda

Probabilities
Python Review
Hands-on

Prior Knowledge
If you have some ideas about your parameters

E.g., We know the coin is “close” to 50 − 50. What can we do?
The Bayesian way

Rather than estimating a single θ, we obtain a distribution over
possible values of θ
max P (θ|x1 , . . . , xn ) ⇐ posterior

θ

Bayesian Learning
Bayes Theorem:
P (D|θ)P (θ)
P (θ|D) =
P (D)
Equivalently:
P (θ|D) ∝ P (D|θ) P (θ)

| {z } | {z } | {z }
posterior likelihood prior
Thomas Bayes (1702–1761)

Example: Cancer Diagnosis
A patient takes a lab test and the result comes back positive. The
test returns a correct positive result is only 98% of the cases in which
the disease is actually present, and a correct negative result is only
97% of the cases in which the disease is not present. Furthermore,
0.8% of the entire population have this cancer.
Would the patient likely have Cancer or not?

P (Cancer) = 0.008, P (¬Cancer) = 0.992

P (+|Cancer) = 0.98, P (−|Cancer) = 0.02
P (+|¬Cancer) = 0.03, P (−|¬Cancer) = 0.97
⇓

P (Cancer) = 0.008, P (¬Cancer) = 0.992

P (+|Cancer) = 0.98, P (−|Cancer) = 0.02
P (+|¬Cancer) = 0.03, P (−|¬Cancer) = 0.97
⇓
P (+) = P (+|Cancer)P (Cancer) + P (+|¬Cancer)P (¬Cancer)
P (+|Cancer)P (Cancer)
P (Cancer|+) = P (+)
P (+|¬Cancer)P (¬Cancer)
P (¬Cancer|+) = P (+)

Maximum A Posterior Learning (MAP)
Choose the most probable hypothesis given the observed data and
prior belief by maximizing the posterior probability
θ̂M AP = arg max P (θ|x1 , . . . , xn )

θ
= arg max P (x1 , . . . , xn |θ)P (θ)
θ
P (θ): Prior encodes our knowledge/preference
Computational intensive
Choosing P (θ) reflects our prior knowledge about the learning task

MAP vs. MLE
MLE: Choose θ that maximizes the likelihood of the observed data
θ̂M LE = arg max P (x1 , . . . , xn |θ)

θ
MAP: Choose θ that maximizes the posterior probability given the

observed data

θ
= arg max P (x1 , . . . , xn |θ)P (θ)
θ
When is MAP same as MLE?

MAP vs. MLE
MLE: Choose θ that maximizes the likelihood of the observed data
θ̂M LE = arg max P (x1 , . . . , xn |θ)

θ
MAP: Choose θ that maximizes the posterior probability given the

observed data

θ
= arg max P (x1 , . . . , xn |θ)P (θ)
θ
When is MAP same as MLE?

Uninformative priors, e.g., uniform distribution: ∀θi , θj , P (θi ) = P (θj )

MAP for Binomial Distribution
Coin flip problem

Likelihood is Binomial:
 
n
P (x1 , . . . , xn |θ) =   θαH (1 − θ)αT
αH
If the prior is Beta distribution:
θβH −1 (1 − θ)βT −1
P (θ) = ∼ Beta(βH , βT )
B(βH , βT )
R1
Beta function: B(x, y) = 0
tx−1 (1 − t)y−1 dt
Posterior is also Beta distribution:
P (θ|x1 , . . . , xn ) ∼ Beta(βH + αH , βT + αT )
P (θ) and P (θ|x1 , . . . , xn ) have the same form! ⇐ Conjugate prior

MAP for Binomial Distribution
Conjugate priors:
Closed-form representation of posterior
θ and P (θ|x1 , . . . , xn ) have the same form
P (θ) ∼ Beta(βH , βT ), P (θ|x1 , . . . , xn ) ∼ Beta(βH + αH , βT + αT )
θ
= arg max P (x1 , . . . , xn |θ)P (θ)
θ
αH +βH −1
= αH +βH +αT +βT −2
As we get more samples (n = αH + αT ), effect of the prior is

“washed out”
Frequentists vs. Bayesians
Not good when sample is small Different answer for different priors

Agenda

Probabilities
Python Review
Hands-on

Python
Free & Open-source

Simple and powerful
Easy to learn
Object oriented
Libraries & Packages
Portable
Good for machine learning

Python
Installation
Download: https://www.python.org/downloads/
IDE
Jupyter notebook
Pycharm
Conda

Agenda

Probabilities
Python Review
Hands-on

Assignment 1
Maths
Python
Will be posted tonight

Due: October 10, at 9am (before Lecture 5)

2023 Fall IBA6102 Lecture 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2023 Fall IBA6102 Lecture 2

Uploaded by

Copyright:

Available Formats

Lecture 2: Basics in Machine Learning

IBA6102 Machine Learning for Business

The Chinese University of Hong Kong, Shenzhen

September 12, 2023

What is Machine Learning?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51

What is Machine Learning?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51

What is Machine Learning?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51

What is Machine Learning?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51

What is Machine Learning?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51

What is Machine Learning?

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 2 / 51

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 3 / 51

Read the question

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 4 / 51

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 5 / 51

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 6 / 51

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 6 / 51

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 7 / 51

A random process or experiment with three components:

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 8 / 51

Three consecutive flips of a coin

If the coin is fair

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 9 / 51

Height of a randomly chosen person

Infinite number of possible outcomes/events

Probability of outcomes are not equal, and are described by a

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 10 / 51

Discrete: probability mass function (pmf)

Continuous: probability density function (pdf)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 11 / 51

All possible outcomes

Addition of disjoint events

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 12 / 51

A random variable X is a function that associates a number x

A way to redefine a probability space to a new

Discrete example: X = number of heads in three flips of a coin

Continuous example: X = average height of five randomly chosen

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 13 / 51

Joint probability of several random variables

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 14 / 51

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 15 / 51

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 16 / 51

Given event A, P (not A) = 1 − P (A)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 17 / 51

Given events A and B, P (A, B) = P (A|B) · P (B)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 18 / 51

Given events A and B,

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 19 / 51

Given events A and B, P (A|B) = P (A) or P (A, B) = P (A) · P (B)

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 20 / 51

Independent random variables

Y and X don’t contain information about each other.

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 21 / 51

Qiyuan (CUHK, Shenzhen) Basics in ML September 12, 2023 22 / 51

Knowing Z makes X and Y independent

Storks deliver babies? Highly statistically significant correlation

Common Environment Factors