EE2211 Lecture 3

EE2211 Introduction to Machine
Learning
Lecture 3
Semester 1
2020/2021
Li Haizhou (haizhou.li@nus.edu.sg)
Electrical and Computer Engineering Department

National University of Singapore
Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou
© Copyright EE, NUS. All Rights Reserved.

Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Kar-Ann / Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks
EE2211 Introduction to Machine Learning 2

Introduction to
Probability and Statistics
Module I Contents
• What is Machine Learning and Types of Learning
• How Supervised Learning works
• Regression and Classification Tasks
• Induction versus Deduction Reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule
• Parameter estimation
• Parametric vs Non-Parametric machine learning
3
Causality
What is statistical causality or causation?
• In statistics, causation means that one thing will cause
the other, which is why it is also referred to as cause and
effect.
• The gold standard for causal data analysis is to combine
specific experimental designs such as randomized
studies with standard statistical analysis techniques.
• Causality creep is the idea that causal language is

often used to describe inferential or predictive analyses.

Correlation
• In statistics, correlation is any statistical relationship,
whether causal or not, between two random variables.
• Correlations are useful because they can indicate a
predictive relationship that can be exploited in practice.
https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html

Correlation does not imply causation
• Most data analyses involve inference or prediction.
• Unless a randomized study is performed, it is difficult to infer why
there is a relationship between two variables.
• Some great examples of correlations that can be calculated but are
clearly not causally related appear at http://tylervigen.com/
(See figure below).

Example
– Decades of data show a clear causal relationship between smoking
and cancer.
– If you smoke, it is a sure thing that your risk of cancer will increase.
– But it is not a sure thing that you will get cancer.
– The causal effect is real, but it is an effect on your average risk.

https://larspsyll.files.wordpress.com/2013/07/causation.jpg?w=538&h=416

Caution
• Particular caution should be used when applying words
such as “cause” and “effect” when performing inferential
analysis.
• Causal language applied to even clearly labeled
inferential analyses may lead to misinterpretation - a
phenomenon called causation creep.

Simpson’s paradox
Simpson's paradox is a phenomenon in probability and statistics, in
which a trend appears in several different groups of data but
disappears or reverses when these groups are combined.
https://en.wikipedia.org/wiki/Simpson%27s_paradox

Example
Ref: Gardener, Martin (March 1979). "MATHEMATICAL GAMES: On the fabric of inductive logic, and some probability paradoxes" (PDF). Scientific American. 234

Probability
• We describe a random experiment by describing its

procedure and observations of its outcomes.
• Outcomes are mutual exclusive in the sense that only one

outcome occurs in a specific trial of the random
experiment. This also means an outcome is not
decomposable. All unique outcomes form a sample
space.
• A subset of sample space 𝑆𝑆, denote as 𝐴𝐴, is an event in a

random experiment 𝐴𝐴 ⊆ 𝑆𝑆, that is meaningful in

Axioms of Probability
Assuming events 𝐴𝐴 ⊆ 𝑆𝑆 and 𝐵𝐵 ⊆ 𝑆𝑆, the probabilities of

events related with and must satisfy,
1. 𝑃𝑃𝑟𝑟 𝐴𝐴 > 0
2. 𝑃𝑃𝑃𝑃 𝑆𝑆 = 1
3. If 𝐴𝐴 ∩ 𝐵𝐵 = ∅ , then 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵
*otherwise, 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵 − 𝑃𝑃𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)

Random Variable
• A random variable, usually written as an italic capital
letter, like X, is a variable whose possible values are
numerical outcomes of a random phenomenon.
– Examples of random phenomena with a numerical outcome include
a toss of a coin (0 for heads and 1 for tails), a roll of a dice, or the
height of the first stranger you meet outside.
• There are two types of random variables:
– discrete and continuous.
s
R
X(s)
Notations
• Some books used P(·) and p(·) to distinguish between the
probability of discrete random variable and the probability
of continuous random variables respectively.
• We shall use Pr(·) for both the above cases.

Discrete random variable
• A discrete random variable (DRV) takes on only a
countable number of distinct values such as red, yellow,
blue or 1, 2, 3, . . ..
• The probability distribution of a discrete random variable is
described by a list of probabilities associated with each of its possible
values.
- This list of probabilities is called a
probability mass function (pmf).
(Like a histogram, except that here
the probabilities sum to 1)
A probability mass function

• Let a discrete random variable X have k possible values
𝑥𝑥𝑖𝑖 𝑘𝑘𝑖𝑖=1 .
• The expectation of X denoted as 𝐸𝐸 𝑥𝑥 is given by,
𝑘𝑘
𝐸𝐸 𝑥𝑥 ≝ �𝑖𝑖=1 𝑥𝑥𝑖𝑖 · Pr(𝑋𝑋 = 𝑥𝑥𝑖𝑖 )
= 𝑥𝑥1 · Pr(𝑋𝑋 = 𝑥𝑥1 ) + 𝑥𝑥2 · Pr(𝑋𝑋 = 𝑥𝑥2 ) + ··· + 𝑥𝑥𝑘𝑘 · Pr(𝑋𝑋 = 𝑥𝑥𝑘𝑘 )
where Pr(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) is the probability that X has the value 𝑥𝑥𝑖𝑖
according to the pmf.
• The expectation of a random variable is also called the
mean, average or expected value and is frequently
denoted with the letter μ.
• The expectation is one of the most important statistics of a
random variable.

• Another important statistic is the standard deviation,
defined as,
σ ≝ 𝐸𝐸 (𝑋𝑋 − 𝜇𝜇)2 .
• Variance, denoted as 𝜎𝜎 2 or var(X), is defined as,
𝜎𝜎 2 = 𝐸𝐸 (𝑋𝑋 − 𝜇𝜇)2
• For a discrete random variable, the standard deviation is
given by
σ = Pr(𝑋𝑋 = 𝑥𝑥1 )(𝑥𝑥1 − 𝜇𝜇)2 + ⋯ + Pr(𝑋𝑋 = 𝑥𝑥𝑘𝑘 )(𝑥𝑥𝑘𝑘 − 𝜇𝜇)2
where 𝜇𝜇 = 𝐸𝐸(𝑋𝑋).

Continuous random variable
• A continuous random variable (CRV) takes an infinite
number of possible values in some interval.
– Examples include height, weight, and time.
– Because the number of values of a continuous random variable X
is infinite, the probability Pr(X = c) for any c is 0.
– Therefore, instead of the list of probabilities, the probability
distribution of a CRV (a continuous probability distribution) is
described by a probability density function (pdf).
– The pdf is a function whose codomain
is nonnegative and the area under the
curve is equal to 1.
A probability density function

• The expectation of a continuous random variable 𝑋𝑋 is given
by 𝐸𝐸 𝑥𝑥 ≝ ∫𝑅𝑅 𝑥𝑥 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥
where 𝑓𝑓𝑋𝑋 is the pdf of the variable 𝑋𝑋 and ∫𝑅𝑅 is the
integral of function 𝑥𝑥 𝑓𝑓𝑋𝑋 .
• The variance of a continuous random variable 𝑋𝑋 is given
by 𝜎𝜎 2 ≝ ∫𝑅𝑅(𝑋𝑋 − 𝜇𝜇)2 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥
• Integral is an equivalent of the summation over all values of

the function when the function has a continuous domain.
• It equals the area under the curve of the function.
• The property of the pdf that the area under its curve is 1
mathematically means that ∫𝑅𝑅 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥 = 1

Mean and Standard Deviation of a Gaussian
distribution 𝑥𝑥 2
95%
90%
𝜇𝜇
𝑥𝑥1

Example 1
Independent random variables
• Consider tossing a coin twice, what is the probability of
having (H,H)?
Pr(x=H, y=H) = Pr(x=H)Pr(y=H)
= (1/2)(1/2) = 1/4
Slides courtesy: Professor Robby Tan

Example 2
Dependent random variables
• Given 2 balls with different colors (Red and Black), what
is the probability of having B and then R?
The space of outcomes of taking two balls
sequentially without replacement:
B–R
R–B Thus having B-R is ½ .
Mathematically:
Pr(x=B, y=R) = Pr(y=R | x=B) Pr(x=B)
= 1×(1/2) = 1/2
Since we are given the first pick was B, and thus we
know the probability of the remaining ball to be R is 1. Slides courtesy: Professor Robby Tan

Example 3
Dependent random variables
• Given 3 balls with different colors (R,G,B), what is the
probability of having B and then G?
The space of outcomes of taking two balls
sequentially without replacement:
R–G|G–B|B–R
R – B | G – R | B – G Thus Pr(y=G, x=B) = 1/6 .
Mathematically:
Pr(y=G, x=B) = Pr(y=G | x=B) Pr(x=B)
= (1/2) × (1/3)
= 1/6
Given that the first pick is B, then the remaining balls are
G and R, and thus the chance of picking up G is ½. Slides courtesy: Professor Robby Tan

Bayes’ Rule
Thomas Bayes (1701 – 1761)
• The conditional probability Pr(𝑌𝑌 = 𝑦𝑦|𝑋𝑋 = 𝑥𝑥) is the

probability of the random variable 𝑌𝑌 to have a specific
value 𝑦𝑦 given that another random variable 𝑋𝑋 has a
specific value of 𝑥𝑥.
• The Bayes’ Rule (also known as the Bayes’ Theorem)
stipulates that:
Pr 𝑋𝑋 = 𝑥𝑥 𝑌𝑌 = 𝑦𝑦 Pr(𝑌𝑌=𝑦𝑦)
Pr 𝑌𝑌 = 𝑦𝑦 𝑋𝑋 = 𝑥𝑥 = (1)
Pr(𝑋𝑋=𝑥𝑥)
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 × 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
Bayes’ Rule
Likelihood – propensity for
Prior – what we know about
observing a certain value of 𝑥𝑥
y BEFORE seeing 𝑥𝑥
given a certain value of 𝑦𝑦
Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦
Pr 𝑦𝑦 𝑥𝑥 = =
Pr(𝑥𝑥) 𝛴𝛴𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦
Evidence – a constant to
Posterior – what we know ensure that the left hand side
about y AFTER seeing 𝑥𝑥 is a valid distribution
Adapted from S. Prince

Parameter Estimation
• Bayes’ rule comes in handy when we have a model of
𝑋𝑋’s distribution, e.g., to model the term Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃�
with a normal distribution 𝑁𝑁(𝜇𝜇, 𝜎𝜎2) ,
− 𝑥𝑥−𝜇𝜇 2
1
𝑓𝑓𝜃𝜃� 𝑥𝑥 = Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ = 𝑒𝑒 2𝜎𝜎2 (2)
2𝜋𝜋𝜎𝜎 2
where 𝜃𝜃� ≝ [𝜇𝜇, σ] is the parameter vector

and 𝜋𝜋 is the constant (3.14159…).

• Suppose Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃� ≝ 𝑓𝑓𝜃𝜃� 𝑥𝑥 of equation (2), we
can update the values of parameters in the vector 𝜃𝜃 from
the data using the Bayes’ rule:
Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃 = 𝜃𝜃)
̂
Pr 𝜃𝜃 = 𝜃𝜃̂ 𝑋𝑋 = 𝑥𝑥 ⃪
Pr(𝑋𝑋 = 𝑥𝑥)
posterior probability
(3)
𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃=𝜃𝜃)
Pr �
=
� Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃=𝜃𝜃)
�
� 𝜃𝜃

• The best value of the parameters 𝜃𝜃 ∗ given some
examples is obtained using the principle of maximum a
posteriori (or MAP):
𝜃𝜃 ∗ = argmax𝜃𝜃 Pr 𝜃𝜃 = 𝜃𝜃� 𝑋𝑋 = 𝑥𝑥
= argmax𝜃𝜃 𝑙𝑙𝑙𝑙𝑙𝑙 Pr 𝜃𝜃 = 𝜃𝜃� 𝑋𝑋 = 𝑥𝑥

called maximum likelihood
estimation when only this term is
• Suppose Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃� ≝ 𝑓𝑓𝜃𝜃� 𝑥𝑥 used for estimating the non-
random parameter θ while
assuming X random.
Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃 = 𝜃𝜃)

̂
Pr 𝜃𝜃 = 𝜃𝜃̂ 𝑋𝑋 = 𝑥𝑥 ⃪
Pr(𝑋𝑋 = 𝑥𝑥)
• In this way, maximum a posteriori (MAP) estimation

becomes maximum likelihood estimation

Maximum Likelihood Estimation
• Suppose that the underlying distribution is a normal
𝑁𝑁(𝜇𝜇, 𝜎𝜎2) and we want to estimate the mean 𝜇𝜇 and
variance 𝜎𝜎2 from sample data 𝑋𝑋 = 𝑥𝑥1, 𝑥𝑥2, … 𝑥𝑥𝑥𝑥 using the
maximum likelihood estimator,
• Log-likelihood function 𝐿𝐿(𝑋𝑋|𝜃𝜃) = ∑𝑚𝑚

𝑖𝑖=1 log 𝑓𝑓𝜃𝜃 (𝑥𝑥𝑥𝑥) ,
𝐿𝐿(𝑋𝑋|𝜃𝜃)
− 𝑥𝑥𝑖𝑖−𝜇𝜇 2
1
= ∑𝑚𝑚
𝑖𝑖=1 log 𝑒𝑒 2𝜎𝜎2
2𝜋𝜋𝜎𝜎 2
𝑚𝑚 1
= −𝑚𝑚 𝑙𝑙𝑙𝑙𝑙𝑙 𝜎𝜎 − log 2𝜋𝜋 − 2 ∑𝑚𝑚
𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝜇𝜇 2
2 2𝜎𝜎

Maximum Likelihood Estimation
Let
𝜕𝜕𝐿𝐿(𝑋𝑋|𝜃𝜃) 1
= ∑𝑚𝑚 𝑥𝑥𝑖𝑖 − 𝜇𝜇 = 0
𝜕𝜕 𝜇𝜇 𝜎𝜎𝜎 𝑖𝑖=1
we have,
1 𝑚𝑚
𝜇𝜇̂ = ∑ 𝑥𝑥 ≔ 𝑥𝑥̅
𝑚𝑚 𝑖𝑖=1 𝑖𝑖
and
𝜕𝜕𝐿𝐿(𝑋𝑋|𝜃𝜃) 𝑚𝑚 1 𝑚𝑚
= − + 3 ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝜇𝜇 2 =0
𝜕𝜕 𝜎𝜎 𝜎𝜎 𝜎𝜎
we have,
1 𝑚𝑚
𝜎𝜎� = ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2
𝑚𝑚

Parametric Machine Learning
• A learning model that summarizes data with a set of
parameters of fixed size is called a parametric model. No
matter how much data you throw at a parametric model, it
won’t change its mind about how many parameters it
need. Artificial Intelligence: A Modern Approach, pp. 737
• The algorithms involve two steps,
1. Select a form for the function, e.g. normal distribution,
2. Learn the coefficients for the function from the training
data.
Pros: Simple, speed, less data

Cons: Strong assumption, limited complexity, poor fit

Non-parametric Machine Learning
• Algorithms that do not make strong assumptions about the
form of the mapping function are called nonparametric
machine learning algorithms. They are good when you have
a lot of data and no prior knowledge, and when you don’t
want to worry too much about choosing just the right
features. Artificial Intelligence: A Modern Approach, pp. 757
Pros: Flexibility, performance

Cons: More data, slow, overfitting

The End


EE2211 Lecture 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EE2211 Lecture 3

Uploaded by

Copyright:

Available Formats

EE2211 Introduction to Machine

Electrical and Computer Engineering Department

© Copyright EE, NUS. All Rights Reserved.

EE2211 Introduction to Machine Learning 2

• Causality creep is the idea that causal language is

EE2211 Introduction to Machine Learning 4

EE2211 Introduction to Machine Learning 5

EE2211 Introduction to Machine Learning 6

EE2211 Introduction to Machine Learning 7

EE2211 Introduction to Machine Learning 8

EE2211 Introduction to Machine Learning 9

EE2211 Introduction to Machine Learning 10

EE2211 Introduction to Machine Learning 11

• We describe a random experiment by describing its

• Outcomes are mutual exclusive in the sense that only one

• A subset of sample space 𝑆𝑆, denote as 𝐴𝐴, is an event in a

EE2211 Introduction to Machine Learning 12

Assuming events 𝐴𝐴 ⊆ 𝑆𝑆 and 𝐵𝐵 ⊆ 𝑆𝑆, the probabilities of

EE2211 Introduction to Machine Learning 13

• We shall use Pr(·) for both the above cases.

EE2211 Introduction to Machine Learning 15

A probability mass function

EE2211 Introduction to Machine Learning 17

EE2211 Introduction to Machine Learning 18

A probability density function

EE2211 Introduction to Machine Learning 19

• Integral is an equivalent of the summation over all values of

EE2211 Introduction to Machine Learning 20

EE2211 Introduction to Machine Learning 21

Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 22

EE2211 Introduction to Machine Learning 23

EE2211 Introduction to Machine Learning 24

• The conditional probability Pr(𝑌𝑌 = 𝑦𝑦|𝑋𝑋 = 𝑥𝑥) is the

Adapted from S. Prince

where 𝜃𝜃� ≝ [𝜇𝜇, σ] is the parameter vector

EE2211 Introduction to Machine Learning 27

EE2211 Introduction to Machine Learning 28

EE2211 Introduction to Machine Learning 29

Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃 = 𝜃𝜃)

• In this way, maximum a posteriori (MAP) estimation

EE2211 Introduction to Machine Learning 30

• Log-likelihood function 𝐿𝐿(𝑋𝑋|𝜃𝜃) = ∑𝑚𝑚

EE2211 Introduction to Machine Learning 31

EE2211 Introduction to Machine Learning 32

Pros: Simple, speed, less data

EE2211 Introduction to Machine Learning 33

Pros: Flexibility, performance

EE2211 Introduction to Machine Learning 34

EE2211 Introduction to Machine Learning 35

You might also like