You are on page 1of 35

EE2211 Introduction to Machine

Learning
Lecture 3
Semester 1
2020/2021

Li Haizhou (haizhou.li@nus.edu.sg)

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou

© Copyright EE, NUS. All Rights Reserved.


Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Kar-Ann / Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

EE2211 Introduction to Machine Learning 2


© Copyright EE, NUS. All Rights Reserved.
Introduction to
Probability and Statistics
Module I Contents
• What is Machine Learning and Types of Learning
• How Supervised Learning works
• Regression and Classification Tasks
• Induction versus Deduction Reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule
• Parameter estimation
• Parametric vs Non-Parametric machine learning

3
© Copyright EE, NUS. All Rights Reserved.
Causality
What is statistical causality or causation?
• In statistics, causation means that one thing will cause
the other, which is why it is also referred to as cause and
effect.
• The gold standard for causal data analysis is to combine
specific experimental designs such as randomized
studies with standard statistical analysis techniques.

• Causality creep is the idea that causal language is


often used to describe inferential or predictive analyses.

EE2211 Introduction to Machine Learning 4


© Copyright EE, NUS. All Rights Reserved.
Correlation
• In statistics, correlation is any statistical relationship,
whether causal or not, between two random variables.
• Correlations are useful because they can indicate a
predictive relationship that can be exploited in practice.

https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html

EE2211 Introduction to Machine Learning 5


© Copyright EE, NUS. All Rights Reserved.
Correlation does not imply causation
• Most data analyses involve inference or prediction.
• Unless a randomized study is performed, it is difficult to infer why
there is a relationship between two variables.
• Some great examples of correlations that can be calculated but are
clearly not causally related appear at http://tylervigen.com/
(See figure below).

EE2211 Introduction to Machine Learning 6


© Copyright EE, NUS. All Rights Reserved.
Example
– Decades of data show a clear causal relationship between smoking
and cancer.
– If you smoke, it is a sure thing that your risk of cancer will increase.
– But it is not a sure thing that you will get cancer.
– The causal effect is real, but it is an effect on your average risk.

EE2211 Introduction to Machine Learning 7


© Copyright EE, NUS. All Rights Reserved.
https://larspsyll.files.wordpress.com/2013/07/causation.jpg?w=538&h=416

EE2211 Introduction to Machine Learning 8


© Copyright EE, NUS. All Rights Reserved.
Caution
• Particular caution should be used when applying words
such as “cause” and “effect” when performing inferential
analysis.
• Causal language applied to even clearly labeled
inferential analyses may lead to misinterpretation - a
phenomenon called causation creep.

EE2211 Introduction to Machine Learning 9


© Copyright EE, NUS. All Rights Reserved.
Simpson’s paradox
Simpson's paradox is a phenomenon in probability and statistics, in
which a trend appears in several different groups of data but
disappears or reverses when these groups are combined.

https://en.wikipedia.org/wiki/Simpson%27s_paradox

EE2211 Introduction to Machine Learning 10


© Copyright EE, NUS. All Rights Reserved.
Example

Ref: Gardener, Martin (March 1979). "MATHEMATICAL GAMES: On the fabric of inductive logic, and some probability paradoxes" (PDF). Scientific American. 234

EE2211 Introduction to Machine Learning 11


© Copyright EE, NUS. All Rights Reserved.
Probability

• We describe a random experiment by describing its


procedure and observations of its outcomes.

• Outcomes are mutual exclusive in the sense that only one


outcome occurs in a specific trial of the random
experiment. This also means an outcome is not
decomposable. All unique outcomes form a sample
space.

• A subset of sample space 𝑆𝑆, denote as 𝐴𝐴, is an event in a


random experiment 𝐴𝐴 ⊆ 𝑆𝑆, that is meaningful in

EE2211 Introduction to Machine Learning 12


© Copyright EE, NUS. All Rights Reserved.
Axioms of Probability

Assuming events 𝐴𝐴 ⊆ 𝑆𝑆 and 𝐵𝐵 ⊆ 𝑆𝑆, the probabilities of


events related with and must satisfy,

1. 𝑃𝑃𝑟𝑟 𝐴𝐴 > 0
2. 𝑃𝑃𝑃𝑃 𝑆𝑆 = 1
3. If 𝐴𝐴 ∩ 𝐵𝐵 = ∅ , then 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵
*otherwise, 𝑃𝑃𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃𝑃𝑃 𝐴𝐴 + 𝑃𝑃𝑃𝑃 𝐵𝐵 − 𝑃𝑃𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵)

EE2211 Introduction to Machine Learning 13


© Copyright EE, NUS. All Rights Reserved.
Random Variable
• A random variable, usually written as an italic capital
letter, like X, is a variable whose possible values are
numerical outcomes of a random phenomenon.
– Examples of random phenomena with a numerical outcome include
a toss of a coin (0 for heads and 1 for tails), a roll of a dice, or the
height of the first stranger you meet outside.
• There are two types of random variables:
– discrete and continuous.

s
R
X(s)
EE2211 Introduction to Machine Learning 14
© Copyright EE, NUS. All Rights Reserved.
Notations
• Some books used P(·) and p(·) to distinguish between the
probability of discrete random variable and the probability
of continuous random variables respectively.

• We shall use Pr(·) for both the above cases.

EE2211 Introduction to Machine Learning 15


© Copyright EE, NUS. All Rights Reserved.
Discrete random variable
• A discrete random variable (DRV) takes on only a
countable number of distinct values such as red, yellow,
blue or 1, 2, 3, . . ..
• The probability distribution of a discrete random variable is
described by a list of probabilities associated with each of its possible
values.
- This list of probabilities is called a
probability mass function (pmf).
(Like a histogram, except that here
the probabilities sum to 1)

A probability mass function


EE2211 Introduction to Machine Learning 16
© Copyright EE, NUS. All Rights Reserved.
• Let a discrete random variable X have k possible values
𝑥𝑥𝑖𝑖 𝑘𝑘𝑖𝑖=1 .
• The expectation of X denoted as 𝐸𝐸 𝑥𝑥 is given by,
𝑘𝑘
𝐸𝐸 𝑥𝑥 ≝ �𝑖𝑖=1 𝑥𝑥𝑖𝑖 · Pr(𝑋𝑋 = 𝑥𝑥𝑖𝑖 )
= 𝑥𝑥1 · Pr(𝑋𝑋 = 𝑥𝑥1 ) + 𝑥𝑥2 · Pr(𝑋𝑋 = 𝑥𝑥2 ) + ··· + 𝑥𝑥𝑘𝑘 · Pr(𝑋𝑋 = 𝑥𝑥𝑘𝑘 )
where Pr(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) is the probability that X has the value 𝑥𝑥𝑖𝑖
according to the pmf.
• The expectation of a random variable is also called the
mean, average or expected value and is frequently
denoted with the letter μ.
• The expectation is one of the most important statistics of a
random variable.

EE2211 Introduction to Machine Learning 17


© Copyright EE, NUS. All Rights Reserved.
• Another important statistic is the standard deviation,
defined as,
σ ≝ 𝐸𝐸 (𝑋𝑋 − 𝜇𝜇)2 .
• Variance, denoted as 𝜎𝜎 2 or var(X), is defined as,
𝜎𝜎 2 = 𝐸𝐸 (𝑋𝑋 − 𝜇𝜇)2
• For a discrete random variable, the standard deviation is
given by
σ = Pr(𝑋𝑋 = 𝑥𝑥1 )(𝑥𝑥1 − 𝜇𝜇)2 + ⋯ + Pr(𝑋𝑋 = 𝑥𝑥𝑘𝑘 )(𝑥𝑥𝑘𝑘 − 𝜇𝜇)2
where 𝜇𝜇 = 𝐸𝐸(𝑋𝑋).

EE2211 Introduction to Machine Learning 18


© Copyright EE, NUS. All Rights Reserved.
Continuous random variable
• A continuous random variable (CRV) takes an infinite
number of possible values in some interval.
– Examples include height, weight, and time.
– Because the number of values of a continuous random variable X
is infinite, the probability Pr(X = c) for any c is 0.
– Therefore, instead of the list of probabilities, the probability
distribution of a CRV (a continuous probability distribution) is
described by a probability density function (pdf).
– The pdf is a function whose codomain
is nonnegative and the area under the
curve is equal to 1.

A probability density function

EE2211 Introduction to Machine Learning 19


© Copyright EE, NUS. All Rights Reserved.
• The expectation of a continuous random variable 𝑋𝑋 is given
by 𝐸𝐸 𝑥𝑥 ≝ ∫𝑅𝑅 𝑥𝑥 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥
where 𝑓𝑓𝑋𝑋 is the pdf of the variable 𝑋𝑋 and ∫𝑅𝑅 is the
integral of function 𝑥𝑥 𝑓𝑓𝑋𝑋 .
• The variance of a continuous random variable 𝑋𝑋 is given
by 𝜎𝜎 2 ≝ ∫𝑅𝑅(𝑋𝑋 − 𝜇𝜇)2 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥

• Integral is an equivalent of the summation over all values of


the function when the function has a continuous domain.
• It equals the area under the curve of the function.
• The property of the pdf that the area under its curve is 1
mathematically means that ∫𝑅𝑅 𝑓𝑓𝑋𝑋 𝑥𝑥 d𝑥𝑥 = 1

EE2211 Introduction to Machine Learning 20


© Copyright EE, NUS. All Rights Reserved.
Mean and Standard Deviation of a Gaussian
distribution 𝑥𝑥 2

95%
90%

𝜇𝜇

𝑥𝑥1

EE2211 Introduction to Machine Learning 21


© Copyright EE, NUS. All Rights Reserved.
Example 1
Independent random variables
• Consider tossing a coin twice, what is the probability of
having (H,H)?
Pr(x=H, y=H) = Pr(x=H)Pr(y=H)
= (1/2)(1/2) = 1/4

Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 22


© Copyright EE, NUS. All Rights Reserved.
Example 2
Dependent random variables
• Given 2 balls with different colors (Red and Black), what
is the probability of having B and then R?
The space of outcomes of taking two balls
sequentially without replacement:
B–R
R–B Thus having B-R is ½ .
Mathematically:
Pr(x=B, y=R) = Pr(y=R | x=B) Pr(x=B)
= 1×(1/2) = 1/2
Since we are given the first pick was B, and thus we
know the probability of the remaining ball to be R is 1. Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 23


© Copyright EE, NUS. All Rights Reserved.
Example 3
Dependent random variables
• Given 3 balls with different colors (R,G,B), what is the
probability of having B and then G?
The space of outcomes of taking two balls
sequentially without replacement:
R–G|G–B|B–R
R – B | G – R | B – G Thus Pr(y=G, x=B) = 1/6 .
Mathematically:
Pr(y=G, x=B) = Pr(y=G | x=B) Pr(x=B)
= (1/2) × (1/3)
= 1/6
Given that the first pick is B, then the remaining balls are
G and R, and thus the chance of picking up G is ½. Slides courtesy: Professor Robby Tan

EE2211 Introduction to Machine Learning 24


© Copyright EE, NUS. All Rights Reserved.
Bayes’ Rule
Thomas Bayes (1701 – 1761)

• The conditional probability Pr(𝑌𝑌 = 𝑦𝑦|𝑋𝑋 = 𝑥𝑥) is the


probability of the random variable 𝑌𝑌 to have a specific
value 𝑦𝑦 given that another random variable 𝑋𝑋 has a
specific value of 𝑥𝑥.
• The Bayes’ Rule (also known as the Bayes’ Theorem)
stipulates that:
Pr 𝑋𝑋 = 𝑥𝑥 𝑌𝑌 = 𝑦𝑦 Pr(𝑌𝑌=𝑦𝑦)
Pr 𝑌𝑌 = 𝑦𝑦 𝑋𝑋 = 𝑥𝑥 = (1)
Pr(𝑋𝑋=𝑥𝑥)

𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 × 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
EE2211 Introduction to Machine Learning 25
© Copyright EE, NUS. All Rights Reserved.
Bayes’ Rule
Likelihood – propensity for
Prior – what we know about
observing a certain value of 𝑥𝑥
y BEFORE seeing 𝑥𝑥
given a certain value of 𝑦𝑦

Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦
Pr 𝑦𝑦 𝑥𝑥 = =
Pr(𝑥𝑥) 𝛴𝛴𝑦𝑦 Pr(𝑦𝑦) Pr 𝑥𝑥 𝑦𝑦

Evidence – a constant to
Posterior – what we know ensure that the left hand side
about y AFTER seeing 𝑥𝑥 is a valid distribution

Adapted from S. Prince


© Copyright EE, NUS. All Rights Reserved.
Parameter Estimation
• Bayes’ rule comes in handy when we have a model of
𝑋𝑋’s distribution, e.g., to model the term Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃�
with a normal distribution 𝑁𝑁(𝜇𝜇, 𝜎𝜎2) ,
− 𝑥𝑥−𝜇𝜇 2
1
𝑓𝑓𝜃𝜃� 𝑥𝑥 = Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ = 𝑒𝑒 2𝜎𝜎2 (2)
2𝜋𝜋𝜎𝜎 2

where 𝜃𝜃� ≝ [𝜇𝜇, σ] is the parameter vector


and 𝜋𝜋 is the constant (3.14159…).

EE2211 Introduction to Machine Learning 27


© Copyright EE, NUS. All Rights Reserved.
Parameter Estimation
• Suppose Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃� ≝ 𝑓𝑓𝜃𝜃� 𝑥𝑥 of equation (2), we
can update the values of parameters in the vector 𝜃𝜃 from
the data using the Bayes’ rule:
Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃 = 𝜃𝜃)
̂
Pr 𝜃𝜃 = 𝜃𝜃̂ 𝑋𝑋 = 𝑥𝑥 ⃪
Pr(𝑋𝑋 = 𝑥𝑥)
posterior probability
(3)
𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃=𝜃𝜃)
Pr �
=
� Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃=𝜃𝜃)

� 𝜃𝜃

EE2211 Introduction to Machine Learning 28


© Copyright EE, NUS. All Rights Reserved.
Parameter Estimation
• The best value of the parameters 𝜃𝜃 ∗ given some
examples is obtained using the principle of maximum a
posteriori (or MAP):

𝜃𝜃 ∗ = argmax𝜃𝜃 Pr 𝜃𝜃 = 𝜃𝜃� 𝑋𝑋 = 𝑥𝑥
= argmax𝜃𝜃 𝑙𝑙𝑙𝑙𝑙𝑙 Pr 𝜃𝜃 = 𝜃𝜃� 𝑋𝑋 = 𝑥𝑥

EE2211 Introduction to Machine Learning 29


© Copyright EE, NUS. All Rights Reserved.
Parameter Estimation
called maximum likelihood
estimation when only this term is
• Suppose Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃� ≝ 𝑓𝑓𝜃𝜃� 𝑥𝑥 used for estimating the non-
random parameter θ while
assuming X random.

Pr 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 = 𝜃𝜃̂ Pr(𝜃𝜃 = 𝜃𝜃)


̂
Pr 𝜃𝜃 = 𝜃𝜃̂ 𝑋𝑋 = 𝑥𝑥 ⃪
Pr(𝑋𝑋 = 𝑥𝑥)

• In this way, maximum a posteriori (MAP) estimation


becomes maximum likelihood estimation

EE2211 Introduction to Machine Learning 30


© Copyright EE, NUS. All Rights Reserved.
Maximum Likelihood Estimation
• Suppose that the underlying distribution is a normal
𝑁𝑁(𝜇𝜇, 𝜎𝜎2) and we want to estimate the mean 𝜇𝜇 and
variance 𝜎𝜎2 from sample data 𝑋𝑋 = 𝑥𝑥1, 𝑥𝑥2, … 𝑥𝑥𝑥𝑥 using the
maximum likelihood estimator,

• Log-likelihood function 𝐿𝐿(𝑋𝑋|𝜃𝜃) = ∑𝑚𝑚


𝑖𝑖=1 log 𝑓𝑓𝜃𝜃 (𝑥𝑥𝑥𝑥) ,

𝐿𝐿(𝑋𝑋|𝜃𝜃)
− 𝑥𝑥𝑖𝑖−𝜇𝜇 2
1
= ∑𝑚𝑚
𝑖𝑖=1 log 𝑒𝑒 2𝜎𝜎2
2𝜋𝜋𝜎𝜎 2
𝑚𝑚 1
= −𝑚𝑚 𝑙𝑙𝑙𝑙𝑙𝑙 𝜎𝜎 − log 2𝜋𝜋 − 2 ∑𝑚𝑚
𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝜇𝜇 2
2 2𝜎𝜎

EE2211 Introduction to Machine Learning 31


© Copyright EE, NUS. All Rights Reserved.
Maximum Likelihood Estimation
Let
𝜕𝜕𝐿𝐿(𝑋𝑋|𝜃𝜃) 1
= ∑𝑚𝑚 𝑥𝑥𝑖𝑖 − 𝜇𝜇 = 0
𝜕𝜕 𝜇𝜇 𝜎𝜎𝜎 𝑖𝑖=1
we have,
1 𝑚𝑚
𝜇𝜇̂ = ∑ 𝑥𝑥 ≔ 𝑥𝑥̅
𝑚𝑚 𝑖𝑖=1 𝑖𝑖

and
𝜕𝜕𝐿𝐿(𝑋𝑋|𝜃𝜃) 𝑚𝑚 1 𝑚𝑚
= − + 3 ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝜇𝜇 2 =0
𝜕𝜕 𝜎𝜎 𝜎𝜎 𝜎𝜎
we have,
1 𝑚𝑚
𝜎𝜎� = ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2
𝑚𝑚

EE2211 Introduction to Machine Learning 32


© Copyright EE, NUS. All Rights Reserved.
Parametric Machine Learning
• A learning model that summarizes data with a set of
parameters of fixed size is called a parametric model. No
matter how much data you throw at a parametric model, it
won’t change its mind about how many parameters it
need. Artificial Intelligence: A Modern Approach, pp. 737
• The algorithms involve two steps,
1. Select a form for the function, e.g. normal distribution,
2. Learn the coefficients for the function from the training
data.

Pros: Simple, speed, less data


Cons: Strong assumption, limited complexity, poor fit

EE2211 Introduction to Machine Learning 33


© Copyright EE, NUS. All Rights Reserved.
Non-parametric Machine Learning
• Algorithms that do not make strong assumptions about the
form of the mapping function are called nonparametric
machine learning algorithms. They are good when you have
a lot of data and no prior knowledge, and when you don’t
want to worry too much about choosing just the right
features. Artificial Intelligence: A Modern Approach, pp. 757

Pros: Flexibility, performance


Cons: More data, slow, overfitting

EE2211 Introduction to Machine Learning 34


© Copyright EE, NUS. All Rights Reserved.
The End

EE2211 Introduction to Machine Learning 35


© Copyright EE, NUS. All Rights Reserved.

You might also like