Professional Documents
Culture Documents
Machine Learning
Lecture 3
Wang Xinchao
xinchao@nus.edu.sg
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Linear Algebra, Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• (Very Gentle) Introduction to Linear Algebra
– Prof. Helen’s part will follow up
• Causality and Simpson’s paradox
• Random Variable, Bayes’ Rule
4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
(Very Gentle) Introduction to Linear Algebra
5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations,
• A vector Vectors,
is an ordered Matrices
list of scalar values, called
attributes
• A vector is an ordered list of scalar values
– Denoted byby
– Denoted a bold
a boldcharacter,
character, e.g. or aa
e.g. x or
• In many books, vectors are written column-wise:
• In many books, vectors are written column-wise:
2 2 1
= , = , =
3 5 0
• The three vectors above are two-dimensional (or h
• The three vectors above are two-dimensional, or have
elements)
two elements
6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors,
Notations, Matrices
Vectors, Matrices
• We denote an attribute of a vector as an italic value
• We denote an entry or attribute of a vector as an italic
with an index, e.g.
value with an index, e.g. !(") or " (") .
– The index j denotes a specific dimension of the vector, the
– Theposition
index j denotes a specific
of an attribute dimension
in the list of the vector, the position
of an attribute in the list
2 2
= = or more commonly = =
3 3
Note:
• Note:
• The notation should not be confused with the power operator, such
asisthe
– $ (") not2 to
in be confused with the power
(cubed)
operation, e.g., $ $ (squared)
– •Square
Square ofofanan indexed
indexed attribute
attribute ofof a vector
a vector is is denoted
denoted asby
($(" )$ ).
(") (')
– •A A variable
variable canhave
may havetwo
twoorormore
moreindices,
indices,e.g.,
like $this:
% or $%," or ,
7
© Copyright EE, NUS. All Rights Reserved. 7
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices
Notations, Vectors, Matrices
• Vectors can be visualized as, in a multi-dimensional space,
– arrows
Vectors can that point to as
be visualized some directions,
arrows that pointorto some directions as well
as –points
pointsin a multi-dimensional space
2 2 1
Illustrations of three two-dimensional vectors, = , = , and =
3 5 0
Note:
Note:
• Capital Sigma and Pi can be applied to the attributes of a vector x
• Capital Sigma and Pi can be applied to the attributes of a vector x
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.
10
11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear
Systems Equations
of Linear Equations
Lineardependence
Linear dependence and
and independence
independence
Note:
Note: If all
If all rows
rows or columns
or columns of a square
of a square matrixmatrix are linearly
X are linearly
independent,
independent, then X is invertible.
then is invertible.
Ref:
Ref: [Book4]
[Book4] Stephen
Stephen Boyd
Boyd and
and Lieven
Lieven Vandenberghe,
Vandenberghe, “Introduction
“Introduction to
to
Applied
Applied Linear
Linear Algebra”,
Algebra”, Cambridge
Cambridge University
University Press,
Press, 2018
2018 (Chp5
(Chp5 && 11.1)
11.1)
12
12
© 11
© Copyright
Copyright EE,
EE, NUS.
NUS. All
All Rights
Rights Reserved.
Reserved.
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Systems
Systems of Linear
of Linear Equations
Equations
2
2
1
1
+ =0 +
13 12
© Copyright EE, NUS. All Rights Reserved.
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Systems
Systems
Systems of of Linear
of Linear
Linear Equations
Equations
Equations
These
Theseequations
equationscan canbe
bewritten
writtencompactly
compactlyininmatrix-vector
matrix-vector
notation:
notation:
==
Where
Where targetvector
datamatrix
,, ,, … … ,,
== ,, == ,, == ..
,, ,, …
… ,,
Note:
Note:
××
•• The
Thedata
datamatrix
matrix and
andthe
thetarget
targetvector
vector are
aregiven
given
•• The
Theunknown
unknownvector
vectorofofparameters
parameters isisto
tobe
belearnt
learnt
•• The
Therank(
rank( ))corresponds
correspondsto tothe
themaximal
maximalnumber
numberofoflinearly
linearly
independent
independentcolumns/rows
columns/rowsofof . .
13
Ref:[Book4]
Ref: [Book4]Stephen
StephenBoyd
Boydand
andLieven
LievenVandenberghe,
Vandenberghe,“Introduction
“Introductiontoto 1414
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Exercises
• The principled way for computing rank is to do Echelon Form
– https://stattrek.com/matrix-algebra/echelon-transform.aspx#MatrixA
• In many cases, however, there are shortcuts, especially for small
size-matrix
Computerankofmatrices
• What is the rank of
1 2
2
2 1
1 1
1
100 100
1 −2 3
0 −3 3 2
1 1 0
14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• (Very Gentle) Introduction to Linear Algebra
– Prof. Helen’s part will follow up
• Causality and Simpson’s paradox
• Random Variable, Bayes’ Rule
15
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Correlationdoesn'timplycausation buta casualrelationshipbetweentwoeventsmean
Theyarecorrelated
Causality
• Causality, or causation is: cause s effect
– The influence by which one event or process (i.e., cause)
contributes to another (i.e. effect),
– The cause is partly responsible for the effect, and the effect is
partly dependent on the cause
16
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
• Examples of (probable) causal relations:
– New web design implemented >> Web page traffic increased
– Uploaded new app store images >> Downloads increased by 2X
– One works hard and attends lectures/tutorials >> Gets A in EE2211
17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
• One popular way to causal data analysis is Randomized
Controlled Trial (RCT)
– A study design that randomly assigns participants into an experimental
group or a control group.
– As the study is conducted, the only expected difference between two
groups is the outcome variable being studied.
• Example:
– To decide whether smoking and lung cancer has a causal relation, we put
participants into experimental group (people who smoke) and control group
(people who don’t smoke), and check whether they develop lung cancer
eventually. find
therateoflungcancerpatient conclusion Mostprobablysmokingwillcause
grouphashigherrate
ifexperimental cancer
oflung cancerp atient
• RCT is sometimes in feasible to conduct, and also has moral
issues. infeasible
18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
Correlation isisaa statistical
Correlation relationship
statistical relationship
probability
• Decades of data show a clear causal relationship
– Decades
between of smoking
data show a clear
and cancer.causal relationship between smok
and cancer.
• If one smokes, it is a sure thing that his/her risk of cancer
– If you
will smoke,
increase.it is a sure thing that your risk of cancer will increas
– But it isitnot
• But a sure
is not a surething thatthat
thing youonewillwill
getgetcancer.
cancer.
– The
• Thecausal effect isisreal,
relationship not but it is an effect
deterministic. on your average risk.
forindividualaspect
butcan see atrendfor a population
19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
linearity doesnotensureaccuracy
I Correlation
Correlation
• •InInstatistics,
statistics,correlation
correlationisisany
anystatistical
statisticalrelationship,
relationship,
whether
whethercausal
causalorornot,
not,between
betweentwo tworandom
random variables.
variables.
• •Correlations
Correlationsare areuseful
usefulbecause
becausethey theycan
canindicate
indicateaa
predictive
predictiverelationship
relationshipthat
thatcan
canbe beexploited
exploitedininpractice.
practice.
• •Linear
Linearcorrelation
correlationcoefficient,
coefficient,r,r,which
whichisisalso
alsoknown
knownasas
the
thePearson
PearsonCoefficient.
Coefficient.
i ra l
11
11
https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html
1.https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html
21
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
counterintuitive
Simpson’s paradox
• Simpson's paradox is a phenomenon in probability and
statistics, in which a trend appears in several different
groups of data but disappears or reverses when these
groups are combined.
22
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Simpson's Paradox
Example
• Batting Average in professional baseball game
• Two well-known players, Derek Jeter and David Justice
23
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• (Very Gentle) Introduction to Linear Algebra
– Prof. Helen’s part will follow up
• Causality and Simpson’s paradox
• Random Variable, Bayes’ Rule
24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Probability
• We describe a random experiment by describing its
procedure and observations of its outcomes.
• Outcomes are mutual exclusive in the sense that only one
outcome occurs in a specific trial of the random
experiment.
– This also means an outcome is not decomposable.
– All unique outcomes form a sample space.
• A subset of sample space (, denoted as ), is an event in
a random experiment ) ⊂ (, that is meaningful to an
application.
– Example of an event: faces with numbers smaller than 3
25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Axioms of Probability
Axioms of Probability
1. 0
2. =1
= , then = +
*otherwise, = + ( )
26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" EE2211 Introduction to Machine Learning 24
Random Variable
• A random variable, usually written as an italic capital
letter, like X, is a variable whose possible values are
numerical outcomes of a random event.
Countable limit
ofvalues
jumber
• There are two types of random variables: discrete and
continuous.
I unlimitednumberofvalues
27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations
• Some books used P(·) and p(·) to distinguish between the
probability of discrete random variable and the probability
of continuous random variables respectively.
28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Discrete
Discrete randomrandom variable
variable DRV
• A• discrete
A discrete random variable (DRV) takes on only a countable number
random variable (DRV) takes on only a
of distinct values such as red, orange, blue or 1, 2, 3.
countable number of distinct values such as red, yellow,
blue
• The probability distribution of a discrete random variable is described
• Thebyprobability distribution
a list of probabilities of a discrete
associated random
with each variable
of its is values.
possible
described by a list of probabilities associated with each of its possible
values.
• This list of probabilities is called a
- This list of probabilities is called a
probability mass function (pmf).
probability mass function (pmf).
– Like a histogram, except that here
(Like a histogram, except that here
the probabilities sum to 1
the probabilities sum to 1)
31
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Continuous random
Continuous variable
random variable
• A continuous random variable (CRV) takes an infinite
• A continuous
number random
of possible values variable
in some (CRV) takes an infinite
interval.
numberinclude
– Examples of possible
height, values in some
weight, and time. interval.
– Examples
– Because include height,
the number weight,
of values of aand time.
continuous random variable X
is–infinite,
The number of values ofPr(X
the probability a continuous random
= c) for any c is 0.variable X is infinite, the
probability Pr(X = c) for any c is 0
– Therefore, instead of the list of probabilities, the probability
– Therefore, instead of the list of probabilities, the probability distribution of a
distribution of a CRV (a continuous probability distribution) is
CRV (a continuous probability distribution) is described by a probability
described
densityby a probability
function (pdf). density function (pdf).
– The pdf pdf
– The is aisfunction
a functionwhose
whosecodomain
range is
is nonnegative
nonnegative and and the
thearea
area under the fix
curve is equal
under the curveto is
1.equal to 1.
Pre fix dy c notveryuseful in programming
useCumulativeDensityFunction CPF instead rangeis no
GFxlx Pr fix dx negative
Fyi
11 É
at
givenmeant
wiaim. 1p2
fromscipyimportstats
A probability density function
at statsnorm.cdfixa.mean.gg 1 y
A probability density function
mean.f ans
stats.norm.cdflx2
pzP't EE2211 Introduction to Machine Learning 30
yright EE, NUS. All Rights Reserved. 32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Continuous random variable
• The expectation of a continuous random variable is given
• The expectation of a
by d continuous random variable is given
• The expectation of d
by a continuous random variable is given
• The
where
by expectation
is the pdfofofa continuous
d the variablerandomand variable
is the is given
where
integral ofis
by the pdf ofd the . variable and is the
where isfunction
the pdf of the variable and is the
• integral
where
The of
variance function
is theofpdf .
of the variable
a continuous random and is the
variable is given
integral of function .
• The
by variance
integral of of a) continuous
(function . d random variable is given
• The variance
by ( of )a continuous
d random variable is given
• The
by variance ( of a ) continuous
d random variable is given
• by
Integral is (an equivalent
) dof the summation over all values of
• Integral is anwhen
the function equivalent of the summation
the function over all
has a continuous values of
domain.
• Integral
the is anwhen
function equivalent of the has
the function summation over all
a continuous values of
domain.
• It equals
Integral
the the
is anwhen
function area under
equivalent the curve
of the has
the function of the
summation function.
over all
a continuous values of
domain.
••• It
the
The
It
equals the
function
property
equals the
area
when under
the
of theunder
area
the
pdffunction
that
the
curve
the has of
area
curve of
the function.
a continuous
under
the function.
domain.
its curve is 1
•• The property
mathematically
It equals the areaof means
theunder
pdf that
that
the the
curvearea under
ofd the 1 its curve is 1
= function.
• The property of means
mathematically the pdf that
that the aread under= 1 its curve is 1
• The property of means
mathematically the pdf that
that the aread under= 1 its curve is 1
mathematically means that d =1
33
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Mean and Standard Deviation of a Gaussian Distribution
Mean and Standard DeviationGaussian
of aformula
Gaussian
Mean and Standard Deviation of a Gaussian
distribution
Gaussian Distribution
distribution fix
f exp Y
95%
90%95%
90%
standarddeviation
g
mean
34
EE2211 Introduction to Machine Learning 32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example 1
• Independent random variables
• Consider tossing a fair coin twice, what is the probability
of having (H,H)? Assuming a coin has two sides, H=head
and T=Tail
– Pr(x=H, y=H) = Pr(x=H)Pr(y=H) = (1/2)(1/2) = 1/4
0
a P X x Yy Pr X x x Pr
Independent Yay Bayes Rule
O SumRule Pr X x Pr X x Y y Pr Y y
35
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example 2
• Dependent random variables
• Given 2 balls with different colors (Red and Black), what
is the probability of first drawing B and then R? Assuming
we are drawing the balls without replacement.
whatistheprobabilityofdrawingaredball
• Mathematically: afterdrawingablankball
1
– Pr(x=B, y=R) = Pr(y=R | x=B) Pr(x=B) = 1×(1/2) = 1/2
Conditional Probability
36
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example 3
• Dependent random variables 4 Pr X 13Y 4 5 16
37
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Two Basic Rules
• Sum Rule
Pr - = " = / Pr(- = ", 2 = 3) )
(
t
D
• Product Rule
Pr X Y yi P Y yi
38
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Bayes’ Rule
• The conditional probability Pr(2 = 3|- = ") is the
probability of the random variable 2 to have a specific
value 3, given that another random variable - has a
specific value of ".
39
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
xample Example
• Drawing a sample of fruit from a box
efine:– First pick a box, and then draw a sample of fruit from it
• B random variable for box picked
– B: variable for Box, can be r (red) or b (blue)
• B = {blue(b), red(r)}
– F: variable for Fruit, can be o (orange) or a (apple)
• F identity of fruit
Probabilityofdrawingtheredbox
Flor fly
• •F = {apple(a), orange(o)} r 0
Pr(B=r)=0.4 prior Flo
• P(B=r)=0.4 and P(B=b)=0.6 can likelihood 0.75 0.4
• Pr(F=o | B=r)= 0.75 easily
• Events are mutually exclusive and include all possible outcomes
obtainedevidence
• Pr(F=o)= 0.45
• Their probabilities must sum to 1
40
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary
• (Very Gentle) Introduction to Linear Algebra
• Causality and Simpson’s paradox
• Random Variable, Bayes’ Rule
41
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
42
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"