Lecture 3 - Introduction To Linear Algebra, Probability and Statistics (DONE!!)

EE2211 Introduction to
Machine Learning
Lecture 3
Wang Xinchao
xinchao@nus.edu.sg
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Course Contents
• Introduction and Preliminaries (Xinchao)
– Introduction
– Data Engineering
– Introduction to Linear Algebra, Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Xinchao)
– Performance Issues
– K-means Clustering
– Neural Networks
2
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
• (Very Gentle) Introduction to Linear Algebra
– Prof. Helen’s part will follow up
• Causality and Simpson’s paradox
• Random Variable, Bayes’ Rule
4
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
(Very Gentle) Introduction to Linear Algebra
• A scalar is a simple numerical value, like 15 or −3.25

– Focus on real numbers
• Variables or constants that take scalar values are
denoted by an italic letter, like x or a
5
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations,
• A vector Vectors,
is an ordered Matrices
list of scalar values, called
attributes
• A vector is an ordered list of scalar values
– Denoted byby
– Denoted a bold
a boldcharacter,
character, e.g. or aa
e.g. x or
• In many books, vectors are written column-wise:
• In many books, vectors are written column-wise:
2 2 1
= , = , =
3 5 0
• The three vectors above are two-dimensional (or h
• The three vectors above are two-dimensional, or have
elements)
two elements
6
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors,
Notations, Matrices
Vectors, Matrices
• We denote an attribute of a vector as an italic value
• We denote an entry or attribute of a vector as an italic
with an index, e.g.
value with an index, e.g. !(") or " (") .
– The index j denotes a specific dimension of the vector, the
– Theposition
index j denotes a specific
of an attribute dimension
in the list of the vector, the position
of an attribute in the list
2 2
= = or more commonly = =
3 3
Note:
• Note:
• The notation should not be confused with the power operator, such
asisthe
– $ (") not2 to
in be confused with the power
(cubed)
operation, e.g., $ $ (squared)
– •Square
Square ofofanan indexed
indexed attribute
attribute ofof a vector
a vector is is denoted
denoted asby
($(" )$ ).
(") (')
– •A A variable
variable canhave
may havetwo
twoorormore
moreindices,
indices,e.g.,
like $this:
% or $%," or ,
• For example, in neural networks, we denote as , the input feature j of

unit u in layer l
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp
7
© Copyright EE, NUS. All Rights Reserved. 7
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations, Vectors, Matrices
• Vectors can be visualized as, in a multi-dimensional space,
– arrows
Vectors can that point to as
be visualized some directions,
arrows that pointorto some directions as well
as –points
pointsin a multi-dimensional space
2 2 1
Illustrations of three two-dimensional vectors, = , = , and =
3 5 0
Figure 1: Three vectors visualized as directions and as points.

8
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.
• A matrix is a rectangular array of numbers arranged in r
• A matrix is a rectangularcolumns
array of numbers arranged in rows and
columns – Denoted with bold capital letters, such as X or W
– Denoted with bold capital– letters,
An example of a
such as X matrix
or W with two rows and three colum
– An example of a matrix with two rows 2 4 3 Sample I
=
and three columns: 21 6 1 Sample 2
• A set is an unordered collection of unique t elements
1st
feature 2ndfeature 3rdfeature
– Denoted
• A set is an unordered collection ofas a calligraphic
unique elements capital character e.g.,
– When
– When an element $ belongs to a an
setelement
%, we writebelongs
$ ∈ %. to a set , we write
– A special set denoted R – A special
includes all set
realdenoted
numbersRfrom includes
minusallinfinity
real numbers
to from
plus infinity infinity to plus infinity
Note:
• For the elements in matrix , we shall use the indexing , where the
• Note: row and the column positions
second indices respectively indicate the row
column
– For elements in matrix• X,Usually,
we shall usedata,
for input the indexing $(,( samples
rows represent where the
and first andrepresen
columns
Ref: [Book1] Andriy (()
Burkov, “The Hundred-Page Machi
second indices indicate the row and the column position, or $( .
– Usually, for input data, rows
© Copyright EE, represent
NUS. All Rights Reserved. samples and columns represent
features mm
9
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations,
Notations, Vectors,
Vectors, Matrices
Matrices
• Capital Sigma: the summation over a collection X = { ,
, , Sigma:
• Capital , . . . , the} is denoted by:
summation over a collection X = { ,
, , ,..., } is denoted by:
= + +…+ +
= + +…+ +
• Capital Pi: the product over a collection X = { , , ,
, . . . , Pi: }the
• Capital is denoted by: a collection X = { , ,
product over ,
,..., } is denoted by:
= · ·…· ·
= · ·…· ·
Note:
Note:
• Capital Sigma and Pi can be applied to the attributes of a vector x
• Capital Sigma and Pi can be applied to the attributes of a vector x
Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.
10
11
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" Ref: [Book1] Andriy Burkov, “The Hundred-Page Machine Learning Book”, 2019, Chp2.
© Copyright EE, NUS. All Rights Reserved.
Systems of Linear
Systems Equations
of Linear Equations
Lineardependence
Linear dependence and
and independence
independence
• A collection of d-vectors ,…, (with 1) is called linearly

dependent if
+ + =0
holds for some ,…, that are not all zero.
• A collection of d-vectors , … , (with 1) is called linearly

independent if it is not linearly dependent, which means that
+ + =0
only holds for = = = 0.
Note:
Note: If all
If all rows
rows or columns
or columns of a square
of a square matrixmatrix are linearly
X are linearly
independent,
independent, then X is invertible.
then is invertible.
Ref:
Ref: [Book4]
[Book4] Stephen
Stephen Boyd
Boyd and
and Lieven
Lieven Vandenberghe,
Vandenberghe, “Introduction
“Introduction to
to
Applied
Applied Linear
Linear Algebra”,
Algebra”, Cambridge
Cambridge University
University Press,
Press, 2018
2018 (Chp5
(Chp5 && 11.1)
11.1)
12
12
© 11
© Copyright
Copyright EE,
EE, NUS.
NUS. All
All Rights
Rights Reserved.
Reserved.
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Systems
Systems of Linear
of Linear Equations
Equations
Geometry of dependency and independency
2
2
1
1
+ =0 +
13 12
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Systems
Systems
Systems of of Linear
of Linear
Linear Equations
Equations
Equations
These
Theseequations
equationscan canbe
bewritten
writtencompactly
compactlyininmatrix-vector
matrix-vector
notation:
notation:
==
Where
Where targetvector
datamatrix
,, ,, … … ,,
== ,, == ,, == ..
,, ,, …
… ,,
Note:
Note:
××
•• The
Thedata
datamatrix
matrix and
andthe
thetarget
targetvector
vector are
aregiven
given
•• The
Theunknown
unknownvector
vectorofofparameters
parameters isisto
tobe
belearnt
learnt
•• The
Therank(
rank( ))corresponds
correspondsto tothe
themaximal
maximalnumber
numberofoflinearly
linearly
independent
independentcolumns/rows
columns/rowsofof . .
13
Ref:[Book4]
Ref: [Book4]Stephen
StephenBoyd
Boydand
andLieven
LievenVandenberghe,
Vandenberghe,“Introduction
“Introductiontoto 1414
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Exercises
• The principled way for computing rank is to do Echelon Form
– https://stattrek.com/matrix-algebra/echelon-transform.aspx#MatrixA
• In many cases, however, there are shortcuts, especially for small
size-matrix
Computerankofmatrices
• What is the rank of
1 2
2
2 1
1 1
1
100 100
1 −2 3
0 −3 3 2
1 1 0
14
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
15
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Correlationdoesn'timplycausation buta casualrelationshipbetweentwoeventsmean
Theyarecorrelated
Causality
• Causality, or causation is: cause s effect
– The influence by which one event or process (i.e., cause)
contributes to another (i.e. effect),
– The cause is partly responsible for the effect, and the effect is
partly dependent on the cause
• Causality relates to an extremely very wide domain of subjects:

philosophy, science, management, humanity.
• Causality research is extremely complex

– Researcher can never be completely certain that there are no other
factors influencing the causal relationship,
– In most cases, we can only say “probably” causal.
16
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
• Examples of (probable) causal relations:
– New web design implemented >> Web page traffic increased
– Uploaded new app store images >> Downloads increased by 2X
– One works hard and attends lectures/tutorials >> Gets A in EE2211
• Examples of (probable) non-causal relations:

– Your height and weight ? Gets A in EE2211
– Your favorite color ? Your GPA in NUS
17
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
• One popular way to causal data analysis is Randomized
Controlled Trial (RCT)
– A study design that randomly assigns participants into an experimental
group or a control group.
– As the study is conducted, the only expected difference between two
groups is the outcome variable being studied.
• Example:
– To decide whether smoking and lung cancer has a causal relation, we put
participants into experimental group (people who smoke) and control group
(people who don’t smoke), and check whether they develop lung cancer
eventually. find
therateoflungcancerpatient conclusion Mostprobablysmokingwillcause
grouphashigherrate
ifexperimental cancer
oflung cancerp atient
• RCT is sometimes in feasible to conduct, and also has moral
issues. infeasible
18
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Causality
Correlation isisaa statistical
Correlation relationship
statistical relationship
probability
• Decades of data show a clear causal relationship
– Decades
between of smoking
data show a clear
and cancer.causal relationship between smok
and cancer.
• If one smokes, it is a sure thing that his/her risk of cancer
– If you
will smoke,
increase.it is a sure thing that your risk of cancer will increas
– But it isitnot
• But a sure
is not a surething thatthat
thing youonewillwill
getgetcancer.
cancer.
– The
• Thecausal effect isisreal,
relationship not but it is an effect
deterministic. on your average risk.
forindividualaspect
butcan see atrendfor a population
19
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
linearity doesnotensureaccuracy
I Correlation
Correlation
• •InInstatistics,
statistics,correlation
correlationisisany
anystatistical
statisticalrelationship,
relationship,
whether
whethercausal
causalorornot,
not,between
betweentwo tworandom
random variables.
variables.
• •Correlations
Correlationsare areuseful
usefulbecause
becausethey theycan
canindicate
indicateaa
predictive
predictiverelationship
relationshipthat
thatcan
canbe beexploited
exploitedininpractice.
practice.
• •Linear
Linearcorrelation
correlationcoefficient,
coefficient,r,r,which
whichisisalso
alsoknown
knownasas
the
thePearson
PearsonCoefficient.
Coefficient.
i ra l
11
11
https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html
1.https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Descriptive-Statistics/Measures-of-Relation-Between-Variables/Correlation/index.html
EE2211 Introduction to Machine Learning 20 17

!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Correlation does not imply causation
Correlation does not imply causation!
• Most data analyses involve inference or prediction.
• Unless a randomized study is performed, it is difficult to infer why
• Some
there is great examples
a relationship of correlations
between two variables.that can be
• calculated but are clearly
Some great examples not causally
of correlations that can related appear
be calculated at
but are
http://tylervigen.com/
clearly not causally related appear at http://tylervigen.com/
(See figure below).
EE2211 Introduction to Machine Learning 18

21
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
counterintuitive
Simpson’s paradox
• Simpson's paradox is a phenomenon in probability and
statistics, in which a trend appears in several different
groups of data but disappears or reverses when these
groups are combined.
22
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Simpson's Paradox
Example
• Batting Average in professional baseball game
• Two well-known players, Derek Jeter and David Justice
23
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Outline
24
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Probability
• We describe a random experiment by describing its
procedure and observations of its outcomes.
• Outcomes are mutual exclusive in the sense that only one
outcome occurs in a specific trial of the random
experiment.
– This also means an outcome is not decomposable.
– All unique outcomes form a sample space.
• A subset of sample space (, denoted as ), is an event in
a random experiment ) ⊂ (, that is meaningful to an
application.
– Example of an event: faces with numbers smaller than 3
25
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Axioms of Probability
Axioms of Probability
Assuming events and , the probabilities of

events related with and must satisfy,
1. 0
2. =1
= , then = +
*otherwise, = + ( )
26
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681" EE2211 Introduction to Machine Learning 24
Random Variable
• A random variable, usually written as an italic capital
letter, like X, is a variable whose possible values are
numerical outcomes of a random event.
Countable limit
ofvalues
jumber
• There are two types of random variables: discrete and
continuous.
I unlimitednumberofvalues
27
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Notations
• Some books used P(·) and p(·) to distinguish between the
probability of discrete random variable and the probability
of continuous random variables respectively.
• We shall use Pr(·) for both the above cases
28
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Discrete
Discrete randomrandom variable
variable DRV
• A• discrete
A discrete random variable (DRV) takes on only a countable number
random variable (DRV) takes on only a
of distinct values such as red, orange, blue or 1, 2, 3.
countable number of distinct values such as red, yellow,
blue
• The probability distribution of a discrete random variable is described
• Thebyprobability distribution
a list of probabilities of a discrete
associated random
with each variable
of its is values.
possible
described by a list of probabilities associated with each of its possible
values.
• This list of probabilities is called a
- This list of probabilities is called a
probability mass function (pmf).
probability mass function (pmf).
– Like a histogram, except that here
(Like a histogram, except that here
the probabilities sum to 1
the probabilities sum to 1)
Ref: Book 1, Chapter 2.2.

A probability mass function punt
EE2211 Introduction to Machine Learning 27 29
opyright EE, NUS. All Rights Reserved.
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Discrete random variable
countable
• Let a discrete random variable X have k possible values
•• Let
Let a
a discrete
discrete
. random
random variable X have
variable X have kk possible
possible values
values
..
• The expectation of X denoted as is given by,
•• The
The expectation
expectation ·of
of X denoted as is
is given
given by,
Pr(X denoted
= ) as by,
· Pr( = ))
pi = · Pr( =· Pr() + = · Pr( = ) + ··· + · Pr( = )
= · Pr( = )+ ·· Pr( = )) +
Pr( = ···
··· + ·· Pr(
Pr(the= ))
where =Pr( · Pr( +
= =) is) the probability +that +X has =value
where
where Pr(
Pr( to= is
is the
=the ))pmf.the probability
probability that X has
that X has the
the value
value
according
according
according to
to the
the pmf.
pmf.
• The expectation of a random variable is also called the
•• The
The expectation
expectation
mean, average or of a
a random
random variable
of expected variable
value andis
is also
also called
called the
is frequentlythe
mean,
denotedaverage
mean, with theor
average expected
orletter
expected
. value
value and
and is
is frequently
frequently
denoted
denoted with
with the
the letter
letter ..
• The expectation is one of the most important statistics of a
•• The
The expectation
variable.is
expectation
random is one
one of
of the
the most
most important
important statistics
statistics of
of a
a
random
random variable.
variable.
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
30
Discrete random variable
• Another important statistic is the standard deviation,
• defined
Anotheras,
important statistic is the standard deviation,
defined as,
• Another important
( )statistic
. is the standard deviation,
defined as,denoted
• Variance, ( ) as. or var(X), is defined as,
• Variance,
= (denoted
( ) ) as. or var(X), is defined as,
• For a=
Variance, (denoted
discrete ) as variable,
random or var(X),the standard
is defined deviation is
as,
• given
For a=discrete
by ( random
) variable, the standard deviation is
given
• For by Pr( =random
a=discrete )( variable,
) + the+ Pr(standard
= )(deviation) is
given
where=by =Pr( ( =). )( ) + + Pr( = )( )
where= = Pr( ( =). )( ) + + Pr( = )( )
where = ( ).
31
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Continuous random
Continuous variable
random variable
• A continuous random variable (CRV) takes an infinite
• A continuous
number random
of possible values variable
in some (CRV) takes an infinite
interval.
numberinclude
– Examples of possible
height, values in some
weight, and time. interval.
– Examples
– Because include height,
the number weight,
of values of aand time.
continuous random variable X
is–infinite,
The number of values ofPr(X
the probability a continuous random
= c) for any c is 0.variable X is infinite, the
probability Pr(X = c) for any c is 0
– Therefore, instead of the list of probabilities, the probability
– Therefore, instead of the list of probabilities, the probability distribution of a
distribution of a CRV (a continuous probability distribution) is
CRV (a continuous probability distribution) is described by a probability
described
densityby a probability
function (pdf). density function (pdf).
– The pdf pdf
– The is aisfunction
a functionwhose
whosecodomain
range is
is nonnegative
nonnegative and and the
thearea
area under the fix
curve is equal
under the curveto is
1.equal to 1.
Pre fix dy c notveryuseful in programming
useCumulativeDensityFunction CPF instead rangeis no
GFxlx Pr fix dx negative
Fyi
11 É
at
givenmeant
wiaim. 1p2
fromscipyimportstats
A probability density function
at statsnorm.cdfixa.mean.gg 1 y
A probability density function
mean.f ans
stats.norm.cdflx2
pzP't EE2211 Introduction to Machine Learning 30
yright EE, NUS. All Rights Reserved. 32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Continuous random variable
• The expectation of a continuous random variable is given
• The expectation of a
by d continuous random variable is given
• The expectation of d
by a continuous random variable is given
• The
where
by expectation
is the pdfofofa continuous
d the variablerandomand variable
is the is given
where
integral ofis
by the pdf ofd the . variable and is the
where isfunction
the pdf of the variable and is the
• integral
where
The of
variance function
is theofpdf .
of the variable
a continuous random and is the
variable is given
integral of function .
• The
by variance
integral of of a) continuous
(function . d random variable is given
• The variance
by ( of )a continuous
d random variable is given
• The
by variance ( of a ) continuous
d random variable is given
• by
Integral is (an equivalent
) dof the summation over all values of
• Integral is anwhen
the function equivalent of the summation
the function over all
has a continuous values of
domain.
• Integral
the is anwhen
function equivalent of the has
the function summation over all
a continuous values of
domain.
• It equals
Integral
the the
is anwhen
function area under
equivalent the curve
of the has
the function of the
summation function.
over all
a continuous values of
domain.
••• It
the
The
It
equals the
function
property
equals the
area
when under
the
of theunder
area
the
pdffunction
that
the
curve
the has of
area
curve of
the function.
a continuous
under
the function.
domain.
its curve is 1
•• The property
mathematically
It equals the areaof means
theunder
pdf that
that
the the
curvearea under
ofd the 1 its curve is 1
= function.
• The property of means
mathematically the pdf that
that the aread under= 1 its curve is 1
• The property of means
mathematically the pdf that
that the aread under= 1 its curve is 1
mathematically means that d =1
33
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Mean and Standard Deviation of a Gaussian Distribution
Mean and Standard DeviationGaussian
of aformula
Gaussian
Mean and Standard Deviation of a Gaussian
distribution
Gaussian Distribution
distribution fix
f exp Y
95%
90%95%
90%
standarddeviation
g
mean
34
EE2211 Introduction to Machine Learning 32
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example 1
• Independent random variables
• Consider tossing a fair coin twice, what is the probability
of having (H,H)? Assuming a coin has two sides, H=head
and T=Tail
– Pr(x=H, y=H) = Pr(x=H)Pr(y=H) = (1/2)(1/2) = 1/4
0
a P X x Yy Pr X x x Pr
Independent Yay Bayes Rule
0 Dependent Pr X Yy Pr X X Y y Pr Yay Pr Yg Xx Pretax

x
Pr XxlY y
PrlY y IX x Pr xx PrYay
O SumRule Pr X x Pr X x Y y Pr Y y
35
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example 2
• Dependent random variables
• Given 2 balls with different colors (Red and Black), what
is the probability of first drawing B and then R? Assuming
we are drawing the balls without replacement.
• The space of outcomes of taking two balls sequentially

without replacement:
B–R, R–B
– Thus having B-R is 1⁄2 .
whatistheprobabilityofdrawingaredball
• Mathematically: afterdrawingablankball
1
– Pr(x=B, y=R) = Pr(y=R | x=B) Pr(x=B) = 1×(1/2) = 1/2
Conditional Probability
36
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Example 3
• Dependent random variables 4 Pr X 13Y 4 5 16
• Given 3 balls with different colors (R,G,B), what is the

probability of first having B and then G?
• The space of outcomes of taking two balls sequentially
without replacement:
R–G | G–B | B–R
R–B | G–R | B–G Thus, Pr(y=G, x=B) = 1/6
• Mathematically:
Pr(y=G, x=B) = Pr(y=G | x=B) Pr(x=B)
= (1/2) × (1/3)
= 1/6
37
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Two Basic Rules
• Sum Rule
Pr - = " = / Pr(- = ", 2 = 3) )
(
t
D
• Product Rule
Pr X Y yi P Y yi
Pr - = ", 2 = 3 = Pr 2 = 3 - = " 5(- = ")
P.lt x1Y y PlY y
38
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Bayes’ Rule
• The conditional probability Pr(2 = 3|- = ") is the
probability of the random variable 2 to have a specific
value 3, given that another random variable - has a
specific value of ".
• The Bayes’ Rule (also known as the Bayes’ Theorem):

likelihood prior
Pr - = " 2 = 3 Pr(2 = 3)
Pr 2 = 3 - = " =
Pr(- = ")
posterior evidence
39
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
xample Example
• Drawing a sample of fruit from a box
efine:– First pick a box, and then draw a sample of fruit from it
• B random variable for box picked
– B: variable for Box, can be r (red) or b (blue)
• B = {blue(b), red(r)}
– F: variable for Fruit, can be o (orange) or a (apple)
• F identity of fruit
Probabilityofdrawingtheredbox
Flor fly
• •F = {apple(a), orange(o)} r 0
Pr(B=r)=0.4 prior Flo
• P(B=r)=0.4 and P(B=b)=0.6 can likelihood 0.75 0.4
• Pr(F=o | B=r)= 0.75 easily
• Events are mutually exclusive and include all possible outcomes
obtainedevidence
• Pr(F=o)= 0.45
• Their probabilities must sum to 1
• Pr(B=r | F=o) = 0.75*0.4/0.45 = 2/3

1 posterior
Theprobabilitythatdrawing an orange
is fromtheredbox
40
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
Summary
41
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
42
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"

Lecture 3 - Introduction To Linear Algebra, Probability and Statistics (DONE!!)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3 - Introduction To Linear Algebra, Probability and Statistics (DONE!!)

Uploaded by

Copyright:

Available Formats

EE2211 Introduction to

• A scalar is a simple numerical value, like 15 or −3.25

• For example, in neural networks, we denote as , the input feature j of

Figure 1: Three vectors visualized as directions and as points.

• A collection of d-vectors ,…, (with 1) is called linearly

• A collection of d-vectors , … , (with 1) is called linearly

Geometry of dependency and independency

• Causality relates to an extremely very wide domain of subjects:

• Causality research is extremely complex

• Examples of (probable) non-causal relations:

EE2211 Introduction to Machine Learning 20 17

EE2211 Introduction to Machine Learning 18

Assuming events and , the probabilities of

• We shall use Pr(·) for both the above cases

Ref: Book 1, Chapter 2.2.

0 Dependent Pr X Yy Pr X X Y y Pr Yay Pr Yg Xx Pretax

• The space of outcomes of taking two balls sequentially

• Given 3 balls with different colors (R,G,B), what is the

Pr - = ", 2 = 3 = Pr 2 = 3 - = " 5(- = ")

P.lt x1Y y PlY y

• The Bayes’ Rule (also known as the Bayes’ Theorem):

• Pr(B=r | F=o) = 0.75*0.4/0.45 = 2/3

You might also like