Professional Documents
Culture Documents
Math Essentials
HOWEVER
z To get really useful results, you need good
mathematical
at e at ca intuitions
tu t o s about ce
certain
ta gegeneral
ea
machine learning principles, as well as the inner
workings of the individual algorithms.
Intuition:
z In some process, several outcomes are possible.
When the process is repeated a large number of
times, each outcome occurs with a characteristic
relative frequency
frequency, or probability.
probability If a particular
outcome happens more often than another
outcome we say it is more probable
outcome, probable.
1. Non-negativity:
for any event E F,
F p( E ) 0
2
2. All possible outcomes:
p( ) = 1
z Continuous
C ti space | | isi infinite
i fi it
Analysis involves integrals ( )
example:
sum of two
fair dice
example:
probability
z Scenario
Several random p processes occur ((doesnt matter
whether in parallel or in sequence)
Want to know probabilities for each possible
combination
bi ti off outcomes
t
z Can describe as joint probability of several random
variables
Example: two processes whose outcomes are
represented by random variables X and Y. Probability
that process X has outcome x and process Y has
outcome y is denoted as:
p(( X = x, Y = y )
Jeff Howbert Introduction to Machine Learning Winter 2012 18
Example of multivariate distribution
z Marginal probability
Probability distribution of a single variable in a
joint distribution
Example: two random variables X and Y:
p( X = x ) = b=all values of Y p( X = x, Y = b )
z Conditional probability
Probability distribution of one variable given
that another variable takes a certain value
Example: two random variables X and Y:
p( X = x | Y = y ) = p( X = x, x Y = y ) / p( Y = y )
Jeff Howbert Introduction to Machine Learning Winter 2012 20
Example of marginal probability
0.2
0 15
0.15
probability
0.1
0.05
p
American
sport
Asian SUV
European minivan
Y = manufacturer X = model type
sedan
Given:
z A discrete random variable X, X with possible
values x = x1, x2, xn
z Probabilities p( X = xi ) that X takes on the
various values of xi
z A function yi = f( xi ) defined on X
E( f ) = x = a b p( x ) f( x )
z Mean ()
f( xi ) = xi = E( f ) = i p( xi ) xi
Average value of X = xi, taking into account probability
of the various xi
Most
M t common measure off center
t off a distribution
di t ib ti
z Variance (2)
f( xi ) = ( xi - ) 2 = i p( xi ) ( xi - )2
Average value of squared deviation of X = xi from
mean , taking into account probability of the various xi
Most
M t common measure off spread
d off a distribution
di t ib ti
is the standard deviation
high (pos
covaria
no co
ance
sitive)
z Compare to formula for covariance of actual samples
1 n
cov( x, y ) =
N 1 i =1
( xi x )( yi y )
Linear dependence
without noise
Various nonlinear
dependencies
p( not A ) = 1 - p( A )
A not A
A B ) = p( A | B ) p( B )
p( A,
(same expression given previously to define conditional probability)
(not A, not B)
A ( A,
A B) B
(A, not B) (not A, B)
p( A ) = p( A,
A B ) + p( A,
A not B )
(same expression given previously to define marginal probability)
(not A, not B)
A ( A,
A B) B
(A, not B) (not A, B)
p( A | B ) = p( A ) or A B ) = p( A ) p( B )
p( A,
B
(A, not B) A ( A, B )
z Independence:
Outcomes on multiple p rolls of a die
Outcomes on multiple flips of a coin
Height of two unrelated individuals
Probability of getting a king on successive draws from
a deck, if card from each draw is replaced
z D
Dependence:
d
Height of two related individuals
Duration of successive eruptions of Old Faithful
Probability of getting a king on successive draws from
a deck,, if card from each draw is not replaced
p
(not A, not B)
A ( A, B ) B
(A, not B) (not A, B)
p( B | A ) = p( A | B ) p( B ) / p( A )
(not A, not B)
A ( A, B ) B
(A, not B) (not A, B)
rows represent
p samples
p No
No
Married
Single
100K
70K
No
No
(records, items, datapoints) Yes Married 120K No
No Divorced 95K Yes
columns represent
p attributes No Married 60K No
(features, variables)
Yes Divorced 220K No
No Single 85K Yes
No Married 75K
z Natural to think of each sample
No
No Single 90K Yes
result is a vector
result is a vector
result is a scalar
y
z Dot product alternative form
a = x y = x y cos ( )
x
z Matrix-matrix multiplication
vector-matrix multiplication
p jjust a special
p case
TO THE BOARD!!
z Multiplication is associative
A(BC)=(AB)C
z Multiplication is not commutative
A B B A (generally)
z Transposition rule:
( A B )T = B T A T
z Maximum likelihood
z Expectation maximization
z Gradient descent