Lecture 2 Data Mining

Linear algebra
Probability theory
Lecture 2
Statistical inference
Linear algebra
Probability theory
Let X be the n by 1 matrix (really, just a vector) with all entries equal
to 1,

1
1

X = . .
..
1
And consider the span of the columns (really, just one column) of X,
the set of all vectors of the form
X0
(where 0 here is any real number).
Linear algebra
Probability theory
Consider some other n-dimensional vector Y,
Y1
Y2
Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask: what is the
vector in the subspace closest to Y?
That is, what value of 0 minimizes the (squared) distance between
X0 and Y,
||Y X0 ||2 = (Y X0 )T (Y X0 ) =
n
X
i=1
(Yi 0 )2 .
Linear algebra
Probability theory
Lets solve for the value of 0 to minimize the distance

n
X
(Yi 0 )2
i=1
by differentiating with respect to 0 and setting to zero:

n
X
2
(Yi 0 ) = 0.
i=1
In matrix notation, differentiate

(Y X0 )T (Y X0 )
with respect to 0 and set to zero:
2X T (Y X0 ) = 0.
Linear algebra
Probability theory
Solving, we obtain for the nearest vector in the subspace

1
1

X b0 = . b0
..
1
where
Pn
b0 =
i=1 Yi
or equivalently,
X b0 = X(X T X)1 X T Y.
Linear algebra
Probability theory
What if we take a more general (but one-dimensional) X: suppose that

the entries of X are arbitrary numbers Xi1 .
X11
X21
X = . .
..
Xn1
And consider the span of the columns of X, the set of all vectors of the
form
X1
(where again 1 here is any real number).
Linear algebra
Probability theory
And as before, consider some other n-dimensional vector Y,
Y1
Y2
Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask, as before:
what is the vector in the subspace closest to Y?
That is, what value of 1 minimizes the (squared) distance between
X1 and Y,
||Y X1 ||2 = (Y X1 )T (Y X1 ) =
n
X
i=1
(Yi 1 Xi1 )2 .
Linear algebra
Probability theory
Lets solve for the value of 1 to minimize the distance

n
X
(Yi 1 Xi1 )2
i=1
by differentiating with respect to 1 and setting to zero:

2
n
X
(Yi 1 Xi1 ) = 0.
i=1
In matrix notation, differentiate

(Y X1 )T (Y X1 )
with respect to 1 and set to zero:
2X T (Y X1 )Xi1 = 0.
Linear algebra
Probability theory
Xi1
Xi2
b
X 1 = . b1
..
Xin
where
b1 =
Pn
Yi Xi1
Pi=1
,
n
2
i=1 Xi1
or equivalently,
X b1 = X(X T X)1 X T Y.
Linear algebra
Probability theory
Now lets go to an n by 2 matrix X:
1 X11
1 X21
X = .
..
..
.
1 Xn1
And consider the span of the columns of X, the set of all vectors of the
form
X,
where here is the two-dimensional column vector (0 , 1 )T .
Linear algebra
Probability theory
And as before, consider some other n-dimensional vector Y,
Y1
Y2
Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask, as before:
what is the vector in the subspace closest to Y?
That is, what value of the two-dimensional minimizes the (squared)
distance between X and Y,
||Y X||2 = (Y X)T (Y X) =
n
X
i=1
(Yi [0 + 1 Xi1 ])2 .
Linear algebra
Probability theory
Lets solve for the value of to minimize the distance

n
X
(Yi [0 + 1 Xi1 ])2
i=1
by taking the gradient with respect to and setting to zero:

2
2
n
X
(Yi [0 + 1 Xi1 ]) = 0
i=1
n
X
(Yi [0 + 1 Xi1 ])Xi1 = 0
i=1
In matrix notation, take the gradient of

(Y X)T (Y X)
with respect to and set to zero:
2X T (Y X) = 0.
Linear algebra
Probability theory

Xi1
1
Xi2
1
b
b
X = . 0 + . b1
..
..
1
Xin
where
b0 = Y b1 X1
, n
n
X
X
b1 =
(Xi1 X1 )Yi
(Xi1 X1 )2
i=1
i=1
or equivalently,
X b = X(X T X)1 X T Y.
Linear algebra
Probability theory
In general, for an n by p matrix X, the vector in the span of the

columns of X nearest to Y, the so-called projection of Y onto the span
of the columns of X, is the vector
b
X ,
where b is the minimizer of
||Y X||2 = (Y X)T (Y X).
If we take the gradient with respect to we arrive at
X T (Y X) = 0,
from which it follows that
b = (X T X)1 X T Y.
Linear algebra
Probability theory
Let = D be a diagonal matrix with all positive entries.

Note that is a simple example of a symmetric, positive definite
matrix.
Note that we could write = IDI T , where I is the identity
matrix.
Note that the columns of I are orthogonal, of unit length, and
they are eigenvectors,
with eigenvalues equal to the corresponding elements of D.
Linear algebra
Probability theory
Let c be any unit length vector, and consider the decomposition of c

as a weighted sum of the columns of I,

0
0
1
0
1
0

c = c1 . + c2 . + + cp . .
.
.
..
.
.
0
What happens when you compute c? You get

1
0
0
1

c = d1 c1 . + d2 c2 . + + dp cp
.
.
.
.
0
0
0
0
..
.
1
Linear algebra
Probability theory
And if you further compute cT c, you get,

0
1
1
0

cT c = d1 c1 cT . + d2 c2 cT . + + dp cp cT
..
..
0
= c21 d1 + c22 d2 + . . . + c2p dp
0
0
..
.
1
Linear algebra
Probability theory
Suppose you wanted to maximize cT c among unit length c?

That is, how do you find c to maximize
c21 d1 + c22 d2 + . . . + c2p dp
subject to the constraint that
c21 + c22 + . . . + c2p = 1?
Take c to be the eigenvector associated with the largest d!
Linear algebra
Probability theory
Lets start all over again. But this time, well not take = D a
diagonal matrix with all positive entries. Instead, take = PDPT ,
where D is again a diagonal matrix with all positive entries, and P is a
matrix whose columns are orthonormal (and span p dimensional
space).
Note that is a complicated example of a symmetric, positive
definite matrix.
Note that we write = PDPT , where P is the not the identity
matrix any more, but rather some other orthonormal matrix.

Note that the columns of P are by definition orthogonal, of unit
length.
And just like the columns of I were eigenvectors, so are the
columns of P
again with eigenvalues equal to the corresponding elements of D.
Linear algebra
Probability theory
Let c be any unit length vector, and consider the decomposition of c

as a weighted sum of the columns of P (not of I now, but rather of P),
c = c1 P1 + c2 P2 + + cp Pp .
(The columns of P are a basis for p dimensional space.)
What happens when you compute c? You get
c = PDPT c = PDPT (c1 P1 + c2 P2 + + cp Pp )
c1
d1 c1
c2
d2 c2
= PD . = P .
..
..
cp
dp cp
= c1 d1 P1 + c2 d2 P2 + + cp dp Pp
Linear algebra
Probability theory
And if you further compute cT c, you get,

cT c = d1 c1 cT P1 + d2 c2 cT P2 + + dp cp cT Pp .
= c21 d1 + c22 d2 + . . . + c2p dp
Linear algebra
Probability theory
Suppose you wanted to maximize cT c as a function of unit length

vectors c?
That is, how do you find c to maximize
c21 d1 + c22 d2 + . . . + c2p dp
subject to the constraint that
c21 + c22 + . . . + c2p = 1?
Again, take c to be the eigenvector associated with the largest d!
Linear algebra
Probability theory
One last fact that will be relevant when we use these results: every
symmetric positive definite matrix can be written in the form PDPT
where the columns of P are an orthonormal basis for p-dimensional
space, the columns are the eigen-vectors of , and the diagonal
matrix D has as components the corresponding (positive) eigenvalues.
Linear algebra
Probability theory
It might help to think of = PDPT as a linear transformation. Think

of how it maps the unit sphere. . .
The transformation corresponding to maps the orthonormal
eigenvectors, the columns of P, into stretched or shrunken versions of
themselves. That is, maps the unit sphere into an ellipsoid, with
axes equal to the eigen-vectors, and the lengths of the axes equal to
twice the eigenvectors.
From this point of view, does it make sense that to maximize cT c for
c on the unit sphere, one can do no better than taking c equal to the
eigenvector with the largest eigenvalue?
Linear algebra
Probability theory
Suppose that a p by p matrix is symmetric, so that = T .

Suppose also that is positive definite, so that for any non-zero
p-dimensional vector c, cT c is greater than zero. Then
All of the eigen-values of are real and positive.
All of the eigen-vectors of are orthogonal (or have the same
eigen-value).
We can find p linearly independent orthogonal unit-length
p-dimensional eigen-vectors Pj .
Let P be the p by p matrix whose columns are the Pj and Let D
be the corresponding diagonal matrix whose entries are the

eigen-values.
Then,
= PDPT .
Linear algebra
Probability theory
Suppose you want to maximize cT c with respect to p-dimensional

unit vectors c.
Local maximizers are given by c = Pj
and the corresponding local maxima are the eigen-values.
Linear algebra
Probability theory
Suppose you want to maximize cT c with respect to p-dimensional

unit vectors c such that
ct Pj = 0
for the Pj corresponding to some set of eigen-vectors.
Local maximizers are given by the Pj associated with the other
eigen-vectors.
and the corresponding local maxima are the eigen-values.
Linear algebra
Probability theory
The linear transformation maps the unit sphere to an elipsoid
with axes the Pj , and with the length of the axes equal to twice
the eigen-values.
For a vector c, PT c has as its components the aj for which
p
X
aj Pj = c.
j=1
So DPT c stretches or shrinks those aj by the associated
eigen-values, j .
and so PDPT c is
p
X
aj j Pj .
j=1
In short,
p
X
j=1
aj Pj
p
X
j=1
aj j Pj .
Linear algebra
Probability theory
Let be a (vector of ) random variable(s) with (joint) density
()
Let Y be a (vector of) random variable(s) with (joint) conditional
density f (y) given .
The conditional density of given Y = y is
Z
()f (y)d .
(|y) = ()f (y)
The conditional expectation of given Y = y is
Z
E{|y} =
(|y)d.
and the value of that maximizes the posterior likelihood solves

d
d
ln (()) +
ln f (y)d() = 0.
d
d
Linear algebra
Probability theory
Suppose that X and Y are jointly distributed random variables with

joint density fXY (x, y),
the density of Y is
Z
fY (y) =
fXY (x, y)dx
the density of X is
Z
fX (x) =
fXY (x, y)dy
and the conditional density of Y given X is
fXY (x, y)/fX (x)
Linear algebra
Probability theory
The expectations of X and Y and the conditional expectation of Y

given X are
Z
E{X} =
fX (x)xdx
Z
E{Y} =
fY (y)ydy
Z
E{Y|X} =
fY|X (y|X)ydy
And we have
E{Y} = E{E{Y|X}}.
Linear algebra
Probability theory
The law of the unconscious statistician says that

Z
E{g(X)} =
fX (x)g(x)dx.
so that also
Z
Var{g(X)} =
fX (x)(g(x) E{g(X)})2 dx.
Linear algebra
Probability theory
The variance of Y and the conditional variance of Y given X are

Z
Var(Y) =
fY (y)(y E{Y})2 dy
Z
Var(Y|X) =
fY|X (y|X)(y E{Y|X})2 dy
Linear algebra
Probability theory
The covariance between two random variables X and Y are defined as

E{(Y E{Y})(X E{X})}
and we have
Var(Y) = E{Var(Y|X)} + Var(E{Y|X})
Linear algebra
Probability theory
For a vector of random variables X, we define the expectation vector:

E{X} is the vectors with entries equal to the expectations of the
components of X.
Linear algebra
Probability theory
And we define the covariance matrix, Cov(X),

E{(X E{X})(X E{X})T },
with diagonal entries equal to the variances of the components of X,
and the covariances arranged in the off-diagonals.
Note that a covariance matrix is symmetric, and, as long as the
components of X are not linear functions of each other, positive
definite.
Linear algebra
Probability theory
Independence
Random variables are independent if their joint density is equal
to the product of their marginals.

Independence captures the notion of one random variables value
having no implications for the value of the other.

If two random variables are independent, their covariance is
equal to zero.
Linear algebra
Probability theory
IF X is a q-dimensional vector of random variables with expectation

vector and covariance matrix , and if M is an r by q matrix of
constants, and is an r-dimensional vector of constants, then
E{MX + } = M +
and
Cov(MX + ) = MCov(X)M T .
Linear algebra
Probability theory
Chebechevs inequality:
P{|X | } Var(X)/2
Linear algebra
Probability theory
Suppose Xi are all independent , i from 1 to n, and suppose that each

Xi has a finite variance (which we will denote i2 .) Then the variance
that is, the variance of
of X,
n
1X
Xi ,
n
i=1
is equal to
n
1 X 2
i .
n2
i=1
i2
And, in particular, if the

have an upper bound in common, then the
variance tends to zero with large values of n.
Linear algebra
Probability theory
In this situation, from Chebechevs inequality, we find that

P{|X
| }
is also small. Here,
is the average of the expectations of the Xi .
In short, by taking more data, we can learn.
Linear algebra
Probability theory
With minimal technical assumptions about the finite variance of the

independent Xi , we can go beyond the behavior of X
to consider
not just that it tends to zero, but also how it varies around zero.
v
!
uX
n
n
n
X
X
u
1
1
n
Xi
E{Xi } xt
i2
P
n
n
i=1
i=1
i1
et /2
dt.
2
Not only can we learn, we can know how well weve learned!
Linear algebra
Probability theory
Linear algebra
Probability theory
When you analyze data,

There is data
A statistical method is applied
There are the results of your method
You also produce some indication of the precision of your results
The results and precision estimates are used to draw conclusions
How do you know what method to apply?
Linear algebra
Probability theory
Linear algebra
Probability theory
Given a statistical model, and given an analytic goal, there is (almost

always) the appropriate method already in SAS, R, SPSS, Matlab,
Minitab, Systat, et cetera.
What is a statistical model?
What is an analytic goal?
How does one elicit them from the client?
Linear algebra
Probability theory
Linear algebra
Probability theory
A probability model has

A sample space for the observables (the random variables)
A joint distribution on the sample space
The joint distribution reflects all the sources of variability that
are inherent in the random variables.
Linear algebra
Probability theory
A statistical model is a family of probability models

on the sample space for the observables (the random variables)
What is specified about the joint distribution reflects what is
known about the distribution of the random variables

What is unspecified reflects what is unknown about the
distribution of the random variables

That we have random variables reflects that even if we knew
everything that could be known about the distribution, there

would still be randomness.
Parameters index the possible distributions for the data.
Linear algebra
Probability theory
What must be considered in devising a statistical model? What is

known, what is unknown about
The sources of variability
The sampling plan
Mechanisms underlying the phenomena under examination
Counterfactuals - issues of causation and confounding
Practical issues relating to complexity, sample size, and
computation
Linear algebra
Probability theory
The analytic goal is a statement of the researchers goal in terms of

the parameters.
The mathematical version of this is decision theory.
Parameters
indexing probability models on outcomes Y
Possible actions A
A loss associated with parameter-action pairs L(, a)
Decision rules map data to actions d : Y A.
We evaluate decision rules via
E {L(, d(Y))}
Linear algebra
Probability theory
n subjects, randomly assigned to treatment or placebo

Cure or Failure recorded for all
Researchers wish to convince EPA that the treatment is
efficacious, but only if it really is

Model? Analytic Goal?
Linear algebra
Probability theory
n patient charts chosen at random from a physicians practice

Total gains generated by up-coding recorded for each
Prosecutors need to assess the total gains in order to recommend
the amount to be recovered

Linear algebra
Probability theory
n loan applications, or cell histologies, or examples of past
weather patterns
associated foreclosure outcomes, or cancer outcome, or rainfall
Researchers want to help others do prediction with new data
Linear algebra
Probability theory
A parameterization is a mapping from the parameter space to the
probability models for the data

The likelihood is the density (or probability mass function) of the
observed data as a function of the parameter

The maximum likelihood estimator is the value of the parameter
that maximizes the likelihood

We usually find the MLE by differentiating the logarithm of the
likelihood, and setting it to zero

Maximum likelihood estimates are
b = .)
Unbiased (E {}
Efficient in the sense of having smallest variance among unbiased
estimates
Linear algebra
Probability theory
Ordinary least squares linear regression as maximum likelihood.

The (conditional) model
The likelihood
The score equations
Linear algebra
Probability theory
Mixture models and the EM algorithm

Mixture models when the component identifiers are available
The likelihood when they are not
An iterative approach to estimation
Linear algebra
Probability theory
Suppose you really believe that the parameter has a distribution ().
And that nature or god or . . . chose from that distribution when it
created .
And suppose you wanted to estimate .
Suppose we have some loss function, say
b )
L(,
So that we need to find a function b to minimize
b )}
E{L(,
What expectation are we talking about? The expectation over !
Find b to minimize
b
E{L((Y),
)|Y}.
Or maybe just approximate that optimal choice with the
expectation or mode or . . .
Linear algebra
Probability theory
Bayes theorem.
()
Y| f (y)
Z
(|y) = ()f (y)

()f (y)d
Linear algebra
Probability theory
And suppose you wanted to estimate after observing some data
generated according to f (y).

Suppose we have some loss function, say
b )
L(,
b to minimize
So that we need to find a function (y)
b
E{L((Y),
)}
What expectation are we talking about? The expectation over
and Y!
b
b
b
E{L((Y),
)} = E{E{L((Y),
)|}} = E{E{L((Y),
)|Y}}
b
Find (Y)
to minimize
b
E{L((Y),
)|Y}.
Or maybe just approximate that optimal choice with the posterior
expectation, or mode, or . . .

Lecture 2 Data Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2 Data Mining

Uploaded by

Copyright:

Available Formats

Linear algebra

Consider some other n-dimensional vector Y,

Lets solve for the value of 0 to minimize the distance

by differentiating with respect to 0 and setting to zero:

In matrix notation, differentiate

Solving, we obtain for the nearest vector in the subspace

What if we take a more general (but one-dimensional) X: suppose that

And as before, consider some other n-dimensional vector Y,

Lets solve for the value of 1 to minimize the distance

by differentiating with respect to 1 and setting to zero:

In matrix notation, differentiate

Solving, we obtain for the nearest vector in the subspace

Now lets go to an n by 2 matrix X:

And as before, consider some other n-dimensional vector Y,

(Yi [0 + 1 Xi1 ])2 .

Lets solve for the value of to minimize the distance

by taking the gradient with respect to and setting to zero:

(Yi [0 + 1 Xi1 ])Xi1 = 0

In matrix notation, take the gradient of

Solving, we obtain for the nearest vector in the subspace

In general, for an n by p matrix X, the vector in the span of the

Let = D be a diagonal matrix with all positive entries.

Let c be any unit length vector, and consider the decomposition of c

What happens when you compute c? You get

And if you further compute cT c, you get,

Suppose you wanted to maximize cT c among unit length c?

Take c to be the eigenvector associated with the largest d!

matrix any more, but rather some other orthonormal matrix.

Let c be any unit length vector, and consider the decomposition of c

And if you further compute cT c, you get,

Suppose you wanted to maximize cT c as a function of unit length

Again, take c to be the eigenvector associated with the largest d!

It might help to think of = PDPT as a linear transformation. Think

Suppose that a p by p matrix is symmetric, so that = T .

be the corresponding diagonal matrix whose entries are the

Suppose you want to maximize cT c with respect to p-dimensional

Suppose you want to maximize cT c with respect to p-dimensional

The linear transformation maps the unit sphere to an elipsoid

So DPT c stretches or shrinks those aj by the associated

Let be a (vector of ) random variable(s) with (joint) density

and the value of that maximizes the posterior likelihood solves

Suppose that X and Y are jointly distributed random variables with

fXY (x, y)dx

fXY (x, y)dy

and the conditional density of Y given X is

fXY (x, y)/fX (x)

The expectations of X and Y and the conditional expectation of Y

The law of the unconscious statistician says that

fX (x)(g(x) E{g(X)})2 dx.

The variance of Y and the conditional variance of Y given X are

fY|X (y|X)(y E{Y|X})2 dy

The covariance between two random variables X and Y are defined as

For a vector of random variables X, we define the expectation vector:

And we define the covariance matrix, Cov(X),

to the product of their marginals.

having no implications for the value of the other.

IF X is a q-dimensional vector of random variables with expectation

Suppose Xi are all independent , i from 1 to n, and suppose that each

And, in particular, if the

In this situation, from Chebechevs inequality, we find that

With minimal technical assumptions about the finite variance of the

When you analyze data,

How do you know what method to apply?