Professional Documents
Culture Documents
Probability theory
Lecture 2
Statistical inference
Linear algebra
Probability theory
Statistical inference
Let X be the n by 1 matrix (really, just a vector) with all entries equal
to 1,
1
1
X = . .
..
1
And consider the span of the columns (really, just one column) of X,
the set of all vectors of the form
X0
(where 0 here is any real number).
Linear algebra
Statistical inference
Probability theory
Y1
Y2
Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask: what is the
vector in the subspace closest to Y?
That is, what value of 0 minimizes the (squared) distance between
X0 and Y,
||Y X0 ||2 = (Y X0 )T (Y X0 ) =
n
X
i=1
(Yi 0 )2 .
Linear algebra
Probability theory
(Yi 0 )2
i=1
Statistical inference
Linear algebra
Statistical inference
Probability theory
Pn
b0 =
i=1 Yi
or equivalently,
X b0 = X(X T X)1 X T Y.
Linear algebra
Probability theory
Statistical inference
X11
X21
X = . .
..
Xn1
And consider the span of the columns of X, the set of all vectors of the
form
X1
(where again 1 here is any real number).
Linear algebra
Statistical inference
Probability theory
Y1
Y2
Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask, as before:
what is the vector in the subspace closest to Y?
That is, what value of 1 minimizes the (squared) distance between
X1 and Y,
||Y X1 ||2 = (Y X1 )T (Y X1 ) =
n
X
i=1
(Yi 1 Xi1 )2 .
Linear algebra
Probability theory
(Yi 1 Xi1 )2
i=1
n
X
(Yi 1 Xi1 ) = 0.
i=1
Statistical inference
Linear algebra
Probability theory
Xi1
Xi2
b
X 1 = . b1
..
Xin
where
b1 =
Pn
Yi Xi1
Pi=1
,
n
2
i=1 Xi1
or equivalently,
X b1 = X(X T X)1 X T Y.
Statistical inference
Linear algebra
Statistical inference
Probability theory
1 X11
1 X21
X = .
..
..
.
1 Xn1
And consider the span of the columns of X, the set of all vectors of the
form
X,
where here is the two-dimensional column vector (0 , 1 )T .
Linear algebra
Statistical inference
Probability theory
Y1
Y2
Y = . .
.
.
Yn
Maybe Y is in the subspace. But if it is not, we can ask, as before:
what is the vector in the subspace closest to Y?
That is, what value of the two-dimensional minimizes the (squared)
distance between X and Y,
||Y X||2 = (Y X)T (Y X) =
n
X
i=1
Linear algebra
Probability theory
n
X
(Yi [0 + 1 Xi1 ]) = 0
i=1
n
X
i=1
Statistical inference
Linear algebra
Statistical inference
Probability theory
Xi1
1
Xi2
1
b
b
X = . 0 + . b1
..
..
1
Xin
where
b0 = Y b1 X1
, n
n
X
X
b1 =
(Xi1 X1 )Yi
(Xi1 X1 )2
i=1
i=1
or equivalently,
X b = X(X T X)1 X T Y.
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
matrix.
Note that we could write = IDI T , where I is the identity
matrix.
Note that the columns of I are orthogonal, of unit length, and
they are eigenvectors,
with eigenvalues equal to the corresponding elements of D.
Linear algebra
Statistical inference
Probability theory
1
0
0
1
c = d1 c1 . + d2 c2 . + + dp cp
.
.
.
.
0
0
0
0
..
.
1
Linear algebra
Statistical inference
Probability theory
0
1
1
0
cT c = d1 c1 cT . + d2 c2 cT . + + dp cp cT
..
..
0
= c21 d1 + c22 d2 + . . . + c2p dp
0
0
..
.
1
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Lets start all over again. But this time, well not take = D a
diagonal matrix with all positive entries. Instead, take = PDPT ,
where D is again a diagonal matrix with all positive entries, and P is a
matrix whose columns are orthonormal (and span p dimensional
space).
Note that is a complicated example of a symmetric, positive
definite matrix.
Note that we write = PDPT , where P is the not the identity
length.
And just like the columns of I were eigenvectors, so are the
columns of P
again with eigenvalues equal to the corresponding elements of D.
Linear algebra
Probability theory
Statistical inference
c1
d1 c1
c2
d2 c2
= PD . = P .
..
..
cp
dp cp
= c1 d1 P1 + c2 d2 P2 + + cp dp Pp
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
One last fact that will be relevant when we use these results: every
symmetric positive definite matrix can be written in the form PDPT
where the columns of P are an orthonormal basis for p-dimensional
space, the columns are the eigen-vectors of , and the diagonal
matrix D has as components the corresponding (positive) eigenvalues.
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
eigen-value).
We can find p linearly independent orthogonal unit-length
p-dimensional eigen-vectors Pj .
Let P be the p by p matrix whose columns are the Pj and Let D
= PDPT .
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
eigen-vectors.
and the corresponding local maxima are the eigen-values.
Linear algebra
Statistical inference
Probability theory
with axes the Pj , and with the length of the axes equal to twice
the eigen-values.
For a vector c, PT c has as its components the aj for which
p
X
aj Pj = c.
j=1
eigen-values, j .
and so PDPT c is
p
X
aj j Pj .
j=1
In short,
p
X
j=1
aj Pj
p
X
j=1
aj j Pj .
Linear algebra
Statistical inference
Probability theory
()
Let Y be a (vector of) random variable(s) with (joint) conditional
density f (y) given .
The conditional density of given Y = y is
Z
()f (y)d .
(|y) = ()f (y)
The conditional expectation of given Y = y is
Z
E{|y} =
(|y)d.
d
d
ln (()) +
ln f (y)d() = 0.
d
d
Linear algebra
Probability theory
Statistical inference
Z
fY (y) =
the density of X is
Z
fX (x) =
Linear algebra
Probability theory
Statistical inference
fY (y)ydy
Z
E{Y|X} =
fY|X (y|X)ydy
And we have
E{Y} = E{E{Y|X}}.
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Independence
Random variables are independent if their joint density is equal
equal to zero.
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Chebechevs inequality:
P{|X | } Var(X)/2
Statistical inference
Linear algebra
Probability theory
Statistical inference
1X
Xi ,
n
i=1
is equal to
n
1 X 2
i .
n2
i=1
i2
Linear algebra
Probability theory
Statistical inference
Linear algebra
Statistical inference
Probability theory
v
!
uX
n
n
n
X
X
u
1
1
n
Xi
E{Xi } xt
i2
P
n
n
i=1
i=1
i1
et /2
dt.
2
Not only can we learn, we can know how well weve learned!
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
computation
Linear algebra
Probability theory
Statistical inference
E {L(, d(Y))}
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
weather patterns
associated foreclosure outcomes, or cancer outcome, or rainfall
Researchers want to help others do prediction with new data
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Linear algebra
Probability theory
Statistical inference
Suppose you really believe that the parameter has a distribution ().
And that nature or god or . . . chose from that distribution when it
created .
And suppose you wanted to estimate .
Suppose we have some loss function, say
b )
L(,
So that we need to find a function b to minimize
b )}
E{L(,
What expectation are we talking about? The expectation over !
Find b to minimize
b
E{L((Y),
)|Y}.
Or maybe just approximate that optimal choice with the
expectation or mode or . . .
Linear algebra
Probability theory
Bayes theorem.
()
Y| f (y)
Z
Statistical inference
Linear algebra
Probability theory
Statistical inference
and Y!
b
b
b
E{L((Y),
)} = E{E{L((Y),
)|}} = E{E{L((Y),
)|Y}}
b
Find (Y)
to minimize
b
E{L((Y),
)|Y}.
Or maybe just approximate that optimal choice with the posterior
expectation, or mode, or . . .