You are on page 1of 75

Lecture Notes

on

Multivariate Analysis
Preface CONTENTS

We would like to give an idea of some of the special feature of this monograph on
S.no. Topics Page no.
Multivariate Analysis before we go further; let us tell you that it has been prepared according
to new syllabus prescribed by UGC. We have written this notes in such a simple style that 1 Basic concept of multivariate analysis 1
even the weak student will be able to understand very easily.
2 Multivariate normal distribution 7
We are sure you will agree with us that the facts and formula of multivariate analysis is just
the same in all the books, the difference lies in the method of presenting these facts to the 3 Estimation of parameters in multivariate normal distribution 51
students in such a simple way that while going through this notes, a student will feel as if a
4 Wishart distribution 63
teacher is sitting by his side and explaining various things to him. We are sure that after
reading this lecture notes, the student will develop a special interest in this field and would 5 Multiple and partial correlations 95
like to help to analyze such type of data in other discipline as well.
6 Hotelling's − T 2 111
We think that the real judges of this monograph are the teachers concern and the student for
whom it is meant. So, we request our teacher friends as well as students to point out our 7 Discriminant analysis 133
mistakes, if any, and send their comments and suggestions for the further improvement of this 8 Principal components 141
monograph. Wishes you a great success.
9 Canonical correlation and canonical variate 147

Yours sincerely 10 Wilk’s Λ criterion 155

Dr. Rafiqullah Khan


aruke@rediffmail.com
Department of Statistics and Operations Research
Aligarh Muslim University, Aligarh.
2 RU Khan

BASIC CONCEPT OF MULTIVARIATE ANALYSIS suitable linear transformations and to choose a very limited number of the resulting linear
combination in some optimum manner, disregarding the remaining linear combination in the
Multivariate analysis is the analysis of observations on several correlated random variables hope that they do not contain much information. Among the multiple techniques there are
for a number of individuals in one or more samples simultaneously, this analysis, has been two main groups. Some methods concerned with relationship between variables called
used in almost all scientific studies. variable-oriented. The important variable-oriented techniques are total, partial, multiple and
canonical correlations, principal component analysis and so on. While the other group
For example, the data may be the nutritional anthropometrical measurements like height,
concerned with the relationships between individuals called individual-oriented. In such
weight, arm circumference, chest circumference, etc. taken from randomly selected students
cases problem of classification arises to assign an individual of unknown origin. One of the
to assess their nutritional studies. Since here we are considering more than one variable this is
two populations we make use of discriminant analysis.
called multivariate analysis.
The sample data may be a collection of measurements such as lengths and breadths of petals Basic theory of matrices
and sepals of different flowers from two different species to identify their group
characteristics. • Given a p × p square matrix A = (aij ) , the elements a11 L a pp form the principal

The data may be information such as annual income, saving, assets and number of children diagonal or simply the diagonal of A and are called its diagonal elements, whereas the
and so on collected from a randomly selected families. elements a1 p a2 p −1 L a p1 constitute the secondary diagonal of A . For i ≠ j , aij are the

As in the univariate case, we shall assume that a random sample of multivariate observations off diagonal elements of A .
has been collected from an infinite or finite population. Also the multivariate techniques like • A matrix of order p , say A = (aij ) is said to be symmetric, if
their univariate counterpart allow one to evaluate hypothesis or results regarding the
population by means of sample observations. aij = a ji , ∀ i, j = 1, 2,L , p , or A = A′ .
For this purpose, we consider the multiple measurements together as a system of • And skew-symmetric, if aij = − a ji , ∀ i, j = 1, 2,L , p , or A = − A′ .
measurements. Thus xiα , denotes the i − th measurement of the α − th individual. Normally,
we denote the number of variables by p and number of individuals by n . The n • A matrix A = (aij ) is called a diagonal matrix, if
measurements on p variables can be arranged as follows: aij = 0 , ∀ i ≠ j , and we write A = diag (a11 , L , a pp ) .
Individuals • A nonzero diagonal matrix whose diagonal elements are all equal is called a scalar matrix
Variable/Characteristic 1 2 ... α ... n and a scalar matrix whose diagonal elements are all unities is called a unit matrix or an
X1 x11 x12 ... x1α ... x1n identity matrix. A unit matrix is generally denoted by I , in which case a scalar matrix is
of the form cI , where c is some nonzero scalar. It may be easily seen that AI = A ,
X2 x 21 x 22 ... x 2α ... x2n IB = B , A (cI ) = cA , (aI ) B = aB , provided the products are all defined.
M M M M M M M
• A square matrix A = (aij ) is said to be upper triangular if aij = 0 , for all i > j and lower
Xi xi1 xi 2 ... xiα ... xin
triangular if aij = 0 , for all i < j , triangular if A is either upper or lower triangular.
M M M M M M M
Xp x p1 x p2 ... x pα ... x pn • A square matrix A is said to be idempotent if A 2 = A .
• A square matrix A is said to be orthogonal if A A′ = A′A = I .
This can be written in the matrix form as
• Let a i ′ denote the i − th row of matrix A = (aij ) of order p and b j its j − th column,
 x11 x12 L x1α L x1n 
  then (i, j ) − th element of AA′ and A′A are a i′ a j and b i ′ b j respectively. Now if A is
 x 21 x22 L x2α L x 2n 
 M M M M M M  orthogonal, we have A A′ = A′A = I , and hence
  = ( xiα ) ; i = 1, 2, L p , α = 1, 2, L n .
 xi1 xi 2 L xiα L xin  a i ′ a j = b i ′ b j = 1 , whenever i = j
 M M M M M M 
 = 0 , whenever i ≠ j
 x p1 x p 2 L x pα L x pn 

Thus, both (a1 ,L , a p ) and (b1 , L , b p ) are orthonormal.
This is also known as data matrix.
The main aim of the most of the multivariate technique is to summarize a large body of data • The trace of a matrix A , written ' tr A' , is the sum of the diagonal elements of A .
by means of relatively few parameters, or to reduce the number of variables by employing • tr ( A + B) = tr A + tr B , whenever A and B are square matrices of same order.
Multivariate Analysis 3 4 RU Khan

• tr AB = tr BA , whenever both AB and BA are defined. is of degree p , is called the latent (or characteristic) polynomial of A and the equation
• tr ABC = tr BCA = tr CAB , provided all the three products are defined. A − λ I = 0 its characteristic equation.

• A square matrix A is said to be non-singular, it there exists a square matrix B of same • If we write A − λ I = a0 + a1λ + L + a p λ p it can be easily seen that a0 = A ,
order, called the inverse of A , such that AB = BA = I .
a p = (−1) p .
• The inverse of a non-singular matrix A is unique, and it is because of this uniqueness that
the inverse of A , if it exists, is generally denoted by A −1 . • Every square matrix satisfies its own characteristic equation.
• ( A −1 ) −1 = A and ( A −1 )′ = ( A′) −1 , whenever A is non-singular. • Given a square matrix A , the roots of its characteristic equation, A − λ I = 0 , are called
its latent (or characteristic or secular) roots or eigen (or proper) values. Clearly if A is
• ( AB) −1 = B −1 A −1 , whenever A and B are non-singular matrices of the same order. p × p , then A has p latent roots.
• The determinant of a diagonal or triangular matrix equals the product of its diagonal • The latent roots of a diagonal and triangular matrix are their diagonal elements.
elements.
• The determinant of a square matrix is the product of its latent roots.
• A′ = A , AB = BA = A B = B A .
• A square matrix is non-singular if and only if none of its latent roots is zero.
• A ≠ 0 , if A is non-singular, since 1 = AA−1 = A A −1 . • Given a latent roots λ of a square matrix A , x is called a latent vector of A
corresponding to the root λ if ( A − λ I ) x = 0 , i.e. Ax = λ x .
• A = 1 or − 1 , if A is orthogonal.
• If λ is a latent root of a non-singular matrix A , then λ−1 is a latent root of A −1 .
• a A = a p A , where p is the order of A .
• If λ is a latent root of an orthogonal matrix A , so is λ−1 a latent roots of A .
• A = 0 whenever A has a zero-row or a zero-column or two identical rows (columns).
• Latent roots of an idempotent matrix are zeros and unities only.
• The determinant changes sign if any two rows (columns) are interchanged.
• Latent roots of a real symmetric matrix are all real.
• The determinant remains unaltered when a row (column) is replaced with the sum of that
row (column) and a scalar multiple of another row (column). • Two square matrices A and B of the same order are said to be similar if there exists a

• If A = diag ( A11 L Aqq ) , where A11 L Aqq are all square matrices, then non-singular matrix C such that A = C −1 BC .
• Similar matrices have the same characteristic polynomials.
A = A11 L Aqq .
• Similar matrices have the same latent roots.
• If A is of the form
• If the latent roots of A are all distinct, then A is similar to a diagonal matrix whose
 A11 A12 L A1q   A11 0 L 0  diagonal elements are the latent roots of A .
   
 0 A22 L A2q   A21 A22 L 0  • If A is a real symmetric matrix, then there exits an orthogonal matrix C such that C ′A C
 M or , then A = A11 L Aqq .
M M M   M M M M  is a diagonal matrix.
   
 0 0 L Aqq   A1q Aq 2 L Aqq 
  • Let A = (aij ) be a real symmetric matrix of order p and x ′ = ( x1 , L , x p ) a p − tuple of
A A12  real variables. The polynomial
• If A =  11  , where both A11 and A22 are non-singular, then
 A21 A22 
∑∑ aij xi x j = x′ Ax is called a p − ary quadratic form.
 −1 −1 −1  i j
A11 − A11 A12 A22
A −1 =  .2 .1  .
 − A A A −1
− 1 −1
A22.1  The quadratic form x ′ A x (or equivalently the real symmetric matrix A ) is said to be
 22 21 11.2  •

• If A = diag ( A11 L Aqq ) , then A −1 = diag ( A11


−1 −1
L Aqq ). i) positive definite, if x ′ A x > 0 for all x ≠ 0 .

• Given a square matrix A of order p , the matrix ( A − λ I ) , where λ is a scalar variable, ii) negative definite, if x ′ A x < 0 for all x ≠ 0 .
is called the latent (or characteristic) matrix of A , and the polynomial, A − λ I , which
iii) positive semi-definite or nonnegative definite, if x ′ A x ≥ 0 for all x ≠ 0 .
Multivariate Analysis 5

iv) negative semi-definite or no positive definite, if x ′ A x ≤ 0 for all x ≠ 0 .


v) indefinite if it takes on both positive and negative values.
• The diagonal elements of a positive definite matrix are all positive.
• The diagonal elements of a negative definite matrix are all negative.

• An p − ary quadratic form x ′ A x can be equivalently reduced by an orthogonal


transformation to the diagonal form λ1 y12 + L + λ p y 2p , where λ1 , L , λ p are the latent
roots of A .

• An p − ary quadratic form x ′ A x can be equivalently reduced by a non-singular


transformation to the canonical form d1 y12 + L + d p y 2p .

• An p − ary quadratic form x ′ A x is positive definite (negative definite) if and only if the
talent roots of A are all positive (negative).
• If A is positive definite (negative definite), − A is negative definite (positive definite).
• If A is positive definite, A > 0 .

• If Σ is p × p and positive definite and if A is a q × p matrix of rank q , then there exists


 A
an ( p − q ) × p matrix B of rank p − q such that AΣB ′ = 0 , in which case D =   is
B
non-singular.
• Gamma integral
∞ ∞
i) Γ(m) = ∫ x m −1 e − x dx , ii) α −m Γ(m) = ∫ x m−1e −α x dx
0 0

• Gamma duplication formula

2 2 m−1 Γ(m)Γ(m + 1 / 2) = π Γ(2m) .


8 RU Khan

THE MULTIVARAITE NORMAL DISTRIBUTION Now



Notations of multivariate distribution
∫−∞ f (u, v) dv = f (u)
If X is a random variable (rv ) then the cumulative distribution function (abbreviated as where f (u ) is called the marginal density function of X and from (2.1)
cdf ) of X is given by x
F ( x) = ∫ f (u ) du
−∞
F ( x ) = Pr ( X ≤ x) , where x is a real number.
In a similar fashion we define G ( y ) , the marginal cdf of Y , and g ( y ) , the marginal density
d function of Y .
If F ( x) is absolutely continuous, then F ( x ) = f ( x ) is called the density function of X
dx Now we turn to the general case
and Let F ( x1 , x 2 , L , x p ) be the cdf of the rv' s X 1 , X 2 , L , X p . The marginal cdf of some of
x
F ( x) = ∫ f (u ) du . X 1 , X 2 , L , X p , say X 1 , X 2 , L , X r (r < p) is
−∞
Pr ( X 1 ≤ x1 , L , X r ≤ x r ) = Pr ( X 1 ≤ x1 , L , X r ≤ x r , X r +1 ≤ ∞, L , X p ≤ ∞)
Joint distribution
First consider the case of two (real) random variables say X and Y . = F ( x1 , x 2 , L , x r , ∞,L , ∞)
x1 xr ∞ ∞
Let X and Y be two rv . The joint cdf of X and Y is given by =∫ L∫ ∫ L∫ f (u1 , L u p ) du1 L du p
−∞ −∞ − ∞ −∞
F ( x, y ) = Pr ( X ≤ x, Y ≤ y ) , defined for every pair of real numbers ( x, y ) .
and the marginal density function of X 1 , X 2 , L , X r is
We are interested in cases where F ( x, y ) is absolutely continuous. If F ( x, y ) is absolutely ∞ ∞
continuous, then ∫−∞ L ∫−∞ f (u1 , Lu p ) du r +1 L du p .
∂ 2 F ( x, y ) The marginal distribution and density of any other subset of X 1 , X 2 , L , X p are obtained in
= f ( x, y ) , and is called the joint density function of X and Y , and
∂x ∂y the obvious similar fashion.
y x
F ( x, y ) = ∫ ∫ f (u, v ) du dv . Statistical Independence
−∞ −∞
Two random variables X , Y with cdf F ( x, y ) are said to be independent if
Now we consider the case of p random variables X 1 , X 2 , L , X p . The cdf is
F ( x, y ) = F ( x) G ( y ) ,
F ( x1 , x 2 , L , x p ) = Pr ( X 1 ≤ x1 , X 2 ≤ x 2 , L , X p ≤ x p ) ,
where F (x) is the marginal cdf of X and G ( y ) be the marginal cdf of Y , then the density
defined for every set of real numbers x1 , x 2 , L , x p , and the density function, if function of X and Y is
F ( x1 , x 2 , L , x p ) is absolutely continuous, is ∂ 2 F ( x, y ) ∂ 2 F ( x ) G ( y ) d F ( x ) d G ( y )
f ( x y) = = = = f ( x ) g ( y ) . Conversely, if
∂x ∂y ∂x ∂y ∂x ∂y
∂ p F ( x1 , x 2 , L , x p )
= f ( x1 , x 2 , L , x p ) , and f ( x, y ) = f ( x) g ( y ) , then
∂x1 ∂x 2 L ∂x p
y x y x
xp x1
F ( x, y ) = ∫ ∫ f (u , v) du dv = ∫ ∫ f (u ) g (v) du dv
−∞ −∞ −∞ −∞
F ( x1 , x 2 , L , x p ) = ∫ L∫ f (u1 , u 2 , L , u p ) du1 L du p .
−∞ −∞ x y
=∫ f (u du ∫ g (v) dv = F ( x) G ( y ) .
−∞ −∞
Marginal distribution
General case,
Let F ( x, y ) be the cdf of two rv' s X and Y , the marginal cdf of X is
Given F ( x1 , x 2 , L , x p ) be the cdf of X 1 , X 2 , L , X p , the variables X 1 , X 2 , L , X p are
Pr ( X ≤ x) = Pr { X ≤ x, Y ≤ ∞} = F ( x, ∞)
called mutually independent if
Let this be F (x) . Clearly
F ( x1 , x 2 , L , x p ) = F1 ( x1 ) L F p ( x p ) ,
x ∞
F ( x) = ∫ ∫ f (u, v ) dv du . (2.1) where Fi ( xi ) is the marginal cdf of X i , i = 1, 2,L , p .
− ∞ −∞
Multivariate normal distribution 9 10 RU Khan

The set X 1 , X 2 , L , X r is independent of X r +1 , X r + 2 , L , X p if Consider,


1  x− µ  2
F ( x1 , x 2 , L , x p ) = F ( x1 , x 2 , L , x r , ∞ L ∞) F (∞ L ∞, x r +1 , x r +2 , L , x p ) . −  
∞ 1 ∞ x−µ
∫−∞ f ( x) dx = σ ∫ e 2 σ  dx . Put = y , then dx = σ dy .
Conditional distribution 2π −∞ σ

The conditional density function of Y , given X , for two rv' s X and Y is defined as 1 1 2
1 ∞ − y2 2 ∞ − y
f ( x, y )
=
σ 2π
∫−∞ e 2 σ dy =

∫0
e 2 dy .
f ( y x) = , provided that f ( x) > 0 .
f ( x)
Again put
In the general case of X 1 , X 2 , L , X p with cdf F ( x1 , x 2 , L , x p ) , the conditional density 1 2
y = t or t 2 = 2 y or 2 t dt = 2 dy
function of X 1 , X 2 , L , X r , given X r +1 = x r +1 , X r + 2 = x r + 2 , L , X p = x p is 2
1
f ( x1 , x 2 , L , x p ) 1 1 1 −2
. or dt = dy or dt = dy = y dy , then
∞ ∞
∫−∞ L ∫−∞ f (u1 , Lu r , u r +1 , L, u p ) du1 L du r t 2y 2
1
∞ 2 ∞ −y  1 −1 / 2  1 ∞ − y 2 −1 1
Transformation of variables
∫− ∞ f ( x ) dx =

∫ 0
e 
 2
y  dy =
 π ∫0
e y dy =
π
Γ1 / 2 = 1 .
Let X 1 , X 2 , L , X p have the joint density function f ( x1 , x 2 , L , x p ) . Consider p real-
∞ −α
valued functions y i = y i ( x1 , x 2 , L , x p ) , i = 1, 2,L , p . We assumed that the transformation Since gamma function ∫0 e α p −1 dα = Γp , and Γ1 / 2 = π .
of Y to X be one-to-one, the inverse transformation is xi = x i ( y1 , y 2 , L , y p ) ,
i = 1, 2,L , p . Let the random variable Y1 , Y2 , L , Y p be defined by Moment generating function of univariate normal distribution
X −µ
Yi = y i ( X 1 , X 2 , L , X p ) If X ~ N (µ , σ 2 ) , then Y = ~ N (0, 1) .
σ
Then the joint density function of Y1 , Y2 , L , Y p is By definition,
y2 1
g ( y1 , y 2 , ..., y p ) = f [ x1 ( y1 , y 2 , ..., y p ), ..., x p ( y1 , y 2 , ..., y p )] J , where ∞ 1 − 2
1 ∞ − 2 ( y −2 t y)
M Y (t ) = E[e ty ] = ∫ e ty f ( y ) dy = ∫ e ty e 2 dy = ∫ e dy
−∞ 2π 2π −∞
∂ x1 ∂ x1
L
∂ y1 ∂ yp 1
− ( y 2 − 2 t y +t 2 −t 2 )
1 2
1 ∞ 1 2 ∞ − ( y −t)
J = mod M L M = Jacobian of transformation. = ∫− ∞ e 2 dy = et / 2 ∫ e 2 dy .
∂ xp ∂ xp 2π 2π −∞
L
∂ y1 ∂ yp Let
y − t = z , and dy = dz , then
Form of normal density function 1 2
2 1 ∞ − z 2
1) The univariate normal density function is given by M Y (t ) = e t / 2 ∫ e 2 dz = e t / 2 .
2
2π −∞
1  x− µ  1
1 −   1 − ( x − µ )′ Σ −1 ( x − µ ) Now,
2 σ 
f ( x) = e = e 2 , −∞ < x < ∞ ,
σ 2π 1/ 2 1/ 2 X −µ
(2π ) Σ Y= ⇒ X = µ + σY .
2 σ
−∞< µ < ∞ , Σ =σ >0.
Thus,
To show that
∞ M X (t ) = E e tx = E e t ( µ + σ y ) = E e tµ e t σ y = e tµ E [e t (σ y ) ]
∫−∞ f ( x) dx = 1 . 2 2
= e t µ M Y ( tσ ) = e t µ + t σ / 2 .
Multivariate normal distribution 11 12 RU Khan

2) The bivariate normal density function This is the pdf of bivariate normal distribution with mean µ and variance covariance
 x − µ 2      
2 matrix Σ , thus X ~ N 2 ( µ , Σ ) .

1  1 1  − 2 ρ  x1 − µ1   x2 − µ2 + x2 − µ 2  
2  σ   σ  σ   σ   1
2 (1− ρ )  1   1  2   2  
f ( x1 , x 2 ) =
1
e  Exercise: Find the marginal of X 1 and X 2 , if f ( x1 , x 2 ) = e −Q / 2 ,
2π σ 1σ 2 1 − ρ 2 (2π ) σ 1σ 2 1 − ρ 2
where
1
− ( x − µ ) ′ Σ −1 ( x − µ )
or f ( x ) =
1
e 2 . 1  x − µ  2  x1 − µ1   x2 − µ 2   x2 − µ2 
2
 1 1 .
Q=  − 2 ρ  σ    +  
1/ 2
2π Σ 2  σ 
(1 − ρ )  1   1   σ2   σ2  
X  µ 
Proof: Let X =  1  , corresponding µ =  1  , and Solution: For, marginal of X 1 , we have
X
 2  µ2 
1  x − µ  2  x1 − µ1   x2 − µ 2   x2 − µ2 
2
 1 1 
 σ 11 σ 12   σ 1 ρ σ 1σ 2   − 2 ρ  σ    +  
2 σ ij Q=
2  σ 
∑ =   = , where ρ = . (1 − ρ )  1   1   σ2   σ2  
 σ 21 σ 22   ρ σ 1σ 2 σ 22 

σ iσ j
  x − µ 2 2
Now 1 ρ 2  1 1 2  x1 − µ1 
=   + (1 − ρ )  σ 
(1 − ρ 2 )   σ 1   1 
∑ = σ 12σ 22 − ρ 2σ 12σ 22 = σ 12σ 22 (1 − ρ 2 ) .
2
Thus,  x − µ1   x 2 − µ 2   x2 − µ 2 
− 2 ρ  1    +   
 1 ρ   σ1   σ 2   σ2  
 −  
1  σ2 
− ρ σ 1σ 2  1  σ 12 σ 1σ 2 
∑ −1 =  2 = 2 2
2 − ρ σ σ 2   ρ 1  1  x 2 − µ 2   x − µ1    x − µ1 
(1 − ρ ) σ 12σ 22  1 2 σ1 2
 (1 − ρ )  −  =   − ρ  1   +  1 
 σ 1σ 2 σ 22  (1 − ρ 2 )  σ 2   σ 1   σ1 
 
2 2
Consider 1  σ   x − µ1 
= x − µ 2 − ρ 2 ( x1 − µ1 ) +  1 
2  2 σ
 1 ρ  2
σ 2 (1 − ρ )  1   σ1 
 − 
−1  σ 12 1 σ 1σ 2   x1 − µ1  2
( x − µ ) ∑ ( x − µ ) = ( x1 − µ1 , x 2 − µ 2 )
′    x − µ 2*  2

(1 − ρ 2 )  − ρ 1   x 2 − µ 2  = 2  +  x1 − µ1  ,
  σ*   
 σ 1σ 2 2 
σ2   2   σ1 

σ2
x −µ ρ ( x2 − µ 2 ) ρ ( x1 − µ1 ) x 2 − µ 2  x − µ  where µ 2* = µ 2 + ρ ( x1 − µ1 ) , and σ 2*2 = σ 22 (1 − ρ 2 ) .
=
1  1 1
− ,− +  1 1
 σ1
2
1 − ρ  σ 12 σ 1σ 2 σ 1σ 2 σ 22   x 2 − µ 2 
 Therefore,
 (x − µ )2 ρ ( x − µ ) (x − µ ) ρ (x − µ ) (x − µ ) (x − µ )2    2 2 
1
 1 1 1 1 2 2 1 1 2 2
+ 2 2  1  x − µ 2*   x − µ1  
exp  −  2
= − − 1
f ( x1 , x 2 ) = +  1  
1 − ρ  σ 12
2 σ 1σ 2 σ 1σ 2 σ 22 
 (2π ) σ 1σ 2*
 2  *   σ 1  

  σ 2 
 
 x − µ  2 ( x1 − µ1 ) ( x 2 − µ 2 )  x 2 − µ 2 
2
1  1 1
+   .   1  x − µ  2    1  x − µ *  2 
=   − 2ρ =
1 
exp −  1 1    1 
exp −  2 2  
1 − ρ  σ 1 
2 σ 1σ 2  σ2     
  2π σ 1  2  σ 1    2π σ *  2  σ 2 *  
    2  
Therefore,
1
− ( x − µ )′ ∑ −1 ( x − µ )
= f ( x1 ) f ( x 2 x1 ) , since ∫x2 f ( x1, x 2 ) dx2 = f ( x1 ) .
1
f ( x) = e 2 .
(2π ) 2 2 ∑
1/ 2 Hence, the marginal distribution of X 1 is N ( µ1 , σ 12 ) , i.e. X 1 ~ N (µ1 , σ 12 ) .
Multivariate normal distribution 13 14 RU Khan

For, marginal of X 2 . We have ∞ ∞ tx t x


=∫ ∫e 1 1 e 2 2 f ( x1 ) f ( x 2 x1 ) dx1 dx 2
−∞ −∞
 x − µ  2 2
e 1 1 f ( x1 )  e 2 2 f ( x 2 x1 ) dx 2  dx1 .
 x1 − µ1   x2 − µ 2   x2 − µ2  ∞ tx ∞ t x
Q=
1
 1
2  σ
1
 − 2 ρ  σ    +    = ∫
−∞  ∞
− ∫ 
(1 − ρ )  1   1   σ2   σ2  

The integral within the brackets is the mgf of the conditional pdf f ( x 2 | x1 ) . Since
 x − µ  2  x1 − µ1   x 2 − µ 2   x − µ2 
2 σ
1  1 1  + ρ 2  2 f ( x 2 | x1 ) is pdf of a normal distribution with mean µ 2* = µ 2 + ρ 2 ( x1 − µ1 ) , and
=   − 2 ρ  σ    σ1
( 1 − ρ 2 )  σ 1   1   σ2   σ2 
variance σ 2*2 = σ 2 2 (1 − ρ 2 ) .
2
 x − µ2  Thus,
+ (1 − ρ 2 ) 2  
 σ2   1
t 2 µ 2* + σ 2*2 t 22
1
tµ + t 2σ 2

∫− ∞ e 2 2 f ( x2 | x1 ) dx2 = e
t x 2 , as M X (t ) = e 2 for N (µ , σ 2 ) .
2 2
1  x1 − µ1   x − µ2   x − µ2 
=   − ρ  2   +  2  Therefore,
(1 − ρ )  σ 1 
2
 σ2   σ2   σ  1
t2  µ2 + ρ 2 ( x1 − µ1 ) + t22σ 22 (1− ρ 2 )
∞  σ1  2
1  σ1   x − µ2
2

2
M X (t ) = ∫ e t1x1
e f ( x1 ) dx1
=  x1 − µ1 − ρ ( x 2 − µ 2 ) +  2  −∞
σ 12 (1 − 2
ρ ) σ 2   σ2   σ 
 t +t2 ρ 2  x1
 σ 1  ∞ 1 σ1 
 x − µ1* 
2 2 = exp t 2 µ 2 − t 2 ρ 2 µ1 + t 22 σ 22 (1 − ρ 2 ) ∫ e  f ( x1 ) dx1
= 1  +  x 2 − µ 2  ,  σ1 2  −∞
 σ*   σ 
   2 
1  σ 1 
= exp t 2 µ 2 − t 2 ρ 2 µ1 + t 22 σ 22 (1 − ρ 2 )
σ  σ 2 
where, µ1* = µ1 + ρ 1 ( x 2 − µ 2 ) , and σ 1*2 = σ 12 (1 − ρ 2 ) . Therefore, the pdf of bi-variate 1
σ2
 σ  1 σ 
2 
normal distribution is × exp  t1 + t 2 ρ 2  µ1 +  t1 + t 2 ρ 2  σ 12 
 σ1  2 σ1  
  2 2   
1  x − µ1*   x − µ2  
exp −  1
1
f ( x1 , x 2 ) = +  2    σ σ
 2  *   1
(2π ) σ 1*σ 2   σ 1   σ2   = exp t 2 µ 2 − t 2 ρ 2 µ1 + t 22 σ 22 (1 − ρ 2 ) + t1 µ1 + t 2 µ1 ρ 2
   σ1 2 σ1

  1x −µ 2    * 2  σ2  σ2 
1     1  1  x − µ1   σ
+ 1  t12 + t 22 ρ 2 2 + 2t1t 2 ρ 2 
= exp −  2 2   exp −  1
 2π σ 2   *   2  2 σ1 
2 σ2  σ1
*
 
   2π σ 1  2  σ 1    
 
 1 1 1 1 
= exp t1 µ1 + t 2 µ 2 + t 22 σ 22 − t 22 ρ 2 σ 22 + t12 σ 12 + t 22 ρ 2σ 22 + t1t 2 ρ σ 1σ 2 
= f ( x 2 ) f ( x1 x 2 ) , where, ∫x1 f ( x1 , x 2 ) dx1 = f ( x2 ) .  2 2 2 2 

Hence, the marginal distribution of X 2 is N ( µ 2 , σ 22 ) , i.e. X 2 ~ N (µ 2 , σ 22 ) .  1 


= exp t1 µ1 + t 2 µ 2 + (t12 σ 12 + 2 t1t 2 ρσ 1σ 2 + t 22 σ 22 ) .
 2 
Moment generating function of bi-variate normal distribution  1 
M X1, X 2 (t1 , 0) = E [e t1x1+0 ] = E [e t1x1 ] = M X 1 (t1 ) = exp t1 µ1 + t12 σ 12 
It is defined as  2 
′ t  x   1 
M X (t ) = E[e t x ] , where t =  1  , and x =  1  . M X1, X 2 (0, t 2 ) = exp t 2 µ 2 + t 22 σ 22  .
t
 2  x2   2 

∞ ∞ Note: If ρ = 0,
M X1 X 2 (t1 , t 2 ) = M X1 X 2 (t1 , 0) M X1X 2 (0, t 2 ) , i.e. X1 and X 2 are
t x +t 2 x2
=∫
−∞ ∫−∞
e1 1 f ( x1, x2 ) dx1 dx2
independently distributed, if and only if M (t1 , t 2 ) = M (t1 , 0) M (0, t 2 ) .
Multivariate normal distribution 15 16 RU Khan

Define, iv) Since ( X | Y = −4) ~ N ( µ1* , σ 1* ) ,


K (t1 , t 2 ) = log M (t1 , t 2 )
σ 3 4
where µ1* = µ1 + ρ 1 ( y − µ 2 ) = 3 × × (−4 − 1) = 0.6 ,
∂ K (t1 , t 2 ) σ2 5 5
i) t1 =0,t 2 =0 = E ( X 2 ) = µ 2 .
∂ t2
and
∂K (t1 , t 2 )
ii) t1 =0,t 2 =0 = E ( X 1 ) = µ1 .  9  16 16
∂ t1 σ 1*2 = σ 12 (1 − ρ 2 ) = 16 1 − = × 16 , ⇒ σ 1* = .
 25  25 5
∂ 2 K (t1 , t 2 ) 2 Therefore,
iii) t1 ,t2 =0 = V ( X 1 ) = σ 1 .
∂ t12  3 3
 −3− 3− 
P (−3 < X < 3 Y = −4) = P  5 <Z < 5  = P  − 18 < Z < 12  = 0.642 .
∂ 2 K (t1 , t 2 ) 2  16 16   16 16 
iv) t1 ,t2 =0 = V ( X 2 ) = σ 2 .  
∂ t 22  5 5 

∂ 2 K (t1 , t 2 ) Exercise: Given ( X , Y ) ~ N 2 (5 , 10, 1, 25 , ρ ) , and P (4 < Y < 16 X = 5) = 0.954 . Find ρ .


v) t1 ,t2 =0 = Cov ( X 1 , X 2 ) = ρ σ 1σ 2 .
∂ t1 ∂t 2 Solution: Since (Y X = 5) ~ N ( µ 2* , σ 2*2 ) , where
Exercise: Given ( X , Y ) ~ N 2 (3 ,1 ,16 , 25 , 3 / 5) . Find i) P (3 < Y < 8) ,
σ2 5
µ 2* = µ 2 + ρ ( x − µ1 ) = 10 + ρ (5 − 5) = 10 , and σ 2*2 = σ 22 (1 − ρ 2 ) = 25 (1 − ρ 2 ) .
ii) P (3 < Y < 8 X = 7) , iii) P (−3 < X < 3) , and iv) P (−3 < X < 3 Y = −4) . σ1 1
3 Thus,
Solution: We are given µ1 = 3, µ 2 = 1 , σ 12 = 16 , σ 22 = 25 , and ρ = , i.e. X ~ N (3,16) ,
5    
and Y ~ N (1, 25) .  4 − 10 16 − 10   6 6 
P (4 < Y < 16 X = 5) = P  <Z <  = P  − <Z< 
 5 1− ρ 2 5 1− ρ 2 5 1 − ρ 2
5 1 − ρ2
 3 − 1 Y − µ2 8 − 1    
i) P (3 < Y < 8) = P  < <  = P (0.4 < Z < 1.4) = 0.4192 − 0.1554 ≅ 0.264 .
 5 σ2 5   
 − 1.2 1.2 
= P <Z<  = 0.954 .
ii) Since (Y X = x) ~ N ( µ 2* , σ 2*2 ) , where  1− ρ2 1− ρ 2
 
σ2 3 5    
µ 2* = µ 2 + ρ ( x − µ1 ) = 1 + × (7 − 3) = 4 , and  1.2   1.2 
σ1 5 4 or 2 P  0 < Z <  = 0.954 or P  0 < Z <  = 0.477 .
 1− ρ2  1− ρ2
   
 9
σ 2* 2 = σ 22 (1 − ρ 2 ) = 25 1 −  = 16 . 1.2
 25  ⇒ =2, or 2 1 − ρ 2 = 1.4 or 4 (1 − ρ 2 ) = 1.2
Therefore, 1− ρ2

 3 − 4 Y − µ* 8 − 4  1.44
 = P (− 1 < Z < 1) or 1 − ρ 2 = = 0.36 ⇒ ρ 2 = 1 − 0.36 = 0.64 or ρ = ±0.8 .
P (3 < Y < 8 X = 7 ) = P  < 2
< 4
 4 σ * 4  4
 2 
Exercise: Find µ mean vector and Σ variance covariance matrix or dispersion matrix of the
= P(−0.25 < Z < 1) = P (−0.25 < Z < 0) + P (0 < Z < 1) = 0.440 . following density functions:
iii) Given X ~ N (3 ,16 ) 1 2 2
1 − 2 [( x−1) +( y−2) ]
 − 3 − 3 X − µ1 3 − 3  i) f ( x, y ) = e ,
P (−3 < X < 3) = P  < <  2π
 4 σ1 4 
1  x2 xy 2 
−  −1.6 + y 
 6  1 0.72  4 2 
= P  − < Z < 0  = P (−1.5 < Z < 0) = 0.4332 . ii) f ( x, y ) = e ,
 4  2.4π
Multivariate normal distribution 17 18 RU Khan

1
2 2 Alternative Solution
1 − 2 ( x + y +4 x −6 y +13)
iii) f ( x, y ) = e , and ∂Q
2π Since Q = x 2 + y 2 + 4 x − 6 y + 13, = 2x + 4 = 0 ⇒ x = −2 , i.e. µ1 = −2 .
∂x
1
1 − (2 x 2 + y 2 + 2 xy −22 x −14 y +65)
iv) f ( x, y ) = e 2 . ∂Q  − 2
2π = 2y − 6 = 0 ⇒ y = 3 , i.e. µ 2 = 3 or µ =   , and
∂y  3 
Solution: If ( X , Y ) ~ N 2 ( µ1 , µ 2 , σ 12 , σ 22 , ρ ) , then
 1 
coeff . of x 2 coeff . of x y 
1 −1 1 0 −1  2
f ( x, y ) = e −Q / 2
, Σ 
= 
 = Σ , where Σ =  1 
(2π ) σ 1σ 2 1 − ρ 2 0 1  coeff . of x y coeff . of y 2 
2 
where
iv) We have
1  x − µ1 
2 2
 x − µ1   y − µ 2   y − µ2  1 2 2
Q=   − 2 ρ     +    . 1 − 2 ( 2 x + y + 2 xy −22 x−14 y+ 65)
1 − ρ 2  σ 1   σ1   σ 2   σ2   f ( x, y ) = e ,

1 2 2
1 − 2[( x−1) +( y −2) ] where Q = 2 x 2 + y 2 + 2 xy − 22 x − 14 y + 65 .
i) We have f ( x, y ) = e ,
2π ∂Q ∂Q
= 4 x + 2 y − 22 = 0 , and = 2 y + 2 x − 14 = 0 .
⇒ µ1 = 1, µ 2 = 2, σ 1 = 1, σ 2 = 1 , and ρ = 0 . ∂x ∂y

 σ2 Solving these two equations, we get


 µ  1 ρσ 1σ 2   1 0 
i.e. µ =  1  =   , and Σ =  1 =   .
 4
 µ2   2  ρσ σ σ 22   0 1 
 1 2 x = 4 , i.e. µ1 = 4 , and y = 3 , i.e. µ 2 = 3 , or µ =   .
 3
ii) We have
 2 1  1 − 1
1  1  x2 xy  Now, Σ −1 =   , then, Σ =   .
f ( x, y ) = exp −  − 1.6 + y 2   1 1 −1 2 
2.4π 
 0.72  4 2 
 Therefore,
  x  2  y  y
2 
 −1
1 1  x σ 12 = 1, σ 22 = 2, ρ σ 1σ 2 = −1 ⇒ ρ =
= exp −   − 2 × 0.8   +   .
(2π ) (2) (1) (0.6)  2 × 0.36  2  2 1 1  2

Exercise: Let f ( x ) = C e −Q / 2 , where
⇒ 1 − ρ 2 = 0.36 or ρ 2 = 0.64 ⇒ ρ = ± 0.8 .
Hence, i) Q = 3 x 2 + 2 y 2 − 2 xy − 32 x + 4 y + 92 , and
 0 ii) Q = 2 x12 + 3 x22 + 4 x32 + 2 x1 x2 − 2 x1 x3 − 4 x2 x3 − 6 x1 − 2 x 2 + 10 x3 + 8 .
µ1 = 0, µ 2 = 0, σ 1 = 2, σ 2 = 1 , and ρ = 0.8 , i.e. µ =   , and
0  
Find µ and Σ .
 4 (0.8) (2) (1)   4 1.6 
Σ =   =   . 3) By the same analogy we have that the multivariate normal density function may be
 ( 0.8) ( 2) (1) 1  1.6 1  written in the form
iii) We have 1
− ( x −b)′ A ( x −b)
1 2 2 1 2 2 f ( x) = k e 2 .
1 − 2 ( x + y + 4 x −6 y +13) 1 − 2 [( x +2) + ( y −3) ]
f ( x, y ) = e = e . where, k > 0 is a constant to be determined such that the integral over the entire
2π 2π
p − dimensional Euclidean space of x1 , x 2 , L , x p is unity.
Here,
 − 2 1 0 We assume that the matrix A is positive definite (if the quadratic form x ′ A x ≥ 0 ) and
µ1 = − 2 , µ 2 = 3 , σ 1 = 1 , σ 2 = 1 , and ρ = 0 . µ =   , and Σ =   .
 3  0 1 symmetric.
Multivariate normal distribution 19 20 RU Khan

Since A is positive definite, we have ( x − b) ′ A ( x − b) ≥ 0 , so that f ( x ) is bounded and


above also clearly f ( x ) ≥ 0 . b = E ( x) = µ (the mean of x , denoted by µ ) (2.6)
x
Since A is positive definite matrix, then by matrix algebra, there exist a nonsingular
matrix C (a matrix whose determinant is not zero), such that Let Σ y , variance covariance matrix of y . Since y1 , y 2 , K , y p are independent N (0,1) ,
C′ AC = I (2.2) we have

where, I denotes the identity matrix and C′ the transpose of C . 1 0 L 0


 
0 1 L 0
Make the nonsingular transformation Σ y = E [ y − 0][ y − 0]′ = E y y ′ =  =I
M M M M
X −b =CY (2.3)  
0 0 L 1 

d ( x − b) d x
The Jacobian of the transformation is J = = = C , where, C indicates Hence,
dy dy

the absolute value of the determinant of C . I = E y y ′ = E [C −1 ( x − b) ( x − b) ′C −1 ] , from equation (2.3)
herefore, the density function of y will be
′ ′
1 1
= C −1 [ E ( x − µ ) ( x − µ ) ′] C −1 = C −1 Σ x C −1 , from equation (2.6)
− y′ C ′ A C y − y′ I y
g ( y) = k e 2 C =ke 2 C , since C ′ A C = I . Multiply by C on the left and by C′ on the right, we get
1
− y′ y C I C′ = Σ x or C C ′ = Σ x (2.7)
=k C e 2 (2.4)
Also, we have
From equation (2.3) we know that the range of the y ' s are also from − ∞ to ∞ .
Integrating over the entire range C ′ A C = I or A = C ′ −1 I C −1 = C ′ −1C −1

∞ ∞
1
− y′ y or A −1 = C C ′ , as ( AC ) −1 = C −1 A −1 (2.8)
∫−∞ K ∫−∞ k C e 2 dy1 K dy p = 1
From equation (2.7) and (2.8), we get, A = Σ −x 1 .
1
∞ ∞ − ( y12 + y22 + K + y 2p )
or ∫−∞
K
−∞ ∫
k C e 2 dy1 K dy p = 1 Also
2
A −1 = C C ′ = C C ′ = C , as C = C ′ , so that
 1 2   1 2 
 ∞ − y1   ∞ − yp 
or k C  ∫ e 2 dy1  K  ∫ e 2 dy p  = 1
 −∞   − ∞  C = Σx
1/ 2
.
   
1 Hence,
1 ∞ − t2 1
or k C ( 2π ) K ( 2π ) = 1 , as ∫
2 π −∞
e 2 dt = 1 or k =
C (2π ) p / 2
(2.5)
f ( x) =
1  1 
exp − ( x − µ x )′ Σ −x1 ( x − µ x ) .
p/2 1/ 2  2 
(2π ) Σx
Substituting from (2.5) in (2.4), we get
1 Definition: A random vector X = ( X 1 , X 2 , ..., X p ) ′ taking values x = ( x1 , x 2 , ..., x p ) ′ in
− y′ y
e 2  1 2   1 2  n
C  1 − 2 y1   1 −2 yp  E (Euclidean space of dimension p ) is said to have a p − variate normal distribution if its
g ( y) = = e L
  e .
C (2π ) p / 2  2π   2π  probability density function can be written as
   
1  1 
Thus, y1 , y 2 , K , y p are independently normally distributed each with mean zero and f ( x) = exp − ( x − µ x )′ Σ −x1 ( x − µ x ) ,
1/ 2  2 
variance unity. (2π ) p / 2 Σ x

Therefore, where
E ( x − b) = E (C y ) = C E ( y ) = 0 µ = ( µ1 , L , µ p )′ , and Σ is a positive definite symmetric matrix of order p .
Multivariate normal distribution 21 22 RU Khan

Properties of multivariate normal distribution Proof: We have


Theorem: If the variance covariance matrix of p − variate normal random vector 1  1 
f ( x) = exp  − ( x − µ )′ Σ −1 ( x − µ 
X = ( X 1 , X 2 , ..., X p ) ′ is diagonal matrix, then the components of X are independently 1/ 2  2 
(2π ) p / 2 Σ
normally distributed random variables.
Consider the transformation
Proof: The pdf of p − variate normal random vector is
Y = C X or X = C −1 Y . The Jacobian of the transformation is C −1 , therefore,
1  1 
f ( x) = exp − ( x − µ x )′ Σ −x 1 ( x − µ x ) .
1/ 2  2 
(2π ) p / 2 Σ x 1  1 
g ( y) = exp  − (C −1 y − µ ) ′ Σ −1 (C −1 y − µ ) C −1
p/2 1/ 2  2 
Given (2π ) Σ

σ 2 L 0  The quadratic form in the exponent of g ( y ) is


 1 
Σ =  M L M  , then the quadratic form ( x − µ )′Σ −1 ( x − µ ) will be (C −1 y − µ )′ Σ −1 (C −1 y − µ ) = (C −1 y − C −1 C µ )′ Σ −1 (C −1 y − C −1 C µ )
 
 0 L σ 2p 
 
= [C −1 ( y − C µ )]′ Σ −1 [C −1 ( y − C µ )]
1 / σ 2 L 0   ( x1 − µ1 ) 
 1   p
 x − µi 
2 ′
[( x1 − µ1 ),L , ( x p − µ p )]1× p  M L M   M  = ∑  i  = ( y − C µ ) ′ C −1 Σ −1C −1 ( y − C µ ) , since ( AB)′ = B ′ A′
 2    i =1
σi 
 0 L 1/ σ p (x − µ p )
  p× p  p p ×1 = ( y − C µ )′ C ′ −1Σ −1C −1 ( y − C µ ) , since (C −1 )′ = (C ′) −1
and
= ( y − C µ )′ (C Σ C ′) −1 ( y − C µ ) , since ( A B) −1 = B −1 A −1 .
p
Σ = ∏ σ i2 (a square matrix A is said to be diagonal, if aij = 0, i ≠ j , then And the Jacobian of the transformation, which is
i =1 1/ 2
p 1 1 1 Σ Σ
C −1 = = = = = , since A B = A B
A = ∏ aii ). C 2 C C′ C Σ C′ 1/ 2
C C Σ C′
i =1

Thus, Thus, the density function of Y is


p 1  1 
Σ
1/ 2
= ∏σ i . g ( y) = exp  − ( y − C µ )′ (C Σ C ′) −1 ( y − C µ )
p/2 1/ 2  2 
i =1
(2π ) C ΣC′

Hence, Therefore,

 1 p x −µ 2 p  1x −µ 2 Y ~ N (C µ , C Σ C ′) .
1  1 
f ( x) = exp − ∑  i i   =∏ exp  −  i i  
p  2 i =1  σ i   i =1 (2π )1/ 2 σ i  2  σi   Exercise: Given ( X 1 , X 2 ) ~ N 2 (µ1 , µ 2 , σ 12 , σ 22 , ρ ) . Find the joint density function of
(2π ) p/2
∏σ i    
Y1 = X 1 + X 2 , and Y2 = X 1 − X 2 .
i =1

= f ( x1 ) f ( x 2 ) L f ( x p ) .  Y  1 1   X 1 
Solution: The transformation is  1  =     , then, the matrix of transformation
 Y2  1 − 1  X 2 
Therefore, X 1 , X 2 , L , X p are independently normally distributed random variable with
1 1 
C =   is nonsingular, therefore,
mean µi , and variance σ i2 . 1 − 1
Theorem: If X (with p components) be distributed according to N ( µ , Σ) . Then Y = C X 1 1   µ1   µ1 + µ 2 
Y ~ N 2 (C µ , C Σ C ′) , where, C µ =     =   = µ
(nonsingular transformation) is distributed according to N (C µ , C Σ C ′) for C nonsingular. 1 − 1  µ 2   µ1 − µ 2 
y
Multivariate normal distribution 23 24 RU Khan

and Necessary part: Let X (1) and X ( 2) be independent


1 1   σ 1 ρ σ 1 σ 2  1 1 
2
C Σ C ′ =     Σ12 = E ( X (1) − µ (1) ) ( X (2) − µ ( 2) )′ = 0 , since X (1) and X (2) are independent.
1 − 1  ρ σ 1 σ 2 σ 22  1 − 1
Thus if X (1) and X (2) be independent, then the components are uncorrelated i.e.
σ 2 + ρ σ σ ρ σ 1 σ 2 + σ 22  1
1
=  12 1 2
 
2  1 − 1
covariances are zero.
 σ − ρσ σ ρ σ1 σ 2 − σ 2   
 1 1 2 Sufficient part: From the necessary part, we have
σ 2 + 2 ρ σ σ + σ 2 σ 12 − σ 22   Σ11 0 
= 1 1 2 2 =Σ . Σ =   , we have to prove that X (1) and X (2) are independent.
 σ 2 2 2 2 y
 0 Σ 22 
 1 σ2
− σ1 − 2 ρ σ 1 σ 2 + σ 2 
Hence, Y ~ N 2 ( µ y , Σ y ) . The joint density function of X (1) and X (2) is

Exercise: Given X be distributed according to N 3 (0, I 3 ) and if Y1 = X 2 + X 3 , 1  1 


f ( x) = exp − ( x − µ )′ Σ −1 ( x − µ )
p/2 1/ 2  2 
Y2 = 2 X 3 − X 1 , and Y3 = X 2 − X 1 , then find the covariance matrix of Y . (2π ) Σ
Solution: The transformation is 1  1 
= exp  − Q  ,
 Y1   0 1 1   X 1   0 1 1 q/2 ( p −q) / 2 1/ 2  2 
       (2π ) (2π ) Σ
 Y2  =  − 1 0 2   X 2  . Here the matrix of transformation C =  − 1 0 2  is
Y   − 1 1 0  X   − 1 1 0 where,
 3   3   
0   x − µ 
 Σ −1 (1) (1)
nonsingular. Therefore,
Q = ( x − µ )′ Σ −1 ( x − µ ) = [( x (1) − µ (1) )′ , ( x ( 2) − µ (2) )′ ]  11
 0 Σ −1   x (2) − µ (2) 
2 2 1  22   
 
′ ′ ′
Y ~ N 3 (0, C I C ) or Y ~ N 3 (0, CC ) , where, C C =  2 5 1  .
 x (1) − µ (1) 
 1 1 2 = [( x (1) − µ (1) ) ′ Σ11
−1
, ( x (2) − µ (2) )′ Σ −221 ]  ( 2) 
   x − µ ( 2) 
 
Theorem: Let X ′ = ( X 1 , X 2 , ..., X p ) have a joint normal distribution, a necessary and
= ( x (1) − µ (1) )′ Σ11 ( x − µ (1) ) + ( x ( 2) − µ ( 2) )′ Σ −221 ( x ( 2) − µ (2) )
−1 (1)
sufficient condition that a subset X (1) of the components of X be independent of the subset
= Q1 + Q2 .
X (2) consisting of the remaining components of X is that the covariance between each
component of X (1) with a component of X (2) is zero. Also we note that

Proof: Let us partition A11 0 A11 0 I 0


Σ = Σ11 Σ 22 , because = ×
0 A22 0 I 0 A22
 X1   X1   X q +1 
     
 
(2)  X q +2 
(1) The density of X can be written
 X2  X  (1)  X 2 
X = = , where, X = , and X =  ,
M   X (2)   M 
 M  1  1 
      f ( x) = exp − (Q1 + Q2 )
Xp  Xq  X  1/ 2 1/ 2
     p  (2π ) q/2
(2π ) ( p −q ) / 2
Σ11 Σ 22  2 

the corresponding partition of µ and Σ will be 1  1  1  1 


= exp  − Q1  × exp  − Q2 
q/2 1/ 2  2  (2π ) ( p − q ) / 2 1 / 2  2 
 µ (1)   Σ11 Σ12 
(2π ) Σ11 Σ 22
µ =   , and
Σ =   , where, Σ11 = E ( X (1) − µ (1) ) ( X (1) − µ (1) )′
 µ ( 2)  Σ
 21 Σ 22  = f ( x (1) ) f ( x ( 2) ) .
 

Σ 22 = E ( X ( 2) − µ ( 2) ) ( X (2) − µ (2) )′ , Σ12 = E ( X (1) − µ (1) ) ( X ( 2) − µ ( 2) )′ , and Therefore, X (1) and X (2) are independent.

Σ 21 = Σ12′ . Further, the marginal density of X (1) is given by the integral


Multivariate normal distribution 25 26 RU Khan

∞ ∞ ∞ (1) Since the transformation is nonsingular, therefore, the distribution Y is also p − variate
∫−∞ ∫−∞ L ∫−∞ f ( x ) f ( x ( 2) ) dx q +1 dx q + 2 L dx p
normal with
∞ ∞ ∞
= f ( x (1) ) ∫ f ( x ( 2) ) dx q+1 dx q + 2 L dx p = f ( x (1) )  (1) −1 (2) 
− ∞ −∞ ∫ L∫
−∞
 Y (1)   I − Σ12 Σ −1   X (1) 
E Y = E  ( 2)  = E  22    = E X − Σ12 Σ 22 X 
Y  0   ( 2)   ( 2 ) 
   I  X   X 
Hence, the marginal distribution of X (1) is N q ( µ (1) , Σ11 ) , similarly the marginal
 µ (1) − Σ Σ −1 µ (2) 
distribution of X (2) is N p −q ( µ (2) , Σ 22 ) . = 12 22 
 µ ( 2) 
 
Theorem: If X is distributed according to N ( µ , Σ) , the marginal distribution of any set of
and
components of X is multivariate normal with means, variances and covariances obtained by
taking corresponding components of µ and Σ , respectively.  I − Σ12 Σ −1   Σ11 Σ12   I 0
ΣY =  22     , since Y = C X .
0   Σ 21 Σ 22   − Σ12 Σ −1 I 
Proof: We partition X , µ and Σ as  I   22

 Σ − Σ12 Σ −1 Σ 21 Σ12 − Σ12 Σ −1 Σ 22   I 0


 X1   X1   X q +1  =  11 22 22 


      
 Σ 21 Σ 22  −1
  12 Σ 22
− Σ I 
 
( 2)  X q + 2 
(1)
 X2   X  (1)  X 2 
X = = , where X = , and X =  
M   X (2)   M 
 M   Σ11.2 0   Σ11.2 0 
      =  = .
X p  Xq  X  Σ − Σ −1
Σ 22   0 Σ 22 
     p   21 12 Σ 22 Σ 22

and the corresponding partition of µ and Σ will be The joint density function of Y (1) and Y (2) is
 µ (1)  1 1
 Σ11 Σ12  exp[ − { y (1) − ( µ (1) − Σ12 Σ −221 µ (2) )}′
µ =   , and f ( y) =
Σ =   . q/2 1/ 2 2
 (2)   Σ 21 Σ 22 
(2π ) Σ11.2
µ 
−1 (1) −1 ( 2 )
We shall make a nonsingular linear transformation to sub vectors Σ11 .2 { y − ( µ (1) − Σ12 Σ 22 µ )}]

Y (1) = X (1) + B X (2) , and Y ( 2) = X ( 2 ) . ×


1 1
exp[ − ( y ( 2) − µ ( 2) )′ Σ −221 ( y ( 2) − µ (2) )]
( p −q ) / 2 1/ 2 2
(2π ) Σ 22
where B is the matrix chosen such that Y (1) and Y (2) are uncorrelated. i.e. B must satisfy
the equation Since Y ( 2) = X ( 2) and Y (1) are independently distributed, so
E (Y (1) − E Y (1) ) (Y (2) − E Y (2) ) ′ = 0 f ( y ) = f ( y (1) ) f ( x (2) ) . Therefore, the marginal density function of X (2) = Y ( 2) is given by
or E ( X (1) + B X (2) − µ (1) − B µ (2) ) ( X ( 2) − µ ( 2) )′ = 0 the integral
∞ ∞ ∞ (1) ∞ ∞ ∞
∫−∞ ∫−∞ L ∫−∞ f ( y ) f ( x (2) ) d y (1) = f ( x ( 2) ) ∫ ∫ ... ∫ f ( y (1) ) d y (1) = f ( x ( 2) )
or E ( X (1) − µ (1) ) ( X (2) − µ ( 2) )′ + B E ( X (2) − µ ( 2) ) ( X ( 2) − µ (2) )′ = 0 − ∞ −∞ −∞

or Σ12 + B Σ 22 = 0 . Hence, the marginal distribution of X (2) is N p −q ( µ (2) , Σ 22 ) .


Thus,
Conditional distribution
B = −Σ12 Σ −221 , and Y (1) = X (1) − Σ12 Σ −221 X ( 2) .
The joint density function of X (1) and X (2) can be obtained by substituting
Therefore, the transformation is x (1)
− Σ12 Σ −221 x ( 2) for y (1)
, x (2)
for y ( 2)
and multiplying by the Jacobian of the
 Y (1)   I − Σ12 Σ −1   X (1)  transformation. The Jacobian of the transformation is J = 1 , since y = C x or x = C −1 y
 = 22   
 Y ( 2)   0   X ( 2) 
   I   dx 1 −1
I − Σ12 Σ 22
and = C −1 = , where C = = 1.
i.e. Y = C X . dy C 0 I
Multivariate normal distribution 27 28 RU Khan

The resulting density of X (1) and X ( 2) is  1.0 0. 8 − 0.4 


 
Exercise: If X ~ N 3 (0, Σ) , where Σ =  0.8 1. 0 − 0.56  . Find the conditional
1 1  − 0.4 − 0.56
f ( x (1) , x (2) ) = exp[ − ( x (1) − Σ12 Σ −221 x (2) − µ (1) + Σ12 Σ −221 µ (2) )′  1.0 
1/ 2
(2π ) q / 2 Σ11.2 2
distribution of i) X 1 , X 2 given X 3 , ii) X 1 , X 3 given X 2 , and iii) X 2 , X 3 given X 1 .
−1 (1) −1 ( 2 ) −1 ( 2 )
Σ11.2 ( x − Σ12 Σ 22 x − µ (1) + Σ12 Σ 22 µ )] Solution:
i) We partition X for X 1 , X 2 given X 3 as
1 1
× exp[ − ( x (2) − µ (2) ) ′ Σ −221 ( x ( 2) − µ (2) )]
1/ 2  X1 
(2π ) ( p −q ) / 2 Σ 22 2
 
 X 2   X 
(1)
X 
1 1 X =  = , where X (1) =  1  , and X ( 2) = X 3
= −1 ( 2)
exp[ − {x (1) − µ (1) − Σ12 Σ 22 ( x − µ ( 2) )}′ L  X (2)  X2 
1/ 2    
(2π ) q / 2 Σ11.2 2 X 
 3
−1
Σ11 .2 { x
(1)
− µ (1) − Σ12 Σ −221 ( x (2) − µ (2) )}]  1.0

0.8 M − 0.4 

 0. 8 1 .0 M − 0.56   Σ11 Σ12 
Σ= = .
L   Σ 21 Σ 22 
1 1 −1 ( 2)
× exp[ − ( x (2) − µ (2) ) ′ Σ 22 ( x − µ (2) )] . 
L L L

( p −q ) / 2 1/ 2
(2π ) Σ 22 2  − 0.4 − 0.56 M 1.0 

Now The conditional distribution of X 1 , X 2 given X 3 is N 2 ( µ (1) * , Σ11.2 ) , where
(1) ( 2)
f (x ,x ) 1 1
= exp[− {x (1) − µ (1) − Σ12 Σ −221 ( x (2) − µ (2) )}′ −1 ( 2 )  − 0. 4   − 0.4 x3 
( 2) q/2 1/ 2 2 µ (1) * = µ (1) + Σ12 Σ 22 ( x − µ ( 2) ) = 0 +   (1) ( x3 − 0) =  − 0.56 x  , and
f (x ) (2π ) Σ11.2  − 0 .56   3

−1
Σ11 (1) −1 ( 2 )
− µ (1) − Σ12 Σ 22 ( x − µ (2) )}]  1.0 0.8   − 0.4 
.2 { x Σ11.2 = Σ11 − Σ12 Σ −221 Σ 21 =   −   (1) (− 0.4 − 0.56)
 0.8 1.0   − 0.56 
= f ( x (1) | x ( 2) ) .
 0.8400 0.5760 
=   .
Therefore, the density function f (x (1)
|x ( 2)
) is a q − variate normal with mean  0.5760 0.6864 
−1 ( 2)
µ (1) + Σ12 Σ 22 ( x − µ (2) ) = µ (1) * , and covariance matrix Σ11.2 = Σ11 − Σ12 Σ −221 Σ 21 , i.e. ii) For X 1 , X 3 given X 2 , we partition X as

( X (1) | X (2) = x ( 2) ) ~ N q ( µ (1) * , Σ11.2 ) .  X1 


 
 X 3   X 
(1)
X 
Note: The mean vector in the distribution of ( X (1) | X (2) = x (2) ) is linear function of X ( 2) X =  = , where X (1) =  1  , and X ( 2) = X 2
L  X (2)   X3 
   
while the covariance matrix is independent of X ( 2) . X 
 2
Exercise: In a similar fashion we can prove the following results  1.0 − 0.4 M 0.8 
 
−1 (1)  − 0.4 1.0 M − 0.56   Σ11 Σ12 
a) X (1) , X (2) − Σ 21 Σ11 X are independently normally distributed. Σ= = .
L L L L   Σ 21 Σ 22 
 
 0.8 − 0.56 M 1.0 
Hint: non singular linear transformation Y (1) = X (1) , Y (2) = X ( 2) + B X (1) 

b) The marginal distribution of X (1) ~ N q (µ (1) , Σ11 ) . The conditional distribution of X 1 , X 3 given X 2 is N 2 ( µ (1) * , Σ11.2 ) ,

where
c) The conditional distribution of X (2) given X (1) = x (1) is normal with mean
 0. 8   0. 8 x 2 
−1 (1)
µ ( 2) + Σ 21Σ11 −1
( x − µ (1) ) = µ ( 2) * and covariance matrix Σ 22.1 = Σ 22 − Σ 21Σ11 Σ12 . µ (1) * = µ (1) + Σ12 Σ −221 ( x (2) − µ (2) ) = 0 +   (1) ( x 2 − 0) =  
 − 0.56   − 0.56 x 2 
Multivariate normal distribution 29 30 RU Khan

and Note: This result has been proved when D is p × p and nonsingular.
−1  1.0 − 0.4   0.8  Proof: The transformation is Y = D X , where Y has q components and D is q × p real
Σ11.2 = Σ11 − Σ12 Σ 22 Σ 21 =   −   (1) (0.8 − 0.56)
 − 0.4 1.0   − 0.56  matrix. The expected value of Y is
 1.0 − 0.4   0.6400 − 0.4480   0.36 0.048  E (Y ) = D E ( X ) = D µ , and the covariance matrix is
=   −   =   .
 − 0.4 1.0   − 0.4480 0.3136   0.048 0.6864 
Σ y = E [Y − E (Y )] [Y − E (Y )]′ = E [ D X − D µ ] [ D X − D µ ]′ = D Σ D ′ .
iii) For X 2 , X 3 given X 1 we partition X as
Since R( D ) = q the q rows of D are independent. We know that a set of q independent
 X1 
  vectors can be extended to form a basis of the p − dimensional vector space by adding to it
 L   X 
(1)
X  p − q vectors.
X =  = , where X (2) =  2  , and X (1) = X 1
X2  X ( 2)   X3 
     Dq× p 
X 
 3 Let C =   , then C is nonsingular. Make the transformation

 E ( p −q )× p 
 1.0 M 0.8 − 0.4 
  Z = C X . Since C nonsingular, therefore, Z ~ N p (C µ , C Σ C ′) , i.e.
 L L L L   Σ11 Σ12 
Σ= = .
0.8 M 1.0 − 0.56   Σ 21 Σ 22   D  DX 
 
 − 0.4 M − 0.56 1.0  Z =   X =   .
  E  EX 
The conditional distribution of X 2 , X 3 given X 1 is N 2 (µ ( 2) * , Σ 22.1 ) , where But, D X being the partition vector of Z has a marginal q − variate normal distribution,
therefore,
−1 (1)  0. 8   0.8 x1 
µ ( 2) * = µ (2) + Σ 21Σ11 ( x − µ (1) ) = 0 +   (1) ( x1 − 0) =   , and Y = D X ~ N q ( D µ , D Σ D ′) .
 − 0 . 4   − 0.4 x1 
 1.0 − 0.56   0.8  Note: This theorem tells us if X ~ N p ( µ , Σ) , then every linear combination of the
−1
Σ 22.1 = Σ 22 − Σ 21Σ11 Σ12 =  −  (1) (0.8 0.4)
 − 0.56 1.0   − 0.4  components of X has a univariate normal distribution.

 0.36 − 0.24   X1 
=   .  
 − 0.24 0.84  X2  ′
Proof: Let Y = D1× p X = (l1 , l 2 , ..., l p )  = l X , then E (Y ) = l ′ µ , and Σ y = l ′ Σ l ,
M 
Exercise: Let X 1 and X 2 are jointly distributed as normal with E ( X i ) = 0 and  
X p 
Var. ( X i ) = 1 , for i = 1, 2 . If the distribution of X 2 given X 1 is normal N1 (0, 1 − ρ 2 ) with  
therefore,
ρ < 1 , find the covariance matrix of X 1 and X 2 .
Y = D1× p X = l ′ X ~ N1 (l ′ µ , l ′ Σ l ) .
X   µ  0 Σ Σ12   1 Σ12 
Solution: Given X =  1  , µ =  1  =   , and Σ =  11 = 
X2  µ
 2   0  Σ 21 Σ 22   Σ 21 1  Exercise: Let X ~ N ( µ , Σ) , where X ′ = ( X 1 , X 2 , X 3 ) , µ ′ = (0, 0, 0) , and

The conditional distribution of X 2 given X 1 is N1 ( µ 2* , Σ 22.1 ) 7 3 2


 
−1 Σ =  3 4 1  . Obtain the distribution of X 1 + 2 X 2 − 3 X 3 .
Σ 22.1 = Σ 22 − Σ 21 Σ11 Σ12 = 1 − ρ 2 or 1 − Σ 21 Σ12 = 1 − ρ 2  2 1 2
 
or Σ 21 Σ12 = ρ 2 or Σ 21 = Σ12 = ρ . Solution: We know that
1 ρ  X1 
Therefore, the variance covariance matrix of X 1 and X 2 will be  .  
ρ 1  X2  ′
Y = D1× p X = (l1 , l 2 , K , l p )  = l X , then Y ~ N1 (l ′ µ , l ′ Σ l ) .
Theorem: If X is distributed according to N p ( µ , Σ) and Y = D X , where D is q × p M 
 
(q < p) matrix of rank q , then Y is distributed according to N q ( D µ , D Σ D ′) . X p 
 
Multivariate normal distribution 31 32 RU Khan

Therefore, in our case where,


 X1  0  a′   a′ µ   a′   a ′Σ   a ′ Σ a a′ Σ b 
    D µ =  ′  µ =  ′  , and DΣD ′ =  ′  Σ (a, b) =  ′  (a, b) =  ′ .
Y = D1×3 X = (1, 2, − 3)  X 2  , then l ′ µ = (1, 2, − 3)  0  = 0 b  b µ  b  b Σ   b Σ a b′ Σ b 
X  0          
 3  
and The off-diagonal term a ′ Σ b is the covariance matrix for Y1 and Y2 . Consequently when

 7 3 2  1   1  a ′ b = 0 , so that a ′ Σ b = 0 , Y and Y are independent.


1 2
    
l ′ Σl = (1, 2, − 3)  3 4 1   2  = (7, 8, − 1)  2  = 29 .
 2 1 2   − 3  − 3  1 − 2 0
      
Exercise: Let X be N 3 (µ , Σ ) with µ ′ = (−3,1, 4) and Σ =  − 2 5 0  .
Hence Y ~ N (0, 29) .  0 0 2 

Exercise: Let X α ~ N p ( µ , Σα ) , α = 1, 2, L , n . Let Y = a1 X 1 + L + a n X n , then show
α Which of the following random variables are independent? Explain
 
that Y ~ N p  ∑ aα µ α , ∑ aα2 Σα  , where aα ' s are scalar. i) X 1 and X 2
 
α α  ii) X 2 and X 3
Proof: Consider a linear combination of Y
iii) ( X 1 , X 2 ) and X 3
 
l ′ Y = a1 l ′ X 1 + L + a n l ′ X n , then E (l ′ Y ) = a1 l ′ µ 1 + L + a n l ′ µ n = l ′  ∑ aα µ α  , and X1 + X 2
  iv) and X 3
α  2
  5
Σ ′ = Cov (a1 l ′ X 1 + L + a n l ′ X n ) = a12 l ′ Σ1 l + L + a n2 l ′ Σ n l = l ′  ∑ aα2 Σα  l . v) X 2 and X 2 − X1 − X 3
lY   2
α 
vi) Conditional distribution of X 2 given X 1 = x1 and X 3 = x3 .
Thus,
    Solution:
l ′ Y ~ N1  l ′ ∑ aα µ α , l ′ ∑ aα2 Σα l ⇒ Y ~ N p  ∑ aα µ α , ∑ aα2 Σα .
    i) Since X 1 and X 2 have covariance σ 12 = −2 , they are not independent.
 α α  α α 
ii) Since X 2 and X 3 have covariance σ 23 = 0 , they are independent.
Exercise: Let X ~ N p (µ , Σ ) , let Y1 = a ′ X and Y2 = b ′ X , where a and b are given
vectors of scalars. Obtain the joint distribution of Y1 and Y2 , and derive the condition for the iii) Partitioning X and Σ as
independence of Y1 and Y2 .  X1   1 −2 M 0
   
 2 X 
X  (1) 
− 2 5 M 0
 Y1   a ′ 
Solution: The transformation is   =  ′  X i.e. Y = D X , where Y has two components X = =
L   X ( 2) 
, and Σ =  L L L L .
 Y2   b    
 0

X 
 3  0 M 2 
and D is 2 × p real matrix. The expected value of Y is
E (Y ) = D E ( X ) = D µ , and the covariance matrix is X  0
We see that X (1) =  1  and X 3 have covariance matrix Σ12 =   , therefore,
X
 2 0
Σ y = E [Y − E (Y )] [Y − E (Y )]′ = E [ D X − D µ ] [ D X − D µ ]′ = D Σ D ′ ( X 1 , X 2 ) and X 3 are independent.
We know that, if iv) The transformation is
X ~ N p (µ , Σ ) , and Y = D X , where, D is a matrix of rank q (q < p ) , then
 X1   X1 + X 2 
1 / 2 1 / 2 0   
Y ~ N q ( D µ , DΣD ′) . Y =    X 2  =   , i.e. Y = D X .
 2

 0 0 1    X3
Therefore, in our case Y ~ N 2 ( D µ , DΣD ′) ,  X3   
Multivariate normal distribution 33 34 RU Khan

Therefore, The variance covariance of this transformation is


 − 3 1 1 1   0 − l1 
1 / 2 1 / 2 0     − 1  0 1 0   
Y ~ N 2 ( D µ , DΣD ′) , where D µ =    1  =   and D Σ D ′ =   1 3 2   1 1 
 0 0 1    4   − l1 1 − l 2  1 2 2   0 − l 
 4    2

 1 − 2 0  1 / 2 0   3 − l1 + 3 − 2 l2 
1 / 2 1 / 2 0     1 / 2 0  =  .
DΣD ′ =    − 2 5 0  1 / 2 0  =  . −
 1 l + 3 − 2 l2 l12 + 2 l1l2 − 2 l1 − 4 l2 + 3 + 2 l22 
 0 0 1  0 2 
 0 0 2   0 1  
X 
The off-diagonal term − l1 + 3 − 2 l 2 is the covariance for X 2 and X 2 − l ′  1  .
X + X2  X3 
We see that 1 and X 3 have covariance σ 12 = 0 , they are independent.
2
 l1 
v) The transformation is For independent − l1 + 3 − 2 l 2 = 0 , so that l =  l1 + 3  .

 X1   2 
 0 1 0    X2 
Y =    X 2  =   , i.e. Y = D X
 − 5 / 2 1 − 1    X 2 − (5 / 2) X 1 − X 3  Exercise: X 1 and X 2 are two rv' s with respective expectations µ1 and µ 2 and variances
X
 3 σ 11 and σ 22 and correlation ρ such that
Therefore,
X 1 = µ1 + σ 11 Y1 , and X 2 = µ 2 + σ 22 (1 − ρ 2 ) Y1 + σ 22 Y2 ,
 − 3
 0 1 0    1  where Y1 and Y2 are two independent normal N (0,1) variates. Find the joint distribution of
Y ~ N 2 ( D µ , DΣD ′) , where D µ =    1  = 
 
 − 5 / 2 1 − 1    9 / 2  X 1 and X 2 .
 4 
Solution: The transformation is
and
 1 − 2 0 0 − 5 / 2  X 1 − µ1   σ 11 0   Y1 
   i.e. X − µ = CY or X = CY + µ , so that
 0 1 0    5 10    =  
 X 2 − µ 2   σ 22 (1 − ρ ) σ 22   Y2 
2
′ 
DΣ D =  
 − 2 5 0 1 1  =  .
 − 5 / 2 1 − 1     10 93 / 4 
 0 0 2 0 −1 
 µ1 
E X = C E (Y ) + µ = µ =  
We see that X 2
5
and X 2 − X 1 − X 3 have covariance σ 12 = 10 , they are not  µ2 
2
independent. and

1 1 1  Σ X = E [ X − µ ][ X − µ ]′ = E [CY + µ − µ ][CY + µ − µ ]′ = E [CY Y ′ C ′] = CC ′


 
Exercise: Let X be N 3 (µ , Σ ) with µ ′ = (2, − 3,1) and Σ = 1 3 2 
 σ 11 0  σ σ 22 (1 − ρ 2 ) 
1 2 2  =   11
   σ (1 − ρ 2 )
 22 σ 22   0 σ 22 

i) Obtain the distribution of 3 X 1 − 2 X 2 + X 3 .
 σ 11 σ 11σ 22 (1 − ρ 2 ) 
X 
ii) Find a vector l such that X 2 and X 2 − l ′  1  are independent. =  .
 σ σ (1 − ρ 2 ) σ (1 − ρ 2 ) + σ 
 X3   11 22 22 22 

iii) Conditional distribution of X 3 given X 1 = x1 and X 2 = x 2 . Exercise: Let A = Σ −1 and Σ be partitioned similarly as
Solution: Consider the transformation  A11 A12  Σ Σ12 
A =   Σ =  11  , by solving A Σ = I according to partitioning prove
 X1   A21 A22   Σ 21 Σ 22 
 0 1 0    X2  ( 2)  X 1 
Y =    X 2  =  ′ X (2)  , where X =   , −1 −1

 1 l 1 − l 2    2 X − l   X3  i) Σ12 Σ 22 = − A11 A12
 X3 
i.e. Y 2×1 = D2×3 X 3×1 . ii) Σ11 − Σ12 Σ −221 Σ 21 = A11
−1
.
Multivariate normal distribution 35 36 RU Khan

Solution: Consider the matrix equation A Σ = I Solution:


i) We know that
 A11 A12   Σ11 Σ12   I 0 
  = 
 A21 A22   Σ 21 Σ 22   0 I  Σ = E ( X − µ ) ( X − µ )′ = E ( X X ′ − X µ ′ − µ X ′ + µ µ ′ )

 A Σ + A12 Σ 21
or  11 11
A11 Σ12 + A12 Σ 22   I 0 
= . = E (X X ′ ) − µ µ′ − µ µ′ + µ µ′
 A21 Σ11 + A22 Σ 21 A21 Σ12 + A22 Σ 22   0 I 
⇒ E (X X ′) = Σ + µ µ′ .
Consider
A11 Σ12 + A12 Σ 22 = 0 , then A11 Σ12 = − A12 Σ 22 ii) E ( X ′ A X ) = E [tr X ′ A X ] , since tr a = a , a is a scalar.

−1
or Σ12 = − A11 A12 Σ 22 or Σ12 Σ −221 = − A11
−1
A12 . = E [tr A ( X X ′ )] , since tr (C D) = tr ( D C )

Also = tr A E ( X X ′ ) = tr A (Σ + µ µ ′ )
A11 Σ11 + A12 Σ 21 = I , then
= tr AΣ + tr A µ µ ′ = tr AΣ + µ ′ A µ .
−1 −1
A11 Σ11 = I − A12 Σ 21 or Σ11 = A11 − A11 A12 Σ 21 Exercise: Let X ~ N p (0, I ) and if A and B are real symmetric matrices of order p , then
−1 −1
or A11 = Σ11 + A11 A12 Σ 21 i) E ( X ′ A X ) = tr A

ii) V ( X ′ A X ) = 2 tr A 2
−1
or A11 = Σ11 − Σ12 Σ −221 Σ 21 .

Exercise: Prove explicitly that Σ is positive definite, then iii) Cov ( X ′ A X , X ′ B X ) = 2 tr AB


Σ = Σ 22 Σ11 − Σ12 Σ −221 Σ 21 = Σ 22 Σ11.2 . Solution: Since A is a real symmetric matrix of order p , there exists an orthogonal matrix
C such that
Solution: Let Σ be partitioned as
C ′AC = diag (λ1 , λ 2 , L , λ p ) = ∆ , where λ1 , λ 2 , L , λ p are latent roots of A .
 Σ11 Σ12   I − Σ12 Σ −1 
Σ =   , and define a matrix C p × p as C =  22  . Consider an orthogonal transformation
 Σ 21 Σ 22 
0 
 I 
Y = C X , then E (Y ) = C µ = 0 , and ΣY = E (Y Y ′ ) = C E ( X X ′ ) C ′ = CC ′ = I
Obviously C is a non-singular matrix because determinant of C is 1 .
Now consider the matrix This shows that Y ~ N p (0, I ) , i.e. Yi ~ N (0, 1) .

 Σ − Σ12 Σ −1 Σ 21 Now
0   Σ11.2 0 
C Σ C ′ =  11 22 =  .

 0 Σ 22   0 Σ 22  i)

X ′ A X = Y ′ C −1 AC −1Y = Y ′ ∆ Y , since C being orthogonal, C −1 = C ′ .
Therefore, = λ1Y12 + λ 2Y22 + L λ p Y p2 .
C Σ C ′ = Σ11.2 Σ 22 ⇒ C Σ C ′ = Σ11.2 Σ 22 , because C = 1 .
Therefore,
Similarly, we can prove
E ( X ′ A X ) = λ1 E (Y12 ) + λ 2 E (Y22 ) + L λ p E (Y p2 )
−1
Σ = Σ11 Σ 22 − Σ 21 Σ11 Σ12 = Σ11 Σ 22.1 .
= λ1 + λ 2 + L λ p , since Yi2 ~ χ12 for each i
Exercise: Let X ~ N p ( µ , Σ) , let A be a symmetric matrix of order p , show that
= tr C ′AC = tr ACC ′ = tr A .
i) E ( X X ′ ) = Σ + µ µ ′ . ii) V ( X ′ A X ) = λ12 V (Y12 ) + λ 22 V (Y22 ) + L λ 2p V (Y p2 )

ii) E ( X ′ A X ) = tr AΣ + µ ′ A µ . = 2 (λ12 + λ 22 + L λ 2p ) = 2 tr A 2 . Since λ12 , L , λ2p are the latent roots of A 2 .


Multivariate normal distribution 37 38 RU Khan

iii) Since A and B are real symmetric matrices of order p , so that, A + B is real symmetric Now the characteristic function of Y is
matrix of order p . Hence, ′ i (u1Y1 + L+u pY p ) i u pY p
φY (u ) = E [e iu Y ] = E e = E e i u1Y1 L E e (2.9)
2 tr ( A + B) = V [ X ′ ( A + B) X ] = V ( X ′ A X + X ′ B X )
2
Since Y1 , Y2 , L , Y p are independent and distributed according to N (0, 1) ,
= V ( X ′ A X ) + V ( X ′ B X ) + 2 Cov ( X ′ A X , X ′ B X )
1
 1   1  itµ − t 2σ 2
1 φY (u ) =  exp − u12  L  exp − u 2p  , since X ~ N (µ , σ 2 ) , and φ X (t ) = e
⇒ Cov ( X ′ A X , X ′ B X ) = [2 tr ( A + B ) 2 − V ( X ′ A X ) − V ( X ′ B X )] 2 .
2  2   2 
1 1
= tr ( A + B) 2 − tr A 2 − tr B 2 − (u12 + L +u 2p ) − u′ u
=e 2 =e 2 (2.10)
= tr ( A 2 + B 2 + AB + BA) − tr A 2 − tr B 2 Thus
= tr AB + tr BA = 2 tr AB , since tr AB = tr BA . ′X i t ′ (C Y + µ ) it ′ µ ′
φ X (t ) = E [e i t ] = E [e ]=e E [e i t ( C Y ) ]
Characteristic function 1
it ′ µ it ′ µ − (C ′ t ) ′ ( C ′ t )
The characteristic function of a random vector X is defined as =e E e i (C ′ t ) ′ Y = e e 2 from equation (2.9) and (2.10)
′X 1 1
φ X (t ) = E [ e i t ] , where t is a vector of reals, i = − 1 . it ′ µ − t ′CC ′ t i t ′ µ − t ′Σ t
=e e 2 =e 2 .
Theorem: Let X = ( X 1 , X 2 , L , X p )′ be normally distributed random vector with mean µ Corollary: Let X ~ N p (µ , Σ ) , if Z = D X , then the characteristic function of vector Z is
and positive definite covariance matrix Σ , then the characteristic function of X is given by 1
i t ′ ( D µ )− t ′ ( DΣD′) t
1 e 2 .
i t ′ µ − t ′Σ t
φX (t ) = e 2 , where t = (t1 , t 2 , L , t p )′ is a real vector of order p × 1 . Proof: The characteristic function of vector Z is defined as
Proof: We have 1
′Z ′ i ( D′ t )′ µ − ( D′t )′ Σ D′ t
1  1  φ Z (t ) = E e it = E e it D X = E e i ( D′t )′ X = e 2
f ( x) = exp − ( x − µ ) ′ Σ −1 ( x − µ )
1/ 2  2 
(2π ) p / 2 Σ 1
i t ′ ( D µ )− t ′ ( D Σ D′) t
=e 2 ,
Since Σ is a symmetric and positive definite, there exits a non-singular matrix C such that
which is the characteristic function of N p ( D µ , DΣD ) .
C ′ Σ −1 C = I , and Σ = C C ′ .
Theorem: If every linear combination of the components of a vector X is normally
Let X − µ = C Y , so that Y = C −1 ( X − µ ) a nonsingular transformation and the Jacobian of
distributed, then X has normal distribution.
the transformation is J = C , therefore, the density function of Y is
Proof: Consider a vector X of p − components with density function f (x ) and
1  1  iu′ X
f ( y) = exp − (C y + µ − µ ) ′ Σ −1 (C y + µ − µ ) C characteristic function φ X (u ) = E [e ] and suppose the mean of X is µ and the
p/2 1/ 2  2 
(2π ) Σ
covariance matrix is Σ .
1  1 
exp − ( y ′ C ′ Σ −1C y ) , since C = Σ Since u ′ X is normally distributed for every u . Then the characteristic function of u ′ X is
1/ 2
=
(2π ) p/2  2 
1
′ it u′ µ − t 2 u′Σ u
1  1   1 1   1 1  E e it (u X ) = e 2 , taking t = 1 , this reduces to
= exp − y ′ y  =  exp − y12  L  exp − y 2p  .
(2π ) p/2  2 
  (2π )1 / 2 2   1 / 2 2 
  (2π )  i u′ µ − u ′Σ u
1

It shows that Y1 , Y2 , L , Y p are independently normally distributed each with mean zero and E e iu X = e 2 .
variance one. Therefore, X ~ N p ( µ , Σ) .
Multivariate normal distribution 39 40 RU Khan

Theorem: The moment generating function of a vector X , which is distributed according Let X − µ = C Y a nonsingular transformation. Under this transformation the quadratic form
1
t ′ µ + t ′Σ t expanded as
to N p ( µ , Σ) is M X (t ) = e 2 .
Q = ( x − µ )′ Σ −1 ( x − µ ) = (C y )′ Σ −1 (C y ) = y ′ C ′ Σ −1C y = y ′ y
Proof: Since Σ is a symmetric and positive definite, then there exits a non-singular matrix
C such that and the Jacobian of the transformation is J = C .
−1
C′Σ C = I and Σ = C C ′ . Therefore, the density function of Y is
Make the nonsingular transformation
1  1 
exp  − y ′ y  C , since Σ = C C ′ = C , and C = Σ
2 1/ 2
f ( y) = .
X − µ = C Y , then Y = C −1 ( X − µ ) and J = C . (2π ) p/2
Σ
1/ 2  2 
Therefore, the density function of Y is
1  1 p   1 1   1 1 
1  1  = exp  − ∑ yi2  =  exp − y12  L  exp − y 2p  .
f ( y) = exp − (C y + µ − µ ) ′ Σ −1 (C y + µ − µ ) C (2π ) p/2  2   1 / 2 2   1 / 2 2 
1/ 2  i =1   (2π )   (2π ) 
(2π ) p/2
Σ  2 
This shows that Y1 , Y2 , L , Y p are independently standard normally variate i.e.
1  1 
exp − ( y ′ C ′ Σ −1C y ) , since C = Σ
1/ 2
= p
(2π ) p / 2  2  Y ~ N p (0, I ) , then Yi ~ N (0, 1) . Therefore, Q = ∑ y i2 ~ χ 2p .
i =1
1  1 
= exp  − y ′ y  .
(2π ) p/2  2  For r − th mean
p Q p Q
It shows that Y1 , Y2 , L , Y p are independently normally distributed each with mean zero and 1 ∞ −1 − 1 ∞ + r −1 −
variance one.
E (Q r ) = ∫
2 p / 2 Γ ( p / 2) 0
Qr Q 2 e 2 dQ = ∫ Q2
2 p / 2 Γ ( p / 2) 0
e 2 dQ

Now the moment generating function of Y is


p
+r
u′ Y (u1Y1 + L+u pY p ) (u1Y1 ) ( u pY p ) 22 p ∞ Γα
M Y (u ) = E e = Ee = Ee LE e = Γ( + r ) , as ∫ x α −1e − x / β dx =
2 p / 2 Γ( p / 2) 2 0 (1 / β )α
p 1 ′
uu
= ∏ E e uiYi = e 2 , since Yi ~ N (0, 1) . p rp  p   p
2 r Γ( + r ) 2  + r − 1  + r − 2  L   Γ( p / 2) r
i =1 2  2   2  2 p 
= = = 2r ∏  + r − i .
i =1  
Thus we can say Γ( p / 2) Γ( p / 2) 2
′X t ′ (C Y + µ ) t′ µ ′ t′ µ
φ X (t ) = E e t = Ee =e E [ e t (C Y ) ] = e E [e (C ′ t )′ Y ] In particular r = 1, 2

 Y (1)   A 
 =   X , and, E (Q 2 ) = 2 2  p + 1  p  = p 2 + 2 p .
1 1 1
t′ µ (C′ t )′ (C′ t ) t ′ µ + t ′CC ′ t t ′ µ + t ′Σ t 
=e e2 =e 2 =e 2 .  Y ( 2)   B  2  2 
 
Exercise: If X ~ N p ( µ , Σ) , then obtain the distribution of quadratic form
Therefore,
Q = ( x − µ ) ′ Σ −1 ( x − µ ) . Also find its r − th mean.
V (Q ) = E (Q 2 ) − [ E (Q )]2 = p 2 + 2 p − p 2 = 2 p .
Solution: Given X ~ N p ( µ , Σ) , then the probability density function of X is
Alternative method
1  1 
f ( x) = exp − ( x − µ ) ′ Σ −1 ( x − µ ) Given X ~ N p ( µ , Σ) , then the probability density function of X is
p/2 1/ 2  2 
(2π ) Σ
1  1 
Since Σ is a symmetric and positive definite there exits a non-singular matrix C such that f ( x) = exp − ( x − µ ) ′ Σ −1 ( x − µ ) .
1/ 2  2 
C ′ Σ −1 C = I and Σ = C C ′ . (2π ) p / 2 Σ
Multivariate normal distribution 41 42 RU Khan

Let ′
= (C −1Y − C −1C µ ) ′Σ −1 (C −1Y − C −1C µ ) = (Y − C µ )′ C −1 Σ −1C −1 (Y − C µ )
Y = (X − µ )′Σ −1 ( X − µ ) , and its moment generating function

 Y (1)   A    Y (1)   A  
1 −1 =  ( 2) −   µ  (CΣC ′) −1  (2)  −   µ 
   
1 t ( x − µ )′ Σ −1 ( x − µ ) − 2 ( x −µ )′ Σ ( x −µ )  Y   B    
 Y   B  
M Y (t ) = E (e t Y ) =
(2π ) p/2
Σ
1/ 2 ∫ e e dx
−1 
 AΣA′ 0  (Y (1) − A µ ) 
= [(Y (1) − Aµ )′ (Y ( 2) − B µ )′]    (2) .
1
1
− ( x − µ )′ (1−2t ) Σ −1 ( x −µ )  0 BΣB ′  (Y − B µ )
=
(2π ) p/2
Σ
1/ 2 ∫ e 2 dx
Under the condition A X = 0 , we have Y (1) = 0 , and hence E Y (1) = 0 , we get
1/ 2 −1
(1 − 2t ) Σ −1 1
− ( x − µ )′ (1−2t ) Σ −1 ( x − µ )  AΣA′ 0   0 
Q = [0 (Y (2) − B µ ) ′]   (Y ( 2) − B µ )
=
1/ 2 1/ 2 ∫e 2 dx
 0 BΣ B ′   
(2π ) p / 2 Σ (1 − 2t ) Σ −1
= (Y ( 2) − B µ )′ ( BΣB ′) −1 (Y (2) − B µ ) and hence,
1 1
= =
1/ 2 1/ 2 1/ 2 1/ 2 Q = ( X − µ ) ′ Σ −1 ( X − µ ) ~ χ 2p− q , since Y (2) ~ N p −q ( B µ , BΣB ′) .
Σ (1 − 2t ) Σ −1 (1 − 2t ) p / 2 Σ Σ −1
Exercise: If X ~ N p ( µ , Σ) , find the distribution of quadratic form Q = X ′ Σ −1 X .
1
= .
(1 − 2t ) p / 2 Solution: Since Σ is a symmetric and positive definite there exits a non-singular matrix C
such that
But this is the moment generating function of χ 2p , therefore, Y ~ χ 2p .
C ′ Σ −1 C = I and Σ = C C ′ .
Exercise: Suppose X ~ N p ( µ , Σ) and let A be some Y ~ N p (υ , I ) matrix of rank q . If
Let X = C Y a nonsingular transformation, then Y = C −1 X .
A X = 0 , then the quadratic form Q = ( X − µ )′ Σ −1 ( X − µ ) follows a chi-square distribution
Since the transformation is nonsingular, the distribution of Y is also p − variate normal with
on Yi ~ N1 (υ i , 1) degree of freedom.
E (Y ) = C −1 E ( X ) = C −1 µ = υ
Solution: Since Σ is positive definite and A is of full row rank, there exists a ( p − q ) × p
 A and
matrix B of rank p − q such that AΣB ′ = 0 and that C =   is nonsingular.
 B ′ ′
ΣY = E [Y − E (Y )] [Y − E (Y )]′ = C −1 E ( X − µ ) ( X − µ ) ′ C −1 = C −1 Σ C −1 = I
Consider the nonsingular linear transformation
This shows that Y ~ N p (υ , I ) , and Yi ~ N1 (υ i , 1) .
Y = C X , therefore, Y ~ N p (C µ , CΣC ′) .
Therefore,
Now
Q = X ′ Σ −1 X = (C Y ) ′ Σ −1 (C Y ) = Y ′ (C ′ Σ −1C ) Y = Y ′ Y
 A  A X   Y (1) 
Y =   X =   = ( 2) , say
p p
 B  B X   Y  = ∑ Yi2 ~ χ 2 , where, ∑υi2 = υ ′ υ = µ ′ Σ −1 µ = non-centrality parameter.
p, ∑ υ i2
i =1 i
i =1
 A  A  AΣA′ AΣB ′   AΣA′ 0 
C µ =   µ , and CΣC ′ =   Σ ( A′ B ′ ) =   =  .
 B  B  BΣA′ BΣB ′   0 BΣB ′  Alternative method

This means that Y (1) ~ N q ( Aµ , AΣA′) and Y (2) ~ N p −q ( B µ , BΣB ′) independently. Given X ~ N p ( µ , Σ) , then the probability density function of X is

Therefore, 1  1 
f ( x) = exp − ( x − µ ) ′ Σ −1 ( x − µ ) .
p/2 1/ 2  2 
Q = ( X − µ )′ Σ −1 ( X − µ ) = (C −1Y − µ )′Σ −1 (C −1Y − µ ) (2π ) Σ
Multivariate normal distribution 43 44 RU Khan

Let 1 
Exercise: If X ~ N p ( µ , Σ) . Show that M X − µ (t ) = exp  t ′ Σ t  , hence find the moment
Y = X ′ Σ −1 X , and its moment generating function 2 
generating function of X .
1
− ( x − µ ) ′ Σ −1 ( x − µ )
1 t x ′ Σ −1 x
M Y (t ) = E (e t Y ) =
(2π ) p / 2 Σ
1/ 2 ∫ e e 2 dx Solution: By definition
1
−1
 t′ ( X − µ )  1 t ′ ( x − µ ) − 2 ( x −µ )′ Σ ( x −µ )
= 1/ 2 ∫
1
t x′Σ −1 x − ( x − µ )′ Σ −1 ( x − µ ) M ( X − µ ) (t ) = E e e e dx
1   (2π ) p / 2 Σ
=
(2π ) p/2
Σ
1/ 2 ∫ e 2 dx.
1
1 − [ ( x − µ )′ Σ −1 ( x − µ )−2 t ′ ( x − µ ) ]
Consider =
(2π ) p / 2 Σ
1/ 2 ∫e 2 dx
1
Q = x ′ Σ −1 x − ( x − µ )′Σ −1 ( x − µ )
2 We know that for A symmetric matrix
1
= − [ x ′ Σ −1 x − µ ′ Σ −1 x − x ′ Σ −1 µ + µ ′ Σ −1 µ − 2 t x ′ Σ −1 x] a ′ A −1 a − 2 a ′ t = (a − At )′ A−1 (a − At ) − t ′ A t
2
1
= − [ x ′ (1 − 2 t )Σ −1 x − 2 x ′ Σ −1 µ + θ ] , where θ = µ ′ Σ −1 µ . Comparing a ′ A −1 a − 2 a ′ u with ( x − µ )′ Σ −1 ( x − µ ) − 2 t ′ ( x − µ ) , we get
2
We know that for A symmetric matrix a = ( x − µ ) , A = Σ , u = t , since a ′ u = u ′ a , so that
a ′ A −1 a − 2 a ′ u = (a − A u )′ A −1 (a − A u ) − u ′ A u 1
1 − [( x − µ −Σ t )′ Σ −1 ( x −µ −Σ t )−t ′Σ t ]
= a ′ A −1 a − u ′ AA−1 a − a ′ A−1 A u + u ′ AA−1 A u − u ′ A u M ( X −µ ) (t ) =
(2π ) p / 2 Σ
1/ 2 ∫e 2 dx

= a ′ A −1 a − 2 a ′ u .
1 ′
t Σt 1 1 ′
− ( x − µ * )′ Σ −1 ( x −µ * )
Comparing a ′ A −1 a − 2 a ′ u = (a − A u )′ A −1 (a − A u ) − u ′ A u with Q , we get e2 t Σt
1/ 2 ∫
= e 2 dx = e2 .
1 (2π ) p / 2 Σ
a = x , A −1 = (1 − 2 t ) Σ −1 , ⇒ A = Σ , u = Σ −1 µ
1− 2t
Also,
Thus,
 t′ (X −µ)  −t ′ µ ′ −t ′ µ
M ( X − µ ) (t ) = E  e =e E ( et X ) = e M X (t )
 
1  

1   1  Σ
Q = −  x − Σ Σ −1 µ  (1 − 2 t )Σ −1  x − Σ Σ −1 µ  − µ ′ Σ −1 Σ −1 µ + θ  1
2  (1 − 2 t )   (1 − 2 t )  (1 − 2 t )  t′ µ t ′ µ + t ′Σ t
  ⇒ M X (t ) = e M ( X − µ ) (t ) = e 2 .
 ′ 
1  µ   µ  θ
= −  x −  (1 − 2 t )Σ −1  x −
 
−
 +θ  . Exercise: For any vector t and any positive definite symmetric matrix W , show that
2  (1 − 2 t )   (1 − 2 t )  (1 − 2 t )  1 ′
   1 ′ −1 1/ 2 t W t
∞ ′ 
∫−∞ exp  − 2 x W x + t x  d x = (2π ) W e 2 . Hence find the moment generating
p/2
Therefore,
′ function of p − variate normal distribution with positive definite covariance matrix Σ .
1/ 2 1 µ   µ 
e tθ /(1− 2 t ) (1 − 2t ) Σ −1 −  x −  (1− 2t ) Σ −1  x −

2  1− 2 t 

 1− 2 t 
 
M Y (t ) =
p/2 1/ 2 −1 1 / 2
∫ e dx Solution: Consider the quadratic form
(2π ) Σ (1 − 2t ) Σ 1 ′ −1
− ( x W x − 2 t ′ x ) = Q (say). We know that for A symmetric matrix
2
1
= e (tθ ) /(1− 2 t ) .
(1 − 2t ) p / 2 a ′ A −1 a − 2 a ′ u = (a − A u )′ A −1 (a − A u ) − u ′ A u .

But this is the mgf of χ (2p, θ ) , ⇒ Y ~ χ (2p, θ ) . Comparing a ′ A −1 a − 2 a ′ u with Q , we get a = x , A = W , u = t , since u ′ t = t ′ u .
Multivariate normal distribution 45 46 RU Khan

Thus,  µ1 
1 1 1 1   
1 1 1  I − I I − I  µ 2  0
Q = − [ ( x − W t ) ′ W −1 ( x − W t ) − t ′W t ] = − ( x − W t )′ W −1 ( x − W t ) + t ′W t EU = D E X =  4 4 4 4 
 µ  =  0 
2 2 2  1 I 1
I
1
− I
1 
− I  3  
4 4 4 4  4 p×4 p  µ 
Therefore,  4
1 ∞  1  1  and
∫ exp − ( x − W t )′ W −1 ( x − W t ) exp  t ′W t  d x
1/ 2 −∞   2 
(2π ) p / 2 W 2
 1 1 
 I I 
1  1   Σ 0 0 0   41 4 
exp  t ′W t  1 1 1  1 
  ∞ exp − 1 ( x − W t ) ′ W −1 ( x − W t ) d x  I − I I − I 0− 4 I I
2 4  0 Σ 0 4 
1/ 2 ∫−∞
=  2  ΣU = D Σ Z D ′ =  4 4 4
   1 I 1 1 1  0 1 I 1 
(2π ) p / 2 W I − I − I   0 0 Σ  − I
4 4 4 4   0 0 0 Σ   4 4 
1 ′
tWt
1 ′
t Σt  − 1 I 1 
− I
=e 2 =e2 , since Σ is positive definite symmetric matrix  4 4 
1
−t ′ µ t′ µ t ′ µ + t ′Σ t 1  1 
= M ( X − µ ) (t ) = e M X (t ) or M X (t ) = e M ( X − µ ) (t ) = e 2 .  ( 4 Σ) 0   Σ 0 
= 16 =4 .
 0 1 1 
Exercise: Let X ~ N p ( µ , Σ) and Y ~ N p (δ , Ω) , if U 1 = X + Y , and U 2 = X − Y , find the  (4 Σ)   0 Σ
 16   4 
joint distribution of U 1 , and U 2 .
This shows that U 1 and U 2 are independent and U 1 ~ N p (0 , Σ / 4) , U 2 ~ N p (0 , Σ / 4) .
Solution: Given
Note:
X µ  Σ 0 
Z 2 p×1 =   , E Z =   , and Σ Z 2 p×2 p =   .  
Y  δ   0 Ω t2X 2 tr X r
M X (t ) = E e tX = ∫ e tx f ( x) dx = E 1 + t X + +L+ + L
 2 ! r ! 
Consider the transformation
U   I I  X t2 tr
U =  1  =     , i.e. U = C 2 p×2 p Z , then = 1 + t E( X ) + E( X 2 ) + L + E ( X r ) + L
U 2   I − 1 I  2 p×2 p  Y  2! r!
2 r
µ I I µ  µ + δ  t t
E U = C E Z = C   =    =   = 1 + t µ1′ + µ 2′ + L + µ r ′ + L (2.11)
δ   I − I   δ   µ − δ  2! r!

and Differentiating r times equation (2.11) with respect to t and then putting t = 0 , we get
I I Σ 0  I I  Σ + Ω Σ − Ω  ∂r   t2  ∂r
ΣU = C Σ Z C ′ =    = .  M X (t )  =  µ r′ + µ r′ +1 t + µ r′ + 2 + L ⇒ µr′ =
I − I   0 Ω   I − I   Σ − Ω Σ + Ω  r  
M X (t ) .
 ∂ t  t =0  2!  t =0 ∂tr t =0
Exercise: Let X 1 , X 2 , X 3 and X 4 be independent N p ( µ , Σ) random vectors, if
1 1 1 1 1 1 1 1
U 1 = X 1 − X 2 + X 3 − X 4 , and U 2 = X 1 + X 2 − X 3 − X 4 , find the joint
4 4 4 4 4 4 4 4
distribution of U 1 , and U 2 , and their marginals.
Solution: Consider the transformation
 X1 
1 1 1 1   
U   I − I I − I
X2
U =  1  =  4 4 4 4 
 X  , i.e. U = C 2 p×4 p X 4 p×1 , then
U 2   1 I 1
I
1
− I
1
− I   3
4 4 4 4  2 p×4 p  X 
 4
Multivariate normal distribution 47 48 RU Khan

Raw moments of multivariate normal distribution   p  p  


∂     
 t′ µ + 1 t′Σ t 
= M  µ l + ∑ t j σ lj   µ n + ∑ t j σ nj  + M σ nl 
∂M ∂   ∂ tr   
2 j =1  j =1  t =0
E[X n ] = =  e 
∂ t n t =0 ∂ t n  
 t =0   p  p  p 
=  M  µ r + ∑ t j σ rj   µ l + ∑ t j σ lj   µ n + ∑ t j σ nj 
  p 1 p p      
∂   j =1  j =1  j =1 
= exp ∑ t k µ k + ∑ ∑ t k t j σ kj 
∂ tn   k =1 2 k =1 j =1  t =0  p   p   p  
+ M  µ n + ∑ t j σ nj  σ lr + M  µ l + ∑ t j σ lj σ nr + M  µ r + ∑ t j σ rj σ nl 
        
1 ∂ p p  j =1   j =1   j =1   t =0
= M µ n + ∑ t k ∑ t j σ kj 
 2 ∂ t n k =1 j =1 
t =0 = µ r µ l µ n + µ nσ lr + µ l σ nr + µ r σ nl .

 1 ∂ Fourth moment
= M µ n + ( t1 (t1σ 11 + t 2 σ 12 + L + t nσ 1n + L + t p σ 1 p )
 2 ∂ tn
∂4M ∂  ∂ 3 M 
+ L + t n (t1σ n1 + L + t nσ nn + L + t p σ np ) E[X n Xl X r X m] = =  
∂ t n ∂ tl ∂ t r ∂ t m
t =0
∂ tm  ∂ t n ∂ t l ∂ t r 
t =0
+ L + t p (t1σ p1 + L + t nσ pn + L + t p σ pp ))}t =0
∂   p  p  p 
 p  = M µ r + ∑ t j σ rj   µ l + ∑ t j σ lj   µ n + ∑ t j σ nj 
1
= M µ n + (t1σ 1n + L + ∑ t jσ nj + t n σ nn + L + t p σ pn ) ∂ tm       
2   j =1  j =1  j =1 
 j =1 t =0
 p   p   p  

 1
p p   p  + M  µ n + ∑ t j σ nj  σ lr + M  µ l + ∑ t j σ lj  σ nr + M  µ r + ∑ t j σ rj σ nl 
= M µ n +  ∑ t jσ jn + ∑ t j σ nj  = M µ n + ∑ t j σ nj  = µ n , as e 0 = 1       

2 j =1   j =1   j =1   j =1   t =0
  j =1 
 t = 0 
 j =1 
t = 0
  p  p  p  p 
=  M  µ m + ∑ t j σ mj   µ r + ∑ t j σ rj   µ l + ∑ t j σ lj   µ n + ∑ t j σ nj 
Second moment
     
∂         
p j =1 j =1 j =1 j =1
∂2M ∂ ∂ M 
E[X n Xl ] = =   = M µ n + ∑ t j σ nj 
∂ tl ∂ t n ∂ tl  ∂ tn  ∂ tl      p   p   p 
t =0  j =1 
+ M σ rm  µ l + ∑ t j σ lj  µ + t σ  +σ µ + t σ 
t =0
 n ∑ j nj  lm  r ∑ j rj 
t =0
 
     j =1   j =1   j =1 

p  p  
= M  µ l + ∑ t j σ lj   µ n + ∑ t j σ nj  + M σ nl  = µ l µ n + σ nl .
    p   p  p  
  j =1  j =1    µ + t σ  +  µ + t σ   µ + t σ σ 
 n ∑ j nj   r ∑ j rj   l ∑ j lj  nm 
t =0
 j =1   j =1  j =1  
Therefore,

E ( X n2 ) = µ n2 + σ nn  p  p 
+ M  µ m + ∑ t j σ mj   µ n + ∑ t j σ nj  σ lr + M σ nm σ lr
and
  
V ( X n ) = E ( X n2 ) − [ E ( X n )]2 = µ n2 + σ nn − µ n2 = σ nn .  j =1  j =1 

Cov ( X n , X l ) = E ( X n X l ) − E ( X n ) E ( X l ) = µ n µ l + σ nl − µ n µ l = σ nl .  p  p 
+ M  µ m + ∑ t j σ mj   µ l + ∑ t j σ lj  σ nr + M σ lm σ nr
  
Third moment  j =1  j =1 

 ∂ 2 M   p  p  
∂ 3M ∂ + M  µ m + ∑ t j σ mj   µ r + ∑ t j σ rj  σ nl + M σ rm σ nl 
E[X n Xl X r ] = =     
∂ t n ∂ tl ∂ t r ∂ tr  ∂ t n ∂ t l  
t =0 t =0  j =1  j =1   t =0
Multivariate normal distribution 49 50 RU Khan

= µ m µ r µ l µ n + µ l µ nσ rm + µ r µ nσ lm + µ r µ l σ nm + µ m µ nσ lr + σ nmσ lr   p  p  p  p 
=  M  ∑ t j σ mj   ∑ t j σ rj   ∑ t j σ lj   ∑ t j σ nj 
+ µ m µ l σ nr + σ lmσ nr + µ m µ r σ nl + σ rmσ nl .   j =1    
    j =1   j =1   j =1 

Moments about the mean   p   p   p  p 



+ M σ rm  ∑ t j σ lj   t σ  +σ  t σ  t σ 
∑ ∑
 1 t′Σ t  1 ′    j nj  lm  j rj   ∑ j nj 
∂M ∂  2  t′ ( X − µ ) t Σt   j =1   j =1   j =1   j =1 
E [ X n − µn ] = = e  , Ee = e2
∂ t n t =0 ∂ t n    p  p    p  p 
+ σ nm  ∑ t j σ rj   ∑ t j σ lj   + M  ∑ t j σ mj   ∑ t j σ nj  σ lr + M σ nm σ lr
t =0
     
 p p   p   j =1   j =1    j =1   j =1 
∂  exp 1 ∑ ∑ t k t j σ kj 
= = M  ∑ t j σ nj  = 0.
∂ tn  2   j =1   p  p 
 k =1 j =1  t =0   t =0
+ M  ∑ t j σ mj   ∑ t j σ lj  σ nr + M σ lm σ nr
  
 j =1   j =1 
∂2M ∂ ∂ M  ∂   p 
E [ X n − µ n ][ X l − µ l ] = =   = M  ∑ t j σ nj   p  p  
∂ tl   + M  ∑ t j σ mj   ∑ t j σ rj  σ nl + M σ rm σ nl 
∂ t n ∂ tl ∂ t n  ∂ tl   j =1
t =0 t =0  t =0    
 j =1   j =1   t =0
  p  p  
 
= M  ∑ t j σ lj   ∑ t j σ nj  + M σ nl  = σ nl . = σ nmσ lr + σ lmσ nr + σ rmσ nl .
  
  j =1  j =1  
t =0

∂3M ∂  ∂ 2 M 
E [ X n − µ n ] [ X l − µl ] [ X r − µ r ] = =  
∂ t n ∂ tl ∂ t r
t =0
∂ tr  ∂ t n ∂ t l 
t =0

  p  p  
∂     
= M  ∑ t j σ lj   ∑ t j σ nj  + M σ nl 
∂ tr   j =1 
 j =1  t =0

  p  p  p 
=  M  ∑ t j σ rj   ∑ t j σ lj   ∑ t j σ nj 
      
  j =1   j =1   j =1 

  p   p   p  

+ M σ lr  ∑ t j σ nj  + σ nr  ∑ t j σ lj + M  ∑ t j σ rj σ nl  = 0.
      
  j =1   j =1   j =1   t =0

∂4M
E [ X n − µ n ] [ X l − µl ] [ X r − µ r ] [ X m − µ m ] =
∂ t n ∂ tl ∂ t r ∂ t m
t =0

∂   p  p  p   p 
= M  ∑ t j σ rj   ∑ t j σ lj   ∑ t j σ nj  + M  ∑ t j σ nj  σ lr
∂ tm   j =1     
    j =1   j =1   j =1 

 p   p  
+ M  ∑ t j σ lj  σ nr + M  ∑ t j σ rj σ nl 
    
 j =1   j =1   t =0
52 RU Khan

ESTIMATION OF PARAMETERS IN MULTIVARIATE NORMAL Also


DISTRIBUTION  a11 a12 L a1 p 
 
1  a 21 a 22 L a 2 p  A
The normal distribution is completely specified if its mean vector µ and covariance matrix S= = .
Σ are known. In case of unknown parameters, the problems of their estimation arise. We n − 1  M M M M  n −1

 a p1 a p 2 L a pp 
estimate these parameters by the method of maximum likelihood estimation. For using this 
method we require a random sample of size n from the given p − variate normal population.
The matrix A is called sum of square and cross products of deviations about the mean or
Let the random sample of size n from N p ( µ , Σ ) be x1 , x 2 , L , x α , L , x n , where n > p and ( SS & CP ) .
x α is p × 1 vector, α = 1, 2, L , n . In extended vector notation the data are as follows: where
Suppose we have n individuals 1, 2, L , n and p characteristics X 1 , X 2 , L , X p and each n

individuals studied.
aij = ∑ ( xiα − xi ) ( x jα − x j ) , for all i and j
α =1
Characteristic Individual Mean n n n n
1 2 L α L n = ∑ [ xiα ( x jα − x j ) − xi ( x jα − x j )] = ∑ x iα x jα −xj ∑ x iα − x i ∑ x jα + n xi x j
α =1 α =1 α =1 α =1
X1 x11 x12 L x1α L x1n x1
n
X2 x 21 x 22 L x 2α L x 2n x2 = ∑ x iα x jα − n x i x j .
M M M M M M M M α =1
Xi xi1 xi 2 L x iα L xin xi
Some results
M M M M M M M M
1) Consider a quadratic form
Xp x p1 x p 2 L x pα L x pn xp
p
x1 x2 L xα L xn Q = x ′ A x = ∑ aij xi x j , where A is symmetric, then
i, j

Corresponding to α − th individual there is a vector x α which is representing a point in  ∂Q 


 
p − dimensional space. So all these n point in E n . Therefore,  ∂ x1 
 ∂Q 
1 n  ∂Q   = 2 Ax .
 ∑ x1α  = ∂ x2
∂x  
  1 
x
 n α =1  M 
 M   x2   ∂Q 
n     ∂ xp 
1  1 n  =  M  , is the sample mean vector, and sample variance  
x = ∑ xα =
 n ∑ iα
x
n α =1   xi 
 α =1    In particular, for p = 2
 nM   M 
1  x   a11 a12   x1   x1 
 n ∑ x pα x 2 ) 
Q = ( x1    = (a11 x1 + a 21 x 2 a12 x1 + a 22 x 2 )  
  p a 22   x 2 
 α =1   a 21  x2 
covariance matrix is
= (a11 x12 + a 21 x1 x 2 + a12 x1 x 2 + a 22 x 22 ) , then
 s11 s12 L s1 p 
   ∂Q 
 s 21 s 22 L s 2 p   
S =
M M M M 
, ∂ Q  ∂ x1
=  =  2a11 x1 + a 21 x 2 + a12 x 2  = 2  a11 a12   x1 
   = 2 Ax .
  ∂ x  ∂Q   a 21 x1 + a12 x1 + 2a 22 x 2  a
 21 a 22   x 2 
 s p1 s p 2 L s pp  ∂x 
  2 
1 n ∂B ∂B
where s ij = ∑ ( xiα − xi ) ( x jα − x j ) , ∀ i and j . 2) If B = y ′ A x = x ′ A′ y , then = A x , and = A′ y
n − 1 α =1 ∂y ∂x
Multivariate normal distribution 53 54 RU Khan

3) If Q = ( x − b) ′ A ( x − b) = (b − x )′ A (b − x ) , then Example: Let orthogonal matrix of order 2


 1 1   1 1   1 1 
∂Q ∂Q  −     − 
= 2 A ( x − b) , and = 2 A (b − x ) .
A= 2 2  , then A′A =  2 2 × 2 2  =  1 0  = I
∂x ∂b 1   0 1 
 1 1   1 1   1
  −   
4) A sub matrix of A is a rectangular array obtained from A by deleting rows and  2 2   2 2  2 2 
columns. A minor is the determinant of square sub matrix of A .
Here n = 2
p p
A = ∑ aij Aij = ∑ a jk A jk , 2
∑ aik = a11 + a12 =
1

1
=0,
j =1 j =1 2 2
k =1
where Aij , is (−1) i+ j times the minor of aij , and the minor of an element aij is the and
determinant of the sub matrix of a square matrix A obtained by deleting the i − th row 2
1 1
and j − th column. ∑ aik × a jk = a11 a21 + a12 a 22 = 2 − 2 = 0
k =1
If A ≠ 0, there exist a unique matrix B such that AB = I , B is called the inverse of A
2
1 1
and is denoted by A −1 . ∑ aik × a jk = a11 a11 + a12 a12 = 2 + 2 = 1
k =1
Let a ij be the element of A −1 in the i − th row and j − th column, then 2
1 1
ij A ji A ji ∑ aik × a jk = a 21 a 21 + a 22 a 22 = 2 + 2 = 1
a = and aij = . k =1
A A −1 Another example of orthogonal matrix of order 3

 2 4  7 / 2 − 2 7 6 1  1 1 1× 0 
Example: Let A =   , A = 14 − 12 = 2 , and A −1 =   , A −1 = − = .  − 
 3 7   − 3 / 2 1  2 2 2  2 ×1 2 ×1 2 ×1 
 1 1 −2 
A= 
A 7 A −4 A − 3 22 A22 2
a11 = 11 = , a12 = 21 = = −2 , a 21 = 12 = , a = = = 1.  3× 2 3× 2 3× 2 
A 2 A 2 A 2 A 2  1 1 1 
 
Also  3 3 3 

 1 1 1× 0 
A11 1 A 21 2 A12 3/ 2  − 0 L 0 
a11 = = = 2, a12 = = = 4, a 21 = = =3,  2 ×1 2 ×1 2 ×1 
A −1 1/ 2 A −1 1/ 2 A −1 1/ 2
 1 1 −2 
 0 L 0 
A 22
7/2  3 ×2 3× 2 3× 2 
a 22 = = =7. A= M M M M M M .
−1 1/ 2 1 1 1 − (n − 1) 
A  L
 n (n − 1) n (n − 1) n (n − 1) n (n − 1) 
5) A square matrix An × n is said to be orthogonal if A′A = AA′ = I n , and if the  1 1 1 1 
 L 
transformation Y = A X transform X ′ X to Y ′ Y . Also  n n n n 

n n Maximum likelihood estimate of the mean vector


1
∑ aik × n
=0, ⇒ ∑ aik = 0 , i = 1, 2, L, n − 1 Let x1 , x 2 , L , x α , L , x n be a random sample of size n(> p ) from N p ( µ , Σ) .
k =1 k =1

and The likelihood function is


n
0, i ≠ j  1 n 
∑ aik × a jk = 1, for i, j = 1, 2, L , n . φ = f ( x1 ) f ( x 2 ) L f ( x n ) =
1
exp  − ∑ ( x α − µ )′ Σ −1 ( x α − µ )
k =1  i= j (2π ) np / 2
Σ
n/2
 2 α =1 
Multivariate normal distribution 55 56 RU Khan

np n 1 n ′
log φ = − log (2π ) − log Σ − ∑ ( x α − µ )′ Σ −1 ( x α − µ ) . 1  1 
Σ x = E ( x − µ ) ( x − µ )′ = E  ( x1 + L + x n ) − µ   ( x1 + L + x n ) − µ 
2 2 2 α =1 n  n 
Differentiating with respect to µ and equating to zero 1
= E ( x1 + L + x n − n µ ) ( x1 + L + x n − n µ )′
n2
n n
∂ log φ 1
= 0 = −0 − 0 − ∑ 2 Σ −1 ( µ − x α ) or Σ −1 ∑ (µ − x α ) = 0 1
∂µ 2 α =1 = [ E ( x1 − µ ) ( x1 − µ )′ + L + E ( x n − µ ) ( x n − µ )′ + 0] , as xα ' s are independent.
α =1
n2
1 n 1
or µˆ = ∑x =x .
n α =1 α
= ( n Σ) = Σ / n .
n2
Thus,
Maximum likelihood estimate of variance covariance matrix
x ~ N ( µ , Σ / n) .
Let Σ −1 = (σ ij ) and Σ = (σ ij ) then the Σ −1 = σ i1Σ i1 + L + σ ip Σ ip , where Σ ij is the
A
Theorem: is an unbiased estimate of Σ .
ij −1 Σ ij th −1 −1 th n −1
cofactor of σ in Σ , therefore, = (i j ) element of (Σ ) = (i j ) element of
Σ −1 Proof: We have
Σ = σ ij . A = ∑ ( x α − x ) ( x α − x )′ = ∑ [( x α − µ ) − ( x − µ )][ ( xα − µ ) − ( x − µ )]′
α α
Now, the logarithm of the likelihood function is
= ∑ [( x α − µ ) ( x α − µ )′ − ( x − µ ) ( x α − µ )′ − ( x α − µ ) ( x − µ ) ′ + ( x − µ ) ( x − µ )′]
np n 1 n α
log φ = − log (2π ) + log Σ −1 − ∑ ( x α − µ )′ Σ −1 ( x α − µ )
2 2 2 α =1 = ∑ ( x α − µ ) ( x α − µ )′ − n ( x − µ ) ( x − µ ) ′ .
α
np n
=− log (2π ) + log (σ i1Σ i1 + L + σ ij Σ ij + L + σ ip Σ ip ) Taking expectation on both the sides
2 2
1 E ( A) = ∑ E ( x α − µ ) ( x α − µ )′ − n E ( x − µ ) ( x − µ ) ′ = n Σ − n Σ / n = (n − 1) Σ
− ∑ ∑ σ ij ( xiα − µi ) ( x jα − µ j ) , since x′ A x = ∑ aij xi x j
2 α i, j
α
i, j
 A 
or E   = Σ , this shows that the maximum likelihood estimate of Σ is biased, i.e.
Differentiating with respect to σ ij and equating to zero, we get  n −1 
 A  n −1
∂ log φ n Σ ij 1 n ∂ f ′( x) E = Σ.
=0= − ∑ ( xiα − µ i ) ( x jα − µ j ) , because log f ( x ) = n n
ij 2 Σ −1 2 α =1 ∂x f ( x)
∂σ
1
Theorem: Given a random sample x1 , x 2 , L , x α , L , x n from N p ( µ , Σ) , x = ∑ x α and
n n n α
n 1 1
or σ ij = ∑ ( xiα − µ i ) ( x jα − µ j ) or σˆ ij = ∑ ( xiα − µˆ i ) ( x jα − µˆ j ) A = ∑ ( x α − x ) ( x α − x )′ . Thus x and A are independently distributed.
2 2 α =1 n α =1
α
A Proof: Make an orthogonal transformation
Hence, Σ̂ = .
n
y = c11 x1 + L + c1n x n
1
Theorem: Given x1 , x 2 , L , x α , L , x n be an independent random sample of size n(> p )
from N p ( µ , Σ) , then x ~ N ( µ , Σ / n) . M M M
y = c n −11 x1 + L + c n−1n x n
n −1
1  1
Proof: E ( x ) = E  ∑ x α  = E ( x + L + x ) = 1 (n µ ) = µ and
n  n 1 n
n 1 1
 α  y = x1 + L + xn
n n n
Multivariate normal distribution 57 58 RU Khan

y   x1 
n
 1   = x1 x1′ + L + x n x n ′ = ∑ x i x i ′ .
 
or Y = C X , where Y = M , X =  M  , and i =1
  x 
y   n
 n Therefore,
 c11 L c1n 
c12 n n

 M MM M 
 ∑ y i y i′ − y n y n′ = ∑ x i x i ′ − y n y n′
i =1 i =1
C =  c n −11 c n−12 L c n−1n  is orthogonal.
  n−1 n
 1

1
L
1 

or ∑ y i y i′ = ∑ xi xi′ − nx x ′ = A (3.2)
 n n n  i =1 i =1
Since C is orthogonal In view of equations (3.1) and (3.2), x and A depends on two mutually exclusive sets,
n
1 n which are independently distributed, therefore, x and A are independently distributed.
∑ cik n
= 0, i = 1, 2, L , n − 1 ⇒ ∑ cik = 0
k =1 k =1 Test for µ , when Σ is known
Also
Given a random sample x1 , x 2 , L , x α , L , x n from N p ( µ , Σ) . The hypothesis of interest is
n
0, i ≠ j
∑ cik × c jk = 1, i = j for i, j = 1, 2, L, n . H 0 : µ = µ , where µ is a specified vector, then, under H 0 , the test statistic is
0 0
k =1 
n ( x − µ ) ′ Σ −1 ( x − µ ) ~ χ 2p .
Now consider, 0 0

1 Proof: Let C be a non-singular matrix such that


y = ( x1 + L + x n ) = n x (3.1)
n n Σ
C ′ Σ *−1C = I , and CC ′ = Σ* = . Make the transformation
So that n
E y = E n x = n µ , and (x − µ ) = C y ⇒ y = C −1 ( x − µ 0 ) , and
n 0
n n
E y = E ∑ cik x k = µ ∑ cik = 0 , for i = 1, 2, L , n − 1 E y = C −1 E ( x − µ ) = C −1 ( µ − µ ) = 0 , under H 0 .
0 0 0
i
k =1 k =1

Cov ( y , y ) = E ( y − E y ) ( y − E y ) ′
i j i i j j
Σ y = E ( y − E y ) ( y − E y )′ = C −1 E ( x − µ 0 ) ( x − µ 0 )′ C −1

= E [ci1 ( x1 − µ ) + L + cin ( x n − µ )][c j1 ( x1 − µ ) + L + c jn ( x n − µ )]′ ′ −1


= C −1Σ*C −1 = (C ′ Σ* C ) −1 = I .
= ci1 c j1 E ( x1 − µ ) ( x1 − µ )′ + L + cin c jn E ( x n − µ ) ( x n − µ )′ + 0
Therefore,
since x1 ,L , x n are independent y ~ N p (0, I ) , i.e. y i ~ N (0,1) , for all i = 1, 2,L , p .
n
0 , if i≠ j
= (ci1 c j1 + L + cin c jn ) Σ = Σ ∑ cik c jk =  Now
k =1 Σ , if i= j
n ( x − µ )′ Σ −1 ( x − µ ) = ( x − µ )′ (CC ′) −1 ( x − µ )
0 0 0 0
This implies y ,L , y are independent.
1 n
p
Now = [C −1 ( x − µ 0 )]′ [C −1 ( x − µ 0 )] = y ′ y = ∑ y i2 ~ χ 2p .
n n i =1
∑ y i y i ′ = ∑ (ci1 x1 + L + cin x n ) (ci1 x1 + L + cin x n )′ Let χ 2p (α ) be the number such that Pr [ χ 2p ≥ χ 2p (α )] = α , then for testing H 0 , we use the
i =1 i =1
n n n n critical region
= x1 x1′ ∑ ci21 + L + x1 x n ′ ∑ ci1 cin + L + x n x1′ ∑ cin ci1 + L + x n x n ′ ∑ cin
2
i =1 i =1 i =1 i =1 n ( x − µ )′ Σ −1 ( x − µ ) ≥ χ 2p (α ) .
0 0
Multivariate normal distribution 59 60 RU Khan

For computational purpose use ′


Σ y = E ( y − E y ) ( y − E y )′ = C −1 E ( x (1) − x ( 2) ) ( x (1) − x ( 2) ) ′ C −1
( x − µ ) = d , then solve
0
′ −1
Σ λ = d (by Doolittle method) = C −1 Σ* C −1 = (C ′ Σ * C ) −1 = I
Therefore,
⇒ λ = Σ −1 d and
y ~ N p (0, I ) , i.e. y i ~ N (0,1) , for all i = 1, 2,L , p .
n d ′ λ = n ( x − µ ) ′ Σ −1 ( x − µ ) .
0 0
Now
Note: If H 0 is not true, then
n1n2
( x (1) − x (2) )′ Σ −1 ( x (1) − x ( 2) ) = ( x (1) − x ( 2) ) ′ (CC ′) −1 ( x (1) − x ( 2) )
E y =C −1
E (x − µ ) = C −1
( µ − µ ) = δ (say), Σ y = I , and n1 + n2
0 0
p
p
= [C −1 ( x (1) − x (2) )]′ [C −1 ( x (1) − x (2) )] = y ′ y = ∑ y i2 ~ χ 2p .
n ( x − µ )′ Σ ( x − µ ) = ∑
−1
y i2 ~ χ2 , where
0 0 p, ∑ δ i2 i =1
i =1 i

p
Let χ 2p (α ) be the number such that Pr [ χ 2p ≥ χ 2p (α )] = α
∑ δ i2 = δ δ ′ = n (µ − µ 0 )′ Σ −1 ( µ − µ 0 ) = non-centrality parameter. Then for testing H 0 , we use the critical region
i =1
n1n2
The confidence region for µ is the set of possible values of µ satisfying ( x (1) − x ( 2) )′ Σ −1 ( x (1) − x (2) ) ≥ χ 2p (α ) .
n1 + n 2
n ( x − µ ) ′ Σ −1 ( x − µ ) ≤ χ 2p (α ) , this has confidence coefficient 1 − α .
Note
Two sample problem i) If H 0 is not true, then

Given x1(1) , x (21) , L , x α(1) , L , x (n1) be a random sample from N p ( µ (1) , Σ) and E y = C −1 E ( x (1) − x (2) ) = C −1 ( µ (1) − µ ( 2) ) = δ (say), and Σ y = I
1

x1( 2) , x (22) , L , x α(2) , L , x (n2) from N p ( µ (2) , Σ) . H 0 : µ (1) = µ ( 2) , then, in this case, the So each y i ~ N (δ i , 1) , then
2
statistic is p
n1n 2 ( x (1) − x ( 2) )′ Σ *−1 ( x (1) − x (2) ) = ∑ yi2 ~ χ 2 ,
( x (1) − x (2) ) ′ Σ −1 ( x (1) − x (2) ) ~ χ 2p . i =1
p, ∑ δ i2
n1 + n 2 i

Proof: We know that where


p ′
x (1) ~ N p ( µ (1) , Σ / n1 ) , and x ( 2) ~ N p ( µ (2) , Σ / n2 ) . Further, x (1) and x ( 2) are
∑ δ i2 = δ ′ δ = (µ (1) − µ (2) )′C −1 C −1 ( µ (1) − µ (2) )
independent and i =1

 1 1   n1n 2
x (1) − x (2) ~ N p  µ (1) − µ ( 2) ,  +  Σ  . = ( µ (1) − µ ( 2) ) ′ Σ −1 ( µ (1) − µ (2) )
 n
 1 n 2   n1 + n 2

Make the transformation (nonsingular0 n1n2


= ∆ = non-centrality parameter.
n1 + n2
( x (1) − x ( 2) ) = C y ⇒ y = C −1 ( x (1) − x (2) )
where ∆ is the measure of distance between two populations defined by Mahalanobis.
Since C is a non-singular matrix such that
The confidence region for υ = µ (1) − µ ( 2) is the set of possible values of υ satisfying the
*−1 n + n2 *
C′Σ C = I , where Σ = 1 Σ = CC ′ , and inequality
n1n2
n1n 2
( x (1) − x ( 2) − υ )′ Σ −1 ( x (1) − x ( 2) − υ ) ≤ χ 2p (α ) .
E y = C −1 E ( x (1) − x (2) ) = 0 , under H 0 . n1 + n 2
Multivariate normal distribution 61 62 RU Khan

ii) If the two populations have known covariance matrix Σ1 and Σ 2 , then 1  1 
= exp  − n ( x − µ ) ′ Σ −1 ( x − µ )
n/2  2 
* −1 Σ Σ (2π ) np / 2 Σ
( x (1) − x ( 2) )′ Σ ( x (1) − x ( 2) ) ~ χ 2p , where Σ* = 1 + 2 .
n1 n2
1  1 
× exp − tr Σ −1 A .
n/2  2 
Result: (2π ) np / 2 Σ
n
1
If t is a sufficient statistic for θ if ∏ f ( xα ; θ ) = g (t ; θ ) h( x1 ,L , x n ) , where f ( x α ; θ ) is Thus, x and
n
A form a sufficient set of statistics for µ and Σ . If Σ is known, x is a
α =1
the density of the α − th observation, g (t ; θ ) is the density of t and h( x1 , L , x n ) does not 1
sufficient statistic for µ . However, if µ is known A is not a sufficient statistic for Σ , but
depend on θ . n
1
Sufficient Statistics for µ and Σ
∑ ( x − µ ) ( xα − µ ) is a sufficient statistic for Σ .
n α α

The joint density function of x1 , L , x n with x α ~ N p ( µ , Σ) is

1  1 n 
exp − ∑ ( x α − µ ) ′ Σ −1 ( x α − µ ) .
np / 2 n/2
(2π ) Σ  2 α =1 
Consider,
n n n
∑ ( xα − µ )′ Σ −1 ( x α − µ ) = tr ∑ ( xα − µ )′ Σ −1 ( x α − µ ) = tr ∑ Σ −1 ( xα − µ ) ′ ( xα − µ)
α =1 α =1 α =1
n
= trΣ −1 ∑ ( xα − µ ) ′ ( xα − µ ) .
α =1
We can write
n n
∑ ( xα − µ )′ ( x α − µ ) = ∑ [( xα − x ) + ( x − µ )] [( x α − x ) + ( x − µ )]′
α =1 α =1
n
= ∑ [( xα − x ) ( x α − x )′ + ( x α − x ) ( x − µ )′ +( x − µ ) ( x α − x )′ + ( x − µ ) ( x − µ )′]
α =1
n   ′
= ∑ ( xα − x ) ( xα − x )′ + ∑ ( xα − x ) ( x − µ )′ +( x − µ ) ∑ ( xα − x )
α =1 α  α
+ n ( x − µ ) ( x − µ )′

= A + n ( x − µ ) ( x − µ )′ , because ∑ ( xα − x ) = ∑ xα − n x = 0 .
α α
Thus the density of x1 , L , x n can be written as
1  1 
exp  − tr Σ −1 [ A + n ( x − µ )( x − µ )′]
np / 2 n/2  2 
(2π ) Σ

1  1 
= exp − [n ( x − µ ) ′ Σ −1 ( x − µ ) + tr Σ −1 A]
np / 2 n/2  2 
(2π ) Σ
64 RU Khan

WISHART DISTRIBUTION M M M M M M M …
b p1 0 0 0 0 0 0 L b11
If x1 , x 2 , L , x n are independent observations from N ( µ , σ 2 ) , it is well known
b p2 0 0 0 0 0 0 L 0 b22
n
(n − 1) s = ∑ ( xi − x ) ~ σ
2 2 2
χ n2−1 . The multivariate analogue of (n − 1) s 2 is the matrix A M M M M M M M L M M
i =1
b pp 0 0 0 0 0 0 L 0 0 L 2b pp
and is called Wishart matrix. In other words the Wishart matrix is defined as the p × p
symmetric matrix of sums of squares and cross products (of deviations about the mean) of the
p
sample observations, from a p − variate nonsingular normal distribution. The distribution of ∂A

p p
= 2 p b11 p −( p −1)
b22 L b pp = 2 p ∏ biip −(i−1) .
A when the multivariate distribution is assumed normal is called Wishart distribution and is ∂B i =1
a generalization of χ 2 distribution in the univariate case.
Theorem: Let x1 , x 2 , K , x n be a random sample from N p (µ , I ) ,
p ( p + 1)
By definition of A , we mean the joint distribution of the distinct elements aij , n n −1
2
(i, j = 1, 2, L , p ; i ≤ j ) of the symmetric matrix A .
A= ∑ ( xα − x ) ( xα − x ) ′ = ∑ Z α Z ′α , where Z α are independent, each distributed
α =1 α =1
n−1
Results:
1) Given a positive definite symmetric matrix A , there exists a nonsingular triangular
according to N p (0, I ) . Then the density of A = ∑ Z α Z ′α is
α =1
matrix B such that A = BB ′ .
(υ − p −1) / 2
p A exp (− 12 tr A)
∂A , where υ = n − 1 .
2) The Jacobian of transformation for the distinct element of B is = 2 p ∏ (bii ) p −(i −1) . p
∂B 2υ p / 2 π p ( p −1) / 4
i =1 ∏ Γ(υ − i + 1) / 2
Proof: The equation A = BB ′ can be written as i =1
Proof: Consider a nonsingular triangular matrix B such that
 a11 a12 L a1 p   b11 0 L 0   b11 b12 L b1 p 
     A = BB ′ (4.1)
 a 21 a 22 L a 2 p   b21 b22 L 0   0 b22 L b2 p 
 M =
M M M   M M M M  M M M M  Let
    
 a p1 a p 2 L a pp   b p1 b p 2   0 L b pp 
0 L 0   B 1 
 L b pp   0 ′
 b11  b11 0 L 0 
    
 b2 L   b21 b22 L 0  B ′ 2   b21 b22 L 0 
 11 b11b21 b11b p1  B = =   , and Brr =  , so that
b b M M M M   M  M M M M 
2 2
b21 + b22 L b21b p1 + b22 b p 2       
= 21 11   b p1 b p 2 L b pp   ′  b 
 M M M M    B p   r1 br 2 L brr 
2 
 b p1b11 b p1b21 + b p 2 b22 L b 2p1 + b 2p 2 + L + b pp
  B pp = B .
∂A ∂ (a11 , a 21 a 22 , a31 a32 a33 , L , a p1 a p 2 L a pp ) Let f (B) = joint density function of nonzero elements of B , then we can write
=
∂B ∂ (b11 , b21 b22 , L , b p1 b p 2 L b pp )
p −1
f ( B ) = f ( B ′1 ) f ( B ′ 2 B11 ) L f ( B ′ p B p−1 p −1 ) = ∏ f ( B ′ i +1 Bii ) (4.2)
∂ a11 a 21 a 22 a31 a 32 a33 … a p1 a p 2 L a pp
i =0
b11 2b11 … Let
b21 0 b11 … B ′. r = (br1 br 2 L br r −1 ) , so that
b22 0 0 2b22 …
b31 0 0 0 b11 … B ′ r = ( B ′. r brr 0 L 0)
b32 0 0 0 0 b22 … The equation (4.1) can be explained as
b33 0 0 0 0 0 2b33 …
Wishart distribution 65 66 RU Khan

 a11 L a1i +1 L a1 p   b11 L 0 L 0   b11 L bi +11 L b p1  = (bi2+11 + bi2+12 + L + bi2+1i+1 ) − (bi2+11 + bi2+12 + L + bi2+1i ) ,since A = BB ′
    
 M M M M M   M M M M M  M M M M M 
a L a L a = b L b L 0   0 L bi+1i L b pi  = bi2+1i +1 ~ χυ2−i (4.8)
 i1 ii +1 ip   i1 ii  
 M M M M M   M M M M M  M M M M M 
a   L b pp   0 L L b pp  As reg SS and error SS are independent, so that bi +1i+1 is independent of B . i+1 .
 p1 L a pi +1 L a pp   b p1 L b pi 0

Let Therefore,

 a1i +1  f ( B i+1 Bii ) = f ( B . i +1 Bii ) f (bi +1 i +1 Bii ) , because B ′ r = ( B ′. r brr 0 L 0)


 
 a 2 i +1 
a . i+1 =  , then a . i+1 = Bii B . i +1 1  1 
M  = exp − (b12i+1 + b22i +1 + L + bi2i+1 )
  (2π ) i/2  2 
 a i i +1 
 
1  1 
exp − (bi2+1i +1 ) (bi +1i+1 )υ −i−1
= Bii Bii′ ( Bii′ ) −1 B .i +1 (4.3) 2 (υ −i −2) / 2 Γ(υ − i) / 2  2 
= Aii C (4.4) 1  1 
because, f ( χ ) = exp − χ 2  ( χ )υ −1
where 2 (υ −2) / 2 Γυ / 2  2 

C = ( Bii′ ) −1 B . i +1 , and Aii = Bii Bii′ (4.5)  1 i 



exp −  bi2+1i +1 + ∑ b 2j i+1  (bi +1i +1 )υ −i −1
1
=
π i / 2 2 (υ −2) / 2 Γ(υ − i) / 2  
Equation (4.4) can be identified to be the normal equation for estimating the regression  2  j =1 
coefficient of Z i +1 on Z1 , Z 2 , L , Z i , thus C is the vector of regression coefficient.
and
∴ C ~ Ni ( E C , Σ C )
p −1
Since Z1 , Z 2 ,L , Z i +1 are independent, it follows that p −1 ∏ (bi+1i +1 )υ −i −1  1 p −1 i+1 
f ( B) = ∏ f ( B i +1 Bii ) = i =0
exp − ∑ ∑ b 2j i +1 
EC =0, ΣC = Aii−1 , because Q = S β , then E βˆ = β and Σ βˆ = σ S 2 −1
i =0
p−1
 2 i =0 j =1 
2 p (υ −2) / 2 π p ( p−1) / 4 ∏ Γ(υ − i) / 2
Now i =0

B .i +1 = Bii′ C , from equation (4.5), since B′ii is fixed with respect to Z i +1 , then p −1
because ∏ π i / 2 = π [1+2+L+( p−1)] / 2 = π p( p−1) / 4 .
E B . i +1 = Bii′ E C = 0 , Σ B.i +1 = Bii′ Aii−1 Bii = Bii′ ( Bii Bii′ ) −1
Bii = I . i =0

p
Further, B . i +1 being a linear combination of regression coefficients which are themselves
∏ (bii )υ −i  1 p i 
normally distributed, we conclude that f ( B) = i =1
exp − ∑ ∑ b 2j i 
p
 2 i =1 j =1 
B . i +1 ~ N i (0, I ) (4.6) 2 p (υi −2) / 2 π p ( p−1) / 4 ∏ Γ(υ − i + 1) / 2
i =1
Re g SS = C ′ a .i +1 , because if the normal equation Q = S β , then Re g SS is β ′ Q
Now
= [( Bii′ ) −1
B .i +1 ]′ Bii B . i+1 = B ′. i +1 ( Bii ) −1 Bii B .i +1 = B ′. i +1 B . i+1 ∂B 1 1
= = .
∂A ∂A p
p p −i +1
= bi2+11 + bi2+12 + L + bi2+1i ~ χ i2 (4.7) ∂B 2 ∏ (bii )
i =1
Error SS = Total SS − Re g SS
p p  i 
= ai +1i +1 − (bi2+11 + bi2+12 + L + bi2+1i ) Also, tr A = the sum of the diagonal elements of A = ∑ Aii = ∑  ∑ b 2j i  .
 
i =1 i =1  j =1 
Wishart distribution 67 68 RU Khan

Therefore, and
p (υ − p −1) / 2
∏ (bii )υ − p−1 A*
*  1 
f ( B) i =1  1  f (A ) = exp − tr A* 
f ( A) = = exp − tr A p  2 
p p  2  2υ p / 2 π p( p −1) / 4 ∏ Γ(υ − i + 1) / 2
2 p ∏ (bii ) p −i +1 2υ p / 2 π p ( p −1) / 4 ∏ Γ(υ − i + 1) / 2 i =1
i =1 i =1
*
∂A
Consider, f ( A) = f [ A* ( A)] .
∂A
p
υ − p −1
∏ (bii )υ − p−1 = (b11 b22 L b pp )υ − p−1 = B Now
i =1
∂A* ∂ BAB p +1
p = = B , and tr A* = tr BAB ′ = tr AB ′B = tr AΣ −1
2 1/ 2 (υ − p −1) / 2 ∂A ∂A
A = B B′ = B ⇒ B = A ⇒ ∏ (bii )υ − p−1 = A
2
i =1 From BΣB ′ = I , we have 1 = BΣB ′ = B Σ B ′ = Σ BB ′ = Σ B
(υ − p −1) / 2
A  1  and
f ( A) = exp − tr A .
p  2  1 1 A
2υp / 2 π p ( p −1) / 4 ∏ Γ(υ − i + 1) / 2 BB ′ = , B = and A* = BAB ′ = .
Σ 1/ 2 Σ
i =1 Σ
This is the form of wishart distribution when Σ = I and is denoted by W p (υ , I ) . Therefore,
(υ − p −1) / 2
Theorem: Suppose the p − component vectors x1 , x 2 ,L , x n (n > p) are independent, A exp (− 12 tr A Σ −1 ) 1
each distributed according to N p ( µ , Σ) , then the density of f ( A) =
p ( p +1) / 2
(υ − p −1) / 2 Σ
n n −1 2υ p / 2 π p ( p −1) / 4 Σ ∏ Γ(υ − i + 1) / 2
A= ∑ ( xα − x ) ( x α − x ) ′ = ∑ Z α Z ′α , where, Z α ~ N p (0, Σ) is i =1
α =1 α =1 (υ − p−1) / 2
A exp (− 12 tr A Σ −1 )
A
(υ − p −1) / 2
exp (− 1 tr A Σ −1 ) = , ⇒ A ~ W p (υ , Σ) .
2 p
, where υ = n − 1 . υ/2
p 2υ p / 2 π p ( p −1) / 4 Σ ∏ Γ(υ − i + 1) / 2
υ/2
2υp / 2 π p ( p −1) / 4 Σ ∏ Γ(υ − i + 1) / 2 i =1
i =1 Theorem: Let x1 , x 2 ,L , x n (n ≥ p + 1) are distributed independently, each according to
Proof: Since Σ is symmetric and positive definite there exist a nonsingular triangular
1 n
matrix B such that N p ( µ , Σ) . Then the distribution of S = ∑ ( x − x ) ( xα − x )′ is W p (υ , Σ / υ ) , where
n − 1 α =1 α
BΣB ′ = I ⇒ B ′B = Σ −1 . υ = n −1.
Make the transformation Proof: Clearly,
Z *α = B Z α , α = 1, 2,L , n − 1 . A n
 1  1 

1
S= = ∑  Z   Zα  , where Z α ~ N p (0, Σ / n − 1) .
n − 1 α =1 n − 1 α
*
E (Z α ) = B E (Z α ) = 0 , Σ Z * = B E ( Z α Z ′α ) B ′ = B Σ B ′ = I .  n −1  n −1
α
Since Σ is symmetric and positive definite there exist a nonsingular triangular matrix B such
Thus, that
n−1 ′  n −1  −1
= B  ∑ Z α Z ′α  B ′ = B A B ′ = A* BΣ * B ′ = I , where Σ* = Σ / n − 1 and B ′B = Σ* .
Z *α ~ N p (0, I ) , and ∑ Z α* Z α*  
α =1  α =1  Make the transformation

⇒ A* ~ W p (υ , I )  1 
Z *α = B Zα  , α = 1, 2,L , n − 1 .
 n −1 
Wishart distribution 69 70 RU Khan

*  1  1 
′ We know that the characteristic function of a vector x ′ = ( x1 , x 2 , L , x p ) is defined as
E (Z α )=0, Σ = B E  Z α   Zα  B ′ = B Σ* B ′ = I .
Z α*
 n −1  n −1  ′
Ee it x , where t ′ x = t1 x1 + t 2 x 2 + L + t p x p (4.10)
Thus,
n−1 ′  n−1 1  1 
′ In view of (4.10), we can write (4.9) as E (e i tr AΘ ) .
Z *α ~ N p (0, I ) , and ∑ Z α* Z α* = B  ∑  n − 1 Z α   Zα   B′ = B S B′ = S *

α =1 α =1  n −1   Theorem: If Z 1 , Z 2 , L , Z n−1 are independent, each with distribution N p (0, Σ) , the
n−1
Hence, characteristic function of A11 , A22 , L , A pp , 2 A12 , L , 2 A p −1 p , where A = ( Aij ) = ∑Zα Zα′
(υ − p −1) / 2 α =1
S*
 1  is given by
S * ~ W p (υ , I ) , and f (S * ) = exp − tr S * 
p  2  −υ / 2
2υ p / 2 π p ( p −1) / 4 ∏ Γ (υ − i + 1) / 2 φ A (Θ) = E e i tr AΘ = I − 2 i ΘΣ , where n − 1 = υ .
i =1
Proof: We have
* ∂S * υ υ
f ( S ) = f [S ( S )] . i tr ∑ Z α Z α ′Θ i tr ∑ Z α ′Θ Z α
∂S
φ A (Θ) = E e α =1 = Ee α =1 , by virtue of the fact that
Now
−1 tr EFG = ∑ eij f jk g ki = tr FGE
∂S * ∂ BSB ′ p +1  Σ 
= = B , tr S * = tr BSB ′ = tr S B ′B = tr S  
∂S ∂S  n − 1 υ

i ∑ Z α Θ Zα υ
i Z α ′Θ Z α
*
From BΣ B ′ = I , we have 1 = BΣ B ′ = B Σ * *
B′ = Σ *
BB ′ = Σ *
B
2
, thus, = Ee α =1 = ∏Ee , as Z α ' s are identical distributed random
α =1
S S vector.
1 1
BB ′ = , B = , and S * = BSB ′ = = .
1/ 2 Σ i Z ′Θ Z υ
Σ* Σ* Σ* = E (e ) , where Z ~ N p (0, Σ) (4.11)
n −1
Since Θ is real and Σ is positive definite, there exist a real nonsingular matrix C such that
Therefore,
 C ′ Σ −1C = I , and C ′ΘC = D , where D is a real diagonal matrix, diagonal elements of D is
 Σ  
−1
(υ − p −1) / 2
S exp  − 1 tr S    d jj .
2  n − 1  

f (S ) = . Consider the transformation
υ/2 p
Σ
2υ p / 2 π p ( p −1) / 4 ∏ Γ(υ − i + 1) / 2
n −1 Z = CY , then EY = 0 , and ΣY = C −1 E Z Z ′ (C −1 )′ = (C ′ Σ −1C ) −1 = I
i =1
This is the density of W p (υ , Σ / υ ) . ⇒ Y ~ N p (0, I ) i.e. Y j , j = 1, 2,L , p are independently distributed as N (0, 1) , therefore,
p
Characteristic function i ∑ d jj y 2j
i Z ′ ΘZ i Y ′C ′ Θ C Y i Y ′ DY j =1
Consider Ee = Ee = Ee = Ee
 A11 A12  p
i d jj y 2j
p p
A =   , the elements of the matrix are A11 , A22 , 2 A12 , because A12 = A21 . = ∏ ( ch. function of χ12 ) = ∏ (1 − 2 i d jj ) −1/ 2
 A21 A22  = ∏Ee
j =1 j =1 j =1
Introduce a real matrix Θ = (θ ij ) , with θ ij = θ ji
−1 / 2
= I − 2i D , as ( I − 2 i D ) is a diagonal matrix. (4.12)
 θ11 θ12  A θ + A12θ 21 A11θ12 + A12θ 22 
Θ =   then, AΘ =  11 11 
θ
 21 θ 22   A21θ11 + A22θ 21 A21θ12 + A22θ 22  Moreover,
2
tr AΘ = A11θ11 + A12θ 21 + A21θ12 + A22θ 22 = A11θ11 + A22θ 22 + 2 A12θ12 (4.9) I − 2 i D = C ′ Σ −1 C − 2 i C ′ Θ C = C ′ (Σ −1 − 2 i Θ) C = C Σ −1 − 2 i Θ .
Wishart distribution 71 72 RU Khan

But Similarly, the characteristic function of A2 will be

C ′ Σ −1 C = I or C ′ Σ −1C = 1 or C′ Σ −1
C = 1 or C′ C Σ −1
=1 φ A2 (Θ) = I − 2 i ΘΣ
−υ2 / 2

2 2 1 Since A1 and A2 are independently distributed, so


C Σ −1 = 1 or C = = Σ
Σ −1 −(υ1 +υ2 ) / 2
φ A1+ A2 (Θ) = φ A1 (Θ) φ A2 (Θ) = I − 2 i ΘΣ
Hence,
But this is the characteristic function of W p (υ1 + υ 2 , Σ) , therefore,
I − 2 i D = Σ −1 − 2 i Θ Σ = I − 2 i Θ Σ (2.13)
A1 + A2 ~ W p (υ1 + υ 2 , Σ) .
Using equations (4.12) and (4.13) in (4.11) gives, the characteristic function of
−υ / 2 Alternative Solution
A11 , A22 , L , A pp , 2 A12 , L , 2 A p −1 p , as φ A (Θ) = I − 2 i ΘΣ .
n1 −1
Alternative proof A1 = ∑ Z α Z α ′ = Z 1 Z 1′ + L + Z υ1 Z υ1 ′ , where Z α ~ N p (0, Σ) , υ1 = n1 − 1 , and
α =1
By definition,
n2 −1
e i tr ΘA A
(υ − p −1) / 2
exp (− 1 tr A Σ −1 ) A2 = ∑ Y α Y α ′ = Y 1Y 1′ + L + Y υ2 Y ′υ2 , where Y α ~ N p (0, Σ) , υ 2 = n2 − 1
φ A (Θ) = Ee i tr ΘA = ∫ 2
dA α =1
p
υ /2
2υ p / 2 π p ( p −1) / 4 Σ ∏ Γ(υ − i + 1) / 2 Y α and Z α are independent, since A1 and A2 are independent, therefore,
i =1
1 −1
A1 + A2 = Z 1 Z 1′ + L + Z υ Z υ ′ + Y 1Y 1′ + L + Y υ Y ′υ2
1 1 2
K (υ − p −1) / 2 − 2 tr (Σ −2 iΘ) A
=
υ/2 ∫ A e dA υ1 +υ2
Σ * *′
= ∑ Zα *
Z α , each Z α ~ N p (0, Σ) .
υ /2 α =1
K Σ −1 − 2 iΘ 1 −1
(υ − p −1) / 2 − 2 tr (Σ −2 iΘ) A Hence,
=
υ/2 −1
υ/2 ∫ A e dA
Σ Σ − 2 iΘ A1 + A2 ~ W p (υ1 + υ 2 , Σ) .

1 −1 Theorem: If A j , j = 1, 2,L , q are independently distributed according to W p (υ j , Σ) , then


1 υ /2 (υ − p −1) / 2 − 2 tr AΣ
= , since K Σ −1 ∫ A e dA = 1 q  q 
υ/2
Σ
υ/2
Σ −1 − 2 iΘ A = ∑ A j ~ W p  ∑υ j , Σ  .
 
j =1  j =1 
1 −υ / 2
= = I − 2 iΘΣ . Proof: We know that the characteristic function of A j is
υ/2
I − 2 iΘΣ
q
−υ j / 2
This shows that, if A ~ W p (υ , Σ) , and then the characteristic function of A is φ A j (Θ) = I − 2 i ΘΣ . Now consider the characteristic function of A = ∑ A j
−υ / 2 j =1
I − 2 iΘΣ .
φ A (Θ) = φ ∑ A j (Θ) = ∏ φ A j (Θ) , since A j ' s are independently distributed
Properties of Wishart distribution j j

Theorem: Suppose Ai (i = 1, 2) are distributed independently according to W p (υ i , Σ) q


= I − 2 i ΘΣ ∑j j .
−υ j / 2 − υ /2
= ∏ I − 2 i ΘΣ
respectively, then A1 + A2 ~ W p (υ1 + υ 2 , Σ) .
j =1
Proof: We know that the characteristic function of A1 , if A1 ~ W p (υ1 , Σ) , is
 
Which is the characteristic function of W p  ∑ υ j , Σ  .
φ A1 (Θ) = I − 2 i ΘΣ
−υ1 / 2
.  
 j 
Wishart distribution 73 74 RU Khan

Therefore, Then Z α(1) are independent, each with distribution N q (0, Σ11 ) , because Z α(1) and Z α( 2) are
q  q  independent, so that Σ12 = Σ 21 = 0 , and
A = ∑ A j ~ W p  ∑υ j , Σ  .
  n−1
j =1  j =1  ′
A11 = ∑ Z α(1) Z α(1) , ⇒ A11 ~ Wq (υ , Σ11 ) .
Theorem: If A ~ W p (n − 1, Σ) , then the distribution of l ′ A l ~ (l ′ Σ l ) χ n2−1 , where l is a α =1

known vector. Similarly,


n−1 A22 ~ W p −q (υ , Σ 22 ) .
Proof: Given A ~ W p (n − 1, Σ) , then A = ∑ Z α Z α ′ , where Z α ~ N p (0, Σ) , and
α =1 Theorem: Let A and Σ be partitioned into p1 , p 2 , L p q rows and columns
n−1 n−1 n −1 ( p1 + p 2 + L + p q = p )
l ′ Al = ∑ l ′ Z α Z α ′ l = ∑ (l ′ Z α ) (l ′ Z α ) ′ = ∑ Vα2 , where Vα = l ′ Z α is N (0, l ′ Σ l ) .
α =1 α =1 α =1  A11 L A1q   Σ11 L Σ1q 
   
Therefore, A= M M M , Σ= M M M 
 Aq1 L Aqq   Σ q1 L Σ qq 
l ′ A l ~ (l ′ Σ l ) χ n2−1 .    

Theorem: If A ~ W p (n − 1, Σ) , and if a is a positive constant, then aA ~ W p (n − 1, aΣ) . If Σ ij = 0 for i ≠ j and if A is distributed according to W p (υ , Σ) , then A11 , A22 , L , Aqq
are independently distributed and A jj is distributed according to W (υ , Σ jj ) .
Proof: Since A ~ W p (n − 1, Σ) , there are n − 1 independent p − component random vectors
n−1
Z 1 , L , Z n−1 each distributed as N p (0, Σ) , such that
Proof: Let A = ∑ Z α Z α ′ , where Z α are independent, each with distribution N p (0, Σ) .
n−1 α =1
A= ∑ Z α Z α ′ . Evidently, we have Let us partition Z α as
α =1
 Z (1) 
n −1  α 
aA = ∑( a Zα )( a Zα′) and a Z 1 , L , a Z n−1 are independently identically Z α =  M  . Since Σ ij = 0 for i ≠ j , so Z α(i) and Z α( j ) are independent, then
α =1
 
 Z (q ) 
distributed as normal N p (0, aΣ) . Thus  α 
n−1 ′ n−1 ′
aA ~ W p (n − 1, aΣ) . Aii = ∑ Z α(i ) Z α(i ) is independent of A jj = ∑ Z α( j ) Z α( j ) , where
α =1 α =1
Theorem: Let A and Σ be partitioned into q and p − q rows and columns as
Z α(i ) ~ N p (0, Σ ii ) , and Z α( j ) ~ N p (0, Σ jj ) , thus, Z α(1) , Z α( 2) ,L , Z α(q ) are independent.
 A11 A12  Σ Σ12 
A =   , Σ =  11 
 A21 A22   Σ 21 Σ 22  Hence,
If A is distributed according to W p (υ , Σ) , then A11 is distributed according to Wq (υ , Σ11 ) n−1 ′
(the marginal distribution of some sets of elements of A ).
Aii = ∑ Z α(i ) Z α(i ) , i = 1, 2, L, q are independently distributed.
α =1
Proof: Given A ~ W p (υ , Σ) , where Theorem: Let A and Σ be partitioned into q and p − q rows and columns
n−1
 A11 A12  Σ Σ12 
A= ∑ Z α Z α ′ , and Z α are independent, each with distribution N p (0, Σ) . A =   , Σ =  11
A22 

α =1  A21  Σ 21 Σ 22 
Partition Z α into sub vectors of q and p − q components as −1
If A is distributed according to W p (υ , Σ) , then A11 − A12 A22 A21 is distributed according to
 Z (1)  −1
Wq (υ − ( p − q), Σ11 − Σ12 Σ 22 Σ 21 ) (the conditional distribution of some sets of elements of
Zα = 
α 
.
 (2)  A ).
Z
 α 
Wishart distribution 75 76 RU Khan

Theorem: Suppose A is distributed according to W p (υ , Σ) and let A* = BAB ′ , where B ∂ B* ∂ B* p +1


f ( B) = f ( B * ( B)) , where = D , tr B* = tr DBD ′ = tr BD ′D = tr BΦ −1 .
∂B ∂B
is a matrix of order q × p , then A* ~ Wq (υ , Φ ) , where Φ = B ΣB ′ .
1 1 B
Proof: We have From DΦD ′ = I ⇒ DD ′ = ⇒ D = , and B * = DBD ′ = .
Φ 1/ 2 Φ
n−1
Φ
A= ∑ Z α Z α ′ , where Z α ~ N p (0, Σ) Therefore,
α =1
(υ − p−1) / 2  1 
n−1 n−1 B exp  − trBΦ −1 
BAB ′ = ∑ B Z α Z α ′ B′ = ∑ ( B Z α ) ( B Z α )′ f ( B) =  2 
p
1
( p +1) / 2
α =1 α =1 (υ − p−1) / 2 Φ
2υ p / 2 π p ( p−1) / 4 Φ ∏ Γ(υ − i + 1) / 2
n −1
i =1
= ∑ V α V α ′ , where V α = B Z α ~ N q (0, BΣB ′) .
α =1 (υ − p−1) / 2  1 
B exp  − tr B Φ −1 
Therefore, =  2  .
p
υ/2
n−1
2υ p / 2 π p ( p −1) / 4 Φ ∏ Γ(υ − i + 1) / 2
A* = BAB ′ = ∑V α V α ′ ~ Wq (υ , BΣB ′) .
i =1
α =1
This is the density of W p (υ , Φ) , i.e. B ~ W p (υ , Φ) .
Theorem: Suppose A ~ W p (υ , Σ) and let A = CBC ′ , where C is a nonsingular matrix of
Exercise: Let x1 , x 2 , L , x n be a random sample from N p ( µ , Σ) , then
order p , then B ~ W p (υ , Φ ) , where Φ = (C ′ Σ −1C ) −1 .
n ( x − µ ) ( x − µ )′ ~ W p (1, Σ) .
n−1
Proof: We have A = ∑ Z α Z α ′ , where Z α ~ N p (0, Σ) . Solution: Let Y = n ( x − µ ) , then E Y = n E ( x − µ ) = 0 , and ΣY = E Y Y ′ = n Σ / n = Σ .
α =1
n−1
Let
′ ′
We know that, if Y α ~ N p (0, Σ) , and A = ∑ Y α Y α ′ , then A ~ W p (n − 1, Σ) , therefore,
α =1
Yα = C −1
Z α , then E Y α = 0 , ΣY = C −1 E Z α Z α ′C −1 = C −1 Σ C −1 = Φ , and
α
Y Y ′ = n ( x − µ ) ( x − µ ) ′ ~ W p (1, Σ) .
n−1  n −1  ′ ′
∑ Y α Y α ′ = C −1  ∑ Z α Z α ′  C −1 = C −1 A C −1 = B Exercise: Show that wishart distribution is a generalization of the chi-square distribution.
α =1  α =1 
Solution: We have A ~ W p (υ , Σ) , the characteristic function of A is
Let D be nonsingular triangular matrix such that DΦD ′ = I , and D ′D = Φ −1 .
−υ / 2
φ A (Θ ) = I − 2 i Θ Σ .
Make the transformation
*
Yα = DY α , then E Y *α = 0 , ΣY * = D EY α Y α ′ D ′ = DΦD ′ = I *
⇒ Yα ~ N p (0, I ) , For p = 1 , A = a11 , and Σ = σ 11 = σ 2 , Θ = θ11 = θ .
α
−υ / 2
and φ a (θ ) = 1 − 2 i θ σ 2 , which is the characteristic function of σ 2 χυ2 , therefore,
′  
∑ Y α* Y α* = D  ∑ Y α Y α ′  D ′ = DBD ′ = B * ⇒ B * ~ W p (υ , I ) . a11
  a11 ~ σ 2 χυ2 or ~ χυ2 .
α α  σ2
Thus,
Moments of the elements of the matrix A
(υ − p −1) / 2  1 
B* exp  − trB *  Let Z 1 , Z 2 ,L, Z υ are independent, each with distribution N p (0, Σ) , then
f (B* ) =  2  .
p n−1
2υ p / 2 π p ( p −1) / 4 ∏ Γ(υ − i + 1) / 2 Aij = ∑ Z iα Z jα .
i =1 α =1
Wishart distribution 77 78 RU Khan

υ Make the transformation


E ( Aij ) = ∑ E ( Z iα Z jα ) , where n − 1 = υ
α =1 Z *α = C Z α , then, Z α* ~ N p (0, I ) , are independent, and
υ n−1 ′  
= ∑ σ ij = υ σ ij (we have infact prove that E ( A / n − 1) = Σ )
∑ Z α* Z α* = C  ∑ Z α Z α ′  C ′ = C A C ′ = B (say)
 
α =1 α =1 α 
 υ   υ 
 We have
E ( Aij Akl ) = E  ∑ Z iα Z jα  ∑ Z kβ Z lβ
   1
 α =1   β =1  C AC′ = B or A = B . But CΣC ′ = I , then
2
υ υ C
= E ∑ ( Z iα Z jα Z kα Z lα ) + E ∑ ( Z iα Z jα Z kβ Z lβ ) , α ≠ β 1
2
α =1 α , β =1 C Σ C ′ = 1 , or C = ⇒ A = B Σ .
Σ
υ υ
= ∑ (Fourth moment of a vector) + ∑ E ( Z iα Z jα ) E ( Z kβ Z lβ ) , α ≠ β Let
α =1 α , β =1
 b11 b12 L b1 p 
 
= υ (σ ij σ kl + σ ik σ jl + σ il σ jk ) + υ (υ − 1)σ ij σ kl .  b21 b22 L b2 p 
B =
M M M M 
Therefore,  
 b p1 b p 2 L b pp 
Cov ( Aij , Akl ) = E[ Aij − E ( Aij )][ Akl − E ( Akl )] = E ( Aij − υ σ ij ) ( Akl − υ σ kl )  
b b ′ (i) 
= E ( Aij Akl ) − υ σ kl E ( Aij ) − υ σ ij E ( Akl ) + υ 2 σ ij σ kl and Bii =  ii , where b ′ (i ) = (bi i +1 , bi i + 2 , L , bi p ) .
b Bi +1i +1 
 (i ) 
= υ 2σ ij σ kl + υ σ ik σ jl + υ σ il σ jk − υ 2σ ij σ kl − υ 2σ ij σ kl + υ 2 σ ij σ kl
So that
= υ ( σ ik σ jl + σ il σ jk ) B11 = B and B pp = b pp , and
If i = k and j = l , we obtain the variance of Aij
bii .i +1,L, p = bii − b ′ (i ) Bi−+11i +1 b (i )
Var ( Aij ) = E [ Aij − E ( Aij )]2 = υ ( σ ii σ jj + σ ij2 ) .
Bii A A12 
= , as A =  11  and assume that A22 is square and nonsingular, then
Bi+1 i +1  A21 A22 
Generalized variance
−1
The multivariate analogue of the variance σ 2 of a univariate distribution is the covariance A = A11 − A12 A22 A21 A22 = A11.2 A22 .
matrix Σ , and the determinant of covariance matrix is termed as generalized variance of the
multivariate distribution. Similarly, the generalized variance of a sample x1 , x 2 ,L , x n is Also
defined as B p −1 p−1
B11 B22
n B = B11 = L B pp
1 B22 B33 B pp
S = ∑ ( x − x ) ( xα − x )′ .
n − 1 α =1 α
= (b11. 2,L, p ) (b22. 3,L, p ) L (b p−1 p −1. p ) b pp .
Distribution of the generalized variance Since
1 n −1 ′
By definition, the distribution of S is the same as the distribution of A , where
(n − 1) p B= ∑ Z*α Z*α , with Z α* ~ N p (0, I )
α =1
n−1
A= ∑ Z α Z α ′ , with Z α ~ N p (0, Σ) . Since Σ is symmetric and positive definite there n −1−( p −1)
α =1 ⇒ b11. 2,L, p = ∑ Vα2 , where Vα ~ N (0, 1) .
exist a nonsingular matrix C such that CΣC ′ = I . α =1
Wishart distribution 79 80 RU Khan

Thus, n−i 
h 2h Γ p + h
Σ
∏ Γ( n − i) / 2  .

2
b11. 2,L, p has χ − distribution with (n − p) degree of freedom. 2
=
(n − 1) ph i =1
*
Since Z α ' s are independently distributed For h = 1
⇒ (b11. 2,L, p ) (b22.3,L, p ) L (b p −1 p −1. p ) b pp are independently distributed n−i  n−i n−i
2Γ + 1
p p 2 Γ 
Σ  2 = Σ  2   2 
⇒ bii . i+1,L, p has χ 2 − distribution with n − 1 − ( p − i ) degrees of freedom. E S = ∏ ∏
(n − 1) p i =1 Γ(n − i ) / 2 (n − 1) p i =1 Γ( n − i ) / 2
Therefore, p
Σ
B is distributed as χ n2−1− ( p −1) . χ n2−1−( p − 2 ) .L . χ n2− 2 . χ n2−1 . = ∏ (n − i ) .
(n − 1) p i =1
1 1 For h = 2
Since S = A = Σ B , then,
(n − 1) p (n − 1) p
n−i   n − i  n − i   n − i 
24Γ + 2
p 2 p 4 + 1  Γ 
1 2 Σ  2  Σ  2  2   2 
S is distributed as Σ χ n2−1. χ n2− 2 L χ n2−1−( p −2) . χ n2− p . E S = ∏ = ∏
(n − 1) p (n − 1) 2 p i=1 Γ (n − i ) / 2 (n − 1) 2 p i =1 Γ( n − i) / 2
For p = 1 Σ
2 p

Σ
= ∏ ( n − i + 2) ( n − i ) .
S = χ n2−1 ⇒
(n − 1) s11
~ χ n2−1 or
a11
~ χ n2−1 . (n − 1) 2 p i =1
n −1 σ 11 2
σ 2 p 2 p
Σ Σ
V ( | S | ) = E | S | 2 −( E | S | ) 2 = ∏ ( n − i + 2) ( n − i ) − ∏ (n − i ) 2
Moments of sample generalized variance (n − 1) 2p
(n − 1) 2p
i =1 i =1
Σ
Since S is distributed as ( y1 y 2 L y p ) , where yi ~ χ n2−i , i = 1, 2,L , p , are Σ
2 p
(n − 1) p =
2p ∏ 2 (n − i) .
independent, thus (n − 1) i =1

h Note: For p = 2
h Σ
E S = E ( y1 ) h E ( y 2 ) h L E ( y p ) h . n−i   n + 2h − 1   n + 2h − 2 
(n − 1) ph h Γ + h h Γ  Γ 
h Σ 2h
2 2
 Σ 22h  2   2 
Consider, E S = ∏ =
(n − 1) 2h i =1 Γ(n − i ) / 2 (n − 1) 2h Γ(n − 1) / 2 Γ(n − 2) / 2
y ~ χυ2 , where, υ = n − 1 , then
 n + 2h − 2 1   n + 2h − 2 
hΓ + Γ 
1 ∞ (υ / 2)+h −1 − y / 2 Σ  2 2h2 2  2 .
E ( y ) = ∫ y f ( y) dy =
h h
∫ y e dy =
2υ / 2 Γ(υ / 2) 0 (n − 1) 2h
Γ
n − 2 1 n− 2
+ Γ 
 2 2  2 
 Γ [(υ / 2) + h] 
1   , because ∞ x m−1 e −λ x dx = Γm
=
2υ / 2 Γ(υ / 2)  (1 / 2) (υ / 2)+h 
∫0 λm
Using legendre’s duplication formula
 
 1 π Γ(2α )
Γ  α +  Γ (α ) =
2 h Γ [(υ / 2) + h]  2  2 2α
= .
Γ(υ / 2) h 2h
Σ 2  (n + 2h − 2) 
Therefore, E S
h
=  π Γ ( n + 2h − 2) / 2 
(n − 1) 2h  π Γ ( n − 2) / 2 2 ( n − 2 ) 

 n −1  n− p 
2h Γ 
h + h  2h Γ  + h h
h Σ  2 L  2  Σ  Γ ( n + 2 h − 2) 
E S = =   (4.14)
(n − 1) ph Γ(n − 1) / 2 Γ ( n − p) / 2 (n − 1) 2h  Γ (n − 2) 
Wishart distribution 81 82 RU Khan

Let Make the transformation

X ~ χ 22m , where m is positive integer aij = aii a jj rij , i ≠ j, i< j (4.15)

aii = aii (4.16)


1 ∞ 2 2h
E ( X 2h ) = ∫ X m+ 2h −1 e − X / 2 dX = Γ(m + 2h)
2 m Γ (m) 0 Γ(m) The Jacobian is the product of the Jacobian of (4.15) and (4.16) for aii fixed, the Jacobian of
p ( p − 1)
2h (4.15) is the determinant of a − order diagonal matrix with diagonal elements
X Γ( m + 2 h) 2
⇒ E  =
2 Γ (m) aii a jj
In view of equation (4.14) and m = n − 2 , we get
a11 a 22
2h
Σ
h 2h  2 (n − 1) S 1 / 2  a11 a33
X
⇒ E  
h
E S = E  = E ( X 2h ) O
(n − 1) 2h  2   Σ
1/ 2 
  a11 a pp
J=
2 (n − 1) S
1/ 2  A 1/ 2  a 22 a33
⇒ ~ χ (22n−4) or 2   ~ χ2
 ( 2 n − 4)
, because X ~ χ 22( n−2) . O
Σ
1/ 2  Σ 1/ 2
  a 22 a pp
Exercise: Let X ~ N 2 (µ1 , µ 2 , σ 12 , σ 22 , ρ ) and let s12
and s 22
denote the sample variances O
and r the sample correlation coefficient. Show that the statistic p
1/ 2 = ∏ aii( p −1) / 2 , because each particulars subscript appears i < j , ( p − 1) times.
2 (n − 1) s1 s 2  1 − r 2 

u= ~ χ 22n −4 . i =1
σ1 σ 2 1 − ρ 2 
  Thus,
Theorem: Let x1 ,L , x n (n ≥ p + 1) are independent, each with distribution (υ − p −1) / 2
aii a jj rij  1 a  p ( p −1) / 2
N p (µ , δ ij σ ii ) , then the density of the sample correlation coefficients is f ( A) = exp − ∑ ii ∏ aii .
υ /2 p  2 i σ ii  i =1
 p 
p (υ − p −1) / 2 2υ p / 2 π p ( p −1) / 4  ∏ σ ii  ∏ Γ(υ − i + 1) / 2
[ Γ (υ / 2)] rij 1, i = j  
, where υ = n − 1, i, j = 1, 2,L , p and δ ij =   i=1  i =1
p
p ( p −1) / 4 0, i ≠ j But
π ∏ Γ (υ − i + 1) / 2
i =1
aii a jj rij
n n−1
Proof: Let A = ∑ ( xα − x ) ( xα − x ) ′ = ∑ Z α Z α ′ , where Z α are independent, each with  a11 0 L 0  1 r12 L r1 p   a11 L 0 
α =1 α =1    
 0 a 22 L 0   r21 1 L r2 p   0 L 0 
distribution N p (0, δ ij σ ii ) , so A ~ W p (υ , δ ij σ ii ) , and =    
 M M M M  M M M M  M M M 

(υ − p −1) / 2
 1
 0 0 L a pp   r p1 r p 2 L 1   0 L a pp 
aij a     
f ( A) = exp − ∑ ii  , because
υ/2 p  2 i σ ii  p
p( p −1) / 4 
 p 
υ p/2
σ  = rij ∏ aii , then
 ∏ ii  ∏ Γ(υ − i + 1) / 2
2 π
 i =1  i =1 i =1

σ 11 0 L 0  (υ −1) / 2  1 a  
a
0 σ 22 L 0 p rij
(υ − p −1) / 2
p  ii
exp −
 2
∑ σ ii  
 i ii 
Σ =
M M M M
= ∏ σ ii . f ( A) =
p ∏  υ/2 υ/2 

2 (σ ii )
0 0 L σ pp
i =1
π p ( p −1) / 4
∏ Γ(υ − i + 1) / 2 i =1  

i =1  
Wishart distribution 83 84 RU Khan

(υ −1) / 2  1 a ii  (υ + 2 h) / 2
p
rij
(υ − p −1) / 2
p
aii exp −  = 2 (υ + 2h) p / 2 π p ( p −1) / 4 Σ ∏ Γ [(υ + 2h) − i + 1] / 2 .
∞  2 σ ii 
= ∏ ∫0 daii . i =1
p ( p −1) / 4
p
i =1 2υ / 2 (σ ii )υ / 2 Therefore,
π ∏ Γ(υ − i + 1) / 2
i =1 p
(υ + 2h) / 2
Consider 2 (υ + 2h) p / 2 π p ( p −1) / 4 Σ ∏ Γ [(υ + 2h) − i + 1] / 2
h i =1
E A =
 1 aii  p
aii(υ −1) / 2 exp −  2υp / 2 π p ( p −1) / 4 Σ
υ/2
∞  2 σ ii  da . Put aii = u , or d a = 2 σ du , then ∏ Γ(υ − i + 1) / 2
B=∫ ii i ii ii i i =1
0 2υ / 2 (σ ii )υ / 2 2 σ ii
p
h
∞ 2 ph / 2 Σ ∏ Γ (υ + 2h − i + 1) / 2
B = ∫ (ui ) (υ / 2) −1 exp(−ui ) du i = Γ(υ / 2), ∀ i. i =1
0 =
p
Hence, the density function of rij is ∏ Γ(υ − i + 1) / 2
i =1
p
(υ − p −1) / 2
rij ∏ Γ (υ / 2) (υ − p −1) / 2
[ Γ (υ / 2)] p 2 ph / 2 Σ
h
p
n−i 
f (rij ) = i =1
=
rij
.
∏ Γ  2
+ h

p p i =1
p ( p −1) / 4 p ( p −1) / 4
= .
p
π ∏ Γ (υ − i + 1) / 2 π ∏ Γ (υ − i + 1) / 2
i =1 i =1 ∏ Γ (n − i ) / 2
i =1
h
Exercise: Find the E A directly from W (υ , Σ) .
Result:
Solution: We know that, if A ~ W (υ , Σ) , then
Let us suppose that x1 ,L, xn constitute a random sample of size n from a population whose
(υ − p −1) / 2 density at x is f ( x ; θ ) and Ω is the set of values, which can be taken on by the parameter θ
A  1 
f ( A) = exp − tr AΣ −1  . (Ω is the parameter space for θ ) and the parameter space for θ is partitioned into the
p  2 
υ p/2 p ( p −1) / 4 υ/2
2 π Σ ∏ Γ{(υ − i + 1) / 2} disjoint sets ω and ω ′ , according to the null hypothesis
i =1 H 0 : θ ∈ ω , against H A : θ ∈ ω ′ , if
Now
max L0
λ= is referred to as a value of the likelihood ratio statistic λ ,
∫A f ( A) dA = 1 max L

p where max L0 and max L are the maximum values of the likelihood function for all values
(υ − p −1) / 2  1  υ /2
⇒ ∫A A exp − tr AΣ −1 dA = 2υ p / 2 π p ( p −1) / 4 Σ ∏ Γ{(υ − i + 1) / 2} of θ in ω and Ω respectively. Since max L0 and max L are both values of a likelihood
 2  i =1 function, and therefore, never negative, it follows that λ ≥ 0 , also, since ω is a subset of the
(4.17) Ω , it follows that λ ≤ 1 , then the critical region of the form λ ≤ k , (0 < k < 1) , defines a
h h likelihood ratio test of the null hypothesis H 0 : θ ∈ ω against the alternative hypothesis
E A =∫ A f ( A) dA
A H A : θ ∈ ω ′ . Such that
(υ − p −1+ 2h ) / 2 Pr [λ ≤ k H0] = α .
A  1 
= exp − tr AΣ −1  dA
p  2  Example: Let H 0 : µ = µ 0 , against H A : µ ≠ µ0
υ/2
2υ p / 2 π p ( p −1) / 4 Σ ∏ Γ{(υ − i + 1) / 2}
i =1 On the basis of a random sample of size n from normal population with the known variance
Now using equation (4.17), we have σ 2 , the critical region of the likelihood ratio test is obtained as follows:
Since ω contains only µ 0 , it follows that µ = µ 0 , and Ω is the set of all real numbers, it
[(υ + 2h)− p −1] / 2  1 
∫A A exp − tr AΣ −1 dA follows that µ̂ = x . Thus
 2 
Wishart distribution 85 86 RU Khan

n   distribution of the likelihood ratio statistic λ , itself. Since the distribution of λ is generally
 1  1 n
2 ∑ i
max L0 =   exp  − ( x − µ0 )2  quite complicated, which makes it difficult to evaluate k , it is often preferable to use the
 σ 2π   2 σ i=1  following approximation. For large n , the distribution of − 2 ln λ approaches, under very

 1 
n  1 n  general conditions the chi-square distribution with 1 degree of freedom, i.e. − 2 ln λ ~ χ12 . In
2 ∑ i
and max L =   exp  − ( x − x ) 2  , then the value of the likelihood ratio above example, we find that
 σ 2π   2 σ i =1 
statistic becomes, 2
n  x − µ0 
− 2 ln λ = ( x − µ 0 ) 2 =   .
 1 n  σ 2
σ / n 
2 ∑ i
exp − (x − µ0 ) 2 
 2 σ i =1   n 
λ= = exp − ( x − µ0 ) 2  Testing independence of sets of variates
 1 n   2 σ
2

exp − ∑ ( xi − x ) 2  Let the p − component vector X be distributed according to N p ( µ , Σ) . We partition X into
 2 σ 2 i=1 
q sub vectors with p1 , p 2 , L , p q components respectively, that is
Hence, the critical region of the likelihood ratio test is
 X (1) 
 n   
exp − ( x − µ 0 ) 2  ≤ k , since λ ≤ k X =  M  , p1 + p2 + L + p q = p . The vector of mean µ and the covariance matrix
2
 2 σ   (q) 
X 
 
n
and after taking logarithms and dividing by , it becomes Σ are partitioned similarly,
2σ 2
 µ (1)   Σ11 L Σ1q 
2 2σ 2    
(x − µ0 ) ≥ − ln k , ln k is negative in view of 0 < k < 1 µ = M , Σ =  M M M .
n  (q )   
µ   Σ q1 L Σ qq 
or x − µ0 ≥ K ,  

where K will have to be determined so that the size of the critical region is α i.e. H 0 : X (1) , X ( 2) ,L , X ( q ) are mutually independently distributed or equivalently Σ ij = 0 ,
Pr [ x − µ 0 ≥ K ] = α . (a) for i ≠ j , where Σ ij = E ( X (i) − µ (i) ) ( X ( j ) − µ ( j ) )′ .
Since x has a normal distribution with mean µ 0 and variance σ 2 / n , i.e. Under H 0 , Σ is of the form
x − µ0
x ~ N ( µ 0 , σ 2 / n) , then Z = ~ N (0, 1) .  Σ11 L 0 
σ/ n  
Σ= M M M  = Σ 0 (say)
For the given value of α , we find a value Z α (say) of standard normal variate by the  0 L Σ qq 
 
following equation
 σ  Let x1 , x 2 ,L, x n be n independent observations on X , the likelihood ratio criterion is
Pr [ Z ≥ Z α / 2 ] = α or Pr  x − µ 0 ≥ Zα / 2  = α (b)
 n  max L( µ , Σ 0 ) max L0
λ= = , (4.18)
From equation (a) and (b), we get max L( µ , Σ) max L
σ where
K= Zα / 2 .
n
1  1 
Therefore, the critical region of the likelihood ratio test is L ( µ , Σ) = exp − ∑ ( x α − µ )′ Σ −1 ( x α − µ ) ,
np / 2 n/2 2
 
(2π ) Σ  α 
σ
Pr  x − µ 0 ≥ Zα / 2  = α .
 n  and the numerator of (4.18) is the maximum of L for µ , Σ ∈ ω restricted by H 0 and the
Note: It was easy to find the constant that made the size of the critical region equal to α , denominator is the maximum of L over the entire parametric space Ω . Now consider the
because we were able to refer to the known distribution of x , and did not have to derive the likelihood function over the entire parametric space
Wishart distribution 87 88 RU Khan

 1 −1  A better approximation to the distribution of λ is obtained by looking at the moments of λ


1  A
max LΩ = exp  − ∑ ( xα − x )′   ( x α − x ) (or of V ), one infact has
n/2  2 α n 
A
(2π ) np / 2 − m ln V ~ χ 2f approximately, where
n

n np / 2  1  A  −1  q
= exp − tr   ∑ ( x α − x ) ( x α − x )′ p 3 − ∑ pi3
(2π ) np / 2 A
n/2  2  n  α  p ( p + 1) p ( p + 1) 1  2
q  3 q
f = −∑ i i = p − ∑ pi2  , and m = n − − i =1
.
2 2 2   2  2 q 2
 
3 p − ∑ pi 
i =1 i =1
n np / 2  1 
= exp  − tr n I   
(2π ) np / 2
A
n/2  2   i =1 
Exercise: Show that the likelihood ratio criterion λ for testing independence of sets of
n np / 2  1  vectors can be written as
= exp  − np  , where trace nI p× p = np
(2π ) np / 2 A
n/2  2 
n/2  R11 L R1q 
R  
Similarly, under H 0 , the likelihood function is λ= , where R = (r jk ) =  M M M .
q
n/2  Rq1 L Rqq 

1  1  ∏ Rii  
max Lω = exp − ∑ ( x α − µˆ )′ Σˆ 0−1 ( x α − µˆ ) i =1
n/2 2
(2π ) np / 2 Σˆ 0  α  n/2
A
Solution: We have, λ = =V n/2 .
q
1  1  q
n/2
=∏ exp − ∑ ( x α(i) − µˆ (i ) ) ′ Σˆ ii−1 ( x α(i) − µˆ (i ) ) ∏ Aii
n / 2
i =1 ( 2π ) npi / 2 Σ
ˆ ii  2 α  i =1
Define,
q
1  1 
=∏ exp  − npi  a jk
n/ 2  2  r jk = , ⇒ a jk = r jk a jj a kk
i =1 A
(2π ) npi / 2 ii a jj a kk
n
p p1 + L+ pi −1 + pi
1  1  ⇒
= q
exp  − np  . A = R ∏ a jj , where p = p1 + p 2 + L + p q , and Aii = Rii ∏ a kk
1  2  j =1 k = p1 + L+ pi −1 +1
(2π ) np / 2 ∏ np / 2 Aii n / 2
n i
i =1 p p
Thus, the likelihood criterion becomes R ∏ a jj R ∏ a jj
A j =1 j =1 R
1 n/2 ⇒ V= = = = .
A n/2
q q  p1 +L+ pi −1 + pi  q p q

λ= n np / 2
q
=
q
A
=V n/2 ∏
i =1
Aii ∏  Rii ∏ akk  ∏
i =1
Rii ∏ a kk ∏
k =1 i =1
Rii
1 n/2 n/2 i =1  k = p1 + L+ pi −1 +1
∏ Aii ∏ Aii
n np / 2 i =1 i =1 n/2
R
Thus, λ = .
and H 0 : Σ ij = 0 is rejected if λ < λ0 , where λ0 is so chosen so as to have level α . q
n/2
Limiting results
∏ Rii
i =1
By the large sample general result about likelihood ratio criterion is Exercise: Let Ci be an arbitrary nonsingular matrix of order pi and let
− 2 ln λ ~ χ 2f approximately, where,
 C1 L 0 
  *
p ( p + 1) 1  2 q 2 
q C = M M M  and x α = C xα + d .
p ( p + 1)
f = −∑ i i = p − ∑ pi . 
0 L C

2 i =1
2 2  i =1


 q
Wishart distribution 89 90 RU Khan

*
Show that the likelihood ratio criterion for independence interms of x α is identical to the and the likelihood ratio criterion is
criterion interms of x α . max L( µ , Σ 0 ) max L0
λ= =
max L( µ , Σ) max L
A
Solution: The likelihood ratio criterion for independence interms of x α is V = , the numerator is the maximum of L for µ , Σ ∈ ω restricted by H 0 and the denominator is
q
∏ Aii the maximum of L over the entire parametric space Ω . Now
i =1
define,  
 
n q   n
A 
−1 
1 n 1  1 i
exp − ∑ ( x α(i ) − x (i) )′  i (i ) (i ) 
A* = ∑ ( xα* − x * ) ( x*α − x * )′ , *
where, x α = ∑ (C x α + d ) = C x + d max LΩ = ∏ 
ni / 2
 2  ni


( x α − x ) 

α =1 n α =1 i =1  ni p / 2 Ai α =1
 (2π ) 
n  ni 
A* = ∑ C ( xα − x ) ( xα − x ) ′ C ′ = C A C ′ .
 
α =1  
Similarly, q  
1  1  1  1 
= ∏ exp − ni p   = exp − np  .
n  A
ni / 2  2   q
A
ni / 2  2 
Aij* = ∑ ( xα*(i ) − x *(i) ) ( x*α( j ) − x *( j ) )′ i =1
 (2π ) ni p / 2 i

 (2π ) np / 2 ∏ i

α =1  ni  n
i =1 i
 n  Similarly, under H 0 , the likelihood function is
= C i  ∑ ( x α(i) − x (i) ) ( x α( j ) − x ( j ) ) ′  C j ′ = C i Aij C j ′ .
 
 α =1  q   1 ni 

 1
Thus, max Lω = ∏  exp − ∑ ( x α(i ) − µˆ (i) )′ Σˆ −1 ( x α(i ) − µˆ (i ) )
ˆ ni / 2
i =1  ( 2π ) ni p / 2 Σ  2 α =1 
A*  
C AC′
V* = =  1 q ni −1 
1  A
exp  − ∑ ∑ ( x α(i) − x (i ) )′   ( x α(i ) − x (i ) ) 
q q
Aii* C i Aii C i ′ =
∏ ∏ A
n/2  2
 i =1 α =1 n 

i =1 i =1 (2π ) np / 2
n
C A C′ A q
= = = V , because ∏ Ci = C . 1  1 
q q = exp − np  .
Aii C i ′  2 
i =1 n/2
∏ Ci ∏ Aii
(2π ) np / 2
A
i =1 i =1 n
Therefore, the test is invariant with respect to linear transformation within each set. Thus, the likelihood criterion becomes
ni / 2 q
Testing equality of covariance matrices q
A 1 ni / 2  q 
∏  n /2 
∏ ni ni p / 2
Ai
np / 2 ∏ i
A i

Let x α(i ) α = 1, 2, L , ni ; i = 1, 2, L , q be an observation from N p ( µ (i ) , Σ i ) i i =1 ni n
λ = i =1 = =  i =1 .
n/2 1 q n/2
A A
n/2 ni p / 2  A 
H 0 : Σ1 = Σ 2 = L = Σ q = Σ (say), versus H A : Σ i ≠ Σ j , for i ≠ j .
n n np / 2 ∏ ni  
i =1  
Let
Bartlett has suggested that the likelihood ratio criterion be modified by replacing the sample
ni q
size ni by the degree of freedom υ i = ni − 1 , the modified criterion is
Ai = ∑ ( xα (i )
−x (i ) (i )
) ( xα − x (i)
)′ , and A = ∑ Ai .
α =1 i =1  q  q
 υ /2  υ /2
υp / 2  ∏ i ∏
The likelihood function is A i Si i
υ  q
q 
λ* =  i =1  = i =1 , where, υ = ∑ υ i = n − q and
1  1 ni  q
υi p / 2  A
υ/2  S
υ / 2
L ( µ , Σ) = ∏  exp − ∑ ( x α(i ) − µ (i ) ) ′ Σ i−1 ( x α(i) − µ (i) ) ∏υi  
i =1
i =1  ( 2π ) i 
n p/2 n / 2
Σi i  2 α =1 i =1  
Wishart distribution 91 92 RU Khan

Ai A 1 1 Therefore, the likelihood criterion becomes


Si = , an unbiased estimate of Σi , and S = = ∑ Ai = ∑υi S i , be the pooled
ni − 1 υ υ i υ i  1 
exp  − tr A  np / 2
estimate of the common covariance matrix.  2  e n/2  1 
λ= =  A exp  − tr A 
 1  n  2 
* np / 2
The test for H 0 : Σ1 = Σ 2 = L = Σ q is to reject H 0 if λ < λ*0 , where λ*0 is so chosen so as n
exp  − np 
A
n/2  2 
to have Pr[λ* < λ*0 H 0 ] = α .
the null hypothesis H 0 : Σ = I is rejected if λ < λ0 , where λ0 is so chosen so as to have
Box has shown that
level α .
− 2 ρ ln λ* ~ χ 2f approx., An approximation to the distribution of λ is obtained by looking at the moment of λ , one
infact has
 q 1 1  2 p2 + 3p −1 
where, ρ = 1 −  ∑ −    , and f = p ( p + 1) (q − 1) / 2 .
− 2 ln λ ~ χ 2p( p +1) / 2 .
 υ i υ   6( p + 1) (q − 1) 
 i=1  
Theorem: Given p − component observation vector y , y ,L , y from N p (υ , Φ ) , the
1 2 n
Testing the hypothesis that a covariance matrix is equal to a given
likelihood ratio criterion for testing the hypothesis H0 : υ =υ0, Φ = Φ0 is
matrix
np / 2
e n/2  1 
Let x α be a random sample of size n from N p ( µ , Σ) and H 0 : Σ = I . λ =  B Φ 0−1 exp − [tr B Φ 0−1 + n ( y − υ 0 )′ Φ 0−1 ( y − υ 0 )] , where,
 n  2 
The likelihood ratio criterion is n

max L( µ , Σ 0 ) max L0
B= ∑ ( y α − y ) ( yα − y )′ , when the null hypothesis is true, then, − 2 ln λ ~ χ 2p( p +1)+ p .
λ= = , α =1
max L( µ , Σ) max L
Proof: Given y ~ N p (υ , Φ ) , and H 0 : υ = υ 0 , Φ = Φ 0 .
α
the numerator is the maximum of L for µ , Σ ∈ ω restricted by H 0 and the denominator is
Let
the maximum of L over the entire parametric space Ω ,
X = C (Y − υ 0 ) , with C is a nonsingular matrix
where,
 1  E X = 0 , and Σ X = C Φ 0−1 C ′ = I , under H 0 .
1
L ( µ , Σ) = exp − ∑ ( x α − µ ) ′ Σ −1 ( x α − µ )
n/2
(2π ) np / 2
Σ  2 α  ⇒ C ′C = Φ 0−1 ,
The likelihood function over the entire parametric space is then x1 , x 2 ,L , x n constitutes a sample from N p ( µ , Σ) and the hypothesis is

1  1  A
−1  H0 : µ = 0 , Σ = I
max LΩ = exp − ∑ ( x α − x )′   ( x α − x )
A
n/2  2 α n 
(2π ) np / 2 n np / 2  1 
n max LΩ = exp  − np  , and
(2π ) np / 2
A
n/2  2 
n np / 2  1 
= exp  − np  .  1 
1
 2  exp  − ∑ xα ′ x α  .
np / 2 n/2
(2π ) A max Lω =
(2π ) np / 2  2 
 α 
The likelihood function over ω , the parametric space as restricted by H 0 : Σ = I is
Thus, the likelihood ratio criterion is
1  1 
max Lω = exp − ∑ ( x α − x )′ I ( x α − x )  1 
np / 2 n/2
 2 α  exp  − ∑ x α ′ x α 
(2π ) I   np / 2  1 
max L( µ , Σ 0 )  2 α  =e n/2
exp  − ∑ x α ′ x α .
λ= =   A
np / 2  2 
1  1  1  1 
max L( µ , Σ) n  1    n  α 
= exp − tr I ∑ ( x α − x ) ( x α − x )′ = exp  − tr A  . exp  − np 
np / 2 2 np / 2  2 
n/2  2 
(2π )  α  (2π ) A
Wishart distribution 93

Now let us return to the observations y , y ,L , y , then


1 2 n

∑ xα ′ xα = ∑ ( y α − υ 0 )′ C ′ C ( yα − υ 0 ) = ∑ ( yα − υ 0 )′ Φ 0−1 ( yα − υ 0 )
α α α

= tr ∑ ( y − υ 0 )′ Φ 0−1 ( y
α α
− υ 0 ) = tr Φ 0−1 ∑ ( y
α
−υ0)(y
α
− υ 0 )′
α α

 
= tr Φ 0−1 ∑ ( y − y ) ( y − y )′ + n ( y − υ 0 ) ( y − υ 0 )′
α α
 α 

= tr Φ −0 1 [ B + n ( y − υ 0 ) ( y − υ 0 ) ′] = tr (Φ −0 1 B) + tr Φ 0−1 n ( y − υ 0 ) ( y − υ 0 ) ′

= tr (Φ −0 1 B) + n ( y − υ 0 ) Φ 0−1 ( y − υ 0 )′
and
A = ∑ ( x α − x ) ( x α − x )′ = ∑ [{C ( y − υ 0 ) − C ( y − υ 0 )}{C ( y − υ 0 ) − C ( y − υ 0 )}′ ]
α α
α α

= ∑ [C ( y − y )] [C ( y − y )]′ = C ∑ ( y − y) ( y − y )′C ′ = C B C ′ .
α α α α
α α
Thus,
B
A = C B C′ = = B Φ 0−1 , because C Φ 0 C ′ = I , then C Φ 0 C ′ = 1 , and
Φ0
1
C C′ = Φ0 =1 ⇒ C C′ = .
Φ0
Therefore, the likelihood ratio criterion becomes
np / 2
e n/2  1 
λ =  B Φ 0−1 exp − [tr B Φ 0−1 + n ( y − υ 0 )′ Φ 0−1 ( y − υ 0 )] .
 n  2 
96 RU Khan

MULTIPLE AND PARTIAL CORRELATIONS Differentiating with respect to β and equating to zero

Result: − 2 E ( X (2) − µ ( 2) ) [ ( X 1 − µ1 ) − β ′ ( X ( 2) − µ ( 2) )] = 0

X 1* = E ( X 1 X 2 = x 2 ) = a + bx 2 is the regression line of X 1 on X 2 , b is the regression or E ( X (2) − µ (2) ) ( X 1 − µ1 ) − E ( X ( 2) − µ (2) ) ( X ( 2) − µ ( 2) )′β = 0


coefficient. It follows that
or σ 12 = Σ 22 β and
∫ x1 f (x1 x 2 ) dx1 = a + bx 2 (5.1)

On multiplying both the sides of equation (5.1) by f ( x 2 ) , and integrating with respect to x 2 , βˆ = σ 12′ Σ −221 .
we get
Therefore, the best linear function of X 1 in terms of X (2) is
∫ ∫ x1 f ( x1 | x2 ) f ( x2 ) dx1 dx2 = a ∫ f ( x 2 ) dx2 + b ∫ x2 f ( x2 ) dx 2
X 1* = µ1 + σ 12′ Σ 22
−1
( X ( 2) − µ ( 2) ) = Xˆ 1 .
or ∫ ∫ x1 [ f ( x1 , x2 ) dx 2 ] dx1 = a + b µ 2 or ∫ x1 f ( x1 ) dx1 = a + b µ 2
or µ1 = a + b µ 2 (5.2) The correlation coefficient between X 1 and its best linear estimate in terms of X (2) is called
Multiple correlation between X 1 and X 2 , L , X p . This is denoted by
On multiplying both the sides of equation (5.1) by x 2 f ( x 2 ) , and integrating with respect to
x 2 , we get Cov ( X 1 , Xˆ 1 )
ρ1(2, 3,L, p ) = , where V ( X 1 ) = E [ X 1 − E ( X 1 )]2 = σ 11 .
x 22 V ( X ) V ( Xˆ )
∫ ∫ x1 x2 f ( x1 , x 2 ) dx 2 dx1 = a ∫ x 2 f ( x 2 ) dx 2 + b ∫ f ( x 2 ) dx 2 1 1

or E ( X 1 , X 2 ) = a µ 2 + b E ( X 22 ) V ( Xˆ 1 ) = E [ Xˆ 1 − E ( Xˆ 1 )][ Xˆ 1 − E ( Xˆ 1 )]′

or σ 12 + µ1µ 2 = a µ 2 + b (σ 22 + µ 22 ) (5.3) = E [ µ1 + σ 12 ′ Σ −221 ( X ( 2) − µ ( 2) ) − µ1 ][ µ1 + σ 12′ Σ 22


−1
( X ( 2) − µ ( 2) ) − µ1 ]′

By solving equations (5.2) and (5.3), we get since E ( Xˆ 1 ) = µ1 + σ 12′ Σ −221 E ( X ( 2) − µ ( 2) ) = µ1


µ1µ 2 = a µ 2 + b µ 22
= σ 12′ Σ 22
−1 −1
E ( X ( 2) − µ ( 2) ) ( X (2) − µ (2) ) ′ Σ 22 σ 12
σ 12 + µ1µ 2 = a µ 2 + b (σ 22 + µ 22 )
= σ 12′ Σ 22
−1
σ 12 .
σ 12 = b [(σ 22 + µ 22 ) − µ 22 ]
Cov ( X 1 , Xˆ 1 ) = E [ X 1 − E ( X 1 )][ Xˆ 1 − E ( Xˆ 1 )]′
σ σ
b = 12 = ρ 1 . After substituting the value of b in equation (5.1), we get
σ 22 σ2 = E ( X 1 − µ1 ) [ µ1 + σ 12 ′ Σ −221 ( X (2) − µ ( 2) ) − µ1 ]′
σ
a = µ1 − ρ 1 µ 2 , therefore, −1
= E ( X 1 − µ1 ) ( X ( 2) − µ ( 2) ) ′ Σ 22 σ 12 = σ 12′ Σ 22
−1
σ 12 .
σ2
Hence,
σ σ σ
X 1* = E ( X 1 X 2 = x 2 ) = µ1 − ρ 1 µ 2 + ρ 1 x 2 = µ1 + ρ 1 ( x 2 − µ 2 ) .
σ2 σ2 σ2 σ 12 ′ Σ −221 σ 12 σ 12′ Σ −221 σ 12 β ′ Σ 22 β
ρ1(2, 3,L, p ) = = = ,
σ 11 (σ 12 ′ Σ −221 σ 12 ) σ 11 σ 11
Multiple correlation

If X 1 is the first component of X and X (2) the vector of remaining ( p − 1) components. σ σ 12′ 
where σ 12 and Σ 22 are defined as  11 .
We first express X 1 as a linear combination of X (2) defined by the relation σ 
 21 Σ 22 
X * = µ + β ′ ( X ( 2) − µ (2) ) , the coefficient vector β is determined by minimizing
1 1 Note: Since the numerator is V ( Xˆ 1 ) , therefore, ρ1( 2, 3,L, p) ≥ 0 i.e. 0 ≤ ρ1(2, 3,L, p ) ≤ 1 .
U = E [ X1 − X 1* ] 2 = E [ X 1 − µ1 − β ′ ( X (2) − µ (2) )]2 . This is so because X̂ 1 is an estimate of X 1 .
Multiple and partial correlations 97 98 RU Khan

Estimation of multiple correlation coefficient Thus, in our case


The multiple correlation in the population is a11
a11 ~ W1 (n − 1, σ 11 ) , ⇒ ~ χ n2−1 .
′ Σ −1 σ σ 11
σ 12 22 12 β′ Σ 22 β
ρ1(2,L, p) = = .
σ 11 σ 11 In null case ρ1(2, 3,L, p ) = 0

A n −1
Given x α (α = 1,L , n) , n > p . We estimate Σ by Σˆ = = S, Σ11.2 = σ 11 − σ 12 ′ Σ −221 σ 21 = σ 11 , since σ 12′ = 0 , so that
n n
where, A = ∑ ( x α − x ) ( x α − x )′ . a11 − a12′ A22
−1
a 21 ~ W1 (n − 1 − ( p − 1), σ 11 )
α
a11 − a12′ A22
−1
a 21
Now A is partitioned as follows ⇒ ~ χ n2− p .
σ 11
a a12′ 
 11
a ′ A −1 Consider
n  , and the estimate of β is βˆ = σ 12′ Σ −1 = 12  22 
A  n ′
=  = a12 ′ A22
−1
.
n  a12 A22  22 n  n 
  a11 a11 − a12′ A22
−1
a 21 a ′ A −1 a
 n n  = + 12 22 21
σ 11 σ 11 σ 11
Using the above estimate, the sample multiple correlation coefficient of X 1 on X 2 , L , X p is
or Q = Q1 + Q2 , (say),
σˆ 12′ Σˆ −221 σ 12 a ′ A −1 a
where Q ~ χ n2−1 , and Q1 ~ χ n2− p .
R1( 2,L, p ) = = 12 22 12
σˆ11 a11
and From Fisher Cochran theorem Q2 is independently distributed as χ n2−1−( n− p) , i.e.

Q2 ~ χ 2p −1 and is independent of Q1 , hence,


a11 − a12′ A22
−1
a12 a11 − a12′ A22
−1
a12 A22 A
1 − R12(2,L, p ) = = = .
a11 a11 A22 a11 A22 ′ −1 2
R2 n − p a12 A22 a12 / σ 11 n − p χ p −1 / p − 1
F= × = × = ~ F p −1, n − p .
Distribution of sample multiple correlation coefficient in null case 1− R2 p −1 a11.2 / σ 11 p − 1 χ n2− p / n − p

The sample multiple correlation coefficient between X 1 and X (2) is defined by relation The distribution of the statistic F is
ν1 / 2 ν1
a ′ A −1 a a11 − a12′ A22
−1
a12  ν1


 F2
−1
R 2 = 12 22 12 and 1 − R 2 = ,
a11 a11 df ( F ) = ν 2  dF ,
(ν1 +ν 2 ) / 2
ν ν   ν1 
a a12′  B 1 , 2  1 + F 
where R 2 = R12( 2, 3,L, p ) and A =  11 .  2 2  ν 2 
a A22 
 12
where ν 1 = p − 1 , ν 2 = n − p .
Therefore,
R2 ν 2 dR 2 ν 2
R2 a12′ A22
−1
a12 In this put F = , then dF =
= . 1 − R 2 ν1 (1 − R 2 ) 2 ν 1
1− R2 a11.2
ν1
We know that, if A is partitioned as ν1 / 2  −1
 ν1  2
ν2 2
 R 
 A11 A12  q  ν  1 − R 2 ν1 
A =   and A ~ W p (n − 1, Σ) , then, A11 ~ Wq (n − 1, Σ11 ) and  2   ν 2 dR 2
A22  p − q df ( R 2 ) =
 A21 (ν 1 +ν 2 ) / 2 ν 1 (1 − R 2 ) 2
ν ν   R2 

B 1 , 2  1+
  1 − R 2 
−1
A11 − A12 A22 A21 ~ Wq (n − 1 − ( p − q), Σ11.2 ) .  2 2 
Multiple and partial correlations 99 100 RU Khan

ν1 ν 1 ν1 Similarly,
− +1−1 −1
 ν1  2 2  R2 2
 
 ν  1 − R 2  max Lω =
n np / 2  1 
exp  − Q  , where
 2   dR 2 np / 2
= (2π ) (a11 A22 ) n / 2  2 
(ν 1 +ν 2 ) / 2
ν ν  1  (1 − R 2 ) 2
B 1 , 2     n a −1
 2 2  1 − R 2  0   a11 0 
Q = trace  11 
−1   0
 = tr nI = np , thus,
 0
 n A22  A22 
ν1 ν1 +ν 2 ν1
1 −1 − +1−2
= (R 2 ) 2 (1 − R 2 ) 2 2 dR 2 n np / 2
ν ν   1 
B 1 , 2  max Lω = exp  − np  .
 2 2  (2π ) np / 2
(a11 A22 ) n / 2  2 
ν1 ν2 Hence, the ratio
1 −1 −1
= (R 2 ) 2 (1 − R 2 ) 2 dR 2 . Put dR 2 = 2 R dR , thus the distribution of R . n/2
ν ν  A A
B 1 , 2  λ= ⇒ λ2 / n = =1− R2
 2 2  (a11 A22 ) n / 2 a11 A22
ν2 n− p
−1 −1 The likelihood ratio test is defined by the critical region λ ≤ λ0 , where
2 R (ν1−1) (1 − R 2 ) 2 2 R p −2 (1 − R 2 ) 2
df ( R) = dR = dR , 0 < R < 1.
ν ν   p −1 n − p  Pr[λ ≤ λ0 H 0 ] = α
B 1 , 2  B , 
 2 2   2 2 
⇒ λ2 / n ≤ λo2 / n ⇒ 1 − λ2 / n = R 2 > 1 − λo2 / n = R02 , (say)
Likelihood ratio criteria for testing H 0 : ρ1(2, 3,K, p ) = 0 . 2
R2 n− p R0 n − p
The likelihood function of the sample x α (α = 1, 2, K , n > p ) from N ( µ , Σ ) is ⇒ >
2 p −1
1− R 1 − R02 p − 1

1  1  ⇒ F > F p −1, n − p (α ) .
L( µ , Σ) = exp − ∑ ( x α − µ ) ′ Σ −1 ( x α − µ )
np / 2 n/2
(2π ) Σ  2 α 
Theorem: Multiple correlation is invariant under the non-singular linear transformation.
and the likelihood ratio criterion Proof: We know that

λ=
max L0
, where the numerator is the maximum of L for µ , Σ ∈ ω restricted by σ 12′ Σ 22
−1
σ 12  X1  σ σ 12′ 
max L ρ12(2,L, p ) = , where X =  ( 2)  , and Σ =  11
σ 11  X  σ Σ 22 
H 0 : ρ1(2, 3,K, p ) = 0 (i.e. β = 0 , ⇒ σ 12 = 0) and the denominator is the maximum of L  12
over the entire parametric space Ω . Let
Now Y1 = a11 X 1 , and Y (2) = A22 X ( 2)

1  1  A
−1  a 0 
max LΩ = exp − ∑ ( x α − x ) ′   ( x α − x ) or Y =  11  X , where a11 ≠ 0 , A22 ≠ 0 .
A
n/2  2 α n   0 A22 
(2π ) np / 2
n Assume that E X = 0 , and
np / 2  1  A  −1 
n
exp  − tr   ∑ ( x α − x ) ( x α − x ) ′ a 0   σ 11 σ 12′   a11 0   a 2 σ 11 a11 A22 σ 12 ′ 
= ΣY =  11   = 11
(2π ) np / 2 A
n / 2  2  n  α   0 A22   σ 12 Σ 22   0 A22   a11 A22 σ 12 A22 Σ 22 A22 

n np / 2  1  Therefore,
= exp  − np  ,
 2  a11 A22 σ 12 ′ ( A22 Σ 22 A22 ) −1 a11 A22 σ 12
np / 2 n/2
(2π ) A
ρ12( 2, 3,K, p) (Y ) =
2
where trace nI p× p = np . a11 σ 11
Multiple and partial correlations 101 102 RU Khan

σ ′ Σ −1 σ  X1  σ σ 12′ 
= 12 22 12 = ρ12( 2, 3,K, p ) ( X ) . Exercise: Let X ~ N (0, Σ) , X =  (2)  , and Σ =  11 . The difference
σ X  σ 
11  12 Σ 22 
X 1 − σ 12′ Σ −221 X ( 2) (called the residual of X 1 from its mean square regression line on
 X  σ σ 12′ 
Theorem: Let X ~ N (0, Σ) , X =  (12)  , and Σ =  11 , then of all linear
  X (2) ) is uncorrelated with any of the independent variable X 2 , X 3 , K , X p .
X   σ 12 Σ 22 
combinations of X 2 , X 3 , K , X p has the maximum correlation with X 1 . Solution:
Proof: We know that, for any c nonzero scalar and γ Cov [( X 1 − σ 12′ Σ −221 X ( 2) ), X (2) ] = E[( X 1 − σ 12 ′ Σ −221 X (2) ) ( X ( 2) )′]

E ( X 1 − β ′ X ( 2 ) ) 2 ≤ E ( X 1 − c γ ′ X ( 2) ) 2 , = E[ X 1 ( X (2) )′] − σ 12′ Σ 22


−1
E[ X (2) ( X ( 2) )′]

because β ′ = σ 12′ Σ −221 is determined by minimizing E ( X 1 − β ′ X (2) ) 2 . = σ 12′ − σ 12 ′ Σ 22


−1
Σ 22 = 0 .
Hence, Exercise: Derive the sampling distribution of sample correlation coefficient r on the basis
E [ X 12 + ( β ′ X ( 2) ) 2 − 2 X 1 ( β ′ X ( 2) )] ≤ E [ X 12 + (c γ ′ X (2) ) 2 − 2 c X 1 (γ ′ X (2) )] of a random sample of size n from a bivariate normal distribution N 2 (µ1 , µ 2 , σ 12 , σ 22 , ρ ) , if
ρ = 0 . Also find E (r k ) .
or E X 12 + E [ β ′ X ( 2) ]2 − 2 E X 1 ( β ′ X ( 2) ) ≤ E X 12 + c 2 E (γ ′ X ( 2) ) 2
Solution: The sample correlation coefficient between X 1 and X 2 is defined by the
− 2 c E X 1 (γ ′ X ( 2) ) . relation

Let a a −1 a a − a a −1 a  a11 a12 


r 2 = 12 22 12 and 1 − r 2 = 11 12 22 12 , where r 2 = r12
2
, A =  .
a11 a11  a 21 a 22 
E ( β ′ X ( 2) ) 2
c2 = , then Therefore,
E (γ ′ X (2) ) 2
−1
r2 a12 a 22 a12
=
E ( β ′ X (2) ) 2
.
−1
E [ β ′ X ( 2) ]2 − 2 EX 1 ( β ′ X ( 2) ) ≤ E (γ ′ X (2) ) 2 1− r2 a11 − a12 a 22 a12
E (γ ′ X (2) ) 2
We know that, if

E ( β ′ X ( 2) ) 2  A11 A12  q
EX1(γ ′ X ( 2) ) A =   , then, A11 ~ Wq (n − 1, Σ11 ) , and
−2
 A21 A22  p − q
E (γ ′ X ( 2) ) 2
−1
A11 − A12 A22 A21 ~ Wq (n − 1 − ( p − q), Σ11.2 ) .
E ( β ′ X ( 2) ) 2
or − 2 EX1( β ′ X ( 2) ) ≤ −2 EX1(γ ′ X ( 2) ) At p = 2 , and q = 1
E (γ ′ X ( 2) )2
a11
a11 ~ W1 (n − 1, σ 11 ) , ⇒ ~ χ n2−1 .
E ( β ′ X ( 2) ) 2 σ 11
EX 1 ( β ′ X ( 2) ) ≥ EX 1 (γ ′ X ( 2) )
E (γ ′ X ( 2) ) 2 In null case ρ = 0
−1
Dividing both the sides by EX 12 E ( β ′ X ( 2) ) 2 , we get σ 11.2 = σ 11 − σ 12σ 22 σ 21 = σ 11 , since σ 12 = 0 .
So that
EX 1 ( β ′ X (2) ) EX 1 (γ ′ X ( 2) )
≥ −1
a11 − a12 a 22 a 21 ~ W1 (n − 1 − (2 − 1), σ 11 )
EX 12 E ( β ′ X (2) ) 2 EX 12 E (γ ′ X ( 2) ) 2
−1
a11 − a12 a 22 a 21
Corr. ( X 1 , β ′ X (2) ) ≥ Corr. ( X 1 , γ ′ X ( 2) ) . ⇒ ~ χ n2− 2 .
σ 11
Multiple and partial correlations 103 104 RU Khan

Consider Similarly, we have

a11
−1
a11 − a12 a 22 a 21 −1
a12 a 22 a 21 V ( X 2.3K p ) = σ 22 − σ 23′ Σ 33
−1
σ 23 , and
= +
σ 11 σ 11 σ 11
Cov ( X 1.3K p , X 2.3K p ) = E[ X 1 − σ 13′ Σ 33 X ][ X 2 − σ 23′ Σ 33
−1 (3) −1 (3) ′
X ]
or Q = Q1 + Q2 , (say),
= σ 12 − σ 13′ Σ 33
−1
σ 23 .
where Q ~ χ n2−1 and Q1 ~ χ n2−2 , therefore, Q2 ~ χ12 , by the additive property of χ 2 . Thus,
Therefore,
−1
r2 n − 2 a12 a 22 a 21 / σ 11 n − 2 χ12
× = × = ~ F1, n − 2 . σ 12 − σ 13′ Σ 33
−1
σ 23
1− r2 1 a11.2 / σ 11 1 2
χn−2 / n − 2 ρ12 .3K p = .
(σ 11 − σ 13′ Σ 33
−1
σ 13 ) (σ 22 − σ 23′ Σ 33
−1
σ 23 )
Hence,
r Alternative proof
( n − 2) ~ t n−2 .
2
1− r Consider the var. cov. matrix of conditional distribution of X (1) given X (2) , which is

Partial correlation −1
Σ11.2 = Σ11 − Σ12 Σ 22 Σ 21 , in our case
The correlation between two variables X 1 and X 2 is measured by the correlation coefficient
 σ 11 σ 12   σ 13′  −1
which sometimes called the total correlation coefficient between X 1 and X 2 . If X 1 and X 2 =   − Σ (σ σ 23 )
′ 33 13
 σ 21 σ 22   σ 23 
are considered in the conjunction with p − 2 other variables X 3 , X 4 , K , X p , we may regard
the variation of X 1 and X 2 as to certain extents due to the variation of the other variables.  σ 11 σ 12   σ 13′ Σ 33
−1
σ 13 σ 13′ Σ 33
−1
σ 23 
=   −
 σ 21 σ 22   σ 23 Σ 33 σ 13 σ 23 Σ 33 σ 23 

Let X 1.3K p and X 2.3K p represent these parts of variation of X 1 and X 2 respectively, ′ − 1 ′ − 1

which remains after subs traction of the best linear estimate in terms of X 3 , X 4 , K, X p .
 σ − σ ′ Σ −1 σ ′ −1  σ σ 12 . 3 K p 
=  11 13 33 13 σ 12 − σ 13 Σ 33 σ 23  =  11. 3K p 
Thus we may regards the correlation coefficient between X 1.3K p and X 2.3K p as a measure σ − σ Σ σ
′ − 1 ′ − 1  σ σ 22 . 3K p 
 12 13 33 23 σ 22 − σ 23 Σ 33 σ 23   12 . 3K p
of correlation between X 1 and X 2 after removal of any part of the variation due to the
influence of X 3 , X 4 , K , X p , this correlation is called partial correlation of X 1 and X 2 with We know that simple correlation coefficient is
σ 12 σ σ 12 
respect to X 3 , X 4 , K , X p . Let ρ= , where σ 11 , σ 12 , and σ 22 are defined as Σ =  11  .
σ 11 σ 22  21 σ 22 
σ
σ ′
 X1   11 σ 12 M σ 13  Therefore, the partial correlation coefficient is obtained like simple correlation coefficient
  σ σ 22 M σ 23′  −1
X =  X 2  , µ = 0 , and Σ =  12  from Σ11.2 = Σ11 − Σ 21Σ 22 Σ12
 X (3)  L L L L 
  σ  σ 12. 3K p σ 12 − σ 13′ Σ 33
−1
σ 23
 13 σ 23 M Σ 33  ρ12.3K p = = .
σ 11.3K p σ 22. 3K p (σ 11 − σ 13′ Σ 33
−1
σ 13 ) (σ 22 − σ 23′ Σ 33
−1
σ 23 )
and the best linear estimates of X 1 and X 2 interms of X (3)
are Xˆ 1 = σ 13′ Σ 33
−1 (3)
X , and
Xˆ = σ ′ Σ −1 X (3) respectively.
2 23 33
In general,
Let the partition of X and Σ as follows
Define,
 X1 
X 1.3K p = X 1 − X̂ 1 , and X 2.3K p = X 2 − X̂ 2 , then  
 M 
V ( X 1.3K p ) = E [ X 1 − σ 13′ Σ 33 X ][ X 1 − σ 13′ Σ 33
−1 (3 ) −1 (3) ′ X 
q  X
X ] (1) 
 Σ11 Σ12 
X = =  , and Σ =   .
 X q +1   ( 2) 
−1 (3) (3) ′ −1    X   Σ 21 Σ 22 
= E[ X 12 − 2 σ 13′ Σ 33
−1
X 1 X (3) + σ 13′ Σ 33 X X Σ 33 σ 13 ]
 M 
 
= σ 11 − 2 σ 13′ Σ 33
−1
σ 13 + σ 13′ Σ 33
−1 −1
Σ 33 Σ 33 σ 13 = σ 11 − σ 13′ Σ 33
−1
σ 13 . Xp 
Multiple and partial correlations 105 106 RU Khan

The partial correlation coefficient between the variable X i and X j ( X i , X j ∈ X (1) ) holding The distribution of the sample partial correlation rij . q +1K p based on a sample of size n from
a distribution with population correlation ρ ij . q+1K p is same as the distribution of ordinary
the components of X (2) fixed (there will be total of q C2 partial correlation coefficients), is
often denoted by correlation coefficient rij based on a sample of size n − ( p − q ) from a distribution with the
σ ij . q +1K p corresponding population partial correlation ρ ij . q +1K p = ρ . Thus,
ρ ij . q +1K p = , i, j = 1, 2,K , q .
(σ ii . q +1K p ) (σ jj . q +1K p ) r
n − ( p − q) − 2 ~ t n −( p − q ) − 2 .
1− r2
Estimation of partial correlation coefficient
We know that the population partial correlation coefficient between X i and X j holding the Test of hypothesis and confidence region for partial correlation coefficient
Case I: H 0 : ρ ij.q +1L p = ρ 0 , (a specified value), against H A : ρ ij.q +1L p ≠ ρ 0 . For
components of X (2) fixed, is given by
testing H 0 on the basis of sample of size n − ( p − q ) , we use the following test statistic
σ ij . q +1K p
ρ ij . q +1K p = . Z − ξ0
(σ ii . q +1K p ) (σ jj . q +1K p ) U= ~ N (0, 1) , where
1/ n − 3
Given a sample xα (α = 1, 2, K , n > p ) from N ( µ , Σ) , the maximum likelihood estimate of 1 1 + rij.q +1L p 1 1 + ρ0
Z = ln = tan h −1 rij.q+1L p , and ξ 0 = ln = tan h −1 ρ 0 .
ρ ij . q+1K p is 2 1 − rij.q +1L p 2 1 − ρ0
σˆ ij . q +1K p If the absolute value of U is greater than 1.96 , we reject H 0 at α = 0.05 level of
ρˆ ij . q +1K p = ,
(σˆ ii . q +1K p ) (σˆ jj . q +1K p ) significance otherwise accept H 0 . For confidence region when ρ ij.q +1L p is unknown,
then we can write
aij . q +1K p
i.e. rij . q +1K p = is called the sample partial correlation coefficient Pr [−U α / 2 ≤ (n − 3) (Z − ξ ) ≤ U α / 2 ] = 1 − α , where
(a ii . q +1K p ) (a jj . q +1K p )
1 1 + ρ ij.q +1L p
between X i and X j holding X q +1 , L , X p fixed, ξ = ln = tan h −1 ρ ij.q +1L p , then
2 1 − ρ ij.q +1L p
−1
where, (aij . q +1K p ) = A11 − A12 A22 A21 = A11.2 , i.e. aij . q +1K p is the i − th and j − th
 Uα / 2 Uα / 2 
A A12  Pr  Z − ≤ tan h −1 ρ ij.q +1L p ≤ Z +  =1− α
element of A11.2 , and A =  11 .  (n − 3) (n − 3) 
 A21 A22 
  U α / 2   U α / 2 
Distribution of sample partial correlation coefficient or Pr  tan h  Z − ≤ ρ ij.q +1L p ≤ tan h  Z +  =1− α .
  (n − 3)   (n − 3) 
 
The sample partial correlation coefficient between X i and X j holding X q +1 , L , X p fixed
Case II: H 0 : ρ ij.q +1L p = 0 , against H A : ρ ij.q +1L p ≠ 0 . For testing H 0 on the basis of
is defined as
sample of size n − ( p − q ) , we use the following test statistic
aij . q +1K p
rij . q +1K p = , rij.q +1L p
(a ii . q +1K p ) (a jj . q +1K p ) t= n − ( p − q ) − 2 ~ t n −( p − q ) − 2 .
1 − rij2.q +1L p
−1
where, (aij . q +1K p ) = A11 − A12 A22 A21 = A11.2 , i.e. aij . q +1K p is the i − th and j − th
If t > t n−( p −q )−2 (α ) , we reject H 0 at α level of significance otherwise accept H 0 .
A A12 
element of A11.2 , and A =  11 .
 A21 A22  Theorem: Partial correlation coefficient is invariant under the nonsingular linear
transformation.
Let
Proof: We know that
A = ∑ ( x α − x ) ( x α − x ) ′ ~ W p (n − 1, Σ) , then,
α σ 12 − σ 13′ Σ 33
−1
σ 23
ρ12 .3K p ( X ) = ,
−1
A11 − A12 A22 A21 ~ Wq (n − 1 − ( p − q), Σ11.2 ) . (σ 11 − σ 13′ Σ 33
−1
σ 13 ) (σ 22 − σ 23′ Σ 33
−1
σ 23 )
Multiple and partial correlations 107 108 RU Khan

σ ′ Solution:
 11 σ 12 σ 13 
where Σ X = σ 12 σ 22 σ 23′  .
 i) We are given that
 
 σ 13 σ 23 Σ 33  ρ ij = ρ , ij = 1, 2, L , p ; i ≠ j , we have
 
−1
Let σ ij − σ ik σ kk σ jk
ρ ij . k =
(3) (3) −1 −1
Y1 = a11 X 1 , a11 ≠ 0 , Y2 = a 22 X 2 , a 22 ≠ 0 , and Y = A33 X , A33 ≠ 0 σ ii − σ ik σ kk σ ik σ jj − σ jk σ kk σ jk

 a11 0 0  1
  σ iσ j ρ ij − σ iσ k ρ ik σ jσ k ρ jk
or Y =  0 a 22 0 X . σ kσ k
=
 0 0 A33  1 1
 σ ii − σ iσ k ρ ik σ iσ k ρ jk σ jj − σ j σ k ρ jk σ jσ k ρ jk
σ kσ k σ kσ k
Assume that E X = 0 , and
σ iσ j ( ρ ij − ρ ik ρ jk ) ρ − ρ2 ρ

 a11 0 0   σ 11 σ 12 σ 13   a11 0 0  = = =
1+ ρ
.
    σ iσ j 1 − ρ ik2 1 − ρ 2jk 1− ρ 2
1− ρ 2
ΣY =  0 a 22 0  σ 12 σ 22 σ 23′   0 a 22 0 
 0  
 0 A33   σ 13 σ 23 Σ 33   0 0 A33  ρ
  Thus every partial correlation coefficient of order 1 is . Similarly,
1+ ρ
 a2 σ a11a 22σ 12 a11σ 13′ A33 
 11 11 2
 ρ   ρ  ρ  ρ 
=  a11a 22σ 12 2
σ 22 a 22 σ 23′ A33    −   1 − 
1 + ρ  1 + ρ 
a 22
  ρ ij.l − ρ ik.l ρ jk.l 1+ ρ  1+ ρ  ρ
 a11σ 13 A33 A33 σ 23 a 22 A33 Σ 33 A33  ρ ij.kl = = = =
  1− ρ ik2 .l 1− ρ 2jk.l  ρ 
2  ρ  ρ  1 + 2ρ
1 −   1 +  1 − 
by the analogy ρ12 .3K p ( X ) 1 + ρ   1+ ρ  1+ ρ 

ρ
a11a 22σ 12 − a11σ 13′ A33 ( A33Σ 33 A33 ) −1 A33 σ 23 a 22 Thus every partial correlation coefficient of order 1 is .
ρ12.3K p (Y ) = 1 + 2ρ
a 2 σ − a σ ′ A ( A Σ A ) −1 A σ a
11 11 11 13 33 33 33 33 33 13 11 The partial correlation coefficient of the highest order in p − variate distribution is p − 2 ,
1 by the method of induction, the every partial correlation coefficient of order p − 2 is
× . ρ
2
a 22 σ 22 − a 22 σ 23′ A33 ( A33Σ 33 A33 ) −1 A33 σ 23a 22 1 + ( p − 2) ρ
. Since ρ ij.( p−1) components ≤ 1 , so that

Taking common a11 and a 22 , then we get − 1 ≤ ρ ij.( p −1) components ≤ 1

σ 12 − σ 13′ Σ 33−1
σ 23 We have on considering the lower limit
ρ12.3K p (Y ) = = ρ12. 3K p ( X ) .
(σ 11 − σ 13′ Σ 33
−1
σ 13 ) (σ 22 − σ 23′ Σ 33
−1
σ 23 ) −1≤
ρ
, or − (1 + pρ − 2 ρ ) ≤ ρ or − 1 ≤ ρ + pρ − 2
1 + ( p − 2) ρ
Note:
1
or − 1 ≤ ρ ( p − 1) or ρ≥− .
1 − ρ12( 2, 3, L, p) = 2
(1 − ρ12 2
) (1 − ρ13 2 2
.2 ) (1 − ρ14.23 ) L (1 − ρ1 p.23L p −1 ) . p −1
ii) We know that
Exercise: If all the total correlation coefficient in a p − variate normal distribution are
equal to ρ ≠ 0 , show that 1 − ρ12(2, 3,L, p) = (1 − ρ12
2 2
) (1 − ρ13 2 2
.2 ) (1 − ρ14.23 ) L (1 − ρ1 p.23L p −1 ) ,

1 ( p − 1) ρ 2 ρ
i) ρ≥− , and ii) ρ12(2, 3,L, p ) = . since ρ12 = ρ , and ρ13.2 = .
p −1 1 + ( p − 2) ρ 1+ ρ
Multiple and partial correlations 109

  ρ  
2  (1 + ρ ) 2 − ρ 2  (1 − ρ ) (1 + 2 ρ )
ρ12(2, 3) = (1 − ρ 2 ) 1 −    = (1 − ρ 2 )  = .
  1 + ρ    (1 + ρ ) 2  (1 + ρ )

Similarly

(1 − ρ ) (1 + 2 ρ )   ρ 
2
1 − ρ12( 2, 3, 4) = (1 − ρ12
2 2
) (1 − ρ13 2
.2 ) (1 − ρ14.23 ) =
1 −   
(1 + ρ )   1 + 2 ρ  
 

(1 − ρ ) (1 + 2 ρ )  1 + 3ρ 2 + 4 ρ  (1 − ρ ) (1 + 3ρ )
= = .
(1 + ρ )  (1 + 2 ρ ) 2  (1 + 2 ρ )
 
By the method of induction
 1 + ( p − 1) ρ 
1 − ρ12( 2, 3,L, p ) = (1 − ρ )  
 1 + ( p − 2) ρ 

1 + pρ − 2 ρ − [1 + pρ − 2 ρ − ( p − 1) ρ 2 ] ( p − 1) ρ 2
ρ12(2, 3, L, p) = = .
1 + ( p − 2) ρ 1 + ( p − 2) ρ
112 RU Khan

HOTELLING'S − T 2 and
ΣY = E [Y − E (Y )][Y − E (Y )]′ = Σ .
If X is univariate normal with mean µ and standard deviation σ , then
Therefore,
n (x − µ) 1 (n − 1) s 2
U=
σ
~ N (0,1) , and V =
2 ∑ ( xi − x ) 2 = 2
~ χ n2−1 , where s 2 is the
Y ~ N p (υ , Σ) , then
σ i σ
sample variance from a sample of size n . If U and V are independently distributed, then
Student's − t is defined as T 2 = Y ′ S −1Y .

n (x − µ) / σ n (x − µ)
Since Σ is positive definite matrix, there exits a nonsingular matrix C , such that
U
t= = = ~ t n−1 .
V / n −1 (n − 1) s 2 /(n − 1) σ 2 s C Σ C′ = I ⇒ C ′C = Σ −1 .
Define,
The multivariate analogue of Student's − t is Hotelling's T 2 .
If x α (α = 1, 2, L , n) is an independent sample of size n from N p ( µ , Σ ) and, if x is the Y * = CY , S * = C S C ′ , and υ * = Cυ , then

sample mean vector, S the matrix of variance covariance, then the Hotelling's − T 2 is E (Y * ) = C n E ( x − µ ) = C n ( µ − µ ) = Cυ = υ *
0 0
defined by the relation
and
T 2 = n ( x − µ ) ′ S −1 ( x − µ ) .
Σ = E [Y * − E (Y * )][Y * − E (Y * )]′ = C Σ C ′ = I .
Y*
Result:
Therefore,
Let the square matrix A be partitioned as
Y * ~ N p (υ * , I ) , then
 A11 A12 
A =   , so that A22 is square. If A22 is nonsingular, let
 A21 A22  ′ −1
T 2 = Y * S* Y *
 I − A12 A −1   −1
0   A11.2 0 
C = 22  , then CAC ′ =  A11 − A12 A22 A21 =  
and
0
 I 


 0 A22   0 A22  n−1 ′
(n − 1) S * = ∑ Z α* Z *α , where Z α
*
= C Z α ~ N (0, I ) .
A 0  −1 α =1
⇒ A = C −1  11.2  C ′ , then
 0 A22  Now consider a random orthogonal matrix Q of order p × p
−1  −1 −1 −1 
A 0  A11 − A11 .2 A12 A22
 Y* Y2* Y p* 
A −1 = C ′  11.2  C = .2 .  1 
 0 A22   − A A A −1
−1 −1 −1 −1
A22 A21 A11.2 A12 A22 + −1 
A22  ′ ′
L

 22 21 11.2   Y Y*
* *′ * 
Y* Y* Y Y
Q= q q 22 L q2 p  .
Distribution of Hotelling's − T 2  21 
 M M L M 
Let x1 , L , x n be an independent sample drawn from N p ( µ , Σ) , then  
 q p1 q p2 L q pp 
n −1
A = (n − 1) S = ∑ Z α Z α ′ with Z α independent, each with distribution N p (0, Σ) . Let
α =1
U = Q Y * , be an orthogonal transformation, also B = (bij ) = Q (n − 1) S *Q ′ .
By definition,
So that
T 2 = n ( x − µ ) ′ S −1 ( x − µ ) .
2 2 2 ′
Let Y1* + Y2* + K + Y p* Y* Y* ′
U1 = = = Y* Y*
Y = n ( x − µ ) , then E (Y ) = n E ( x − µ ) = n ( µ − µ ) = υ (say) *′ * *′ *
0 0 0 Y Y Y Y
Hotelling’s distribution 113 114 RU Khan

and ⇒ b11.2 is a χ 2 with (n − p) degree of freedom, but the conditional distribution of b11.2
p ′ p
Yi* does not depend upon Q . Therefore b11.2 is unconditionally distributed as a χ 2 with
U j = ∑ q j i Yi* = Y * Y * ∑ q ji , j = 2, 3,K , p
′ (n − p) degree of freedom.
i =1 i =1 Y* Y*
=0 ∀ j = 2, 3,K , p , by using the property of an orthogonal matrix. Since Y * is distributed according to N p (υ * , I )

Thus, ′ ′ ′
⇒ Y * Y * ~ χ 2p (υ * υ * ) , where υ * υ * = υ ′ Σ −1 υ = λ 2 (say).
′ −1
T 2 = Y * S * Y * = (Q −1U )′ S *−1 (Q −1U ) = U ′ (Q S *Q ′) −1U
T2
Thus the is the ratio of a noncentral χ 2 with p degree of freedom to an independent
= (n − 1)U ′ [Q (n − 1) S *Q ′]−1U = (n − 1) U ′ B −1U . n −1
χ 2 with (n − p) degree of freedom, i.e.
 b11 b12 L b1 p   U1 
  
T2  b 21 b 22 L b 2 p   0  T2 n− p χ 2p (λ2 ) / p
⇒ = U ′ B −1U = (U 1 0 L 0)    = ~ F p , n − p (λ 2 ) .
n −1  M M M M  M  n −1 p χ n2− p /(n − p )
 b p1 b p 2 L b pp   0 
 
If µ = µ , then the F − distribution is central.
0
= U12 b11 , where (b ij ) = B −1 .
Alternative proof
We know that if the square matrix say B is partition as
By definition,
b b12′ 
B =  11 , and if B22 is nonsingular, then T2
b B22  T 2 = n ( x − µ )′ S −1 ( x − µ ) or = n ( x − µ 0 )′ A −1 ( x − µ 0 ) .
 12
n −1
 −1
b11 −1
− b11 −1
.2 b12 B22

B −1 =  .2 . Let
 − B −1 b b −1 −1 −1 ′ −1
B22 b 21 b11.2 b12 B22 + −1 
B22
 22 21 11.2  d = n ( x − µ ) , ⇒ d ~ N p (0 , Σ)
0
Thus,
′ ⇒ d d ′ = Q ~ W p (1, Σ) (6.1)
Y* Y*
U 12 b11 = , and A ~ W p (υ , Σ) (6.2)
b11.2
1 1 Now
where b11 = = .
b11.2 b − b ′ B −1 b
11 12 22 12 1 d′
= A 1 + d ′ A −1 d or A + d d′
Therefore, −d A

T2 Y* Y* A 1 1 1
= . ⇒ = = = (6.3)
A + d d′ ′ −1
n − 1 b11.2 −1
1+ d A d ′
1 + n (x − µ 0 ) A (x − µ 0 ) T2
1+
But n −1
n −1 n−1 1 A
B = Q (n − 1) S *Q ′ = ∑ (Q Z *α ) (Q Z α* ) ′ = ∑ V α V α ′ , We determine the distribution Φ =
T 2
=
A+Q
.
α =1 α =1 1+
n −1
*
where, V α = Q Z α ~ N p (0, I ) , for given Q .
A and Q are independent as Q is based on x . Thus from (6.1), (6.2) and the additive
n−1−( p −1) property of wishart distribution, we have
⇒ b11.2 is distributed as ∑ wα2 , where wα are independent N (0, 1) .
A + Q ~ W p ( n, Σ )
α =1
Hotelling’s distribution 115 116 RU Khan

We find the r − th moment of Φ  υ + 2r   υ + 2r − 1   υ + 2 r − p + 2   υ + 2r − p + 1 


Γ Γ LΓ  Γ 
 2   2   2   2 
r r
E (Φ ) = ∫ Φ f ( A, Q) dA dQ =∫ Φ f ( A) f ( Q ) dA dQ r =
υ  υ − 1 υ − p + 2  υ − p + 1
Γ Γ LΓ  Γ 
 C ( p,υ )
r  2  2   2   2 
A
 (υ − p −1) / 2  1 −1  
=∫ A exp  − tr A Σ  f (Q) dA dQ ,
r   υ + 1 υ  υ − p + 3  υ − p + 2 
A + Q  Σ
υ/2  2
 Γ  Γ  L Γ  Γ 
×  2  2  2   2 
where,  υ + 2r + 1   υ + 2r   υ + 2r − p + 2   υ + 2r − p + 2 
Γ Γ LΓ  Γ 
1  2   2   2   2 
C ( p ,υ ) = .
p
υ − i + 1  υ + 2r − p + 1   υ + 1 
2υ p / 2 (π ) p ( p −1) / 4 ∏ Γ   Γ Γ 
i =1  =    2 
2  2
(6.5)
 υ − p + 1   υ + 2r + 1 
(υ + 2r − p−1) / 2 Γ Γ 
−υ / 2 A  1   2   2 
E (Φ r ) = C ( p , υ ) Σ ∫ exp  − tr A Σ −1  f (Q ) dA dQ
r  2 
A+Q We compare this r − th moment with the r − th moment of Beta one, the density function of
β1 (l , m) is given by
C ( p, υ ) Σ
−υ / 2  C ( p,υ + 2r )
1
−(υ + 2r ) / 2 ∫
1
=
r  f ( x : l , m) = x l −1 (1 − x) m−1 , 0 < x < 1 .
A + Q  Σ
(υ +2 r ) / 2
C ( p,υ + 2r ) Σ β (l , m)

1 −1  1 1 r l −1 β (r + l , m) Γ (r + l ) Γ (m) Γ(l + m)
(1 − x) m −1 dx =
β (l , m) ∫0
(υ + 2 r − p −1) / 2 − 2 tr A Σ  E (x r ) = x x =
A e  f (Q ) dA dQ (6.4) β (l , m) Γ (r + l + m) Γ(l ) Γ(m)

 Γ(r + l ) Γ (l + m)
=
Note that expression inside the bracket is W p (υ + 2r , Σ) . Γ(r + l + m) Γ(l )

Let A + Q = U , we will now integrate over the constant surface of A + Q = U , from the So if we write
additive property of wishart distribution υ − p +1 p υ +1
l= , and m = , ⇒ l+m=
1 −1
2 2 2
r C ( p, υ ) C ( p,υ + 2r + 1) (υ + 2r +1− p −1) / 2 − 2 tr U Σ
E (Φ ) =
C ( p, υ + 2r ) Σ
−r u ∫ U
r
Σ
(υ + 2r +1) / 2
U e du υ − p + 1 p 
⇒ Φ ~ β1  , .
 2 2
1
C ( p,υ ) C ( p,υ + 2r + 1) C ( p,υ + 1) (υ +1− p −1) / 2 − 2 tr U Σ
−1 υ * υ*  1 − X υ1*
= ∫ U e du We know that if X ~ β1  1 , 2  , then × ~ Fυ * , υ * .
C ( p,υ + 2r ) C ( p,υ + 1) U r (υ +1) / 2  2 2  X υ 2* 2 1
Σ  

p So that
 υ + 2r − i + 1  (υ +1) p / 2
2 (υ + 2r ) p / 2 (π ) p ( p −1) / 4 ∏ Γ  2 1− Φ υ − p +1
i =1  
2 × ~ F p,υ − p +1
= Φ p
p
 υ − i + 1  (υ + 2 r +1) p / 2
2υp / 2 (π ) p ( p −1) / 4 ∏ Γ  2 1
i =1 
2  1−
1 + T 2 / (n − 1) n − p
p ⇒ × ~ F p, n− p
υ + 1− i + 1 1 p
(π ) p ( p −1) / 4 ∏ Γ  
i =1 
2  1 + T 2 / (n − 1)
×
p
 υ + 2r + 1 − i + 1  T2 n− p
(π ) p ( p −1) / 4 ∏ Γ   ⇒ × ~ F p, n− p .
i =1  
2 n −1 p
Hotelling’s distribution 117 118 RU Khan

Properties of Hotelling’s − T 2 Hence,

n np / 2  1 
T 2 − Statistic as a function of likelihood ratio criterion max Lω = exp  − np 
(2π ) np / 2
A + n ( x − µ ) ( x − µ )′
n/2  2 
Let x α (α = 1, 2, L , n > p ) be a random sample of size n from N p ( µ , Σ) . The likelihood 0 0

function is Thus, the likelihood ratio criterion is


 1  n/2
1 A
L( µ , Σ) = exp − ∑ ( x α − µ ) ′ Σ −1 ( x α − µ ) λ=
np / 2 n/2 2 n/2
(2π ) Σ  α  A + n ( x − µ ) ( x − µ )′
0 0
and the likelihood ratio criterion
A A
max Lω ( µ , Σ) max L0 or λ 2 / n = = ,
λ= = . 1 − n ( x − µ 0 )′ A 1 + n ( x − µ ) ′ A −1 ( x − µ )
max LΩ (µ , Σ) max L 0 0
n (x − µ ) A
0
In the parameter space Ω , the maximum of L occurs when the parameters µ and Σ are
estimated by their maximum likelihood estimators i.e. µ̂ = x , and Σˆ = A / n . In the space ω , since Σ = Σ 22 Σ11 − Σ12 Σ −221 Σ 21 .
1
we have µ = µ , and Σˆ = ∑ ( x α − µ 0 ) ( x α − µ 0 )′ , therefore, =
1
0 n α
1 + n ( x − µ 0 )′ A −1 ( x − µ 0 )
1  1  A
−1 
max LΩ = exp − ∑ ( x α − x ) ′   ( x α − x ) =
1
=
1
,
A
n/2  2 α n  n −1 2
(2π ) np / 2 1+ ( x − µ 0 ) ′S ( x − µ 0 ) 1 + T
n n −1 n −1

n np / 2  1  where, T 2 = n ( x − µ ) ′ S −1 ( x − µ ) = n (n − 1) ( x − µ ) ′ A −1 ( x − µ ) .
= exp  − np  . 0 0 0 0
(2π ) np / 2 A
n/2  2 
The likelihood ratio test is defined by the critical region λ ≤ λ0 , where, λ0 is so chosen so as
Similarly, to have level α , i.e. Pr [λ ≤ λ 0 H 0 ] = α .
1 Thus
max Lω =
n/2
1 T2
(2π ) np / 2 ∑ ( x − µ 0 ) ( x α − µ 0 )′
n α α λ2 / n ≤ λ02 / n , or
1
≤ λ02 / n , or 1 + ≥ λ−0 n / n ,
2 n −1
1 + T /(n − 1)
 −1

1    or T 2 ≥ (n − 1) (λ−0 2 / n − 1) = T02 , (say)
exp − tr n ∑ ( xα − µ 0 ) ( x α − µ 0 )′ ∑ ( x α − µ 0 ) ( x α − µ 0 ) ′
 2  α   α 
  ⇒ T 2 ≥ T02 .

n np / 2  1  Therefore, Pr [T 2 ≥ T02 H 0 ] = α .
= exp  − np  .
n/2  2 
(2π ) np / 2 ∑ ( xα − µ ) ( x α − µ )′
0 0 Invariance property of T 2
α

Consider Let X ~ N p ( µ , Σ) , then T x2 = n ( x − µ 0 x )′ S x−1 ( x − µ 0 x ) ,

∑ ( xα − µ 0 ) ( xα − µ 0 ) ′ = ∑ [( x α − x ) + ( x − µ 0 )][( x α − x ) + ( x − µ 0 )]′ where


α α
1 1
= A + n ( x − µ ) ( x − µ )′ .
0 0
Sx = ∑ ( x − x ) ( xα − x )′ = n − 1 ( x1 x1′ + L + x n x n′ − n x x ′ )
n −1 α α
Hotelling’s distribution 119 120 RU Khan

Make a non-singular transformation Let ( x − µ ) = d , then solve


0
Y = C X , ⇒ y = C xα
α Sλ = d (by Doolittle method) ⇒ λ = S −1 d , and T 2 = n d ′ λ .
Now
A d 3
Let A be a matrix of order 3 , then augmented matrix, say B =  ′ 
1 1 0  1
Sy = ∑ ( y − y ) ( y α − y )′ = n − 1 ( y1 y1′ + L + y n y n′ − n y y ′ )
n −1 α α
d
 −1
1 ′ −1 A B  A D − CA B
= (C x1 x1′ C ′ + L + C x n x n ′ C ′ − n C x x ′ C ′) B = A 0 − d A d , since =
n −1 C D  D A − BD −1C

 1 
=C ( x1 x1′ + L + x n x n ′ − n x x ′ ) C ′ B
n −1  ⇒ d ′ A −1 d = − ,
A
= C S x C′.
where
By definition
B = a11 b11 c11 d11 (Doolittle method), A = a11 b11 c11 , then d ′ A −1 d = − d11 .
T y2 = n ( y − µ 0 y )′ S −y 1 ( y − µ 0 y ) = n (C x − C µ 0 x ) ′ (C S x C ′) −1 (C x − C µ 0 x )
Note:
= n ( x − µ 0 x )′ C ′C ′ −1S x−1C −1C ( x − µ 0 x ) Sometimes we are given the S S and C P matrix A , then we solve

= n ( x − µ 0 x )′ S x−1 ( x − µ 0 x ) = Tx2 . Aλ * = d (by Doolittle method) ⇒ λ * = A −1 d and T 2 = n (n − 1) d ′ λ * .

Two sample problem


Uses of T 2 − statistic
Let x1(i) , L , x (ni) be a random sample from N p ( µ (i ) , Σ) , i = 1, 2 , where the variance
One sample problem covariance matrices are assumed equal but unknown. The hypothesis of interest is
Let x1 , L , x n be a random sample from N p ( µ , Σ) , when the variance covariance matrix Σ H 0 : µ (1) = µ ( 2) .
is unknown. Suppose we are required to test
Let
H 0 : µ = µ (specific mean vector). n
0 1 i (i )
Let
x (i ) = ∑ x be the sample mean vector, and
ni α =1 α
Y = n ( x − µ ) , then, EY = 0 , under H 0 , and Σ Y = Σ . Thus, Y ~ N p (0 , Σ) and 
0  1 1  
x (i ) ~ N p ( µ (i ) , Σ / ni ) , then, x (1) − x (2) ~ N p 0,  +  Σ , under H 0 .
n n −1   n1 n 2  
(n − 1) S = ∑ ( xα − x ) ( xα − x ) ′ = A = ∑ Z α Z α ′ , with Z α ~ N p (0, Σ) .
α =1 α =1 n1n2
⇒ ( x (1) − x (2) ) ~ N p (0, Σ) .
Therefore, by definition n1 + n 2

T2 n− p Let
T 2 = Y ′ S −1Y , and ~ F p, n − p .
n −1 p n1n2
Y= ( x (1) − x (2) ) , then,
Thus adopting a significance level of size α , the null hypothesis is rejected if n1 + n 2

(n − 1) p n1n2
T 2 ≥ T02 , where, T02 = F p, n− p (α ) . EY = 0 , under H 0 , and ΣY = E ( x (1) − x (2) ) ( x (1) − x ( 2) )′ = Σ .
n− p n1 + n 2

From the sample T 2 is computed easily as follows: Therefore, Y ~ N p (0 , Σ) .


Hotelling’s distribution 121 122 RU Khan

Let and the variance covariance matrix is


ni    
1 1 1
S (i ) = ∑ ( x (i) − x (i ) ) ( xα(i ) − x (i ) )′
ni − 1 α =1 α
and Var Cov  ∑ β i x (i )  = ∑ β i2 Var Cov ( x (i ) ) =  ∑ β i2  Σ = Σ .
   ni  C
 i  i  i 
2 i n Let
1
S= ∑ ∑ ( x (i) − x (i ) ) ( xα(i ) − x (i ) )′
n1 + n2 − 2 i =1α =1 α  
Y = C  ∑ β i x (i ) − µ  , then, Y ~ N p (0, Σ) .
 
(n1 − 1) S (1) + (n2 − 1) S ( 2)  i 
= be the pooled sample variance covariance matrix.
n1 + n 2 − 2 Consider
(1) ( 2)
or (n1 + n2 − 2) S = (n1 − 1) S + (n2 − 1) S 1 i n

n1 + n 2 − 2
S (i ) = ∑ ( x (i ) − x (i ) ) ( x α(i ) − x (i) )′
ni − 1 α =1 α
= A(1) + A (2) = ∑ Z α Z α ′ , with Z α ~ N p (0, Σ) .
α =1 and
n1 + n2 − 2 q ni
1
Hence, (n1 + n2 − 2) S is distributed as ∑ Zα Zα′ . S=
q ∑ ∑ ( xα(i) − x (i ) ) ( xα(i ) − x (i ) )′
α =1
∑ ni − q i =1 α =1

Therefore, by definition i =1

nn  q  q ni
T 2 = Y ′ S −1 Y = 1 2 ( x (1) − x (2) ) ′S −1 ( x (1) − x ( 2) )  n − qS =
n1 + n 2  ∑ i  ∑ ∑ ( xα(i) − x (i ) ) ( xα(i) − x (i ) ) ′
 i =1  i =1 α =1
and
∑ ni −q
T2 n1 + n2 − 2 − ( p − 1) i
~ F p, n1 + n 2 − p −1 . = ∑ Z α Z α ′ , where, Z α ~ N p (0, Σ)
n1 + n2 − 2 p α =1

Thus adopting a significance level of size α , the null hypothesis is rejected if T 2 ≥ T02 , Therefore, by definition
(n1 + n 2 − 2) p ′
where, T02 = F p , n1 + n2 − p −1 (α ) .    
n1 + n2 − p − 1 T 2 = Y ′ S −1 Y = C  ∑ β i x (i ) − µ  S −1  ∑ β i x (i ) − µ  is distributed as T 2 with
   
 i   i 
q − sample problem ∑ in − q degree of freedom, i.e.
i
Let x α(i ) (α = 1, 2, L , ni ; i = 1, 2, L , q) be a random sample from N p ( µ (i ) , Σ) respectively.
Suppose we are required to test ∑ ni − q − ( p − 1)
T2 i
~ F p, ∑ ni −q − p +1 .
q
H 0 : ∑ βi µ (i )
= µ , where β1 , L , β q are given scalars and µ is given vector.
∑ ni − q p i
i
i =1

Let Thus adopting a significance level of size α , the null hypothesis is rejected if T 2 ≥ T02 ,
n where
1 i (i )
x (i ) = ∑ x be the sample mean vector, and x (i ) ~ N p ( µ (i ) , Σ / ni ) , then,
ni α =1 α  
 ∑ ni − q  p
 
 
T02 =  
q
 1  i
∑ β i x (i) ~ N p  µ , Σ  , under H 0 , where E  ∑ β i x (i )  = µ F p, n −q − p+1 (α ) .
i =1  C  
 i

 ∑ ni − q − p + 1 ∑i i
i
Hotelling’s distribution 123 124 RU Khan

A problem of symmetry Behrens-Fisher problem

Given a random sample x1 , L , x n from N p ( µ , Σ) , where µ ′ = ( µ1 , µ 2 ,L , µ p ) . The Let x α(i ) (α = 1, 2, L , ni ; i = 1, 2) be random sample from N p ( µ (i ) , Σ i ) . Hypothesis of
hypothesis of interest is H 0 : µ1 = µ 2 = L = µ p . interest is H 0 : µ (1) = µ ( 2) . The mean x (1) of the first sample is normally distributed with
expected value
Let C be a matrix of order ( p − 1) × p such that C η = 0 , where η ′ = (1,1, L ,1) .
E ( x (1) ) = µ (1) , and covariance matrix
A matrix satisfying this condition is
1 0 L 0 − 1 1
  E ( x (1) − µ (1) ) ( x (1) − µ (1) )′ = Σ1 , i.e. x (1) ~ N p ( µ (1) , Σ1 / n1 ) .
1 L 0 − 1 n1
0
C = ,
M M M M M  Similarly,
 
0 0 L 1 − 1 ( p −1)× p
 x ( 2) ~ N p ( µ ( 2 ) , Σ 2 / n 2 ) .
where, C is called a contrast matrix. Thus,
Let
  1 1 
y = C x α , then, ( x (1) − x (2) ) ~ N p  µ (1) − µ (2) ,  Σ1 + Σ 2  .
α  n
 1 n 2 
Ey = C µ , and Σ y = E [ y − E ( y )][ y − E ( y )]′ = CΣC ′ , If n1 = n 2 = n
α α α α α α

with this transformation we can write Let y = x α(1) − x α(2) , (assuming the numbering of the observations in the two samples is
α
1 0 L 0 − 1  µ1  µ1 − µ p = 0 independent of the observations themselves), then y ~ N p (0, Σ1 + Σ 2 ) under H 0 .
    α
0 1 L 0 − 1  µ2  µ2 − µ p = 0
H 0 : C µ = 0 , i.e.   M =0, or 1 n  Σ +Σ 
M M M M M   
M ⇒ y= ∑ y = ( x (1) − x (2) ) ~ N p  0, 1 n 2  ⇒ n y ~ N p (0, Σ1 + Σ 2 ) .
  n α =1 α
 0 0 L 1 − 1 ( p−1)× p  µ p  µ p −1 − µ p = 0
Let
Therefore,
1 n
y ~ N p −1 (0 , C Σ C ′) under H 0 , then
α Sy = ∑ ( y − y ) ( y α − y )′
n − 1 α =1 α
 C ΣC′ 
y ~ N p −1  0,  ⇒ n y ~ N p −1 (0, C Σ C ′) . n−1
 n  or (n − 1) S y = ∑ Z α Z α ′ , where, Z α ~ N p−1 (0, Σ1 + Σ 2 ) .
Let α =1

1 n
1 n Thus, by definition, T 2 = n y ′ S y−1 y has T 2 − distribution with (n − 1) degrees of freedom.
Sy = ∑ ( y − y ) ( y α − y )′ = n − 1 ∑ C ( xα − x ) ( xα − x )′C ′ = CSC ′
n − 1 α =1 α α =1 (n − 1) p
The critical region is T 2 ≥ F p, n− p (α ) .
n−1 ′ n− p
(n − 1) S y = ∑ Z α* Z α* *
, with Z α ~ N p −1 (0, C ΣC ′) .
If n1 ≠ n 2 , and n1 < n 2 .
α =1
Thus, by definition, Define,
n1 n2
T 2 = n y ′ S y−1 y is distributed as T 2 with (n − 1) degree of freedom, and the critical region of n1 (2) 1 1
y
α
= x α(1) − x +
n2 α
∑ x (β2) − n ∑ x γ(2) , α = 1, 2, L, n1 , then
n1n 2 β =1 2 γ =1
size α for testing H 0 : C µ = 0 is
n1 n
n1 ( 2) 1 1 2 (2)
T2 ≥
(n − 1) ( p − 1)
F p −1, n − p +1 (α ) . Ey
α
= µ (1) −
n2
µ + ∑
n1n 2 β =1
µ ( 2) − ∑
n 2 γ =1
µ
n − 1 − ( p − 2)
Hotelling’s distribution 125 126 RU Khan

n1 (2) n Behrens-Fisher problem for q − sample


= µ (1) − µ + 1 µ (2) − µ (2) = µ (1) − µ (2) .
n2 n2
Suppose xα(i ) (α = 1, 2, K , ni ; i = 1, 2, K , q ) be random sample from N p ( µ (i ) , Σ i )
The covariance matrix of y and y is respectively.
α β
Consider testing the hypothesis
E[y − E ( y )][ y − E ( y )]′
α α β β
q
 n1 n2  H 0 : ∑ β i µ (i) = µ ,
n 1 1
= E ( x α(1) − µ (1) ) − 1 ( xα( 2) − µ ( 2) ) +
 n

n1n2 γ =1
( x γ(2) − µ ( 2) ) −
n
∑ ( x γ(2) − µ ( 2) )

i =1
2 2 γ =1 where β1 , β 2 , K , β q are given scalars and µ is a given vector. If ni are unequal, take n1 to
 n1 n2  be the smallest.
n 1 1
( x (β1) − µ (1) )′ − 1 ( x (β2) − µ (2) )′ +
 n

n1n2 γ =1
( x γ( 2) − µ ( 2) )′ −
n
∑ ( x γ( 2) − µ (2) )′

Define,
2 2 γ =1
q
n1  (i ) 1 n1 (i ) 1
ni 
n 1 1 n 1 y = β1 x α(1) + ∑ β i xα − ∑ xβ + ∑ x γ(i )  , α = 1, 2, K , n1 .
= Σ1 + 1 Σ 2 + n1 Σ 2 + n2 Σ 2 − 2 1 Σ2 α ni  n1 β =1 n1ni γ =1 
n2 n1n2 n22 n2 n1n2 i =2  
The expected value of y is
n1 1 1 1 α
+2 Σ2 − 2 n1 Σ 2 .
n2 n2 n1n2 n2 q  (i ) 1 
E ( y ) = β1 µ (1) + ∑ β i
n1  µ − n µ (i) + ni µ (i ) 
α ni  n
1
n1ni 
n 1 1 2 2 n1 2 n1  i=2  1 
= Σ1 + Σ 2  1 + + − + − 

n
 2 n 2 n 2 n 2 n 2 n2 n2 n2  q q
= β1 µ (1) + ∑ β i µ (i ) = ∑ β i µ (i)
n
= Σ1 + 1 Σ 2 . i =2 i =1
n2
and the variance covariance matrix of y and y is
α β
Hence,
q
n β2
 n 
y ~ N p  0, Σ1 + 1 Σ 2  under H 0 , then E [ y − E ( y )][ y − E ( y )]′ = ∑ 1 i Σ i .
α α β β ni
α
 n2  i =1

n So that
1 1  1 n 
y= ∑
n1 α =1
y ~ N p 0,  Σ1 + 1 Σ 2 
α  q
n β2 
 n1  n2  y ~ N p  µ , ∑ 1 i Σ i  under H 0
α  n 
 n1   i =1 i 
⇒ n1 y ~ N p  0, Σ1 + Σ 2  .
 n2  and

Consider 1 1
n  1 q n1β i2   q n β2 
∑ y ~ N p  µ, n1 ( y − µ ) ~ N p  0, ∑ 1 i Σ i  .
 n1 ∑ ni i 
y= Σ ⇒
n1 n1 −1 n1 α =1 α  n 
 n   i =1   i =1 i 
(n1 − 1) S = ∑ ( yα − y) ( y
α
− y )′ = ∑ Z α Z α ′ , with Z α ~ N p  0, Σ1 + 1 Σ 2  .
α =1 α =1  n2  Consider
Therefore, by definition, n1
(n1 − 1) S = ∑ ( y α − y ) ( y α − y )′ .
T 2 = n1 y ′ S −1 y . This statistic has T 2 − distribution with (n1 − 1) degree of freedom. α =1

The critical region of size α is Therefore, T 2 = n1 ( y − µ )′S −1 ( y − µ ) has T 2 − distribution with (n1 − 1) degree of
(n1 − 1) p (n1 − 1) p
T2 ≥ F p, n1 − p (α ) . freedom. The critical region is T 2 ≥ F p, n1 − p (α ) , with α level of significance.
n1 − p n1 − p
Hotelling’s distribution 127 128 RU Khan

Test for equality of population mean of two sub vectors  2 1  5 / 9 − 1/ 9 


Example: Let Σ =   , then Σ = 9 , and Σ −1 =   , so that
x  q(1)  µ (1)  1 5  − 1/ 9 2 / 9 
Let x =   be distributed normally with mean µ =   and covariance matrix
 (2)  q  ( 2)  Σ 1 1 1
 x   µ  = 9× = 9 / 5 , also = = 9 / 5.
Σ 22 5 11 5 / 9
 Σ11 Σ12  σ
(1) (2)
Σ =   , where both x and x are q − dimensional vector (each of
 Σ 21 Σ 22  Exercise: Let A ~ W p (n − 1, Σ) with n > p and Σ > 0 . Let u be distributed independently

of A . Prove that u ′ Σ −1 u / u ′ A −1 u has χ n2− p distribution, independent of u . Hence deduce


(1) ( 2)
q − components). The hypothesis of interest is H 0 : µ =µ . Let

y = x (1) − x (2) , then the expectation of y is E ( y ) = E ( x (1) − x ( 2) ) = µ (1) − µ ( 2) the distribution of Hotelling’s T 2 .

and the variance covariance matrix of y is Solution: Consider the variance covariance of conditional distribution of X 1 given X (2) ,
which is σ − σ ′ Σ −1 σ , where
11 12 22 21
E [ y − E ( y )][ y − E ( y )]′ = E [( x (1) − x (2) ) − ( µ (1) − µ ( 2) )] [( x (1) − x (2) ) − (µ (1) − µ (2) )]′
σ σ 12 ′  1
= E [( x (1) − µ (1) ) ( x (1) − µ (1) ) ′ − ( x (1) − µ (1) ) ( x ( 2) − µ ( 2) )′ Σ =  11 , and Σ = Σ 22 σ 11 − σ 12′ Σ −221 σ 21 .
σ 
 21 Σ 22  p − 1
− ( x ( 2) − µ ( 2) ) ( x (1) − µ (1) ) ′ + ( x ( 2) − µ (2) ) ( x (2) − µ (2) ) ′] Thus,
Σ 1
= Σ11 − Σ12 − Σ 21 + Σ 22 . = σ 11 − σ 12′ Σ 22
−1
σ 21 = , where σ 11 is the leading term of Σ −1 .
Thus,
Σ 22 σ 11
Let A = ∑ ( x α − x ) ( x α − x )′ and partitioned according as Σ
y ~ N ( µ (1) − µ (2) , Σ11 − Σ 21 − Σ12 + Σ 22 )
α
and a a12′  1
A =  11 , and A = A22 a11 − a12 ′ A22
−1
a 21
 Σ − Σ 21 − Σ12 + Σ 22  a A22  p − 1
y ~ N  µ (1) − µ (2) , 11   21
 n 
A 1
⇒ n y ~ N (0, Σ11 − Σ12 − Σ 21 + Σ 22 ) , under H 0 . or = a11 − a12′ A22
−1
a 21 = (6.6)
A22 a11
Consider,
where a11 is the leading term of A −1 .
n
(n − 1) S = ∑ ( yα − y) ( y
α
− y)′ Since A ~ W p (n − 1, Σ) , and, is partitioned as
α =1
 A11 A12  q −1
A =   , then, A11 − A12 A22 A21 ~ Wq (n − 1 − ( p − q), Σ11 − Σ12 Σ −221 Σ 21 )
Therefore, by definition, T 2 = n y ′ ( S11 − S12 − S 21 + S 22 ) −1 y , which has T 2 − distribution  A21 A22  p − q
with (n − 1) degree of freedom. The critical region is
⇒ a11 − a12′ A22
−1
a 21 ~ W1 (n − 1 − ( p − 1), σ 11 − σ 12′ Σ 22
−1
σ 21 ) (6.7)
2 (n − 1) q
T ≥ Fq , n− q (α ) with α level of significance. In view of equation (6.6), equation (6.7) reduces as
n − 1 − (q − 1)
1  1  σ 11
Result: ~ W1  n − p,  ⇒ ~ χ n2− p .
11
a  σ 11  a11
Σ Σ12 
If Σ =  11  , then Σ = Σ 22 Σ11 − Σ12 Σ −221 Σ 21 Of course we could obtain the var. cov. of conditional distribution of any component X j
Σ
 21 Σ 22 
jj
σ
Σ 1 ( j = 1, 2,L , p ) of X given the other components. Hence, ~ χ n2− p , where, σ jj and a jj
or = Σ11 − Σ12 Σ −221 Σ 21 = , where σ 11 is the leading term of Σ −1 . a jj
Σ 22 σ 11
are the leading term of Σ −1 and A −1 respectively.
Hotelling’s distribution 129 130 RU Khan

Let C be a nonsingular matrix and let D 2 = ( x (1) − x ( 2) )′ S −1 ( x (1) − x ( 2) ) and is known as Mahalanobis's D 2 ,
−1′
B = C −1 , ⇒ B = C ′ , as A ~ W p (n − 1, Σ) , and BAB ′ ~ W p (n − 1, BΣB ′) . where
2 ni
(n1 − 1) S (1) + (n2 − 1) S (2) 1
Now S=
n1 + n 2 − 2
= ∑ ∑ ( x (i ) − x (i ) ) ( xα(i ) − x (i) )′ ,
n1 + n2 − 2 i =1 α =1 α
( BAB ′) −1 = B ′ −1 A −1 B −1 = C ′A −1C , and ( BΣB ′) −1 = B ′ −1Σ −1 B −1 = C ′Σ −1C
n
1 i (i )
Let C = C 1′ L C p ′ , where C i is a p − component vector, then the leading term in ( BAB ′) −1 where x (i ) = ∑ x , i = 1, 2 .
ni α =1 α
C ′ Σ −1 C 1
is C 1′ A −1 C 1 and leading term in ( BΣB ′) −1 is C 1′ Σ −1 C 1 , thus, 1 ~ χ n2− p , as both It is obvious that
C 1′ A −1 C 1 n1n 2
the matrices are of order 1 × 1 . T2 = D2 ,
n1 + n 2
The result is true for any C 1 even when C 1 is obtained by observing a random vector, which
n1n 2
is distributed independently of A . i.e. two-sample T 2 and D 2 are almost the same, except for the constant k 2 = .
n1 + n 2
u ′ Σ −1 u
Hence, if u and A are independently distributed then, ~ χ n2− p . Let
u ′ A −1 u
Y = k ( x (1) − x (2) ) , then expected value of Y is E (Y ) = k (µ (1) − µ (2) ) = δ
We know that
and the variance covariance matrix of Y is
V1 = n ( x − µ )′ Σ −1 ( x − µ ) ~ χ 2p
0 0
ΣY = k 2 E[( x (1) − x ( 2) ) − (µ (1) − µ ( 2) )][( x (1) − x (2) ) − ( µ (1) − µ ( 2) )]′
Let u = (x − µ )
0
1 1  n + n2 1
= k 2  +  Σ = Σ , because 1 = .
( x − µ 0 )′ Σ −1 ( x − µ 0 )  n1 n 2  n1n2 k2
V2 = ~ χ n2− p
( x − µ 0 )′ A −1 ( x − µ 0 ) Therefore,

Y ~ N p (δ , Σ) , then, k 2 D 2 = Y ′ S −1 Y .
(n − 1) ( x − µ 0 )′ Σ −1 ( x − µ 0 ) n
= × ~ χ n2− p , where A = (n − 1) S Since Σ is positive definite matrix there exist a nonsingular matrix C such that
( x − µ 0 ) ′ S −1 ( x − µ 0 ) n
CΣC ′ = I ⇒ CC ′ = Σ −1 .
V1 and V2 are independently distributed as χ 2 Define,
V1 / n V
T 2 = n ( x − µ ) ′ S −1 ( x − µ ) = = (n − 1) 1 Y * = C Y , S * = C S C ′ , and δ * = C δ , then,
0 0 V2 / n (n − 1) V2
′ −1
k 2 D 2 = Y * S * Y * , and the expected value of Y * is
T2 V T2 n− p V1 / p
⇒ = 1 or × = ~ F p, n − p .
n − 1 V2 n −1 n V2 / n − p E (Y * ) = C E (Y ) = Cδ = δ * , and the variance covariance matrix of Y * is
Σ = C E[Y − E (Y )][(Y − E (Y )]′C ′ = C Σ C ′ = I .
Distribution of Mahalanobis's D 2 Y*

Thus,
The quantity (µ (1) − µ (2) )′ Σ −1 ( µ (1) − µ (2) ) is denoted by ∆2 and was proposed by
′ ′
Mahalanobis as a measure of the distance between the two populations, N p ( µ (1) , Σ) , and Y * ~ N p (δ * , I ) , ⇒ Y * Y * ~ χ 2p (δ * δ * ) ,

N p ( µ (2) , Σ) . If the parameters are replaced by their unbiased estimates, is denoted by D 2 , where
which is given by ′
δ * δ * = δ ′ C ′ C δ = δ ′ Σ −1 δ = λ2 .
Hotelling’s distribution 131

Let
n1 + n2 − 2
(n1 + n 2 − 2) S = ∑ Z α Z α ′ , where Z α ~ N p (0, Σ)
α =1
n1 + n2 −2
⇒ (n1 + n 2 − 2) S * = ∑ (C Z α ) (C Z α )′ , where C Z α ~ N p (0, I ) .
α =1
Therefore,

χ 2p (λ2 )
k 2 D 2 = Y * S *−1 Y * = (n1 + n 2 − 2)
2
χ n + n −2−( p −1)
1 2

n1n 2 n1 + n 2 − p − 1 2 χ 2p (λ 2 ) / p
⇒ D = ~ F p , n1+ n2 − p −1 (λ 2 ) .
n1 + n 2 (n1 + n 2 − 2) p χ n2 +n −2−( p −1) / n1 + n2 − p − 1
1 2

If µ (1) = µ ( 2) , then the F − distribution is central.


134 RU Khan

DISCRIMINANT ANALYSIS ⇒ φ is minimize, when q 2 f 2 ( x ) ≤ q1 f1 ( x ) , so we divide R such as

The problem of discriminant analysis deals with assigning an individual to one of several  q f ( x)   f ( x) q 2 
R1 = {x | q 2 f 2 ( x) ≤ q1 f1 ( x)} =  x 1 1 ≥ 1 =  x 1 ≥ .
categories on the basis of measurements on a p − component vector of variable x on that  q f
2 2 ( x )   f 2 ( x ) q1 
individual. For example, we take certain measurements on the skull of an animal and want to Similarly,
know whether it was male or female, a patient is to be classified as diabetic or non diabetic
on the basis of certain tests such as blood, urine, blood pressure etc., a salesman is to be  q f ( x)   f ( x) q 2 
classified as successful or unsuccessful on different psychological tests. R2 =  x 1 1 < 1 =  x 1 < .
 q 2 f 2 ( x)   f 2 ( x ) q1 
Procedure of classification into one of the two populations with known probability Further, if the cost of misclassification is given, C (2 |1) cost of misclassification to π 2 when
distribution it actually comes from π 1 , C (1| 2) the cost of misclassification to π 1 when it actually comes
Let R denote the entire p − dimensional space in which the point of observation x falls. We from π 2 . Example, if a potentially good candidate for admission to a medical school is
then have to divided R into two, say, R1 and R2 so that rejected, the nation will suffer a shortage in medical persons, but, on the contrary, if a bad
candidate is admitted, he may not be able to complete the course successfully and money,
If x falls in R1 , then assign the individual to population π 1 resources equipment used by him will be a waste. Total expected cost from misclassification
If x falls in R2 , assign the individual to population π 2 E (C ) = C (2 |1) Pr (2 |1) q1 + C (1| 2) Pr (1| 2) q2 and
Obviously, with any such procedure an error of misclassification is inevitable (unavoidably) classification rule will be
i.e. the rule may assign an individual to π 2 , when he really belongs to π 1 and vice versa. A
 f ( x ) q2 C (1| 2)   f ( x ) q 2 C (1| 2) 
rule should control this error of discrimination. R1 =  x 1 ≥  , and R2 =  x 1 < .
 f 2 ( x) q1 C (2 |1)   f 2 ( x) q1 C (2 |1) 
Let f1 ( x) and f 2 ( x) are the probability density function of x in the two populations π 1 and
π 2 . Let Classification into one of two known multivariate normal populations
q1 = a priori probability that an individual comes from π 1 Let f1 ( x) = density function of N p ( µ (1) , Σ) and f 2 ( x ) = density function of N p ( µ ( 2) , Σ)
q 2 = a priori probability that an individual comes from π 2 and the region R1 of the classification into the population first is given by
Pr (1| 2) = Pr ( an individual belongs to π 2 is misclassified to π 1 )  f1 ( x )  q2 C (1 2 )
R1 =  x ≥ k  , where k = . Consider
Pr (2 |1) = Pr ( an individual belongs to π 1 is misclassified to π 2 ) .  f 2 ( x)  q1 C ( 2 1 )
Obviously, f ( x)
ln 1 ≥ ln k (7.2)
Pr (1| 2) = ∫ f 2 ( x ) d x , and Pr (2 |1) = ∫ f1 ( x) d x . f 2 ( x)
R1 R2
The left hand side of (7.2) can be expanded as
Since the probability of drawing an observation from π 1 is q1 and from π 2 is q2 , we have
1
ln f1 ( x) − ln f 2 ( x) = − {( x − µ (1) )′ Σ −1 ( x − µ (1) ) − ( x − µ (2) )′ Σ −1 ( x − µ ( 2) )}
Pr ( drawing an observation from π 1 and is misclassified as from π 2 ) = q1 Pr (2 |1) 2
Pr ( drawing an observation from π 2 and is misclassified as from π 1 ) = q 2 Pr (1| 2) 1 ′ ′
= − {( x ′ Σ −1 x − x ′ Σ −1 µ (1) − µ (1) Σ −1 x + µ (1) Σ −1 µ (1) )
Then the total chance of misclassification, say φ , is 2
′ ′
φ = q1 Pr (2 |1) + q 2 Pr (1| 2) (7.1) − ( x ′ Σ −1 x − x ′ Σ −1 µ (2) − µ ( 2) Σ −1 x + µ ( 2) Σ −1 µ ( 2) )}
We choose regions R1 and R2 such that equation (7.1) is minimized. The procedure that 1 ′ ′
minimize (7.1) for a given q1 and q2 is called a Bayes procedure. Consider, = − {−2 x ′ Σ −1 µ (1) + 2 x ′ Σ −1 µ ( 2) + µ (1) Σ −1 µ (1) − µ (2) Σ −1 µ (2) }
2
φ = q1 Pr (2 |1) + q 2 Pr (1| 2) = q1 ∫ f ( x) d x + q2 ∫
f ( x) d x 1 (1) ′ −1 (1) ′
R2 1 R1 2 = x ′ Σ −1 ( µ (1) − µ ( 2) ) − (µ Σ µ − µ (2) Σ −1 µ (2) ) .
2
= q1 ∫ f ( x ) d x + q1 ∫f ( x) d x − q1 ∫f ( x) d x + q 2 ∫
f ( x) d x Thus,
R2 1 R1 1 R1 1 R1 2
1 ′ ′
= q1 ∫ f1 ( x ) d x + ∫ [q 2 f 2 ( x ) − q1 f1 ( x)] d x R1 = { x | x ′ Σ −1 (µ (1) − µ ( 2) ) − ( µ (1) Σ −1 µ (1) − µ (2) Σ −1 µ ( 2) ) ≥ ln k} .
R R 1 2
Discriminant analysis 135 136 RU Khan

In particular, when Thus,


C (1| 2) = C (2 | 1) , and q1 = q 2 = 1 / 2 , then k = 1 , ln k = 0 . ′
(h − µ (1) δ ) / ∆ 1 2 1 ( µ ( 2) − µ (1) )′ δ / 2 ∆ − y 2 / 2
Pr (2 | 1) = ∫ e − y / 2 dy = ∫ e dy .
Then the region of classification into population first is −∞ 2π 2π −∞
1 ′ ′ Similarly,
R1 = { x x ′ Σ −1 ( µ (1) − µ ( 2) ) − ( µ (1) Σ −1 µ (1) − µ (2) Σ −1 µ (2) ) ≥ 0}
2 ′
Pr (1| 2) = ∫ f ( x) d x =∫ f ( x : µ ( 2) , Σ) d x , where R1 = { x U ≥ h} , E (U ) = µ ( 2) δ
1 ′ ′ R1 2 R1
= { x x ′ Σ −1 ( µ (1) − µ ( 2) ) ≥ ( µ (1) Σ −1 µ (1) − µ ( 2) Σ −1 µ (2) )}
2 and Var (U ) = ∆2 . Put
1 (1)
= { x x ′ Σ −1 ( µ (1) − µ ( 2) ) ≥ ( µ + µ (2) )′ Σ −1 ( µ (1) − µ (2) )} . ′
2 U − µ (2) δ
= y , ⇒ du = ∆ dy . Therefore,

The term x ′ Σ −1 (µ (1) − µ (2) ) is known as the Fisher’s discriminant function.
1 ∞ − y2 / 2 1 ∞ − y2 / 2
Similarly, the region of classification into population second is Pr (1| 2) = ∫ (2)′ e
2π ( h−µ δ ) / ∆
dy =

∫( µ (1) −µ (2) )′δ / 2 ∆ e dy .

1 (1)
R 2 = { x x ′ Σ −1 (µ (1) − µ (2) ) < (µ + µ ( 2) )′ Σ −1 (µ (1) − µ ( 2) )} Sample discriminant function
2
The regions are computed easily as follows: Suppose that we have a sample x1(1) , L , x (n1) from N p ( µ (1) , Σ) and a sample x1(2) , L , x (n2)
1 2
Consider
from N p ( µ ( 2) , Σ) , and the unbiased estimate of µ (1) is
−1 (1) (2)
Σ d = δ , then solve, Σδ = d (by Doolittle method), where d = µ −µ .
1 n1 (1) 1 n2 ( 2)
Probability of misclassification (Two known p − variate normal population)
x (1) = ∑
n1 α =1
x α , and µ (2) is x ( 2) = ∑ x , and of Σ is S defined by
n2 α =1 α

If the observation is from π 1 , then  n1 (1) n1 


1
S=  ∑ ( x α − x (1) ) ( x α(1) − x (1) ) ′ + ∑ ( x α( 2) − x ( 2) ) ( x α( 2) − x ( 2) )′ .
Pr ( 2 1) = ∫ f ( x) d x = ∫
f (x : µ (1)
, Σ) d x , where n1 + n 2 − 2 α =1
 α =1 
R2 1 R2

1 (1) Substitute these estimates for the parameters in the function x ′ δ , Fisher’s discriminant
R 2 = { x x ′ Σ −1 (µ (1) − µ (2) ) < (µ + µ ( 2) )′ Σ −1 (µ (1) − µ ( 2) )} . Put
2 function becomes

U = x ′ δ , where δ = Σ −1 ( µ (1) − µ (2) ) , and h =


1 (1)
( µ + µ ( 2) ) δ x ′ S −1 ( x (1) − x ( 2) ) .
2
This is known as sample discriminant function. The classification procedure now becomes
⇒ R 2 = { x U < h} .
i) Compute x ′ S −1 ( x (1) − x ( 2) ) = x ′ δ
Since x ~ N p ( µ (1)
, Σ) , then U = x ′ δ is the univariate normal with following parameters 1 (1) 1
ii) Compute ( x + x ( 2) ) ′S −1 ( x (1) − x ( 2) ) = ( x (1) + x (2) ) ′δ
′ 2 2
E (U ) = µ (1) δ , and Var (U ) = δ ′ Σδ = ( µ (1) − µ ( 2) )′ Σ −1 ( µ (1) − µ ( 2) ) = ∆2 .
iii) Assign the individual with measurements x to population first or population second,
Therefore, 1
according as x ′ δ − ( x (1) + x ( 2) )′δ is ≥ 0 or < 0 .
 1  2
h 1 ′
Pr (2 |1) = ∫ exp − (U − µ (1) δ ) 2  du
−∞ ∆ 2π  2 ∆2  Classification into one of several populations

Make a transformation Consider the m populations say π 1 ,L , π m with the priori probabilities q1 ,L , q m and the
′ density functions f1 ( x ),L , f m ( x ) . We wish to divide the p − dimensional space R , in
U − µ (1) δ which the point of observation x falls, into m mutually exclusive and exhaustive regions
= y , ⇒ du = ∆ dy .
∆ R1 ,L , Rm .
Discriminant analysis 137 138 RU Khan

If x falls in Ri , then assign the individual to population π i . f j ( x) q


or assign x to π j if > k
Let f k ( x) qj

Pr ( j | i) = Pr ( an individual belongs to π i is misclassified to π j ) = ∫ f i ( x ) d x f j ( x) q


Rj or assign x to π j if U jk ( x ) = ln > ln k
f k ( x) qj
and
C ( j | i ) = Cost of misclassifying an observation from π i as coming from π j . f j ( x) exp [−1 / 2 ( x − µ ( j ) ) ′ Σ −1 ( x − µ ( j ) )]
ln = ln
f k ( x) exp [−1 / 2 ( x − µ ( k ) )′ Σ −1 ( x − µ ( k ) )]
Since the probability that an observation comes from π i is qi .

Pr ( an observation belongs to π i and is classified as coming from π j )  1 
=  x − (µ ( j ) + µ ( k ) ) Σ −1 (µ ( j ) − µ ( k ) ) .
 2 
= Pr ( an observation comes from π i ) × Pr ( misclassified it as coming from π j )
Note that
= qi Pr ( j | i) . ′
 1 
The total expected cost from misclassification U kj ( x) =  x − ( µ (k ) + µ ( j ) ) Σ −1 ( µ (k ) − µ ( j ) )
 2 
m m  Therefore,
∑  ∑ C ( j | i) qi Pr ( j | i) , i ≠ j.
i =1  j =1  U jk ( x ) = −U kj ( x) .

We would like to choose regions R1 ,L , Rm to make this expected cost minimum. There are m C 2 combinations of U jk (x) for m categories
It can be seen that the classification rule comes out to be
For m = 3 , combinations are U 12 U 13 U 23 .
Assign x to π k if
f ( x) q 
m m ln 1 = U 12 > ln 2 
f 2 ( x) q1 
∑ qi f i ( x) C (k | i ) < ∑ qi f i ( x) C ( j | i ) , j = 1, 2, L , m ; j ≠ k
q3 
 assign x to π 1
i =1 i =1 f1 ( x )
ln = U 13 > ln
i≠k i≠ j f 3 ( x) q1 
Let X ~ N p ( µ (i) , Σ) , i = 1, 2,L , m , let the cost of misclassification be equal. Then the rule is f ( x) q 
ln 2 = U 21 > ln 1 
assign x to π k if f1 ( x ) q2 
 assign x to π 2
f ( x) q
m m ln 2 = U 23 > ln 3 
∑ qi f i ( x) < ∑ q i f i ( x ) , j = 1, 2, L , m ; j ≠ k f 3 ( x) q 2 
i =1 i =1
i≠k i≠ j f ( x) q 
ln 3 = U 31 > ln 1 
m m f1 ( x ) q3 
∑ qi f i ( x ) + q k f k ( x) − q k f k ( x) < ∑ qi f i ( x ) + q j f j ( x ) − q j f j ( x)  assign x to π 3 .
f ( x) q
i =1 i =1 ln 3 = U 32 > ln 2 
i≠k i≠ j f 2 ( x) q3 
m m m
Sample discriminant function
⇒ ∑ qi f i ( x ) + q k f k ( x ) = ∑ q i f i ( x ) = ∑ q i f i ( x ) + q j f j ( x)
i =1 i =1 i =1 X ~ N p ( µ (i) , Σ) , µ (i ) , Σ are unknown, then replace µ (i ) by x (i ) and Σ by S , where
i≠k i≠ j

⇒ q k f k ( x) > q j f j ( x) 1 m ni ′ m 
S=
m ∑ ∑ ( x iα x ′ iα − n i x (i ) x ( i ) ) or  ∑ ni − m  S = A1 + A2 + L + Am .
 
 i=1 

f k ( x) q j
> . ∑ ni − m i=1 α =1
f j ( x) q k i =1
Discriminant analysis 139 140 RU Khan

Substitute these estimates for the parameters in the function, the function becomes V (U ) = E (U − E U ) (U − E U ) ′
′  1
 1  = E  x ′ Σ −1 (µ (1) − µ ( 2) ) − ( µ (1) + µ (2) ) ′ Σ −1 ( µ (1) − µ (2) )
V jk ( x) =  x − ( x ( j ) + x ( k ) ) S −1 ( x ( j ) − x (k ) ) .
 2   2
1 (1) 
Tests associated with discriminant functions − (µ − µ (2) )′ Σ −1 (µ (1) − µ (2) )
2 
1) Goodness of fit of a hypothetical discriminant function
 ′ −1 (1) (2) 1 (1) ( 2) −1 (1) ( 2)
H 0 : A given function x ′ δ is good enough for discriminating between two populations.  x Σ ( µ − µ ) − 2 (µ + µ )′ Σ (µ − µ )
We use the test statistic ′
1 (1) 
− (µ − µ (2) ) ′ Σ −1 ( µ (1) − µ ( 2) )
n1 + n 2 − p − 1  k 2 ( D 2p − D12 )  2 
  ~ F p−1, n + n − p −1 ,
p −1  (n1 + n2 − 2) + k 2 D 2  1 2
 
 1  1 1 
= E  x ′ − (µ (1) + µ ( 2) ) ′ − ( µ (1) − µ (2) ) ′Σ −1 ( µ (1) − µ (2) )
n1n2  2 2  
where k 2 = , D 2p = ( x (1) − x (2) )′S −1 ( x (1) − x ( 2) ) is the studentized D 2 of
n1 + n2 ′
 ′ 1 (1) ( 2) 1 (1) (2)  −1 (1) (2) 
Mahalanobis, based on the p characters x , and  x − 2 ( µ + µ )′ − 2 ( µ − µ ) ′Σ (µ − µ )
  
(δ ′ d ) 2
D12 = , where d = ( x (1) − x (2) ) , δ = S −1 ( x (1) − x (2) ) . = E [( x − µ (1) )′ Σ −1 (µ (1) − µ (2) )][( x − µ (1) )′ Σ −1 (µ (1) − µ ( 2) )]′
δ ′ Sδ
2) Test for additional information = ( µ (1) − µ ( 2) ) ′ Σ −1 E ( x − µ (1) ) ( x − µ (1) )′ Σ −1 ( µ (1) − µ ( 2) )

H 0 : ∆2p = ∆2q (i.e. p − q components do not provide additional information). = ( µ (1) − µ (2) ) ′ Σ −1 ( µ (1) − µ (2) ) = α (say).

We use the test statistic Therefore,


U ~ N p (α / 2, α )
n1 + n 2 − p − 1  k 2 ( D 2p − Dq2 ) 
  ~ F p − q, n + n − p −1 ,
p−q  (n1 + n 2 − 2) + k Dq 
2 2 1 2
Similarly, if x ~ N p ( µ ( 2) , Σ) , then
 
d   S11 S12  ′ 1
where, Dq2 = d 1′ S11
−1
d 1 , d =  1  , and S =  . E (U ) = µ (2) Σ −1 ( µ (1) − µ (2) ) − (µ (1) + µ ( 2) ) ′ Σ −1 ( µ (1) − µ ( 2) )
d 2   S 21 S 22  2
1 ( 2) 1
Exercise: If f i (x) is a p − variate normal density with parameters µ (i ) and Σ for i = 1, 2 = (µ − µ (1) )′ Σ −1 (µ (1) − µ (2) ) = − α , and
2 2
with Σ >0, and the region R1 of classification is
V (U ) = (µ (1) − µ ( 2) )′ Σ −1 (µ (1) − µ ( 2) ) = α .
 1 
U =  x ′ Σ −1 ( µ (1) − µ ( 2) ) − ( µ (1) + µ ( 2) )′ Σ −1 ( µ (1) − µ ( 2) ) ≥ 0 , where q1 = q 2 , and
 2  Thus,
C ( 1 2) = C ( 2 1) . Find the distribution of U , if x ~ N p ( µ (1) , Σ) . U ~ N p (−α / 2, α ) .

Solution: Given
1 (1)
x ~ N p ( µ (1) , Σ) , and U = x ′ Σ −1 ( µ (1) − µ (2) ) − (µ + µ ( 2) ) ′ Σ −1 ( µ (1) − µ (2) )
2
′ 1
E (U ) = µ (1) Σ −1 ( µ (1) − µ ( 2) ) − ( µ (1) + µ ( 2) )′ Σ −1 (µ (1) − µ (2) )
2
1 (1)
= (µ − µ ( 2) )′ Σ −1 (µ (1) − µ (2) ) , and
2
142 RU Khan

PRINCIPAL COMPONENTS  ∂φ 
where λ is a Lagrange multiplier, the vector of partial derivatives   is
 ∂β i 
In multivariate analysis the dimension of X causes problems in obtaining suitable statistical
analysis to analyze a set of observations (data) on X . It is natural to look for method for ∂φ
= 2 Σ β − 2 λ β , equating this to zero, gives
rearranging the data, so that with as little loss of information as possible, the dimension of ∂β
problem is considerably reduced. This reduction is possible by transforming the original
variable to a new set of variables which are uncorrelated. These new variables are known as Σβ − λ β = 0 ,
principal components.
⇒ (Σ − λ I ) β = 0 . (8.2)
Principal components are normalized linear combination’s of original variables which has
specified properties interms of variance. They are uncorrelated and are ordered, so that the The equation (8.2) admits a non-zero solution in β if only,
first component displays the largest amount of variation, the second components displays the
second largest amount of variation, and so on. Σ −λ I = 0 . (8.3)

If there are p − variables, then p components are required to reproduce (rearrange) the total i.e. (Σ − λ I ) is a singular matrix.
variability present in the data, this variability can be accounted for by a small number, k
The function Σ −λ I is a polynomial in λ of degree p (i.e.
(k < p ) of the components. If this is so, there is almost as much information in the k
p p −1
components as there is in original p variables and then k components can be replace the λ + a1λ + L + a p −1λ + a p ). Therefore, the equation (8.3) will give p solutions in λ ,
original p variables. . That is why; this is considered as linear reduction technique. This let these be λ1 ≥ λ 2 ≥ L ≥ λ p .
technique produces best results when the original variables are highly correlated, positively
or negatively. Pre-multiplying (8.2) with β ′ , gives
For example, suppose we are interested in finding the level of performance in mathematics of
the tenth grade students of a certain school. We may then record their scores in mathematics: β ′Σ β − λ β ′ β = 0
i.e., we consider just one characteristic of each student. Now suppose we are instead
interested in overall performance and select some p characteristics, such as mathematics, ⇒ β ′Σ β = λ . (8.4)
English, history, science, etc. These characteristics, although related to each other, may not
all contain the same amount of information, and in fact some characteristics may be This shows if β satisfies equation (8.2) and β ′ β = 1 , then the variance of β ′ X is λ .
completely redundant. Obviously, this will result in a loss of information and waste of
resources in analyzing the data. Thus, we should select only those characteristics that will Thus for the maximum variance we should use in (8.2) the largest root λ1 . Let β (1) be a
truly discriminate one student from another, while those least discriminatory should be normalized solution of
discarded.
(Σ − λ1 I ) β = 0 i.e. Σ β (1) = λ1 β (1) , then
Determination

Let X be a p − component vector with covariance matrix Σ be positive definite (all the U 1 = β (1) X
characteristic roots are positive and distinct). Since we are interested only in the variances
is a normalized linear combination (first principal component) with maximum variance i.e.
and covariances, we shall assume that E ( X ) = 0 . Let β is a p − component column vector

such that β ′ β = 1 (in order to obtain a unique solution). V (U1 ) = β (1) Σ β (1) = λ1 [from equation (9.4)].

Define, We have next to find a normalized linear combination β ′ X that has maximum variance of
U = β ′ X , then E U = E ( β ′ X ) = 0 , and V ( β ′ X ) = β ′ E ( X X ′ ) β = β ′ Σ β (8.1) all linear combinations un-correlated with U 1 . Non-correlation with U 1 is same as
′ ′
The normalized linear combination β ′ X with maximum variance is therefore obtained by 0 = Cov (β ′ X , U1 ) = E (β ′ X − E β ′ X ) (β (1) X − E β (1) X )′

maximizing equation (8.1) subject to condition β ′ β = 1 . We shall use technique of Lagrange = E (β ′ X X ′ β (1) ) = β ′ Σ β (1) .
multiplier and maximize
Since Σ β (1) = λ1 β (1) , then
 
φ = β ′ Σ β − λ ( β ′ β − 1) = ∑ β iσ ij β j − λ  ∑ β i2 − 1 ,
i, j

 i

 β ′ Σ β (1) = λ1 β ′ β (1) = 0 . (8.5)
Principal components 143 144 RU Khan

Thus we have to determine β as to maximize the function ∂φ r +1 r


= 0 = 2 Σ β − 2λ β − 2 ∑ υ i Σ β (i ) . (8.7)
∂β
φ 2 = β ′ Σ β − λ (β ′ β − 1) − 2υ1 ( β ′Σ β (1) ) , i =1

where λ and υ1 are Lagrange multipliers. The vector of partial derivatives is Premultiplying by β ( j ) ′ , we obtain

∂φ 2 ′ ′ ′
= 2 Σ β − 2 λ β − 2 υ1 Σ β (1) , and we set this to 0 . β ( j ) Σ β − λ β ( j ) β −υ j β ( j ) Σ β ( j ) = 0 , using equation (8.5)
∂β

⇒ Σ β − λ β −υ1 Σ β (1) = 0 . ⇒ −υ j β ( j ) Σ β ( j ) = 0

Premultiplying by β (1) ′ ⇒ υ j λ j = 0 , because β ′ Σ β = λ


′ ′ ′ ⇒ υ j = 0 , because λ j ≠ 0 .
β (1) Σ β − λ β (1) β −υ1 β (1) Σ β (1) = 0 .
Thus equation (8.7) becomes
Since β ′ Σ β (1) = λ1 β ′ β (1) = 0 , so that
Σβ − λ β = 0
(1)′ (1) ′
−υ1 β Σβ (1)
=0 or υ1λ1 = 0 , because β ′ Σ β = λ ⇒ β Σβ (1)
= λ1 . ⇒ (Σ − λ I ) β = 0
⇒ υ1 = 0 , because λ1 ≠ 0 . and therefore, β satisfies (8.2) and λ must satisfies (8.3).
This means that the condition of uncorrelated ness is itself satisfied when β satisfied (8.2)
Let λr +1 be the maximum of λ1 , λ 2 , L , λ p , such that there exists a vector β satisfying
and λ satisfied (8.3).
(Σ − λ I ) β = 0 , β ′ β = 1 and equation (8.6), call this vector β ( r +1) and corresponding
r +1
So the procedure to find the second principal component is to solve (8.2) for λ = λ2 , and call

the vector as β (2)
and the corresponding linear combination linear combination U r +1 = β (r +1) X .
′ We shall write,
U 2 = β (2) X , clearly,
Β p × p = ( β (1) β (2) L β ( p) )
(2) ′ (2)
V (U 2 ) = β Σβ = λ2
and
(3) (4)
Similarly, β ,β ,L , and their linear combinations  λ1 0 L 0
 
(3) ′ ( 4) ′ 0 λ2 L 0
β X ,β X ,L Λ = diag (λ1 λ 2 L λ p ) =  
M M M M
 
and their variances with λ3 , λ4 , L 0 0 L λ p 

As a general case, we want to find a vector β such that β ′ X has maximum variance of all U = Β′ X .
the normalized linear combinations which are uncorrelated with U 1 , U 2 , L , U r , that is
Since Σ β ( r ) = λ r β ( r ) for r = 1, 2, L , p can be written in matrix form as
0 = Cov ( β ′ X , U ) = E ( β ′ X U ) = E (β ′ X X ′ β (i ) ) = β ′ Σ β (i ) , i = 1, 2, L , r
i i ΣΒ = ΒΛ (8.8)
= λ i β ′ β (i ) . (8.6) ′ ′
and the equations β (r ) β ( r ) = 1 and β (r ) β ( s ) = 0 , r ≠ s can be written as
We want to maximize
Β′Β = Ι (8.9)
r
φ r +1 = β ′ Σ β − λ ( β ′ β − 1) − 2 ∑υ i β ′ Σ β (i ) , i.e. Β is an orthogonal matrix
i =1
From equations (8.8) and (8.9) we obtain
where λ and υ i are Lagrange multipliers, the vector of partial derivatives is
Β′ ΣΒ = Β′ΒΛ = Λ .
Principal components 145 146 RU Khan

Thus given a positive definite matrix Σ , there exist an orthogonal matrix Β such that Β′ Σ Β The sum of the variances of the components of V is
is in diagonal form and the diagonal elements of Β′ Σ Β are the characteristic roots of Σ . p p
Therefore, we proceed as follows: ∑ E (Vi ) 2 = tr (CΣC ′) = tr ΣC ′C = tr ΣI = tr Σ = ∑ E ( X i ) 2 .
i =1 i =1
Solve Σ − λ I = 0 . Let the roots be λ1 > λ 2 > L > λ p , then solve
Corollary: The generalized variance of the vector of principal components is the generalized
(Σ − λ1 Ι) β = 0 , get a solution β (1) i.e. Σ β (1) = λ1 β (1) . variance of the original vector.

First principal component Proof: We have,


′ U = Β′ X , and Σ U = Β′ Σ Β , then
U 1 = β (1) X .
Σ U = Β′ Σ Β = Σ Β′Β = Σ Ι = Σ .
Again solve
Corollary: The sum of the variances of the principal components is equal to the sum of the
(Σ − λ2 Ι) β = 0 , get a solution β (2) i.e. Σ β ( 2) = λ 2 β (2) . variances of the original variates.
Second principal component Proof: We have, U = Β′ X , and Σ U = Β′ Σ Β , then

U 2 = β (2) X tr ΣU = tr Β′ ΣΒ = tr Σ Β′ Β = tr Σ Ι = tr Σ .

M But
r − th principal component tr ΣU = V (u1 ) + LV (u p ) , and tr Σ X = V ( x1 ) + LV ( x p ) .
(r )′
Ur = β X  1 r12 
Exercise: Let R =   . Solve
 r12 1 
Stop at m (≤ p) for which λ m is negligible as compared to λ1 , L , λ m −1 . Thus we get a
fewer (countable less) linear combinations or at most as many linear combinations as the 1− λ r12
original variable ( p) . R − λ I = 0 , or , or (1 − λ ) 2 − r12
2
=0
r12 1− λ
λ1
Contribution of first principal component = or λ2 − 2 λ + 1 − r 2 = 0 , r12 = r
p
∑ λi
i =1 2 ± 4 − 4 (1 − r 2 )
or λ = =1± r
λ + λ2 2
Contribution of ( I + II ) principal component = 1
p
If r > 0 λ1 = 1 + r , λ2 = 1 − r
∑ λi
i =1 If r < 0 λ1 = 1 − r , λ2 = 1 + r
λ1 + λ2 + λ3 If r = 0 λ1 = 1 = λ2 ,
Contribution of ( I + II + III ) principal component = .
p
∑ λi i.e. in case of perfect correlation, we need one principal component which explains fully but
i =1 in case of zero correlation, no principal component.
Theorem: An orthogonal transformation V = C X of a random vector X leaves invariant Exercise: Find the variance of the first principle component of the covariance matrix Σ
the generalized variance and the sum of the variances of the components. defined by

Proof: Let E X =0 , and E X X′ =Σ, then E (V ) = 0 and Σ = (1 − ρ ) Ι + ρ e e′ , where e ′ = (1 1 L 1) .


ΣV = E (V V ′ ) = E (C X X ′ C ′) = C Σ C ′ . 1 ρ
Exercise: Find the characteristic vector of   corresponding to the characteristic
The generalized variance of V is ρ 1 
roots 1 + ρ and 1 − ρ .
CΣC ′ = C Σ C ′ = Σ CC ′ = Σ , because C is orthogonal matrix, ⇒ CC ′ = I
= generalized variance of X .

You might also like