You are on page 1of 43

Random Matrix Theory and Its Applications to High-dimensional

Statistical Inference

Danning Li
dl496@statslab.cam.ac.uk

Contents
1 Introduction 1
1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Large dimensional data analysis . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Types of random matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Limit theorems of eigenvalues of popular random matrices . . . . . . . . . . 10

2 Wigner Semicircle Law by Moment Method 14


2.1 Semicircle law for iid case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Moment Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Proof of the Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Use Stieltjes Transform to Prove MarcenkoPastur Law for Sample Co-


variance Matrix 22
3.1 MP Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Stieltjes Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Stieltjes Transform of The MP Law . . . . . . . . . . . . . . . . . . . . . . 26
3.4 The Proof of the Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 High-dimensional Sphericity Test 31


4.1 Likelihood Ratio Test for Sphericity . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Different Tests for Sphericity . . . . . . . . . . . . . . . . . . . . . . . . . . 38

1 Introduction

1.1 Definition

From linear algebra, we have learnt there are a lot of different types of matrices, such as
symmetric matrices, orthogonal matrices, etc. The elements of a matrix can be real or

1
complex. Formally, an n n square matrix can be defined as follows:

a11 a12 . . . a1n

a21 a22 . . . a2n
A := . .

.. .. .. ..
. . .

an1 an2 ann

The matrix A is a symmetric matrix if A = AT , where AT is the transpose of A in the


real case or the conjugate transpose of A if in complex case. If AT A = In , then A is an
orthogonal matrix. For a non-zero vector , if A = , then we say that the number is the
eigenvalue of A corresponding to . The matrix A has n eigenvalues, denoted by 1 , . . . , n .
A random matrix is defined by replacing all the elements of the matrix A with random
variables. Then the corresponding eigenvalues 1 , . . . , n are random variables too. The
main research area about random matrices focuses on investigating the limiting properties of
the eigenvalues of random matrices (known as the spectral properties of random matrices),
when their dimension tends to infinity.

1.2 Large dimensional data analysis

Most of the classical limiting theorems in statistics assume that the dimension of data is
fixed. In recent years, with the development of the modern technology, people are collecting
data to characterize many difference features for the study subject. So the dimension of a
dataset can be huge. Sometimes the dimension of data can be even larger than the sample
size. The gene expression data is a good example. Then it is natural to ask whether the
dimension needs to be considered large and whether there are any differences between the
results for the fixed dimension and those for large dimension. The answer is yes!
Why does the random matrix theory (RMT) focus on spectral analysis of random ma-
trices? One reason is that, in multivariate analysis, many statistics can be expressed as
the functions of eigenvalues. For example, consider the hypothesis test: H0 : = 2 In vs.
H1 : 6= 2 In for n independent observations X1 , . . . , Xn from a Np (, ) distribution.
Pn T
i=1 (Xi X)(Xi X)
The sample covariance matrix is defined as S := n1 and the likelihood
function of (, ) is
n n 1 1 o n 1 o
L(, ) = det()n/2 exp tr S exp n(X )T 1 (X ) . (1.1)
2 2

2
Then the likelihood ratio test statistic is
supRp ,2 >0 L(, 2 In )
:= supRp ,>0 L(,)
on/2
det( n1
n
S)
= 1
n
n
p . (1.2)
p
tr( n1 S)

n Pp
Let 1 , . . . , n be the eigenvalues of n1
n S. Then we can write log = 2 i=1 log i
p
p log( p1 i=1 i ). Research has shown that its limiting distribution is different when p is
P

fixed or divergent along with n. Another example, the largest eigenvalue of S is a very
important statistic in principal component analysis (PCA), and its limiting distribution
depends on the dimension setting too. So the spectral analysis of random matrices is very
important.

1.3 Types of random matrices

A lot of research in the RMT focus on investigating symmetric random matrices. There are
three classical types of such random matrices. They are named differently in physics and
statistics. The following table gives a comparison.

Real = 1 Complex = 2 Quaternion = 4


Hermite GOE GUE GSE
Laguerre Real Wishart Complex Wishart Quaternion Wishart
Jacobi Real MANOVA Complex MANOVA Quaternion MANOVA
In the table, GOE denotes the Gaussian orthogonal ensemble, GUE is the short notation
for Gaussian unitary ensemble, GSE is the acronym for Gaussian symplectic ensemble and
MANOVA is the short notation for multivariate analysis of variance. Each row corresponds
to one kind of classical random matrices. The parameter = 1, 2 and 4 correspond to
the real, complex and quaternion cases respectively. We are all very familiar with real
and complex numbers. The quaternion number can be defined in the following way: the
quaternion number q := a1 +a2 i+a3 j +a4 k, where a1 , a2 , a3 , a4 R, i2 = j 2 = k 2 = 1 and
ij = k, jk = i and ki = j. Notice that two quaternion numbers q1 and q2 are generally not
commutative for q1 q2 6= q2 q1 . Let x1 , x2 , x3 and x4 follow a standard normal distribution
independently. Then we can define three standard normal distributions correspond to the
real, complex and quaternion cases repectively:

(1) The standard real normal distribution is simply a N (0, 1) distribution.

(2) The standard complex normal distribution, denoted by CN (0, 1), is the probability

distribution of (x1 + ix2 )/ 2.

3
(3) The standard quaternion normal distribution, denoted by QN (0, 1), is the probability
x1 +ix2 +jx3 +kx4
distribution of 2 .

Then the GOE, GUE and GSE appear in the first row of the table, can be defined as
iid
G := 1 (X + X T ), where X = (xij ) is an n n matrix and xij N /CN /QN (0, 1)
2
correspondingly. Hermitian ensemble is another name for this type of matrix in physics. It
is easy to check that each diagonal entry of GOE is normally distributed with mean of 0 and
variance of 2. and that each off-diagonal entry follows the standard normal distribution.
The density function of the eigenvalues (1 , . . . , n ) of G is
Y  X n 
fG (1 , . . . , n ) = C |i j | exp 2i ,
4
i<j i=1

where = 1, 2, 4 correspond to the real, complex and quaternion cases respectively, C is


the normalization constant and (1 , . . . , n ) Rn .
From now on, let C be a generic constant which can vary from line to line.
The real, complex and Quaternion Wishart matrices, which show in the second row of
the table, can be defined as W := X T X, where X = (xij ) is an m n random with
iid
xij N /CN /QN (0, 1) correspondingly and m > n. Physics terminology for this type of
random matrices is Laugerre ensemble. The density function of the eigenvalues (1 , . . . , n )
of W is
n n
Y

Y
(mn+1)1
X 
fW (1 , . . . , n ) = C |i j | i 2
exp i ,
4
i<j i=1 i=1

where = 1, 2, 4 correspond to the real, complex and quaternion cases respectively, C is


the normalization constant and (1 , . . . , n ) (0, )n .
The real, complex and quaternion MANOVA matrices appear in the last row of the table
are defined as M := X T X(X T X + Y T Y )1 , where X = (xij )m1 n and Y = (yij )m2 n ,
iid
where n < min(m1 , m2 ) and xij , yij N /CN /QN (0, 1). In physics, the MANOVA matrix
is also called Jacobi ensemble. Then the density function of the eigenvalues
n
Y Y (m1 n+1)1
fM (1 , . . . , n ) = C |i j | i2 (1 i ) 2 (m2 n+1)1 ,
i<j i=1

where = 1, 2, 4 correspond to the real, complex and quaternion cases, respectively. Again
C is the normalization constant and (1 , . . . , n ) (0, 1)n .

REMARK 1.1 Once we have the joint distribution of the eigenvalues, we can derive a lot
of probabilistic properties, such as the limiting spectral distribution and the limits of the

4
largest eigenvalues, etc. In all of the curves above, 1 , . . . , n are not independent, but the
dependence is not too strong.

Now lets see how to derive a joint density function of eigenvalues of a special random matrix
p
A in the real case. Define A := (Aij )nn = 1/2G, where G is a GOE defined in above.
So {Aij , i j n} are independent random variables, Aii N (0, 1), Aij N (0, 12 ) and
Aij = Aji .

THEOREM 1 Let (1 , 2 , . . . , n ) be the eigenvalues of A. Then the joint density of


(1 , 2 , . . . , n ) Rn is
n
Y  1 X 2
fA (1 , . . . , n ) = C |i j | exp i , (1.3)
2
i<j i=1

where C is the normalization constant.

Before we prove Theorem 1, let us review a very useful lemma (Corollary 2.5.4 from An-
derson, Guionnet and Zeitouni (2010) )

LEMMA 1.1 Let (U1 , U2 , . . . , Un ) denote the eigenvector corresponding to the eigenvalue
(1 , . . . , n ) of the random matrix A. Then the collection of (U1 , U2 , . . . , Un ) are independent
of the eigenvalues (1 , . . . , n ).

Sketch Proof of Lemma 1.1. For any matrix M Rnn , we say M is admissible if the
first non-zero element of each column of M is positive. There exists a function a : Rnn
Rnn such that a(M ) is the admissible form of M . Define D(M ) = diag(h1 , . . . , hn ) is a n
dimensional diagonal matrix with h1 h2 . . . hn are the eigenvalues of M and U (M ) is
a admissible matrix with jth column is the jth largest eigenvector corresponding to the jth
largest eigenvalue of M . By eigenvalue decomposition, we have that A = U (A)D(A)U (A)T .
It is known that, with probability 1, all the eigenvalues of A are distinct. So we have
that D(A) = diag(1 , . . . , n ) with 1 > 2 > . . . > n . Now The transformation A =
U (A)D(A)U (A)T is a bijective map. In order to show that D(A) and U (A) are independent,
we only need to show that the conditional distribution of D(A) given U (A), does not
depend on U (A). In another word, we need to show that the two conditional random
matrix D(A)|(U (A) = U0 ) and D(A)|(U (A) = U1 ) have the same distribution, where U0
and U1 are any two orthogonal admissible matrices and U0 6= U1 . Let H = U1 U0T . Since
D(A) = D(HAH T ), we have that

D(A)|(U (A) = U1 ) = D(A)|(U (A) = HU0 )


= D(HAH T )|(U (H(H T AH)H T ) = HU0 ).

5
Let B = (H T AH), = D(B) and P = U (B). Since HBH T = HP (HP )T = a(HP )[a(HP )]T ,
we have that U (H(H T AH)H T ) = a(HU (H T AH)). So

D(A)|(U (A) = U1 ) = D(HAH T )|(a(HU (H T AH)) = HU0 )


d
= D(HAH T )|U (H T AH) = U0 ) = D(A)|(U (A) = U0 ).

To show the second equation above, we only need to show that the event {a(HU (H T AH)) =
HU0 } is same as the event {U (H T AH) = U0 }. We know that there exists a diagonal ma-
trix S = diag(1) such that a(HU (H T AH)) = HU (H T AH)S, then HU (H T AH)S =
HU0 . Since H is orthogonal and U (HAH T ) and U0 are both admissible, we get that
S = In . ThereforeU (H T AH) = U0 if a(HU (H T AH)) = HU0 . It is easy to check that
d
a(HU (H T AH)) = HU0 if U (H T AH) = U0 . The last equation is due to that A = HAH T .
So we conclude that the eigenvalues of A are independent of its eigenvectors.

Proof of Therorem 1. It is easy to see that the joint density of {Aij , 1 i j n} is


n
Y  1  Y
L(Aij , 1 i j n) = C exp A2ii exp(A2ij )
2
i=1 1i<jn
n
 1 X X  n 1 o
= C exp A2ii A2ij = C exp tr(A2 ) .
2 2
i=1 1i<jn

By eigendecomposition, we have that A = U DU T , where U and D satisfy the conditions


required in the proof of Lemma 1.1, which make the map from A to (D, U ) is a one to
one map. We can re-write the matrix A as a n2 1 vector vec(A) by stacking up the
columns of A together. In the same way we can also vectorize the matrix D and U into
vec(D) and vec(U ). Recall that the degrees of freedom of a random vector is the number
of free components in a random vector(i.e. how many components need to be known before
the vector is fully determined). Since A is symmetric, we know that the elements in the
lower triangular part are determined by those in the upper triangular part. This means
that if given the entries of on or above diagonal of A, the rest elements of A are fully
n(n+1)
determined. Then we say there are only 2 free components of the random vector
n(n1)
vec(A). It is easy to see that vec(D) has n free components. The vector vec(U ) has 2
free components, because there are n + 12 n(n 1) constraints on the n2 elements of vec(U )
implied by U T U = In ( U = (U1 , . . . , Un ) with constraints that the l2 vector norm kUi k = 1

6
and the inner product (Ui , Uj ) = 0). Now define the following three vectors:

A11

A
22
.
..




U12

Ann U13
..

A12 1

.
.
A
:=

, := . , :=

U1n

. (1.4)
13 .

.
..

U

n

23

A



.
..


1n

A23 U(n1)n n(n1)


.
..

2


A(n1)n n(n+1)
2

We know that all the elements of A, D and U can be expressed as the functions of ,
and respectively. For the sake of easy notation, we let := (1 , . . . , n(n1) ) = . From
2
Lemma 1.1, we know that and are independent. Since tr(A2 ) = tr(U D2 U T ) = tr(D2 ) =
Pn 2
i=1 i , together with (1.4), we know that the joint density of (, ) is

1 X 2   

g(, ) = C exp i det . (1.5)
2 (, )
i

Now we only need to calculate the determinant of the Jacobian of the map from to (, ).

Define J := (,) , we have
A11 Ann A12 A(n1)n
1 ... 1 1 ... 1
.. .. .. .. .. ..

. . . . . .

A11 Ann A12 A(n1)n

n ... n n ... n

J = .

A11 Ann A12 A(n1)n
1 ... 1 1 ... 1
.. .. .. .. .. ..

. . . . . .


A(n1)n

A11 Ann A12
1 n(n1) ... 1 n(n1) 1 n(n1) ... 1 n(n1)
2 2 2 2

In order to calculate the determinant of J, we need to review some definitions and useful
facts in linear algebra.

B
(1) For any n n matrix B, Let x denote the derivative of the matrix B with respect to

7
the scalar variable x, which is the n n matrix of element by element derivative:

B11
x . . . B
x
1n

B .. .. ..
= . . . .
x
Bn1 Bnn
x ... x

Since U T U = In , by the product rule, we have

U T U
U + UT = 0. (1.6)
1 1

U T
Let S (1) := U T 1
= U
1 U.

(2) Since A = U DU T , we conclude that

A U U T
= DU T + U D .
1 1 1
So
A U U T
UT U = UT D+D U
1 1 1
= S (1) D DS (1) .

A
It follows that the element in (i, j)th position of the n n matrix (U T 1
U ) is

X Akl (1)
Uki Ulj = Sij (j i ), (1.7)
1
1k,ln

(1)
where Sij is the (i, j)th element of matrix S (1) . Since Aij = Aji , we know that, the
above equation can be rewritten as
X Akk X Akl (1)
Uki Ukj + 2 Uki Ulj = Sij (j i ). (1.8)
1 1
k k<l

D A
(3) Since D = U T AU , we have that 1 = U T 1
U . Hence the (i, j)th element of the
D
matrix 1 is

Dij X Akk X Akl


= Uki Ukj + 2 Uki Ulj = ij i1 . (1.9)
1 1 1
k k<l

8
Now define ij R(n+1)n/2 and matrix V R(n+1)n/2(n+1)n/2 as follows:

U1i U1j
..
.



U U
ni nj
ij = (1.10)


2U1i U2j

..

.

2Un1i Unj
and V = (11 , 22 , . . . , nn , 12 , 13 , . . . , 1n , 23 , . . . , (n1)n ). Then by (1.8) and (1.9), we get
that
Inn 0 ... 0
(1) (1)
S12 (2 1 ) ... S(n1)n (n n1 )
JV = ,

.. .. .. ..

. . . .

( 1 n(n1)) ( 1 n(n1))
(2 1 ) . . . S(n1)n
S122 2
(n n1 )
Q
where denote some unknown values. So |det(JV )| = i<j |i j |h(1 , . . . , 1 n(n1) ).
2
Since det(V ) is a function of vec(V ), which can be written as a function of too. Further
Q
|det(JV )| = |det(J)||det(V )|, we know |det(J)| = 1i<jn |i j |h(1 , . . . , (n1)n ). Now
2
we have that
 1 X Y
g(, ) = C exp 2i |i j |h(1 , . . . , 1 n(n1) ).
2 2
i i<j

Since and are independent, it is easy to obtain the joint density of (1 , . . . , n ) from
g(, ).
Beside the random matrices in the table, there are some other types of random matrices
with interests, such as Toeplitz (aij = a|ij| ), Hankel (aij = ai+j1 ) and Markov matrices
P
(symmetric matrices with the ith diagonal element equals j6=i aij ). All examples above
are symmetric matrices. We may also investigate some properties about non-symmetric
matrices, such as G = (gij ), where gij are iid N /CN (0, 1) and Haar measure on classical
compact groups. In these case, the eigenvalues of the random matrices may not be real any
more.
From a statistical point of view, people care more about the matrix whose entries have
unknown distribution and may depend on each other. However due to the technical limi-
tation, the current RMT does not handle such kind of random matrices very well. Some
results are known for special cases such as the Haar matrices. However, the RMT does
provide some insights about random matrices whose entries follow unknown distribution,
when there exists necessary independence assumption.

9
Some extra notes about the Haar measure :

(1) An orthogonal group is defined as O(n) := {all n n real orthogonal matrices }.

(2) There exists a unique measure on (O(n), ), where indicate the normal matrix
product, such that
Z Z Z
f (ax)(dx) = f (xb)(dx) = f (x)(dx)
O(n) O(n) O(n)

for any bounded measurable function f: O(n) R and a, b O(n). Then is the
Haar measure on O(n).

(3) There are two ways to generate the Haar measure on O(n).
1
a) Let Y = (yij )nn , where yij are iid N (0, 1) r.v.s. Set X = (Y T Y ) 2 Y . Then X
generates the Haar measure on O(n) and X is also called as Haar orthogonal ensemble.
b) Let Y be same as the above and write Y = (y1 , . . . , yn ). Use the GramSchmidt
process (QR decomposition) on Y: e1 = kyy11 k , w2 = y2 (y2 , e1 )e1 , then e2 = kww2
2k
and
Pi1 wi
so on. Letwi = yi j=1 (yi , ej )ei and ei = kwi k . So X = (e1 , . . . , en ) has the Haar
measure on O(n) and it is an Haar orthogonal ensemble.

1.4 Limit theorems of eigenvalues of popular random matrices

Suppose A is an n n matrix with eigenvalues i , i = 1, 2, . . . , n. If all these eigenvalues


are real( e.g., if A is Hermitian), we can define a empirical distribution of the eigenvalues
as follows:
n
1X
FnA (x) = I(i x),
n
i=1

which is called the empirical spectral distribution (ESD) of matrix A. To investigate the
convergence of FnAn when dimension n goes to infinity is one of main problems in the RMT.
The limiting distribution F of FnA is called the limiting spectral distribution(LSD) of A.
The importance of ESD is due to the fact that many important statistics can be expressed
 R 

as a function of ESD, for example, det(A) = ni=1 i = exp n 0 log(x)FA (dx) . We have
Q

the following famous result regarding the LSD for some well known random matrices:

(1) Semicircle law: the ESD of the normalized GOE (divided by n) converges to the
semicircle law with probability 1. The density function of the semicircle law is
(
1

2 if |x| 2;
2 4 x
f (x) =
0 otherwise.

10
0.30
0.25
0.20
0.15
y

0.10
0.05
0.00

-2 -1 0 1 2

The GOE is also called the Wigner matrix, which is named after the mathematical
physics professor Eugene Wigner, who first proved the LSD of GOE follows the semi-
circle law. The above result is also held true for the generalized Wigner matrix which
only requires the matrix to be a symmetric matrix with on or above diagonal entries
are independent and satisfying some Lindebergs condition.

(2) The largest eigenvalue and the smallest eigenvalue of the Wigner matrix have limits

results as follows: max
n
2 and min
n
2 if the entries of Wigner matrix have
mean of 0, variance of 1 and the fourth moment exists.
Pn T
i=1 (Xi X) (Xi X)
(3) MarcenkoPastur law: the ESD of the sample covariance matrix Sn := n1 ,
where Xi = (xi1 , x12 , . . . , xip ) and xij are i.i.d with mean of 0 and variance of 1, con-
verges to the MP law which has density function as
(
1
p
2xy (b x)(x a) if a x b;
f (x) =
0 otherwise.

and has a point mass 11/y at the origin if y > 1, where a = (1 y)2 , b = (1+ y)2
and p/n y (0, ). The asymptotic theory of spectral analysis of the sample
covariance matrix was developed by Marcenko and Pastur (1967) and Pastur (1972,
1973). This result was later generalized to non-Gaussian ensembles.

library(RMTstat), n<-500, p<-100,


x<-seq((1-sqrt(p/n))^2,(1+sqrt(p/n))^2,0.01),
y<-dmp(x,n,p,1) > ,plot(x,y,type="l")

11
0.8
0.6
y

0.4
0.2
0.0

0.5 1.0 1.5 2.0

It is easy to see the MP distribution is not symmetric in general. When p > n, we


know there are p n eigenvalues equals 0, so the LSD has a point mass at zero in
this case. This LSD also indicates a very important fact hat the sample covariance
matrix is no longer a good estimator of the covariance matrix when the dimension is
very high.

(4) Limits for spectral norm of sample covariance matrices: max (Sn ) (1 + y)2 if and
only if the fourth moments of xij exists.
i.i.d
(5) TracyWidom Law: Assume that X is an n p random matrix with xij N (0, 1),

where n/p y (0, ). Denote n = ( n + p)2 and n = ( n + p)( 1n +
T
1 )1/3 .
p Then max (XnX)n F1 , while F1 follows TracyWidom law, where F1 (s) =
R
exp( 21 2
s (q(x) + (x s)q (x))dx) and q(x) satisfies the Painleve function
(
q(x) = xq(x) + 2q 3 (x);
q(x) Ai(x), x .

This result was derived by Iain M. Johnstone (2001). In fact the first work was done
for the limiting distribution of the largest eigenvalue of Gaussian symmetric matrix by
Tracy and Widom (1996). Recently universality for the sample covariance matrix has
been established by two different methods. So the TracyWidom law is generalized

12
to non-Gaussian ensembles with some moments assumptions. The following picture
is the density of TracyWidom law:

0.30
0.25
0.20
y

0.15
0.10
0.05
0.00

-4 -2 0 2 4

(6) As mentioned above, the linear spectral statistics (such as the LRT statistic)are im-
R
portant in multivariate analysis, such as the LRT statistic. We define f (x)Fn (dx)
and call it as the linear spectral statistics (LSS),where Fn (x) is the ESD of a random
R
matrix. We can use (LSS) to estimate a parameter = f (x)F (dx). By Bai and
Silverstein (2004) showed that the limiting distribution of
Z
Xn (f ) := an f (t)d(Fn (t) F (t))

follows Gaussian distribution under certain assumption.

(7) Girkos circular law: let G = (gij )nn , gij are i.i.d. N /CN (0, 1), then the corresponding
eigenvalues 1 , . . . , n are complex random variables. Then the empirical spectral
distribution FnG (x, y) = n1 #{i n : Re(i ) x, Im(i ) y} almost surely converges
to the uniform distribution on the unit disk in the complex plane with density f (z) =
I(|z| 1). The best results are given by Tao and Vu (2010) by only assuming gij
i.i.d. with mean of 0 and variance of 1.

13
2 Wigner Semicircle Law by Moment Method
Under a generalized definition, a Wigner matrix is a symmetric matrix whose entries on
or above the diagonal are independent. The Wigner matrix includes the GO/U/S E. The
research about the Wigner matrix arose in the nuclear physics in 1950s. Wigner (1955,
1958) showed that the expected ESD of a Gaussian ensemble converges to the semicircle
law. This work was generalized by many researchers in various aspects.

2.1 Semicircle law for iid case

Suppose that Xn is an n n Wigner matrix whose diagonal entries are i.i.d. random
variables with mean 0 and variance 2 and those above the diagonal entries are i.i.d. random
variables with mean 0 and variance 1. Let Wn = 1 Xn denote the normalized Wigner
n
matrix. Suppose 1 , . . . , n are the eigenvalues of Wn , we can write the empirical spectral
distribution (ESD) of Wn as
n
1X
FWn (x) = I(i x).
n
i=1

Then we have the following theorem.

THEOREM 2 With probability 1, the ESD FWn (x) of Wn = 1 Xn convergess to the


n
semicircle distribution, whose density is
(
1

2 if |x| 2;
2 4 x
f (x) =
0 otherwise.

2.2 Moment Convergence Theorem

In order to apply the moment method to prove the Theorem 2, we need to review some useful
facts about the moment convergence theorem(MCT). The MCT gives the conditions under
which the convergence of moment of all fixed orders implies the convergence in distribution.
Let {Yn } denote a sequence of random variables with finite moments of all orders. Since
1 C
sup P (|Yn | K) 2
E|Yn |2 2 ,
n K K

then {Yn } is tight. Let Ynl and Ynk be any two subsequence of Yn . Then we can find
two convergent subsequence of them. To simplify the notation, we can call them Yn1 and
Yn2 . Suppose Yn1 Y1 and Yn2 Y2 . From the moment convergence condition, we know
that EY1k = EY2k for all k N . Let Mk = EY1k . Since Yn is tight, we know that if any

14
convergent sub sequence of Yn converges to the same limit Y in distribution, then Yn Y .
Now we need some conditions to guarantee that Y1 = Y2 . In another words, we need some
conditions under which the distribution of a random variable is uniquely determined by its
moments. In fact, Carleman condition or Rietz condition can guarantee the uniqueness of
the distribution by its moments. They are:
1
Rietz condition: lim inf k k1 M2k
2k
< ;
1
P 2k
Carleman condition: k=1 M2k = .

The proof is given in Bai and Silverstein (2009).

2.3 Proof of the Theorem 2

In this section, we can follow these three steps to prove Theorem 2

(1) Calculate Mk , the kth moment of the semicircle distribution;


1
P
(2) Show that the Carelman condition k M2k 2k = is held;
R k dF 1 1
ki = tr(Xnk )
P
(3) Prove that x Wn (x) = n nk/2+1
Mk .

Then applying the MCT, we can show the theorem.

It is easy to verify (1) and (2).


Proof of (1): Obviously, we have M2k+1 = 0, because the semicircle density f is a sym-
metric function. Now
2
1 p 1 2 2k p
Z Z
2k
2
M2k = x 4 x dx = x 4 x2 dx
2 2 0
1 1 22k+1 (k + 12 )( 23 )
Z
1 1
= (4y)k (1 y) 2 y 2 dy = .
0 (k + 2)
(2k)! 1 2k
By (x + 1) = x(x) and ( 21 ) = , we have M2k = (k+1)!(k)!

= k+1 k for all k N . In
1 2k

fact k+1 k is called Catalan numbers in combinatorial mathematics.
1 1
Proof of (2): Since M2k (2k)! (2k)2k , then M2k 2k 2k. Therefore k M2k 2k = .
P

Sketch proof of (3): In this sketch proof, lets further assume that all the entries of Xn
1
are bounded above. In order to show k tr(Xnk ) Mk , we only need to show
n 2 +1
1
(a) k Etr(Xnk ) Mk ;
n 2 +1
1
tr(Xnk )) < .
P
(b) V ar( k
n 2 +1

15
At first, we review some concepts in graph theory. A graph is an ordered pair = (V, E),
where V is the vertex set and E is the edge set. In a directed graph, an edge is an ordered
pair (v1 , v2 ), which starts from the vertex v1 and ends at the vertex v2 . if v1 = v2 , we
call this type of edge a loop. If two edges have the same set of vertices, we say they are
coincident. A cycle is define as a sequence of vertices that starts and ends at the same
vertex. Any two consecutive vertices in the cycle are connected by an edge in the graph.
Now, we can expand tr(Xnk ) as
X
tr(Xnk ) = xi1 i2 xi2 i3 . . . xik1 ik xik i1 , (2.11)

sums over (i1 , . . . , ik ) {1, 2, . . . , n}k . Let Gi1 ...ik = xi1 i2 xi2 i3 . . . xik1 ik xik i1 .
P
where
With the set of subindex (i1 , . . . , ik ) of Gi1 ...ik , we can define a graph (k, t) as follows:
plot a horizontal line and mark the numbers i1 , . . . , ik on it. Suppose there are t distinctive
number in this set. Now the t distinct numbers become the vertices of the graph, and draw
k edges from i1 to i2 , i2 to i3 , . . ., ik1 to ik and ik to i1 . For example, let k = 6, t = 4 and
Gi1 ...i6 = x12 x23 x33 x34 x42 x21 , we can draw the graph in figure 1. Obviously, any such kind
i3 = i4
i1 = i7 i5

i2 = i6

Figure 1: (6, 4)

graph (k, t) defined above forms a cycle. Each term Gi1 ,...,ik in above equation (2.11) can
be also written as Gi1 ,...,ik = xks11h1 . . . xksllhl , where lj=1 kj = k and l is the number of non-
P

coincident edges in the graph k,t . All these graphs can be classified into three categories.

Category 1: There exists at least one edge in the graph that is not coincident with all the
other edges. Therefore there is a kj0 = 1 for some j0 . Then EGi1 ...k = 0.

Category 2: There is no loop and each edge in the graph is coincident with exactly one
other edge but at opposite direction. Therefore kj = 2 for for 1 j l and is 6= js .
Then k must be and even number. Hence EGi1 ...k = 1.

Category 3: All the other types of graphs except those in category 1 and 2. So there
are two possibilities: either at least 3 edges are coincident (kj0 3 for some j0 ) or

16
there exists a cycle of non-coincident edges. In the first situation, there are at most
(k 1)/2 consecutive connected non-coincident edges in the graph. Then we have
t (k + 1)/2. In the second situation, the cycle of non-coincident edges implies that
there exists one edge in the cycle whose vertices are decided by the rest of the edges
in the cycle. So we have t (k + 1)/2 again. Since all the entries xij are bounded, so
P k+1
Category3 EGi1 ...k Cn
2 .

From above, we know that all contributions from category 1 and 3 to (a) converges to 0 as
n . Therefore we only need to count how many terms in Category 2. Now if we define
that H(e1 ) = 1 if xi1 i2 (corresponding to edge e1 ) shows up the first time in the graph and
H(es ) = 1 if xis is+1 is same as xi1 i2 and the edge es coincides with the edge e1 but in the
opposite direction. Since k is even number, we can write that k = 2K. Then there exists a
sequence (a1 = H(e1 ) = 1, . . . , a2K = H(e2K ) = 1) that corresponds to the graph . We
call it the characteristic sequence of the graph . It is easy to tell that all the characteristic
sequences of the graphs in category 2 satisfy that

a1 + a2 + . . . + al 0. (2.12)

n

for all 1 l 2K. There are t (n t)! versions of such graph in category 2 corresponding
to one characteristic sequence. Now we only need to count the number of characteristic
sequences.
2K

First, there are K different ways to arrange K positive ones and K minus ones. Now

i
S = 1
-2

we use the symmetrization principle to count the number of non-characteristic sequences.


Given a sequence (a1 , . . . , a2K ), where ai {1, 1}, let S0 = 0 and Si = Si1 + ai for
i = 1, 2, , 2K. We can represent such sequence into a 2-dimensional graph (i, Si ) on the
plane. Notice that each positive one indicates an upward step and each minus one indicates

17
a downward step (see figure above). The graph should start from (0, 0) and return to (2K,
0) after 2K Steps.
If (a1 , . . . , a2K ) is a characteristic sequence, then Si 0 for all i according to (2.12). In
other words, the graph should always be above or on the horizontal axis (never touch the
negative part). Therefore, for a non-characteristic sequence, there must exist an i0 such as
Si0 = 1. Or visually, the graph will touch -1 after i0 steps. For such non-characteristic
sequence, we can reflect its graph after the step i0 along the horizontal line of 1 (see the
figure above). The reflected graph will start at 0, but ends at -2 after 2K steps. Now we
define bj = aj for j i0 and bj = aj for all j > i0 , then the new figure corresponds to the
new sequence (b1 , . . . , b2K ). The sequence (b1 , . . . , b2K ) contains K 1 positive ones and
2K

K + 1 minus ones. Therefore the number of the b-sequence is K1 . So the number of
characteristic sequences is
     
2K 2K 1 2K
= = M2K .
K K 1 K +1 K
P
So Category2 EGi1 ...k = n(n 1) . . . (n K)M2K . Then we can conclude that (a) is right.
To prove (b), we need to show that V ar(tr(Xnk )) is summable for any fixed k. We have
that
1 1 X
tr(Xnk )) =

V ar( k E(Gi1 ...ik Gj1 ...jk ) E(Gi1 ...ik )E(Gj1 ...jk ) . (2.13)
n 2 +1 nk+2
We can construct two graphs 1 and 2 for Gi1 ...ik and Gj1 ...jk . If there are no coincident
edge between 1 and 2 , then Gi1 ...ik and Gj1 ...jk are independent and thus the correspond-
ing term in the sum is 0. If the combined graph = 1 2 has a single edge then
EGi1 ...ik Gj1 ...jk = EGi1 ...ik EGj1 ...jk = 0, hence the corresponding term in (2.15) is 0 too.
Now, suppose that contains no single edge and the graph of non-coincident edges has
a cycle. Then the distinct number of vertices will be no more than k. If contains no single
edge and the graph of non-coincident edges has no cycle, then in each graph, all the cycles
should be contracted the coincident edges. In addition there is also no single edges in each
graph. Hence there is at least one edge in with at least 4 coincident edges and thus the
number of distinct vertices is no larger than k. Consequently,
1
V ar( k tr(Xnk )) Cn2 . (2.14)
+1
n 2

All of the results above are based on the assumption that all the entries xij are bounded.
But in Theorem 2 we only assume i.i.d. with mean 0 and finite variance. So we need some
extra steps. That is to truncate, centralize and rescale the entries of Xn . We can prove that
the ESD of the new matrix is asymptotically the same as the original one almost surely.

18
First lets define the Levy metric.

DEFINITION 2.1 The Levy distance L between two distribution functions F and G is
defined by

L(F, G) = inf{ : F (x )  G(x) F (x + ) + }.

In fact, L(Fn , F ) 0 implies that Fn convergens to F as n .


Step 1.Truncation
For any fixed positive constant C, we can truncate the random variables xij at C by
defining xij (C) = xij I(|xij | C). Now we define a truncated Wigner matrix Xn(C) whose
elements are all xij (C). Then we have the following result.

PROPOSITION 2.1 Write Wn(C) = 1 Xn(C) . Then,


n

lim sup L3 (F Wn , F Wn(C) ) = 0 (2.15)


n

almost surely.

In order to show the above proposition, we need the following lemma.

LEMMA 2.1 Let A and B be two n n symmetric matrices with their ESDs denoted by
F A and F B , respectively. Then,
1
L3 (F A , F B ) tr[(A B)(A B)T ]. (2.16)
n
Sketch Proof of Lemma 2.1. First, we show
n
3 A B 1X
L (F , F ) |i (A) i (B)|2 . (2.17)
n
i=1

1 Pn 2
The above inequality is trivial if h = n i=1 |i (A) i (B)| 1. When h < 1, let  = h1/3
and m = #(M1 \M2 ) where M1 = {i n : i (A) x} and M2 = {i n : i (B) x + }.
Then we have that
n
A B m 1 X
F (x) F (x + ) 2 |i (A) i (B)|2 = .
n n
i=1

Similarly, we have F B (x ) F A (x) . Therefore, L(F A , F B )  and (2.17) is held.


Let = ((1), . . . , (n)) denote a permutation of 1, 2, . . . , n. Then
n
3 A B 1X
L (F , F ) min |i (A) (i) (B)|2 . (2.18)
n
i=1

19
1 Pn
By the perturbation inequality, we know min n i=1 |i (A) (i) (B)|
2 n1 tr(A B)(A
B)T . Therefore, the conclusion is held.
Proof of Proposition 2.1. By Lemma 2.1 and the law of large number, we have

1 X 2
L3 (F Wn , F Wn(C) ) xij I(|xij | > C))
n2
ij

E{x212 I(|x12 | > C)}. (2.19)

Note that the right hand side of (2.19) can be made arbitrarily small by increasing C. Then
the proposition is held. 
Therefore, in the proof of Theorem 2, we can assume that the entries of the matrix Xn are
uniformly bounded.
Step 2. Removing the diagonal
Notice that in Theorem 2, we assume that the diagonal entries and the off diagonal
entries may have different distributions. We would like to set the diagonal entires of Wn(C)
to be 0. By some similar arguments as in step 1, we can show that this modification wont
influence the asymptotic result of the ESD. We still use Wn(C) to denote this new random
matrix.
Step 3. Centralization
In this step, first we reset the mean values of all off diagonal entries of Wn(C) to be zero.
Let a = Ex12 I(|x12 | > C), we will show that
0 1
kF Wn(C) F Wn(C) a11 k , (2.20)
n
where kf (x)k = supx |f (x)|. In fact, we have that

LEMMA 2.2 Let A and B be two n n symmetric matrices with their ESDs denoted by
F A and F B , respectively. Then,
1
kF A F B k rank(A B), (2.21)
n
where kf (x)k = supx |f (x)|.

Proof of Lemma 2.2. Since both sides of (2.21) are invariant under common orthogonal
transformation
" # on A and B, we may transform A and B such that A B has thos form
C 0
, where C is a full rank matrix. To prove (2.21), we assume that
0 0

20
" # " #
A11 A12 B11 A12
A= and B = ,
A21 A22 A21 A22
where A22 is an (n k) (n k) matrix and rank(A B)=rank(A11 B11 ) = k. Denote
the eigenvalues of A, B and A22 by 1 . . . n , 1 . . . n and 1 . . . nk ,
respectively. By the interlacing theorem, we have the relation that max(j , j ) j
min(j+k , j+k ). Then we can conclude that for any x (j1 , j ),

j1 j+k
F A (x), F B (x) < ,
n n
which implies (2.21). 
Notice the rank of a110 is 1, then applying Lemma 2.2, we have (2.20). Although the new
matrix Wn(C) a110 has the mean of its off-diagonal elements equal to 0, but its diagonal
entries all are equal to a. Now we can use the following lemma to remove all diagonal
elements of Wn(C) a110 without influencing the final result.

LEMMA 2.3 Let A and B be two n n Hermitian matrices with their ESDs denoted by
F A and F B , respectively. Then,

L(F A , F B ) kA Bk, (2.22)

where kAk is the operator norm.


Proof of Lemma 2.3. It is easy to see that L(F A , F B ) maxk |k (A) k (B)|. we only
need to show that maxk |k (A) k (B)| kA Bk, where k () denotes the k-th largest
eigenvalue. By the definition of k (A), we have that
(
miny1 ,...,yk1 maxxy1 ,...,yk1 x Bx + kA Bk
k (A) = min max x Ax
y1 ,...,yk1 xy1 ,...,yk1 ,kxk=1 miny1 ,...,yk1 maxxy1 ,...,yk1 x Bx kA Bk.

So Lemma 2.3 is proved.


0
Hence, we have L(F Wn(C) a11 , F Wn(C) EWn(C) ) a 0 as n .
Step 4. Rescaling
Now we can normalize the elements of the new matrix Wn(C) EWn(C) . Write 2 (C) =
V ar(x12 I(|x12 | > C)) and define Wn = 1 (C)(Wn(C) EWn(C) ). Note that all the off
diagonal entries of Wn are xij = 1 (C)(x12C Ex12C ). Applying Lemma 2.1, we obtain

(1 (C))2 X
L3 (F Wn , F Wn(C) EWn(C) ) {xij (C) Exij (C)}2
n2 (C)2
i6=j
2
(1 (C)) .

21
Note that (1 (C))2 can be made arbitrarily small by increasing C.

Finally Combining all the results above, we can assume all the entries of Xn are bounded
by C and having mean zero and variance 1. Then applying the sketch proof of (3), we finish
the proof of Theorem 2.

3 Use Stieltjes Transform to Prove MarcenkoPastur Law


for Sample Covariance Matrix
Sample covariance matrix is very popular in multivariate statistical analysis. This is because
many test statistics can be written as a function of its eigenvalues. The formal definition
of a sample covariance matrix goes as follows. Suppose xij are iid random variables with
mean 0 and variance 2 . Write Xi = (xi1 , . . . , xip )T and X = (X1 , . . . , Xn )np . Then the
Pn T
i=1 (Xi X)(Xi X)
P
Xi
sample covariance matrix is defined as S = n1 , where X = n . There is
another version of sample covariance matrix, which is convenient for the spectral analysis of
random matrices. Notice that the X X T is a rank one matrix and hence the removal of X X T
does not affect LSD by the rank inequality. Therefore the sample covariance matrix can be
simply defined as S = n1 Xk XkT = n1 XX T . In addition, we may assume np y (0, ).
P

The first success in finding the limiting spectral distribution of the large sample covariance
matrix Sn is due to Marcenko and Pastur (1967). The density function of MarcenkoPastur
(MP) Law Fy (x) has a density function
(
1
p
2xy 2 (b x)(x a) if a x b;
f (x) =
0 otherwise

and has a point mass 1 1/y at 0 if y > 1, where a = 2 (1 y)2 and b = 2 (1 + y)2 .
Here, the constant y is the limit of p/n.

3.1 MP Law

In this chapter, we only consider the LSD of the sample covariance matrix for case where the
underlying variables are i.i.d. and bounded with mean 0 and variance 1. We can relax those
assumptions by applying truncation, centralization and rescaling technique as we discussed
in the previous chapter.

THEOREM 3 Suppose xij are i.i.d random variables, which are bounded with mean of 0
and variance of 2 . Assume that p/n y (0, ). Thenwith probability one, the ESD of
S, F S (x) = p1 pi=1 I(i x), converges to the Marcenko-Pastur law.
P

22
We will use the Stieltjes transform to prove Theorem 3. In recent years, the Stieltjes
transform becomes more popular in random matrix research than the moment method. Let
us start with the basic theory and concepts related to the Stieltjes transform.

3.2 Stieltjes Transform

DEFINITION 3.1 Let F (x) be a CDF. Define


Z
1
SF (z) = dF (x),
x z

where z D := { + i : > 0}.

1 1
The Stieltjes transform SF (z) is well defined because the integrand | xz |= 1 .
(x)2 + 2
First, We like to show that we can recover a distribution function F from its Stieltjes
transform. Here is the inversion formula.

LEMMA 3.1 Let a < b be the continuous points of a CDF F (x). Then
b
1
Z
F (b) F (a) = lim ImSF (x + i)dx,
0+ a

where Imf (z) is the imaginary part of f (z).


Proof of Lemma 3.1 By the definition
1 y x + i
Z Z
SF (x + i) = dF (y) = 2 2
dF (y).
R y x i R (y x) + 

Thus
b b

Z Z Z
ImSF (x + i)dx = { 2 + 2
dx}dF (y)
a (y x)
ZR a
by ay
= {arctan( ) arctan( )}dF (y),
R  
in which  0,
by ay
arctan( ) arctan( ) (Ia (x) + Ib (x)) + I(a,b) (x).
  2
Since a and b are continuous points of F , by the dominated convergence theorem, we have
the conclusion. 
From the above lemma, we immediately have the following result.

LEMMA 3.2 For any two CDFs F (x) and G(x), we can say that F = G if and only if
SF (z) = SG (z) for all z D.

23
This lemma indicates that the distribution function can be uniquely determined by its
Stieltjes transform. Here is another result from Lemma 3.1.

LEMMA 3.3 Let F (x) be a CDF and x0 R. Suppose limzDx0 ImSF (z) exists and
denote it by ImSF (x0 ). Then F (x) is differentiable at x0 and F 0 (x0 ) = 1 ImSF (x0 ).

This Lemma says that we can derive a density function of distribution from its Stieltjes
transform.
Proof of Lemma 3.3. Remember that the discontinuous point of F (x) is countable.
Thus for any sequence of continuous points {xn } such that xn x0 as n , we have
that
F (xm ) F (xn ) 1
ImSF (x0 ), (3.23)
xm xn
when n, m . In particular, take x1 < x3 < . . . x0 and x2 > x4 < . . . x0 . Then (3.23)
indicate that F (x2k ) F (x0 ), F (x0 ) F (x2k1 ) F (x2k ) F (x2k1 ) 0 Thus F (x) is
continuous at x0 . Therefor by choose the sequence {x1 , x0 , x2 , x0 , . . .}, we have

F (x) F (x0 ) 1
ImSF (x0 ),
xm x0
for any sequence {xn } with xn x0 where {xn } are the continuous points of F (x).
Now we need to show that F+0 (x0 ) = F0 (x0 ) = 1
ImSF (x0 ). For any sequence xn x0 ,
take a continuous point x0n such that xn > x0n > x0 and |xn x0n | 1
n |xn x0 |. Then we
x0n x0
have x0n x0 and xn x0 1. Then we have

F (xn ) F (x0 ) 1
lim inf ImSF (x0 ),
n xn x0

so F+0 (x) = 1 ImSF (x0 ). By the similar idea, we get the conclusion. 

Remind that the characteristic function is a powerful tool of studying weak convergence.
This is because we have the continuity theorem that provides the sufficient and necessary
conditions for weak convergence. We have a similar continuity theorem for the Stieltjes
transform.

LEMMA 3.4 A sequence of distribution functions {Fn } converges weakly to F (x) if and
only if lim SFn (z) = SF (z) for all z D.

Proof of Lemma 3.4. The necessity part follows directly from applying the dominated

convergence theorem. Now lets deal with the sufficiency part. Write f (x) = (x2 + 2 )
.

24
The function fv (x) is the density function of a Cauchy random variable C with the scale
parameter . Then by the definition of the Stieltjes transform, we have
1 1
Z
ImSFn ( + i) = 2 2
dFn (x).
R (x ) +

1
This implies ImSFn ( + i), if viewed as a function of , is the density function of the
convolution of Fn (x) and the distribution of C . Now define n (t), Fn (t) and C (t)
1
be the characteristic functions of ImSFn ( + i), Fn (x) and C , respectively. Further
1
assume that (t) and F are the characteristic functions of ImSF ( + i) and F (x).
Then n (t) = Fn (t)C (t) and (t) = F (t)C (t). Since lim SFn (z) = S(z), then
1 1
lim ImSFn (z) = ImSF (z). Hence n (t) (t). By C (t) = exp(|t|), we have
Fn (t) F (t). By the continuity theorem of characteristic function, we conclude that
Fn F as n . 

If for any distribution sequence of Fn , we know SFn (z) converges to some limit s(z).
There is still no guarantee that s(z) is the Stieljtes transform of some probability distribution
function F . This problem can be solved by assuming that Fn is a tight sequence.

LEMMA 3.5 Let {Fn } be tight for n 1. If lim SFn (z) = S(z) for all z D, then there
is a CDF F (x) such that Fn converges to F weakly, and SF (z) = S(z).

Proof of Lemma 3.6. Since lim SFn (z) = S(z), then all the convergence subsequence Fnk
have the same limiting distribution. By Hellys theorem, Fn is tight implies that Fn F
and SF (z) = S(z). 
If we can not check the tight condition, then the following lemma is very useful.

LEMMA 3.6 If lim SFn (z) = S(z) for all z D, then there exists a probability distribution
function F with Stieltjes transform SF (z) if and only if

lim (iS()) = 1 (3.24)


and Fn converges weakly to F .

Now we can conclude a criterion for Stieltjes transforms. Let S(z) be a analytic function
on D. Then there exists a distribution function F with Stieltjes transform S(z) if and only
S(z) satisfies that ImS(z) > 0 for each z D and (3.24) holds.

25
3.3 Stieltjes Transform of The MP Law

First lets review some basic definitions and theorems from the complex analysis.

DEFINITION 3.2 A complex function f is a map from C C. The derivative of f at a


f (z)f (z0 )
point z0 in its domain is defined by the limit f 0 (z0 ) = limzz0 zz0 . If the limit exists,
we say that f is complex differentiable at the point z0 .

This is the same as the definition of the derivative for real functions, except that all of
the quantities are complex.

DEFINITION 3.3 If f is complex differentiable at every point in an open set U , we say


that f is holomorphic on U .

The existence of a complex derivative in a neighborhood is a very strong condition, because


it implies that any holomorphic function is actually infinitely differentiable.

DEFINITION 3.4 Residue theorem: If a is an element of open set U and f : U \a C is


a holomorphic function over its domain. If there exists a holomorphic function g : U C
g(z)
and a positive integer n, such that for all z in U \a, f (z) = (za)n holds, then a is called a
pole of f. The smallest n is called the order of the pole. A pole of order 1 is called a simple
pole.

LEMMA 3.7 If a1 , . . . , am are finitely many poles of f inside U and f is holomorphic


on U \{a1 , . . . , am }. If is a simple closed curve with counter-clockwise direction in U,
H P
which does not meet any ai , then f (z) dz = 2i Res(f, ak ) with the sum over those k
for which ak is in the interior of . Here, Res(f, ak ) denotes the residue of f at ak and
1 0
H
Res(f, ak ) = 2i 0 f (z) dz, where is a counter clockwise oriented closed curve in U and
only contains ak .

LEMMA 3.8 The Cauchys integral formula: Suppose U is an open subset of the com-
plex plane C, f : U C is a holomorphic function and the closed disk D = {z :
|z z0 | r} is completely contained in U . Let be the circle forming the boundary of
1
H f (z)
D. Then for every a in the interior of D, f (a) = 2i za dz, where the contour in-
tegral is taken counter-clockwise. In particular f is actually infinitely differentiable, with
n! f (z)
f (n) (a) = 2i
H
(za)n+1 dz.

In the following calculation, we specify the square root of a complex number as the one

with positive imaginary part. Let z = + i. By this requirement, we have that Re( z) =

26

1 sign() 1
p p
2
|z| + u and Im( z) = 2
|z| u. This shows that the real part of z has the
same sign as that of the imaginary part of z.
Now let us calculate the the Stieltjes transform S(z) for the MP law, where z D.
When y < 1, we have
b
1 1 p
Z
S(z) = 2
(b x)(x a)dx,
a x z 2xy

where a = 2 (1 y)2 and b = 2 (1 + y)2 . Let x = 2 (1 + y + 2 y cos ) and = ei .
Then we have

2 1
Z
S(z) = 2 sin2 d
0 (1 + y + 2 ycos)( (1 + y + 2 ycosw) z)
1 2
((ei ei )/2i)2
Z
= d
0 (1 + y + y(ei + ei ))( 2 (1 + y + y(ei + ei )) z)
1 ( 1 )2
I
= d
4i ||=1 (1 + y + y( + 1 ))( 2 (1 + y + y( + 1 )) z)
1 ( 2 1)2
I
= d.
4i ||=1 ((1 + y) + y( 2 + 1))( 2 ((1 + y) + y( 2 + 1)) z)

The integrand has 5 simple poles at 0 = 0, 1,2 = (1+y)(1y)



2 y and 3,4 =

2 (1+y)+z 4 (1y)2 2 2 (1+y)z+z 2

2 2 y
. By Cauchy integral formula, we find that residues at
these 5 poles are
1 1y 1 p 4
2
, and 2 (1 y)2 2 2 (1 + y)z + z 2 .
y yz yz
Notice that |1 2 | = 1 and |3 4 | = 1. So only one pole in each pair is inside the contour
|| = 1. It is easy to check that |1 | < 1, |2 | > 1 when |y| 1. Since we require the square
root of a complex number as the one with positive imaginary part, this indicates that the

real part of z has the same sign as that of the imaginary part of z, then we know that the
p
both the real part and imaginary part of 4 (1 y)2 2 2 (1 + y)z + z 2 and 2 (1+y)+z
have the same signs. Hence |3 | > 1 and |4 | < 1. Then by Residue Theorem, we obtain
p
2 (1 y) z + (z 2 y 2 )2 4y 4
S(z) = . (3.25)
2yz 2
When y > 1, S(z) is equal to the above integral plus (y 1)/yz. With some basic
calculation, the equation (3.25) still holds true. When y = 1, the equation (3.25) is true by
continuity. Without loss of generality, we can assume 2 = 1 and y < 1 from now on. It is
not very hard to see that S(z) is one of the solutions for
1
S(z) = , (3.26)
1 z y yzS(z)
whose imaginary part is positive.

27
3.4 The Proof of the Theorem 3

Let the Stieltjes transform of F Sn be STn (z) := 1


p tr(Sn zIp )1 , z D. We need the
following three steps to prove Theorem 3:

(1) For any fixed z D, STn (z) ESTn (z) 0 almost surely.

(2) For any fixed z D, ESTn (z) S(z).

(3) With probability one, STn (z) S(z) for all z D.

Step 1. Define Ek as the conditional expection given {Xk+1 , Xk+2 , . . . , Xn }. Then we


have
n
1X
STn (z) ESTn (z) = {Ek1 tr((Sn zIp )1 ) Ek tr((Sn zIp )1 )}
p
k=1
n
1X
= (Ek Ek1 ){tr((Sn zIp )1 ) tr((Snk zIp )1 )},
p
k=1

where Snk = Sn n1 Xk XkT . The last equation is due to the fact that Ek tr((Snk zIp )1 ) =
Ek1 tr((Snk zIp )1 ). For any invertible matrix A and two vectors and , we have

A1 T A1
(A + T )1 = A1 .
1 + T A1
By the above formula, we obtain
1 T 2
n Xk ((Snk zIp ) )Xk
tr((Sn zIp )1 ) tr((Snk ZIp )1 ) = =: Hn . (3.27)
1 + n1 XkT ((Snk ZIp )1 )Xk

By eigenvalue-decomposition, we know Snk = DDT , where = diag(1 , 2 , . . . , p ) and


DDT = DT D = Ip . So

|XkT (Snk zIp )2 Xk | = |XkT (D( zI)2 DT )1 Xk |


Xp
= |(DXk )i |2 (|i + i|2 )1
i=1
p
X
= |(DXk )i |2 {(i )2 + ()2 }1
i=1
= XkT ((Snk Ip )2 + 2 Ip )1 Xk .

Hence, because |z| |Imz|,

XkT ((Snk Ip )2 + 2 Ip )1 Xk 1
|Hn | T 1
= .
Im(n + Xk ((Snk zIp ) )Xk )

28
Let Rk = (Ek1 Ek )Hn . If we define Fk = (Xn , Xn1 , . . . , Xnk+1 ) and Yk = Rnk+1 ,
then (Yk , Fk ) is a bounded martingale difference sequence. By the moment inequality for
the martingale difference, we have
n
C4 X
E|STn (z) ESTn (z)|4 = E( |Rk |2 )2 = O(n2 ).
p4
k=1

Then by BorelCantelli Lemma, STn (z) ESTn (z) 0 a.s..


Step 2. Now we will show ESTn (z) S(z), where S(z) is defined in (3.25) with 2 = 1.

1T
.
Recall X = (X1 , . . . , Xn ) = .
. , and let X(l) denote the matrix obtained by removing
pT
!
1 T 1 T T
1 1 1 X(1)
1
the l-th row of X. Then Sn = n XX T = 1 n
1
n
and (Sn zIp )1 =
X(1) X(1)X(1) T
n 1 n
!1 !
T z T X(1) T d d
1 1 1 11 1
1
n . For any D = , if we write D1 = (bij ),
X(1)1 X(1)X(1) zIp1 T T
d1 D22
then b11 = d d 1D dT . This formula can be generalized to the other elements on the
11 1 22 1
diagonal of A1 . So we have
p
1 1X 1
STn (z) = tr(Sn zIp )1 = 1 T 1 T .
p p z
l=1 n l l
X(l)T ( n1 X(l)X(l)T
n2 l
zIp1 )1 X(l)l

p
Define yn = n and l = 1 T
n l l 1 1 T
X(l)T ( n1 X(l)X(l)T
n2 l
zIp1 )1 X(l)l + yn +
yn zESTn (z). Then, we have
1
ESTn (z) = + n , (3.28)
1 z yn yn zESTn (z)

where
p
1X  l
n = E .
p (1 z yn yn zESTn (z))(1 z yn yn zESTn (z) + l )
l=1

There are two solutions for ESTn (z) from (3.28),


1  p
s1,2 (z) = (1 z yn + yn zn ) (1 z yn yn zn )2 4yn z .
2yn z

Let , we know that ESTn (z) 0. The imaginary part of (1 z yn + yn zn ) is


(1 yn n ), which is negative. So s1 (z) 0 as . This shows that ESTn (z) = s1 (z)
for all z with large imaginary part. If ESTn (z) = s1 (z) is not true for all z D, then

29
there exists z0 = 0 + i0 D such that ESTn (z0 ) = s1 (z0 ) = s2 (z0 ). Then ESTn (z0 ) =
(1z0 yn +yn z0 n )
2yn z0 . Combine this with (3.28), we have
1 z0 yn 1
ESTn (z0 ) = + .
yn z0 z0 1 + yn + yn z0 ESTn (z0 )
(yn yn ) 2
0 yn
It is easy to see that Im( 1z 2 (2 + 2 ) < 0 when yn < 1. Furthermore, Im(z0 1 +
yn z 0 ) = yn
R 0 0
yx
yn + yn z0 ESTn (z0 )) = 0 (1 + 0 (x )2 + 2 dFn (x)) > 0, where Fn (x) is the ESD of Sn . So
0 0
we conclude that Im(ESTn (z0 )) < 0, which is a contradiction. So ESTn (z) = s1 (z) for all
z D. If n 0 as n , we conclude that ESTn (z) S(z) for all z D.
Now lets write
p
1X  2l
n = E
p (1 z yn yn zESTn (z))(1 z yn yn zESTn (z) + l )
l=1
p
1 X  El
2
:= J1 + J2 .
p (1 z yn yn zESTn (z))
l=1
First lets consider J2 . Note that
1 1
|El | = 2 Etr(lT X(l)T ( X(l)X(l)T zIp1 )1 X(l)l ) + yn + yn ESTn (z)

n n
1 1
= 2 tr(EX(l) ( X(l)X(l)T zIp1 )1 X(l)El lT ) + yn + yn ESTn (z)
T

n n
1 1 1
= Etr(( X(l)X(l)T zIp1 )1 X(l)X(l)T ) + yn + yn ESTn (z)

n n n
p1 z 1
+ Etr( X(l)X(l) zIp1 )1 + yn + yn ESTn (z)
T

=
n n n
1 |yn z| 1 1
E tr( X(l)X(l)T zIp1 )1 tr( XXT zIp1 )1

+
n p n n
1 |z|
+
n n
R yx
and |1 z yn yn zESTn (z)| Im(1 z yn yn zESTn (z)) = (1 + 0 (x) 2 + 2 dFn(x) ).

These imply J2 0. Since

Im(1 z yn yn zESTn (z) + l )


1 1 1
= Im( lT l z 2 lT X(l)T ( X(l)X(l)T zIp1 )1 X(l)l )
n n n
1 T 1 1
= (1 + 2 l X(l) ( X(l)X(l)T Ip1 )2 + 2 Ip1
T
X(l)l ) < ,
n n
then |J1 | p13 pl=1 E|l |2 . So we only need to show E|l |2 0 as n . Let El denote
P

the conditional expectation given {j , j 6= l}. Then E|l |2 E|l El l |2 + E|El l El |2 +
|El |2 . Define A = (aij ) = In n1 X(l)T ( n1 X(l)X(l)T zIp1 )1 X(l). Then
n
1 X X
aii (x2li 1) +

l El l = aij xki xkj .
n
i=1 i6=j

30
In addition, we have
n
2 1 X X
|aii |2 E(x4li 1) + 2|aij |2 .

El |l El l | = 2
n
i=1 i6=j
Pn Pn
T x2 x2
Since A = (In X(l)X(l) nz )1 , we know |aii | = |1 k=1 nz
ki
| 1 + k=1
nv
ki
. Then
1 P n 2 4 C P 2 T T
E n2 i=1 aii E(|xli | 1) n . It is easy to see that i,j |aij | = tr(AA ) and tr(AA ) =
T T n|z|2
|z|2 tr{( X(l)X(l)
n zIn )( X(l)X(l)
n zIn )}1 2
. So we conclude that E|l El l |2 C
n.
By the similar approach in the proof of (3.27), we have

|z|2 yn2 1 1 1 1 |z|2


E|El l El |2 = E|tr( X(l)X(l)T
zI p1 ) Etr( X(l)X(l)T
zI p1 ) | .
p2 n n n2 2

So |J2 | E(l El l )2 + E|El l El |2 + (El )2 0 as n .


Step 3. First let us review the Vitalis convergence lemma.

LEMMA 3.9 Let fn be any analytic sequence in D. Assume |fn (z)| M for all n and
z C, where C is a connected open set. If fn (z) converges for each z in a dense subset of
D. Then there exists an analytic function f (z) in D such that fn (z) f (z) for z D.

By the Steps 1 and 2, for any fixed z D, we have STn (z) S(z) almost surely. This means
that, for any z D, there exists a set Nz with probability 0, such that STn (z, ) S(z)
for all Nzc . Let D0 = {zm } be a dense subset of D (e.g. all zm have rational real and
imaginary parts). Then STn (z, ) S(z) for all D0 Nzcm and P (D0 Nzcm ) = 1. Now
apply Lemma 3.9, we have
STn (z, ) S(z)

for all z D and Dr Nzcm .


By all these three steps in above, we conclude that F Sn (x) F M p (x) a.s.

4 High-dimensional Sphericity Test


Let x1 , , xn be i.i.d. Rp -valued random vectors from a normal distribution Np (, ),
where Rp is the mean vector and is the p p covariance matrix. Consider the
hypothesis test:

H0 : = Ip vs H1 : 6= Ip (4.29)

where > 0 is unknown.

31
4.1 Likelihood Ratio Test for Sphericity

Denote
n n
1X X A
x = xi , A = (xi x)(xi x)T , and S = . (4.30)
n n1
i=1 i=1

If ignore the constant, the likelihood function of {x1 , , xn } is


1 n
L(, ) = det()n/2 exp( tr(1 A)) exp{ (x )T 1 (x )}.
2 2
Then the likelihood ratio statistic is
sup, L(, Ip )
n = .
sup, L(, )

First we derive the denominator. Since L(, ) det()n/2 exp( 12 tr(1 A)) with equal-
ity if and only if = x, then we know x maximize L(, ) for all . Now we only need to
maximize det()n/2 exp( 12 tr(1 A)). Note that
n 1
g() = log L(x, ) = det(1 ) tr(1 A)
2 2
n 1 1 1 n
= det( A) tr( A) det(A)
2 2 2
p
1 X n
= (n log i i ) det(A),
2 2
i=1

where 1 , . . . , p are the eigenvalues of 1 A. We know that g() is maximized at 1 =


. . . = p = n. So
np np
sup L(, ) = n 2 exp( )det(A)n/2 .
, 2
The numerator is
np 1 1 np np
sup L(, Ip ) = sup 2 exp( tr(A)) = trA) 2 exp( ).
, , 2 np 2

So we define
 p  p
trA trS
Vn = 2/n
n = detA = detS . (4.31)
p p
Notice that the matrix A and S are singular when p > n and consequently their determinants
are equal to zero in this case. This indicates that the likelihood ratio test of (4.29) only
exists when p n.
Bai et al.(2010) made a correction to the traditional likelihood ratio test statistic Vn
to make it suitable for testing a high-dimensional normal distribution by using the CLT of

32
linear spectral statistics of S. They defined Ln = tr(S)log(detS)p =
P
i {i log(i )1},
where 1 , . . . , p are the eigenvalues of S. It is easy to see that Ln is a linear spectral
statistic. Then the central limit theorem of linear spectral statistics plays an essential role
here. Lets review a specialization of the theorem from Bai and Silversten (1996).

LEMMA 4.1 For any analytic function f and xij are i.i.d. random variables with mean
0, variance 1 and Ex4ij < . Assume np y (0, 1) as n . Then Gn (f ) =
p f (x)(F Sn (x) Fyn (x))dx converges to a normal distribution with mean vector
R

b(y)
f (a(y)) + f (b(y)) 1 f (x)
Z
m(f ) = p dx
4 2 a(y) 4y (x 1 y)2

and variance function

1 f (z1 )f (z2 )
I I
v(f ) = 2 dm(z1 )dm(z2 ),
2 (m(z1 ) m(z2 ))2

where m(z) is the Stieltjies transform of Fy = (1 y)I[0,) + yFy and the contours are non
overlapping and both contain the support of Fy . The functions Fyn and Fy are the MP law
with index p/n and y.

It is not hard to write Ln = p (x log x + 1) dF Sn (x) in this form and x log x + 1 is


R

analytic. Then according to the above lemma Ln converges to normal distribution weakly.
For more details, please refer Bai et al (2010). There is another way to derive the same
central limit result for Ln . Recall the density of -Laguerre ensemble has a from
p Pp
1
c,a aq
Y Y
f,a (1 , , p ) = L |i j |
i e 2 i=1 i
(4.32)
1i<jp i=1

for all 1 > 0, 2 > 0, , p > 0, where


p
(1 + 2 )
c,a
Y
pa
L =2 , (4.33)
j=1 (1 + 2 j)(a 2 (p j))

and
Z
() := exp(x)x1 dx for C with Re() > 0, (4.34)
0

> 0, p 2, a > 2 (p 1) and q = 1 + 2 (p 1). The joint density function for 1 , . . . , p


of nS in our case corresponds to the -Laguerre ensemble in (4.32) with = 1, a =
1
2 (n 1) and q = 1 + 12 (p 1).

33
THEOREM 4 Let x1 , . . . , xn be i.i.d. random vectors with normal distribution Np (, ).
Define Ln = i {i /n log(i ) + log n 1}, where 1 , . . . , p are the eigenvalues of nS.
P

Assume = I. If n > p = pn and limn p/n = y (0, 1], then (Ln n )/n converges
in distribution to N (0, 1) as n , where
 3  p hp  p i
n = n p log 1 + p y and n2 = 2 + log 1 .
2 n n n

LEMMA 4.2 Let n > p and Ln be defined as in above. Assume 1 , , p have density
function f,a (1 , , p ) as in (4.32) with a = 2 (n 1) and q = 1 + 2 (p 1). Then

p1
tLn (log n1)pt
 2t p(t 2 (n1)) pt Y (a t 2 j)
Ee =e 1 2
n
j=0 (a 2 j)

for any t 21 , 12 ( n) .


Proof. Recall
p
1X
Ln = (j n log j ) + p log n p.
n
j=1

We then have

EetLn
Z Pp p
t Y
= e (log n1)pt
e n j=1 j
t
j f,a (1 , , p ) d1 dp
[0,)p j=1
Z Pp p
1 t (at)q
= e(log n1)pt c,a e( 2 n ) j
Y Y
L
j=1 j |k l | d1 dp .
[0,)p j=1 1k<lp

For t 12 , 12 ( n) , we know 12 nt > 0. Make transforms j = (1 2t



n )j for 1 j p.
It follows that the above is identical to

 2t p(atq) 2 p(p1)p
e(log n1)pt c,a
L 1
n
Z p
p
12 j=1 j
P
(at)q
Y Y
e j |k l | d1 dp . (4.35)
[0,)p j=1 1k<lp

Since t 12 , 12 ( n) and n p 1, we know





t< (n p) = (n 1) (p 1) = a (p 1).
2 2 2 2 2

34
That is, a t > 2 (p 1). Therefore the integral in (4.35) is equal to 1/c,at
L by (4.32) and
(4.33). It then from (4.35) and (4.35) that
,a

 2t p(atq) 2 p(p1)p cL
EetLn = e(log n1)pt 1 ,at
n cL
p
 2t p(atq) 2 p(p1)p pt Y (a t 2 (p j))

= e(log n1)pt 1 2
.
n
j=1 (a 2 (p j))

Now, use a = 2 (n 1) and q = 1 + 2 (p 1) to obtain that


p1
tLn (log n1)pt
 2t p(t 2 (n1)) pt Y (a t 2 j)
Ee =e 1 2
.
n
j=0 (a 2 j)

The proof is complete. 


The asymptotic expansion of a univariate Gamma function was best known as the Sterling
formula:

 
1 1 1
log (z) = z log z z log z + log 2 + +O . (4.36)
2 12z Re(z)3
Since our application of these results are mainly in the statistical area, we limit our discus-
sion to the real space R for the sake of simplicity in derivation. We begin with the following
lemma:

LEMMA 4.3 Let b := b(x) be a real-valued function defined on (0, ). As x


(x + b) b2 b
log = b log x + + c(x) (4.37)
(x) 2x
where () is the univariate Gamma function as defined in (4.34) and

O(x1/2 ), if b(x) = O(x ) as x +;


c(x) =
O(x2 ), if b(x) = O(1) as x +.

PROPOSITION 4.1 Let n > p = pn and rn = ( log(1 np ))1/2 . Assume that p/n y
(0, 1] and t = tn = O(1/rn ) as n . Then, as n ,
n1
Y ( 2i t)  
log = pt(1 + log 2) pt log n + rn2 t2 + (p n + 1.5)t + o(1).
i=np
( 2i )

Sketch Proof of Theorem 4. First, since log(1 x) < x for all x < 1, we know n2 > 0
for all n > p 1. Now, by assumption, it is easy to see

2 y + log(1 y), if y (0, 1);
2
lim = (4.38)
n n +, if y = 1.

35
Trivially, the limit is always positive. Consequently,

0 := inf{n ; n > p 1} > 0.

To finish the proof, it is enough to show that


n L o 2
n n
E exp s es /2 = EesN (0,1) (4.39)
n
as n for all s such that |s| < 0 /2.
Fix s such that |s| < 0 /2. Set t = tn = s/n . Then |tn | < 1/2 for all n > p 1. In Lemma
4.2, take = 1 and a = (n 1)/2,
np p p1 nj1

 2t pt 2 + 2 pt Y ( 2 t)
EetLn = e(log n1)pt 1 2 .
n ( nj1
2 ) j=0

Letting i = n j 1, we get
n1
tLn pt (log n1)pt
2t pt
np
2
+ p2 Y ( 2i t)
Ee =2 e 1 (4.40)
n
i=np
( 2i )

for n > p. Then


n1
Y ( i t)
tLn
 1 n  2t  2
log Ee = pt(log n 1 log 2) + p t + log 1 + log i
.
2 n ( 2)
i=np
2
Now, use identity log(1 x) = x x2 + O(x3 ) as x 0 to have
 1 n  2t   1 n  2t 2t2 1 
p t+ log 1 = p t+ 2 + O( 3 )
2 n 2 n n n
2pt  1n  t
= t+ 1+ + o(1)
n 2 n
2pt 1
 1n 1 
= t+ + O( ) + o(1)
n 2 2 n
p
= t2 + pt yt + o(1)
n
as n . Recall rn = ( log(1 np ))1/2 . We know t = tn = s
n = O( r1n ) as n . By
Proposition 4.1,
n1
Y ( 2i t)  
log = pt(1 + log 2) pt log n + rn2 t2 + (p n + 1.5)t + o(1)
i=np
( 2i )

as n . Join all the assertions from (4.40) to the above to obtain that
p 2
log EetLn = pt(log n 1 log 2) t + pt yt
n  
+ pt(1 + log 2) pt log n + rn2 t2 + (p n + 1.5)t + o(1)
 p 
= + rn2 t2 + [p + rn2 (p n + 1.5) y]t + o(1)(4.41)
n

36
as n . Noticing
 3  p
p + rn2 (p n + 1.5) y = n p log 1 + p y = n
2 n
s2
and from the definition of n and notation t = sn , we know np + rn2 t2 =

2. Hence, it
follows from (4.41) that
n L o s2
n n
log E exp s = log EetLn n t
n 2
as n . This implies (4.39). The proof is completed. 

Now let us go back to see the LRT statistics Vn without any correction. The statistic
Vn is commonly known as the ellipticity statistic. The distribution of the test statistic Vn
can be studied through its moments. When the null hypothesis H0 : = Ip is true, the
following result is referenced from page 341 of Muirhead (1982):

(n1)p p (n1)
   
+h 1
E(Vnh ) =p ph
 (n1)
2 2
 n1  for h > , (4.42)
p 2 2

2 + ph

where
p  
p(p1)/4
Y 1 1
p () := (i 1) for Re() > (p 1). (4.43)
2 2
i=1

When p is assumed a fixed integer, the following result gives an explicit expansion of the
distribution function of 2 log Vn , where = 1(2p2 +p+2)/(6np6p), as M = (n1)
:
 
Pr (n 1) log Vn x

= Pr(2f x) + 2 Pr(2f +4 x) Pr(2f x) + O M 3
 
(4.44)
M
where f = (p + 2)(p 1)/2, = (n 1)2 2 2 , and 2 given by

(p 1)(p 2)(p + 2)(2p3 + 6p + 3p + 2)


2 = . (4.45)
288p2 (n 1)2 2

Now we use (4.42) to derive a central limit theorem for the likelihood ratio test statistic
log Vn as given in (4.31) directly.

THEOREM 5 Assume that p := pn is a series of positive integers depending on n such


that n > 1 + p for all n 3 and p/n y (0, 1] as n . Let Vn be defined as in

37
(4.31). Then under H0 : = Ip ( unknown), (log Vn n )/n converges in distribution
to N (0, 1) as n , where
 
p
n = p (n p 1.5) log 1 ,
n1
  
2 p p
n = 2 + log 1 > 0.
n1 n1
Sketch Proof of Theorem 5. Recall that a sufficient condition for a sequence of random
variables {Zn ; n 1} converges to Z in distribution as n is that

lim EetZn = EetZ < (4.46)


n

for all t (t0 , t0 ), where t0 > 0 is a constant. Thus, to prove the theorem, it suffices to
show that there exists 0 > 0 such that
n log V o 2
n n
E exp s es /2 (4.47)
n
as n for all |s| < 0 . The rest is same as in the proof of Theorem 4.

4.2 Different Tests for Sphericity

A different test for sphericity other than the likelihood ratio test is the following:
2
(1/p)trS2

1 S
U = tr Ip = 1. (4.48)
p (1/p)trS [(1/p)trS]2
Under the null hypothesis of (4.29), the limiting distribution of the test statistic U , as the
sample size n goes to infinity while the dimension p remains fixed, is given by
np d
U 2p(p+1)/21 . (4.49)
2
Ledoit and Wolf (2002) re-examined the limiting distribution of the test statistic U in the
high-dimensional situation where p/n c (0, ). They proved that under the null
hypothesis of (4.29),
d
nUn p N (1, 4). (4.50)

Ledoit and Wolf further argued that since


2 2 d
p N (1, 4), (4.51)
p p(p+1)/21
the asymptotic results of test statistic U still remains valid for practice in the high-dimensional
case (i.e. both p and n are large).

38
Now we introduce another type of test statistics, which deal the case p  n. Recall
Xn = (xij ) be an n p random matrix where the entries xij are i.i.d. real random variables
with mean and variance 2 > 0. Let x1 , x2 , , xp be the p columns of Xn . The sample
correlation matrix n is defined by n := (ij ) with
(xi xi )T (xj xj )
ij = , 1 i, j p (4.52)
kxi xi k kxj xj k
Pn
where xk = (1/n) i=1 xik and k k is the usual Euclidean norm in Rn . Here we write
xi xi for xi xi e, where e = (1, 1, , 1)T Rn . The largest magnitude of the off-diagonal
entries of the sample correlation matrix is defined as

Ln = max |ij |. (4.53)


1i<jp

THEOREM 6 Suppose Eet0 |x11 | < for some 0 < 2 and t0 > 0. Set = /(4 + ).
Assume p = p(n) and log p = o(n ) as n . Then nL2n 4 log p+log log p converges
weakly to an extreme distribution of type I with distribution function
1 ey/2
F (y) = e 8 , y R.

Without loss of generality, we assume Exij = 0. Define Wn = max1i<jp |xTi xj |, It is easy


to show that
n2 L2n Wn2
0. (4.54)
n
So we only need to show the following result for Wn .

PROPOSITION 4.2 Let {xij ; i 1, j 1} be i.i.d. random variables with Ex11 = 0,



E(x211 ) = 1 and Eet0 |x11 | < for some 0 < 2 and t0 > 0. Let Wn = max1i<jp |xTi xj |.
Set = /(4 + ). Assume p = p(n) and log p = o(n ) as n . Then
 2
Wn n

z/2
P z eKe
n

as n for any z R, where n = 4n log p n log(log p) and K = ( 8)1 .

Before the proof, lets see some useful results first The following Poisson approximation
result is essentially a special case of Theorem 1 from Arratia et al. (1989).

LEMMA 4.4 Let I be an index set and {B , I} be a set of subsets of I, that is,
B I for each I. Let also { , I} be random variables. For a given t R, set
P
= I P ( > t). Then

|P (max t) e | (1 1 )(b1 + b2 + b3 )
I

39
where
X X
b1 = P ( > t)P ( > t),
I B
X X
b2 = P ( > t, > t),
I 6=B
X
b3 = E|P ( > t|( ,
/ B )) P ( > t)|,
I

and ( ,
/ B ) is the -algebra generated by { ,
/ B }. In particular, if is
independent of { ,
/ B } for each , then b3 = 0.

The following conclusion is Example 1 from Sakhanenko (1991). See also Lemma 6.2
from Liu et al (2008).

LEMMA 4.5 Let i , 1 i n, be independent random variables with Ei = 0. Put


n
X n
X n
X
s2n = Ei2 , %n = 3
E|i | , Sn = i .
i=1 i=1 i=1

Assume max1in |i | cn sn for some 0 < cn 1. Then

P (Sn xsn ) = e(x/sn ) (1 (x))(1 + n,x (1 + x)s3


n %n )

for 0 < x 1/(18cn ), where |(x)| 2x3 %n and |n,x | 36.

The following are moderate deviation results from Chen (1990), see also Chen (1991), Dembo
and Zeitouni (1998) and Ledoux (1992). They are a special type of large deviations.

LEMMA 4.6 Suppose 1 , 2 , are i.i.d. r.v.s with E1 = 0 and E12 = 1. Put Sn =
Pn 
i=1 i . Let 0 < 1 and {an ; n 1} satisfy that an + and an = o n
2(2) . If


Eet0 |1 | < for some t0 > 0, then
1 S
n
 1
lim log P an = (4.55)
n a2n n 2
for any u > 0.

LEMMA 4.7 Let 1 , , n be i.i.d. random variables with E1 = 0, E12 = 1 and Eet0 |1 | <
for some t0 > 0 and 0 < 1. Put Sn = ni=1 i and = /(2 + ). Then, for any
P

{pn ; n 1} with 0 < pn and log pn = o(n ) and {yn ; n 1} with yn y > 0,
2 /2
yn
 Sn  p
n (log pn )1/2
P yn
n log pn 2 y
as n .

40
Proof of the Proposition 4.2. It suffices to show that
  z/2
P max |yij | n + nz eKe , (4.56)
1i<jp
Pn
where yij = k=1 xki xkj . We now apply Lemma 4.4 to prove (4.56). Take I = {(i, j); 1
i < j p}. For u = (i, j) I, set Xu = |yij | and Bu = {(k, l) I; one of k and l =

i or j, but (k, l) 6= u}. Let an = n + nz and Aij = {|yij | > an }. Since {yij ; (i, j) I}
are identically distributed, by Lemma 4.4,

|P (Wn an ) en | b1,n + b2,n (4.57)

where
p(p 1)
n = P (A12 ), b1,n 2p3 P (A12 )2 and b2,n 2p3 P (A12 A13 ). (4.58)
2
We first calculate n . Write

p2 p
r
|y12 |
 
n
n = P > +z (4.59)
2 n n
Pn
and y12 = i=1 i , where {i ; 1 i n} are i.i.d. random variables with the same
distribution as that of x11 x12 . In particular, E1 = 0 and E12 = 1. Note 1 := /2 1.
We then have
 1
x211 + x212

1 1 

|x11 x12 | |x 11 | + |x 12 | .
2 21
Hence, by independence,
1 1
Eet0 |1 | = Eet0 |x11 x12 | < .

( nn + z)/ log p. Then yn 2 as n . By Lemma 4.7,


p
Let yn =
 r   Pn 
y12 n i=1 i
P > +z = P > yn
n n n log p
2
pyn /2 (log p)1/2 ez/2 1

2 2 8 p2
as n . Considering Exij = 0, it is easy to see that the above also holds if y12 is replaced
by y12 . These and (4.59) imply that

p2 p ez/2 1 ez/2
n 2 2 (4.60)
2 8 p 8
as n .

41
Recall (4.57) and (4.58), to complete the proof, we have to verify that b1,n 0 and
b2,n 0 as n . By (4.58), (4.59) and (4.60),

b1,n 2p3 P (A12 )2


8p3 2n
 
1
= 2 2
=O
(p p) p

as n . Also, by (4.58),
 P xk1 xk2 p
P
k=1 xk1 xk3 p 
3 k=1
b2,n 2p P > n /n + z, > n /n + z
n n
n  P xk1 (xk2 + xk3 ) p 
= 2p3 P k=1

> 2 n /n + z
n
 P xk1 (xk2 xk3 ) p o
+P k=1 > 2 n /n + z .
n

Since Exk1 (xk2 + xk3 ) = 0 and E{xk1 (xk2 + xk3 )}2 = 2, applying (4.55), we get
 P xk1 (xk2 + xk3 ) 
> 2 n /n + z 2 exp((1 )(n /n + z)) 2p(4) .
p
P k=1
n
 P 
x (x x ) p
The similar result holds for P k=1 k1n k2 k3 > 2 n /n + z . Therefore b2,n = o(1).


Acknowledgements
This set of lecture notes is mainly based on the book written by Bai and Silverstein (2009).
It also includes materials from professor Tiefeng Jiangs lecture notes and the book by
Anderson, Guionnet and Zeitouni (2010).

References
[1] Anderson, G. W., Guionnet, A. and Zeitouni, O. (2010). An introduction to
random matrices. Cambridge University Press.

[2] Bai, Z. and Silverstein, J. W. (2009). Spectral analysis of large dimensional ran-
dom matrices. Springer.

[3] Bai, Z. (1999). Methodologies in spectral analysis of large-dimensional random


matrices, a review. Statist. Sinica, 9(3), 611-677.

42
[4] Bai, Z., Jiang, D., Yao, J. and Zheng, S. (2009) Corrections to LRT on large-
dimensional covariance matrix by RMT. Ann. Statist. 37(6B),3822-3840.

[5] Cai, T. and Jiang, T. (2011). Limiting Laws of Coherence of Random Matrices
with Applications to Testing Covariance Structure and Construction of Com-
pressed Sensing Matrices. Ann. Stat. 39(3), 1496-1525.

[6] Jiang, D., Jiang, T. and Yang, F. (2012). Likelihood Ratio Tests for Covari-
ance Matrices of High-Dimensional Normal Distributions. Journal of Statistical
Planning and Inference142(8), 2241-2256.

[7] Jiang, T. and Yang, F. (2013). Central Limit Theorems for Classical Likelihood
Ratio Tests for High-Dimensional Normal Distributions. Ann. Stat. 41(4), 2029-
2074.

43