You are on page 1of 8

Computational Statistics & Data Analysis 5 (1987) 185-192 185

North-Holland

Sample size determination


in estimating a covariance matrix
Pushpa L. G U P T A *
Department of Mathematics, University of Maine, Orono, ME 04469, USA

R.D. GUPTA **
Division of Mathematics, Egineering, Computer Science, University of New Brunswick,
Saint John, N.B., Canada E2L 4L5

Received 22 July 1986


Revised 17 January 1987

Abstract: The sample size requirements, for estimating a covariance matrix with a desired precision
in a multivariate normal population, are investigated. Explicit formulas for the sample size are
provided in the univariate case and in the multivariate case when the covariance matrix is diagonal.
In these cases tables are also provided for specific values of e, and the joint confidence coefficient
1 - a. For the general case, a method to compute the sample size is developed resulting in an
integral equation involving the covariance matrix. In case a prior estimate of the covariance matrix
is available, the integral equation can be solved by using the algorithm given by Russell et al.
(1985). Examples are used to illustrate the effects of dimensions and quality of prior estimates of
covariance matrix on the sample size.

Keywords: Sample size, Covariance matrix, Multivariate normal distribution.

1. Introduction

This paper deals with determining the sample size for estimating a covariance
matrix in a multivariate normal population with joint confidence level and
precision. Theproblem originated when the first author was involved in a project
at the USAF School of Aerospace Medicine (USAFSAM). The USAFSAM at
Brooks AFB has been interested for several years in the use of statistical methods
to develop a computerized system to assist the cardiologists, who must examine a
large number of EKG's in a single day, in the screening, diagnosis and serial
comparison of vectorcardiograms. Past efforts at USAFSAM in the diagnosis of
vectorcardiograms has relied on a Karhunen-Lorve approximation of the signal

* Supported by a Faculty Summer Research Grant from the University of Maine.


* * Supported by NSERC Research Grant#A-4850.

0167-9473/87/$3.50 © 1987, Elsevier Science Publishers B.V. (North-Holland)


186 P.L. Gupta, R.D. Gupta / Estimating a covariance matrix

(750 dimensional in 3-lead system) together with linear and quadratic discrimina-
tion in the transformed space which is 60 dimensional. The crux of this approach
is, therefore, the estimate of the 60 x 60 covariance matrix of the Karhunen-Lo4ve
coefficients. Its quality can, therefore, be a source of concern for the efficacy of
the entire procedure. The quality or accuracy of the covariance matrix estimate is
a function of the sample size and the unknown entries. It was suggested that a
sample of 750 is sufficient to estimate a 60 × 60 covariance matrix with reasona-
ble accuracy. This figure is apperently not based on any theoretical considera-
tions and seems to be-,low as is evident by the sample size requirement for the
sixty dimensional independent case (see Table 2).
The problem of estimating the variance (O 2) of a normal density arises in
many experimental situations. As an example (Greenwood and Sandomire [4]), a
series of radar pulses is to be sent out to a target and the strength of the return
signal measured. How many readings under identical conditions shall be taken so
that the standard deviation of the return signal strengths shall, with 80% confi-
dence, be within 10% of the true value?
Greenwood and Sandomire [4] presented a graphical approach for obtaining
the sample size required to estimate variance of a normal density within a given
percent of its true value. Graybill and Connell [2] instead, have given a two step
sampling procedure to estimate the variance within a given number of units. The
number of units and the confidence level are specified in advance. Thompson and
Endriss [10] have also given a method for estimating the sample size in the
univariate case. Their method depends on the large sample distribution of
estimator. Other work, dealing with estimating variance, includes Graybill and
Morrison [3], Leone, Rutenberg and Topp [5], Tate and Klett [9] and Graybill [1].
For the sake of completeness, in Section 2, a brief discussion is given to find
the sample size n for the univariate case for a given e (the relative error) and a
given a (where 1 - a is the confidence coefficient).
In Section 3, we develop the procedures for determining the sample size in the
multivariate situation where two cases are studied. In case 1, the covariance
matrix Z is taken to be diagonal while in case 2 it is any general matrix. Table 2
is prepared for the case I when p = 2, 5, 10, 20, 40, 60. For case 2 tables cannot
be prepared as the result is in the form of an integral equation involving Z.
However, if a prior estimate of N is available, one can use the algorithm given by
Russell et al. [7] to solve the integral equation. The quality of prior estimate has
an intimate effect on the sample size which is illustrated by some examples.
Throughout the paper p denotes the dimension, e the relative error and 1 - a
the joint confidence coefficient when p > 2.

2. Univariate case

Let X 1, X2,... , Xn be a random sample from N(~, o2). Let


n

$2= E ( X , - X)2/n - 1.
i=1
P.L. Gupta, R.D. Gupta / Estimating a covariance matrix 187

Then (n - 1 ) $ 2 / 0 2 has a chi-square distribution with n - 1 degrees of freedom.


It is well known that large samples are necessary if o is to be estimated
accurately.
The problem is to find the sample size n such that

p[ S_o_l<e 1=1-~ (2.1)

for a given value of (e, a). That is,

1-a=P [1 - e < - - < l,+ e


o
]
=P
[ ~/2(n - 1) ( 1 - e) <
J 2(n - 1)S 2
7~ < ~/2(n- 1) (1 + e)
]
= P[(2(n-1) (1- ~)- ~/2(n-1) -1 <Z<~/2(n-1)(l+e)

- ; / 2 ( n - 1) - 1]

q}[ 2 ~ - 1)(1 + ~ ) - ~/2(n - 1) - 1 ] - O[~/2(n - 1)(1 - e)

- ~ / 2 ( n - 1) - 1 ] (2.2)
where

Z=~ 2( n - 1)S 2 - ~/2(n - 1) - 1 ~- N ( 0 , 1)


2
O

(see Snedecor and Cochran [8]), • is the cumulative distribution function of


N(0, 1):
Since n is large, equation (2.2) can be written as
(1 - a)-= ~ [ ~ 2 ( n - 1)e] - q ) [ - ~2(n - 1)e]

or

~2(n - 1) e -=- Z~/2


or

n---l+~ (2.3)

where P[Z > Z,,/21 = a/2.


Table 1 gives such values of n for some selected values of a and e.

3. Multivariate case

Let X1, X 2 , . . . , Xp be p random variables measured for each object or subiect.


Let us also assume that X = (X1, X 2 , . . . , X p ) ' - N p ( # , 2;).
188 P.L. Gupta, R.D. Gupta / Estimating a covariance matrix

Table 1
Sample size for the univariate case

0.01 0.02 0.03 0.04 0.05 0.06 0.08 0.09 0.10


0.01 33180 8295 3688 2075 1329 933 679 411 333
0.05 19210 4804 2136 1202 770 535 394 239 194
0.10 13530 3384 1505 847 543 377 278 169 137

Case 1. Suppose )(1, X 2 , . . . , Xp are i n d e p e n d e n t l y distributed, i.e. Z = d i a g ( o n ,


022,... , Opp). T h e n the p r o b l e m is to find n such that

P [ loiSii-I
i <~,i=l,2,...,p]=(l-a) (3.1)

for given e > 0 and a > 0, where

Sii_ 1 E'7 (x,J-


j=l

a n d X 1, X 2 , . . . , X n is a r a n d o m sample of size n from X. Define


vec S -- (Sll , $22,... , Spp )', vec 2; = ( o n , o22,... , Opp)'
and
Y = ~ / n - 1 (rec S - vec Z ) .
T h e n [6, p. 43], Y is asymptotically Np(0, V), where all elements of V are given
by
cov(Y~j, Y~,,) = oik%, + o,,%-k. (3.2)
In this case V = diag(2o]~, 2o~2,..., 2o2p). N o w (3.1) can be written as

n--1
P Y~ <~ 2 , i=l,2,...,p ---(I-a). (3.3)

Since Y~ are independently normally distributed with mean zero and variance
2oi2i, i = 1, 2,..., p, we can write (3.3) as

or

(3.4)

where Z - N(0, 1). Let 13 = 111 - (1 - a)l/p]. T h e n

n --- 1 + 2 ( Z B / e ) z. (3.5)
P.L. Gupta, R.D. Gupta / Estimating a covariance matrix 189

Table 2
Sample size for the multivariate independent case

p e

0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
0.01 2 6301 4376 3216 2462 1946 1576 1303 1095 933 805
5 7635 5303 3896 2983 2358 1910 1579 1327 1131 975
10 8657 6012 4417 3382 2673 2165 1790 1504 1282 1105
20 9687 6728 4943 3785 2991 2423 2003 1683 1434 1237
40 10724 7448 5472 4190 3311 2682 2217 1863 1588 1369
60 11059 7680 5643 4321 3414 2766 2286 1921 1637 1412

0.05 2 4003 2780 2043 1565 1237 1002 828 696 593 512
5 5280 3667 2695 2064 1631 1321 1092 918 782 675
10 6272 4356 3201 2451 1937 1569 1297 1090 929 801
20 7278 5055 3714 2844 2247 1821 1505 1265 1078 930
40 8297 5762 4234 3242 2562 2075 1715 1442 1229 1060
60 8897 6179 4540 3476 2747 2225 1839 1546 1317 1136

0.10 2 3040 2111 1552 1188 939 761 629 529 451 389
5 4273 2968 2181 1670 1320 1069 884 743 633 546
10 5243 3641 2675 2049 1619 1312 1084 911 777 670
20 6233 4329 3181 2436 1925 1559 1284 1083 923 796
40 7239 5028 3694 2829 2235 1811 1497 1258 1072 925
60 7834 5441 3998 3061 2419 1960 1620 1361 1160 1001

It should be noticed that the sample size formula given by (3.5) is independent of
N. When p = 1, (3.5) reduces to (2.3) with different value of e. One can regard
this as an application of Bonferroni method to several independent variables.
Table 2 gives the values of n for some selected values of a, e and p. One may
notice that there is a sharp increase in n as p increases.

Case 2. Suppose none of the o ij are zero, i.e. all variables are correlated. Then we
define vec S and vec ~ as follows:
vec S = ($11, $12,..., Sip, $22,..., S2p, . . . , Sii, Sii+ a, . . . , Sip,..., Spp )',
vec ~J = (o11, o12,..., tip, 022, .... O2p,..., oii, oii+l,..., tip,..., Opp)'.
As before, let
Y= ((n- 1)(vec S - vec ~ ; ) = (Ira, Y 2 , . " , Yp(p+l)/2)'.
By [6, p. 43], Y is asymptotically Np(p+l)/2(O, V), where elements of V are given
by (3.2). V thus formed is a positive definite symmetric matrix.
We want to find n such that

P[]SiJ-I <e,i=l, 2,...,p,j=l, 2,...,pJ=(l-a),


11
r[-e< V < e ] -~ 1 - a (3.6)
190 P.L. Gupta, R.D. Gupta / Estimating a covariance matrix

where
e = e ~ / ( n - 1 ) ([Ola ], I O 1 2 [ , - - - , IOppl)t.
Rewriting in integral form, we have
f_ IV 1-1/2
e<Y<e (2.if) p(p+I)/4 e-y'V l y / 2 d y ~ (1 - og). (3.7)

If a prior estimate of ~ is available, the evaluation of the integral in (3.7) can be


achieved by an algorithm recently given by Russell, Farrier and Howell [7].

Remark. In case some of the oij's are zero, we will remove those oij's from vec 2:
and the corresponding Su's from vec S and carry out the calculation as before.

Since (3.7) depends on ~;, a table for the n values cannot be prepared. The
situation here is quite similar to the sample size determination in estimating the
proportion of a binomial population. The quality of prior estimate and dimension
of 2: have profound effect on the sample size. The effect of dimension of 2: can
be seen by the fact that the dimension of V increases sharply, resulting in a sharp
increase in the sample size. The effect of the quality of the estimate of 2: can be
seen by the following examples.

Examples. Let us suppose a prior estimate of the variance-covariance matrix 2;


of a bivariate normal distribution is given as
(4 5)
5 9 "
Then
32 40 50)
V = 40 61 90 , e = ex/-n - 1 ( 4 , 5, 9 ) .
50 90 162
Equation (3.7) can be written as
f 9 e ¢ ~ i f seCt-=1 f4e~z-Y [ g [-1/2 e -y'V-ly/2 d y i d y 2 d y 3 = 1 - o~ (3.8)
- 9 ex~-2]- d - 5 ex/-n~-]-d - 4 e~/-n--zT ( 2 ~ ) 3/2

or

e -y'R-1y/2 d y I dy 2 d y 3 = 1 - a (3.9)
_ _ _

where
1 0.905357 0.694444)
R = 0.905357 1 0.905357 ,
0.694444 0.905357 1

h l = e I n - 1~- , h 2 = 5 e I n - 1~-~ , h3=e


~n --1
2
P.L. Gupta, R.D. Gupta / Estimating a covariance matrix 191

N o w by using the algorithm given in Russell et al. [7], we find that for
(~, ~) = (0.05, 0.05), n ----4209,
(~, ~)= (0.05, 0.10), n --- 3107,
(e, o~)= (0.10, 0.05), n ~- 1053,
(e, a) = (0.10, 0.10), n ----780.
The correlation between the variables plays a very important role as can be seen
from (3.9). If one were to take any other prior estimate of ~ with the same
correlation as in the above prior, then the resulting R matrix and the integration
limits in (3.9) will be the same. In essence, the sample size remains unchanged
corresponding to all prior estimates of ~ with the same correlation.
In the above example correlation between the two variables is 5/6. Suppose
the investigator decided that this correlation is too high when in fact it should be
very low and takes a prior estimate of ~; to be
(4 1)
1 9 "

Then
32 8 2) 1 0.232495 0.027777
V= 8 37 18 , R = 0.232495 1 0.232495
2 18 162 0.027777 0.232495 1
resulting in
n ~ 14223 for (e, a ) = (0.10, 0.05)
and
n ----10020 for (e, a) = (0.10, 0.10),
a rather sharp increase in the sample size. One can observe that this increase is
due to the decrease in the correlation between the variables. In order to confirm
this observation further, let us take another prior estimate of 2;,
(1
½ 1 '

which results in correlation matrix


1 0.63245 0.25 )
R = 0.63245 1 0.63245
0.25 0.63245 1
and
n -- 1946 for (e, a) = (0.10, 0.05),
n ----1400 for (~, ~) = (0.10, 0.10).
From the examples given above it is clear that as the correlation between the
variables increases, the sample size decreases.
192 P.L. Gupta, R.D. Gupta / Estimating a covariance matrix

Now let us take a case where two variables have small correlation and the third
is independent of the first two. Suppose a prior estimate of Z is
4 1 O)
1 9 0
0 0 25
In this case we obtain the following results:
n -= 14300 for (~, a) = (0.10, 0.05),
n ----10055 for (e, a) = (0.10, 0.10).
One can note that there is a very small increase in the sample size by adding an
independent variable to the list of variables. If all three variables were correlated,
the V matrix would have been 6 x 6 which would result in a very large sample
size. Therefore, one can safely assume that the sample size obtained, under a
wrong assumption of independence of variables, will be too low.

Acknowledgement
The authors are thankful to the referee for some useful suggestions which
improved the manuscript considerably, and also to Mrs. Judith Leonard for
assistance in numerical work.

References
[1] F.A. Graybill~ Determining sample size for a specified width confidence interval, Ann. Math.
Statist. 29 (1958) 282-287.
[2] F.A. Graybill and T.L. Connell, Sample size required for estimating the variance with d units
of the true value, Ann. Math. Statist. 35 (1964) 438-440.
[3] F.A. Graybill and R.D. Morrison, Sample size for a specified width confidence interval on the
variance of a normal distribution, Biometrics 16 (1960) 636-641.
[4] J.A. Greenwood and M.M. Sandomire, Sample size required for estimating the standard
deviation as a percent of its true value, J. Amer. Statist. Assoc. 45 (1950) 257-260.
[5] F.C. Leone, Y.M. Rutenberg and C.W. Topp, The use of sample quasi-ranges in setting
confidence intervals for the population standard deviation, J. Amer. Statist. Assoc. 56 (1961)
260-272.
[6] R.J. Muirhead, Aspects of Multivariate Statistical Theory (John Wiley & Sons, New York,
1982).
[7] N.S. Russell, D.R. Farrier and J. Howell, Evaluation of multinormal probabilities using
Fourier series expansions, Appl. Statist. 34 (1) (1985) 49-53.
[8] G.W. Snedecor and W.G. Cochran, Statistical Methods (Iowa State University Press, Ames,
IA, 1967).
[9] R.F. Tate and G.W. Klett, Optimal confidence intervals for the variance of a normal
distribution, J. Amer. Statist. Assoc. 54 (1959) 674-682.
[10] W.A. Thompson and J. Endriss, The required sample size when estimating variances, Amer.
Statist. 15(3) (1961) 22-23.