Attribution Non-Commercial (BY-NC)

82 views

Attribution Non-Commercial (BY-NC)

- Effects of Annual Rainfall on Dengue Incidence in the Indian State of Rajasthan
- Design of Experiments for Dummies – a Beginner’s Guide
- Bierens - Introduction to the Mathematical and Statistical Foundations of Eco No Metrics
- 19. Indus[1].pdf
- Dry Ash of Black Tea Leaves(RESULTS)
- A Novel Approach To Answer Continuous Aggregation Queries Using Data Aggregations
- 01 Analysis of Relationship Between Time and Cost Overruns
- CHAPTER6 Estimation
- November 2008 Update: Why are Electricity Prices in RTOs Increasingly Expensive?
- Negative Behaviour and Speaking Scores
- GCSE_Maths_Scatter_Diagrams2
- Tribal Populations of Maharashtra Cormic Index
- coh30086_ch05_070_083.pdf
- 208422085-SOLUCION-1-ELEC-2014-01
- Financial Performance of Privately Held Firms
- Glass Education Meta
- Chapter 3
- Portfolio Management.docx
- 4 MANOVA
- 07_list of Tables Figures

You are on page 1of 8

North-Holland

in estimating a covariance matrix

Pushpa L. G U P T A *

Department of Mathematics, University of Maine, Orono, ME 04469, USA

R.D. GUPTA **

Division of Mathematics, Egineering, Computer Science, University of New Brunswick,

Saint John, N.B., Canada E2L 4L5

Revised 17 January 1987

Abstract: The sample size requirements, for estimating a covariance matrix with a desired precision

in a multivariate normal population, are investigated. Explicit formulas for the sample size are

provided in the univariate case and in the multivariate case when the covariance matrix is diagonal.

In these cases tables are also provided for specific values of e, and the joint confidence coefficient

1 - a. For the general case, a method to compute the sample size is developed resulting in an

integral equation involving the covariance matrix. In case a prior estimate of the covariance matrix

is available, the integral equation can be solved by using the algorithm given by Russell et al.

(1985). Examples are used to illustrate the effects of dimensions and quality of prior estimates of

covariance matrix on the sample size.

1. Introduction

This paper deals with determining the sample size for estimating a covariance

matrix in a multivariate normal population with joint confidence level and

precision. Theproblem originated when the first author was involved in a project

at the USAF School of Aerospace Medicine (USAFSAM). The USAFSAM at

Brooks AFB has been interested for several years in the use of statistical methods

to develop a computerized system to assist the cardiologists, who must examine a

large number of EKG's in a single day, in the screening, diagnosis and serial

comparison of vectorcardiograms. Past efforts at USAFSAM in the diagnosis of

vectorcardiograms has relied on a Karhunen-Lorve approximation of the signal

* * Supported by NSERC Research Grant#A-4850.

186 P.L. Gupta, R.D. Gupta / Estimating a covariance matrix

(750 dimensional in 3-lead system) together with linear and quadratic discrimina-

tion in the transformed space which is 60 dimensional. The crux of this approach

is, therefore, the estimate of the 60 x 60 covariance matrix of the Karhunen-Lo4ve

coefficients. Its quality can, therefore, be a source of concern for the efficacy of

the entire procedure. The quality or accuracy of the covariance matrix estimate is

a function of the sample size and the unknown entries. It was suggested that a

sample of 750 is sufficient to estimate a 60 × 60 covariance matrix with reasona-

ble accuracy. This figure is apperently not based on any theoretical considera-

tions and seems to be-,low as is evident by the sample size requirement for the

sixty dimensional independent case (see Table 2).

The problem of estimating the variance (O 2) of a normal density arises in

many experimental situations. As an example (Greenwood and Sandomire [4]), a

series of radar pulses is to be sent out to a target and the strength of the return

signal measured. How many readings under identical conditions shall be taken so

that the standard deviation of the return signal strengths shall, with 80% confi-

dence, be within 10% of the true value?

Greenwood and Sandomire [4] presented a graphical approach for obtaining

the sample size required to estimate variance of a normal density within a given

percent of its true value. Graybill and Connell [2] instead, have given a two step

sampling procedure to estimate the variance within a given number of units. The

number of units and the confidence level are specified in advance. Thompson and

Endriss [10] have also given a method for estimating the sample size in the

univariate case. Their method depends on the large sample distribution of

estimator. Other work, dealing with estimating variance, includes Graybill and

Morrison [3], Leone, Rutenberg and Topp [5], Tate and Klett [9] and Graybill [1].

For the sake of completeness, in Section 2, a brief discussion is given to find

the sample size n for the univariate case for a given e (the relative error) and a

given a (where 1 - a is the confidence coefficient).

In Section 3, we develop the procedures for determining the sample size in the

multivariate situation where two cases are studied. In case 1, the covariance

matrix Z is taken to be diagonal while in case 2 it is any general matrix. Table 2

is prepared for the case I when p = 2, 5, 10, 20, 40, 60. For case 2 tables cannot

be prepared as the result is in the form of an integral equation involving Z.

However, if a prior estimate of N is available, one can use the algorithm given by

Russell et al. [7] to solve the integral equation. The quality of prior estimate has

an intimate effect on the sample size which is illustrated by some examples.

Throughout the paper p denotes the dimension, e the relative error and 1 - a

the joint confidence coefficient when p > 2.

2. Univariate case

n

$2= E ( X , - X)2/n - 1.

i=1

P.L. Gupta, R.D. Gupta / Estimating a covariance matrix 187

It is well known that large samples are necessary if o is to be estimated

accurately.

The problem is to find the sample size n such that

o

]

=P

[ ~/2(n - 1) ( 1 - e) <

J 2(n - 1)S 2

7~ < ~/2(n- 1) (1 + e)

]

= P[(2(n-1) (1- ~)- ~/2(n-1) -1 <Z<~/2(n-1)(l+e)

- ; / 2 ( n - 1) - 1]

- ~ / 2 ( n - 1) - 1 ] (2.2)

where

2

O

N(0, 1):

Since n is large, equation (2.2) can be written as

(1 - a)-= ~ [ ~ 2 ( n - 1)e] - q ) [ - ~2(n - 1)e]

or

or

n---l+~ (2.3)

Table 1 gives such values of n for some selected values of a and e.

3. Multivariate case

Let us also assume that X = (X1, X 2 , . . . , X p ) ' - N p ( # , 2;).

188 P.L. Gupta, R.D. Gupta / Estimating a covariance matrix

Table 1

Sample size for the univariate case

0.01 33180 8295 3688 2075 1329 933 679 411 333

0.05 19210 4804 2136 1202 770 535 394 239 194

0.10 13530 3384 1505 847 543 377 278 169 137

022,... , Opp). T h e n the p r o b l e m is to find n such that

P [ loiSii-I

i <~,i=l,2,...,p]=(l-a) (3.1)

j=l

vec S -- (Sll , $22,... , Spp )', vec 2; = ( o n , o22,... , Opp)'

and

Y = ~ / n - 1 (rec S - vec Z ) .

T h e n [6, p. 43], Y is asymptotically Np(0, V), where all elements of V are given

by

cov(Y~j, Y~,,) = oik%, + o,,%-k. (3.2)

In this case V = diag(2o]~, 2o~2,..., 2o2p). N o w (3.1) can be written as

n--1

P Y~ <~ 2 , i=l,2,...,p ---(I-a). (3.3)

Since Y~ are independently normally distributed with mean zero and variance

2oi2i, i = 1, 2,..., p, we can write (3.3) as

or

(3.4)

n --- 1 + 2 ( Z B / e ) z. (3.5)

P.L. Gupta, R.D. Gupta / Estimating a covariance matrix 189

Table 2

Sample size for the multivariate independent case

p e

0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14

0.01 2 6301 4376 3216 2462 1946 1576 1303 1095 933 805

5 7635 5303 3896 2983 2358 1910 1579 1327 1131 975

10 8657 6012 4417 3382 2673 2165 1790 1504 1282 1105

20 9687 6728 4943 3785 2991 2423 2003 1683 1434 1237

40 10724 7448 5472 4190 3311 2682 2217 1863 1588 1369

60 11059 7680 5643 4321 3414 2766 2286 1921 1637 1412

0.05 2 4003 2780 2043 1565 1237 1002 828 696 593 512

5 5280 3667 2695 2064 1631 1321 1092 918 782 675

10 6272 4356 3201 2451 1937 1569 1297 1090 929 801

20 7278 5055 3714 2844 2247 1821 1505 1265 1078 930

40 8297 5762 4234 3242 2562 2075 1715 1442 1229 1060

60 8897 6179 4540 3476 2747 2225 1839 1546 1317 1136

0.10 2 3040 2111 1552 1188 939 761 629 529 451 389

5 4273 2968 2181 1670 1320 1069 884 743 633 546

10 5243 3641 2675 2049 1619 1312 1084 911 777 670

20 6233 4329 3181 2436 1925 1559 1284 1083 923 796

40 7239 5028 3694 2829 2235 1811 1497 1258 1072 925

60 7834 5441 3998 3061 2419 1960 1620 1361 1160 1001

It should be noticed that the sample size formula given by (3.5) is independent of

N. When p = 1, (3.5) reduces to (2.3) with different value of e. One can regard

this as an application of Bonferroni method to several independent variables.

Table 2 gives the values of n for some selected values of a, e and p. One may

notice that there is a sharp increase in n as p increases.

Case 2. Suppose none of the o ij are zero, i.e. all variables are correlated. Then we

define vec S and vec ~ as follows:

vec S = ($11, $12,..., Sip, $22,..., S2p, . . . , Sii, Sii+ a, . . . , Sip,..., Spp )',

vec ~J = (o11, o12,..., tip, 022, .... O2p,..., oii, oii+l,..., tip,..., Opp)'.

As before, let

Y= ((n- 1)(vec S - vec ~ ; ) = (Ira, Y 2 , . " , Yp(p+l)/2)'.

By [6, p. 43], Y is asymptotically Np(p+l)/2(O, V), where elements of V are given

by (3.2). V thus formed is a positive definite symmetric matrix.

We want to find n such that

11

r[-e< V < e ] -~ 1 - a (3.6)

190 P.L. Gupta, R.D. Gupta / Estimating a covariance matrix

where

e = e ~ / ( n - 1 ) ([Ola ], I O 1 2 [ , - - - , IOppl)t.

Rewriting in integral form, we have

f_ IV 1-1/2

e<Y<e (2.if) p(p+I)/4 e-y'V l y / 2 d y ~ (1 - og). (3.7)

achieved by an algorithm recently given by Russell, Farrier and Howell [7].

Remark. In case some of the oij's are zero, we will remove those oij's from vec 2:

and the corresponding Su's from vec S and carry out the calculation as before.

Since (3.7) depends on ~;, a table for the n values cannot be prepared. The

situation here is quite similar to the sample size determination in estimating the

proportion of a binomial population. The quality of prior estimate and dimension

of 2: have profound effect on the sample size. The effect of dimension of 2: can

be seen by the fact that the dimension of V increases sharply, resulting in a sharp

increase in the sample size. The effect of the quality of the estimate of 2: can be

seen by the following examples.

of a bivariate normal distribution is given as

(4 5)

5 9 "

Then

32 40 50)

V = 40 61 90 , e = ex/-n - 1 ( 4 , 5, 9 ) .

50 90 162

Equation (3.7) can be written as

f 9 e ¢ ~ i f seCt-=1 f4e~z-Y [ g [-1/2 e -y'V-ly/2 d y i d y 2 d y 3 = 1 - o~ (3.8)

- 9 ex~-2]- d - 5 ex/-n~-]-d - 4 e~/-n--zT ( 2 ~ ) 3/2

or

e -y'R-1y/2 d y I dy 2 d y 3 = 1 - a (3.9)

_ _ _

where

1 0.905357 0.694444)

R = 0.905357 1 0.905357 ,

0.694444 0.905357 1

~n --1

2

P.L. Gupta, R.D. Gupta / Estimating a covariance matrix 191

N o w by using the algorithm given in Russell et al. [7], we find that for

(~, ~) = (0.05, 0.05), n ----4209,

(~, ~)= (0.05, 0.10), n --- 3107,

(e, o~)= (0.10, 0.05), n ~- 1053,

(e, a) = (0.10, 0.10), n ----780.

The correlation between the variables plays a very important role as can be seen

from (3.9). If one were to take any other prior estimate of ~ with the same

correlation as in the above prior, then the resulting R matrix and the integration

limits in (3.9) will be the same. In essence, the sample size remains unchanged

corresponding to all prior estimates of ~ with the same correlation.

In the above example correlation between the two variables is 5/6. Suppose

the investigator decided that this correlation is too high when in fact it should be

very low and takes a prior estimate of ~; to be

(4 1)

1 9 "

Then

32 8 2) 1 0.232495 0.027777

V= 8 37 18 , R = 0.232495 1 0.232495

2 18 162 0.027777 0.232495 1

resulting in

n ~ 14223 for (e, a ) = (0.10, 0.05)

and

n ----10020 for (e, a) = (0.10, 0.10),

a rather sharp increase in the sample size. One can observe that this increase is

due to the decrease in the correlation between the variables. In order to confirm

this observation further, let us take another prior estimate of 2;,

(1

½ 1 '

1 0.63245 0.25 )

R = 0.63245 1 0.63245

0.25 0.63245 1

and

n -- 1946 for (e, a) = (0.10, 0.05),

n ----1400 for (~, ~) = (0.10, 0.10).

From the examples given above it is clear that as the correlation between the

variables increases, the sample size decreases.

192 P.L. Gupta, R.D. Gupta / Estimating a covariance matrix

Now let us take a case where two variables have small correlation and the third

is independent of the first two. Suppose a prior estimate of Z is

4 1 O)

1 9 0

0 0 25

In this case we obtain the following results:

n -= 14300 for (~, a) = (0.10, 0.05),

n ----10055 for (e, a) = (0.10, 0.10).

One can note that there is a very small increase in the sample size by adding an

independent variable to the list of variables. If all three variables were correlated,

the V matrix would have been 6 x 6 which would result in a very large sample

size. Therefore, one can safely assume that the sample size obtained, under a

wrong assumption of independence of variables, will be too low.

Acknowledgement

The authors are thankful to the referee for some useful suggestions which

improved the manuscript considerably, and also to Mrs. Judith Leonard for

assistance in numerical work.

References

[1] F.A. Graybill~ Determining sample size for a specified width confidence interval, Ann. Math.

Statist. 29 (1958) 282-287.

[2] F.A. Graybill and T.L. Connell, Sample size required for estimating the variance with d units

of the true value, Ann. Math. Statist. 35 (1964) 438-440.

[3] F.A. Graybill and R.D. Morrison, Sample size for a specified width confidence interval on the

variance of a normal distribution, Biometrics 16 (1960) 636-641.

[4] J.A. Greenwood and M.M. Sandomire, Sample size required for estimating the standard

deviation as a percent of its true value, J. Amer. Statist. Assoc. 45 (1950) 257-260.

[5] F.C. Leone, Y.M. Rutenberg and C.W. Topp, The use of sample quasi-ranges in setting

confidence intervals for the population standard deviation, J. Amer. Statist. Assoc. 56 (1961)

260-272.

[6] R.J. Muirhead, Aspects of Multivariate Statistical Theory (John Wiley & Sons, New York,

1982).

[7] N.S. Russell, D.R. Farrier and J. Howell, Evaluation of multinormal probabilities using

Fourier series expansions, Appl. Statist. 34 (1) (1985) 49-53.

[8] G.W. Snedecor and W.G. Cochran, Statistical Methods (Iowa State University Press, Ames,

IA, 1967).

[9] R.F. Tate and G.W. Klett, Optimal confidence intervals for the variance of a normal

distribution, J. Amer. Statist. Assoc. 54 (1959) 674-682.

[10] W.A. Thompson and J. Endriss, The required sample size when estimating variances, Amer.

Statist. 15(3) (1961) 22-23.

- Effects of Annual Rainfall on Dengue Incidence in the Indian State of RajasthanUploaded byAnonymous X2eyA6vlN
- Design of Experiments for Dummies – a Beginner’s GuideUploaded by5landers
- Bierens - Introduction to the Mathematical and Statistical Foundations of Eco No MetricsUploaded byAlexander Correa
- 19. Indus[1].pdfUploaded byjvanandh
- Dry Ash of Black Tea Leaves(RESULTS)Uploaded byaisyah
- A Novel Approach To Answer Continuous Aggregation Queries Using Data AggregationsUploaded byIJMER
- 01 Analysis of Relationship Between Time and Cost OverrunsUploaded byRahulRandy
- CHAPTER6 EstimationUploaded bySekut Tawar
- November 2008 Update: Why are Electricity Prices in RTOs Increasingly Expensive?Uploaded byRobertMcCullough
- Negative Behaviour and Speaking ScoresUploaded byHendi Pratama
- GCSE_Maths_Scatter_Diagrams2Uploaded byJohnny James P-pott
- Tribal Populations of Maharashtra Cormic IndexUploaded bymonkey12345678901234
- coh30086_ch05_070_083.pdfUploaded byJr Grande
- 208422085-SOLUCION-1-ELEC-2014-01Uploaded byEdisson Andres
- Financial Performance of Privately Held FirmsUploaded bycuteeangel1
- Glass Education MetaUploaded byfiserada
- Chapter 3Uploaded byTekaling Negash
- Portfolio Management.docxUploaded byPele Nasa
- 4 MANOVAUploaded bymesut
- 07_list of Tables FiguresUploaded bykarthik keyan
- Temporal Management of the Writing Process: Effects of Genre and Organizing Constraints in Grades 5, 7, and 9Uploaded byVicky Colombo
- AddUploaded byarvind
- 17-18Uploaded bydr_kbsingh
- Determinants of[1]Uploaded byRodney Munsami
- Hasil Uji Statistik WassaseaUploaded byARDIANSYAH
- CorrelationsUploaded byWaqas Shaheen
- RPUploaded byAyyaz Ali
- Spe Wetxdrycompletionwithrisk 131217154125 Phpapp02Uploaded bytavis80
- Pearson CorelationUploaded byMarijana Jankovic
- Drought Disturbances Increase Temporal VariabilityUploaded byAnonymous CoUBbG1mL

- Lec12 ProbabilityUploaded byAbdul Aziz
- LargeScaleInference.pdfUploaded byJchitP
- Time Series ModelsUploaded byGhulam Nabi
- JURNAL SEGEMENTASI HISTOGRAMUploaded byTya Lupheluphe Diya
- SI417-Tut9-2017Uploaded byarjunvenugopalachary
- IE27_04_PMFPDFUploaded byCristina de los Reyes
- Some Neutrosophic Probability DistributionsUploaded byAnonymous 0U9j6BLllB
- Session 8Uploaded byAakash Singh
- Distribution-function.pptUploaded byherayatiesmi
- Weibull DistributionUploaded byMuhamamd Khan Muneer
- penelope_2003_NEAUploaded byThái Trần
- Mitchell Capitulos ExtraUploaded byamadeo magnus
- Unit-4 HMMUploaded byrajeevsahani
- Conditional distribution inverse method in gene- rating uniform random vectors over a simplex.Communications in Statistics – Simulation and ComputationUploaded byGabrielle Resende
- ShahiDawat-01Uploaded bySumit Singh
- Bayesian Network SolutionsUploaded byVin Ngo
- Supplemental Material to Intro to SQC 6th EdUploaded byFrank Scialla
- Exploring a Solution to the Birthday Paradox When Applied to Different SocietiesUploaded byAlvaro Magallanes Alzamora
- Time Series Econometrics MidtermUploaded byblackbriar22
- chap04_2Uploaded byapi-19919301
- Pen DerUploaded bytimag7388
- The Similarity HeuristicUploaded bychoileo
- Time Series Diagnostic TestUploaded byEzra Ahumuza
- assighnment.docxUploaded byShubham Singh
- Problems on Two Dimensional Random Variable (1)Uploaded byBRAHMA REDDY AAKUMAIIA
- Ch3Uploaded bydownloadfreakforever
- Decision Making Chapter Summary AssignmentUploaded byFezi Afesina Haidir
- Frequency Analysis of Extreme EventsUploaded byburreiro
- Probability Review, cheat sheetUploaded byrogiebone11
- Prob NotesUploaded byEkkAcEkka2332