SIXTH EDITION
STAT~STICAL ANALYS~S
R I CHARD A . , ~ . . D E A N
JOHNSON ~~
W.
WICHERN
SIXTH EDITION
DEAN W. WICHERN
Texas A&M University
mnson, Richard A.
Statistical analysis/Richard A. Johnson.6'" ed. Dean W. Winchern p.cm. Includes index. ISBN 0.131877151 1. Statistical Analysis '":IP Data Available
\xecutive Acquisitions Editor: Petra Recter Vice President and Editorial Director, Mathematics: Christine Haag roject Manager: Michael Bell Production Editor: Debbie Ryan .>enior Managing Editor: Lindd Mihatov Behrens ~anufacturing Buyer: Maura Zaldivar Associate Director of Operations: Alexis HeydtLong Aarketing Manager: Wayne Parkins >darketing Assistant: Jennifer de Leeuwerk &Iitorial Assistant/Print Supplements Editor: Joanne Wendelken \rt Director: Jayne Conte Director of Creative Service: Paul Belfanti .::Over Designer: Bruce Kenselaar 1\rt Studio: Laserswords
2007 Pearson Education, Inc. Pearson Prentice Hall Pearson Education, Inc. Upper Saddle River, NJ 07458
All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher. Pearson Prentice HaHn< is a tradeq~ark of Pearson Education, Inc. Printed in the United States of America
10 9 8 7 6 5 4 3 2
Contents
PREFACE 1 ASPECTS OF MULTIVARIATE ANALYSIS
XV
1.4
19
1.5 1.6
49
2.1 2.2
Positive Definite Matrices 60 A SquareRoot Matrix 65 Random Vectors and Matrices 66 Mean Vectors and Covariance Matrices
68
Partitioning the Covariance Matrix, 73 The Mean Vector and Covariance Matrix for Linear Combinations of Random Variables, 75 Partitioning the Sample Mean Vector and Covariance Matrix, 77
2.7
78
vii
viii
82
Introduction 111 The Geometry of the Sample 111 Random Samples and the Expected Values of the Sample Mean and Covariance Matrix 119 Generalized Variance 123
Situo.tions in which the Generalized Sample Variance Is Zero, I29 Generalized Variance Determined by I R I and Its Geometrical Interpretation, 134 Another Generalization of Variance, 137
3.5 3.6
Sample Mean, Covariance, and Correlation As Matrix Operations 137 Sample Values of Linear Combinations of Variables Exercises 144 References 148
140
149
4.1 4.2
149
4.3
Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation 168
The Multivariate Normal Likelihood, I68 Maximum Likelihood Estimation of 1.t and 1:, I70 Sufficient Statistics, I73
4.7 4.8
187
192
Contents
5 INFERENCES ABOUT A MEAN VECTOR
ix
210
Introduction 210 The Plausibility of p. 0 as a Value for a Normal Population Mean 210 Hotelling's T 2 and Likelihood Ratio Tests 216
General Likelihood Ratio Method, 219
5.5
5.6
Large Sample Inferences about a Population Mean Vector Multivariate Quality Control Charts 239
234
Charts for Monitoring a Sample of Individual Multivariate Observations for Stability, 241 Control Regions for Future Individual Observations, 247 Control Ellipse for Future Observations, 248 T 2 Chart for Future Observations, 248 Control Chans Based on Subsample Means, 249 Control Regions for Future Subsample Observations, 251
5.7 5.8
Inferences about Mean Vectors when Some Observations Are Missing 251 Difficulties Due to Time Dependence in Multivariate Observations 256 Supplement SA: Simultaneous Confidence Intervals and Ellipses as Shadows of the pDimensional Ellipsoids 258 Exercises 261 References 272 273 273
6.1 6.2
6.3
6.4
Contents
A Summary of Univariate ANOVA, 297 Multivariate Analysis ofVariance (MANOVA), 30I
Simultaneous Confidence Intervals for Treatment Effects 308 Testing for Equality of Covariance Matrices 310 1\voWay Multivariate Analysis of Variance 312
Univariate TwoWay FixedEffects Model with Interaction, 312 Multivariate 1WoWay FixedEffects Model with Interaction, 3I5
Profile Analysis 323 Repeated Measures Designs and Growth Curves 328 Perspectives and a Strategy for Analyzing Multivariate Models 332 Exercises 337 References 358
360
Introduction 360 The Classical Linear Regression Model 360 Least Squares Estimation 364
SumofSquares Decomposition, 366 Geometry of Least Squares, 367 Sampling Properties of Classical Least Squares Estimators, 369
7.7
Multiple Regression Models with Time Dependent Errors 413 Supplement 7A: The Distribution of the Likelihood Ratio for the Multivariate Multiple Regression Model Exercises 420 References 428
418
Contents
8 PRINCIPAL COMPONENTS
xi
430 430
8.1 8.2
Principal Components Obtained from Standardized Variables, 436 Principal Components for Covariance Matrices with Special Structures, 439
8.3
441
8.4 8.5
8.6
Supplement 8A: The Geometry of the Sample Principal Component Approximation 466
The pDimensional Geometrical Interpretation, 468 ThenDimensional Geometrical Interpretation, 469
481
482
The Principal Component (and Principal Factor) Method, 488 A Modified Approachthe Principal Factor Solution, 494 The Maximum Likelihood Method, 495 A Large Sample Test for the Number of Common Factors, 501
9.4 9.5
504 513
Oblique Rotations, 512 The Weighted Least Squares Method, 514 The Regression Method, 516
9.6
Perspectives and a Strategy for Factor Analysis 519 Supplement 9A: Some Computational Details for Maximum Likelihood Estimation
Recommended Computational Scheme, 528 Maximum Likelihood Estimators of p = L,L~ + 1/1, 529
52 7
xii
Contents
10 CANONICAL CORRELATION ANALYSIS
539
Introduction 539 Canonical Variates and Canonical Correlations 539 Interpreting the Population Canonical Variables 545
Identifying the {:anonical Variables, 545 Canonical Correlations as Generalizations of Other Correlation Coefficients, 547 The First r Canonical Variables as a Summary of Variability, 548 A Geometrical Interpretation of the Population Canonical Correlation Analysis 549
10.4 10.5
The Sample Canonical Variates and Sample Canonical Correlations 550 Additional Sample Descriptive Measures 558
Matrices of Errors of Approximations, 558 Proportions of Explained Sample Variance, 561
10.6
11
575
Classification of Normal Populations When It = I 2 = I, 584 Scaling, 589 Fisher's Approach to Classification with 1Wo Populations, 590 Is Classification a Good Idea?, 592 Classification of Normal Populations When It #' I 2 , 593
Introduction 575 Separation and Classification for 1\vo Populations 576 Classification with 1\vo Multivariate Normal Populations
584
11.4 11.5
11.6
11.7
11.8
Contents
Testing for Group Differences, 648 Graphics, 649 Practical Considerations Regarding Multivariate Normality, 649
xiii
671
12.1 12.2
673
Distances and Similarity Coefficients for Pairs of Items, 673 Similarities and Association Measures for Pairs of Variables, 677 Concluding Comments on Similarity, 678
12.3
680
Single Linkage, 682 Complete Linkage, 685 Average Linkage, 690 Wards Hierarchical Clustering Method, 692 Final CommentsHierarchical Procedures, 695
12.4
703
12.8 12.9
740
Preface
INTENDED AUDIENCE
This book originally grew out of our lecture notes for an "Applied Multivariate Analysis" course offered jointly by the Statistics Department and the School of Business at the University of WisconsinMadison. Applied Multivariate Statistica/Analysis, Sixth Edition, is concerned with statistical methods for describing and analyzing multivariate data. Data analysis, while interesting with one variable, becomes truly fascinating and challenging when several variables are involved. Researchers in the biological, physical, and social sciences frequently collect measurements on several variables. Modern computer packages readily provide the numerical results to rather complex statistical analyses. We have tried to provide readers with the supporting knowledge necessary for making proper interpretations, selecting appropriate techniques, and understanding their strengths and weaknesses. We hope our discussions will meet the needs of experimental scientists, in a wide variety of subject matter areas, as a readable introduction to the statistical analysis of multivariate observations.
LEVEL
Our aim is to present the concepts and methods of multivariate analysis at a level that is readily understandable by readers who have taken two or more statistics courses. We emphasize the applications of multivariate methods and, consequently, have attempted to make the mathematics as palatable as possible. We avoid the use of calculus. On the other hand, the concepts of a matrix and of matrix manipulations are important. We do not assume the reader is familiar with matrix algebra. Rather, we introduce matrices as they appear naturally in our discussions, and we then show how they simplify the presentation of multivariate models and techniques. The introductory account of matrix algebra, in Chapter 2, highlights the more important matrix algebra results as they apply to multivariate analysis. The Chapter 2 supplement provides a summary of matrix algebra results for those with little or no previous exposure to the subject. This supplementary material helps make the book selfcontained and is used to complete proofs. The proofs may be ignored on the first reading. In this way we hope to make the book accessible to a wide audience. In our attempt to make the study of multivariate analysis appealing to a large audience of both practitioners and theoreticians, we have had to sacrifice
XV
xvi
Preface
a consistency of level. Some sections are harder than others. In particular, we have summarized a voluminous amount of material on regression in Chapter 7. The resulting presentation is rather succinct and difficult the first time through. we hope instructors will be able to compensate for the unevenness in level by judiciously choosing those sections, and subsections, appropriate for their students and by toning them tlown if necessary.
The methodological "tools" of multivariate analysis are contained in Chapters 5 through 12. These chapters represent the heart of the book, but they cannot be assimilated without much of the material in the introductory Chapters 1 through 4. Even those readers with a good knowledge of matrix algebra or those willing to accept the mathematical results on faith should, at the very least, peruse Chapter 3, "Sample Geometry," and Chapter 4,"Multivariate Normal Distribution." Our approach in the methodological chapters is to keep the discussion direct and uncluttered. Typically, we start with a formulation of the population models, delineate the corresponding sample results, and liberally illustrate everything with examples. The examples are of two types: those that are simple and whose calculations can be easily done by hand, and those that rely on realworld data and computer software. These will provide an opportunity to (1) duplicate our analyses, (2) carry out the analyses dictated by exercises, or (3) analyze the data using methods other than the ones we have used or suggested . .The division of the methodological chapters (5 through 12) into three units allows instructors some flexibility in tailoring a course to their needs. Possible sequences for a onesemester (two quarter) course are indicated schematically. Each instructor will undoubtedly omit certain sections from some chapters to cover a broader collection of topics than is indicated by these two choices.
Getting Started Chapters 14
For most students, we would suggest a quick pass through the first four chapters (concentrating primarily on the material in Chapter 1; Sections 2.1, 2.2, 2.3, 2.5, 2.6, and 3.6; and the "assessing normality" material in Chapter 4) followed by a selection of methodological topics. For example, one might discuss the comparison of mean vectors, principal components, factor analysis, discriminant analysis and clustering. The discussions could feature the many "worked out" examples included in these sections of the text. Instructors may rely on di
Preface
xvii
agrams and verbal descriptions to teach the corresponding theoretical developments. If the students have uniformly strong mathematical backgrounds, much of the book can successfully be covered in one term. We have found individual dataanalysis projects useful for integrating material from several of the methods chapters. Here, our rather complete treatments of multivariate analysis of variance (MANOVA), regression analysis, factor analysis, canonical correlation, discriminant analysis, and so forth are helpful, even though they may not be specifically covered in lectures.
CHANGES TO THE SIXTH EDITION
New material. Users of the previous editions will notice several major changes in the sixth edition.
Twelve new data sets including national track records for men and women, psychological profile scores, car body assembly measurements, cell phone tower breakdowns, pulp and paper properties measurements, Mali family farm data, stock price rates of return, and Concho water snake data. Thirty seven new exercises and twenty revised exercises with many of these exercises based on the new data sets. Four new data based examples and fifteen revised examples. Six new or expanded sections: 1. Section 6.6 Testing for Equality of Covariance Matrices 2. Section 11.7 Logistic Regression and Classification 3. Section 12.5 Clustering Based on Statistical Models 4. Expanded Section 6.3 to include "An Approximation to th~ Distribution of T 2 for Normal Populations When Sample Sizes are not Large" 5. Expanded Sections 7.6 and 7.7 to include Akaike's Information Criterion 6. Consolidated previous Sections 11.3 and 11.5 on two group discriminant analysis into single Section 11.3
Web Site. To make the methods of multivariate analysis more prominent in the text, we have removed the long proofs of Results 7.2, 7.4, 7.10 and 10.1 and placed them on a web site accessible through www.prenhall.com/statistics. Click on "Multivariate Statistics" and then click on our book. In addition, all full data sets saved as ASCII files that are used in the book are available on the web site. Instructors' Solutions Manual. An Instructors Solutions Manual is available on the author's website accessible through www.prenhall.com/statistics. For information on additional forsale supplements that may be used with the book or additional titles of interest, please visit the Prentice Hall web site at www.prenhall.com.
""iii
Preface
,ACKNOWLEDGMENTS
We thank many of our colleagues who helped improve the applied aspect of the book by contributing their own data sets for examples and exercises. A number of individuals helped guide various revisions of this book, and we are grateful for their suggestions: Christopher Bingham, University of Minnesota; Steve Coad, University of Michigan; Richard Kiltie, University of Florida; Sam Kotz, George Mason University; Him Koul, Michigan State University; Bruce McCullough, Drexel University; Shyamal Peddada, University of Virginia; K. Sivakumar University of Illinois at Chicago; Eric ~mith, Virginia Tech; and Stanley Wasserman, University of Illinois at UrbanaChampaign. We also acknowledge the feedback of the students we have taught these past 35 years in our applied multivariate analysis courses. Their comments and suggestions are largely responsible for the present iteration of this work. We would also like to give special thanks to Wai Kwong Cheang, Shanhong Guan, Jialiang Li and Zhiguo Xiao for their help with the calculations for many of the examples. We must thank Dianne Hall for her valuable help with the Solutions Manual, Steve Verrill for computing assistance throughout, and Alison Pollack for implementing a Chernoff faces program. We are indebted to Cliff Gilman for his assistance with the multidimensional scaling examples discussed in Chapter 12. Jacquelyn Forer did most of the typing of the original draft manuscript, and we appreciate her expertise and willingness to endure cajoling of authors faced with publication deadlines. Finally, we would like to thank Petra Recter, Debbie Ryan, Michael Bell, Linda Behrens, Joanne Wendelken and the rest of the Prentice Hall staff for their help with this project.
R. A. Johnson rich@stat. wisc.edu D. W. Wichern dwichem@tamu.edu
Chapter
generation of appropriate data in certain disciplines. (This is true, for example, in business, economics, ecology, geology, and sociology.) You should consult [6] and [7] for detailed accounts of design principles that, fortunately, also apply to multivariate situations. It will become increasingly clear that many multivariate methods are based upon an underlying pro9ability model known as the multivariate normal distribution. Other methods are ad hoc in nature and are justified by logical or commonsense arguments. Regardless of their origin, multivariate techniques must, invariably, be implemented on a computer. Recent advances in computer technology have been accompanied by the development of rather sophisticated statistical software packages, making the implementation step easier. Multivariate analysis is a "mixed bag." It is difficult to establish a classification scheme for multivariate techniques that is both widely accepted and indicates the appropriateness of the techniques. One classification distinguishes techniques designed to study interdependent relationships from those designed to study dependent relationships. Another classifies techniques according to the number of populations and the number of sets of variables being studied. Chapters in this text are divided into sections according to inference about treatment means, inference about covariance structure, and techniques for sorting or grouping. This should not, however, be considered an attempt to place each method into a slot. Rather, the choice of methods and the types of analyses employed are largely determined by the objectives of the investigation. In Section 1.2, we list a smaller number of practical problems designed to illustrate the connection between the choice of a statistical method and the objectives of the study. These problems, plus the examples in the text, should provide you with an appreciation of the applicability of multivariate techniques across different fields. The objectives of scientific investigations to which multivariate methods most naturally lend themselves include the following: L Data reduction or structural simplification. The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier. 2. Sorting and grouping. Groups of "similar" objects or variables are created, based upon measured characteristics. Alternatively, rules for classifying objects into welldefined groups may be required. 3. Investigation of the dependence among variables. The nature of the relationships among variables is of interest. Are all the variables mutually independent or are one or more variables dependent on the others? If so, how? 4. Prediction. Relationships between variables must be determined for the purpose of predicting the values of one or more variables on the basis of observations on the other variables. s. Hypothesis construction and testing. Specific statistical hypotheses, formulated in terms of the parameters of multivariate populations, are tested. This may be done to validate assumptions or to reinforce prior convictions. We conclude this brief overview of multivariate analysis with a quotation from F. H. C. Marriott [19], page 89. The statement was made in a discussion of cluster analysis, but we feel it is appropriate for a broader range of methods. You should keep it in mind whenever you attempt or read about a data analysis. It allows one to
Applications of Multivariate Techniques 3 maintain a proper perspective and not be overwhelmed by the elegance of some of the theory:
If the results disagree with informed opinion, do not admit a simple logical interpreta
tion, and do not show up clearly in a graphical presentation, they are probably wrong. There is no magic about numerical methods, and many ways in which they can break down. They are a valuable aid to the interpretation of data, not sausage machines automatically transforming bodies of numbers into packets of scientific fact.
Using data on several variables related to cancer patient responses to radiotherapy, a simple measure of patient response to radiotherapy was constructed. (See Exercise 1.15.) nack records from many nations were used to develop an index of performance for both male and female athletes. (See [8] and [22].) Multispectral image data collected by a highaltitude scanner were reduced to a form that could be viewed as images (pictures) of a shoreline in two dimensions. (See [23].) Data on several variables relating to yield and protein content were used to create an index to select parents of subsequent generations of improved bean plants. (See [13].) A matrix of tactic similarities was developed from aggregate data derived from professional mediators. From this matrix the number of dimensions by which professional mediators judge the tactics they use in resolving disputes was determined. (See [21].)
Sorting and grouping
Data on several variables related to computer use were employed to create clusters of categories of computer jobs that allow a better determination of existing (or planned) computer utilization. (See [2].) Measurements of several physiological variables were used to develop a screening procedure that discriminates alcoholics from nonalcoholics. (See [26].) Data related to responses to visual stimuli were used to develop a rule for separating people suffering from a multiplesclerosiscaused visual pathology from those not suffering from the disease. (See Exercise 1.14.)
4 Chapter 1 Aspects of Multivariate Analysis The U.S. Internal Revenue Service uses data collected from tax returns to sort taxpayers into two groups: those that will be audited and those that will not. (See [31].)
Investigation of the dependence among variables
Data on several vruiables were used to identify factors that were responsible for client success in hiring external consultants. (See [12].) Measurements of variables related to innovation, on the one hand, and variables related to the business environment and business organization, on the other hand, were used to discove~ why some firms are product innovators and some firms are not. (See [3].) Measurements of pulp fiber characteristics and subsequent measurements of characteristics of the paper made from them are used to examine the relations between pulp fiber properties and the resulting paper properties. The goal is to determine those fibers that lead to higher quality paper. (See [17].) The associations between measures of risktaking propensity and measures of socioeconomic characteristics for toplevel business executives were used to assess the relation between risktaking behavior and performance. (See [18].)
Prediction
The associations between test scores, and several high school performance variables, and several college performance variables were used to develop predictors of success in college. (See [10].) Data on several variables related to the size distribution of sediments were used to develop rules for predicting different depositional environments. (See [7] and [20].) Measurements on several accounting and fmancial variables were used to develop a method for identifying potentially insolvent propertyliability insurers. (See [28].) eDNA microarray experiments (gene expression data) are increasingly used to study the molecular variations among cancer tumors. A reliable classification of tumo~s is essential for successful diagnosis and treatment of cancer. (See [9].)
Hypotheses testing
Several pollutionrelated variables were measured to determine whether levels for a large metropolitan area were roughly constant throughout the week, or whether there was a noticeable difference between weekdays and weekends. (See Exercise 1.6.) Experimental data on several variables were used to see whether the nature of the instructions makes any difference in perceived risks, as quantified by test scores. (See [27].) Data on many variables were used to investigate the differences in structure of American occupations to determine the support for one of two competing sociological theories. (See [16] and [25].) Data on several variables were used to determine whether different types of firms in newly industrialized countries exhibited different patterns of innovation. (See [15].)
The preceding descriptions offer glimpses into the use of multivariate methods in widely diverse fields.
Arrays
Multivariate data arise whenever an investigator, seeking to understand a social or physical phenomenon, selects a number p 2:: 1 of variables or characters to record. The values of these variables are all recorded for each distinct item, individual, or experimental unit. We will use the notation xjk to indicate the particular value of the kth variable that is observed on the jth item, or trial. That is,
x 1k = measurement of the kth variable on the jth item
Consequently, n measurements on p variables can be displayed as follows: Variable 1 Item 1: Item2: Itemj: Itemn:
xu
x21
Variable 2
xi2 Xzz
Variable k
xlk Xzk
Variable p
Xip Xzp
Xji
xjz
Xjk
Xjp
Xni
x,z
x,k
Xnp
Or we can display these data as a rectangular array, called X, of n rows and p columns:
xu Xzi xi2 Xzz xlk Xzk Xip Xzp
X
xi! xiz Xjk Xjp
x,l
x,z
x,k
x,P
The array X, then, contains the data consisting of all of the observations on all of the variables.
Example 1.1 {A data array) A selection of four receipts from a university bookstore was obtained in order to investigate the nature of book sales. Each receipt provided, among other things, the number of books sold and the total amount of each sale. Let the first variable be total dollar sales and the second variable be number of books sold. Then we can reg_ard the corresponding numbers on the receipts as four measurements on two variables. Suppose the data, in tabular form, are
Variable 1 (dollar sales): 42 52 48 58 Variable2(numberofbooks): 4 5 4 3 Using the notation just introduced, we have
Xu
= 42 = 4
Xz!
x 22
= 52 = 5
x 31 x 32
= 48 = 4
x41 x42
= 58 = 3
X= 42 4] 52 5
48 58 4 3
Considering data in the form of arrays facilitates the exposition of the subject matter and allows numerical calculations to be performed in an orderly and efficient manner. The efficiency is twofold, as gains are attained in both (1) describing numerical calculations as operations on arrays and (2) the implementation of the calculations on computers, which now use many languages and statistical packages to perform array operations. We consider the manipulation of arrays of numbers in Chapter 2. At this point, we are concerned only with their value as devices for displaying data.
Descriptive Statistics
A large data set is bulky, and its very mass poses a serious obstacle to any attempt to visually extract pertinent information. Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics. For example, the arithmetic average, or sample mean, is a descriptive statistic that provides a measure of locationthat is, a "central value" for a set of numbers. And the average of the squares of the distances of all of the numbers from the mean provides a measure of the spread, or variation, in the numbers. We shall rely most heavily on descriptive statistics that measure location, variation, and linear association. The formal definitions of these quantities follow. Let xu, x 21 , ... , xn 1 ben measurements on the first variable. Then the arithmetic average of these measurements is
The Organization of Data 7 ' If the n measurements represent a subset of the full set of measurements that might have been observed, then x1 is also called the sample mean for the first variable. We adopt this terminology because the bulk of this book is devoted to procedures designed to analyze samples of measurements from larger collections. The sample mean can be computed from the n measurements on each of the p variables, so that, in general, there will be p sample means:
xk = 
2: xik n i=l
k = 1,2, ... ,p
(11)
A measure of spread is provided by the sample variance, defined for n measurements on the first variable as
St =  "'
2
where x1 is the sample mean of the xi 1 's. In general, for p variables, we have
sk =  "'
1 ~ ( xik  xk )2 n i=l .
k = 1, 2, ... ,p
(12)
Tho comments are in order. First, many authors define the sample variance with a divisor of n  1 rather than n. Later we shall see that there are theoretical reasons for doing this, and it is particularly appropriate if the number of measurements, n, is small. The two versions of the sample variance will always be differentiated by displaying the appropriate expression. Second, although the s 2 notation is traditionally used to indicate the sample variance, we shall eventually consider an array of quantities in which the sample variances lie along the main diagonal. In this situation, it is convenient to use double subscripts on the variances in order to indicate their positions in the array. Therefore, we introduce the notation skk to denote the same variance computed from measurements on the kth variable, and we have the notational identities
sk
skk
1 ~ =  "'
n i=I
(xik  xk
 )2
k = 1,2, ... ,p
(13)
The square root of the sample variance, ~, is known as the sample standard deviation. This measure of variation uses the same units as the observations. Consider n pairs of measurements on each of variables 1 and 2:
That is, xil and xi 2 are observed on the jth experimental item (j = 1, 2, ... , n ). A measure of linear association between the measurements of variables 1 and 2 is provided by the sample covariance
1
St2 = 
n i=I
2: (xjl
xt) (xj2 
x2)
8 Chapter 1 Aspects of Multivariate Analysis or the average product of the deviations from their respective means. If large values for one variable are observed in conjunction with large values for the other variable, and the small values also occur together, s 12 will be positive. U large values from one variable occur with small values for the other variable, s12 will be negative. If there is no particular association between the values for the two variables, s 12 will be approximately zero. The sample covariance 1 n i=1,2, ... ,p, k=1,2, ... ,p (14} S;k = ;; (xji  X;) (xjk  xk)
j=l
measures the association between the "ith and kth variables. We note that the covariance reduces to the sample variance when i = k. Moreover, s;k = ski for all i and k. The final descriptive statistic considered here is the sample correlation coefficient (or Pearson's productmoment correlation coefficient, see [14]}. This measure of the linear association between two variables does not depend on the units of measurement. The sample correlation coefficient for the ith and kth variables is defined as
L
j=l
fori= 1,2, ... ,pandk = 1,2, ... ,p.Noterik = rkiforalliandk. The sample correlation coefficient is a standardized version of the sample covariance, where the product of the square roots of the sample variances provides the standardization. Notiee that r;k has the same value whether nor n  1 is chosen as the common divisor for s;;, skk, and s;k The sample correlation coefficient r;k can also be viewed as a sample covariance. Suppose the original values xj; and xjk are replaced by standardized values (xj 1  x1 }/~and(xjk :ik}/~.Thestandardizedvaluesarecornmensurablebe cause both sets are centered at zero and expressed in standard deviation units. The sample correlation coefficient is just the sample covariance of the standardized observations. Although the signs of the sample correlation and the sample covariance are the same, the correlation is ordinarily easier to interpret because its magnitude is bounded. To summarize, the sample correlation r has the following properties:
1. The value of r must be between 1 and + 1 inclusive. 2. Here r measures the strength of the linear association. If r = 0, this implies a lack of linear association between the components. Otherwise, the sign of r indicates the direction of the association: r < 0 implies a tendency for one value in the pair to be larger than its average when the other is smaller than its average; and r > 0 implies a tendency for one value of the pair to be large when the other value is large and also for both values to be small together. 3. The value of r;k remains unchanged if the measurements of the ith variable are changed to Yji = axj; + b, j = 1, 2, ... , n, and the values of the kth vari1, 2, ... , n, provided that the conable are changed to Yjk = cxjk + d, j stants a and c have the same sign.
The quantities sik and r;k do not, in general, convey all there is to know about the association between two variables. Nonlinear associations can exist that are not revealed by these descriptive statistics. Covariance and correlation provide measures of linear association, or association along a line. Their values are less informative for other kinds of association. On the other hand, these quantities can be very sensitive to "wild" observations ("outliers") and may indicate association when, in fact, little exists. In spite of these shortcomings, covariance and correlation coefficients are routinely calculated and analyzed. They provide cogent numerical summaries of association when the data do not exhibit obvious nonlinear patterns of association and when wild observations are not present. Suspect observations must be accounted for by correcting obvious recording mistakes and by taking actions consistent with the identified causes. The values of s;k and r;k should be quoted both with and without these observations. The sum of squares of the deviations from the mean and the sum of crossproduct deviations are often of interest themselves. These quantities are
wkk
2: (xjk j=l
xk)z
k = 1, 2, ... ,p
(16)
and
W;k =
2: (xi; j=l
x;)(xjk  xk)
= 1,2, ... ,p
(17)
The descriptive statistics computed from n measurements on p variables can also be organized into arrays.
Sample means
.,m
sn =
s~l
Spl
Sample correlations
, ,l l'" , ,l "l~'
sl2
Szz
szp
(18)
spz
sPP
r12
rzp
rpl
rpz
10 Chapter 1 Aspects of Multivariate Analysis The sample mean array is denoted by i, the sample variance and covariance array by the capital letter Sn, and the sample correlation array by R. The subscript n on the array Sn is a mnemonic device used to remind you that n is employed as a divisor for the elements s;k The size of all of the arrays is determined by the number of variables, p. The arrays Sn and R consist of p rows and p columns. The array i is a single column with p rows. The first subscript on an entry in arrays Sn and R indicates the row; the second subscript indicates the column. Since s;k = ski and ra = rk; for all i and k, the entries in symmetric positions about the main northwestsoutheast diagonals in arrays Sn and R are the same, and the arrays are said to be
symmetric.
Example 1.2 (The arrays x, Sn and R for bivariate data) Consider the data introduced in Example 1.1. Each receipt yields a pair of measurements, total dollar sales, and number of books sold. Find the arrays i, Sn, and R. Since there are four receipts, we have a total of four measurements (observations) on each variable. Thesample means are
X1 = ~
j=!
L
4
Xjt
x2 = ~
L: x 12 = ~(4 + 5 + 4 + 3) = 4
j=l
j=l
L (xj! 4
x1) 2
L (xj2 j=l
L
j=l
i2)
= ~((4St2
.5
=~
Sn = [
1.5
34
1.5] .5
11
so
R = [
.3~  .3~ J
Graphical Techniques
Plots are important, but frequently neglected, aids in data analysis. Although it is impossible to simultaneously plot all the measurements made on several variables and study the configurations, plots of individual variables and plots of pairs of variables can still be very informative. Sophisticated computer programs and display equipment allow one the luxury of visually examining data in one, two, or three dimensions with relative ease. On the other hand, many valuable insights can be obtained from the data by constructing plots with paper and pencil. Simple, yet elegant and effective, methods for displaying data are available in (29]. It is good statistical practice to plot pairs of variables and visually inspect the pattern of association. Consider, then, the following seven pairs of measurements on two variables: Variable 1 (x1 ): Variable 2 ( x 2 ):
3
4
5.5
2 4
6 7
8 10
2 5
5 7.5
These data are plotted as seven points in two dimensions (each axis representing a variable) in Figure 1.1. The coordinates of the points are determined by the paired measurements: (3, 5), ( 4, 5.5), ... , (5, 7.5). The resulting twodimensional plot is known as a scatter diagram or scatter plot.
xz
Xz
JO
8
e
6
4
6
4
!
2
t
4
6 8 Dot diagram
10
"J
12
Chapter 1 Aspects of Multivariate Analysis Also shown in Figure 1.1 are separate plots of the observed values of variable 1 and the observed values of variable 2, respectively. These plots are called (marginal) dot diagrams. They can be obtained from the original observations or by projecting the points in the scatter diagram onto each coordinate axis. The information contained in the singlevariable dot diagrams can be used to calculate the sample means xi and x and the sample variances si I and s22 . (See Ex2 ercise 1.1.) The scatter diagram indicates the orientation of the points, and their coordinates can be used to calculate the sample covariance Siz In the scatter diagram of Figure 1.1, large values of xi occur with large values of x 2 and small value.s of xi with small values of x 2 Hence, s 12 will be positive. Dot diagrams and scatter plots contain different kinds of information. The information in the marginal dot diagrams is not sufficient for constructing the scatter plot. As an illustration, suppose the data preceding Figure 1.1 had been paired differently, so that the measurements on the variables xi and x 2 were as follows: Variable 1 Variable 2
(xi):
(xz):
4 5.5
6 4
2 10
8 5
3 7.5
(We have simply rearranged the values of variable 1.) The scatter and dot diagrams for the "new" data are shown in Figure 1.2. Comparing Figures 1.1 and 1.2, we find that the marginal dot diagrams are the same, but that the scatter diagrams are decidedly different. In Figure 1.2, large values of xi are paired with small values of x 2 and small values of xi with large values of x 2 . Consequently, the descriptive statistics for the individual variables xi, x , sii, and s22 remain unchanged, but the sample covari2 ance si 2 , which measures the association between pairs of variables, will now be negative. The different orientations of the data in Figures 1.1 and 1.2 are not discernible from the marginal dot diagrams alone. At the same time, the fact that the marginal dot diagrams are the same in the two cases is not immediately apparent from the scatter plots. The two types of graphical procedures complement one another; they are not competitors. The next two examples further illustrate the information that can be conveyed by a graphic display.
Xz
Xz
10
8
6
4
10
I
XI
t
2
!
4
!
6
!
8
10
. . x,
Figure 1.2 Scatter plot and dot diagrams for rearranged data.
13
Example 1.3 {The effect of unusual observations on sample correlations) Some fi
nancial data representing jobs and productivity for the 16 largest publishing firms appeared in an article in Forbes magazine on April30, 1990. The data for the pair of variables x 1 = employees (jobs) and x 2 = profits per employee (productivity) are graphed in Figure 1.3. We have labeled two "unusual" observations. Dun & Bradstreet is the largest firm in terms of number of employees, but is "typical" in terms of profits per employee. Time Warner has a "typical" number of employees, but comparatively small (negative) profits per employee.
Time Warner
Employees (thousands)
Figure 1.3 Profits per employee and number of employees for 16 publishing firms.
The sample correlation coefficient computed from the values of x 1 and x 2 is  .39 .56 = { .39 .50 for all16 firms for all firms but Dun & Bradstreet for all firms but Time Warner for all firms but Dun & Bradstreet and Time Warner
r12
It is clear that atypical observations can have a considerable effect on the sample correlation coefficient.
Example 1.4 {A scatter plot for baseball data) In a July 17, 1978, article on money in sports, Sports Illustrated magazine provided data on x 1 = player payroll for National League East baseball teams. We have added data on x 2 = wonlost percentage for 1977. The results are given in Thble 1.1. The scatter plot in Figure 1.4 supports the claim that a championship team can be bought. Of course, this causeeffect relationship cannot be substantiated, because the experiment did not include a random assignment of payrolls. Thus, statistics cannot answer the question: Could the Mets have won with $4 million to spend on player salaries?
14
1977 Salary and Final Record for the National League East
Xt
Team Philadelphia Phillies Pittsburgh Pirates St. Louis Cardinals Chicago Cubs Montreal Expos New York Mets
= playerpayroll
3,497,900 2,485,475 1,782,875 1,725,450 1,645,575 1,469,800
0
Player payroU in millions of dollars
To construct the scatter plot in Figure 1.4, we have regarded the six paired observations in Thble 1.1 as the coordinates of six points in twodimensional space. The figure allows us to examine visually the grouping of teams with respect to the vari ables total payroll and wonlost percentage.
Example I.S (Multiple scatter plots for paper strength measurements) Paper is manufactured in continuous sheets several feet wide. Because of the orientation of fibers within the paper, it has a different strength when measured in the direction produced by the machine than when measured across, or at right angles to, the machine direction. Table 1.2 shows the measured values of
x1
= density(gramsjcubiccentinleter)
xz
x3
A novel graphic presentation of these data appears in Figure 1.5, page"16. The scatter plots are arranged as the offdiagonal elements of a covariance array and box plots as the diagonal elements. The latter are on a different scale with this
15
Density
.801 .824 .841 .816 .840 .842 .820 .802 .828 .819 .826 .802 .810 .802 .832 .796 .759 .770 .759 .772 .806 .803 .845 .822 .971 .816 .836 .815 .822 .822 .843 .824 .788 .782 .795 .805 .836 .788 .772 .776 .758
Machine direction
121.41 127.70 129.20 131.80 135.10 131.50 126.70 115.10 130.80 124.60 118.31 114.20 120.30 115.70 117.51 109.81 109.10 115.10 118.31 112.60 116.20 118.00 131.00 125.70 126.10 125.80 125.50 127.80 130.50 127.90 123.90 124.10 120.80 107.40 120.70 121.91 122.31 110.60 103.51 110.71 113.80
Cross direction
70.42 72.47 78.20 74.89 71.21 78.39 69.02 73.10 79.28 76.48 70.25 72.88 68.23 68.12 71.62 53.10 50.85 51.68 50.60 53.51 56.53 70.70. 74.35 68.29 72.10 70.64 76.33 76.75 80.33 75.68 78.54 71.91 68.22 54.42 70.41 73.68 74.93 53.52 48.93 53.67 52.42
31 32 33 34 35 36 37 38 39
40
41
f6
Max
a
Q
"
Med Min
0.81 0.76
i"
8
c ~
5 0<1
"'
Med
I I
T
'
135.1
I I
121.4
..
Max
::
.. .. . ..
'
Min
103.5
.~;,:
. :
4....
.. ...
.. .. ... .. : = . ...
T
_l_
80.33
Med
70.70
Min
48.93
figure I.S Scatter plots and boxplots of paperquality data from Thble 1.2.
software, so we use only the overall shape to provide information on symmetry and possible outliers for each individual characteristic. The scatter plots can be inspected for patterns and unusual observations. In Figure 1.5, there is one unusual observation: the density of specimen 25. Some of the scatter plots have patterns suggesting that there are two separate clumps of observations. These scatter plot arrays are further pursued in our discussion of new software graphics in the next section. In the general multiresponse situation, p variables are simultaneously recorded on n items. Scatter plots should be made for pairs of important variables and, if the task is not too great to warrant the effort, for all pairs. Limited as we are to a three~dimensional world, we cannot always picture an entire set of data. However, two further geometric representations of the data provide an important conceptual framework for viewing multi variable statistical methods. In cases where it is possible to capture the essence of the data in three dimensions, these representations can actually be graphed.
I7
n Points in p Dimensions (pDimensional Scatter Plot). Consider the natural extension of the scatter plot top dimensions, where the p measurements
on the jth item represent the coordinates of a point in pdimensional space. The coordinate axes are taken to correspond to the variables, so that the jth point is xi! units along the first axis, xi 2 units along the second, ... , xiP units along the pth axis. The resulting plot with n points not only will exhibit the overall pattern of variability, but also will show similarities (and differences) among then items. Groupings of items will manifest themselves in this representation. The next example illustrates a threedimensional scatter plot.
Example 1.6 {Looking for lowerdimensional structure) A zoologist obtained measurements on n = 25 lizards known scientifically as Cophosaurus texanus. The weight, or mass, is given in grams while the snoutvent length (SVL) and hind limb span (HLS) are given in millimeters. The data are displayed in Table 1.3. Although there are three size measurements, we can ask whether or not most of the variation is primarily restricted to two dimensions or even to one dimension. To help answer questions regarding reduced dimensionality, we construct the threedimensional scatter plot in Figure 1.6. Clearly most of the variation is scatter about a onedimensional straight line. Knowing the position on a line along the major axes of the cloud of points would be almost as good as knowing the three measurements Mass, SVL, and HLS. However, this kind of analysis can be misleading if one variable has a much larger variance than the others. Consequently, we first calculate the standardized values, Zjk = (xjk xk)/~, so the variables contribute equally to the variation
Lizard 1 2 3 4 5 6 7 8 9 10 11 12 13
Mass 5.526 10.401 9.213 8.953 7.063 6.610 11.273 2.447 15.493 9.004 8.199 6.601 7.622
SVL 59.0 75.0 69.0 67.5 62.0 62.0 74.0 47.0 86.5 69.0 70.5 64.5 67.5
HLS 113.5 142.0 124.0 125.0 129.5 123.0 140.0 97.0 162.0 126.5 136.0 116.0 135.0
Lizard 14 15 16 17 18 19 20 21 22 23 24 25
Mass 10.067 10.091 10.888 7.610 7.733 12.015 10.049 5.149 9.158 12.132 6.978 6.890
SVL 73.0 73.0 77.0 61.5 66.5 79.5 74.0 59.5 68.0 75.0 66.5 63.0
HLS
136.5 135.5 139.0 118.0 133.5 150.0 137.0 116.0 123.0 141.0 117.0 117.0
hapter 1 Aspe 18 C
15
. the scatter plot. Figure 1.7 gives _th~ threedirnensio_nal scatter plot for ~he stanto rd. ed variables. Most of the vanatwn can be explamed by a smgle vanable de
2
: 1
~ 0
1

E~a~6 it is interesting to see if male and female lizards occupy different parts of the fh~e~dimensional space containing the size data. The gender, by row, for the lizard
data in Table 1.3 are fmffmfmfmfmfm mmmfmmmffmff
19
Figure 1.8 repeats the scatter plot for the original variables but with males marked by solid circles and females by open circles. Clearly, males are typically larger than females.
15
10
p Points in n Dimensions. The n observations of the p variables can also be regarded as p points in ndimensional space. Each column of X determines one of the points. The ith column,
consisting of all n measurements on the ith variable, determines the ith point. In Chapter 3, we show how the closeness of points inn dimensions can be related to measures of association between the corresponding variables.
20
.~~ '"'
:
.::1 ....
' : .,
.. .. '
~
I
80.3
....
135
48.9
..
: r
104
.971
.I
. ...... . ....
#
'
'
.758
...: .
. ... .. ..
~~ "\
Figure 1.9 Scatter plots for the paperquality data of Table 1.2.
21
.~:~: ....
::.'
25
:.
Cross
(xJ)
80.3
: :r.
. ..
., .. ..
~
48.9
...r '
~.
...
135 25
Machine
( x2)
~.
:2s.
...
.I
104 .971 Density
(x,)
..
.... '
25
I I
25
.758
, ::..:,: .
~..
.... {"
(a)
I .: ..
'...J
...
. .
~~
. .. .. ......
.,
.,... .
~ ' :' ,
....
48.9 135
80.3
.... .
...
Machine
(~)
.I
104
....'
....
#
. .: ..' . .
~
.758
~..1
(b)
:7
... ... ,
Figure 1.10 Modified scatter plots for the paperquality data with outlier (25) (a) selected and (b) deleted.
22
~ .....
:~
:.' ,
..... ;
I r:1
, .... ',
~
80.3
.... ,: ., ..
135 Machine
(x2)
48.9
~
~.
...
..
...
.. .I
' ' .
:
'
I
\.
~===~
.971
104
Density
(x,)
.758
.. .
~.
:...,...
.:
~~
',4 ..
. .. .....
~
..
....
80.3
.. . .
.
68.1
135
Machine
(x2)
...
Figure 1.1 I Modified scatter plots with (a) group of points selected and (b) points, including specimen 25, deleted and the scatter plots rescaled.
114 .845
.. ..
... ... ..
:
(b)
Density (x,)
.788
Data Displays and Pictorial Representations 23 from an older roll of paper that was included in order to have enough plies in the cardboard being manufactured. Deleting the outlier and the cases corresponding to the older paper and adjusting the ranges of the remaining observations leads to the scatter plots in Figure l.ll{b). The operation of highlighting points corresponding to a selected range of one of the variables is called brushing. Brushing could begin with a rectangle, as in Figure 1.11(a), but then the brush could be moved to provide a sequence of highlighted points. The process can be stopped at any time to provide a snapshot of the current situation. Scatter plots like those in Example 1.8 are extremely useful aids in data analysis. Another important new graphical technique uses software that allows the data analyst to view highdimensional data as slices of various threedimensional perspectives. This can be done dynamically and continuously until informative views are obtained. A comprehensive discussion of dynamic graphical methods is available in [U A strategy for online multivariate exploratory graphical analysis, motivated by the need for a routine procedure for searching for structure in multivariate data, is given in [32].
Example 1.9 (Rotated plots in three dimensions) Four different measurements of
lumber stiffness are given in Table 4.3, page 186. In Example 4.14, specimen (board) 16 and possibly specimen (board) 9 are identified as unusual observations. Figures 1.12(a), {b), and (c) contain perspectives of the stiffness data in the x 1 , x 2 , x 3 space. These views were obtained by continually rotating and turning the threedimensional coordinate axes. Spinning the coordinate axes allows one to get a better
.16
xz
)
'
(a)
..... . .: . .
I
.~: . .
"'
(b)
Outliers clear.
Outliers masked.
~~.~=. .... .. .
f:x2
x3
. .=...~~ =.:
1.6 9.
(d)
x,
(c)
xz
Good view of x2 , .x 3, x4 space.
Specimen 9large.
understanding of the threedimensional aspects of the data. Figure 1.12(d) gives one picture of the stiffness data in x 2 , x3, x 4 space. Notice that Figures 1.12(a) and (d) visually confirm specimens 9 and 16 as outliers. Specimen 9 is very large in all three coordinates. A counterclockwiselike rotation of the axes in Figure 1.12(a) produces Figure 1.12(b), and the two unusual observations are masked in this view. A further spinning of the x2 , x3 axes gives Figure 1.12(c); one of the outliers (16) is now hidden. Additional insights can sometimes be gleaned from visual inspection of the slowly spinning data. It is this dynamic aspect that statisticians are just beginning to understand and exploit. Plots like those in Figure 1.12 allow one to identify readily observations that do not conform to the rest of the data and that may heavily influence inferences based on standard datagenerating models.
Bear 1 2 3 4
Wt2 48
Wt3
Wt4
Wt5 Lngth2 Lngth 3 82 102 107 104 247 118 111 141 140 145 146 150 142 139 157 168 162 159 158 140 171
59
68 77 43 145 82
95
102 93 104 185
59
61 54 100 68
5
6 7
95
109
68
95
25
200
~ !50
100
50 2.0
2.5
3.0
3.5
4.0
4.5
5.0
Year
Figure 1.1 3 Combined growth curves for weight for seven female grizzly bears.
reads pounds. Further inspection revealed that, in this case, an assistant later failed to convert the field readings to kilograms when creating the electronic database. The correct weights are ( 45, 66, 84, 112) kilograms. Because it can be difficult to inspect visually the individual growth curves in a combined plot, the individual curves should be replotted in an array where similarities and differences are easily observed. Figure 1.14 gives the array of seven curves for weight. Some growth curves look linear and others quadratic.
Bear I !50 !50 Bear2 !50 Bear 3 !50 Bear4
1::
.~100
50 0
_/
2 3 4 Year Bear5
5
. 100
~
50 0
__/
2 3 4 Year Bear6 5
~100
~
50 0
~
2 3 4 Year Bear? 5
.:;::
~
100
50 0
~
2 3 4 Year 5
!50
!50
!50
~ 100
.:;::
50 0
/
2 3 4 Year 5
:*
. 100
50 0
/
2 3 4 Year 5
1::
:*
~ 100
~
2 3 4 Year 5
50 0
Figure 1.14 Individual growth curves for weight for female grizzly bears.
26
Chapter 1 Aspects of Multivariate Analysis Figure 1.15 gives a growth curve array for length. One bear seemed to get shorter from 2 to 3 years old, but the researcher kno\W that the steel tape measurement of length can be thrown off by the bear's posture when sedated.
Bear! Bear2 Bear3 Bear4
180
;::;
~ 160
140
/
2
3
4
180
5
]'160 140
5
r
j
180
5
ff 160
140
I
2 3 4 Year
Bear?
180
5
ff 160
140
/
2 3 4 Year
5
Year
Bear5
3 4 Year
Bear6
!80
180
180
5
...J
.:;
j
~!60
/
I
~160
...J
"
' !60
140
140 2 3 4 5 Year
140
I
2 3 4 Year
5
3 4 Year
Figure 1.1 S Individual growth curves for length for female grizzly bears.
We now turn to two popular pictorial representations of multivariate data in two dimensions: stars and Chernoff faces.
Stars
Suppose each data unit consists of nonnegative observations on p ~ 2 variables. In two dimensions, we can construct circles of a fixed (reference) radius with p equally spaced rays emanating from the center of the circle. The lengths of the rays represent the values of the variables. The ends of the rays can be connected with straight lines to form a star. Each star represents a multivariate observation, and the stars can be grouped according to their (subjective) similarities. It is often helpfuJ, when constructing the stars, to standardize the observations. In this case some of the observations will be negative. The observations can then be reexpressed so. that the center of the circle represents the smallest standardized observation within the entire data set.
Example 1.11 {Utility data as stars) Stars representing the first 5 of the 22 public utility fi11IIS in Table 12.4, page 688, are shown in Figure 1.16. There are eight variables; consequently, the stars are distorted octagons.
5
Central Louisiana Electric Co. (3)
5
Consolidated Edison Co. (NY) (5) Commonwealth Edison Co. (4) I
The observations on all variables were standardized. Among the first five utilities, the smallest standardized observation for any variable was 1.6. 'freating this value as zero, the variables are plotted on identical scales along eight equiangular rays originating from the center of the circle. The variables are ordered in a clockwise direction, beginning in the 12 o'clock position. At first glance, none of these utilities appears to be similar to any other. However, because of the way the stars are constructed, each variable gets equal weight in the visual impression. If we concentrate on the variables 6 (sales in kilowatthour [kWh) use per year) and 8 (total fuel costs in cents per kWh), then Boston Edison and Consolidated Edison are similar (small variable 6, large variable 8), and Arizona Public Service, Central Louisiana Electric, and Commonwealth Edison are similar (moderate variable 6, moderate variable 8).
Chernoff faces
People react to faces. Chernoff [4) suggested representing pdimensional observations as a twodimensional face whose characteristics (face shape, mouth curvature, nose length, eye size, pupil position, and so forth) are determined by the measurements on the p variables.
28
As originally designed, Chernoff faces can handle up to 18 variables. The assignment of variables to facial features is done by the experimenter, and different choices produce different results. Some iteration is usually necessary before satisfactory representations are achieved. Chernoff faces appear to be most useful for verifying (1) an initial grouping suggested by subjectmatter knowledge and intuition or (2) final groupings produced by clustering algorithms.
Example 1.12 (Utility data as Cher.noff faces) From the data in Table 12.4, the 22
public utility companies were represented as Chernoff faces. We have the following correspondences: Variable
XI:
Xz: X3:
X4:
Fixedcharge coverage Rate of return on capital Cost per kW capacity in place Annual load factor
Facial characteristic Halfheight of face Face width Position of center of mouth Slant of eyes Eccentricity (height) width of eyes
X1: Percent nuclear Xs: Total fuel costs (cents per kWh)
The Chernoff faces are shown in Figure 1.17. We have subjectively grouped "similar" faces into seven clusters. If a smaller number of clusters is desired, we might combine clusters 5, 6, and 7 and, perhaps, clusters 2 and 3 to obtain four or five clusters. For our assignment of variables to facial features, the firms group largely according to geographical location.
Constructing Chernoff faces is a task that must be done with the aid of a computer. The data are ordinarily standardized within the computer program as part of the process for determining the locations, sizes, and orientations of the facial characteristics. With some training, we can use Chernoff faces to communicate similarities or dissimilarities, as the next example indicates.
Example 1.13 (Using Chernoff faces to show changes over time) Figure 1.18 illus
trates an additional use of Chernoff faces. (See [24].) In the figure, the faces are used to track the financial wellbeing of a company over time. As indicated, each facial feature represents a single financial indicator, and the longitudinal changes in these indicators are thus evident at a glance.
29
10
22
21
15
13
Cluster 4
Cluster 6
20
CD0CD 00CD
18 11 12 19 16 17
14
30 Chap
Chernoff faces have also been used to display differences in multivariate observations in two dimensions. For example, the twodimensional coordinate axes might represent latitude and longitude (geographical location), and the faces might represent multivariate measurements on several U.S. cities. Additional examples of this kind are discussed in [30]. There are several ingenious ways to picture multivariate data in two dimensions. We have described some of them. Further advances are possible and will almost certainly take advantage of improved computer graphics.
1.s Distance
Although they may at first appear formidable, most multivariate techniques are based upon the simple concept of distance. Straightline, or Euclidean, distance should be familiar. If we consider the point P = ( x 1 , x 2 ) in the plane, the straightline distance, d(O, P), from P to the origin 0 = (0, 0) is, according to the Pythagorean theorem,
d(O,P) =
(19)
The situation is illustrated in Figure 1.19. In general, if the point P hasp coordinates so that P = (x 1 x 2 , ... , xp), the straightline distance from P to the origin 0 = (0, 0, ... , 0) is
d(O, P) =
Vxi + x~ + + x~ xi + x~ + + x~ =
c2
(110)
(See Chapter 2.) All points ( x 1 , x2 , ... , x p) that lie a constant squared distance, such as c2, from the origin satisfy the equation d 2 (0, P) =
(111)
Because this is the equation of a hypersphere (a circle if p = 2), points equidistant from the origin lie on a hypersphere. The straightline distance between two arbitrary points P and Q with coordinates?= (xJ,Xz, ... ,xp) andQ = (Yl>Yz, ... ,yp)isgivenby
d(P,Q)
= V(xt
yt) 2
(112)
Straightline, or Euclidean, distance is unsatisfactory for most statistical purposes. This is because each coordinate contributes equally to the calculation of Euclidean distance. When the coordinates represent measurements that are subject to random fluctuations of differing magnitudes, it is often desirable to weight coordinates subject to a great deal of variability less heavily than those that are not highly variable. This suggests a different measure of distance. Our purpose now is to develop a "statistical" distance that accounts for differences in variation and, in due course, the presence of correlation. Because our
p
d(O,P)=
jx\+ZJ~ t
"2
1x,1
Distance 31 choice will depend upon the sample variances and covariances, at this point we use the term statistical distance to distinguish it from ordinary Euclidean distance. It is statistical distance that is fundamental to multivariate analysis. To begin, we take asju:ed the set of observations graphed as the pdimensiona) scatter plot. From these, we shall construct a measure of distance from the origin to a point P == (x 1 , x2 , ... , xp) In our arguments, the coordinates (x 1 , Xz, ... , xp) of P can vary to produce different locations for the point. The data that determine distance will, however, remain fixed. To illustrate, suppose we have n pairs of measurements on two vari11bles each having mean zero. Call the variables x 1 and x 2 , and assume that the x 1 measurements vary independently of the x 2 measurements. 1 In addition, assume that the variability in the x 1 measurements is larger than the variability in the x 2 measurements. A scatter plot of the data would look something like the one pictured in Figure 1.20.
x2
figure 1.20 A scatter plot with greater variability in the x 1 direction than in the x 2 direction.
Glancing at Figure 1.20, we see that values which are a given deviation from the origin in the x1 direction are not as "surprising" or "unusual" as lilre values equidistant from the origin in the x 2 direction. This is because the inherent variability in the x 1 direction is greater than the variability in the x 2 direction. Consequently, large x 1 coordinates (in absolute value) are not as unexpected as large x 2 coordinates. It seems reasonable, then, to weight an x 2 coordinate more heavily than an x 1 coordinate of the same value when computing the "distance'' to the origin. One way to proceed is to divide each coordinate by the sample standard deviation. Therefore, upon division by the standard deviations, we have the "standardized" coordinates x~ = x1jYS;; and x; = x2fvs:;;. The standardized coordinates are now on an equal footing with one another. After taking the differences in variability into account, we determine distance using the standard Euclidean formula. Thus, a statistical distance of the point P = ( x1 , x2 ) from tl1e origin 0 = ( 0, 0) can be computed from its standardized coordinates x7 = xJYS;; and x; = x 2/vs:;;_ as
(113)
1At this poinl, "independently~ means that the x measurements cannot be predicted with any 2 accuracy from the x 1 measurements, and vice versa.
32
Chapter 1 Aspects of Multivariate Analysis Comparing (113) with (19), we see that the difference between the two expressions is due to the weights k 1 = 1/su and k 2 = 1/s22 attached to xi and x~ in (113). Note that if the sample variances are the same, k1 = k2 , then xi and x~ will receive the same weight. In cases where the weights are the same, it is convenient to ignore the common divisor and use the usual Euclidean distance formula. In other words, if the variability in thex1 direction is the same as the variability in the x 2 direction, and the x 1 values vary independently of the x2 values, Euclidean distance is appropriate. Using (113), we see that all points which have coordinates (xt> x 2 ) and are a constant squared distance c2 from the origin must satisfy
+=c su Szz
X~
X~
(114)
Equation (114) is the equation of an ellipse centered at the origin whose major and minor axes coincide with the coordinate axes. That is, the statistical distance in (113) has an ellipse as the locus of all points a constant distance from the origin. This general case is shown in Figure 1.21.
cy's;
~~0++~x,
cfi;;
cfi;;
Figure 1.21 The ellipse of constant statistical distance d 2 (0,P) = xi/s1 1 + x~js22 = c 2.
Example 1.14 (Calculating a statistical distance) A set of paired measurements (x 1 , x 2 ) on two variables yields x1 = Xz = 0, s11 = 4, and s22 = 1. Suppose the x 1
measurements are unrelated to the x 2 measurements; that is, measurements within a pair vary independently of one another. Since the sample variances are unequal, we measure the square of the distance of an arbitrary point P = (x 1 , Xz) to the origin 0 = (0, 0) by
All points ( x 1 , x 2 ) that are a constant distance 1 from the origin satisfy the equation
X~ X~ +=1 4 1
The coordinates of some points a unit distance from the origin are presented in the following table:
Distance 33
Coordinates: (x 1 , x 2 ) (0,1)
(0, 1)
DiStance: 4 +
xi T x~
A plot of the equation xV4 + xV1 = 1 is an ellipse centered at (0, 0) whose major axis lies along the x 1 coordinate axis and whose minor axis lies along the x2 coordinate axis. The halflengths of these major and minor axes are V4 = 2 and v'I = 1, respectively. The ellipse of unit distance is plotted in Figure 1.22. All points on the ellipse are regarded as being the same statistical distance from the originin this case, a distance of 1.
distance,
x + lx~ =
1.
The expression in (113) can be generalized to accommodate the calculation of statistical distance from an arbitrary point P = (x 1 , x 2) to any /u:ed point Q = (Yt, Yz). If we assume that the coordinate variables vary independently of one another, the distance from P to Q is given by
d(P, Q) =
/ (x!  yJ)
(x2 
.Yz)
Stt
Szz
'(115)
The extension of this statistical distance to more than two dimensions is straightforward. Let the points P and Q have p coordinates such that P = ( x 1, Xz, ... , xp) and Q = (y1, Yz, ... , yp) Suppose Q is a fixed point [it may be the origin 0 = (0, 0, ... , 0)] and the coordinate variables vary independently of one another. Let s 11 , s22 , . , sPP be sample variances constructed from n measurements on x 1, x 2 , . , xP, respectively. Then the statistical distance from P to Q is
d(P,Q) =
(116)
34
Cbaptef
All points P that are a constant squared distance from Q lie on a hyperellipsoid centered at Q whose major and minor axes are parallel to the coordinate axes. We note th~ following:
z.
1be distance in (116) still does not include most of the important cases we shall encounter, because of the assumption of independent coordinates. The scatter plot in figure 1.23 depicts a twodimensional situation in which the x 1 measurements do not vary independently of the x 2 measurements. In fact, the coordinates of the pairs (x 1 , x2 ) exhibit a tendency to be large or small together, and the sample correlation coefficient is positive. Moreover, the variability in the x 2 direction is larger than the variability in the x 1 direction. What is a meaningful measure of distance when the variability in the x1 direction is different from the variability in the x2 direction and the variables x1 and x2 are correlated? Actually, we can use what we have already introduced, provided that we took at things in the right way. From Figure 1.23, we see that if we rotate the original coordinate system through the angle 0 while keeping the scatter fixed and label the rotated axes 1 and 2 , the scatter in terms of the new axes looks very much like that in Figure 1.20. (You may wish to tum the book to place the 1 and 2 axes in their customary positions.) This suggests that we calculate the sample variances using the 1 and 2 coordinates and measure distance as in Equation (113). That is, with reference to the 1 and 2 axes, we define the distance from the point p"" (x 1, 2 ) to the origin 0 = (0, 0) as
x x
d(O, P)
(117)
22
and
x2
I
I
r,
~1~~~~x,
. . ..... ,..
I
I I
I
.<~
figure 1.23 A scatter plot for positively correlated measurements and a rotated coordinate system.
Distance 15 The relation between the original coordinates (x 1 , x 2 ) and the rotated coordinates (x1 , 2 ) is provided by
X1
Xt cos(8) + XzSin(/J)
(118)
Given the relations in (118), we can formally substitute for 1 and 2 in (117) and express the distance in terms of the original coordinates. After some straightforward algebraic manipulations, the distance from P = (x1 , 2 ) to the origin 0 = (0, 0) can be written in terms of the original coordinates x 1 and x 2 of Pas
d(O, P)
Va 11 xi + 2a 12 x 1 x 2 + a 22 x~
(119)
where the a's are numbers such that the distance is nonnegative for all possible values of x 1 and x 2 Here au, a 12 , and a22 are dete,rmined by the angle 8, and s 1 1> s12, and s22 calculated from the original data. 2 The particular forms for a 11 , aJ2, and Ozz are not important at this point. What is important is the appearance of the crossproduct term 2a 12x 1x 2 necessitated by the nonzero correlation r12 . Equation (119) can be compared with (113). The expression in (113) can be regarded as a special case of (119) with a 11 = 1/s11 , a22 = 1/s22 , and a 12 = 0. In general, the statistical distance of the point P = (x1 , x 2 ) from the fued point Q == (YI> .Yl) for situations in which the variables are correlated has the general form
d(P,Q) = Vau(x 1

yi) 2
(120)
and can always be computed once a 1 b a 12 , and a 22 are known. In addition, the coordinates of all points P = (x1 , x 2 ) that are a constant squared distance c 2 from Q satisfy a 11 (x 1

yJ) 2 + 2an(x 1
y1 ) (x 2
Yl) + a 22 (x 2
Y2) 2 = c 2
(121)
By definition, this is the equation of an ellipse centered at Q. The graph of such an equation is displayed in Figure 1.24. The major (long) and minor (short) axes are indicated. They are parallel to the 1 and 2 axes. For the choice of a 1 1 , a 12 , and a22 in footnote 2, the x1 and x2 axes are at an angle 8 with respect to the x 1 and x 2 axes. The generalization of the distance formulas of (119) and (120) top dimensions is straightforward. Let P = (x 1 , x 2 , .. , xp) be a point whose coordinates represent variables that are correlated and subject to inherent variability. Let
Specifically,
a"
022
cos 2(6)s22
~~
sin2(8)
~
~~
+ sin2(6)s22 + cos2(8)s22
and
012
= cos2 (8)su
+ 2sin(8)cos(8)s12 + sin2(8)s, 2
cos>(6)s, 2
36
/ / /
'
'
''
.. , Yp) be a specified fixed point Then the distances from P to 0 and from P to Q have the general forms
d(O,P) = Va 11xf +
rra 22 x~
+ +
aPPx~
and
~d(P,Q) =
\j" ... .
2 + 2an(xl  Y1)(x2  >?.) + 2a 13 (xJ Y1)(x3 YJ) + + 2ar 1 .p(xpl Yrd(xp  Yp)]
(123) where the a's are numbers such that the distances are always nonnegative. We note that the distances in (122) and (123) are completely determined by the coefficients (weights) O;k. i = 1, 2, ... , p, k "" 1;2, ... , p. These coefficients can be set out in the rectangular array
3
where the o;/s with i # k are displayed twice, since they are multiplied by 2 in the distance formulas. Consequently, the entries in this array specify the distance functions. The a;*'s cannot be arbitrary numbers; they must be such that the computed distance is nonnegative for every pair of points. (See Exercise 1.10.) Contours of constant distances computed from (122) and (123) are hyperellipsoids. A hyperellipsoid resembles a football when p = 3; it is impossible to visualize in more than three dimensions.
:Yrhe algebraic expressions for the squares of the distances in (122) and (123) are known as quadratic forms and, in particular, positive definite quadratic forms. It is possible to display these quadratic fonns in a simpler manner using matrix algebra; we shall do so in Section 2.3 of Chapter 2.
ou
OJ 2
a12
022
(124)
OJ p
o2 p
Exercises 37
=.~.~0,~x,
P@ :. .
.....::. ...
. ...
Figure 1.25 A cluster of points relative to a point P and the origin .
The need to consider statistical rather than Euclidean distance is illustrated heuristically in Figure 1.25. Figure 1.25 depicts a cluster of points whose center of gravity (sample mean) is indicated by the point Q. Consider the Euclidean distances from the point Q to the point P and the origin 0. The Euclidean distance from Q to P is larger than the Euclidean distance from Q to 0. However, P appears to be more like the points in the cluster than does the origin. If we take into account the variability of the points in the cluster and measure distance by the statistical distance in (120), then Q will be closer toP than to 0. This result seems reasonable, given the nature of the scatter. Other measures of distance can be advanced. (See Exercise 1.12.) At times, it is useful to consider distances that are not related to circles or e!lipses. Any distance measure d(P, Q) between two points P and Q is valid provided that it satisfies the following properties, where R is any other intermediate point:
d(P, Q) = d(Q, P) d(P, Q)
> 0 if P
Q
(triangle inequality)
d( P, Q) = 0 if P
=Q
(125)
Exercises
1.1.
x2
5.5
10
7.5
Calculate the sample means x1 and x2 , the sample variances s 11 and s22 , and the sample covariance s1 z.
1.2. A morning newspaper lists the following usedcar prices for a foreign compact with age x 1 measured in years and selling price x 2 measured in thousands of dollars:
2
x2
8
7.49
9
6.00
11
3.99
(a) Construct a scatter plot of the data and marginal dot diagrams. (b) Infer the sign of the sample covariance s 12 from the scatter plot. (c) Compute the sample means :X 1 and :X 2 and the sarilple variances s 11 and Szz. Compute the sample covariance s 12 and the sample correlation coefficient r12 . Interpret these quantities. (d) Display the sample mean array x, the sample variancecovariance array Sn, and the sample correlation array R using (18).
9
12
2
8
6
6
5
4
8
10
x2
x3
Find the arrays i, S", and R.
Company Citigroup General Electric American Inti Group Bank of America HSBCGroup ExxonMobil Royal Dutch/Shell
sales (billions)
108.28 152.36 95.04 65.45 62.97 263.99 265.19 285.06
Xz =profits
x 3 =assets
(billions)
17.05 16.59 10.91 14.14 9.52 25.33 18.54 15.73 8.10 11.13
(billions)
1,484.10 750.33 766.42 1,!10.46 1,031.29 195.26 193.83 191.11 1,175.16 211.15
BP
lNG Group Toyota Motor
1From
92.01
165.68
April18, 2005. (a) Plot the scatter diagram and marginal dot diagrams for variables x 1 and xz. Comment on the appearance of the diagrams. (b) Compute :X~o Xz, s11, s22 , Stz, and r 12 . Interpret r 12
Exercises 39 1.6. The data in Table 1.5 are 42 measurements on airpollution variables recorded at 12:00 noon in the Los Angeles area on different days. (See also the airpollution data on the web at www.prenhall.com/statistics.) (a) Plot the marginal dot diagrams for all the variables. (b) Construct the x, Sn, and R arrays, and interpret the entries in R.
CO (x3)
7 4 4 5 4 5 7 6 5 5 5 4 7 4 4 4 4 5 4 3 5 4 4 3 5 3 4 2 4 3 4 4 6 4 4 4 3 7 7 5 6 4
NO(x4)
N0 2.(xs)
03(x6)
HC(x7)
2 3 3 4 3 4 5 4 3 4 3 3 3 3 3 3 3 4 3 3 4 3 4 3 4 3 3 3 3 3 3 4 3 3 2 2 2 2 3 2 3 2
8 7 7 10 6 8 9 5 7 8 6 6 7 10 10 9 8 8 9 9 10
9
98 107 103 88 91
90
84
72
82 64 71 91
72
70
72
8 5 6 8
6 8 6 10 8 7 5 6 10 8 5 5 7 7 6 8
77 76 71 67 69 62 88 80 30 83 84 78 79 62 37 71 52 48 75 35 85 86 86 79 79 68 40
2 3 3 2 2 2 4 4 1 2 4 2 4 2 1 1 1 3 2 3 3 2 2 3 1 2 2 1 3 1 1 1 5 1 1 1 1 2 4 2 2 3
12 9 5 8 8 12 12 21 11 13 10 12 18
11
8 9 7 16 13 9 14 7 13 5 10 7 11 7 9 7 10 12 8 10 6 9 6 13 9 8
8 5 6 15 10 12 15 14 11 9 3 7 10 7 10 10 7 4 2 5 4 6
11
2 23 6
11
10 8 2 7 8 4 24 9 10 12 18 25 6 14 5
11
6
40
1. T.
= 2 variables:
x 21
= 3 x 31 == 4
x 12
= 1
x 22 = 2 x 32 = 4
(a) Plot the pairs of observations in the twodimensional "variable space." That is, construct a twodimensional scatter ptot of the data. (b) Plot the data as two points in the threedimensional "item space."
1.8.
Evaluate the distance of the point P = ( 1, 1) to the point Q = ( 1, 0) using the Euclidean distance formula in (112) with p = 2 and using the statistical distance in (120) with a 11 = 1/3, a 22 = 4/27, and a12 = 1/9. Sketch the focus of points that are a constant squared statistical distance 1 from the point Q.
1.9. Consider the following eight pairs of measurements on two variables x 1 and x 2 :
Xj
6
3 3
2
1 1
2 5 2 1
6
5
(a) Plot the data as a scatter diagram, and compute sl!, s22 , and s 12 (b) Using (118), calculate the corresponding measurements on variables 1 and 2 , assuming that the original coordinate axes are rotated through an angle of{} == 26 [given cos (26) == .899 and sin (26) = .438]. (c) Using the 1 and 2 measurements from (b), compute the sample variances SJ1 and s22. . (d) Consider the new pair of measurements (x 1 ,x 2 ) = (4, 2). Transform these to measurements on x1 and Xz using (l18), and calculate the distance d(O, P) of the new point P = (xi, 2 ) from the origin 0 == (0, 0) using (117). Note: You will need s11 and s22 from (c). (e) Calculate the distance from P = (4, 2) to the origin 0 = (0, 0) using (119) and the expressions for a11 , azz, and a12 in footnote 2. Note: You will need s11 , sn, and s12 from (a). Compare the distance calculated here with the distance calculated using the 1 and 'i2 values in (d). (Within rounding error,the numbers should be the same.)
1.I 0. Are the following distance functions valid for distance from the origin? Explain.
(a) x?
+ 4x~ + x 1x 2
=
(distance)
2
(b) x?  2x~
(distance )
1.II. Verify that distance defined by (120) with a1 1 = 4, a 22 = l, and a 12 = 1 satisfies the first three conditions in (125). (The triangle inequality is more difficult to verify.) 1.12. Define the distance from the point P = (x1 , x 2 ) to the origin 0
(0, 0) as
d(O, P)
= max(/xJ/, /x 2 /)
(a) Compute the distance from P == ( 3,4) to the origin. (b) Plot the locus of points whose squared distance from the origin is L (c) Generalize the foregoing distance expression to points in p dimensions.
1.13. A large city has major roads laid out in a grid pattern, as indicated in the following diagram. Streets 1 through 5 run north~outh (NS), and streets A through E run eastwest (EW). Suppose there are retail stores located at intersections (A, 2), ( E, 3), and ( C, 5).
Exercises ,41 Assume the distance along a street between two intersections in either the NS or EW direction is 1 unit. Define the distance between any two intersections (points) on the grid to be the "city block" distance. (For example, the distance between intersections (D, 1) and (C, 2), which we might call d((D,1), (C, 2)), is given by d((D,1), (C, 2)) = d ( ( D, 1), ( D, 2)) + d ( ( D, 2), ( C, 2)) = 1 + 1 = 2. Also, d ( ( D, 1), ( C, 2)) = d((D, 1), (C, 1)) + d((C, 1), (C,2)) = 1 + 1 = 2.]
lliii iffll
A B
l!i! jl1ll
c
D
liililil!l
lim
Iii:: iii
l:':z::J
Locate a supply facility (warehouse) at an intersection such that the sum of the distances from the warehouse to the three retail stores is minimized.
The following exercises contain fairly extensive data sets. A computer may be necessary for the required calculations.
1.14. Table 1.6 contains some of the raw data discussed in Section 1.2. (See also the multiple
sclerosis data on the web at www.prenhall.com/statistics.) Two different visual stimuli (S1 and S2) produced responses in both the left eye (L) and the right eye (R) of subjects in the study groups. The values recorded in the table include x 1 (subject's age); x2 (total response of both eyes to stimulus S1, that is, S1L + S1R); x 3 (difference between responses of eyes to stimulus S1, IS1L  S1R /);and so forth. (a) Plot the twodimensional scatter diagram for the variables x 2 and x 4 for the multiplesclerosis group. Comment on the appearance of the diagram. (b) Compute the x, Sn. and R arrays for the nonmultiplesclerosis and multiplesclerosis groups separately. J.IS. Some of the 98 measurements described in Section 1.2 are listed in Table 1.7 (See also the radiotherapy data on the web at www.prenhall.com/statistics.) The data consist of average ratings over the course of treatment for patients undergoing radiotherapy. Variables measured include x 1 (number of symptoms, such as sore throat or nausea); x2 (amount of activity, on a 15 scale); x 3 (amount of sleep, on a 15 scale); x 4 (amount of food consumed, on a 13 scale); x 5 (appetite, on a 15 scale); and x 6 (skin reaction, on a 03 scale). (a) Construct the twodimensional scatter plot for variables x 2 and x 3 and the marginal dot diagrams (or histograms). Do there appear to be any errors in the x 3 data? (b) Compute the x, Sn, and R arrays. Interpret the pairwise correlations.
1.16. At the start of a study to determine whether exercise or dietary supplements would slow
bone loss in older women, an investigator measured the mineral content of bones by photon absorptiometry. Measurements were recorded for three bones on the dominant and nondominant sides and are shown in Table 1.8. (See also the mineralcontent data on the web at www.prenhall.com/statistics.) Compute the x, Sn, and R arrays. Interpret the pairwise correlations.
42
x2
X]
(Age)
18 19 20 20 20 67 69 73 74 79
(SlL + SlR)
 152.0
138.0 144.0 143.6 148.8 154.4 171.2 157.2 175.2 155.0
ISlL SlRI
1.6 .4 .0 3.2 .0 2.4 1.6 .4 5.6 1.4
(S2L + S2R)
198.4 180.8 186.4 194.8 217.6
:
x4
xs
IS2L S2RI
.0 1.6 .8 .0 .0 6.0 .8 .0 .4 .0
5
65 66 67 68 69
xz
148.0 195.2 158.0 134.4 190.2 165.6 238.4 164.0 169.8 199.8
x3
.8 3.2 8.0 .0 14.2 16.8 8.0 .8 .0 4.6
x4
205.4 262.8 209.8 198.4 243.8
xs
.6 .4 12.2 3.2 10.6 15.6 6.0 .8 1.6 1.0
23 25 25 28 29 57
58
58 58 59
x2
Activity
1.389 1.437 1.091 .941 2.545 1.900 1.062 2.769 1.455 1.000
X]
x4
Eat
2.222 2.312 2.455 2.000 2.727 2.000 1.875 2.385 2.273 2.000
Symptoms
.889 2.813 1.454 .294 2.727 4.100 .125 6.231 3.000 .889
Sleep
1.555 .999 2.364 1.059 2.819 2.800 1.437 1.462 2.090 1.000
xs
Appetite
1.945 2.312 2.909 1.000 4.091 2.600 1.563 4.000 3.272 1.000
x6
Skin reaction
1.000 2.000 3.000 1.000 .000 2.000 .000 2.000 2.000 2.000
Source: Data courtesy of Mrs. Annette Tealey, R.N. Values of .r2 and .r3 less than 1.0 are due to errors in the datacollection process. Rows containing values of .r 2 and .r3 less than 1.0 may be omitted.
Exercises 43
Table 1.8 Mineral Content in Bones Subject number 1 2 3 4 Dominant radius 1.103 .842 .925 .857 .795 .787 .933 .799 .945 .921 .792 .815 .755 .880 .900 .764 .733 .932 .856 .890 .688 .940 .493 .835 .915 Radius 1.052 .859 .873 .744 .809 .779 .880 .851 .876 .906 .825 .751 .724 .866 .838 .757 .748 .898 .786 .950 .532 .850 .616 .752 .936 Dominant humerus 2.139 1.873 1.887 1.739 1.734 1.509 1.695 1.740 1.811 1.954 1.624 2.204 1.508 1.786 1.902 1.743 1.863 2.028 1.390 2.187 1.650 2.334 1.037 1.509 1.971 Humerus 2.238 1.741 1.809 1.547 1.715 1.474 1.656 1.777 1.759 2.009 1.657 1.846 1.458 1.811 1.606 1.794 1.869 2.032 1.324 2.087 1.378 2.225 1.268 1.422 1.869 Dominant ulna .873 .590 .767 .706 .549 .782 .737 .618 .853 .823 .686 .678 .662 .810 .723 .586 .672 .836 .578 .758 .533 .757 .546 .618 .869 Ulna .872 .744
.713
.674 .654 .571 .803
5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
.682
.777 .765 .668 .546 .595 .819 .677 .541 .752 .805 .610 .718 .482 .731 .615 .664 .868
1.17. Some of the data described in Section 1.2 are listed in Table 1.9. (See also the nationaltrackrecords data on the web at www.prenhall.com/statistics.) The national track records for women in 54 countries can be examined for the relationships among the running event!). Compute the x, Sn, and R arrays. Notice the magnitudes of the correlation coefficients as you go from the shorter (100meter) to the. lo!'.ger {marathon) running distances. Interpret ihese pairwise correlations. 1.18. Convert the national track records for women in Table 1.9 to speeds measured in meters per second. For example, the record speed for the 100m dash for Argentinian women is 100 m/11.57 sec = 8.643 mjsec. Notice that the records for the 800m, 1500m, 3000m and marathon runs are measured in minutes. The marathon is 26.2 miles, or 42,195 meters, long. Compute the x, Sn, and R arrays. Notice the magnitudes of the correlation coefficients as you go from the shorter (100m) to the longer (marathon) running distances. Interpret these pairwise correlations. Compare your results with the results you obtained in Exercise 1.17. 1.19. Create the scatter plot and boxplot displays of Figure 1.5 for (a) the mineralcontent data in Table 1.8 and (b) the nationaltrackrecords data in Table 1.9.
439
4.33 4.19 4.20 4.06 4.10 4.01 4.62 4.41 3.99 3.96 3.90 3.87 5.42
Exercises 45 lOOm (s) 12.13 11.06 11.16 11.34 11.22 11.33 11.25 10.49 200m (s) 24.54 22.38 22.82 22.88 22.56 23.30 22.71 21.34 400m (s) 55.08 49.67 51.69 51.32 52.74 52.60 53.15 48.83 800m (min) 2.12 1.96 1.99 1.98 2.08 2.06 2.01 1.94 1500m (min) 4.52 4.01 4.09 3.97 4.38 4.38 3.92 3.95 3000m (min) 9.94 8.48 8.81 8.60 9.63 10.07 8.53 8.43 Marathon (min) 154.41 146.51 150.39 145.51 159.53 162.39 151.43 141.16
Source: IAAFIATFS Track and Field Ha.IJdbook for Helsinki 2005 (courtesy of Ottavio Castellini).
1.20. Refer to the bankruptcy data in Table 11.4, page 657, and on the following website www.prenhall.com/statistics. Using appropriate computer software, (a) View the entire data set in x 1 , x 2 , x 3 space. Rotate the coordinate axes in various directions. Check for unusual observations. (b) Highlight the set of points corresponding to the bankrupt firms. Examine various threedimensional perspectives. Are there some orientations of threedimensional space for which the bankrupt firms can be distinguished from the nonbankrupt firms? Are there observations in each of the two groups that are likely to have a significant impact on any rule developed to classify firms based on the sample mearis, variances, and covariances calculated from these data? (See Exercise 11.24.) 1.21. Refer to the milk transportationcost data in Table 6.10, page 345, and on the web at www.prenhall.com/statistics. Using appropriate computer software,
(a) View the entire data set in three dimensions. Rotate the coordinate axes in various directions. Check for unusual observations. (b) Highlight the set of points corresponding to gasoline trucks. Do any of the gasolinetruck points appear to be multivariate outliers? (See Exercise 6.17.) Are there some orientations of x 1 , x 2 , x 3 space for which the set of points representing gasoline trucks can be readily distinguished from the set of points representing diesel trucks?
1.22. Refer to the oxygenconsumption data in Table 6.12, page 348, and on the web at www.prenhall.com/statistics. Using appropriate computer software, (a) View the entire data set in three dimensions employing various combinations of three variables to represent the coordinate axes. Begin with the x 1 , x 2 , x 3 space. (b) Check this data set for outliers. 1.23. Using the data in Table 11.9, page 666, and on the web at www.prenhall.com/ statistics, represent the cereals in each of the following ways. (a) Stars. (b) Chernoff faces. (Experiment with the assignment of variables to facial characteristics.} 1.24. Using the utility data in Table 12.4, page 688, and on the web at www.prenhall. com/statistics, represent the public utility companies as Chernoff faces with assignments of variables to facial characteristics different from those considered in Example 1.12. Compare your faces with the faces in Figure 1.17. Are different groupings indicated?
46
Chapter 1 Aspects of Multivariate Analysis 1.25. Using the data in Table 12.4 and on the web at www.prenhall.com/statistics, represent the 22 public utility companies as stars. Visually group the companies into four or five clusters.
l .26. The data in Table 1.10 (see the bull data on the web at www.prenhall.com/statistics) are
the measured characteristics of 76 young (less than two years old) bulls sold at auction. Also included in the ta8le are the selling prices (SalePr) of these bulls. The column headings (variables) are defined as follows: Breed = 1 Angus 5 Hereford { 8 Simental YrHgt = Yearling height at shoulder (inches) PrctFFB = Percentfatfree body BkFat = Back fat (inches) SaleWt = Sale weight (pounds)
FtFrBody = Fat free body (pounds) Frame = Scale from 1 (small) to 8 (large) SaleHt = Sale height at shoulder (inches)
(a) Compute the i, Sn, and R arrays. Interpret the pairwise correlations. Do some of these variables appear to distinguish one breed from another? (b) View the data in three dimensions using the variables Breed, Frame, and BkFat. Rotate the coordinate axes in various directions. Check for outliers. Are the breeds well separated in this coordinate system? (c) Repeat part busing Breed, FtFrBody, and SaleHt. Which threedimensional display appears to result in the best separation of the three breeds of bulls? Table 1.10 Data on Bulls Breed 1 1 1 1 1
:
SalePr 2200 2250 . 1625 4600 2150 1450 1200 1425 1250 1500
YrHgt 51.0 51.9 49.9 53.1 51.2 51.4 49.8 50.0 50.1 51.7
FtFrBody 1128 1108 1011 993 996 997 991 928 990 992
PrctFFB 70.9 72.1 71.6 68.9 68.6 73.4 70.8 70.8 71.0 70.6
Frame 7 7 6 8 7
:
BkFat
.25
.25 .15 .35
.25
.10 .15 .10 .10 .15
8 8 8 8 8
7 6 6 6 7
Source: Data courtesy of Mark Ellersieck. 1.27. Table 1.11 presents the 2005 attendance (millions) at the fifteen most visited national parks and their size (acres). (a) Create a scatter plot and calculate the correl<~tion coefficient.
References 47 (b) Identify the park that is unusual. Drop this point and recalculate the correlation coefficient. Comment on the effect of this one point on correlation. (c) Would the correlation in Part b change if you measure size in square miles instead of acres? Explain.
Table 1.11 Attendance and Size of National Parks
National Park Arcadia Bruce Canyon Cuyahoga Valley Everglades Grand Canyon Grand Thton Great Smoky Hot Springs Olympic Mount Rainier Rocky Mountain Shenandoah Yellowstone Yosemite Zion
Size (acres) 47.4 35.8 32.9 1508.5 1217.4 310.0 521.8 5.6 922.7 235.6 265.8 199.0 2219.8 761.3 146.6
Visitors (millions) 2.05 1.02 2.53 1.23 4.40 2.46 9.19 1.34 3.14 1.17 2.80 1.09 2.84 3.30 2.59
References
1. Becker, R. A., W. S. Cleveland, and A. R. Wilks. "Dynamic Graphics for Data Analysis." Statistical Science, 2, no. 4 (1987), 355395. 2. Benjamin, Y., and M. Igbaria. "Clustering Categories for Better Prediction of Computer Resources Utilization." Applied Statistics, 40, no. 2 (1991), 295307. 3. Capon, N., J. Farley, D. Lehman, and J. Hulbert. "Profiles of Product Innovators among Large U.S. Manufacturers." Management Science, 38, no. 2 (1992), 157169. 4. Chernoff, H. "Using Faces to Represent Points in KDimensional Space Graphically." Journal of the American Statistical Association, 68, no. 342 (1973), 361368. 5. Cochran, W. G. Sampling Techniques (3rd ed.). New York: John Wiley, 1977. 6. Cochran, W. G., and G. M. Cox. Experimental Designs (2nd ed., paperback). New York: John Wiley, 1992. 7. Davis, J. C. "Information Contained in Sediment Size Analysis." Mathematical Geology, 2, no. 2 (1970), 105112. 8. Dawkins, B. "Multivariate Analysis of National Track Records." The American Statistician, 43, no. 2 (1989),110115. 9. Dudoit, S., J. Fridlyand, and T. P. Speed. "Comparison of Discrimination Methods for the Oassification ofThmors Using Gene Expression Data." Journal of the American Statistical Association, 97, no. 457 (2002), 7787. 10. Dunham, R. B., and D. J. Kravetz. "Canonical Correlation Analysis in a Predictive System." Journal of Experimental Education, 43, no. 4 (1975), 3542.
48
Chapter 1 Aspects of Multivariate Analysis 11. Everitt, B. Graphical Techniques for Multivariate Data. New York: NorthHolland, 1978. 12. Gable, G. G. "A Multidimensional Model of Client Success when Engaging External Consultants." Management Science, 42, no. 8 (1996) 11751198. 13. Halinar, J. C. "Principal Component Analysis in Plant Breeding." Unpublished report based on data collected by Dr. F. A. Bliss, University of Wisconsin, 1979. 14. Johnson, R. A., and 6. K. Bhattacharyya. Statistics: Principles and Methods (5th ed.). New York: John Wiley, 2005. 15. Kim, L., and Y. Kim. "Innovation in a Newly Industrializing Country: A Multiple Discriminant Analysis." Management Science, 31, no. 3 (1985) 312322. 16. Klatzky, S. R., and R. W. Hodge. "A Canonical Correlation Analysis of Occupational Mobility." Journal of the American Statistical Association, 66, no. 333 (1971 ), 1622. 17. Lee, J., "Relationships Between Properties of PulpFibre and Paper." Unpublished doctoral thesis, University of Toronto. Faculty of Forestry (1992). 18. MacCrimmon, K., and D. Wehrung. "Characteristics of Risk Taking Executives." Management Science, 36, no. 4 (1990), 422435. 19. Marriott, F. H. C. The /nterp relation of Multiple Observations. London: Academic Press, 1974. 20. Mather, P. M. "Study of Factors Influencing Variation in Size Characteristics in Fluvioglacial Sediments." Mathematical Geology, 4, no. 3 (1972), 219234. 21. McLaughlin, M., et al. "Professional Mediators' Judgments of Mediation Thctics: Multidimensional Scaling and Cluster Analysis." Journal of Applied Psychology, 76, no. 3 (1991), 465473. 22. N aik, D. N., and R. Khattree. "Revisiting Olympic Track Records: Some Practical Considerations in the Principal Component Analysis." The American Statistician, 50, no. 2 (1996), 140144. 23. Nason, G. "Threedimensional Projection Pursuit." Applied Statistics, 44, no. 4 (1995), 411430. 24. Smith, M., and R. Thffler. "Improving the Con1munication Function of Published Accounting Statements." Accounting and Business Research, 14, no. 54 (1984), 139146. 25. Spenner, K. I. "From Generation to Generation:Thenansmission of Occupation." Ph.D. dissertation, University ofWisconsin, 1977. 26. Tabakoff, B., et a!. "Differences in Platelet Enzyme Activity between Alcoholics and Nonalcoholics." New England Journal of Medicine, 318, no. 3 (1988), 134139. 27. Timm, N. H. Multivariate Analysis with Applications in Education and Psychology. Monterey, CA: Brooks/Cole, 1975. 28. nieschmann, J. S., and G. E. Pinches. "A Multivariate Model for Predicting Financially Distressed PL Insurers." Journal of Risk and Insurance, 40, no. 3 (1973), 327338. 29. Thkey, J. W. Exploratory Data Analysis. Reading, MA: AddisonWesley, 1977. 30. Wainer, H., and D. Thissen. "Graphical Data Analysis." Annual Review of Psychology, 32, (1981), 191241. 31. Wartzman, R. "Don't Wave a Red Flag at the IRS." The Wall Street Journal (February 24, 1993), Cl, CIS. 32. Weihs, C., and H. Schmidli. "OMEGA (On Line Multivariate Exploratory Graphical Analysis): Routine Searching for Structure." Statistical Science, 5, no. 2 (1990), 175226.
Chapter
X =~ l
:X;2niJ
50
I
I
I
I I I I
~~~"
A vector x can be represented geometrically as a directed line in n dimensions with component x 1 along the first axis, x 2 along the second axis, ... , and Xn along the nth axis.1bis is illustrated in Figure 2.1 for n = 3. A vector can be expanded or contracted by multiplying it by a constant c. In particular, we define the vector ex as
ex=
c.x ] c~z
CXn
That is, ex is the vector obtained by multiplying each element of x by c. [See Figure 2.2(a).]
2 2
1.r
(a) (b)
51
x+y=
+ Yn
so that x + y is the vector with ith element X; + Yi The sum of two vectors emanating from the origin is the diagonal of the parallelogram formed with the two original vectors as adjacent sides. This geometrical interpretation is iJlustrated in Figure 2.2(b). A vector has both direction and length. In n = 2 dimensions, we consider the vector
X=[::]
The length of x, written L,, is defined to be
Lx = Vx~
+ x~
Geometrically, the length of a vector in two dimensions can be viewed as the hypotenuse of a right triangle. This is demonstrated schematically in Figure 2.3. The length of a vector x' = [xi> x 2 , ... , xn], with n components, is defined by
Lx =
V X~
+ X~ + + X~
(21)
Multiplication of a vector x by a scalar c changes the length. From Equation (21), Lc" = Vc 2 x~ + c 2 x~ + + c 2 x~
= Ic IVxt + X~ + + X~
= Ic ILx
Multiplication by c does not change the direction of the vector x if c > 0. However, a negative value of c creates a vector with a direction opposite that of x. From (22)
it is clear that x is expanded if Ic I > 1 and contracted if 0 < Ic I < 1. [Recall Figure 2.2(a).] Choosing c = L; 1 , we obtain the unit vector L; 1x, which has length 1 and lies in the direction of x.
2
V xi +
x~.
52
Cbapter 2
:r
A second geometrical concept is angle. Consider two vectors in a plane and the ngle {} between them, as in Figure 2.4. From the figure, 8 can be represented as :he difference_ be~ween the a~!?~ 81 and 82 formed by the two vectors and the first coordinate axis. Smce, by defimtwn,
cos(02) ==Ly sin (Oil == L,_ and
Xz YI
sin(02 ) == ~ Ly
the angle 8 between the two vectors x' = [ x 1 , xz] and y' = [y1 , Y2] is specified by (23) We find it convenient to introduce the inner product of two vectors. For n = 2 dimensions, the inner product of x andy is
x'y =
X1Y1
+ XzY2
x'y
= ''
x'y
WxwY
since cos (90) = cos (270) == 0 and cos (8) == 0 only if x' y = 0, x and y are perpendicular when x'y = 0. For an arbitrary number of dimensions n, we define the inner product of x andy as x'y = XIYJ.
+ XzY2 + + XnJ'n
(24)
Some Basics of Matrix and Vector Algebra ,53 Using the inner product, we have the natural extension of length and angle to vectors of n components:
Lx
= lengthofx = ~
~
(25)
cos(O) =
x'y
L,..Ly
x'y
~vy;y
(26)
Since, again, cos (0) = 0 only if x'y = 0, we say that x andy are perpendicular when x'y = 0.
Example 2.1 (Calculating lengths of vectors and the angle between them) Given the vectors x' = [1, 3, 2] and y' = [ 2, 1, 1], find 3x and x + y. Next, determine the length of x, the length of y, and the angle between x and y. Also, check that the length of 3x is three times the length of x. First,
Next, x'x=l 2 +3 2 +2 2 =14, y'y=(2) 2 +12 +(1) 2 =6, and x'y= 1(2) + 3(1) + 2(1) = 1. Therefore,
Lx
and
= ~ = v'I4
= 3.742
Ly =
vy;y =
V6
= 2.449
= 96.3. Finally,
L3x = V3 2 + 92 + 62 =
3\1'14 =
v126
c1 X+ c2 y = 0
A pair of vectors x and y of the same dimension is said to be linearly dependent if there exist constants c 1 and c2 , both not zero, such that A set of vectors x1 , x2 , .. , xk is said to be linearly dependent if there exist constants c1 , c2 , ... , ck. not all zero, such that (27) Linear dependence implies that at least one vector in the set can be written as a linear combination of the other vectors. Vectors of the same dimension that are not linearly dependent are said to be linearly independent.
Example 2.2 {Identifying linearly independent vectors) Consider the set of vectors
Cz Cz
+ c3 = 0

2c1
CJ 
2c3
=0
=
+ c3
with the unique solution c 1 = c2 = c3 = 0. As we cannot find three constants c1 , c2 , and c3 , not all zero, such that c1 x1 + c2 x2 + c3 x3 = 0, the vectors x1 , x2 , and x3 are linearly independent. The projection (or shadow) of a vector x on a vector y is . . ProJectwnofxony
= ,y =  L L y
yy
y y
(x'y)
(x'y) 1
(28)
where the vector Ly1y has unit length. The length of the projection is Length of projection =
(29)
(;:~)y
Matrices
A matrix is any rectangular array of real numbers. We denote an arbitrary array of n rows and p columns by
A
{nXp)
['"
a~I
:
al2
an anz
ani
...
'"l
azp anp
55
Many of the vector concepts just introduced have direct generalizations to matrices. The transpose operation A' of a matrix changes the columns into rows, so that the first column of A becomes the first row of A', the second column becomes the second row, and so forth.
Example 2.3 {The transpose of a matrix) If
A
(2x3)
3 [1
1 2]
5 4
2
then
A' =
(3X2)
[~
ca 12 ca 22
canz
!]
4
A matrix may also be multiplied by a constant c. The product cA is the matrix that results from multiplying each element of A by c. Thus
cA
(nxp)
r~., = c~21
: canl
'""l
ca 2P canp
Two matrices A and B of the same dimensions can be added. The sum A + B has (i, j)th entry a;j + b;j.
Example 2.4 (The sum of two matrices and multiplication of a matrix by a constant)
If A
(2x3)
 [0
3 1 1
~]
and
B
(2x3)
 [1
2
5
~]
then
(2X3)
12 4A = [O 4 4
B
:]
and
A+
(2X3)
(2X3)
~]
It is also possible to define the multiplication of two matrices if the dimensions of the matrices conform in the following manner: When A is (n X k) and B is (k X p ), so that the number of elements in a row of A is the same as the number of elements in a column of B, we can form the matrix product AB. An element of the new matrix AB is formed by taking the inner product of each row of A with each column of B.
the ( n X p) matrix whose entry in the ith row and jth column is the inner product of the ith row of A and the jth column of B
k
or
L
f=l
a;cbci
(210)
4, we have four products to add for each. entry in the matrix AB. Thus,
a12
(nx4)(4xp)
[""
(at! : ani
a,z anz
"lb
:
II
...
hzl ...
a;.i)
b31 b41
... ...
b3p b4p
~'l
bzp
an4
Column
~
A= [ 1
Row
(a,. b,i
3 1 2] 54'
then
A B _
(2X3){3XI) 
[1
3
1 2 5 4
[6~]
(2XI)
and
1 2]
5 4
= [2(3)
[~
2 4]
Some Basics of Matrix and Vector Algebra 57 When a matrix B consists of a single column, it is customary to use the lowercase b vector notation.
A=[~
2 3]
4 1
d =
[~]
b'<
~ [7
3 6] [ _;]
~ [13]
be' =
3 [5 6
7]
8 4] =
30
48
24
The product b c' is a matrix whose row dimension equals the dimension of b and whose column dimension equals that of c. This product is unlike b'c, which is a single number.
Square matrices will be of special importance in our development of statistical methods. A square matrix is said to be symmetric if A = A' or a;j = aji for all i andj.
58
[~ ~J
is symmetric; the matrix is not symmetric.
When two square matrices A and B are of the same dimension, both products AB and BA are defined, although they need not be equal. (See Supplement 2A.) If we let I denote the square matrix with ones on the diagonal and zeros elsewhere, it follows from the definition of matrix multiplication that the ( i, j)th entry of AI is ail X 0 + + ai,jl X 0 + a;i x 1 + ai,j+l X 0 + + a;k x 0 = aii so AI = A. Similarly, lA = A, so
(kxk)(kxk)
(kxk)(kxk)
= A
(kxk)
forany A
(kxk)
(211)
The matrix I acts like 1 in ordinary multiplication (1 a = a 1 = a), so it is called the identity matrix. The fundamental scalar relation about the existence of an inverse number a1 such that a1a = aa1 = 1 if a # 0 has the following matrix algebra extension: If there exists a matrix B such that
(kxk)(kxk)
BA=AB=I
(kxk)(kxk)
(kxk)
then B is called the inverse of A and is denoted by A1. The technical condition that an inverse exists is that the k columns a 1 , a 2 , .. , ak of A are linearly independent. That is, the existence of A 1 is equivalent to
(212)
(See Result 2A.9 in Supplement 2A.)
Example 2.8 (The existence of a matrix inverse) For
A=[!
you may verify that
~]
(.2)2 + (.4)1 (.8)2 + (.6)1
.2 .8
[~ ~]
59
so
[
is A1 . We note that
.2 .8
.6
.4]
implies that c1 = c2 = 0, so the columns of A are linearly independent. This confirms the condition stated in (212). A method for computing an inverse, when one exists, is given in Supplement 2A. The routine, but lengthy, calculations are usually relegated to a computer, especially when the dimension is greater than three. Even so, you must be forewarned that if the column sum in (212) is nearly 0 for some constants c1 , .. , ck, then the computer may produce incorrect inverses due to extreme errors in rounding. It is always good to check the products AA 1 and A 1A for equality with I when A 1 is produced by a computer package. (See Exercise 2.10.) Diagonal matrices have inverses that are easy to compute. For example, 1
all
0 1
a22
0 0 1 a33 0 0
0 0 0
1 a44
0 0 0 0
l
ass
r~
0
a22
0 0
a33 0 0
0 0 0
0 0 0
a44 0
0 0 0 0
has inverse
0 0 0
if all the a;; # 0. Another special class of square matrices with which we shall become familiar are the orthogonal matrices, characterized by
(213)
The name derives from the property that if Q has ith row q), then QQ' =I implies that q/q; ~ 1 and qjqi = 0 for i # j, so the rows have unit length and are mutually perpendicular (orthogonal). According to the condition Q'Q = I, the columns have the same property. We conclude our brief introduction to the elements of matrix algebra by introducing a concept fundamental to multivariate statistical analysis. A square matrix A is said to have an eigenvalue A, with corresponding eigenvector x # 0, if
Ax= Ax
(214)
60
Ordinarily, we normalize x so that it has length unity; that is, 1 = x'x. It is convenient to denote normalized eigenvectors bye, and we do so in what follows. Sparing you the details of the derivation (see [1 ]), we state the following basic result: Let A be a k X k square symmetric matrix. Then A has k pairs of eigenvalues and eigenvectorsnamely, (215) The eigenvectors can be chosen to satisfy 1 = e;e1 = = ek:ek and be mutually perpendicular. The eigenvectors are unique unless two or more eigenvalues are equal.
Then, since
A = 6 is an eigenvalue, and 1
is its corresponding normalized eigenvector. You may wish to show that a second eigenvalueeigenvector pair is A = 4, e2 = [1/vl, 1/v'2]. 2 A method for calculating the A's and e's is described in Supplement 2A. It is instructive to do a few sample calculations to understand the technique. We usually rely on a computer when the dimension of the square matrix is greater than two or three.
multivariate analysis. In this section, we consider quadratic forms that are always nonnegative and the associated positive definite matrices. Results involving quadratic forms and symmetric matrices are, in many cases, a direct consequence of an expansion for symmetric matrices known as the spectral decomposition. The spectral decomposition of a k X k symmetric matrix A is given by 1
A
(kxk)
= A e1 1
(kxi)(iXk)
ej + A2 e 2 e2 + +
(kxi)(ixk)
Ak ek elc (kxi)(ixk)
(216)
where A1, A , ... , Ak are the eigenvalues of A and e 1 , e 2 , ... , ek are the associated 2 normalized eigenvectors. (See also Result 2A.14 in Supplement 2A). Thus, eje; = 1 fori= 1,2, ... ,k,andejej = Ofori j.
4
13
2
~]
10
The eigenvalues obtained from the characteristic equation I A  AI I = 0 are A1 = 9, A = 9, and A = 18 (Definition 2A.30). The corresponding eigenvectors 2 3 ei> e 2 , and e 3 are the (normalized) solutions of the equations Ae; = A;e; for i = 1, 2, 3. Thus, Ae 1 = Ae 1 gives
62
VI8
+9 1
VI8
4
[~
4 VI8 vT8
1
VI8
1 18 1 18 4 18 1 18 1 18 4 18
2 3 2 + 18  3 1 3

4 ~
18 4 18 16 18
+ 18
9

4 9 4
9
2 9
4 9 4 
2 9 2
9
1 
The spectral decomposition is an important analytical tool. With it, we are very easily able to demonstrate certain statistical results. The first of these is a matrix explanation of distance, which we now develop. Because x' Ax has only squared terms x[ and product terms x;x k, it is called a quadratic form. When a k X k symmetric matrix A is such that 0
:5
x'Ax
(217)
for all x' == [x 1 , x 2, ... , xk], both the matrix A and the quadratic form are said to be nonnegative definite. If equality holds in (217) only for the vector x' = (0, 0, ... , OJ, then A or the quadratic form is said to be positive definite. In other words, A is positive definite if 0 < x'Ax for all vectors x
(218)
0.
Example 2.11 {A positive definite matrix and quadratic form) Show that the matrix for the following quadratic form is positive definite:
VZJ
2
[XI]
x2
= x'Ax
By Definition 2A.30, the eigenvalues of A are the solutions of the equation  All = 0, or (3  A)(2  A)  2 = 0. The solutions are A1 = 4 and Az = 1. Using the spectral decomposition in (216), we can write
IA
A
(2X2)
= A1e 1 ei
(2XIJ{IX2)
+ A2e 2 e2
(2XIJ{IX2)
= 4e 1 ei
(2XIJ{IX2)
+ e 2 e2
(2XIJ{IX2)
where e 1 and e 2 are the normalized and orthogonal eigenvectors associated with the eigenvalues A1 = 4 and A2 = 1, respectively. Because 4 and 1 are scalars, premultiplication and postmultiplication of A by x' and x, respectively, where x' = [ x 1 , x 2 ] is any nonzero vector, give x' A x
(I X2){2X2)(2X I)
= 4x' e 1 e;
= 4yi
x 0 Y2
+ x'
e 2 e2 x (I X2)(2Xl)(l X2)(2Xl)
+ y~
2:
with
y 1 = x'e 1
= eix
and
= x'e 2
= e2x
We now show that y 1 and Y2 are not both zero and, consequently, that x' Ax = 4yi + y?: > 0, or A is positive definite. From the definitions of Yl and Y2, we have
y
(2Xl)
E X (2X2)(2Xl)
Now E is an orthogonal matrix and hence has inverse E'. Thus, x = E'y. But xis a nonzero vector, and 0 x = E'y implies that y 0. Using the spectral decomposition, we can easily show that a k X k symmetric matrix A is a positive definite matrix if and only if every eigenvalue of A is positive. (See Exercise 2.17.) A is a nonnegative definite matrix if and only if ali of its eigenvalues are greater than or equal to zero. Assume for the moment that the p elements x 1 , x 2 , .. , xP of a vector x are realizations of p random variables X 1 , X 2 , .. , Xp. As we pointed out in Chapter 1,
b4 ChaP
ter 2 Matrix Algebra and Random Vectors we can regard these elements as the coordinates of a point in pdimensional space, and the "distance" of the point [X1, x2, ... , xp]' to the origin can, and in this case should, be interpreted in terms of standard deviation units. In this way, we can account for the inherent uncertainty (variability) in the observations. Points with the same associated "uncertainty" are regarded as being at the same distance from the origin. If we use the distance formula introduced in Chapter 1 [see Equation (122)], the distance from the origin satisfies the general formula (distance )2 = a 11 xi + a2 2 x~ + + aPPx~
+ 2(a12x1x2 +
provided that (distance) 2 > Oforall [xi, x2, ... , xp] i j, i = 1,2, ... ,p, j = 1,2, ... ,p, we have
a13x1x3
+ +
apI.pXplxp)
ali
. 2_ 0 <(distance)  [xi,x2,,xp]
[
a21
: apl
.. .
a2p a1pl
. .
[x1l . .
x2
aPP
xP
(219)
From (219), we see that the p X p symmetric matrix A is positive definite. In sum, distance is determined from a positive definite quadratic form x' Ax. Conversely, a positive definite quadratic form can be interpreted as a squared distance.
Comment. Let the square of the distance from the point x' = [x 1 , x 2, ... , xp] to the origin be given bY x' Ax, where A is a p X p symmetric positive definite matrix. Then the square of the distance from x to an arbitrary fixed point p.' = [PI> p, 2, ... , Pp] is given by the general expression (x  p. )' A(x  p. ).
Expressing distance as the square root of a positive definite quadratic formallows us to give a geometrical interpretation based on the eigenvalues and eigenvectors of the matrix A. For example, suppose p = 2. Then the points x' = [x 1 , x 2] of constant distance c from the origin satisfy x' Ax = a11xf + a22X~ + 2a12x x = c2
1 2
By the spectr,al decomposition, as in Example 2.11, 2 2 A= A e1ei + A2e2ez so x'Ax = A1(x'ed + A (x'e 2) 2 1 Now, c2 = A1yi + A2y~ is an ellipse in Y1 = x'e1 and Y2 = x'e 2 because AI> A2 > 0 1 when A is positive definite. (See Exercise 2.17.) We easily verify that x = cA! 1 2e 1 2 1 f2eiei) = 2. Similarly, x = cA2 f2e 2 gives the appropriate satisfies x' Ax = A ( cA!1 1 distance in the e 2 direction. Thus, the points at distance c lie on an ellipse whose axes are given by the eigenvectors of A with lengths proportional to the reciprocals of the square roots of the eigenvalues. The constant of proportionality is c. The situation is illustrated in Figure 26.
A SquareRoot Matrix
65
If p > 2, the points x' = [xi> x 2 , ... , xp] a constant distance c = ~from 2 2 the origin lie on hyperellipsoids c 2 = A1 (x'eJ) + + Ap(x'ep) , whose axes are given by the eigenvectors of A. The halflength in the direction e; is equal to cjVi;, i = 1, 2, ... , p, where A1 , A2, .. , Ap are the eigenvalues of A.
A
(kxk)
2: A; (kxl)(lxk) e; ei i=l
P'
(220)
(k xk)(k xk)(k x k)
A
(kxk)
l
AI 0
~
withA; > 0 0
66
Thus,
(221)
since (PA  1P')PAP' = PAP'(PA lp') = PP' =I. Next, let A I/2 denote the diagonal matrix with
k
A 112
has the following properties:
L
i=l
(222)
2. A112 A 112 = A.
3. (A112 f
e;e; = PA I/2P', where A1/2 is a diagonal matrix with vA; 1/Vi:; as the ith diagonal element.
i=l
. ~
(223)
67
E(X;i) =
l
=
i:
2:
X;if;j(X;j) dx;j
if X;i is a continuous random variable with probability density function.fij(xii) if X;i is a discrete random variable with probability function p;j(x;i)
X;iPii(x;i)
allxij
Example 2.12 {Computing expected values for discrete random variables) Suppose
= 2 and n = 1, and consider the random vector X' = [X 1 , X 2 ]. Let the discrete random variable X 1 have the following probability function:
p
1 .3
1
.4
.3
ThenE(X1 )
2:
allx 1
x 1 p 1(x 1 )
= (1)(.3) +
(0)(.3) + (1)(.4)
= .1.
Similarly, let the discrete random variable X 2 have the probability function Xz Pz(xz) Then E(X2) Thus,
E(X) =
0 .8
1 .2
2:
all x 2
XzPz(x 2)
= .2.
[E(Xd] [1]
E(Xz)
=
.2
TWo results involving the expectation of sums and products of matrices follow directly from the definition of the expected value of a random matrix and the univariate properties of expectation, E(X1 + Yj) = E(X1 ) + E(Yj) and E(cX1 ) = cE(XJ). Let X and Y be random matrices of the same dimension, and let A and B be conformable matrices of constants. Then (see Exercise 2.40)
E(X + Y) = E(X) + E(Y)
(224)
E(AXB) = AE(X)B
2 If you are unfamiliar with calculus, you should concentrate on the interpretation of the expected value and, eventually, variance. Our development is based primarily on the properties of expectation rather than its particular evaluation for continuous or discrete random variables.
68
p.;=
a?= l
It will be convenient in later sections to denote the marginal variances by a;; rather than the more traditional al, and consequently, we shall adopt this notation. The behavior of any pair of random variables, such as X; and X k, is described by their joint probability function, and a measure of the linear association between them is provided by the covariance
!1 !
all xi
00
X!,( x) dx
' ' ' '
oo
if X; is a continuous random variable with probability density function J;(x;) if X; is a discrete randotn variable with probability function p;(x;) if X; is a continuous random variable withprobabilitydensityfunctionjj(x;) if X; is a discrete random variable with probability function p;(x;) (225)
(x _ p.)Zf,{x) dx
' '
2: (x; JLYP;(x;)
a;k
= E(X;
2: 2:
all
Xi
all x1c
if X;, X k are discrete random variables with joint probability function P;k( X;, xk) (226)
and JL; and JLk> i, k = 1, 2, ... , p, are the marginal means. When i = k, the covariance becomes the marginal variance. More generally, the collective behavior of the p random variables X 1 , X 2 , ... , Xp or, equivalently, the random vector X' = [ X 1 , X 2 , .. , Xp], is described by a joint probability density function f(x 1 , Xz, ... , xp) = f(x). As we have already noted in this book, f (x) will often be the multivariate normal density function. (See Chapter 4.) If the joint probability P[ X; s x; and Xk s xk] can be written as the product of the corresponding marginal probabilities, so that (227)
69
for all pairs of values x;, xk> then X; and Xk are said to be statistically independent. When X; and Xk are continuous random variables with joint density /;k(x;, xk) and marginal densities /;(x;) and fk(xk), the independence condition becomes
/;k(x;, xk)
= /;(x;)fk(xk)
for aJ1 pairs (x;, xk) The p continuous random variables X 1, X 2 , ... , Xp are mutually statistically independent if their joint density can be factored as (228) for all ptuples (xi> x 2 , .. , xp). Statistical independence has an important implication for covariance. The factorization in (228) implies that Cov (X;, Xk) = 0. Thus,
Cov(X;, Xk) = 0
(229)
The converse of (229) is not true in general; there are situations where Cov(X;, Xk) = 0, but X; and Xk are not independent. (See [5].) The means and covariances of the p X 1 random vector X can be set out as matrices. The expected value of each element is contained in the vector of means 1' = E(X), and the p variances a;; and the p(p  1)/2 distinct covariances a;k( i < k) are contained in the symmetric variancecovariance matrix .I= E(X P)(X P)'. Specifically,
E(X) =
and
E(~z) =
E(XI)lli'Il = '1
l'p
1'
(230)
E(Xp)
= E
l l
(XI  JLd (Xz  1'2) (Xz  JLz) 2 (Xp  JLp) (Xz  Pz) E(X1  JLd (Xz  Pz) E(Xz  JLz) 2
(XI  JLd (Xp  JLp)] (Xz  JLz)(Xp  JLp) (Xp JLp)z E(X1  JLd(Xp JLp)] E(X2  Pz)(Xp  JLp) E(Xp JLp) 2
(Xp  Pp)(XI 
70
or
all
I = Cov(X) =
[
af
(231)
a pi
Example 2.13 (Computing the covariance matrix) Find the covariance matrix for the two random variables X 1 and X 2 introduced in Example 2.12 when their joint probability function p 12 (x 1 , x 2 ) is represented by the entries in the body of the following table:
~
I
Pl(xJ)
.06 .14 .00
1 0 1
.3 .3 .4
pz(x2)
.2
= .1
= .2.
(See Exam
JLJ) 2 =
L
all x 1
(x 1  .1)2p 1(xJ)
= .69
L
aiJ x2
+ (1  .2) 2 (.2)
.16
L
allpairs(x,,x2)
71
= [X~>
X 2 ],
= E(X) =
[E(XJ)J
E(Xz)
and
azz
al2J
= [
.08
.69
.08]
.16
We note that the computation of means, variances, and covariances for discrete random variables involves summation (as in Examples 2.12 and 2.13), while analogous computations for continuous random variables involve integration. Because a;k = E(X;  P;)(Xk  1'k) = ak;, it is convenient to write the matrix appearing in (231) as
[uu
:
ai2
a!z azz
u,,l
azp app
(232)
alp azp
We shaH refer to 1' and I as the population mean (vector) and population variancecovariance (matrix), respectively. The multivariate normal distribution is completely specified once the mean vector 1' and variancecovariance matrix I are given (see Chapter 4), so it is not surprising that these quantities play an important role in many multivariate procedures. It is frequently informative to separate the information contained in variances a;; from that contained in measures of association and, in particular, the measure of association known as the population correlation coefficient Pik The correlation coefficient P;k is defined in terms of the covariance a;k and variances a;; and akk as
a;k Pik = =,:.::=== \Ia;;~
(233)
The correlation coefficient measures the amount of linear association between the random variables X; and Xk. (See, for example, [5).)
72
p symmetric matrix
~~
~vo;;
Uzz
p=
lTJz
~va:;
lT]p
va;vo;;
lTzp
~va;;
va:;vu;;
+:,
Pip
P12
1
Pzp
... ...
.,]
Pzp
(234)
X p
(235)
and (237) That is, I can be obtained from Vlf2 and p, whereas p can be obtained from I. Moreover, the expression of these relationships in terms of matrix operations allows the calculations to be conveniently implemented on a computer.
Example 2.14 (Computing the correlation matrix from the covariance matrix)
Suppose
va;;
0 and
0 ]  [20 0
YO;;
0
1 ~] [;2 0 ~
1 9 3
3 25
2]
[! 0
~ ~
0 0
X=
Xq+I Xp
}p
[i~~~]
JLq
and
= E(X) =
JLq+I JLp
= [;~~~]
q
(238)
14
. .
. . .
_
(X}  llI) (Xp  ILp)J (Xz ~tz):(Xp ILp) (Xq /tq)(Xp ILp)
Upon taking the expectation of the matrix (X(!)  1L (I)) (X< 2l  1L (Z) )', we get
which gives all the covariances,u1i, i = 1, 2, ... , q, j = q + 1, q + 2, ... , p, between a component of x(IJ and a component of X(2)_ Note that the matrix 1:12 is not necessarily symmetric or even square. Making use of the partitioning in Equation (238), we can easily demonstrate that
O"I,q+I
O"I,q+2
...
::_...
ITI pj
O"fp O"qp
(]"2ri Uzrz
O"q,q+2
= "II2 (239)
O"q,q+I
and consequently,
(X(IJ  J.LOl) (X(2J  J.L(2J)'J (qXI) (IX(pq)) (X(2) _ JL(2J) (X(2J _ J.L(2J) ((pq)X!) (IX(pq))
"I (pxp)
= E(X
 JL)(X JL)' =
q pq
'
pq
O"qi O"qq j O"q,q+l O"qp .................................... ~...................................... O"q+J,I O"pi O"q+J,q lO"q+l,q+I
0" pq
(240)
O"q+l,p
0" PP
~ (]" p,q+I
Mean Vectors and Covariance Matrices 75 Note that I 12 = I2 1. The covariance matrix of x(ll is Iu, that of x< 2l is I that of elements from xol and X(2) is I 12 (or I 21 ). It is sometimes convenient to use the Cov (X(1l, x< 2l) notation where Cov (XOl, x< 2l) = I
12
22 ,
and
is a matrix: containing all of the covariances between a component of X(ll and a component of x< 2l.
The Mean Vector and Covariance Matrix for Linear Combinations of Random Variables
Recall that if a single random variable, such as X 1 , is multiplied by a constant c, then
E(cX1 )
and Var (eX!) = E(cX1
cE(X1 ) = CJL1
q.t!} 2 = c2Var (X 1 ) = c2u 11
If X 2 is a second random variable and a and b are constants, then, using additional properties of expectation, we get
Cov(aX1 ,bX2)
E(aX1
= abE(X1 
= abCov(X1 , X2 )
= aba12
E(aX1 + bX2) = aE(X1 ) + bE(X2) = ap., 1 + bJ.l2 Var(aX1 + bX2) = E[(aX1 + bX2)  (ap.,, + bJL2)f I L I I
p., 1) + b(X2 J.L2lf J.Ld + b2(X2  J.l2) 2 + 2ab(X,  J.Ld (X2  JL2)] 2Var(X ) + b2Var(X ) + 2abCov(X,, X2) = a 1 2 (241)
= E[a(X1

= E[a 2(X1
[~J = c'X
Similarly, E(aX1
b]
[:J
= c'JL
If we let
76
= c'Ic
(242)
b]
[uu u
u12
12
<Tzz
The preceding results can be extended to a linear combination of p random variables: The linear combination c'X .= c1 X1 + .. + mean = E(c'X) where 1' = E(X) and I = Cov (X). In general, consider the q linear combinations of the p random variables XI> XP:
Z 1 = c11 X 1 + c12 X 2 + + c1 PXP Zz = cz1X1 + c22 X 2 + + CzpXp . .
c~XP
has
(243)
variance = Var(c'X)
. .
. .
or
Ctz
cz2
Cqz (qxp)
(244)
ex have
(245)
where 1'x and Ix are the mean vector and variancecovariance matrix of X, respectively. (See Exercise 2.28 for the computation of the offdiagonal terms in CixC'.) We shall rely heavily on the result in (245) in our discussions of principal components and factor analysis in Chapters 8 and 9.
Example 2.1 S (Means and covariances of linear combinations) Let X' = [XI> Xz] be a random vector with mean vector 1'X = [p,1 , p,z] and variancecovariance matrix
Find the mean vector and covariance matrix for the linear combinations
zl
or
==:XI Xz
~=XI+
Xz
p.z = and
E(Z)
= Cp.x =
l:z = Cov(Z) =
C~xC' =
D1] [<Tu 12
1
u
u22 u
12 ]
1
11]
1
Note that if u 11 = u 22 that is, if X 1and X 2 have equal variancesthe offdiagonal terms in Iz vanish. This demonstrates the wellknown result that the sum and difference of two random variables with identical variances are uncorrelated. '
!; .. .
,~1
(xil 
id (xiP
. . .
ip)l
L
be the corresponding sample variance<:ovariance matrix.
n i=l
(x1 P xp)
78
The sample mean vector and the covariance matrix can be partitioned in order to distinguish quantities corresponding to groups of variables. Thus,
(p><l)
.59.__
xq+I
(246)
and
sl,q+l
(p><p)
s"
Sq 1
Sqq
Sq,q+i
Sq p
Spl
Spq
Sp,q+l
Spp
(247)
where
x(IJ
i( 1)
and
i(
= [xi> ... , xq]' and x( 2 ) = [xq+ I> ... , xp]'. respectively; S 11 is the sample covari
ance matrix computed from observations x(Il; S22 is the sample covariance matrix computed from observations x( 2 l; and S12 = S2 1 is the sample covariance matrix for elements of x(Il and elements of x( 2 l.
1 vectors. Then
(248)
(b'd)
(b'b)(d'd)
79
Proof. The inequality is obvious if either b = 0 or d = 0. Excluding this possibility, consider the vector b  x d, where x is an arbitrary scalar. Since the length of b  x d is positive for b  x d # 0, in this case
0 < (b xd)'(b xd) = b'b xd'b b'(xd) + x 2 d'd
= b'b 2x(b'd)
+ x 2 (d'd)
The last expression is quadratic in x. If we complete the square by adding and 2 subtracting the scalar (b'd) /d'd, we get 0 < b'b   +    2x(b'd) + x 2 (d'd)
d'd d'd
2
(b'd)
(b'd)
b'd) d'd
The term in brackets is zero if we choose x = b'd/d'd, so we conclude that 0 < b'b  d'd or (b'd) < (b'b)(d'd) if b # xd for some x. Note that if b = cd, 0 = (b  cd)'(b  cd), and the same argument produces
(b'd) = (b'b)(d'd).
2 2
(b'd)
A simple, but important, extension of the CauchySchwarz inequality follows ' directly.
Extended Cauchy5chwarz Inequality. Let let B be a positive definite matrix. Then
(pxp) 2
b
(pxl)
and
d
(pxl)
Proof. The inequality is obvious when b = 0 or d = 0. For cases other than these, consider the squareroot matrix sl/2 defined in terms of its eigenvalues A; and the normalized eigenvectors e; as 81/2 =
sl/2 =
L
i=I
i=J
 ee~ VA;
I I
it follows that
b'd = b'Id
and the proof is completed by applying the CauchySchwarz inequality to the vectors (B 112b) and (B1f2d). The extended CauchySchwarz inequality gives rise to the following maximization result.
(pxp)
(pXi)
be a given vector.
x ,
( 'd)2
(250)
= cB 1 d
(pXp)(p>li)
Proof. By tne extended CauchySchwarz inequality, (x'd/ ::; (x'Bx) (d'BId). Because x i' 0 and B is positive definite, x'Bx > 0. Dividing both sides of the inequality by the positive scalar x'Bx yields the upper bound
( __ ::; d'B1d _x x'Bx Taking the maximum over x gives Equation (250) because the bound is attained for x = cB 1d.
'd)2
(attainedwhenx == e!)
(251)
(attainedwhenx = ep)
max
x'Bx , =
XX
Ak+I
_l
"=
1, 2, ... , p  1)
(252)
Proof. Let
(pXp)
e 1 , e 2 , ... , eP and A be the diagonal matrix with eigenvalues A1 , A2, ... , Ap along the main diagonal. Let nl/2 == PA l/2p [see (222)] and y = P' X
* 0. Thus,
(pXI)
(pXp)(pxl)
~=~~::::
=::
y'Ay y'y
f A;Yf
i=l
yJ
i=l
.f Yf f yf
(253)
since
k =' 1 k#1
A similar argument produces the second part of (251). Now, x = Py = y1e 1 + }2e 2 + + ypep. sox .l el> ... , ek implies
Therefore, for x perpendicular to the first k eigenvectors e;, the lefthand side of the inequality in (253) becomes
x'Bx x'x
L A;YT i=k+!
i=k+l
2:
y[
Taking Yk+t
= 1, Yk+2 = =
yP
For a fixed Xo 0, xoBXo/XoXo has the same value as x'Bx, where x' = x 0 j~ is of unit length. Consequently, Equation (251) says that the largest eigenvalue, AI> is the maximum value of the quadratic form x'Bx for all points x whose distance from the origin is unity. Similarly, Ap is the smallest value of the quadratic form for all points x one unit from the origin. The largest and smallest eigenvalues thus represent extreme values of x'Bx for points on the unit sphere. The "intermediate" eigenvalues of the p X p positive definite matrix B also have an interpretation as extreme values when xis further restricted to be perpendicular to the earlier choices.
Supplement
Vectors are said to be equal if their corresponding entries are the same.
Definition 2A.2 (Scalar multiplication). Let c be an arbitrary scalar. Then the product ex is a vector with ith entry cxi. To illustrate scalar multiplication, take c1 = 5 and c2 = 1.2. Then
c1
y s[
=
2
~] = [ l~J
10
82
and c2
y= (1.2) [
2
~] = [=~~]
2.4
Definition 2A.3 (Vector addition). The sum of two vectors x andy, each having the same number of entries, is that vector
z
Thus,
=x+
Z;
= x; + Y;
Taking the zero vector, 0, to be themtuple (0, 0, ... , 0) and the vector x to be the mtuple ( x 1 ,  x 2 , ... ,  xm), the two operations of scalar multiplication and vector addition can be combined in a useful manner.
Definition 2A.4. The space of all real mtuples, with scalar multiplication and vector addition as just defined, is called a vector space. Definition 2A.S. The vector y = a 1x 1 + a 2x 2 + + akxk is a linear combination of the vectors x 1 , x 2 , ... , xk. The set of all linear combinations of x 1 , x 2, ... , xk> is called their linear span. Definition 2A.6. A set of vectors x 1 , x 2, ... , xk is said to be linearly dependent if there exist k numbers (a 1, a 2 , ... , ak), not all zero, such that
so
implies that a 1
= a 2 = a3
= a4
= 0.
84
Then 2x 1

x2 + 3x3 == 0
Thus, x 1, x 2 , x3 are a linearly dependent set of vectors, since any one can be written as a linear combination of the others (for example, ~ 2 = 2x 1 + 3x 3).
Definition 2A.7. Any set of m linearly independent vectors is called a basis for the vector space of all mtuples of real numbers. Result 2A.I. Every vector can be expressed as a unique linear combination of a fixed basis.
These four vectors were shown to be linearly independent. Any vector x can be uniquely expressed as
A vector consisting of m elements may be regarded geometrically as a point in mdimensional space. For example, with m = 2, the vector x may be regarded as representing the point in the plane with coordinates x 1 and x 2 Vectors have the geometrical properties of length and direction.
2
Definition 2A.8. The length of a vector of m elements emanating from the origin is given by the Pythagorean formula:
lengthofx = L, =
v'xi
+ x~ + .. + x~
Vectors and Matrices: Basic Concepts 85 Definition 2A.9. The angle (J between two vectors x andy, both having m entries, is defined from
where Lx = length of x and Ly = length of y, x 1 , x 2, ... , xm are the elements of x, and y1 , Ji2, ... , Ym are the elements of y. Let
Then the length of x, the length of y, and the cosine of the angle between the two vectors are lengthofx
= v'(_:_1) 2
+5 2 + 2 2 + (2) 2 02 + 12
= V34 = 5.83
= 5.10
length of y = and
vf42+ (3) 2 +
= V26
5.83
Consequently, (J
= 135.
Definition 2A.I 0. The inner (or dot) product of two vectors x andy with the same number of entries is defined as the sum of component products:
We use the notation x'y or y'x to denote this inner product. With the x'y notation, we may express the length of a vector and the cosine of the angle between two vectors as
L 1 = lengthofx =
v'xt +
x~ + +
x; = ~
cos(O)
=.
86
Definition 2A.II. When the angle between two vectors x, y is (J = 90" or 270, we say that x and y are perpendicular. Since cos( 0) = 0 only if (J = 90 or 270, the condition becomes
x andy are perpendicular if x'y = 0 We write x
1_
y. _
Result 2A.2. (a) z is perpendicular to every vector if and only if z = 0. (b) If z is perpendicular to each vector x1 , x 2, ... , xk, then z is perpendicular to their linear span. (c) Mutually perpendicular vectors are linearly independent. Definition 2A.I2. The projection (or shadow) of a vector x on a vector y is
proJeCtiOn ofx on y = 
. .
(~y)
2 Ly 
If y1, y2, ... , y, are mutually perpendicular, the projection {or shadow) of a vector x on the linear span ofyb J2, ... , y, is
(x'yJ) ,yl
Y1Y1
+ ,y2 + + ,y,
Y2Y2
y,y,
(x'Y2)
(x'y,)
Result 2A.3 (GramSchmidt Process). Given linearly independent vectors x 1 , x 2, ... , xk, there exist mutually perpendicular vectors u1 , u2 , ... , uk with the same linear span. These may be constructed sequentially by setting
UJ =X]
87
We can also convert the u's to unit length by setting zi = uif~ In this
k1
j~l
and
we take
so
Thus,
Matrices
Definition 2A.Il. An m X k matrix, generally denoted by a boldface uppercase letter such as A, R, :I, and so forth, is a rectangular array of elements having m rows and k columns.
Examples of matrices are
A=
[7~ 2] ~
.7
3
'
B = [:
2
1/~J
I"[~
1 0
X" [
;
.3
3]
8
1 '
E = [eJ]
88 Chapter 2 Matrix Algebra and Random Vectors In our work, the matrix elements will be real numbers or functions taking on values in the real numbers.
Definition 2A.I4. The dimension (abbreviated dim)ofan m X k matrix is the ordered pair (m, k); m is the row dimension and k is the column dimension. The dimension of a matrix is frequentlyindicated in parentheses below the letter representing the matrix. Thus, the m x k matrix A is denoted by A .
(mXk)
In the preceding examples, the dimension of the matrix l: is 3 information can be conveyed by WI:iting l: .
(3X3)
3, and this
A
(mxk)
l'" '"l
a~i
: azz a2k ami amz amk
ai2
... ...
or more compactly as
(mxk)
index j refers to the column. An m X 1 matrix is referred to as a column vector. A 1 x k matrix is referred to as a row vector. Since matrices can be considered as vectors side by side, it is natural to define multiplication by a scalar and the addition of two matrices with the same dimensions.
Definition2A.IS. 1\vomatrices A
(mXk)
= {a;j}and (mxk) = B
{bij}aresaidtobeequa/,
written A= B, if a;j = b;j, i = 1, 2, ... , m, j = 1,2, ... , k. That is, two matrices are equal if (a) Their dimensionality is the same. (b) Every corresponding element is the same.
Definition 2A.I6 {Matrix addition). Let the matrices A and B both be of dimension k with arbitrary elements a;j and bij i = 1,2, ... ,m, j = 1,2, ... ,k, respectively. The sum of the matrices A and B is an m X k matrix C, written C = A + B, such that the arbitrary element of C is given by
m X
i = 1, 2, ... , m, j = 1, 2, ... , k
Note that the addition of matrices is defined only for matrices of the same dimension. For example;
[~ ~ 1~J
c
Then
(m><k)
cA
==
(m><k)
Ac
(m><k)
= {bij},
where b;j
= ca;j = a;jc,
(m><k)
i = 1, 2, ... , m,
Multiplication of a matrix by a scalar produces a new matrix whose elements are the elements of the original matrix, each multiplied by the scalar. For example, if c = 2,
10
B
== {a;j}
(mxk)
and
(m><k)
= {b;j}
be two
matrices of equal dimension. Then the difference between A and B, written A  B, is an m x k matrix C = { c;j} given by
C=AB=A+(1)8
Thatis,c;j
= a;j +
(1)b;j
= a;j
b;j,i
= 1,2,
... ,k.
Definition 2A.I9. Consider them X k matrix A with arbitrary elements aij, i == 1, 2, ... , m, j == 1, 2, ... , k. The transpose of the matrix A, denoted by A', is the k x m matrix with elements aj;, j = 1, 2, ... , k, i = 1, 2, ... , m. That is, the transpose of the matrix A is obtained from A by interchanging the rows and columns.
As an example, if
A (2><3)
[2 1 3]
7

4 6
then
(3X2)
A' =
[2 7]
1 4 3 6
Result 2A.4. For all matrices A, B, and C (of equal dimension) and scalars c and d, the following hold:
(a) (A + B) + C (b) A+ B
=
=A
+ (B + C)
B +A
+ cB
(That is, the transpose of the sum is equal to the sum of the transposes.)
(d) (c + d)A = cA + dA
(e) (A + B)'
=
A' + B'
= cA'
Definition 2A.20. If an arbitrary matrix A has the same number of rows and columns, then A is called a square matrix. The matrices I, I, and E given after Definition 2A.l3 are square matrices. Definition 2A.21._Let A beak X k (square) matrix. Then A is said to be symmetric if A = A'. That is, A is symmetric if a;i = aii i = 1, 2, ... , k, j = 1, 2, ... , k.
Examples of symmetric matrices are
l 0 OJ I=010, [Q Q 1 (3X3)
with ones on the main (NWSE) diagonal and zeros elsewhere. The 3 X 3 identity matrix is shown before this definition.
Definition 2A.23 (Matrix multiplication). The product AB of an m X n matrix A = {a;i} and an n X k matrix B = {b;i} is them X k matrix C whose elements are
c;i =
f=I
L a;ebei
i = 1, 2, ... , m
j = 1, 2, ... , k
Note that for the product AB to be defined, the column dimension of A must equal the row dimension of B. If that is so, then the row dimension of AB equals the row dimension of A, and the column dimension of AB equals the column dimension of B. For example, let
A (2X3)
3 [4
~ ~ J and
B =
(3X2)
3 4] [
6 2
4 3
Then
[! ~ :][:
(2X3)
(3X2)
J [:~ :~H::
(2X2)
c12]
Czz
= =
= 20
c 21 c22
= (4)(3)
+ (0)(6) + (5)(4)
= 32
= 31
Then x' = [1
2
3] and
,.,
[1
0 2 3] [
=~] ~
[20]
[2 3
1
8] [
~] ~
,.,
Note that the product xy is undefined, since xis a 4 X 1 matrix andy is a 4 X 1 matrix, so the column dim of x, 1, is unequal to the row dim of y, 4.1f x andy are vectors of the same dimension, such as n X 1, both of the products x'y and xy' are defined. In particular, y'x = x'y = XI)'J + Xz.Y2 + + x"y,., and xy' is an n X n matrix with i,jth element x;y1. Result 2A.S. For all matrices A, B, and C (of dimensions such that the indicated products are defined) and a scalar c,
(a) c(AB) = (c A)B
(d) (B + C)A = BA + CA
(e) (AB)' = B'A'
LA
j=l
Xj
= A
2: Xj
j=l
92
There are several important differences between the algebra of matrices and the algebra of real numbers. TWo of these differences are as follows: 1. Matrix multiplication is, in general, not commutative. That is, in general, AB BA. Several examples will illustrate the failure of the commutative law (for matrice~).
[! ~J[~J ~~J
= [
but
[~][! ~]
is not defined.
[~
but
3
[
Also, but
3 1 2 4
76] [1 0
0 1
4~]
[4 1][ 21]=[11 OJ
3 4 3 4
12
2. Let 0 denote the zero matrix, that is, the matrix with zero for every element. In the algebra of real numbers, if the product of two numbers, ab, is zero, then a = 0 or b = 0. In matrix algebra, however, the product of two nonzero matrices may be the zero matrix. Hence,
(mxn)(nxk)
AB
(mXk)
A
(mXn)
0
(mXn)
or
B
(nXk)
0 , then
(nXk)
93
IA I =
I A I=
where A 1i is the (k  1)
X
a 11
k
if k
~>IjiAijl(1)1+i
j=!
ifk > 1
L
j=!
\! !I=
In general,
1141(l)Z + 3161(1)
_; : =
= 3(39) 1(3)
If I is the k
k identity matrix, II I = 1.
a32
azzj( 1 )4 a32
The determinant of any 3 X 3 matrix can be computed by summing the products of elements along the solid Jines and subtracting the products along the dashed
94
lines in the following diagram. This procedure is not valid for matrices of higher dimension, but in general, Definition 2A.24 can be employed to evaluate these determinants.
,,
'
'
'l
We next want to state a result that describes some properties of the determinant. However, we must first introduce some notions related to matrix inverses.
Definition 2A.2S. The row rank of a matrix is the maximum number of linearly independent rows, considered as vectors (that is, row vectors). The column rank of a matrix is the rank of its set of columns, considered as vectors.
1 1]
5 1
1 1
The rows of A, written as vectors, were shown to be linearly dependent after Definition 2A.6. Note that the column rank of A is also 2, since
but columns 1 and 2 are linearly independent. This is no coincidence, as the following result indicates.
Result 2A.6. The row rank and the column rank of a matrix are equal.
Thus, the rank of a matrix is either the row rank or the column rank.
95
(kxk)(kx1)
(kX1)
implies
that
x
(kX1)
a square matrix is nonsingular if its rank is equal to the number of rows (or columns) it has. Note that Ax = x 1a 1 + x 2a 2 + + xkak, where a; is the ith column of A, so that the condition of nonsingularity is just the statement that the columns of A are linearly independent.
Result 2A. 7. Let A be a nonsingular square matrix of dimension k x k. Then there is a unique k X k matrix 8 such that A8 = BA =I
where I is the k
k identity matrix.
Definition 2A.27. The 8 such that A8 = BA = I is called the inverse of A and is denoted by A 1. In fact, if BA =I or AB = I, then B = A1 , and both products must equal I.
For example,
A= [2 3]
1 5
since
has
A 1 =
[ i~
7 5
n
3] [1 01]
=
[21 3] [ ; ~]
5 7 7
Result 2A.8. (a) The inverse of any 2
X
= [
7
~] [2 7 1
2 matrix
, A=[::: :::]
is given by
A1 = _1_ [
\A\ az1
(b) The inverse of any 3 x 3 matrix
azz
is given by
1
A
/AI
a13,_,all al31 a33 az! azJ a121 /all a1zl an az1 a22
(a)
0 implies x
(k><l)
(k><J)
0 (A is nonsingular).
(k><l)
(k><k)(k><1)
A 1A =
I .
(k><k)
Result 2A.JO. Let A and B be square matrices of the same dimension, and let the indicated inverses exist. Then the following hold:
(a)/AI=IA'I (b) If each element of a row (column) of A is zero, then I A I = 0 (c) If any two rows (columns) of A are identical, then /AI= 0 (d) If A is nonsingular, then I A I =
(e) /AB/ = IAIIBI (f) IcA I = ck I A I, where cis a scalar. You are referred to [6J for proofs of parts of Results 2A.9 and 2A.ll. Some of these proofs are rather complex and beyond the scope of this book.
Definition 2A.28. Let A = { a;j} beak X k square matrix. The trace of the matrix A,
written tr (A), is the sum of the diagonal elements; that is, tr (A)
= ~ a;;i=!
97
Result 2A.I2. Let A and B be k X k matrices and c be a scalar. (a) tr(cA) = c tr(A)
(b) tr(A B)= tr(A) tr(B) (c) tr(AB) = tr(BA) (d) tr(B 1AB) = tr(A)
k k
Definition 2A.29. A square matrix A is said to be orthogonal if its rows, considered as vectors, are mutually perpendicular and have unit lengths; that is, AA' = I. Result 2A.I3. A matrix A is orthogonal if and only if A1 = A'. For an orthogonal matrix, AA' = A' A = I, so the columns are also mutually perpendicular and have unit lengths. An example of an orthogonal matrix is
A~ ~ ~ ~ j]
[ Clearly,A
=
n Jln
2
I
2
!
I
2 I 2 2
I
2 I
2 I .2
!
I
2 2
_!
2 I
I
2 A
2 A
j] [j ~]
0 0 1 0 0 1 0 0 I
so A' = A , and A must be an orthogonal matrix. Square matrices are best understood in terms of quantities called eigenvalues and eigenvectors. Definition 2A.30. Let A be a k X k square matrix and I be the k X k identity matrix. Then the scalars A1 , A2 , , Ak satisfying the polynomial equation IA  Ail = 0 are called the eigenvalues (or characteristic roots) of a matrix A. The equation IA  Ail = 0 (as a function of A) is called the characteristic equation. For example, let
A=[~~]
fG ~){~ ~JI
=ll~A 3~AI=(1A)(3A)=o
implies tbat there are two roots, A1 = 1 and A2 = 3. The eigenvalues of A are 3 and 1. Let
A"'
13 4
4 13 2
~]
10

13 A
4 2
IA
 All
4 13  A
2 3 2 =  A + 36A2 10 A
405A + 1458 = 0
has three roots: A = 9, Az = 9, and A3 = 18; that is, 9, 9, and 18 are the eigenvalues 1 of A.
Definition 2A.31. Let A be a square matrix of dimension k X k and let A be an eigenvalue of A. If x is a nonzero vector ( x 0 ) such that
(kxl) (kxl)
(kxt)
Ax= Ax
then xis said to be an eigenvector (characteristic vector) of the matrix A associated with the eigenvalue A. An equivalent condition for A to be a solution of the eigenvalueeigenvector equation is lA  AJI = 0. This follows because the statement that Ax = Ax for some A and x 0 implies that
0 == (A  Al)x
1
= x1 col1(A
 AI) + +
xk
colk(A  AI)
That is, the columns of A  A1 are linearly dependent so, by Result 2A9(b), A  All == 0, as asserted. Following Definition 2A.30, we have shown that the eigenvalues of
A==
c~]
are A1 = 1 and A2 = 3. The eigenvectors associated with these eigenvalues can be determined by solving the following equations:
Ax= Azx
x1 + 3xz ==
or
Xz
x 1 = 2x2 There are many solutions for x1 and x2 Setting x2 = 1 (arbitrarily) gives x 1 = 2, and hence,
= 3x 1
= 3x 2
x1
+ 3x2
X=[~]
is an eigenvector corresponding to the eigenvalue 3. It is usual practice to determine an eigenvector so that it has length unity. That is, if Ax = Ax, we take e = xjv'i'i as the eigenvector corresponding to A. For example, the eigenvector for A1 = 1 is
e} =
[2/v'.S, 1/v'.S].
'
Definition 2A.32. A quadraticformQ(x) in thek variablesx 1 ,x2 , ... , xk isQ(x) = x'Ax, where x' = [ x1 , x2, ... , xd and A is a k X k symmetric matrix.
k k
_2 _2
i=l i=l
Q(x) = [x1
Xz{ ~ ~ J[~J
3
1
= xi + 2x 1x 2 + x~
2
n [;;] ~
Any symmetric square matrix can be reconstructured from its eigenvalues and eigenvectors. The particular expression reveals the relative importance of each pair according to the relative size of the eigenvalue and the direction of the eigenvector. '
100 Chapter2 Matrix Algebra and Random Vectors Result 2A.14. The Spectral Decomposition. Let A beak X k symmetric matrix. Then A can be expressed in terms of its k eigenvalueeigenvector pairs (A;, e;) as
A=
For example, let
L A;e;ei
i=l
A= [2.2
Then
2 lA Ail= A 
.4] .4 2.8
so A has eigenvalues A1 = 3 and A2 = 2. The corresponding eigenvectors are e! = [ 1/YS, 2/YS] and e2 = [2/YS, 1/YS], respectively. Consequently,
A= [2.2 .4
.4  3
28
ll ll
1

_2_
where the m X k matrix A has ( i, i) entry A; ;;:: 0 fori = 1, 2, ... , min( m, k) and the other entries are zero. The positive constants A; are called the singular values of A. The singularvalue decomposition can also be expressed as a matrix expansion that depends on the rank r of A. Specifically, there exist r positive constants A1 , A2 , .. , Ar, r orthogonal m X 1 unit vectors u1 , u 2 , .. , u" and r orthogonal k X Lunit vectors v1 , v2 , . , v" such that
A=
L A;u;v; = VrArV;
i=l
X rdiagonal matrix
101
uJ, so
withAL A~, ... , A~> 0 = A~+J, A~n ... , A~11 (form > k).Then V; = At: 1A'u;. Alternatively, the V; are the eigenvectors of A' A with the same nonzero eigenvalues A~. The matrix expansion for the singularvalue decomposition written in terms of the full dimensional matrices U, V, A is
(mxk)
(mxm)(mxk)(kxk)
V'
where U has m orthogonal eigenvectors of AA' as its columns, V has k orthogonal eigenvectors of A' A as its columns, and A is specified in Result 2A.15. For example, let
Then
AA'
~ ~
[

2 You may verify that the eigenvalues y = A of AA' satisfy the equation 22y + 120 = (y 12) (y 10), and consequently, the eigenvalues are y 1 = Ai = 12 and y 2 = A~ = 10. The corresponding eigenvectors are
,.Z 
1 u '  [ v2 1 
1 v2 and u 2
J ' [v2
1
Also,
A'A =
1 0, 2] [ 2~ 10 4
4 2
so IA' A  yl I = y 3  22y 2  120y = y( y  12) ( y  10), and the eigenvalues are y 1 = Ai = 12, y2 = A~ = 10, and y 3 = Aj = 0. The nonzero eigenvalues are the same as those of AA '.A computer calculation gives the eigenvectors
V1 
, _ [ y'6 1
2 y'6
1 y'6 , V2
J ,_[

2 y'5
1 y'5
o]
, an
d v3  [ v'3Q , _ 1
V30 V30 .
5
102
Taking A1 A is
= Vf2 and A2
[ 3 11]
1 _3 1
~l [_2_
1
1 v'5 v'5
v'2
The equality may be checked by carrying out the operations on the righthand side. The singularvalue decomposition is closely connected to a result concerning the approximation of a rectangular matrix by a lowerdimensional matrix, due to Eckart and Young ([2]). If am X k matrix A is approximated by B, having the same dimension but lower rank, the sum of squared differences
m
Result 2A.I6. Let A be an m X k matrix of real numbers with m ~ k and singular value decomposition UA V'. Lets < k = rank(A). Then
s
B= ~
i=l
A;U;Vi
is the ranks least squares approximation to A. It minimizes tr[(A B)(A B)'] over all m
X
error of approximation, is
2: AT. i=s+l
=
2: (A; c;;) 2 +
i=l
~2: cti
i"'i
j and
the s largest singular values. The other c;; = 0. That is, UBV' = A, orB = ~ A; u; vj.
i=l
c;; s
= A; for
Exercises
I 03
Exercises
2.1.
Letx' = [5, 1, 3]andy' = [1, 3, 1]. (a) Graph the two vectors. (b) Find (i) the length ofx, (ii) the angle between x andy, and (iii) the projection ofy on x. (c) Since :X= 3 and y = 1, graph [5  3,1  3, 3  3] = [2, 2, OJ and [1 1,3 1,1 1] = [2,2,0].
perform the indicated multiplications. (a) 5A (b) BA (c) A'B' (d) C'B (e) Is AB defined?
2.3. Verify the following properties of the transpose when
A = [
(a) (b) (c) (d)
!l
(kxt)
B =
[!
~ ~
and
C =
[!
~J
(A')' = A 1)' (C')1 = (AB)' = B' A' For general A and B , (AB)' = B'A'.
cc
(mxk)
1
2.4. When A and B exist, prove each of the following. (a) (A')1 = (A 1)' (b) (AB)1 = B1A 1 . Hint: Partacanbeprovedb(notingthatAK1 =1,1 ;;,I',and(AA 1 )' Part b follows from (B 1A )AB = B 1(A1A)B = B 18 =I.
2.5. Check that
(A 1 )'A'.
Q =
~ 13
12] 1
13
is an orthogonal matrix.
2.6. Let
2
2]
6
I 04
Let A be as given in Exercise 2.6. (a) Determine the eigenvalues and eigenvectors of A. (b) Write the spectral decomposition of A. (c) Find A 1. (d) Find the eigenvalues and eigenvectors of A 1.
[1 2]
2 2
find the eigenvalues A1 and A2 and the associated normalized eigenvectors e 1 and e 2 Determine the spectral decomposition (216) of A.
2.9. Let A be as in Exercise 2.8. (a) Find A 1 (b) Compute the eigenvalues and eigenvectors of A 1. (c) Write the spectral decomposition of A1, and compare it with that of A from Exercise 2.8.
4 [ 4.001
4.001 4.002
and
[4 4.001
4.001 4.002001
J
*
These matrices are identical except for a small difference in the (2, 2) position. Moreover, the columns of A (and B) are nearly linearly dependent. Show that A_, ~ ( 3)8 1. Consequently, small changesperhaps caused by roundingcan give substantially different inverses.
2.11. Show that the determinant of the p X p diagonal matrix A= {a;i} with a;i = 0, i j, is given by the product of the diagonal elements; thus, IA I = a 1 1a 22 aPr Hint: By Definition 2A.24, I A I = a 11 A 11 + 0 + + 0. Repeat for the submatrix A 1 1 obtained by deleting the first row and first column of A.
2.12. Show that the determinant of a square symmetric p X p matrix A can be expressed as the product of its eigenvalues A1 , A2 , ... , AP; that is, IA I = ITf= 1 A;. Hint: From (216) and (220), A =PAP' with P'P =I. From Result 2A.11(e), /A/= /PAP'/= /P //AP' I= /P /I A //P'/ =/A 1/1/,since/1/ = /P'P/ = /P' 1/P/. Apply Exercise 2.11.
2.13. Show that IQ I = + 1 or  1 if Q is a p X p orthogonal matrix. Hint: /QQ' I = /1/. Also, from Result 2A.ll, JQ 1/Q' I = /Q /2 . Thus, IQ /2 use Exercise 2.11.
2.14. Show that Q'
(p><p)(p><p)(p><p)
/1/. Now
Hint: Let A be an eigenvalue of A. Then 0 = IA  All. By Exercise 2.13 and Result 2A.ll(e),wecanwrite0 = JQ'I/A AII/QJ = /Q'AQ Al/,sinceQ'Q =I.
2.1 S. A quadratic form x 'Ax is said to be positive definite if the matrix A is positive definite. Is the quadratic form 3xt + 3xi  2x 1x 2 positive definite?
2.1 6. Consider an arbitrary n X p matrix A. Then A' A is a symmetric p that A' A is necessarily nonnegative definite. Hint: Sety = Axsothaty'y = x'A'Ax.
X
p matrix. Show
Exercises
I 05
2.17. Prove that every eigenvalue of a k X k positive definite matrix A is positive. Hint: Consider the definition of an eigenvalue, where Ae = A.e. Multiply on the left by e' so that e' Ae = A.e'e. 2.18. Consider the sets of points ( x 1 , x 2) whose "distances" from the origin are given by c 2 = 4xt + 3x~  2V2x1x2
for c 2 = 1 and for c 2 = 4. Determine the major and minor axes of the ellipses of constant distances and their associated lengths. Sketch the ellipses of constant distances and comment on their positions. What will happen as c 2 increases?
2.19. Let A 1 2 = 1
(mxm)
i=]
2:
"1/A;e;e;
= PA 112P',wherePP' = P'P
I.(TheA;'sandthee;'sare
the eigenvalues and associated normalized eigenvectors of the matrix A.) Show Properties (1)(4) of the squareroot matrix in (222). 2.20. Determine the squareroot matrix A112, using the matrix A in Exercise 2.3. Also, determine A 112 , and show that A 112A 1 2 = A 1 1 2 =I. 1 f2A 1
2.21. (See Result 2A.15) Using the matrix
(a) Calculate A' A and obtain its eigenvalues and eigenvectors. (b) Calculate AA' and obtain its eigenvalues and eigenvectors. Check that the nonzero eigenvalues are the same as those in part a. (c) Obtain the singularvalue decomposition of A.
2.22. (See Result 2A.l5) Using the matrix
A=[::
~J
(a) Calculate AA' and obtain its eigenvalues and eigenvectors. (b) Calculate A' A and obtain its eigenvalues and eigenvectors. Check that the nonzero eigenvalues are the same as those in part a. (c) Obtain the singularval~e decomposition of A. 2.23. Verify the relationships V 1 2pv 112 = I and p = (V 1f2) 1I(V 112f\ where I is the 1 p X p population covariance matrix [E~uation (232)], pis the p X p population correlation matrix [Equation (234)], and V 12 is the population standard deviation matrix [Equation (235)].
2.24. Let X have covariance matrix
Find (a) I 1 (b) The eigenvalues and eigenvectors of I. (c) The eigenvalues and eigenvectors of I 1.
I=
25 2 [ 4
2 4 1 1 9
4]
=
(a) Determine p a~d VI/2. (b) Multiply your matrices to check the relation Vif2pVI/2
2.26. Use I as given in Exercise 2.25. (a) Find p 13 (b) Find the correlation between XI and ~X2
I.
+ ~X 3
2.27. Derive expressions for the mean and variances of the following linear combinations in terms of the means and covariances of the random variables XI, X 2 , and X 3 (a) XI 2X 2 (b) XI+ 3X2 (c) XI + Xz + X3 (e) XI + 2Xz  X 3 (f) 3XI  4X2 if XI and X 2 are independent random variables. 2.28. Show that
where c) = [c 11 , ci 2 , ... , ci p] and ci = [c2 I, c22 , .. , c2 p] This verifies the offdiagonal elements CixC' in (245) or diagonal elements if ci = c2 Hint: By (243),ZI E(ZI) = cii(XI 1ti) + + cip(Xp 1tp) and Z 2  E(Zz) = c21(XI 1ti) + + Czp(Xp /tp).SoCov(ZI,Z2 ) =
E[(ZI  E(ZI))(Zz E(Zz))] = E[(c11(XI 1ti)
+ cdXz
1tz)
+
~tz)
+ + Czp(Xp 1tp))
e=I m=l
2: 2:
Verify the last step by the definition of matrix multiplication. The same steps hold for all elements.
Exercises
2.29. Consider the arbitrary random vector X' = JL' = [JLJ. IL2 J.LJ, IL4 J.L 5 ]. Partition X into
[X~>
I0 7
X=
where
[i{~~J
nd
xPJ
~ [1:]
x<'1
~ U:J
Let I be the covariance matrix of X with general element u;k Partition I into the covariance matrices of X (1 l and X (2) and the covariance matrix of an element of X {l) and an element of X (2).
2.30. You are given the random vector X' = [X1 , X 2 , X 3 , X 4 ] with mean vector JL'x = [4, 3, 2, 1] and variancecovariance matrix
Ix = Partition X as
0 1 2 1
2 0
30 1 0 2 2]
9
2
2
4
Let
A
= [1
2]
and
B = [
~ =J ~
and consider the linear combinations AX(' l and BX(2). Find (a) E(X(ll) (b) E(AX(ll) (c) Cov (X(!)) (d) Cov(AX(!l) (e) E(X(2l) (f) E(BX(2l) (g) Cov(X(2l) (h) Cov (BX< 2l) (i) Cov(X< 1>, X(2l) (j) Cov (AX(ll, BX(2))
2.31. Repeat Exercise 2.30, but with A and B replaced by
= [1
1]
and
= [~  ~
108
, .. ,
3
1
I I 2 ;z 1
0 0
1
Ix
2
I :z
6 1 1
4
0
0 2
0
Partition X as
Let A =
D ~ J
and
B=
U~ _J ~
and consider the linear combinations AX(lJ and BX(2). Find (a) (b) (c) (d) (e) E(X(Il) E(AX('l) Cov(X( 1)) Cov (AX(Il) E(X(2l) (f) E(BX( 2 ))
(g) Cov(X( 2)) (h) Cov(BX( 2l) (i) Cov(X(l), X(2)) 0) Cov(AX( 1l, BXI 2l)
A=U 1 OJ
1 3
and
B =
[1 2]
1 1
Exercises
I09
2.35. Using the vectors b' = [4, 3] and d' = [1, 1], verify the extended CauchySchwarz 2 inequality (b'd) :s; (b'Bb)(d'B 1d) if
=[
2 2
2] 5
2.36. Fmd the maximum and minimum values of the quadratic form 4_x1 all points x' = [x 1 , x 2 ] such that x'x = 1.
4x~
+ 6x,xz for
2.37. With A as given in Exercise 2.6, fmd the maximum value of x' Ax for x' x = 1. 2.38. Find the maximum and minimum values of the ratio x' A xjx'x for any nonzero vectors x' = [x 1 ,x2 ,x3 ] if
A=
2.39. Show that
13 4 [ 2
4 13 2
~]
10
s
(rXs)(sXr)(tXv)
Hint: BC has ( e, j)th entry 2: bekCki = de; So A(BC) has (i, j)th element
k~l
2.40. Verify (224): E(X + Y) = E(X) + E(Y) and E(AXB) = AE(X)B. Hint X + Y has X;;+ Y;; as its (i, j)th element. Now,E(X;; + Y;;) = E(X;;) + E(Y;;) by a univariate property of expectation, and this last quantity is the ( i, j) th element of
E(X) + E(Y). Next (see Exercise 2.39),AXB has (i,j)th entry 2: 2: a;eXekbki and by the additive property of expectation, e k E(2: 2: a;eXekbk;) = 2: 2: a;eE(Xek)bk;
2.41. You are given the random vector X'= [X 1 ,X2 ,X3 ,X4 ] with mean vector = [3, 2, 2, OJ and variancecovariance matrix
P.x
Ix =
l30
1
0 0 3 0 0 0 3 0 0 0 0
]
j]
Let
A~[:
2
(a) Find E (AX), the mean of AX. (b) Find Cov (AX), the variances and covariances of AX. (c) Which pairs of linear combinations have zero covariances?
110
References
1. Bellman, R.lntroduction to Mat~ix Analysis (2nd ed.) Philadelphia: Soc for Industrial &
Applied Math (SIAM), 1997. 2. Eckart, C., and G. Young. "The Approximation of One Matrix by Another of Lower Rank." Psychometrika,1 (1936), 211218. 3. Graybill, F. A. Introduction to Matrices with Applications in Statistics. Belmont, CA: Wadsworth, 1969. 4. Halmos, P.R. FiniteDimensional Vector Spaces. New York: SpringerVerlag, 1993. 5. Johnson, R. A., and G. K. Bhattacharyya. Statistics: Principles and Methods (5th ed.) New York: John Wiley, 2005. 6. Noble, B., and J. W. Daniel. Applied Linear Algebra (3rd ed.). Englewood Cliffs, NJ: Prentice Hall, 1988.
Chapter
X
(n><p)
Xll
=
[
XzJ ; Xnl
III
Each row of X represents a multivariate observation. Since the entire set of measurements is often one particular realization of what might have been observed, we say that the data are a sample of size n from a pvariate "population." The sample then consists of n measurements, each of which hasp components. As we h!fve seen, the data can be plotted in two different ways. For the, pdimensional scatter plot, the rows of X represent n points in pdimensional space. We can write ..
X
(nXp)
[xu
x~1
: Xnt
xi2
Xzz
x,,J .
X~p
n
X~
. .
Xnz
Xnp
X~
The row vector xj, representing the jth observation, contains the coordinates of a~ point. The scatter plot of n points in pdimensional space provides information on the locations and variability of the points. If the points are regarded as solid spheres, the sample mean vector x, given by (18), is the center of balance. Variability occurs in more than one direction, and it is quantified by the sample variancecovariance matrix Sn. A single numerical measure of variability is provided by the determinant of the sample variancecovariance matrix. When p is greater than 3, this scatter plot representation cannot actually be graphed. Yet the consideration of the data as n points in p dimensions provides insights that are not readily available from algebraic expressions. Moreover, the concepts illustrated for p = 2 or p = 3 remain valid for the other cases.
Example 3.1 {Computing the mean vector) Compute the mean vector i from the
data matrix.
Plot the n = 3 data points in p = 2 space, and locate x on the resulting diagram. The first point, "~> has coordinates xi = ( 4, 1]. Similarly, the remaining two points are x2 = [ 1, 3] and x] = (3, 5]. Finally,
113
5
4
...,
@.r
ex,
x2e
2 1
I
2 2
The alternative geometrical representation is constructed by considering the data as p vectors in ndimensional space. Here we take the elements of the columns of the data matrix to be the coordinates of the vectors. Let
xll
x12
X
(n><p)
x~1
:
x~z
:
(32)
Xnl
Xnz
Then the coordinates of the first point yj = (xu, x 21 , ... , xnd are the n measurements on the first variable. In general, the ith point yi = [x 1 ;, x2 ;, .. , x";] is determined by the ntuple of all measurements on the ith variable. In this geometrical representation, we depict Y1> ... , Yp as vectors rather than points, as in the pdimensional scatter plot. We shall be manipulating these quantities shortly using the algebra of vectors discussed in Chapter 2.
Example 3.2 {Data as p vectors in n dimensions) Plot the following data as p = 2 vectors in n = 3 space:
114
6"
Here y!
= [4,
1, 3] and Y2
Many of the algebraic expressions we shall encounter in multivariate analysis can be related to the geometrical notions of length, angle, and volume. This is important because geometrical representations ordinarily facilitate understanding and lead to further insights. Unfortunately, we are limited to visualizing objects in three dimensions, and consequently, thendimensional representation of the data matrix X may not seem like a particularly useful device for n > 3. It turns out, however, that geometrical relationships and the associated statistical concepts depicted for any three vectors remain valid regardless of their dimension. This follows because three vectors, even if n dimensional, can span no more than a threedimensional space, just as two vectors with any number of components must lie in a plane. By selecting an appropriate threedimensional perspectivethat is, a portion of the ndimensional space containing the three vectors of interesta view is obtained that preserves both lengths and angles. Thus, it is possible, with the right choice of axes, to illustrate certain algebraic statistical concepts in terms of only two or three vectors of any dimension n. Since the specific choice of axes is not relevant to the geometry, we shall always label the coordinate axes 1, 2, and 3. It is possible to give a geometrical interpretation of the process of finding a sample mean. We start by defining then X 1 vector I;,= (1, 1, ... , 1]. (To simplify the notation, the subscript n will be dropped when the dimension of the vector 1, is clear from the context.) The vector 1 forms equal angles with each of the n coordinate axes, so the vector ( 1/Vn)1 has unit length in the equalangle direction. Consider the vector yj = [xli, x 2 ;, .. , x.;]. The projection of y; on the unit vector ( 1/Vn)1 is, by (28),
Y; (
1 1 1 )  1 
Vn
Vn
X1
I
Xz
I
+ + X nl11  X;
(33)
Thatis,thesamplemean:i; = (x 1 ; + x 2 ; + + x.;)/n = yj1jncorrespondstothe multiple of 1 required to give the projection of Y; onto the line determined by 1.
I 15
d; = Y;  x;l =
The elements of d; are the deviations of the measurements on the ith variable from their sample mean. Decomposition of the Y; vectors into mean components and deviation from the mean components is shown in Figure 3.3 for p = 3 and n = 3.
xu 
Xz x ' : _
x;J
(34)
Xn; X;
Figure 3.3 The decomposition ofy; into a mean component :X;l and a deviation component d; = Y;  x;l, i = 1, 2, 3.
Example 3.3 (Decomposing a vector into its mean and deviation components) Let us carry out the decomposition of y; into :X;l and d; = Y;  :X;l, i = 1, 2, for the data given in Example 3.2:
x11 and d,
= Y1 
(X,I)'(y,X,I){2 2 2][1]46+20
A similar result holds for
x2 1 and dz
,. [} [:HlJ
hmmf:J
For the time being, we are interested in the deviation (or residual) vectors d; = Y;  .X;l. A plot of the deviation vectors of Figur_e 3.3 is given in Figure 3.4.
I 17
We have translated the deviation vectors to the origin without changing their lengths or orientations. Now consider the squared lengths of the deviation vectors. Using (25) and (34), we obtain
L3;
= djd; =
j=l
(xji  x;)
(35)
(Length of deviation vector ) 2 = sum of squared deviations From (13), we see that the squared length is proportional to the variance of the measurements on the ith variable. Equivalently, the length is proportional to the standard deviation. Longer vectors represent more variability than shorter vectors. For any two deviation vectors d; and db
djdk =
L (xj; j=l
(36)
Let fJ;k denote the angle formed by the vectors d; and dk. From (26), we get
~ (xj; 
x;)(xjk  xk) =
~ ~ (xji 
x;)
~ ~ (xjk 
The cosine of the angle is the sample correlation coefficient. Thus, if the two deviation vectors have nearly the same orientation, the sample correlation will be close to 1. If the two vectors are nearly perpendicular, the sample correlation will be approximately zero. If the two vectors are oriented in nearly opposite directions, the sample correlation will be close to 1.
Example 3.4 (Calculating Sn and R from deviation vectors) Given the deviation vec
tors in Example 3.3, let us compute the sample variancecovariance matrix Sn and sample correlation matrix R using the geometrical concepts just introduced. From Example 3.3,
2 3
4
These vectors, translated to the origin, are shown in Figure 3.5. Now,
or s 11
=!f. Also,
or s22 = ~. Finally,
or s12
=  ~. Consequently,
'12
12
vs;; v' v1
_:?, 3
=  189
and
Sn
=[
i
11
~]
'
R = [1 .189] .189 1
Random Samples and the Expected Values of the Sample Mean and Covariance Matrix
I 19
The concepts of length, angle, and projection have provided us with a geometrical interpretation of the sample. We summarize as follows:
3.3 Random Samples and the Expected Values of the Sample Mean and Covariance Matrix In order to study the sampling variability of statistics such as x and Sn with the ultimate aim of making inferences, we need to make assumptions about the variables whose observed values constitute the data set X. Suppose, then, that the data have not yet been observed, but we intend to collect n sets of measurements on p variables. Before the measurements are made, their values cannot, in general, be predicted exactly. Consequently, we treat them as random variables. In this context, let the (j, k}th entry in the data matrix be the random variable Xi k. Each set of measurements Xi on p variables is a random vector, and we have the random matrix
xll
(nxp)
Xzl
:
Xnl
[X~] ~z .
.
X~
(38)
A random sample can now be defined. If the row vectors Xj, X2, ... , X~ in (38} represent independent observations from a common joint distribution with density function f(x) = f(xb x 2, ... , xp), then X1o X 2 , ... , Xn are said to form a random sample from f(x). Mathematically, X 1 , X 2 , ... , Xn form a random sample if their joint density function is given by the product f(x 1 )f(x 2 ) f(xn), where f(xi) = f(xi 1, xi 2, ... , xi p) is the density function for the jth row vector. Two points connected with the definition of random sample merit special attention: 1. The measurements of the p variables in a single trial, such as Xj = [ Xi 1 , Xi 2 , . , Xip], will usually be correlated. Indeed, we expect this to be the case. The measurements from different trials must, however, be independent.
1 The square of the length and the inner product are (n  1 )s;; and (n  l)sik, respectively, when the divisor n  1 is used in the definitions of tbe sample variance and covariance.
120
2. The independence of measurements from trial to triaJ may not hold when the variables are likely to drift over time, as with sets of p stock prices or p economic indicators. Violations of the tentative assumption of independence can have a serious impact on the quality of statistical inferences. The following exjlmples illustrate these remarks.
Example 3.5 (Selecting a random sample) As a preliminary step in designing a permit system for utilizing a wilderness canoe area without overcrowding, a naturalresource manager took a survey of users. The total wilderness area was divided into subregions, and respondents were asked to give information on the regions visited, lengths of stay, and other variables. The method followed was to select persons randomly (perhaps using a random number table) from all those who entered the wilderness area during a particular week. All persons were equally likely to be in the sample, so the more popular entrances were represented by larger proportions of canoeists. Here one would expect the sample observations to conform closely to the criterion for a random sample from the population of users or potential users. On the other hand, if one of the samplers had waited at a campsite far in the interior of the area and interviewed only canoeists who reached that spot, successive measurements would not be independent. For instance, lengths of stay in the wilderness area for dif ferent canoeists from this group would all tend to be large. Example 3.6 (A nonrandom sample) Because of concerns with future solidwaste disposal, an ongoing study concerns the gross weight of municipaJ solid waste generated per year in the United States (EnvironmentaJ Protection Agency). Estimated amounts attributed to x 1 = paper and paperboard waste and x 2 = plastic waste, in millions of tons, are given for selected years in Table 3.1. Should these measurements on X' = [XI> X 2 ] be treated as a random sample of size n = 7? No! In fact, except for a slight but fortunate downturn in paper and paperboard waste in 2003, both variables are increasing over time. Table 3.1 Solid Waste
Year
x 1 (paper)
1960 29.2
.4
x2 (plastics)
As we have argued heuristically in Chapter 1, the notion of statistical independence has important implications for measuring distance. Euclidean distance appears appropriate if the components of a vector are independent and have the same variances. Suppose we consider the location of the kthcolumn Yk = [Xlk, X2k, ... , Xnk] of X, regarded as a point inn dimensions. The location of this point is determined by the joint probability distribution f(yk) = f(xlk, x2k, . .. , Xnk) When the measurements Xlk,X2k,Xnk are a random sample, f(Yk) = f(xlk,x2k,xnk) = fk(xlk)fk(x2k) fk(xnk) and, consequently, each coordinate xjk contributes equally to the location through the identical marginal distributionsfk(xjk)
Random Samples and the Expected Values of the Sample Mean and Covariance Matrix
121
If the n components are not independent or the marginal distributions are not identical, the influence of individual measurements (coordinates) on location is asymmetrical. We would then be led to consider a distance function in which the coordinates were weighted unequally, as in the "statistical" distances or quadratic forms introduced in Chapters 1 and 2. Certain conclusions can be reached concerning the sampling distributions of X and Sn without making further assumptions regarding the form of the underlying joint distribution of the variables. In particular, we can see how X and Sn fare as point estimators of the corresponding population mean vector p. and covariance matrix .I.
Result 3.1. Let X 1 , X 2 , ... , Xn be a random sample from a joint distribution that has mean vector p. and covariance matrix .I. Then X is an unbiased estimator of p., and its covariance matrix is
That is,
E(X) = p.
1 Cov(X) =.I n
Thus,
Ec: 1
Sn) =.I
(310)
so [nj(n  1) ]Sn is an unbiased estimator of .I, while Sn is a biased estimator with (bias)= E(Sn) .I= (1/n).I.
Proof. Now, X = (X 1 + X 2 + + Xn)/n. The repeated use of the properties of expectation in (224) for two vectors gives
E(X)
=E
=
(~ X1) + E (~ Xz)
1 n
+ + E
1 n
(~ Xn)
1 n 1 n 1 n
1 n
Next,
(1 L
n
nj~l
(Xj p.) )
n
(1 L
n
ne=l
122
so
For j # e, each entry in E(Xj  J.t)(Xe  1')' is zero because the entry is the covariance between a component of Xj and a component of Xe, and these are independent. [See Exercise 3.17 and (229).] Therefore,
Since I = E(Xj  1' )(Xj  1' )' is the common population covariance matrix for each Xj, we have
~(Xj X)
]=1
(X)'
= ~ xjxjj=!
nx.x
n
i=i
,1
For any random vector V with E(V) = 1'v and Cov(V) = Iv, we have E(VV') Iv + 1'vl'v (See Exercise 3.16.) Consequently,
E(XjX}) =I+ 1'1''
and
~ :=: E(XjXj)

(1
= (n 1)I
j=i
E(S ) = (n  1) I n n
Generalized Variance
123
2: (Xj;
j=l
[n/( n  1) JSn is an unbiased estimator of a;k However, the individual sample standard deviations Vi;;, calculated with either nor n  1 as a divisor, are not unbiased estimators of the corresponding population quantities ~.Moreover, the correlation coefficients r;k are not unbiased estimators of the population quantities Pik However, the bias E( ~)  ~'or E(r;k)  Pik> can usually be ignored if the sample size n is moderately large. Consideration of bias motivates a slightly modified definition of the sample variancecovariance matrix. Result 3.1 provides us with an unbiased estimator S of I:
( n) Sn = 1
~
~ LJ
1 Fl
(311)
2: (Xji
j=l
This definition of sample covariance is commonly used in many multivariate test statistics. Therefore, it will replace Sn as the sample covariance matrix in most of the material throughout the rest of this book.
S=
The sample covariance matrix contains p variances and !P(P  1) potentially different co variances. Sometimes it is desirable to assign a single numerical value for the variation expressed by S. One choice for a value is the determinant of S, which reduces to the usual sample variance of a single characteristic when p = 1. This determinant 2 is called the generalized sample variance: Generalized sample variance =
su
s~2
Stp
Isl
(312)
2 Definition 2A.24 defines "determinant" and indicates one method for calculating the value of a determinant.
124
Example 3.7 (Calculating a generalized variance) Employees (x 1) and profits per employee (x 2 ) for the 161argest publishing firms in the United States are shown in Figure 1.3. The sample covariance matrix, obtained from the data in the April 30, 1990, Forbes magazine article, is
s=[
Evaluate the generalized variance. In this case, we compute
1
252.()4 68.43
68.43] 123.67
s1=
The generalized sample variance provides one way of writing the information on all variances and covariances as a single number. Of course, when p > 1, some information about the sample is lost in the process. A geometrical interpretation of IS I will help us appreciate its strengths and weaknesses as a descriptive summary. Consider the area generated within the plane by two deviation vectors d 1 = y1  .X1 1 and d 2 = y2  .X 2 1. Let Ld 1 be the length of d 1 and Ld 2 the length of d 2 . By elementary geometry, we have the diagram
~
Height= Ld 1 sin (ll)
and theareaofthe trapezoid is I Ld1 sin(O) ILd2 Sincecos2 (0) + sin2 (0) = 1, we can express this area as
~ (xil 
2 .X1)
= V(n
1)su
)j~ (xj2 xd =
cos(O) = r1 z
V(n 1)Szz
d2=
(n l)"Ysus22 (1 
d 2)
(313)
I[su S12
S12] I =
Szz  SuSzzdz
= SuSzz
"\ ,': \
I
\
3
\
'' ,,,
I(
I\
\
'
\
\
\
,,
\
(a)
(b)
Figure 3.6 (a) "Large" generalized sample variance for p (b) "Small" generalized sample variance for p = 3.
3.
lSI=
(area) 2/(n 1) 2
Assuming now that IS I = (n  1)(pl)(volume )2 holds for the volume generated inn space by the p  1 deviation vectors d 1 , d 2 , ... , dpl, we can establish the following general result for p deviation vectors by induction (see [1], p. 266): Generalized sample variance
= 1S 1 = (n
 1rP(volume) 2
(315)
Equation (315) says that the generalized sample variance, for a fixed set of data, is 3 proportional to the square of the volume generated by the p deviation vectors d1 = Y1 .X1l, d 2 = y2  .X 2 l, ... ,dP = Yp xPl. Figures 3.6(a) and (b) show trapezoidal regions, generated by p = 3 residual vectors, corresponding to "large" and "small" generalized variances. For a fixed sample size, it is clear from the geometry that volume, or IS I, will increase when the length of any d; = Y;  .X;l (or ~) is increased. In addition, volume will increase if the residual vectors of fixed length are moved until they are at right angles to one another, as in figure 3.6(a). On the other hand, the volume, or I S /,will be small if just one of the S;; is small or one of the deviation vectors lies nearly in the (hyper) plane formed by the others, or both. In the second case, the trapezoid has very little height above the plane. This is the situation in Figure 3 .6(b), where d 3 lies nearly in the plane formed by d 1 and d 2 .
3 1f generalized variance is defmed in terms of the sample covariance matrix s. = [ ( n  1 )/n ]S, then, using Result 2A.11,jS.I = i[(n 1)/n]IpSI = j[{n 1)/n]IpiiSI = [(n 1)/nJPISj. Consequently, using (315), we can also write the following: Generalized sample variance = IS.l = nP(volumef
12 6 Chapter 3 Sample Geometry and Random Sampling Generalized variance also has interpretations in the pspace scatter plot representation of the data. The most intuitive interpretation concerns the spread of the scatter about the sample mean point x' = [:i\, x2 , ... , xp] Consider the measure of distancegiven in the comment below (219), with x playing the role of the fixed point 1' and sI playing the role of A. With these choices, the coordinates x' = [ x 1, x2 , ... , x P] of the points a constant distance c from i satisfy
(x  i)'S1(x x) = CJ
(316)
[When p = 1, (x  x)'S 1(x.  i) = (x!  XJ'l/su is the squared distance from Xj to x1 in standard deviation units.) Equation (316) defines a hyperellipsoid (an ellipse if p = 2) centered at i. It can be shown using integral calculus that the volume of this hyperellipsoid is related to IS I In particular, Volume of{x: (x i)'S1(x i) or (Volume of ellipsoid ) 2 = (constant) (generalized sample variance) where the constant kp is rather formidable. 4 A large volume corresponds to a large generalized variance. Although the generalized variance has some intuitively pleasing geometrical interpretations, it suffers from a basic weakness as a descriptive summary of the sample covariance matrix S, as the following example shows.
:S
c2}
kpiSI 1 12cP
(317)
Example 3.8 (Interpreting the generalized variance) Figure 3.7 gives three scatter plots with very different patterns of correlation. All three data sets have i' = [2, 1], and the covariance matrices are
5 4 S = [ 4 5] ' r
.8 S
= [
3 0 0 3] ' r
= 0 S = [ 4  5] ' r
.8
Each covariance matrix S contains the information on the variability of the component variables and also the information required to calculate the correlation coefficient. In this sense, S captures the orientation and size of the pattern of scatter. The eigenvalues and eigenvectors extracted from S further describe the pattern in the scatter plot. For
s=
[5 4]
4
5 '
0 = (A  5) 2  42 = (A  9) (A  1)
For those who are curious, k P = 2w"f21p r(p/2), where f( <)denotes the gamma function evaluated
~
d~
Generalized Variance
x,
7
X,
127
.. ' .
x,
(b)
7 x,
(a)
x,
..... .. . ,. ....
(c)
and we determine the eigenvalueeigenvector pairs AI = 9, ej = [ 1/VZ' 1/ vz] and Az = 1,e2 = [1/v'i,1/VZ]. The meancentered ellipse, with center x' = [2, 1] for all three cases, is
To describe this ellipse, as in Section 2.3, with A = s 1 , we notice that if (A, e) is an eigenvalueeigenvector pair for S, then (A I, e) is an eigenvalueeigenvector pair for s1 That is, if Se = Ae, then multiplying on the left by s1 gives s1Se = AS 1e, or s1e = Ale. Therefore, using the eigenvalues from S, we know that the ellipse extends evA; in the direction of e; from x.
128
In p = 2 dimensions, the choice c 2 = 5.99 will produce an ellipse that contains approximately 95% of the observations. The vectors 3v'5.99 e 1 and v'5.99 e 2 are drawn in Figure 3.8(a). Notice how the directions are the natural axes for the ellipse, and observe that the lengths of these scaled eigenvectors are comparable to the size of the pattern in each direction. Next, for
S=[~ ~].
0 == (A  3) 2
and we arbitrarily choose the eigenvectors so that A = 3, e) = [1, OJ and A = 3, 1 2 e2 = [0, 1]. The vectors v3 v'5.99 e1 and v3 v'5.99 e 2 are drawn in Figure 3.8(b ).
'i
7 7
'
...
x,
.f
(a)
(b)
tp
7
x,
(c)
Figure 3.8 Axes of the meancentered 95% ellipses for the scatter plots in Figure 3.7.
Generalized Variance
12 9
Finally, for
s=
5 4
4]
5 ,
0 = (A  5) 2
4) 2
= (A  9)(A  1)
and we determine the eigenvalueeigenvector pairs A1 = 9, ej = [1/ V2, 1/ \12] and A = 1, e2 = [lj\12, 1/VZ]. The scaled eigenvectors 3v'5.99e 1 and v'5.99e2 are 2 drawn in Figure 3.8(c). In two dimensions, we can often sketch the axes of the meancentered ellipse by eye. However, the eigenvector approach also works for high dimensions where the data cannot be examined visually. Note: Here the generalized variance IS I gives the same value, IS I = 9, for all three patterns. But generalized variance does not contain any information on the orientation of the patterns. Generalized variance is easier to interpret when the two or more samples (patterns) being compared have nearly the same orientations. Notice that our three patterns of scatter appear to cover approximately the same area. The ellipses that summarize the variability (x  i)'S1(x  i)
:s?
do have exactly the same area [see (317)], since all have
IS I =
9.
As Example 3.8 demonstrates, different correlation structures are not detected by IS 1. The situation for p > 2 can be even more obscure. . Consequently, it is often desirable to provide more than the single number IS I _as a summary of S. From Exercise 2.12, IS I can be expressed as the product A1 A AP of the eigenvalues of S. Moreover, the meancentered ellipsoid based on 2 s1 [see (316)] has axes whose lengths are proportional to the square roots of the A/s (see Section 2.3). These eigenvalues then provide information on the variability in all directions in the pspace representation of the data. It is useful, therefore, to report their individual values, as well as their product. We shall pursue this topic later when we discuss principal components.
[
di+l> ... 'dp.
X1p 
. .
21 : XI
. .
..
x 2P 
XPJ
xP
. .
X~ 
x'
Xnl 
XI
Xnp Xp
=X(nxp)
(nxl)(lxp)
x'
(318)
can be expressed as a linear combination of the other columns. As we have shown geometrically, this is a case where one of the deviation vectorsfor instance, d/ = [xu X;, ... ' Xni xJIies in the (hyper) plane generated by dl' ... 'd;1,
130
Result 3.2. The generalized variance is zero when, and only when, at least one deviation vector lies in the (hyper) plane formed by all linear combinations of the othersthat is, when the columns of the matrix of deviations in (318) are linearly dependent.
Proof. If the columns of the deviation matrix (X  IX') are linearly dependent, there is a linear combination of the columns such that
0 = a1 col 1(X  li') =(X IX~)a
+ + apcolp(X  li')
forsomea # 0
But then, as you may verify, (n 1)S = (X  IX')' (X lx') and
so the same a corresponds to a linear dependency, a 1 col 1(S) + + aP colp(S) = Sa = o, in the columns of S. So, by Result 2A.9, IS I = 0. In the other direction, if IS I = 0, then there is some linear combination Sa of the columns of S such that Sa= 0. That is, 0 = (n 1)Sa =(X lx')'(X li')a. Premultiplying by a' yields 0 = a'(X li')'(X li')a = LfxJJ.')a and, for the length to equal zero, we must have (X  IX')a of (X  li') are linearly dependent.
Example 3.9 (A case where the generalized variance is zero) Show that 1 S I = 0 for
X
(3x3)
12 5] [
4 1 6 4 0 4
1 3
X
IX'=
~ =~ ~ = [~ ~ ~]
=
!]
1
1
The deviation (column) vectors are di = [2,1,1], d2 = [1,0,l], and d3 = [0, 1, 1 ]. Since d3 = d1 + 2d2, there is column degeneracy. (Note that there is row degeneracy also.) This means that one of the deviation vectorsfor example, d Iies in the plane generated by the other two residual vectors. Consequently, the 3 threedimensional volume is zero. This case is illustrated in Figure 3.9 and may be verified algebraically by showing that IS I = 0. We have 3
s (3x3)
~
Generalized Variance
131
6
5
3
4
lSI
n+ m (
~
o)
+ 0 = ~ ~
When large data sets are sent and received electronically, investigators are sometimes unpleasantly surprised to find a case of zero generalized variance, so that S does not have an inverse. We have encountered several such cases, with their associated difficulties, before the situation was unmasked. A singular covariance matrix occurs when, for instance, the data are test scores and the investigator has included variables that are sums of the others. For example, an algebra score and a geometry score could be combined to give a total math score, or class midterm and final exam scores summed to give total points. Once, the total weight of a number of chemicals was included along with that of each component. This common practice of creating new variables that are sums of the original variables and then including them in the data set has caused enough lost time that we emphasize the necessity of being alert to avoid these consequences.
Example 3.10 (Creating new variables that lead to a zero generalized variance) Consider the data matrix
9 12 10 8 11
16 12
10]
13
14
where the third column is the sum of first two columns. These data could be the number of successful phone solicitations per day by a parttime and a fulltime employee, respectively, so the third column is the total number of successful solicitations per day. Show that the generalized variance IS I = 0, and determine the nature of the dependency in the data.
132 Chapter 3 Sample Geometry and Random Sampling We find that the mean corrected data matrix, with entries xfk  ibis
xn+1 ~~ ~Jl
The resulting covariance matrix is
s=
Is I =
alxfl + azx1z + a3x13
for all j. That is,
2 X 5
+ 0 + 0  2.5 3
2.53
.o
= 0
In general, if the three columns of the data matrix X satisfy a linear constraint = c, a constant forallj, then a1i1 + aziz+ a3x3 = c, so that
a 1(xil 
3)
=0
and the columns of the mean corrected data matrix are linearly dependent. Thus, the inc! usion of the third variable, which is linearly related to the first two, has led to the case of a zero generalized variance. Whenever the columns of the mean corrected data matrix are linearly dependent,
(n  l)Sa
= (X IX')O = 0
and Sa = 0 establishes the linear dependency of the columns of S. Hence, IS I = 0. Since Sa = 0 = 0 a, we see that a is a scaled eigenvector of S associated with an eigenvalue of zero. This gives rise to an important diagnostic: If we are unaware of any extra variables that are linear combinations of the others, we can find them by calculating the eigenvectors of S and identifying the one associated with a zero eigenvalue. That is, if we were unaware of lhe dependency in this example, a computer calculation would find an eigenvalue proportional to a' = [1, 1, 1], since
Sa=
i 2)
+ (l)(x 13  i 3)
= 0
forallj
In addition, the sum of the first two variables minus the third is a constant c for all n units. Here the third variable is actually the sum of the first two variables, so the columns of the original data matrix satisfy a linear constraint with c = 0. Because we have the special case c = 0, the constraint establishes the fact that the columns of the data matrix are linearly dependent.
Generalized Variance
133
Let us summarize the important equivalent conditions for a generalized variance to be zero that we discussed in the preceding example. Whenever a nonzero vector a satisfies one of the following three conditions, it satisfies all of them: (1) Sa = 0
"v'
(2) a'(xi  x) = 0 for all j The linear combination of the mean corrected data, using a, is zero.
a'x)
We showed that if condition (3) is satisfiedthat is, if the values for one variable can be expressed in terms of the othersthen the generalized variance is zero because S has a zero eigenvalue. In the other direction, if condition (1) holds, then the eigenvector a gives coefficients for the linear dependency of the mean corrected data. In any statistical analysis, IS I = 0 means that the measurements on some variables should be removed from the study as far as the mathematical computations are concerned. The corresponding reduced data matrix will then lead to a covariance matrix of full rank and a nonzero generalized variance. The question of which measurements to remove in degenerate cases is not easy to answer. When there is a choice, one should retain measurements on a (presumed) causal variable instead of those on a secondary characteristic. We shall return to this subject in our discussion of principal components. At this point, we settle for delineating some simple conditions for S to be of full rank or of reduced rank.
Result 3.3. If n :s p, that is, (sample size) for all samples.
:S:
Proof. We must show that the rank of S is less than or equal to p and then apply Result 2A.9. For any fixed sample, then row vectors in (318) sum to the zero vector. The existence of this linear combination means that the rank of X  fi' is less than or equal to n  1, which, in turn, is less than or equal top  1 because n :s p. Since
(n  1)
(pxp)
(X  fi)'(X  fi')
(pXn) (nXp)
the kth column of S, colk(S), can be written as a linear combination of the columns of (X  fi')'. In particular,
(n  1) colk(S) = (X  fi')' colk(X  fi')
= (x!k xk) col 1(X fi')'
Since the column vectors of (X  fi')' sum to the zero vector, we can write, for example, col 1(X  fi')' as the negative of the sum of the remaining column vectors. After substituting for row1 (X  fi')' in the preceding equation, we can express colk(S) as a linear combination of the at most n  1linearly independent row vectors colz(X  fi')', ... , coln(X  fi')'. The rank ofS is therefore less than or equal to n  1, whichas noted at the beginning of the proofis less than or equal to p  1, and Sis singular. This implies, from Result 2A.9, that IS I = 0.
134 Chapter 3 Sample Geometry and Random Sampling Result 3.4. Let the p X 1 vectors x1 , Xz, ... , Xn, where x; is the jth row of the data matrix X, be realizations of the independent random vectors X 1 , X2 , ... , Xn. Then
1. If the linear combination a'Xi has positive variance for each constant vector a
then, provided that p < n, S has full rank with probability 1 and 1 S 1 > 0.
* 0,
2; If, with prob"'ability 1, a'Xi is a constant (for example, c) for all j, then 1S 1 = 0.
Proof. (Part 2). If a'Xi
with probability 1,
a'xi = c for all j, and the sample mean of this linear combination is c =
a 2 xi 2
+ +
apxjp)/n =
a,i\ +
a2 :Xz
+ + ap:Xp =
a'x.
2: (a,x.,
1 JI
Then
indicating linear dependence; the conclusion follows from Result 3.2. The proof of Part (1) is difficult and can be found in [2}.
(319)
= (yk
 xkl)'jys,;;
all have length ~.the generalized sample variance of the standardized variables will be large when these vectors are nearly perpendicular and will be small
Generalized Variance
135
when two or more of these vectors are in almost the same direction. Employing the argument leading to (3 7), we readily find that the cosine of the angle O;k between (y;  :X;l)/YS;'; and (Yk  xkl)/~ is the sample correlation coefficient r;k Therefore, we can make the statement that IR I is large when all the r;k are nearly zero and it is small when one or more of the r;k are nearly + 1 or 1. In sum, we have the following result: Let
X!j
X;
X;
\IS;;
(y;  :X;l)
Xz;
\IS;;
vs;;
= 1,2, ... , p
be the deviation vectors of the standardized variables. The ith deviation vectors lie in the direction of d;, but all have a squared length of n  1. The volume generated in pspace by the deviation vectors can be related to the generalized sample variance. The same steps that lead to (315) produce Generalized sa~ple va~iance) , ( of the standardized vanables
IR I , (n
_ 1)P(volume)z
(320)
The volume generated by deviation vectors of the standardized variables is illustrated in Figure 3.10 for the two sets of deviation vectors graphed in Figure 3.6. A comparison of Figures 3.10 and 3.6 reveals that the influence of the d 2 vector (large variability in x2 ) on the squared volume IS I is much greater than its influence on the squared volume IR I
,, '
'
\
'
(a)
(b)
Figure 3.10 The volume generated by equallength deviation vectors of the standardized variables.
136 Chapter 3 Sample Geometry and Random Sampling The quantities IS I and I R I are connected by the relationship
(321)
so
(322)
[The proof of (321) is left to the reader as Exercise 3.12.] Interpreting (322) in terms of volumes, we see from (315) and (320) that the squared volume (n 1)PISI is proportional to the squared volume (n 1)PIRI. The constant of proportionality i:s the product of the variances, which, in tum, is proportional to the product of the squares of the lengths (n  1)su of the di. Equation (321) shows, algebraically, how a change in the. measurement scale of %1 , for example, will alter the relationship between the generalized variances. Since IR I is based on standardized measurements, it is unaffected by the change in scale. However, the relative value of IS I will be changed whenever the multiplicative factor s 11 changes.
Example 3.11 {Illustrating the relation between IS I and I R I> Let us illustrate the relationship in (321) for the generalized variances IS I and IR I when p = 3. Suppose
s
(3X3)
4 3 1] 3 9 2 [1 2 1
Then
s1 1
R =
[~ ~ t]
!
2
lSI=
14
14 =lSI=
s 11 s22s3 3IRI
= (4)(9)(1)(~) = 14
(check)
13 7
s1 1
Szz
+ .. +
sPP
(323)
Example 3.12. {Calculating the total sample variance) Calculate the total sample variance for the variancecovariance matrices S in Examples 3. 7 and 3.9. From Example 3. 7.
s=
and
[ 252.04
68.43
68.43] 123.67
= 252.04 +
123.67
= 375.71
= s11 +
s22 + s33
=3 +1+ 1= 5
Geometrically, the total sample variance is the sum of the squared lengths of the = (y1  .X11), ... , dp = (yp  Xpl), divided by n  1. The total sample variance criterion pays no attention to the orientation (correlation structure) of the residual vectors. For instance, it assigns the same values to both sets of residual vectors (a) and (b) in Figure 3.6.
p deviation vectors d 1
We have it that i; =
Xi
(x 1 ; 1
x 2 ; 1
+ +
xu
Xn;
y!1
n
Xi2
1 1
Xz
Y21
n
y~1
x=
xp
1 n
Xzi
Xzz
Xzn
xpi
Xpz
Xpn
n
or X =1 
X' 1
(324)
That is, x is calculated from the transposed data matrix by postmultiplying by the vector 1 and then multiplying the result by the constant 1jn. Next, we create an n X p matrix of means by transposing both sides of (324) and premultiplying by 1; that is,
1x'
(325)
X produces then
X_ .!_ 11 ,X =
n
Now, the matrix (n  1)S representing sums of squares and cross products is just the transpose of the matrix (326) times the matrix itself, or
xu xz1 :
x1
x1
xi
x12 
Xzz : iz
xipXzp  Xp
~P]
(326)
:
Xnz Xz Xnp Xp
Xni 
(n  l)S
XniXnz Xz
~I]
Xnp Xp
xll 
~~
XipXzp 
~p]
xP
Xzi xi
Xni
i1
Xnp Xp
139
since )'( Ill' ) = I  11' 11' + 11'11' = I  11' 1 1 1 1 1 1 ( Ill' n n . n n n2 n To summarize, the matrix expressions relating x and S to the data set X are
x = _!_X'l
n
s=
n 1
1x'(I !u)x n
(327)
The result for Sn is similar, except that 1/n replaces 1/(n  1) as the first factor. The relations in (327) show clearly how matrix operations on the data matrix X lead to xand S. Once S is computed, it can be related to the sample correlation matrix R. The resulting expression can also be "inverted" to relate R to S. We fJISt define the p X p sample standard deviation matrix D 1f2 and compute its inverse, (D 1f2r 1 = nI/2. Let 0
Dlf2
(pXp)
[T
1
vS;;
0
l]
0 0
(328)
Then
~
0 vz
(pXp)
vs;;
0
0
Since
vs;;,
and
we have
R =
n112 sn112
(329)
Postmultiplying and premultiplying both sides of (329) by D1/2olf2 = D 112D112 = I gives S = D 112 RD 112
ol/2
That is, R can be obtained from the information in S, whereas Scan be obtained from D'/2 and R. Equations (329) and (330) are sample analogs of (236) and (237).
(331)
Then derived observations in (331) have (c'x 1 + c'x2 + + c'xn) Sample mean = '='=...:"'
'(X1 + x 2 + + Xn ) ;;1
,_ X
(332)
= (c'(xi
(C ,X1

 x))
C
= c'(xi
Sample vanance =
,)2 X
c'(x,  x)(x,  x)'c + c'(x2 x)(xz  x)'c + ... + c'(xn  x)(xn  x)'c
n 1
'[(x,  x)(xl  x)' + (xz  x)(x2  x)' + ... + (xn _ x)(xn  x)'
=c n1
Equations (332) and (333) are sample analogs of (243). They correspond to substituting the sample quantities x and S for the "population" quantities J.L and :I, respectively, in (243). Now consider a second linear combination b'X = b1 X 1 + whose observed value on the jth trial is j
bz%2 + + bPXP
= 1, 2, ... , n
(334)
14 I
It follows from (332) and (333) that the sample mean and variance of these derived observations are Sample meanofb'X = b'i Sample variance of b'X = b'Sb Moreover, the sample covariance computed from pairs of observations on b'X and c'X is Sample covariance (b'x 1 b'(x 1

b'i) ( c'x 2  c'i) + + (b'xn  b'i)( c'xn  c'i) n  1 i)(x, i)'c + b'(xz i)(xz i)'c + + b'(xn i)(x" i)'c n  1 b'i)( c'x 1

c'i) + (b'x 2
= b' [
(x 1
i)(x 1
i)' + (x 2
or
Sample covarianceofb'X and c'X = b'Sc In sum, we have the following result. Result l.S. The linear combinations b'X = b1X 1 + bzX2 + + bPXP c'X = c1 X 1 + c2 Xz + + cPXP have sample means, variances, and covariances that are related to i and S by Samplemeanofb'X = b'i Samplemeanofc'X = c'i Sample variance of b'X = b'Sb Sample variance of c'X
(335)
(336)
= c'Sc
Samplecovarianceofb'Xandc'X = b'Sc
Example 3.13 (Means and covariances for linear combinations) We shall consider two linear combinations and their derived values for then = 3 observations given in Example 3.9 as
2 5]
1 6
0 4
Consider the two linear combinations
b'X
~ [2
1]
142
and
<'X~ \1
1 3{ ~:]~X,
X, +3X,
The means, variances, and covariance will flfSt be evaluate.d directly and then be evaluated by (336). Observations on these linear combinations are obtained by replacing X 1 , X2 , and X 3 with their observed values. For example, then = 3 observations on b'X are
b'x 1 = 2xu + 2.:i: 12  x 13 = 2(1) + 2(2)  (5) = 1 b'x2 = 2x21 + 2x22 x23 = 2(4) + 2(1) (6) = 4 b'x 3 = 2x31 + 2x32  X33 = 2(4) + 2(0)  (4) = 4
The sample mean and variance of these values are, respectively, Sample mean . Sample vanance In a similar manner, then
= =
=3
c'x 1 = 1x11  Jx12 + 3x13 = 1(1)  1(2) + 3(5) c'x2 = 1(4) 1(1) + 3(6) = 21 c'x 3 = 1(4) 1(0) + 3(4) = 16
(14 + 21 + 16) = 17 3 2 + (21 17) 2 + (16 17) 2 (14 17) 31 =l3
Moreover, the sample covariance, computed from the pairs of observations (b'x 1 , c'xl), (b'x 2 , c'x2), and (b'x3, c'x 3), is Sample covariance (1 3)(14 17) + (4 3)(21 17) + (4 3)(16 17) 3 1 9 2
Alternatively, we use the sample mean vector x and sample covariance matrix S derived from the original data matrix X to calculate the sample means, variances, and covariances for the linear combinations. Thus, if only the descriptive statistics are of interest, we do not even need to calculate the observations b'xi and c'xi. From Example 3.9,
~ 1
0]
1
2
143
Consequently, using (336), we find that the two sample means for the derived observations are
S=plomo.nofb'X
~ b'i ~ [2
~ <'i ~ [1
= b'Sb
= [2
2 1r[:J
~3
~ 17
(check)
S=p!om"n of<'X
Using (336), we also have
I
3] [:]
(check)
Samplevarianceofb'X
+!
3]
2
1
l
2
IJUJ
(check)
= [2
+iJ~3
= [1
1
3 _l OJ [ 1] [~ ~ ~
~ [I
1 3] [
n ~
13
(<h<<k)
Samplecovarianceofb'Xandc'X = b'Sc
= [2
(check)
As indicated, these last results check with the corresponding sample quantities computed directly from the observations on the linear combinations. The sample mean and covariance relations in Result 3.5 pertain to any number of linear combinations. Consider the q linear combinations
i = 1,2, ... , q
(337)
144
Chapter 3 Sample Geometry and Random Sampling These can be expressed in matrix notation as
r,x,
aqlxi
a21xl
+ + +
al2x2 a22X2
.,,x,]
a2 ~XP . .
r
a~1 . .
aql
a12 a22
. ,rx'J
a2p
x2
'=AX
aq2Xz
aqpXp
aq2
aqp_
XP
(338) Taking the ith row of A, a;, to be b' and the kth row of A, ak, to be c', we see that Equations (336) imply that the ith row of AX has sample mean a;x and the ith and kth rows of AX have sample covariance a;s ak. Note that a;s ak is the ( i, k )th element of ASA'.
Result 3.6. The q linear combinations AX in (338) have sample mean vector AX and sample covariance matrix ASA'.
Exercises
3.1.
Given the data matrix
(a) Graph the scatter plot in p = 2 dimensions Locate the sample mean on your diagram. (b) Sketch the n = 3dimensional representation of the data, and plot the deviation vectors y1  x 11 and Y2  x2l. (c) Sketch the deviation vectors in (b) emanating from the origin. Calculate the lengths of these vectors and the cosine of the angle between them. Relate these quantities to Sn and R.
3.2. Given the data matrix
(a) Graph the scatter plot in p = 2 dimensions, and locate the sample mean on your diagTam. (b) Sketch the n = 3space representation of the data, and plot the deviation vectors Y1  x 1l and y2  xzl. (c) Sketch the deviation vectors in (b) emanating from the origin. Calculate their lengths and the cosine of the angle between them. Relate these quantities to Sn and R.
3.3. Perform the decomposition ofyl into x1l and Y1  x1l using the first column of the data matrix in Example 3.9. 3.4. Use the six observations on the variable XI> in units of millions, from Table 1.1. (a) Find the projection on 1' = [ 1, 1, 1, 1, 1, 1]. (b) Calculate the deviation vector YI ill. Relate its length to the sample standard deviation.
Exercises
145
(c) Graph (to scale) the triangle formed by Y~> x1 1, and y1  x1 1. Identify the length of each component in your graph. (d) Repeat Parts ac for the variable X 2 in Table 1.1. (e) Graph (to scale) the two deviation vectors y1  x 11 and y2  x 2 1. Calculate the value of the angle between them.
3.5. Calculate the generalized sample variance IS I for (a) the data matrix and (b) the data matrix X in Exercise 3.2. 3.6. Consider the data matrix
X in Exercise 3.1
X=
[~
5
! ~]
2 3
(a) Calculate the matrix of deviations (residuals), X  li'. Is this matrix of full rank? Explain. (b) Determine Sand calculate the g~neralized sample variance IS 1 Interpret the latter geometrically. (c) Using the results in (b), calculate the total sample variance. [See (323).]
3.7.
Sketch the solid ellipsoids (x  i)'S 1(x i) s 1 [see (316)] for the three matrices
s
3.8. Given
[5 4]
0 0
4 5 ,
s=
4
4]
5 ,
IS 1)
s=
1 0 0] [
1 0 0 1
and
! !] t
1 _!
2
(a) Calculate the total sample variance for each S. Compare the results. (b) Calculate the generalized sample variance for each S, and compare the results. Comment on the discrepancies, if any, found between Parts a and b.
3.9. The following data matrix contains data on test scores, with x 1 = score on first test, x 2 = score on second test, and x 3 = total score on the two tests:
12 17 29] 18 20 38 X= 14 16 30 [ 20 18 38 16 19 35 (a) Obtain the mean corrected data matrix, and verify that the columns are linearly dependent. Specify an a' = [a 1 , a 2 , a 3 ] vector that establishes the linear dependence. (b) Obtain the sample covariance matrix S, and verify that the generalized variance is zero. Also, show that Sa = 0, so a can be rescaled to be an eigenvector corresponding to eigenvalue zero. (c) Verify that the third column of the data matrix is the sum of the first two columns. That is, show that there is linear dependence, with a 1 = 1, a 2 = 1, and a 3 = 1.
146
matrix Xc = X  li' that are linearly dependent, not necessarily those of the data matrix itself Given the data
(a) Obtain the mean corrected data matrix, and verify that the columns are linearly dependent. Specify an a' = [a!, az, a3] vector that establishes the dependence.. (b) Obtain the sample covariance matrix S, and verify that the generalized variance is zero. (c) Show that the columns ofthe data matrix are linearly independent in this case.
3 II Use the sample covariance obtained in Example 3.7 to verify (329) and (330) which . state that R = D112SDl/Z and Dl/2 RD 112 = S. '
Hint: From Equation (330), S = D 112 RD 112 . Taking determinants gives 1S 1 = 1 /Dl/211 R II D 1"1 (See Result 2A.ll.) Now examine ID 121.
X and the resulting sample correlation matrix R, consider the standardized observations (xik xk)/~, k = 1,2, ... ,p, j = 1, 2... , n. Show that these standardized quantities have sample covariance matrix R. 3.14. Consider the data matrix X in Exercise 3.1. We have n = 3 observations on p = 2 variables X 1 and X 2 . Form the linear combinations
3.13. Given a data matrix
c'X
[1
2]
[~:] = x~ + 2X
b'X = [2
3]
[~J = 2X
+ 3X2
(a) Evaluate the sample means, variances, and covariance of b'X and c'X from first principles. That is, calculate the observed values of b'X and c'X, and then use the sample mean, variance, and covariance formulas. (b) Calculate the sample means, variances, and covariance of b'X and c'X using (336). Compare the results in (a) and (b).
3.1 s. Repeat Exercise 3.14 using the data matrix
14 7
l]mJ
and
3.16. Let V be a vector random variable with mean vector E(V) = JL. and covariance matrix E(V  JLv )(V  JLv )' = I . Show that E(VV') = Iv + JLvJLv 3.17. Show that, if X
(pXI)
= P[X 1 :s XJ,X2 :s
P[X1 :s x 1 and Z1 :s
zd
zd
for all x 1 , z1. So X 1 and Z 1 are independent: Repeat for other pairs.
3.18. Energy consumption iri 2001, by state, from the major sources
x 1 = petroleum
x 3 = hydroelectric power
x 2 = natural gas
x4
is recorded in quadrillions (10 15 ) of BTUs (Source: Statistical Abstract of the United States 2006). The resulting mean and covariance matrix are
X=
l l
0.766J 0.508 0.438
s=
0.161
(a) Using the summary statistics, determine the sample mean and variance of a state's total energy consumption for these major sources. (b) Determine the sample mean and variance of the excess of petroleum consumption over natural gas consumption. Also find the sample covariance of this variable with the total variable in part a.
3.19. Using the summary statistics for the first three variables in Exercise 3.18, verify the relation
148
3.20. In northern climates, roads must be cleared of snow quickly following a storm. One measure of storm severity is xi = its duration in hours, while the effectiveness of snow removal can be quantified by x 2 = the number of hours crews, men, and machine, spend to clear snow. Here are the results for 25 incidents in Wisconsin. Table 3.2 Snow Data
Xi Xz xi xz Xi xz
(a) Find the sample mean and variance of the difference x2  xi by first obtaining the summary statistics. (b) Obtain the mean and variance by first obtaining the individual values x12  xii> for j = 1, 2, ... , 25 and then calculating the mean and variance. Compare these values with those obtained in part a.
References
1. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (3rd ed.). New York: John Wiley,2003. 2. Eaton, M., and M. Perlrnan."The NonSingularity of Generalized Sample Covariance Matrices." Annals of Statistics, 1 (1973), 710717.
Chapter
DISTRIBUTION
4.1 Introduction
A generalization of the familiar bellshaped normal density to several dimensions plays a fundamental role in multivariate analysis. In fact, most of the techniques encountered in this book are based on the assumption that the data were generated from a multivariate normal distribution. While real data are never exactly multivariate normal, the normal density is often a useful approximation to the "true" population distribution. One advantage of the multivariate normal distribution sterns from the fact that it is mathematically tractable and "nice" results can be obtained. This is frequently not the case for other datagenerating distributions. Of course, mathematical attractiveness per se is of little use to the practitioner. It turns out, however, that normal distributions are useful in practice for two reasons: First, the normal distribution serves as a bona fide population model in some instances; second, the sampling distributions of many multivariate statistics are approximately normal, regardless of the form of the parent population, because of a centra/limit effect. To summarize, many realworld problems fall naturally within the framework of normal theory. The importance of the normal distribution rests on its dual role as both population model for certain natural phenomena and approximate sampling distribution for many statistics.
1 f(x) =  
vz;;;;z e[(x~.t)/"f/2
149
00
<X<
00
(41)
figure 4.1 A normal densitv with mean J.L and variance a~ and selected areas under the curve.
A plot of this function yields the familiar bellshaped curve shown in Figure 4.1.
Also shown in the figure are approximate areas under the curve within 1 standard deviations and 2 standard deviations of the mean. These areas represent probabilities, and thus, for the normal random variable X,
It is convenient to denote the normal density function with mean J.L and variance Ul by N(J.L, ~).Therefore, N(lO, 4) refers to the function in (41) with J.L = 10 and a = 2. This notation will be extended to the multivariate case later. The term
( X  JL)z ;;: = (x
(42)
in the exponent of the univariate normal density function measures the square of the distance from x to JL in standard deviation units. This can be generalized for a p X 1 vector x of observations on several variables as (43) The p X 1 vector p. represents the expected value of the random vector X, and the p X p matrix I is the variancecovariance matrix of X. [See (230) and (231).] We shall assume that the symmetric matrix I is positive definite, so the expression in ( 43) is the square of the generalized distance from x top.. The multivariate normal density is obtained by replacing the univariate distance in (42) by the multivariate generalized distance of (43) in the density function of (41). When this replacement is made, the univariate normalizing constant 1 (2'1T f 1f2(u 2f f2 must be changed to a more general constant that makes the volume under the surface of the multivariate density function unity for any p. This is necessary because, in the multivariate case, probabilities are represented by volumes under the surface over regions defined by intervals of the X; values. It can be shown (see [1]) that this constant is (21Trp/ZI Ii'fl, and consequently, a pdimensional normal density for the random vector X' = [X,, Xz, ... , Xp] has the form
(44)
where  oo < x; < oo, i = 1, 2, ... , p. We shall denote this pdimensional normal density by Np(fL, I), which is analogous to the normal density in the univariate case.
151
Example 4.1 (Bivariate density in terms of a11 = Var(XI), a 22 = Using Result 2A.8,
normal density) Let us evaluate the p = 2variate normal the individual parameters ILl = E(XI), IL 2 = E(X2), Var(X2), and p 12 = a 12 /(~ v'IT;) = Corr(X,, X2) we find that the inverse of the covariance matrix
is
I' =
a12]
au
Introducing the correlation coefficient p 12 by writing a12 = p 12 ~ v'IT;, we obtain u 11 a 22  o1 2 = aua22(1  pf 2), and the squared distance becomes
un(x, 
= 1  Pi2
1L1) v'lT;;
x 2  IL2) v'lT;;
 ILz)] va;
(45)
The last expression is written in terms of the standardized values (x 1  J.l. 1)/va;;_ and (x2  JLz)/~Next, since III= ulla22 ui2 = ulla22(1 Pt2), we can substitute for xl and I:II in (44) to get the expression for the bivariate (p = 2) normal density involving the individual parameters ILl, J.L2, a 11 , a22, and P12: (46) 2
X
+ (x 2 1Lz)  
vo=;;
The expression in (46) is somewhat unwieldy, and the compact general form in ( 44) is more informative in many ways. On the other hand, the expression in ( 46) is useful for discussing certain properties of the normal distribution. For example, if the random variables X 1 and X 2 are uncorrelated, so that p 12 = 0, the joint density can be written as the product of two univariate normal densities each of the form of (41).
152 Chapter 4 The Multivariate Normal Distribution That is, f(xl> xz) = f(xi)f(x 2 ) and X 1 and X2 are independent. [See (228).] This result is true in general. (See Result 4.5.) Two bivariate distributions with a 11 = u 22 are shown in Figure 4.2. In Figure 4.2(a), X 1 and X2 are independent (p 12 = 0). In Figure 4.2(b),p 12 = .75. Notice how the presence of correlation causes the probability to concentrate along a line.
(a)
(b)
Figure 4.2 1\vo bivariate normal distributions. (a) uu (b) uu = a 22 and p 12 = .75.
= u 22
and p 12
0.
153
From the expression in (44) for the density of a pdimensional normal variable, it should be clear that the paths of x values yielding a constant height for the density are ellipsoids. That is, the multivariate normal density is constant on surfaces where the square of the distance (x  p.)'I 1(x  p.) is constant. These paths are called contours: Constant probability density contour = {all x such that ( x  p.) 'I1 ( x  p.) = c 2 } = surface of an ellipsoid centered at p. The axes of each ellipsoid of constant density are in the direction of the eigenvectors of I1, and their lengths are proportional to the reciprocals of the square roots of the eigenvalues of I1. Fortunately, we can avoid the calculation of I 1 when determining the axes, since these ellipsoids are also determined by the eigenvalues and eigenvectors of I. We state the correspondence formally for later reference.
Result 4.1. If I is positive definite, so that I1 exists, then
Ie = Ae
implies
I 1e =
(~)e
so (A, e) is an eigenvalueeigenvector pair for I corresponding to the pair ( 1/ A, e) for I1 . Also, I1 is positive definite.
= e'(Ae) = Ae'e
Proof. Foripositivedefiniteande =F Oaneigenvector,wehaveO < e'Ie = e'(Ie) = A. Moreover, e = I 1(Ie) = I 1 (Ae), ore= AI 1e, and division by A > 0 gives I1e = (1/ A)e. Thus, (1/ A, e) is an eigenvalueeigenvector pair for I 1. Also,for any p X 1 x, by (221)
x'I x = x(
=I
(
=
2
p ~
i=l
A;
1)
(~)e;e;)x A,
(x'e;) 2 ~ 0
since each term Aj 1 (x'e;) is nonnegative. In addition, x'e; = 0 for all i only if x
= 0.
So x =F 0 implies that
i=l
positive definite. The following summarizes these concepts: Contours of constant density for the pdimensional normal distribution are ellipsoids defined by x such the that
(47)
= A;e;
A contour of constant density for a bivariate normal distribution with a 11 = a 22 is obtained in the following example.
f 54
Example 4.2 (Contours of the bivariate normal density) We shall obtain the axes of constant probability density contours for a bivariate normal distribution when a 11 == a 22 . From (47), these axes are given by the eigenvalues and eigenvectors of I. Here II  AI[ = 0 becomes
0=
~a11
A
lTJ2
UJz
a11
A
+

a 11
+ u 12 and A2 = a 11
or
= ( a11 = ( uu
+ a 12 )e1
+ O"Jz)ez
These equations imply that e1 = ez, and after normalization, the first eigenvalueeigenvector pair is
Sirnilarly,A 2 = a11  a 12 yields the eigenvectore2 = [lj\12, 1/\12]. When the covariance a 1z (or correlation Pd is positive, A1 = a 11 + a 12 is the largest eigenvalue, and its associated eigenvector ej = [lf\12, 1/\12] lies along the 45 line through the point JL' = [tt1 , tLz]. This is true for any positive value of the covariance (correlation). Since the axes of the constantdensity ellipses are given by eVA;" e1 and eVA; ez [see (47)], and the eigenvectors each have length unity, the major axis will be associated with the largest eigenvalue. For positively correlated normal random variables, then, the major axis of the constantdensity ellipses will be along the 45 line through IL (See Figure 4.3.)
Figure 4.l A constantdensity contour for a bivariate normal distribution with a11 = a 22 and lTJz > O(orp12 > 0).
155
When the covariance (correlation) is negative, A2 = au  a12 will be the largest eigenvalue, and the major axes of the constantdensity ellipses will lie along a line at right angles to the 45 line through 1' (These results are true only for
au
= Uzz.)
To summarize, the axes of the ellipses of constant density for a bivariate normal distribution with a 11 = a22 are determined by
We show in Result 4.7 that the choice c = x~(a), where x~(a) is the upper ( 100a) th percentile of a chisquare distribution with p degrees of freedom, leads to contours that contain (1  a) x 100% of the probability. Specifically, the following is true for a pdimensional normal distribution:
2
The constantdensity contours containing 50% and 90% of the probability under the bivariate normal surfaces in Figure 4.2 are pictured in Figure 4.4.
"2 x2
/l2
@
f.lt
/l2
x,
L'...._x,
f.lt
#
Figure 4.4 The 50% and 90% contours for the bivariate normal distributions in Figure 4.2.
The pvariate normal density in (44) has a maximum value when the squared distance in (43) is zerothat is, when x = 1' Thus, 1' is the point of maximum density, or mode, as well as the expected value of X, or mean. The fact that 1' is the mean of the multivariate normal distribution follows from the symmetry exhibited by the constantdensity contours: These contours are centered, or balanced, at 1'
3. Zero covariance implies that the corresponding components are independently distributed.
Example 4.3 (The distribution of a linear combination of the components of a normal random vector) Consider the linear combination a'X of a m,ultivariate normal random vector determined by the choice a' = [1, 0, ... , OJ. Since
a'
X~
[1,0, ... , 0]
[i:J ~X,
15 7
we have
all a12
Uzz
a12
u 2P
au
.aJp
Uzp
uPP
and it follows from Result 4.2 that X 1 is distributed as N(f.L 1 , a 11 ). More generally, the marginal distribution of any component X; of X is N(JL;, u;;). The next result considers several linear combinations of a multivariate normal vector X. Result 4.3. If X is distributed as Np(JL, I), the q linear combinations
(pXI)
(pXI)
d , where d is a vector of
+ d, I).
Proof. The expected value E(AX) and the covariance matrix of AX follow .from (245). Any linear combination b'(AX) is a linear combination of X, of the form a'X with a = A'b. Thus, the conclusion concerning AX follows directly from Result 4.2. The second part of the result can be obtained by considering a' (X + d) = a'X + (a'd), where a'X is distributed as N(a'J.t,a'Ia). It is known from the univariate case that adding a constant a'd to the random variable a'X leaves the variance unchanged and translates the mean to a'IL + a'd = a'(JL +d). Since a was arbitrary, X + d is distributed as Np(JL + d, I). Example 4.4 (The distribution of two linear combinations of the components of a normal random vector) For X distributed as N 3 (JL, I), find the distribution of
1
~J [~:]~AX
f 58
1
and covariance matrix
1
AIA' =
[~
1
OJ
1 1
1
1 OJ 1
1
Alternatively, the mean vector Ap. and covariance matrix AIA' may be verified by direct calculation of the means and covariances of the two random variables y! =XI XzandYz = Xz x3. We have mentioned that all subsets of a multivariate normal random vector X are themselves normally distributed. We state this property formally as Result 4.4.
Result 4.4. All subsets of X are normally distributed. If we respectively partition X, its mean vector p., and its covariance matrix I as
(p'5.1) =
[(_~~1. . ]
((pq)X!) (q Xq) I 11 i (qX(pq)) ] ! I 12 = ; I21 i Izz ((pq)Xq) i ((pq)X(pq))
and
(p P)
11 ).
Proof. Set
(qxp)
= [ I
(qxq) ; (qx(pq))
To apply Result 4.4 to an arbitrary subset of the components of X, we simply relabel the subset of interest as X 1 and select the corresponding component means and covariances as p. 1 and I 11 , respectively.
159
[~:].We set
and note that with this assignment, X,,.._, and I can respectively be rearranged and partitioned as
I = [
or
a22 a24 a12 a23 a2sl a24 a44 : a14 a34 a4s ta12 a14 ! all a13 a1s a23 a34 ! a13 a33 a3s a2s a4s ( a1s a3s ass
X=
['~:!l,
(3XI)
It is clear from this example that the normal distribution for any subset can be expressed by simply selecting the appropriate means and covariances from the original 1' and I. The formal process of relabeling and partitioning is unnecessary.
We are now in a position to state that zero correlation between normal random variables or sets of normal random variables is equivalent to statistical independence.
Result 4.5. (a) If X 1 and X 2 are independent, then Cov(XI> X 2 ) = 0, a q1
(q xl)
1
(q xl) 2
q 2 matrix of
zeros.
and only if I
= 0.
11 )
and
Nq2(P 2 , I
22 ),
respectively, then
Proof. (See Exercise 4.14 for partial proofs based upon factoring the density function when I 12 = 0.) .
Example 4.6. (The equivalence of zero covariance and independence for normal variables) Let X be N3 (P, I) with
(3X!)
3 0
0 2
1 OJ
Are X 1 and X 2 independent? What about ( X1, X2) and X3? Since X 1 and X2 have covariance a 12 = 1, they are not independent. However, partitioning X and I as
4
I ==
[ .,
1 3
1 i o == [ _(~~~!.;.(~~.!!. OJ Iu i I12
I21: ~2
(IX2) f (IX!)
J
J.
Therefore,
0 0 i2
we see that X 1
= [ ~:]
12
= [~
(X1, X 2) and X 3 are independent by Result 4.5. This implies X 3 is independent of X 1 and also ofX2. We pointed out in our discussion of the bivariate normal distribution that p 12 = 0 (zero correlation) implied independence because the joint density function [see (46)] could then be written as the productofthemarginal (normal) densities of X 1 and X 2. This fact, which we encouraged you to verify directly, is simply a special case of Result 4.5 with q1 = (/2 = 1.
[~~J
22 1
be distributed as Np(P, I)
with 1'
= [~~].
161
I12I2ii21
Note that the covariance does not depend on the value x2 of the conditioning variable.
Proof. We shall give an indirect proof. (See Exercise 4.13, which uses the densities directly.) Take
so
Since X 1  ILl  I12I21 (X 2  ~L 2 ) and X 2  IL 2 have zero covariance, they are independent. Moreover, the quantity X 1  ILl  I 12 I21 (X2  IL 2) has distribution Nq(O, I 11  I 12 I21I2J). Given that X 2 = x2, ILl + I 12 I21 (x2  IL 2) is a constant. Because X 1  ILl  I 12I21 (X 2  ~L 2 ) and X 2  IL 2 are independent, the conditional distribution of X 1  ILl  I 12 I21 (x2  ~L 2 ) is the same as the unconditional distribution of X 1  ILl  I12I2! (X 2  IL2) Since X 1  ILl  I 12 I21 (X2  IL2) is Nq(O, I 11  I12I2}I 2 J), so is the random vector X 1  ILl  I 12I21 (x2  IL2) when X 2 has the particular value x 2. Equivalently, given that X 2 = x 2, X 1 is distrib uted as Nq(ILl + I12I2! (x2  IL2), Iu  I12I2! I2d
Example 4.7 (The conditional density of a bivariate normal distribution) The conditional density of X 1 , given that X 2 = x2 for any bivariate distribution, is defined by
f(x 1 1 x 2)
where f(x 2) is the marginal distribution of X 2 If f(xb x2 ) is the bivariate normal density, show that f(x 1 1x 2 ) is
N ( p., 1
a12 +  ( x2 a22
f.Lz), a 11

at2) a22
Here a 11  ufz/a22 = a1 1(1  P~2). The two terms involving x 1  JLI in the exponent of the bivariate normal density [see Equation (46)] become; apart from the multiplicative constant 1/2(1  p~ 2 ), (x1  JLd 2 uu  2PI2 (xi  JLd(x2  JL2) , r , rVa11 va22
p~
1 = 2O"JJ (1 
2) P12
~~
VU22
JLz)
)2
!Jil?
_21 _(x_2_ JLz_)_2 __ a22
= "  (1LlTII
~. P12 ) (xi 2
JLi 
pf 2) also factors as
00
< XJ <
00
Thus, with our customary notation, the conditional distribution of X 1 given that X2 = x2 is N(JLI + (aJda22)(x2 JL2), uu(1 P~2) ). Now, I 11  I 12 Itii 21 = a11  ~da22 = uu(l  Pf2) and I12I2! = atda22, agreeing with Result 4.6, which we obtained by an indirect method.
163
1. All conditional distributions are (multivariate) normal. 2. The conditional mean is of the form
(49)
""12""22 
~1 _
/3l,q+l
13l,q+2 132,q+2
132,q+l :
...
131 ,p ] 132,p
/3q,q+l
/3q,q+2
/3q,p
12 I21I 21 ,
We conclude this section by presenting two final properties of multivariate normal random vectors. One has to do with the probability content of the ellipsoids of constant density. The other discusses the distribution of another form of linear combina lions. The chisquare distribution determines the variability of the sample variance s 2 = s 11 for samples from a univariate normal population. It also plays a basic role in the multivariate case. Result 4. 1. Let X be distributed as Np(P., I) with II
1
I>
0. Then
(a) (X J.t)'I (X p.) is distributed as x~, where~ denotes the chisquare distribution with p degrees of freedom. (b) The Np(JL, I) distribution assigns probability 1  a to the solid ellipsoid {x: (x J.t)'I 1 (x J.t) :s ~(a)}, where ~(a) denotes the upper (100a)th percentile of the ~ distribution. Proof. We know that~ is defined as the distribution of the sum Zi + Zi + + Z~, where Z 1 , Z 2 , . , Zp are independent N(O, 1) random variables. Next, by the spectral decomposition [see Equations (216) and (221) with A = I, and see p 1 Result 4.1], I1 = ~  e;ei, where Ie; = A;e;, so I1e; = (1/ A;)ei. Consequently, i=l A; (XJ.t)'I 1 (XJ.t) = ~(1/A;)(XJ.t)'eie;(XJ.t) = ~(1/Ai)(e;(XJ.t))
i=l
p
2
i=l
p
(pXp)
and X 1' is distributed as Np(O,:I). Therefore, by Result 4.3, Z distributed as Np( 0, A:IA' ), where
A(X 1') is
L A;e;ej
p
][
1  1 e 1 i  1 e 2 ! eP !  i
i= I
VA";"
! \!'A;
v>:;,
~ e2 [ . v>:;, e~
VA;"
e;J
i 1 . [1e 1 ;ezl
VA";"
! v'A;
By Result 4.5, Z 1 , Z2 , .. , Zp are independent standard normal variables, and we conclude that (X  ,...)':I1(X  1') has a ,0,distribution. For Part b, we note that P[(X ,...)':I 1(X 1') :5 c] is the probability assigned to the ellipsoid (X ,...)':I1(X 1') :5 c2 by the density Np(,...,:I). But from Part a, P[(X ,...)':I 1(X ,...) s x~(a)] = 1 a, and Part b holds.
Remark: (Interpretation of statistical distance) Result 4.7 provides an interpretation of a squared statistical distance. When X is distributed as Np(J.', :I),
(X
,...)'r1(X 1')
is the squared statistical distance from X to the population mean vector 1' If one component has a much larger variance than another, it will contribute less to the squared distance. Moreover, two highly correlated random variables will contribute less than two variables that are nearly uncorrelated. Essentially, the use of the inverse of the covariance matrix, (1) standardizes all of the variables and (2) eliminates the effects of correlation. From the proof of Result 4.7,
zt
+ z~ + + z~
165
= zz = zt + zi + +
z~
The squared statistical distance is calculated as if, first, the random vector X were transformed to p independent standard normal random variables and then the usual squared distance, the sum of the squares of the variables, were applied. Next, consider the linear combination of vector random variables
c1X1
+ CzXz + +
c"X"
[XI
i Xz ! i
(pXn)
Xn]
(nXI)
(410)
This linear combination differs from the linear combinations considered earlier in that it defines a p. X 1 vector random variable that is a linear combination of vectors. Previously, we discussed a single random variable that could be written as a linear combination of other univariate random variables.
Result 4.8. Let X1, Xz, ... , Xn be mutually independent with Xi distributed as Np(l'i :I). (Note that each Xi has the same covariance matrix :I.) Then
VI
= c1X1 +
J=l
CzXz + + cnXn
1 2
is distributed as
= b1X 1 + bzXz
[
.
(~ cy )x
(b'c):I
(b'c):I ]
(~by):I
=
2: cibi =
i=l
0.
(npXI)
p.
(npXI)
P: 2
~n
J.I1]
and
(npXnp}
:Ix
[~ ~
~
0 OJ
. . . :I
I 66
The choice
and AX is normal N2 p( A,..., AI,A') by Result 4.3. Straightforward block multipli cation shows that AI,A' has the first block diagonal term
j=l
L cibi =
cibi)I =
(pxp)
For sums of the type in ( 410), the property of zero correlation is equivalent to requiring the coefficient vectors band c to be perpendicular.
Example 4.8 (Linear combinations of random vectors) Let XI, Xz, independent and identically distributed 3 x 1 random vectors with
~~ [ n
and variance
a' I a = 3aj
and
+: ~ ~]
x3. and x4 be
We first consider a linear combination a'X 1 of the three components ofX 1 . This is a random variable with mean
a~
+ 2aj  2a 1a2 + 2a 1a 3
That is, a linear combination a'X 1 of the components of a random vector is a single random variable consisting of a sum of terms that are each a constant times a variable. This is very different from a linear combination of random vectors, say,
c 1X1
1,67
which is itself a random vector. Here each term in the sum is a constant times a random vector. Now consider two linear combinations of random vectors
and
Find the mean vector and covariance matrix for each linear combination of vectors and also the covariance between them. By Result 4.8 with c1 = c2 = c3 = c4 = 1/2, the first linear combination has mean vector
For the second linear combination of random vectors, we apply Result 4.8 with b1 = bz = b3 = 1 and b4 = 3 to get mean vector
36 12 [ 12
12 12] 12 0 0 24
Finally, the covariance matrix for the two linear combinations of random vectors is
Every component of the first linear combination of random vectors has zero covariance with every component of the second linear combination of random vectors. If, in addition, each X has a trivariate normal distribution, then the two linear combinations have a joint sixvariate normal distribution, and the two linear combi nations of vectors are independent.
168
4.3 Sampling from a Multivariate Normal Distribution and Maximum likelihood Estimation
We discussed sampling and selecting random samples briefly in Chapter 3. In this , section, we shallbe concerned with samples from ~multivariate normal populationin particular, with the sampling distribution of X and S.
{
(27T)P
fZ
III l(Z e
(xj.u)':tl(Jt.u)/2}
'
:= _ _
(411)
When the numerical values of the observations become available, they may be substituted for the xi in Equation (411 ). The resulting expression, now considered as a function of p. and I for the fixed set of observations x1 , xz, ... , Xn, is called the likelihood. Many good statistical procedures employ values for the population parameters that "best" explain the observed data. One meaning of best is to select the parameter values that maximize the joint density evaluated at the observations. This technique is called maximum likelihood estimation, and the maximizing parameter values are called maximum likelihood estimates. At this point, we shall consider maximum likelihood estimation of the parameters p. and I for a multivariate normal population. To do so, we take the observations x 1 , x2 , ... , Xn as fixed and consider the joint density of Equation ( 411) evaluated at these values. The result is the likelihood function. In order to simplify matters, we rewrite the likelihood function in another form. We shall need some additional properties for the trace of a square matrix. (The trace of a matrix is the sum of its diagonal elements, and the properties of the trace are discussed in Definition 2A.28 and Result 2A.12.)
= tr(x'Ax) = tr(Axx')
L.;
i=l
(b) tr(A) =
Proof. For Part a, we note that x'Ax isascalar,sox'Ax = tr(x'Ax). We pointed out in Result 2A.12 that tr (BC) = tr(CB) for any two matrices B and C of dimensions.m x k and k
X
L.; b;ici; as
j=l
169
Let x' be the matrix B with m = 1, and let Ax play the role of the matrix C. Then tr(x'(Ax)) = tr((Ax)x'),and the result follows. Part b is proved by using the spectral decomposition of (220) to write A = P' AP, where PP' = I and A is a diagonal matrix with entries A1 , A2 , ... , Ak. Therefore, tr(A) = tr(P'AP) = tr(APP') = tr(A) = A1 + A2 + + Ak Now the exponent in the joint density in (411) can be simplified. By Result 4.9(a), (xj J.Cp; 1 (xi J.C) = tr[(xi J.Cp; 1(xi J.C)]
(412)
=
==
i=l
2: tr [I1(xi 
= tr[
since the trace of a sum of matrices is equal to the sum of the traces of the matrices, according to Result 2A.l2(b). We can add and subtract i = (1/n) term (xi  J.C) in i=l
xi in each
i=l
= ~(xi i)(xi x)'
"
i=l
=
i=l
L
j=l
(xi  i)(x 
1' )'
and
L
j=L
 x)',
are both matrices of zeros. (See Exercise 4.15.) Consequently, using Equations (413) and (414), we can write the joint density of a random sample from a multivariate normal population as Joint density of} { X 1 ,X 2 , . ,X"
X
exp { tr [
Ilc~ (xi 
170
Substituting the observed values x 1 , x2, ... , Xn into the joint density yields the likelihood function. We shall denote this function by L(IL, I), to stress the fact that it is a function of the (unknown) population parameters 11 and I. Thus, when the vectors Xj contain the specific numbers actually observed, we have
L(IL l:)
'
(t (xji)(xji)'+n(ip.)(ip.)')]/2
J
(416)
It will be convenient in later sections of this book to express the exponent in the like
lihood function (416) in different ways. In particular, we shall make use of the identity
(xj x)(xjx)(xi
I 1 (~ (xi
x)')] +
(417)
Result 4.1 0. Given a p X p symmetric positive definite matrix B and a scalar b > 0, it follows that _l_etr(I 18)/2 JIIb
for all positive definite I
(pxp) :$
= (1/2b )B.
Proof. Let B 112 be the symmetric square root of B [see Equation (222)], so B 1f2B 112 = B, B1fZB112 = I, and B1/ZB112 = B1. Then tr (I1B) = tr[(I 1B 112)B112] = tr[B1fZ(r 1B1f2)]. Let 'T1 be an eigenvalue of B 112I 1 B 112. This
matrix is positive definite because y'B112I1B1fZy = (B 112y)'I1 (B 112y) > 0 if B112y # 0 or, equivalently, y # 0. Thus, the eigenvalues Tl; of B112I1B 112 are positive by Exercise 2.17. Result 4.9(b) then gives tr (I1B) = tr (B112I1 B1f2) = and I B112I1B 1 = 12J
p
i~i
Tl;
=
III
JBI
171
or
/B112I1B112j
n 17i
i=l
/I/
/B/
(
/B/
Combining the results for the trace and the determinant yields
_1 etr [ r 1B]/2 = _
II \b
i=!
IB lb
,$
IB lb
i=!
l?e11;/2 17 '
But the function 17be.,f2 has a maximum, with respect to 17, of (2b)beb, occurrjng at 17 = 2b. The choice 17; = 2b, for each i, therefore gives
_1_ etr(I1B)/2 :s; _ 1_ (2b)Pbebp
/I/b
/B/b
The upper bound is uniquely attained when I = (1/2b )B, since, for this choice, B112I 1 B112 = B112(2b)B1B112 = (2b) I
(p><p)
and Moreover,
/I/
IBif2IIBI/21 /B/
I (2b)I\ JB/
(2b)P
/B/
Straightforward substitution for tr[I 1B] and 1// I /b yields the bound asserted.
The maximum likelihood estimates of p. and I are those valuesdenoted by fL and Ithat maximize the function L(p., I) in (416). The estimates jL and I will depend on the observed values x 1 , x 2 , . , Xn through the summary statistics xand S.
Result 4.11. Let XI> X 2 , ... , X" be a random sample from a normal population with mean p. and covariance I. Then
(n  1) X) (Xj X) = S n are the maximum likelihood estimators of p. and I, respectively. Their observed
and
I=~ (XjA
1~
j=!
values, x and (1/n) ~ (xj x)(xj x)', are called the maximum likelihood estij=!
tr[I{~
p.)'I (x1
p.)
I 72
By Result 4.1, I 1 is positive definite, so the distance (x J.C)'I1(x J.C} > 0 unless 1' = x. Thus, the likelihood is maximized with respect to 1' at fi. = x. It remains to maximize
over I. By Result 4.10 with b = n/2 and B = L(Xj  x)(xj x)', the maximum
j=!
occurs at I= (1/n)
The maximum likelihood estimators are random quantities. They are optained by replacing the observations x1 , x 2 , . , Xn in the expressions for fi. and I with the corresponding random vectors, XI> X 2 , .. , X". We note that the maximum likelihood estimator X is a random vector and the maximum likelihood estimator :i is a random matrix. The maximum likelihood estimates are their particular values for the given data set. In addition, the maximum of the likelihood is
L( ~
#',
I)
(418)
(419)
The generalized variance determines the "peakedness" of the likelihood function and, consequently, is a natural measure of variability when the parent population is multivariate normaL Maximum likelihood estimators possess an invariance property. Let (} be the maximum likelihood estimator of 0, and consider estimating the parameter h(O), which is a function of 0. Then the maximum likelihood estimate of
h(O)
(a function of 9)
is given by
h( {j)
(same function of 0)
(420)
fi.
X and
((n  1)/n)S are the maximum likelihood estimators of 1' and I, respectively. 2. The maximum likelihood estimator of Va;; is~. where
a;; = ;:; LJ j=l
A
:i =
1 ~
 2 (X;j X;)
I 73
Sufficient Statistics
From expression (415), the joint density depends on the whole set of observations x 1 , x 2 , ... , Xn only through the sample mean x and the sumofsquaresandcrossproducts matrix
L (xj j=]
(421)
The importance of sufficient statistics for normal populations is that all of the information about p. and I in the data matrix X is contained in x and S, regardless of the sample size n. This generally is not true for nonnormal populations. Since many multivariate techniques begin with sample means and covariances, it is prudent to check on the adequacy of the multivariate normal assumption. (See Section 4.6.) If the data cannot be regarded as multivariate normal, techniques that depend solely on xand S may be ignoring other useful sample information.
: (
Xj  X) is distributed as
~times a chisquare variable having n  1 degrees of freedom (d.f.). In turn, this chisquare is the distribution of a sum of squares of independent standard normal random variables. That is, (n  1)s 2 is distributed as a 2 (Zt + ... + z~_J) = (a ZJ) 2 + + (aZn!) 2 The individual terms aZ; are independently distributed as N(O, ~). It is this latter form that is suitably generalized to the basic sampling distribution for the sample covariance matrix.
174
The sampling distribution of the sample covariance matrix is called the Wishart distribution, after its discoverer; it is defined as the sum of independent products of multivariate normal random vectors. Specifically,
(422)
:L ziz;
where the 1 are each independently distributed as Np(O,I). We summarize the sampling distribution results as follows:
Let X 1 , X 2 , ... , Xn be a random sample of size n from a pvariate norm~! distribution with mean f.L and covariance matrix!. Then
( 423)
z z;.
2P("I)f2'1Tp(pl)/41II("J)/2
IT rO(n i))
i=1
A positive definite
(425) where r
Result 4.12 (Law of large numbers). Let Y1 , Y2 , .. , Y,, be independent observations from a population with mean E(Y;) = 1L Then
n converges in probability to J.L as n increases without bound. That is, for any prescribed accuracy e > 0, P[ e < Y  J.L < e] approaches unity as n+ oo.
Proof. See [9].
Y="
Yl+Yz++Y,,
X;
X converges in probability to 1'
S (or I = Sn) converges in probability to I (426)
As a direct consequence of the law of large numbers, which says that each converges in probability to J.L;, i = 1, 2, .... p,
Also, each sample covariance s;k converges in probability to u;b i, k = 1, 2, ... , p, and (427)
(n. l)s;k =
=
i=l
L L
n
(Xj; X;)(Xik
X'd
(Xji J.L; + J.L; J{;)(Xjk ILk+ ILk Xk) p.,,)(Xik J.Lk) + n(X; J.L;)(Xk J.Ld
i=l
L (Xi;j=l
176 Chapter 4 The Multivariate Normal Distribution Letting lj = (X1;  J.L;)(X;k  J.Lk), with E(lj) = a;k, we see that the first term in s;k converges to a;k and the second term converges to zero, by applying the law of large numbers. The practical interpretation of statements (426) and (427) is that, with high probability, X will be close to p. and S will be close to I whenever the sample size is large. The statemebt concerning X is made even more precise by a multivariate version of the central limit theorem. Result 4.13 (The central limit theorem). Let X 1 , X 2 , ... , X" be independent observations from any population with mean p. and finite covariance I. Then
The approximation provided by the central limit theorem applies to discrete, as well as continuous, multivariate populations. Mathematically, the limit is exact, and the approach to normality is often fairly rapid. Moreover, from the results in Section 4.4, we know that X is exactly normally distributed when the underlying population is normal. Thus, we would expect the central limit theorem approximation to be quite good for moderate n when the parent population is nearly normal. As we have seen, when n is large, S is close to I with high probability. Consequently, replacing I by Sin the approximating normal distribution for X will have a negligible effect on subsequent probability calculations. Result 4.7 can be used to show that n(X  p. )'I1 (X  p.) has ax~ distribution when
Np(O, I) distribution. The~ distribution is_approximately the sampling distribution of n(X  J.L )'I  1 (X  J.L) when X is approximately normally distributed. Replacing I1 by s 1 does not seriously affect this approximation for n large and much greater than p. We summarize the major conclusions of this section as follows:
Let X 1 , X2 , ... , X, be independent observations from a population with mean JL and finite (nonsingular) covariance I. Then
Vn (X
and
for n  p large. In the next three sections, we consider ways of verifying the assumption of normality and methods for transforming nonnormal observations into observations that are approximately normal.
177
1. Do the marginal distributions of the elements of X appear to be normal? What about a few linear combinations of the components X;? 2. Do the scatter plots of pairs of observations on different characteristics give the elliptical appearance expected from normal populations? 3. Are there any "wild" observations that should be checked for accuracy?
It will become clear that our investigations of normality will concentrate on the behavior of the observations in one or two dimensions (for example, marginal distributions and scatter plots). As might be expected, it has proved difficult to construct a "good" overall test of joint normality in more than two dimensions because of the large number of things that can go wrong. To some extent, we must pay a price for concentrating on univariate and bivariate examinations of normality: We can never be sure that we have not missed some feature that is revealed only in higher dimensions. (It is possible, for example, to construct a nonnormal bivariate distribution with normal marginals. [See Exercise 4.8.]) Yet many types of nonnormality are often reflected in the marginal distributions and scatter plots. Moreover, for most practical work, onedimensional and twodimensional investigations are ordinarily sufficient. Fortunately, pathological data sets that are normal in lower dimensional representations, but nonnormal in higher dimensions, are not frequently encountered in practice.
178
interval (x;  Vi;;, x1 + viS;;) to be about .683. Similarly, the observed proportion P;z of tbe observations in (x;  2VS;;, X; + zvs;;) should be about .954. Using the normal approximation to the sampling distribution of p; (see [9]), we observe that either 1/J; 1 or 1p;z

.6831 > 3
(.683)(.317)
n
1.396
Vn
.628
.954 1 > 3
(.954)(.046) n
Vn
(429)
would indicate departures from an assumed normal distribution for the ith characteristic. When the observed proportions are too small, parent distributions with thicker tails than the normal are suggested. Plots are always useful devices in any data analysis. Special plots called QQ plots can be used to assess the assumption of normality. These plots can be made for the marginal distributions of the sample observations on each variable. They are, in effect, plots of the sample quantile versus the quantile one would expect to observe if the observations actually were normally distributed. When the points lie very nearly along a straight line, the normality assumption remains tenable. Normality is suspect if the points deviate from a straight line. Moreover, the pattern of the deviations can provide clues about the nature of the nonnormality. Once the reasons for the nonnormality are identified, corrective action is often possible. (See Section 4.8.) To simplify notation, let x 1 , x2, ... , x, represent n observations on any single characteristic X,. Let X(J) :=; x(2) ,; ,; X(n) represent these observations after they are ordered according to magnitude. For example, x(2) is the second smallest observation and x(n) is the largest observation. The x(i)'s are the sample quantiles. When the x(i) are distinct, exactly j observations are less than or equal to xu). (This is theoretically always true when the observations are of the continuous type, which we usually assume.) The proportion jjn of the sample at or to the left of xu) is often approximated by (i ~)/n for analytical convenience.' For a standard normal distribution, the quantiles %) are defined by the relation
(430) (See Table 1 in the appendix). Here P(i) is the probability of getting a value less than or equal to q(i) in a single drawing from a standard normal population. The idea is to look at the pairs of quantiles (qU) xU)) with the same associated cumulative probability (i !);n. If the data arise from a ~ormal populati~n, the pairs (q(i), x(i)) will be approximately linearly related, since uq(J) + J.L is nearly the expected sample quantile. 2
The ~in the numerator of (i  DJn is a "continuity" correction. Some authors (see (5) and (10]) have suggested replacing (i  ~)/ n by (i  ~)/( n + 2 A better procedure is to plot (m(i) <(j)), where mu) ~ E(zu)) is the expected value of the jth order statislic in a sample of size n from a standard normal dislribution. (See [13) for further discussion.)
1
n
I 79
Probability levels
(i 
~)/n
Standard normal quantiles %J 1.645 1.036 .674 .385 .125 .125 .385 .674 1.036 1.645
1.00 .10 .16 .41 .62 .80 1.26 1.54 1.71 2.30 Here, for example, P[Z :s: .385]
=
Let us now construct the QQ plot and comment on its appearance. The QQ plot for the foregoing data, which is a plot of the ordered data xuJ against the normal quantiles q(i), is shown in Figure 4.5. The pairs of points (q(i), x(j)) lie very nearly along a straight line, and we would not reject the notion that these data are normally distributedparticularly with a sample size as small as n = 10.
oo
~ e' /2 dz v27T
2
The calculations required for QQ plots are easily programmed for electronic computers. Many statistical programs available commercially are capable of producing such plots. The steps leading to a QQ plot are as follows: 1. Order the original observations to get x(IJ, x(Z) .. , x(n) and their corresponding probability values ( 1  ~)/n, (2  D/n, ... , (n  ~)/n; 2. Calculate the standard normal quantiles q(I) q(Z) .. , q(n); and 3. Plot the pairs of observations ( q(l), x 0 )), (q(z), X(z)), ... , ( q(n), x(n) ), and examine the "straightness" of the outcome.
I 80
QQ plots are not particularly informative unless the sample size is moderate to largefor instance, n <:: 20. There can be quite a bit of variability in the straightness of the QQ plot for small samples, even when the observations are known to come from a normal population.
Example 4.10 {A QQ plot for radiation data) The qualitycontrol department of a manufacturer of microwave ovens is required by the federal governmel).t to monitor the amount of radiation emitted when the doors of the ovens are closed. Observations of the radiation emitted through closed dom's of n = 42 randomly selected ovens were made. The data are listed in Table 4.1.
Oven no. 31 32 33 34 35 36 37 38 39 40 41 42
.11
.30 .02 .20 .20 .30 .30 .40 .30 .05
6 7 8 9 10
11 12
.07
.02 ,Ql .10 .10
27
28 29 30
13
14 15
In order to determine the probability of exceeding a prespecified tolerance level, a probability distribution for the radiation emitted was needed. Can we regard the observations here as being normally distributed? A computer was used to assemble the pairs (q(Jl xu)) and construct the QQ plot, pictured in Figure 4.6 on page 181. It appears from the plot that the data as a whole are not normally distributed. The points indicated by the circled locations in the figure are outliersvalues that are too large relative to the rest of the observations. For the radiation data, several observations are equal. When this occurs, those observations with like values are associated with the same normal quantile. This quantile is calculated using the average of the quantiles the tied observations would have if they all differed slightly.
I8 I
.40
.30
.20 2 9
.10
e3
2 5
L~L~~q
.OOLL______
2.0
j __ _ _ _ _ _
(;)
1.0
.0
1.0
2.0
3.0
Figure 4.6 A QQ plot of the radiation data (door closed) from Example 4.10. (The integers in the plot indicate the number of points occupying the same location.)
The straightness of the QQ plot can be measured by calculating the correlation coefficient of the points in the plot. The correlation coefficient for the QQ plot is defined by
2: (xu) i)(qul q)
j=1
( 431)
[;
\)
J~
(xul  i)2
VJ~ (qu) 
II
q)2
and a powerful test of normality can be based on it. (See [5], [10], and [12].) Formally, we reject the hypothesis of normality at level of significance a if r0 falls below the appropriate value in Table 4.2.
Table 4.2 Critical Points for the QQ Plot Correlation Coefficient Test for Normality
Sample size n
5 10 15 20 25 30 35 40 45 50 55 60 75 100 150 200 300
Significance levels a
.01 .8299 .8801 .9126 .9269 .9410 .9479 .9538 .9599 .9632 .9671 .9695 .9720 .9771 .9822 .9879 .9905 .9935 .05 .8788 .9198 .9389 .9508 .9591 .9652 .9682 .9726 .9749 .9768 .9787 .9801 .9838 .9873 .9913 .9931 .9953 .10 .9032 .9351 .9503 .9604 .9665 .9715
.9740
.9771 .9792 .9809 .9822 .9836 .9866 .9895 .9928 .9942 .9960
Example 4.11 (A correlation coefficient test for normality) Let us calculate the correlation coefficient rQ from the QQ plot of Example 4.9 (see Figure 4.5) and test for normality. Using the information from Example 4.9, we have x = .770 and
L (x(i) j=l
!0
x)quJ
= 8.584,
L
j=l
10
(xuJ  x)
= 8.472,
and
L
j=l
10
qtn
= 8.795
Since always, q = 0,
rQ =
v8.472 V8.795
8.584
9 .9 4
A test of normality at the 10% level of significance is provided by referring rQ = .994 totheentryinTable 4.2 corresponding ton = lOanda = .10. This entry is .9351. Since rQ > .9351, we do not reject the hypothesis of normality. Instead of rQ, some software packages evaluate the original statistic proposed by Shapiro and Wilk [12]. Its correlation form corresponds to replacing %J by a function of the expected value of standard normalorder statistics and their covariances. We prefer rQ because it corresponds directly to the points in the normalscores plot. fur large sample sizes, the two statistics are nearly the same (see [13]), so either can be used to judge lack of fit. Linear combinations of more than one characteristic can be investigated. Many statisticians suggest plotting
e!xi
where
s e1 = Ae1 1
in which A is the largest eigenvalue of S. Here xj = [xi 1 , x 12 , ... , xip] is the jth 1 observation on the p variables X 1 , X 2 , .. , XP. The linear combination e~xi corresponding to the smallest eigenvalue is also frequently singled out for inspection. (See Chapter 8 and [6] for further details.)
183
has probability .5. Thus, we should expect roughly the same percentage, 50%, of sample observations to lie in the ellipse given by {allxsuchthat(x i)'S 1(x i) s
xk5)}
where we have replaced I" by its estimate i and I 1 by its estimate s 1. If not, the normality assumption is suspect.
Example 4.12 (Checking bivariate normality) Although not a random sample, data consisting of the pairs of observations (x 1 = sales, x 2 =profits) for the 10 largest companies in the world are listed in Exercise 1.4. These data give
i = [155.60]
s=
14.70 '
so
s1 _
303.62] 7476.45
XI [ x2 
.002930] .072148
[XI  155.60]
x2

:$
14.70
139
is on or inside the estimated 50% contour. Otherwise the observation is outside this contour. The first pair of observations in Exercise 1.4 is [x 1 , x 2 ]' ~ ( 108.28, 17.05]. In this case
> 1.39
and this point falls outside the 50% contour. The remaining nine points have generalized distances from i of .30, .62, 1.79, 1.30, 4.38, 1.64, 3.53, 1.71, and 1.16, respectively. Since four of these distances are less than 1.39, a proportion, .40, of the data falls within the 50% contour. If the observations were normally distributed, we would expect about half, or 5, of them to be within this contour. This difference in proportions might ordinarily provide evidence for rejecting the notion of bivariate normality; however, our sample size of 10 is too small to reach this conclusion. (See also Example 4.13.) Computing the fraction of the points within a contour and subjectively comparing it with the theoretical probability is a useful, but rather rough, procedure.
= 1, 2, ... , n
where"', "z ... , Xn are the sample observationl' The procedure we are about to describe is not limited to the bivariate case; it can be used for all p 2: 2. When the parent population is multivariate normal and both n and n  p are greater than 25 or 30, each of the squared distances di, d~, ... , d~ should behave like a chisquare random variable. [See Result 4.7 and Equations (426) and (427).] Although these distances are not independent or exactly chisquare distributed, it is helpful to plot them as if they were. The resulting plot is called a chisquare plot or gamma plot, because the chisquare distribution is a special case of the more general gamma distribution. (See [6].) 1b construct the chisquare plot,
1. Order the squared distances in (432) from smallest to largest as dfo :s df2) :s ,; dfn) 2. Graph the pairs (qc)(j ~)ln),dtj)), where qc,p((j !)In) is the wo(j Din quantile of the chisquare distribution with p degrees of freedom.
Quantiles are specified in terms of proportions, whereas percentiles are specified in terms of percentages. The quantiles qc,p( ~)In) are related to the upper percentiles of a chisquared distribution. In particular, qc)(j  Din) = x~( (n  j + ~)In). The plot should resemble a straight line through the origin having slope 1. A systematic curved pattern suggests lack of normality. One or two points far above the line indicate large distances, or outlying observations, that merit further attention.
Example 4.13 (Constructing a chisquare plot) Let us construct a chisquare plot of
(j 
the generalized distances given in Example 4.12. TI1e ordered distances and the corresponding chisquare percentiles for p = 2 and n = 10 are listed in the following table: j 1 2 3 4
dfn
.30 .62
qc.Z W
.10
c 1)
1
1.16
1.30 1.61 1.64 1.71 1.79 3.53 4.38
.33 .58
.86 1.20 1.60
5
6 7 8 9 10
2.10
2.77 3.79 5.99
185
5
4.5 4 3.5
2.5 2
1.5
0.5
q,,z((j!)110)
A graph of the pairs (qc,z( (j  )/10 ), dtn) is shown in Figure 4.7. The points in :Figure 4.7 are reasonably straight. Given the small sample size it is difficult to reject bivariate normality on the evidence in this graph. If further analysis of the data were required, it might be reasonable to transform them to observations more nearly bivariate normal. Appropriate transformations are discussed in ~ti~~R In addition to inspecting univariate plots and scatter plots, we should check multivariate normality by constructing a chisquared or d 2 plot. Figure 4.8 contains d 2
dJ)
10 10
dJ)
8
8
6
6
4
.:
0 0
q,..cv !IJ3o>
2
4
q,_.<v ~'3o>
0
10
12
10
12
Figure 4.8 Chisquare plots for two simulated fourvariate normal data sets with n
= 30.
I86 Chapter 4 The Multivariate Normal Distribution plots based on two computergenerated samples of 30 fourvariate normal random vectors. As expected, the plots have a straightline pattern, but the top two or three ordered squared distances are quite variable. The next example contains a real data set comparable to the simulated data set that produced !he plots in Figure 4.8.
Example 4.14 (Evaluating multivariate normality for a fourvariable data set) The data in Table 4.3 were obtained by taking four different measures of stiffness, x 1 , x 2 , x 3 , and x 4 , of each of n = 30 boards. The fir~t measurement involves sending a shock wave down the board, the second measurement is determined while vibrating the board, and the last two measurements are obtained from static tests. The 1 squared distances d~ = (xi  i)'S (xi  i) are also presented in the table. Table 4.3 Four Measurements of Stiffness
Observation no.
XI
Xz
X3
x4
d2
Observation no.
Xt
Xz
X]
x4
d2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Source:
1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859
1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649
1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389
1778 .60 2197 5.48 2222 7.62 1533 5.21 1883 1.40 1546 2.22 1671 4.99 1874 1.49 2581 12.26 1508 .77 1667 1.93 1898 .46 1741 2.70 1678 .13 1714 1.08
16 17 18 19 20 21 22 23 24
25
26 27 28 29 30
1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490
2149 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382
1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214
1281 1176 1308 1755 1646 2111 1477 1516 2037 1533 1469 1834 1597 2234 1284
16.85 3.50 3.99 1.36 1.46 9.90 5.06 .80 2.54 4.58 3.40 2.38 3.00 6.28 2.58
The marginal distributions appear quite normal (see Exercise 4.33), with the possible exception of specimen (board) 9. To further evaluate multivariate normality, we constructed the chisquare plot shown in Figure 4.9. The two specimens with the largest squared distances are clearly removed from the straightline pattern. Together, with the next largest point or two, they make the plot appear curved at the upper end. We will return to a discussion of this plot in Example 4.15. We have discussed some rather simple techniques for checking the multivariate normality assumption. Specifically, we advocate calculating the j = 1, 2, ... , n [see Equation. (432)] and comparing the results with _I quantiles. For example, pvariate normality is indicated if
dy,
I87
d?'jJ
~
;'!;
~
s
00
10
"" "'
0
.....
2
10
12
(See [6] for a more complete exposition of methods for assessing normality.) We close this section by noting that all measures of goodness of fit suffer the same serious drawback. When the sample size is small, only the most aberrant behavior will be identified as Jack of fit. On the other hand, very large samples invariably produce statistically significant lack of fit. Yet the departure from the specified distribution may be very small and technically unimportant to the inferential conclusions.
188 Chapter 4 The Multivariate Normal Distribution Outliers are best detected visually whenever this is possible. When the number of observations n is large, dot plots are not feasible. When the number of characteristics pis large, the large number of scatter plots p(p  1)/2 may prevent viewing them all. Even so, we suggest first visually inspecting the data whenever possible. What should we look for? For a single random variable, the problem is one dimensional, and'we look for observations that are far from the others. For instance, the dot diagram
4~~~x
reveals a single large observation which is circled. In the bivariate case, the situation is more complicated. Figure 4.10 shows a situation with two unusual observations. The data point circled in the upper right corner of the figure is detached from the pattern, and its second coordinate is large relative to the rest of the x2
..
..
+~r~+~x,
.............
. . . .:
.<:>
I 89
measurements, as shown by the vertical dot diagram. The second outlier, also circled, is far from the elliptical pattern of the rest of the points, but, separately, each of its components has a typical value. This outlier cannot be detected by inspecting the marginal dot diagrams. In higher dimensions, there can be outliers that cannotbe detected from the univariate plots or even the bivariate scatter plots. Here a large value of (xi  x)'S1{xi  x) will suggest an unusual observation, even though it cannot be seen visually.
2. Make a scatter plot for each pair of variables. 3. Calculate the standardized values Zjk = (xik  xk)/~ for j = 1, 2, ... , n and each column k = 1, 2, ... , p. Examine these standardized values for large or small values. 4. Calculate the generalized squared distances {xi  x)'S 1{xi x). Examine these distances for unusually large values. In a chisquare plot, these would be the points farthest from the origin. In step 3, "large" must be interpreted relative to the sample size and number of variables. There are n X p standardized values. When n = 100 and p = 5, there are 500 values. You expect 1 or 2 of these to exceed 3 or be less than 3, even if the data came from a multivariate distribution that is exactly normal. As a guideline, 3.5 might be considered large for moderate sample sizes. In step 4, "large" is measured by an appropriate percentile of the chisquare distribution with p degrees of freedom. If the sample size is n = 100, we would expect 5 observations to have values of dJ that exceed the upper fifth percentile of the chisquare distribution. A more extreme percentile must serve to determine observations that do not fit the pattern of the remaining data. The data we presented in Table 4.3 concerning lumber have already been cleaned up somewhat. Similar data sets from the same study also contained data on x 5 = tensile strength. Nine observation vectors, out of the total of 112, are given as rows in the following table, along with their standardized values.
XI
x2
x3
x4
xs
ZI
Zz
Z3
Z4
Zs
1528 1677
1190
q37)
.12 .60
190
The standardized values are based on the sample mean and variance, calculated from allll2 observations. There are two extreme standardized values. Both are too large with standardized values over 4.5. During their investigation, the researchers recorded measurements by hand in a logbook and then performed calculations that produced the values given in~the table. When they checked their records regarding the values pinpointed by this analysis, errors were discovered. The value x 5 = 2791 was corrected to 1241, and x 4 = 2746 was corrected to 1670. Incorrect readings on an individual variable are quickly detected by locating a large leading digit for the standardized value. The next example returns to the data on lumber discussed in Example 4.14.
Example 4.15 (Detecting outliers in the data on lumber) Table 4.4 contains the data in Table 4.3, along with the standardized observations. These data consist of four different measures of stiffness xi, xz, x 3, and x4, on each of n = 30 boards. Reeall that the fust measurement involves sending a shock wave down the board, the second measurement is determined while vibrating the board, and the last two measurements are obtained from static tests. The standardized measurements are Table 4.4 Four Measurements of Stiffness with Standardized Values
Xt
xz
X]
x4
Observation no. 1 2 3 4
5
Z1
zz
.3 .9 .2 .4 .5 .1 .2 .2 3.3 .5 .5 .5 .3 .2 .3 1.3 1.8 1.2 .4 .5 1.4 .4 .7 .4 .8 1.1 .5 .2 1.7 1.2
Z3
Z4
dz
1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859 1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490
1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649 2149 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382
1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389 1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214
1778 2197
2222
1533 1883 1546 1671 1874 2581 1508 1667 1898 1741 1678 1714 1281 1176 1308 1755 1646 2111 1477 1516 2037 1533 1469 1834 1597 2234 1284
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
.1 1.5 .7 .8 .2 .6 .1 .6 3.3 .5 .6 .4 .2 .1 .1 .1 1.8 1.5 .2 .6 1.1 .0 .8 .5 .2 .6 .8 .8 1.3 1.3
.3 .1 .4 1.1 1.7 .8 .3 .6 .1 .3 .7 .5 .5 .9 .6 .3 1.8 1.0
.2 1.5 1.5 .6 .5 .6 .2 .5 2.7 .7 .2 .5 .0 .1 .0 1.4 1.7 1.3 .1 .2 1.2 .8 .6 1.0 .6 .8 .3 .4 1.6 1.4
~
.77
1.93 .46 2.70 .13 1.08
~
3.50 3.99 1.36 1.46 9.90 5.06
.80
2.54 4.58 3.40 2.38 3.00 6.28 2.58
191
1200
I I I I I L
1800
I I I I I I
2400
_l_
_[
16
~ .....
0
x1
obry:ro
0
0~
0~
oO
9
0
8
2>o
o~#/
0 0
<>%
Cb~
o't> \o o
~
69,
OB
~.....
~<i'o
cP o
0
0 0 0 0
~
o'%
x2
cS
0 0
0~0
o'{? o
0
~ ~~
cP
0
0 0 0
0
~
....Cb
0 0
~ CO oo o
0
1?
~~
0
r
f~ .....
~8
fb:t~
0
x3
o
0 00 0 0
~
0
Y'q;
0
'boo
f~ ''
'~
... ..... 
80 0
Cb
0 0 0 0
((:,
~~
o
 Cb
I
~~
~
~
0
o
I
Jt~
I
I
0
0~
~ocib
oe

9
I I I I I I I I
1500
2500
1000
1600
2200
figure 4.11 Scatter plots for the lumber stiffness data with specimens 9 and 16 plotted as solid dots.
xik xk
Zjk
k = 1, 2,3, 4;
j = 1, 2, ... , 30
and the squares of the distances are d} = (xi  i)'S1 (xi  x). The last column in Table 4.4 reveals that specimen 16 is a multivariate outlier, since xKOOS) = 14.86; yet all of the individual measurements are well within their respective univariate scatters. Specimen 9 also has a large d 2 value. The two specimens (9 and 16) with large squared distances stand out as clearly different from the rest of the pattern in Figure 4.9. Once these two points are removed, the remaining pattern conforms to t~ expected straightline relation. Scatter plots for the lumber stiffne~s measurements are given in Figure 4.11 above.
192
The solid dots in these figures correspond to specimens 9 and 16. Although the dot for specimen 16 stands out in all the plots, the dot for specimen 9 is "hidden" in the scatter plot of x 3 versus x 4 and nearly hidden in that ofx 1 versus x 3 . However, specimen 9 is clearly identified as a multivariate outlier when all four variables are considered. Scientists specializing in the properties of wood conjectured that specimen 9 was unusually cH~ar and therefore very stiff and strong. It would also appear that specimen 16 is a bit unusual, since both of its dynamic measurements are above average and the two static measurements are low. Unfortunately, it was not possible to investigate this specimen further because the material was no longer available.
If outliers are identified, they should be examined for content, as was done in the case of the data on lumber stiffness in Example 4.15. Depending upon the nature of the outliers and the objectives of the investigation, outliers may be deleted or appropriately "weighted" in a subsequent analysis. Even though many statistical techniques assume normal populations, those based on the sample mean vectors usually will not be disturbed by a few moderate outliers. Hawkins [7] gives an extensive treatment of the subject of outliers.
1. Counts,y
2. Proportions, p 3. Correlations, r
Vy
logit(p) = Fisher's
logC
~ .P)
(433)
+ z(r) =log  2 1 r
1 (1 r)
193
In many instances, the choice of a transformation to improve the approximation to normality is not obvious. For such cases, it is convenient to let the data suggest a transformation. A useful family of transformations for this purpose is the family of power transformations. Power transformations are defined only for positive variables. However, this is not as restrictive as it seems, because a single constant can be added to each observation in the data set if some of the values are negative. Let x represent an arbitrary observation. The power family of transformations is indexed by a parameter A. A given value for A implies a particular transformation. For example, consider xA with A = 1. Since x 1 = 1/x, this choice of A corresponds to the recip~ocal transformation. We can trace the family of transformations as A ranges from negative to positive powers of x. For A = 0, we define x 0 = ln x. A sequence of possible transformations is
I .. ,X
~.x\
...
To select a power transformation, an investigator looks at the marginal aot diagram or histogram and decides whether large values have to be "pulled in" or "pushed out" to improve the symmetry about the mean. Trialanderror calculations with a few of the foregoing transformations should produce an improvement. The final choice should always be examined by a QQ plot or other checks to see whether the tentative normal assumption is satisfactory. The transformations we have been discussing are data based in the sense that it is only the appearance of the data themselves that influences the choice of an appropriate transformation. There are no external considerations involved, although the transformation actually used is often determined by some mix of information supplied by the data and extradata factors, such as simplicity or ease of interpretation. A convenient analytical method is available for choosing a power transformation. We begin by focusing our attention on the univariate case. Box and Cox [3] consider the slightly modified family of power transformations
x(A) =
{ XA 1 A
;6
( 434)
lnx
A= 0
which is continuous in A for x > 0. (See [8].) Given the observations x 1 , x 2 , ... , Xn, the BoxCox solution for the choice of an appropriate power A is the solution that maximizes the expression
[1 L
11
+ (A 1)
LIn xi
11
(435)
j:I
1.
n
j:l
x}A) =
1.
n
j:l
(xfA
1)
( 436)
194
is the aritlunetic average of the transformed observations. The first term in (435) is, apart from a constant, the logarithm of a normal likelihood function, after maximizing it with respect to the population mean and variance parameters. The calculation off( A) for many values of A is an easy task for a computer. It is helpful to have a graph of e(A) versus A, as. well as a tabular displ!IY of the pairs (A, e(A)), in orderto study the behavior near the maximizing value A. For instance, if either A = 0 (logarithm) or A = ~(square root) is near A, one of these may be preferred because of its simplicity. Rather than program the calculation of (435), some statisticians recommend the equivalent procedure of fixing A, creating the new variable
(A) _
Yi
A[ (u x;t"J1
xj
j = 1, ... , n
(437)
and then calculating the sample variance. The minimum of the variance occurs at the same A that maximizes (435). Comment. It is now understood that the transformation obtained by maximizing C(A) usually improves the approximation to normality. However, there is no guarantee that even the best choice of A will produce a transformed set of values that adequately conform to a normal distribution. The outcomes produced by a transformation selected according to (435) should always be carefully examined for possible violations of the tentative assumption of normality. This warning applies with equal force to transformations selected by any other technique.
Example 4.16 (Determining a power transformation for univariate data) We gave readings of the microwave radiation emitted through the closed doors of n = 42 ovens in Example 4.10. The QQ plot of these data in Figure 4.6 indicates that the observations deviate from what would be expected if they were normally distributed. Since all the observations are positive, let us perform a power transformation of the data which, we hope, will produce results that are more nearly normal. Restricting our attention to the family of transformations in (434), we must find that value of A maximizing the function C(A) in (435). The pairs (A, C(A)) are listed in the following table for several values of A:
C(A) 70.52 75.65 80.46 84.94 89.06 92.79 96.10 98.97 101.39 103.35 104.83 105.84 106.39 106.5i)
C(A) 106.20 105.50 104.43 103.03 101.33 99.34 97.10 94.64 91.96 89.10 86.07 82.88
1.00
.90 .80 .70 .60 .50 .40 .30 .20 .10 .00 .10 .20 (30
.80
.90
1.00 1.10 1.20 1.30 1.40 1.50
195
eo,.)
0.0
0.1
0.2
0.3
0.4
0.5
1..=0.28
figure 4.12 Plot of e(A) versus A for radiation data (door closed).
The curve of C(A) versus A that allows the more exact determination A = .28 is shown in Figure 4.12. It is evident from both the table and the prot !hat a value of A around .30 maximizes C(A). For convenience, we choose A = .25. The data xi were reexpressed as
(1/4) _
xi
lj4
xi
l
4
1
j = 1,2, ... ,42
and a QQ plot was constructed from the transformed quantities. This plot is shown in Figure 4.13 on page 196. The quantile pairs fail very close to a straight line, and we 14 would conclude from this evidence that the x) / ) are approximately normal.
[12:
J+
(Ak 1)
i=l
2: lnxik
n
(438}
Ill
1.00
1.50
2.00
2.50
3.00
2.0
1.0
.0
1.0
2.0
3.0
figure 4.13 A QQ plot of the transformed radiation data (door closed). (The integers in the plot indicate the number of points occupying the same location.)
where x 1k, x2k, ... , Xnf< are the n observations on the kth variable, k Here
(i."i) _
Xk 
= 1, 2, ... , p.
(439)
1~
n i=l
1~ (x;f 1)
... 
n J=l
A.k
is the arithmetic average of the transformed observations. The jth transformed multivariate observation is
x}J 1
AI
X'i.2 
1 _,_z__
x<A.l =
j
x2
..J__P_ _
XXP
Ap
where
197
'
The procedure just described is equivalent to making each marginal distribution approximately normal. Although normal marginals are not sufficient to ensure that the joint distribution is normal, in practical applications this may be good enough. If not, we could start with the values A , A , ... , Ap obtained from the preceding 1 2 transformations and iterate toward the set of values A' = [A 1 , A2, ... , Ap], which collectively maximizes
n lnjS(A)\ +(A, 1)
L
j=l
lnxil + (A 2  1)
L
j=l
n
lnxi2
+ + (A p  1)
where S(A) is the sample covariance matrix computed from
j=l
(440)
j = 1,2, ... ,n
Maximizing (440) not only is substantially more difficult than maximizing the individual expressions in (438), but also is unlikely to yield remarkably better results. The selection method based on Equation (440) is equivalent to maximizing a multivariate likelihood over ~' : and A, whereas the method based on ( 438) corresponds to maximizing the kth univariate likelihood over ILk> akk, and Ak. The latter likelihood is generated by pretending there is some A.k for which the observations (x~i 1)/A.b j = 1, 2, ... , n have a normal distribution. See [3] and [2] for detailed discussions of the univariate and multivariate cases, respectively. (Also, see [8].)
Example 4.17 (Determining power transformations for bivariate data) Radiation measurements were also recorded through the open doors of the n = 42 microwave ovens introduced in Example 4.10. The amount of radiation emitted through the open doors of these ovens is listed in Table 4.5. In accordance with the procedure outlined in Example 4.16, a power transformation for these data ~as selected by maximizing f(A.) in (435). The approximate maximizing value was A = .30. Figure 4.14 on page 199 shows QQ plots of the lintransformed and transformed dooropen radiation data. (These data were actually
Radiation
~
Oven no. 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Radiation .20 .04 .10 .01 :60 .12 .10 .05 .05 .15 .30 .15 .09 .09 .28
Oven no. 31 32 33 34 35 36 37 38 39 40 41 42
Radiation
.10 .10 .10 .30 .12 .25 .20 .40 .33 .32 .12 .12
12 13 14 15
.30 .09 .30 .10 .10 .12 .09 .10 .09 .10 .07 .05 .01 .45 .12
transformed by taking the fourth root, as in Example4.16.) It is clear from the figure that the transformed data are more nearly normal, although the normal approximation is not as good as it was for the doorclosed data. Let us denote the doorclosed data by x 11 ,x 2h .. , x 42 , 1 and the dooropen data by x 12 , x22 , .. , x42, 2. Choosing a power transformation for each set by maximizing the expression in (435) is equivalent to maximizing fk(A) in (438) with k = 1, 2. 'fhus, using th~ outcomes from Example 4.16 and the foregoing results, we have A1 = .30 and A2 = .30. These powers were determined for the marginal distributions of x 1 and x 2 We can consider the joint distribution of x 1 and x2 and simultaneously determine the pair of powers ( A1 , A2 ) that makes this joint distribution approximately bivariate normal. To do this, we must maximize f(A 1 , A2) in (440) with respect to both A1 and A2 We co~puted f(A 1 , A ) for a grid of A1 , A2 values covering 0 :s A1 :s .50 and 2 0 :s A2 :s .50, and we constructed the contour plot shown in Figure 4.15 on page 200. We see that the maximum occurs at about (A 1 , A) = (.16, .16). 2 The "best" power transformations for this bivariate case do not differ substan tially from those obtained by considering each marginal distribution. As we saw in Example 4.17, making each marginal distribution approximately normal is roughly equivalent to addressing the bivariate distribution directly and making it approximately normal. It is generally easier to select appropriate transformations for the marginal distributions than for the joint distributions.
199
.60
.45
.30
2
.15
5
2
.0
3
q( j)
~~''..J..__j_~
2.0
1.0
.0
(a)
1.0
2.0
3.0
xCII4)
(j}
.00
.60
1.20 6
~2
1.80
/
.I.._~q(i)
2.40
/2
2.0
:/
3.00
1.0
.0
(b)
1.0
2.0
3.0
figure 4.14 QQ plots of (a) the original and (b) the transformed radiation data (with door open). (The integers in the plot indicate the number of points occupying the same location.)
0.5
0.4
0.3
223
0.2
0.1
o.o
0.0 0.1 0.2
Ly....,~A,
221
0.3
0.4
0.5
X20,Ai'O
X 2:
0, A= 0
Exercises
4.1. Consider a bivariate normal distribution with p., 1
Pl2
= .8.
= 1,
p.,2 = 3, uu == 2, u 22
= 1 and
(a) Write out the bivariate normal density. (b) Write out the squared statistical distance expression ( x  J.t )' .I1( x  J.t) as a quadratic function of x 1 and x 2
4.2. Consider a bivariate normal population with p.,1 = 0, p.,2 = 2, uu = 2, u 22 = 1, and PI2 = .5. (a) Write out the bivariate normal density.
Exercises 20 I (b) Write out the squared generalized distance expression (x p.)'I 1(x p.) as a function of x 1 and xz. (c) Determine (and sketch) the constantdensity contour that contains 50% of the probability.
= [ 3,
1, 4] and
~+i
Which of the following random variables are independent? Explain. (a) X1 and Xz (b) X2 and X3 (c) (X 1 ,Xz) and X 3 Xl + Xz (d) and X 3 2 (e) Xzand Xz ~XI x3
: n
X
:=
[2.3,1]and
I=[~1 ! ~] 2 2
(a) Find the distribution of 3X1  2X2 + X 3 . (b) Relabel the variables if necessary, and find a 2 X2

a'[;: J
are independent.
4.S. Specify each of the following. (a) The conditional distribution of X 1 , given that X 2 = x2 for the joint distribution in Exercise 4.2. (b) The conditional distribution of X 2 , given that X 1 = x 1 and X 3 = x 3 for the joint distribution in Exercise 4.3. (c) The conditional distribution of X 3 given that X 1 = x 1 and X 2 = x 2 for the joint distribution in Exercise 4.4.
I= [ ~ ~
1
~] 2
Which of the following random variables are independent? Explain. (a) X 1 andX2 (b) x, and x3 ' (c) Xz and X3 (d) (X1 ,X3 ) and X 2 (e) X 1 and X 1 + 3X2  2X3
= x3 .
4.8. (Example of a nonnormal bivariate distribution with normal marginals.) Let X 1 be. N (0, I), and tet~
if 1 :s XI :s I otherwise ShoW each of the following. (a) X 2 also has an N(0,1) distribution. (b) X 1 and X 2 do not have a bivariate normal distribution.
Hint:
(a) Since X 1 is N(O,I), P[1 < X1 s x] = Pfx s X1 < 1] for any x. When 1 < x2 < 1, P[X2 :s x 2 ) = P[X2 :s l] + P[ 1 < X 2 :s x 2 ] = P[X1 :s 1] + P[ I <xi :s Xz] = P[Xl :s 1] + P[ xz :s: XI< 1]. But P[ Xz :s xl < l] = P[ 1 < X 1 :s x 2 ] from the symmetry argument in the first line of this hint. Thus, P[Xz :s Xz] = P(XJ :s 1] + P[ 1 <xi :s Xz] = P[XI :s Xz), which is a standard normal probability. (b) Consider the linear combination X1  Xz, which equals zero with probability
X{X xi
2 
ifcsX1 :sc
elsewhere
Show that c can be chosen so that Cov ( X1, X2) = 0, but that the two random variables are not independent.
Hint:
Fore= 0, evaluate Cov(XJ, Xz) = E[ X 1 (XI)] For c very large, evaluate Cov ( X1, Xz) *' E[ X 1(XI)].
4.1 o. Show each of the following.
(a)
I~
:1 =!AliBI
for
(b)
~~ ~~=I AliBI
lA I
Hint:
(a) O' B ~ O'
of I reduced by one. This procedure is repeated until! X IB I is obtained. Similarly, 0 . j)l expandmg the deternunant , I 1by the last row g1ves , I = I A/. 0 0
A 01 lA OIII B Expandmg the determmant II B by the f1rst row . . . I O' 01 O' 01 / Definition 2A.24) gives. 1 times a determinant of the same form, with the order (see
. lA
lA
Exercises 203
(b)~~ ~~ = 1:.
:11:.
A~'cf.Butexpandingthedeterminant 1:.
\'Ct
lA I= IA22IIA11 A12A21A21I
Take determinants on both sides of this equality. Use Exercise 4.10 for the first and third determinants on the left and for the determinant on the right. The second equality for I A I follows by considering
A'= [
Thus, (A 11

A2~A2,
OJ I
[(A~,
A 12 A21A21)0'
0 J [I
A21
0'
IzA21J A 1
postmultiply by [ A:iA
21
J'.
res~lting expression.
4.13. Show the following if X is Np(p., I) with II I* 0. (a) Check that II I = II 2z II I 11  I 12 I21I 21 I. (Note that II I can be factored into the product of contributions from the marginal and conditional distributions.) (b) Check that
204
OJ [(Ill I
~12I2!~21r
0'
I21
0J
X[:.
If we group the product so that
I' [ 0  I12Ii!J [x; 1tiJ = [XI  1'1  I12I2Hx2  1tz)J I xz  1t2 Xz  1t2
X1
and
X2
((pq)Xl)
0
(qX(pq))
[I o J
1 11
0'
1 I2z
[I
11
0'
Note that) I
I I 11 ))
zeros. Here xj
1, 2, ... , n, and
4.16. Let X 1 , X 2 , X 3 , and X 4 be independent Np(lt, l:) random vectors. (a) Find the marginal distributions for each of the random vectors
v,
and
~x~
xz + ~X3 ~X4
(b) Find the joint density of the random vectors V1 and V2 defined in (a).
4.17. Let X 1 , X 2, X 3 , X 4 , and X 5 be independent and identically distributed random vectors with mean vector p. and covariance matrix I. Find the mean vector and covariance matrices for each of the two linear combinations of random vectors 5X, + 5Xz + 5
I I IX
3
IX I + 5 4 + 5Xs
x3  x4
+ Xs
in terms of p. and I. Also, obtain the covariance between the two linear combinations of random vectors. 4.18. Find the maximum likelihood estimates of the 2 x 1 mean vector p. and the 2 X 2 covariance matrix I based on the random sample
from a bivariate normal population. 4.19. Let X h X 2 , ... , X 20 be a random sample of size n = 20 from an N6 (p., I) population. Specify each of the following completely. (a) Thedistributionof(X 1  p.)'I 1 (X 1  p.) (b) The distributions of X and Vn(X  p.) (c) The distribution of (n  1) S 4.20. For the random variables X 1 , X2 , . .. , X10 in Exercise 4.19, specify the distribution of B(19S)B' in each case.
206
Chapter 4 The Multivariate Normal Distribution (b) Carry out a test of normality based on the correlation coefficient rQ. [See (431).] Set the significance level at a = .10. Do the results of these tests corroborate theresults in Part a?
4.25. Refer to the data for the world's 10 largest companies in Exercise 1.4. Construct a chisquare plot using all three variables. The chisquare quantiles are 0.3518 0.7978 1.2125 1.6416 2.1095 2.6430 3.2831 4.1083 5.3170 7.8147
4.26. Exercise 1.2 gives the age XJ, measured in years, as weiJ as the seiling price x2, measured in thousands of dollars, for n = 10 used cars. These data are reproduced as foiiows:
2 3 3
4
8
7.49
11
3.99
x2
18.95
19.00
17.95
15.54
14.00 12.95
8.94
6.00
(a) Use the results of Exercise 1.2 to calculate the squared statistical distances (X;  x)'S 1(xj  X), j = 1, 2, ... , 10, Where x; = [xj!, X12]. (b) Using the distances in Part a, det_e~mine the propo~tion of the observations falling within the estimated 50% probab1hty contour of a b1variate normal distribution. (c) Order the distances in Part a and construct a chisquare plot. (d) Given the results in Parts b and c, are these data approximately bivariate normal? Explain.
4.27. Consider the radiation data (with door closed) in Example 4.10. Construct a QQ plot
for the natural logarithms of these data. [Note that the natural logarithm transformation corresponds to the value A= 0 in (434).] Do the natural logarithms appear to be normally distributed? Compare your results with F1gure 4.13. Does the choice A = ! or A = 0 make much difference in this case? 4.
The .following exercises may require a computer.
4.28. Consider the airpollution data given in Table 1.5. Construct a QQ plot for the solar
radiation measurements and carry out a test for normality based on the correlation coefficient rQ [see (431)]. Let a = .05 and use the entry corresponding to n = 40 in Table 4.2.
4.29. Given the airpollution data in Table 1.5, examine the pairs Xs = N02 and X 6 = 0 3 for
bivariate normality. (a) Calculate statistical distances (x;  x)'S 1(x 1  x),
= 1, 2, ... , 42,
where
(b) Determine the proportion of observations x; = [x 15 , x16 ], j = 1, 2, ... , 42, falling within the approximate 50% probability contour of a bivariate normal distribution. (c) Construct a chisquare plot of the ordered distances in Part a.
xj = [x 1s.x;6]
4.30. Consider the usedcar data in Exercise 4.26. (a) Determine the power transformation A1 that makes the x 1 values approximately normal. Construct a QQ plot for the transformed data. (b) Determine the power transformations A2 that makes the x2 values approximately normal. Construct a QQ plot for the transformed data. (c) Determine the power transformations A' = [A1 2] that make the [x 1 , x 2 ] values jointly normal using (440). Compare the results with those obtained in Parts a and b.
.A
Exercises 207
4.31. Examine the marginal normality of the observations on variables X 1 , X 2 , .. , X 5 for the multiplesclerosis data in Table 1.6. Treat the nonmultiplesclerosis and multiplesclerosis groups separately. Use whatever methodology, including transformations, you feel is appropriate. 4.32. Examine the marginal normality of the observations on variables X 1 , X 2 , ... , X 6 for the radiotherapy data in Table 1.7. Use whatever methodology, including transformations, you feel is appropriate. 4.33. Examine the marginal and bivariate normality of the observations on variables XI' X2, XJ, and x4 for the data in Table 4.3. 434. Examine the data on bone mineral content in Table 1.8 for marginal and bivariate normality. 4.35. Examine the data on paperquality measurements in Table 1.2 for marginal and multivariate normality. 4.36. Examine the data on women's national track records in Table 1.9 for marginal and multivariate normality. 4.37. Refer to Exercise 1.18. Convert the women's track records in Table 1.9 to speeds measured in meters per second. Examine the data on speeds for marginal and multivariate normality. 4.38. Examine the data on bulls in Table 1.10 for marginal and multivariate normality. Consider only the variables YrHgt, FtFrBody, PrctFFB, BkFat, SaleH!, and SaleWt 439. The data in Table 4.6 (see the psychological profile data: www.prenhall.cornfstatistics) con sist of 130 observations generated by scores on a psychological test administered to Peruvian teenagers (ages 15, 16, and 17). For each of these teenagers the gender (male = 1, female = 2) and socioeconomic status (low = 1, medium = 2) were also recorded The scores were accumulated into five subscale scores labeled independence (indep), support (supp), benevolence (benev), conformity (conform), and leadership (leader).

27 12 14 18 9
:
13 13 20 20 22
11 12
11
14 24 15 17 22
:
20 25 16 12 21 17
11
11 6 7 6 6
2 2 2 2 2 1 1 2 2 2
1 1 1 1 1 2 2 2 2 2
10 14 19 27 10
19 17
26 14 23 22 22
18 7 22
10 29 13 9 8
(a) Examine each of the variables independence, support, benevolence, conformity and leadership for marginal normality. (b) Using all five variables, check for multivariate normality. (c) Refer to part (a). For those variables that are nonnormal, determine the transformation that makes them more nearly normal.
208
4.41. Consider the data on snow removal in Exercise 3.20. (a) Comment on any possible outliers in a scatter plot of the original variables. (b) Determine the power transformation A1 the makes the x 1 values approximately; normal. Construct a QQ plot of the transformed observations. :_ (c) Determine the power transformation Az the makes the x 2 values approximately normal. Construct a QQ plot of the transformed observations. (d) Determine the power transformation for approximate bivariate normality using_ (440). .
References
1. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (3rd ed.). New York: John Wiley, 2003. 2. Andrews, D. F., R. Gnanadesikan, and J. L. Warner. "Transformations of Multivariate Data." Biometrics, 27, no. 4 (1971), 825840. 3. Box, G. E. P., and D. R. Cox. "An Analysis of Transformations" (with discussion). Journal oft he Royal Statistical Society (B), 26,no. 2 (1964), 211252. 4. Daniel, C. and F. S. Wood, Fitting Equations to Data: Computer Analysis of Multifactor Data. New York: John Wiley,1980. 5. Filliben, J. J. "The Probability Plot Correlation Coefficient Test for Normality." Technometrics,17, no.1 (1975), 111117. 6. Gnanadesikan, R. Methods for Statistical Data Analysis of Multivariate Observations (2nd ed.). New York: WileyInterscience, 1977. 7. Hawkins, D. M.Identification of Outliers. London, UK: Chapman and Hall, 1980. 8. Hernandez, F., and R. A Johnson. "The LargeSample Behavior of Transformations to Normality." Journal of the American Statistical Association, 75, no. 372 (1980), 855861. 9. Hogg, R. V., Craig. A. T. and J. W. Mckean Introduction to Mathematical Statistics (6th ed.). Upper Saddle River, N.J.: Prentice Hall, 2004. 10. Looney, S. W., and T. R Gulledge, Jr. "Use of the Correlation Coefficient with Normal Probability Plots." The American Statistician,39,no.1 (1985), 7579. 11. Mardia, K. V., Kent, J. T. and J. M. Bibby. Multivariate Analysis (Paperback). London: Academic Press, 2003. 12. Shapiro, S. S., and M. B. Wilk. "An Analysis of Variance Test for Normality (Complete Samples)." Biometrika, 52, no. 4 (1965), 591611.
Exercises
209
13. Verrill, S., and R. A. Johnson. "Tables and LargeSample Distribution Theory for CensoredData Correlation Statistics for Testing Normality." Journal of the American Statistical Association, 83, no. 404 (1988), 11921197.
14. Yeo, I. and R. A. Johnson "A New Family of Power Transformations to Improve Normality or Symmetry." Biometrika, 87, no. 4 (2000), 954959. 15. Zehna, P. "lnvariance of Maximum Likelihood Estimators." Annals of Mathematical Statistics, 37, no. 3 (1966), 744.
Chapter
Let us start by recalling the univariate theory for determining whether a specific value is a plausible value for the population mean p.,. From the point of view of hypothesis testing, this problem can be formulated as a test of the competing hypotheses
H 0 : JL = JLo
and
H 1: p.
JLo
Here H 0 is the null hypothesis and H 1 is the (twosided) alternative hypothesis. If X 1 , X 2 , ... , Xn denote a random sample from a normal population, the appropriate test statistic is
t =
(X
ll{))
sjVn '
1 n where X =  ~ Xi and
j=l
1 n 2 s 2 =   ~ (X  X) n  1 j=l 1
210
21 I
This test statistic has a student's tdistribution with n  1 degrees of freedom (d.f.). We reject H0 , that JLo is a plausible value of p.,, if the observed It I exceeds a specified percentage point of atdistribution with n  1 d.f. Rejecting H0 when It I is large is equivalent to rejecting H0 if its square,
t
2 2
(.X  J.Lo)z
s /n
= n(X
 J.Lo)(s ) (X  p., 0 )
2 _, 
(51
is large. The variable t in (51) is the square of the distance from the sample mean
X to the test value~ The units of distance are expressed in terms of sjVn, or estimated standard deviations of X. Once X and s 2 are observed, the test becomes: Reject H0 in favor of H 1 , at significance level a, if
(52)
where t,_ 1 (a/2) denotes the upper 100(a/2)th percentile of the tdistribution with n 1 d.f. If H 0 is not rejected, we conclude that ~ is a plausible value for the normal population mean. Are there other values of p., which are also consistent with the data? The answer is yes! In fact, there is always a set of plausible values for a normal population mean. From the wellknown correspondence between acceptance regions for tests of H 0 : p., = J.Lo versus H 1 : p., '# ~ and confidence intervals for p.,, we have {DonotrejectH0 :p., = J.Loatlevela} is equivalent to
{ J.Lo lies in the 100( 1  a)% confidence interval
or
1:;~1
x
:s t,_ 1(a/2)
t,_ 1( aj2)
~}
(53)
or
The confidence interval consists of all those values J.Lo that would not be rejected by the level a test of H 0 : JL = ~ Before the sample is selected, the 100(1  a)% confidence interval in (53) is a random interval because the endpoints depend upon the random variables .X and s. The probability that the interval contains p., is 1  a; among large numbers of such independent intervals, approximately 100(1  a)% of them will contain p.,. Consider now the problem of determining whether a given p X 1 vector J.Lo is a plausible value for the mean of a multivariate normal distribution. We shall proceed by analogy to the univariate development just presented. A natural generalization of the squared distance in (51) is its multivariate analog
where
(p><l)
1 n X =~X 1
n~
'
(p><p)
1 n n  1 i=l
The statistic T 2 is called Hotelling's T 2 in honor of Harold Hotelling, a pioneer in ~ multivariate analysis, who first obtained its sampling distribution. Here ( 1/n )Sis the estimated covariance matrix of X. (See Result 3.1.) If the observed statistical distance T 2 is too largethat is, if i is "too far" from. p.0the hypothesis H 0 : p. = p. 0 is rejected. It turns out that special tables of T 2 percentage points are not required for formal tests of hypotheses. This is true because
1'20
l'10]
.
: .
1'pO
(55) 
where Fp,np denotes a random variable with an Fdistribution with p and n  p d.f. To summarize, we have the following:
Let X1, X 2 , ... , X, be a random sample from an Np(l, I) population. Then with X =  ~ Xi and S = ( _ ) ~ (Xi  X)(X1  X), 1 i=l n J=l n
a  P T > (n _ p) Fp,np(a)
_ [ 2
ll
ll
_,
(n  1)p
J
J
(56)
whatever the true p. and I. Here Fp,np(a) is the upper (100a)th percent.ile of the Fp,np distribution. Statement (56) leads immediately to a test of the hypothesis H 0 : p. = P.o versus ILo At the a level of significance, we reject H 0 in favor of H 1 if the H 1: p. observed
T z n 
c
X 
P.o )'sIcX
P.o )
(57)
It is informative to discuss the nature of the T2distribution briefly and its correspondence with the univariate test statistic. In Section 4.4, we described the manner in which the Wishart distribution generalizes the chisquare distribution. We can write
n ~ (Xi 
T2
vn (X 
X)(X1  X)'
1 .
)1
p. 0 )'
( i=l
n _
vn (X 
P.o)
2,13
which combines a normal, Np(O, I.), random vector and a Wishart, Wp,n 1 (l:.), random matrix in the form T2
p.nJ
= (multivariate normal)'
random vector
1
= Np(O,l:.)' [ n _ Wp,n1(I,) 1
This is analogous to or
2 _

]1 Np(O,l:)
tn 1
)I
for the univariate case. Since the multivariate normal and Wishart random variables are independently distributed [see (423)], their joint density function is the product of the marginal normal and Wishart distributions. Using calculus, the distribution (55) of T 2 as given previously can be derived from this joint distribution and the representation (58). It is rare, in multivariate situations, to be content with a test of H 0 : p. = P.o. where all of the mean vector components are specified under the null hypothesis. Ordinarily, it is preferable to find regions of p. values that are plausible in light of the observed data. We shall return to this issue in Section 5.4.
Example S.l (Evaluating T 2 ) Let the data matrix for a random sample of size n = 3 from a bivariate normal population be
x{: n
Evaluate the observed T 2 for p.0 = [9, 5]. What is the sampling distribution of T 2 in this case? We find and
S11
s12
s22
2 (9  6) + (6  6) + (3 ~ 2
2 2
= 3
= '_.,;,'.,;_.....:.... = 9
6) 2
214
s=
Thus,
~ s1 _
and, from (54),
 (4)(9) (3)(3) 3 4
3
T =3(89. 65]n lJ[:=n=3(1. Before the sample is selected, T 2 has the distribution of a (3  1)2 (3  2) Fz,32 = 4Fz,1 random variable.
1{!J=~
The next example illustrates a test of the hypothesis H 0 : p.. == p.. 0 using data collected as part of a search for new diagnostic techniques at the University of Wisconsin Medical School.
Example 5.2 {Testing a multivariate mean vector with T 2 ) Perspiration from 20 healthy females was analyzed. Three components, X 1 = sweat rate, X 2 = sodium content, and X 3 = potassium content, were measured, and the results, which we call the sweat data, are presented in Table 5.1. Test the hypothesis H 0 : p! = [4, 50, 10] against H 1 : ,.,: 4, 50, 10) at level of significance a = .10. Computer calculations provide
*[
s=
2.879 10.010 10.010 199.788 1.810 5.640 .258] .006 .002 .002 .402
and
.586 .022
s 1 =
. We evaluate
.o22 [ .258
Tz =
20(4.640 4, 45.400 50, 9.965 10) .586 .022 [ .258 .022 .006 .002 .258] [ 4.640  4 ] .002 45 50 .402 9.965  10
.400 
= 20 [.640,
4.600,
.035)
215
Individual
1 2 3 4
xl
(Sweat rate)
3.7 5.7 3.8 3.2 3.1 4.6 2.4 7.2 6.7 5.4 3.9 4.5 3.5 4.5 1.5 8.5 4.5 6.5 4.1
Xz (Sodium)
48.5 65.1 47.2 53.2
(Potassium)
9.3 8.0 10.9 12.0 9.7 7.9 14.0 7.6 8.5 11.3 12.7 12.3 9.8 8.4 10.1 7.1 8.2 10.9 11.2 9.4
x3
5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
55.5
36.1 24.8 33.1 47.4 54.1 36.9 58.8 27.8 40.2 13.5 56.4 71.6 52.8 44.1 40.9
5.5
(n l)p
( n
p)
F.p
19(3) np(.10) =  17
FJ17(.10) =
3.353(2.44) = 8.18
we see that T 2 = 9.74 > 8.18, and consequently, we reject H 0 at the 10% level of significance. We note that H0 will be rejected if one or more of the component means, or some combination of means, differs too much from the hypothesized values [4, 50, 10]. At this point, we have no idea which of these hypothesized values may not be supported by the data. We have assumed that the sweat data are multivariate normal. The QQ plots constructed from the marginal distributions of X~o X 2 , and X 3 all approximate straight lines. Moreover, scatter plots for pairs of observations have approximate elliptical shapes, and we conclude that the normality 'assumption was reasonable in this case. (See Exercise 5.4.) One feature of the T 2statistic is that it is invariant (unchanged) under changes in the units of measurements for X of the form
(pXI)
Y=CX+d,
(pxp)(pXl)
(pXl)
nonsingular
(59)
216
A transformation of the observations of this kind arises when a constant b; i:d subtracted from the ith variable to form X;  b; and the result is multiplied ~ by a constant a; > 0 to get a;(X;  b;). Premultiplication of the centered and:;l scaled quantities a;( X;  b;) by any nonsingular matrix will yield Equation (S9).~ As an example, the operations involved in changing X; to a;(X;  b;) correspond~ exactly to the process of converting temperature from a Fahrenheit to a Celsius; reading. : Given observations x1 , x2 , ... , x" and the transformation in (S9), it immediately; follows from Result 3.6 that : 1 n . .J y = Ci + d and s =   ~ (y  y) (y  y)' = CSC' Y n  1 j=I 1 1 Moreover, by (224) and (245),
1'Y
= c,.._ + d
Therefore, T 2 computed with the y's and a hypothesized value J.lv,o = CJ.'Q + d is .. T 2 = n(y  J.lv,o)'s; 1 (y  1'v,o)
= n(C(i J.lo))'(CSC'r (C(i J.'o))
=
1
1 , (27T )"Pf2j I
ln/2
e np/2
(S10)
where
' I
=
L
ll
= i =
j=!
L
ll
Xj
j=l
are the maximum likelihood estimates. Recall that ;... and I are those choices for,.._ and I that best explain the observed values of the random sample.
217
The mean Po is now fixed, but I can be varied to find the value that is "most likely" to have led, with p. 0 fixed, to the observed sample. This value is obtained by maximizing L(p.0 , I) with respect to I. Following the steps in (413), the exponent in L(p.0 , I) may be written as
=
Applying Result 4.10 with B =
p.o)(xj  Po)')]
L
j=l
with
1 Io = n
A
j=!
To determine whether Po is a plausible value of p., the maximum of L(p.o, I) is compared with the unrestricted maximum of L(p., I). The resulting ratio is called the likelihood ratio statistic. Using Equations (510) and (511), we get Likelihood ratio
=A
max L(p.o, I)
=
max L ,.,:I
l:
(IiI
A
)n/2
p.,I
I Iol
(512)
The equivalent statistic A2/" = IiI/ Iio I is called Wilks' lambda. If the observed value of this likelihood ratio is too small, the hypothesis H 0 : 1' = Po is unlikely to be true and is, therefore, rejected. Specifically, the likelihood ratio test of H 0 : P = Po against H 1: 1' Po rejects H 0 if
(513)
where ca is the lower (100a)th percentile of the distribution of A. (Note that the likelihood ratio test statistic is a power of the ratio of generalized variances.) Fortunately, because of the following relation between T 2 and A, we do not need the distribution of the latter to carry out the test.
218
Result S.l. Let X 1 , X 2 , ... ,X,. be a random sample from an Np(P., I) population. Then the test in (57) based on T 2 is equivalent to the likelihood ratio test of H 0 : p. = Po versus H 1 : p. ~ P.o because
A2/n
"=
(1 _I}_)1
+
(n 1)
+ 1) x (p + 1) matrix
I A;1ll Azz
AzJA!lA 12 j,
( 1)
I
j=l
Since, by (414),
i=l
2: (xi P.o)(xi 
P.o)' ==
i=l
n
P.o)'
==
J=l
or
IIo I
~~~
"=
(1 + ~)!
(n 1)
(514)
Here H 0 is rejected for small values of A Zfn or, equivalently, large values of T 2. The critical values of T 2 are determined by (56).
219
Incidentally, relation (514) shows that T 2 may be calculated from two determinants, thus avoiding the computation of s 1 Solving (514) for T 2 , we have
Tz = (n  :) I Io I
_ (n
III
_ 1)
I
j=l
(515)
Likelihood ratio tests are common in multivariate analysis. Their optimal large sample properties hold in very general contexts, as we shall indicate shortly. They are well suited for the testing situations considered in this book. Likelihood ratio methods yield test statistics that reduce to the familiar F and !statistics in univariate situations.
< c
max L(8)
lle8
(516)
where c is a suitably chosen constant. Intuitively, we reject H 0 if the maximum of the likelihood obtained by allowing 8 to vary over the set e 0 is much smaller than the maximum of the likelihood obtained by varying 0 over all values in e. When the maximum in the numerator of expression (516) is much smaller than the maximum in the denominator, e 0 does not contain plausible values for 8. In each application of the likelihood ratio method, we must obtain the sampling distribution of the likelihoodratio test statistic A. Then c can be selected to produce a test with a specified significance level a. However, when the sample size is large and certain regularity conditions are satisfied, the sampling distribution of 2ln A is well approximated by a chisquare distribution. This attractive feature accounts, in part, for the popularity of likelihood ratio procedures.
Result 5.2. When the sample size n is large, under the null hypothesis H 0 ,
2 In A ==
is, approximately, a x~"Y random variable. Here the degrees of freedom are v  v0
Statistical tests are compared on the basis of their power, which is defined as the ~ curve or surface whose height i!! P[ test rejects H 0 [8], evaluated at each parameter~ vector 8. Power measures the ability of a test to reject H 0 when it is not true. In the~ rare situation where 8 == 8 0 is completely specified under H 0 and the alternative H 1 ; consists of the single specified value 8 == 8 1 , the likelihood ratio test has the hiibest . power among all tests with the same significance level a == P( test rejects H 0 18 == 80 ]. In many singleparameter cases (8 has one component), the likelihood ratio test is uniformly most powerful against all alternatives to one side of H 0 : 8 == 80 In other cases, this property holds approximately for large samples. We shall not give the technical details required for discussing the optimal properties of likelihood ratio tests in the multivariate situation. The general import of these properties, for our purposes, is that they have the highest possible (average) power when the sample size is large.
= 1  a
(517)
This probability is calculated under the true, but unknown, value of 8. The confidence region for the mean ,_., of a pdimensional normal population is available from (S6). Before the sample is selected,
P [ n(X 1') S (X 1')
_ , _ 1 _
:S:
(n  1)p ] (n _ p) Fp,np(a) = 1 a
of,_.,, with probability 1  a, provided that distance is defined in .terms of nS 1 For a particular sample, i and S can be computed, and the inequality
221
n(x p.)'S 1 (x p.) :s (n 1)pFp,np(a)j(n p) will define a region R(X) within the space of all possible parameter values. In this case, the region will be an ellipsoid centered at i. This ellipsoid is the 100(1  a)% confidence region for p..
A 100(1  a)% confidence region for the mean of a pdimensional normal distribution is the ellipsoid determined by all p. such that
n ( X  p.
)'sI(
X
J.t
p(n  1) F ( ) :s (n _ p) p,np a
(518)
where
x = L
j;l
Xn
are
To determine whether any P.o lies within the confidence region (is a plausible value for p. ), we need to compute the generalized squared distance n(x p. 0 )'S 1(x p. 0 ) and compare it with [p(n 1)/(n p)]Fp, .. p(a). If the squared distance is larger than [p(n  1)/(n p)]Fp,np(a), Ito is not in the confidence region. Since this is analogous to testing H 0 : p. = p. 0 versus H 1 : p. of. P.o [see (S7)), we see that the confidence region of (518) consists of all p. 0 vectors for which the T 2test would not reject H 0 in favor of H 1 at significance level a. For p ~ 4, we cannot graph the joint confidence region for p.. However, we can calculate the axes of the confidence ellipsoid and their relative lengths. These are determined from the eigenvalues A; and eigenvectors e; of S. As in ( 47), the directions and lengths of the axes of
n ( X  p.
)'st(
X 
p.
z p(n  1) F ( ) :s c = (n _ p) p,np a
vT; cjVn
vT; Vp(n
 1)Fp,np(a)/n(n  p)
units along the eigenvectors e;. Beginning at the center x, the axes of the confidence ellipsoid are
VA; \j n(n
fp(n1) _ p) Fp,np(a) e;
where Se;
A;e;,
1, 2, ... , p
(519)
The ratios of the A;'s will help identify relative amounts of elongation along pairs of axes. Example 5.3 (Constructing a confidence ellipse for p.) Data for radiation from microwave ovens were introduced in Examples 4.10 and 4.17. Let
x1
222
x = [.564]
_ = [ 5 1
.603 ,
203.018 163.391
= .026,
= .002,
e! e2
= [.704, = [.710,
.710] .704]
The 95% confidence ellipse for J.L consists of all values (tL1 , tL 2 ) satisfying 42 [ 564  J.LI, 203.018 603  J.L2 ] [ 163.391 163.391] [.564  f.L1] 200.228 .603  J.L2
:5
.ro
2(41)
F2.40( .05)
or, since
F2. 40 ( .05) =
3.23,
42(203.018) (.564 ILl? + 42(200.228) (.603  f.L2) 2  84(163.391)(.564 J.Ld(.603 J.L2 ) :s: 6.62 To see whether ,.,_ = [.562, .589] is in the confidence region, we compute 42(203.018) (.564  .562) 2 + 42(200.228) (.603  .589) 2  84(163.391) (.564  .562)(.603  .589) = 1.30 :5 6.62 We conclude that J.L' = [.562, .589] is in the region. Equivalently, a test of H 0:
J.L
'F
of significance. The joint confidence ellipsoid is plotted in Figure 5.1. The center is at x' = [.564, .603], and the halflengths of the major and minor axes are given by
VA;
and
2(41)
.064
= v':002
respectively. The axes lie along e; = [.704, .710] and e2 = [ .710, .704] when these vectors are plotted with i as the origin. An indication of the elongation of the confidence ellipse is provided by the ratio of the lengths of the major and minor axes. This ratio is p(n  1) 2\t'A;y ( ) Fpnp(a) yT, 1 _ _rn=:=n==p== = _A = _.16_1 = 3 6 /p(n 1) VA; .045 2v'f;y n(n _ p) Fp.np(a)
==
11t~
0.60
0.55
.r,
The length of the major axis is 3.6 times the length of the minor axis.
Z
From (243),
and
a~= Var(Z)
Moreover, by Result 4.2, Z has an N(a'p.,a'Ia) distribution. If a random sample X 1 , X 2, ... , Xn from the Np(p., I) population is available, a corresponding sample of Z's can be created by taking linear combinations. Thus,
j = 1,2, ... , n
z1 , z2 , . , Zn are, by (336),
z = a'x
where i and S are the sample mean vector and covariance matrix of the x/s, respectively. ~ Simultaneous confidence intervals can be developed from a consideration of con fidence intervals for a',_., for various choices of a. The argument proceeds as follows. ; For a fixed and u~ unknown, a 100(1  a)o/o confidence interval for 1'z "' a' p.  is based on student's tratio (520) and leads to the st_!ltement
or
Va'Sa Vn s
a'p.
s a'i + tnl(a/2)
WSa
Vn
(521)
where tn ;(a/2) is the upper 100( aJ2 )th percentile of a t distribution with n  1 d.f. Inequality (521) can be interpreted as a statement about the components of the mean vector 1' For example, with a' = [1. 0, ... , OJ, a' 1' = p., 1 , and (521) becomes the usual confidence interval for a normal population mean. (Note, in this case, that a'Sa "' s 11 .) Clearly, we could make several confidence statements about the components of,_.,, each with associated confidence coefficient 1  a, by choosing different coefficient vectors a. However, the confidence associated with all of the statements taken together is not 1  a. Intuitively, it would be desirable to associate a "collective" confidence coefficient of 1  a with the confidence intervals that can be generated by all choices of a. However, a price must be paid for the convenience of a large simultaneous confidence coefficient: intervals that are wider (less precise) than the interval of (521) for a specific choice of a. Given a data set x 1 , x2, ... , Xn and a particular a, the confidence interval in (521) is that set<>f a',_., values for which
It I =
or, equivalently,
Vn(a'x a',..,)j
y'jj'Sa
S
tnl(a/2)
2 ( /2)
tn1
(522)
A simultaneous confidence region is given by the set of a',_., values such that t 2 is relatively small for all choices of a. It seems reasonable to expect that the constant t~_ 1 (aj2) in (522) will be replaced by a larger value,c 2 , when statements are developed for many choices of a.
225
Considering the values of a for which t 2 :s; c2 , we are naturally led to the determination of
m~x t
= m~x
= a, d = (x
(x 1'
 1' ), and B
x l.t
= S, we get
T2
(523)
n(a'(x P))2
a'Sa
=n
m~x
(a'(x P))2]
a'Sa
=n
)'s'(
1' ).
Result S.3. Let X 1 , X 2 , .. , X, be a random sample from an Np(l', :I) population with :I positive definite. Then, simultaneously for all a, the interval
(a'X
a'X +
for every a, or
:s;
a'x +
fa'Sa cy;;n
for every a. Choosing c2 = p(n 1)Fp,np(a)J(n p) [see (56)] gives intervals that will contain a' 1' for all a, with probability 1  a = P[T 2 :s; c 2 ].
It is convenient to refer to the simultaneous intervals of Result 5.3 as T 2 intervals, since the coverage probability is determined by the di~tribution of T 2 The successive choices a'= [1,0, ... ,0], a'= [0,1, ... ,0], and so on through a' = [0, 0, ... , 1] for the T 2 intervals allow us to conclude that
~ _ y;; =s P2 =s x2
. . .
~p(n1)
(n _ p ). Fp,np(a)
(524)
_ Xp
J.Lp
:s;
Xp +
all hold simultaneously with confidence coefficient 1  a. Note that, without modifying the coefficient 1  a, we can make statements about the differences JLi  P.k corresponding to a' = [0, ... , 0, a;, 0, ... , 0, ak> 0, ... , OJ, where a; = 1 and
ILi  ILk
<.
 x,
xk +
(525)
The simultaneous T2 confidence intervals are ideal for "data snooping." The confidence coefficient 1  a remains unchanged for any choice of a, so linear combinations of the components ILi that merit inspection based upon an examination of the data can be estimated. In addition, according to the results in Supplement SA, we can include the stiltements about (p.,;, p.,t) belonging to the sample meancentered ellipses
n[Xi  J.L;,
xk
ILk]
(526)
and still maintain the confidence coefficient ( 1  a) for the whole set of statements. The simultaneous T2 confidence intervals for the individual components of a mean vector are just the shadows, or projections, of the confidence ellipsoid on the component axes. This connection between the shadows of the ellipsoid and the simultaneous confidence intervals given by (524) is illustrated in the next example.
Example 5.4 (Simultaneous confidence intervals as shadows of the confidence ellipsoid)
In Example 5.3, we obtained the 95% confidence ellipse for the means of the fourth roots of the doorclosed and dooropen microwave radiation measurements. The 95% simultaneous T 2 intervals for the two component means are, from (524),
= (.564
~3.23~,
fN46
.564 +
~3.23~)
or
(.516,
.612)
x2 
= (.603
~3.23y42
2(41)
2(41)
/0146)
~
or
(.555,
.651)
In Figure 5.2, we have redrawn the 95% confidence ellipse from Example 5.3. The 95% simultaneous intervals are shown as shadows, or projections, of this ellipse on the axes of the component means.
Example 5.5 (Constructing simultaneous confidence inte..Vals and ellipses) The scores obtained by n = f57 college students on the College Level Examination Program (CLEP) subtest X 1 and the College Qualification Test (CQT) subtests X2 and x3 are given in Table 5.2 on page 228 for X1 = social science and history, X2 = verbal, and X 3 = science. These data give
00 .....
~
.
' ' '
,~~..~~~~~,
0.500
2
'
'
:
'
' '
'
0.552
0.604
Figure 5.2 Simultaneous T intervals for the component means as shadows of the confidence ellipse on the axesmicrowave radiation data.
i =
and
S=
Let us compute the 95% simultaneous confidence intervals for JL1> JLz, and JL3 We have
p(n 1) n _ P Fp,np(a)
3(87 1)
3(86)
526.29 or
:s JLJ :s 526:29 +
v'8.29 \ffi26155 ~
:s JLz :s 54.69 +
v'8.29 ~126.05 ~
f2ill
f2ill
228
Individual
x3l.
~
~
1 2 3 4 5 6 7 8 9
10 11
26 26 21 33 27 29 22
23
45 46 47
48
50 55
72
12
13
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
63 59 53 62 65 48 32 64 59 54 52
64
51 62 56 38 52
40
65 61 64
64
19 32 31 19 26 20 28 21 27 21 21 23 25 31 27 18 26 16 26 19 25 28 27
28
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
64
65 66 67 68 69 70
71 72
53 51 58 65 52 57 66 57 55 61 54 53
26 21
23
23 28 21 26 14 30 31 31 23
73 74 75 76 77 78 79 80 81 82 83
84
85 86 87
494 541 362 408 594 501 687 633 647 647 614 633 448 408 441 435 501 507 620 415 554 348 468 507 527 527 435 660 733 507 527 428 481 507 527 488 607 561 614 527 474 441 607
41 47 36 28 68 25 75 52 67 65 59 65 55 51 35 60 54 42
71
52 69 28 49 54 47 47 50 70 73 45 62 37 48 61 66 41 69 59 70 49 41 47 67
24 25 17 17 23 26 33 31 29 34 25 28 24 19 22 20 21 24 36 20 30 18 25 26 31 26 28 25 33 28 29 19 23 19 23 28 28 34 23 30 16 26 32
or 23.65
::s; iL3 ::s;
26.61
With the possible exception of the verbal scores, the marginal QQ plots and twodimensional scatter plots do not reveal any serious departures from normality for the college qualification test data. (See Exercise 5.18.) Moreover, the sample size is large enough to justify the methodology, even though the data are not quite normally distributed. (See Section 5.5.) ' The simultaneous T 2 intervals above are wider than univariate intervals because all three must hold with 95% confidence. They may also be wider than necessary, because, with the same confidence, we can make statements about differences. For instance, with a' = [0, 1, 1], the interval for p., 2  J.l 3 has endpoints
_ _ (x2  x3)
\j
2s23
= (54.69
= 29.56
3.12
so (26.44, 32.68) is a 95% confidence interval for J.Lz  p., 3 Simultaneous intervals can also be constructed for the other differences. Finally, we can construct confidence ellipses for pairs of means, and the same 95% confidence holds. For example, for the pair (p., 2 , p., 3 ), we have 87[54.69  Jl2,
=
25.13 
J.l3]
J.l2] J.l3
::s;
8.29
This ellipse is shown in Figure 5.3 on page 230, along with the 95% confidence ellipses for the other two pairs of means. The projections or shadows of these ellipses on the axes are also indicated, and these projections are the T 2intervals.
(527)
"'
__ L
r
...
0
1'3
"'
I
"'
500 522
.JL,
544
V)
"'
1l3
r
"'
500
522
JL,
544
50.5
54.5
58.5
Figure 5.3 95% confidence ellipses for pairs of means and the simultaneous T 2 intervals{X)llege test data.
Although prior to sampling, the ith interval has probability 1  a of covering JL;, we do not know what to assert, in genera~ about the probability of all intervals containing their respective p.,/s. As we have pointed out, this probability is not 1  a. To shed some light on the problem, consider the special case where the observations have a joint normal distribution and
I"'
Since the observations on the first variable are independent of those on the second variable, and so on, the product rule for independent events can be applied. Before the sample is selected,
l
:
O"JJ
0 O"zz
(1  a) (1  a) (1  a)
= (1  a)P
To guarantee a probability of 1  a that all of the statements about the component means hold simultaneously, the individual intervals must be wider than the separate tintervals;just how much wider depends on both p and n, as well as on 1  a. For 1  a = .95, n = 15, and p = 4, the multipliers of v;:;j;z in (524) and (527) are 4(14) ~ p(n 1) (3.36) = 4.14 (n _ p) Fp,np(.05) =
u
and tn 1(.025) = 2.145, respectively. Consequently, in this case the simultaneous intervals are 100( 4.14  2.145)/2.145 = 93% wider than those derived from the oneatatime t method. Table 5.3 gives some critical distance multipliers for oneatatime tintervals computed according to (521), as well as the corresponding simultaneous T 2 intervals. In general, the width of the T 2 intervals, relative to the tintervals, increases for fixed n as p increases and decreases for fixed p as n increases.
Table s.J Critical Distance Multipliers for OneataTimet Intervals and
.95)
~(n l)p
n
15 25 50 100
00
(n _ p) Fp,np(.05) p = 10
11.52 6.39 5.05 4.61 4.28
p=4
4.14 3.60 3.31 3.19 3.08
The comparison implied by Table 5.3 is a bit unfair, since the confidence level associated with any collection of T 2 intervals, for fixed nand p, is .95, and the overall confidence associated with a collection of individual t intervals, for the same n, can, as we have seen, be much less than .95. The oneatatime t intervals are too short to maintain an overall confidence level for separate statements about, say, all p means. Nevertheless, we sometimes look at them as the best possible information concerning a mean, if this is the only inference to be made. Moreover, if the oneatatime intervals are calculated only when the T 2 test rejects the null hypothesis, some researchers think they may more accurately represent the information about the means than the T 2intervals do. The T 2intervals are too wide if they are applied only to the p component means. To see why, consider the confidence ellipse and the simultaneous intervals shown in Figure 5.2.1f p., 1 lies in its T 2interval and JLz lies in its T 2interval, then (p., 1 , p., 2 ) lies in the rectangle formed by these two intervals. This rectangle contains the confidence ellipse and more. The confidence ellipse is smaller but has probability .95 of covering the mean vector p, with its component means p., 1 and 1Lz Consequently, the probability of covering the two individual means p., 1 and JLz will be larger than .95 for the rectangle formed by the T 2intervals. This result leads us to consider a second approach to makingmultiple comparisons known as the Bonferroni method.
232
= 1 P[atleastoneC;false)
2:
1
i=l
P(C;false) = 1 
i=l
i: (1 
1  (a 1 + az + + am)
Inequality (528), a special case of the Bonferroni inequality, allows an investigator to control the overall error rate + az + +am, regardless of the correlation structure behind the confidence statements. There is also the flexibility of controlling the error rate for a group of important statements and balancing it by another choice for the less important ~atements. Let us develop simultaneous interval estimates for the restricted set consisting of the components IL; of p. Lacking information on the relative importance of these components, we consider the individual tintervals
a,
X; { 11 _1
i = 1,2, ... ,m
with a;= o:jm. Since P[X, t,_ 1 (aj2m)~ i = 1, 2, ... , m, we have, from (528),
contains
J.L;] = 1  ajm,
P [ X; t,_ 2(  a )
2m
a a) + + +m m
m terms
Therefore, with an overaU confidence level greater than or equal to 1  a, we can make the following m = p statements:
(529)
233
The statements in (529) can be compared with those in (524). The percentage point tn_ 1(aj2p) replaces Y(n 1)pFp,np(a)j(n p), but otherwise the intervals are of the same structure.
Example 5.6 (Constructing Bonferroni simultaneous confidence intervals and comparing them with T2 intervals) Let us return to the microwave oven radiation data in Examples 5.3 and 5.4. We shall obtain the simultaneous 95% Bonferroni confidence intervals for the means, p., 1 and p., 2 , of the fourth roots of the doorclosed and dooropen measurements with a; = .05/2, i = 1, 2. We make use of the results in Example 5.3, noting that n = 42 and t41 (.05/2(2)) = t41 (.0125) = 2.327, to get
t 41 (.0125)
rs;; y;; =
or or
.521
:S
JLI
:S
.607
 xz
t 41 (.0125)
Figure 5.4 shows the 95% T 2 simultaneous confidence intervals for p., 1 , p.,z from Figure 5.2, along with the corresponding 95% Bonferroni intervals. For each component mean, the Bonferroni interval falls within the T 2interval. Consequently, the rectangular Qoint) region formed by the two Bonferroni intervals is contained in the rectangular region formed by the two T 2 intervals. If we are interested only in the component means, the Bonferroni intervals provide more precise estimates than
~ 0
.651 .646
Bonferroni
.560
.555
.607.612
0.500
0.552
0.604
Figure 5.4 The 95% T 2 and 95% Bonferroni simultaneous confidence intervals for the component meansmicrowave radiation data.
the T 2intervals. On the other hand, the 95% confidence region for p. gives the plausible values for the pairs (p. 1 , f.Lz) when the correlation between the measured ~ variables is taken into account. ~
The Bonferroni intervals for linear combinations a' p. and the analogous] T 2intervals (l:ecall Result 5.3) have the same general form:
a'X (''11)/a'Sa cntJca va ue 'J;;n
Consequently, in every instance where a; = afm, Length of Bonferroni interval _ Length of T 2interval 
tn I ( af2m)
(5c30)
which does not depend on the random quantities Xand S.As we have pointed out, for a small number m of specified parametric functions a' p., the Bonferroni intervals will always be shorter. How much shorter is indicated in Table 5.4 for selected n and p.
Table 5.4 (Length of Bonferroni Interval)/(Length of T 2Interval) for 1  a = .95 and a; = .05/m
m= p
n
15 25
50 100
00
4
.69 .75 .78 .80 .81
We see from Table 5.4 that the Bonferroni method provides shorter intervals when m = p. Because they are easy to apply and provide the relatively short confidence intervals needed for inference, we will often apply simultaneous tintervals based on the Bonferroni method.
235
the closer the underlying population is to multivariate normal, the more efficiently the sample information will be utilized in making inferences. All largesample inferences about 1' are based on a x 2distribution. From (428), we know that (X,...)' (n 1S)1(X ,...) = n(X ,...)'s 1(X ,...) is approximately x 2 with p d.f., and thus,
=1 a
(531)
where x~(a) is the upper (lOOa)th percentile of the x~distribution. Equation (531) immediately leads to large sample tests of hypotheses and simultaneous confidence regions. These procedures are summarized in Results 5.4 and 5.5.
Result 5.4. Let X 1 , X 2 , .. , Xn be a random sample from a population with mean 1' and positive definite covariance matrix :I. When n  p is large, the hypothesis H 0 : 1' = p. 0 is rejected in favor of H 1 : 1' J.'o, at a level of significance approximately a, if the observed n(i  J.'o)'S 1(i  1'o) > ~(a)
Here x~(a) is the upper (100a)th percentile of a chisquare distribution with pd.f.
Comparing the test in Result 5.4 with the corresponding normal theory test in (57), we see that the test statistics have the same structure, but the critical values are different. A closer examination, however, reveals that both tests yield essentially the same result in situations where the x 2test of Result 5.4 is appropriate. This follows directly from the fact that (n 1)pFp,np(a)f(n  p) and x~(a) are approximately equal for n large relative top. (See Tables 3 and 4 in the appendix.)
Result S.S. Let X 1 , X 2 , ... , Xn be a random sample from a population with mean :I. If n  p is large,
will contain a' p., for every a, with probability approximately 1  a. Consequently, we can make the 100( 1  a)% simultaneous confidence statements
XI
~flj
Xz
~flj
S;;
S;k skk
s;k
1 [ 
J:s
x~(a)
contam
level is a consequence of (531). The statements for the JL; are obtained by the special choices a' = [0, ... , 0, ai> 0, ... , 0.], where a; = 1, i = 1, 2, ... , p. The ellipsoids for pairs of means follow from Result 5A.2 with c2 = ~(a). The overall confidence level of approximately 1  a for all statements is, once again, a result of the large sample distribtltion theory summarized in (531). The question of what is a large sample size is not easy to answer. In one or two dimensions, sample sizes in the range 30 to 50 can usually be considered large. As the number characteristics bec<;>mes large, certainly larger sample sizes are required for the asymptotic distributions to provide good approximations to the true distributions of various test statistics. Lacking definitive studies, we simply state that 11  p must be large and realize that the true case is more complicated. An application with p = 2 and sample size 50 is much different than an application with p = 52 and sample size 100 although both haven  p = 48. It is good statistical practice to subject these large sample inference procedures to the same checks required of the normaltheory methods. Although small to moderate departures from normality do not cause any difficulties for n large, extreme deviations could cause problems. Specifically, the true error rate may be far removed from the nominal level a. If, on the basis of QQ plots and other investigative devices, outliers and other forms of extreme departures are indicated (see, for example, [2]), appropriate corrective actions, including transformations, are desirable. Methods for testing mean vectors of symmetric multivariate distributions that are relatively insensitive to departures from normality are discussed in [11 ]. In some instances, Results 5.4 and 5.5 are useful only for very large samples. The next example allows us to illustrate the construction of large sample simultaneous statements for all single mean components.
ExampleS. 1 {Constructing large sample simultaneous confidence intervals) A music educator tested thousands of Fmnish students on their native musical ability in order to set national norms in Finland. Summary statistics for part of the data set are given in Table 5.5. These statistics are based on a sample of n = 96 Fmnish 12th graders. Table S.S Musical Aptitude Profile Means and Standard Deviations for 96 12thGrade Finnish Students Participating in a Standardization Program
Raw score Variable X 1 =melody X 2 =harmony x3 =tempo x4 =meter X 5 = phrasing x6 =balance I x7 =style Source: Data courtesy ofV. Sell. Mean (:X;) 28.1 26.6 35.4 34.2 23.6 22.0 22.7 Standard deviation ( ~) 5.76 5.85 3.82 5.12 3.76 3.93 4.03
Large Sample Inferences about a Population Mean Vector 237 Let us construct 90% simultaneous confidence intervals for the individual mean components JL;, i = 1, 2, ... , 7. From Result 5.5, simultaneous 90% confidence limits are given by
x;
Vx~(.lO) ~,
28.1 26.6 35.4 34.2 23.6 22.0 22.7
= 1, 2, ... , 7, where
~(.10)
.'v'12.02 5. 76 12.02 v'% contains JLI or 26.06 :s: 'v'12.02 5.85 contains p.,z or 24.53 :s: v'% 'v'12.02 v'% contains JL3 or 34.05 :s: 3.82 12.02 'v'12.02 5.12 contains p., or 32.39 :s: 12.02 v'% 4 'v'12.02 3. 76 contains p., or 22.27 :s: 12.02 v'% 5 'v'12.02 3.93 contains JL6 or 20.61 :s: 12.02 v'% 'v'12.02 4.03 contains p., or 21.27 :s: 12.02 v'% 7
JLi
:s: 30.14
JL3
JL4
JL?
Based, perhaps, upon thousands of American students, the investigator could hypothesize the musical aptitude profile to be
We see from the simultaneous statements above that the melody, tempo, and meter components of JLo do not appear to be plausible values for the corresponding means of Finnish scores. When the sample size is large, the oneatatime confidence intervals for individual means are
x; 
a) {S;; (2 \j;;
i = 1,2, ... ,p
where z(a/2) is the upper 100(a/2)th percentile of the standard normal distribution. The Bonferroni simultaneous confidence intervals for the m = p statements about the individual means take the same form, but use the modified percentile z( af2p) to give
x;  z
p.,; :s:
x; + z
(a) \j;;{S;;
2
p
i = 1, 2, ... , p
238
Table 5.6 gives the individual, Bonferroni, and chisquarebased (or shadow of the confidence ellipsoid) intervals for the musical aptitude data in Example 5.7.
Table S.6 The Large Sample 95% Individual, Bonferroni, and T 2Intervals for
the Musical Apjitude Data The oneatatime confidence intervals use z(.025) = 1.96. The simultaneous Bonferroni intervals use z(.025/7) = 2.69. The simultaneous T2 , or shadows of the ellipsoid, use ,0( .05) = 14.07. Variable Oneatatime Bonferroni Intervals Shadow of Ellipsoid Lower Upper Lower Upper Upper Lower 26.95 25.43 34.64 33.18 22.85 21.21 21.89 29.25 27.77 36.16 35.22 24.35 22.79 23.51 26.52 24.99 34.35 32.79 22.57 20.92 21.59 29.68 28.21 36.45 35.61 24.63 23.08 23.81 25.90 24.36 33.94 32.24 22.16 20.50 21.16 30.30 28.84 36.86 36.16 25.04 23.50 24.24
Although the sample size may be large, some statisticians prefer to retain the F and tbased percentiles rather than use the chisquare or standard normalbased percentiles. The latter constants are the infinite sample size limits of the former constants. The F and t percentiles produce larger intervals and, hence, are more conservative. Table 5.7 gives the individual, Bonferroni, and Fbased, or shadow of the confidence ellipsoid, intervals for the musical aptitude data. Comparing Table 5.7 with Table 5.6, we see that all of the intervals in Table 5.7 are larger. However, with the relatively large sample size n = 96, the differences are typically in the third, or tenths, digit.
= 2.11.
Oneatatime Bonferroni Intervals Shadow of Ellipsoid Lower Upper Upper Lower Lower Upper 26.93 25.41 34.63 33.16 22.84 21.20 21.88 29.27 27.79 36.17 35.24 24.36 22.80 23.52 26.48 24.96 34.33 32.76 22.54 20.90 21.57 29.72 28.24 36.47 35.64 24.66 23.10 23.83 25.76 24.23 33.85 32.12 22.07 20.41 21.07 30.44 28.97 36.95 36.28 25.13 23.59 24.33
1. Plot the individual observations or sample means in time order. 2. Create and plot the centerline the sample mean of all of the observations. 3. Calculate and plot the control limits given by
x,
x + 3(standard deviation)
x 3(standarddeviation)
The standard deviation in the control limits is the estimated standard deviation of the observations being plotted. For single observations, it is often the sample standard deviation. If the means of subsamples of size m are plotted, then the standard deviation is the sample standard deviation divided by Viii. The control limits of plus and minus three standard deviations are chosen so that there is a very small chance, assuming normally distributed data, of falsely signaling an outofcontrol observationthat is, an observation suggesting a special cause of variation.
Example 5.8 (Creating a univariate control chart) The Madison, Wisconsin, police department regularly monitors many of its activities as part of an ongoing quality improvement program. Table 5.8 gives the data on five different kinds of overtime hours. Each observation represents a total for 12 pay periods, or about half a year. We examine the stability of the legal appearances overtime hours. A computer calculation gives x1 = 3558. Since individual values will be plotted, x1 is the same as 1. Also, the sample standard deviation is~= 607, and the control limits are
UCL =
140
Table S.8 Five Types of Overtime Hours for the Madison, Wisconsin, Police
Department
XJ
x2
Extraordinary Event Hours
XJ
x4
COA1 Hours
xs
Holdover Hours
Meeting
Hours
3387 3109 2670 3125 3469 3120 3671 4531 3678 3238 3135
2200
875 957 1758
1181 3532
2502
4510 3032 2130 1982 4675 2354 4606 3044 3340 2111 1291 1365 1175
868
398 1603 523 2034 1136 5326 1658 1945 344 807 1223
l!"'
3728 3506 3824 3516
14,861 11,367 13,329 12,328 12,847 13,979 13,528 12,699 13,534 11,609 14,189 15,052 12,236 15,482 14,900 15,078
236 310
1182
1208 1385 1053 1046
1100
1349 1150 1216 660 299 206 239 161

mp
The data, along with the centerline and control limits, are plotted as an .X chart in Figure 5.5. Legal Appearances Overtime Hours
5500
UCL =5379
4500
"' "' ~
;>
"
3500
x, = 3558
2500
1500
0
LCL"' !737
10
15
Observation Number
The legal appearances overtime hours are stable over the period in which the data were collected. The variation in overtime hours appears to be due to common causes, so no specialcause variation is indicated. With more than one important characteristic, a multivariate approach should be used to monitor process stability. Such an approach can account for correlations between characteristics and will control the overall probability of falsely signaling a special cause of variation when one is not present. High correlations among the variables can make it impossible to assess the overall error rate that is implied by a large number of univariate charts. The two most common multivariate charts are (i) the ellipse format chart and (ii) the T 2chart. Two cases that arise in practice need to be treated differently: 1. Monitoring the stability of a given sample of multivariate observations 2. Setting a control region for future observations Initially, we consider the use of multivariate control procedures for a sample of multivariate observations x 1 , x 2 , ... , x,. Later, we discuss these procedures when the observations are subgroup means.
has
and
_ ( 1) 2 (n  1) Cov(X X)= 1   I+ (n l)n2 I =   I I I! n
Each Xi  X has a normal distribution but, Xi  X is not independent of the sample covariance matrix S. However to set control limits, we approximate that 1 (Xi (Xi  X) has a chisquare distribution.
xrs
Ellipse Format Chart. The ellipse format chart for a bivariate control region is the more intuitive of the charts, but its approach is limited to two variables. The two characteristics on the jth unit are plotted as a pair (xi 1 , xj 2 ). The 95% quality ellipse consists of all x that satisfy (x  x)'s1 (x  x) s xi(05) (532)
Example S.9 {An ellipse format chart for overtime hours) Let us refer to Example
5.8 and create a quality ellipse for the pair of overtime characteristics (legal appearances, extraordinary event) hours. A computer calculation gives
~
x=
s [ 367,884.7
=
We illustrate the quality ellipse format chart using the 99% ellipse, which consists of ali x that satisfy
s11 s22
((x,su
xt)
2

2s, 2
s11s22  sn
+ (x2 x2) )
.:___cc_______:::_:___
s22
+ 1399053.1
:s;
This ellipse format chart is graphed, along with the pairs of data, in Figure 5.6.
;:
> 0
tll
0
c
J:
c
~
0
+.
I
tll
><
1500
2500
3500
4500
5500
Appearances Overtime
Figure 5.6 The quality control 99% ellipse for legal appearances and extraordinary event overtime.
UCL=5027
;; ., " ;;
'6
>
;;
::>
3000 2000
1'2= 1478
1000
0
.s
LCL= 2071
15
Notice that one point, indicated with an arrow, is definitely outside of the ellipse. When a point is out of the control region, individual X charts are constructed. TheX chart for x 1 was given in Figure 5.5; that for x 2 is given in Figure 5.7. When the lower control limit is less than zero for data that must be nonnegative, it is generally set to zero. The LCL = 0 limit is shown by the dashed line in Figure 5.7. Was there a special cause of the single point for extraordinary event overtime that is outside the upper control limit in Figure 5.7? During this period, the United States bombed a foreign capital, and students at Madison were protesting. A majority of the extraordinary overtime was used in that fourweek period.Although, by its very definition, extraordinary overtime occurs only when special events occur and is therefore unpredictable, it still has a certain stability.
T 2Chart. A T 2 chart can be applied to a large number of characteristics. Unlike the ellipse format, it is not limited to two variables. Moreover, the points are displayed in
time order rather than as a scatter plot, and this makes patterns and trends visible. For the jth point, we calculate the T 2statistic
(533)
We then plot the T2 values on a time axis. The lower control limit is zero, and we use the upper control limit
UCL = x~(.05)
or, sometimes, ~(.01). There is no centerline in the T 2chart. Notice that the T 2statistic is the same as used to test normality in Section 4.6. the quantity
dJ
244
Example S.J 0 (A Tzchart for overtime hour~) Using the ~alice department data in~ Example 5.8, we construct a T 2 plot based on the two vanables X 1 = legal appear~'J. ances hours and X 2 = extraordinary event hours. T 2 charts with more than variab~es are considere~ in Exercise 5.26. We take a = .01 to be consistent Witb1 the ellipse format chart tn Example 5.9. e;: The T 2 chart in Figure 5.8 reveals that the pair (legal appearances, extraordi ~ nary event) hours for period 11 is out of control. Further investigation, as in Exam~ei ple 5.9, confirms that this is due to the large value of extraordinary event overtim~i during that period. . .
:::
'
two1
.aj
.~
12
10
i
.,::;
~
f:...
4
6
B
0 0
!0
12
14
16
Period
figure 5.8 The T 2chart for legal appearances hours and extraordinary event hours, a
.01.
When the multivariate T 2chart signals that the jth unit is out of control, it should be determined which variables are responsible. A modified region based on Bonferroni intervals is frequently chosen for this purpose. The kth variable is out of control if xjk does not lie in the interval
(xk  ln1(.005/p)YS;;;,, xk + 1n_tf.CXJ5jp)V$k;) where pis the total number of measured variables.
Example S.Jl (Control of robotic weldersmore than T2 needed) The assembly of a driveshaft for an automobile requires the circle welding of tube yokes to a tube. The inputs to the automated welding machines must be controlled to be within certain operating limits where a machine produces welds of good quality. In order to control the process, one process engineer measured four critical variables: X 1 = Voltage (volts) X2 = Current (amps) X3 = Feed speed( in/min) X 4 = (inert) Gas flow (cfm)
245
Case 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Voltage (X 1) 23.0 22.0 22.8 22.1 22.5 22.2 22.0 22.1 22.5 22.5 22.3 21.8 22.3 22.2 22.1 22.1 21.8 22.6 22.3 23.0 22.9 21.3 21.8 22.0 22.8 22.0 22.5 22.2 22.6 21.7 21.9 22.3 22.2 22.3 22.0 22.8 22.0 22.7 22.6 22.7
Current (X2 ) 276 281 270 278 275 273 275 268 277 278 269 274 270 273 274 277 277 276 278 266 271 274 280 268 269 264 273 269 273 283 273 264 263 . 266 263 272 277 272 274 270
Feed speed (X 3 ) 289.6 289.0 288.2 288.0 288.0 288.0 290.0 289.0 289.0 289.0 287.0 287.6 288.4 290.2 286.0 287.0 287.0 290.0 287.0 289.1 288.3 289.0 290.0 288.3 288.7 290.0 288.6 288.2 286.0 290_0 288.7 287.0 288.0 288.6 288.0 289.0 287.7 289.0 287.2 290.0
Gas flow (X4) 51.0 51.7 51.3 52.3 53.0 51.0 53.0 54.0 52.0 52.0 54.0 52.0 51.0 51.3 51.0 52.0 51.0 51.0 51.7 51.0 51.0 52.0 52.0 51.0 52.0 51.0 52.0 52.0 52.0 52.7 55.3 52.0 52.0 51.7 51.7 52.3 53.3 52.0 52.7 51.0
246
The normal assumption is reasonable for most variables, but we take the natur. ~ a! logarithm of gas flow. In addition, there is no appreciable serial correlation for ~ successive observations on each variable. ~ A T 2chart for the four welding variables is given in Figure 5.9. The dotted line ~1 is the 95% limit and the solid line is the 99% limit. Using the 99% limit, no points ..i are out of contml, but case 31 is outside the 95% limit. j What do the quality control ellipses (elfipse format charts) show for two van. ; abies? Most of the variables are in control. However, the 99% quality ellipse for gas'i flow and voltage, shown in Figure 5.10, reveals that case 31 is out of control and .~ this is due to an unusually large volume of gas flow. The univariate X chart for J In(gas flow), in Figure 5.11, shoWs that this point is outside the three sigma limits. : It appears that gas flow was reset at the target for case 32. All the other univariate X charts have all points within their three sigma control limits.
14 f~__:.;99;..;%c;....::L:::inn=jt
12 10
8 6
4
0 ~~~~~
0
10
figure S.9 The T 2 ~chart for the welding data with 95% and 99% limits.
4.05
4.00
~
r;::: ~ ~
0
3.95
..s
3.90
Voltage
figure S.IO The 99% quality . contra I e111pse f or In ( gas fl ow) and voltage.
247
4.00
UCL=4.005
IX
3.95
Mean= 3.951
3.90 0 10
LCL =3.896
20
30
40
Case
In this example, a shift in a single variable was masked with 99% limits, or almost masked (with 95% limits), by being combined into a single T 2value.
T 2 =   (X  X) n+ 1
s1 (X
 X) is distributed as
(n  1)p
np
FP
np
( .  )'s1( X  ) x X x
o5
Proof. We first note that X  X has mean 0. Since X is a future observation, X and
X are independent, so
1 (n + 1) Cov(X X)= Cov(X) + Cov(X) =I+  I = I n n
and, by Result 4.8,
v'nj(n
X)'s1
n (X  X) n + 1
248
which combines a multivariate normal, Np(O, I), random vector and an independent . Wishart, Wp,n 1 (I), random matrix in the form ( multivariate normal)' (Wishart random matrix)! (multivariate normal) random vector d.f. random vector
~
has the scaled r distribution claimed according to (58) and the discussion on page 213. , The constant for the ellipsoid follows from (56). Note that the prediction region in Result 5.6 for a future observed value xis an " ellipsoid. It is centered at the initial sample mean x, and its axes are determined by the eigenvectors of S. Since
 ' 1 (nz 1)p P [ (X  X) S (X X) ::s n(n _ p) Fp.np(a)
1 a
before any new observations are taken, the probability that X will fall in the prediction ellipse is 1  a. Keep in mind that the current observations must be stable before they can be used to determine control regions for future observations. Based on Result 5.6, we obtain the two charts for future observations.
(534)
Any future observation xis declared to be out of control if it falls out of the control ellipse. Example S.l2 (A control ellipse for future overtime hours) In Example 5.9, we checked the stability of legal appearances and extraordinary event overtime hours. Let's use these data to determine a control region for future pairs of values. From Example 5.9 and Figure 5.6, we find that the pair of values for period 11 were out of control. We removed this point and determined the new 99% ellipse. All of the points are then in control, so they can serve to determine the 95% prediction region just defined for p = 2. This control ellipse is shown in Figure 5.12 along with the initial15 stable observations. Any future observation falling in the ellipse is regarded as stable or in control. An observation outside of the ellipse represents a potential outofcontrol observation or specialcause variation.
T2
= _n_
n + 1
~
0
til
C
= "
>
> "
~
8 "'
0
+
til
"
8 I "'
1500 2500 3500 4500 5500
Appearances Overtime
Figure 5.12 The 95% control ellipse for future legal appearances and extraordinary event overtime.
As will be .described in Section 6.4, the sample covariances from the n subsamples can be combined to give a single estimate (called Spooled in Chapter 6) of the common covariance I. This pooled estimate is
Here ( nm  n )S is independent of each and, therefore, of their mean X. Further, ( nm  n )Sis distributed as a Wishart random matrix with nm  n degrees of freedom. Notice that we are estimating I internally from the data collected in any given period. These estimators are combined to give a single estimator with a large number of degrees of freedom. Consequently,
x,
(535)
is distributed as
(nm n)p
(nm n
+ 1)
Fp,nmnp+l
Ellipse Format Chart. In an analogous fashion to our discussion on individual multivariate observations, the ellipse format chart for pairs of subsample means is
_ = , _
(x  X) S (x  x) 
~ <
(536)
although the righthand side is usually approximated as x~(.05)/m. Subsamples corresponding to points outside of the control ellipse should be carefully checked for changes in the behavior of the quality characteristics being measured. The interested reader is referred to [10] for additional discussion.
T1 Chart. To construct a T 2chart with subsample data and p characteristics, we plot the quantity
= (nmnp+ I) FP nmnp+l(.05)
The UCL is often approximated as ~(.05.) when n is large. Values of Tj that exceed the UCL correspond to potentially outofcontrol or special cause variation, which should be checked. (See [10].)
Inferences about Mean Vectors When Some Observations Are Missing 251
Consequently,
is distributed as (nm n)p (nm n p + 1) Fp,nmnp+l Control Ellipse for Future Subsample Means. The prediction ellipse for a future subsample mean for p = 2 characteristics is defined by the set of all x such that
(x  x) S 1 (x  x)
=, _
(n
:5
(537)
where, again, the righthand side is usually approximated as ,d(.05)/m. T 2 Chart for Future Subsample Means. control limit and plot the quantity As before, we bring n/(n + 1) into the
The UCL is often approximated as x~(.05) when n is large. Points outside of the prediction ellipse or above the UCL suggest that the current values of the quality characteristics are different in some way from those of the previous stable process. This may be good or bad, but almost certainly warrants a careful search for the reasons for the change.
252 Chapter 5 Inferences about a Mean Vector experimental context. If the pattern of missing values is closely tied to the value of the response, such as people with extremely high incomes who refuse to respond in a survey on salaries, subsequent inferences may be seriously biased. To date, no statistical techniques have been developed for these cases. However, we are able to treat situations where data are missing at randomthat is, cases in which the chance mechanism responsible for the missing values is not influenced by the value of the variables. A general approach for computing maximum likelihood estimates from incomplete data is given by Dempster, Laird, and Rubin [5]. Their technique, called the EM algorithm, consists of an iterative calculation involving two steps. We call them the prediction and estimation steps:
0 of the unknown parameters, predict the contribution of any missing observation to the (completedata) sufficient sta tis tics. 2. Estimation step. Use the predicted sufficient statistics to compute a revised estimate of the parameters.
1. Prediction step. Given some estimate
The calculation cycles from one step to the other, until the revised estimates do not differ appreciably from the estimate obtained in the previous iteration. When the observations X 1 , X 2 , ... , Xn are a random sample from a pvariate normal population, the predictionestimation algorithm is based on the completedata sufficient statistics [see (421)] T1 and
T2 =
=o
i=l
n
Xi "' nX
L
i=l
In this case, the algorithm proceeds as follows: We assume that the population mean and variance#' and l:, respectivelyare unknown and must be estimated.
Prediction step. For each vector xi with missing values, let xjil denote the miss
x) 2 )
Given estimates 'ji and I: from the estimation step, use the mean of the conditional normal distribution of xOl, given x( 2 ), to estimate the missing values. That is/
x?)
= p:(l) +
I:l2I:2!(x?)  #( 2 ))
(538)
+ ~(I)~(l) Xj Xj
(539)
+ 'ji'ji'.
Inferences about Mean Vectors When Some Observations Are Missing 253
and
The contributions in (538) and (539) are summed over ag_ Xj wit~ missing com~ nents. The results are combined with the sample data to yield T 1 and Tz.
Estimation step. Compute the revised maximum likelihood estimates (see Result 4.11 ):
(540) We illustrate the computational aspects of the predictionestimation algorithm in Example 5.13.
Example 5.13 (Illustrating the EM algorithm) Estimate the normal population mean ,.._ and covariance X using the incomplete data set
Here n = 4, p = 3, and parts of observation vectors x 1 and x 4 are missing. We obtain the initial sample averages
J1J =2= 6,
~
7 +5
p,z=~3~=1,
0+2+1
p;3
= 3 + 6 + 2 + 5 = 4
from the available observations. Substituting these averages for any missing values, so that 11 = 6, for example, we can obtain initial covariance estimates. We shall construct these estimates using the divisor n because the algorithm eventually produces the maximum likelihood estimate i Thus,
~~=
=2
un=z
UJ2
=
=
uz3 =
1 4 3
The prediction step consists of using the initial estimates ji. and 2 to predict the contributions of the missing values to the sufficient statistics T1 and T2 . [See (538) and (539).]
254
Chapter 5 Inferences about a Mean Vector The first component of x 1 is missing, so we partition 'ji and .I as
== 6
[I
[~. 1)
l
~
2
~ 2 XII=
_l;l2_l;22_l;21
~I~
I + ~2 = zX11
2,
4 1
[! 2]1 [!]
~
4 4
uill~=~]=
1
5.73
= 32.99
+ (5.73) 2
~]
x11 [xt
[~~J [~(~!J, )J
:..:; f.L3
f.L(
and predict
U + DJ m1 J
(5  4) = [
~:~]
=D nDJm [1
1
~l+[~:~J[6.4
1.3]
and
[6.4]
1.3
Inferences about Mean Vectors When Some Observations Are Missing 255 are the contributions to T2 Thus, the predicted completedata sufficient statistics are
r [
1 "'
x11
XJ2 x 13
+ + +
o2 + 22 +
J'
+ 6' + z' +
s']
This completes one prediction step. The next esti!llation step, using (540), provides the revised estimates 2
ii
1:I = Tz  'jiji' n
~
1 ;;T1
1.08 4.00]
Note that 0'11 = .61 and 0'22 = .59 are larger than the corresponding initial estimates obtained by replacing the missing observations on the first and second variables by the sample means of the remaining values. The third variance estimate 0'33 remains unchanged, because it is not affected by the missing components. The iteration between the prediction and estimation steps continues until the elements of 'ji and l: remain essentially unchanged. Calculations of this sort are easily handled with a computer.
2
256
Once final estimates jJ. and j; are obtained and relatively few missing components occur in X, it seems reasonable to treat (541) as an approximate 100(1 a)% confidence ellipsoid. The simultaneous confidence' statements, would then follow as in Section 5.5, but with x replaced by jL and S replaced by l:.
Caution. The predictionestimation algorithm we discussed is developed on the basis that component observations are missing at random. If missing values are related to the response levels, then handling the missing values as suggested may introduce serious biases into the estimation procedures.'JYpically, missing values are related to the responses being measured. Consequently, we must be dubious of any computational scheme that fills in values as if they were lost at random. When more than a few values are missing, it is imperative that the investigator search for the systematic causes that created them.
(542)
where the e, are independent and identically distributed with E [e,J = 0 and Cov ( e,) = l:e and all of the eigenvalues of the coefficient matrix II> are between 1 and 1. Under this model Cov(X 1 , X,_,) = <11'1:1 where
00
I.
= ~ ~iie<ll'i
j=O
The AR(l) model (542) relates the observation at timet, to the observation at time t  1, through the coefficient matrix <II. Further, the autoregressive model says the observations are independent, under multivariate normality, if all the entries in the coefficient matrix <II are 0. The name autoregressive model comes from the fact that (542) looks like a multivariate version of a regression with X, as the dependent variable and the previous value X1 1 as the independent variable.
257
1 S= n _
where the arrow above indicates convergence in probability, and Cov(n 112
r=l
x,)+
cl>'f 1 
"Ix
(543)
Moreover, for large n, (X  IL) is approximately normal with mean 0 and covariance matrix given by (543). To make the calculations easy, suppose the underlying process has cl> = </>I where 14> I < 1. Now consider the large sample nominal 95% confidence ellipsoid for IL
vn
This ellipsoid has large sample coverage probability .95 if the observations are independent. If the observations are related by our autoregressive model, however, this ellipsoid has large sample coverage probability
Table 5.10 shows how the coverage probability is related to the coefficient 4> and the number of variables p. According to Table 5.10, the coverage probability can drop very low, to .632, even for the bivariate case. The independence assumption is crucial, and the results based on this assumption can be very misleading if the observations are, in fact, dependent.
Table S.l 0 Coverage Probability of the Nominal95% Confidence Ellipsoid
4>
Supplement
which extends from 0 along u with length cVu' Auju'~. When u is a unit vector, the shadow extends c~ units, so lz'ul $ cv;;;Au. The shadow also extends c~ units in the u direction. Proof. By Definition 2A.12, the projection of any z on u is given by (z'u) uju'u. Its squared length is (z'u//u'u. We want to maximize this shadow over all z with z' Alz ::5 c 2 The extended CauchySchwarz inequality in (249) states that (b'd/ ::5 (b'Bd)(d'B 1d), with equality when b = kB 1d. Setting b = z, d = u, and B = AI, we obtain (u'u)(lengthofprojection) 2
=
(z'u/
::5
c2
The choice z = cAn/~ yields equalities and thus gives the maximum shadow, besides belonging to the boundary of the ellipsoid. That is, z' A 1z = c~' Auju' Au = c2 for this z that provides the longest shadow. Consequently, the projection of the
258
Simultaneous Confidence Intervals and Ellipses as Shadows of the pDimensional Ellipsoids 259
eu =
ellipsoid on u is c~ u/u'u, and its length is cVu' Au/u'u. With the unit vector u/~, the projection extends Vc 2e'u Ae u = ~c ~
vu'u
Vu'Au
units along u
The projection of the ellipsoid also extends the same length in the direction u.
U = [u1
Result SA.2. Suppose that the ellipsoid {z: z' A 1z ! u2] is arbitrary but of rank two. Then
z in the ellipsoid } { based on A 1 and c 2
:S
. . Imp1IeS that
Proof. We first establish a basic inequality. Set P = A112 U(U' AU)1 U' A1f2, where A = A112 A112. Note that P = P' and P 2 = P, so (I  P)P' = P  P 2 = 0. Next, using A 1 = A!(2 A 112, we write z' A 1z = (A 112z)' (A112z) and A 1f2z = PK 1f2z + (I P)A1 f2z. Then
= (P K 1f2z)' (P K 112z) +
Since z' A 1z
:S
Our next result establishes the twodimensional confidence ellipse as a projection of the pdimensional ellipsoid. (See Figure 5.13.)
3
ellipsoid z' A 1z
:S
c2 on the
u 1 , u2 plane is an ellipse.
260 Chapter 5 Inferences about a Mean Vector Projection on a plane is simplest when the two vectors u1 and u2 determining the plane are first converted to perpendicular vectors of unit length. (See Result 2A.3.)
Result SA.J. Given the ellipsoid {z: z' A1z :S c2 } and two perpendicular unitvectors u1 and u2 , the projection (or shadow) of {z' A 1z :S c2 } on the Dt, 02 1 plane results in the twodimensional ellipse {(U'z)' (U'AUf (U'z) :S c2}, where. U = [ul ! uz]. Proof. By Result 2A.3, the projection of a vector z on the u 1 , u2 plane is
(u[z)ui + (uiz)uz
=
[u 1
u2 ] [n[z] = UU'z
UzZ
The projection of the ellipsoid {z:z'Arz:Sc2 } consists of all UU'z withz' Arz :S c 2. Consider the two coordinates U'z of the projection U(U'z). Let z belong to the set {z: z' Arz :S c2 } so that UU'z belongs to the shadow of the ellipsoid. By Result 5A.2,
so the ellipse { (U'z)' (U' AUf (U'z) :S c2 } contains the coefficient vectors for the shadow of the ellipsoid. Let Ua be a vector in the Dr, u2 plane whose coefficients a belong to the ellipse {a'(U'AUf1a s c2 }. If we set z = AU(U' AUfra, it follows that U'z and
=
Thus, U'z belongs to the coefficient vector ellipse, and z belongs to the ellipsoid z' A 1z s c2. Consequently, the ellipse contains only coefficient vectors from the projection of {z: z' Arz s c2 } onto the Dr, u2 plane.
Remark. Projecting the ellipsoid z' Arz s c2 first to the u1o u2 plane and then to the line u1 is the same as projecting it directly to the line determined by ur. In the context of confidence ellipsoids, the shadows of the twodimensional ellipses give the single component intervals. Remark. Results 5A.2 and 5A.3 remain valid if U = [ u1 , ... , uq] consists of 2 < q :S p linearly independent columns.
Exercises
Z61
Exercises
5.1.
ll],usingthedata
X=
(b) Specify the distribution of T 2 for the situation in (a). (c) Using (a) and (b), test H 0 at the a = .OS level. What conclusion do you reach?
5.2. Using the data in Example 5.1, verify that T 2 remains unchanged if each observation xi, j = 1, 2, 3; is replaced by Cxi, where
l
8
2 12]
6 9
10
9
5.3. (a) Use expression (515) to evaluate T 2 for the data in Exercise 5.1. (b) Use the data in Exercise 5.1 to evaluate A in (513). Also, evaluate Wilks' lambda.
5.4. Use the sweat data in Table 5.1. (See Example 5.2) (a) Determine the axes of the 90% confidence ellipsoid for IL Determine the lengths of these axes. (b) Construct QQ plots for the observations on sweat rate, sodium content, and potassium content, respectively. Construct the three possible scatter plots for pairs of observations. Does the multivariate normal assumption seem justified in this case? Comment. 5.5. The quantities x, S, and s 1 are given in Example 5.3 for the transformed microwaveradiation data. Conduct a test of the null hypothesis H 0 : IL' = [.55, .60] at the a = .OS level of significance. Is your result consistent with the 95% confidence ellipse for IL pictured in Figure 5.1? Explain.
5.6. Verify the Bonferroni inequality in (528) form = 3. Hint: A Venn diagram for the three events C1 C 2 , and C3 may help.
5.7. Use the sweat data in Table 5.1 (See Example 5.2.) Find simultaneous 95% T 2 confidence intervals for ILI, 1Lz, and ILJ using Result 5.3. Construct the 95% Bonferroni intervals using (529). Compare the two sets of intervals.
262
5.8. From (523), we know that T 2 is equal to the largest squared univariate 1value 1 constructed from the linear combination a'xi with a = s (i  p. 0 ). Using the results in Example 5.3 and the Ho in Exercise 5.5, evaluate a for the transformed microwaveradiation data. Verify that the 12 value computed with this a is equal to rz
in Exercise 5.5.
5.9. Harry Roberts, a natural_ist ~or the Alaska Fish an~ Game department, studies griZlly
bears with the goal of mamtammg a healthy populatwn. Measurements on n "' 61 bears provided the following summary statistics (see also .Exercise 8.23): Variable Weight (kg) Body length (em) 164.38 Neck (em) Girth (an) Head length (em) 17.98 Head width (em) 31.13
Sample meani
95.52
55.69
93.39
Covariance matrix 3266.46 1343.97 731.54 1175.50 162.68 238.37 1343.97 721.91 32425 53735 80.17 117.73 731.54 324.25 179.28 281.17 39.15 56.80 1175.50 162.68 537.35 80.17 281.17 39.15 474.98 63.73 63.73 9.95 94.85 13.88 238.37 117.73 56.80 94.85 13.88 21.26
S=
(a) Obtain the large sample 95% simultaneous confidence intervals for the six population mean body measurements. (b) Obtain the large sample 95% simultaneous confidence ellipse for mean weight and mean girth. (c) Obtain the 95% Bonferroni confidence intervals for the six means in Part a. (d) Refer to Part b. Construct the 95% Bonferroni confidence rectangle for the mean weight and mean girth using m = 6. Compare this rectangle with the confidence ellipse in Part b. (e) Obtain the 95% Bonferroni confidence interval for mean head width  mean head length using m = 6 + 1 = 7 to allow for this statement as well as statements about each individual mean.
5.1 0. Refer to the bear growth data in Example 1.10 (see Table 1.4 ). Restrict your attention to the measurements of length. (a) Obtain the 95% T2 simultaneous confidence intervals for the four population means for length. (b) Refer to Part a. Obtain the 95% T 2 simultaneous conf1dence intervals for the three successive yearly increases in mean length. (c) Obtain the 95% T 2 confidence ellipse for the mean increase in length from 2 to 3 years and the mean increase in length from 4 to 5 years.
Exercises 263 (d) Refer to Parts a and b. Construct the 95% Bonferroni confidence intervals for the set consisting of four mean lengths and three successive yearly increases in mean length. (e) Refer to Parts c and d. Compare the 95% Bonferroni confidence rectangle for the mean increase in length from 2 to 3 years and the mean increase in length from 4 to 5 years with the confidence ellipse produced by the T 2 procedure.
5.1 1. A physical anthropologist performed a mineral analysis of nine ancient Peruvian hairs. The results for the chromium (xJ) and strontium (x 2 ) levels, in parts per million (ppm), were as follows:
Xt(Cr)
x 2(St)
.48 12.57
40.53 73.68
2.19 11.13
.55
.74
.66
.93
4.64
.37
.22 1.08
20.03
20.29
.78
.43
Source: Benfer and others, "Mineral Analysis of Ancient Peruvian Hair," American
Journal of Physical Anthropology, 48, no. 3 (1978), 277282.
It is known that low levels (less than or equal to .100 ppm) of chromium suggest the presence of diabetes, while strontium is an indication of animal protein intake. (a) Construct and plot a 90% joint confidence ellipse for the population mean vector p.' = (p. t , 1L2l, assuming that these nine Peruvian hairs represent a random sample from individuals belonging to a particular ancient Peruvian culture. (b) Obtain the individual simultaneous 90% confidence intervals for p. 1 and p. 2 by "projecting" the ellipse constructed in Part a on each coordinate axis. (Alternatively, we could use Result 5.3.) Does it appear as if this Peruvian culture has a mean strontium level of 10? That is, are any of the points (p. 1 arbitrary, 10) in the confidence regions? Is (.30, 10]' a plausible value for J.L? Discuss. (c) Do these data appear to be bivariate normal? Discuss their status with reference to QQ plots and a scatter diagram. If the data are not bivariate normal, what implications does this have for the results in Parts a and b? (d) Repeat the analysis with the obvious "outlying" observation removed. Do the inferences change? Comment.
with missing components, use the predictionestimation algorithm of Section 5.7 to estimate J.L and I. Determine the initial estimates, and iterate to find the first revised estimates.
5.13. Determine the approximate distribution of n In( I i Thble 5.1. (See Result 5.2.)
IJI io I)
5.14. Create a table similar to Table 5.4 using the entries (length of oneatatimerinterval)/ (length of Bonferroni rinterval).
Frequently, some or all of the population characteristics of interest are in the form of attributes. Each individual in the population may then be described in terms of the attributes it possesses. For convenience, attributes are usually numerically coded with respect to their presence or absence. If we let the variable X pertain to a specific attribute then we can distinguish between the presence or absence of this attribute by defining '
q
0 0 0 0
q +
0 0 0
0 0
0 0
0 0 P2
0 Pq+l = 1
PI
...
Pk ... Pq
L P; i=l
The kth component, Xik> of Xi is 1 if the observation (individual) is from category k and is 0 otherwise. The random sample Xr. X 2 , ... ,Xn can be converted to a sample proportion vector, which, given the nature of the preceding observations, is a sample mean vector. Thus,
p =
[
P1 P2
.
=
n i=l
2: xi
with
Pq.+l
For large n, the approximate sampling distribution of p is provided by the central limit theorem. We have
Result. Let X 1 , X 2 , ... , Xn be a random sample from a q + 1 category multinomial distribution with P[Xik = 1] = Pk> k = 1, 2, ... , q + 1, j = 1, 2, ... , n. Approximate simultaneous 100(1 a)% confidence regions for all linear combinations a'p = arPr + a2P2 + + aq+IPq+I are given by the observed values of
a'p
';:;:: V Xq(a)
n
~ y;:;n
matrix with ukk = Pk(1  Pk) and U;k = p;h, i k. Also, X~(a) is the upper (lOOa)th percentile of the chisquare distribution with q d.f. In this result, the requirement that n  q is large is interpreted to mean npk is about 20 or more for each category. We have only touched on the possibilities for the analysis of categorical data. Complete discussions of categorical data analysis are available in [l] and [4].
5.1 5. Le_t Xi; and Xik be the ith and kth components, respectively, of Xi.
(a) Show that JL; = E(Xj;) = p; and u;; = Var(Xi;) = p;(1  p;), i = 1, 2, ... , p. (b) Show that u;k = Cov(Xi;, Xik) =  P;Pk> i sarily be negative?
5.16. As part of a larger marketing research project, a consultant for the Bank of Shorewood wants to know the proportion of savers that uses the bank's facilities as their primary vehicle for saving. The consultant would also like to know the proportions of savers who use the three major competitors: Bank B, Bank C, and Bank D. Each individual contacted in a survey responded to the following question:
266
chapter 5 Inferences about a Mean Vector Which bank is your primary savings bank? Bank of Sh orewoo d
.;a
'~
ii
""" :i
Response:
l I I I I
Bank B Bank C Bank D 119
P2
~
1
'ill_"
~
A sample of n = 355 people with savings accounts produced the following coun!Sl when asked to indicate their primary savings banks (the people with no savings will De'] ignored in the comparison of savers, so there are five categories): ~
.i
Bank (category) Observed Bank of Shorewood Another bank
<:(l
_n_u_m_be_r~r~"'_35_5;:
PI
105
56
P3
25
/Total
~
J
populatio~
P1 _ = _ _ _ _ _ _ _ _ _  k_ _ _= _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __
'
105 355
.30
p3
1  (p, + P2 + PJ + P4) = proportion of savers at other banks (a) Construct simultaneous 95% confidence intervals for p 1, P2 ... , Ps (b) Construct a simultaneous 95% confidence interval that allows a comparison of the Bank of Shorewood with its major competitor, Bank B. Interpret this interval
5.17. In order to assess the prevalence of a drug problem among high school students in a particular city, a random sample of 200 students from the city's five high schools were surveyed. One of the survey questions and the corresponding responses are as follows:
What is your typical weekly marijuana usage? Category None Number of responses Moderate (13 joints) Heavy (4 or more joints)
117
62
21
Exercises
2,67
Construct 95% simultaneous confidence intervals for the three proportions p 1 , p 2 , and PJ = 1  (p, + P2).
The following exercises may require a computer. 5.18. Use the college test data in Table 5.2. (See Example 5.5.) (a) Test the null hypothesis H 0 : 11' = [500, 50, 30] versus H 1 : 11' [500, 50, 30] at the a = .05 level of significance. Suppose [500, 50, 30]' represent average scores for thousands of college students over the last 10 years. Is there reason to believe that the group of students represented by the scores in Table 5.2 is scoring differently? Explain. (b) Determine the lengths and directions for the axes of the 95% confidence ellipsoid for IL (c) Construct QQ plots from the marginal distributions of social science and history, verbal, and science scores. Also, construct the three possible scatter diagrams from the pairs of observations on different variables. Do these data appear to be normally distributed? Discuss.
5.19. Measurements of x 1 = stiffness and x 2 = bending strength for a sample of n = 30 pieces of a particular grade of lumber are given in Thble 5.11. The units are pounds/( inches f. Using the data in the table,
(Bending strength) 4175 6652 7612 10,914 10,850 7627 6954 8365 9469 6410 10,327 7320 8196 9709 10,370
(Bending strength) 7749 6818 9307 6457 10,102 7414 7556 7833 8309 9559 6255 10,723 5430 12,090 10,072
1232 1115 2205 1897 1932 1612 1598 1804 1752 2067 2365 1646 1579 1880 1773
1712 1932 1820 1900 2426 1558 1470 1858 1587 2208 1487 2206 2332 2540 2322
(a) Construct and sketch a 95% confidence ellipse for the pair [loti> ~tzl', where ILl = E(XJ) and ~t 2 = E(X2 ). (b) Suppose IL!O = 2000 and ~t 20 = _10,000 represent "typical" values for stiffness and bending strength, respectively. Given the result in (a), are the data in Table 5.11 consistent with these values? Explain.
268
Chapter 5 Inferences about a Mean Vector (c) Is the bivariate normal distribution a viable population model? Explain with reference to QQ plots and a scatter diagram.
5.20: A wildlife ecologist measured x 1 = tail length (in millim~ters) and x 2 = wing length (in '1: millimeters) for a sample of n = 45 female hookbilled k1tes. These data are displayed in . Table 5.12. Usi~g the data in the table, ~
Xr
x2
xz
xr
x2
(Tail length)
(Wing length)
'(Tail length)
(Wing length)
(Tail length)
(Wing length)
191 197 208 180 180 188 210 196 191 179 208 202 200 192 199
284 285 288 273 275 280 283 288 271 257 289 285 272 282 280
186 197 201 190 209 187 207 178 202 205 190 189 211 216 189
266 285 295 282 305 285 297 268 271 285 280 277 310 305 274
173 194 198 180 190 191 196 207 209 179 186 174 181 189 188
262 258
(a) Find and sketch the 95% confidence ellipse for the population means ILl and ILz Suppose it is known that ILJ = 190 mm and P.z = 275 mm for male hookbilled kites. Are these plausible values for the mean tail length and mean wing length for the female birds? Explain. 2 (b) Construct the simultaneous 95% T intervals for ILl and ILz and the 95% Bonferroni intervals for ILl and liz Compare the two sets of intervals. What advantage, if any, do the T 2intervals have over the Bonferroni intervals? (c) Is the bivariate normal distribution a viable population model? Explain with reference to QQ plots and a scatter diagram.
5.21. Using the data on bone mineral content in Table 1.8, construct the 95% Bonferroni
intervals for the individual means. Also, find the 95% simultaneous T intervals. Compare the two sets of intervals.
2
5.22. A portion of the data contained in Table 6.10 in Chapter 6 is reproduced in Table 5.13.
These data represent various costs associated with transporting milk from farms to dairy plants for gasoline trucks. Only the first 25 multivariate observations for gasoline trucks are given. Observations 9 and 21 have been identified as outliers from the full data set of 36 observations. (See [2].)
Exercises 269
16.44 7.19 9.92 4.24 11.20 14.25 13.50 13.32 29.11 12.68 7.51 9.90 10.25 11.11 12.17 10.24 10.18 8.88 12.34 8.51 26.16 12.95 16.93 14.70 10.32
12.43 2.70 1.35 5.78 5.05 5.78 10.98 14.27 15.09 7.61 5.80 3.63 5.07 6.15 14.26 2.59 6.05 2.70 7.73 14.02 17.44 8.24 13.37 10.78 5.16
1l.~
3.92 9.75 7.78 10.67 9.88 10.60 9.45 3.28 10.23 8.13 9.13 10.17 7.61 14.39 6.09 12.14 12.23 11.68 12.01 16.89 7.18 17.59 14.58 17.00
(a) Construct QQ plots of the marginal distributions of fuel, repair, and capital costs. Also, construct the three possible scatter diagrams from the pairs of observations on different variables. Are the outliers evident? Repeat the QQ plots and the scatter diagrams with. the apparent outliers removed. Do the data now appear to be normally distributed? Discuss. (b) Construct 95% Bonferroni intervals for the individual cost means. Also, find the 95% T 2intervals. Compare the two sets of intervals.
5.23. Consider the 30 observations on male Egyptian skulls for the first time period given in Table 6.13 on page 349. (a) Construct QQ plots of the marginal distributions of the maxbreath, basheight, baslength and nasheight variables. Also, construct a chisquare plot of the multivariate observations. Do these data appear to be normally distributed? Explain. (b) Construct 95% Bonferroni intervals for the individual skull dimension variables. Also, find the 95% T 2intervals. Compare the two sets of intervals. 5.24. Using the Madison, Wisconsin, Police Department data in Table 5.8, construct individual X charts for x3 = holdover hours and x 4 = COA hours. Do these individual process characteristics seem to be in control? (That is, are they stable?) Comment.
5.25. Refer to Exercise 5.24. Using the data on the holdover and COA overtime hours, COnstruct a quality ellipse and a T 2 chart, Does the process represented by the bivariate observations appear to be in control? (That is, is it stable?) Comment. Do you learn ~omething from the multivariate control charts that was not apparent in the individual X charts?
5.26. Construct a T 2chart using the data on x1 = legal appearances overtime hours,
x2 = extraordinary event overtime hours, and x3 = holdover overtime hours from Table 5.8. Compare this chart with the chart in Figure 5.8 of Example 5.10. Does plottini T 2 with an additional characteristic change your conclusion about process stability?. Explain.
= holdover hours and x4 = COA hours from Table 5.8, construct a prediction ellipse for a future observation x' = (x3, x4). Remember, a prediction ellipse should be calculated from a stable process Interpret the result.
5.28 As part of a study of its sheet metal assembly process, a major automobile manufacturer
uses sensors that record the deviation from the nominal thickness (millimeters) at six locations on a car. The first four are measured when the car body is complete and the lasttwo are measured on the underbody at an earlier stage of assembly. Data on 50 cars are given in Table 5.14. (a) The process seems stable for the first 30 cases. Use these cases to estimateS and x. Then construct a T 2 chart using all of the variables. Include all 50 cases. (b) Which individual locations seem to show a cause for concern?
5.29 Refer to the car body data in Exercise 5.28. These are all measured as deviations from
target value so it is appropriate to test the null hypothesis that the mean vector is zero. Using the first 30 cases, test H 0: J. == 0 at a == .05
5.30 Refer to the data on energy consumption in Exercise 3.18. (a) Obtain the large sample 95% Bonferroni confidence intervals for the mean consumption of each of the four types, the total of the four, and the difference, petroleum minus natural gas. (b) Obtain the large sample 95% simultaneous T2 intervals for the mean consumption of each of the four types, the total of the four, and the difference. petroleum minus natural gas. Compare with your results for Parr a. 5.31 Refer to the data on snow storms in Exercise 3.20. (a) Find a 95% confidence region for the mean vector after taking an appropriate transformation. (b) On the same scale, find the 95% Bonferroni confidence intervals for the two campo nent means.
Exercises 4 71
TABLE S.l4 Car Body Assembly Data
Index
XI
Xz
XJ
X4
xs
1.37 0.25 1.45 0.12 0.78 1.15 0.26 0.15 0.65 1.15 0.21 0.00 0.65 1.25 0.15 0.50 L65
1.00
X6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
0.12 0.60 0.13 0.46 0.46 0.46 0.46 0.13 0.31 0.37 1.08 0.42 0.31 0.14 0.61 0.61 0.84 0.96 0.90 0.46 0.90 0.61 0.61 0.46 0.60 0.60 0.31 0.60 0.31 0.36 0.40 0.60 0.47 0.46 0.44 0.90 0.50 0.38 0.60 0.11 0.05 0.85 0.37 0.11 0.60 0.84 0.46 0.56 0.56 0.25
0.36 0.35 0.05 0.37 0.24 0.16 0.24 0.05 0.16 0.24 0.83 0.30 0.10 0.06 0.35 0.30 0.35 0.85 0.34 0.36 0.59 0.50 0.20 0.30 0.35 0.36 0.35 0.25 0.25 0.16 0.12 0.40 0.16 0.18 0.12 0.40 0.35 0.08 0.35 0.24 0.12 0.65 0.10 0.24 0.24 0.59 0.16 0.35 0.16 0.12
0.40 0.04 0.84 0.30 0.37 0.07 0.13 0.01 0.20 0.37 0.81 0.37 0.24 0.18 0.24 0.20 0.14 0.19 0.78 0.24 0.13 0.34 0.58 0.10 0.45 0.34 0.45 0.42 0.34 0.15 0.48 0.20 0.34 0.16 0.20 0.75 0.84 0.55 0.35 0.15 0.85 0.50 0.10 0.75 0.13 0.05 0.37 0.10 0.37 0.05
0.25 0.28 0.61 0.00 0.13 0.10 O.D2 0.09 0.23 0.21 0.05 0.58 0.24 0.50 0.75 0.21 0.22 0.18 0.15 0.58 0.13 0.58 0.20 0.10 0.37 0.11 0.10 0.28 0.24 0.38 0.34 0.32 0.31 O.Dl 0.48 0.31 0.52 0.15 0.34 0.40 0.55 0.35 0.58 0.10 0.84 0.61 0.15 0.75 0.25 0.20
0.25 0.15 0.60 0.95 1.10 0.75 1.18 1.68 1.00 0.75 0.65 1.18 0.30 0.50 0.85 0.60 1.40 0.60 0.35 0.80 0.60 0.00 1.65 0.80 1.85 0.65 0.85 1.00 0.68 0.45 1.05 1.21
0.13 0.15 0.25 0.25 0.15 0.18 0.20 0.18 0.15 0.05 0.00 0.45 0.35 0.05 0.20 0.25 0.05 0.08 0.25 0.25 0.08 0.08 0.00 0.10 0.30 0.32 0.25 0.10 0.10 0.10 0.20 0.10 0.60 0.35 0.10 0.10 0.75 0.10 0.85 0.10 0.10 0.21 0.11 0.10 0.15 0.20 0.25 0.20 0.15 0.10
2 72
References
1. Agresti, A. Categorical Data Analysis (2nd ed.), New York: John Wiley, 2002. 2. BaconSone, J., and W. K. Fung. ~ New Graphical Method for Detecting Single and Multiple Outli~rs in Univariate and Multivariate Data." Applied Statistics, 36, no. 2 (1987), 153162. 3. Bickel, P. J., and K. A Doksum. Mathematical Statistics: Basic Ideas and Selected Topics . Vol. I (2nd ed.), Upper Saddle River, NJ: Prentice Hali,2000. ' 4. Bishop, Y. M. M., S. E. Feinberg, and P. W. Holland. Discrete Multivariate Analysis: Theory . and Practice (Paperback). Cambndge, MA:The MIT Press, 1977. 5. Dempster, A. P., N. M. Laird, and D. B. Rubin. "Maximum Likelihood from Incomplete Data via the EM Algorithm (with Discussion)." Journal of the Royal Statistical Society (B), 39, no.l (1977), 138. 6. Hartley, H. 0. "Maximum Likelihood Estimation from Incomplete Data." Biometrics,l4 (1958), 174194, 7. Hartley, H. 0., and R. R. Hocking. "The Analysis of Incomplete Data." Biometrics, 21 (1971), 783808. B. Johnson, R. A. and T. Langeland "A Linear Combinations Test for Detecting Serial Correlation in Multivariate Samples." Topics in Statistical Dependence. (1991) Institute of
Mathematical Statistics Monograph, Eds. Block, H. eta!., 299313. 9. Johnson, R.A. and R. Li "Multivariate Statistical Process Control Schemes for Controlling a Mean." Springer Handbook of Engineering Statistics (2006), H. Pham, Ed. Springer, Berlin. 10. Ryan, T. P. Statistical Methods for Quality Improvement (2nd ed. ). New York: John Wiley, 2000. 11. Tiku, M. L., and M. Singh. "Robust Statistics for Testing Mean Vectors of Multivariate Distributions." Communications in StatisticsTheory and Methods, ll, no. 9 (1982), 9851001.
Chapter
274
after the treatment. In other situations, two or more treatments can be administerJ to the same or similar experimental units, and responses can be compared to asse the effects of the treatments. ..: One rational approach to comparing two treatments, or the presence and a'': sence of a single treatment, is to assign both treatments to the same or identical uni (individuals, stores, plots of land, and so forth). The paired responses may then analyzed by computing their differences, thereby eliminating much of the influen of extraneous unittounit variation. In the single response (univariate) case, let Xi! denote the response ~~ treatment 1 (or the response before treatment), and let Xi2 denote the response i treatment 2 (or the response after t_reatm_ent) ~or th~ jth ~rial.'J?at is, (Xi 1, xi;JI are measurements recorded on the jth umt or jth pa1r of like umts. By design, the;j n differences ."!1
Di=Xi 1
Xj 2 ,
j=1,2, ... ,n
(61)~
should reflect only the differential effects of the treatments. Given that the differences Di in (61) represent independent observations fromJ an N(o, u~) distribution, the variable "
t=
l58
Sc~/'Vn
(62)
where
l5
= 
2: Di ni=i
2: (
(63)
H 0: o = 0
versus
H 1:o
*o
may be conducted by comparing It I with t,_ 1(aj2)the upper 100(a/2)th percentile of atdistribution with n  1 d.f. A 100(1 a)% confidence interval for the mean difference o = E( Xil  X 12 ) is provided the statement
d t 11 J(aj2)

Vn
sd
:5o :5
+ 1 J(a/2) Vn
11
sd
(64)
(For example, see [11].) Additional notation is required for the multivariate extension of the pairedcomparison procedure. It is necessary to distinguish between p responses, two treatments, and n experimental units. We label the p responses within the jth unit as
Xlj 1 xlj2
xljp x2ji
x2j2
275
Let Dj
. . . . D1 = x, 1p  x2 1p p
(65)
1, 2, ... , n, that
lJ:J
nd Cov(D)
"~,
(66)
If, in addition, D 1 , D 2 , ... , Dn are independent Np( o, l:d) random vectors, inferences about the vector of mean differences o can be based upon a T 2statistic. Specifically,
T 2 = n(D o)'s;j'(D o)
where
(67)
=  k;
1~
n i=l
D1 and
1 ~ , Sd =   k; (D 1  D)(D1  D) n  1 i=l
(68)
Result 6.1. Let the differences D 1 , D 2 , ... , Dn be a random sample from an Np( o, l:d) population. Then T 2 = n(D o)'S;J 1(D o)
is distributed as an [ (n  1) pj (n  p) ]Fp,n P random variable,.whatever the true and l:d
If n and n  p are both large, T 2 is approximately distributed as a random variable, regardless of the form of the underlying population of differences.
Proof. The exact distribution of T 2 is a restatement of the summary in {56), with vectors of differences for the observation vectors. The approximate distribution of T 2 , for n andn  p large, follows from (428).
x;
The condition o = 0 is equivalent to "no average difference between the two treatments." For the ith variable, oi > 0 implies that treatment 1 is larger, on average, than treatment 2. In general, inferences about o can be made using Result 6.1. Given the observed differences dj = [dil, d12 , ... , d1p], j = 1, 2, ... , n, corresponding to the random variables in (65), an alevel test of H 0 : o = 0 versus H 1 : o 0 for an Np( o, l:d) population rejects H 0 if the observed
where Fp,np(a) is tpe upper (100a)th percentile of an distribution with p and n  p d.f. Here d and Sd are given by (68).
276
o such that
(69)
(do) Sd 1(do)$ (
n n p
) Fpnp(a)
Also, 100(1  p)% simultaneous confidence intervals for the individual mean differences o; are given by
o1:
d1
~(n
(n _ p) Fp,np(ff)
1)p
[s{ v:
(610)
where d; is the ith element of ii.and s~i is the ith diagonal element of sd. For n  p large, [(n  1)p/(n  p)]Fp,np(ff) x~(t:f) and normality need not be assumed. The Bonferroni 100(1  a)% simultaneous confidence intervals for the individual mean differences are
S;: d,
where
tnl ( a/2p)
(610a)
n 1 d.f.
Example 6.1 (Checking for a mean difference with paired observations) Municipal wastewater treatment plants are required by Jaw to monitor their discharges into rivers and streams on a regular basis. Concern about the reliability of data from one of these selfmonitoring programs led to a study in which samples of effluent were divided and sent to two laboratories for testing. Onehitlf of each sample was sent to the Wisconsin State Laboratory of Hygiene, and onehalf was sent to a private commercial laboratory routinely used in the monitoring program. Measurements of biochemical oxygen demand (BOD) and suspended solids (SS) were obtained, for n = 11 sample splits, from the two laboratories. The data are displayed in Table 6.1.
2 3 4 5
6
6 6
27 23
64 44
lR
8
11
34
7 8 9
10 11
Source:
28
71
43 33 20
Data courtesy of S. Weber.
30 75 26 124 54 30 14
25 28 36 35 15 44 42 54 34 29 39
15
13 22 29 31 64 30 64 56
20
21
~77
Do the two laboratories' chemical analyses agree? If differences exist, what is their nature? The T2 statistic for testing H 0 : o' = [81 82 ] = [0, OJ is constructed from the differences of paired observations:
djl = Xljl X2jl
4 10 1 11
14 4
17
4 19 7
60 2 10
Here
d =
and
 [dl] d2
.0055 .0012
= 13.6
Taking a= .05, we find that [p(n 1)/(n p)]Fp.np(05) = [2(10)/9]F2_9(.05) = 13.6 > 9.47, we reject Ho and conclude that there is a nonzero mean difference between the measurements of the two laboratories. It appears, from inspection of the data, that the commercial lab tends to produce lower BOD measurements and higher SS measurements than the State Lab of Hygiene. The 95% simultaneous confidence intervals for the mean differences 81 and o can be 2 computed using (610). These intervals are
= 9.47. Since T 2
81 :d1
= 9.36
\19.47 ,..; 11
or
(I99.26
( 22.46, 3.74)
81 : 13.27
The 95% simultaneous confidence intervals include zero, yet the hypothesis Ho: o = 0 was rejected at the 5% level. What are we to conclude? The evideQ.ce points toward real differences. The point o = 0 falls outside the 95% confidence region for 6 (see Exercise 6.1), and this result is consistent with the T 2 test. The 95% simultaneous confidence coefficient applies to the entire set of intervals that could be constructed for all possible linear combinations of the form a 1 81 + a 2 82 The particular intervals corresponding to the choices (a 1 = 1, a2 = 0) and (a 1 = 0, a 2 = 1) contain zero. Other choices of a1 and a2 will produce simultaneous intervals that do not contain zero. (If the hypothesis H 0 : o = 0 were not rejected, then all simultaneous intervals would include zero.) The Bonferroni simultaneous intervals also cover zero. (See Exercise 6.2.)
278 Chapter 6 Comparisons of Several Multivariate Means Our analysis assumed a normal distribution for the Di. In fact, the situation is; further complicated by the presence of one or, possibly, two outliers. (See Exercis2 6.3.) These data can be transformed to data more nearly normal, but with such ai small sample, it is difficult to remove the effects of the outlier(s). (See Exercise 6.4.~ The numerical results of this example illustrate an unusual circumstance thaf, can occur whenmaking inferences. ~ The experimenter in Example 6.1 actually divided a sample by first shaking it and~ then pouring it rapidly back and forth into two bottles for chemical analysis. This waS'? prudent because a simple division of the sample into two pieces obtained by pourini the top half into one bottle and the remainder into another bottle might result in more; suspended solids in the lower half due to setting. The two laboratories would then not~ be working with the same, or even like, experimental units, and the conclusions would, not pertain to laboratory competence, measuring techniques, and so forth. Whenever an investigator can control the assignment of treatments to experimental units, an appropriate pairing of units and a randomized assignment of treak ments can enhance the statistical analysis. Differences, if any, between supposedly identical units must be identified and mostalike units paired. Further, a random assignment of treatment 1 to one unit and treatment 2 to the other unit will help eliminate the systematic effects of uncontrolled sources of variation. Randomization can be implemented by flipping a coin to determine whether the first unit in a pair receives treatment 1 (heads) or treatment 2 (tails). The remaining treatment is then assigned to the other unit. A separate independent randomization is conducted for each pair. One can conceive of the process as follows: Experimental Design for Paired Comparisons
{6
Treatments I and 2 assigned at random
D D t
D ... D
3
n
t Treatmenls
I and 2 assigned at random
CJ
t
Tre<~tments
We conclude our discussion of paired comparisons by noting that d and Sd, and hence T 2 , may be calculated from the fullsample quantities i and S. Here i is the 2p X 1 vector of sample averages for the p variables on the two treatments given by
(611)
s11
(pXp)
s22 (pxp)
s12
(612)
(pxp)
S21
279
The matrix St 1 contains the sample variances and covariances for the p variables on treatment 1. Similarly, S22 contains the sample variances and covariances computed for the p variables on treatment 2. Finally, S 12 :, S2 1 are the matrices of sample covariances computed from observations on pairs of treatment 1 and treatment 2 variables. Defining the matrix
c
(pX2p)
l'
0
1
0
~
0 0
1
0
1
0 0
(613)
0
(p
+ 1 )st column
d = Cx
Thus,
and
Sd""' CSC'
(614)
(615)
and it is not necessary first to calculate the differences d 1 , d2 , ... , d,. On the other hand, it is wise to calculate these differences in order to check normality and the assumption of a random sample. Each row c; of the matrix C in (613) is a contrast vector, because its elements sum to zero. Attention is usually centered on contrasts when comparing treatments. Each contrast is perpendicular to the vector 1' = (1, 1, ... , 1) since c;l "" 0. The component l'xj, representing the overall treatment sum, is ignored by the test statistic T 2 presented in this section.
j = 1,2, .. . ,n
where Xj; is the response to the ith treatment on the jth unit. The name repealed measures stems from the fact that all treatments are administered to each unit.
280
or
l
~
1L2J
. .
l1.
~
1 0
0 1
.
0 0
ILI  ILq
Both e 1 and e 2 are called contrast matrices, because their q  1 rows are linearly independent and each is a contrast vector. The nature of the design eliminates much. of the influence of unittounit variation on treatment comparisons. Of course, the~ experimenter should randomize the order in which the treatments are presented to each subject. When the treatment means are equal, e 111. = eziL = 0. In general, the hypothesis that there are no differences in treatments (equal treatment means) becomes elL = 0 for any choice of the contrast matrix C. Consequently, based on the contrasts exj in the observations, we have means e i and covariance matrix ese', and we test elL = 0 using the T 2statistic
T2 =
:=::ll~.: ILqILqI
10 :::
~
0 0
n(ex)'(ese'f1ei
T 
2 _
_ ,
, 1
(n  1)(q  1)
(616)
where Fql.nq+!(a) is the upper (lOOa)th percentile of an distribution with q  1 and n  q + 1 d.f. Here i and S are the sample mean vector and covariance matrix defined, respectively, by
X
~ = 1 4J
n
j=I
Xj
and
1 ~ = ~ ~ ( Xj 
1 j=l
) X ( Xj 
)' X
Paired Comparisons and a Repeated Measures Design 281 A confidence region for contrasts CIJ., with IL the mean of a normal population, is determined by the set of all CJL such that n(Cx CIL) (CSC) (Cx CIL).,;:;
_ , , ! _
(n 1)(q 1)
(
nq+l)
Fq1 nq+l(a)
'
(617)
where i and S are as defined in (616). Consequently, simultaneous 100(1 a)% confidence intervals for single contrasts c' JL for any contrast vectors of interest are given by (see Result 5A.1)
CIJ.:
, . J(nex
~'Sc
n

(618)
Example 6.2 (Testing for equal treatments in a repeated measures design) Improved anesthetics are often developed by first studying their effects on animals. In one study, 19 dogs were initially given the drug pentobarbital. Each dog was then administered carbon dioxide C0 2 at each of two pressure levels. Next, halothane (H) was added, and the administration of C02 was repeated. The response, milliseconds between heartbeats, was measured for the four treatment combinations:
Present
Halothane
Absent Low High
COz
pressure
Table 6.2 contains the four measurements for each of the 19 dogs, where Treatment 1 = high C02 pressure without H Treatment 2 = low C0 2 pressure without H Treatment 3 = high C0 2 pressure with H Treatment 4 = low C02 pressure with H We shall analyze the anesthetizing effects of C0 2 pressure and halothane from this repeatedmeasures design. There are three treatment contrasts that might be of interest in the experiment. Let J.LI, J.L2, J.L3, and J.L 4 correspond to the mean responses for treatments 1, 2, 3, and 4, respectively. Then
(J.L3
Halothane contrast representing the) difference between the presence and ( absence of halothane between high and low C02 pressure Contrast representing the influence ) of halothane on C02 pressure differences ( (HC02 pressure "interaction")
(,
, 1
, 3
IL 2
J.L 4
(J.LJ + J.L4 )
(J.L2 + IJ.J) =
282
2 609 236 433 431 426 438 312 326 447 286 382 410 377 473 326 458 367 395 556
3 556 392 349 522 513 507 410 350 547 403 473 488 447 472 455 637 432 508 645
4
600
2 3 4 5 6 7 8 9 10
11
12
13
14 15 16 17 18 19
Source:
426 253 359 432 405 324 310 326 375 286 349 429 348 412 347 434 364 420 397
395 357 600 513 539 456 504 548 422 497 547 514 446 468 524 469 531 625
c=
The data (see Table 6.2) give
[~1 =~ ~ ~]
1 1
and
S=
ex=
and
CSC' =
T2 = n(Cx)'(CSC'f (Ci)
= 19(6.11)
116
= 
18{3)  {3.24) 16
10.94
From (616), T 2 = 116 > 10.94, and we reject H 0 : CIL = 0 (no treatment effects). To see which of the contrasts are responsible for the rejection of H 0 , we construct 95% simultaneous confidence intervals for these contrasts. From (618), the contrast c]IL
= halothane influence
(x3 +
x4 )  (x 1 + x2 )
18{3)
vT0:94
J9432.32  19
= 209.31 73.70
where c! is the first row of C. Similarly, the remaining contrasts are estimated by C02 pressure influence = (J.Ll + J.L3)  (J.L2 + J.L4):
v'ID.94
) 75 ~~ 44 = 12.79
65.97
The first confidence interval implies that there is a halothane effect. The presence of halothane produces longer times between heartbeats. This occurs at both levels of C0 2 pressure, since the HC0 2 pressure interaction contrast, (J.L 1 + J.L 4 )  (P,2  J.L 3), is not significantly different from zero. (See the third confidence interval.) The second confidence interval indicates that there is an effect due to C02 pressure: The lower C02 pressure produces longer times between heartbeats. Some caution must be exercised in our interpretation of the results because the trials with halothane must follow those without. The apparent Heffect may be due to a time trend. (Ideally, the time order of all treatments should be determined at
~~)
The test in (616) is appropriate when the covariance matrix, Cov(X) = l:, cannot be assumed to have a.ny special structure. If it is reasonable to assume that l: has a particular structure, tests designed with this structure in mind have higher power than the one in (616). (For l: with the equal correlation structure (814), see a discussion of the "randomized block" design in (17J or (22).)
284
Summary statistics
(Population 2)
x21' x22> ... , X2n2
In this notation, the first subscript! or 2denotes the population. We want to make inferences about (mean vector of population 1) (mean vectorofpopulation2) = l'r ,.,. 2 .
I' I  ,.,_ 2 = 0)? Also, if I' I 
For instance, we shall want to answer the question, Is I' I = ,.,_ 2 (or, equivalently, is ,.,. 2 7'= 0, which component means are different? With a few tentative assumptions, we are able to provide answers to these questions.
1. The sample X
We shall see later that, for large samples, this structure is sufficient for making inferences about the p X 1 vector I' I  ,.,. 2 . However, when the sample sizes n 1 and n 2 are small, more assumptions are needed.
285
The second assumption, that I 1 = I 2 , is much stronger than its univariate counterpart. Here we are assuming that several pairs of variances and covariances are nearly equal. When I 1 = I 2 = I,
L (xij j=l
nl
2: (x2 i
i=l
nz
 i 2 ) ( x2 i  i 2 )'
is an estimate of (n 2
(621)
Since
n2 
L
j=i
(x 11
L
j=l
1 d.f., the divisor (n 1  1) + (n 2  1) in (621) is obtained by combining the two component degrees of freedom. [See (424).] Additional support for the pooling procedure comes from consideration of the multivariate normal likelihood. (See Exercise 6.11.) To test the hypothesis that 1'I  ,., 2 = o0 , a specified vector, we consider the squared statistical distance from i 1  i 2 to o0 Now,
E(X 1

Since the independence assumption in (619) implies that X 1 and X 2 are independent and thus Cov(X1 , X 2 ) = 0 (see Result 4.5), by (39), it follows that
Cov(X 1
 ) X2
= Cov(X 1 )
1 1 + Cov(X 2 ) =  I +  I = (1 + 1) I n1 n2 n1 n2
(622)
(:I +
is an estimator of Cov (X 1  X 2 ). The likelihood ratio test of
~J Spooled
Bo
Ho:
1'1  l'2 =
is based on the square of the statistical distance, T 2 , and is given by (see [1]). Reject H 0 if T = (i 1
2 
i2
o0 )' [(~
~Jspooled
J
1
(i 1
i2
o0 )
> c2 (623)
286
where the critical distance c 2 is determined from the distribution of the twosamplec; T 2 statistic. :i
Result 6.1. If X 11 , X 12 , .. , X1, 1 is a random sample of size n 1 from I) X 21 , X 22 , ... , X 2, 2 is an independent random sample of size n 2 from Np(~t 2 , I), then,;;
Np(~t 1 ,
... i'J:
and1
"'
is distributed as
(n1 + n 2  2)p ( ni + nz p 1) Fp.n 1+n2 pl
Consequently,
P [ (XI  X2 (Itt #tz)) I [ (
2] =
1  cr _
(624)
where
is distributed as
c,,+ 1

By assumption, the X 1 /s and the X 2 /s are independent, so (n 1  1)S 1 and (n 2 1)S 2 arealsoindependent.From(424),(n 1  1)S 1 + (n 2  1)Szisthendistributed as Wn,+n,z(I). Therefore, 1 1 T2 = ( ~ + n1 nz
)1/2 (Xl    X2
X2  (I'!  l'z))
d.f.
random vector
)]t
Np(O, I)

which is the T 2distribution specified in (58), with n replaced by n 1 + n 2 (55)_ for the relation to F.]
1. [See
We are primarily interested in confidence regions for 1'!  ,.,. 2 From (624), we conclude that all 1'!  ,.,_ 2 within squared statistical distance c 2 of x  i 2 constitute 1 the confidence region. This region is an ellipsoid centered at the observed difference x1  i 2 and whose axes are determined by the eigenvalues and eigenvectors of
Spooled
(or s;;.;oled ).
Example 6.3 (Constructing a confidence region for the difference of two mean vectors)
Fifty bars of soap are manufactured in each of two ways. Two characteristics, X 1 = lather and X 2 = mildness, are measured. The summary statistics for bars produced by methods 1 and 2 are
it =
xz
  [10.2]
= 49
!] !]
+ 49
98
S2
Obtain a 95% confidence region for 1'I  ,.,. 2 We first note that S1 and S2 are approximately equal, so that it is reasonable to pool them. Hence, from (621),
Spooled
98
St
= [2
1] 1 5
Also,
XJ 
 = [1.9] Xz .2
so the confidence ellipse is centered at [ 1.9, .2]'. The eigenvalues and eigenvectors of Spooled are obtained from the equation 0 =
JSpooled
AIJ
2
1
~A
~ AI
2 =A
7A
+9
so A = (7 v'49  36 )/2. Consequently, A = 5.303 and ,\ 2 = 1.697, and the 1 corresponding eigenvectors, e 1 and e 2 , determined from
i = 1,2
are
e 1  [.290]
.957
an d
e 2  [
.957] .290
By Result 6.2,
1 (
n!
1)
n2
1 ) (98)(2)
\j
I(_!_ + _!__) 2 = nl nz
vA; V25
288
2.0
1.0
for IJ.I
l'2
units along the eigenvector e;, or 1.15 units in the e 1 direction and .65 units in the e2 direction. The 95% confidence ellipse is shown in Figure 6.1. Clearly, 1'I  ,_, 2 == 0 is not in the ellipse, and we conclude that the two methods of manufacturing soap produce different results. It appears as if the two processes produce bars of soap with about the same mildness ( X 2 ), but Jhose from the second process have more lather (XI).
p,2 ;
will be covered by
(XJ; Xz;) c
\1f(_!_1 + _!_)sii,pooled n n2
and
given by a'X 1 j = a 1X 1j 1 + a2X 1 j2 + + apXljp and a'X 2 j = a 1X 2jl+ a2Xzj2 + ..:._: + apXZjp Th~se linear combinations have_:;ample me~s and covariances a'X 1 , a'S 1a and a'X 2 , a'S 2 a, respectively, where XI> S 1 , and X 2 , S2 are the mean and covariance statistics for the two original samples. (See Result 3.5.) When both parent populations have the same covariance matrix, sf.. = a'S 1a and si.a = a'S2a
289
are both estimators of a'Ia, the common population variance of the linear combinations a'X 1 and a'X 2. Pooling these estimators, we obtain s;.pooled =
..:_c__,'':_:_::__
{n 1
= a'[ n,
=
~ : ~ 2 s, + n ~ ~ 2 S J a
2
(625)
a'Spooleda
To test H 0 : a'(~t 1  ~tz) = a' o0 , on the basis of the a'X 1 j and a'X 2 j, we can form the square of the univariate twosample /statistic [a'(XI
X2 
(~t 1
~t 2 ))] 2
(626)
2  = T2
:s;
for all a
* 0. Thus,
c 2]
= P[t~ :s;
c2,
P[l
~t 2 )1
c )aU, +
~Jspooleda
for alia
Remark. For testing H0 : I'!  ~t 2 = 0, the linear combination a'{i 1  x2), with coefficient vector a ex S~otcd{i 1  i 2), quantifies the largest population difference. That is, if T 2 rejects H 0 , then a'(i 1  i 2) will have a nonzero mean. Frequently, we try to interpret the components of this linear combination for both subject matter and statistical importance. Example 6.4 (Calculating simultaneous confidence intervals for the differences in mean components) Samples of sizes n 1 = 45 and n 2 = 55 were taken of Wisconsin homeowners with and without air conditioning, respectively. (Data courtesy of Statistical Laboratory, University of Wisconsin.) 1\vo measurements of electrical usage (in kilowatt hours) were considered. The first is a measure of total onpeak consumption (XJ) during July, and the second is a measure of total offpeak consumption (X 2) during July. The resulting summary statistics are i i
I 
n,
= 45
55
= [130.0]
n2 =
290
(The offpeak consumption is higher than the onpeak consumption because there are more offpeak hours in a month.) Let us find 95% simultaneous confidence intervals for the differences in the mean components. Although there appears to be somewhat of a discrepancy in the sample variances, for illustrative purposes we proceed to a calculation of the pooled sample covariance matrix. Here
n1  1
Spooled
n1
+ n2
2 51 +
n2
n1
1

+ n2
S _ [10963.7 2 ~ 21505.5
21505.5] 63661.3
and c2 _
n1
(2.02)(3.1)
= 6.26
With #ti = [J.L 11  J.L21 , J.Lzz  J.L22 ], the 95% simultaneous confidence intervals for the population differences are
J.LIJ J.L 21 : (204.4 130.0)
#tz
v'6.26
)C + 5~)10963.7
1 5
(onpeak)
or 21.7
J.Ll2 J.L 22 :
$
(556.6 355.0)
v'6.26 )C~ +
328.5
5~)63661.3
(offpeak)
or 74.7
:S J.L12 
J.L22 :S
We conclude that there is a difference in electrical consumption between those with airconditioning and those without. This difference is evident in both onpeak and offpeak consumption. The 95% confidence ellipse for J.LI  #tz is determined from the eigenvalueeigenvector pairs A1 = 71323.5, e! = [.336, .942 J and A = 3301.5, e2 = [.942,  .336]. 2 Since
and
we obtain the 95% confidence ellipse for J.Lz  J.Lz sketched in Figure 6.2 on page 291. Because the confidence ellipse for the difference in means does not cover 0' = [0, OJ, the T 2statistic will reject H 0 : JLJ  JLz = 0 at the 5% level.
291
300
200
100
100
200
~tl
 1'2
The coefficient vector for the linear combination most responsible for rejection is proportional toS~~oled(i 1  iz). (See Exercise 6.7.) The Bonferroni 100(1  a)% simultaneous confidence intervals for the p population mean differences are
where tn 1+n 2z(aj2p) is the upper 100(aj2p)th percentile of atdistribution with n 1 + n 2  2 d.f.
.I 1
=I=
.I 2
When I 1 7'= I 2 . we are unable to find a "distance" measure like T 2 , whose distribu tion does not depend on the unknowns I 1 and I 2 Bartlett's test [3] is used to test the equality of I 1 and I 2 in terms of generalized variances. Unfortunately, the conclusions can be seriously misleading when the populations are nonnormal. Nonnormality and unequal covariances cannot be separated with Bartlett's test. (See also Section 6.6.) A method of testing the equality of two covariance matrices that is less sensitive to the assumption of multivariate normality has been proposed by Tiku and Balakrishnan [23]. However, more practical experience is needed with this test before we can recommend it unconditionally. We suggest, without much factual support, that any discrepancy of the order UJ,ii = 4uz,;;, Or ViCe Versa, is probably SeriOUS. This is true in the Univariate CaSe. The size of the discrepancies that are critical in the multivariate situation probably depends, to a large extent, on the number of variables p. A transformation may improve things when the marginal variances are quite different. However, for n 1 and n 2 large, we can avoid the complexities due to unequal covariance matrices.
292
Result 6.4. Let the sample sizes be such that n 1  p and nz  p are large. Then, an ; approximate 100(1  a)% confidence ellipsoid for ILl  1L2 is given by all ILl  #L ;,
~~q
. . 2.
where x~(a) is the upper (100a)th percentile of a chisquare distribution with p d.f.: Also, 100(1 a)% simultaneous confidence intervals for all linear combinations a' (ILl  ILz) are provided by a'(IL 1

iJ.z)
belongs to a'(x 1
x 2 ) ;I:
1 ~)a' (nt S
..!..s2 nz
)a
E(X 1  Xz)
and
= 1L1  1L2
Bythecentrallimittheorem,XlX2 isnearlyNp[IL 1  1Lz,n] 1l: 1 + n2 1l: 2 ].Ifl:1 and l:z were known, the square of the statistical distance from X 1  Xz to ILl  ILz would be
This squared distance has an approximate ~distribution, by Result 4.7. When n 1 and nz are large, with high probability, sl will be close to l:, and s2 will be close to l:z. Consequently, the approximation holds with S 1 and Sz in place of 1: 1 and l:z, respectively. The results concerning the simultaneous confidence intervals follow from Result 5 A.l.
Remark. If n 1 = n2

n, then (n  1)j(n + n  2)
_!_ (S
1/2, so
1 S
n1
1 _ + Sz
nz
2 1) + S2 ) _ ( n  1) S, + (n  1) S ( 1 +n +n  2 n n
Spooled
G ~)
+
With equal sample sizes, the large sample procedure is essentially the same as the procedure based on the pooled covariance matrix. (See Result 6.2.) In one dimension, it is well known that the effect of unequal variances is least when n 1 = n 2 and greatest when n 1 is much less than n2 or vice versa.
293
Example 6_.5 (Large sample procedures for inferences about the difference in means)
We shall analyze the electricalconsumption data discussed in Example 6.4 using the large sample approach. We first calculate
+ n2 2 = 45 23823.4 73107.4
464.17
= [ 886.08
1 s
1 [13825.3
23823.4]
1 [ 8632.0
19616.7]
+ 55 19616.7 55964.5
886.08] 2642.15
The 95% simultaneous confidence intervals for the linear combinations a'(ILt IL2) and a '( ILt are (see Result 6.4)
J.Ltt 
= [1,0] [J.Ltl=
= J.Ltt= J.L12
J.L21
1L2 )
 J.L22
J.L21:
74.4
Notice that these intervals differ negligibly from the intervals in Example 6.4, where the pooling procedure was employed. The T 2statistic for testing H 0 : ILl  1L2 = 0 is
J1 [it 
i2J
886.08]l [204.4  130.0] 2642.15 556.6  355.0 20.080] [ 74.4] 10.519 201.6
= [7 4.4
=
59.874 20.080
15 66 .
For a= .05, the critical value is x~(.05) = 5.99 and, since T 2 = 15.66 > x1(.05) 5.99, we reject H 0 . The most critical linear combination leading to the rejection of H 0 has coefficient vector
+ ...!._ 8 n 2
2
)I (i
I
__ ) = ( 10_ 2
X
) 59.874 4[ 20.080
= [.041]
.063
The difference in offpeak electrical consumption between those with air conditioning and those without contributes more than the corresponding difference in onpeak consumption to the rejection of H 0 : ILl  1L2 = 0.
294
Chapter 6 Comparisons of Several Multivariate Means A statistic si~ilar to T 2 that is less sensitive to outlying ob~ervations for sman~ and moderately sized samples has been developed byTiku and Smgh [24]. Howeveri if_the sample size is moderate. to large, Hotelling's T 2 is remarka~ly unaffected~~ shght departures from normality and/or the presence of a few outhers. .
An Approximation to the Distribution of T2 for Normal Populations When Sample Sizes Are Not Large
One can test H 0 : iJ.I  IJ.z = 0 when the population covariance matrices are unequal even if the two sample sizes are not large, provided the two populations are: multivariate normal. This situation is often called the multivariate BehrensFisher problem. The result requires that both sample sizes ni and n2 are greater than p, the number of variables. The approach depends on an approximation to the distribution of the statistic
'
(627)
which is identical to the large sample statistic in Result 6.4. However, instead of using the chisquare approximation to obtain the critical value for testing Ho the recommended approximation for smaller samples (see [15] and [19]) is given by
2vp T  v p + Fpvp+l 1 ,
(628)
where the degrees of freedom v are estimated from the sample covariance matrices using the relation
(629)
where min(nb n 2 ) ::5 v ::5 n 1 + n2 This approximation reduces to the usual Welch solution to the BehrensFisher problem in the univariate (p = 1) case. With moderate sample sizes and two normal populations, the approximate level a test for equality of means rejects H 0 : iJ.J  IJ.z = 0 if
1 (x1 iz (,.,. 1  ,.,. 2))'[_!_s 1 + ...!..sz] (xJ  i2 (IJ.J IJ.z)) > vp Fp.vp+ 1(a) ni nz v p + 1
where the degrees of freedom v are given by (629). This procedure is consistent with the large samples procedure in Result 6.4 except that the critical value x~( a) is vp replaced by the larger constant Fp.vp+ 1(a). vp+ 1 Similarly, the approximate 100(1  a)% confidence region is given by all PI  1J.2 such that (ii  iz (p.l p.z))'[_!_sl + _!_sz]l (il iz (IJ.! IJ.z)) ::5 vp F.P vp+l(a) ni nz v p + 1 (630)
Comparing Mean Vectors from1Wo Populations 295 For normal populations, the approximation to the distribution of T 2 given by (628) and (629) usually gives reasonable results.
Example 6.6 (The approximate T 2 distribution when : 1 #: :2 ) Although the sample sizes are rather large for the electrical consumption data in Example 6.4, we use these data and the calculations in Example 6.5 to illustrate the computations leading to the approximate distribution of T 1 when the population covariance matrices are unequal. We first calculate
_!_s = _!_ [13825.2 23823.4] = [307.227 1 n1 45 23823.4 73107.4 529.409 1 1 [ 8632.0 19616.7] = [156.945 n 2 Sz = 55 19616.7 55964.5 356.667 and using a result from Example 6.5, _!_s + _!_Sz] = (l04)[ 59.874 [n 1 nz 20.080 1 Consequently,
1
2o.o8o] 10.519
20.080] 10.519
=[
.776 .092
.060] .646
2_ (_!_s + _!_s ] ( n1 81 n1 1 nz 2
Further,
1 2 )
= [
.776 .092
.060] .646
[ .608 .131
.0851 .423
 .060] .354
296
2
)
Ffi
1 2 ]) }
~4
;~
'.l!
n2
::= .0671
~
A
fM.. ~
2 )
(tr[_!_s 2(_!_s 1
n2 n1
+ _!_s 2)n2
])
2 }
.J
';J
~ ;.~
+~
= 77.6
~!
.'~
~;:!
vp
p
';:!:
"
c;.;;
From Example 6.5, the observed value of the test statistic is T2 = 15.66 so the:. hypothesis H 0 : 1'!  JL 2 = 0 is rejected at the. 5% level. This is the same conclusimf~ reached with the large sample procedure described in Example 6.5. ~ As was the case in Example 6.6, the Fp. v p+ 1 distribution can be defined noninteger degrees of freedom. A slightly more conservative approach is to use tli~ integer part of v. ~~
wi~t
1
(631)
MAN OVA is used first to investigate whether the population mean vectors are the same and, if not, which mean ;:omponents differ significantly.
;;~
<i
+
overall) ( mean
Tc
(6 32)
leads to a restatement of the hypothesis of equality of means. The null hypothesis becomes H0 : r 1 = r 2 = = r 8 = 0 The response XI'; distributed as N(p, + Tc, u 2 ), can be expressed in the suggestive form
Xc; =
J.L
+
(
Tc
ecj
(overall mean)
treatment) effect
where the ecj are independent N(O, u 2) random variables. To define uniquely the model parameters and their least squares estimates, it is customary to impose the constraint
t=l
nrrc = 0.
Motivated by the decomposition in (633), the analysis of variance is based upon an analogous decomposition of the observations,
+
(observation) overall ) ( sample mean
(xc x)
estimated ) ( treatment effect
(xcj :Xc)
(634) (residual)
where xis an estimate of J.L, rc = (:Xc x) is an estimate of rc, and (xcj :Xc) is an estimate of the error efj.
298
Example 6.7 (The sum of squares decomposition for univariate ANOVA) Consider the following independent samples.
+ 2)(3
x=
(9
+ 6 + 9 + 0 +2+
4 + ( 2) + 1
G: :) G: :)
observation (xcj) mean
(i)
=i =i
~: :J
(xti ie)
The question of equality of means is answered by assessing whether the contribution of the treatment array is large relative to the residuals. (Our estig
mates
estimate of zero.) If the treatment contribution is large, H0 should be rejected. The size of an array is quantified by stringing the rows of the array out into a vector and calculating its squared length. This quantity is called the sum of squares (SS). For the observations, we construct the vector y' = [9, 6, 9, 0, 2, 3, 1, 2]. Its squared length is
Similarly, SSmean = 42 + 42 + 42 + 42 + 42 + 42 + 42 + 42 = 8( 42) = 128 2 2 SSt. = 42 + 4 + 4 + (3) 2 + ( 3) 2 + ( 2) 2 + ( 2) 2 + ( 2)2 2 + 2(3i + 3(2/ = 78 = 3(4 ) and the residual sum of squares is
(1)2 + 02 = 10
The sums of squares satisfy the same decomposition, (634), as the observations. Consequently,
= SSmean
SStr
+ SSres
or216 = 128 + 78 + 10. The breakup into sums of squares apportions variability in the combined samples into mean, treatment, and residual (error) components. An analysis of variance proceeds by comparing the relative sizes of SS" and ss,.,. If Ho is true, variances computed from SS" and SSres should be approximately equal.
299
'
The sum of squares decomposition illustrated numerically in Example 6.7 is so basic that the algebraic equivalent will now be developed. Subtracting x from both sides of (634) and squaring gives
~
j=I
(xej  xc)
x) +
0, and obtain
~ (xcjj=l
x)
= ne(xe
~ (xejj=l g n,
xel
e we get
C=I (
~;cor
2: ne(ie (C=I
2 x) +
xd
SSres
(635)
)
=
SStr
total (corrected) SS
between (samples) SS
g
+ within (samples) SS
g
~
or
"t
( n 1 + n2 +  + n8 )x 2 +
(SSmean)
2: nr(ic x) 2 + 2:
f=l
f=l j=l
(SSt,)
(636)
In the course of establishing (636), we have verified that the arrays representing the mean, treatment effects, and residuals are orthogonal. That is, these arrays, considered as vectors, are perpendicular whatever the observation vector y' = [xll, ... , X1 111 , x21> .. , x2,2 , ... , xgn.J. Consequently, we could obtain SSres by subtraction, without having to calculate the individual residuals, because SSres = SSobs  SSmean  SS 1,. However, this is false economy because plots of the residuals provide checks on the assumptions of the model. The vector representations of the arrays involved in the decomp_osition (634) also have geometric interpretations that provide the degrees of freedom. For an arbitrary set of observations, let [ x 11 , .. , x 1 , 1 , x 2 I> .. , x 2 , 2, . .. , x 8 , J = y'. The observation vector y can lie anywhere in n = n 1 + n 2 + + ng 'i!imensions; the mean vector xl = [x, ... , x]' must lie along the equiangular line of 1, and the treatment effect vector
1 1 (x1  x)
0 0 0 0
}
+ (x 2  x)
0 0
0 0 0 0
1 1
0 0
n2
(x8
x)
1 1
}

(x 1  x)u 1 + (1 2  x)u 2 + + (x 8
x)u 8
300
lies in the hyperplane of linear combinations of the g vectors JJ 1 , o 2 , ... , o 8 . Since . 1 = o 1 + o 2 + + o 8 , the mean vector also lies in this hyperplane, and it is always perpendicular to the treatment vector. (See Exercise 6.10.) Thus, the mean . vector has the freedom to lie anywhere along the onedimensional equiangular line and the treatment vector has the freedom to lie anywhere in the other g  1 di~~ mension~ The residual Vector, = Y  (Xl)  ((XI  :X)u 1 + + (x g  X)u j is . 8 perpendicular to both the mean vector and the treatment effect vector and has the freedom to lie anywhere in the subspace of dimension n  (g  1). 1 = n  g ~ that is perpendicular to their hyperplane. To summarize, we attribute 1 d.f. to SSmean, g . 1 d.f. to SS 1" and n  g "' ; (n 1 + n 2 + + n 8 )  g d.f. toSS,.,. The total number of degrees of freedom is: n = n~ + n 2 + + n 8 Alternatively, by appealing to the univariate distribution theory, we find that these are the degrees of freedom for the chisquare distributions associated with the corresponding sums of squares. The calculations of the sums of squares and the associated degrees of freedom are conveniently summarized by an AN OVA table.
ANOVA Table for Comparing Univariate Population Means Source of variation neatments Residual (error) Total (corrected for the mean) Sum of squares (SS) Degrees of freedom (d.f.)
g 1
SSu
SS,.,
~ ne(:Xe f=l
g
:X)
L neg f=I
C=I
nc 1
T1
T2
= = T 8 = 0 at level a if
8 )
SS,,j(g  1)
F =
/(
SS,.,
2:nr g f=l
> FgI.Xncg( a)
large values of ss,,;ss,., or for large values of 1 + ss,,;s~es The statistic appropriate for a multivariate generalization rejects H0 for small values of the reciprocal
1 1 + ss" ;ss,.,
ss,.,
ss..e, + ss,,
(637)
301
Example 6.8 (A univariate A NOVA table and Ftest for treatment effects) Using the information in Example 6.7, we have the following ANOVA table: Source of variation Treatments Residual Total (corrected) Consequently,
F =
Degrees of freedom
g1=31=2
f=l
ne  g = (3 + 2 + 3)  3 = 5 ne 1 = 7
L
f=l
Since F = 19.5 > F2 , 5 (.01) = 13.27, we reject H 0 : T 1 = effect) at the 1% level of significance.
T3
= 0 (no treatment
and
e=
1,2, ... ,g
(638)
where the eei are independent Np(O, I) variables. Here the parameter vector 1' is an overall mean (level), and Te represents the eth treatment effect with
L neTc = 0. e=I
According to the model in (638),each component of the observation vector Xei satisfies the univariate model (633). The errors for the components of Xei are correlated, but the covariance matrix I is the same for all populations. A vector of observations may be decomposed as suggested by the model. Thus,
+
(observation) ( overall sa~ple) mean/'
(ie  i)
estimated) treatment ( effect Tc
+ (xei  ie)
(639)
The decomposition in (639) leads to the multivariate analog of the univariate sum of squares breakup in (635). First we note that the product
302
Chapter 6 Comparisons of Several Multivariate Means can be written as (xei i)(xe; i)' == [(xr; ic) +(it i)] [(xc; it) + (ic i)]' == (xc; ic) (Xej ir)' + (xe; ie) (ic  i)'
x/
The sum over j of the middle two expressions is the zero matrix, because.
e and j yields
ic)' (640)
j=l
C=l
The within sum of squares and cross products matrix can be expressed as
w == 2:
n,
ic)(x,; xr)'
1)~
== (n 1 l)SJ
+ (nz
+ + (n8
(641)

l)S8
where Se is the sample covariance matrix for the Cth sample. This matrix is a generalization of the (n 1 + n2  2)Spooled matrix encountered in the twosample case. It plays a dominant role in testing for the presence of treatment effects. Analogous to the univariate result, the hypothesis of no treatment effects,
H0 : r 1
= Tz =
11
== 0
is tested by considering the relative sizes of the treatment and residual sums of squares and cross products. Equivalently, we may consider the relative sizes of the residual and total (corrected) sum of squares and cross products. Formally, we summarize the calculations leading to the test statistic in a MAN OVA table. MANOVA Table for Comparing Population Mean Vectors Source of variation Treatment Residual (Error) Total (corrected for the mean) Matrix of sum of squares and cross products (SSP) Degrees of freedom ( d.f.)
B ==
2: nc(ir f=l
i) (ic  i)'
ic) (xc;  ic)'
g 1
W=
C=i
"r
B+W=
nc
i)'
~ ne 1
f=l
303
This table is exactly the same form, component by component, as the AN OVA table, except that squares of scalars are replaced by their vector counterparts. For example, (xc  x) 2 becomes (ie  x)(ie  i)'. The degrees of freedom correspond to the univariate geometry and also to some multivariate distribution theory involving Wishart densities. (See [1]) One test of H0 : T 1 == 'Tz ==  == Tg = 0 involves generalized variances. Wereject H0 if the ratio of generalized variances
A.*==
IWI IB +WI
I~ ~ (xei 
x)(xei  x)'
(642)
is too small. The quantity A.* = IW 1/1 B + WI, proposed originally by Wilks (see [25]), corresponds to the equivalent form (637) of the Ftest of H0 : no treatment effects in the univariate case. Wilks' lambda has the virtue of being convenient and related to the likelihood ratio criterion. 2 The exact distribution of A.* can be derived for the special cases listed in Table 6.3. For other cases and large sample sizes, a modification of A.* due to Bartlett (see [4]) can be used to test H 0
Table 6.3 Distribution ofWilks' Lambda, A.*=
IWI/IB +WI
No. of variables
p=1
No. of groups
g2!2
~ne  g) g:__1
e
A.*) A*
p=2
g2!2
(~negg _
p ;c: 1
g=2
c
~ FgI,"ln,g
~ Fz(gI).2("ln,rl)
p ;c:
g=3
ev VA*
.t\.*
Fp,!:n,p1
A.*)
F2p,2("ln,p2)
A" A, , A, of w 1B 2
as
where s = min (p, g  1 ), the rank of B. Other statistics for checking the equality of se~eral multivariate means, such as Pillai's statistic, the LawleyHotelling statistic, and Roy's largest root statistic can also be written as particular functions of the eigenvalues of W" 1 B. For large samples, all of these statistics are, essentially equivalent. (See the additional discussion on page 336.)
304 Chapter 6 Comparisons of Several Multivariate Means Bartlett (see [4]) has shown that if H 0 is true and 2:nc = n is large,
(643)
has approximately a chisquare distribution with p(g  1) d.f. Consequently, for 2:nc = n large, we reject H 0 at significance level a if
(644)
p(g  1) d.f.
Example 6.9 (A MANOVA table and Wilks' lambda for testing the equality of three mean vectors) Suppose an additional variable is observed along with the variable introduced in Example 6.7, The sample sizes are n 1 = 3, n 2 = 2, and n 3 = 3.
Xcj
in rows, we obtain
[:J
w1thx 1 =
.  [8]
4
andi = [:]
We have already expressed the observations on the first variable as the sum of an overall mean, treatment effect, and residual in our discussion of univariate ANOVA. We found that
G: :) G: :)
(observation)
and
(=~ =~ J
( treatment) effect
( :
~:
:)
(mean)
(residual)
SSobs = SSmean + SStr + SSres 216 = 128 + 78 + 10 Total SS (corrected) = SSobs  SSmean = 216  128 = 88 Repeating this operation for the obs,ervations on the second variable, we have
(! ~ 7)
8 9 7 (observation)
(~ ~
5)
(=~ =~ 1)
3
(
(~ =~ 3)
0
1 1
5 5 5
(mean)
treatment) effect
(residual)
Comparing Several Multivariate Population Means (Oneway MANOVA) 305 and SSobs = SSmean + SSt, + SS,es
272 = 200 + 48 + 24
Total SS (corrected) = SSobs  SSmean = 272  200 = 72 These two singlecomponent analyses must be augmented with the sum of entrybyentry cross products in order to complete the entries in the MANOVA table. Proceeding row by row in the arrays for the two variables, we obtain the cross product contributions: Mean: 4(5) + 4(5) + + 4(5) = 8(4)(5) = 160 Treatment: 3(4)(1) + 2(3)(3) + 3(2)(3) = 12 Residual; 1(1) + (2)(2) + 1(3) + (1)(2) + ... + 0(1) = 1 Total: 9(3) + 6(2) + 9(7) + 0(4) + .. + 2(7) = 149 Total (corrected) cross product = total cross product  mean cross product
Degrees of freedom
78 12
3  1
[ [
1~
88 11
3+2+33=5
Total (corrected)
88
I 11
Ill
72
Since p = 2 and g = 3, Table 6.3 indicates that an exact test (assuming normal. ity and equal group covariance matrices) of H 0 : T1 = Tz = T 3 = 0 (no treatment, effects) versus H 1: at least one Tc 0 is available. To carry out the test, we compare~ the test statistic '
l(
W) w (Lnc
g 1) = (1(g  1) Y.0385
with a percentage point of an Fdistribution having v 1 = 2(g  1) = 4 and .. v2 = 2(Lnc g 1) = 8 d.f. Since 8.19 > F4, 8(.01) = 7.01, we reject H0 at the_ a = .Ollevel and conclude that treatment differences exist. ,. When the number of variables, p, is large, the MANOVA table is usually not constructed. Still, it is good practice to have the computer print the matrices Band W so that especially large entries can be located. Also, the residual vectors
ecj = Xcj  Xc
should be examined for normality and the presence of outliers using the techniquesdiscussed in Sections 4.6and 4.7 of Chapter 4.
Example 6.10 (A multivariate analysis of Wisconsin nursing home data) The Wisconsin Department of Health and Social Services reimburses nursing homes in the state for the services provided. The department develops a set of formulas for rates for each facility, based on factors such as level of care, mean wage rate, and average wage rate in the state. Nursing homes can be classified on the basis of ownership (private party, nonprofit organization, and government) and certification (skilled nursing facility, intermediate care facility, or a tombination of the two). One purpose of a recent study was to investigate the effects of ownership or certification (or both) on costs. Four costs, computed on a perpatientday basis and measured in hours per patient day, were selected for analysis: X 1 = cost of nursing labor,X2 = cost of dietary labor, X 3 = cost of plant operation and maintenance labor, and X 4 = cost of housekeeping and laundry labor. A total of n = 516 observations on each of the p = 4 cost variables were initially separated according to ownership. Summary statistics for each of the g = 3 groups are given in the following table.
Group
Number of observations
e = 1 (private) e=
2 (nonprofit) XI =
l2.167J l2.273l 2.066J .480 .596 .521 .082 ; Xz = .124 ; XJ = .125 l .360 .418 .383
e = 3 (government)
3
~ nc = 516
c~1
307
l.291
sl
==
.001 .002
OJ
s2 =
l'"
.005 .002
J;

s3
lUI
.030 .003
.018
.004 .001
Since theSe's seem to be reasonably compatible,3 they were pooled [see (641)] to obtain W = (n 1

1)S1 + (n 2
l)S 2 + (n 3 ]
1)S3
Also,
l
B =
f=l
and
ne(xe x)(xe  x) =
To test H0 : T 1 = Tz = T 3 (no ownership effects or, equivalently, no difference in average costs among the three types of ownersprivate, nonprofit, and government), we can use the result in Table 6.3 for g = 3. Computerbased calculations give
.235 .230
A*=
IWI IB +WI
= .7714
3 However, a normaltheory test of H0 : I 1 = I 2 = I 3 would reject H0 at any reasonable significance level because of the large sample sizes (see Example 6.12).
308
and
'Lnc  p  2)
p
(!__:_VA*) VA*
(516  4 _:___.?_) 4
(1Y.77I4
Y.7714)
17.67
Let a = .01, so that Fz( 4).i(510)(.01) ,;, ,is(.Ol)/8 = 2.51. Since 17.67 > F 8. 1020 (.01) : 2.51, we reject Ho at the 1o/o level and conclude that average costs differ, depending on type of ownership. It is informative to compare the results based on this "exact" test with those obtained using the largesample procedure summarized in (643) and (644). For the present example, Lnr = n = 516 is large, and Ho can be tested at the a = .01 level by comparing
(n 1  (p + g)/2) In (
132.76
with X~(gIJ(.01) = x~(.Ol) = 20.09. Since 132.76 > x~(.01) = 20.09, we reject Ho at the 1% level. This result is consistent with the result based on the foregoing Fs ta tis tic.
.. Var(rk; rc; ) =
var(xki
X1;) =
(1 1)
nk_ nc
+ u;;
where O";; is the ith diagonal element of I. As suggested by (641), Var (Xki  X"u) is estimated by dividing the corresponding element of W by its degrees of freedom. That is,
~Var(X*; Xc;) =
(1

n*
+  ne n g
1)wu
m = pg(g  1)/2
(646)
k=l
{1 a),
Tk; Tc;
belongs to
x*  xc
I I
ng (
pg(g  1)
W " (  1 n  g nk
~) +nc
for all components i = 1, ... , p and all differences ith diagonal element ofW.
We shall illustrate the construction of simultaneous interval estimates for the pairwise differences in treatment means using the nursinghome data introduced in Example 6.10.
Example 6.11 (Simultaneous intervals for treatment differencesnursing homes) We saw in Example 6.10 that average costs for nursing homes differ, depending on the type of ownership. We can use Result 6.5 to estimate the magnitudes of the differences. A comparison of the variable X 3 , costs of plant operation and maintenance labor, between privately owned nursing homes and governmentowned nursing homes can be made by estimating r 13  r 33 . Using (639) and the information in Example 6.10, we have
'
TJ
=(xi x) =
TJ
'
= (x 3
x) =
182.962
[
Consequently,
,,J
r13  r33 =
J( 1+ 1)
n1
n3
W33 n g =
310 Chapter 6 Comparisons of Several Multivariate Means Since p = 4 and g = 3, for 95% simultaneous confidence statements we require that t513 (.0.S/4(3)2) == 2.87. (See Appendix, Table 1.) The 95% simultaneous confidence statement is
, belongs to r13
~
, r 33 t 513 ( .00208 )
~(1
n1
W33 + 1 )  
n3
n g
= .043 2.87(.00614)
= .043 .018,
or(.061, .025)
We conclude that the average maintenance and labor cost for governmentowned nursing homes is higher by .025 to .061 hour per patient day than for privately owned nursing homes. With the same 95% confidence, we can say that r 13 and
r23

r 23 belongs to the interval ( .058, .026) r33 belongs to the interval ( .021, .019)
Thus, a difference in this cost exists between private and nonprofit nursing homes, but no difference is observed between nonprofit and government nursing homes.
II ( I Spooled I e
I Se I
)(n,1)12
(648)
Here ne is the sample size for the eth group, Se is the (th group sample covariance matrix and Spooled is the pooled sample covariance matrix given by
(649)
Box's test is based on his x2 approximation. to the sampling distribution of 2ln A (see Result 5.2). Setting 2 In A = M (Box's M statistic) gives
M =
(650)
If the null hypothesis is true, the individual sample covariance matrices are not expected to differ too much and, consequently, do not differ too much from the pooled covariance matrix. In this case, the ratio of the determinants in (648) will all be close to 1, A will be near 1 and Box's M statistic will be small. If the null hypothesis is false, the sample covariance matrices can differ more and the differences in their determinants will be more pronounced. In this case A will be small and M will be relatively large. To illustrate, note that the determinant of the pooled covariance matrix, ISpooled I, will lie somewhere near the "middle" of the determinants I ScI's of the individual group covariance matrices. As the latter quantities become more disparate, the product of the ratios in (644) will get closer to O.ln fact, as the I Sc l's increase in spread, I Sen Ill Spooled I reduces the product proportionally more than I S(g) Ill Spooled I increases it, where ISen I and I S(g) I are the minimum and maximum determinant values, respectively.
u =
r
l
~(nc 1)  ~(ne _
1)
l[ J
2p + 3p  1 ] 6(p + l)(g 1)
(651)
where pis the number of variables and g is the number of groups. Then C
= (1 u)M
(1
has an approximate
y
xz distribution with
1
= g2p(p
+ l)(g 1)
(653)
110 if C >
~(p+l)(gt)l2(a).
Box's x2 approximation works well if each ne exceeds 20 and if p and g do not exceed 5. In situations where these conditions do not hold, Box ([7), [8]) has provided a more precise F approximation to the sampling distribution of M.
Example 6.12 (Testing equality of covariance matricesnursing homes) We introduced the Wisconsin nursing home data in Example 6.10. In that example the sample covariance matrices for p = 4 cost variables associated with g = 3 groups of nursing homes are displayed. Assuming multivariate normal data, we test the hypothesis H 0 : l:1 = .I.z = l:3 = l:.
3 12
Chapter 6 Comparisons of Several Multivariate Means Using the information in Example 6.10, we have n 1 = 271, n 2 == 138 = 2.783 X 108, = 89.539 X 108, = 14.579 X 108, and 8 spooled = 17.398 X 10 . Taking the natural logarithms of the determinants gives In = 17.397, In S:z = 13.926, In = 15.741 and In Spooled = 15.564. We calculate
n3
= 107 and
Is,l
Is,l
ISzl
Is31
I I
IS31
1 u = [ 270
+ 137 +
1 106  270
1 137
M = [270
= 289.3
and C = (1  .0133)289.3 = 285.5. Referring C to a / table with v = 4( 4 + 1 )(3  1)12 = 20 degrees of freedom, it is clear that H 0 is rejected <1t any reasonable level of significance. We conclude that the covariance matrices of the cost variables associated with the three populations of nursing homes are not the same. Box's Mtest is routinely calculated in many statistical computer packages that do MANOVA and other procedures requiring equal covariance matrices. It is known that theMtest is sensitive to some forms of nonnormality. More broadly, in the presence of nonnormality, normal theory tests on covariances are influenced by the kurtosis of the parent populations (see [16]). However, with reasonably large samples, the MANOVA tests of means or treatment effects are rather robust to nonnormality. Thus the Mtest may reject H 0 in some nonnormal cases where it is not damaging to the MANOVA tests. Moreover, with equal sample sizes, some differences in covariance matrices have little effect on the MANOVA tests. To summarize, we may decide to continue with the usual MAN OVA tests even though theMtest leads to rejection of H0 .
'TWoWay Multivariate Analysis of Variance 313 ,nations of levels. Denoting the rth observation at level e of factor 1 and level k of factor 2 by Xu,, we specify the univariate twoway model as
Xur = /L f3k + Yek + eekr 1, 2, ... , g k = 1,2, ... ,b
e=
+ Te +
(654)
r = 1,2, ... ,n
g
where ~ Te
~ f3k = ~ Yek
~ Yek
=0
N(O, an overall level, re represents the fixed effect of factor 1, {3 k represents the fixed effect of factor 2, and Ye k is the interaction between factor 1 and factor 2. The expected response at the eth level of factor 1 and the kth level of factor 2 is thus
/L
Te
f3k
Yek
mean ) ( response
e = 1, 2, ... , g,
= 1, 2, ... , b
The presence of interaction, Yek. implies that the factor effects are not additive and complicates the interpretation of the results. Figures 6.3(a) and (b) show
2
(a)
Level of factor 2
2
(b)
Level of factor 2
Figure 6.3 Curves for expected responses (a) with interaction and (b) without interaction.
expected responses as a function of the factor levels with and without interaction ' respectively. The absense of interaction means rek = 0 for all e.and k. In a manner analogous to (6SS), each observation can be decomposed as
(656)
where xis the overall average, Xe is the average for the Cth level of factor 1, x.k is the average for the kth level of factor 2, and Xfk is the average for the Cth level of factor 1 and the kth level of factor 2. Squaring and summing the deviations (xekr  x) gives
f=l k=l rl
2: 2: 2: (xu, 
x) =
f=l
2: bn(xe. 
:X) + ~ gn(x k  x)
2
k=l
+
or
(657)
SScor = SScact + SStac2 + SS;"' + SSres The corresponding degrees of freedom associated with the sums of squares in the breakup in (657) are
gbn 1 = (g 1)
(658)
TheANOVA table takes the following form: ANOVA Table for Comparing Effects of Two Factors and Their Interaction Source of variation Factor 1 Factor 2 Interaction Residual (Error) SScor = ~ Sum of squares (SS) Degrees of freedom (d.f.)
g 1
ssfacl
SStac2
= =
2: bnCXe f=l
L gn(x.k k=I b
x)
x)
b  1
SS;nl =
L L
c~t
+ x) 2
(g 1)(b 1)
gb(n  1)
k=l
.J...
Total (corrected)
C=l k=J
L r=l (xu, L
x)
gbn  1
315
The Fratios of the mean squares, SScacd(g  1 ), SScac 2 /(b  1 ), and SS; 01 /(g  1)(b  1) to the mean square, SS,es/(gb(n  1)) can be used to test for the effects of factor 1, factor 2, and factor 1factor 2 interaction, respectively. (See [11) for a discussion of univariate twoway analysis of variance.)
= P.
+ Tc + /h + 'Ytk + eckr
(659)
e=
k
1, 2, ... , g
= 1, 2, ... ' b
r = 1,2, ... ,n
where
f=J
Tc
k=l
fh
_! 'Yfk =
f=l
k=l
'Ytk
1,
and the eck,. are independent Np(O, l:) random vectors. Thus, the responses consist of p measurements replicated n times at each of the possible combinations of levels of factors 1 and 2. Following (656), we can decompose the observation vectors Xtkr as
(660)
where i is the overall average of the observation vectors, ic is the average of the observation vectors at the Cth level of factor 1, i.k is the average of the observation vectors at the kth level of factor 2, and iek is the average of the observation vectors at the eth level of factor 1 and the kth level of factor 2. Straightforward generalizations of (657) and (658) give the breakups of the sum of squares and cross products and degrees of freedom:
x)(xtk,  x)' =
L: bn(xcC=l
 x)(xc  x)'
+ +
L
k=I
g
L L
f=l k=l
(662)
Again, the generalization from the univariate to the multivariate analysis consists 2 simply of replacing a scalar such as (xc.  x) with the corresponding matrix (ic.  i)(ie  x)'.
3 16 Chapter 6 Comparisons of Several Multivariate Means The MANOVA table is the following: MANOVA Table for Comparing Factors and Their Interaction Degrees of . : Source of variation Factor 1 Factor 2 Interaction Residual (Error) Total (corrected) Matrix of sum of squares and cross products (SSP)
freedom~
(df.)
e~1
L
b
g1
b 1
= ~ gri(i.k
k~l
i)'
f=I k=l
n(iek
ie.
SSPcor
..!, 2.,
2: 2: (Xfkr 
i)(Xfkr  i)'
gbn 1
Ho:YII = Y12 = =
versus
Ygb
=0
~~1
(664zJ
For large samples, Wilks' lambda, A*, can be referred to a chisquare percentile2~ U~ing Bartlett's multiplier (see [6]) to improve th~ chisquare approximation,~ reJectH0:y11 = y 1 z = = Ygb = Oatthealevelif ~
,.::;,,
 [ gb(n 1) 
p + 1  (g  1)(b 1)]
2
InA*>
xZgl)(bl)p(a)
where A* is given by (664) and xfgI)(bI)p(a) is the upper (lOOa)th percentile chisquare distribution with (g  l)(b  l)p d. f. Ordinarily, the lest for interaction is carried out before the tests for main factor fects. If interaction effects exist, the factor effects do not have a clear i"n,telipreltati<>~ From a practical standpoint, it is not advisable to proceed with the additional variate tests.Instead,p univariate twoway analyses of variance (one for each are often conducted to see whether the interaction appears in some responses
5 The likelihood test procedures require that p (with probability 1).
TwoWay Multivariate Analysis of Variance 3,17 others. Those responses without interaction may be interpreted in terms of additive factor 1 and 2 effects, provided that the latter effects exist. In any event, interaction plots similar to Figure 6.3, but with treatment sample means replacing expected values, best clarify the relative magnitudes of the main and interaction effects. In the multivariate model, we test for factor 1 and factor 2 main effects as follows. First, consider the hypotheses H 0 : r 1 = r 2 = = r 8 = 0 and H 1 : at least one re 0. These hypotheses specify no factor 1 effects and some factor 1 effects, respectively. Let
'*
A* =
,'~::.:_..,.
(666)
so that small values of A* are consistent with H 1 . Using Bartlett's correction, the likelihood ratio test is as follows: Reject H 0 : r 1
=
r2
= = r 8
 [ gb(n  1) 
where A* is given by (666) and xfgl)p(a) is the upper (100a)th percentile of a chisquare distribution with (g  1)p d.f. In a similar manner, factor 2 effects are tested by considering H 0 : fJ1 = f3z = = f3b = 0 and H 1 : at least one f3 k 0. Small values of
'*
(668)
are consistent with H 1 . Once again, for large samples and using Bartlett's correction: Reject H0 : /3 1 = f3 2 = = f3b = 0 (no factor 2 effects) at level a if
 [ gb(n 1) 
(669)
where A* is given by (668) and xfbl)p(a) is the upper (100a)th percentile of a chisquare distribution with ( b  1)p degrees of freedom. Simultaneous confidence intervals for contrasts in the model parameters can provide insights into the nature of thefactor effects. Results comparable to Result 6.5 are available for the twoway model. When interaction effects are negligible, we may concentrate on contrasts in the factor 1 and factor 2 main effects. The Bonferroni approach applies to the components of the differences re  Tm of the factor 1 effects and the components of f3k  /3q of the factor 2 effects, respective! y. The 100{1  a)% simultaneous confidence intervals for re;  T mi are
re;
Tmi
belongs to
(xe.;
Xmi) tv (
pg(g _
).
{;;2 y;;;;
(670)
Xmi
where v = gb(n  1), E;; is the ith diagonal element of E = SSP, , and xe.; is the ith component of ie.  Xm .
13 qi
/3k; /3q;
belongs to
{E;;2 y;g;;
(671)

x.q.
Comment. We have considered the multivariate twoway model with replications. That is, the model allows for n replications of the responses at each combination of factor levels. This enables us to examine the "interaction" of the factors. If only one observation vector is available at each combination of factor levels, the ' twoway model does not allow for the possibility ofa general interaction term 'Ytk The corresponding MANOVA table includes only factor 1, factor 2, and residual sources of variation as components of the total variation. (See Exercise 6.13.)
Example 6.13 (A twoway multivariate analysis of variance of plastic film data) The optimum conditions for extruding plastic film have been examined using a technique called Evolutionary Operation. (See [9].) In the course of the study that was done, three responsesX1 = tear resistance, X 2 = gloss, and X 3 = opacitywere measured at two levels of the factors, rate of extrusion and amount of an additive. The measurements were repeated n = 5 times at each combination of the factor levels. The data are displayed in Table 6.4.
High (1.5%)
x2 x3.j ~ [6.9 9.1 [7.2 10.0 2.0 [6.9 9.9 3.9 [6.1 9.5 1.9 [6.3 9.4 5.7]
57l
High(lO%)
x3 ~ ~ [7.1 9.2 8.4] [7.0 8.8 5.2] [7.2 9.7 6.9] [7.5 10.1 2.71 [7.6 9.2 1.9
The matrices of the appropriate sum of squares and cross products were calculated (see the SAS statistical software output in Panel6.1 6), leading to the following MANOVA table:
6
Additional SAS programs for MANOVA and other procedures discussed in this chapter are
available in [13].
TwoWay Multivariate Analysis of Variance 319 Source of variation Factor 1: change in rate of extrusion amount of additive SSP
d.f.
[ 1.7405
1.5045 1.3005 .6825 .6125 .0165 .5445 .0200 2.6280 .7855 5.0855
.7395 .4205
.8555]
Factor 2:
.7605
1.93()5]
1
Interaction
[ ~5 [1.7640 r2655
0445]
Residual
16
Total (corrected)
2395] 1.9095
74.2055
19
PANEL 6.1
title 'MANOVA'; data film; infile 'T64.dat'; input x1 x2 x3 factor1 factor2; proc glm data= film; class factor1 factor2; model x1 x2 x3 = factor1 factor2 factor1*factor2 jss3; manova h =factor1 factor2 factor1*factor2 jprinte; means factor1 factor2;
PROGRAM COMMANDS
General linear Models Procedure Class level Information Class levels FACTOR1 2 FACTOR2 2 Number of observations in
J
Oep~tf~~nt V~riabl~: X1
OUTPUT
DF 3 16 19 RSquare 0.586449
Mean Square 0.83383333 0.11025000 Root MSE 0332039 Mean Square 1.74050000 0.76050000 0.00050000
Pr > F 0.0023
c.v.
4.893724 Type Ill SS
X1 Mean 6.78500000 F Value 15.79 6.90 0.00 Pr > F 0.0011 0.0183 0.9471
Source
DF
~. tj~s~.
o.ooosoooo.
0'76050000
~pendent Variable: X2
Source Model Error Corrected Total
OF 3 16 19 RSquare 0.483237
F Value 4.99
Pr > F 0.0125
c.v.
4.350807 Type Ill SS 1.300so00o !).61250000 0,54450000
Root MSE 0.405278 Mean Square 1.30050000 0.61250000 0.54450000 F Value 7.92 3.73 3.32
DF 1 1 1
[Dependent Variable: X3 / Source Model Error Corrected Total DF 3 16 19 RSquare 0.125078 Source fACTOR FAaoR2 FACTOR1*FACTOR2 DF Sum of Squares 9.28150000 64.924000QO 74.20SSOOOO Mean Square 3.09383333 4.05775000 F Value 0.76 Pr > F 0.5315
c.v.
51.19151 Type Ill SS 0.42050000 4.90050000. 3.96050000
RootMSE 2.014386 Mean Square 0.42050000 4.90050000 3.96050000 F Value 0.10 121 0.98
X3 Mean 3.93500000
Pr> F 0.7517 0.2881 0.3379
1 1 1
I
X2 0.02 2.628 {).552
X2
X3
the
rStatistic
.~.o:~~~;~:.:~ . ..
0.61814162 1.61877188 1.61877188
\1, I
Den OF
14
7.5543
7.5543 7.5543 7.5543
3 3 3 3
Pr>fd 0.0030:"
''""3"
Pillai's Trace
Hotellin!rl.awley Trace
14 14 14
321
pANEL 6.1
(continued)
Manova Test Criteria and Exact F Statistics for the
j
I
DenDF 14 14 14 14 Pr> F 0.0247 0.0247 0.0247 0.0247
H =Type Ill SS&CP Matrix for FACTOR2 S= 1 M =0.5 Value 0.52303490 Pillai's Trace Hotel lingLawley Trace Roy's Greatest Root 0.47696510 0.91191832 0.91191832 F
4.255~
3 3
3 3
Manova Test Criteria and Exact F Statistics for the Hypothesis of no Overall FACTOR1 *FACTOR2 Effect E =Error SS&CP Matrix
H =Type Ill SS&CP Matrix for FACTOR1*FACTOR2 S= 1 M = 0.5 N=6 Statistic Value 0.77710576 0.22289424 0.28682614 0.28682614 F 1.3385 1.3385 1.3385 1.3385
NumbF
Wilks' ~mbda
Pillai's Trace Hotel linglawley Trace Roy's Greatest Root level of FACTOR!
0
3
3 3
Den DF 14 14 14 14
N 10 10 level of FACTOR! 0
N 10 10
X3Mean SD
3.79000000 4.08000000 1.85379491 2.18214981
level of FACTOR2 0
N 10 10 level of FACTOR2 0
N 10 10
322
A*)
"2
F 
I'J =
"2 = (2(2)(4)  3 + 1) = 14 and F3 , 14 ( .05) = 3.34. Since F = 1.34 < F3,14(.05) = 3.34, we do not reject the hypothesis H 0 : y 11 = y 12 = y 21 = y 22 = 0 (no interaction effects). Note that the approximate chisquare statistic for this test is [2(2)(4)"i> (3 + 1  1(1))/2] ln(.7771) = 3.66, from (665). Since xj(.05) = 7.81, we would reach the same conclusion as provided by the exact Ftest. To test for factor 1 and factor 2 effects (see page 3I7), we calculate
A~ =
and A; =
A~
and
F = (I 2 A;
A;) (gb(n
I)  p + I)j2 (l(bI)pi+I)/2
have distributions with degrees of freedom v1 = I(g  I)  PI + I, "2 = gb (n  1)  p + 1 and v1 = I (b  I) PI +I, "2 = gb(n I) p + 1, respectively. (See [1].) In our case, F1
= (1
and
"I
= II  31 + 1 = 3
"2 = (I6  3 + I) = 14
Profile Analysis
323
From before, F3, 14 (.05) = 3.34. We have F1 = 7.55 > F3, 14 (.05) = 3.34, and therefore, we reject H 0 : TI = T 2 = 0 (no factor 1 effects) at the 5% level. Similarly, F2 = 4.26 > F3 , 14 (.05) = 3.34, and we reject H 0 : PI= fJ 2 = 0 (no factor 2 effects) at the 5% level. We conclude that both the change in rate of extrusion and the amount of additive affect the responses, and they do so in an additive manner. The nature of the effects of factors 1 and 2 on the responses is explored in Exercise 6.15. In that exercise, simultaneous confidence intervals for contrasts in the components of Te and Pk are considered.
'    _ . . __ _J __ _..L...._ _
..J.._;~
Variable
7 The question, "Assuming that the profiles are parallel, are the profiles linear?" is considered in Exercise 6.12. The null hypothesis of parallel linear profiles can be written H 0: (p. 1, + 1'2;)  (P.It1 + 1'2,1) = (p.,,1 + l'21)  (P.li2 + 1'2, 2 ), i = 3, ... , p. Although this hypothesis may be of interest in a particular situation, in practice the question of whether two parallel profiles are the same (coincident), whatever their nature, is usually of greater interest.
3. Assuming that the profiles are coincident, are the profiles level? That is, are all the means equal to the same constant? Equivalently: Is H03 : ILII = J.L12 = = IL!p = J.L2I = J.L2 2 = = IL2p acceptable? The null hypothesis in stage 1 can be written
c
((p!)Xp)
[? ~
: 0
~~
(672)
0 0 0
For independent samples of sizes n 1 and n 2 from the two populations, the null hypothesis can be tested by constructing the transformed observations Cx11 , and
Cx2f
j=1,2, ... ,n 1
These have sample mean vectors Ci 1 and C i 2 , respectively, and pooled covariance matrix CSpooledC'. Since the two sets of transformed observations have Np 1(CJ.LI, CIC) and Np! (Cp. 2 , CIC') distributions, respectively, an application of Result 6.2 provides a test for para!! el profiles.
= (i 1 
i 2)'c[ (
~1 + ~2 )cspooledC'
J
1
C(i 1
i 2) > c
(673)
where
When the profiles are parallel, the first is either above the second (J.Lii > J.L2;, for all i), or vice versa. Under this condition, the profiles will be coincident only if the total heights ILJI + J.L12 + .. + IL!p = 1' p 1 and J.L2J + J.L22 + + IL2p = 1' 1t2 are equal. Therefore, the null hypothesis at stage 2 can be written in the equivalent form
We can then test H02 with the usual twosample !statistic based on the univariate observationsi'x 11 ,j = 1,2, ... ,n 1 ,and1'x 21 ,j = 1,2, ... ,n 2 .
J
1
1'(xi x2)
(674)
For coincident profiles, x11 , x 12 , ... , x 1 n 1 and x2 1> x22 , ... , x2 n 2 are all observations from the same normal population? The next step is to see whether all variables have the same mean, so that the common profile is level. When H 01 and H02 are tenable, the common mean vector JL is estimated, using all n 1 + n2 observations, by
11
11
n 1 + n 2 J=! (nt + n2) (n 1 + n2) If the common profile is level, then p, 1 = p, 2 = = JLp, and the null hypothesis at stage 3 can be written as H 03 : CJL = 0 where Cis given by (672). Consequently, we have the following test.
1 =  (
"1
..t::..J XJ 1
~ X2 1 i=l
"2
n1
_ X1
n2
_ X2
(n 1 + n2)x'C'[CSC'tcx > c 2
c2 = ( n 1 + n2  1)( p  1) ( nl + n2 p + 1)
(675)
( )
Fp!,nl+n2p+!
Example 6.14 (A profile analysis of love and marriage data) As part of a larger study of love and marriage, E. Hatfield, a sociologist, surveyed adults with respect to their marriage "contributions" and "outcomes" and their levels of "passionate" and "companionate" love. Rece11tly married males and females were asked to respond to the following questions, using the 8point scale in the figure below.
1. All things considered, how would you describe your contributions to the
marriage? 2. All things considered, how would you describe your outcomes from the marriage? SubjeGts were also asked to respond to the following questions, using the 5point scale shown. 3. What is the level of passionate love that you feel for your partner? 4. What is the level of companionate love that you feel for your partner?
None
at all
Very little
2
Some
A great deal
Tremendous amount
5
Let
x 1 = an 8point scale response to Question 1 x 2 = an 8point scale response to Question 2 x 3 = a 5point scale response to Question 3 x 4 = a 5point scale response to Question 4
= married women
The population means are the average responses to the p = 4 questions for the populations of males and females. Assuming a common covariance matrix :t, it is of interest to see whether the profiles of males and females are the same. A sample of n 1 = 30 males and n 2 = 30 females gave the sample mean vectors
XJ
spooled =
The sample mean vectors are plotted as sample profiles in Figure 6.5 on page 327. Since the sample sizes are reasonably large, we shall use the normal theory methodology, even though the data, which are integers, are clearly nonnormal. 1b test for parallelism (H0 1: Cp,1 = Cp,2), we compute
l l l
6.833J 7.033 3.967 '
(males)
x2 =
4.700
(females)
.606 .262 .066 .262 .637 .173 .066 .173 .810 .161 .143 .029
.161J .143
.029
.306
Key: xxMales
o a Females
2
CSpooledC
[ 1 ~
1 1 0
0 1 1
~}fj ~]
0 1 1 0
719 = [ .268
.125
and
125]
.751 1.058
Thus,
fat
profiles for men and women is tenable. Given the plot in Figure 6.5, this finding is not surprising. Assuming that the profiles are parallel, we can test for coincident profiles. To test H 02 : 1'1'I = 1' ,.,_ 2 (profiles coincident), we need Sum of elements in (i1

= 8.7. Since T 2 = 1.005 < 8.7, we conclude that the hypothesis of parallel
= 3.11(2.8)
i 2 ) = 1' (i 1
i 2 ) = .367
T2 = (
v'(~ + ~)4.027
.367
)2 = .501
With a = .05, F1 ,~8 (.05) = 4.0, and T 2 = .501 < F1, 58 ( .05) = 4.0, we cannot reject the hypothesis that the profiles are coincident. That is, the responses of men and women to the four questions posed appear to be the same. We could now test for level profiles; however, it does not make sense to carry out this test for our example, since Questions 1 and 2 were measured on a scale of 18, while Questions3 and 4 were measured on a scale of 15. The incompatibility of these scales makes the test for level profiles meaningless and illustrates the need for similar measurements in order to carry out a complete profile analysis. When the sample sizes are small, a profile analysis will depend on the normality assumption. This assumption can be checked, using methods discussed in Chapter 4, with the original observations Xcj or the contrast observations Cxcj The analysis of profiles for several populations proceeds in much the same fashion as that for two populations. In fact, the general measures of comparison are analogous to those just discussed. (See [13], [18].)
observed over a period of time. For instance, we could measure the weight of a puppy at birth and then once a month. It is the curve traced by a typical dog that must be modeled. In this context, we refer to the curve as a growth curve. When some subjects receive one treatment and others another treatment, the growth curves for the treatments need to be compared. To illustrate the growth curve model introduced by Potthoff and Roy [21], we consider calcium measurements of the dominant ulna bone in older women. Besides an initial reading, Table 6.5 gives readings after one year, two years, and three years for the control group. Readings obtained by photon absorptiometry from the same subject are correlated but those from different subjects should be independent. The model assumes that the same covariance matrix l; holds for each subject. Unlike univariate approaches, this model does not require the four measurements to have equal variances. A profile, constructed from the four sample means (xi> x2 , x3 , x4 ), summarizes the growth which here is a loss of calcium over time. Can the growth pattern be adequately represented by a polynomial in time?
Initial
87.3 59.0 76.7 70.6 54.9 78.2 73.7 61.8 85.3 82.3 68.6 67.8 66.2 81.0 72.3 72.38
1 year 86.9 60.2 76.5 76.1 55.1 75.3 70.8 68.7 84.4 86.9 65.4 69.2 67.0 82.3 74.6 73.29
2 year 86.7 60.0 75.7 72.1 57.2 69.1 71.8 68.2 79.2 79.4 72.3 66.3 67.0 86.8 75.3 72.47
3 year 75.5 53.6 69.5 65.3 49.0 67.6 74.6 57.4 67.0 77.4 60.8 57.9 56.2 73.9 66.1 64.79
Mean
When the p measurements on all subjects are taken at times ti> t2 , , tP, the PotthoffRoy model for quadratic growth becomes
where the ith mean J.L; is the quadratic expression evaluated at t;. Usually groups need to be compared. Table 6.6 gives the calcium measurements for a second set of women, the treatment group, that received special help with diet and a regular exercise program. When a study involves several treatment groups, an extra subscript is needed as in the oneway MANOVA model. Let Xn, Xc 2 , ... , Xcne be ~he nc vectors of measurements on the nc subjects in groupe, fore = 1, ... , g. Assumptions. All of the Xcj are independent and have the same covariance matrix l;. Under the quadratic growth model, the mean vectors are f3co + f3ntl + f3co + + f3nd
E[Xcj] =
f3n/2
f3nt1J
=
l1~
1
3 year 81.2 60.6 75.2 66.7 54.2 68.6 71.6 64.1 70.3 67.9 65.9 48.0 51.5 68.0 65.7 53.0 64.53
71.18
where
= : :
1 t2
tl ri: tt]
r2
p
and
Pe = /3n
f3e2
[ ]
f3eo
(676)
1 tp
/3n
and
B=
1
tp tq p
Pe
(677)
/3eq
Under the assumption of multivariate normality, the maximum likelihood estimators of the Pe are
Pe = (B'Sj;~oledBf B'S~JedXe
1
for
e=

1,2, ... ,g
(678)
where
Spooled=
1)S1 )
=N
1 _ gW
L
f=l
Pe = k
A )
ne
for
e = 1, 2, ... , g
(679)
where k = (N g) (N  g  1)/(N  g  p + q)(N  g  p + q + 1). Also, Pe and Ph are independent, fore 7'= h, so their covariance is 0. We can formally test that a qthorder polynomial is adequate. The model is fit without restrictions, the error sum of squares and cross products matrix is just the within groups W that has N  g degrees of freedom. Under a qthorder polynomial, the error sum of squares and cross products
Wq =
f=l j=l
f!
(680)
has ng  g + p  q  1 degrees of freedom. The likelihood ratio test of the null hypothesis that the qorder polynomial is adequate can be based on Wilks' lambda A*=
IWI IWql
(681)
Under the polynomial growth model, there are q + 1 terms instead of the p means for each of the groups. Thus there are (p  q  1)g fewer parameters. For large sample sizes, the null hypothesis that the polynomial is adequate is rejected if
( N 
~(p
 q
(682)
Example 6.15 (Fitting a quadratic growth curve to calcium loss) Refer to the data in Tables 6.5 and 6.6. Fit the model for quadratic growth. A computer calculation gives
[pJ,p2]=
so the estimated growth curves are Control group:
Treatment group: 70.14 + 4.09t  1.85t2 (2.50) (.80) (.27) where 93.1744 5.8368 [ 0.2184 5.8368 9.5699 3.0240
(B'S~oledBf =
and, by (679), the standard errors given below the parameter estimates were obtained by dividing the diagonal elements by ne and taking the square root.
Examination of the estimates and the standard errors reveals that the t 2 tenns are needed. Loss of calcium is predicted after 3 years for both groups. Further, there does not seem to be any substantial difference between the two groups. Wilks' lambda for testing the null hypothesis that the quadratic growth model is adequate becomes
l27U282 2660.749
2369.308 2335.912
2660.749 2369.308 2756.009 2343.514 2327.961 2343.514 2301.714 2098.544 2327.961 2098.544. 2277.452
2335:912J
= .7627
l2781.017 2698.589
2363.228 2362.253
2698.589 2363.228 2832.430 2331.235 2381)60 2331.235 2303.687 2089.996 2381.160 2089.996 2314.485
2362.253]
Since, with a
( N 
= .01,
q +
~ (p 
g)) In A* ~  (
31 
i(
4  2 + 2)) In .7627
== 921
we faiJ to reject the adequacy of the quadratic fit at a = .01. Since the pvalue is less than .05 there is, however, some evidence ~hat the quadratic does not fit well. We could, without restricting to quadratic growth, test for parallel and coincident calcium loss using profile analysis. The Potthoff and Roy growth Cl!_rve model holds for more general designs than oneway MAN OVA. However, the Pe are no longer given by (678) and the expression for its covariance matrix becomes more complicated than (679). We refer the reader to [14] for more examples and further tests. There are many other modifications to the model treated here. They include the following:
(a) Dropping the restriction to polynomial growth. Use nonlinear parametric
indicate. A single multivariate test, with its associated single pvalue, is preferable to performing a large number of univariate tests. The outcome tells us whether or not it is worthwhile to look closer on a variable by variable and group by group analysis. A single multivariate test is recommended over, say,p univariate tests because, as the next example demonstrates, univariate tests ignore important information and can give misleading results. Example 6.16 (Comparing multivariate and univariate tests for the differences in means) Suppose we collect measurements on two variables X 1 and X 2 for ten randomly selected experimental units from each of two groups. The hypothetical data are noted here and displayed as scatter plots and marginal dot diagrams in Figure 6.6 on page 334. Group 5.0 4.5 6.0 3.0 3.2 3.5
1
6.0 6.2
~9
4.6 5.6
~2
1 1 1 1
1
1 1 1
6.8 5.3
6.6
7.8 8.0
It is clear from the horizontal marginal dot diagram that there is considerable overlap in the x 1 values for the two groups. Similarly, the vertical marginal dot diagram shows there is considerable overlap in the x 2 values for the two groups. The scatter plots suggest that there is fairly strong positive correlation between the two variables for each group, and that, although there is some overlap, the group 1 measurements are generally to the southeast of the group 2 measurements. Let Pi = [JLu, ~t 12 J be the population mean vector for the first group, and let 1'z = [J.Lz 1 , ~t22 J be the population mean vector for the second group. Using the x1 observations, a univariate analysis of variance gives F = 2.46 with v1 = 1 and v 2 = 18 degrees of freedom. Consequently, we cannot reject H 0 : JLu = ~t 21 at any reasonable significance level (F1,1 8 (.10) = 3.01). Using the x 2 observations, a univariate analysis of variance gives F = 2.68 with v 1 = 1 and v 2 = 18 degrees of freedom. Again, we cannot reject H 0 : ~t 12 = ~t22 at any reasonable significance level.
x,
+ + + +o +o +8 +o :to +
0 0 0 0
8
7 +
loll l!2J
0
5
4
3
4
+ +
7
0+
~o~~o~~o~~~~~o_o~ol~o~~ x, +
Figure 6.6 Scatter plots and marginal dot diagrams for the data from two groups.
++
+ ~
The univariate tests suggest there is no difference between the component means for the two groups, and hence we cannot discredit p. 1 = p. 2 . On the other hand, if we use Hotelling's T 2 to test for the equality of the mean vectors, we find
T2
= 17.29 >
c2 = 
and we reject H 0 : p. 1 = p.z at the 1% level. The multivariate test takes into account the positive correlation between the two measurements for each groupinformation that is unfortunately ignored by the univariate tests. This T 2test is equivalent to the MANOVA test (642).
Example 6.1 T (Data on lizards that require a bivariate test to establish a difference in means) A zoologist collected lizards in the southwestern United States. Among
other variables, he measured mass (in grams) and the snoutvent length (in millimeters). Because the tails sometimes break off in the wild, the snoutvent length is a more representative measure of length. The data for the lizards from two genera, Cnemidophorus (C) and Sceloporus (S), collected in 1997 and 1999 are given in Table 6.7. Notice that there are n 1 "' 20 measurements for C lizards and n 2 = 40 measurements for S lizards. After taking natural logarithms, the summary statistics are C:n 1 = 20 Xt 
 [2.240] 4.394
s1 = [0.35305
0.09417
0.09417] 0.02595
S: n 2
= 40
s2 
c
Mass 7.513 5.032 5.867 11.088 2.419 13.610 18.247 16.832 15.910 17.035 16.526 4.530 7.230 5.200 13.450 14.080 14.665 6.092 5.264 16.902 SVL 74.0 69.5 72.0 80.0 56.0 94.0 95.5 99.5 97.0 90.5 91.0 67.0 75.0 69.5 91.5 91.0 90.0 73.0 69.5 94.0 Mass 13.911 5.236 37.331 41.781 31.995 3.962 4.367 3.048 4.838 6.525 22.610 13.342 4.109 12.369 7.120 21.077 42.989 27.201 38.901 19.747
s
SVL 77.0 62.0 108.0 115.0 106.0 56.0 60.5 52.0 60.0 64.0 96.0 79.5 55.5 75.0 64.5 87.5 109.0 96.0 111.0 84.5 Mass 14.666 4.790 5.020 5.220 5.690 6.763 9.977 8.831 9.493 7.811 6.685 11.980 16.520 13.630 13.700 10.350 7.900 9.103 13.216 9.787
s
SVL 80.0 62.0 61.5 62.0 64.0 63.0 71.0 69.5 67.5 66.0 64.5 79.0 84.0 81.0 82.5 74.0 68.5 70.0 77.5 70.0
8o 0
ofiJ o r::8o6' ._
~
0
@0
.
3.9
figure 6.7 Scatter plot of ln(Mass) versus ln(SVL) for the lizard data in Table 6.7.
A plot of mass (Mass) versus snoutvent length (SVL), after taking natural logarithms, is shown in Figure 6.7. The large sample individual 95% confidence intervals for the difference in Jn(Mass) means and the difference in Jn(SVL) means both cover 0. ln(Mass): 1tll  ~t 21 : ( 0.476, 0.220) ln(SVL): ~t 12  ~t22 : ( 0.011, 0.183)
336
Chapter 6 Comparisons of Several Multivariate Means The corresponding univariate Student's ttest statistics for testing for no difference in the individual means have pvalues of .46 and .08, respectively. Clearly, from a univariate perspective, we cannot detect a diff~ence in mass means or a difference in snoutvent length means for the two genera of lizards. However, consistent with the scatter diagram in Figure 6.7, a bivariate analysis strongly supports a difference in size between the two groups of lizards. Using Result 1 6.4 (also see Example 6.5), the T statistic has an approximate x~ distribution. 2 For this example, T = 225.4 with a pvalue less than .!XXJl. A multivariate method is essential in this case.
Examples 6.16 and 6.17 demonstrate the efficacy of a multivariate test relative to its univariate counterparts. We encountered exactly this situation with the effluent data in Example 6.1. In the context of random samples from several populations (recall the oneway MAN OVA in Section 6.4), multivariate tests are based on the matrices
W =
t=I j=l
t=I
IB+WI
IWI
which is equivalent to the likelihood ratio test. Three other multivariate test statistics are regularly included in the output of statistical packages. LawleyHotelling trace "" tr [BW1J Pillai trace
= tr [B(B + wr1)
+ Wt 1
All four of these tests appear to be nearly equivalent for extremely large samples. For moderate sample sizes, all comparisons are based on what is necessarily a limited number of cases studied by simulation. From the simulations reported to date, the first three tests have similar power, while the last, Roy's test, behaves differently. Its power is best only when there is a single nonzero eigenvalue and, at the same time, the power is large. This may approximate situations where a large difference exists in just one characteristic and it is between one group and all of the others. There is also some suggestion that Pillai's trace is slightly more robust against nonnormality. However, we suggest trying transformations on the original data when the residuals are nonnormal. All four statistics apply in the twoway setting and in even more complicated MANOVA. More discussion is given in terms of the multivariate regression model in Chapter 7. When, and only when, the multivariate tests signals a difference, or departure from the null hypothesis, do we probe deeper. We recommend calculating the Bonferonni intervals for all pairs of groups and all characteristics. The simultaneous confirlence statements determined from the shadows of the confidence ellipse are, typically, too large. The oneatatime intervals may be suggestive of differences that
Exercises 33 7 merit further study but, with the current data, cannot be taken as conclusive evidence for the existence of differences. We summarize the procedure developed in this chapter for comparing treatments. The first step is to check the data for outliers using visual displays and other calculations.
We must issue one caution concerning the proposed strategy. It may be the case that differences would appear in only one of the many characteristics and, further, the differences hold for only a few treatment combinations. Then, these few active differences may become lost among all the inactive ones. That is, the overall test may not show significance whereas a univariate test restricted to the specific active variable would detect the difference. The best preventative is a good experimental design. To design an effective experiment when one specific variable is expected to produce differences, do not include too many other variables that are not expected to show differences among the treatments.
Exercises
Construct and sketch a joint 95% confidence region for the mean difference vector B using the effluent data and results in Example 6.1. Note that the point B = 0 falls outside the 95% contour. Is this result consistent with the test of H0 : B = 0 considered in Example 6.1? Explain. 6.2. Using the information in Example 6.1. construct the 95% Bonferroni simultaneous intervals for the components of the mean difference vector B. Compare the lengths of these intervals with those of the simultaneous intervals constructed in the example. 6.3. The data corresponding to sample 8 in Table 6.1 seem unusually large. Remove sample 8. Construct a joint 95% confidence region for the mean difference vector B and the 95% Bonferroni simultaneous intervals for the components of the mean difference vector. Are the results consistent with a test of H 0 : B = 0? Discuss. Does the "outlier" make a difference in the analysis of these data?
6.1.
(a) Redo the analysis in Example 6.1 after transforming the pairs of observations to . In(BOD) and ln(SS). (b) Construct the 95% Bonferroni simultaneous intervals for the components of the mean vector o of transformed variables. (c) Discuss any possible violation of the assumption of a bivariate normal distribution for the difference vectors of transformed observations. 6.S. A researcher considered three indices measuring the severity of heart attacks. The values of these indices for n = 40_ h~artattack patients arriving at a hospital emergency : room produced the summary statistics "
x=
and S = .
[101.3 63.0
(a) All three indices are evaluated for each patient. Test for the equality of mean indices using (616) with a = .OS. (b) Judge the differences in pairs of mean indices using 9S% simultaneous confidence intervals. [See (618).] 6.6. Use the data for treatments 2 and 3 in Exercise 6.8. (a) Calculate Spooled. (b) Test H 0 : l'z  I'J = 0 employing a twosample approach with a = .01. (c) Construct 99% simultaneous confidence intervals for the differences 1Lz;  ILJ;, i = 1, 2. 6.7. Using the summary statistics for the electricitydemand data given in Example 6.4, compute T 2 and test the hypothesis H 0 : p. 1  p. 2 = 0, assuming that I 1 = I 2 . Set a = .OS. Also, determine the linear combination of mean components most responsible for the rejection of H 0 6.8. Observations on two responses are collected for three treatments. The observation vectors [
:~ Jare
Treatmentl:
[~J [~J
[~
Treatment 2:
J [!J [n
[:J
[~J [~]
Treatment3:
(a) Break up the observations into mean, treatment, and residual components, as in (639). Construct the corresponding arrays for each variable. (See Example 6.9.) (b) Using the information in Part a, construct the oneway MANOVA table. (c) Evaluate Wilks' lambda, A, and use Table 6.3 to test for treatment effects. Set a = .01. Repeat the test using the chisquare approximation with Bartlett's correction. [See (643).] Compare the conclusions.
Exercises 339
6.9. Using the contrast matrix C in (613), verify the relationships dj = Cxj, d = Ci, and Sd = CSC' in (614). 6.10. Consider the univariate oneway decomposition of the observation xej given by (634). Show that the mean vector x I is always perpendicular to the treatment effect vector (:XI :X)u 1 + (x 2  :X)u2 + .. + (x 8  :X)u8 where
}
0
DJ =
0 0
'Uz =
0 0 0
' ... , Dg
0 0 0
},
0 0
==
0
6.11. A likelihood argument provides additional support for pooling the two independent sample covariance matrices to estimate a common covariance matrix in the case of two normal populations. Give the likelihood function, L(IJ. 1 , ,.,_ 2 , I), for two independent samples of sizes n 1 and n 2 from Np(l' 1 , I) and Np(l' 2 , I) populations,respectively.Show that this likelihood is maximized by the choices ji. 1 = x1 , ji. 2 = i 2 and
linear profiles, given that the profiles are parallel.) Let IJ.I = 1'z = (p. 21 , p. 22 , ... , p. 2 p] be the mean responses to p treatments for populations 1 and 2, respectively. Assume that the profiles given by the two mean vectors are parallel. (a) Showthatthehypoihesisthattheprofilesarelinearcan bewrittenasH0: (p. 1; + /L2i)(P.Ii1 + IL2it) = (J.tii1 + /L2iJ)  (P.Ji2 + J'212), i = 3, ... , p or as Ho: C (~J. 1 + ~J. 2 ) =0, where the (p  2) X p matrix
[J.tii, ft12, ... , ILl p] and
(b) Following an argument similar to the one leading to (673), we reject Ho: C (1' 1 + ~J. 2 ) = 0 at level a if T where
2
340
Chapter 6 Comparisons of Several Multivariate Means Let n 1 = 30, n2 = 30, xi = [6.4,6.8, 7.3, 7.0], i2 = [4.3,4.9,5.3,5.1], and .61 .26 .07 .16] .26 .64 .17 .14 .07 .17 .81 .03 .16 .14 .03 .31
Spooled =
Test for linear proflles, assuming that the profiles are parallel. Use a = .05.
Level 2
Level 3
Level 4
Levell
[~]
Factor
Leve12 Level3
[!]
[:] [~]
[1~] [~]
[~ J [=:J
~ Te
f=l
g
[~J
=
k~t
/h
where the e 1k are independent Np(O, l;) random vectors. (a) Decompose the observations for each of the two variables as
Consequently, obtain the matrices SSPcon SSPracl, SSPrac 2 , and SSPres with degrees offreedomgb 1,g 1,b l,and(g 1)(b 1),respectively. (c) Summarize the calculations in Part bin a MANOVA table.
Exercises 341
Hint: This MANOVA table is consistent with the twoway MANOVA table for comparing factors and their interactions where n = 1. Note that, with n = 1, SSP,., in the general twoway MANOVA table is a zero matrix with zero degrees of freedom. The matrix of interaction sum of squares and cross products now becomes the residual sum of squares and cross products matrix. (d) Given the summary in Part c, test for factor 1 and factor 2 main effects at the a = .05 level. Hint: Usetheresultsin(667)and(669)withgb(n 1)replacedby(g l)(b 1). Note: The tests require that p s; (g  1) ( h  1) so that SSP,., will be positive definite (with probability 1).
6.14. A replicate of the experiment in Exercise 6.13 yields the following data:
Factor 2 Level
1
Level 2
Level 3
Level
4
[ 1:]
[~]
[~J
DJ
[~]
[1~]
[1~]
[
[n
[~:]
[~]
1~ J
[:]
(a) Use these data to decompose each of the two measurements in the observation vector as
Xu = x + (xe.  x) + (x.k  x) + (xu  Xe.  x.k + x)
where .X is the overall average, x1 . is the average for the eth level of factor 1, and x k is the average for the kth level of factor 2. Form the corresponding arrays for each of the two responses. (b) Combine the preceding data with the data in Exercise 6.13 and carry out the necessary calculations to complete the general twoway MANOVA table. (c) Given the results in Part b, test for interactions, and if the interactions do not exist, test for factor 1 and factor 2 main effects. Use the likelihood ratio test with a= .05. (d) If main effects, but no interactions, exist, examine the natur~ of the main effects by constructing Bonferroni simultaneous 95% confidence intervals for differences of the components of the factor effect parameters. 6.15. Refer to Example 6.13. (a) Carry out approximate chisquare (likelihood ratio) tests for the factor 1 and factor 2 effects. Set a = .05. Compare these results with the results for the exact Ftests given in the example. Explain any differences. (b) Using (670), construct simultaneous 95% confidence intervals for differences in the factor 1 effect parameters for pairs of the three responses. Interpret these intervals. Repeat these calculations for factor 2 effect parameters.
6.16. Four measures of the response stiffness on .each of 30 boards are listed in Table 4.3 (see Example 4.14). The measures, on a given board, are repeated in the sense that they were made one after another. Assuming that the measures of stiffness arise from four treatments, test for the equality of treatments in a repeated measures design context. Set a == .05. Construct a 95% (simultaneous) confidence interval for a contrast in the mean levels representing a comparison of the dynamic measurements with the static measurements. 6.1.7. The data in Table 6.8 were collected to test two psychological models of numerical cognition. Does the processfng of.numbers depend on the way the numbers are pre. sen ted (words, Arabic digits)? Thirtytwo subjects were required to make a series of
(xi)
869.0 995.0 1056.0 1126.0 1044.0 925.0 1172.5 1408.5 1028.0 1011.0 726.0 982.0 1225.0 731.0 975.5 1130.5 945.0 747.0 656.5 919.0 751.0 774.0 941.0 751.0 767.0 813.5 1289.5 1096.5 1083.0 1114.0 708.0 1201.0
WordSame (x2)
860.5 875.0 930.5 954.0 909.0 856.5 896.5 1311.0 887.0 863.0 674.0 894.0 1179.0 662.0 872.5 811.0 909.0 752.5 659.5 833.0 744.0 735.0 931.0 785.0 737.5 750.5 1140.0 1009.0 958.0 1046.0 669.0 925.0
ArabicDiff (xJ)
691.0 678.0 833.0 888.0 865.0 1059.5 926.0 854.0 915.0 761.0 663.0 831.0 1037.0 662.5 814.0 843.0 867.5 777.0 572.0 752.0 683.0 671.0 901.5 789.0 724.0 711.0 904.5 1076.0 918.0 1081.0 657.0 1004.5
ArabicSame (x4)
601.0 659.0 826.0 728.0 839.0 797.0 766.0 986.0 735.0 657.0 583.0 640.0 905.5 624.0 735.0 657.0 754.0 687.5 539.0 611.0 553.0 612.0 700.0 735.0 639.0 625.0 784.5 983.0 746.5 796.0 572.5 673.5
Exercises 343 quick numerical judgments about two numbers presented as either two number words ("two," "four") or two single Arabic digits ("2," "4"). The subjects were asked to respond "same" if the two numbers had the same numerical parity (both even or both odd) and "different" if the two numbers had a different parity (one even, one odd). Half of the subjects were assigned a block of Arabic digit trials, followed by a block of number word trials, and half of the subjects received the blocks of trials in the reverse order. Within each block, the order of "same" and "different" parity trials was randomized for each subject. For each of the four combinations of parity and format, the median reaction times for correct responses were recorded for each subject. Here X 1 = median reaction time for word formatdifferent parity combination X 2 = median reaction time for word formatsame parity combination X 3 = median reaction time for Arabic formatdifferent parity combination X 4 = median reaction time for Arabic formatsame parity combination (a) Test for treatment effects using a repeated measures design. Set a = .05. (b) Construct 95% (simultaneous) confidence intervals for the contrasts representing the number format effect, the parity type effect and the interaction effect. Interpret the resulting intervals. (c) The absence of interaction supports theM model of numerical cognition, while the presence of interaction supports the C and C model of numerical cognition. Which model is supported in this experiment? (d) For each subject, construct three difference scores corresponding to the number format contrast, the parity type contrast, and the interaction contrast. Is a multivariate normal distribution a reasonable population model for these data? Explain.
6.18. Jolicoeur and Mosimann [12] studied the relationship of size and shape for painted turtles. Thble 6.9 contains their measurements on the carapaces of 24 female and 24 male turtles. (a) Test for equality of the two population mean vectors using a = .05. (b) If the hypothesis in Part a is rejected, find the linear combination of mean components most responsible for rejecting H 0 (c) Find simultaneous confidence intervals for the component mean differences. Compare with the Bonferroni intervals. Hint: You may wish to consider logarithmic transformations of the observations. 6.19. In the first phase of a study of the cost of transporting milk from farms to dairy plants, a survey was taken of firms engaged in milk transportation. Cost data on X 1 = fuel, X 2 = repair, and X 3 = capital, all measured on a permile basis., are presented in Table 6.10 on page 345 for n 1 = 36 gasoline and n 2 = 23 diesel trucks. (a) Test for differences in the mean cost vectors. Set a = .01. (b) If the hypothesis of equal cost vectors is rejected in Part a, find the linear combination of mean components most responsible for the rejection. (c) Construct 99% simultaneous confidence intervals for the pairs of mean components. Which costs, if any, appear to be quite different? (d) Comment on the validity of the assumptions used in your analysis. Note in particular that observations 9 and 21 for gasoline trucks have been identified as multivariate outliers. (See Exercise 5.22 and [2].) Repeat Part a with these observations deleted. Comment on the results.
(x3)
(xi)
Width (x2)
Height
(x3)
98 103 103 105 109 123 123 133 133 133 134 136 138 138 141 147 149 153 155 155 158 159 162 177
81
84
86 86 88
92
38 38 42 42
44
95 99 102 102 100 102 98 99 105 108 107 107 115 117 115 118 124 132
50 46 51 51 51 48 49 51 51 53 57 55 56 63 60 62 63 61 67
93 94 96 101 102 103 104 106 107 112 113 114 116 117 117 119 120 120 121 125 127 128 131 135
74 78 80 84 85 81 83 83
82
89
88
86 90
90
91 93 89 93 95 93 96 95 95 106
37 35 35 39 38 37 39 39 38 40 40 40 43 41 41 41 40
44
42 45 45 45 46 47
6.20. The tail lengths in millimeters (xi) and wing lengths in millimeters (x 2 ) for 45 male hookbilled kites are given in Table 6.11 on page 346. Similar measurements for female hookbilled kites were given in Table 5.12. (a) Plot the male hookbilled kite data as a scatter diagram, and (visually) check for outliers. (Note, in particular, observation 31 with x1 = 284.) (b) Test for equality of mean vectors for the populations of male and female hookbilled kites. Set a = .05. If H 0 : IL!  11 2 = 0 is rejected, find the linear combination most responsible for the rejection of H 0 (You may want to eliminate any outliers found in Part a for the male hookbilled kite data before conducting this test. Alternatively, you may want to interpret XJ = 284 for observation 31 as misprint and conduct the test with x 1 = 184 for this observation. Does it make any difference in this case how observation 31 for the male hookbilled kite data is treated?) (c) Determine the 95% confidence region for ILl  11 2 and 95% simultaneous confidence intervals for the components of IL!  11 2 . (d) Are male or female birds generally larger?
Exercises 345
Table 6.10
X!
Gasoline trucks
X2 X!
x2
x3
16.44 7.19 9.92 4.24 11.20 14.25 13.50 13.32 29.11 12.68 7.51 9.90 10.25 11.11 12.17 10.24 10.18 8.88 12.34 8.51 26.16 12.95 16.93 14.70 10.32 8.98 9.70 12.72 9.49 8.22 13.70 8.21 15.86 9.18 12.49 17.32
12.43 2.70 1.35 5.78 5.05 5.78 10.98 14.27 15.09 7.61 5.80 3.63 5.07 6.15 14.26 2.59 6.05 2.70 7.73 14.02 17.44 8.24 13.37 10.78 5.16 4.49 11.59 8.63 2.16 7.95 11.22 9.85 11.42 9.18 4.67 6.86
11.23 3.92 9.75 7.78 10.67 9.88 10.60 9.45 3.28 10.23 8.13 9.13 10.17 7.61 14.39 6.09 12.14 12.23 11.68 12.01 16.89 7.18 17.59 14.58 17.00 4.26 6.83 5.59 6.23 6.72 4.91 8.17 13.06 9.49 11.94 4.44
8.50 7.42 10.28 10.16 12.79 9.60 6.47 11.35 9.15 9.70 9.77 11.61 9.09 8.53 8.29 15.90 11.94 9.54 10.43 10.87 7.13 11.88 12.03
12.26 5.13 3.32 14.72 4.17 12.72 8.89 9.95 2.94 5.06 17.86 11.75 13.25 10.14 6.22 12.90 5.69 16.77 17.65 21.52 13.22 12.18 9.22
9.11 17.15 11.23 5.99 29.28 11.00 19.00 14.53 13.68 20.84 35.18 17.00 20.66 17.45 16.38 19.09 14.77 22.66 10.66 28.47 19.44 21.20 23.09
X 1 = current ratio (a measure of shortterm liquidity) X 2 = longterm interest rate (a measure of interest coverage) X 3 = debttoequity ratio (a measure of financial risk or leverage) X 4 = rate of return on equity (a measure of profitability)
Table 6.11
Xt
Xt
X2
(Tail length)
(Wing length)
(Tail length)
(Wing length)
(Tail length)
(Wing length)
ISO
186 206 184 177 177 176 200 191 193 212 181 195 187 190
~
278 277 308 290 273 284 267 281 287 271 302 254 297 281 284
185 195 183 202 177 177 170 186 177 178 192 204 191 178 177
282 285 276 308 254 268 260 274 272 266 281 276 290 265 275
284 176 185 191 177 197 199 190 180 189 194 186 191 187 186
277 281 287 295 267 310 299 273 278 280 290 287 286 288 275
Source: Data courtesy of S. Temple. were recorded. The summary statistics are as follows: Aa bond companies: n 1 = 20, i! == [2.287, 12.600, .347, 14.830 ], and
SI =
Baabondcompanies:
n2
Sz =
= 20,x2"'
and
s
pooled
"'
(a) Does pooling appear reasonable here? Comment on the pooling procedure in this case. (b) Are the financial characteristics of firms with Aa bonds different from those with Baa bonds? Using the pooled covariance matrix, test for the equality of mean vectors. Set a = .05.
l l l
.459 .254 .026 .254 27.465 .589 .026 .589 .030 .244 .267 .1 02
[2.404,7.155,.524,12.840],
.944 .089 .002 .719J .089 16.432 .400 19.044 .002 .400 .024 .094 .719 19.044 .094 61.854
Exercises 34 7 (c) Calculate the linear combinations of mean components most responsible for rejecting Ho: 1'1  1'z = 0 in Part b. (d) Bond rating companies are interested in a company's ability to satisfy its outstanding debt obligations as they mature. Does it appear as if one or more of the foregoing financial ratios might be useful in helping to classify a bond as "high" or "medium" quality? Explain. (e) Repeat part (b) assuming normal populations with unequal covariance matices (see (627), (628) and (629)). Does your conclusion change?
6.22. Researchers interested in assessing pulmonary function in non pathological populations asked subjects to run on a treadmill until exhaustion. Samples of air were collected at definite intervals and the gas contents analyzed. The results on 4 measures of oxygen consumption for 25 males and 25 females are given in Table 6.12 on page 348. The variables were
X 1 = resting volume 0 2 (L/min) X 2 = resting volume 0 2 (mL/kg/min) X3 = maximum volume 0 2 (L/min) X 4 = maximum volume 0 2 (mL/kg/min)
(a) Look for gender differences by testing for equality of group means. Use a = .05. If you reject H 0 : p 1  p 2 = 0, find the linear combination most responsible. (b) Construct the 95% simultaneous confidence intervals for each J.Lli J.Lzi> i = 1, 2, 3, 4. Compare with the corresponding Bonferroni intervals. (c) The data in Table 6.12 were collected from graduatestudent volunteers, and thus they do not represent a random sample. Comment on the possible implications of this information.
6.23. Construct a oneway MANOVA using the width measurements from the iris data in Table 11.5. Construct 95% simultaneous confidence intervals for differences in mean components for the two responses for each pair of populations. Comment on the validity of the assumption that I 1 = Iz = I 3 . 6.24. Researchers have suggested that a change in skull size over time is evidence of the interbreeding of a resident population with immigrant populations. Four measurements were made of male Egyptian skulls for three different time periods: period 1 is 4000 B.C., period 2 is 3300 B.C., and period 3 is 1850 B.C. The data are shown in Table 6.13 on page 349 (see the skull data on the website www.prenhall.com/sratistics). The measured variables are
X 1 = maximum breadth of skull ( mm) X 2 = basibregmatic height of skull ( mm) X 3 = basialveolar length of skull ( mm) x4 =nasal height ofsku!l (mm)
Construct a oneway MAN OVA of the Egyptian s~ull data. Use a = .05. Construct 95 %. simultaneous confidence intervals to determine which mean components differ among the populations represented by the three time periods. Are the usual MANOVA assumptions realistic for these data? Explain.
6.25. Construct a oneway MANOVA of the crudeoil data listed in Table 11.7 on page 662. Construct 95% simultaneous confidence intervals to determine which mean components differ among the populations. (You may want to consider transformations of the data to make them more closely conform to the usual MANOVA assumptions.)
Females
x3 x4
xi x2
XJ
X4
I Resting 0 2 (L/min)
I
I
Resting 0 2 (mL/kg/rnin)
3.71 5.08 5.13 3.95 5.51 4.07 4.77 6.69 3.71 4.35 7.89 5.37 4.95 4.97 6.68 4.80 6.43 5.99 6.30 6.00 6.04 6.45
Maximum0 2 (L/min)
2.87 3.38 4.13 3.60 3.11 3.95 4.39 3.50 2.82 3.59 3.47 3.07 4.43 3.56 3.86 3.31 3.29 3.10 4.80 3.06 3.85 5.00 5.23 4.00 2.82
Maximum0 2 (mL/kg/min)
30.87 43.85 44.51 46.00 47.02 48.50 48.75 48.86
Resting0 2 (L/min)
0.29 0.28 0.31 0.30 0.28 0.11 0.25 0.26 0.39 0.37 0.31 0.35 0.29 0.33 0.18 0.28 0.44 0.22 0.34 0.30 0.31 0.27 0.66 0.37 0.35
I I
""'
0.34 0.39 0.48 .0.31 0.36 0.33 0.43 0.48 0.21 0.32 0.54 0.32 0.40 0.31 0.44 0.32 0.50 0.36 0.48 0.40 0.42
48.92
48.38 50.56 51.15 55.34 56.67 58.49 49.99 42.25 51.70 63.30 46.23 55.08 58.80 57.46
1.93 2.51 2.31 1.90 2.32 2.49 2.12 1.98 2.25 1.71 2.76 2.10 2.50 3.06 2.40 2.58 3.05
~.85
'
0.55
0.50 0.34 0.40
5.55
4.27 4.58
50.35
32.48
33.85 35.82 36.40 37.87 38.30 39.19 39.21 39.94 42.41 28.97 37.80 31.10 38.30 51.80 37.60 36.78 46.16 38.95 40.60 43.69 30.40 39.46 39.34 34.86 35.07
..,
~,.
'
'
'
Exercises 349 Table 6.13 Egyptian Skull Data MaxBreath (xi) BasHeight (x2) BasLength (x3) NasHeight (x4) Tune Period
131 125 131 119 136 138 139 125 131 134 124 133 138 148 126 135 132 133 131 133 132 133 138 130 136 134 136 133 138 138
138 131 132 132 143 137 130 136 134 134
:
89
92
49 48 50
44
54 56 48 48 51 51 48 48 45 51 45 52 54 48 50 46 52 50 51 45 49 52 54 49 55 46
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
138 134 134 129 124 136 145 130 134 125
:
130 131 137 127 133 123 137 131 133 133
6.26. A project was designed to investigate how consumers in Green Bay, Wisconsin, would react to an electrical timeofuse pricing scheme. The cost of electricity during peak periods for some customers was set at eight times the cost of electricity during offpeak hours. Hourly consumption (in kilowatthours) was measured on a hot summer day in July and compared, for both the test group and the control group, with baseline consumption measured on a similar day before the experimental rates began. The responses, log( current consumption)  log(baseline consumption)
350 Chapter 6 Comparisons of Several Multivariate Means for the hours ending9 A.M.ll A.M. (a peak hour), 1 P.M., and 3 P.M. (a peak hour) producetj the following summary statistics:
Test group: Control group: and n 1 = 28,il = [.153, .231, .322, 339]
n 2 = 58,i2 = [.151, .180, .256, .257]
Spooled
.355 .228 .232] .722 .233 .199 .233 .592 .239 .199 .239 .479
Source: Data courtesy of Statistical Laboratory, Univen;ity of Wisconsin. Perform a profile analysis. Does timeofuse pricing seem to make a difference in electrical consumption? What is the nature of this difference, if any? Comment. (Use a significance level of a = .05 for any statistical tests.)
6.27. As part of the study of love and marriage in Example 6.14, a sample of husbands and wives were asked to respond to these questions:
1. What is the level of passionate love you feel for your partner? What is the level of passionate love that your partner feels for you? 3. What is the level of companionate love that you feel for your partner? 4. What is the level of companionate love that your partner feels for you? The responses were recorded on the following 5point scale.
2.
None at all
Very
little
Some
great
deal
Tremendous
amount
5
Thirty husbands and 30 wives gave the responses in Thble 6.14, where X 1 = a 5pointscale response to Question 1, X2 = a 5pointscale response to Question 2, X 3 = a 5pointscale response to Question 3, and x~ == a 5pointscale response to Question 4. (a) Plot the mean vectors for husbands and wives as sample profiles. (b) Is the husband rating wife profile parallel to the wife rating husband profile? Test for parallel profiles with a = .05. If the profiles appear to be parallel, test for coincident profiles at the same level of significance. Finally, if the profiles are coincident, test for level profiles with a = .05. What conclusion( s) can be drawn from this analysis? 6.28. 1\vo species of biting flies (genus Leptoconops) are so similar morphologically, that for many years they were thought to be the same. Biological differences such as sex ratios of emerging flieS and biting habits were found to exist. Do the taxonomic data listed in part in Table 6.15 on page 352 and on the website www.prenhall.com/statistics indicate any difference in the two speciesL. carteri and L. torrens?Test for the equality of the two population mean vectors using a = .05. If the hypotheses of equal mean vectors is rejected, determine the mean components (or linear combinations of mean components) most responsible for rejecting Ho. Justify your use of normaltheory methods for these data. 6.29. Using the data on bone mineral content in Table 1.8, investigate equality between the dominant and nondominant bones.
Exercises 351
Table 6.14 Spouse Data
xz
XJ
x4
5
4
5
4
5
4 4 3 3 3 4 4 4 4 5 4 4 4 3 4 5
5 5
3 3 3 4 4 5 4 4
5
4
5
4
5
4 4
5 5
4
5 5
3
5 5
3
5
4 4
5
4 4 5
4 4 4 4 4 3 4 3 4 3 4
5
4
5
4 3 3 4 4 4
5 5 5 5 5
4
5 5 5 5 5
4 4 5 4
4
5 5 5
4
5
4 3 4 3
5
4 4 4 3 5 4 3
5 5
4 4 4 4
5 5 5
4 5 4 5 4
4
5 5 5
4
5 5
4 4
5 5
4
5
4
5
4 4 4 3
5 5 5
4 4 4 4 3
5
4 4 4 4
5
4 4 4 4 5
5 5
4 3 3
5
4 4 4 4 4 5
5 5
4 2 3 4 4 4 3 4 4
5 5 3
4 3 4 4
5 5
3 4 4 5 3 5
5
3 4 4
5 3
4 3 4 4
5 3 5
5 5 4 3 4 4 4 4 4
4 4 4 5
5 5
4
5 5
4
5
4
5
4 4
5 5
(a) Test using a = .05. (b) Construct 95% simultaneous confidence intervals for the mean differences. (c) Construct the Bonferroni 95% simultaneous intervals, and compare these with the intervals in Part b. 6.30. Thble 6.16 on page 353 contains the bone mineral contents, for the first 24 subjects in Thble 1.8, 1 year after their participation in an experimental program. Compare the data from both tables to determine whether there has been bone loss. (a) Test using a = .05. (b) Construct 95% simultaneous confidence intervals for the mean differences. (c) Construct the Bonferroni 95% simultaneous intervals, and compare these with the intervals in Part b.
Table 6.15
BitingFly Data
Xt X2 X3 X4
x5 x6 x7
Chi'd) Clri'd)
palp length 31 32 36 32 35 36 36 36 36 35 38 34 34 35 36 36 35 34 37 37 37 38 39 35 42 40
44
palp width
13
41 38 44 43 43 44 42 43 41 38 47 46 44 41 44 45 40
44 40
14 15 17 14 12 16 17 14 11
:
22
27.
28
26
24
26 26 23 24
:
9 13 8 9 10 9 9 9 9 9
:
8 13 9 9 10 9 9 9 9 10
:
15 14 15 14
13
46 19 40 48 41 43 43 45 43 41 44
:
15 14 15 12 14
11
40 42 43
:
14 14 12 15 15 14 18 15 16
:
26 31 23 24 27 30 23 29 22 30 25 31 33 25 32 25 29 31 31 34
:
10 10 10 10 11 10 9 9 9 10 9 6 10 9 9 9 11
11
10 10
:
10 11 10 10 10 10 10 10 10 10 9 7 10 8 9 9 11 10 10 10
:
99 110
99
42 45
44
43. 46 47 47 43 50 47
38 41 35 38 36 38 40 37 40 39
14 17 16 14 15 14 15 14 16 14
33 36 31 32 31 37 32 23 33
34
9 9 10 10 8
11
11 11 12 7
9 10 10 10 8 11 11 10 11 7
Exercises 353
Table 6.16 Mineral Content in Bones (After 1 Year)
Subject number
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Dominant radius
1.027 .857 .875 .873 .811 .640 .947 .886 .991 .977 .825 .851 .770 .912 .905 .756 .765 .932 .843 .879 .673 .949 .463 .776
Radius
1.051 .817 .880 .698 .813 .734 .865 .806 .923 .925 .826 .765 .730 .875 .826 .727 .764 .914 .782 .906 .537 .900 .637 .743
Dominant humerus
2.268 1.718 1.953 1.668 1.643 1.396 1.851 1.742 1.931 1.933 1.609 2.352 1.470 1.846 1.842 1.747 1.923 2.190 1.242 2.164 1.573 2.130 1.041 1.442
Humerus
2.246 1.710 1.756 1.443 1.661 1.378 1.686 1.815 1.776 2.106 1.651 1.980 1.420 1.809 1.579 1.860 1.941 1.997 1228 1.999 1.330 2.159 1.265 1.411
Dominant ulna
.869 .602 .765 .761 .551 .753 .708 .687 .844 .869 .654 .692 .670 .823 .746 .656 .693 .883 .577 .802 .540 .804 .570 .585
Ulna
.964 .689 .738 .698 .619 .515 .787 .715 .656 .789 .726 .526 .580 .773 .729 .506 .740 .785 .627 .769 .498 .779 .634 .640
6.31. Peanuts are an important crop in parts of the southern United States. In an effort to develop improved plants, crop scientists routinely compare varieties with respect to several variables. The data for one twofactor experiment are given in Table 6.17 on page 354. Three varieties (5, 6, and 8) were grown at two geographical locations (1, 2) and, in this case, the three variables representing yield and the two important gradegrain characteristics were measured. The three variables are
xl = Yield (plot weight) X 2 = Sound mature kernels (weight in gramsmaximum of 250 grams) X 3 = Seed size (weight, in grams, of 100 seeds)
There were two replications of the experiment. (a) Perform a twofactor MANOVA using the data in Table 6.17. Test for a location effect, a variety effect, and a locationvariety interaction. Use a = .05. (b) Analyze the residuals from Part a. Do the usual MANOVA assumptions appear to be satisfied? Discuss. (c) Using the results in Part a, can we conclude that the location and/or variety effects are additive? If not, does the interaction effect show up for some variables, but not for others? Check by running three separate univariate twofactor ANOVAs.
x2
XJ
Yield 195.3 194.3 189.7 180.4 203.0 195.9 202.7 197.6 193.5 187.0 201.5 200.0
SdMatKer 153.1 167.7 139.5 121.1 156.8 166.0 166.1 161.8 164.5 165.1 166.8 173.8
Seed Size 51.4 53.7 55.5 44.4 49.8 45.8 60.4 54.1 57.8 58.6 65.0 672
1 1
2 2 1 1 2 2 1 1 2 2
6
6 8 8 8 8
(d) Larger numbers correspond to better yield and gradegrain characteristics. Usinglo~ cation 2, can we conclude that one variety is better than the other two for each char;jJ acteristic? Discuss your answer, using 95% Bonferroni simultaneous intervals fo~ pairs of varieties. ;!~
6.32. In one experiment involving remote sensing, the spectral reflectance of three 1yearold seedlings was measured at various wavelengths during the growing seaSOJ! The seedlings were grown with two different levels of nutrient: the optimal lev . coded +, and a suboptimal level, coded . The species of seedlings used were sit!( spruce (SS), Japanese larch (JL), and lodgepole pine (LP). TWo of the variables me~. sured were ~:
~\~
species~..
,'J;
~~
ss
JL LP
ss
+ + +
.w
~""
~s~
JL LP
~
(a) neating the cell means as individual observations, perform a twoway MANOVJ\;. test for a species effect and a nutrient effect. Use a = .05. (b) Construct a twoway ANOVA for the 560CM observations and another t~o'!.a ANOVA for the 720CM observations. Are these results consistent WJth tfi MAN OVA results in Part a? If not, can you explain any differences?
Exercises 355
6.33. Refer to Exercise 6.32. The data in Table 6.18 are measurements on the variables X 1 = percent spectral reflectance at wavelength 560 nm (green) X 2 = percent spectral reflectance at wavelength 720 nm (near infrared)
for three species (sitka spruce [SS], Japanese larch [JL], and lodgepole pine [LP]) of 1yearold seedlings taken at three different times (Julian day 150 [1], Julian day 235 [2], and Julian day 320 [3]) during the growing season. The seedlings were all grown with the optimal level of nutrient. (a) Perform a twofactor MANOVA using the data in Table 6.18. Test for a species effect, a time effect and speciestime interaction. Use a = .05.
ss ss ss ss ss ss ss ss ss ss ss ss
JL JL JL JL JL JL JL JL JL JL JL JL LP LP LP LP LP LP LP LP LP LP LP LP
356 Chapter 6 Comparisons of Several Multivariate Means (b) Do you think the usual MAN OVA assumptions are satisfied for the these data? cuss with reference to a residual analysis, and the possibility of correlated tions over time. (c) Foresters are particularly interested in the interaction of species and time.D teraction show up for one variable but not for the other? Check by running variate twofactor AN OVA for each of the two responses. (d) Can you think. of another method of analyzing these data (or a different tal design) that would allow for a potential time trend in the spectral numbers? 6.34. Refer to Example 6.15. (a) Plot the profiles, the components of i 1 versus time and those of i 2 versus the same graph. Comment on the comparison. (b) Test that linear growth is adequate. Take a = .01. 6.35. Refer to Example 6.15 but treat all 31 subjects as a single group. The maximum hood estimate of the ( q + 1) X 1 J3 is
where S is the sample covariance matrix. The estimated covariances of the maximum likelihood estimators are
CoV(P) "'
(n  1) (n  2) (n  1  p + q) (n p
+ q)n
(B'storl
Fit a quadratic growth curve to this single group and comment on the fit.
6.36. Refer to Example 6.4. Given the summary information on electrical usage in this e pie, use Box's Mtest to test the hypothesis H 0 : 1: 1 = : 2 == l:. Here l:1 is the co ance matrix for the two measures of usage for tb.e population of Wisconsin borneo .. with air conditioning, and : 2 is the electrical usage covariance matrix for the populatio~ of Wisconsin homeowners without air conditioning. Set a = .05. . ::;
6.37. Table 6.9 page 344 contains the carapace measurements for 24 female and 24 male nif~ ties. Use Box's Mtest to test H0 : : 1 = : 2 == ::.where : 1 is the population covariam:_e; matrix for carapace measurements for female turtles, and t 2 is the population cov~,j ance matrix for carapace measurements for male turtles. Set a ~ .05. ..ec~
6.38. Table 11.7 page 662 contains the values of three trace elements and two measures of hy~ drocarbons for crude oil samples taken from three groups (zones) of sandstone. Us~ Box's Mtest to test equality of population covariance matrices for the three sandsto~~. groups. Set a "' .05. Here there are p = 5 variables and you may wish to consider tr~ formations of the measurements on these variables to make them more nearly nann~
6.39. Anacondas are some of the largest snakes in the world. Jesus Ravis and his fellow:fr.. ij~ searchers capture a snake and measure its (i) snout vent length (em) or the length om the snout of the snake to its vent where it evacuates waste and (ii) weight (kilograms). sample of these measurements in shown in Table6.19. (a) Test for equality of means between males arul females using a = .05. large sample statistic. (b) Is it reasonable to pool variances in this case? Explain. (c) Find the 95% Boneferroni confidence intervals for the mean differences males and females on both length and weight.
Exercises 357
F F F F F F F F F F F F F F F F F F F F F F F F F F F F
10.00
7.25 9.25 7.50 5.75 7.75 5.75 5.75 7.65 7.75 5.84 7.53 5.75 6.45 6.49 6.00
M M M M M M M M M M M M M M M
M
M M M M M M M
M M
M M M
6.40. Compare the male national track records in Table 8.6 with the female national track records in Table 1.9 using the results for the lOOm, 200m, 400m, 800m and 1500m races. Treat the data as a random sample of size 64 of the twelve record values. (a) Test for equality of means between males and females using a = .05. Explain why it may be appropriate to analyze differences. (b) Find the 95% Bonferroni confidence intervals for the mean differences between male and females on all of the races. 6.41. When cell phone relay towers are not working properly, wireless providers can lose great amounts of money so it is important to be able to fix problems expeditiously. A first step toward understanding the problems involved is to collect data from a designed experiment .involving three factors. A problem was initially classified as low or high severity, simple or complex, and tb.e engineer assigned was rated as relatively new (novice) or expert (guru).
358
'"'!J 1\vo times were observed. The time to assess the problem and plan an attack ~ the time to implement the solution were each measured in hours. The data are given in:l Table6.20. 5\!1 Perform a MANOVA including appropriate confidence intervals for important eff~~
'I
Severity Level Low Low Low Low Low Low Low Low
Implementation Time 6.3 5.3 2.1 1.6 12.6 12.8 8.8 9.2
9.3
7.6
3.8 2.8 19.3 19.9 14.4 13.7 14.0 15.4 9.4 8.6 23.5 21.8
,.,. '{I
:1!
~
'!\!!
Novice
Guru Guru Novice Novice Guru Guru Novice Novice Guru Guru
High
High
4.5 4.5
4.7 3.1 3.0 7.9 6.9 5.0 5.3
9.5 10.7
6.3 5.6 15.6 14.9 10.4 10.4
I~
High
High High
High
High
High
15.4 15.7
~
References
1. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (3rd ed.). New York: John Wiley, 2003. 2. BaconShone, J., and W. K. Fung. "A New Graphical Method for Detecting Single and Multiple Outliers in Univariate and Multivariate Data." Applied Statistics, 36, no. 2 (1987), 153162. 3. Bartlett, M.S. "Properties of Sufficiency and Statistical Tests." Proceedings of the Royal Society of London (A), 160 {1937), 268282. 4. Bartlett, M. S. "Further Aspects of the Theory of Multiple Regression." Proceedings of the Cambridge Philosophical Society, 34 (1938), 3340. 5. Bartlett, M. S. "Multivariate Analysis." Journal of the Royal Statistical Society Supplement (B), 9 (1947), 176197. 6. Bartlett, M. S. "A Note on the Multiplying Factors for Various x2 Approximations." Journal of the Royal Statistical Sociery (B),16 (1954),296298. 7. Box, G. E. P., "A General Distribution Theory for a Class of Likelihood Criteria." Biometrika, 36 (1949), 317346. 8. Box, G. E. P., "Problems in the Analysis of Growth and Wear Curves." Biometrics, 6 (1950), 362389.
References 359 9. Box, G. E. P., and N. R. Draper. Evolutionary Operation: A Statistical Method for Process Improvement. New York: John Wiley, 1969. 10. Box, G. E. P., W. G. Hunter, and J. S. Hunter. Statistics for Experimenters (2nd ed.). New York: John Wiley, 2005. 11. Johnson, R. A. and G. K. Bhattacharyya. Statistics: Principles and Methods (5th ed.). New York: John Wiley, 2005. 12. Jolicoeur, P., and J. E. Mosimann. "Size and Shape Variation in the Painted Thrtle: A Principal Component Analysis." Growth, 24 (1960), 339354. 13. Khattree, R. and D. N. Naik, Applied Multivariate Statistics with SAS Software (2nd ed.). Cary, NC: SAS Institute Inc., 1999. 14. Kshirsagar,A. M., and W. B. Smith, Growth Curves. New York: Marcel Dekker, 1995. 15. Krishnamoorthy, K., and J. Yu. "Modified Nel and Van der Merwe Test for the Multivariate BehrensFisher Problem." Statistics & Probability Letters, 66 (2004), 161169. 16. Mardia, K. V., "The Effect of Nonnormality on some Multivariate Tests and Robustnes to Nonnormality in the Linear Model." Biometrika, 58 (1971), 105121. 17. Montgomery, D. C Design and Analysis of Experiments (6th ed. ). New York: John Wiley, 2005. 18. Morrison, D. F. Multivariate Statistical Methods (4th ed.). Belmont, CA: Brooks/Cole Thomson Learning, 2005. 19. Nel, D. G., and C A. Van der Merwe. "A Solution to the Multivariate BehrensFisher Problem." Communications in StatisticsTheory and Methods, 15 (1986), 37193735. 20. Pearson, E. S., and H. 0. Hartley, eds. Biometrika Tables for Statisticians. vol. II. Cambridge, England: Cambridge University Press, 1972. 21. Potthofl; R. F. and S. N. Roy. "A Generalized Multivariate Analysis of Variance Model Useful Especially for Growth Curve Problems." Biometrika, 51 (1964), 313326. 22. Scheffe, H. The Analysis of Variance. New York: John Wiley, 1959. 23. Tiku, M. L., and N. Balakrishnan. "Testing the Equality of VarianceCovariance Matrices the Robust Way." Communications in StatisticsTheory and Methods, 14, no.l2 (1985), 30333051. 24. Tiku, M. L., and M. Singh. "Robust Statistics for Testing Mean Vectors of Multivariate Distributions." Communications in StatisticsTheory and Methods, 11, no. 9 (1982), 9851001. 25. Wilks, S. S. "Certain Generalizations in the Analysis of Variance." Biometrika, 24 (1932), 471494.
Chapter
MULTIVARIATE LINEAR
REGRESSION MODELS
7.1 Introduction
Regression analysis is the statistical methodology for predicting values of one or more response (dependent) variables from a collection of predictor (independent) variable values. It can also be used for assessing the effects of the predictor variables on the responses. Unfortunately, the name regression, culled from the title of the first paper on the subject by F. Galton [15], in no way reflects either the importance or breadth of application of this methodology. In this chapter, we first discuss the multiple regression model for the predic: tion of a single response. This model is then generalized to handle the prediction of several dependent variables. Our treatment must be somewhat terse, as a vast literature exists on the subject. (If you are interested in pursuing regression analysis, see the following books, in ascending order of difficulty: Abraham and Ledolter [1], Bowerman and O'Connell [6], Neter, Wasserman, Kutner, and Nachtsheim [20], Draper and Smith [13], Cook and Weisberg [11], Seber [23], and Goldberger [16].) Our abbreviated treatment highlights the regression assumptions and their consequences, alternative formulations of the regression model, and the general applicability of regression techniques to seemingly different situations.
Y = currentmarketvalueofhome
360
361
and
z1 = square feet of living area z2 = location (indicator for zone of city) z3 = appraised value last year z4 = quality of construction (price per square foot)
The classical linear regression model states that Y is composed of a mean, which depends in a continuous manner on the z;'s, and a random error e, which accounts for measurement error and the effects of other variables not explicitly considered in the model. The values of the predictor variables recorded from the experiment or set by the investigator are treated as fixed. The error (and hence the response) is viewed as a random variable whose behavior is characterized by a set of distributional assumptions. Specifically, the linear regression model with a single response takes the form
Y
[Response]= [mean (dependingon z1 ,z2 , ... ,zr)J +[error] The term "linear" refers to the fact that the mean is a linear function of the unknown parameters /3 0 , /3 1 , ... , f3r. The predictor variables may or may not enter the model as firstorder terms. With n independent observations on Y and the associated values of Z;, the complete model becomes
Y1 = f3o Y2
f3o
+ + +
f3IZ!I f3IZ2!
+ + +
f32Z12 fJ2Z22
+ '.' + + ... + + +
/3rZ!r /3rZ2r
+ + +
BJ 82
(71)
Yn = f3o
f3IZn!
fJ2Zn2
/3rZnr
Bn
where the error terms are assumed to have the following properties: 1. E(ej) = 0; 2. Var(ej) = c? (constant); and 3. Cov(ei,ek) = O,j
(72)
* k.
.. Zn! Zn2
:" Zr
Znr
Z1r]
l/3o]
f3r
~I + lei] ~2
En
or
(nXI)
(nX(r+l)) ((r+I)XI)
fJ
(nXI)
and the specifications in (72) become 1. E(e) = 0; and 2. Cov(e) = E(u') = c?I.
362
Chapter 7 Multivariate Linear Regression Models Note that a one in the first column of the design matrix Z is the multiplier of the constant term {3 0 . It is customary to introduce the artificial variable Zjo "' 1, so that
Po +
f3IZji
+ + {3,Zjr
= fJoZjO
fJIZji
+ +
f3rZj,
Each column"Of Z consists of the n values of the corresponding predictor variable while the jth row of Z contains the values for aU predictor variables on the jth trial:
Y=
(nX(r+l)) ((r+l)Xl)
Jl+E,
(nXl)
2
E(e)
= u 1,
(nXn)
(73)
where fJ and u 2 are unknown parameters and the design matrix Z has jth row~
[ZjO Zjl , Zjr]. .
Although the errorterm assumptions in (72) are very modest, we shall later need to add the assumption of joint normality for making confidence statements and testing hypotheses. We now provide some examples of the linear regression model.
Example 7.1 (Fitting a straightline regression model) Determine the linear regression model for fitting a straight line
Mean response = E(Y) = f3o + f3IZI to the data
ZI
1
4
Before the responses Y' = [Y1 , Y2 , .. , }5) are observed, the errors e' = [ e1, ez, ... , e5 ] are random, and we can write
y
where
Z{J +
y =
yl] [Ys
;
fJ =
[:~J
[el] ~2
es
363
The data for this model are contained in the observed response vector y and the design matrix Z, where
Note that we can handle a quadratic expression for the mean response by introducing the term {3 2 z2 , with z2 = z!. The linear regression model for the jth trial in this latter case is or
Example 7.2 (The design matrix for oneway ANOVA as a regression model) Determine the design matrix if the linear regression model is applied to the oneway ANOVA situation in Example 6.6. We create socalled dummy variables to handle the three population means: J.Lt = J.L + r 1 , J.L2 = J.L + T2, and p, 3 = J.L + r 3 We set if the observation is from population 1 otherwise if the observation is from population 3 otherwise and f3o 1
z2
ei,
= 1,2, ... ,8
where we arrange the observations from the three populations in sequence. Thus, we obtain the observed response vector and design matrix 9 6 9 0 2 3 1 2
y
(8x!)
(8X4)
1 1 1 1 1 1 1 1
1 1 1 0 0 0 0 0
0 0 0 1 1 0 0 0
0 0 0 0 0 1 1 1
The construction of dummy variables, as in Example 7.2, allows the whole of analysis of variance to be treated within the multiple linear regression framework.
j=l
The coefficients b chosen by the least squares criterion are called least sqlil!.res estimates of the regression parameters fl. They will henceforth be denoted by fl to em phasize their role as e~timates of fl. The coefficients fl are consistent. with the data in the sense that they produce,."' estimated (fitted) mean responses, ffio + ffi 1zi 1 + + ffi,zi" the sum of whose;:~ squares of the differences from the observed Yi is as small as possible. The deviatioliS~
ei = Yi 
~o 
ffitZjt   /J,zi"
j = 1, 2, , n
'
(7sl~
,:~
are called residuals. The vector of residuals i = y  Zfl contains the information~" about the remaining unknown parameter~. (See Result 7.2.) Result 7.1. Let Z have full rank r + 1 (73) is given by Let y = = "hat" matrix. Then the residuals
:5
zp
(I H)y
residualsumofsquares =
=
2:; (Yii=l
"'
"'
"'
.... , "'
1 !f Z is Exerctse 7.6.)
~ot
1
.
:.:~
~
(Z'Zf 1 Z'y as asserted. Then i = y1 [I Z(Z'Zr Z']y. The matrix (I Z(Z'Zf Z'] satisfies
Proof. Let
p=
1
y=
y zp =
1. (I Z(Z'Zf Z']' =[I Z(Z'Zf Z'] 2. (I Z(Z'Zr Z'][I Z(Z'Z) 1Z']
1 1 1
(symmetric);
1
=I 2Z(Z'Zf Z' + Z(Z'Z) Z'Z(Z'Zf Z' =[I Z(Z'Zf Z'] 3. Z'[I Z(Z'zr z']
1 1
(76)
(idempotent); Z' = 0.
Z'[I Z(Z'Z)1Z']y = O,soy' = P'Z' = 0. 1 Additionally,'= y'(I Z(Z'Zf 1 Z'](I Z(Z'Zf1 Z']y = y'[I Z(Z'Zr Z']y = y' y  y' Z To verify the expression for /3 , we write
= Z'Consequently,Z'i = Z'(y y) =
p.
y Zb so S(b)
Zb
= y
z/3 + Z(P b)
= (y= (y
+ 2(y ZP)'Z(P b)
= (y Zp)'(y Zp)
(P
b)'Z'Z(P b)
since (y  ZP)' z = e'Z = 0'. The first term in S(b) does not depend on b and the b). Because Z has full rank, Z ( b) #o 0 second is the squared length of Z if jJ # b, so the minimum sum of squares is unique and occurs for b = = (Z'Zr 1Z'y. Note that (Z'Zf1 exists since Z'Z has rank r + 1 s n. (If Z'Z is not of full rank, Z'Za = 0 for some a ;to 0, but then a' Z'Za = 0 or Za = 0, which con tradicts Z having full rank r + 1.)
(P 
P
Result 7.1 shows how the least squares estimates and the residuals i can be obtained from the design matrix Z and responses y by simple matrix operations.
Example 7.3 (Calculating the least squares estimates, the residuals, and the residual sum of squares) Calculate the least square estimates the residuals , and the residual sum of squares for a straightline model
p,
0 1
1 4
2 3
3 8
4
9
We have
Z'
[ 5 10 10
Z'Z
(Z'Zf
3:L
1 1
2 3
~]
m
1
30
.6 .2
.2 .1
J [~~]
Consequently,
jJ
==
so
SumofSquares Decomposition
According to Result 7.1, y'i == 0, so the total response sum of squares y'y satisfies
= "2:.0
i=I
y'y
= (y + y y)'(y + y y) = (y + i)'(y + i)
==
y'y + i'i
(77)
Least Squares Estimation 367 Since the first column of Z is 1, the condition Z'i: = 0 includes the requirement
0 = l'i: =
L
j=l
i:i
L Yi L
j=l j=l
''
Yi or y
sides of the decomposition in (77), we obtain the basic decomposition of the sum of squares about the mean:
y'y  ny2 =
y'y 
n(y)2 + i:' i:
or
j=l
(yj  Ji)2
~ C.Yi j=l
Ji)2 +
~ i:J
j=l
(78)
( about mean
The preceding sum of squares decomposition suggests that the quality of the models fit can be measured by the coefficient of determination
~ 1
R2 = 1 " i=l
.Y) .:..i=,/
~ Ch 
(79)
L
j=l
2
(yj  .Y)2
L
j=l
(yj  Ji)2
The quantity R gives the proportion of the total variation in the Y/S "explained" by, or attributable to, the predictor variables z 1, z 2, ... , Zr Here R 2 (or the multiple correlation coefficient R = + VRZ) equals 1 if the fitted equation passes through all the data points, so that ei = 0 for all j. At the other extreme, R 2 is 0 if ~ 0 = y and ~ 1 = ~ 2 = = ~r = 0. In this case, the predictor variables z1 , z2 , .. . , Zr have no influence on the response.
zp
/3 0
Thus, E(Y) is a linear combination of the columns of Z. As p varies, zp spans the model plane of all linear combinations. Usually, the observation vector y will not lie in the model plane, because of the random errore; that is, y is not (exactly) a linear combination of the columns of Z. Recall that y zp E + response) ( vector vector ) in model plane error) ( vector
l
;
:.
Znt
Z11r
Once the observations become available, the least squares solution is derived from the deviation vector y  Zb == (observation vector)  (vector in model plane) The squared length (y Zb)'(y Zb) isthesumofsquaresS(b).Asillustratedin Figure 7.1, S(b) is as small as possible when b is selected such that Zb is the point in the model plane closest toy. This point occurs at th~ tip of th~ perpendicular projection of yon the plane. That is, for the choice b = {J, y = zp is the projection of yon the plane consisting of all linear combinations of the columns of Z. The residual vector i == y  yis perpendicular to that plane. This geometry holds even when Z is not offull rank. When Z has full rank, the projection operation is expressed analytically as 1 multiplication by the matrix Z (Z'Z f Z'. To see this, we use the spectral decomposition (216) to write Z'Z
where At ~ A ~ ~ Ar+I > 0 are the eigenvalues of Z'Z and e 1 , e 2 , ... , e,+t are 2 the corresponding eigenvectors. If Z is of full rank,
(z ' Z)
1
= elet
1 At
' . 1
= Ajt/2 x;tflejZ'Ze~c
Consider q; == Aitflze;, which is a linear combination of the columns of Z. Then qiq.t = Ajtfl A/c1 f2eiA~cek = 0 if i ;t k or 1 if i == k. That is, the r + 1 vectors q; are mutually perpendicular and have unit length. Their linear combinations span the space of all linear combinations of the columns of z. Moreover, r+t r+l 1 Z(Z'Zf Z' == AjtZe;e;z = q;qj
2:
2:
i=l
i=:]
Least Squares Estimation 369 According to Result 2A.2 and Definition 2A.12, the projection of yon a linear combinationof{qJ,q2,qr+I} is~ (qjy)q; =
1
r+I
Zfl.
Thus, multiplication by Z (Z'Z) Z' projects a vector onto the space spanned by the columns of Z. 2 1 Similarly, [I Z(Z'Z) Z'] is the matrix for the projection of yon the plane perpendicular to the plane spanned by the columns of Z.
Result 7.2. Under the general linear regression model in (73), the least squares 1 estimator = (Z'Zf Z'Y has
E(fl) = /l
and
Cov(P) = c?(Z'Zr
and
Y'[l H]Y n  r  1
we have
E(s 2 )
Moreover,
= u2
The least squares estimator possesses a minimum variance property that was first established by Gauss. The following result concerns "best" estimators of linear parametric functions of the form c' fl = c0{3 0 + c1f3 1 + + c,/3, for any c.
Result 7.3 (Gauss'3 least squares theorem). Let Y = Z/l + , where E( ) = 0, Cov ( ) = c? I, and Z has full rank r + 1. For any c, the estimator
2: Aj1e,ej,
ja}
where
A 1
=
;;,
A 2
A,,+ 1 > 0
1
A,,+ 2
= =
r 1+1
~I
+ 1 and generates the unique projection of yon the space spanned by the linearly
independent columns of Z. This is true for any choice of the generalized inverse. (See (23].) 3 Much later, Markov proved a less general result, which misled many writers into attaching his name to this theorem.
of c' f3 has the smallest possible variance among all linear estimators of the form
:ij
~ ........
'!
111
Proof. For any fixed c, let a'Y be any unbiased estimator of c' /3. The~
E( a'Y) = c' /3, whatever the value of fj. Also, by assumption, E(a'Y) ;
E(a'Z/3 + a'e) = a'ZfJ. Equating the two expected value expressions yieldi a'Z/3 = c'/3 or(c' a'Z)fJ = 0 for aU {J, including the choice /3 = (c' a'Zi'i This implies that c' = a'Z for any unbiased estimator. f.i ' 1 1 ....... Now, c' 13, = c'(Z'Z) ~y = a*'Y with a = Z(Z'Z) c. Moreover, froiJi1 Result 7.2 E( /3) = /3, soc' f3 = a*'Y is an unbiased estimator of c' /3. Thus, for an~ a satisfying the unbiased requirement c' = a'Z, ~ Var(a'Y)
= Var(a'Z/3 + a'e)
=
T
= Var(a'e)
= a'Iu2a
~
~"'
+ a*'a*]
since (a ' a*)'a* = (a  a*)'Z(Z'Zf c = 0 from the condition (a a*)'Z = a'Z  a*'Z = c'  c' = 0'. Because a is fixed and (a a*)'(a a*) is positive unless a =a*, Var(a'Y) is minimized by the choice a*'Y = c'(Z'Zf1Z'Y = c'
p.
This powerful result states that substitution of [3 for fJleads to the best estima~, tor of c' fJ for any c of interest. In statistical terminology, the estimator c' [3 is called the best (minimumvariance) linear unbiased estimator (BLUE) of c' fJ.
(710)
we must determine the sampling distributions of [3 and the residual sum of squares,
i' i. To do so, we shall assume that the errors E have a normal distribution.
Result 1.4. Let Y = Z/3 + E, where Z has full rank r + 1 and E is distributed as ~ Nn(O, u 21). Then the maximum likelihood estimator of f3 is the same as the least~ ' . squares estimator /3. Moreover, '~
{J
= (Z'Zf Z'Y
'
3 71
zp. Further,
is distributed as
u 2 ~r1
A confidence ellipsoid for fl is easily constructed. It is expressed in terms of the 1 estimated covariance matrix s 2(Z'Zr , where r = i' ij(n  r  1).
Result 7.5. Let Y = Zfl + E, where Z has full rank r + 1 and a 100(1  a) percent confidence region for fl is given by
E
where F,+ 1,nr 1 (a) is the upper (lOOa)th percentile of an distribution with r + 1 and n  r  1 d.f. Also, simultaneous 100(1 a) percent confidence intervals for the /l; are given by
{3; ~ V(r
,..
corresponding to 13;.
1
Proof. Consider the symmetric squareroot matrix (Z'Z) f2. [See (222).] Set 1/2 V = (Z'Z) (/l  fl) and note that E(V) = 0,
A
and V is normally distributed, since it consists of linear combinations of the 13;'s. 1 Therefore, vv = fl)'(Z'Z) f2(Z'Z) 1f2(p fl) = fl)'(Z'Z)(P fl) 2 is distributed as u x:+i By Result 7.4 (n r 1)s 2 = e'i is distributed as T2 ~rl independently of and, hence, independently of V. Consequently, [rr+ 1/(r + 1))/[~rt/(n r  1)] = [V'Vj(r + 1))/s2 has an Fr+l,nr1 distribution, and the confidence ellipsoid for fJ follows. Projecting this ellipsoid for /l) using Result 5A.1 With A 1 = Z'Zjs 2, c2 = (r + 1)Fr+l,nrJ(a), and u' =
(P
(P
(P 
Pd
::S
V(r + 1)Fr+l,nrl(a)
1
~,where
A
corresponding to 13;.
The confidence ellipsoid is centered at the maximum likelihood estimate fJ, and its orientation and size are determined by the eigenvalues and eigenvectors of Z'Z.If an eigenvalue is nearly zero, the confidence ellipsoid will be very long in the direction of the corresponding eigenvector.
3 72
Chapter 7 Multivariate Linear Regression Models Practitioners often ignore the "simultaneous" confidence property of the interval estimates in Result 7.5.lnstead, they replace (r + 1)F;.+l.nrJ( a) with the oneatatimet value tnrI ( a/2) and use the intervals
(711) ~
Example 7.4 (Fitting a regression model to realestate data) The assessment data in. Table 7.1 were gathered from 20 homes in a Milwaukee, Wisconsin, neighborhood. Fit the regression model
where z1 = total dwelling size (in hundreds of square feet), Zz = assessed value (in thousands of doilars), andY = selling price (in thousands of dollars), to these data using the method of least squares. A computer calculation yields 5.1523 ] .2544 .0512 [ .1463 .0172 .0067
.,
(Z'Zf
Total dwelling size ( 100 ft 2 ) 15.31 15.20 16.25 14.33 14.57 17.33 14.48 14.91 15.25 13.89 15.18 14.44 14.87 18.63 15.20 25.76 19.05 15.37 18.06 16.35
y
Selling price ($1000) 74.8 74.0 72.9 70.0 74.9 76.0 72.0 73.5 74.5 73.5 71.5 71.0 78.9 86.5 68.0 102.0 84.0 69.0 88.0 76.0
57.3 63.8 65.4 57.0 63.8 63.2 60.2 57.7 56.4 55.6 62.6 63.4 60.2 67.2 57.1 89.6 68.6 60.1 66.3 65.8
3 73
and
P = (Z'Zf1 Z'y = [
Thus, the fitted equation is
y=
with s = 3.473. The numbers in parentheses are the estimated standard deviations of the least squares coefficients. Also, R 2 = .834, indicating that the data exhibit a strong regression relationship. (See Panel 7.1, which contains the regression analysis of these data using the SAS statistical software package.) If the residuals pass the diagnostic checks described in Section 7.6, the fitted equation could be used to predict the selling price of another house in the neighborhood from its size
PANEL 7.1
title 'Regression Analysis'; data estate; infile 'T71.dat'; input z1 z2 y; proc reg data estate; model y = z1 z2;
I
Rsquare Adj Rsq
"'OGRAM COMMANDS
Model: MODEL 1 Dependent Variable: Analysis of Variance Sum of Squares 1032.87506 204.99494 1237.87000 3.47254 76.55000 4.53630 Mean Square 516.43753 12.05853
OUTPUT
f value 42.828
Prob>F 0.0001
0.83441 0.8149
c.v.
Parameter Estimates Parameter Estimate 30.966566 2.634400 0.045184 Standard Error 7.88220844 0.78559872 0.28518271 T for HO: Parameter = 0 3.929 3.353 0.158
Variable INTERCEP z1 z2
DF 1
374
Pt
2
17(.025)
~ = .045 2.110(.285)
( .556, .647)
or Since the confidence interval includes /32 = 0, the variable z2 might be dropped"; from the regression model and the analysis repeated with the single predictor vari~ able z1 Given dwelling size, assessed value seems to add little to the prediction oL selling price. ~
Ho: /3q+l =
/3q+2
= .. = f3r = 0
or Ho: 11(2)
=0
(712)
z=
Z 1 1 Z2 ]. nX(q+l) i nX(rq)
{J =
[_((?~~I~~) J
((rq)XI)
Y = Z/) +
= [Z 1
l Z 2 ] [/!_(!)_j' + e =ZIP(IJ
P(2J
+E.
Z2P(Z)
+e
Under the null hypothesis H 0 : 11( 2 ) == 0, Y == ZdJ(I) of Ho is based on the Extrasumofsquares == SS,.,(Z 1)  SS,.(Z)
(713)
zJJ0 J)'(y
Result 7.6. Let Z have full rank r + 1 and e be distributed as Nn(O, o21). The
likelihood ratio test of H 0 : 11(2) = 0 is equivalent to a test of H 0 based on the extra sum of squares in (713) and s2 = (y  z{J)' (y  ZP)/(n  r  1). In particular, the likelihood ratio test rejects H0 if (SS,.,(ZI}  SS,.,(Z))/(r q) > F
( )

rq,nr1 a
where Frq,nrI( a) is the upper (lOOa)th percentile of an Fdistribution with r and n  r  1 d.i
L( /3 ~) = 1 c27T r12 ~
e(yZfJ)'(yZfJ)/2u2 :5 _ _
(27T t l 2 u"
1~ en/2.
withthemaximumoccurringatp = (Z'Zr Z'yandu 2 = (y ZP)'(y ZP)!n. Under the restriction of the null hypothesis, Y = Z 1 /3(1) + E and
fJ (I)U 2
max L( a
~
P(I)
u 2) =
1
(
~n 2'IT )n/2UJ
I
en/2
13(1)
n(ut  UZ)j(r  q)
nU'lj(n r 1)
The preceding Fratio has an distribution with r  q and n  r  1 d.f. (See [22] or Result 7.11 with m = 1.)
Comment. The likelihood ratio test is implemented as follows. To test whether all coefficients in a subset are zero, fit the model with and without the terms corresponding to these coefficients. The improvement in the residual sum of squares (the extra sum of squares) is compared to the residual sum of squares for the full model via the Fratio. The same procedure applies even in analysis of variance situations where Z is not of full rank. 4 More generally, it is possible to formulate null hypotheses concerning r  q linear combinations of f3 of the form H 0 : C f3 = A 0 Let the ( r  q) X ( r + 1) matrix C have full rank, let A 0 = 0, and consider
H 0 :Cf3 = 0
I ( This null hypothesis reduces to the previous choice when C = [0 i (rq)X(rq) i
J)
4 In situations where Z is not of full rank, rank(Z) replaces r + 1 and rank(Ztl replaces q + 1 in Result7.6.
376 Chapter 7 Multivariate Linear Regression Models Under the full model,_ c{J is distri~u~ed as Nrq(C(J,u2C(Z'~fiC'). We reject~ H 0 : C(J = 0 at level a If 0 does not hem the 100(1 a)% confidence ellipsoid for'; Cp. Equivalently, we reject H0 : Cf3 = 0 if ;,~
"
~
(714)i where s2 = (y  z{J)'(y  ZP)/(n  r  1) and Frq,nrl(a) is the uppet~ (100a)th percentile of an Fdistribution with r  q and n r  1 d.f. The test in,J (714) is the likelihood ratio test, and the numerator in the Fratio is the extra residuaU sum of squares incurred by fitting the model, subject to the restriction that Cf3 "" o:~ (See [23]). . .:1 The next example illustrates how unbalanced experimental designs are easilyj handled by the general theory just described.
Example 7.5 (Testing the importance of additional predictors using the extra sumof~ squares approach) Male and female patrons rated the service in three establish:
ments (locations) of a large restaurant chain. The service ratings were converted into an index. Table 7.2 contains the data for n = 18 customers. Each data point in the table is categorized according to location (1, 2, or 3) and gender (male = 0 and female = 1). This categorization has the format of a twoway table with unequal numbers of observations per cell. For instance, the combination of location 1 and male has 5 responses, while the combination of location 2 and female has 2 responses. Introducing three dummy variables to account for location and two dummy variables to account for gender, we can develop a regression model linking the service index Y to location, gender, and their "interaction" using the design matrix
Table 7.2 RestaurantService Data
Location 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
Gender
0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 1
Service (Y) 15.2 21.2 27.3 21.2 21.2 36.4 92.4 27.3 15.2 9.1 18.2 50.0 44.0 63.6 15.2 30.3 36.4 40.9
location 1 1 1 1 1 0 0 0 0 0
gender 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1
interaction 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0
) 5 <e.pone.
1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
} 2 responses
Z=
) 5 mpone.
1 0 1 0 1 1 1 1
0 0 0 0 0 0
1 0 0 1 0 0 1 0 1 0
0 0 0 0 0 0 0 0
1 0 1 0 0 0 1 1
0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 1
fl'
= [,/3o,,/3J,,/32,,/33,TJ,T2,'Yll>'Yl2,'Y2l>'Y22,'Y31>'Y32]
where the ,B;'s (i > 0) represent the effects of the locations on the determination of service, the r;'s represent the effects of gender on the service index, and the 'Yik's represent the locationgender interaction effects. The design matrix Z is not of full rank. (For instance, column 1 equals the sum of columns 24 or columns 56.) In fact, rank(Z) = 6. For the complete model, results from a computer program give SSres(Z)
= 2977.4
and n  rank(Z) = 18  6 = 12. The model without the interaction terms has the design matrix Z 1 consisting of the first six columns of Z. We find that SSres(Z 1)
= 3419.1
= 'Y22 =
'YJI =
with n  rank(Z 1) = 18  4 = 14. To test H 0 : y11 = 'Y12 = 'Y2I y 32 = 0 (no locationgender interaction), we compute
F=
=
(SSre.CZJ)  SSres(Z))/(6 4)
s2
378 Chapter 7 Multivariate Linear Regression Models The Fratio may be compared with an appropriate percentage point of an Fdistribution with 2 and 12 d.f. This Fratio is not significant for any reasonable significance level a. Consequently, we conclude that the service index does not depend upon any locationgender interaction, and these terms can be dropped from the model. Using the extra sumofsquares approach, we may verify that there is no difference between locations (no location effect), but that gender is significant; that is males and females do not give the same ratings to service. ' ln analysisofvariance situations where the cell counts are unequal, the variation in the response attributable to different predictor variables and their interactions cannot usually be separated into independent amounts. To evaluate the relative influences of the predictors on the response in this case, it is necessary to fit the model with and without the terms in question and compute the appropriate Ftest statistics.
zb
= [I,
Yo denote the value of the response when the predictor variables have values z01 , . , z0 , ]. According to the model in (73), the expected value of Y0 is
E(lQizo)
z'oP
(715)
Its least squares estimate is z'oP. Result 1.1. For the linear regression model in (73), z'oP is the unbiased linear 1 estimator of E(Yo I zo) with minimum variance, Var(ziJP) "" zh(Z' Zf zou 2 . If the errors E are normally distributed, then a 100(1  a)o/o confidence interval for E(Yolzo) = z'oP is provided by
where
t,_,_ 1(a/2)
n r 1 d.f.
Proof. For a fixed zo,zbP)s just a lin~ar combinatioEz of the /3/s, so R~sult 7.3 applies. Also, Var(z 0 = z0Cov(/l)zo = zO(Z'Z) zou 2 since Cov(p) = P) 1 ~(Z'Zr by Result 7.2. Under the further assumftion that e is normally distributed, Result 7.4 asserts that pis N,+ 1( p,u2(Z'Z) ) independently of s 2/u 2 , which
~79
zoP is
cz'oiJ
 zaP l
Zo/J = f3o + f3tZot + ... + f3rZor The variance of the forecast error Y0 Var(YoWhen the errors Y0 is given by
E

"'
"'
,..
z 0 is
zoP)
= u 2(1
+ zo(Z'Zr zo)
have a normal distribution, a 100(1  a)% prediction interval for zbP t,_,_J
(%) V
s2(1 + z(J(Z'Zf z 0 )
n  r 
where f,_,_ 1(a/2) is the upper 100(a/2)th percentile of a tdistribution with 1 degrees of freedom.
~ ~
P, which estimates E(Y0 I z0 ). By Result 7. 7, z0P has I . = zofJ and Var(zo/J) = zo(Z'Zr ZoT2 The forecast error is then Yo ziJP = ziJfJ +eo zoP =eo+ zb(fJp). Thus, E(Yo z&P) = E(eo) + E(z 0  P)) = 0 so the predictor is unbiased. Since e0 and pare independent, (fJ I I Var (Yo zbfJ) = Var (eo)+ Var (z6/J) = T2 + zo(Z'Z) zou 2 = u 2(1 + z(J(Z' Z) ~ zo)Proof. We forecast Y0 by z0
E(zbfJ)
lf it is further assumed that E has a normal distribution, then fJ is normally distributed, and so is the linear combination Yo Consequently,
(Yo 
z 0 P)/~ ziJ(Z'Zr z 0 ) is distributed as N(O, 1 ). Dividing this ratio by v;z;;?, which is distributed as v'X;,,_ /(n  r  1), we obtain
1
1
zoP.
(Yo
zbP)
1
380 Chapter 7 Multivariate Linear Regression Models The prediction inter_val for Y0_is wider than the confidence i~t.erval for estimating, the value of the regressiOn functiOn E(Y0 l z0 ) = z0fJ. The add!Uonal uncertainty in : forecasting Yo. which is represented by the extra term s 2 in the expression ;(1 + z0 (Z'Zr 1z0), comes from the presence of the unknown error term e 0 .
Example 7.6 (Interval estimates for a mean response and a future response) Companies considering the purchase of a computer must first assess their future needs in orderto determine the proper equipment. A computer scientist collected data from seven similar company sites so that a forecast equation of computerhardware requirements for inventory management could be developed. The data are given in Thble 7.3 for
Construct a 95% confidence interval for the mean CPU time, E(Y0 lz 0 ) = _ {3 0 + {3 1zo 1 + f32z02 at zb = [1, 130, 7.5]. Also, find a 95% prediction interval for a new facility's CPU requirement corresponding to the same z0 . A computer program provides the estimated regression function
y = 8.42
(Z'Zf1
and s = 1204. Consequently,
=
+ l.08z1 + .42z2
.00052 .00107
and s Vz 0 (Z'Zr z 0 = 1.204(.58928) = .71. We have t4 (.025) = 2.776, so the 95% confidence interval for the mean CPU time at z 0 is z 0 t4 (.025)sVz 0 (Z'Z) z 0 = 151.97 2.776(.71) or (150.00, 153.94).
Table 7.3 Computer Data
ZJ Z2
y
(CPU time) 141.5 168.9 154.8 146.5 172.8 160.1 108.5
Model Checking and Other Aspects of Regression 381 Since + z0 (Z'Z) 0 = (1.204)(1.16071) = 1.40, a 95% prediction interval for the CPU time at a new facility with conditions z0 is
sVl
or (148.08, 155.86).
EJ =
e2
~0 Yz  ~0
YI 
~1Z11 ~1Z21
... 
~rZir ~rZ2r
or
(716)
If the model is valid, each residual i:i is an estimate of the error ei, which is assumed to be a normal random variable with mean zero and variance u 2 Although the residuals 1 ehaveexpectedvalueO,theircovariancematrixc?(I Z(Z'Zf Z'J = u 2 (I H] is not diagonal. Residuals have unequal variances and nonzero correlations. Fortunately, the correlations are often small and the variances are nearly equal. Because the residuals i have covariance matrix u 2 [I  H], the variances of the ei can vary greatly if the diagonal elements of H, the leverages hii are substantially different. Consequently, many statisticians prefer graphical diagnostics based on studentized residuals. Using the residual mean square s 2 as an estimate of u 2 , we have
j = 1, 2, ... , n
(717)
We expect the studentized residuals to look, approximately, like independent drawings from an N(O, 1) distribution. Some software packages go one step further and studentize using the deleteone estimated variance s 2 (j), which is the residual mean square when the jth observation is dropped from the analysis.
ei
382 Chapter 7 Multivariate Linear Regression Models Residuals should be plotted in various ways to detect possible anomalies. For general diagnostic purposes, the following are useful graphs:
1. Plot the residuals ej against the predicted values Yj = i:Jo + {3 1 z + + i:J,z ,. 11 Departures from the assumptions of the model are typically indicated by t~o types of phenmp.ena: (a) A dependence of the residuals on the predicted value. This is illustrated in Figure 7.2(a). The numerical calculations are incorrect, or a /3o term has" been omitted from the model. (b) The variance is not constant. The pattern of residuals may be funnel shaped, as in Figure 7.2(b )",so that there is large variability for large y and small variability for small y. If this is the case, the variance of the error is not constant, and transformations or a weighted least squares approach (or both) are required. (See Exercise 7.3.) In Figure 7.2(d), the residuals form a horizontal band. This is ideal and indicates equal variances and no dependence on y. 2. Plot the residuals 1 against a predictor variable, such as z1 , or products ofpredictor variables, such as zf or ZI Z2. A systematic pattern in these plots suggests the need for more terms in the mode1.1bis situation is illustrated in Figure 7.2(c). 3. QQ plots and histograms. Do the errors appear to be normally distributed? To answer this question, the residuals ej or ej can be examined using the techniques discussed in Section 4.6. The QQ plots, histograms, and dot diagrams help to detect the presence of unusual observations or severe departures from normality that may require special attention in the analysis. If n is large, minor departures from normality will not greatly affect inferences about /3.
(a)
(b)
(c)
(d)
4. Plot the residuals versus time. The assumption of independence is crucial, but hard to check.lf the data are naturally chronological, a plot of the residuals versus time may reveal a systematic pattern. (A plot of the positions of the residuals in space may also reveal associations among the errors.) For instance, residuals that increase over time indicate a strong positive dependence. A statistical test of independence can be constructed from the first autocorrelation,
(719)
n(ei  ei_Jl In eJ i~
2
Example 7.7 (Residual plots) Three residual plots for the computer data discussed in Example 7.6 are shown in Figure 7.3. The sample size n == 7 is really too small to allow definitive judgments; however, it appears as if the regression assumptions are Wu~
(a)
(b)
(c)
Figure 7.3 Residual !'lots for the computer data of Example 7.6.
384
Chapter 7 Multivariate Linear Regression Models If several observations of the response are available for the same values of the predictor variables, then a formal test for lack of fit can be carried out. (See [13] for a discussion of the pureerror lackoffit test.)
1 h .= 11 n
+ "n
j=l
(z; 
z/
2: (z; zi
The average leverage is (r + 1)/n. (See Exercise 7.8.) Second, the leverage hii, is a measure of pull that a single case exerts on the fit. The vector of predicted values is
y = Zp
Z(Z'Z)1Zy
Hy
Yi =
h;;Y; + Lh;kYk
Provided that all other y values are held fixed (changeiny;) = h;;(changeiny;)
If the leverage is large relative to the other h;k> then Yi will be a major contributor to
the predicted value Y;. Observations that significantly affect inferences drawn from the data are said to be influential. Methods for assessing)nfluence are typically based on the change in the vector of parameter estimates, /3, when observations are deleted. Plots based upon leverage and influence statistics and their use in diagnostic checking of regression models are described in [3], [5], and [10]. These references are recommended for anyone involved in an analysis of regression models. If, after the diagnostic checks, no serious violations of the assumptions are detected, we can make inferences about /3 and the future Y values with some assurance tllat we will not be misled.
385
Selecting predictor variables from a large set. In practice, it is often difficult to formulate an appropriate regression function immediately. Which predictor variables should be included? What form should the regression function take? When the list of possible predictor variables is very large, not all of the variables can be included in the regression function. Techniques and computer programs designed to select the "best" subset of predictors are now readily available. The good ones try all subsets: z1 alone, z2 alone, ... , z1 and z2 , . The best choice is decided by examining some criterion quantity like R 2 (See (79).] However, R 2 always increases with the inclusion of additional predictor variables. Although this problem can be circumvented by using the adjusted R 2 , R2 = 1 (1 R 2 )(n 1)/(n r 1), a better statistic for selecting variables seems to be Mallow's CP statistic (see [12]).
I'
residual sum of squares for subset model) with p parameters, including an intercept ( (residual variance for full model)
(n  2p)
A plot of the pairs (p. C p). one for each subset of predictors, will indicate models that forecast the observed responses well. Good models typically have (p, Cp) coordinates near the 45" line. In Figure 7.4, we have circled the point corresponding to the "best" subset of predictor variables. If the list of predictor variables is very Jong, cost considerations limit the number of models that can be examined. Another approach, called stepwise regression (see (13]), attempts to select important predictors without considering all the possibilities.
e(OJ
e(3) (Z) (2. 3)
e(l, 3)
e(l)
Figure 7.4 CP plot for computer data from Example 7.6 with three predictor variables (z 1 = orders, z2 = adddelete count, z3 = number of items; see the example and original source).
The procedure can be described by listing the basic steps (algorithm) involved in the .;;;l computations: :;
Step 1. All possible simple linear regressions are considered. The predictor variable that explains the largest significant proportion of the variation in Y (the variable''~ that has the la.:gest correlation with the response) is the first variable to enter the re1 gression function. lt~ Step 2. The next variable to enter is the one (out of those not yet included) that~ makes the largest sig~ific?nt ~ntribut~on to the regression sum of squares. The sig)~ nificance of the contnbutwn IS deterrnmed by an Ftest. (See Result 7.6.) The valuej of the F~ta~i~tic t~at must be exceeded before the contribution of a variable is>'l deemed significant IS often called the F to enter. .! Step 3. Once an additional variable has been included in the equation, the indivi<f 'j ual contributions to the regression sum of squares of the other variables already in~ the equation are checked for significance using Ftests. If the statistic is less than ' the cine (called the F to remove) corresponding to a prescribed significance level, the , variable is deleted from the regression function. ~ Step 4. Steps 2and 3 are repeated until all possible additions are nonsignificant and all possible deletions are significant. At this point the selection stops.
Because of the stepbystep procedure, there is no guarantee that this approach will select, for example, the best three variables for prediction. A second drawback is that the (automatic) selection methods are not capable of indicating when transformations of variables are useful. Another popular criterion for selecting an appropriate model, called an information criterion, also balances the size of the residual sum of squares with the number of parameters in the model. Akaike's information criterion (Ale) is residual sum of squares for subset model ) with p parameters, including an intercept AIC = nln ( + 2p n
It is desirable that residual sum of squares be small, but the second term penalizes for too many parameters. Overall, we want to select models from those having the smaller values of AI C.
Colinearity. If Z is not of full rank, some linear combination, such as Za, must equal
0. In this situation, the columns are said to be colinear. This implies that Z'Z does
not have an inverse. For most regression analyses, it is unlikely that Za = 0 exactly. Yet, if linear combinations of the columns of Z exist that are nearly 0, the calculation 1 of czzr 1 is numerically unstable. Typically, the diagoqal entries of (Z'Zr will be large. This yields large estimated variances fq_r the {3;'s and it is then difficult to detect the "significant" regression coefficients {3;. The problems caused by colinearity can be overcome somewhat by (1) deleting one of a pair of predictor variables ! that are strongly correlated or (2) relating the response Y to the principal campo _;;j nents of the predictor variablesthat is, the rows zj of Z are treated as a sample, and ~ the first few principal components are calculated as is subsequently described in :~ Section 8.3. The response Y is then regressed on these new predictor variables.
Multivariate Multiple Regression 387 Bias caused by a misspecified model. Suppose some important predictor variables are omitted from the proposed regression model. That is, suppose the true model has Z = [Z 1 ! Z 2] with rank r + 1 and
(720)
where E( E) = 0 and Var( E) = T 21. However, the investigator unknowingly fits a model using only the first q predictors by minimizing the error su,m of squares (Y Zdl(J))'(Y ZJil(l)) The least squares estimator of il(l) is il(l) = 1 (Z!Z 1 f ZjY. Then, unlike the situation when the model is correct,
E(P(l)) = (Z!ZJf ZjE(Y) = (ZjZJf Zj(ZJ/l(l) + Z2fl(2) + E(E))
1 1
(721)
That is, {J(l) is a biased estimator of fl(l) unless the columns of Z 1 are perpendicular to those of Z 2 (that is, ZjZ 2 = Ot If important variables are missing from the model, the least squares estimates fl(l) may be misleading.
+
+
f3llZ! f312Z!
+ +
+ +
f3r!Zr f3r2Zr
+ e1 + e2
(722)
The error term E' = [e 1 , e2, ... , em] has E(E) = 0 and Var(E) = l:. Thus, the error terms associated with different responses may be correlated. To establish notation conforming to the classical linear regression model, let [ZjO Zj! .. , Zjr] denote the values of the predictor variables for the jth trial, let Yj = [lj 1 , lj 2 , .. , lfm] be the responses, and let Ej = [ eil, ei 2 , ... , Ejm] be the errors. In matrix notation, the design matrix
Z!r~
Z2r
. 7
388
Chapter 7 Multivariate Linear Regression Models is the same as that for the singleresponse regression model. [See {73).] The othe matrix quantities have multivariate counterparts. Set r
y
(nXm)
= [y" ~I
: Ynl
y12
1'22
Y,.] = :
}2,
Ynm
: ... :
Y(m)]
Yn2
.Bo2
((r+l)Xm)
fJ
lp"
{3, I
.B_ll
{312
f3r2
BJ2
~]
:
P(m)]
e =
(nXm)
l'"
:
e~~
Bnl
B22
']
B2m
Bnm
[E(J)
Bn2
"lt]
The multivariate linear regression model is
(nxm)
Y=
(nx(r+l)) ((r+!)Xm)
fJ+e (nxm)
{723)
i,k = 1,2, ... ,m
with
Them observations on the jth trial have covariance matrix I = {u;k}, but ob. servations from different trials are uncorrelated. Here and u;k are unknown parameters; the design matrix Z has jth row [zi 0 , zi 1 , ... , Zjr ].
fJ
Simply stated, the ith response Y(i) follows the linear regression model
i= 1,2, ... ,m
(724) .:
with Cov ( E(i)) == T;; I. However, the errors for different responses on the same trial can be correlated. Given the outcomes _Y and the values of th7 predic!or variab~es Z with fu~.,.;~ column rank, we determme the least squares estimates Jl(i) exclusively from th;~ obse':'ations Y(i) on the ith response. In conformity with the singleresponsJ'~ solutiOn, we take ~.
PoJ
(Z'Zf Z'YuJ
(725i
Regres.~ion
389
P= [P(l) l P(2)
or
Y(m)J
(726)
For any choice of parameters B = [b(IJ i b( 2) l ! b(m)J, the matrix of errors is Y  ZB. The error sum of squares and cross products matrix is (Y  ZB)'(Y  ZB) (Y( 1J  Zb(IJ)'(Y( 1J  Zb(IJ)
=
l
J
Zb(m))
(727)
The
selection
b(i) = P(i)
minimizes
the
ith
of
squares
(Y(i) Zb(iJ)'{Y(i) ZbuJ). Consequently, tr[(Y ZB)'{Y ZB)] is minimized by the choice B = /:J. Also, the generalized variance I(Y  ZB)' (Y  ZB) I is minimized by the least squares estimates jJ. {See Exercise 7.11 for an additional general ized sum of squares property.) Using the least squares estimates fJ, we can form the matrices of Predicted values: Residuals:
Y= zp = Z(Z'Zf Z'Y
1
f: = Y
Y =[I
Z(Z'Zf Z']Y
(728)
The orthogonality conditions among the residuals, predicted values, and columns of Z, which hold in classical linear regression, hold in multivariate multiple regression. 1 They follow from Z'[I  Z (Z'Zf Z'] = Z'  Z' = 0. Specifically,
{729)
v = iJzr1 z(z'z) z 1v = o
1
(730)
confirming that the predicted values Y(i) are perpendicular to all residual vectors
i(kJ Because Y =
Y+ e,
=
Y'Y
or
(Y + e)'(Y + e) =
Y'Y
vv + ee + o + o
+
Y'Y
ee
residual (error) sum) of squares and cross products (731)
total sum of squares) = (predicted sum of squares) + ( and cross products and cross products (
390
Chapter 7 Multivariate Linear Regression Models The residual sum of squares and cross products can also be written as
(732)
jJ' Y, and e' we fit a straightline regression model (see Panel 7.2),_
lj1
=
= I,2, ... ,5
to two responses l} and Y2 using the data in Example 7.3. These data, augmented by ; observations on an additional response, are as follows: 
0 I 1
2
4 I
3 2
3 8 3
9
2
The design matrix Z remains unchanged from the singleresponse problem. We find that
Z' = [10 1 2 3 4 I I I] 1
PANEL 7.2 SAS ANALYSIS FOR EXAMPLE 7.8 USING PROC. GLM. title 'Multivariate Regression Analysis'; data mra; in file 'Example 78 data; input y1 y2 z1; proc glm data mra; model y 1 y2 = z 1fss3; manova h =zlfprinte;
(Z'Zf
= [ .2
.2]
.1
PROGRAM COMMANDS
3
4
RSquare 0.869565
c.v.
28.28427
Y1 Mean 5.00000000
391
(continued)
Source Z1
OF 1
F Value 20.00
7.50
Pr > F 0.0714
RSquare 0.714286
c.v.
115.4701
Type Ill SS 10.00000000
Y2 Mean 1.00000000
Source Z1
OF 1
F Value
7.50
1~
Manova Test Criteria and Exact F Statistics for the Hypothesis of no Overall Z 1 Effect H =Type Ill SS&CP Matrix for Z 1 E =Error SS&CP Matrix 5:1 M:O N:O Statistic Wilks' lambda Pillai's Trace Hotellinglawley Trace Roy's Greatest Root Value 0.06250000 0.93750000 15.00000000 15.00000000 F 15.0000 15.0000 15.0000 15.0000 Num DF 2 2
2
Den DF
2 2 2 2
Z'y'''
so
~ [~ ~
:
13(1)
'
!] [=~] ~ [ 2~]
:n [2~J
=
=[
~ J
(Z'Z) Z'Y(l)
[1lJ
2
Hence,
fJ = [13(1) i 13(2)] =
1 + 2z 1 and
5'2 =
1 +
z2 Collectively,
and
i=YY=
Note that
' [0 1
0
1
2 1 1 1 1
o]'
' e'Y
Since
[0 0
1 1
2 1
1 1
~J[~ t]~[~ ~]
Y'Y
1 1
4 3 1 2
8 3
~J[~ =~ ~ ~:
1
E
l[
4
43 19
, Y 'Y,
[165 45
45] 15
and
'i
= [ 6 2]
2
393
is easily verified.
Result 7.9. For the least squares estimator /J = [~(!) ~( 2 ) : : ~(m)l determined under the multivariate multiple regression model (723) with full rank (Z) = r + 1 < n,
cr;k(Z'Z) ,
1
! i(m)] = Y
E(E) = 0
Also, Eand
and
E( n
1 r 1
e'e) =I
fJ are uncorrelated.
E(E(i))
0,
and
E(E(i)"Cil) = cr;;I
~(il  fJ(i)
and
(Z'Zf Z'E(i)
(733)
iul = Y(i)
Y(i)
Cov(~(i) ~(k))
Using Result 4.9, with U any random vector and A a fixed matrix, we have that E[U' AU] = E[tr(AUU')] = tr[AE(UU')]. Consequently, from the proof of Result 7.1 and using Result 2A.12
E(i(;)i(k))
cr;k(n r 
1)
394
e' eby n 1
Cov(P(li(kJ)
= E[(Z'Zf1Z'E(iJE(kJ(I= =
1
Z(Z'Zf Z')J
1 1
e.
..:
The mean vectors and covariance matrices determined in Result 7.9 enable us to obtain the sampling properties of the least squares predictors. We first consider the problem of estimating the mean vector when the predictor : variables have the values z0 = [1, z01 , .. , z0 ,]. The mean of the ith response variable is zofl(i)' and this is estimated by zij[J (i)' the ith component of the fitted regression relationship. Collectively,
. J
(734)
for each compo
nent.From the covariance matrix for /J(i) and p(k), the estimation errors zo/J(i) z0 /J(i) E[z 0 (/l(i)  PuJHf.l(k)  P(k))'zo] = zb(E(P(i) P(i))(fJ(k)  P(k))')zo
=
a;kzo(Z'Z )~ zo
I
735)
The related problem is that of forecasting a new observation vector Y'o = [Yo!' Yo2, ... 'Yoml at Zo According to the regression model, Yo; = zo/J(i) + Bo; where the "new" error Eo = [e01 , e02 , ... , ~>om] is independent of the errors e and satisfies E( e0 ;) = 0 and E(eo;eod = aik Theforecasterrorforthe ith component of Y0 is
}()i 
+ Zo/l(i)
 zoP(i)
E(eo;eok) + zQE(p(i) fl(i))(P{k) fJ(k))'zo  ziJE( (P(i)  f.l(i))SOk)  E(e 0;(P(k)  fJ(k))')zo
= a;k(l
+ z0 (Z'Zf1z0)
1
(736)
Note that E((P(i)  f.l(i))~>ok) = Osince P(i) = (Z'Zf Z' E(i) + fl(i) is independent: of Eo A similar result holds forE( eo;( [J(k)  fl(kJ)'). Maximum likelihood estimators and their distributions can be obtained when the errors e have a normal distribution.
Multivariate Multiple Regression 395 Result 7.1 0. Let the multivariate multiple regression model in (723) hold with full rank (Z) = r + 1, n?: (r + 1) + m, and let the errors e have a normal distribution. Then is the maximum likelihood estimator of
fJ
and
1
fJ .has a normal
distribution with
E(P) = fJ and Cov (p(i), P(k)) = CT;k(Z'Zf Also, Pis independent of the maximum likelihood estimator of the positive definite :I given by
I= I..e'e n
and
= I..(yn
z/3)'(Y z/3)
wp.nr1 (:I)
ni
is distributed as
I) = (21T)mn/2jijn/2emn/2.
supp~rt
are the maximum likeliWhen the errors are normally distributed, fJ and n 1 hood estimators of fJ and :I, respectively. Therefore, for large samples, they have nearly the smallest possible variances. Comment. The multivariate multiple regression model posAes no new computa1 tional problems. Least squares (maximum likelihood) estimates, fJuJ = (Z'Z) Z'y(;J, are computed individually for each response variable. Note, however, that the model requires that the same predictor variables be used for all responses. Once a multivariate multiple regression model has been fit to the data, it should be subjected to the diagnostic checks described in Section 7.6 for the singleresponse model. The residual vectors [ 2 , ... , 111 ] can be examined for normality or outliers using the techniques in Section 4.6. The remainder of this section is devoted to brief discussions of inference for the normal theory multivariate multiple regression model. Extended accounts of these procedures appear in [2] and [18].
e'e
eil, ei
ei
a Ho: P(2) = 0
Setting Z = [
Z1
where
r((q~W"')j /3(;;((rq)xm)
(737)
(nx(q+l))
i (nx(rq))
Z2
E(Y) = Z/J = [Z 1
Z 2]
[!!.~!.:.] /3(2)
= Z1/J(IJ
+ Z2/J(2)
396
Chapter 7 Multivariate Linear Regression Models Under H 0 : /Jc 2 >= 0, Y = Z 1/Jcl} + e and the likelihood ratio test of Ha is b on the quantities involved in the
=:
(Y 
= n(i:I  i:)
where Pel}= (ZizirlzjY and i;l = n1(Y zi/J(IJ)'(Y zl/J(I)) From Result 7 .10, the likelihood ratio, A, can be expressed in terms of generalize variances:
IIII
can be used.
~~~
<~
Result 7.1 1. Let the multivariate multiple regression model of (7 23) h?ld .with .Z 11 ..1' of full rank r + 1 and ( r + 1) + m s: n. Let the errors e be normally d1stnbuted;
Under Ho: Pcz) = 0, is distributed as wp,nrJ(I) independently of n(II I_. which, in tum, is distributed as Wp,rq(I ). The likelihood ratio test of H0 is equivalen'' to rejecting H 0 for large values o(f Ii; I )
ni:
Ini I
~
,f~
.,F..
~
ZlnA
.!(m r + q +
If Z is not of full rank, but has rank r1 + 1, then fJ = (Z'ZfZ'Y, wher~fii (Z'zr is the generalized inverse discussed in [22]. (See also Exercise 7.6.) ThdJ distributional conclusions stated in Result 7.11 remain the same, provided that r i(~ replaced by r 1 and q + 1 by rank (ZJ). However, not all hypotheses concerning can be tested due to the lack of uniqueness in the identification of fJ caused by linear dependencies among the columns of z. Nevertheless, the generalized allows all of the important MANOVA models to be analyzed as special cases of multivariate multiple regression model.
Jl
Example 7.9 (Testing the importance of additional predictors with a multivariate response) The service in three locations of a large restaurant chain was rated according to two measures of quality by male and female patrons. The first servicequality index was introduced in Example 7.5. Suppose we consider a regression model that allows for the effects of location, gender, and the locationgender interaction on both servicequality indices. The design matrix (see Example 7.5) remains the same for the tworesponse situation. We shall illustrate the test of no location gender interaction ih either response using Result 7.11. A comp1,1ter program provides
residual sum of squares) and cross products extra sum of squares) ( and cross products
_ i:)
= [441.76
246.16
246.16] 366.12
Let /Jr 2 l be the matrix of interaction parameters for the two responses. Although the sample size n = 18 is not large, we shall illustrate the calculations involved in the test of H 0 : /Jr 2 l = 0 given in Result 7.11. Setting a = .05, we test H 0 by referring
[n  1 ~(m r +
ri 1
q1,
1)]
In (
In~ I
=  [ 18  5  1 
~(2 
5 + 3 + 1) }n(.7605) = 3.28
to a chisquare percentage point with m(r1  qJ) = 2(2) = 4 d.f.Since3.28 < ~(.05) = 9.49, we do not reject H 0 at the 5% level. The interaction terms are not needed. Information criterion are also available to aid in the selection of a simple but adequate multivariate multiple regresson model. For a model that includes d predictor variables counting the intercept, let
id
This criterion attempts to balance the generalized variance with the number of paramete~s. Models with smaller AIC's are preferable. In the context of Example 7 .9, under the null hypothesis of no interaction terms, we have n = 18, p = 2 response variables, and d = 4 terms, so AIC = n ln(J I J)  2p
X
d = 18ln
1267.88"]1)  2 2417.07
= 18 X ln(20545.7)  16 = 162.75
More generally, we could consider a null hypothesis of the form H 0 : CfJ = r 0 , where C is (r  q) X (r + 1) and is of full rank (r  q). For the choices
398
[o i
(rq)X(rq)
Pcz)
== 0
'
the case considered earlier. It can be shown that the extra sum of squares and cross products generated by the hypothesis H0 is
,n(I1 I)= .
(CP.
fo)'(C(Z'Zf
C'f1(C/J
f 0)
Under the null hypothesis, the statistic n(I 1 is distributed as W,_q(I) independently of I. This distribution theory can be employed to develop a test of H 0 : C{J = r 0 similar to the test discussed in Result 7.11. (See, for example, (18].)
I)
ni
I)
that results from fitting the full model. The p X p hypothesis, or extra, sum of squares and crossproducts matrix
H = n(II 
The statistics can be defined in terms of E and H directly, or in terms of the nonzero eigenvalues 111 ~ 112 ~ ... ~ 11s ofHE1 , where s = min(p,r q). 11I I = 0. The definitions are Equivalently, they are the roots of I 1 
(I
I) 
i=l s
i=1
+
111
 1li = tr[H(H 1
1li
+ Er IJ
2:, lJi
= tr[HE 1]
Roy'sgreatestroot =  1 + 111 Roy's test selects the coefficient vector a so that the univariate Fstatistic based on a a' Yj has its maximum possible value. When several of the eigenvalues 11; are moderately large, Roy's test will perform poorly relative to the other three. Simulation studies suggest that its power will be best when there is only one large eigenvalue. Charts and tables of critical values are available for Roy's test. (See [21] and [17].) Wilks' lambda, Roy's greatest root, and the HotellingLawley trace test are nearly equivalent for large sample sizes. If there is a large discrepancy in the reported Pvalues for the four tests, the eigenvalues and vectors may lead to an interpretation. In this text, we report Wilks' lambda, which is the likelihood ratio test.
399
ni
is independently distributed as
Wnr 1 (I)
The unknown value of the regression function at z 0 is /3' z0 . So, from the discussion of the T 2 statistic in Section 5.2, we can write
T =
n
n r 
1I
(739)
(740) where Fm,nrm( a) is the upper (lOOa)th percentile of an distribution with m and n r md.f. The 100(1  a)% simultaneous confidence intervals for E(Y;) = zofJ(I) are , z0fJ(i) )(m(n rnrm
1))
p
Fm nrm(a)
) zo(Z'Z) z
1
where {J(i) is the ith column of and lr;; is the ith diagonal element of i. The second prediction problem is concerned with forecasting new responses Yo= f3'z 0 + e 0 at z 0 Here e0 is independent of E. Now,
is distributed as
(Yo P'zo)' (
n nr 1
$
(742)
The 100( 1  a)% simultaneous prediction intervals for the individual responses Yo; are zofJ(i)
+ z'o(Z'Z)1zo)
n _ r _
lr;;
i=1,2, ... ,m
(743)
400 Chapter 7 Multivariate Linear Regression Models where PuJ CT;;, and Fm,nrm(a) are the same quantities appearing in (741). Com, paring (741) and (743), we see that the prediction intervals for the actual values of; the response variables are wider than the corresponding intervals for the expected'; va! ues. The extra width reflects the presence of the random error eo;.
Example 7.10 (Constructing a confidence ellipse and a prediction ellipse for bivariateresponses) A second response variable was measured for the computerrequiremenr: problem discussed in Example 7.6. Measurements on the response Y2 , disk;
' .,
with s
= 151.97,
We find that
zoPcz> = 14.14
+ 2.25(13o) + 5.67(7.5)
= 349.17
and
Since
P'zo =
[~i:}_]zo [zo~~~~]
f1(2)
zo,.,(z)
,a
[151.97] 349.17
= 7, r = 2,
= [zofJcl)J 'Uzo,.,(z)
IS,
c rom
(740), the set 5 80 5.3o]1 [zofJ(IJ  151.97] [zofJ(IJ 151.97,z 0fJc 2 J 349.17](4) [ zofJ(z)  349.17 5.30 13.13
:s;
with F2 ,3(.05) = 9.55. This ellipse is centered at (151.97,349.17). Its orientation and the lengths of the major and minor axes can be determined from the eigenvalues and eigenvectors of Comparing (740) and (742), we see that the only change required for the calculation of the 95% prediction ellipse is to replace z0 (Z'Zr 1zo = .34725 with
ni:.
40 I
dPrediction ellipse
~onfidence
0
ellipse
and prediction ellipses for the computer data with two responses.
1 + z0 (Z'Z)1z 0 = 1.34725. Thus, the 95% prediction ellipse for Yo= [Yo~> Yoz] is also centered at (151.97,349.17), but is larger than the confidence ellipse. Both ellipses are sketched in Figure 7.5. It is the prediction ellipse that is relevant to the determination of computer requirements for a particular site with the given z0 .
I=
with
rt~~T~~~
(744) (745)
Izz can be taken to have full rank. 6 Consider the problem of predicting Y using the
6 If l:zz is not of full rank, one variablefor example, Zt<:an be written lis a linear combination of the other z;s and thus is redundant in forming the linear regression function z fl. That is, Z may be replaced by any subset of components whose nonsingular covariance matrix has the same rank as l:zz.
402 Chapter 7 Multivariate Linear Regression Models For a given predictor of the form of (745), the error in the prediction of Y is
prediction error= Y 
bo 
b1Z1  ..  b,Z,
=Y
 b0
b'Z
(746) "
(747)
Now the mean square error depends on the joint distribution of Y and z only through the parameters p. and I. It is possible to express the "optimal" linear predictor in terms of these latter quantities.
has minimum mean square among all linear predictors of the response Y. Its mean square error is
E(Y 
/3 0
Also, {3 0 + f.l'Z = J.Ly + Uzrizlz(Z  p.z) is the linear predictor having maximum correlation with Y; that is, Corr(Y, /30 + f.l'Z) = max Corr(Y, b0 + b'Z)
=
bo,b
{J'Izzfl = 17yy
uhii~uzy
uyy (J.Ly  b' pz), we get
bo b'Z)
= =
= uyy
2b'uzy
Uz y Ii~uz y, we obtain
+ (J.Lr  bo  b' 11z? + (b  Ii~uzy )':Izz(b Ii~uzr)
The mean square error is minimized by taking b =Izlzuzy = f.l, making the last term zero, and then choosing bo = J.Ly (Izlzuzy)'p.z = /3 0 to make the third term zero. The minimum mean square error is thus uyy  uzyiz~uzy. Next, we note that Cov(bo + b'Z, Y) = Cov(b'Z, Y) = b'uzy so
_ , (b'uzrf 2 [Corr (bo + b Z, Y)]  uyy(b'Izzb) ,
for all b0 , b
uyy
zz
ZY
with equality for b = l:i1zuzy = fl. The alternative expression for the maximum correlation follows from the equation uzyl:z1 zuzy = uzyfl = uzyl:z1 zl:zzfl =
/l'l:zzfl.
The correlation between Y and its best linear predictor is called the population multiple correlation coefficient
PY(Z) =
(748)
The square of the population multiple correlation coefficient, Phz), is called the population coefficient of determination. Note that, unlike other correlation coefficients, the multiple correlation coefficient is a positive square root, so 0 :s PY(Z) :s 1. The population coefficient of determination has an important interpretation. From Result 7.12, the mean square error in using {3 0 + fl'Z to forecast Y is
uyy  uzyizzUzy
, I
uyy  uyy
(uzvizlzuzv)
uyy
= uyy(1 
2 PY(Z))
(749)
If Phz)
= 0, there is no predictive power in Z. At the other extreme, Phz) = 1 implies that Y can be predicted with no error.
Example T.ll (Determining the best linear predictor, its mean square error, and the multiple correlation coefficient) Given the mean vector and covariance matrix of Y, ZI,Z2,
determine (a) the best linear predictor f3o + {3 1Z 1 + {3 2Z 2 , (b) its mean square error, and (c) the multiple correlation coefficient. Also, verify that the mean square error equals uyy(1  Phz)) First,
fl = IzzUzy = f3o
1
[7 3]l [
3 2
=
1] .4 1 = [ .6
.6] [ 1 1] 1.4
[ 2 1]
= J.LY fl'p,z
5
[1,21[~] = 3

1 6  ] [ ] = 10  3 = 7
1.4
404 Chapter 7 Multivariate Linear Regression Models and the multiple correlation coefficient is
PY(Z) =
uh:Iziuzy Uyy
= [3 = _ 548
vw
(750) ..
=pyy
where pYY is the upperlefthand corner of the inverse of the correlation matrix determined from :I. . The restriction to linear predictors is closely connected to the assumption of normality. Specifically, if we take
1 2 z:Y: ] ,
z~,
The mean of this conditional distri!Jution is the linear predictor in Result 7.12. That is,
(751)
and we conclude that E(Y /zh z2, ... , z,) is the best linear predictor of Y when the population is N,+J(IL, :I). The conditional expectation of Yin (751) is called the regression function. For normal populations, it is linear. When the population is not normal, the regression function E(Y I ZJ, z2, ... , z,) need not be of the form /3 0 + fl'z. Nevertheless, it can be shown (see [22]) that E(Y IZJ, z2, ... , z,), whatever its form, predicts Y with the smallest mean square error. Fortunately, this wider optimality among all estimators is possessed by the linear predictor when the population is normal.
Result 7.13. Suppose the joint distribution of Yand Z is N,+ 1(,., :I). Let
[L =
[~]
'
and
[~;H~~J
1
be the sample mean vector and sample covariance matrix, respectively, for a random sample of size n from this population. Then the maximum likelihood estimators of the coefficients in the linear predictor are
flo = Y  sZySzzZ = Y  fl'Z

The Concept of Linear Regression 405 Consequently, the maximum likelihood estimator of the linear regression function is
~o +
:Z)
and the maximum likelihood estimator of the mean square errorE[ Y  f3o  fl' Z ]2 is
, n  1 , _1 uyyz =   ( s y y  SzySzzSzy) n
Proof. We use Result 4.11 and the in variance property of maximum likelihood estimators. [See (420).] Since, from Result 7.12,
= uyyz = uyy
 uzy:Iz~uzy
[L =
for
[i]
and
[~;~+t~~] = ( n: 1 )s
It is customary to change the divisor from n to n  ( r + 1) in the estimator of the mean square error, uyy.z = E(Y  f3o  fl'Z) 2 , in order to obtain the unbiased estimator
"'
,..
(752)
Example 7.12 (Maximum likelihood estimate of the regression functionsingle response) For the computer data of Example 7.6, the n = 7 observations on Y (CPU time), Z 1 (orders), and Z 2 (adddelete items) give the sample mean vector and sample covariance matrix:
~ ~ [i] ~ [ii~~;~
s
~ [~:j~:~] ~ [~!~:jj;!:!!~:!]
406 Chapter 7 Multivariate Linear Regression Models Assuming that Y, Z1, and Z 2 are jointly normal, obtain the estimated regression function and the estimated mean square error. Result 7.13 gives the maximum likelihood estimates
f1
=sIs
= [
zz_ zy
.003128 .006422
' ' f3o = ji  (J'z = 150.44  [1.079, .420] [13024 3547 = 8.421
and the estimated regression function
= 150.44  142.019
sI
= .894
Prediction of Several Variables
The extension of the previous results to the prediction of several responses }[, }2, ... , Ym is almost immediate. We present this extension for normal populations. Suppose
with
l l
(mXI)
~y
(rXI)
is distributed as Nm+r(lt, I)
By Result 4.6, the conditional expectation of [ }[, Y2, _.. , Yml', given the fixed values
z1, z2, ... , z, of the predictor variables, is E{YizJ.Z2,,z,] = I'Y + Ivzii\;(z ~tz)
(753)
This conditional expected value, considered as a function of z1 , z2, ... , z,, is called the multivariate regression of the vector Y on Z. It is composed of m univariate regressions. For instance, the first component of the conditional mean vector is J.Ly1 + Iv1zii\;(z  ~tz) = E(Y1 1 ZI> z2 , ... , z,), which minimizes the mean square error for the prediction of Y1 Them X r matrix /3 = Ivzii1z is called the matrix of regression coefficients.
Ivvz = E[Y I'Y Ivzii~(Z ~tz)] [Y I'Y Ivzii~(Z ~tz)]' 1 = Iyy  Ivzii~(Ivz)' Ivzii~Izv + Ivzii zizzii~(Ivz)' = Iyy Ivzii~Izv
(754)
Because I' and I are typically unknown, they must be estimated from a random sample in order to construct the multivariate linear predictor and determine expected prediction errors. Result 7.14. Suppose Y and Z are jointly distributed as Nm+r(lt, I). Then theregression of the vector Y on Z is
Po + fJz =
I'Y Ivzii~l'z
Based on a random sample of size n, the maximum likelihood estimator of the regression function is
Po + Pz = Y + SvzSi~(z Z)
and the maximum likelihood estimator of Ivvz is
Ivvz = (
n:
) (Svv  SvzSi'zSzv)
Proof. The regression function and the covariance matrix for the prediction errors follow from Result 4.6. Using the relationships
fJ = Ivzii~
~tz)
=
Iyy  /Jizz/3'
we deduce the maximum likelihood statements from the invariance property [see (420)] of maximum likelihood estimators upon substitution of
I'
[~]; Zj
1
j; =
Po fJZ)' I
(755)
Example 7.13 (Maximum likelihood estimates of the regression functionstwo responses) We return to the computer data given in Examples 7.6 and 7.10. For Y1 = CPU time, Y2 = disk 110 capacity, Z 1 = orders, and Z 2 = adddelete items,
~h~
'
J [L =[I]= ij6~~[
150.44 3_547 and
s = 1_~~~l~~?;] = ~8.556
LSzv! Szz
.
467.913 1148.5561 418.763 35.983l 3072.491 i ~008.97~!~0.~?.. [ 418.763 1008.976/ 377100 28.034 35.983 140.5581 28.034 13.657
Po + fJz = y + SyzSi~(z  z)
=
150.44] [ 418.763 35.983] [ 327.79 + 1008.976 140558 x [ .003128 .006422 150.44] [l.079(z1

= [ 327.79 + 2.254(z1
150.44 + 1.079( z 1
The maximum likeiip.ood estimate of the expected squared errors and crossproducts matrix :Ivvz is given by
(n:
) (Svv 
SvzSi~Szv)
1148.536] 3072.491 1008.976]) 140.558
=7 .
(
= (
35.983] [ .003128 .006422] [418.763 140.558 .006422 .086404 35.983 1.042] [.894 .893] 2.572 = .893 2.205
6) [ 1.043 1.042
The Concept of Linear Regression 409 The first estimated regression function, 8.42 + l.08z 1 + .42z 2 , and the associated mean square error, .894, are the same as those in Example 7.12 for the singleresponse case. Similarly, the second estimated regression function, 14.14 + 2.25z 1 + 5.67z 2 , is the same as that given in Example 7.10. We see that the data enable us to predict the first response, YI> with smaller error than the second response, Y2 The positive covariance .893 indicates that overprediction (underprediction) of CPU time tends to be accompanied by overpredic tion (underprediction) of disk capacity.
Comment. Result 7.14 states that the assumption of a joint normal distribution for the whole collection }J., }2, ... ,Y,, ZI> Z 2, ... ,Zr leads to the prediction equations
.YI Yl
Ym
= f3oJ = f3o2
+ +
/311Z! f312Z1
+ + + +
f3r!Zr f3r2Zr
1. The same values, z 1 , z2, ... , Zr are used to predict each Y;. 2. The ~ik are estimates of the (i,k)th entry of the regression coefficient matrix fJ = :Ivz:Iz~ fori, k ;:, 1.
We conclude this discussion of the regression problem by introducing one further correlation coefficient.
ILY1 ILY 2 
Y2 
obtained from using the best linear predictors to predict Y1 and Y2. Their correlation, determined from the error covariance matrix :Ivvz = :Ivv  :Ivz:Iz~:Izy, measures the association between Y1 and Y2 after eliminating the effects of ZI> Z2, ... ,Zr. We define the partial correlation coefficient between Y 1 and }2, eliminating ZJ, Z 2, ... ,Zr,by
PY 1Y 2z= .~.~ VUy 1yl"z VUy 2YfZ
(756)
z:Izv The where uY;Y>Z is the (i, k)th entry in the matrix :Ivvz = :Ivv  :Ivz:Iz1 corresponding sample partial cor:relation coefficient is
(757)
410 Chapter 7 Multivariate Linear Regression Models with sv,v.z the (i, k )th element of Svv  SvzSz~Szy. Assuming that Y and z have a joint multivariate normal distribution, we find that the sample partial correlation coefficient in (757) is the maximum likelihood estimator of the partial correlation coefficient in (756). Example 7.14 (Calculating a partial correlation) From the computer data Example 7.13,
Syy  SyzSzzSzy 1 
10
Therefore, (758)
Calculating the ordinary correlation coefficient, we obtain ry 1y 1 = .96. Comparing the two correlation coefficients, we see that the association between Y 1 and 12 has been sharply reduced after eliminating the effects of the variables Z on both responses.
{3 1z 1i = {3 1(z 1i 
The predictor variables can be "centered" by subtracting their means. For instance, 1) + {3 1 1 and we can write
yj = (f3o
f3JZ!
T"
ZJ)
+ ... +
{3,(Zrj z,) + ej
= /3 +
f3J(ZJj z!)
(759)
Comparing the Two Formulations of the Regression Model 411 with /3 = {3 0 + {3 1 1 + + {3,1.,. The mean corrected design matrix corresponding to the reparameterization in (759) is
zc = ~
where the last r columns are each perpendicular to the first column, since
l
1 1
Z11 
z1
:I Z1
Z1r Z2r
Z21
...
~~
z,j
Zr
Zni 
Znr 
L
j=l
1(Zji 
Z;) = 0,
i = 1,2, ... ,r
Further, setting Zc
so
(760)
That is, the regression coefficients [{3 1 , {3 2 , , {3,]' are unbiasedly estimated by (Z~ 2 Zc 2 rJz~ 2 y and /3 is estimated by y. Because the definitions /31> {3 2 , , {3, remain unchanged by the reparameterization in (759), their best estimates computed from the design matrix Zc are exactly the sa111e as the, best e~timates computed from the design matrix Z. Thus, setting /J~ = [.BJ. lh ... , {3, ], the linear predictor of Y can be written as (761)
= (Z~ZJ cr =
!T_ 2 n
(762}
Cov(IJJ
412
Comment. The multivariate multiple regression model yields the same mean corrected design matrix for each response. The least squares estimates of the coeffj. cient vectors for the ith response are given by
fl(iJ
= l,z, ... ,m
(7{i3)
zJ /
1~1
z.)/V(n !)s,,,,
slope coefficients {3; in the regression model are replaced by /3. = {3; ~The least squares estimates of the beta coefficients become = ~; n  1) s ' ' i = 1, 2, ... , r. These relationships hold for each response in the multivariate multi~i~ regression situation as well.
/i;
/J;
V(
1 _ + /l z = y + SzySzz(z  z)
A' '
o:
AT
i1
(764)
where the estimation procedure leads naturally to the introduction of centered z;'s. Recall from the mean corrected form of the regression model that the best linear predictor of Y [see (761)] is
y = ~. + P~(z with ~. =
A
i)
and (764), we see that
{3. =
Y = f3o and
flc = /l since 7
shSz~
y'ZdZ~zZd
(765)
Therefore, both the normal theory conditional mean and the classical regression model approaches yield exactly the same linear predictors. A similar argument indicates that the best linear predictors of the responses in the two multivariate multiple regression setups are also exactly the same.
Example 7.1 S (Two approaches yield the same linear predictor) The computer data with the single response Y1 o: CPU time were analyzed in Example 7.6 using the classical linear regression model. The same data were analyzed again in Example 7.12, assuming that the variables }J, Z 1, and Z 2 were jointly normal so that the best predictor of Y1 is the conditional mean of}! given z1 and z2 . Both approaches yielded the same predictor,
'=
+ yl so that
(y  yl)'Zc2
Multiple Regression Models with Time Dependent Errors 413 Although the two formulations of the linear prediction problem yield the same predictor equations, conceptually they are quite different. For the model in (73) or (723), the values of the input variables are assumed to be set by the experimenter. In the conditional mean model of (751) or (753), the values of the predictor variables are random variables that are observed along with the values of the response variable(s). The assumptions underlying the second approach are more stringent, but they yield an optimal predictor among all choices, rather than merely among linear predictors. We close by noting that the multivariate regression calculations in either case can be couched in terms of the sample mean vectors y and z and the sample sums of squares and crossproducts:
This is the only information necessary to compute the estimated regression coefficients and their estimated covariances. Of course, an important part of regression analysis is model checking. This requires the residuals (errors), which must be calculated using all the original data.
414
(DHD) = 65 deg  daily average temperature. A large number for DHD indicates a cold day. Wind speed, again a 24hour average, can also be a factor in the sendout amount. Because many businesses close for the weekend, the demand for natural gas is typically less on a weekend day. Data on these variables for one winter in a major northern city are shown, in part, in Table 7.4. (See website: www.prenhall.com/statistics for the complete data set. There are n = 63 observations.)
y Sendout
227 236 228 252 238
:
DHD 32 31 30 34 28 46 33 38 52 57
zl
DHDLag 30 32 31 30 34
:
z2
Winds peed 12 8 8 8 12 8 8 18 22 18
z3
Weekend
1 1
\
z4
0
0 0 0 0 0 0
41 46 33 38 52
Initially, we developed a regression model relating gas sendout to degree heating days, wind speed and a weekend dummy variable. Other variables likely to have some affect on natural gas consumption, like percent cloud cover, are subsumed in the error term. After several attempted fits, we decided to include not only the current DHD but also that of the previous day. (The degree heating day lagged one time period is denoted by DHDLag in Table 7.4.) The fitted model is Sendout = 1.858 + 5.874 DHD + .1.405DHDLag
lag 1 autocorrelation =
TJ (e)
= 'n = .52
j~l
j~2
:L
SjSj!
:L 8y
415
The value, .52, of the lag 1 autocorrelation is too large to be ignored. A plot of the residual autocorrelations for the first 15 lags shows that there might also be some dependence among the errors 7 time periods, or one week, apart. This amount of dependence invalidates the ttests and ?values associated with the coefficients in the model. The first step toward correcting the model is to replace the presumed independent errors in the regression model for sendout with a possibly dependent series of noise terms Nj. That is, we formulate a regression model for the Nj where we relate each~ to its previous value Njl, its value one week ago, Nj_ 7 , and an independent error sj. Thus, we consider
where the sj are independent normal random variables with mean 0 and variance cr 2 The form of the equation for Nj is known as an autoregressive model. (See [8).) The SAS commands and part of the output from fitting this combined regression model for sendout with an autoregressive model for the noise are shown in Panel 7.3 on page 416. The fitted model is Sendout
=
The variance of sis estimated to be G2 = 228.89. From Panel 7.3, we see that the autocorrelations of the residuals from the enriched model are all negligible. Each is within two estimated standard errors of 0. Also, a weighted sum of squares of residual autocorrelations for a group of consecutive lags is not large as judged by the ?value for this statistic. That is, there is no reason to reject the hypothesis that a group of consecutive autocorrelations are simultaneously equal to 0. The groups examined in Panel 7.3 are those for lags 16, 112,118, and 124. The noise is now adequately modeled. The tests concerning the coefficient of each predictor variable, the significance of the regression, and so forth, are now valid. 8 The intercept term in the final model can be dropped. When this is done, there is very little change in the resulting model. The enriched model has better forecasting potential and can now be used to forecast sendout of natural gas for given values of the predictor variables. We will not pursue prediction here, since it involves ideas beyond the scope of this book. (See [8).)
"These tests are obtained by the extra sum of squares procedure but applied to the regression plus autoregressive noise model. The tests are those described in the computer output.
416 Chapter 7 Multivariate Linear Regression Models When modeling relationships using time ordered data, regression models with noise structures that allow for the time dependence are often useful. Modern soft. ware packages, like SAS, allow the analyst to easily fit these expanded models.
PANEL 7.3
data a; infile 'T74.dat'; time =_n_; input obsend dhd dhdlag wind xweekend; proc arima data =a; identify var = obsend crosscor dhd dhdlag wind xweekend ); estimate p =(1 7) method = ml input = ( dhd dhdlag wind xweekend ) plot; estimate p = (1 7) noconstant method ml input dhd dhdlag wind xweekend ) plot; PROGRAM COMMANDS
=(
=(
ARIMA Procedure Maximum Likelihood Estimation Approx. Std Error 13.12340 0.11779 0.11528 0.24047 0.24932 0.44681 6.03445 OUTPUT
Estimat~
5.72
2.70 1.68
Lag 0 1 7 0 0 0 0
Shift 0 0 0 0 0 0 0
variance Estimate
15.1292441 528.490321 543.492264 SBC 63 Number of Residuals = Std Error Estimate AIC Autocorrelation Check of Residuals To Lag 6 12 18 24 Chi Square 6.04 10.27 15.92 23.44
DF 4 10 16 22
Autocorrelations 0.079 0.144 0.013 O.D18 0.012 {).067 0.106 0.004 0.022 {).111 {).137 0.250 0.192 {).056 {).170 {).080 {).127 {).056 {).079 {).069 0.161 {).108 O.D18 {).051
~.Stt~i_
0
1
2 3 4 5 6 7 8 9 10
11
12 13 14 15
228.894 18.194945 2.763255 5.038727 44.059835 29.118892 36.904291 33.008858 15.424015 25.379057 12.890888 12.777280 24.825623 2.970197 24.150168 31.407314
1.00000 0.07949 0.01207 0.02201 0.19249 {).12722 0.16123 0.14421 {).06738 {).11088 {).05632 {).05582 {).10846 0.01298 0.10551 {).13721
1 9 8 7 6 5 4 3 2 0 1 2 3 4 5 6 7 8 9 1 !*******************~ I I I** I I I I**** . I ***I I I*** I I*** I *I I **I I *I I *I I **I I I I** I ***I
"." marks two standard errors
Supplement
~.,~
;~
,_,~
']c.
.:I
.~
il
ni = Y'(l Z(Z'Zf1Z')Y
from columns from columns of Z 2 arbitrary set of ofZ 1 but perpendicular orthonormal to columns of Z 1 vectors orthogonal to columns of Z Let (A, e) be an eigenvalueeigenvector pair of Z 1(ZIZ 1f Zj. Then, since 1 1 1 [Z!(Z}Zlf Z;][Z 1 (Z]Z 1f Z\] = Z 1(Z!Zd Zt. it follows that
1
Ae = zl(Z!Z 1 f Zie
= (Z 1(Z]Z 1f
418
= A2e
The Distribution of the Likelihood Ratio for the Multivariate Multiple Regression Model 419
1 1 and the eigenvalues of Z 1(ZjZ 1f Z! are 0 or 1. Moreover, tr (Z 1 (Z'1Z 1f Zl) 1 = tr((Zjzlr ZjZI) = tr( I ) == q + 1 =AI+ A2 + ... + Aq+l where (q+I)X(q+l) 1 A1 ~ A ~ ~ Aq+l > 0 are the eigenvalues of Z 1(ZjZ 1f Zj. This shows that 2 1 1 Z 1(Z]Z1f Zi has q + 1 eigenvalues equal to 1. Now, (Z 1 (ZiZ 1f Zl) Z1 == Zl> so any linear combination Z 1be of unit length is an eigenvector corresponding to the eigenvalue 1. The orthonormal vectors ge, e = 1, 2, ... , q + 1, are therefore eigen1 vectors of Z 1(ZiZ 1r zj, since they are formed by taking particular linear combinations of the columns of Z 1. By the spectral decomposition (216), we have q+l 1 1 Z 1(ZiZJ) Zi = geg{. Similarly, by writing (Z (Z'Zf Z') Z == Z, we readily see f=l 1 that the linear combination Zbc = ge, for example, is an eigenvector of Z (Z'Zf Z'
r+l C=l
L gcg{.
es e>
1 Continuing, we have PZ =[I Z(Z'Zf Z']Z == Z Z = 0 so ge = Zbe, r + 1, are eigenvectors of P with eigenvalues A = 0. Also, from the way the ge, r + 1, were constructed, Z'gc = 0, so that Pge = ge. Consequently, these ge's are eigenvectors of P corresponding to then  r  1 unit eigenvalues. By the spectraldecomposition(216),P =
f=r+2
" L
geg{and (E'gc)(E'gc)' =
ni = E'PE =
L
t=r+2
"
L
C=r+2
VeVe
where, because Cov(Vr;, Vfk) = E(g(E(i)E(k)gj) = O';kgfgj == 0, e j, the E'gc = Ve = [Vn, ... , Ve;, ... , Veml' are independently distributed as Nm(O, l: ). Consequently, by (422), ni is distributed as Wp,nr 1(1: ). In the same manner, ge P1gc = { o so P 1 =
e> es
q + 1 q +1
f=q+2
A
gcg(. We can write the extra sum of squares and cross products as
A
r+l
r+l
(E'ge) (E'ge)'
VeVe
e~q+2
C=q+2
where the Ve are independently distributed as Nm(O, l:). By (422), n(i 1  i) is distributed as wp,rq(l:) independently of ni, since n(il  i) involves a different set of independent Vc's. The large sample distribution for[ n  r 1  ~ (m  r + q + 1) ]ln (I i
1/1 ill)
instead
follows from Result 5.2, with v  v0 = m(m + 1)/2 + m(r + 1)  m(m + 1)/2
(n
r 1 ~(m r + q +
1))
of n in the statistic is due to Bartlett [4] following Box [7), and it improves the
420
Exercises
7.1.
Given the data
Zt
10
7 3
19 25
11 7
8 13
fit the linear regression model ~ =)30 + f3 1z11 + ei, j = 1, 2, ... , 6. Specifically, calculate the least squares estimates {3, the fitted values y, the residuals , and the residual sum of squares, e' E.
7.2.
10
2
15
5 3
7 3
3
19
6
25
11 7
18
9 13
Yi
to the standardized form (see page 412) of the variables y, z 1 , and z2 From this fit, deduce the corresponding fitted regression equation for the original (not standardized) variables.
Z {3 (nx(r+l)) ((r+l)xl)
E
(nXl)
where E( e) = 0 but E(ee') = 17 2 V, with V(n X n) known and positive definite. For V of full rank, show that the weighted least squares estimator is
Pw = (Z'V1Zt1Z'V1Y
If u 2 is unknown, it may be estimated, unbiasedly, by
(n r lf 1
Hint: v 112 y = (vIi 2Z)/3 + vl/ 2 is of the classical linear regression form Y* = 1 Z* fl + e*, with E(e*) = 0 and E(e*e*') = 17 21. Thus, = = (Z*Z* ) Z*'Y*.
Pw p*
1.4.
Use the weighted least squares estimator in Exercise 7.3 to derive an expression for the estimate of the slope f3 in the model Yi = f3zi + ei,j = 1,2, ... ,n, when (a) Var (ei) = u 2 , (b) Var(ej) = 172Zi, and (c) Var(ei) = 172zJ. Comment on tQ.e manner in which the unequal variances for tb.e errors influence the optimal choice of f3w. Establish (750): Phz) = 1  1/prr. Hint: From (749) and Exercise 4.11
7.5.
1
PY(Z) =
uyy 
u2:riz~uzr
l7yy
IIzz luyy
From Result 2A.8(c),uYY = IIzz 1/1 I 1. wberei7YY is theentry~fr in the first row and 1 first column. Since (see Exercise 2.23) p = v 1/2I vl/2 and p 1 = (V 112 I v 112 f = V 112 I 1V 112 , the entry in the (1, 1) position of p1 is pYY = 17YYuyy.
Exercises 42 I
7.6.
(Generalized inverse ofZ'Z) A matrix (Z'Zt is called a generalized inverse of Z'Z if Z'Z (Z'ZtZ'Z = Z'Z. Let r1 + 1 = rank(Z) and suppose A1 ~ A2 ~ ~ A,,+l > 0 are the nonzero eigenvalues of Z'Z with corresponding eigenvectors e 1 , e 2, ... , e,,+l (a) Show that
(Z'Z)/
r 1+1
~ Aj 1e,ej
i=l
is a generalized inverse of Z'Z. (b) The coefficients [J that minimize the sum of squared errors (y Z/l)'(y  Z/l) satisfy the normal equations (Z'Z)P = Z'y. Show that these equations are satisfied for any [J such that z[J is the projection of yon the columns of z. (c) Show that z[J = Z (Z'Z)Z'y is the projection of yon the columns of Z. (See Footnote 2 in this chapter.) (d) Show directly that [J = (Z'Z)Z'y is a solution to the normal equations (Z'Z) [(Z'Z)Z'y] = Z'y. Hint: (b) If is the projection, then y is perpendicular to the columns of Z. (d) The eigenvalueeigenvector requirement implies that (Z'Z) (Aj1e,) = e; fori ::5 r1 + 1 and 0 = ei(Z'Z)e, fori> r1 + l. Therefore, (Z'Z) (Ai 1e;)e[Z'= e;ejZ'. Summing over i gives
zp
zp
r 1+1
(r+l
z=l
~ e;e[ Z'
= IZ' = Z'
z=l
1. 1.
since ejZ' = 0 fori > r 1 + 1. Suppose the classical regression model is, with rank (Z) zl /lp) + y = (nxl) (nx(q+l)) ((q+I)XI)
(nX(rq))
r + 1, written as
((rq)xl)
fl(2)
+ e
(nxJ)
where rank(Z 1) = q + 1 and rank(Z 2) = r  q. If the parameters fJ(2) are identified beforehand as being of primary interest, show that a 100(1  a)% confidence region for fl(2) is given by (p(2) /l(2J)'[Z2Z2 Z2ZI(ZIZI) ZIZ2] (P( 2J /lc2J)
1
s;
s2(r
q)F,q.nrt(a)
r 1,
where (Z'Zt = [
~~; ~:~]
2
Multiply by the squareroot matrix (C 22 t 112, and conclude that (C 22 t 11\/Jc 2J /l(2))/u is N(O,I),so that 1 (P(2) 13(2))'(C 22 ) (Pc2J fl(2)) is~:lrq
7.8.
Recall that the hat matrix is defined by H = Z (Z'Zt Z' with diagonal elements hn (a) Show that His an idempotent matrix. [See Result 7.1 and (76).)
n
(b) Show that 0 < hii < 1, j = 1,2, ... ,n, and that ~ hii = r + 1, where r is the
j=l
s;
hi i < 1.)
422
Chapter 7 Multivariate Linear Regression Models (c) Verify, for the simple linear regression model with one independent variable the leverage, h11 , is given by
z, that
7.9.
Consider the following data on one predictor variable z1 and two responses Y1 and y
2 1 0 1
2
: 2
5 3
3 1
4 1
2 2
1 3
Determine the least squares estimates of the parameters in the bivariate straightline regression model
lft
}/2
+ Ej2>
j = 1,2,3,4,5
Also, calculate the matrices of fitted values and residuals Verify the sum of squares and crossproducts decomposition
e with Y = [y
! J'2].
7.1 0. Using the results from Exercise 7.9, calculate each of the following.
(a) A 95% confidence interval for the mean response E(Yo 1) = /3 01 + {3 11 z01 corresponding to z01 = 0.5 (b) A 95% prediction interval for the response Yo 1 corresponding to z01 = 0.5 (c) A 95% prediction region for the responses Y01 and
Yo 2 corresponding to
z01 = 0.5
7.11. (Generalized least squares for multivariate multiple regression.) Let A be a positive )'A(y1  B'z1 is a squared statistical definite matrix, so that dJ(B) = (y1  B'z1 ) distance from the jth observation Y; to its regression B'z1. Show that the choice B =
{J =
for any choice of positive definite A. Choices for A include I. Hint: Repeat the steps in the proof of Result 7.10 with I1 replaced by A.
I 1 and
~I
d](B),
determine each of the following. (a) The best linear predictor /3o + f3tZI + /32Zz of Y (b) The mean square error of the best linear predictor (c) The population multiple correlation coefficient (d) The partial correlation coefficient pyz,z,
Exercises 423
7.13. The test scores for college students described in Example 5.5 have
z1 ] [ z2
z3
s=
Assume joint normality. (a) Obtain the maximum likelihood estimates of the parameters for predicting Z1 from Z 2 and Z 3 . (b) Evaluate the estimated multiple correlation coefficient Rz,(z 2,z3 ). (c) Determine the estimated partial correlation coefficient Rz,,zfz,.
7.14. 1\ventyfive portfolio managers were evaluated in terms of their performance. Suppose Y represents the rate of return achieved over a period of time, Z 1 is the manager's attitude toward risk measured on a fivepoint scale from "very conservative" to "very risky," and Z 2 is years of experience in the investment business. The observed correlation coefficients bet ween pairs of variables are
y
zl
.35 1.0 . .60
z2
.82] .60 1.0
(a) Interpret the sample correlation coefficients ryz 1 = .35 and ryz 2 = .82. (b) Calculate the partial correlation coefficient ryz 1.z2 and interpret this quantity with respect to the interpretation provided for ryz 1 in Part a.
The following exercises may require the use of a computer.
7.1 S. Use the realestate data in Table 7.1 and the linear regression model in Example 7.4. (a) Verify the results in Example 7.4. (b) Analyze the residuals to check the adequacy of the model. (See Section 7.6.) (c) Generate a 95% prediction interval for the selling price (Yo) corresponding to total dwelling size z1 = 17 and assessed value z2 = 46. (d) Carry out a likelihood ratio test of H 0 : {3 2 = 0 with a significance level of a = .05. Should the original model be modified? Discuss. 7.16. Calculate a CP plot corresponding to the possible linear regressions involving the realestate data in Table 7.1. 7.17. Consider the Forbes data in Exercise 1.4. (a) Fita linear regression model to these data using profits as the dependent variable and sales and assets as the independent variables. (b) Analyze the residuals to check the adequacy of the model. Compute the leverages associated with the data points. Does one (or more) of these companies stand out as an outlier in the set of independent variable data points? (c) Generate a 95% prediction interval for profits corresponding to sales of 100 (billions of dollars) and assets of 500 (billions of dollars). (d) Carry out a likelihood ratio test of H 0 : {3 2 = 0 with a significance level of a = .05. Should the original model be modified? Discuss.
zl
Charge rate (amps) .375 1.000 1.000 1.000 1.625 1.625 1.625 .375 1.000 1.000 1.000 1.625 .375 1.000 1.000 1.000 1.625 1.625 .375 .375
Z:z
Discharge rate (amps) 3.13 3.13 3.13 3.13 3.13 3.13 3.13 5.00 5.00 5.00 5.00 5.00 1.25 1.25 1.25 1.25 1.25 1.25 3.13 3.13
Depth of discharge (%of rated amperehours) 60.0 76.8 60.0 60.0 43.2 60.0 60.0 76.8 43.2 43.2 100.0 76.8 76.8 43.2 76.8 60.0 43.2 60.0 76.8 60.0
Z:J
z4
Temperature (OC) 40 30 20 20 10 20 20
10 10
y
Cycles to failure "101 141 96 125 43 16 188 10 3 386 45 2 76 78 160 3 216 73 314 170

30 20 10
10 10
30 0 30 20 30 20
Source: Selected from S. Sidik, H. Leibecki, and J. Bozek, Failure of SilverZinc Cells with Competing Failure ModesPreliminary Data Analysis, NASA Technical Memorandum 81556 (Cleveland: Lewis Research Center, 1980).
1.20. Using the batteryfailure data in Table 7.5, regress ln(Y) on the first principal component of the predictor variables z1 , z2 , ... , z5 . (See Section 8.3.) Compare the result with the fitted model obtained in Exercise 7.19(a).
Exercises 425
7.21. Consider the airpollution data in Table 1.5. Let }] == N02 and Y2 == 0 3 be the two responses (pollutants) corresponding to the predictor variables Z 1 = wind and Z 2 = solar radiation. (a) Perform a regression analysis using only the first response}]. (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction interval for N02 corresponding to z1 = 10 and
Z2 = 80. (b) Perform a multivariate multiple regression analysis using both responses Y1 and Yz (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction ellipse for both N0 2 and 0 3 for z1 = 10 and Zz = 80. Compare this ellipse with the prediction interval in Part a (iii). Comment.
7.22. Using the data on bone mineral content in Table 1.8: (a) Perform a regression analysis by fitting the response for the dominant radiu~ bone to the measurements on the last four bones. (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (b) Perform a multivariate multiple regression analysis by fitting the responses from both radius bones. (c) Calculate the AIC for the model you chose in (b) and for the full model. 7.23. Using the data on the characteristics of bulls sold al auction in Table 1.10: (a) Perform a regression analysis using the response Y1 = SalePr and the predictor variables Breed, YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and Sale Wt. (i) Determine the "best" regression equation by retaining only those predictor variables that are individually significant. (ii) Using the best fitting model, construct a 95% prediction interval for selling price for the set of predictor variable values (in the order listed above) 5, 48.7, 990, 74.0, 7, .18, 54.2 and 1450. (iii) Examine the residuals from the best fitting model. (b) Repeat the analysis in Part a, using the natural logarithm of the sales price as the response. That is, set Y1 = Ln (SalePr). Which analysis do you prefer? Why? 7.24. Using the data on the characteristics of bulls sold at auction in Table 1.10: (a) Perform a regression analysis, using only the response Y1 = SaleHI and the predictor variables Z1 = YrHgt and Z 2 = FtFrBody. (i) Fit an appropriate model and analyze the residuals. (ii) Construct a 95% prediction interval for SaleHt corresponding to z1 == 50.5 and
Z2 = 970. (b) Perform a multivariate regression analysis with the responses }] = SaleHt and ~ = SaleWt and the predictors Z 1 = YrHgt and Z 2 = FtFrBody. (i) Fit an appropriate multivariate model and analyze the residuals. (ii) Construct a 95% prediction ellipse for both SaleHt and SaleWt for z:1 = 50.5 and z2 = 970. Compare this ellipse with the prediction interval in Part a (ii). Comment.
426
7.2S. Amitriptyline is prescribed by some physicians as an antidepressant. However, there are also conjectured side effects that seem to be related to the use of the drug: irregular heartbeat, abnormal blood pressures, and irregular waves on the electrocardiogram among other things. Data gathered on 17 patients who were admitted to the hospitai after an amitriptyline overdose are given in Table 7.6. The two response variables are
Y1
Z4
Z 5 = QRSwavemeasurement(QRS)
Y2 AMI
ZJ
Zz
Z3
GEN
AMT
PR
DIAP
Z4
QRS
zs
3389
1101
3149 653 810 448 844 1450 493 941 547 392 1283 458
722
1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1
7500 1975 3600 675 750 2500 350 1500 375 1050 3000 450 1750 2000 4500 1500 3000
0 0 60 60 70 60 80 70 60 60 60
64 90
60 0 90 0
(a) Perform a regression analysis using only the first response }j. (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction interval for Total TCAD for Z3 = 140, z4 = 70, and z = 85. 5 (b) Repeat Part a using the second response Yz.
z1 = 1, z2 = 1200,
Exercises 42/ (c) Perform a multivariate multiple regression analysis using both responses Y1 and Y2. (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction ellipse for both Total TCAD and Amount of amitriptyline for z1 = 1, z2 = 1200, z3 = 140, z4 = 70, and z5 = 85. Compare this ellipse with the prediction intervals in Parts a and b. Comment.
7.26. Measurements of properties of pulp fibers and the paper made from them are contained in Table 7.7 (see also (19] and website: www.prenhall.com/statistics). There are n = 62 observations of the pulp fiber characteristics, z1 = arithmetic fiber length, Z2 = long fiber fraction, z3 = fine fiber fraction, z4 = zero span tensile, and the paper properties, YI = breaking length, Y2 = elastic modulus, y 3 = stress at failure, y 4 = burst strength. Table 7.7 Pulp and Paper Properites Data
Y! BL
Y2 EM
Y3 SF
Y4
Zt
Z2
Z3
Z4
BS
AFL
LFF
FFF
ZST
7.039 6.979 6.779 6.601 6.795 6.315 6.572 7.719 7.086 7.437
5.326 5.237 5.060 4.479 4.912 2.997 3.017 4.866 3.396 4.859
.932 .871 .742 .513 .577 .400 .478 .239 .236 .470
35.239 35.713 39.220 39.756 32.991 2.845 1.515 2.054 3.018 17.639
1.057 1.064 1.053 1.050 1.049 1.008 .998 1.081 1.033 1.070
Source: See Lee [19]. (a) Perform a regression analysis using each of the response variables Y1, Y2, Y3 and Y4. (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. Check for outliers or observations with high leverage. (iii) Construct a 95% prediction interval for SF (Y3 ) for z1 = .330, z2 = 45.500, Z1 = 20.375, Z4 = 1.010. (b) Perform a multivariate multiple regression analysis using all four response variables, Y1 , Y2, Y3 and Y4, and the four independent variables, Z 1, Z 2, Z 3 and Z 4. (i) Suggest and fit an appropriate linear regression model. Specify the matrix of estimated coefficients and estimated error covariance matrix (ii) Analyze the residuals. Check for outliers. (iii) Construct simultaneous 95% prediction intervals for the individual responses Y0;, i = 1, 2, 3,4, for the same settings of the independent variables given in part a (iii) above. Compare the simultaneous prediction interval for Y03 with the prediction interval in part a (iii). Comment.
I.
7.27. Refer to the data on fixing breakdowns in cell phone relay towers in Thble 6.20. In the initial design, experience level was coded as Novice or Guru. Now consider three levels of experience: Novice, Guru and Experienced. Some additional runs for an experienced engineer are given below. Also, in the original data set, reclassify Guru in run 3 as
428
Chapter 7 Multivariate Linear Regression Models Experienced and Novice in run 14 as Experienced. Keep all the other numbers for these ; two engineers the same. With these changes and the new data below, perform a multi ~ variate multiple regression analysis with assessment and implementation times as the ~ responses, and problem severity, problem complexity and experience level as the predictor.'! variables. Consider regression models with the predictor variables and two factor inter~ action terms as inputs. (Note: The two changes in the original data set and the additional :: data below unbalances the design, so the analysis is best handled with regression ., methods.) Problem severity level Low Low High High High Problem complexity level Complex Complex Simple Simple Complex Engineer experience level Experienced Experienced Experienced Experienced Experienced Problem. assessment time
.>
J .
5.3
5.0 4.0 4:5 6.9
References
1. Abraham, B. and J. Ledolter. Introduction to Regression Modeling, Belmont, CA: Thompson Brooks/Cole, 2006. 2. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (3rd ed.). New York: John Wiley, 2003. 3. Atkinson, A. C. Plots, Transformations and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis. Oxford, England: Oxford University Press, 1986. 4. Bartlett, M. S. "A Note on Multiplying Factors for Various ChiSquared Approximations." Journal of the Royal Statistical Society (B), 16 (1954 ), 296298. 5. Bels!ey, D. A., E. Kuh, and R. E. Welsh. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity (Paperback). New York: WileyInterscience, 2004. 6. Bowerman, B. L., and R. T. O'Connell. Linear Statistical Models: An Applied Approach (2nd ed.). Belmont, CA: Thompson Brooks/Cole, 2000. 7. Box, G. E. P. "A General Distribution Theory for a Class of Likelihood Criteria." Biometrika, 36 (1949), 317346. 8. Box, G. E. P., G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting and Control (3rd ed.). Englewood Cliffs, NJ: Prentice Hall, 1994. 9. Chatterjee, S., A. S. Hadi, and B. Price. Regression Analysis by Example (4th ed.). New York: Wileylnterscience, 2006. 10. Cook, R. D., and S. Weisberg. Applied Regression Including Computing and Graphics. New York: John Wiley, 1999. 11. Cook, R. D., and S. Weisberg. Residuals and Influence in Regression. London: Chapman and Hall, 1982. 12. Daniel, C. and F. S. Wood. Fitting Equations to Data (2nd ed.) (paperback). New York: WileyInterscience, 1999.
References 429 13. Draper, N. R., and H. Smith. Applied Regression Analysis (3rd ed.). New York: John Wiley, 1998. 14. Durbin, J., and G. S. Watson. "Testing for Serial Correlation in Least Squares Regression, II." Biometrika, 38 (1951), 159178. 15. Galton, F. "Regression Toward Mediocrity in Heredity Stature." Journal of the Anthropological Institute, 15 (1885), 246263. 16. Goldberger,A. S. Econometric Theory. New York: John Wiley, 1964. 17. Heck, D. L. "Charts of Some Upper Percentage Points of the Distribution of the Largest Characteristic Root." Annals of Mathematical Statistics, 31 (1960), 625642. 18. Khattree, R. and D. N. Naik. Applied Multivariate Statistics with SAS Software (2nd ed.) Cary,NC: SAS Institute Inc., 1999. 19. Lee, J. "Relationships Between Properties of PulpFibre and Paper." Unpublished doctoral thesis, University of Toronto, Faculty of Forestry, 1992. 20. Neter, J., W. Wasserman, M. Kutner, and C. Nachtsheim. Applied Linear Regression Models (3rd ed.). Chicago: Richard D. Irwin, 1996. 21. Pillai, K. C. S. "Upper Percentage Points of the Largest Root of a Matrix in Multivariate Analysis." Biometrika, 54 (1967), 189193. 22. Rao, C. R. Linear Statistical Inference and Its Applications (2nd ed.) (paperback). New York: WileyInterscience, 2002. 23. Seber, G. A. F. Linear Regression Analysis. New York: John Wiley, 1977. 24. Rudorfer, M. V. "Cardiovascular Changes and Plasma Drug Levels after Amitriptyline Overdose." Journal of ToxicologyClinical Toxicology, 19 (1982), 6771.
Chapter
i
%1
.
.~
11
.
,~
PRINCIPAL COMPONENTS
8.1 Introduction
A principal component analysis is concerned with explaining the variancecovariance structure of a set of variables through a few linear combinations of these variables. Its general objectives are (1) data reducti.on and (2) interpretation. Although p components are required to reproduce the total system variability, often much of this variability can be accounted for by a small number k of the principal components. If so, there is (almost) as much information in the k components as there is in the original p variables. The k principal components can then replace the initial p variables, and the original data set, consisting of n measurements on p variables, is reduced to a data set consisting of n measurements on k principal components. An analysis of principal components often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily result. A good example of this is provided by the stock market data discussed in Example 8.5. Analyses of principal components are more of a means to an end rather than an end in themselves, because they frequently serve as intermediate steps in much larger investigations. For example, principal components may be inputs to a multiple regression (see Chapter 7) or cluster analysis (see Chapter 12). Moreover, (scaled) principal components are one "factoring" of the covariance matrix for the fact9r analysis model considered in Chapter 9.
430
Population Principal Components 431 with X 1 , X 2 , .. , Xp as the coordinate axes. The new axes represent the directions with maximum variability and provide a simpler and more parsimonious description of the covariance structure. As we shall see, principal components depend solely on the covariance matrix I (or the correlation matrix p) of Xt> X 2 , . , XP. Their development does not require a multivariate normal assumption. On the other hand, principal components derived for multivariate normal populations have useful interpretations in terms of the constant density ellipsoids. Further, inferences can be made from the sample components when the population is multivariate normal. (See Section 8.5.) Let the random vector X' = [ X 1 , X 2 , . , Xp] have the covariance matrix I with eigenvalues A1 ~ A2 ~ ~ AP ~ 0. Consider the linear combinations
a2X = a 21 X 1 + a 22 X
+ + a2 PXP
(81)
Then, using (245), we obtain Var (Y;) = a;Ia, Cov(Y;, Yk) = a;Iak
i = 1,2, ... ,p
(82) (83)
The principal components are those uncorrelated linear combinations Y1 , Y 2 , . , YP whose variances in (82) are as large as possible. The first principal component is the linear combination with maximum variance.Thatis,itmaximizesVar(}]) = a]Ia 1 .ItisclearthatVar(Yt) = a]Ia 1 can be increased by multiplying any a 1 by some constant. To eliminate this indeterminacy, it is convenient to restrict attention to coefficient vectors of unit length. We therefore define First principal component = linear combination a]X that maximizes Var(ajX) subject to aja 1
=
Second principal component = linear combination a2X that maximizes Var (a2X) subject to a2a 2 = 1 and Cov(ajX, a2X) = 0 At the ith step, ith principal component = linear combination a; X that maximizes Var (a; X) subject to a; a; = 1 and Cov(ajX, atX) = 0 for
k < i
432
Result 8.1. Let I be the covariance matrix associated with the random vect d X' == [X1 , X 2 , . , Xp] Let I have the eigenvalueeigenvector pairs (A h eo)r ,. .~ I ( A2 , e 2 ), ... , ( AP, ep) where A 2: Az 2: 2: AP :2: 0. Then the ith principal conz~t 1 ponent is given by ~
Y; = ..ejX
With these choices,
= 1, 2, ... 'p
(84) ~ t
........
i = 1,2, ... ,p
i 'I; k
:<
(85)] _,. If some A; are equal, the choices of the corresponding coefficient vectors, e;, and.,~ hence }j, are not unique. '
Proof. We know from (251), with B = I, that
=0
(attained when a = e 1 )
But e! e 1 = 1 since the eigenvectors are normalized. Thus, a'Ia e;Ie, , max,= A =  ,  = e1Ie 1 = Var(YJ) 1 *0 a a e,e 1 Similarly, using (252), we get max
aJ.er.e2et
,~ =
a'Ia
aa
Ak+l
Forthechoicea
el:+ 1Iek+J/el:+ 1ek+l = ei:+1Iek+1 = Var(Yk+J) But e.(,+ 1(Iek+J) = Ak+Jek+lek+ 1 = Ak+l so Var(Yk+ 1) = Ak+l It remains to show that e; perpendicular to ek (that is,ejek = 0, i ~ k) gives Cov(Y;, Yk) = 0. Now, the eigenvectors of I are orthogonal if all the eigenvalues A , A , ... , Ap are distinct. If 1 2 the eigenvalues are not all distinct, the eigenvectors corresponding to common eigenvalues may be chosen to be orthogonal. Therefore, for any two eigenvectors e; and ek, ejek = 0, i ~ k. Since Iek = Akek. premultiplication by ej gives Cov(Y;, Yk)
From Result 8.1, the principal components are uncorrelated and have variances equal to the eigenvalues of I.
Result 8.2. Let X' = [X1 , X 2 , . , Xp] have covariance matrix I, with eigenvalueeigenvector pairs (A 1,e 1), (A2 ,e 2), ... , (Ap,ep) where A 2: A 2: 2: Ap 2:0. 1 2 Let Y1 = e\X, Y2 = e2X, ... , YP = e;,x be the principal components. Then
u,, + u 22
+ + uPP =
i=l
Var(X;) =A,+ Az +
+ AP =
.f Var(Y;)
i=l
= tr(A) = A1 +
=
A + + 2
AP
L
i=l
Var(X;) = tr(I)
= tr(A)
L
i=l
u22
Var(Y;)
u PP
+ +
AP
= A1 +
A2 + +
(86)
and consequently, the proportion of total variance due to (explained by) the kth principal component is Proportion of total ) population variance _ Ak due to kth principal  A1 + A + + 2 ( component
AP
k = 1,2, ... ,p
(87)
If most (for instance, 80 to 90%) of the total population variance, for large p, can be attributed to the first one, two, or three components, then these components can "replace" the original p variables without much loss of information. Each component of the coefficient vector e; = [ei!, ... , e;k> ... , e;p] also merits inspection. The magnitude of eik measures the importance of the kth variable to the ith principal component, irrespective of the other variables. In particular, e;k is proportional to the correlation coefficient between Y; and Xk.
Result 8.3. If Y1 = e\X, Y2 = e2X, ... , ~. = obtained from the covariance matrix I, then
PY,,x. = . r  vukk
e~X
e;k vA;
i, k = 1, 2, .. ' , p
(88)
are the correlation coefficients between the components Y; and the variables Xk Here (AI> el), (A 2 , e 2 ), ... , (Ap, ep) are the eigenvalueeigenvector pairs for I.
Proof. Set al, = [0, ... , 0, 1, 0, ... , OJ so that Xk =aleX and Cov(Xk, Y;) = Cov (a/eX, e!X) = alcie;, according to (245).Since Ie; = A;e;, Cov (Xb Y;) = a/cA;e; = A;e;k Then Var(Y;) = A; [see (85)] and Var(Xk) = ukk yield
py x* = _ ~
Cov(Y;, Xk)
''
vVar(Y;)
v'
Var(Xk)
= " . , =
A;e;k
e;k vA;
,vukk
.
1,
vA;
vukk
= 1, 2, ... , p
Although the correlations of the variables with the principal components often help to interpret the components, they measure only the univariate contribution of an individual X to a component Y. That is, they do not indicate the importance of an X to a component Y in the presence of the other X's. For this reason, some
434 Chapter 8 Principal Components statisticians (see, for example, Rencher [16]) recommend that only the coefficients e;k> and not the correlations, be used to interpret the components. Although the coefficients and the correlations can lead to different rankings as measures of the im portance of the variables to a given component, it is our experience that these c rankings are often not appreciably different. In practice, variables with relatively large coefficients (in absolute value) tend to have relatively large correlations, so the two measures of importance, the first multivariate and the second univariate . frequently give similar results. We recommend that both the coefficients and th~ ~ correlations be examined to help interpret the principal components. The following hypothetical example illustrates the contents of Results 8.1, 8.2,, and 8.3.
Example 8.1 (Calculating the population principal components) Suppose the
e;
= [.383, .924, OJ
= 0.17,
= e2X = X 3
= Var(.383X1 
.924X2)
= Cov(.383X1 = .383(0)
.924X2, X3)
 .924(0)
=0
.17
+ (]'33
=1 +
5+ 2
= A1 + A2 + A3 = 5.83 + 2.00 +
validating Equation (86) for this example. The proportion of total variance accounted for by the first principal component is A1/(A 1 + A2 + A3 ) = 5.83/8 = .73. Further, the first two components account for a proportion (5.83 + 2)/8 = .98 of the population variance. In this case, the components Y1 and Y2 could replace the original three variables with little loss of information. Next, using (88), we obtain
Notice here that the variable X 2 , with coefficient .924, receives the greatest weight in the component Y 1 It also has the largest correlation (in absolute value) with Y1 The correlation of X 1 , with Y1 , .925, is almost as large as that for X 2 , indicating that the variables are about equally important to the first principal component. The relative sizes of the coefficients of X 1 and X 2 suggest, however, that X 2 contributes more to the determination of Y1 than does X 1 Since, in this case, both coefficients are reasonably large and they have opposite signs, we would argue that both variables aid in the interpretation of Y1 Finally, (as it should) The remaining correlations can be neglected, since the third component is unimportant. It is informative to consider principal components derived from multivariate normal random variables. Suppose X is distributed as Np(f.L, I). We know from (47) that the density of X is constant on the f.L centered ellipsoids
which have axes c"\I"A; e;, i = 1, 2, ... , p, where the (A;, e;) are the eigenvalueeigenvector pairs of I. A point lying on the ith axis of the ellipsoid will have coordinates proportional to e; = [en, e; 2 , .. , e; p] in the coordinate system that has origin f.L and axes that are parallel to the original axes x 1 , x 2 , .. , x P. It will be convenient to set f.L = 0 in the