SIXTH EDITION
STAT~STICAL ANALYS~S
R I CHARD A . , ~ . . D E A N
JOHNSON ·~~·
W.
WICHERN
Applied Multivariate Statistical Analysis
SIXTH EDITION
Applied Multivariate Statistical Analysis
RICHARD A. JOHNSON
University of WisconsinMadison
DEAN W. WICHERN
Texas A&M University
Upper Saddle River, New Jersey 07458
.brary of Congress Catalogingin Publication Data
mnson, Richard A.
Statistical analysis/Richard A. Johnson.6'" ed. Dean W. Winchern p.cm. Includes index. ISBN 0.131877151 1. Statistical Analysis '":IP Data Available
\xecutive Acquisitions Editor: Petra Recter Vice President and Editorial Director, Mathematics: Christine Haag roject Manager: Michael Bell Production Editor: Debbie Ryan .>enior Managing Editor: Lindd Mihatov Behrens ~anufacturing Buyer: Maura Zaldivar Associate Director of Operations: Alexis HeydtLong Aarketing Manager: Wayne Parkins >darketing Assistant: Jennifer de Leeuwerk &Iitorial Assistant/Print Supplements Editor: Joanne Wendelken \rt Director: Jayne Conte Director of Creative Service: Paul Belfanti .::Over Designer: Bruce Kenselaar 1\rt Studio: Laserswords
© 2007 Pearson Education, Inc. Pearson Prentice Hall Pearson Education, Inc. Upper Saddle River, NJ 07458
All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher. Pearson Prentice HaHn< is a tradeq~ark of Pearson Education, Inc. Printed in the United States of America
10 9 8 7 6 5 4 3 2
1
ISBN13: 9780131877153 ISBN10: 0131877151
Pearson Education LID., London Pearson Education Australia P'IY, Limited, Sydney Pearson Education Singapore, Pte. Ltd Pearson Education North Asia Ltd, Hong Kong Pearson Education Canada, Ltd., Toronto Pearson Educaci6n de Mexico, S.A. de C.V. Pearson EducationJapan, Tokyo Pearson Education Malaysia, Pte. Ltd
To the memory of my mother and my father. R.A.J. To
Doroth~
Michael, and Andrew. D.WW
Contents
PREFACE 1 ASPECTS OF MULTIVARIATE ANALYSIS
XV
1
1.1 1.2 1.3
Introduction 1 Applications of Multivariate Techniques The Organization of Data 5
Arrays,5 Descriptive Statistics, 6 Graphical Techniques, 1J
3
1.4
Data Displays and Pictorial Representations
Linking Multiple TwoDimensional Scatter Plots, 20 Graphs of Growth Curves, 24 Stars, 26 Chernoff Faces, 27
19
1.5 1.6
Distance 30 Final Comments 37 Exercises 37 References 47
2
MATRIX ALGEBRA AND RANDOM VECTORS
49
2.1 2.2
Introduction 49 Some Basics of Matrix and Vector Algebra 49
Vectors, 49 Matrices, 54
2.3 2.4 2.5 2.6
Positive Definite Matrices 60 A SquareRoot Matrix 65 Random Vectors and Matrices 66 Mean Vectors and Covariance Matrices
68
Partitioning the Covariance Matrix, 73 The Mean Vector and Covariance Matrix for Linear Combinations of Random Variables, 75 Partitioning the Sample Mean Vector and Covariance Matrix, 77
2.7
Matrix Inequalities and Maximization
78
vii
viii
Contents Supplement 2A: Vectors and Matrices: Basic Concepts
Vectors, 82 Matrices, 87
82
Exercises 103 References 110
3 SAMPLE GEOMETRY AND RANDOM SAMPLING 111
3.1 3.2 3.3 3.4
Introduction 111 The Geometry of the Sample 111 Random Samples and the Expected Values of the Sample Mean and Covariance Matrix 119 Generalized Variance 123
Situo.tions in which the Generalized Sample Variance Is Zero, I29 Generalized Variance Determined by I R I and Its Geometrical Interpretation, 134 Another Generalization of Variance, 137
3.5 3.6
Sample Mean, Covariance, and Correlation As Matrix Operations 137 Sample Values of Linear Combinations of Variables Exercises 144 References 148
140
4
THE MULTIVARIATE NORMAL DISTRIBUTION
149
4.1 4.2
Introduction 149 The Multivariate Normal Density and Its Properties
Additional Properties of the Multivariate Normal Distribution, I 56
149
4.3
Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation 168
The Multivariate Normal Likelihood, I68 Maximum Likelihood Estimation of 1.t and 1:, I70 Sufficient Statistics, I73
4.4 4.5 4.6
The Sampling Distribution of X and S 173
Propenies of the Wishart Distribution, I74
LargeSample Behavior of X and S 175 Assessing the Assumption of Normality 177
Evaluating the Normality of the Univariate Marginal Distributions, I77 Evaluating Bivariate Normality, I82
4.7 4.8
Detecting Outliers and Cleaning Data
Steps for Detecting Outliers, I89
187
'fiansfonnations to Near Normality Exercises 200 References 208
192
Transforming Multivariate Observations, I95
Contents
5 INFERENCES ABOUT A MEAN VECTOR
ix
210
5.1 5.2 5.3 5.4
Introduction 210 The Plausibility of p. 0 as a Value for a Normal Population Mean 210 Hotelling's T 2 and Likelihood Ratio Tests 216
General Likelihood Ratio Method, 219
Confidence Regions and Simultaneous Comparisons of Component Means 220
Simultaneous Confidence Statements, 223 A Comparison of Simultaneous Confidence Intervals with OneataTime Intervals, 229 The Bonferroni Method of Multiple Comparisons, 232
5.5
5.6
Large Sample Inferences about a Population Mean Vector Multivariate Quality Control Charts 239
234
Charts for Monitoring a Sample of Individual Multivariate Observations for Stability, 241 Control Regions for Future Individual Observations, 247 Control Ellipse for Future Observations, 248 T 2 Chart for Future Observations, 248 Control Chans Based on Subsample Means, 249 Control Regions for Future Subsample Observations, 251
5.7 5.8
Inferences about Mean Vectors when Some Observations Are Missing 251 Difficulties Due to Time Dependence in Multivariate Observations 256 Supplement SA: Simultaneous Confidence Intervals and Ellipses as Shadows of the pDimensional Ellipsoids 258 Exercises 261 References 272 273 273
6
COMPARISONS OF SEVERAL MULTIVARIATE MEANS
6.1 6.2
Introduction 273 Paired Comparisons and a Repeated Measures Design
Paired Comparisons, 273 A Repeated Measures Design for Comparing ]}eatments, 279
6.3
Comparing Mean Vectors from Two Populations 284
Assumptions Concerning the Structure of the Data, 284 Funher Assumptions When n 1 and n 2 Are Small, 285 Simultaneous Confidence Intervals, 288 The TwoSample Situation When 1:1 !.2, 291 An Approximation to the Distribution of T 2 for Normal Populations When Sample Sizes Are Not Large, 294
*
6.4
Comparing Several Multivariate Population Means (OneWay Manova) 296
Assumptions about the Structure of the Data for OneWay MAN OVA, 296
Contents
A Summary of Univariate ANOVA, 297 Multivariate Analysis ofVariance (MANOVA), 30I
6.5 6.6 6.7 6.8 6.9 6.10
Simultaneous Confidence Intervals for Treatment Effects 308 Testing for Equality of Covariance Matrices 310 1\voWay Multivariate Analysis of Variance 312
Univariate TwoWay FixedEffects Model with Interaction, 312 Multivariate 1WoWay FixedEffects Model with Interaction, 3I5
Profile Analysis 323 Repeated Measures Designs and Growth Curves 328 Perspectives and a Strategy for Analyzing Multivariate Models 332 Exercises 337 References 358
7
MULTIVARIATE LINEAR REGRESSION MODELS
360
7.1 7.2 7.3
Introduction 360 The Classical Linear Regression Model 360 Least Squares Estimation 364
SumofSquares Decomposition, 366 Geometry of Least Squares, 367 Sampling Properties of Classical Least Squares Estimators, 369
7.4 7.5 7.6
Inferences About the Regression Model 370
Inferences Concerning the Regression Parameters, 370 Likelihood Ratio Tests for the Regression Parameters, 374
Inferences from the Estimated Regression Function 378
Estimating the Regression Function atz 0 , 378 Forecasting a New Observation at z0 , 379
Model Checking and Other Aspects of Regression 381
Does the Model Fit?, 38I Leverage and Influence, 384 Additional Problems in Linear Regression, 384
7.7
Multivariate Multiple Regression 387
Likelihood Ratio Tests for Regression Parameters, 395 Other Multivariate Test Statistics, 398 Predictions from Multivariate Multiple Regressions, 399
7.8 7.9 7.10
The Concept of Linear Regression 401
Prediction of Several Variables, 406 Partial Correlation Coefficient, 409
Comparing the 1\vo Formulations of the Regression Model 410
Mean Corrected Form of the Regression Model, 4IO Relating the Formulations, 412
Multiple Regression Models with Time Dependent Errors 413 Supplement 7A: The Distribution of the Likelihood Ratio for the Multivariate Multiple Regression Model Exercises 420 References 428
418
Contents
8 PRINCIPAL COMPONENTS
xi
430 430
8.1 8.2
Introduction 430 Population Principal Components
Principal Components Obtained from Standardized Variables, 436 Principal Components for Covariance Matrices with Special Structures, 439
8.3
Summarizing Sample Variation by Principal Components
The Number of Principal Components, 444 Interpretation of the Sample Principal Components, 448 Standardizing the Sample Principal Components, 449
441
8.4 8.5
Graphing the Principal Components 454 Large Sample Inferences 456
Large Sample Propenies of A; and e;, 456 Testing for the Equal Correlation Structure, 457
8.6
Monitoring Quality with Principal Components 459
Checking a Given Set of Measurements for Stability, 459 Controlling Future Values, 463
Supplement 8A: The Geometry of the Sample Principal Component Approximation 466
The pDimensional Geometrical Interpretation, 468 ThenDimensional Geometrical Interpretation, 469
Exercises 470 References 480
9 FACTOR ANALYSIS AND INFERENCE FOR STRUCTURED COVARIANCE MATRICES
481
9.1 9.2 9.3
Introduction 481 The Orthogonal Factor Model Methods of Estimation 488
482
The Principal Component (and Principal Factor) Method, 488 A Modified Approachthe Principal Factor Solution, 494 The Maximum Likelihood Method, 495 A Large Sample Test for the Number of Common Factors, 501
9.4 9.5
Factor Rotation Factor Scores
504 513
Oblique Rotations, 512 The Weighted Least Squares Method, 514 The Regression Method, 516
9.6
Perspectives and a Strategy for Factor Analysis 519 Supplement 9A: Some Computational Details for Maximum Likelihood Estimation
Recommended Computational Scheme, 528 Maximum Likelihood Estimators of p = L,L~ + 1/1, 529
52 7
Exercises 530 References 538
xii
Contents
10 CANONICAL CORRELATION ANALYSIS
539
10.1 10.2 10.3
Introduction 539 Canonical Variates and Canonical Correlations 539 Interpreting the Population Canonical Variables 545
Identifying the {:anonical Variables, 545 Canonical Correlations as Generalizations of Other Correlation Coefficients, 547 The First r Canonical Variables as a Summary of Variability, 548 A Geometrical Interpretation of the Population Canonical Correlation Analysis 549
10.4 10.5
The Sample Canonical Variates and Sample Canonical Correlations 550 Additional Sample Descriptive Measures 558
Matrices of Errors of Approximations, 558 Proportions of Explained Sample Variance, 561
10.6
Large Sample Inferences 563 Exercises 567 References 574
11
DISCRIMINAnON AND CLASSIFICATION
575
11.1 11.2 11.3
Classification of Normal Populations When It = I 2 = I, 584 Scaling, 589 Fisher's Approach to Classification with 1Wo Populations, 590 Is Classification a Good Idea?, 592 Classification of Normal Populations When It #' I 2 , 593
Introduction 575 Separation and Classification for 1\vo Populations 576 Classification with 1\vo Multivariate Normal Populations
584
11.4 11.5
Evaluating Classification Functions 596 Classification with Several Populations 606
The Minimum Expected Cost of Misclassl:fication Method, 606 Qassification with Normal Populations, 609
11.6
Fisher's Method for Discriminating among Several Populations 621
Using Fisher's Discriminants to Classify Objects, 628
11.7
Logistic Regression and Classification 634
Introduction, 634 The Logit Model, 634 Logistic Regression Analysis, 636 Classiftcation, 638 Logistic Regression With Binomial Responses, 640
11.8
Final Comments 644
Including Qualitative Variables, 644 Classification ]}ees, 644 Neural Networks, 647 Selection of Variables, 648
Contents
Testing for Group Differences, 648 Graphics, 649 Practical Considerations Regarding Multivariate Normality, 649
xiii
Exercises 650 References 669
12 CLUSTERING, DISTANCE METHODS, AND ORDINATION
671
12.1 12.2
Introduction 671 Similarity Measures
673
Distances and Similarity Coefficients for Pairs of Items, 673 Similarities and Association Measures for Pairs of Variables, 677 Concluding Comments on Similarity, 678
12.3
Hierarchical Clustering Methods
680
Single Linkage, 682 Complete Linkage, 685 Average Linkage, 690 Wards Hierarchical Clustering Method, 692 Final CommentsHierarchical Procedures, 695
12.4
Nonhierarchical Clustering Methods 696
Kmeans Method, 696 Final CommentsNonhierarchlcal Procedures, 701
12.5 12.6 12.7
Clustering Based on Statistical Models Multidimensional Scaling 706
The Basic Algorithm, 708 .
703
Correspondence Analysis 716
Algebraic Development of Correspondence Analysis, 718 Inertia, 725 Interpretation in Two Dimensions, 726 Final Comments, 726
12.8 12.9
Biplots for Viewing Sampling Units and Variables 726
Constructing Biplots, 727
Procrustes Analysis: A Method for Comparing Configurations 732
Constructing the Procrustes Measure ofAgreement, 733
Supplement 12A: Data Mining
Introduction, 740 The Data Mining Process, 741 Model Assessment, 742
740
Exercises 747 References 755
APPENDIX DATA INDEX SUBJECT INDEX
757 764 767
Data analysis. We emphasize the applications of multivariate methods and. Rather. We do not assume the reader is familiar with matrix algebra. Modern computer packages readily provide the· numerical results to rather complex statistical analyses. The introductory account of matrix algebra. in a wide variety of subject matter areas. while interesting with one variable. physical. and we then show how they simplify the presentation of multivariate models and techniques. the concepts of a matrix and of matrix manipulations are important. and understanding their strengths and weaknesses. and social sciences frequently collect measurements on several variables. The Chapter 2 supplement provides a summary of matrix algebra results for those with little or no previous exposure to the subject. becomes truly fascinating and challenging when several variables are involved. We have tried to provide readers with the supporting knowledge necessary for making proper interpretations. as a readable introduction to the statistical analysis of multivariate observations. we have had to sacrifice XV . We hope our discussions will meet the needs of experimental scientists.Preface INTENDED AUDIENCE This book originally grew out of our lecture notes for an "Applied Multivariate Analysis" course offered jointly by the Statistics Department and the School of Business at the University of WisconsinMadison. In this way we hope to make the book accessible to a wide audience. highlights the more important matrix algebra results as they apply to multivariate analysis. Sixth Edition. Researchers in the biological. In our attempt to make the study of multivariate analysis appealing to a large audience of both practitioners and theoreticians. consequently. we introduce matrices as they appear naturally in our discussions. is concerned with statistical methods for describing and analyzing multivariate data. selecting appropriate techniques. On the other hand. have attempted to make the mathematics as palatable as possible. LEVEL Our aim is to present the concepts and methods of multivariate analysis at a level that is readily understandable by readers who have taken two or more statistics courses. Applied Multivariate Statistica/Analysis. We avoid the use of calculus. The proofs may be ignored on the first reading. This supplementary material helps make the book selfcontained and is used to complete proofs. in Chapter 2.
Getting Started Chapters 14 For most students.The division of the methodological chapters (5 through 12) into three units allows instructors some flexibility in tailoring a course to their needs. For example." Our approach in the methodological chapters is to keep the discussion direct and uncluttered.2. In particular. Some sections are harder than others. 2. The examples are of two types: those that are simple and whose calculations can be easily done by hand. 2. These will provide an opportunity to (1) duplicate our analyses. . The discussions could feature the many "worked out" examples included in these sections of the text. and subsections. we would suggest a quick pass through the first four chapters (concentrating primarily on the material in Chapter 1. Possible sequences for a onesemester (two quarter) course are indicated schematically. we have summarized a voluminous amount of material on regression in Chapter 7. factor analysis. and liberally illustrate everything with examples. Even those readers with a good knowledge of matrix algebra or those willing to accept the mathematical results on faith should.1. (2) carry out the analyses dictated by exercises.xvi Preface a consistency of level. or (3) analyze the data using methods other than the ones we have used or suggested .6. we start with a formulation of the population models." and Chapter 4. peruse Chapter 3. These chapters represent the heart of the book. 2. appropriate for their students and by toning them tlown if necessary. delineate the corresponding sample results. but they cannot be assimilated without much of the material in the introductory Chapters 1 through 4. at the very least. Sections 2. discriminant analysis and clustering. one might discuss the comparison of mean vectors.3. Typically. Each instructor will undoubtedly omit certain sections from some chapters to cover a broader collection of topics than is indicated by these two choices. principal components. ORGANIZATION AND APPROACH The methodological "tools" of multivariate analysis are contained in Chapters 5 through 12. Instructors may rely on di . we hope instructors will be able to compensate for the unevenness in level by judiciously choosing those sections. The resulting presentation is rather succinct and difficult the first time through.6. and those that rely on realworld data and computer software. and the "assessing normality" material in Chapter 4) followed by a selection of methodological topics."Multivariate Normal Distribution. 2. and 3. "Sample Geometry.5.
Expanded Sections 7. Section 12.4. . • Twelve new data sets including national track records for men and women. We have found individual dataanalysis projects useful for integrating material from several of the methods chapters. In addition. regression analysis.com. An Instructors Solutions Manual is available on the author's website accessible through www. Click on "Multivariate Statistics" and then click on our book. Section 6. CHANGES TO THE SIXTH EDITION New material. 7. and so forth are helpful. Consolidated previous Sections 11.prenhall. • Thirty seven new exercises and twenty revised exercises with many of these exercises based on the new data sets.prenhall.5 Clustering Based on Statistical Models 4. please visit the Prentice Hall web site at www. To make the methods of multivariate analysis more prominent in the text. 7. Instructors' Solutions Manual.10 and 10. psychological profile scores.1 and placed them on a web site accessible through www. • Six new or expanded sections: 1. Users of the previous editions will notice several major changes in the sixth edition. If the students have uniformly strong mathematical backgrounds.6 Testing for Equality of Covariance Matrices 2. canonical correlation.2. Mali family farm data. much of the book can successfully be covered in one term.5 on two group discriminant analysis into single Section 11. • Four new data based examples and fifteen revised examples. and Concho water snake data. Expanded Section 6. factor analysis.6 and 7.3 and 11. stock price rates of return. Here. all full data sets saved as ASCII files that are used in the book are available on the web site.7 Logistic Regression and Classification 3. cell phone tower breakdowns. Section 11.3 to include "An Approximation to th~ Distribution of T 2 for Normal Populations When Sample Sizes are not Large" 5.Preface xvii agrams and verbal descriptions to teach the corresponding theoretical developments. our rather complete treatments of multivariate analysis of variance (MANOVA).prenhall. pulp and paper properties measurements.com/statistics. we have removed the long proofs of Results 7. For information on additional forsale supplements that may be used with the book or additional titles of interest. even though they may not be specifically covered in lectures. discriminant analysis.7 to include Akaike's Information Criterion 6. car body assembly measurements.3 Web Site.com/statistics.
Joanne Wendelken and the rest of the Prentice Hall staff for their help with this project. Drexel University. We must thank Dianne Hall for her valuable help with the Solutions Manual. Richard Kiltie.edu . Shyamal Peddada. Jacquelyn Forer did most of the typing of the original draft manuscript. and Alison Pollack for implementing a Chernoff faces program. University of Michigan. A. We would also like to give special thanks to Wai Kwong Cheang. George Mason University. Him Koul.ACKNOWLEDGMENTS We thank many of our colleagues who helped improve the applied aspect of the book by contributing their own data sets for examples and exercises. University of Virginia. Michigan State University. University of Illinois at UrbanaChampaign. Virginia Tech. Sam Kotz.edu D. We also acknowledge the feedback of the students we have taught these past 35 years in our applied multivariate analysis courses. Linda Behrens. we would like to thank Petra Recter. Sivakumar University of Illinois at Chicago. W. University of Minnesota. wisc. R. Eric ~mith. We are indebted to Cliff Gilman for his assistance with the multidimensional scaling examples discussed in Chapter 12. Steve Coad. Finally. University of Florida. A number of individuals helped guide various revisions of this book. Shanhong Guan. Wichern dwichem@tamu. K.""iii Preface . Johnson rich@stat. Their comments and suggestions are largely responsible for the present iteration of this work. and we appreciate her expertise and willingness to endure cajoling of authors faced with publication deadlines. Debbie Ryan. Steve Verrill for computing assistance throughout. and Stanley Wasserman. Jialiang Li and Zhiguo Xiao for their help with the calculations for many of the examples. Michael Bell. Bruce McCullough. and we are grateful for their suggestions: Christopher Bingham.
Applied Multivariate Statistical Analysis .
.
some mathematical sophistication and a desire to think quantitatively will be required. Because the data include simultaneous measurements on many variables. it is frequently impossible to control the . Only in Chapters 6 and 7 shall we treat a few experimental plans (designs) for generating data that prescribe the active manipulation of important variables. making heavy use of illustrative examples and a minimum of mathematics. Often. an analysis of the data gathered by experimentation or observation will usually suggest a modified explanation of the phenomenon. Although the experimental design is ordinarily the most important part of a scientific investigation. We have chosen to provide explanations based upon algebraic concepts and to avoid the derivations of statistical results that require the calculus of many variables.Chapter ASPECTS OF MULTIVARIATE ANALYSIS 1. Additionally. the complexities of most phenomena require an investigator to collect observations on many different variables. The need to understand the relationships between many variables makes multivariate analysis an inherently difficult subject. Throughout this iterative learning process. more mathematics is required to derive multivariate statistical techniques for making inferences than in a univariate setting.1 Introduction Scientific inquiry is an iterative learning process. Objectives pertaining to the explanation of a social or physical phenomenon must be specified and then tested by gathering and analyzing data. variables are often added or deleted from the study. In tum. Our objective is to introduce several useful multivariate techniques in a clear manner. Thus. This book is concerned with statistical methods designed to elicit information from these kinds of data sets. the human mind is overwhelmed by the sheer bulk of the data. Nonetheless. Most of our emphasis will be on the analysis of measurements obtained without actively controlling or manipulating any of the variables on which the measurements are made. this body of methodology is called multivariate analysis.
Specific statistical hypotheses. are tested. s. It will become increasingly clear that many multivariate methods are based upon an underlying pro9ability model known as the multivariate normal distribution. plus the examples in the text. invariably. The nature of the relationships among variables is of interest. be implemented on a computer. should provide you with an appreciation of the applicability of multivariate techniques across different fields. 3. Sorting and grouping. page 89. for example. You should keep it in mind whenever you attempt or read about a data analysis." It is difficult to establish a classification scheme for multivariate techniques that is both widely accepted and indicates the appropriateness of the techniques. multivariate techniques must. inference about covariance structure. One classification distinguishes techniques designed to study interdependent relationships from those designed to study dependent relationships. H. and techniques for sorting or grouping. It allows one to . economics. Prediction. C. the choice of methods and the types of analyses employed are largely determined by the objectives of the investigation. and sociology. fortunately. but we feel it is appropriate for a broader range of methods. however.2 Chapter 1 Aspects of Multivariate Analysis generation of appropriate data in certain disciplines. Chapters in this text are divided into sections according to inference about treatment means. based upon measured characteristics. In Section 1. how? 4. Multivariate analysis is a "mixed bag. geology. also apply to multivariate situations. 2. This may be done to validate assumptions or to reinforce prior convictions. be considered an attempt to place each method into a slot. making the implementation step easier. Other methods are ad hoc in nature and are justified by logical or commonsense arguments.2. The phenomenon being studied is represented as simply as possible without sacrificing valuable information. The statement was made in a discussion of cluster analysis. These problems. Investigation of the dependence among variables. ecology. rules for classifying objects into welldefined groups may be required.) You should consult [6] and [7] for detailed accounts of design principles that. (This is true. Alternatively. Recent advances in computer technology have been accompanied by the development of rather sophisticated statistical software packages. We conclude this brief overview of multivariate analysis with a quotation from F. Another classifies techniques according to the number of populations and the number of sets of variables being studied. The objectives of scientific investigations to which multivariate methods most naturally lend themselves include the following: L Data reduction or structural simplification. Marriott [19]. Relationships between variables must be determined for the purpose of predicting the values of one or more variables on the basis of observations on the other variables. formulated in terms of the parameters of multivariate populations. Regardless of their origin. we list a smaller number of practical problems designed to illustrate the connection between the choice of a statistical method and the objectives of the study. Hypothesis construction and testing. It is hoped that this will make interpretation easier. Are all the variables mutually independent or are one or more variables dependent on the others? If so. This should not. Rather. Groups of "similar" objects or variables are created. in business.
many of our examples are multifaceted and could be placed in more than one category. (See Exercise 1. in order to give some indication of the usefulness of multivariate techniques. Data reduction or simplification • Using data on several variables related to cancer patient responses to radiotherapy. we offer the following short descriptions. and many ways in which they can break down. as we did in earlier editions of this book:.) • Data on several variables relating to yield and protein content were used to create an index to select parents of subsequent generations of improved bean plants. From this matrix the number of dimensions by which professional mediators judge the tactics they use in resolving disputes was determined. (See [21]. do not admit a simple logical interpreta tion. of the results of studies from several disciplines. (See [13]. they are probably wrong. However.2 Applications of Multivariate Techniques The published applications of multivariate methods have increased tremendously in recent years. (See [23]. (See [26]. These descriptions are organized according to the categories of objectives given in the previous section.) • A matrix of tactic similarities was developed from aggregate data derived from professional mediators.) • nack records from many nations were used to develop an index of performance for both male and female athletes. not sausage machines automatically transforming bodies of numbers into packets of scientific fact. It is now difficult to cover the variety of realworld applications of these methods with brief discussions.) • Measurements of several physiological variables were used to develop a screening procedure that discriminates alcoholics from nonalcoholics. and do not show up clearly in a graphical presentation.14. Of course.) .15. a simple measure of patient response to radiotherapy was constructed.) Sorting and grouping • Data on several variables related to computer use were employed to create clusters of categories of computer jobs that allow a better determination of existing (or planned) computer utilization. They are a valuable aid to the interpretation of data.Applications of Multivariate Techniques 3 maintain a proper perspective and not be overwhelmed by the elegance of some of the theory: If the results disagree with informed opinion.) • Data related to responses to visual stimuli were used to develop a rule for separating people suffering from a multiplesclerosiscaused visual pathology from those not suffering from the disease.) • Multispectral image data collected by a highaltitude scanner were reduced to a form that could be viewed as images (pictures) of a shoreline in two dimensions. There is no magic about numerical methods. (See [8] and [22]. 1. (See Exercise 1. (See [2].
(See [9]. (See [28]. or whether there was a noticeable difference between weekdays and weekends.) • Data on several variables related to the size distribution of sediments were used to develop rules for predicting different depositional environments. (See [12]. Internal Revenue Service uses data collected from tax returns to sort taxpayers into two groups: those that will be audited and those that will not. (See [3]. (See [15]. and several college performance variables were used to develop predictors of success in college.) Investigation of the dependence among variables • Data on several vruiables were used to identify factors that were responsible for client success in hiring external consultants. (See [7] and [20].) Prediction • The associations between test scores. and variables related to the business environment and business organization.) • Data on many variables were used to investigate the differences in structure of American occupations to determine the support for one of two competing sociological theories.) Hypotheses testing • Several pollutionrelated variables were measured to determine whether levels for a large metropolitan area were roughly constant throughout the week. (See [10].) • eDNA microarray experiments (gene expression data) are increasingly used to study the molecular variations among cancer tumors. as quantified by test scores. (See [27].) .) • The associations between measures of risktaking propensity and measures of socioeconomic characteristics for toplevel business executives were used to assess the relation between risktaking behavior and performance. (See [18]. (See [17].S. on the other hand.) • Data on several variables were used to determine whether different types of firms in newly industrialized countries exhibited different patterns of innovation. (See [31].4 Chapter 1 Aspects of Multivariate Analysis • The U. and several high school performance variables. (See [16] and [25]. (See Exercise 1.6.) • Measurements on several accounting and fmancial variables were used to develop a method for identifying potentially insolvent propertyliability insurers. were used to discove~ why some firms are product innovators and some firms are not.) • Measurements of pulp fiber characteristics and subsequent measurements of characteristics of the paper made from them are used to examine the relations between pulp fiber properties and the resulting paper properties. The goal is to determine those fibers that lead to higher quality paper.) • Experimental data on several variables were used to see whether the nature of the instructions makes any difference in perceived risks. on the one hand.) • Measurements of variables related to innovation. A reliable classification of tumo~s is essential for successful diagnosis and treatment of cancer.
We now introduce the preliminary concepts underlying these first steps of data organization. which quantitatively portray certain features of the data. These measurements (commonly called data) must frequently be arranged and displayed in various ways. seeking to understand a social or physical phenomenon.The Organization of Data 5 The preceding descriptions offer glimpses into the use of multivariate methods in widely diverse fields. individual.z x. then. of n rows and p columns: xu Xzi xi2 Xzz xlk Xzk Xip Xzp X xi! xiz Xjk Xjp x. That is. contains the data consisting of all of the observations on all of the variables.k Xnp Or we can display these data as a rectangular array. For example. . are also necessary to any description. selects a number p 2:: 1 of variables or characters to record. We will use the notation xjk to indicate the particular value of the kth variable that is observed on the jth item.k x. n measurements on p variables can be displayed as follows: Variable 1 Item 1: Item2: Itemj: Itemn: xu x21 Variable 2 xi2 Xzz Variable k xlk Xzk Variable p Xip Xzp Xji xjz Xjk Xjp Xni x.P The array X. 1.3 The Organization of Data Throughout this text.l x. or experimental unit. The values of these variables are all recorded for each distinct item. or trial. Arrays Multivariate data arise whenever an investigator. called X.z x. x 1k = measurement of the kth variable on the jth item Consequently. Summary numbers. graphs and tabular arrangements are important aids in data analysis. we are going to be concerned with analyzing measurements made on several variables or characteristics.
which now use many languages and statistical packages to perform array operations. . among other things. For example.1 {A data array) A selection of four receipts from a university bookstore was obtained in order to investigate the nature of book sales. Then the arithmetic average of these measurements is .6 Chapter 1 Aspects of Multivariate Analysis Example 1. And the average of the squares of the distances of all of the numbers from the mean provides a measure of the spread. Much of the information contained in the data can be assessed by calculating certain summary numbers. we are concerned only with their value as devices for displaying data. The efficiency is twofold. We shall rely most heavily on descriptive statistics that measure location. Descriptive Statistics A large data set is bulky. or sample mean. Each receipt provided. Let the first variable be total dollar sales and the second variable be number of books sold. the number of books sold and the total amount of each sale. or variation. in tabular form. l • Considering data in the form of arrays facilitates the exposition of the subject matter and allows numerical calculations to be performed in an orderly and efficient manner.. Let xu. a "central value" for a set of numbers. is a descriptive statistic that provides a measure of locationthat is. The formal definitions of these quantities follow. are Variable 1 (dollar sales): 42 52 48 58 Variable2(numberofbooks): 4 5 4 3 Using the notation just introduced.. in the numbers. the arithmetic average. variation. . Suppose the data. xn 1 ben measurements on the first variable. We consider the manipulation of arrays of numbers in Chapter 2. we have Xu x 12 and the data array X is = 42 = 4 Xz! x 22 = 52 = 5 x 31 x 32 = 48 = 4 x41 x42 = 58 = 3 X= 42 4] 52 5 48 58 4 3 with four rows and two columns. x 21 . and linear association. and its very mass poses a serious obstacle to any attempt to visually extract pertinent information. known as descriptive statistics. Then we can reg_ard the corresponding numbers on the receipts as four measurements on two variables. At this point. as gains are attained in both (1) describing numerical calculations as operations on arrays and (2) the implementation of the calculations on computers.
"' 2 1 ~ ( xik . . Second. in general. .. Later we shall see that there are theoretical reasons for doing this.xk . We adopt this terminology because the bulk of this book is devoted to procedures designed to analyze samples of measurements from larger collections. for p variables. k = 1. defined for n measurements on the first variable as St = .xk )2 n i=l .xt) n j=l where x1 is the sample mean of the xi 1 's. Consider n pairs of measurements on each of variables 1 and 2: [xu]."' 2 1 ~ .. .2 (xi 1 ...p (12) Tho comments are in order.. 2.. xil and xi 2 are observed on the jth experimental item (j = 1.)2 k = 1. n ). is known as the sample standard deviation. we shall eventually consider an array of quantities in which the sample variances lie along the main diagonal. A measure of linear association between the measurements of variables 1 and 2 is provided by the sample covariance 1 St2 =  n i=I 2: (xjl n  xt) (xj2  x2) . [x21]..p (11) A measure of spread is provided by the sample variance. In general. The two versions of the sample variance will always be differentiated by displaying the appropriate expression. . so that. there will be p sample means: xk =  2: xik n i=l 1 n k = 1. The sample mean can be computed from the n measurements on each of the p variables. it is convenient to use double subscripts on the variances in order to indicate their positions in the array.2.. we introduce the notation skk to denote the same variance computed from measurements on the kth variable. 2. n. many authors define the sample variance with a divisor of n .The Organization of Data 7 ' If the n measurements represent a subset of the full set of measurements that might have been observed.1 rather than n. . .. . we have sk = . then x1 is also called the sample mean for the first variable.[Xnt] X12 X22 Xn2 That is. First. .2."' n i=I (xik . is small. although the s 2 notation is traditionally used to indicate the sample variance.p (13) The square root of the sample variance. ~. Therefore. This measure of variation uses the same units as the observations.. and it is particularly appropriate if the number of measurements. and we have the notational identities sk 2 = skk 1 ~ = . . . In this situation.
k· The sample correlation coefficient r.. 2.pandk = 1. Moreover. = .:ik}/~. this implies a lack of linear association between the components. If there is no particular association between the values for the two variables. .p (14} S. j = 1.. (xji .k can also be viewed as a sample covariance. and s.. + b.. The sample correlation coefficient is a standardized version of the sample covariance. .k has the same value whether nor n .2. k=1. the sign of r indicates the direction of the association: r < 0 implies a tendency for one value in the pair to be larger than its average when the other is smaller than its average. skk.2. The sample covariance 1 n • i=1. the correlation is ordinarily easier to interpret because its magnitude is bounded.xk) L j=l measures the association between the "ith and kth variables. see [14]}. . 2...8 Chapter 1 Aspects of Multivariate Analysis or the average product of the deviations from their respective means. 3. We note that the covariance reduces to the sample variance when i = k. s 12 will be positive..p.) (xjk . U large values from one variable occur with small values for the other variable. Notiee that r.2. .k = .X..) (xjk . The value of r must be between 1 and + 1 inclusive.. . Although the signs of the sample correlation and the sample covariance are the same.. To summarize..k = ski for all i and k. s 12 will be approximately zero.. and r > 0 implies a tendency for one value of the pair to be large when the other value is large and also for both values to be small together.. . The sample correlation coefficient for the ith and kth variables is defined as L j=l n (xji . 2. If large values for one variable are observed in conjunction with large values for the other variable.X.1 is chosen as the common divisor for s. . j stants a and c have the same sign. This measure of the linear association between two variables does not depend on the units of measurement. n. .k remains unchanged if the measurements of the ith variable are changed to Yji = axj.. provided that the conable are changed to Yjk = cxjk + d. . If r = 0. where the product of the square roots of the sample variances provides the standardization. . Suppose the original values ·xj. The final descriptive statistic considered here is the sample correlation coefficient (or Pearson's productmoment correlation coefficient.x1 }/~and(xjk. The sample correlation coefficient is just the sample covariance of the standardized observations. n. The value of r.. s. Here r measures the strength of the linear association. and the values of the kth vari1.Thestandardizedvaluesarecornmensurablebe cause both sets are centered at zero and expressed in standard deviation units. Otherwise.p. . and the small values also occur together. s12 will be negative. and xjk are replaced by standardized values (xj 1 . the sample correlation r has the following properties: 1. .2.Noterik = rkiforalliandk.xk} (15} fori= 1.
..p. convey all there is to know about the association between two variables.2. .. . covariance and correlation coefficients are routinely calculated and analyzed.The Organization of Data. Arrays of Basic Descriptive Statistics Sample means . j=l n x. Suspect observations must be accounted for by correcting obvious recording mistakes and by taking actions consistent with the identified causes.)(xjk .l "l~' sl2 Szz szp (18) spz sPP r12 1 rzp rpl rpz 1 .m sn = s~l Spl Sample variances and covariances Sample correlations R . The values of s.2. 2. ..k should be quoted both with and without these observations. Covariance and correlation provide measures of linear association. The sum of squares of the deviations from the mean and the sum of crossproduct deviations are often of interest themselves.p (16) and W. Their values are less informative for other kinds of association.l l'" ..k and r. in general.p (17) The descriptive statistics computed from n measurements on p variables can also be organized into arrays. these quantities can be very sensitive to "wild" observations ("outliers") and may indicate association when. little exists. . in fact. Nonlinear associations can exist that are not revealed by these descriptive statistics. . These quantities are wkk = 2: (xjk j=l n xk)z k = 1..xk) i = 1.k = 2: (xi.k do not.. . They provide cogent numerical summaries of association when the data do not exhibit obvious nonlinear patterns of association and when wild observations are not present. 9 The quantities sik and r. or association along a line. In spite of these shortcomings. . On the other hand. k = 1. .
Thesample means are X1 = ~ j=! L 4 4 Xjt = h42 + 52+ 48 +58) =50 x2 = ~ L: x 12 = ~(4 + 5 + 4 + 3) = 4 j=l The sample variances and covariances are Stt = ~ j=l L (xj! 4 4 x1) 2 = ~((42. The subscript n on the array Sn is a mnemonic device used to remind you that n is employed as a divisor for the elements s.4) +(52.4) 2 + (3.5 . Since s.4) + (48.2 (The arrays x.so)(s. and number of books sold.so) 2 + (48. the second subscript indicates the column. Find the arrays i. The arrays Sn and R consist of p rows and p columns. Sn• and R for bivariate data) Consider the data introduced in Example 1. The first subscript on an entry in arrays Sn and R indicates the row.4) 2 + (4.so)(4. and the arrays are said to be symmetric.5 =~ 4 (xj! . we have a total of four measurements (observations) on each variable.5] . the entries in symmetric positions about the main northwestsoutheast diagonals in arrays Sn and R are the same.k· The size of all of the arrays is determined by the number of variables. for all i and k. Example 1.50)(4.4) 2) = .xt)(xj2. Sn. p.k = ski and ra = rk.i2) = hC42.5 and Sn = [ 1. the sample variance and covariance array by the capital letter Sn. and R.50)(3. and the sample correlation array by R.10 Chapter 1 Aspects of Multivariate Analysis The sample mean array is denoted by i. Each receipt yields a pair of measurements. The array i is a single column with p rows.4)) = 1.sw + (52.4) +(58.5 34 1.so?+ (58. total dollar sales.50) 2 ) = 34 s22 = ~ L (xj2 j=l L j=l i2) 2 = ~((4St2 4) 2 + (5. Since there are four receipts.1.
the following seven pairs of measurements on two variables: Variable 1 (x1 ): Variable 2 ( x 2 ): 3 4 5 5.. yet elegant and effective. Simple. or three dimensions with relative ease. ( 4.5 These data are plotted as seven points in two dimensions (each axis representing a variable) in Figure 1.5). It is good statistical practice to plot pairs of variables and visually inspect the pattern of association. 7. xz Xz JO • " ~ " '6 • 8 • • • 8 •• • e 8 • • • • • 6 4 6 4 • 2 2 0 4 6 8 • • ! 2 t 4 • ! ! 6 8 Dot diagram 10 I• "J Figure 1. 5). On the other hand.5). The resulting twodimensional plot is known as a scatter diagram or scatter plot..3~ . but frequently neglected.3~ J • Graphical Techniques Plots are important. 5. (5.5 2 4 6 7 8 10 2 5 5 7. The coordinates of the points are determined by the paired measurements: (3.. then. many valuable insights can be obtained from the data by constructing plots with paper and pencil.1 A scatter plot and marginal dot diagrams.1. Sophisticated computer programs and display equipment allow one the luxury of visually examining data in one. . two. aids in data analysis. Although it is impossible to simultaneously plot all the measurements made on several variables and study the configurations. . . methods for displaying data are available in (29].The Organization of Data 11 The sample correlation is so R = [ . plots of individual variables and plots of pairs of variables can still be very informative. Consider.
2. and s22 remain unchanged. The information contained in the singlevariable dot diagrams can be used to calculate the sample means xi and x and the sample variances si I and s22 . which measures the association between pairs of variables. sii. x. large values of xi are paired with small values of x 2 and small values of xi with large values of x 2 .1 and 1.2. In Figure 1.2. will now be negative. x . Dot diagrams and scatter plots contain different kinds of information.1. The next two examples further illustrate the information that can be conveyed by a graphic display. s 12 will be positive. they are not competitors. Xz Xz • • • •• • • 10 8 • • • •• 4 6 4 • 6 • 8 2 0 2 10 I XI • t 2 • ! 4 • ! 6 ! 8 10 . The information in the marginal dot diagrams is not sufficient for constructing the scatter plot. Consequently.5 6 4 2 2 10 7 8 5 3 7. The different orientations of the data in Figures 1. but the sample covari2 ance si 2 . but that the scatter diagrams are decidedly different. (See Ex2 ercise 1. . the fact that the marginal dot diagrams are the same in the two cases is not immediately apparent from the scatter plots. Comparing Figures 1.) The scatter diagram indicates the orientation of the points.1 and 1. They can be obtained from the original observations or by projecting the points in the scatter diagram onto each coordinate axis. At the same time. respectively. As an illustration.1 had been paired differently. and their coordinates can be used to calculate the sample covariance Siz· In the scatter diagram of Figure 1. These plots are called (marginal) dot diagrams.) The scatter and dot diagrams for the "new" data are shown in Figure 1. the descriptive statistics for the individual variables xi. . we find that the marginal dot diagrams are the same. The two types of graphical procedures complement one another.5 (We have simply rearranged the values of variable 1. Figure 1.2 are not discernible from the marginal dot diagrams alone.12 Chapter 1 Aspects of Multivariate Analysis Also shown in Figure 1. suppose the data preceding Figure 1.1 are separate plots of the observed values of variable 1 and the observed values of variable 2.1.s of xi with small values of x 2 • Hence. so that the measurements on the variables xi and x 2 were as follows: Variable 1 Variable 2 (xi): (xz): 5 5 4 5.2 Scatter plot and dot diagrams for rearranged data. large values of xi occur with large values of x 2 and small value.
article on money in sports. Thus.56 = { . The sample correlation coefficient computed from the values of x 1 and x 2 is . this causeeffect relationship cannot be substantiated. Dun & Bradstreet is the largest firm in terms of number of employees.4 {A scatter plot for baseball data) In a July 17. but is "typical" in terms of profits per employee.1.3 Profits per employee and number of employees for 16 publishing firms. statistics cannot answer the question: Could the Mets have won with $4 million to spend on player salaries? . 1978. The data for the pair of variables x 1 = employees (jobs) and x 2 = profits per employee (productivity) are graphed in Figure 1. but comparatively small (negative) profits per employee.39 . •• • • • • • Time Warner • Dun & Bradstreet • Employees (thousands) Figure 1.The Organization of Data 13 Example 1. because the experiment did not include a random assignment of payrolls.3 {The effect of unusual observations on sample correlations) Some fi nancial data representing jobs and productivity for the 16 largest publishing firms appeared in an article in Forbes magazine on April30. The scatter plot in Figure 1.39 . The results are given in Thble 1. 1990. Example 1. Sports Illustrated magazine provided data on x 1 = player payroll for National League East baseball teams.3. We have labeled two "unusual" observations. Time Warner has a "typical" number of employees.4 supports the claim that a championship team can be bought.. Of course. We have added data on x 2 = wonlost percentage for 1977. •• • • .50 for all16 firms for all firms but Dun & Bradstreet for all firms but Time Warner for all firms but Dun & Bradstreet and Time Warner r12 It is clear that atypical observations can have a considerable effect on the sample • correlation coefficient.
or at right angles to.497.800 x2 = wonlost percentage . To construct the scatter plot in Figure 1.725.475 1. Because of the orientation of fibers within the paper. the machine direction.4 Salaries and wonlost percentage from Table 1.900 2.5.593 . Louis Cardinals Chicago Cubs Montreal Expos New York Mets = playerpayroll 3.2 shows the measured values of x1 = density(gramsjcubiccentinleter) xz x3 = strength (pounds) in the machine direction "' strength (pounds) in the cross direction A novel graphic presentation of these data appears in Figure 1.463 .469.4. The scatter plots are arranged as the offdiagonal elements of a covariance array and box plots as the diagonal elements. The latter are on a different scale with this . Example I.1 as the coordinates of six points in twodimensional space. The figure allows us to examine visually the grouping of teams with respect to the vari• ables total payroll and wonlost percentage.S (Multiple scatter plots for paper strength measurements) Paper is manufactured in continuous sheets several feet wide.575 1.782.875 1.645.1.485. it has a different strength when measured in the direction produced by the machine than when measured across.450 1.512 .395 I •• •• • • 0 Player payroU in millions of dollars Figure 1. we have regarded the six paired observations in Thble 1.14 Chapter 1 Aspects of Multivariate Analysis Table 1.1 1977 Salary and Final Record for the National League East Xt Team Philadelphia Phillies Pittsburgh Pirates St. page"16.500 .623 . Table 1.
30 115.788 .50 127.22 54.21 78.75 80.81 109.759 .51 110.832 .820 .10 130.20 74.841 .41 73.840 .10 115.93 53.68 50.31 114.67 52.810 .39 69.824 .10 120.20 118.48 70.70 129.80 107. .54 71.23 68.33 75.25 72.80 Cross direction 70.00 131.90 123.802 .772 .828 .60 53.70 126.971 .42 31 32 33 34 35 36 37 38 39 40 41 Source: Data courtesy of SONOCO Products Company.796 .822 .772 . 74.822 .836 .The Organization of Data 15 Table 1.770 .70 115.802 .2 PaperQuality Measurements Strength Specimen 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Density .88 68.80 130.816 .842 .70 117.788 .805 .62 53.782 .803 .35 68.10 131.819 .10 125.20 131.29 72.845 .824 .80 124.68 78.836 .02 73.91 122.51 56.80 125.47 78.31 112.41 127.60 103.816 .776 .93 53.758 Machine direction 121.00 125.70 121.40 120.50 126.802 .806 .795 .10 79.759 .10 50.91 68.60 116.31 110.801 .89 71.85 51.826 .70.10 70.60 118.64 76.815 .90 124.68 74.843 .50 127.53 70.51 109.80 135.20 120.12 71.28 76.52 48.42 70.822 .71 113.42 72.10 118.33 76.
..f6 Chapter 1 Aspects of Multivariate Analysis Density Strength (MD) 0. Scatter plots should be made for pairs of important variables and. . we cannot always picture an entire set of data. ..70 Min 48. ... these representations can actually be graphed..... . ··:.:·:. :···· 4·. . . .. In cases where it is possible to capture the essence of the data in three dimensions. .. . Max • • i" . Some of the scatter plots have patterns suggesting that there are two separate clumps of observations. However. . .:· . so we use only the overall shape to provide information on symmetry and possible outliers for each individual characteristic.. . .. there is one unusual observation: the density of specimen 25.93 figure I.. Limited as we are to a three~dimensional world.. These scatter plot arrays are further pursued in our discussion of new software graphics in the next section.... ~ T _l_ 80.. In Figure 1.. ....5 .·~. p variables are simultaneously recorded on n items. ........1 I I 121.97 Strength (CD) Max ·a Q c " Med Min ~ 0.· .81 0. . . .···.r Med I I T ' 135.... for all pairs.5. .33 Med 70. The scatter plots can be inspected for patterns and unusual observations. ...·: ... :·.4 . ..S Scatter plots and boxplots of paperquality data from Thble 1... '· Min 103..76 . two further geometric representations of the data provide an important conceptual framework for viewing multi variable statistical methods. . .. . : =· . 8 c ~ 6 5 0<1 "' · . Max :: . .. ... software.... .2.. • In the general multiresponse situation. .. ·. if the task is not too great to warrant the effort..
The resulting plot with n points not only will exhibit the overall pattern of variability.890 SVL 73.0 124.3. where the p measurements on the jth item represent the coordinates of a point in pdimensional space. but also will show similarities (and differences) among then items.0 125.610 11. Knowing the position on a line along the major axes of the cloud of points would be almost as good as knowing the three measurements Mass.0 123.5 66.0 133.158 12.0 126.149 9.049 5.0 137. so the variables contribute equally to the variation Table 1.0 69.888 7.5 69.733 12. The weight..0 67.5 67.5 142.0 116.0 97.5 63.0 74.5 62.xk)/~. .015 10.5 135.0 77. Zjk = (xjk.5 123. SVL.3 Lizard Size Data Lizard 1 2 3 4 5 6 7 8 9 10 11 12 13 Mass 5.0 70.273 2.5 139.0 129.5 64. .6 {Looking for lowerdimensional structure) A zoologist obtained measurements on n = 25 lizards known scientifically as Cophosaurus texanus. To help answer questions regarding reduced dimensionality. is given in grams while the snoutvent length (SVL) and hind limb span (HLS) are given in millimeters.0 Lizard 14 15 16 17 18 19 20 21 22 23 24 25 Mass 10.063 6.0 118. Consider the natural extension of the scatter plot top dimensions.953 7.0 162.5 150.6.0 117.199 6.5 79.0 Source: Data courtesy of Kevin E.004 8.5 74.067 10. Consequently.0 47.610 7.0 61.213 8. this kind of analysis can be misleading if one variable has a much larger variance than the others.978 6.601 7.5 68. Although there are three size measurements.401 9.0 86. The data are displayed in Table 1. Groupings of items will manifest themselves in this representation.0 141. . we construct the threedimensional scatter plot in Figure 1.447 15. xi 2 units along the second.526 10. However.0 73.0 117.The Organization of Data I7 n Points in p Dimensions (pDimensional Scatter Plot).0 116.0 66. we first calculate the standardized values. Example 1.0 75. so that the jth point is xi! units along the first axis.0 62. Bonine.0 135.5 HLS 113. xiP units along the pth axis.0 140.091 10.622 SVL 59. we can ask whether or not most of the variation is primarily restricted to two dimensions or even to one dimension.. The next example illustrates a threedimensional scatter plot. The coordinate axes are taken to correspond to the variables. Clearly most of the variation is scatter about a onedimensional straight line. or mass.132 6.0 75.0 HLS 136. and HLS.0 59.5 136.493 9.
7 (Looking for group structure in three dimensions) Referring to Exam· E~a~6 it is interesting to see if male and female lizards occupy different parts of the fh~e~dimensional space containing the size data. tefl]ll 3 2 : 1 ~ 0 1 ~ 2 Figure I. for the lizard data in Table 1.6 3D scatter plot of lizard data from Table 1. The gender.7 gives _th~ threedirnensio_nal scatter plot for ~he stanto rd. •  A threedimensional scatter plot can often reveal group structure.3 are fmffmfmfmfmfm mmmfmmmffmff . the scatter plot. ed variables.T 3D scatter Zsv~ plot of standardized lizard data.3. Most of the vanatwn can be explamed by a smgle vanable de da ~zned by a line through the cloud of points.cts of Multivariate Analysis hapter 1 Aspe 18 C 15 5 Figure 1. by row. pie 1. . Figure 1.
if multidimensional observations can be represented in two dimensions. In general. determines the ith point. to sit at one's desk and examine the nature of multidimensional data with clever computergenerated pictures.8 repeats the scatter plot for the original variables but with males marked by solid circles and females by open circles. • p Points in n Dimensions.4 Data Displays and Pictorial Representations The rapid development of powerful personal computers and workstations has led to a proliferation of sophisticated statistical software for data analysis and graphics. We shall discuss and illustrate several methods for displaying multivariate data in two dimensions. It is often possible. then outliers. One good source for more discussion of graphical methods is [11]. 1.8 3D scatter plot of male and female lizards. Clearly. males are typically larger than females. Each column of X determines one of the points. for example. The n observations of the p variables can also be regarded as p points in ndimensional space.Data Displays and Pictorial Representations 19 Figure 1. we show how the closeness of points inn dimensions can be related to measures of association between the corresponding variables. These pictures are valuable aids in understanding data and often prevent many false starts and subsequent inferential problems. and distinguishable groupings can often be discerned by eye. consisting of all n measurements on the ith variable. . 15 ~ 10 5 Figure 1. there are several techniques that seek to represent pdimensional observations in few dimensions such that the original distances (or similarities) between pairs of observations are (nearly) preserved. As we shall see in Chapters 8 and 12. In Chapter 3. The ith column. relationships.
20 Chapter 1 Aspects of Multivariate Analysis Linking Multiple TwoDimensional Scatter Plots One of the more exciting new graphical procedures involves electronically connecting many twodimensional scatter plots.3 . the x 1 values are plotted along the horizontal axis.. ' ~ I 80.~~· '"' :· . ... x 3 ). and specimens 3841 were actually specimens .. the axes are reversed. Selecting these points. For example. These data represent measurements on the variables x 1 = density. # ·: ~ ' ~· ' .::1 . The lower righthand comer of the figure contains a scatter plot of the observations ( x3. x2 = strength in the machine direction. where the outlier is labeled as specimen 25 and the same data point is highlighted in all the scatter plots..971 .. That is. using the (dashed) rectangle (see page 22). . Figure 1.. the ( x2 . ·. ~ '.9 .. we refer to the paperquality data in Thble 1.. The operation of deleting this specimen leads to the modified scatter plots of Figure l.·:· . Notice that the variables and their threedigit ranges are indicated in the boxes along the SWNE diagonal. the picture in the upper lefthand comer of the figure is a scatter plot of the pairs of observations ( x1 . · ·: r 104 .. we notice that some points in.lO(a). ~~ •"\ ~ Figure 1.758 . x 3 ) scatter plot seem to be disconnected from the others..ll(a).8 (Linkecl scatter plots and brushing) To illustrate linked twodimensional scatter plots. x 3 ) scatter plot..I .. From Figure 1. xi). The operation of marking (selecting).lO(b ). the obvious outlier in the (x 1 .10. highlights the selected points in all of the other scatter plots and leads to the display in Figure l... for example..9 shows twodimensional scatter plots for pairs of these variables organized as a 3 X 3 array. Specimen 25 also appears to be an outlier in the ( x 1 . ·.•. .. and x 3 = strength in the cross direction. Corresponding interpretations hold for the other scatter plots in the figure. . Further checking revealed that specimens 1621. Example 1.....2. That is.. 135 48. and the x 3 values are plotted along the vertical axis. . x 2 ) scatter plot but not in the (x2 .· . . specimen 34.9 creates Figure l. x 3 ) scatter plot of Figure 1.2.:· ..9 Scatter plots for the paperquality data of Table 1. I .
. ' .. ... ... . Figure 1.9 135 80.: .: .:·. .. ~·.i.) . .. . "' :·7 .. ·. . ~~ ..3 : :r..·. .:. ·....: r ··.I 104 ·. Machine (~) .• ..:·.. .971 Density (x.:....·:·25 . .. . # . ··'· '. Cross (xJ) 80. 48. :. '..... .. '• ... 135 25 Machine ( x2) .' 25 '..· . ..9 . ....' . ... ' 25 I I •• 25 ..Data Displays and Pictorial Representations 21 . ~ 48.: . :· ...) . .. ....~:~: .. ~ . ' ~·..~:~: ..758 .r ' ~.10 Modified scatter plots for the paperquality data with outlier (25) (a) selected and (b) deleted. . {" (a) I . ..' .. ~.. ... ·... ::... ···::.'..971 Density (x. . ·.. ~ ' ·:·' .1 (b) ··. .. . .. .. . .758 ·~·.....3 .. I ~ :.... ...J ...I 104 .. ~· ···:2s•.
. .845 .. .. ~ .' r ~ ... !' .. I r:1  . '..971 104 Density (x. .. (a) ·:.. 80. .788 ..···· . ... ·....: .3 .) ..' . ·. ~===~ ..9 ~ ~. . .. . :~· :·.. including specimen 25.) .. 114 .I ' ' . .....22 Chapter 1 Aspects of Multivariate Analysis ···~· ..1 I Modified scatter plots with (a) group of points selected and (b) points.4 ... : (b) Density (x.:· ~~ '... .. ·.... ..· .· 68.... Figure 1. .. . ..· • ···: ' I # \.3 .·.. deleted and the scatter plots rescaled. . 135 Machine (x2) 48. · .758 . ....... . ~. .·.. ·. " _.. ' ..·..1 135 Machine (x2) . ~ 80.·.. .
page 186. Scatter plots like those in Example 1. {b).11(a).~=.Data Displays and Pictorial Representations 23 from an older roll of paper that was included in order to have enough plies in the cardboard being manufactured..· ·. Another important new graphical technique uses software that allows the data analyst to view highdimensional data as slices of various threedimensional perspectives. In Example 4. as in Figure 1. A comprehensive discussion of dynamic graphical methods is available in [U A strategy for online multivariate exploratory graphical analysis. ~~. ~ :· • f:x2 • • x3 ..14. The operation of highlighting points corresponding to a selected range of one of the variables is called brushing.. . . (d) x• •9 x.. and (c) contain perspectives of the stiffness data in the x 1 .16 xz ) ' (a) . but then the brush could be moved to provide a sequence of highlighted points. The process can be stopped at any time to provide a snapshot of the current • situation. These views were obtained by continually rotating and turning the threedimensional coordinate axes.~~· =.9 (Rotated plots in three dimensions) Four different measurements of lumber stiffness are given in Table 4.. .~: . Example 1.12 Threedimensional perspectives for the lumber stiffness data..ll{b). x 2 . Figures 1.. . Figure 1. is given in [32]... ·. . motivated by the need for a routine procedure for searching for structure in multivariate data.12(a).:· 1. .=•••. Specimen 9large. x4 space. (c) xz Good view of x2 . .·: . . "' (b) Outliers clear.3. This can be done dynamically and continuously until informative views are obtained. Outliers masked. I . x 3 space.6 9.x 3.8 are extremely useful aids in data analysis. Deleting the outlier and the cases corresponding to the older paper and adjusting the ranges of the remaining observations leads to the scatter plots in Figure l. Brushing could begin with a rectangle. Spinning the coordinate axes allows one to get a better . specimen (board) 16 and possibly specimen (board) 9 are identified as unusual observations..
x 4 space. It is this dynamic aspect that statisticians are just beginning to understand and exploit. x3. Additional insights can sometimes be gleaned from visual inspection of the slowly spinning data. for each bear. the points can be plotted and then connected by lines to produce a graph. Table 1.10 (Arrays of growth curves) The Alaska Fish and Game Department monitors grizzly bears with the goal of maintaining a healthy population. A further spinning of the x2 . 4. Figure 1. Specimen 9 is very large in all three coordinates. we plot the weights versus the ages and then connect the weights at successive years by straight lines. Measurements of length are taken with a steel tape. and the two unusual observations are masked in this view. Graphs of Growth Curves When the height of a young child is measured at each birthday.13 shows the growth curves for all seven bears.4 Female Bear Data Bear 1 2 3 4 Wt2 48 Wt3 Wt4 Wt5 Lngth2 Lngth 3 82 102 107 104 247 118 111 141 140 145 146 150 142 139 157 168 162 159 158 140 171 Lngth4 168 174 172 176 168 178 176 Lngth 5 183 170 177 171 175 189 175 59 68 77 43 145 82 95 102 93 104 185 59 61 54 100 68 5 6 7 95 109 68 95 Source: Data courtesy of H. Example 1.24 Chapter 1 Aspects of Multivariate Analysis understanding of the threedimensional aspects of the data.4 gives the weights (wt) in kilograms and lengths (lngth) in centimeters of seven female bears at 2. Roberts. Notice that Figures 1. pattern is expected.12(a) produces Figure 1. This is an example of a growth curve. Bears are shot with a dart to induce sleep and weighed on a scale hanging from a tripod. Figure 1. or even an increasing followed by a decreasing.12(a) and (d) visually confirm specimens 9 and 16 as outliers.12(b). x3 axes gives Figure 1. Is this an outlier or just natural variation in the population? In the field. A counterclockwiselike rotation of the axes in Figure 1. . 3. • Plots like those in Figure 1. and 5 years of age. The noticeable exception to a common pattern is the curve for bear 5. This gives an approximation to growth curve for weight. decreasing.12(c). one of the outliers (16) is now hidden.12(d) gives one picture of the stiffness data in x 2 . bears are weighed on a scale that Table 1. repeated measurements of the same characteristic on the same unit or subject can give rise to a growth curve if an increasing. First. In general.12 allow one to identify readily observations that do not conform to the rest of the data and that may heavily influence inferences based on standard datagenerating models. .
112) kilograms.:: 50 0 / 2 3 4 Year 5 :* .~100 ~ 50 0 _/ 2 3 4 Year Bear5 5 .:. 66.Data Displays and Pictorial Representations 250 25 200 ~ ~ !50 100 50 2.:: ~ ~ 100 50 0 ~ 2 3 4 Year 5 !50 !50 !50 ~ ~ 100 .§ 100 50 0 / 2 3 4 Year 5 1:: :* ~ 100 ~ 2 3 4 Year 5 50 0 Figure 1. reads pounds.0 3. The correct weights are ( 45.0 2.5 4.:.5 5.0 4. .1 3 Combined growth curves for weight for seven female grizzly bears. the individual curves should be replotted in an array where similarities and differences are easily observed.§ 100 ~ 50 0 __/ 2 3 4 Year Bear6 5 ~100 ~ 50 0 ~ 2 3 4 Year Bear? 5 . Further inspection revealed that.0 Year Figure 1. Bear I !50 !50 Bear2 !50 Bear 3 !50 Bear4 1:: .14 gives the array of seven curves for weight. 84. Figure 1.14 Individual growth curves for weight for female grizzly bears. Some growth curves look linear and others quadratic. an assistant later failed to convert the field readings to kilograms when creating the electronic database. in this case.5 3. Because it can be difficult to inspect visually the individual growth curves in a combined plot.
the stars are distorted octagons.1 S Individual growth curves for length for female grizzly bears.11 {Utility data as stars) Stars representing the first 5 of the 22 public utility fi11IIS in Table 12. j ~!60 / I ~160 . are shown in Figure 1. The observations can then be reexpressed so.4. that the center of the circle represents the smallest standardized observation within the entire data set.. It is often helpfuJ. Bear! Bear2 Bear3 Bear4 180 . we can construct circles of a fixed (reference) radius with p equally spaced rays emanating from the center of the circle.. and the stars can be grouped according to their (subjective) similarities..J " §' !60 140 140 2 3 4 5 Year 140 I 2 3 4 Year 5 2 3 4 Year 5 Figure 1. • We now turn to two popular pictorial representations of multivariate data in two dimensions: stars and Chernoff faces. One bear seemed to get shorter from 2 to 3 years old. In two dimensions. In this case some of the observations will be negative. but the researcher kno\W that the steel tape measurement of length can be thrown off by the bear's posture when sedated.26 Chapter 1 Aspects of Multivariate Analysis Figure 1. The ends of the rays can be connected with straight lines to form a star.. .::.:.15 gives a growth curve array for length. The lengths of the rays represent the values of the variables. when constructing the stars. consequently. There are eight variables. j ~ 160 140 / 2 3 4 180 5 ]'160 140 5 2 r j 180 5 ~ ff 160 140 I 2 3 4 Year Bear? 180 5 j ff 160 140 / 2 3 4 Year 5 Year Bear5 3 4 Year 5 5 Bear6 !80 180 180 5 .16. page 688. to standardize the observations. Example 1.J . Stars Suppose each data unit consists of nonnegative observations on p ~ 2 variables. Each star represents a multivariate observation.
and Commonwealth Edison are similar (moderate • variable 6. However. (NY) (5) Commonwealth Edison Co. and Arizona Public Service. beginning in the 12 o'clock position. eye size. At first glance. Chernoff faces People react to faces. nose length. (4) I 2 5 5 5 figure 1. the variables are plotted on identical scales along eight equiangular rays originating from the center of the circle. the smallest standardized observation for any variable was 1. Among the first five utilities. because of the way the stars are constructed. Chernoff [4) suggested representing pdimensional observations as a twodimensional face whose characteristics (face shape. The variables are ordered in a clockwise direction. (3) 5 Consolidated Edison Co.6. Central Louisiana Electric. and so forth) are determined by the measurements on the p variables. then Boston Edison and Consolidated Edison are similar (small variable 6. The observations on all variables were standardized. If we concentrate on the variables 6 (sales in kilowatthour [kWh) use per year) and 8 (total fuel costs in cents per kWh). 'freating this value as zero. none of these utilities appears to be similar to any other. moderate variable 8).tions 27 Ariwna Public Service (I) Boston Edison Co.Data Displays and Pictorial Representa. pupil position. mouth curvature. large variable 8). (2) 5 Central Louisiana Electric Co. each variable gets equal weight in the visual impression. .16 Stars for the first five public utilities.
Some iteration is usually necessary before satisfactory representations are achieved. With some training. The data are ordinarily standardized within the computer program as part of the process for determining the locations.17.. the faces are used to track the financial wellbeing of a company over time.13 (Using Chernoff faces to show changes over time) Figure 1. Chernoff faces appear to be most useful for verifying (1) an initial grouping suggested by subjectmatter knowledge and intuition or (2) final groupings produced by clustering algorithms...4.. • ... For our assignment of variables to facial features.. Example 1. . and the longitudinal changes in these indicators are thus evident at a glance. Chernoff faces can handle up to 18 variables. As indicated.28 Chapter 1 Aspects of Multivariate Analysis As originally designed.. • Constructing Chernoff faces is a task that must be done with the aid of a computer. If a smaller number of clusters is desired. the firms group largely according to geographical location.. as the next example indicates.. The assignment of variables to facial features is done by the experimenter. perhaps. we might combine clusters 5. 6.. . . (See [24]. Facial characteristic Halfheight of face Face width Position of center of mouth Slant of eyes Eccentricity (height) width of eyes Xs: Peak kWh demand growth from 1974 X6: Sales (kWh use per year) X1: Percent nuclear Xs: Total fuel costs (cents per kWh) Halflength of eye Curvature of mouth Length of nose The Chernoff faces are shown in Figure 1..18 illus trates an additional use of Chernoff faces. ...noff faces) From the data in Table 12.. Example 1..) In the figure. we can use Chernoff faces to communicate similarities or dissimilarities... and 7 and.....12 (Utility data as Cher. each facial feature represents a single financial indicator.. We have the following correspondences: Variable XI: Xz: X3: X4: Fixedcharge coverage Rate of return on capital Cost per kW capacity in place Annual load factor . sizes.. clusters 2 and 3 to obtain four or five clusters.. and orientations of the facial characteristics. the 22 public utility companies were represented as Chernoff faces. We have subjectively grouped "similar" faces into seven clusters. and different choices produce different results.
. 1976 1978 1975 1971 1979 ~Time Figure I .Data Displays and Pictorial Representations Cluster I Cluster 2 Cluster 3 Cluster 5 Cluster 7 29 000CJJ(D G)(DQ(J)(D 0\D 000(}) 4 6 5 7 10 22 21 15 13 9 Cluster 4 Cluster 6 20 CD0CD 00CD 18 11 12 19 16 17 14 8 2 figure I. I 7 Chernoff faces for 22 public utilities.18 Chernoff faces over time.
. . distance should be familiar. it is often desirable to weight coordinates subject to a great deal of variability less heavily than those that are not highly variable. Our purpose now is to develop a "statistical" distance that accounts for differences in variation and. x 2 ) in the plane. such as c2. . The straightline distance between two arbitrary points P and Q with coordinates?= (xJ. or Euclidean. 0.. If we consider the point P = ( x 1 . P). according to the Pythagorean theorem. . the presence of correlation.. 1. In general.. 0) is.Yz) 2 + · · · + (xp.P) = ~ (19) The situation is illustrated in Figure 1. When the coordinates represent measurements that are subject to random fluctuations of differing magnitudes.19.. the straightline distance from P to the origin 0 = (0. and the faces might represent multivariate measurements on several U.Xz. For example. . most multivariate techniques are based upon the simple concept of distance.19 Distance given by the Pythagorean theorem. . from the origin satisfy the equation d 2 (0. .xp) andQ = (Yl>Yz.) All points ( x 1 . from P to the origin 0 = (0. the twodimensional coordinate axes might represent latitude and longitude (geographical location).S.yp)isgivenby d(P. 0) is d(O. points equidistant from the origin lie on a hypersphere. x2 . This is because each coordinate contributes equally to the calculation of Euclidean distance. cities. Additional examples of this kind are discussed in [30].Yp) 2 (112) Straightline.s Distance Although they may at first appear formidable. d(O. There are several ingenious ways to picture multivariate data in two dimensions. x p) that lie a constant squared distance. . P) = (111) Because this is the equation of a hypersphere (a circle if p = 2). in due course. P) = Vxi + x~ + ·· · + x~ xi + x~ + ·· · + x~ = c2 (110) (See Chapter 2. .. distance is unsatisfactory for most statistical purposes. Straightline. xp). d(O..1  t figure 1. the straightline distance. .. . We have described some of them..Q) = V(xt yt) 2 + (xz.30 Chap ter 1 Aspects of Multivariate Analysis Chernoff faces have also been used to display differences in multivariate observations in two dimensions. This suggests a different measure of distance.. or Euclidean.. Because our p d(O. if the point P hasp coordinates so that P = (x 1 • x 2 . Further advances are possible and will almost certainly take advantage of improved computer graphics.P)= jx\+ZJ~ t "2 0 1x.
assume that the variability in the x 1 measurements is larger than the variability in the x 2 measurements.20. A scatter plot of the data would look something like the one pictured in Figure 1. . then. and vice versa.. Consequently.. upon division by the standard deviations. One way to proceed is to divide each coordinate by the sample standard deviation.20. 1 In addition. x2 . and assume that the x 1 measurements vary independently of the x 2 measurements. P) = V(x~) + (x._ as d(O. and x. a statistical distance of the point P = ( x1 . . The data that determine distance will. . we take asju:ed the set of observations graphed as the pdimensiona) scatter plot. Xz. To illustrate. suppose we have n pairs of measurements on two vari11bles each having mean zero...) 2 2 (113) 1At this poinl. It is statistical distance that is fundamental to multivariate analysis. xp)· In our arguments. To begin. x2 ) from tl1e origin 0 = ( 0. we shall construct a measure of distance from the origin to a point P == (x 1 . the coordinates (x 1 . Therefore. = x2fvs:. we have the "standardized" coordinates x~ = x1jYS. and x.. From these. "independently~ means that the x measurements cannot be predicted with any 2 accuracy from the x 1 measurements. Thus. x2 • • • • • • •• • • • • • • • •• • • • • • •• • • • • • figure 1. large x 1 coordinates (in absolute value) are not as unexpected as large x 2 coordinates.Distance 31 choice will depend upon the sample variances and covariances. to weight an x 2 coordinate more heavily than an x 1 coordinate of the same value when computing the "distance'' to the origin. 0) can be computed from its standardized coordinates x7 = xJYS.. = x 2/vs:. we see that values which are a given deviation from the origin in the x1 direction are not as "surprising" or "unusual" as lilre values equidistant from the origin in the x 2 direction. Glancing at Figure 1. . After taking the differences in variability into account.. Call the variables x 1 and x 2 .20 A scatter plot with greater variability in the x 1 direction than in the x 2 direction. The standardized coordinates are now on an equal footing with one another. This is because the inherent variability in the x 1 direction is greater than the variability in the x 2 direction. xp) of P can vary to produce different locations for the point. remain fixed. at this point we use the term statistical distance to distinguish it from ordinary Euclidean distance. It seems reasonable. . we determine distance using the standard Euclidean formula.. however..
That is. Xz) to the origin 0 = (0. Using (113).21 The ellipse of constant statistical distance d 2 (0. In cases where the weights are the same. we measure the square of the distance of an arbitrary point P = (x 1 . cfi. Since the sample variances are unequal. x 2 ) on two variables yields x1 = Xz = 0. and the x 1 values vary independently of the x2 values. we see that the difference between the two expressions is due to the weights k 1 = 1/su and k 2 = 1/s22 attached to xi and x~ in (113). 0) by All points ( x 1 .14 (Calculating a statistical distance) A set of paired measurements (x 1 . it is convenient to ignore the common divisor and use the usual Euclidean distance formula. ~~0++~x. k1 = k2 . measurements within a pair vary independently of one another. s11 = 4. the statistical distance in (113) has an ellipse as the locus of all points a constant distance from the origin.. we see that all points which have coordinates (xt> x 2 ) and are a constant squared distance c2 from the origin must satisfy +=c su Szz X~ X~ z (114) Equation (114) is the equation of an ellipse centered at the origin whose major and minor axes coincide with the coordinate axes.P) = xi/s1 1 + x~js22 = c 2. Example 1. Suppose the x 1 measurements are unrelated to the x 2 measurements. This general case is shown in Figure 1.. that is.21. In other words. Note that if the sample variances are the same. Euclidean distance is appropriate. and s22 = 1. if the variability in thex1 direction is the same as the variability in the x 2 direction.32 Chapter 1 Aspects of Multivariate Analysis Comparing (113) with (19). Figure 1. x 2 ) that are a constant distance 1 from the origin satisfy the equation X~ X~ +=1 4 1 The coordinates of some points a unit distance from the origin are presented in the following table: . cfi. then xi and x~ will receive the same weight. cy's.
If we assume that the coordinate variables vary independently of one another. .. Let the points P and Q have p coordinates such that P = ( x 1. . s22 .. .Distance 33 Coordinates: (x 1 . the distance from P to Q is given by d(P. Let s 11 .• .Yz) 2 Stt Szz '(115) The extension of this statistical distance to more than two dimensions is straightforward. x 2 ) (0.yJ) 2 + (x2  . 1) DiStance: 4 + . V3j2) 02 12 . 4 x¥ + lx~ = 1. 0) whose major axis lies along the x 1 coordinate axis and whose minor axis lies along the x2 coordinate axis. 0.22 Ellipse of unit I distance. xi T x~ = 1 (2. x 2) to any /u:ed point Q = (Yt.Yt)2 + (xz. + (xp. All points on the ellipse are regarded as being the same statistical distance from the originin this case.Q) = Y /(xl.22. respectively.= 1 4 1 12 (V3/2)2 . . Yz. The ellipse of unit distance is plotted in Figure 1.. Q) = Y / (x! . . The halflengths of these major and minor axes are V4 = 2 and v'I = 1. Xz.. •. respectively. Then the statistical distance from P to Q is d(P. . ..Yz)2 + . xp) and Q = (y1... a distance of 1. • Figure 1. xP.1) (0. 0)] and the coordinate variables vary independently of one another.0) (1. ••.Yp)Z su s22 sPP (116) . yp)· Suppose Q is a fixed point [it may be the origin 0 = (0. sPP be sample variances constructed from n measurements on x 1.+= 1 4 1 02 ( 1 )2 +=1 4 1 22 02 + . . The expression in (113) can be generalized to accommodate the calculation of statistical distance from an arbitrary point P = (x 1 .+ = 1 4 1 A plot of the equation xV4 + xV1 = 1 is an ellipse centered at (0. x 2 .. Yz).
2 ) to the origin 0 = (0. the coordinates of the pairs (x 1 . we can use what we have already introduced. The scatter plot in figure 1. (You may wish to tum the book to place the 1 and 2 axes in their customary positions. That is. •I • I • • I I • . we define the distance from the point p"" (x 1.23 depicts a twodimensional situation in which the x 1 measurements do not vary independently of the x 2 measurements. because of the assumption of independent coordinates.. . .<~ 6 • I • figure 1.\spects of Multivariate Analysis All points P that are a constant squared distance from Q lie on a hyperellipsoid centered at Q whose major and minor axes are parallel to the coordinate axes. _ z. provided that we took at things in the right way.. 1be distance in (116) still does not include most of the important cases we shall encounter. s 22 denote the sample variances computed with the x 1 and x2 I I r.. the scatter in terms of the new axes looks very much like that in Figure 1.20. and the sample correlation coefficient is positive.23 A scatter plot for positively correlated measurements and a rotated coordinate system.. we see that if we rotate the original coordinate system through the angle 0 while keeping the scatter fixed and label the rotated axes 1 and 2 . P) = (117) where s 11 and measurements. Moreover.) This suggests that we calculate the sample variances using the 1 and 2 coordinates and measure distance as in Equation (113). with reference to the 1 and 2 axes.34 Cbaptef 1 . the variability in the x 2 direction is larger than the variability in the x 1 direction. The distance of P to the origin 0 is obtained by setting y1 = )2 = · · · = Yp = 0 in (116). In fact. What is a meaningful measure of distance when the variability in the x1 direction is different from the variability in the x2 direction and the variables x1 and x2 are correlated? Actually. We note th~ following: J..23. . 0) as x x x x x x x x x d(O. ·~1·~·~~~x. From Figure 1. • •  . If s11 = Szz = · · · = sPP• the Euclidean distance formula in (112) is appropriate. . x2 ) exhibit a tendency to be large or small together..
0) can be written in terms of the original coordinates x 1 and x 2 of Pas x x x d(O. and a22 are dete.24.Yz) 2 (120) and can always be computed once a 1 b a 12 . and Ozz are not important at this point. the distance from P = (x1 .s2 1_1 _+_2_s. P) = Va 11 xi + 2a 12 x 1 x 2 + a 22 x~ (119) where the a's are numbers such that the distance is nonnegative for all possible values of x 1 and x 2 • Here au. and a 12 = 0. 2 ) to the origin 0 = (0.:(.8)s cos(8) sin( B) + sin2(6)s22 + cos2(8)s22  and 012 = cos2 (8)su + 2sin(8)cos(8)s12 + sin2(8)s. and a22 in footnote 2.rmined by the angle 8. the x1 and x2 axes are at an angle 8 with respect to the x 1 and x 2 axes. x 2 ) from the fued point Q == (YI> . •. xp) be a point whose coordinates represent variables that are correlated and subject to inherent variability. x 2 ) that are a constant squared distance c 2 from Q satisfy a 11 (x 1  yJ) 2 + 2an(x 1  y1 ) (x 2  Yl) + a 22 (x 2  Y2) 2 = c 2 (121) By definition. The generalization of the distance formulas of (119) and (120) top dimensions is straightforward. They are parallel to the 1 and 2 axes. a 12 . For the choice of a 1 1 . and a 22 are known. The graph of such an equation is displayed in Figure 1. The major (long) and minor (short) axes are indicated.Yl) for situations in which the variables are correlated has the general form d(P. The expression in (113) can be regarded as a special case of (119) with a 11 = 1/s11 .y1 )(xz. the coordinates of all points P = (x1 . 2  cos>(6)s. .. a" 022 = cos 2(8) cos2(8)s11 + 2sin(8)cos(8)s12 + sin 2(6)s22 sin 2(8) + cos 2(6)s22  2sin(6)cos(8)s12 + sin2(8)s 11 cos2 (8) 2sin(8) cos(8)s12 + sin2(6)su sin(6) cos(6) ~~ sin2(8) ~ ~~ = co. x 2 . What is important is the appearance of the crossproduct term 2a 12x 1x 2 necessitated by the nonzero correlation r12 .Distance 15 The relation between the original coordinates (x 1 . the statistical distance of the point P = (x1 . Equation (119) can be compared with (113). a 12 . aJ2. In general. 2 The particular forms for a 11 . and s 1 1> s12. After some straightforward algebraic manipulations.in(·e) cos(8)s12 :. this is the equation of an ellipse centered at Q. Let P = (x 1 .Q) = Vau(x 1  yi) 2 + 2an(xl. In addition. Let x x 2 Specifically. we can formally substitute for 1 and 2 in (117) and express the distance in terms of the original coordinates. x 2 ) and the rotated coordinates (x1 . and s22 calculated from the original data.Yz) + a 2 z(Xz. 2 ) is provided by x X1 = Xt cos(8) + XzSin(/J) Xz = x1 sin(O) + x2 cos(O) (118) Given the relations in (118). 2  2sin(6)cos(6)s12 + sin2 (8)sJ1 . a22 = 1/s22 .
0. The a.*'s cannot be arbitrary numbers. We note that the distances in (122) and (123) are completely determined by the coefficients (weights) O. i = 1. 0 = (0. since they are multiplied by 2 in the distance formulas./s with i # k are displayed twice.10.Yrd(xp . ..p(xpl..Y1)(x3. the entries in this array specify the distance functions. they must be such that the computed distance is nonnegative for every pair of points. and Jet Q == (y1 . A hyperellipsoid resembles a football when p = 3.. .P) = Va 11xf + rra 22 x~ + ··· + aPPx~ + 2a 12 x 1x2 + 2a 13 x 1x 3 + · · · + 2ap.Q) = \j" . 2.. 0) denote the origin. y1 ) + azz(x2.1. . }?_. . .•. Yp) be a specified fixed point Then the distances from P to 0 and from P to Q have the general forms d(O... It is possible to display these quadratic fonns in a simpler manner using matrix algebra.k. (See Exercise 1.) + 2a 13 (xJ. l ou OJ 2 a12 022 (124) OJ p o2 p .pxplxp (122) and ~d(P. we shall do so in Section 2.) Contours of constant distances computed from (122) and (123) are hyperellipsoids. in particular.2. p.. it is impossible to visualize in more than three dimensions. These coefficients can be set out in the rectangular array 3 where the o.}2) + · · · + app(xp . k "" 1.3 of Chapter 2. positive definite quadratic forms.Y1)(x2 .Yp) 2 2 2 + 2an(xl .24 Ellipse of points a constant distance from the point Q. . p..>?. :Yrhe algebraic expressions for the squares of the distances in (122) and (123) are known as quadratic forms and.36 Chapter 1 Aspec!s of Multivariate Analysis / / / ' ' '' Figure 1.YJ) + ··· + 2ar 1 . . . .Yp)] (123) where the a's are numbers such that the distances are always nonnegative. Consequently.
summarizing.12. Any distance measure d(P.1. If we take into account the variability of the points in the cluster and measure distance by the statistical distance in (120). given the nature of the scatter.5 4 7 10 5 7.Exercises 37 =.) At times.::·. and displaying data. Q) = d(Q. then Q will be closer toP than to 0. Q) between two points P and Q is valid provided that it satisfies the following properties. In addition. . and the sample covariance s1 z.25. P@ • • • :••. where R is any other intermediate point: d(P.~x.r 2 ) plotted in Figure 1.. methods for organizing. • • • •• • •• •• •••• • • ••• • ••• •••••• ···~• • • ••• • • • • • • . P appears to be more like the points in the cluster than does the origin.. Q) ~ d(P. it is useful to consider distances that are not related to circles or e!lipses.1: XJ 3 4 2 6 8 2 5 x2 5 5.. •. but important.. .6 Final Comments We have attempted to motivate the study of multivariate analysis and to provide you with some rudimentary. Figure 1. Q) 1. Exercises 1. Q) > 0 if P ¢ Q (triangle inequality) d( P.25 A cluster of points relative to a point P and the origin . The need to consider statistical rather than Euclidean distance is illustrated heuristically in Figure 1. P) d(P. the sample variances s 11 and s22 . However. Figure 1.25 depicts a cluster of points whose center of gravity (sample mean) is indicated by the point Q. R) + d(R. (See Exercise 1. Consider the Euclidean distances from the point Q to the point P and the origin 0. Other measures of distance can be advanced. Consider the seven pairs of measurements (x 1 .. • . a general concept of distance has been introduced that will be used repeatedly in later chapters.~0. . This result seems reasonable.~. Q) = 0 if P =Q (125) d(P. The Euclidean distance from Q to P is larger than the Euclidean distance from Q to 0.5 Calculate the sample means x1 and x2 . ....
04 65.4. Stz.28 152.99 265.com partially based on ForberThe Forbes G!obal2000. Use the data in Exercise 1. and r 12 .33 766.45 62. The world's 10 largest companies yield the following data: The World's 10 Largest Companies 1 X!= Company Citigroup General Electric American Inti Group Bank of America HSBCGroup ExxonMobil Royal Dutch/Shell sales (billions) 108.42 1.95 19.97 263. (c) Compute the sample means :X 1 and :X 2 and the sarilple variances s 11 and Szz.3.01 165.JS Chapter I Aspects of Multivariate Analysis 1. (b) Compute the X. the sample variancecovariance array Sn.15 BP lNG Group Toyota Motor 1From 92. A morning newspaper lists the following usedcar prices for a foreign compact with age x 1 measured in years and selling price x 2 measured in thousands of dollars: 2 x2 3 3 4 5 6 8 7.!10.83 191.13 (billions) 1. x 3) and (xi> x 3 ).00 17.99 18.26 193.5.49 9 6.00 12.33 18. April18.46 1.36 95.54 14.14 9. Comment on the appearance of the diagrams.91 14.73 8.16 211. 3 4 0 2 1 1.031.2. sn' and R arrays for (xl' Xz.4. x 2 .10 750. Compute the sample covariance s 12 and the sample correlation coefficient r12 .94 (a) Construct a scatter plot of the data and marginal dot diagrams.484.11 1.Forbes.06 Xz =profits x 3 =assets (billions) 17.19 285. 1. (a) Plot the scatter diagrams and dot diagrams for (x2 . The following are five measurements on the variables x 1 .05 16. and the sample correlation array R using (18). (a) Plot the scatter diagram and marginal dot diagrams for variables x 1 and xz. (d) Display the sample mean array x. and R.175. and x 3 : X1 9 12 2 8 6 6 5 4 8 10 x2 x3 Find the arrays i.95 8. x3)· .29 195. (b) Compute :X~o Xz.00 11 3. s22 .52 25. Interpret these quantities. S".68 www.59 10. Interpret r 12 • 1.10 11. 2005. Comment on the patterns. s11.95 15.54 15. (b) Infer the sign of the sample· covariance s 12 from the scatter plot.
(See also the airpollution data on the web at www. (b) Construct the x.(xs) 03(x6) HC(x7) 2 3 3 4 3 4 5 4 3 4 3 3 3 3 3 3 3 4 3 3 4 3 4 3 4 3 3 3 3 3 3 4 3 3 2 2 2 2 3 2 3 2 8 7 7 10 6 8 9 5 7 8 6 6 7 10 10 9 8 8 9 9 10 9 98 107 103 88 91 90 84 72 82 64 71 91 72 70 72 8 5 6 8 6 8 6 10 8 7 5 6 10 8 5 5 7 7 6 8 77 76 71 67 69 62 88 80 30 83 84 78 79 62 37 71 52 48 75 35 85 86 86 79 79 68 40  2 3 3 2 2 2 4 4 1 2 4 2 4 2 1 1 1 3 2 3 3 2 2 3 1 2 2 1 3 1 1 1 5 1 1 1 1 2 4 2 2 3 12 9 5 8 8 12 12 21 11 13 10 12 18 11 8 9 7 16 13 9 14 7 13 5 10 7 11 7 9 7 10 12 8 10 6 9 6 13 9 8 8 5 6 15 10 12 15 14 11 9 3 7 10 7 10 10 7 4 2 5 4 6 . . 11 2 23 6 11 10 8 2 7 8 4 24 9 10 12 18 25 6 14 5 11 6 Source: Data courtesy of Professor G. Sn. The data in Table 1.com/statistics. and interpret the entries in R.prenhall.5 are 42 measurements on airpollution variables recorded at 12:00 noon in the Los Angeles area on different days. and R arrays.5 AirPollution Data Wind (x1) Solar radiation ( x 2) CO (x3) 7 4 4 5 4 5 7 6 5 5 5 4 7 4 4 4 4 5 4 3 5 4 4 3 5 3 4 2 4 3 4 4 6 4 4 4 3 7 7 5 6 4 NO(x4) N0 2. C.Exercises 39 1.) (a) Plot the marginal dot diagrams for all the variables. Table 1.6. Tiao.
(Within rounding error. 0) as d(O. A large city has major roads laid out in a grid pattern. Are the following distance functions valid for distance from the origin? Explain. 1. and s 12 • (b) Using (118).13." 1. and a12 ·= 1/9. x 2 ) to the origin 0 = (0. Evaluate the distance of the point P = ( 1. azz. Sketch the focus of points that are a constant squared statistical distance 1 from the point Q. 1) to the point Q = ( 1. Compare the distance calculated here with the distance calculated using the 1 and 'i2 values in (d). 2). 0) using (119) and the expressions for a11 .8.40 Chapter 1 Aspects of Multivariate Analysis 1. Transform these to measurements on x1 and Xz using (l18). 5). Streets 1 through 5 run north~outh (NS).II. a 22 = l.4) to the origin. . ( E.) 1. Suppose there are retail stores located at intersections (A. You are given the following n == 3 observations on p Variable 1: Variable 2: x 11 = = 2 variables: 2 x 21 = 3 x 31 == 4 x 12 = 1 x 22 = 2 x 32 = 4 (a) Plot the pairs of observations in the twodimensional "variable space.9. Verify that distance defined by (120) with a1 1 = 4.438]. (e) Calculate the distance from P = (4. (b) Plot the data as two points in the threedimensional "item space. (The triangle inequality is more difficult to verify. sn.12. s22 . and a12 in footnote 2.I 0. Note: You will need s11 . (a) x? + 4x~ + x 1x 2 = = (distance) 2 2 (b) x? . calculate the corresponding measurements on variables 1 and 2 . and compute sl!. P) = max(/xJ/. and a 12 = 1 satisfies the first three conditions in (125). Note: You will need s11 and s22 from (c). P) of the new point P = (xi. a 22 = 4/27. construct a twodimensional scatter ptot of the data. (d) Consider the new pair of measurements (x 1 . compute the sample variances SJ1 and s22. Define the distance from the point P = (x1 . T.) x x x x x x 1. 3). assuming that the original coordinate axes are rotated through an angle of{} == 26° [given cos (26°) == . and streets A through E run eastwest (EW). Consider the following eight pairs of measurements on two variables x 1 and x 2 : Xj I 6 3 3 2 1 1 2 5 2 1 6 5 8 3 (a) Plot the data as a scatter diagram. (b) Plot the locus of points whose squared distance from the origin is L (c) Generalize the foregoing distance expression to points in p dimensions.x 2 ) = (4. as indicated in the following diagram. and calculate the distance d(O. and ( C.2x~ (distance ) 1.899 and sin (26°) = . (c) Using the 1 and 2 measurements from (b).the numbers should be the same. 2). 2) to the origin 0 = (0." That is. /x 2 /) (a) Compute the distance from P == ( 3. 0) using (117). and s12 from (a). . 1. 2 ) from the origin 0 == (0. 0) using the Euclidean distance formula in (112) with p = 2 and using the statistical distance in (120) with a 11 = 1/3.
x 3 (amount of sleep. (For example. IS1L . Interpret the pairwise correlations. 2)) = d ( ( D.com/statistics. and R arrays. Measurements were recorded for three bones on the dominant and nondominant sides and are shown in Table 1. (a) Construct the twodimensional scatter plot for variables x 2 and x 3 and the marginal dot diagrams (or histograms). such as sore throat or nausea).prenhall. 2)) + d ( ( D. 1). 1). . Also. on a 03 scale). that is. on a 15 scale). (C. x 5 (appetite. Comment on the appearance of the diagram. A computer may be necessary for the required calculations.prenhall. is given by d((D. x2 (amount of activity. 1). 1)) + d((C. (C. x 4 (amount of food consumed. (b) Compute the x. on a 15 scale). ( C. At the start of a study to determine whether exercise or dietary supplements would slow bone loss in older women.S1R /).) The data consist of average ratings over the course of treatment for patients undergoing radiotherapy. Variables measured include x 1 (number of symptoms. 2)) = 1 + 1 = 2. Sn. and R arrays. and R arrays for the nonmultiplesclerosis and multiplesclerosis groups separately. ( C. 1. (See also the mineralcontent data on the web at www.1). 2).2)) = 1 + 1 = 2. 1) and (C. Sn.41 Assume the distance along a street between two intersections in either the NS or EW direction is 1 unit. 2).16.) Compute the x. S1L + S1R). on a 13 scale).and so forth.prenhall.com/statistics. The values recorded in the table include x 1 (subject's age). 1). 2)). Table 1. Define the distance between any two intersections (points) on the grid to be the "city block" distance.] lliii iffll 2 4 5 A B l!i! jl1ll c D liililil!l lim E Iii:: iii l:':z::J Locate a supply facility (warehouse) at an intersection such that the sum of the distances from the warehouse to the three retail stores is minimized. 1.1).2.14. Some of the 98 measurements described in Section 1.7 (See also the radiotherapy data on the web at www.) Two different visual stimuli (S1 and S2) produced responses in both the left eye (L) and the right eye (R) of subjects in the study groups. J.IS.2 are listed in Table 1. (a) Plot the twodimensional scatter diagram for the variables x 2 and x 4 for the multiplesclerosis group.6 contains some of the raw data discussed in Section 1. on a 15 scale). d ( ( D. (See also the multiple sclerosis data on the web at www.com/statistics. which we might call d((D. x2 (total response of both eyes to stimulus S1. an investigator measured the mineral content of bones by photon absorptiometry. (C. ( D. Do there appear to be any errors in the x 3 data? (b) Compute the x. the distance between intersections (D. and x 6 (skin reaction. The following exercises contain fairly extensive data sets. Sn. 2)) = d((D. x 3 (difference between responses of eyes to stimulus S1. Interpret the pairwise correlations.8.Exercises . (C.
4 194.6 MultipleSclerosis Data NonMultipleSclerosis Group Data Subject number 1 2 3 4 XI x2 X] (Age) 18 19 20 20 20 67 69 73 74 79 (SlL + SlR) .0 134.0 ISlL.0 138.000 2.273 2.062 2.0 .0 23 25 25 28 29 57 : 58 58 58 59 229. Celesia.r2 and .900 1.r 2 and .4 1.0 .000 2.8 198.389 1.090 1.8 219.4 .0 3.4 171.000 1. Annette Tealey.813 1.000 2.SlRI 1.6 : x4 xs IS2L.000 X] x4 Eat 2. R.8 .4 180.4 216.000 2.4 262.312 2.0 .945 2.6 15.909 1.555 .6 .454 .6 .4 12.8 3. G. .N.231 3.091 .8 .4 .437 1.000 x6 Skin reaction 1.364 1.6 x4 205.6 .800 1.7 Radiotherapy Data XI x2 Activity 1.437 1.6 148.294 2. Values of .125 6.889 2.0 6.0 169.889 Sleep 1.8 186.6 1.563 4.4 (S2L + S2R) 198.0 4.0 144.0 are due to errors in the datacollection process.6 1.769 1. Table 1.6 238.8 199.8 x3 .2 155.000 4.S2RI .8 1.0 2.000 Symptoms .600 1.6 204.000 .4 190.000 .0 195.000 xs Appetite 1.0 14.941 2.6 .0 5 65 66 67 68 69 205.6 6.4 MultipleSclerosis Group Data Subject number 1 2 3 4 5 25 26 27 28 29 XI xz 148.r3 less than 1.545 1.0 .2 157.000 1.2 304.000 3.091 2.819 2.8 209.2 3. G.152.059 2.2 210.r 1 Aspects of Multivariate Analysis Table 1.0 .2 158.2 250.0 143.272 1.4 5.8 235.000 2.r3 less than 1.2 8.0 may be omitted.2 175.727 4.100 .455 1.2 Source: Data courtesy of Dr.312 2.8 xs .0 .875 2.0 1.000 2.4 204.2 10.8 8.455 2.000 3.385 2.4 243.999 2.8 154.2 .000 .2 165.2 16.462 2.8 .000 Source: Data courtesy of Mrs.222 2.42 Chapte.727 2. Rows containing values of .4 164.8 217.
678 .873 .458 1.422 1.842 .809 1.846 1.8 Mineral Content in Bones Subject number 1 2 3 4 Dominant radius 1.2 miles.859 .873 .715 1.2 are listed in Table 1.378 2.767 .809 .615 . Sn.028 1.866 .324 2. the record speed for the 100m dash for Argentinian women is 100 m/11. The marathon is 26.741 1.950 .751 .836 .900 .533 .595 .933 .932 . 1.547 1.887 1.786 1.880 .872 .5 for (a) the mineralcontent data in Table 1.509 1.624 2.805 .731 .815 . Compare your results with the results you obtained in Exercise 1.18. Some of the data described in Section 1. Notice the magnitudes of the correlation coefficients as you go from the shorter (100meter) to the.880 .786 . 1.prenhall.493 .541 .578 . 3000m and marathon runs are measured in minutes.57 sec = 8.718 .654 . .748 .656 1. Compute the x.Exercises 43 Table 1.586 .853 .757 . Notice the magnitudes of the correlation coefficients as you go from the shorter (100m) to the longer (marathon) running distances.819 .204 1.757 .869 Dominant ulna .com/statistics.869 Ulna .268 1.764 .945 .765 .825 . and R arrays. Interpret these pairwise correlations.792 .195 meters.810 .672 . Compute the x.032 1.139 1.187 1.618 .9.508 1.17.744 .616 .890 .052 .643 mjsec.823 .743 1. lo!'. 1.777 1.936 Dominant humerus 2.9 to speeds measured in meters per second.787 .758 .811 1.906 .723 . For example.225 1.779 .713 .334 1.8 and (b) the nationaltrackrecords data in Table 1.606 1.863 2.902 1.ger {marathon) running distances.688 .19.971 Humerus 2.509 1.803 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 .238 1.482 .857 .868 Source: Data courtesy of Everett Smith.390 2.850 .921 .17.662 .087 1.799 .734 1.546 .851 .724 .811 1.777 .740 1.752 .103 .898 . Convert the national track records for women in Table 1.782 .) The national track records for women in 54 countries can be examined for the relationships among the running event!).835 .744 .739 1.876 .838 .590 .668 .940 .474 1. long. 1500m.759 2. (See also the nationaltrackrecords data on the web at www.650 2.546 .532 .954 1. Notice that the records for the 800m.794 1.571 .856 .695 1.869 2.752 .682 . and R arrays. Create the scatter plot and boxplot displays of Figure 1.037 1.873 1.925 .664 .657 1.755 .915 Radius 1.610 .674 . Interpret ihese pairwise correlations.677 .733 .9.686 .618 . Sn.737 .706 .549 .009 1.795 . or 42.
48 4.41 11.34 4.63 50.99 21.42 143.29 142.10 1.32 11.52 11.89 8.38 11.15 22.14 11.37 23.60 52.72 11.12 Marathon (min) 150.08 8.69 22.74 8.62 4.93 51.48 144.33 8.05 174.02 23.71 22.45 135.79 11.05 4.02 2.43 146.31 51.88 22.32 143.08 51.22 3.30 11.13 23.98 1.98 4.92 53.94 2.24 2.03 4.41 138.98 1.57 47.13 23.21 9.37 8.96 2.62 51.17 4.20 4.00 148.04 2.57 11.29 4.80 11.67 24.62 23.15 11.28 2.46 141.96 3.56 53.93 1.65 10.81 23.46 11.56 55.89 2.98 4.94 22.54 4.46 148.27 141.81 9.81 11.35 22.31 24.38 13.09 144.35 143.23 156.95 1.86 22.36 23.93 11.89 9.32 BOOm (min) 2.84 4.54 9.72 11.39 9.00 2.50 48.01 1.31 149.64 61.01 1.19 4.51 154.68 23.04 8.10 142.91 2.25 4.18 143.30 50. South Korea.08 11.47 146.41 3.62 11.07 52.97 2.68 49.73 10.09 4.24 8.92 11.50 9.31 191.34 166.81 49.37 23.97 2.33 164.08 4.16 3.09 1.18 147.97 23.11 56.12 1.35 3000m (min) 9.36 152.21 9.10 2.92 49.64 8.53 8.99 2.33 148.19 149.96 23.50 55.01 2.97 2.45 55.07 1.07 52.35 51.25 4.33 145.10 23.49 11.41 148.10 10.71 8.14 11.60 49.02 50.24 3.97 1.83 23.97 2.92 1.03 2.10 3.30 10.92 1.97 2.42 .06 158.96 4.51 8.40 171.87 5.98 11.56 54.82 23.96 51.65 52.09 11.83 11.07 2.10 4.25 47.62 49.39 21.94 1.03 3.52 4.06 51.45 53.10 22.50 141.15 1.80 25.50 8.45 11.60 22.10 4.02 4.14 48.01 22.45 400m (s) 52.77 12.63 11.06 2.50 154.57 8.23 56.26 8.12 145.37 11.19 8.28 10.23 169.06 221.01 4.78 8.19 212.55 9.89 52.58 (continues) 439 4.50 23.67 56.75 49.13 10.99 3.33 4.05 51.66 11.43 50. North Luxembourg Malaysia Mauritius Mexico Myanmar(Burma) Netherlands New Zealand Norway Papua New Guinea Philippines Poland Portugal Romania Russia Samoa lOOm (s) 11.43 11.56 11.91 23.96 11.76 8.23 22.81 8.59 8.23 139.28 167.91 22.63 8.24 4.00 2.06 4.64 51.11 8.12 2.12 4.82 4.96 9.91 53.09 2.28 51.39 155.92 21.88 49.00 4.14 165.36 4.36 11.10 9.38 200m (s) 22.02 4.01 8.50 11.02 2.90 3.36 143.84 8.69 8.25 153.9 National Track Records for Women Country Argentina Australia Austria Belgium Bermuda Brazil Canada Chile China Columbia Cook Islands Costa Rica Czech Republic Denmark Dominican Republic Finland France Germany Great Britain Greece Guatemala Hungary India Indonesia Ireland Israel Italy Japan Kenya Korea.35 21.31 12.09 11.00 1.05 1.03 1.47 139.33 23.05 22.76 11.70 22.17 10.48 23.62 48.12 11.97 4.41 11.18 54.53 10.92 25.36 9.10 9.99 52.82 9.07 1.13 22.29 158.60 23.96 9.42 11.96 3.31 9.94 1.36 8.87 25.93 2.44 Chapter 1 Aspects of Multivariate Analysis Table 1.71 9.87 8.84 22.06 23.29 1500m (min) 4.
page 657. (b) Highlight the set of points corresponding to the bankrupt firms.38 3. Thailand 11ukey U. x 2 . Check for unusual observations. 1. x 3 space for which the set of points representing gasoline trucks can be readily distinguished from the set of points representing diesel trucks? 1.99 1. Source: IAAFIATFS Track and Field Ha.48 8.} 1.34 400m (s) 55.98 2.96 1.82 22.15 48. and on the web at www.prenhall. (Experiment with the assignment of variables to facial characteristics.25 10.94 1500m (min) 4.17. com/statistics.71 21.20. Refer to the bankruptcy data in Table 11. Rotate the coordinate axes in various directions.12.23.10.83 800m (min) 2.12.22.16 Country Singapore Spain Sweden Switzerland Taiwan .01 4. Using the utility data in Table 12.16 11.01 1. Do any of the gasolinetruck points appear to be multivariate outliers? (See Exercise 6. (a) View the entire data set in three dimensions employing various combinations of · three variables to represent the coordinate axes. and on the following website www.74 52.) 1. Begin with the x 1 .56 23.Exercises 45 lOOm (s) 12.88 22.52 4.95 3000m (min) 9. x 3 space.49 200m (s) 24.63 10.34 11.53 162. and covariances calculated from these data? (See Exercise 11.38 4. Are there some orientations of threedimensional space for which the bankrupt firms can be distinguished from the nonbankrupt firms? Are there observations in each of the two groups that are likely to have a significant impact on any rule developed to classify firms based on the sample mearis.prenhall.54 22. Rotate the coordinate axes in various directions. Refer to the milk transportationcost data in Table 6.43 141.prenhall. (a) View the entire data set in three dimensions.24.51 150.60 9. page 688.51 159.) Are there some orientations of x 1 . Refer to the oxygenconsumption data in Table 6.A.IJdbook for Helsinki 2005 (courtesy of Ottavio Castellini).12 1. page 345.38 22. Using appropriate computer software. (a) View the entire data set in x 1 .prenhall.60 53. Check for unusual observations. (b) Highlight the set of points corresponding to gasoline trucks.17. represent the cereals in each of the following ways.43 Marathon (min) 154.06 11. (a) Stars. page 348.67 51.08 49. Compare your faces with the faces in Figure 1.com/statistics.39 145.92 3. variances.97 4. x 2 .32 52. Examine various threedimensional perspectives.69 51.30 22.22 11.4. Using the data in Table 11. Using appropriate computer software.4. (b) Check this data set for outliers. represent the public utility companies as Chernoff faces with assignments of variables to facial characteristics different from those considered in Example 1.09 3.33 11.13 11.53 8.07 8.39 151.prenhall.06 2.9. and on the web at www.94 8. page 666. x 3 space.com/statistics. (b) Chernoff faces.81 8. and on the web at www.S.com/ statistics. Are different groupings indicated? . Using appropriate computer software.08 2. x 2 . and on the web at www.41 146.com/statistics.21.24. 1.
9 55. Do some of these variables appear to distinguish one breed from another? (b) View the data in three dimensions using the variables Breed.0 : SaleWt 1720 1575 1410 1595 1488 : .25 .10 Data on Bulls Breed 1 1 1 1 1 : SalePr 2200 2250 .4 49.com/statistics) are the measured characteristics of 76 young (less than two years old) bulls sold at auction.0 70.15 .com/statistics.9 54.prenhall.15 8 8 8 8 8 7 6 6 6 7 55. and SaleHt. 1625 4600 2150 1450 1200 1425 1250 1500 YrHgt 51.7 FtFrBody 1128 1108 1011 993 996 997 991 928 990 992 PrctFFB 70. Interpret the pairwise correlations. Which threedimensional display appears to result in the best separation of the three breeds of bulls? Table 1. (a) Create a scatter plot and calculate the correl<~tion coefficient. Also included in the ta8le are the selling prices (SalePr) of these bulls.prenhall. and BkFat. The column headings (variables) are defined as follows: Breed = 1 Angus 5 Hereford { 8 Simental YrHgt = Yearling height at shoulder (inches) PrctFFB = Percentfatfree body BkFat = Back fat (inches) SaleWt = Sale weight (pounds) FtFrBody = Fat free body (pounds) Frame = Scale from 1 (small) to 8 (large) SaleHt = Sale height at shoulder (inches) (a) Compute the i.1 56. FtFrBody. 1.15 .10 (see the bull data on the web at www.4 70.1 1454 1475 1375 1564 1458 Source: Data courtesy of Mark Ellersieck.8 71. represent the 22 public utility companies as stars.2 51. Table 1.1 51.2 54.26.10 .11 presents the 2005 attendance (millions) at the fifteen most visited national parks and their size (acres).9 72.6 53.6 68.3 53. Using the data in Table 12.6 Frame 7 7 6 8 7 : BkFat SaleHt 54.1 51.9 53.25 . Are the breeds well separated in this coordinate system? (c) Repeat part busing Breed.6 73. and R arrays.8 55.25.8 50.9 68. The data in Table 1.10 .25 .4 and on the web at www.46 Chapter 1 Aspects of Multivariate Analysis 1.1 71.35 .27. .9 49.0 50.8 70. Sn. Check for outliers. Rotate the coordinate axes in various directions. l .10 .4 55. Frame. Visually group the companies into four or five clusters.0 51.
5. S. 3542. paperback).5 1217. no..30 2. Table 1." Applied Statistics. Cochran.8 199. 68. J.6 265. 2. 2 (1970).. Sampling Techniques (3rd ed. 10. W." The American Statistician. "Canonical Correlation Analysis in a Predictive System. "Information Contained in Sediment Size Analysis. New York: John Wiley. no.09 2. Dunham. 97.46 9. "Multivariate Analysis of National Track Records..40 2. no.6 922. B. 157169. J. 3. Igbaria. "Using Faces to Represent Points in KDimensional Space Graphically. "Comparison of Discrimination Methods for the Oassification ofThmors Using Gene Expression Data.80 1." Management Science. 342 (1973). 4.8 761. Cox. 4 (1987).11 Attendance and Size of National Parks National Park Arcadia Bruce Canyon Cuyahoga Valley Everglades Grand Canyon Grand Thton Great Smoky Hot Springs Olympic Mount Rainier Rocky Mountain Shenandoah · Yellowstone Yosemite Zion Size (acres) 47. R. 2. "Dynamic Graphics for Data Analysis. Farley.3 146.110115. and J.59 References 1. Fridlyand. 361368. 6. (c) Would the correlation in Part b change if you measure size in square miles instead of acres? Explain.4 35. J. 457 (2002).17 2. Hulbert.. G." Statistical Science.0 2219. 40. Y. R. Kravetz. 4 (1975). G. C." Mathematical Geology." Journal of the American Statistical Association. Benjamin.8 5." Journal of the American Statistical Association. M. Capon. Cleveland.0 521.References 47 (b) Identify the park that is unusual. 355395.19 1. Chernoff. Speed. and D. W. no. and T. Becker. 1992.).05 1. New York: John Wiley. and M. "Clustering Categories for Better Prediction of Computer Resources Utilization. D. Experimental Designs (2nd ed. no. 1977. R.9 1508. 8. 43. H. 2 (1992).4 310.7 235. 7. .53 1. 43. Drop this point and recalculate the correlation coefficient. Dudoit.84 3. 38. 2 (1989). 2 (1991).23 4. "Profiles of Product Innovators among Large U.02 2.. B. 9. Dawkins.8 32. W. 2. P. and G.14 1. Comment on the effect of this one point on correlation. J. 295307. 7787.. A. Manufacturers.." Journal of Experimental Education. no. Davis. 105112. and A.34 3. Lehman.6 Visitors (millions) 2.S. Cochran. no. N. Wilks. no. S.
22. Reading. 19. Wehrung. 139146.. University of Wisconsin.. 411430. Marriott." New England Journal of Medicine. no." Mathematical Geology.48 Chapter 1 Aspects of Multivariate Analysis 11." Journal of Risk and Insurance. 23. and 6. J. 327338. 16. 3 (1985) 312322... 66. Monterey. and R. L. 14. H. 27. The /nterp relation of Multiple Observations. "Revisiting Olympic Track Records: Some Practical Considerations in the Principal Component Analysis. University of Toronto. Gable. 13. K. Halinar." Management Science. 1978. and R. Kim. Thissen. Weihs. "A Multivariate Model for Predicting Financially Distressed PL Insurers. Thffler.. 30. Johnson. no." The American Statistician. Cl. Spenner. and H. "A Multidimensional Model of Client Success when Engaging External Consultants. Wartzman. M. 1993). (1981). "Principal Component Analysis in Plant Breeding. and D." Unpublished report based on data collected by Dr. R. . et a!. "Study of Factors Influencing Variation in Size Characteristics in Fluvioglacial Sediments. Hodge. R. "OMEGA (On Line Multivariate Exploratory Graphical Analysis): Routine Searching for Structure. no. W. F. A." Management Science. 140144. 4. no. Khattree. B. Faculty of Forestry (1992). no. "Professional Mediators' Judgments of Mediation Thctics: Multidimensional Scaling and Cluster Analysis. Thkey. 465473. no. C.. Schmidli. 40. N. 20." Accounting and Business Research. J. E. Timm. 31. I. 29. and D. no. 24. 191241.. 1979. 36. et al. CIS. 44. G.. 15. 18. 2005. 26. CA: Brooks/Cole. no. 25.D. and G. D.). no. 42. 31. Multivariate Analysis with Applications in Education and Psychology. C. F. 32." Ph. 21. G. "A Canonical Correlation Analysis of Occupational Mobility. M. H. Tabakoff. and Y. 76. Lee. MacCrimmon.. 50. and R. Pinches. no. 1977. Statistics: Principles and Methods (5th ed. 318. N. S. G. J. 4 (1990). University ofWisconsin. "Relationships Between Properties of PulpFibre and Paper. 1622. M. "Differences in Platelet Enzyme Activity between Alcoholics and Nonalcoholics. "Improving the Con1munication Function of Published Accounting Statements. no. 3 (1991). 422435." Applied Statistics. 12. "Characteristics of Risk Taking Executives. W. 28. 219234. "From Generation to Generation:Thenansmission of Occupation. Everitt. London: Academic Press. 1977. New York: NorthHolland. Nason. 14. 2 (1996). S." The Wall Street Journal (February 24. 3 (1973). 175226. P. 54 (1984). 8 (1996) 11751198. A." Statistical Science. 1974. 3 (1988). C." Unpublished doctoral thesis. K. N aik. 32. H.. no. Wainer. "Innovation in a Newly Industrializing Country: A Multiple Discriminant Analysis." Journal of the American Statistical Association. Kim. Exploratory Data Analysis. R.. 5. 1975. K. 134139. Bliss. nieschmann. 17. Mather. Bhattacharyya. 333 (1971 ). Smith. "Don't Wave a Red Flag at the IRS. 4 (1995). 3 (1972). "Threedimensional Projection Pursuit. J." Journal of Applied Psychology. Graphical Techniques for Multivariate Data. New York: John Wiley." Annual Review of Psychology. "Graphical Data Analysis. Klatzky.." Management Science. McLaughlin. MA: AddisonWesley. 2 (1990). B. dissertation.
x11 ] where the prime denotes the operation of transposing a column to a row.2niJ or x' = [xbx 2. a rectangular array of numbers with.2 Some Basics of Matrix and Vector Algebra Vectors An array x of n real numbers x 1 . . ..1 Introduction We saw in Chapter 1 that multivariate data can be conveniently displayed as an array of numbers. the formal relations expressed in matrix terms are easily programmed on computers to allow the routine calculation of important statistical quantities. x 2 . for instance.Chapter MATRIX ALGEBRA AND RANDOM VECTORS 2. Moreover. 2.. The study of multivariate methods is greatly facilitated by the use of matrix algebra. If you have not been previously exposed to the rudiments of matrix algebra.. and it is written as X =~ l :X. We begin by introducing some very basic concepts that are essential to both our geometrical interpretations and algebraic explanations of subsequent statistical techniques. n rows and p columns is called a matrix of dimension n X p. Xn is called a vector. . 49 . . In general. The matrix algebra results presented in this chapter will enable us to concisely state statistical models. you may prefer to follow the brief refresher in the next section by the more detailed review provided in Supplement 2A..
r (a) (b) figure 2. and Xn along the nth axis.50 Chapter 2 Matrix Algebra and Random Vectors I I I I I I I I ~~~" I I / figure2.] 2 2 1. ex is the vector obtained by multiplying each element of x by c.1bis is illustrated in Figure 2. we define the vector ex as ex= l 1 c. [See Figure 2. x 2 along the second axis.. .. A vector x can be represented geometrically as a directed line in n dimensions with component x 1 along the first axis.2].2(a).3. . A vector can be expanded or contracted by multiplying it by a constant c.1 for n = 3. In particular.x ] · c~z CXn That is.1 Thevectorx' = [1. .2 Scalar multiplication and vector addition.
. A vector has both direction and length. . we obtain the unit vector L.2(b). which has length 1 and lies in the direction of x.Some Basics of Matrix and Vector Algebra 51 1\vo vectors may be added. Lc" = Vc 2 x~ + c 2 x~ + · · · + c 2 x~ = Ic IVxt + X~ + · · · + X~ = Ic ILx Multiplication by c does not change the direction of the vector x if c > 0.3. However. 1 . written L. + Yi· The sum of two vectors emanating from the origin is the diagonal of the parallelogram formed with the two original vectors as adjacent sides. 2 figure 2. we consider the vector X=[::] The length of x. This is demonstrated schematically in Figure 2. xn]. The length of a vector x' = [xi> x 2 .. . From (22) it is clear that x is expanded if Ic I > 1 and contracted if 0 < Ic I < 1. is defined by Lx = V X~ + X~ + · · · + X~ (21) Multiplication of a vector x by a scalar c changes the length. Xz Xz Xn Yn Xn + Yn so that x + y is the vector with ith element X.] Choosing c = L. In n = 2 dimensions.3 Length of x = V xi + x~. the length of a vector in two dimensions can be viewed as the hypotenuse of a right triangle. . [Recall Figure 2. 1x.. Addition of x and y is defined as x+y= l Y2 xlllYlJ lx1 + Y2 Y1J : + : : . is defined to be Lx = Vx~ + x~ Geometrically.2(a). a negative value of c creates a vector with a direction opposite that of x. with n components. This geometrical interpretation is iJlustrated in Figure 2. . . From Equation (21).
x and y are perpendicular when x'y = 0. 8 can be represented as :he difference_ be~ween the a~!?~ 81 and 82 formed by the two vectors and the first coordinate axis._ and Xz YI sin(02 ) == ~ Ly the angle 8 between the two vectors x' = [ x 1 . we define the inner product of x andy as x'y = XIYJ. cos (8) = .. .Ly x'y = '' x'y WxwY since cos (90°) = cos (270°) == 0 and cos (8) == 0 only if x' y = 0. Smce.52 Cbapter 2 Matrix Algebra and Random Vectors 2 :r Figure 2.4. For an arbitrary number of dimensions n. as in Figure 2. Consider two vectors in a plane and the ngle {} between them. xz] and y' = [y1 . A second geometrical concept is angle. + XzY2 + · ·· + XnJ'n (24) 'Jbe inner product is denoted by either x'y or y'x. Y2] is specified by (23) We find it convenient to introduce the inner product of two vectors. the inner product of x andy is · x'y = X1Y1 + XzY2 With this definition and Equation (23). by defimtwn. From the figure.L. cos(02) ==Ly sin (Oil == L. For n = 2 dimensions.4 The angle (J between x' == [x1 . x2] andy' == [yl> Yz].
449 = . First. Therefore. and x'y= 1(2) + 3(1) + 2(1) = 1. Also. . 3.y (26) Since. determine the length of x. we have the natural extension of length and angle to vectors of n components: Lx = lengthofx = ~ ~ (25) cos(O) = x'y L. Next. 1.. y'y=(2) 2 +12 +(1) 2 =6. L3x = V3 2 + 92 + 62 = v126 and 3Lx = 3\1'14 = v126 showing L3x = 3Lx. not all zero..53 Using the inner product. Vectors of the same dimension that are not linearly dependent are said to be linearly independent. .Ly = x'y ~vy. Finally. find 3x and x + y.1 (Calculating lengths of vectors and the angle between them) Given the vectors x' = [1. x'x=l 2 +3 2 +2 2 =14. again. ck.. . c2 .742 Ly = vy. such that (27) Linear dependence implies that at least one vector in the set can be written as a linear combination of the other vectors.Some Basics of Matrix and Vector Algebra . Lx and = ~ = v'I4 = 3. xk is said to be linearly dependent if there exist constants c1 .. and the angle between x and y. . 1]. we say that x andy are perpendicular when x'y = 0. cos (0) = 0 only if x'y = 0.• . • c1 X+ c2 y = 0 A pair of vectors x and y of the same dimension is said to be linearly dependent if there exist constants c 1 and c2 . check that the length of 3x is three times the length of x. x2 . both not zero. Example 2. such that A set of vectors x1 . 2] and y' = [ 2.y = V6 = 2. the length of y. Next.l09 so 0 = 96.742 X 2.449 x'y 1 cos(O) = LxLy = 3..3°.
) • y (. not all zero.S The projection of x on y.l cos (8) I (29) where 8 is the angle between x andy. and x3 are linearly independent.2 {Identifying linearly independent vectors) Consider the set of vectors Setting implies that c1 { Cz Cz + c3 = 0  2c1 CJ  2c3 =0 = + c3 0 with the unique solution c 1 = c2 = c3 = 0. Matrices A matrix is any rectangular array of real numbers. As we cannot find three constants c1 . x2 .54 Chapter 2 Matrix Algebra and Random Vectors Example 2. ProJectwnofxony = .Ly I = L. cos {8) ~ Figure 2.y = . • The projection (or shadow) of a vector x on a vector y is . (See Figure 2. c2 . IL.L L y yy y y (x'y) (x'y) 1 (28) where the vector Ly1y has unit length.:~)y fL. and c3 . We denote an arbitrary array of n rows and p columns by A {nXp) = ['" a~I : al2 an anz ani . . such that c1 x1 + c2 x2 + c3 x3 = 0. the vectors x1 .. '"l azp anp ..5. The length of the projection is Length of projection = I x'y x'y T I = L.
The product cA is the matrix that results from multiplying each element of A by c. so that the first column of A becomes the first row of A'. j)th entry a.j + b.j. Example 2.3 {The transpose of a matrix) If A (2x3)  3 [1 1 2] 5 4 2 then A' = (3X2) [~ ca 12 ca 22 canz !] 4 • A matrix may also be multiplied by a constant c. the second column becomes the second row. = c~21 : canl '""l ca 2P canp Two matrices A and B of the same dimensions can be added. .[1 2 5 2 ~] then (2X3) 12 4A = [O 4 4 B :] and A+ (2X3) (2X3) 32 =[0+1 13]=[1 1 1 + 2 1 + 5 1 + 1 3 4 ~] • It is also possible to define the multiplication of two matrices if the dimensions of the matrices conform in the following manner: When A is (n X k) and B is (k X p ). Example 2. we can form the matrix product AB.[0 3 1 1 ~] and B (2x3) .Some Basics of Matrix and Vector Algebra 55 Many of the vector concepts just introduced have direct generalizations to matrices.. so that the number of elements in a row of A is the same as the number of elements in a column of B. and so forth. Thus cA (nxp) r~. The sum A + B has (i.4 (The sum of two matrices and multiplication of a matrix by a constant) If A (2x3) . The transpose operation A' of a matrix changes the columns into rows. An element of the new matrix AB is formed by taking the inner product of each row of A with each column of B.
Thus.J . a. + a...i + a.j)entryofAB = a.i + a... ·J Example 2.. hzl .56 Chapter 2 Matrix Algebra and Random Vectors The matrix product AB is A B = (nxk)(kxp) the ( n X p) matrix whose entry in the ith row and jth column is the inner product of the ith row of A and the jth column of B k or (i.i) b31 b41 ..kbki = When k = L f=l a.[1( 2) + 5(7) 9 _ 3(2) + (1)(7) + 2(9) + 4(9) J = [6~] (2XI) and 1 2] 5 4 = [2(3) + 0(1) 1(3)..1(1) 6 2 (2X3) 2(1) + 0(5) 2(2) + 0(4)] 1(1).1(5) 1(2).b.b. b.cbci (210) 4. a12 aiJ a.~...z anz "lb : II . . b3p b4p ~'l bzp an4 Column ~ A= [ 1 Row f· j (a..5 (Matrix multiplication) If 3 1 2] 54' then A B _ (2X3){3XI)  [1 3 1 2 5 4 7 J[2] . 2 b2i + ··· + a.1(4) = [~ 2 4] • . 1b1i + a....3 an3 (nx4)(4xp) A B = ["" (at! : ani a. entry in the matrix AB. we have four products to add for each. blj bzi b3j b4j .
and d' Abare typical products. A square matrix is said to be symmetric if A = A' or a. b'c.Some Basics of Matrix and Vector Algebra 57 When a matrix B consists of a single column. be' = [ 3 [5 6 7] 8 4] = [15 24 28] 35 56 12 30 48 24 The product b c' is a matrix whose row dimension equals the dimension of b and whose column dimension equals that of c. be'. b'< ~ [7 3 6] [ _. The product A b is a vector with dimension equal to the number of rows of A. . Example 2.] ~ [13] The product b' cis a 1 X 1 vector or a single number.j = aji for all i andj. it is customary to use the lowercase b vector notation. The product d' A b is a 1 X 1 vector or a single number.6 (Some typical products and their dimensions) Let A=[~ 2 3] 4 1 d = [~] Then Ab. here 13. • Square matrices will be of special importance in our development of statistical methods. here 26. which is a single number. This product is unlike b'c.
That is. so (kxk)(kxk) I A = (kxk)(kxk) A I = A (kxk) forany A (kxk) (211) The matrix I acts like 1 in ordinary multiplication (1 ·a = a ·1 = a).) Example 2.0 has the following matrix algebra extension: If there exists a matrix B such that (kxk)(kxk) BA=AB=I (kxk)(kxk) (kxk) then B is called the inverse of A and is denoted by A1. •.jl X 0 + a.2)2 + (. j)th entry of AI is ail X 0 + · · · + ai. lA = A. ak of A are linearly independent.8)3 + (.j+l X 0 + · · · + a.6)4 J = [~ ~] .8)2 + (. both products AB and BA are defined.) If we let I denote the square matrix with ones on the diagonal and zeros elsewhere.8 .6)1 [ .4] [3 2] = [(.2 . • When two square matrices A and B are of the same dimension.9 in Supplement 2A. The fundamental scalar relation about the existence of an inverse number a1 such that a1a = aa1 = 1 if a #. it follows from the definition of matrix multiplication that the ( i.4)1 (.i x 1 + ai.1 is equivalent to (212) (See Result 2A.2)3 + (. The technical condition that an inverse exists is that the k columns a 1 . Similarly.6 4 1 (.8 (The existence of a matrix inverse) For A=[! you may verify that ~] (.7 {A symmetric matrix) The matrix [~ ~J is symmetric.. . the matrix is not symmetric. (See Supplement 2A.k x 0 = aii• so AI = A.4)4 . the existence of A. a 2 . so it is called the identity matrix.58 Chapter 2 Matrix Algebra and Random Vectors Example 2. although they need not be equal.
4] implies that c1 = c2 = 0. According to the condition Q'Q = I. We note that . We conclude our brief introduction to the elements of matrix algebra by introducing a concept fundamental to multivariate statistical analysis. It is always good to check the products AA. when one exists. A square matrix A is said to have an eigenvalue A. ck.1 and A. with corresponding eigenvector x #.0. • A method for computing an inverse.j. #.. ~ 1 and qjqi = 0 for i #.10. Another special class of square matrices with which we shall become familiar are the orthogonal matrices.6 .2 . then the computer may produce incorrect inverses due to extreme errors in rounding.1A for equality with I when A. calculations are usually relegated to a computer. 1 all 0 1 a22 0 0 1 a33 0 0 0 0 0 1 a44 0 0 0 0 l ass r·~· 0 a22 0 0 a33 0 0 0 0 0 0 0 0 a44 0 n 0 0 0 0 has inverse 0 0 0 0 if all the a. if Ax= Ax (214) ..1 (213) The name derives from the property that if Q has ith row q).0. the columns have the same property.8 . so the rows have unit length and are mutually perpendicular (orthogonal). but lengthy.1 is produced by a computer package. .Some Basics of Matrix and Vector Algebra 59 so [ is A1 .) Diagonal matrices have inverses that are easy to compute. so the columns of A are linearly independent. •. This confirms the condition stated in (212). then QQ' =I implies that q/q. especially when the dimension is greater than three. Even so. you must be forewarned that if the column sum in (212) is nearly 0 for some constants c1 . For example. (See Exercise 2. The routine. characterized by QQ' = Q'Q =I or Q' = Q. is given in Supplement 2A.
Example 2. Sparing you the details of the derivation (see [1 ]). It is instructive to do a few sample calculations to understand the technique. We usually rely on a computer when the dimension of the square matrix is greater than two or three. The eigenvectors· are unique unless two or more eigenvalues are equal. Then A has k pairs of eigenvalues and eigenvectorsnamely. 1/v'2]. we state the following basic result: Let A be a k X k square symmetric matrix. It is convenient to denote normalized eigenvectors bye. (215) The eigenvectors can be chosen to satisfy 1 = e. and 1 is its corresponding normalized eigenvector. 2. e2 = [1/vl. and we do so in what follows. 1 = x'x. You may wish to show that a second • eigenvalueeigenvector pair is A = 4.9 (Verifying eigenvalues and eigenvectors) Let Then.e1 = · · · = ek:ek and be mutually perpendicular.60 Chapter 2 Matrix Algebra and Random Vectors Ordinarily. Squared distances (see Chapter 1) and the multivariate normal density can be expressed in terms of matrix products called quadratic forms (see Chapter 4). Consequently. it should not be surprising that quadratic forms play a central role in . that is.3 Positive Definite Matrices The study of the variation and interrelationships in multivariate data is often based upon distances and the assumption that the data are multivariate normally distributed. 2 A method for calculating the A's and e's is described in Supplement 2A. since A = 6 is an eigenvalue. we normalize x so that it has length unity.
. A ... 2. ek are the associated 2 normalized eigenvectors.andejej = Ofori ¢ j.30). 3. In this section. (See also Result 2A. 1jv'I8. Thus.4e 21 + 2e31 = 9e 11 4e11 + 13e21 . Consequently. You may verify that e2 = [1/vTB.k.Positive Definite Matrices 61 multivariate analysis. OJ. in many cases. 1/3] is the normalized eigenvector corresponding 2 to the eigenvalue A = 18. and A = 18 (Definition 2A. 3 1 A proof of Equation (216) is beyond the scope oftbis book. the normalized eigenvector is ej = [1/\112 + 12 + 02. = 1 fori= 1. Moreover. we consider quadratic forms that are always nonnegative and the associated positive definite matrices.. and e3 = [2/3. Ak are the eigenvalues of A and e 1 . 4jv'IB] is also an eigenvector for 9 = A . Thus. The corresponding eigenvectors 2 3 ei> e 2 .2e31 = 9e2 1 2e 11 . a direct consequence of an expansion for symmetric matrices known as the spectral decomposition. but two of the equations are redundant. 1/\112 + 12 + 0 2. 1jv'2. . ejej = 0 for i ¢ j.2e 21 + 10e31 = 9e31 Moving the terms on the right of the equals sign to the left yields three homogeneous equations in three unknowns.. . for i = 1. we find that e31 = 0.AI I = 0 are A1 = 9.e. 2j3. . Ae 1 = Ae 1 gives [ ~! ~: ~] [:~:] 9[:::] = 2 2 10 e31 e31 or 13e11 . A = 9. Example 2 . .I 0 {The spectral decomposition of a matrix) Consider the symmetric matrix 4 13 2 ~] 10 The eigenvalues obtained from the characteristic equation I A . Selecting one of the equations and arbitrarily setting e11 = 1 and e21 = 1.14 in Supplement 2A). The interested reader will find a proof in [6].2. . The spectral decomposition of a k X k symmetric matrix A is given by 1 A (kxk) = A e1 1 (kxi)(iXk) ej + A2 e 2 e2 + · · · + (kxi)(ixk) Ak ek elc (kxi)(ixk) (216) where A1. and e 3 are the (normalized) solutions of the equations Ae. eje. since the sum of the squares of its elements is unity.. = A. OjV12 + 12 + 02] = [1jv'2. Chapter 8.. . e 2 . Results involving quadratic forms and symmetric matrices are.
OJ. xk]. With it. 1e 1e! + A.3 1 3  4 ~ 18 4 18 16 18  + 18 9  4 9 4 9 2 9 4 9 4  2 9 2 9 1  2 9 9 as you may readily verify. The first of these is a matrix explanation of distance. . • The spectral decomposition is an important analytical tool. it is called a quadratic form. we are very easily able to demonstrate certain statistical results.. then A or the quadratic form is said to be positive definite. . x 2.62 Chapter 2 Matrix Algebra and Random Vectors The spectral decomposition of A is then A = A. . 2 e 2e2 + A. If equality holds in (217) only for the vector x' = (0. . . 0. 3 e 3e3 or 1 VI8 +9 1 VI8 4 [~ 4 VI8 vT8 1 J VI8 1 18 1 18 4 18 1 18 1 18 4 18 2 3 2 + 18 . When a k X k symmetric matrix A is such that 0 :5 x'Ax (217) for all x' == [x 1 . which we now develop. both the matrix A and the quadratic form are said to be nonnegative definite... Because x' Ax has only squared terms x[ and product terms x. In other words.x k.. A is positive definite if 0 < x'Ax for all vectors x ¢ (218) 0.
.) A is a nonnegative definite matrix if and only if ali of its eigenvalues are greater than or equal to zero. xP of a vector x are realizations of p random variables X 1 . we first write the quadratic form in matrix notation as VZJ 2 [XI] x2 = x'Ax By Definition 2A. = 4yi x 0 Y2 + x' (I X2)(2X 1)(1 X2)(2X 1) e 2 e2 x (I X2)(2Xl)(l X2)(2Xl) + y~ 2: with y 1 = x'e 1 = eix and = x'e 2 = e2x We now show that y 1 and Y2 are not both zero and.17. Thus.All = 0.•. . respectively. give x' A x (I X2){2X2)(2X I) = 4x' e 1 e. From the definitions of Yl and Y2. the eigenvalues of A are the solutions of the equation . • Using the spectral decomposition.A)(2 .2 v'2 x 1x 2 To illustrate the general approach. respectively. . .11 {A positive definite matrix and quadratic form) Show that the matrix for the following quadratic form is positive definite: 3xi + 2x~ . and 0 ¢ x = E'y implies that y ¢ 0.. or (3 . x = E'y.2 = 0. X 2 . Because 4 and 1 are scalars.A) . (See Exercise 2.Positive Definite Matrices 63 Example 2. But xis a nonzero vector. x 2 . premultiplication and postmultiplication of A by x' and x.• .:] or y (2Xl) = E X (2X2)(2Xl) Now E is an orthogonal matrix and hence has inverse E'. The solutions are A1 = 4 and Az = 1. As we pointed out in Chapter 1. Using the spectral decomposition in (216). x 2 ] is any nonzero vector. we have [~] = [:~] [.30. Xp. or A is positive definite. consequently. we can write IA A (2X2) = A1e 1 ei (2XIJ{IX2) + A2e 2 e2 (2XIJ{IX2) = 4e 1 ei (2XIJ{IX2) + e 2 e2 (2XIJ{IX2) where e 1 and e 2 are the normalized and orthogonal eigenvectors associated with the eigenvalues A1 = 4 and A2 = 1. that x' Ax = 4yi + y?: > 0. we can easily show that a k X k symmetric matrix A is a positive definite matrix if and only if every eigenvalue of A is positive. where x' = [ x 1 . Assume for the moment that the p elements x 1 .
we have a13x1x3 + ··· + apI. In sum.... a positive definite quadratic form can be interpreted as a squared distance. be interpreted in terms of standard deviation units.) We easily verify that x = cA! 1 2e 1 2 1 f2eiei) = 2. 2_ 0 <(distance) . Comment. the distance from the origin satisfies the general formula (distance )2 = a 11 xi + a2 2 x~ + · · · + aPPx~ + 2(a12x1x2 + provided that (distance) 2 > Oforall [xi. . . i = 1. Thus. . xp] to the origin be given bY x' Ax. Similarly. Let the square of the distance from the point x' = [x 1 . )' A(x .al decomposition. . OJ.p..[xi. [x1l . In this way. . . . . ··· a2p a1pl . x 2] of constant distance c from the origin satisfy x' Ax = a11xf + a22X~ + 2a12x x = c2 1 2 By the spectr. distance is determined from a positive definite quadratic form x' Ax. ... Pp] is given by the general expression (x . ali . and the "distance" of the point [X1.j = lij. . Setting a. 2. . x 2.p.. x2 . x2. Points with the same associated "uncertainty" are regarded as being at the same distance from the origin. Then the square of the distance from x to an arbitrary fixed point p. j = 1. suppose p = 2.17.···.p. . where A is a p X p symmetric positive definite matrix. aPP xP or 0 < (distance? = x' Ax forx ¢ 0 (219) From (219)... .2. xp] i j. The constant of proportionality is c. (See Exercise 2. .x2. . x2.pXplxp) * ¢ [0. ). as in Example 2. Then the points x' = [x 1 . x = cA2 f2e 2 gives the appropriate satisfies x' Ax = A ( cA!1 1 distance in the e 2 direction. 2 2 A= A e1ei + A2e2ez so x'Ax = A1(x'ed + A (x'e 2) 2 1 Now.. xp]' to the origin can. ..' = [PI> p. ..p. we see that the p X p symmetric matrix A is positive definite.b4 ChaP ter 2 Matrix Algebra and Random Vectors we can regard these elements as the coordinates of a point in pdimensional space. If we use the distance formula introduced in Chapter 1 [see Equation (122)].. Expressing distance as the square root of a positive definite quadratic formallows us to give a geometrical interpretation based on the eigenvalues and eigenvectors of the matrix A. the points at distance c lie on an ellipse whose axes are given by the eigenvectors of A with lengths proportional to the reciprocals of the square roots of the eigenvalues. 0. Conversely. The situation is illustrated in Figure 26..xp] [ a21 : apl .. c2 = A1yi + A2y~ is an ellipse in Y1 = x'e1 and Y2 = x'e 2 because AI> A2 > 0 1 when A is positive definite.. we can account for the inherent uncertainty (variability) in the observations. and in this case should. . .11. For example.2.
the points x' = [xi> x 2 . 1 s A1 < Az). . Let the normalized eigenvectors be the columns of another matrix i=l k P = [ e 1 . . p. Then k A (kxk) = 2: A. .. withA. i = 1. and this leads to a useful squareroot matrix. A2.. 2. .4 A SquareRoot Matrix The spectral decomposition allows us to express the inverse of a square matrix in terms of its eigenvalues and eigenvectors.e. whose axes are given by the eigenvectors of A.. Let A be a k X k positive definite matrix with the spectral decomposition A = 2: A. ek].e. The halflength in the direction e. If p > 2. 2... Ap are the eigenvalues of A. > 0 0 .• . is equal to cjVi.. where A1 .A SquareRoot Matrix 65 Figure 2. (kxl)(lxk) e. xp] a constant distance c = ~from 2 2 the origin lie on hyperellipsoids c 2 = A1 (x'eJ) + · · · + Ap(x'ep) . .. 6 Points a constant distance c from the origin (p = 2.. . .. ei i=l P A P' (220) (k xk)(k xk)(k x k) where PP' = P'P = I and A is the diagonal matrix A (kxk) = l AI 0 ~ . e 2 .
Next. (A112 f e.andA112AIf2 = A1. a random matrix is a matrix whose elements are random variables. A112 A 112 = A. e. 1/Vi:. ~ 4. The expected value of a random matrix (or vector) is the matrix (vector) consisting of the expected values of each of its elements. = PA 112P' (222) 2 1. 2. Specifically.whereA112 = (A112f1 . Similarly. let X = { X. Then the expected value of X. let A I/2 denote the diagonal matrix with k VA.e. denoted by E(X). e. is the n X p matrix of numbers (if they exist) E(X p)] E(X2 p) E(Xnp) 1 (223) . (A11 )' = A1f2(thatis. i=l 1 = ±.66 Chapter 2 Matrix Algebra and Random Vectors Thus.5 Random Vectors and Matrices A random vector is a vector whose elements are random variables.j} be an n X p random matrix. = PA I/2P'. 2. where A1/2 is a diagonal matrix with vA. 3. of a positive definite matrix A. The matrix AI/2. (221) since (PA .1P')PAP' = PAP'(PA lp') = PP' =I. A 112A112 = A1f2A112 = I. 2: VA. A 112 has the following properties: = L i=l k vT.ei = P AI/2p> is called the square root of A and is denoted by i=l The squareroot matrix.A112issymmetric).e. as the ith diagonal element. . as the ith diagonal element.
eventually.3 ThenE(X1 ) 2: allx 1 x 1 p 1(x 1 ) = (1)(.j(X.iPii(x.3) + (0)(.3 0 1 .i) X.i) = l = i: 2: X.1.2 = 2: all x 2 XzPz(x 2) = (0) (. and consider the random vector X' = [X 1 .2. let the discrete random variable X 2 have the probability function Xz Pz(xz) Then E(X2) Thus. E(X1 + Yj) = E(X1 ) + E(Yj) and E(cX1 ) = cE(XJ). Let X and Y be random matrices of the same dimension. [E(Xd] [·1] E(Xz) = .i is a continuous random variable with probability density function.j if X.i is a discrete random variable with probability function p.j) dx. variance. X 2 ].4) = .if.fij(xii) if X.4 . for each element of the matrix? E(X. Similarly.Random Vectors and Matrices 67 where.2 • TWo results involving the expectation of sums and products of matrices follow directly from the definition of the expected value of a random matrix and the univariate properties of expectation.j(x. Let the discrete random variable X 1 have the following probability function: p 1 . E(X) = 0 .8) + (1) (.2) = .3) + (1)(.i) allxij Example 2.40) E(X + Y) = E(X) + E(Y) (224) E(AXB) = AE(X)B 2 If you are unfamiliar with calculus. Then (see Exercise 2. . and let A and B be conformable matrices of constants. you should concentrate on the interpretation of the expected value and.12 {Computing expected values for discrete random variables) Suppose = 2 and n = 1.8 1 . Our development is based primarily on the properties of expectation rather than its particular evaluation for continuous or discrete random variables.
p. p.. p..6 Mean Vectors and Covariance Matrices Suppose X'= [X1 .) if X. . . is described by a joint probability density function f(x 1 .) if X.(x. xk) (226) and JL.. Xp] isap X 1 randomvector.12. . X 2 . the random vector X' = [ X 1 . p. X 2 . 2. (See Example 2.. we shall adopt this notation.TheneachelementofXisa random variable with its own marginal probability distri]::mtion. is a continuous random variable with probability density function J. respectively. s x. . k = 1. .{x·) dx· ' ' 2: (x.= a?= l It will be convenient in later sections to denote the marginal variances by a.k = E(X.)(xk . ..k) if X.t..p. Xp or. Xp]. = E(X.) (225) 2: X. and JLk> i..) if X. is a discrete random variable with probability function p. More generally. . rather than the more traditional al. is described by their joint probability function. . equivalently.....) a. xk) 2: 2: all Xi (x.P. so that (227) ..) The marginalmeansp.andvarianceso}aredefinedasp.) all. xp) = f(x).JL.) (Xk . xk) all x1c if X.) 2 .. Specifically. and consequently. .JLk)P. i = 1. .(x. and Xk s xk] can be written as the product of the corresponding marginal probabilities. . .( x·) dx· ' ' ' ' oo if X..k(x. and X k. X 2 . As we have already noted in this book.) If the joint probability P[ X.JL.( x. and a measure of the linear association between them is provided by the covariance !1 ! all xi 1 00 X·!. X k are discrete random variables with joint probability function P.k(x.(x. is a discrete randotn variable with probability function p. the collective behavior of the p random variables X 1 .k( X.)Zf.•.)andal = E(X.. is a continuous random variable withprobabilitydensityfunctionjj(x.68 Chapter 2 Matrix Algebra and Random Vectors 2.. such as X. 00 _ 00 ' ' ' (x· _ p.(x. . are the marginal means. the covariance becomes the marginal variance. (See Chapter 4.. 2. Xk are continuous random variables with the joint density function/.. Xz.. The behavior of any pair of random variables.JLYP. . When i = k. f (x) will often be the multivariate normal density function....
Thus.Pz)(Xp . there are situations where Cov(X.JLp) (Xz .JLI) .k(x.JLp)] E(X2 .Pz) E(X1 . Xk) = 0 if X..JLz) 2 (XI .P)'. xp).. and Xk are not independent. The factorization in (228) implies that Cov (X. the independence condition becomes /.)fk(xk) for aJ1 pairs (x.JLd (Xz . Xk) = 0. Cov(X. and Xk are continuous random variables with joint density /.JLp) E(Xp.(x.P!) 2 E(Xz . E(X) = and l E(~z) = E(XI)lli'Il = '1 l'p 1' (230) E(Xp) = E = l l (XI ... xk)· The p continuous random variables X 1.Pd 2 (Xz .Pz) E(Xz .JLz) 2 (Xp .= E(X)...Pp)(XI  £( X1 .P)(X. Xk) = 0. xk) = /.JLd(Xp.1'2) (Xz . Statistical independence has an important implication for covariance.Pz)(X1  E(Xp . and the p(p .) and fk(xk).JLp)] (Xz .JLz)(Xp . xk> then X.) The means and covariances of the p X 1 random vector X can be set out as matrices.k( i < k) are contained in the symmetric variancecovariance matrix .Mean Vectors and Covariance Matrices 69 for all pairs of values x. The expected value of each element is contained in the vector of means 1'. . but X.JLp:)(X1 .I= E(X. X 2 .. .Pz)/X1  Pd Pd JLd (XI . . . (See [5].k(x..•.JLd (Xz .JLp) (Xp. and Xk are independent (229) The converse of (229) is not true in general.(x. xk) and marginal densities /. Specifically. and the p variances a.JLd (Xp .JLp) 2 (Xp . Xp are mutually statistically independent if their joint density can be factored as (228) for all ptuples (xi> x 2 .1)/2 distinct covariances a.. and Xk are said to be statistically independent. When X..JLp)z E(X1 .
JLd (X2 .. + (1 .. (See Exam JLJ) 2 = L all x 1 (x 1 .00) = ...4 pz(x2) .2)(.1) 2(.1)(1.JL2) 2 = = (0..JL2) = L allpairs(x.2) 2 (.00 1 0 1 .13 (Computing the covariance matrix) Find the covariance matrix for the two random variables X 1 and X 2 introduced in Example 2.24) + (1.) In addition.8 Pl(xJ) .1)(0.06 .1) 2(..08 ..40 .1) (x2 .24 .JLJ) = E(X1  JL1)(X2 .L 2 = E(X2 ) = .2)(.3) + (0. x2) = (1.4) a22 = E(X2.2) .2...12 when their joint probability function p 12 (x 1 .JLz) = a 12 = .1)(1..3 .2 We have already shown that JL 1 = E(XJ) ple 2..1 and J. x 2 ) ·is represented by the entries in the body of the following table: ~ I 0 .12..2)(.2)pn(xi..16 .1) 2(.70 Chapter 2 Matrix Algebra and Random Vectors or all I = Cov(X) = [ af 1 (231) a pi Example 2.8) = L aiJ x2 (x2.06) + .69 = (1.1)2p 1(xJ) = .2) 2 (..16 a12 = E(X1 .. ail = E(X1  = ..3 .3) + (1.14 ..2) 2P2(x2) + (1 .x2) (xi ..08 a 21 = E(X2  JL2)(X1 .
1 ] = [·1] Pz . respectively.P)(X. and akk as a. so it is not surprising that these quantities play an important role in many multivariate procedures. with X' 1' = [X~> X 2 ].~ (233) The correlation coefficient measures the amount of linear association between the random variables X.:. the measure of association known as the population correlation coefficient Pik· The correlation coefficient P..12 and 2.Mean Vectors and Covariance Matrices 71 Consequently.k and variances a.... and covariances for discrete random variables involves summation (as in Examples 2. [5). variances.)(Xk . and Xk.69 .k is defined in terms of the covariance a.P)(X..k = E(X.13). (See. from that contained in measures of association and.and variancecovariance matrix I are given (see Chapter 4).P.08 . The multivariate normal distribution is completely specified once the mean vector 1'. for example.P)' = [uu : ai2 a!z azz u. Because a.16 • We note that the computation of means.) . it is convenient to write the matrix appearing in (231) as I= E(X.k Pik = =.1'k) = ak. .08] . It is frequently informative to separate the information contained in variances a. in particular. = E(X) = [E(XJ)J E(Xz) = [~'. while analogous computations for continuous random variables involve integration.2 and I= E(X.P)' azz al2J = [ .l azp app (232) alp azp We shaH refer to 1'.::=== \Ia.and I as the population mean (vector) and population variancecovariance (matrix).
14 (Computing the correlation matrix from the covariance matrix) Suppose Obtain V 112 and p. ... Moreover.. the expression of these relationships in terms of matrix operations allows the calculations to be conveniently implemented on a computer. Example 2. Uzz p= lTJz ~va:. .. lTzp ~va... Pip P12 1 Pzp .72 Chapter 2 Matrix Algebra and Random Vectors Let the population correlation matrix be the p lTJJ lTJ2 X p symmetric matrix ~~ ~vo.. va:.vu.vo. whereas p can be obtained from I. •.] Pzp (234) 1 and let the p X p standard deviation matrix be (235) Then it is easily verified (see Exercise 223) that (236) and (237) That is.. lT]p va. I can be obtained from Vlf2 and p. +:..
Mean Vectors and Covariance Matrices. consider measurements of variables representing consumption and income or variables representing personality traits and physical characteristics. the characteristics measured on individual trials will fall naturally into two or more groups. One approach to handling these situations is to let the characteristics defining the distinct groups be subsets of the total coi!ection of characteristics.. the subsets can be regarded as components of X and can be sorted by partitioning X. If the total collection is represented by a (p X 1)dimensional random vector X. As examples.~~~] q (238) . we can partition the p characteristics contained in the p X 1 random vector X into. In general. the correlation matrix pis given by 1 ~] [.q. respectively.2 0 ~ 1 9 3 3 25 2] [! 0 ~ ~ 0 0 • Partitioning the Covariance Matrix Often. For example. we can write XI Xq JLI }· X= Xq+I Xp }p = [i·~~~] JLq and p = E(X) = JLq+I JLp = [. 0 and 0 ] . for instance.[20 0 YO. 0 Consequently. 73 Here 0 va. two groups of size q and p .. from (237).
.~tq)(Xq+z..q+I 0" pq (240) O"q+l. ..~tz):(Xp..1L (Z) )'.q+l O"qp ... .~tq) (Xq+I . ··· _ (X} ..."(IJ)' (qxi) (lxq) (X(IJ ...... between a component of x(IJ and a component of X(2)_ Note that the matrix 1:12 is not necessarily symmetric or even square.. O"q+J...ilq+Z) Upon taking the expectation of the matrix (X(!) ....L(2J)• ((pq)X!) (IX(pq)) (X(2) .... .....llI) (Xp .. ~..JL)(X.... we can easily demonstrate that l O"I../tq..JL<IJ) cx(IJ . q + 2... 2. j = q + 1....EJ .. .q lO"q+l..1Lz)(Xq+2. p. .z) (Xz./tq+Z) [ ........ . q... l ex<!) .JL)' = and consequently...J... i = 1.... ..q+I O"I..J...JL (I)) cx<2J ..u1i.IL(2)r = which gives all the covariances. ::_. (X(IJ _ JL<'l) (x<zJ _ JL(z))' (XI.../tq)(Xp.JL)(X. + ~.q+I .JL< 2l) (X(JJ .p 0" PP ~ (]" p...ilq+I) (Xq. .JL(IJ)' ((pq)x!) (lxq) "I (pxp) = E(X .1L (I)) (X< 2l . Making use of the partitioning in Equation (238)..ILJ)(Xq+I...L.LOl) (X(2J ..JL)' = q pq [~!. we get Ecx<IJ .ILp) (Xq .q+2 = "II2 (239) O"q.ilq+I) (Xz...q+I (X. ITI pj O"fp O"qp (]"2ri Uzrz O"q...ilq+I) ~ .I O"pi O"q+J...ILp) (Xq...ILp)J (Xz. "Izi : "I22j (pXp) q ' pq O"qi O"qq j O"q..L(2J)'J (qXI) (IX(pq)) (X(2) _ JL(2J) (X(2J _ J..14 Chapter 2 Matrix Algebra and Random Vectors From the definitions of the transpose and matrix multiplication...1tz)(Xq+l. (XI~ ILI)(Xq+2....q+2 .
J.l2 can be expressed as [a b] [:J = c'JL If we let . is multiplied by a constant c. The covariance matrix of x(ll is Iu. 1 + bJ.t!} 2 = c2Var (X 1 ) = c2u 11  If X 2 is a second random variable and a and b are constants. .p.(ap. such as X 1 . using additional properties of expectation. E(aX1 + bX2) = aJ.L 1 + bJ.Ld (X2 . we get Cov(aX1 . It is sometimes convenient to use the Cov (X(1l.bX2) = E(aX1  aJL 1 )(bX2 .J.Mean Vectors and Covariance Matrices 75 Note that I 12 = I2 1.l2) 2 + 2ab(X. then E(cX1 ) and Var (eX!) = E(cX1 = cE(X1 ) = CJL1 q. that of x< 2l is I that of elements from xol and X(2) is I 12 (or I 21 ). aX1 + bX2 can be written as [a b] [~J = c'X Similarly..L2lf J. b ]. + bJL2)f I L I I p.J.bp. x< 2l) = I 12 22 . 1) + b(X2.. for the linear combination aX1 + b X 2. x< 2l) notation where Cov (XOl.. The Mean Vector and Covariance Matrix for Linear Combinations of Random Variables Recall that if a single random variable..JL2)] 2Var(X ) + b2Var(X ) + 2abCov(X.. then... and is a matrix: containing all of the covariances between a component of X(ll and a component of x< 2l.. 2) p.Ld + b2(X2 .l2 Var(aX1 + bX2) = E[(aX1 + bX2) . 2) = abE(X1  = abCov(X1 .!)(X2. we have E(aX1 + bX2) = aE(X1 ) + bE(X2) = ap. X2) = a 1 2 (241) = E[a(X1  = E[a 2(X1  With c' = [a. X2 ) = aba12 Finally.
Equation (241) becomes Var(aX1 + bX2 ) = Var(c'X) since c'Ie = [a = c'Ic (242) b] [uu u u12 12 <Tzz ] [ab] = a2uu + 2abu12 + b2<Tzz The preceding results can be extended to a linear combination of p random variables: The linear combination c'X .) We shall rely heavily on the result in (245) in our discussions of principal components and factor analysis in Chapters 8 and 9.z] and variancecovariance matrix .::hapter 2 Matrix Algebra and Random Vectors be the variancecovariance matrix of X.28 for the computation of the offdiagonal terms in CixC'. respectively. Example 2. . c~XP has (243) variance = Var(c'X) = c' 1' = c'Ic . consider the q linear combinations of the p random variables XI>···· XP: Z 1 = c11 X 1 + c12 X 2 + ··· + c1 PXP Zz = cz1X1 + c22 X 2 + · · · + CzpXp ..= c1 X1 + .1 . or Ctz cz2 Cqz (qxp) (244) The linear combinations z = ex have (245) 1'z = E(Z) = E(CX) = Cl'x Iz = Cov(Z) = Cov(CX) = CixC' where 1'x and Ix are the mean vector and variancecovariance matrix of X. p. . In general. (See Exercise 2.76 <. · + mean = E(c'X) where 1' = E(X) and I = Cov (X). .1 S (Means and covariances of linear combinations) Let X' = [XI> Xz] be a random vector with mean vector 1'X = [p. .
.  ip)l L be the corresponding sample variance<:ovariance matrix. . X2 .Mean Vectors and Covariance Matrices T7 Find the mean vector and covariance matrix for the linear combinations zl or ==:XI. and (240) also hold if the population quantities are replaced by their appropriately defined sample counterparts. Xp.. (237). (238)..~1 ± (xil  id (xiP . Here p. ••. .z = and E(Z) = Cp.x and l:x . Let i' = [x 1 . . . This demonstrates the wellknown result that the sum and difference of two random variables with identical variances are uncorrelated..Xz ~=XI+ Xz in terms of P. . if X 1and X 2 have equal variancesthe offdiagonal terms in Iz vanish. ' • Partitioning the Sample Mean Vector and Covariance Matrix Many of the matrix results in this section have been expressed in terms of population means and variances (covariances). .ILz] ILz ILl + ILz 1 l:z = Cov(Z) = C~xC' = D1] [<Tu 12 1 u u22 u 12 ] [ 1 11] 1 Note that if u 11 = u 22 that is. 1 n _ 2 n i=l (x1 P. . The results in (236). and let ··· !.xp) . xp] be the vector of sample averages constructed from n observations on p variables X 1 .x = [11 1] [IL1] _ [IL1. x2 .
. respectively.l Sq+i. The matrix inequalities presented in this section will easily allow us to derive certain maximization results. Let b and d be any two p X 1 vectors. . principal components are linear combinations of measurements with maximum variability.78 Chapter 2 Matrix Algebra and Random Vectors The sample mean vector and the covariance matrix can be partitioned in order to distinguish quantities corresponding to groups of variables.q+l (p><p) s" = ···J······Sq+l. The allocation rule is often a linear function of measurements that ma:cimizes the separation between groups relative to their withingroup variability. S 11 is the sample covari ance matrix computed from observations x(Il. Unear discriminant analysis. Thus.q Sq+i. for example.1 Matrix Inequalities and Maximization Maximization principles play an important role in several multivariate techniques.q+i Sq p Spl Spq Sp. xq]' and x( 2 ) = [xq+ I> .. (p><l) x . 2. and S12 = S2 1 is the sample covariance matrix for elements of x(Il and elements of x( 2 l. CauchySchwarz Inequality. As another example.q+I Sq+I.q+l Spp (247) where x(IJ i( 1) and i( 2 l are the sample mean vectors constructed from observations = [xi> . .59. is concerned with allocating observations to predetennined groups.. xp]'. which will be referenced in later chapters.__ xq+I (246) and sl. Then (248) (b'd) 2 $ (b'b)(d'd) with equality if and onlyifb = cd (ord = cb) for some constant c.p Sq 1 Sqq i i Sq. . S22 is the sample covariance matrix computed from observations x( 2 l..
e. in this case 0 < (b.. The inequality is obvious if either b = 0 or d = 0. 0 = (b . Excluding this possibility. If we set [see also (222)] 1  i=J ±.x d is positive for b . and the same argument produces (b'd) = (b'b)(d'd). but important..+ . so we conclude that 0 < b'b.x d #.ej.. and (249) (b'd) :s (b'Bb)(d'B.cd)'(b .0.d'd or (b'd) < (b'b)(d'd) if b #. Let let B be a positive definite matrix.+ (d'd) d'd ( X  b'd) d'd 2 The term in brackets is zero if we choose x = b'd/d'd.1f2d) and the proof is completed by applying the CauchySchwarz inequality to the vectors (B 112b) and (B1f2d). If we complete the square by adding and 2 subtracting the scalar (b'd) /d'd. extension of the CauchySchwarz inequality follows ' directly. and the normalized eigenvectors e.. Then (pxp) 2 b (pxl) and d (pxl) be any two vectors. we get 0 < b'b. as 81/2 = sl/2 = L i=I p VA. .e·e~ VA.cd)..xd) = b'b. • The extended CauchySchwarz inequality gives rise to the following maximization result. where x is an arbitrary scalar.b'(xd) + x 2 d'd = b'b. For cases other than these.xd for some x. consider the vector b . consider the squareroot matrix sl/2 defined in terms of its eigenvalues A.x d.Matrix Inequalities and Maximization 79 Proof.xd'b.2x(b'd) + x 2 (d'd) The last expression is quadratic in x. Note that if b = cd.. The inequality is obvious when b = 0 or d = 0.2x(b'd) + x 2 (d'd) d'd d'd 2 (b'd) 2 (b'd) 2 = (b'd) b'b . Extended Cauchy5chwarz Inequality. 2 2 (b'd) 2 • A simple. Proof..xd)'(b. Since the length of b . I I it follows that b'd = b'Id = b'B 112s112d = (B 1f2b)'(s.1d) with equality if and only if b = c B1d (or d = c B b) for some constant c.
(attainedwhenx = ep) max x'Bx .. (pXI) (pXp)(pxl) ~=~~:::: x'B1 112x 12B x'!!:x I (pXp) x'PA 1 12P'PA 112p•x y'y =:: y'Ay y'y f A.. A2. 2. Then.FO X. Let (pxp) B be positive definite and (pXI) (pXi) d be a given vector.::L. . . x'Bx > 0. ' \ (pXp) positive d e fi mte matnx Wit h e1genva1 ues "I 2:: "Z 2:: · · · 2:: AP 2:: Q and associated normalized eigenvectors e 1 . Maximization of Quadratic Forms for Points on the Unit Sphere.80 Chapter 2 Matrix Algebra and Random Vectors Maximization Lemma..L_ = A1 f yJ i=l . ez . .. for an arbitrary nonzero vector x .. max ~X~." Proof.0 and B is positive definite. Let nl/2 == PA l/2p• [see (222)] and y = P' X • Consequently.. Then x'Bx max.1) (252) where the symbol is read "is perpendicular to. p ...!.e. ~.:S: A1. Thus.. . d'B1d _x x'Bx Taking the maximum over x gives Equation (250) because the bound is attained for x = cB.. .1 d (pXp)(p>li) for any constant c 'I' 0.. Let B be a •. x i'...f Yf f yf (253) .= d' B1d x"'o x'Bx (pXI) ( 'd)2 (250) with the maximum attained when x = cB.E.0 implies y x'Bx x'x • • * 0. eP. . Because x i'.X (attainedwhenx == e!) (251) x'Bx min"" A x:=FO x'x P Moreover. (x'd/ ::. . eP and A be the diagonal matrix with eigenvalues A1 . e 2 . .Yf i=l = ... • • . 'd)2 • A final maximization result will provide us wilh an interpretation of eigenvalues. Ap along the main diagonal.=A 1 x. .. Proof.Le~o . Dividing both sides of the inequality by the positive scalar x'Bx yields the upper bound ( __ ::.· = XX Ak+I _l (attained when x = ek+i• k "= 1. Let (pXp) P be the orthogonal matrix whose columns are the eigenvectors e 1 . (x'Bx) (d'BId)..1d. By tne extended CauchySchwarz inequality.
Matrix Inequalities and Maximization 81 Setting x = e 1 gives
since
k =' 1 k#1
For this choice of x, we have y' Ayjy'y = AJl = A1 , or
(254)
A similar argument produces the second part of (251). Now, x = Py = y1e 1 + }2e 2 + · · · + ypep. sox .l el> ... , ek implies
Therefore, for x perpendicular to the first k eigenvectors e;, the lefthand side of the inequality in (253) becomes
x'Bx x'x
L A;YT i=k+!
i=k+l
p
2:
p
y[
Taking Yk+t
= 1, Yk+2 = · · · =
yP
= 0 gives the asserted maximum.
•
For a fixed Xo 0, xoBXo/XoXo has the same value as x'Bx, where x' = x 0 j~ is of unit length. Consequently, Equation (251) says that the largest eigenvalue, AI> is the maximum value of the quadratic form x'Bx for all points x whose distance from the origin is unity. Similarly, Ap is the smallest value of the quadratic form for all points x one unit from the origin. The largest and smallest eigenvalues thus represent extreme values of x'Bx for points on the unit sphere. The "intermediate" eigenvalues of the p X p positive definite matrix B also have an interpretation as extreme values when xis further restricted to be perpendicular to the earlier choices.
*
Supplement
VECTORS AND MATRICES: BAsic CoNCEPTs
Vectors
Many concepts, such as a person's health, intellectual abilities, or personality, cannot be adequately quantified as a single nuinber. Rather, several different measurements x 1 , x2 , ... , Xm are required.
Definition 2A.I. An mtuple of real numbers (x 1 , xz, ... , x., ... , xm) arranged in a column is called a vector and is denoted by a boldfaced, lowercase letter. Examples of vectors are
Vectors are said to be equal if their corresponding entries are the same.
Definition 2A.2 (Scalar multiplication). Let c be an arbitrary scalar. Then the product ex is a vector with ith entry cxi. To illustrate scalar multiplication, take c1 = 5 and c2 = 1.2. Then
c1
y s[
=
2
~] = [ l~J
10
82
and c2
y= (1.2) [
2
~] = [=~~]
2.4
Vectors and Matrices: Basic Concepts 83
Definition 2A.3 (Vector addition). The sum of two vectors x andy, each having the same number of entries, is that vector
z
Thus,
=x+
y
with ith entry
Z;
= x; + Y;
X
+
y
z
Taking the zero vector, 0, to be themtuple (0, 0, ... , 0) and the vector x to be the mtuple ( x 1 ,  x 2 , ... ,  xm), the two operations of scalar multiplication and vector addition can be combined in a useful manner.
Definition 2A.4. The space of all real mtuples, with scalar multiplication and vector addition as just defined, is called a vector space. Definition 2A.S. The vector y = a 1x 1 + a 2x 2 + · · · + akxk is a linear combination of the vectors x 1 , x 2 , ... , xk. The set of all linear combinations of x 1 , x 2, ... , xk> is called their linear span. Definition 2A.6. A set of vectors x 1 , x 2, ... , xk is said to be linearly dependent if there exist k numbers (a 1, a 2 , ... , ak), not all zero, such that
Otherwise the set of vectors is said to be linearly independent.
If one of the vectors, for example, x;, is 0, the set is linearly dependent. (Let a; be the only nonzero coefficient in Definition 2A.6.) The familiar vectors with a one as an entry and zeros elsewhere are linearly independent. Form = 4,
so
implies that a 1
= a 2 = a3
= a4
= 0.
84
Chapter 2 Matrix Algebra and Random Vectors
As another example, let k
= 3 and m = 3, and Jet
Then 2x 1

x2 + 3x3 == 0
Thus, x 1, x 2 , x3 are a linearly dependent set of vectors, since any one can be written as a linear combination of the others (for example, ~ 2 = 2x 1 + 3x 3).
Definition 2A.7. Any set of m linearly independent vectors is called a basis for the vector space of all mtuples of real numbers. Result 2A.I. Every vector can be expressed as a unique linear combination of a • fixed basis.
With m = 4, the usual choice of a basis is
These four vectors were shown to be linearly independent. Any vector x can be uniquely expressed as
A vector consisting of m elements may be regarded geometrically as a point in mdimensional space. For example, with m = 2, the vector x may be regarded as representing the point in the plane with coordinates x 1 and x 2 • Vectors have the geometrical properties of length and direction.
2 •
Definition 2A.8. The length of a vector of m elements emanating from the origin is given by the Pythagorean formula:
lengthofx = L, =
v'xi
+ x~ + · .. + x~
Vectors and Matrices: Basic Concepts 85 Definition 2A.9. The angle (J between two vectors x andy, both having m entries, is defined from
where Lx = length of x and Ly = length of y, x 1 , x 2, ... , xm are the elements of x, and y1 , Ji2, ... , Ym are the elements of y. Let
Then the length of x, the length of y, and the cosine of the angle between the two vectors are lengthofx
= v'(_:_1) 2
+5 2 + 2 2 + (2) 2 02 + 12
= V34 = 5.83
= 5.10
length of y = and
vf42+ (3) 2 +
= V26
=
V34 V26 [(1)4 + 5(3) + 2(0) + (2)1]
1
1
1
5.83
Consequently, (J
X
5.10 [ 21} = .706
= 135°.
Definition 2A.I 0. The inner (or dot) product of two vectors x andy with the same number of entries is defined as the sum of component products:
We use the notation x'y or y'x to denote this inner product. With the x'y notation, we may express the length of a vector and the cosine of the angle between two vectors as ·
L 1 = lengthofx =
v'xt +
x~ + · ·· +
x; = ~
cos(O)
=.
x'y ~ ~ vx'x vy'y
86
Chapter 2 Matrix Algebra and Random Vectors
Definition 2A.II. When the angle between two vectors x, y is (J = 90" or 270°, we say that x and y are perpendicular. Since cos( 0) = 0 only if (J = 90° or 270°, the condition becomes
x andy are perpendicular if x'y = 0 We write x
1_
y. _
The basis vectors
[~]· [~]· [~]· [~]
.
are mutually perpendicular. Also, each has length unity. The same construction holds for any number of entries m.
Result 2A.2. (a) z is perpendicular to every vector if and only if z = 0. (b) If z is perpendicular to each vector x1 , x 2, ... , xk, then z is perpendicular to their linear span. (c) Mutually perpendicular vectors are linearly independent. • Definition 2A.I2. The projection (or shadow) of a vector x on a vector y is
proJeCtiOn ofx on y = 
. .
(~y)
2 Ly 
Y
If y has unit length so that Ly = 1,
projectionofxony = {x'y)y
If y1, y2, ... , y, are mutually perpendicular, the projection {or shadow) of a vector x on the linear span ofyb J2, ... , y, is
(x'yJ) ,yl
Y1Y1
+ ,y2 + ···+ ,y,
Y2Y2
y,y,
(x'Y2)
(x'y,)
Result 2A.3 (GramSchmidt Process). Given linearly independent vectors x 1 , x 2, ... , xk, there exist mutually perpendicular vectors u1 , u2 , ... , uk with the same linear span. These may be constructed sequentially by setting
UJ =X]
Vectors and Matrices: Basic Concepts
87
We can also convert the u's to unit length by setting zi = uif~· In this
k1
construction, (xizj) zi is the projection of xk on zi and
2: (x/.,lj)Zj is the projection
j~l
of xk on the linear span of x 1, x 2, ... , XkJ·
For example, to construct perpendicular vectors from
•
and
we take
so
and x2u 1 = 3(4) + 1(0) + 0(0) 1(2) = 10
Thus,
Matrices
Definition 2A.Il. An m X k matrix, generally denoted by a boldface uppercase letter such as A, R, :I, and so forth, is a rectangular array of elements having m rows and k columns.
Examples of matrices are
A=
[7~ 2] ~
.7
3
'
B = [:
2
1/~J
I"[~
0
1 0
X" [
;
.3
2
3]
8
n
1 '
E = [eJ]
88 Chapter 2 Matrix Algebra and Random Vectors In our work, the matrix elements will be real numbers or functions taking on values in the real numbers.
Definition 2A.I4. The dimension (abbreviated dim)ofan m X k matrix is the ordered pair (m, k); m is the row dimension and k is the column dimension. The dimension of a matrix is frequentlyindicated in parentheses below the letter representing the matrix. Thus, the m x k matrix A is denoted by A .
(mXk)
In the preceding examples, the dimension of the matrix l: is 3 information can be conveyed by WI:iting l: .
(3X3)
X
3, and this
An m x k matrix, say, A, of arbitrary constants can be written
A
(mxk)
=
l'" '"l
a~i
: azz a2k ami amz amk
ai2
... ...
or more compactly as
(mxk)
A
{aij}, where the index i refers to the row and the
index j refers to the column. An m X 1 matrix is referred to as a column vector. A 1 x k matrix is referred to as a row vector. Since matrices can be considered as vectors side by side, it is natural to define multiplication by a scalar and the addition of two matrices with the same dimensions.
Definition2A.IS. 1\vomatrices A
(mXk)
= {a;j}and (mxk) = B
{bij}aresaidtobeequa/,
written A= B, if a;j = b;j, i = 1, 2, ... , m, j = 1,2, ... , k. That is, two matrices are equal if (a) Their dimensionality is the same. (b) Every corresponding element is the same.
Definition 2A.I6 {Matrix addition). Let the matrices A and B both be of dimension k with arbitrary elements a;j and bij• i = 1,2, ... ,m, j = 1,2, ... ,k, respectively. The sum of the matrices A and B is an m X k matrix C, written C = A + B, such that the arbitrary element of C is given by
m X
i = 1, 2, ... , m, j = 1, 2, ... , k
Note that the addition of matrices is defined only for matrices of the same dimension. For example;
[~ ~ 1~J
c
Vectors and Matrices: Basic Concepts 89
Definition 2A.I T(Scalar multiplication). Let c be an arbitrary scalar and A .r= {a;j}.
Then
(m><k)
cA
==
(m><k)
Ac
=
(m><k)
B
= {bij},
where b;j
= ca;j = a;jc,
(m><k)
i = 1, 2, ... , m,
j = 1,2, ... ,k.
Multiplication of a matrix by a scalar produces a new matrix whose elements are the elements of the original matrix, each multiplied by the scalar. For example, if c = 2,
4] [3 4] [! ~~]
6 5 2 0 cA 6 2 5 Ac
0
10
B
== {a;j}
Definition 2A.I8 (Matrix subtraction). Let A
(mxk)
and
(m><k)
B
= {b;j}
be two
matrices of equal dimension. Then the difference between A and B, written A  B, is an m x k matrix C = { c;j} given by
C=AB=A+(1)8
Thatis,c;j
= a;j +
(1)b;j
= a;j
b;j,i
=
1,2, ... ,m,j
= 1,2,
... ,k.
Definition 2A.I9. Consider them X k matrix A with arbitrary elements aij, i == 1, 2, ... , m, j == 1, 2, ... , k. The transpose of the matrix A, denoted by A', is the k x m matrix with elements aj;, j = 1, 2, ... , k, i = 1, 2, ... , m. That is, the transpose of the matrix A is obtained from A by interchanging the rows and columns.
As an example, if
A (2><3)
[2 1 3]
7

4 6
,
then
(3X2)
A' =
[2 7]
1 4 3 6
Result 2A.4. For all matrices A, B, and C (of equal dimension) and scalars c and d, the following hold:
(a) (A + B) + C (b) A+ B
=
=A
+ (B + C)
B +A
(c) c(A +B) = cA
+ cB
(That is, the transpose of the sum is equal to the sum of the transposes.)
(d) (c + d)A = cA + dA
(e) (A + B)'
=
A' + B'
(f) (cd)A = c(dA) (g) (cA)'
= cA'
•
90 Chapter 2 Matrix Algebra and Random Vectors
Definition 2A.20. If an arbitrary matrix A has the same number of rows and columns, then A is called a square matrix. The matrices I, I, and E given after Definition 2A.l3 are square matrices. Definition 2A.21._Let A beak X k (square) matrix. Then A is said to be symmetric if A = A'. That is, A is symmetric if a;i = aii• i = 1, 2, ... , k, j = 1, 2, ... , k.
Examples of symmetric matrices are
l 0 OJ I=010, [Q Q 1 (3X3)
Definition 2A.22. The k X k identity matrix, denoted by I , is the square matrix
(kXk)
with ones on the main (NWSE) diagonal and zeros elsewhere. The 3 X 3 identity matrix is shown before this definition.
Definition 2A.23 (Matrix multiplication). The product AB of an m X n matrix A = {a;i} and an n X k matrix B = {b;i} is them X k matrix C whose elements are
c;i =
f=I
L a;ebei
n
i = ·1, 2, ... , m
j = 1, 2, ... , k
Note that for the product AB to be defined, the column dimension of A must equal the row dimension of B. If that is so, then the row dimension of AB equals the row dimension of A, and the column dimension of AB equals the column dimension of B. For example, let
A (2X3)
3 [4
~ ~ J and
B =
(3X2)
3 4] [
6 2
4 3
Then
[! ~ :][:
(2X3)
(3X2)
J [:~ :~H::
(2X2)
c12]
Czz
Vectors and Matrices: Basic Concepts 91
where cu = (3)(3) + (1)(6) + (2)(4) = 11
CJ2
= =
(3)( 4) + ( 1)( 2) + (2)(3)
= 20
c 21 c22
= (4)(3)
+ (0)(6) + (5)(4)
= 32
= 31
(4)(4) + (0)( 2) + (5)(3)
As an additional example, consider the product of two vectors. Let
Then x' = [1
0
2
3] and
,.,
~
[1
0 2 3] [
=~] ~
[20]
~
[2 3
1
8] [
~] ~
,.,
Note that the product xy is undefined, since xis a 4 X 1 matrix andy is a 4 X 1 matrix, so the column dim of x, 1, is unequal to the row dim of y, 4.1f x andy are vectors of the same dimension, such as n X 1, both of the products x'y and xy' are defined. In particular, y'x = x'y = XI)'J + Xz.Y2 + · · · + x"y,., and xy' is an n X n matrix with i,jth element x;y1. Result 2A.S. For all matrices A, B, and C (of dimensions such that the indicated products are defined) and a scalar c,
(a) c(AB) = (c A)B
(b) A(BC) = (AB)C
(c) A(B + C) = AB + AC
(d) (B + C)A = BA + CA
(e) (AB)' = B'A'
More generally, for any x1 such that Ax1 is defined,
(f)
LA
j=l
n
Xj
= A
2: Xj
j=l
n
•
92
Chapter 2 Matrix Algebra and Random Vectors
There are several important differences between the algebra of matrices and the algebra of real numbers. TWo of these differences are as follows: 1. Matrix multiplication is, in general, not commutative. That is, in general, AB ¢ BA. Several examples will illustrate the failure of the commutative law (for matrice~).
[! ~J[~J ~~J
= [
but
[~][! ~]
is not defined.
[~
but
0
3
:J[: n.r,: 10]
33
1]  [ 19  1 2 3 6 10 18 3 12 26
[
Also, but
3 1 2 4
76] [1 0
0 1
4~]
[4 1][ 21]=[11 OJ
3 4 3 4
[ 21] [4 1] = [ 81]
3 4 0 1
12
7
2. Let 0 denote the zero matrix, that is, the matrix with zero for every element. In the algebra of real numbers, if the product of two numbers, ab, is zero, then a = 0 or b = 0. In matrix algebra, however, the product of two nonzero matrices may be the zero matrix. Hence,
(mxn)(nxk)
AB
(mXk)
0
does not imply that A
= 0 orB = 0. For example,
It is true, however, that if either A B 0 .
(mxn)(nxk) (mXk)
A
(mXn)
=
0
(mXn)
or
B
(nXk)
0 , then
(nXk)
Vectors and Matrices: Basic Concepts
93
Definition 2A.24. The determinant of the square k by IA I, is the scalar
X
k matrix A = { aij}, denoted
IA I =
I A I=
where A 1i is the (k  1)
X
a 11
k
if k
=
1
~>IjiAijl(1)1+i
j=!
ifk > 1
(k  1) matrix obtained by deleting the first row and
jth column of A. Also, IA I = row.
L
j=!
k
a;il A;il( 1)i+i, with theith row in place of the first
Examples of determinants (evaluated using Definition 2A.24) are
\! !I=
In general,
1141(l)Z + 3161(1)
3
=
1(4) + 3(6)(1) = 14
~
_; : =
31~ ~l(1)z + 11~ ~1(1) 3 + 61~ ~1(1)4
+ 6(57) = 222
= 1(1) = 1
= 3(39) 1(3)
~ ~ ~ 1~~ ~~(1)z + oJ~ ~~(1) 3 + oJ~ ~~(1)4
=
If I is the k
X
k identity matrix, II I = 1.
au a1z a!3 az! a22 az3 a3! a32 a33
=all Jazz
a32
az3J(1)z + a!zjaz! a33 a31
az3j(1)3 + a!3iaz! a33 a31
azzj( 1 )4 a32
The determinant of any 3 X 3 matrix can be computed by summing the products of elements along the solid Jines and subtracting the products along the dashed
94
Chapter 2 Matrix Algebra and Random Vectors
lines in the following diagram. This procedure is not valid for matrices of higher dimension, but in general, Definition 2A.24 can be employed to evaluate these determinants.
,,
'
'
'l
We next want to state a result that describes some properties of the determinant. However, we must first introduce some notions related to matrix inverses.
Definition 2A.2S. The row rank of a matrix is the maximum number of linearly independent rows, considered as vectors (that is, row vectors). The column rank of a matrix is the rank of its set of columns, considered as vectors.
For example, let the matrix
1 1]
5 1
1 1
The rows of A, written as vectors, were shown to be linearly dependent after Definition 2A.6. Note that the column rank of A is also 2, since
but columns 1 and 2 are linearly independent. This is no coincidence, as the following result indicates.
Result 2A.6. The row rank and the column rank of a matrix are equal.
•
Thus, the rank of a matrix is either the row rank or the column rank.
Vectors and Matrices: Basic Concepts
95
Definition 2A.26. A square matrix A is nonsingular if A
(kxk)
(kxk)(kx1)
x
(kX1)
0
implies
that
x
(kX1)
0 . If a matrix fails to be nonsingular, it is called singular. Equivalently,
(kX1)
a square matrix is nonsingular if its rank is equal to the number of rows (or columns) it has. Note that Ax = x 1a 1 + x 2a 2 + · · · + xkak, where a; is the ith column of A, so that the condition of nonsingularity is just the statement that the columns of A are linearly independent.
Result 2A. 7. Let A be a nonsingular square matrix of dimension k x k. Then there is a unique k X k matrix 8 such that A8 = BA =I
where I is the k
X
k identity matrix.
•
Definition 2A.27. The 8 such that A8 = BA = I is called the inverse of A and is denoted by A 1. In fact, if BA =I or AB = I, then B = A1 , and both products must equal I.
For example,
A= [2 3]
1 5
since
has
A 1 =
[ i~
7 5
n
3] [1 01]
=
[21 3] [ ; ~]
5 7 7
Result 2A.8. (a) The inverse of any 2
X
= [
7
;
~] [2 7 1
0
2 matrix
, A=[::: :::]
is given by
A1 = _1_ [
\A\ az1
(b) The inverse of any 3 x 3 matrix
azz
(c) There exists a matrix A. ith entry [jAiii//A/](l)i+j. Let A = { a.1 such that AA.au a32 a31 a13.all al31 a33 az! azJ a121 /all a1zl an az1 a22 In both (a) and (b).0 if the in verse is to exist. then I A I = 0 (c) If any two rows (columns) of A are identical. Let A and B be square matrices of the same dimension. that is. then I A I = 1/l A. that is. Some of these proofs are rather complex and beyond the scope of this book.9.1 hasj. Let A and B be k X k square matrices.JO._.j} beak X k square matrix. (e) /AB/ = IAIIBI (f) IcA I = ck I A I. written tr (A). tr (A) = ~ a. Then the following hold: (a) (A.1)' = (A'f (b) (AB)1 = 8IA. • Result 2A.1 1 • The determinant has the following properties.1 = A.1A = I . (a)/AI=IA'I (b) If each element of a row (column) of A is zero. A.28. You are referred to [6J for proofs of parts of Results 2A. it is clear that I A I #. For a square matrix A of dimension k X k. and let the indicated inverses exist. where Aii is the matrix obtained from A by deleting the ith row and jth column. then /AI= 0 (d) If A is nonsingular. /A II A1 1 = 1.ll.96 Chapter 2 Matrix Algebra and Random Vectors is given by 1 1 A /AI Jazz an 1az1 a31 laz1 a31 azJJ Jal2 a13 la12 al31 a33 an an azz an 1 az31 /all a33 a31 azzJ_. is the sum of the diagonal elements. (k><k) • Result 2A.i=! k . (c) In general. (k><l) (k><k)(k><1) (b) /AI* o. • Definition 2A. where cis a scalar. the following are equivalent: (a) A x = 0 implies x (k><l) (k><J) = 0 (A is nonsingular).1 1. Result 2A.9 and 2A. The trace of the matrix A.II..
WeverifythatAA =I= AA' = A'A.2 ! I I 2 2 _! 2 I I I 2 2 A 2 A 2 j] [j ~] 0 0 1 0 0 1 0 0 I so A' = A. let 1 A=[~~] . and A must be an orthogonal matrix. that is. considered as vectors.I3.29. AA' = A' A = I. For an orthogonal matrix. Then the scalars A1 .30..A = A'.1AB) = tr(A) k k (e) tr (AA') = ~ ~ a[j i=l j=l • Definition 2A. Ak satisfying the polynomial equation IA . Result 2A.Ail = 0 (as a function of A) is called the characteristic equation. Square matrices are best understood in terms of quantities called eigenvalues and eigenvectors. Let A be a k X k square matrix and I be the k X k identity matrix. are mutually perpendicular and have unit lengths. ••• . A2 .I2. so the columns are also mutually perpendicular and have unit lengths.Ail = 0 are called the eigenvalues (or characteristic roots) of a matrix A.Vectors and Matrices: Basic Concepts 97 Result 2A. AA' = I.soAA' = A'A = AA. A matrix A is orthogonal if and only if A1 = A'. Definition 2A.or n Jln 2 I I ! 2 ! I 2 2 I 2 2 I 2 I 2 I . • An example of an orthogonal matrix is A~ ~ ~ ~ j] [ Clearly. A square matrix A is said to be orthogonal if its rows. The equation IA . For example. Let A and B be k X k matrices and c be a scalar. (a) tr(cA) = c tr(A) (b) tr(A ±B)= tr(A) ± tr(B) (c) tr(AB) = tr(BA) (d) tr(B.
9.A 4 2 IA .A 405A + 1458 = 0 has three roots: A = 9. A . Let A be a square matrix of dimension k X k and let A be an eigenvalue of A. Definition 2A. and 18 are the eigenvalues 1 of A. The eigenvalues of A are 3 and 1. Az = 9. the columns of A . Let A"' [ 13 4 2 4 13 2 ~] 10  Then the equation 13. This follows because the statement that Ax = Ax for some A and x ¢ 0 implies that 0 == (A .AI) That is.A + 36A2 10.30. we have shown that the eigenvalues of A== c~] are A1 = 1 and A2 = 3. The eigenvectors associated with these eigenvalues can be determined by solving the following equations: .98 Chapter 2 Matrix Algebra and Random Vectors Then lAAII= fG ~){~ ~JI =ll~A 3~AI=(1A)(3A)=o implies tbat there are two roots. Following Definition 2A.AI) + · · · + xk colk(A . and A3 = 18.A 2 2 3 2 = . 9. as asserted.31.Al)x 1 = x1 col1(A .All == 0. that is.All = 4 13 . An equivalent condition for A to be a solution of the eigenvalueeigenvector equation is lA . A1 = 1 and A2 = 3.AJI = 0.A1 are linearly dependent so. by Result 2A9(b). If x is a nonzero vector ( x ¢ 0 ) such that (kxl) (kxl) (kxt) Ax= Ax then xis said to be an eigenvector (characteristic vector) of the matrix A associated with the eigenvalue A.
xk isQ(x) = x'Ax. That is. From the second expression. ' Definition 2A.. x2.x.. x1 = 3x 1 = 3x 2 x1 + 3x2 implies that x 1 = 0 and x 2 = 1 (arbitrarily).S]. A quadraticformQ(x) in thek variablesx 1 .x2 . we take e = xjv'i'i as the eigenvector corresponding to A. . is an eigenvector corresponding to the eigenvalue 1. For example..x.32. . . xd and A is a k X k symmetric matrix.Vectors and Matrices: Basic Concepts 99 Ax= Azx From the first expression.S. + 2xl Any symmetric square matrix can be reconstructured from its eigenvalues and eigenvectors. xi ==xi x1 + 3xz == or Xz x 1 = 2x2 There are many solutions for x1 and x2 • Setting x2 = 1 (arbitrarily) gives x 1 = 2... where x' = [ x1 . . For example. 1/v'. if Ax = Ax. the eigenvector for A1 = 1 is e} = [2/v'. and hence. k k Note that aquadraticfonn can be written as Q(x) = _2 _2 i=l i=l a1ix 1xi. and hence. ' .x!. It is usual practice to determine an eigenvector so that it has length unity. Q(x) = [x1 Xz{ ~ ~ J[~J 3 1 = xi + 2x 1x 2 + x~ 2 n [.] ~ xl + 6x..4x. X=[~] is an eigenvector corresponding to the eigenvalue 3. The particular expression reveals the relative importance of each pair according to the relative size of the eigenvalue and the direction of the eigenvector.
. . •..2) so A has eigenvalues A1 = 3 and A2 = 2.. .:: 0 fori = 1.v's [v'5 v'5 J+ . Ur]. . A2 . i=l r where ur = [ul. vr].16 =(A. If A is a rectangular matrix. •.. .. u" and r orthogonal k X Lunit vectors v1 . v2 .. Uz. Specifically. A= [2. Let A beak X k symmetric matrix. there exist r positive constants A1 . e. The Spectral Decomposition. rather than a square.4 . i) entry A. .6 . respectively. . v" such that A= L A...e. The positive constants A. Consequently. Ar.4] .ei i=l k • A= [2.2 Then 2 lA.8 .4 The ideas that lead to the spectral decomposition can be extended to provide a decomposition for a rectangular..100 Chapter2 Matrix Algebra and Random Vectors Result 2A.Value Decomposition. The corresponding eigenvectors are e! = [ 1/YS. Let A be an m X k matrix of real numbers.u.8 SA+ 6. u 2 .2 2. min( m..and Ar is an r X rdiagonal matrix .8] 1. .v. .6 1. k) and the other entries are zero.Ail= A  .•. 2/YS] and e2 = [2/YS. Singular. Result 2A.14. = [vl. Vz.. Then A can be expressed in terms of its k eigenvalueeigenvector pairs (A..• . Vr with diagonal entries A.4 .2] + [ 1.3 28 v's J.4 2.2 ..I S. Then there exist an m X m orthogonal matrix U and a k X k orthogonal matrix V such that A= U.. then the vectors in the expansion of A are the eigenvectors of the square matrices AA' and A' A. . . let L A.AV' where the m X k matrix A has ( i. are called the singular values of A. [v'5 ~] ~ 1 _2_ 2 ll ll 1  2 _2_ = [ .4 . r orthogonal m X 1 unit vectors u1 .16. 2. • The singularvalue decomposition can also be expressed as a matrix expansion that depends on the rank r of A.3)(A. matrix.) as A= For example. = VrArV. 1/YS].
2 5 J Eigenvectors v 1 and v2 can be verified by checking: .A computer calculation gives the eigenvectors V1  . A is (mxk) A (mxm)(mxk)(kxk) U A V' where U has m orthogonal eigenvectors of AA' as its columns. y2 = A~ = 10.. Alternatively.120y = y( y . A~11 (form > k).}. V2 J .Z  1 u ' . are the eigenvectors of A' A with the same nonzero eigenvalues A~. _ [ y'6 1 2 y'6 1 y'6 ..[ v'3Q .10).Vectors and Matrices: Basic Concepts 101 Here AA' has eigenvalueeigenvector pairs (.12) ( y ..[v2 1 Also.yl I = y 3 . and y 3 = Aj = 0. and the eigenvalues are y 1 = Ai = 12. _ 1 V30 V30 .15. the eigenvalues are y 1 = Ai = 12 and y 2 = A~ = 10._[  2 y'5 1 y'5 o] . an d v3 .. . respective 1y. For example. = A~u.[ v2 1  1 v2 and u 2 J '. and consequently.10). A~n . = At: 1A'u. 2] [ 2~ 10 4 4 2 so IA' A .. let Then AA' ~ ~ [  : :J[: l [': . 1 1] = 3 1 2 You may verify that the eigenvalues y = A of AA' satisfy the equation 22y + 120 = (y. The corresponding eigenvectors are .Then V.. .22y 2 . The nonzero eigenvalues are the same as those of AA '. . A'A = [~1 ~] [_31 1 1 0.12) (y. and A is specified in Result 2A. uJ. AA'u. V. so withAL A~.>. v2 .:] 1] . The matrix expansion for the singularvalue decomposition written in terms of the full dimensional matrices U. V has k orthogonal eigenvectors of A' A as its columns. A~> 0 = A~+J. the V.
we use UU' squares as tr[(A. Then s B= ~ i=l A..cii) 2 = i=l j=l 2: (A. i=l c. Oearly.B)VV'(A. The minimum value. we find that the singularvalue decomposition of A = [ 3 11] 1 _3 1 v'6 v'6 _1 + vTii [ 2 J ~l [_2_ 1 1 v'5 v'5 0 J v'2 The equality may be checked by carrying out the operations on the righthand side.I6. If am X k matrix A is approximated by B.B)(A.B)'] over all m X k matrices B having rank no greater than s. Lets < k = rank(A).B)VV'(A. UBV' = A.) 2 + i=l ¢ m ~2: cti i"'i where C = U'BV. the minimum occurs when c.i. orB = ~ A. Let A be an m X k matrix of real numbers with m ~ k and singular value decomposition UA V'. s = A.i) 2 = tr[(A.C)']= ~ ~ (A.B)'] = tr[U'(A.Vi is the ranks least squares approximation to A.i. That is.b. u. having the same dimension but lower rank. is 2: AT.B)'U] m k = tr[(A.102 Chapter 2 Matrix Algebra and Random Vectors Taking A1 A is = Vf2 and A2 = v'IO. due to Eckart and Young ([2]). The singularvalue decomposition is closely connected to a result concerning the approximation of a rectangular matrix by a lowerdimensional matrix. vj.B)(A. It minimizes tr[(A.B)']= tr[UU'(A.i = 0 fori j and the s largest singular values. = 0.. or error of approximation. The other c. the sum of squared differences m k ~ ~ (a.c.B)(A B)'] i=l j=l Result 2A..U..C)(A. i=s+l = k • Im and VV' = Ik to write the sum of To establish this result. for .
When A.1 . · (a) Graph the two vectors. Verify the following properties of the transpose when A = [ (a) (b) (c) (d) ~ !l (kxt) B = [! ~ ~ l and C = [! ~J (A')' = A 1)' (C')1 = (AB)' = B' A' For general A and B .2. (AB)' = B'A'.exist. Given the matrices perform the indicated multiplications. 1]. 3]andy' = [1.and(AA.1. 2.and B.1.)AB = B.1A.0]. 3.1 1] = [2.3.1. 2.1(A1A)B = B. (a) 5A (b) BA (c) A'B' (d) C'B (e) Is AB defined? 2. Check that 1 = (A.Exercises I 03 Exercises 2. (b) Find (i) the length ofx. 3 . (ii) the angle between x andy.3. Hint: Partacanbeprovedb(notingthatAK1 =1. (a) (A')1 = (A. 2 9 2] 6 .1 )' Part b follows from (B. 2. and (iii) the projection ofy on x.4.1 )'A'..3] = [2.18 =I. prove each of the following. Let A= [ (a) Is A symmetric? (b) Show that A is positive definite.6.1 . OJ and [1.1 . Q = [ ~ 13 5 12] 1 13 is an orthogonal matrix.3.3.2.. 2 .5. (c) Since :X= 3 and y = 1. cc (mxk) 1 2. 1.I'. graph [5 .1)' (b) (AB)1 = B1A. Letx' = [5.
A2 . Show that the determinant of the p X p diagonal matrix A= {a. Repeat for the submatrix A 1 1 obtained by deleting the first row and first column of A. By Exercise 2.2x 1x 2 positive definite? 2.8. (a) Find A.1. Consider the matrices A = 4 [ 4. from Result 2A. Moreover. Given the matrix A = [1 2] 2 2 find the eigenvalues A1 and A2 and the associated normalized eigenvectors e 1 and e 2 • Determine the spectral decomposition (216) of A. (d) Find the eigenvalues and eigenvectors of A. Consider an arbitrary n X p matrix A.Al/. X p matrix. Now A Q and A have the same eigenvalues if Q is orthogonal.002 J and B = [4 4. 2) position.1.12. Let A be as in Exercise 2. Apply Exercise 2.since/1/ = /P'P/ = /P' 1/P/. and compare it with that of A from Exercise 2. 2.11.1. . From Result 2A.8.ll(e).11(e).001 4.11. (c) Find A. Hint: Sety = Axsothaty'y = x'A'Ax.AII/QJ = /Q'AQ.002001 J * These matrices are identical except for a small difference in the (2. IQ /2 use Exercise 2. Then 0 = IA .1.ll. Show .1• (b) Compute the eigenvalues and eigenvectors of A..6.11. Thus. Then A' A is a symmetric p that A' A is necessarily nonnegative definite. A =PAP' with P'P =I.8. is given by the product of the diagonal elements.1 if Q is a p X p orthogonal matrix. 2.1 S. (c) Write the spectral decomposition of A1. Show that the determinant of a square symmetric p X p matrix A can be expressed as the product of its eigenvalues A1 .sinceQ'Q =I. ~ ( 3)8.i} with a. IA I = ITf= 1 A. I A I = a 11 A 11 + 0 + · · · + 0. (a) Determine the eigenvalues and eigenvectors of A.10. (p><p) Hint: Let A be an eigenvalue of A.001 4.1 6. Show that IQ I = + 1 or . A quadratic form x 'Ax is said to be positive definite if the matrix A is positive definite.i = 0. 2. (b) Write the spectral decomposition of A. 2. Is the quadratic form 3xt + 3xi .All. Also. Show that A_.. Hint: From (216) and (220).13 and Result 2A.001 4.wecanwrite0 = JQ'I/A. Consequently. 2..001 4. IA I = a 1 1a 22 · · · aPr Hint: By Definition 2A. Let A be as given in Exercise 2.7. small changesperhaps caused by roundingcan give substantially different inverses. that is. Show that Q' (p><p)(p><p)(p><p) = /1/.14. 2.9. AP. 2. the columns of A (and B) are nearly linearly dependent. 2. JQ 1/Q' I = /Q /2 .13. i j. thus. .24. Hint: /QQ' I = /1/.I 04 Chapter 2 Matrix Algebra and Random Vectors 2. /A/= /PAP'/= /P //AP' I= /P /I A //P'/ =/A 1/1/.
19.1I(V 112f\ where I is the 1 p X p population covariance matrix [E~uation (232)].Exercises I 05 2. Let X have covariance matrix Find (a) I. Determine the major and minor axes of the ellipses of constant distances and their associated lengths. Hint: Consider the definition of an eigenvalue. (See Result 2A.wherePP' = P'P = I.(TheA.1. Check that the nonzero eigenvalues are the same as those in part a.e. (c) Obtain the singularval~e decomposition of A.21. (b) Calculate AA' and obtain its eigenvalues and eigenvectors. x 2) whose "distances" from the origin are given by c 2 = 4xt + 3x~ .1 1 2 =I.18. 2. 2. Let A 1 2 = 1 (mxm) i=] 2: m "1/A.l5) Using the matrix A=[:: ~J (a) Calculate AA' and obtain its eigenvalues and eigenvectors. What will happen as c 2 increases? 2. Consider the sets of points ( x 1 . 2.e. Verify the relationships V 1 2pv 112 = I and p = (V 1f2). (c) Obtain the singularvalue decomposition of A. determine A. . using the matrix A in Exercise 2. 2.e.20.'sandthee. Multiply on the left by e' so that e' Ae = A.17.15) Using the matrix (a) Calculate A' A and obtain its eigenvalues and eigenvectors. 1 f2A 1 2. (See Result 2A. Sketch the ellipses of constant distances and comment on their positions.1 (b) The eigenvalues and eigenvectors of I. (c) The eigenvalues and eigenvectors of I.112 .'sare the eigenvalues and associated normalized eigenvectors of the matrix A.2V2x1x2 for c 2 = 1 and for c 2 = 4. where Ae = A. Also.23. pis the p X p population correlation matrix [Equation (234)]. and V 12 is the population standard deviation matrix [Equation (235)]. Prove that every eigenvalue of a k X k positive definite matrix A is positive.e'e.) Show Properties (1)(4) of the squareroot matrix in (222). Determine the squareroot matrix A112. = PA 112P'.22. and show that A 112A.24.3.1 2 = A. Check that the nonzero eigenvalues are the same as those in part a. 2. (b) Calculate A' A and obtain its eigenvalues and eigenvectors.
106 Chapter 2 Matrix Algebra and Random Vectors 2. X 2 . ci 2 .25.28. (b) Multiply your matrices to check the relation Vif2pVI/2 2. Let X have covariance matrix I= 25 2 [ 4 2 4 1 1 9 4] = (a) Determine p a~d VI/2./tp). Show that where c) = [c 11 ... The same steps hold for all elements.1ti) + ··· + cip(Xp. c2 p]· This verifies the offdiagonal elements CixC' in (245) or diagonal elements if ci = c2 • Hint: By (243). and X 3 • (a) XI..2X 2 (b) XI+ 3X2 (c) XI + Xz + X3 (e) XI + 2Xz .1tp))]. .1ti) + cdXz 1tz) + · ·· ~tz) + cip(Xp.1tp))(czi(XI . . .27. (a) Find p 13 • (b) Find the correlation between XI and ~X2 I.SoCov(ZI.1tp) and Z 2 .4X2 if XI and X 2 are independent random variables. Derive expressions for the mean and variances of the following linear combinations in terms of the means and covariances of the random variables XI.1ti) + czz(Xz + ··· + Czp(Xp.1tz) + ··· + czp(Xp.1ti) + ··· + Cip(Xp. The product (c11(XI.1tp)) = e=I m=l 2: 2: p p CifCzm(Xe.E(ZI) = cii(XI.1ti) + Czz(Xz.E(Zz))] = E[(c11(XI. ci p] and ci = [c2 I.E(Zz) = c21(XI. 2.1tf)(Xm.E(ZI))(Zz. Use I as given in Exercise 2. c22 . •.26.1tm) has expected value Verify the last step by the definition of matrix multiplication. + ~X 3 • 2.ltp))(c2l(XI.25.1ti) + ··· + Czp(Xp.ZI.Z2 ) = E[(ZI .X 3 (f) 3XI . .
2. Consider the arbitrary random vector X' = JL' = [JLJ. Partition X into [X~> I0 7 X 2 . X 4 ] with mean vector JL'x = [4.L 5 ]. X 4 .LJ.Exercises 2. X 2 .30. 3. 1] and variancecovariance matrix Ix = Partition X as l 0 1 2 1 2 0 30 1 0 2 2] 9 2 2 4 Let A = [1 2] and B = [ ~ =J ~ and consider the linear combinations AX(' l and BX(2).k· Partition I into the covariance matrices of X (1 l and X (2) and the covariance matrix of an element of X {l) and an element of X (2). X(2l) (j) Cov (AX(ll.~ J . X 3 . IL4• J. X 3 .31. BX(2)) 2. IL2· J. You are given the random vector X' = [X1 . Repeat Exercise 2. 2. Find (a) E(X(ll) (b) E(AX(ll) (c) Cov (X(!)) (d) Cov(AX(!l) (e) E(X(2l) (f) E(BX(2l) (g) Cov(X(2l) (h) Cov (BX< 2l) (i) Cov(X< 1>.30. but with A and B replaced by A = [1 1] and B = [~ . X 5 ] with mean vector X= where [i{~~J •nd xPJ ~ [1:] x<'1 ~ U:J Let I be the covariance matrix of X with general element u.29.
4. X2 P. Verify the CauchySchwarz . 4.x = [2. X(2)) 0) Cov(AX( 1l..32. BXI 2l) 2. 1. 1. 3.33. You are given the random vector X' = [X1 .108 Chapter 2 Matrix Algebra and Random Vectors 2. OJ and d' 2 inequality(b'd) ~ (b'b)(d'd).32. X 5 J with mean vector 3 1 I I 2 . Find (a) (b) (c) (d) (e) E(X(Il) E(AX('l) Cov(X( 1)) Cov (AX(Il) E(X(2l) (f) E(BX( 2 )) (g) Cov(X( 2)) (h) Cov(BX( 2l) (i) Cov(X(l). but with X partitioned as and with A and B replaced by A=U 1 OJ 1 3 and B = [1 2] 1 1 2. . 3. = [1. Repeat Exercise 2. 2. OJ and variancecovariance matrix 4 1 1 . Considerthevectorsb' = [2.z 1 0 0 1 Ix = 2 I :z I 6 1 1 4 0 0 2 0 Partition X as 0 Let A = D ~ J and B= U~ _J ~ and consider the linear combinations AX(lJ and BX(2).• . 1J.34.
j)th element of AE(X)B.eXekbk. Fmd the maximum and minimum values of the quadratic form 4_x1 all points x' = [x 1 .xz for 2. fmd the maximum value of x' Ax for x' x = 1. 2.ebekCkj f~l k~l (rXs)(sXr)(tXv) Hint: BC has ( e. e k which is the (i. j) th element of E(X) + E(Y).) = E(X..x Ix = l30 1 0 0 3 0 0 0 3 0 0 0 0 ] j] Let A~[: 2 (a) Find E (AX). j)th entry 2: 2: a.eXekbki• and by the additive property of expectation.. x 2 ] such that x'x = 1. e k 2. Show that 13 4 [ 2 4 13 2 ~] 10 s r A B C has (i.37.) by a univariate property of expectation. (b) Find Cov (AX). + Y.· So A(BC) has (i.36.x2 .. Hint· X + Y has X. OJ and variancecovariance matrix P.38.X4 ] with mean vector = [3. (b'Bb)(d'B. 2... Now..E(X.) = 2: 2: a. and this last quantity is the ( i.40. j)th element. 2.eE(Xek)bk. e k E(2: 2: a. verify the extended CauchySchwarz 2 inequality (b'd) :s.AXB has (i. Using the vectors b' = [4.39.35. j)th entry 2: bekCki = de. Verify (224): E(X + Y) = E(X) + E(Y) and E(AXB) = AE(X)B. Find the maximum and minimum values of the ratio x' A xjx'x for any nonzero vectors x' = [x 1 .39).X2 . Next (see Exercise 2. as its (i.+ Y. You are given the random vector X'= [X 1 .6.Exercises I09 2. the variances and covariances of AX.j)th entry 2: 2: a. 3] and d' = [1.) + E(Y. With A as given in Exercise 2. the mean of AX. (c) Which pairs of linear combinations have zero covariances? .1d) if B =[ 2 2 2] 5 2.X3 .41.x3 ] if A= 2. 1]. + 4x~ + 6x. j)th element k~l r 2.
and G. F. CA: Wadsworth. 2. 6. 211218. P. K. Halmos. Daniel. 1969. A.1 (1936). R. "The Approximation of One Matrix by Another of Lower Rank.42. Johnson. Statistics: Principles and Methods (5th ed.. 5.41. NJ: Prentice Hall. 1988. 1997. FiniteDimensional Vector Spaces." Psychometrika. Repeat Exercise 2. Englewood Cliffs. Noble. B. Introduction to Matrices with Applications in Statistics. W.. Young. Belmont.R. A.110 Chapter 2 Matrix Algebra and Random Vectors 2. . 2005.lntroduction to Mat~ix Analysis (2nd ed. and J. New York: SpringerVerlag.). Bhattacharyya. but with References 1. C.. Bellman. R. 1993.) Philadelphia: Soc for Industrial & Applied Math (SIAM). 4. and G. Graybill. Eckart. 3.) New York: John Wiley. Applied Linear Algebra (3rd ed.
Xnp III . Sn. This generalization of variance is an integral part of the comparison of multivariate means. we introduce a single number. it is this structure of the random sample that justifies a particular choice of distance and dictates the geometry for the ndirnensional representation of the data. In later sections we use matrix algebra to provide concise expressions for the matrix products and sums that allow us to calculate x and Sn directly from the data matrix X. the entire data set can be placed in ann X p array (matrix): X (n><p) Xll = [ XzJ . if n observations have been obtained.2. random sampling implies that (1) measurements taken on different items (or trials) are umelated to one another and {2) the joint distribution of all p variables remains the same for all items.Chapter SAMPLE GEOMETRY AND RANDOM SAMPLING 3. to describe variability. Xnl XJ2 Xzz . The connection between x. and R. statistical inferences are based on a solid foundation. In Section 3.2 The Geometry of the Sample A single multivariate observation is the collection of measurements on p different variables taken on the same item or trial. we do so in Section 3.1 Introduction With the vector concepts introduced in the previous chapter. Many of our explanations use the representation of the columns of X as p vectors inn dimensions. and the means and covariances for linear combinations of variables is also clearly delineated. called generalized variance. Simply stated. when data can be treated as a random sample. we can now delve deeper into the geometrical interpretations of the descriptive statistics x. Ultimately.3 we introduce the assumption that the observations constitute a random sample. 3. •• Xnz ·•• XJpj Xzp . Sn.4. As in Chapter 1. using the notion of matrix products. Furthermore. Returning to geometric interpretations in Section 3.
X~p = . A single numerical measure of variability is provided by the determinant of the sample variancecovariance matrix. Xnz Xnp X~ nth (multivariate) observation The row vector xj. given by (18). Similarly. Since the entire set of measurements is often one particular realization of what might have been observed. When p is greater than 3. X (nXp) = [xu x~1 : Xnt xi2 Xzz x. If the points are regarded as solid spheres. 5]. Plot the n = 3 data points in p = 2 space." The sample then consists of n measurements. Yet the consideration of the data as n points in p dimensions provides insights that are not readily available from algebraic expressions. Finally. "~> has coordinates xi = ( 4. 1]. the data can be plotted in two different ways. representing the jth observation. and it is quantified by the sample variancecovariance matrix Sn.J . Example 3. this scatter plot representation cannot actually be graphed. . The first point. n X~ 1st '(multivariate) observation (31) . we say that the data are a sample of size n from a pvariate "population. Moreover. contains the coordinates of a~ point. 3] and x] = (3. the concepts illustrated for p = 2 or p = 3 remain valid for the other cases. the rows of X represent n points in pdimensional space. pdimensional scatter plot. is the center of balance. We can write . The scatter plot of n points in pdimensional space provides information on the locations and variability of the points... the remaining two points are x2 = [ 1. For the.Chapter 3 Sample Geometry and Random Sampling 111 Each row of X represents a multivariate observation. As we h!fve seen. .1 {Computing the mean vector) Compute the mean vector i from the data matrix. each of which hasp components. Variability occurs in more than one direction. the sample mean vector x. and locate x on the resulting diagram.
The alternative geometrical representation is constructed by considering the data as p vectors in ndimensional space. In general..1 shows that ~~ x is the balance point (center of gravity) of the scatter . We shall be manipulating these quantities shortly using the algebra of vectors discussed in Chapter 2..2 {Data as p vectors in n dimensions) Plot the following data as p = 2 vectors in n = 3 space: . @.. In this geometrical representation.The Geometry of the Sample 2 113 5 4 .. we depict Y1> . Here we take the elements of the columns of the data matrix to be the coordinates of the vectors...r ex. Figure 3. Example 3. x 21 ..1 A plot of the data matrix X as n = 3 points in p = 2 space. Let xll x12 X (n><p) = r x~1 : x~z : (32) Xnl Xnz Then the coordinates of the first point yj = (xu. as in the pdimensional scatter plot. x2e 2 2 1 I 2 2 4 5 Figure 3. . x2 . Yp as vectors rather than points. the ith point yi = [x 1 . x". .] is determined by the ntuple of all measurements on the ith variable. xnd are the n measurements on the first variable. . ....• .
thendimensional representation of the data matrix X may not seem like a particularly useful device for n > 3. It is possible to give a geometrical interpretation of the process of finding a sample mean.= (1. ( I 1 1 1 ) . Unfortunately. 3. even if n dimensional. x 2 . by (28). and 3. the subscript n will be dropped when the dimension of the vector 1.)/n = yj1jncorrespondstothe multiple of 1 required to give the projection of Y. 3] and Y2 = [1. just as two vectors with any number of components must lie in a plane.thesamplemean:i. = (x 1 . 1. Y. • Many of the algebraic expressions we shall encounter in multivariate analysis can be related to the geometrical notions of length. 1. Thus. with the right choice of axes. 5]. we are limited to visualizing objects in three dimensions. Consider the vector yj = [xli.2. to illustrate certain algebraic statistical concepts in terms of only two or three vectors of any dimension n. + x 2 .. . . It turns out. These vectors are shown in Figure 3. we shall always label the coordinate axes 1. The projection of y.) The vector 1 forms equal angles with each of the n coordinate axes. . so the vector ( 1/Vn)1 has unit length in the equalangle direction. Here y! = [4. + ··· + x. x. (To simplify the notation. This is important because geometrical representations ordinarily facilitate understanding and lead to further insights... and consequently. (33) Thatis. that geometrical relationships and the associated statistical concepts depicted for any three vectors remain valid regardless of their dimension. This follows because three vectors. We start by defining then X 1 vector I. it is possible. a portion of the ndimensional space containing the three vectors of interesta view is obtained that preserves both lengths and angles. is clear from the context. angle.. and volume.. can span no more than a threedimensional space. 1].X.].114 Chapter 3 Sample Geometry and Random Sampling 6" figure 3. •. however. Since the specific choice of axes is not relevant to the geometry. ..2 A plot of the data matrix X as p = 2 vectors in n = 3 space. 2. By selecting an appropriate threedimensional perspectivethat is. onto the line determined by 1. on the unit vector ( 1/Vn)1 is..1  Vn Vn  X1· I + Xz · I n + · · · + X nl11 · .
The Geometry of the Sample I 15 Further. vectors into mean components and deviation from the mean components is shown in Figure 3.x.l and a deviation component d.l.x 1 = (4.x· ' : _· x. = Y.:X.3 The decomposition ofy.l = The elements of d. or mean corrected. Decomposition of the Y.J (34) Xn. we have the decomposition where :X. 2.. for the data given in Example 3. Example 3. into a mean component :X. vector is d.l.x. .1 + 3)/3 = 2andx 2 = (1 + 3 + 5)/3 = 3.l.so . .i. = Y. The deviation. 3. = Y. . i = 1. into :X. Figure 3. for each y. l xu  Xz·..l is perpendicular to Y.l and d.2: Here.3 (Decomposing a vector into its mean and deviation components) Let us carry out the decomposition of y.3 for p = 3 and n = 3. 2. are the deviations of the measurements on the ith variable from their sample mean. i = 1. .X.
= Y1  x. .X...I)'(y..4 The deviation vectors d. The decomposition is .[n.I){2 2 2][1]46+20 A similar result holds for x2 1 and dz = Yz .X. from Figure 3.Xzl.m[n and We note that x11 and d.l. we are interested in the deviation (or residual) vectors d. [} [:HlJ hmmf:J • For the time being.. A plot of the deviation vectors of Figur_e 3. because (X. . Figure 3.l are perpendicular. = Y.4.3.1 16 Chapter 3 Sample Geometry and Random Sampling Consequently.3 is given in Figure 3. . ··..
k denote the angle formed by the vectors d. Example 3.) (xjk . and dk. . we get or. we see that the squared length is proportional to the variance of the measurements on the ith variable. let us compute the sample variancecovariance matrix Sn and sample correlation matrix R using the geometrical concepts just introduced. and db djdk = L (xj. the sample correlation will be close to 1.) 2 ~ ~ (xjk  xk/ cos ( fJik) so that [see (15)] (37) The cosine of the angle is the sample correlation coefficient.3. Now consider the squared lengths of the deviation vectors. Longer vectors represent more variability than shorter vectors. Using (25) and (34). we obtain L3. If the two vectors are nearly perpendicular. = djd. For any two deviation vectors d. Equivalently. If the two vectors are oriented in nearly opposite directions.  x.x. if the two deviation vectors have nearly the same orientation.) 2 (35) (Length of deviation vector ) 2 = sum of squared deviations From (13). = j=l ± (xji .3.using (35) and (36). the length is proportional to the standard deviation.)(xjk . the sample correlation will be close to 1.4 (Calculating Sn and R from deviation vectors) Given the deviation vec tors in Example 3. From (26).The Geometry of the Sample I 17 We have translated the deviation vectors to the origin without changing their lengths or orientations.xk) = ~ ~ (xji  x. j=l n x. we obtain ~ (xj.xk) (36) Let fJ. the sample correlation will be approximately zero. From Example 3. Thus.
or s12 = . or s22 = ~. and Sn =[ i 11 ~] ~ ' R = [1 . translated to the origin.189] . are shown in Figure 3. v'¥ v1 = _:?. '12 = s ~ 12 vs.. or s 11 =!f.~. 3 = .189 .189 1 • . These vectors. Now.5.5 The deviation vectors d1 and d2. Consequently. Also. Finally.I 18 Chapter 3 Sample Geometry and Random Sampling 3 2 3 4 5 Figure 3.
Random Samples and the Expected Values of the Sample Mean and Covariance Matrix
I 19
The concepts of length, angle, and projection have provided us with a geometrical interpretation of the sample. We summarize as follows:
Geometrical Interpretation of the Sample
1. The projection of a column Y; of the data matrix X onto the equal angular vector 1 is the vector x;l. The vector :X;l has length Vn I x; 1. Therefore, the ith sample mean, x;, is related to the length of the projection of Y; on 1. 2. The information comprising sn is obtained from the deviation vectors d; = y;  x;l = [xli  X;, x 2 ;  xi> ... , xni  x;]'. The square of the length of d; is ns;;, and the (inner) product between d; and dk is nsik. 1 3. The sample correlation r;k is the cosine of the angle between d; and dk.
3.3 Random Samples and the Expected Values of the Sample Mean and Covariance Matrix In order to study the sampling variability of statistics such as x and Sn with the ultimate aim of making inferences, we need to make assumptions about the variables whose observed values constitute the data set X. Suppose, then, that the data have not yet been observed, but we intend to collect n sets of measurements on p variables. Before the measurements are made, their values cannot, in general, be predicted exactly. Consequently, we treat them as random variables. In this context, let the (j, k}th entry in the data matrix be the random variable Xi k. Each set of measurements Xi on p variables is a random vector, and we have the random matrix
xll
(nxp)
X
=
Xzl
[
:
Xnl
X1p] = x~p . . Xnp
[X~] ~z .
.
X~
(38)
A random sample can now be defined. If the row vectors Xj, X2, ... , X~ in (38} represent independent observations from a common joint distribution with density function f(x) = f(xb x 2, ... , xp), then X1o X 2 , ... , Xn are said to form a random sample from f(x). Mathematically, X 1 , X 2 , ... , Xn form a random sample if their joint density function is given by the product f(x 1 )f(x 2 ) · · · f(xn), where f(xi) = f(xi 1, xi 2, ... , xi p) is the density function for the jth row vector. Two points connected with the definition of random sample merit special attention: 1. The measurements of the p variables in a single trial, such as Xj = [ Xi 1 , Xi 2 , .•• , Xip], will usually be correlated. Indeed, we expect this to be the case. The measurements from different trials must, however, be independent.
1 The square of the length and the inner product are (n  1 )s;; and (n  l)sik, respectively, when the divisor n  1 is used in the definitions of tbe sample variance and covariance.
120
Chapter 3 Sample Geometry and Random Sampling
2. The independence of measurements from trial to triaJ may not hold when the variables are likely to drift over time, as with sets of p stock prices or p economic indicators. Violations of the tentative assumption of independence can have a serious impact on the quality of statistical inferences. The following exjlmples illustrate these remarks.
Example 3.5 (Selecting a random sample) As a preliminary step in designing a permit system for utilizing a wilderness canoe area without overcrowding, a naturalresource manager took a survey of users. The total wilderness area was divided into subregions, and respondents were asked to give information on the regions visited, lengths of stay, and other variables. The method followed was to select persons randomly (perhaps using a random· number table) from all those who entered the wilderness area during a particular week. All persons were equally likely to be in the sample, so the more popular entrances were represented by larger proportions of canoeists. Here one would expect the sample observations to conform closely to the criterion for a random sample from the population of users or potential users. On the other hand, if one of the samplers had waited at a campsite far in the interior of the area and interviewed only canoeists who reached that spot, successive measurements would not be independent. For instance, lengths of stay in the wilderness area for dif• ferent canoeists from this group would all tend to be large. Example 3.6 (A nonrandom sample) Because of concerns with future solidwaste disposal, an ongoing study concerns the gross weight of municipaJ solid waste generated per year in the United States (EnvironmentaJ Protection Agency). Estimated amounts attributed to x 1 = paper and paperboard waste and x 2 = plastic waste, in millions of tons, are given for selected years in Table 3.1. Should these measurements on X' = [XI> X 2 ] be treated as a random sample of size n = 7? No! In fact, except for a slight but fortunate downturn in paper and paperboard waste in 2003, both variables are increasing over time. Table 3.1 Solid Waste
Year
x 1 (paper)
1960 29.2
.4
1970 44.3 2.9
1980 55.2 6.8
1990 72.7 17.1
1995 81.7 18.9
2000 87.7 24.7
2003 83.1 26.7
x2 (plastics)
•
As we have argued heuristically in Chapter 1, the notion of statistical independence has important implications for measuring distance. Euclidean distance appears appropriate if the components of a vector are independent and have the same variances. Suppose we consider the location of the kthcolumn Yk = [Xlk, X2k, ... , Xnk] of X, regarded as a point inn dimensions. The location of this point is determined by the joint probability distribution f(yk) = f(xlk, x2k, . .. , Xnk)· When the measurements Xlk,X2k,····Xnk are a random sample, f(Yk) = f(xlk,x2k,····xnk) = fk(xlk)fk(x2k) · · · fk(xnk) and, consequently, each coordinate xjk contributes equally to the location through the identical marginal distributionsfk(xjk)·
Random Samples and the Expected Values of the Sample Mean and Covariance Matrix
121
If the n components are not independent or the marginal distributions are not identical, the influence of individual measurements (coordinates) on location is asymmetrical. We would then be led to consider a distance function in which the coordinates were weighted unequally, as in the "statistical" distances or quadratic forms introduced in Chapters 1 and 2. Certain conclusions can be reached concerning the sampling distributions of X and Sn without making further assumptions regarding the form of the underlying joint distribution of the variables. In particular, we can see how X and Sn fare as point estimators of the corresponding population mean vector p. and covariance matrix .I.
Result 3.1. Let X 1 , X 2 , ... , Xn be a random sample from a joint distribution that has mean vector p. and covariance matrix .I. Then X is an unbiased estimator of p., and its covariance matrix is
That is,
E(X) = p.
1 Cov(X) =.I n
(population mean vector) ( population variancecovariance matrix) divided by sample size
(39)
For the covariance matrix Sn,
Thus,
Ec: 1
Sn) =.I
(310)
so [nj(n  1) ]Sn is an unbiased estimator of .I, while Sn is a biased estimator with (bias)= E(Sn) .I= (1/n).I.
Proof. Now, X = (X 1 + X 2 + · · · + Xn)/n. The repeated use of the properties of expectation in (224) for two vectors gives
E(X)
=E
=
(1 1 1 ~X1 + ~Xz + ··· + ~Xn )
E
(~ X1) + E (~ Xz)
1 n
+ ··· + E
1 n
(~ Xn)
1 n 1 n 1 n
= E(XJ) + E(Xz) + ··· + E(Xn) = p. + p. + ··· + p.
=p.
1 n
Next,
(X p.)(X p.)' =
=
_
_
(1 L
n
nj~l
(Xj p.) )
n
(1 L
n
(Xe p.) ) '
ne=l
1
2
n
L L (Xj p.)(Xe p.)' j=l f=l
n
122
Chapter 3 Sample Geometry and R11ndom Sampling
so
For j # e, each entry in E(Xj  J.t)(Xe  1')' is zero because the entry is the covariance between a component of Xj and a component of Xe, and these are independent. [See Exercise 3.17 and (229).] Therefore,
Since I = E(Xj  1' )(Xj  1' )' is the common population covariance matrix for each Xj, we have
1 n 1 Cov(X) =  ( ~ E(X  J.t)(X· J.t)' ) =  (I +I+···+ I) n2 j=i 1 1 n2 .....:...._~~ nterms
= J._(ni) =(.!.)I n2 n To obtain the expected value of Sn, we first note that ( X 1 ; X,) (Xjk  Xk) is the (i, k)th element of (Xj  X) (Xj  X)'. The matrix representing sums of squares and cross products can then be written as
~ (Xj X)(Xj X)'=~ (Xj X)Xj
j=i . j=l
n
n
+
(
~(Xj X)
]=1
n
)
(X)'
n
= ~ xjxjj=!
nx.x·
n
n
i=i
· since ~ (Xj  X) = 0 and nX' = ~ Xj. Titerefore, its expected value is
,1
For any random vector V with E(V) = 1'v and Cov(V) = Iv, we have E(VV') Iv + 1'vl'v· (See Exercise 3.16.) Consequently,
E(XjX}) =I+ 1'1''
=
and
E(XX') =.!.I + 1'1'' n
Using these results, we obtain
~ :=: E(XjXj)
 nE(XX') = ni + nJ.LJ.t' n ;;I+ 1'1''

(1
)
= (n 1)I
and thus, since Sn = (1/n) (
j=i
±
XjXj nXX'), it follows immediately that
E(S ) = (n  1) I n n
•
Generalized Variance
123
Result 3.1 shows that the (i, k)th entry, (n  1)1
2: (Xj;
j=l
n
X;) (Xjk  Xk), of
[n/( n  1) JSn is an unbiased estimator of a;k· However, the individual sample standard deviations Vi;;, calculated with either nor n  1 as a divisor, are not unbiased estimators of the corresponding population quantities ~.Moreover, the correlation coefficients r;k are not unbiased estimators of the population quantities Pik· However, the bias E( ~)  ~'or E(r;k)  Pik> can usually be ignored if the sample size n is moderately large. Consideration of bias motivates a slightly modified definition of the sample variancecovariance matrix. Result 3.1 provides us with an unbiased estimator S of I:
(Unbiased) Sample VarianceCovariance Matrix
S
=
( n) Sn = 1
~
n
~ LJ
n
1~ (Xj  X)(Xj  X)'
1 Fl
(311)
HereS, without a subscript, has (i, k)th entry (n  1)1
2: (Xji
j=l
n
X;) (Xfk  Xk)·
This definition of sample covariance is commonly used in many multivariate test statistics. Therefore, it will replace Sn as the sample covariance matrix in most of the material throughout the rest of this book.
3.4 Generalized Variance
With a single variable, the sample variance is often used to describe the amount of variation in the measurements on that variable. When p variables are observed on each unit, the variation is described by the sample variancecovariance matrix
S=
The sample covariance matrix contains p variances and !P(P  1) potentially different co variances. Sometimes it is desirable to assign a single numerical value for the variation expressed by S. One choice for a value is the determinant of S, which reduces to the usual sample variance of a single characteristic when p = 1. This determinant 2 is called the generalized sample variance: Generalized sample variance =
l
su
s~2
Stp
Isl
(312)
2 Definition 2A.24 defines "determinant" and indicates one method for calculating the value of a determinant.
124
Chapter 3 Sample Geometry and Random Sampling
Example 3.7 (Calculating a generalized variance) Employees (x 1) and profits per employee (x 2 ) for the 161argest publishing firms in the United States are shown in Figure 1.3. The sample covariance matrix, obtained from the data in the April 30, 1990, Forbes magazine article, is
s=[
Evaluate the generalized variance. In this case, we compute
1
252.()4 68.43
68.43] 123.67
s1=
(252.04) ( 123.67)  ( 68.43 )( 68.43) = 26,487
•
The generalized sample variance provides one way of writing the information on all variances and covariances as a single number. Of course, when p > 1, some information about the sample is lost in the process. A geometrical interpretation of IS I will help us appreciate its strengths and weaknesses as a descriptive summary. Consider the area generated within the plane by two deviation vectors d 1 = y1  .X1 1 and d 2 = y2  .X 2 1. Let Ld 1 be the length of d 1 and Ld 2 the length of d 2 . By elementary geometry, we have the diagram
~·
Height= Ld 1 sin (ll)
and theareaofthe trapezoid is I Ld1 sin(O) ILd2 • Sincecos2 (0) + sin2 (0) = 1, we can express this area as
From (35) and (37),
Ld 1 = ) Ld2 =
~ (xil 
2 .X1)
= V(n
1)su
)j~ (xj2 xd =
cos(O) = r1 z
V(n 1)Szz
and Therefore, Area= (n  1)~ v'.l;";" V1 Also,
IS I =
d2=
(n l)"Ysus22 (1 
d 2)
(313)
I[su S12
S12] I =
Szz  SuSzzdz
I[ ~ su r12 ~ Szz v'.l;";"r12] I v'S22
= SuSzz(1
 r1z)
(314)
= SuSzz
Generalized Variance 125
,~,
1\
3
"\ ,': \
I
\
3
\
'' ,,,
I(
I
I\
\
'
\
\
\
\
,,
\
(a)
(b)
Figure 3.6 (a) "Large" generalized sample variance for p (b) "Small" generalized sample variance for p = 3.
=
3.
If we compare (314) with (313), we see that
lSI=
(area) 2/(n 1) 2
Assuming now that IS I = (n  1)(pl)(volume )2 holds for the volume generated inn space by the p  1 deviation vectors d 1 , d 2 , ... , dpl, we can establish the following general result for p deviation vectors by induction (see [1], p. 266): Generalized sample variance
= 1S 1 = (n
 1rP(volume) 2
(315)
Equation (315) says that the generalized sample variance, for a fixed set of data, is 3 proportional to the square of the volume generated by the p deviation vectors d1 = Y1 .X1l, d 2 = y2  .X 2 l, ... ,dP = Yp xPl. Figures 3.6(a) and (b) show trapezoidal regions, generated by p = 3 residual vectors, corresponding to "large" and "small" generalized variances. For a fixed sample size, it is clear from the geometry that volume, or IS I, will increase when the length of any d; = Y;  .X;l (or ~) is increased. In addition, volume will increase if the residual vectors of fixed length are moved until they are at right angles to one another, as in figure 3.6(a). On the other hand, the volume, or I S /,will be small if just one of the S;; is small or one of the deviation vectors lies nearly in the (hyper) plane formed by the others, or both. In the second case, the trapezoid has very little height above the plane. This is the situation in Figure 3 .6(b), where d 3 lies nearly in the plane formed by d 1 and d 2 .
3 1f generalized variance is defmed in terms of the sample covariance matrix s. = [ ( n  1 )/n ]S, then, using Result 2A.11,jS.I = i[(n 1)/n]IpSI = j[{n 1)/n]IpiiSI = [(n 1)/nJPISj. Consequently, using (315), we can also write the following: Generalized sample variance = IS.l = nP(volumef ·
12 6 Chapter 3 Sample Geometry and Random Sampling Generalized variance also has interpretations in the pspace scatter plot representation of the data. The most intuitive interpretation concerns the spread of the scatter about the sample mean point x' = [:i\, x2 , ... , xp] Consider the measure of distancegiven in the comment below (219), with x playing the role of the fixed point 1' and sI playing the role of A. With these choices, the coordinates x' = [ x 1, x2 , ... , x P] of the points a constant distance c from i satisfy
(x  i)'S1(x x) = CJ
(316)
[When p = 1, (x  x)'S 1(x.  i) = (x!  XJ'l/su is the squared distance from Xj to x1 in standard deviation units.) Equation (316) defines a hyperellipsoid (an ellipse if p = 2) centered at i. It can be shown using integral calculus that the volume of this hyperellipsoid is related to IS I· In particular, Volume of{x: (x i)'S1(x i) or (Volume of ellipsoid ) 2 = (constant) (generalized sample variance) where the constant kp is rather formidable. 4 A large volume corresponds to a large generalized variance. Although the generalized variance has some intuitively pleasing geometrical interpretations, it suffers from a basic weakness as a descriptive summary of the sample covariance matrix S, as the following example shows.
:S
c2}
=
kpiSI 1 12cP
(317)
Example 3.8 (Interpreting the generalized variance) Figure 3.7 gives three scatter plots with very different patterns of correlation. All three data sets have i' = [2, 1], and the covariance matrices are
5 4 S = [ 4 5] ' r
=
.8 S
= [
3 0 0 3] ' r
= 0 S = [ 4  5] ' r
5
4
=
.8
Each covariance matrix S contains the information on the variability of the component variables and also the information required to calculate the correlation coefficient. In this sense, S captures the orientation and size of the pattern of scatter. The eigenvalues and eigenvectors extracted from S further describe the pattern in the scatter plot. For
s=
[5 4]
4
5 '
the eigenvalues satisfy
0 = (A  5) 2  42 = (A  9) (A  1)
4
For those who are curious, k P = 2w"f21p r(p/2), where f( <)denotes the gamma function evaluated
~
d~
Generalized Variance
x,
7
X,
127
7
•
• •
•• •.. ·' .
• • • •
• • • •• • •• • •
• •
7
x,
• • • • •• • • •• • • • • • ••• ••• • • • • •• • • • •
(b)
7 x,
(a)
x,
• • •• ••• •
•
7
..... .. . ,. •• • • • .... • • • ••
(c)
.• • •
Figure 3. 7 Scatter plots with three different orientations.
and we determine the eigenvalueeigenvector pairs AI = 9, ej = [ 1/VZ' 1/ vz] and Az = 1,e2 = [1/v'i,1/VZ]. The meancentered ellipse, with center x' = [2, 1] for all three cases, is
To describe this ellipse, as in Section 2.3, with A = s 1 , we notice that if (A, e) is an eigenvalueeigenvector pair for S, then (A I, e) is an eigenvalueeigenvector pair for s1• That is, if Se = Ae, then multiplying on the left by s1 gives s1Se = AS 1e, or s1e = Ale. Therefore, using the eigenvalues from S, we know that the ellipse extends evA; in the direction of e; from x.
128
Chapter 3 Sample Geometry and Random Sampling
In p = 2 dimensions, the choice c 2 = 5.99 will produce an ellipse that contains approximately 95% of the observations. The vectors 3v'5.99 e 1 and v'5.99 e 2 are drawn in Figure 3.8(a). Notice how the directions are the natural axes for the ellipse, and observe that the lengths of these scaled eigenvectors are comparable to the size of the pattern in each direction. Next, for
S=[~ ~].
the eigenvalues satisfy
0 == (A  3) 2
and we arbitrarily choose the eigenvectors so that A = 3, e) = [1, OJ and A = 3, 1 2 e2 = [0, 1]. The vectors v3 v'5.99 e1 and v3 v'5.99 e 2 are drawn in Figure 3.8(b ).
'i
7 7
• • • • • •
'
•
• ... •
• •
• •
7
x,
•
.f
(a)
(b)
• • • • • • • •• • tp ••••
7
•
x,
• • • • • • ••
(c)
Figure 3.8 Axes of the meancentered 95% ellipses for the scatter plots in Figure 3.7.
Generalized Variance
12 9
Finally, for
s=
[
5 4
4]
5 ,
the eigenvalt,1es satisfy
0 = (A  5) 2

(
4) 2
= (A  9)(A  1)
and we determine the eigenvalueeigenvector pairs A1 = 9, ej = [1/ V2, 1/ \12] and A = 1, e2 = [lj\12, 1/VZ]. The scaled eigenvectors 3v'5.99e 1 and v'5.99e2 are 2 drawn in Figure 3.8(c). In two dimensions, we can often sketch the axes of the meancentered ellipse by eye. However, the eigenvector approach also works for high dimensions where the data cannot be examined visually. Note: Here the generalized variance IS I gives the same value, IS I = 9, for all three patterns. But generalized variance does not contain any information on the orientation of the patterns. Generalized variance is easier to interpret when the two or more samples (patterns) being compared have nearly the same orientations. Notice that our three patterns of scatter appear to cover approximately the same area. The ellipses that summarize the variability (x  i)'S1(x  i)
:s?
do have exactly the same area [see (317)], since all have
IS I =
9.
•
As Example 3.8 demonstrates, different correlation structures are not detected by IS 1. The situation for p > 2 can be even more obscure. . Consequently, it is often desirable to provide more than the single number IS I _as a summary of S. From Exercise 2.12, IS I can be expressed as the product A1 A · · · AP of the eigenvalues of S. Moreover, the meancentered ellipsoid based on 2 s1 [see (316)] has axes whose lengths are proportional to the square roots of the A/s (see Section 2.3). These eigenvalues then provide information on the variability in all directions in the pspace representation of the data. It is useful, therefore, to report their individual values, as well as their product. We shall pursue this topic later when we discuss principal components.
Situations in which the Generalized Sample Variance Is Zero
The generalized sample variance will be zero in certain situations. A generalized variance of zero is indicative of extreme degeneracy, in the sense that at least one column of the matrix of deviations,
[
di+l> ... 'dp.
X~  ~:J = [XII ~I x
X 2 : X
X1p 
. .
21 : XI
. .
..
· ··
x 2P 
XPJ
xP
. .
X~ 
x'
Xnl 
XI
Xnp Xp
=X(nxp)
(nxl)(lxp)
1
x'
(318)
can be expressed as a linear combination of the other columns. As we have shown geometrically, this is a case where one of the deviation vectorsfor instance, d/ = [xu X;, ... ' Xni xJIies in the (hyper) plane generated by dl' ... 'd;1,
130
Chapter 3 Sample Geometry and Random Sampling
Result 3.2. The generalized variance is zero when, and only when, at least one deviation vector lies in the (hyper) plane formed by all linear combinations of the othersthat is, when the columns of the matrix of deviations in (318) are linearly dependent.
Proof. If the columns of the deviation matrix (X  IX') are linearly dependent, there is a linear combination of the columns such that
0 = a1 col 1(X  li') =(X IX~)a
+ ·· · + apcolp(X  li')
forsomea # 0
But then, as you may verify, (n 1)S = (X  IX')' (X lx') and
(n 1)Sa =(X li')'(X IX')a =
o
so the same a corresponds to a linear dependency, a 1 col 1(S) + · · · + aP colp(S) = Sa = o, in the columns of S. So, by Result 2A.9, IS I = 0. In the other direction, if IS I = 0, then there is some linear combination Sa of the columns of S such that Sa= 0. That is, 0 = (n 1)Sa =(X lx')'(X li')a. Premultiplying by a' yields 0 = a'(X li')'(X li')a = LfxJJ.')a and, for the length to equal zero, we must have (X  IX')a of (X  li') are linearly dependent.
= 0. Thus, the columns
•
Example 3.9 (A case where the generalized variance is zero) Show that 1 S I = 0 for
X
(3x3)
=
12 5] [
4 1 6 4 0 4
and determine the degeneracy. Here x' = [3, 1, 5], so
1 3
X
IX'=
[
43 43 0 1 45
~ =~ ~ = [~ ~ ~]
=
!]
1
1
1
The deviation (column) vectors are di = [2,1,1], d2 = [1,0,l], and d3 = [0, 1, 1 ]. Since d3 = d1 + 2d2, there is column degeneracy. (Note that there is row degeneracy also.) This means that one of the deviation vectorsfor example, d Iies in the plane generated by the other two residual vectors. Consequently, the 3 threedimensional volume is zero. This case is illustrated in Figure 3.9 and may be verified algebraically by showing that IS I = 0. We have 3
s (3x3)
~
[
~
Generalized Variance
131
6
5
3
4
figure 3.9 A case where the threedimensional volume is zero (ISI=O).
and from Definition 2A.24,
lSI
=31! :l{1)z + (n~~ :1(1)3 + {0)~~ !l(1)4
= 3 (1

n+ m (
~
o)
+ 0 = ~ ~
=
0
•
When large data sets are sent and received electronically, investigators are sometimes unpleasantly surprised to find a case of zero generalized variance, so that S does not have an inverse. We have encountered several such cases, with their associated difficulties, before the situation was unmasked. A singular covariance matrix occurs when, for instance, the data are test scores and the investigator has included variables that are sums of the others. For example, an algebra score and a geometry score could be combined to give a total math score, or class midterm and final exam scores summed to give total points. Once, the total weight of a number of chemicals was included along with that of each component. This common practice of creating new variables that are sums of the original variables and then including them in the data set has caused enough lost time that we emphasize the necessity of being alert to avoid these consequences.
Example 3.10 (Creating new variables that lead to a zero generalized variance) Consider the data matrix
9 12 10 8 11
16 12
10]
13
14
where the third column is the sum of first two columns. These data could be the number of successful phone solicitations per day by a parttime and a fulltime employee, respectively, so the third column is the total number of successful solicitations per day. Show that the generalized variance IS I = 0, and determine the nature of the dependency in the data.
132 Chapter 3 Sample Geometry and Random Sampling We find that the mean corrected data matrix, with entries xfk  ibis
xn·+1 ~~ ~Jl
The resulting covariance matrix is
s=
. [2.5 ~.5 ~:~]· 0
2.5 2.5 5.0
We verify that, in this case, the generalized variance
Is I =
alxfl + azx1z + a3x13
for all j. That is,
2§ X 5
+ 0 + 0  2.5 3

2.53

.o
= 0
In general, if the three columns of the data matrix X satisfy a linear constraint = c, a constant forallj, then a1i1 + aziz+ a3x3 = c, so that
a 1(xil 
x1) + a2(x12  x2 ) + a3(x 13 (X  li')a
=
i
3)
=0
0
and the columns of the mean corrected data matrix are linearly dependent. Thus, the inc! usion of the third variable, which is linearly related to the first two, has led to the case of a zero generalized variance. Whenever the columns of the mean corrected data matrix are linearly dependent,
(n  l)Sa
= (X lx')'(X li')a
= (X IX')O = 0
and Sa = 0 establishes the linear dependency of the columns of S. Hence, IS I = 0. Since Sa = 0 = 0 a, we see that a is a scaled eigenvector of S associated with an eigenvalue of zero. This gives rise to an important diagnostic: If we are unaware of any extra variables that are linear combinations of the others, we can find them by calculating the eigenvectors of S and identifying the one associated with a zero eigenvalue. That is, if we were unaware of lhe dependency in this example, a computer calculation would find an eigenvalue proportional to a' = [1, 1, 1], since
Sa=
2.5 0 [ 2.5 2.5 5.0
~.5 ~:~] d = [~] = o[ ~] n
[ 0 1

The coefficients reveal that
1(xil iJ) + 1(x12
i 2)
+ (l)(x 13  i 3)
= 0
forallj
In addition, the sum of the first two variables minus the third is a constant c for all n units. Here the third variable is actually the sum of the first two variables, so the columns of the original data matrix satisfy a linear constraint with c = 0. Because we have the special case c = 0, the constraint establishes the fact that the columns of the data matrix are linearly dependent. •
Generalized Variance
133
Let us summarize the important equivalent conditions for a generalized variance to be zero that we discussed in the preceding example. Whenever a nonzero vector a satisfies one of the following three conditions, it satisfies all of them: (1) Sa = 0
"v'
(2) a'(xi  x) = 0 for all j The linear combination of the mean corrected data, using a, is zero.
(3) a'xi = c for allj (c
=
a'x)
a is a scaled eigenvector of S with eigenvalue 0.
The linear combination of the original data, using a, is a constant.
We showed that if condition (3) is satisfiedthat is, if the values for one variable can be expressed in terms of the othersthen the generalized variance is zero because S has a zero eigenvalue. In the other direction, if condition (1) holds, then the eigenvector a gives coefficients for the linear dependency of the mean corrected data. In any statistical analysis, IS I = 0 means that the measurements on some variables should be removed from the study as far as the mathematical computations are concerned. The corresponding reduced data matrix will then lead to a covariance matrix of full rank and a nonzero generalized variance. The question of which measurements to remove in degenerate cases is not easy to answer. When there is a choice, one should retain measurements on a (presumed) causal variable instead of those on a secondary characteristic. We shall return to this subject in our discussion of principal components. At this point, we settle for delineating some simple conditions for S to be of full rank or of reduced rank.
Result 3.3. If n :s p, that is, (sample size) for all samples.
:S:
(number of variables), then IS I = 0
Proof. We must show that the rank of S is less than or equal to p and then apply Result 2A.9. For any fixed sample, then row vectors in (318) sum to the zero vector. The existence of this linear combination means that the rank of X  fi' is less than or equal to n  1, which, in turn, is less than or equal top  1 because n :s p. Since
(n  1)
(pxp)
s
=
(X  fi)'(X  fi')
(pXn) (nXp)
the kth column of S, colk(S), can be written as a linear combination of the columns of (X  fi')'. In particular,
(n  1) colk(S) = (X  fi')' colk(X  fi')
= (x!k xk) col 1(X fi')'
+ ··· + (xnk xk) coln(X fi')'
Since the column vectors of (X  fi')' sum to the zero vector, we can write, for example, col 1(X  fi')' as the negative of the sum of the remaining column vectors. After substituting for row1 (X  fi')' in the preceding equation, we can express colk(S) as a linear combination of the at most n  1linearly independent row vectors colz(X  fi')', ... , coln(X  fi')'. The rank ofS is therefore less than or equal to n  1, whichas noted at the beginning of the proofis less than or equal to p  1, and Sis singular. This implies, from Result 2A.9, that IS I = 0. •
. If. geometrically..134 Chapter 3 Sample Geometry and Random Sampling Result 3. is the jth row of the data matrix X. Let the p X 1 vectors x1 . provided that p < n. Xn.k] = (yk . the conclusion follows from Result 3.. . a'xi = c for all j.. Consequently. ·1 JI n Then = a'x ~ a'x] [c ~ c] [a'xn . (x2k. with prob"'ability 1. (See Exercise 3. Xz.xk)/Vi. Proof.. The proof of Part (1) is difficult and can be found in [2}. .ik)/~. the sample correlation matrix of the original variables. be realizations of the independent random vectors X 1 . 2.13. a'Xi is a constant (for example.i\ + a2 :Xz + · · · + ap:Xp = a'x. suppose some S.k. * 0.l) will be very long or very short and will therefore clearly be an important factor in determining volume. • Generalized Variance Determined by IRl and Its Geometrical Interpretation The generalized sample variance is unduly affected by the variability of measurements on a single variable. c) for all j.. and the sample mean of this linear combination is c = + a 2 xi 2 + ··· + apxjp)/n = a. For example. S has full rank with probability 1 and 1 S 1 > 0. Xi! + azXiz + ··· + aPXiP = c with probability 1.i. X2 . is either large or quite small. then 1S 1 = 0. . 2: (a. If the linear combination a'Xi has positive variance for each constant vector a then.xkl)'jys. Then 1. it is sometimes useful to scale ali the deviation vectors so that they have the same length. If a'Xi = a.x.c 1 : = : = 0 indicating linear dependence. (xnk . Xn. Then.xk)/~•. The sample covariance matrix of the standardized variables is then R. all have length ~. Scaling the residual vectors is equivalent to replacing each original observation xik by its standardized value (xjk .. . the corresponding deviation vector d 1 = (y1 .xk)/YS..) We define Generalized sample variance) = R ( of the standardized variables I I Since the resulting vectors (319) [ (xlk .a'x c .2. where x.the generalized sample variance of the standardized variables will be large when these vectors are nearly perpendicular and will be small .4. (Part 2). ...
. \IS..l) Xz.10 for the two sets of deviation vectors graphed in Figure 3.10 The volume generated by equallength deviation vectors of the standardized variables. .. ( of the standardized vanables IR I .k between (y.6 reveals that the influence of the d 2 vector (large variability in x2 ) on the squared volume IS I is much greater than its influence on the squared volume IR I· ( .Generalized Variance 135 when two or more of these vectors are in almost the same direction.10 and 3. . . (n _ 1)P(volume)z (320) The volume generated by deviation vectors of the standardized variables is illustrated in Figure 3.. X. A comparison of Figures 3. and (Yk .2. we readily find that the cosine of the angle O. . In sum. The ith deviation vectors lie in the direction of d. Employing the argument leading to (3 7).l)/YS...:X. (y. but all have a squared length of n . we have the following result: Let X!j X. p be the deviation vectors of the standardized variables. The volume generated in pspace by the deviation vectors can be related to the generalized sample variance. d (a) (b) Figure 3.. we can make the statement that IR I is large when all the r. \IS..:X.k are nearly + 1 or 1.'. .xkl)/~ is the sample correlation coefficient r.. ' ' \ ' ' ' 3. The same steps that lead to (315) produce Generalized sa~ple va~iance) . vs. i = 1.6.k· Therefore.1.k are nearly zero and it is small when one or more of the r.
which.12. we obtain lSI= 4~~ ~~(1) 2 + 3~~ ~~(1) 3 + 1~~ ~~(1) 4 = = 4(9 . s 22 = 9. Equation (321) shows. Moreover. will alter the relationship between the generalized variances. R = [~ ~ t] ! 2 ~ 3 1 Using Definition 2A.9) 14 It then follows that 14 =lSI= s 11 s22s3 3IRI = (4)(9)(1)(~) = 14 (check) • .2) + 1(6 . However. for example.1)PIRI.11 {Illustrating the relation between IS I and I R I> Let us illustrate the relationship in (321) for the generalized variances IS I and IR I when p = 3. and s33 = 1. Since IR I is based on standardized measurements.1)PISI is proportional to the squared volume (n. in tum. how a change in the. we see from (315) and (320) that the squared volume (n. it is unaffected by the change in scale.3(3 .4) . measurement scale of %1 .24. Example 3.136 Chapter 3 Sample Geometry and Random Sampling The quantities IS I and I R I are connected by the relationship (321) so (322) [The proof of (321) is left to the reader as Exercise 3.1)su of the di.] Interpreting (322) in terms of volumes. the relative value of IS I will be changed whenever the multiplicative factor s 11 changes. algebraically. Suppose s (3X3) = 4 3 1] 3 9 2 [1 2 1 Then s1 1 = 4. is proportional to the product of the squares of the lengths (n . The constant of proportionality i:s the product of the variances.
. .5 Sample Mean. · + sPP (323) Example 3.6. are easily programmed on electronic computers. the total sample variance is the sum of the squared lengths of the = (y1 . S.12. s= and [ 252.43 68. The resulting expressions. The total sample variance criterion pays no attention to the orientation (correlation structure) of the residual vectors. Covariance... = 252. which depict the relation between x. Total sample variance = s1 1 + Szz + .9.Xpl).X11). 7. . Specifically. For instance. and Correlation as Matrix Operations 13 7 Another Generalization of Variance We concludethis discussion by mentioning another generalization of variance. Thus. we define the total sample variance as the sum of the diagonal elements of the sample variancecoJ~ariance matrix S.. From Example 3.Sample Mean. p deviation vectors d 1 3. {Calculating the total sample variance) Calculate the total sample variance for the variancecovariance matrices S in Examples 3. Covariance. it is possible to link algebraically the calculation of i and S directly to X using matrix operations. In addition. and Correlation as Matrix Operations We have developed geometrical representations of the data matrix X and the derived descriptive statistics xand S. dp = (yp .9. divided by n .1. it assigns the same values to both sets of residual vectors (a) and (b) in Figure 3.43] 123.67 = 375.71 and Total sample variance = s11 + s22 + s33 =3 +1+ 1= 5 • Geometrically. and the full data set X concisely.04 68.67 Total sample variance = s11 + s22 From Example 3. 7 and 3..04 + 123.
·1 + ··· + xu Xn.iz xipXzp . or l xu xz1 : x1 x1 xi x12  Xzz :. Next. 1x' (325) Subtracting this result from X produces then X p matrix of deviations (residuals) Xz X_ .~11·) x l xll  ~~ XipXzp  ~p] xP Xzi. x is calculated from the transposed data matrix by postmultiplying by the vector 1 and then multiplying the result by the constant 1jn. we create an n X p matrix of means by transposing both sides of (324) and premultiplying by 1.Xp X = (x. = Xi (x 1 . ·1)/n = yj1jn. Xni i1 Xnp.Xp . ·1 + x 2 .~11·x)' (x.~11·x) = x·(•.X = n Now.xP x 2P xP XniXnz.Xp Xni  (n . the matrix (n .Xp ~P] (326) : : Xnz Xz Xnp.1)S representing sums of squares and cross products is just the transpose of the matrix (326) times the matrix itself. Therefore. that is.138 Chapter 3 Sample Geometry and Random Sampling We have it that i.l)S ~ ~ =:: ::: =:: ::: l x i p . Xin y!1 n Xi2 1 1 Xz Y21 n y~1 x= xp 1 n Xzi Xzz Xzn xpi Xpz Xpn 1 n or X =1  n X' 1 (324) That is.Xz ~I] Xnp.xi .!_ 11 .
1 1x'(I.11'. and Correlation as Matrix Operations 139 since )'( Ill' ) = I . it can be related to the sample correlation matrix R..11' 1 1 1 1 1 1 ( Ill' n n . the matrix expressions relating x and S to the data set X are x = _!_X'l n s=  n..Sample Mean. 0 1 0 Since vs. n n n2 n To summarize. 0 l] 0 0 (328) Then ~ 0 vz (pXp) 0 = 0 vs.!u)x n (327) The result for Sn is similar. The relations in (327) show clearly how matrix operations on the data matrix X lead to xand S... Let 0 Dlf2 (pXp) = [T 1 vS.1) as the first factor. except that 1/n replaces 1/(n . 1 and we have R = n112 sn112 (329) . We fJISt define the p X p sample standard deviation matrix D 1f2 and compute its inverse. (D 1f2r 1 = nI/2. The resulting expression can also be "inverted" to relate R to S. Once S is computed. Covariance.11' + 11'11' = I .
x)'c + c'(x2. They correspond to substituting the sample quantities x and S for the "population" quantities J. . .x)' + (xz ..6 Sample Values of Linear Combinations of Variables We have introduced linear combinations of p variables in Section 2. respectively. 2.x)(xz . 3.x)(xi. n (334) .n (331) Then derived observations in (331) have (c'x 1 + c'x2 + · · · + c'xn) Sample mean = '='=.1 = C . ...x)' + . + c2X2 + · · · + cPXP whose observed value on the jth trial is j = 1.:"' n = C '(X1 + x 2 + · · · + Xn ) .. in (243).x)) C 2 = c'(xi ._ X (332) Since (c'xi.X2 ...)2 + (C.140 Chapter 3 Sample Geometry and Random Sampling Postmultiplying and premultiplying both sides of (329) by D1/2olf2 = D 112D112 = I gives S = D 112 RD 112 ol/2 and noting that (330) That is.x)'c n.2. Equations (329) and (330) are sample analogs of (236) and (237).x)(x2 ..)2 X c'(x...X1  . .x)(xl . + (xn _ x)(xn . . . Now consider a second linear combination b'X = b1 X 1 + whose observed value on the jth trial is j bz%2 + · · · + bPXP = 1.x)(x. Xn X X n _ 1 .. R can be obtained from the information in S. we have C Sample vanance = . + c'(xn . In many multivariate procedures. whereas Scan be obtained from D'/2 and R. .x)'c.c'x) • 2 = (c'(xi (C .1 '[(x.x)(xn .6..L and :I. we are led naturally to consider a linear combination of the form c'X = c1X.C r)2 + · · · + ( C .x)'c + .x)' =c n1 J c or Sample variance of c'X = c'Sc (333) Equations (332) and (333) are sample analogs of (243).
i)'c + b'(xz.i)(x".S.i)'c n ..i)' + · · · + (xn . we have the following result. .i)(xz. + 2X..i)'J c n .9 as 2 5] 1 6 0 4 Consider the two linear combinations b'X ~ [2 2 1] [~:] ~ 2X. .1 i)(x.i) (x.Sample Values of Linear Combinations of Variables 14 I It follows from (332) and (333) that the sample mean and variance of these derived observations are Sample meanofb'X = b'i Sample variance of b'X = b'Sb Moreover. and covariances that are related to i and S by Samplemeanofb'X = b'i Samplemeanofc'X = c'i Sample variance of b'X = b'Sb Sample variance of c'X (335) (336) = c'Sc Samplecovarianceofb'Xandc'X = b'Sc • Example 3.b'i)( c'xn .13 (Means and covariances for linear combinations) We shall consider two linear combinations and their derived values for then = 3 observations given in Example 3.i)'c + ·· · + b'(xn.1 b'i)( c'x 1  c'i) + (b'x 2 = b' [ (x 1 i)(x 1  i)' + (x 2  i)(x 2 . variances.X. Result l. the sample covariance computed from pairs of observations on b'X and c'X is Sample covariance (b'x 1 b'(x 1   b'i) ( c'x 2 .1 or Sample covarianceofb'X and c'X = b'Sc In sum. The linear combinations b'X = b1X 1 + bzX2 + · · · + bPXP c'X = c1 X 1 + c2 Xz + · · · + cPXP have sample means.c'i) + · · · + (b'xn ..c'i) n .
3)(14 17) + (4.3) 2 + (4. Sample mean . and covariance will flfSt be evaluate. c'xl).1(0) + 3(4) = 16 and Sample mean Sample variance = (14 + 21 + 16) = 17 3 2 + (21. then = 3 observations on b'X are b'x 1 = 2xu + 2. and X 3 with their observed values. and (b'x3. respectively. then = = (1 + 4 + 4) = 3 3 2 + (4.17) 2 + (16.17) 2 (14.d directly and then be evaluated by (336).:i: 12 . +3X.17) 3. we do not even need to calculate the observations b'xi and c'xi. if only the descriptive statistics are of interest.3) _ 3 1 =3 = 3 observations on c'X are = 14 c'x 1 = 1x11 . Sample vanance In a similar manner. (b'x 2 .1(2) + 3(5) c'x2 = 1(4). is Sample covariance (1. and covariances for the linear combinations. the sample covariance.3)(21.Jx12 + 3x13 = 1(1) . we use the sample mean vector x and sample covariance matrix S derived from the original data matrix X to calculate the sample means. ~ 1 0] 1 2 t 1 .9. X. Observations on these linear combinations are obtained by replacing X 1 .17) + (4.3)2 (1.1(1) + 3(6) = 21 c'x 3 = 1(4). For example. c'x2).X33 = 2(4) + 2(0) .(6) = 4 b'x 3 = 2x31 + 2x32 . The means. c'x 3).x 13 = 2(1) + 2(2) .(4) = 4 The sample mean and variance of these values are. Thus. variances. X2 .x23 = 2(4) + 2(1). From Example 3.(5) = 1 b'x2 = 2x21 + 2x22.142 Chapter 3 Sample Geometry and Random Sampling and <'X~ \1 1 3{ ~:]~X. variances.3)(16.17) 31 =l3 Moreover.1 9 2 Alternatively. computed from the pairs of observations (b'x 1 .
these last results check with the corresponding sample quantities computed directly from the observations on the linear combinations.2.nofb'X ~ b'i ~ [2 ~ <'i ~ [1 = b'Sb = [2 2 1r[:J ~3 ~ 17 (check) S=p!om"n of<'X Using (336). • The sample mean and covariance relations in Result 3. we also have I 3] [:] (check) Samplevarianceofb'X 2 +! 3] 2 1 l 2 3 IJUJ (check) = [2 2 +iJ~3 Sample variance of c'X = c' Sc = [1 1 3 _l OJ [ 1] [~ ~ ~ i ~ [I 1 3] [ n ~ 13 (<h<<k) Samplecovarianceofb'Xandc'X = b'Sc = [2 2 (check) As indicated. Consider the q linear combinations i = 1...5 pertain to any number of linear combinations. q (337) .Sample Values of Linear Combinations of Variables 143 Consequently. we find that the two sample means for the derived observations are S=plomo. . using (336). .
+ + .4.x1l using the first column of the data matrix in Example 3. 1. (b) Sketch the n = 3space representation of the data. aqlxi a21xl + + + al2x2 a22X2 + .. 1...x 11 and Y2 .9.x and the ith and kth rows of AX have sample covariance a. ak.rx'J a2p : x2 : '=AX aq2Xz aqpXp aq2 aqp_ XP (338) Taking the ith row of A. Relate these quantities to Sn and R.. • Exercises 3. Given the data matrix (a) Graph the scatter plot in p = 2 dimensions Locate the sample mean on your diagram. to be b' and the kth row of A. Perform the decomposition ofyl into x1l and Y1 .144 Chapter 3 Sample Geometry and Random Sampling These can be expressed in matrix notation as r. (b) Sketch the n = 3dimensional representation of the data. (c) Sketch the deviation vectors in (b) emanating from the origin. we see that Equations (336) imply that the ith row of AX has sample mean a. 1. 3..2. Relate its length to the sample standard deviation. and plot the deviation vectors y1 .s ak. . + . Note that a.x 1l and y2 . + + . and locate the sample mean on your diagTam. Use the six observations on the variable XI> in units of millions. 3. k )th element of ASA'.6. Calculate the lengths of these vectors and the cosine of the angle between them.1.x2l.x. Relate these quantities to Sn and R.xzl. . to be c'. and plot the deviation vectors Y1 .s ak is the ( i.1.. Result 3. a.. Given the data matrix (a) Graph the scatter plot in p = 2 dimensions. 3. aql a12 a22 . from Table 1... (c) Sketch the deviation vectors in (b) emanating from the origin. . Calculate their lengths and the cosine of the angle between them. The q linear combinations AX in (338) have sample mean vector AX and sample covariance matrix ASA'.3. 1. 1].x. . (a) Find the projection on 1' = [ 1.] a2 ~XP . (b) Calculate the deviation vector YI ill. = r a~1 .
and compare the results. s= [ 4 5 4] 5 . Specify an a' = [a 1 . so a can be rescaled to be an eigenvector corresponding to eigenvalue zero. show that Sa = 0. and verify that the columns are linearly dependent. (b) Obtain the sample covariance matrix S.5. x 2 = score on second test. Compare the results.1(x. calculate the total sample variance. if any. (b) Calculate the generalized sample variance for each S. X .i)'S.Exercises 145 (c) Graph (to scale) the triangle formed by Y~> x1 1.9. Comment on the discrepancies. and a 3 = 1. found between Parts a and b. with x 1 = score on first test. Calculate the value of the angle between them. and verify that the generalized variance is zero.8. (e) Graph (to scale) the two deviation vectors y1 . (c) Using the results in (b). and y1 . Given = [5 4] 0 0 4 5 . show that there is linear dependence. Sketch the solid ellipsoids (x . (Note that these matrices have the same generalized variance IS 1) s= 1 0 0] [ 1 0 0 1 and ! !] t 1 _! 2 1 (a) Calculate the total sample variance for each S. Is this matrix of full rank? Explain.2. (b) Determine Sand calculate the g~neralized sample variance IS 1. with a 1 = 1.6.i) s 1 [see (316)] for the three matrices s 3. and x 3 = total score on the two tests: 12 17 29] 18 20 38 X= 14 16 30 [ 20 18 38 16 19 35 (a) Obtain the mean corrected data matrix. a 3 ] vector that establishes the linear dependence. The following data matrix contains data on test scores. 3. (d) Repeat Parts ac for the variable X 2 in Table 1.x 11 and y2 . a 2 = 1. That is. Identify the length of each component in your graph.x1 1. 3.li'. Also. 3. . Calculate the generalized sample variance IS I for (a) the data matrix and (b) the data matrix X in Exercise 3.1 X= [~ 5 ! ~] 2 3 (a) Calculate the matrix of deviations (residuals).7. (c) Verify that the third column of the data matrix is the sum of the first two columns.1. a 2 . [See (323).Interpret the latter geometrically.x 2 1.] 3. Consider the data matrix X in Exercise 3.
variance. it is the columns of the mean corrected data matrix Xc = X . Showthat/S/ = (s11s22···spp)JRI. . n. and covariance formulas. Hint: From Equation (330). variances.p. j = 1.li' that are linearly dependent. We have n = 3 observations on p = 2 variables X 1 and X 2 .. 2•.1 s. consider the standardized observations (xik.ll. not necessarily those of the data matrix itself Given the data (a) Obtain the mean corrected data matrix.) Now examine ID 121. variances. and covariance of b'X and c'X from first principles.14.. Show that these standardized quantities have sample covariance matrix R. Taking determinants gives 1S 1 = 1 /Dl/211 R II D 1"1· (See Result 2A. .. ' 3. az.7 to verify (329) and (330) which • .146 Chapter 3 Sample Geometry and Random Sampling 3. 3 II Use the sample covariance obtained in Example 3. 3. That is. Given a data matrix c'X = [1 2] [~:] = x~ + 2X 2 b'X = [2 3] [~J = 2X 1 + 3X2 (a) Evaluate the sample means.12. Consider the data matrix X in Exercise 3. and then use the sample mean. (c) Show that the columns ofthe data matrix are linearly independent in this case. and verify that the generalized variance is zero.xk)/~. 3.1. and verify that the columns are linearly dependent.. X and the resulting sample correlation matrix R. state that R = D112SDl/Z and Dl/2 RD 112 = S. Form the linear combinations 3.2. Specify an a' = [a!. S = D 112 RD 112 . a3] vector that establishes the dependence. calculate the observed values of b'X and c'X. k = 1. Compare the results in (a) and (b).. (b) Obtain the sample covariance matrix S.14 using the data matrix . and covariance of b'X and c'X using (336).13. (b) Calculate the sample means. . When the generalized variance is zero. Repeat Exercise 3.1 o.
Xp :s xp)·P[Z 1 :s z 1. then each component of X is (qXl) ' Hint:P[X 1 :s x 1 . Using the summary statistics for the first three variables in Exercise 3. (b) Determine the sample mean and variance of the excess of petroleum consumption over natural gas consumption. So X 1 and Z 1 are independent: Repeat for other pairs.JLv )(V .17.161 0. (pXI) and Z are independent. .096 s= 0. determine the sample mean and variance of a state's total energy consumption for these major sources.Exercises and the linear combinations b'X = [1 14 7 l]mJ and 3. .. 3. Let x 2.JLv )' = I •.18.067 0. ..635 0.039 0.508 0.173 0. Show that. Also find the sample covariance of this variable with the total variable in part a.067 0. and covariance matrix E(V . .039 0..18.. from the major sources x 1 = petroleum x 3 = hydroelectric power x 2 = natural gas x4 = nuclear electric power is recorded in quadrillions (10 15 ) of BTUs (Source: Statistical Abstract of the United States 2006).128 0. .XP :s xpandZ 1 :s z1> . The resulting mean and covariance matrix are X= l l 0. if X independent of each component of z. verify the relation . Zq tend to infinity...438 0..043 (a) Using the summary statistics..171 0. z1.. .635 0. Show that E(VV') = Iv + JLvJLv· 3.856 0.568 0. ..096J 0. Xp and z2. Energy consumption iri 2001. .Zq :s Zq) by independence.X2 :s x 2 . by state.. to obtain P[X1 :s x 1 and Z1 :s zd = P[X 1 :s xJ] · P[Z 1 :s zd for all x 1 .X2 :s x2. 3.127 0. Let V be a vector random variable with mean vector E(V) = JL. .173 0.16. .Zq :s Zq) = P[X 1 :s XJ.19. .766J 0..
spend to clear snow. M. (b) Obtain the mean and variance by first obtaining the individual values x12 .0 24.5 10. Here are the results for 25 incidents in Wisconsin.5 8. New York: John Wiley.5 21.0 8.0 23.5 6. Compare these values with those obtained in part a..2 22. In northern climates.5 10.5 8.8 3.5 8.20.0 26. Perlrnan. 2.6 (a) Find the sample mean and variance of the difference x2 .2003.5 42.5 7.8 9. T.6 13.0 6. 2. and M.0 4.1 12. An Introduction to Multivariate Statistical Analysis (3rd ed. Anderson. while the effectiveness of snow removal can be quantified by x 2 = the number of hours crews. .0 17. W.5 14." Annals of Statistics.3 11.4 18.0 7.8 10.. .6 12. men.0 13.8 15.0 12.0 32.xii> for j = 1. and machine. 1 (1973). Table 3.0 9. roads must be cleared of snow quickly following a storm.5 10..0 7.xi by first obtaining the summary statistics. 25 and then calculating the mean and variance.5 18. Eaton. References 1.1 14. One measure of storm severity is xi = its duration in hours.5 8.7 15.2 32.2 Snow Data Xi Xz xi xz Xi xz 12.4 25.0 9."The NonSingularity of Generalized Sample Covariance Matrices.5 17.0 13.3 17. 710717.7 16.).0 6.5 12.4 11.0 19.148 Chapter 3 Sample Geometry and Random Sampling 3. .
. second. While real data are never exactly multivariate normal. because of a centra/limit effect. the sampling distributions of many multivariate statistics are approximately normal. One advantage of the multivariate normal distribution sterns from the fact that it is mathematically tractable and "nice" results can be obtained. Recall that the univariate normal distribution. the normal distribution serves as a bona fide population model in some instances. To summarize.L and variance a 2 . mathematical attractiveness per se is of little use to the practitioner. 4. In fact. many realworld problems fall naturally within the framework of normal theory.1 Introduction A generalization of the familiar bellshaped normal density to several dimensions plays a fundamental role in multivariate analysis.z e[(x~.t)/"f/2 149 00 <X< 00 (41) . the normal density is often a useful approximation to the "true" population distribution.Chapter THE MULTIVARIATE NORMAL DISTRIBUTION 4. with mean f. It turns out. vz. This is frequently not the case for other datagenerating distributions. Of course..2 The Multivariate Normal Density and Its Properties The multivariate normal density is a generalization of the univariate normal density to p 2! 2 dimensions. however. most of the techniques encountered in this book are based on the assumption that the data were generated from a multivariate normal distribution.. has the probability density function 1 f(x) = . regardless of the form of the parent population. that normal distributions are useful in practice for two reasons: First. The importance of the normal distribution rests on its dual role as both population model for certain natural phenomena and approximate sampling distribution for many statistics.
JL)z .L + u) : . It can be shown (see [1]) that this constant is (21Trp/ZI Ii'fl. Also shown in the figure are approximate areas under the curve within ±1 standard deviations and ±2 standard deviations of the mean.L and variance a~ and selected areas under the curve.L.. The multivariate normal density is obtained by replacing the univariate distance in (42) by the multivariate generalized distance of (43) in the density function of (41). in the multivariate case. probabilities are represented by volumes under the surface over regions defined by intervals of the X. [See (230) and (231). . ~).:. a pdimensional normal density for the random vector X' = [X. and thus.Therefore.. so the expression in ( 43) is the square of the generalized distance from x top.oo < x. . and the p X p matrix I is the variancecovariance matrix of X. This is necessary because. 2.. The term ( X ..e{~tp)I1(~tp)/2 (21T )Pf2' I l'fl (44) where .2u s X s J.L P(JL. . This can be generalized for a p X 1 vector x of observations on several variables as (43) The p X 1 vector p. . I).150 Chapter 4 The Multivariate Normal Distribution figure 4. values. This notation will be extended to the multivariate case later.. 4) refers to the function in (41) with J.= (x JL)(~f (x. P(J. When this replacement is made. the univariate normalizing constant 1 (2'1T f 1f2(u 2f f2 must be changed to a more general constant that makes the volume under the surface of the multivariate density function unity for any p.L and variance Ul by N(J. A plot of this function yields the familiar bellshaped curve shown in Figure 4.JL) 1 (42) in the exponent of the univariate normal density function measures the square of the distance from x to JL in standard deviation units.. p.1 A normal densitv with mean J. for the normal random variable X.L . which is analogous to the normal density in the univariate case. .] We shall assume that the symmetric matrix I is positive definite.1...u s X s J. < oo.68 + 2u) . These areas represent probabilities. We shall denote this pdimensional normal density by Np(fL.. and consequently. i = 1.95 = It is convenient to denote the normal density function with mean J. represents the expected value of the random vector X.. Xp] has the form 1 ' f(x) = .L = 10 and a = 2. N(lO. Xz.
. (45) The last expression is written in terms of the standardized values (x 1 .Pt2) = 1 .l. On the other hand.exp { .. (x.J.2) u11a22( 1 .ui2 = ulla22(1. a.. the joint density can be written as the product of two univariate normal densities each of the form of (41).  1Ld 2 + uu(xz [(XI J. ~ The expression in (46) is somewhat unwieldy.ILI) (x2 2p.J.L1)(x2. 2 vo=.22 ..!) . if the random variables X 1 and X 2 are uncorrelated. J.l. so that p 12 = 0.L2. 2 + ( x 2 .The Multivariate Normal Density and Its Properties 151 Example 4.o1 2 = aua22(1 ..11 .J. normal density) Let us evaluate the p = 2variate normal the individual parameters ILl = E(XI).2)]} 2P 12 (X~) (XzYO:. and p 12 = a.1Lz) .l. and the squared distance becomes (x.1 (Bivariate density in terms of a11 = Var(XI).ILz)] va. 2  1 . .2 (x~ .22 = Using Result 2A.~t)':I.J.Pi2 1 1L1) v'lT. _ . we can substitute for xl and I:II in (44) to get the expression for the bivariate (p = 2) normal density involving the individual parameters ILl.IL2) v'lT.JLz)/~Next.8. X2)· we find that the inverse of the covariance matrix is I' = 1 [ a22 aua22 . Var(X2)..) = Corr(X.Pi2) ~ + (x 2.2PI2~ v'£1.12 /(~ v'IT.J. the expression in ( 46) is useful for discussing certain properties of the normal distribution. For example. IL 2 = E(X2).1 (x. since III= ulla22. a22. 2(1 ...Pt2). a. and the compact general form in ( 44) is more informative in many ways. and P12: (46) 2 X 1 [(XI .uh a12 a12] au Introducing the correlation coefficient p 12 by writing a12 = p 12 ~ v'IT._ and (x2 .pf 2).1)/va.P) un(x. we obtain u 11 a..Lz) 2 ..l.
Notice how the presence of correlation causes the probability to concentrate along a line.2(b).152 Chapter 4 The Multivariate Normal Distribution That is.p 12 = . • (a) (b) Figure 4. .75.5.75. f(xl> xz) = f(xi)f(x 2 ) and X 1 and X2 are independent.) Two bivariate distributions with a 11 = u 22 are shown in Figure 4.] This result is true in general.2(a). (See Result 4. (a) uu (b) uu = a 22 and p 12 = . In Figure 4.2 1\vo bivariate normal distributions. In Figure 4. [See (228). X 1 and X2 are independent (p 12 = 0). = u 22 and p 12 = 0.2.
where Ie. and have axes fori = 1. e = I. 1) (~)e. Thus. = 0 for all i only if x = 0.The Multivariate Normal Density and Its Properties 153 From the expression in (44) for the density of a pdimensional normal variable.1 is positive definite. Fortunately.1(Ie) = I.for any p X 1 x. x'e. (1/ A. e) for I1 .)(:'e. We state the correspondence formally for later reference. If I is positive definite. and division by A > 0 gives I1e = (1/ A)e. 2.p. A contour of constant density for a bivariate normal distribution with a 11 = a 22 is obtained in the following example. Result 4. we can avoid the calculation of I. . the multivariate normal density is constant on surfaces where the square of the distance (x ... so that I1 exists.1 when determining the axes.e. ore= AI.. These paths are called contours: Constant probability density contour = {all x such that ( x .)x A. .) 2 ~ 0 since each term Aj 1 (x'e. Also.1e = (~)e so (A. and it follows that I. = A.1. (x'e. ±cv'I. = e'(Ae) = Ae'e Proof. p. it should be clear that the paths of x values yielding a constant height for the density are ellipsoids.) is constant. The following summarizes these concepts: Contours of constant density for the pdimensional normal distribution are ellipsoids defined by x such the that (47) • These ellipsoids are centered at p. That is.1.1e. I1 is positive definite.) 'I1 ( x . .1 (Ae).)'I. Also.p.e. The axes of each ellipsoid of constant density are in the direction of the eigenvectors of I1. So x =F 0 implies that i=l ± (1/A. e) is an eigenvalueeigenvector pair for I corresponding to the pair ( 1/ A. e.) is nonnegative.x = x·(± •=I ( 1 = 2 p ~ i=l  A.wehaveO < e'Ie = e'(Ie) = A. since these ellipsoids are also determined by the eigenvalues and eigenvectors of I. Moreover.p. then Ie = Ae implies I.) = c 2 } = surface of an ellipsoid centered at p. and their lengths are proportional to the reciprocals of the square roots of the eigenvalues of I1.1(x . Foripositivedefiniteande =F Oaneigenvector.p. In addition.) 2 > 0. by (221) x'I. e) is an eigenvalueeigenvector pair for I.
1/\12].OJz = (A. Since the axes of the constantdensity ellipses are given by ±eVA. then. and its associated eigenvector ej = [lf\12. From (47). . This is true for any positive value of the covariance (correlation). the eigenvalues are A1 = vector e 1 is determined from a 11 + u 12 and A2 = a. tLz].l A constantdensity contour for a bivariate normal distribution with a11 = a 22 and lTJz > O(orp12 > 0).a 12 yields the eigenvectore2 = [lj\12.22 .a.11 . Here II . For positively correlated normal random variables.A 2 = a11 .11 + a. and after normalization.A)2 . ez [see (47)].) Figure 4.f 54 Chapter 4 The Multivariate Normal Distribution Example 4.a12)(A . When the covariance a 1z (or correlation Pd is positive.11 == a.AI[ = 0 becomes 0= ~a11 A lTJ2 UJz a11 A I 2 = (au . the major axis of the constantdensity ellipses will be along the 45° line through IL· (See Figure 4.2 (Contours of the bivariate normal density) We shall obtain the axes of constant probability density contours for a bivariate normal distribution when a. these axes are given by the eigenvalues and eigenvectors of I. A1 = a.12 is the largest eigenvalue.au +  a12) a12 . 1/\12] lies along the 45° line through the point JL' = [tt1 . and the eigenvectors each have length unity. The eigen Consequently. the major axis will be associated with the largest eigenvalue." e1 and ±eVA.11 or aue1 + a12e2 a12e1 + uuez = ( a11 = ( uu + a 12 )e1 + O"Jz)ez These equations imply that e1 = ez.3. the first eigenvalueeigenvector pair is Sirnilarly.
. A2 = au . "2 x2 /l2 @ f.7 that the choice c = x~(a).._x. the following is true for a pdimensional normal distribution: 2 The solid ellipsoid of x values satisfying (48) has probability 1 . L'. or mean.a. or balanced.11 = a22 are determined by • We show in Result 4. as well as the expected value of X. leads to contours that contain (1 .lt /l2 x. where x~(a) is the upper ( 100a) th percentile of a chisquare distribution with p degrees of freedom. when x = 1'· Thus. The pvariate normal density in (44) has a maximum value when the squared distance in (43) is zerothat is. the axes of the ellipses of constant density for a bivariate normal distribution with a.is the point of maximum density..a12 will be the largest eigenvalue. at 1'· .4 The 50% and 90% contours for the bivariate normal distributions in Figure 4. 1'. Specifically.4.lt # Figure 4.a) x 100% of the probability. The fact that 1'.The Multivariate Normal Density and rts Properties 155 When the covariance (correlation) is negative. The constantdensity contours containing 50% and 90% of the probability under the bivariate normal surfaces in Figure 4.2. and the major axes of the constantdensity ellipses will lie along a line at right angles to the 45° line through 1'· (These results are true only for au = Uzz.2 are pictured in Figure 4. or mode.) To summarize. f.is the mean of the multivariate normal distribution follows from the symmetry exhibited by the constantdensity contours: These contours are centered.
then X must be Np(J.3 (The distribution of a linear combination of the components of a normal random vector) Consider the linear combination a'X of a m. are partly responsible for the popularity of the normal distribution.ifa'X is distributed as N(a' . 0.. The expected value and variance of a'X follow from (243). can be stated rather simply.ultivariate normal random vector determined by the choice a' = [1. a'Ia) for every a. Result 4. Proving that a'Xisnormally distributed if X is multivariate normal is more difficult...a'Ia).2 indicates how the linear combination definition of a normal density relates to the multivariate density in (44). OJ. ._.._. Proof.. Many of these results are illustrated with examples.2 is also demonstrated in [1]. If X is distributed as Np(. Zero covariance implies that the corresponding components are independently ·distributed. I). The following are true for a. I). . ..2 can be taken as a working definition of the normal distribution. Since a' X~ [1. The key properties.156 Chapter 4 The Multivariate Normal Distribution Additional Properties of the Multivariate Normal Distribution Certain properties of the normal distribution will be needed repeatedly in our explanations of statistical models and methods. Result 4. . The second part of result 4. 2. . These statements are reproduced mathematically in the results that follow. • Example 4.0. The proofs that are included should help improve your understanding of matrix manipulations and also lead you to an appreciation for the manner in which the results successively build on themselves. 4..1. With this in hand. The conditional distributions of the components are (multivariate) normal. Our partial proof of Result 4.2.. You can find a proof in [1].'.random vector X having a multivariate normal distribution: 1. 3. These properties make it possible to manipulate normal distributions easily and. All subsets of the components of X have a (multivariate) normal distribution.. 0] [i:J ~X. the subsequ~nt properties are almost immediate._.Also. Linear combinations of the components of X are normally distributed. then any linear combination of variablesa'X = a1 X 1 + a 2 X 2 + ··· + aPXPisdistributedasN(a'. as we suggested in Section 4.. which we shall soon discuss in some mathematical detail.
.from (245). AIA'). u. the conclusion concerning AX follows directly from Result 4. • Example 4.4 (The distribution of two linear combinations of the components of a normal random vector) For X distributed as N 3 (JL. I).t. Since a was arbitrary.0] [ .The Multivariate Normal Density and Its Properties and 15 7 we have all a12 Uzz a Ia = [1. More generally. is distributed as Np(JL (pXI) X + (pXI) d . Any linear combination b'(AX) is a linear combination of X. u 2P : : au .2 that X 1 is distributed as N(f.. Thus.. The second part of the result can be obtained by considering a' (X + d) = a'X + (a'd).). where a'X is distributed as N(a'J. I).a'Ia). It is known from the univariate case that adding a constant a'd to the random variable a'X leaves the variance unchanged and translates the mean to a'IL + a'd = a'(JL +d).0. I).aJp Uzp uPP 0 and it follows from Result 4. O"JpJ [1J _ 0 • •• . X + d is distributed as Np(JL + d. of the form a'X with a = A'b.. The expected value E(AX) and the covariance matrix of AX follow . a. constants.L 1 . the marginal distribution of any component X. the q linear combinations are distributed as Nq(AJL. If X is distributed as Np(JL. Result 4.11 ). Also. a12 : : ..3. . where d is a vector of + d. • The next result considers several linear combinations of a multivariate normal vector X. find the distribution of 1 1 ~J [~:]~AX . I).2. Proof.. . of X is N(JL.
. I l 11 ). and covariance matrix AIA' may be verified by direct calculation of the means and covariances of the two random variables y! =XI. and the conclusion follows. All subsets of X are normally distributed. the distribution of AX is multivariate normal with mean 1 1 and covariance matrix 1 OJ [::] = [JLI .3.1) = [·····(_~~1. ] ((pq)X!) (q Xq) I 11 i (qX(pq)) ] ! I 12 = ·················..JLz] p. . Proof.x3. We state this property formally as Result 4. If we respectively partition X. we simply relabel the subset of interest as X 1 and select the corresponding component means and • covariances as p.zJLJ JL3 o12 o13. . respectively.3. Result 4. the mean vector Ap.[ AIA' = [~ 1 OJ 1 1 [o11 o12 o13 O"zz az3 O"zJ o33_ 1 1 OJ 1 1 0 Alternatively. (qx(pq)) i 0 } in Result 4.4 to an arbitrary subset of the components of X. its mean vector p.XzandYz = Xz. and its covariance matrix I as (p'5.4. 1 . To apply Result 4.······················· I21 i Izz ((pq)Xq) i ((pq)X(pq)) and ~ (p P) then X 1 is distributed as Nq(p. 1 and I 11 . Set (qxp) A = [ I (qxq) .4. • We have mentioned that all subsets of a multivariate normal random vector X are themselves normally distributed.f 58 Chapter 4 The Multivariate Normal Distribution By Result 4.
5. from Result 4. a q1 (q xl) 1 X (q xl) 2 q 2 matrix of zeros.4. .5 (The distribution of a subset of a normal random vector) If X is distributed as Ns(J.'... (a) If X 1 and X 2 are independent. XI] (b) If [·. X. then Cov(XI> X 2 ) = 0. • We are now in a position to state that zero correlation between normal random variables or sets of normal random variables is equivalent to statistical independence. for we have the distribution It is clear from this example that the normal distribution for any subset can be expressed by simply selecting the appropriate means and covariances from the original 1' and I.We set and note that with this assignment. The formal process of relabeling and partitioning is unnecessary. [Ill i I12]) . and I can respectively be rearranged and partitioned as I = [ or a22 a24 a12 a23 a2sl a24 a44 : a14 a34 a4s ta12 a14 ! all a13 a1s a23 a34 ! a13 a33 a3s a2s a4s ( a1s a3s ass l X= ['~:!l. I).[ I21 : I22 (3x2) ( (3X3) Thus.1s X2 Nq 1+q 2 ([1'1] l'2 12 . (3XI) Ill I12l (2X2) i (2X3) I= . find the distribution of [~:].The Multivariate Normal Density and Its Properties 159 Example 4.._. I22 and only if I = 0. then X 1 and X 2 are mdepen dent 1 · f ·jI21 . Result 4.
14 for partial proofs based upon factoring the density function when I 12 = 0. Example 4.I12I2}(x2. • We pointed out in our discussion of the bivariate normal distribution that p 12 = 0 (zero correlation) implied independence because the joint density function [see (46)] could then be written as the productofthemarginal (normal) densities of X 1 and X 2. I 22 ).22 > 0.21 : ..1. This implies X 3 is independent of X 1 and also ofX2. I) with (3X!) 3 0 0 2 1 OJ Are X 1 and X 2 independent? What about ( X1. respectively. Then the conditional distribution of X 1 . Therefore. OJ Iu i I12 I21: ~2 (IX2) f (IX!) l J J. Let X I = = [~~J 22 1 be distributed as Np(P. and I I . then [~d has the multivariate normal distribution Proof. given that x2 == x2' is normal and has Mean == 1'1 + . 0 0 i2 we see that X 1 = [ ~:] and X 3 have covariance matrix I 12 = [~ (X1. they are not independent. is simply a special case of Result 4..160 Chapter 4 The Multivariate Normal Distribution (c) If X 1 and X 2 are independent and are distributed as Nq 1(P. (See Exercise 4.?.].6. However.5 with q1 = (/2 = 1.1'2) . I 11 ) and Nq2(P.!!.) • .6. I) with 1' = [~~].. X 2) and X 3 are independent by Result 4.1.2 . which we encouraged you to verify directly. This fact. (The equivalence of zero covariance and independence for normal variables) Let X be N3 (P. Result 4.(~~. [~!~~~!.1 . partitioning X and I as 4 I == [ ·····. X2) and X3? Since X 1 and X2 have covariance a 12 = 1.5.···· 1 3 1 i o == [ _(~~~!.
7 (The conditional density of a bivariate normal distribution) The conditional density of X 1 .IL2)· Since X 1 .IL2) when X 2 has the particular value x 2.ILl . X 1 is distrib• uted as Nq(ILl + I12I2! (x2 . Iu .ILl .13.I12I21 (X 2 . the conditional distribution of X 1 . which uses the densities directly. (See Exercise 4.. show that f(x 1 1x 2 ) is N ( p. the quantity X 1 . a 11  at2) a22 .I 12 I21 (x2 .I 12 I21 (X2 .IL 2 have zero covariance.IL 2 are independent.ILl .~L 2 ) is the same as the unconditional distribution of X 1 .IL2) is Nq(O. I 11 .The Multivariate Normal Density and Its Properties and Covariance = I 11  161 I12I2ii21 Note that the covariance does not depend on the value x2 of the conditioning variable. Moreover.I12I2}I 2 J).( x2 a22  f. given that X 2 = x2 for any bivariate distribution.ILl .ILl . Because X 1 .I12I2! (X 2 . so is the random vector X 1 . I 11 . 1 a12 + . Proof. given that X 2 = x 2.IL 2) is a constant. Given that X 2 = x2.I 12 I21 (X2 .I12I2! I2d· Example 4.) Take so is jointly normal with covariance matrix AIA' given by Since X 1 .IL2).Lz).ILl .I 12I21 (X 2 . We shall give an indirect proof.I 12I21 (x2 .ILl .~L 2 ) and X 2 .~L 2 ) and X 2 . they are independent. Equivalently. is defined by f(x 1 1 x 2) = {conditional density of X 1given that X2 = x 2} where f(x 2) is the marginal distribution of X 2 • If f(xb x2 ) is the bivariate normal density.I 12 I21I2J). ILl + I 12 I21 (x2 .IL 2) has distribution Nq(O.
p~ 2 ) ( a22 a22 !Jil? _21 _(x_2_. = aIda22.. with our customary notation. I 11 .P!2.2PI2 (xi . the complete expo 1 = 2O"JJ (1  2) P12 ( x! . Now. P12 ) (xi 2 JLi  :12 (x2.6..r . (x1 .JLz) 2 a22 p~ = a12/~ '\10.~da22 = uu(l .162 Chapter 4 The Multivariate Normal Distribution Here a..JLd 2 uu . or P12ya.(x2  ~~ VU22 JLz) )2 2 1 _1_ .(1LlTII ~.f!J2 ) (x2 2(1 . apart from the multiplicative constant 1/2(1 .JL! .ufz/a22 = a1 1(1 . The two terms involving x 1 .P~2) ). uu(1. agreeing with Result 4.p~ 2 ). the conditional distribution of X 1 given that X2 = x2 is N(JLI + (aJda22)(x2.JL2).P12 .JL2) .J/Yfi:.11 .JLd(x2 .I 12 Itii 21 = a11 . r .P~2). rVa11 va22 ~ = . .1 [ XJ .Pf2) and I12I2! = atda22..JL1 .(x2 .2))2 u22 The constant term 211 V a11 a22(1  pf 2) also factors as Dividing the joint density of X! and X2 by the marginal density and canceling terms yields the conditional density 00 < XJ < 00 Thus.JLz_)_2 __ a22 = " .JLI in the exponent of the bivariate normal density [see Equation (46)] become. r./1. • which we obtained by an indirect method.2 (x2 .JL2) ] O"JJ va22 Becausepi2 nent is 2 .
.)ei.t) = ~(1/Ai)(e. It also plays a basic role in the multivariate case. I 11 of the conditioning variable(s). All conditional distributions are (multivariate) normal.The Multivariate Normal Density and Its Properties 163 For the multivariate normal situation. i=l A.J. (XJ.(X. and see p 1 Result 4. The conditional covariance. Then (a) (X. = (1/ A.p ] 132.t) = ~(1/A.J.t)'I. where~ denotes the chisquare distribution with p degrees of freedom.p /3q.forinstance.t)'eie. I1 = ~ . 1.q+2 /3q.I'A.e. so I1e. where Z 1 . (b) The Np(JL.1 (XJ.Now.t)) i=l p 2 p p 2 = i=l p ~ [(1/\. The other discusses the distribution of another form of linear combina lions.q+l : . We know that~ is defined as the distribution of the sum Zi + Zi + · · · + Z~. 2.a to the solid ellipsoid {x: (x. Proof.) is distributed as x~.)] = ~ Z[.1 (x. = A.t) :s ~(a)}.q+2 132. Zp are independent N(O.J. ••.. by the spectral decomposition [see Equations (216) and (221) with A = I.q+l /3q.q+l 13l.wecanwriteZ = A(X. where Ie.(XJ. One has to do with the probability content of the ellipsoids of constant density. The conditional mean is of the form (49) where the {3's are defined by ""12""22  ~ ~1 _ l /3l. where ~(a) denotes the upper (100a)th percentile of the ~ distribution. The chisquare distribution determines the variability of the sample variance s 2 = s 11 for samples from a univariate normal population.)(XJ. Consequently. does not depend upon the value(s) We conclude this section by presenting two final properties of multivariate normal random vectors.ei.1].p 3. Result 4. Let X be distributed as Np(P. 1) random variables. Z 2 .. I) with II 1 I> 0. ··· 131 .p.t)'I.p.) ej(X. Next.JL)..e. I) distribution assigns probability 1 . i=l i=l . it is worth emphasizing the following: 1.t)'I..  I 12 I21I 21 .q+2 132.(XJ.
0. where = A(X.J i 1 . two highly correlated random variables will contribute less than two variables that are nearly uncorrelated..." ! \!'A. Zp are independent standard normal variables... .. Moreover. A:IA' )..e. Z distributed as Np( 0. When X is distributed as Np(J.. e~ VA.)':I.distribution. • Remark: (Interpretation of statistical distance) Result 4.. by Result 4.1') has a .)':I1(X .1' is distributed as Np(O...ej p ][ 1 .. ! v>:.. J = ~ e2 [ .7 provides an interpretation of a squared statistical distance.1') :5 c] is the probability assigned to the ellipsoid (X. the use of the inverse of the covariance matrix.. Essentially.1(X.." ! v'A. . we note that P[(X.:I)..3.)':I.1') :5 c2 by the density Np(...1') is [ L A. it will contribute less to the squared distance.. But from Part a.1') is the squared statistical distance from X to the population mean vector 1'· If one component has a much larger variance than another.)':I.1(X....1(X .•. Z2 ..) = zt + z~ + · · · + z~ .)'r1(X.1 e 1 i .164 Chapter 4 The Multivariate Normal Distribution where (pXp) A = and X.) s x~(a)] = 1 a.. (X .... and we conclude that (X . :I). Z 1 ." e.'. v>:. Therefore.. (X.ezl··· VA"..7.5... P[(X.. From the proof of Result 4.... [1e 1 . and Part b holds.i i= I VA"..:I).1 e 2 ! ··· eP ! . For Part b. i By Result 4.. (1) standardizes all of the variables and (2) eliminates the effects of correlation.)':I1(X..
the np component vector is multivariate normal.I1] and (npXnp} :Ix = [~ ~ ~ O 0 OJ . X 1 vector random variable that is a linear combination of vectors.) has a Np(O. Vt and V2 are independent if b' c 2: cibi = i=l n 0. the sum of the squares of the variables. Previously. In particular. .8. first. consider the linear combination of vector random variables c1X1 + CzXz + · · · + c"X" = [XI i Xz ! ·· · i (pXn) Xn] (nXI) c (410) This linear combination differs from the linear combinations considered earlier in that it defines a p. and I I (X. Next. we discussed a single random variable that could be written as a linear combination of other univariate random variables.p. (b'c):I ] (~by):I = Consequently. where p.) Then VI = c1X1 + J=l CzXz + · · · + cnXn 1 2 is distributed as NP(± ci~'i• (± cy)x). :I . the random vector X were transformed to p independent standard normal random variables and then the usual squared distance. :Ix). .p. .p.The Multivariate Normal Density and Its Properties I I 165 In terms ofl: 2 (see (222)).. V and V J=l = b1X 1 + bzXz + · · · + bnXn are jointly multivariate normal with covariance matrix [ . (Note that each Xi has the same covariance matrix :I. Let X1.5(c).)':I2:I2(X. Ip) distribution. Result 4. Xn be mutually independent with Xi distributed as Np(l'i• :I). Xz. Z = :I 2 (X . By Result 4..1(X. (~ cy )x (b'c):I . Moreover. (npXI) = [ P: 2 ~n J. were applied. .p.)':I.) = (X.p.I. Proof.) = z·z = zt + zi + · · · + z~ The squared statistical distance is calculated as if. (npXI) X is distributed as Nnp(J.
This is a random variable with mean + a~ + 2aj . c 1X1 + czXz + c3X 3 + c4X 4 . V2 . Xz. V 1 and V2 are independent by Result 4.I 66 Chapter 4 The Multivariate Normal Distribution The choice where I is the p x p identity matrix. Consequently.3. say.. • · For sums of the type in ( 410). when b' c = 0.2a 1a2 + 2a 1a 3 That is. the property of zero correlation is equivalent to requiring the coefficient vectors band c to be perpendicular.A' has the first block diagonal term The offdiagonal term is This term is the covariance matrix for V 1 . Straightforward block multipli· cation shows that AI.A') by Result 4. a linear combination a'X 1 of the components of a random vector is a single random variable consisting of a sum of terms that are each a constant times a variable. gives and AX is normal N2 p( A.. independent and identically distributed 3 x 1 random vectors with ~~ [ n and variance a' I a = 3aj and ~ +: ~ ~] x3...5(b ). so that ( i=l ± j=l L cibi = n cibi)I = (pxp) 0 . This is very different from a linear combination of random vectors. Example 4. and x4 be We first consider a linear combination a'X 1 of the three components ofX 1 .8 (Linear combinations of random vectors) Let XI. AI.
we apply Result 4. and the two linear combi• nations of vectors are independent. If. Now consider two linear combinations of random vectors and Find the mean vector and covariance matrix for each linear combination of vectors and also the covariance between them.8 with b1 = bz = b3 = 1 and b4 = 3 to get mean vector and covariance matrix (bi+bi+b~+bDI=12xi= 36 12 [ 12 12 12] 12 0 0 24 Finally. in addition. the first linear combination has mean vector and covariance matrix For the second linear combination of random vectors. .8 with c1 = c2 = c3 = c4 = 1/2. then the two linear combinations have a joint sixvariate normal distribution.The Multivariate Normal Density and Its Properties 1. Here each term in the sum is a constant times a random vector. the covariance matrix for the two linear combinations of random vectors is Every component of the first linear combination of random vectors has zero covariance with every component of the second linear combination of random vectors.67 which is itself a random vector. each X has a trivariate normal distribution. By Result 4.
The resulting expression.. ..sox'Ax = tr(x'Ax).u)/2} ' := _ _ 1_ ___ e:~ (•i. Xn. is called the likelihood.L. Since X 1 .. we rewrite the likelihood function in another form. and the properties of the trace are discussed in Definition 2A. We shall need some additional properties for the trace of a square matrix. .m x k and k X m. we note that x'Ax isascalar.. . we take the observations x 1 . In order to simplify matters. This follows because BC has L. the joint density function of all the observations is the product of the marginal normal densities: Joint density } .I(~r. and I for the fixed set of observations x1 . Xn as fixed and consider the joint density of Equation ( 411) evaluated at these values.12 that tr (BC) = tr(CB) for any two matrices B and C of dimensions. Proof. section.u)/2 1 (Z1r )"PfZ 1 l"rz ~I (411) When the numerical values of the observations become available. For Part a.. At this point. The Multivariate Normal Likelihood Let us assume that the p X 1 vectors X 1 .) Result 4. i=l k (b) tr(A) = A. . I). To do so. as j=l k . x2 . .28 and Result 2A. Many good statistical procedures employ values for the population parameters that "best" explain the observed data. and the maximizing parameter values are called maximum likelihood estimates..uJ':f. . now considered as a function of p. xz.rr" { ofX. they may be substituted for the xi in Equation (411 ).168 Chapter 4 The Multivariate Normal Distribution 4. Then (a) x'Ax = tr(x'Ax) = tr(Axx') L. Let A be a k x k symmetric matrix and x be a k x 1 vector. .j=I { (27T)P fZ 1 III l(Z e (xj. respectively. with the sampling distribution of X and S.ici.. we shall consider maximum likelihood estimation of the parameters p. X 2 .12. The result is the likelihood function...Xz. where the A.3 Sampling from a Multivariate Normal Distribution and Maximum likelihood Estimation We discussed sampling and selecting random samples briefly in Chapter 3. One meaning of best is to select the parameter values that maximize the joint density evaluated at the observations. In this . are the eigenvalues of A.Xn ....~ from a multivariate normal population with mean vector fl. . Xz.9.. We pointed out in Result 2A. Xn are mutually independent and each has distribution Np(J. and I for a multivariate normal population. . we shallbe concerned with samples from ~multivariate normal populationin particular.u)':tl(Jt·.. This technique is called maximum likelihood estimation. . b. (The trace of a matrix is the sum of its diagonal elements. X" represent a random sample. and covariance matrix : l:.
according to Result 2A.. By Result 4.' )') ] /2} (415) exp { tr [ Ilc~ (xi  .J.C)(i. Part b is proved by using the spectral decomposition of (220) to write A = P' AP.i + i .) = tr(BC).i.i)(x  1' )' and L j=L n .J.C)] = tr[I1 (xi.C){Xi.J.and the result follows.1 (xi.J.C) = tr[(xi.l2(b).1 (~ (xi.J.J. .J.15.jCj. Then tr(x'(Ax)) = tr((Ax)x'). (xj.J.' )(x . We can add and subtract i = (1/n) term (xi .i + i . (See Exercise 4. ••.x)(xi.).J.J. Ak. Let x' be the matrix B with m = 1.1 (xi.t)'] Next.J. i=l (412) L n n (xi.i) = ~ (~ b.J.) Consequently.C)'I.Cp. .J.x)' " L n (xi.J.C)(Xi.J. are both matrices of zeros.x)' + n(i .J.J.9(a)..1(xi.C)(xi.X" X = (21r )npf2l I r"/2 x)(xi ..C) (414) (x 1' )(xi L j=l n (xi .C)' 1 n i=l because the crossproduct terms. we can write the joint density of a random sample from a multivariate normal population as Joint density of} { X 1 .sotr(CB) = ~ (~ ci.pi.. and let Ax play the role of the matrix C.J. the jth diagonal ~ cj.C)(X.J.J.Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation 169 its ith diagonal element.JL)'] (413) n = tr[ I. Therefore.C)'I. L n (xi.J.t)' + i=l = i=l :L (i.b.Cp. .t)')] i=l since the trace of a sum of matrices is equal to the sum of the traces of the matrices. where PP' = I and A is a diagonal matrix with entries A1 .C) in i=l L xi in each L n (xi.C)(xi.X 2 .1 (xi.t)' to give i=l = ~(xi. so tr(BC) = elementofCBis ~ (~ b.x)'.x)' + n(i . A2 .C)] j=l i=l 2: tr [I1(xi  n 1') (xi . using Equations (413) and (414). tr(A) = tr(P'AP) = tr(APP') = tr(A) = A1 + A2 + ··· + Ak· • Now the exponent in the joint density in (411) can be simplified. Similarly.i)(xi.J.C) = == ~ tr[(xi.b.J.
Proof.1 (~ (xi x)')] + n(x 11)'I.x)(xjx)(xi x>')) + ntrfi..17.i)(xi.11.1 0. B1fZB112 = I. Given a p X p symmetric positive definite matrix B and a scalar b > 0.1 etr[x1 (21T )"P/21 I 1"/2 (t (xji)(xji)'+n(ip. and B1/ZB112 = B1.)(ip.J. We shall denote this function by L(IL.1B 112)B112] = tr[B1fZ(r 1B1f2)]. it follows that _l_etr(I. we have L(IL l:) = ' . . y # 0. to stress the fact that it is a function of the (unknown) population parameters 11. so B 1f2B 112 = B. equivalently. In particular. This matrix is positive definite because y'B112I1B1fZy = (B 112y)'I1 (B 112y) > 0 if B112y # 0 or. Thus.C)'J 1 I. we can write I Blf2I1 BlfZI = IBlf2JI IIII Bl/21 = IIIII Blf21J Bl/21 = JI.J. Then tr (I1B) = tr[(I.L and I The next result will eventually allow us to obtain the maximum likelihood estimators of 1' and I.170 Chapter 4 The Multivariate Normal Distribution Substituting the observed values x 1 .11) (417) Maximum Likelihood Estimation of J.C)(i.i)' + n(x.)')]/2 J (416) It will be convenient in later sections of this book to express the exponent in the like lihood function (416) in different ways. x2. . of B112I1B 112 are positive by Exercise 2.and I. with equality holding only for I = (1/2b )B. Result 4.J. the eigenvalues Tl. when the vectors Xj contain the specific numbers actually observed. I).C)') J = tr(I~c~ = tr[ (xj.1(x . Let B 112 be the symmetric square root of B [see Equation (222)].12. Let 'T1 be an eigenvalue of B 112I.18)/2 JIIb for all positive definite I (pxp) :$ _l_ (2b)Pbebp JBib . Result 4.9(b) then gives tr (I1B) = tr (B112I1 B1f2) = and I B112I1B 1 = 12J • p i~i ± Tl. we shall make use of the identity tr[r{~ (xj.(x.. Xn into the joint density yields the likelihood function. Thus.~'Hx. From the properties of determinants in i~J Result 2A.1 B 112.IIBI 1 = 1 III JBI . IT 'Tli by Exercise 2..
is [see (417)] tr[I{~ (xj. x 2 . Then (n . The choice 17.f2 has a maximum. are called the maximum likelihood estij=! mates of.X) = S n are the maximum likelihood estimators of p.)'I.1 B112 = B112(2b)B1B112 = (2b) I (p><p) and Moreover.1) X) (Xj.1B] and 1// I /b yields the bound asserted./2 = __ n 1 p p i=! IB lb .. therefore gives _1_ etr(I1B)/2 :s.$ 1 IB lb i=! l?e11.L 11._ and I. B112I.x)(xj. x and (1/n) ~ (xj. _ 1_ (2b)Pbebp /I/b /B/b The upper bound is uniquely attained when I = (1/2b )B. for each i. and covariance I.•• .. .x)')] + n(x p. Result 4. of (2b)beb. Xn through the summary statistics xand S. since. for this choice. with respect to 17. X" be a random sample from a normal population with mean p.11. respectively. Proof.Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation 171 or 1 /B112I1B112j n 17i i=l p /I/ /B/ ( /B/ Combining the results for the trace and the determinant yields _1 etr [ r 1B]/2 = _ II \b fi 17i)b e.. 1 /I/ IBif2IIBI/21 /B/ I (2b)I\ JB/ (2b)P /B/ • Straightforward substitution for tr[I.. and I. apart from the multiplicative factor !. . Their observed and I=~ (XjA 1~ I n j=! n values. The exponent in the likelihood function [see Equation (416)]. Let XI> X 2 .x)(xj. I) in (416).x)'../2 17 ' But the function 17be.) . and I are those valuesdenoted by fL and Ithat maximize the function L(p. The estimates jL and I will depend on the observed values x 1 . The maximum likelihood estimates of p.(x1 p. = 2b.. occurrjng at 17 = 2b. .
I 72 Chapter 4 The Multivariate Normal Distribution By Result 4.= Var(X.) is the maximum likelihood estimator of a. The maximum likelihood estimator of IL'r11' is. and I with the corresponding random vectors. In addition.10 with b = n/2 and B = L(Xj .. The maximum likelihood estimator of Va. By Result 4. Thus. since III= [(n. I. .) For example.x)'. is~.constant X (generalized variance )"/2 (419) The generalized variance determines the "peakedness" of the likelihood function and.C} > 0 unless 1' = x. Xn in the expressions for fi. 1.£. the likelihood is maximized with respect to 1' at fi. = x.. 2. It remains to maximize n over I. where fi.j X.J.•• .1)/nJPISI. L(jl. the maximum of the likelihood is L( ~ #'. LJ j=l A :i = 1 ~ .x)(xj.x)'. respectively.x)(xj. .. •. I) =. They are optained by replacing the observations x1 . X".. Let (} be the maximum likelihood estimator of 0. • We note that the maximum likelihood estimator X is a random vector and the maximum likelihood estimator :i is a random matrix. x 2 . and consider estimating the parameter h(O).J. = X and ((n .). the maximum j=! occurs at I= (1/n) L (xj. consequently. j=! n The maximum likelihood estimators are random quantities.C)'I1(x.1)/n)S are the maximum likelihood estimators of 1' and I. so the distance (x. . where a.£ 'I1 .2 (X. which is a function of 0. XI> X 2 . Then the maximum likelihood estimate of h(O) (a function of 9) is given by h( {j) (same function of 0) (420) (See [1] and [15].:. I) = 1 enp/2 _1_ (27T )"P/2 II In/2 (418) or.1. The maximum likelihood estimates are their particular values for the given data set. = .1 is positive definite. is a natural measure of variability when the parent population is multivariate normaL Maximum likelihood estimators possess an invariance property.asstated.
and covariance I.6.X) is distributed as 2 ~times a chisquare variable having n . (See Section 4. . + z~_J) = (a ZJ) 2 + · · · + (aZn!) 2 • The individual terms aZ.4 The Sampling Distribution of X and S The tentative assumption that X 1 . ~).. Xn be a random sample from a multivariate normal population with mean p. Xn constitute a random sample from a normal population with mean p. .. X 2 .f.1)S. Xn only through the sample mean x and the sumofsquaresandcrossproducts matrix L (xj j=] n x)(xj . Since many multivariate techniques begin with sample means and covariances. (n . are independently distributed as N(O. .1 degrees of freedom (d. In the univariate case (p = 1).1)s 2 = j=I :± ( Xj .. . and covariance I completely determines the sampling distributions of X and S. regardless of the sample size n. ..) If the data cannot be regarded as multivariate normal. techniques that depend solely on xand S may be ignoring other useful sample information. and I in the data matrix X is contained in x and S. It is this latter form that is suitably generalized to the basic sampling distribution for the sample covariance matrix.). recall that ( n . This generally is not true for nonnormal populations. it is prudent to check on the adequacy of the multivariate normal assumption. Here we present the results on the sampling distributions of X and S by drawing a parallel with the familiar univariate conclusions. X 2 . That is. We express this fact by saying that xand ( n .. this chisquare is the distribution of a sum of squares of independent standard normal random variables. 4.1)s 2 is distributed as a 2 (Zt + . we know that X is normal with mean JL = (population mean) and variance 1 population variance a2 = '_o__ _ _ _ __ n sample size The result for the multivariate case (p ~ 2) is analogous in that X has a normal distribution with mean p. x 2 . In turn. . and covariance matrix (1/n)I. . Then X and S are sufficient statistics (421) The importance of sufficient statistics for normal populations is that all of the information about p...x)' = (n .1)S (or S) are sufficient statistics: Let X 1 .The Sampling Distribution of X and S I 73 Sufficient Statistics From expression (415). For the sample variance.. the joint density depends on the whole set of observations x 1 .
174
Chapter 4 The Multivariate Normal Distribution
The sampling distribution of the sample covariance matrix is called the Wishart distribution, after its discoverer; it is defined as the sum of independent products of multivariate normal random vectors. Specifically,
W,,( • 1 = Wishart distribution with m d.f. "£)
= distribution of
i•l
(422)
:L ziz;
m
where the 1 are each independently distributed as Np(O,I). We summarize the sampling distribution results as follows:
z
Let X 1 , X 2 , ... , Xn be a random sample of size n from a pvariate norm~! distribution with mean f.L and covariance matrix!. Then
1. X is distributed as Np(p.,(l/n)I). 2. ( n  1 )Sis distributed as a Wishart random matrix with n  1 d.f.
( 423)
3. X and S are independent.
Because I is unknown, the distribution of Xcannot be used directly to make inferences about p.. However, S provides independent information about I, and the distribution of S does not depend on 11· This allows us to construct a statistic for making inferences about JL, as we shall see in Chapter 5. For the present, we record some further results from multivariable distribution theory. The following properties of the Wishart distribution are derived directly from its definition as a sum of the independent products, 1 Proofs can be found in [1).
z z;.
Properties of the Wishart Distribution
1. If A 1 is distributed as Wm 1(A1 II) independently of Az, which is distributed as W1112 (A 2 j I), then A 1 + Az is distributed as W", 1+m2 (AI + Azl "£).That is, the degrees of freedom add. (424) 2. If A is distributed as Wm(A II), then CAC' is distributed as W,,(CAC' I CIC'). Although we do not have any par1icular need for the probability density function of the Wishart distribution, it may be of some interest to see its rather complicated form. The density does not exist unless the sample size n is greater than the number of variables p. When it does exist, its value at the positive definite matrix A is
p ,
2P("I)f2'1Tp(pl)/41II("J)/2
IT rO(n i))
i=1
A positive definite
(425) where r
0
is the gamma function. (See [1} and [11).)
LargeSample Behavior of X and S' 175
4.S largeSample Behavior of X and 5
Suppose the quantity X is determined by a large number of independent causes VI> V , ... , Vn, where the random variables V; representing the causes have approxi2 mately the same variability. If X is the sum
X=V1 +V2 +···+V,,
then the central limit theorem applies, and we conclude that X has a distribution that is nearly normal. This is true for virtually any parent distribution of the V;'s, provided that n is large enough. The univariate central limit theorem also tells us that the sampling distribution of the sample mean, X for a large sample size is nearly normal, whatever the form of the underlying population distribution. A similar result holds for many other important univariate statistics. It turns out that certain multivariate statistics, like X and S, have largesample properties analogous to their univariate counterparts. As the sample size is increased without bound, certain regularities govern the sampling variation in X and S, irrespective of the form of the parent population. Therefore, the conclusions pre· sented in this section do not require multivariate normal populations. The only requirements are that the parent population, whatever its form, have a mean 1' and a finite covariance I.
Result 4.12 (Law of large numbers). Let Y1 , Y2 , .•. , Y,, be independent observations from a population with mean E(Y;) = 1L· Then

n converges in probability to J.L as n increases without bound. That is, for any prescribed accuracy e > 0, P[ e < Y  J.L < e] approaches unity as n+ oo.
Proof. See [9].
Y="
Yl+Yz+···+Y,,
•
X;
X converges in probability to 1'
S (or I = Sn) converges in probability to I (426)
As a direct consequence of the law of large numbers, which says that each converges in probability to J.L;, i = 1, 2, .... p,
Also, each sample covariance s;k converges in probability to u;b i, k = 1, 2, ... , p, and (427)
Statement (427) follows from writing
(n. l)s;k =
=
i=l
L L
n
n
(Xj; X;)(Xik
X'd
n
(Xji J.L; + J.L; J{;)(Xjk ILk+ ILk Xk) p.,,)(Xik J.Lk) + n(X; J.L;)(Xk J.Ld
i=l
=
L (Xi;j=l
176 Chapter 4 The Multivariate Normal Distribution Letting lj = (X1;  J.L;)(X;k  J.Lk), with E(lj) = a;k, we see that the first term in s;k converges to a;k and the second term converges to zero, by applying the law of large numbers. The practical interpretation of statements (426) and (427) is that, with high probability, X will be close to p. and S will be close to I whenever the sample size is large. The statemebt concerning X is made even more precise by a multivariate version of the central limit theorem. Result 4.13 (The central limit theorem). Let X 1 , X 2 , ... , X" be independent observations from any population with mean p. and finite covariance I. Then
vn (X p.) has an approximate Np(O,I) distribution
for large sample sizes. Here n should also be large relative to p. Proof. See [1].
•
The approximation provided by the central limit theorem applies to discrete, as well as continuous, multivariate populations. Mathematically, the limit is exact, and the approach to normality is often fairly rapid. Moreover, from the results in Section 4.4, we know that X is exactly normally distributed when the underlying population is normal. Thus, we would expect the central limit theorem approximation to be quite good for moderate n when the parent population is nearly normal. As we have seen, when n is large, S is close to I with high probability. Consequently, replacing I by Sin the approximating normal distribution for X will have a negligible effect on subsequent probability calculations. Result 4.7 can be used to show that n(X  p. )'I1 (X  p.) has ax~ distribution when
X is distributed as Np(#L. ~I) or, equivalently, when Vn (X J.L) has an
Np(O, I) distribution. The~ distribution is_approximately the sampling distribution of n(X  J.L )'I  1 (X  J.L) when X is approximately normally distributed. Replacing I1 by s 1 does not seriously affect this approximation for n large and much greater than p. We summarize the major conclusions of this section as follows:
Let X 1 , X2 , ... , X, be independent observations from a population with mean JL and finite (nonsingular) covariance I. Then
Vn (X
and
 J.L) is approximately NP (0, I)
(428)
1
n(X  J.L )' s (X  J.L) is approximately ,0
for n  p large. In the next three sections, we consider ways of verifying the assumption of normality and methods for transforming nonnormal observations into observations that are approximately normal.
Assessing the Assumption of Normality
177
4.6 Assessing the Assumption of Normality
As we have pointed out, most of the statistical techniques discussed in subsequent chapters assume that each vector observation x1 comes from a multivariate normal distribution. On the other hand, in situations where the sample size is large and the techniques depend solely on the behavior of X, or distances involving X of the form n(X  1' )'S1 (X  1' ), the assumption of normality for the individual observations is less crucial. But to some degree, the quality of inferences made by these methods depends on how closely the true parent population resembles the multivariate normal form. It is imperative, then, that procedures exist for detecting cases where the data exhibit moderate to extreme departures from what is expected under multivariate normality. We want to answer this question: Do the observations x1 appear to violate the assumption that they came from a normal population? Based on the properties of normal distributions, we know that all linear combinations of normal variables are normal and the contours of the multivariate normal density are ellipsoids. Therefore, we address these questions:
1. Do the marginal distributions of the elements of X appear to be normal? What about a few linear combinations of the components X;? 2. Do the scatter plots of pairs of observations on different characteristics give the elliptical appearance expected from normal populations? 3. Are there any "wild" observations that should be checked for accuracy?
It will become clear that our investigations of normality will concentrate on the behavior of the observations in one or two dimensions (for example, marginal distributions and scatter plots). As might be expected, it has proved difficult to construct a "good" overall test of joint normality in more than two dimensions because of the large number of things that can go wrong. To some extent, we must pay a price for concentrating on univariate and bivariate examinations of normality: We can never be sure that we have not missed some feature that is revealed only in higher dimensions. (It is possible, for example, to construct a nonnormal bivariate distribution with normal marginals. [See Exercise 4.8.]) Yet many types of nonnormality are often reflected in the marginal distributions and scatter plots. Moreover, for most practical work, onedimensional and twodimensional investigations are ordinarily sufficient. Fortunately, pathological data sets that are normal in lower dimensional representations, but nonnormal in higher dimensions, are not frequently encountered in practice.
Evaluating the Normality of the Univariate Marginal Distributions
Dot diagrams for smaller n and histograms for n > 25 or so help reveal situations where one tail of a univariate distribution is much longer than the other. If the histogram for a variable X; appears reasonably symmetric, we can check further by counting the number of observations in certain intervals. A univariate normal distribution assigns probability .683 to the interval (J.L;  VCT;;, J.L; + vei;;) and probability .954 to the interval (J.L;  2va:;;, J.Li + 2vei;;). Consequently, with a large sample size n, we expect the observed proportion p; 1 of the observations lying in the
178
Chapter 4 The Multivariate Normal Distribution
interval (x;  Vi;;, x1 + viS;;) to be about .683. Similarly, the observed proportion P;z of tbe observations in (x;  2VS;;, X; + zvs;;) should be about .954. Using the normal approximation to the sampling distribution of p; (see [9]), we observe that either 1/J; 1 or 1p;z

.6831 > 3
(.683)(.317)
n
1.396
Vn
.628

.954 1 > 3
(.954)(.046) n
Vn
(429)
would indicate departures from an assumed normal distribution for the ith characteristic. When the observed proportions are too small, parent distributions with thicker tails than the normal are suggested. Plots are always useful devices in any data analysis. Special plots called QQ plots can be used to assess the assumption of normality. These plots can be made for the marginal distributions of the sample observations on each variable. They are, in effect, plots of the sample quantile versus the quantile one would expect to observe if the observations actually were normally distributed. When the points lie very nearly along a straight line, the normality assumption remains tenable. Normality is suspect if the points deviate from a straight line. Moreover, the pattern of the deviations can provide clues about the nature of the nonnormality. Once the reasons for the nonnormality are identified, corrective action is often possible. (See Section 4.8.) To simplify notation, let x 1 , x2, ... , x, represent n observations on any single characteristic X,. Let X(J) :=; x(2) ,; · · · ,; X(n) represent these observations after they are ordered according to magnitude. For example, x(2) is the second smallest observation and x(n) is the largest observation. The x(i)'s are the sample quantiles. When the x(i) are distinct, exactly j observations are less than or equal to xu). (This is theoretically always true when the observations are of the continuous type, which we usually assume.) The proportion jjn of the sample at or to the left of xu) is often approximated by (i ~)/n for analytical convenience.' For a standard normal distribution, the quantiles %) are defined by the relation
(430) (See Table 1 in the appendix). Here P(i) is the probability of getting a value less than or equal to q(i) in a single drawing from a standard normal population. The idea is to look at the pairs of quantiles (qU)• xU)) with the same associated cumulative probability (i !);n. If the data arise from a ~ormal populati~n, the pairs (q(i), x(i)) will be approximately linearly related, since uq(J) + J.L is nearly the expected sample quantile. 2
The ~in the numerator of (i  DJn is a "continuity" correction. Some authors (see (5) and (10]) have suggested replacing (i  ~)/ n by (i  ~)/( n + 2 A better procedure is to plot (m(i)• <(j)), where mu) ~ E(zu)) is the expected value of the jth· order statislic in a sample of size n from a standard normal dislribution. (See [13) for further discussion.)
1
n
Assessing the Assumption of Normality
I 79
Example 4.9 {Constructing a QQ plot) A sample of n = 10 observations gives the
values in the following table: Ordered observations
X(j)
Probability levels
(i 
~)/n
Standard normal quantiles %J 1.645 1.036 .674 .385 .125 .125 .385 .674 1.036 1.645
1.00 .10 .16 .41 .62 .80 1.26 1.54 1.71 2.30 Here, for example, P[Z :s: .385]
=
.05 .15 .25 .35 .45 .55 .65 .75 .85
.95
.335
Let us now construct the QQ plot and comment on its appearance. The QQ plot for the foregoing data, which is a plot of the ordered data xuJ against the normal quantiles q(i), is shown in Figure 4.5. The pairs of points (q(i), x(j)) lie very nearly along a straight line, and we would not reject the notion that these data are normally distributedparticularly with a sample size as small as n = 10.
1
,
oo
~ e' /2 dz v27T
2
1
=
.65. [See (430).]
•
2
Figure 4.S A QQ plot for the data in Example 4.9.
•
The calculations required for QQ plots are easily programmed for electronic computers. Many statistical programs available commercially are capable of producing such plots. The steps leading to a QQ plot are as follows: 1. Order the original observations to get x(IJ, x(Z)• .•. , x(n) and their corresponding probability values ( 1  ~)/n, (2  D/n, ... , (n  ~)/n; 2. Calculate the standard normal quantiles q(I)• q(Z)• .•. , q(n); and 3. Plot the pairs of observations ( q(l), x 0 )), (q(z), X(z)), ... , ( q(n), x(n) ), and examine the "straightness" of the outcome.
I 80
Chapter 4 The Multivariate Normal Distribution
QQ plots are not particularly informative unless the sample size is moderate to largefor instance, n <:: 20. There can be quite a bit of variability in the straightness of the QQ plot for small samples, even when the observations are known to come from a normal population.
Example 4.10 {A QQ plot for radiation data) The qualitycontrol department of a manufacturer of microwave ovens is required by the federal governmel).t to monitor the amount of radiation emitted when the doors of the ovens are closed. Observations of the radiation emitted through closed dom's of n = 42 randomly selected ovens were made. The data are listed in Table 4.1.
Table4.1 Radiation Data (Door Closed)
Oven no. 1 2 3 4 5 Radiation .15 .09 .18 .10 .05 .12 .08 .05 .08 .10 Oven no. 16 17 18 19 20 21 22 23 24 25 26 Radiation .10 .02 .10 .01 .40
.10
Oven no. 31 32 33 34 35 36 37 38 39 40 41 42
Radiation .10 .20
.11
.30 .02 .20 .20 .30 .30 .40 .30 .05
6 7 8 9 10
11 12
.07
.02 ,Ql .10 .10
27
28 29 30
13
14 15
.05 .03 .05 .15 .10 .15 .09 .08 .18
Source: Data courtesy of l D. Cryer.
In order to determine the probability of exceeding a prespecified tolerance level, a probability distribution for the radiation emitted was needed. Can we regard the observations here as being normally distributed? A computer was used to assemble the pairs (q(Jl· xu)) and construct the QQ plot, pictured in Figure 4.6 on page 181. It appears from the plot that the data as a whole are not normally distributed. The points indicated by the circled locations in the figure are outliersvalues that are too large relative to the rest of the observations. For the radiation data, several observations are equal. When this occurs, those observations with like values are associated with the same normal quantile. This quantile is calculated using the average of the quantiles the tied observations would have if they all differed slightly. •
Assessing the Assumption of Normality
'{J)
I8 I
.40
.30
.20 2 9 ••
2
.10
e3
• 2 5
L~L~~q
.OOLL______
2.0
j __ _ _ _ _ _
(;)
1.0
.0
1.0
2.0
3.0
Figure 4.6 A QQ plot of the radiation data (door closed) from Example 4.10. (The integers in the plot indicate the number of points occupying the same location.)
The straightness of the QQ plot can be measured by calculating the correlation coefficient of the points in the plot. The correlation coefficient for the QQ plot is defined by
2: (xu) i)(qul q)
j=1
n
( 431)
[;
\)
J~
(xul  i)2
VJ~ (qu) 
J
II
q)2
and a powerful test of normality can be based on it. (See [5], [10], and [12].) Formally, we reject the hypothesis of normality at level of significance a if r0 falls below the appropriate value in Table 4.2.
Table 4.2 Critical Points for the QQ Plot Correlation Coefficient Test for Normality
Sample size n
5 10 15 20 25 30 35 40 45 50 55 60 75 100 150 200 300
Significance levels a
.01 .8299 .8801 .9126 .9269 .9410 .9479 .9538 .9599 .9632 .9671 .9695 .9720 .9771 .9822 .9879 .9905 .9935 .05 .8788 .9198 .9389 .9508 .9591 .9652 .9682 .9726 .9749 .9768 .9787 .9801 .9838 .9873 .9913 .9931 .9953 .10 .9032 .9351 .9503 .9604 .9665 .9715
.9740
.9771 .9792 .9809 .9822 .9836 .9866 .9895 .9928 .9942 .9960
182 Chapter 4 The Multivariate Normal Distribution
Example 4.11 (A correlation coefficient test for normality) Let us calculate the correlation coefficient rQ from the QQ plot of Example 4.9 (see Figure 4.5) and test for normality. Using the information from Example 4.9, we have x = .770 and
L (x(i) j=l
!0
x)quJ
= 8.584,
L
j=l
10
(xuJ  x)
2
= 8.472,
and
L
j=l
10
qtn
= 8.795
Since always, q = 0,
rQ =
v8.472 V8.795
8.584
=
9 .9 4
A test of normality at the 10% level of significance is provided by referring rQ = .994 totheentryinTable 4.2 corresponding ton = lOanda = .10. This entry is .9351. Since rQ > .9351, we do not reject the hypothesis of normality. • Instead of rQ, some software packages evaluate the original statistic proposed by Shapiro and Wilk [12]. Its correlation form corresponds to replacing %J by a function of the expected value of standard normalorder statistics and their covariances. We prefer rQ because it corresponds directly to the points in the normalscores plot. fur large sample sizes, the two statistics are nearly the same (see [13]), so either can be used to judge lack of fit. Linear combinations of more than one characteristic can be investigated. Many statisticians suggest plotting
e!xi
where
s e1 = Ae1 1
in which A is the largest eigenvalue of S. Here xj = [xi 1 , x 12 , ... , xip] is the jth 1 observation on the p variables X 1 , X 2 , .•. , XP. The linear combination e~xi corresponding to the smallest eigenvalue is also frequently singled out for inspection. (See Chapter 8 and [6] for further details.)
Evaluating Bivariate Normality
We would like to check on the assumption of normality for all distributions of 2, 3, ... , p dimensions. However, as we have pointed out, for practical work it is usually sufficient to investigate the univariate and bivariate distributions. We considered univariate marginal distributions earlier. It is now of interest to examine the bivariate case. In Chapter 1, we described scatter plots for pairs of characteristics. If the observations were generated from a multivariate normal distribution, each bivariate distribution would be normal, and the contours of constant density would be ellipses. The scatter plot should conform to this structure by exhibiting an overall pattern that is nearly elliptical. Moreover, by Result 4.7, the set of bivariate outcomes x such that
Assessing the Assumption of Normality
183
has probability .5. Thus, we should expect roughly the same percentage, 50%, of sample observations to lie in the ellipse given by {allxsuchthat(x i)'S 1(x i) s
xk5)}
where we have replaced I" by its estimate i and I 1 by its estimate s 1. If not, the normality assumption is suspect.
Example 4.12 (Checking bivariate normality) Although not a random sample, data consisting of the pairs of observations (x 1 = sales, x 2 =profits) for the 10 largest companies in the world are listed in Exercise 1.4. These data give
i = [155.60]
s=
14.70 '
[7476.45 303.62] 26.19 303.62
so
s1 _
1 [ 26.19 103,623.12 303.62 .000253 [ .002930
=
303.62] 7476.45
=
.002930] .072148 1.39. Thus, any observation x' = [x1, x2J
From Table 3 in the appendix, rz(.5) satisfying
XI [ x2 
155.60]' [ .000253 14.70 .002930
.002930] .072148
[XI  155.60]
x2

:$
14.70
139
is on or inside the estimated 50% contour. Otherwise the observation is outside this contour. The first pair of observations in Exercise 1.4 is [x 1 , x 2 ]' ~ ( 108.28, 17.05]. In this case
108.28 155.60]' [ .000253 [ 17.05 14.70 .002930
= 1.61
.002930] [108.28  155.60] .072148 17.05 14.70
> 1.39
and this point falls outside the 50% contour. The remaining nine points have generalized distances from i of .30, .62, 1.79, 1.30, 4.38, 1.64, 3.53, 1.71, and 1.16, respectively. Since four of these distances are less than 1.39, a proportion, .40, of the data falls within the 50% contour. If the observations were normally distributed, we would expect about half, or 5, of them to be within this contour. This difference in proportions might ordinarily provide evidence for rejecting the notion of bivariate normality; however, our sample size of 10 is too small to reach this conclusion. (See • also Example 4.13.) Computing the fraction of the points within a contour and subjectively comparing it with the theoretical probability is a useful, but rather rough, procedure.
184 Chapter 4 The Multivariate Normal Distribution
A somewhat more formal method for judging the joint normality of a data set is based on the squared generalized distances j
= 1, 2, ... , n
where"', "z· ... , Xn are the sample observationl' The procedure we are about to describe is not limited to the bivariate case; it can be used for all p 2: 2. When the parent population is multivariate normal and both n and n  p are greater than 25 or 30, each of the squared distances di, d~, ... , d~ should behave like a chisquare random variable. [See Result 4.7 and Equations (426) and (427).] Although these distances are not independent or exactly chisquare distributed, it is helpful to plot them as if they were. The resulting plot is called a chisquare plot or gamma plot, because the chisquare distribution is a special case of the more general gamma distribution. (See [6].) 1b construct the chisquare plot,
1. Order the squared distances in (432) from smallest to largest as dfo :s df2) :s · · · ,; dfn) · 2. Graph the pairs (qc)(j ~)ln),dtj)), where qc,p((j !)In) is the wo(j Din quantile of the chisquare distribution with p degrees of freedom.
Quantiles are specified in terms of proportions, whereas percentiles are specified in terms of percentages. The quantiles qc,p( ~)In) are related to the upper percentiles of a chisquared distribution. In particular, qc)(j  Din) = x~( (n  j + ~)In). The plot should resemble a straight line through the origin having slope 1. A systematic curved pattern suggests lack of normality. One or two points far above the line indicate large distances, or outlying observations, that merit further attention.
Example 4.13 (Constructing a chisquare plot) Let us construct a chisquare plot of
(j 
the generalized distances given in Example 4.12. TI1e ordered distances and the corresponding chisquare percentiles for p = 2 and n = 10 are listed in the following table: j 1 2 3 4
dfn
.30 .62
qc.Z W
.10
c 1)
1
2
1.16
1.30 1.61 1.64 1.71 1.79 3.53 4.38
.33 .58
.86 1.20 1.60
5
6 7 8 9 10
2.10
2.77 3.79 5.99
Assessing the Assumption of Normality
d(;]
185
5
4.5 4 3.5
• • •
2.5 2
1.5
•• •
0
• • •
0.5
•
q,,z((j!)110)
0
2
4
5
6
Figure 4. 7 A chisquare plot of the ordered distances in Example 4.13.
A graph of the pairs (qc,z( (j  )/10 ), dtn) is shown in Figure 4.7. The points in :Figure 4.7 are reasonably straight. Given the small sample size it is difficult to reject bivariate normality on the evidence in this graph. If further analysis of the data were required, it might be reasonable to transform them to observations more nearly bivariate normal. Appropriate transformations are discussed in ~ti~~R • In addition to inspecting univariate plots and scatter plots, we should check multivariate normality by constructing a chisquared or d 2 plot. Figure 4.8 contains d 2
!
dJ)
10 10
dJ)
8
8
6
6
4
4
•
.:••
•••
•
•
2
2
0 0
q,..cv !IJ3o>
2
4
0
q,_.<v ~'3o>
0
6
10
12
2
4
6
8
10
12
Figure 4.8 Chisquare plots for two simulated fourvariate normal data sets with n
= 30.
I86 Chapter 4 The Multivariate Normal Distribution plots based on two computergenerated samples of 30 fourvariate normal random vectors. As expected, the plots have a straightline pattern, but the top two or three ordered squared distances are quite variable. The next example contains a real data set comparable to the simulated data set that produced !he plots in Figure 4.8.
Example 4.14 (Evaluating multivariate normality for a fourvariable data set) The data in Table 4.3 were obtained by taking four different measures of stiffness, x 1 , x 2 , x 3 , and x 4 , of each of n = 30 boards. The fir~t measurement involves sending a shock wave down the board, the second measurement is determined while vibrating the board, and the last two measurements are obtained from static tests. The 1 squared distances d~ = (xi  i)'S (xi  i) are also presented in the table. Table 4.3 Four Measurements of Stiffness
Observation no.
XI
Xz
X3
x4
d2
Observation no.
Xt
Xz
X]
x4
d2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Source:
1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859
1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649
1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389
1778 .60 2197 5.48 2222 7.62 1533 5.21 1883 1.40 1546 2.22 1671 4.99 1874 1.49 2581 12.26 1508 .77 1667 1.93 1898 .46 1741 2.70 1678 .13 1714 1.08
16 17 18 19 20 21 22 23 24
25
26 27 28 29 30
1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490
2149 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382
1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214
1281 1176 1308 1755 1646 2111 1477 1516 2037 1533 1469 1834 1597 2234 1284
16.85 3.50 3.99 1.36 1.46 9.90 5.06 .80 2.54 4.58 3.40 2.38 3.00 6.28 2.58
Data courtesy of William Galligan.
The marginal distributions appear quite normal (see Exercise 4.33), with the possible exception of specimen (board) 9. To further evaluate multivariate normality, we constructed the chisquare plot shown in Figure 4.9. The two specimens with the largest squared distances are clearly removed from the straightline pattern. Together, with the next largest point or two, they make the plot appear curved at the upper end. We will return to a discussion of this plot in Example 4.15. • We have discussed some rather simple techniques for checking the multivariate normality assumption. Specifically, we advocate calculating the j = 1, 2, ... , n [see Equation. (432)] and comparing the results with _I quantiles. For example, pvariate normality is indicated if
dy,
1. Roughly half of thedJ are less than or equal toqc,p(.50).
Detecting Outliers and Cleaning Data
I87
d?'jJ
~
;'!;
~
s
00
• • •• • • •••• •
10
"" "'
0
•
0
.....• ••••
2
••••••
4
6
10
12
figure 4.9 A chisquare plot for the data in Example 4.14.
~
::t~";):. r~'l):q"~'::, (~?~. :::":~:,.~..,:~:~:'.~:::
line having slope 1 and that passes through the origin.
(See [6] for a more complete exposition of methods for assessing normality.) We close this section by noting that all measures of goodness of fit suffer the same serious drawback. When the sample size is small, only the most aberrant behavior will be identified as Jack of fit. On the other hand, very large samples invariably produce statistically significant lack of fit. Yet the departure from the specified distribution may be very small and technically unimportant to the inferential conclusions.
4.7 Detecting Outliers and Cleaning Data
Most data sets contain one or a few unusual observations that do not seem to belong to the pattern of variability produced by the other observations. With data on a single characteristic, unusual observations are those that are either very large or very small relative to the others. The situation can be more complicated with multivariate data. Before we address the issue of identifying these outliers, we must emphasize that not all outliers are wrong numbers. They may, justifiably, be part of the group and may lead to a better understanding of the phenomena being studied.
188 Chapter 4 The Multivariate Normal Distribution Outliers are best detected visually whenever this is possible. When the number of observations n is large, dot plots are not feasible. When the number of characteristics pis large, the large number of scatter plots p(p  1)/2 may prevent viewing them all. Even so, we suggest first visually inspecting the data whenever possible. What should we look for? For a single random variable, the problem is one dimensional, and'we look for observations that are far from the others. For instance, the dot diagram
4~~~x
••
• ....• . ..... ····· ....... .
reveals a single large observation which is circled. In the bivariate case, the situation is more complicated. Figure 4.10 shows a situation with two unusual observations. The data point circled in the upper right corner of the figure is detached from the pattern, and its second coordinate is large relative to the rest of the x2
• • • • • •• • • • ••• •• ••• •• • • • • • •
• • •• • ..
• • •
• • • •
• •
..
•• •• •
@
®
• • •
• •
+~r~+~x,
• •• • .............
. . . .: • •®
.<:> •
Figure 4.1 0 Two outliers; one univariate and one bivariate.
43 . Here a large value of (xi ..x).5 might be considered large for moderate sample sizes.12 . XI x2 x3 x4 xs ZI Zz Z3 Z4 Zs 1631 1770 1376 1705 1643 1567 1528 1803 1587 1528 1677 1190 1577 1535 1510 1591 1826 1554 1452 1707 723 1332 1510 1301 1714 1748 1352 1559 1738 1285 1703 1494 1405 1685 2746 1554 1602 1785 2791 l.ti64 1582 1553 1698 1764 1551 . also circled.10 1.38 .78 .35 . In step 3. In a chisquare plot.31 . As a guideline. You expect 1 or 2 of these to exceed 3 or be less than 3.x)'S.26 <i@ .81 .22 . there can be outliers that cannot·be detected from the univariate plots or even the bivariate scatter plots.47 . The second outlier.x) will suggest an unusual observation.52 .05 : . 2.28 .32 .04 ..94 . Calculate the generalized squared distances {xi . This outlier cannot be detected by inspecting the marginal dot diagrams. even though it cannot be seen visually. these would be the points farthest from the origin. In step 4. are given as rows in the following table. is far from the elliptical pattern of the rest of the points.13 .64 1.43 1. we would expect 5 observations to have values of dJ that exceed the upper fifth percentile of the chisquare distribution.Detecting Outliers and Cleaning Data I 89 measurements.20 . Examine these standardized values for large or small values.1{xi.3 concerning lumber have already been cleaned up somewhat.87 . Make a scatter plot for each pair of variables.37 .60 . "large" is measured by an appropriate percentile of the chisquare distribution with p degrees of freedom. along with their standardized values.10 1. 3. Make a dot plot for each variable. Similar data sets from the same study also contained data on x 5 = tensile strength.x)'S1{xi .12 .73 . Calculate the standardized values Zjk = (xik . .01 . The data we presented in Table 4.21 . If the sample size is n = 100.26 .28 .15 . p. 3.06 . Nine observation vectors. separately.xk)/~ for j = 1. out of the total of 112. even if the data came from a multivariate distribution that is exactly normal.56 1.23 .04 .28 .01 .. A more extreme percentile must serve to determine observations that do not fit the pattern of the remaining data.07 2. as shown by the vertical dot diagram. 2.11 . each of its components has a typical value. 4. In higher dimensions. n and each column k = 1. There are n X p standardized values.13 .. . 2. . Steps for Detecting Outliers 1. . but. there are 500 values. When n = 100 and p = 5. "large" must be interpreted relative to the sample size and number of variables.75 q37) . Examine these distances for unusually large values.05 1.
28 2.1 .3 1.0 1. Reeall that the fust measurement involves sending a shock wave down the board.3 1.8 .36 1. along with the standardized observations. Example 4. x 3.2 1. on each of n = 30 boards.2 Z3 Z4 dz 1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859 1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490 1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649 2149 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382 1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389 1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214 1778 2197 2222 1533 1883 1546 1671 1874 2581 1508 1667 1898 1741 1678 1714 1281 1176 1308 1755 1646 2111 1477 1516 2037 1533 1469 1834 1597 2234 1284 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 .4 .4 1.8 1.00 6.6 .4 contains the data in Table 4. There are two extreme standardized values.2 .0 .4 ~ .6 .46 2.3 .5 1.1 .50 3.8 1. xz.1 .77 1.7 1.46 9.4 1.5 .5 1.40 2. During their investigation.2 .1 1.2 1.58 .2 .6 1.3.3 . the researchers recorded measurements by hand in a logbook and then performed calculations that produced the values given in~the table.7 1.8 1.2 .5 .9 .14.6 . and x4. and the last two measurements are obtained from static tests.3 .3 .1 .90 5.8 1.3 .5 .7 .8 .8 .1 .08 .1 .2 .6 3.9 1.5 .60 5.2 3.5 .4 .5 .38 3. Both are too large with standardized values over 4.6 .3 1.99 1. The value x 5 = 2791 was corrected to 1241. and x 4 = 2746 was corrected to 1670.5 .6 .1 .4 .0 .2 .3 1. These data consist of four different measures of stiffness xi.0 .6 1.2 . Incorrect readings on an individual variable are quickly detected by locating a large leading digit for the standardized value.3 . errors were discovered.15 (Detecting outliers in the data on lumber) Table 4.7 .5 .2 .1 .8 .49 ~ 3. the second measurement is determined while vibrating the board. The next example returns to the data on lumber discussed in Example 4.2 .8 1.7 .3 .8 .40 2.1 .5 .7 .8 .2 1.0 .4 .7 .5 2. calculated from allll2 observations.4 Four Measurements of Stiffness with Standardized Values Xt xz X] x4 Observation no.190 Chapter 4 The Multivariate Normal Distribution The standardized values are based on the sample mean and variance.1 .3 .8 .54 4.5 .62 5.80 2.1 1.7 3.3 .21 1.3 .2 .99 1.4 1.0 1.5 .22 4.93 .4 .48 7.13 1.4 .2 1.6 .5 .0 . The standardized measurements are Table 4.6 .6 1.6 .70 .06 .3 .2 .1 1.4 .7 .58 3.5 . 1 2 3 4 5 Z1 zz .9 . When they checked their records regarding the values pinpointed by this analysis.0 .3 .5.5 .2 .4 .
30 and the squares of the distances are d} = (xi ... j = 1. the remaining pattern conforms to t~ expected straightline relation..3. The last column in Table 4.x).9.. ~~  o . • • ~<i'o cP o 0 0 0 0 0 • 0 ~ o'% 0 x2 • cS ~ 0 • 0 0 0 0~0 o'{? o 0 ~ ~~ cP 0 • 0 0 0 • 0 ~ •.. . yet all of the individual measurements are well within their respective univariate scatters. Specimen 9 also has a large d 2 value...Cb I ~~ • I ~ ~ 0 o¥ I Jt~ • I I 0 0~ ~ocib oe  «9 I I I I I I I I I I 1500 2500 1000 1600 2200 figure 4.11 above. .. .  80 0 • Cb 0 0 0 0 ((:. . The two specimens (9 and 16) with large squared distances stand out as clearly different from the rest of the pattern in Figure 4... 2.... Scatter plots for the lumber stiffne~s measurements are given in Figure 4.Cb 0 0   ~ CO oo o 0 • 1? ~~ 0 r f~ .i)'S1 (xi .. Once these two points are removed. since xKOOS) = 14.. xik.86.4 reveals that specimen 16 is a multivariate outlier..... ~8 fb:t~ 0 0 x3 o • • • 0 00 0 0 ~ 0 Y'q... 2.Detecting Outliers and Cleaning Data 1500 2500 I I : 191 1200 I I I I I L 1800 I I I I I I 2400 _l_  I I I I _[ • 16 J • L I • ~ .11 Scatter plots for the lumber stiffness data with specimens 9 and 16 plotted as solid dots.. 0 ~ 'boo f~ '' '~ . 4. OB ~.xk Zjk = ~· k = 1.. 0 0  x1 obry:ro 0 0~ 0~ oO • 9 0 0 8 2>o 8 •o~#/ 0 0 • • 0 <>£% Cb~ o't> \o o ~ 69.
what is the next step? One alternative is to ignore the findings of a normality check and proceed as if the data were normally distributed. they should be examined for content. Appropriate transformations are suggested by (1) theoretical considerations or (2) the data themselves (or both). it was not possible to investigate this specimen further because the material was no longer available. Unfortunately. • If outliers are identified. It would also appear that specimen 16 is a bit unusual. r Vy logit(p) = Fisher's i logC ~ . when a histogram of positive observations exhibits a long righthand tail. Counts. Hawkins [7] gives an extensive treatment of the subject of outliers. Normaltheory analyses can then be carried out with the suitably transformed data. It has been shown theoretically that data that are counts can often be made more normal by taking their square roots. specimen 9 is clearly identified as a multivariate outlier when all four variables are considered. A second alternative is to make nonnormal data more "normal looking" by considering transformations of the data. Correlations. Depending upon the nature of the outliers and the objectives of the investigation. Scientists specializing in the properties of wood conjectured that specimen 9 was unusually cH~ar and therefore very stiff and strong.y 2.192 Chapter~ The Multivariate Normal Distribulion The solid dots in these figures correspond to specimens 9 and 16. in many instances. since. It frequently happens that the new units provide more natural expressions of the characteristics being studied. Similarly.2 1. Proportions. Even though many statistical techniques assume normal populations.P) (433) + z(r) =log . For example. Although the dot for specimen 16 stands out in all the plots. 4. transforming the observations by taking their logarithms or square roots will often markedly improve the symmetry about the mean and the approximation to a normal distribution. Helpful Transformations To Near Normality Original Scale Transformed Scale 1.15. However.8 Transformations to Near Normality If normality is not a viable assumption. p 3. as was done in the case of the data on lumber stiffness in Example 4. since both of its dynamic measurements are above average and the two static measurements are low. the [ogit transformation applied to proportions and Fisher's ztransformation applied to correlation coefficients yield quantities that are approximately normally distributed. the dot for specimen 9 is "hidden" in the scatter plot of x 3 versus x 4 and nearly hidden in that ofx 1 versus x 3 . it could lead to incorrect conclusions. This practice is not recommended.r 1 (1 r) . outliers may be deleted or appropriately "weighted" in a subsequent analysis. Transformations are nothing more than a reexpression of the data in different units. those based on the sample mean vectors usually will not be disturbed by a few moderate outliers.
x\ . this is not as restrictive as it seems. For A = 0. xlf2 = Vx ' x' shrinks large values of x ~.) Given the observations x 1 . However.•.6 0 ( 434) lnx A= 0 which is continuous in A for x > 0. x 2 . this choice of A corresponds to the recip~ocal transformation.. Trialanderror calculations with a few of the foregoing transformations should produce an improvement.X . The power family of transformations is indexed by a parameter A. We begin by focusing our attention on the univariate case. we define x 0 = ln x.x<A)l 2 n J~I We note that [1 L 11 J + (A. The transformations we have been discussing are data based in the sense that it is only the appearance of the data themselves that influences the choice of an appropriate transformation. Xn. A sequence of possible transformations is I . it is convenient to let the data suggest a transformation. increases large valuesofx To select a power transformation. . For example. There are no external considerations involved. . n j:l ± (xfA 1) ( 436) .1) LIn xi 11 (435) j:I xY) is defined in (434) and x(A) = 1. The final choice should always be examined by a QQ plot or other checks to see whether the tentative normal assumption is satisfactory. Power transformations are defined only for positive variables.Transformations to Near Normality 193 In many instances. because a single constant can be added to each observation in the data set if some of the values are negative. (See [8]. A convenient analytical method is available for choosing a power transformation.1 A A . For such cases. consider xA with A = 1. the choice of a transformation to improve the approximation to normality is not obvious. A given value for A implies a particular transformation. an investigator looks at the marginal aot diagram or histogram and decides whether large values have to be "pulled in" or "pushed out" to improve the symmetry about the mean. A useful family of transformations for this purpose is the family of power transformations. Let x represent an arbitrary observation. . n j:l ± x}A) = 1..!_ x 0 = lnx' x 114 = ~ .. such as simplicity or ease of interpretation. the BoxCox solution for the choice of an appropriate power A is the solution that maximizes the expression n C(A) = ln (x)Al. We can trace the family of transformations as A ranges from negative to positive powers of x. although the transformation actually used is often determined by some mix of information supplied by the data and extradata factors. Since x.1 = 1/x.. Box and Cox [3] consider the slightly modified family of power transformations x(A) = { XA.
after maximizing it with respect to the population mean and variance parameters. let us perform a power transformation of the data which.40 . one of these may be preferred because of its simplicity.52 75.30 1.10 86. The QQ plot of these data in Figure 4.94 89.40 . if either A = 0 (logarithm) or A = ~(square root) is near A.10. The calculation off( A) for many values of A is an easy task for a computer. However.64 91. C(A)) are listed in the following table for several values of A: A C(A) 70.10 98. as. The outcomes produced by a transformation selected according to (435) should always be carefully examined for possible violations of the tentative assumption of normality.79 96. the logarithm of a normal likelihood function.70 . It is now understood that the transformation obtained by maximizing C(A) usually improves the approximation to normality. we hope.06 92. some statisticians recommend the equivalent procedure of fixing A.60 .10 1.20 . well as a tabular displ!IY of the pairs (A.80 .t"J1 xj 1 j = 1.90 1.00 .70 .97 101. The first term in (435) is.10 .07 82.50 .40 1.30 .03 101.50 . apart from a constant.6 indicates that the observations deviate from what would be expected if they were normally distributed.50 .60 . we must find that value of A maximizing the function C(A) in (435).39 106.46 84. n (437) and then calculating the sample variance. The pairs (A. in orderto study the behavior near the maximizing value A. there is no guarantee that even the best choice of A will produce a transformed set of values that adequately conform to a normal distribution.10 .83 105..5i) A C(A) 106. e(A)). . Comment. Example 4.88 1.16 (Determining a power transformation for univariate data) We gave readings of the microwave radiation emitted through the closed doors of n = 42 ovens in Example 4.39 103.00 1. Rather than program the calculation of (435).00 .10 94.194 Chapter 4 The Multivariate Normal Distribution is the aritlunetic average of the transformed observations. The minimum of the variance occurs at the same A that maximizes (435). will produce results that are more nearly normal.35 104.34 97. creating the new variable (A) _ Yi  A[ (u x.84 106.43 103. It is helpful to have a graph of e(A) versus A.65 80.20 105. For instance. Restricting our attention to the family of transformations in (434).96 89. This warning applies with equal force to transformations selected by any other technique. .33 99..80 .20 (30 . Since all the observations are positive.20 1.50 104.90 .
2 I 0..42 and a QQ plot was constructed from the transformed quantities.1) i=l 2: lnxik n (438} . .) 0.=0..2.3 0. The data xi were reexpressed as (1/4) _ xi lj4 xi  l 4  1 j = 1..Transformations to Near Normality 195 eo.0 0.30 maximizes C(A).25. The quantile pairs fail very close to a straight line. The curve of C(A) versus A that allows the more exact determination A = . Each Ak can be selected by maximizing n n Ck(A) =In (xJ~). Let A .12 Plot of e(A) versus A for radiation data (door closed). AP be the power transformations for the p measured 1 2 characteristics. • Transforming Multivariate Observations With multivariate observations. A . we choose A = . and we 14 would conclude from this evidence that the x) / ) are approximately normal.x~A*)) 2 2 n i=l [12: J+ (Ak.5 1.28 figure 4. a power transformation must be selected for each of the variables. For convenience. This plot is shown in Figure 4.. .1 0.13 on page 196.12.4 0. It is evident from both the table and the prot !hat a value of A around .. . .28 is shown in Figure 4..
k is the arithmetic average of the transformed observations.0 1. .50 2. Xnf< are the n observations on the kth variable..1 AI X'i..1) £..) where x 1k.00 _j__ _ __j[___ _ _. (The integers in the plot indicate the number of points occupying the same location.0 2.. k Here (i. .196 Chapter 4 The Multivariate Normal Distribution xil/4) Ill 1.J__P_ _ XXP 1 Ap where A.... .0 1. . . p. (439) 1~ n i=l (A._ _ ___L_~_____L_____ q(j) 2. 2. 1~ (x.00 2..50 3.  n J=l A. .00 / / 1.•.0 3. x2k.)_ £.. 2 1 .0 .13 A QQ plot of the transformed radiation data (door closed). Xjk ...L_ _ __~. The jth transformed multivariate observation is x}J.f.l = j x2 ._·z__ x<A.2  1 _. A."i) _ Xk  = 1. AP are the values that individually maximize (438).0 figure 4.
Jp x· . . A . 2. . . A2.30.. Although normal marginals are not sufficient to ensure that the joint distribution is normal. which collectively maximizes = n lnjS(A)\ +(A.. ...17 (Determining power transformations for bivariate data) Radiation measurements were also recorded through the open doors of the n = 42 microwave ovens introduced in Example 4.1) L j=l n n lnxi2 + · · · + (A p . See [3] and [2] for detailed discussions of the univariate and multivariate cases.5.10. n have a normal distribution. see [8].b j = 1. The latter likelihood is generated by pretending there is some A. respectively.Transformations to Near Normality 197 ' The procedure just described is equivalent to making each marginal distribution approximately normal.. we could start with the values A .1)/A. . The approximate maximizing value was A = . .14 on page 199 shows QQ plots of the lintransformed and transformed dooropen radiation data. Figure 4.4.J (440) j = 1. but also is unlikely to yield remarkably better results. The amount of radiation emitted through the open doors of these ovens is listed in Table 4. (These data were actually .. If not.n Maximizing (440) not only is substantially more difficult than maximizing the individual expressions in (438). a power transformation for these data ~as selected by maximizing f(A. Ap]. in practical applications this may be good enough. .16. .1) where S(A) is the sample covariance matrix computed from j=l ~In .1) 2 L j=l n lnxil + (A 2 .k for which the observations (x~i.. whereas the method based on ( 438) corresponds to maximizing the kth univariate likelihood over ILk> akk.2... The selection method based on Equation (440) is equivalent to maximizing a multivariate likelihood over ~'• :£ and A. and Ak. Ap obtained from the preceding 1 2 transformations and iterate toward the set of values A' = [A 1 .) in (435). In accordance with the procedure outlined in Example 4. (Also.) Example 4.
10 . Let us denote the doorclosed data by x 11 .17. 2. x42.04 .x 2h . x 42 .09 .07 .30 and A2 = . These powers were determined for the marginal distributions of x 1 and x 2 • We can consider the joint distribution of x 1 and x2 and simultaneously determine the pair of powers ( A1 .05 .50.09 .10 .10 .33 .30 .12 . as in Example4.45 .16.30.16.30 . 2 The "best" power transformations for this bivariate case do not differ substan• tially from those obtained by considering each marginal distribution.• .) It is clear from the figure that the transformed data are more nearly normal.30 .20 . We see that the maximum occurs at about (A 1 . . .15 on page 200.10 .09 .32 .12 12 13 14 15 .10 . A2) in (440) with respect to both A1 and A2 • We co~puted f(A 1 .05 .05 . using th~ outcomes from Example 4.198 Chapter 4 The Multivariate Normal Distribution Table 4. . .10 . and we constructed the contour plot shown in Figure 4. 2.16 and the foregoing results. we have A1 = . 'fhus. transformed by taking the fourth root. A ) for a grid of A1 .20 .09 .S Radiation Data (Door Open) Oven no.15 .25 . A2 ) that makes this joint distribution approximately bivariate normal. A2 values covering 0 :s A1 :s . making each marginal distribution approximately normal is roughly equivalent to addressing the bivariate distribution directly and making it approximately normal.12 .30 .01 :60 .01 .16). As we saw in Example 4.09 .12 .•.D. x22 . 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Radiation .50 and 2 0 :s A2 :s .40 .28 Oven no. Choosing a power transformation for each set by maximizing the expression in (435) is equivalent to maximizing fk(A) in (438) with k = 1.15 ..10 . 1 2 3 4 5 6 7 8 9 10 11 Radiation ~ Oven no.12 .10 . It is generally easier to select appropriate transformations for the marginal distributions than for the joint distributions. 1 and the dooropen data by x 12 . 31 32 33 34 35 36 37 38 39 40 41 42 Radiation . Cryer. To do this.12 Source: Data courtesy of J. A) = (. although the normal approximation is not as good as it was for the doorclosed data. we must maximize f(A 1 .10 .
30 2 .___ _ 1.) ...00 .0 xCII4) (j} ._ ___j_ _ __ L_ ___.0 figure 4.80 •/ • 4 / .40 /2 2.Transformations to Near Normality 199 .0 (a) 1..60 • • 4 •• • .00 __l.14 QQ plots of (a) the original and (b) the transformed radiation data (with door open).20 6 ~2 1...0 3.0 2.I._~q(i) 2.60 1.0 2.0 .J.0 :/ ·/ 5 9 3.L_ _ _.0 (b) 1.__j_~ 2.0 3.0 1.45 . (The integers in the plot indicate the number of points occupying the same location.J.0 .15 5 • • 9 6 2 2 .0 • 3 • q( j) ~~''.
3 0. 221 0. uu = 2.A = 2 * Exercises 4.1 0.Ai'O X 2: 0. A= 0 X< O.0 0. (a) Write out the bivariate normal density.4 0.. = 1.1 = 0.1. 1 Pl2 = . Consider a bivariate normal population with p..2 0. p.. . uu == 2.1 S Contour plot of C( A • A ) for the radiation data.A 2 X< O.3 223 0. p. Consider a bivariate normal distribution with p.o 0.J. u 22 = 1 and (a) Write out the bivariate normal density.A) ln(x + 1) X20.4 0..2 = 3..5 Figure 4.2 Ly.t) as a quadratic function of x 1 and x 2 • 4.2. 1 2 If the data includes some large negative values and have a single long tail.I1( x .1 o.. (b) Write out the squared statistical distance expression ( x . a more general transformation (see Yeo and Johnson (14]) should be applied..J.8.1}/A x(~) _ ln(x + 1) .5. and PI2 = .{ {(x + 1)H. {(x + 1)~..·~A.1}/(2.5 0.t )' .2 = 2.200 Chapter 4 The Multivariate Normal Distribution 0. u 22 = 1.
Let X be distributed as N3 (p.4. I) with p.x3 : n X 4. and x3 ' (c) Xz and X3 (d) (X1 .' = [ 3. 4.3.4. (c) The conditional distribution of X 3 • given that X 1 = x 1 and X 2 = x 2 for the joint distribution in Exercise 4..I)withp.' = [1.Exercises 20 I (b) Write out the squared generalized distance expression (x. 2 J and I= [ ~ ~ 1 0 ~] 2 Which of the following random variables are independent? Explain. (a) The conditional distribution of X 1 .S. given that X 1 = x 1 and X 3 = x 3 for the joint distribution in Exercise 4. (a) X 1 andX2 (b) x. (a) X1 and Xz (b) X2 and X3 (c) (X 1 . 4.2X3 . where p. 4. 1. 1.3.)'I.p.3.) as a function of x 1 and xz. 4] and ~+i Which of the following random variables are independent? Explain.: J 1 vector a such that X 2 and are independent.2X2 + X 3 .6.1]and I=[~1 ! ~] 2 2 (a) Find the distribution of 3X1 . given that X 2 = x2 for the joint distribution in Exercise 4.. (b) The conditional distribution of X 2 .1(x. and find a 2 X2  a'[..2.X3 ) and X 2 (e) X 1 and X 1 + 3X2 .~XI.' := [2. I). Let X be N 3(p.p.Xz) and X 3 Xl + Xz (d) and X 3 2 (e) Xzand Xz. Specify each of the following. (c) Determine (and sketch) the constantdensity contour that contains 50% of the probability. (b) Relabel the variables if necessary. LetXbeN3 (p.
7.giventhatX2 = x2 andX3 = x3 . Xz) = E[ X 1 (XI)] For c very large. P[X2 :s x 2 ) = P[X2 :s l] + P[ 1 < X 2 :s x 2 ] = P[X1 :s 1] + P[ I <xi :s Xz] = P[Xl :s 1] + P[ xz :s: XI< 1]. 0 .Xz :s xl < l] = P[ 1 < X 1 :s x 2 ] from the symmetry argument in the first line of this hint. 0 0 A 01 lA OIII B Expandmg the determmant II B by the f1rst row . But P[.I). lA lA . I 1by the last row g1ves . Thus. and tet~ if 1 :s XI :s I otherwise ShoW each of the following. (b) X 1 and X 2 do not have a bivariate normal distribution. When 1 < x2 < 1. which is a standard normal probability. Xz) *' E[ X 1(XI)]. (a) The conditional distribution of X 1 . . Hint: Fore= 0. X2) = 0.1 o. Hint: (a) Since X 1 is N(O.Xz. (a) X 2 also has an N(0. Show each of the following. 4.24) gives. I). P[1 < X1 s x] = Pfx s X1 < 1] for any x.202 Chapter 4 The Multivariate Normal Distribution 4. I = I A/.9. but that the two random variables are not independent.8. given that X 3 = x3 • (b)TheconditionatdistributionofX 1.3174.) Let X 1 be. 4. N (0.6 and specify each of the following. P[Xz :s Xz] = P(XJ :s 1] + P[ 1 <xi :s Xz] = P[XI :s Xz). evaluate Cov(XJ. . .8. evaluate Cov ( X1. Refer to Exercise 4. Refer to Exercise 4. (b) Consider the linear combination X1 . (Example of a nonnormal bivariate distribution with normal marginals. j)l expandmg the deternunant . (a) I~ :1 =!AliBI for (b) ~~ ~~=I AliBI lA I¢ 0 Hint: (a) O' B ~ O' of I reduced by one.1) distribution. 1 times a determinant of the same form. Similarly. with the order·· (see . This procedure is repeated until! X IB I is obtained. which equals zero with probability P[/Xd >I]= . I O' 01 O' 01 / Definition 2A. but modify the construction by replacing the break point 1 by c so that X{X xi 2  1 ifcsX1 :sc elsewhere Show that c can be chosen so that Cov ( X1. 4.
p.) = [x. A~'cf.l.) (c) Given the results in Parts a and b. (Note that II I can be factored into the product of contributions from the marginal and conditional distributions. Now use the result in Part a... Take inverses of the res~lting expression.l.l. I OJ I [(A~.Butexpandingthedeterminant 1:.IzA21J_. 0 postmultiply by [ A:iA 21 ~ J'.Il2I21Izd'[x.) (b) Check that (x.zi21(xz.1 . Show that. .11.z) (Thus. and Hint: Premultiply the expression in the hint to Exercise 4.z)J' x (I.J. Show the following if X is Np(p.I. 4. (a) Check that II I = II 2z II I 11 . if A is square.z)'I2Hxz . A'= [ Thus.l. for A symmetric.l. 0 4.J. I foriA22I # lA I= IA22IIA11. the joint density exponent can be written as the sum of two terms corresponding to contributions from the conditional and marginal distributions.J.)'I1(x.12. . :11:.I 12 I21I 21 I. The second equality for I A I follows by considering 4. identify the marginal distribution of X 2 and the conditional distribution ofX 1 1 X 2 = x2 .Il2I21(xz.Exercises 203 (b)~~ ~~ = 1:. A 12 A21A21)0' 1 0 J [I A21 0' IzA21J A 1 A 12 A21Az 1)'is the upper lefthand block of A.z)] + (xz . .1 .1. (A 11  A2~A2.11 by [ I . Use Exercise 4.P. I A'cl = 1. I) with II I* 0. \'Ct by the last row g1ves I .J.13.J. .10 for the first and third determinants on the left and for the determinant on the right.l. A. .A12A21A21I 0 Hint: Partition A and verify that Take determinants on both sides of this equality. Show that.J.
)(x. Let X 1 . (a) Find the marginal distributions for each of the random vectors v. + 5Xz + 5 I I IX 3 IX I + 5 4 + 5Xs .lt2)'I21(xz. X 3 .p.1tt)'I!f(xl. 4.)'r 1(x p.(x 2 = l'z)')[Io~l ~~~].l't) + (xz.p. X 3 .1(x . and covariance matrix I. Here xj = [xi!. 4. Show that± (xi. j = 1.1tz ~I21Izl I OJ [(Ill I ~12I2!~21r 0' 1 I21 0J X[:. and X 5 be independent and identically distributed random vectors with mean vector p.t. .. and 4.204 Chapter 4 The Multivariate Normal Distribution Hint: (a) Apply Exercise 4. xJ 2.17.)' and j=! i (i p. 2.I' )'I. I) with II I as the product of marginal densities for (qXI) * 0.1tz)J I xz . X 2.x)(i p.. + ~Xz.15.o J 1 11 0' 1 I2z is the inverse ofl: = [I 11 0' Then write (x.1'1 .[x'.. . show that the joint density can be written if l:12 X1 and X2 ((pq)Xl) = 0 (qX(pq)) Hint: Show by block multiplication that [I.14. l:) random vectors.p. Now factor the joint density.~X4 V2= ~x.1tl] Xz . xi p].1tiJ = [XI . If X is distributed as Np(J. If we group the product so that I' [ 0 . . Let X 1 .11..~x4 (b) Find the joint density of the random vectors V1 and V2 defined in (a).1t2 the result foliows.I12I2Hx2 .. Find the mean vector and covariance matrices for each of the two linear combinations of random vectors 5X.i)' arebothp X p matrices of zeros.~X3.) = [(x 1 p. and = ~x~ ±xz + ~X3..1t2 Xz .I12Ii!J [x.12 that we can write (x.. (b) Note from Exercise 4. n. j=l 4.1t1]' [ [ Xz.) as xl . X 2 . and X 4 be independent Np(lt.1tz ~22 (x. X 4 .16. 1 ) Note that) I 1 = I I 11 )) Iz 2 1 from Exercise 4.10(a). . 1)'.
2 22. (a)B=[~ O~ ~~ ~~ ~~ ~] (b) B = [o1 0 0 0 0 OJ 0 1 0 0 0 4. Find the maximum likelihood estimates of the 2 x 1 mean vector p. Specify each of the following completely.1 6.p.2 25.4 contains data on three variables for the world's 10 largest companies as of April2005.. (b) Carry out a test of normality based on the correlation coefficient rQ..19.) (b) The distributions of X and Vn(X .p. . Let X 1 . X2 . Exercise 1.. Consider the annual rates of return (including dividends) on the DowJones industrial average for the years 19962005. X10 in Exercise 4.p. )'I.Xz + x3 .18. ••• . specify the distribution of B(19S)B' in each case.p. Let X h X 2 . and covariance I. 4.)'I. Do the data seem to be normally distributed? Explain.22.) 4.) 4.10.p. . Specify each of the following completely. obtain the covariance between the two linear combinations of random vectors. ' Use these 10 observations to complete the following.21.23. . . . For the sales (x 1) and profits (x2) data: (a) Construct QQ plots. These data. (a) Thedistributionof(X 1 . Let X 1.) (d) The approximate distribution of n(X.1 25.6 26. and I. (a) Construct a QQ plot.1 (X .1 (X 1 .p.3 16. Also..[See (431). 4. . Do these data appear to be normally distributed? Explain.24. and covariance matrix I.1 (X 1 . }'S1(X.1) S 4.) (c) The distribution of n(X .0. are 0.20. X 60 be a random sample of size 60 from a fourvariate normal distribution having mean p..6 3. .) (c) The distribution of (n . (a) The distribution of X (b)Thedistributionof(X 1 . )'S.p.)'I.p.] Let the significance level be a = . 4. X 75 be a random sample from a population distribution with mean p.. multiplied by 100. . For the random variables X 1 .p. I) population.8 7. What is the approximate distribution of each of the following? (a) X (b) n(X .Exercises 205 and XI .p.x4 + Xs in terms of p.1 (X . X 20 be a random sample of size n = 20 from an N6 (p. and the 2 X 2 covariance matrix I based on the random sample from a bivariate normal population.19. X 2 .p.
2.. [See (431). as weiJ as the seiling price x2.x)'S.] Set the significance level at a = . Construct a QQ plot for the transformed data.54 14. (b) Determine the power transformations A2 that makes the x2 values approximately normal. x16 ].10.] Do the natural logarithms appear to be normally distributed? Compare your results with F1gure 4. falling within the approximate 50% probability contour of a bivariate normal distribution. Do the results of these tests corroborate theresults in Part a? 4. (a) Calculate statistical distances (x.49 9 11 3. Where x. for n = 10 used cars.1(x 1 ..26.7978 1.95 8.27.26. 4. where (b) Determine the proportion of observations x.25. 42. are these data approximately bivariate normal? Explain. Let a = . (c) Determine the power transformations A' = [A1 2] that make the [x 1 . . x 2 ] values jointly normal using (440). measured in years. Refer to the data for the world's 10 largest companies in Exercise 1.2. Exercise 1.05 and use the entry corresponding to n = 40 in Table 4. Construct a QQ plot for the transformed data.X). 2.2125 1.28. measured in thousands of dollars. 42.3518 0. j = 1.5. . (a) Determine the power transformation A1 that makes the x 1 values approximately normal.1(xj .8147 4.. (c) Order the distances in Part a and construct a chisquare plot. xj = [x 1s. These data are reproduced as foiiows: 2 3 3 4 5 6 8 7. Construct a QQ plot for the solar radiation measurements and carry out a test for normality based on the correlation coefficient rQ [see (431)].00 12. = [x 15 . (d) Given the results in Parts b and c. 4.x. Consider the radiation data (with door closed) in Example 4. 4.13. Compare the results with those obtained in Parts a and b. The . .30. (b) Using the distances in Part a.2 gives the age XJ.29.. X12].94 6. Consider the airpollution data given in Table 1.10.00 (a) Use the results of Exercise 1.. .A .6430 3.2831 4. det_e~mine the propo~tion of the observations falling within the estimated 50% probab1hty contour of a b1variate normal distribution. Does the choice A = ! or A = 0 make much difference in this case? 4.4.99 x2 18.x)'S..5. 10. .2 to calculate the squared statistical distances (X.95 15.206 Chapter 4 The Multivariate Normal Distribution (b) Carry out a test of normality based on the correlation coefficient rQ.1095 2. = [xj!. . [Note that the natural logarithm transformation corresponds to the value A= 0 in (434).6]· 4. .1083 5. .95 19. j = 1. examine the pairs Xs = N02 and X 6 = 0 3 for bivariate normality. Consider the usedcar data in Exercise 4. j = 1. Construct a QQ plot for the natural logarithms of these data. (c) Construct a chisquare plot of the ordered distances in Part a.6416 2. 2.3170 7. . The chisquare quantiles are 0.x).following exercises may require a computer. Construct a chisquare plot using all three variables. Given the airpollution data in Table 1.00 17.
.37. and leadership (leader). Examine the data on bulls in Table 1.33. Convert the women's track records in Table 1..9 for marginal and multivariate normality.6.18. support. Refer to Exercise 1. including transformations. SaleH!. The data in Table 4.Exercises 207 4.38. conformity (conform). Examine the marginal normality of the observations on variables X 1 . benevolence.36. 16. For those variables that are nonnormal. check for multivariate normality. Examine the data on paperquality measurements in Table 1. Examine the marginal normality of the observations on variables X 1 . 4. . So to. BkFat.6 Psychological Profile Data lndep Supp Benev Conform Leader Gender Socia 27 12 14 18 9 : 13 13 20 20 22 11 12 11 14 24 15 17 22 : 20 25 16 12 21 17 11 11 6 7 6 6 2 2 2 2 2 1 1 2 2 2 1 1 1 1 1 2 2 2 2 2 10 14 19 27 10 19 17 26 14 23 22 22 18 7 22 10 29 13 9 8 Source: Data courtesy of C.2 for marginal and multivariate normality. For each of these teenagers the gender (male = 1. Examine the data on bone mineral content in Table 1. Examine the marginal and bivariate normality of the observations on variables XI' X2. you feel is appropriate. Use whatever methodology. · 4. X 5 for the multiplesclerosis data in Table 1. support (supp).7. female = 2) and socioeconomic status (low = 1.prenhall.•. 434. and x4 for the data in Table 4. Consider only the variables YrHgt.9 to speeds measured in meters per second.8 for marginal and bivariate normality.  Table 4. you feel is appropriate. determine the transformation that makes them more nearly normal. X 6 for the radiotherapy data in Table 1. . 4.31. including transformations.3. Use whatever methodology.10 for marginal and multivariate normality. X 2 . and SaleWt 439.32. and 17). (c) Refer to part (a). medium = 2) were also recorded The scores were accumulated into five subscale scores labeled independence (indep). 4. Examine the data on women's national track records in Table 1. PrctFFB. Examine the data on speeds for marginal and multivariate normality. (a) Examine each of the variables independence. (b) Using all five variables. XJ. Treat the nonmultiplesclerosis and multiplesclerosis groups separately. benevolence (benev).cornfstatistics) con· sist of 130 observations generated by scores on a psychological test administered to Peruvian teenagers (ages 15. 4. conformity and leadership for marginal normality. FtFrBody. 4. .6 (see the psychological profile data: www..35. X 2 . .
" Journal of the American Statistical Association. and D.41. 3. T. R.).1980. M. no. Methods for Statistical Data Analysis of Multivariate Observations (2nd ed. Fitting Equations to Data: Computer Analysis of Multifactor Data.no.e the power transformation for approximate bivariate normality (440). Consider the data on snow removal in Exercise 3. Multivariate Analysis (Paperback). normal. 9. 855861. Construct a QQ plot of the transformed observations. 372 (1980). 2 (1964). . "An Analysis of Transformations" (with discussion). and J. S.39.1 (1975). V. W. R. T. Anderson. (a) Comment on any possible outliers in a scatter plot of the original variables. 8. R Gulledge.. Gnanadesikan. New York: WileyInterscience. "The LargeSample Behavior of Transformations to Normality. Construct a QQ plot of the transformed observations. Hogg. J. Filliben. Andrews. F. 27. K. Shapiro. Hernandez.20. Bibby. 825840. N. S. T.. and J. D. W.40. "An Analysis of Variance Test for Normality (Complete Samples). Jr. 111117. Hawkins. 7579. New York: John Wiley. Box. and F. An Introduction to Multivariate Statistical Analysis (3rd ed. Warner. Cox. Gnanadesikan. S. Mckean Introduction to Mathematical Statistics (6th ed. 211252. and T. 26. B. Construct a QQ plot of the transformed observations.208 Chapter 4 The Multivariate Normal Distribution 4. S. 591611. 4. Wood. 1980. 1977. using~ 4. "Transformations of Multivariate Data. and J. :_ (c) Determine the power transformation Az the makes the x 2 values approximately normal.27. J.no. 4 (1971). 5. (a) Comment on any possible outliers in a scatter plot of the original variables. Journal oft he Royal Statistical Society (B).: Prentice Hall. 2003. E. A. (b) Determine the power transformation A the makes the x 1 values approximately. C. L. P.Identification of Outliers. "The Probability Plot Correlation Coefficient Test for Normality.. New York: John Wiley. G. UK: Chapman and Hall.J.1 (1985). Wilk. _ (c) Determinethe power transformation Az the makes the x2 values approximately·' normal." The American Statistician." Technometrics. "Use of the Correlation Coefficient with Normal Probability Plots. Mardia. (d) Determin. 2. London: Academic Press. 7. 52. D. . V. Daniel. W. Construct a QQ plot of the transformed observations. References 1. 11. Looney. and R. 4 (1965). 2003. F. Consider the data on national parks in Exercise 1. J.: 1 normal.17. no. no. 6.. London. Kent. R. 75." Biometrics. M.). R. Upper Saddle River.." Biometrika. 12. (d) Determine the power transformation for approximate bivariate normality using_ (4·40). no. 2004. A Johnson.). (b) Determine the power transformation A1 the makes the x 1 values approximately·. 10.. Craig. and M..
. 37." Biometrika.Exercises 209 13. 4 (2000). no. 83. I. and R. no." Annals of Mathematical Statistics. P. 87. 11921197. Zehna. 404 (1988)." Journal of the American Statistical Association. "Tables and LargeSample Distribution Theory for CensoredData Correlation Statistics for Testing Normality.. no. Yeo. Johnson "A New Family of Power Transformations to Improve Normality or Symmetry. 3 (1966). A. S. 954959. Johnson. and R. A. Verrill. 744. "lnvariance of Maximum Likelihood Estimators. 14. 15.
1 j=l 1 210 .X) n . From the point of view of hypothesis testing. reaching valid conclusions concerning a population on the basis of information from a sample. At this point.~ Xi and n j=l 1 n 2 s 2 = . A large part of any analysis is concerned with inferencethat is. Although we introduce statistical inference through initial discussions of tests of hypotheses.. 5.Lo as a Value for a Normal Population Mean ll{) Let us start by recalling the univariate theory for determining whether a specific value is a plausible value for the population mean p. If X 1 ... our ultimate aim is to present a full statistical analysis of the component means based on simultaneous confidence statements. ¢ JLo Here H 0 is the null hypothesis and H 1 is the (twosided) alternative hypothesis. the appropriate test statistic is t = (X ll{)) sjVn ' 1 n where X = . we shall concentrate on inferences about a population mean vector and its component parts.l Introduction This chapter is the first of the methodological sections of the book. this problem can be formulated as a test of the competing hypotheses H 0 : JL = JLo and H 1: p. This principle is exemplified by the methods presented in this chapter.2 The Plausibility of J.~ (X . . ..Chapter INFERENCES ABOUT A MEAN VECTOR S. X 2 . We shall now use the concepts and results set forth in Chapters 1 through 4 to develop techniques for analyzing data. One of the central messages of multivariate analysis is that p correlated variables must be analyzed jointly.. Xn denote a random sample from a normal population.
= J. that JLo is a plausible value of p. approximately 100(1 . among large numbers of such independent intervals.The Plausibility of !Lo as a Value for a Normal Population Mean 21 I This test statistic has a student's tdistribution with n .f... we conclude that ~ is a plausible value for the normal population mean. at significance level a.f.Lo)z s /n 2 = n(X  . From the wellknown correspondence between acceptance regions for tests of H 0 : p.. t 2 2 = (. If H 0 is not rejected.. The probability that the interval contains p. = J. The variable t in (51) is the square of the distance from the sample mean X to the test value~· The units of distance are expressed in terms of sjVn.f. or estimated standard deviations of X.1 d.Loatlevela} is equivalent to { J. Are there other values of p..Lo is a plausible value for the mean of a multivariate normal distribution. Consider now the problem of determining whether a given p X 1 vector J..Lo versus H 1 : p. 0 ) 2 _.. the test becomes: Reject H0 in favor of H 1 .X and s.J..1 d.1 degrees of freedom (d.Lo that would not be rejected by the level a test of H 0 : JL = ~· Before the sample is selected. which are also consistent with the data? The answer is yes! In fact.a)% confidence interval or 1:. Rejecting H0 when It I is large is equivalent to rejecting H0 if its square.X . if (52) where t. '# ~ and confidence intervals for p.. there is always a set of plausible values for a normal population mean._ 1( aj2) ~} (53) or The confidence interval consists of all those values J.~1 x± :s t.a.).. We shall proceed by analogy to the univariate development just presented. A natural generalization of the squared distance in (51) is its multivariate analog . if the observed It I exceeds a specified percentage point of atdistribution with n .Lo lies in the 100( 1 ..Lo)(s ) (X .J.a)% confidence interval in (53) is a random interval because the endpoints depend upon the random variables . we have {DonotrejectH0 :p.. the 100(1 . is 1 . Once X and s 2 are observed.  (51 ) is large.a)% of them will contain p.p._ 1 (a/2) denotes the upper 100(a/2)th percentile of the tdistribution with n._ 1(a/2) t. We reject H0 .
X)(Xi. we described the manner in which the Wishart distribution generalizes the chisquare distribution.f. = P.o versus ILo· At the a level of significance. Statement (56) leads immediately to a test of the hypothesis H 0 : p.n  c X  P.np distribution.X)' 1 .o = (p><l) 1 n n .o ) > (n.) S (X.np 1stn (n (55)  where Fp. I) population.~ Xi and S = ( _ ) ~ (Xi .np(a) _ [ 2  1 ll 1 ll  _.Z12 Chapter 5 Inferences about a Mean Vector where (p><l) 1 n X =~X· 1 n~ ' (p><p) S =~~(Xi.. X. 1'pO 1)pF T Z·IS d' 'bute d as (n _ p) p. )1 p. who first obtained its sampling distribution.X).P n(X.0the hypothesis H 0 : p. we reject H 0 in favor of H 1 if the H 1: p. To summarize.p. = p.) > (n _ p) Fp.X)(X1 .1)p Fp. X 2 .) If the observed statistical distance T 2 is too largethat is. .1)p .o )'sIcX  P. In Section 4. We can write n ~ (Xi  T2 = vn (X  X)(X1 . be a random sample from an Np(l£. a pioneer in ~ multivariate analysis. : .1)p J J (56) _ [ .p. observed * T z. 0 is rejected. Here Fp.n. p. and I.X). (See Result 3. _1 (n. if i is "too far" from.p d. This is true because l 1'20 l'10] .ile of the Fp.p( a ) (n _ P) (57) It is informative to discuss the nature of the T2distribution briefly and its correspondence with the univariate test statistic. . 1 i=l n J=l n a . Here ( 1/n )Sis the estimated covariance matrix of X.np(a) whatever the true p. and P. we have the following: Let X1.1. Then with X = . It turns out that special tables of T 2 percentage points are not required for formal tests of hypotheses.P T > (n _ p) Fp.1 i=l   I The statistic T 2 is called Hotelling's T 2 in honor of Harold Hotelling.np(a) is the upper (100a)th percent. 0 )' ( i=l n _ vn (X  P.o) .np denotes a random variable with an Fdistribution with p and n . (n ..4.
l:..6) + (10 . What is the sampling distribution of T 2 in this case? We find and S11 (6 8) 2 __:..13 which combines a normal. Ordinarily. Wp._.. Since the multivariate normal and Wishart random variables are independently distributed [see (423)].1 (l:.1 normal ) random variable ( (scaled) chisquare random variable d...8)(6 .).f. where all of the mean vector components are specified under the null hypothesis..____.'. )I ( normal ) random variable for the univariate case. Example S.nJ = (multivariate normal)' random vector ( Wishart random matrix ) d..:._ + (10 . I. the distribution (55) of T 2 as given previously can be derived from this joint distribution and the representation (58).).6) + (8 .8)(3 ._. It is rare. random vector and a Wishart..8)(9 . = 9 6) 2 .The Plausibility of J.= 4 s12 s22 = 2 (6 .n.)' [ n _ Wp.. it is preferable to find regions of p.4.6) 2 (9 ...f.. Using calculus.n1(I. = P.l:) ( tn.6) + (3 ~ 2 2 2 = 3 = '_. to be content with a test of H 0 : p. We shall return to this issue in Section 5. in multivariate situations.__ 8) 2 + (8 .__..8) 2 = __:___.l (Evaluating T 2 ) Let the data matrix for a random sample of size n = 3 from a bivariate normal population be x{: n Evaluate the observed T 2 for p. 1 (multivariate normal) random vector (58) = Np(O.0 = [9. Np(O.o.. 5].. values that are plausible in light of the observed data.. random matrix in the form T2 p.4J as a Value for a Normal Population Mean 2.. their joint density function is the product of the marginal normal and Wishart distributions.__".) 1 This is analogous to or 2 _  1 ]1 Np(O.:.6) + (6 .
50. 65]n lJ[:=n=3(1. We evaluate .: 4.o22 [ ..788 1. and the results. were measured.400  = 20 [. Computer calculations provide *[ i = 4.600.1 = . Test the hypothesis H 0 : p! = [4.2) Fz.640. 50.640.022 [ .50.4.002 45 50 .258] [ 4. X 1 = sweat rate.640 .10 .010 199.400. . 10) at level of significance a = . 4.2 {Testing a multivariate mean vector with T 2 ) Perspiration from 20 healthy females was analyzed.4 ] .022 s.74 . 0 using data collected as part of a search for new diagnostic techniques at the University of Wisconsin Medical School.160 .402 9.. 10] against H 1 : .214 Chapter 5 Inferences about a Mean Vector so s= Thus.258 . [ 9.640 .965 10) .1 random variable.32 = 4Fz. which we call the sweat data.810 5..035) . from (54).586 . . ~ s1 _ and.965 s= [ 2. Before the sample is selected.810] 5.628 and .002 .402 1.879 10. and X 3 = potassium content..006 . 9.640 3.258 Tz = 20(4.042 = 9.010 10.400 . == p.022 . Example 5. 45. X 2 = sodium content.467] [ .002 . Three components.258] . T 2 has the distribution of a (3 .(3)(3) 3 4 4 3] 1 [9 3] _[~ 17t J [ 3 9  ~ T =3(89.(4)(9).1.002 .965 .006 .1)2 (3 .640] 45.586 . are presented in Table 5. 2 1{!J=~ • The next example illustrates a test of the hypothesis H 0 : p.10.
Gerald Bargman.18.0 10.6 52.2 6.1 40.2 13.8 44.5 1.2 9.) • One feature of the T 2statistic is that it is invariant (unchanged) under changes in the units of measurements for X of the form (pXI) Y=CX+d.4 71.6 2.p 19(3) np(.1 47. At this point. and consequently.4 54.2 3. we have no idea which of these hypothesized values may not be supported by the data. and we conclude that the normality 'assumption was reasonable in this case.3 8.7 7.5 3.4.1 4.74 with the critical value (n. we reject H 0 at the 10% level of significance.10) = 3. We note that H0 will be rejected if one or more of the component means. 50.8 8. (pxp)(pXl) (pXl) C nonsingular (59) .5 36.8 40.4 x3 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 55.8 33. 10].0 7.1 24.9 12.10) = .8 3.The Plausibility of !Lo as a Value for a Normal Population Mean 215 Table S.353(2.4 7.5 65.6 8.8 27.1 8.5 56.2 53. We have assumed that the sweat data are multivariate normal.2 10.· 17 FJ·17(.9 4.74 > 8.5 4.1 Xz (Sodium) 48.3 9.7 3.1 47. Comparing the observed T 2 = 9.5 4.l Sweat Data Individual 1 2 3 4 xl (Sweat rate) 3.5 Source: Courtesy of Dr.2 (Potassium) 9.3 12.4 10.5 11.9 11.9 5.l)p ( n p) F. scatter plots for pairs of observations have approximate elliptical shapes.44) = 8. and X 3 all approximate straight lines.1 36.9 14.18 we see that T 2 = 9.7 5. The QQ plots constructed from the marginal distributions of X~o X 2 .5 6. (See Exercise 5.5 8.5 4.9 58. or some combination of means.0 9. differs too much from the hypothesized values [4.7 12.4 3.1 7. Moreover.7 5.
to a.lo)'S1 (i.( X. reading.1 j=I 1 1 Moreover. and they are particularly convenient for hypotheses formulated in terms of multivariate normal parameters.o)'s.J.'o) The last expression is recognized as the value of T2 computed with the x's.J..J y = Ci + d and s· = .1'v.'O· The general theory of likelihood ratio tests is beyond the scope of this book.. it immediately·. We know from (418) that the maximum of the multivariate normal likelihood as. ...'o) 1 = n(i..~ As an example. ._ = J. the operations involved in changing X. .b. x" and the transformation in (S9).(X..J.b.b. ..'o) = n(i.. and the T 2statistic can be derived as the likelihood ratio test of Ho: .lo)'C'(CSC'r C(i..) correspond~ exactly to the process of converting temperature from a Fahrenheit to a Celsius·. i:d subtracted from the ith variable to form X.'Q + d is . x2 . follows from Result 3. > 0 to get a. 1'Y = E(Y) = E(CX +d) = E(CX) + E(d) = c. and I are those choices for. T 2 computed with the y's and a hypothesized value J.. 5.J. .o) = n(C(i. ....lo))'(CSC'r (C(i..i)' and . . T 2 = n(y . = i = 1 j=! n L ll Xj j=l are the maximum likelihood estimates. by (224) and (245).6 that · :· 1 n .3 Hotel ling's T 2 and Likelihood Ratio Tests We introduced the T 2 statistic by analogy with the univariate squared distance t 2.y)' = CSC' Y n .y) (y ..) Likelihood ratio tests have several optimal properties for reasonably large samples._ and I that best explain the observed values of the random sample. Recall that .I) p.) by any nonsingular matrix will yield Equation (S9).. There is a general principle for constructing test procedures called the likelihood ratio method.lo)'C'(C')1S....b._ and I are varied over their possible values is given by maxL(..J.1C 1 C(i.l scaled quantities a. (See [3] for a treatment of the topic.lv.~ (y . . Premultiplication of the centered and:... and the result is multiplied ~ by a constant a.lv.J."X = 1 .216 Chapter 5 Inferences about a Mean Vector A transformation of the observations of this kind arises when a constant b.. (27T )"Pf2j I ln/2 e np/2  (S10) where ' I = 1 n L ll (xj . 1 (y .'o)) = 1 n(i. : Given observations x1 .).o = CJ.i)(xj .J.(X._ + d Therefore.J.J.
I) is compared with the unrestricted maximum of L(p. 0 . I) with respect to I. (Note that the likelihood ratio test statistic is a power of the ratio of generalized variances. 0 fixed. )(xi. If the observed value of this likelihood ratio is too small. we have (511) with 1 Io = n A j=! L n (xi .Po)')] L j=l n (xi .Po) = ~ j: tr[I.p.o.= Po is unlikely to be true and is. .(xi. I) may be written as ~±(xi.. 0 )' and b = n/2. Following the steps in (413).o.0 . therefore.= Po against H 1: 1'. I). but I can be varied to find the value that is "most likely" to have led..Po)' To determine whether Po is a plausible value of p..Po)(xi .Po)'] 1 1 0 1=! J=! = Applying Result 4. the maximum of L(p.:I l: ( ) = (IiI A )n/2 p. because of the following relation between T 2 and A.0 .I I Iol (512) The equivalent statistic A2/" = IiI/ Iio I is called Wilks' lambda.¢ Po rejects H 0 if (513) where ca is the lower (100a)th percentile of the distribution of A. the normal likelihood specializes to The mean Po is now fixed. rejected. the exponent in L(p.Po)'I. The resulting ratio is called the likelihood ratio statistic. This value is obtained by maximizing L(p. the likelihood ratio test of H 0 : P. we do not need the distribution of the latter to carry out the test.p.o)(xj .(xi. Using Equations (510) and (511). we get Likelihood ratio =A max L(p.. to the observed sample. with p. the hypothesis H 0 : 1'.Hotelling's T 2 and Likelihood Ratio Tests 217 Under the hypothesis H 0 : 1'.) Fortunately. I) = max L .. Specifically.= p.Po) (xi .10 with B = ~ tr[ I~c~ (xi  p.
x + x.x)' + n(i .o because A2/n "= (1 _I}_)1 + (n.o)' == i=l n 2: (xi . = Po versus H 1 : p.P.o) (x .o)(xi  n P.11. IA I == IAzzll A11 . .X.x + x 2: (xi  n P.218 Chapter 5 lnferences about a Mean Vector Result S..l.1ll Azz AzJA!lA 12 j..Po)(xi .x) (Xj . A2fn = IIo I ~~~ "= (1 + ~)! (n. large values of T 2. Let the (p + 1) x (p + 1) matrix By Exercise 4.P.1) Proof.x)' + n(x . Then the test in (57) based on T 2 is equivalent to the likelihood ratio test of H 0 : p. i=l 2: (xi.AizA2!Azii from which we obtain = I A. ( 1) I± j=l (xj . be a random sample from an Np(P. The • critical values of T 2 are determined by (56). by (414)..o) (x .1) (514) Here H 0 is rejected for small values of A Zfn or.Po)') Since. Let X 1 . .P.o)' == x) (xi .o)' J=l the foregoing equality involving determinants can be written or • '( I nio I = Ini I 1 + (n Tz) _ 1) Thus.. equivalently. ~ P. I) population.P. X 2 . .
so e 0 has dimension v 0 = 0 + p(p + 1)/2 = p(p + 1)/2.. .J. In each application of the likelihood ratio method.. a 2 P. 'JLp = JLpO..J.oo < JLJ < oo . we must obtain the sampling distribution of the likelihoodratio test statistic A. .Hotelling's T 2 and Likelihood Ratio Tests 219 Incidentally. .'o and I unspecified.. 8 is restricted to lie in a subset e 0 of e. .. in the pdimensional multivariate normal case.1• Solving (514) for T 2 .. When the maximum in the numerator of expression (516) is much smaller than the maximum in the denominator. . General Likelihood Ratio Method We shall now consider the general likelihood ratio method. . . This attractive feature accounts..•··· lTpJ. JLp. Under the null hypothesis Ho: 8 = 8 0 .lTpp] and e consists of the pdimensional space. and let L(8) be the likelihood function obtained by evaluating the joint density of X 1 .. a 22 . Their optimal large sample properties hold in very general contexts. we reject H 0 if the maximum of the likelihood obtained by allowing 8 to vary over the set e 0 is much smaller than the maximum of the likelihood obtained by varying 0 over all values in e.1) I± I ± I j=l (515) (xi .:) I Io I _ (n III _ 1) (n . relation (514) shows that T 2 may be calculated from two determinants.lo)' j=l :''.. where . They are well suited for the testing situations considered in this book.. . . the sampling distribution of 2ln A is well approximated by a chisquare distribution. A likelihood ratio test of Ho: 8 E eo rejects Ho in favor of Hl: 8 fi eo if max L(8) A= lle8o < c max L(8) lle8 (516) where c is a suitably chosen constant.. .. 8' = [JLI> ... However. e 0 does not contain plausible values for 8. . for the popularity of likelihood ratio procedures.p. .. e has dimension v = p + p(p + 1)/2. X 2 . . .oo < JLp < oo combined with the [p(p + 1 )/2]dimensional space of variances and covariances such that I is positive definite. Likelihood ratio methods yield test statistics that reduce to the familiar F.x) (xi . as we shall indicate shortly.lo) (xi.1) (xi. we have Tz = (n . Therefore. For the multivariate normal situation with'"' = J.. The parameter vector 8 takes its value in the parameter set e.lTzp.. Let 8 be a vector consisting of all the unknown population parameters. .and !statistics in univariate situations.p• aPP with I positive definite}.. ... ' x.. Intuitively.. eo = {JLJ = JLIO> JLz = JLzo. thus avoiding the computation of s. apl. Then c can be selected to produce a test with a specified significance level a. in part. For example. all·····alp· lTzz. a 11 ... . when the sample size is large and certain regularity conditions are satisfied. Xn at their observed values XJ' Xz.'..x)' I Likelihood ratio tests are common in multivariate analysis.( n .a1P.
1') S (X. and for the moment. In the~ rare situation where 8 == 8 0 is completely specified under H 0 and the alternative H 1 . is.. consists of the single specified value 8 == 8 1 . P [ n(X.2.terms of nS.a. 2 In A == 8 2ln(~lfo L(8))) L( max h8 . we need to extend the concept of a univariate confidence interval to a multivariate confidence region.np(a)j(n._. a x~"Y random variable. This region is determined by the data. 5.. and I. which is defined as the ~ curve or surface whose height i!! P[ test rejects H 0 [8].220 Chapter 5 Inferences about a Mean Vector Result 5. under the null hypothesis H 0 . . In many singleparameter cases (8 has one component). When the sample size n is large. A confidence region is a region of likely 8 values. and the inequality .v0 • ~ Statistical tests are compared on the basis of their power. with probability 1 .~dimension of @0 ).. Xn]' is the data matrix.._. We shall not give the technical details required for discussing the optimal properties of likelihood ratio tests in the multivariate situation.1)p ] (n _ p) Fp. Here the degrees of freedom are v . before the sample is selected. provided that distance is defined in ..1• For a particular sample. power among all tests with the same significance level a == P( test rejects H 0 18 == 80 ].a (517) This probability is calculated under the true.p)j112 of. i and S can be computed. the likelihood ratio test has the hiibest . where X== [X 1 .4 Confidence Regions and Simultaneous Comparisons of Component Means To obtain our primary method for making inferences from a sample. The region R(X) is said to be a 100(1 a)% confidence region if. X will be within [(n . we shall denote it by R(X).. _ 1 _ :S: (n . evaluated at each parameter~ vector 8._. value of 8.1)pFp. approximately. The general import of these properties. In words. Power measures the ability of a test to reject H 0 when it is not true. this property holds approximately for large samples. Before the sample is selected. is that they have the highest possible (average) power when the sample size is large. the likelihood ratio test is uniformly most powerful against all alternatives to one side of H 0 : 8 == 80 • In other cases. . for our purposes. == (dimension of@) . P[R(X) will cover the true 8] = 1 . of a pdimensional normal population is available from (S6).np(a) = 1 a whatever the values of the unknown. The confidence region for the mean . Let 8 be a vector of unknown population parameters and @ be the set of all possible values of 8. X2.1') _ . but unknown.
. where Se.i)' and x 1 .a)% confidence region for the mean of a pdimensional normal distribution is the ellipsoid determined by all p. Beginning at the center x.p.p.o lies within the confidence region (is a plausible value for p. p (519) The ratios of the A. the directions and lengths of the axes of n ( X .p.1)Fp.. we cannot graph the joint confidence region for p. = p.... we see that the confidence region of (518) consists of all p.p)]Fp.) Data for radiation from microwave ovens were introduced in Examples 4. 0 vectors for which the T 2test would not reject H 0 in favor of H 1 at significance level a.  )'st( X  p. 0 ) and compare it with [p(n.p. 0 versus H 1 : p. . the axes of the confidence ellipsoid are ±VA...10 and 4. such that n ( X .) :s (n. Example 5. ) z p(n .t ) p(n .1 (x. S = _( ) 1 n i~l n the sample observations.'s will help identify relative amounts of elongation along pairs of axes. If the squared distance is larger than [p(n . . the region will be an ellipsoid centered at i. However. As in ( 47).e.17. i = 1. we can calculate the axes of the confidence ellipsoid and their relative lengths... x =.p)]Fp. \j n(n fp(n1) _ p) Fp. x2 .np(a) e.l L n (xi. Ito is not in the confidence region. ). Xn are To determine whether any P.p) units along the eigenvectors e.)'S.p) will define a region R(X) within the space of all possible parameter values.a)% confidence region for p.o [see (S7)). of S.1)/(n.L j.1) F ( ) :s c = (n _ p) p. Vp(n . . Since this is analogous to testing H 0 : p. A 100(1 .1(x.p. . This ellipsoid is the 100(1 .3 (Constructing a confidence ellipse for p.1) F ( ) :s (n _ p) p.np(a)j(n.np a are determined by going vT. Let x1 = '¢!measured radiation with door closed and x 2 = ij measured radiation with door open .x)(xi. we need to compute the generalized squared distance n(x.. cjVn = vT. In this case.  )'sI( X J. of.np(a)/n(n .Confidence Regi{)ns and Simultaneous Comparisons of Component Means 221 n(x. 2. p(a). 0 )'S.1)/(n. For p ~ 4.p.np(a). and eigenvectors e.1)pFp. These are determined from the eigenvalues A. P. = A.np a (518) where 1 " 1 xi.
.228) (.228 203.16_1 = 3 6 /p(n.ro 2(41) F2.L2 ) :s: 6.603]. = [.562] woul d not be reJecte d m f avor of H1: p.564 ILl? + 42(200.564 .J.603 . e! e2 = [.30 :5 6.018) (.np(a) I = v':002 2(41) ( 0) (3.228 .L2 ] [ 163. 203.. = [.562.5~ of significance. .L' = [. .562] at the a = .589) = 1._• = [.J.1. = [ .391 200. we compute 42(203.589] is in the region.589) 2 .391 163. .62 To see whether . .018) (.018 42 4 respectively.704.84(163. a test of H 0: J.L2) 2 .564.. we find that x = [.L consists of all values (tL1 .562.84(163.564 . The axes lie along e. .603 . since F2.23) = ..710] .1) VA.603 .1) 2\t'A.np(a) I == .710] and e2 = [ . The joint confidence ellipsoid is plotted in Figure 5.f.704. = . .J..564 . Equivalently.J.018 163. An indication of the elongation of the confidence ellipse is provided by the ratio of the lengths of the major and minor axes.562) 2 + 42(200.J.564.y n(n _ p) Fp. and p(n.603 . 42(203.62 We conclude that J.704] when these vectors are plotted with i as the origin.0146 .0117] . S = [.562)(.391] [.018 ·603 . This ratio is p(n .L1] 200.L .603 .40( .704] The 95% confidence ellipse for J.f..y n(n _ p) Fp. .Ld(. and the halflengths of the major and minor axes are given by VA.LI.026.391) (.391 J The eigenvalue and eigenvector pairs for s are A1 A2 = .L2 :5 .np(a) = v':026 42( 40) (3. 40 ( . 163.391)(.1) v'f.710.002. The center is at x' = [.589] is in the confidence region.0144 . .05) or.y ( ) Fpnp(a) yT.710.064 p(n .05 level .564] _ = [ 5 1 .05) = 3. .603.~9 'F [.228) (.23) 2(41) = .1) n(n _ p) Fp.0117 . tL 2 ) satisfying 42 [· 564 .23.045 · 2v'f.222 Chapter 5 Inferences about a Mean Vector For the n = 42 pairs of transformed observations. 1 _ _rn=:=n==p==· = _A = _.
Z has an N(a'p. . Xn from the Np(p.. = a'Ia and a~= Var(Z) Moreover. Thus. Figure S.. In so doing. If a random sample X 1 . .55 .. I) population is available. based on microwaveradiation data .p. . Zn are. .r..) :s c2 . I) distribution and form the linear combination Z From (243).60 0.65 11t~ 0. . z = a'x . a corresponding sample of Z's can be created by taking linear combinations. any summary of conclusions ordinarily includes confidence statements about the individual component means. by Result 4.p.Confidence Regions and Simultaneous Comparisons of Component Means 223 2 0.l A 95% confidence ellipse for p.a'Ia) distribution. • Simultaneous Confidence Statements While the confidence region n(x. for c a constant.2.2...)'S.6 times the length of the minor axis. j = 1.•• . correctly assesses the joint knowledge concerning plausible values of p. = a1X1 + a 2 X2 + · · · + aPXP = a' X 1Lz = E(Z) = a'p.1 (x. X 2.. We begin by considering simultaneous confidence statements which are intimately related to the joint confidence region based on the T 2statistic. Let X have an Np(P. we adopt the attitude that all of the separate confidence statements should hold simultaneously with a specified high probability. n The sample mean and variance of the observed values z1 . The length of the major axis is 3.. z2 . It is the guarantee of a specified probability against any statement being incorrect that motivates the term simultaneous confidence intervals. by (336).
. . 1 .. ..(a/2) is the upper 100( aJ2 )th percentile of a t distribution with n . ~ Simultaneous confidence intervals can be developed from a consideration of con. Intuitively. However. .._. For a fixed and u~ unknown. a 100(1 .· fidence intervals for a'. OJ. · is based on student's tratio (520) and leads to the st_!ltement or a'i.tn_ 1(a/2) Va'Sa Vn s a'p... values for which It I = or.a with the confidence intervals that can be generated by all choices of a.a... . for various choices of a. equivalently. with a' = [1. x2. Inequality (521) can be interpreted as a statement about the components of the mean vector 1'· For example..a'._.. s a'i + tnl(a/2) WSa Vn (521) where tn. a price must be paid for the convenience of a large simultaneous confidence coefficient: intervals that are wider (less precise) than the interval of (521) for a specific choice of a.) Clearly.. .)j y'jj'Sa S tnl(a/2) n(a'(i. each with associated confidence coefficient 1 . The argument proceeds as follows.c 2 . 0. a' 1' = p. It seems reasonable to expect that the constant t~_ 1 (aj2) in (522) will be replaced by a larger value.. Xn and a particular a.. when statements are developed for many choices of a. in this case. (Note. and (521) becomes the usual confidence interval for a normal population mean. the confidence interval in (521) is that set<>f a'. values such that t 2 is relatively small for all choices of a. . However.a)o/o confidence interval for 1'z "' a' p. I Vn(a'x. respectively.224 Chapter 5 Inferences about a Mean Vector and s~ = a'Sa where i and S are the sample mean vector and covariance matrix of the x/s. the confidence associated with all of the statements taken together is not 1 .))2 < a'Sa  2 ( /2) tn1 a (522) A simultaneous confidence region is given by the set of a'. that a'Sa "' s 11 ..f. Given a data set x 1 . it would be desirable to associate a "collective" confidence coefficient of 1 .._.. we could make several confidence statements about the components of._.a. by choosing different coefficient vectors a.1 d..
n for every a. x 1 + ~p(n.Lp :s. .np(a)a'Sa.np(a) (n _ ~ _ ~ _ y. X..a...=s P2 =s x2 ..P))2 a'Sa =n [ m~x (a'(x. without modifying the coefficient 1 ..for all a. . 0.3 as T 2 intervals. c2 .. :I) population with :I positive definite. + ~p(n1) (n _ p ). . X 2 . Result S.. . .. . 0.l..a. 1] for the T 2 intervals allow us to conclude that _ ~p(n.P))2] a'Sa =n )'s'( ) = with the maximum occurring for a proportional to s.p) Fp.np(a) (524) _ Xp fp(n. . Proof..... 1'! :s. with probability 1 . c 2 ]. 1) y. and so on through a' = [0.•..1.:s.1) ) n(n _ p) Fp.p) Fp. the interval (a'X p(n...P)) a'Sa 2 Using the maximization lemma (250) with x max = a. .np(a) Vn (n  all hold simultaneously with confidence coefficient 1 .np(a)J(n.1) Fp.. and B x.1' . • It is convenient to refer to the simultaneous intervals of Result 5.. Then. we get T2 (523) n(a'(x. Choosing c2 = p(n. .1) X1(n _ p) Fp. or [ii'S3 a'x. a'x + fa'Sa cy..0.np(a)a'Sa will contain a' 1'.1'. .1)Fp. . we can make statements about the differences JLi .=sa' I' :s.cy.. . ak> 0. for every a.np(a) .3.np vfp(n.np(a) v. From (523)..t = S. v (n. .k corresponding to a' = [0.with probability 1 . . 0... since the coverage probability is determined by the di~tribution of T 2 • The successive choices a'= [1. Fp.P.1(x  1'. . nnp ) ( a'X + p(n .p) [see (56)] gives intervals that will contain a' 1'. . = 1 and . 0.1) (S. simultaneously for all a. where a. . we are naturally led to the determination of m~x t 2 = m~x n(a'(x. . a'= [0. Note that. a.).a = P[T 2 :s.0].Confidence Regions and Simultaneous Comparisons of Component Means 225 Considering the values of a for which t 2 :s. Let X 1 . d = (x (x.0]. $ J.). be a random sample from an Np(l'.a.np(a) _ ~p(n1) x2(n _ p) Fp. OJ.p) Fp. Xp + _ 1) {SP.
xk ILk] S· S·k]I[X·[s.612) ( x2  p(n . . of this ellipse on the axes of the component means..np(a) ILk n ll I  r l (526) and still maintain the confidence coefficient ( 1 .555.1) Fp.t) belonging to the sample meancentered ellipses n[Xi . ..2s. • Example 5.a) for the whole set of statements. of the confidence ellipsoid on the component axes.Os)fj.23~.2.603 ~3. from (524).k + su n :S ILi . x + )~~np\) Fp.x.564 ~3.OS) J!ijzz) _ ( ) n n.np(. according to the results in Supplement SA. Example 5. .Os)fj) 1 1 = (..1) ) (n. This connection between the shadows of the ellipsoid and the simultaneous confidence intervals given by (524) is illustrated in the next example.np(." The confidence coefficient 1 .np a n (525) The simultaneous T2 confidence intervals are ideal for "data snooping. The 95% simultaneous intervals are shown as shadows. The simultaneous T2 confidence intervals for the individual components of a mean vector are just the shadows.23 = (.5 (Constructing simultaneous confidence inte. or projections.k· skk xk . or projections.ILk _ <.2s. These data give . we can include the stiltements about (p.564 + ~3.np(. xk + )p(n1)F ( ))s.JL·] s: p(n. Xz + Jp(n.  2s..p n 603 .Vals and ellipses) The scores obtained by n = f57 college students on the College Level Examination Program (CLEP) subtest X 1 and the College Qualification Test (CQT) subtests X2 and x3 are given in Table 5.4 (Simultaneous confidence intervals as shadows of the confidence ellipsoid) In Example 5.p ) FP · np(..516.2 on page 228 for X1 = social science and history.L. and we have the statement s.. we obtained the 95% confidence ellipse for the means of the fourth roots of the doorclosed and dooropen microwave radiation measurements. In this case a'Sa = S. p. we have redrawn the 95% confidence ellipse from Example 5. The 95% simultaneous T 2 intervals for the two component means are. fN46 .3. + ~3...P1) Fp. In addition.k + skk. X2 = verbal. and X 3 = science.OS) fjn.23y42· 2(41) 2(41) /0146) ~ or (.226 Chapter 5 Inferences about a Mean Vector ak = 1.a remains unchanged for any choice of a. (x )~~n~ :/ Fp.J.23~) or (.k+skk (np ) p.3. ..651) In Figure 5. so linear combinations of the components ILi that merit inspection based upon an examination of the data can be estimated.
1) 3(86) and we obtain the simultaneous confidence statements [see (524)] 526.2 Simultaneous T intervals for the component means as shadows of the confidence ellipse on the axesmicrowave radiation data..np(a) = (87 _ 3 ) F3.Q3 23.~~·~~~.39 222. ' ' ' .29y~ f2ill f2ill ..84 126.05 23.13. ~· I '' ' ' ' ' ·..06 :s JLJ :s 550.29 ~126.69 or v'8.29y~ :s JL3 :s 25.29] 54. 0.500 2 ' ' : : ' ' ' ' 0.1) n _ P Fp.11 Let us compute the 95% simultaneous confidence intervals for JL1> JLz.604 Figure 5.84(.29 \ffi26155 ~ :s JLz :s 54..~·~.552 0.v'8.12 54.39 23.13 + v'8.05) = ~ (2. i = 526.29 v'8.29 or V8.13 and S= [5808.06 ~ 3(87.06 597.69 [ 25.22 :s JLz :s 58.69 + v'8.Confidence Regions and Simultaneous Comparisons of Component Means 22 7 0 00 .84 222.Q3] 597.16 25. and JL3· We have p(n..29 \ff5808J56 ~ :s JLJ :s 526:29 + 503.7) = 8.29 ~5808.05 ~ 51.
228 Chapter 5 Inferences about a Mean Vector Table S. .(Verbal) (Science) history) 468 428 514 547 614 501 421 527 527 620 587 541 561 468 614 527 507 580 507 521 574 587 488 488 587 421 481 428 640 574 547 580 494 554 647 507 454 427 521 468 587 507 574 507 41 39 53 67 61 67 46 X! Xz (Social ~ science and (Verbal) (ScienceE history) Individual x3l.2 College Test Data Individual x3 Xz X! (Social science and . Johnson. ~ ~·· 1 2 3 4 5 6 7 8 9 10 11 26 26 21 33 27 29 22 23 45 46 47 48 50 55 72 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 63 59 53 62 65 48 32 64 59 54 52 64 51 62 56 38 52 40 65 61 64 64 19 32 31 19 26 20 28 21 27 21 21 23 25 31 27 18 26 16 26 19 25 28 27 28 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 53 51 58 65 52 57 66 57 55 61 54 53 26 21 23 23 28 21 26 14 30 31 31 23 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 494 541 362 408 594 501 687 633 647 647 614 633 448 408 441 435 501 507 620 415 554 348 468 507 527 527 435 660 733 507 527 428 481 507 527 488 607 561 614 527 474 441 607 41 47 36 28 68 25 75 52 67 65 59 65 55 51 35 60 54 42 71 52 69 28 49 54 47 47 50 70 73 45 62 37 48 61 66 41 69 59 70 49 41 47 67 24 25 17 17 23 26 33 31 29 34 25 28 24 19 22 20 21 24 36 20 30 18 25 26 31 26 28 25 33 28 29 19 23 19 23 28 28 34 23 30 16 26 32 ~ Source: Data courtesy of Richard W.
as suggested by (521) with a' = [0. the sample size is large enough to justify the methodology. 2 .18.JL2) 2 + 4.p.x3) ± \j fp(n. iL3 ::s.1) /s22 + s33 (n _ p) Fp. 32.05) \j n  2s23 = (54. 2 .39]l [54. we have 87[54.69 /126.11 25. This approach ignores the covariance structure of the p variables and leads to the intervals (527) .13 J. 0. 1].J...np(.44. 0. 8. . = 25. because. Finally. 26.JL2 ) (25. 3 ) ::s.l3] [126.6923.69 .p.13  .Lz .11 .. 3 ). along with the 95% confidence ellipses for the other two pairs of means.05 23.13) ± V8.l3) 2 J. A Comparison of Simultaneous Confidence Intervals with OneataTime Intervals An alternative approach to the construction of confidence intervals is to consider the components JL.5. OJ where a. (See Exercise 5. and these projections are the T 2intervals. we can construct confidence ellipses for pairs of means.l3 0..13  J.2(23.. 1. for the pair (p...12 so (26.Confidence Regions and Simultaneous Comparisons of Component Means 229 or 23.39 23.2 x 0. For example.69 . = 1.29 This ellipse is shown in Figure 5.Jl2.61 With the possible exception of the verbal scores.68) is a 95% confidence interval for J..56 ± 3. the interval for p.859(54. For instance.l2] J. .) ' The simultaneous T 2 intervals above are wider than univariate intervals because all three must hold with 95% confidence.29 \j 87 = 29.25.69 .l.39) .633(25.849(54.3 on page 230. and the same 95% confidence holds. even though the data are not quite normally distributed. (See Section 5.05 + 23. The projections or shadows of these ellipses on the axes are • also indicated. a. with a' = [0.. They may also be wider than necessary. 3 • Simultaneous intervals can also be constructed for the other differences. .3 has endpoints _ _ (x2 .65 ::s.. the marginal QQ plots and twodimensional scatter plots do not reveal any serious departures from normality for the college qualification test data. we can make statements about differences.13 .) Moreover. with the same confidence. one at a time. p. .
74.95 and p = 6.a)··· (1 .JL. consider the special case where the observations have a joint normal distribution and I"' Since the observations on the first variable are independent of those on the second variable. 544 V) "' 1l3 r "' 500 522 JL. and so on.. Before the sample is selected..'s] If 1 .5 Figure 5. this probability is not 1 . 544 50. l : O"JJ 0 0 O"zz : 0 0 P[ all tintervals in (527) contain the ~. . As we have pointed out. we do not know what to assert. this probability is (.a.a of covering JL.95) 6 = .. in genera~ about the probability of all intervals containing their respective p..5 54.3 95% confidence ellipses for pairs of means and the simultaneous T 2 intervals{X)llege test data.a)P = . the product rule for independent events can be applied.a) = (1 .a) (1 .5 58. Although prior to sampling.a = (1 . To shed some light on the problem.230 Chapter 5 Inferences about a Mean Vector "' __ L r . the ith interval has probability 1 ./s. 0 1'3 "' I "' 500 522 .
08 I The comparison implied by Table 5.95 for the rectangle formed by the T 2intervals.2. . as we have seen. For 1 . the multipliers of v. the probability of covering the two individual means p. increases for fixed n as p increases and decreases for fixed p as n increases.95) ~(n. consider the confidence ellipse and the simultaneous intervals shown in Figure 5. relative to the tintervals. say. To see why.2. is .95. Table 5.05) = u and tn. 1 lies in its T 2interval and JLz lies in its T 2interval.a = . with its component means p.z in (524) and (527) are 4(14) ~ p(n. the individual intervals must be wider than the separate tintervals...3 is a bit unfair. we sometimes look at them as the best possible information concerning a mean. Table ·s. some researchers think they may more accurately represent the information about the means than the T 2intervals do.Intervals and T 2 Intervals for Selected nand p (l a = . as well as on 1 .19 3.. for the same n..np(.3 gives some critical distance multipliers for oneatatime tintervals computed according to (521). since the confidence level associated with any collection of T 2 intervals.just how much wider depends on both p and n. p.J Critical Distance Multipliers for OneataTimet.025) 2.. all p means. The T 2intervals are too wide if they are applied only to the p component means.145)/2.970 1.a that all of the statements about the component means hold simultaneously.l)p n 15 25 50 100 00 (n _ p) Fp.14 3. Nevertheless. then (p. 2 ) lies in the rectangle formed by these two intervals.a.960 p=4 4.95.025) = 2.36) = 4. This rectangle contains the confidence ellipse and more.95 of covering the mean vector p. 1 .064 2. n = 15.95.np(. as well as the corresponding simultaneous T 2 intervals.28 tn1(. if this is the only inference to be made. In general.145. and p = 4.1(.05) p = 10 11. respectively. and the overall confidence associated with a collection of individual t intervals.1) (3.52 6. for fixed nand p.j. 1 and JLz will be larger than . The oneatatime t intervals are too short to maintain an overall confidence level for separate statements about. This result leads us to consider a second approach to making·multiple comparisons known as the Bonferroni method. The confidence ellipse is smaller but has probability .61 4.Confidence Regions and Simultaneous Comparisons of Component Means 231 To guarantee a probability of 1 .145 2.1f p.31 3.010 1. Consequently.05 4. Moreover.14 (n _ p) Fp.14 . be much less than . if the oneatatime intervals are calculated only when the T 2 test rejects the null hypothesis. in this case the simultaneous intervals are 100( 4. can. 1 and 1Lz· Consequently.145 = 93% wider than those derived from the oneatatime t method.:.60 3.39 5. the width of the T 2 intervals.
P [ X.a.ajm. i = 1..P[atleastoneC..a.= o:jm.. .L.L = a1J. true] = 1 .. with an overaU confidence level greater than or equal to 1 . .. we consider the individual tintervals a. we have. 2.6).p are required.± t.) y. regardless of the correlation structure behind the confidence statements. X.± { 11 _1 . or linear combinations' a' J. 2. In these situations it is possible to do better than the simultaneous intervals of Result 5.Lp is small. Let us develop simultaneous interval estimates for the restricted set consisting of the components IL.(a 1 + az + · · · + am) Inequality (528).· because it is developed from a probability inequality carrying that name. If the number m of specified component means J. prior to the co)Iection of data.3. i = 1._ 1 (aj2m)~ i = 1. . . Since P[X. I n m =1a a a) +. . Let C.. allows an investigator to control the overall error rate + az + · · · +am. we can make the following m = p statements: (529) . Lacking information on the relative importance of these components. .(a contamsJ. . al1·] 2: 1.2. attention is restricted to a small number of individual confidence statements.true] = 1. Z {S.....232 Chapter 5 Inferences about a Mean Vector The Bonferroni Method of Multiple Comparisons Often..m with a. from (528). . a special case of the Bonferroni inequality. Now (see Exercise 5. m.+ · · · +m m m terms Therefore. m.L.false) 2: 1 i=l i P(C..a ) 2m ~. confidence statements about m linear combinations ajp. . true)) (528) = 1 . denote a confidence state.] = 1 .ment about the value of ajp with P[C. a~. simultaneous confidence intervals can be developed that are shorter (more precise) than the simultaneous T 2intervals The alternative method for multiple comparisons is called the Bonferroni method. contains J.false) = 1  i=l i: (1  P(C.L. a2p. There is also the flexibility of controlling the error rate for a group of important statements and balancing it by another choice for the less important ~atements.L 1 + ii2JL2 + · · · + apJ. · P[allC._ 2( .. '·· Suppose that. ± t. of p.(a.
the Bonferroni interval falls within the T 2interval.1)pFp.500 0.u or or . the Bonferroni intervals provide more precise estimates than ~ 0 . . p. p.564 ± 2. along with the corresponding 95% Bonferroni intervals.521 :S JLI :S .05/2(2)) = t41 (.603 .·r··.. = .. noting that n = 42 and t41 (. 1 and p. We shall obtain the simultaneous 95% Bonferroni confidence intervals for the means...327.327 /0144 ~ f0146 ± 2..0125) fS22 y.4 shows the 95% T 2 simultaneous confidence intervals for p.4 The 95% T 2 and 95% Bonferroni simultaneous confidence intervals for the component meansmicrowave radiation data. The percentage point tn_ 1(aj2p) replaces Y(n. For each component mean... We make use of the results in Example 5.z from Figure 5.0125) rs. Consequently.3 and 5.3..555 __ ... i = 1.~~~~1 .± xz t 41 (. to get x 1 ± t 41 (...np(a)j(n.4. of the fourth roots of the doorclosed and dooropen measurements with a.. = .Confidence Regions and Simultaneous Comparisons of Component Means 233 The statements in (529) can be compared with those in (524).552 0.646 Figure 5. but otherwise the intervals are of the same structure.604 Figure 5.. 2 .607. = .607 . 1 . 2.p).651 . .0125) = 2.2.612 0.. If we are interested only in the component means. Example 5. I I r~L_. ·· .646 Bonferroni .560 .6 (Constructing Bonferroni simultaneous confidence intervals and comparing them with T2 intervals) Let us return to the microwave oven radiation data in Examples 5. y.. the rectangular Qoint) region formed by the two Bonferroni intervals is contained in the rectangular region formed by the two T 2 intervals.05/2.327 '\j.560 :s JL2 :s ..
48 . the 95% confidence region for p.__.91 4 .69 .15.. = . for a small number m of specified parametric functions a' p..05/m m= p n 15 25 50 100 00 2 . we will often apply simultaneous tintervals based on the Bonferroni method..4 for selected n and p. tests of hypotheses and confidence regions for p.4 that the Bonferroni method provides shorter intervals when m = p.. Because they are easy to apply and provide the relatively short confidence intervals needed for inference. = afm.234 Chapter 5 Inferences about a Mean Vector the T 2intervals.4 (Length of Bonferroni Interval)/(Length of T 2Interval) for 1 .n Consequently.58 .17.81 10 . in every instance where a. tn I ( af2m) (5c30) which does not depend on the random quantities Xand S.p . Length of Bonferroni interval _ Length of T 2interval  )p(n .S Large Sample Inferences about a Population Mean Vector When the sample size is large.np(a) :..78 . How much shorter is indicated in Table 5.. can be constructed without the assumption of a normal population. S) is a sufficient summary for normal populations [see (421)]. • ~ i ~ The Bonferroni intervals for linear combinations a' p.3) have the same general form: a'X ± (''11)/a'Sa cntJca va ue 'J. we are able to make inferences about the population mean even though the parent distribution is discrete. since (i.95 and a.62 . and the analogous] T 2intervals (l:ecall Result 5.:. and S.66 We see from Table 5. Both tests of hypotheses and simultaneous confidence statements will then possess (approximately) their nominal levels. S. 5.91 .:. On the other hand.a = .80 .88 . On the other hand.pp n.1) . In fact. .Lz) when the correlation between the measured ~ variables is taken into account..91 . for large n.. As illustrated in Exercises 5. Table 5.75 . and 5.As we have pointed out. 1 . serious departures from a normal population can be overcome by large sample sizes.16. gives the plausible values for the pairs (p. The advantages associated with large samples may be partially offset by a loss in sample information caused by using only the summary statistics i. the Bonferroni intervals will always be shorter.29 . f.90 .
Large Sample Inferences about a Population Mean Vector 235 the closer the underlying population is to multivariate normal... 2 Xz± ~flj contains P.'o)'S. if the observed n(i .np(a)f(n . X 2 .1'o) > ~(a) Here x~(a) is the upper (100a)th percentile of a chisquare distribution with pd. . Consequently. at a level of significance approximately a.1(X..5.. • Comparing the test in Result 5. . JLk [ S.p and. JLk).. the hypothesis H 0 : 1' = p. When n .k skk s... (See Tables 3 and 4 in the appendix.)' (n.. however.a)% simultaneous confidence statements XI± ~flj contains JLI contains p.. All largesample inferences about 1' are based on a x 2distribution..4. k = 1.)'s. From (428).k J 1 [  J:s x~(a) contam ..1') :s x~(a)] =1 a (531) where x~(a) is the upper (lOOa)th percentile of the x~distribution.p is large.. the more efficiently the sample information will be utilized in making inferences. ... P[n(X. we see that the test statistics have the same structure...)'s1(X. If n . with probability approximately 1 .4 and 5. i.4 is appropriate. and thus. _ xk JL. and positive definite covariance will contain a' p. These procedures are summarized in Results 5. S.JLk] (JL.4 with the corresponding normal theory test in (57). p.p is large.. in addition. the sample meancentered X. 0 is rejected in favor of H 1 : 1' ¢ J..'o...J. X 2 . Result 5. Equation (531) immediately leads to large sample tests of hypotheses and simultaneous confidence regions.f. reveals that both tests yield essentially the same result in situations where the x 2test of Result 5.. for every a. 2.. •.1)pFp...) Result S. we know that (X. Xn be a random sample from a population with mean 1' and positive definite covariance matrix :I. for all pairs ellipses _ n[x.. .p) and x~(a) are approximately equal for n large relative top. but the critical values are different. Let X 1 . we can make the 100( 1 ..) = n(X. .a.. .. This follows directly from the fact that (n.p.) is approximately x 2 with p d. Xn be a random sample from a population with mean :I. p.1S)1(X. A closer examination...S. _ Xk. Let X 1 .1(i ..f.
The next example allows us to illustrate the construction of large sample simultaneous statements for all single mean components. Results 5.0 22.82 5..76 5.6 22. extreme deviations could cause problems.>mes large. 2. As the number characteristics bec<. ai> 0. In one or two dimensions.].4 and 5. once again. i = 1. for example..5 are useful only for very large samples.p = 48.7 Standard deviation ( ~) 5. . p.5.76 3. In some instances. Summary statistics for part of the data set are given in Table 5. are obtained by the special choices a' = [0.2 with c2 = ~(a). the true error rate may be far removed from the nominal level a. 0. 1 {Constructing large sample simultaneous confidence intervals) A music educator tested thousands of Fmnish students on their native musical ability in order to set national norms in Finland. Methods for testing mean vectors of symmetric multivariate distributions that are relatively insensitive to departures from normality are discussed in [11 ]. sample sizes in the range 30 to 50 can usually be considered large.1. Specifically. The overall confidence level of approximately 1 .p must be large and realize that the true case is more complicated. certainly larger sample sizes are required for the asymptotic distributions to provide good approximations to the true distributions of various test statistics.) 28.85 3. The first part follows from Result 5A. . It is good statistical practice to subject these large sample inference procedures to the same checks required of the normaltheory methods. appropriate corrective actions. on the basis of QQ plots and other investigative devices. The statements for the JL. we simply state that 11 . ...S Musical Aptitude Profile Means and Standard Deviations for 96 12thGrade Finnish Students Participating in a Standardization Program Raw score Variable X 1 =melody X 2 =harmony x3 =tempo x4 =meter X 5 = phrasing x6 =balance I x7 =style Source: Data courtesy ofV. These statistics are based on a sample of n = 96 Fmnish 12th graders.. ExampleS.236 Chapter 5 Inferences about a Mean Vector Proof. . .. Lacking definitive studies.a for all statements is. a result of the large·· sample distribtltion theory summarized in (531). where a. Table S. = 1. If.2 23. . are desirable.6 35. Sell. Mean (:X.12 3.93 4. The ellipsoids for pairs of means follow from Result 5A. An application with p = 2 and sample size 50 is much different than an application with p = 52 and sample size 100 although both haven . • The question of what is a large sample size is not easy to answer. Although small to moderate departures from normality do not cause any difficulties for n large.1 26. The probability level is a consequence of (531). with c2 =X~( a).4 34. [2]).03 . outliers and other forms of extreme departures are indicated (see. 0. including transformations.
+ z (a) \j.2.. or 22. p .05 :s: 3..02 3.75 :s: 36.. the oneatatime confidence intervals for individual means are x. 2. 5 :s: 24.02 5.2 ± 23.02 v'% 4 'v'12..6 ± 35. Thus. i = 1. The Bonferroni simultaneous confidence intervals for the m = p statements about the individual means take the same form. ( a) \j.10) = 12.1 ± 26.p where z(a/2) is the upper 100(a/2)th percentile of the standard normal distribution.12 contains p. .. • When the sample size is large.02 v'% 'v'12.. the investigator could hypothesize the musical aptitude profile to be We see from the simultaneous statements above that the melody. i = 1. 2.06 :s: 'v'12. :s: JL..02 v'% contains JLI or 26. :s: i.. with approxi mately 90% confidence..:s: 2 p p. ± Vx~(.4 ± 34.z {S.'v'12.. ...2 {S.02 v'% 5 'v'12.. .61 :s: 12.14 p.85 contains p..39 :s: 24.. .... . ..02 v'% 7 JLi :s: 30.. perhaps. . upon thousands of American students.5.{S. + z (a) \j.02 5. tempo. 2. From Result 5..39 :s: 12.6 ± 22.lO) ~.67 JL3 :s: 36. but use the modified percentile z( af2p) to give x..  z a) {S.7 ± i = 1. and meter components of JLo do not appear to be plausible values for the corresponding means of Finnish scores. simultaneous 90% confidence limits are given by x.53 :s: v'% 'v'12.0 ± 22. .02 5. 76 contains p. 2 p i = 1.02 v'% contains JL3 or 34.93 contains JL6 or 20.93 JL6 :s: 23.02. 28.13 JL? Based. or 32.27 :s: 12. :s: x.z or 24.. . 76 12.01 JL4 p.. or 21.02 4.Large Sample Inferences about a Population Mean Vector 237 Let us construct 90% simultaneous confidence intervals for the individual mean components JL.82 12.27 :s: 12.. where ~(. 7.2 :s: 28.03 contains p. 7. .02 3. (2 \j..02 'v'12.
Table 5.48 24. or shadow of the confidence ellipsoid. The simultaneous Bonferroni intervals use z(. Bonferroni.41 34.22 24.86 36.89 29.80 23.50 21.44 28. Comparing Table 5.59 24.47 35. and Fbased.76 24.20 21. Variable Oneatatime Bonferroni Intervals Shadow of Ellipsoid Lower Upper Lower Upper Upper Lower 26.63 33.33 X 1 =melody X2 =harmony X3 =tempo x4 =meter X5 = phrasing x6 =balance x1 =style .05) = 14.99 34. use .16 25.88 29.96 34. some statisticians prefer to retain the F.24 22. Bonferroni.025) = 1.52 26. The latter constants are the infinite sample size limits of the former constants.97 36.7 with Table 5.35 32.24 36.79 36.16 35. and chisquarebased (or shadow of the confidence ellipsoid) intervals for the musical aptitude data in Example 5.76 22.10 23.63 23.75.27 27. TableS.68 28.7 gives the individual.69.36 22.45 35.25 27.77 36. intervals for the musical aptitude data. The simultaneous Bonferroni intervals use t95 (.30 28. are more conservative. use F7. The F and t percentiles produce larger intervals and.54 20.16 20.23 33.6 gives the individual.66 23. Table S.57 29.35 22.90 24.7 are larger.36 33. or shadows of the ellipsoid. Bonferroni.24 24. However. 89 (. or tenths.24 X1 =melody X2 =harmony X 3 =tempo x4 =meter X5 = phrasing x6 =balance x1 =style Although the sample size may be large.85 21.51 26.7.95 25.and tbased percentiles rather than use the chisquare or standard normalbased percentiles.43 34. with the relatively large sample size n = 96. and T 2Intervals for the Musical Apjitude Data The oneatatime confidence intervals use z(.05) Variable = 2.64 24.025/7) = 2.18 22.50 24.90 21.11.85 32. we see that all of the intervals in Table 5.52 24.08 23. and T2 Intervals for the Musical Aptitude Data The oneatatime confidence intervals use t95 ( .21 36.025/7) = 2.16 22. 7 The 95% Individual.84 21.93 25. or shadows of the ellipsoid.6 The Large Sample 95% Individual.59 29. hence. the differences are typically in the third. Bonferroni.95 36.13 23.16 30.6.07. Oneatatime Bonferroni Intervals Shadow of Ellipsoid Lower Upper Upper Lower Lower Upper 26.79 23.07 20.81 25.79 22.238 Chapter 5 Inferences about a Mean Vector Table 5.17 35. The simultaneous T2 .83 25.21 21.92 21.57 20.72 28.07 30.04 23.0( .64 33.025) = 1.96.12 22.84 36.28 25.61 24. digit.99.41 21. The simultaneous T2 .94 32.33 32.
The control limits of plus and minus three standard deviations are chosen so that there is a very small chance. assuming normally distributed data. and the control limits are x UCL = x1 + 3(~) = 3558 + 3(607) = 5379 LCL = x1 . A computer calculation gives x1 = 3558. or about half a year. data need to be examined for causes of variation. and no one cause is a major source of variation. Example 5. Create and plot the centerline the sample mean of all of the observations.3(607) = 1737 .bar chart). Table 5. To create an X chart. of falsely signaling an outofcontrol observationthat is. data should be collected to evaluate the capabilities and stability of the process.8 (Creating a univariate control chart) The Madison. When a process is stable. Wisconsin.Multivariate Quality Control Charts 239 S. 1. that indicate the amount of variation due to common causes. it is often the sample standard deviation. Plot the individual observations or sample means in time order. 2. Also. 3. an observation suggesting a special cause of variation. the sample standard deviation is~= 607. Since individual values will be plotted. Upper control limit (UCL) = Lowercontrollimit(LCL) = x + 3(standard deviation) x. but they can also suggest improvements to the process.6 Multivariate Quality Control Charts To improve the quality of goods and services. then the standard deviation is the sample standard deviation divided by Viii. We examine the stability of the legal appearances overtime hours.3(~) = 3558. police department regularly monitors many of its activities as part of an ongoing quality improvement program.8 gives the data on five different kinds of overtime hours. the variation is produced by common causes that are always present. x1 is the same as 1. These causes of variation often indicate a need for a timely repair. The purpose of any control chart is to identify occurrences of special causes of variation that come from outside of the usual process. Calculate and plot the control limits given by x. When a manufacturing process is continuously producing items or when we are monitoring activities of a service. A control chart typically consists of data plotted in time order and horizontal lines. For single observations. One useful control chart is the X chart (read X. Each observation represents a total for 12 pay periods. called control limits. Control charts make the variation visible and allow one to distinguish common from special causes of variation. If the means of subsamples of size m are plotted.3(standarddeviation) The standard deviation in the control limits is the estimated standard deviation of the observations being plotted.
534 11.861 11.5.052 12. .140 Chapter 5 Inferences about a Mean Vector Table S.8 Five Types of Overtime Hours for the Madison.328 12.482 14.189 15.078 236 310 1182 1208 1385 1053 1046 1100 1349 1150 1216 660 299 206 239 161  l mp satory overtime allowed. Wisconsin.900 15. Legal Appearances Overtime Hours 5500 UCL =5379 4500 "' "' ~ .S TheX chart for x1 = legal appearances overtime hours.528 12. Police Department XJ x2 Extraordinary Event Hours XJ x4 COA1 Hours xs Legal Appearances Hours ~ Holdover Hours Meeting Hours 3387 3109 2670 3125 3469 3120 3671 4531 3678 3238 3135 2200 875 957 1758 1181 3532 2502 4510 3032 2130 1982 4675 2354 4606 3044 3340 2111 1291 1365 1175 868 398 1603 523 2034 1136 5326 1658 1945 344 807 1223 l!"' 3728 3506 3824 3516 14.329 12. The data. = 3558 2500 1500 0 LCL"' !737 10 15 Observation Number Figure S.> " 3500 ] ~ x. along with the centerline and control limits.367 13.979 13.609 14.X chart in Figure 5. are plotted as an .236 15.847 13.699 13.
so no specialcause variation is indicated.I+ (n. Charts for Monitoring a Sample of Individual Multivariate Observations for Stability We assume that X 1 . .l)n2 I = . Monitoring the stability of a given sample of multivariate observations 2. The two characteristics on the jth unit are plotted as a pair (xi 1 . The two most common multivariate charts are (i) the ellipse format chart and (ii) the T 2chart. However to set control limits.x.X) has a chisquare distribution. Such an approach can account for correlations between characteristics and will control the overall probability of falsely signaling a special cause of variation when one is not present.. The variation in overtime hours appears to be due to common causes.X is not independent of the sample covariance matrix S. we discuss these procedures when the observations are subgroup means._I.!. x.X)= 1 . • With more than one important characteristic. By Result4. xrs Ellipse Format Chart..!.·· ·. . xj 2 ).. The 95% quality ellipse consists of all x that satisfy (x .x = (1.) x. X·.X has a normal distribution but. Xi .1) Cov(X·... Later.. .x.!. Setting a control region for future observations Initially.. .x) s xi(05) (532) .x)'s1 (x .xi···.'.. are independently distributed as Np(J.I I I! n Each Xi ..Multivariate Quality Control Charts 141 The legal appearances overtime hours are stable over the period in which the data were collected. X. we consider the use of multivariate control procedures for a sample of multivariate observations x 1 . High correlations among the variables can make it impossible to assess the overall error rate that is implied by a large number of univariate charts.•.!.!. X 2 .8. I). we approximate that 1 (Xi (Xi .. The ellipse format chart for a bivariate control region is the more intuitive of the charts.xn n n n n n 1 1 1 I+ has and _ ( 1) 2 (n . x 2 .. but its approach is limited to two variables. a multivariate approach should be used to monitor process stability. 1. Two cases that arise in practice need to be treated differently: 1.
( 72093.rl(.1 :s.1(x .6 The quality control 99% ellipse for legal appearances and extraordinary event overtime.xt)(x2.( · 367844.:___cc_______:::_:___ 2 s22 (367844.8 and create a quality ellipse for the pair of overtime characteristics (legal appearances.21.093.3558) (x2 ..7 X 1399053. .Ol) Here p = 2. 2 s11s22 .x2) ) .x) s . extraordinary event) hours. and the ellipse becomes '"".7 = 72.1 + 1399053.1 We illustrate the quality ellipse format chart using the 99% ellipse.7 . This ellipse format chart is graphed.sn (x.7 X 1399053.884.399.1) 367844.053.1 .::.1478) 2 72093 8 367844. along with the pairs of data.2~ s11 s22 ((x. so x~(. ~ I tll • • • • • • • • • >< § N I 1500 2500 3500 4500 5500 Appearances Overtime Figure 5.:..Ol) = 9. which consists of ali x that satisfy (x ..X2) S11 522 + (x2.su xt) 2  2s.. in Figure 5..7 X 1399053.1478j2) 9 21 · X ( (x 1 .8] 72. A computer calculation gives ~ x= [3558] 1478 and s [ 367.093..6.9 {An ellipse format chart for overtime hours) Let us refer to Example 5..: > 0 tll 0 e 0 ~ c J: c • ~ 0 ••• +. x)'S...242 Chapter 5 Inferences about a Mean Vector Example S.3558)2 ) (x. ··.8 1.8) 2 (x 2 .
the points are displayed in time order rather than as a scatter plot.. and this makes patterns and trends visible. it still has a certain stability. T 2Chart. When the lower control limit is less than zero for data that must be nonnegative..01).. For the jth point.s 1000 2000 3000 0 10 Observation Number LCL= 2071 15 Figure S.7.7? During this period. Notice that the T 2statistic is the same as used to test normality in Section 4. is definitely outside of the ellipse. '6 > . by its very definition.6. ~(. Moreover. Unlike the ellipse format. the quantity dJ . Notice that one point. indicated with an arrow. and students at Madison were protesting. 7 The X chart for x2 = extraordinary event hours. we calculate the T 2statistic (533) We then plot the T2 values on a time axis. that for x 2 is given in Figure 5. it is generally set to zero. the United States bombed a foreign capital. sometimes.Multivariate Quality Control Charts 243 Extraordinary Event Hours 6000 5000 4000 0 UCL=5027 . When a point is out of the control region. TheX chart for x 1 was given in Figure 5.5.7. individual X charts are constructed. Was there a special cause of the single point for extraordinary event overtime that is outside the upper control limit in Figure 5. extraordinary overtime occurs only when special events occur and is • therefore unpredictable. There is no centerline in the T 2chart. A T 2 chart can be applied to a large number of characteristics. The lower control limit is zero. " ·. A majority of the extraordinary overtime was used in that fourweek period.. ::> 3000 2000 1'2= 1478 1000 0 . and we use the upper control limit UCL = x~(. it is not limited to two variables.05) or. .Although. The LCL = 0 limit is shown by the dashed line in Figure 5.
) where pis the total number of measured variables.: The T 2 chart in Figure 5. xk + 1n_tf.aj ·.Jl (Control of robotic weldersmore than T2 needed) The assembly of a driveshaft for an automobile requires the circle welding of tube yokes to a tube. confirms that this is due to the large value of extraordinary event overtim~i during that period.. . 6 4 2 • • • • 4 6 B 0 0 • !0 12 14 2 16 Period figure 5. as in Exam~ei ple 5.26. The kth variable is out of control if xjk does not lie in the interval (xk . We take a = .J 0 (A Tzchart for overtime hour~) Using the ~alice department data in~ Example 5..~ 12 10 • i• . The inputs to the automated welding machines must be controlled to be within certain operating limits where a machine produces welds of good quality. Further investigation.::. ~ 8  f:.. When the multivariate T 2chart signals that the jth unit is out of control.01 to be consistent Witb1 the ellipse format chart tn Example 5. In order to control the process.J Example S.8 The T 2chart for legal appearances hours and extraordinary event hours. T 2 charts with more than variab~es are considere~ in Exercise 5.~ nary event) hours for period 11 is out of control.8. A modified region based on Bonferroni intervals is frequently chosen for this purpose.ln1(.8 reveals that the pair (legal appearances. extraordi. it should be determined which variables are responsible.01..244 Chapter 5 Inferences about a Mean Vector .9. ::: ' two1 .005/p)YS. Example S. e.· ances hours and X 2 = extraordinary event hours.9. we construct a T 2 plot based on the two vanables X 1 = legal appear~'J...CXJ5jp)V$k.. a = . . one process engineer measured four critical variables: X 1 = Voltage (volts) X2 = Current (amps) X3 = Feed speed( in/min) X 4 = (inert) Gas flow (cfm) .
6 288.0 51.7 289.0 51.1 21.0 288.0 51.2 290.0 51.6 22.0 51.8 22.0 52.2 22.0 22.0 51.1 22.2 286.0 289.3 22.0 288.5 22.0 52.0 22.4 290.0 288.3 52.1 22.3 22.7 21.0 52.5 22.0 290.7 22.0 51.3 52.2 22.3 23.0 288.0 52.7 55.3 51.3 21.8 22.0 54.0 287.8 22.0 52.0 289.3 52.7 287.0 Source: Data courtesy of Mark Abbotoy.2 286.0 288. Table 5.6 21.6 288.1 288.0 287.8 22.7 52.9 gives the values of these variables at fivesecond intervals.0 290_0 288.0 51.0 51.0 290.0 287.9 22.0 52.0 289.2 22.7 51.5 22.0 Gas flow (X4) 51.6 22.3 53.0 52.0 290.0 22.7 51.0 22.3 22.0 288.7 51. .0 22.0 52.0 52.9 21.0 54.0 22.0 51.0 287.0 52.5 22.0 22.7 Current (X2 ) 276 281 270 278 275 273 275 268 277 278 269 274 270 273 274 277 277 276 278 266 271 274 280 268 269 264 273 269 273 283 273 264 263 .8 22.9 Welder Data Case 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Voltage (X 1) 23.3 289.0 287.0 289.1 22.0 52.0 51.0 288. 266 263 272 277 272 274 270 Feed speed (X 3 ) 289.3 21.0 289.Multivariate Quality Control Charts 245 Table 5.0 287.0 287.3 53.2 288.7 290.6 289.0 22.0 52.0 53.6 288.2 22.7 51.0 52.3 288.8 22.
there is no appreciable serial correlation for ~ successive observations on each variable.11.5 23. . abies? Most of the variables are in control. : It appears that gas flow was reset at the target for case 32. in Figure 5.00 ~ r. shown in Figure 5.9 The T 2 ~chart for the welding data with 95% and 99% limits. Using the 99% limit. reveals that case 31 is out of control and .246 Chapter 5 Inferences about a Mean Vector The normal assumption is reasonable for most variables.5 22. but case 31 is outside the 95% limit. ..::L:::inn=j·t 12 10 8 6 4 2 0 ~~~~~ 0 10 • 95% Limit • • • •• • • • • • • • •• • • • • •• • •• •••• • • •••• •• •• •• 20 30 40 Case figure S.0 Voltage figure S.5 24. The univariate X chart for J In(gas flow).0 21. the 99% quality ellipse for gas'i flow and voltage.85 20.. The dotted line ~1 is the 95% limit and the solid line is the 99% limit. All the other univariate X charts have all points within their three sigma control limits.. no points ..0 23. ~ a! logarithm of gas flow. 14 f~__:.10. contra I e111pse f or In ( gas fl ow) and voltage. In addition...9. shoWs that this point is outside the three sigma limits.90 3.. 4.05 4..IO The 99% quality . ·~ A T 2chart for the four welding variables is given in Figure 5.i are out of contml. j What do the quality control ellipses (elfipse format charts) show for two van.99.~ this is due to an unusually large volume of gas flow.s 3. but we take the natur.%c.. However.::: ~ ~ 0 • 3..95 .5 21.0 22.
951 3.ll The univariate Case X chart for ln(gas flow). I). or prediction. to set a control region for a future observation x or future observations. .np a Proof. or almost masked (with 95% limits).X) is distributed as Np(O.X) n + 1 .6.Multivariate Quality Control Charts 247 4.. so 1 (n + 1) Cov(X. we give the basic distribution theory as Result 5. . and let X be a future observation from the same distribution. s1 (X .) x X x o5 (nz. In this example.1)p F ( ) n(n _ p) p.)'s1( X .X)= Cov(X) + Cov(X) =I+ .. • Control Regions for Future Individual Observations The goal now is to use data x 1 .I = I n n and. [fl V. .a)% pdimensional prediction ellipsoid is given by all x satisfying ( . x 2 .95 Mean= 3. X and X are independent. v'nj(n + 1) (X . by being combined into a single T 2value. •. collected when a process is stable.. Xn.. Result S. we take the observations to be independently distributed as Np(p. Let X 1 . X 2 .+1 (X:  X)'s1 J n (X .90 0 10 LCL =3. Now. Because these regions are of more general importance than just for monitoring quality. by Result 4. The region in which a future observation is expected to lie is called a forecast.6. If the process is stable. . We first note that X .8.005 IX 3. a shift in a single variable was masked with 99% limits. region. I). I).00 UCL=4.X has mean 0.(X .X) is distributed as  (n .. Then T 2 = .L. Xn be independently distributed as Np(J..X) n+ 1 n  .896 20 30 40 Figure S.1)p np FP · np and a 100(1 . Since X is a future observation.
Since .1)p P [ (X . the 95% prediction ellipse in Result 5.l2 (A control ellipse for future overtime hours) In Example 5.6 specializes to _ .' 1 (nz. plot T2 = _n_ n + 1 (x.12 along with the initial15 stable observations.6. we obtain the two charts for future observations. random matrix in the form ( multivariate normal)' (Wishart random matrix)! (multivariate normal) random vector d. the probability that X will fall in the prediction ellipse is 1 .x) ::s n(n _ ) Fz. we find that the pair of values for period 11 were out of control. Example S. The constant for the ellipsoid follows from (56). Wishart. This control ellipse is shown in Figure 5. Let's use these data to determine a control region for future pairs of values.x) .1(x .05) 2 (534) Any future observation xis declared to be out of control if it falls out of the control ellipse. Keep in mind that the current observations must be stable before they can be used to determine control regions for future observations.6. so they can serve to determine the 95% prediction region just defined for p = 2. random vector and an independent . I).np(a) J = 1.a before any new observations are taken. Control Ellipse for Future Observations With p = 2.X) S (X. All of the points are then in control.1)2 (x .248 Chapter 5 Inferences about a Mean Vector which combines a multivariate normal. It is centered at the initial sample mean x. Note that the prediction region in Result 5. we checked the stability of legal appearances and extraordinary event overtime hours.6 for a future observed value xis an " ellipsoid. random vector ·~ has the scaled r distribution claimed according to (58) and the discussion on page 213. Based on Result 5. • . _ _ (n 2 . An observation outside of the ellipse represents a potential outofcontrol observation or specialcause variation.X) ::s n(n _ p) Fp.9.9 and Figure 5.a.1 (I). Wp. We removed this point and determined the new 99% ellipse. Any future observation falling in the ellipse is regarded as stable or in control.x) S 1 (x .nz(. • T2 Chart for Future Observations For each new observation x. Np(O.x)'S. and its axes are determined by the eigenvectors of S. From Example 5.f.n.
1 ( n . xi . Set LCL = 0. From the first sample. we determine its sample mean X 1 and covariance matrix S 1 .1)p UCL = (n _ p) Fp. When the population is normal. For a general subsample mean xi. at the same time. from the process.Multivariate Quality Control Charts 249 ~ 0 til ·C ~ • • = " > > " • • • ~ ~ 8 "' 0 ] • + • • • • • • • g til " 8 I "' 1500 2500 3500 4500 5500 Appearances Overtime Figure 5.1) Cov(Xi. Control Charts Based on Subsample Means It is assumed that each random vector of observations from the process is independently distributed as Np(O. and take (n . these two random quantities are independent.np(. in time order.X has a normal distribution with mean 0 and ( 1) 2 · n .12 The 95% control ellipse for future legal appearances and extraordinary event overtime. I).X)= 1 ..Cov(Xi) + . We proceed differently when the sampling procedure specifies that m > 1 units be selected. See [9] for discussion of other procedures.2Cov(X 1 ) = I n n nm .05) Points above the upper control limit represent potential special cause variation and suggest that the process in question should be examined to determine whether immediate corrective action is warranted.
Consequently. of their mean X.nmnp+l Ellipse Format Chart. 2. the ellipse format chart for pairs of subsample means is _ = .nmn1(. which should be checked. x. Further..x)  1 _ ~ < (n . This pooled estimate is Here ( nm .n )S is independent of each and. Notice that we are estimating I internally from the data collected in any given period. These estimators are combined to give a single estimator with a large number of degrees of freedom.1 (536) although the righthand side is usually approximated as x~(. Subsamples corresponding to points outside of the control ellipse should be carefully checked for changes in the behavior of the quality characteristics being measured. . therefore.4.05)/m. the sample covariances from the n subsamples can be combined to give a single estimate (called Spooled in Chapter 6) of the common covariance I.05) (n.) . n.n p + 1) Fp.1)(m. (535) is distributed as (nm. In an analogous fashion to our discussion on individual multivariate observations. we plot the quantity for j = 1. _ (x . Values of Tj that exceed the UCL correspond to potentially outofcontrol or special cause variation.05.1)(m 1)p The UCL is often approximated as ~(. To construct a T 2chart with subsample data and p characteristics.described in Section 6.n )Sis distributed as a Wishart random matrix with nm . T1 Chart..n)p (nm. (See [10]. ( nm . The interested reader is referred to [10] for additional discussion.1SO Chapter 5 Inferences about a Mean Vector where As will be .X) S (x . .n degrees of freedom.1)2 ) Fz. where the UCL = (nmnp+ I) FP ·nmnp+l(.) when n is large.05) ( mnmn.
1(X. The upper control limit is then UCL = ( (n + 1)(m. depends.: I n nm Consequently. This may occur because of a breakdown in the recording equipment or because of the unwillingness of a respondent to answer a particular item on a survey questionnaire. S.05) The UCL is often approximated as x~(.Inferences about Mean Vectors When Some Observations Are Missing 251 Control Regions for Future Subsample Observations Once data are collected from the stable operation of a process. Points outside of the prediction ellipse or above the UCL suggest that the current values of the quality characteristics are different in some way from those of the previous stable process. T 2 Chart for Future Subsample Means. they can be used to set control limits for future observed subsample means. control limit and plot the quantity As before.X)'S. to a large extent. the righthand side is usually approximated as .n.x)  =. again. The best way to handle incomplete observations. we bring n/(n + 1) into the T 2 = m(X . then X . The prediction ellipse for a future subsample mean for p = 2 characteristics is defined by the set of all x such that (x .x) S 1 (x .X)= Cov(X) + Cov(X 1) = .X has a multivariate normal distribution with mean 0 and · = 1 _ (n + 1) Cov(X.. on the .n.1 ' (537) where.1)2 ) Fz nmn!(.05)/m. _ _ = (n :5 + 1)(m.p + 1) Fp. some components of a vector observation are unavailable.05) ( m nm.1)p nmnp+ 1 ) Fp ' nmnp+ 1 ( . or missing values..05) when n is large.n)p (nm.d(.X) for future sample means in chronological order. This may be good or bad.nmnp+l Control Ellipse for Future Subsample Means. is distributed as (nm. If X is a future subsample mean. but almost certainly warrants a careful search for the reasons for the change. 7 Inferences about Mean Vectors When Some Observations Are Missing Often.
Laird.#'. X 2 ... Given estimates 'ji and I: from the estimation step. Estimation step. .l)S + nXX' In this case. .. the predictionestimation algorithm is based on the completedata sufficient statistics [see (421)] T1 and T2 = =o i=l n L n Xi "' nX L i=l XiX} = (n .. Xn are a random sample from a pvariate normal population. = I + 'ji'ji'.. 'ji' I:) 1 = p:(l) + I:l2I:2!(x?) .. until the revised estimates do not differ appreciably from the estimate obtained in the previous iteration.. For each vector xi with missing values. Their technique. 1. x) 2 ) denote those components which are available.#£( 2 )) (538) estimates. subsequent inferences may be seriously biased. the algorithm proceeds as follows: We assume that the population mean and variance#' and l:.. such as people with extremely high incomes who refuse to respond in a survey on salaries. and Rubin [5].252 Chapter 5 Inferences about a Mean Vector experimental context.22. Prediction step. called the EM algorithm. to estimate the missing values. If the pattern of missing values is closely tied to the value of the response.the contribution of x) ) to T1 . _ E(X(l)X(l)•/ Xj(2). We call them the prediction and estimation steps: 0 of the unknown parameters. we are able to treat situations where data are missing at randomthat is. . Xi . predict the contribution of any missing observation to the (completedata) sufficient sta tis tics. To date. set _ " "I" . A general approach for computing maximum likelihood estimates from incomplete data is given by Dempster. Next. respectivelyare unknown and must be estimated. 2. cases in which the chance mechanism responsible for the missing values is not influenced by the value of the variables._12. consists of an iterative calculation involving two steps. given x( 2 )._) . When the observations X 1 . However. use the mean of the conditional normal distribution of xOl. the predicted contribution of x?) to T2 is Xi {i)(1).21 + ~(I)~(l)• Xj Xj (539) 1 xi = 'ji and I}·. Given some estimate The calculation cycles from one step to the other. Thus. That is/ x?) = E(X) 1 ) Ix?). no statistical techniques have been developed for these cases. let xjil denote the miss ing components and 1 _ [ (J)1 (2)11 Xi Xi .._11 ~ " _" Xj j j 1f all the components "i are missing. Prediction step. Use the predicted sufficient statistics to compute a revised estimate of the parameters.
~ 7 +5 p.11 ): (540) We illustrate the computational aspects of the predictionestimation algorithm in Example 5.Inferences about Mean Vectors When Some Observations Are Missing 253 and The contributions in (538) and (539) are summed over ag_ Xj wit~ missing com~ nents. p = 3. Estimation step.6)(1 . x ~~= (6.6)(2 . [See (538) and (539). for example.3 = 3 + 6 + 2 + 5 = 4 4 from the available observations. ~ 0+2+1 p.1) + (5 .6) 2 + {6.6)(0 ._ and covariance X using the incomplete data set Here n = 4. and 2 to predict the contributions of the missing values to the sufficient statistics T1 and T2 .6)2 + (5.1) 4 = uz3 = 4• 1 4 3 The prediction step consists of using the initial estimates ji. Substituting these averages for any missing values.. Compute the revised maximum likelihood estimates (see Result 4. The results are combined with the sample data to yield T 1 and Tz.z=~3~=1.6)( 1 . We obtain the initial sample averages J1J =2= 6.1) + (6 .13 (Illustrating the EM algorithm) Estimate the normal population mean ·.6) 2 + (7.13.6) 2 4 1 =2 1 un=z· UJ2 = (6 . so that 11 = 6. Example 5. we can obtain initial covariance estimates. We shall construct these estimates using the divisor n because the algorithm eventually produces the maximum likelihood estimate i Thus. and parts of observation vectors x 1 and x 4 are missing.1) + (7 .] ..
so we partition 'ji and .0] 6. 1) l ~ 2 ~ 2 XII= _l.22_l.3] = [41.73[0.3 (5) = [32.27] 8.97 and = [6.73) 2 ~] = x11 [xt x13] == 5.99 + (5. we partition ji and .J and predict x 11 /i1 i1 2i2~[::: =~] + = 0"1] ~ == 6 [I + [~. 4• 1 [! 2]1 [!] ~ 4 4 uill~=~]= 1 5..4] 1.I as ji = [~~J [~(~!J.73 = 32. )J ·:.27 1. from (539). =D nDJm.4) = [ ~:~] for the contribution to T1• Also.:.[1 1 ~l+[~:~J[6.~.254 Chapter 5 Inferences about a Mean Vector The first component of x 1 is missing.l2_l.4 1.··· f. 3] = [0.18] For the two missing components of x4 .L( and predict = U + DJ m1 J (5 .06 8.5 .I as "~ [~J ~ [i.L3 = f.21 ~ ~I~ I + ~2 = zX11 2. 17.
18] 20.18 27.T1 = ~ [24.30 1.08 16.50 101.97 20.17] . because it is not affected by the missing components. The iteration between the prediction and estimation steps continues until the elements of 'ji and l: remain essentially unchanged.33 .50 .27 [ 101.27 101.50 1. provides the revised estimates 2 ii 1:I = Tz .Inferences about Mean Vectors When Some Observations Are Missing 255 are the contributions to T2 • Thus.97 20.3 + X33 + x 43 3 +6 +2 +5 = [24. .05 27.17 ..18 + 7(6) + 5(2) + 32 = o2 + 22 + 12 + 1.00 32.08 [6.00] = .73 + 7 + 5 + 6.59 are larger than the corresponding initial estimates obtained by replacing the missing observations on the first and second variables by the sample means of the remaining values.50 101.27 6.00 .1.5 J' + 6' + z' + s'] 148.03] 20.50 74.! [148.4 27. the predicted completedata sufficient statistics are ·r [ 1 "' x11 XJ2 x 13 + + + XzJ X22 X23 + x 31 + ~41 ] [5. The next esti!llation step.99 + 72 + 52 + 41.83 Note that 0'11 = .00 1.4] + X32 + X42 = 0 + 2 + 1 + 1.61 .27 [ 17.00 4.27 6.30 16. Calculations of this sort are • easily handled with a computer.00 4.18 27.33 [ 1.13] = [6.83 2.61 and 0'22 = .03 74.13] 4. The third variance estimate 0'33 remains unchanged.05 . 2 The final entries in i are exact to two decimal places.08 4.06 "' 0 + 7(2) + 5(1) + 8.'jiji' n ~ = 1 .03] 4.59 .18] [6.00 This completes one prediction step. using (540).97 0(3) + 2(6) + 1(2) + 6.
Further. it seems reasonable to treat (541) as an approximate 100(1 a)% confidence ellipsoid. this assumption may not be valid. follow the multivariate AR(l) model X. = lf>(X. . which are all constructed assuming that independence holds. the autoregressive model says the observations are independent. = ~ ~iie<ll'i j=O The AR(l) model (542) relates the observation at timet. X 2 . . missing values are related to the responses being measured. X.) + e. through the coefficient matrix <II. we must be dubious of any computational scheme that fills in values as if they were lost at random. then handling the missing values as suggested may introduce serious biases into the estimation procedures. to the observation at time t ._. (542) where the e. but with x replaced by jL and S replaced by l:. we have assumed that the multivariate observations X 1 . if all the entries in the coefficient matrix <II are 0. . The simultaneous confidence' statements. that is. are independent and identically distributed with E [e.) = l:e and all of the eigenvalues of the coefficient matrix II> are between 1 and 1. We will illustrate the nature of the difficulty when the time dependence can be represented as a multivariate first order autoregressive [AR(l)] model. would then follow as in Section 5. When more than a few values are missing. The name autoregressive model comes from the fact that (542) looks like a multivariate version of a regression with X. they are independent of one another. Let the p X 1 random vector X.J = 0 and Cov ( e. If missing values are related to the response levels.5. and j.'JYpically. If the observations are collected over time. under multivariate normality. Consequently. The predictionestimation algorithm we discussed is developed on the basis that component observations are missing at random.. Under this model Cov(X 1 . The presence of even a moderate amount of time dependence among the observations can cause serious difficulties for tests.p. Caution. X 11 constitute a random sample.8 Difficulties Due to Time Dependence in Multivariate Observations For the methods described in this chapter.1 as the independent variable. are obtained and relatively few missing components occur in X.. . confidence regions. and simultaneous confidence intervals. 5.1.256 Chapter 5 Inferences about a Mean Vector Once final estimates jJ._I .p. it is imperative that the investigator search for the systematic causes that created them.) = <11'1:1 where 00 I. as the dependent variable and the previous value X1.
989 .950 .993 . Now consider the large sample nominal 95% confidence ellipsoid for IL· vn This ellipsoid has large sample coverage probability . and the results based on this assumption can be very misleading if the observations are. this ellipsoid has large sample coverage probability Table 5.641 . .)+ (I. According to Table 5. in fact. to .950 .10 shows how the coverage probability is related to the coefficient 4> and the number of variables p.Difficulties Due to Time Dependence in Multivariate Observations 257 As shown in Johnson and Langeland [8].090 p .1Ix + Ix(l  cl>'f 1  "Ix (543) Moreover. • (X.l 0 Coverage Probability of the Nominal95% Confidence Ellipsoid . The independence assumption is crucial.25 1 2 5 10 15 . 1 S= n _ 1 ~ ~ . suppose the underlying process has cl> = </>I where 14> I < 1.834 . To make the calculations easy. the coverage probability can drop very low.405 .000 0 .95 if the observations are independent. even for the bivariate case. for large n.999 1.X) + "Ix where the arrow above indicates convergence in probability. and Cov(n.10. .632 . If the observations are related by our autoregressive model.ci>).950 4> .742 .193 . however.632.950 .548 .25 .IL) is approximately normal with mean 0 and covariance matrix given by (543).112 r=l ±x.950 .X)(X.871 .5 . dependent.751 . Table S.998 . (X .
1d).Au. Consequently. the projection of the 258 . the shadow extends c~ units. Proof. For a given vector u 0.1z ::5 c2 }. and z belonging to the ellipsoid.. Its squared length is (z'u//u'u. We want to maximize this shadow over all z with z' Alz ::5 c 2 • The extended CauchySchwarz inequality in (249) states that (b'd/ ::5 (b'Bd)(d'B. When u is a unit vector.Supplement SIMULTANEOUS CONFIDENCE INTERVALS AND ELLIPSES AS SHADOWS OF THE p.DIMENSIONAL ELLIPSOIDS We begin this supplementary section by establishing the general result concerning the projection (shadow) of an ellipsoid onto a line. we obtain (u'u)(lengthofprojection) 2 = (z'u/ ::5 (z'K1z)(u'Au) :.1z = c~' Auju' Au = c2 for this z that provides the longest shadow. c2 u' Au for all z: z' A. the projection of any z on u is given by (z'u) uju'u.12. Let the constant c > 0 and positive definite p X p matrix A determine the ellipsoid {z:z' A. I. By Definition 2A. so lz'ul $ cv. {z'K 1z ::5 c2}onu . and B = AI. That is. Result SA. z' A.1d. Setting b = z.. with equality when b = kB.c~u which extends from 0 along u with length cVu' Auju'~. besides belonging to the boundary of the ellipsoid. The shadow also extends c~ units in the u direction.. d = u.1z ::5 c2 The choice z = cAn/~ yields equalities and thus gives the maximum shadow. the * ( Projection (shadow) of)_ WA..
1f2z + (I . . (See Figure 5. Then z' K 1z = (A. U = [u1 Result SA.1 and c 2 :S c2 } is given and that . and its length is cVu' Au/u'u.P)A1f2z)' ((I .13.1 = A!(2 A. Set P = A112 U(U' AU)1 U' A1f2.1z = (A.112P'PA1f2z = z'A. the projection extends Vc 2e'u Ae u = ~c.1z :S ((I . Note that P = P' and P 2 = P.P)K 112z) = (P K 1f2z)' (P K 112z) + Since z' A.2.1z ! u2] is arbitrary but of rank two. the result follows.1z :S c2 on the u 1 . u2 plane is an ellipse.112z)' (A112z) and A. · where A = A112 A112. Next.1f2z)' (A. Then z in the ellipsoid } { based on A.P)K112z) (5A1) 2 2 ~ z'A. Imp1IeS that {for all U. using A.) 3 Figure 5.P)A1 f2z.Simultaneous Confidence Intervals and Ellipses as Shadows of the pDimensional Ellipsoids 259 eu = ellipsoid on u is c~ u/u'u. so (I .13 The shadow ofthe ellipsoid z' A. U'z is in the ellipsoid} b ase d on (U'A U )1 an d c2 or z' A.11 z = z'U(U'AUr 1U'z c2 and U was arbitrary.P)P' = P .P 2 = 0. we write z' A. • Our next result establishes the twodimensional confidence ellipse as a projection of the pdimensional ellipsoid.~ vu'u Vu'Au units along u • The projection of the ellipsoid also extends the same length in the direction u.112z + (I. With the unit vector u/~. Suppose that the ellipsoid {z: z' A.P)K 112z)' (PA. .11 pA.1f2z = PK 1f2z + (I.112.112z) = (PA.1z s c 2 implies that (U'z)' (U' AUr 1 (U'z) s c 2 for all U Proof. We first establish a basic inequality.
Let Ua be a vector in the Dr. u2 plane. If we set z = AU(U' AUfra.3.. Results 5A. (See Result 2A.2 and 5A. By Result 2A. so the ellipse { (U'z)' (U' AUf (U'z) :S c2 } contains the coefficient vectors for the shadow of the ellipsoid. In the context of confidence ellipsoids.1z s c2.3 remain valid if U = [ u1 .2. U'z belongs to the coefficient vector ellipse. the ellipse contains only coefficient vectors from the • projection of {z: z' Arz s c2 } onto the Dr. By Result 5A.) Result SA. . Let z belong to the set {z: z' Arz :S c2 } so that UU'z belongs to the shadow of the ellipsoid. u2 plane is (u[z)ui + (uiz)uz = [u 1 1 u2 ] [n[z] = UU'z UzZ The projection of the ellipsoid {z:z'Arz:Sc2 } consists of all UU'z withz' Arz :S c 2. Consequently.3.J. 02 1 plane results in the twodimensional ellipse {(U'z)' (U'AUf (U'z) :S c2}. the projection of a vector z on the u 1 .1z :S c2 } on the Dt. . the shadows of the twodimensional ellipses give the single component intervals. Projecting the ellipsoid z' Arz s c2 first to the u1o u2 plane and then to the line u1 is the same as projecting it directly to the line determined by ur.. . where. uq] consists of 2 < q :S p linearly independent columns. it follows that U'z and = 1 U' AU(U' AUfra = a Thus. and z belongs to the ellipsoid z' A. Consider the two coordinates U'z of the projection U(U'z). Given the ellipsoid {z: z' A1z :S c2 } and two perpendicular unitvectors u1 and u2 . Remark. Proof. u2 plane whose coefficients a belong to the ellipse {a'(U'AUf1a s c2 }.260 Chapter 5 Inferences about a Mean Vector Projection on a plane is simplest when the two vectors u1 and u2 determining the plane are first converted to perpendicular vectors of unit length. the projection (or shadow) of {z' A.· U = [ul ! uz]. Remark.
5. Construct the 95% Bonferroni intervals using (529).9) [ (6 + 9) (10. Is your result consistent with the 95% confidence ellipse for IL pictured in Figure 5.usingthedata X= (b) Specify the distribution of T 2 for the situation in (a). (b) Use the data in Exercise 5. and C3 may help. is replaced by Cxi. ll]. What conclusion do you reach? 5. 1Lz. Construct the three possible scatter plots for pairs of observations.7. Verify the Bonferroni inequality in (528) form = 3. 5. . and ILJ using Result 5. The quantities x.Exercises Z61 Exercises 5.2._• = [7.1 to evaluate A in (513). (b) Construct QQ plots for the observations on sweat rate.1. sodium content.1.OS level of significance. Using the data in Example 5. and s.4. evaluate Wilks' lambda. Hint: A Venn diagram for the three events C1 • C 2 .OS level.1.60] at the a = . Compare the two sets of intervals.fortestingH0 :.2.) Find simultaneous 95% T 2 confidence intervals for ILI. Conduct a test of the null hypothesis H 0 : IL' = [. 5.3)]' (8 + 3) 5.3. (See Example 5. Does the multivariate normal assumption seem justified in this case? Comment. Also.1? Explain. j = 1.1. S. respectively.3 for the transformed microwaveradiation data. (c) Using (a) and (b). Use the sweat data in Table 5. 5.6) (10 + 6) (8.1 (See Example 5.3. Use the sweat data in Table 5. 2. verify that T 2 remains unchanged if each observation xi.5. test H 0 at the a = .1 are given in Example 5.6.2) (a) Determine the axes of the 90% confidence ellipsoid for IL· Determine the lengths of these axes. . 3. (a) Use expression (515) to evaluate T 2 for the data in Exercise 5. (a) EvaluateT 2 . where · l 8 2 12] 6 9 10 9 8 Note that the observations yield the data matrix (6 .55. and potassium content.
studies griZlly bears with the goal of mamtammg a healthy populatwn.15 56.5.73 56. (b) Obtain the large sample 95% simultaneous confidence ellipse for mean weight and mean girth.25 179.mean head length using m = 6 + 1 = 7 to allow for this statement as well as statements about each individual mean.28 281. Measurements on n "' 61 bears provided the following summary statistics (see also . Construct the 95% Bonferroni confidence rectangle for the mean weight and mean girth using m = 6. (a) Obtain the 95% T2 simultaneous confidence intervals for the four population means for length.17 117.73 731. Compare this rectangle with the confidence ellipse in Part b.52 55.73 63.17 39. Using the results in Example 5.3 and the Ho in Exercise 5.88 238.262 Chapter 5 Inferences about a Mean Vector 5.73 9.97 721.50 162. Verify that the 12 value computed with this a is equal to rz in Exercise 5. (c) Obtain the 95% Bonferroni confidence intervals for the six means in Part a.85 13.85 13.26 S= (a) Obtain the large sample 95% simultaneous confidence intervals for the six population mean body measurements. . (c) Obtain the 95% T 2 confidence ellipse for the mean increase in length from 2 to 3 years and the mean increase in length from 4 to 5 years.50 162. 5. evaluate a for the transformed microwaveradiation data.37 117.13 Sample meani 95.p.95 94.80 1175.4 ).68 238.98 63. (e) Obtain the 95% Bonferroni confidence interval for mean head width .10 (see Table 1.80 94. Restrict your attention to the measurements of length.46 1343.38 Neck (em) Girth (an) Head length (em) 17. we know that T 2 is equal to the largest squared univariate 1value 1 constructed from the linear combination a'xi with a = s. a natural_ist ~or the Alaska Fish an~ Game department.97 731. (d) Refer to Part b.8.5.69 93.9.17 281.(i . 0 ).17 39.88 21.15 474.39 Covariance matrix 3266.54 1175.37 1343. 5. From (523). Refer to the bear growth data in Example 1.35 80.Exercise 8.1 0.23): Variable Weight (kg) Body length (em) 164. (b) Refer to Part a. Obtain the 95% T 2 simultaneous conf1dence intervals for the three successive yearly increases in mean length. Harry Roberts.54 324.98 Head width (em) 31.68 537.91 32425 53735 80.
4 using the entries (length of oneatatimerinterval)/ (length of Bonferroni rinterval). 5. "Mineral Analysis of Ancient Peruvian Hair. Determine the approximate distribution of n In( I i Thble 5. If the data are not bivariate normal. were as follows: Xt(Cr) x 2(St) .53 73. are any of the points (p. Construct the 95% Bonferroni confidence intervals for the set consisting of four mean lengths and three successive yearly increases in mean length.L and I. while strontium is an indication of animal protein intake.) IJI io I) for the sweat data in 5. 1 and p.14.) Does it appear as if this Peruvian culture has a mean strontium level of 10? That is. we could use Result 5.' = (p.30.100 ppm) of chromium suggest the presence of diabetes. no. 1 arbitrary. A physical anthropologist performed a mineral analysis of nine ancient Peruvian hairs. 48.13." American Journal of Physical Anthropology.93 4. Compare the 95% Bonferroni confidence rectangle for the mean increase in length from 2 to 3 years and the mean increase in length from 4 to 5 years with the confidence ellipse produced by the T 2 procedure.12. 10]' a plausible value for J. Create a table similar to Table 5.66 .1 1.64 . Determine the initial estimates. (c) Do these data appear to be bivariate normal? Discuss their status with reference to QQ plots and a scatter diagram.08 20.2.03 20.22 1.48 12. Do the inferences change? Comment. (a) Construct and plot a 90% joint confidence ellipse for the population mean vector p.55 .37 .3. and iterate to find the first revised estimates. (e) Refer to Parts c and d. 2 by "projecting" the ellipse constructed in Part a on each coordinate axis. use the predictionestimation algorithm of Section 5. t . (Alternatively. 1L2l. 3 (1978).68 2.19 11. It is known that low levels (less than or equal to . 10) in the confidence regions? Is (. (See Result 5. what implications does this have for the results in Parts a and b? (d) Repeat the analysis with the obvious "outlying" observation removed. (b) Obtain the individual simultaneous 90% confidence intervals for p.Exercises 263 (d) Refer to Parts a and b.13 . 5.43 Source: Benfer and others. The results for the chromium (xJ) and strontium (x 2 ) levels. Given the data with missing components. 5.L? Discuss. assuming that these nine Peruvian hairs represent a random sample from individuals belonging to a particular ancient Peruvian culture. . in parts per million (ppm). 277282.7 to estimate J.29 .78 .74 .1.57 40.
be a random sample of size n from the multinomial The kth component. . some or all of the population characteristics of interest are in the form of attributes.(p 1 + pz + · · · + Pq).. When attributes are numerically coded as 01 variables.+l . Pq L P..Z64 Chapter 5 Inferences about a Mean Vector Exercises 5.. It has the following structure: Category 0 0 Outcome (value) 0 0 Probability (proportion) Let Xi. We consider the situation where an individual with a particular combination of attributes can be classified into one of q + 1 mutually exclusive and exhaustive categories. = 2 k q 0 0 0 0 q + 0 0 0 0 0 0 0 0 0 P2 0 0 Pq+l = 1 PI . n. is a sample mean vector. a random sample from the population of interest results in statistics that consist of the counts of the number of sample items that have each distinct set of characteristics. . .j distribution. . Each individual in the population may then be described in terms of the attributes it possesses. Pq.. O]'with 1 in the kth position. which. 1. .. 2.. and 5..16.. 0.15... Thus. 5. given the nature of the preceding observations.Xn can be converted to a sample proportion vector. The probability distribution for an observation from the population of individuals in q + 1 mutually exclusive and exhaustive categories is known as the multinomial distribution. X 2 . pz. Xik> of Xi is 1 if the observation (individual) is from category k and is 0 otherwise.. Pq+l· Since the categories include all possibilities. . attributes are usually numerically coded with respect to their presence or absence. An individual from category k will be assigned the ( ( q + 1) X 1 ) vector value [0. .17 refer to the following information: Frequently. The corresponding probabilities are denoted by Pr. p = [ A P1 P2 . If the sample counts are large.. The random sample Xr. we take Pq+l = 1 . we can assign numerical values to qualitative characteristics. . . . i=l q 1.. 0. For convenience. If we let the variable X pertain to a specific attribute then we can distinguish between the presence or absence of this attribute by defining ' X = { 1 if attribute present 0 if attribute absent In this way. methods for producing simultaneous confidence statements can be easily adapted to situations involving proportions. Pk . j 1 = n i=l 2: xi n with Pq..
Bank C. The usual inverse of i.k = Cov(Xi. the requirement that n . and Xik be the ith and kth components. Le_t Xi.q+l u2.q+l l For large n. i k.:: V Xq(a) n ~ y. .a)% confidence intervals for all linear combinations a'p.Pk> i sarily be negative? * k.Pk.f. n.p.q+l u q+:. We have Vn (p. Here p = (1/n) 2: Xi. p. Each individual contacted in a survey responded to the following question: .q is large is interpreted to mean npk is about 20 or more for each category. .P. As part of a larger marketing research project. and u. and Bank D..PiPk..) = p. = E(Xj. . Xn be a random sample from a q + 1 category multinomial distribution with P[Xik = 1] = Pk> k = 1.(p 1 + ~ + · · · + Pq).Pk) and u.q+l U2.). * Result.k = p. so Pq+I = 1 . * (a) Show that JL.) = p. the approximate sampling distribution of p is provided by the central limit theorem.h. q + 1. 2.Exercises 265 and U[.:. .. and as a result.k} is a (q + 1) X (q + 1) j~I matrix with ukk = Pk(1 . 5. Why must this covariance neces 5. Approximate simultaneous 100(1 a)% confidence regions for all linear combinations a'p = arPr + a2P2 + · · · + aq+IPq+I are given by the observed values of a'p ± '. of Xi.. = Var(Xi. Xq+l..Pk) and u ik = . does not exist. X 2 . .P.. Complete discussions of categorical data analysis are available in [l] and [4]. 2.. 2.p) is approximately N(O. We have only touched on the possibilities for the analysis of categorical data.(1 . Let X 1 . ..(X 1i + X 2i + · · · + Xqi). respectively.k is estimated by U. (b) Show that u.1 5. a consultant for the Bank of Shorewood wants to know the proportion of savers that uses the bank's facilities as their primary vehicle for saving. The normal approximation remains valid when ukk is estimated by ukk = Pk(l . Since each individual must belong to exactly one category. Xik) = . I has rank q. X~(a) is the upper (lOOa)th percentile of the chisquare distribution with q d. Also.Pk) and U. i = 1. . .:. but it is still possible to develop simultaneous 100(1 . j = 1..n provided that n. i k.q is large.I) where the elements of I are u kk = Pk( 1 .16..j = 1 . • In this result. and I= {u.k = . The consultant would also like to know the proportions of savers who use the three major competitors: Bank B.
i Bank (category) Observed Bank of Shorewood Another bank <:(l _n_u_m_be_r~r~"'_35_5. l ..(p. Bank B.266 chapter 5 Inferences about a Mean Vector Which bank is your primary savings bank? Bank of Sh orewoo d . Ps · (b) Construct a simultaneous 95% confidence interval that allows a comparison of the Bank of Shorewood with its major competitor. Interpret this interval 5. a random sample of 200 students from the city's five high schools were surveyed.: PI 105 56 P3 25 /Total ~ J 1 populatio~ proportiOn Observed ~ample proportiOn Ps = 1(Pt + P7. In order to assess the prevalence of a drug problem among high school students in a particular city.17..k_ _ _= _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ ' 105 355 . ~ 1 ·'ill_" ~ A sample of n = 355 people with savings accounts produced the following coun!Sl when asked to indicate their primary savings banks (the people with no savings will De'] ignored in the comparison of savers. + P2 + PJ + P4) = proportion of savers at other banks (a) Construct simultaneous 95% confidence intervals for p 1.30 Let the population proportions be p 1 = proportion of savers at Bank of Shorewood p 2 = proportion of savers at Bank B p3 = proportion of savers at Bank C p 4 = proportion of savers at Bank D 1 . so there are five categories): ~ .. + P3 + P4) P1 _ = _ _ _ _ _ _ _ _ _ .a '·~ ii """ :i Response: l I I I I Bank B Bank C Bank D 119 P2 No Bank B Bank C Bank D Another Savmgs Ban k . . One of the survey questions and the corresponding responses are as follows: What is your typical weekly marijuana usage? Category None Number of responses Moderate (13 joints) Heavy (4 or more joints) 117 62 21 . P2• .
Given the result in (a).Exercises 2.914 10.S. The following exercises may require a computer. .370 (Bending strength) 7749 6818 9307 6457 10. · (b) Determine the lengths and directions for the axes of the 95% confidence ellipsoid for IL· (c) Construct QQ plots from the marginal distributions of social science and history. construct the three possible scatter diagrams from the pairs of observations on different variables. 50. 50. (Stiffness:· modulus of elasticity) x2 (Bending strength) 4175 6652 7612 10. Forest Products Laboratory. Also. 50.327 7320 8196 9709 10.(p. Do these data appear to be normally distributed? Discuss. Is there reason to believe that the group of students represented by the scores in Table 5. * 5.05 level of significance. p 2 .5. 5.67 Construct 95% simultaneous confidence intervals for the three proportions p 1 . and PJ = 1 . Using the data in the table.19.11 consistent with these values? Explain.11. (See Example 5. (a) Construct and sketch a 95% confidence ellipse for the pair [loti> ~tzl'.072 1232 1115 2205 1897 1932 1612 1598 1804 1752 2067 2365 1646 1579 1880 1773 1712 1932 1820 1900 2426 1558 1470 1858 1587 2208 1487 2206 2332 2540 2322 Source: Data courtesy of U. 30]' represent average scores for thousands of college students over the last 10 years. 30] at the a = . The units are pounds/( inches f.102 7414 7556 7833 8309 9559 6255 10. Measurements of x 1 = stiffness and x 2 = bending strength for a sample of n = 30 pieces of a particular grade of lumber are given in Thble 5. (b) Suppose IL!O = 2000 and ~t 20 = _10. respectively.000 represent "typical" values for stiffness and bending strength. are the data in Table 5.18.2. verbal. Use the college test data in Table 5. Suppose [500. and science scores. 30] versus H 1 : 11' [500.850 7627 6954 8365 9469 6410 10. Table 5.11 Lumber Data xl (Stiffness: modulus of elasticity) x2 x.2 is scoring differently? Explain. where ILl = E(XJ) and ~t 2 = E(X2 ). + P2).) (a) Test the null hypothesis H 0 : 11' = [500.723 5430 12.090 10.
z = 275 mm for male hookbilled kites.20: A wildlife ecologist measured x 1 = tail length (in millim~ters) and x 2 = wing length (in '1: millimeters) for a sample of n = 45 female hookbilled k1tes. Usi~g the data in the table. ~ Table 5.13.21.10 in Chapter 6 is reproduced in Table 5.12. Temple.22. do the T 2intervals have over the Bonferroni intervals? (c) Is the bivariate normal distribution a viable population model? Explain with reference to QQ plots and a scatter diagram. find the 95% simultaneous T intervals. Table 5.12 Bird Data Xr Xr x2 xz xr x2  (Tail length) (Wing length) '(Tail length) (Wing length) (Tail length) (Wing length) 191 197 208 180 180 188 210 196 191 179 208 202 200 192 199 284 285 288 273 275 280 283 288 271 257 289 285 272 282 280 186 197 201 190 209 187 207 178 202 205 190 189 211 216 189 266 285 295 282 305 285 297 268 271 285 280 277 310 305 274 173 194 198 180 190 191 196 207 209 179 186 174 181 189 188 271280 300 272 292 286 285 286 303 261 262 245 250 262 258 Source: Data courtesy of S. What advantage. (See [2]. 5. A portion of the data contained in Table 6. 5.268 Chapter 5 Inferences about a Mean Vector (c) Is the bivariate normal distribution a viable population model? Explain with reference to QQ plots and a scatter diagram. Only the first 25 multivariate observations for gasoline trucks are given. 2 (b) Construct the simultaneous 95% T ·intervals for ILl and ILz and the 95% Bonferroni intervals for ILl and liz· Compare the two sets of intervals. construct the 95% Bonferroni intervals for the individual means.8. if any. Also. Compare the two sets of intervals. Observations 9 and 21 have been identified as outliers from the full data set of 36 observations. These data represent various costs associated with transporting milk from farms to dairy plants for gasoline trucks. Using the data on bone mineral content in Table 1. 2 5. These data are displayed in . (a) Find and sketch the 95% confidence ellipse for the population means ILl and ILz· Suppose it is known that ILJ = 190 mm and P. Are these plausible values for the mean tail length and mean wing length for the female birds? Explain.) .
58 17.24 11. are they stable?) Comment.70 7.70 10.13 9. . Compare the two sets of intervals.98 14.78 5.51 9.68 7.43 2.44 7.23.07 6.05 2. Also.59 6. Are the outliers evident? Repeat the QQ plots and the scatter diagrams with.16 12.45 3.73 14.39 6.32 12.93 14. find the 95% T 2intervals.01 16.37 10.19 9.13 Milk TransportationCost Data Fuel (x1) Repair (x2 ) Capital (x3 ) 16.61 5.11 12.24.20 14.50 13.44 8.92 9. construct individual X charts for x3 = holdover hours and x 4 = COA hours. Wisconsin.75 7.90 10.02 17.78 5.88 10. and capital costs. Police Department data in Table 5. Do the data now appear to be normally distributed? Discuss.78 10. Consider the 30 observations on male Egyptian skulls for the first time period given in Table 6.68 12.25 11. Do these individual process characteristics seem to be in control? (That is. baslength and nasheight variables. 5.89 7.Exercises 269 Table 5. Also.8.15 14.61 14. (b) Construct 95% Bonferroni intervals for the individual skull dimension variables.27 15.11 12.18 8. basheight.92 4. (b) Construct 95% Bonferroni intervals for the individual cost means.32 29.28 10.00 (a) Construct QQ plots of the marginal distributions of fuel.23 11.17 7.26 2.63 5. construct a chisquare plot of the multivariate observations.24 13.80 3.16 1l.88 12. Compare the two sets of intervals.17 10. the apparent outliers removed.95 16. Also.78 10.~ 3.60 9.05 5.67 9.13 10.09 12.59 14. Using the Madison. Also. (a) Construct QQ plots of the marginal distributions of the maxbreath.51 26.34 8. Do these data appear to be normally distributed? Explain. repair. construct the three possible scatter diagrams from the pairs of observations on different variables.25 13.23 8. find the 95% T 2intervals.70 1.09 7. 5.24 10.13 on page 349.35 5.14 12.18 17.
The first four are measured when the car body is complete and the lasttwo are measured on the underbody at an earlier stage of assembly. 5.05 5. 5. . x4). Use these cases to estimateS and x. x2 = extraordinary event overtime hours. the total of the four. COnstruct a quality ellipse and a T 2 chart. Using the data on x 3 = holdover hours and x4 = COA hours from Table 5. Explain.29 Refer to the car body data in Exercise 5. Using the first 30 cases.31 Refer to the data on snow storms in Exercise 3. find the 95% Bonferroni confidence intervals for the two campo· nent means. Construct a T 2chart using the data on x1 = legal appearances overtime hours.8 of Example 5.18.8. is it stable?) Comment. Then construct a T 2 chart using all of the variables. Using the data on the holdover and COA overtime hours.28 As part of a study of its sheet metal assembly process.27. petroleum minus natural gas. Does plottini T 2 with an additional characteristic change your conclusion about process stability?. a prediction· ellipse should be calculated from a stable process Interpret the result. the total of the four.28.25. (b) On the same scale. (b) Obtain the large sample 95% simultaneous T2 intervals for the mean consumption of each of the four types.20.14.26. and the difference. Do you learn ~omething from the multivariate control charts that was not apparent in the individual X charts? 5. and the difference. Does the process represented by the bivariate observations appear to be in control? (That is. These are all measured as deviations from target value so it is appropriate to test the null hypothesis that the mean vector is zero. petroleum minus natural gas. Include all 50 cases.8. Compare this chart with the chart in Figure 5. Refer to Exercise 5. (a) The process seems stable for the first 30 cases. a major automobile manufacturer uses sensors that record the deviation from the nominal thickness (millimeters) at six locations on a car. Data on 50 cars are given in Table 5.24. (a) Obtain the large sample 95% Bonferroni confidence intervals for the mean consumption of each of the four types. Compare with your results for Parr a. and x3 = holdover overtime hours from Table 5.30 Refer to the data on energy consumption in Exercise 3. (b) Which individual locations seem to show a cause for concern? 5. Remember.£ == 0 at a == . construct a prediction ellipse for a future observation x' = (x3. (a) Find a 95% confidence region for the mean vector after taking an appropriate transformation.270 Chapter 5 Inferences about a Mean Vector 5.10. test H 0: J. 5.
15 0.21 0.13 0.58 0.37 0.75 0.24 0.30 0.20 0.05 0.61 0.56 0.75 0.38 0.60 0.20 0.40 0.75 0.30 0.20 0.36 0.58 0.85 0.65 0.61 0.19 0.08 0.24 0.30 0.00 0.18 0.50 0.34 0.04 0.55 0.16 0.13 0.16 0.24 0.50 L65 1.40 0.08 0.00 0.90 0.25 1.31 0.60 0.75 0.31 0.16 0.25 0.60 0.22 0.36 0.11 0.05 0.18 1.05 1.25 0.24 0.68 0.46 0.12 0.10 0.10 0.21 0.Exercises 4 71 TABLE S.46 0.60 0.l4 Car Body Assembly Data Index XI Xz XJ X4 xs 1.10 Source: Data Courtesy of Darek Ceglarek.60 0.65 0.85 0.10 0.37 0.78 1.21 0.85 1.58 0.20 0.75 0.25 0.61 0.37 0.18 0.09 0.35 0.20 0.28 0.14 0.00 0.32 0.12 0.10 0.20 0.46 0.78 0.40 0.12 0.25 0.05 0.36 0.83 0.25 0.65 1.05 0.46 0.08 0.15 0.10 0.56 0.05 0.00 0.15 0.65 1.90 0.58 0.45 1.37 0.15 0.35 0.10 0.14 0.28 0.85 0.35 0.61 0.08 0.10 0.25 0.60 0.05 0.81 0.59 0.24 0.58 0.16 0.50 0.20 0.15 0.45 0.00 0.18 0.15 0.84 0.10 0.20 0.37 0.31 O.34 0.26 0.60 0.00 X6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 0.25 0.25 0.55 0.80 1.60 0.20 0.08 0.24 0.60 0.15 0.15 0.21 0.25 0.05 0.36 0.35 0.06 0.84 0.84 0.21 0.45 0.13 0.11 0.15 0.05 0.25 0.15 0.45 0.D2 0.37 1.35 0.24 0.24 0.11 0.Dl 0.37 0.95 1.34 0.48 0.18 0.60 0.16 0.16 0.96 0.10 0.10 0.61 0.10 0.24 0.00 1.10 0.10 0.12 0.00 0.18 0.05 0.48 0.34 0.85 0.45 0.13 0.24 0.20 0.61 0.40 0.35 0.10 0.37 0.15 0.25 0.20 0.01 0.07 0.25 0.24 0.35 0.13 0.50 0.75 0.40 0.75 1.35 0.84 0.23 0.46 0.25 0.15 0.13 0.20 0.10 0. .13 0.46 0.34 0.32 0.59 0.37 0.38 0.10 0.90 0.52 0.60 1.35 0.60 0.31 0.11 0.30 0.13 0.47 0.44 0.31 0.10 0.16 0.30 0.34 0.15 0.42 0.15 0.46 0.25 0.42 0.68 1.10 0.65 0.40 0.80 0.50 0.50 0.37 0.12 0.65 1.35 0.84 0.18 0.35 0.30 0.10 O.10 0.34 0.46 0.10 0.85 0.35 0.35 0.85 0.31 0.12 0.15 0.
R. 21 (1971). and D." Journal of the Royal Statistical Society (B). ' 4. no. Block. S. no. ll. Tiku. ). Hartley.. Hocking. New York: John Wiley. Holland. ·~ New Graphical Method for Detecting Single and Multiple Outli~rs in Univariate and Multivariate Data. Pham. BaconSone.2000. M. Johnson. A. 299313. Johnson. E. J. "Maximum Likelihood from Incomplete · Data via the EM Algorithm (with Discussion). 11. Ed. Langeland "A Linear Combinations Test for Detecting Serial Correlation in Multivariate Samples. NJ: Prentice Hali. and M. 0. Upper Saddle River. Bishop. A. 2 · (1987). H. H.. 2. 2000. L.). P. 5. Feinberg. Berlin. Statistical Methods for Quality Improvement (2nd ed." Applied Statistics." Topics in Statistical Dependence. Eds." Biometrics. Singh. no. and T. and K. 36. 39. 6.2 72 Chapter 5 Inferences about a Mean Vector References 1. Y. Hartley. Mathematical Statistics: Basic Ideas and Selected Topics .. M. P." Biometrics.)." Springer Handbook of Engineering Statistics (2006). and R. "Robust Statistics for Testing Mean Vectors of Multivariate Distributions. 7. Dempster. Bickel. (1991) Institute of Mathematical Statistics Monograph. K. B. eta!. Laird. A. 9.A.. Cambndge. 0. Vol. "The Analysis of Incomplete Data. Ryan.. Agresti. 1977.l (1977). B. W. . T. and W. 3." Communications in StatisticsTheory and Methods. 153162. A Doksum. 10. M. Categorical Data Analysis (2nd ed. J. New York: John Wiley. N. 138. and R. Discrete Multivariate Analysis: Theory . MA:The MIT Press. Fung. and Practice (Paperback). I (2nd ed..l4 (1958). Li "Multivariate Statistical Process Control Schemes for Controlling a Mean. 174194. R. R. M. "Maximum Likelihood Estimation from Incomplete Data. P. H. 783808. 9 (1982).. Springer. H. 2002. 9851001. Rubin. and P.
In later sections. we discuss several comparisons among mean vectors arranged according to treatment levels. We begin by considering pairs of mean vectors.Chapter COMPARISONS OF SEVERAL MULTIVARIATE MEANS 6. Because comparisons of means frequently (and should) emanate from designed experiments. along with modifications required to analyze growth curves.2 Paired Comparisons and a Repeated Measures Design Paired Comparisons Measurements are often recorded under different sets of experimental conditions to see whether the responses differ significantly over these sets. 6. This partitioning is known as the multivariate analysis of variance (MANOVA). the efficacy of a new drug or of a saturation advertising campaign may be determined by comparing measurements before the "treatment" (drug or advertising) with those 273 . The theory is a little more complicated and rests on an assumption of multivariate normal distributions or large sample sizes. we shall often review univariate procedures for comparing several means and then generalize to the corresponding multivariate cases by analogy. For example. To circumvent these problems. The numerical examples we present will help cement the concepts. The corresponding test statistics depend upon a partitioning of the total variation into pieces of variation attributable to the treatment sources and error. is explicitly considered.1 Introduction The ideas developed in Chapter 5 can be extended to handle problems involving the comparison of several mean vectors. we take the opportunity to discuss some of the tenets of good experimental practice. useful in behavioral studies. Similarly. A repeated measures design. the notation becomes a bit cumbersome.
In other situations. In the single response (univariate) case. and so forth). the variable " t= l58 Sc~/'Vn (62) where l5 =  2: Di ni=i 1 n 1 II 2 and s~ = . stores. u~) distribution. Given that the differences Di in (61) represent independent observations fromJ an N(o.X 12 ) is provided the statement d.Di ."!1 Di=Xi 1 Xj 2 ._ 1(aj2)the upper 100(a/2)th percentile of atdistribution with n .f. j=1. and responses can be compared to asse··· the effects of the treatments..2.274 Chapter 6 Comparisons of Several Multivariate Means ·~ after the treatment. Consequently. two treatments. an alevel test of H 0: o = 0 versus (zero mean difference for treatments) H 1:o *o may be conducted by comparing It I with t. ..l5) n1 1=1 2: ( (63) has a !distribution with n .: One rational approach to comparing two treatments.t 11 J(aj2)  Vn sd :5o :5 d  + 1 J(a/2) Vn 11 sd (64) (For example. . see [11]. is to assign both treatments to the same or identical uni · (individuals. and n experimental units.JI are measurements recorded on the jth umt or jth pa1r of like umts.1 d.'J?at is.) Additional notation is required for the multivariate extension of the pairedcomparison procedure.1 d. plots of land. We label the p responses within the jth unit as Xlj 1 xlj2 = variable 1 under treatment 1 = variable 2 under treatment 1 xljp x2ji = variablepundertreatmentl =variable 1 undertreatment2 = variable2undertreatrnent2 x2j2 X 2i P = variable p under treatment 2 . let Xi! denote the response ~~ treatment 1 (or the response before treatment). two or more treatments can be administerJ to the same or similar experimental units. or the presence and a''·: sence of a single treatment. The paired responses may then · analyzed by computing their differences..n (61)~ should reflect only the differential effects of the treatments. and let Xi2 denote the response i treatment 2 (or the response after t_reatm_ent) ~or th~ jth ~rial. A 100(1 a)% confidence interval for the mean difference o = E( Xil .j n differences · . It is necessary to distinguish between p responses. thereby eliminating much of the influen of extraneous unittounit variation. (Xi 1. xi. By design. the.f. .
P random variable. Dn are independent Np( o. d1p].. Specifically.p d. . follows from (428). D 1 . Given the observed differences dj = [dil. corresponding to the random variables in (65).j'(D. T 2 is approximately distributed as a random variable.o) is distributed as an [ (n . oi > 0 implies that treatment 1 is larger. (66) If. . . D 2 .(n . l:d) random vectors. that lJ:J •nd Cov(D) "~. and assume." For the ith variable.whatever the true and l:d· o If n and n . . D1p]. .1 i=l (68) Result 6. . ..o)'S. j = 1. . with vectors of differences for the observation vectors.D) n . Dn be a random sample from an Np( o.Paired Comparisons and a Repeated Measures Design 275 and the p paireddifference random variables become D1.np(a) where Fp. Let Dj = tDil. D 2 .1.. 1p .J 1(D.f. Here d and Sd are given by (68).1..x2 1p p (65) 1.. ... In general. The condition o = 0 is equivalent to "no average difference between the two treatments. d12 .p are both large. . regardless of the form of the underlying population of differences. .o)'s.p) ]Fp.D)(D1 . .np(a) is tpe upper (100a)th percentile of an £distribution with p and n . l:d) population. 2. . . Let the differences D 1 . _.. on average. inferences about o can be made using Result 6. . Then T 2 = n(D. = X.X211 D12 = X1 12 . (D 1 . an alevel test of H 0 : o = 0 versus H 1 : o 0 for an Np( o.X2 12 . than treatment 2. for j = E(DJ) " B " .p large.. The exact distribution of T 2 is a restatement of the summary in {56). n.. The approximate distribution of • T 2 . D1 = x. 2. 1. n. inferences about the vector of mean differences o can be based upon a T 2statistic. x.. T 2 = n(D.1) pj (n . in addition.nd Sd d > (n _ p) Fp. for n andn . 1~ n i=l D1 and 1 ~ .k. .l)p T . l:d) population rejects H 0 if the observed * 2 .n.. D12.k.. . ..o) where (67) D = . Proof. Sd = .
is the ith element of ii.np(ff) x~(t:f) and normality need not be assumed.a)% simultaneous confidence intervals for the individual mean differences are = S.1 Effluent Data Sample j XJj! Commercial lab XJj2 (SS) (BOD) State lab of hygiene x 2i 1 (BOD) xziz (SS) 1 2 3 4 5 6 6 6 27 23 64 44 lR 8 11 34 7 8 9 10 11 Source: 28 71 43 33 20 Data courtesy of S. Weber. Example 6. for n = 11 sample splits. The data are displayed in Table 6.1 d.1 )p o such that (69) (do) Sd 1(do)$ ( n n.1.f. _ (n .276 Chapter 6 Comparisons of Several Multivariate Means A 100( 1 .1)p/(n .p)]Fp. 30 75 26 124 54 30 14 25 28 36 35 15 44 42 54 34 29 39 15 13 22 29 31 64 30 64 56 20 21 . Table 6. The Bonferroni 100(1 .p)% simultaneous confidence intervals for the individual mean differences o. 100(1 . Measurements of biochemical oxygen demand (BOD) and suspended solids (SS) were obtained.a)% confidence region forB consists of all _ . For n . [(n .1 (Checking for a mean difference with paired observations) Municipal wastewater treatment plants are required by Jaw to monitor their discharges into rivers and streams on a regular basis. are given by o1: d1 ±  ~(n (n _ p) Fp.p large. ± where tnl ( a/2p) t"~( 2:) J!! (610a) is the upper 100( aj2p )th percentile of a tdistribution with n. and onehalf was sent to a private commercial laboratory routinely used in the monitoring program. Concern about the reliability of data from one of these selfmonitoring programs led to a study in which samples of effluent were divided and sent to two laboratories for testing.and s~i is the ith diagonal element of sd.p ) Fpnp(a) · · Also. Onehitlf of each sample was sent to the Wisconsin State Laboratory of Hygiene. from the two laboratories.np(ff) 1)p [s{ v: (610) where d.: d.
It appears. Other choices of a1 and a2 will produce simultaneous intervals that do not contain zero. 3..2.27] [ . what is their nature? The T2 statistic for testing H 0 : o' = [81. (If the hypothesis H 0 : o = 0 were not rejected.ce points toward real differences.27 = 13.38] sd.6 Taking a= . we find that [p(n..1).IJ(n .36] .) .36 ± \19. we reject Ho and conclude that there is a nonzero mean difference between the measurements of the two laboratories.p)]Fp.:.0012] [9. that the commercial lab tends to produce lower BOD measurements and higher SS measurements than the State Lab of Hygiene.[dl] d2 = [9.) The Bonferroni simultaneous intervals also cover zero.74) 81 : 13.71. The 95% simultaneous confidence coefficient applies to the entire set of intervals that could be constructed for all possible linear combinations of the form a 1 81 + a 2 82 • The particular intervals corresponding to the choices (a 1 = 1.25) The 95% simultaneous confidence intervals include zero.36] 13.47. then all simultaneous intervals would include zero.np(05) = [2(10)/9]F2_9(. and this result is consistent with the T 2 test.46. yet the hypothesis Ho: o = 0 was rejected at the 5% level. from inspection of the data.0012 .27 ± \19. rrn _ )p [s~.38 y2 = 11[ 9. Since T 2 81 :d1 ± .. {4f8.61 88.26 ( 22.6 > 9.Paired Comparisons and a Repeated Measures Design ~77 Do the two laboratories' chemical analyses agree? If differences exist. 32.82 ] = [0. These intervals are = 9.05.np(a) \j.1p) Fp. The point o = 0 falls outside the 95% confidence region for 6 (see Exercise 6.6f = 9.1)/(n.47 . a2 = 0) and (a 1 = 0. a 2 = 1) contain zero.27 ' 88.36.05) = 13. OJ is constructed from the differences of paired observations: djl = Xljl. The 95% simultaneous confidence intervals for the mean differences 81 and o can be 2 computed using (610).[199.47 'J11 or ( 5. What are we to conclude? The evideQ.11 or (I99.47. (See Exercise 6. 13.26 418.0055 ..X2jl 19 22 18 27 12 10 42 15 4 10 1 11 14 4 17 9 4 19 7 60 2 10 Here d = and .0026 13.
A separate independent randomization is conducted for each pair. or even like.1 actually divided a sample by first shaking it and~ then pouring it rapidly back and forth into two bottles for chemical analysis. an appropriate pairing of units and a randomized assignment of treak ments can· enhance the statistical analysis. and the conclusions would. a random assignment of treatment 1 to one unit and treatment 2 to the other unit will help eliminate the systematic effects of uncontrolled sources of variation. The two laboratories would then not~ be working with the same. between supposedly identical units must be identified and mostalike units paired. This waS'? prudent because a simple division of the sample into two pieces obtained by pourini the top half into one bottle and the remainder into another bottle might result in more. measuring techniques. further complicated by the presence of one or. In fact.. The remaining treatment is then assigned to the other unit. if any.278 Chapter 6 Comparisons of Several Multivariate Means Our analysis assumed a normal distribution for the Di. experimental units. D 3 n t Treatmenls I and 2 assigned at random D ••• ••• CJ t Tre<~tments I and 2 assigned at random We conclude our discussion of paired comparisons by noting that d and Sd.3. ~ The experimenter in Example 6. (See Exercise 6. may be calculated from the fullsample quantities i and S. One can conceive of the process as follows: Experimental Design for Paired Comparisons Like pairs of experimental units {6 Treatments I and 2 assigned at random 2 t Treatments I and 2 assigned at random D D t D .~ The numerical results of this example illustrate an unusual circumstance thaf.) These data can be transformed to data more nearly normal. · Whenever an investigator can control the assignment of treatments to experimental units. Differences. suspended solids in the lower half due to setting. and so forth. possibly. the situation is.4. Randomization can be implemented by flipping a coin to determine whether the first unit in a pair receives treatment 1 (heads) or treatment 2 (tails). Here i is the 2p X 1 vector of sample averages for the p variables on the two treatments given by (611) and Sis the 2p x 2p matrix of sample variances and covariances arranged as S = (pXp) s11 (pXp) s22 (pxp) s12 [ J (612) (pxp) S21 .. (See Exercis2 6. can occur whenmaking inferences. and hence T 2 . Further. two outliers. but with such ai small sample. it is difficult to remove the effects of the outlier(s). not pertain to laboratory competence.
d2 .n where Xj.Paired Comparisons and a Repeated Measures Design 279 The matrix St 1 contains the sample variances and covariances for the p variables on treatment 1. . Each subject or experimental unit receives each treatment once over successive periods of time.. The component l'xj. 1) since c. d. Finally. . . representing the overall treatment sum. of the matrix C in (613) is a contrast vector. Defining the matrix c (pX2p) = l' 0 1 0 ~ 0 0 1 0 1 0 0 (613) 0 (p 0 j + 1 )st column we can verify (see Exercise 6.2. S2 1 are the matrices of sample covariances computed from observations on pairs of treatment 1 and treatment 2 variables. Each row c. .. The name repealed measures stems from the fact that all treatments are administered to each unit.. The jth observation is j = 1.. because its elements sum to zero. and Sd""' CSC' (614) (615) and it is not necessary first to calculate the differences d 1 . S 12 :. . is the response to the ith treatment on the jth unit. 2.. Each contrast is perpendicular to the vector 1' = (1. n d = Cx Thus. . is ignored by the test statistic T 2 presented in this section.. On the other hand. ... Attention is usually centered on contrasts when comparing treatments. S22 contains the sample variances and covariances computed for the p variables on treatment 2. Similarly.l "" 0.9) that j = 1. . 1. A Repeated Measures Design for Comparing Treatments Another generalization of the univariate paired !statistic arises in situations where q treatments are compared with respect to a single response variable. . it is wise to calculate these differences in order to check normality and the assumption of a random sample. .
the·~ experimenter should randomize the order in which the treatments are presented to each subject. all perpendicular to th. 0 0 ILI . ~ 1 0 0 1 .. 1 1 Any pair of contrast matrices C 1 and C2 must be related by C 1 = BC 2. 1 1 ILq ~ 0 0 0 n(ex)'(ese'f1ei Test for Equality of Treatments in a Repeated Measures Design Consider an Nq( IJ. e 111. .1) (616) where Fql. respectively. and we test elL = 0 using the T 2statistic T2 = l :=::ll~. of the influence of unittounit variation on treatment comparisons.e vector 1. = eziL = 0.: ILqILqI 10 ::: ~:. we have means e i and covariance matrix ese'.280 Chapter 6 Comparisons of Several Multivariate Means For comparative purposes. = l1. by X ~ = 1 4J n j=I Xj and s 1 ~ = ~. In general. . Then (BC 2)'(BC 2SC2B'f1(BC2) = C2B'(BT1(C2SC2t'B'BC2 = 1 C2(C 2SC2f C 2. The nature of the design eliminates much. Here i and S are the sample mean vector and covariance matrix defined.I.1 and n .ILq 1 Both e 1 and e 2 are called contrast matrices. 0 This follows because each. so T 2 computed with C 2 orC 1 = BC 2 gives the same result. Consequently. . with B nonsingular.1 rows are linearly independent and each is a contrast vector. An alevel test of H 0 : CJL = 0 (equal treatment means) versus H 1: elL 0 is as follows: Reject H 0 if * T  2 _ n(ex) (cse) ex> (n=q+l)FqJ.1)(q .nq+!(a) is the upper (lOOa)th percentile of an £distribution with q . of linearly independent rows. q. When the treatment means are equal. I) population. C has the largest possible number. . 1 _ (n . ~ll::l= e21L : : .~ ( Xj  n 1 j=l ) X ( Xj  )' X It can be shown that T 2 does not depend on the particular choice of C. based on the contrasts exj in the observations. because their q .IL3 1L2J . the hypothesis that there are no differences in treatments (equal treatment means) becomes elL = 0 for any choice of the contrast matrix C.f.nq+l(a) _ . Of course.and let e be a contrast matrix. we consider contrasts of the components of: 11. = E(Xj)· These could be or l ~ iLt ILl :.q + 1 d.
L2. . . and J. respectively.J(nex± l)(q..J) = .1) ( nq+l) Fq1 nq+l(a) ' (617) where i and S are as defined in (616). 2..L4 )  (J. Next. 19 dogs were initially given the drug pentobarbital... In one study.: .1 .. ! _ (n. J.(J. 3.CIL).Paired Comparisons and a Repeated Measures Design 281 A confidence region for contrasts CIJ.1) CIJ. Consequently.L 4 correspond to the mean responses for treatments 1.L4) .3 IL 2 + J.CIL) (CSC) (Cx.1) ( nq+1 ) Fqlnq+J(a) · ~'Sc n  (618) Example 6. halothane (H) was added.2 contains the four measurements for each of the 19 dogs.2 (Testing for equal treatments in a repeated measures design) Improved anesthetics are often developed by first studying their effects on animals. There are three treatment contrasts that might be of interest in the experiment. was measured for the four treatment combinations: Present Halothane Absent Low High COz pressure Table 6. and 4. with IL the mean of a normal population.L2) = +.L 1 + J. simultaneous 100(1 a)% confidence intervals for single contrasts c' JL for any contrast vectors of interest are given by (see Result 5A. J.L 4 ) = (C02 contrast representing the difference) (J. .:.LI. and the administration of C02 was repeated. _ . ) _ ( Halothane contrast representing the) difference between the presence and ( absence of halothane between high and low C02 pressure Contrast representing the influence ) of halothane on C02 pressure differences ( (HC02 pressure "interaction") (. milliseconds between heartbeats. Let J. Then (J. The response.L2 + IJ. is determined by the set of all CJL such that n(Cx.1)(q.L3.L3 + J.LJ + J.. where Treatment 1 = high C02 pressure without H Treatment 2 = low C0 2 pressure without H Treatment 3 = high C0 2 pressure with H Treatment 4 = low C02 pressure with H We shall analyze the anesthetizing effects of C0 2 pressure and halothane from this repeatedmeasures design. Each dog was then administered carbon dioxide C0 2 at each of two pressure levels.
98 6851.54 7557.2 SleepingDog Data Treatment Dog 1 1 2 609 236 433 431 426 438 312 326 447 286 382 410 377 473 326 458 367 395 556 3 556 392 349 522 513 507 410 350 547 403 473 488 447 472 455 637 432 508 645 4 600 2 3 4 5 6 7 8 9 10 11  12 13 14 15 16 17 18 19 Source: 426 253 359 432 405 324 310 326 375 286 349 429 348 412 347 434 364 420 397 395 357 600 513 539 456 504 548 422 497 547 514 446 468 524 469 531 625 Dala courtesy of Dr.L 4 ].92 5195.o5 . I..54 [ 927.79 CSC' = 9432. Allee..62 914..32 2295._.63 It can be verified that ex= and 209. the contrast matrix Cis c= The data (see Table 6.62] 1098.29 3568.32 1098.14 [ 2943.2) give [~1 =~ ~ ~] 1 1 1  368.31] 6o.35 4065.44 4499.282 Chapter 6 Comparisons of Several Multivariate Means Table 6.92 927. 1'z. With._.42 7963. J. = [1'1 .49 5303..11) = 116 . .89 and S= 2819.21J 404.84 914.44 T2 = n(Cx)'(CSC'f (Ci) 1 = 19(6.26 502.63 x= [ 479. [ 12..
19 = 209. see a discussion of the "randomized block" design in (17J or (22). the time order of all treatments should be determined at ~~) .2 .05) ' 16 ' ( nq+ 1) =  18{3) .) The second confidence interval indicates that there is an effect due to C02 pressure: The lower C02 pressure produces longer times between heartbeats.L3) .31 ± 18{3) ~ vT0:94 J9432.32 .94. (See the third confidence interval.94 ) 75 ~~· 44 = 12. we construct 95% simultaneous confidence intervals for these contrasts. (n .Ll + J. and we reject H 0 : CIL = 0 (no treatment effects). Similarly.L4 ) . the remaining contrasts are estimated by C02 pressure influence = (J.31 ± 73.(x 1 + x2 ) ± ~F3 .L 4 ) .79 ± 65.ny special structure. the contrast c]IL = {J.L 3). (J. Some caution must be exercised in our interpretation of the results because the trials with halothane must follow those without.L2 + J.97 The first confidence interval implies that there is a halothane effect. since the HC0 2 pressure interaction contrast.Paired Comparisons and a Repeated Measures Design 283 With a == .(J. To see which of the contrasts are responsible for the rejection of H 0 . T 2 = 116 > 10.L 4 ) . This occurs at both levels of C0 2 pressure.L 1 + J.J.FJ !6(.L4): HC0 2 pressure "interaction" = (J.(J.) . tests designed with this structure in mind have higher power than the one in (616). The test in (616) is appropriate when the covariance matrix.05. is not significantly different from zero.1){q .LJ): .1) 18{3) Fq1 11 q+l(a) = .79 ± v'ID.70 where c! is the first row of C.L2) = halothane influence is estimated by the interval (x3 + x4 ) . cannot be assumed to have a. If it is reasonable to assume that l: has a particular structure. 16 (.L 1 + J.{3.05) \j~ = 209. The presence of halothane produces longer times between heartbeats. The apparent Heffect may be due to a time trend.94 From (616).24) 16 = 10.LJ + J.12. Cov(X) = l:.L2 + J.(P.(J.L 1 + J. (Ideally. From (618). (For l: with the equal correlation structure (814).
X2n2 In this notation.. . .. If possible..\ sponses from another set of experimental settings (population 2)... the first subscript! or 2denotes the population. as in th~'i pairedcomparison case.. the experimental units should be randomly assigned to the sets 0 (] experimental conditions. more assumptions are needed. .. mitigate the effect~ of unittounit variability in a subsequent comparison of treatments.. We want to make inferences about (mean vector of population 1). The sample X2 b X2 2 .. is a random sample of size n 1 from a pvariate population with mean vector I' I and covariance matrix I 1 • 2... for large samples. . Although some·. is ... applicable to a more general collection of experimental units... to some extent... we shall want to answer the question.. equivalently. precision is lost relative to paired comparisons.. I' I .. However.3 Cornparing Mean Vectors from Two Populations A T 2 statistic for testing the equality of vector means from two multivariate populao1 tions can be developed by analogy with the univariate procedure. ..: simply because unit homogeneity is not required. X 12 . .. the inferences in the twopopulation·~ case are. The comparison~i can be made without explicitly controlling for unittounit variability.284 Chapter 6 Comparisons of Several Multivariate Means 6.. The sample X We shall see later that.. 2 and covariance matrix I 2 • (619) 1..~ cussion of the univariate case. 2 . ·:... Assumptions Concerning the Structure ofthe Data 11 . we are able to provide answers to these questions. Randomization will. 2 .. which component means are different? With a few tentative assumptions. · Consider a random sample of size n 1 from population I and a sample of size n 2 from population 2._ 2 (or. X2n 2 ..(mean vectorofpopulation2) = l'r.Xtz. ..) This T 2 statistic is appropriate for comparin~ responses from oneset of experimental settings (population I) with independent re. 2 7'= 0. (See [II] for a diS:. this structure is sufficient for making inferences about the p X 1 vector I' I .._ 2 = 0)? Also. The observations on p variables can be arranged as:': follows: Sample (Population I) XJJ. Is I' I = . is a random sample of size n 2 from a pvariate population with mean vector . ordinarily.. . if I' I  For instance. X 1 n 1 . when the sample sizes n 1 and n 2 are small.X]nl Summary statistics (Population 2) x21' x22> . .
1 d.f. that I 1 = I 2 .iJ)' is an estimate of (n 1 .l'2 = is based on the square of the statistical distance.i 2 )' is an estimate of (n 2 1 )I.f. (620) The second assumption. When I 1 = I 2 = I.. Also. and is given by (see [1]). a specified vector.I + .i 2 to o0 • Now.i2)' has 1 d. by (39).iJ)' has n 1 .1) + (n 2 ..E(Xz) = 1J. We set (621) Since n2  L j=i ~ (x 11  iJ)(x 1i .1 . [See (424).] Additional support for the pooling procedure comes from consideration of the multivariate normal likelihood. T 2 . X 2 ) = 0 (see Result 4. it follows that Cov(X 1  . 2.i 2 ) ( x2 i .11. Both populations are multivariate normal.1) in (621) is obtained by combining the two component degrees of freedom. The likelihood ratio test of ~J Spooled Bo Ho: 1'1 .I = (1 + 1) I n1 n2 n1 n2 (622) Because Spooled estimates I. L (xij j=l nl iJ) (x 1i . we consider the squared statistical distance from i 1 .i 2)(x 2i . and L j=l ~ (x 2i .2 Since the independence assumption in (619) implies that X 1 and X 2 are independent and thus Cov(X1 . is much stronger than its univariate counterpart.Comparing Mean Vectors from TWo Populations 285 Further Assumptions When n 1 and n2 CAre Small 1.1)I and  2: (x2 i i=l nz .1J. Consequently.) X2 = Cov(X 1 ) 1 1 + Cov(X 2 ) = . (See Exercise 6. E(X 1  X2) = E(Xt) .. Here we are assuming that several pairs of variances and covariances are nearly equal. I 1 = I 2 (same covariance matrix). 2 = o0 .. we see that (:I + is an estimator of Cov (X 1 .) To test the hypothesis that 1'I .5). the divisor (n 1 . we can pool the information in both samples in order to estimate the common covariance I.X 2 ). Reject H 0 if T = (i 1 2  i2  o0 )' [(~ 1 + ~Jspooled J 1 (i 1  i2  o0 ) > c2 (623) .
(n 1 .l'z)) = (multivariate normal)' (Wishart random matrix)! (multivariate normal) random vector = Np(O. .n 1+n2 pl Consequently. 1 1 T2 = ( ~ + n1 nz )1/2 (Xl . with c1 = c2 1jn 2 .286 Chapter!) Comparisons of Several Multivariate Means where the critical distance c 2 is determined from the distribution of the twosamplec. According to (423).cr _ (624) where Proof. T 2 statistic.1..z(I).. P [ (XI . 1 _ 1(I) and ( n 2 1 )S2 as W.X2 (1'1 . the X 1 /s and the X 2 /s are independent.1)S 1 and (n 2 1)S 2 arealsoindependent.X2.. X 12 .+"3z(I nl + nz.From(424).(I'! . I)  which is the T 2distribution specified in (58). with n replaced by n 1 + n 2 (55)_ for the relation to F.1) Fp. We first note that is distributed as by Result 4.] 1. 2 = 1 )S1 is distributed as W..#tz)) I [ ( ~ 1 + n ) Spooled ]( (XI.1)Szisthendistributed as Wn.2 )]t Np(O.f. I). 1 is a random sample of size n 1 from I) X 21 .1 Xz..r 1(I) By assumption... then. . [See • . . X 2..(1'1.• . so (n 1 . X 22 . 2 is an independent random sample of size n 2 from Np(~t 2 . Therefore..(Itt. If X 11 . random vector I)' [W.1)S 1 + (n 2 . 1+. .p.+ 1  = c" 1+2 = · · · = c.l'z)) Spooled n1 ' 1 ( 1 + 1 )1/2 (XJ n2 _ X2 .2)p ( ni + nz. Np(~t 1 ..+n.. X1. d. ( n1  = · · · = c.8. :i Result 6. = 1fn 1 and c... i'J: and1 "' is distributed as (n1 + n 2 .#tz)) :s c 2 2] = 1 .
..i 2 and whose axes are determined by the eigenvalues and eigenvectors of Spooled (or s._ 2 within squared statistical distance c 2 of x .[10.9] Xz . are measured. 97 (.05) = .05) = 3..1.. e 1 and e 2 .3] [4.9. XJ   ..9 .. we conclude that all 1'! .957] ...[.... Consequently.3 (Constructing a confidence region for the difference of two mean vectors) Fifty bars of soap are manufactured in each of two ways.[ . Two characteristics.2] = 49 !] !] + 49 98 S2 Obtain a 95% confidence region for 1'I . The confidence ellipse extends vA. determined from i = 1. A = 5. V25 .i 2 constitute 1 the confidence region. Spooled 98 St = [2 1] 1 5 Also.1 ' 3. The eigenvalues and eigenvectors of Spooled are obtained from the equation 0 = JSpooled AIJ = 2 1 ~A 5 ~ AI 2 =A 7A +9 so A = (7 ± v'49 ..957 an d e 2 .36 )/2. and the 1 corresponding eigenvectors.697. The summary statistics for bars produced by methods 1 and 2 are it = 8. 2 • We first note that S1 and S2 are approximately equal.2. from (621).290 By Result 6. 2 • From (624).\ 2 = 1.290] .2 are e 1 .oled ). Hence. Example 6. 1 ( n! + 1) n2 lc = 50 + 50 (97) F2... \j I(_!_ + _!__) 2 = nl nz vA.2]'. so that it is reasonable to pool them.= [1.2 so the confidence ellipse is centered at [ 1.Comparing Mean Vectors from Two Populations 28! We are primarily interested in confidence regions for 1'! .. .97(. X 1 = lather and X 2 = mildness.303 and . This region is an ellipsoid centered at the observed difference x1 .25 2 ( 1 1 ) (98)(2) since F2. xz .
5. The 95% confidence ellipse is shown in Figure 6.2 .l)]Fp.p Proof._.1 95% confidence ellipse for IJ. • Simultaneous Confidence Intervals It is possible to derive simultaneous confidence intervals for the components of the vector IJ._: · + apXZjp· Th~se linear combinations have_:..288 Chapter 6 Comparisons of Several Multivariate Means 2.) ± c \1f(_!_1 + _!_)sii. Consider univariate linear combinations of the observations X11. .. (See Result 3. or 1. X1n.I  l'2· units along the eigenvector e.. where XI> S 1 .1(a). respectively. . Xzn 2 given by a'X 1 j = a 1X 1j 1 + a2X 1 j2 + · · · + apXljp and a'X 2 j = a 1X 2jl·+ a2Xzj2 + . Clearly.:. . Result 6.a = a'S2a .15 units in the e 1 direction and ..2)p/(ni + nz. .65 units in the e2 direction. Let c2 = probability 1 .. .. but Jhose from the second process have more lather (XI). S2 are the mean and covariance statistics for the two original samples.. and X 2 ._. X12.1. a'S 2 a..ample me~s and covariances a'X 1 ..2. Xzz...1. 1  1'z) for all a. and we conclude that the two methods of manufacturing soap produce different results. It appears as if the two processes produce bars of soap with about the same mildness ( X 2 ).+n2p.0 Figure 6. = a'S 1a and si.3. 2 == 0 is not in the ellipse.pooled n n2 and fori= 1. a'S 1a and a'X 2 . 2 .n. .J . sf.) When both parent populations have the same covariance matrix. In particular p. [(111 + nz. 1'I .a. With will cover a'(1J.  p.. will be covered by (XJ. It is assumed that the parent multivariate populations are normal with a common covariance I.Xz. X11.0 1.p. These confidence intervals are developed from a consideration of all possible linear combinations of the differences in the mean vectors.
5 ' n.. c 2] = P[t~ :s..4 .(I'J .. t. ' [( n.7] 55964...4] 556.0 ' S _ [13825. = ~ : ~ 2 s. That is.. with coefficient vector a ex S~otcd{i 1 . on the basis of the a'X 1 j and a'X 2 j. $(X.= T2 :s. For testing H0 : I'! .~tz) = a' o0 . we obtain s.i 2). .a _:___::.l)s!. = a'[ n. Pooling these estimators. Thus..3 I ...2) 2 1 '.pooled = . the common population variance of the linear combinations a'X 1 and a'X 2.~t 2 = 0.1'2)) 2 .. then a'(i 1 . = P[l a'{X... Example 6. 19616.0 2 19616. respectively. quantifies the largest population difference.23823._ _:. +nz ) Spooled ]I (X. + n ~ ~ 2 S J a 2 (625) a'Spooleda To test H 0 : a'(~t 1 .1'2 )) 1 1 . + ~Jspooleda for alia J where c 2 is select~<d according to Result 6.a)= P[T 2 c2.Comparing Mean Vectors from TWo Populations 289 are both estimators of a'Ia. The first is a measure of total onpeak consumption (XJ) during July.4] 73107. (1. we can form the square of the univariate twosample /statistic [a'(XI X2  (~t 1  ~t 2 ))] 2 (626) According to the maximization lemma B = (l/n 1 + 1/n2)Spooled in (250).6 ' 355.4 (Calculating simultaneous confidence intervals for the differences in mean components) Samples of sizes n 1 = 45 and n 2 = 55 were taken of Wisconsin homeowners with and without air conditioning. Frequently. and the second is a measure of total offpeak consumption (X 2) during July.Xzforalla] for all a * 0.'':_:_::__ {n 1  1)sf. (Data courtesy of Statistical Laboratory.Xz).) 1\vo measurements of electrical usage (in kilowatt hours) were considered.0] n2 = .c_ (n 1 + n 2 . • Remark. the linear combination a'{i 1 .(1'1 .4 s = [ 8632. University of Wisconsin.i 2) will have a nonzero mean.a'(I'I ~t 2 )1 $ c )a·U. if T 2 rejects H 0 . = 45 55 2 = [130. we try to interpret the components of this linear combination for both subject matter and statistical importance. .2.x2).7 23823. + (n 2 .:_c__.Xz . The resulting summary statistics are i i I  _ [204.
1 97 = (2.+rr. Because the confidence ellipse for the difference in means does not cover 0' = [0.J.5] 63661.p.LI .942 J and A = 3301.3 and c2 _ n1 (n 1 + nz.J.02)(3.n. Here n1 . Although there appears to be somewhat of a discrepancy in the sample variances. . 97 (. OJ. This difference is evident in both onpeak and offpeak consumption. 2 Since and we obtain the 95% confidence ellipse for J. the 95% simultaneous confidence intervals for the population differences are J.336.7 2 ~ 21505.JLz = 0 at the 5% level.Lll .336]. the T 2statistic will reject H 0 : JLJ .L22 :S We conclude that there is a difference in electrical consumption between those with airconditioning and those without. .F2.Lz .7 :S J.J.pJ(a) .0) #tz ± v'6. .L 11 . for illustrative purposes we proceed to a calculation of the pooled sample covariance matrix. e! = [.2 on page 291. J.1 (556.#tz is determined from the eigenvalueeigenvector pairs A1 = 71323.942.J.Lzz .L 22 : $ J.5.6.130..J.0) ± v'6.3 (offpeak) or 74.7 J.L22 ].L21 .LZJ :s: 127.26 With #ti = [J.Ll2.Lz sketched in Figure 6.7 1 5 (onpeak) or 21.L 21 : (204.5.J.290 Chapter 6 Comparisons of Several Multivariate Means (The offpeak consumption is higher than the onpeak consumption because there are more offpeak hours in a month.1) = 6.5 5~)63661.) Let us find 95% simultaneous confidence intervals for the differences in the mean components.05) + n2 .L12  J.26 )C~ + 328.2)p _ 98(2) Fp. The 95% confidence ellipse for J.355.1 Spooled = n1 + n2  2 51 + n2 n1  1  + n2 2 S _ [10963.4.5 21505. e2 = [.26 )C + 5~)10963.LIJ.
that any discrepancy of the order UJ.6.a)% simultaneous confidence intervals for the p population mean differences are where tn 1+n 2z(aj2p) is the upper 100(aj2p)th percentile of atdistribution with n 1 + n 2 .2 d. the conclusions can be seriously misleading when the populations are nonnormal.7.) • The Bonferroni 100(1 . (See also Section 6. on the number of variables p.1'2 = (JLJI . more practical experience is needed with this test before we can recommend it unconditionally. for n 1 and n 2 large.I 1 =I= .JL21. A transformation may improve things when the marginal variances are quite different..I 2 When I 1 7'= I 2 .2 95% confidence ellipse for 0 100 200 ~tl . However. Or ViCe Versa. The size of the discrepancies that are critical in the multivariate situation probably depends.iz). However. Unfortunately.) A method of testing the equality of two covariance matrices that is less sensitive to the assumption of multivariate normality has been proposed by Tiku and Balakrishnan [23].JL22>· The coefficient vector for the linear combination most responsible for rejection is proportional toS~~oled(i 1 .f. we are unable to find a "distance" measure like T 2 .. whose distribu· tion does not depend on the unknowns I 1 and I 2 • Bartlett's test [3] is used to test the equality of I 1 and I 2 in terms of generalized variances. without much factual support. is probably SeriOUS.Comparing Mean Vectors from TWo Populations 291 300 200 100 Figure 6. We suggest. This is true in the Univariate CaSe. . we can avoid the complexities due to unequal covariance matrices. (See Exercise 6.. to a large extent. JL12 . Nonnormality and unequal covariances cannot be separated with Bartlett's test. The TwoSample Situation When .ii = 4uz.
XlX2 isnearlyNp[IL 1 .f.p are large. Let the sample sizes be such that n 1 .: Also.I: 1 ~)a' (nt S 1 + . so 1 S n1 1 1 _ + Sz nz n 1 2 1) + S2 ) _ ( n .2.7. ~ where x~(a) is the upper (100a)th percentile of a chisquare distribution with p d. by Result 4. • Remark.) In one dimension. If n 1 = n2  = n.Ifl:1 and l:z were known. sl will be close to l:.n] 1l: 1 + n2 1l: 2 ]. .1L2 Bythecentrallimittheorem. ~~q . .#L .a)% confidence ellipsoid for ILl ..1) S. it is well known that the effect of unequal variances is least when n 1 = n 2 and greatest when n 1 is much less than n2 or vice versa. and s2 will be close to l:z.4.2) _!_ (S = 1/2.ILz) are provided by · a'(IL 1  iJ. 2. When n 1 and nz are large.1) S ( 1 +n +n .Xz to ILl . then (n .s2 nz )a Proof.1)j(n + n . with high probability. Then. the approximation holds with S 1 and Sz in place of 1: 1 and l:z.2 n n = Spooled G ~) + With equal sample sizes.z) belongs to a'(x 1  x 2 ) . respectively. approximate 100(1 . 100(1 a)% simultaneous confidence intervals for all linear combinations a' (ILl .Xz) and = 1L1 .. Consequently.292 Chapter 6 Comparisons of Several Multivariate Means Result 6.!. the large sample procedure is essentially the same as the procedure based on the pooled covariance matrix.. + (n .l. (See Result 6. From (622) and (39).1Lz. the square of the statistical distance from X 1 .ILz would be This squared distance has an approximate ~distribution.1L2 is given by all ILl .p and nz . The results concerning the simultaneous confidence intervals follow from Result 5 A. E(X 1 . an .
i2J' [ s~ + s2 nt n2 = [ 556.Ltt= J.L22 J.08 1 s 1 [13825.1L2 = 0.080 = 15 66 .Li2.4) Notice that these intervals differ negligibly from the intervals in Example 6.L21: 74.99.130.4] 1 [ 8632.355.J.7 55964. 127.L2t] J. 327.L22 : 201..Lti . The T 2statistic for testing H 0 : ILl .17 or (21.4 .4] 10.063 The difference in offpeak electrical consumption between those with air conditioning and those without contributes more than the corresponding difference in onpeak consumption to the rejection of H 0 : ILl . The most critical linear combination leading to the rejection of H 0 has coefficient vector + . • .L21 1L2 ) (0.17 886.15 The 95% simultaneous confidence intervals for the linear combinations a'(ILt. where the pooling procedure was employed.4 464.!.L21] J.66 > x1(.080] [ 74.6 = [7 4.08 201.130.0 20.L 12 .519 201.6 .J.99 V2642.J.08] 2642.355.4) J.5 (Large sample procedures for inferences about the difference in means) We shall analyze the electricalconsumption data discussed in Example 6.15 556.0] [J.IL2) and a '( ILt are (see Result 6.L22 = J.. we reject H 0 .874 20.L22 .080] [ 74.4 = 59. We first calculate + n2 2 = 45 23823.4 .7] + 55 19616.J.0 19616.080 20.08]l [204.4 ± J.Ltl= J._ 8 n 2 2 )I (i I __ ) = ( 10_ 2 X ) 59.0] 2642.0]' [464. 1 J [J.6] ( 104) [ 886.Li2 . the critical value is x~(. since T 2 = 15.7.1) v'5.05) = 5.4.8.6 .1L2 = 0 is 1 1 T2 = fit .15 or (75.6 ± v'5.4 using the large sample approach.05.Ltt  = [1.L12 J.J.519 201.3 23823.6 = [.0 J1 [it  i2J 204.041] .5 886.99 V464.17 = [ 886.99 and.874 4[ 20.05) 5. For a= .4 73107.4] 10.Comparing Mean Vectors from Two Populations 293 Example 6_.
the approximate level a test for equality of means rejects H 0 : iJ.IJ.z = 0 when the population covariance matrices are unequal even if the two sample sizes are not large. .4.vp+ 1(a).vp+ 1(a) ni nz v.z = 0 if 1 (x1.(xJ .J .IJ... However. The approach depends on an approximation to the distribution of the statistic ' (627) which is identical to the large sample statistic in Result 6.IJ.IJ. the approximate 100(1 . The result requires that both sample sizes ni and n2 are greater than p.l.i2. provided the two populations are: multivariate normal. (628) where the degrees of freedom v are estimated from the sample covariance matrices using the relation (629) where min(nb n 2 ) ::5 v ::5 n 1 + n2• This approximation reduces to the usual Welch solution to the BehrensFisher problem in the univariate (p = 1) case.(IJ.p. vp+ 1 Similarly. An Approximation to the Distribution of T2 for Normal Populations When Sample Sizes Are Not Large One can test H 0 : iJ. to large.2 such that (ii ..(IJ.(p.iz.sz].P vp+l(a) ni nz v. 1 ...p + 1 · (630) ..(..4 except that the critical value x~( a) is vp replaced by the larger constant Fp..!.a)% confidence region is given by all PI .294 Chapter 6 Comparisons of Several Multivariate Means A statistic si~ilar to T 2 that is less sensitive to outlying ob~ervations for sman~ and moderately sized samples has been developed byTiku and Smgh [24]. Howeveri if_the sample size is moderate.iz. With moderate sample sizes and two normal populations. instead of using the chisquare approximation to obtain the critical value for testing Ho the recommended approximation for smaller samples (see [15] and [19]) is given by 2vp T . This situation is often called the multivariate BehrensFisher problem.z))'[_!_sl + _!_sz]l (il.p + 1 where the degrees of freedom v are given by (629).z)) > vp Fp.!. the number of variables..1J. 2))'[_!_s 1 + .J.v. This procedure is consistent with the large samples procedure in Result 6.p + Fpvp+l 1 .. Hotelling's T 2 is remarka~ly unaffected~~ shght departures from normality and/or the presence of a few outhers.z)) ::5 vp F.iz.I .
6 (The approximate T 2 distribution when :£ 1 #: :£2 ) Although the sample sizes are rather large for the electrical consumption data in Example 6.080 ° 20.667 ](1 4)[ 59.0851 .Comparing Mean Vectors from1Wo Populations 295 For normal populations. _!_s + _!_Sz].4 73107.776 .354 . 1 2 ) = [ .224 10.060][ .060] .646 .609 20. Example 6.536 2o.5.o8o] 10.060] .080] .874 1624.646 2_ (_!_s + _!_s ] ( n1 81 n1 1 nz 2 Further. the approximation to the distribution of T 2 given by (628) and (629) usually gives reasonable results.5 356.423 156.874 [n 1 nz 20.945 n 2 Sz = 55 19616. we use these data and the calculations in Example 6.= (l04)[ 59.[.519 .080] 10..409] 1624.7] = [156.409 and 529.945 [ 356.4091 (104) [ 59.667 and 356.2 23823.776 .0 19616. We first calculate _!_s = _!_ [13825.092 .4] = [307.519 307.409 1 1 [ 8632.092 .7 55964.536 20.227 1 n1 45 23823.667] 1017. 1 529.092 .080 20.131 ..776 .609 356.060] .519 =[ .080 1 Consequently.874 1017.4 529.227 [ 529.608 .092 .646 = [ .667 and using a result from Example 6.4.5 to illustrate the computations leading to the approximate distribution of T 1 when the population covariance matrices are unequal.
6 1 X '.055 + . Population 1: Population 2: Population g: Xtt•X 1 2. • from each of g populations.m. = __!_{ (. f = 1.. 5% level.:omponents differ significantly.J '.6 and the a = .. are arranged as ·.5..2+ 1(.0678 + ... This is the same conclusimf~ reached with the large sample procedure described in Example 6. Random samples..0095 77. 2. ~ Using (629).:! v _ vp p + 155.296 Chapter 6 Comparisons of Several Multivariate Means Then 1 2 ) ··Ffi 1 2 ]) } ~4 . X 22.. . which mean ...2 _ f. ~~ wi~t 1 6.423) + (.0671 ~· A ·fM.~ .'~ ~. the estimated degrees of freedom vis · v = 2 . . . .. ..66 so the:. .. collected·.J ~ ··.05) = F .:!: ·" ·c•. the Fp..131) + (.X8 n8 (631) MAN OVA is used first to investigate whether the population mean vectors are the same and.l! __!_ {tr[(_!_si(_!_si + _!_s2)n1 n1 n1 n2 ) + (tr[_!_s 1(_!_s 1 + _!_s 2)n1 n1 n2 = 45 {(. more than two populations need to be compared. A slightly more conservative approach is to use tli~ integer part of v. if not. the observed value of the test statistic is T2 = 15.32 77 .6.224 + .. The random samples from different populations are independe~' ·.4 Comparing Several Multivariate Population Means (OneWay MANOVA) Often... Assumptions about the Structure of the Data forOne~ay ~ANO~~. ··~ __!_ {tr((__!_s2(__!_sl + _!_s2)112 n2 n1 n2 ) 2 ) 1 + (tr[_!_s 2(_!_s 1 n2 n1 + _!_s 2)n2 ]) 2 } ·. v.~ '.·. .~ ·<i . Xcne• is a random sample of size ntfromapopulatton With meaD#£ .646) 2} = 1 1 ::= . •.6 2 + 1 2 76 ..····Xtn 1 X 21. .12 = 6. . X2.p+ 1 distribution can be defined noninteger degrees of freedom.354)2} =. g. From Example 6. Xc2._ p+ 1(. ~ As was the case in Example 6..6 2 +~ = 77.608 + .JL 2 = 0 is rejected at the. hypothesis H 0 : 1'! .05 critical value is ~! . Xgl> Xg 2 ..776 + .05) = 3.~ · L Xn.5.
. . + (observation) overall ) ( sample mean (xc. + Tc.) or J. and therefore. = J. All populations have a common covariance matrix :t. the assumptions are that Xn.Lc = J. .L + (J. 1 = p. it is convenient to investigate the deviations Te associated with the fth population (treatment). Each population is multivariate normal. f = 1. u 2) random variables. . it is customary to impose the constraint t=l ± nrrc = 0. such as J.• distributed as N(p.J.L.Lc .Lc. Although the null hypothesis of equality of means could be formulated as p. A review of the univariate analysis of variance (ANOVA) will facilitate our discussion of the multivariate assumptions and solution methods. and a component due to the specific population. and that the random samples are independent. it is customary to regard J.x) is an estimate of rc.L + ( Tc + ecj (overall mean) treatment) effect 3 (random) (63 ) error where the ecj are independent N(O. The null hypothesis becomes H0 : r 1 = r 2 = · · · = r 8 = 0 The response XI'.. g.Lc + overall) ( mean Tc ( fth population) mean fth population ) ( (treatment) effect (6 32) leads to a restatement of the hypothesis of equality of means. Xf2.L· Populations usually correspond to different sets of experimental conditions.x) estimated ) ( treatment effect + (xcj. 2 = · · · = p.L. we can write J. To define uniquely the model parameters and their least squares estimates. A Summary of Univariate ANOVA In the univariate situation. 3.Lc as the sum of an overall mean component.Lc ..8 . .Comparing Several Multivariate Population Means (Oneway MANOVA} 297 2. 2. For instance. can be expressed in the suggestive form Xc.L + Tc where Tc = J.. Xcne is a random sample from an N(J. the analysis of variance is based upon an analogous decomposition of the observations.:Xc) (634) (residual) where xis an estimate of J.Lc = J. The reparameterization J.:Xc) is an estimate of the error efj.13) when the sample sizes nc are large. Condition 3 can be relaxed by appealing to the central limit theorem (Result 4.p. Motivated by the decomposition in (633). u 2 ). . u 2 ) population. rc = (:Xc. and (xcj.
.•• = 12 + (2)2 + 12 + (1)2 + 12 + 12 + SSobs (1)2 + 02 = 10 The sums of squares satisfy the same decomposition. 2 = 2 and + 2)(3 x= (9 + 6 + 9 + 0 +2+ 3 = XJI = ~ + (i3 . variances computed from SS" and SSres should be approximately equal.i 3 ) = 4 + (2 . 3. The size of an array is quantified by stringing the rows of the array out into a vector and calculating its squared length. 6. SSmean = 42 + 42 + 42 + 42 + 42 + 42 + 42 + 42 = 8( 42) = 128 2 2 SSt.7 (The sum of squares decomposition for univariate ANOVA) Consider the following independent samples. For the observations. 9 Population 2: 0. and residual (error) components. for example. 2 Population 3: Since. 9. we construct the vector y' = [9.2) = 4 + ( 2) + 1 Repeating this operation for each observation.i of Tc always satisfy _2: n/re = 0. i3 = (3 + 1 3 + 1 + 2)(8 = 4. If Ho is true. Under H 0. we find that 3. (634). treatment. H0 should be rejected. as the observations. we obtain the arrays G: :) G: :) observation (xcj) mean (i) + ( + (: + treatment effect + residual (ie. each ie is an f=l estimate of zero. = SSmean + SStr + SSres or216 = 128 + 78 + 10.4) + (3 .298 Chapter 6 Comparisons of Several Multivariate Means Example 6. (Our estig mates ie = ie .ie) The question of equality of means is answered by assessing whether the contribution of the treatment array is large relative to the residuals. = 42 + 4 + 4 + (3) 2 + ( 3) 2 + ( 2) 2 + ( 2) 2 + ( 2)2 2 + 2(3i + 3(2/ = 78 = 3(4 ) and the residual sum of squares is ss.i) + (x31 .) If the treatment contribution is large. This quantity is called the sum of squares (SS). 6.i) =i =i J ~: :J (xti. 2].. 0.. 1. Its squared length is Similarly. An analysis of variance proceeds by comparing the relative sizes of SS" and ss. The breakup into sums of squares apportions variability in the combined samples into mean. • . 1. 2. Population 1: 9. Consequently.
the mean vector xl = [x. without having to calculate the individual residuals.x)(xej . and residuals are orthogonal. xgn. . X1 111 . x]' must lie along the equiangular line of 1. The observation vector y can lie anywhere in n = n 1 + n 2 + · · · + ng 'i!imensions. . we have verified that the arrays representing the mean.2 .Comparing Several Multivariate Population Means (Oneway MANOVA) 299 ' The sum of squares decomposition illustrated numerically in Example 6. 1 . considered as vectors. . 2.x) 2 + 2: f=l 2. The vector representations of the arrays involved in the decomp_osition (634) also have geometric interpretations that provide the degrees of freedom. treatment effects.. 2.. x2.. we could obtain SSres by subtraction. (xcj .x) = e we get C=I ( ~. For an arbitrary set of observations.SSmean .x)u 2 + · · · + (x 8 x)u 8 . ..xd + 2(xc .• . (xfj ..SS 1. .xe) We can sum both sides over j.xl + (xej . . However. x 1 .7 is so basic that the algebraic equivalent will now be developed. .x) 0 0 0 0 }·· + (x 2 .x) = (xe . Consequently. x 2 I> . x 8 . this is false economy because plots of the residuals provide checks on the assumptions of the model. .es) 2 f=l j=l + (SSt..xc) x) + = 0.xt) (SS.x) 0 0 0 0 0 0 1 1 0 0 } n2 + + (x8  x) 1 1 }··  (x 1 . . xel Next. x21> .x)u 1 + (1 2 . let [ x 11 . these arrays..•...•. are perpendicular whatever the observation vector y' = [xll. because SSres = SSobs . J = y'.) + (636) In the course of establishing (636). . ~ 2 2. . summing both sides over 1. note that ~ j=I (xej .cor 2: ne(ie (C=I g 2 x) + 2: 2: (xfj f=I j)=I ( xd SSres (635) ) = SStr ) total (corrected) SS between (samples) SS g + within (samples) SS g ~ or L l:x1j t=l j=l (SSobs) g "t ( n 1 + n2 + · · + n8 )x 2 + (SSmean) 2: nr(ic... That is. x 2 . and the treatment effect vector 1 1 (x1 . . Subtracting x from both sides of (634) and squaring gives 2 (xrj .J.. and obtain ~ (xcjj=l x) 2 = ne(xe 2 ~ (xejj=l g n.
the mean . The total number of degrees of freedom is: n = n~ + n 2 + · · · + n 8 • Alternatively...300 Chapter 6 Comparisons of Several Multivariate Means lies in the hyperplane of linear combinations of the g vectors JJ 1 . or for large values of 1 + ss...1 di~~ mension~ The residual Vector.(Xl) . to SS 1" and n . L i=l (xei . toSS.f.s~es· The statistic appropriate for a multivariate generalization rejects H0 for small values of the reciprocal 1 1 + ss" .:Xe) 2 L f=l L neg f=I C=I g ± nc 1 The usual Ftest rejects Ho: T1 = T2 = · · · = T 8 = 0 at level a if 8 ) SS.) g. g .g f=l > FgI. o 8 . = ~ ne(:Xe f=l g :X) 2 = n.:l:ncg(a) is the upper (100a)th percentile of the Fdistribution with g . · To summarize.f.g degrees of freedom.X)u j is .g d. (637) ..1 = n .. 2:nr. and it is always perpendicular to the treatment vector.j(g . 1 d. This is equivalent to rejecting Ho for large values of ss. o 2 . we attribute 1 d.. to SSmean. .1 and 2:nc . by appealing to the univariate distribution theory.e.Xncg( a) where Fg!.g "' ... vector has the freedom to lie anywhere along the onedimensional equiangular line and the treatment vector has the freedom to lie anywhere in the other g .. The calculations of the sums of squares and the associated degrees of freedom are conveniently summarized by an AN OVA table.. .1) F = /( SS.(g ... Since . = Y .f.1)... we find that these are the degrees of freedom for the chisquare distributions· associated with the corresponding sums of squares. ss.10. (n 1 + n 2 + · · · + n 8 ) . (See Exercise 6.1 SSu SS..) Thus. the mean vector also lies in this hyperplane.g ~ that is perpendicular to their hyperplane.:X)u 1 + · · · + (x g . 8 perpendicular to both the mean vector and the treatment effect vector and has the freedom to lie anywhere in the subspace of dimension n .ss.((XI .... ss. 1 = o 1 + o 2 + · · · + o 8 ..ss.. + ss.f... e ANOVA Table for Comparing Univariate Population Means Source of variation neatments Residual (error) Total (corrected for the mean) Sum of squares (SS) Degrees of freedom (d.
Here the parameter vector 1' is an overall mean (level).2. First we note that the product (xei . I) variables. and Te represents the eth treatment effect with L neTc = 0..1 = 7 L f=l g SStr/(g .i)' .. . .i) estimated) treatment ( effect Tc + (xei .01) = 13. g + (observation) ( overall sa~ple) mean/' (ie .g = (3 + 2 + 3) .2.5 ss.g) 10/5 T2 Since F = 19.1) 78/2 = . F = Sum of squares SStr = 78 SSres = 10 SScor = 88 Degrees of freedom g1=31=2 f=l f ne .g (638) where the eei are independent Np(O. .i)(xei .= 19.7. The errors for the components of Xei are correlated. we have the following ANOVA table: Source of variation Treatments Residual Total (corrected) Consequently.ne and e= 1. 5 (. j = 1. = T3 = 0 (no treatment • Multivariate Analysis of Variance (MANOVA) Paralleling the univariate reparameterization. e=I According to the model in (638). we specify the MANOVA model: MANOVA Model For Comparing g Population Mean Vectors Xej = 1' + Te + eej.•• /(Ine ..8 (A univariate A NOVA table and Ftest for treatment effects) Using the information in Example 6. we reject H 0 : T 1 = effect) at the 1% level of significance.. Thus. but the covariance matrix I is the same for all populations. .ie) (639) The decomposition in (639) leads to the multivariate analog of the univariate sum of squares breakup in (635).Comparing Several Multivariate Population Means (Oneway MAN OVA) 301 Example 6. A vector of observations may be decomposed as suggested by the model.5 > F2 .each component of the observation vector Xei satisfies the univariate model (633).3 = 5 ne.27.
we summarize the calculations leading to the test statistic in a MAN OVA table.i)' ic) (xc.ir)' + (xe.x}(xe. . C=l j=! C=l total (corrected) sum of squares and cross J ( products 1 1 treatment (!!etween)l sum of squares and ( cross products residual (Within) sum) of squares and cross ( products The within sum of squares and cross products matrix can be expressed as w == 2: 2: (xr. . Hence.i}(ic..i)] [(xc. H0 : r 1 = Tz = · · · = T 11 == 0 is tested by considering the relative sizes of the treatment and residual sums of squares and cross products.ie) == 0.i)(xe. Equivalently.ic) (xc...ic)' + (ic .ic)' g. 8 g "r B+W= .ic) +(it. e and j yields ic)' (640) j=l ~~~ 2: 2.i)]' == (xc.i)(xc.2.302 Chapter 6 Comparisons of Several Multivariate Means can be written as (xei.i)' == [(xr. f=l j=l g n.xr)' 1)~ == (n 1 l)SJ + (nz + ···+ (n8 (641)  l)S8 where Se is the sample covariance matrix for the Cth sample.ie) (ic .i)' + (it.. we may consider the relative sizes of the residual and total (corrected) sum of squares and cross products.1 W= C=i 2: j=l (xei . the hypothesis of no treatment effects. ne(ic.f.i)' f=l j=l = 2.. (x 1.i) (ie x/ The sum over j of the middle two expressions is the zero matrix..2)Spooled matrix encountered in the twosample case. ic)(x.1 f=l .2...) B == 2: nc(ir f=l g i) (ic . i (xej . It plays a dominant role in testing for the presence of treatment effects.ic) (Xej. MANOVA Table for Comparing Population Mean Vectors Source of variation Treatment Residual (Error) Total (corrected for the mean) Matrix of sum of squares and cross products (SSP) Degrees of freedom ( d.i)' + g~ 2: 2.. Formally. This matrix is a generalization of the (n 1 + n2 . because.it) + (ic..f=l j=l nc i)' ~ ne. . (XCj. summing the cross product over J...i)(xe.2. (xc.. Analogous to the univariate result.
x)(ie .g ~ Fz(gI).* = IW 1/1 B + WI.3 Distribution ofWilks' Lambda.\. Wilks' lambda has the virtue of being convenient and related to the likelihood ratio criterion.x)' I (642) is too small. the rank of B. such as Pillai's statistic. as the AN OVA table. For other cases and large sample sizes.== Tg = 0 involves generalized variances. For example. (See [1]) One test of H0 : T 1 == 'Tz == • ·. component by component.Comparing Several Multivariate Population Means (Oneway MANOVA) 303 This table is exactly the same form.* can be derived for the special cases listed in Table 6.c: 1 g=3 ( ev VA* . (See the additional discussion on page 336.1B 2 as where s = min (p. and Roy's largest root statistic can also be written as particular functions of the eigenvalues of W" 1 B.*== IWI IB +WI ~~ ~ (xei.!:n. essentially equivalent.c: 1 g=2 VA*) 1 1) w (~nep p.g) g:__1 e A. g .* due to Bartlett (see [4]) can be used to test H 0 • Table 6.2("ln.2("ln.p1 A.p2) 2 Wilks' lambda can also be expressed as a function of the eigenvalues of .rl) p ."ln.p . the LawleyHotelling statistic. proposed originally by Wilks (see [25]). a modification of A. (xc . all of these statistics are.*) A* p=2 g2!2 (~negg _ p . 2 The exact distribution of A.x) 2 becomes (ie .ie) (xej. ) A" A. ••• .2) p c ~ FgI. except that squares of scalars are replaced by their vector counterparts.t\. A. Wereject H0 if the ratio of generalized variances A.) . corresponds to the equivalent form (637) of the Ftest of H0 : no treatment effects in the univariate case.* Fp. 1 A*=D ( 1+.i)'.1) e~) ~ne .ie)'l I~ ~ (xei  x)(xei .*= IWI/IB +WI No.*)  F2p. of variables p=1 No. The quantity A. For large samples. of groups Sampling distribution for multivariate normal data g2!2 ( ~ne . A.1 ).3. The degrees of freedom correspond to the univariate geometry and also to some multivariate distribution theory involving Wishart densities. of w. Other statistics for checking the equality of se~eral multivariate means.
and n 3 = 3.(gl)(a) is the upper (100a)th percentile of a chisquare distribution with p(g . treatment effect.128 = 88 Repeating this operation for the obs.1) d.[8] 4 .f. for 2:nc = n large.SSmean = 216 . Consequently. we have (! ~ 7) 8 9 7 (observation) (~ ~ 5) + (=~ =~ 1) 3 ( + (~ =~ 3) 0 1 1 5 5 5 (mean) 3 3 treatment) effect (residual) . The sample sizes are n 1 = 3.0. we reject H 0 at significance level a if (p+g)) ( lwl ) In jB + Wj > .0.304 Chapter 6 Comparisons of Several Multivariate Means Bartlett (see [4]) has shown that if H 0 is true and 2:nc = n is large.1) d. we obtain [:J [~] [~] [~] [~] [~] [~] [~] w1thx 1 = . Example 6. and residual in our discussion of univariate ANOVA.1  (643) has approximately a chisquare distribution with p(g . Arranging the observation pairs Xcj in rows.(p +g)) in( IB+WI ) IWI 2 2 .f.7. andi = [:] We have already expressed the observations on the first variable as the sum of an overall mean.(p +g)) InA*= (n. We found that G: :) G: :) (observation) and + (=~ =~ J ( treatment) effect + ( : ~: :) (mean) (residual) SSobs = SSmean + SStr + SSres 216 = 128 + 78 + 10 Total SS (corrected) = SSobs .1. .9 (A MANOVA table and Wilks' lambda for testing the equality of three mean vectors) Suppose an additional variable is observed along with the variable introduced in Example 6.( n.ervations on the second variable. (n 1.(gIJ(a) 2 (644) where . n 2 = 2.
the MANOVA table takes the following form: Source of variation Treatment Residual Matrix of sum of squares and cross products Degrees of freedom [ 78 12 12] 48 2:] 11] 72 3 .mean cross product = 149 .1 = 2 [ [ 1~ 88 11 3+2+33=5 Total (corrected) 7 Equation (640) is verified by noting that Using (642).160 = 11 Thus. + 0(1) = 1 Total: 9(3) + 6(2) + 9(7) + 0(4) + .0385 88(72).200 = 72 These two singlecomponent analyses must be augmented with the sum of entrybyentry cross products in order to complete the entries in the MANOVA table. we get 88 I 11 Ill 72 10(24) .. Proceeding row by row in the arrays for the two variables..= .. + SS... 1(1) + (2)(2) + 1(3) + (1)(2) + . · + 2(7) = 149 Total (corrected) cross product = total cross product .(1) 2 239 ''''..es 272 = 200 + 48 + 24 Total SS (corrected) = SSobs .= ..SSmean = 272 . we obtain the cross product contributions: Mean: 4(5) + 4(5) + · · · + 4(5) = 8(4)(5) = 160 Treatment: 3(4)(1) + 2(3)(3) + 3(2)(3) = 12 Residual.(11) 2 6215 .Comparing Several Multivariate Population Means (Oneway MANOVA) 305 and SSobs = SSmean + SSt..
.521 . intermediate care facility.01) = 7.6and 4.167J l2.g. Group Number of observations Sample mean vectors e = 1 (private) e= 2 (nonprofit) XI = l2. Table 6.082 . the residual vectors ecj = Xcj . To carry out the test.1) = 4 and ..1) = 819 3 1 . v2 = 2(Lnc.Ollevel and conclude that treatment differences exist.480 . computed on a perpatientday basis and measured in hours per patient day.596 . Nursing homes can be classified on the basis of ownership (private party. Example 6. When the number of variables.1) = (1(g .X2 = cost of dietary labor.3. One purpose of a recent study was to investigate the effects of ownership or certification (or both) on costs.01. Four costs.125 l .306 Chapter 6 Comparisons of Several Multivariate Means Since p = 2 and g = 3. p.360 . is large. Still. the MANOVA table is usually not constructed. A total of n = 516 observations on each of the p = 4 cost variables were initially separated according to ownership..066J .418 .273l 2. or a tombination of the two). Also. Since 8.1) = 8 d.7 of Chapter 4. and government) and certification (skilled nursing facility. X 3 = cost of plant operation and maintenance labor. were selected for analysis: X 1 = cost of nursing labor. effects) versus H 1: at least one Tc 0 is available.10 (A multivariate analysis of Wisconsin nursing home data) The Wisconsin Department of Health and Social Services reimburses nursing homes in the state for the services provided. and X 4 = cost of housekeeping and laundry labor. mean wage rate.f. and average wage rate in the state.383 e = 3 (government) 3 ~ nc = 516 c~1 .Xc should be examined for normality and the presence of outliers using the techniquesdiscussed in Sections 4.124 . Summary statistics for each of the g = 3 groups are given in the following table.1) Y. nonprofit organization.0385 V]385) (8.3 indicates that an exact test (assuming normal·. • . The department develops a set of formulas for rates for each facility. with a percentage point of an Fdistribution having v 1 = 2(g . we compare~ the test statistic ' * l( W) w (Lnc g. ity and equal group covariance matrices) of H 0 : T1 = Tz = T 3 = 0 (no treatment. it is good practice to have the computer print the matrices Band W so that especially large entries can be located. XJ = . 8(.19 > F4. we reject H0 at the_ a = . based on factors such as level of care. Xz = .
001 .584 1.018 .037 .025 .Comparing Several Multivariate Population Means (Oneway MANOVA) 307 Sample covariance matrices l.017 .010 .007 .001 Source: Data courtesy of State of Wisconsin Department of Health and SociatServices.x) = _ _ _  1 To test H0 : T 1 = Tz = T 3 (no ownership effects or.000 .633 9.  s3  lUI .000 .003 .291 sl == . equivalently.011 .200 1.821 . .230 A*= IWI IB +WI = . no difference in average costs among the three types of ownersprivate.695 . a normaltheory test of H0 : I 1 = I 2 = I 3 would reject H0 at any reasonable significance level because of the large sample sizes (see Example 6. Computerbased calculations give l 3.225 . and government). we can use the result in Table 6.001 .581 2.610 .428 1.004 .7714 3 However.3 for g = 3.004 .005 .030 .002 .394 6.484 .111 .3 they were pooled [see (641)] to obtain W = (n 1  1)S1 + (n 2  l)S 2 + (n 3 ] 1)S3 Also.002 J.538 and L J ne(xe.x)(xe . l B = f=l 182.006 OJ s2 = l'" .962 4.475 1.235 .408 8.011 . nonprofit. J Since theSe's seem to be reasonably compatible.453 .12).001 .003 .000 .
1..(p + g)/2) In ( /B /WI ) = 511. is the ith diagonal element of I.51..g 1)wu where w.ne n. Notice that .Te (or l'k .. Var(rk.Xc. .) =  (1 1) nk_ nc +. those effects that led to the rejection of the hypothesis are of interest.4) can be used to construct simultaneous confidence intervals for the components of the differences Tk . • 6. = xk. .77I4 Y..:~ . is the difference between two independent sample means. ) = var(xki X1.51. where O". depending on type of ownership. ~Var(X*.rc. and Ho can be tested at the a = .p .l't)· These intervals are shorter than those obtained for all contrasts. As suggested by (641).01) = 20. we reject Ho at the 1% level. This result is consistent with the result based on the foregoing Fs ta tis tic. Since 132.7714} = +WI 132.2) p (!__:_VA*) VA* = (516 .76 with X~(gIJ(.01) . the Bonferroni approach (see Section 5.7714) = 17.i..x (645) and i~.. Since Tk is estimated by:. Since 17.Ol) = 20.is(.01.* = x* .u. For the present example. and they require critical values only for the univariate tstatistic. For pairwise comparisons.09. Var (Xki ... Let Tk.5 Simultaneous Confidence Intervals for Treatment Effects When the hypothesis of equal treatment effects is rejected.4 _:___.i(510)(.' The twosample tbased confidence interval is valid with an appropriately modified a.01) : 2. so that Fz( 4). is the ith diagonal element of Wand n = n 1 + · · · + ng. Lnr = n = 516 is large.76 > x~(.09.X"u) is estimated by dividing the corresponding element of W by its degrees of freedom.01 level by comparing (n.Ol)/8 = 2.) = (1  n* +.?_) 4 (1Y. .01) = x~(. That is..5ln(.67 Let a = . It is informative to compare the results based on this "exact" test with those obtained using the largesample procedure summarized in (643) and (644)..1 .308 Chapter 6 Comparisons of Several Multivariate Means and ( 'Lnc . 1020 (.ic.67 > F 8. be the ith component of rk. we reject Ho at the 1o/o level and conclude that average costs differ.
1)/2 (646) is the number of simultaneous confidence statements. There are p variables and g(g . Relation (528) still applies..581 8.023 = .J r13 .1 n .10 that average costs for nursing homes differ.. so that J( 1+ 1) n1 n3 W33 n. depending on the type of ownership. Using (639) and the information in Example 6. costs of plant operation and maintenance labor. Tk. between privately owned nursing homes and governmentowned nursing homes can be made by estimating r 13 .g = _!_ + _1_) 1. e < k = 1. . so each twosample tinterval will employ the critical value tng( a/2m).200 .484 2..g nk ~) +nc for all components i = 1.428 .137] .020 ' .394 .xc. belongs to x* · . is the We shall illustrate the construction of simultaneous interval estimates for the pairwise differences in treatment means using the nursinghome data introduced in Example 6.Simultaneous Confidence Intervals for 'freatment Effects 309 It remains to apportion the error rate over the numerous confidence statements.. g.10.. Result 6.408 1. with confidence at least {1 a). p and all differences ith diagonal element ofW.± I I t ng ( pg(g . 4. .5 to estimate the magnitudes of the differences.039 .003 182. we have ' TJ =(xi. . .962 w = [ Consequently.3 . Tc..r 33 .002 [ .( . For the model in (638). where m = pg(g .x) = [ ..020 .695 9.070j .1) a ) J W·· ".r33 = . Let n = k=l ± nk. We can use Result 6.484 = 00614 ( 271 107 516 .1)/2 pairwise differences.020 TJ ' = (x 3   x) = . .023 .11 (Simultaneous intervals for treatment differencesnursing homes) We saw in Example 6.633 1. Here W.5. A comparison of the variable X 3 . Example 6.043 and n = 271 + 138 + 107 = 516.10.
018. n3 n.) Before pooling the variation across samples to form a pooled covariance matrix when comparing mean vectors. or(.g = . The alternative hYPothesis is that at least two of the covariance matrices are not equal. a likelihood ratio statistic for testing (fr47) is given by (see [1]) A= II ( I Spooled I e I Se I )(n. (This assumption will appear again in Chapter 11 when we discuss discrimination and classification. • 6. g.. r 33 ± t 513 ( . Se is the (th group sample covariance matrix and Spooled is the pooled sample covariance matrix given by (649) .043 ± .S/4(3)2) == 2.00208 ) ~(1 n1 W33 + 1 ) .061. .. (See Appendix.87(.021. belongs to r13 ~  ..310 Chapter 6 Comparisons of Several Multivariate Means Since p = 4 and g = 3.025) We conclude that the average maintenance and labor cost for governmentowned nursing homes is higher by . 1.1)12 (648) Here ne is the sample size for the eth group.019) Thus.058. 2. a difference in this cost exists between private and nonprofit nursing homes. for 95% simultaneous confidence statements we require that t513 (. . e .043 ± 2. One commonly employed test for equal covariance matrices is Box's Mtest ((8].026) r33 belongs to the interval ( .0.00614) = . With g populations. but no difference is observed between nonprofit and government nursing homes.) The 95% simultaneous confidence statement is . we can say that r 13 and r23  r 23 belongs to the interval ( .025 to . Assuming multivariate normal populations. .061 hour per patient day than for privately owned nursing homes. the null hypothesis is (647) where l:e is the covariance matrix for the eth population.6 Testing for Equality of Covariance Matrices One of the assumptions made when comparing two or more multivariate mean vectors is that the covariance matrices of the potentially different populations are the same. . . Table 1. (9]). With the same 95% confidence.87. it can be worthwhile to test the equality of the population covariance matrices. and :i is the presumed common covariance matrix.
where ISen I and I S(g) I are the minimum and maximum determinant values. reject 110 if C > ~(p+l)(gt)l2(a). As the latter quantities become more disparate. consequently.1) 2 (651) where pis the number of variables and g is the number of groups. .1) (653) degrees of freedom. do not differ too much from the pooled covariance matrix. In situations where these conditions do not hold. Example 6. At significance level a.1 ] 6(p + l)(g. note that the determinant of the pooled covariance matrix.~[(nr l)lniSciJ e c (650) If the null hypothesis is true.ln fact. I Sen Ill Spooled I reduces the product proportionally more than I S(g) Ill Spooled I increases it.2p(p + 1) 1 = 2p(p has an approximate y xz distribution with 1 = g2p(p + l)(g. If the null hypothesis is false. ISpooled I. the sample covariance matrices can differ more and the differences in their determinants will be more pronounced. the individual sample covariance matrices are not expected to differ too much and. In this case.u)M = (1 u){[~(nel)}n I Spooled I. Box's Test for Equality of Covariance Matrices Set u = r l ~(nc. In that example the sample covariance matrices for p = 4 cost variables associated with g = 3 groups of nursing homes are displayed. to the sampling distribution of 2ln A (see Result 5.10. Box's x2 approximation works well if each ne exceeds 20 and if p and g do not exceed 5.I. respectively.Testing for Equality of Covariance Matrices 31 I Box's test is based on his x2 approximation. as the I Sc l's increase in spread. In this case A will be small and M will be relatively large.z = l:3 = l:. the product of the ratios in (644) will get closer to O. Assuming multivariate normal data.1) .~(ne _ 1 1 1) l[ J 2p + 3p . will lie somewhere near the "middle" of the determinants I ScI's of the individual group covariance matrices.2). we test the hypothesis H 0 : l:1 = .12 (Testing equality of covariance matricesnursing homes) We introduced the Wisconsin nursing home data in Example 6.~[(nc1)1n ISeiJ}(652) + 1). Setting 2 In A = M (Box's M statistic) gives M = [L:(ne 1)]1niSpooledl. [8]) has provided a more precise F approximation to the sampling distribution of M. Then C = (1. To illustrate. the ratio of the determinants in (648) will all be close to 1. Box ([7). A will be near 1 and Box's M statistic will be small.
270 + 1 137 + ][2(4 ) + 3(4) . Thus the Mtest may reject H 0 in some nonnormal cases where it is not damaging to the MANOVA tests. The particular experimental design employed will not concern us in this book.l ISzl Is31 I I IS31 I I 1 u = [ 270 + 137 + f 1 106 .564). To summarize.397. In S:z = 13. however. and 8 spooled = 17.) We shall.and that n independent observations can be observed at each of the gb combi4 The use of the tenn "factor" to indicate an experimental condition is convenient. it is clear that H 0 is rejected <1t any reasonable level of significance.397) + 137(13. = 89. respectively. In some cases. In = 15. Let the two sets of experimental conditions be the levels of. in the presence of nonnormality. Moreover. 6. Univariate TwoWay FixedEffects Model with Interaction We assume that measurements are recorded at various levels of two factors. for instance..564. with equal sample sizes. assume that observations at different combinations of experimental conditions are independent of one another. It is known that theMtest is sensitive to some forms of nonnormality. we have n 1 = 271. with reasonably large samples.1) = · 2 M = [270 + 137+ 106](15. 7 TwoWay Multivariate Analysis of Variance Following our approach to t~e oneway MANOVA. the MANOVA tests of means or treatment effects are rather robust to nonnormality. some differences in covariance matrices have little effect on the MANOVA tests. Referring C to a / table with v = 4( 4 + 1 )(3 . . More broadly.3 = 285. The factors discussed here should not be confused with the unobservable factor. However.1)12 = 20 degrees of freedom.926.3 12 Chapter 6 Comparisons of Several Multivariate Means Using the information in Example 6. = 14.[270(17.l I Is. We conclude that the covariance matrices of the cost variables associated with the three populations of nursing homes are not the same.783 X 108.3 and C = (1 . we may decide to continue with the usual MAN OVA tests even though theMtest leads to rejection of H0 . these experimental conditions represent levels of a single treatment arranged within several blocks.. (See [10] and [17] for discussions of experimental design. We calculate n3 I = 107 and Is. considered in Chapter 9 in the context of factor analysis.0133)289. Taking the natural logarithms of the determinants gives In = 17. normal theory tests on covariances are influenced by the kurtosis of the parent populations (see [16]).741)] = 289. n 2 == 138 = 2.579 X 108. factor 1 and factor 2.539 X 108.1] 0133 106 6(4 + 1)(3 .5.926) + 106( 15. 4 Suppose there are g levels of factor 1 and b levels of factor 2. • Box's Mtest is routinely calculated in many statistical computer packages that do MANOVA and other procedures requiring equal covariance matrices.741 and In Spooled = 15. we shall briefly review the analysis for a univariate twoway fixedeffects model and then simply generalize to the multivariate case by analogy..398 X 10.10.
.b e= + Te + (654) r = 1.. Here /L represents + Te + f3k + Yek mean ) ( response (overall) + (effect of) + (effect of) + level factor 1 factor 2 (fa~ tor 1fa~tor 2) mteract10n (655) e = 1..3 Curves for expected responses (a) with interaction and (b) without interaction.2.. Yek. Denoting the rth observation at level e of factor 1 and level k of factor 2 by Xu.3(a) and (b) show Level I of factor 1 Level 3 of factor I Level 2 of factor 1 2 (a) 3 4 Level of factor 2 Level 3 of factor 1 Level 1 of factor I Level 2 offactor I 2 (b) 3 4 Level of factor 2 Figure 6. {3 k represents the fixed effect of factor 2.. re represents the fixed effect of factor 1. ..2. 2. we specify the univariate twoway model as Xur = /L f3k + Yek + eekr 1. . .. . an overall level. implies that the factor effects are not additive and complicates the interpretation of the results. 2. g k = 1. . .'TWoWay Multivariate Analysis of Variance 313 . . Figures 6. . .nations of levels. and Ye k is the interaction between factor 1 and factor 2. k = 1. g.... 2.. b The presence of interaction. The expected response at the eth level of factor 1 and the kth level of factor 2 is thus /L f=l k=l f=l k=l u 2 ) random variables. ..n g b g b where ~ Te = ~ f3k = ~ Yek = ~ Yek =0 and the eekr are independent N(O.
L b n x) 2 gbn . Total (corrected) C=l k=J L r=l (xu.1) + (b. Xe· is the average for the Cth level of factor 1.1) (b. In a manner analogous to (6SS).x) 2 b 2 k=l + or 2: 2: 2: Cxrk.k is the average for the kth level of factor 2..1) + gb(n .  g :X) + ~ gn(x ·k .:xrd C=l k=l r=l g b n (657) SScor = SScact + SStac2 + SS. and Xfk is the average for the Cth level of factor 1 and the kth level of factor 2.x.k + x) 2 (g.1) gb(n . Squaring and summing the deviations (xekr .1)(b.1 .1) (658) TheANOVA table takes the following form: ANOVA Table for Comparing Effects of Two Factors and Their Interaction Source of variation Factor 1 Factor 2 Interaction Residual (Error) SScor = ~ Sum of squares (SS) Degrees of freedom (d..) g.k k=I b g x) 2 x) 2 b .and k.1 SS.1) k=l .nl = L L c~t g b n(xtk . x. each observation can be decomposed as (656) where xis the overall average.1) + (g.ie.  g b n x) = 2 f=l 2: bn(xe.f.J."' + SSres The corresponding degrees of freedom associated with the sums of squares in the breakup in (657) are gbn.314 Chapter 6 Comparisons of Several Multivariate Means expected responses as a function of the factor levels with and without interaction ' respectively. .1 ssfacl SStac2 = = 2: bnCXe· f=l L gn(x..1 = (g. The absense of interaction means rek = 0 for all e.x) gives f=l k=l r•l 2: 2: 2: (xu.
and factor 1factor 2 interaction.ic·.x) with the corresponding matrix (ic. factor 2. are independent Np(O.k + i) + (xckr. . we can decompose the observation vectors Xtkr as Xckr = i + (ic.1) + (b.ic.1)(b .. and the eck. 2.1 ).Two. 2. (See [11) for a discussion of univariate twoway analysis of variance.x)'.1) to the mean square. ic· is the average of the observation vectors at the Cth level of factor 1. . ....x)' = L: bn(xcC=l g . SS.es/(gb(n .1) (662) Again. Thus.2. we specify the twoway fixedeffects model for a vector response consisting of p components [see (654)] Xfkr = P.1 ).k + i)' (661) gbn.i)' b L L f=l k=l n(iek.k + i)(ifk.Xfk) (660) where i is the overall average of the observation vectors.. .1) (b.. the responses consist of p measurements replicated n times at each of the possible combinations of levels of factors 1 and 2.k.1)) can be used to test for the effects of factor 1. ' b r = 1.ie. and iek is the average of the observation vectors at the eth level of factor 1 and the kth level of factor 2.1) + (g. l:) random vectors.i.. and SS.x)(xc· .i) (i. + Tc + /h + 'Ytk + eckr (659) e= k 1..k is the average of the observation vectors at the kth level of factor 2.i.1) + gb(n. C=J k=l r=l g b n x)(xtk.Way Multivariate Analysis of Variance 315 The Fratios of the mean squares.k . i. 01 /(g .k .x)' + + L k=I g b gn(i. . . The vectors are all of order p X 1.n where f=J ± Tc = k=l ± fh = _! 'Yfk = f=l k=l ± 'Ytk = 0. . Following (656). g = 1.1 = (g.i)(ie..i.) Multivariate TwoWay FixedEffects Model with Interaction Proceeding by analogy.. respectively. the generalization from the univariate to the multivariate analysis consists 2 simply of replacing a scalar such as (xc. SScac 2 /(b .i) + (iu. . Straightforward generalizations of (657) and (658) give the breakups of the sum of squares and cross products and degrees of freedom: L: L: L: cxtk.i) + (i. SScacd(g . .
.(g .e. f.k k~l i)' f=I k=l ±± n(iek ie. the lest for interaction is carried out before the tests for main factor fects. Wilks' lambda. Ordinarily.i) (ie. 2.1)  p + 1 .k + i) (iek.1) SSPcor = .~ reJectH0:y11 = y 1 z = ··· = Ygb = Oatthealevelif ~ ·~ .i) (i.::.p univariate twoway analyses of variance (one for each are often conducted to see whether the interaction appears in some responses 5 The likelihood test procedures require that p (with probability 1).i.1)(b.1 = ~ gri(i.i)' .[ gb(n. 5 gb(n .3 16 Chapter 6 Comparisons of Several Multivariate Means The MANOVA table is the following: MANOVA Table for Comparing Factors and Their Interaction Degrees of .l)..es I ~~1 ~ (664zJ For large samples..Instead.. 01 = e~1 L b g bn(ie· . : Source of variation Factor 1 Factor 2 Interaction Residual (Error) Total (corrected) Matrix of sum of squares and cross products (SSP) freedom~ (df.!. will be positive .) SSPraci = SSPcac2 SSP.. the factor effects do not have a clear i"n.l)(b .x..l)p d.k + i)' gb(n .k  g1 b.. it is not advisable to proceed with the additional variate tests. If interaction effects exist. . can be referred to a chisquare percentile2~ U~ing Bartlett's multiplier (see [6]) to improve th~ chisquare approximation.. A*.. 2: 2: (Xfkr  b n i)(Xfkr .telipreltati<>~ From a practical standpoint.1)] 2 InA*> xZgl)(bl)p(a) where A* is given by (664) and xfgI)(bI)p(a) is the upper (lOOa)th percentile chisquare distribution with (g . so thai SSP.i)' gbn 1 f=I k=l r=l A test (the likelihood ratio test) 5 of Ho:YII = Y12 = ··· = versus Ygb =0 (no interaction effects) is conducted by rejecting H 0 for small values of the ratio A* = ISSPres! ''""!SSPint + SSP.. i.
but with treatment sample means replacing expected values. . In the multivariate model./3q of the factor 2 effects.. Using Bartlett's correction. First. Xm·i) ±tv ( pg(g _ a 1 ) ).•• I ISSPtacl + SSPres I (666) so that small values of A* are consistent with H 1 .'~::.1)  p+1(g1)] In A* > xfgl)p(a) 2 where A* is given by (666) and xfgl)p(a) is the upper (100a)th percentile of a chisquare distribution with (g . is the ith diagonal element of E = SSP. for large samples and using Bartlett's correction: Reject H0 : /3 1 = f3 2 = · · · = f3b = 0 (no factor 2 effects) at level a if . Let '* A* = .[ gb(n . These hypotheses specify no factor 1 effects and some factor 1 effects. ISSP. . When interaction effects are negligible..•• . {£. Once again. interaction plots similar to Figure 6..3..:_. consider the hypotheses H 0 : r 1 = r 2 = · · · = r 8 = 0 and H 1 : at least one re 0.5 are available for the twoway model.Xm· . The 100{1 . Those responses without interaction may be interpreted in terms of additive factor 1 and 2 effects. In any event.1)p degrees of freedom. factor 2 effects are tested by considering H 0 : fJ1 = f3z = · · · = f3b = 0 and H 1 : at least one f3 k 0.[ gb(n. provided that the latter effects exist.1)  p+1(b1)] In A*> xfbl)p(a) 2 (669) where A* is given by (668) and xfbl)p(a) is the upper (100a)th percentile of a chisquare distribution with ( b .Tm of the factor 1 effects and the components of f3k . Results comparable to Result 6.. . Tmi belongs to (xe.T mi are re.TwoWay Multivariate Analysis of Variance 3.2 y. The Bonferroni approach applies to the components of the differences re . E.17 others. Simultaneous confidence intervals for contrasts in the model parameters can provide insights into the nature of the·factor effects. (670) Xm·i where v = gb(n . we may concentrate on contrasts in the factor 1 and factor 2 main effects.1)p d. and xe. Small values of '* jSSP. the likelihood ratio test is as follows: Reject H 0 : r 1 = r2 = ··· = r 8 = 0 (no factor 1 effects) at level a if (667) .. is the ith component of ie.. best clarify the relative magnitudes of the main and interaction effects.a)% simultaneous confidence intervals for re.•• I A*=jSSPtac2 + SSPresl (668) are consistent with H 1 . we test for factor 1 and factor 2 main effects as follows.f. respectively. respective! y. In a similar manner...1)..
2 ~ 9. If only one observation vector is available at each combination of factor levels.5%) ~ ~ Factor 1: Change in rate of extrusion [6.2 9.1] 3.0] 4.4] 6.5 [6. (671)  where v and E. leading to the following MANOVA table: 6 Additional SAS programs for MANOVA and other procedures discussed in this chapter are available in [13].0 8.4] 3.13 (A twoway multivariate analysis of variance of plastic film data) The optimum conditions for extruding plastic film have been examined using a technique called Evolutionary Operation. Comment.j ~ [6.4.6 9. the model allows for n replications of the responses at each combination of factor levels.5 4. (See Exercise 6.7 6.1] 0.0 [6. x 2 = gloss.8] 1.7 [6.6] 3. the 100( 1 .2] [7. three responsesX1 = tear resistance. and residual sources of variation as components of the total variation.8] 4. (See [9]. This enables us to examine the "interaction" of the factors.9 9. That is.318 Chapter 6 Comparisons of Several Multivariate Means Similarly.5 9.q.8 [6.k x. The measurements were repeated n = 5 times at each combination of the factor levels.5 10.71 [7.5 1.2 [7.7] 57l High(lO%) [6.2 10.1 9.8 2. the ' twoway model does not allow for the possibility ofa general interaction term 'Ytk· The corresponding MANOVA table includes only factor 1.) Example 6.1 9.2 1.9 9.8] XJ x2 x3.'are as just defined and i.9] [7.1 9.ki the ith component ofi. Table 6.3 8.9 3.g.) In the course of the study that was done. and X 3 = opacitywere measured at two levels of the factors. factor 2.2 8..6 9..6 [7.is  13 qi /3k.6 9.1 6).1 [6.5 [6.5 ~ 9.0 2...4] [7.Xq.9 [6.13. We have considered the multivariate twoway model with replications.X·qi) ±tv ( pb(b _ .O%) ~ High (1.9 [6.2 y. . rate of extrusion and amount of an additive.9 9.2 Low(10)% [5. X 2 = gloss./3q.8 5.a) percent simultaneous confidence intervals for f3 ki are a (x·k.1 [7.4 5.3 9.4 Plastic Film Data x 1 = tear resistance.3 8. and x 3 = opacity Factor 2: Amount of additive Low(l.9 The matrices of the appropriate sum of squares and cross products were calculated (see the SAS statistical software output in Panel6. The data are displayed in Table 6.1 2. belongs to 1 ) ) {E.4 8..4] x3 ~ ~ [7.
class factor1 factor2.0023 c.dat'.76050000 0.5045 1.74050000 0.5445 . manova h =factor1 factor2 factor1*factor2 jprinte.7395 .78500000 F Value 15.0011 0.0165 .76400000 4.893724 Type Ill SS X1 Mean 6. model x1 x2 x3 = factor1 factor2 factor1*factor2 jss3. PROGRAM COMMANDS General linear Models Procedure Class level Information Class levels FACTOR1 2 FACTOR2 2 Number of observations in J [ Oep~tf~~nt V~riabl~: X1 Values 0 1 0 1 data set= 20 FValue 7.6125 .7640 r2655 0445] 1 Residual 3.0183 0. tj~s~.0855 .00050000 Pr > F 0. data film.9095 74.6280 .7605 1.4205 . infile 'T64.9005 1.50150000 1.26550000 Mean Square 0. o.5520 64.4685 3.11025000 Root MSE 0332039 Mean Square 1.8555] 1 Factor 2: [ . 4. 0'76050000 (continues on next page) .7325 4.79 6.9471 Source DF ~.83383333 0.90 0.2055 19 PANEL 6. [ 1.9605 Interaction [ ~5 [1.TwoWay Multivariate Analysis of Variance 319 Source of variation Factor 1: change in rate of extrusion amount of additive SSP d. input x1 x2 x3 factor1 factor2. means factor1 factor2.9240 J 16 Total (corrected) 2395] 1.0700 .7855 5.3005 .0200 2.93()5] 1 1.7405 1.v.00 Pr > F 0.586449 Sum of Squares 2.13 USING PROC GLM title 'MANOVA'.ooosoooo. proc glm data= film.f.6825 .56 OUTPUT Source Model Error Corrected Total DF 3 16 19 RSquare 0.1 SAS ANALYSIS FOR EXAMPLE 6.
Wilks' La111bda: ~.45750000 2. 51.0714 0.28150000 64.61877188 1.125078 Source fACTOR FAaoR2 FACTOR1*FACTOR2 DF Sum of Squares 9.628 {).54450000 Root MSE ·0.09383333 4.31SOOOOO Pr > F 0.v.98 X3 Mean 3.61814162 1.v.42050000 4.16425000 F Value 4.300so00o !).61250000 0. ' "~·:· .5 f H =Type Ill S5&CP Matrix for FAGOR1 E = Error SS&CP Matrix N=6 Nu~_Df rStatistic .c .l FACTOR2 FACTOR1 *FACTOR2 DF 1 1 1 l [Dependent Variable: X3 / Source Model Error Corrected Total DF 3 16 19 RSquare 0. 4.5543 3 3 3 3 Pr>fd 0.awley Trace Roy's Greatest Root 14 14 14 0.93500000 Pr> F 0.0030 0.30050000 0.19151 Type Ill SS 0.•.92 3.20SSOOOO Mean Square 3.32 X2Mean 9.42050000 4.96050000 RootMSE 2.483237 Sum of Squares 2. 0.~:·.99 Pr > F 0.552 64.0874 Source fACTORf.1 (continued) ~pendent Variable: X2 Source Model Error Corrected Total I  OF 3 16 19 RSquare 0.61250000 0.96050000 F Value 0.5543 7.3379 l j 1 1 1 I £ =Error SS&CP Matrix XI I X2 0.07 X3 3.5543 7.924 Manova Test Criteria and Exact F Statistics for the IHypOthesis of no Overall FACTOR1 Effect / s =1 M:0.08550000 Mean Square 0.0030:·" ''""3" Pillai's Trace Hotellin!rl.73 3.552 X2 X3 XI 1.~.0125 c.0125 0.2881 0.764 0.54450000 F Value 7. I Den OF 14 7..405278 Mean Square 1.05775000 F Value 0.61877188 \1.90050000.:~ .5543 7.0030 (continues on next page) . .350807 Type Ill SS 1. .320 Chapter 6 Comparisons of Several Multivariate M<:ans PANEL 6.014386 Mean Square 0.90050000 3.·o:~~~.02 2.5315 c.02 3.0030 0.. .76 Pr > F 0.924000QO 74.7517 0.81916667 0. 3. .07 {).62800000 5.10 121 0.
49000000 7.0247 0.5 Value 0.0247 0.79000000 4..85379491 2.98000000 SO 0.52303490 Pillai's Trace Hotel lingLawley Trace Roy's Greatest Root 0.2556 Manova Test Criteria and Exact F Statistics for the Hypothesis of no Overall FACTOR1 *FACTOR2 Effect E =Error SS&CP Matrix H =Type Ill SS&CP Matrix for FACTOR1*FACTOR2 S= 1 M = 0.255~ E =Error SS&CP Matrix N =6 NumDF 3 3 3 3 4.7098 7771 354.42804465 N 10 10 Mean 3.22289424 0.47328638 X2Mean 9.49000000 SO 0.56015871 0.55077042 2.3385 1._ _ ISSPresl ISSPint + SSPres I 275.43000000 X350 1.57580861 N 10 10 X3Mean SD 3.='=:.77710576 0.59000000 6.0247 H =Type Ill SS&CP Matrix for FACTOR2 S= 1 M =0.30123155 To test for interaction.3385 1.28682614 F 1.29832868 0.40674863 0.08000000 1.44000000 4.3018 0.0247 0.7906 = • .08000000 SD 0.3018 0.3018 0.47696510 0.32249031 X2Mean 9.91191832 F 4.3385 NumbF Wilks' ~mbda Pillai's Trace Hotel linglawley Trace Roy's Greatest Root level of FACTOR! 0 3 3 3 Den DF 14 14 14 14 Pr > F 0. we compute A* = .3385 1.57000000 9.5 N=6 Statistic Value 0.28682614 0.42018514 0.18214981 level of FACTOR2 0 N 10 10 level of FACTOR2 0 X1Mean 6.3018 3 1 N 10 10 level of FACTOR! 0 X1Mean 6.TwoWay Multivariate Analysis of Variance 321 pANEL 6.14000000 9.2556 4.06000000 SD 0.:.2556 4.91191832 0.1 (continued) Manova Test Criteria and Exact F Statistics for the j Hypothesis of no Overall FACTOR2 Effect I DenDF 14 14 14 14 Pr> F 0.
3 + I)j2 = 7. we do not reject the hypothesis H 0 : y 11 = y 12 = y 21 = y 22 = 0 (no interaction effects).I) PI +I.3819 (II31 + I)j2 F = (1 .31 + I)/2 .. we would reach the same conclusion as provided by the exact Ftest.p + 1.66.14(.) In our case. 1F = ( .3 + I)j2 = 4_ 26 2 .1347 · For both g .31 + I)j2 and "I = II .)Forourexample..p + I)j2 (l(gI)pi+I)/2 and F = (I 2 A.f.5230) (I6.34 (11(1). Since xj(.1 = 1 and b .. 14 ( .p + I)j2 (l(bI)pi+I)/2 have £distributions with degrees of freedom v1 = I(g .05) = 7. = (1 1 A~ A~) (gb(n I).7771) = 3. F1 = (1 .(See[l].PI + I.I). A. we calculate A~ = and A.31 + 1) =3 I'J = "2 = (2(2)(4) .12 (I (g.1) PI + 1).I.81. "2 = gb(n.1(1))/2] ln(. = I SSPres I = 275.1) .34 < F3.34.p + 1 and v1 = I (b . from (665).02I2 · ISSPres I = 275.3 + I)j2 . To test for factor 1 and factor 2 effects (see page 3I7).7771) (2(2)(4). respectively.05) = 3.1)(b . "2 = gb (n .1) = 1.3 + I) = 14 .) (gb(n I) .7098 = I 38 9 ISSPracl + SSPres I 722.7098 = 5230 ISSPrac2 + SSPres I 527. p.5230 (II.PI + 1 and gb(n 1). "2 F  . · Note that the approximate chisquare statistic for this test is [2(2)(4)"i> (3 + 1 .1 = 1.31 + 1 = 3 "2 = (I6 .34.1) ..05) = 3.3819) (I6.1)(b.I) .p + 1).1 )( b .7771 (11(1) . Since F = 1. (See [1].55 .3 + 1) = 14 and F3 .p + 1d..12 has an exact £distribution with v1 = I(g. A*) (gb(n 1).(1 .322 Chapter 6 Comparisons of Several Multivariate Means For ( g .
05) = 3. • 6. We conclude that both the change in rate of extrusion and the amount of additive affect the responses. F2 = 4. .. We have F1 = 7. Similarly.. i = 3.1...~ Variable 2 3 4 p Figure 6. Although this hypothesis may be of interest in a particular situation. We shall concentrate on two groups. acceptable? 2.. it is assumed that the responses for the different groups are independent of one another. is usually of greater interest. J.4 The population profile = 4.. is shown in Figure 6. are the profiles coincident? 7 Equivalently: Is Ho 2 : /L!i = J.2 = [J. 2. 1.1.05) = 3.IL2iI.. In that exercise. p. respectively. we might pose the question.. and so forth) are administered to two or more groups of subjects.1. J.1 + l'2•1) .(P. Let /.34. Are the profiles parallel? Equivalently: Is H 01 : IL!i .8 Profile Analysis Profile analysis pertains to situations in which a battery of p treatments (tests.. acceptable? Mean response ' .34. Ordinarily..L22 . we can formulate the question of equality in a stepwise fashion.! = 1.L 21 . . 2 implies that the treatments have the same (average) effect on the two populations. Assuming that the profiles are parallel.1.._ . 7 The question.12. .) . . A plot of these means.. Consider the population means 1. we reject H 0 : TI = T 2 = 0 (no factor 1 effects) at the 5% level. /L 2 p] be the mean responses top treatments for populations 1 and 2.55 > F3. . __ _J __ _. F3. p. All responses must be expressed in similar units. /Lip] and 1. I = [J. and we reject H 0 : PI= fJ 2 = 0 (no factor 2 effects) at the 5% level. 14 (. .1) = (p.. and they do so in an additive manner. 1.. The null hypothesis of parallel linear profiles can be written H 0: (p.. i = 1.LI b J. connected by straight lines..It1 + 1'2. /LI 2 .. simultaneous confidence intervals for contrasts in the components of Te and Pk are considered._. + 1'2.! = [J.LliI = /L2i ._ _ . are the population mean vectors the same? In profile analysis. whatever their nature.LII.05) = 3....Ll4] representing the average responses to four treatments for the fust group..26 > F3 . and therefore.L 2 ..(P.Profile Analysis 323 From before. . . . 14 (. 14 (. . in practice the question of whether two parallel profiles are the same (coincident). questions. This brokenline graph is the profile for population 1.4.li2 + 1'2.J.... i = 2.15.. p.. the question of equality of mean vectors is divided into several specific possibilities.J. are the profiles linear?" is considered in Exercise 6.34. The hypothesis H 0 : 1. . Further. "Assuming that the profiles are parallel. In terms of the population profiles.LJ2...I.L.. Profiles can be constructed for each population (group). The nature of the effects of factors 1 and 2 on the responses is explored in Exercise 6.2 ). 3. . /LIJ.
.j = 1. are the profiles level? That is.. the profiles will be coincident only if the total heights ILJI + J. the null hypothesis at stage 2 can be written in the equivalent form We can then test H02 with the usual twosample !statistic based on the univariate observationsi'x 11 ..2 provides a test for para!! el profiles. and pooled covariance matrix CSpooledC'..j = 1. the null hypothesis can be tested by constructing the transformed observations Cx11 . CIC') distributions. Assuming that the profiles are coincident. . respectively.n 1 j=l. . ..L2J + J.n2 These have sample mean vectors Ci 1 and C i 2 .2.LI. Since the two sets of transformed observations have Np. 1 = Cp 2 (parallel profiles) at level a if T 2 = (i 1  i 2)'c•[ ( ~1 + ~2 )cspooledC' J 1 C(i 1  i 2) > c 2 (673) where When the profiles are parallel. an application of Result 6.324 Chapter 6 Comparisons of Several Multivariate Means 3. . Test for Parallel Profiles for Two Normal Populations Reject H 01 : Cp.n 2 .L2. . and Cx2f• j=1.. .L22 + · · · + IL2p = 1' 1t2 are equal. are all the means equal to the same constant? Equivalently: Is H03 : ILII = J. Therefore.. + IL!p = 1' p 1 and J.and1'x 21 . Under this condition.2.L12 = · · · = IL!p = J.n 1 .Lii > J. . CIC) and Np! (Cp.2.. for all i). the first is either above the second (J.. respectively.. 2 . .2.L2 2 = · · · = IL2p acceptable? The null hypothesis in stage 1 can be written where Cis the contrast matrix c ((p!)Xp) = [? ~ : 0 ~~ (672) 0 0 0 For independent samples of sizes n 1 and n 2 from the two populations.1(CJ. or vice versa.L12 + · ..L2I = J.
we have the following test. .1)( p . x2 n 2 are all observations from the same normal population? The next step is to see whether all variables have the same mean. . a sociologist. so that the common profile is level.. Hatfield.p + 1) (675) where S is the sample covariance matrix based on all n 1 + n 2 observations and ( ) Fp!. the common mean vector JL is estimated. then p. using all n 1 + n2 observations.x2) (674) For coincident profiles. X _ 1 = . and the null hypothesis at stage 3 can be written as H 03 : CJL = 0 where Cis given by (672)..x2) [ (: + :J1'Srooled1 1 2 J 1 1'(xi. Consequently..Profile Analysis 325 Test for Coincident Profiles.J XJ 1· + ~ X2 1 · i=l "2 ) = n1 _ X1 + n2 _ X2 Test for level Profiles.( "1 . Given That Profiles Are Parallel For two normal populations. Given That Profiles Are Coincident For two normal populations: Reject H 03 : CJL = 0 (profiles level) at level a if (n 1 + n2)x'C'[CSC'tcx > c 2 c2 = ( n 1 + n2 . by 11 11 n 1 + n 2 J=! (nt + n2) (n 1 + n2) If the common profile is level. reject H 02 : 1' JLI = 1' JL 2 (profiles coincident) at level a if T = 1'(xi.. . .14 (A profile analysis of love and marriage data) As part of a larger study of love and marriage. When H 01 and H02 are tenable. E. x11 . 2 3 4 5 6 7 8 . 2 = · · · = JLp. using the 8point scale in the figure below...t::.nl+n2p+! a Example 6. 1 = p. Rece11tly married males and females were asked to respond to the following questions. x 1 n 1 and x2 1> x22 . surveyed adults with respect to their marriage "contributions" and "outcomes" and their levels of "passionate" and "companionate" love.1) ( nl + n2. x 12 .
606 .161 .066 .143 . we shall use the normal theory methodology.5 on page 327.029 .066 . What is the level of companionate love that you feel for your partner? None at all Very little 2 Some A great deal Tremendous amount 5 4 Let x 1 = an 8point scale response to Question 1 x 2 = an 8point scale response to Question 2 x 3 = a 5point scale response to Question 3 x 4 = a 5point scale response to Question 4 and the two populations be defined as Population 1 = married men Population 2 = married women The population means are the average responses to the p = 4 questions for the populations of males and females. how would you describe your contributions to the marriage? 2.1 = Cp.161J .633J 7.033 3.533 (females) . 3.967 ' (males) x2 =  4. All things considered. A sample of n 1 = 30 males and n 2 = 30 females gave the sample mean vectors XJ = and pooled covariance matrix spooled = The sample mean vectors are plotted as sample profiles in Figure 6.143 .000 4.306 .029 .000 4.637 . 1b test for parallelism (H0 1: Cp.700 6. we compute l l l 6.2). how would you describe your outcomes from the marriage? SubjeGts were also asked to respond to the following questions. are clearly nonnormal. What is the level of passionate love that you feel for your partner? 4.173 .833J 7. even though the data. Assuming a common covariance matrix :t. Since the sample sizes are reasonably large.326 Chapter 6 Comparisons of Several Multivariate Means 1.262 .262 .810 . using the 5point scale shown. All things considered.173 . which are integers. it is of interest to see whether the profiles of males and females are the same.
268 1.058 ._ 2 (profiles coincident).058 Thus.066. . 6 4 Key: xxMales o.207 .05) = 3. we need Sum of elements in (i1  = 8..067) = 1.367 S urn of elements in Spooled = 1' Spooled1 = 4.066 1.125 . 56 (.268 1.268 .. To test H 02 : 1'1'I = 1' . with a= .200] (to+ = 15(.167. this finding is not surprising. Since T 2 = 1. we can test for coincident profiles.5.719 .7. c 2 = [(30+302)(41)/(30+304)]F3.268 [ .751 . Given the plot in Figure 6.Profile Analysis 327 sample mean response :i e.101 .5 Sample profiles for marriagelove responses.005 2 fat .125]l [.200 profiles for men and women is tenable. T = [ . .11(2.a Females 2 2 3 4 Figure 6.05. Assuming that the profiles are parallel.005 < 8.167] .751 .8) i 2 ) = 1' (i 1  i 2 ) = .751 125] .751 1.7. we conclude that the hypothesis of parallel Moreover.125 and . 1 CSpooledC = [ 1 ~ 1 1 0 0 1 1 ~}fj ~] 0 1 1 0 719 = [ .101 .
~8 (. A profile. We could now test for level profiles. summarizes the growth which here is a loss of calcium over time. To illustrate the growth curve model introduced by Potthoff and Roy [21].501 With a = .9 Repeated Measures Designs and Growth Curves As we said earlier. a profile analysis will depend on the normality assumption. the term "repeated measures" refers to situations where the same characteristic is observed. while Questions3 and 4 were measured on a scale of 15. When some subjects receive one treatment and others another treatment.5 gives readings after one year. we could measure the weight of a puppy at birth and then once a month. the general measures of comparison are analogous to those just discussed. x3 . 58 ( . Can the growth pattern be adequately represented by a polynomial in time? . This assumption can be checked.328 Chapter 6 Comparisons of Several Multivariate Means Using (674). The incompatibility of these scales makes the test for level profiles meaningless and illustrates the need for similar measurements in order to carry out a complete profile analysis.05) = 4.05. In this context. [18].367 )2 = . F1 . That is. two years. and T 2 = . Unlike univariate approaches. constructed from the four sample means (xi> x2 . holds for each subject. at different times or locations. (See [13].) 6. this model does not require the four measurements to have equal variances. For instance. the growth curves for the treatments need to be compared. The model assumes that the same covariance matrix l. we consider calcium measurements of the dominant ulna bone in older women. we obtain T2 = ( v'(~ + ~)4.0.501 < F1. Readings obtained by photon absorptiometry from the same subject are correlated but those from different subjects should be independent.027 . it does not make sense to carry out this test for our example. It is the curve traced by a typical dog that must be modeled. • When the sample sizes are small. In fact. and three years for the control group. x4 ). using methods discussed in Chapter 4.05) = 4. (b) A single treatment may be applied to each subject and a single characteristic observed over a period of time.0. we refer to the curve as a growth curve. on the same subject. Besides an initial reading. however. we cannot reject the hypothesis that the profiles are coincident. Table 6. since Que·stions 1 and 2 were measured on a scale of 18. The treatments need to be compared when the responses on the same subject are correlated. the responses of men and women to the four questions posed appear to be the same. (a) The observations on a subject may correspond to different treatments as in Example 62 where the time between heartbeats was measured under the 2 X 2 treatment combinations applied to each dog. with the original observations Xcj or the contrast observations Cxcj· The analysis of profiles for several populations proceeds in much the same fashion as that for two populations.
All of the Xcj are independent and have the same covariance matrix l.0 76. Under the quadratic growth model.6 69.2 73.6 57. Control Group Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Initial 87.. is the quadratic expression evaluated at t.47 3 year 75.2 73.7 61..0 77.4 86.8 66.3 70. ..9 56.2 81. Xcne be ~he nc vectors of measurements on the nc subjects in groupe. g.2 76.9 60.5 65.9 66.8 57.29 2 year 86.. When a study involves several treatment groups.2 79.2 79. .7 72.5 76. ••• .3 49.0 72.6 gives the calcium measurements for a second set of women.1 57.0 75.6 54.6 67.3 72.4 60. .0 82.8 85.3 67.1 71.9 78. Xc 2 . that received special help with diet and a regular exercise program.3 59.79 Mean Source: Data courtesy of Everett Smith.6 73.0 67. Assumptions..5 53.1 55.1 75.Repeated Measures Designs and Growth Curves 329 Table 6. When the p measurements on all subjects are taken at times ti> t2 . an extra subscript is needed as in the oneway MANOVA model.4 69.2 69. Table 6.4 72.8 68.6 74.7 60.2 67.3 82.3 68.7 70. the PotthoffRoy model for quadratic growth becomes where the ith mean J. Let Xn. the treatment group.0 86.3 74.1 64.4 67.9 65. the mean vectors are f3co + f3ntl + f3co + + f3nd E[Xcj] = l f3n/2 f3nt1J = l1~ 1 f3co + f3n tP + f3nt~ . tP. .7 84.8 75..38 1 year 86.3 66.3 72.L.8 68.S Calcium Measurements on the Dominant Ulna. Usually groups need to be compared. fore = 1.
0 51.6 71.8 65. · + (n 8 1)S1 ) =N 1 _ gW .9 65.7 74.7 53.2 75.3 59.7 56.~oledBf B'S~JedXe 1 for e=  1.9 70.2.66 2year 862 67.3 67.5 76..4 55.3 81.3 76.0 74.18 Source: Data courtesy of Everett Smith.9 79.2 68.3 50.3 69. where B = : : If a qthorder polynomial is fit to the growth data.0 70.1 70.5 68.3 57.6 74.3 74.0 84.3 70.3 76.6 75.7 54.6 Calcium Measurements on the Dominant Ulna.9 48.5 66. .4 57. the maximum likelihood estimators of the Pe are Pe = (B'Sj.g (678) where Spooled= 1 (N _g) ((n!.9 53.0 76.5 64.8 68.5 74.1 70.6 80.4 70.2 67. then 1 1 t! t2 t1 t~ f3eo l 1 t2 tl ri: tt] r2 p and Pe = /3n f3e2 [ ] f3eo (676) 1 tp /3n and B= 1 tp tq p Pe = (677) /3eq Under the assumption of multivariate normality.3 79.330 Chapter 6 Comparisons of Several Multivariate Means Table 6.3 72.0 57.6 64.9 79.29 1 year 85. Treatment Group Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Mean Initial 83.2 66.5 66. ..7 3 year 81.53 71.6 57.7 58.7 51.2 60.0 64.5 72.0 77..7 74.7 77.0 65.9 77.1)Sl + .0 70.
27) where 93.BPe)' (680) has ng .1051 and.BPe) (Xej .85t2 (2.2. For large sample sizes.58) (. We can formally test that a qthorder polynomial is adequate.0274 70.q +g)) In A* > xfrqI)g(a) (682) Example 6. The likelihood ratio test of the null hypothesis that the qorder polynomial is adequate can be based on Wilks' lambda A*= IWI IWql (681) Under the polynomial growth model. Thus there are (p . Under a qthorder polynomial. . Also.14 + 4.64t .15 (Fitting a quadratic growth curve to calcium loss) Refer to the data in Tables 6.09t .q .2184 5.g .Repeated Measures Designs and Growth Curves 331 with N = L f=l g ne. 2. the error sum of squares and cross products matrix is just the within groups W that has N .p2]= so the estimated growth curves are Control group: 73. fore 7'= h.1744 5.0900 1.6.( Pe = k A ) ne ( B ' SpooledB )I 1 for e = 1. the null hypothesis that the polynomial is adequate is rejected if ( N  ~(p .1. The estimated covariances of the maximum likelihood estimators are Cov .g + p .28) Treatment group: 70.8368 [ 0.g .0701 3. Fit the model for quadratic growth.g .07 + 3.5 and 6. The model is fit without restrictions.50) (. .1)g fewer parameters.q . the standard errors given below the parameter estimates were obtained by dividing the diagonal elements by ne and taking the square root.03t 2 (2. . by (679). so their covariance is 0. is the pooled estimator of the common covariance matrix I..8534 73.2184] 3..0240 (B'S~oledBf = 1 0.1)/(N . Pe and Ph are independent.6444 [ 2.80) (.1 degrees of freedom.5699 3. A computer calculation gives [pJ.83) (. g (679) where k = (N g) (N .p + q + 1). there are q + 1 terms instead of the p means for each of the groups. the error sum of squares and cross products Wq = f=l j=l f! (Xej .g degrees of freedom.8368 9.0240 1.1387] 4.p + q)(N .
160 2089. with a ( N  = .86 < .514 2327.009 2343.996 2381. 6.749 2369. q + ~ (p  g)) In A* ~ . 2277.912 2660.7627 l2781. We refer the reader to [14] for more examples and further tests.10 Perspectives and a Strategy for Analyzing Multivariate Models We emphasize that. there does not seem to be any substantial difference between the two groups.544.01. over time. Loss of calcium is predicted after 3 years for both groups. (b) Restricting the covariance matrix to a special form such as equally correlated responses on the same individual. on the same individual. some evidence ~hat the quadratic does not fit well.2 + 2)) In .961 2343.017 2698. Further. however. the Pe are no longer given by (678) and the expression for its covariance matrix becomes more complicated than (679).332 Chapter6 Comparisons of Several Multivariate Means Examination of the estimates and the standard errors reveals that the t 2 tenns are needed. This is particularly important when testing for the equality of two or more treatments as the examples in this chapter .01) == 921 we faiJ to reject the adequacy of the quadratic fit at a = .( 31  i( 4 .308 2335.01.687 2089.714 2098.0421)2( . • The Potthoff and Roy growth Cl!_rve model holds for more general designs than one·way MAN OVA.589 2363. There are many other modifications to the model treated here.228 2362.514 2301.430 2331.235 2381)60 2331. This results in a multivariate version of the growth curve model. Use nonlinear parametric models or even nonparametric splines.05 there is. with several characteristics.485 2362.589 2363.308 2756. Since the pvalue is less than .452 2335:912J = . We could.544 2327.996 2314. without restricting to quadratic growth. They include the following: (a) Dropping the restriction to polynomial growth. it is important to control the overall probability of making any incorrect decision. test for parallel and coincident calcium loss using profile analysis.235 2303.228 2832.7627 =7.961 2098.253] Since.253 2698. However. (c) Observing more than one response variable.749 2369. Wilks' lambda for testing the null hypothesis that the quadratic growth model is adequate becomes l27U282 2660.
6 on page 334. Using the x 2 observations.5 7..5 6..2 ~9 4. Similarly.?.. a univariate analysis of variance gives F = 2.0 5.... say.68 with v 1 = 1 and v 2 = 18 degrees of freedom..3 7..0 3. a univariate analysis of variance gives F = 2. although there is some overlap.3 6.1 5.. The hypothetical data are noted here and displayed as scatter plots and marginal dot diagrams in Figure 6.6 7..6 5.1 5...5 1 6.0 3.8 4...1 7. and that.8 6.10) = 3.Perspectives and a Strategy for Analyzing Multivariate Models 333 indicate.0 4....0 It is clear from the horizontal marginal dot diagram that there is considerable overlap in the x 1 values for the two groups. Example 6.. Let Pi = [JLu. A single multivariate test. Using the x1 observations..4 6. univariate tests ignore important information and can give misleading results..6 4.1 8 (..8 6.6 4.Lz 1 ..8 5. is preferable to performing a large number of univariate tests.2 5.8 8.?:}.6 ~2 1 1 1 1 1 1 1 1 6..9 6. The scatter plots suggest that there is fairly strong positive correlation between the two variables for each group.7 . A single multivariate test is recommended over.. we cannot reject H 0 : JLu = ~t 21 at any reasonable significance level (F1.16 (Comparing multivariate and univariate tests for the differences in means) Suppose we collect measurements on two variables X 1 and X 2 for ten randomly selected experimental units from each of two groups. as the next example demonstrates. Again..0 3...0 5.. the group 1 measurements are generally to the southeast of the group 2 measurements. ..46 with v1 = 1 and v 2 = 18 degrees of freedom. and let 1'z = [J... we cannot reject H 0 : ~t 12 = ~t22 at any reasonable significance level. with its associated single pvalue. The outcome tells us whether or not it is worthwhile to look closer on a variable by variable and group by group analysis.. ~2 2 2 2 2 2 2 2 2 2 6. Group 5..01)..p univariate tests because. ~t22 J be the population mean vector for the second group.0 4. the vertical marginal dot diagram shows there is considerable overlap in the x 2 values for the two groups.3 4.0 6..··?:.2 3. Consequently.9 5.. ~t 12 J be the population mean vector for the first group.9 4... .
6 Scatter plots and marginal dot diagrams for the data from two groups. After taking natural logarithms.[2.z at the 1% level.50684 0.308 s2  [0.02595 S: n 2 = 40 .09417] 0.. Notice that there are n 1 "' 20 measurements for C lizards and n 2 = 40 measurements for S lizards.118 X 6.4. Cnemidophorus (C) and Sceloporus (S).1 T (Data on lizards that require a bivariate test to establish a difference in means) A zoologist collected lizards in the southwestern United States. + ++ + + ~ + + The univariate tests suggest there is no difference between the component means for the two groups.. This T 2test is equivalent to the MANOVA test (642). The data for the lizards from two genera.04255 .[2. he measured mass (in grams) and the snoutvent length (in millimeters). Because the tails sometimes break off in the wild.14539] 0. the snoutvent length is a more representative measure of length.35305 0. 1 = p.334 Chapter 6 Comparisons of Several Multivariate Means x2 x. On the other hand. and hence we cannot discredit p. we find T2 = 17. the summary statistics are C:n 1 = 20 Xt  . 1 = p. collected in 1997 and 1999 are given in Table 6. 2 .09417 0. + • Figure 6.11 = 12. + + + +o +o +8 +o :to + 0 0 0 0 8 7 + loll l!2J 0 6 5 4 3 4 + + 5 6 7 0+ ~o~~o~~o~~~~~o_o~ol~o~~ x. Among other variables.240] 4. • Example 6.394 s1 = [0.29 > c2 =  (18)(2) .368] Xz. if we use Hotelling's T 2 to test for the equality of the mean vectors.F2.Ol) = 2.7.n(. The multivariate test takes into account the positive correlation between the two measurements for each groupinformation that is unfortunately ignored by the univariate tests.14539 0.94 17 and we reject H 0 : p.
5 97.781 31.0 67.3 4. 0.5 79.700 10.048 4.367 3.5 94.~t 21 : ( 0.5 75.Perspectives and a Strategy for Analyzing Multivariate Models 335 Table 6. A plot of mass (Mass) versus snoutvent length (SVL).342 4.666 4. is shown in Figure 6.200 13. .0 SVL = snoutvent length.0 84.0 64.0 82.236 37.5 55. 0.0 108.103 13.1 4.011.201 38.5 91.5 62.0 62.350 7.0 60.910 17.690 6.747 s SVL 77.0 56. 4~~ ~ 0 0 8o 0 3 0 0 ofiJ o r::8o6' .220 5.0 81. Source: Data courtesy of Kevin E.0 84.989 27.0 Mass 13.5 52.450 14.5 70.109 12.5 99.4 4.088 2.0 62.9 figure 6.035 16.120 21.5 91.092 5.995 3.216 9.5 66.911 5.0 60.0 90.790 5.6 4.526 4.0 68.0 63.0 69.0 64.7.0 94.230 5.0 115.831 9.0 61.8 ln(SVL) 3.493 7.787 s SVL 80. • 4.901 19.5 72.0 75.0 64.665 6.220) ln(SVL): ~t 12 .610 13.0 79.476.0 69.630 13.5 70.0 69. Bonine.610 18.5 87.0 80.0 96.685 11.0 71.020 5. 7 4.525 22.0 96.419 13.900 9.0 69.977 8.0 56.5 91.763 9.811 6.520 13. The large sample individual 95% confidence intervals for the difference in Jn(Mass) means and the difference in Jn(SVL) means both cover 0.~t22 : ( 0. 7 Lizard Data for Two Genera c Mass 7.077 42.080 14.0 64.980 16.0 73.264 16.5 4.0 106.838 6.5 67.902 SVL 74.369 7.7.2 4.0 95.247 16.867 11. after taking natural logarithms. ln(Mass): 1tll .0 4.513 5.0 90.962 4.7 Scatter plot of ln(Mass) versus ln(SVL) for the lizard data in Table 6.832 15.331 41.0 111._ ·'·· o<e~g ' • ~ 0 ~ @0 .5 74.5 109.530 7.5 Mass 14.0 77.032 5.183) .
More discussion is given in terms of the multivariate regression model in Chapter 7. the multivariate tests signals a difference. we cannot detect a diff~ence in mass means or a difference in snoutvent length means for the two genera of lizards. do we probe deeper. · In the context of random samples from several populations (recall the oneway MAN OVA in Section 6. We encountered exactly this situation with the effluent data in Example 6.1. There is also some suggestion that Pillai's trace is slightly more robust against nonnormality. Clearly. the first three tests have similar power. A multivariate method is essential in this case.4 with a pvalue less than . behaves differently. consistent with the scatter diagram in Figure 6. too large. When.336 Chapter 6 Comparisons of Several Multivariate Means The corresponding univariate Student's ttest statistics for testing for no difference in the individual means have pvalues of . at the same time. we suggest trying transformations on the original data when the residuals are nonnormal. However. The simultaneous confirlence statements determined from the shadows of the confidence ellipse are.!XXJl.46 and .16 and 6. while the last.5). The oneatatime intervals may be suggestive of differences that . LawleyHotelling trace "" tr [BW1J Pillai trace = tr [B(B + wr1) + Wt 1 Roy's largest root = maximum eigenvalue of W (B All four of these tests appear to be nearly equivalent for extremely large samples.4). the T statistic has an approximate x~ distribution. and only when. We recommend calculating the Bonferonni intervals for all pairs of groups and all characteristics.ie)(Xej. from a univariate perspective. However. Its power is best only when there is a single nonzero eigenvalue and.17 demonstrate the efficacy of a multivariate test relative to its univariate counterparts. respectively. All four statistics apply in the twoway setting and in even more complicated MANOVA. multivariate tests are based on the matrices W = t=I j=l ±~ (xe 1 . a bivariate analysis strongly supports a difference in size between the two groups of lizards.i)(ie . From the simulations reported to date. Three other multivariate test statistics are regularly included in the output of statistical packages. For moderate sample sizes. This may approximate situations where a large difference exists in just one characteristic and it is between one group and all of the others. • Examples 6.4 (also see Example 6.7. we have used Wilks' lambda statistic A • = IB+WI IWI which is equivalent to the likelihood ratio test. typically. Using Result 1 6.i)' Throughout this chapter. T = 225. Roy's test.ie)' and B = t=I ± ne(ie . the power is large. or departure from the null hypothesis. 2 For this example.08. all comparisons are based on what is necessarily a limited number of cases studied by simulation.
1. The data corresponding to sample 8 in Table 6. Also check the collection of residual vectors from any fitted model for outliers. It may be the case that differences would appear in only one of the many characteristics and. 6.1. That is. these few active differences may become lost among all the inactive ones. Exercises Construct and sketch a joint 95% confidence region for the mean difference vector B using the effluent data and results in Example 6. Are the results consistent with a test of H 0 : B = 0? Discuss. try looking at Bonferroni intervals for the larger set of responses that includes the differences and sums of pairs of responses. 2.3. cannot be taken as conclusive evidence for the existence of differences. Try to identify outliers. 3. Calculate the Bonferroni simultaneous confidence intervals. Remove sample 8. The best preventative is a good experimental design. Our choice is the likelihood ratio test. Perform a multivariate test of hypothesis. Is this result consistent with the test of H0 : B = 0 considered in Example 6. Check the data group by group for outliers. Note that the point B = 0 falls outside the 95% contour.1 seem unusually large. If the multivariate test reveals a difference. construct the 95% Bonferroni simultaneous intervals for the components of the mean difference vector B. Be aware of any outliers so calculations can be performed with and without them. do not include too many other variables that are not expected to show differences among the treatments. 6. then proceed to calculate the Bonferroni confidence intervals for all pairs of groups or treatments. and all characteristics. Compare the lengths of these intervals with those of the simultaneous intervals constructed in the example. A Strategy for the Multivariate Comparison of Treatments 1. further. which is equivalent to Wilks' lambda test. We summarize the procedure developed in this chapter for comparing treatments.Exercises 33 7 merit further study but.1. the overall test may not show significance whereas a univariate test restricted to the specific active variable would detect the difference. . The first step is to check the data for outliers using visual displays and other calculations. the differences hold for only a few treatment combinations. Using the information in Example 6. If no differences are significant. with the current data. We must issue one caution concerning the proposed strategy. Then. Does the "outlier" make a difference in the analysis of these data? 6. To design an effective experiment when one specific variable is expected to produce differences. Construct a joint 95% confidence region for the mean difference vector B and the 95% Bonferroni simultaneous intervals for the components of the mean difference vector.2.1? Explain.
Test for the equality of mean indices using (616) with a = .p.4 (a) All three indices are evaluated for each patient. Use the data for treatments 2 and 3 in Exercise 6.8. Set a = .2 55.OS. A•.4 and S = . (a) Redo the analysis in Example 6. (c) Construct 99% simultaneous confidence intervals for the differences 1Lz.8.1] 57. (c) Evaluate Wilks' lambda.] Compare the conclusions. Set a = . In(BOD) and ln(SS).3 [ S0. 1 .) (b) Using the information in Part a. 6. Observations on two responses are collected for three treatments.6 71. The observation vectors [ :~ Jare Treatmentl: [~J [~J [~ Treatment 2: J [!J [n [:J [~J [~] Treatment3: [~J [~J [~J [~] (a) Break up the observations into mean. .6 97.3 63. (b) Construct the 95% Bonferroni simultaneous intervals for the components of the mean vector o of transformed variables.7. treatment. (c) Discuss any possible violation of the assumption of a bivariate normal distribution for the difference vectors of transformed observations.0] 80.3 to test for treatment effects.0 71.0 SS. and use Table 6. (b) Judge the differences in pairs of mean indices using 9S% simultaneous confidence intervals.1.S.. (See Example 6.I'J = 0 employing a twosample approach with a = .9. . construct the oneway MANOVA table. 6.338 Chapter 6 Comparisons of Several Multivariate Means 6. assuming that I 1 = I 2 . [See (618).4. [101. Repeat the test using the chisquare approximation with Bartlett's correction. 2. [See (643).6. (a) Calculate Spooled.01. Refer to Example 6. Using the summary statistics for the electricitydemand data given in Example 6.1 after transforming the pairs of observations to . and residual components. Construct the corresponding arrays for each variable.OS. as in (639). Also. i = 1.0 63.4. 2 = 0. (b) Test H 0 : l'z .] 6.ILJ. compute T 2 and test the hypothesis H 0 : p. The values of these indices for n = 40_ h~artattack patients arriving at a hospital emergency : room produced the summary statistics · ··" x= 46. determine the linear combination of mean components most responsible for the rejection of H 0 • 6. A researcher considered three indices measuring the severity of heart attacks.01.
Dg 0 0 0 }·.. verify the relationships dj = Cxj. given that the profiles are parallel. d = Ci. 1. p.9. ji.Show that this likelihood is maximized by the choices ji. Give the likelihood function. ILl p] and (b) Following an argument similar to the one leading to (673). we reject Ho: C (1' 1 + ~J.Ji2 + J'212). ft12. 2 ) = 0 at level a if T where 2 = (it + x2)'c'[ (~I + ~J CSpooledC' Tl C(xl + x2) > c 2 . (a) Showthatthehypoihesisthattheprofilesarelinearcan bewrittenasH0: (p. Assume that the profiles given by the two mean vectors are parallel. 22 . 1 + ~J. 6. . for two independent samples of sizes n 1 and n 2 from Np(l' 1 .:X)u 1 + (x 2 . L(IJ. + /L2i)(P. . p. . ..Ii1 + IL2it) = (J.10. 21 .. I) populations. p or as Ho: C (~J. 1 = x1 . · + (x 8 ..tii1 + /L2iJ) ..Exercises 339 6.. Consider the univariate oneway decomposition of the observation xej given by (634). where the (p . and Sd = CSC' in (614). 2 = i 2 and Hint: Use (416) and the maximization ResuJt 4. A likelihood argument provides additional support for pooling the two independent sample covariance matrices to estimate a common covariance matrix in the case of two normal populations.tii.) Let IJ. respectively. ._ 2 .12.I = 1'z = (p.:X)u8 where }·· 0 DJ = 0 0 'Uz = 0 0 0 ' .. I).11. Show that the mean vector x I is always perpendicular to the treatment effect vector (:XI. .10.:X)u2 + . Using the contrast matrix C in (613). .. 2 p] be the mean responses to p treatments for populations 1 and 2. 1 . I) and Np(l' 2 . 0 0 == 0 }·· 6.respectively. i = 3... .. 2 ) =0.2) X p matrix [J. (Test for linear profiles.(P. 6.
7. 7. assuming that the profiles are parallel.b l.6.respectively. .4.3.13. 6.03 . and SSPres with degrees offreedomgb 1.x.5. the twoway MANOVA model is L b /h = 0 where the e 1k are independent Np(O.3.) Consider the observations on two responses. n2 = 30. displayed in the form of the following twoway table (note that there is a single observation vector at each combination of factor levels): Factor 2 Level 1 Level 2 Level 3 Level 4 Levell [~] Factor 1 Leve12 Level3 [!] [:] [~] [1~] [~] [~ J [=:J ~ Te f=l g [~J = k~t With no replications..g. xi = [6.and(g 1)(b.1).03 . Here xis the overall average.16 . i2 = [4.1.16] . (a) Decompose the observations for each of the two variables as Xek = X+ (ie.8.26 .x) + (x.17 .9.5. SSPrac 2 .k + i) similar to the arrays in Example 6.05.9. (b) Regard the rows of the matrices in Part a as strung out in a single "long" vector.es Consequently. (c) Summarize the calculations in Part bin a MANOVA table.17 .340 Chapter 6 Comparisons of Several Multivariate Means Let n 1 = 30. this decomposition will result in several 3 X 4 matrices.31 Spooled = [ Test for linear proflles.14 . obtain the matrices SSPcon SSPracl. . Use a = . x1• is the average for the lth level of factor 1. For each response.07 .) random vectors.1].81 . and .61 . and compute the sums of squares SSIOI = ssmean and sums of cross products SCPtot = SCPmean + SSrac 1 + SSrac2 + SSres + SCPract + SCPrac2 + SCP.4. and i ·k is the average for the kth level of factor 2. l.07 .x) + (xek. x 1 and x 2 . (Twoway MANOVA without replications.ie.26 .14 .3.0].64 .k .
1) so that SSP. Compare these results with the results for the exact Ftests given in the example.X is the overall average.k + x) where . Set a = . The matrix of interaction sum of squares and cross products now becomes the residual sum of squares and cross products matrix.13. Hint: Usetheresultsin(667)and(669)withgb(n. (g . SSP.k ..15.Exercises 341 Hint: This MANOVA table is consistent with the twoway MANOVA table for comparing factors and their interactions where n = 1. and if the interactions do not exist. (d) Given the summary in Part c.Xe. Use the likelihood ratio test with a= . Form the corresponding arrays for each of the two responses. construct simultaneous 95% confidence intervals for differences in the factor 1 effect parameters for pairs of the three responses.x. with n = 1.05 level.1)replacedby(g l)(b. (d) If main effects.14. Note that. in the general twoway MANOVA table is a zero matrix with zero degrees of freedom.05. exist. (c) Given the results in Part b. . Note: The tests require that p s. Refer to Example 6. Repeat these calculations for factor 2 effect parameters. x1 .. Interpret these intervals. A replicate of the experiment in Exercise 6.x) + (xu . (a) Carry out approximate chisquare (likelihood ratio) tests for the factor 1 and factor 2 effects. examine the natur~ of the main effects by constructing Bonferroni simultaneous 95% confidence intervals for differences of the components of the factor effect parameters. (b) Combine the preceding data with the data in Exercise 6.13 yields the following data: Factor 2 Level 1 Level 2 Level 3 Level 4 Levell Factor 1 Level2 Level3 [ 1:] [~] [~J DJ [~] [1~] [1~] [ [n [~:] [~] 1~ J [:] (a) Use these data to decompose each of the two measurements in the observation vector as Xu = x + (xe.. test for interactions.1). and x ·k is the average for the kth level of factor 2. . test for factor 1 and factor 2 main effects. will be positive definite (with probability 1).1) ( h . (b) Using (670).. 6. but no interactions. 6.05. Explain any differences. is the average for the eth level of factor 1.13 and carry out the necessary calculations to complete the general twoway MANOVA table. . test for factor 1 and factor 2 main effects at the a = .x) + (x.
0 1056.0 1046.0 659.0 1011.0 735.5 ArabicSame (x4) 601.0 925.0 761.5 1028.0 784.0 826.0 797.0 941.0 WordSame (x2) 860. Assuming that the measures of stiffness arise from four treatments.0 1081.5 Source: Data courtesy of J.5 659.0 909.0 752.0 1225.0 583.14).5 833. sen ted (words.0 735.0 1126.0 754.5 983.0 735. Four measures of the response stiffness on .0 767.0 639.8 were collected to test two psychological models of numerical cognition. The measures.0 553.0 986.0 751.5 539.5 789.5 896.0 678. Construct a 95% (simultaneous) confidence interval for a contrast in the mean levels representing a comparison of the dynamic measurements with the static measurements.0 724.0 918.0 975.0 854.0 751.0 611.0 1201.0 737. .0 843.0 958.5 1096.0 728.0 700.0 888.0 711.0 865.0 612.0 982.0 863.5 777.0 1004.5 919. are repeated in the sense that they were made one after another.0 671.5 1076.0 831.5 624.05.0 901.0 656.0 785.5 673.5 875.0 662. 6.5 1140.0 731.5 1311.0 1179.0 752.each of 30 boards are listed in Table 4.0 572.0 674.1.0 663.5 750. on a given board.5 811.0 ArabicDiff (xJ) 691.0 909. test for the equality of treatments in a repeated measures design context.0 662. Does the processfng of. Carr.0 872.0 1009. Set· a == .342 Chapter 6 Comparisons of Several Multivariate Means The following exercises may require the use of a computer.0 1044.0 887.0 813.0 746.5 926.0 931.0 930.0 657.numbers depend on the way the numbers are pre.0 625.0 708.5 1083.0 747. The data in Table 6.5 945.0 833.0 839.0 867.0 904.0 687.0 657.0 640.0 995.0 669. Arabic digits)? Thirtytwo subjects were required to make a series of Table 6.0 905.8 Number Parity Data (Median Times in Milliseconds) WordDiff (xi) 869.0 726.0 856.0 1114.0 894.16.0 774.0 683.0 1172.0 744.0 735.5 954.0 657.3 (see Example 4.5 796. 6.0 1037.5 1289.0 1059.7.0 766.0 915.0 572.0 925.5 1130.5 1408.5 814.
Here X 1 = median reaction time for word formatdifferent parity combination X 2 = median reaction time for word formatsame parity combination X 3 = median reaction time for Arabic formatdifferent parity combination X 4 = median reaction time for Arabic formatsame parity combination (a) Test for treatment effects using a repeated measures design. 6. Compare with the Bonferroni intervals. Note in particular that observations 9 and 21 for gasoline trucks have been identified as multivariate outliers. (c) The absence of interaction supports theM model of numerical cognition. Thble 6. the median reaction times for correct responses were recorded for each subject. Set a = . Hint: You may wish to consider logarithmic transformations of the observations. (c) Construct 99% simultaneous confidence intervals for the pairs of mean components.05. followed by a block of number word trials. and X 3 = capital. the parity type contrast. Interpret the resulting intervals. the order of "same" and "different" parity trials was randomized for each subject. The subjects were asked to respond "same" if the two numbers had the same numerical parity (both even or both odd) and "different" if the two numbers had a different parity (one even. Which costs. (See Exercise 5. Within each block. Set a = .19. Comment on the results. and the interaction contrast. Is a multivariate normal distribution a reasonable population model for these data? Explain. For each of the four combinations of parity and format. Which model is supported in this experiment? (d) For each subject. 6. find the linear combination of mean components most responsible for the rejection. and half of the subjects received the blocks of trials in the reverse order. find the linear combination of mean components most responsible for rejecting H 0 • (c) Find simultaneous confidence intervals for the component mean differences. if any. .. In the first phase of a study of the cost of transporting milk from farms to dairy plants.9 contains their measurements on the carapaces of 24 female and 24 male turtles. the parity type effect and the interaction effect.) Repeat Part a with these observations deleted. are presented in Table 6.22 and [2].18. all measured on a permile basis." "4").Exercises 343 quick numerical judgments about two numbers presented as either two number words ("two. one odd). while the presence of interaction supports the C and C model of numerical cognition. appear to be quite different? (d) Comment on the validity of the assumptions used in your analysis. X 2 = repair. a survey was taken of firms engaged in milk transportation.05. (a) Test for equality of the two population mean vectors using a = . (b) If the hypothesis of equal cost vectors is rejected in Part a. construct three difference scores corresponding to the number format contrast. Jolicoeur and Mosimann [12] studied the relationship of size and shape for painted turtles.01.10 on page 345 for n 1 = 36 gasoline and n 2 = 23 diesel trucks. (b) Construct 95% (simultaneous) confidence intervals for the contrasts representing the number format effect." "four") or two single Arabic digits ("2. (b) If the hypothesis in Part a is rejected. Half of the subjects were assigned a block of Arabic digit trials. (a) Test for differences in the mean cost vectors. Cost data on X 1 = fuel.
Similar measurements for female hookbilled kites were given in Table 5. find the linear combination most responsible for the rejection of H 0 • (You may want to eliminate any outliers found in Part a for the male hookbilled kite data before conducting this test.) (b) Test for equality of mean vectors for the populations of male and female hookbilled kites. observation 31 with x1 = 284. Alternatively.20. (a) Plot the male hookbilled kite data as a scatter diagram.11.11.9 Carapace Measurements (in Millimeters) for Painted Thrtles Female Length (xi)Width (x2) Height Length Male (x3) (xi) Width (x2) Height (x3) 98 103 103 105 109 123 123 133 133 133 134 136 138 138 141 147 149 153 155 155 158 159 162 177 81 84 86 86 88 92 38 38 42 42 44 95 99 102 102 100 102 98 99 105 108 107 107 115 117 115 118 124 132 50 46 51 51 51 48 49 51 51 53 57 55 56 63 60 62 63 61 67 93 94 96 101 102 103 104 106 107 112 113 114 116 117 117 119 120 120 121 125 127 128 131 135 74 78 80 84 85 81 83 83 82 89 88 86 90 90 91 93 89 93 95 93 96 95 95 106 37 35 35 39 38 37 39 39 38 40 40 40 43 41 41 41 40 44 42 45 45 45 46 47 6. in particular.11. If H 0 : IL! . (Note. Set a = .344 Chapter 6 Comparisons of Several Multivariate Means Table 6.05. Does it make any difference in this case how observation 31 for the male hookbilled kite data is treated?) (c) Determine the 95% confidence region for ILl .2 and 95% simultaneous confidence intervals for the components of IL! . and (visually) check for outliers.2 .12.11 on page 346. (d) Are male or female birds generally larger? a .2 = 0 is rejected. you may want to interpret XJ = 284 for observation 31 as misprint and conduct the test with x 1 = 184 for this observation. The tail lengths in millimeters (xi) and wing lengths in millimeters (x 2 ) for 45 male hookbilled kites are given in Table 6.
26 5.20 23.88 12.93 14.70 12.99 29.14 12.42 10.22 9.90 11.00 20.66 10.24 10.43 2.95 2.13 11.15 9.94 5.59 14.26 2.61 5.59 8.63 2.87 7.00 14.90 10.22 12.28 10.13 10.10 X! Milk TransportationCost Data Diesel trucks XJ Gasoline trucks X2 X! x2 x3 16.86 11.67 9.23 5.72 4.78 5.07 6.77 17.49 11.06 17.09 7.68 12.79 9.37 10.80 3.86 11.09 14.16 4.63 5.16 12.51 26.44 8.98 9.19 9.72 4.22 13.11 12.23 11.75 13.49 11.35 9.50 13.66 17.Exercises 345 Table 6.18 17.11 17.18 12. samples of 20 Aa (middlehigh quality) corporate bonds and 20 Baa (topmedium quality) corporate bonds were selected.49 8.53 13.26 6.25 11.32 8.51 9.24 11.16 7.22 9.20 14.44 7.11 12.18 9.28 10.52 13.95 11.44 21. Using Moody's bond ratings.65 21. the ratios X 1 = current ratio (a measure of shortterm liquidity) X 2 = longterm interest rate (a measure of interest coverage) X 3 = debttoequity ratio (a measure of financial risk or leverage) X 4 = rate of return on equity (a measure of profitability) .60 9.24 13.77 22.68 7.01 16.84 35.18 17.61 14.39 6.23 6.13 3.05 2.88 10.35 5.17 10.23 3.90 5.32 29. For each of the corresponding companies.73 14.66 28.25 10.58 17.15 14.69 16.54 10.02 17.94 4.29 15.27 15.59 6.83 5.03 12.78 10.88 12.45 3.28 11.85 11.17 13.18 4.72 8.94 9.00 19.38 19.15 11.17 7.95 16.47 11.98 14. Keaton.91 8.59 6.89 7.42 9.32 14.23 8.70 9.17 12.77 11.21.09 Source: Data courtesy of M.61 9.70 7.67 6.72 9.18 8.21 15.34 8.16 12.70 8.60 6.92 9.00 4.06 9.86 9.70 10.70 1.49 17.13 9.47 19.09 12.75 7.89 9.53 8.32 12.50 7.22 12.25 13.78 5.09 8.92 4.05 5.68 20. 6.78 10.44 8.14 6.45 16.43 10.
346 Chapter 6 Comparisons of Several Multivariate Means Table 6.026 .05. l l l .400 .830 ].004 .287.854 . test for the equality of mean vectors.432 .1 02 .102 6.854 [2.494 .004 34.094 61. and SI = Baabondcompanies: n2 Sz = = 20.944 .002 .719 19.155.267 .589 .x2"' and s pooled "' (a) Does pooling appear reasonable here? Comment on the pooling procedure in this case.388 . The summary statistics are as follows: Aa bond companies: n 1 = 20.719J .024 .347.267 .12. i! == [2.094 .044 .11 Xt Male HookBilled Kite Data x2 Xt x2 Xt X2 (Tail length) (Wing length) (Tail length) (Wing length) (Tail length) (Wing length) ISO 186 206 184 177 177 176 200 191 193 212 181 195 187 190 ~ 278 277 308 290 273 284 267 281 287 271 302 254 297 281 284 185 195 183 202 177 177 170 186 177 178 192 204 191 178 177 282 285 276 308 254 268 260 274 272 266 281 276 290 265 275 284 176 185 191 177 197 199 190 180 189 194 186 191 187 186 277 281 287 295 267 310 299 273 278 280 290 287 286 288 275 Source: Data courtesy of S.089 . .481J .949 .589 .002 . Set a = .481 9.044 .388 . Temple.701 . 14.030 .012 .012 .7.083 .840].400 19.494 9.244 ..083 21..459 .354 .254 27.465 .254 .089 16.404. (b) Are the financial characteristics of firms with Aa bonds different from those with Baa bonds? Using the pooled covariance matrix.244J . . 12. were recorded.524.600.027 .026 .
Construct 95 %. The data are shown in Table 6.. Are the usual MANOVA assumptions realistic for these data? Explain.J.Lzi> i = 1. (d) Bond rating companies are interested in a company's ability to satisfy its outstanding debt obligations as they mature.5. find the linear combination most responsible. (b) Construct the 95% simultaneous confidence intervals for each J.05. Researchers interested in assessing pulmonary function in non pathological populations asked subjects to run on a treadmill until exhaustion. period 2 is 3300 B. The results on 4 measures of oxygen consumption for 25 males and 25 females are given in Table 6.13 on page 349 (see the skull data on the website www.24. 4. Construct 95% simultaneous confidence intervals for differences in mean components for the two responses for each pair of populations. and period 3 is 1850 B.12 on page 348.prenhall. Use a = . Compare with the corresponding Bonferroni intervals. The measured variables are X 1 = maximum breadth of skull ( mm) X 2 = basibregmatic height of skull ( mm) X 3 = basialveolar length of skull ( mm) x4 =nasal height ofsku!l (mm) Construct a oneway MAN OVA of the Egyptian s~ull data.25.22. and thus they do not represent a random sample.7 on page 662.C. 6. Does it appear as if one or more of the foregoing financial ratios might be useful in helping to classify a bond as "high" or "medium" quality? Explain. Construct a oneway MANOVA of the crudeoil data listed in Table 11. Four measurements were made of male Egyptian skulls for three different time periods: period 1 is 4000 B. Construct a oneway MANOVA using the width measurements from the iris data in Table 11.Exercises 34 7 (c) Calculate the linear combinations of mean components most responsible for rejecting Ho: 1'1 .p 2 = 0.. 6. (You may want to consider transformations of the data to make them more closely conform to the usual MANOVA assumptions. Comment on the validity of the assumption that I 1 = Iz = I 3 . Does your conclusion change? 6.C. 2.23. (e) Repeat part (b) assuming normal populations with unequal covariance matices (see (627). (c) The data in Table 6.12 were collected from graduatestudent volunteers.C. simultaneous confidence intervals to determine which mean components differ among the populations represented by the three time periods. Construct 95% simultaneous confidence intervals to determine which mean components differ among the populations. 3. If you reject H 0 : p 1 .1'z = 0 in Part b. 6.05.) .com/sratistics). Use a = . Samples of air were collected at definite intervals and the gas contents analyzed. The variables were X 1 = resting volume 0 2 (L/min) X 2 = resting volume 0 2 (mL/kg/min) X3 = maximum volume 0 2 (L/min) X 4 = maximum volume 0 2 (mL/kg/min) (a) Look for gender differences by testing for equality of group means. Researchers have suggested that a change in skull size over time is evidence of the interbreeding of a resident population with immigrant populations.Lli. (628) and (629)). Comment on the possible implications of this information.
Source.97 6.30 39.87 3.56 51.69 3.25 0.38 4.95 4.10 38.15 55.86 3.07 4.86 35.10 4.37 4.51 2.95 40.71 2.00 47.43 3.• l.75 48.36 0.0.29 0.60 3.00 6.43 2.46 5.02 48.77 5.39 0.31 0.23 5.26 0.t 33.41 28.39 3.07 4.97 2.37 0.16 38.82 Maximum0 2 (mL/kg/min) 30.80 57.03 2.23 4.04 6.39 0.50 48.31 0.35 0.50 2.57 1.13 3.34 34.27 0.87 38. ·' ' .50 0. .30 0.30 51.32 2.85 ' 0.88 5.67 58.55 0.10 4.05 ~.·.20 5..97 4.60 43.60 36.80 3.44 0.22 4.43 0.11 0.08 5.12 OxygenConsumption Data Males Xt x2 Females x3 x4 xi x2 XJ X4 I Resting 0 2 (L/min) I I Resting 0 2 (mL/kg/rnin) 3.30 0.97 37.59 3.73 5.80 6.40 39.35 32.19 39.37 0.13 3.69 30.16 11.85 35.12 1.95 5.48 2.78 46.92 48. ···'""""""""'"'"")) ' '"' .07 .42 48.34 56.58 50.34 0.25 .68 4.00 2.05 5.48 2.40 0.86 Resting0 2 (L/min) 0.48 0.40 37.34 0.29 3.90 2.66 0.33 0. •' .47 3..25 1.06 3.80 37.31 0.31 1.46 39.49 2.45 Maximum0 2 (L/min) 2.85 44.31 0.50 3.70 63..28 0.32 0..95 4.Table 6.40 5.30 6.11 3.89 5.60 2.69 4..56 3. ~.99 6.58 1.27 4.71 4.12 4.40 0.18 0.06 2.36 0.22 0..82 3.21 0.25 51.40 2.80 4. Data courtesy of S.28 7.80 31.94 42.29 0.48 0.37 ll>e I I ~ ""' 0.55 4.08 58.28 0.01 6.48 .l ""''~"" .85 5.32 2.21 39.74 4.35 Resting 0 2 Maximum 0 2 Maximum0 2 (mL/kg/min) (L/min) (mL/kg/min) 5.98 2.23 55.66 5.30 46.93 2.71 5.38 50.87 43.55 5.10 2.76 2. Rokicki.54 0.32 0.31 0.46 1.28 0.04 3.51 4.35 7.77 6.58 3.82 36.43 5.34 0.50 0.32 0.44 0.51 46.99 42.95 4.31 3.00 5.49 49.32 6.33 0.
Wisconsin. for both the test group and the control group. Hourly consumption (in kilowatthours) was measured on a hot summer day in July and compared.13 Egyptian Skull Data MaxBreath (xi) BasHeight (x2) BasLength (x3) NasHeight (x4) Tune Period 131 125 131 119 136 138 139 125 131 134 124 133 138 148 126 135 132 133 131 133 132 133 138 130 136 134 136 133 138 138 138 131 132 132 143 137 130 136 134 134 : 89 92 99 96 100 89 108 93 102 99 : 49 48 50 44 54 56 48 48 51 51 48 48 45 51 45 52 54 48 50 46 52 50 51 45 49 52 54 49 55 46 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 138 134 134 129 124 136 145 130 134 125 : 101 97 98 104 95 98 100 102 96 94 : 130 131 137 127 133 123 137 131 133 133 91 100 94 99 91 95 101 96 100 91 Source: Data courtesy of J. The responses.Exercises 349 Table 6. A project was designed to investigate how consumers in Green Bay. 6. The cost of electricity during peak periods for some customers was set at eight times the cost of electricity during offpeak hours. Jackson. with baseline consumption measured on a similar day before the experimental rates began.26.log(baseline consumption) . log( current consumption) . would react to an electrical timeofuse pricing scheme.
Justify your use of normaltheory methods for these data. (a peak hour) producetj the following summary statistics: Test group: Control group: and n 1 = 28.479 Source: Data courtesy of Statistical Laboratory. Perform a profile analysis. Does timeofuse pricing seem to make a difference in electrical consumption? What is the nature of this difference.256.233 .05. and 3 P.28. . (b) Is the husband rating wife profile parallel to the wife rating husband profile? Test for parallel profiles with a = . 339] n 2 = 58. What is the level of companionate love that you feel for your partner? 4.199 .prenhall.232] . Biological differences such as sex ratios of emerging flieS and biting habits were found to exist.232 .) 6. and x~ == a 5pointscale response to Question 4. Do the taxonomic data listed in part in Table 6. investigate equality between the dominant and nondominant bones. X2 = a 5pointscale response to Question 2.239 .257] Spooled . a sample of husbands and wives were asked to respond to these questions: 1. if any? Comment. None at all Very A little Some great deal Tremendous amount 5 4 Thirty husbands and 30 wives gave the responses in Thble 6.199 .355 .15 on page 352 and on the website www. test for coincident profiles at the same level of significance.M. Finally.14. 1 P. (a) Plot the mean vectors for husbands and wives as sample profiles.M.233 .350 Chapter 6 Comparisons of Several Multivariate Means for the hours ending9 A.ity of Wisconsin.05.14. that for many years they were thought to be the same.il = [. (Use a significance level of a = . Univen. 1\vo species of biting flies (genus Leptoconops) are so similar morphologically.M. What is the level of companionate love that your partner feels for you? The responses were recorded on the following 5point scale. What is the level of passionate love you feel for your partner? What is the level of passionate love that your partner feels for you? 3. .M.239 .ll A.05..com/statistics indicate any difference in the two speciesL.722 . . test for level profiles with a = .i2 = [.151. carteri and L. where X 1 = a 5pointscale response to Question 1. 2. .27.804 355 = [ 228 . If the hypotheses of equal mean vectors is rejected.05 for any statistical tests.29. .153. What conclusion( s) can be drawn from this analysis? 6.322. if the profiles are coincident. Using the data on bone mineral content in Table 1. torrens?Test for the equality of the two population mean vectors using a = . X 3 = a 5pointscale response to Question 3.231. (a peak hour).180. As part of the study of love and marriage in Example 6. determine the mean components (or linear combinations of mean components) most responsible for rejecting Ho. If the profiles appear to be parallel.8. .592 . 6.228 .
and compare these with the intervals in Part b. (c) Construct the Bonferroni 95% simultaneous intervals. for the first 24 subjects in Thble 1. (b) Construct 95% simultaneous confidence intervals for the mean differences. 6.Exercises 351 Table 6. 1 year after their participation in an experimental program.8.14 Spouse Data Husband rating wife Xt Xz • XJ Wife rating husband x4 XI xz XJ x4 2 3 5 4 5 4 5 4 4 3 3 3 4 4 4 4 5 4 4 4 3 4 5 5 5 3 3 3 4 4 5 4 4 5 4 5 4 5 4 4 5 5 4 5 5 3 5 5 3 5 4 4 5 4 4 5 4 4 4 4 4 3 4 3 4 3 4 4 5 4 5 4 3 3 4 4 4 5 5 5 5 5 4 5 5 5 5 5 4 4 5 4 4 5 5 5 4 5 4 3 4 3 5 4 4 4 3 5 4 3 5 5 4 4 4 4 5 5 5 4 5 4 5 4 4 5 5 5 4 5 5 4 4 5 5 4 5 4 5 4 4 4 3 5 5 5 4 4 4 4 3 5 4 4 4 4 5 4 4 4 4 5 5 5 4 3 3 5 4 4 4 4 4 5 5 5 4 2 3 4 4 4 3 4 4 5 5 3 4 3 4 4 5 5 3 4 4 5 3 5 5 3 4 4 5 3 4 3 4 4 5 3 5 5 5 4 3 4 4 4 4 4 4 4 4 5 5 5 4 5 5 4 5 4 5 4 4 5 5 5 Source: Data courtesy of E. (b) Construct 95% simultaneous confidence intervals for the mean differences. . Hatfield.05. (c) Construct the Bonferroni 95% simultaneous intervals.16 on page 353 contains the bone mineral contents.30. (a) Test using a = . and compare these with the intervals in Part b.05. Thble 6. Compare the data from both tables to determine whether there has been bone loss. (a) Test using a = .
15 BitingFly Data Xt X2 X3 X4 x5 x6 x7 (Wing) (Wing) length width 85 87 94 92 96 91 90 92 91 87 L.352 Chapter 6 Comparisons of Several Multivariate Means Table 6. . carteri : 46 19 40 48 41 43 43 45 43 41 44 : 15 14 15 12 14 11 40 42 43 : 14 14 12 15 15 14 18 15 16 : 26 31 23 24 27 30 23 29 22 30 25 31 33 25 32 25 29 31 31 34 : 10 10 10 10 11 10 9 9 9 10 9 6 10 9 9 9 11 11 10 10 : 10 11 10 10 10 10 10 10 10 10 9 7 10 8 9 9 11 10 10 10 : 99 110 99 42 45 44 103 95 101 103 99 105 99 Source: 43. 28 26 24 26 26 23 24 : 9 13 8 9 10 9 9 9 9 9 : 8 13 9 9 10 9 9 9 9 10 : 106 105 103 100 109 104 95 104 90 104 86 94 103 82 15 14 15 14 13 103 101 103 100 99 100 L. 46 47 47 43 50 47 38 41 35 38 36 38 40 37 40 39 14 17 16 14 15 14 15 14 16 14 33 36 31 32 31 37 32 23 33 34 9 9 10 10 8 11 11 11 12 7 9 10 10 10 8 11 11 10 11 7 Data courtesy of William Atchley. torrens : Chi'd) Clri'd) palp length 31 32 36 32 35 36 36 36 36 35 38 34 34 35 36 36 35 34 37 37 37 38 39 35 42 40 44 palp width 13 comili) ( Length of ) ( Lengili of · · antenna! antenna] palp length segment 12 segment 13 25 r 41 38 44 43 43 44 42 43 41 38 47 46 44 41 44 45 40 44 40 14 15 17 14 12 16 17 14 11 : 22 27.
875 .17 on page 354.860 1.330 2.265 1. but not for others? Check by running three separate univariate twofactor ANOVAs.637 . 2) and.619 .742 1.734 .997 1228 1.813 . The data for one twofactor experiment are given in Table 6.914 . Three varieties (5.420 1.05.463 . and a locationvariety interaction.931 1.715 .869 . and 8) were grown at two geographical locations (1.130 1.999 1.826 .164 1. The three variables are xl = Yield (plot weight) X 2 = Sound mature kernels (weight in gramsmaximum of 250 grams) X 3 = Seed size (weight.857 .16 Mineral Content in Bones (After 1 Year) Subject number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Dominant radius 1.756 1.498 . (c) Using the results in Part a.964 .159 1.609 2.933 1.953 1. Use a = .640 Source: Data courtesy of Everett Smith.580 .851 .941 1.730 .727 .051 .443 1.923 .825 . can we conclude that the location and/or variety effects are additive? If not.551 .668 1.779 .842 1.190 1.844 . does the interaction effect show up for some variables. 6.17.352 1.246 1.689 .765 .809 1.900 .627 . .823 .923 2.027 .991 .651 1.776 Radius 1.925 .643 1.873 . in grams.740 .540 .656 .670 .875 .932 .537 .746 .442 Humerus 2.747 1. crop scientists routinely compare varieties with respect to several variables.573 2.949 .765 .879 .843 .Exercises 353 Table 6.396 1. in this case.886 .769 .883 .726 .869 .806 .673 .764 .585 Ulna . of 100 seeds) There were two replications of the experiment.515 .785 .977 .773 .106 1.880 .710 1.729 .815 1.242 2.817 .579 1.846 1. Test for a location effect.765 .905 .687 .698 .654 .577 .756 .912 .698 .804 .693 .708 .761 . the three variables representing yield and the two important gradegrain characteristics were measured.661 1. Peanuts are an important crop in parts of the southern United States.656 .753 .570 . a variety effect.738 .787 . (a) Perform a twofactor MANOVA using the data in Table 6.634 .470 1.718 1.782 .770 .947 . In an effort to develop improved plants.268 1.980 1.526 .041 1.743 Dominant humerus 2.506 .692 .640 . (b) Analyze the residuals from Part a.811 .31.686 1.789 .602 .826 .411 Dominant ulna . 6.378 1. Do the usual MANOVA assumptions appear to be satisfied? Discuss.776 2.802 .865 .906 .851 1.
1 156.20 Species Nutrient ss JL LP ss + + + . test for a species effect and a nutrient effect.45 29.8 Seed Size 51. Use a = .05.9 202.3 194.6 193.·.354 Chapter 6 Comparisons of Several Multivariate Means Table 6. coded .a ANOVA for the 720CM observations. Japanese larch (JL).40 17.0 195.5 187.7 139.jJ acteristic? Discuss your answer. sured were ~: ~\~ species~. ~~ X 2 = percent spectral reflectance at wavelength 720 nm (near infrared) The cell means (CM) for Julian day 235 for each combination of species and nutri~ level are as follows.1 161. perform a twoway MANOVJ\.78 10. TWo of the variables me~.·. The species of seedlings used were sit!( spruce (SS).35 13.8 45.0 SdMatKer 153. Data courtesy of Yolanda Lopez.8 164.7 180.'J.4 49.25 41.5 200.5 121.1 167.5 165.17 Peanut Data Factor 1 Location Factor 2 Variety 5 5 5 5 6 6 X] x2 XJ Yield 195. (b) Construct a twoway ANOVA for the 560CM observations and another t~o'!. In one experiment involving remote sensing. the spectral reflectance of three 1yearold seedlings was measured at various wavelengths during the growing seaSOJ! · The seedlings were grown with two different levels of nutrient: the optimal lev .63 25.41 7. Usinglo~ cation 2.0 672 1 1 2 2 1 1 2 2 1 1 2 2 6 6 8 8 8 8 Source. Are these results consistent WJth· tfi MAN OVA results in Part a? If not. X 1 = percent spectral reflectance at wavelength 560 nm (green) . (d) Larger numbers correspond to better yield and gradegrain characteristics. using 95% Bonferroni simultaneous intervals fo~ pairs of varieties.32.!~ 6.1 166.4 203. These averages are based on four replications.1 57.8 58.40 720CM 25.8 173.8 60.3 189.5 44.4 53.78 10.w ~"" ~s~ ~ JL LP ~ (a) neating the cell means as individual observations. ·. coded +.0 201. and a suboptimal level. and lodgepole pine (LP).4 54.7 197.6 65.7 55.15 24. ·~~ 560CM 10. can you explain any differences? ~ .8 166. can we conclude that one variety is better than the other two for each char.0 166.93 38.
54 14. and lodgepole pine [LP]) of 1yearold seedlings taken at three different times (Julian day 150 [1]. and Julian day 320 [3]) during the growing season.31 8. (a) Perform a twofactor MANOVA using the data in Table 6.42 10.74 67.71 44.87 Species Time 1 1 1 1 2 2 2 2 3 3 3 3 1 1 1 1 2 2 2 2 3 3 3 3 1 1 1 1 2 2 2 2 3 3 3 3 Replication 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ss ss ss ss ss ss ss ss ss ss ss ss JL JL JL JL JL JL JL JL JL JL JL JL LP LP LP LP LP LP LP LP LP LP LP LP Source: Data courtesy of Mairtin Mac Siurtain.14 78.33 8.25 16.14 19.37 25.74 9.38 14.37 7.62 15.67 24.67 40.16 21. a time effect and speciestime interaction.18 Spectral Reflectance Data 560nrn 9.50 33.94 8.27 20.32 22.27 10.77 12.12 26.93 60.03 32.24 16.12 15.78 26. Japanese larch [JL]. .87 44.21 8. Table 6.Exercises 355 6.15 77.21 9.79 8.45 6.22 10.87 22.44 37.24 12.89 36.18 are measurements on the variables X 1 = percent spectral reflectance at wavelength 560 nm (green) X 2 = percent spectral reflectance at wavelength 720 nm (near infrared) for three species (sitka spruce [SS].73 26.37 31.00 23.86 8.31 33.00 40.55 19.51 13.48 33.22 17.67 37. Refer to Exercise 6.33 12.32 27.90 40.07 11.93 37.77 720nm 19.33 40.32. Julian day 235 [2].28 38.03 12.48 12.00 25.69 14.34 7.04 13. The seedlings were all grown with the optimal level of nutrient. Test for a species effect.35 38.43 45.13 10.05.33.57 71. Use a = . The data in Table 6.73 7.18.74 36.
05. and t 2 is the population cov~. Refer to Example 6.36. Take a = .05.34. Jesus Ravis and his fellow:fr. large sample statistic.35.p + q)n (B'storl Fit a quadratic growth curve to this single group and comment on the fit. 6. .9 page 344 contains the carapace measurements for 24 female and 24 male nif~ ties.. (a) Test for equality of means between males arul females using a = .01. Us~ Box's Mtest to test equality of population covariance matrices for the three sandsto~~. groups.ec~ 6. Table 6.. ij~ searchers capture a snake and measure its (i) snout vent length (em) or the length om the snout of the snake to its vent where it evacuates waste and (ii) weight (kilograms).1) (n . (b) Is it reasonable to pool variances in this case? Explain. ::. The maximum hood estimate of the ( q + 1) X 1 J3 is where S is the sample covariance matrix. use Box's Mtest to test the hypothesis H 0 : 1: 1 = :£ 2 == l:.15.05. Set a "' . (c) Find the 95% Boneferroni confidence intervals for the mean differences males and females on both length and weight. Here l:1 is the co ance matrix for the two measures of usage for tb. with air conditioning. matrix for carapace measurements for female turtles. sample of these measurements in shown in Table6.D teraction show up for one variable but not for the other? Check by running variate twofactor AN OVA for each of the two responses. and :£ 2 is the electrical usage covariance matrix for the populatio~ of Wisconsin homeowners without air conditioning. Given the summary information on electrical usage in this e pie.356 Chapter 6 Comparisons of Several Multivariate Means (b) Do you think the usual MAN OVA assumptions are satisfied for the these data? · cuss with reference to a residual analysis. (b) Test that linear growth is adequate. . Refer to Example 6.1 .where :£ 1 is the population covariam:_e.38.4.2) (n . Use Box's Mtest to test H0 : :£ 1 = :£ 2 == ::£. 6. Refer to Example 6. Anacondas are some of the largest snakes in the world.39.p + q) (n. The estimated covariances of the maximum likelihood estimators are CoV(P) "' (n . and the possibility of correlated tions over time. 6. (c) Foresters are particularly interested in the interaction of species and time. Here there are p = 5 variables and you may wish to consider tr~ formations of the measurements on these variables to make them more nearly nann~ 6. Comment on the comparison. the components of i 1 versus time and those of i 2 versus the same graph. Table 11. . of another method of analyzing these data (or a different tal design) that would allow for a potential time trend in the spectral numbers? 6.19.j ance matrix for carapace measurements for male turtles. (d) Can you think.05. Set a = . (a) Plot the profiles.•.7 page 662 contains the values of three trace elements and two measures of hy~ drocarbons for crude oil samples taken from three groups (zones) of sandstone. Set a ~ .37.15 but treat all 31 subjects as a single group.e population of Wisconsin borneo .
50 23.Exercises 357 Table 6.75 10. and tb.0 261.00 39.75 5.25 7.4 246. Compare the male national track records in Table 8.3 319. Explain why it may be appropriate to analyze differences.7 212. A first step toward understanding the problems involved is to collect data from a designed experiment .00 15. simple or complex.3 466. A problem was initially classified as low or high severity.00 24.05.6 172.e engineer assigned was rated as relatively new (novice) or expert (guru).7 347.7 438.7 365. 200m.00 61. .5 307.00 53.75 3.2 290.9 236.75 5.0 477.7 235.51 19.3 336.98 24.00 30.45 6.65 7.0 236.50 57.6 221.0 226.19 Anaconda Data Snout vent Length 271.2 225.25 9.0 417. (b) Find the 95% Bonferroni confidence intervals for the mean differences between male and females on all of the races.50 6.3 Weight 3.0 306.40.4 223.4 280.7 235. Treat the data as a random sample of size 64 of the twelve record values.00 70.40 33.7 259.85 10.75 6.50 63.0 215.9 331.8 326.3 365.0 303.0 229.75 5. 800m and 1500m races.6 with the female national track records in Table 1.0 244.50 Gender Snout vent length 176.15 29.5 186.7 435.3 222.75 9.8 360.00 9.7 224. wireless providers can lose great amounts of money so it is important to be able to fix problems expeditiously.00 34.00 Gender F F F F F F F F F F F F F F F F F F F F F F F F F F F F 10.0 237.75 9.8 257.0 440.50 82.53 5.00 7.00 25.50 69.5 258. 400m. When cell phone relay towers are not working properly.6 377.0 228.3 441.5 223.3 384.7 312.41.7 315.25 9. (a) Test for equality of means between males and females using a = .9 using the results for the lOOm.07 7.1 Weight 18.8 233.75 7.5 268.0 223.00 9.49 6.25 21.00 M M M M M M M M M M M M M M M M M M M M M M M M M M M M Source: Data Courtesy of Jesus Ravis.75 7.25 30. 6.involving three factors.00 3.5 247.00 54.5 238.00 9.97 56.84 7.7 231.75 23.00 15.75 44.37 15.50 5. 6.
"A New Graphical Method for Detecting Single and Multiple Outliers in Univariate and Multivariate Data.S. 36.4 13. 5\!1 Perform a MANOVA including appropriate confidence intervals for important eff~~ J 'I 1'~~··"" Problem Problem Engineer Problem Problem Total . G.7 14.9 6. G..3 9.8 2. Box.7 6.0 5. W. M. S. S. J. M.7 Source: Data courtesy of Dan Porter. .. P. "A General Distribution Theory for a Class of Likelihood Criteria. K. ~ References 1.3 Implementation Time 6. Box.).. 6. "A Note on the Multiplying Factors for Various x2 Approximations.8 19.4 15. '{I J :1! ~ '!\¥!! ~ Novice Guru Guru Novice Novice Guru Guru Novice Novice Guru Guru High High 4.358 Chapter 6 Comparisons of Several Multivariate Means .20 Fixing Breakdowns Complexity Level Simple Simple Simple Simple Complex Complex Complex Complex Simple Simple Simple Simple Complex Complex Complex Complex Severity Level Low Low Low Low Low Low Low Low Experience Level Novice Novice Guru Guru Novice Assessment Tune 3.6 14." Biometrics.3 2.1 5. 362389. The time to assess the problem and plan an attack ~ the time to implement the solution were each measured in hours." Journal of the Royal Statistical Sociery (B)." Biometrika.9 10. and W.3 5. Bartlett.6 12.3 7. 268282.. An Introduction to Multivariate Statistical Analysis (3rd ed.16 (1954)." Proceedings of the Cambridge Philosophical Society. 3340. no.6 . BaconShone.6 23.4 I~ High High High High High High 15.5 10. Fung.296298. 176197.2 Resolution £ii Time ::: 9. 2003. P.8 1. 6 (1950).0 2. 7.5 21.4 10.6 12.9 14. Bartlett. '"'!J 1\vo times were observed." Journal of the Royal Statistical Society Supplement (B). Anderson.i~ Table 6. E.4 9. 36 (1949). 2.2 6.0 7. T.6 3. "Further Aspects of the Theory of Multiple Regression.5 4.6 15.20. New York: John Wiley. 5.5 4. "Properties of Sufficiency and Statistical Tests.8 8. The data are given in:l Table6." Proceedings of the Royal Society of London (A). Bartlett.3 19.9 5. E. 8.." Applied Statistics.3 5.1 3. "Multivariate Analysis. 34 (1938).0 15. 2 (1987).4 8.8 9. Bartlett.7 1.7 7. "Problems in the Analysis of Growth and Wear Curves. 4. M. 317346. 9 (1947).7 3. 160 {1937).. M. S. 153162.1 1. 3.
and N. New York: Marcel Dekker.). 14. NC: SAS Institute Inc. 2005. B. ). Tiku. 1999. 313326. 51 (1964). 1959. S. Belmont. 21. Box. L. S. Johnson. 24 (1932). D. P. N. and M. Wilks. and W. P. 16. Cambridge. 22.. Montgomery. Bhattacharyya. 2005. Balakrishnan. 10. and C A.. M. and H. Roy. 2005. 25. E. "Modified Nel and Van der Merwe Test for the Multivariate BehrensFisher Problem. 23.. 339354. 471494. no. 11. Khattree.).)." Growth. eds. and D. Smith. Singh. Mardia. E.. 14. and J. Draper.. R. Statistics for Experimenters (2nd ed. Applied Multivariate Statistics with SAS® Software (2nd ed. W. and G. vol.. "Robust Statistics for Testing Mean Vectors of Multivariate Distributions.. S. and J.. 24. N. Yu. 2005. E. "A Generalized Multivariate Analysis of Variance Model Useful Especially for Growth Curve Problems. K. 1995. Scheffe." Statistics & Probability Letters. Evolutionary Operation: A Statistical Method for Process Improvement.. R. Krishnamoorthy.A. Tiku. 13. Hunter. 12. 37193735. P. Box. F. 1969. G. 1972. G. 19. no.l2 (1985). 17. Growth Curves. Morrison. 161169. S. "Testing the Equality of VarianceCovariance Matrices the Robust Way. CA: Brooks/Cole Thomson Learning. Naik." Biometrika. 66 (2004). "A Solution to the Multivariate BehrensFisher Problem. 18. and J. 0. Hartley. "The Effect of Nonnormality on some Multivariate Tests and Robustnes to Nonnormality in the Linear Model. Van der Merwe. Nel. Mosimann. 58 (1971). G.. A. 9 (1982). Jolicoeur. V. M.). R. and S. New York: John Wiley. M. 20. D. II. England: Cambridge University Press. Multivariate Statistical Methods (4th ed. Biometrika Tables for Statisticians. New York: John Wiley. Potthofl. 15 (1986). E. R. .References 359 9. 15. The Analysis of Variance. C Design and Analysis of Experiments (6th ed. D. 9851001. K. L. H." Communications in StatisticsTheory and Methods. "Certain Generalizations in the Analysis of Variance. Cary. 11. New York: John Wiley. 105121. F. K.." Communications in StatisticsTheory and Methods. Hunter. 30333051." Communications in StatisticsTheory and Methods. Pearson. 24 (1960). G. "Size and Shape Variation in the Painted Thrtle: A Principal Component Analysis. New York: John Wiley." Biometrika. Kshirsagar." Biometrika. Statistics: Principles and Methods (5th ed. and N. New York: John Wiley.
It can also be used for assessing the effects of the predictor variables on the responses. and Nachtsheim [20]. we might have Y = currentmarketvalueofhome 360 . as a vast literature exists on the subject. z2 . in no way reflects either the importance or breadth of application of this methodology. This model is then generalized to handle the prediction of several dependent variables. Cook and Weisberg [11]. be r predictor variables thought to be related to a response variable Y. alternative formulations of the regression model. (If you are interested in pursuing regression analysis. we first discuss the multiple regression model for the predic: tion of a single response. In this chapter. z. Wasserman.) Our abbreviated treatment highlights the regression assumptions and their consequences. Seber [23].2 The Classical linear Regression Model Let ZJ. Draper and Smith [13]. Neter. Bowerman and O'Connell [6]. 1. see the following books. with r = 4.Chapter MULTIVARIATE LINEAR REGRESSION MODELS 7.1 Introduction Regression analysis is the statistical methodology for predicting values of one or more response (dependent) variables from a collection of predictor (independent) variable values. Kutner. For example. in ascending order of difficulty: Abraham and Ledolter [1]. the name regression. culled from the title of the• first paper on the subject by F. Galton [15]. Our treatment must be somewhat terse. ••• . and Goldberger [16]. Unfortunately. and the general applicability of regression techniques to seemingly different situations.
. . (71) becomes :" Zr Znr Z1r] l/3o] f3r ~I + lei] ~2 En or (nXI) Y = (nX(r+l)) ((r+I)XI) Z fJ + (nXI) E and the specifications in (72) become 1. and 3. The error (and hence the response) is viewed as a random variable whose behavior is characterized by a set of distributional assumptions. + + ··· + /3rZ!r /3rZ2r + + + BJ 82 (71) Yn = f3o f3IZn! fJ2Zn2 /3rZnr Bn where the error terms are assumed to have the following properties: 1. Cov(e) = E(u') = c?I. the complete model becomes Y1 = f3o Y2 = f3o + + + f3IZ!I f3IZ2! + + + f32Z12 fJ2Z22 + '. Var(ej) = c? (constant).. E(e) = 0.zr)J +[error] The term "linear" refers to the fact that the mean is a linear function of the unknown parameters /3 0 . The predictor variables may or may not enter the model as firstorder terms. E(ej) = 0..z2 ..'s. Specifically.. /3 1 . . With n independent observations on Y and the associated values of Z. which depends in a continuous manner on the z.j (72) * k...The Classical Linear Regression Model 361 and z1 = square feet of living area z2 = location (indicator for zone of city) z3 = appraised value last year z4 = quality of construction (price per square foot) The classical linear regression model states that Y is composed of a mean. 2. and 2. Cov(ei. which accounts for measurement error and the effects of other variables not explicitly considered in the model. . The values of the predictor variables recorded from the experiment or set by the investigator are treated as fixed. . . the linear regression model with a single response takes the form Y = f3o + f3IZ! + · · · + /3rZr + e [Response]= [mean (dependingon z1 .' + + . and a random error e. f3r. • Zn! Zn2 In matrix notation.ek) = O...
z = ·r1~ Z~I Z!IJ 1 Zs1 .. }5) are observed.1 (Fitting a straightline regression model) Determine the linear regression model for fitting a straight line Mean response = E(Y) = f3o + f3IZI to the data ZI 0 1 4 2 3 4 y 1 3 8 9 Before the responses Y' = [Y1 . . ez.Zjr = fJoZjO + fJIZji + ··· + f3rZj. and we can write · y where = Z{J + E y = yl] [Ys . Example 7. Zjr]. fJ = [:~J E = [el] ~2 es . (nXn) (73) where fJ and u 2 are unknown parameters and the design matrix Z has jth row~ [ZjO• Zjl• · • ·. so that Po + f3IZji + · · · + {3. e5 ] are random. we shall later need to add the assumption of joint normality for making confidence statements and testing hypotheses. . Although the errorterm assumptions in (72) are very modest. We now provide some examples of the linear regression model. . Y2 . Each column"Of Z consists of the n values of the corresponding predictor variable while the jth row of Z contains the values for aU predictor variables on the jth trial: Classic~! Linear Regression Model (nXl) Y= (nX(r+l)) ((r+l)Xl) z Jl+E.• . It is customary to introduce the artificial variable Zjo "' 1. (nXl) 2 E(e) = (nXI) and Cov(e) 0 = u 1.362 Chapter 7 Multivariate Linear Regression Models Note that a one in the first column of the design matrix Z is the multiplier of the constant term {3 0 . the errors e' = [ e1. .. ..
Thus.Lt = J.2. we obtain the observed response vector and design matrix 9 6 9 0 2 3 1 2 y (8x!) = (8X4) z 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 • The construction of dummy variables. with z2 = z!.The Classical Linear Regression Model 363 The data for this model are contained in the observed response vector y and the design matrix Z. allows the whole of analysis of variance to be treated within the multiple linear regression framework. {3 3 = r 3. 3 = J. {3 2 = r 2..L2 = J.2.L. Then }j = f3o + f3tZjt + f32Zj2 + f33Zj3 + ei.6.2 (The design matrix for oneway ANOVA as a regression model) Determine the design matrix if the linear regression model is applied to the oneway ANOVA situation in Example 6. . We create socalled dummy variables to handle the three population means: J. .. and p.L + r 1 . J. . {3 1 = r 1 .L + r 3 • We set if the observation is from population 1 otherwise if the observation is from population 3 otherwise and f3o 1 z2 = { 0 if the observation is from population 2 otherwise = J. where Note that we can handle a quadratic expression for the mean response by introducing the term {3 2 z2 . j = 1.8 where we arrange the observations from the three populations in sequence. as in Example 7. The linear regression model for the jth trial in this latter case is or • Example 7.L + T2.
'JYpically. (See Result 7. The vector of residuals i = y .f3tZjl.y'ZP 1 !f Z is Exerctse 7. (Yii=l n {30 . Then the residuals :5 n.Zjr that be expected if b were the . Also.· · · between the observed response Yi and the value bo + b1zi 1 + · · · + b.Zb) The coefficients b chosen by the least squares criterion are called least sqlil!. They will henceforth be denoted by fl to em.b1Zjl . Let Z have full rank r + 1 (73) is given by Let y = = "hat" matrix.2.Zjr) = E E "' "' "' 2 . "' y'[I.Zjr will not be zero. where H = Z (Z'Zr Z' is called 1 i = y. ffio + ffi 1zi 1 + · · · + ffi. 1 The least squares estimate of fl in'"· zp ~ = (Z'Zf 1Z'y Hy denote the fitted values of y.) ~ot full rank."true" parameter vector.:~ ~ .• 1 . Consider the difference Yj..···.. The deviatioliS~ ei = Yi  ~o  ffitZjt .· · · .Z(Z'Zr1 Z']y satisfy Z 'i = (I.Zb)'(y. The coefficients fl are consistent.Zjr) 2 (y. .6. we must determine the values for regression coefficients fl and the error variance u 2 consistent with the available Let b be trial values for fl.· · · . the residualsumofsquares = = 2:..· · · .. a generalized inverse of Z'Z."' estimated (fitted) mean responses.res estimates of the regression parameters fl. (Z'Z)1 is replaced by (Z'Zt.zi" j = 1. Zjr· That is. method of least squares selects b so as to mil)imize the sum of the squares of differences: S(b) = = j=l L n (yj .Zfl contains the information~" about the remaining unknown parameter~.J64 Chapter 7 Multivariate Linear Regression Models 7..phasize their role as e~timates of fl. :.3 least Squares Estimation One of the objectives of regression analysis is to develop an equation that wiU the investigator to predict the response for given values of the predictor Thus. . the n.btZjt .b.b.bo.rr"r"~''"' yj . with the data in the sense that they produce.:~ are called residuals.zi" the sum of whose. n ' (7sl~ .Z(Z'Zr1 Z'Jy = y'y.bo . (S.y = [I. · · ·. it is nec~ssary to "fit" the model in (73) to the observed Yj co1rre:spcmam~r the known values 1.btZjt. Zjb .{3./J.1.:~ squares of the differences from the observed Yi is as small as possible.H)y = 0 and Y' i = 0.b 0 . 2..) Result 7. because the response fluctuates manner characterized by the error term assumptions) about its expected value.
Then i = y1 [I.Zb so S(b) = y.1 shows how the least squares estimates and the residuals i can be obtained from the design matrix Z and responses y by simple matrix operations. The first term in S(b) does not depend on b and the b). Z'[I. but then a' Z'Za = 0 or Za = 0.2Z(Z'Zf Z' + Z(Z'Z).zjJ + zp Zb)'(y.) (P  P P Result 7. 365 (Z'Zf 1 Z'y as asserted.y) = p. Because Z has full rank. (I.Z(Z'Z). so the minimum sum of squares is unique and occurs for b = = (Z'Zr 1Z'y.Z(Z'Zf 1 Z'](I. 1 =I. 1 Additionally. P Example 7. Let p= 1 y= y.Zb) Zb = y z/3 + Z(P.Z(Z'Zf Z']' =[I. (I. and the residual sum of squares for a straightline model p.Z(Z'Z)1Z']y = O. and the residual sum of squares) Calculate the least square estimates the residuals £.zp = 1.£'£= y'(I. fit to the data Zl y 0 1 1 4 2 3 3 8 4 9 .Z(Z'Zf1 Z']y = y'[I.Z(Z'Zf Z'] 2.Z(Z'Zf Z'] 3.ZP) + (P. which con• tradicts Z having full rank r + 1.ZP)'Z(P.b) = (y= (y ZP)'(y. Z' = 0.Z(Z'zr z'] 1 1 (76) (idempotent). The matrix (I. the residuals. Note that (Z'Zf1 exists since Z'Z has rank r + 1 s n.to 0. we write = Z'Consequently.Z'Z(Z'Zf Z' =[I.b) + 2(y.Least Squares Estimation .y' Z To verify the expression for /3 . (If Z'Z is not of full rank.b)'Z'Z(/3.Z'i = Z'(y.1Z'] 1 1 1 1 1 (symmetric).Zp) + (P b)'Z'Z(P. Z'Za = 0 for some a . Z ( b) #o 0 second is the squared length of Z if jJ # b. Z'[I. y.soy'£ = P'Z'£ = 0.b) = (y.Zp)'(y.Z(Z'Zr Z'][I.Z(Z'Zr Z']y = y' y .b) since (y .Z(Z'Zr Z']y.3 (Calculating the least squares estimates.Z(Z'Zf Z'] satisfies Proof.ZP)' z = e'Z = 0'.
366 Chapter 7 Multivariate Linear Regression Models We have Z' y [ 5 10 10 Z'Z (Z'Zf 1 3:L u 1 1 2 3 ~] m 1 30 J [ .2 .y) = (y + i)'(y + i) == y'y + i'i (77) .y)'(y + y.6 . y'i == 0.2 .1. jJ == [i~J == (Z'Zf Z'y == [ :~ :~] [~~] == GJ y == 1 + 2z and the fitted equation is The vector of fitted (predicted) values is so The residual sum of squares is SumofSquares Decomposition According to Result 7. so the total response sum of squares y'y satisfies = "2:.1 J [~~] Consequently.0 i=I n y'y = (y + y.
zp spans the model plane of all linear combinations. 1 1l + /3 lZ11J + · · · + !3r lZtrl . R 2 is 0 if ~ 0 = y and ~ 1 = ~ 2 = · · · = ~r = 0.:..L j=l j=l '' n Yi· or y = y. . Mean response vector = E(Y) = zp = /3 0 Thus. z 2. Subtracting ny 2 = n(})1 from both sides of the decomposition in (77). the predictor variables z 1.. or attributable to. Zr have no influence on the response. As p varies. so that ei = 0 for all j. y is not (exactly) a linear combination of the columns of Z.~=~~n) + (residuaflsquares (error)) sumo squares The preceding sum of squares decomposition suggests that the quality of the models fit can be measured by the coefficient of determination ~ £1 R2 = 1 " i=l . Z21 Z2r 1 :.Least Squares Estimation 367 Since the first column of Z is 1. z2 ..Ji)2 The quantity R gives the proportion of the total variation in the Y/S "explained" by.. the condition Z'i: = 0 includes the requirement 0 = l'i: = L j=l n i:i = L Yi.Y) . 1 Znt Z11r ( . At the other extreme.ny2 = y'y  n(y)2 + i:' i: or j=l ± (yj . Geometry of Least Squares A geometrical interpretation of the least squares technique highlights the nature of the concept.Y)2 L j=l (yj . Recall that y zp E + response) ( vector vector ) in model plane error) ( vector l . Usually. because of the random errore. we obtain the basic decomposition of the sum of squares about the mean: y'y . the predictor variables z1 . the observation vector y will not lie in the model plane. . E(Y) is a linear combination of the columns of Z. According to the classical linear regression model./ ~ Ch  2 (79) L j=l 2 (yj . In this case. that is. .Ji)2 = ~ C. .Yi j=l Ji)2 + ~ i:J j=l (78) ( about mean ~~~~usaur~s ) = (re.. Zr· Here R 2 (or the multiple correlation coefficient R = + VRZ) equals 1 if the fitted equation passes through all the data points. .i=.
. S(b) is as small as possible when b is selected such that Zb is the point in the model plane closest toy. If Z is of full rank.Asillustratedin Figure 7.Zb)'(y. e 2 . r = 1.yis perpendicular to that plane.368 Chapter 7 Multivariate Linear Regression Models 3 Figure 1. That is. the projection operation is expressed analytically as 1 multiplication by the matrix Z (Z'Z f Z'. r+t r+l 1 Z(Z'Zf Z' == AjtZe.Zb) isthesumofsquaresS(b).ezez + ··· + ~er+ler+l 11 2 11r+l = Ajt/2 x. This point occurs at th~ tip of th~ perpendicular projection of yon the plane. the r + 1 vectors q. To see this.+t are 2 the corresponding eigenvectors. Moreover.qj 2: 2: i=l i=:] .1.~ter+te~+t where At ~ A ~ • · · ~ Ar+I > 0 are the eigenvalues of Z'Z and e 1 .t k or 1 if i == k. for the choice b = {J. .z• = q. Their linear combinations span the space of all linear combinations of the columns of z.. y = zp is the projection of yon the plane consisting of all linear combinations of the columns of Z. This geometry holds even when Z is not offull rank. The residual vector i == y .. 1 ' 1 ' + . When Z has full rank. the least squares solution is derived from the deviation vector y .(vector in model plane) The squared length (y.e. == Aitflze. are mutually perpendicular and have unit length. (z ' Z) 1 = elet 1 At ' . which is a linear combination of the columns of Z. we use the spectral decomposition (216) to write Z'Z = Azezel + Azeze2 + · ·· + A. .1 Least squares as a projection for n = 3.t = Ajtfl A/c1 f2eiA~cek = 0 if i . Then qiq.tflejZ'Ze~c Consider q. Once the observations become available. e. That is.Zb == (observation vector) .
Under the general linear regression model in (73).(r + 1) Y'[l.E(i'i) = (n. Sampling Properties of Classical least Squares Estimators The least squares estimator detailed in the next result. A 2 A.r .n . Markov proved a less general result..Z'] is the matrix for the projection of yon the plane perpendicular to the plane spanned by the columns of Z.Z(Z'Zf Z'] = u 2[I.Z' projects a vector onto the space spanned by the columns of Z.+ 1 > 0 1 = A.Z(Z'Zf Z']Y n . where E( £) = 0. so defining i'i s2 = . ~ Thus. which misled many writers into attaching his name to this theorem.. p Result 7. Proof. and Z has full rank r + 1.. as described in Exercise 7. .1)c?. p and the residuals i have the sampling properties Result 7.····qr+I} is~ (qjy)q. the projection of yon a linear combinationof{qJ./3.+ 2 = · ·· = A.1 1 Y'[l.. Then Z (Z'ZfZ' r 1+1 •~I 2: q.com/statistics) • The least squares estimator possesses a minimum variance property that was first established by Gauss. 2 1 Similarly.6.Z(Z'Z). The following result concerns "best" estimators of linear parametric functions of the form c' fl = c0{3 0 + c1f3 1 + · · · + c. the least squares 1 estimator = (Z'Zf Z'Y has p E(fl) = /l and Cov(P) = c?(Z'Zr 1 The residuals i have the properties E(e) = 0 and Cov(£) = u 2[I.r.. This is true for any choice of the generalized inverse.) 3 Much later. [I.q2.12. we can use the generalized inverse (Z'Zr = . Cov ( £) = c? I.3 (Gauss'3 least squares theorem).qi 1 Zfl.2 and Definition 2A..: 2: Aj1e. for any c.ej.H] 1 Also...1 we have E(s 2 ) Moreover.r . the estimator c'P =caPo+ c1PI + ··· + Cr~r rr~l 2 If Z is not of full rank.Least Squares Estimation 369 According to Result 2A. • • • .qj has rank r + 1 and generates the unique projection of yon the space spanned by the linearly independent columns of Z. (See (23]. Let Y = Z/l + £.H]Y n .. For any c..prenhall. = u2 p and eare uncorrelated..+ I..2. (See webpage: www. = 1 r+I (r+l ) y = Z(Z'Zf Z'y = ~q. ja} where A 1 = . multiplication by Z (Z'Z).
. froiJi1 Result 7. Methods for checking the general adequacy of the model are considered in Section 7. ~ Var(a'Y) = Var(a'Z/3 + a'e) = £T = Var(a'e) = a'Iu2a ~ ~"' 2 (a a* + a*)'(a. squares estimator /3.a'Z)fJ = 0 for aU {J.4 Inferences About the Regression Model We describe inferential procedures based on the classical linear regression model in (73) with the additional (tentative) assumption that the errors E have a normal distribution. I Proof.a*'Z = c' . whatever the value of fj. c' 13. Equating the two expected value expressions yieldi a'Z/3 = c'/3 or(c'.~(Z'Zf ) 1 j ' . soc' f3 = a*'Y is an unbiased estimator of c' /3..a'Zi'i This implies that c' = a'Z for any unbiased estimator. Result 1.2 E( /3) = /3. Inferences Concerning the Regression Parameters Before we can assess the importance of particular variables in the regression function (710) we must determine the sampling distributions of [3 and the residual sum of squares. let a'Y be any unbiased estimator of c' /3... Also.. where Z has full rank r + 1 and E is distributed as ~ Nn(O.a*)'Z = a'Z .. the estimator c' [3 is called the best (minimumvariance) linear unbiased estimator (BLUE) of c' fJ. Moreover.4. i' i. for an~ a satisfying the unbiased requirement c' = a'Z.a*)'(a.a*)'a* = (a . Then the maximum likelihood estimator of f3 is the same as the least~ ' . f. •• This powerful result states that substitution of [3 for fJleads to the best estima~...+ 1 (fJ. Var(a'Y) is minimized by the choice a*'Y = c'(Z'Zf1Z'Y = c' p.a*) 1 + a*'a*] since (a '. tor of c' fJ for any c of interest.a*)'(a.c' = 0'. Thus. u 21).370 Chapter 7 Multivariate Linear Regression Models ~ of c' f3 has the smallest possible variance among all linear estimators of the form :ij ~ . = c'(Z'Z) ~·y = a*'Y with a• = Z(Z'Z) c. by assumption.a*) is positive unless a =a*. The~ E( a'Y) = c' /3.a*)'Z(Z'Zf c = 0 from the condition (a.i ' 1 1 . including the choice /3 = (c'. Let Y = Z/3 + E.. For any fixed c. '·! ·111 that are unbiased for c' /3. Now. Moreover.6. we shall assume that the errors E have a normal distribution... Because a• is fixed and (a. In statistical terminology.a* +a*) =~[(a. E(a'Z/3 + a'e) = a'ZfJ.. To do so. E(a'Y) ·. '~ {J = (Z'Zf Z'Y 1 isdistributedas N. 1..
Projecting this ellipsoid for /l) using Result 5A..2 = i'i zp. hence. are given by {3. where r = i' ij(n . + 1)Fr+1..r. It is expressed in terms of the 1 estimated covariance matrix s 2(Z'Zr . since it consists of linear combinations of the 13..4 (n. . A Cov(V) = (Z'Z) 1f2Cov(p)(Z'Z) 112 = £T 2 (Z'Z) 112 (Z'Zr (Z'Z) 112 = 1 £T 2 1 and V is normally distributed.1). Proof.nr1 distribution._ A Pd ::S V(r + 1)Fr+l...com/statistics) • is Nn(O.nrl(a) 1 ~. Result 7. Also..r 1 where Var(l3.a) percent confidence region for fl is given by E where F. . independently of V.Inferences About the Regression Model 3 71 and is distributed independently of the residuals i = Y nU.If an eigenvalue is nearly zero.fl) and note that E(V) = 0.) is the diagonal element of l(Z'Z) corresponding to 13. v·v = fl)'(Z'Z) f2(Z'Z) 1f2(p.1)s 2 = e'i is distributed as £T2 ~rl• independently of and. where Z has full rank r + 1 and a 100(1 . is distributed as u 2 ~r1 where U. . Then A confidence ellipsoid for fl is easily constructed.fl) 2 is distributed as u x:+i· By Result 7.'s. and u' = (P (P p (P  [0.r .fl) = fl)'(Z'Z)(P. • The confidence ellipsoid is centered at the maximum likelihood estimate fJ.r . .1.nr1(a). Further. .+ 1... Consider the symmetric squareroot matrix (Z'Z) f2..1 With A.. (See webpage: www. c2 = (r + 1)Fr+l. Consequently..nr.. .prenhall. 1 A Proof.1)] = [V'Vj(r + 1))/s2 has an Fr+l.nrJ(a). 1.5. i = 0..OJ yields I 13.r . . u 21).f.] Set 1/2 V = (Z'Z) (/l . and the confidence ellipsoid for fJ follows.1 d. 2 is the maximum likelihood estimator of u 2 . 1 Therefore.1 (a) is the upper (lOOa)th percentile of an £distribution with r + 1 and n .. simultaneous 100(1 a) percent confidence intervals for the /l. ± ~ V(r . and its orientation and size are determined by the eigenvalues and eigenvectors of Z'Z. [rr+ 1/(r + 1))/[~rt/(n.0. the confidence ellipsoid will be very long in the direction of the corresponding eigenvector. [See (222). Let Y = Zfl + E.0.where A Var(l3i) is the diagonal element of s (Z'Z) 2 corresponding to 13..1 = Z'Zjs 2.
25 13.8 .9 70.3 65.0 74.5.6 62.33 14.6 68.0 72. neighborhood.1 were gathered from 20 homes in a Milwaukee.06 16.2 67.48 14.+l.87 18.76 19.0 102.05 15.5 71. (711) ~ when searching for important predictor variables.6 63.7 56.0 76.2544 .~ 2 V Var({3.3 63. they replace (r + 1)F.0172 .9 86.25 14.0512 [ .lnstead.4 55.4 (Fitting a regression model to realestate data) The assessment data in.31 15. Table 7.35 zz Assessed value ($1000) y Selling price ($1000) 74.0 88.18 14.8 74.0 84. Fit the regression model where z1 = total dwelling size (in hundreds of square feet).4 57.. andY = selling price (in thousands of dollars). Example 7.20 16. to these data· using the method of least squares.4 60.89 15.0 72.0 78.63 15.1 66.8 65.9 76.37 18.5 68.5 73.0 63.20 25. (Z'Zf 1 = Table 7.44 14.) 11 .1 RealEstate Data Z1 Total dwelling size ( 100 ft 2 ) 15.2 57.0067 .1 89.1523 ] .0 69.5 74.. Zz = assessed value (in thousands of doilars).8 63.1463 .57 17. A computer calculation yields 5.0 57.0 73.91 15.33 14.2 60.5 71.2 57.3 72 Chapter 7 Multivariate Linear Regression Models Practitioners often ignore the "simultaneous" confidence property of the interval estimates in Result 7.6 60. Wisconsin.nrJ( a) with the oneatatimet value tnrI ( a/2) and use the intervals /3 ± 1 rJ (a).
53630 Mean Square 516. the fitted equation is 30. which contains the regression analysis of these data using the SAS statistical software package.473.78559872 0.1 SAS ANALYSIS FOR EXAMPLE 7.285) with s = 3. data estate.87000 3. (See Panel 7.4 USING PROC REG.05853 OUTPUT Source Model Error C Total DF 2 17 19 Root MSE Deep Mean f value 42.0038 0.158 Variable INTERCEP z1 z2 DF 1 Prob > ITI 0.967] 2. Parameter Estimates Parameter Estimate 30. model y = z1 z2.8760 .834. indicating that the data exhibit a strong regression relationship.87506 204. R 2 = .83441 0.99494 1237.634400 0.045184 Standard Error 7.88) (. infile 'T71. proc reg data estate.8149 c.966566 2.) If the residuals £ pass the diagnostic checks described in Section 7. The numbers in parentheses are the estimated standard deviations of the least squares coefficients.88220844 0.6.785) (.967 + 2.47254 76.634z1 + . Also.045 y= 30. title 'Regression Analysis'.353 0.045z2 (7.dat'.43753 12.28518271 T for HO: Parameter = 0 3. = I R·square Adj Rsq "'OGRAM COMMANDS Model: MODEL 1 Dependent Variable: Analysis of Variance Sum of Squares 1032.1.v.634 .Inferences About the Regression Model 3 73 and P = (Z'Zf1 Z'y = [ Thus.828 Prob>F 0. the fitted equation could be used to predict the selling price of another house in the neighborhood from its size PANEL 7. input z1 z2 y.0001 0.929 3.55000 4.0011 0.
..1 d. One null hypothesis of interest states that certain of the z.285) (. Zq+Z •.nrI( a) is the upper (lOOa)th percentile of an Fdistribution with r and n . the variable z2 might be dropped"·... In particular. . The likelihood ratio test of H 0 : 11(2) = 0 is equivalent to a test of H 0 based on the extra sum of squares in (713) and s2 = (y . the likelihood ratio test rejects H0 if (SS.nr1 a where Frq. Y == ZdJ(I) of Ho is based on the Extrasumofsquares == SS. Setting z= [ Z 1 1 Z2 ]. .1).q) > F r ( )  rq. The statement that Zq+l. • ~ Likelihood Ratio Tests for the Regression Parameters Part of regression analysis is concerned with assessing the effects of particular predictor variables on the response variable.].r .025) ~ = . .... 1 zJJ0 J)'(y zJJ(IJ) .(Z) The.(Z 1) . o21). from the regression model and the analysis repeated with the single predictor vari~ able z1 • Given dwelling size.'s do not influence the response Y.•..045 ± 2. We note that a 95% confidence interval for {3 2 [see (714)] is given by " P±t 2 17(. .(ZI} . assessed value seems to add little to the prediction oL selling price.. Zq+2• .. + Z2P(Z) +e Under the null hypothesis H 0 : 11( 2 ) == 0.z{J)' (y ... z. These predictors will be labeled: Zq+l. · = f3r = 0 or Ho: 11(2) =0 (712) where P(z) = [/3q+l•/3q+2• .SS.ZP)/(n . Zr do not influence Y translates into the statistical hypothesis Ho: /3q+l = /3q+2 = .(y . nX(q+l) i nX(rq) {J = [_((?~~I~~) J ((rq)XI) we can express the general linear model as Y = Z/) + E = [Z 1 l Z 2 ] [/!_(!)_j' + e =ZIP(IJ P(2J +E.i q .p..ZP) Result 7..647) or Since the confidence interval includes /32 = 0.110(.zp)'(y .(Z))/(r.374 Chapter 7 Multivariate Linear Regression Models a~d assessed value. likelihood ratio test (713) = (y where P(IJ == (Z]Z 1 f Z]y.SS.6.556. Let Z have full rank r + 1 and e be distributed as Nn(O..r .
11 with m = 1.q) X ( r + 1) matrix C have full rank. and consider H 0 :Cf3 = 0 I ( This null hypothesis reduces to the previous choice when C = [0 i (rq)X(rq) i J·) 4 In situations where Z is not of full rank. To test whether all coefficients in a subset are zero. (See [22] or Result 7. The improvement in the residual sum of squares (the • extra sum of squares) is compared to the residual sum of squares for the full model via the Fratio.ZP)'(y.) • Comment. n(ut .f. The same procedure applies even in analysis of variance situations where Z is not of full rank. The likelihood ratio test is implemented as follows.6. rank(Z) replaces r + 1 and rank(Ztl replaces q + 1 in Result7. 4 More generally. Given the data and the normal assumption.en/2.UZ)j(r .Inferences About the Regression Model 3 75 Proof. Rejecting H 0 : 13( 2 ) = 0 for small values of the likelihood ratio is equivalent to rejecting H 0 for large values of (ut . it is possible to formulate null hypotheses concerning r .r.1 d.ZP)!n.UZ)jQ2 or its scaled version. fit the model with and without the terms corresponding to these coefficients. the likelihood associated with the parameters f3 and u 2 is L( /3 ~) = 1 • c27T r12 ~ e(yZfJ)'(yZfJ)/2u2 :5 _ _ (27T t l 2 u" 1~.1) The preceding Fratio has an £distribution with r . Moreover. withthemaximumoccurringatp = (Z'Zr Z'yandu 2 = (y. let A 0 = 0. Y = Z 1 /3(1) + E and fJ (I)•U 2 1 max L( a ~ P(I) • u 2) = 1 ( ~n 2'IT )n/2UJ I en/2 where the maximum occurs at 13(1) = (Z)Z 1r Z)y.q linear combinations of f3 of the form H 0 : C f3 = A 0 • Let the ( r .q and n . Under the restriction of the null hypothesis.q) nU'lj(n.r . .
subject to the restriction that Cf3 "" o:~ (See [23]).4 92.1) and Frq.2 21. or 3) and gender (male = 0 and female = 1).2 30.J (714) is the likelihood ratio test.3 15. Table 7.3 36.~ " ·~ (714)i where s2 = (y . For instance.3 21. gender..f.ZP)/(n .0 63. • Example 7.2 50. while the combination of location 2 and female has 2 responses.4 40.r .0 44. the combination of location 1 and male has 5 responses. Each data point in the table is categorized according to location (1.2 27.2 21. The service ratings were converted into an index.2 RestaurantService Data Location 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 Gender 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 1 Service (Y) 15. . and their "interaction" using the design matrix Table 7.r . . we reject H0 : Cf3 = 0 if ·.376 Chapter 7 Multivariate Linear Regression Models Under the full model. and the numerator in the Fratio is the extra residuaU sum of squares incurred by fitting the model. We reject~ H 0 : C(J = 0 at level a If 0 does not hem the 100(1 a)% confidence ellipsoid for'.2 36. Cp. 2. The test in.2 contains the data for n = 18 customers.1 18.5 (Testing the importance of additional predictors using the extra sumof~ squares approach) Male and female patrons rated the service in three establish: ments (locations) of a large restaurant chain. Introducing three dummy variables to account for location and two dummy variables to account for gender.nrl(a) is the uppet~ (100a)th percentile of an Fdistribution with r .:1 The next example illustrates how unbalanced experimental designs are easilyj handled by the general theory just described.9 .4 27.q and n._ c{J is distri~u~ed as Nrq(C(J.6 15.1 d. Equivalently. This categorization has the format of a twoway table with unequal numbers of observations per cell.z{J)'(y .2 9. we can develop a regression model linking the service index Y to location.u2C(Z'~fiC').
column 1 equals the sum of columns 24 or columns 56.4 and n .'Y31>'Y32] where the . 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 } 2 responses } 2 responses } 2 responses 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 The coefficient vector can be set out as fl' = [. results from a computer program give SSres(Z) = 2977. We find that SSres(Z 1) = 3419.1 . rank(Z) = 6. and the 'Yik's represent the locationgender interaction effects.'s (i > 0) represent the effects of the locations on the determination of service. The design matrix Z is not of full rank. To test H 0 : y11 = 'Y12 = 'Y2I y 32 = 0 (no locationgender interaction).. The model without the interaction terms has the design matrix Z 1 consisting of the first six columns of Z.) In fact.'s represent the effects of gender on the service index.CZJ) . (For instance. we compute F= = (SSre.4 )/2 = ./3J.Inferences About the Regression Model 377 constant ~ ~ location 1 1 1 1 1 0 0 0 0 0 ~ gender 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 interaction 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 ) 5 <e.rank(Z) = 18 ..TJ.6 = 12.SSres(Z))/2 SSres(Z)/12 (3419.2977. the r.pon•e.1 = 'Y22 = 'YJI = with n .T2.'Y2l>'Y22.rank(Z 1) = 18 .B.4 = 14./32.4) s2 = (SSres(ZJ) . 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } 2 responses Z= ) 5 mpon•e.SSres(Z))/(6. For the complete model.89 2977.4/12 ./33..'Yll>'Yl2./3o.
1 d._. ' ln analysisofvariance situations where the cell counts are unequal. we conclude that the service index does not depend upon any locationgender interaction.1.f. so R~sult 7. but that gender is significant. Var(z 0 = z0Cov(/l)zo = zO(Z'Z) zou 2 since Cov(p) = P) 1 ~(Z'Zr by Result 7. Under the further assumftion that e is normally distributed. + /3. Yo denote the value of the response when the predictor variables have values z01 . If the errors E are normally distributed. Consequently. To evaluate the relative influences of the predictors on the response in this case.3 applies.. zo.+ 1( p. and these terms can be dropped from the model. . z'oP is the unbiased linear 1 estimator of E(Yo I zo) with minimum variance. then a 100(1 . This Fratio is not significant for any reasonable significance level a.a)o/o confidence interval for E(Yolzo) = z'oP is provided by where t. at z 0 and (2) to estimate the value of the response y at z 0 • Estimating the Regression Function at z0 Let zb = [I. ~t :zO = [1. Using the extra sumofsquares approach.. Also.f.378 Chapter 7 Multivariate Linear Regression Models The Fratio may be compared with an appropriate percentage point of an Fdistribution with 2 and 12 d._ 1(a/2) is the upper 100(a/2)th percentile of a tdistribution with n.Zor = z'oP (715) Its least squares estimate is z'oP. ].u2(Z'Z). we may verify that there is no difference between locations (no location effect). Result 7.2. For a fixed zo. that is males and females do not give the same ratings to service. Proof. z0 .5 Inferences from the Estimated Regression Function Once an investigator is satisfied with the fitted regression model.J be selected values for the predictor variables..r. it is necessary to fit the model with and without the terms in question and compute the appropriate Ftest statistics. ZoJ> .. • 7. Then z 0 and p can be used (1) to estimate the regression function f3o + {3 1 z01 + · · · + {3. Result 1. the variation in the response attributable to different predictor variables and their interactions cannot usually be separated into independent amounts.z 0 .• .4 asserts that pis N.) independently of s 2/u 2 . According to the model in (73). which .zbP)s just a lin~ar combinatioEz of the /3/s. Var(ziJP) "" zh(Z' Zf zou 2 . For the linear regression model in (73). •. it can be used to solve two prediction problems. the expected value of Y0 is E(lQizo) = f3o + {31Z01 + .
such as Y0 . (Yo zbP) 1 Vs2(l + z(. ~ ~ P.zbfJ) = Var (eo)+ Var (z6/J) = £T2 + zo(Z'Z). E(Yo.1). we obtain 1 1 zoP. (fJ I I Var (Yo. Given the linear regression model of (73). z 0 is p zoP) = u 2(1 + zo(Z'Zr zo) 1 have a normal distribution. By Result 7.z&P) = E(eo) + E(z 0 . the linear combination 1 zoP is (z&P. • . z0P has I ._. at z 0 = [1. Consequently. u 2z 0 (Z'Zr z 0 ) and is distributed as X~rtl(n.P)) = 0 so the predictor is unbiased. hence. The errors E influence the estimators iJ and s 2 through the responses Y.Inferences from the Estimated Regression function ~79 N(z 0{J. Yo= ZofJ + eo or (new response Y0 ) = (expected value of Y0 at z 0 ) + (new error) where e0 is distributed as N(O. u 2 ) and is independent of E and._ /(n .8. Zo/J = f3o + f3tZot + .•. of iJ and s 2. a 100(1 . which estimates E(Y0 I z0 ).] is more uncertain than estimating the expected value of Y0 • According to the regression model of (73). + f3rZor The variance of the forecast error Y0 Var(YoWhen the errors Y0 is given by E  "' "' . Result 1..(Z'Zr zo) which is distributed as tnrl· The prediction interval follows immediately.r ._J (%) V s2(1 + z(J(Z'Zf z 0 ) 1 n .. z0 . 7._ 1(a/2) is the upper 100(a/2)th percentile of a tdistribution with 1 degrees of freedom.r .z. Dividing this ratio by v.?. 1 ). which is distributed as v'X.a)% prediction interval for zbP ± t.zou 2 = u 2(1 + z(J(Z' Z) ~ zo)Proof.r  where f. We forecast Y0 by z0 E(zbfJ) ~ ~ lf it is further assumed that E has a normal distribution. then fJ is normally distributed.z(1/1)/Vu2z 0(Z'Zf 1z 0 W[uz is distributed as tnrl· cz'oiJ .. .. • Forecasting a New Observation at z0 Prediction of a new observation. z01 ._.1).. but e0 does not... Since e0 and pare independent. and so is the linear combination Yo Consequently.. (Yo  z 0 P)/~ ziJ(Z'Zr z 0 ) is distributed as N(O..zoP =eo+ zb(fJp). . a new observation Yi1 has the unbiased predictor .ziJP = ziJfJ +eo.zaP l The confidence interval follows. = zofJ and Var(zo/J) = zo(Z'Zr Zo£T2 • The forecast error is then Yo. Thus.
P.94).5) = 151.108 9.776. The data are given in Thble 7.8 160.5 151.213 1.08831 zo/J = 8.00. Artis.5 168. 130. A computer scientist collected data from seven similar company sites so that a forecast equation of computerhardware requirements for inventory management could be developed.5 172. Example 7.06411 [ . comes from the presence of the unknown error term e 0 . Forecasting Compwer Requirements: A Dilemmo (Piscataway.71. 1979). 7.1 133. Consequently. the value of the regressiOn functiOn E(Y0 l z0 ) = z0fJ.125 Source: Data laken from H.8 146. find a 95% prediction interval for a new facility's CPU requirement corresponding to the same z0 .97 ± 2. = + l.3 Computer Data ZJ Z2 1 p 1 y (CPU time) 141.905 .58928) = .17969 . 153.5 146.42 (Z'Zf1 and s = 1204.204(. which is represented by the extra term s 2 in the expression · .776(. Table 7. .603 1.97 . NJ: Bell Laboratories.9 128.5 (Orders) 123.5 136. We have t4 (.0 Forecaster~ (Adddelete items) 2.6 (Interval estimates for a mean response and a future response) Companies considering the purchase of a computer must first assess their future needs in orderto determine the proper equipment.42 + 1.9 154.061 8.00107 8.42z2 .71) or (150. The add!Uonal uncertainty in : forecasting Yo.2 92.025)sVz 0 (Z'Z) z 0 = 151.08z1 + .~· and s Vz 0 (Z'Zr z 0 = 1.42(7.1 108.380 Chapter 7 Multivariate Linear Regression Models The prediction inter_val for Y0_is wider than the confidence i~t.erval for estimating.(1 + z0 (Z'Zr 1z0). A computer program provides the estimated regression function y = 8.815 1.08(130) + . E(Y0 lz 0 ) = _ {3 0 + {3 1zo 1 + f32z02 at zb = [1.5].00052 .025) = 2.3 for z1 = customer orders (in thousands) z2 = adddelete item count (in thousands) Y = CPU (central processing unit) time (in hours) Construct a 95% confidence interval for the mean CPU time. Also. so the 95% confidence interval for the mean CPU time at z 0 is z 0 ± t4 (.
. n (717) and the studentized residuals are j=1. . like independent drawings from an N(O.Z(Z'Zf Z'J = u 2 (I..08.H]. • 1.204)(1. which is assumed to be a normal random variable with mean zero and variance u 2 • Although the residuals 1 ehaveexpectedvalueO.. which is the residual mean square when the jth observation is dropped from the analysis. the variances of the ei can vary greatly if the diagonal elements of H.86). 155. . it is imperative to examine the adequacy of the model before the estimated function becomes a permanent part of the decisionmaking apparatus.Model Checking and Other Aspects of Regression 381 Since + z0 (Z'Z) 0 = (1. 2.16071) = 1. approximately.. Because the residuals i have covariance matrix u 2 [I . Fortunately.theircovariancematrixc?(I.6 Model Checking and Other Aspects of Regression Does the Model Fit? Assuming that the model is "correct.H]y 1 (716) If the model is valid.. each residual i:i is an estimate of the error ei.40.2. we have j = 1.  ~rZir ~rZ2r or £=[I.. the leverages hii• are substantially different.n (718) We expect the studentized residuals to look. Residuals have unequal variances and nonzero correlations.Z(Z'Zf Z'Jy =[I. ei . All the sample information on lack of fit is contained in the residuals EJ = e2 = ~0 Yz . Using the residual mean square s 2 as an estimate of u 2 . Of course. the correlations are often small and the variances are nearly equal. 1) distribution.~0 YI   ~1Z11 ~1Z21  ··· .H] is not diagonal. . a 95% prediction interval for the CPU time at a new facility with conditions z0 is sVl 1 z or (148." we have used the estimated regression function to make inferences. many statisticians prefer graphical diagnostics based on studentized residuals. Consequently. . Some software packages go one step further and studentize using the deleteone estimated variance s 2 (j).
2(a). This is illustrated in Figure 7. such as z1 . j . the residuals ej or ej can be examined using the techniques discussed in Section 4.6. Plot the residuals ej against the predicted values Yj = i:Jo + {3 1 z + · · · + i:J. The pattern of residuals may be funnel shaped. 3. the following are useful graphs: 1. as in Figure 7.382 Chapter 7 Multivariate Linear Regression Models Residuals should be plotted in various ways to detect possible anomalies.2(d). If this is the case. Do the errors appear to be normally distributed? To answer this question.3.so that there is large variability for large y and· small variability for small y. e (a) (b) (c) (d) Figure 7. If n is large. histograms. The QQ plots.ena: (a) A dependence of the residuals on the predicted value.) In Figure 7. and transformations or a weighted least squares approach (or both) are required. and dot diagrams help to detect the presence of unusual observations or severe departures from normality that may require special attention in the analysis. or a /3o term has" been omitted from the model. QQ plots and histograms. 11 Departures from the assumptions of the model are typically indicated by t~o types of phenmp. This is ideal and indicates equal variances and no dependence on y. A systematic pattern in these plots suggests the need for more terms in the mode1..2 Residual plots. minor departures from normality will not greatly affect inferences about /3. 2. the variance of the error is not constant.2(c). the residuals form a horizontal band. (b) The variance is not constant.2(b )". Plot the residuals 1 against a predictor variable. For general diagnostic purposes. such as zf or ZI Z2. The numerical calculations are incorrect.1bis situation is illustrated in Figure 7.z ·. (See Exercise 7. or products ofpredictor variables.
. a plot of the residuals versus time may reveal a systematic pattern. however. but hard to check.r1 ) is called the DurbinWatson test.) For instance.Model Checking and Other Aspects of Regression 383 ' 4. residuals that increase over time indicate a strong positive dependence.7 (Residual plots) Three residual plots for the computer data discussed in Example 7. (719) of residuals from adjacent periods.) Example 7. A popular test based on the statistic ~ n(ei . (A plot of the positions of the residuals in space may also reveal associations among the errors.ei_Jl In eJ i~ 2 "'= 2(1 . it appears as if the regression assumptions are Wu~ • (a) (b) (c) Figure 7.3. (See [14) for a description of this test and tables of critical values. Plot the residuals versus time.6 are shown in Figure 7. The sample size n == 7 is really too small to allow definitive judgments.6. The assumption of independence is crucial. A statistical test of independence can be constructed from the first autocorrelation.lf the data are naturally chronological.3 Residual !'lots for the computer data of Example 7.
. erable effect on tlle analysis yet are not easily detected from an examination of residual plots. the leverage hii. these outliers may determine the fit. Additional Problems in Linear Regression We shall briefly discuss several important aspects of regression that deserve and receive extensive treatments in texts devoted to regression analysis. First. In fact. how far the jth observation is from the other n . Plots based upon leverage and influence statistics and their use in diagnostic checking of regression models are described in [3]. For simple linear regression with one explanatory variable z. in terms of the observations as bj Yi = h. can be interpret" ed in two related ways.1 Z.. no serious violations of the assumptions are detected. [13]. The leverage hii the U.kYk Provided that all other y values are held fixed (changeiny. when observations are deleted. If. (See [10].) Second. (See [13] for a discussion of the pureerror lackoffit test. and (23]. then a formal test for lack of fit can be carried out.) = h. is a measure of pull that a single case exerts on the fit. [11 ]. /3.1 observations. + Lh. The vector of predicted values is y = Zp = Z(Z'Z)1Zy = Hy where the jth row expresses the fitted value Y.  z/ 2: (z.(changeiny.zi The average leverage is (r + 1)/n. we can make inferences about /3 and the future Y values with some assurance tllat we will not be misled. j) diagonal element of H = Z(Z'Z). (See Exercise 7.) If the leverage is large relative to the other h. in the space of the explanatory variables.Y.. and [10]. departures from the regression m~del are often hidden by the fitting pr<_>eess. there maybe "outliers" in either the response or explanatory vanables that can have a consid. after the diagnostic checks.. 1 h . Methods for assessing)nfluence are typically based on the change in the vector of parameter estimates.8. For example. Observations that significantly affect inferences drawn from the data are said to be influential.384 Chapter 7 Multivariate Linear Regression Models If several observations of the response are available for the same values of the predictor variables.) .) Leverage and l~fluence Although a residual analysis is useful in assessing the fit of a model.. [5]. These references are recommended for anyone involved in an analysis of regression models. the leverage is associated with the jth data point measures.k> then Yi will be a major contributor to the predicted value Y.= 11 n + "n j=l (z.
In practice. . z2 alone.1)/(n. R2 = 1.Model Checking and Other Aspects of Regression 385 Selecting predictor variables from a large set.. 3) e(l) Figure 7. Cp) coordinates near the 45" line. called stepwise regression (see (13]).6 with three predictor variables (z 1 = orders. C p). C I' = residual sum of squares for subset model) with p parameters. it is often difficult to formulate an appropriate regression function immediately.(1. If the list of predictor variables is very Jong. . Which predictor variables should be included? What form should the regression function take? When the list of possible predictor variables is very large. z2 = adddelete count. cost considerations limit the number of models that can be examined. Good models typically have (p. . not all of the variables can be included in the regression function.4. a better statistic for selecting variables seems to be Mallow's CP statistic (see [12]). In Figure 7.2p) A plot of the pairs (p. z1 and z2 .•• The best choice is decided by examining some criterion quantity like R 2 • (See (79). Another approach. including an intercept ( (residual variance for full model)  (n . 3) e(l.r. Although this problem can be circumvented by using the adjusted R 2 . see the example and original source). R 2 always increases with the inclusion of additional predictor variables. z3 = number of items. •. e(OJ e(3) (Z)• • (2. one for each subset of predictors.1). The good ones try all subsets: z1 alone. we have circled the point corresponding to the "best" subset of predictor variables.R 2 )(n. attempts to select important predictors without considering all the possibilities.4 CP plot for computer data from Example 7.] However.. will indicate models that forecast the observed responses well. Techniques and computer programs designed to select the "best" subset of predictors are now readily available.
. For most regression analyses. In this situation. we want to select models from those having the smaller values of AI C. (See Result 7.!! The procedure can be described by listing the basic steps (algorithm) involved in the . Akaike's information criterion (Ale) is residual sum of squares for subset model ) with p parameters. also balances the size of the residual sum of squares with the number of parameters in the model. Overall.) The valuej of the F~ta~i~tic t~at must be exceeded before the contribution of a variable is>'l deemed significant IS often called the F to enter. the .'s and it is then difficult to detect the "significant" regression coefficients {3. it is unlikely that Za = 0 exactly. Another popular criterion for selecting an appropriate model. 1 Colinearity..j nents of the predictor variablesthat is. . This yields large estimated variances fq_r the {3. must equal 0. A second drawback is that the (automatic) selection methods are not capable of indicating when transformations of variables are useful. the diagoqal entries of (Z'Zr will be large._. called an information criterion. At this point the selection stops. If Z is not of full rank. for example. ·~ Step 4. such as Za. Yet. J . Because of the stepbystep procedure. The predictor variable that explains the largest significant proportion of the variation in Y (the variable·''~ that has the la.3. some linear combination.:gest correlation with the response) is the first variable to enter the re1 gression function. The problems caused by colinearity can be overcome somewhat by (1) deleting one of a pair of predictor variables ! that are strongly correlated or (2) relating the response Y to the principal campo.. All possible simple linear regressions are considered. Steps 2and 3 are repeated until all possible additions are nonsignificant and all possible deletions are significant.6. if linear combinations of the columns of Z exist that are nearly 0. The next variable to enter is the one (out of those not yet included) that~ makes the largest sig~ific?nt ~ntribut~on to the regression sum of squares.l computations: :. the rows zj of Z are treated as a sample. the best three variables for prediction. there is no guarantee that this approach will select. variable is deleted from the regression function. This implies that Z'Z does not have an inverse. The response Y is then regressed on these new predictor variables.. the columns are said to be colinear. the indivi<f.'j ual contributions to the regression sum of squares of the other variables already in~ the equation are checked for significance using Ftests..! Step 3. including an intercept AIC = nln ( + 2p n It is desirable that residual sum of squares be small.386 Chapter 7 Multivariate Linear Regression Models . but the second term penalizes for too many parameters. the calculation 1 of cz·zr 1 is numerically unstable. Typically. lt~ Step 2. The sig)~ nificance of the contnbutwn IS deterrnmed by an Ftest. If the £statistic is less than ' the cine (called the F to remove) corresponding to a prescribed significance level.. and ~ the first few principal components are calculated as is subsequently described in :~ Section 8. Once an additional variable has been included in the equation. Step 1.
Thus. z. unlike the situation when the model is correct. In matrix notation.n and a single set of predictor variables z1 . z2 .Z1Z2il( 2) 1 (721) That is.ZJil(l))· The least squares estimator of il(l) is il(l) = 1 (Z!Z 1 f ZjY. the investigator unknowingly fits a model using only the first q predictors by minimizing the error su. so that Y1 = f3ol Y2 = f3o2 + + f3llZ! f312Z! + ·· · + + ··· + f3r!Zr f3r2Zr + e1 + e2 (722) The error term E' = [e 1 .•. e2. Y2 . . E(P(l)) = (Z!ZJf ZjE(Y) = (ZjZJf Zj(ZJ/l(l) + Z2fl(2) + E(E)) 1 1 = flo> + (ZjZJ). Each response is assumed to follow its own regression model.m of squares (Y.. let Yj = [lj 1 .Multivariate Multiple Regression 387 Bias caused by a misspecified model... the least squares estimates fl(l) may be misleading. the error terms associated with different responses may be correlated. . ZjZ 2 = Ot If important variables are missing from the model. Y. 1. . However.•.. lfm] be the responses. let [ZjO• Zj!• .1 Multivariate Multiple Regression In this section. Suppos·e some important predictor variables are omitted from the proposed regression model. Ejm] be the errors. lj 2 . That is. ••. . Then. . and let Ej = [ eil. To establish notation conforming to the classical linear regression model. . • • 7 . Zjr] denote the values of the predictor variables for the jth trial. the design matrix Z!r~ Z2r . . ••• . suppose the true model has Z = [Z 1 ! Z 2] with rank r + 1 and (720) where E( E) = 0 and Var( E) = £T 21. {J(l) is a biased estimator of fl(l) unless the columns of Z 1 are perpendicular to those of Z 2 (that is. we consider the problem of modeling the relationship between m responses Y1 . . ei 2 . em] has E(E) = 0 and Var(E) = l:..Zdl(J))'(Y.
. servations from different trials are uncorrelated. . \ E(m)l Bn2 "lt] The multivariate linear regression model is (nxm) Y= (nx(r+l)) ((r+!)Xm) z fJ+e (nxm) {723) i..] = : }2. the ith response Y(i) follows the linear regression model i= 1.. Zjr ]. zi 1 ..~ obse':'ations Y(i) on the ith response.2...Brm ~] : ' : . the errors for different responses on the same trial can be correlated. PoJ = (Z'Zf Z'YuJ 1 (7·25i .: with Cov ( E(i)) == £T. but ob. we determme the least squares estimates Jl(i) exclusively from th..k}. fJ Simply stated.B_ll {312 f3r2 BJ2 . I . : Y(m)] Yn2 . Set r • y (nXm) = [y" ~I : Ynl y12 1'22 Y.. [See {73).388 Chapter 7 Multivariate Linear Regression Models is the same as that for the singleresponse regression model.m with Them observations on the jth trial have covariance matrix I = {u. However. I..m (724) .k are unknown parameters.k = 1...] The othe · matrix quantities have multivariate counterparts.. l' P(m)] e = (nXm) l'" : e~~ Bnl B22 '•·] B2m Bnm = [E(J) E(2) \ .. . Ynm ' [Y(l) i Y(2) : . . the design matrix Z has jth row [zi 0 . .2. ..~ column rank. Given the outcomes _Y and the values of th7 predic!or variab~es Z with fu~.. In conformity with the singleresponsJ'~ solutiOn..B~m ~ [Jl(l) P(2) . we take ~.Bo2 ((r+l)Xm) fJ = lp" {3... Here and u.
ZB. we can form the matrices of Predicted values: Residuals: Y= zp = Z(Z'Zf Z'Y 1 f: = Y Y =[I Z(Z'Zf Z']Y 1 (728) The orthogonality conditions among the residuals.Zb(IJ)'(Y( 1J .ZB)'(Y .Zb(mJ.Z' = 0.Zb(iJ)'{Y(i). {See Exercise 7.ZB) (Y( 1J .Zb(IJ) Zb(m))~{Y(m) diagonal sum Zb(m)) (727) The selection b(i) = P(i) minimizes the ith of squares (Y(i). and columns of Z. = Y'Y or (Y + e)'(Y + e) = Y'Y v·v + e·e + o + o· + Y'Y e·e residual (error) sum) of squares and cross products (731) total sum of squares) = (predicted sum of squares) + ( and cross products and cross products ( .Multivariate Multiple Collecting these univariate least squares estimates.Z (Z'Zf Z'] = Z' . Specifically. the matrix of errors is Y .ZB)' (Y .~ion 389 P= [P(l) l P(2) or l · · · l P(mJ] = {Z'Zf Z'[Y(l) ! Y(2) 1 ! ··· Y(m)J (726) For any choice of parameters B = [b(IJ i b( 2) l · · · ! b(m)J. we obtain Regres.11 for an additional general• ized sum of squares property. the generalized variance I(Y .ZB) I is minimized by the least squares estimates jJ.Zb(IJ) = (Y(l) . which hold in classical linear regression. v·£ = iJ·z·r1. Also.ZB)'{Y. 1 They follow from Z'[I .z(z'z). {729) so the residuals E(iJ are perpendicular to the columns of Z.ZbuJ). The error sum of squares and cross products matrix is (Y . predicted values.Zb(mJ) {Y(m)  l J [ (Y(mJ .) Using the least squares estimates fJ. Consequently.ZB)] is minimized by the choice B = /:J. hold in multivariate multiple regression.Also.z· 1v = o 1 (730) confirming that the predicted values Y(i) are perpendicular to all residual vectors i(kJ· Because Y = Y+ e.Zb(IJ)'(Y(mJ . tr[(Y.'(Y(IJ .
2 · 6 .8 USING PROC. are as follows:  0 I 1 2 4 I 3 2 3 8 3 4 9 2 The design matrix Z remains unchanged from the singleresponse problem._ lj1 = f3oi + f3II Zjl + f:jl j = I.00 OUTPUT Pr > F 0.0208 Source Model Error Corrected Total 3 4 RSquare 0. in file 'Example 78 data. augmented by .00000000 46. observations on an additional response.v. title 'Multivariate Regression Analysis'.00000000 Mean Square 40.414214 Y1 Mean 5.2] . We find that Z' = [10 1 2 3 4 I I I] 1 PANEL 7.2 SAS ANALYSIS FOR EXAMPLE 7.00000000 F Value 20. input y1 y2 z1.8 {Fitting a multivariate straightline regression model) To illustrate the~ jJ' Y.390 Chapter 7 Multivariate Linear Regression Models The residual sum of squares and cross products can also be written as e'e = Y'Y.00000000 (continues on next page) . 28. model y 1 y2 = z 1fss3.28427 Root MSE 1.{J·z·zjJ calculations of (732) i Example 7.869565 c.00000000 6.2. manova h =zlfprinte. GLM. These data.00000000 2.2).. . proc glm data mra. and e' we fit a straightline regression model (see Panel 7.3.5 to two responses l} and Y2 using the data in Example 7..v·v = Y'Y. (Z'Zf 1 = [ . data mra. .1 = PROGRAM COMMANDS General linear Models Procedure J Depef)~ent Variable: Y1J DF 1 Sum of Squares 40.
09544512 0.00 Pr > F 0.00000000" F Value 20.47 Pr >IT! 0. 115.00000000 4.0208 @:F~ncj~:'nt V~riable: Y21 Source Model Error Corrected Total OF 1 3 Sum of Squares 10.00000000 F 15.E =Error SS & CP Matrix] Y1 Y1 Y2 Y2 1~ Manova Test Criteria and Exact F Statistics for the Hypothesis of no Overall Z 1 Effect H =Type Ill SS&CP Matrix for Z 1 E =Error SS&CP Matrix 5:1 M:O N:O Statistic Wilks' lambda Pillai's Trace Hotellinglawley Trace Roy's Greatest Root Value 0.91 4.00000000 Root MSE 1.89442719 0.00000000 1.0625 .12 2.00000000 Mean Square 10.00000000 F Value 7.36514837 T for HO: Parameter= 0 1.06250000 0.33333333 F Value 7.74 Pr > ITI 0.714286 c.3450 0.44721360 Tfor HO: Parameter= 0 0.0208 Std Error of Estimate 1.00000000 Source Z1 OF 1 Mean Square 10.00000000 Mean Square 40.Multivariate Multiple Regression pANEL 7.v.0714 Std Error of Estimate 0.0000 15.0000 15.0714 4 RSquare 0.0625 0.0714 .0625 0.93750000 15.4701 Type Ill SS 10.00000000 15.154701 Y2 Mean 1.0000 15..2 391 (continued) Source Z1 OF 1 Type Ill SS 40.0000 Num DF 2 2 2 Den DF 2 2 2 2 2 Pr > F 0.50 Pr > F 0.00000000 14.0625 0.50 Pr > F 0.4286 0.
fJ = [13(1) i 13(2)] = ' ' ' [1 1] 2 1 = = (Z'Z) 1 Z'[Y(I) i Y(2)] The fitted values are generated from j/1 1 + 2z 1 and 5'2 = 1 + z2 • Collectively.392 Chapter 7 Multivariate Linear Regression Models and Z'y''' so ~ [~ ~ : 13(1) ' : !] [=~] ~ [ 2~] ~(2) = (Z'zrlz. Y 'Y. = [165 45 45] 15 and '•i = [ 6 2] 2 . and i=YY= Note that ' [0 1 0 1 2 1 1 1 1 o]' ' e'Y Since = [0 0 1 1 2 1 1 1 ~J[~ t]~[~ ~] Y'Y = [ 1 1 4 3 1 2 8 3 ~J[~ =~ ~ ~: 1 E l[ 4 43 19 J .3. (Z'Z) Z'Y(l) [1lJ 2 Hence.Y(2) = [ :~ = 1 :n [2~J = =[ ~ J From Example 7.
9. . For the least squares estimator /J = [~(!) ~( 2 ) : · · · : ~(m)l determined under the multivariate multiple regression model (723) with full rank (Z) = r + 1 < n.I ~(il . E(E(i)) = 0.Multivariate Multiple Regression 393 the sum of squares and crossproducts decomposition is easily verified.Z(Z'Zf Z')] = 1 cr. . and Cov(fJ(i)• fJ(k)) ' ' = cr. Proof.k 1 tr[(I.r. Cov(~(i)• ~(k)) = E(~(i).12 E(i(. we have that E[U' AU] = E[tr(AUU')] = tr[AE(UU')].k(Z'Z) .Z(Z'Zf Z']Y(il =[I.2. from (724).1 and using Result 2A.1 e'e) =I fJ are uncorrelated.Z(Z'Zf Z']"Ul 1 1 so E(~(i)) = fJ(i) and E(i(i)) = 0. from the proof of Result 7.. Eand and E( n 1 r.)i(k)) = E(E{i)(I.Z(Z'Zf 1Z')E(k)) = tr[(I.fJ(i) and = (Z'Zf Z'Y(i) .ki] = cr.k(Z'Zf Using Result 4.fJ(i) 1 = (Z'Zf Z'E(i) 1 (733) iul = Y(i) Y(i) =[I. with U any random vector and A a fixed matrix.Z(Z'Zf Z')cr..fJ(k))' 1 1 1 = (Z'Zf Z' E(E(i)"Ck))Z(Z'Zf = CT. • Result 7.b so ! i(m)] = Y Z/J satisfy E(i(i)) = 0 and E(E) = 0 Also. and E(E(i)"Cil) = cr.1)cr.9. The ith response follows the multiple regression model Y(i) = ZfJ(i) + "U)• Also.k = 1. Consequently.r  1) . Next..fJ(i))(~(k).k(n. 1 i.m The residuals E = [i(l) ! i( 2) ! · · · E(i(i)i(k)) = (n.
k(l + z0 (Z'Zf1z0) 1 (736) Note that E((P(i) .zoP(i))(Yok .Z(Z'Zf Z')" (Z'Zf Z'CT.f.. 'Yoml at Zo.zoP(k)) = E(eo. .ki(I. the estimation errors zo/J(i).l(i) + Zo/l(i) .ziJE( (P(i) .P(i))(fJ(k) .• .. e02 .ziJPuJ) = E(eo. where the "new" error Eo = [e01 .ziJE(PuJ . e' eby n 1 r ..PuJHf.. .1.fJ(k))'zo . .. = zo/J(i) + Bo. Yo.. The forecast errors have covariances E(Yo.According to the regression model.P(k))'zo] = zb(E(P(i).].( [J(k) .fJ(k))')zo = a.P(k))')zo = a.. ·p' 'p' i •p' Zo = [zo (1) ' Zo (2) is an unbiased estiillator have covariances .(Z'Zf Z') = 1 1 o so each element of jJ is uncorrelated with each element of e.E(e 0. .z&(P(i) .. indicating that zoP(i) is an unbiosed predictor of Yo.eod = aik· Theforecasterrorforthe ith component of Y0 is }()i  zoP(i) = Yo.fl(kJ)').(P(k) ..ziJ(P(k) .f.kzo(Z'Z )~ zo I ( 735) The related problem is that of forecasting a new observation vector Y'o = [Yo!' Yo2.l(k) .9 enable us·· to obtain the sampling properties of the least squares predictors.l(kJ)) = E(eo. Finally.fl(i)) = 0.From the covariance matrix for /J(i) and p(k). .z0 /J(i) E[z 0 (/l(i) . z0 .zoP(i) =eo.*((Z'Zf Z' .fl(i)) so E(Yo. Collectively. The mean of the ith response variable is zofl(i)' and this is estimated by zij[J (i)' the ith component of the fitted regression relationship. we obtain the unbiased estimator Cov(P(•l•i(kJ) = E[(Z'Zf1Z'E(iJE(kJ(I= = 1 Z(Z'Zf Z')J 1 1 1 = (Z'Zf Z'E(E(iJEfkJ)(I. J •• • 'i Zo (m) J •p' (734) for each compo zb/3 since E(z 0Pn) = z0E(p 1 = zbfJu) i)) 1 nent.Ji(kJ of of I.394 Chapter 7 Multivariate Linear Regression Models Dividing each entry i(.l(i)))(eok.Z(Z'Zf Z') CT. .f.) = 0 and E(eo.l(i))~>ok) = Osince P(i) = (Z'Zf Z' E(i) + fl(i) is independent: of Eo.fl(i))(P{k). We first consider the problem of estimating the mean vector when the predictor : variables have the values z0 = [1. z01 .) .ziJ(P(i) . .: The mean vectors and covariance matrices determined in Result 7.l(i))SOk) .. ~>om] is independent of the errors e and satisfies E( e0 .f.A similar result holds forE( eo.zof.eok) + zQE(p(i). . Maximum likelihood estimators and their distributions can be obtained when the errors e have a normal distribution.
. The residual vectors [ 2 . however. .((rq)xm) (737) (nx(q+l)) i i (nx(rq)) Z2 ]·we can write the general model as E(Y) = Z/J = [Z 1 ! Z 2] [!!... that the model requires that the same predictor variables be used for all responses.J.] /3(2) = Z1/J(IJ + Z2/J(2) .e'e n and = I. and let the errors e have a normal distribution. are the maximum likeliWhen the errors are normally distributed. . for large samples. Note. respectively.~!. • supp~rt Proof.Multivariate Multiple Regression 395 Result 7. 111 ] can be examined for normality or outliers using the techniques in Section 4.:.6 for the singleresponse model.(yn z/3)'(Y. ei ei likelihood Ratio Tests for Regression Parameters The multiresponse analog of (712). The multivariate multiple regression model posAes no new computa1 tional problems. I) = (21T)mn/2jijn/2emn/2.. The remainder of this section is devoted to brief discussions of inference for the normal theory multivariate multiple regression model.1 0. fJ and n. are computed individually for each response variable.nr1 (:I) ni is distributed as The maximized likelihood L (ji. z..10 provides additional for using least squares estimates. .z/3) wp. Extended accounts of these procedures appear in [2] and [18]...has a normal distribution with E(P) = fJ and Cov (p(i).Z'y(.. (See website: www. P(k)) = CT. Then is the maximum likelihood estimator of fJ and 1 fJ .1 hood estimators of fJ and :I.k(Z'Zf • Also. it should be subjected to the diagnostic checks described in Section 7. the hypothesis that the responses do not depend on Zq+lo Zq+ 2 .prenhall.6. Therefore. becomes a Ho: P(2) = 0 Setting Z = [ Z1 where p a = r((q~W"')j /3(. e'e eil. Let the multivariate multiple regression model in (723) hold with full rank (Z) = r + 1. Least squares (maximum likelihood) estimates.com/statistics) Result 7. fJuJ = (Z'Z). Once a multivariate multiple regression model has been fit to the data.. Pis independent of the maximum likelihood estimator of the positive definite :I given by I= I. . Comment. they have nearly the smallest possible variances. n?: (r + 1) + m.
which. then fJ = (Z'ZfZ'Y. Let the errors e be normally d1stnbuted.11 remain the same.m should also be large to obtain a good chi~uare .!(m. Y = Z 1/Jcl} + e and the likelihood ratio test of Ha is b on the quantities involved in the extra sum of squares and cross products =: (Y  zJJcl})' (Y .5 the modified statistic [n . is distributed as wp. A. Nevertheless. Let the multivariate multiple regression model of (7 23) h?ld .l = n1(Y. Under Ho: Pcz) = 0.r + q + 1)] In( 1~1) '~' ·:t ·~ A 2 has.) .r and n . The likelihood ratio test of H0 is equivalen'' to rejecting H 0 for large values o(f Ii.t: '~ If Z is not of full rank.f~ .. both n . wher~fii (Z'zr is the generalized inverse discussed in [22]. not all hypotheses concerning can be tested due to the lack of uniqueness in the identification of fJ caused by linear dependencies among the columns of z.rq(I ). the generalized · allows all of the important MANOVA models to be analyzed as special cases of multivariate multiple regression model.i:) where Pel}= (ZizirlzjY and i. ·~ ZlnA = nln II1I = nlnlni + n(II i:~l For n 1arge.with .396 Chapter 7 Multivariate Linear Regression Models Under H 0 : /Jc 2 >= 0. I ) ni: Ini I ~ . a chisquare distribution with m(r..nrJ(I) independently of n(II I_. However.(Y .. but has rank r1 + 1. is distributed as Wp.zi/J(IJ)'(Y.q) d. (See Supplement 7A. (See also Exercise 7. the likelihood ratio. Jl 5 TechnicaUy.z/J) = n(i:I .1rProof. to a close approximation.F.1 1. in tum. Wilks' lambda statistic A2fn = IIII can be used.) ThdJ distributional conclusions stated in Result 7.z/J)' (Y . provided that r i(~ replaced by r 1 and q + 1 by rank (ZJ).1' of full rank r + 1 and ( r + 1) + m s: n. can be expressed in terms of generalize variances: · A Equivalently.zJJ(I>) .zl/J(I))· From Result 7 .Z ¥11 . ~~~ ~ ••¥<~ ~ Result 7.10.6.
A comp1.qJ) = 2(2) = 4 d. and the locationgender interaction on both servicequality indices.q) X (r + 1) and is of full rank (r . The first servicequality index was introduced in Example 7. The design matrix (see Example 7.49.11. let Id = .q). Although the sample size n = 18 is not large.9. we shall illustrate the calculations involved in the test of H 0 : /Jr 2 l = 0 given in Result 7. Setting a = .72 2050.5) remains the same for the tworesponse situation. • Information criterion are also available to aid in the selection of a simple but adequate multivariate multiple regresson model.2 2417. so AIC = n ln(J I J) . we have n = 18.!.76 246.1  ~(2  5 + 3 + 1) }n(. we do not reject H 0 at the 5% level.12 Let /Jr 2 l be the matrix of interaction parameters for the two responses.1ter program provides ( residual sum of squares) and cross products extra sum of squares) ( and cross products = ni: = [2977.88"]1) . we could consider a null hypothesis of the form H 0 : CfJ = r 0 . (residual sum of squares and cross products matrix) n Then.I)J In~ I A ) = .[ 18 .Since3.2p X d id This criterion attempts to balance the generalized variance with the number of paramete~s.5 . and d = 4 terms.07 X 2 X 4 = 18 X ln(20545. For a model that includes d predictor variables counting the intercept.7605) = 3.2p X d = 18ln (1 118 [3419. we test H 0 by referring [n .16] 366. the multivariate multiple regression version of the Akaike's information criterion is AIC = n ln(l J) .15 1267.95 _ i:) = [441. where C is (r . We shall illustrate the test of no location gender interaction ih either response using Result 7.72] 1021.~(m r + ri 1 q1.9 (Testing the importance of additional predictors with a multivariate response) The service in three locations of a large restaurant chain was rated according to two measures of quality by male and female patrons.16 = 162.05. + 1)] In ( A lni + n(I1.f. gender. The interaction terms are not needed.5.28 < ~(. Suppose we consider a regression model that allows for the effects of location.75 More generally.11. p = 2 response variables. In the context of Example 7 . under the null hypothesis of no interaction terms. Models with smaller AIC's are preferable.88 1267.39 = n(i: 1 1021. For the choices .1.28 to a chisquare percentage point with m(r1 .7) .05) = 9.Multivariate Multiple Regression 397 Example 7.16 246.
i=l 1 11. Let . are moderately large. (CP.r.q).1 + 111 Roy's test selects the coefficient vector a so that the univariate Fstatistic based on a a' Yj has its maximum possible value. for example. (See.11. The p X p hypothesis. The definitions are Equivalently. or extra. (18]. Roy's test will perform poorly relative to the other three.= /E IE IH/ .) Wilks' lambda. they are the roots of I 1  (I I)  Wilks' lambda = Pillai's trace = HotellingLawley trace = • i=l s i=1 ±+ 111 s 1 II +. When several of the eigenvalues 11. this null hypothesis becomes Ho: cp == Pcz) == 0 ' the case considered earlier. which is the likelihood ratio test. the statistic n(I 1 is distributed as W.. ~ 11s ofHE1 . fo)'(C(Z'Zf 1 C'f1(C/J f 0) Under the null hypothesis._q(I) independently of I. where s = min(p. the eigenvalues and vectors may lead to an interpretation. sum of squares and cross products matrix E = ni I) that results from fitting the full model.n(I1 I)= .1] Roy'sgreatestroot = . 11I I = 0. To connect with their output. sum of squares and crossproducts matrix H = n(II  The statistics can be defined in terms of E and H directly... This distribution theory can be employed to develop a test of H 0 : C{J = r 0 similar to the test discussed in Result 7. or in terms of the nonzero eigenvalues 111 ~ 112 ~ . lJi = tr[HE.398 Chapter 7 Multivariate Linear Regression Models c = [o i (rq)X(rq) I ] and r 0 = 0. Roy's greatest root. we report Wilks' lambda. or residual. Charts and tables of critical values are available for Roy's test.1li = tr[H(H 1 1li + Er IJ 2:. It can be shown that the extra sum of squares and cross products generated by the hypothesis H0 is . (See [21] and [17]. In this text. E be the p x p error. and the HotellingLawley trace test are nearly equivalent for large sample sizes. If there is a large discrepancy in the reported Pvalues for the four tests. Popular computerpackage programs routinely calculate four multivariate test statistics. + . we introduce some alternative notation.) I) Other Multivariate Test Statistics Tests other than the likelihood ratio test have been proposed for testing H 0 : /Jr 2 > = 0 in the multivariate multiple regression model. . Simulation studies suggest that its power will be best when there is only one large eigenvalue.
z ) P'zo. with normal errors E.))Fm nrm(a)] nrm · (742) The 100( 1 .a)% simultaneous confidence intervals for E(Y.nrm( a) is the upper (lOOa)th percentile of an £distribution with m and n.2. it can be employed for predictive purposes. From this result. we determine that and fJ'z 0 is distributed as Nm(fJ'z 0 .P'zo)' ( n nr.m (743) .z 1 n .nrm(a) 'J/(1 n _ _ + z'o(Z'Z)1zo) ( n _ r _ n 1 lr.rnrm 1)) p Fm nrm(a) · ) zo(Z'Z).. One problem is to predict the mean responses corresponding to fixed values zo of the predictor variables.1 $ i)I (Yo.) ni is independently distributed as Wnr. The second prediction problem is concerned with forecasting new responses Yo= f3'z 0 + e 0 at z 0 • Here e0 is independent of E.. z0fJ(i) ± )(m(n.1 i = 1.Multivariate Multiple Regression 399 Predictions from Multivariate Multiple Regressions Suppose the model Y = Z/3 + E. So. is the ith diagonal element of i.1 (I) The unknown value of the regression function at z 0 is /3' z0 .10. The 100(1 . ) .a)% confidence ellipsoid for /3' z 0 is provided by the inequality (740) where Fm. i=1.) = zofJ(I) are . we can write T = P'zo. nr. .m (741) ( 0 where {J(i) is the ith column of and lr. Inferences about the mean responses can be made using the distribution theory in Result 7.P'zo = (/3 .a)% prediction ellipsoid for Yo becomes (Yo.f.2. so the 100(1 . .md..r.P'zo) 1 (1 + zo(Z'Z)1zo) [(m(n.2. .fJ'zo 1 0 (739) and the 100(1 .fJ'zo 2 (Yz (Z'Z) z )' ( 0 1 0 n n r  1I ·)I (Yz6(Z'Z).) independently of ni:. ) 17. Yo.. If the model is adequate.z 0 (Z'Z)1z0 I.a)% simultaneous prediction intervals for the individual responses Yo. from the discussion of the T 2 statistic in Section 5.r.. are • zofJ(i) ± 'Jf(m(nr rm1)) Fm.. has been fit and checked for any inadequacies. Now.. ./J)'zo + Eo is distributed as Nm(O. (1 + z0 (Z'Z)1z0 )I. .
with s = 1. [14. P(z) = P(!) = [8. r = 2. Thus. Example 7. 328.151.25. This ellipse is centered at (151. va! ues.97] [zofJ(IJ.97.05) ] . Its orientation and the lengths of the major and minor axes can be determined from the eigenvalues and eigenvectors of Comparing (740) and (742).8.a = [151.400 Chapter 7 Multivariate Linear Regression Models where PuJ• CT. 2. . we see that the only change required for the calculation of the 95% prediction ellipse is to replace z0 (Z'Zr 1zo = ..349. zoPcl) = 151. we see that the prediction intervals for the actual values of. and Fm.34725 1 We find that zoPcz> = 14.3(.5) = 349.10 (Constructing a confidence ellipse and a prediction ellipse for bivariateresponses) A second response variable was measured for the computerrequiremenr: problem discussed in Example 7.349. 42].2.25(13o) + 5. c rom (740).4. 362.5. 307. 396.13 :s.(z) · IS. Com.97. The extra width reflects the presence of the random error eo. the response variables are wider than the corresponding intervals for the expected'..4. .nrm(a) are the same quantities appearing in (741). a• and m = 2.17 and Since P'zo = [~i:}_]zo [zo~~~~] f1(2) a• = zo.05) = 9. Measurements on the response Y2 ..67].1.3o]1 [zofJ(IJ . disk.17 5. and zh(Z'Zr zo = .55.34725 with ni:.25z 1 + 5. 130.3 with F2 .67(7. . input/output capacity. paring (741) and (743).5].08. From Example 7.z 0fJc 2 J.. 1.17](4) [ · zofJ(z) ..97] 349.6.14 + 2.1] Obtain the 95% confidence ellipse for fJ' z0 and the 95% prediction ellipse ·for Y 0 = [Yo~> Yoz] for a site with the configuration z0 = [1.34725) [(2(4)) F2..30 13. corresponding to the z1 and z2 values in that example were j Y2 = [301. 229.67z2 ' .14 + 2..17 n = 7.97.3(. 5. 7.14.349.151. a 95% confidence ellipse for p zo = [zofJcl)J 'Uzo.. Computer calculations provide the fitted equation )2 = 14.6.(z) .17).812. the set 5 80 5. ' (. 369.42.
Suppose all the variables Y. Z 1 . ••• .. The regression model that we have considered treats Y as a random variable whose mean depends upon fixed values of the z. .5... . 6 Consider the problem of predicting Y using the 6 If l:zz is not of full rank. The linear regression model also arises in a different setting. uy z.8 The Concept of Linear Regression The classical linear regression model is concerned with the association between a single dependent variable Yanda collection of predictor variables z1 ..L normal. /3 1 .'s. Z may be replaced by any subset of components whose nonsingular covariance matrix has the same rank as l:zz. . ••• . Partitioning J.34725. the 95% prediction ellipse for Yo= [Yo~> Yoz] is also centered at (151.L and covariance matrix (r+I)XI (r+I)X(r+l) and I in an obvious fashion. Both ellipses are sketched in Figure 7. Zt<:an be written lis a linear combination of the other z..97. one variablefor example.349. Thus.The Concept of Linear Regression Response 2 40 I dPrediction ellipse ~onfidence 0 ellipse Figure 7. . • 7. we write lTyy ! uzy] (lXI) i (IXr) and I= with rt~~T~~~· (744) (745) ui: y == [uy z1 .5 95% confidence and prediction ellipses for the computer data with two responses.. z2 .l Izz can be taken to have full rank. uy z2 . z. not necessarily I . . Z 2 . with mean vector J. /3. Z. but is larger than the confidence ellipse. It is the prediction ellipse that is relevant to the determination of computer requirements for a particular site with the given z0 .17).s and thus is redundant in forming the linear regression function z· fl. are random and have a joint distribution. This mean is assumed to be a linear function of the regression coefficients /3 0 . 1 + z0 (Z'Z)1z 0 = 1. That is.
uirii~(Z.Lr.Ly + Uzrizlz(Z .z) is the linear predictor having maximum correlation with Y. (b'uzrf 2 [Corr (bo + b Z.l.Ly .b'pz)] 2 E(Y .b0 .b'J.J.402 Chapter 7 Multivariate Linear Regression Models For a given predictor of the form of (745).. Y) = b'uzy so _ .z = /3 0 to make the third term zero. Y) = Cov(b'Z.Z. {3 0 + f.b0  b'Z (746) " Because this error is random.Lr . Corr(Y.J.bo .bo..uyy(b'Izzb) .z) )2 + (J. the error in the prediction of Y is prediction error= Y  bo  b1Z1 . it is customary to select b0 and b to minimize the mean square error= E(Y . we get Proof.Ly. ·.b'Z) 2 (747) Now the mean square error depends on the joint distribution of Y and z only through the parameters p.pz)) 2 = uyy .(b'Z.Lz) + (J. Writing bo + b'Z = E(Y bo + b'Z + (J.l'Z) = max Corr(Y. It is possible to express the "optimal" linear predictor in terms of these latter quantities..p.b'Z) 2 = = E[Y. b'Z) = urr...b' 11z? + (b . that is. Result T. b0 + b'Z) = J bo. we obtain + (J.b'pz) 2  2b'uzy Adding and subtracting 2 Uz y Ii~uz y. Its mean square error is E(Y  /3 0  f.pz)(Y.b0 .b {J'Izzfl = 17yy uhii~uzy uyy (J. The linear predictor {3 0 + {J'Z with C?efficients f.b' pz).(Izlzuzy)'p.b' 11d .Ly .uzrii~uzr The mean square error is minimized by taking b =Izlzuzy = f.bo :. Next.b.l'Z) 2 = E(Y. and then choosing bo = J. f3o = J.bo . we obtain .uhii~uzy Also. for all b0 . we note that Cov(bo + b'Z.z has minimum mean square among all linear predictors of the response Y. Y)] .fJ'p.b'p.Ly.Ly.Ly. =Y .p. b Employing the extended CauchySchwartz inequality of (249) with B = Izz.2E(b'(Z.Ly.l'Z = J.J. making the last term zero. /30 + f. The minimum mean square error is thus uyy .uzyiz~uzy.Lr)] = uyy + b'Izzb + (J.z)  bo.l = Ii~uzy.Ld + E(b'(Z . and I.J.Ii~uzy )':Izz(b Ii~uzr) E(Y .J2.Ly.
ZI. Y)j2 :s u' :t1 u ZY uyy zz ZY with equality for b = l:i1zuzy = fl. I = uyy .4 = [ 2 1] = J. the multiple correlation coefficient is a positive square root.Z2. and (c) the multiple correlation coefficient.Phz))· First. 1] [ :: 2Z2 • The mean square error is 1 1 6 · ] [ ] = 10 . determine (a) the best linear predictor f3o + {3 1Z 1 + {3 2Z 2 . (b) its mean square error. is called the population coefficient of determination.6] [ 1 1] 1. so 0 :s PY(Z) :s 1.4 1 = [ .fl'p.3 = 7 1. the mean square error in using {3 0 + fl'Z to forecast Y is uyy .Uzyiz1zUzy = 10.[1. Phz). At the other extreme.6 . Example T.12.4 . its mean square error. The alternative expression for the maximum correlation follows from the equation uzyl:z1 zuzy = uzyfl = uzyl:z1 zl:zzfl = /l'l:zzfl. and the multiple correlation coefficient) Given the mean vector and covariance matrix of Y. Note that.LY.21[~] = 3  so the best linear predictor is {3 0 + fl' Z = 3 + Z 1 Uyy. The population coefficient of determination has an important interpretation.The Concept of Linear Regression 403 or [Corr(bo + b'Z. From Result 7. verify that the mean square error equals uyy(1 . Also. Phz) = 1 implies that Y can be predicted with no error. fl = IzzUzy = f3o 1 [7 3]l [ 3 2 = 1] . • The correlation between Y and its best linear predictor is called the population multiple correlation coefficient PY(Z) = + (748) The square of the population multiple correlation coefficient. there is no predictive power in Z.uzyizzUzy .uyy (uzvizlzuzv) uyy = uyy(1  2 PY(Z)) (749) If Phz) = 0.ll (Determining the best linear predictor.z 5 [1. unlike other correlation coefficients.
.__Phz)) = 10(1 fa)= 7 is the mean square error. Result 7. z2. Specifically. .fl'Z  . . Let [L = [~] ' and S = [~.ILz) = f3o + fl'z (751) and we conclude that E(Y /zh z2. E(YizJ. It is possible to show (see Exercise 7. .404 Chapter 7 Multivariate Linear Regression Models and the multiple correlation coefficient is PY(Z) = uh:Iziuzy Uyy = [3 = _ 548 vw Note that uyy(1 . predicts Y with the smallest mean square error. :I).Z2.Lv z~..6) is + uzv:Iii(Z. z.+J(IL. Fortunately. The restriction to linear predictors is closely connected to the assumption of normality. . the regression function E(Y I ZJ.5) that 1PY(Z) 2 ·(750) . . z. That is. z2.). for a random sample of size n from this population..Ly + Uzyizi(z. z. . to be distributed as N.ILz). For normal populations.. respectively.···•Z. =pyy 1 where pYY is the upperlefthand corner of the inverse of the correlation matrix determined from :I.uyy. whatever its form. The conditional expectation of Yin (751) is called the regression function.. fixed (see Result 4.. it can be shown (see [22]) that E(Y IZJ.. this wider optimality among all estimators is possessed by the linear predictor when the population is normal. Then the maximum likelihood estimators of the coefficients in the linear predictor are flo = Y .uzy:Iziuzy) The mean of this conditional distri!Jution is the linear predictor in Result 7. .. z2.. Suppose the joint distribution of Yand Z is N. :I). :I) [ then the conditional distribution of Y with iv(J.+ 1(.H~~J 1· be the sample mean vector and sample covariance matrix. .. z.12. When the population is not normal.) need not be of the form /3 0 + fl'z. if we take 1 2 z:Y: ] .sZySzzZ = Y .+J(#L.. . it is linear.) is the best linear predictor of Y when the population is N.) = J. Nevertheless.13.
12 (Maximum likelihood estimate of the regression functionsingle response) For the computer data of Example 7.fl' Z ]2 is .12.Ly.~+t~~] = ( n: 1 )s • It is customary to change the divisor from n to n .] Since.11 and the in variance property of maximum likelihood estimators.szvSzlzszy) = nri=I n "' .f3o . Z 1 (orders). _1 uyy·z = . n . and mean square error = uyy·z = uyy . [See (420). /3o = J.uzy:Iz~uzy the conclusions follow upon substitution of the maximum likelihood estimators [L = for [i] and i = [~.. We use Result 4.( s y y .(:Iz~uzy)'ll. 2 /3o .SzySzzSzy) n Proof.. the maximum likelihood estimator of the linear regression function is ~o + P'z = 'Y + szvSi~(z  :Z) and the maximum likelihood estimator of the mean square errorE[ Y . and Z 2 (adddelete items) give the sample mean vector and sample covariance matrix: ~ ~ [i] ~ [ii~~.The Concept of Linear Regression 405 Consequently.!:!!~:!] ..1 . from Result 7.z = E(Y . in order to obtain the unbiased estimator 2: (lj ( _n__1_1) (syy.6.z.fl'Z) 2 .fl'Zi) 1 nr (752) Example 7.~ s ~ [~:j~:~] ~ [~!~:jj.( r + 1) in the estimator of the mean square error. uyy. the n = 7 observations on Y (CPU time).f3o .
35..006422] [418. zzSzy = (D (467.079] .)..···.983 .] = I'Y + Ivzii\.z..1..763] = [1.. obtain the estimated regression function and the estimated mean square error. .913. I) By Result 4. the conditional expectation of [ }[. of the predictor variables. is called the multivariate regression of the vector Y on Z. z. . Ym is almost immediate.44 .(z . . .Z2.. is E{YizJ.(J'z = 150. Y2.003128 . . .006422] [418. .08z1 + .086404 35.079.006422 . It is composed of m univariate regressions.. 1) ( . z.[418.~tz) (753) This conditional expected value. the first component of the conditional mean vector is J. For instance.. considered as a function of z1 . Suppose with l l (mXI) ~y (rXI) is distributed as Nm+r(lt.406 Chapter 7 Multivariate Linear Regression Models Assuming that Y.420 ' ' f3o = ji .983 = .086404 35.[1.~tz) = E(Y1 1 ZI> z2 ..42z2 The maximum likelihood estimate of the mean square error arising from the prediction of Y with this regression function is n ( .983] [ _:::~ syy Szy sI ) . which minimizes the mean square error for the prediction of Y1• Them X r matrix /3 = Ivzii1z is called the matrix of regression coefficients. given the fixed values z1.(z. Result 7. z.142.019 f3o + (J'z = 8.. . Yml'. z2.. }2.420] [13024 3547 = 8.894 Prediction of Several Variables • The extension of the previous results to the prediction of several responses }[. _. Z1. . We present this extension for normal populations.44 .Ly1 + Iv1zii\. and Z 2 are jointly normal.42 .13 gives the maximum likelihood estimates f1 • =sIs = [ zz_ zy . z2. .763. .421 and the estimated regression function J = 150.6.763]) .
Then theregression of the vector Y on Z is Po + fJz = I'Y. Po + fJz = I'Y + Ivzii~(z fJ = Ivzii~ ~tz) = Ivv·z = Iyy.SvzSi'zSzv) Proof. Suppose Y and Z are jointly distributed as Nm+r(lt.1 i=l I I I 1 n .Ivzii~Izv Iyy .I'Y. Result 7.r .fJZ)' = Ivvz = Iyy .Ivzii~(Z.Po. they must be estimated from a random sample in order to construct the multivariate linear predictor and determine expected prediction errors.fJZ)' I (755) .1) (Svv ·SvzSi zSzv) 2: (Y.Ivzii~Izv (754) Because I' and I are typically unknown./JZ)(Yn .The Concept of Linear Regression 407 The error of prediction vector Y. = [~~~+~~~] = ( 1) S= ( 1) [~~~t~~~] Izv i Izz Szv Szz n : n : j • It can be shown that an unbiased estimator of Ivv·z is 1 n( nr.Ivzii~l'z + Ivzii~z =#tv + Ivzii~(z .Po .~tz) has the expected squares and crossproducts matrix Ivv·z = E[Y.I'Y.Ivzii~(Ivz)'.Ivzii~l'z. . The regression function and the covariance matrix for the prediction errors follow from Result 4.Ivzii~(Z./Jizz/3' we deduce the maximum likelihood statements from the invariance property [see (420)] of maximum likelihood estimators upon substitution of I' = [~]. . I). Zj 1 j.14. Using the relationships Po = ltv .~tz)]' 1 = Iyy .Z) and the maximum likelihood estimator of Ivv·z is Ivv·z = ( n: 1 ) (Svv .Ivzii~(Z.Po . .6.I'Y .Ivzii1zizv Based on a random sample of size n.~tz)] [Y.fJZ) (Y . the maximum likelihood estimator of the regression function is Po + Pz = Y + SvzSi~(z. Po.~tz) The expected squares and crossproducts matrix for the errors is E(Y .Ivzii~Izv + Ivzii zizzii~(Ivz)' = Iyy.
913 1148.665(z2 ..408 Chapter 7 Multivariate Linear Regression Models Example 7.976 140558 x [ .043 1.034 13. the best predictor of }2 is The maximum likeiip. Y2 = disk 110 capacity.3.5561 418.006422] [z1 .547) 14.6 and 7.547)] 130.763 [ 1008.24) + 5.572 = .983 1.420(z2 .~?.536 418.3.42 + l.003128 .ood estimate of the expected squared errors and crossproducts matrix :Ivv·z is given by (n: 1 ) (Svv  SvzSi~Szv) 1148.130. [ 418.79 + 1008. 467.976 35. For Y1 = CPU time.97~!~0.893] 2.976]) 140.3.25z 1 + 5.79 + 2.983 140.44] [ 418. we find that the estimated regression function is Po + fJz = y + SyzSi~(z .205 7 6) [ 1.763 1008.44] [l.42z2 Similarly. ~h~ ' J [L =[I]= ij6~~[ 150.086404 Z2 .983] [ .z) = 150.5581 28.034 35.24) + .08z 1 + .547) Thus.420(z2 .893 2.042] [.24] . the minimum mean square error predictor of Yi is  130.67z2 = 8.763 140.079(z1  .006422 150.547 = [ 327.491 1008. and Z 2 = adddelete items.983l 3072.491 i ~008. ( = ( 6) ([ 467.042 .763 35.558 =7 .894 .556 LSzv! Szz .14 + 2.44 3_547 and s = 1_~~~l~~?.976/ 377100 28.13 (Maximum likelihood estimates of the regression functionstwo responses) We return to the computer data given in Examples 7.44 + 1.536] 3072.10.079( z 1 13024) + .763 35.254(z1 150.657 Assuming normality.006422] [418.086404 35.3.006422 .913 1148.558.003128 ..§.983] [ 327.] = ~8. Z 1 = orders.
YI> with smaller error than the second response.Zr leads to the prediction equations . are the same as those in Example 7. determined from the error covariance matrix :Ivv·z = :Ivv . The ~ik are estimates of the (i.. .Y>Z is the (i. . k)th entry in the matrix :Ivv·z = :Ivv .08z 1 + .by PY 1Y 2·z= . Y2 • The positive covariance .Zr. }2. is the same as that given in Example 7. ...:. the second estimated regression function. We define the partial correlation coefficient between Y 1 and }2.25z 1 + 5.42 + l. .... Zr are used to predict each Y.:Ivz:Iz1 corresponding sample partial cor:relation coefficient is (757) .. 1. ..1'z) l:y~l:z~(Z . Result 7.Zr. measures the association between Y1 and Y2 after eliminating the effects of ZI> Z2. 14. ZI> Z 2. ..67z 2 .14 + 2.10.. . We conclude this discussion of the regression problem by introducing one further correlation coefficient.k)th entry of the regression coefficient matrix fJ = :Ivz:Iz~ fori. . We see that the data enable us to predict the first response.~ VUy 1yl"z VUy 2YfZ (756) z:Izv· The where uY. Partial Correlation Coefficient Consider the pair of errors Y1  ILY1 ILY 2  :Iy 1z:Iz1 z(Z . z2. z 1 .Y. k . and the associated mean square error.The Concept of Linear Regression 409 The first estimated regression function.12 for the singleresponse case. . Similarly. The same values.:Ivz:Iz~:Izy.. .893 indicates that overprediction (underprediction) of CPU time tends to be accompanied by overpredic• tion (underprediction) of disk capacity. eliminating ZJ.~.894. Comment. Their correlation. 2.. .14 states that the assumption of a joint normal distribution for the whole collection }J..42z 2 . Z 2. 8.1'z) Y2  obtained from using the best linear predictors to predict Y1 and Y2.YI Yl Ym = f3oJ = f3o2 + + /311Z! f312Z1 + ··· + + · ·· + f3r!Zr f3r2Zr = ~Om + ~!mZ! + · · · + ~rmZr We note the following: 1.
2 and 7. Assuming that Y and z have a joint multivariate normal distribution..z.043 1. · + {3r(Zrj. we obtain ry 1y 1 = .042] 1. respectively.8with a set of variables that have a joint normal distribution. we presented the multiple regression models for one and several response variables.. Comparing the two correlation coefficients. The two approaches to multiple regression are related.410 Chapter 7 Multivariate Linear Regression Models with sv. k )th element of Svv .) + E:j (759) . Alternatively.7.(Zrj.96. we see that the association between Y 1 and 12 has been sharply reduced after eliminating the effects of the variables Z on both responses.. The process of conditioning on one subset of variables in order to predict values of the other set leads to a conditional expectation that is a multiple regression model.13.042 2. the predictor variables had fixed values zi at the jth trial.z. To show this relationship explicitly..v. (758) Calculating the ordinary correlation coefficient. 1) + {3 1 1 and we can write z z yj = (f3o + f3JZ! + .·z the (i. we introduce two minor variants of the regression model formulation. In these treatments.SvzSz~Szy.) + f3J(Z!j T" ZJ) + . For instance.14 (Calculating a partial correlation) From the computer data Example 7. • 7.9 Comparing the Two Formulations of the Regression Model In Sections 7. Syy . we can startas in Section 7. + {3.572 Therefore. + {3.Z.) + ej = /3• + f3J(ZJj. we find that the sample partial correlation coefficient in (757) is the maximum likelihood estimator of the partial correlation coefficient in (756).SyzSzzSzy 1  10 [1. the multiple regression model asserts that {3 1z 1i = {3 1(z 1i  The predictor variables can be "centered" by subtracting their means. Example 7.z!) + ·. Mean Corrected Form of the Regression Model For any response variable Y.
]' are unbiasedly estimated by (Z~ 2 Zc 2 rJz~ 2 y and /3• is estimated by y..j Zr Zni  Znr  L j=l n 1(Zji  Z. ••• . i = 1. {3.cr = 1 [ !T_ 2 n (762} Cov(IJJ 0 .BJ... Thus. . the linear predictor of Y can be written as (761) Cov(/3. best e~timates computed from the design matrix Z. their best estimates computed from the design matrix Zc are exactly the sa111e as the. {3 2 . Because the definitions /31> {3 2 . . ~~ z. . remain unchanged by the reparameterization in (759).1.. since l 1 1 Z11  z1 :I Z1 ··• Z1r Z2r Z21 ~ ·.2. {3. = (Z~ZJ. ••• ..Comparing the Two Formulations of the Regression Model 411 with /3• = {3 0 + {3 1 1 + · · · + {3. lh . we obtain so (760) That is... the regression coefficients [{3 1 . setting Zc = [11 Zc2] with Z~ 21 = 0.) = 0. setting /J~ = [. ].. The mean corrected design matrix corresponding to the reparameterization in (759) is z zc = ~ where the last r columns are each perpendicular to the first column. {3.r Further./JJ ] ' ' ..
. "standardized" input variables· (Zji  zJ / ~ 1~1 ± (zi. Comparing (761) 1 flc = /l since 7 A A shSz~ = y'ZdZ~zZd 1 (765) Therefore.1 S (Two approaches yield the same linear predictor) The computer data with the single response Y1 o: CPU time were analyzed in Example 7. .•. ~The least squares estimates of the beta coefficients become = ~.6 using the classical linear regression model. . r.t)'ZdZ~2zc2r' = (n.m (7{i3) Sometimes. /J. + P~(z with ~.z) A' ' o: !Jy + Uzy. = Y = f3o and A y and P~ = y'ZdZ~ 2Zd. y = 8. in the regression model are replaced by /3.<~ ' [ J i = l. Example 7.z. are jointly normal. Z. cient vectors for the ith response are given by fl(iJ A Y(i) = (z~zizf1 Z~z.:.. In this case. V( Relating the Formulations When the variables Y.412 Chapter 7 Multivariate Linear Regression Models Comment. both the normal theory conditional mean and the classical regression model approaches yield exactly the same linear predictors... A similar argument indicates that the best linear predictors of the responses in the two multivariate multiple regression setups are also exactly the same. Both approaches yielded the same predictor. slope coefficients {3. The multivariate multiple regression model yields the same mean corrected design matrix for each response. Z 1. the .yl)'Zc2 + 0' Consequently..ILz) A AT i1 A (764) where the estimation procedure leads naturally to the introduction of centered z.. yZdz. for even further numerical stability..)il)'Zc2 + yl'Zc2 = (y. The least squares estimates of the coeffj. we see that {3.'s.zz(Z .. 2.. the estimated predictor of y (see Result 7.) 2 =. ..= {3. .2zd·' = (y. The same data were analyzed again in Example 7. (zi..t)Szzt1 = szySz1 z .yl) + yl so that (y . are used. = A i) and (764).12. /i.yl)'Zc2 y'Zc2 = (y .l)szy[(n . These relationships hold for each response in the multivariate multi~i~ regression situation as well.. assuming that the variables }J.!)s.08zt + .13) is {3 0 A 1 _ + /l z = y + SzySzz(z .42zz 7 • '= The identify in (765) is established by writing y ~ (y . Recall from the mean corrected form of the regression model that the best linear predictor of Y [see (761)] is y = ~. n . ..1) s ' ' i = 1.)/V(n.42 + 1. Z1 . z.Z... ~ •. and Z 2 were jointly normal so that the best predictor of Y1 is the conditional mean of}! given z1 and z2 .
This issue is important so. Similarly. More gas is required on cold days. inferences in regression can be misleading when regression models are fit to time ordered data and the standard regression assumptions are used. the errors. rather than merely among linear predictors. but also how to incorporate this dependence into the multiple regression model. particularly during the coldest days of the year. but they yield an optimal predictor among all choices. In the conditional mean model of (751) or (753).10 Multiple Regression Models with Time Dependent Errors For data collected over time. We close by noting that the multivariate regression calculations in either case can be couched in terms of the sample mean vectors y and z and the sample sums of squares and crossproducts: This is the only information necessary to compute the estimated regression coefficients and their estimated covariances. As indicated in our discussion of dependence in Section 5.16 (Incorporating time dependent errors in a regression model) Power companies must have enough natural gas to heat all of their customers' homes and businesses. the values of the input variables are assumed to be set by the experimenter. Of course. equivalently. which must be calculated using all the original data. the observations on the dependent variable or. it is customary to nse degree heating days . we not only show how to detect the presence of time dependence. conceptually they are quite different. time dependence in the observations can invalidate inferences made using the usual independence assumption. an important part of regression analysis is model checking. For the model in (73) or (723). observations in different time periods are often related. like temperature. that clearly have some relationship to the amount of gas consumed. in the example that follows.8. The assumptions underlying the second approach are more stringent. cannot be independent. 7.Multiple Regression Models with Time Dependent Errors 413 Although the two formulations of the linear prediction problem yield the same predictor equations. Consequently. This requires the residuals (errors). A major component of the planning process is a forecasting exercise based on a model relating the sendouts of natural gas to factors. the values of the predictor variables are random variables that are observed along with the values of the response variable(s). in a regression context. Rather than use the daily average temperature. Example 7. or autocorrelated.
com/statistics for the complete data set.daily average temperature.= . we decided to include not only the current DHD but also that of the previous day. the lag 1 autocorrelation. A large number for DHD indicates a cold day.4 Natural Gas Data y Sendout 227 236 228 252 238 : DHD 32 31 30 34 28 46 33 38 52 57 zl DHDLag 30 32 31 30 34 : z2 Winds peed 12 8 8 8 12 8 8 18 22 18 z3 Weekend 1 1 \ z4 0 0 0 0 0 0 0 0 333 266 280 386 415 41 46 33 38 52 Initially.4.) Table 7. can also be a factor in the sendout amount.414 Chapter 7 Multivariate Linear Regression Models (DHD) = 65 deg . (See website: www. are significant and it looks like we have a very good fit. (The intercept term could be dropped.52 j~l j~2 :L n SjSj! :L 8y . again a 24hour average.) However.4. There are n = 63 observations. Other variables likely to have some affect on natural gas consumption.15. the demand for natural gas is typically less on a weekend day. Wind speed.315 Windspeed . the results do not change substantially. if we calculate the correlation of the residuals that are adjacent in time.prenhall.857 Weekend with R 2 = . are subsumed in the error term.) The fitted model is Sendout = 1.874 DHD + .952.405DHDLag + 1. like percent cloud cover. wind speed· and a weekend dummy variable.1. we get lag 1 autocorrelation = TJ (e) = 'n. we developed a regression model relating gas sendout to degree heating days. in Table 7. All the coefficients. with the exception of the intercept. (The degree heating day lagged one time period is denoted by DHDLag in Table 7. After several attempted fits. Data on these variables for one winter in a major northern city are shown. Because many businesses close for the weekend. in part. When this is done.858 + 5.
Each is within two estimated standard errors of 0. Thus.Multiple Regression Models with Time Dependent Errors 415 The value.109 Weekend and the time dependence in the noise terms is estimated by The variance of sis estimated to be G2 = 228. A plot of the residual autocorrelations for the first 15 lags shows that there might also be some dependence among the errors 7 time periods. of the lag 1 autocorrelation is too large to be ignored. we formulate a regression model for the Nj where we relate each~ to its previous value Njl. The noise is now adequately modeled. Also. We will not pursue prediction here. That is.426 DHDLag + 1. we consider where the sj are independent normal random variables with mean 0 and variance cr 2• The form of the equation for Nj is known as an autoregressive model. This amount of dependence invalidates the ttests and ?values associated with the coefficients in the model. we see that the autocorrelations of the residuals from the enriched model are all negligible. and an independent error sj. The tests are those described in the computer output. . since it involves ideas beyond the scope of this book. and so forth. there is very little change in the resulting model. The fitted model is Sendout = 2. 112.10.810 DHD + 1. 8 The intercept term in the final model can be dropped. and 124.118. That is.89. the significance of the regression. its value one week ago. (See [8).207 Windspeed . From Panel 7. a weighted sum of squares of residual autocorrelations for a group of consecutive lags is not large as judged by the ?value for this statistic. . The enriched model has better forecasting potential and can now be used to forecast sendout of natural gas for given values of the predictor variables.3 are those for lags 16. there is no reason to reject the hypothesis that a group of consecutive autocorrelations are simultaneously equal to 0. Nj_ 7 . (See [8).3 on page 416. apart. are now valid.130 + 5. or one week.52. When this is done. The groups examined in Panel 7.3. The tests concerning the coefficient of each predictor variable.) • "These tests are obtained by the extra sum of squares procedure but applied to the regression plus autoregressive noise model. The first step toward correcting the model is to replace the presumed independent errors in the regression model for sendout with a possibly dependent series of noise terms Nj.) The SAS commands and part of the output from fitting this combined regression model for sendout with an autoregressive model for the noise are shown in Panel 7.
416 Chapter 7 Multivariate Linear Regression Models When modeling relationships using time ordered data.70 1. ~.067 0.72 2.27 15.111 {).61770069 228.127 {).24932 0.1292441 528.92 23.144 0.Stt~·i_ (continues on next page) . 1 AR1.11779 0.12957 0. estimate p =(1 7) method = ml input = ( dhd dhdlag wind xweekend ) plot.03445 OUTPUT Parameter MU AR1.137 0.056 {).80976 1. Modern soft. a. PANEL 7.12340 0.170 {).16 3.44681 6.23986 5. infile 'T74.079 0.1_96 0.051 Prob.3 SAS ANALYSIS FOR EXAMPLE 7.42632 1.99 2. allow the analyst to easily fit these expanded models. Std Error 13.04 10.161 {). regression models with noise structures that allow for the time dependence are often useful.68 Lag 0 1 7 0 0 0 0 Variable OBSEND OBSEND OBSEND DHD DHDLAG WIND XWEEKEND Shift 0 0 0 0 0 0 0 I variance Estimate 15.41i<:.192 {).D18 0.16 5. ware packages.11528 0.24047 0.492264 SBC 63 Number of Residuals = Std Error Estimate AIC Autocorrelation Check of Residuals To Lag 6 12 18 24 Chi Square 6.490321 543.10890 0.dat'. like SAS. input obsend dhd dhdlag wind xweekend.056 {).004 0.069 0.013 O. 2 NUM1 NUM2 NUM3 NUM4 Constant Estimate Estimat~ 2.080 {).20740 10.08 24.250 0.079 {).012 {).022 {).16 USING PROC ARIMA data a. time =_n_. proc arima data =a. estimate p = (1 7) noconstant method ml input dhd dhdlag wind xweekend ) plot.8940281 T Ratio 0.D18 {).47008 0.108 O.106 0. identify var = obsend crosscor dhd dhdlag wind xweekend ).44 DF 4 10 16 22  Autocorrelations 0. PROGRAM COMMANDS =( = =( ARIMA Procedure Maximum Likelihood Estimation Approx.
894 18.10551 {)." marks two standard errors .825623 2.379057 12.777280 24.13721 1 9 8 7 6 5 4 3 2 0 1 2 3 4 5 6 7 8 9 1 !*******************~ I I I** I I I I**** . I ***I I I*** I I*** I *I I **I I *I I *I I **I I I I** I ***I ".02201 0.14421 {).008858 15.424015 25.3 (continued) Autocorrelation Plot of Residuals Lag Covariance Correlation 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 228.12722 0.118892 36.890888 12.038727 44.407314 1.01207 0.01298 0.19249 {).05582 {).10846 0.Multiple Regression Models with Time Dependent Errors 417 pANEL 7.970197 24.150168 31.194945 2.06738 {).059835 29.00000 0.05632 {).904291 33.11088 {).763255 5.07949 0.16123 0.
~ ']c. ·~ .: RATIO FOR THE MULTIVARIATE MULTIPLE REGRESSION MODEL The development in this supplement establishes Result 7J l. Set P =[I... it follows that 1 Ae = zl(Z!Z 1 f Zie 1 = (Z 1(Z]Z 1f 418 1 Zi) e = A(Z 1(ZiZif Zi)e 2 1 = A2e . we have gl.gq+3····.~ . and under H0 . . ni1 = Y'[I. Thus..Z(Z'Zf1Z'].Z(Z'Zf Z'][ZI i Zz] = [PZ1 i PZ 2 ) the columns of Z are perpendicular to P.1 vectors orthogonal to the previous vectors. Z 2}. . Consequently.Supplement ~. 1 1 Since 0 = (I.~ .~ il THE DISTRIBUTION OF THE LIKELIHOOD.Z 1(Z]Z 1 f Z). gq+ d = G from the columns of Z 1 • Then we continue..][Z 1 (Z]Z 1f Z\] = Z 1(Z!Zd. e) be an eigenvalueeigenvector pair of Z 1(ZIZ 1f Zj.gr+l• gr+2• gr+3• · ·' • gn ~ ~ 1 from columns from columns of Z 2 arbitrary set of ofZ 1 but perpendicular orthonormal to columns of Z 1 vectors orthogonal to columns of Z Let (A.gq+l> gq+2. and finally complete the set to n dimensions by constructing an arbitrary orthonormal set of n . . obtaining the orthonormal setfrom [G. since 1 1 1 [Z!(Z}Zlf Z.Z 1(Z!Z 1 f Zi]Y with Y = ZdJ(I) + t:. we can write We know that 1 ni = Y'(l.Zt.r .Z(Z'Zf1Z')Y ni = (Z{J + e)'P(Z/3 +e) = t:'Pt: ni 1 = (Z 1{J(l) + €)'P1(Z 1/3(l) + t:) = E:'P1£ where P 1 = I . .:I ·._. We then use the GramSchmidt process (see Result 2A.gz.3) to construct the orthonormal vectors [gh g2. Then..Z(Z'Zf Z']Z = (1.
(n r. Consequently.1(1: ). and it improves the .2. Moreover.r.1. .The Distribution of the Likelihood Ratio for the Multivariate Multiple Regression Model 419 1 1 and the eigenvalues of Z 1(ZjZ 1f Z! are 0 or 1.r . . . because Cov(Vr. Z'gc = 0. are eigenvectors of P with eigenvalues A = 0. the E'gc = Ve = [Vn. q + 1.~ (m . The use of chisquare approximation.Z(Z'Zf Z']Z == Z. In the same manner..r + q + 1)) of n in the statistic is due to Bartlett [4] following Box [7). by (422). Ve. Consequently.Zi = geg{.v0 = m(m + 1)/2 + m(r + 1) .P)E = L r+l (E'ge) (E'ge)' = L VeVe e~q+2 C=q+2 where the Ve are independently distributed as Nm(O. so that Z (Z'Zf Z' = 1 r+l C=l L gcg{. tr (Z 1 (Z'1Z 1f Zl) 1 = tr((Zjzlr ZjZI) = tr( I ) == q + 1 =AI+ A2 + .i) involves a different set of independent Vc's. + Aq+l• where (q+I)X(q+l) 1 A1 ~ A ~ · · · ~ Aq+l > 0 are the eigenvalues of Z 1(ZjZ 1f Zj. e j.Z = 0 so ge = Zbe. The large sample distribution for[ n . are therefore eigen1 vectors of Z 1(ZiZ 1r zj.~(m. l: ). By the spectral decomposition (216). .i) is distributed as wp. so that Pge = ge. is an eigenvector of Z (Z'Zf Z' L with eigenvalue A= 1. By the spectraldecomposition(216).kgfgj == 0. r + 1... since they are formed by taking particular linear combinations of the columns of Z 1. Veml' are independently distributed as Nm(O.... The orthonormal vectors ge. (Z 1 (ZiZ 1f Zl) Z1 == Zl> so any linear combination Z 1be of unit length is an eigenvector corresponding to the eigenvalue 1. r + 1. ni is distributed as Wp.. e = 1.l:) = E'(PI . 2..rq(l:) independently of ni. . we readily see f=l 1 that the linear combination Zbc = ge. we have PZ =[I.q) d.r + q + 1) ]ln (I i 1/1 ill) instead follows from Result 5. Similarly. since n(il . es e> 1 Continuing.. n(i 1 .1 . we have q+l 1 1 Z 1(ZiZJ). .nr. were constructed. ge P1gc = { o so P 1 = * e> es q + 1 q +1 f=q+2 A L n gcg(.P = f=r+2 " L geg{and (E'gc)(E'gc)' = ni = E'PE = L t=r+2 " L C=r+2 n VeVe where. Also. We can write the extra sum of squares and cross products as A r+l n(l:l . By (422).f. with v .m(m + 1)/2 m(q + 1) = m(r. for example.1 unit eigenvalues. from the way the ge. Now. This shows that 2 1 1 Z 1(Z]Z1f Zi has q + 1 eigenvalues equal to 1. these ge's are eigenvectors of P corresponding to then . l:). by writing (Z (Z'Zf Z') Z == Z. Vfk) = E(g(E(i)E(k)gj) = O'..
j = 1. Use the weighted least squares estimator in Exercise 7. 1) position of p1 is pYY = 17YYuyy.2.e errors influence the optimal choice of f3w.) Let Y (nXl) = Z {3 (nx(r+l)) ((r+l)xl) + E (nXl) where E( e) = 0 but E(ee') = 17 2 V. Comment on tQ.Z*'Y*.1/prr. Hint: From (749) and Exercise 4.n.112 f = V 112 I.11 7. unbiasedly. . . to the standardized form (see page 412) of the variables y.5. with E(e*) = 0 and E(e*e*') = 17 21.uzri:Z~uzr) lizzi uyy 1 IIzz luyy From Result 2A. Thus. Given the data Zt Z2 10 2 15 5 3 7 3 3 19 6 25 11 7 18 9 13 y 9 7 fit the regression model Yi = f3zz11 + f32z12 + ei.112 I v.uYY = IIzz 1/1 I 1.2.1(Y.. 2.ZPw)· Hint: v. .. (b) Var(ej) = 172Zi. and (c) Var(ei) = 172zJ..420 Chapter 7 Multivariate Linear Regression Models Exercises 7.. the fitted values y.lf 1 X (Y. For V of full rank..e manner in which the unequal variances for tb. the entry in the (1.112 y = (vIi 2Z)/3 + vl/ 2£ is of the classical linear regression form Y* = 1 Z* fl + e*. j = 1. . Since (see Exercise 2.r. 7. deduce the corresponding fitted regression equation for the original (not standardized) variables. 6. and the residual sum of squares. Establish (750): Phz) = 1 .3 to derive an expression for the estimate of the slope f3 in the model Yi = f3zi + ei. • calculate the least squares estimates {3. (Weighted least squares estimators.8(c).23) p = v. z 1 . . Specifically. Pw p* 1. it may be estimated. 1 PY(Z) = 2 uyy  u2:riz~uzr l7yy lizzi (17rr. the residuals £. with V(n X n) known and positive definite.3. when (a) Var (ei) = u 2 .ZPw)'V. Given the data Zt I 10 5 7 3 19 25 11 7 8 13 9 fit the linear regression model ~ =)30 + f3 1z11 + ei. 7. e' E.1. and z2 • From this fit. wberei7YY is theentry~fr in the first row and 1 first column.1V 112 .1 = (V. j = 1. show that the weighted least squares estimator is Pw = (Z'V1Zt1Z'V1Y If u 2 is unknown.1/2I vl/2 and p..4.6. . by (n. . = = (Z*Z* ).2.
q./l(2J)'[Z2Z2. (c) Show that z[J = Z (Z'Z)Z'y is the projection of yon the columns of Z.ej ) Z' ~ ( = (r+l z=l ~ e.ZIZ2] (P( 2J. and that ~ hii = r + 1./l(2))/u is N(O.I). If the parameters fJ(2) are identified beforehand as being of primary interest.. with rank (Z) zl /lp) + y = (nxl) (nx(q+l)) ((q+I)XI) (nX(rq)) = r + 1. Summing over i gives zp zp (Z'Z)(Z'Z)z' = Z'Z ( ~ Aj 1e. . ~:~] 2 Multiply by the squareroot matrix (C 22 t 112. (See Footnote 2 in this chapter.) (d) Show directly that [J = (Z'Z)Z'y is a solution to the normal equations (Z'Z) [(Z'Z)Z'y] = Z'y.Exercises 42 I 7.so that 1 (P(2). fori> r1 + l. and conclude that (C 22 t 11\/Jc 2J. (In fact.. (Z'Z) (Ai 1e. (Generalized inverse ofZ'Z) A matrix (Z'Zt is called a generalized inverse of Z'Z if Z'Z (Z'ZtZ'Z = Z'Z. show that a 100(1 .ej i=l is a generalized inverse of Z'Z.Z2Z 1(Z[Z 1t Z[Z 2 1 r 1. (b) The coefficients [J that minimize the sum of squared errors (y. then y is perpendicular to the columns of Z. where r is the j=l number of independent variables in the regression model. where (Z'Zt = [ 1 ~~.Z/l) satisfy the normal equations (Z'Z)P = Z'y.ej Z' r 1+1 ) = ~+I e.(Pc2J.) .Z/l)'(y . Suppose the classical regression model is. e 2.1 and (76). hi i < 1. written as ~ ((rq)xl) fl(2) + e (nxJ) where rank(Z 1) = q + 1 and rank(Z 2) = r . Recall that the hat matrix is defined by H = Z (Z'Zt Z' with diagonal elements hn· (a) Show that His an idempotent matrix. Show that these equations are satisfied for any [J such that z[J is the projection of yon the columns of z.2. . fori ::5 r1 + 1 and 0 = ei(Z'Z)e. C 22 = [Z2Z2 . Hint: (b) If is the projection.e[ Z' ) = IZ' = Z' z=l 1. 1. s2(r q)F.ejZ'. e.)e[Z'= e. Let r1 + 1 = rank(Z) and suppose A1 ~ A2 ~ · · · ~ A.8.n..q. Therefore.6.Z2ZI(ZIZI). since ejZ' = 0 fori > r 1 + 1.+l > 0 are the nonzero eigenvalues of Z'Z with corresponding eigenvectors e 1 . [See Result 7.a)% confidence region for fl(2) is given by (p(2). ( 1/ n) s. . j = 1.fl(2)) is~:lrq· 7.) = e.12.. with 1 'sand 2's interchanged. .) n 1 (b) Show that 0 < hii < 1.13(2))'(C 22 ).. (d) The eigenvalueeigenvector requirement implies that (Z'Z) (Aj1e.+l· (a) Show that (Z'Z)/ = r 1+1 ~ Aj 1e.nrt(a) Hint: By Exercise 4../lc2J) 1 s.
calculate each of the following.) Let A be a positive )'A(y1 .B'z1 ) distance from the jth observation Y. for the simple linear regression model with one independent variable the leverage. Using the results from Exercise 7. to its regression B'z1. .5 7. and Zz.4. (a) A 95% confidence interval for the mean response E(Yo 1) = /3 01 + {3 11 z01 corresponding to z01 = 0.422 Chapter 7 Multivariate Linear Regression Models (c) Verify. Show that the choice B = {J = (Z'Zf 1Z'Y minimizes the sum of squared statistical distances. 7. for any choice of positive definite A.5 (b) A 95% prediction interval for the response Yo 1 corresponding to z01 = 0.1 0. (Generalized least squares for multivariate multiple regression.9.·z. Hint: Repeat the steps in the proof of Result 7. Choices for A include I. I 2. I.3.5 (c) A 95% prediction region for the responses Y01 and Yo 2 corresponding to z01 = 0. that 7.B'z1 is a squared statistical definite matrix. calculate the matrices of fitted values and residuals Verify the sum of squares and crossproducts decomposition Y e with Y = [y 1 ! J'2]. Consider the following data on one predictor variable z1 and two responses Y1 and y 2 1 0 ·1 2 : 2 5 3 3 1 4 1 2 2 1 3 Determine the least squares estimates of the parameters in the bivariate straightline regression model lft }/2 = /3ot + /311ZJI + EfJ f3o2 + f312Zji = + Ej2> j = 1.10 with I1 replaced by A. 7. determine each of the following.2. . Z 1 .5 Also.9.1 and ~I ± d](B). is given by z. so that dJ(B) = (y1 .11. (a) The best linear predictor /3o + f3tZI + /32Zz of Y (b) The mean square error of the best linear predictor (c) The population multiple correlation coefficient (d) The partial correlation coefficient pyz. h11 . Given the mean vector and covariance matrix of Y.
(b) Calculate the partial correlation coefficient ryz 1. Does one (or more) of these companies stand out as an outlier in the set of independent variable data points? (c) Generate a 95% prediction interval for profits corresponding to sales of 100 (billions of dollars) and assets of 500 (billions of dollars). 1\ventyfive portfolio managers were evaluated in terms of their performance. 7.. Calculate a CP plot corresponding to the possible linear regressions involving the realestate data in Table 7. 7. Use the realestate data in Table 7.16.zfz. (c) Determine the estimated partial correlation coefficient Rz. 7. The following exercises may require the use of a computer.35 1.74] 54.) (c) Generate a 95% prediction interval for the selling price (Yo) corresponding to total dwelling size z1 = 17 and assessed value z2 = 46.4. Compute the leverages associated with the data points.05.6.35 [ . · . (a) Verify the results in Example 7.13 s= 5691.17.0 R = . Consider the Forbes data in Exercise 1.1. Should the original model be modified? Discuss.82] .69 . ..60 1. Z 1 is the manager's attitude toward risk measured on a fivepoint scale from "very conservative" to "very risky.37 23.Exercises 423 7. The observed correlation coefficients bet ween pairs of variables are y zl .34 600. (b) Analyze the residuals to check the adequacy of the model.05 23. (d) Carry out a likelihood ratio test of H 0 : {3 2 = 0 with a significance level of a = . (b) Analyze the residuals to check the adequacy of the model.4. Suppose Y represents the rate of return achieved over a period of time.(z 2. 7.1 S.13.51 [ 217.0 1.14.4.z2 and interpret this quantity with respect to the interpretation provided for ryz 1 in Part a. (d) Carry out a likelihood ratio test of H 0 : {3 2 = 0 with a significance level of a = .35 and ryz 2 = .z3 ). (a) Fita linear regression model to these data using profits as the dependent variable and sales and assets as the independent variables. Should the original model be modified? Discuss." and Z 2 is years of experience in the investment business.11 Assume joint normality.25 ] 126. (See Section 7.60 z2 . (b) Evaluate the estimated multiple correlation coefficient Rz. 25. The test scores for college students described in Example 5.82.05.82 (a) Interpret the sample correlation coefficients ryz 1 = .5 have i = z1 ] [ z2 = z3 [527. (a) Obtain the maximum likelihood estimates of the parameters for predicting Z1 from Z 2 and Z 3 .0 .1 and the linear regression model in Example 7.
000 1.625 1.625 1.13 3.625 .8 60. Failure of SilverZinc Cells with Competing Failure ModesPreliminary Data Analysis.99 2.8 43.00 y Cycles to failure "101 141 96 125 43 16 188 10 3 386 45 2 76 78 160 3 216 73 314 170  30 20 10 10 10 30 0 30 20 30 20 Source: Selected from S.25 1.375 1.375 1. life cycle. .2 60.13 3.01 1.375 .375 Z:z Discharge rate (amps) 3.000 1.000 1.19.00 1.01 2.625 1.0 43. and J.13 3. Sidik.99 2.2 100.00 5. 1980).00 1. Bozek.0 76. 4.13 3. Table 1.02 2.000 1.375 1.. contains failure data collected to characterize the performance of the battery during its.25 1. Calculate (a) a CP plot corresponding to the possible regressions involving the Forbes data in~ Exercise 1.13 3.01 2.0 60. " (b) the AIC for each possible regression. z5 .00 5. . Use these data.98 2.00 1.000 1. Satellite applications motivated the development of a silverzinc battery..19(a).8 60. Using the batteryfailure data in Table 7.25 1.8 76.8 43.0 Z:J z4 Temperature (OC) 40 30 20 20 10 20 20 10 10 Zs End of charge voltage (volts) 2.5 BatteryF~ilure Data zl Charge rate (amps) . Table 7.00 2.2 76. .00 1.00 5. H.2 43.625 .00 1.99 2.00 5.2 60.13 3. regress ln(Y) on the first principal component of the predictor variables z1 .8 60.625 .20.0 43.5. 1.13 5.000 1. (a) Find the estimated linear regression of In(Y) on an appropriate ("best") subset of predictor variables.000 1. (b) Plot the residuals from the fitted model chosen in Part a to check the normal" assumption.0 76.00 2.00 1.25 3.25 1.5.) Compare the result with the fitted model obtained in Exercise 7. 1.000 1.25 1.0 76. Leibecki.3.18.13 Depth of discharge (%of rated amperehours) 60. NASA Technical Memorandum 81556 (Cleveland: Lewis Research Center.424 Chapter 7 Multivariate Linear Regression Models 1. (See Section 8.99 2.000 1.0 76.99 2.0 60.99 2. z2 .13 3.01 1.
Frame.10: (a) Perform a regression analysis.10: (a) Perform a regression analysis using the response Y1 = SalePr and the predictor variables Breed. using the natural logarithm of the sales price as the response. That is. set Y1 = Ln (SalePr). 74. 48. Using the data on bone mineral content in Table 1. Using the data on the characteristics of bulls sold al auction in Table 1. 54. (b) Repeat the analysis in Part a. (iii) Construct a 95% prediction ellipse for both N0 2 and 0 3 for z1 = 10 and Zz = 80. (ii) Construct a 95% prediction ellipse for both SaleHt and SaleWt for z:1 = 50. (ii) Analyze the residuals. and Sale Wt. SaleHt. (ii) Construct a 95% prediction interval for SaleHt corresponding to z1 == 50. FtFrBody. using only the response Y1 = SaleHI and the predictor variables Z1 = YrHgt and Z 2 = FtFrBody.18. BkFat. .23. Compare this ellipse with the prediction interval in Part a (iii). (ii) Using the best fitting model.5. (c) Calculate the AIC for the model you chose in (b) and for the full model. (b) Perform a multivariate multiple regression analysis by fitting the responses from both radius bones. (i) Determine the "best" regression equation by retaining only those predictor variables that are individually significant. (b) Perform a multivariate multiple regression analysis using both responses Y1 and Yz · (i) Suggest and fit appropriate linear regression models. Comment. Consider the airpollution data in Table 1. 7. (a) Perform a regression analysis using only the first response}].Exercises 425 7. (i) Fit an appropriate model and analyze the residuals.2 and 1450. Which analysis do you prefer? Why? 7. 990. (ii) Analyze the residuals.8: (a) Perform a regression analysis by fitting the response for the dominant radiu~ bone to the measurements on the last four bones.24. YrHgt.7. . (iii) Construct a 95% prediction interval for N02 corresponding to z1 = 10 and Z2 = 80. (b) Perform a multivariate regression analysis with the responses }] = SaleHt and ~ = SaleWt and the predictors Z 1 = YrHgt and Z 2 = FtFrBody. construct a 95% prediction interval for selling price for the set of predictor variable values (in the order listed above) 5.5 and Z2 = 970.0. Compare this ellipse with the prediction interval in Part a (ii). (ii) Analyze the residuals. 7. Let }] == N02 and Y2 == 0 3 be the two responses (pollutants) corresponding to the predictor variables Z 1 = wind and Z 2 = solar radiation. 7. Using the data on the characteristics of bulls sold at auction in Table 1. Comment. PrctFFB. (iii) Examine the residuals from the best fitting model. (i) Suggest and fit appropriate linear regression models. (i) Suggest and fit appropriate linear regression models. (i) Fit an appropriate multivariate model and analyze the residuals.22.5 and z2 = 970.21.
The two response variables are Y1 = Total TCAD plasma leyel (TOT) }2 = Amount of amitriptyline present in TCAD plasma level ( A. Amitriptyline is prescribed by some physicians as an antidepressant. z2 = 1200.426 Chapter 7 Multivariate Linear Regression Models 7. z1 = 1. (i) Suggest and fit appropriate linear regression models. there are also conjectured side effects that seem to be related to the use of the drug: irregular heartbeat. 3149 653 810 448 844 1450 493 941 547 392 1283 458 722 384 501 405 1520 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1 7500 1975 3600 675 750 2500 350 1500 375 1050 3000 450 1750 2000 4500 1500 3000 220 200 205 160 185 180 154 200 137 167 180 160 135 160 180 170 180 0 0 60 60 70 60 80 70 60 60 60 64 90 140 100 111 120 83 80 98 93 105 74 80 60 79 80 60 0 90 0 100 120 129 (a) Perform a regression analysis using only the first response }j.6 Amitriptyline Data TOT YJ Y2 AMI ZJ Zz Z3 GEN AMT PR DIAP Z4 QRS zs 3389 1101 1131 596 896 1767 807 1111 645 628 1360 652 860 500 781 1070 1754 Source: See [24]. (iii) Construct a 95% prediction interval for Total TCAD for Z3 = 140. . and z = 85. and irregular waves on the electrocardiogram among other things. 5 (b) Repeat Part a using the second response Yz. abnormal blood pressures. z4 = 70.MI) The five predictor variables are Z 1 = Gender: 1 iffemale.2S. However.Oifmale (GEN) Z2 Z3 = Amount of antidepressants taken at time of overdose ( AMT) = PR wave measurement (PR) Z4 = Diastolic blood pressure (DIAP) Z 5 = QRSwavemeasurement(QRS) Table 7. Data gathered on 17 patients who were admitted to the hospitai after an amitriptyline overdose are given in Table 7. (ii) Analyze the residuals.6.
470 .991 2.315 6. reclassify Guru in run 3 as . Compare this ellipse with the prediction intervals in Parts a and b. (i) Suggest and fit appropriate linear regression models.991 36.912 2.639 36.033 1.26.220 39.449 : 7.572 7.479 4.050 1.008 .998 1.4.prenhall. y 4 = burst strength..27. i = 1. Specify the matrix of estimated coefficients and estimated error covariance matrix (ii) Analyze the residuals.375.239 . experience level was coded as Novice or Guru. Some additional runs for an experienced engineer are given below. Z 1.289 17.081 1. Z2 = long fiber fraction.237 5. Y3 and Y4.064 1.312 21.756 32.053 1.086 7.709 19.542 20. Table 7.324 84. (a) Perform a regression analysis using each of the response variables Y1. Guru and Experienced. (i) Suggest and fit appropriate linear regression models.289 .997 3.049 1. Y2 = elastic modulus. 7. 2. Z 2.239 35.845 1. in the original data set.030 .057 1.779 6.070 16. Check for outliers. Z1 = 20. z4 = 70.586 21.694 . z2 = 1200.866 3. Comment. Now consider three levels of experience: Novice. YI = breaking length.570 : 1. Comment.795 6. (i) Suggest and fit an appropriate linear regression model.294 20.054 3. P I.932 . (iii) Construct a 95% prediction ellipse for both Total TCAD and Amount of amitriptyline for z1 = 1.500.988 8. z2 = 45. Y2.Exercises 42/ (c) Perform a multivariate multiple regression analysis using both responses Y1 and Y2. (iii) Construct a 95% prediction interval for SF (Y3 ) for z1 = . Y2. Measurements of properties of pulp fibers and the paper made from them are contained in Table 7. 3.326 5. In the initial design.605 . (ii) Analyze the residuals.979 6.934 Source: See Lee [19].786 5. Compare the simultaneous prediction interval for Y03 with the prediction interval in part a (iii). 7. and z5 = 85.206 20.559 .478 .437 5.015 . and the paper properties.025 . and the four independent variables.7 (see also (19] and website: www.713 39.com/statistics). Y1 .018 17.719 7. Also.742 . z3 = fine fiber fraction.017 4.072 36.415 . There are n = 62 observations of the pulp fiber characteristics.601 6.396 4. (ii) Analyze the residuals.577 . (iii) Construct simultaneous 95% prediction intervals for the individual responses Y0.039 6.060 4. Y3 and Y4.330. for the same settings of the independent variables given in part a (iii) above. y 3 = stress at failure.7 Pulp and Paper Properites Data Y! BL Y2 EM Y3 SF Y4 Zt Z2 Z3 Z4 BS AFL LFF FFF ZST 21. Z4 = 1.851 30.236 .855 28.400 . z4 = zero span tensile.441 16.871 . z1 = arithmetic fiber length.163 20.513 . Refer to the data on fixing breakdowns in cell phone relay towers in Thble 6.859 .20.070 : 35.030 . (b) Perform a multivariate multiple regression analysis using all four response variables.554 81. Check for outliers or observations with high leverage. z3 = 140.515 2. Z 3 and Z 4.010.
Bartlett. C." Journal of the Royal Statistical Society (B). 10. New York: John Wiley.6 13.8 J . 2. 1999. R.).. New York: John Wiley.0 4:5 6. 317346.).428 Chapter 7 Multivariate Linear Regression Models Experienced and Novice in run 14 as Experienced. and F. P. New York: WileyInterscience. D... CA: Thompson Brooks/Cole. London: Chapman and Hall.> Problem implementation time 9. G.·'! variables. 2003. G. methods. P.0 4. E.6 8. Chatterjee. W.7 14. 36 (1949). 3. and S. S. Daniel. A. and G. 5. L. B.2 10. Kuh. Oxford. CA: Thompson Brooks/Cole. (Note: The two changes in the original data set and the additional :: data below unbalances the design. C. A. Anderson. Keep all the other numbers for these . M.2 21. Atkinson. NJ: Prentice Hall." Biometrika.9 8. An Introduction to Multivariate Statistical Analysis (3rd ed. Linear Statistical Models: An Applied Approach (2nd ed. 1994. "A General Distribution Theory for a Class of Likelihood Criteria. and R.9 References 1. "A Note on Multiplying Factors for Various ChiSquared Approximations. M. T. Transformations and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis.5 15.) (paperback). T. Welsh. O'Connell. R. Plots. Belmont. E. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity (Paperback). Residuals and Influence in Regression. Hadi. and B.~ variate multiple regression analysis with assessment and implementation times as the ~ responses. G. E. C. England: Oxford University Press. Wood. 1999. 2000. two engineers the same. Bowerman. 7. Time Series Analysis: Forecasting and Control (3rd ed.. 16 (1954 ). Abraham. 1986.9 12. A.. 6. assessment time . Englewood Cliffs. 8. and problem severity. New York: WileyInterscience. D. Box. Introduction to Regression Modeling.). Ledolter.9 Total resolution time 14. so the analysis is best handled with regression . Cook.3 5. Reinsel. B.. 9. S. perform a multi. 1982. and S. Weisberg. Jenkins. Belmont. 4. · . New York: Wileylnterscience. Fitting Equations to Data (2nd ed. and J. Applied Regression Including Computing and Graphics.. Cook. Bels!ey. 5. Weisberg. 2004. Price. S. 2006. problem complexity and experience level as the predictor. With these changes and the new data below. and R.) · Problem severity level Low Low High High High Problem complexity level Complex Complex Simple Simple Complex Engineer experience level Experienced Experienced Experienced Experienced Experienced Problem. D. Regression Analysis by Example (4th ed. 11. Box. 12. Consider regression models with the predictor variables and two factor inter~ action terms as inputs. S.). E. 296298. 2006.
Galton. 1999. 1992. C.). New York: John Wiley. II. 1977. Kutner." Biometrika. and D. Applied Regression Analysis (3rd ed. 19. R.NC: SAS Institute Inc.. Goldberger." Journal of ToxicologyClinical Toxicology. New York: John Wiley. Durbin. Linear Statistical Inference and Its Applications (2nd ed. N.. J. "Upper Percentage Points of the Largest Root of a Matrix in Multivariate Analysis. New York: WileyInterscience. 189193.A. 19 (1982). F. S. "Relationships Between Properties of PulpFibre and Paper. 15 (1885).) Cary." Unpublished doctoral thesis. New York: John Wiley. 14. 54 (1967). "Charts of Some Upper Percentage Points of the Distribution of the Largest Characteristic Root. Nachtsheim. K. Naik. J. 24.. Linear Regression Analysis. 159178. V. 2002. 1998.. Wasserman. Lee. "Cardiovascular Changes and Plasma Drug Levels after Amitriptyline Overdose. 38 (1951).) (paperback). 246263. Pillai. Faculty of Forestry. S. "Testing for Serial Correlation in Least Squares Regression. 16. University of Toronto. A." Annals of Mathematical Statistics. 6771. "Regression Toward Mediocrity in Heredity Stature. 15. Applied Linear Regression Models (3rd ed." Journal of the Anthropological Institute.References 429 13. Heck." Biometrika. 22. R. R. Neter. 21. W. 17. . 31 (1960). S. Draper. N. C. J. G. D. and H. 20. Rudorfer. Econometric Theory. M. Applied Multivariate Statistics with SAS® Software (2nd ed.). Irwin. Smith. 1964. 18. 625642. 1996. Khattree. and G. and C. Chicago: Richard D. F. Seber. Watson. M. Rao. 23. L.
consisting of n measurements on p variables. principal components are particular linear combinations of the p random variables X 1 . For example. A good example of this is provided by the stock market data discussed in Example 8. Its general objectives are (1) data reducti. If so.1 Introduction A principal component analysis is concerned with explaining the variancecovariance structure of a set of variables through a few linear combinations of these variables. often much of this variability can be accounted for by a small number k of the principal components. ··. Although p components are required to reproduce the total system variability. (scaled) principal components are one "factoring" of the covariance matrix for the fact9r analysis model considered in Chapter 9.5. .Chapter i %1 . 8. because they frequently serve as intermediate steps in much larger investigations. Moreover. An analysis of principal components often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily result.on and (2) interpretation.~ J PRINCIPAL COMPONENTS 8. . Analyses of principal components are more of a means to an end rather than an end in themselves. there is (almost) as much information in the k components as there is in the original p variables. The k principal components can then replace the initial p variables.. Geometrically. is reduced to a data set consisting of n measurements on k principal components.2 Population Principal Components Algebraically. XP.~ 11 . these linear combinations represent· the selection of a new coordinate system obtained by rotating the original system. and the original data set. principal components may be inputs to a multiple regression (see Chapter 7) or cluster analysis (see Chapter 12). 430 . X 2 .• .
X 2 . .2. The first principal component is the linear combination with maximum variance.2..• .. •. a. atX) = 0 for k < i .) Let the random vector X' = [ X 1 .Ia. Further. . •. principal components depend solely on the covariance matrix I (or the correlation matrix p) of Xt> X 2 .p (82) (83) i.p The principal components are those uncorrelated linear combinations Y1 . X that maximizes Var (a. . •. it is convenient to restrict attention to coefficient vectors of unit length. using (245). . ..• .5.Iak i = 1.. • •. We therefore define First principal component = linear combination a]X that maximizes Var(ajX) subject to aja 1 = 1 Second principal component = linear combination a2X that maximizes Var (a2X) subject to a2a 2 = 1 and Cov(ajX... To eliminate this indeterminacy. we obtain Var (Y.Population Principal Components 431 with X 1 . Y 2 .ItisclearthatVar(Yt) = a]Ia 1 can be increased by multiplying any a 1 by some constant. On the other hand. inferences can be made from the sample components when the population is multivariate normal. YP whose variances in (82) are as large as possible. Consider the linear combinations Y1 "'ajX = a 11 Xt + a12 X 2 + · · · + a1 PXP Y2 = a2X = a 21 X 1 + a 22 X 2 + · · · + a2 PXP (81) Then. Yk) = a. ith principal component = linear combination a. Their development does not require a multivariate normal assumption. . principal components derived for multivariate normal populations have useful interpretations in terms of the constant density ellipsoids.k = 1. XP. Xp as the coordinate axes. X) subject to a. As we shall see.Thatis. a2X) = 0 At the ith step.) = a. The new axes represent the directions with maximum variability and provide a simpler and more parsimonious description of the covariance structure. = 1 and Cov(ajX. (See Section 8. Cov(Y.itmaximizesVar(}]) = a]Ia 1 . X 2 . Xp] have the covariance matrix I with eigenvalues A1 ~ A2 ~ · · · ~ AP ~ 0.
) =A. and the proof is complete.2. ' Proof. Then the ith principal conz~·t 1 ponent is given by ~ Y. . . with eigenvalueeigenvector pairs (A 1. the eigenvectors corresponding to common eigenvalues may be chosen to be orthogonal. and. e.. (Ap.1.1 Forthechoicea = ek+ 1 . Let X' = [X1 . 2.. . are not unique.. max.+ Az + ··· + AP = .... Then u.= A1 . Yk) ~ ejiek =0 a'Ia max. e 2 ).e 1 Similarly. ..Ie. el:+ 1Iek+J/el:+ 1ek+l = ei:+1Iek+1 = Var(Yk+J) But e. Y2 = e2X. Xp]· Let I have the eigenvalueeigenvector pairs (A h eo)r .~ hence }j. ..2. the choices of the corresponding coefficient vectors.. + u 22 + · · · + uPP = i=l ± Var(X. i ~ k.+ 1(Iek+J) = Ak+Jek+lek+ 1 = Ak+l so Var(Yk+ 1) = Ak+l· It remains to show that e.p i 'I.~ I ( A2 . .withe!:+ 1e. ..432 Chapter 8 Principal Components Result 8.Iek = ejAkek = Ake. with B = I.ek = 0 for any i #..•• . o a a (attained when a = e 1 ) But e! e 1 = 1 since the eigenvectors are normalized.) = ejie. ejek = 0. X 2 . ...p 1. A . k.. i = 1.1. ••. . = O.e 2). premultiplication by ej gives Cov(Y.= e1Ie 1 = Var(YJ) 1 •*0 a a e. Var(Y. (A2 . Let I be the covariance matrix associated with the random vect d X' == [X1 ... =A. . that Cov(Y.. using (252).2.. Ap are distinct.p. ..f Var(Y. 1 2 Let Y1 = e\X. Therefore.(.2. . ep) where A 2: Az 2: · · · 2: AP :2: 0. . k :< (85)] _.. i = 1. 'p (84) ~ t . Xp] have covariance matrix I.ep) where A 2: A 2: ··· 2: Ap 2:0. .x be the principal components.ejek = 0. i ~ k) gives Cov(Y.... perpendicular to ek (that is.fori = 1.e 1). .. the principal components are uncorrelated and have variances equal to the eigenvalues of I. . . Result 8. = enX1 + e. = .. Yk) = e. ( AP..e2·····et . a'Ia e..= A = ....~ = a'Ia aa Ak+l k = 1.. • From Result 8. If 1 2 the eigenvalues are not all distinct. Yk) = 0.zXz + · · · + e. we get max aJ.ejX With these choices.. . We know from (251).. Since Iek = Akek.. . YP = e. Thus.kandk = 1. . Now.) i=l . . X 2 . the eigenvectors of I are orthogonal if all the eigenvalues A . for any two eigenvectors e. .er.. If some A.. and ek. are equal.pXp..2..
e. .e.. = A. 1.) v' Var(Xk) = • " .2 says that Total population variance = u 11 + + ·· · + AP = A1 + A2 + · · · + (86) and consequently.. OJ so that Xk =aleX and Cov(Xk. 2. Each component of the coefficient vector e. Y.. we can write I = PAP' where A is the diagonal matrix of eigenvalues and P = [e 1 .. they measure only the univariate contribution of an individual X to a component Y. the proportion of total variance due to (explained by) the kth principal component is Proportion of total ) population variance _ Ak due to kth principal .. = . = [ei!.28.k vA. 1. some . can be· attributed to the first one. [see (85)] and Var(Xk) = ukk yield py x* = _ ~ Cov(Y.A1 + A + · · · + 2 ( component AP k = 1.p] also merits inspection.) = Cov (a/eX.) • u PP Result 8. according to (245). then these components can "replace" the original p variables without much loss of information. = [0.e. two. e 2 ).e.k· Then Var(Y..) = A.x. p (88) are the correlation coefficients between the components Y. p • Although the correlations of the variables with the principal components often help to interpret the components. From Definition 2A. e. .k> . they do not indicate the importance of an X to a component Y in the presence of the other X's. u 11 + u 22 + · · · + u PP = tr (I)..Population Principal Components 433 Proof. ' . 0. . k vukk = 1. e!X) = alcie. = tr(A) = A1 + = A + ··· + 2 AP L i=l p Var(X. . 2. Proof. irrespective of the other variables. .. Result 8.1Z(c).e. . • . e.. . From (220) with A = I.. Set al. for large p. .p (87) If most (for instance.. .vukk e~X are the principal components e. . ep) are the eigenvalueeigenvector pairs for I. i.. Xk) '' vVar(Y. r . and Xk.. . Cov (Xb Y.3. . If Y1 = e\X. 80 to 90%) of the total population variance. .. .Since Ie. (Ap. k = 1.2. vA. That is.... .k vA..) = tr(I) = tr(A) L i=l u22 p Var(Y.= A. then PY. Using Result 2A. . Y2 = e2X... = A. = obtained from the covariance matrix I. ~. . . ep] so that PP' = P'P = I. ... The magnitude of eik measures the importance of the kth variable to the ith principal component. 0.k e.. e 2 . or three components.k is proportional to the correlation coefficient between Y. For this reason.vukk .) = a/cA. In particular. we have tr(I) = tr(PAP') = tr(AP'P) Thus. (A 2 .. and the variables Xk· Here (AI> el).
924X2.924(0) =0 .· portance of the variables to a given component. x2 and x3 have the covariance matrix It may be verified that the eigenvalueeigenvector pairs are AI"' 5. for example.00.924X2 Y2 = e2X = X 3 f) = e)X = .17 It is also readily apparent that u 11 + (]'22 + (]'33 =1 + 5+ 2 = A1 + A2 + A3 = 5. }2) = Cov(.1. · The following hypothetical example illustrates the contents of Results 8. it is our experience that these c rankings are often not appreciably different.83. 1] e] = [.383(0) . X 2 ) == .00 + . Equation (85) can be demonstrated from first principles. = [. X3) . X3) = ..83 + 2.383X1 = . OJ = 0..383X1  . so·· the two measures of importance. For example.2.924) Cov(X1 . e2 = [0.147(1) + . and 8.924X1 + .924.383X2 The variable X 3 is one of the principal components. Rencher [16]) recommend that only the coefficients e.. .1 (Calculating the population principal components) Suppose the · random variables XI.924 Cov ( x2.434 Chapter 8 Principal Components statisticians (see. the principal components become Y1 = e\X = .38W Var(X1) + (.. frequently give similar results. 0. We recommend that both the coefficients and th~ ~ correlations be examined to help interpret the principal components.3. OJ Therefore. because it is uncorrelated with the other two variables.383.83 =AI Cov(}J. .854(5) . Example 8. In practice. A2 = 2.708( 2) == 5.924.383)( . variables with relatively· large coefficients (in absolute value) tend to have relatively large correlations. Var(}J) = Var(. X3) .383XI .k> and not the correlations.924X2) = (. be used to interpret the components.383 Cov (XI.924) 2 Var(X2 ) + 2(. 8. A3 e. the first multivariate and the second univariate .383. Although the coefficients and the correlations can lead to different rankings as measures of the im.17..
. e..L = 0 in the argument that follows. that X 2 contributes more to the determination of Y1 than does X 1 • Since. .. x P. In this case. A point lying on the ith axis of the ellipsoid will have coordinates proportional to e. e. p] in the coordinate system that has origin f. • It is informative to consider principal components derived from multivariate normal random variables.3 with A = I. i = 1. .L centered ellipsoids which have axes ±c"\I"A.73. e.. . p.83/8 = . both coefficients are reasonably large and they have opposite signs. we obtain Notice here that the variable X 2 .98 of the population variance. receives the greatest weight in the component Y 1 • It also has the largest correlation (in absolute value) with Y1 • The correlation of X 1 . e.) are the eigenvalueeigenvector pairs of I.925. we can write 1 1 2 1 2 1 2 c2 = x'I x = (e)x) + (e2x) + ··· + (e~x) A1 A2 AP 1 This can be done without loss of generality because the normal random vector X can always be translated to the normal random vector W =X . . . where the (A. The proportion of total variance accounted for by the first principal component is A1/(A 1 + A2 + A3 ) = 5. I). = [en. the first two components account for a proportion (5.83 + 2)/8 = .. Suppose X is distributed as Np(f.•. we would argue that both variables aid in the interpretation of Y1 • Finally.L and axes that are parallel to the original axes x 1 .However.L. Further. in this case. is almost as large as that for X 2 . Cov(X) = Cov(W). indicating that the variables are about equally important to the first principal component. the components Y1 and Y2 could replace the original three variables with little loss of information. It will be convenient to set f. with Y1 . 1 From our discussion in Section 2. (as it should) The remaining correlations can be neglected.. however. 2 . •. The relative sizes of the coefficients of X 1 and X 2 suggest. .Population Principal Components 435 validating Equation (86) for this example.1 . We know from (47) that the density of X is constant on the f. and E(W) = ~. since the third component is unimportant. using (88). x 2 .p. Next. with coefficient . 2.924.
The remaining minor axes lie in the directions defined by e 2.1. . 0. . AP are positive) in a coordinate system with axes y 1 .. .1x = c 2 and the principal components y1 . .1.x p=O p = . ..) that has mean 0 and lies in the direction e. .p] and.JL2) . Yp lying in the directions of e 1 . necessarily.1 The constant density ellipsex'I.75 Figure 8.. Yz = e2x. y 2 for a bivariate normal random vector X having meanO. .. . = [eil. * y2 = e. it is the meancentered principal component y..1. then the major axis lies in the direction e 1 ..... 0. . e~ x are recognized as the principal components of x.. Yp = e~x lie in the directions of the axes of a constant density ellipsoid. we have c = .1.436 Chapter 8 Principal Components where e[ x. We see that the principal components are obtained by rotating the original coordinate axes through an angle() until they coincide with the axes of the constant density ellipse. = ej( x .. e2. y. . . A constant density ellipse and the principal components for a bivariate normal random vector with 1.Y2 + ··· + · Yp A1 A2 AP 212 1z 12 and this equation defines an ellipsoid (since .JLI) = ~ (X2 . eP. e2 x.. 2.75 are shown in Figure 8..1. Yz.1. ... Principal Components Obtained from Standardized Variables Principal components may also be obtained for the standardized variables zl Z2 = (XI . e.. = 0 and p = .1. Setting Y1 = e) x. 0. Yz = e2 x. the principal components y1·= e[x. OJ... To summarize. Therefore. principal component coordinates of the form [0. e. Yp = e~x. respectively...Y1 + . . .2. If A1 is the largest eigenvalue... eP.VCT22 ~ (89) .. . any point on the ith ellipsoid axis has x coordinates proportional to e. This result holds for p > 2 dimensions as well. .. . .\. . . . When 1.
... . p 2: Var (Y.) = p i=l p (811) and i. i = 1. in general. = e.(vt12 r 1(X lA. .z = e.4 follows from Results 8. p In this case. Using (87) with Z in place of X. We shall continue to use the notation Y. Z 2 . ..p (812) P where the Ak's are the eigenvalues of p.2. k=1. •. ZP in place of X 1 . 8. • We see from (811) that the total (standardized variables) population variance is simply p. Result 8.. 2. since the variance of each Z. ep) are the eigenvalueeigenvector pairs for p. . e. to refer to the ith principal component and (A. The principal components of Z may be obtained from the eigenvectors of the correlation matrix p of X. e 2 ). All our previous results apply. Xp and pin place of :I. Result 8.) derived from :I are. . with At ~ A2 ~ · · · ~ AP ~ 0. (At. . Z2 .. e.) i=t p = 2: Var (Z. X 2 . Zp] with Cov(Z) = p.• .. e1 ). is unity.) for the eigenvalueeigenvector pair from either p or :I. However.2. . Moreover. (AP. the sum of the diagonal elements of the matrix p.•.. . not the same as the ones derived from p. 2. the (A.. {810) where the diagonal standard deviation matrix Vlf2 is defined in {235). (A 2 .4. Clearly.Population Principal Components 43 7 In matrix notation. with some simplifications.2 (Principal components obtained from covariance and correlation matrices are different) Consider the covariance matrix :I = [ ~ 10~J . and 8. we find that the proportion of total variance explained by the kth principal component of Z is ( Proportio~ of ( st~ndardized )) population vanance due to kth principal component Ak = . E(Z) = 0 and cov(Z) = (v 112f1:I(vt12r1 = p by (237). k = I... The ith principal component of the Z' = [Zt. with Zt.). . .. is given by standardized variables Y.3.1. Proof. Example 8. . .
Lz) p: l2 = .70722 XI = . the first principal component is greatly affected by the standardization.707.. we obtain and py~>z2 = e21 \/'I.70721 + ..4 In this case.707 ( .J.. we see that the relative importance of the variables to.707(X1 .JL1) 1 .040X2 and }] = . . this first principal component explains a proportion _A_I_ = 100. . Moreover.84. I: }] Y2 e.70722 = .837 1..4. . e\ ei [.707) == [..707.040] Similarly. the first principal component explains a proportion == 7 p 2 .J.L 2 ) Because of its large variance. Most strikingly.: ~ J == [.0707(X2 . resulting variables contribute equally to the principal components determined from p.999] == The eigenvalueeigenvector pairs from I are A1 = 100. . the eigenvalueeigenvector pairs from AI = 1 A2 p are + p = 1.040.707Vf.16 == 992 A1 + A2 101 · of the total population variance. for instance.707] The respective principal components become == .4 = ..438 Chapter 8 Principal Components and the derived correlation matrix p = [. however. X2 completely dominates the first principal component determined from I.999.707(X1  JL1) .0707(%2  1~ J.4.70721  = . = 1.707 (Xz . A = 2 ..16. the. Using Result 8.040XI = + . . ei == [. AI of the total (standardized) population variance.999X2  .707 (XI .p = .707 ( Xz JLJ) + .999X1 .6.Lz) J." = .  JLi) + . When the variables X1 and X2 are standardized.Lz) 10 = .
. Suppose :t is the diagonal matrix ull 0 u 22 ··· ··· :t = [ 0 . For a covariance matrix with the pattern of (813).Population Principal Components 439 When the first principal component obtained from p is expressed in terms of X 1 and X 2 .999 attached to these variables in the principal • component obtained from :t. 0 . the relative magnitudes of the weights .. [T u22 0 ..000 to $350. 1...) is the ith eigenvalueeigenvector pair. Furthermore..01 to . This suggests that the standardization is not inconsequential. one set of principal components is not a simple function of the other. nothing is gained by extracting the principal components. . . In this case.2. the contours of constant density are ellipsoids whose axes already lie in the directions of maximum variation. 0. 0 0 and we conclude that (u. :t ). The preceding example demonstrates that the principal components derived from :t are different from those derived from p. . .U 1 0 0 or Ie. 0 . Principal Components for Covariance Mat(ices with Special Structures There are certain patterned covariance and correlation matrices whose principal components can be expressed in simple forms.040 and . X = X. if X 1 represents annual sales in the $10.e. Variables should probably be standardized if they are measured on scales with widely differing ranges or if the units of measurement are not commensurate. For example.000 range and X 2 is the ratio (net annual income)/(total assets) that falls in the ..0707 are in direct opposition to those of the weights . . Consequently. and X 2 (or~) will play a larger role in the construction of the principal components.. 0. with 1 in the ith position. . their subsequent magnitudes will be of the same order. . This behavior was observed in Example 8. if X is distributed as Np(/L. (813) Setting e! = [0. we observe that 0 0 0 0 0 1u. e. ..60 range. Since the linear combination e. = u. if both variables are standardized. From another point of view. . OJ. there is no need to rotate the coordinate system.. . we would expect a single (important) principal component with a heavy weighting of X 1 • Alternatively. the set of principal components is just the original set of uncorrelated random variables. then the total variation will be due almost exclusively to dollar sales.707 and .
' xp are equally correlated. pal pall . 0. . == 1e.. vu . . in this case of equal eigenvalues.o. 0. .0]. the largest is A1 = 1 + (p . When pis positive... the principal components determined from p are also the original variables Z1. . Zp· Moreover. .In that case. a2 (81:4) The resulting correlation matrix (815) is also the covariance matrix of the standardized variables.. Another patterned covariance matrix.o. the multivariate normal ellipsoids of constant density are spheroids. ..1) . i == 1. . v(p 1)p' V(p. .1) = [ V(p... . has the' general form .0.. 1. are convenient choices for the eigenvectors.p and one choice for their eigenvectors is e2 = [ e)= [ ~· v21x 3' ~.. ..440 Chapter 8 Principal Components Standardization does not substantially alter the situation for the I in (813).. 1 v1i=1)i 1 .. ... so the eigenvalue 1 has multiplicity p and ei = [0. p.. 2.0 J e~ (p. .. Oearly.. which often describes the correspondence among certain biological variables such as the sizes of living things.oJ ej = [ Vci=1Ji 1 ..o] 1 ~.1)p (816) with associated eigenvector ej =[ ~·~·····~] (817) The remaining p .1)p J .1)i {i..V1~ 2 . .. It is not difficult to show (see Exercise 8. pe. ... The matrix in (815) implies that the variables xl' x2 . Consequently.1)p. the p X p identity matrix... p = I.1 eigenvalues are A2 = A3 = · · · == Ap = 1 .. .. ...5) that the p eigenvalues of the correlation matrix (815) can be divided into two groups.
. . Recall that the n values of any linear combination j = 1.. The uncorrelated combinations with the largest variances will be called the sample principal components. . have sample covariance aJ. 1] X.Sa 2 [see (336)].. The minor axes (and remaining principal components) occur in spherically symmetric directions perpendicular to the major axis (and first principal component). ." with the major axis proportional to the first principal component Y1 = (1/Vp) [1..3 Summarizing Sample Variation by Principal Components We now have the framework necessary to study the problem of summarizing the variation in n measurements on p variables with a few judiciously chosen linear combinations. retaining only the first principal component Y1 = (1/ Vp) [1. When p is near 1. ZP have a multivariate normal distribution with a covariance matrix given by (815).. ..1 components collectively contribute very little to the total variance and can often be neglected. 8.1. 2. In this special case. the first component explains 84% of the total variance. Vpi=l is proportional to the sum of the p standarized variables. . Our objective in this section will be to construct uncorrelated linear combinations of the measured characteristics that account for much of the variation in the sample. and covariance matrix l:. 2 2 . If the standardized variables 2 1 . still explains the same proportion (818) of total variance. .... We see that AJ/p = p for p close to 1 or p large. and the sample correlation matrix R. . . 1]. . It might be regarded as an "index" with equal weights. .p =p+p p (818) of the total population variation.Summarizing Sample Variation by Principal Components 441 The first principal component 1 p Y1 = ejZ = .80 and p = 5. for two linear combinations. 1. Suppose the data xi> x 2 .1 )p 1... . Xn represent n independent drawings from some pdimensional population with mean vector 1. n have sample mean aii and sample variance aiSa 1 • Also. This principal component explains a proportion = A1 p 1 + (p . 1. then the ellipsoids of constant density are "cigar shaped. 1] Z. the sample covariance matrix S. the last p . the pairs of values (a)xi. For example... . a2xi). These data yield the sample mean vector i. if p = . This principal component is the projection of Z on the equiangular line 1' = [1. 1. a measure of total size.~ Z.
p A A A where A1 2: A2 2: · · · XI' Xz.. ·' First sample linear combination a. As with the population quantities. i #' k In addition. equivalently.. Also. k < 1 ·. Specifically. covariance for the pairs (a! xi. 2.2. J The first principal component maximizes a... . ala.3.xi that maximizes the sample :..h) = Ak> k = 1. ' xp.xi that maximizes principal component = the sample variance of a!xi subject to aja 1 = 1 ::i ·'>""' Second sample linear combination a2x1 that maximizes the sample'"'" principal component = variance of a2xi subject to a2a1 = 1 and zero sampl.18. xi subj~ct to. . . . xi. x '' k = e.to ek. 1 ). .442 Chapter 8 Principal Components ~ The sample principal components are defined as those linear combinati which have maximum sample variance. (Ap. . perpendicular . we have :. Thus. Yk) = 0.. . w~:. to satisfy aja. ep). as in the proofs of Resuii 8. the ith sample principal component is given. = A + A + · ·· + 2 1 AP r. .{ il . 2: AP 2: 0 and x is any observation on the variables· A: Sample variance(.k} is ~e p X p sa~ple covariance matrix with eigenvalueeigenvecto~~ pairs (A 1 .·'"'" :. Total sample variance = and (820) ·~f ~ i=l ± S. (A1 . '>i e ~ i = 1. Successive choices of a." 1 and z~ro sample covanance for all pairs (a..... p Sample covariance(y. ~. principal component = varia~ce of a. ez). = 1.fl strict the coefficient vectors a. a1xi) ~{ At the ith step..2.Akek' or 8. we obtain the following results concerning sample principal component§~ e td/ If S = js. akxi). . • rV Skk i.k = 1. . the maximum is the largest eigenvalue A attained for the choice~ 1 a 1 = eigenvector 1 of S.k~ .. Sa 1 or...p .. ~ By (251).. maximize (819) subject 0 = ajSek = a. ith sample linear combination a..
) In this case. (S!'e [1]. Y. and c~rresponding eigenvectors e. .) We shall not consider I because the assumption of normality is not required in this section.513 10. so if the variables are standardized.x [see (820)] and the same proportion of explained variance A.000) and [ s= 1. . I has eigenvalues [ ( n .078 10.t)/ n JA. i = 1. n j=l (823) n That is. the maximum likelihood estimate of the covariance matrix I. y.Summarizing Sample Variation by Principal Components 443 We shall denote the sample principal components by 1 . on five socioeconomic variables for the Madison.o = o e. = e.64] median home value ($100..626 28. Example 8.96.5 in the exercises at the end of this chapter.203 0. i=! x > = 1~. as in (820). area. by tract.91.306 2. Y. total population (thousands) 3.c xi £.937 89. 2. the sample m~an of each principal component is zero. employed age over 16 (percent) 26.2. both S al}d I give the same sample correla2 tion matrix R. If we consider the values of the ith component j = 1.. The observations xi are often "centered" by subtracting i. the choice of S or I is irrelevant. in general.203 4.47.078 0. but it will be clear from the context which matrix is being used. e. n (822) generated by substituting each observation xi for the arbitrary x in (821). then .953 28. . .3 (Summarizing sample variability with two sample principal components) A census provided information. £.319 Oill7] Can the sample variation be summarized by one or two principal components? 2 Sample principal components also can be obtained from I = s.673 1.>) = 1~. . e.102 9.044 2.11.306 1. (See Result 4.513 55.044 0. and the single notation is convenient.) are the eigenvalueeigenvector pairs for S.953 1.. This has no effect on the sample covariance matrix S and gives the ith principal component y Yp.p (821) for any observation vector x.067 0. professional degree (percent) 71. =  n 1~~. These data produced the following summary statistics: i' = [4. if the X 1 are normally distributed. Thus.(x  x). It is also convenient to label the component coefficient vectors and the component variances for both situations. The sample variances are still given by the A.J x e. 2 The components constructed from Sand R are not the same. the sample principal components can be viewed as the maximu~ likelihood es!imates of the corresponding population counterparts... Also.3~ 1.J e.937 0..'s. government employment (percent) 1.. Wisconsin.957 0.. . where (A..957 1.102 4. The data from 61 tracts are listed in Table 8./(A 1 + A + · · · + Ap). )?. A. . Finally.42. both Sand I give the same sample principal components e. irrespective of whether they are obtained from S or R.(~c xi . provided that the eigenvalues of I are distinct.027 3.
989 0...Ik) 0. a component associated with an eigenvalue near zer_... as we discuss later. Things to consider include the amount of total sampl~~ variance explained.130(. the first principal component.091 0.95) 0.058 . . but only measure the importance of an individual X without regard to the other X's making up the component. ponents and a reduction in the data from 61 observations on 5 observations to 61% observations on 2 principal components is reasonable.7 92.i es :~ 'I . the component . that the correlation coefficients displayed in the table confirm the interpretation provided by the component coefficients.105(. and the subjectmatter interpretations of the components.863(.005 0007 ·:!] • ·'.0.73) 0.082 2.8 98. coefficients e.17) 39.039(. The correlations allow for differences in the variances o(_ the original variables.68) 0.02 e2 (rrz.492(.7% of the total sample variance.188 0.) 0.030 0.015(.~ appears to be essentially a weighted difference between the percent employed by~ government and the percent total employment.k and the correlations ry. may indicate an unsuspected linear dependend~m the data. collectively. Consequently..8% of the total sample vari. ..xk should both be examined to interpret the_ principal components. The first two principal components. As we said in our discussion of the population components.67 e3 0.046 0.171 0.071(.16) 107.87 ·~ 0.~ ~"" '"' :C'Y' :~ A l ~~i 0..): Cumulative percentage of total variance ~'1ili eJ (rJJ.32) 0. There is no defin_ itive answer to this question.J Given the foregoing component coefficients. explain 92. deemed unimportant. however..977 0..15 "'I ·~ ·~ 67.000 The first principal component explains 67. _.3.1 99...< · ~nd.139 :] 0. We notice in Example 8.961 0..37 e4 0. The second principal componenf • appears to be a weighted sum of the two.22) 0. Th·e Number of Principal Components There is always the question of how many components to retain.125 8.26) 0. sample variation is summarized very well by two principal com.· ance.9 1. ~_:· . In a~~ dition.864(.444 Chapter 8 Principal Components ee ~Jt e~ ~ We find the following: Coefficients for the Principal Components (Correlation Coefficients in Parentheses) Variable Total population Profession Employment(%) Government employment(%) Medium home value Vapance (A. hence.~ 0.480(.(Xl9(. the relative sizes of the eigenvalues (the variances of the sa~ pie components}.35} .24) 0.153 0.
3 With the eigenvalues ordered from largest to smallest. Figure 8. we look for an elbow (bend) in the scree plot.4 (Summarizing sample variability with one sample principal component) In a study of size and shape relationships for painted turtles. 3 Scree is the rock debris at the bottom of a cliff. .2 shows a scree plot for a situation with six principal components.18. Table 6. Their data.9. That is. 2 without any other evidence. In this case. A useful visual aid to determining an appropriate number of principal components is a scree plot.Summarizing Sample Variation by Principal Components 445 Figure 8. a scree plot is a plot of A. that two (or perhaps three) sample principal components effectively summarize the total sample variance.2 A scree plot. Example 8. versus ithe magnitude of an eigenvalue versus its number.) Perform a principal component analysis. and height. reproduced in Exercise 6. To determine the appropriate number of components. width. An elbow occurs in the plot in Figure 8. the eigenvalues after A are all relatively small and about the same size. it appears. Jolicoeur and Mosimann [11] measured carapace length. suggest an analysis in terms of logarithms.2 at about i = 3. (Jolicoeur [10] generally suggests a logarithmic transformation in studies of sizeandshape relationships. The number of components is taken to be the point at which the remaining eigenvalues are relatively small and all about the same size.
There is clearly one dominant principal component.3 A scree plot for the turtle data. .60 x w3 98.788 e3 .5 x w3 96.510 ( .005 [ 8.019 6.1 100 A scree plot is shown in Figure 8.523 In (height) = In [ (length)· 683 ( width )510(height )5 23 ] j_.99) .523 (. The very distinct elbow in this plot occurs at i = 2. which explains 96% of the total variance.703] and covariance matrix s= 103 11.324 . The first principal component.30 X 103 .683 In (length) + .97) .773 A principal component analysis (see PanelS. 3.005 6.97) ez .713 .36 23.417 6.622 .725. has an interesting subjectmatter interpretation.594 . Since j/1 = . 4.683 (.! on page 447 for the output from the SAS statistical software package) yields the following summary: Coefficients for the PrinCipal Components (Correlation Coefficients in Parentheses) Variable In (length) In (width) In (height) Variance ( AJ: Cumulative percentage of total variance ekhxk) .510 In (width) + . X 101 2 3 Figure 8.159 .160J 8.446 Chapter 8 Principal Components The natural logarithms of the dimensions of 24 male turtles have sample mean vector i' = [4.3.160 6.478.072 8.019 8.
l 447 .105223590 I X3 3.324401 .960508 0. input length width height. x3 =log(height).0067727585 j Total Variance= 0.024261488 Eigenvalues of the Covariance Matrix I Cumulative 0.0080191419 0. title 'Principal Component Analysis'.0110720040 0.51()220 o:s22s39 .0081596480 1 X1 X1 X2 X3 0.014832 Eigenvectors X1 X2 X3 PRIN1 0.159479 .023303 0.594012 0.0080191419 0.00000 PRIN1 PRIN2 PRIN3 Eigenvalue 0. data turtle.683102 0. proc princomp cov data = turtle out= result.000360 Difference 0. x1 = log(length).96051 0.621953 0.082296771 Covariance Matrix X2 X3 0.024661 0.98517 1.000238 Proportion 0.0060052707 0. PRIN2 .022705 0.Summarizing Sample Variation by Principal Components PANEL 8.1 SAS ANALYSIS FOR EXAMPLE 8.0060052707 0.0081596480 1 0.080104466 OUTPUT Mean StD X1 4.PROGRAM COMMANDS Principal Components Analysis 24 Observations 3 Variables Simple Statistics X2 4. var x1 x2 x3. infile 'E84.712697 0.dat'.477573765 0.4 USING PROC PRINCOMP.0064167255 0.725443647 0. x2 =log(width). 703185794 0.000598 0.788490 PRIN3 .
with A1 > A2 • The sample principal components are well determined. ••• . Then the sample principal components. centered at i.. suppose the'" underlying distribution of X is nearly Np(p. · · · ~f th~se hyperel~ipsoid vA. lie along the axes of the hyperellipsoid.) = c of the underlying~ normal density. of S.4(a) shows an ellipse of constant distance. the absolute value of the ith principal component. gives the length of the projection of the vector (x . (See Section 2. Figure 8.If Sis : positive definite.. the sample principal components can be viewed as the result of translating the origin of the original coordinate system to i and then rotating the coordinate axes until they pass through the scatter in the directions of maximum variance. c: Also.. AP·~ and (A..3 and Result 4.x) = 1 C2 2 (824) estimates the constant d'ensity contour (x .448 Chapter 8 Principal Components the first principal component may be viewed as the In (volume) of a box with ad. the axes of the ellipse (circle) of constant distance are not. with'·~ A1 = Az.(x . the sample principal components = e.] Thus..· ~usted dimensions.p. First. Even when the normal assumption is suspect and the scatter plot may depart somewhat from an elliptical pattern.i in the directions of the axes Consequently. centered . For instance. . . p. • . = e[(X .) The lengths axes are proportional to i = 1. which accounts. The data can then be expressed in the new coordinates. but it is not required for the development of the properties of the sample principal components summarized in (820). the contour consisting of all p X 1 vectors x satisfying · (x . p.:. e. The normality assumption is useful for the inference procedures discussed in Section 8..1. Now. i = 1.. . Y. for the rounded shape of the carapace. 2. eJ are the eigenvalueeigenvector pairs of I. where Ap :2:: 0 are the eigenvaluesofS. by i and I by S.~ Yi = el( X .. The geometrical interpretation of the sample principal components is illustrated in Figure 8. · Because has length 1. A) distribution.x) 1. I I = Ief(x .x). e. m some sense. ).1 for = 2. lie along the axes of the ellipse in the perpendicular directions of maximum~ ~amp!~ variap..)'I.1(x.. including the~ A1 :2:: Az . the data may be plotted as n points in pspace. . If A1 = A2 . .1 or. Interpretation of the Sample Principal Components The sample principal components have several interpretations. at i.(x .~ uniquely determined and can lie in any two perpendicular directions. .. I).x)'S. from the sample values xi. tain the sample principal components.ce.x) on the unit vector [See (28) and (29).. which coincide with the axes of the contour of (824). e.p. 2. Geometrically.. A2 .~ which have an Np(O. (824) defines a hyperellipsoid that is centered at i and whose axes are given by the eigenvectors of s. we can approximate p. and their absolute values are the lengths of the projections of x .5..x) are realizations of population principal components Y. we can still extract the eigenvalues from Sand ob.. the adjusted height is (height)· 523 . equivalently. The diagonal matrix A has entries A1 . with S in place of I.IJgure 8.4(b) shows a constant distance ellipse. They. The approximate contours can be drawn on the scatter plot to indicate the normal distribution that generated the data. ~ Yi e.
i)'s• (x.7.x) = vS. in general. When the contours of constant distance are nearly circular or. directions of the original coordinate axes.. If the last few eigenvalues A. variables measured on different scales or on a common scale with widely differing ranges are often standardized. It is then not possible to represent the data well in fewer than p dimensions.x) = c2 (x~~x~. (See Exercises 8.~x. when the eigenvalues of S are nearly equal. x)'s• (xi)= c 2 Figure 8. standardization is accomplished by constructing Xjl X) X2 ~ Xj2  Zj = DI/2(Xj . including those of the original coordi· nate axes. . the sample variation is homogeneous in all directions.x. equivalently.Summarizing Sample Variation by Principal Components 449 (x. For the sample. .2. n (8~25) . the sample principal components can lie in any two perpendicular directions.4 Sample principal components and ellipses of constant distance. and the data can be adequately approximated by their representations in the space of the retained components. (See Section 8. j = 1.) As we mentioned in the treatment of population components. e. Supplement 8A gives a further result concerning the role of the sample principal components when directly approximating the meancentered data xi. not invariant with respect to changes in scale. are sufficiently small such that the variation in the corresponding directions is negligible.) Finally. Standardizing the Sample Principal Components Sample principal components are. Similarly.4. the last few sample principal components can often be ignored..6 and 8.
Xp (826) ~ vs..u·z)'(z.1)sll su (n .Xp xu .450 Chapter 8 Principal Components The n X p data matrix of standardized observations z= [ ~~ = ZJl [Zll Z12 ··· Z!pl z~2 : · · z:P T Zn Zn I Zn2 x2 Znp X]p. with the matrix R in place of S.1)s2 P ~v.l)s2p Vs....:X1 x12xn ~ X2J...u·z) n n 1 . Spp The sample principal components of the standardized observations are given by (820). Xn2..X2 vs..vs..X1 vs. Xnp. yields the sample mean vector [see (324)) (827) and sample covariance matrix [see (327)] 1 st = n1 =  (z.i.!.1 1 .Xp x2 vs..XnJ. X2p.... Since the observations are already "centered" by construction.1 (n ~vs.JYS.vs. .(Z n..s.. there is no need to write the components in the form of (821).1)s22 S22 1 Vs.li')'(Z ..1 )su (n l)s 1 P % v'. n. (n vs.. (n . l)spp =R (828) l)s1 P (n. vs..l)s12 (n . vs.1 Z'Z (n . (n. .tz') = n.XI vs..!.
adjusted for stock splits and dividends.5 (Sample principal components from standardized data) The weekly rates of return for five stocks (JP Morgan. a scree plot is also useful for selecting the appropriate number of components. A.. are greater than unity or. because as one axpects. The data are listed in Table 8. . . equivalently. k == 1. and ExxonMobil) listed on the New York Stock Exchange were determined for the period January 2004 through December 2005. however. Also. Citibank. Let x 1 . This rule does not have a great deal of theoretical support. explain at least a proportion 1/p of the total variance. the ith sample principal component is i == 1.0016. 2. and ExxonMobil...4 in the Exercises. z 2... 'p Using (829).. . x 2 . Citibank.p (830) A rule of thumb suggests retaining only those components whose variances A.p where (A. and it should not be applied blindly. Wells Fargo. .p j h) ==· 0 *k (829) Total (standardized) sample variance == tr(R) == p == and A + A + · · · + AP 2 1 i. ••.0007. we see that the proportion of the total sample variance explained by the ith sample prinCipal component is Proportion of (standardized)) A sample variance due to ith == '( sample principal component P i==1.2.... e.2. . Royal Dutch Shell.) == Sample covariance (Y. Royal Dutch Shell. Then i' = [.0040] . Sample variance (y. only those components which.2. . The observations in 103 successive weeks appear to be independently distributed. The weekly rates of return are defined as (current week closing priceprevious week closing price)/(previous week closing price).. . . Zn are standardized observations with covariance matrix R.0011. respectively.Summarizing Sample Variation by Principal Components 451 If z 1 . i == 1. Wells Fargo. . As we have mentioned. x 5 denote observed weekly rates of return for JP Morgan. Example 8. individually.0040.) is the ith eigenvalueeigenvector pair of R with A A A A1 ~ A2 ~ •• · ~ AP ~ 0. . . . . but the rates of return across stocks are correlated. stocks tend to move together in response to general economic conditions.• .. In addition.
387.361] .on. ExxonMobil).501.6o6J We note that R is the covariance matrix of the standardized observations The eigenvalues and corresponding normalized eigenvectors of R.236z2. they do not explain • much of the total sample variance.437.604.000 .315.465z3 + . represent variation that is probably specific to each stock. The second component represents a contrast between the banking stocks (JP Morgan.155 . or. determined by a computer. . . a market component.109] .595. . .632 .289. . .146 .\2) 100% = e.465.363. 1.585. .381.574 .683 1.000 "'] . .136. This component might be called a general stockmarket component.437 .387z4 + . A = 1. e. which account for c~ . Thus. Wells Fargo) and the oil stocks (Royal Dutch Shell. or "index.146 .498] Using the standardized variables.511 ..115 . This interpretation of stock price behavior also has been suggested by King [12].183 1. = r .093. .683 . simply. we obtain the first two sample principal components: y1 = eiz = .574 1.236. .315z3 + 585z4 + . .532z2 + . .000 .000 . e3 e4 = [.183 .361z 5 . In any event.. .606zs S2 = e2z =  These components..632 1. collectively. . . The first component is a roughly equally weighted sum..407) woo/o = 77% of the total (standardized) sample variance.452 Chapter 8 Principal Components and R = ['~ . we see that most of the variation in these stock returns is due to market activity and uncorrelated industry activity. It might be called an industry component.469zt + . A5 = .322 .368." of the five stocks. e] = [ .496..511 .493J = [ .255.322 . .368zl . .\3 = .532.629.469.213 .115 . are At = 2. Citibank.407. 2 e2 = [. .384.m. have interesting interpretations. The remaining components are not easy to interpret and.213 . .
7386 .7386 . Consider a situation where x 1 .382. 4 The sample mean vector and sample correlation matrix were.x3 ) was initially overlooked. A3 = . • Comment.OOO .6854) 3.. Body weight (in grams) for n = 150 female mice were obtained immediately after the birth of their first four litters.058/4)% = 76% of the total variance.1) ( .217 We note that the first eigenvalue is nearly equal to 1 + ( p . and x 3 are subtest scores and the total score x 4 is the sum x 1 + x 2 + x 3 • Then. 1]x = x 1 + x 2 + x 3 . 48.000 . 45. An unusually small value for the last eigenvalue from either the sample covariance or correlation matrix can indicate an unnoticed linear dependency in the data set.) Thus. although the linear combination e'x = [1.x 4 is always zero. and A4 = . Although the average postbirth weights increase over time.6329 .5oz4 accounts for lOO(Aif p )% = 100(3.49zl + . respectively.95] l.Summarizing Sample Variation by Principal Components 453 Example 8.z = .342. where r is the arithmetic average of the offdiagonal elements of R.88. one (or more) of the variables is redundant and should be deleted.9.7501 .000 .49z3 + .7501 1.6625 . there is some evidence that the corresponding population 2 3 correlation matrix p may be of the "equalcorrelation" form of (815).6363] .08.085. Thus. the smallest eigenvalueeigenvector pair should provide a clue to its existence. If the linear expression relating x 4 to (x!> x 2 . 49. Rutledge.52z2 + .1 )r = 1 + (4 .6925 . rounding error in the computation of eigenvalues may lead to a small nonzero value. If this occurs.h = e. The remaining eigenvalues are small and about equal. x 2 . . i' = [39. the variation in weights is fairly well explained by the first principal component with (nearly) equal coefficients. J.6329 . The eigenvectors associated with these latter eigenvalues may point out linear dependencies in the data set that can cause interpretive and computational problems in a subsequent analysis. 4 Data courtesy of J. 1. = A2 = . eigenvalues very close to zero should not be routinely ignored. 1.056.6363 .6625 1. (See the discussion in Section 3.000 and R = The eigenvalues of this matrix are A A l A1 = 3. The first principal component .4. although A is somewhat smaller 4 than A and A . This notion is explored further in Example 8. pages 131133.11.6 (Components from a correlation matrix with a special structure) Geneticists are often concerned with the inheritance of characteristics that can be measured several times during an animal's lifetime.6925 1. although "large" eigenvalues and the corresponding eigenvectors are important in a principal component analysis.
Figure 8.. The last principal components can help pinpoint suspect observations. as well as provide checks on the assumption of normality. The three sample principal components are Y1 = .725) + .4. That is.725). Also. plot appears to be reasonably elliptical. The observation for the first turtle is circled and lies in the lower right corner of the scatter plot and in the upper right comer of the QQ plot.4.1 differs from x1 by .3.478) + . erofS.3. e2 .Y12e2 + · · · + Yrqleq. respectively.) The following statements summarize these ideas.703) 4. This point should have been checked for recording errors. Yip contributing to this squared length will be large. or the turtle should. and x 3 = In (height).YJ = . construct scatter diagrams for pairs of the first few principal components. .159(x 1  4. It is often necessary to verify that the first few principal components are approximately normally distributed when they are to be used as the input data • for additional analyses. The plots for the other sets of principal com: ponents do not indicate any substantial departures from normality. •j .6 shows the scatter plot of (y1 . the square of whose length is YJq + · · · + YJp· Suspect observations will often be such that at least one of the coordinates Yiq• . . }2).Y1iq + · · · + YiPer. Construct scatter diagrams and QQ plots for the last few principal components. the magnitudes of the IasL principal components determine how well the first few fit the observations. Apart from the first turtle.4... Since the principal components are linearcombinations o_f the original variables.478) + . it may be suspect..454 Chapter 8 Principal Components 8. 1...5 shows the QQ plot for )2 and Figure 8. the scatter.594(x2  )2 = . it is not unreasonable to expect them to bej nearly normal.703) where x 1 = In (length).622(x2 .725) + .523(x 3 .788(x3 . To help check the normal assumption. have been examined for structural anomalies.478) + . make QQ plots from the sample values generated by each principal component.703) . x 2 = In (width).324(x3.7 (Plotting the principal components for the turtle data) We illustrate the plotting of principal components for the data on male turtles discussed in Example 8. Thus. These help identify suspect observations.4 Graphing the Principal Components Plots of the principal components can reveal suspect observations.683(x 1  4. Example 8.3. Each' observation can be expressed as a linear combination x1 = (xjei)el + (xje2)e2 + ··· + (xjep)ep = Yi!el + Y12e2 + · · · + Yjpep of the complete set of eigenvectors e1. 2.Y11e1 + . •. (See Supplement 8A for more general approximation results.510(x2 .713(x 1 4.
.E· )' : :. Principal components. .2. • .5 A QQ plot for the second principal component . .1 .E· ) (' · . You should be aware that there are linear dependencies among the residuals from a linear regression analysis.: E E I I I I (832) can be scrutinized in the same manner as those determined from a random sample. . it is prudent to consider the Residual vector = (observation vector) ..£..3 ••• •• •• • • • • :• Y2 •• •• • • Figure 8.6 Scatter plot of the principal components h and . The diagnostics involving principal components apply equally well to the checking of assumptions for a multivariate multiple regression model. n _ p j=l (' · . so the last eigenvalues will be zero.. derived from the covariance matrix of the residuals.n (831) for the multivariate linear model.. having fit any model by any method of estimation.(vector of predicted) (estimated) values or (px 1) ei = Yi (pxl) j = 1..9'2 of the data on male turtles. .9'2 from the data on male turtles. within rounding error. In fact. 1 ~ ...Graphing the Principal Components 455 Figure 8.
el.. values and eigenvectors will differ from their underlying population counterparts. are independently distributed. Even when the normal assumption is violated.A). The one exception is the case where the number of equal eigenvalues is known.. and are difficult to derive and beyond the scope of this book.a)% confidence interval for A. Using this normal distribution. we obtain P[/ A. When the first few eigenvalues are much larger than the rest. . Let then 3.e.. you can find some of these derivations for multivariate normal populations in [1]. V2. most of the total variance can be "explained" in fewer than p dimensions. and [5]. . and the eigenvalues specify the vari..z(a(2)Vfjn) (833) . is approximately Np(O. 1. A. . has an approximate N(A. Xn are a random sample from a normal population. 2At/n) distribution. Currently available results concerning large sample confidence intervals for A. . Ap of l:. e..711)  A· 1 :5 A· I (1.. and assume that the observations X 1 . . is thus provided by A A . The sampling distributions of A.) is approximately Np(O..456 Chapter 8 Principal Components 8. .). 2A 2).: = c .. . Because of sampling variation. If you are interested...<: ~ (1 + z(a/2)V2. is distributed independently of the elements of the associated e. A large sample 100(1 . It must also be assumed that the (unknown) eigenvalues of I are distinct and positive. and Anderson [2] and Girshick 15] have established the following large sample distribution theory for the eigenvalues A' = [AI. In practice. the A. . so that A1 > A2 > · · · > Ap > 0. . the confidence iqtervals obtained in this manner still provide some indication of the uncertainty in A.. We shall simply summarize the pertinent large sample results.. Moreover.·' ances. X2 . unless there is a strong reason to believe that I has a special structure that yields equal eigenvalues. 2. these eigen... then Vn (A.. Large Sample Properties of Ai and ei e.5 Large Sample Inferences We have seen that the eigenvalues and eigenvectors of the covariance (correlation) matrix are the essence of a principal component analysis. Usually the conclusions for distinct eigenvalues are applied. . [2]. Result 1 implies that. decisions regarding the quality of the principal component approximation must be made on the basis of the eigenvalueeigenvector pairs (A. Let A be the diagonal matrix of eigenvalues At.C .. for n large. The eigenvectors determine the directions of maximum variability. .) extracted from s orR. Each Vn (e. e.Ad :5 z(a/2)A..Ap] and eigenvectors ep of S: e. E.. A. .71l] = 1 a.
some care must be exercisAed in dropping or retaining principal components based on an examination of the A.. Assume that the stock rates of return represent independent drawings from an Ns(#l. In general.. A 1 . Approximate standard errors for th~ coeffis_ients eik are given by the square rOO!S of the diagonal elements of (1/n) E. are correlated.'s. . is one important structure in which the eigenvalues of :I are not distinct and the previous results do not apply. A1 .8 (Constructing a confidence interval for A1 ) We shall obtain a 95% confidence interval for A1 . by substituting A. we can us~ (833) with i = 1 to construct a 95% confidence interval for A1 . (1  • !T 1. gets larger. even though n is fairly large. (See Section 5.0011 . the intervals generated by (833) can be quite wide. The elements of each e.. let * Ho: and P = (pXp) = Po l~ ~ • p p .....025) = 1.'s for large samples. To test for this structure. Testing for the Equal Correlation Structure The special correlation structure Cov(X.96 v Jfu ) or . using the stock price data listed in Table 8. Xk) = p. but Lawley [14] has demonstrated that an equivalent test procedure can be constructed from the offdiagonal elements of R.4.. A1 = . for reasonable confidence levels. Xk) = YCT.96..) Result 2 implies that the e.. AP (which 2 is unknown) and the sample size n.0014 .0019 • Whenever an eigenvalue is large.'s.l • 1 A test of H 0 versus H 1 may be based on a likelihood ratio statistic. Therefore.'s and e..0014 • !T ( 1 + 1. such as 100 or even 1000. or Corr(X.. :I) population.'s for the A.CTkk p. where E. Example 8. z( . Consequently.. where :I is positive definite with distinct eigenvalues A1 > A2 > · · · > As > 0. is derived from E.'s are obtained by replacing z(a/2) with z(aj2m). . all i k. Bonferronitype simultaneous 100(1 a)% intervals for m A. . .'s for the e.4 in the Exercises.0014 and in addition. From Exercise 8..10.. Since n = 103 is large.Large Sample Inferences 45 7 where z( a/2) is the upper 100( a/2 )th percentile of a standard normal distribution. with 95% confidence. A .96 v Jfu ) .. and the correlation depends to a large extent on the separation of the eigenvalues A1 . the variance of the first population principal component.. . the confidence interval gets wider at the same rate that A.'s are normally distributed about the corresponding e.
6625) = .6855f i<k ) (.f.6329 + .9 (Testing for equicorrelation structure) From Example 8. = .6363) = .7386 + .6363 + .6625 .6363] .6626. .01277 ..6625 .6855i = . and we set Hop~~~ [i ~ ~ ~l P I p H1:p P 1 p *Po Using (834) and (835)..7501. k k = I. Here p = 4.0 .6363 .7386 .0 R = . r2 = .k (834) It is evident that rk is the average of the offdiagonal elements in the kth column (or row) of Rand r is the overall average of the offdiag(.6.7271.6329 .6925 ..k p.6925 1.7501 .6925 + .6625 1.1')2 = + (.p.4 = . we obtain r1 fJ = 1 3 (. .6329 + .6791 (.k. r = p(p 2 _ 1) ~f: r. r= 4 3 ! ..k 1') 2  y k=l ± (rk 1') 2 ] > Xtp+l)(p2)/2(a) (83S) where xfp+l)(p.mal elements.2 ).6855f + . + (..0 .2)/2d.750I [ .6855 2:2: (r.673I.458 Chapter 8 Principal Components Lawley's procedure requires the quantities rk = I P 2: r.1 1=1 i ..2(a) is the upper (IOOa)th percentile of a chisquare distribution with(p + 1)(p.6329 ..6329 1. the sample correlation matrix constructed from the n = 150 postbirth weights of female mice is 1..2.0 We shall use this correlation matrix to illustrate the large sample test in (835). Example 8.7501 + .7386 .7501 + . The large sample approximate ·alevel test is to reject H 0 in favor of H 1 if T = (n (I  !~ r) [2:2: i<k (r.
. Yj1 = eJ. Clearly. Today.00245 y = • (4.I) (I .OI277 .. we consider a twopart procedure. of any two components.(4. X 2 .6855) 2 = ..4 Since (p + I )(p .. Yj2) for j = 1. concentration.. 2. The value of our test statistic is approximately equal to the large sample 5% critical point.6855) 2 . .Monitoring Quality with Principal Components 459 L: (rk k=l 4 2 . with A4 being somewhat smaller than the other two. Xn be a random sample from a multivariate normal distribution with mean 1. a large sample test that all variables are independent (all the offdiagonal elements of l: are zero) is contained in Exercise 8.. there are 45 pairs for which to create quality ellipses.r) = (. A .(2. and A are slightly 4 2 3 different. Consequently. Checking a Given Set of Measurements for Stability Let X 1 .. .9. As we saw jn Example 8. but two are easier to inspect visually and.. and weight.(xj .6 Monitoring Quality with Principal Components In Section 5.1. with electronic and other automated methods of data collection.07.6855) 2 = 2. To monitor quality using principal components. then the values of the first two principal components should be stable.6. the smallest eigenvalues A .6855) 2 + · · · + (. pressure. .679I . small differences from the equal correlation • structure show up as statistically significant. if the principal components remain stable over time. 8. If a process is stable over time.i).6. it is not uncommon for data to be collected on 10 or 20 process variables. the 5% critical value for the test in (835) is x~(. the common effects that influence the process are likely to remain constant.2)/2 = 5(2)/2 = 5. at various positions along the production process.673I . Assuming a multivariate normal population. so that the measured characteristics are influenced only by variations in common causes.6855) 2 ] 4. Additional principal components could be considered.I) 2 [1 (I . but not overwhelming.05) = Il. . The first part of the procedure is to construct an ellipse format chart for the pairs of values (Yj 1.2)(I . we introduced multivariate control charts. Even with IO variables to monitor.00245)] = Il. the first two explain the largest cumulative proportion of the total sample variance.i) and Yj2 = e2(xj . with the large sample size in this problem.1329 and T = (I50 . and covariance matrix l:. .1329) (. Conversely. [. including the quality ellipse and the T 2 chart. so the evidence against H 0 (equal correlations) is strong. another approach is required to both visually display important quantities and still have the sensitivity to detect special causes of variation.. Major chemical and drug companies report measuring over IOO process variables. including temperature. We consider the first two sample principal components. n.
007 .8 563.7 588.081 .1 1988.0 244.2 169.2 464.10 (An ellipse format chart based on the first two principal components).8 441.2 325.432 .107 .1 Eigenvectors and Eigenvalues of Sample Covariance Matrix for Police Department Data Variable Appearances overtime (x 1 ) Extraordinary event (x 2 ) Holdover hours (x 3 ) COA hours (x4 ) Meeting hours (x 5 ) el e2 e3 e4 es .582 .7 281.3 283.2.658 .6 611.7 177.6 450. The sample values for all five components are displayed in Table 8.392 . ~ ~ Table 8.5 38.7 213.039 .4 66.6 498. and the sall!ple variance of the second principal component }'2~ is the secondlargest eigenvalue A2 .770.5 82.2 .5 142.397 . The two sample components are uncorrelated so the quality ellipse for n large (see Section 5.7 1045.3 159.5 545.155 2.1 565.6 344.2 403.6 966.2 182.5 148.1 209.8 801.6 707.3 7. 628.046 .429.8 2187.138 . Refer to the police department overtime data given in Table 5.129 99.8 1142.3 2366. Table 8.2 340.586 221.7 782.3 270.8.503 .2 418.4 620.460 Chapter 8 Principal Components By (820).7 1193.1 189.077 .8 425.5 135.9 2143.8 84.h.6 1437.206 .784 A.8 883.4 276.9 170.9 736.5 202.643 .048 .3 443.0 115. the s_ample variance of the first principal component .734 .5 165.7 296.2 61.6 563.0 373.Yt is given by the largest eigenvalue A1 .8 2787.2 686.226 .985 . The first two sample components explain 82% of the total variance.824 Table 8.069 .3 125.i 1.6 28.9 89.151 .9 2153.2 1917.6 196.250 .7 521.2 878.4 3936.6) reduces to the collection of pairs of possible values (.107 . }2) such that (836) Example 8.9 132.9 256.1 contains the five normalized eigenvectors and eigenvalues of the sample covariance matrix s.213 .2 Values of the Principal Components for the Police Department Data Period 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Yil Yi2 Yi3 Yi4 Yis 2044.8 2186.8 68.9 250.3 437.629 .4 761.8 0.5 184.
the second principal component is essentially extraordinary event overtime hours. along with the data. because the second principal component for this point has a large value. the second part of our twopart procedurethat is. This chart is created from the information in the principal components not involved in the ellipse format chart.IL)'erel + (X. Even without the normal assumption.. and the ellipse becomes :r + : s '2 '2 2 5.:: ~ • 0 8 8 I . The principal component approach • has led us to the same conclusion we came to in Example 5. Scanning Table 8. e In the event that special causes are likely to produce shocks to the system.7 The 95% control ellipse based on the first two principal components of overtime hours. Although n = 16 is not large.Monitoring Quality with Principal Components 461 .1. One point is out of control.IJ. Xi .99 Ar A2 This ellipse centered at (0. 5000 2000 • • • +· • • • • • • • 0 2000 4000 Figure 8..IL)'e2e2 +(X..9. 0).IL)'epep . Let us construct a 95% ellipse format chart using the first two sample principal components and plot the 16 pairs of component values in Table 8.IL =(X. :t ). we see that this is the value 3936.IL)'e3e3 + · · · + (X.2.7.. a second chartis required.99.IL can be expressed as the sum of its projections on the eigenvectors of l: X.2. and assume that X is distributed as Np(IJ. Consider the deviation vector X .9 for period 11. is shown in Figure 8.05) = 5. According to the entries of 2 in Table 8. we use x!(.
2 Tj Yp Yj4 Yjp = . Recall that Var(Y. A.) =A. and so has a chisquare distribution with p.A3 A4 '2 :2 '2 Ap which involves the estimated eigenvalues and vectors..2 independent standard normal variables.+ . for i = 1. Further. Consequently.+ .. .:. Yk) = 0 fori"# k. it is customary to create a T 2 chart based on the statistic e.2 population principal components. based on the last p . equivalently.62 Chapter 8 Principal Components or (837) where Y.1") 'e. This T 2 statistic is based on highdimensional data. For example.2 principal components are obtained as an orthogonal transformation of the approximation errors. = (X . + . . when p = 20 variables are measured. the principal components do not have a normal distribution even when the population is normal.. it is usual to appeal to the large sample approximation described by (838) and set the upper control limit of the T 2chart as UCL = c 2 = 0z(a).Y1e1 . base it on these last principal components.. The approximation to X . e e . is the population ith principal component centered to have mean 0.v 1 .. . In terms of !he sample data.2._by the first two principal components has the form Y1 e 1 + 1'2e2 • This leaves an unexplained component of X . Still.:.. it uses the information in the ISdimensional space perpendicular to the first two eigenvectors 1 and 2 .. 1 Y( 2 ). the statistic Y(2)Iv~.. ... this T 2 based on the unexplained variation in the original observations is reported as highly effective in picking up special causes of variation. the principal components and eigenvalues must be estimated.1" . The orthogonal transformation of the unexplained part.2 degrees of freedom. 112 Y.p and Cov(Y... e2 . Rather than base the T 2 chart on the approximation errors. ep] be the orthogonal matrix whose columns are the eigenvectors of I.Yze2 Let E = (e 1 . becomes ++···+A3 A4 y~ Ya y~ Ap (838) This is just the sum of the squares of p . so the last p . 1 . However. Because the coefficients of the linear combinations are also estimates. we can..
we plot the time sequence of values j T2 _ . and COA hours.1. no further modifications are necessary for monitoring future values.. Because the chisquare distribution was used to approximate the UCL of the T 2 chart and the critical distance for the ellipse format chart.8. We note that they are just beyond the period in which the extraordinary event overtime hours peaked. The first part of the quality monitoring procedure. Since p = 5.+ .8 A T chart based on the last three principal components of overtime hours. 0 2 5 10 15 Period Figure 8. and from Table 8.10.2. we considered checking whether a given series of multivariate observations was stable by considering separately the first two principal components and then the last p . was shown in Figure 8. 31 is large in period 12.A4 As Yi4 '2 YiS '2 A3 where the first value is T 2 = .11 (A T 2 chart for the unexplained [orthogonal] overtime hours) Consider the quality control analysis of the police department overtime hours in Example 8.05) = 7. Using the eigenvalues and the values of the principal component~. and the upper control limit is x~(.891 and so on. holdover. To illustrate the second step of the twostep monitoring procedure.. we create the chart for the other principal components.7.81. The T 2 chart is shown in Figure 8. something has happened during these periods. Yi3 '2 + . the large coefficients in e 3 belong to legal appearances. From Table 8. given in Example 8. . the quality ellipse based on the first two principal components.2. Since points 12 and 13 exceed or are near the upper control limit. this chart is based on 5 .10. Was there some adjusting of these other categories following the period extraordinary hours peaked? • y Controlling Future Values Previously.Monitoring Quality with Principal Components 463 Example 8.2 = 3 dimensions.
.596.939 . The 15 stable pairs of principal components are also shown. at least 50 or more observations are needed.007 .9 gives the 99% prediction (836) ellipse for future pairs of values for the new first two principal components of overtime.464 Chapter 8 Principal Components .081 672.260 .0 92.9 A 99% ellipse format chart for the first two · principal components of future values of overtime.336 . Appearances overtime Extraordinary event Holdover hours COA hours Meeting hours (x 1) (x 2 ) (x 3) (x 4 ) (x 5 ) ..760.5 e4 . 194. 049 .3 The principal components have changed.1 e3 .479 . in order to set future limits. Example 8.158 .995.212''• . .752 (] .089 . dropping a single case can make a substantial difference.::: 0 • •• • • ""• • .9 .~ Table 8. A.10. .304 . The component consisting primarily of extraordinary event overtime is now the third principal component and is not included in the chart of the first two. w~ determined that case 11 was out of control.12 (Control ellipse for future principal components) In Example 8. II § (. Because our initial sample size is only 16..731 .3 Eigenvectors and Eigenvalues from the 15 Stable Observations ei ·~ ez .632 . .401.123 ... § I •• • • 5000 2000 0 2000 4000 Figure 8. The results are shown in Table 8.964..291 .~ ·. Usually. Figure 8. from stable operation of the process. We drop this point and recalculate tlia eigenvalues and eigenvectors based on the covariance of the remaining 15 observa~1 tions.503 .4~7.530• .3.662 .629 .058 396.159 2.582 .078 .749.
Monitoring Quality with Principal Components 465 In some applications of multivariate control in the chemical and pharmaceutical industries. 2.. by inserting EE' = I. n.( a). more than 100 variables are monitored simultaneously. we set and determine The upper control limit is then ex. .Yjzez)'(xj i .i  h1e1. For each stable observation. take the sum of squares of its unexplained component dbj = (xj. the dbj are plotted versus j to create a control chart. that avoids the difficulty caused by dividing a small squared principal component by a very small eigenvalue.05 or .Yjzez) Note that. An alternative approach (see [13]) to constructing a control chart. These include numerous process variables as well as quality variables. we also have which is just the sum of squares of the neglected principal components.Yj1e1. 1)'pically. To implement · this approach. .. has been successfully applied. we proceed as follows. For the chisquare approximation. the space orthogonal to the frrst few principal components has a dimension greater than 100 and some of the eigenvalues are very small. Using either form. The lower limit of the chart is 0 and the upper limit is set by approximating the distribution of db j as the distribution of a constant c times a chisquare random variable with 11 degrees of freedom. .01. In particular. where a = . the constant c and degrees of freedom v are chosen to match the sample mean and variance of the db j• j = 1.
.) = LL ~~~ (x1.. .X. .J. I Let A be any matrix with rank(A) (nXp) r < min(p. We consider approximations of the form A = [a I> a2 . is the ith eigenvector of S. = [ei> e ..e. . n).. The interpretations of both the pdimensional scatter plot and thendimensional representation rely on the algebraic result that follows.i . . anJ' to the mean corrected data matrix (nxp) (xl . .X.a. .i. 466 .. where e. we shall present interpretations for approximations to the data based on the first r sample principal components... Xn n n  i]' p The error of approximation is quantified as the sum of the np squared errors ~I L (x1  i . Let E. e.. . Xz . .a1/ ::S (8A1) Result SA.a1)' (x. The error of approximation sum 2 of squares in (8A1) is minimized by the choice so the jth column of its transpose A' is &i = Yilel + hzez + · · · + Y1.Supplement THE GEOMETRY OF THE SAMPLE PRINCIPAL COMPONENT APPROXIMATION In this supplement.
.Ubi) = (xi . .u.i .x).Ubi = (I .l ± ~ (xj x aj)'(xj x aj) = (n t)(A.i ..UU'U = U .x)(xi. For fixed U. u.x on the plane.3).i)'UU'(xi.~:. + Ap) where A.i) (8A3) We are now in a position to minimize the error over choices of U by maximizing the last term in (8A3).. for an arbitrary vector bi. ~ [u. Proof.Ubj = Xj. .i)'uzUz +···+(xi. = lu. or (xi.UU') (x. j. e2(xj.J n (x1 . Consider first any A whose transpose A' has columns ai that are a linear combination of a fzxed set of r perpendicular vectors u 1 .i .i)  = j=l 2: (x1  n j=l 2: (x1  n i)'UU'(x. .Ubi)'(xi.The Geometry of the Sample Principal Component Approximation 467 where [Yjb Yi2• .i)'] j.i)'UU'(xi.+ I ··· ~ AP are the smallest eigenvalues of S. .. By the properties of trace (see Result 2A.i) + 0 + (U'(xi ..i))' (xi..i) .. L j.UU') (xi.x)] 2: tr[UU'(x.UU') (xi.i . ....] satisfies U'U = I.... .i) and Ubi = UU' (xi .UU'(xj.u.x)]' are the values of the first r sample principal components for the jth unit.]~:.i). so that U = [ ub u2 . The last term is positive unless bi is chosen so that bi = U' (xi ..bi) where the cross product vanishes because (I .i ..i)) = L j=l n (x. u2 .l n = (n 1) tr[UU'S] = (n.1) tr[U'SU] (8A4) .• e~(Xj.i) + UU'(xj.u. ...l 2: (x1  n i .. ' Yirl' = [e!(xj .i is best approximated by its projection on the space spanned by ul> u 2 ..+l + . with the choice a1 = Ubi= UU'(x1 .12). xi . u.bi) so the error sum of squares is (x.i) . .i)'ulul +(xi.i) i)' (xi .U = 0.i) is the projection of xi .i) + U(U'(x 1  i) .i).(x1  :il x) = UU'(xj. .UU') U = U ..i).i) = = L j=l n tr[(x. (8A1) becomes j.bi)'(U'(xi . . u.i)'(I .i) (8A2) This follows because. Further.UU'(x1 . .i)' (I .. Xj. Moreover.i)'u.UU'(x. (see Result 2A.
i].Ubi + i .Also. The coefficients of ek are ek(x1 .... .x) = tr [~ 2 (xi. .t 0. We want to select the rdimensional plane a + Ub that minimizes the sum of 11 squared distances ~ dJ between the observations xi and the plane. .. determined by u 1 . and A' = E. becomes a + Ub for some b.. . This plane is determined by e 1.. ·~ we find that u = [et.E.\.Ubi + i . . e. u 1 = e 1 . .x) = Yib the kth sample principal component evaluated at the jth observation.(A. The plane through the origin.Ubi)' (xi . 5 then i=l i=l L (xi  11 a ..a)' (xi . This plane..468 Chapter 8 Principal Components That is. ' x.as ·: asserted...] Continuing. = e. use a + Ub. • :1 The pDimensional Geometrical Interpretation The geometrical interpretations involve the determination of best approximating planes to the pdimensional scatter plot. An alternative interpretation can be given.a)' (i .E.] = E.x)'(xi.e.1) tr(S) = (n..i . ... Let U = U in (8A3). translated to pass through a.i. so } ez.m~ed by e 2 .i. e 2 .b) = a· + Ubj.a . u.x))'(xi.l. With this choice the ith diagonal element of tJ'SU is e/Se. give~ . = (a + Ub) + U(b. the best choice for U maximizes the sum of the diagonal elements of U'SU From (819).a) = ~ (xi . Xz .x .Ubi)' (xi .i  E. so the plane passes through the sample mean.. For _!lz perpendicular to e1.i Ubi) + n(i .E~(xi. .a) j=! ~:±(xii=l X.selecting u 1 to maximize u)Su" the first diagonal element of U'SU.10.. .(xj.l)(At + A2 + ··· + Ap). The investigator places a plane through i and moves it about to obtain the largest spread among the shadows of the 5 If " ~ b. consists of all points x with for some b ..) = A.. ~2Su 2 is.) = L (xi i=! 11 n i . ••• . since [Ubi> . and the~ error bound follows. The approximating plane interpretation of sample principal components is illustrated in Figure 8. ' tr = ru·su] = A1 + A +···+..~ (xi. = nb j=l . u2 . If xi is approxii=I 11 mated by a + Ubi with ~ bi = 0.. .Ub.e. [See (252). Ub 11 ] = A' has rank (A) $ r. The lower bound is reached by taking a = i.x)) by Result 8A.E~[Xt .x)(xi x)'h (n.
x 2 .i) 0. . j=l vivj and this plane also maximizes the total variance tr(S.)£"+2 Figure 8. .v) (vi. Also.X. the projection of the deviation x. since v = 11 ~~ (n .The Geometry of the Sample Principal Component Approximation 469 d. by columns. . = (n 1) tr[U'SU) is maximized by U = E.10 The r = 2dimensional plane that approximates the scatter plot by minimizing i=l ±dJ. . From (8A2). the ith column [xu.  C.v)' j=l = 2:...bj) 2 Considering A to be of rank one. 1 tt J = (n _ 1 ) tr [" v..x. the approximation of the meancentered data matrix by A.. v = 0 and the sum of the squared lengths of the projection deviations 11 ~ vjvi = j=I 2:. observations..vi ~ 1 J The nDimensional Geometrical Interpretation Let us now consider. For r = 1. .X. (xjj=I n i)'UU'(xi.) = (n _ 1 ) tr [" viv..].]' is approximated by a multiple c. X 11 . .b. we conclude from Result 8A.x on the plane Ub is vi= UU'(xi. . Now...1 that . = ~ (v. b' of a fixed vector b' = [bb ~ •.l)S.x). The square of the length of the error of approximation is Lf = (nXp) 2: (xji j=l n X..
. . Lt.)/Vi. 8.ll(a). . . Exercises 8.Xi]..X. the resulting vector [(x 1. y . This is illustrated in Figure 8.] has length n.1 I The first sample principal component.. In the former case Ly is the squared distance i=l p between [x 1 ..] In either case.i.x.1 for all variables. [See Figure 8.. ..X. . . x 2 . p by the vector of values of the first principal component.. ..x...x.i. minimizes the sum of squared lengths 2: Lf.1l(b).470 Chapter 8 Principal Components 3 (a) Principal component of S (b) Principal component of R Figure 8. the vector b is moved around innspace to minimize the sum of the squares of the distances 2: Lf. . to a line. ]' and its projection on the line determined by b.x.. . That is. x.. (xni. (x 2 . and each vector exerts equal influence on the choice of direction. x 21 . The second principal component minimizes the same quantity among all vectors perpendicular to the first choice. Determine the population principal components Y and Y for the covariance matrix 2 1 I= [ ~ ~] Also.2. xni . calculate the proportion of the total population variance explained by the first principal component. . Conven the covariance matrix in Exercise 8.)/YS..x. i=l p If the variables are first standardized.. the best direction is determined i=l . d~ = [x 11 .1.1 to a correlation matrix p. (a) Determine the principal components Y1 and Y2 from p and compute the proponion of total population variance explained by Y1 . minimizes the 1 sum of the squares of the distances. Note that the longer deviation vectors (the larger s...'s) have the most influence on the minimization of 2: Lt.)jVi... from the deviation vectors.
(c) Sketch the constant density ellipse (x . k = 1.62 26.1.4.6.45 303. Given the original data displayed in Exercise 1.YJ.I2 .1(x .3. 2. What interpretation.x) = 1. if any. Conven the covariance matrix Sin Exercise 8. do you feel that it is better to determine principal components from the sample covariance matrix or sample correlation matrix? Explain. Find the principal components and the proportion of the total population variance explained by each when the covariance matrix is <p< I I 0 v'2 8. What can you say about the eigenvectors (and principal components) associated with eigenvalues that are not distinct? 8.4 of Chapter 1. 7. Data on x 1 = sales and x 2 = profits for the 10 largest companies in the world were listed in Exercise 1. and Y3 . Let I= [0 OJ 20o 0 4 0 4 Determine the principal components Y" Y2 . (a) Find the sample principal components .60] 14. (d) Compute the correlation coefficients rYt. can you give to the first principal componeflt? 8. 2.70 ' s = [7476. 12 and their variances. From Example 4.4.= X [I55. Interpret y1 • (d) Compare the components obtained in Part a with those obtained in Exercise 8.) (b) Find the proponion of the total sample variance explained by h.19 (a) Detennine the sample principal components and their variances for these data.Yz on your graph. 8. (You may need the quadratic formula to solve for the eigenvalues of S.4.z2 .62] 303.5.Exercises 4 7 I (b) Compare the components calculated in Part a with those obtained in Exercise 8.z 1 • 8.z•. (b) Compute the proportion of the total sample variance explained by y1 • (c) Compute the correlation coefficients rYt>lk' k = 1. and indicate the principal components y1 and .zl' PY 1 . and PY 2 . Are they the same? Should they be? (c) Compute the correlations py 1.x)'S.6(a).6 to a sample correlation matrix R. (a) Find the eigenvalues of the correlation matrix p = [ p I P] P 1 p P P I Are your results consistent with (816) and (817)? (b) Verify the eigenvalueeigenvector pairs for the p x p matrix p given in (815). .
5. 2 ( a). = . is approximately XZp+ 2 )(pl)/Z· Thus. This test is called a sphericity test. be used in place of 2ln A.. (b) Show that the likelihood ratio test of H 0 : I = u 21 rejects H 0 if A.2ln A.. List any assumptions required in carrying out this test. 2 and k = 1.) (a) Consider that the normal theory likelihood ratio test of H0 : I is the diagonal matrix 0 0 Show that the test is as follows: Reject H 0 if A. because the constant density contours are spheres when I = u 2 I.: tions reinforce the interpretations given to the first two components? Explain. The large sample a critical point is X~(pr).zk fori= 1.(2p 2 + p + 2)/6pn]lnA. is approximately X~(pI)/ 2 . .8.. This results in an improved chisquare approximation.. Bartlett (3] suggests that the test statistic 2[1 . . Bartlett [3] suggests that 2[1. (A test that all variables are independent. 2. the large sample a critical point is Xfp+ 2)(pI)/z(a).. : (b) Test the hypothesis Ho: P = Po = l: ~ ~ ::j p p p 1 p p p p· p 1 versus Hr:P *Po at the 5% level of significance. ~ A.(2p + ll)/6n] In A. ]n/2 geometric mean A· ( !<( S)/ )"'I' ~ ( . 5. Use the results in Example 8.)' ~ [•rithmotio m"n J IS ln/2 p l 1~1 A.4 72 Chapter 8 Principal Components 8. . ' np/2 < ' "for a large sample size. 8. (a) Compute the correlations ry.= fr .= IR IS ln/2 p Sjj IT i=I n/2 ln/2 < C For a large sample size. Do these Correia. Note that testing I = I 0 is the same as testing p = I.9.
00784 0.1 0. (See the stockprice data on the following website: www.01618 0.01645 0. do you feel that the stock ratesofreturn data can be summarized in fewer than five dimensions? Explain.01082 0. so S may be used.04477 0.±(xi.01081 0.03732 0. 8.Exercises 473 Hint: (a) max L(IJ..01744 0.05819 0. max(27T)nl2 uji"1 2 exp[.01113 0.00372 0.01701 0.01004 0.02296 0.5.01371 0.and u.01669 0. = n.) 2 /2u.00443 0.02039 0. Y2 .. The weekly rates of return for five stocks listed on the New York Stock Exchange are given in Table 8.05253 0.02174 0.00305 0.02156 0.04456 0.01349 0..1 :±xi.01380 0.. Table 8.00608 0.00337 0.01723 0.00970 0. The divisor n cancels in A. A2 .02382 0.00849 O. and A3 ofthe first three population components YI. j=1 (xi.00349 0.00466 0.02099 0.02859 0. :I 0 ) is the product of the univariate ./.00498 Exxon Mobil 0.00248 0.02919 0.prenhall.2 to calculate the chisquare degrees of freedom.01267 : 0.Ujj j=l J. the divisors n cancel in the statistic.04848 0.00863 0.03193 0.00606 0.I likelihoods.00584 0.00336 0.Q1792 0.com/statistics.01874 Royal Dutch Shell 0.00311 0.01303 0.03059 0.01531 0.03449 0.00122 0.. and Y 3 • (d) Given the results in Parts a<:.01885 0.02800 0. 2 (xi 1 :X 1 ) + . = (b) Verify ± u =[± (1/n) i=l 2 j=l f'.012:25 0.02990 0.02528 0.02920 0.00695 0.01039 0. Interpret these components. (Note that the sample mean vector i is displayed in Example 8.. The following exercises require the use of a computer.01279 Citibank 0.00633 0.) (b) Determine the proportion of the total sample variance explained by the first three principal components.~.00290 0.00323 0.01697 0.01825 0.00299 0. x.01196 0 0.00621 0. and find the sample principal components in (820).00768 0.00319 0.01220 0.03166 0.00844 0. · + j=l ± (xiP.04098 0.00522 0.00266 0.01820 0.00167 0.02568 0.01637 . ~ 2 Hence P. so S may be used.00838 0.01437 Wells Fargo 0.00515 0. and max L(IJ.4. (c) Construct Bonferroni simultaneous 90% confidence intervals for the variances AI.04070 0.00688 0. :I) is given by (510). Again.03054 0.L.01013 0.00951 0.4 StockPrice Data (Weekly Rate Of Return) Week 1 2 3 4 5 6 7 8 9 10 94 95 96 97 98 99 100 101 102 103 JP Morgan 0.01017 0.03593 0.00864 0.. Use Result 5.02817 0.xp) ]/np under Ho.00756 0.02380 0.00807 0.) (a) Construct the sample covariance matrix S.00805 0.00614 0.
52 1.0 48.29 74.5 Census.6 36.79 5.44 5.3 43. (Note that this.16 1.25 3.1 20. covariance matrix can be obtained from the covariance matrix given in Example 83 by multiplying the offdiagonal elements in the fifth column and row by 10 and the diagonal element sss by 100.52 73. Consider the censustract data listed in Table 8.58 1.5. . Complete data set available at www.59 77.61 67. (c) Compute the proportion of total variance explained by the first two principal· components obtained in Part b.60 78.5 31.3 32..8 17. that is.0 27.33 1.82 4.14 5.36 6. What have you learned? Does it make any difference which matrix is chosen for analysis? Can the data be summarized in three or fewer dimensions? Can you interpret the principal components? Table 8.14 6.54 5.73 : Employed age over 16 (percent) 69.71 4.33 79.38 7.21 2.71 4.69 1.prenhall..84 4.3 26.11.4 20. That is. multiply all the numbers listed in the sixth column of the table by 10.20 83.7 24.48 1..72 6.67 2.2 21. of dollars.07 1.23 Government employment (percent) 30.40 2.37 10. Consider the airpollution data listed in Table 1.8 19.00 67. What can you say about the effects of this change in scale on the principal components? 8.74 9..55 1.40 4.3.50 1.43 5.48 1.2 37.000) 1. Suppose the observations on X 5 = median value home were recorded in ten thousands.5 : Median home value ($100.0 24.39 78. ~ ~ Note: Observations from adjacent census tracts are likely to be correlated.34 : Tract 1 2 3 4 5 6 7 Professional degree (percent) 5.com/statistics..42 : 8 9 10 52 53 54 55 56 57 58 59 60 61 725 5. Calculate the correlation coefficients.98 .6 22.03 72. Conduct a principal component analysis of the.62 424 4.72 2.84 71.4 74 Chapter 8 Principal Components 8.12.04 3.12 5.93 23. Why?) (b) Obtain the eigenvalueeigenvector pairs and the first two sample principal components for the covariance matrix in Part a. ry.65 2.93 4. Your job is to summarize these data in fewer than p = 7 dimensions if possible.6 20..83 3. these 61 observations may not constitute a random sample.98 64.26 2.80 2.5.30 4.44 9.47 2. (a) Construct the sample covariance matrix S for the censustract data when Xs = median value home is recorded in ten thousands of dollars.tract Data Total population (thousands) 2..94 71.2 20.02 72.85 2.9 1.25 4.11 1. Compare your results with the results in· Example 8. and interpret these components if possible. data using both the covariance matrix S and the correlation matrix R. i .44 2. rather than hundred thousands.27 7.01 74.82 2.94 53.16 2.60 1.58 86.54 78.14 2.30 2.23 1.70 74.
4919 . .2636 .2045 .4919 .= 32.0647 .com/statistics). (b) Pick one of the matrices S orR (justify your choice). (a) Comment on the pattern of correlation within the centrarchid family x 1 through x 4 • Does the walleye appear to group with the other fish? (b) Perform a principal component analysis using only x 1 through x 4 • Interpret your results.Exercises 4 75 8.2. Vs 22 = 33.6 are vs.2144 R= is based on a sample of about 120.1 S. 8. Perform a principal component analysis using the sample covariance matrix of the sweat data given in Example 5. and ~ = 37.2635 ..13. the 11 = 98 observations on p = 6 variables represent patients' reactions to radiotherapy. In the radiotherapy data listed in Table 1.2249 .7 (see also the radiotherapy data on the website www.3517 Use these and the correlations given in Example 8. J6.5918.2293 .prenhall. interpret the components. Are there any suspect observations? Explain. 8. The first four fish belong to the centrarchids. Is it possible to summarize the radiotherapy data with a single reactionindex component? Explain. (c) Given the results in Part b. (c) Perform a principal component analysis using all six variables.3506 . Perform a principal component analysis using S. and determine the eigenvalues and eigenvectors.2493 . Construct a QQ plot for each of the important principal components.5534. Prepare a table showing.3506 .2144 . the percent that each eigenvalue contributes to the total sample variance. If possible. Over a period of five years in the 1990s. Interpret your results.) Fish caught by the same fisherman live alongside of each other. The four sample standard deviations for the postbirth weights discussed in Example 8.14.4653 . decide on the number of important sample principal components.2249 1 .2493 .9909. in decreasing order of size.0652 .2277 . (There were a few missing values.1917 .0652 .2045 .3127 .4653 . Their responses were then converted to a catch rate per hour for x1 = Bluegill x2 = Black crappie x 3 = Smallmouth bass x 6 = Northern pike x 4 = Largemouth bass x 5 = Walleye The estimated correlation matrix (courtesy of Jodi Barnet) 1 .1917 .4108 .2277 . The walleye is the most popular fish to eat.2293 .6 to construct the sample covariance matrix S.4108 1 . yearly samples of fishermen on 28 Jakes in Wisconsin were asked to report the time they spent fishing and how many of each type of game fish they caught.. the most plentiful family. (a) Obtain the covariance and correlation matrices Sand R for these data. v's 33 = 36.0647 . so the data should provide some evidence on how the fish group.3127 . (d) Prepare a table of the correlation coefficients between each principal component you decide to retain and the original variables. 8.
The data on national track records for men are listed in Table 8. The marathon is 26. 10.· tive strength of a nation at the various running distances. Construct a scree plot to aid your determination.prenhall. Are the results consistent with those obtained from the women's data? 8. Which analysis do you prefer? Why? 8.18.20. PrctFFB. SaleHt. perform a principal component analysis of S. plot the data in a two· dimensional space with y1 along the vertical axis and . 5000 m. Prepare a table showing the correlations of the standardized variables with the compo. perform a principal component analysis using the covariance matrix S and the correlation matrix R. Compare the results with the results in Exercise 8.22. (b) Interpret the sample principal components. 1500 m.19.18.20. Compare the results with the results in Exercise 8.9 to speeds measured in meters per second.18. Utilizing the seven variables YrHgt. and SaleWt. ~. The second component might rrieasure the rela. Notice that the records for 800 m. 1500 m.6. and eigenvectors.17.21. nents. and determine its eigenvalues . Does this ranking correspond with your inituitive notion of athletic excellence for the various countries? 8.476 Chapter 8 Principal Components 8.20. Refer to Exercise 8. BkFat.18? Which analysis do you prefer? Why? 8. Perform a principal component analysis using the covariance matrix S of the speed data. Frame.•.195 meters.. (Note that the first< component is essentially a normalized unit vector and might measure the athlet. ·~· (b) Determine the first two principal components for the standardized variables. (d) Using the values for the first two principal components. Refer to Exercise 8.' ic excellence of a given nation. Interpret the plot.2 miles. (a) Obtain the sample correlation matrix R for these data. Can you distinguish groups representing the three breeds of cattle? Are there any outliers? (e) Construct a QQ plot using the first principal component. listed in Table 1.Y2 along the horizontal axis.com/statistics) Repeat the principal component analysis outlined in Exercise 8.9. and the marathon are given in minutes. long. Convert the national track records for men in Table 8. Do your interpretations of the components differ? If the nations are ranked on the basis of their s~ore on the first principal component.) (d) Rank the nations based on their score on the first principal component. long. Your analysis should include the following: (a) Determine the appropriate number of components to effectively summarize the sample variability. (c) Interpret the two principal components obtained in Part b. Notice that the records for 800 m. or 42. 8. The data on national track records for women are. or 42.6 to speeds measured in meters per· second. 3000 m. .10. does the subsequent ranking differ from that in Exercise 8.000 m and the marathon are given in minutes. Consider the data on bulls in Table 1.195 meters.2 miles. (c) Do you think it is possible to develop a "body size" or "body configuration" index from the data on the seven variables above? Explain. Using the data on bone mineral content in Table 1.18 for the men. FtFrBody. Convert the national track records for women in Table 1. and the cumulative percentage of the total (standardized) sample variance explained by the two components. (See also the data on national track records for men on the website www.8. Perform a principal components analysis using the covariance matrix S of the speed data. The marathon is 26.
15 20.89 20. .34 10.60 49. South Korea.49 28.35 139.48 46.59 3.57 48.38 Source: IAAFIATES Track and Field Statistics Handbook for the Helsinki 2005 Olympics.79 1.41 44.43 146.61 3.06 9.93 27.57 44.56 3.33 10.42 20.00 139.72 26.89 27.18 800m 1500m (min) (min) 1.12 .42 45.45 28.75 1.53 3.16 124.68 46.96 45.20 16.62 46.74 1.29 13.09 46.67 27.93 13.38 45.28 10.20 29.29 44.53 3.42 13.23 13.13 10.20 10.98 13.17 10.98 20.77 3.50 29.29 10.54 31.78 44.76 1.04 19.59 3.55 127.37 10.44 27.92 20.14 29.06 20.80 1.38 28.49 3.33 12.59 20. lOOm (s) 10.67 3.70 1.23 21.53 132.84 27.81 29.29 13.41 20.81 20.72 1.26 44.37 126.000m Marathon (min) (min) 27.11 10.59 45.70 3.54 4.84 1.22 13.44 3.75 13.18 129. North Luxembourg Malaysia Mauritius Mexico Myanmar(Burma) Netherlands New Zealand Norway Papua New Guinea Philippines Poland Portugal Romania Russia Samoa Singapore Spain Sweden Switzerland Taiwan Thailand TUrkey U.11 45.81 1.26 44.74 1.19 10.85 21.62 27.56 134.11 20.24 27.6 National nack Records for Men Country Argentina Australia Austria Belgium Bermuda Brazil Canada Chile China Columbia Cook Islands Costa Rica Czech Republic Denmark DominicanRepublic Finland France Germany Great Britain Greece Guatemala Hungary India Indonesia Ireland Israel Italy Japan Kenya Korea.48 3.90 34.02 10.18 10.99 46.25.13 13.78 1.91 30.00 131.71 1.80 1.19 20.52 3.82 1.23 126.92 45.75 1.83 14.06 20.44 129.13 132.24 3.72 45.25 13.75 1.26 134.96 20.70 3.72 4.31 128.04 132.73 1.38 27.45 13.49 16.57 128.57 127.16 10.70 1.09 28.26 133.77 1.73 20.16 161.15 126.21 13.05 46.94 1.01 3.91 13.18 44.13 27.89 19.16 10.77 45.86 21.43 20.33 27.90 13.45 12.52 20.58 46.11 46.17 148.76 1.33 44.90 46.47 127.65 27.41 21.77 45.22 12720 146.63 43.37 46.37 45.43 19.69 1.70 13.74 3.84 10.71 31.72 20.35 10.21 10.58 26.32 27.37 20.18 131.79 1.36 29.11 14.72 13.41 10.23 9.36 10.73 1.28 27.21 13.S.66 13.21 127.17 10.77 44.23 10.57 129.81 27.67 28.43 27.75 1.13 138.10 10.90 29.76 3.19 20.97 10.55 3.18 45.29.77 3.19 139.15 134.52 20.17 21.50 14.66 13.01 13.20 129.50 3.64 3.83 1.38 129.78 1.00 9.83 3.78 10.94 19.23 20.09 13.94 1.69 21.00 10.86 10.93 20.80 3.27 10.16 20.52 27.84 51.77 1. 13.65 27.63 45.76 1.57 3.54 20.00 9.25 125.64 13.58 3.87 30.03 149.40 21.43 45.61 20.72 46.60 28.63 3.53 27.17 171.A.77 29.64 10.53 3.87 10.20 10.42 20.23 131.87 1.74 1.80 45.82 1.51 132.03 28.61 3.38 9.48 13.60 44.17 27.84 3.31 48.05 130.29 126.93 10.46 20.34 28.59 3.86 3.24 10.73 1.30 19.57 10.21 27.59 130.74 1.36 45.49 44.73 3.75 1.85 22.36 27.39 13.62 4.25 45.45 20.46 5000m (min) 13.73 1.11 10.42 14.03 20.21 10.28 14.77 1.04 13.27 12.18 21.15 10.71 1.65 20.40 46.75 20.98 12.23 129.36 132.71 3.Exercises 4 77 Table 8.14 10.27 143.22 12.77 3.77 3.10 132.72 1.08 10.80 1.54 3.23 130.78 28.30 129.63 3.58 3.91 14.77 20.76 1.57 3.40 10.80 1.51 28.36 128.07 127.80 27.01 10.48 13.15 13.30 10.73 1.09 132.61 3.71 1.68 3.81 1.11 14.14 20.64 14.35 3.48 3.08 10.19 129.22 127.69 44.65 3.00 3.77 1.09 20.88 35.72 27.07 13.97 13.50 144.33 130.29 10.79 1.64 44.13 14.05 13.80 1.31 13.70 1.02 45.76 1.57 3.26 12.61 3.96 13.62 3.32 10.80 1.89 44.82 3.11 10.47 20.42 13.98 47.19 13.54 44.90 45.21 10.97 1032 10.43 20.46 28.23 19.84 13.53 3.44 45. Courtesy of Ottavio Castellini.76 1.04 27.95 47.78 200m (s) 20.17 20.60 10.12 20.70 27.32 400m (s) 46.30 28.
38 Neck (em) Girth (em) Head length (em) 17.6.91 324. based on the sum of squares db i.80 117. Your analysis should include the following: (a) Determine the appropriate number of components to effectively summarize the variability.69 93.37 80. performs a principal component analysis using the covariance matrix S and the correlation matrix R.26. Using the five variables.17 39.80 39. page 240.10 and the data in Table 5. perform a principal component analysis using the covariance matrix Sand the·.£ (a) Determine the appropriate number of components to effectively summarize tM.15 63. Construct a scree plot to aid in your determination.68 238.15 94..25. Can the data be effectively summarized in fewer than six dimensions? (b) Perform a principal component analysis using the correlation matrix.73 94.50 721.24. (b) Interpret the sample principal components..85 162. Indep. Measurements on n = 61 bears provided th ·. 8. (c) Using the values for the first two principal components.98 Head width (em) 31.54 1175.37 1343. Benev.50 162. Construct an alternate control chart. Refer to the police overtime hours data in Example 8.25 537.17 474. The pulp and paper properties data is given in Table 7. 8.73 56. Add the variable x 6 = regular overtime hours whose values are (read across) 6187 7679 7336 8259 6988 10954 6964 9353 8425 6291 6778 4969 5922 4825 7307 6019 and redo Example 8.478 Chapter 8 Principal Components 8.39 S= 3266.23.27. Sample mean x Covariance matrix 95. the variance of the first population principal component from the covariance matrix.28 281. plot the data in a twodimensional space with h along the vertical axis and h along the horizontal axis.52 55.97 731. Variable Weight (kg) Body length (em) 164.73 56.10.17 537. EM (elastic modulus). 8. Using the four paper variables. BL (breaking length). SF (Stress at failure) and BS (burst'' strength).88 13. Can you distinguish groups representing the two socioeconomic levels and/or the two genders? Are there any outliers? (d) Construct a 95% confidence interval for At.~.. Supp.10. Refer to Example 8.17 117. A naturalist for the Alaska Fish and Game Department studies grizzly bears with the goal of maintaining a healthy population. correlation matrix R.98 80.7. Consider the psychological profile data in Thble 4.88 21.~ variability.~ following summary statistics: ~. to monitor the unexplained variation in the original observations summarized by the additional principal components.13 .46 1343.8.68 238. Construct a scree plot to aid in your determination.85 63. Your analysis should include the following: .73 9. (c) Comment on the similarities and differences between the two analyses.97 731.95 13. ~ .. Conform and Leader.25 179.54 1175.26 (a) Perform a principal component analysis using the covariance matrix.35 324.35 281. 8.
00 2. Remove any obvious outliers from the data set.Exercises 479 (b) Interpret the sample principal components.0 1.0 6.00 1.0 0 2. Table 8.00 4.28.5 1.0 2.5 2.00 4.5 1.00 4.0 1. .0 10. Identify any outliers in this data set.0 5.5 0 .0 1 0 1 2 6 1 0 3 2 7 6 3 0 0 8 0 0 14 0 8 Source: Data courtesy of Jay Angerer. A total of n = 76 farmers were surveyed and observations on the nine variables x 1 = Family (total number of individuals in household) x 2 = DistRd (distance in kilometers to nearest passable road) x 3 = Cotton (hectares of cotton planted in year 2000) x 4 = Maize (hectares of maize planted in year 2000) x 5 = Sorg (hectares of sorghum planted in year 2000) x 6 = Millet (hectares of millet planted in year 2000) x 1 = Bull (total number of bullocks or draft animals) x 8 = Cattle (total).0 2. Survey data were collected as part of a study to assess options for enhancing food security through the sustainable use of natural resources in the Sikasso region of Mali (West Africa).00 1.0 2.50 1.0 .0 0 0 1.prenha!Lcorn!statistics (a) Construct twodimensional scatterplots of Family versus DistRd.00 .00 2.50 1.50 2 6 0 1 4 2 0 0 4 2 : 0 32 0 0 21 0 0 0 5 1 : 5 0 5 0 3 0 0 2 7 0 1 0 5 6 5 4 10 2 7 20 27 18 30 77 21 13 24 29 57 1.7 Mali Family Farm Data Family DistRD Cotton Maize Sorg Millet Bull Cattle Goats 1 12 54 11 21 61 20 29 29 57 23 : 80 8 13 13 30 70 35 35 9 33 0 41 500 19 18 500 100 100 90 90 1.00 5.0 5.00 1. 8.5 4.00 1.1 2.0 0 0 1.0 .0 8.00 1.50 7.5 2. x 9 = Goats (total) were recorded.00 .0 0 1.5 2.00 0 0 0 0 0 0 0 0 0 1.00 2.0 3. The data are listed in Thble 8. plot the data in a twodimensional space with h along the vertical axis and h along the horizontal axis.00 . (c) Do you think it it is possible to develop a "paper strength" index that effectively contains the information in the four paper variables? Explain.00 3.0 0 1.00 4. (d) Using the values for the first two principal components. and DistRd versus Cattle.0 3.00 3.00 2.25 1.7 and on the website www.50 3.50 1.25 1.50 .50 5.00 2.5 6.0 0 3.0 3.
10 (1939). 24 (1960). 149151. Canonical Variates and Principal Components. "The Multivariate Generalization of the Allometry Equation. H. ·." Growth.28. 2735.. Interpret this chart.29. 28 (1996)." The American Statistician. Kourti. "Simplified Calculation of Principal Components. P. 10. W. 6. "Multivariate SPC Methods for Process and Product Monitoring." Journal of Business. T. C. "Analysis of a Complex of Statistical Variables into Principal Components. "Multivariate Analysis of National Track Records. T.). "The Most Predictable Criterion. A. D. Hotelling. ' (c) Interpret the first five principal components. An Introduction to Multivariate Statistical Analysis (3rd ed. John Wiley. Linear Statistical Inference and Its Applications (2nd ed." Journal of Educational Psychology. R. "Interpretation of Canonical Discriminant Functions. 5. 3. Hotelling. "Market and Industry Factors in Stock Price Behavior. "On Testing a Set of Correlation Coefficients for Equality." Journal of the Royal Statistical Society (B). for example. Identify the car locations that appear to be out of control. B. 497499. C. 321377. M. 417441. 2002.. "On the Sampling Theory of Roots of Determinantal Equations. Refer to Exercise 5. ''goats and distance to road" component? . T. 13. H. M.1. Jolicoeur. "Relations between Two Sets of Variates. 4. Lawley. W. 46 (1992). Hotelling.34 (1963). 14." Biometrika.P. based on the sum of squares d~ i• to monitor~ the variation in the original observations summarized by the remaining four princi~": pal components. 11. H." The American Statistician. 2003. N. Hotelling.. and J. 139142." Annals of Mathematical Statistics. 1 (1936).110115. New York: WileyInterscience. perhaps. 339354. and J.139190. 15. 34 (1963)." Psychometrika. Mosimann. 296298. 7. "Size and Shape Variation in the Painted Turtle: A Principal Component Analysis." Biometrics. 16. 2. McGregor. 16 (1954). 217225.480 Chapter 8 Principal Components (b) Perform a principal component analysis using the correlation matrix R. S. A. ~ (b) Construct an alternative control chart. 28 (1936). King. Dawkins. 19 (1963). Anderson. .). H. · References 1." Journal of Educational Psychology. 8. 43 (1989). Determin the number of components to effectively summarize the variability. Rencher. "Asymptotic Theory for Principal Components Analysis. New York: . (a) Construct a 95% ellipse format chart using the first two principal components j/1 and~ j/2 . Jolicoeur. Use the propo:. 12. 39 (1966). 26 (1935). obtain the sample principal components.." Annals of Mathematical Statistics. 409428. a "farmsize'' component? A. Rao. Girschick. Using the covariance matrix S for the first 30 cases of car body~ assembly data.24 (1933). Can you identify." Annals of Mathematical Statistics.203224. Bartlett.~ tion of variation explained and a scree plot to aid in your determination. Anderson. 8." Journal of Quality Technology. B. "A Note on Multiplying Factors for Various ChiSquared Approximations. 9. E.498520. 122148.
It is still true. the covariance relationships among many variables in terms of a few underlying. might correspond to another factor. and others to define and measure intelligence. 481 .Chapter fACTOR ANALYSIS AND INFERENCE FOR STRUCTURED COVARIANCE MATRICES 9. but have relatively small correlations with variables in a different group. Its modern beginnings lie in the early20thcentury attempts of Karl Pearson. It is this type of structure that factor analysis seeks to confirm. Arguments over the psychological interpretations of several early studies and the lack of powerful computing facilities impeded its initial development as a statistical method. correlations from the group of test scores in classics. That is. For example. suppose all variables within a particular group are highly correlated among themselves. English. The advent of highspeed computers has generated a renewed interest in the theoretical and computational aspects of factor analysis. if possible. French. that each application of the technique must be examined on its own merits to determine its success. Most of the original techniques have been abandoned and early controversies resolved in the wake of recent developments. Basically. however.1 Introduction Factor analysis has provoked rather turbulent controversy throughout its history. random quantities called factors. representing physicalfitness scores. but unobservable. A second group of variables. or factor. the factor model is motivated by the following argument: Suppose variables can be grouped by their correlations. Because of this early association with constructs such as intelligence. Then it is conceivable that each group of variables represents a single underlying construct. if available. mathematics. Charles Spearman. that is responsible for the observed correlations. · The essential purpose of factor analysis is to describe. and music collected by Spearman suggested an underlying "intelligence" factor. factor analysis was nurtured and developed primarily by scientists interested in psychometrics.
.. 9. the model in (92) implies certain covariance relationships...482 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices FaCtor analysis can be considered an extension of principal component analysis. _. .1. ..L2. We assume that E(F) = (mXI) 0 . .• ' xp . Fm. F m.. .Li J. Both can be viewed as attempts to approximate the covariance matrix I. . and p additional sources of variation e 1 . Note that the ith specific factor e. .Lp are expressed in terms of p + m random variables PI> F 2.Li> x2 . which can be checked.J. e2. called errors or.. sometimes.l2 = = eliFj e21F1 + + ei2F2 e22F2 + ... However the approximation based on the factor analysis model is more elaborate. and covariance:~ matrix I.. .. F2. 1 In particular. With so many unobservable quantities. However. e2 . a direct verification of the factor model from observations on XI' x2. in which the independent variables [whose position is occupied by Fin (92)] can be observed. . in matrix notation.J. e 1 . with some additional assumptions about the random vectors F and e.1.·observable random variables F 1 .2 The Orthogonal Factor Model The observable random vector X. . The p deviations XI .with p components. .. eP. (pXI) Cov(e) = E[u'] = 'I' = (pXp) l' n ? : 0 "'2 0 (93) 0 1 As Maxwell [12] points out. The factor model postulates that X is linearly dependent upon a few un.1. This distinguishes the factor model of (92) from the multivariate regression model in (723). + elmFm + I':J + · · · + e2mFm + e2 (91) or. the factor analysis model is XI X2 J. eP which are unobservable..... 'xp is hopeless. in many investigations the£.. tend to be combinations of measurement 'l error and factors that are uniquely associated with the individual variables. . Cov (F) = E[FF'] = I (mXm) E(e) = 0 . .J. called common factors. X . so the matrix L is the nwtrix of factor loadings. Th~~ primary question in factor analysis is whether the data are consistent with a ' prescribed structure. (pXI) = L F (pXm)(mXI) + e (pXI) (92) The coefficient eij is called the loading of the ith variable on the jth factor. specific factors. is associated only with the ith response X. has mean 1.
Cov(F) =I E (e) = 0. (X~t) (X  ~t)' == (LF + e)(LF +e)' = (LF + e)((LF)' + e') == LF(LF)' + e(LF)' + LFe' + ee' so that I= Cov(X) = E(X. Cov (e) = 'I'.The Orthogonal Factor Model 4~3 and that F and e are independent. = ith specific factor (94) Fj = jth common factor cij = loading of the ith variable on the jth factor The unobservable random vectors F and e satisfy the following conditions: F and e are independent E(F) = O.F) == E(X. 2 Orthogonal Factor Model with m Common Factors X=~t+L F+e (pXl) (pXl) (pXm)(mXJ) (pXI) ILi = mean of variable i e. The oblique model presents some additional estimation difficulties and will not be discussed in this book.~t)F' == LE(FF') + E(eF') = L.) . (See [20]. From the model in (94). so Cov(e.~t)F' = (LF + e)F' = LFF' Cov(X.F) = E(eF') = (pXm) 0 These assumptions and the relation in (92) constitute the orthogonal factor model. by the model in (94). (X.~t)' = LE(FF')L' = LL' +'I' + E(eF')L' + LE(Fe') + E(u') according to (93). F') = 0 Also. 2 Allowing the factors F to be correlated so that Cov (F) is not diagonal gives the oblique factor model.~tHX. Cov ( e. F) = E(e. where 'I' is a diagonal matrix The orthogonal factor model implies a covariance structure for X. + eF'. Also by independence.
484 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices
Covariance Structure for the Orthogonal Factor Model
L Cov(X) = LL' + 'If
or Var(X;) =
Cf1 + · · · + Ctm + ljl;
+ ··· + C;mekm
(95)
Cov(X.,Xk) = eilekl
2. Cov(X,F) = L
or
The model X  1" = LF + E is linear in the common factors. If the p responses X are, in fact, related to underlying factors, but the relationship is nonlinear, such as inXl l"1 = el!F1F, + B],X2 l"2 = e2!F2F, + e2,andsoforth,thenthecovariance structure LL' + 'If given by (95) may not be adequate. The very important assumption of linearity is inherent in the formulation of the traditional factor model. That portion of the variance of the ith variable contributed by them common factors is called the ith communality. That portion ofVar (X;) = u,; due to the specific factor is often called the uniqueness, or specific variance. Denoting the ith communality by ht, we see from (95) that
'.'
Var(X,) or
communality
+ specific variance
(96)
and
i = 1,2, ... ,p
The ith communality is the sum of squares of the loadings of the ith variable on the m common factors.
Example 9.1 (Verifying the relation
:I = LL' + 'If for two factors) Consider the co
variance matrix
:I =
[!~ ~~ ~ ~~J
2 5 3847 12 23 47 68
The Orthogonal Factor Model 485
The equality
[19 30
or
30 57 2 5 12 23
2 5 23 38 47 47 68
12] [ 41]
=
7 2 1 6
4 [1
7 2
1 6
1 8
~] [~
+
0 4 0 0
0 0 1 0
~]
'I= LL' +'If
may be verified by matrix algebra. Therefore, 'I has the structure produced by an m = 2 orthogonal factor model. Since
L =
'If=
['" '"] ["] ['' jJ[~ ~]
e21 e22 _ e31 e32 t42 
7 2 1 6 '
e41
1 8
0
0 0 0
.Pz
0 0
0 0
.P3
0
0 0 4 0 0 1 0 0
the communality of X 1 is, from (96),
ht = eL
U!l
+
eh =
42 + 12
=
17
and the variance of X 1 can be decomposed as
=
(Cf! + Cf2)
+ f1 =
hf + f1
17 + 2
or 19
~
+
'v'
2
~
variance
communality +
specific variance
A similar breakdown occurs for the other variables.
•
The factor model assumes that the p + p(p  1 )/2 = p(p + 1 )/2 variances and covariances for X can be reproduced from the pm factor loadings C;i and the p specific variances ljl;. When m = p, any covariance matrix I can be reproduced exactly as LL' [see (911)], so 'If can be the zero matrix. However, it is when m is small relative to p that factor analysis is most useful. In this case, the factor model provides a "simple" explanation of the covariation in X with fewer parameters than the p(p + 1 )/2 parameters in 'I. For example, if X contains p = 12 variables, and the factor model in (94) with m = 2 is appropriate, then the p(p + 1)/2 = 12(13)/2 = 78 elements of I are described in terms of the mp + p = 12(2) + 12 = 36 parameters eij and ljl; of the factor model.
186
Olapter 9 Factor Analysis and Inference for Structured Covariance Matrices Unfortunately for the factor analyst, most covariance matrices cannot be factored as LL' + 'If, where the number of factors m is much less than p. The following example demonstrates one of the problems that can arise when attempting to determine the parameters eii and ljl; from the variances and covariances of the observable variables.
Example 9.2 (Nonexistence of a proper solution) Let p = 3 and m = 1, and suppose the random variables X 1, X 2 , and X 3 have the positive definite covariance matrix
. [ 1 I= .9
.7 Using the factor model in (94), we obtain X1  ILI =
.9 .7]
1 .4 .4 1
e11 F1 + ei
e2
Xz  1L2 = eziFl +
X3  /L3 = e3IF1 + e3 The covariance structure in (95) implies that
I= LL' +'If
or .90
= e11 e21
.70 =
e11 e3 1
.P3
1 = e~I + .P2
.40 = ezle31
1 = e~I +
The pair of equations .70 =
e11 e3 I
.40 = e2Ie31
implies that
Substituting this result for e 21 in the equation
yields er 1 = 1.575, or ell = ± 1.255. Since Var(F1) = 1 (by assumption) and Var (XI) = 1, e 11 = Cov (XI> PI) = Corr (XI> FJ). Now, a correlation coefficient cannot be greater than unity (in absolute value), so, from this point of view, Ieuf = 1.255 is too large. Also, the equation
1 =·e11 + f1,
Or
fi = 1 e11
The Orthogonal Factor Model 487 gives
fi
= 1  1.575 = .575
which is unsatisfactory, since it gives a negative value for Var (eJ) = .p 1 . Thus, for this example with m = 1, it is possible to get a unique numerical solution to the equations I = LL' + '1'. However, the solution is not consistent with the statistical interpretation of the coefficients, so it is not a proper solution. • When m > 1, there is always some inherent ambiguity associated with the factor model. To see this, let T be any m X m orthogonal matrix 1 so that TT' = T'T = I. Then the expression in (92) can be written
X JL
= LF +
E
=
LTT'F +
E
= L*F* +
E
(97)
where
L* = LT
and F* = T'F
Since
E(F*) = T' E(F) = 0
and Cov(F*)
= T'Cov(F)T
=
T'T
=
(mxm)
I
it is impossible, on the basis of observations on X, to distinguish the loadings L from the loadings L *.That is, the factors F and F* = T'F have the same statistical properties, and even though the loadings L* are, in general, different from the loadings L, they both generate the same covariance matrix I. That is,
I = LL' + 'I'
=
LTT'L' + 'I'
=
(L*) (L*)' + 'I'
(98)
This ambiguity provides the rationale for "factor rotation," since orthogonal matrices correspond to rotations (and reflections) of the coordinate system for X.
Factor loadings L are determined only up to an orthogonal matrix T. Thus, the loadings
L*
=
LT
and
L
(99)
both give the same representation. The communalities, given by the diagonal elements of LL' = (L*) (L*)' are also unaffected by the choice ofT.
The analysis of the factor model proceeds by imposing conditions that allow one to uniquely estimate L and 'I'. The loading matrix is then rotated (multiplied by an orthogonal matrix), where the rotation is determined by some "easeofinterpretation" criterion. Once the loadings and specific variances are obtained, factors are identified, and estimated values for the factors themselves (called factor scores) are frequently constructed.
488
Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices
9.3 Methods of Estimation
Given observations x 1 , x 2 , ... , Xn on p generally correlated variables, factor analysis seeks to answer the question, Does the factor model of (94), with a small number of factors, adequately represent the data? In essence, we tackle this statistical model~ building problem by trying to verify the covariance relationship in (95). The sample covariance matrix S is an estimator of the unknown population covariance matrix I. If the offdiagonal elements of S are small or those of the sample ' correlation matrix R essentially zero, the variables are not related, and a factor analysis will not prove useful. In .these circumstances, the specific factors play the~ dominant role, whereas the major aim of factor analysis is to determine a few · important common factors. . If I appears to deviate significantly from a diagonal matrix, then a factor model can be entertained, and the initial problem is one of estimating the factor loadings f.;. and specific variances ljl;. We shall consider two of the most popular methods of para~ meter estimation, the principal component (and the related principal factor) method and the maximum likelihood method. The solution from either method can be rotatea· in order to simplify the interpretation of factors, as described in Section 9.4. It is always prudent to try more than one method of solution; if the factor model is appropriate for the problem at hand, the solutions should be consistent with one another. Current estimation and rotation methods require iterative calculations that must be done on a computer. Several computer programs are now available for this purpose.
The Principal Component (and Principal Factor) Method
The spectral decomposition of (216) provides us with one factoring of the covariance matrix I. Let I have eigenvalueeigenvector pairs (A;. e;) with A1 ~ A2 ~ • .. ~ Ap ~ 0. Then
~ [vA,,, i vA,.,; ··· i v>,•,]
. r.
. •"
.
, . r.
~~;
v'I;' ei VA; e;
l
(910)
[
This fits the prescribed covariance structure for the factor analysis model having as many factors as variables (m = p) and specific variances ljl; = 0 for all i. The loading matrix has j~h column given by ~ ej. That is, we can write
(pXp)
I
(pXp)(pXp)
L
L'
+
(pXp)
0
= LL'
(911)
Apart from the scale factor ~, the factor loadings on the jth factor are the coefficients for the jth principal component of the population. Although the factor analysis representation of I in (911) is exact, it is not particularly useful: It employs as many common factors as there are variables and does not allow for any variation in the specific factors E in (94). We prefer models that explain the covariance structure in terms of just a few common factors. One :
Methods of Estimation 489 approach, when the last p  m eigenvalues are small, is to neglect the contribution of Am+lem+le;,+l + · · · + Apepe~ to I in (910). Neglecting this contribution, we obtain the approximation
I "= [ ~ e1
i
VA; e2
i ··· i
VA;;; em]
[
····~········ = (p~m) (m~p)
\IX; e2
~e!J
(912)
VA;;;e;,
The approximate representation in (912) assumes that the specific factors E in (94) are of minor importance and can also be ignored in the factoring of I. If specific factors are included in the model, their variances may be taken to be the diagonal elements of I  LL', where LL' is as defined in (912). · Allowing for specific factors, we find that the approximation becomes
I="' LL' + 'It
= [
~ el i VA; e2
m
.
.
,
i · · · i VA;;; em]
[' [
..............
\I'A;e2
:::::::::::;:: +
~e!]
'1'1
0
~
0
.Pz
0
:J
.Pp
(913)
VA: em
where .Pi
= U;; 
L eh fori
j~J
== 1, 2, ... , p.
To apply this approach to a data set xb x 2 , ... , x", it is customary first to center the observations by subtracting the sample mean x. The centered observations _
Xj 
X =
XjJ] r~JJ [ =~I J . . [
xi2 : .

x2
:
==
XjJ xi 2
:
x2
Xp
j == 1, 2, ... , n
(914)
Xjp
Xp
Xjp 
have the same sample covariance matrix S as the original observations. In cases in which the units of the variables are not commensurate, it is usually desirable to work with the standardized variables
(xjl 
x1 )
x2 )
j = 1,2, ... ,n
~
(xi2 
vS;
whose sample covariance matrix is the sample correlation matrix R of the observations x 1 , x 2, ... , Xn. Standardization avoids the problems of having one variable with large variance unduly influencing the determination of factor loadings.
490 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices
The representation in (913), when applied to the sample covariance matrix S or the sample correlation matrix R, is known as the principal component solution. The name follows from the fact that the factor loadings are the scaled coefficients of the first few sample principal components. (See Chapter 8.)
Principal Component Solution of the Factor Model
The principal component factor analysis of the sample covariance matrix S is specified in terms of its eigenvalueeigenvector pairs (AI> (A2 , 2 ), •.• , (Ap, ep), where AI ~ A2 ~ ... ~ Ap. Let m < p be !_he number of common factors. Then the matrix of estimated factor loadings {t'ii} is given by
et),
e
L = [~e~! ~e2! ···!VA: em]
(915)
The estimated specific variances are provided by the diagonal elements of the matrix S LL', so
~ r~· ~,
=
0
0
~2
?Jp
:
J
~2
with ';fr; =
s;; 
:L eij
j=l
m
~2
(916)
Communalities are estimated as
h; =
21 + e; 2 + · ·. + t';m e;
~2
(917)
The principal component factor an~lysis of the sample correlation matrix is obtained by starting with R in place of S.
For the principal component solution, the estimated loadings for a given factor do not change as the number of factors is increased. For example, if m = 1,
i: = [~e 1 ], and if m = z, i: = [~ e1 i ~~].where (A 1 ,ed and (A2.~) are the first two eigenvalueeigenvector pairs for S (orR). By the definition of 'J};, the diagonal elements of S are equal to the diagonal elements of LL' + ~. However, the offdiagonal elements of S are not usually reproduced by LL' + ~. How, then, do we select the number of factors m? If the number of common factors is not determined by a priori considerations, such as by theory or the work of other researchers, the choice of m can be based on the estimated eigenvalues in much the same manner as with principal components. Consider the residual matrix
s (i:i·
+ ~)
(918)
resulting from the approximation of S by the principal component solution. The diagonal elements are zero, and if the other elements are also small, we may subjectively take them factor model to be appropriate.Analytieally, we have (see Exercise 9.5) Sum of squared entries of (S 
(LL' +
~)) s A~+l + · · · + A~
(919)
Methods of Estimation 491 Consequently, a small value for the sum of the squares of the neglected eigenvalues implies a small value for the sum of the squared errors of approximation. Ideally, the contributions of the first few factors to the sample variances of the variables should be large. The contribution to the sample variance s;; from the ~2 first common factor is fn. The contribution to the total sample variance, s 11 + s 22 + · · · + sPP = tr(S), from the first common factor is then
e1\ + e~1 + ··· + e:1 =
Proportion ~f total) sample vanance ( due to jth factor
(~et)'(~el)
=
A1
since the eigenvector e1has unit length. In general,
Criterion (920) is frequently used as a heuristic device for determining the appropriate number of common factors. The number of common factors retained in the model is increased until a "suitable proportion" of the total sample variance has been explained. Another convention, frequently encountered in packaged computer programs, is to Set m equal to the number of eigenvalues of R greater than one if the sample correlation matrix is factored, or equal to the number of positive eigenvalues of S if the sample covariance matrix is factored. These rules of thumb should not be applied indiscriminately. For example, m = p if the rule for Sis obeyed, since all the eigenvalues are expected to be positive for large sample sizes. The best approach is to retain few rather than many factors, assuming that they provide a satisfactory interpretation of the data and yield a satisfactory fit to S or R.
Example 9.3 (Factor analysis of consumerpreference data) In a consumerpreference study, a random sample of customers were asked to rate several attributes of a new product. The responses, on a 7point semantic differential scale, were tabulated and the attribute correlation matrix constructed. The correlation matrix is presented next:
l
su
+ s22
~
~
Aj
.•.
+ sPP for a factor analysis of S
(920) for a factor analysis of R
p
Attribute (Variable) Taste Good buy for money Flavor Suitable for snack Provides lots of energy
1
2 3 4 5
T''
.02 .96 .42 .01
2 .02 1.00
3
® .13
1.00 .50 .11
.13
.71 .85
4 .42 .71 .50 1.00 .79
5
® .11 ® 1.00
"l
It is clear from the circled entries in the correlation matrix that variables 1 and 3 and variables 2 and 5 form groups. Variable 4 is "closer" to the (2, 5) group than the (1, 3) group. Given these results and the small number of variables, we might expect that the apparent linear relationships between the variables can be explained in terms of, at most, two or three common factors.
492
Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices
The first two eigenvalues, A = 2.85 and A = 1.81, of R are the only eigenval1 2 ues greater than unity. Moreover, m = 2 common factors will account for a cumulative proportion
At
+
p
A2 = 2.85
+ 1.81
5
== .93
of the total (standardized) sample variance. The estimated factor loadings, commu " nalities, and specific variances, obtained using (915), (916), and (917), are given in Table 9.1.
Table 9.1
Estimated factor loadings
~vr
eij
=
A;eij
Communalities
~2
Specific variances
;pi= 1 
Variable 1. Taste 2. Good buy for money 3. Flavor 4. Suitable for snack 5. Provides lots of energy Eigenvalues Cumulative proportion of total (standardized) sample variance
PI
.56 .78 .65 .94 .80 2.85
F2 .82
.53 .75 .10 .54 1.81
hi
ht
.98 .88 .98 .89 .93
.02 .12 .02
.11
.07
.571
.932
Now, .56 .78 = .65 [ .94 .80
i:i:·
+
'if
53 .82] :75 [.56 .10 .82 .54
.78 .53
.65 .75
.80] .94 .10 .54
.02 0 .12 + 0 0 [0 0 0 0 0
0 0 .02 0 0
0 0 0 .11 0
~
l
=
.01 1.00 [1.00
.97
.11
1.00
.44 .79 .53 1.00
.00] .91
.11 .81 1.00 :{
.07
...
Methods of Estimation 493 nearly reproduces the correlation matrix R. Thus, on a purely descriptive basis, we would judge a twofactor model with the factor loadings displayed in Table 9.1 as providing a good fit to the data. The communalities ( .98, .88, .98, .89, .93) indicate that the two factors account for a large percentage of the sample variance of each variable. We shall not interpret the factors at this point. As we noted in Section 9.2, the factors (and loadings) are unique up to an orthogonal rotation. A rotation of the factors often reveals a simple structure and aids interpretation. We shall consider this example again (see Example 9.9 and Panel 9.1) after factor rotation has been discussed. •
Example 9.4 (Factor analysis of stockprice data) Stockprice data consisting of n = 103 weekly rates of return on p = 5 stocks were introduced in Example 8.5. In that example, the first two sample principal components were obtained from R.
Taking m = 1 and m = 2, we can easily obtain principal component solutions to the orthogonal factor model. Specifically, the estimated factor loadings are the sample principal component coefficients (eigenvectors of R), scaled by the square root of the corresponding eigenvalues. The estimated factor loadings, communalities, specific variances, and proportion of total (standardized) sample variance explained by each factor for the m = 1 and m = 2 factor solutions are available in Table 9.2. The communalities are given by (917). So, for example, with
m = 2, hi= f 11 + f 12 = (.732) + (.437) = .73.
Table 9.2
~2
~2
~2
2
2
Onefactor solution Estimated factor loadings PI .732 .831 .726 .605 .563 Specific variances 1/1; = 1 h; .46 .31 .47 .63 .68
TWofactor solution Estimated factor loadings p2 PI .732 .831 .726 .605 .563 .437 .280 .374 .694 .719 Specific variances 1/1; = 1 h; .27 .23 .33
~15
Variable
1. 2. 3. 4. 5.
~
~2
~
~2
JPMorgan Citibank Wells Fargo Royal Dutch Shell ExxonMobil
.17
Cumulative proportion of total (standardized) sample variance explained
.487
.487
.769
The residual matrix corresponding to the solution for m 0 .099 = .185 [ .025 .056 .099 0 .134 .014 .054 .185 .134 0 .003 .006
= 2 factors is
.056] .054 .006 .156 0
R
i:i> ;r
.025 .014 .003 0 .156
494 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices The proportion of the total variance explained by the twofactor solution2_~appreciably larger than that for the onefactor solution. However, for m = 2, LL' produces numbers that are, in general, larger than the sample correlations. This is particularly true for r13 • It seems fairly clear that the first factor, F~o represents general economic conditions and might be called a market factor. All of the stocks load highly on this factor, and the loadings are about equal. The second factor contrasts the banking stocks with the oil stocks. (The banks have relatively large negative loadings, and the oils have large positive loadings, on the factor.) Thus, F2 seems to differentiate stocks in different industries and might be called an industry factor. To summarize rates of return appear to be det~rmined by general market conditions and activitie~' that are unique to the different industries, as well as a residual or firm specific. factor. This is essentially the conclusion reached by an examination of the sample principal components in Example 8.5. •
A Modified Approachthe Principal Factor Solution
A modification of the principal component approach is sometimes considered. We describe the reasoning in terms of a factor analysis of R, although the procedure is also appropriate for S. If the factor model p = LL' + 'I' is correctly specified, the m common factors should account for the offdiagonal elements of p, as well as the communality portions of the diagonal elements
P;;
= 1 = lzt + o/;
If the specific factor contribution o/; is removed from the diagonal or, equivalently, the 1 replaced by ht, the resulting matrix is p  'I' = LL '. Suppose, now, that initial estimates o/7 of the specific variances are available. Then replacing the ith diagonal element of R by h? = 1  1/J;, we obtain a "reduced" sample correlation matrix
Now, apart from sampling variation, all of the elements of the reduced sample correlation matrix R, should be accounted for by them common factors. In particular, R, is factored as
(921)
where L~ = { e~j} are the estimated loadings. The principal factor method of factor analysis employs the estimates
.. : A2 .. L *  [.V 0 e1 : .V 0 e2 , A1
o/7
=
: ...
:F,;"·•] : VA~ em
(922) :
1
2: e;J
j=l
m
Methods of Estimation 495
where (A7, e7), i = 1, 2, ... , mare the (largest) eigenvalueeigenvector pairs determined from R,. In turn, the communalities would then be (re)estimated by
~
h?
= ~e;J
j:1
m
(923)
The principal factor solution can be obtained iteratively, with the communality estimates of (923) becoming the initial estimates for the next stage. In the spirit of the principal component solution, consideration of the estimated eigenvalues A~, ,\;, ... , A; helps determine the number of common factors to retain. An added complication is that now some of the eigenvalues may be negative, due to the use of initial communality estimates. Ideally, we should take the number of common factors equal to the rank of the reduced population matrix. Unfortunately, this rank is not always well determined from R" and some judgment is necessary. Although there are many choices for initial estimates of specific variances, the most popular choice, when one is working with a correlation matrix, is 1/17 = 1/r", where rii is the ith diagonal element of R 1 . The initial communality estimates then become
h? =
1  "'; = 1
1
(924)
which is equal to the square of the multiple correlation coefficient between X; and the other p  1 variables. The relation to the multiple correlation coefficient means that h';l can be calculated even when R is not of full rank. For factoring S, the initial specific variance estimates uses;;, the diagonal elements of s1 . Further discussion of these and other initial estimates is contained in [6]. Although the principal component method for R can be regarded as a principal factor method with initial communality estimates of unity, or specific variances equal to zero, the two are philosophically and geometrically different. (See [6].) In practice, however, the two frequently produce comparable factor loadings if the number of variables is large and the number of common factors is small. We do not pursue the principal factor solution, since, to our minds, the solution methods that have the most to recommend them are the principal component method and the maximum likelihood method, which we discuss next.
The Maximum likelihood Method
If the common factors F and the specific factors e can be assumed to be normally
distributed, then maximum likelihood estimates of the factor loadings and specific variances may be obtained. When Fi and ej are jointly normal, the observations Xj  1.1. = LFj + ei are then normal, and from (416), the likelihood is
L(~.t, "I)
= (27T)=
"J "I ~~e G)tr[:t{~ 1 (xii) (xri)'+n(i,.) (i,.)')] (nl)p (n1) il) [ l("'_ _ (27T)2I"I ,2e \2 I f:1 (xix)(xix) ·)] (925)
1
tr
x (27T) ~I "I ~~e (~)<x,.)'Il(i~<)
496 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices which depends on L and 'I' through I = LL' + IJr. This model is still not weli defined, because of the multiplicity of choices for L made possible by orthogonal transformations. It is desirable to make L well defmed by imposing the computationally convenient uniqueness condition a diagonal matrix
(926)
The maximum likelihood estimates Land ~ must be obtained by numerical maximization of (925). Fortunately, efficient computer programs now exist that enable one to get these estimates rather easily. . We summarize some facts abOut maximum likelihood estimators and, for now rely on a computer to perform the numerical details. '
Result 9.1. Let X~>X 2 , ... ,X" be a random sample from Np(p.,I), where I = LL' + 'I' is the covariance matqx {or them common factor model of (94). '!b~ m:;!ximum likelihood estimators L, '1', and it = i maximize (925) subject to L''I' 1L being diagonal.
The maximum likelihood estimates of the communalities are fori = 1, 2, ... , p so
A2 A2 A2
(927)
Proportionoftotalsample) ==eli+ Czi + ... + ePi ( variance due to jth factor sn + s22 + · · · + sPP
(928)
Proof. By the invariance property of maximum likelihood estim:;!tes (se~ Section 4.3), functions of L and 'I' are estimated by the same functions of L and 'I'. In particular, the communalities ht = + ... + have maximum likelihood estimates A2 A2 A2 h; = eil + ... + eim· •
erl
erm
If, as in (810), the variables are standardized so that Z the covariance matrix p of Z has the representation
= v1f2(X

f.L ),
then
(929)
Thus, p has a factorization analogous to (95) with loading matrix L. = v 1 L and specific variance matrix 'I', = v 1f2'1'V 1f2. By the invariance property of maximum likelihood estimators, the maximum likelihood estimator of p is
12
/J = (v1/2fl (v1/2fr + v1!2~\r1!2
=
i.i;
+ ~.
(930)
1 and L are the maximum likelihood estimators of vlf2 and L, respecwhere !2 tively. (See Supplement 9A) As a consequence of the factorization of (930), whenever the maximum likelihood analysis pertains to the correlation matrix, we call
i = 1,2, ... ,p
v
(931)
Methods of Estimation 497 the maximum likelihood estimates of the communalities, and we evaluate the importance of the factors on the basis of
~2 ~2 ~2
Proportionoftotal(standardized)) ( sample variance due to jth factor
==eli+ e2i + ··· + ePi
p
(932)
To avoid more tedious notations, the preceding Cii's denote the elements of i,.
Comment. Ordinarily, the observations are standardized, and a sample correlation matrix is factor analyzed. The sample correlation matrix R is inserted for [(n  1)/n]S in the likelihood function of (925), and the maximum likelihood estimates i, and .lr z are obtained using a computer. Although the likelihood in (925) is appropriate for S, not R, surprisingly, this practice is equivalent to obtaining the maximum likelihood estimates i and .lr based on the sample covariance matrix S, setting i. = vIf2i and .lr z = vIf2..jrvI/2 Here v!(2 is the diagonal matrix with the recip. rocal of the sample standard deviations (computed with the divisor Vn) on the main diagonal. Going in the other direction, given the estimated loadings i. and specific variances .lr z obtained from R, we find that the resulting maximum likelihood estimates for a factor analysis of the covariance matrix [(n  1)/n]S are
i = V112i. and .lr = V112 lr.v 112 , or
where cr;; is the sample variance computed with divisor n. The distinction between divisors can be ignored with principal component solutions. • The equivalency between factoring S and R has apparently been confused in many published discussions of factor analysis. (See Supplement 9A.)
Example 9.5 (Factor analysis of stockprice data using the maximum likelihood method) The stockprice data of Examples 8.5 and 9.4 were reanalyzed assuming
an m = 2 factor model and using the maximum likelihood method. The estimated factor loadings, communalities, specific variances, and proportion of total (standardized) sample variance explained by each factor are in Table 9.3. 3 The corresponding figures for them = 2 factor solution obtained by the principal component method (see Example 9.4) are also provided. The communalities corresponding to ~2 A2 ~z the maximum likelihood factoring of Rare of the form [see (931)] h; = eil + ei2· So, for example,
hf =
(.115) 2 + (.765) 2
= .58
3 The maximum likelihood solution leads to a Heywood case. For this example, the solution of the likelihood equations give estimated loadings such that a specific variance is negative. The software program obtains a feasible solution by slightly adjusting the loadings so that all specific variance estimates are nonnegative. A Heywood case is suggested here by the .00 value for the specific variance of Royal Dutch Shell.
498
Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices
Table 9.3
Maximum likelihood Estimated factor . loadings Variable Specific variances Principal components Estimated factor loadings Specific variances
PI
Fi
.755 .788 .652
~i
= 1
hf
FI
.732 .831 .726 .605 .563
F2
.437 .280 .374 .694 .719
;pi= 1 h?
I
1. 2. 3. 4. 5.
.115 J PMorgan Citibank .322 Wells Fargo .182 Royal Dutch Shell 1.000 Texaco .683
.000
.032
.42 .27 .54 .00 .53
.27 .23 .33 .15 .17
Cumulative proportion of total (standardized) sample variance explained
.323
.647
.487
.769
The residual matrix is
R
LL'
~
=
[.002 0
.001
.001 .002 0 .002 .002 0
.000
.000 .000 0 .000
.000
.052
.000
.033
.000
.001
.033 .001 .000 0
052]
The elements of R  LL'  Ware much smaller than those of the residual matrix corresponding to the principal component factoring of R presented in Example 9.4. On this basis, we prefer the maximum likelihood approach and typically feature it in subsequent examples. The cumulative proportion of the total sample variance explained by the factors is larger for principal component factoring than for maximum likelihood factoring. It is not surprising that this criterion typically favors principal component factoring. Loadings obtained by a principal component factor analysis are related to the principal components, which have, by design, a variance optimizing property. [See the discussion preceding (819).] Focusing attention on the maximum likelihood solution, we see that all variables have positive loadings on F1 • We call this factor the market factor, as we did in the principal component solution. The interpretation of the second factor is not as clear as it appeared to be in the principal component solution. The bank stocks have large positive loadings and the oil stocks have negligible loadings on the second factor F2 • From this perspective, the second factor differentia ties the bank stocks from the oil stocks and might be called an industry factor. Alternatively, the second factor might be simply called a banking factor.
Methods of Estimation 499 The patterns of the initial factor loadings for the maximum likelihood solution are constrained by the uniqueness condition that i•.4). The second factor is primarily a strength factor because shot put and discus load highly on this factor.0000 .11 after a discussion of factor rotation.2100 .1546 . 4. In this case.3449 . which based on all 280 cases.2812 .3151 .0000 .500meter run have large positive loading on the first factor. the first factor appears to be a general athletic ability factor but the loading pattern is not as strong as with principal component factor solution.92.4728 . A subsequent interpretation.1012 .0120 .4357 1. Therefore.2380 .3102 .2100 . reinforces the choice m = 4.4163 .3998 .3227 .4036 1.2539 .3509 .0120 .0109 . Factor 2. analyze the correlation matrix.4953 .0983 .4213 . Following his approach we examine the n = 280 complete starts from 1960 through 2004. We.1546 1. all events except the 1.4682 .4008 .4682 .1712 .4179 . The fourfactor maximum likelihood solution accounts for less of the total sample .2116 .0000 .6386 . 1.2395 .0000 .1012 .3503 . This suggests that some events might require unique or specific attributes not required for the other events.0000 .4706 .0983 1.1712 .3262 . Again. The recorded values for each event were standardized and the signs of the timed events changed so that large scores are good for all events.2380 . We shall return to an interpretation of the factors in Example 9.3151 1. although the estimated specific variances are large in some cases (for example. . the two solution methods produced very different results.06. the first four eigenvalues.4125 . although it may have something to do with jumping ability or leg strength.2344 . For the principal component factorization.P.0352 .2812 .000 .6040 .7926 .1i be a diagonal matrix.4357 .4953 1.3503 1..0352 .2344 .4036 .4728 .3657 .4008 .3449 . This factor might be labeled general athletic ability.0000 .6386 1.0002 .500meter run might be called a running endurance factor. much like Linden's original analysis.3657 .4752 . too.2539 .0000 . The remaining factors cannot be easily interpreted to our minds. of R suggest a factor solution with m = 3 or m = 4. The fourfactor principal component solution accounts for much of the total (standardized) sample variance.500meter run have large loadings.21.3509 .3998 . the javelin).5668 .4213 . 1. is R= 1.2395 .3262 . useful factor patterns are often not revealed until the factors are rotated (see Section 9.3102 .6040 .5520 .1821 .39.4752 .0000 .4125 . • Example 9.4163 .2553 .0000 From a principal component factor analysis perspective.0002 .3227 .3520 .5167 .1821 .4179 . For the maximum likelihood method.5167 . The third factor is running endurance since the 400meter run and 1.3520 .5520 . which loads heavily on the 400meter run and 1.6 (Factor analysis of Olympic decathlon data) Linden [11] originally conducted a factor analytic study of Olympic decathlon results for all 160 complete starts from the end of World War II until the midseventies.4706 .2553 1. the fourth factor is not easily identified.0109 .7926 .2116 .5668 .
Longjump Shot put Highjump 400m run 100m hurdles Discus Pole vault Javelin 1500m run .069 .304 .56 .305 .145 1/1. 8.771 . 7.17 .696 .609 F4 .115 .075 .h.620 .20 .085 .197 .27  . 10.60 Cumulative proportion of total variance explained .083 . 6.605 .456 .29 . = 1.62  .018 .57 .239 .424 .4 Principal component Estimated factor loadings Variable 1.220 .67 .416 .255 .23 .39 .078 .28 .513 .084 .493 F4 .45 .289 .407 .33 .12 .746 .30 . 5.162 .17 .571 .079 .993 . 9.434 .01 .402 .30 .189 .711 .102 .42 . = 1.440 .09 .665 . 4.045 .263 .718 .390 .76  .372 .421 .343 .397 .074 .363 .091 .793 .519 .761 .016 F2 F3 .323 .428 .530 .019 .367 .220 .252 .095 .252 .39 .h.518 .002 .73 .549 .777 . 100m run F1 F2 Maximum likelihood Specific variances Estimated factor loadings ~2 Specific variances 6 .Table 9.141 .33 .090 .022 .73 .085 ~ 1/1.461 .181 .021 .468 .15 .005 . 3. Fl .690 .561 .112 .218 .42 . ~ ~2 "' g 2.
000 .004 .068 .022 .078 .030 .015 .011 .038 . as the following residual matrices indicate.021 .003 .048 .001 .021 .006 .068 .000 .004 .001 .000 0 .023 .001 .069 .064 . .014 .074 .042 .003 .000 0 .021 0 .006 .059 .019 .006 .000 .Methods of Estimation 50 I variance..001 .015 .210 .000 .006 .078 .010 .082 0 .078 .002 0 .006 .086 .042 .039 .000 .033 .082 .006 .016 .003 .124 .078 .059 .005 .002 0 .048 .001 .000 .005 .026 0 .046 0 .031 . .025 .015 .000 .062 .042 .085 .151 .031 .029 .064 .002 .046 .023 .000 .017 .006 .022 . In this case I = LL' + 'If.000 .039 .030 .011 . the maximum of the likelihood function [see (418) and Result 4.074 .006 0 .029 0 .055 .001 0 .016 .003 .001 . When I does not have any special form.11 with I = ((n.001 . Suppose the m common factor model holds.151 .000 .069 .001 .047 .f .030 .042 .004 .026 .009 .000 .000 .096 0 .064 .001 .001 .003 .003 . but.119 .010 .001 .047 .000 000 .002 .000 .015 0 .009 .021 .000 . and testing the adequacy of the m common factor model is equivalent to testing H0: I (pXp) = L L' + 'If (pXp) (933) (pXm) (mXp) versus H 1 : I any other positive definite matrix.000 .024 .084 .124 .004 0 .030 .013 .000 .013 .085 .006 .030 .210 . the maximum likelihood do a better job of reproducing R than the principal component estimates ~and estimates L and 'If.096 .000 .078 .000 .119 .001 .000 .062 .086 ..000 .024 = .064 .025 .107 .010 .002 .003 .107 .204 .002 0 .001 .033 .014 .017 .029 . t Principal component: RLi>~= 0 .1)/n)S = Sn] is proportional to (934) .204 .000 .000 .030 .029 0 .000 .019 .084 .078 0 Maximum likelihood: R LL'.010 .055 .003 0 • A Large Sample Test for the Number of Common Factors The assumption of a normal population leads directly to a test of the adequacy of the model.006 .000 .038 0 .
2. Since the number of degrees of freedom.502 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices Under H 0 . ~ llrn/ (936) 2 + n [tr(I1Sn) .p = 0 provided that I the maximum likelihood estimate of I = LL' + 'I'. and (935). (940) 4 Many factor analysts obtain an approximate maximum likelihood estimate by replacing Sn with the unbiased estimateS = [n/(n .2ln .(2p + 4m + 5)/6).[p(m + 1) .m] (937) Supplement 9A indicates that tr (I. I is restricted to have the form of (933). ~ [ (p .p are large. respectively] is proportional to Using Result 5. (934).1 )]Sn and then minimizing lnl I I + tr[I.1S]. Thus.~m(m. 4 we reject H0 at the a level of significance if A A A '(n .m )2 . The dual substitution of Sand the approximate maximum likelihood estimator into the test statistic of (939) does not affect its large sample properties. we find that the likelihood ratio statistic for testing H 0 is _ _ _ [maximized likelihood under H 0 ] 2lnA.. must be positive. where Land q.p . it follows that m < ~(2p + 1 ..1Sn) . + q. the maximum of the likelihood function [see (925) with jL = x and I = ii. .(2p + 4m + 5)/6) In ILL' + 'I' I Isn I > 2 X[(pm)2pm]j2(a) (939) provided that n and n .1 .1)] = ~[(p.v0 = ~p(p + 1). In this case.v'8[}+1) in order to apply the test (939). we have 2lnA = nln = LL' + q.p. is III) ( fS:I (938) Bartlett [3] has shown that the chisquare approximation to the sampling distribution of 2ln A can be improved by replacing n in (938) with the multiplicative factor (n .1 .m) 2 . Using Bartlett's correction.m ]. are the maximum likelihood estimates of L and 'I'. max1m1ze dl'k ell'h oo d 1 = 2lnc with degrees of freedom.p] v.
000 .17519 = 0216 1. v12 v12 = R.000 .!ally b~e rejected.513 .632 . at level a = . From Example 9.zl IRI 1.146 1. leading to a retention of more common factors.322 .L~ + q.632 .000 .103 1.000 .i~ + q. T (Testing for two common factors) The twofactor maximum likelihood analysis of the stockprice data was presented in Example 9. I = LL' + 'l{f may be close enough to Sn so that adding more factors does not provide additional insights. The residual matrix there suggests that a twofactor solution may be adequate.000 .17898 .y112 1 1 11 ii· + q.146 1.322 .000 . By the properties of 1Sn 1 v112 v1f2ii·v1/2 + v112q. 11 v112 1= 1 and Consequently. However. even though those factors are "significant.246 1.fiequaFY of the m common factor model by comparing the generalized variances ILL' + 'l{f I and I Sn 1.574 "1.5. I y1/2 I I sn I I v1/21 1v1/2ii·v1/2 + v112q. Test the hypothesis H0 : I = LL' + 'l{f.000 .5.115 .182 .000 = . The test statistic in (939) is based on the ratio of generalized variances Ii I I sn I Iii. I I sn I Let v12be the diagonal matrix such that 1 determinants (see Result 2A.y112 1 I y1/2sn y1/2 I I i. I IRI by (930). the hypothesis H 0 ~ill U~l. Iy112 11 Sn II y112 1= IV112 SnV112 1 Ii I I sn I I v112 11 ii• + it II v121 .. + q.572 .000 .05. we are testing for the ~a.154 (941) IL. In implementing the test in (939). with m = 2." Some judgment must be exercised in the choice of m.683 1.ll).683 1.115 .Methods of Estimation 503 Comment.182 . 1. we determine 1.213 . Example 9.510 . If n is large and m is small relative top.000 .
9.1  (10+8+5)] In (1. we evaluate the test statistic in (939): (n .] From matrix algebra.m) 2 . the transformation to a simple structure can frequently be determined graphically. We shall concentrate on graphical and analytical methods for determining an orthogonal rotation to a simple structure. remains unchanged. the observed significance level.) The expressions are. where 'IT' = T'T = I (942) is a p X m matrix of "rotated" loadings. P[xt > 2. or the common factors are considered two at a time.504 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices Using Bartlett's correction.V. since (943) Equation (943) indicates that the residual matrix. it is immaterial whether i or L is obtained. • A Jarge sample variances and covariances for the maximum likelihood estimates f. it is not always possible to get this simple structure.15 implies that H0 would not be rejected at any reasonable level. Moreover. and so forth) then L* = LT.t1 I Sn I = [ 103 . it is usual practice to rotate them until a "simpler structure" is achieved.05) = 3. maximum likelihood..2) 2 . However.LL'"' = sn. For this reason.p. Ideally. If L is the p X m matrix of estimated factor loadings obtained by any method (principal component. all factor loadings obtained from the initial loadings by an orthogonal transformation have the same ability to reproduce the covariance (or correlation) matrix. an orthogonal transformation of the factor loadings.1 . the estimated covariance (or correlation) matrix remains unchanged. * Since the original loadings may not be readily interpretable.10) == . we should like to see a pattern of loadings such that each variable loads highly on a single factor and has small to moderate loadings on the remaining factors. Thus.L*L*' . in general. and hence the communalities hr.0216) == 2.i.. or Pvalue. we know that an orthogonal transformation corresponds to a rigid rotation (or reflection) of the coordinate axes.11 provide a nearly ideal pattern. the specific variances ojl. (See [10]. the 5% critical value xt( . are unaltered. The rationale is very much akin to sharpening the focus of a microscope in order to see the detail more clearly.84 is not exceeded. [See (98). is called factor rotation.2. although the rotated loadings for the decathlon data discussed in Example 9. as well as the implied orthogonal transformation of the factors. The uncorrelated common factors are regarded as unit .ljli have been derived when these estimates have been determined from the sample covariance matrix S. snA. When m = 2.10 6 Since !f(p.m] = ~[(5. and we fail to reject H0 • We conclude that the data do not contradict a twofactor modeL In fact. quite complicated. Moreover.(2p + 4m + 5)/6)ln ILL'+ . from a mathematical viewpoint.4 Factor Rotation As we indicated in Section 9.52]= 1.
132 Communalities h. and the magnitudes of the rotated loadings must be inspected to find a meaningful interpretation of the original data.248 .356 .211 . In this situation. 3.288 .273 . On the other hand. The correlation matrix is Gaelic 1.:ctors along perpendicular coordinate axes.392 .0 English .464 1.490 . The coordinate axes can tp. Example 9.406 . The choice of an orthogonal matrix T that satisfies an analytical measure of simple structure will be considered shortly.595 :429 .5.568 . each point corresponding to a variable.372 .724 . clusters of variables are often apparent by eye. (pX2)(2X2) T (944) oo•¢ sin 4> sin 4> cos 4> sinlj> cos 4> J clockwise rotation counterclockwise rotation T=[coslj> sin 4> J The relationship in (944) is rarely implemented in a twodimensional graphical analysis. for m > 2. Gaelic Estimated factor loadings Fl Fz .5 Variable 1.329 .181 .Factor Rotation 505 vc. A plot of the pairs of factor loadings ( fn.354 .553 . English History Arithmetic Algebra Geometry . fd yields p points. 6.164 1.8 (A first look at factor rotation) Lawley and Maxwell [10] present the sample correlation matrix of examination scores in p = 6 subject areas for n = 220 male students.0 .410 . orientations are not easily visualized.190 .470 .0 Arithmetic Algebra . 4.740 . Table 9. A2 2.450 . and these clusters enable one to identify the common factors without having to inspect the magnitudes of the rotated loadings. 5.351 1.320 .288 ..329 .439 1.0 Geometry .0 R= and a maximum likelihood solution form = 2 common factors yields the estimates in Table 9.0 History .595 1.623 .569 .en be visually rotated through an anglecall it 1/>and the new rotated loadings e~j are determined from the relationships r~ where [ (pX2) L• = i.
because the signs of the loadings on a factor can be reversed without" affecting the analysis. The mathematical test variables load highly on F. 5.1. Lawley and Maxwell suggest that this factor reflects the overall response of the students to instruction and might be labeled a general intelligence factor. Fz ._ estimates are unchanged by the orthogonal rotation. but is such that individuals who get aboveaverage scores on the verbal tests get aboveaverage scores on.1. the factor. and the communalities are the diagonal elements of these matrices. The points are labeled with the numbers of the corresponding variables.2. since LL' = LIT'L' = L*L*'. The rotated factor loadings obtained from (944) with cp ""' zoo and the corresponding communality estimates are shown in Table 9. Individuals with aboveaverage scores on the mathematical tests get' belowaverage scores on the factor. is arbitrary. One new axis would pass through the cluster {1. Half the loadings are positive and half are negative on the second factor. Similarly.) This factor is not easily identified. Oblique rotations are so named because they c01·respond to a nonrigid rotation of coordinate axes leading to new axes that are not perpendicular.3} and the other through the {4.6. Also shown is a clo. C 2). The second factor might be labeled a verbalability factor.. We point out that Figure 9. and F. · ·· The factor loading pairs ( ei1' ei2) are plotted as points in Figure 9. and moderate to small loadings on F7. and have negligible loadings on Fi. 6} group. ~ _ The~ comp1una!itY. The magnitudes of the rotated factor loadings reinforce the interpretation of the factors suggested by Figure 9.1 suggests an oblique rotation of the coordinates.and the two distinct clusters of variables are more clearly revealed. (The assignment of negative and positive poles.. The generalintelligence factor identified initially is submerged in the factors F. . When this is 4 done. Perhaps this factor can be classified as a "mathnonmath" factor. This angle was chosen so that one of the new axes passes through ( f 4 .ckwise orthogonal rotation of the coordinate axes through an angle of cp ""' 20°.506 Chapter9 Factor Analysis and Inference for Structured Covariance Matrices All the variables have positive loadings on the first factor.5 I I I I I I Fj • 3 •I I Figure 9.1 Factor rotation for test scores. all the points fall in the first quadrant (the factorloadings are all positive). The first factor might be called a mathematicalability factor. the three verbal test variables have high loadings on F. A factor with this pattern of loadings is called a bipolar factor.
4. If a dominant single factor exists. Computing algorithms exist for maximizing V. maximizing V corresponds to "spreading out" the squares of the loadings on each factor as much as possible. . it can always be held fixed and the remaining factors rotated. maximum likelihood. Also. to be the rotated coefficients scaled by the square root of the communalities. and so forth) will not.372 1. are multiplied by h.6 Estimated rotated factor loadings F~ Fi Communalities Variable h'?=hl .406 .356 . V <X f . Define 77. the statistical software packages SAS. it will generally be obscured by any orthogonal rotation. e. and most popular factor analysis computer programs (for example. so that the original communalities are preserved. in general.=I (variance ofsquar~s of (scaled) loadings for) ]th factor (946) Effectively. 5.568 . varimax rotations of factor loadings obtained by different solution methods (principal components. the pattern of rotated loadings may change considerably if additional common factors are included in the rotation. 2. Then the (normal) varimax procedure selects the orthogonal transformation T that makes v = 1 m p J=i 2: •=I 77. Gaelic English History Arithmetic Algebra Geometry It is apparent. 7:. we hope to find groups of large and negligible coefficients in any column of the rotated loadings matrix L*.2)21 p J P (945) as large as possible. Kaiser [9] has suggested an analytical measure of simple structure known as the varimax (or normal varimax) criterion.623 ..Factor Rotation 507 Table 9. it has a simple interpretation. has the effect of giving variables with small Scaling the rotated coefficients communalities relatively more weight in the determin~tion of simple structure. In words. Although (945) looks rather forbidding. Therefore. coincide. BMDP. By contrast. the loadings C7. 3. 6. SPSS.490 . 2: [ P 4  ( •=1 2. As might be expected. however. After the transformation T is determined. that the interpretation of the oblique factors for this example would be much the same as that given previously for an orthogonal • rotation. and MINITAB) provide varimax rotations. = e7) h.
932 It is clear that variables 2.7.507 ® . while variables 1 and 3 defme factor 2 (high loadings on factor 2..52 .65 . .78 .89 . 5. Panel 9.02 @) .94 .2 Factor rotation for hypothetical marketing data.88 .98 . 4. the communalities.5 Figure 9.9 (Rotated loadings for the consumerpreference data) Let us return to _ the marketing data discussed in Example 9. and 5 define factor 1 (high loadings on factor 1.56 .:.82 . (See the SAS statistical software output in.2. • . 4.) Table 9. and the (varimax) rotated : factor loadings are shown in Table 9.508 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices Example 9.75 . Taste Good buy for money Flavor Suitable for snack Provides lots of energy Rqtated estimated factor loadings Communalities ~2 FJ .43 . We might call factor 1 a nutritional factor and factor 2 a taste factor. 3.54 Fi Fi h. The factor loadings for the variables are pictured with respect to the original and (varimax) rotated factor· axes in Figure 9. 2. although it has aspects of the trait represented by factor 2.571 .932 . The original factor loadings (obtained '" by the principal component method). small or negligible loadings on factor 2). Cumulative proportion of total (standardized) sample variance explained ~ 97 .01 @ .1 Estimated factor loadings Variable 1.5 I I I I I I I I /I • 3 I F* 2 .10 .13 .98 .1.80 F2 .02 .93 . small or negligible loadings on factor 1).3. Variable 4 is most closely aligned with factor 1.
878920 FLAVOR SNACK ENERGY 0.I OUTPUT Prior Communality Estimates: ONE Eigenvalues of the Correlation Matrix: Total = 5 Average = 1 2 1.42 .13 snack . _type_='CORR'.00 flavor . 0.10492 0.9 USING PROC FACTOR.52420 0.9728 4 0.11 PROGRAM COMMANDS 1.0000 2 factors will be retained by the NFACTOR criterion.1 SAS ANALYSIS FOR EXAMPLE 9. taste 1.02 1.79821 FACTOR2.102081 0.97961 (continues on next page) .9319 3 0.'93911 0.85 flavor snack energy.9933 Eigenvalue Difference Proportion Cumulative 2.79 1.54323 I Final Communa'li~ Estimates: 1 Total= 4.74795 0.853090 1.806332 1.50 .71 energy .55986 0. data consumer(type = corr).81610 0. 509 title 'Factor Analysis'.5706 0.0409 0.601842 o.64534 0.5706 5 0. input _name_$ taste money cards.00 proc factor res data=consumer method=prin nfact=2rotate=varimax pre plot plot. I Initial Factor Method: Principal Components .Factor Rotation PANEL 9.j6Bj 0.00 . var taste money flavor snack energy.00 money .01 .0205 0. 1.00 .96 .102409 0.204490 0.0067 1.77726 0.033677 0.046758 0.659423 TASTE 1 MONEY 0. I Factor Pattern I TASTE MONEY FLAVOR SNACK ENERGY FACTOR1 0.068732 0.
227 .000 .1 (continued) f Rotation Method: Varimax I I Rotated Factor Pattem I TASTE MONEY FLAVOR SNACK ENERGY FACTOR! 0.53 .647 .12856 0.5.647 .84244 0.01123 D.323 .683 .sjpce)he initial values are constrained to satisfy the uniqueness condition that L''I'1L be a diagonal matrix.000 .hf .93744 0.652 .182 1.537396 FACT0R2 2.10 (Rotated loadings for the stockprice data) Table 9.032 Rotated estimated factor loadings Fj Fi Specific variances ~~ = 1 .8 Variable J P Morgan Citibank Wells Fargo Royal Dutch Shell ExxonMobil Maximum likelihood estimates of factOI' loadings F2 Ft . An m = 2 factor model is assumed. The estimated Table 9.01970 0.322 .5 and 9.510 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices PANEL 9.96539 FACTOR2 0.104 .346 . Example 9.98948 ~.755 .8 shows the initial and rotated maximum likelihood estimates of the factor loadings for the stockprice data of Examples 8.01563 Variance explained by each factor FACTOR! 2.788 .. but may not lead to factors that can easily be interpreted.122027 Rotation of factor loadings is recommended particularly for loadings obtained by maximum likelihooq.27 . This condition is convenient for computational purposes.42 .024 .54 Cumulative proportion of total sample variance explained ~ am .00 .9i947 0.42805 ~.113 .115 .
A varimax rotation [see (945)] was performed to see whether the rotated factor loadings would provide additional insights. Finally. As he notes. although factors 1 and 2 are not in the same order. This factor could be called running speed. and Wells Fargo) load highly on the first factor. high jump. These quantities were derived for an m = 4 factor model. Factor 2 appears to represent economic conditions affecting oil stocks. Linden labeled this factor explosive leg strength. in cases where a general factor is evident.Factor Rotation 511 specific variances and cumulative proportions of the total (standardized) sample variance explained by each factor are also given. Linden called this factor running endurance. along with the specific variances. pole vault.) The two rotated factors. an orthogonal rotation is sometimes performed with the gen• eral factor loadings fixed.5. We identified market and industry factors.9. . together. The varimax rotated loadings for the m = 4 factor solutions are displayed in Table 9. Similarly. (Although the rotated loadings obtained from the principal component solution are not displayed. As we have noted. rotation will affect only the distribution of the proportions of the total sample variance explained by each factor.11 (Rotated loadings for the Olympic decathlon data) The estimated factor loadings and specific variances for the Olympic decathlon data were presented in Example 9.6. An interpretation of the factors suggested by the unrotated loadings was presented in Example 9." 'Some generalpurpose factor analysis programs allow one to fix loadings associated with certain factors and to rotate the remaining factors. discus. The 100meter run. Factor 1 represents those unique economic forces that cause bank stocks to move together. andto some extentlong jump load highly on another factor. differentiate the industries. Citibank. the 1500meter run loads heavily and the 400meter run loads heavily on the fourth factor. using both principal component and maximum likelihood solution methods. and. The rotated loadings indicate that the bank stocks (JP Morgan. Apart from the estimated loadings. one on which all the variables load highly) tends to be "destroyed after rotation. 400meter run. a general factor (that is." For this reason. andagain to some extentthe long jump load highly on a third factor. 110meter hurdles. It is difficult for us to label these factors intelligently. following Linden [11]. We see that shot put. this factor might be called explosive arm strength. The rotated factor loadings for both methods of solution point to the same underlying attributes. and javelin load highly on a factor. the same phenomenon is observed for them. 5 Example 9. The cumulative proportion of the total sample variance explained for all factors does not change. The interpretation of all the underlying factors was not immediately evident. while the oil stocks (Royal Dutch Shell and ExxonMobil) load highly on the second factor. "The basic functions indicated in this study are mainly consistent with the traditional classification of track and field athletics.
17 ~2 F.6191 .142 .005 .7001 .30 .293 1.20 .293 .! .041 ffiJJ . The points are generally grouped along the factor axes.160 .· variances F~ F.086 1. The former are .7931 .185 .278 1.?.221 ).09 .019 .220 1.188 .070 .204 .110 .169 :. Plots of rotated principal component loadings are very similar.?. .12 .7471 .204 .5191 .8851 .6561 1.278 .29 .139 .51 .314 16131 . Many investigators in social sciences consider oblique (nonorthogonal) rotations.001 .73 .302 .097 .23 . Fi F: 1/1.055 .024 .~~f.9 Principal component Estimated rotated factor loadings.28 .8831 .133 .37 .136 r:i~~: 1. 2) and (1.139 1.17 . .3 on page 513.33 . • Oblique Rotations Orthogonal rotations are appropriate for a factor model in which the common factors are assumed to be independent.) .267 .296 1.254 1.30 .068 .5 12 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices Table 9.033 .8321 .7541 .=1kr .280 1.01 . 3) are displayed in Figure 9.62 . Fi F: ~.151 .9281 .5071 .252 .161 1.228 .8191 .39 .22 .108 1. = 1.33 .242 1.20 .5541 [.6831 .73 .009 .6641[.054 .~?.~?.39 .! 1.h.324 .7391 .075 .057 1.002 .205 .076 .8261 .155 .42 .15 .62 Plots of rotated maximum likelihood loadings for factors pairs (1.i Specific .279 .182 1. e7j Variable 100m run Long jump Shot put High jump 400m run 110m hurdles Discus Pole vault Javelin 1500m run Cumulative proportion of total sample variance explained Specific variances ~ Maximum likelihood Estimated rotated factor loadings.048 .045 . F.60 f.76 . as well as orthogonal rotations.173 .291 1.43 .
However.. f. 2.. 3)decathlon data. 2) and (1. for example. If we regard the m common factors as coordinate axes. the point with the m coordinates ( f.0 1.4 6 • • 0. An oblique rotation seeks to express each variable in terms of a minimum number of factorspreferably. Rather.4 Factor I 0. . the estimated values of the common factors. 2 . f.6 0. after rotation.2 cv I 0 (7 I I 0. as well as inputs to a subsequent analysis.6 Factor I 9 01 Figure 9.2 • 0. 111 ) represents the position of the ith variable in the factor space.0 0. . These quantities are often used for diagnostic purposes.2 4 2 0. they are estimates of values for the unobserved random factor vectors Fi.4 8 0. · Factor scores are not estimates of unknown parameters in the usual sense.) often suggested after one views the estimated factor loadings and do not follow from our postulated model.6 • I 9 0 I " ~ 0. interest is usually centered on the parameters in the factor model. j = 1. .Factor Scores 513 1.6 0.8 g 0. 9. That is. 1 .00.0 0. pass as closely to the clusters as possible. (The numbers in the figures correspond to variables.4 0.S Factor Scores In factor analysis.8 • • 0.0 0.0 0. an oblique rotatio!J. Assuming that the variables are grouped into nonoverlapping clusters...8 0. called factor scores.3 Rotated maximum likelihood loadings for factor pairs (1. is frequently a useful aid in factor analysis. n. Oblique rotations are discussed in several sources (see. factor scores fi = estimate of the values fi attained by Fi (jth case) . a single factor.2 0. An oblique rotation to a simple structure corresponds to a nonrigid rotation of the coordinate system such that the rotated axes (no longer perpendicular) pass (nearly) through the clusters. [6] or [10]) and will not be pursued in this book. Nevertheless. may also be required. an orthogonal rotation to a simple structure corresponds to a rigid rotation of the coordi: nate axes such that the axes. .
2. 2. do not change when rotated loadings are substituted for unrotated loadings.fli = .. .f. rather than the original estimated loadings. need not be equal. are used to compute factor scores... we take the estimates obtain the factor scores for the jth case as L. A A The Weighted Least Squares Method Suppose first that the mean vector.5 14 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices The estimation situation is complicated by the fact that the unobserved quantities f. i = 1. as if they were the true values._ (pXl) (pXm){mXl) L F+e (pXI) Further. some rather heuris~ tic. They treat the estimated factor loadings f. ep] as errors.) = 1/J.. but reasoned.. . and the specific variance 'If are known for the factor model (pXl) X. and Ei outnumber the observed xi. the factor loadings L. The sum of the squares of the errors. weighted by the reciprocal of their variances. L'. Bartlett [2] has suggested that weighted least squares be used to estimate the common factor values. th. as given in this section. Since Var(e. The computational formulas. .. approaches to the problem of estimating factor values have been advanced. is (947) Bartlett proposed choosing the estimates f off to minimize (947). a diagonal matrix. The solution (see Exercise 7..i and specific variances 1/J. To overcome this difficulty.p. perhaps centered or standardized.. We describe two of these approaches.e estimated rotated loadings.1. regard the specific factors e' = [ e1 . . so we will not differentiate between them._. They involve linear transformations of the original data.. these estimates must satisfy the uniqueness condition. e2 . Both of the factor score approaches have two elements in common: 1. We then have the following: . and ji. = i as the true values and (949) When Land 'If are determined by the maximum likelihood method. . JYpically.3) is (948) Motivated by (948).
2.1 1 n A A (sample covariance) . A The factor scores generated by (950) have sample mean vector 0 and zero sample covariances.jr11. .). . If the factor loadings are estimated by the principal component method... are equal or nearly equal. Since we have L= v'A: e1 i ~ e2 .asin(825).. n or. Comment. ..Factor Scores 5 15 Factor Scores Obtained by Weighted least Squares from the Maximum likelihood Estimates j = 1. 2: fj = n j=1 1 n A 0 (sample mean) and .f' n..16. vA: em] [see (915)]. . z z z z j A = 1.x).L~zi [ ~ ~ 1~ for standardized data. (951) For these factor scores. (See Exercise 9..2. .= 1 wherezi = D (i:.. rj' are related tori by = T'fi.. . j = 1. . ' n. it is customary to generate factor scores using an unweighted (ordinary) least squares procedure.. . Implicitly. if the correlation matrix is factored (950) f.. The factor scores are _then or A fi = (L~L.jr1 Z 1.andp = L. the subsequent factor scores. 2. r.L~ + '1'.) If rotated loadings L* = LT are used in place of the original loadings in (950). this amounts to assuming that the 1/J. z )1i.j=1 I I =I 2:f.n A A 1/2 (xi.
we find that the conditional distribution of Fix is multivariate normal with mean= E(Fix) = and covariance= Cov(Fix) =I.n The calculation of Exercise 9.t· (pxm) L ] I (952) (m+p)X(m+p) : (mXm) and 0 is an (m + p) X l vector of zeros.1') and F is Nm+p(O.= LF + e has an Np(O. Then. Using Result 4. the regression and generalized least squares methods will give nearly the same factor scores. The Regression Method Starting again with the original factor model X .11.. evaluated at x1. .11) 1 (953) 'l'r L 1 (954) The quantities L'(LL' + '1'f in (953) are the coefficients in a (multivariate) regression of the factors on the variables. we obtain (957} For maximum likelihood estimates (L'4r1i)"'1 = &1 and if the elements of this diagonal matrix are close to zero.L'(LL' + 1 L'r1(x. are nothing more than the first m (scaled} principal components. using (956).L':I1L =I. f. we see that the f.516 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices Comparing (951} with (821}. we denote the former by r1 and the latter by 'Ls .. and taking the maximum likelihood estimates Land 'If as the true values. the linear combination X . :I*}.6.yr = + (mXp) (pXp) i· ~r1 (956) This identity allows us to compare the factor scores in (955}. givt:n any vector of observations x.6) (955) rj in (955} can be simplified by using the matrix identity (see (I + i· ~r. where :I = (pxp) + 'I' ! LL' i [ L' i (mXp) :I • = ··.= LF + e.11. When the common factors F and the specific factors (or errors) e are jointly normally distributed with means and covariances given by (93}. generated by the regression argument. we see that the jth factor score vector is given by j=1.(See Result 4.. we initially treat the loadings matrix L and specific variance matrix 'I' as known.2.P) = L'(LL' + '~'r (x.q. . .) Moreover. LL' + 'I'} distribution.. the joint distribution of (X .1 if 1 (mXm) (mXp) i: (ii· (pXp) . Estimates of these coefficients produce factor scores that are analogous to the estimates of the conditional mean values in multivariate regression analysis.) Consequen!ly. Temporarily. (See Chapter 7.3. with those generated by the weighted least squares procedure 'R [see (950}].
54 0 0 0 0 0 . practitioners tend to calculate the factor scores in (955) by using S (the original sample covariance matrix) instead of i = LL' + .024] .1 (x·I I or. A maximum likelihood solution from R gave the estimated rotated loadings and specific variances . . 2. 1. Example 9. ..= i·s. see (825}. i} ' j = 1. n A numerical measure of agreement between the factor scores generated from two different calculation methods is provided by the sample correlation coefficient between scores on the same factor.821 = . Of the methods presented...Factor Scores 51 7 In an attempt to reduce the effects of a (possibly) incorrect determination of the number of factors. z' = [. .n Ar LA. . the subsequent factor scores rj are related to fj by j = 1.... .993 . .27 : ~. . 1.2.P.42 0 •nd L. . j  z Zj. where. We then have the following: Factor Scores Obtained by Regression f.12 (Computing factor scores) We shall illustrate the computation of fac tor scores by the least squares and regression methods using the stockprice data discussed in Example 9. n (958} j = 1.00 0 The vector of standardized observations.227 .70.675 0 .104 .763 . Again. if a correlation matrix is factored.20. R_. none is recommended as uniformly superior.50.118 . ~ [ ~ 0 0 .10. .669 [ .40. if rotated loadings L* = LT are used in place of the original loadings in (958}.2.40] yields the following scores on factors 1 and 2: .113 .
.. 6 In order to calculate the weighted least squares factor scores.. are plotted in Figure 9.70 1. Group the variables with high (say.. .j. Comment..50] ~.137 .221 . obtained using (958)..ru:"HVil summing the (standardized) observed values of the variables in the .063 .001 ·20 [ .1 z = • • • • • Regression (958): [.40 .040 . ..026 . 2 Factor 1 Figure 9. greater than absolute value) loadings on a factor..Iz • = [ .023 .01 so that this matrix could be inverted. All of the factor scores..61] .u..331 .40  In this case.00 in the fourth .4.. The scores for factor 1 are then fnr~.. the two methods produce very similar results. "' 0 • ] 2 • I • 0 • •• • •• • • •• • • • • • • • • •• •• • .j..SIB chapter 9 Factor Analysis and Inference for Structured Covariance Matrices Weighted least squares (950):6 f = (L*'qrIL*)IL*'. Factor scores with a rather pleasing intuitive property can structed very simply.526 ...61 f = i*..011] 1.• was set to .4 Factor scores using (958) for factorll 1 and 2 of the stockprice (maximum likelihood estimates of the factor loadings).. bined according to the sign of the loadings. The factor scores for factor sums of the standardized observations corresponding to variables with high 0 2 • • • • •• •• • • 2 • • • • M 8 u tl.
the factor scores may or may not be normally distributed. we create the linear combinations ! 1 = x 1 + x2 + x3 + h = x4 + xs. and neglect the smaller loadings.x 1 x4 + x5 as a summary. it is suitable only for data that are approximately normally distributed. As we pointed out in Chapter 4.694 . 9. • Although multivariate normality is often assumed for the variables in a factor analysis.909 ~] For each factor. the number of common factors.813 . Probably the most important decision is the choice of m. Data reduction is accomplished by replacing the standardized data by these simple factor scores.374 . Moreover. marginal transformations may help. the simple factor scores would be /1 j2 = x1 + x2 + x3 = x4 + xs The identification of high loadings and negligible loadings is really quite subjective. Plots of factor scores should be examined prior to using these scores in other analyses.Perspectives and a Strategy for Factor Analysis 519 on factor 2. Most often. and so forth.4 produced the estimated loadings L = ['" .13 (Creating simple summary scores from factor analysis groupings) The principal component factor analysis of the stock price data in Example 9. Bivariate scatter plots of factor scores can produce all sorts of nonelliptical shapes.911 . In practice.563 437] . the test will most assuredly reject the model for small m if the number of variables and observations is large. Example 9.280 . instead of L. The simple factor scores are frequently highly correlated with the factor scores obtained by the more complex least squares and regression methods. we start with the varimax rotated loadings L*.133 . we would standardize these new variables.719 and L* = LT = [852 .726 .605 . the final choice of m is based on some combination of .6 Perspectives and a Strategy for Factor Analysis There are many decisions that must be made in any factor analytic study.851 . Linear compounds that make subjectmatter sense are preferable. Thus.084 . Similarly. it is very difficult to justify the assumption for a large number of variables. take the loadings with largest absolute value in Las equal in magnitude.079 . Although a large sample test of the adequacy of a model is available for a given m.831 . If. Yet this is the situation when factor analysis provides a useful approximation.214 . They can reveal outlying values and the extent of the (possible) nonnormality.
2. The full data set consists of n = 276 measurements on bone dimensions: Head: xl = { x2 = skull length skull breadth femur length tibia length Leg: x3 = { x4 = Wing: X 5 = humerus length { x6 = ulna length .14 (Factor analysis of chickenbone data) We present the results of several factor analyses on bone and skull measurements of white leghorn fowl. Compare the two results with each other and with that obtained from the complete data set to check the stability of the solution.) Example 9. At the present time.) (a) Look for susprc1ous observations by plotting the factor scores. (2) subjectmatter knowledge. For large data sets. Perform a maximum likelihood factor analysis. 4. (It is not required that R or s be nons in gular. This would reveal changes over time. split them in half and perform a factor analysis on each part. 3. The choice of the solution method and type of rotation is a less crucial decision. The original data were taken from Dunn [5]. Also calculate standardized scores for each observation and squared distances a~ described in Section 4. including a varimax rotation. Perform a principal component factor analysis.6. Factor analysis of Dunn's data was originally considered by Wright [15]. (a) Do the loadings group in the same manner? (b) Plot factor scores obtained for principal components against scores from the maximum likelihood analysis.520 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices (1) the proportion of the sample variance explained. factor analysis still maintains the flavor of an art. and (3) the "reasonableness" of the results. and no single strategy should yet be "chiseled into stone. This method is particularly appropriate for a first pass through the data. Compare the solutions obtained from the two factor analyses. (b) Try a varimax rotation. In fact. who started his analysis from a different correlation matrix than the one we use. Do extra factors necessarily contribute to the understanding and interpretation of the data? 5. {The data might be divided by placing the first half of the cases in one group and the second half of the cases in the other group. the most satisfactory factor analyses are those in which rotations are tried with more than one method and all the results substantially confirm the same factor structure. Repeat the first three steps for other numbers of common factors m." We suggest and illustrate one reasonable option: 1.
164 .075 .576 .Perspectives and a Strategy for Factor Analysis 521 The sample correlation matrix 1.422 1.000 .283 .000 .652 . .08 .823 .603 .396 .345 .000 . Skull length Skull breadth Femur length Tibia length Humerus length Ulna length FI Fz F3 Rotated estimated loadings F.573 .02 .350 . The results are given in Table 9.252 .926 1.000 . 5.00 . 6.888 .945 .073 .467 . .878 .908 .667 .602 . 4.000 .336 .189 .228 .212 .51 .08 .00 for tibia length in the maximum likelihood solution.340 . F~ Fi .128 . Readers attempting to replicate our results should try the Hey( wood) option if SAS or similar software is used.10? Table 9.145 .467 .937 1.894 .289 .602 . .763 .450 .894 .211 .857 c:J 2 .244 (. 2.057 .362 .050 .738 .877.467 .621 . 3.00 .00 .214 .09 .948 .831 .926 .047 & .741 .175 .233 . 4.286 .000 .937 .482 . . 3.08 .878 .894 .355 1/1.602 .000 . 5.779 .000 .890 . This suggests that maximizing the likelihood function may produce a Heywood case.874 .264 Cumulative proportion of total (standardized) sample variance explained Maximum Likelihood .467 .143 .877 .823 7 Notice the estimated specific variance of .559 .902) . 6.936 .950 .569 .325 .039 ~ . 2.950 Estimated factor loadings Variable 1.604 .621 .272 Cumulative proportion of total (standardized) sample variance explained .067 .505 1. Skull length Skull breadth Femur length Tibia length Humerus length Ulna length FI Fz F3 Rotated estimated loadings F.743 .07 .949) .873 .218 .463 .450 . F~ Fi .874 1.921 .569 .422 .943 .012 .929 .505 .904 .33 .926 1.177 .482 .12 .192 (.10 Factor Analysis of ChickenBone Data Principal Component Estimated factor loadings Variable 1.000 R= was factor analyzed by the principal component and maximum likelihood methods for an m = 3 factor model.720 .603 .874 .045 .084 .
000 . If.000 . and a factor analysis should be performed on each hali The results of these analyses can be compared with each other and with the analysis for the full data set to .000 . Focusing our attention on the principal component method and the cumulative proportion of the total sample variance explained.L~.1].001 . respectively. It has an unusually large Fzscore. Sets of loadings that do not agree will produce factor scores that deviate from this pattern.000 . the loadings were not altered appreciably. for one reason or another. but far from the origin and the cluster of the remaining points. The third factor explains a "significant" amount of additional sample variation. Further support for retaining three or fewer factors is provided by lhe resid~al matrix obtained from the maximum likelihood estimates: . it should be divided into two (roughly) equal sets.5. The rotated maximum likelihood factor loadings are consistent with those gen.000 . skull breadth and skull length. the two methods of solution appear to give somewhat different results. 73. Plots of this kind allow us to identify observations that.6 on pages 524526. The first factor appears to be a bodysize factor dominated by wing and leg dimensions. 75. It is also of interest to plot pairs of factor scores obtained using the principal component and maximum likelihood estimates of factor loadings. it is usually associated with the last factors and may suggest that the number of factors is too large.000 .522 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices After rotation.7. If the sets of loadings for a factor tend to agree. 393. An m = 2 factor model is considered in Exercise 9.000 . the second factor appears to represent head size. That is. The second and third factors. This seems to be the case with the third factor in the chickenbone data. but not for factors 2 and 3. [39. the pairs of scores should cluster tightly about the 45° line through the origin. 115.~% = .000 .L. Plots of pairs of factor scores using estimated loadings from two solution methods are also good tools for detecting outliers.001 . Factor scores for factors 1 and 2 produced from {958) with the rotated maximum likelihood estimates are plotted in Figure 9.001 .6. 69. It is clear from Plot (b) in Figure 9. If the loadings on a particular factor agree.004 R. The meaning of the third factor is unclear. plots of pairs of factor scores are given in Figure 9.6 that one of the 276 observations is not consistent with the others.000 . We shall pursue the m = 3 factor model in this example. represent skull dimensions and might be given the same names as the variables.the latter occurs. outliers will appear as points in the neighborhood of the 45° line. was removed and the analysis repeate\1. we see that a threefactor solution appears to be warranted. Potential outliers are circled in the figure. erated by the principal component method for the first factor.000 . collectively. For the maximum likelihood method.003 .000 . and it is probably not needed. When the data set is large.an . the last factors are not meaningful.000 .000 All of the entries in this matrix are very small. as indicated by Plot (c) in Figure 9.4.000 . For the chickenbone data.000 . are not consistent with the remaining observations.10.1. When this point.
909 . . The resulting sample correlation matrices were 1.S Factor scores for the first two factors of chickenbone data. . .940 1... .   .000 .366 .540 1..588 .901 . confidence in the solution is increased. ..584 .587 .386 .572 . . The chickenbone data were divided into two sets of n 1 = 137 and n 2 = 139 observations..000 .606 . . ..420 .000 .639 .352 1.575 .  3 J 2 I I I I 0 2 3 Figure 9..931 1.660 R1 1..000 . If the results are consistent with one another.000 .000 . .950 .406 .696 ..Perspectives and a Strategy for Factor Analysis 523 3 "" I I I I 2 • • 0 I  2  • • ••• • • • •$ • • • • • ••$$ • • • • • •• •$• •• • •• • $ ••• $••1• $oo$ • • •••$• ••$ •$ $ ••$ • • •$ $ $ ••• • •••• • •• • • • $ ••••••• $ • •• $$ •• • •• • $o $ •• •• • • o$ • • $ $ • $• $o $$ $$ ••• •$ • $$ $ $ • • • • • $• •• • • • • • • •• • • .844 . ~~  . .863 1.598 Rz= 1..835 ..894 1.587 .694 .000 . test the stability of the solution.000 and 1...000 . respectively.866 1.911 . .927 1.000 .000 .000 .
5 2. skull length and skull breadth.5 1.8  . A one.5 3.9 1I 1." As we have pointed out however. and you are encouraged to repeat the analyses here with fewer factors and alternative solution methods. The solution is remarkably stable.0 (a) First factor figure 9.0 I I I I I I I I I I I 2. The results for the two halves of the chickenbone measurements are very similar. and Fi interchange with respect to their labels.5 3. (See Exercise 9.524 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices I I I I I I I I I I I I T I 1 Principal component 2.0 2.5 0 .0 1.) • .10.9 1 1 11 I I 1 1 32 1121 12 1 1 21 11 1 1'12321 I 1 4 26311 1 21 24 3 1 1 33112 2 II 21 17 2 1 11 434 3 2 I 12243314 13 223152 1411221 II 12121 1 115251 I 11133 I I 1 2 111 1 I I II I I 0 Maximwn likelihOOd . three factors are probably too many.6 1IJ 3. and we can be fairly confident that the large loadings are "real.7 'I 1 1 I II I 1 I 3 1. specific variances. again appears to be a bodysize factor dominated by leg and wing dimensions.7  3. Factors F.8   I I I 2.5 2. These are the same interpretations we gave to the results from a principal component factor analysis of the entire set of data.) The rotated estimated loadings. The first factor.0 .11 on page 525.5 1. (Loadings are estimated by principal component and maximum likelihood methods. and proportion of the total (standardized) sample variance explained for a principal component solution of an m = 3 factor model are given in Table 9.or twofactor model is surely sufficient for the chickenbone data..6 Pairs of factor scores for the chickenbone data.0 1. F. but they collectively se~m to represent head size.
5 1 . if/.145 .921) .25 1. 4.395 .01 .130 . F.11 .5 1 3.899) .25 6. 6.06 1.312 .002.06 .10 . Cumulative proportion of total (standardized) sample variance explained . .743 .930 .25 3.270 .187 .75 4.248 .332 . .0 r  7.08 . .ffi.00 ..  6.75 0 . . (.247 .167 (.. 3.914 .252 .75 7.0 rI II I I I 2 12Jll I 23221 12346231 224331 2!"4 ~~~~611 213625 572 121A3837 31 11525111 223 31 I 21 I Ill I I I I I I I I I  1.940 .175 .06 .962 .00 3.50 5.546 .239 .50 (b) Second factor First set (n 1 = 137 observations) Rotated estimated factor loadings Variable (n 2 = 139 observations) Second set Rotated estimated factor loadings F.0 r  4.6 (continued) Table 9.968) . 2. 5.Perspectives and a Strategy for Factor Analysis I I I I I I I I I I 525 I Pr"l .00 .877 .830 .00 .912 .50 2.352 F.08 ~ . mc1pal component leD 9.208 .242 .168 if.925 . Skull length Skull breadth Femur length Tibia length Humerus length Ulna length ~ .231 F.914 .5 r  0 Maximum likelihood 1.360 F.50 .593 .272 F.05 .11 I II II I I 3.361 (.238 . .5 r I 6.00 I _l II Figure 9.853) .75 1.780 . (.
the vast majority of attempted factor analyses do not yield such clearcut results.25 f." the application is deemed successful. when all is said and done.8 I I I 3.75 f. In these areas. in common with most published sources.50 Ill 22 I 1 I l 1 21 II 211 I 1 1 2 I 11 2 1 I II 112 1 I I I 2 1 1 1 I I I 1 I I I 1 I Maximum I 1 likelihood I   l 11 2 I I 2.526 Chapter 9 Factor Analysis and Inference for Structured Cavariance Matrices I I I I Principal component I I I I I .SO  I I . it is natural to regard multivariate observations on animal and human processes as manifestations of underlying unobservable "traits. .2 .0 2. Still.6 (continued) Factor analysis has a tremendous intuitive appeal for the behavioral and social sciences.8 (c) Third factor Figure 9. Rather.0 4. the criterion for judging the quality of any factor analysis has not been well quantified..25 r'I I 3.4 3.6 1.6 I I  3.8 1.2 2. the investigator can shout "Wow. In practice. Our exam· pies. consist of situations in which the factor analysis model provides reasonable explanations in terms of a few interpretable factors. that quality seems to depend on a WOW criterion If. I T 3.6 0 .00 1  2.00 I I I I I I 1. factor analysis remains very subjective. I understand these factors.4 1. 0 1 I 1 I 2 L I I 1 1 I 1 1 1 Ill I 1 22 121 1 3141 I 1 I 2 1 I I II 11 I II 2 2111 1 Ill' 1 I I 1 1 I I 2 2 l 2 21 11 I I I I Ill 11 3 I I I I I 1 I 1 1 1 1 . while scrutinizing the factor analysis. Unfortunately.2 4." Factor analysis provides a way of explaining the observed variability in behavior in terms of these traits. 1 I 2 I 1 I I I II I Ill · 1 I 21 32 I I 21 1 I 2 I I I l l II I I 1 2 I 11 3 1 I I I 2 Ill I 1 2 I I I I.75 I 1.
i)(xi. They satisfy (9A1) so the jth column of . x 2 . · s.L (Xi j=l n .X) (Xi .I.q.l/2sn.X)' of an unstructured covariance matrix. The factor analysis of R is. This modification. referenced in Footnote 4 of this chapter. the conditions are stated in terms of the maximum likelihood estimator S.1 L j=l n (xi. Result 9A. Some factor analysts employ the usual sample covariance S.1(n.q.. The maximum likelihood estimates i and 4r are obtained by maximizing (925) subject to the uniqueness condition in (926)..1)S and 527 ~ 1 ~ ~2 ~ ~ ~m . they can be shown to satisfy certain equations.q.l/2 . . Not surprisingly. Xn be a random sample from a normal population.. but still use the title maximum likelihood to refer to resulting estimates. = n.l/2i is t~e (nonnormalized) eigenvector of corresponding to eigenvalue 1 + !!. Let x 1 . .. since they both produce the same correlation matrix. amounts to employing the likelihood obtained from the Wishart distribution of L j=l n (Xi .X)' and ignoring the minor contribution due to the normal density for X. = (1/n) .. of course.X) (Xi . unaffected by the choice of Sn or S.i)' = n. Here .Supplement SOME COMPUTATIONAL DETAILS FOR MAXIMUM LIKELIHOOD ESTIMATION Although a simple al}alytica} expression cannot be obtained for the maximum likelihood estimators Land 'I'.
Now. a diagonal matrix. (l. for normal data. • Comment. the condition L''IJI. we can minimize lnl I or.1L = A. Equivalently.1L = A effectively imposes m(m . . along with many others who do factor analysis. .) If we ignore the contribution to the likelihood in (925) from the second term involving (p. L and .!. at convergence. lnl I I+ tr(I. m) 2 P where sii is the ith diagonal element of s1 .. and the likelihood equations are solved. we minimize · (9A3) subject to L''IJI. since Sn andp are constant with respect to the maximization.. and ILL' + ~I should be compared with IS I if the foregoing Wishart likelihood is used to derive L and . One procedure is the following: 1. for large n. then maximizing the reduced likelihood over L and 'Jf is equivalent to maximizing the Wishart likelihood Likelihood o:: I I ~(nl)f2e[(n1)/2] tr[l:'s] over Land 'IJI. . For testing· the factor model [see (939)). it is evident that jl = i and a consideration of the loglikelihood leads to the maximization of (n/2)[lnl I I + tr(:t.j.. 1/Jz. Also. (n .. in an iterative fashion.528 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices Also.. 1/J p· Joreskog [8] suggests setting ~' = (1 .1Sn) Jover L and 'IJI.1) S has. Lawley and Maxwell [10). use the unbiased estimateS of the covariance matrix instead of the maximum likelihood estimate Sn. SandS. subject to these contraints. and the corresponding maximum likelihood estimates. Equivalently.1)/2 constraints on the elements of L and 'Jf. as in (9A3). 1 + tr(r1s) Recommended Computational Scheme Form > 1.. are almost identical.1S) lniSIp Under these conditions. a Wishart distribution. Result (9A1) holds with Sin place of S.J.i). Iif: + ~I should be compared with ISn I if the actual likelihood of (925) is employed. would be similar.) s" (9A4) .. Compute initial estimates of the specific variances f/J 1 . [See (421) and (423)... ~i = ithdiagonalelementofSnand LL' (9A2) We avoid the details of the proof However.
.. in the objective function of (9A3). we can write the objective function in (9A7) as (9A8) . 'e"" of the "uniquenessrescaled" covariance matrix (9A5) Let E = [el i e2 em] be the p X m matrix of normalized eigenvectors and A = diag[ A .. Comment. Given . . and ~. p can be factored as p = v.. This solution is clearly inadmissible and is said to be improper. . For most packaged computer programs. The values (fr 1 .. whose ith diagonal element is the square root of the ith diagonal element of Sn. 1 2 From (9A1). el' e2. A > A > · · · > Am > 1. Am] be the m X m diagonal matrix of eigenvalues. 112E(A . = vl/2.) R] 1 p (9A7) Introducing the diagonal matrix V112.L~ + w •.P. until the differences between successive values of f. •. (/I P obtained from this minimization are employed at Step (2) to create a new L Steps (2) and (3) are repeated until convergencethat is.fr2 .• .I/2[.j.j.fr 1 . Thus. l/2.. (fr p· A numerical search routine must be used..6.L~ + I) + IRI "IJf z tr[(LzLz + I w. 2 (9A6) 3.1) 1. . If R is substituted for S. . ... A. if they occur on a particular iteration. negative (/I.. Maximum Likelihood Estimators of p = Lzl~ + 'I' z When I has the factor analysis structure I = LL' + . .fr.pvl/2.. are negligible.I/2 = . or a Heywood case.112L) (V. compute the first m distinct eigenvalues.J. A =I+ A and E = .6.. 112E. = vlf2 L. the investigator minimizes In (IL. (fr2 .112:tvlf2 = (V. It often happens that the objective function in'(9A3) has a relative minimum corresponding to negative values for some .112wv112 = L. and the corresponding specific variance matrix is w. The loading matrix for the standardized variables is L. are changed to small positive numbers before proceeding with the next step. Substitute L obtained in (9A6) into the likelihood function (9A3)..112L)' + v. we obtain the estimates i = .Some Computational Details for Maximum Likelihood Estimation 529 2. where vl/2 is the diagonal matrix with ith diagonal element ui?/2. and minimize the result with respect to . and 1 2 corresponding eigenvectors.j.
.35 [ . . (sumofsquaredentriesofS LL') .625. = vIf2.1 are A1 A2 = 1. The rationale for the latter procedure comes from the in variance property of maximum likelihood estimators.q. (a) Calculate communalities hf. .1. . .4.5F1 + e3 .'if) :s.0 m = 1 factor model = 3 standardized random variables Z 1 . p Hint: Since S  1:1:0  'if has zeros on the diagonal.491.63 .36.F1 ) = O. [See (420)1 Exercises 9.'It and the principal factor solution for the loading matrix L Is the result consistent with the information in Exercise 9.JrvIfl .219. = vlf2L by i. Establish the inequality (919). Compare the results with those in Exercise 9.vl/2. Which variable might carry the greatest weight in "naming" the common factor? Why? 9.749.jr• = vt.530 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices The last inequality follows because the maximum likelihood estimates minimize the objective function (9A3). = .z.35 1.593. (b) What proportion of the total population variance is explained by the first common factor? 9. i = 1. and 'It z is equivalent L and .0 . calculate the reduced correlation matrix = p . (sumofsquaredentriesofS LL'.63 1.1? Should it be? ..F1 ) fori= 1.19 0 whereVar(F1 ) = l. Show that the covariance matrix p for the p = 1. [Equality holds in (9A8) for L. 2.1. Use the information in Exercise 9.9F1 .7F1 Z3 + e1 + ez . 3. Given p and 'It in Exercise 9. = vIfli.1 and an m = 1 factor model.1. (b) Calculate Corr(Z. and to obtaining L and ~ "' vtfli. write pin the form p = LL' + 'It.3.. minimizing (9A7) over L.45] .and 'It = Cov(e) = [ ~ ~1 :. A3 = e] = [.507) [.0 . from sn and estimating L.2.68.j. The eigenvalues and eigenvectors of the correlation matrix pin Exercise 9.177) e2 = (a) Assuming an m = 1 factor model.843] ej = [. . and 'I'.] That is.638.S. Z 2 . calculate the loading matrix L and matrix of specific variances 'It using the principal component solution method.45 . and Z 3 can be generated by the Z1 Z2 = = = .}Therefore. 9.Cov(e.2. and interpret these quantities.3.96. 9.= vlf2"1JtVlf2 by i.
Ap· Use (sum of squared entries of A) = tr AA' and tr [P(2)A(z)A(2)P( 21 ] =tr [ A(2)A(2)].1L).10 . +Apepe~= P(z)A(zibl· where p(2) = [em+1i·. and (I + L''l'.1 Hint: Postmultiply both sides by (LL' + 'It) and use (a). Stoetzel [14] collected preference rankings of p = 9 liquor types from n = 1442 individuals.1 L).1L''I.64. It exceeds the maximum value of .1 =vI·_ v.60 •This figure is too high.46 .14 . u 22 .(I + L''I'.4 . there is an infinity of choices for Land 'It.1L)1 are symmetric matrices.. 7. Verify the following matrix identities. and u 12 .02 . (Unique but improper solution: Heywood case. In a study of liquor preference in France.Exercises 531 Now.66 .17 .20 . (c) L'(LL' + '1'f 1 = (I+ L''I'.29 .29 .16 . ·! ep] and A(2) is the diagonal matrix with elements Am+!> . but that 1/1 3 < 0.06 ..74 .) Let the factor model with p = 2 and m = 1 prevail.52 . . . (a) (I + L''I'.97• .8.03 . (The factor model parameterization need not be unique.19 . Show that U11 = C~1 + 1/11. 9. for given u 11 .4 .+1 + .1Lf 1L'v1 Hint: Postmultiply the result in (b) by L use (a). (b) (LL' + '1'). UJz = Uz1 = C11C21 Uzz = e~l + 1/Jz and.9 1 .09 .) Consider an m = 1 factor model for the population with covariance matrix ~  [1 . A factor analysis of the 9 X 9 sample correlation matrix of rank orderings gave the following estimated loadings: Variable (X1 ) Liquors Kirsch Mirabelle Rum Marc Whiskey Calvados Cognac Armagnac F1 Estimated factor loadings Fz F3 .9.64 .1L =I .08 .7 .04 . vI.1L). and take the transpose.39 ..9] .i:i> = Am+lem+1e. with~ = LL' + 'It.24 .6.1 Hint: Premultiply both sides by (I + L'v..42 .1L(I + L''I'.1L)..49 .7 1 Show that there is a unique choice of Land 'It the choice is not admissible. 9.50 .1L''I'.17 . 9. as a result of an approximation method for obtaining the estimated factor loadings used by Stoetzel. noting that (LL' + 'It f 1. s.. so 9.
499 . (c) The proportion of variance explained by each factDr.000 . Except in the case of the two most popular and least expensive items (rum and marc). Ulna length 1.874 .476 .603 .411 .417 ] 8. (d) The residual matrix R . obtain the maximum likelihood estimates of the following.~ 1 6.894 . Refer to Exercise 9.327 .200 .450 The following estimated factor loadings were extracted by the maximum likelihood procedure: Varimax rotated estimated factor loadings Fi F. p.532 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices Given these results.878 . 3.569 .602 .. Stoetzel concluded the following: The major principle of liquor preference in France is the distinction between sweet and strong liquors.12. Comment on the results. 9. this second factor plays a much smaller role in producing preference judgments.877 .10. does Stoetzel's interpretation seem reasonable? (b) Plot the loading pairs for the first two factors.422 1.467 . The second motivating element is price.594 Using the unrotated estimated factor loadings. The third factor concerns the sociological and primarily the regional. The cor~elation matrix for chickenbone measurements (see Example 9.505 1. Conduct a graphical orthogonal rotation of the factor axes.it.000 .000 . (b) The communalities. Estimated factor loadings Variable Skull length Skull breadth Femur length Tibia length 5.926 1.855 .375 .4) is s= w3 [ ~:~. (a) The specific variances.621 .926 1.160 6.000 .603 .602 . (See [14].773 .467 .. 2. Interpret the rotated loadings for the first two factors.937 1. Generate approximate rotated loadings.11.744 . Fl Fz . which can be understood by remembering that liquor is both an expensive commodity and an item of conspicuous consumption.484 .894 . 9.154 .482 . 11.874 1.717 . variability of the judgments.i. 4.) (a) Given what you know about the various liquors involved.319 .14) is 1. Humerus length 6.000 . Does your interpretation agree with Stoetzel's interpretation of these factors from the unrotated loadings? Explain.000 . Compute the value of the varimax criterion using both unrotated and rotated estimated factor loadings.000 .861 . The covariance matrix for the logarithms of turtle measurements (see Example 8.i~ .519 .1 0.143 . 9.000 .005 6.
.688 .it. HRS Replacement return on assets.303 HRS RRA RRE RRS Q 1.13.712 . and marketvalue measures of profitability for a sample of firms operating in 1977 is as follows: REV Variable Historical return on assets. (d) The residual matrix Sn. The correlation matrix of accounting historical.610 1.000 . RRE Replacement return on sales. (a) Specific variances. ln(height) Fl .419 .513 .579 . HRA Historical return on equity.604 HRE 1.000 .000 . (b) Communalities.520 . that = . As part of their study. 9.937 1.i 112 where A= A I is a diagonal matrix.15.625 . Refer to Exercise 9. ln(length) 2. REV  HRA 1.617 1. [See (940).000 .0765 = 1 model Using the estimated factor loadings.1022 . 9.639 . Hint: Convert S to S n. for this choice. and uses of accounting and marketvalue measures of profitability.831 . obtain the maximum likelihood estimates of each of the following. a factor analysis of accounting profit measures and market esti~ates of economic profits was conducted. RRS Market Q ratio.352 1.867 .692 .828 . ln(width) 3.563 .LL' .738 .731 .14.0752 .Exercises 533 The following maximum likelihood estimates of the factor loadings for an m were obtained: Estimated factor loadings Variable 1. RRA Replacement return on equity.000 . Compute the test statistic in (939).] 9. Indicate why a test of H 0 : I= LL' + 'IJf (with m = 1) versus H 1 : I unrestricted cannot be carried out for this example. Q Market relative excess value.826 .652 . 1/2i.000 . HRE Historical return on sales. determinants.12. The maximum likelihood factor loading estimates are given in (9A6) by i Verify.608 .000 .681 .887 1.000 . (c) Proportion of variance explained by the factor. accounting replacement. Hirschey and Wichern [7] investigate the consistency.543 322 .q.
17.928 .12. The following exercises require the use of a computer.612 . Determine a reasonable: number of factors m.L~ . (b) Determine the residual matrix.16 concerning the numbers offish caught. (b) Using only the measurements x 1 . Refer to Example 9. ~: 9.079 .895 . factor model were obtained: Estimated factor loadings Variable Historical return on assets Historical return on equity Historical return on sales Replacement return on assets Replacement return on equity Replacement return on sales Market Q ratio Market relative excess value Cumulative proportion of total variance explained Fl F2 F3 . (a) Using the estimated factor loadings. Compare the solutions and comment on them.534 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices The following rotated principal component estimates of factor loadings for an m . to .483 . does an m = 3 factor model appear appropriate for these data? (c) Assuming that estimated loadings less than .18. 9.283 .296 A06 .01 so that W. Refer to Exercise 8. A firm is attempting to evaluate the quality of its sales staff and is trying to find an ex.x 6 .708 . Does it appear.355 .W Given this information and the . (d) Perform a factor analysis using the measurements x 1 . 1 L. obtain the principal component solution for factor models with m = 1 and m = 2. (c) Rotate your solutions in Parts (a) and (b).287 ..414 .238 .433 .19.. interpret the three factors. Using the information in this example.789 . cumulative proportion of total variance explained in the preceding table.. obtain the maximum likelihood solution for factor models with m = 1 and m = 2.331 . determine the specific variances and communalities.198 . that marketvalue measures provide evidence of profitability distinct from that provided by accounting measures? Can you sepa· rate accounting historical measures of profitability from accounting replacement measures? 9.628 . Interpret the factors. 9. and compare the principal component and maximum likeli~.910 .160 .16.499 .t 1 • Note: Set the fourth diagonal element of .125 .892 .908 .. Interpret each factor. R . (a) Using only the measurements x 1 . hood solutions aft~r rotation.L.:!' amination or series of tests that may reveal the potential for good performance in sale~ .x 4 . evaluate (L~W. 1 can be determined. Will the regression and generalized least squares methods for constructing factors scores for standardized stock price observations give nearly the same results? Hint: See equation (957) and the discussion following it.4 are small. for example. Verify that factor scores constructed according to (950) have sample mean vector 0 and· zero sample co variances.x 4 .234 .887 .Jr. .294 .
Compare the two sets of rotated loadings. = (X. Perform a factor analysis of the "stiffness" measurements given in Table 4. rather than S... (b) Find the factor scores from the principal component solution.105. selected at random.5. 9.2S. mechanical reasoning. Your analysis should include factor rotation and the computation of factor scores. Obtain either the principal component solution or the maximum likelihood solution form = 2 and m = 3 common factors. With these results and those in Parts b and c. abstract reasoning. 9. (b) Find the maximum likelihood estimates of Land 'I' form = 1 and m = 2. Interpret the results. (a) Obtain the principal component solution to a factor model with m = 1 and m = 2. specific variances. . starting from the sample correlation matrix.Exercises 535 The firm has selected a random sample of 50 sales people and has evaluated each on 3 measures of performance: growth of sales. . Interpret them = 2 and m = 3 factor solutions.21. (c) Compare the three sets of factor scores. Each of the 50 individuals took each of 4 tests.5. and LL' + .12 on page 536.)/va:.20. 9. 35]. Refer to Exercise 9.Jr for them = 2 and m = 3 solutions. x 2 . which purported to measure creativity. Use the sample covariance matrix S. obtains the test scores x' = [x 1 . . which choice of m appears to be the best? (e) Suppose a new salesperson. Repeat Exercise 9. (a) Assume an orthogonal factor model for the standardized variables Z.J. 12. Using the airpollution variables XI> X 2 .3 and discussed in Example 4. Perform a varimax rotation of both m = 2 solutions in Exercise 9. . Compute factor scores. (c) Compare the factorization obtained by the principal component and maximum likelihood methods.20. (c) List the estimated communalities. (a) Calculate the factor scores from the m = 2 maximum likelihood estimates by (i) weighted least squares in (950) and (ii) the regression approach of (958). respectively. . and check for outliers in the data. Interpret the factors for the m = 1 and m = 2 solutions. obtain the rotated loadings form = 2 and m = 3. Which choice of m do you prefer at this point? Why? (d) Conduct a test of H 0 : l: = LL' + 'I' versus H 1 : l: "#. profitability of sales.20. Start with R and obtain both the maximum likelihood and principal component solutions. and X 6 given in Table 1. x 7 ] = [110. 2. 18.. Calculate the salesperson's factor score using the weighted least squares method and the regression method.14. Are the principal component and maximum likelihood solutions consistent with each other? 9. Then = 50 observations on p = 7 variables are listed in Table 9. Does it make a difference if R.22. Perform a factor analysis of the censustract data in Table 8. . is factored? Explain. 7. X 5 . using (951).LL' + 'I' for both m = 2 and m = 3 at the a = . 98.20. 9. generate the sample covariance matrix. (b) Given your solution in (a). Comment on your choice of m. i = 1. Note: The components of x must be standardized using the sample means and variances calculated from the original data. 9.23..24. Compare the results. and newaccount sales. These measures have been converted to a scale..L.. 15. and mathematical ability.01 level. on which 100 indicates "average" performance.
8 95.3 95.0 101.5 92.8 95.8 102.3 102.0 89.8 122.5 99.8 87.8 100.5 122.3 106.8 106.5 (x3) 97.5 92.3 99.8 99.5 105.5 114.0 108.5 84.0 94.0 88.5 89.8 104.3 108.3 106.5 99.3 110.8 110.0 93.0 (x4) 09 07 08 13 10 10 (xs) 12 10 .8 106.8 107.8 103.0 95.5 99.8 109.5 99.5 96.3 103.5 103.3 107.8 102.8 103.3 95.5 110.3 106.3 119.8 104.0 110.0 (xz) 96.536 Chapter 9 Factor Analysis and Inference for Structured Covariance Matrices Table 9.3 95.5 88.3 94.8 102.0 99.0 118.5 105.0 100.8 105.0 115.0 121.8 100.0 106.5 99.8 95.8 104.5 103.3 106.5 105.8 97.8 99.3 87.0 96.3 103.0 108.8 92.3 106.8 112.5 107.8 83.12 Salespeople Data Index of: Sales growth Sales profitability Newaccount sales Creativity test Score on: Mechanical Abstract Mathereasoning reasoning rna tics test test test Salesperson 1 2 3 4 5 6 7 8 9 10 11 12 (xt) 93.3 102.5 100.3 103.3 96.8 103.3 99.0 104.3 105.3 109.8 107.0 119.3 94.0 102.8 96.5 106.3 102.0 97.8 103.0 88.0 91.3 99.8 99.3 106.3 120.3 99.5 121.3 106.8 97.5 105.0 81.3 94.5 107.5 103.0 92.5 109.8 103.0 109.5 92.0 102.8 103.8 121.8 112.3 104.8 111. 12 14 15 14 12 20 17 18 17 18 17 (x6) 09 10 09 12 12 11 09 15 13 11 12 08 11 11 08 05 (x7) 20 15 26 29 32 21 09 18 10 14 12 10 16 08 13 07 11 11 25 51 31 39 32 31 34 34 34 16 32 35 30 27 15 42 16 37 39 23 39 49 17 44 43 10 27 19 42 47 18 28 41 37 32 13 "14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 10 10 09 12 14 14 17 12 11 11 13 11 07 05 17 10 05 09 12 16 10 14 11 09 15 19 15 16 16 10 17 15 11 15 12 17 13 16 12 19 20 17 15 20 05 16 13 15 11 07 12 12 07 12 11 13 11 10 08 11 11 10 14 08 14 12 12 13 06 10 09 11 12 11 08 12 11 10 08 09 18 13 07 10 18 08 18 13 15 14 09 13 17 23 32 15 24 37 14 01 07 18 07 08 14 12 08 12 16 09 36 39 .0 90.0 93.3 106.0 120.5 99.0 102.3 104.3 101.8 96.0 106.5 106.5 110.5 101.5 115.0 103.5 102.5 113.0 102.3 107.5 118.3 115.3 95.
Use the sample covariance matrix S and interpret the factors. are the interpretations of the factors roughly the same? If the models are different. can you interpret the factor? Try both the principal component and maximum likelihood solution methods. The pulp and paper properties data are given in Table 7. Conform and Leader.10. Repeat the analysis with the sample correlation matrix R. sured in meters per second. FtFrBody. Convert the national track records for men to speeds measured in meters per second.. is factored? Explain. 9. explain the differences. Use the seven variables YrHgt. . Which analysis do you prefer? Why? 9.28. BL. for the mouse with standardized weights [. Refer to Exercise 9. Compare your results with the results in Exercise 9. Note: Be aware that a maximum likelihood solution may result in a Heywood case. Compare the results obtained from S with the results from R. Compare your results with the results in Exercise 9.28. Perform a factor analysis of the psychological profile data in Table 4. Repeat Exercise 9. Start with the sample covariance matrix. and check for outliers in the data.33. and check for outliers in the data. BkFat..21. Can you interpret the factors? Your analysis should include factor rotation and the computation of factor scores. Perform a factor analysis of the national track records for women given in Table 1. (See Exercise 8. and SaleWt. 9. Supp. obtain the factor scores using the maximum likelihood estimates of the loadings and Equation (958). 9.34. Refer to Exercise 9. Does it make a difference if R. Perform a factor analysis using observations on the four paper property variables. Use the sample correlation matrix R constructed from measurements on the five variables. Which analysis do you prefer? Why? 9. rather than S. Consider the miceweight data in Example 8. Indep.27.29. 9.32. and check for outliers in the data. (c) Perform a varimax rotation of the solutions in Parts a and b.2. Compute factor scores.Exercises 53 7 9. Can the information in these data be summarized by a single factor? If so. (See Exercise 8. 9.6. Does it make a difference if R.26. PrctFFB.30.15 for Vi. Compute factor scores. Compute factor scores. Use the sample covariance matrix S and interpret the factors.6.) (a) Obtain the principal component solution to the factor model with m = 1 and m = 2. (See Exercise 8. Repeat the analysis with the sample correlation matrix R. Obtain both the principal component and maximum likelihood solutions form = 2 and m = 3 factors. rather than S. EM. Is the appropriate factor model for the men's data different from the one for the women's data? If not.26 by factoring R instead of the sample covariance matrix S. Benev. Repeat the analysis with the sample correlation matrix R. Also.) Perform a factor analysis of the speed data.) Perform a factor analysis of the speed data.30. Frame. Does it make a difference if R. 9. rather than S.9. Does it make a difference if R. is factored? Explain.31. Repeat the analysis with the sample correlation matrix R.19.7.1.8. Does your interpretation of the factor(s) change if S rather than R is factored? . Perform a factor analysis of the data on bulls given in Table 1.6. Repeat the steps given in Exercise 9. .28.28. Convert the national track records for women to speeds mea. rather than S. Perform a factor analysis of the national track records for men given in Table 8. (b) Find the maximum likelihood estimates of the loadings and specific variances for m = 1 andm = 2. is factored? Explain. SaleHt. Factor the sample covariance matrix S and interpret the factors. and check for outliers. Use the sample covariance matrix S and interpret the factors. and BS and the sample correlation matrix R.30. Compute factor scores.6. is factored? Explain. SF.5]. Repeat this analysis with the sample covariance matrix S.