# STATISTICS 407

APPLIED MULTIVARIATE ANALYSIS

1

TOPICS
Principal Component Analysis (PCA): Reduce the ________________, summarize the sources of variation in the data, transform the data into a new data set where the variables are uncorrelated. Factor Analysis (FA): When its not possible to _________________ of interest directly, measure what’s possible and create the variables of interest from the observed data. Discriminant Analysis (classiﬁcation, supervised learning): Build a rule to _____________ or group id from observed training data.

2

Individual-directed: Summarizing relationships that exist between individuals. PLUS how to plot multivariate data. or experimental units. Multivariate Regression and Canonical Correlation Analysis: ______________________ variables.TOPICS Cluster Analysis (unsupervised learning): Find similar groups of individuals. or ______________________ based on their similarity. eg _______________________________________________. ____________. 3 A TAXONOMY OF TECHNIQUES Variable-directed: Quantifying the relationships between variables. 4 . eg ___________________________ _______________________________________________ _______________________________________________. M(ultivariate)ANOVA: Infer information about the ____________________ based on the sample means. and explore the association between the set of dependent variables and a set of explanatory variables.

Xp]      =  X11 X12 X21 X22 .  .. . . .. . X2p    . . . .. that is ith case and j th variable.MULTIVARIATE DATA Example. p variables) has matrix form as follows: X = [X1 X2 .. . 5 6 . . X1p . . Xnp n×p Xij is the element in the ith row and j th column. . . Xn1 Xn2 . . Matrix Notation Data (n observations. . . . nutritional information of chocolates (100g equivalent ) from around the world: 5 SOME MATH.

Sp1 Sp2 .. . . r2p    . Spp       How do you calculate S11. we’d calculate the mean vector. r12? R= 1 r12 . S2p . . . . . . and proportions. ¯p X           S= S11 S12 .  rp1 rp2 . . S12. Variance-Covariance/Correlation Matrices ¯1 X  . ¯ = X .. . . .   . and correlation matrix for the _________________________. S1p S21 S22 . . . . . var-cov matrix. 1 8 7 MULTIVARIATE DATA For the chocolates data example. . . . .SOME MATH. The mean vector.. Mean Vector. . r1p r21 1 . . var-cov and corr matrix might also be reported _____________ for each category of the categorical variables. . . .. 8 . . .. .. For the __________ variables we’d report counts. . . .

230 −0.6 196.197 0.3   −156.26 45.97 −5.9 −0.703 0.4 90.27 40.5 −108.62 115.130 −0.48 X  56.14 −16.400 0.000 −0.MULTIVARIATE DATA Chocolates data: n=10.000 0. These are the ____________.01 229.8 −0.632  551.1 −66.380 −0.136 9 MULTIVARIATE DATA Chocolates data .65 60.303 −0.212 −0.629 1.31   ¯ =  9.05 −16.50 5.520 −0.8 2275.712 1.18  36.197 −0.400          48.212 1.50 −108.3 60.6  229. or even the dependent variables that might be used for classifying observations.30 7.332 −0.632 −0.520 0.712 −0.36 −0.230 1.3 90.summary of categorical variables 1 using counts.60 0.303 −0.480 1.380 0.2 −61.130 0.000 −0.16 −23. 10 .78 5.629 −425.3 −166.000 −0.332 −0.97 −66.62 −166.88   6.8 −425.480 0.65 −61.1    R=     1.3   48.000 −0. p=6 What’s the mean Calories? variance of Fiber? correlation between Sugars and Calories? Which variables have negative covariance? Which variable has the largest variance? Standard deviation of Chol? 2299.000                 −296.28 0.79 −5.1 S=  −296.136 −0.703 −156.8 −23.

Bp ). and Xα is a projection If α1 p of the data. and projections: If  α1  . . Linear combinations. B) = (A − B)S−1 (A − B)￿ . (2) d(A.. . Xα =   . B). . A). if A ￿= B. . Distance measures: For two points (rows of the data matrix) A = (A1 A2 . + αp Xpn n×1   then ￿ 2 + . . . . C) + d(C. B) = d(B. Ap ) and B = (B1 B2 . . B) = (A − B)(A − B)￿ = (A1 − B1 )2 + . but it must satisfy (1) d(A. ..  αp p×1  α1 X11 + α2 X21 + . 11 MORE MATH. for any intermediate point C.. B) > 0. Used in ______________________. B) ≤ d(A. B) = 0. Generally any distance measure can be deﬁned. Used in ______________________________________.. . 1 12 . Euclidean distance is deﬁned as ￿ ￿ d(A.  α= . α1 X1n + α2 X2n + . . + αp Xp1   . . . if A = B. . (4) d(A.MORE MATH. + (Ap − Bp )2 and statistical distance (or Mahalobis distance) is deﬁned as ￿ d(A. (3) d(A. + α2 = 1 then α is a projection vector.

13 This work is licensed under the Creative Commons Attribution-Noncommercial 3..0 United States License. The standardized data has mean vector all zeros. . Scaling: The standardized data matrix is  z11 z12  z21 z22  Z= .. . and variances all equal to 1. j = 1. 94105.  . .. zn 1 zn 2 x −x ¯ .. p. n.MORE MATH.. . znp where zij = ijsj j . _____________ does remove correlation . To view a copy of this license. . Different from ____________! Doesn’t change the correlation between variables. .more to come on this... .0/us/ or send a letter to Creative Commons. USA... Suite 300. 171 Second Street.      . i = 1. z1p z2 p . visit http://creativecommons.org/ licenses/by-nc/3. 14 . California. .. . San Francisco. .