Professional Documents
Culture Documents
SUMMARY OF CONTENT
Main Book:
Rencher, Alvin. (2002). Methods of Multivariate Analysis. 2d. Edition. John Wiley and
Sons, Inc., USA
Reference Books:
Tabachnick, Barbara, and Fidell, Linda. (2007). Using Multivariate Statistics. Pearson.
EVALUATION
MULTIVARIATE STATISTIC
In the last decades, Multivariate Statistical Analysis has had an increasing
development due, in part, to the numerous applications it has in almost all
experimental sciences and its use has become more or less essential. The
spectacular possibilities offered by computers today have had a decisive
influence on this development. If the first applications were based on simple
models or working with very few variables, today, high capacity and high
speed computers allow the use of complex and powerful methods and the
handling of a large number of variables. (Cuadras, 1996, 2018).
Multivariate Analysis (MA) is the part of Statistics and data analysis that
studies, analyzes, represents and interprets data resulting from observing
more than one statistical variable on a sample of individuals.
The observable variables are homogeneous and correlated, without any is
predominant over the others. The information in Multivariate Analysis is
therefore multidimensional in nature, thus geometry, matrix calculation and
the multivariate distributions play a key role.
Multivariate information is a data matrix, but often, in the MA input
information consists of distance or similarity matrices. The results of these
tests measure the degree of discrepancy between individuals. Start with
techniques based on n*p data matrices, where n is the number of individuals
and p the number of variables. (Cuadras, 2018).
OBJETIVES
The analysis of multivariate data aims at the statistical study of several
variables measured in elements of a population. It has the following
objectives.
Individuals v1 v2 v3 v4
1 45 1.56 13 14
2 50 1.60 12 16
3 50 1.65 13 15
4 60 1.75 15 9
5 60 1.70 14 10
6 65 1.70 14 7
7 70 1.60 15 8
8 65 1.60 13 13
9 60 1.55 15 17
10 65 1.70 14 11
2) In a maternity clinic, a study was carried out to determine the size (in cm)
of an infant, taking into account age (in days), size (in cm), weight (in kg)
and thorax size (in cm) at birth. For this purpose, a sample of 9 children was
taken, from which the following results were obtained:
𝑦𝑖1
𝐲𝑖 = ( ⋮ )
𝑦𝑖𝑝
All n observation vectors y1, y2, . . . , yn, can be transposed to row vectors
and listed in the data matrix Y as follows:
Since n is usually greater than p, the data can be more conveniently tabulated
by entering the observation vectors as rows rather than columns. Note that
the first subscript i corresponds to units (subjects or objects) and the second
subscript j refers to variables. This convention will be followed whenever
possible.
The sample mean vector 𝐲̅ can be found either as the average of the n
observation vectors or by calculating the average of each of the p variables
separately:
𝒏 y̅1
1 y̅
𝐲̅ = ∑ 𝒚i = ( 2 )
n ⋮
𝒊=𝟏 y̅p
EXAMPLE: Suppose that we have measured the height (in metres) and
weight (in kilograms) of 5 students. We have n = 5 and p = 2, and the data
matrix is as follows:
1.67 65.0
1.78 85.0
𝐘 = 1.60 54.5
1.83 72.0
( 1.80 94.5 )
We can calculate the mean vector of this data matrix in three different ways:
1.-First by calculating the column means.
2.-As the sample mean of the observation vectors.
3.-We can use the matrix expression for the sample mean.
NOTE: This example hopefully makes clear that our three different ways of
thinking about computing the sample mean are all equivalent. However, the
final method based on a matrix multiplication operation is the neatest both
mathematically and computationally, and so we will make use of this
expression, as well as other similar expressions, throughout the course.
The mean of y over all possible values in the population is called the
population mean vector or expected value of y. It is defined as a vector of
expected values of each variable,
y1 E(y1 ) μ1
E(𝒚) = E ( ⋮ ) = ( ⋮ ) = ( ⋮ ) = 𝛍
yp E(yp ) μp
> colMeans(y)
Them:
> j’<-c(1, 1, 1, 1, 1, 1, 1, 1, 1)
> vm<-((1/9)*(j’%*%y))
> vm
BIVARIATE CONTEXT:
SCATTER PLOTS
0
-50
-4 -2 0 2 4 6 8
ZC[,1]
If the two random variables x and y in a bivariate random variable are added
or multiplied, a new random variable is obtained. The mean of x + y or of xy
is as follows:
Formally, x and y are INDEPENDENT if their joint density factors into the
product of their individual densities: f (x, y) = g(x) h(y). Informally, x and y
are independent if the random behavior of either of the variables is not
affected by the behavior of the other.
The notion of independence of x and y is more general than that of zero
covariance. The covariance σxy measures linear relationship only, whereas
if two random variables are independent, they are not related either linearly
or nonlinearly.
One way to demonstrate that the converse is not true is to construct examples
of bivariate x and y that have zero covariance and yet are related in a
nonlinear way (the relationship will have zero slope).
Example. To obtain the sample covariance for the height and weight data
we first calculate 𝑥̅ , 𝑦̅ and ∑𝑖 𝑥𝑖 𝑦𝑖 where x is height and y is weight:
69 + 74+ . . . . +76
𝑥̅ = = 71.45
20
153 + 175+ . . . . +220
𝑦̅ = = 164.7
20
20
Now, we have:
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥̅ 𝑦̅
𝑠𝑥𝑦 = = 128.88
𝑛−1
By itself, the sample covariance 128.88, given in the example, is not very
meaningful. We are not sure if this represents a small, moderate, or large
amount of relationship between y and x. A method of standardizing the
covariance is given now.
X 0 1 2 3 4 5 6 7 8 9 10
Y 0 144 256 336 384 400 384 336 256 144 0
b)What can be concluded about the relationship between the variables X and
Y, using the result of part a)?
c) Draw the scatter plot and discuss the appearance of the graph.
> cor(A) [1, 2]
[1] 0
> plot(A)
400
300
A[,2]
200
100
0
0 2 4 6 8 10
A[,1]
Note: The sample correlation 𝑟𝑥𝑦 is related to the cosine of the angle between
two vectors.
THE CENTERING MATRIX
GRÁFICOS MULTIVARIANTES
MULTIVARIATE CONTEX
COVARIANCE MATRICES
The sample covariance matrix S = (sjk )is the matrix of sample variances
and covariances of the p variables:
In S the sample variances of the p variables are on the diagonal, and all
possible pairwise sample covariances appear off the diagonal. The jth row
(column) contains the covariances of yj with the other p − 1 variables. Other
names used for the covariance matrix are variance matrix, variance-
covariance matrix, and dispersion matrix.
n
1
𝐒= ∑(𝐲i − 𝐲̅) (𝐲𝑖 − 𝐲̅)′
n−1
𝑖=1
n
1
= (∑ 𝐲𝑖 𝐲𝑖′ − n𝐲̅𝐲̅ ′ )
n−1
𝑖=1
1
𝐒 = 𝐘 ′ (𝐈 − 𝐉) 𝐘
𝑛
The diagonal elements 𝜎𝑗𝑗 = 𝜎𝑗2 are the population variances of the y’s, and
the off-diagonal elements σjk are the population covariances of all possible
pairs of y’s.
𝚺 = E(𝐲𝐲 ′ ) − 𝛍𝛍′
CORRELATION MATRICES
1 𝑟12 … … 𝑟1𝑝
R = (rjk ) = ( ⋮ ⋮ ⋮ )
𝑟𝑝1 𝑟𝑝2 … … 1
The correlation matrix can be obtained from the covariance matrix, and
vice versa. Define:
= diag (s1 ; s2 ; . . . . ; sp )
𝑠1 0… 0
= (⋮ ⋮ ⋮)
0 0… 𝑠𝑝
Then:
𝐑 = 𝐃s−1 𝐒 𝐃s−1
𝐒 = 𝐃s 𝐑 𝐃s
1 𝜌12 … … 𝜌1𝑝
𝐏p = (ρjk ) = ( ⋮ ⋮ ⋮ )
𝜌𝑝1 𝜌…… 1
𝜎𝑗𝑘
Where: 𝜌𝑗𝑘 = 𝜎𝑗 𝜎𝑘
Example: Table below gives partial data from Kramer and Jensen (1969a).
Three variables were measured (in milliequivalents per 100 g) at 10 different
locations in the South. The variables are
Location
Number
𝑦1 𝑦2 𝑦3
1 35 3.5 2.80
2 35 4.9 2.70
3 40 30.0 4.38
4 10 2.8 3.21
5 6 2.7 2.73
6 20 2.8 2.81
7 35 4.6 2.88
8 35 10.9 2.90
9 35 8.0 3.28
10 30 1.6 3.20
To find the mean vector y, we simply calculate the average of each column
and obtain:
11.8551 0 0
𝐷𝑠 = ( 0 8.4999 0 )
0 0 0.5001
Then:
With R:
>y
[,1] [,2] [,3]
[1,] 35 3.5 2.80
[2,] 35 4.9 2.70
[3,] 40 30.0 4.38
[4,] 10 2.8 3.21
[5,] 6 2.7 2.73
[6,] 20 2.8 2.81
[7,] 35 4.6 2.88
[8,] 35 10.9 2.90
[9,] 35 8.0 3.28
[10,] 30 1.6 3.20
> var(y)
[,1] [,2] [,3]
[1,] 140.544444 49.680000 1.9412222
[2,] 49.680000 72.248444 3.6760889
[3,] 1.941222 3.676089 0.2501211
> cor(y)
[,1] [,2] [,3]
[1,] 1.0000000 0.4930154 0.327411
[2,] 0.4930154 1.0000000 0.864762
[3,] 0.3274110 0.8647620 1.000000
Why R?
(From Zelterman, (2015). Applied Multivariate Statistics with R).
SAS is also more suitable for sharing programs and data, as in a business
setting. SAS encourages the development of large programs through the use
of its powerful macro5 language. The macro writes code that is expanded
before the interpreter actually reads the code that is converted into
instructions. In contrast, R has limited macro capabilities.
R was chosen as the software tool for the present course because of its
extensive libraries to perform the relevant analyses and more flexible
graphics capability. R is widely available as a free download from the
Internet. It should not be too difficult to download R and install it on your
computer. R is open source, meaning that in many cases, you can examine
the source code and see exactly what action is being performed. Further, if
you don’t like the way it performs a task, then you can rewrite the code to
have it do what you want it to do. Of course, this is a dangerous capability if
you are just a novice, but it does point out a more useful property: Anybody
can contribute to it. As a result there are hundreds of user-written packages
available to you. These include specialized programs for different analyses,
both statistical and discipline specific, as well as collections of data.
The learning curve for R is not terribly. Most users are up and running
quickly, performing many useful actions. R provides a nice graphical
interface that encourages visual displays of information as well as
mathematical calculation. Once you get comfortable with R, you will
probably want to learn more.
$u
[,1] [,2] [,3]
[1,] -0.3491067 0.4210584 0.5276557
[2,] -0.1612680 -0.6148072 -0.4166424
[3,] -0.8628567 0.1878036 -0.3776532
[4,] -0.3280175 -0.6398841 0.6366841
$v
[,1] [,2] [,3]
[1,] -0.5037009 -0.8612311 0.06757566
[2,] -0.5935025 0.4018330 0.69734139
[3,] -0.6277262 0.3111451 -0.71354644
2.-Traspuesta:
> t(A)
[,1] [,2] [,3] [,4]
[1,] 1 2 4 3
[2,] 3 0 5 2
[3,] 2 1 6 1
3.-Determinant:
> A<-matrix(c(3,7,-2,-1,-3,-5,2,-8,-9),nrow=3, ncol=3, byrow=F)
>A
[,1] [,2] [,3]
[1,] 3 -1 2
[2,] 7 -3 -8
[3,] -2 -5 -9
> det(A)
[1] -200
4.-Inverse:
> solve(A)
[,1] [,2] [,3]
[1,] 0.065 0.095 -0.07
[2,] -0.395 0.115 -0.19
[3,] 0.205 -0.085 0.01
5.-Product:
> A%*%t(A)
[,1] [,2] [,3]
[1,] 14 8 -19
[2,] 8 122 73
[3,] -19 73 110
> eigen(A%*%t(A))
$values
[1] 189.528565 52.447404 4.024031
$vectors
[,1] [,2] [,3]
[1,] -0.04037342 -0.4469577 0.8936436
[2,] 0.73127362 -0.6226824 -0.2783981
[3,] 0.68088830 0.6422581 0.3519882
7.-TRACE
Tra_A<-sum(diag(A))
> Tra_A<-sum(diag(A))
> Tra_A
[1] -9
Example: Los siguientes datos se refieren a la altura de una planta X1 (en m), su longitud
radicular X2 (en cm), su área foliar X3 (encm 2), y el peso de la pulpa del fruto (en gr), de
una variedad de manzano.
Obs. X1 X2 X3 X4
1 1.38 51 4.6 115
2 1.40 60 5.6 130
3 1.42 69 5.8 138
4 1.54 73 6.5 148
5 1.30 56 5.3 122
6 1.55 75 7.0 152
7 1.50 80 8.1 160
8 1.60 76 7.8 155
9 1.41 58 5.9 135
10 1.34 70 6.1 140
>tabla1<-data.frame(X1,X2,X3,X4)
# Covariance matrix
# Redondeadas a tres cifras
>round(cov(tabla1),3)
# Correlation matrix
>round(cor(tabla1),3)
# Determinante de la matriz de covarianzas
>det(cov(tabla1))
#Determinante de la matriz de correlación
>det(cor(tabla1))
> round(cov(tabla1),3)
X1 X2 X3 X4
X1 0.010 0.713 0.083 1.150
X2 0.713 96.622 9.509 138.556
X3 0.083 9.509 1.134 14.883
X4 1.150 138.556 14.883 212.056
> det(cov(tabla1))
[1] 0.3402605
> round(cor(tabla1), 3)
X1 X2 X3 X4
X1 1.000 0.737 0.790 0.802
X2 0.737 1.000 0.908 0.968
X3 0.790 0.908 1.000 0.960
X4 0.802 0.968 0.960 1.000
> det(cor(tabla1))
[1] 0.001510327
Conclusiones: Se nota la alta relación lineal que tiene el peso en pulpa (X4) con el
área foliar (X3) y la longitud radicular(X2), pues estos son los elementos responsables
en la fisiología de la planta.
La variable que más participa de la varianza total es la variable peso en pulpa X4,
pues esta corresponde a (212.0555/309.8216)*100=68.4% de la variabilidad total, de
manera análoga y decreciente, las participaciones de las otras variables son: 31.2 %
para la longitud radicular X2, 0.37% para el área foliar, y, 0.003% para la altura de la
planta.
PROBLEMS
1.-Rencher: 2.7; 2.8; 2.9; 2.11; 2.12; 2.18; 2.19; 2.21; 2.22; 2.23; 2.24;
2.25; 2.26; 2.30; 2.33; 2.34; 2.38; 2.39.
Company X1 X2 X3
1 8.7 0.3 3.1
2 14.3 0.9 7.4
3 18.9 1.8 9.0
4 19.0 0.8 9.4
5 20.5 0.9 8.3
6 14.7 1.1 7.6
7 18.8 2.5 12.6
8 37.3 2.7 18.1
9 12.6 1.3 5.9
10 25.7 3.4 15.9
(a) Draw the scatter plot and discuss the appearance of the graph.
(b) For X1 and X2 calculate, respectively, the sampling means, the
sampling variances, the variance between X1 and X2, and the
correlation between the two. Analyze the results.
(c) Using the data matrix Y and the centering matrix H, calculate the
sample mean vector and the sample covariance matrix. From this,
obtain the correlation matrix.