You are on page 1of 29

MULTIVARIATE STATISTICAL ANALYSIS

Prof. Dr.: RAFAEL AMARO.


iamaro@yachaytech.edu.ec
Bachelor of Science in Mathematics, USB, Ven.
Master of Science in Statistics, UCV, Ven.
PhD in Applied Multivariate Statistics, USAL, Spain.

SUMMARY OF CONTENT

1.-Introduction to multivariate methods. Motivation Examples.


The data matrix.
2.-The Multivariate Normal Distribution.
3.-Tests on One or Two Mean Vectors.
4.-Multivariate Analysis of Variance, (MANOVA).
5.-Discriminant Analysis: Description of Group Separation.
6.-Canonical Correlation.
7.-Principal Component Analysis.
8.-Correspondence Factor Analysis. Biplots Methods.
9.-Cluster Analysis.
10.-Topics: Big Data. The CUR decomposition.
BIBLIOGRAPHY

Main Book:
Rencher, Alvin. (2002). Methods of Multivariate Analysis. 2d. Edition. John Wiley and
Sons, Inc., USA

Reference Books:

Everitt, Brian; Hothorn, Torsten. (2011).An Introduction to Applied Multivariate


Analysis with R. Springer.

Tabachnick, Barbara, and Fidell, Linda. (2007). Using Multivariate Statistics. Pearson.

Cuadras, C. M. (2018). Nuevos Métodos de Análisis Multivariante. CMC Editions.


Barcelona.

Zelterman, Daniel. (2015). Applied Multivariate Statistics with R. Springer.

EVALUATION

Midterm evaluation, 30%.


Final evaluation, 30%.
Continued evaluation, 40%.

MULTIVARIATE STATISTIC
In the last decades, Multivariate Statistical Analysis has had an increasing
development due, in part, to the numerous applications it has in almost all
experimental sciences and its use has become more or less essential. The
spectacular possibilities offered by computers today have had a decisive
influence on this development. If the first applications were based on simple
models or working with very few variables, today, high capacity and high
speed computers allow the use of complex and powerful methods and the
handling of a large number of variables. (Cuadras, 1996, 2018).

Practitioners and researchers in all applied disciplines often measure several


variables on each subject or experimental unit. In some cases, it may be
productive to isolate each variable in a system and study it separately.
Typically, however, the variables are not only correlated with each other, but
each variable is influenced by the other variables as it affects a test statistic
or descriptive statistic. Thus, in many instances, the variables are intertwined
in such a way that when analyzed individually they yield little information
about the system. Using multivariate analysis, the variables can be examined
simultaneously in order to access the key features of the process that
produced them. The multivariate approach enables us to (1) explore the joint
performance of the variables and (2) determine the effect of each variable in
the presence of the others. (Rencher, 2002).

The majority of data sets collected by researchers in all disciplines are


multivariate, meaning that several measurements, observations, or
recordings are taken on each of the units in the data set. These units might
be human subjects, archaeological artifacts, countries, or a vast variety of
other things. In a few cases, it may be sensible to isolate each variable and
study it separately, but in most instances all the variables need to be
examined simultaneously in order to fully grasp the structure and key
features of the data. For this purpose, one or another method of multivariate
analysis might be helpful, and it is with such methods that this book is largely
concerned. Multivariate analysis includes methods both for describing and
exploring such data and for making formal inferences about them. The aim
of all the techniques is, in a general sense, to display or extract the signal in
the data in the presence of noise and to find out what the data show us in the
midst of their apparent chaos. (Everitt and Hothorn, 2011).

WHAT IS MULTIVARIATE STATISTICAL ANALYSIS?

Multivariate Analysis (MA) is the part of Statistics and data analysis that
studies, analyzes, represents and interprets data resulting from observing
more than one statistical variable on a sample of individuals.
The observable variables are homogeneous and correlated, without any is
predominant over the others. The information in Multivariate Analysis is
therefore multidimensional in nature, thus geometry, matrix calculation and
the multivariate distributions play a key role.
Multivariate information is a data matrix, but often, in the MA input
information consists of distance or similarity matrices. The results of these
tests measure the degree of discrepancy between individuals. Start with
techniques based on n*p data matrices, where n is the number of individuals
and p the number of variables. (Cuadras, 2018).
OBJETIVES
The analysis of multivariate data aims at the statistical study of several
variables measured in elements of a population. It has the following
objectives.

1. Summarize the set of variables into a few new variables, constructed as


transformations of the original ones, with minimal loss of information.
2. Find groups in the data if they exist.
3. Classify new observations in defined groups.
4. Relate two sets of variables.
5. Hypothesis Testing

EXAMPLES OF MULTIVARIATE INFORMATION:

1) In a clinical experiment, the effects of three doses of a treatment were


tested on four indicators: v1, v2, v3 and v4. An unbalanced completely
randomized design was used, so that 3, 3 and 4 observations per treatment
were taken, respectively. Each treatment with its observations is considered
as a group, because we want to know if there are differences between the
treatments, according to their behavior in the four indicators evaluated.

The table below shows the data table of the experiment:

Individuals v1 v2 v3 v4
1 45 1.56 13 14
2 50 1.60 12 16
3 50 1.65 13 15
4 60 1.75 15 9
5 60 1.70 14 10
6 65 1.70 14 7
7 70 1.60 15 8
8 65 1.60 13 13
9 60 1.55 15 17
10 65 1.70 14 11

2) In a maternity clinic, a study was carried out to determine the size (in cm)
of an infant, taking into account age (in days), size (in cm), weight (in kg)
and thorax size (in cm) at birth. For this purpose, a sample of 9 children was
taken, from which the following results were obtained:

Size Age Birth Size Weight Thorax Size


57,50 78,00 48,20 2,75 29,50
52,80 69,00 45,50 2,15 26,30
61,30 77,00 46,30 4,41 32,20
67,00 88,00 49,00 5,52 36,50
53,50 67,00 43,00 3,21 27,20
62,70 80,00 48,00 4,32 27,70
56,20 74,00 48,00 2,31 28,30
68,50 94,00 53,00 4,30 30,30
69,20 102,00 58,00 3,71 28,70
3) In the 1988 Olympics held in Seoul, the heptathlon was won by one of the
stars of women's athletics in the USA, Jackie Joyner-Kersee. The results for
all 25 competitors in all seven disciplines are given in the next Table. We
shall analyze these data using Biplot. (Everitt and Torsten. (2011)).
THE DATA MATRIZ

Let y represent a random vector of p variables measured on a sampling unit


(subject or object). If there are n individuals in the sample, the n observation
vectors are denoted by y1, y2, . . . , yn, where:

𝑦𝑖1
𝐲𝑖 = ( ⋮ )
𝑦𝑖𝑝

All n observation vectors y1, y2, . . . , yn, can be transposed to row vectors
and listed in the data matrix Y as follows:

y𝟏𝟏 y𝟏𝟐 ⋯ y1p


𝐘=( ⋮ ⋱ ⋮ )
yn1 yn2 ⋯ ynp

Since n is usually greater than p, the data can be more conveniently tabulated
by entering the observation vectors as rows rather than columns. Note that
the first subscript i corresponds to units (subjects or objects) and the second
subscript j refers to variables. This convention will be followed whenever
possible.

The sample mean vector 𝐲̅ can be found either as the average of the n
observation vectors or by calculating the average of each of the p variables
separately:
𝒏 y̅1
1 y̅
𝐲̅ = ∑ 𝒚i = ( 2 )
n ⋮
𝒊=𝟏 y̅p

Note: 𝐲̅ ′ = (𝑦̅1 , 𝑦̅2 , . . . . 𝑦̅𝑝 )

PROPOSITION: The sample mean of a data matrix can be computed as:


1
𝐲̅ = 𝐘 ′ 𝐣 (*)
𝑛

where j is a vector of 1’s.

NOTE: We can transpose (*) to obtain:


1
𝐲̅ ′ = 𝐣′ 𝐘
𝑛

EXAMPLE: Suppose that we have measured the height (in metres) and
weight (in kilograms) of 5 students. We have n = 5 and p = 2, and the data
matrix is as follows:
1.67 65.0
1.78 85.0
𝐘 = 1.60 54.5
1.83 72.0
( 1.80 94.5 )

We can calculate the mean vector of this data matrix in three different ways:
1.-First by calculating the column means.
2.-As the sample mean of the observation vectors.
3.-We can use the matrix expression for the sample mean.

NOTE: This example hopefully makes clear that our three different ways of
thinking about computing the sample mean are all equivalent. However, the
final method based on a matrix multiplication operation is the neatest both
mathematically and computationally, and so we will make use of this
expression, as well as other similar expressions, throughout the course.

The mean of y over all possible values in the population is called the
population mean vector or expected value of y. It is defined as a vector of
expected values of each variable,

y1 E(y1 ) μ1
E(𝒚) = E ( ⋮ ) = ( ⋮ ) = ( ⋮ ) = 𝛍
yp E(yp ) μp

where μj is the population mean of the jth variable.

It can be shown that: E(𝐲


̅) = 𝛍

Therefore, 𝐲̅ is an unbiased estimator of 𝛍.

Example: For the data of the maternity clinic with R:

> colMeans(y)

[1] 60.966667 81.000000 48.777778 3.631111 29.522222

Them:

𝐲̅ ′ = (60.966; 81.000; 48.777; 3.631; 29.522)


1
Another way, using 𝐲̅ ′ = 𝐣′ 𝐘
𝑛

> j’<-c(1, 1, 1, 1, 1, 1, 1, 1, 1)
> vm<-((1/9)*(j’%*%y))
> vm

[,1] [,2] [,3] [,4] [,5]


[1,] 60.96667 81 48.77778 3.631111 29.52222

REVIEW OF LINEAR ALGEBRA


FROM UNIVARIATE TO MULTIVARIATE....

Multivariate Analysis was born as a necessity:

1.- Practically no process can be explained by a single circumstance, so a


univariate study is incomplete.

2.-A series of univariate analyses treated separately leads to errors or


omissions in the interpretation of the results.

3.-The development of personal computers.

BIVARIATE CONTEXT:

Covariance: If two variables x and y are measured on each research unit


(object or subject), we have a bivariate random variable (x, y). In many
cases it is important to study the covariance between x and y.
EJEMPLO: Height and Weight for a Sample of 20 College-age Males

Person Height Weight Person Height Weight


(X) (Y) (X) (Y)
1 69 153 11 72 140
2 74 175 12 79 265
3 68 155 13 74 185
4 70 135 14 67 112
5 72 172 15 66 140
6 67 150 16 71 150
7 66 115 17 74 165
8 70 137 18 75 185
9 76 200 19 75 210
10 68 130 20 76 220

SCATTER PLOTS

A scatter diagram, is a simple but efficient method to visualize the


arrangement of individuals or populations with respect to variables, to
graphically represent the relationships between them, to obtain information
about 'Clusters' or groupings, of 'Outliers', etc.

Example, for the centered matrix:


100
50
ZC[,2]

0
-50

-4 -2 0 2 4 6 8

ZC[,1]

The POPULATION COVARIANCE is defined as:

𝑐𝑜𝑣 (x, y) = σxy = 𝐸[(x − μx )(y − μy )]

Where μx and μy are the means of x and y, respectively.

If the two random variables x and y in a bivariate random variable are added
or multiplied, a new random variable is obtained. The mean of x + y or of xy
is as follows:

E(x + y) = E(x) + E(y)

E(xy) = E(x )E(y) if x and y are independent.

Formally, x and y are INDEPENDENT if their joint density factors into the
product of their individual densities: f (x, y) = g(x) h(y). Informally, x and y
are independent if the random behavior of either of the variables is not
affected by the behavior of the other.
The notion of independence of x and y is more general than that of zero
covariance. The covariance σxy measures linear relationship only, whereas
if two random variables are independent, they are not related either linearly
or nonlinearly.

Independence implies 𝛔𝐱𝐲 = 0, but 𝛔𝐱𝐲 = 0 does not imply independence.

It is easy to show that if x and y are independent, then σxy = 0:

σxy = E(xy) – μx μy = E(x) E(y)− μx μy = μx μy –μx μy = 0.

One way to demonstrate that the converse is not true is to construct examples
of bivariate x and y that have zero covariance and yet are related in a
nonlinear way (the relationship will have zero slope).

The SAMPLE COVARIANCE is defined as:

∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)


𝑠𝑥𝑦 =
𝑛−1

It can be shown that:


∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥̅ 𝑦̅
𝑠𝑥𝑦 =
𝑛−1

Example. To obtain the sample covariance for the height and weight data
we first calculate 𝑥̅ , 𝑦̅ and ∑𝑖 𝑥𝑖 𝑦𝑖 where x is height and y is weight:
69 + 74+ . . . . +76
𝑥̅ = = 71.45
20
153 + 175+ . . . . +220
𝑦̅ = = 164.7
20

20

∑ 𝑥𝑖 𝑦𝑖 = (69)(153) + (74)(175)+ . . . + (76)(220) = 237805


𝑖=1

Now, we have:
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥̅ 𝑦̅
𝑠𝑥𝑦 = = 128.88
𝑛−1

By itself, the sample covariance 128.88, given in the example, is not very
meaningful. We are not sure if this represents a small, moderate, or large
amount of relationship between y and x. A method of standardizing the
covariance is given now.

CORRELATION: Since the covariance depends on the scale of


measurement of x and y, it is difficult to compare covariances between
different pairs of variables. For example, if we change a measurement from
inches to centimeters, the covariance will change. To find a measure of linear
relationship that is invariant to changes of scale, we can standardize the
covariance by dividing by the standard deviations of the two variables. This
standardized covariance is called a correlation. The POPULATION
CORRELATION of two random variables x and y is

𝜎𝑥𝑦 𝐸[(𝑥 − 𝜇𝑥 )(𝑦 − 𝜇𝑦 )]


𝜌𝑥𝑦 = =
𝜎𝑥 𝜎𝑦 √𝐸(𝑥 − 𝜇𝑥 )2 √𝐸(𝑦 − 𝜇𝑦 )2

And the SAMPLE CORRELATION is:

𝑠𝑥𝑦 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)


𝑟𝑥𝑦 = =
𝑠𝑥 𝑠𝑦 √∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 (𝑦𝑖 − 𝑦̅)2

Example: for the following data table

X 0 1 2 3 4 5 6 7 8 9 10
Y 0 144 256 336 384 400 384 336 256 144 0

a)Find the correlation between variables X and Y

b)What can be concluded about the relationship between the variables X and
Y, using the result of part a)?

c) Draw the scatter plot and discuss the appearance of the graph.
> cor(A) [1, 2]

[1] 0

> plot(A)
400
300
A[,2]

200
100
0

0 2 4 6 8 10

A[,1]

Note: The sample correlation 𝑟𝑥𝑦 is related to the cosine of the angle between
two vectors.
THE CENTERING MATRIX

DEFINITION: The matrix


1
𝐇𝑛 = 𝐈𝑛∗𝑛 − 𝐣𝑛 𝐣𝑡𝑛
𝑛
Is known as the CENTERING MATRIX.

PROPOSITION: the centering matrix 𝐇𝑛 has the following properties:


1.-Is symmetric.
2.-Is idempotent.
3.-If Y is an n*p matrix, then the n*p matrix 𝐖 = 𝐇𝑛 𝐘 has sample mean
equal to zero p-vector.
Proof:

GRÁFICOS MULTIVARIANTES
MULTIVARIATE CONTEX
COVARIANCE MATRICES
The sample covariance matrix S = (sjk )is the matrix of sample variances
and covariances of the p variables:

𝑠11 𝑠12 … … 𝑠1𝑝


S = (sjk ) = ( ⋮ ⋮ ⋮ )
𝑠𝑝1 𝑠𝑝2 … … 𝑠𝑝𝑝

In S the sample variances of the p variables are on the diagonal, and all
possible pairwise sample covariances appear off the diagonal. The jth row
(column) contains the covariances of yj with the other p − 1 variables. Other
names used for the covariance matrix are variance matrix, variance-
covariance matrix, and dispersion matrix.

The sample covariance matrix S can also be expressed in terms of the


observation vectors:

n
1
𝐒= ∑(𝐲i − 𝐲̅) (𝐲𝑖 − 𝐲̅)′
n−1
𝑖=1

n
1
= (∑ 𝐲𝑖 𝐲𝑖′ − n𝐲̅𝐲̅ ′ )
n−1
𝑖=1

We can also obtain S directly from the data matrix Y:


1 1
𝐒= [𝐘 ′ 𝐘 − 𝐘 ′ ( 𝐉) 𝐘]
n−1 n

1
𝐒 = 𝐘 ′ (𝐈 − 𝐉) 𝐘
𝑛

If y is a random vector taking on any possible value in a multivariate


population, the population covariance matrix is defined as:
𝜎11 𝜎12 … … 𝜎1𝑝
𝚺 = cov(𝐲) = ( ⋮ ⋮ ⋮ )
𝜎𝑝1 𝜎𝑝2 … … 𝜎𝑝𝑝

The diagonal elements 𝜎𝑗𝑗 = 𝜎𝑗2 are the population variances of the y’s, and
the off-diagonal elements σjk are the population covariances of all possible
pairs of y’s.

The population covariance matrix in can also be found as:

𝚺 = E[(𝐲 − 𝛍)(𝐲 − 𝛍)′ ]

It can be easily shown that 𝚺 can be expressed as:

𝚺 = E(𝐲𝐲 ′ ) − 𝛍𝛍′

CORRELATION MATRICES

The sample correlation matrix is analogous to the covariance matrix with


correlations in place of covariances:

1 𝑟12 … … 𝑟1𝑝
R = (rjk ) = ( ⋮ ⋮ ⋮ )
𝑟𝑝1 𝑟𝑝2 … … 1

The correlation matrix can be obtained from the covariance matrix, and
vice versa. Define:

𝐃𝒔 = diag(√s11 ; √s22 ; . . . . ; √spp )

= diag (s1 ; s2 ; . . . . ; sp )
𝑠1 0… 0
= (⋮ ⋮ ⋮)
0 0… 𝑠𝑝

Then:

𝐑 = 𝐃s−1 𝐒 𝐃s−1

𝐒 = 𝐃s 𝐑 𝐃s

The population correlation matrix is defined as:

1 𝜌12 … … 𝜌1𝑝
𝐏p = (ρjk ) = ( ⋮ ⋮ ⋮ )
𝜌𝑝1 𝜌…… 1

𝜎𝑗𝑘
Where: 𝜌𝑗𝑘 = 𝜎𝑗 𝜎𝑘

Example: Table below gives partial data from Kramer and Jensen (1969a).
Three variables were measured (in milliequivalents per 100 g) at 10 different
locations in the South. The variables are

y1 = available soil calcium,


y2 = exchangeable soil calcium,
y3 = turnip green calcium.

Location
Number
𝑦1 𝑦2 𝑦3
1 35 3.5 2.80
2 35 4.9 2.70
3 40 30.0 4.38
4 10 2.8 3.21
5 6 2.7 2.73
6 20 2.8 2.81
7 35 4.6 2.88
8 35 10.9 2.90
9 35 8.0 3.28
10 30 1.6 3.20
To find the mean vector y, we simply calculate the average of each column
and obtain:

𝑦̅ ′ = (28.1; 7.18; 3.089)

Continuing in this fashion, we obtain:

140.54 49.68 1.94


𝐒 = ( 49.68 72.25 3.68)
1.94 3.68 0.25

11.8551 0 0
𝐷𝑠 = ( 0 8.4999 0 )
0 0 0.5001

Then:

1.000 0.493 0.327


𝐑= 𝐃s−1 𝐒 𝐃s−1 = (0.493 1.000 0.865)
0.327 0.865 1.000

With R:
>y
[,1] [,2] [,3]
[1,] 35 3.5 2.80
[2,] 35 4.9 2.70
[3,] 40 30.0 4.38
[4,] 10 2.8 3.21
[5,] 6 2.7 2.73
[6,] 20 2.8 2.81
[7,] 35 4.6 2.88
[8,] 35 10.9 2.90
[9,] 35 8.0 3.28
[10,] 30 1.6 3.20

> var(y)
[,1] [,2] [,3]
[1,] 140.544444 49.680000 1.9412222
[2,] 49.680000 72.248444 3.6760889
[3,] 1.941222 3.676089 0.2501211

> cor(y)
[,1] [,2] [,3]
[1,] 1.0000000 0.4930154 0.327411
[2,] 0.4930154 1.0000000 0.864762
[3,] 0.3274110 0.8647620 1.000000

Why R?
(From Zelterman, (2015). Applied Multivariate Statistics with R).

There are a number of high-quality software packages available to the data


analyst today. As with any type of tool, some are better suited for the task at
hand than others. Understanding the strengths and limitations will help
determine which is appropriate for your needs. It is better to decide this early,
rather than investing a lot of time on a major project, only to be disappointed
later.
Two popular languages regularly in use by the statistical community today
are SAS and R. The most glaring differences between these packages are the
capability to handle huge databases and the capability to provide a
certification for the validity of the results. SAS is the standard package for
many applications such as in pharmaceuticals and financials because it can
handle massive data sets and provide third-party certification. In contrast, R
is more suited for quick and nimble analyses of smaller data sets. There is no
independent review of R and errors can continue, uncorrected, for years.

SAS is also more suitable for sharing programs and data, as in a business
setting. SAS encourages the development of large programs through the use
of its powerful macro5 language. The macro writes code that is expanded
before the interpreter actually reads the code that is converted into
instructions. In contrast, R has limited macro capabilities.

R was chosen as the software tool for the present course because of its
extensive libraries to perform the relevant analyses and more flexible
graphics capability. R is widely available as a free download from the
Internet. It should not be too difficult to download R and install it on your
computer. R is open source, meaning that in many cases, you can examine
the source code and see exactly what action is being performed. Further, if
you don’t like the way it performs a task, then you can rewrite the code to
have it do what you want it to do. Of course, this is a dangerous capability if
you are just a novice, but it does point out a more useful property: Anybody
can contribute to it. As a result there are hundreds of user-written packages
available to you. These include specialized programs for different analyses,
both statistical and discipline specific, as well as collections of data.

The learning curve for R is not terribly. Most users are up and running
quickly, performing many useful actions. R provides a nice graphical
interface that encourages visual displays of information as well as
mathematical calculation. Once you get comfortable with R, you will
probably want to learn more.

It is highly recommended that all users of R work in Rstudio, an interface


that provides both assistance for novices as well as productivity tools for
experienced users. The Rstudio opens four windows: one for editing code, a
window for the console to execute R code, one track to the variables that are
defined in the workspace, and the fourth to display graphical images.

Data processing with R


1.-Singular value decomposition:
1 3 2
A= ( 2 0 1 )
4 2 1
3 2 1

> A<-matrix(c(1,2,4,3,3,0,5,2,2,1,6,1), nrow=4, ncol=3, byrow=F)


>A
[,1] [,2] [,3]
[1,] 1 3 2
[2,] 2 0 1
[3,] 4 5 6
[4,] 3 2 1
> svd(A)
$d
[1] 10.139196 2.295544 1.388229

$u
[,1] [,2] [,3]
[1,] -0.3491067 0.4210584 0.5276557
[2,] -0.1612680 -0.6148072 -0.4166424
[3,] -0.8628567 0.1878036 -0.3776532
[4,] -0.3280175 -0.6398841 0.6366841
$v
[,1] [,2] [,3]
[1,] -0.5037009 -0.8612311 0.06757566
[2,] -0.5935025 0.4018330 0.69734139
[3,] -0.6277262 0.3111451 -0.71354644

2.-Traspuesta:
> t(A)
[,1] [,2] [,3] [,4]
[1,] 1 2 4 3
[2,] 3 0 5 2
[3,] 2 1 6 1

3.-Determinant:
> A<-matrix(c(3,7,-2,-1,-3,-5,2,-8,-9),nrow=3, ncol=3, byrow=F)
>A
[,1] [,2] [,3]
[1,] 3 -1 2
[2,] 7 -3 -8
[3,] -2 -5 -9
> det(A)
[1] -200

4.-Inverse:
> solve(A)
[,1] [,2] [,3]
[1,] 0.065 0.095 -0.07
[2,] -0.395 0.115 -0.19
[3,] 0.205 -0.085 0.01

5.-Product:
> A%*%t(A)
[,1] [,2] [,3]
[1,] 14 8 -19
[2,] 8 122 73
[3,] -19 73 110

6.-Eingenvalues and eigenvectors:

> eigen(A%*%t(A))
$values
[1] 189.528565 52.447404 4.024031

$vectors
[,1] [,2] [,3]
[1,] -0.04037342 -0.4469577 0.8936436
[2,] 0.73127362 -0.6226824 -0.2783981
[3,] 0.68088830 0.6422581 0.3519882

7.-TRACE
Tra_A<-sum(diag(A))

> Tra_A<-sum(diag(A))
> Tra_A
[1] -9

Example: Los siguientes datos se refieren a la altura de una planta X1 (en m), su longitud
radicular X2 (en cm), su área foliar X3 (encm 2), y el peso de la pulpa del fruto (en gr), de
una variedad de manzano.

Obs. X1 X2 X3 X4
1 1.38 51 4.6 115
2 1.40 60 5.6 130
3 1.42 69 5.8 138
4 1.54 73 6.5 148
5 1.30 56 5.3 122
6 1.55 75 7.0 152
7 1.50 80 8.1 160
8 1.60 76 7.8 155
9 1.41 58 5.9 135
10 1.34 70 6.1 140

># lectura de los datos:


>X1<-c(1.38,1.40,1.42,1.54,1.30,1.55,1.50,1.60,1.41,1.34)
>X2<-c(51,60,69,73,56,75,80,76,58,70)
>X3<-c(4.8,5.6,5.8,6.5,5.3,7.0,8.1,7.8,5.9,6.1)
>X4<-c(115,130,138,148,122,152,160,155,135,140)

>tabla1<-data.frame(X1,X2,X3,X4)

# Covariance matrix
# Redondeadas a tres cifras
>round(cov(tabla1),3)
# Correlation matrix
>round(cor(tabla1),3)
# Determinante de la matriz de covarianzas
>det(cov(tabla1))
#Determinante de la matriz de correlación
>det(cor(tabla1))

> round(cov(tabla1),3)
X1 X2 X3 X4
X1 0.010 0.713 0.083 1.150
X2 0.713 96.622 9.509 138.556
X3 0.083 9.509 1.134 14.883
X4 1.150 138.556 14.883 212.056

> det(cov(tabla1))
[1] 0.3402605

> round(cor(tabla1), 3)
X1 X2 X3 X4
X1 1.000 0.737 0.790 0.802
X2 0.737 1.000 0.908 0.968
X3 0.790 0.908 1.000 0.960
X4 0.802 0.968 0.960 1.000

> det(cor(tabla1))
[1] 0.001510327

Conclusiones: Se nota la alta relación lineal que tiene el peso en pulpa (X4) con el
área foliar (X3) y la longitud radicular(X2), pues estos son los elementos responsables
en la fisiología de la planta.

La variable que más participa de la varianza total es la variable peso en pulpa X4,
pues esta corresponde a (212.0555/309.8216)*100=68.4% de la variabilidad total, de
manera análoga y decreciente, las participaciones de las otras variables son: 31.2 %
para la longitud radicular X2, 0.37% para el área foliar, y, 0.003% para la altura de la
planta.

PROBLEMS
1.-Rencher: 2.7; 2.8; 2.9; 2.11; 2.12; 2.18; 2.19; 2.21; 2.22; 2.23; 2.24;
2.25; 2.26; 2.30; 2.33; 2.34; 2.38; 2.39.

2. - Show that the eigenvalues of A’A are real and no negativos.

3.-Generate a matrix X and a vector u (Use R), both of random numbers,


and construct the matrices A = X’X y B = uu’.
(a) Check that the trace and the determinant of A match, respectively, the
sums and the product of the eigenvalues of A.
(b) Obtain the ranks of A and B and check that it matches, respectively, the
number of non-null eigenvalues of A and B.
4.-Rencher: 3.4; 3.9; 3.10; 3.11; 3.18; 3.22.

5.-The centering matrix of dimension n is defined as H = I – (1/n) 11’, where


I is the identity matrix of dimension n and 1 is a vector n*1 of ones. Use R
to check the following properties of the H matrix (for example, for n=5):
(a)H is an idempotent matrix.
(b)rank(H) = tra(H) = n-1.

6.-The data in the following table correspond to houses built by 10


construction companies in the area of the coast:

X1 = Average duration of the mortgage (years).


X2 = Average price (millions of dollars).
X3 = Average kitchen area (m2).

Company X1 X2 X3
1 8.7 0.3 3.1
2 14.3 0.9 7.4
3 18.9 1.8 9.0
4 19.0 0.8 9.4
5 20.5 0.9 8.3
6 14.7 1.1 7.6
7 18.8 2.5 12.6
8 37.3 2.7 18.1
9 12.6 1.3 5.9
10 25.7 3.4 15.9

(a) Draw the scatter plot and discuss the appearance of the graph.
(b) For X1 and X2 calculate, respectively, the sampling means, the
sampling variances, the variance between X1 and X2, and the
correlation between the two. Analyze the results.
(c) Using the data matrix Y and the centering matrix H, calculate the
sample mean vector and the sample covariance matrix. From this,
obtain the correlation matrix.

You might also like