Data Matrices

Data matrices
LSTAT2110(A) – Analyse des données
2020–2021
1 / 29
Data matrices
Mean vector
Covariance and correlation
Linear transformations
2 / 29
Data matrices
3 / 29
Datasets as matrices
Data matrix X : n × p:
 
x11 . . . x1j . . . x1p
 . .. .. 
 .. . . 
 
X = (xij )i=1,...,n =  xi1 . . . xij ... xip 
 
j=1,...,p  . .. .. 
 .
 . . . 

xn1 . . . xnj . . . xnp
I n rows: observations, individuals (i = 1, . . . , n)

x 0i = xi1 . . . xip
I p columns: variables (j = 1, . . . , p)
x1j
 
j  .. 
x = . 
xnj
4 / 29
X <- read.delim(file = "data/food",
row.names = 1, comment.char = "#")
X
bread veget fruit meat poult milk wine

w2 332 428 354 1437 526 247 427
e2 293 559 388 1527 567 239 258
m2 372 767 562 1948 927 235 433
w3 406 563 341 1507 544 324 407
e3 386 608 396 1501 558 319 363
m3 438 843 689 2345 1148 243 341
w4 534 660 367 1620 638 414 407
e4 460 699 484 1856 762 400 416
m4 385 789 621 2366 1149 304 282
w5 655 776 423 1848 759 495 486
e5 584 995 548 2056 893 518 319
m5 515 1097 887 2630 1167 561 284
# In R, the object X is a data.frame,
# but we think of it as a matrix anyway.
class(X)
[1] "data.frame"
5 / 29
Visualizing a data matrix?
plot(X) # graph not very informative... => better way?
500 700 900 1100 1400 1800 2200 2600 250 350 450 550
300 400 500 600

bread
900 1100
veget
700
500
800
fruit
600
400
1400 1800 2200 2600
meat
1000
poult
800
600
550
450
milk
350
250
450
wine
350
250
300 400 500 600 400 600 800 600 800 1000 250 350 450
6 / 29
Datasets as clouds of points
View data matrix as a cloud of points. Two different ways:

I rows x 01 , . . . , x 0n : n points in Rp : point cloud of observations
I columns x 1 , . . . , x p : p points in Rn : point cloud of variables
Both clouds contain the same information. But they live in
high-dimensional spaces (dimension n or p).
Dimension reduction
Represent both clouds in a low-dimensional space, ideally in a plane
I visualization
I interpretation
7 / 29
Mean vector
8 / 29
Sample means
Sample mean of j-th variable:

n
1X
x̄j = xij
n i=1
Vector of sample means:
x̄ 0 = (x̄1 , . . . , x̄p ) : 1 × p
Unless stated otherwise, vectors are considered as column matrices.

Therefore, row vectors need to be indicated via matrix transpose.
xbar <- colMeans(X)
xbar

446.6667 732.0000 505.0000 1886.7500 803.1667 358.2500 368.5833
9 / 29
Centering
Often, data matrices are centered prior to subsequent analysis: the

mean of each variable is subtracted from that variable:
x11 − x̄1 . . . x1p − x̄p
 
0
X − 1n x̄ = (xij − x̄j )ij =  .. ..
:n×p
 
. .
xn1 − x̄1 . . . xnp − x̄p
Here, 1n denotes an n × 1 vector, all elements equal to 1, so that
1 x̄1 . . . x̄p
   
 ..   .. ..  : n × p

0
1n x̄ =  .  x̄1 . . . x̄p =  . .
1 x̄1 . . . x̄p
10 / 29
Xc <- sweep(X, MARGIN = 2, STATS = colMeans(X), FUN = "-")
Xc

w2 -114.666667 -304 -151 -449.75 -277.16667 -111.25 58.416667
e2 -153.666667 -173 -117 -359.75 -236.16667 -119.25 -110.583333
m2 -74.666667 35 57 61.25 123.83333 -123.25 64.416667
w3 -40.666667 -169 -164 -379.75 -259.16667 -34.25 38.416667
e3 -60.666667 -124 -109 -385.75 -245.16667 -39.25 -5.583333
m3 -8.666667 111 184 458.25 344.83333 -115.25 -27.583333
w4 87.333333 -72 -138 -266.75 -165.16667 55.75 38.416667
e4 13.333333 -33 -21 -30.75 -41.16667 41.75 47.416667
m4 -61.666667 57 116 479.25 345.83333 -54.25 -86.583333
w5 208.333333 44 -82 -38.75 -44.16667 136.75 117.416667
e5 137.333333 263 43 169.25 89.83333 159.75 -49.583333
m5 68.333333 365 382 743.25 363.83333 202.75 -84.583333
# After centering, all variables have mean 0
round(colMeans(Xc), 10)

0 0 0 0 0 0 0
11 / 29
Covariance and correlation
12 / 29
Sample variance and covariance
I Sample covariance between variables j and k:
n
1X
sjk = (xij − x̄j )(xik − x̄k )
n i=1
Up to 1/n, equal to scalar product of centered variables j and k

I Sample variance: if j = k
n
1X
sjj = (xij − x̄j )2
n i=1
Up to 1/n, equal to squared norm of centered variable j

I Standard deviation: square root of the sample variance
√
sj = sjj
I In R, division is by n − 1 rather than by n

13 / 29
Covariance matrix
Covariance matrix: gather all (co)variances in a matrix.
1 0
S = (sjk ) j=1,...,p = X Xc : p × p
k=1,...,p n c
I X 0c X c is p × p matrix of scalar products of centered variables

I Diagonal of S: sample variances s11 , . . . , spp
I The matrix S is symmetric: sjk = skj , ∀j, k =⇒ S 0 = S
14 / 29
n <- nrow(X) # number of observations
S <- ((n-1)/n) * cov(X) # function cov(): division by n-1 rather than by n
round(S, 2) # rounding up to two digits

bread 10523.89 11020.58 3180.42 12487.75 6079.06 9842.58 2141.61
veget 11020.58 32806.67 24514.00 60467.75 35780.92 13462.50 -4437.33
fruit 3180.42 24514.00 24984.17 57463.75 34955.08 5888.17 -5282.50
meat 12487.75 60467.75 57463.75 143566.85 88884.37 15916.48 -11385.77
poult 6079.06 35780.92 34955.08 88884.37 57090.47 6240.21 -6571.01
milk 9842.58 13462.50 5888.17 15916.48 6240.21 12575.52 53.02
wine 2141.61 -4437.33 -5282.50 -11385.77 -6571.01 53.02 4723.24
round((1/n) * t(as.matrix(Xc)) %*% as.matrix(Xc), 2) # equal to S

bread 10523.89 11020.58 3180.42 12487.75 6079.06 9842.58 2141.61
veget 11020.58 32806.67 24514.00 60467.75 35780.92 13462.50 -4437.33
fruit 3180.42 24514.00 24984.17 57463.75 34955.08 5888.17 -5282.50
meat 12487.75 60467.75 57463.75 143566.85 88884.38 15916.48 -11385.77
poult 6079.06 35780.92 34955.08 88884.38 57090.47 6240.21 -6571.01
milk 9842.58 13462.50 5888.17 15916.48 6240.21 12575.52 53.02
wine 2141.61 -4437.33 -5282.50 -11385.77 -6571.01 53.02 4723.24
15 / 29
Standardization
Variables may be expressed in different units and therefore difficult
to compare.
Standardization or scaling: divide each (centered) variable by its
standard deviation
!
xij − x̄j
X cs =
sj i=1,...,n
j=1,...,p
# centered and scaled variables
Xcs <- sweep(Xc, MARGIN = 2, STATS = sqrt(diag(S)), FUN = "/")
round(Xcs, 3)

w2 -1.118 -1.678 -0.955 -1.187 -1.160 -0.992 0.850
e2 -1.498 -0.955 -0.740 -0.949 -0.988 -1.063 -1.609
m2 -0.728 0.193 0.361 0.162 0.518 -1.099 0.937
w3 -0.396 -0.933 -1.038 -1.002 -1.085 -0.305 0.559
e3 -0.591 -0.685 -0.690 -1.018 -1.026 -0.350 -0.081
m3 -0.084 0.613 1.164 1.209 1.443 -1.028 -0.401
w4 0.851 -0.398 -0.873 -0.704 -0.691 0.497 0.559
e4 0.130 -0.182 -0.133 -0.081 -0.172 0.372 0.690
m4 -0.601 0.315 0.734 1.265 1.447 -0.484 -1.260
w5 2.031 0.243 -0.519 -0.102 -0.185 1.219 1.708
e5 1.339 1.452 0.272 0.447 0.376 1.425 -0.721
m5 0.666 2.015 2.417 1.962 1.523 1.808 -1.231 16 / 29
The variables X cs have mean zero and unit variance.
round(colMeans(Xcs), 10) # zero means (up to rounding errors)

0 0 0 0 0 0 0
round((1/n) * t(as.matrix(Xcs)) %*% as.matrix(Xcs), 3) # unit variances

bread 1.000 0.593 0.196 0.321 0.248 0.856 0.304
veget 0.593 1.000 0.856 0.881 0.827 0.663 -0.356
fruit 0.196 0.856 1.000 0.959 0.926 0.332 -0.486
meat 0.321 0.881 0.959 1.000 0.982 0.375 -0.437
poult 0.248 0.827 0.926 0.982 1.000 0.233 -0.400
milk 0.856 0.663 0.332 0.375 0.233 1.000 0.007
wine 0.304 -0.356 -0.486 -0.437 -0.400 0.007 1.000
17 / 29
Correlation
Correlation between variables j and k:
n
sjk 1X xij − x̄j xik − x̄k
ρjk = =
sj sk n i=1 sj sk
I Covariance between standardized variables

I −1 ≤ ρjk ≤ 1
I Correlation matrix: R = (ρjk ) j=1,...,p : p × p
k=1,...,p
round(cor(X), 3)

bread 1.000 0.593 0.196 0.321 0.248 0.856 0.304
veget 0.593 1.000 0.856 0.881 0.827 0.663 -0.356
fruit 0.196 0.856 1.000 0.959 0.926 0.332 -0.486
meat 0.321 0.881 0.959 1.000 0.982 0.375 -0.437
poult 0.248 0.827 0.926 0.982 1.000 0.233 -0.400
milk 0.856 0.663 0.332 0.375 0.233 1.000 0.007
wine 0.304 -0.356 -0.486 -0.437 -0.400 0.007 1.000
18 / 29
Correlations and angles
Correlation and scalar product of centered variables j and k:

Pn
i=1 (xij − x̄j )(xik − x̄k )
ρjk = qP = cos θjk
n Pn
i=1 (xij − x̄j )2 i=1 (xik − x̄k )
2
with θjk the angle between the centered variables, seen as vectors in
Rn .
ρjk θjk
1 0 parallel
0 π/2 orthogonal
−1 π anti-parallel
19 / 29
Linear transformations
20 / 29
Linear transformation of a data matrix
Let X : n × p be a data matrix and A : p × ` a matrix of constants.

New data matrix Y : n × `:
 0
y1
 .. 

Y = X A
|{z} |{z} |{z} =  .  = y1 . . . y`
n×` n×p p×` y 0n
P p
yik = x a
Ppj=1 ijj jk
yk = j=1 x ajk k-th variable
y 0i 0
= xi A i-th observation
I n observations on ` new variables y 1 , . . . , y `

I Each new variable is a linear combination of the p original
variables x 1 , . . . , x p .
21 / 29
Creating ` = 2 new variables:
I vegan = sum of veget and fruit
I animal = sum of meat and poult
names(X)
[1] "bread" "veget" "fruit" "meat" "poult" "milk" "wine"

A <- cbind(c(0, 1, 1, 0, 0, 0, 0), c(0, 0, 0, 1, 1, 0, 0))
A
[,1] [,2]
[1,] 0 0
[2,] 1 0
[3,] 1 0
[4,] 0 1
[5,] 0 1
[6,] 0 0
[7,] 0 0
22 / 29
Y <- as.data.frame(as.matrix(X) %*% A)
names(Y) <- c("vegan", "animal")
Y
vegan animal
w2 782 1963
e2 947 2094
m2 1329 2875
w3 904 2051
e3 1004 2059
m3 1532 3493
w4 1027 2258
e4 1183 2618
m4 1410 3515
w5 1199 2607
e5 1543 2949
m5 1984 3797
Note: unified manipulation of data matrices via tidyverse R packages.
23 / 29
Transformed mean vector
Mean vector of transformed data matrix:
ȳ 0 = x̄ 0 A
colMeans(Y) # mean vector of Y
vegan animal
1237.000 2689.917
t(colMeans(X)) %*% A # idem
[,1] [,2]
[1,] 1237 2689.917
24 / 29
Transformed covariance matrix
Covariance matrix of transformed data matrix:
S Y = A0 S X A : ` × `
n <- nrow(Y) # number of observations

((n-1)/n) * cov(Y) # covariance matrix of Y
vegan animal
vegan 106818.8 188667.5
animal 188667.5 378426.1
((n-1)/n) * t(A) %*% cov(X) %*% A # idem
[,1] [,2]
[1,] 106818.8 188667.5
[2,] 188667.5 378426.1
25 / 29
Cross-covariance matrix
Covariances between the p old and the ` new variables:
S XY = S X A : p × `
((n-1)/n) * cov(X) %*% A # 7-by-2 cross-covariance matrix
[,1] [,2]
bread 14201.000 18566.81
veget 57320.667 96248.67
fruit 49498.167 92418.83
meat 117931.500 232451.23
poult 70736.000 145974.85
milk 19350.667 22156.69
wine -9719.833 -17956.78
26 / 29
Joint covariance matrix
Joining the p old and ` new variables:

Z= X Y : n × (p + `)
Joint covariance matrix: (p + `) × (p + `) block matrix

!
SX S XY
SZ =
S 0XY SY
27 / 29
Z <- cbind(X, Y)
Z
bread veget fruit meat poult milk wine vegan animal

w2 332 428 354 1437 526 247 427 782 1963
e2 293 559 388 1527 567 239 258 947 2094
m2 372 767 562 1948 927 235 433 1329 2875
w3 406 563 341 1507 544 324 407 904 2051
e3 386 608 396 1501 558 319 363 1004 2059
m3 438 843 689 2345 1148 243 341 1532 3493
w4 534 660 367 1620 638 414 407 1027 2258
e4 460 699 484 1856 762 400 416 1183 2618
m4 385 789 621 2366 1149 304 282 1410 3515
w5 655 776 423 1848 759 495 486 1199 2607
e5 584 995 548 2056 893 518 319 1543 2949
m5 515 1097 887 2630 1167 561 284 1984 3797
28 / 29
round(((n-1)/n) * cov(Z))
bread veget fruit meat poult milk wine vegan animal

bread 10524 11021 3180 12488 6079 9843 2142 14201 18567
veget 11021 32807 24514 60468 35781 13462 -4437 57321 96249
fruit 3180 24514 24984 57464 34955 5888 -5282 49498 92419
meat 12488 60468 57464 143567 88884 15916 -11386 117932 232451
poult 6079 35781 34955 88884 57090 6240 -6571 70736 145975
milk 9843 13462 5888 15916 6240 12576 53 19351 22157
wine 2142 -4437 -5282 -11386 -6571 53 4723 -9720 -17957
vegan 14201 57321 49498 117932 70736 19351 -9720 106819 188668
animal 18567 96249 92419 232451 145975 22157 -17957 188668 378426
29 / 29

Data Matrices

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Matrices

Uploaded by

Copyright:

Available Formats

Data matrices

LSTAT2110(A) – Analyse des données

Covariance and correlation

xn1 . . . xnj . . . xnp

I n rows: observations, individuals (i = 1, . . . , n)

bread veget fruit meat poult milk wine

300 400 500 600

View data matrix as a cloud of points. Two different ways:

Sample mean of j-th variable:

Vector of sample means:

Unless stated otherwise, vectors are considered as column matrices.

bread veget fruit meat poult milk wine

Often, data matrices are centered prior to subsequent analysis: the

Here, 1n denotes an n × 1 vector, all elements equal to 1, so that

bread veget fruit meat poult milk wine

bread veget fruit meat poult milk wine

Up to 1/n, equal to scalar product of centered variables j and k

Up to 1/n, equal to squared norm of centered variable j

I In R, division is by n − 1 rather than by n

Covariance matrix: gather all (co)variances in a matrix.

I X 0c X c is p × p matrix of scalar products of centered variables

bread veget fruit meat poult milk wine

bread veget fruit meat poult milk wine

bread veget fruit meat poult milk wine

bread veget fruit meat poult milk wine

bread veget fruit meat poult milk wine

I Covariance between standardized variables

bread veget fruit meat poult milk wine

Correlation and scalar product of centered variables j and k:

Let X : n × p be a data matrix and A : p × ` a matrix of constants.

I n observations on ` new variables y 1 , . . . , y `

[1] "bread" "veget" "fruit" "meat" "poult" "milk" "wine"

Note: unified manipulation of data matrices via tidyverse R packages.

Mean vector of transformed data matrix:

colMeans(Y) # mean vector of Y

Covariance matrix of transformed data matrix:

n <- nrow(Y) # number of observations

Covariances between the p old and the ` new variables:

((n-1)/n) * cov(X) %*% A # 7-by-2 cross-covariance matrix

Joining the p old and ` new variables:

Joint covariance matrix: (p + `) × (p + `) block matrix

bread veget fruit meat poult milk wine vegan animal

bread veget fruit meat poult milk wine vegan animal

You might also like