You are on page 1of 29

Data matrices

LSTAT2110(A) – Analyse des données

2020–2021

1 / 29
Data matrices

Mean vector

Covariance and correlation

Linear transformations

2 / 29
Data matrices

3 / 29
Datasets as matrices
Data matrix X : n × p:
 
x11 . . . x1j . . . x1p
 . .. .. 
 .. . . 
 
X = (xij )i=1,...,n =  xi1 . . . xij ... xip 
 
j=1,...,p  . .. .. 
 .
 . . . 

xn1 . . . xnj . . . xnp

I n rows: observations, individuals (i = 1, . . . , n)


 
x 0i = xi1 . . . xip

I p columns: variables (j = 1, . . . , p)

x1j
 
j  .. 
x = . 
xnj
4 / 29
X <- read.delim(file = "data/food",
row.names = 1, comment.char = "#")
X

bread veget fruit meat poult milk wine


w2 332 428 354 1437 526 247 427
e2 293 559 388 1527 567 239 258
m2 372 767 562 1948 927 235 433
w3 406 563 341 1507 544 324 407
e3 386 608 396 1501 558 319 363
m3 438 843 689 2345 1148 243 341
w4 534 660 367 1620 638 414 407
e4 460 699 484 1856 762 400 416
m4 385 789 621 2366 1149 304 282
w5 655 776 423 1848 759 495 486
e5 584 995 548 2056 893 518 319
m5 515 1097 887 2630 1167 561 284
# In R, the object X is a data.frame,
# but we think of it as a matrix anyway.
class(X)

[1] "data.frame"

5 / 29
Visualizing a data matrix?
plot(X) # graph not very informative... => better way?
500 700 900 1100 1400 1800 2200 2600 250 350 450 550

300 400 500 600


bread

900 1100

veget
700
500

800
fruit

600
400
1400 1800 2200 2600

meat

1000
poult

800
600
550
450

milk
350
250

450
wine

350
250
300 400 500 600 400 600 800 600 800 1000 250 350 450

6 / 29
Datasets as clouds of points

View data matrix as a cloud of points. Two different ways:


I rows x 01 , . . . , x 0n : n points in Rp : point cloud of observations
I columns x 1 , . . . , x p : p points in Rn : point cloud of variables
Both clouds contain the same information. But they live in
high-dimensional spaces (dimension n or p).
Dimension reduction
Represent both clouds in a low-dimensional space, ideally in a plane
I visualization
I interpretation

7 / 29
Mean vector

8 / 29
Sample means

Sample mean of j-th variable:


n
1X
x̄j = xij
n i=1

Vector of sample means:

x̄ 0 = (x̄1 , . . . , x̄p ) : 1 × p

Unless stated otherwise, vectors are considered as column matrices.


Therefore, row vectors need to be indicated via matrix transpose.
xbar <- colMeans(X)
xbar

bread veget fruit meat poult milk wine


446.6667 732.0000 505.0000 1886.7500 803.1667 358.2500 368.5833

9 / 29
Centering

Often, data matrices are centered prior to subsequent analysis: the


mean of each variable is subtracted from that variable:
x11 − x̄1 . . . x1p − x̄p
 
0
X − 1n x̄ = (xij − x̄j )ij =  .. ..
:n×p
 
. .
xn1 − x̄1 . . . xnp − x̄p

Here, 1n denotes an n × 1 vector, all elements equal to 1, so that

1  x̄1 . . . x̄p
   
 ..   .. ..  : n × p

0
1n x̄ =  .  x̄1 . . . x̄p =  . .
1 x̄1 . . . x̄p

10 / 29
Xc <- sweep(X, MARGIN = 2, STATS = colMeans(X), FUN = "-")
Xc

bread veget fruit meat poult milk wine


w2 -114.666667 -304 -151 -449.75 -277.16667 -111.25 58.416667
e2 -153.666667 -173 -117 -359.75 -236.16667 -119.25 -110.583333
m2 -74.666667 35 57 61.25 123.83333 -123.25 64.416667
w3 -40.666667 -169 -164 -379.75 -259.16667 -34.25 38.416667
e3 -60.666667 -124 -109 -385.75 -245.16667 -39.25 -5.583333
m3 -8.666667 111 184 458.25 344.83333 -115.25 -27.583333
w4 87.333333 -72 -138 -266.75 -165.16667 55.75 38.416667
e4 13.333333 -33 -21 -30.75 -41.16667 41.75 47.416667
m4 -61.666667 57 116 479.25 345.83333 -54.25 -86.583333
w5 208.333333 44 -82 -38.75 -44.16667 136.75 117.416667
e5 137.333333 263 43 169.25 89.83333 159.75 -49.583333
m5 68.333333 365 382 743.25 363.83333 202.75 -84.583333
# After centering, all variables have mean 0
round(colMeans(Xc), 10)

bread veget fruit meat poult milk wine


0 0 0 0 0 0 0

11 / 29
Covariance and correlation

12 / 29
Sample variance and covariance
I Sample covariance between variables j and k:
n
1X
sjk = (xij − x̄j )(xik − x̄k )
n i=1

Up to 1/n, equal to scalar product of centered variables j and k


I Sample variance: if j = k
n
1X
sjj = (xij − x̄j )2
n i=1

Up to 1/n, equal to squared norm of centered variable j


I Standard deviation: square root of the sample variance

sj = sjj

I In R, division is by n − 1 rather than by n


13 / 29
Covariance matrix

Covariance matrix: gather all (co)variances in a matrix.

1 0
S = (sjk ) j=1,...,p = X Xc : p × p
k=1,...,p n c

I X 0c X c is p × p matrix of scalar products of centered variables


I Diagonal of S: sample variances s11 , . . . , spp
I The matrix S is symmetric: sjk = skj , ∀j, k =⇒ S 0 = S

14 / 29
n <- nrow(X) # number of observations
S <- ((n-1)/n) * cov(X) # function cov(): division by n-1 rather than by n
round(S, 2) # rounding up to two digits

bread veget fruit meat poult milk wine


bread 10523.89 11020.58 3180.42 12487.75 6079.06 9842.58 2141.61
veget 11020.58 32806.67 24514.00 60467.75 35780.92 13462.50 -4437.33
fruit 3180.42 24514.00 24984.17 57463.75 34955.08 5888.17 -5282.50
meat 12487.75 60467.75 57463.75 143566.85 88884.37 15916.48 -11385.77
poult 6079.06 35780.92 34955.08 88884.37 57090.47 6240.21 -6571.01
milk 9842.58 13462.50 5888.17 15916.48 6240.21 12575.52 53.02
wine 2141.61 -4437.33 -5282.50 -11385.77 -6571.01 53.02 4723.24
round((1/n) * t(as.matrix(Xc)) %*% as.matrix(Xc), 2) # equal to S

bread veget fruit meat poult milk wine


bread 10523.89 11020.58 3180.42 12487.75 6079.06 9842.58 2141.61
veget 11020.58 32806.67 24514.00 60467.75 35780.92 13462.50 -4437.33
fruit 3180.42 24514.00 24984.17 57463.75 34955.08 5888.17 -5282.50
meat 12487.75 60467.75 57463.75 143566.85 88884.38 15916.48 -11385.77
poult 6079.06 35780.92 34955.08 88884.38 57090.47 6240.21 -6571.01
milk 9842.58 13462.50 5888.17 15916.48 6240.21 12575.52 53.02
wine 2141.61 -4437.33 -5282.50 -11385.77 -6571.01 53.02 4723.24

15 / 29
Standardization
Variables may be expressed in different units and therefore difficult
to compare.
Standardization or scaling: divide each (centered) variable by its
standard deviation
!
xij − x̄j
X cs =
sj i=1,...,n
j=1,...,p
# centered and scaled variables
Xcs <- sweep(Xc, MARGIN = 2, STATS = sqrt(diag(S)), FUN = "/")
round(Xcs, 3)

bread veget fruit meat poult milk wine


w2 -1.118 -1.678 -0.955 -1.187 -1.160 -0.992 0.850
e2 -1.498 -0.955 -0.740 -0.949 -0.988 -1.063 -1.609
m2 -0.728 0.193 0.361 0.162 0.518 -1.099 0.937
w3 -0.396 -0.933 -1.038 -1.002 -1.085 -0.305 0.559
e3 -0.591 -0.685 -0.690 -1.018 -1.026 -0.350 -0.081
m3 -0.084 0.613 1.164 1.209 1.443 -1.028 -0.401
w4 0.851 -0.398 -0.873 -0.704 -0.691 0.497 0.559
e4 0.130 -0.182 -0.133 -0.081 -0.172 0.372 0.690
m4 -0.601 0.315 0.734 1.265 1.447 -0.484 -1.260
w5 2.031 0.243 -0.519 -0.102 -0.185 1.219 1.708
e5 1.339 1.452 0.272 0.447 0.376 1.425 -0.721
m5 0.666 2.015 2.417 1.962 1.523 1.808 -1.231 16 / 29
The variables X cs have mean zero and unit variance.
round(colMeans(Xcs), 10) # zero means (up to rounding errors)

bread veget fruit meat poult milk wine


0 0 0 0 0 0 0
round((1/n) * t(as.matrix(Xcs)) %*% as.matrix(Xcs), 3) # unit variances

bread veget fruit meat poult milk wine


bread 1.000 0.593 0.196 0.321 0.248 0.856 0.304
veget 0.593 1.000 0.856 0.881 0.827 0.663 -0.356
fruit 0.196 0.856 1.000 0.959 0.926 0.332 -0.486
meat 0.321 0.881 0.959 1.000 0.982 0.375 -0.437
poult 0.248 0.827 0.926 0.982 1.000 0.233 -0.400
milk 0.856 0.663 0.332 0.375 0.233 1.000 0.007
wine 0.304 -0.356 -0.486 -0.437 -0.400 0.007 1.000

17 / 29
Correlation
Correlation between variables j and k:
n
sjk 1X xij − x̄j xik − x̄k
ρjk = =
sj sk n i=1 sj sk

I Covariance between standardized variables


I −1 ≤ ρjk ≤ 1
I Correlation matrix: R = (ρjk ) j=1,...,p : p × p
k=1,...,p
round(cor(X), 3)

bread veget fruit meat poult milk wine


bread 1.000 0.593 0.196 0.321 0.248 0.856 0.304
veget 0.593 1.000 0.856 0.881 0.827 0.663 -0.356
fruit 0.196 0.856 1.000 0.959 0.926 0.332 -0.486
meat 0.321 0.881 0.959 1.000 0.982 0.375 -0.437
poult 0.248 0.827 0.926 0.982 1.000 0.233 -0.400
milk 0.856 0.663 0.332 0.375 0.233 1.000 0.007
wine 0.304 -0.356 -0.486 -0.437 -0.400 0.007 1.000

18 / 29
Correlations and angles

Correlation and scalar product of centered variables j and k:


Pn
i=1 (xij − x̄j )(xik − x̄k )
ρjk = qP = cos θjk
n Pn
i=1 (xij − x̄j )2 i=1 (xik − x̄k )
2

with θjk the angle between the centered variables, seen as vectors in
Rn .

ρjk θjk
1 0 parallel
0 π/2 orthogonal
−1 π anti-parallel

19 / 29
Linear transformations

20 / 29
Linear transformation of a data matrix

Let X : n × p be a data matrix and A : p × ` a matrix of constants.


New data matrix Y : n × `:
 0
y1
 .. 
 
Y = X A
|{z} |{z} |{z} =  .  = y1 . . . y`
n×` n×p p×` y 0n
P p
yik = x a
Ppj=1 ijj jk
yk = j=1 x ajk k-th variable
y 0i 0
= xi A i-th observation

I n observations on ` new variables y 1 , . . . , y `


I Each new variable is a linear combination of the p original
variables x 1 , . . . , x p .

21 / 29
Creating ` = 2 new variables:
I vegan = sum of veget and fruit
I animal = sum of meat and poult
names(X)

[1] "bread" "veget" "fruit" "meat" "poult" "milk" "wine"


A <- cbind(c(0, 1, 1, 0, 0, 0, 0), c(0, 0, 0, 1, 1, 0, 0))
A

[,1] [,2]
[1,] 0 0
[2,] 1 0
[3,] 1 0
[4,] 0 1
[5,] 0 1
[6,] 0 0
[7,] 0 0

22 / 29
Y <- as.data.frame(as.matrix(X) %*% A)
names(Y) <- c("vegan", "animal")
Y

vegan animal
w2 782 1963
e2 947 2094
m2 1329 2875
w3 904 2051
e3 1004 2059
m3 1532 3493
w4 1027 2258
e4 1183 2618
m4 1410 3515
w5 1199 2607
e5 1543 2949
m5 1984 3797

Note: unified manipulation of data matrices via tidyverse R packages.

23 / 29
Transformed mean vector

Mean vector of transformed data matrix:

ȳ 0 = x̄ 0 A

colMeans(Y) # mean vector of Y

vegan animal
1237.000 2689.917
t(colMeans(X)) %*% A # idem

[,1] [,2]
[1,] 1237 2689.917

24 / 29
Transformed covariance matrix

Covariance matrix of transformed data matrix:

S Y = A0 S X A : ` × `

n <- nrow(Y) # number of observations


((n-1)/n) * cov(Y) # covariance matrix of Y

vegan animal
vegan 106818.8 188667.5
animal 188667.5 378426.1
((n-1)/n) * t(A) %*% cov(X) %*% A # idem

[,1] [,2]
[1,] 106818.8 188667.5
[2,] 188667.5 378426.1

25 / 29
Cross-covariance matrix

Covariances between the p old and the ` new variables:

S XY = S X A : p × `

((n-1)/n) * cov(X) %*% A # 7-by-2 cross-covariance matrix

[,1] [,2]
bread 14201.000 18566.81
veget 57320.667 96248.67
fruit 49498.167 92418.83
meat 117931.500 232451.23
poult 70736.000 145974.85
milk 19350.667 22156.69
wine -9719.833 -17956.78

26 / 29
Joint covariance matrix

Joining the p old and ` new variables:


 
Z= X Y : n × (p + `)

Joint covariance matrix: (p + `) × (p + `) block matrix


!
SX S XY
SZ =
S 0XY SY

27 / 29
Z <- cbind(X, Y)
Z

bread veget fruit meat poult milk wine vegan animal


w2 332 428 354 1437 526 247 427 782 1963
e2 293 559 388 1527 567 239 258 947 2094
m2 372 767 562 1948 927 235 433 1329 2875
w3 406 563 341 1507 544 324 407 904 2051
e3 386 608 396 1501 558 319 363 1004 2059
m3 438 843 689 2345 1148 243 341 1532 3493
w4 534 660 367 1620 638 414 407 1027 2258
e4 460 699 484 1856 762 400 416 1183 2618
m4 385 789 621 2366 1149 304 282 1410 3515
w5 655 776 423 1848 759 495 486 1199 2607
e5 584 995 548 2056 893 518 319 1543 2949
m5 515 1097 887 2630 1167 561 284 1984 3797

28 / 29
round(((n-1)/n) * cov(Z))

bread veget fruit meat poult milk wine vegan animal


bread 10524 11021 3180 12488 6079 9843 2142 14201 18567
veget 11021 32807 24514 60468 35781 13462 -4437 57321 96249
fruit 3180 24514 24984 57464 34955 5888 -5282 49498 92419
meat 12488 60468 57464 143567 88884 15916 -11386 117932 232451
poult 6079 35781 34955 88884 57090 6240 -6571 70736 145975
milk 9843 13462 5888 15916 6240 12576 53 19351 22157
wine 2142 -4437 -5282 -11386 -6571 53 4723 -9720 -17957
vegan 14201 57321 49498 117932 70736 19351 -9720 106819 188668
animal 18567 96249 92419 232451 145975 22157 -17957 188668 378426

29 / 29

You might also like