You are on page 1of 40

# Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Multivariate Data Analysis

Special focus on Clustering and Multiway Methods Franois Husson & Julie Josse
Applied mathematics department, Agrocampus Rennes

## useR! 2010, July 20, 2010

1 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Why a tutorial on Multivariate Data Analysis?

Our research focus is principal component methods We teach multivariate data analysis We have developed R packages:

## FactoMineR to perform principal component methods

PCA, correspondence analysis (CA), multiple correspondence analysis (MCA), multiple factor analysis (MFA) complementarity between clustering and principal component methods

## missMDA to handle missing values in and with multivariate

data analysis

perform principal component methods (PCA, MCA) with missing values simple and multiple imputation based on principal component models for continuous and categorical data
2 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Outline

Multivariate data analysis with a special focus on clustering and multiway methods
1 2 3

Principal Component Analysis (PCA) Multiple Factor Analysis (MFA) Complementarity between Clustering and Principal Component methods

## Multidimensional descriptive methods Graphical representations

3 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

1 2 3 4

## Data - Issues - Preprocessing Individuals Study Variables Study Helps to Interpret

4 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Principal Component Analysis

Dimensionality reduction describes the dataset with a smaller number of variables Technique widely used for applications such as: data compression, data reconstruction, preprocessing before clustering, and ... Descriptive methods

5 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## PCA deals with which kind of data?

PCA deals with continuous variables, but categorical variables can also be included in the analysis Many examples: Sensory analysis: products - descriptors Ecology: plants - measurements; waters - physico-chemical analyses Economy: countries - economic indicators Microbiology: cheeses - microbiological analyses etc.

## Figure: Data table in PCA

6 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Wine data

## 10 individuals (rows): white wines from Val de Loire 30 variables (columns):

27 continuous variables: sensory descriptors 2 continuous variables: odour and overall preferences 1 categorical variable: label of the wines (Vouvray - Sauvignon)
Aroma.persistency Overall.preference Aroma.intensity Visual.intensity Odor.preferene Astringency Sweetness O.passion Bitterness

O.citrus

O.fruity

Acidity

S S S S S V V V V V

Michaud Renaudie Trotignon Buisse Domaine Buisse Cristal Aub Silex Aub Marigny Font Domaine Font Brls Font Coteaux

4.3 4.4 5.1 4.3 5.6 3.9 2.1 5.1 5.1 4.1

2.4 3.1 4.0 2.4 3.1 0.7 0.7 0.5 0.8 0.9

5.7 5.3 5.3 3.6 3.5 3.3 1.0 2.5 3.8 2.7

3.5 3.3 3.0 3.9 3.4 7.9 3.5 3.0 3.9 3.8

5.9 6.8 6.1 5.6 6.6 4.4 6.4 5.7 5.4 5.1

4.1 3.8 4.1 2.5 5.0 3.0 5.0 4.0 4.0 4.3

1.4 2.3 2.4 3.0 3.1 2.4 4.0 2.5 3.1 4.3

7.1 7.2 6.1 4.9 6.1 5.9 6.3 6.7 7.0 7.3

6.7 6.6 6.1 5.1 5.1 5.6 6.7 6.3 6.1 6.6

5.0 3.4 3.0 4.1 3.6 4.0 6.0 6.4 7.4 6.3

6.0 5.4 5.0 5.3 6.1 5.0 5.1 4.4 4.4 6.0

5.0 5.5 5.5 4.6 5.0 5.5 4.1 5.1 6.4 5.7

Sauvignon Sauvignon Sauvignon Sauvignon Sauvignon Vouvray Vouvray Vouvray Vouvray Vouvray
7 / 40

Label

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Problems - objectives

Individuals study: similarity between individuals with respect to all the variables partition between individuals Variables study: linear relationships between variables visualization of the correlation matrix (denoted S ); nd synthetic variables Link between the two studies: characterization of the groups of individuals by the variables; specic individuals to better understand links between variables

8 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Two clouds of points

Individuals study

Variables study

K 1

I RK

I RI

var 1 ind i

ind 1

var k

9 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Preprocessing

d
2

K
(i , i ) =

k =1

(xik xi k )2

K
d
2

(i , i ) =

k =1

## Standardizing variables or not?

K
d
2

(i , i ) =

1
s
2

k =1 k

(xik xi k )2
10 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Individuals cloud

Study the structure, i.e. the shape of the cloud of individuals Individuals are in RK
11 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Figure: Camel vs dromedary?

Closest representation by projection Best representation of the diversity, variability
12 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

xi.
min max
P

Fi1

u1

## 1 u1 (xi . ) = u1 (u1 u1 ) u1 xi . = < xi . , u1 > u1 Fi 1 = < xi . , u1 >

Minimize the distance between individuals and their projections Maximize the variance of the projected data
u1

## = arg max(var (F.1 )) = arg max(var (Xu1 )) with u1 RK u1 RK

1 1 1 1

u1 u1

=1

u rst eigenvector of the correlation matrix associated with the largest eigenvalue : Su = u
1

Var F.1

## ) = var (Xu1 ) = 1/I

u1 X Xu1

= u1 Su1 = 1 u1 u1 = 1
13 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Fit the individuals cloud

Additional axes are sequentially dened: each new direction maximizes the projected variance among all orthogonal directions Q eigenvectors u ,...,uQ associated to ,...,Q
1 1

## Total variance of the initial individuals cloud (total inertia): 1

I

K
x

i. g

= tr (S ) =

k =1
1 2

k (= K )

Variance of the projected individuals cloud (Q-dimensional representation): var (F ) + var (F ) + ... + var (FQ )
Q k =1 k K k =1 k
14 / 40

## Percentage of variance explained:

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Example: wine data

Sensory descriptors are used as active variables: only these variables are used to construct the axes Variables are (centred and) standardized
Aroma.persistency Overall.preference Aroma.intensity Visual.intensity Odor.preferene

Astringency

Sweetness

O.passion

Bitterness

O.citrus

O.fruity

Acidity

S S S S S V V V V V

Michaud Renaudie Trotignon Buisse Domaine Buisse Cristal Aub Silex Aub Marigny Font Domaine Font Brls Font Coteaux

4.3 4.4 5.1 4.3 5.6 3.9 2.1 5.1 5.1 4.1

2.4 3.1 4.0 2.4 3.1 0.7 0.7 0.5 0.8 0.9

5.7 5.3 5.3 3.6 3.5 3.3 1.0 2.5 3.8 2.7

3.5 3.3 3.0 3.9 3.4 7.9 3.5 3.0 3.9 3.8

5.9 6.8 6.1 5.6 6.6 4.4 6.4 5.7 5.4 5.1

4.1 3.8 4.1 2.5 5.0 3.0 5.0 4.0 4.0 4.3

1.4 2.3 2.4 3.0 3.1 2.4 4.0 2.5 3.1 4.3

7.1 7.2 6.1 4.9 6.1 5.9 6.3 6.7 7.0 7.3

6.7 6.6 6.1 5.1 5.1 5.6 6.7 6.3 6.1 6.6

5.0 3.4 3.0 4.1 3.6 4.0 6.0 6.4 7.4 6.3

6.0 5.4 5.0 5.3 6.1 5.0 5.1 4.4 4.4 6.0

5.0 5.5 5.5 4.6 5.0 5.5 4.1 5.1 6.4 5.7

Sauvignon Sauvignon Sauvignon Sauvignon Sauvignon Vouvray Vouvray Vouvray Vouvray Vouvray

Label

15 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

2

V Font Brls

Dim 2 (25.14%)

S Buisse Cristal
-2

V Font Domaine

S Buisse Domaine
-4

V Aub Silex
-6 -6

-4

-2

0 Dim 1 (43.48%)

## Need variables to interpret the dimensions of variability

16 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Individuals coordinates considered as variables

u2

1 1
V Aub Marigny V Font Coteaux

K F.1 F.2

## S Renaudie S Michaud S Trotignon

Fi1
S Buisse Cristal

V Font Br ls

u1

V Font Domaine

xik

Fi1 Fi2

S Buisse Domaine

Fi2

V Aub Silex

17 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Correlation between variable x.k and F. (and F. )

1 2

r (F.2 , x.k )

x.k
O.Vanilla

-1

r(F.1 , x.k )

-1

Correlation circle
18 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Interpretation of the individuals graph with the variables

Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Oxidation O.fruity Typicity

Dim 2 (25.14%)

-0.5

0.0

0.5

1.0

Sweetness
-1.0

-1.0

-0.5

## 0.0 Dim 1 (43.48%)

0.5

1.0

1.5

19 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Cloud of variables

x ik

cos

(kl ) =

## < x.k , x.l >

x.

x.

= (

I x x i =1 ik il = r (x.k , x.l ) I x 2 )( I x 2 ) i =1 ik i =1 il

20 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

1

v1

## = 1) which best ts the cloud

1 v1 (x.k ) = v1 (v1 v1 ) v1 x.k Gk 1 = 1/I < v1 , x .k > < v1 , x .k > Gk 1 = 1/I v1 x.k

x.k
P

Gk1 v1

arg max

v1
1

v1 RI i =k

## 2 r (v1 , x.k ) k 1 = arg max v1 RI i =k 2

is the best synthetic variable v , ..., vQ are the eigenvectors of W = XX the inner product matrix associated with the largest eigenvalues: Wvq = q vq
21 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Fit the variables cloud

Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Oxidation O.fruity Typicity

Dim 2 (25.14%)

-0.5

0.0

0.5

1.0

Sweetness
-1.0

-1.0

-0.5

0.5

1.0

1.5

## Same representation! What a wonderful result!

22 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Projections...
r A B cos

( , ) = cos (A,B ) (A,B ) cos (HA ,HB ) if variables are well projected
A HA HB HD HE E D HC HA HB HD HE HC

## Only well projected variables can be interpreted!

23 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Link between the two representations: transition formulae

v2 u2 Fi1 Fi2
1 1 k
i k Gk2

u1
Gk1

v1

K F.1 F.2

1 1

K v 1 v2

xik

Fi1 Fi2

xik

I u1 u2

G.1 G.2

Gk1 Gk2

24 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Su

=X

Xu

= u
W Xu Wv

XX Xu WF

= X u

) = (Xu )
F

= F and since

= v then

## Since, ||F || = and ||v || = 1 we have:

v u

= =

1 F 1 G

G F

1 =X v = X F 1 = Xu = XG

iq =

K
x

q k =1

ik Gkq

kq =

I
x

q i =1

ik Fiq

F.

G.

## q : principal components, scores q : correlations between variables and principal components

25 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Link between the two representations: transition formulae

iq =

1
q

ik Gkq

kq =

1
q

ik Fiq

What does it mean? An individual is at the same side as the variables for which it takes high values
Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Oxidation O.fruity Typicity
-0.5
-4

2

Dim 2 (25.14%)

V Font Brls

Dim 2 (25.14%)

S Buisse Cristal
-2

V Font Domaine

S Buisse Domaine

V Aub Silex
-6

0.0

0.5

V Font Coteaux

1.0

Sweetness
-1.0

-6

-4

-2

0 Dim 1 (43.48%)

-1.0

-0.5

## 0.0 Dim 1 (43.48%)

0.5

1.0

1.5

26 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Supplementary information

For the continuous variables: projection of supplementary variables on the dimensions For the individuals: projection For the categories: projection at the barycentre of the individuals who take the categories
Sauvignon Vouvray
S Renaudie S Michaud S Trotignon Sauvignon
2

0.5

## V Font Brls Vouvray V Font Domaine

Dim 2 (25.14%)

Dim 2 (25.14%)

-2

S Buisse Domaine
-4

0.0

S Buisse Cristal

Odor.Intensity.before.shaking Expression Odor.Intensity.after.shaking Attack.intensity Aroma.persistency Bitterness Aroma.intensity Acidity O.wooded O.vanilla O.alcohol Astringency Freshness Odor.preferene Visual.intensity O.passion Grade O.citrus Surface.feeling O.flower O.mushroom Smoothness O.plante O.candied.fruit Overall.preference Oxidation O.fruity Typicity

V Aub Silex
-6

-0.5

1.0

Sweetness
-1.0
-6 -4 -2 0 Dim 1 (43.48%) 2 4

-1.0

-0.5

0.5

1.0

1.5

## Supplementary information do not create the dimensions

27 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Choosing the number of components

Eigenvalues

Bar plot, test on eigenvalues, condence interval, cross-validation (functions estim_ncpPCA and estim_ncp), etc.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

10

x.1

x.k

x.K

F1

FQ

FK

## PCA Data Structure Noise

28 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Is there a structure on my data?

Number of variables nbind 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 45 50 100 4 96.5 93.3 90.5 88.1 86.1 84.5 82.8 81.5 80.0 79.0 78.1 77.3 76.5 75.5 75.1 74.1 72.0 69.8 68.5 67.5 66.4 65.6 60.9 5 93.1 88.6 84.9 82.3 79.5 77.5 75.7 74.0 72.5 71.5 70.3 69.4 68.4 67.6 67.0 66.1 63.3 61.1 59.6 58.3 57.1 56.3 51.4 6 90.2 84.8 80.9 77.2 74.8 72.3 70.3 68.6 67.2 65.7 64.6 63.5 62.6 61.8 60.9 60.1 57.1 55.1 53.3 52.0 50.8 49.9 44.9 7 87.6 81.5 77.4 73.8 70.7 68.2 66.3 64.4 62.9 61.5 60.3 59.2 58.2 57.1 56.5 55.6 52.5 50.3 48.6 47.3 46.1 45.2 40.0 8 85.5 79.1 74.4 70.7 67.4 65.0 62.9 61.2 59.4 58.1 57.0 55.6 54.7 53.7 52.8 52.1 48.9 46.7 44.9 43.4 42.4 41.4 36.3 9 83.4 76.9 72.0 68.2 65.1 62.4 60.1 58.3 56.7 55.1 53.9 52.9 51.8 50.8 49.9 49.1 46.0 43.6 41.9 40.5 39.3 38.4 33.3 10 81.9 75.1 70.1 66.1 62.9 60.1 58.0 55.8 54.4 52.8 51.5 50.3 49.3 48.4 47.4 46.6 43.4 41.1 39.5 38.0 36.9 35.9 31.0 11 80.7 73.2 68.3 64.0 61.1 58.3 56.0 54.0 52.2 50.8 49.4 48.3 47.1 46.3 45.5 44.7 41.4 39.1 37.4 36.0 34.8 33.9 28.9 12 79.4 72.2 67.0 62.8 59.4 56.5 54.4 52.4 50.5 49.0 47.8 46.6 45.5 44.6 43.7 42.9 39.6 37.3 35.6 34.1 33.1 32.1 27.2 13 78.1 70.8 65.3 61.2 57.9 55.1 52.7 50.9 48.9 47.5 46.1 45.2 44.0 43.0 42.1 41.3 38.1 35.7 34.0 32.7 31.5 30.5 25.8 14 77.4 69.8 64.3 60.0 56.5 53.7 51.3 49.3 47.7 46.2 44.9 43.6 42.6 41.6 40.7 39.8 36.7 34.4 32.7 31.3 30.2 29.2 24.5 15 76.6 68.7 63.2 59.0 55.4 52.5 50.1 48.2 46.6 45.0 43.6 42.4 41.4 40.4 39.6 38.7 35.5 33.2 31.6 30.1 29.0 28.1 23.3 16 75.5 68.0 62.2 58.0 54.3 51.5 49.2 47.2 45.4 44.0 42.5 41.4 40.3 39.3 38.4 37.5 34.5 32.1 30.4 29.1 27.9 27.0 22.3

Table: 95 % quantile inertia on the two rst dimensions of 10000 PCA on data with independent variables

29 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Percentage of variance obtained under independence

Number of variables nbind 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 45 50 100 17 74.9 67.0 61.3 57.0 53.6 50.6 48.1 46.2 44.4 42.9 41.6 40.4 39.4 38.3 37.4 36.7 33.5 31.2 29.5 28.1 27.0 26.1 21.5 18 74.2 66.3 60.7 56.2 52.5 49.8 47.2 45.2 43.4 42.0 40.7 39.5 38.5 37.4 36.5 35.8 32.5 30.3 28.6 27.3 26.1 25.3 20.7 19 73.5 65.6 59.7 55.4 51.8 49.0 46.5 44.4 42.8 41.3 39.8 38.7 37.6 36.7 35.8 34.9 31.8 29.5 27.9 26.5 25.4 24.6 19.9 20 72.8 64.9 59.1 54.5 51.2 48.3 45.8 43.8 41.9 40.4 39.1 37.9 36.9 35.8 34.9 34.2 31.1 28.8 27.1 25.8 24.7 23.8 19.3 25 70.7 62.3 56.4 51.8 48.1 45.2 42.8 40.7 39.0 37.4 36.2 35.0 33.8 32.9 32.0 31.3 28.1 26.0 24.3 23.0 21.9 21.1 16.7 30 68.8 60.4 54.3 49.7 45.9 42.9 40.6 38.5 36.8 35.2 34.0 32.8 31.7 30.7 29.9 29.1 26.0 23.9 22.2 21.0 20.0 19.1 14.9 35 67.4 58.9 52.6 47.8 44.4 41.4 39.0 36.9 35.1 33.6 32.4 31.1 30.1 29.1 28.3 27.5 24.5 22.3 20.7 19.5 18.5 17.7 13.6 40 66.4 57.6 51.4 46.7 42.9 40.1 37.7 35.5 33.9 32.3 31.1 29.8 28.8 27.8 27.0 26.2 23.3 21.1 19.6 18.4 17.4 16.6 12.5 50 64.7 55.8 49.5 44.6 41.0 38.0 35.6 33.5 31.8 30.4 29.0 27.9 26.8 25.9 25.1 24.3 21.4 19.3 17.8 16.6 15.7 14.9 11.0 75 62.0 52.9 46.4 41.6 38.0 35.0 32.6 30.5 28.8 27.4 26.0 24.9 23.9 22.9 22.2 21.4 18.6 16.6 15.2 14.1 13.2 12.5 8.9 100 60.5 51.0 44.6 39.8 36.1 33.2 30.8 28.8 27.1 25.7 24.3 23.2 22.2 21.3 20.5 19.8 17.0 15.1 13.7 12.7 11.8 11.1 7.7 150 58.5 49.0 42.4 37.6 34.0 31.0 28.7 26.7 25.0 23.6 22.4 21.2 20.3 19.4 18.6 18.0 15.2 13.4 12.1 11.1 10.3 9.6 6.4 200 57.4 47.8 41.2 36.4 32.7 29.8 27.5 25.5 23.9 22.4 21.2 20.1 19.2 18.3 17.5 16.9 14.2 12.5 11.1 10.2 9.4 8.7 5.7

Table: 95 % quantile inertia on the two rst dimensions of 10000 PCA on data with independent variables
30 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Quality of the representation: cos

2
2

For the variables: only well projected variables (high cos between the variable and its projection) can be interpreted!
Dim.1 Dim.2 Odor.Intensity.before.shaking 0.01 0.94 Odor.Intensity.after.shaking 0.01 0.89 Expression 0.11 0.71 round(res.pca\$var\$cos2,2)

For the individuals: (same idea) distance between individuals can only be interpreted for well projected individuals
round(res.pca\$ind\$cos2,2) Dim.1 Dim.2 S Michaud 0.62 0.07 S Renaudie 0.73 0.15 S Trotignon 0.78 0.07

31 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Contribution

## for each individual:

Ctr

q (i ) =

Fiq Fiq I F 2 = q i =1 iq
2 2

## Individuals with a large coordinate contribute the most

round(res.pca\$ind\$contrib,2) Dim.1 Dim.2 S Michaud 15.49 3.10 S Renaudie 15.56 5.56 S Trotignon 15.46 2.43

## for each variable:

Ctr

2 Gkq r (x k ,vq )2 q (k ) = q = . q

## contribute the most

32 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Description of the dimensions

By the continuous variables: correlation between each variable and the principal component of rank q is calculated correlation coecients are sorted and signicant ones are given
> dimdesc(res.pca) \$Dim.1\$quanti corr p.value O.candied.fruit 0.93 9.5e-05 Grade 0.93 1.2e-04 Surface.feeling 0.89 5.5e-04 Typicity 0.86 1.4e-03 O.mushroom 0.84 2.3e-03 Visual.intensity 0.83 3.1e-03 ... ... ... O.plante -0.87 1.0e-03 O.flower -0.89 4.9e-04 O.passion -0.90 4.5e-04 Freshness -0.91 2.9e-04 \$Dim.2\$quanti corr p.value Odor.Intensity.before.shaking 0.97 3.1e-06 Odor.Intensity.after.shaking 0.95 3.6e-05 Attack.intensity 0.85 1.7e-03 Expression 0.84 2.2e-03 Aroma.persistency 0.75 1.3e-02 Bitterness 0.71 2.3e-02 Aroma.intensity 0.66 4.0e-02

Sweetness

-0.78 8.0e-03
33 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Description of the dimensions

By the categorical variables: Perform a one-way analysis of variance with the coordinates of the individuals (F.q ) explained by the categorical variable

a F-test by variable for each category, a Student's t -test to compare the average of the category with the general mean
p.value 7.30e-05 p.value 7.30e-05 7.30e-05

> dimdesc(res.pca) Dim.1\$quali R2 Label 0.874 Dim.1\$category Estimate Vouvray 3.203 Sauvignon -3.203

34 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

Practice with R
1 2 3 4 5 6

Choose active variables Scale or not the variables Perform PCA Choose the number of dimensions to interpret Simultaneously interpret the individuals and variables graphs Use indicators to enrich the interpretation

library(FactoMineR) Expert <- read.table("http://factominer.free.fr/useR2010/Expert_wine.csv", header=TRUE, sep=";",row.names=1) res.pca <- PCA(Expert,scale=T,quanti.sup=29:30,quali.sup=1) res.pca x11() barplot(res.pca\$eig[,1],main="Eigenvalues",names.arg=1:nrow(res.pca\$eig)) plot.PCA(res.pca,habillage=1) res.pca\$ind\$coord res.pca\$ind\$cos2 res.pca\$ind\$contrib plot.PCA(res.pca,axes=c(3,4),habillage=1) dimdesc(res.pca) write.infile(res.pca,file="my_FactoMineR_results.csv") #to export a list
35 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Practice with GUI

source("http://factominer.free.fr/install-facto.r")

36 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Obtain the principal components from observed data with an

EM-type algorithm

Impute missing values with PCA using imputePCA function (tuning parameter: number of components) Perform the usual PCA on the completed data set

library(missMDA) data(orange) nb.dim <- estim_ncpPCA(orange,ncp.max=5) res.comp <- imputePCA(orange,ncp=2) res.pca <- PCA(res.comp\$completeObs)

37 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## MCA: problems - objectives

Individuals study: similarity between individuals (for all the variables) partition between individuals Individuals are dierent if they don't take the same levels Variables study: nd some synthetic variables (continuous variables that sum up categorical variables); link between variables levels study Categories study:

two levels of dierent variables are similar if individuals that take these levels are the same (ex: 65 years and retired) two levels are similar if individuals taking these levels behave the same way, they take the same levels for the other variables (ex: 60 years and 65 years)

Link between these studies: characterization of the groups of individuals by the levels (ex: executive dynamic women)
38 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## Binary coding of the factors: a factor with

variable 1 1 variable j

## Kj levels Kj columns containing binary values, also called dummy variables

variable J

individuals

01000

0010

(i , i ) =

I J

Kj

1
I

j =1 k =1

(xik xi k )2
39 / 40

Data - Issues

Individuals Study

Variables Study

Helps to Interpret

## MCA: the superimposed representation

iq =

1
q k

ik G

kq

kq =

1
q i

ik F iq k

Individual i at the barycenter Level k at the barycenter of MCA factor map the individuals who take this level of its levels
tea shop

unpackaged p_upscale

Dim 2 (8.103%)

G G G

green dinner

G G G G G G G G G black G G G GG G G lemon tearoom G GG G G GG G G G G G G No.sugar Not.friends GG G Not.breakfast G G Not.resto Not.work G G G G chain store+tea shop GG Not.tea time G G GG G alone Not.lunch G G Not.pub GG tea bag+unpackaged always G Not.evening G other G G GG home G G G G G G Not.always G G Not.dinner G G evening G G Not.home G Not.tearoom G G G G G G friends G time G tea G G G G G G GG GG GG G GG G G G GG G G G pub G G G G G GGG G G sugar G G Gbreakfast G GG G G G p_variable GG GG G G G G G G G G G G GG G GG G G GG G GG p_cheap G G GG G G Earl GG GG GGGrey G G G G GG G G G G GG G G G G G GG G G G G G GG G work G GG G G G G G G G G G G G G G G GG G G G G G G milk G G G G bag G chain store G resto G GG G G G G G GG G lunch GG p_branded G G tea G G GG G G G G G G G G GG G G G G p_private label G G G G G G G G G G G G G G G G

G G

G G G G G G

p_unknown

40 / 40