0% found this document useful (0 votes)
27 views24 pages

PCA with Missing Data in R Using missMDA

Uploaded by

Thierry Nesztler
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views24 pages

PCA with Missing Data in R Using missMDA

Uploaded by

Thierry Nesztler
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PCA with missing data using the

missMDA R package

François Husson

Applied Mathematics Department, Rennes Agrocampus

husson@agrocampus-ouest.fr

1 / 10
Using missMDA to deal with missing data

> library(missMDA)
> data(orange)

Color Odor Attack Sweet Acid Bitter Pulp Typicity


intensity intensity intensity
1 4.79 5.29 NA NA NA 2.83 NA 5.21
2 4.58 6.04 4.42 5.46 4.13 3.54 4.62 4.46
3 4.71 5.33 NA NA 4.29 3.17 6.25 5.17
4 6.58 6.00 7.42 4.17 6.75 NA 1.42 3.42
5 NA 6.17 5.33 4.08 NA 4.38 3.42 4.42
6 6.33 5.00 5.38 5.00 5.50 3.63 4.21 4.88
7 4.29 4.92 5.29 5.54 5.25 NA 1.29 4.33
8 NA 4.54 4.83 NA 4.96 2.92 1.54 3.96
9 4.42 NA 5.17 4.62 5.04 3.67 1.54 3.96
10 4.54 4.29 NA 5.79 4.38 NA NA 5.00
11 4.08 5.13 3.92 NA NA NA 7.33 5.25
12 6.50 5.88 6.13 4.88 5.29 4.17 1.50 3.50

2 / 10
Some (bad) easy methods
• Delete individuals or variables with missing data : usually not
a good idea
• Replace missing data with the mean (default in several
packages including FactoMineR)

3 / 10
Some (bad) easy methods
• Delete individuals or variables with missing data : usually not
a good idea
• Replace missing data with the mean (default in several
packages including FactoMineR)
> res.pca <- PCA(orange)

Variables factor map (PCA)

1.0
Odor.intensity
Odor.intensity
Individuals factor map (PCA) Pulp
Pulp Bitter

0.5
3

Dim 2 (18.32%)
5●
Typicity
Typicity
2

0.0
11 ● ●
2● Color.intensity
Color.intensity
Dim 2 (18.32%)

3● Acid
Acid
Attack.intensity
Attack.intensity
● 12

−0.5

0

● ●
6 9 Sweet
Sweet


1 4
−1

● ●
10 7
−1.0
−2


8

−4 −2 0 2 4 6 −1.0 −0.5 0.0 0.5 1.0

Dim 1 (51.45%) Dim 1 (51.45%)

3 / 10
Some (bad) easy methods
• Delete individuals or variables with missing data : usually not
a good idea
• Replace missing data with the mean (default in several
packages including FactoMineR)

● ●
2


● ●

● ●


● ●
● ● ●
1


●● ●


● ●

● ●● ● ● ●

● ● ●
y


0

● ● ● ●●● ●
● ● ●● ● ●●●● ● ● ● ● ●

●● ●
● ●
● ●



● ● ● ●
● ● ●

● ●

−1

● ●
● ●
● ●

● ●
● ● ●

●● ●
−2

−2 −1 0 1 2
x

Big distortion of links between variables 3 / 10


Iterative PCA

Ideas :
• As x and y strongly correlated : impute missing y value using
x value
• if individuals i and j have similar values for all variables,
impute missing i value using j value for that variable

⇒ takes into account global similarity between individuals and


links between variables

4 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01

3
1.5 NA
2.0 1.98

2
1
x2

0
-1
-2

-2 -1 0 1 2 3

x1

5 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01

3
1.5 NA
2.0 1.98

x1 x2

2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00 1
2.0 1.98
x2

0
-1
-2

-2 -1 0 1 2 3

x1

Initialize : impute the mean


5 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01

3
1.5 NA
2.0 1.98

x1 x2

2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00 1
2.0 1.98
x2

x1 x2
0

-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
-1

2.27 1.67
-2

-2 -1 0 1 2 3

x1

Do PCA on imputed table → axes and components ;


5 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01

3
1.5 NA
2.0 1.98

x1 x2

2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00 1
2.0 1.98
x2

x1 x2
0

-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
-1

2.27 1.67
-2

-2 -1 0 1 2 3

x1

Missing data imputed using PCA


5 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01

3
1.5 NA
2.0 1.98

x1 x2

2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00 1
2.0 1.98
x2

x1 x2
0

-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
-1

2.27 1.67

x1 x2
-2.0 -2.01
-2

-1.5 -1.48
0.0 -0.01 -2 -1 0 1 2 3
1.5 0.57
2.0 1.98 x1

New imputed data table


5 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01

3
1.5 NA
2.0 1.98

x1 x2

2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.57 1
2.0 1.98
x2

0
-1

x1 x2
-2.0 -2.01
-2

-1.5 -1.48
0.0 -0.01 -2 -1 0 1 2 3
1.5 0.57
2.0 1.98 x1

5 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01

3
1.5 NA
2.0 1.98

x1 x2

2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.57 1
2.0 1.98
x2

x1 x2
0

-2.00 -2.01
-1.47 -1.52
0.09 -0.11
1.20 0.90
-1

2.18 1.78

x1 x2
-2.0 -2.01
-2

-1.5 -1.48
0.0 -0.01 -2 -1 0 1 2 3
1.5 0.90
2.0 1.98 x1

5 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01

3
1.5 NA
2.0 1.98

x1 x2

2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 1.48 1
2.0 1.98
x2

x1 x2
0

-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
-1

2.27 1.67

x1 x2
-2.0 -2.01
-2

-1.5 -1.48
0.0 -0.01 -2 -1 0 1 2 3
1.5 1.48
2.0 1.98 x1

Repeat these steps until convergence


5 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01

3
1.5 NA
2.0 1.98

2
1
x2

0
-1

x1 x2
-2.0 -2.01
-2

-1.5 -1.48
0.0 -0.01 -2 -1 0 1 2 3
1.5 1.48
2.0 1.98 x1

Do PCA on imputed data table


5 / 10
Iterative PCA

1. initialization : impute using the mean


2. Step ` :
(a) do PCA on imputed data table
S dimensions retained
(b) missing data imputed using PCA
(c) means (and standard deviations) updated
3. iterate the estimation and imputation steps

6 / 10
Iterative PCA

1. initialization : impute using the mean


2. Step ` :
(a) do PCA on imputed data table
S dimensions retained
(b) missing data imputed using PCA
(c) means (and standard deviations) updated
3. iterate the estimation and imputation steps

Overfitting problem due to believing too much in links between


variables
⇒ regularized iterative PCA

6 / 10
Running missMDA in R
> library(missMDA)
> data(orange)
> nb <- estim_ncpPCA(orange, scale=TRUE) ## Estimate no. of dimensions
> comp <- imputePCA(orange, ncp=2, scale=TRUE) ## Impute the table
> res.pca <- PCA(comp$completeObs) ## Do the PCA

> orange > comp$completeObs


Sweet Acid Bitter Pulp Typicity Sweet Acid Bitter Pulp Typicity
NA NA 2.83 NA 5.21 5.54 4.13 2.83 5.89 5.21
5.46 4.13 3.54 4.62 4.46 5.46 4.13 3.54 4.62 4.46
NA 4.29 3.17 6.25 5.17 5.45 4.29 3.17 6.25 5.17
... ...
4.88 5.29 4.17 1.50 3.50 4.88 5.29 4.17 1.50 3.50
Individuals factor map (PCA) Variables factor map (PCA)

1.0
2

5

Odor.intensity
3 2 Pulp
11 1 ●

1

● ●

0.5
6 Typicity
● Color.intensity
Bitter
Dim 2 (17.16%)

12 4 Dim 2 (17.16%)

0

● ●

0.0

Attack.intensity
Acid
10 9 Sweet
−1

● ●
−0.5

7

8
−2


−1.0

−4 −2 0 2 4 6 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Dim 1 (71.34%) Dim 1 (71.34%) 7 / 10


Is running the imputation algorithm once sufficient ?

●●

2


● ●

● ●




● ●
● ● ●●●
1

●●


●● ● ●
● ● ●
● ●●

● ● ●●
● ● ●

● ●
●●

y
0

● ●●
● ● ●
●●
● ●






● ● ●
● ● ●●●●

● ●
●●
−1

● ● ●
● ●



● ●


● ● ●
●● ●
−2


−2 −1 0 1 2
x

⇒ Reinforces links between variables 8 / 10


Visualizing uncertainty due to missing data

What confidence can we give to the results ? Idea of variance ?


⇒ A single value cannot show variability in the predicted value

9 / 10
Visualizing uncertainty due to missing data

What confidence can we give to the results ? Idea of variance ?


⇒ A single value cannot show variability in the predicted value
(F̂ Û ′ )ik

⇒ Multiple imputation : generate several plausible values for each


missing data point

9 / 10
Visualizing uncertainty due to missing data
> mi <- MIPCA(orange, scale = TRUE, ncp=2)
> plot(mi)

10 / 10
Visualizing uncertainty due to missing data
> mi <- MIPCA(orange, scale = TRUE, ncp=2)
> plot(mi)
Supplementary projection Variable representation

1.0
4

Odor.intensity
Pulp
5

0.5
2

11 1 3● 2 Typicity
Color.intensity
Dim 2 (17.17%)

Dim 2 (17.17%)
● ●
6 Bitter
● 12 4

0

● ●

0.0

10 9
● ● Attack.intensity
7 Sweet Acid
8 ●
−2

−0.5
−4

−1.0
−6 −4 −2 0 2 4 6 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Dim 1 (71.33%) Dim 1 (71.33%)

10 / 10
Visualizing uncertainty due to missing data
> mi <- MIPCA(orange, scale = TRUE, ncp=2)
> plot(mi)
Supplementary projection Variable representation

1.0
4

Odor.intensity
Pulp
5

0.5
2

11 1 3● 2 Typicity
Color.intensity
Dim 2 (17.17%)

Dim 2 (17.17%)
● ●
6 Bitter
● 12 4

0

● ●

0.0

10 9
● ● Attack.intensity
7 Sweet Acid
8 ●
−2

−0.5
−4

−1.0
−6 −4 −2 0 2 4 6 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Dim 1 (71.33%) Dim 1 (71.33%)


Projection of the Principal Components
Multiple imputation using Procrustes

5 1.0
0.5
2

11 1 3● 2
Dim 2 (17.17%)

Dim 2 (17.17%)

● ●

6
● 12 4
0.0


0

● ●

10 9
● ●
7
8 ●
−2

−0.5


−4

−1.0

−4 −2 0 2 4 6 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Dim 1 (71.33%) Dim 1 (71.33%) 10 / 10

You might also like