PCA with Missing Data in R Using missMDA
PCA with Missing Data in R Using missMDA
missMDA R package
François Husson
husson@agrocampus-ouest.fr
1 / 10
Using missMDA to deal with missing data
> library(missMDA)
> data(orange)
2 / 10
Some (bad) easy methods
• Delete individuals or variables with missing data : usually not
a good idea
• Replace missing data with the mean (default in several
packages including FactoMineR)
3 / 10
Some (bad) easy methods
• Delete individuals or variables with missing data : usually not
a good idea
• Replace missing data with the mean (default in several
packages including FactoMineR)
> res.pca <- PCA(orange)
1.0
Odor.intensity
Odor.intensity
Individuals factor map (PCA) Pulp
Pulp Bitter
0.5
3
Dim 2 (18.32%)
5●
Typicity
Typicity
2
0.0
11 ● ●
2● Color.intensity
Color.intensity
Dim 2 (18.32%)
3● Acid
Acid
Attack.intensity
Attack.intensity
● 12
−0.5
●
0
● ●
6 9 Sweet
Sweet
●
●
1 4
−1
● ●
10 7
−1.0
−2
●
8
3 / 10
Some (bad) easy methods
• Delete individuals or variables with missing data : usually not
a good idea
• Replace missing data with the mean (default in several
packages including FactoMineR)
● ●
2
●
● ●
●
● ●
●
●
●
● ●
● ● ●
1
●
●
●● ●
●
●
● ●
●
● ●● ● ● ●
●
● ● ●
y
●
0
● ● ● ●●● ●
● ● ●● ● ●●●● ● ● ● ● ●
●
●● ●
● ●
● ●
●
●
●
● ● ● ●
● ● ●
●
● ●
●
−1
● ●
● ●
● ●
●
● ●
● ● ●
●
●● ●
−2
−2 −1 0 1 2
x
Ideas :
• As x and y strongly correlated : impute missing y value using
x value
• if individuals i and j have similar values for all variables,
impute missing i value using j value for that variable
4 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
3
1.5 NA
2.0 1.98
2
1
x2
0
-1
-2
-2 -1 0 1 2 3
x1
5 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
3
1.5 NA
2.0 1.98
x1 x2
2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00 1
2.0 1.98
x2
0
-1
-2
-2 -1 0 1 2 3
x1
3
1.5 NA
2.0 1.98
x1 x2
2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00 1
2.0 1.98
x2
x1 x2
0
-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
-1
2.27 1.67
-2
-2 -1 0 1 2 3
x1
3
1.5 NA
2.0 1.98
x1 x2
2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00 1
2.0 1.98
x2
x1 x2
0
-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
-1
2.27 1.67
-2
-2 -1 0 1 2 3
x1
3
1.5 NA
2.0 1.98
x1 x2
2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.00 1
2.0 1.98
x2
x1 x2
0
-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
-1
2.27 1.67
x1 x2
-2.0 -2.01
-2
-1.5 -1.48
0.0 -0.01 -2 -1 0 1 2 3
1.5 0.57
2.0 1.98 x1
3
1.5 NA
2.0 1.98
x1 x2
2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.57 1
2.0 1.98
x2
0
-1
x1 x2
-2.0 -2.01
-2
-1.5 -1.48
0.0 -0.01 -2 -1 0 1 2 3
1.5 0.57
2.0 1.98 x1
5 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
3
1.5 NA
2.0 1.98
x1 x2
2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 0.57 1
2.0 1.98
x2
x1 x2
0
-2.00 -2.01
-1.47 -1.52
0.09 -0.11
1.20 0.90
-1
2.18 1.78
x1 x2
-2.0 -2.01
-2
-1.5 -1.48
0.0 -0.01 -2 -1 0 1 2 3
1.5 0.90
2.0 1.98 x1
5 / 10
Iterative PCA
x1 x2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
3
1.5 NA
2.0 1.98
x1 x2
2
-2.0 -2.01
-1.5 -1.48
0.0 -0.01
1.5 1.48 1
2.0 1.98
x2
x1 x2
0
-1.98 -2.04
-1.44 -1.56
0.15 -0.18
1.00 0.57
-1
2.27 1.67
x1 x2
-2.0 -2.01
-2
-1.5 -1.48
0.0 -0.01 -2 -1 0 1 2 3
1.5 1.48
2.0 1.98 x1
3
1.5 NA
2.0 1.98
2
1
x2
0
-1
x1 x2
-2.0 -2.01
-2
-1.5 -1.48
0.0 -0.01 -2 -1 0 1 2 3
1.5 1.48
2.0 1.98 x1
6 / 10
Iterative PCA
6 / 10
Running missMDA in R
> library(missMDA)
> data(orange)
> nb <- estim_ncpPCA(orange, scale=TRUE) ## Estimate no. of dimensions
> comp <- imputePCA(orange, ncp=2, scale=TRUE) ## Impute the table
> res.pca <- PCA(comp$completeObs) ## Do the PCA
1.0
2
5
●
Odor.intensity
3 2 Pulp
11 1 ●
●
1
● ●
0.5
6 Typicity
● Color.intensity
Bitter
Dim 2 (17.16%)
12 4 Dim 2 (17.16%)
●
0
● ●
0.0
●
Attack.intensity
Acid
10 9 Sweet
−1
● ●
−0.5
7
●
8
−2
●
−1.0
●●
2
●
●
● ●
●
● ●
●
●
●
●
● ●
● ● ●●●
1
●●
●
●
●● ● ●
● ● ●
● ●●
●
● ● ●●
● ● ●
●
● ●
●●
●
y
0
● ●●
● ● ●
●●
● ●
●
●
●
●
●
●
● ● ●
● ● ●●●●
●
● ●
●●
−1
● ● ●
● ●
●
●
●
● ●
●
●
● ● ●
●● ●
−2
●
●
−2 −1 0 1 2
x
9 / 10
Visualizing uncertainty due to missing data
9 / 10
Visualizing uncertainty due to missing data
> mi <- MIPCA(orange, scale = TRUE, ncp=2)
> plot(mi)
10 / 10
Visualizing uncertainty due to missing data
> mi <- MIPCA(orange, scale = TRUE, ncp=2)
> plot(mi)
Supplementary projection Variable representation
1.0
4
Odor.intensity
Pulp
5
0.5
2
11 1 3● 2 Typicity
Color.intensity
Dim 2 (17.17%)
Dim 2 (17.17%)
● ●
6 Bitter
● 12 4
●
0
● ●
0.0
●
10 9
● ● Attack.intensity
7 Sweet Acid
8 ●
−2
−0.5
−4
−1.0
−6 −4 −2 0 2 4 6 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
10 / 10
Visualizing uncertainty due to missing data
> mi <- MIPCA(orange, scale = TRUE, ncp=2)
> plot(mi)
Supplementary projection Variable representation
1.0
4
Odor.intensity
Pulp
5
0.5
2
11 1 3● 2 Typicity
Color.intensity
Dim 2 (17.17%)
Dim 2 (17.17%)
● ●
6 Bitter
● 12 4
●
0
● ●
0.0
●
10 9
● ● Attack.intensity
7 Sweet Acid
8 ●
−2
−0.5
−4
−1.0
−6 −4 −2 0 2 4 6 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
5 1.0
0.5
2
11 1 3● 2
Dim 2 (17.17%)
Dim 2 (17.17%)
● ●
●
6
● 12 4
0.0
●
0
● ●
●
10 9
● ●
7
8 ●
−2
−0.5
●
−4
−1.0