Professional Documents
Culture Documents
Gilles Gasso
Introduction
Data analysis
Study data’s distribution or geometry
Goal : acquire or extract knowledge /
patterns from data
→ Reduction of dimension, visualization,
clustering
Point 5.09 8.9 46.3 15.4 44 29.97 71.1 13.97 88 185.1 102.7 m
5.11 9.6 48.2 16.7 103 27.39 65.9 11.66 83 185.5 94.2 m
xi 4.94 6.3 45.7 15.5 50 23.11 34.3 6.43 74 184.9 79 m
4.86 3.9 44.9 15.4 73 22.83 34.5 6.56 70 181 74.8 m
4.51 4.4 41.6 12.7 44 19.44 65.1 15.07 53.42 179.9 62.9 f
4.62 7.3 43.8 14.7 26 21.2 76.8 18.08 61.85 188.7 75.5 f
Variable j (hemoglobin)
What are the relation between the variables? How close are the samples?
Introduction
N X N Z
80
60 7 7 77
7777
27 7 7777 7777 7 9949
7777777777 777 7 7 77
7 7777 77 7 7777 7 77 7 444444
78 99999 9 77777
7 9 444
399 9 9999 999999 99 99 9999 4 44
44 999999 999 9 777 4 444 44 4
40
7
4 99 1 9799 7 777 444444444 4444 44
11
111111111 11 87 2
2 77779 9
9 9 77 77 47 7 4444444 4
11111111111
2 111 1 99999 7 79 449444 444 44
11 111 1 7 59 999 44 44494 99 9
1111 11 2 4 77 999 2 77 44
2 1111111711111 11121 3 2 7 7 9 9 9 999 4
20 8 66 1 11111111 14 23372 44 4 9 9 44444
1 11117 4 93 99 8 9 9944 9 44
22 22 2 11111 1 1
1 7 2
2 2 7 49 994 444444444
5 4
2222 222 1 111 1111112 6
111 222 2 2222 75 59 9 4 9
2 22 8 81 11 666 2 77 9
604
5 115 26 2 88
0 8 8 8 8 12 55555
8 88 888888 888 8888 33 5555 55
5
5
88 8 88 8
88 8 8 93 13 5555 555 55 6663 2 6
888888888 88 88 8 888 5588 99
5 5555 55 6 666660 6666
88888888888888 88 88 8 883 88888 85 555 4 66
8 2 8 8 8 55 3535 555 5 666666666 6
33
33 35333 85 3
5 8 566 6666 6666 666 222 2
-20 3 8 5 3 3
55
5 55 5 6 6666666 496 666666 6 66666 2
2 8 3338 3 85 8 5 55 5 5 5 6 6 666 6 66666 666 22 2
3
3 3 3359 5 5 5 5 55 6 66666 22222222222
3 3333 5533 333 5 5 5555 55 5
5 66 66 66
66 22
2222222 222 23
333 33 3 3 3 33 5 5 53 0 2
5 2
3 333 3 3 3 3 5 2222 2222 2
-40 333333 333 55 00 0
3 33
33 333333333 55 0 0 000000 0
3 53 00 000000 00 0000 2
6 0 00 090 0 00 0000000000000
6 0 000 000 00
0 0000 0 0 00000000
-60 000000000000000000
0
-80
-60 -40 -20 0 20 40 60 80
d = 784 q=2
Gilles Gasso Dimensionality reduction and data visualization 4 / 27
Introduction
What for?
Visualization (q = 2 ou 3)
check the data
identify outliers
visualize the data according to their categories (if labelled)
Coding/Encoding scheme
cod : IRd −→ IRq , x 7−→ z = cod(x)
dec : IRq −→ IRd , z 7−→ x = dec(z)
x i ∈ IR3
z i ∈ IR2 : distance
preservation
N
X N
X
J(P) = kx i − x̂ i k2 = (x i − PP > x i )> (x i − PP > x i )
i=1 i=1
N
X
> > >
= (x > > >
i x i − 2x i PP x i + x i PP PP x i )
i=1
N
X N
X N
X
>
= x>
i xi − x> >
i PP x i = x i x i − z>
i zi
i=1 i=1 i=1
N N
! N N
!
X 1 X X X
>
= trace xix>
i − ziz>
i = trace > >
xixi − P xixi P
N
i=1 i=1 i=1 i=1
> > >
J(P) = trace X X − trace P X X P
Data: {x i ∈ IRN×d }N
i=1
Computing p 1 ∈ IRd
We seek p 1 a unit vector such as to maximize the variance of the {z i }N
i=1
N
1 X > 2
max (p 1 x i ) s.t. kp 1 k2 = 1
p 1 ∈IR d N
i=1
Computing p 1
max p 1 > C p 1 s.t. kp 1 k2 = 1
p 1 ∈IRd
PN >
C= 1
N i=1 xix>
i =
1
NX X is the correlation matrix
Solution derivation
The Lagrange associated: L(p 1 , λ1 ) = −p > >
1 C p1 + λ1 (p 1 p 1 − 1)
Optimality conditions :
∇p1 L = −2C p 1 + 2λ1 p 1 = 0 and ∇λ1 L = p >
1 p1 − 1 = 0
=⇒ C p 1 = λ1 p 1 and p>
1 C p 1 = λ1
Solution
p 2 is the eigenvector associated to λ2 , the 2nd highest eigenvalue of C
Lemma
The sub-space of size k that maximizes the variance of the projection
necessarily includes the sub-space of size k − 1.
Gilles Gasso Dimensionality reduction and data visualization 12 / 27
Algorithm
PCA algorithm
xij −x̄j
1 Normalize the data : {x i ∈ Rd }N
i=1 −→ {xij = σj , j = 1, d}N
i=1
1 >
2 Compute the correlation matrix C = NX X
P = (p 1 , · · · , pq ) ∈ Rd×q
Application
Bivariate representation
→ some variables are linearly
dependent
→ use correlation
Correlation matrix
at
Bf
cc
m
r
c
bm
f
hg
pc
hc
t
ss
rc
ht
fe
w
lb
w
1
rcc 1 0.15 0.92 0.89 0.25 0.3 -0.4 -0.49 0.55 0.36 0.4
hc 0.92 0.15 1 0.95 0.26 0.32 -0.45 -0.53 0.58 0.37 0.42
0.5
hg 0.89 0.13 0.95 1 0.31 0.38 -0.44 -0.53 0.61 0.35 0.46
0.25
ferr 0.25 0.13 0.26 0.31 1 0.3 -0.11 -0.18 0.32 0.12 0.27
bmi 0.3 0.18 0.32 0.38 0.3 1 0.32 0.19 0.71 0.34 0.85 0
ssf -0.4 0.14 -0.45 -0.44 -0.11 0.32 1 0.96 -0.21 -0.07 0.15
-0.25
pcBfat -0.49 0.11 -0.53 -0.53 -0.18 0.19 0.96 1 -0.36 -0.19 0
-0.5
lbm 0.55 0.1 0.58 0.61 0.32 0.71 -0.21 -0.36 1 0.8 0.93
ht 0.36 0.08 0.37 0.35 0.12 0.34 -0.07 -0.19 0.8 1 0.78 -0.75
dimension q < d
p 1 > x̃
dimension q
dimension d
normalize
x −−−−−→
Projection
x̃ −−−−−−→ ..
x̄ et σ z=P > x̃ .
p k > x̃
Data visualization
Biplot ACP
5.0
ssf
pcBfat
bmi
wt
2.5 Fill.
f
ht
m
lbm
wcc
NULL
Axe 2
ferr 12.5
0.0 10.0
7.5
5.0
rcc hg
2.5
hc
-2.5
-3
Gilles Gasso 0 Dimensionality3reduction and data visualization
6 18 / 27
Algorithm Application
50
45.4%
40
Pourcentage variance
30
23.3%
20
10.5%
10
8.1%
7.2%
3.9%
1% 0.4% 0.2% 0%
0
1 2 3 4 5 6 7 8 9 10
Nombre d'axes
Cross-Validation
Detection of "an elbow" on the graph of eigenvalues
fixed percentage (for instance 95%) of the variance is kept
Gilles Gasso Dimensionality reduction and data visualization 19 / 27
Algorithm Application
d = 784 q=2
0.01
7
79 44 9
7
9 74 7794 7477 974 7779 44
9 7 9 44 94 4 7 447 9
7 994794947799 749 42
4 7 49 9 4 4 9 74
77 79779949 747 79974 9 4
7 7 977497 79999477 9 4 799 7 44 4 4
9 4
9 79 74 474 4 7 777 5 49 4 4
0.005 7 9774 4747 9 4 9
774 7 4 4 9 74 4 4 6
77 7 99747 4 79
97 9 44
4 4
4 4 947 6 4
9 44 6 66
49999 9749779 2 7 44 66 4 4
9 7 4 399 2 49997 27 49366447 2 647 3 5 9 6 6 8
77
97 9 4 4 4 5 68 4 6 668
797 474 996964 3 6 59 6646 66 6
9 49 4 4 4 7547 9 4 763 9 23 6
6 2 5 62
2
999 64 3128 5 6237 8 0
6 6
060 03
6
0 00
49 7 9 5
4 2
5
95 8 73 53 9 0 8 6 8
36 2 6805 5 02 66 00 0
97 5 2 3 6
1 0 5
9 6 7 9
7
375 5 3 2 4 5 66
836 3338 5 6535 0 2 0 6 2 9 0 0
0 1 19 7788554925356 55 3936 8825826 08 6
4266626 0 6 2200 0 60 0 05 5 0 0 0 0 0
1 91 16254 5
29 355 8 5 8
6 8 2 2 9 8 26 06 6 6 0 20 0 0 0
7 5358 56 58 6 656 6282 8026 26066 62226 20 005 0
1 111 7 48 11 6115 8168 56858686868255582258268 5 2 52 6052 00 0 0 000
0 0
1 9 75 1
5
6 8 385 6 8 6
56 5 6
8 8
2 6 8 5
6 35 22 2 20 0 0
5 20 2 0 0 0 0 0 0
1111 1 1 6 17 6588835 88 18
7 2
8 8 83 885 5868 53558
5 32025582 522 5
3 2 03 35 0
0 0
111
11 11 7 3 8 52
3 3 2 25 02 0 00 05 00
1 111 2 85 82838 2538352 88 5865382833 33 38 56 55 280 9
33 8 2 8 5 0 0 005 0 00 0
0
0
1 11 6 1 3 3 8 6 2 2 0 0 0
000 0
1 111111111112 1 828 888 82 8168 8 33 55 8 3 32 00
1 2 8 8 8 8 8 3338 83 333 03 538 55 3
-0.005 111111111111 83 3 3 8 33 3 5383 5 552 3 3 2
1 1 1 2 8 3 33
111 1111 88 3
2
23 333 3 2 5
5 5 0 0
1 1 111 2 3 3 2
3 3 5 0
1 8 3 5
111 1 2 33 3
1 2 33 3 23
5
1 2 22 32 3
2
2 2
-0.01 2
3
-0.015
Z= -0.01 -0.005 0 0.005 0.01 0.015
Drawbacks of PCA
Beyond PCA
Non-linear PCA
ISOMAP, LLE, MVU, SNE, t-SNE . . .
Neural networks
auto-encoders
embeddings for custom data: word2vec, doc2vec (text), signal2vec
(time series)
with δij = kz i − z j k
dist(x i , x j )2
with dij =
2σi2
Computing the {z i }ni=1
Parameter σi defines the number of
neighbors of sample x i . It is selected Minimiser la divergence de
such that Kullback-Leibler entre IPX et IPZ
PN IP (x |x )
N minz 1 ,...,z N i,j=1 IPX (x i |x j ) log IPXZ (z ii |z jj )
X
log(K )=− IPX (x i |x j ) log IPX (x i |x j ) Solved via numerical method
j=1
Illustration of SNE
X = Z=
t-SNE variant
SNE t-SNE
Prob. that z j is neighbor of z i Prob. that z j is neighbor of z i
−δij2
exp
IPZ (z i |z j ) = PN 2
k=1,k6=i exp−δik 1 + δij2
−1
IPZ (z i |z j ) = PN −1
2
1 + δik
k=1,k6=i
with δij = kz i − z j k
2
exp-x
1/(1+x2)
Probabilités
0 1 2 3 4 5 6 7 8 9 10
Distances
Illustration of t-SNE
80
60 7 7 77
7777
27 7 7777 7777 7 9949
7777777777 777 7 7 77
7 7777 77 7 7777 7 77 7 444444
78 99999 9 77777
7 9 444
3 9 9999 999999 99
9 9 9999999 4 44
40 44 9 9
99999 9 99 777 4 444444 44444
7
4 9
99 1 7 7 777 9 44 44 444
4 44
11
111111111 11 87 2
2 77779 9
9 9 77 77 47 7 4444444 4
1 111111111 999 9 7 79 449444 444 44
2 111
1111 11111 7
1 59 9999 4 44 4
4 4 99
1 111111 7 1 2 2 4 977 9 9 9 2 77 44
2 11111111 111 1 9 9 4
20 8 66 1 11111111 14
3 722 44 77 99 99 9 999 44444
1 11117 233 4 93 99
4 8 9944 9 4 4444 4
22 22 2 11111 1 1 111 7 2
2 2 7 49 599494444 44
2 7 49
2222 222
2 22 1 11 111111112 6 2222 2 222 5
959
8 8 111115 2666 2 77
04
6
5 6 85 8
0 8 8 88 8 888 33 1
2 5 5
55555
8 88 88 88 8 8888 555 55
88 8 88 8
88 8 8 93 13 5555 555 55 6663 2 6
888888888 88 88 8 888 5588 99
5 5555 555 6 666660 6666
88888888888888 88 88 8 883 88888 85 55555 6466666
6
8 2 8 8 8 55 335 5 666666666
3333533 855 5 3
5 86566 66 6 66 222 2
-20 3 8 5
3
3 3 3 55 55 5 6
6 6666 66 496 6666666666666666
6 2
2 8 3338 3 85 8 5 55 5 5 5 666 6 6666 66 2 2 22
3
3 3 3359 5 5 5 5 55 6 66666 2222222222
222
3333 53 333 55 55 555555 66 66 66
66 2 2222 22 23
22
333333 3 3 333 5 3 55 5 5 3 0 2
5 2 2
3 333 3 3 3 3
3 22 22 2
22 2
-40 333333 33 55 000
3 33
33 333333333 55 0 0 00000000 0
0
3 0 0 0 0 0 0 2
53 6 0000 00000 00 0 000
6
00 9 0 000 000
00000 00000000
0
0 00 0 0 0000000
-60 000000000000000000
0
-80
-60 -40 -20 0 20 40 60 80
https://lvdmaaten.github.io/tsne/
Conclusions