You are on page 1of 27

Dimensionality reduction and data visualization

Gilles Gasso

INSA Rouen - ASI Departement


Laboratory LITIS

September 12, 2019

Gilles Gasso Dimensionality reduction and data visualization 1 / 27


Introduction

Introduction

Supervised learning (predictive methods)


Develop predictive models using labeled
training data
Make sure these patterns work on future
data (test data)

Unsupervised learning (descriptive methods) K=6

Data analysis
Study data’s distribution or geometry
Goal : acquire or extract knowledge /
patterns from data
→ Reduction of dimension, visualization,
clustering

Gilles Gasso Dimensionality reduction and data visualization 2 / 27


Data exploration
d variables

rcc wcc hc hg ferr bmi ssf pcBfat lbm ht wt sex


4.82 7.6 43.2 14.4 58 22.37 50 11.64 53.11 163.9 60.1 f
4.32 6.8 40.6 13.7 46 17.54 54.6 12.16 46.12 173 52.5 f
5.16 7.2 44.3 Data 88matrix
14.5 18.29 61.9 12.92 48.76 175 56 f
4.66 6.4 40.9 13.9 109 18.37 38.2 8.45 41.93 157.9
> 45.8 f
4.19 9 39 13.4 sample
69 18.93xi =43.5 xi,110.16 · ·42.95
· xi,d158.9 47.8 f
4.53 5 40.7 14 41 17.79
 56.8 12.55 38.3  156.9 43.8 f
20.31x1,158.9 . . .
13.46 x1,d
4.42 6.4 42.8 14.5 63 39.03 149 45.1 f
n
4.32 4.3 41.6 14 177 26.73 . 35.2 6.46 . 91  190.4 n×d
96.9 m
4.73 6.7 42.8 14.9
X8 = 
 .
19.81 . 41.8 7.19
.
. 70  ∈ R 75.5
195.2 m points
4.71 7.2 43.6 14 32 20.39
xn,130.5 . . .5.63 xn,d67 186.6 71 m
4.93 7.3 46.2 15.1 41 21.12 34 6.59 67 184.4 71.8 m
5.21 7.5 47.5 16.5 20 21.89 46.7 9.5 70 187.3 76.8 m

Point 5.09 8.9 46.3 15.4 44 29.97 71.1 13.97 88 185.1 102.7 m
5.11 9.6 48.2 16.7 103 27.39 65.9 11.66 83 185.5 94.2 m
xi 4.94 6.3 45.7 15.5 50 23.11 34.3 6.43 74 184.9 79 m
4.86 3.9 44.9 15.4 73 22.83 34.5 6.56 70 181 74.8 m
4.51 4.4 41.6 12.7 44 19.44 65.1 15.07 53.42 179.9 62.9 f
4.62 7.3 43.8 14.7 26 21.2 76.8 18.08 61.85 188.7 75.5 f

Variable j (hemoglobin)

What are the relation between the variables? How close are the samples?
Introduction

Dimension reduction: the goal


Let X ∈ RN×D the data (N samples of dimension d)
Goal: find a projection of X onto Z ∈ RN×q with q < d
d q

N X N Z

80

60 7 7 77
7777
27 7 7777 7777 7 9949
7777777777 777 7 7 77
7 7777 77 7 7777 7 77 7 444444
78 99999 9 77777
7 9 444
399 9 9999 999999 99 99 9999 4 44
44 999999 999 9 777 4 444 44 4
40
7
4 99 1 9799 7 777 444444444 4444 44
11
111111111 11 87 2
2 77779 9
9 9 77 77 47 7 4444444 4
11111111111
2 111 1 99999 7 79 449444 444 44
11 111 1 7 59 999 44 44494 99 9
1111 11 2 4 77 999 2 77 44
2 1111111711111 11121 3 2 7 7 9 9 9 999 4
20 8 66 1 11111111 14 23372 44 4 9 9 44444
1 11117 4 93 99 8 9 9944 9 44
22 22 2 11111 1 1
1 7 2
2 2 7 49 994 444444444
5 4
2222 222 1 111 1111112 6
111 222 2 2222 75 59 9 4 9
2 22 8 81 11 666 2 77 9
604
5 115 26 2 88
0 8 8 8 8 12 55555
8 88 888888 888 8888 33 5555 55
5
5
88 8 88 8
88 8 8 93 13 5555 555 55 6663 2 6
888888888 88 88 8 888 5588 99
5 5555 55 6 666660 6666
88888888888888 88 88 8 883 88888 85 555 4 66
8 2 8 8 8 55 3535 555 5 666666666 6
33
33 35333 85 3
5 8 566 6666 6666 666 222 2
-20 3 8 5 3 3
55
5 55 5 6 6666666 496 666666 6 66666 2
2 8 3338 3 85 8 5 55 5 5 5 6 6 666 6 66666 666 22 2
3
3 3 3359 5 5 5 5 55 6 66666 22222222222
3 3333 5533 333 5 5 5555 55 5
5 66 66 66
66 22
2222222 222 23
333 33 3 3 3 33 5 5 53 0 2
5 2
3 333 3 3 3 3 5 2222 2222 2
-40 333333 333 55 00 0
3 33
33 333333333 55 0 0 000000 0
3 53 00 000000 00 0000 2
6 0 00 090 0 00 0000000000000
6 0 000 000 00
0 0000 0 0 00000000
-60 000000000000000000
0

-80
-60 -40 -20 0 20 40 60 80

d = 784 q=2
Gilles Gasso Dimensionality reduction and data visualization 4 / 27
Introduction

What for?
Visualization (q = 2 ou 3)
check the data
identify outliers
visualize the data according to their categories (if labelled)

Data representation (q < d)


Noise reduction
pre-processing: computation issue
hidden structure in the data (example: manifolds)

Coding/Encoding scheme
cod : IRd −→ IRq , x 7−→ z = cod(x)
dec : IRq −→ IRd , z 7−→ x = dec(z)

How to assess the quality of the coding?

Gilles Gasso Dimensionality reduction and data visualization 5 / 27


Introduction

Principle of dimension reduction methods


Project samples {x i ∈ Rd }N q N
i=1 onto {z i ∈ R }i=1 (q < d) such that the
data topology is preserved
preserve distance between samples
preserve the neighborhood . . .

x i ∈ IR3
z i ∈ IR2 : distance
preservation

Methods we will study


Linear : PCA, non-linear : SNE and t-SNE variant
Gilles Gasso Dimensionality reduction and data visualization 6 / 27
Introduction Principal Component Analysis: PCA

Principal Component Analysis (PCA)


Model: data = information + noise
X = Z P> + B
cod : IRd −→ IRq , x 7−→ z = P >x
Linear orthogonal projection:
dec : IRq −→ IRd , z 7−→ x̂ = Pz

Property: columns of P are orthogonal


 > 
z>
 
x1 1
Dimensions: X =  ...  ∈ IRN×d , Z =  ..  ∈ IRN×q , P ∈ IRd×q
  
. 
x>
N z>N

Objective: minimize reconstruction error between the x i and x̂ i = dec(cod(x i ))


N
X
min kx i − PP > x i k2
P∈IRd×q
i=1

Gilles Gasso Dimensionality reduction and data visualization 7 / 27


Introduction Principal Component Analysis: PCA

Another view of PCA

PCA linearly projects {x i ∈ Rd }N


i=1 onto a subs-space of dimension
q (q < d) such that the variance of the projections
{z i = P > x i ∈ Rq }N
i=1 remains maximal

Variance maximization (case q = 1)


max kX pk22 with kpk22 = 1 and Z = Xp
p∈IRq
Gilles Gasso Dimensionality reduction and data visualization 8 / 27
Introduction Principal Component Analysis: PCA

Minimization of error / maximization of variance

N
X N
X
J(P) = kx i − x̂ i k2 = (x i − PP > x i )> (x i − PP > x i )
i=1 i=1
N
X
> > >
= (x > > >
i x i − 2x i PP x i + x i PP PP x i )
i=1
N
X N
X N
X
>
= x>
i xi − x> >
i PP x i = x i x i − z>
i zi
i=1 i=1 i=1
N N
! N N
!
X 1 X X X
>
= trace xix>
i − ziz>
i = trace > >
xixi − P xixi P
N
i=1 i=1 i=1 i=1
   
> > >
J(P) = trace X X − trace P X X P

⇒ min J(P) ⇔ maximizing the variance of the projections w.r.t. P


Gilles Gasso Dimensionality reduction and data visualization 9 / 27
Introduction Principal Component Analysis: PCA

Learning the first projection vector p 1 of P

Data: {x i ∈ IRN×d }N
i=1

Assume the x i are


normalized (mean removal
and std deviation scaling)
Projections onto p1 :
{z i = p > N
1 x i ∈ IR}i=1

Computing p 1 ∈ IRd
We seek p 1 a unit vector such as to maximize the variance of the {z i }N
i=1

N
1 X > 2
max (p 1 x i ) s.t. kp 1 k2 = 1
p 1 ∈IR d N
i=1

−→ This turns into a constrained optimization problem


Gilles Gasso Dimensionality reduction and data visualization 10 / 27
Introduction Principal Component Analysis: PCA

Computing p 1
max p 1 > C p 1 s.t. kp 1 k2 = 1
p 1 ∈IRd
PN >
C= 1
N i=1 xix>
i =
1
NX X is the correlation matrix

Solution derivation
The Lagrange associated: L(p 1 , λ1 ) = −p > >
1 C p1 + λ1 (p 1 p 1 − 1)

Optimality conditions :
∇p1 L = −2C p 1 + 2λ1 p 1 = 0 and ∇λ1 L = p >
1 p1 − 1 = 0

=⇒ C p 1 = λ1 p 1 and p>
1 C p 1 = λ1

1 (λ1 , p 1 ) is the couple (eigenvalue , eigenvector) of the C


2 p>
1 C p 1 = λ1 is what we seek to maximize

p 1 is the eigenvector associated to the highest eigenvalue of C .


Gilles Gasso Dimensionality reduction and data visualization 11 / 27
Introduction Principal Component Analysis: PCA

Computing p 2 and beyond

we are looking for a unit vector p 2 orthogonal to p 1 that maximizes


the variance of the projections {p > N
2 x i }i=1 onto p2

Solution
p 2 is the eigenvector associated to λ2 , the 2nd highest eigenvalue of C

Lemma
The sub-space of size k that maximizes the variance of the projection
necessarily includes the sub-space of size k − 1.
Gilles Gasso Dimensionality reduction and data visualization 12 / 27
Algorithm

PCA algorithm

xij −x̄j
1 Normalize the data : {x i ∈ Rd }N
i=1 −→ {xij = σj , j = 1, d}N
i=1

1 >
2 Compute the correlation matrix C = NX X

3 Find the eigenvalue decomposition {p j ∈ Rd , λj ∈ R}dj=1 of C


4 Order the eigenvalues λj by decreasing order
5 The projection matrix is:

P = (p 1 , · · · , pq ) ∈ Rd×q

{p 1 , · · · , p q } are the d eigenvectors associated to the q highest eigenvalues.

Gilles Gasso Dimensionality reduction and data visualization 13 / 27


Algorithm Application

Application

rcc wcc hc hg ferr bmi ssf pcBfat lbm ht wt sex


4.82 7.6 43.2 14.4 58 22.37 50 11.64 53.11 163.9 60.1 f
4.32 6.8 40.6 13.7 46 17.54 54.6 12.16 46.12 173 52.5 f
5.16 7.2 44.3 14.5 88 18.29 61.9 12.92 48.76 175 56 f
4.66 6.4 40.9 13.9 109 18.37 38.2 8.45 41.93 157.9 45.8 f
4.19 9 39 13.4 69 18.93 43.5 10.16 42.95 158.9 47.8 f
4.53 5 40.7 14 41 17.79 56.8 12.55 38.3 156.9 43.8 f
4.42 6.4 42.8 14.5 63 20.31 58.9 13.46 39.03 149 45.1 f
4.32 4.3 41.6 14 177 26.73 35.2 6.46 91 190.4 96.9 m
4.73 6.7 42.8 14.9 8 19.81 41.8 7.19 70 195.2 75.5 m
4.71 7.2 43.6 14 32 20.39 30.5 5.63 67 186.6 71 m
4.93 7.3 46.2 15.1 41 21.12 34 6.59 67 184.4 71.8 m
5.21 7.5 47.5 16.5 20 21.89 46.7 9.5 70 187.3 76.8 m
5.09 8.9 46.3 15.4 44 29.97 71.1 13.97 88 185.1 102.7 m
5.11 9.6 48.2 16.7 103 27.39 65.9 11.66 83 185.5 94.2 m
4.94 6.3 45.7 15.5 50 23.11 34.3 6.43 74 184.9 79 m
4.86 3.9 44.9 15.4 73 22.83 34.5 6.56 70 181 74.8 m
4.51 4.4 41.6 12.7 44 19.44 65.1 15.07 53.42 179.9 62.9 f
4.62 7.3 43.8 14.7 26 21.2 76.8 18.08 61.85 188.7 75.5 f

Gilles Gasso Dimensionality reduction and data visualization 14 / 27


Pair plots of the variables

Bivariate representation
→ some variables are linearly
dependent
→ use correlation
Correlation matrix

at
Bf
cc

m
r
c

bm

f
hg

pc
hc

t
ss
rc

ht
fe
w

lb

w
1
rcc 1 0.15 0.92 0.89 0.25 0.3 -0.4 -0.49 0.55 0.36 0.4

wcc 0.15 0.75


1 0.15 0.13 0.13 0.18 0.14 0.11 0.1 0.08 0.16

hc 0.92 0.15 1 0.95 0.26 0.32 -0.45 -0.53 0.58 0.37 0.42
0.5

hg 0.89 0.13 0.95 1 0.31 0.38 -0.44 -0.53 0.61 0.35 0.46
0.25
ferr 0.25 0.13 0.26 0.31 1 0.3 -0.11 -0.18 0.32 0.12 0.27

bmi 0.3 0.18 0.32 0.38 0.3 1 0.32 0.19 0.71 0.34 0.85 0

ssf -0.4 0.14 -0.45 -0.44 -0.11 0.32 1 0.96 -0.21 -0.07 0.15
-0.25

pcBfat -0.49 0.11 -0.53 -0.53 -0.18 0.19 0.96 1 -0.36 -0.19 0

-0.5
lbm 0.55 0.1 0.58 0.61 0.32 0.71 -0.21 -0.36 1 0.8 0.93

ht 0.36 0.08 0.37 0.35 0.12 0.34 -0.07 -0.19 0.8 1 0.78 -0.75

wt 0.4 0.16 0.42 0.46 0.27 0.85 0.15 0 0.93 0.78 1


-1
Perform data projection

Knowing P ∈ Rd×q , the mean x̄ and std σ


   

dimension q < d
p 1 > x̃
 

dimension q
dimension d

   
  normalize
x  −−−−−→
  Projection
x̃  −−−−−−→  .. 
  x̄ et σ   z=P > x̃  . 
    p k > x̃

−→ we achieve dimensionality reduction


Algorithm Application

Data visualization

Biplot ACP

5.0
ssf

pcBfat
bmi
wt

2.5 Fill.
f
ht
m
lbm
wcc
NULL
Axe 2

ferr 12.5

0.0 10.0

7.5

5.0

rcc hg
2.5

hc

-2.5

-3
Gilles Gasso 0 Dimensionality3reduction and data visualization
6 18 / 27
Algorithm Application

Choice of q, the projection space size


Variance expliquée par les axes

50

45.4%

40
Pourcentage variance

30

23.3%

20

10.5%
10
8.1%
7.2%

3.9%
1% 0.4% 0.2% 0%
0

1 2 3 4 5 6 7 8 9 10
Nombre d'axes
Cross-Validation
Detection of "an elbow" on the graph of eigenvalues
fixed percentage (for instance 95%) of the variance is kept
Gilles Gasso Dimensionality reduction and data visualization 19 / 27
Algorithm Application

Visualizing Mnist dataset

d = 784 q=2
0.01
7
79 44 9
7
9 74 7794 7477 974 7779 44
9 7 9 44 94 4 7 447 9
7 994794947799 749 42
4 7 49 9 4 4 9 74
77 79779949 747 79974 9 4
7 7 977497 79999477 9 4 799 7 44 4 4
9 4
9 79 74 474 4 7 777 5 49 4 4
0.005 7 9774 4747 9 4 9
774 7 4 4 9 74 4 4 6
77 7 99747 4 79
97 9 44
4 4
4 4 947 6 4
9 44 6 66
49999 9749779 2 7 44 66 4 4
9 7 4 399 2 49997 27 49366447 2 647 3 5 9 6 6 8
77
97 9 4 4 4 5 68 4 6 668
797 474 996964 3 6 59 6646 66 6
9 49 4 4 4 7547 9 4 763 9 23 6
6 2 5 62
2
999 64 3128 5 6237 8 0
6 6
060 03
6
0 00
49 7 9 5
4 2
5
95 8 73 53 9 0 8 6 8
36 2 6805 5 02 66 00 0
97 5 2 3 6
1 0 5
9 6 7 9
7
375 5 3 2 4 5 66
836 3338 5 6535 0 2 0 6 2 9 0 0
0 1 19 7788554925356 55 3936 8825826 08 6
4266626 0 6 2200 0 60 0 05 5 0 0 0 0 0
1 91 16254 5
29 355 8 5 8
6 8 2 2 9 8 26 06 6 6 0 20 0 0 0
7 5358 56 58 6 656 6282 8026 26066 62226 20 005 0
1 111 7 48 11 6115 8168 56858686868255582258268 5 2 52 6052 00 0 0 000
0 0
1 9 75 1
5
6 8 385 6 8 6
56 5 6
8 8
2 6 8 5
6 35 22 2 20 0 0
5 20 2 0 0 0 0 0 0
1111 1 1 6 17 6588835 88 18
7 2
8 8 83 885 5868 53558
5 32025582 522 5
3 2 03 35 0
0 0
111
11 11 7 3 8 52
3 3 2 25 02 0 00 05 00
1 111 2 85 82838 2538352 88 5865382833 33 38 56 55 280 9
33 8 2 8 5 0 0 005 0 00 0
0
0
1 11 6 1 3 3 8 6 2 2 0 0 0
000 0
1 111111111112 1 828 888 82 8168 8 33 55 8 3 32 00
1 2 8 8 8 8 8 3338 83 333 03 538 55 3
-0.005 111111111111 83 3 3 8 33 3 5383 5 552 3 3 2
1 1 1 2 8 3 33
111 1111 88 3
2
23 333 3 2 5
5 5 0 0
1 1 111 2 3 3 2
3 3 5 0
1 8 3 5
111 1 2 33 3
1 2 33 3 23
5
1 2 22 32 3
2
2 2
-0.01 2
3

-0.015
Z= -0.01 -0.005 0 0.005 0.01 0.015

Gilles Gasso Dimensionality reduction and data visualization 20 / 27


Algorithm Application

Drawbacks of PCA

its a linear projection method

Implicit assumption : data are issued from gaussian distribution

PCA only relies on order 2 statistics of the data (variance)

Beyond PCA
Non-linear PCA
ISOMAP, LLE, MVU, SNE, t-SNE . . .
Neural networks
auto-encoders
embeddings for custom data: word2vec, doc2vec (text), signal2vec
(time series)

Gilles Gasso Dimensionality reduction and data visualization 21 / 27


Algorithm Application

SNE (Stochastic Neighbor Embedding) and t-SNE


The idea
Transform pairwise distances dist(x i , x j ) into probability IPX (x i |x j )
(that x i and x j are close)
low distance → high probability to be close
Assume the same for the projections dist(z i , z j ) → IPZ (z i |z j )
Find the projections {z i } : minimize distance between distributions
pX and pZ

Gilles Gasso Dimensionality reduction and data visualization 22 / 27


Algorithm Application

SNE: the maths

for {x i }ni=1 for {y i }ni=1 (the unknowns)


define the probability that x j Prob. that z j is neighbor of z i
and x i are close 2
exp−δij
exp −dij2 IPZ (z i |z j ) = PN 2
IPX (x i |x j ) = PN k=1,k6=i exp−δik
exp −dik2
k=1,k6=i

with δij = kz i − z j k
dist(x i , x j )2
with dij =
2σi2
Computing the {z i }ni=1
Parameter σi defines the number of
neighbors of sample x i . It is selected Minimiser la divergence de
such that Kullback-Leibler entre IPX et IPZ
PN IP (x |x )
N minz 1 ,...,z N i,j=1 IPX (x i |x j ) log IPXZ (z ii |z jj )
X
log(K )=− IPX (x i |x j ) log IPX (x i |x j ) Solved via numerical method
j=1

Gilles Gasso Dimensionality reduction and data visualization 23 / 27


Algorithm Application

Illustration of SNE

X = Z=

Gilles Gasso Dimensionality reduction and data visualization 24 / 27


Algorithm Application

t-SNE variant
SNE t-SNE
Prob. that z j is neighbor of z i Prob. that z j is neighbor of z i
−δij2
exp
IPZ (z i |z j ) = PN 2
k=1,k6=i exp−δik 1 + δij2
−1
IPZ (z i |z j ) = PN −1
2
1 + δik
k=1,k6=i
with δij = kz i − z j k
2
exp-x
1/(1+x2)
Probabilités

IPx = IPZ large ⇒ δZ < dX


(attraction)
IPx = IPZ low ⇒ δZ > dX
(repulsion)

0 1 2 3 4 5 6 7 8 9 10

Distances

Gilles Gasso Dimensionality reduction and data visualization 25 / 27


Algorithm Application

Illustration of t-SNE
80

60 7 7 77
7777
27 7 7777 7777 7 9949
7777777777 777 7 7 77
7 7777 77 7 7777 7 77 7 444444
78 99999 9 77777
7 9 444
3 9 9999 999999 99
9 9 9999999 4 44
40 44 9 9
99999 9 99 777 4 444444 44444
7
4 9
99 1 7 7 777 9 44 44 444
4 44
11
111111111 11 87 2
2 77779 9
9 9 77 77 47 7 4444444 4
1 111111111 999 9 7 79 449444 444 44
2 111
1111 11111 7
1 59 9999 4 44 4
4 4 99
1 111111 7 1 2 2 4 977 9 9 9 2 77 44
2 11111111 111 1 9 9 4
20 8 66 1 11111111 14
3 722 44 77 99 99 9 999 44444
1 11117 233 4 93 99
4 8 9944 9 4 4444 4
22 22 2 11111 1 1 111 7 2
2 2 7 49 599494444 44
2 7 49
2222 222
2 22 1 11 111111112 6 2222 2 222 5
959
8 8 111115 2666 2 77
04
6
5 6 85 8
0 8 8 88 8 888 33 1
2 5 5
55555
8 88 88 88 8 8888 555 55
88 8 88 8
88 8 8 93 13 5555 555 55 6663 2 6
888888888 88 88 8 888 5588 99
5 5555 555 6 666660 6666
88888888888888 88 88 8 883 88888 85 55555 6466666
6
8 2 8 8 8 55 335 5 666666666
3333533 855 5 3
5 86566 66 6 66 222 2
-20 3 8 5
3
3 3 3 55 55 5 6
6 6666 66 496 6666666666666666
6 2
2 8 3338 3 85 8 5 55 5 5 5 666 6 6666 66 2 2 22
3
3 3 3359 5 5 5 5 55 6 66666 2222222222
222
3333 53 333 55 55 555555 66 66 66
66 2 2222 22 23
22
333333 3 3 333 5 3 55 5 5 3 0 2
5 2 2
3 333 3 3 3 3
3 22 22 2
22 2
-40 333333 33 55 000
3 33
33 333333333 55 0 0 00000000 0
0
3 0 0 0 0 0 0 2
53 6 0000 00000 00 0 000
6
00 9 0 000 000
00000 00000000
0
0 00 0 0 0000000
-60 000000000000000000
0

-80
-60 -40 -20 0 20 40 60 80

https://lvdmaaten.github.io/tsne/

Gilles Gasso Dimensionality reduction and data visualization 26 / 27


Algorithm Application

Conclusions

PCA a linear projection method for dimensionality reduction


Several non-linear projection approaches (t-SNE, UMAP, Isomap . . . )

Involves advanced optimization issues

All these methods are useful for data visualization


Which one to pick up in pratice ? not so easy to say
Some toolboxes
Matlab : https://lvdmaaten.github.io/drtoolbox/
Python : http:
//scikit-learn.org/stable/modules/manifold.html#manifold
Graphical tool : http://divvy.ucsd.edu/

Gilles Gasso Dimensionality reduction and data visualization 27 / 27

You might also like