Cours m1 Misc

Analyse de données multi-dimensionnelles
Réduction de données, analyse en variables latentes et applications
Matthieu P UIGT
Université du Littoral Côte d’Opale

Ecole d’Ingénieurs du Littoral Côte d’Opale
Master Ingénierie des Systèmes Complexes
matthieu.puigt[at]univ-littoral.fr
http://www-lisic.univ-littoral.fr/~puigt/
Retrouvez ce document sur:

https://www-lisic.univ-littoral.fr/~puigt/teaching.html
Année universitaire 2022-2023
M. Puigt Analyse de données multi-dimensionnelles 2022-2023 1

Bibliography of the lecture
From Easy-to-follow to Not-so-easy Video Tutorials:
Josh Starmer’s youtube channel (StatQuest):
https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw
Luis Serrano’s youtube channel (Serrano.Academy):
https://www.youtube.com/channel/UCgBncpylJ1kiVaPyP-PZauQ
Prof. Barry Van Veen’s youtube channel:
https://www.youtube.com/channel/UCooRZ0pxedi179pBe9aXm5A
Publications:
L. Balzano & R. Nowak, Blind calibration of sensor networks, in Proc. of
IPSN, pp. 79–88, 2007.
P. Comon, & C.Jutten (Eds.), Handbook of Blind Source Separation:
Independent component analysis and applications, Academic press, 2010.
N. Gillis, The Why and How of Nonnegative Matrix Factorization, in
“Regularization, Optimization, Kernels, and Support Vector Machines,” pp.
275–310, 2014, Chapman and Hall/CRC.
N. Guan, D. Tao, Z. Luo, & B. Yuan, NeNMF: An optimal gradient method for
nonnegative matrix factorization, IEEE Trans. on Sig. Proc., 60(6),
2882–2898, 2012.
D.D. Lee & H.S. Seung, Learning the parts of objects by non-negative matrix
factorization. Nature, 401(6755), 788–791, 1999.
C.H. Martin, T.S. Peng, & M.W. Mahoney, Predicting trends in the quality of
state-of-the-art neural networks without access to training or testing data,
Nature Communications, 12(1), 1-13, 2021.
M. Puigt, O. Berné, R. Guidara, Y. Deville, S. Hosseini, & C. Joblin,
Cross-validation of blindly separated interstellar dust spectra, Proc. of
ECMS 2009, pp. 41–48, Mondragon, Spain, July 8-10, 2009.
M. Udell & A. Townsend, Why are big data matrices approximately low
rank?, SIAM Journal on Mathematics of Data Science, 1(1), 144–160, 2019.
Table of Contents
1 Introduction
2 Low-rank matrices
3 Singular Value Decomposition
4 Principal Component Analysis
5 Independent Component Analysis
6 Nonnegative Matrix Factorization
7 Conclusion

Introduction (1)
High dimensionality
Many problems in data science, machine learning, or signal/image
processing involve datasets in high dimensional spaces
a high dimensional space: the number d of features is above the number
n of observations (d > n)
Examples
In healthcare, for each person—aka observation—may be described by its height,
weight, temperature, blood pressure, heart rate, etc (features)
In social science, a researcher may ask some volunteers—aka
observations—tens to hundreds of questions—aka features—which are used to
derive some properties
In movie recommendation systems, each customer may grade hundreds of
movies
on TV shows (at least...)

Introduction (2)
The curse of dimensionality
Recall: we observe n samples, each of them composed of d features
î Curse of dimensionality!
To infer a property, the number n of observations must exponentially
increase when d linearly increases

Introduction (2)
Analogy
Let’s imagine you play Mario’s enemies and you want to catch him...
... in 1D

Introduction (2)
Analogy
... in 2D

Introduction (2)
Analogy
... in 3D

Introduction (2)
Analogy
î Increasing the dimension in which Mario can move makes your problem
much harder
î Increasing the number of Mario’s enemies makes the task easier (provided it
is possible to have a good AI to help you play all the characters)

Introduction (3)
Additional properties
Fortunately!
Many real problems involve structured data
They tend to be low-rank

Introduction (3)
Fortunately!

Introduction (3)
Fortunately!
Approximate Low-rankness
Key property in signal & image processing and in machine learning
Data matrix / tensor can be explained by a limited number of
hidden/latent variables (with a limited/negligeable error of approximation)
Goal of this lecture (6 h)

An introduction to low-rank approximation techniques applied to matrices
Applications through different signal processing or machine learning
problems
Matlab hands-on

Table of Contents
1 Introduction
2 Low-rank matrices
7 Conclusion

Low-rank matrices (1)
We consider n d-dimensional vectors x i ∈ Rd (i ∈ {1, . . . , n}), with d > n

(high dimensional sensing)
We arrange these data as a matrix X of size d × n, i.e.,
 
| |
X = x 1 . . . x n 
| |


 
| |
X = x 1 . . . x n 
| |
X is low-rank iif ∃k < min(n, d), ∃W ∈ Rn×k , H ∈ Rd×k , such that

  — hT —
| | 1
X = W · H T = w 1 . . . w k  · 
 .. 
. 
| | — hT — k
where the columns of W and H are linearly independent.


 
| |
X = x 1 . . . x n 
| |
X is low-rank iif ∃k < min(n, d), ∃W ∈ Rn×k , H ∈ Rd×k , such that

  — hT —
| | 1
X = W · H T = w 1 . . . w k  · 
 .. 
. 
| | — hT — k
where the columns of W and H are linearly independent.

X is rank-k0 if we can’t find any matrices W and H satisfying the above
property for k < k0 while we can find for k = k0 .

Example with a rank-2 matrix
X as a matrix product
 — hT1 —
 
| |
X = w 1 ... wk · 
 .. 
. 
| | — T
hk —
= ·

Example with a rank-2 matrix
X as a matrix product
= ·
X as a sum of rank-1 matrices

k
X
X = w i · hTi
i=1
= · + ·

Some applications
Clustering (K-means)
X ≈ W · H T with all hTi are all zero except for one value where it is one
î W contains the centroids and H T is a cluster membership matrix

Some applications
Nonnegative Matrix Factorization

X ≈ W · H T with X , W , H ≥ 0 (part-based decomposition enhances
interpretability)
î In audio processing, W is a matrix of frequency patterns (timbres) and H T
a matrix of time activation

Some applications
Denoising
Y = X + E where X contains some dominant patterns (X low-rank) and E
does not (noise)
Y ≈ WH T î WH T closer to X than to Y

To conclude on low-rankness
Key assumption in many data science problems
î Many data are approximately low-rank (i.e., noisy low-rank data)
Directly applicable on matrices or tensors
Connections with neural networks and deep learning – out of the scope of
this lecture

this lecture
Any large data matrix tends to be low-rank (Udell, 2019)
And we live a data deluge
In 2021 (source:
Satista)
500 h Youtube
video upload
695,000 photos
on Instagram
197 million
emails sent

this lecture
Any large data matrix tends to be low-rank (Udell, 2019)
And we live a data deluge
Let us now see some popular low-rank approximation techniques!

Table of Contents
1 Introduction
2 Low-rank matrices

Definitions & Properties of SVD
Hands on: Blind Sensor Calibration
7 Conclusion

Singular Value Decomposition aka SVD (1)
Definition
Any d × n matrix X can be expressed as
X = U · Σ · VT
where:
U is d × d matrix with orthonormal columns (left singular vectors)
V is n × n matrix with orthonormal columns (right singular vectors)
Σ is a d × n diagonal matrix with σ1 ≥ 0 and σi ≥ σj if i < j (singular
values)
If d > n (high dimensional sensing) If d < n (e.g., big data)

X is tall and skinny X is short and fat
 
σ1

0 ... 0
 σ1 0 . . . 0 ... 0
0 σ2 0... 0   0 σ2 0 ... ... 0 
Σ=.
 
 ..

..

  .. . ..

. .  
0 . . . 0 σd 0... 0
 
0
Σ= ... 0 σn 

0 ... ... 0 
 
. ..
 ..

. 
0 0
Let us focus on the tall-and-skinny case X = UΣV T = (UΣ)V T
 σ1 0 ... 0
 
u11 ... u1n u1,n+1 ... u1d

0 σ2 0... 0
 u21 ... u2n u2,n+1 ... u2d  

  .. ..

 .. ..
 
 .
 . . 
. 
 0

UΣ = 
un1 ... 0 σn 
... unn un,n+1 ... und  
 0


 . ... 0 
 .. .. .. ..   
. . .  ... .. 
. 
ud1 ... udn ud,n+1 ... udd
0 ... 0

 σ1 0 ... 0
 
u11 ... u1n u1,n+1 ... u1d

0 σ2 0... 0
 u21 ... u2n u2,n+1 ... u2d  

  .. ..

 .. ..
 
 .
 . . 
. 
 0

UΣ = 
un1 ... 0 σn 
... unn un,n+1 ... und  
 0


 . ... 0 
 .. .. .. ..   
. . .  ... .. 
. 
ud1 ... udn ud,n+1 ... udd
0 ... 0


î And thus we can get the economy-size SVD, i.e.,
u11 ... u1n

 
 u21 ... u2n  σ 0 ... 0

 1
 .. ..   0

 . .  σ2 0... 0 
UΣ =  .

..
  ..
un1 ... unn 

 . 
 . ..  0
 ..

. ... 0 σn
ud1 ... udn
That is, U can be reduced to a d × n orthonormal matrix and Σ a n × n
diagonal matrix.
Similarly, we can truncate V if n > d (to convince yourself, apply SVD on X T

with d > n)

Properties
Orthonormality
UU T = I, U T U = I, VV T = I, VTV = I
Be careful! there might be some differences in the dimension of the identity
matrices (esp. with the econ. SVD)

Properties
Bases of X
The vectors in U and V are orthonormal bases for rows and columns of X ,
respectively

Properties
Rank
If rank(X ) = k < min(n, d), then
σ1 ≥ σ2 ≥ . . . ≥ σk ≥ σk +1 = . . . = σmin(n,d) = 0

Properties
Low-rank approximation of a matrix (Eckart-Young theorem, 1936)

Let X be a rank-k matrix and r < k .
k
X
min kX − Mk2F = σi2 ,
rank (M)≤r
i=r +1
Pr
where M = i=1 σi u i · v Ti and X = UΣV T is the SVD
X = σ1 · u 1 · v T1 +σ2 · u 2 · v T2 + . . . + σk · u k · v Tk
Most
Patterns: 2nd most k -th most
important

Properties
Low-rank approximation of a matrix (Eckart-Young theorem, 1936)

Let X be a rank-k matrix and r < k .
k
X
min kX − Mk2F = σi2 ,
rank (M)≤r
i=r +1
Pr
where M = i=1 σi u i · v Ti and X = UΣV T is the SVD
X = σ1 · u 1 · v T1 +σ2 · u 2 · v T2 + . . . + σk · u k · v Tk
Most
Patterns: 2nd most k -th most
important
The singular values provide an ordered

ranking of components (i.e., the decrease of σi
allows to find an optimal rank k )

Properties
Example
Truncated SVD applied on a real image

Hands on: Blind Sensor Calibration
We are now going to see an application of SVD.

The content of the next slides is inspired by:
L. Balzano & R. Nowak, Blind calibration of sensor networks, in Proc. of
IPSN, pp. 79–88, 2007.
C. Dorffer, M. Puigt, G. Delmaire, G. Roussel, Outlier-robust calibration
method for sensor networks, in Proc. IEEE ECMSM, 2017.

The "why" of sensor calibration
Sensed phenomenon =⇒ voltage

Voltage =⇒ phenomenon?




Sensor calibration needed
Not always physically possible
î Blind sensor calibration


Fixed sensor network
Many methods including
Projection-based calibration (L. Balzano and R. Nowak 2007)


Fixed sensor network
Many methods including
Projection-based calibration (L. Balzano and R. Nowak 2007)

The "how" of calibration (1) – Problem Statement and Assumptions
Network composed of N fixed and synchronized sensors observed at
Times t = t1 , . . . , td .

Times t = t1 , . . . , td .
Calibrated and uncalibrated sensor readings denoted xn (t) and yn (t)

Times t = t1 , . . . , td .
Affine response model:
xn (t) ≈ αn yn (t) + βn
where αn (resp. βn ) is the n-th sensor gain (resp. offset) correction

Times t = t1 , . . . , td .

No sensor gain is zero
xn (t) − βn
yn (t) ≈
αn

Times t = t1 , . . . , td .

xn (t) − βn
yn (t) ≈
αn
Sensors are pre-calibrated before deployment
î First sensor readings of yi (t) thus correspond to xi (t)

Times t = t1 , . . . , td .

xn (t) − βn
yn (t) ≈
αn
The network is dense enough to oversampling the signal space
î Rank r of the observed signal is known with r N

Times t = t1 , . . . , td .

xn (t) − βn
yn (t) ≈
αn
The network is dense enough to oversampling the signal space
î Rank r of the observed signal is known with r N
Observed signal subspace S is known, e.g.,
we can learn the subspace in which lie the "calibrated" data from the first
readings,
the subspace can be provided by experts in case of no pre-calibration.

The "how" of calibration (2) – Example
Example with known rank-1 subspace S

x1 (t) = α1 · y1 (t) + β1
∈ S, ∀t
x2 (t) = α2 · y2 (t) + β2
x2
x2 (t2 )− •
x2 (t3 )− •
x2 (t1 )− •
| | |
x1 (t1 ) x1 (t2 ) x1
x1 (t3 )

x1 (t) = α1 · y1 (t) + β1
∈ S, ∀t
x2 (t) = α2 · y2 (t) + β2
y2
• S
•
•
•
•
•
gain effect
y1


x1 (t) = α1 · y1 (t) + β1
∈ S, ∀t
x2 (t) = α2 · y2 (t) + β2
y2 •
•
• S
•
• •
•
•
•
gain effect
offset effect
y1

The "how" of calibration (3) – General strategy
y2 •
•
S
y1
•
1 Removing the offset contributions
y2
• (by centering the signals)
S
Indeed...
•
d
1X
yn = yn (tj )
d
i=1
1
Pd
xn (ti ) βn
= d n=1 −
αn αn
y1 and ∀k ∈ {1, . . . , d},

1
Pd
xn (tk ) − d i=1 xn (tk )
yn (tk ) − yn =
αn
y2
(by centering the signals)
• S
• Indeed...
d
• 1X
yn = yn (tj )
d
i=1
1
Pd
xn (ti ) βn
= d n=1 −
αn αn
y1 and ∀k ∈ {1, . . . , d},

1
Pd
xn (tk ) − d i=1 xn (tk )
yn (tk ) − yn =
αn
y2
• S 2 Projecting data onto S ⊥
•
y1
S⊥

y2
•
y1
S⊥

y2
•
3 Estimating sensor gains by nullspace
projection, i.e.
•
PΩ · Y1
 
..
 ·α ≈ 0,
 
 .
PΩ · Yd
| {z }
C
y1
where
PΩ is the projection operator onto
S⊥.
Y
k =
y1 (tk ) − y1

S⊥
 . ..

 0 0 
yN (tk ) − yN

y2
•
3 Estimating sensor gains by nullspace
projection, i.e.
•
PΩ · Y1
 
..
 ·α ≈ 0,
 
 .
PΩ · Yd
| {z }
C
y1
where
PΩ is the projection operator onto
S⊥.
Y
k =
y1 (tk ) − y1

S⊥
 . ..

 0 0 
yN (tk ) − yN
4 Estimating sensor offsets (e.g., by
least squares if true data are
zero-mean)
The "how" of calibration (4)
Is that really possible?

Yes! Proof of convergence if the number d of "snapshots" satisfies (Balzano
& Nowak, 2007)
N −1
d> +1
N −r
But a scale indeterminacy remains. Possibility to solve it by, e.g., assuming
that α1 = 1

How to solve C · α = 0? (Balzano & Nowak, 2007)

1 Least-squares:
Remove α1 from α and obtain α0
Remove the first column c 1 from C and obtain C 0
The problem to solve reads C 0 · α0 = −c 1
α0 = −C 0−† · c 1 where −† denotes the pseudo-inverse (pinv in Matlab)


1 Least-squares:
2 SVD:
Compute the economy-size SVD of C (svd(C,0) in Matlab)
The last singular vector of V is proportional to α
α can be derived by dividing the last SV of V by V (1)


1 Least-squares:
2 SVD:
Compute the economy-size SVD of C (svd(C,0) in Matlab)
The last singular vector of V is proportional to α
α can be derived by dividing the last SV of V by V (1)
Your turn now!

You are given a Matlab code to fill in order to perform calibration. Which
method is the most accurate?

Table of Contents
1 Introduction
2 Low-rank matrices

Definition & Properties of PCA
Application
7 Conclusion

Principal Component Analysis (1)
Concept
We assume X a d × n real-valued centereda data matrix.
a
Recall: centering an observation consists of removing its mean! If data are arranged as n
observations of d features, the datapoints in each column are centered around the origin

Concept

PCA aims to find the "principal directions" of X , i.e., a basis (the
directions are orthogonal and their norm is one) which describes X the
best.
a

Concept
Adapted from StatQuest video on PCA

Concept

Concept

best.
The first principal direction is the one which maximizes the variance of
the data (or minimizes the error of fit).
a

Concept

Concept

Concept

best.
The first principal direction is the one which maximizes the variance of
the data (or minimizes the error of fit).
The second principal direction is the one which is orthogonal to the first
principal direction and which maximizes the variance of the data
And so on...
a

Concept

Mathematical formulation
From the centered d × n matrix X , we derive its covariance matrix which

reads

C = E X · XT


reads

C = |{z}E X · XT
mean


reads
1
C= X · XT
n


reads
1
C= X · XT
n
Finding the first principal direction consists of solving
1 T
max · f · X · XT · f
kf k22 =1 n


reads
1
C= X · XT
n
1 T
kf k22 =1 n
By applying an SVD on X = U · Σ · V T , we obtain
1 T
max · f · U · Σ · V T · (U · Σ · V T )T · f
kf k22 =1 n


reads
1
C= X · XT
n
1 T
kf k22 =1 n
1 T
max · f · U · Σ · V T · V · Σ · UT · f
kf k22 =1 n


reads
1
C= X · XT
n
1 T
kf k22 =1 n
1 T
max · f · U · Σ2 · U T · f
kf k22 =1 n


reads
1
C= X · XT
n
1 T
kf k22 =1 n
1 T
max · f · U · Σ2 · U T · f
kf k22 =1 n
... whose solution is f = u 1 , i.e., the first principal vector is the first
singular vector


reads
1
C= X · XT
n
1 T
kf k22 =1 n
1 T
max · f · U · Σ2 · U T · f
kf k22 =1 n
singular vector
σ12
And the associated variance is 1
n
· u T1 · X · X T · u 1 = n


reads
1
C= X · XT
n
1 T
kf k22 =1 n
1 T
max · f · U · Σ2 · U T · f
kf k22 =1 n
singular vector
σ12
And the associated variance is 1
n
· u T1 · X · X T · u 1 = n
Similarly for the other principal vectors!

Some applications of PCA

Denoising, dimensionality reduction
Orthogonal regression (aka Total Least Squares)

Some applications of PCA

Denoising, dimensionality reduction
Orthogonal regression (aka Total Least Squares)
Analysis of Deep Learning networks

Table of Contents
1 Introduction
2 Low-rank matrices

Let’s talk about linear systems
A kind of magic?
A bit of history
from PCA to ICA
Non-Gaussianity-based ICA
7 Conclusion

Independent Component Analysis
Concept highly linked with blind source separation

In the 1950s, the concept of cocktail party problem was expressed.
You are attending a party which is a bit crowdy and noisy
Still, you are able to listen to the people near you
However, if you only hear with one ear, you can’t understand anything!
î Modern cocktail party problem: political debate on TV!
Doing what our ears are doing with computers is pretty challenging!
Let’s introduce the mathematical background first!

All of you know how to solve this kind of systems:

2 · s1 + 3 · s2 = 5
(1)
3 · s1 − 2 · s2 = 1
If we resp. define A, s, and x the matrix and the vectors:

2 3
A= , s = [s1 , s2 ] T , and x = [5, 1]T
3 −2
Eq. (1) begins

x =A·s
whose solution is
s = A−1 · x = [1, 1]T
Finding s with respect to x is called an inverse problem as we must invert
the operator A.


2 · s1 + 3 · s2 + . . . + 7 · s5 = 5
(1)
3 · s1 − 2 · s2 + . . . + 2 · s5 = 1

2 3 ... 7
A= , s = [s1 , s2 , . . . , s5 ] T , and x = [5, 1]T
3 −2 . . . 2
Eq. (1) begins

x =A·s
whose solution is
s =???
the operator A.
How to solve this kind of problem if?

There are more unknown than equations (ill-posed inverse problem)


? · s1 +? · s2 = 5
(1)
? · s1 +? · s2 = 1

? ?
A= , s = [s1 , s2 ] T , and x = [5, 1]T
? ?
Eq. (1) begins

x =A·s
whose solution is
s = A−1 · x =?
the operator A.
How to solve this kind of problem if?

There are more unknown than equations (ill-posed inverse problem)
We do not know the operator A (blind source separation)

A kind of magic?
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads

a11 · s1 + a12 · s2 = 5
a21 · s1 + a22 · s2 = 1
Fortunately...
In many signal processing / machine learning / high-dimensional data,
we get several samples of the data,
î i.e., we have a series of systems of equations!

A kind of magic?
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads

a11 · s1 (t1 ) + a12 · s2 (t1 ) = 5
a21 · s1 (t1 ) + a22 · s2 (t1 ) = 1
Fortunately...

A kind of magic?
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads

a11 · s1 (t2 ) + a12 · s2 (t2 ) = 0
a21 · s1 (t2 ) + a22 · s2 (t2 ) = 7
Fortunately...

A kind of magic?
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads

a11 · s1 (t3 ) + a12 · s2 (t3 ) = −4.2
a21 · s1 (t3 ) + a22 · s2 (t3 ) = 0.7
Fortunately...

A kind of magic?
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads

a11 · s1 (t4 ) + a12 · s2 (t4 ) = 2.1
a21 · s1 (t4 ) + a22 · s2 (t4 ) = 7.5
Fortunately...

A kind of magic?
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads

a11 · s1 (t) + a12 · s2 (t) = x1 (t)
a21 · s1 (t) + a22 · s2 (t) = x2 (t)
Fortunately...

A kind of magic?
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads
 
| |
— x1 —
= A · s 1 s2  or X = A · S
— x2 —
| |
Fortunately...
We can use some statistical properties to invert A

A bit of history
BSS problem formulated around 1982, by Hans, Hérault, and Jutten for a
biomedical problem and first papers in the mid of the 80s
Great interest from the community, mainly in France and later in Europe
and in Japan, and then in the USA
Several special session in international conferences (e.g., GRETSI’93,
NOLTA’95)
First workshop in 1999, in Aussois, France. One conference each 18 months
until 2018!
People with different backgrounds: signal processing, statistics, neural
networks, and later machine learning, artificial intelligence
Topic renamed “Latent Variable Analysis” (LVA)
Initially, BSS addressed for simple mixtures (as in this lecture) but
More complex mixture models (well suited to acoustics) in the mid and the
end of the 90s
Until the end of the 90s, BSS ≈ ICA
First NMF methods in the mid of the 90s but famous contribution in 1999
Other methods based on sparsity around 2000
Deep learning techniques since the 2010s
a generic problem
Many applications, e.g., biomedical, audio processing, telecommnications,
astrophysics, image classification, underwater acoustics, finance, quantum
information processing
From PCA to ICA
Independent ⇒ Uncorrelated (but not the inverse in general)
Let us see a graphical
example with uniformsourcess,x = As with
−0.2485 0.8352
P = N = 2 and A =
0.4627 −0.6809
Source distributions (red: source directions) Mixture distributions (green: eigenvectors)
M. Puigt A very short introduction to BSS April/May 2011 31

From PCA to ICA
Independent ⇒ Uncorrelated (but not the inverse in general)
Let us see a graphical
example with uniformsourcess,x = As with
−0.2485 0.8352
P = N = 2 and A =
0.4627 −0.6809
Source distributions (red: source directions) Output distributions after whitening

PCA does “half the job” and we need to rotate the data to achieve the
separation!
Principles of ICA
I won’t go through the details of ICA.

Many methods have been proposed since 1984
Some ICA results are famous...
Historical approaches based on non-Gaussianity

Let’s understand why !

ICA based on non-Gaussianity
Mix sources ⇒ Gaussian observations
Why?
Theorem of central limit states that sum of random variables tends to a
Gaussian distribution

Non-Gaussian ICA concept
If at most one source is Gaussian iid

Then, we can find a matrix W such that the signals yi (t) in
Y =W ·X
are the least Gaussian!

2
The Kurtosis defined as kurt(y ) = E y 4 − 3 E y 2

is null if y is
Gaussian.*
In practice, we first apply PCA to the data (whitening)
Then we use the kurtosis to find the rotation angle

Back to our toy example
Source distributions (red: source directions) Mixture distributions (green: eigenvectors)

Source distributions Output distributions after whitening


Source distributions (red: source directions) Output distributions after ICA

Some ICA applications
source separation (audio signals, image, financial data, etc)

denoising

Table of Contents
1 Introduction
2 Low-rank matrices

Definition & Properties of NMF
Hands on: Hyperspectral unmixing
7 Conclusion

Nonnegative Matrix Factorization (1)
Why is it so popular?
In many problems, matrices X ≈ G · F are non-negative, e.g.,

chemical concentration analysis
text analysis
(hyperspectral) image analysis
electric power consumption data
î Non-negativity on G and/or F yields better interpretability
NMF applied to face dataset (source: Lee & Seung, 1999)

Why is it so popular?
In many problems, matrices X ≈ G · F are non-negative, e.g.,

chemical concentration analysis
text analysis
(hyperspectral) image analysis
electric power consumption data
î Non-negativity on G and/or F yields better interpretability
Principal Component Analysis applied to face dataset (source: Lee & Seung, 1999)

Mathematical formulation and some properties
We aim to solve X ≈ G · F with X , G, F ≥ 0
How to define the discrepancy of X wrt G · F ?

Frobenius norm 21 kX − G · F k2F
î Similar to Euclidian norm for matrices
Kullback-Leibler divergence, parametric divergences, etc
î Out of the scope of this lecture!


We might take into account additional constraints on G or F
e.g., smoothness, sparsity, etc
î Out of the scope of this lecture


We might take into account additional constraints on G or F
e.g., smoothness, sparsity, etc
î Out of the scope of this lecture
Issues:
1 NMF is NP-hard in the general case, i.e., convergence to a global minimum
is not guaranteed in the general case
2 NMF solution is not unique
3 Choice of the NMF rank k can be tricky
Proof of the non-uniqueness of the NMF solution

If G ∈ Rd×k
+ and F ∈ Rk+×n are some NMF solutions,
then for any k × k invertible matrix D,
(G · D) and (D −1 · F ) are also solutions.

How to solve NMF?
General strategy
1 Initialize the iteration number t = 1, and initialize G1 and F1
2 for t = 2 until a maximum number of iterations or a stopping criterion
1 Update G s.t.
1 1
kX − Gt+1 Ft k2F ≤ kX − Gt Ft k2F
2 2
2 Update F s.t.
1 1
kX − Gt+1 Ft+1 k2F ≤ kX − Gt+1 Ft k2F
2 2
How to proceed in practice ? Some classical algorithms

How to solve NMF?
Alternating least squares

Apply Least Squares to update G or F
e.g., Gt+1 = (X · FtT ) · (Ft · FtT )−1
Replace negative entries by zero or a small positive threshold
Comments
Very fast
but not accurate!

How to solve NMF?
Non-negative alternating least squares

Apply Non-negative Least Squares to update W or H
Comments
Very easy to implement (e.g., in Matlab, lsqnonneg)
but slow!

How to solve NMF?
Gradient descend
Update using gradient descend
e.g., Gt+1 = Gt − ν · ∇G J (Gt , Ft ) where
J (G, F ) = 12 kX − GF k2F
∇G J (G, F ) = G · F · F T − X · F T
ν is a weight
Replace negative entries by zero or a small positive threshold
Comments
Choice of ν?
î Possibility to find optimal ν (but takes some time)
Alternative: replace classical gradient descend by extrapolation (Guan et al.,
2012)

How to solve NMF?
Multiplicative updates
Gradient descend with a well-chosen weight so that the sum with ν is
replaced by a product

How to solve NMF?
Proof
Gradient descend:
h i
∀i, j, Fij = Fij + νij (GT · X )ij − (GT · G · F )ij
Fij
We set: νij = (GT ·G·F )ij
We obtain
Fij h i
∀i, j, Fij = Fij + (GT · X )ij − (GT · G · F )ij
(GT
· G · F )ij
(GT · X )ij

= Fij · 1 + − 1
(GT · G · F )ij
(GT · X )ij
= Fij ·
(GT · G · F )ij

How to solve NMF?
Multiplicative updates
Gradient descend with a well-chosen weight so that the sum with ν is
replaced by a product
Principle of the “heuristic" method:
∇J − (Gt , Ft ) X · FtT
Gt+1 = Gt ◦ , i.e., G t+1 = G t ◦
∇J + (Gt , Ft ) Gt · Ft · FtT
where
· F T}
∇G J (G, F ) = |G · F{z· F T} − |X {z
∇J + (G,F ) ∇J − (G,F )
◦ and the division are the elementwise operations (.* and ./ in Matlab)
Comments
Non-negativity of G and F kept along iterations
Easy to implement
But very slow when X is large!
Hands on: Hyperspectral unmixing using NMF
We are now going to see an application of NMF. The content of the next
slides is inspired by:
M. Puigt, O. Berné, R. Guidara, Y. Deville, S. Hosseini, C. Joblin:
Cross-validation of blindly separated interstellar dust spectra, Proc. of
ECMS 2009, pp. 41–48, Mondragon, Spain, July 8-10, 2009

Problem Statement (1)
Interstellar medium
Lies between stars in our galaxy
Concentrated in dust clouds which play a major role in the evolution of
galaxies
Adapted from: http://www.nrao.edu/pr/2006/gbtmolecules/, Bill Saxton, NRAO/AUI/NSF

Interstellar medium
galaxies
Interstellar dust
Absorbs UV light and re-emit it in
the IR domain
Several grains in Photo-
Dissociation Regions (PDRs)
Spitzer IR spectrograph provides
hyperspectral datacubes
N
x(n,m) (λ) = ∑ a(n,m),j sj (λ) Polycyclic Aromatic
j=1 Hydrocarbons
ñ Blind Source Separation (BSS) Very Small Grains
Big grains
Interstellar medium
galaxies
Interstellar dust
Absorbs UV light and re-emit it in
the IR domain
Several grains in Photo-
Dissociation Regions (PDRs)
Spitzer IR spectrograph provides
hyperspectral datacubes
N
x(n,m) (λ) = ∑ a(n,m),j sj (λ)
j=1
ñ Blind Source Separation (BSS)

ñ ñ Separation
How to validate the separation of unknown sources?

Cross-validation of the performance of numerous BSS methods based
on different criteria
Deriving a relevant spatial structure of the emission of grains in PDRs

Blind Source Separation
Three main classes
Independent Component Analysis (ICA)
Sparse Component Analysis (SCA)
Non-negative Matrix Factorization (NMF)
Tested ICA methods

1 FastICA:
Maximization of
non-Gaussianity
Sources are stationary
2 Guidara et al. ICA method:
Maximum likelihood
Sources are Markovian
processes & non-stationary

Three main classes
Tested SCA methods

Low sparsity assumption
Three methods with the same
structure
1 LI-TIFROM-S: based on ratios

of TF mixtures
2 LI-TIFCORR-C & -NC: based
on TF correlation of mixtures

Three main classes
Tested SCA methods

structure

of TF mixtures

Three main classes
Tested SCA methods

structure

of TF mixtures

Three main classes
Tested NMF method

Lee & Seung algorithm:
b and source matrix b
Estimate both mixing matrix A S from observation
matrix X
Minimization of the divergence between observations and estimated matrices:
   

 

 Xij 
div X|AS = ∑ Xij log 
b b  − Xij + ASb b
i,j 
 bb
A S ij 

ij

Pre-processing stage
Additive noise not taken into account in the mixing model
More observations than sources
ñ Pre-processing stage for reducing the noise & the complexity:
For ICA and SCA methods

1 Sources centered and normalized
For NMF method

Above pre-processing stage not possible
Presence of some rare negative samples in observations
ñ Two scenarii
1 Negative values are outliers not taken into account
2 Negativeness due to pipeline: translation of the observations to positive
values

Estimated spectra from Ced 201 datacube
c R. Croman www.rc-astro.com
Black: Mean values

Gray: Enveloppe

Black: Mean values

Gray: Enveloppe
NMF with 1st scenario

Black: Mean values

Gray: Enveloppe
FastICA

Black: Mean values

Gray: Enveloppe
FastICA
All other methods

Distribution map of chemical species
ñ ñ
x(n,m) (λ) = ∑Nj=1 a(n,m),j sj (λ) yk (λ) = ηj sj (λ)
How to compute distribution map of grains?

cn,m,k = E x(n,m) (λ)yk (λ) = a(n,m),j ηj E sj (λ)2

Distribution map of chemical species

Conclusion
Conclusion
1 Cross-validation of separated spectra with various BSS methods
Quite the same results with all BSS methods
Physically relevant
2 Distribution maps provide another validation of the separation step
Spatial distribution not used in the separation step
Physically relevant

Hands on
Your turn!
Joint lab subject
î Lab report to write!

Table of Contents
1 Introduction
2 Low-rank matrices
7 Conclusion

Conclusion of the lecture
Short introduction on high dimensional data anlaysis

Curse of dimensionality
î Dimensionality reduction
Low-rank approximation techniques (SVD, PCA, NMF)
Applications
Thank you for your attention.

Questions?

Cours m1 Misc

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cours m1 Misc

Uploaded by

Copyright:

Available Formats

Analyse de données multi-dimensionnelles

Réduction de données, analyse en variables latentes et applications

Université du Littoral Côte d’Opale

Retrouvez ce document sur:

Année universitaire 2022-2023

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 1

3 Singular Value Decomposition

4 Principal Component Analysis

5 Independent Component Analysis

6 Nonnegative Matrix Factorization

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 3

on TV shows (at least...)

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 4

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 5

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 5

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 5

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 5

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 5

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 6

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 6

Goal of this lecture (6 h)

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 6

3 Singular Value Decomposition

4 Principal Component Analysis

5 Independent Component Analysis

6 Nonnegative Matrix Factorization

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 7

We consider n d-dimensional vectors x i ∈ Rd (i ∈ {1, . . . , n}), with d > n

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 8

We consider n d-dimensional vectors x i ∈ Rd (i ∈ {1, . . . , n}), with d > n

X is low-rank iif ∃k < min(n, d), ∃W ∈ Rn×k , H ∈ Rd×k , such that

where the columns of W and H are linearly independent.

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 8

We consider n d-dimensional vectors x i ∈ Rd (i ∈ {1, . . . , n}), with d > n

X is low-rank iif ∃k < min(n, d), ∃W ∈ Rn×k , H ∈ Rd×k , such that

where the columns of W and H are linearly independent.

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 8

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 9

X as a sum of rank-1 matrices

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 9

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 10

Nonnegative Matrix Factorization

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 10

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 10

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 11

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 11

Let us now see some popular low-rank approximation techniques!

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 11

3 Singular Value Decomposition

4 Principal Component Analysis

5 Independent Component Analysis

6 Nonnegative Matrix Factorization

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 12

If d > n (high dimensional sensing) If d < n (e.g., big data)

Let us focus on the tall-and-skinny case X = UΣV T = (UΣ)V T

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 14

Let us focus on the tall-and-skinny case X = UΣV T = (UΣ)V T

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 14

Let us focus on the tall-and-skinny case X = UΣV T = (UΣ)V T

u11 ... u1n

Similarly, we can truncate V if n > d (to convince yourself, apply SVD on X T

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 14

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 15

M. Puigt Analyse de données multi-dimensionnelles 2022-2023 15