Professional Documents
Culture Documents
Matthieu P UIGT
1 Introduction
2 Low-rank matrices
7 Conclusion
Examples
In healthcare, for each person—aka observation—may be described by its height,
weight, temperature, blood pressure, heart rate, etc (features)
In social science, a researcher may ask some volunteers—aka
observations—tens to hundreds of questions—aka features—which are used to
derive some properties
In movie recommendation systems, each customer may grade hundreds of
movies
Analogy
Let’s imagine you play Mario’s enemies and you want to catch him...
... in 1D
Analogy
Let’s imagine you play Mario’s enemies and you want to catch him...
... in 2D
Analogy
Let’s imagine you play Mario’s enemies and you want to catch him...
... in 3D
Analogy
Let’s imagine you play Mario’s enemies and you want to catch him...
î Increasing the dimension in which Mario can move makes your problem
much harder
î Increasing the number of Mario’s enemies makes the task easier (provided it
is possible to have a good AI to help you play all the characters)
Fortunately!
Many real problems involve structured data
They tend to be low-rank
Fortunately!
Many real problems involve structured data
They tend to be low-rank
Fortunately!
Many real problems involve structured data
They tend to be low-rank
Approximate Low-rankness
Key property in signal & image processing and in machine learning
Data matrix / tensor can be explained by a limited number of
hidden/latent variables (with a limited/negligeable error of approximation)
1 Introduction
2 Low-rank matrices
7 Conclusion
X = W · H T = w 1 . . . w k ·
..
.
| | — hT — k
X = W · H T = w 1 . . . w k ·
..
.
| | — hT — k
X as a matrix product
— hT1 —
| |
X = w 1 ... wk ·
..
.
| | — T
hk —
= ·
X as a matrix product
= ·
= · + ·
Clustering (K-means)
X ≈ W · H T with all hTi are all zero except for one value where it is one
î W contains the centroids and H T is a cluster membership matrix
Denoising
Y = X + E where X contains some dominant patterns (X low-rank) and E
does not (noise)
Y ≈ WH T î WH T closer to X than to Y
In 2021 (source:
Satista)
500 h Youtube
video upload
695,000 photos
on Instagram
197 million
emails sent
1 Introduction
2 Low-rank matrices
7 Conclusion
X = U · Σ · VT
where:
U is d × d matrix with orthonormal columns (left singular vectors)
V is n × n matrix with orthonormal columns (right singular vectors)
Σ is a d × n diagonal matrix with σ1 ≥ 0 and σi ≥ σj if i < j (singular
values)
σ1 0 ... 0
u11 ... u1n u1,n+1 ... u1d
0 σ2 0... 0
u21 ... u2n u2,n+1 ... u2d
.. ..
.. ..
.
. .
.
0
UΣ =
un1 ... 0 σn
... unn un,n+1 ... und
0
. ... 0
.. .. .. ..
. . . ... ..
.
ud1 ... udn ud,n+1 ... udd
0 ... 0
σ1 0 ... 0
u11 ... u1n u1,n+1 ... u1d
0 σ2 0... 0
u21 ... u2n u2,n+1 ... u2d
.. ..
.. ..
.
. .
.
0
UΣ =
un1 ... 0 σn
... unn un,n+1 ... und
0
. ... 0
.. .. .. ..
. . . ... ..
.
ud1 ... udn ud,n+1 ... udd
0 ... 0
Orthonormality
UU T = I, U T U = I, VV T = I, VTV = I
Be careful! there might be some differences in the dimension of the identity
matrices (esp. with the econ. SVD)
Bases of X
The vectors in U and V are orthonormal bases for rows and columns of X ,
respectively
Rank
If rank(X ) = k < min(n, d), then
σ1 ≥ σ2 ≥ . . . ≥ σk ≥ σk +1 = . . . = σmin(n,d) = 0
X = σ1 · u 1 · v T1 +σ2 · u 2 · v T2 + . . . + σk · u k · v Tk
Most
Patterns: 2nd most k -th most
important
X = σ1 · u 1 · v T1 +σ2 · u 2 · v T2 + . . . + σk · u k · v Tk
Most
Patterns: 2nd most k -th most
important
Example
Truncated SVD applied on a real image
xn (t) ≈ αn yn (t) + βn
xn (t) ≈ αn yn (t) + βn
xn (t) ≈ αn yn (t) + βn
xn (t) ≈ αn yn (t) + βn
xn (t) ≈ αn yn (t) + βn
x2
x2 (t2 )− •
x2 (t3 )− •
x2 (t1 )− •
| | |
x1 (t1 ) x1 (t2 ) x1
x1 (t3 )
M. Puigt Analyse de données multi-dimensionnelles 2022-2023 19
The "how" of calibration (2) – Example
Example with known rank-1 subspace S
x1 (t) = α1 · y1 (t) + β1
∈ S, ∀t
x2 (t) = α2 · y2 (t) + β2
y2
• S
•
•
•
•
•
gain effect
y1
y2 •
•
• S
•
• •
•
•
•
gain effect
offset effect
y1
y1
The "how" of calibration (3) – General strategy
•
1 Removing the offset contributions
y2
• (by centering the signals)
S
Indeed...
•
d
1X
yn = yn (tj )
d
i=1
1
Pd
xn (ti ) βn
= d n=1 −
αn αn
d
• 1X
yn = yn (tj )
d
i=1
1
Pd
xn (ti ) βn
= d n=1 −
αn αn
y1
S⊥
y1
S⊥
1 Introduction
2 Low-rank matrices
7 Conclusion
a
Recall: centering an observation consists of removing its mean! If data are arranged as n
observations of d features, the datapoints in each column are centered around the origin
a
Recall: centering an observation consists of removing its mean! If data are arranged as n
observations of d features, the datapoints in each column are centered around the origin
a
Recall: centering an observation consists of removing its mean! If data are arranged as n
observations of d features, the datapoints in each column are centered around the origin
a
Recall: centering an observation consists of removing its mean! If data are arranged as n
observations of d features, the datapoints in each column are centered around the origin
1
C= X · XT
n
1
C= X · XT
n
Finding the first principal direction consists of solving
1 T
max · f · X · XT · f
kf k22 =1 n
1
C= X · XT
n
Finding the first principal direction consists of solving
1 T
max · f · X · XT · f
kf k22 =1 n
1 T
max · f · U · Σ · V T · (U · Σ · V T )T · f
kf k22 =1 n
1
C= X · XT
n
Finding the first principal direction consists of solving
1 T
max · f · X · XT · f
kf k22 =1 n
1 T
max · f · U · Σ · V T · V · Σ · UT · f
kf k22 =1 n
1
C= X · XT
n
Finding the first principal direction consists of solving
1 T
max · f · X · XT · f
kf k22 =1 n
1 T
max · f · U · Σ2 · U T · f
kf k22 =1 n
1
C= X · XT
n
Finding the first principal direction consists of solving
1 T
max · f · X · XT · f
kf k22 =1 n
1 T
max · f · U · Σ2 · U T · f
kf k22 =1 n
... whose solution is f = u 1 , i.e., the first principal vector is the first
singular vector
1
C= X · XT
n
Finding the first principal direction consists of solving
1 T
max · f · X · XT · f
kf k22 =1 n
1 T
max · f · U · Σ2 · U T · f
kf k22 =1 n
... whose solution is f = u 1 , i.e., the first principal vector is the first
singular vector
σ12
And the associated variance is 1
n
· u T1 · X · X T · u 1 = n
1
C= X · XT
n
Finding the first principal direction consists of solving
1 T
max · f · X · XT · f
kf k22 =1 n
1 T
max · f · U · Σ2 · U T · f
kf k22 =1 n
... whose solution is f = u 1 , i.e., the first principal vector is the first
singular vector
σ12
And the associated variance is 1
n
· u T1 · X · X T · u 1 = n
Similarly for the other principal vectors!
1 Introduction
2 Low-rank matrices
7 Conclusion
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads
a11 · s1 + a12 · s2 = 5
a21 · s1 + a22 · s2 = 1
Fortunately...
In many signal processing / machine learning / high-dimensional data,
we get several samples of the data,
î i.e., we have a series of systems of equations!
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads
a11 · s1 (t1 ) + a12 · s2 (t1 ) = 5
a21 · s1 (t1 ) + a22 · s2 (t1 ) = 1
Fortunately...
In many signal processing / machine learning / high-dimensional data,
we get several samples of the data,
î i.e., we have a series of systems of equations!
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads
a11 · s1 (t2 ) + a12 · s2 (t2 ) = 0
a21 · s1 (t2 ) + a22 · s2 (t2 ) = 7
Fortunately...
In many signal processing / machine learning / high-dimensional data,
we get several samples of the data,
î i.e., we have a series of systems of equations!
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads
a11 · s1 (t3 ) + a12 · s2 (t3 ) = −4.2
a21 · s1 (t3 ) + a22 · s2 (t3 ) = 0.7
Fortunately...
In many signal processing / machine learning / high-dimensional data,
we get several samples of the data,
î i.e., we have a series of systems of equations!
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads
a11 · s1 (t4 ) + a12 · s2 (t4 ) = 2.1
a21 · s1 (t4 ) + a22 · s2 (t4 ) = 7.5
Fortunately...
In many signal processing / machine learning / high-dimensional data,
we get several samples of the data,
î i.e., we have a series of systems of equations!
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads
a11 · s1 (t) + a12 · s2 (t) = x1 (t)
a21 · s1 (t) + a22 · s2 (t) = x2 (t)
Fortunately...
In many signal processing / machine learning / high-dimensional data,
we get several samples of the data,
î i.e., we have a series of systems of equations!
Denoting
a11 a12
A= ,
a21 a22
Eq. (1) reads
| |
— x1 —
= A · s 1 s2 or X = A · S
— x2 —
| |
Fortunately...
In many signal processing / machine learning / high-dimensional data,
we get several samples of the data,
î i.e., we have a series of systems of equations!
We can use some statistical properties to invert A
a generic problem
Many applications, e.g., biomedical, audio processing, telecommnications,
astrophysics, image classification, underwater acoustics, finance, quantum
information processing
M. Puigt Analyse de données multi-dimensionnelles 2022-2023 30
From PCA to ICA
Independent ⇒ Uncorrelated (but not the inverse in general)
Let us see a graphical
example with uniformsourcess,x = As with
−0.2485 0.8352
P = N = 2 and A =
0.4627 −0.6809
Y =W ·X
1 Introduction
2 Low-rank matrices
7 Conclusion
Principal Component Analysis applied to face dataset (source: Lee & Seung, 1999)
General strategy
1 Initialize the iteration number t = 1, and initialize G1 and F1
2 for t = 2 until a maximum number of iterations or a stopping criterion
1 Update G s.t.
1 1
kX − Gt+1 Ft k2F ≤ kX − Gt Ft k2F
2 2
2 Update F s.t.
1 1
kX − Gt+1 Ft+1 k2F ≤ kX − Gt+1 Ft k2F
2 2
Comments
Very fast
but not accurate!
Comments
Very easy to implement (e.g., in Matlab, lsqnonneg)
but slow!
Gradient descend
Update using gradient descend
e.g., Gt+1 = Gt − ν · ∇G J (Gt , Ft ) where
J (G, F ) = 12 kX − GF k2F
∇G J (G, F ) = G · F · F T − X · F T
ν is a weight
Replace negative entries by zero or a small positive threshold
Comments
Choice of ν?
î Possibility to find optimal ν (but takes some time)
Alternative: replace classical gradient descend by extrapolation (Guan et al.,
2012)
Multiplicative updates
Gradient descend with a well-chosen weight so that the sum with ν is
replaced by a product
Proof
Gradient descend:
h i
∀i, j, Fij = Fij + νij (GT · X )ij − (GT · G · F )ij
Fij
We set: νij = (GT ·G·F )ij
We obtain
Fij h i
∀i, j, Fij = Fij + (GT · X )ij − (GT · G · F )ij
(GT
· G · F )ij
(GT · X )ij
= Fij · 1 + − 1
(GT · G · F )ij
(GT · X )ij
= Fij ·
(GT · G · F )ij
Multiplicative updates
Gradient descend with a well-chosen weight so that the sum with ν is
replaced by a product
Principle of the “heuristic" method:
∇J − (Gt , Ft ) X · FtT
Gt+1 = Gt ◦ , i.e., G t+1 = G t ◦
∇J + (Gt , Ft ) Gt · Ft · FtT
where
· F T}
∇G J (G, F ) = |G · F{z· F T} − |X {z
∇J + (G,F ) ∇J − (G,F )
◦ and the division are the elementwise operations (.* and ./ in Matlab)
Comments
Non-negativity of G and F kept along iterations
Easy to implement
But very slow when X is large!
M. Puigt Analyse de données multi-dimensionnelles 2022-2023 37
Hands on: Hyperspectral unmixing using NMF
We are now going to see an application of NMF. The content of the next
slides is inspired by:
M. Puigt, O. Berné, R. Guidara, Y. Deville, S. Hosseini, C. Joblin:
Cross-validation of blindly separated interstellar dust spectra, Proc. of
ECMS 2009, pp. 41–48, Mondragon, Spain, July 8-10, 2009
Interstellar dust
Absorbs UV light and re-emit it in
the IR domain
Several grains in Photo-
Dissociation Regions (PDRs)
Spitzer IR spectrograph provides
hyperspectral datacubes
N
x(n,m) (λ) = ∑ a(n,m),j sj (λ) Polycyclic Aromatic
j=1 Hydrocarbons
ñ Blind Source Separation (BSS) Very Small Grains
Big grains
M. Puigt A very short introduction to BSS April/May 2011 60
Problem Statement (1)
Interstellar medium
Lies between stars in our galaxy
Concentrated in dust clouds which play a major role in the evolution of
galaxies
Interstellar dust
Absorbs UV light and re-emit it in
the IR domain
Several grains in Photo-
Dissociation Regions (PDRs)
Spitzer IR spectrograph provides
hyperspectral datacubes
N
x(n,m) (λ) = ∑ a(n,m),j sj (λ)
j=1
ñ ñ Separation
c R. Croman www.rc-astro.com
c R. Croman www.rc-astro.com
c R. Croman www.rc-astro.com
FastICA
c R. Croman www.rc-astro.com
FastICA
ñ ñ
x(n,m) (λ) = ∑Nj=1 a(n,m),j sj (λ) yk (λ) = ηj sj (λ)
Conclusion
1 Cross-validation of separated spectra with various BSS methods
Quite the same results with all BSS methods
Physically relevant
2 Distribution maps provide another validation of the separation step
Spatial distribution not used in the separation step
Physically relevant
Your turn!
Joint lab subject
î Lab report to write!
1 Introduction
2 Low-rank matrices
7 Conclusion