A Short Tutorial On Blind Source Separation: Fabian J. Theis

A short tutorial on blind source separation
Fabian J. Theis
Institute of Biophysics University of Regensburg, Germany fabian@theis.name
TUAT, Tokyo, 28-3-2005
Theis
Motivation Basics Linear blind source separation Conclusions
Why do we need BSS?

e.g. for the analysis of biomedical signals such as EEG, MEG or fMRI recordings
Fp1 Fp2 F3 F4
EEG Channel / Electrode
C3 C4 P3 P4 O1 O2 F7 F8 T3 T4 T5 T6 Fz Cz Pz A1 A2
61
62
63
65
65
66
67
Time [s]
.
Theis A short tutorial on blind source separation
Why do we need BSS?

e.g. for the analysis of biomedical signals such as EEG, MEG or fMRI recordings
Fp1 Fp2 F3 F4
EEG Channel / Electrode
C3 C4 P3 P4 O1 O2 F7 F8 T3
eye movement
Independent Signals [a.u.]
sensor noise
heart beat
T4 T5 T6 Fz Cz Pz A1 A2
61
62
63
65
65
66
67
61
62
63
64
65
66
67
Time [s]
Time [s]
.
Outline
Motivation Basics
Probability theory Information theory
Linear blind source separation

Principal component analysis Independent component analysis Maximization of non-Gaussianity Second-order BSS using time structure
Conclusions
Theis
Outline
Motivation Basics

Conclusions
Theis
A one-page primer on probability theory

main object: random variable/vector X
denition: a measurable function on a probability space determined a.e. by its density pX : Rn [0, )
properties of a probability density function (pdf)

p (x) dx = 1 Rn X transformation: pAX (y) = | det A|1 pX (A1 y)
indices derived from densities (probabilistic quantities)

expectation or mean: E(X) = Rn xpX (x) dx covariance: Cov(X) = E((X E(X))(X E(X)) )
decorrelation and independence

X is decorrelated if Cov(X) is diagonal and white if Cov(X) = I X is independent if its density factorizes pX (x1 , . . . , xn ) = pX1 (x1 ) . . . pXn (xn ) independent decorrelated (but not vice versa in general)
Theis




Theis




Theis




Theis
0.25
0.5
0.2
0.4
0.15
0.3
0.1
0.2
0.05
0.1
0 2 1 0 0 1 2 2 1 1 2
0 2 1 0 0 1 y 2 2 1 x 1 2
uniform pX =
0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 2 1 0
1 vol(K ) 1K
Laplacian pX (x) =
exp
n i=1
|xi |
2 1 0 1 2 2 1
Gaussian pX (x) = c exp 1 (x ) C1 (x ) 2

0.25
0.5
0.2
0.4
0.15
0.3
0.1
0.2
0.05
0.1
0 2 1 0 0 1 2 2 1 1 2
0 2 1 0 0 1 y 2 2 1 x 1 2
uniform pX =
0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 2 1 0
1 vol(K ) 1K
Laplacian pX (x) =
exp
n i=1
|xi |
2 1 0 1 2 2 1
properties of a Gaussian X X decorrelated independent AX Gaussian X independent AX independent if A orthogonal c = 1n

(2) det C
Gaussian pX (x) = c exp 1 (x ) C1 (x ) 2

The second page...

higher-order moments
central moment of a random variable X (n = 1): j (X ) := E ((X E (X ))j ) so 1 (X ) = E (X ) mean and 2 (X ) = Cov(X ) =: var(X ) variance 3 (X ) is called skewness measures asymmetry (3 (X ) = 0 is X symmetric)
kurtosis
the combination of moments kurt(X ) := E (X 4 ) 3(E (X 2 ))2 is called kurtosis of X kurt(X ) = 0 if X Gaussian, < 0 if uniform and > 0 if laplacian
sampling
in practice density is unknown only some samples i.e. values of random function are given given independent (Xi )i=1,...n with same density p, then X1 (), . . . , Xn () for some event are called i.i.d. samples of p strong theorem of large numbers: given a pairwise i.i.d. sequence (Xi )iN in L1 (), then (for almost all ) 1 limn n n Xi () E (X1 ) = 0 i=1
The second page...

kurtosis
sampling
The second page...

kurtosis
sampling
Kurtosis example
0.8 0.8 1.5
0.6
0.6 1
0.4
0.4 0.5
0.2
0.2
0 2 0 2
0 2 0 2
0 1 0 1
random variables with dierent kurtosis

blue densities: centered white Gaussian (kurt = 0) left: Laplacian with = 2 middle: uniform density in [ 3, 3], kurtosis 1.2 1 right: subgaussian random variable X := cos(Y ) with Y uniform 21 in [, ], kurtosis 8
An even shorter introduction to information theory

entropy
H(X) := EX (log2 pX ) is called the (dierential) entropy of X transformation: H(AX) = H(X) + EX (log | det A|) given X let Xgauss be the Gaussian with mean E(X) and covariance Cov(X); then H(Xgauss ) H(X)
negentropy
negentropy of X is dened by J(X) := H(Xgauss ) H(X) transformation: J(AX) = J(X) 1 1 approximation in 1d: J(X ) = 12 E (X 3 )2 + 48 kurt(X )2 + . . .
information
n I (X) := i=1 H(Xi ) H(X) is called mutual information of X I (X) 0 and I (X) = 0 if and only if X is independent transformation: I (LPX + c) = I (X) for scaling L, permutation P and translation c Rn
Theis

entropy
negentropy
information
Theis

entropy
negentropy
information
Theis
Outline
Motivation Basics

Conclusions
Theis
Example use of ICA for performing BSS
source 1
source 2
mixture 1
mixture 2
sources
mixtures
Theis
Example use of ICA for performing BSS

source 1 source 2 mixture 1 mixture 2
sources
output 1 ~output 1 output 2
mixtures
~output 2
recovered sources by ICA

Linear BSS
Blind source separation (BSS) problem X = AS + N
X observed m-dimensional random vector A (unknown) full-rank real matrix S (unknown) n-dimensional source signals n m N (unknown) noise (often assumed to be white Gaussian)
goal: recover unknown A and S given only X additional assumptions necessary

without, problem ill-posed depending on assumptions FA, PCA, ICA, SCA, NMF
remark: often N = 0 also in the following
Theis
Linear BSS

Theis
Linear BSS

Theis
Linear BSS

Theis
Principal component analysis

principal component analysis (PCA)
also called Karhunen-Lo`ve transformation e very common multivariate data analysis tools transform data to feature space, where few main features (principal components) make up most of the data iteratively project into directions of maximal variance second-order analysis main application: prewhitening and dimension reduction
model and algorithm

assumption: S is decorrelated without loss of generality white construction:
eigenvalue decomposition Cov(X): D = V Cov(X)V with diagonal D and orthogonal V PCA-matrix W is constructed by W := D1/2 V
indeterminacy: unique up to right transformation in orthogonal group
MATLAB example

model and algorithm

MATLAB example

model and algorithm

MATLAB example
Independent component analysis

now: ICA assumption

S independent i.e. I (S) = 0 indeterminacies: only permutation and scaling (!) if S contains at most one Gaussian
Theis
Independent component analysis

now: ICA assumption

S independent i.e. I (S) = 0 indeterminacies: only permutation and scaling (!) if S contains at most one Gaussian
Theis
Additional model assumptions

in linear ICA, additional model assumptions are possible sources can be assumed to be centered i.e. E(S) = 0 (coordinate transformation X := X E(X)) white sources
if A := (a1 | . . . |an ), then scaling indeterminacy means X = AS =
n i=1
a i Si =
n i=1
1 i
ai (i Si )
hence normalization is possible e.g. var(Si ) = 1
white mixtures (complete case m = n):

by assumption Cov(S) = I let V be PCA matrix of X then Z := VX is white, and an ICA of Z gives ICA of X
orthogonal A
by assumption Cov(S) = Cov(X) = I hence I = Cov(X) = A Cov(X)A = AA
Theis

n i=1
a i Si =
n i=1
1 i
ai (i Si )

orthogonal A
Theis

n i=1
a i Si =
n i=1
1 i
ai (i Si )

orthogonal A
Theis

n i=1
a i Si =
n i=1
1 i
ai (i Si )

orthogonal A
Theis

n i=1
a i Si =
n i=1
1 i
ai (i Si )

orthogonal A
Theis
ICA-Algorithms
basic scheme of ICA algorithms (case m = n)

search for invertible W Gl(n) that minimizes some dependence measure of WX for example minimize mutual information I (WX) [Comon, 1994] or maximize neural network output entropy H(f (WX)) [Bell and Sejnowski, 1995] earliest algorithm: extend PCA by performing nonlinear decorrelation [Hrault and Jutten, 1986] e
another dependence measure is explained in more detail in the following
Theis
ICA-Algorithms

Theis
ICA-Algorithms

Theis
ICA-Algorithms

Theis
ICA-Algorithms

Theis
Maximization of non-Gaussianity
basic idea
given X = AS construct ICA matrix W, which ideally equals A1 at rst, recover only one source: search for b Rn with Y = b X = b AS =: q S ideally b is row of A1 , so q = ei central limit theorem: gaussianity ( indep. RVs) > gaussianity (indep. RVs) so Y = q S is more gaussian than all source components Si at ICA solutions Y Si , hence solutions are least gaussian
Algorithm:(FastICA) Find b with b X is maximal nongaussian.
Theis
basic idea
Theis
basic idea
Theis
basic idea
Theis
basic idea
Theis
basic idea
Theis
Example: sources
1.5 1.5
0.5
0.5
0.5
0.5
1.5 1.5
0.5
0.5
1.5
1.5 1.5
0.5
0.5
1.5
Kurtosis maximization: source and white mixture scatterplots (rotation by 30 degrees)
Theis
Example: histograms
alpha=0, kurt=0.7306 alpha=10, kurt=0.93051
alpha=20, kurt=1.1106
Kurtosis maximization: histograms
Theis
Measuring non-gaussianity using kurtosis

kurtosis was dened as kurt(Y ) := E (Y 4 ) 3(E (y 2 ))2 . if Y Gaussian, then E (Y 4 ) = 3(E (Y 2 ))2 , so kurt(Y ) = 0 hence kurtosis (or squared kurtosis) gives a simple measure for the deviation from gaussianity assumption of unit variance, E (Y 2 ) = 1: so kurt(Y ) = E (Y 4 ) 3 q1 two-d example: q = A b = . q2 then Y = b X = q S = q1 S1 + q2 S2 . linearity of kurtosis: 4 4 kurt(Y ) = kurt(q1 S1 ) + kurt(q2 S2 ) = q1 kurt(S1 ) + q2 kurt(S2 ). 2 2 2 2 2 normalization: E (S1 ) = E (S2 ) = E (Y ) = 1, so q1 + q2 = 1 i.e. q lies on circle 4 4 but maxima S 1 R, q |q1 kurt(S1 ) + q2 kurt(S2 )| are unit vectors!




Algorithm
S is not known after whitening Z = VX search for w Rn with w Z maximal non-gaussian because of q = (VA) w we get |q|2 = q q = (w VA)(A V w) = |w|2 so if q S n1 also w S n1 Algorithm:(kurtosis maximization) Maximize w | kurt(w Z)| on S n1 after whitening.
1.3 1.2
1.1
0.9
0.8
0.7
0.6
0.5
20
40
60
80
100
120
140
160
180
200
absolute kurtosis versus angle, plotted is | kurt((cos() sin())Z)| with the uniform Z]
Algorithm
1.3 1.2
1.1
0.9
0.8
0.7
0.6
0.5
20
40
60
80
100
120
140
160
180
200
Algorithm
1.3 1.2
1.1
0.9
0.8
0.7
0.6
0.5
20
40
60
80
100
120
140
160
180
200
Maximization
algorithmic maximization by gradient ascent:
a dierentiable function f : Rn R can be maximized by local updates in directions of its gradient suciently small learning rate > 0 and a starting point x(0) Rn , local maxima of f can be found by iterating x(t + 1) = x(t) + x(t) with x(t) = (Df )(x(t)) = grad f (x(t)) = f (x(t)) the gradient of x f at x(t).
in our case grad | kurt(w Z)|(w) = 4 sgn(kurt(w Z)) E(Z(w Z)3 ) 3|w|2 w Algorithm:(gradient ascent kurtosis maximization) Choose > 0 and w(0) S n1 . Then iterate w(t) := sgn(kurt(w(t) Z))E(Z(w(t) Z)3 ) v(t + 1) := w(t) + w(t) v(t + 1) w(t + 1) := |v(t + 1)|
Maximization
Maximization
PCA versus ICA

1.5
0.5
0.5
1.5 2 1.5 1 0.5 0 0.5 1 1.5 2
Comparison of ICA basis and PCA basis when applied to transformed independent uniform density.
Theis
Application: ICA analysis of nancial data sets

use closing day value of stock prices consider portfolio of stocks
30
MSFT
28 26 24 6 50 100 150 200 250
SUNW
5 4 3 50 100 150 200 250
40
INTC
30 20 10 60 50 50 100 150 200 250
KO
40 30 50 100 150 200 250
Historical data of four stocks from S&P 500, from 1-29-2004 to 1-28-2005.
Preprocessing
stocks prices are non-stationary use stock returns x(t) = x(t + 1) x(t) or logarithmic dierences x(t) = log x(t + 1) log x(t)
2
MSFT
0 2 4 1 50 100 150 200 250
SUNW
0.5 0 0.5 2 50 100 150 200 250
INTC
0 2 4 2 0 50 100 150 200 250
KO
2 4 50 100 150 200 250
Figure: Stock returns

Independent components
perform ICA rst extract 4 sources (complete case) recombine s(t) = s(t) + s(t 1) with s(0) = 0
10 0 10 20 10 0 10 20 20 0 50 100 150 200 250 300 0 50 100 150 200 250 300
20 20 0 20 40
50
100
150
200
250
300
50
100
150
200
250
300
Figure: Extracted 4 sources (independent components using ICA
Theis
extract only 2 sources (dimension reduction by PCA) recover original mixtures / stocks with reduced data set
10 5 0 5 10 15 20 25 0 50 100 150 200 250 300
25 20 15 10 5 0 5 0 50 100 150 200 250 300
Figure: Extracted 2 sources (independent components using ICA
Theis
extract only 2 sources (dimension reduction by PCA) recover original mixtures / stocks with reduced data set
30
MSFT
25
20 6
50
100
150
200
250
SUNW
5 4 3 50 100 150 200 250
35
INTC
30 25 20 60 50 50 100 150 200 250
KO
40 30 50 100 150 200 250
Figure: Recovered stocks
Theis
Prediction
use linear lter to predict data simplest model: estimate parameters by minimizing LMSE
30
29
28
MSFT
27
26
25
24 50 100 150 200 250
Figure: One-step estimation using lter length 50

Prediction
use linear lter to predict data simplest model: estimate parameters by minimizing LMSE
29.5 29 28.5 28 27.5 27 26.5 26 25.5 25 50 100 150 200 250
Figure: One-step estimation using lter length 200

Second-order BSS using time structure

instead of independence assume here:
data possesses additional time structure S(t) source have diagonal autocovariances RS ( ) := E (S(t + ) E(S(t)))(S(t) E(S(t)) for all
goal: nd A (then estimate S(t) e.g. using regression) as before: centering and prewhitening (by PCA) allow assumptions
zero-mean X(t) and S(t) equal source and sensor dimension (m = n) orthogonal A
but hard-prewhitening gives bias...
Theis

Theis

Theis
AMUSE
bilinearity of autocovariance: RX ( ) = E X(t + )X(t) = ARS (0)A + 2 I ARS ( )A =0 =0
1 so symmetrized autocovariance RX ( ) := 2 RX ( ) + (RX ( )) fullls (for = 0) RX ( ) = ARS ( )A
identiability:
A can only be found up to permutation and scaling if there exists RS ( ) with pairwise dierent eigenvalues no more indeterminacies
AMUSE (algorithm for multiple unknown signals extraction)

[Tong et al., 1991] recover A by eigenvalue decomposition of RX ( )
Theis
AMUSE
identiability:

Theis
AMUSE
identiability:

Theis
AMUSE
identiability:

Theis
SOBI
problems of AMUSE
choice of susceptible to noise or bad estimates of RX ( )
SOBI (second-order blind identication)

[Belouchrani et al., 1997] (similar: TDSEP) identify A by joint diagonalization of a whole set {RX ( (1) ), . . . , RX ( (K ) )} of autocovariance matrices joint diagonalization e.g. by Jacobi algorithm (iterative Givens rotation in two coordinates) more robust against noise and choice of
Theis
SOBI
problems of AMUSE
choice of susceptible to noise or bad estimates of RX ( )
SOBI (second-order blind identication)

[Belouchrani et al., 1997] (similar: TDSEP) identify A by joint diagonalization of a whole set {RX ( (1) ), . . . , RX ( (K ) )} of autocovariance matrices joint diagonalization e.g. by Jacobi algorithm (iterative Givens rotation in two coordinates) more robust against noise and choice of
Theis
Data sets with additional structure

1.5
goal
improve SOBI performance for random processes with a higher dimensional parametrization
0.5
0.5
1.5
10
20
30
40
50
60
70
80
90
100
additional structure
random processes S and X depend multiple variables (z1 , . . . , zM ) examples: images (z1 , z2 ), 3D data sets such as fMRI scans (z1 , z2 , z3 ) use this additional information!
Theis
Data sets with additional structure

1.5
goal
improve SOBI performance for random processes with a higher dimensional parametrization
0.5
0.5
1.5
10
20
30
40
50
60
70
80
90
100
additional structure
Theis
From nD to 1D
usual reduction (example: images)
transform vector of images S(z1 , z2 ) to S(t) x mapping from the 2D parameter set to 1D time parametrization (e.g. row concatenation)
result
no problem in ICA because i.i.d. samples are assumed but time-structure based algorithms eectively loose information
...
...
Theis
From nD to 1D
usual reduction (example: images)
transform vector of images S(z1 , z2 ) to S(t) x mapping from the 2D parameter set to 1D time parametrization (e.g. row concatenation)
result
no problem in ICA because i.i.d. samples are assumed but time-structure based algorithms eectively loose information
...
...
Theis
Multidimensional autocovariance
M-dimensional autocovariance (for centered processes) RS (1 , . . . , M ) := E S(z1 + 1 , . . . , zM + M )S(z1 , . . . , zM )
estimation from samples as usual now depends on M shifts i
advantage
1dcov 2dcov
.8
.6
.4
.2
0 0 50 100 150 |tau| 200 250 300
Theis
Multidimensional autocovariance
M-dimensional autocovariance (for centered processes) RS (1 , . . . , M ) := E S(z1 + 1 , . . . , zM + M )S(z1 , . . . , zM )
estimation from samples as usual now depends on M shifts i
advantage
1dcov 2dcov
.8
.6
.4
.2
0 0 50 100 150 |tau| 200 250 300
Theis
Multidimensional second-order BSS

replace 1D autocovariances by M-dimensional ones in AMUSE and SOBI mdAMUSE (multidimensional AMUSE)
prewhiten X(z1 , . . . , zK ) EVD of symmetrized multidimensional autocovariance RX (1 , . . . , M ) detects A how to choose lags i ?
mdSOBI (multidimensional SOBI)

joint diagonalization of a set of symmetrized multidimensional autocovariances
(1) (1) (K ) (K ) RX 1 , . . . , M , . . . , RX 1 , . . . , M
joint diagonalizer equals A except for permutation and signs (k) (k) in practice: choose (1 , . . . , M ) with increasing modulus for increasing k


(1) (1) (K ) (K ) RX 1 , . . . , M , . . . , RX 1 , . . . , M


(1) (1) (K ) (K ) RX 1 , . . . , M , . . . , RX 1 , . . . , M
SOBI versus mdSOBI

key observation: often data sets do not have long-distance autocorrelations, but quite high multi-dimensional close-distance correlations advantages of mdSOBI
exploit close-distance structure! SOBI weighs each matrix equally strong deterioration of performance for large K mdSOBI uses stronger close-distance correlations
Theis
SOBI versus mdSOBI

Theis
SOBI versus mdSOBI

K
source images
problem
performance comparison
Results: Articial mixtures

articial example to compare algorithms
linear mixture of n = 3 images with a randomly chosen 3 3 matrix A increase noise level from 0% to 50% compare SOBI and mdSOBI for K = 32 and K = 128 autocovariance matrices
source images
Theis

same performance in low noise case mdSOBI outperforms SOBI with increasing noise reason: natural images do not have any substantial long-distance autocorrelations
noise level
SOBI and mdSOBI performance dependence on noise level
Theis

noise level
Theis

noise level
Theis

noise level
Theis
Results: fMRI analysis

function magnetic resonance imaging
noninvasive brain imaging technique information on brain activation patterns activation maps help identifying task-related brain regions BSS techniques for fMRI possible, see [McKeown et al., 1998].
data set
block design protocol with 5 time instants of visual stimulation and 5 instants of rest 100 scans of duration of 3s each acquired by D. Auer, MPI of Psychiatry, Munich, Germany
preprocessing
motion correction dimension reduction to rst 8 principal components
Theis

data set
preprocessing
Theis

data set
preprocessing
Theis
Results: fMRI results

1 2 3
1
cc: 0.08
cc: 0.19
cc: 0.11
cc: 0.21
cc: 0.43
cc: 0.21
cc: 0.16
cc: 0.86
component maps
time courses
mdSOBI was performed with K = 32 component 8 is desired stimulus component active in the visual cortex (cc = 0.86) poor SOBI performance: two stimulus components with cc = 0.81 and 0.84
Results: fMRI results

1 2 3
1
cc: 0.08
cc: 0.19
cc: 0.11
cc: 0.21
cc: 0.43
cc: 0.21
cc: 0.16
cc: 0.86
component maps
time courses
mdSOBI was performed with K = 32 component 8 is desired stimulus component active in the visual cortex (cc = 0.86) poor SOBI performance: two stimulus components with cc = 0.81 and 0.84
Conclusions
summary
1D autocovariances do not suciently describe second-order image statistics proposed extension mdSOBI of SOBI uses multi-dimensional autocovariances performance increase in both simulations and real-world applications
future work
take also higher-order statistics into account applications to fMRI analysis also with 3D autocovariances
Theis
Conclusions
summary
1D autocovariances do not suciently describe second-order image statistics proposed extension mdSOBI of SOBI uses multi-dimensional autocovariances performance increase in both simulations and real-world applications
future work
take also higher-order statistics into account applications to fMRI analysis also with 3D autocovariances
Theis
Conclusions
Summary
Data analysis: identiability algorithm independence: diagonality of Hessian of logarithmic density / characteristic function linear ICA by Hessian diagonalization: HessianICA postnonlinear BSS by nonlinearity detection as preprocessing step application to fMRI data sets
Questions:
Solve Hessian diagonalization dierential equation in postnonlinear model. Other models? Further applications in biological and medical data analysis?
Theis
Details and papers on my website http://fabian.theis.name Support by the DFG1 and the BMBF2 is gratefully acknowledged. References
A. Bell and T. Sejnowski. An information-maximisation approach to blind separation and blind deconvolution. Neural Computation, 7:11291159, 1995. A. Belouchrani, K. A. Meraim, J.-F. Cardoso, and E. Moulines. A blind source separation technique based on second order statistics. IEEE Transactions on Signal Processing, 45(2):434444, 1997.
P. Comon. Independent component analysis - a new concept? Signal Processing, 36:287314, 1994. J. Hrault and C. Jutten. Space or time adaptive signal e processing by neural network models. In J. Denker, editor, Neural Networks for Computing. Proceedings of the AIP Conference, pages 206211, New York, 1986. American Institute of Physics. M. McKeown, T. Jung, S. Makeig, G. Brown, S. Kindermann, A. Bell, and T. Sejnowksi. Analysis of fMRI data by blind separation into independent spatial components. Human Brain Mapping, 6: 160188, 1998. L. Tong, R.-W. Liu, V. Soon, and Y.-F. Huang. Indeterminacy and identiability of blind identication. IEEE Transactions on Circuits and Systems, 38:499509, 1991.
1 graduate 2 project
college: Nonlinearity and Nonequilibrium in Condensed Matter ModKog


A Short Tutorial On Blind Source Separation: Fabian J. Theis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Short Tutorial On Blind Source Separation: Fabian J. Theis

Uploaded by

Copyright:

Available Formats

A short tutorial on blind source separation

TUAT, Tokyo, 28-3-2005