Livro que amplia o debate acrca do Beamforming utilizando a analide de componentes independentes

© All Rights Reserved

1 views

Livro que amplia o debate acrca do Beamforming utilizando a analide de componentes independentes

© All Rights Reserved

- FRM Quant Question Bank
- biplot
- JLargeMFE MT Notes Wks1-4
- Face Recognition in Video
- Managing Diversification SSRN-id1358533
- Russian Doll
- 2004 VDI Schwingung PolyMAX
- ECG Signal
- IJESAT_2012_02_SI_01_18
- spectral.pdf
- dist_semantics.pdf
- Robust Gait-based Gender Classification Using Depth Cameras
- CS231n Convolutional Neural Networks for Visual Recognition
- Fundamentals_of_MATLAB.ppt
- dexin-zhou.pdf
- Fisher-LDA
- svd-note.pdf
- Notes on Mathematical Expectation
- build4.pdf
- Spring 2017 Homework 9

You are on page 1of 23

tex

Pursuit: A Tutorial Introduction

James V Stone and John Porrill,

Psychology Department, Sheeld University,

Sheeld, S10 2UR, England.

Tel: 0114 222 6522 Fax: 0114 276 6515

Email: j.porrill, j.v.stone@shef.ac.uk

Web: http://www.shef.ac.uk/psychology/stone

April 2, 1998

Abstract

Independent component analysis (ICA) and projection pursuit (PP) are two related techniques for separating mixtures of source signals into their individual components. These rapidly evolving techniques are

currently nding applications in speech separation, ERP, EEG, fMRI, and low-level vision. Their power

resides in the simple and realistic assumption that dierent physical processes tend to generate statistically

independent signals. We provide an account that is intended as an informal introduction, as well as a

mathematical and geometric description of the methods.

1 Introduction

Independent component analysis (ICA) [Jutten and Herault, 1988] and projection pursuit (PP) [Friedman, 1987]

are methods for recovering underlying source signals from linear mixtures of these signals. This rather terse

description does not capture the deep connection between ICA/PP and the fundamental nature of the physical

world. In the following pages, we hope to establish, not only that ICA/PP are powerful and useful tools, but

that this power follows naturally from the fact that ICA/PP are based on assumptions which are remarkably

attuned to the spatiotemporal structure of the physical world.

1

Most measured quantities are actually mixtures of other quantities. Typical examples are, i) sound signals in

a room with several people talking simultaneously, ii) an EEG signal, which contains contributions from many

dierent brain regions, and, iii) a person's height, which is determined by contributions from many dierent

genetic and environmental factors. Science is, to a large extent, concerned with establishing the precise nature

of the component processes responsible for a given set of measurements, whether these involve height, EEG

signals, or even IQ. Under certain conditions, the underlying sources of measured quantities can be recovered

by making use of methods (PP and ICA) based on two intimately related assumptions.

The more intuitively obvious of these assumptions is that dierent physical processes tend to generate signals

that are statistically independent of each other. This suggests that one way to recover source signals from

signal mixtures is to nd transformations of those mixtures that produce independent signal components. This

independence is given much emphasis in the ICA literature, although an apparently subsidiary assumption that

source signals have amplitude histograms that are non-Gaussian is also required. In (apparent) contrast, the

PP method relies on the assumption that any linear mixture of any set of (nite variance) source signals is

Gaussian, and that the source signals themselves are not Gaussian. Thus, another method for extracting source

signals from linear mixtures of those signals is to nd transformations of the signal mixtures that extract nonGaussian signals. It can be shown that the assumption of statistical independence is implicit in the assumption

that source signals are non-Gaussian, and therefore that both PP and ICA are actually based on the same

assumptions.

Within the literature, PP is used to extract one signal at a time, whereas ICA extracts simultaneously a set of signals. However, like the apparently dierent assumptions of PP and ICA, this dierence is supercial, and re
ects

the underlying histories of the two methods, rather than any fundamental dierence between them. Recent applications of ICA include separation of dierent speech signals [Bell and Sejnowski, 1995], analysis of EEG data

[Makeig et al., 1997], functional magnetic resonance imaging (fMRI) data [McKeown et al., 1998], image processing [Bell and TJ, 1997], and the relation between biologicial image processing and ICA [van Hateren and van der Schaaf, 1998

Before becoming too embroiled in the intricacies of ICA, we need to establish the class of problems they can

address. Given N time-varying source signals, we dene the amplitudes of these signals at time t as a column

vector st = fs1t; :::; sNtg. These signals can be linearly combined to form a signal mixture xt = ast, where each

element of the row vector a species how much of the corresponding source signal sit contributes to the signal

mixture xt. Given M signal mixtures xt = fx1t; :::; xMtgT we can dene a mixing matrix A = fa1 ; :::; aM gT in

which each row ai species a unique mixture xit of the signals st = fs1t; :::; sNtg. (Note that the t subscript

denotes time, whereas the T superscript denotes the transpose operator). Using this matrix notation, the

2

xt = Ast

(1)

Both ICA and PP are capable of taking the signal mixtures x and recovering the sources s. That the mixtures

can be separated in principle is easily demonstrated now that the problem has been summarised in matrix

algebra. An `unmixing' matrix W is dened such that:

st = W xt

(2)

Given that each row in W species how the mixtures in x are recombined to produce one source signal, it

follows that it must be possible to recover one signal at a time by using a dierent row vector to extract each

signal. For example, if only one signal is to be extracted from M signal mixtures then W is a 1 M matrix.

Thus, the shape of the unmixing matrix W depends upon how many signals are to be extracted. Usually, ICA

is used to extract a number of sources simultaneously, whereas PP is used to extract one source at a time.

The nature of the linear `unmixing' transformation matrix W can be conveniently explored in terms of two signal

mixtures. Consider two signals s = fs1 ; s2g that have been mixed with a 2 2 mixing matrix A to produce

two signal mixtures x = As, where x = fx1; x2g. If we interpret this in terms of two voices (sources) and

two microphones, then elements of the ith row in A specify the proximity of each voice to the j th microphone.

Each microphone records a weighted mixture xi of the two sources s1 and s2 , where the weightings for each

microphone are given by a column of A.

Plots of s1 versus s2 , and of x1 versus x2 can be seen in Figure 1. We dene the space in which s exists as S

with axes S1 and S2 , and x's space is dened as X with axes X1 and X2 . The amplitudes of the source signals

s1t and s2t at time t are represented as a point with coordinates (s1t ; s2t) in S . The corresponding amplitudes

of the signal mixtures x = As at time t are represented as a point xt with coordinates (x1t; x2t) in X . The

`mixing' matrix A denes a linear transformation1, so that the mapping from S to X consists of a rotation and

shearing of axes in S . Thus, the orthogonal axes S1 and S2 in S appear as two skewed lines S10 and S20 in X

(see Figure 1b). Note that variation along each axis in S is caused by variation in the amplitude of one source

signal. Given that each axis Si in S corresponds to a direction Si0 in X , variation along the projected axes S10

and S20 in X are caused by variation in the signal amplitudes dened as s1 and s2 , respectively. If we can extract

variations associated with one direction, say, S10 , in X whilst ignoring variations along all other directions, then

we can recover the amplitudes of the signal s1 . This can be achieved by projecting all points in X onto a line

1

that is orthogonal to all but one direction S10 . Such a line is dened by a vector w1 = (w1 ; w2) (depicted as a

dashed line in Figure 1b), dened so that only components of X that lie along the direction S10 are transformed

to non-zero values of y = w1 x. This is depicted graphically in Figure 1b, with the result of unmixing both

signals y = W x depicted in Figure 2.

To summarise, the linear transformation yt = W xt produces a scalar value for each point xt in X , so that a

single signal results from the transformation y = W x. The signal amplitude yt at time t is found by taking the

inner product of W with a point xt . As the row vector W is dened to be orthogonal to directions corresponding

all but one source signal in X , only that signal will be projected to non-zero values y = W x.

Having demonstrated that an unmixing matrix W exists that can extract one or more source signals from a

mixture, the following sections describe how PP and ICA can be used to obtain values for W .

4.1 Independence and Correlation

Statistical independence lies at the core of the ICA/PP methods. Therefore, in order to understand ICA/PP,

it is essential to understand independence. At an intuitive level, if two variables x and y are independent then

the value of one variable cannot be predicted if the value of the other variable is known.

One simple way to understand independence relies on the more familiar denition of correlation. The correlation

between two variables x and y is:

(x; y)

(3)

(x; y) = Cov

x y

where x and y are the standard deivations of x and y, respectively, and Cov(x; y) is the covariance between

x and y:

X

(4)

Cov(x; y) = (1=n) (xi x )(yi y )

where x and y are the means of x and y, respectively. Correlation is simply a form of covariance that

has been normalised to lie in the range f 1; +1g. Note that if two variables x and y are uncorrelated then

(x; y) = Cov(x; y) = 0, although (x; y) and Cov(x; y) are not equal in general.

The covariance Cov(x; y) can be shown to be:

Cov(x; y) = (1=n)

xi yi (1=n)

xi (1=n)

yi

(5)

Each term in Equation (5) is a mean, or expected value E, and can be written more succinctly as:

4

(6)

A histogram plot with abscissas x and y, and with the ordinate denoting frequency, approximates the probability

density function (pdf) of the joint distribution of xy. The quantity E[xy] is known as a second moment of

this joint distribution. Similarly, histograms of x and y approximate their respective pdfs, and are known as

the marginal distributions of the joint distribution xy. The quantities E[x] and E[y] are the rst moments

(respectively) of these marginal distributions. Thus, covariance is dened in terms of moments associated with

the joint distribution xy.

Just because x and y are uncorrelated, this does not imply that they are independent. To take a simple example,

given a variable z = f0; :::; 2g, we can dene x = sin(z ) and y = cos(z ). Intuitively, it can be seen that both

x and y depend on z . As can be sen from Figure 3, the variables x and y are highly interdependent. However,

the covariance (and therefore the correlation) of x and y is zero:

= E[cos(z ) sin(z )] 0

= 0

(7)

(8)

(9)

In summary, covariance does not capture all types of dependencies between x and y, whereas measures of

statistical independence do.

Like covariance, independence is dened in terms of the expected values of the joint distribution xy. We have

established that if x and y are uncorrelated then they have zero covariance:

E[xy] E[x]E[y] = 0

(10)

Using a generalised form of covariance involving powers of x and y, if x and y are statistically independent

then:

E[xp yq ] E[xp]E[yq ] = 0

(11)

for all positive integer values of p and q. Whereas covariance uses p = q = 1, all positive integer values of p and

q are implicit in measures of independence. Formally, if x and y are independent then each moment E[xpyq ] is

equal to the product of the expected values of the pdf's marginal distributions E[xp ]E[yq ], which leads to the

result stated in Equation (11).

The formal similarity between measures of independence and covariance can be interpreted as follows. Whereas

covariance measures the amount of linear covariation between x and y, independence measures the linear

covariation between [x raised to powers p] and [y raised to powers q]. Thus, independence can be considered

as a generalised form of covariance, which measures the linear covariation between non-linear functions (e.g.

cubed power) of two variables.

For example, using x = sin(z ) and y = cos(z ) we know that Cov(x; y) = 0. However, the measure of linear

5

covariation between the variables xp and yq as depicted in Figure (3) for p = q = 2 is:

E[xp yq ] E[xp]E[yq ] = 0:123

(12)

This corresponds to a correlation between x2 and y2 of 0:864 (see Figure 3). Thus, whereas the correlation

between x sin(z ) and y = cos(z ) is zero, the fact that the value of x can be predicted from y is implicit in the

non-zero values of the higher order moments of the distribution of xy.

We have shown that both the covariance and interdependence between two variables x and y are dened in

terms of the moments of the pdf of their joint distribution. However, any variable with a Gaussian pdf is

special in the sense that it is completely specied by its second moment E[xy]. That is, the values of all higher

moments are implicit in the value of the second moment of a Gaussian distribution. Thus, if the covariance

E[xy] E[x]E[y] of the joint distribution of two Gaussian variables is zero then it can be shown that the quantity

E[xp yq ] E[xp ]E[yq ] is zero for all positive integer values of p and q. From Equation (11) we know that such

variables are statistically independent, and it therefore follows that uncorrelated Gaussian variables are also

independent.

However, non-Gaussian variables that are uncorrelated are not, in general, independent. As stated above, the

non-Gaussian variables x = sin(z ) and y = cos(z ). Here, E[xy] E[x]E[y] = 0, but (for example) E[x2 y2 ]

E[x2]E[y2 ] == 0:123, and the correlation between x2 and y2 is r = 0:86. Thus, for non-Gaussian variables,

the dependency between x and y only becomes apparent in their high order moments.

Separation

5.1 Projection Pursuit: Mixtures of Source Signals Are Gaussian

A critical feature of a random linear mixture of any signals (with nite variance) is that a histogram of its values

is approximately Gaussian; that is, it has a Gaussian probability density function (pdf). This follows from the

central limit theorem, and is is illustrated in Figures 5, 6 and 7. Most mixtures of of a set of signals therefore

produce a signal mixture with a Gaussian pdf. As methods for separating sources use a set of mixtures as input,

and produce a linear weighting of them as output, it follows that arbitrary `unmixing' matrices W also produce

Gaussian signals. However, if an `unmixing` matrix exists that produces a non-Gaussian signal from the set of

mixtures then such a signal is unlikely to be a mixture of signals.

6

If we assume that source signals have non-Gaussian pdfs then, whilst most transformations produce data with

Gaussian distributions, a small number of transformations exist that produce data with non-Gaussian distributions. Under certain conditions, the non-Gaussian signals extracted from signal mixtures by such a transformation are in fact the original source signals. This is the basis of projection pursuit methods [Friedman, 1987].

In order to set about nding non-Gaussian component signals, it is necessary to dene precisely what is meant

by the term `non-Gaussian'. Two important classes of signals with non-Gaussian pdfs have super-Gaussian and

sub-Gaussian pdfs. These are dened in terms of kurtosis, which is dened as

R

1

4

T

k = 1 R ((ss sst ))2 dt

3

(13)

t dt

T

where st is the value of a signal at time t, s is the mean value of st , and the constant (3) ensures that superGaussian signals have positive kurtosis, whereas a sub-Gaussian signal have negative kurtosis.

4

(14)

k = EE [([(ss ss))2 ]] 3

A signal with a super-Gaussian pdf has most of its values clustered around zero, whereas a signal with a subGaussian pdf does not. As examples, a speech signal has a super-Gaussian pdf, and a sine function and white

noise have sub-Gaussian pdfs (see Figure 4).

FIGURE 4 HERE.

PP methods tend to make use of high-order moments of distributions such as kurtosis in order to estimate the

extent to which a signal is non-Gaussian. However, here we will use a more general measure, which is borrowed

from ICA. The extent to which a signal's pdf is non-Gaussian depends on the following critical observation: If

the scalar values of a signal s are transformed by the cumulative density function (cdf) of that signal then

the resultant distribution of values is uniform. This is useful because it permits the extent of deviation from

a Gaussian pdf to be recast in terms of the uniformity, or equivalently, the entropy of the transformed signal,

Y = (s) (one way to think of entropy is as a measure of the uniformity of a given distribution).

The question of how to nd the linear transformation capable of recovering a source signal follows from the

denition of our measure of Gaussian deviation. We have established that a linear transformation W exists such

that a signal s = W x can be recovered from a set of M signal mixtures x, and that this transformation produces

a signal s = Wx such that S = (W x) has maximum entropy. By inverting the ow of logic in this argument,

it follows that s can be recovered from x by nding a W that maximises the entropy H (Y ) of S = (Wx).

It can be shown [Girolami and Fyfe, 1996] that if a number of signals are extracted from a mixture x (and

these are the most non-Gaussian signals components of the mixture) then they are guaranteed to be mutually

independent. Thus, even though a measure of independence is not explicitly maximised as part of the PP

7

optimisation process, extracting non-Gaussian signals produces signals that are mutually independent. In

contrast, ICA explicitly maximises the mutual independence of extracted signals.

The ICA methods described in this section are based on the following simple observation: If a set of N signals

are from dierent physical sources (e.g. N dierent speakers) then they tend to be statistically independent of

each other. The method of ICA is based on the assumption that if a set of independent signals can be extracted

from signal mixtures then these extracted signals are likely to be the original source signals. Like PP, ICA

requires assumptions of independence that involve the cdfs of source signals, and this is the link that binds ICA

and PP methods together.

As in the previous section, this problem can be considered in geometric terms. The amplitudes of N source

signals at a given time can be represented as a point in an N -dimensional space, and, considered over all times,

they dene a distribution of points in this space. If the signals are from dierent sources (e.g. N dierent

speakers) then they tend to be statistically independent of each other. As with PP, a key observation is that

if a signal s has a cdf then the distribution of (s) has maximum entropy (i.e. is uniform). Similarly, if N

signals each have cdf then the joint distribution of (s) = ((s1 ); : : :; (sN ))T has maximum entropy, and

is therefore uniform. For a set of signal mixtures x = As, an `unmixing' matrix exists such that s = W x,

where W . Given that (s) has maximum entropy, it follows that s can be recovered by nding a matrix W

that maximises the entropy of Y = (Wx) (where is a vector of cdfs in one-to-one correspondence with

transformed signals in y = W x), at which point (W x) = (s). In summary, for any distribution x which is

a mixture of N independent signals each with cdf , there exists a linear unmixing transformation W followed

by a non-linear transformation , such that the resultant distribution Y = (W x) has maximum entropy. This

can be used to recover the original sources by dening a plausible cdf, and then nding an unmixing matrix W

that maximises the entropy of Y.

The explicit assumption of independence upon which ICA is based is less critical than the apparently subsidiary

assumption regarding the non-Gaussian nature of source signals. It can be shown [Girolami and Fyfe, 1996]

that, given a set of mixtures of independent non-Gaussian signals, the sources can be extracted by nding

component signals that have appropriate cdfs, and that these signals are independent. The converse is not

true, in general, if the number of extracted signals is less than the number of independent signals in the set of

signal mixtures. That is, simply nding a subset of independent signals in a set of independent non-Gaussian

source signal mixtures is not, in general, equivalent to nding the component sources. This is because linear

combinations of disjoint sets of source signals are independent. For example, if a subset of independent signals

are combined to form a signal mixture x1, and a non-overlapping subset of other signals are combined to

8

form a mixture x2 , then x1 and x2 are mutually independent, even though both consist of mixtures of source

signals. Thus, statistical independence of extracted signals is a necessary, but not sucient, condition for source

separation.

Having established the connection between ICA and PP, and conditions under which they are equivalent, we

proceed by describing the `standard' ICA method [Bell and Sejnowski, 1995].

Suppose that the outputs x = (x1; : : :; xM )T of M measurement devices are a linear mixture of N = M

independent signal sources s = (s1 ; : : :; sN )T , x = As, where A is an N N mixing matrix. We wish to nd a

N N unmixing matrix W such that each of the N components recovered by y = W x is one of the original

signals s (i.e. K N ).

As discussed above, an unmixing matrix W can be found by maximising the entropy H (Y) of the joint distribution Y = fY1; : : :; YN g = f1(y1 ); : : :; N (yN )g, where yi = Wxi . The correct i have the same form as

the cdfs of the input signals xi . However, in many cases it is sucient to approximate these cdfs by sigmoids2

Yi = tanh yi .

The entropy of a signal y with pdf fx (x) is given by:

fx (x) ln fx (x) dy

(15)

As might be expected, the transformation of a given data set x aects the entropy of the transformed data Y

according to the change in the amount of `spread' introduced by the transformation. Given a multidimensional

signal x, if a cluster of points in x is mapped to a large region in Y, then the transformation implicitly maps

innitesimal volumes from one space to another. The `volumetric mapping' between spaces is given by the

Jacobian of the transformation between spaces. The Jacobian combines the derivative of each axis in x with

respect to every axis in y to form a ratio of innitesimal volumes in x and y. The change in entropy induced

by the transformation W can be shown to be equal to the expected value of ln jJ j, where j:j denotes absolute

value.

Given that Y = (Wx), the output entropy H (Y) can be shown to be related to the entropy of the input H (x)

by

H (Y) = H (x) + E [ log jJ j ]

(16)

where jJ j is the determinant of the Jacobian matrix J = @ Y=@ x. Note that the entropy of the input H (x) is

constant. Given that we wish to nd a W that maximises H (Y), any W that maximises H (Y) is unaected by

In fact sources si normalised so that E[si tanh si ] = 1=2 can be separated using tanh sigmoids if and only if the pairwise

conditions
i
j > 1 are satised, where
i = 2E[s2i ]E[sech2 si ] [Porrill, 1997].

2

H (x), which can therefore be ignored. Using the chain rule, we can evaluate jJ j as:

N

Y

@Y

@Y @y

jJ j = @ x = @ y @ x = i0 (yi )jW j

i=1

(17)

where @ Y=@ y and @ y=@ x are Jacobian matrices. Substituting Equation (17) in Equation (16) yields

H (Y) = H (x) + E

"

i=1

(18)

As the entropy of the x is unaected by W , it can be ignored in the maximisation of H (Y). The term

P

E [ log i0 (yi )] can be estimated given n samples from the distribution dened by y:

N

n X

X

log i0 (yi ) n1

log i0 (yi(j ) )

(19)

i=1

j =1 i=1

Ignoring H (x), and substituting Equation (19) in (18) yields a new function that diers from H (Y) by a

constant equal to H (x):

n X

N

X

h(W ) = n1

log i0 (yi(j ) ) + log jW j

(20)

j =1 i=1

If we dene the cdf i = tanh then this evaluates to

"

N

n X

X

h(W ) = n1

log(1 yi(j )2 ) + log jW j:

j =1 i=1

This function can be maximised by taking its derivative with respect to the matrix W :

rh W = [W T ]

2yxT

(21)

(22)

Now an unmixing matrix can be found by taking small steps of size to update W :

W = ([W T ]

2yxT )

(23)

In fact, the matching of the pdf of y to each cdf also requires that each signal yi has zero mean. This is easily

accomodated by introducing a `bias' weight wi to ensure that yi = W x + wi has zero mean. The value of each

bias weight is learned like any other weight in W . For a tanh cdf, this evaluates to:

wi = 2y

(24)

In practice, h(W ) is maximised either, a) by using a `natural gradient' [Amari, 1998] which normalises the error

surface so that the step-size along each dimenstion is scaled by the local gradient in that direction, and which

obviates the need to invert W at each step, or b) a second order technique (such as BFGS or a conjugate

gradient Marquardt method) which estimates an optimal search direction and step-size under the assumption

that the error surface is locally quadratic.

10

In `standard' ICA, each of N signal mixtures is measured over T time steps, and N sources are recovered as

y = W x, where each source is independent over time of every other source. However, when using ICA to analyse

temporal sequences of images it rapidly becomes apparent that there are two alternative ways to implement

ICA.

Normally, independent temporal sequences are extracted by placing the image corresponding to each time

step in a column of x. We refer to this as ICAt. This essentially treats each of the N pixels as a separate

`microphone' or mixture, so that each mixture consists of T time steps. The (large) N N matrix W then

nds temporally independent sources that contribute to each pixel grey-level over time. Having discovered

the temporally independent signals for an image sequence, this begs the question: what was it that varied

independently over time? Given y = W x we can derive

x = Ay

(25)

where A = W 1 is an N N matrix. Therefore, each row (source signal) of y species how the contribution to

x of one column (image) of A varies over time. So, whereas each row yi of y species a signal that is independent

of all rows in y, each column ai of A consists of an image that varies independently over time according to the

amplitude of yi . Note that, in general, the rows of s are constrained to be mutually independent, whereas the

relationship between columns of A is completely unconstrained.

Instead of placing each image in a column of x we can place each image in a row of x. This is equivalent to

treating each time step as a mixture of independent images. We refer to this as ICAs. In this case, each source

signal (row of y) is an image, where the pixel values in each image (row) are independent of every other image,

so that these images are said to be spatially independent. Each column of the T T matrix A is a temporal

sequence.

In summary, both ICAs and ICAt produce a set of images and a corresponding set of temporal sequences.

However, ICAt produces a set of mutually independent temporal sequences and a corresponding set of unconstrained images, whereas ICAt produces mutually independent images and a corresponding set of unconstrained

temporal sequences.

If it is known that either temporal or spatial independence cannot be assumed then this rules out ICAt or ICAs,

11

ICAs requires a N N matrix W . ICAs has been used to good eect on fMRI images [McKeown et al., 1998].

If neither spatial nor temporal independence can be assumed then a form of ICA that requires assumptions of

minimal dependence over time and space can be used.

For many data sets, it is impracticable to nd an N N unmixing matrix W because the number N of rows in

x is large. In such cases, principle component analysis (PCA) can be used to reduce the size of W .

Each of the T N -dimensional column vectors in x denes a single point in an N -dimensional space. If most

of these points lie in a K -dimensional subspace (where K N ) then we can use K judiciously chosen basis

vectors to represent the T columns of x. (E.g. if all the points in a box lie in a two-dimensional square then we

can describe the points in terms of the two basis vectors dened by two sides of that square). Such a set of K

N -dimensional eigenvectors U can be obtained using PCA.

Just as ICA can be used with data vectors in the rows or columns of x, so PCA can be used to nd a set V of K

T -dimensional eigenvectors. More importantly, (and momentarily setting K = N ) the two sets of eigenvectors

U and V are related to each other by a diagonal matrix:

x = UDV T

(26)

Where the diagonal elements of D contain the ordered eigenvalues of the corresponding eigenvectors in the

columns of U and the rows of V . This decomposition is produced by singular value decomposition (SVD).

Note that each eigenvalue species the amount of data variance associated with the direction dened by a

corresponding eigenvector in U and V . We can therefore discard eigenvectors with small eigenvalues because

these account for trivial variations in the data set. Setting K N permits a more econimical representation

of x:

x x~ = U~ D~ V~ T

(27)

Note that U~ is now an N K matrix, V~ is a K K matrix, and D~ is a diagonal K K matrix. As with ICAs

and ICAt, these can be considered in temporal and spatial terms. If each column of x is an image of N pixels

then each column of U is an eigenimage, and each column of V is an eigensequence.

Given that we require a small unmixing matrix W , it is desirable to use U~ instead of X for ICAt, and V~ instead

of X T for ICAs. The basic method consists of performing ICA on U~ or V~ to obtain K ICs, and then using the

relation X~ = U~ D~ V~ T to obtain the K corresponding columns of A.

12

Replacing x with V~ T in y = W x produces

y = W V~ T

(28)

where each row of the K T matrix V~ T is an `eigensequence', and W is a K K matrix. In this case, ICA

recovers K mutually independent sequences, each of length T .

The set of images corresponding to the K temporal ICs can be obtained as follows. Given

and

V~ T = Ay = W 1y

(29)

x = U~ D~ V~ T

(30)

we have

~

x = U~ DW

y

= Ay

1

~

A = U~ DW

(31)

(32)

(33)

where A is a N K matrix in which each column is an image. Thus, we have extracted K independent

T -dimensional sequences and their corresponding N -dimensional images using a K K unmixing matrix W .

A similar method can be used to nd K independent images and their corresponding time courses.

Replacing xT with U~ T in y = W xT produces

y = W U~ t

(34)

where each row of the K N matrix U~ T is an `eigenimage', and W is a K K matrix. In this case, ICA

recovers K mutually independent images, each of length N .

The set of images corresponding to the K spatial ICs can be obtained as follows. Given

and

U~ T = Ay = W 1y

(35)

xT = V~ D~ U~ T

(36)

13

we have

~

x = V~ DW

y

= Ay

1

A = V~ D~ W~

(37)

(38)

(39)

where A is a T K matrix in which each column is a time course. Thus, we have extracted K independent

N -dimensional images and their corresponding T -dimensional time courses using a K K unmixing matrix W .

Note that, using SVD in this manner requires an assumption that the ICs are not distributed amongst the

smaller eigenvectors which are usually discarded. The validity of this assumption is by no means guaranteed.

We have shown how ICA can be made tractable by using principal components (PCs) obtained from SVD.

However, for large dimensional data, U or V can be too large for most computers. For instance, if the data

matrix x is an N T matrix then U is N T and V is T T . If each column in x contains an image with N

pixels then neither x nor U may be small enough to t into the RAM available on a computer.

It is possible to compute U D and V in an iterative manner using techniques that do not rely on SVD.

Throughout the following we assume that V is samller than U . For convenience, we assume that x consists of

one image per column, and that each column corresponds to one of N time steps. We proceed by nding V and

D, from which U can be obtained by combining V with x.

The matrix V contains one eigensequence per column. This can be obtained from the temporal N N covariance

matrix of x, which is dened by the outer product:

C = xT x

(40)

This covariance matrix is the starting point of many standard PCA algorithms. After PCA we have V and a

corresponding set of ordered eigenvalues. The matrix D which is normally obtained with SVD can be constructed

by setting each diagonal element to the square of each corresponding eigenvalue. Given that:

it follows that:

x = UDV T

(41)

U = xV D

(42)

14

So, given V and D from a PCA of the covariance matrix of x, we can obtain the eigenimages U . Note that we

can compute as many eigenimages as required by simply omitting corresponding eigensequences and eigenvalues

from V and D, respectively.

If we distribute the eigenvalues between U and V by multiplying each column in U and V by the square root

of its corresponding eigenvalue then we have:

x = UV T

(43)

which has a similar form to the ICA decomposition:

x = As

(44)

The main dierence between SVD and ICA is as follows. Each matrix produced by SVD has orthogonal columns.

That is, the variation in each column is uncorrelated with variations in every other column within U and V . In

contrast, ICA produces two matrices with quite dierent properties. Rather than being uncorrelated, the rows

of s are independent. This stringent requirement on the rows of s suggest that the columns of A cannot also be

independent, in general, and ICA actually places no constraints on the relationships between columns of A.

044823).

References

[Amari, 1998] Amari, A. (1998). Natural gradient works eciently in learning. Neural Computation, 10(2):251{

276.

[Bell and Sejnowski, 1995] Bell, A. and Sejnowski, T. (1995). An information-maximization approach to blind

separation and blind deconvolution. Neural Computation, 7:1129{1159.

[Bell and TJ, 1997] Bell, A. and TJ, S. (1997). The `independent components' of natural scenes are edge lters.

Vision Research, 37(23):3327{3338.

[Friedman, 1987] Friedman, J. (1987). Exploratory projection pursuit. J Amer. Statistical Association,

82(397):249{266.

[Girolami and Fyfe, 1996] Girolami, M. and Fyfe, C. (1996). Negentropy and kurtosis as projection pursuit

indices provide generalised ica algorithms. NIPS96 Blind Signal Separation Workshop.

15

[Jutten and Herault, 1988] Jutten, C. and Herault, J. (1988). Independent component analysis versus pca. In

Proc. EUSIPCO, pages 643 { 646.

[Makeig et al., 1997] Makeig, S., Jung, T., Bell, A., Ghahremani, D., and Sejnowski, T. (1997). Blind separation

of auditory event-related brain responses into independent components. Proc. Natl. Acad. Sci, 94:0979{10984.

[McKeown et al., 1998] McKeown, M., Makeig, S., Brown, G., Jung, T., Kindermann, S., and Sejnowski, T.

(1998). Spatially independent activity patterns in functional magnetic resonance imaging data during the

stroop color-naming task. Proceedings of the National Academy of Sciences USA (In Press).

[Porrill, 1997] Porrill, J. (1997). Independent component analysis: Conditions for a local maximum. Technical

Report 123, Psychology Department, Sheeld University, England.

[van Hateren and van der Schaaf, 1998] van Hateren, J. and van der Schaaf, A. (1998). Independent component

lters of natural images compared with simple cells in primary visual cortex. Prc Royal Soc London (B),

265(7):359{366.

16

3

2

2

1

1

1

1

2

2

3

Figure 1: The geometry of source separation. a) Plot of signal s1 versus s2 . Each point st in S represents the

amplitudes of the source signals s1t and s2t at time t. These signals are plotted separately in Figure 2. b) Plot

of signal mixture x1 versus x2. Each point xt = Ast in X represents the amplitudes of the signal mixtures

x1t and x2t at time t. These signal mixtures are plotted separately in Figure 2. The orthogonal axes S1 and

S2 in S (solid lines in Figure a) are transformed by the mixing matrix A to form the skewed axes S10 and S20

in X (solid lines in Figure b). An `unmixing' matrix W consists of two row vectors, each of which `selects' a

direction associated with a dierent signal in X . The dashed line in Figure b species one row vector w1 of an

`unmixing' matrix W which is (in general) orthogonal to every transformed axis Si0 except one (S10 , in this case).

Variations in signal amplitude associated with directions (such as S20 ) that are orthogonal to w1 have no eect

on the inner product y1 = w1 x. Therefore, y1 only re
ects amplitude changes associated with the direction S10 ,

so that y1 = ks1 where k is a constant that equals unity if S10 and w1 are co-linear.

17

1.5

1.5

1.5

0.5

0.5

0.5

0.5

0.5

0.5

1.5

10

1.5

10

1.5

1.5

10

10

2

1.5

0.5

0.5

0.5

0.5

2

1

1.5

10

1

0

10

1.5

Figure 2: Separation of two signals. Original signals s = s1 ; s2 are displayed in the left hand graphs. Two signal

mixtures x = As are displayed in middle graphs. The results of applying an unmixing matrix W = A 1 to the

mixtures x = W x are displayed in the right hand graphs.

18

1.5

0.5

0.5

0.5

0

1.5

0.5

0.5

1.5

0.5

Figure 3: The interdependence of x = sin(z ) and y = cos(z ) is only apparent in the higher order moments of

the joint distribution of xy. a) Plot of x = sin(z ) versus y = cos(z ). Even though the value of x is highly

predictable given the corresponding value of y (and vice versa), the correlation between x and y is r = 0. For

display purposes, noise has been added in order to make the set of points visible. b) Plot of sin2 (z ) versus

cos2(z ). The correlation between sin2(z ) and cos2 (z ) is r = 0:864. Whereas x and y are uncorrelated if the

correlation between x and y is zero, they are statistically independent only if the correlation between xp and

yq is zero for all positive integer values of p and q. Therefore, sin(z ) and cos(z ) are uncorrelated, but not

independent.

19

1.5

Figure 4: Histograms of signals with dierent probability density functions. From left to right, histograms

of super-Guassian, Guassian, and sub-Guassian signal. The left hand histogram is derived from a portion of

Handel's Messiah, the middle histogram is derived from Gaussian noise, and the right hand histogram is derived

from a sine wave.

20

Figure 5: Six sound signals and their pdfs. Each signal consists of ten thousand samples. From top to bottom:

chirping, gong, Handel's Messiah, people laughing, whistle-plop, steam train.

21

Figure 6: The outputs of six microphones, each of which receives input from six sound sources according to

its proximity to each source. Each microphone receives a dierent mixture of the six non-Gaussian signals

displayed in Figure 5. Note that the pdf of each signal mixture shown on the rhs is approximately Gaussian.

22

Figure 7: A typical signal produced by applying a random `unmixing' matrix to the six signal mixtures displayed

in 6. The resultant signal has a pdf that is approximately Gaussian. From top to bottom: a single mixture

of the six signals shown in Figure 5, the mixture's pdf, pdf of a Gaussian signal. Note that correct unmixing

matrix would produce each of the original source signals displayed in Figure 5.

23

- FRM Quant Question BankUploaded byrajnish007
- biplotUploaded byshantanuril
- JLargeMFE MT Notes Wks1-4Uploaded byleafko
- Face Recognition in VideoUploaded byJake Cairo
- Managing Diversification SSRN-id1358533Uploaded bymshuffma971518
- 2004 VDI Schwingung PolyMAXUploaded bySam George
- ECG SignalUploaded bySneha Singh
- dist_semantics.pdfUploaded byAmar Kaswan
- Russian DollUploaded byFernando Carvalho
- IJESAT_2012_02_SI_01_18Uploaded byIjesat Journal
- Robust Gait-based Gender Classification Using Depth CamerasUploaded byaleksandar.ha
- CS231n Convolutional Neural Networks for Visual RecognitionUploaded byAndres Tuells Jansson
- spectral.pdfUploaded byDicky513
- Fundamentals_of_MATLAB.pptUploaded byChandra Prakash Ojha
- dexin-zhou.pdfUploaded by951m753m
- Fisher-LDAUploaded byRandy Lin
- svd-note.pdfUploaded byali
- Notes on Mathematical ExpectationUploaded byaef
- build4.pdfUploaded byAlla Eddine G C
- Spring 2017 Homework 9Uploaded byRodger Pang
- 8 ClassificationUploaded bybytorrent7244
- Intro to PCAUploaded byavant_ganji
- Biometrics Face DetectionUploaded byHoang Ngoc Hung
- a06v47n5.pdfUploaded byAnderson Santos Morais
- Inequalities 1Uploaded bymoshe
- do Local e Global Caracteristicas Para Identificacao de EscritaUploaded bycasesilva
- PS1sUploaded bysnazrul
- MSTAT Solution.pdfUploaded byakshay patri
- Non Asymptotic Rmt PlainUploaded byjoon8558
- Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year.Uploaded byresearchinventy

- Estimation TheoryUploaded byprasannakumar_7
- Value Stream Mapping of a Complete ProductUploaded byAndres Castillo
- 00940211Uploaded byFernando de Sá
- 05743890Uploaded byFernando de Sá
- Lecture10_11.6519Uploaded byFernando de Sá
- Wiltgen TimothyE Masters Thesis ME DeptUploaded byFernando de Sá
- EScholarship UC Item 8pb5353sUploaded byFernando de Sá
- 404Uploaded byFernando de Sá
- IcaUploaded byPhuc Hoang
- MITRES_6_008S11_lec19.pdfUploaded byvenkatb7
- MITRES_6_008S11_sol19Uploaded byFernando de Sá
- MITRES_6_008S11_lec20.pdfUploaded byvenkatb7
- UntitledUploaded byapi-127299018

- SPSS 19 Answers to Selected ExercisesUploaded byAnkur Aggarwal
- Psychoneuroimmunologic20Effects20of20Ayurvedic20Oil-Dripping20TreatmentUploaded byLina
- Correlation & RegressionUploaded byManoj Khandelwal
- Poster Final Gael CherrierUploaded bySagar Dasgupta
- The Effectiveness of the Extension Programs of TheUploaded byFrancis Gutierrez Balazon
- Variables in ResearchUploaded byYhen Jecah Calizo
- Topography of social touching depends on emotional bonds between humansUploaded byNational Post
- Balsam Et Al 2002Uploaded byJuwita Merlinda
- Internationalization and Firm PerformanceUploaded bysergiopereira
- Flash Point Correlation With Specific Gravity Refractive Index and Viscosity of Base Oil FractionsUploaded byDobromir Yordanov
- Nosil, Harmon, Seehausen - 2009 - Ecological explanations for (incomplete) speciation.pdfUploaded byBorismen
- journal article.pdfUploaded byTibakarsi Karunagaran
- Logistic Regression Mini TabUploaded byAnıl Toraman
- Inferential StatisticsUploaded byNylevon
- Prob & Random Process Q.B (IV Sem)Uploaded byNidhin Mn
- garra rufaUploaded byNicol Sketo
- skittles project part 3Uploaded byapi-242372492
- Practice Exam 3 - Fall 2009 -With AnswersUploaded byzach
- DOLLAR SVENSSON (1998) What Explains the Success or Failure of Structural Adjustment ProgramsUploaded byHimunk
- The Behavioral Aspect of Mergers AndUploaded bynisarg_
- ASL [Compatibility Mode]Uploaded byHarold John
- Research on Islamic Credit CardsUploaded byk_Dashy8465
- Vocabulary Acquisition Paul Nation 1989Uploaded byjuanhernandezloaiza
- Narrow-band Vegetation Indexes From Hyperion and DirectionalUploaded bynoradeere
- Homework of Correlation and RegressionUploaded bydip171
- Math Apps U34Uploaded byNatalie Chong
- crime2Uploaded byHamid Waqas
- Multiple Choice QuestionsUploaded byaditya_2k
- Notes on Jackson's Research Methods and Statistics 3rd edition TextUploaded bypereirdp
- Intelligence and the Wealth and Poverty of NationsUploaded byafterragnarok