You are on page 1of 40

Clustering neural systems using generative embedding

Semester Thesis

Ajita Gupta

December 7, 2011

Advisor: Supervisors:
1

Kay H. Brodersen1,2 Professor Joachim M. Buhmann1 , Professor Klaas E. Stephan2

Machine Learning Laboratory, Department of Computer Science, ETH Zurich 2 Laboratory for Social and Neural Systems Research, Department of Economics, University of Zurich

Abstract Multivariate decoding models have been increasingly used to infer cognitive or clinical brain states from measures of brain activity. The performance of conventional clustering techniques is restricted by high data dimensionality, low sample size and lack of mechanistic interpretability. In this project, we are extending previous work on classication of neural dynamical systems to clustering, asking what structure can be discovered when no labelling information is available. We illustrate the utility of our approach in the context of neuroimaging and validate our solutions in relation to known clinical diagnostics. We also investigate how our solution can be visualized and interpreted in the context of the underlying generative model. We envisage that future applications of model-based clustering may help dissect spectrum disorders into physiologically more well-dened subgroups.

Acknowledgements I would like to express my deepest gratitude to everyone who has accompanied me throughout the course of this project in the past four months. To begin with, I would like to thank Prof. Dr. Joachim M. Buhmann and Prof. Dr. Dr. med. Klaas E. Stephan for giving me this wonderful opportunity to work in their group, as well as to attend the rst International SystemsX.ch Conference on Systems Biology this year. It was a memorable experience and a signicant exposure of my academic career. I am grateful to Kay H. Brodersen for his guidance, help and patience. He has been an excellent advisor and has consistently engaged with me in insightful conversations by providing valuable hints and regular feedback. I have been fortunate to work with Kate I. Lomakina. She has mentored me at the initial stage of my project by introducing me to the neuroscience background, as well as the fundamental mathematical concepts needed to accomplish my task. Special thanks to the participants of the Dynamic Causal Modelling Seminar. They were a constant source of inspiration. This project builds upon their expertise and input. My heartfelt thanks to Prof. Dr. Andreas Krause, Prof. Dr. Randy McIntosh, Alexander Vezhnevets, Alberto G. Busetto and Lin Zhihao for their contributions. Finally, I would like to thank my family for their uninching support and cooperation.

Contents
1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Adopted Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Methods 2.1 Dynamic causal modelling (DCM) . . . 2.2 Generative embedding . . . . . . . . . . 2.2.1 Model inversion . . . . . . . . . . 2.2.2 Kernel construction . . . . . . . 2.3 Clustering Techniques . . . . . . . . . . 2.3.1 K-Means clustering . . . . . . . 2.3.2 Gaussian Mixture Models . . . . 2.4 Model selection . . . . . . . . . . . . . . 2.4.1 Distortion . . . . . . . . . . . . . 2.4.2 Davies-Bouldin Index . . . . . . 2.4.3 Log Likelihood . . . . . . . . . . 2.4.4 Bayesian Information Criterion . 2.5 Predictive validity . . . . . . . . . . . . 2.5.1 Balanced Purity . . . . . . . . . 2.5.2 Normalized Mutual Information 2.6 Bootstrap . . . . . . . . . . . . . . . . . 2.7 Illustration of Clusters . . . . . . . . . . 3 Results 3.1 Application to LFP Data . . . . 3.1.1 DCM for LFP . . . . . . 3.1.2 Clustering Validation . . 3.1.3 Visualization . . . . . . . 3.1.4 Computational Eciency 3.2 Application to fMRI Data . . . . 3.2.1 DCM for fMRI . . . . . . 3.2.2 Regional Correlations . . 3.2.3 PCA Reduction . . . . . . 3.2.4 Clustering Validation . . 3.2.5 Visualization . . . . . . . 3.2.6 Computational Eciency 5 5 6 7 7 8 8 9 9 9 10 11 11 12 12 12 12 13 13 14 15 16 16 17 18 21 22 23 24 25 25 25 28 29

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

4 Discussion 31 4.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A MATLAB Implementation A.1 Feature Extraction . . . . A.2 Clustering . . . . . . . . . A.3 Bootstrap . . . . . . . . . A.4 Visualization . . . . . . . A.5 Operating on Cluster . . . Bibliography 33 33 33 34 34 34 35

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

List of Figures
1.1 2.1 2.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 Analysis Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . K-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . LFP data (Experimental Design) . . . . . . . . . . . . Internal Validation on LFP Data for k-Means . . . . . Internal Validation on LFP Data for Gaussian mixture External Validation on LFP Data . . . . . . . . . . . . MDS for k-Means on LFP Data . . . . . . . . . . . . . MDS for Gaussian Mixture Models on LFP Data . . . Criteria stability for LFP Data . . . . . . . . . . . . . Right Hemisphere of the Brain . . . . . . . . . . . . . Internal Validation on fMRI Data . . . . . . . . . . . . Approach Comparison for fMRI Data . . . . . . . . . MDS for k-Means on fMRI Data . . . . . . . . . . . . MDS for GMM on fMRI Data . . . . . . . . . . . . . . Criteria stability for fMRI Data . . . . . . . . . . . . . . . . . . . . . . . models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 10 11 17 19 19 20 21 22 23 24 26 27 28 29 30

List of Tables
A.1 A.2 A.3 A.4 A.5 Feature Extraction Clustering . . . . . Bootstrap . . . . . Visualization . . . Cluster Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 34 34 34

Chapter 1

Introduction
1.1 Motivation

Complex biological systems can be studied using dynamical systems models. These are built upon dierential equations which describe how individual system elements interact in time. We note that even simple systems may induce exceptionally complex trajectories; conversely, a system that seems complicated at the surface may be the work of a surprisingly simple apparatus underneath. This is a very fundamental observation of cognition we obtain from dynamics. In two previous studies, a newfangled approach was proposed, which applies dynamical systems in neurobiology using the concept of generative embedding. In the rst study, a generative model of local eld potentials (LFP) in mice was used to read out the trial-by-trial identity of a sensory stimulus from activity in the somatosensory cortex [1]. In the subsequent analysis, a model of functional magnetic resonance imaging (fMRI) data was utilized to diagnose aphasia, an impairment of language ability, in human stroke patients. The model was based on activity in non-lesioned, specically thalamo-temporal brain regions recorded during speech processing [2]. The question, whether generative embedding could aid in discovering plausible structure in unmarked datasets, remains an open question.

1.2

Adopted Approach

In this project, we have addressed this question by developing a model-based clustering approach. More precisely, we (i) inverted a dynamic causal model [3] of neuroimaging data in a trial-wise or subject-wise fashion; (ii) constructed a kernel function; (iii) applied baseline clustering algorithms on subject-specic parameter estimates; (iv) compared the solution with respect to group labels given a priori; and nally (v) interpreted the obtained results with regard to the underlying model (see Figure 1.1). After the described analysis pipeline was laid down, we applied our methods to experimental data. At the outset, we included the analysis of LFP data recorded from mice with dierent whiskers being stimulated on each trial, asking whether the model might recognize the distinct set of trials. Moreover, we reanalyzed previously procured functional MRI data from stroke patients and healthy controls, examining what group structure might emerge when feeding unmarked data through a generative kernel. Using these examples, we investigated questions of model order selection and examined the nature and extent of agreement between our unsupervised approach and a conventional supervised analysis. We anticipate that analyses that are based on mechanistically interpretable models will play an increasingly vital role in the future. For instance, they might become particularly relevant for objective grouping of spectrum disorders [4], where the intention is to split up groups of patients with similar symptoms into pathophysiologically distinct subsets. This work represents one step towards this goal.

Figure 1.1: Analysis Pipeline. This gure depicts the ve salient milestones in using generative embedding for model-based clustering of fMRI data.

Chapter 2

Methods
In this chapter, we elaborate on the mathematical concepts which serve as the foundation of our analysis. In a next step, we will look at the various analysis and illustration techniques, followed by a discussion of criteria to validate our approach.

2.1

Dynamic causal modelling (DCM)

Our entire analysis is built upon a mathematical modelling technique called Dynamic Causal Modelling (DCM) for which we describe the generic idea in this section. The key concept of DCM (see [3] for an elaborated description) is to treat the brain as a deterministic nonlinear dynamic system. Eective connectivity is parameterised in terms of coupling between brain regions (e.g. neuronal activity in dierent regions). Dynamic causal modelling is distinguished from alternative approaches by accommodating the nonlinear and dynamic aspects of neuronal interactions, as well as by staging the estimation problem in terms of experimental perturbations. DCM calls upon the same experimental design principles to invoke region-specic interactions that are used in traditional approaches. However, the causal (or explanatory) variables now become external inputs, whereas the parameters represent eective connectivity. DCM represents an attempt to establish forward models which are able to convincingly capture how neuronal dynamics react to dierent input vectors and generate recorded responses. This resonates the increasing acceptance of neuronal models and their importance for understanding measured brain activity (see [5] for a discussion).

2.2

Generative embedding

Todays traditional clustering methods are restricted by two major issues. Firstly, even the most rened algorithms have great diculties in separating useful features and uninformative sources of noise. Secondly, most clustering studies are blind to the neuronal mechanisms that discriminate between brain states (e.g., diseases). Hence, they are unable to improve our mechanistic understanding of the discovered structures. Generative embedding stands on two major pillars: a generative model for selection of mechanistically interpretable features and a discriminative method for clustering (see Figure 1.1). Generative kernels are used to construct a so-called generative score space, where the set of observations is mapped to statistical representations [1, 621]. Wellknown examples are the P-kernel [22] and the Fisher kernel [17]. Generative models have proven highly benecial in explaining how observed data is generated by the underlying (neuronal) system. One example in neuroimaging is DCM, which is the foundation our work. It enables statistical inference on physiological quantities that are not directly observable with current methods, such as directed interregional coupling strengths and their modulation, e.g., by synaptic gating [23]. From a pathophysiological perspective, disturbances of synaptic plasticity and neuromodulation are at the heart of psychiatric spectrum diseases such as schizophrenia [4] or depression [24]. It is therefore likely that clustering of disease states could benet from exploiting estimates of these quantities. We anticipate that generative embedding for model-based clustering yields better class separability when fed into a discriminative clustering algorithm than conventional techniques based on brain structure or activation and provides convincing solutions to the challenges outlined above.

2.2.1

Model inversion

Bayesian inversion of a given dynamic causal model m denes a projection X M that maps subject-specic data x X to a multivariate probability distribution p(|x, m) in a parametric set M . The model m determines the neuronal regions of interest, external inputs u, synaptic connections and a prior distribution over the estimated parameters p(|m). Taking into account the model m and the data x, model inversion proceeds in an unsupervised and sample-wise manner. The combination of the prior density over the parameters p(|m) with the likelihood function p(x|, m) gives the posterior density p(|x, m). The most ecient way of performing inversion is to maximize a variational free-energy bound to the log model evidence, ln p(x|m), under Gaussian assumptions about the posterior (see [25] for an overview). Model inversion yields a probability density p(|x, m) for each sample x, that is characterized by the vector of posterior means Rd and a covariance matrix Rdd , given d parameters. 8

2.2.2

Kernel construction

The kernel denes a similarity measure to evaluate generative models. The selection of this metric depends on the denition of the feature space. For instance, one could consider the posterior means or maximum a posteriori (MAP) estimates of relevant model parameters (e.g. parameters encoding synaptic connection strengths) in the case of generative embedding. We dene a mapping M Rd that lters MAP estimates M AP from the posterior distribution p(|x, m). This d-dimensional vector space represents discriminative information required for the subsequent clustering step. In the case when group dierences reveal rich information, it might be benecial to integrate elements of the posterior covariance matrix into the vector space. After creating a generative score space, any conventional kernel k : Rd Rd R can be used to compare two inverted models. The simplest one is the linear kernel k(xi , xj ) = xi , xj , expressing the inner product between two vectors xi and xj . Nonlinear kernels (e.g. quadratic, polynomial or radial basis functions) on the other hand, have several disadvantages for generative embedding. Complex kernels come with an increasing risk of overtting. Moreover, the contribution of each model parameter is simple to interpret in relation to the underlying model when they do not undergo further transformation. Therefore, a simple linear kernel is highly recommended.

2.3

Clustering Techniques

Clustering considers the problem of identifying groups of data points in a highdimensional space. Suppose we have a data set consisting of N observations of a random Ddimensional variable x. Our goal is to partition the data set into some number K of clusters, where we shall assume for the moment that the value of K is known a priori. Intuitively, one might think of a cluster as comprising a group of data points, which exhibit similar behavior. In this section, we will look at two commonly used baseline techniques: kMeans clustering and Gaussian mixture models.

2.3.1

K-Means clustering

K-means [26] is the simplest unsupervised learning algorithm that solves the well-known clustering problem. The goal is to nd an assignment of data points to clusters, as well as a set of cluster centers called centroids, such that the sum of the squares of the distances of each data point to its centroid, becomes minimal. We can achieve this goal with the help of an iterative procedure, which alternatively optimizes data point assignment and centroid computation. This is repeated until there is no further change in the assignments. Because each

phase reduces the value of the cost function, convergence of the algorithm is assured. One example is illustrated in Figure 2.1.

Figure 2.1: K-Means Algorithm. This gure (adopted from [27]) shows the iterative k-Means algorithm for k = 2 on a given data set.

2.3.2

Gaussian Mixture Models

Gaussian mixture models are formed by combining multivariate normal density components. Similar to k-means clustering, Gaussian mixture modeling uses an iterative algorithm called Expectation-Maximization, that converges to a local optimum. An example is given in Figure 2.2. Gaussian mixture models are more exible, since they take into account both variance and covariance values. Thus, the EM outcome is able to accommodate clusters of variable size and correlations much better than k-means. Furthermore, data assignments are now referred to as soft, since they are not strictly binary (i.e. hard assignments), but are based on maximum posterior probabilities for each data point.

10

Figure 2.2: Gaussian Mixture Models. This gure (adopted from [27]) illustrates the EM algorithm on the same data set for two clusters.

2.4

Model selection

One of the fundamental challenges of clustering is how to evaluate results without auxiliary information. A common approach for evaluation of clustering results is to use validity indices. Clustering validity approaches can use two dierent criteria: external and internal criteria. The nal selection depends on the kind of information available and the nature of the problem. For model selection, clustering results are evaluated purely based on the data themselves. The best scores are typically assigned to solutions with high similarity within clusters and low similarity between clusters. In our analysis, we looked at four dierent internal validation measures.

2.4.1

Distortion

The simplest quality measure for cluster validation is Distortion. It is used as the cost function in k-Means clustering and is given by (see [27])

J=
n=1 i=1

rin ||xn i ||

(2.1)

Distortion represents the sum of the squared distances of each data point to its corresponding cluster. Our goal is to nd values for data assignments rni and the cluster means i so as to minimize J. The lower J, the better the model t. k denotes the number of clusters, whereas N is the total number of data points.

11

2.4.2

Davies-Bouldin Index

The Davies-Bouldin Index (DBI) goes one step further, since it aims to identify sets of clusters that are compact and well separated. It is dened by

DBI =

1 k

max
i=1 i=j

i + j d(i , j )

(2.2)

d(i , j ) stands for the respective inter-cluster and i for the average intracluster distances. The smaller the value, the more appropriate the clustering solution.

2.4.3

Log Likelihood

The results of GMM clustering are dierent from those computed by k-means. Thus, we need a measure (see [27]) which considers probabilities, i.e. soft assignments.

ln p(X|, , ) =
n=1

ln
i=1

i N (xn |i , i )

(2.3)

In a mixture of Gaussians, the goal is to maximize the likelihood function with respect to the given parameters (comprising the means i and covariances i of the components as well as the mixing coecients i ). The higher the likelihood value, the better the model ts to the data.

2.4.4

Bayesian Information Criterion

The Bayesian Information Criterion (BIC) is devised to avoid the overtting problem and can be computed using the following formula:

BIC = 2 ln(L) + k ln(N )

(2.4)

N is the sample size, L is the maximized value of the likelihood function for the estimated model and k is the number of free parameters (here: clusters) in the Gaussian model. The rst term of the BIC represents model t, whereas the second part expresses model complexity. The model with the lowest BIC is considered to be the optimal trade-o between the two characteristics.

2.5

Predictive validity

Clustering results can also be evaluated with respect to an external criterion, such as known class labels. External criteria measure how closely the clustering solution has captured known structure in the data. We opted for the following two metrics of external validation, since they were most suitable to our application. 12

2.5.1

Balanced Purity

Purity is a simple evaluation measure [28]. To compute purity, each cluster is assigned to the label which is most frequent in the cluster. The accuracy of this assignment is measured by counting the number of correctly assigned points and dividing by the total number of samples. Formally:

purity(, L) =

1 N

max |k lj |
k j

(2.5)

= {1 , 2 , . . . , K } is the set of clusters and L = {l1 , l2 , . . . , lJ } is the set of labels. We interpret i as the set of points in cluster i and li as the number of points in label i. The term balanced is used to emphasize that the labels are uniformly distributed within the data set. If the dataset is imbalanced, the purity will be inated. In order to remove this bias, we perform a linear shift. In the case where the cardinality of label classes is equal, the balanced purity reduces to the purity measure (50%).

balanced purity =

1 1 purity Rlabels + 2 1 Rlabels 2

(2.6)

Here, Rlabels refers to the ratio between label classes. Bad clusterings have purity values close to 0, a perfect clustering has a purity of 1. The purity increases for a growing number of clusters. In particular, the purity is 1 if each point gets its own cluster. Thus, we cannot use this measure to trade o the quality of clustering against the number of clusters.

2.5.2

Normalized Mutual Information

A measure that allows us to make this tradeo is the Normalized Mutual Information (NMI) [28].

NMI(, L) = I is called the Mutual Information:

I(, L) [H() + H(L)]/2

(2.7)

I(; L) =
k j

P (k lj ) log

P (k lj ) P (k )P (lj ) (2.8)

=
k j

|k lj | N |k lj | log N |k ||lj |

13

P (k ), P (lj ), and P (k lj ) are the probabilities of a data point being in cluster k , label lj , and in the intersection of k and lj , respectively. Both equations are equivalent to maximum likelihood estimates of the probabilities (i.e., the estimate of each probability is the corresponding relative frequency). H is called the entropy and is dened as:

H() =
k

P (k ) log P (k ) P (lj ) log P (lj )


j

=
k

|k | |k | log N N |lj | |lj | log N N

(2.9) (2.10)

H(L) =

=
j

The second equation is once again derived from maximum likelihood estimates of the probabilities. I(; L) measures the amount of information by which our knowledge about cluster assignment increases when we know their respective labels. Maximum mutual information is reached for a clustering solution that perfectly recreates the class labels (e.g. when K = N ). Similar to purity, large cardinalities are not penalized. The normalization by the denominator [H() + H(L)]/2 resolves this issue, since the entropy increases with the number of clusters. For instance, H() reaches a maximum of log N for K = N , which ensures that NMI is low for K = N . Because NMI is normalized, we can use it to compare clusterings with dierent numbers of clusters. The denominator is chosen in this particular form, since [H() + H(L)]/2 is a tight upper bound on I(; L). Thus, the NMI is always a number between 0 and 1.

2.6

Bootstrap

The bootstrap technique (see [29, 30] for more details) is an integral part of statistics. Essentially, it allows the estimation of uncertainty about a given a statistical property (such as its mean, variance or standard deviation). This is achieved by measuring the desired property while sampling from an approximating distribution. One standard choice is the empirical distribution of the observed data. In the particular case where a set of observations can be assumed to be independent and identically distributed, this can be implemented by creating a number of new samples of equal size, each of which is determined by random sampling with replacement from the original dataset. A striking advantage of bootstrap is its simplicity. It is straightforward to derive estimates of standard errors and condence intervals. Using a bootstrap allows us to characterize the uncertainty of any validation statistic with regard to the population from which subjects were drawn. By capturing this betweensubjects (or random-eects) uncertainty, our results will apply to the entire

14

population, not just to the particular sample drawn from it. Hence, it is a pertinent way to control and determine the stability of the obtained results (the dierent clustering validation criteria in our case).

2.7

Illustration of Clusters

In order to visualize our data set, we have opted for an exploratory illustration technique called Multidimensional scaling (MDS). This facilitates interpretation and serves as a sanity check of the obtained clustering results. Multidimensional scaling (see [31] for an overview) encompasses a collection of methods, which provide insight in the underlying structure of relations between entities. This technique allows a geometrical representation of these relations. Per se, these methods belong to the more generic class of methods for multivariate data analysis. An arbitrary relation between a pair of entities, which is transformable into a proximity or a dissimilarity measure, can be considered as possible input for multidimensional scaling. The selection of the type of spatial representation can be considered to be the most important part of modelling and is determined by the application context. In this project, we have applied Classical Multidimensional Scaling (CMDS), also known as Principal Coordinates Analysis, to visualize our clustering solutions for both algorithms. This method takes a matrix of interpoint distances and creates a new constellation. The dataset is now reconstructed taking into account a linear combination of all feature dimensions. The Euclidean distances between them approximately reproduce the original distance matrix.

15

Chapter 3

Results
This chapter is entirely devoted to the results of our analysis. First, we briey review our two independent neuroimaging data sets. Then, we validate the performance of generative embedding using several dierent criteria. As an additional control, selected solutions are visualized to obtain a deeper understanding of the underlying structure. Finally, we examine the outcome with respect to its computational eciency. Results are presented for both data sets.

3.1

Application to LFP Data

Our rst dataset was obtained from mice (see [1] for a more details) in an earlier experiment. After the induction of anaesthesia and surgical preparation, animals were directed on a stereotactic frame. After inserting a single-shank electrode with 16 channels into the barrel cortex, voltage traces were monitered from all dierent sites (duration: 2 s). Local eld potentials were recorded by passing the data through a band-pass lter (bandwidth: 1-200 Hz). In every trial, one of two whiskers was stimulated by means of a quick exure of a piezo actuator. The two neighboring whiskers were chosen, such that they generated reliable responses at the site of each recording (dataset A1: whiskers E1 and D3; dataset A2: whiskers C1 and C3; datasets A3-A4: whiskers D3 and ). The experiment involved a total of 600 samples for each mouse (see Fig. 3.1). The goal was to determine which particular whisker had been twitched in each trial based on the neuronal activity. Data was collected from three adult male mice. In one of these, an additional round (with 100 trials) was conducted after the standard procedure described above. In this session, the actuator was very close to the whiskers, but did not touch it, serving as a control condition to prevent experimental artifacts from interfering with the decoding performance.

16

Figure 3.1: LFP data (Experimental Design). This gure (as given in [1]) shows how the stimuli are inserted with the help of a piezo actuator. Local eld potentials are extracted from the barrel cortex using a 16 sites silicon probe.

3.1.1

DCM for LFP

In this section, we provide a brief summary of the main modelling principles using DCM for LFP data (see [1] for an elaborate description). The neural-mass model in DCM acts as the bottom layer within the model chain. It describes a set of n neuronal populations (dened by m states each) as a system of interacting elements and it models their dynamics in the context of experimental disruptions. At each time incident t, the state of the system is represented by a vector x(t) Rnm . The evolution of the system over time is described by a set of dierential equations that evolve the state vector and introduces for conduction delays among spatially distinct populations. The equations specify the rate of change of activity in each region (i.e., of each element in x(t)) as a function of three variables: the current state x(t) itself, the strength of experimental inputs u(t) (e.g., sensory stimulation) and a set of time-invariant parameters . More formally, the dynamics of the model are given by an n-valued function F = dx . dt The counterpart to the previous model is the forward model within DCM, which describes how (hidden) neuronal activity in individual regions generates (observed) measurements. The model expresses eld potentials as a combination of activity in three local neuronal populations of every single brain region: excitatory pyramidal cells (60%); inhibitory interneurons (20%); and spiny stellate (or granular) cells (20%).

17

After inverting the model we constructed a feature space of a single-region DCM by including the estimated posterior means of all intrinsic parameters , as well as the posterior variances. This led to feature vectors in the set R14 for each animal.

3.1.2

Clustering Validation

We investigate how well our algorithms have performed with respect to goodness of t and in comparison to external benchmarks (here: labels). We thereby consider both standard algorithms: k-Means and Gaussian mixture models. The Figures below (Fig. 3.2 and 3.3) show two internal criteria for k-Means (Distortion, Davies-Bouldin Index) and for Gaussian mixture models (Negative log likelihood, Bayesian information criterion) each. The standard mean values are illustrated along with the 95% condence intervals for all animals. The Normalized Distortion1 measure represents the cost function for k-Means. Its monotonous falling behavior is self-explanatory, since the average intracluster distance decreases with an increasing number of clusters. The Normalized Davies-Bouldin Index performs in a similar manner2 . However, this monotonous descent is now scaled by the inter-cluster distance and the number of clusters. The negative log likelihood measure serves as a cost function. The constant descent is expected, since its negative counterpart, the log likelihood, expresses the objective measure for the GMM scheme. The growing number of clusters facilitates accommodation of data points. As a direct consequence, the model t improves. The BIC measure weighs model t against model complexity. A lower BIC value represents a better model. The drop tells us that the minimum value is achieved at point k = 10. However, this is only a local optimum restricted by our chosen window of cluster size [1, 10]. Thus, in this particular scenario, we cannot consider the BIC to be a very meaningful metric.

term Normalized articulates the scaling of values by the sample size of trials. must note that the DB Index is only computed after k = 2, as the inter-cluster metric for one cluster is non-existent.
2 One

1 The

18

Figure 3.2: Internal Validation. This gure illustrates the internal criteria for the outcome of k-Means (for until 10 clusters) on the electrophysiological dataset.

Figure 3.3: Internal Validation. This gure denotes the internal criteria for the outcome of GMM (for until 10 clusters) on the electrophysiological dataset. The two external quality metrics, Balanced Purity and Normalized Mutual Information can be compared across individual subjects. From Figure 3.4, we observe a noticeable disjunct evolution for our experimental subjects in contrast to the control animal, which only exhibits a slim advantage to the null baseline, where the class labels are shued in random order (i.e. class membership is arbitrary) in every computation. In k-Means, the highest discrepancy in NMI values lies between animal 1 and 4 and animal 3 and 4 in GMM (see next subsection for a detailed discussion).

19

According to k-Means the ideal cluster size is k = 2 and between k = 3 and k = 5 in the case of GMM. Since the ground truth is known to be two, we learn that our selected set of algorithms are able to discover groups which are in line with the legitimate structure. Therefore, we conclude that our clustering technique is sensitive or well-tuned to the input data.

Figure 3.4: External Validation. This gure compares the two external criteria for all four animals in k-Means and GMM (for until 10 clusters).

20

3.1.3

Visualization

We use a technique called CMDS (see Section 2.7) to visualize our clustering solution graphically. We contrast our previously selected cluster domain (k = 2, 3 and 4) for an experimental subject and a control animal. From Figure 3.5 we see that the newly discovered structures for both animals are almost equally balanced between the whiskers in both cases k = 2 and k = 4. Hence, the purity remains the same (as seen in Figure 3.4). However, the NMI value decreases due to the two additional clusters. We conclude that k-Means favors two clusters for the experimental subject. This is in agreement with the number of whisker types.

Figure 3.5: MDS for k-Means. This gure compares the k-Means clustering solution for k = 2 and k = 4 for an experimental animal (subject 1) and the control subject (subject 4). Figure 3.6 looks at the clustering solution in GMM for k = 2 and k = 3 for two subjects. We see that the additional cluster in the control animal hardly aects the accuracy, whereas the majority of the members in the new blue grouping for subject 3 belongs to one whisker class. Thus, the overall homogeneity rises, as seen in Figure 3.4.

21

Figure 3.6: MDS for Gaussian Mixture Models. This gure contrasts the GMM clustering solution for k = 2 and k = 3 for an experimental animal (subject 3) and the control subject (subject 4).

3.1.4

Computational Eciency

All validation measures used in this study are subject to two sources of variance. First, they are inuenced by within-trials (or within-subjects) variability, due to the non-deterministic nature of the employed clustering algorithms. Second, they are subject to between-trials (or between-subjects) uncertainty, due to the bootstrap procedure that we used to enable inference on the population from which the available trials (or subjects) were sampled. Using simulations, we assessed how many algorithmic iterations we would need to obtain estimates that were suciently stable to allow for comparison of different models. In particular, we repeated our analysis (using resampled data) 100 times and analyzed the evolution of the running mean (and running standard error) of the statistics of interest. We also visualize an error mark of 1% (relative to the nal value), which tells us that from this point onwards, our results are reliable with a 99% accuracy. The simulation plots given below (Figure 3.7) visualize the computational eciency for the LFP data set for the same cluster region as used for CMDS for 22

the most resource-intensive computations, one internal and one external criteria. This is depicted for k-Means clustering. We see that the normalized DBI shows absolutely no uctuations in the case for k = 2 (since the both error marks are located at 0), but takes a while to converge for higher cluster numbers. The NMI yields perfect stability for two or three clusters and starts varying only after k = 4.

Figure 3.7: Criteria stability The two plots show the cumulative means estimates for validation criteria over a total of 100 bootsample iterations. The validation is performed on k-Means.

3.2

Application to fMRI Data

We used data from two groups of participants (patients with moderate aphasia vs. healthy controls) engaged in a simple speech-processing task (see [2] for details on the experiment). The two groups of subjects consisted of 26 right-handed healthy participants with normal hearing, English as their rst language and no neurological disease in their medical record (12 female; mean age 54.1 years; range 26-72 years); and 11 patients diagnosed with moderate aphasia due to stroke (1 female; mean age 66.1; range 45-90 years). The patients aphasia prole was typifed using the Comprehensive Aphasia Test [32]. The scores were calculated in the aphasic range for: spoken and written word comprehension (single word and sentence level), single word repetition and object naming. One must keep in mind that the lesions did not aect any of the temporal regions which were included into our analysis model (see Fig. 3.8).

23

Subjects were presented with two types of auditory stimulus: (i) normal speech; and (ii) time-reversed speech. They were expected to make a gender judgment on each auditory stimulus by a brief button press.

Figure 3.8: Right Hemisphere of the Brain. This gure (as given in [33]) highlights the selected brain regions. Our analysis was performed on the activity of six non-lesioned thalamo temporal regions, three from each hemisphere.

3.2.1

DCM for fMRI

DCMs for fMRI data consist of two hierarchical layers [34]. The rst layer is a neuronal model of the dynamics of interacting neuronal populations with regard to experimental perturbations. Experimental manipulations u can enter the model in two dierent ways: they can evoke responses through direct inuences on specic regions (e.g. sensory inputs) or modulate the strength of coupling among regions (e.g. task demands or learning). In this project, we have used the classical bilinear DCM for fMRI [3].

dz(t) = f (z(t), n , u(t)) = (A + dt

uj (t)B (j) )z(t) + Cu(t),


k

(3.1)

where z(t) represents the neuronal state vector z at time t, A is a matrix of endogenous connection strengths, B (j) represents the additive change of these connection strengths induced by modulatory input j and C denotes the strengths of direct (driving) inputs. These neuronal parameters n = (A, B (1) , ..., B (J) , C) are rate constants with units s1 . The second layer of a DCM is a biophysically motivated forward model that 24

describes how a given neuronal state translates into a measurement. This model has haemodynamic parameters h and comprises a Gaussian measurement error . The nonlinear operator g(z(t), h ) connects a neuronal state z(t) to a predicted blood oxygen level dependent (BOLD) signal via changes in vasodilation, blood ow, blood volume, and deoxyhaemoglobin content (see [35] for details).This is a set of nonlinear dierential equations for fMRI data.

x(t) = (A + g(z(t), n ) +

(3.2)

The construction of the generative score space was based on the MAP estimates of the neuronal model parameters n . The resulting space contained 22 features in total.

3.2.2

Regional Correlations

We compared generative embedding to a simple approach based on undirected regional correlations. Given the time series (i.e. raw data) of brain recordings, we calculated the spatial mean activity for each region of interest (there are six in total). In the next step, we computed the Pearson correlation coecients, which measure the linear dependence between the time series of any two regions. This resulted in a nal R15 feature vector, which was fed into the discriminative clustering algorithm.

3.2.3

PCA Reduction

Another common method applied on high-dimensional data sets is called Principal Component Analysis (PCA). This technique aided in condensing the original data comprising almost 4000 dimensions to a meagre 37-dimensional feature space of neuronal activity. The dimensionality was chosen so as to match the number of features in generative embedding. The individual variance components were sorted in decreasing order. In opposition to generative embedding, PCA-reduction achieves linear diversity of the data without providing a mechanistic view of how it could have been produced.

3.2.4

Clustering Validation

We rst verify the performance of our model-based approach and compare this to the results of the two alternative approaches. This is done for both algorithms, k-Means as well as Gaussian mixture models. The gure below (Fig. 3.9) illustrates internal quality criteria. All measures are shown along with 95% condence intervals. The Distortion measure, the Davies-Bouldin Index and the Negative log likelihood take on a similar pattern as seen for the electrophysiological data. We recall the previously established conclusion that the model t is proportional 25

to the number of clusters and conrm this statement for our second data set. However, this time, we do attain an optimum number of clusters, namely for k = 7 (see plot for BIC). We mark this area for further investigation (see next subsection).

Figure 3.9: Internal Validation. This gure shows the internal criteria for validating clustering solutions obtained by k-Means and GMM (for until 30 clusters). In a next step, the external criteria are juxtaposed to alternative data extraction approaches as well as to the null baseline (where class membership is kept hidden). This is shown in Figure 3.10.

26

The external criteria lie, as per denition, within the [0, 1] interval. The Balanced purity starts at 0.5 (which can be seen as the equilibrium point, since all points are assigned to one single group) and gradually goes up to 1, where each point is assigned to its own cluster. Unlike the Balanced purity, the Normalized mutual information penalizes large cardinalities. We observe a striking jump from k = 3 to the peak value k = 4 for both criteria in k-Means, which we look at into more detail by visualizing the clustering solution (see next subsection). In contrast to k-Means clustering, the external metrics for Gaussian Mixture Models unfold in a smooth ow without salient uctuations.

Figure 3.10: Approach Comparison. This gure denotes the external criteria across dierent feature extraction modes performed on the solution for k-Means and GMM. 27

We derive two fundamental conclusions from these illustrations: i) All techniques are consistently superior to their baseline variant that uses randomly permuted labels and ii) Our model-based approach called generative embedding performs noticeably better than alternative methods based on feature correlations and distinction.

3.2.5

Visualization

We choose our critical cluster regions (as mentioned in the previous subsection) and visualize these for k-Means and Gaussian mixture models. Figure 3.11 explains the startling jump we observed before in Figure 3.10. The transition from k = 2 k = 3 gives a minor increase in homogeneity. However, in the next step, the red cluster (for k = 2) is split up into two perfectly homogeneous clusters, where each of them favors healthy controls. A new red cluster contains patients with only three exceptions, as seen in Figure 3.11. Hence, the value for the Balanced purity rises instantly to the value of 84%.

Figure 3.11: MDS for k-Means. This gure shows a graphical illustration of the k-Means clustering solution for k = 2 and 4. In opposition to k-Means, Gaussian mixture models features no remarkable improvement in terms of cluster homogeneity in this area. Instead, we pick the region of interest to be where the BIC reaches a minimum. From Figure 3.12 we observe that the two clusters dominated by patients are almost identical in k = 5 and k = 6. The healthy controls in the lower half, are, however, separated into two groups. This increases the model t. The progression from k = 6 k = 7 and from k = 7 k = 10 yields a change in model t (another partition of the lower half colored in gray), which is however, not as well compensated by the model complexity (second term of the BIC) as in the rst transition. Therefore, k = 7 is our best feasible trade-o between t and complexity in this scenario.

28

Figure 3.12: MDS for GMM. This gure denotes a CMDS plot of the Gaussian mixture models clustering solution for k = 5, 6, 7 and 10.

3.2.6

Computational Eciency

As before, our quality metrics are again inspected with regard to the computational resources. The iteration plots given below (Figure 3.13) illustrate the computational eciency for the fMRI data set for the cluster range k = 2, 3 and 4 in the case of GMM on two selected criteria. We learn that the Normalized mutual information takes long to converge, whereas the Balanced purity reaches its nal value almost after half of the total bootsample iterations. One must note, that the time and computational resources required for the LFP data set exceed the ones required for the fMRI data. This is due to a very simple reason that the former set is more data-intensive than the latter one. The primary conclusion we derive from these simulations is that we do not need millions of iterations. Just a couple hundred will suce to give us a reasonable estimate. This makes our approach considerably faster.

29

Figure 3.13: Criteria stability The two plots show the cumulative means estimates for external validation criteria over a total of 200 bootsamples. Here, we look at the values for GMM.

30

Chapter 4

Discussion
In this chapter, we rst give a brief rundown of our ndings. We then take a closer look at the boundaries, which conne our approach, before closing this report with suggestions for further improvement and an outlook on more advanced analysis techniques.

4.1

Summary of Results

Generative embedding has previously been shown to be a powerful technique for classifying neural systems by relating patterns of connectivity parameters to an external (clinical) label. However, it has been unknown whether generative embedding might also prove useful to discover new structure in a group of data, where the relationship between neural activity and external labels is not given beforehand. In the course of this study, clustering was performed and validated on two independent data sets. The implementation in MATLAB has led to two key results. First, it was shown that the baselines techniques applied to a set of experimental trials recorded using electrophysiology (LFP data) were able to detect plausible structure within the data which agreed with the ground truth to a fair amount. Second, while aiming to discover clusters in a group of subjects engaged in a passive speech-listening task, our model-based analysis scheme demonstrated a compelling advantage over conventional approaches, which did not exploit discriminative information encoded in hidden physiological quantities.

4.2

Limitations

We observe huge error bars (see Section 3.2.2 Clustering Validation). Hence, our results come with a high level of uncertainty. From the technical perspective, its the low amount of fMRI data, which limits the reliability of our results. In order to make more accurate and dependable statements, we must increase the quantity of acquired data. However, it is organisationally dicult to recruit 31

many patients (one must nd them within the given time window of a study and obtain their consent). It is also rather expensive to use an MRI scanner (around 400 CHF per hour). A crucial algorithmic aspect is internal validation. In contraposition to the Bayesian Information Criterion, we have not made use of any metric for kMeans, which is capable of balancing model t and model complexity. Our current verication targets the goodness of t and, therefore, does not give us the full picture required to pass a solid judgement. From an even more fundamental perspective, the choice of a variable of interest represents a subtle challenge. As long as decoding restricts itself to experimental variables or presumed cognitive states, inference about information processing will be limited. For instance, if decision making is implemented by a series of processing stages, each representing a dierent decision-related quantity, then spatial localization of choice will not aord major insights into the nature of neural computations.

4.3

Future Work

Along with an increasing interest in the analysis to neuroimaging datasets, a rich variety of new methods has been proposed over the last few years. These methods have proted enormously from previous decades of machine learning research, though various eld-specic lessons have been learnt as well. Here, we look at two recent techniques, which address the problem from a Bayesian or information-theoretic point of view. Approximation Set Coding (ASC) (see [36] for details) represents an informationtheoretic approach to clustering. Under this view, the best solution is the one that optimally balances the competing objectives of informativeness and stability. This is the solution which extracts most bits from the data. Informativeness is maximized with as many clusters as data points, whereas stability reaches its peak value with just 1 cluster that comprises the entire data set. Variational Bayes Gaussian Mixture Models (VB-GMM) (see [37] for details) is another approach to clustering, which uses the model evidence as criterion for model selection. Similar to traditional GMM, it selects arbitrary initialization parameters and iteratively converges to an optimum for the log model evidence (using the EM algorithm). Though still at its preliminary stage, clustering of neuroimaging datasets is likely to continue to evolve rapidly in the coming years of brain research. Much is at stake: basic research about the mapping between structure and function on the one hand; applications in engineering, in a legal context, and in medical diagnosis on the other. In the domain of spectrum disorders, for instance, one could decompose groups of patients with similar symptoms into pathophysiologically distinct subgroups.

32

Appendix A

MATLAB Implementation
This part of the appendix provides a short description of the main analysis and code scripts that were implemented in MATLAB for the purpose of this study. They are subdivided in the following categories: feature extraction, clustering, bootstrapping, visualization and operating on the cluster.

A.1

Feature Extraction

Table A.1 shows MATLAB scripts, which load the data and true labels from respective directories for further processing of each data set. Function load_elec_data load_aphasic_data Input subject (1, 2, 3 or 4) mode (o, m or p for k-Means) Table A.1: Feature Extraction Output data matrix, labels vector data matrix, labels vector

A.2

Clustering

Table A.2 numerates scripts, which perform clustering and output metrics for evaluation. Function kmeans_on_dataset_data Input k (clustersize) data, labels, varargin (visualization ag) k (clustersize) data, labels, varargin (visualization ag) Output (Normalized) Distortion, (Normalized) DBI, Balanced purity, Random purity, NMI, Random NMI Negative Log Likelihood, BIC, Balanced purity, Random purity, NMI, Random NMI

gmm_on_dataset_data

Table A.2: Clustering

33

A.3

Bootstrap

The table below (Table A.3) lists three main scripts for the bootstrap technique. Function cumbootstraps bootstrap_on_aphasic_data bootstrap_on_elec_data Input mean vector, condence interval mode (o, m or p), algo (kmeans or gmm) subject (1, 2, 3 or 4), algo (kmeans or gmm) Output running mean, upper and lower bound, 99% mark None None

Table A.3: Bootstrap

A.4

Visualization

Various measures have been visualized to understand, interpret and compare results. Table A.4 contains plot and MDS scripts. Function mds_plot plot_quality_measures plot_allmodes plot_allsubjects plot_iteration_subplots Input None None None None None Output None None None None None

Table A.4: Visualization

A.5

Operating on Cluster

In order to speed up computation and save time resources, heavy computations were delegated to Nash. Table A.5 summarizes the scripts used to run code on the cluster. Function run_elec_on_cluster run_aphasic_on_cluster Input None None Output None None

Table A.5: Cluster Operation

34

Bibliography
[1] Kay H. Brodersen, F. Haiss, C.S. Ong, M. Tittgemeyer F. Jung, J.M. Buhmann, B. Weber, and K.E. Stephan. Model-based feature construction for multivariate decoding. NeuroImage, 56:601615, May 2011. [2] Kay H. Brodersen, Thomas M. Schoeld, Alexander P. Le, Cheng Soon Ong, Ekaterina I. Lomakina, Joachim M. Buhmann, and Klaas E. Stephan. Generative embedding for model-based classication of fmri data. PLoS Computational Biology, 7, June 2011. [3] Lee M. Harrison, Will D. Penny, and Karl J. Friston. Dynamic causal modelling. NeuroImage, 19:12731302, August 2003. [4] Klaas E. Stephan, Karl J. Friston, and Chris D. Frith. Dysconnection in schizophrenia: From abnormal synaptic plasticity to failures of selfmonitoring. Schizophrenia Bulletin, 35:509527, May 2009. [5] B. Horwitz, K.J. Friston, and J. G. Taylor. Neural modeling and functional brain imaging: an overview. Neural Networks, 13:829846, November 2000. [6] Manuele Bicego, Vittorio Murino, and Mrio A. T. Figueiredo. Similaritybased classication of sequences using hidden markov models. Pattern Recogn., 37:22812291, December 2004. [7] Tony Jebara, Risi Kondor, and Andrew Howard. Probability product kernels. J. Mach. Learn. Res., 5:819844, December 2004. [8] Matthias Hein and Olivier Bousquet. Hilbertian metrics and positive denite kernels on probability measures. Proceedings of AISTATS, 10:136143, 2004. [9] Marco Cuturi, Kenji Fukumizu, and Jean-Philippe Vert. Semigroup kernels on measures. J. Mach. Learn. Res., 6:11691198, December 2005. [10] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Scene classication via plsa. Analysis, 3954:517530, 2006. [11] A. Bosch, A. Zisserman, and X. Muoz. Scene classication using a hybrid generative/discriminative approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(4):712727, 2008. [12] Manuele Bicego, Elzbieta Elbieta Pekalska, David M. J. Tax, and Robert P. W. Duin. Component-based discriminative classication for hidden markov models. Pattern Recognition, 42(11):26372648, 2009. 35

[13] Nathan Smith and Mahesan Niranjan. Data-dependent kernels in svm classication of speech patterns. Sixth International Conference on Spoken Language Processing, 2001. [14] Alex D. Holub, Max Welling, and Pietro Perona. Combining generative models and sher kernels for object recognition. Tenth IEEE International Conference on Computer Vision ICCV05 Volume 1, 1:136143, 2005. [15] Manuele Bicego, Marco Cristani, and Vittorio Murino. Clustering-based construction of hidden markov models for generative kernels. Construction, 3:466479, 2009. [16] Thomas Hofmann. Learning the similarity of documents: An informationgeometric approach to document retrieval and categorization. Neural Information Processing Systems, pages 914920, 2000. [17] Tommi S. Jaakkola and David Haussler. Exploiting generative models in discriminative classiers. In Advances in Neural Information Processing Systems 11, pages 487493. MIT Press, 1999. [18] T.P. Minka. Discriminative models, not discriminative training. Technical Report TR-2005-144, Microsoft Research, Cambridge, 2005. [19] Julia A. Lasserre, Christopher M. Bishop, and Thomas P. Minka. Principled hybrids of generative and discriminative models. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 1:8794, 2006. [20] A. Perina, M. Cristani, U. Castellani, V. Murino, and N. Jojic. A hybrid generative/discriminative classication framework based on freeenergy terms. IEEE 12th International Conference on Computer Vision, 1:20582065, 2009. [21] M. Figueiredo, P. Aguiar, A. Martins, V. Murino, and M. Bicego. Information theoretical kernels for generative embeddings based on hidden markov models. In Proceedings of the 2010 joint IAPR international conference on Structural, syntactic, and statistical pattern recognition, pages 463472, Berlin, Heidelberg, 2010. Springer-Verlag. [22] David Haussler. Convolution kernels on discrete structures. UCSC-CRL99-10. Available: http://www.cbse.ucsc.edu/sites/default/files/ convolutions.pdf, July 1999. [23] Klaas E. Stephan, Lars Kasper, Lee M. Harrison, Jean Daunizeau, Hanneke E.M. den Ouden, Michael Breakspear, and Karl J. Friston. Nonlinear dynamic causal models for fmri. NeuroImage, 42:649662, May 2008. [24] Eero Castren. Is mood chemistry? Nature Reviews Neuroscience, 6:241 246, March 2005. [25] Karl Friston, Jeremie Mattout, Nelson Trujillo-Barreto, John Ashburner, and Will Penny. Variational free energy and the laplace approximation. 2006.

36

[26] J.B. MacQueen. Some methods for classication and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1:281297, 1967. [27] Christopher M. Bishop. Springer, 2006. Pattern Recognition and Machine Learning.

[28] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze. Introduction to Information Retrieval. Cambridge University Press, 2008. [29] Peter Buehlmann und Martin Maechler. Computational statistics. Technical report, ETH, Zurich, February 2008. [30] Bradley Efron. Bootstrap methods: another look at the jackknife. Annals of Statistics, 7(1):126, 1979. [31] Ingwer Borg and Patrick J.F. Groenen. Modern Multidimensional Scaling. Springer, 2005. [32] Kate Swinburn, Gillian Porter, and David Howard. Comprehensive Aphasia Test. Psychology Press, 2004. [33] Alexander P. Le, Thomas M. Schoeld, Klass E. Stephan, Jennifer T. Crinion, Karl J. Friston, and Cathy J. Price. The cortical dynamics of intelligible speech. Journal of Neuroscience, 28(49):1320913215, 2008. [34] Klaas E. Stephan, L. M. Harrison, S. J. Kiebel, O. David, W.D. Penny, and K.J. Friston. Dynamic causal models of neural system dynamics: Current state and future extensions. J Biosci, 32:12944, June 2007. [35] Klaas E. Stephan, N. Weiskopf, P. M. Drysdale, P. A. Robinson, and K. J. Friston. Comparing hemodynamic models with dcm. NeuroImage, 38:387 401, November 2007. [36] Information theoretic model validation for clustering. IEEE, 2010. press). (in

[37] W.D. Penny. Variational bayes for d-dimensional gaussian mixture models. Technical report, Wellcome Department of Cognitive Neurology, University College London, July 2001.

37

You might also like