You are on page 1of 12

Gaussian Process Latent Variable Models

Ahmad Ashar
Group 256:Modelling and simulation Supervisors: Conf.Lehel Csato, Conf.Radu Trimbitas January 10, 2013

1 Introduction 1.1 Initial Strides . . . . . . . . . . . . . . . . . . . 1.2 Probabilistic Models . . . . . . . . . . . . . . . 1.2.1 Bayesian Inference . . . . . . . . . . . . 1.2.2 Gaussian Process . . . . . . . . . . . . . 1.2.3 Gaussian Process Regression . . . . . . . 1.2.4 Gaussian Process Latent Variable Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 3 3 4 5 6 7 7 9 9 10
Abstract The thesis presents Gaussian Process Latent Variable Model (GPLVM) which are in a way probabilistic Principal Component Analysis. The main body of work was to implement some of the GPLVMs and compare and contrast their eciency to other standard models. The thesis has been arranged into four key sections- Introduction,Motivation & Aims, Results and Discussions and Conclusions. As the GPLVM are

2 Motivations and Aims 2.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Methodology and aims . . . . . . . . . . . . . . . . . . . . . . 3 Results and Discussions 4 Conclusions

pretty complex, the introductory part has been relatively long. The introduction traces a brief sketch of the history of machine learning and its present trends. The mathematical details of Bayesian Inference have then been elucidated. The Gaussian Process Regression and GPLVM have been mathematically dened. The section about Motivation and Aims takes a look at the research work in this eld and then we explain our own aim and methodology for approaching the problem. The section on Results and Discussions follows from the observations we made during our experiments/simulations. Finally, the Conclusion section denes the scope of this work in the scientic literature and the possibility to build on this thesis.


Initial Strides

Developing intelligent systems and robots has been of prime importance to the researchers in computer science. The eld of Articial Intelligence (AI) was introduced in the early 1960s. After a lot of initial enthusiasm where the eld of AI was supposed to have solved all the problems related to making a human-like machine, a long hiatus followed. However by the early 1990s AI had a lot of sub elds like Natural Language Processing, Perception, Motion, Knowledge representation, Reasoning etc. Machine Learning had some success in the early 1990s with the use of Neural Networks being the dominant theme. The highly prestigious conference called Neural Information Processing Systems,NIPS was established in 1987 in Colorado,USA and has since then attracted the best researchers in the eld of cognitive sciences, machine learning and computational neuroscience. The pattern of papers in NIPS can give us a indication as to how Machine Learning has been evolving since the 1990s. At the beginning a lot of emphasis was on understanding and mimicking the functionality of the brain like in the case of Articial Neural Networks. However, at the turn of the century a new class of algorithms called as Kernel Methods started dominating. Recently however there has been a shift with more and more researchers using Kernel Methods in Probabilistic Models. One of the reasons which helped this shift was the so called MCMC revolution in which the computationally intensive calculations were circumvented with a statistical sampling method. Nowadays Bayesian based probabilistic models have come to dominate the machine learning commu2

nity. The result is that all the previous methods like regression, classication, dimensionality reduction, ltering in dynamical systems, time series analysis, inverse problems have been successfully given a probabilistic touch in the past ten years. Machine learning has also succeeded in nding common ground between game theory and decision theory in the case of Probabilistic Multi agent systems. The eld of graphical modelling for dynamical systems and other mathematically complex bio-systems has been attracting a lot interest. Graphical models are just graphical representation of the conditional probability relationship between various variables. Thus it is very important to understand the Probabilistic paradigm.


Probabilistic Models
Bayesian Inference

The foundation of Probabilistic reasoning has been the use of the following theorem known as Bayes theorem: p(|X, ) = x

p(|)p(|X, )d x


The left hand side of the equation is known as the posterior predictive distribution of the data x given a likelihood distribution p(|) and a posterior x over parameters given by p(|X, ),the X are the training data points in the model and are called the hyper parameters of the model. The posterior for the parameters are given by the following formula: p(|X, ) = p(X|)p(|) p(X|) (2)

The entity p(|) is called the prior over the parameter space. p(X| is called the likelihood of the data (i.e given a parameter value, what are the chances to observe that data X) The key problem of Bayesian analysis is the evaluation of the denominator of 2. The problem is due to the evaluation of the second integral as: p(X|) =



The complex integrals like in 1, 2 and 3 make the Bayesian inference very dicult to proceed. Hence we would be requiring the use of approximation 3

methods. If we are interested in nding a point estimate for our variable we can use the Maximum a posteriori estimate given as = arg max p(|X, )


The hyper parameters are obtained by maximizing a likelihood of the training data = arg max p(X|)p(|)d (5)

Bayesian Inference has been used for classifying spam e-mails using a Naive Bayes Classier. Their use with graphical models makes them easily integrable with MCMC sampling methods like Gibbs sampler or MetropolisHastings sampler. Their usefulness with regards to model selection has also been proven. 1.2.2 Gaussian Process

Gaussian Processes (GP) help us better understand the formalism of Bayesian Inference. A Gaussian process is dened as a set of innite number of random variables. The joint probability distribution of a nite number of its random variables follows a joint multi variate Gaussian distribution. If we visualize functions as being a collection of innite number of points in their domain X with f (X) taking random values depending on f , a GP can be thought of as a probability distribution over functions. In that case f GP (m, K) where m is a vector of a size equal to the number of points in the domain , K is known as the covariance matrix or the kernel of the Gaussian process. There can be dierent types of Kernels, some of the examples are KL (x, x ) = xT x called as the linear kernel.The other common kernel is called the squared exponential kernel,it is used for getting smooth,innitely dierentiable curves.It is given by the formula |d|2 KSE (x, x ) = exp 2 2l (6)

The values d is the distance between the two points x and x , the value l is the hyper parameter of the kernel and is the scaling factor for the distance. If we assume a Gaussian process prior on a function with x = (x1 , x2 . . . xn ) , then 1 (x )T 1 (x ) P (x1 , . . . , xk ) = exp (7) (2)k/2 ||1/2 2 4

A Gaussian process can be used as a prior over the function space. Thus the variable in the case of Bayesian inference is the random function f . It has been shown that placing a Gaussian prior is equivalent to using innite dimensional priors over therefore this is also known as non-parametric model (in which the number of priors are innity). 1.2.3 Gaussian Process Regression

Regression is a common problem in which given xi , yi i=1,...n the aim is predict y(x ) for some unknown x . In the case of linear regression the problem is formulated as follows: y = XT w + (8) Where w is the vector of weights and is the noise vector. The idea is to determine the weights, so as to minimize some cost function :

arg min L = w

||yi xT w|| i


The Bayesian interpretation of the regression asks to put a prior on w such that p(w) = N(0, par ) where par is the prior covariance matrix. If we assume N(0, 2 I) a Gaussian white noise we can write the likelihood as : p(y|X, w) = N(X T .w, 2 I) (10)

The expression for the posterior was developed by Rasmussen and William[1] and shown to be: p(w|y, X) = N( 2 A1 Xy, A1 ) (11) where A = 2 XX T + 1 . The method described is also known as the par weight-space representation of the Gaussian process regression. The function space representation places a gaussian prior over the function f (x) and thus doesnt assume a linear relationship between x and y. We assume the following likelihood: p(f |x1 , x2 . . . xn ) = N(0, Kf f ) (12)

where f = [f (x1 ), f (x2 ) . . . f (xn )] and Kf f is the covariance matrix which is responsible for properties such smoothness, stationarity and local independence. The posterior estimate is given as: p(f |y) = 1 p(y) 5 p(y|f )p(f , f )df (13)

Using this posterior we can make predictions as has been described in the above section. The posterior has been calculated by Rasmussen and William[1] and has the following form : p(f |y) = N(K,f (Kf,f + 2 I)1 y, K, K,f (Kf,f + 2 I)Kf, ) (14)

The problem is the evaluation of the term (Kf,f + 2 I)1 which is O(n3 ) and hence making the computation scale badly with the increase of number of data points,n. A number of Sparse approximations have been proposed Csato and Opper[2] which include ideas like selecting a subset of data, or using many local GPs etc. Quinonero-Candela and Rasmussen [3] have integrated many approximations in their Gaussian process approximation framework on the basis of inducing variables. 1.2.4 Gaussian Process Latent Variable Model

GPLVM was introduced by Lawrence[4]. The motivation behind it was the probabilitic Principal Component Analysis(PCA) introduced by Bishop & Williams[5]. The problem in dimensionality reduction is as follows: Given N data points for a higher dimensional data y such that Y RN d we want to nd a low dimension data set such that X RN q where q < d. The idea of GPLVM is similar to that of Kernel PCA. The PCA is basically a linear mapping from the latent X space to the Y space. The probabilistic model for the PCA is: yn = W xn + n (15) W is a D q matrix and N(0, 1 I) is a D D matrix. The likelihood of the model is then given by : p(yn |xn , W, ) = N(W xn , 1 I) (16)

We assume that the inputs are independent from one another. There are two unknowns in this equation W, X. In the PPCA, we place a prior on X and marginalize it out as follows :

p(Y |W, ) =

N(yn |W xn , 1 I)N(xn |0, I)dxn


Using some linear algebra Lawrence [4] showed that


P (Y |W, ) =

N(yn |0, C) where C = W W T + 1 I 6


This is also known as the reduced rank representation. Now we take the log likelihood and maximize it to obtain W. This actually reduces to a Eigen Value problem in which the largest q eigen values are retained. A similar approach has been used in Dual PPCA but the other way round, i.e. rst a prior is placed on W and then it is marginalized out. The likelihood is then maximized to get X. We give the form of the derivative of the log-likelihood L and the corresponding Eigen-Value problem as given by Lawrence in 2006 [9]. L = K 1 SK 1 X + K 1 X where K = XX T + 1 I (19) X S[ 1 I + XX T ] X = X The GPLVM, instead places a prior over the function f such that ynj = f (xn ) +
nj 1


where f GP (0, K)


If we assume the noise is independent we are reduced to the following likelihood p(Y |X, ) = N(yi |0, K) K RN N parametrised by (22)

Again, we take the log likelihood and maximize with respect to both the latent space locations X & . This is non-convex optimization problem with many local-maximas. There have been a number of ways to solve this problem. In our work we try to implement some of the already implemented algorithms for solving this problem. We also look in more detail to the issues with such a problem formulation and discuss a number of variational and sampling methods employed to solve this problem more eciently.


Motivations and Aims

Previous Work

In this section we give a detailed reference to the work already done in the eld of GPLVM. As has been stated previously, GPLVM was introduced by Neil M. Lawrence from the University of Sheeld,UK in 2004 [4]. In his work he traces the motivations to the works by Bishop [5] on probabilistic principal component analysis. In that paper Lawrence uses a Scaled-Conjugate gradient to solve eq 22 and obtain the latent variable X. He uses a RBF 7

kernal for K and gives a comparison with PCA on the oil-ow data set. This data set is freely available and experiments could be reproduced using the practical algorithm described in the paper.There are further two more data sets : digit rotation and Images-from-video sequence on which simulations are performed. Building on this idea Aaron Hertzmann et al.[7] from the University of Toronto employed the GPLVM for inverse kinematics(IK). They basically mapped dierent human-poses onto a two dimensional latent space. Then they could synthesise various poses from the latent space. They suggested a novel style-based IK in which given some constraints, the model could generate the pose most probable from the training data. This has applications in computer animation in which human-type poses have to interact with other characters in the animation thus leading to constraints. In 2005 Wang et al.[8] introduced a Dynamical GPLVM known as Gaussian Process Dynamical Model(GPDM).A GPDM comprises a low-dimensional latent space with associated dynamics, and a map from the latent space to an observation space. The result is a non-parametric dynamical model which accounts for the uncertainty in the data. In the paper, an application with motion-capture data has been given, which can used for generating new motions. The missing data problem has also been addressed in the paper. Also in 2005, Raquel Urtasun et al. [20] applied the GPDM for 3D people tracking. In the paper they modify the original GPDM to permit learning from motions with signicant stylistic variation. Lawrence et al.[9] proposed a variation of the GPLVM using back constraints which essentially preserves the local distances from the high dimensional space to the low dimensional space. He also extended the model to large data sets in his technical report in 2006 [10]. In that he discusses many sparse approximations and the iterative optimization for eq. 22. In 2007 Urtasun et al. [11] developed what they called as Discriminative Gaussian process latent variable model and used it for classication. The results were compared to other discriminative models for manifold learning like LDA, GDA etc. Lawrence [12] introduced Hierarchical GPLVM in 2007. In 2008 Topologically contrained GPLVM was proposed by Urtasun & Lawrence et al.[13] In 2010 and 2011 Michael Titsias introduced the Bayesian GPLVM and its corresponding variational counterpart [14],[15]. Working in the eld of computer vision Carl Ek from KTH, Sweeden has co-authored three papers one about about human-pose identication using GPLVM [16] in 2007, the second in 2008 on Data consolidation using GPLVM [17] and his most important contribution has been with Shared GPLVM [18] in 2009. 8


Methodology and aims

The thesis plans to assimilate all the work done till date in the Gaussian Process Latent Variable Models. This involves understanding of the various proposed algorithms and their implementation in this eld. As it can be seen, GPLVM has been an intensely researched area since its inception in 2004 and has involved some of the worlds leading researchers from MIT to Oxford. As is clear from the above section the main implementation areas of GPLVM have been in Computer Vision like human pose determination, 3d people tracking, Manifold determination in the case of high resolution images etc. and in Robotics like inverse kinematics using style based approach. Recently many researchers including Lawrence [19] have been using GPLVM in systems Biology research to better understand the gene expression and of transcriptional regulation. The author plans to implement some of the algorithms mentioned above and recreate the experiments/simulations that have been mentioned in the literature. For example, most of the researchers provide their code online and this could be a starting point in simulating the observations once again. The data sets for most of the papers are also freely available. The author also plans to introduce MCMC methods for faster inference in the GPLVM, although such an approach becomes infeasible for large data sets and Variational Methods have been preferred.The goal of this thesis would be to better understand the implementation issues with the above mentioned algorithms and to propose some changes which could yield better and faster results. The main software that would be used will the GPLVM tool-kit, written by Lawrence et al.The data sets would be chosen appropriately. The other important aspect is that of large data sets and corresponding Sparse approximations. The author would leverage the expertise of the supervisors in this area and try to test various algorithms with dierent sparse approximations.

Results and Discussions

The simulations indicate the usefulness of the GPLVM method in comparison to other conventional methods. Although there are some issues like the one with local-maximas and with large data sets. We also observe that certain kernel functions do a better job in dimensionality reduction, for example using the ARD kernel and the Variational method we are able to nd out

the hidden dimensionality of the manifold with the use of just the training data. The SE kernel gave better results in the case of GPDMs and is more robust to the missing data points and noise. Sampling algorithms like MCMC provide a faster inference in the case when the noise doesnt follow a Gaussian distribution. However, their importance decreases when the training data set consists of a large number of points. Variational methods have also proved very useful in nding out the lower bound and hence should be employed more often. Our simulations also suggest the combining of dierent Gaussian process priors and then optimizing the parameter space with the training data. This observation hints at using a Mixture model and this possibility should be explored.


As a conclusion to this monograph, it can be said that Probabilistic Models for Machine Learning are very useful in the case of uncertain situations or situations with missing data. The GPLVM method has been a proven tool for dimensionality reduction and modelling of probabilistic dynamical systems. A comparison has also been provided between the conventional methods and their probabilistic counterparts. However, optimizing the algorithms and proposing faster approximations still proves to be a challenge. The future of these methods lie in how eciently they could reduce the complexity of the enormous amounts of data that we have. Sampling methods or Variational methods have not been looked in detail in this monograph because of the restriction of space, however they would play a key role in the implementation of the optimization problem for large data sets.

[1] Rasmussen CE Person and Williams CKI : Gaussian Processes for Machine Learning,248, MIT Press, Cambridge, MA, USA, (2006). [2] Csato L, Opper M (2002): Sparse Gaussian Processes Neural Computation, 14/3, The MIT Press.


[3] J. Quinonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian Process regression. Journal of Machine Learning Research, 6:19391959,December 2005. [4] Lawrence N. Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models. Journal of Machine Learning Research, 2005 [5] C. M. Bishop. Bayesian PCA. In M. J. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, pages 482388, Cambridge, MA,1999a. MIT Press. [6] N. D. Lawrence (2006) The Gaussian process latent variable model Technical Report no CS-06-03, The University of Sheeld, Department of Computer Science. [7] Keith Grochow, Steven L. Martin, Aaron Hertzmann, Zoran Popovic. Style-based Inverse Kinematics. ACMTransactions on Graphics (Proceedings of SIGGRAPH) [8] Wang, J., Fleet, D.J., Hertzmann, A.: Gaussian Process dynamical models. NIPS 18, MIT Press, 2005. [9] N. D. Lawrence and J. Quionero Candela. (2006) Local distance preservation in the GP-LVM through back constraints in W. Cohen and A. Moore (eds) Proceedings of the International Conference in Machine Learning, Omnipress, , pp 513520. 2004),2004,23(3), pp.522-531. [10] N. D. Lawrence (2006) Large scale learning with the Gaussian process latent variable model Technical Report no. CS-06-05, University of Sheeld. [11] Raquel Urtasun and Trevor Darrell. 2007. Discriminative Gaussian process latent variable model for classication. In Proceedings of the 24th international conference on Machine learning (ICML 07) [12] N. D. Lawrence and A. J. Moore. (2007) Hierarchical Gaussian process latent variable models in Z. Ghahramani (ed.) Proceedings of the International Conference in Machine Learning, Omnipress, , pp 481488.


[13] Raquel Urtasun, David J. Fleet, Andreas Geiger, Jovan Popovic, Trevor J. Darrell, and Neil D. Lawrence. 2008. Topologically-constrained latent variable models. In Proceedings of the 25th international conference on Machine learning (ICML 08). ACM, New York, NY, USA, 1080-1087. [14] M. K. Titsias and N. D. Lawrence.Bayesian Gaussian Process Latent Variable Model. Thirteenth International Conference on Articial Intelligence and Statistics (AISTATS), JMLR: WCP 9, pp. 844-851, 2010. [15] A. C. Damianou, M. K. Titsias and N. D. Lawrence.Variational Gaussian Process Dynamical Systems. NIPS, 24, 2012 [16] C. H. Ek, P. H. Torr, and N. D. Lawrence. Gaussian process latent variable models for human pose estimation. In MLMI: Multimodal Interaction and Related Machine Learning Algorithms, June. 2007. [17] C. H. Ek, P. H. Torr, and N. D. Lawrence. GP-LVM for data consolidation. Learning from Multiple Sources, NIPS, 2008. [18] C. H. Ek. Shared Gaussian Process Latent Variable Models. PhD thesis, Oxford Brookes University, 2009. [19] N. D. Lawrence, M. Rattray, P. Gao and M. K. Titsias. (2010) Gaussian processes for missing species in biochemical systems in N. D. Lawrence, M. Girolami, M. Rattray and G. Sanguinetti (eds) Learning and Inference in Computational Systems Biology, MIT Press, Cambridge, MA [20] Raquel Urtasun, David J. Fleet, and Pascal Fua. 2006. 3D People Tracking with Gaussian Process Dynamical Models. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1 (CVPR 06), Vol. 1. IEEE Computer Society, Washington, DC, USA, 238-245.