You are on page 1of 3

Gaussian Processes for Natural Language Processing

Trevor Cohn Daniel Preoţiuc-Pietro and Neil Lawrence


Computing and Information Systems Department of Computer Science
The University of Melbourne The University of Sheffield
trevor.cohn@gmail.com {daniel,n.lawrence}@dcs.shef.ac.uk

1 Introduction non-linear methods are required for better under-


standing and improved performance. Currently,
Gaussian Processes (GPs) are a powerful mod- kernel methods such as Support Vector Machines
elling framework incorporating kernels and (SVM) represent a popular choice for non-linear
Bayesian inference, and are recognised as state- modelling. These suffer from lack of interoper-
of-the-art for many machine learning tasks. ability with down-stream processing as part of a
Despite this, GPs have seen few applications in larger model, and inflexibility in terms of parame-
natural language processing (notwithstanding terisation and associated high cost of hyperparam-
several recent papers by the authors). We argue eter optimisation. GPs appear similar to SVMs, in
that the GP framework offers many benefits over that they incorporate kernels, however their prob-
commonly used machine learning frameworks, abilistic formulation allows for much wider appli-
such as linear models (logistic regression, least cability in larger graphical models. Moreover, sev-
squares regression) and support vector machines. eral properties of Gaussian distributions (closure
Moreover, GPs are extremely flexible and can under integration and Gaussian-Gaussian conju-
be incorporated into larger graphical models, gacy) means that GP (regression) supports analytic
forming an important additional tool for proba- formulations for the posterior and predictive infer-
bilistic inference. Notably, GPs are one of the ence.
few models which support analytic Bayesian in- This tutorial will cover the basic motivation,
ference, avoiding the many approximation errors ideas and theory of Gaussian Processes and several
that plague approximate inference techniques in applications to natural language processing tasks.
common use for Bayesian models (e.g. MCMC, GPs have been actively researched since the early
variational Bayes).1 GPs accurately model not 2000s, and are now reaching maturity: the fun-
just the underlying task, but also the uncertainty damental theory and practice is well understood,
in the predictions, such that uncertainty can be and now research is focused into their applica-
propagated through pipelines of probabilistic tions, and improve inference algorithms, e.g., for
components. Overall, GPs provide an elegant, scaling inference to large and high-dimensional
flexible and simple means of probabilistic infer- datasets. Several open-source packages (e.g. GPy
ence and are well overdue for consideration of the and GPML) have been developed which allow for
NLP community. GPs to be easily used for many applications. This
This tutorial will focus primarily on regression tutorial aims to promote GPs, emphasising their
and classification, both fundamental techniques of potential for widespread application across many
wide-spread use in the NLP community. Within NLP tasks.
NLP, linear models are near ubiquitous, because
they provide good results for many tasks, support 2 Overview
efficient inference (including dynamic program-
Our goal is to present the main ideas and theory
ming in structured prediction) and support simple
behind Gaussian Processes in order to increase
parameter interpretation. However, linear mod-
awareness within the NLP community. The first
els are inherently limited in the types of relation-
part of the tutorial will focus on the basics of Gaus-
ships between variables they can model. Often
sian Processes in the context of regression. The
1
This holds for GP regression, but note that approximate Gaussian Process defines a prior over functions
inference is needed for non-Gaussian likelihoods. 1 which applied at each input point gives a response

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Tutorials, pages 1–3,
c
Baltimore, Maryland, USA, 22 June 2014. 2014 Association for Computational Linguistics
value. Given data, we can analytically infer the perparameters (e.g. RBF kernel length-scales) us-
posterior distribution of these functions assuming ing Automatic Relevance Determination (ARD).
Gaussian noise. This basically gives a ranking in importance of
This tutorial will contrast two main applications the features, allowing interpretability of the mod-
settings for regression: interpolation and extrapo- els. Switching to a multi-task regression setting,
lation. Interpolation suits the use of simple radial we present an application to Machine Translation
basis function kernels which bias towards smooth Quality Estimation. Our method shows large im-
latent functions. For extrapolation, however, the provements over previous state-of-the-art (Cohn
choice of the kernel is paramount, encoding our and Specia, 2013). Concepts in automatic kernel
prior belief about the type of function wish to selection are exemplified in an extrapolation re-
learn. We present several different kernels, includ- gression setting, where we model word time series
ing non-stationary and kernels for structured data in Social Media using different kernels (Preoţiuc-
(string and tree kernels). One of the main issues Pietro and Cohn, 2013). The Bayesian evidence
for kernel methods is setting the hyperparameters, helps to select the most suitable kernel, thus giv-
which is often done in the support vector literature ing an implicit classification of time series.
using grid search on held-out validation data. In In the final section of the tutorial we give a
the GP framework, we can compute the probabil- brief overview of advanced topics in the field of
ity of the data given the model which involves the GPs. First, we look at non-conjugate likelihoods
integral over the parameter space. This marginal for modelling classification, count and rank data.
likelihood or Bayesian evidence can be used for This is harder than regression, as Bayesian pos-
model selection using only training data, where by terior inference can no longer be solved analyti-
model selection we refer either to choosing from a cally. We will outline strategies for non-conjugate
set of given covariance kernels or choosing from inference, such as expectation propagation and the
different model hyperparameters (kernel parame- Laplace approximation. Second, we will outline
ters). We will present the key algorithms for type- recent work on scaling GPs to big data using vari-
II maximum likelihood estimation with respect to ational inference to induce sparse kernel matrices
the hyper-parameters, using gradient ascent on the (Hensman et al., 2013). Finally – time permitting
marginal likelihood. – we will finish with unsupervised learning in GPs
Many problems in NLP involve learning from using the latent variable model (Lawrence, 2004),
a range of different tasks. We present multi-task a non-linear Bayesian analogue of principle com-
learning models by representing intra-task transfer ponent analysis.
simply and explicitly as a part of a parameterised
kernel function. GPs are an extremely flexible 3 Outline
probabilistic framework and have been success- 1. GP Regression (60 mins)
fully adapted for multi-task learning, by modelling
multiple correlated output variables (Alvarez et (a) Weight space view
al., 2011). This literature develops early work (b) Function space view
from geostatistics (kriging and co-kriging), on (c) Kernels
learning latent continuous spatio-temporal models
from sparse point measurements, a problem set- 2. NLP Applications (60 mins)
ting that has clear parallels to transfer learning (in- (a) Sparse GPs: Predicting user impact
cluding domain adaptation). (b) Multi-output GPs: Modelling multi-
In the application section, we start by present- annotator data
ing an open-source software package for GP mod- (c) Model selection: Identifying temporal
elling in Python: GPy.2 The first application we patterns in word frequencies
approach the regression task of predicting user in-
fluence on Twitter based on a range or profile and 3. Further topics (45 mins)
word features (Lampos et al., 2014). We exem-
(a) Non-congjugate likelihoods: classifica-
plify how to identifying which features are best
tion, counts and ranking
for predicting user impact by optimising the hy-
(b) Scaling GPs to big data: Sparse GPs and
2
http://github.com/SheffieldML/GPy 2 stochastic variational inference
(c) Unsupervised inference with the GP- Vasileios Lampos, Nikolaos Aletras, Daniel Preoţiuc-
LVM Pietro, and Trevor Cohn. 2014. Predicting and char-
acterising user impact on Twitter. In Proceedings of
4 Instructors the 14th Conference of the European Chapter of the
Association for Computational Linguistics, EACL.
Trevor Cohn3 is a Senior Lecturer and ARC Fu- Neil D. Lawrence. 2004. Gaussian process latent vari-
ture Fellow at the University of Melbourne. His able models for visualisation of high dimensional
research deals with probabilistic machine learn- data. NIPS, 16(329-336):3.
ing models, particularly structured prediction and
Daniel Preoţiuc-Pietro and Trevor Cohn. 2013. A tem-
non-parametric Bayesian models. He has recently poral model of text periodicities using Gaussian Pro-
published several seminal papers on Gaussian Pro- cesses. In Proceedings of the Conference on Em-
cess models for NLP with applications ranging pirical Methods in Natural Language Processing,
from translation evaluation to temporal dynamics EMNLP.
in social media. Kashif Shah, Trevor Cohn, and Lucia Specia. 2013.
Daniel Preoţiuc-Pietro4 is a final year PhD stu- An investigation on the effectiveness of features for
dent in Natural Language Processing at the Uni- translation quality estimation. In Proceedings of the
Machine Translation Summit.
versity of Sheffield. His research deals with ap-
plying Machine Learning models to model large
volumes of data, mostly coming from Social Me-
dia. Applications include forecasting future be-
haviours of text, users or real world quantities (e.g.
political voting intention), user geo-location and
impact.
Neil Lawrence5 is a Professor at the University
of Sheffield. He is one of the foremost experts on
Gaussian Processes and non-parametric Bayesian
inference, with a long history of publications and
innovations in the field, including their application
to multi-output scenarios, unsupervised learning,
deep networks and scaling to big data. He has been
programme chair for top machine learning confer-
ences (NIPS, AISTATS), and has run several past
tutorials on Gaussian Processes.

References
Mauricio A. Alvarez, Lorenzo Rosasco, and Neil D.
Lawrence. 2011. Kernels for vector-valued func-
tions: A review. Foundations and Trends in Machine
Learning, 4(3):195–266.

Trevor Cohn and Lucia Specia. 2013. Modelling anno-


tator bias with multi-task Gaussian processes: an ap-
plication to machine translation quality estimation.
In Proceedings of the 51st annual meeting of the As-
sociation for Computational Linguistics, ACL.

James Hensman, Nicolo Fusi, and Neil D. Lawrence.


2013. Gaussian processes for big data. In Proceed-
ings of the 29th Conference on Uncertainty in Artifi-
cial Intelligence, UAI.
3
http://staffwww.dcs.shef.ac.uk/
people/T.Cohn
4
http://www.preotiuc.ro
5
http://staffwww.dcs.shef.ac.uk/
people/N.Lawrence 3

You might also like