Professional Documents
Culture Documents
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Tutorials, pages 1–3,
c
Baltimore, Maryland, USA, 22 June 2014.
2014 Association for Computational Linguistics
value. Given data, we can analytically infer the perparameters (e.g. RBF kernel length-scales) us-
posterior distribution of these functions assuming ing Automatic Relevance Determination (ARD).
Gaussian noise. This basically gives a ranking in importance of
This tutorial will contrast two main applications the features, allowing interpretability of the mod-
settings for regression: interpolation and extrapo- els. Switching to a multi-task regression setting,
lation. Interpolation suits the use of simple radial we present an application to Machine Translation
basis function kernels which bias towards smooth Quality Estimation. Our method shows large im-
latent functions. For extrapolation, however, the provements over previous state-of-the-art (Cohn
choice of the kernel is paramount, encoding our and Specia, 2013). Concepts in automatic kernel
prior belief about the type of function wish to selection are exemplified in an extrapolation re-
learn. We present several different kernels, includ- gression setting, where we model word time series
ing non-stationary and kernels for structured data in Social Media using different kernels (Preoţiuc-
(string and tree kernels). One of the main issues Pietro and Cohn, 2013). The Bayesian evidence
for kernel methods is setting the hyperparameters, helps to select the most suitable kernel, thus giv-
which is often done in the support vector literature ing an implicit classification of time series.
using grid search on held-out validation data. In In the final section of the tutorial we give a
the GP framework, we can compute the probabil- brief overview of advanced topics in the field of
ity of the data given the model which involves the GPs. First, we look at non-conjugate likelihoods
integral over the parameter space. This marginal for modelling classification, count and rank data.
likelihood or Bayesian evidence can be used for This is harder than regression, as Bayesian pos-
model selection using only training data, where by terior inference can no longer be solved analyti-
model selection we refer either to choosing from a cally. We will outline strategies for non-conjugate
set of given covariance kernels or choosing from inference, such as expectation propagation and the
different model hyperparameters (kernel parame- Laplace approximation. Second, we will outline
ters). We will present the key algorithms for type- recent work on scaling GPs to big data using vari-
II maximum likelihood estimation with respect to ational inference to induce sparse kernel matrices
the hyper-parameters, using gradient ascent on the (Hensman et al., 2013). Finally – time permitting
marginal likelihood. – we will finish with unsupervised learning in GPs
Many problems in NLP involve learning from using the latent variable model (Lawrence, 2004),
a range of different tasks. We present multi-task a non-linear Bayesian analogue of principle com-
learning models by representing intra-task transfer ponent analysis.
simply and explicitly as a part of a parameterised
kernel function. GPs are an extremely flexible 3 Outline
probabilistic framework and have been success- 1. GP Regression (60 mins)
fully adapted for multi-task learning, by modelling
multiple correlated output variables (Alvarez et (a) Weight space view
al., 2011). This literature develops early work (b) Function space view
from geostatistics (kriging and co-kriging), on (c) Kernels
learning latent continuous spatio-temporal models
from sparse point measurements, a problem set- 2. NLP Applications (60 mins)
ting that has clear parallels to transfer learning (in- (a) Sparse GPs: Predicting user impact
cluding domain adaptation). (b) Multi-output GPs: Modelling multi-
In the application section, we start by present- annotator data
ing an open-source software package for GP mod- (c) Model selection: Identifying temporal
elling in Python: GPy.2 The first application we patterns in word frequencies
approach the regression task of predicting user in-
fluence on Twitter based on a range or profile and 3. Further topics (45 mins)
word features (Lampos et al., 2014). We exem-
(a) Non-congjugate likelihoods: classifica-
plify how to identifying which features are best
tion, counts and ranking
for predicting user impact by optimising the hy-
(b) Scaling GPs to big data: Sparse GPs and
2
http://github.com/SheffieldML/GPy 2 stochastic variational inference
(c) Unsupervised inference with the GP- Vasileios Lampos, Nikolaos Aletras, Daniel Preoţiuc-
LVM Pietro, and Trevor Cohn. 2014. Predicting and char-
acterising user impact on Twitter. In Proceedings of
4 Instructors the 14th Conference of the European Chapter of the
Association for Computational Linguistics, EACL.
Trevor Cohn3 is a Senior Lecturer and ARC Fu- Neil D. Lawrence. 2004. Gaussian process latent vari-
ture Fellow at the University of Melbourne. His able models for visualisation of high dimensional
research deals with probabilistic machine learn- data. NIPS, 16(329-336):3.
ing models, particularly structured prediction and
Daniel Preoţiuc-Pietro and Trevor Cohn. 2013. A tem-
non-parametric Bayesian models. He has recently poral model of text periodicities using Gaussian Pro-
published several seminal papers on Gaussian Pro- cesses. In Proceedings of the Conference on Em-
cess models for NLP with applications ranging pirical Methods in Natural Language Processing,
from translation evaluation to temporal dynamics EMNLP.
in social media. Kashif Shah, Trevor Cohn, and Lucia Specia. 2013.
Daniel Preoţiuc-Pietro4 is a final year PhD stu- An investigation on the effectiveness of features for
dent in Natural Language Processing at the Uni- translation quality estimation. In Proceedings of the
Machine Translation Summit.
versity of Sheffield. His research deals with ap-
plying Machine Learning models to model large
volumes of data, mostly coming from Social Me-
dia. Applications include forecasting future be-
haviours of text, users or real world quantities (e.g.
political voting intention), user geo-location and
impact.
Neil Lawrence5 is a Professor at the University
of Sheffield. He is one of the foremost experts on
Gaussian Processes and non-parametric Bayesian
inference, with a long history of publications and
innovations in the field, including their application
to multi-output scenarios, unsupervised learning,
deep networks and scaling to big data. He has been
programme chair for top machine learning confer-
ences (NIPS, AISTATS), and has run several past
tutorials on Gaussian Processes.
References
Mauricio A. Alvarez, Lorenzo Rosasco, and Neil D.
Lawrence. 2011. Kernels for vector-valued func-
tions: A review. Foundations and Trends in Machine
Learning, 4(3):195–266.