Gaussian Processes for Machine Learning

My PhD Bible
Kieron McCallan
October 2021 - Present
Terminology
Gaussian Processes
Stochastic Process: A collection of random variables indexed by some variable x ∈ X. Usually y(x) ∈ R and
X ⊂ Rn , and we think of y as a function of x.
General ML
Generalisation: the process of going from data that is related to, but not directly taken from a situa-
tion/problem, to a predicted answer. eg. I have temperature data for this room from yesterday, how can I be
sure that using this to predict today’s temperature is a reasonable option.
Supervised Learning: involves a dataset organised into sets of pairs:
Dn = (x(1), y(1)), , (x(n), y(n))
where x is an input and y is an output, so eventually we can take a given x and predict a y. Can think of x
values as vectors in d-dimensions. Consider discrete case where y(1) ∈ {+1, −1} – this is a binary classification
case. Y is either +1 or -1. The learning system is given some inputs and is then told what specific outputs are
to be associated with them. Supervised learning is split into two categories, depending on whether the outputs
are drawn from a small finite set (classification) or a large finite/continuous set (regression). The output y
values are sometimes referred to as target values.
Classification: This can be binary if it draws from a set of two possible values, or multi-class otherwise.
Feature Mapping: ex. Say we want to classify songs into two groups. We can’t just put the song di-
rectly into the dataset abstractly, we have to characterise it in some way. Here we would have some kind of
feature representation that would take the song and somehow turn it into a vector.
Unsupervised Learning: This doesn’t involve learning a function from inputs to outputs based on a set
of pairs. Instead, we are given a data set and expected to find some kind of pattern or structure that is behind
it.
Density Estimation: Given samples x(1), . . . , x(n) ∈ RD , the goal is the find a partitioning (“clustering”)
of the samples which groups together samples that are similar. There are different objectives here, depending
on what the desired outcome is, ie. Minimising the average distance between elements in a cluster, maximising
the average distance etc. Other methods involve “soft clustering” where a sample may be assigned partial
membership (ie. 0.9 membership in one clustering and 0.1 in another). Clustering can be used as a step in
density estimation, and sometimes to find useful structure in data.
Dimensionality Reduction: Given samples x(1), . . . , x(n) ∈ RD , the problem is to re-represent them as
points in a d-dimensional space, where d < D. The goal is usually to retain the information in the dataset that
will for example allow elements of one class to be discriminated from another. This method is useful for visual-
ising or understanding high-dimensional data. If the overall goal is to perform regression or classification on the
data after the dimensionality is reduced, it is usually best to articulate an objective for the overall prediction
problem rather than to do dimensionality reduction without knowing which dimensions will be important for
the prediction task.
Reinforcement Learning: The goal here is to learn a mapping from input values x to output values y,
1
but without a direct supervision signal to specify which output values y are best for a particular input. There
is no training set specified a priori. Instead the learning problem is framed as an agent interacting with an
environment as follows:
• The agent observes the current state, x0
• It selects an action, y 0
• It receives a reward, r0 , which depends on x0 and possibly y 0
• The environment transitions probabilistically to a new state, x1 , with a distribution that depends only on
x0 and y 0
• The agent observes the current state, x1
• ...
The goal here is to find a policy, π, mapping x to y (ie. Mapping states to actions) such that some long-term
sum or average or rewards, r, is maximised. This is very different from supervised and unsupervised learning,
because the agent’s choices affect both its reward and its ability to observe the environment. It needs careful
consideration of the long-term effects of actions, as well as all the issues relating to supervised learning.
Sequence Learning: Here the goal is to learn a mapping from input sequences x0 , . . . , xn to output se-
quences y1 , . . . , ym . The mapping is typically represented as a state machine, with one function f used to
compute the next hidden internal state given the input, and another function g used to compute the output
given the current hidden state. It is supervised in the sense that we are told what output sequence to generate
for which input sequence, but the internal functions have to be learned by some method other than direct
supervision, because we don’t know what the hidden state sequence is.
Semi-supervised Learning: In semi-supervised learning, we have a supervised-learning target, but there

may be an additional set of x(i) values with no known values of y (i) . These values can still be used to improve
learning performance if they are drawn from Pr(X) that is the marginal of Pr(X,Y) that governs the rest of the
data set.
Active Learning: In active learning, it is assumed to be expensive to acquire a label y (i) (eg. Asking a
human to read an X-ray) so the learning algorithm can sequentially ask for particular inputs x(i) to be labelled,
and must be carefully select queries in order to learn as effectively as possible while minimising the cost of
labelling.
Transfer Learning: Also called meta-learning, In transfer learning, there are multiple tasks with data drawn
from different but related distributions. The goal is for experience with previous tasks to apply to learning a
current task in a way that requires decreased experience with the new task.
Deterministic Simulators: A simulator with no degree of randomness and contains no random variables.
Simulations made using these simulators have known inputs, and the output for a given set of inputs will not
change if it is run multiple times. These can be thought of as a constant counterpart to stochastic processes.
Blocking Factor: Some kind of deliminator used to separate experimental data into smaller, more similar
groups (’blocks’) eg. using age, ethnicity, or gender to ’block’ cancer patients in a trial.
Theory
Gaussian Processes
What is a GP?
Why are they used for ML?
How are they implemented?
What are their strengths and weaknesses?
Advanced GPs eg. treed, chained, etc.
2
Basics
A Gaussian Process (GP) is a stochastic process
Treed Gaussian Processes

Chained Gaussian Processes
asdf
3
Notes: The Design and Analysis of Computer Experiments
0.1 Chapter 1
0.1.1 Introduction
It is desirable at times to replace the use of a physical experiment with that of a deterministic simulator.
Deterministic simulators contain no degree of randomness and no random variable. These are mathematical
models that relate input and output variables. For example, a mathematical model may consist of a set of
coupled partial differential equations.
The gold standard of data collection for establishing cause-effect relationship uses a prospective design for
the relevant physical experiment, ie. watching for a change in outcome.
The classical treatment of responses from a physical experiment considers them to be stochastic with a mean
value that depends on a set of experimenter-selected treatment variables, where the primary objective of the
study is knowledge of the effect of the treatment variables.
Other experiments may consider effects from other input variables.
Environmental (”noise”) variables describe either the operating conditions of an experimental subject/unit
employing a given treatment, or the conditions under which a manufacturing process is conducted. An example
of this is an experiment in agriculture, where the different tests have different locations, each with their own
separate temperature and amount of rainfall.
A blocking factor is a qualitative (measured) environmental variable that identifies identical groups of ex-
perimental material eg. using gender or ethnicity to group subjects in a cancer clinical trial.
Confounding variables: A type of input that can be present in a physical experiment. They are unrecog-
nised by the experimenter but affect the mean output of a physical system. Confounding variables can possibly
mask/exaggerate the effect of a treatment variable eg. if an active confounding variable has values that are
correlated with the values of an inert treatment variable, the effect of the active confounding variable may be
incorrectly attributed to the treatment variable.
There are multiple ways to increase the validity of treatment comparisons in physical experiment:
1. Randomisation: The treatments are assigned to experimental units at random and are applied in a
random order. When an experiment is randomised, the chance of a treatment variable being misinterpreted
due to the presence of a highly correlated and active confounding variable is reduced.
2. Blocking: Experimental units are grouped to be as similar as possible. Allocating treatments in a

balanced way in an experiment within blocks allows for valid within-block treatment comparisons.
3. Adequate Replication: Using a large enough scale (eg. lots of data points) to ensure there is no
unavoidable ”measurement” variation in the output due to obscuring treatment differences.
We can classify the inputs to a simulator in the same way we classify the parameters of a physical experiment:
Control Variables: The simulation version of a treatment variable. Simulators often include a class of vari-
ables not found in a physical system called model/calibration variables here. An example of model variables are
unknown rates of change, an unknown magnitude of a component of friction, and material properties. Usually
these variables would only be known from previous research with a quoted uncertainty.
This book will follow the convention of:
xc = control variable, xe = environmental variables, xm = model variables.
Computer simulator experiments can also have additional characteristics that are not seen in physicale ex-
periments. Computer experiments yield deterministic output (up to numerical noise). This is to say that if
the same simulation were run twice with the same set of inputs, the same output will be returned (again, up
to numerical noise). Therefore, trying to block the runs into groups of ”similar” experimental units does not
work. In fact, none of the methods of blocking, randomisation, or replication will work in a computer simulator
experiment where they would have worked in a physical experiment.
Computer experiments can also be very time-consuming to run.

Gaussian Processes for Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gaussian Processes for Machine Learning

Uploaded by

Copyright:

Available Formats

My PhD Bible

Supervised Learning: involves a dataset organised into sets of pairs:

Dn = (x(1), y(1)), , (x(n), y(n))

• The agent observes the current state, x0

• It receives a reward, r0 , which depends on x0 and possibly y 0

• The agent observes the current state, x1

Semi-supervised Learning: In semi-supervised learning, we have a supervised-learning target, but there

Treed Gaussian Processes

Other experiments may consider effects from other input variables.

2. Blocking: Experimental units are grouped to be as similar as possible. Allocating treatments in a

You might also like