Bayesian Inference on Change Point

Problems
by
Xiang Xuan
B.Sc., The University of British Columbia, 2004
A THESIS SUBMITTED IN PARTIAL FULFILMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
Master of Science
in
The Faculty of Graduate Studies
(Computer Science)
The University Of British Columbia
March, 2007
c Xiang Xuan 2007
Abstract
Change point problems are referred to detect heterogeneity in temporal or spa-
tial data. They have applications in many areas like DNA sequences, financial
time series, signal processing, etc. A large number of techniques have been
proposed to tackle the problems. One of the most difficult issues is estimating
the number of the change points. As in other examples of model selection, the
Bayesian approach is particularly appealing, since it automatically captures a
trade off between model complexity (the number of change points) and model
fit. It also allows one to express uncertainty about the number and location of
change points.
In a series of papers [13, 14, 16], Fearnhead developed efficient dynamic pro-
gramming algorithms for exactly computing the posterior over the number and
location of change points in one dimensional series. This improved upon earlier
approaches, such as [12], which relied on reversible jump MCMC.
We extend Fearnhead’s algorithms to the case of multiple dimensional series.
This allows us to detect changes on correlation structures, as well as changes on
mean, variance, etc. We also model the correlation structures using Gaussian
graphical models. This allow us to estimate the changing topology of dependen-
cies among series, in addition to detecting change points. This is particularly
useful in high dimensional cases because of sparsity.
ii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . 3
1.2.2 Reversible Jump MCMC . . . . . . . . . . . . . . . . . . . 3
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 One Dimensional Time Series . . . . . . . . . . . . . . . . . . . . 5
2.1 Prior on Change Point Location and Number . . . . . . . . . . . 6
2.2 Likelihood Functions for Data in Each Partition . . . . . . . . . . 9
2.2.1 Linear Regression Models . . . . . . . . . . . . . . . . . . 10
2.2.2 Poisson-Gamma Models . . . . . . . . . . . . . . . . . . . 13
2.2.3 Exponential-Gamma Models . . . . . . . . . . . . . . . . 15
2.3 Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Polynomial Basis . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Autoregressive Basis . . . . . . . . . . . . . . . . . . . . . 16
iii
Table of Contents
2.4 Offline Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Backward Recursion . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Forward Simulation . . . . . . . . . . . . . . . . . . . . . 18
2.5 Online Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Forward Recursion . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 Backward Simulation . . . . . . . . . . . . . . . . . . . . . 21
2.5.3 Viterbi Algorithms . . . . . . . . . . . . . . . . . . . . . . 22
2.5.4 Approximate Algorithm . . . . . . . . . . . . . . . . . . . 23
2.6 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7.2 British Coal Mining Disaster Data . . . . . . . . . . . . . 24
2.7.3 Tiger Woods Data . . . . . . . . . . . . . . . . . . . . . . 25
2.8 Choices of Hyperparameters . . . . . . . . . . . . . . . . . . . . . 26
3 Multiple Dimensional Time Series . . . . . . . . . . . . . . . . . 35
3.1 Independent Sequences . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Dependent Sequences . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Multivariate Linear Regression Models . . . . . . . . . . . 37
3.3 Gaussian Graphical Models . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Structure Representation . . . . . . . . . . . . . . . . . . 41
3.3.2 Sliding Windows and Structure Learning Methods . . . . 42
3.3.3 Likelihood Functions . . . . . . . . . . . . . . . . . . . . . 45
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.2 U.S. Portfolios Data . . . . . . . . . . . . . . . . . . . . . 51
3.4.3 Honey Bee Dance Data . . . . . . . . . . . . . . . . . . . 52
4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . 61
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
iv
List of Figures
1.1 Examples on possible changes over consecutive segments . . . . . 2
2.1 Graphical model of linear regression models . . . . . . . . . . . . 10
2.2 Graphical model of Poisson-Gamma models . . . . . . . . . . . . 14
2.3 Results of synthetic data Blocks . . . . . . . . . . . . . . . . . . . 28
2.4 Results of synthetic data AR1 . . . . . . . . . . . . . . . . . . . . 29
2.5 Results of coal mining disaster data . . . . . . . . . . . . . . . . 30
2.6 Results of Tiger Woods data . . . . . . . . . . . . . . . . . . . . 31
2.7 Results of synthetic data Blocks under different values of hyper-
parameter λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8 Results of synthetic data Blocks under different values of hyper-
parameter γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.9 Results of synthetic data AR1 under different values of hyperpa-
rameter γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 An example to show the importance of modeling correlation struc-
tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Graphical model of multivariate linear regression models . . . . . 38
3.3 An example of graphical representation of Gaussian graphical
models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Graphical representation for independent model . . . . . . . . . . 43
3.5 Graphical representation for fully dependent model . . . . . . . . 43
3.6 An example to show sliding window method . . . . . . . . . . . . 44
3.7 Results of synthetic data 2D . . . . . . . . . . . . . . . . . . . . . 49
v
List of Figures
3.8 Results of synthetic data 10D . . . . . . . . . . . . . . . . . . . . 54
3.9 Candidate list of graphs generated by sliding windows in 10D data 55
3.10 Results of synthetic data 20D . . . . . . . . . . . . . . . . . . . . 56
3.11 Results of U.S. portfolios data of 5 industries . . . . . . . . . . . 57
3.12 Results of U.S. portfolios data of 30 industries . . . . . . . . . . . 58
3.13 Results of honey bee dance sequence 4 . . . . . . . . . . . . . . . 59
3.14 Results of honey bee dance sequence 6 . . . . . . . . . . . . . . . 60
vi
Acknowledgements
I would like to thank all the people who gave me help and support throughout
my degree.
First of all, I would like to thank my supervisor, Professor Kevin Murphy for his
constant encouragement and rewarding guidance throughout this thesis work.
Kevin introduced me to this interesting problem. I am grateful to Kevin for
giving me the Bayesian education and directing me to the research in machine
learning. I am amazed by Kevin’s broad knowledge and many quick and bril-
liant ideas.
Secondly, I would like to thank Professor Nando de Freitas for dedicating his
time and effort in reviewing my thesis.
Thirdly, I would like to extend my appreciation to my colleagues for their friend-
ship and help, and especially to the following people: Sohrab Shah, Wei-Lwun
Lu and Chao Yan.
Last but not least, I would like to thank my parents and my wife for their end-
less love and support.
XIANG XUAN
The University of British Columbia
March 2007
vii
Chapter 1
Introduction
1.1 Problem Statement
Change point problems are commonly referred to detect heterogeneity of tempo-
ral or spatial data. Given a sequence of data over time or space, change points
split data into a set of disjoint segments. Then it is assumed that the data from
the same segment comes from the same model. If we assume
data = model +noise
then the data on two successive segments could be different in the following
ways:
• different models (model types or model orders)
• same model with different parameters
• same model with different noise levels
• in multiple sequences, different correlation among sequences
Figure 1.1 shows four examples of changes over successive segments. In all
four examples, there is one change point at location 50 (black vertical solid
line) which separates 100 observations into two segments. The top left panel
shows an example of different model orders. The 1st segment is a 2nd order
Autoregressive model and the 2nd segment is a 4th order Autoregressive model.
The top right panel shows an example of same model with different parameters.
Both segments are linear models, but the 1st segment has a negative slope while
the 2nd segment has a positive slope. The bottom left panel shows an example
1
Chapter 1. Introduction
of same model with different noise level. Both segments are constant models
which have means at 0, but the noise level (the standard deviation) of the 2nd
segment is three times as large as the one on the 1st segment. The bottom right
panel is an example of different correlation between two series. We can see that
two series are positive correlated in the 1st segment, but negative correlated in
the 2nd segment.
0 20 40 60 80 100
−20
−15
−10
−5
0
5
10
15
20
0 20 40 60 80 100
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 20 40 60 80 100
−10
−8
−6
−4
−2
0
2
4
6
8
10
0 20 40 60 80 100
−3
−2
−1
0
1
2
3
4
Figure 1.1: Examples show possible changes over successive segments. The
top left panel shows changes on AR model orders. The top right panel shows
changes on parameters. The bottom left panel shows changes on noise level.
The bottom right panel shows changes on correlation between two series.
The aim of the change point problems is to make inference about the number
and location of change points.
2
Chapter 1. Introduction
1.2 Related Works
Many works have been done by many researchers in different areas. This the-
sis is an extension based on the Fearnhead’s work which is a special case of
the Product Partition Models (PPM) (defined later) and using dynamic pro-
gramming algorithms. Here I like to mention two approaches that are closely
related to our works. The first approach is based on a different models Hidden
Markov model (HMM), and the second approach is using a different algorithm
Reversible Jump Monte Carlo Markov Chain.
1.2.1 Hidden Markov Models
A HMM is a statistical model where system being modeled is assumed to be
Markov process with hidden states. In a regular Markov model, the state is
directly visible. In a HMM, the state is not directly visible, but variables influ-
enced by the hidden states are visible. The challenge is to determine the hidden
states from observed variables.
In change point problems, we view the change is due to change on hidden
states. Hence by inference on all hidden states, we can segment data implicitly.
In HMM, we need to fix the number of hidden states, and often the number of
state cannot be too large.
1.2.2 Reversible Jump MCMC
Reversible jump is a MCMC algorithm which has been extensively used in the
change point problems. It starts by a initial set of change points. At each step,
it can make the following three kinds of moves:
• Death move to delete a change point or merge two consecutive segments,
• Birth move to add a new change point or split a segment into two,
• Update move to shift the position of a change point
3
Chapter 1. Introduction
Each step, we will accept a move based on a probability calculated by the
Metropolis-Hastings algorithm. We run until the chain converges. Then we can
find out the posterior probability of the number and position of change points.
The advantage of reversible jump MCMC is that it can handle a large family
of distributions even when we only know the distribution up to a normalized
constant. The disadvantages are slow and difficult to diagnose convergence of
the chain.
1.3 Contribution
Our contributions of the thesis are:
• We extend Fearnhead’s work to the case of multiple dimensional series
which allows us to detect changes on correlation structures, as well as
changes on means, variance, etc.
• We further model the correlation structures using Gaussian graphical mod-
els which allows us to estimate the changing topology of dependencies
among series, in addition to detecting change points.
• We illustrate the algorithms by applying them to some synthetic and real
data sets.
1.4 Thesis Outline
The remaining of the thesis is organized as follows. In Chapter 2, we will review
Fearnhead’s work in one dimensional series in details, and provide experimen-
tal results on some synthetic and real data sets. In Chapter 3, we will show
how to extend Fearnhead’s work in multiple dimensional series and how to use
Gaussian graphical models to model and learn correlation structures. We will
also provide experimental results on some synthetic and real data sets. Finally,
our conclusions are stated in Chapter 4, along with a number of suggestions for
future works.
4
Chapter 2
One Dimensional Time
Series
Let’s consider the change point problems with the following conditional inde-
pendence property: given the position of a change point, the data before that
change point is independent of the data after the change point. Then these mod-
els are exactly the Product Partition Models (PPM) in one dimension [3, 4, 11].
Here a dataset Y
1:N
is partitioned into K partitions, where the number of the
partitions K is unknown. The data on the k-th segment Y
k
is assumed to be
independent with the data on the other segments given a set of parameters θ
k
for that partition . Hence given the segmentation (partition) S
1:K
, we can write
P(Y
1:N
|K, S
1:K
, θ
1:K
) =
K
¸
k=1
P(Y
k

k
) (2.1)
Fearnhead [13, 14, 16] proposed three algorithms (offline, online exact and online
approximate) to solve the change point problems under PPM. They all first
calculate the joint posterior distribution of the number and positions of change
points P(K, S
1:K
|Y
1:N
) using dynamic programming, then sample change points
from this posterior distribution by perfect sampling [22].
K, S
1:K
∼ P(K, S
1:K
|Y
1:N
) (2.2)
After sampling change points, making inference on models and their parameters
over segments is straight forward.
The offline algorithm and online exact algorithm run in O(N
2
), and the online
approximate algorithm runs in O(N).
5
Chapter 2. One Dimensional Time Series
2.1 Prior on Change Point Location and
Number
To express the uncertainty of the number and position of change points, we put
the following prior distribution on change points.
Let’s assume that the change points occur at discrete time points and we model
the change point positions by a Markov process. Let the transition probabilities
of this Markov process be the following,
P(next change point at t|change point at s) = g(|t −s|) (2.3)
We assume ( 2.3) only depends on the distance between two change points. Also
we let the probability mass function for the distance between two successive
change points s and t be g(|t − s|). Furthermore, we define the cumulative
distribution function for the distance as following,
G(l) =
l
¸
i=1
g(i) (2.4)
and assume that g() is also the probability mass function for the position of
the first change point. In general, g() can be any arbitrary probability mass
function with the domain over 1, 2, · · · , N −1. Then g() and G() imply a prior
distribution on the number and positions of change points.
For example, if we use the Geometric distribution as g(), then our model implies
a Binomial distribution for the number of change points and a Uniform distri-
bution for the locations of change points. To see that, let’s suppose there are N
data points and we use a Geometric distribution with parameter λ. We denote
P(C
i
= 1) as the probability of location i being a change point. By default,
position 0 is always a change point. That is,
P(C
0
= 1) = 1 (2.5)
First, we show that the distribution for the location of change points is Uniform
by induction. That is, ∀i = 1, . . . , N
P(C
i
= 1) = λ (2.6)
6
Chapter 2. One Dimensional Time Series
When i = 1, we only have one case: position 0 and 1 both are change points.
Hence the length of the segment is 1. We have,
P(C
1
= 1) = g(1) = λ
Suppose ∀ i ≤ k, we have
P(C
i
= 1) = λ (2.7)
Now when i = k+1, conditioning on the last change point before position k+1,
we have,
P(C
k+1
= 1) =
k
¸
j=0
P(C
k+1
= 1|C
j
= 1)P(C
j
= 1) (2.8)
where
P(C
k+1
= 1|C
j
= 1) = g(k + 1 −j) = λ(1 −λ)
k−j
(2.9)
By ( 2.5), ( 2.7) and ( 2.9), ( 2.8) becomes,
P(C
k+1
= 1) = P(C
k+1
= 1|C
0
= 1)P(C
0
= 1) +
k
¸
j=1
P(C
k+1
= 1|C
j
= 1)P(C
j
= 1)
(let t = k −j) = λ(1 −λ)
k
+
k
¸
j=1
λ(1 −λ)
k−j
λ = λ(1 −λ)
k

2
k−1
¸
t=0
(1 −λ)
t
= λ(1 −λ)
k

2
1 −(1 −λ)
k
1 −(1 −λ)
= λ(1 −λ)
k
+λ(1 −(1 −λ)
k
)
= λ (2.10)
By induction, this proves ( 2.6). Next we show the number of change points
follows Binomial distribution.
Let’s consider each position as a trial with two outcomes, either being a change
point or not. By ( 2.6), we know the probability of being a change point is
the same. Then we only need to show each trial is independent. That is,
∀i, j = 1, . . . , N and i = j,
P(C
i
= 1) = P(C
i
= 1|C
j
= 1) = λ (2.11)
7
Chapter 2. One Dimensional Time Series
When i < j, it is true by default, since future will not change history. When
i > j, we show it by induction on j.
When j = i −1, we only have one case: position j and i both are change points.
Hence the length of the segment is 1. We have,
P(C
i
= 1|C
i−1
= 1) = g(1) = λ
Suppose ∀ j ≥ k, we have
P(C
i
= 1|C
j
= 1) = λ (2.12)
Now when j = k −1, conditioning on the next change point after position k −1,
we have,
P(C
i
= 1|C
k−1
= 1) =
i
¸
t=k
P(C
i
= 1|C
t
= 1, C
k−1
= 1)P(C
t
= 1|C
k−1
= 1)
(2.13)
where
P(C
i
= 1|C
t
= 1, C
k−1
= 1) = P(C
i
= 1|C
t
= 1) =

λ if t < i
1 if t = i
P(C
t
= 1|C
k−1
= 1) = g(t −k + 1) = λ(1 −λ)
t−k
(2.14)
Hence ( 2.13) becomes,
P(C
i
= 1|C
k−1
= 1) =
i−1
¸
t=k
λλ(1 −λ)
t−k
+λ(1 −λ)
i−k
(let s = t-k) = λ
2
i−k−1
¸
s=0
(1 −λ)
s
+λ(1 −λ)
i−k
= λ
2
1 −(1 −λ)
i−k
1 −(1 −λ)
+λ(1 −λ)
i−k
= λ(1 −(1 −λ)
i−k
) +λ(1 −λ)
i−k
= λ (2.15)
By induction, this proves ( 2.11). Hence the number of change points follows
Binomial distribution.
8
Chapter 2. One Dimensional Time Series
2.2 Likelihood Functions for Data in Each
Partition
If we assume that s and t are two successive change points, then data Y
s+1:t
forms one segment. We use the likelihood function P(Y
s+1:t
) to evaluate how
good the data Y
s+1:t
can be fit in one segment. Here there are two levels of
uncertainty.
The first is the uncertainty of model types. There are many possible candidate
models and we certainly cannot enumerate all of them. As a result, we will
only consider a finite set of models. For example, polynomial regression models
up to order 2 or autoregressive models up to order 3. Then if we put a prior
distribution π() on the set of models, and define P(Y
s+1:t
|q) as the likelihood
function of Y
s+1:t
conditional on a model q, then we can evaluate P(Y
s+1:t
) as
following,
P(Y
s+1:t
) =
¸
q
P(Y
s+1:t
|q)π(q) (2.16)
The second is the uncertainty of model parameters. If the parameter on this
segment conditional on model q is θ
q
, and we put a prior distribution π(θ
q
) on
parameters, we can evaluate P(Y
s+1:t
|q) as following,
P(Y
s+1:t
|q) =

t
¸
i=s+1
f(Y
i

q
, q)π(θ
q
|q) dθ
q
(2.17)
We assume that P(Y
s+1:t
|q) can be efficiently calculated for all s, t and q, where
s < t. In practice, this requires either conjugate priors on θ
q
which allow us to
work out the likelihood function analytically, or fast numerical routines which
are able to evaluate the required integration. In general, for any data and
models, as long as we can evaluate the likelihood function ( 2.17), we can use
Fearnhead’s algorithms. Our extensions are mainly based on this.
Now let’s look at some examples.
9
Chapter 2. One Dimensional Time Series
2.2.1 Linear Regression Models
The linear regression models are one of the most widely used models. Here, we
assume
Y
s+1:t
= Hβ +ǫ (2.18)
where H is a (t − s) by q matrix of basis functions, β is a q by 1 vector of
regression parameters and ǫ is a (t − s) by 1 vector of iid Gaussian noise with
mean 0 and variance σ
2
. Since we assume conjugate priors, σ
2
has an Inverse
Figure 2.1: Graphical representation of Hierarchical structures of Linear Re-
gression Models
Gamma distribution with parameters ν/2 and γ/2, and the jth component of
regression parameter β
j
has a Gaussian distribution with mean 0 and variance
σ
2
δ
2
j
. This hierarchical model can be illustrated by Figure 2.1. For simplicity,
we write Y
s+1:t
as Y and let n = t −s. And we have the following,
P(Y |β, σ
2
) =
1
(2π)
n/2
σ
n
exp


1

2
(Y −Hβ)
T
I
−1
n
(Y −Hβ)

(2.19)
P(β|D, σ
2
) =
1
(2π)
q/2
σ
q
|D|
1/2
exp


1

2
β
T
D
−1
β

(2.20)
10
Chapter 2. One Dimensional Time Series
P(σ
2
|ν, γ) =
(γ/2)
ν/2
Γ(ν/2)

2
)
−ν/2−1
exp


γ

2

(2.21)
where D = diag(δ
2
1
, . . . , δ
2
q
) and I
n
is a n by n identity matrix.
By ( 2.17), we have,
P(Y
s+1,t
|q) = P(Y |D, ν, γ)
=

P(Y, β, σ
2
|D, ν, γ) dβ dσ
2
=

P(Y |β, σ
2
)P(β|D, σ
2
)P(σ
2
|ν, γ) dβ dσ
2
Multiplying ( 2.19) and ( 2.20), we have following,
P(Y |β, σ
2
)P(β|D, σ
2
) ∝ exp


1

2
((Y −Hβ)
T
(Y −Hβ) +β
T
D
−1
β)

∝ exp


1

2
(Y
T
Y −2Y
T
Hβ +β
T
H
T
Hβ +β
T
D
−1
β)

Now let
M = (H
T
H +D
−1
)
−1
P = (I −HMH
T
)
||Y ||
2
P
= Y
T
PY
(∗) = Y
T
Y −2Y
T
Hβ +β
T
H
T
Hβ +β
T
D
−1
β
Then
(∗) = β
T
(H
T
H +D
−1
)β −2Y
T
Hβ +Y
T
Y
= β
T
M
−1
β −2Y
T
HMM
−1
β +Y
T
Y
= β
T
M
−1
β −2Y
T
HMM
−1
β +Y
T
HMM
−1
M
T
H
T
Y −Y
T
HMM
−1
M
T
H
T
Y +Y
T
Y
Using fact M = M
T
(∗) = (β −MH
T
Y )
T
M
−1
(β −MH
T
Y ) +Y
T
Y −Y
T
HMH
T
Y
= (β −MH
T
Y )
T
M
−1
(β −MH
T
Y ) +Y
T
PY
= (β −MH
T
Y )
T
M
−1
(β −MH
T
Y ) +||Y ||
2
P
11
Chapter 2. One Dimensional Time Series
Hence
P(Y |β, K)P(β|D, K) ∝ exp


1

2
((β −MH
T
Y )
T
M
−1
(β −MH
T
Y ) +||Y ||
2
P

So the posterior for β is still Gaussian with mean MH
T
Y and variance σ
2
M:
P(β|D, σ
2
) =
1
(2π)
q/2
σ
q
|M|
1/2
exp


1

2
(β −MH
T
Y )
T
M
−1
(β −MH
T
Y )

(2.22)
Then integrating out β, we have
P(Y |D, σ
2
) =

P(Y |β, σ
2
)P(β|D, σ
2
) dβ
=
1
(2π)
n/2
σ
n
(2π)
q/2
σ
q
|M|
1/2
(2π)
q/2
σ
q
|D|
1/2
exp


1

2
||Y ||
2
P

=
1
(2π)
n/2
σ
n

|M|
|D|
1
2
exp


1

2
||Y ||
2
P

(2.23)
Now we multiply ( 2.21) and ( 2.23),
P(Y |D, σ
2
)P(σ
2
|ν, γ) ∝ (σ
2
)
−n/2−ν/2−1
exp


γ +||Y ||
2
P

2

So the posterior for σ
2
is still Inverse Gamma with parameters (n + ν)/2 and
(γ +||Y ||
2
P
)/2:
P(σ
2
|ν, γ) =
((γ +||Y ||
2
P
)/2)
(n+ν)/2
Γ((n +ν)/2)

2
)
−(n+ν)/2−1
exp


γ +||Y ||
2
P

2

(2.24)
Then integrating out σ
2
, we have
P(Y
s+1:t
|q) = P(Y |D, ν, γ)
=

P(Y |D, σ
2
)P(σ
2
|ν, γ) dσ
2
=
¸
1
(2π)
n/2

|M|
|D|
1
2
¸
¸
(γ/2)
ν/2
Γ(ν/2)
¸
Γ((n +ν)/2)
((γ +||Y ||
2
P
)/2)
(n+ν)/2

= π
−n/2

|M|
|D|
1
2
(γ)
ν/2
(γ +||Y ||
2
P
)
(n+ν)/2
Γ((n +ν)/2)
Γ(ν/2)
(2.25)
12
Chapter 2. One Dimensional Time Series
In implementation, we rewrite ( 2.25) in log space as following,
log(P(Y
s+1:t
|q)) = −
n
2
log(π) −
1
2
(log|M| −log|D|) +
ν
2
log(γ) −
n +ν
2
log(γ +||Y ||
2
P
)
+ log(Γ((n +ν)/2)) −log(Γ(ν/2)) (2.26)
To speed up, −
n
2
log(π),
ν
2
log(γ), −
1
2
log|D|, log(Γ((n +ν)/2)) and log(Γ(ν/2))
can be pre-computed. At each iteration, M and ||Y ||
2
P
can be computed by the
following rank one update,
H
T
1:i+1,:
H
1:i+1,:
= H
T
1:i,:
H
1:i,:
+H
T
i+1,:
H
i+1,:
Y
T
1:i+1
Y
1:i+1
= Y
T
1:i
Y
1:i
+Y
T
i+1
Y
i+1
H
T
1:i+1,:
Y
1:i+1
= H
T
1:i,:
Y
1:i
+H
T
i+1,:
Y
i+1
We use the following notations: if Y is a vector, then Y
s:t
denotes the entries
from position s to t inclusive. If Y is a matrix, then Y
s:t,:
denotes the s-th row
to the t-th row inclusive, and Y
:,s:t
denotes the s-th column to the t-th column
inclusive.
2.2.2 Poisson-Gamma Models
In Poisson Models, each observation Y
i
is a non-negative integer which fol-
lows Poisson distribution with parameter λ. With conjugate prior, λ follows a
Gamma distribution with parameters α and β. This hierarchical model can be
illustrated by Figure 2.2. Hence we have the following,
P(Y
i
|λ) =
e
−λ
λ
Y
i
Y
i
!
(2.27)
P(λ|α, β) = λ
(α−1)
β
α
e
−βλ
Γ(α)
(2.28)
For simplicity, we write Y
s+1:t
as Y and let n = t −s. By ( 2.17), we have,
P(Y
s+1:t
|q) = P(Y |α, β)
=

P(Y, λ|α, β) dλ
=

P(Y |λ)P(λ|α, β) dλ
=

¸
i
P(Y
i
|λ)

P(λ|α, β) dλ
13
Chapter 2. One Dimensional Time Series
Figure 2.2: Graphical representation of Hierarchical structures of Poisson-
Gamma Models. Note the Exponential-Gamma Models have the same Graphical
representation of their Hierarchical structures.
By multiply ( 2.27) and ( 2.28), and let SY =
¸
i
Y
i
, we have,
P(Y |λ)P(λ|α, β) ∝ e
−nλ−βλ
λ
SY +α−1
So the posterior for λ is still Gamma with parameters SY +α and n +β:
P(λ|α, β) = λ
(SY +α−1)
(n +β)
SY +α
e
−(n+β)λ
Γ(SY +α)
(2.29)
Then integrating out λ, we have
P(Y
s+1:t
|q) = P(Y |α, β)
=

¸
i
1
Y
i
!

β
α
Γ(α)
Γ(SY +α)
(n +β)
SY +α
=

¸
i
1
Y
i
!

Γ(SY +α)
Γ(α)
β
α
(n +β)
SY +α
(2.30)
Similarly, in log space, we have the following,
log(P(Y
s+1:t
|q)) = −
¸
i
log(Y
i
!) +log(Γ(SY +α)) −log(Γ(α)) +αlog(β)
14
Chapter 2. One Dimensional Time Series
− (SY +α)log(n +β) (2.31)
Where
¸
i
log(Y
i
!), log(Γ(α)) and αlog(β) can be pre-computed, and SY can
use the following rank one update,
SY
i+1
= SY
i
+Y
i+1
2.2.3 Exponential-Gamma Models
In Exponential Models, each observation Y
i
is a positive real number which
follows Exponential distribution with parameter λ. With conjugate prior, λ
follows a Gamma distribution with parameters α and β. This hierarchical model
is the same as the one in Poisson-Gamma Models, which can be illustrated by
Figure 2.2. Hence we have the following,
P(Y
i
|λ) = λe
−λY
i
(2.32)
P(λ|α, β) = λ
(α−1)
β
α
e
−βλ
Γ(α)
(2.33)
For simplicity, we write Y
s+1:t
as Y and let n = t −s. By ( 2.17), we have,
P(Y
s+1:t
|q) = P(Y |α, β)
=

P(Y, λ|α, β) dλ
=

P(Y |λ)P(λ|α, β) dλ
=

¸
i
P(Y
i
|λ)

P(λ|α, β) dλ
By multiply ( 2.32) and ( 2.33), and let SY =
¸
i
Y
i
, we have,
P(Y |λ)P(λ|α, β) ∞ e
−SY λ−βλ
λ
n+α−1
So the posterior for λ is still Gamma with parameters n +α and SY +β:
P(λ|α, β) = λ
(n+α−1)
(SY +β)
n+α
e
−(SY +β)λ
Γ(n +α)
(2.34)
Then integrating out λ, we have
P(Y
s+1:t
|q) = P(Y |α, β)
=
β
α
Γ(α)
Γ(n +α)
(SY +β)
n+α
(2.35)
15
Chapter 2. One Dimensional Time Series
Similarly, in log space, we have the following,
log(P(Y
s+1:t
|q)) = αlog(β) −log(Γ(α)) +log(Γ(n +α)) −(n +α)log(SY +β)
(2.36)
Where log(Γ(α)) and αlog(β) can be pre-computed, and SY can use the rank
one update as mentioned earlier.
2.3 Basis Functions
In the linear regression model ( 2.18), we have a basis function H.
Some common basis functions are the following.
2.3.1 Polynomial Basis
Polynomial model of order r is defined as following,
Y
i
= β
0

1
X
i

2
X
2
i
+· · · +β
r
X
r
i
(2.37)
Hence the polynomial basis H is following,
H
i:j
=

¸
¸
¸
¸
¸
¸
¸
1 X
i
X
2
i
· · · X
r
i
1 X
i+1
X
2
i+1
· · · X
r
i+1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 X
j
X
2
j
· · · X
r
j
¸

(2.38)
where X
i
=
i
N
.
2.3.2 Autoregressive Basis
Autoregressive model of order r is defined as following,
Y
i
= β
1
Y
i−1

2
Y
i−2
+· · · +β
r
Y
i−r
(2.39)
16
Chapter 2. One Dimensional Time Series
Hence the autoregressive basis H is following,
H
i:j
=

¸
¸
¸
¸
¸
¸
¸
Y
i−1
Y
i−2
· · · Y
i−r
Y
i
Y
i−1
· · · Y
i−r+1
.
.
.
.
.
.
.
.
.
.
.
.
Y
j−1
Y
j−2
· · · Y
j−r
¸

(2.40)
There are many other possible basis functions as well (eg, Fourier Basis). Here
we just want to point out that the basis function does provide a way to extend
Fearnhead’s algorithms (eg, Kernel Basis).
2.4 Offline Algorithms
Now we review the offline algorithm in detail. By its name, we know it works in
batch mode. It first recurses backward, then simulates change points forward.
2.4.1 Backward Recursion
Let’s define for s = 2, . . . , N,
Q(s) = Prob(Y
s:N
|location s −1 is a change point) (2.41)
and Q(1) = Prob(Y
1:N
) since location 0 is a change point by default.
Offline algorithm will calculate Q(s) for all s = 1, 2, · · · , N.
When s = N
Q(N) =
¸
q
P(Y
N:N
|q)π(q) (2.42)
When s < N
Q(s) =
N−1
¸
t=s
¸
q
P(Y
s:t
|q)π(q)Q(t + 1)g(t −s + 1)
+
¸
q
P(Y
s:N
|q)π(q)(1 −G(N −s)) (2.43)
where π(q) is the prior probability of model q and g() and G() are defined in
( 2.4).
17
Chapter 2. One Dimensional Time Series
The reason behind ( 2.43) is that (drop the explicit conditioning on a change
point at s −1 for notational convenience)
Q(s) =
N−1
¸
t=s
P(Y
s:N
, next change point is at t)
+ P(Y
s:N
, no further change points) (2.44)
For the first part we have,
P(Y
s:N
, next change point is at t)
= P(next change point is at t)P(Y
s:t
, Y
t+1:N
|next change point is at t)
= g(t −s + 1)P(Y
s:t
|s,t form a segment)P(Y
t+1:N
|next change point is at t)
=
¸
q
P(Y
s:t
|q)π(q)Q(t + 1)g(t −s + 1)
For the second part we have,
P(Y
s:N
, no further change points)
= P(Y
s:N
|s,N form a segment)P(the length of segment > N −s)
=
¸
q
P(Y
s:N
|q)π(q)(1 −G(N −s))
We will calculate Q(s) backward for s = N, · · · , 1 by ( 2.42) and ( 2.43). Since
s is from N to 1, and at step s we will compute sum over t where t is from s to
N, the whole algorithm runs in O(N
2
).
2.4.2 Forward Simulation
After we calculate Q(s) for all s = 1, . . . , N, we can simulate all change points
forward. To simulate one realisation, we do the following,
1. Set τ
0
= 0, and k = 0.
2. Compute the posterior distribution of τ
k+1
given τ
k
as following,
For τ
k+1
= τ
k
+ 1, τ
k
+ 2, · · · , N −1,
P(τ
k+1

k
, Y
1:N
) =
¸
q
P(Y
τ
k
+1:τ
k+1
|q)π(q)Q(τ
k+1
+ 1)g(τ
k+1
−τ
k
)/Q(τ
k
+ 1)
18
Chapter 2. One Dimensional Time Series
For τ
k+1
= N, which means no further change points
P(τ
k+1

k
, Y
1:N
) =
¸
q
P(Y
τ
k
+1:N
|q)π(q)(1 −G(N −τ
k
))/Q(τ
k
+ 1)
3. Simulated τ
k+1
from P(τ
k+1

k
, Y
1:N
), and set k = k + 1.
4. If τ
k
< N return to step (2); otherwise output the set of simulated change
points, τ
1
, τ
2
, . . . , τ
k
.
2.5 Online Algorithms
Now we discuss the online algorithms which have two versions (exact and ap-
proximate). Both work in online mode, and in the same way that first calculate
recursion forward, then simulate change points backward.
We first review the online exact algorithm in detail. Let’s introduce C
t
as a
state at time t, which is defined to be the time of the most recent change point
prior to t. If C
t
= 0, then there is no change point before time t. Then the
state C
t
can take values 0, 1, . . . , t −1, and C
1
, C
2
, . . . , C
t
satisfy Markov prop-
erty since we assume conditional independence over segments. Conditional on
C
t−1
= j, either t − 1 is not a change point, which leads to C
t
= j, or t − 1 is
a change point, which leads to C
t
= t −1. Hence the transition probabilities of
this Markov Chain can be calculated as following,
TP(C
t
= j|C
t−1
= i) =

1−G(t−i−1)
1−G(t−i−2)
if j = i
G(t−i−1)−G(t−i−2)
1−G(t−i−2)
if j = t −1
0 otherwise
(2.45)
where G is defined in ( 2.4).
The reason behind ( 2.45) is : given C
t−1
= i, we know there is no change point
between i and t −2. This also means that the length of segment must be longer
than t −2 −i. At the same time, G(n) means the cumulative probability of the
distance between two consecutive change points no more than n, which can also
be considered as the cumulative probability of the length of segment no more
19
Chapter 2. One Dimensional Time Series
than n. Hence We have P(C
t−1
= i) = 1 −G(t −2 −i). Then if j = i, we have
P(C
t
= i) = 1 −G(t −1 −i). If j = t −1, which means there is a change point
at t − 1, and by definition of g(), the length of two change points is t − 1 − i,
and this is just g(t −1 −i), which is also G(t −i −1) −G(t −i −2).
If we use a Geometric distribution as g(), then g(i) = (1 − λ)
i−1
λ and G(i) =
1 −(1 −λ)
i
. Now if j = i,
TP(C
t
= j|C
t−1
= i) =
1 −G(t −i −1)
1 −G(t −i −2)
=
(1 −λ)
t−i−1
(1 −λ)
t−i−2
= 1 −λ
And if j = t −1
TP(C
t
= j|C
t−1
= i) =
G(t −i −1) −G(t −i −2)
1 −G(t −i −2)
=
(1 −λ)
t−i−2
−(1 −λ)
t−i−1
(1 −λ)
t−i−2
= λ
Hence ( 2.45) becomes,
TP(C
t
= j|C
t−1
= i) =

1 −λ if j = i
λ if j = t −1
0 otherwise
where λ is the parameter of the Geometric distribution.
2.5.1 Forward Recursion
Let’s define the filtering density P(C
t
= j|Y
1:t
) as: given data Y
1:t
, the proba-
bility of the last change point is at position j. Online algorithm will compute
P(C
t
= j|Y
1:t
) for all t, j, such that t = 1, 2, . . . , N and j = 0, 1, . . . , t −1.
When t = 1, we have j = 0, hence
P(C
1
= 0|Y
1:1
) = 1 (2.46)
When t > 1, by standard filtering recursions, we have
P(C
t
= j|Y
1:t
) = P(C
t
= j|Y
t
, Y
1:t−1
)
=
P(C
t
= j, Y
t
|Y
1:t−1
)
P(Y
t
|Y
1:t−1
)
20
Chapter 2. One Dimensional Time Series
=
P(Y
t
|C
t
= j, Y
1:t−1
)P(C
t
= j|Y
1:t−1
)
P(Y
t
|Y
1:t−1
)
∝ P(Y
t
|C
t
= j, Y
1:t−1
)P(C
t
= j|Y
1:t−1
) (2.47)
and
P(C
t
= j|Y
1:t−1
) =
t−2
¸
i=0
TP(C
t
= j|C
t−1
= i)P(C
t−1
= i|Y
1:t−1
) (2.48)
If we define w
j
t
= P(Y
t
|C
t
= j, Y
1:t−1
), and with ( 2.45), we have
P(C
t
= j|Y
1:t
) ∝

w
j
t
1−G(t−i−1)
1−G(t−i−2)
P(C
t−1
= j|Y
1:t−1
) if j < t −1
w
t
t
¸
t−2
i=0

G(t−i−1)−G(t−i−2)
1−G(t−i−2)
P(C
t−1
= i|Y
1:t−1
)

if j = t −1
(2.49)
Now w
j
t
can be calculated as following.
When j < t −1,
w
j
t
=
P(Y
j+1:t
|C
t
= j)
P(Y
j+1:t−1
|C
t
= j)
=
¸
q
P(Y
j+1:t
|q)π(q)
¸
q
P(Y
j+1:t−1
|q)π(q)
When j = t −1,
w
t
t
=
¸
q
P(Y
t:t
|q)π(q) (2.50)
where π(q) is the prior probability of model q.
We will calculate P(C
t
|Y
1:t
) forward for t = 1, · · · , N by ( 2.46) and ( 2.49).
Since t is from 1 to N, and at step t we will compute C
t
= j where j is from 0
to t −1, the whole algorithm runs in O(N
2
).
2.5.2 Backward Simulation
After we calculate the filtering densities P(C
t
|Y
1:t
) for all t = 1, . . . , N, we
can use idea in [9] to simulate all change points backward. To simulate one
realisation from this joint density, we do the following,
1. Set τ
0
= N, and k = 0.
21
Chapter 2. One Dimensional Time Series
2. Simulated τ
k+1
from the filtering density P(C
τ
k
|Y
1:τ
k
), and set k = k +1.
3. If τ
k
> 0 return to step (2); otherwise output the set of simulated change
points backward, τ
k−1
, τ
k−2
, . . . , τ
1
.
2.5.3 Viterbi Algorithms
We can also obtain an online Viterbi algorithm for calculating the maximum a
posterior (MAP) estimate of positions of change points and model parameters
for each segment as following. We define M
i
to be the event that given a change
point at time i, the MAP choice of change points and model parameters occurs
prior to time i.
For t = 1, . . . , n, and i = 0, . . . , t −1, and all q,
P
t
(i, q) = P(C
t
= i, modelq, M
i
, Y
1:t
)
and
P
MAP
t
= P(change point at t,M
t
, Y
1:t
)
Then we can compute P
t
(i, q) using the following:
P
t
(i, q) = (1 −G(t −j −1))P(Y
j+1:t
|q)π(q)P
MAP
j
(2.51)
and
P
MAP
t
= max
j,q
P
t
(j, q)g(t −j)
1 −G(t −j −1)
(2.52)
At time t, the MAP estimates of C
t
and the current model parameters are given
respectively by the values of j and q which maximize P
t
(i, q). Given a MAP
estimate of C
t
, we can then calculate the MAP estimates of the change point
prior to C
t
and the model parameters of that segment by the value of j and q
that maximized the right hand side of ( 2.52). This procedure can be repeated
to find the MAP estimates of all change points and model parameters.
22
Chapter 2. One Dimensional Time Series
2.5.4 Approximate Algorithm
Now we look at the online approximate algorithm. The online approximate
algorithm works in almost same way as the online exact algorithm. The only
difference is the following. At step t, online exact algorithm stores the complete
posterior distribution P(C
t
= j|Y
1:t
) for j = 0, 1, · · · , t − 1. We approximate
P(C
t
= j|Y
1:t
) by a discrete distribution with fewer points. This approximate
distribution can be described by a set of support points. We call them particles.
We impose a maximum number (eg, M) of particles to be stored at each step.
Whenever we have M particles, we perform resampling to reduce the number
of particles to L, where L < M. Hence the maximum computational cost per
iteration is proportional to M. Since M is not dependent on the size of data
N, the approximate algorithm runs in O(N).
There are many ways to perform resampling [7, 12, 15, 16]. Here, we use the
simplest one. We let L = M −1, hence each step we just drop the particle that
has the lowest support. We will show later that this approximation will not
affect the accuracy of the result much, but runs much faster.
2.6 Implementation Issues
There are two issues that need to be mentioned in implementation.
The first issue is numerical stability. We perform all calculations in log space
to prevent overflow and underflow. We also use the functions logsumexp and
logdet. In all three algorithms, we introduce a threshold (eg, 1×10
−30
). During
recursions, whenever the quantity that we need to compute is less than the
threshold, we will set it to be −∞, hence it will not be computed in the next
iteration. This will provide not only stable calculation but also speed up because
we will calculate less terms in each iteration. At the end of each iteration, we
will rescale the quantity to prevent underflow.
The second one is rank one update. This is very important in term of speed.
The way to do rank one update depends on the likelihood function we use.
23
Chapter 2. One Dimensional Time Series
2.7 Experimental Results
Now we will show experimental results on some synthetic and real data sets.
2.7.1 Synthetic Data
We will look at two synthetic data sets.
The first one is called Blocks. It has been previously analyzed in [10, 13, 21]. It
is a very simple data set, and the results are shown in Figure 2.3. Each segment
is a piecewise constant model, with mean level shifting over segments. We set
the hyper parameter ν = 2 and γ = 2 on σ
2
. The top panel shows the raw
data with the true change points (red vertical line). The bottom three panels
show the posterior probability of being change points at each position and the
number of segments that are calculated (from top to bottom) by the offline, the
online exact and the online approximate algorithms. We can see that the results
are essentially the same.
The second data set is called AR1, and results are shown in Figure 2.4. Each
segment is an autoregressive model. We set the hyper parameter ν = 2 and
γ = 0.02 on σ
2
. The top panel shows the raw data with the true change points
(red vertical line). The bottom three panels show the posterior probability
of being change points at each position and the number of segments that are
calculated (from top to bottom) by the offline, the online exact and the online
approximate algorithms. Again, we see the results from three algorithms are
essentially the same. Henceforth only use online approximate method.
2.7.2 British Coal Mining Disaster Data
Now let’s look at a real data set which records British coal-mining disasters [17]
by year during the 112 year period from 1851 to 1962. This is a well-known
data set and is previously studied by many researchers [6, 14, 18, 25]. Here Y
i
is
the number of disasters in the i-th year, which follows Poisson distribution with
parameter (rate) λ. Hence the natural conjugate prior is a Gamma distribution.
24
Chapter 2. One Dimensional Time Series
We set the hyper parameter α = 1.66 and β = 1, since we let the prior mean
of λ is equal to the empirical mean (
α
β
=
¯
Y = 1.66) and the prior strength β
to be weak. Then we use Fearnhead’s algorithms to analyse the data, and the
results are shown in Figure 2.5. In top left panel, it shows the raw data as
a Poisson sequence. The bottom left panel shows the posterior distribution on
the number of segments. It shows the most probable number of segments is
four. The bottom right panel shows the posterior distribution of being change
points at each location. Since we have four segments, we will pick the most
probable three change points at location 41, 84 and 102 which corresponding
to year 1891, 1934 and 1952. Then in the up right panel, it shows the resulted
segmentation (the red vertical line) and the posterior estimators of rate λ on
each segment (the red horizontal line). On four segments, the posterior rates
are roughly as following, 3, 1, 1.5 and 0.5.
2.7.3 Tiger Woods Data
Now let’s look at another interesting example. In sports, when an individual
player or team enjoys periods of good form, it, is called ’streakiness’. It is
interesting to detect a streaky player or a streaky team in many sports [1, 5, 26]
including baseball, basketball and golf. The opposite of a streaky competitor
is a player or team with a constant rate of success over time. A streaky player
is different, since the associated success rate is not constant over time. Streaky
players might have a large number of success during one or more periods, with
fewer or no successes during other periods. More streaky players tend to have
more change points. Here we use Tiger Woods as an example.
Tiger Woods is one of the most famous golf players in the world, and the first
ever to win all four professional major championships consecutively. Woods
turned professional at the Great Milwaukee Open in August 29, 1996 and play
230 tournaments, winning 62 championships in the period September 1996 to
December 2005. Let Y
i
= 1 if Woods won the i-th tournament and Y
i
=
0 otherwise. Then the data can be expressed as a Bernoulli sequence with
25
Chapter 2. One Dimensional Time Series
parameter (rate) λ. We put a conjugate Beta prior with hyper parameters
α = 1 and β = 1 on λ since this is the weakest prior. Then we perform
Fearnhead’s algorithms to analyse the data, and the results are shown in Figure
2.6. In top left panel, it shows the raw data as a Bernoulli sequence. The
bottom left panel shows the posterior distribution on the number of segments.
It shows the most probable number of segments is three. The bottom right
panel shows the posterior distribution of being change points at each location.
Since we have three segments, we will pick the most probable two change points
at location 73 and 167 which corresponding to May 1999 and March 2003.
Then in the up right panel, it shows the resulted segmentation (the red vertical
line) and the posterior estimators of rate λ on each segment (the red horizontal
line). We can see clearly that Tiger Woods’s career from September 1996 to
December 2005 can be splitted into three periods. Initially, from September 1996
to May 1999, he was an golf player with winning rate lower than 0.2. However,
from May 1999 to March 2003, he was in his peak period with winning rate
nearly 0.5. After March 2003 till December 2005, his winning rate was dropped
back to lower than 0.2. The data comes from Tiger Woods’s official website
http://www.tigerwoods.com/, and is previous studied by [26]. In [26], the data
is only from September 1996 to June 2001.
2.8 Choices of Hyperparameters
Hyperparameters λ is the rate of change points and can be set as following,
λ =
the total number of expected segments
the total number of data points
(2.53)
For example, if there are 1000 data points and we expect there are 10 seg-
ments, then we will set λ = 0.01. If we increase λ, we will encourage more
change points. We use the synthetic data Blocks as an example. Results ob-
tained by the online approximate algorithms under different values of λ are
shown in Figure 2.7. From top to bottom, the values of λ are: 0.5, 0.1, 0.01,
0.001 and 0.0002. We see when λ = 0.5 (this prior says the length of segment is
26
Chapter 2. One Dimensional Time Series
2, which is too short.), the result is clearly oversegmented. Under other values
of λ, results are fine.
Different likelihood functions have different hyperparameters. In linear regres-
sion, we have hyperparameters ν and γ on the variance σ
2
. Since ν represents
the strength of prior, we normally set ν = 2 such that it is a weak prior. Then
we can set γ to reflect our belief on how large the variance will be within each
segment. (Note: we parameterize Inverse-Gamma in term of
ν
2
and
γ
2
.) When
ν = 2, the mean does not exist. We use the mode
γ
ν+2
to set γ = 4σ
2
, where σ
2
is expected variance within each segment. For example, if we believe the vari-
ance is 0.01, then we will set γ = 0.04. Now we show results from synthetic data
Blocks and AR1 obtained by the online approximate algorithms under different
values of γ in Figure 2.8 and Figure 2.9. From top to bottom, the values of γ
are: 100, 20, 2, 0.2 and 0.04. We see that the data Blocks is very robust to the
choices of γ. For the data AR1, we see when γ = 100, we only detect 3 change
points instead of 5. For other values of γ, results are fine.
27
Chapter 2. One Dimensional Time Series
0 100 200 300 400 500 600 700 800 900 1000
−10
0
10
20
Y
0 100 200 300 400 500 600 700 800 900 1000
0
0.2
0.4
0.6
0.8
1
Offline
0 100 200 300 400 500 600 700 800 900 1000
0
0.2
0.4
0.6
0.8
1
Online Exact
0 100 200 300 400 500 600 700 800 900 1000
0
0.2
0.4
0.6
0.8
1
Online Approximate
0 5 10
0
0.2
0.4
0.6
0.8
1
P(N)
0 5 10
0
0.2
0.4
0.6
0.8
1
P(N)
0 5 10
0
0.2
0.4
0.6
0.8
1
P(N)
Figure 2.3: Results on synthetic data Blocks (1000 data points). The top
panel is the Blocks data set with true change points. The rest are the posterior
probability of being change points at each position and the number of segments
calculated by (from top to bottom) the offline, the online exact and the online
approximate algorithms. Results are generated by ’showBlocks’.
28
Chapter 2. One Dimensional Time Series
0 100 200 300 400 500 600 700 800 900 1000
−20
−10
0
10
20
Y
0 100 200 300 400 500 600 700 800 900 1000
0
0.2
0.4
0.6
0.8
1
Offline
0 100 200 300 400 500 600 700 800 900 1000
0
0.2
0.4
0.6
0.8
1
Online Exact
0 100 200 300 400 500 600 700 800 900 1000
0
0.2
0.4
0.6
0.8
1
Online Approximate
0 2 4 6 8
0
0.2
0.4
0.6
0.8
1
P(N)
0 2 4 6 8
0
0.2
0.4
0.6
0.8
1
P(N)
0 2 4 6 8
0
0.2
0.4
0.6
0.8
1
P(N)
Figure 2.4: Results on synthetic data AR1 (1000 data points). The top panel is
the AR1 data set with true change points. The rest are the posterior probability
of being change points at each position and the number of segments calculated
by (from top to bottom) the offline, the online exact and the online approximate
algorithms. Results are generated by ’showAR1’.
29
Chapter 2. One Dimensional Time Series
0 20 40 60 80 100
0
1
2
3
4
5
6
Y
0 2 4 6 8 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
P(N)
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100
0
1
2
3
4
5
6
Figure 2.5: Results on Coal Mining Disaster data. The top left panel shows
the raw data as a Poisson sequence. The bottom left panel shows the posterior
distribution on the number of segments. The bottom right panel shows the
posterior distribution of being change points at each position. The up right
panel shows the resulted segmentation and the posterior estimators of rate λ on
each segment. Results are generated by ’showCoalMining’.
30
Chapter 2. One Dimensional Time Series
0 50 100 150 200
0
0.2
0.4
0.6
0.8
1
Y
0 1 2 3 4 5 6 7 8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
P(N)
0 50 100 150 200
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50 100 150 200
0
0.2
0.4
0.6
0.8
1
Figure 2.6: Results on Tiger Woods data. The top left panel shows the raw
data as a Bernoulli sequence. The bottom left panel shows the posterior distri-
bution on the number of segments. The bottom right panel shows the posterior
distribution of being change points at each position. The up right panel shows
the resulted segmentation and the posterior estimators of rate λ on each seg-
ment. Results are generated by ’showTigerWoods’. This data comes from Tiger
Woods’s official website http://www.tigerwoods.com/.
31
Chapter 2. One Dimensional Time Series
−10
0
10
20
Y
0
0.5
1
λ = 0.5
0
0.5
1
λ = 0.1
0
0.5
1
λ = 0.01
0
0.5
1
λ = 0.001
0 100 200 300 400 500 600 700 800 900 1000
0
0.5
1
λ = 0.0002
0 20 40
0
0.5
1
P(N)
0 20 40
0
0.5
1
P(N)
0 20 40
0
0.5
1
P(N)
0 20 40
0
0.5
1
P(N)
0 20 40
0
0.5
1
P(N)
Figure 2.7: Results on synthetic data Blocks (1000 data points) under different
values of hyperparameter λ. The top panel is the Blocks data set with true
change points. The rest are the posterior probability of being change points at
each position and the number of segments calculated by the online approximate
algorithms under different values of λ. From top to bottom, the values of λ are:
0.5, 0.1, 0.01, 0.001 and 0.0002. Large value of λ will encourage more segments.
Results are generated by ’showBlocksLambda’.
32
Chapter 2. One Dimensional Time Series
−10
0
10
20
Y
0
0.5
1
γ = 100
0
0.5
1
γ = 20
0
0.5
1
γ = 2
0
0.5
1
γ = 0.2
0 100 200 300 400 500 600 700 800 900 1000
0
0.5
1
γ = 0.04
0 5 10
0
0.5
1
P(N)
0 5 10
0
0.5
1
P(N)
0 5 10
0
0.5
1
P(N)
0 5 10
0
0.5
1
P(N)
0 5 10
0
0.5
1
P(N)
Figure 2.8: Results on synthetic data Blocks (1000 data points) under different
values of hyperparameter γ. The top panel is the Blocks data set with true
change points. The rest are the posterior probability of being change points at
each position and the number of segments calculated by the online approximate
algorithms under different values of γ. From top to bottom, the values of γ
are: 100, 20, 2, 0.2 and 0.04. Large value of γ will allow higher variance in one
segment, hence encourage less segments. Results are generated by ’showBlocks-
Gamma’.
33
Chapter 2. One Dimensional Time Series
−20
0
20
Y
0
0.5
1
γ = 100
0
0.5
1
γ = 20
0
0.5
1
γ = 2
0
0.5
1
γ = 0.2
0 100 200 300 400 500 600 700 800 900 1000
0
0.5
1
γ = 0.04
0 2 4 6
0
0.5
1
P(N)
0 2 4 6
0
0.5
1
P(N)
0 2 4 6
0
0.5
1
P(N)
0 2 4 6
0
0.5
1
P(N)
0 2 4 6
0
0.5
1
P(N)
Figure 2.9: Results on synthetic data AR1 (1000 data points) under differ-
ent values of hyperparameter γ. The top panel is the AR1 data set with true
change points. The rest are the posterior probability of being change points
at each position and the number of segments calculated by the online approx-
imate algorithms under different values of γ. From top to bottom, the values
of γ are: 100, 20, 2, 0.2 and 0.04. Large value of γ will allow higher variance
within one segment, hence encourage less segments. Results are generated by
’showAR1Gamma’.
34
Chapter 3
Multiple Dimensional Time
Series
We extend Fearnhead’s algorithms to the case of multiple dimensional series.
Now Y
i
, the observation at time i, is a d-dimensional vector. Then we can
detect change points based on changing correlation structure. We model the
correlation structures using Gaussian graphical models. Hence we can estimate
the changing topology of the dependencies among the series, in addition to
detecting change points.
3.1 Independent Sequences
The first obvious extension is to assume that each dimension is independent.
Under this assumption, the marginal likelihood function ( 2.16) can be written
as the following product,
P(Y
s+1:t
) =
d
¸
j=1
P(Y
j
s+1:t
) (3.1)
where P(Y
j
s+1:t
) is the marginal likelihood in the j-th dimension. Since now
Y
j
s+1:t
is one dimension, P(Y
j
s+1:t
) can be any likelihood function discussed in
the previous chapter.
The independent model is simple to use. Even when independent assumption is
not valid, it can be used as an approximate model similar to Naive Bayes when
we cannot model the correlation structures among each dimension.
35
Chapter 3. Multiple Dimensional Time Series
3.2 Dependent Sequences
Correlation structures only exist when we have multiple series. This is the main
difference when we go from one dimension to multiple dimensions. Hence we
should model correlation structures whenever we are able to do so.
3.2.1 A Motivating Example
Let’s use the following example to illustrate the importance of modeling corre-
lation structures. As shown in Figure 3.1, we have two series. The data on
0 100 200 300
−3
−2
−1
0
1
2
3
Y1
0 100 200 300
−3
−2
−1
0
1
2
3
Y2
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
−2
0
2
4
Figure 3.1: A simple example to show the importance of modeling chang-
ing correlation structures. We are unable to identify any change if we look
at each series individually. However, their correlation structures are changed
over segments(positive, independent, negative). Results are generated by
’plot2DExample’.
three segments are generated from the following Gaussian distributions. On the
36
Chapter 3. Multiple Dimensional Time Series
k-th segment,
Y
k
∼ N(0, Σ
k
)
where Σ
1
=

1 0.75
0.75 1
¸
¸
, Σ
2
=

1 0
0 1
¸
¸
, Σ
3
=

1 −0.75
−0.75 1
¸
¸
As a result, the marginal distribution on each dimension is, N(0, 1), the same
over all three segments. Hence if we look at each dimension individually, we are
unable to identify any changes. However if we consider them jointly, then we
find their correlation structures are changed. For example, in the first segment,
they are positive correlated; in the second segment, they are independent; in
the last segment, they are negative correlated.
3.2.2 Multivariate Linear Regression Models
For multivariate linear regression models, we know how to model correlation
structures, and we can calculate the likelihood function ( 2.16) analytically.
On segment Y
s+1:t
we still have,
Y
s+1:t
= Hβ +ǫ (3.2)
For simplicity, we write Y
s+1:t
as Y , and let n = t − s. Now Y is a n by d
matrix of data, H is a n by q matrix of basis functions, β is a q by d matrix of
regression parameters and ǫ is a n by d matrix of noise.
Here we need introduce the Matrix-Gaussian distribution. A random m by n
matrix A is Matrix-Gaussian distributed with parameters M
A
, V and W if the
density function of A is
A ∼ N(M
A
, V, W)
P(A) =
1
(2π)
mn/2
|V |
n/2
|W|
m/2
exp


1
2
trace((A−M
A
)
T
V
−1
(A−M
A
)W
−1
)

where M is a m by n matrix representing the mean, V is a m by m matrix
representing covariance among rows and W is a n by n matrix representing
covariance among columns.
37
Chapter 3. Multiple Dimensional Time Series
Similar as before we assume conjugate priors, ǫ has a Matrix-Gaussian distri-
bution with mean 0 and covariance I
n
and Σ, since we assume ǫ is independent
across time(rows) but dependent across features(columns). And β has a Matrix-
Gaussian distribution with mean 0 and covariance D and Σ. Finally covariance
Σ has an Inverse-Wishart distribution with parameter N
0
and Σ
0
. This hierar-
chical model can be illustrated by Figure 3.2.
Here D = diag(δ
2
1
, · · · , δ
2
q
) and I
n
is a n by n identity matrix. Hence we have,
Figure 3.2: Graphical representation of Hierarchical structures of Multivariate
Linear Regression Models
P(Y |β, Σ) =
1
(2π)
nd/2
|I
n
|
d/2
|Σ|
n/2
exp


1
2
trace((Y −Hβ)
T
I
−1
n
(Y −Hβ)Σ
−1
)

(3.3)
P(β|D, Σ) =
1
(2π)
qd/2
|D|
d/2
|Σ|
q/2
exp


1
2
trace(β
T
D
−1
βΣ
−1
)

(3.4)
P(Σ|N
0
, Σ
0
) =

0
|
N
0
/2
Z(N
0
, d)2
N
0
d/2
1
|Σ|
(N
0
+d+1)/2
exp


1
2
trace(Σ
0
Σ
−1
)

(3.5)
where N
0
≥ d and
Z(n, d) = π
d(d−1)/4
d
¸
i=1
Γ((n + 1 −i)/2)
38
Chapter 3. Multiple Dimensional Time Series
By ( 2.17), we have,
P(Y
s+1:t
|q) = P(Y |D, N
0
, Σ
0
)
=

P(Y, β, Σ|D, N
0
, Σ
0
) dβ dΣ
=

P(Y |β, Σ)P(β|D, Σ)P(Σ|N
0
, Σ
0
) dβ dΣ
Now multiply ( 3.3) and ( 3.4),
P(Y |β, Σ)P(β|D, Σ) ∝ exp


1
2
trace(((Y −Hβ)
T
(Y −Hβ) +β
T
D
−1
β)Σ
−1
)

∝ exp


1
2
trace((Y
T
Y −2Y
T
Hβ +β
T
H
T
Hβ +β
T
D
−1
β)Σ
−1
)

Now let
M = (H
T
H +D
−1
)
−1
P = (I −HMH
T
)
(∗) = Y
T
Y −2Y
T
Hβ +β
T
H
T
Hβ +β
T
D
−1
β
Then
(∗) = β
T
(H
T
H +D
−1
)β −2Y
T
Hβ +Y
T
Y
= β
T
M
−1
β −2Y
T
HMM
−1
β +Y
T
Y
= β
T
M
−1
β −2Y
T
HMM
−1
β +Y
T
HMM
−1
M
T
H
T
Y −Y
T
HMM
−1
M
T
H
T
Y +Y
T
Y
Using fact M = M
T
(∗) = (β −MH
T
Y )
T
M
−1
(β −MH
T
Y ) +Y
T
Y −Y
T
HMH
T
Y
= (β −MH
T
Y )
T
M
−1
(β −MH
T
Y ) +Y
T
PY
Hence
P(Y |β, Σ)P(β|D, Σ) ∝ exp


1
2
trace(((β −MH
T
Y )
T
M
−1
(β −MH
T
Y ))Σ
−1
+Y
T
PY Σ
−1
)

So the posterior for β is still Matrix-Gaussian with mean MH
T
Y and covariance
M and W,
P(β|D, Σ) ∼ N(MH
T
Y, M, Σ) (3.6)
39
Chapter 3. Multiple Dimensional Time Series
Then integrating out β, we have
P(Y |D, Σ) =

P(Y |β, Σ)P(β|D, Σ) dβ (3.7)
=
1
(2π)
nd/2
|I
n
|
d/2
|Σ|
n/2
(2π)
qd/2
|M|
d/2
|Σ|
q/2
(2π)
qd/2
|D|
d/2
|Σ|
q/2
exp(−
1
2
trace(Y
T
PY Σ
−1
))
=
1
(2π)
nd/2
|Σ|
n/2

|M|
|D|
d
2
exp(−
1
2
trace(Y
T
PY Σ
−1
))
Now multiply ( 3.5) and ( 3.7)
P(Y |D, Σ)P(Σ|N
0
, Σ
0
) ∝
1
|Σ|
n/2
|Σ|
(N
0
+d+1)/2
exp(−
1
2
trace((Y
T
PY + Σ
0

−1
))

1
|Σ|
(n+N
0
+d+1)/2
exp(−
1
2
trace((Y
T
PY + Σ
0

−1
))
So the posterior for Σ is still Inverse-Wishart with parameter n + N
0
and
Y
T
PY + Σ
0
,
P(Σ) ∼ IW(n +N
0
, Y
T
PY + Σ
0
) (3.8)
Then integrating out Σ, we have
P(Y
s+1:t
|q) = P(Y |D, N
0
, Σ
0
)
=

P(Y |D, Σ)P(Σ|N
0
, Σ
0
) dΣ
=

|M|
|D|
d
2
1
(2π)
nd/2

0
|
N
0
/2
Z(N
0
, d)2
N
0
d/2
Z(n +N
0
, d)2
(n+N
0
)d/2
|Y
T
PY + Σ
0
|
(n+N
0
)/2
= π
−nd/2

|M|
|D|
d
2

0
|
N
0
/2
|Y
T
PY + Σ
0
|
(n+N
0
)/2
Z(n +N
0
, d)
Z(N
0
, d)
(3.9)
When d = 1 and by setting N
0
= ν and Σ
0
= γ, ( 2.25) is just a special case
of ( 3.9), since InverseWishart(ν, γ) ≡ InverseGamma(
ν
2
,
γ
2
). In implementa-
tion, we rewrite ( 3.9) in log space as following,
log(P(Y
s+1:t
|q)) =
N
0
2
log(|Σ
0
|) −
n +N
0
2
log(|Y
T
PY + Σ
0
|) −
d
2
(log|M| −log|D|)

nd
2
log(π) +
d
¸
i=1
log(Γ((n +N
0
+ 1 −i)/2)) −
d
¸
i=1
log(Γ((N
0
+ 1 −i)/2))
(3.10)
40
Chapter 3. Multiple Dimensional Time Series
To speed up, −
nd
2
log(π),
N
0
2
log(|Σ
0
|), −
d
2
log|D|,
¸
d
i=1
log(Γ((n+N
0
+1−i)/2))
and
¸
d
i=1
log(Γ((N
0
+ 1 − i)/2)) can be pre-computed. At each iteration, M
and Y
T
PY can be computed by the following rank one update,
H
T
1:i+1,:
H
1:i+1,:
= H
T
1:i,:
H
1:i,:
+H
T
i+1,:
H
i+1,:
Y
T
1:i+1,:
Y
1:i+1,:
= Y
T
1:i,:
Y
1:i,:
+Y
T
i+1,:
Y
i+1,:
H
T
1:i+1,:
Y
1:i+1,:
= H
T
1:i,:
Y
1:i,:
+H
T
i+1,:
Y
i+1,:
3.3 Gaussian Graphical Models
Gaussian graphical models are more general tools to model correlation struc-
tures, especially conditional independence structures. There are two kinds of
graphs: directed and undirected. Here we use undirected graphs as examples,
and we will talk about directed graphs later.
3.3.1 Structure Representation
In an undirected graph, each node represents a random variable. Define the
precision matrix Λ = Σ
−1
, which is the inverse of covariance matrix Σ. There
is no edge between node i and node j if and only if Λ
ij
= 0. As shown in Figure
3.3, there are four nodes. There is an edge between node 1 and node 2, since
Λ
12
= 0. (”X” denotes non-zero entries in the matrix.)
Then the independent model and the multivariate linear regression model are
just two special cases of Gaussian graphical models. In independent model, Λ
is a diagonal matrix, which represents the graph with all isolated nodes.
In multivariate linear regression model, Λ is a full matrix, which represents the
graph with fully connected nodes.
Unlike independent models which are too simple, full covariance models could
be too complex when the dimension d is large since the number of parameters
needed by the models are O(d
2
). Gaussian graphical models provide a more flex-
ible and better solution between these two extremes, especially when d becomes
large, since in higher dimensions, sparse structures are more important.
41
Chapter 3. Multiple Dimensional Time Series
Figure 3.3: A simple example to show how to use a graph to represent correla-
tion structures. We first compute precision matrix Λ. Then from Λ, if Λ
ij
= 0,
then there is no edge on from node i to node j. ”X” represent non-zero entries
in the matrix.
3.3.2 Sliding Windows and Structure Learning Methods
In order to use Fearnhead’s algorithms, we need to do the following two things:
• generate a list of possible graph structures,
• compute the marginal likelihood given a graph structure.
For the first one, the number of all possible undirected graphs is O(2
d
2
). Hence
it is infeasible to try all possible graphs. We can only choose a subset of all
graphs. How to choose them is a chicken and egg problem: if we knew the
segmentation, then we could run a fast structure learning algorithm on each
segment, but we need to know the structures in order to compute the segmen-
tation. Hence we propose the following sliding window method to generate the
set of graphs. We slide a window of width w = 0.2N across the data, shifting
42
Chapter 3. Multiple Dimensional Time Series
Figure 3.4: Gaussian graphical model of independent models.
Figure 3.5: Gaussian graphical model of full models.
by s = 0.1w at each step. This will generate a set of about 50 candidate seg-
mentations. We can repeat this for different setting of w and s. Then we run a
fast structure learning algorithm on each windowed segment, and sort resulting
set of graphs by frequency of occurrence. Finally we pick the top M = 20 to
form the set of candidate graphs. We hope this set will contain the true graph
structures or at least the ones that are very similar.
As shown in Figure 3.6, suppose the vertical pink dot lines are true segmen-
tations. When we run a sliding window inside of a true segment, (eg, the red
window), we hope the structure we learn from this window is similar to the
true structure of this segment. And when we shift the window one step, (eg,
shifting to the blue window), if it is still inside of the same segment, we hope
the structure we learn is the same or at least very similar to the one we learn
in the previous window. Of course we will have the window that overlaps two
segments (eg, the black window), then we know the structure we learn from this
window will represent neither segments. However, this brings no harm since
these ”wrong” graph structures will later receive negligible posterior probabil-
ity. We can choose the number of the graphs we want to consider based on how
43
Chapter 3. Multiple Dimensional Time Series
Figure 3.6: An example to show sliding window method.
fast we want the algorithms to run.
Now on each windowed segment, we need to learn the graph structure.
The simplest method is the thresholding method. It works as following,
1. compute the empirical covariance Σ,
2. compute the precision Λ = Σ
−1
,
3. compute the partial correlation coefficients by ρ
ij
=
Λ
ij

Λ
ii
Λ
jj
,
4. set edge G
ij
= 0 if |ρ
ij
| < θ for some threshold θ (eg, θ = 0.2).
The thresholding method is simple and fast, but it may not give a good esti-
mator. We can also use the shrinkage method discussed in [23] to get a better
estimator of Σ, which helps regularize the problem when the segment is too
short and d is large.
If we further pose the sparsity on the graph structure, we can use convex op-
timization techniques discussed in [2] to compute the MAP estimator for the
precision Λ under a prior that encourage many entries to go to 0. We first form
44
Chapter 3. Multiple Dimensional Time Series
the following problem,
max
Σ≻0
log(det(Λ)) −trace(ΣΛ) −ρ||Λ||
1
(3.11)
where Σ is the empirical covariance, ||Λ||
1
= Σ
ij

ij
|, and ρ > 0 is the regular-
ization parameter which controls the sparsity of the graph. Then we can solve
( 3.11) by block coordinate descent algorithms.
We summarize the overall algorithm as following,
1. Input: data Y
1:N
, hyperparameters λ for change point rate and θ for
likelihood functions, observation model obslik and parameter ρ for graph
structure learning methods.
2. S = make overlapping segments from Y
1:N
by sliding windows
3. G = estimate graph structure estG(Y
s
, ρ) : s ∈ S
4. while not converged do
5. (K, S
1:K
) = segment(Y
1:N
, obslik, λ, θ)
6. G = estG(Y
s
, ρ) : s ∈ S
1:K
7. end while
8. Inference model m
i
= max
m
P(m|Y
s
) for i = 1 : K
9. Output: the number of segments K, the set of segments S
1:K
, the model
inferences on each segment m
1:K
3.3.3 Likelihood Functions
After we get the set of candidate graphs, we need to be able to compute marginal
likelihood for each graph. However, we can only do so for decomposable graphs
in undirected graphs. For non-decomposable graphs, we will use approximation
by adding minimum number of edges to make it decomposable.
Given a decomposable graph, we will assume the following conjugate priors.
Comparing with the multivariate linear regression models, everything is the
45
Chapter 3. Multiple Dimensional Time Series
same except the prior on Σ is now a Hyper-Inverse-Wishart instead of Inverse-
Wishart.
Hence we still have the following model, (let Y
s+1:t
as Y and n = t −s)
Y = Hβ +ǫ
ǫ ∼ N(0, I
n
, Σ)
The priors are,
β ∼ N(0, D, Σ)
Σ ∼ HIW(b
0
, Σ
0
) (3.12)
where D, I
n
are defined before, and b
0
= N
0
+ 1 −d > 0.
When a graph G is decomposable, by the standard graph theory [8], G can
be decomposited into a list of components and a list of separators, and each
component and separator itself is a clique. Then the joint density of Hyper-
Inverse-Wishart can be decomposed as following,
P(Σ|b
0
, Σ
0
) =
¸
C
P(Σ
C
|b
0
, Σ
C
0
)
¸
S
P(Σ
S
|b
0
, Σ
S
0
)
(3.13)
where C is each component and S is each separator.
For each C, we have Σ
C
∼ IW(b
0
+ |C| − 1, Σ
C
0
) and Σ
C
0
is the block of Σ
0
corresponding to Σ
C
. The same applies to each S. By ( 2.17), we have,
P(Y
s+1:t
|q) = P(Y |D, b
0
, Σ
0
)
=

P(Y, β, Σ|D, b
0
, Σ
0
) dβ dΣ
=

P(Y |β, Σ)P(β|D, Σ)P(Σ|b
0
, Σ
0
) dβ dΣ
Since P(Y |β, Σ)P(β|D, Σ) are the same as before, by integrating out β, we have,
P(Y |D, Σ) =

P(Y |β, Σ)P(β|D, Σ) dβ (3.14)
=
1
(2π)
nd/2
|Σ|
n/2

|M|
|D|
d
2
exp(−
1
2
trace(Y
T
PY Σ
−1
))
where M = (H
T
H +D
−1
)
−1
and P = (I −HMH
T
) are defined before.
When P is positive definite, P has the Cholesky decomposition Q
T
Q. (In
46
Chapter 3. Multiple Dimensional Time Series
practice, we can let the prior D = c ∗I, where I is the identity matrix and c is a
scalar. By setting c into a proper value, we can make sure P is always positive
definite. This is very important in high dimension.) Now let X = QY , then
( 3.14) can rewrite as the following,
P(Y |D, Σ) =
1
(2π)
nd/2
|Σ|
n/2

|M|
|D|
d
2
exp(−
1
2
trace(X
T

−1
))
=

|M|
|D|
d
2

1
(2π)
nd/2
|I
n
|
d/2
|Σ|
n/2
exp(−
1
2
trace(X
T
I
−1
n

−1
))

=

|M|
|D|
d
2
P(X|Σ) (3.15)
where P(X|Σ) ∼ N(0, I
n
, Σ), or just P(X|Σ) ∼ N(0, Σ).
Since we use decomposable graphs, this likelihood P(X|Σ) can be decomposited
as following [8],
P(X|Σ) =
¸
C
P(X
C

C
)
¸
S
P(X
S

S
)
(3.16)
where C is each component and S is each separator.
For each C, we have X
C
∼ N(0, Σ
C
) and Σ
C
is the block of Σ corresponding
to X
C
. The same applies to each S.
Now multiply ( 3.13) and ( 3.16), we have,
P(X|Σ)P(Σ|b
0
, Σ
0
) =
¸
C
P(X
C

C
)P(Σ
C
|b
0
, Σ
C
0
)
¸
S
P(X
S

S
)P(Σ
S
|b
0
, Σ
S
0
)
(3.17)
Then for each C, the posterior of Σ
C
∼ IW(b
0
+|C| −1+n, Σ
C
0
+X
T
C
X
C
). The
same holds for each S. As a result, the posterior of Σ ∼ HIW(b
0
+n, Σ
0
+X
T
X).
Hence by integrating out Σ, we have,
P(Y
s+1:t
|q) = P(Y |D, b
0
, Σ
0
)
=

P(Y |D, Σ)P(Σ|b
0
, Σ
0
) dΣ
=

|M|
|D|
d
2
(π)
−nd/2
h(G, b
0
, Σ
0
)
h(G, b
n
, Σ
n
)
(3.18)
where
h(G, b, Σ) =
¸
C

C
|
b+|C|−1
2
Z(b
0
+|C| −1, |C|)
−1
¸
S

S
|
b+|S|−1
2
Z(b
0
+|S| −1, |S|)
−1
47
Chapter 3. Multiple Dimensional Time Series
Z(n, d) = π
d(d−1)/4
d
¸
i=1
Γ((n + 1 −i)/2)
Σ
n
= Σ
0
+Y
T
PY
b
n
= b
0
+n
In log space, ( 3.18) can re-write as following,
log(P(Y
s+1:t
|q)) = −
nd
2
log(π) −
d
2
(log|M| −log|D|) +log(h(G, b
0
, Σ
0
)) −log(h(G, b
n
, Σ
n
))
(3.19)
We will also use the similar rank one update as mentioned earlier. Also notice
that h(G, b, Σ) contains many local terms. When we evaluate h(G, b, Σ) over
different graphs, we will cache all local terms. Then later when we meet the
same term, we don’t need to re-evalue it.
3.4 Experimental Results
Now we will show experimental results on some synthetic and real data sets.
3.4.1 Synthetic Data
First, we revisit the 2D synthetic data mentioned in the beginning of this chap-
ter. We run it with all three different models (independent, full and Gaussian
graphical models). We set the hyper parameter ν = 2 and γ = 2 on σ
2
for each
dimension in the independent model; and set the hyper parameter N
0
= d and
Σ
0
= I on Σ where d is the number of the dimension (in this case d = 2) and
I is the identity matrix in the full model; and set the hyper parameter b
0
= 1
and Σ
0
= I on Σ in the Gaussian graphical model. We know there are three
segments. From Figure 3.7, the raw data are shown in the first two rows. The
independent model thinks there is only one segment, since the posterior prob-
ability of the number of segments is mainly at 1. Hence it detects no change
points. The other two models both think there are three segments. Both models
detect positions of change points that are close to the ground truth with some
48
Chapter 3. Multiple Dimensional Time Series
uncertainty.
0 100 200 300
−2
0
2
Y1
0 100 200 300
−2
0
2
Y2
0 100 200 300
0
0.5
1
Truth
0 100 200 300
0
0.5
1
Indep
0 2 4 6
0
0.5
1
Indep : P(N)
0 100 200 300
0
0.5
1
Full
0 2 4 6
0
0.5
1
Full : P(N)
0 100 200 300
0
0.5
1
GGM
0 2 4 6
0
0.5
1
GGM : P(N)
Figure 3.7: Results on synthetic data 2D. The top two rows are raw data. The
3rd row is the ground truth of change points. From the 4th row to the 6th row,
results are the posterior distributions of being change points at each position
and the number of segments generated from the independent model, the full
model and the Gaussian graphical model respectfully. Results are generated by
’show2D’.
Now let’s look at a 10D synthetic data. We generate data in a similar way, but
this time we have 10 series. And we set the hyper parameters in a similar way.
To save space, we only show the first two dimensions since the rest are similar.
From Figure 3.8, we see as before the independent model thinks there is only
one segment hence detects no change point. The full model thinks there might
be one or two segments. Also it detects the position of change point is close to
49
Chapter 3. Multiple Dimensional Time Series
two ends which is wrong. In this case, only the Gaussian graphical model think
there are three segments, and detects positions that are very close to the ground
truth. Then based on the segments estimated by Gaussian graphical model, we
plot the posterior over all graph structures, P(G|Y
s:t
), the true graph struc-
ture, the MAP structure G
MAP
= argmax
G
P(G|Y
s:t
), and the marginal edge
probability, P(G
i,j
= 1|Y
s:t
), computed using Bayesian model averaging (Gray
squares represent edges about which we are uncertain) in the bottom two rows
of Figure 3.8. Note, we plot the graph structure by its adjacency matrix, such
that if node i and j are connected, then the corresponding entry is a black square;
otherwise it is a white square.
We plot all candidate graph strutures selected by sliding windows in Figure 3.9.
In this case, we have 30 graphs in total. For the 1st segment, we notice that
the true graph structure is not included in the candidate list. As a result, GGM
picks the graphs that are most close to the true structure. In this case, there are
two graphs that both are close. Hence we see some uncertainty over the edges.
However on the 3rd segment, the true structure is included in the candidate list.
In this case, GGM can identify it correctly, and we see very little uncertainty in
the posterior probability. As mentioned earlier, the candidate list contains 30
graphs. Some are very different from the true structures which might be gener-
ated by the window overlapping two segments. However, we find these graphs
only slow the algorithms but won’t hurt their results, because these ”useless”
graphs all get very low posterior probabilities.
Finally let’s look at a 20D synthetic data. We generate data in a similar way and
set the hyper parameters in a similar way. To save space, we only show the first
two dimensions since the rest are similar. From Figure 3.10, we see as before
the independent model thinks there is only one segment hence detects no change
point. The full model clearly oversegments the data. Again, only the Gaussian
graphical model correctly segments, and detects positions that are very close to
the ground truth. Comparing with the results from 2d and 10d cases, we find
that the independent model fails to detect in all cases since the changes are on
the correlation structures, and the independent model cannot model it. The
50
Chapter 3. Multiple Dimensional Time Series
full model can detect in low dimensional case, but fails in high dimension cases,
because the full model has more parameters to learn. The Gaussian graphical
model can detect on all cases, and it is more confident in high dimension cases
since the sparsity structures are more important in high dimension.
3.4.2 U.S. Portfolios Data
Now let’s look at two real data sets which record annually rebalanced value-
weighted monthly returns from July 1926 to December 2001 total 906 month
of U.S. portfolios data. The first data set has five industries (manufacturing,
utilities, shops, finance and other) and the second has thirty industries.
The first data set has been previously studied by Talih and Hengartner in [24].
We set the hyper parameter ν = 2 and γ = 2 on σ
2
for each dimension in the
independent model; and set the hyper parameter N
0
= 5 and Σ
0
= I on Σ in
the full model; and set the hyper parameter b
0
= 1 and Σ
0
= I on Σ in the
Gaussian graphical model. The raw data are shown in the first five rows of
Figure 3.11. Talih’s result is shown on the 6th row. From the 7th row to the
9th row, results are from independent model, full model and GGM respectfully.
We find that full model and GGM have similar results since they think there
are roughly 7 segments, and they agree on 4 out of 6 change points. Indepen-
dent model seems to be over-segmented. In this problem, we do not know the
ground truth and the true graph structures. Note, only 2 change points that we
discovered coincide with the results of Talih (namely 1959 and 1984). There are
many possible reasons for this. First, they assume a different prior over models
(the graph changes by one arc at a time between neighboring segments); second,
they use reversible jump MCMC; third, their model requires to pre-specify the
number of change points. In this case, the positions of change points are very
sensitive to the number of change points. We think their results could be over-
segmented. In this data, sliding windows generates 17 graphs. As usual, based
on the segmentation estimated by GGM, we plot the posterior over all 17 graphs
51
Chapter 3. Multiple Dimensional Time Series
structures, P(G|Y
s:t
), the MAP graph structure G
MAP
= argmax
G
P(G|Y
s:t
),
and the marginal edge probability, P(G
i,j
= 1|Y
s:t
), computed using Bayesian
model averaging (Gray squares represent edges about which we are uncertain)
in the bottom three rows of Figure 3.11. From the MAP graph structures
detected from our results, we find Talih’s assumption on the graph structure
changing by one arc at a time between neighboring segments may not be true.
For example, between the third segment and the fourth segment in our results,
the graph structure is changed by three arcs, rather than one.
The second data set is similar to the first one, but has 30 series. Since from
the results of synthetic data, we know that in higher dimension, the Gaussian
graphical model performs much better than the independent model and the full
model, we only show the result from the Gaussian graphical model in Figure
3.12.
We show the first two industry raw data in the top two rows. In the third row,
we show the resulted posterior distribution of being change points at each po-
sition and the number of segments generated by the Gaussian graphical model.
We still don’t know the ground truth and the true graph structures. We also
show the MAP graph structures detected on three consecutive regions. We can
see that the graph structures are very sparse and they differ more than one arcs.
3.4.3 Honey Bee Dance Data
Finally, we analyse the honey bee dance data set used in [19, 20]. This consists
of the x and y coordinates of a honey bee, and its head angle θ, as it moves
around an enclosure, as observed by an overhead camera. Two examples of the
data, together with a ground truth segmentation (created by human experts)
are shown in Figure 3.13 and 3.14. We also show the results of segmenting
this using a first-order auto-regressive AR(1) model, using independent model
or with full covariate model. We preprocessed the data by replacing θ with sinθ
and cosθ to overcome the discontinuity as the bees moves between −π to π. We
set the hyper parameter ν = 2 and γ = 0.02 on σ
2
for each dimension in the
52
Chapter 3. Multiple Dimensional Time Series
independent model; and set the hyper parameter N
0
= 4 and Σ
0
= 0.01 ∗ I on
Σ in the full model.
From both Figures, the results from full covariate model are very close to the
ground truth, but the results from independent model are clearly over seg-
mented.
53
Chapter 3. Multiple Dimensional Time Series
0 100 200 300
−5
0
5
Y1
0 100 200 300
−5
0
5
Y2
0 100 200 300
0
0.5
1
Truth
0 100 200 300
0
0.5
1
Indep
0 5
0
0.5
1
Indep : P(N)
0 100 200 300
0
0.5
1
Full
0 2 4 6
0
0.5
1
Full : P(N)
0 100 200 300
0
0.5
1
GGM
0 2 4 6
0
0.5
1
GGM : P(N)
Truth
0 10 20 30
0
0.5
1
P(G)
MAP P(G(i,j)=1)
Truth
0 10 20 30
0
0.5
1
P(G)
MAP P(G(i,j)=1)
Truth
0 10 20 30
0
0.5
1
P(G)
MAP P(G(i,j)=1)
Figure 3.8: Results on synthetic data 10D. The top two rows are the first two
dimensions of raw data. The 3rd row is the ground truth of change points. From
the 4th row to the 6th row, results are the posterior distributions of being change
points at each position and the number of segments generated from the inde-
pendent model, the full model and the Gaussian graphical model respectfully.
In the bottom 2 rows, we plot the posterior over all graph stuctures, P(G|Y
s:t
),
the true graph structure, the MAP structure G
MAP
= argmax
G
P(G|Y
s:t
), and
the marginal edge probability, P(G
i,j
= 1|Y
s:t
) on 3 segments detected by the
Gaussian graphical model. Results are generated by ’show10D’.
54
Chapter 3. Multiple Dimensional Time Series
Figure 3.9: Candidate list of graphs generated by sliding windows in 10D data.
We plot the graph structure by its adjacency matrix, such that if node i and j
are connected, then the corresponding entry is a black square; otherwise it is a
white square. Results are generated by ’show10D’.
55
Chapter 3. Multiple Dimensional Time Series
0 100 200 300
−5
0
5
Y1
0 100 200 300
−5
0
5
Y2
0 100 200 300
0
0.5
1
Truth
0 100 200 300
0
0.5
1
Indep
0 2 4 6
0
0.5
1
Indep : P(N)
0 100 200 300
0
0.5
1
Full
0 2 4 6
0
0.5
1
Full : P(N)
0 100 200 300
0
0.5
1
GGM
0 2 4 6
0
0.5
1
GGM : P(N)
Truth
0 20 40
0
0.5
1
P(G)
MAP P(G(i,j)=1)
Truth
0 20 40
0
0.5
1
P(G)
MAP P(G(i,j)=1)
Truth
0 20 40
0
0.5
1
P(G)
MAP P(G(i,j)=1)
Figure 3.10: Results on synthetic data 20D. The top two rows are the first two
dimensions of raw data. The 3rd row is the ground truth of change points. From
the 4th row to the 6th row, results are the posterior distributions of being change
points at each position and the number of segments generated from the inde-
pendent model, the full model and the Gaussian graphical model respectfully.
In the bottom 2 rows, we plot the posterior over all graph stuctures, P(G|Y
s:t
),
the true graph structure, the MAP structure G
MAP
= argmax
G
P(G|Y
s:t
), and
the marginal edge probability, P(G
i,j
= 1|Y
s:t
) on 3 segments detected by the
Gaussian graphical model. Results are generated by ’show20D’.
56
Chapter 3. Multiple Dimensional Time Series
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
−40
−20
0
20
40
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
−40
−20
0
20
40
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
−40
−20
0
20
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
−50
0
50
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
−20
0
20
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
0
0.5
1
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
0
0.5
1
0 10 20
0
0.5
1
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
0
0.5
1
0 10 20
0
0.5
1
192606 193410 194302 195106 195910 196802 197606 198410 199302 200106
0
0.5
1
0 10 20
0
0.5
1
0 10
0
0.5
1
0 10
0
0.5
1
0 10
0
0.5
1
0 10
0
0.5
1
0 10
0
0.5
1
0 10
0
0.5
1
0 10
0
0.5
1
Figure 3.11: Results on U.S. portfolios data of 5 industries. The top five rows
are the raw data representing annually rebalanced value-weighted monthly re-
turn in 5 industries from July 1926 to December 2001. Total there are 906
month. The 6th row is the result by Talih. From the 7th row to the 9th row,
results are the posterior distributions of being change points at each position
and the number of segments generated from the independent model, the full
model and the Gaussian graphical model respectfully. In the bottom three
rows, we plot the posterior over all graph structures, P(G|Y
s:t
), the MAP
graph structure G
MAP
= argmax
G
P(G|Y
s:t
), and the marginal edge proba-
bility, P(G
i,j
= 1|Y
s:t
). Results are generated by ’showPortofios’.
57
Chapter 3. Multiple Dimensional Time Series
192606 194302 195910 197606 199302 192606 194302 195910 197606 199302
−30
−20
−10
0
10
20
30
192606 194302 195910 197606 199302 192606 194302 195910 197606 199302
−40
−20
0
20
40
60
192606 194302 195910 197606 199302 192606 194302 195910 197606 199302
0
0.2
0.4
0.6
0.8
1
GGM
0 5 10 15
0
0.2
0.4
0.6
0.8
1
GMM : P(N)
Figure 3.12: Results on U.S. portfolios data of 30 industries. We show the first
two industry raw data in the top two rows. The third row is the result of the
posterior distribution of being change points at each position and the number
of segments generated from the Gaussian graphical model. In the fourth row,
we show the MAP graph structures in 3 consecutive regions detected by the
Gaussian graphical model. Results are generated by ’showPort30’.
58
Chapter 3. Multiple Dimensional Time Series
0 100 200 300 500 600 700
−2
0
2
X
0 100 200 300 500 600 700
−2
0
2
Y
0 100 200 300 500 600 700
−1
0
1
Sin(T)
0 100 200 300 500 600 700
−1
0
1
Cos(T)
0 100 200 300 500 600 700
0
0.5
1
Truth
0 100 200 300 500 600 700
0
0.5
1
Indep
0 100 200 300 500 600 700
0
0.5
1
Full
0 20 40 60
0
0.5
1
Indep : P(N)
0 20 40 60
0
0.5
1
Full : P(N)
Figure 3.13: Results on honey bee dance data 4. The top four rows are the raw
data representing x, y coordinates of honey bee and sin, cos of its head angle
θ. The 5th row is the ground truth. The 6th row is the result from independent
model and the 7th row is the result from full covariate model. Results are
generated by ’showBees(4)’.
59
Chapter 3. Multiple Dimensional Time Series
0 100 200 400 500 600
−2
0
2
X
0 100 200 400 500 600
−2
0
2
Y
0 100 200 400 500 600
−1
0
1
Sin(T)
0 100 200 400 500 600
−1
0
1
Cos(T)
0 100 200 400 500 600
0
0.5
1
Truth
0 100 200 400 500 600
0
0.5
1
Indep
0 100 200 400 500 600
0
0.5
1
Full
0 20 40
0
0.5
1
Indep : P(N)
0 20 40
0
0.5
1
Full : P(N)
Figure 3.14: Results on honey bee dance data 6. The top four rows are the raw
data representing x, y coordinates of honey bee and sin, cos of its head angle
θ. The 5th row is the ground truth. The 6th row is the result from independent
model and the 7th row is the result from full covariate model. Results are
generated by ’showBees(6)’.
60
Chapter 4
Conclusion and Future
Work
In this thesis, we extended Fearnhead’s algorithms to the case of multiple di-
mensional series. This allowed us to detect changes on correlation structures, as
well as changes on mean, variance, etc. We modeled the correlation structures
using undirected Gaussian graphical models. This allowed us to estimate the
changing topology of dependencies among series, in addition to detecting change
points. This is particularly useful in high dimensional cases because of sparsity.
We can also model the correlation structures by directed graphs. In directed
graphs, we can compute the marginal likelihood for any graph structures. How-
ever, we don’t have a fast way to generate candidate list of graph structures
since currently all structure learning algorithms for directed graphs are too
slow. Hence if there is a fast way to learn structure of directed graphs, we can
use directed graphs.
Finally, the conditional independence property sometime might not be reason-
able. Or we might want to model other constrains on changing over consecutive
segments. This requires us to come up with new models.
61
Bibliography
[1] Jim Albert and Patricia Williamson. Using model/data simulation to detect
streakiness. The American statistician, 55:41–50, 2001.
[2] Onureena Banerjee, Laurent El Ghaoui, Alexandre d’Aspremont, and
Georges Natsoulis. Convex optimization techniques for fitting sparse Gaus-
sian graphical models. Proceedings of the 23rd international conference on
Machine learning, pages 89–96, 2006.
[3] Daniel Barry and J. A. Hartigan. Product partition models for change
point problems. The Annals of Statistics, 20(1):260–279, 1992.
[4] Daniel Barry and J. A. Hartigan. A Bayesian analysis for change point
problems. Journal of the American Statistical Association, 88(421):309–
319, 1993.
[5] Albright S. C., Albert J., Stern H. S., and Morris C. N. A statistical
analysis of hitting streaks in baseball. Journal of the American Statistical
Association, 88:1175–1196, 1993.
[6] Bradley P. Carlin, Alan E. Gelfand, and Adrian F. M. Smith. Hierarchical
Bayesian analysis of changepoints problems. Applied Statistics, 41(2):389–
405, 1992.
[7] J. Carpenter, Peter Clifford, and Paul Fearnhead. An improved particle
filter for non-linear problems. IEE Proceedings of Radar and Sonar Navi-
gation, 146:2–7, 1999.
62
Bibliography
[8] Carlos Marinho Carvalho. Structure and sparsity in high-dimensional mul-
tivariate analysis. PhD thesis, Duke University, 2006.
[9] Nicolas Chopin. Dynamic detection of change points in long time series.
The Annals of the Institute of Statistical Mathematics, 2006.
[10] D. G. T. Denison, B. K. Mallick, and A. F. M. Smith. Automatic Bayesian
curve fitting. Journal of Royal Statistical Society Series B, 60:333–350,
1998.
[11] David G. T. Denison, Christopher C. Holmes, Bani K. Mallick, and Adrian
F. M. Smith. Bayesian methods for nonlinear classification and regression.
John Wiley & Sons, Ltd, Baffins, Chichester, West Sussex, England, 2002.
[12] Arnaud Doucet, Nando De Freitas, and Neil Gordon. Sequential Monte
Carlo methods in practice. Springer-Verlag, New York, 2001.
[13] Paul Fearnhead. Exact Bayesian curve fitting and signal segmentation.
IEEE Transactions on Signal Processing, 53:2160–2166, 2005.
[14] Paul Fearnhead. Exact and efficient Bayesian inference for multiple change-
point problems. Statistics and Computing, 16(2):203–213, 2006.
[15] Paul Fearnhead and Peter Clifford. Online inference for hidden Markov
models via particle filters. Journal of the Royal Statistical Society: Series
B, 65:887–899, 2003.
[16] Paul Fearnhead and Zhen Liu. Online inference for multiple change point
problems. 2005.
[17] R. G. Jarrett. A note on the intervals between coal-mining disasters.
Biometrika, 66(1):191–193, 1979.
[18] Peter J.Green. Reversible jump Markov Chain Monte Carlo computation
and Bayesian model determination. Biometrika, 82(4):711–732, 1995.
63
Bibliography
[19] Sang Min Oh, James M. Rehg, Tucker Balch, and Frank Dellaert. Learn-
ing and inference in parametric switching linear dynamic systems. IEEE
International Conference on Computer Vision, 2:1161–1168, 2005.
[20] Sang Min Oh, James M. Rehg, and Frank Dellaert. Parameterized du-
ration modeling for switching linear dynamic systems. IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, 2:1694–
1700, 2006.
[21] Elena Punskaya, Christophe Andrieu, Arnaud Doucet, and William J.
Fitzgerald. Bayesian curve fitting using MCMC with application to sig-
nal segmentation. IEEE Transactions on Signal Processing, 50:747–758,
2002.
[22] Christian P. Robert and George Casella. Monte Carlo Statistical Methods.
Springer, New York, 2004.
[23] Juliane Schafer and Korbinian Strimmer. A shrinkage approach to large-
scale covariance matrix estimation and implications for functional ge-
nomics. Statistical Applications in Genetics and Molecular Biology, 4, 2005.
[24] Makram Talih and Nicolas Hengartner. Structural learning with time-
varying components: Tracking the cross-section of financial time series.
Journal of Royal Statistical Society Series B, 67:321–341, 2005.
[25] T. Y. Yang and L. Kuo. Bayesian binary segmentation procedure for a
Poisson process with multiple changepoints. Journal of Computational and
Graphical Statistics, 10:772–785, 2001.
[26] Tae Young Yang. Bayesian binary segmentation procedure for detecting
streakiness in sports. Journal of Royal Statistical Society Series A, 167:627–
637, 2004.
64

Abstract
Change point problems are referred to detect heterogeneity in temporal or spatial data. They have applications in many areas like DNA sequences, financial time series, signal processing, etc. A large number of techniques have been proposed to tackle the problems. One of the most difficult issues is estimating the number of the change points. As in other examples of model selection, the Bayesian approach is particularly appealing, since it automatically captures a trade off between model complexity (the number of change points) and model fit. It also allows one to express uncertainty about the number and location of change points. In a series of papers [13, 14, 16], Fearnhead developed efficient dynamic programming algorithms for exactly computing the posterior over the number and location of change points in one dimensional series. This improved upon earlier approaches, such as [12], which relied on reversible jump MCMC. We extend Fearnhead’s algorithms to the case of multiple dimensional series. This allows us to detect changes on correlation structures, as well as changes on mean, variance, etc. We also model the correlation structures using Gaussian graphical models. This allow us to estimate the changing topology of dependencies among series, in addition to detecting change points. This is particularly useful in high dimensional cases because of sparsity.

ii

Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii iii v vii 1 1 3 3 3 4 4 5 6 9 10 13 15 16 16 16

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 1.2.2 1.3 1.4 Hidden Markov Models . . . . . . . . . . . . . . . . . . . Reversible Jump MCMC . . . . . . . . . . . . . . . . . . .

Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 One Dimensional Time Series . . . . . . . . . . . . . . . . . . . . 2.1 2.2 Prior on Change Point Location and Number . . . . . . . . . . . Likelihood Functions for Data in Each Partition . . . . . . . . . . 2.2.1 2.2.2 2.2.3 2.3 Linear Regression Models . . . . . . . . . . . . . . . . . . Poisson-Gamma Models . . . . . . . . . . . . . . . . . . . Exponential-Gamma Models . . . . . . . . . . . . . . . .

Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 2.3.2 Polynomial Basis . . . . . . . . . . . . . . . . . . . . . . . Autoregressive Basis . . . . . . . . . . . . . . . . . . . . .

iii

. . . . . . . . . . . . . . . . . . 3 Multiple Dimensional Time Series 3. . . . . .6 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 3. . . . . . . .4. . . . . . . . . . .2 Independent Sequences . . . . . . . . . . . . 2. . . . . . . . . . . . . . . .1 3. . . . . . . . . . . . . . . . . . . . . . 2. . . . . .5. . . .5. . . . . . . .3 Gaussian Graphical Models . . . . .2 2. . . . . . . . . . . .3.3 Synthetic Data . . . . . 3. . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . 17 17 18 19 20 21 22 23 23 24 24 24 25 26 35 35 36 36 37 41 41 42 45 48 48 51 52 61 62 iv Online Algorithms . . . . . . . . . . . . . . . . 3. . . . . . . . . British Coal Mining Disaster Data . . . . .4 Experimental Results . . . . . . . . . . . . . .2 3. 2.7. . . . . . Bibliography . . . . 2. . . . . . . . . Viterbi Algorithms . . . . . . . Tiger Woods Data . .2 3. . . . . .1 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 2. . . . . . .2 2. . . Dependent Sequences . . . . .4. . . .4 Offline Algorithms . . . . . . .3 Synthetic Data . . . U. . . . . . . . . Sliding Windows and Structure Learning Methods . . .1 2.1 3. .S. . . . . . . . .8 Choices of Hyperparameters . . . . . . . . . . .Table of Contents 2. . . . . . . . 2. . . . . . . . . . . . . . . . . . . . .5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7. . . Likelihood Functions .3. Multivariate Linear Regression Models . . . . . . . . .2.2.7 Implementation Issues . . . . . . Approximate Algorithm . . . . . . . . . . . . . . . . . .2 A Motivating Example . . . . Forward Simulation . . . . . . . . . . . . . . . . . .3 Structure Representation . . . . . . . . . . . . . . . . . . . . . . .2 2. . . . . . . . . . . . . 3. . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . . 3.5 Backward Recursion . . . Portfolios Data . . . . . . .4. . Experimental Results . . . . . . . . . 4 Conclusion and Future Work . . . . . . .4. . . . . . . . Honey Bee Dance Data . . . .1 2. . . . . . . .7. . . . .1 3. . . . . . .3. . . . . . . . . . . . . . . . . . . . . . Backward Simulation . . . . .4 Forward Recursion . . . . . .5. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . An example to show sliding window method . . . . . Results of synthetic data AR1 .1 2. . . . . . . Graphical model of linear regression models . . . . . . . . . . . . . . . . . . . . . . . . . . Results of synthetic data Blocks . . . . .8 Results of synthetic data Blocks under different values of hyperparameter γ . . . . . . An example of graphical representation of Gaussian graphical models . . . . . . . . .6 3. . .1 An example to show the importance of modeling correlation structures . . . . . . . . . .3 Graphical model of multivariate linear regression models . . .5 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of synthetic data 2D . . . . . . . . . . 2 10 14 28 29 30 31 Results of synthetic data Blocks under different values of hyperparameter λ . . 33 2. . . . . . . . . . . . . Graphical representation for fully dependent model . .4 3. .2 2. . . . . . . Graphical model of Poisson-Gamma models . . . . 34 3. . . . .4 2. . . . . Results of coal mining disaster data . . .9 Results of synthetic data AR1 under different values of hyperparameter γ . . . . . . . . . . .3 2. . . . . . . . . . . . . . . . . . .5 2. . . . . . . . . . . . . . . .List of Figures 1. .7 Examples on possible changes over consecutive segments . . . . . . 42 43 43 44 49 v 3. . . . . . . . . . . . . .6 2. . . . . . . . . . . .2 3.7 Graphical representation for independent model . . . . . . . 36 38 3. . . . . . . . 32 2. . . . . Results of Tiger Woods data . . . . . . . . . . . . . . . . . . . . . . . . . .1 2. . . . . . . . . . . . . . . . . . . . . . . .

. .11 Results of U. . . . 3. . . . .10 Results of synthetic data 20D . . 3. . . . . . . . . . . 54 Candidate list of graphs generated by sliding windows in 10D data 55 56 57 58 59 60 3. . . .S. . . . portfolios data of 5 industries . 3. . . . . . .13 Results of honey bee dance sequence 4 . . . .List of Figures 3. .14 Results of honey bee dance sequence 6 . . . . . . . . . . . . . . . . . . . . . .9 Results of synthetic data 10D . . . . . . . . . . vi . . . 3.8 3. . . . . . . . .S. . . . . . . . portfolios data of 30 industries . . .12 Results of U. .

I would like to thank my parents and my wife for their endless love and support. and especially to the following people: Sohrab Shah. I am amazed by Kevin’s broad knowledge and many quick and brilliant ideas. Secondly. Last but not least. First of all. XIANG XUAN The University of British Columbia March 2007 vii .Acknowledgements I would like to thank all the people who gave me help and support throughout my degree. I would like to extend my appreciation to my colleagues for their friendship and help. I am grateful to Kevin for giving me the Bayesian education and directing me to the research in machine learning. Kevin introduced me to this interesting problem. Thirdly. I would like to thank my supervisor. Professor Kevin Murphy for his constant encouragement and rewarding guidance throughout this thesis work. I would like to thank Professor Nando de Freitas for dedicating his time and effort in reviewing my thesis. Wei-Lwun Lu and Chao Yan.

Both segments are linear models.Chapter 1 Introduction 1.1 shows four examples of changes over successive segments. change points split data into a set of disjoint segments. Then it is assumed that the data from the same segment comes from the same model. different correlation among sequences Figure 1.1 Problem Statement Change point problems are commonly referred to detect heterogeneity of temporal or spatial data. If we assume data = model + noise then the data on two successive segments could be different in the following ways: • different models (model types or model orders) • same model with different parameters • same model with different noise levels • in multiple sequences. The bottom left panel shows an example 1 . The top right panel shows an example of same model with different parameters. Given a sequence of data over time or space. The top left panel shows an example of different model orders. but the 1st segment has a negative slope while the 2nd segment has a positive slope. there is one change point at location 50 (black vertical solid line) which separates 100 observations into two segments. In all four examples. The 1st segment is a 2nd order Autoregressive model and the 2nd segment is a 4th order Autoregressive model.

4 0. The top left panel shows changes on AR model orders.8 0.2 −0. 20 15 10 5 0 −5 −10 −15 −20 1.2 0 −0. We can see that two series are positive correlated in the 1st segment. The bottom left panel shows changes on noise level. Introduction of same model with different noise level. 2 . but negative correlated in the 2nd segment.Chapter 1.4 0 20 40 60 80 100 0 20 40 60 80 100 10 8 4 3 6 4 2 0 −2 −4 −6 −2 −8 −10 0 20 40 60 80 100 −3 0 20 40 60 80 100 0 −1 2 1 Figure 1.1: Examples show possible changes over successive segments. The aim of the change point problems is to make inference about the number and location of change points. but the noise level (the standard deviation) of the 2nd segment is three times as large as the one on the 1st segment. The top right panel shows changes on parameters.2 1 0. The bottom right panel is an example of different correlation between two series. Both segments are constant models which have means at 0. The bottom right panel shows changes on correlation between two series.6 0.

the state is directly visible.2. The first approach is based on a different models Hidden Markov model (HMM). 1. the state is not directly visible. This thesis is an extension based on the Fearnhead’s work which is a special case of the Product Partition Models (PPM) (defined later) and using dynamic programming algorithms. we view the change is due to change on hidden states. • Update move to shift the position of a change point 3 . The challenge is to determine the hidden states from observed variables. In a regular Markov model. Introduction 1. • Birth move to add a new change point or split a segment into two.Chapter 1.2 Reversible Jump MCMC Reversible jump is a MCMC algorithm which has been extensively used in the change point problems. Here I like to mention two approaches that are closely related to our works.2 Related Works Many works have been done by many researchers in different areas. Hence by inference on all hidden states. we need to fix the number of hidden states.2. we can segment data implicitly. and the second approach is using a different algorithm Reversible Jump Monte Carlo Markov Chain. In a HMM. and often the number of state cannot be too large.1 Hidden Markov Models A HMM is a statistical model where system being modeled is assumed to be Markov process with hidden states. It starts by a initial set of change points. it can make the following three kinds of moves: • Death move to delete a change point or merge two consecutive segments. but variables influenced by the hidden states are visible. 1. In change point problems. At each step. In HMM.

and provide experimental results on some synthetic and real data sets. 1.Chapter 1. along with a number of suggestions for future works. Introduction Each step. we will show how to extend Fearnhead’s work in multiple dimensional series and how to use Gaussian graphical models to model and learn correlation structures. • We illustrate the algorithms by applying them to some synthetic and real data sets.4 Thesis Outline The remaining of the thesis is organized as follows. as well as changes on means. in addition to detecting change points. We run until the chain converges.3 Contribution Our contributions of the thesis are: • We extend Fearnhead’s work to the case of multiple dimensional series which allows us to detect changes on correlation structures. In Chapter 2. We will also provide experimental results on some synthetic and real data sets. Then we can find out the posterior probability of the number and position of change points. Finally. we will accept a move based on a probability calculated by the Metropolis-Hastings algorithm. 4 . The disadvantages are slow and difficult to diagnose convergence of the chain. our conclusions are stated in Chapter 4. 1. In Chapter 3. • We further model the correlation structures using Gaussian graphical models which allows us to estimate the changing topology of dependencies among series. variance. we will review Fearnhead’s work in one dimensional series in details. etc. The advantage of reversible jump MCMC is that it can handle a large family of distributions even when we only know the distribution up to a normalized constant.

Hence given the segmentation (partition) S1:K . S1:K |Y1:N ) using dynamic programming. we can write K P (Y1:N |K. The data on the k-th segment Y k is assumed to be independent with the data on the other segments given a set of parameters θk for that partition . Here a dataset Y1:N is partitioned into K partitions. where the number of the partitions K is unknown. They all first calculate the joint posterior distribution of the number and positions of change points P (K. and the online approximate algorithm runs in O(N ). Then these models are exactly the Product Partition Models (PPM) in one dimension [3. K. The offline algorithm and online exact algorithm run in O(N 2 ). then sample change points from this posterior distribution by perfect sampling [22]. S1:K |Y1:N ) (2.Chapter 2 One Dimensional Time Series Let’s consider the change point problems with the following conditional independence property: given the position of a change point. 14. S1:K . S1:K ∼ P (K. 5 . 11].1) Fearnhead [13. θ1:K ) = k=1 P (Y k |θk ) (2. 4. online exact and online approximate) to solve the change point problems under PPM. making inference on models and their parameters over segments is straight forward. 16] proposed three algorithms (offline.2) After sampling change points. the data before that change point is independent of the data after the change point.

position 0 is always a change point. By default. if we use the Geometric distribution as g().4) and assume that g() is also the probability mass function for the position of the first change point. let’s suppose there are N data points and we use a Geometric distribution with parameter λ. N − 1. we define the cumulative distribution function for the distance as following.6) 6 . N P (Ci = 1) = λ (2. . g() can be any arbitrary probability mass function with the domain over 1. 2. We denote P (Ci = 1) as the probability of location i being a change point. For example. That is. · · · . .5) First. .3) We assume ( 2. One Dimensional Time Series 2. . To see that. we show that the distribution for the location of change points is Uniform by induction. Let the transition probabilities of this Markov process be the following. Then g() and G() imply a prior distribution on the number and positions of change points. That is. Also we let the probability mass function for the distance between two successive change points s and t be g(|t − s|). l G(l) = i=1 g(i) (2. Furthermore. then our model implies a Binomial distribution for the number of change points and a Uniform distribution for the locations of change points. P (next change point at t|change point at s) = g(|t − s|) (2. Let’s assume that the change points occur at discrete time points and we model the change point positions by a Markov process. we put the following prior distribution on change points.Chapter 2. P (C0 = 1) = 1 (2. ∀i = 1.3) only depends on the distance between two change points. In general.1 Prior on Change Point Location and Number To express the uncertainty of the number and position of change points.

8) P (Ck+1 = 1|Cj = 1) = g(k + 1 − j) = λ(1 − λ)k−j (2. ( 2. P (C1 = 1) Suppose ∀ i ≤ k. Next we show the number of change points follows Binomial distribution. ( 2. Let’s consider each position as a trial with two outcomes. . P (Ci = 1) = P (Ci = 1|Cj = 1) = λ (2.6). We have. we have. N and i = j.10) . either being a change point or not. j = 1.8) becomes. we only have one case: position 0 and 1 both are change points. Then we only need to show each trial is independent. we have P (Ci = 1) = λ (2. we know the probability of being a change point is the same. By ( 2. One Dimensional Time Series When i = 1. . k P (Ck+1 = 1) = P (Ck+1 = 1|C0 = 1)P (C0 = 1) + j=1 k k k−j P (Ck+1 = 1|Cj = 1)P (Cj = 1) k−1 k 2 t=0 (let t = k − j) = λ(1 − λ) + j=1 λ(1 − λ) λ = λ(1 − λ) + λ (1 − λ)t 1 − (1 − λ)k = λ(1 − λ)k + λ2 = λ(1 − λ)k + λ(1 − (1 − λ)k ) 1 − (1 − λ) = λ By induction.9). conditioning on the last change point before position k + 1. . .11) 7 (2.9) By ( 2.5).Chapter 2. Hence the length of the segment is 1. ∀i.7) and ( 2. this proves ( 2. That is.6).7) = g(1) = λ Now when i = k + 1. k P (Ck+1 = 1) where = j=0 P (Ck+1 = 1|Cj = 1)P (Cj = 1) (2.

15) i−k−1 (let s = t-k) = λ2 s=0 = λ(1 − (1 − λ)i−k ) + λ(1 − λ)i−k = λ By induction. i P (Ci = 1|Ck−1 = 1) = t=k P (Ci = 1|Ct = 1.13) where P (Ci = 1|Ct = 1. One Dimensional Time Series When i < j. since future will not change history.11). this proves ( 2. Ck−1 = 1) P (Ct = 1|Ck−1 = 1) Hence ( 2. Hence the length of the segment is 1. it is true by default. When j = i − 1. We have. conditioning on the next change point after position k − 1. we have P (Ci = 1|Cj = 1) = λ (2. Hence the number of change points follows Binomial distribution. we show it by induction on j. i−1 = g(t − k + 1) = λ(1 − λ)t−k   λ = P (Ci = 1|Ct = 1) =  1 if t < i if t = i (2.13) becomes. P (Ci = 1|Ci−1 = 1) Suppose ∀ j ≥ k. Ck−1 = 1)P (Ct = 1|Ck−1 = 1) (2.14) P (Ci = 1|Ck−1 = 1) = t=k λλ(1 − λ)t−k + λ(1 − λ)i−k (1 − λ)s + λ(1 − λ)i−k = λ2 1 − (1 − λ)i−k + λ(1 − λ)i−k 1 − (1 − λ) (2. When i > j.Chapter 2. we only have one case: position j and i both are change points. 8 . we have.12) = g(1) = λ Now when j = k − 1.

The first is the uncertainty of model types. and we put a prior distribution π(θq ) on parameters. If the parameter on this segment conditional on model q is θq . we will only consider a finite set of models. In general. t P (Ys+1:t |q) = i=s+1 f (Yi |θq . Now let’s look at some examples. Our extensions are mainly based on this. 9 . and define P (Ys+1:t |q) as the likelihood function of Ys+1:t conditional on a model q. polynomial regression models up to order 2 or autoregressive models up to order 3. In practice. this requires either conjugate priors on θq which allow us to work out the likelihood function analytically. as long as we can evaluate the likelihood function ( 2. where s < t. As a result. There are many possible candidate models and we certainly cannot enumerate all of them.17) We assume that P (Ys+1:t |q) can be efficiently calculated for all s. we can evaluate P (Ys+1:t |q) as following. t and q. One Dimensional Time Series 2. then data Ys+1:t forms one segment.2 Likelihood Functions for Data in Each Partition If we assume that s and t are two successive change points. then we can evaluate P (Ys+1:t ) as following.17). we can use Fearnhead’s algorithms. or fast numerical routines which are able to evaluate the required integration. q)π(θq |q) dθq (2. Then if we put a prior distribution π() on the set of models.Chapter 2. For example. P (Ys+1:t ) = q P (Ys+1:t |q)π(q) (2.16) The second is the uncertainty of model parameters. We use the likelihood function P (Ys+1:t ) to evaluate how good the data Ys+1:t can be fit in one segment. Here there are two levels of uncertainty. for any data and models.

For simplicity. σ 2 ) = = 1 1 −1 exp − 2 (Y − Hβ)T In (Y − Hβ) 2σ (2π)n/2 σ n 1 1 exp − 2 β T D−1 β 2σ (2π)q/2 σ q |D|1/2 (2. σ 2 ) P (β|D. And we have the following. we write Ys+1:t as Y and let n = t − s. Here.1 Linear Regression Models The linear regression models are one of the most widely used models.18) where H is a (t − s) by q matrix of basis functions. This hierarchical model can be illustrated by Figure 2. P (Y |β. and the jth component of regression parameter βj has a Gaussian distribution with mean 0 and variance 2 σ 2 δj . we assume Ys+1:t = Hβ + ǫ (2. Since we assume conjugate priors.2.1.1: Graphical representation of Hierarchical structures of Linear Regression Models Gamma distribution with parameters ν/2 and γ/2. β is a q by 1 vector of mean 0 and variance σ 2 .Chapter 2. σ 2 has an Inverse regression parameters and ǫ is a (t − s) by 1 vector of iid Gaussian noise with Figure 2. One Dimensional Time Series 2.20) 10 .19) (2.

σ 2 )P (β|D. By ( 2. P (Ys+1. β. σ 2 )P (β|D.20). we have. we have following. . γ) dβ dσ 2 P (Y |β. σ 2 ) ∝ exp − 1 ((Y − Hβ)T (Y − Hβ) + β T D−1 β) 2σ 2 1 ∝ exp − 2 (Y T Y − 2Y T Hβ + β T H T Hβ + β T D−1 β) 2σ Now let M P ||Y ||2 P (∗) Then (∗) = β T (H T H + D−1 )β − 2Y T Hβ + Y T Y = β T M −1 β − 2Y T HM M −1 β + Y T Y = β T M −1 β − 2Y T HM M −1 β + Y T HM M −1 M T H T Y − Y T HM M −1 M T H T Y + Y T Y Using fact M = M T (∗) = = = (β − M H T Y )T M −1 (β − M H T Y ) + Y T Y − Y T HM H T Y (β − M H T Y )T M −1 (β − M H T Y ) + Y T P Y (β − M H T Y )T M −1 (β − M H T Y ) + ||Y ||2 P 11 = = (H T H + D−1 )−1 (I − HM H T ) = Y TPY = Y T Y − 2Y T Hβ + β T H T Hβ + β T D−1 β . ν. γ) (γ/2)ν/2 2 −ν/2−1 γ (σ ) exp − 2 Γ(ν/2) 2σ = (2. γ) dβ dσ 2 Multiplying ( 2.21) 2 2 where D = diag(δ1 . . One Dimensional Time Series P (σ 2 |ν. δq ) and In is a n by n identity matrix. P (Y |β. ν. γ) = = P (Y.Chapter 2. σ 2 )P (σ 2 |ν. . σ 2 |D.17).t |q) = P (Y |D.19) and ( 2. .

σ 2 ) = = = P (Y |β.21) and ( 2. One Dimensional Time Series Hence P (Y |β. σ 2 )P (σ 2 |ν.24) Then integrating out σ 2 . γ) ∝ (σ 2 )−n/2−ν/2−1 exp − γ + ||Y ||2 P 2σ 2 So the posterior for σ 2 is still Inverse Gamma with parameters (n + ν)/2 and (γ + ||Y ||2 )/2: P P (σ 2 |ν. we have P (Y |D. γ) = = P (Y |D. ν.23) Now we multiply ( 2.25) 12 .Chapter 2.23).22) Then integrating out β. σ 2 )P (β|D. σ 2 ) dβ 1 (2π)n/2 σ n 1 (2π)n/2 σ n (2π)q/2 σ q |M |1/2 1 exp − 2 ||Y ||2 P q/2 σ q |D|1/2 2σ (2π) |M | |D| 1 2 exp − 1 ||Y ||2 P 2σ 2 (2. σ 2 )P (σ 2 |ν. P (Y |D. K) ∝ exp − 1 2 ((β − M H T Y )T M −1 (β − M H T Y ) + ||Y ||P 2σ 2 So the posterior for β is still Gaussian with mean M H T Y and variance σ 2 M : P (β|D. K)P (β|D. we have P (Ys+1:t |q) = P (Y |D. γ) dσ 2 1 (2π)n/2 |M | |D| 1 2 1 2 (γ/2)ν/2 Γ(ν/2) Γ((n + ν)/2) ((γ + ||Y ||2 )/2)(n+ν)/2 P = π −n/2 |M | |D| Γ((n + ν)/2) (γ)ν/2 Γ(ν/2) (γ + ||Y ||2 )(n+ν)/2 P (2. γ) = ((γ + ||Y ||2 )/2)(n+ν)/2 2 −(n+ν)/2−1 γ + ||Y ||2 P P (σ ) exp − Γ((n + ν)/2) 2σ 2 (2. σ 2 ) = 1 (2π)q/2 σ q |M |1/2 exp − 1 (β − M H T Y )T M −1 (β − M H T Y ) 2σ 2 (2.

: Hi+1. then Ys:t. log(Γ((n + ν)/2)) and log(Γ(ν/2)) 2 We use the following notations: if Y is a vector.17). β) = = = i P (Y. M and ||Y ||2 can be computed by the P following rank one update. we have. β) dλ 13 . log(P (Ys+1:t |q)) n 1 ν n+ν 2 = − log(π) − (log|M | − log|D|) + log(γ) − log(γ + ||Y ||P ) 2 2 2 2 + log(Γ((n + ν)/2)) − log(Γ(ν/2)) (2. each observation Yi is a non-negative integer which follows Poisson distribution with parameter λ.Chapter 2. By ( 2. we write Ys+1:t as Y and let n = t − s. β) dλ P (Yi |λ) P (λ|α.26) ν 2 log(γ).: Yi+1 To speed up. P (Ys+1:t |q) = P (Y |α. 2 − 1 log|D|. P (Yi |λ) = P (λ|α. If Y is a matrix.: Y1:i+1 T T = H1:i. λ follows a Gamma distribution with parameters α and β. − n log(π). T H1:i+1.: H1:i. then Ys:t denotes the entries from position s to t inclusive.27) (2.: denotes the s-th row to the t-th row inclusive.2.2. we rewrite ( 2. At each iteration. One Dimensional Time Series In implementation. β) e−λ λYi Yi ! β α e−βλ = λ(α−1) Γ(α) (2.: T Y1:i+1 Y1:i+1 T H1:i+1. This hierarchical model can be illustrated by Figure 2.28) For simplicity. With conjugate prior. can be pre-computed. β) dλ P (Y |λ)P (λ|α.2 Poisson-Gamma Models In Poisson Models. Hence we have the following. 2.: H1:i+1. λ|α.: + Hi+1.s:t denotes the s-th column to the t-th column inclusive.: Y1:i + Hi+1. and Y:.: T T = Y1:i Y1:i + Yi+1 Yi+1 T T = H1:i.25) in log space as following.

β) i Yi . log(P (Ys+1:t |q)) = − log(Yi !) + log(Γ(SY + α)) − log(Γ(α)) + αlog(β) 14 i . we have. β) = i 1 Yi ! 1 Yi ! β α Γ(SY + α) Γ(α) (n + β)SY +α Γ(SY + α) βα Γ(α) (n + β)SY +α (2. we have the following.30) = i Similarly. and let SY = P (Y |λ)P (λ|α.27) and ( 2. By multiply ( 2.Chapter 2.2: Graphical representation of Hierarchical structures of Poisson- Gamma Models. ∝ e−nλ−βλ λSY +α−1 (n + β)SY +α e−(n+β)λ Γ(SY + α) So the posterior for λ is still Gamma with parameters SY + α and n + β: P (λ|α. One Dimensional Time Series Figure 2.29) Then integrating out λ.28). we have P (Ys+1:t |q) = P (Y |α. Note the Exponential-Gamma Models have the same Graphical representation of their Hierarchical structures. β) = λ(SY +α−1) (2. in log space.

which can be illustrated by Figure 2. we write Ys+1:t as Y and let n = t − s.31) log(Yi !). This hierarchical model is the same as the one in Poisson-Gamma Models. β) = λ(n+α−1) (SY + β)n+α e−(SY +β)λ Γ(n + α) (2. β) Γ(n + α) βα = Γ(α) (SY + β)n+α (2.Chapter 2. One Dimensional Time Series − (SY + α)log(n + β) Where i (2.2. P (Ys+1:t |q) = P (Y |α.33) For simplicity. β) dλ P (Yi |λ) P (λ|α.33). Hence we have the following. P (Yi |λ) = λe−λYi P (λ|α. β) = = = i P (Y.32) β e Γ(α) α −βλ (2. λ follows a Gamma distribution with parameters α and β. β) dλ P (Y |λ)P (λ|α.3 Exponential-Gamma Models In Exponential Models.32) and ( 2. e−SY λ−βλ λn+α−1 So the posterior for λ is still Gamma with parameters n + α and SY + β: P (λ|α. each observation Yi is a positive real number which follows Exponential distribution with parameter λ. By ( 2. β) dλ i By multiply ( 2.35) 15 . we have. With conjugate prior.2. and let SY = P (Y |λ)P (λ|α. λ|α. we have. β) ∞ Yi . and SY can use the following rank one update.34) Then integrating out λ. β) = λ(α−1) (2. we have P (Ys+1:t |q) = P (Y |α.17). SYi+1 = SYi + Yi+1 2. log(Γ(α)) and αlog(β) can be pre-computed.

Yi = β1 Yi−1 + β2 Yi−2 + · · · + βr Yi−r (2. we have the following.18). . .36) Where log(Γ(α)) and αlog(β) can be pre-computed.Chapter 2.3. . in log space. ..  .39) 16 . 2. 2.37) Hence the polynomial basis H is following. Yi 2 r = β0 + β1 Xi + β2 Xi + · · · + βr Xi (2. 2. . r Xj         (2. One Dimensional Time Series Similarly. . log(P (Ys+1:t |q)) = αlog(β) − log(Γ(α)) + log(Γ(n + α)) − (n + α)log(SY + β) (2.2 Autoregressive Basis Autoregressive model of order r is defined as following.1 Polynomial Basis Polynomial model of order r is defined as following.  .  1 Xj 2 Xj ··· ··· . . we have a basis function H. . Some common basis functions are the following.3 Basis Functions In the linear regression model ( 2. ··· r Xi r Xi+1 . .38) where Xi = i N.  2 1 Xi Xi   2  1 Xi+1 Xi+1 Hi:j =  .3. and SY can use the rank one update as mentioned earlier.

Chapter 2.4 Offline Algorithms Now we review the offline algorithm in detail. Offline algorithm will calculate Q(s) for all s = 1.4).43) + q P (Ys:N |q)π(q)(1 − G(N − s)) where π(q) is the prior probability of model q and g() and G() are defined in ( 2. 2. then simulates change points forward. . Kernel Basis). It first recurses backward. . Yj−r         (2. Fourier Basis). . .  . 17 . By its name. · · · .40) There are many other possible basis functions as well (eg. Here we just want to point out that the basis function does provide a way to extend Fearnhead’s algorithms (eg. N .  Yj−1 Yj−2 ··· Yi−r Yi−r+1 . Q(s) = P rob(Ys:N |location s − 1 is a change point) (2. .  Yi−1 Yi−2 · · ·   Yi−1 · · ·  Yi Hi:j =  . we know it works in batch mode. 2. One Dimensional Time Series Hence the autoregressive basis H is following.  .4.. . 2. When s = N Q(N ) When s < N N −1 = q P (YN :N |q)π(q) (2. . N . .41) and Q(1) = P rob(Y1:N ) since location 0 is a change point by default. . .1 Backward Recursion Let’s define for s = 2.42) Q(s) = t=s q P (Ys:t |q)π(q)Q(t + 1)g(t − s + 1) (2. .

we can simulate all change points forward.t form a segment)P (Yt+1:N |next change point is at t) = q P (Ys:t |q)π(q)Q(t + 1)g(t − s + 1) For the second part we have. · · · . . N − 1. Since s is from N to 1. no further change points) For the first part we have.Chapter 2. · · · . For τk+1 = τk + 1.4. the whole algorithm runs in O(N 2 ). Compute the posterior distribution of τk+1 given τk as following.44) + P (Ys:N . and at step s we will compute sum over t where t is from s to N .N form a segment)P (the length of segment > N − s) = q P (Ys:N |q)π(q)(1 − G(N − s)) We will calculate Q(s) backward for s = N. P (τk+1 |τk . next change point is at t) = P (next change point is at t)P (Ys:t . 2. we do the following. Yt+1:N |next change point is at t) = g(t − s + 1)P (Ys:t |s. Set τ0 = 0. P (Ys:N .43) is that (drop the explicit conditioning on a change point at s − 1 for notational convenience) N −1 Q(s) = t=s P (Ys:N . τk + 2. no further change points) = P (Ys:N |s. 1 by ( 2. 1. To simulate one realisation. Y1:N ) = q P (Yτk +1:τk+1 |q)π(q)Q(τk+1 + 1)g(τk+1 − τk )/Q(τk + 1) 18 . . N . next change point is at t) (2. . One Dimensional Time Series The reason behind ( 2.2 Forward Simulation After we calculate Q(s) for all s = 1. 2. .42) and ( 2. P (Ys:N .43). and k = 0.

τ1 . which means no further change points P (τk+1 |τk . τk . otherwise output the set of simulated change points. and in the same way that first calculate recursion forward. . . . C2 .   1−G(t−i−1)  1−G(t−i−2)   G(t−i−1)−G(t−i−2) T P (Ct = j|Ct−1 = i) = 1−G(t−i−2)     0 where G is defined in ( 2. This also means that the length of segment must be longer than t − 2 − i. We first review the online exact algorithm in detail. and C1 . then there is no change point before time t. 4. 1.45) is : given Ct−1 = i. τ2 . if j = i if j = t − 1 otherwise (2. If τk < N return to step (2).5 Online Algorithms Now we discuss the online algorithms which have two versions (exact and approximate). Both work in online mode.Chapter 2. Ct satisfy Markov property since we assume conditional independence over segments. .45) The reason behind ( 2. G(n) means the cumulative probability of the distance between two consecutive change points no more than n. t − 1.4). Conditional on Ct−1 = j. which leads to Ct = t − 1. Hence the transition probabilities of this Markov Chain can be calculated as following. . . . At the same time. Let’s introduce Ct as a state at time t. then simulate change points backward. Y1:N ). Then the state Ct can take values 0. If Ct = 0. either t − 1 is not a change point. Y1:N ) = q P (Yτk +1:N |q)π(q)(1 − G(N − τk ))/Q(τk + 1) 3. which leads to Ct = j. . . . which can also be considered as the cumulative probability of the length of segment no more 19 . Simulated τk+1 from P (τk+1 |τk . 2. and set k = k + 1. or t − 1 is a change point. One Dimensional Time Series For τk+1 = N . . . we know there is no change point between i and t − 2. which is defined to be the time of the most recent change point prior to t.

. Hence We have P (Ct−1 = i) = 1 − G(t − 2 − i).45) becomes. we have P (Ct = i) = 1 − G(t − 1 − i). When t = 1. 1. by standard filtering recursions.5. If j = t − 1. and this is just g(t − 1 − i). N and j = 0. . . . we have j = 0. we have P (Ct = j|Y1:t ) = P (Ct = j|Yt . then g(i) = (1 − λ)i−1 λ and G(i) = 1 − G(t − i − 1) (1 − λ)t−i−1 = 1 − G(t − i − 2) (1 − λ)t−i−2 1−λ T P (Ct = j|Ct−1 = i) = = And if j = t − 1 T P (Ct = j|Ct−1 = i) G(t − i − 1) − G(t − i − 2) (1 − λ)t−i−2 − (1 − λ)t−i−1 = 1 − G(t − i − 2) (1 − λ)t−i−2 = λ = Hence ( 2. and by definition of g().46) When t > 1.Chapter 2. . Online algorithm will compute P (Ct = j|Y1:t ) for all t. j. Now if j = i. Y1:t−1 ) P (Ct = j. such that t = 1. 1 − (1 − λ)i .1 Forward Recursion Let’s define the filtering density P (Ct = j|Y1:t ) as: given data Y1:t . . which means there is a change point at t − 1. which is also G(t − i − 1) − G(t − i − 2). . t − 1. the length of two change points is t − 1 − i. hence P (C1 = 0|Y1:1 ) = 1 (2. One Dimensional Time Series than n. 2. 2. If we use a Geometric distribution as g(). . the probability of the last change point is at position j.   1 − λ if j = i    λ     0 T P (Ct = j|Ct−1 = i) = if j = t − 1 otherwise where λ is the parameter of the Geometric distribution. Yt |Y1:t−1 ) = P (Yt |Y1:t−1 ) 20 . Then if j = i.

Chapter 2. and with ( 2.5. we can use idea in [9] to simulate all change points backward. To simulate one realisation from this joint density. 1. 21 . we do the following. 2. . and k = 0. Y1:t−1 )P (Ct = j|Y1:t−1 ) = (2.2 Backward Simulation After we calculate the filtering densities P (Ct |Y1:t ) for all t = 1. N . the whole algorithm runs in O(N 2 ). Y1:t−1 )P (Ct = j|Y1:t−1 ) P (Yt |Y1:t−1 ) ∝ P (Yt |Ct = j. we have   wj 1−G(t−i−1) P (Ct−1 = j|Y1:t−1 ) t 1−G(t−i−2) P (Ct = j|Y1:t ) ∝  wt t−2 G(t−i−1)−G(t−i−2) P (Ct−1 = i|Y1:t−1 ) t i=0 1−G(t−i−2) j Now wt can be calculated as following. if j < t − 1 if j = t − 1 (2. One Dimensional Time Series P (Yt |Ct = j.46) and ( 2. Y1:t−1 ). We will calculate P (Ct |Y1:t ) forward for t = 1. j wt = = P (Yj+1:t |Ct = j) P (Yj+1:t−1 |Ct = j) q P (Yj+1:t |q)π(q) q P (Yj+1:t−1 |q)π(q) When j = t − 1. Since t is from 1 to N . and at step t we will compute Ct = j where j is from 0 to t − 1.49). Set τ0 = N .50) where π(q) is the prior probability of model q. N by ( 2.49) When j < t − 1.47) and t−2 P (Ct = j|Y1:t−1 ) = i=0 T P (Ct = j|Ct−1 = i)P (Ct−1 = i|Y1:t−1 ) (2. .48) j If we define wt = P (Yt |Ct = j. .45). . t wt = q P (Yt:t |q)π(q) (2. · · · .

. and all q. If τk > 0 return to step (2).q Pt (j. q)g(t − j) 1 − G(t − j − 1) (2.5. . t − 1. modelq. q). . τk−2 . 3. .52) = (1 − G(t − j − 1))P (Yj+1:t |q)π(q)PjM AP (2. τ1 . One Dimensional Time Series 2.Mt . . Given a MAP estimate of Ct . . . Y1:t ) = P (Ct = i. . . τk−1 . Pt (i. n.Chapter 2. 22 . Y1:t ) Then we can compute Pt (i.52). For t = 1. We define Mi to be the event that given a change point at time i. otherwise output the set of simulated change points backward.3 Viterbi Algorithms We can also obtain an online Viterbi algorithm for calculating the maximum a posterior (MAP) estimate of positions of change points and model parameters for each segment as following. we can then calculate the MAP estimates of the change point prior to Ct and the model parameters of that segment by the value of j and q that maximized the right hand side of ( 2. . Simulated τk+1 from the filtering density P (Cτk |Y1:τk ). q) and PtM AP = maxj. the MAP estimates of Ct and the current model parameters are given respectively by the values of j and q which maximize Pt (i. and set k = k + 1. 2. q) and PtM AP = P (change point at t. This procedure can be repeated to find the MAP estimates of all change points and model parameters.51) At time t. . Mi . and i = 0. . the MAP choice of change points and model parameters occurs prior to time i. q) using the following: Pt (i.

We impose a maximum number (eg. During recursions. where L < M . 16]. · · · . We will show later that this approximation will not affect the accuracy of the result much. In all three algorithms. hence it will not be computed in the next iteration. 12. but runs much faster. we will set it to be −∞. Hence the maximum computational cost per iteration is proportional to M . At step t. 1. the approximate algorithm runs in O(N ).6 Implementation Issues There are two issues that need to be mentioned in implementation. t − 1.Chapter 2. One Dimensional Time Series 2. Here. M ) of particles to be stored at each step. Since M is not dependent on the size of data N . This will provide not only stable calculation but also speed up because we will calculate less terms in each iteration. This is very important in term of speed. We perform all calculations in log space to prevent overflow and underflow. There are many ways to perform resampling [7. This approximate distribution can be described by a set of support points. 23 . We let L = M − 1. Whenever we have M particles. hence each step we just drop the particle that has the lowest support.5. 1 × 10−30 ). We also use the functions logsumexp and logdet. we use the simplest one.4 Approximate Algorithm Now we look at the online approximate algorithm. The second one is rank one update. we perform resampling to reduce the number of particles to L. 15. whenever the quantity that we need to compute is less than the threshold. online exact algorithm stores the complete posterior distribution P (Ct = j|Y1:t ) for j = 0. we introduce a threshold (eg. At the end of each iteration. we will rescale the quantity to prevent underflow. The way to do rank one update depends on the likelihood function we use. 2. We call them particles. We approximate P (Ct = j|Y1:t ) by a discrete distribution with fewer points. The only difference is the following. The first issue is numerical stability. The online approximate algorithm works in almost same way as the online exact algorithm.

24 . 18. Each segment is an autoregressive model. The first one is called Blocks.Chapter 2. Hence the natural conjugate prior is a Gamma distribution.7 Experimental Results Now we will show experimental results on some synthetic and real data sets. This is a well-known data set and is previously studied by many researchers [6. The top panel shows the raw data with the true change points (red vertical line). We set the hyper parameter ν = 2 and γ = 0. 2. Again.3. with mean level shifting over segments.1 Synthetic Data We will look at two synthetic data sets.7. We can see that the results are essentially the same. The bottom three panels show the posterior probability of being change points at each position and the number of segments that are calculated (from top to bottom) by the offline. One Dimensional Time Series 2. The top panel shows the raw data with the true change points (red vertical line). Each segment is a piecewise constant model.4. we see the results from three algorithms are essentially the same. 25]. Here Yi is the number of disasters in the i-th year. and the results are shown in Figure 2. Henceforth only use online approximate method.02 on σ 2 . The second data set is called AR1. 13. The bottom three panels show the posterior probability of being change points at each position and the number of segments that are calculated (from top to bottom) by the offline.7. It is a very simple data set. the online exact and the online approximate algorithms. 2.2 British Coal Mining Disaster Data Now let’s look at a real data set which records British coal-mining disasters [17] by year during the 112 year period from 1851 to 1962. and results are shown in Figure 2. It has been previously analyzed in [10. the online exact and the online approximate algorithms. 21]. which follows Poisson distribution with parameter (rate) λ. 14. We set the hyper parameter ν = 2 and γ = 2 on σ 2 .

2.5. 26] including baseball. 1996 and play 230 tournaments. Then in the up right panel. it shows the raw data as a Poisson sequence. The bottom left panel shows the posterior distribution on the number of segments.7. Woods turned professional at the Great Milwaukee Open in August 29. basketball and golf. 3. It is interesting to detect a streaky player or a streaky team in many sports [1. The bottom right panel shows the posterior distribution of being change points at each location. and the first ever to win all four professional major championships consecutively. 1. we will pick the most probable three change points at location 41. 1934 and 1952. A streaky player is different. 1. Then the data can be expressed as a Bernoulli sequence with 25 . Here we use Tiger Woods as an example. since we let the prior mean ¯ of λ is equal to the empirical mean ( α = Y = 1. One Dimensional Time Series We set the hyper parameter α = 1. 84 and 102 which corresponding to year 1891.5. winning 62 championships in the period September 1996 to December 2005. On four segments. Since we have four segments. It shows the most probable number of segments is four. In sports. 5. since the associated success rate is not constant over time. is called ’streakiness’. it. Streaky players might have a large number of success during one or more periods.66) and the prior strength β β to be weak. Tiger Woods is one of the most famous golf players in the world. Then we use Fearnhead’s algorithms to analyse the data.3 Tiger Woods Data Now let’s look at another interesting example. the posterior rates are roughly as following.66 and β = 1. with fewer or no successes during other periods. and the results are shown in Figure 2. when an individual player or team enjoys periods of good form.Chapter 2. it shows the resulted segmentation (the red vertical line) and the posterior estimators of rate λ on each segment (the red horizontal line). The opposite of a streaky competitor is a player or team with a constant rate of success over time. More streaky players tend to have more change points. In top left panel. Let Yi = 1 if Woods won the i-th tournament and Yi = 0 otherwise.5 and 0.

01. the values of λ are: 0. then we will set λ = 0.Chapter 2.5. from May 1999 to March 2003. Initially. In [26].5 (this prior says the length of segment is 26 . the data is only from September 1996 to June 2001. If we increase λ. Since we have three segments. his winning rate was dropped back to lower than 0. In top left panel. The bottom right panel shows the posterior distribution of being change points at each location. we will encourage more change points. 0. it shows the resulted segmentation (the red vertical line) and the posterior estimators of rate λ on each segment (the red horizontal line).1. We see when λ = 0. 2.53) For example. Then we perform Fearnhead’s algorithms to analyse the data. However. The data comes from Tiger Woods’s official website http://www.01.com/. It shows the most probable number of segments is three. From top to bottom.6.2. 0. The bottom left panel shows the posterior distribution on the number of segments. he was an golf player with winning rate lower than 0. Then in the up right panel. 0.001 and 0. We can see clearly that Tiger Woods’s career from September 1996 to December 2005 can be splitted into three periods.7. we will pick the most probable two change points at location 73 and 167 which corresponding to May 1999 and March 2003. Results obtained by the online approximate algorithms under different values of λ are shown in Figure 2. We use the synthetic data Blocks as an example. from September 1996 to May 1999. and the results are shown in Figure 2.2. λ = the total number of expected segments the total number of data points (2. One Dimensional Time Series parameter (rate) λ.0002. he was in his peak period with winning rate nearly 0.5. We put a conjugate Beta prior with hyper parameters α = 1 and β = 1 on λ since this is the weakest prior.tigerwoods.8 Choices of Hyperparameters Hyperparameters λ is the rate of change points and can be set as following. After March 2003 till December 2005. and is previous studied by [26]. if there are 1000 data points and we expect there are 10 segments. it shows the raw data as a Bernoulli sequence.

results are fine. we normally set ν = 2 such that it is a weak prior.2 and 0.04.) and 2 When to set γ = 4σ .Chapter 2.9.04. For the data AR1. Under other values of λ. In linear regression. Now we show results from synthetic data Blocks and AR1 obtained by the online approximate algorithms under different values of γ in Figure 2.8 and Figure 2. the mean does not exist. For other values of γ. We see that the data Blocks is very robust to the choices of γ. we see when γ = 100. then we will set γ = 0. where σ 2 is expected variance within each segment. which is too short. Different likelihood functions have different hyperparameters. 2. From top to bottom. 27 . (Note: we parameterize Inverse-Gamma in term of ν = 2. the result is clearly oversegmented. We use the mode γ ν+2 ν 2 γ 2 . results are fine. One Dimensional Time Series 2. 20. we only detect 3 change points instead of 5. 0.). Since ν represents the strength of prior. we have hyperparameters ν and γ on the variance σ 2 .01. the values of γ are: 100. For example. Then we can set γ to reflect our belief on how large the variance will be within each segment. if we believe the variance is 0.

2 0 0 100 200 300 400 500 Online Approximate 1 0.Chapter 2.4 0.6 0.3: Results on synthetic data Blocks (1000 data points).8 0.6 0.8 0.2 0 0 100 200 300 400 500 Online Exact 1 0.4 0.2 0 0 100 200 300 400 500 600 700 800 900 1000 1 0. Results are generated by ’showBlocks’.2 0 0 5 P(N) 10 600 700 800 900 1000 P(N) Figure 2. The rest are the posterior probability of being change points at each position and the number of segments calculated by (from top to bottom) the offline. the online exact and the online approximate algorithms.6 0.8 0.4 0.4 0.2 0 0 5 P(N) 10 600 700 800 900 1000 100 200 300 400 500 Offline 1 0.4 0.6 0.8 0.6 0.8 0. The top panel is the Blocks data set with true change points.2 0 0 5 10 600 700 800 900 1000 1 0.4 0. 28 . One Dimensional Time Series Y 20 10 0 −10 0 1 0.8 0.6 0.

8 0.4 0.6 0.4 0.2 0 0 2 4 P(N) 6 8 600 700 800 900 1000 100 200 300 400 500 Offline 1 0.8 0.2 0 0 100 200 300 400 500 Online Exact 1 0.2 0 0 2 4 6 8 600 700 800 900 1000 1 0.6 0.Chapter 2.4: Results on synthetic data AR1 (1000 data points).8 0.4 0.4 0.6 0. Results are generated by ’showAR1’.4 0.8 0.8 0. One Dimensional Time Series Y 20 10 0 −10 −20 0 1 0. 29 . The rest are the posterior probability of being change points at each position and the number of segments calculated by (from top to bottom) the offline.6 0.6 0.2 0 0 100 200 300 400 500 Online Approximate 1 0.8 0.6 0. The top panel is the AR1 data set with true change points. the online exact and the online approximate algorithms.4 0.2 0 0 100 200 300 400 500 600 700 800 900 1000 1 0.2 0 0 2 4 P(N) 6 8 600 700 800 900 1000 P(N) Figure 2.

4 0.2 0.8 0.3 0.5 0.6 0.9 0. The bottom left panel shows the posterior distribution on the number of segments.2 0. 30 .Chapter 2. The up right panel shows the resulted segmentation and the posterior estimators of rate λ on each segment. The top left panel shows the raw data as a Poisson sequence.7 0. One Dimensional Time Series Y 6 5 4 3 2 1 0 0 20 40 60 80 100 6 5 4 3 2 1 0 0 20 40 60 80 100 P(N) 1 0.5 0. The bottom right panel shows the posterior distribution of being change points at each position.4 0.5: Results on Coal Mining Disaster data.7 0.8 0.9 0. Results are generated by ’showCoalMining’.3 0.1 0 0 2 4 6 8 10 1 0.1 0 0 20 40 60 80 100 Figure 2.6 0.

This data comes from Tiger Woods’s official website http://www.4 0.3 0.tigerwoods.3 0.1 0 0 50 100 150 200 Figure 2. The bottom left panel shows the posterior distribution on the number of segments.4 0. The top left panel shows the raw data as a Bernoulli sequence.4 0. 31 .8 0.1 0 0 1 2 3 4 5 6 7 8 1 0.9 0.8 0.6 0.6 0.6: Results on Tiger Woods data. One Dimensional Time Series Y 1 1 0.7 0. Results are generated by ’showTigerWoods’.2 0 0 50 100 150 200 0 0 50 100 150 200 P(N) 1 0.2 0.9 0. The up right panel shows the resulted segmentation and the posterior estimators of rate λ on each segment.8 0.6 0.5 0.Chapter 2.6 0.7 0.4 0. The bottom right panel shows the posterior distribution of being change points at each position.8 0.2 0.2 0.5 0.com/.

the values of λ are: 0.5. From top to bottom.5 0. 0. 0.1 0 0 20 40 P(N) λ = 0.5 P(N) λ = 0.01 0 0 20 40 P(N) λ = 0. Large value of λ will encourage more segments. One Dimensional Time Series Y 20 10 0 −10 λ = 0.0002 0 0 20 40 P(N) 0 100 200 300 400 500 600 700 800 900 1000 0 20 40 Figure 2.Chapter 2. The rest are the posterior probability of being change points at each position and the number of segments calculated by the online approximate algorithms under different values of λ. The top panel is the Blocks data set with true change points.5 0 1 1 0.5 0.5 0 1 1 0.7: Results on synthetic data Blocks (1000 data points) under different values of hyperparameter λ.5 0 0.01.1.5 0 0.5 0.0002. 32 .5 1 1 0.5 0 1 1 0. Results are generated by ’showBlocksLambda’.001 and 0.5 0 1 1 0.001 0 0 20 40 P(N) λ = 0. 0.

0. 20.Chapter 2.5 0. hence encourage less segments. 2.04.5 0 1 1 0. Results are generated by ’showBlocksGamma’. The rest are the posterior probability of being change points at each position and the number of segments calculated by the online approximate algorithms under different values of γ.2 0 0 5 10 P(N) γ = 0.8: Results on synthetic data Blocks (1000 data points) under different values of hyperparameter γ.5 0 0. From top to bottom.5 0 0.5 0 1 1 0. One Dimensional Time Series Y 20 10 0 −10 γ = 100 1 1 0. The top panel is the Blocks data set with true change points.5 0.04 0 0 5 10 P(N) 0 100 200 300 400 500 600 700 800 900 1000 0 5 10 Figure 2.5 0 1 1 0.5 0.2 and 0.5 0 1 1 0. the values of γ are: 100. Large value of γ will allow higher variance in one segment.5 P(N) γ = 20 0 0 5 10 P(N) γ=2 0 0 5 10 P(N) γ = 0. 33 .

2 0 0 2 4 6 P(N) γ = 0. 0. hence encourage less segments.9: Results on synthetic data AR1 (1000 data points) under differ- ent values of hyperparameter γ. 2.5 0 1 1 0. The top panel is the AR1 data set with true change points.5 0 1 1 0.5 0.5 0.5 P(N) γ = 20 0 0 2 4 6 P(N) γ=2 0 0 2 4 6 P(N) γ = 0. 20.04 0 0 2 4 6 P(N) 0 100 200 300 400 500 600 700 800 900 1000 0 2 4 6 Figure 2.5 0.5 0 1 1 0. One Dimensional Time Series Y 20 0 −20 γ = 100 1 1 0.5 0 0. Large value of γ will allow higher variance within one segment. Results are generated by ’showAR1Gamma’. the values of γ are: 100.5 0 1 1 0.04.Chapter 2. The rest are the posterior probability of being change points at each position and the number of segments calculated by the online approximate algorithms under different values of γ. From top to bottom.2 and 0. 34 .5 0 0.

35 . 3. d P (Ys+1:t ) = j=1 j P (Ys+1:t ) (3. P (Ys+1:t ) can be any likelihood function discussed in the previous chapter. The independent model is simple to use.16) can be written as the following product.1) j where P (Ys+1:t ) is the marginal likelihood in the j-th dimension. We model the correlation structures using Gaussian graphical models. Hence we can estimate the changing topology of the dependencies among the series. is a d-dimensional vector. it can be used as an approximate model similar to Naive Bayes when we cannot model the correlation structures among each dimension. Now Yi . Even when independent assumption is not valid. Since now j j Ys+1:t is one dimension.1 Independent Sequences The first obvious extension is to assume that each dimension is independent. the marginal likelihood function ( 2. in addition to detecting change points. Then we can detect change points based on changing correlation structure. Under this assumption.Chapter 3 Multiple Dimensional Time Series We extend Fearnhead’s algorithms to the case of multiple dimensional series. the observation at time i.

negative). The data on Y1 3 2 1 0 −1 −2 −3 0 100 200 300 Y2 3 2 1 0 −1 −2 −3 0 100 200 300 4 4 4 2 2 2 0 0 0 −2 −2 −2 −4 −4 −2 0 2 4 −4 −4 −2 0 2 4 −4 −4 −2 0 2 4 Figure 3. We are unable to identify any change if we look at each series individually. However. Multiple Dimensional Time Series 3.2 Dependent Sequences Correlation structures only exist when we have multiple series. independent. This is the main difference when we go from one dimension to multiple dimensions. their correlation structures are changed over segments(positive.Chapter 3.1: A simple example to show the importance of modeling chang- ing correlation structures. ’plot2DExample’. As shown in Figure 3.2. Hence we should model correlation structures whenever we are able to do so. 3. we have two series.1. Results are generated by three segments are generated from the following Gaussian distributions. On the 36 .1 A Motivating Example Let’s use the following example to illustrate the importance of modeling correlation structures.

∼ N (0. we are unable to identify any changes. in the first segment.75 −0. Σ3 =   . H is a n by q matrix of basis functions.Chapter 3. in the second segment. Ys+1:t = Hβ + ǫ (3.75 0 1 1  Yk where Σ1 =   1 0. On segment Ys+1:t we still have. V and W if the density function of A is A ∼ N (MA .2 Multivariate Linear Regression Models For multivariate linear regression models.75 1   As a result. A random m by n matrix A is Matrix-Gaussian distributed with parameters MA . V. Σ2 =  −0.16) analytically. Now Y is a n by d matrix of data.2) For simplicity. 37 . 1). β is a q by d matrix of regression parameters and ǫ is a n by d matrix of noise. W ) 1 1 P (A) = exp − trace((A − MA )T V −1 (A − MA )W −1 ) 2 (2π)mn/2 |V |n/2 |W |m/2 where M is a m by n matrix representing the mean. then we find their correlation structures are changed. we know how to model correlation structures. they are positive correlated. and we can calculate the likelihood function ( 2.2. they are independent. Σk )    1 1 0 0. N (0. the marginal distribution on each dimension is. 3. For example. they are negative correlated. and let n = t − s. Multiple Dimensional Time Series k-th segment.75  . in the last segment. Hence if we look at each dimension individually. we write Ys+1:t as Y . Here we need introduce the Matrix-Gaussian distribution. V is a m by m matrix representing covariance among rows and W is a n by n matrix representing covariance among columns. However if we consider them jointly. the same over all three segments.

Σ) P (β|D. 2 2 Here D = diag(δ1 . since we assume ǫ is independent across time(rows) but dependent across features(columns).Chapter 3.3) 1 1 exp − trace(β T D−1 βΣ−1 ) (3. And β has a MatrixGaussian distribution with mean 0 and covariance D and Σ. · · · . This hierarchical model can be illustrated by Figure 3. d)2N0 d/2 |Σ|(N0 +d+1)/2 d (3. δq ) and In is a n by n identity matrix.2. ǫ has a Matrix-Gaussian distribution with mean 0 and covariance In and Σ. Multiple Dimensional Time Series Similar as before we assume conjugate priors. Figure 3. Σ0 ) = = = 1 −1 exp − trace((Y − Hβ)T In (Y − Hβ)Σ−1 ) 2 (3. Hence we have. Finally covariance Σ has an Inverse-Wishart distribution with parameter N0 and Σ0 .5) where N0 ≥ d and Z(n.2: Graphical representation of Hierarchical structures of Multivariate Linear Regression Models P (Y |β. d) = π d(d−1)/4 i=1 Γ((n + 1 − i)/2) 38 . Σ) P (Σ|N0 .4) 2 (2π)qd/2 |D|d/2 |Σ|q/2 1 (2π)nd/2 |In |d/2 |Σ|n/2 |Σ0 |N0 /2 1 1 exp − trace(Σ0 Σ−1 ) 2 Z(N0 .

Σ0 ) dβ dΣ Now multiply ( 3. Σ0 ) = = P (Y. M. P (β|D. Σ0 ) dβ dΣ P (Y |β.17). Σ) 1 ∝ exp − trace(((Y − Hβ)T (Y − Hβ) + β T D−1 β)Σ−1 ) 2 1 ∝ exp − trace((Y T Y − 2Y T Hβ + β T H T Hβ + β T D−1 β)Σ−1 ) 2 Now let M P (∗) Then (∗) = β T (H T H + D−1 )β − 2Y T Hβ + Y T Y = β T M −1 β − 2Y T HM M −1 β + Y T Y = β T M −1 β − 2Y T HM M −1 β + Y T HM M −1 M T H T Y − Y T HM M −1 M T H T Y + Y T Y Using fact M = M T (∗) = = Hence P (Y |β. Σ|D. Σ)P (Σ|N0 . Σ) ∼ N (M H T Y. Multiple Dimensional Time Series By ( 2. N0 .3) and ( 3. Σ)P (β|D. Σ) 1 ∝ exp − trace(((β − M H T Y )T M −1 (β − M H T Y ))Σ−1 + Y T P Y Σ−1 ) 2 (β − M H T Y )T M −1 (β − M H T Y ) + Y T Y − Y T HM H T Y (β − M H T Y )T M −1 (β − M H T Y ) + Y T P Y = = (H T H + D−1 )−1 (I − HM H T ) = Y T Y − 2Y T Hβ + β T H T Hβ + β T D−1 β So the posterior for β is still Matrix-Gaussian with mean M H T Y and covariance M and W .4).6) 39 . P (Ys+1:t |q) = P (Y |D. β. we have. N0 .Chapter 3. P (Y |β. Σ)P (β|D. Σ) (3. Σ)P (β|D.

d)2N0 d/2 |Y T P Y + Σ0 |(n+N0 )/2 |M | |D| d 2 Z(n + N0 .5) and ( 3.9). d)2(n+N0 )d/2 1 (2π)nd/2 Z(N0 .7) P (Y |D. since InverseW ishart(ν.10) 40 . Σ)P (Σ|N0 . γ) ≡ InverseGamma( ν . Y T P Y + Σ0 ) (3.9) When d = 1 and by setting N0 = ν and Σ0 = γ.9) in log space as following. Σ0 ) dΣ |M | |D| −nd/2 d 2 |Σ0 |N0 /2 Z(n + N0 .Chapter 3. Σ) dβ (3. we rewrite ( 3. d) |Σ0 |N0 /2 |Y T P Y + Σ0 |(n+N0 )/2 Z(N0 . we have P (Ys+1:t |q) = P (Y |D. Σ)P (β|D.25) is just a special case of ( 3. Multiple Dimensional Time Series Then integrating out β. we have P (Y |D. ( 2.8) Then integrating out Σ. N0 . d) (3. Σ0 ) = = = π P (Y |D. In implementa2 2 tion. P (Σ) ∼ IW (n + N0 . Σ0 ) ∝ ∝ 1 exp(− trace((Y T P Y + Σ0 )Σ−1 )) 2 1 1 exp(− trace((Y T P Y + Σ0 )Σ−1 )) 2 |Σ|(n+N0 +d+1)/2 |Σ|n/2 |Σ|(N0 +d+1)/2 1 So the posterior for Σ is still Inverse-Wishart with parameter n + N0 and Y T P Y + Σ0 . Σ)P (Σ|N0 . Σ) = = = P (Y |β.7) 1 (2π)qd/2 |M |d/2 |Σ|q/2 1 exp(− trace(Y T P Y Σ−1 )) nd/2 |I |d/2 |Σ|n/2 (2π)qd/2 |D|d/2 |Σ|q/2 2 (2π) n 1 (2π)nd/2 |Σ|n/2 |M | |D| d 2 1 exp(− trace(Y T P Y Σ−1 )) 2 Now multiply ( 3. log(P (Ys+1:t |q)) − = n + N0 d N0 log(|Σ0 |) − log(|Y T P Y + Σ0 |) − (log|M | − log|D|) 2 2 2 d i=1 d nd log(π) + 2 log(Γ((n + N0 + 1 − i)/2)) − i=1 log(Γ((N0 + 1 − i)/2)) (3. γ ).

There are two kinds of graphs: directed and undirected.: + Hi+1. Λ is a diagonal matrix.: T T = Y1:i. There is no edge between node i and node j if and only if Λij = 0. there are four nodes.: Y1:i. In independent model. − nd log(π). T H1:i+1.: + Yi+1.: log(Γ((N0 + 1 − i)/2)) can be pre-computed. and we will talk about directed graphs later.: Y1:i+1. There is an edge between node 1 and node 2. since Λ12 = 0. As shown in Figure 3.) Then the independent model and the multivariate linear regression model are just two special cases of Gaussian graphical models. In multivariate linear regression model.1 Structure Representation In an undirected graph. At each iteration. 2 d i=1 log(Γ((n+N0 +1−i)/2)) and Y T P Y can be computed by the following rank one update. 3. especially conditional independence structures.: Hi+1. since in higher dimensions. which represents the graph with all isolated nodes. Define the precision matrix Λ = Σ−1 . Multiple Dimensional Time Series To speed up. which is the inverse of covariance matrix Σ.3 Gaussian Graphical Models Gaussian graphical models are more general tools to model correlation structures. full covariance models could be too complex when the dimension d is large since the number of parameters needed by the models are O(d2 ).: + Hi+1.: Yi+1. Here we use undirected graphs as examples. especially when d becomes large.: Y1:i.3. sparse structures are more important.: H1:i. which represents the graph with fully connected nodes.3. Unlike independent models which are too simple. 41 .: T T = H1:i. Λ is a full matrix. (”X” denotes non-zero entries in the matrix. − d log|D|.Chapter 3. 2 and d i=1 N0 2 log(|Σ0 |). M 3. Gaussian graphical models provide a more flexible and better solution between these two extremes.: T Y1:i+1.: H1:i+1. each node represents a random variable.: T H1:i+1.: T T = H1:i.: Yi+1.: Y1:i+1.

Chapter 3. Multiple Dimensional Time Series

Figure 3.3: A simple example to show how to use a graph to represent correlation structures. We first compute precision matrix Λ. Then from Λ, if Λij = 0, then there is no edge on from node i to node j. ”X” represent non-zero entries in the matrix.

3.3.2

Sliding Windows and Structure Learning Methods

In order to use Fearnhead’s algorithms, we need to do the following two things: • generate a list of possible graph structures, • compute the marginal likelihood given a graph structure. For the first one, the number of all possible undirected graphs is O(2d ). Hence it is infeasible to try all possible graphs. We can only choose a subset of all graphs. How to choose them is a chicken and egg problem: if we knew the segmentation, then we could run a fast structure learning algorithm on each segment, but we need to know the structures in order to compute the segmentation. Hence we propose the following sliding window method to generate the set of graphs. We slide a window of width w = 0.2N across the data, shifting 42
2

Chapter 3. Multiple Dimensional Time Series

Figure 3.4: Gaussian graphical model of independent models.

Figure 3.5: Gaussian graphical model of full models.

by s = 0.1w at each step. This will generate a set of about 50 candidate segmentations. We can repeat this for different setting of w and s. Then we run a fast structure learning algorithm on each windowed segment, and sort resulting set of graphs by frequency of occurrence. Finally we pick the top M = 20 to form the set of candidate graphs. We hope this set will contain the true graph structures or at least the ones that are very similar. As shown in Figure 3.6, suppose the vertical pink dot lines are true segmentations. When we run a sliding window inside of a true segment, (eg, the red window), we hope the structure we learn from this window is similar to the true structure of this segment. And when we shift the window one step, (eg, shifting to the blue window), if it is still inside of the same segment, we hope the structure we learn is the same or at least very similar to the one we learn in the previous window. Of course we will have the window that overlaps two segments (eg, the black window), then we know the structure we learn from this window will represent neither segments. However, this brings no harm since these ”wrong” graph structures will later receive negligible posterior probability. We can choose the number of the graphs we want to consider based on how 43

Chapter 3. Multiple Dimensional Time Series

Figure 3.6: An example to show sliding window method.

fast we want the algorithms to run. Now on each windowed segment, we need to learn the graph structure. The simplest method is the thresholding method. It works as following, 1. compute the empirical covariance Σ, 2. compute the precision Λ = Σ−1 ,
Λ 3. compute the partial correlation coefficients by ρij = √ ij Λii Λjj

,

4. set edge Gij = 0 if |ρij | < θ for some threshold θ (eg, θ = 0.2). The thresholding method is simple and fast, but it may not give a good estimator. We can also use the shrinkage method discussed in [23] to get a better estimator of Σ, which helps regularize the problem when the segment is too short and d is large. If we further pose the sparsity on the graph structure, we can use convex optimization techniques discussed in [2] to compute the MAP estimator for the precision Λ under a prior that encourage many entries to go to 0. We first form

44

(K. 2.11) by block coordinate descent algorithms. obslik. 1. For non-decomposable graphs. λ. Multiple Dimensional Time Series the following problem. Comparing with the multivariate linear regression models. everything is the 45 . while not converged do 5.3 Likelihood Functions After we get the set of candidate graphs. and ρ > 0 is the regularization parameter which controls the sparsity of the graph. S = make overlapping segments from Y1:N by sliding windows 3. Given a decomposable graph. observation model obslik and parameter ρ for graph structure learning methods.3. we can only do so for decomposable graphs in undirected graphs. hyperparameters λ for change point rate and θ for likelihood functions. the model inferences on each segment m1:K 3. ||Λ||1 = Σij |Λij |. end while 8. we need to be able to compute marginal likelihood for each graph. ρ) : s ∈ S1:K 7. the set of segments S1:K .11) where Σ is the empirical covariance. S1:K ) = segment(Y1:N . We summarize the overall algorithm as following. Output: the number of segments K. θ) 6. However. G = estimate graph structure estG(Ys . ρ) : s ∈ S 4. Inference model mi = maxm P (m|Ys ) for i = 1 : K 9.Chapter 3. G = estG(Ys . Then we can solve ( 3. we will assume the following conjugate priors. maxΣ≻0 log(det(Λ)) − trace(ΣΛ) − ρ||Λ||1 (3. Input: data Y1:N . we will use approximation by adding minimum number of edges to make it decomposable.

Σ|D.14) 1 exp(− trace(Y T P Y Σ−1 )) 2 When P is positive definite. b0 . Σ0 ) = = P (Y. P (Ys+1:t |q) = P (Y |D. Σ0 ) = C P (ΣC |b0 . Then the joint density of HyperInverse-Wishart can be decomposed as following.17). Σ) are the same as before. (In 46 where M = (H T H + D−1 )−1 and P = (I − HM H T ) are defined before. Multiple Dimensional Time Series same except the prior on Σ is now a Hyper-Inverse-Wishart instead of InverseWishart.13) where C is each component and S is each separator. Σ) = = P (Y |β. The same applies to each S. b0 . we have. Σ) ∼ HIW (b0 . P (Y |D. In . Σ0 ) dβ dΣ Since P (Y |β. Hence we still have the following model. P has the Cholesky decomposition QT Q. When a graph G is decomposable. β Σ ∼ N (0. Σ)P (β|D. Σ0 ) dβ dΣ P (Y |β. we have. and b0 = N0 + 1 − d > 0. β.12) where D. Σ0 ) (3. D. Σ) The priors are. ΣS ) 0 S (3. ΣC ) and ΣC is the block of Σ0 0 0 corresponding to ΣC . by integrating out β. Σ) dβ 1 (2π)nd/2 |Σ|n/2 |M | |D| d 2 (3. By ( 2. Σ)P (β|D. (let Ys+1:t as Y and n = t − s) Y = Hβ + ǫ ǫ ∼ N (0. . we have ΣC ∼ IW (b0 + |C| − 1. G can be decomposited into a list of components and a list of separators. and each component and separator itself is a clique.Chapter 3. Σ)P (Σ|b0 . ΣC ) 0 P (ΣS |b0 . P (Σ|b0 . In are defined before. by the standard graph theory [8]. Σ)P (β|D. For each C.

This is very important in high dimension. then ( 3. Now multiply ( 3. or just P (X|Σ) ∼ N (0. we can let the prior D = c ∗ I.14) can rewrite as the following. the posterior of ΣC ∼ IW (b0 + |C| − 1 + n.16). Hence by integrating out Σ.) Now let X = QY . |C|)−1 47 . we have XC ∼ N (0. |S|)−1 Z(b0 + |C| − 1. this likelihood P (X|Σ) can be decomposited as following [8]. Multiple Dimensional Time Series practice. ΣC + XC XC ). As a result. we can make sure P is always positive definite. Σn ) (3. b0 . bn . we have. P (X|Σ)P (Σ|b0 . The 0 P (Y |D. the posterior of Σ ∼ HIW (b0 +n. In . b0 . The same applies to each S. For each C. Σ) = = = 1 (2π)nd/2 |Σ|n/2 |M | |D| |M | |D| d 2 |M | |D| d 2 1 exp(− trace(X T XΣ−1 )) 2 d 2 1 1 −1 exp(− trace(X T In XΣ−1 )) 2 (2π)nd/2 |In |d/2 |Σ|n/2 P (X|Σ) (3. Σ0 ) = C P (XC |ΣC )P (ΣC |b0 .Chapter 3. Since we use decomposable graphs. b. Σ0 ) dΣ |M | |D| |ΣC | d 2 (π)−nd/2 h(G.15) where P (X|Σ) ∼ N (0.16) where C is each component and S is each separator. Σ). ΣC ) and ΣC is the block of Σ corresponding to XC . Σ). By setting c into a proper value. P (X|Σ) = C P (XC |ΣC ) S P (XS |ΣS ) (3. ΣC ) 0 S S P (XS |ΣS )P (ΣS |b0 .17) same holds for each S. P (Ys+1:t |q) = P (Y |D. P (Y |D. Σ0 ) = = where h(G.18) b+|C|−1 2 b+|S|−1 2 |ΣS | Z(b0 + |S| − 1. Σ0 ) (3. Σ) = C S T Then for each C. where I is the identity matrix and c is a scalar. Σ0 +X T X). Σ)P (Σ|b0 . we have. Σ0 ) h(G.13) and ( 3.

d) Σn bn = π d(d−1)/4 i=1 Γ((n + 1 − i)/2) = Σ0 + Y T P Y = b0 + n In log space. Σ) over different graphs. full and Gaussian graphical models). The independent model thinks there is only one segment. Σn )) 2 2 (3. Σ0 )) − log(h(G.4 Experimental Results Now we will show experimental results on some synthetic and real data sets. log(P (Ys+1:t |q)) = − nd d log(π) − (log|M | − log|D|) + log(h(G. and set the hyper parameter N0 = d and Σ0 = I on Σ where d is the number of the dimension (in this case d = 2) and I is the identity matrix in the full model. Multiple Dimensional Time Series d Z(n. b0 . since the posterior probability of the number of segments is mainly at 1.1 Synthetic Data First.4. Also notice that h(G.Chapter 3. we don’t need to re-evalue it. Then later when we meet the same term. We set the hyper parameter ν = 2 and γ = 2 on σ 2 for each dimension in the independent model. We know there are three segments. The other two models both think there are three segments.7. we will cache all local terms. b. When we evaluate h(G. Both models detect positions of change points that are close to the ground truth with some 48 . 3. and set the hyper parameter b0 = 1 and Σ0 = I on Σ in the Gaussian graphical model. b. Σ) contains many local terms. ( 3. we revisit the 2D synthetic data mentioned in the beginning of this chapter. From Figure 3. bn . the raw data are shown in the first two rows. We run it with all three different models (independent. Hence it detects no change points.18) can re-write as following.19) We will also use the similar rank one update as mentioned earlier. 3.

5 0. From Figure 3. The full model thinks there might be one or two segments. We generate data in a similar way. Multiple Dimensional Time Series uncertainty. To save space. we only show the first two dimensions since the rest are similar. And we set the hyper parameters in a similar way.5 0 0 100 200 300 0 0 2 4 6 Figure 3. From the 4th row to the 6th row. but this time we have 10 series.5 0.Chapter 3.8.5 0 0 100 200 300 Indep 1 1 Indep : P(N) 0.7: Results on synthetic data 2D. Also it detects the position of change point is close to 49 .5 0. the full model and the Gaussian graphical model respectfully. The 3rd row is the ground truth of change points. Now let’s look at a 10D synthetic data. Results are generated by ’show2D’. The top two rows are raw data. Y1 2 0 −2 0 100 200 300 Y2 2 0 −2 0 1 100 200 300 Truth 0.5 0 0 100 200 300 0 0 2 4 6 Full 1 1 Full : P(N) 0. results are the posterior distributions of being change points at each position and the number of segments generated from the independent model. we see as before the independent model thinks there is only one segment hence detects no change point.5 0 0 100 200 300 0 0 2 4 6 GGM 1 1 GGM : P(N) 0.

As mentioned earlier. The 50 . the MAP structure GM AP = argmaxG P (G|Ys:t ). Again. and the independent model cannot model it. there are two graphs that both are close. However on the 3rd segment.10. We plot all candidate graph strutures selected by sliding windows in Figure 3. In this case. Some are very different from the true structures which might be generated by the window overlapping two segments. As a result. Note. Then based on the segments estimated by Gaussian graphical model. GGM picks the graphs that are most close to the true structure. Hence we see some uncertainty over the edges. Comparing with the results from 2d and 10d cases. In this case. the candidate list contains 30 graphs. For the 1st segment.Chapter 3. we notice that the true graph structure is not included in the candidate list. only the Gaussian graphical model think there are three segments.8. because these ”useless” graphs all get very low posterior probabilities. Finally let’s look at a 20D synthetic data. computed using Bayesian model averaging (Gray squares represent edges about which we are uncertain) in the bottom two rows of Figure 3. otherwise it is a white square. the true structure is included in the candidate list. Multiple Dimensional Time Series two ends which is wrong. and the marginal edge probability. then the corresponding entry is a black square. we see as before the independent model thinks there is only one segment hence detects no change point. GGM can identify it correctly. we plot the posterior over all graph structures. we find these graphs only slow the algorithms but won’t hurt their results. and detects positions that are very close to the ground truth. P (G|Ys:t ). In this case. we find that the independent model fails to detect in all cases since the changes are on the correlation structures. The full model clearly oversegments the data. such that if node i and j are connected. we plot the graph structure by its adjacency matrix. the true graph structure. P (Gi. However. and detects positions that are very close to the ground truth. In this case. To save space. only the Gaussian graphical model correctly segments. we only show the first two dimensions since the rest are similar. From Figure 3.j = 1|Ys:t ). We generate data in a similar way and set the hyper parameters in a similar way. we have 30 graphs in total.9. and we see very little uncertainty in the posterior probability.

Portfolios Data Now let’s look at two real data sets which record annually rebalanced valueweighted monthly returns from July 1926 to December 2001 total 906 month of U.2 U. and set the hyper parameter N0 = 5 and Σ0 = I on Σ in the full model. second. From the 7th row to the 9th row. and they agree on 4 out of 6 change points.S. and it is more confident in high dimension cases since the sparsity structures are more important in high dimension. 3. based on the segmentation estimated by GGM. In this data. finance and other) and the second has thirty industries. sliding windows generates 17 graphs. third. results are from independent model. only 2 change points that we discovered coincide with the results of Talih (namely 1959 and 1984).4. we plot the posterior over all 17 graphs 51 . We find that full model and GGM have similar results since they think there are roughly 7 segments. The first data set has five industries (manufacturing. but fails in high dimension cases. In this case. We set the hyper parameter ν = 2 and γ = 2 on σ 2 for each dimension in the independent model. The raw data are shown in the first five rows of Figure 3. they use reversible jump MCMC. There are many possible reasons for this. Note. The first data set has been previously studied by Talih and Hengartner in [24]. full model and GGM respectfully.S. utilities. The Gaussian graphical model can detect on all cases. shops.11. their model requires to pre-specify the number of change points. We think their results could be oversegmented.Chapter 3. Talih’s result is shown on the 6th row. they assume a different prior over models (the graph changes by one arc at a time between neighboring segments). because the full model has more parameters to learn. the positions of change points are very sensitive to the number of change points. we do not know the ground truth and the true graph structures. portfolios data. Multiple Dimensional Time Series full model can detect in low dimensional case. First. As usual. In this problem. Independent model seems to be over-segmented. and set the hyper parameter b0 = 1 and Σ0 = I on Σ in the Gaussian graphical model.

together with a ground truth segmentation (created by human experts) are shown in Figure 3.02 on σ 2 for each dimension in the and cosθ to overcome the discontinuity as the bees moves between −π to π. computed using Bayesian model averaging (Gray squares represent edges about which we are uncertain) in the bottom three rows of Figure 3. We preprocessed the data by replacing θ with sinθ set the hyper parameter ν = 2 and γ = 0. the MAP graph structure GM AP = argmaxG P (G|Ys:t ). as it moves around an enclosure. From the MAP graph structures detected from our results. We also show the MAP graph structures detected on three consecutive regions. 20]. Multiple Dimensional Time Series structures.11. 3. This consists of the x and y coordinates of a honey bee. P (Gi.13 and 3. the graph structure is changed by three arcs. as observed by an overhead camera. rather than one. and its head angle θ.j = 1|Ys:t ). The second data set is similar to the first one. We also show the results of segmenting this using a first-order auto-regressive AR(1) model. In the third row. we only show the result from the Gaussian graphical model in Figure 3. between the third segment and the fourth segment in our results.14. We still don’t know the ground truth and the true graph structures. and the marginal edge probability. but has 30 series. we know that in higher dimension. P (G|Ys:t ). using independent model or with full covariate model. We can see that the graph structures are very sparse and they differ more than one arcs.4. Two examples of the data. We 52 . the Gaussian graphical model performs much better than the independent model and the full model.12. we find Talih’s assumption on the graph structure changing by one arc at a time between neighboring segments may not be true.Chapter 3.3 Honey Bee Dance Data Finally. Since from the results of synthetic data. For example. we show the resulted posterior distribution of being change points at each position and the number of segments generated by the Gaussian graphical model. we analyse the honey bee dance data set used in [19. We show the first two industry raw data in the top two rows.

53 . From both Figures. and set the hyper parameter N0 = 4 and Σ0 = 0. the results from full covariate model are very close to the ground truth. Multiple Dimensional Time Series independent model. but the results from independent model are clearly over segmented.01 ∗ I on Σ in the full model.Chapter 3.

results are the posterior distributions of being change points at each position and the number of segments generated from the independent model. and the marginal edge probability. Results are generated by ’show10D’. we plot the posterior over all graph stuctures.5 0 1 0. The 3rd row is the ground truth of change points. In the bottom 2 rows.j = 1|Ys:t ) on 3 segments detected by the Gaussian graphical model.j)=1) MAP P(G(i.5 0 Truth 1 0.5 0 1 0 100 Full 200 300 0 1 0 5 Full : P(N) 0. 54 . The top two rows are the first two dimensions of raw data.5 0 P(G) Truth 0 10 20 30 0 10 20 30 0 10 20 30 P(G(i.8: Results on synthetic data 10D. the MAP structure GM AP = argmaxG P (G|Ys:t ).5 0. Multiple Dimensional Time Series Y1 5 0 −5 0 5 0 −5 0 1 0.j)=1) MAP Figure 3.5 0 1 100 100 Y2 200 300 Truth 200 300 0 100 Indep 200 300 Indep : P(N) 1 0.5 0 0 100 GGM 200 300 0 2 4 6 GGM : P(N) 1 0.j)=1) MAP P(G(i. P (G|Ys:t ). From the 4th row to the 6th row.5 0 0. P (Gi.5 0 0 100 200 300 0 2 4 6 P(G) 1 0.Chapter 3.5 0 P(G) Truth 1 0. the true graph structure. the full model and the Gaussian graphical model respectfully.

Multiple Dimensional Time Series Figure 3. 55 .9: Candidate list of graphs generated by sliding windows in 10D data. Results are generated by ’show10D’. otherwise it is a white square.Chapter 3. such that if node i and j are connected. then the corresponding entry is a black square. We plot the graph structure by its adjacency matrix.

56 .5 0 P(G) Truth 1 0. P (Gi.10: Results on synthetic data 20D. The top two rows are the first two dimensions of raw data.5 0 0.j = 1|Ys:t ) on 3 segments detected by the Gaussian graphical model. the true graph structure. the MAP structure GM AP = argmaxG P (G|Ys:t ). we plot the posterior over all graph stuctures.Chapter 3. and the marginal edge probability.5 0 0. From the 4th row to the 6th row.j)=1) MAP P(G(i. results are the posterior distributions of being change points at each position and the number of segments generated from the independent model. Results are generated by ’show20D’.5 0 1 1 0.5 0 0. In the bottom 2 rows.j)=1) MAP Figure 3.5 0 1 1 0. Multiple Dimensional Time Series Y1 5 0 −5 0 5 0 −5 0 1 0. P (G|Ys:t ). the full model and the Gaussian graphical model respectfully.5 0 1 1 0.j)=1) MAP P(G(i.5 0 100 100 Y2 200 300 Truth 200 300 0 100 Indep 200 300 Indep : P(N) 0 100 Full 200 300 0 2 4 6 Full : P(N) 0 100 GGM 200 300 0 2 4 6 GGM : P(N) 0 100 200 300 0 2 4 6 P(G) 1 0.5 0 Truth 1 0.5 0 P(G) Truth 0 20 40 0 20 40 0 20 40 P(G(i. The 3rd row is the ground truth of change points.

j = 1|Ys:t ).S.5 0 193410 194302 195106 195910 196802 197606 198410 199302 200106 193410 194302 195106 195910 196802 197606 198410 199302 200106 193410 194302 195106 195910 196802 197606 198410 199302 200106 193410 194302 195106 195910 196802 197606 198410 199302 200106 193410 194302 195106 195910 196802 197606 198410 199302 200106 193410 194302 195106 195910 196802 197606 198410 199302 200106 1 0. From the 7th row to the 9th row.5 0 200106 0 0 10 20 0 10 0 10 0 10 0 10 0 10 0 10 0 10 Figure 3.5 0 195106 1 195910 0. Total there are 906 month.5 0 1 192606 0.5 0 1 192606 0.5 0 200106 1 0 0. In the bottom three rows.5 0 200106 1 0 0. The 6th row is the result by Talih. P (G|Ys:t ).5 193410 194302 195106 195910 196802 197606 198410 199302 10 20 193410 194302 195106 195910 196802 197606 198410 199302 10 20 193410 1 0. 57 .5 0 197606 1 0. Results are generated by ’showPortofios’.Chapter 3. results are the posterior distributions of being change points at each position and the number of segments generated from the independent model. the full model and the Gaussian graphical model respectfully.5 0 1 192606 0.5 0 194302 1 0. P (Gi. we plot the posterior over all graph structures. The top five rows are the raw data representing annually rebalanced value-weighted monthly return in 5 industries from July 1926 to December 2001. and the marginal edge probability. the MAP graph structure GM AP = argmaxG P (G|Ys:t ).11: Results on U.5 0 198410 1 199302 0.5 0 1 192606 0. portfolios data of 5 industries.5 0 196802 1 0. Multiple Dimensional Time Series 40 20 0 −20 −40 40 192606 20 0 −20 −40 192606 20 0 −20 −40 50 192606 0 −50 192606 20 0 −20 1 192606 0.

2 0 192606 1 0.2 0 GMM : P(N) 194302 195910 197606 199302 192606 194302 195910 197606 199302 0 5 10 15 Figure 3.8 0.Chapter 3.6 0. The third row is the result of the posterior distribution of being change points at each position and the number of segments generated from the Gaussian graphical model. In the fourth row. Results are generated by ’showPort30’.6 0. we show the MAP graph structures in 3 consecutive regions detected by the Gaussian graphical model.12: Results on U. Multiple Dimensional Time Series 30 20 10 0 −10 −20 −30 192606 194302 195910 197606 199302 192606 194302 195910 197606 199302 60 40 20 0 −20 −40 192606 194302 195910 197606 199302 192606 194302 195910 197606 199302 GGM 1 0. 58 .S.4 0.4 0. We show the first two industry raw data in the top two rows.8 0. portfolios data of 30 industries.

y coordinates of honey bee and sin.5 0.5 0 0 1 0 100 200 300 Full 500 600 700 0 20 40 60 Full : P(N) 1 0.5 0 1 0 100 200 300 Indep 500 600 700 Indep : P(N) 1 0. The 5th row is the ground truth.5 0. cos of its head angle θ. The 6th row is the result from independent model and the 7th row is the result from full covariate model. The top four rows are the raw data representing x. Results are generated by ’showBees(4)’.Chapter 3.5 0 0 0 100 200 300 500 600 700 0 20 40 60 Figure 3. Multiple Dimensional Time Series X 2 0 −2 0 2 0 −2 0 1 0 −1 0 1 0 −1 0 1 100 200 300 100 200 300 100 200 300 100 200 300 Y 500 600 700 Sin(T) 500 600 700 Cos(T) 500 600 700 Truth 500 600 700 0. 59 .13: Results on honey bee dance data 4.

Results are generated by ’showBees(6)’.5 0.14: Results on honey bee dance data 6. The 6th row is the result from independent model and the 7th row is the result from full covariate model. The top four rows are the raw data representing x.5 0 0 0 100 200 400 500 600 0 20 40 Figure 3. cos of its head angle θ. y coordinates of honey bee and sin.5 0 1 0 100 200 Indep 400 500 600 Indep : P(N) 1 0.5 0.5 0 0 1 0 100 200 Full 400 500 600 0 20 40 Full : P(N) 1 0. The 5th row is the ground truth.Chapter 3. Multiple Dimensional Time Series X 2 0 −2 0 100 200 Y 400 500 600 2 0 −2 0 1 0 −1 0 1 0 −1 0 1 100 200 100 200 100 200 Sin(T) 400 500 600 Cos(T) 400 500 600 Truth 400 500 600 0. 60 .

However. We modeled the correlation structures using undirected Gaussian graphical models. Hence if there is a fast way to learn structure of directed graphs.Chapter 4 Conclusion and Future Work In this thesis. in addition to detecting change points. This is particularly useful in high dimensional cases because of sparsity. This allowed us to detect changes on correlation structures. the conditional independence property sometime might not be reasonable. Finally. we don’t have a fast way to generate candidate list of graph structures since currently all structure learning algorithms for directed graphs are too slow. we can use directed graphs. etc. as well as changes on mean. we extended Fearnhead’s algorithms to the case of multiple dimensional series. This allowed us to estimate the changing topology of dependencies among series. We can also model the correlation structures by directed graphs. Or we might want to model other constrains on changing over consecutive segments. 61 . In directed graphs. we can compute the marginal likelihood for any graph structures. variance. This requires us to come up with new models.

Bibliography [1] Jim Albert and Patricia Williamson. Product partition models for change point problems. 88(421):309– 319. Stern H. A. 20(1):260–279. [3] Daniel Barry and J.. and Paul Fearnhead. Albert J. [4] Daniel Barry and J. M. Alexandre d’Aspremont. 55:41–50. Carlin. 1993. 146:2–7. and Morris C. 41(2):389– 405. [5] Albright S. 2001. Hierarchical Bayesian analysis of changepoints problems. Journal of the American Statistical Association. Convex optimization techniques for fitting sparse Gaussian graphical models.. Applied Statistics. 1992. Laurent El Ghaoui. Gelfand. 1992. and Adrian F. The American statistician. A Bayesian analysis for change point problems. N. 1999. 1993. Smith. [7] J. 2006. S. Proceedings of the 23rd international conference on Machine learning. Carpenter. IEE Proceedings of Radar and Sonar Navigation. 88:1175–1196. Using model/data simulation to detect streakiness. The Annals of Statistics. and Georges Natsoulis. Hartigan. A statistical analysis of hitting streaks in baseball.. Alan E. Hartigan. Peter Clifford. C. [6] Bradley P. 62 . A. pages 89–96. [2] Onureena Banerjee. Journal of the American Statistical Association. An improved particle filter for non-linear problems.

Journal of the Royal Statistical Society: Series B. 65:887–899. Exact and efficient Bayesian inference for multiple changepoint problems. K. [10] D. Denison. Bayesian methods for nonlinear classification and regression. [12] Arnaud Doucet. F. PhD thesis. Biometrika. Biometrika. 60:333–350. Christopher C. Smith. Statistics and Computing. 2005. Mallick. 66(1):191–193.Bibliography [8] Carlos Marinho Carvalho. A note on the intervals between coal-mining disasters. 2005. [13] Paul Fearnhead. and A. 1979. 1995. M. G. [9] Nicolas Chopin. Structure and sparsity in high-dimensional multivariate analysis. 82(4):711–732. B. Online inference for hidden Markov models via particle filters. [15] Paul Fearnhead and Peter Clifford. [11] David G. Bani K. Jarrett. 1998. Sequential Monte Carlo methods in practice. 2003. 16(2):203–213. [17] R. and Adrian F. Baffins. Chichester. Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination. Smith. 2006. Mallick. [16] Paul Fearnhead and Zhen Liu. Automatic Bayesian curve fitting. Dynamic detection of change points in long time series. 2002. [14] Paul Fearnhead. John Wiley & Sons. [18] Peter J. The Annals of the Institute of Statistical Mathematics. Online inference for multiple change point problems. M. Springer-Verlag. Ltd. Duke University.Green. 2001. Exact Bayesian curve fitting and signal segmentation. Holmes. New York. and Neil Gordon. 63 . Journal of Royal Statistical Society Series B. 2006. Nando De Freitas. T. West Sussex. Denison. 53:2160–2166. T. England. G. IEEE Transactions on Signal Processing. 2006.

4. Rehg. James M. [26] Tae Young Yang. and William J. Monte Carlo Statistical Methods. New York. 2005. 64 . 2002. 2004. 167:627– 637. Kuo. Journal of Computational and Graphical Statistics. Bayesian curve fitting using MCMC with application to signal segmentation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Rehg. Christophe Andrieu. James M. [23] Juliane Schafer and Korbinian Strimmer. Learning and inference in parametric switching linear dynamic systems. Parameterized duration modeling for switching linear dynamic systems. [21] Elena Punskaya. and Frank Dellaert. 10:772–785. and Frank Dellaert. Journal of Royal Statistical Society Series B. 2:1694– 1700. Bayesian binary segmentation procedure for a Poisson process with multiple changepoints. A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics. IEEE Transactions on Signal Processing. Yang and L. [25] T. Bayesian binary segmentation procedure for detecting streakiness in sports. Robert and George Casella. Y. 2006. IEEE International Conference on Computer Vision. Fitzgerald.Bibliography [19] Sang Min Oh. 2:1161–1168. 2005. [24] Makram Talih and Nicolas Hengartner. Statistical Applications in Genetics and Molecular Biology. 67:321–341. 2005. Structural learning with timevarying components: Tracking the cross-section of financial time series. [22] Christian P. 2001. 2004. Tucker Balch. Springer. 50:747–758. [20] Sang Min Oh. Arnaud Doucet. Journal of Royal Statistical Society Series A.

Sign up to vote on this title
UsefulNot useful