Professional Documents
Culture Documents
Dong DING
Department of Mathematics
Imperial College
180 Queen’s Gate, London SW7 2BZ
December 2018
I certify that this thesis, and the research to which it refers, are the product
of my own work, and that any ideas or quotations from the work of other
people, published or otherwise, are fully acknowledged in accordance with
the standard referencing practices of the discipline.
Signed:
ii
Copyright
The copyright of this thesis rests with the author and is made available
under a Creative Commons Attribution Non-Commercial No Derivatives li-
cence. Researchers are free to copy, distribute or transmit the thesis on the
condition that they attribute it, that they do not use it for commercial pur-
poses and that they do not alter, transform or build upon it. For any reuse
or redistribution, researchers must make clear to others the licence terms of
this work.
iii
Thesis advisor: Professor Axel Gandy Dong DING
Abstract
Monte Carlo methods are useful tools to approximate the numerical result
of a problem by random sampling when its analytic solution is intractable
or computationally intensive. The main focus of this work is to investigate
Monte Carlo methods in two areas of inference problems: hypothesis testing
and posterior analysis in a hidden Markov model (HMM).
The first part of this thesis focuses on the decision of the p-value with respect
to a fixed threshold via Monte Carlo simulations in a statistical hypothesis
test. We wish to control the resampling risk, which is the probability of
obtaining a different test decision from the true one based on the unknown
p-value. We present confidence sequence method (CSM), a simple Monte
Carlo testing procedure which bounds the resampling risk uniformly. CSM
is useful due to its simple implementation and comparable performance to
its competitors.
iv
Thesis advisor: Professor Axel Gandy Dong DING
TPS and TPE both construct an auxiliary tree for recursively splitting model
into sub-models. The root of the tree stands for the target distribution of the
model. We propose different forms of intermediate target distributions of the
sub-models associated to the non-root nodes, which are crucial to sampling
quality. For the sampling process, we generate initial samples independently
between the leaf nodes. Then we recursively merge these samples along the
tree until reaching the root. Each merging step involves importance sampling
for the (intermediate) target distribution. A more adaptive design of the
algorithms and an improved accuracy compared to their competitors make
them useful alternatives in practice.
v
To my family and friends.
vi
Acknowledgments
First and foremost, I would like to sincerely thank my supervisor, Prof. Axel
Gandy, for his great expertise, support and patience. Over the years, he has
not only guided and motivated me with constructive ideas in my research, but
also advised me on academic writing, time management and career planning,
which I really appreciate. I feel very lucky to meet such an excellent PhD
supervisor, and believe the research skills I learned from him along with
the instructions and encouragement would definitely help me in the future.
Moreover, I would like to thank him for securing the college scholarship for
me.
I would also like to thank Dr. Georg Hahn for his useful suggestions and
patience on our submitted journal papers as well as on my research. I still
remember the day he underwent every paragraph of my first ever collabo-
rated paper to give me very detailed advice. Furthermore, I would thank
Prof. David Van Dyk and Dr. Nikolas Kantas for their useful comments on
my early and late stage assessments. I would like to thank Jessica Zhuang,
Longjie Jia, Shijing Si, Nanxin Wei, Din-Houn Lau, Xue Lu, Ricardo Monti,
Jeff Leong, Zhana Kuncheva, Diletta Martinelli, Dimos Tsagkrasoulis, Xiyun
Jiao, Francois-Xavier Briol and Louis Ellam for bringing a memorable re-
search environment at Huxley 526 and within the maths department.
vii
I would like to thank my mum and dad for their continuous support and
understanding. They often visit me with great care and cook very delicious
food. The constant reunions of the family never make me feel lonely in the
UK.
I would like to thank all my friends for delivering all memorable moments
during my PhD study. In particular, I have always been in touch with Renda
Gu, Weiyi Huang, Yiwen Hu, Cecilia Li and Yucheng Shi to share our life
stories. It is also a great pleasure to work and initiate friendship at Imperial
CSSA with Tianyu Cheng, Lily Lin, Chris Cheung, Runzhi Zhou, Sizhe Zhou
and Yiqun Huan.
Finally, I would like to express my thanks for receiving the President’s PhD
Scholarship of Imperial College (formally known as ‘Imperial College PhD
Scholarship’) as financial support.
viii
List of figures
ix
3.8 CDF of the smoothing, filtering and sampling distribution of
TPS-L in the non-linear HMM . . . . . . . . . . . . . . . . . . 142
5.1 Update of the auxiliary tree of TPS in the on-line setting . . . 203
x
List of tables
4.1 Options in TPE regarding the prior information and the com-
bination method of the overlapping parameters . . . . . . . . 192
4.2 Simulation results of the parameter estimation algorithms in
the HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
xi
Contents
1 Introduction 1
1.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aims of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Overview of the Chapters . . . . . . . . . . . . . . . . . . . . 9
1.4 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . 11
xii
2.9 Extension to Multiple Thresholds . . . . . . . . . . . . . . . . 46
2.9.1 P-value Buckets and Resampling Risk . . . . . . . . . . 47
2.9.2 General Construction of the Algorithms . . . . . . . . 50
2.9.3 Multi-threshold CSM and SIMCTEST . . . . . . . . . 51
2.9.4 Non-stopping Regions of the P-value Buckets . . . . . 56
2.10 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.10.1 Comparison of Penguin Pairs on Two Islands . . . . . . 59
2.10.2 Two-way Contingency Table . . . . . . . . . . . . . . . 61
2.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
xiii
3.7 Intermediate Target Distributions . . . . . . . . . . . . . . . . 108
3.7.1 Distribution Suggested by Lindsten et al. (2017) (TPS-L)109
3.7.2 Estimates of the Filtering Distributions (TPS-EF) . . . 110
3.7.3 Kullback–Leibler Divergence in TPS . . . . . . . . . . 112
3.7.4 Estimates of the Smoothing Distributions (TPS-ES) . . 115
3.7.5 Intermediate Target Distributions at Leaf Nodes . . . . 117
3.7.6 Exact Filtering Distributions (TPS-F) . . . . . . . . . 120
3.8 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.8.1 Definitions and Properties of RESS and MRESS . . . . 122
3.8.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.9 Simulation Study in a Linear Gaussian HMM . . . . . . . . . 130
3.9.1 Model Description and Metrics . . . . . . . . . . . . . 130
3.9.2 Simulation Results . . . . . . . . . . . . . . . . . . . . 132
3.10 Simulation Study in a Non-linear HMM . . . . . . . . . . . . . 134
3.10.1 Model Description and Metrics . . . . . . . . . . . . . 135
3.10.2 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 135
3.10.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.10.4 Comparison between TPS and Other Algorithms . . . 140
3.10.5 Comparison between TPS-EF and TPS-ES . . . . . . . 143
3.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
xiv
4.4.2 Sampling Procedure . . . . . . . . . . . . . . . . . . . 160
4.5 Intermediate Target Distributions . . . . . . . . . . . . . . . . 163
4.5.1 Relation to Consensus Monte Carlo . . . . . . . . . . . 163
4.5.2 Sub-HMMs with Original Priors (TPE-O) . . . . . . . 166
4.5.3 Sub-HMMs with Estimated Prediction Priors (TPE-EP)167
4.6 Combination of TPE and SIR-PE . . . . . . . . . . . . . . . . 171
4.7 Construction of Transformation Functions . . . . . . . . . . . 172
4.7.1 A Toy Model with Conditional Independent States . . 173
4.7.2 Unknown Parameter with Support R in an HMM . . . 179
4.7.3 Unknown Parameter with Support R+ in an HMM . . 182
4.8 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 185
4.8.1 Model Description . . . . . . . . . . . . . . . . . . . . 185
4.8.2 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 186
4.8.3 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
4.8.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 189
4.8.5 Simulation Parameters and Results . . . . . . . . . . . 193
4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
xv
1
Introduction
1.1 Preamble
Monte Carlo methods (Metropolis and Ulam, 1949) are a broad class of com-
putational algorithms which approximate the deterministic result of a prob-
lem by random sampling. ‘Monte Carlo’ is first coined by Metropolis and
Ulam (1949) for solving problems in mathematical physics. Since then, the
methods have become a desirable option when a highly complex problem
lacks an analytic solution or a simple implementation. According to Google
Scholar, over three million academic articles are related to the keywords
‘Monte Carlo’, of which over one million appear after year 2000.
1
Monte Carlo methods have also been employed in the cutting-edge tech-
nology. For instance, one class of method called Monte Carlo tree search is
applied in artificial intelligence to design the program ‘AlphaGo’ for playing
the board game Go (Silver et al., 2016). In early 2016, AlphaGo beated Go
master Lee Se-dol by 3-0 in a best-of-five competition (BBC news, 2016).
In this thesis, we develop new Monte Carlo methods for two inference
problems: (statistical) hypothesis testing and posterior analysis in one type
of probabilistic graphical models called hidden Markov models (HMMs). In
both areas, previous Monte Carlo algorithms do not focus on bounding the
error caused by random sampling, or perform poor approximations under
certain circumstances due to ineffective designs. Hence, we wish to control
the error or decrease it from existing methods in the aforementioned areas.
At the same time, we aim for a simple and efficient implementation.
2
refer to this probability as the p-value. If the p-value is below a user-specified
threshold, the observed sample data is believed to be very unlikely under the
null. Hence, the null hypothesis is rejected.
Hidden Markov models (HMMs) are within the class of PGMs which
3
incorporate a hidden Markov process {Xt }t∈N and a set of observable random
variables {Yt }t∈N generated by the Markov process (Cappé et al., 2006). The
word ‘hidden’ implies invisibility of the realisations of {Xt }t∈N , and each Xt
is called a hidden state. Each observation Yt is obtainable from the user, and
its distribution only depends on Xt .
This thesis aims to elaborate Monte Carlo algorithms for statistical hypoth-
esis testing and for the inference problems in an HMM.
4
the statistical hypothesis test. The error we wish to control in the test caused
by random sampling is called resampling risk (Fay and Follmann, 2002; Fay
et al., 2007; Gandy, 2009). Formally, it is defined as the probability that
the true p-value and the estimated p-value are on different sides of a fixed
threshold.
Choosing the resampling risk as the metric originates from the ‘first law of
applied statistics’ (Gleser, 1996): ‘Two individuals using the same statistical
method on the same data should arrive at the same conclusion.’ Monte
Carlo simulation violates this law, since it introduces an extra variability to
the result caused by random sampling that is not existent to the data itself.
Hence, we hope to regulate this uncertainty.
The aim of the first part of the thesis is to develop and compare two
Monte Carlo testing procedures, which both bound the resampling risk within
a pre-specified error uniformly, i.e. for all p-values in [0,1]. The algo-
rithm proposed by Gandy (2009) which we call SIMCTEST achieves this by
constructing a spending sequence: It determines the rate of the allowed re-
sampling risk spent. SIMCTEST ensures that the probability of a proposed
metric hitting the wrong boundary given by the spending sequence is at most
, thus showing a uniformly bounded resampling risk.
5
The proposed algorithm called confidence sequence method (CSM) is en-
lightened by the construction of the sequential confidence intervals for the
p-value (Robbins, 1970; Lai, 1976). CSM forms a decision upon such se-
quence whose joint coverage probability of the p-value is bounded by 1 − .
The algorithm is suggested due to its simple implementation and comparable
performances to SIMCTEST as well as other algorithms.
In the second part (Chapter 3 & 4) of the thesis, we consider two poste-
rior distributions: smoothing and parameter estimation in a hidden Markov
model (HMM). We establish a novel class of Monte Carlo algorithms, which
approximate the posteriors when their analytic solutions do not exist.
In the literature, one type of Monte Carlo method called sequential Monte
Carlo (SMC) is widely employed in the HMM (Liu and Chen, 1998; Pitt and
Shephard, 1999; Doucet et al., 2000). SMC can produce random samples
sequentially from a list of target distributions with increasing dimension.
Hence, it can be applied to the filtering or smoothing problem in which the
distributions {p(x0:t |y0:t )}Tt=0 are sequentially estimated.
When sampling from the smoothing distribution, the Monte Carlo al-
6
gorithms using a sequential approach such as SMC usually suffer from a
phenomenon called path degeneracy (Arulampalam et al., 2002). Path de-
generacy refers to low diversity of the samples caused by numerous update
steps.
We illustrate the path degeneracy issue when applying SMC for smooth-
ing. The algorithm starts from t = 0 where the samples of X0 are initially
simulated. When we proceed forward to simulate X1 , the samples of X0 re-
quire an update, which may be accompanied by a resampling process. In this
process, we replicate samples with high probabilities to substitute those with
low probabilities, hence losing diversity. Usually, more updates imply more
resampling steps. In SMC, for each time t ≤ T , the samples of X0 , . . . , Xt−1
from all previous time steps demand updates. As a result, the samples of X0
are updated (T + 1) times upon the end of the algorithm, which implies a
large number of potential resampling steps if T is huge. On the other hand,
XT is only revealed in the sequences of the target {p(x0:t |y0:t )}Tt=0 at the very
last step, which contributes to far more diversified samples with at most one
resampling step.
7
a node only contains a single hidden state. The tree under such construction
has a depth of (1 + dlog2 (T + 1)e) levels.
8
consideration of proposals and computational budget. In particular, it is
adjustable with a possible reduction to a linear complexity.
9
sequence method (CSM) in Section 2.4. We then review the existing approach
called SIMCTEST in Section 2.5, and compare it to CSM in Section 2.6 &
2.7.
We further investigate the resampling risk of the Monte Carlo test when a
maximum number of Monte Carlo samples is specified, which we call trunca-
tion. We empirically show the risk is not uniformly bounded in the truncated
versions of CSM and SIMCTEST as well as other truncated procedures in
Section 2.8. We extend the Monte Carlo tests under a single threshold to
multiple ones in Section 2.9.
10
in a non-linear HMM in Section 3.9 & 3.10. We complete the chapter with a
discussion in Section 3.11.
In the simulation study of Section 4.8, we perform TPE and other al-
gorithms to estimate a three dimensional unknown parameter in a linear
Gaussian HMM. We finish the chapter with a discussion in Section 4.9.
Chapter 2 and 3 of the thesis have been submitted to the journals and are
available on the arXiv preprint server:
https://arxiv.org/abs/1611.01675
11
• Chapter 2 (Section 2.9): submitted as the third author with main con-
tribution to Theorem 2 (first part) and Lemma 3 in this thesis. The
article is available on
https://arxiv.org/abs/1703.09305
https://arxiv.org/abs/1808.08400
12
2
Implementing Monte Carlo
Tests with Uniformly
Bounded Resampling Risk
2.1 Introduction
13
where the measure P is ideally the true null distribution in a simple hypoth-
esis. Otherwise, it can be an estimated distribution in a bootstrap scheme,
or a distribution conditional on an ancillary statistic, etc.
Gleser (1996) suggests that two individuals using the same statistical
method on the same data should reach the same conclusion. For tests, the
standard decision rule is based on comparing p to a threshold α. In the
setting we consider, Monte Carlo methods are used to compute an estimate
p̂ of p, which is then compared to α to reach a decision.
When we compare the p-value with a single threshold, we let the resam-
14
pling risk
P(p̂ > α) if p ≤ α,
RRp (p̂) =
P(p̂ ≤ α) if p > α,
In the first part of this chapter, we are looking for procedures that achieve
1 + ni=1 Xi
P
p̂ =
1+n
does not guarantee a small uniform bound on the resampling risk where n is
the pre-defined number of Monte Carlo samples. In fact, the lowest uniform
bound on the resampling risk for this estimator is at least 0.5 (Gandy, 2009).
15
A variety of procedures for sequential Monte Carlo testing are available in
the literature which target different error measures. Silva et al. (2009); Silva
and Assunção (2013) bound the power loss of the test while minimising the
expected number of steps. Silva et al. (2018, Section 4) construct truncated
sequential Monte Carlo algorithms which bound the power loss and the level
of significance in comparison to the exact test by arbitrarily small numbers.
Other algorithms aim to control the resampling risk (Fay and Follmann,
2002; Fay et al., 2007; Gandy, 2009; Kim, 2010). Fay et al. (2007) use a
truncated sequential probability ratio test (SPRT) boundary and discuss the
resampling risk, but do not aim for a uniform lower bound on it. Kim (2010);
Fay and Follmann (2002) ensure a uniform bound on the resampling risk
under the assumption that the random variable p belongs to a certain class
of distributions. Besides being a much less restrictive requirement than (2.1),
one drawback of this approach is that in real situations, the distribution of p
is typically not fully known, as this would require knowledge of the underlying
true sampling distribution.
16
ples in the simulation in advance and forces a decision before or at the end
of all simulations. Open-ended procedures e.g., Gandy (2009) do not have
a stopping rule or an upper bound on the number of steps. Open-ended pro-
cedures can be turned into truncated procedures by forcing a decision after
a fixed number of simulation steps. Truncated procedures cannot guarantee
a uniform bound on the resampling risk – Section 2.8 demonstrates this.
In the second part of this chapter, we extend the single testing threshold
to multiple ones, since standard software packages such as R Development
Core Team (2008) and IBM Corporation (2013) usually report the signifi-
cance of a p-value with respect to multiple levels, e.g. 0.1%, 1%, 5%. We
generalise the testing thresholds to user-specified intervals by defining a set
S
of intervals J satisfying J∈J J = [0, 1], which we call p-value buckets.
We refine the resampling risk, which is the probability that the true
p-value p is not contained in a bucket I ∈ J based upon the estimated
(Monte Carlo) p-value. Formally, it is defined as RRp (I) = Pp (p ∈
/ I). We
aim to design algorithms which bound this error up to an arbitrary constant
∈ (0, 1] uniformly:
17
We formulate a class of algorithms which satisfy (2.2). These algorithms
employ a sequence of confidence intervals for the p-value with the desired
joint coverage probability, and return one p-value bucket once it contains
the confidence interval. Using this construction, we propose new versions of
CSM and SIMCTEST, which we call mCSM and mSIMCTEST, to accom-
modate the situation of multiple thresholds for bounding the resampling risk
uniformly.
18
2.2 Statistical Hypothesis
19
2.2.1 P-value
The p-value measures the probability of the data being at least as extreme as
the observed one under the null hypothesis (Altman, 1990). To calculate the
p-value, we utilise a statistic which usually has a known distribution under
the null hypothesis H0 called test statistic T . The distribution of T under
H0 is called the null distribution of T . We denote the observed value of the
test statistic by t, and follow the convention of rejecting the null hypothesis
when observing a large value of t. The p-value p under the null hypothesis
is then defined as
p = P(T ≥ t|H0 ). (2.3)
When the null hypothesis is a simple one, the distribution T is known. When
it is composite, three remedies can usually be employed. The first approach
chooses a test statistic T whose distribution is identical for all distributions
F . A well-known example is the Student’s t-statistic used for testing the
mean of a normal distribution with unknown variance. The second approach
conditions on a sufficient statistic S under the null, which eliminates un-
known model parameters (Davison et al., 1997). The conditional p-value is
defined as p = P(T ≥ t|H0 , S = s). The third approach approximates the
distribution F when the nuisance parameter cannot be conditioned away. It
then computes the p-value using the estimated distribution under the null.
The decision of rejecting or not rejecting the null is based upon the p-
value. When a large p-value is obtained from (2.3), we claim that the data
could often occur under the null hypothesis. Alternatively, a tiny p-value
implies that the data is very unlikely to be observed. The test of the null
20
hypothesis using the p-value is hence a decision on whether it lies above or
below a chosen cut-off. When the p-value is below (resp. above) the cut-off,
we say the test is statistically significant (resp. not significant), and the null
hypothesis is rejected (resp. not rejected).
The cut-off of the significance level depends on user’s request. The clas-
sical thresholds are 0.1%, 1% and 5%, and are usually labelled as (∗∗∗ ,∗∗ ,∗ )
by a star rating system in softwares such as R (R Development Core Team,
2008) and SPSS (IBM Corporation, 2013).
A false decision for testing the null hypothesis can be committed if the in-
formation about the whole population is only obtained partially from the
observed data. Two types of error can be made. The first type is called
type I error when we falsely reject H0 in favour of H1 given that the null
hypothesis H0 is indeed true. The second type is called type II error when
we falsely reject H1 in favour of H0 given that the null hypothesis H0 is false.
We denote the probability of committing a type I and II error by α0 and β0 ,
respectively.
The probability of rejecting the null hypothesis when it is true, i.e. com-
mitting the type I error, is called level of significance or significance level of
the test. Correspondingly, the probability of rejecting the null hypothesis
when it is false, i.e. not committing the type II error, is called power of the
21
test:
22
Pn
denotes the indicator function. We also define partial sum Sn = i=1 Xi ,
which will be constantly used in the later sections.
1 + ni=1 Xi
P
p̂mc = . (2.4)
1+n
The basic Monte Carlo p-value estimator violates the first law of applied
statistics (Gleser, 1996): ‘Two individuals using the same statistical method
on the same data should arrive at the same conclusion.’ In hypothesis test-
ings, it refers to whether an estimator of p stays in the same region as the
true p-value given a threshold, and can be quantified by the term resampling
risk (Fay and Follmann, 2002; Fay et al., 2007; Gandy, 2009). If we assume
H0 : p > α and H1 : p ≤ α
23
p̂ and the true p lie on the different sides of α:
P(Not reject H0 |H1 ) = P(p̂ > α) if p ≤ α,
RRp (p̂) =
P(Reject H0 |H0 ) = P(p̂ ≤ α)
if p > α.
We aim for procedures that achieve a small uniform bound > 0 on the
resampling risk RRp (p̂):
sup RRp (p̂) ≤ . (2.5)
p∈[0,1]
24
or weeks to finish depending on the complexity of the algorithm, and hence
an open-ended approach is not recommended.
Besag and Clifford (1991) propose a heuristic stopping rule for the trun-
cated sequential Monte Carlo procedure or bootstrap sampling. The algo-
rithm has two tuning parameters: the number of exceedances h and the
maximum number of samples (nmax − 1). We terminate the sampling pro-
cedure once Sn = h for some n < nmax − 1 or finish the whole simulation
process by obtaining (nmax − 1) Monte Carlo samples. Then, the estimated
p-value p̂ is defined as
h
if Sn = h and n < nmax − 1,
n
p̂ =
S +1
nmax
if Snmax < h,
nmax
25
2009). The algorithm starts with a relatively small sample size n = nmin and
increases the sample size n until reaching a decision or a maximum number
Pn
of samples denoted by nmax . Given Sn = i=1 Xi , the null hypothesis is
Sn
X n
X
Ψ = {n ∈ N : b(n, α, i) < β or b(n, α, i) < β},
i=0 i=Sn
26
Table 2.1: Stopping boundaries using Davidson and MacKinnon (2000)’s method with re-
spect to the suggested sample sizes when the level of pretest β = 0.05.
or reject it once
Sn ≤ C2 + nC0 ,
27
where
1 − p0
C0 = log / log(r),
1 − pα
log(B)
C1 = ,
log(r)
log(A)
C2 = ,
log(r)
pα (1 − p0 )
r = .
p0 (1 − pα )
1−β0
The type I error α0 and type II error β0 can be approximated using A = α0
β0
and B = 1−α0
.
Z 1
π(α) = π(α|p)FP (dp),
0
28
rejecting H0 under the exact test:
1, if p ≤ α,
π(α|p) =
0, if p > α.
Z 1
πm (αmc ) = πm (αmc |p)FP (dp),
0
where αmc ∈ (0, 1) is a desired significance level for the (Monte Carlo) test
and πm (αmc |p) is the probability of rejecting H0 by the test. The formula of
πm (αmc |p) is given in (Silva et al., 2018, Equation (3.1)). The power loss of
the Monte Carlo test is the difference between π(α) and πm (αmc ).
where the left hand sides of (2.6) and (2.7) are derived in (Silva et al., 2018,
Section 4).
29
2.4 Confidence Sequence Method (CSM)
Recall ∈ (0, 1) be the desired bound on the resampling risk. Using inde-
pendent Bernoulli(p) random variables Xi , i ∈ N, an inequality of (Robbins,
1970, p. 1397) states
Pp ∃n ∈ N : b(n, p, Sn ) ≤ ≤ (2.8)
n+1
n Pn
for all p ∈ (0, 1), where b(n, p, x) = x
px (1−p)n−x and Sn = i=1 Xi . Then,
In = {p ∈ [0, 1] : (n + 1)b(n, p, Sn ) > } is a sequence of confidence sets that
has the desired joint coverage probability 1 − .
Lai (1976) shows that In are intervals. Indeed, if 0 < Sn < n we obtain
In = gn (Sn ), fn (Sn ) , where gn (x) < fn (x) are the two distinct roots of
(n + 1)b(n, p, x) = for p . If Sn = 0 then the equation (n + 1)b(n, p, 0) =
has only one root rn , leading to In = [0, rn ). Likewise for Sn = n, in which
case In = (rn , 1].
τ = inf{n ∈ N : α ∈
/ In }.
30
the uniform bound on the resampling risk in (2.5) holds true.
τ = inf {n ∈ N : (n + 1)b(n, α, Sn ) ≤ } ,
τ = inf{n ∈ N : Sn ≥ un or Sn ≤ ln },
The following theorem shows that the resampling risk of CSM is uniformly
bounded.
31
1200
Stopping boundaries
α = 0.01
α = 0.05
α = 0.10
800
400
0
Figure 2.1: Lower (ln ) and upper (un ) stopping boundaries of CSM for several thresholds
α.
To see the former: When not hitting any boundary, i.e. on the event
{τ = ∞}, we have p̂c = α. When hitting the lower boundary, i.e. on the
event {τ < ∞, Sτ ≤ lτ }, we have p̂c = Sτ /τ ≤ lτ /τ . It thus suffices to show
ln /n ≤ α for all n ∈ N.
32
Hence, by the definition of ln we have ln ≤ dαne − 1 < αn.
To finish the proof of this case, we show that the probability of hitting
the upper boundary is bounded by , which can be done using (Gandy, 2009,
Lemma 3) and (2.8):
= Pα (∃n ∈ N : (n + 1)b(n, α, Sn ) ≤ ) ≤ .
The case p > α can be shown analogously to the case p < α using that
Pp (τ = ∞) = 0, which is shown in (Lai, 1976, p. 268).
33
ε = 0.01
ε = 0.001
3000
2000 ε = 0.0001
Ep(τ)
1000
0
Figure 2.2: Expected number of steps Ep (τ ) required to decide whether p lies above or
below the threshold α = 0.05.
34
uniformly. SIMCTEST sequentially updates two integer sequences (Ln )n∈N
and (Un )n∈N serving as lower and upper stopping boundaries and stops the
sampling process once the trajectory (n, Sn ) hits either boundary. The deci-
sion whether the p-value lies above (below) the threshold depends on whether
the upper (lower) boundary is hit first.
The boundaries (Ln )n∈N and (Un )n∈N are a function of α, computed re-
cursively such that the probability of hitting the upper (lower) boundary,
given p ≤ α (p > α), is less than . Starting with U1 = 2, L1 = −1, the
boundaries are recursively defined as
Un = min{j ∈ N :Pα (τ ≥ n, Sn ≥ j)
+ Pα (τ < n, Sτ ≥ Uτ ) ≤ n },
Ln = max{j ∈ Z :Pα (τ ≥ n, Sn ≤ j)
+ Pα (τ < n, Sτ ≤ Lτ ) ≤ n },
35
upper boundary, thus leading to the stopping time σ = inf{k ∈ N : Sk ≥
Uk or Sk ≤ Lk }. In this case, a p-value estimate can readily be computed
as p̂s = Sσ /σ if σ < ∞ and p̂s = α otherwise. Similarly to Figure 2.2, the
expected stopping time of SIMCTEST diverges as p approaches the threshold
α.
36
Stopping boundaries
300
CSM
SIMCTEST
200
100
0
Figure 2.3: Stopping boundaries for CSM and SIMCTEST (with default spending se-
quence).
1.4
(un−ln)/(Un−Ln)
1.2
1.0
0.8
0.6
Figure 2.4: Ratio of widths of stopping boundaries (un −ln )/(Un −Ln ) for CSM (un upper,
ln lower) and SIMCTEST with default spending sequence (Un , Ln ). Log scale on the x-axis.
37
lower stopping boundaries of CSM (resp. SIMCTEST).
According to Figure 2.4, the boundaries of CSM are initially tighter than
the ones of SIMCTEST, but become wider as the number of steps increases.
However, this will eventually reverse again for large numbers of steps as
depicted in Figure 2.4.
Both SIMCTEST and CSM are guaranteed to bound the resampling risk
by some constant chosen in advance by the user. We will demonstrate in
this section that the actual resampling risk (that is the actual probability of
hitting a boundary leading to a wrong decision in any run of an algorithm)
for SIMCTEST is close to , whereas CSM does not make full use of the
allocated resampling risk. This in turn indicates that it might be possible to
construct boundaries for SIMCTEST which are uniformly tighter than the
ones of CSM – We will pursue this in Section 2.7.
Figure 2.5 plots the cumulative probability of hitting the upper and lower
boundaries over 5 · 104 steps for both methods. As before we control the
38
resampling risk at our default choice of = 10−3 .
50000 = (5 · 104 )/(5 · 104 + 1000) ≈ 9.804 · 10−4 allocated by the spending
sequence, which is close to the full resampling risk = 10−3 .
CSM tends to be more conservative as it does not spend the full re-
sampling risk. Indeed, the total probabilities of hitting the upper and lower
boundaries in CSM up to step 5 · 104 are 4.726 · 10−4 and 4.472 · 10−5 , respec-
tively. In particular, the probability of hitting the lower boundary in CSM
is far lower than .
One advantage of SIMCTEST lies in the fact that it allows control over
the resampling risk spent in each step through suitable adjustment of its
spending sequence n, n ∈ N. This can be useful in practical situations in
39
P(hit boundary up to step n | p = α)
8e−04 CSM, upper boundary
CSM, lower boundary
SIMCTEST, upper boundary
SIMCTEST, lower boundary
4e−04
0e+00
Figure 2.5: Cumulative resampling risk spent over all iterations of the algorithm spent in
CSM and SIMCTEST. Log scale on the x-axis.
Suppose we are given a lower bound L and an upper bound U for the
minimal and maximal number of samples to be spent, respectively. We con-
struct a new spending sequence in SIMCTEST which guarantees that no
resampling risk is spent over both the first L samples as well as after U
samples have been generated. We call this the truncated spending sequence:
0 if n ≤ L,
n = n
if L < n < U,
n+k
if n ≥ U.
Figure 2.6 shows upper and lower stopping boundaries of CSM and SIM-
40
600 Stopping boundaries
CSM (un,ln)
SIMCTEST (Un,Ln)
Boundaries
400
200
0
Figure 2.6: Stopping boundaries of CSM and SIMCTEST with the truncated spending se-
quence. Log scale on the x-axis.
As expected, for the first 100 steps the stopping boundaries of SIM-
CTEST are much wider than the ones of CSM steps since no resampling risk
is spent.
41
4e−04
8e−05
l = 1.4 l = 1.4
l = 1.5 l = 1.5
l = 1.6 l = 1.6
2e−04
4e−05
0e+00
0e+00
1 2 5 10 20 50 100 500 1 2 5 10 20 50 100 500
Number of steps n Number of steps n
Section 2.7.1 showed that it is possible to choose the spending sequence for
SIMCTEST in such a way as to obtain stopping boundaries which are strictly
contained within the ones of CSM for a pre-specified range of steps.
Motivated by Figure 2.5 indicating that CSM does not spend the full
resampling risk, we aim to construct a spending sequence with the property
that the resulting boundaries in SIMCTEST are strictly contained within
the ones of CSM for every number of steps. This implies that the stopping
time of SIMCTEST is never longer than the one of CSM. Our construction
is dependent on the specific choice α = 0.05.
We first determine the rate at which the real resampling risk is spent in
each step in CSM. By matching this rate using a suitably chosen spending
sequence, we will obtain upper and lower stopping boundaries for SIMCTEST
which are uniformly narrower than the ones of CSM (verified for the first 5·104
steps).
42
We start by estimating the rate at which the real resampling risk is
spent in CSM. We are interested in empirically finding an l ∈ R such that
nl · CSM
n − CSM
n−1 is constant, where CSM
n is the (cumulative) real resampling
risk (the total probability of hitting either boundary) for the first n steps in
CSM.
In order to match the O(n−1.5 ) rate for CSM, we generalise the default
S
spending sequence of SIMCTEST to n = nγ / (nγ + k) for n ∈ N and a fixed
γ > 0. Similarly to the aforementioned derivation, SIMCTEST in connection
with S
n will spend the real resampling risk at a rate of O n−(γ+1) . We choose
43
10 un − Un
ln − Ln
Difference of boundaries
5
0
−5
−10
Figure 2.8: Differences between the upper and lower stopping boundaries of CSM and SIM-
CTEST. Log scale on the x-axis.
Figure 2.8 depicts the differences between the upper (lower) boundaries
of CSM (upper un , lower ln ) and SIMCTEST (upper Un , lower Ln ) with
the aforementioned spending sequence. We observe that ln ≤ Ln as well as
un ≥ Un for n ∈ {1, . . . , 5 · 104 }, thus demonstrating that SIMCTEST can
be tuned empirically to spend the resampling risk at the same rate as CSM
while providing strictly tighter upper and lower stopping boundaries (over
a finite number of steps). We observe that the gap between the boundaries
seems to increase with the number of steps, leading to the conjecture that
SIMCTEST has tighter boundaries for all n ∈ N.
44
2.8 Comparison of Truncated Sequential Monte Carlo Pro-
cedures
In this section, we compute the resampling risk for several truncated proce-
dures as a function of p and thus demonstrate that they do not bound the
resampling risk uniformly.
45
Table 2.2: Parameters of the truncated Monte Carlo testing procedures.
In the previous sections of this chapter, we consider the Monte Carlo testing
procedures which return a decision on the p-value with respect to one single
threshold. In practice, we may be interested in where the p-value lies with
respect to multiple levels, e.g. the classical thresholds {0.001, 0.01, 0.05}. In
this section, we extend the single threshold to multiple ones, and develop
46
1.0
Besag & Clifford (1991)
Davidson & MacKinnon (2000)
Fay et.al (2007)
Gandy (2009)
Silva & Assuncao (2017)
Robins (1970) & Lai (1976)
0.8
0.6
Resampling risk
0.4
0.2
0.0
p value
Figure 2.9: Comparison of resampling risks between the truncated Monte Carlo testing
procedures when the threshold α = 0.05.
47
as with respect to the thresholds. For instance, given the testing thresholds
{0.001, 0.01, 0.05}, we can generate the intervals
is a set of p-value buckets and contains the intervals with mutually empty
intersection. We call them non-overlapping p-value buckets.
which has the property that any p ∈ (0, 1) is contained in the interior of a
J ∈ J ∗ (for p = 0, we require that there exists J ∈ J and > 0 such that
[0, ) ⊆ J, and similarly for p = 1).
The choice of the p-value buckets is, of course, arbitrary. For the clas-
sical thresholds {0.001, 0.01, 0.05}, we usually build J 0 as the set of non-
overlapping p-value buckets and require the set of overlapping p-value buckets
48
is a superset of J 0 . We may choose the overlapping buckets for different pur-
poses. For example, we aim to achieve a relatively small upper bound for the
stopping time given limited computational budget or aim to provide stable
results which return the same p-value bucket consistently in different simu-
lations – We will explore these two properties respectively in Section 2.9.4
and Section 2.10.2.
Gandy et al. (2017) find that the overlapping p-value buckets are both
a sufficient and necessary condition for an algorithm satisfying (2.12) with
a finite stopping time. Here, the stopping time is measured in terms of
the number of simulated Monte Carlo samples, and is denoted by τ . The
following statements are proved to be equivalent in (Gandy et al., 2017,
Theorem 1):
3. There exists an algorithm satisfying (2.12) with τ < C for some deter-
49
ministic C > 0.
for all p ∈ [0, 1]. We provide two approaches of computing such confidence
sequences in Section 2.9.3.
50
be interpreted as the number of samples needed until a decision of p with
respect to all but one of the thresholds 0.001, 0.01, or 0.05 is computed.
Pp ∃n ∈ N : b(n, p, Sn ) ≤ ≤ (2.16)
n+1
Pn
holds true for all p ∈ (0, 1) and ∈ (0, 1), where Sn = i=1 Xi and
n
b(n, p, x) = x
px (1 − p)n−x . Therefore, In = {p ∈ [0, 1] : (n + 1)b(n, p, Sn ) >
51
} is a sequence of confidence sets for p with the desired coverage probability
of 1 − .
Lai (1976) further shows that solving the left hand side of (2.16) yields
where gn (x) < fn (x) are the two distinct roots (Lai, 1976) of (n+1)b(n, p, x) =
. Indeed, if 0 < Sn < n, a sequence of confidence intervals for p is given by
In := gn (Sn ), fn (Sn ) . In the case Sn = 0, the equation (n + 1)b(n, p, x) =
has only one root rn , leading to In = [0, rn ). Likewise for the case Sn = n
which leads to the confidence interval In = (rn , 1].
(Gandy et al., 2017, Lemma 1) prove the stopping time is bounded above
by a deterministic constant in mCSM if overlapping p-value buckets are em-
ployed.
mSIMCTEST then creates the stopping boundaries from the default spend-
ing sequence in SIMCTEST (Gandy, 2009) for each α ∈ AJ , denoted by Ln,α
and Un,α , using the same resampling risk parameter ρ. We define the corre-
sponding stopping time for each α as σα = inf{k ∈ N : Sk ≥ Uk,α or Sk ≤
52
Lk,α } (based on the same sequence of Xj , j ∈ N). We also define
[0, 1] if n < σα ,
In,α = [0, α] if n ≥ σα , Sσα ≤ Lσα ,α ,
(α, 1] if n ≥ σ , S ≥ U
α σα σα ,α ,
T
and let In = α∈AJ In,α .
The following theorem shows that In indeed has the desired joint coverage
probability given in (2.13) or (2.15) when setting ρ = /2. Additionally,
employing overlapping buckets in mSIMCTEST leads to a bounded stopping
time.
53
We only prove the first part of Theorem 2 regarding the joint coverage
probability of the confidence sequence. The second part can be found in
(Gandy et al., 2017, Theorem 2).
N
E α = {Sτα ≥ Uτα ,α , τα < N }
be the event that the upper boundary is hit first before time N and likewise
let
EN
α = {Sτα ≤ Lτα ,α , τα < N }
be the event that the lower boundary is hit first. Then, for all α, α0 ∈ AJ
with α < α0 the following holds:
N N
E α ⊇ E α0 and E N N
α ⊆ E α0 . (2.18)
N N N
Indeed, to see E α ⊇ E α0 , we can argue as follows. On the event E α0 ,
as Un,α ≤ Un,α0 for all n ∈ N, the trajectory (n, Sn ) must hit the upper
boundary Un,α of α no later than τα0 , hence τα ≤ τα0 < N . It remains to
prove that the trajectory does not first hit the lower boundary Ln,α of α.
Indeed, if the trajectory does hit the lower boundary of α before hitting its
upper boundary, it also hits the lower boundary of α0 (as Ln,α ≤ Ln,α0 for all
N
n < N ) before time τα0 , thus contradicting being on the event E α0 . Hence,
N N
we have E α ⊇ E α0 . The proof of E N N
α ⊆ E α0 is similar.
54
Using this notation, for all p ∈ [0, 1],
Pp (∃n < N : p ∈
/ In ) ≤ Pp (∃n < N, α ∈ AJ : p ∈
/ In,α )
[ [ N
= Pp EN
α ∪ Eα
α∈AJ :α<p α∈AJ :α≥p
[ [ N
≤ Pp EN
α
+ Pp E α . (2.19)
α∈AJ :α<p α∈AJ :α≥p
If p < min AJ , then the first term is equal to 0. Otherwise, let α0 = max{α ∈
AJ : α < p}. Then, by (2.18),
[
Pp EN
α
= Pp E N
α0 ≤ ρ.
α∈AJ :α<p
The second term on the right hand side of (2.19) can be dealt with similarly.
55
Let α, α0 ∈ AJ with α < α0 . Then there exists n0 ∈ N such that for all
n ≥ n0 ,
Ln,α ≤ Ln,α0 , Un,α ≤ Un,α0 .
∆n 1
2 + ≤ α0 − α for all n ≥ n0 . (2.21)
n n
Splitting 2
n
= 1
n
+ n1 and multiplying by n yields nα + ∆n + 1 ≤ nα0 − ∆n − 1
from which Un,α ≤ Ln,α0 follows by (2.20).
By definition, we have Ln,α ≤ Un,α and Ln,α0 ≤ Un,α0 for all n ∈ N, thus
implying Ln,α ≤ Ln,α0 , Un,α ≤ Un,α0 for all n ≥ n0 as desired.
56
any p-value bucket J ∈ J , i.e.
Suppose we are only interested in whether the p value is above or below the
single threshold α = 0.05, this corresponds to the non-overlapping p-value
buckets
J e := {[0, 0.05], (0.05, 1]}
and the non-stopping region (grey) is shown in Figure 2.10 (left) using mSIM-
CTEST. By construction, those regions bound the resampling risk at ,
where in this and all following simulations in this section we always use
= 10−3 . The sampling process terminates once the trajectory of (n, Sn )
hits beyond the region and we report p > α (resp. p ≤ α) upon arriving
at the upper (resp. lower) boundary of the region first. Adding another
bucket {(0.03, 0.07]} to J e results in a finite non-stopping boundary see
Figure 2.10 (right) which ensures the stopping time no later than approxi-
mately 11250 simulations of the Monte Carlo samples. The sample trajectory
(n, Sn ) can leave the non-stopping region in three ways: either from the for-
mer Figure 2.10 (right) upper boundary indicating p ∈ (0.05, 1], or from
the former lower boundary indicating p ∈ [0, 0.05], or the middle indicat-
ing p ∈ (0.03, 0.07]. We omit the plot using mCSM, which can be obtained
similarly using its implied upper and lower boundaries.
57
1000
1000
750
750
(0.05,1]
(0.05,1]
500
500
Sn
Sn
]
.07
3,0
(0.0
250
250
[0,0.05]
[0,0.05]
0
0
0 3750 7500 11250 15000 0 3750 7500 11250 15000
n n
Figure 2.10: Non-stopping region (grey) to decide p with respect to J e (left), which corre-
sponds to a 5% threshold, and with respect to the overlapping buckets J e ∪ {(0.03, 0.07]}
(right). 12000
400
300
8000
200
Sn
Sn
4000
100
0
n n
3000
2000
Sn
1000
0
58
where J n is defined as:
2.10 Application
59
Ruxton and Neuhäuser (2013) employ SIMCTEST to conduct a hypothesis
test to determine whether the means of the penguin counts on Stewart Island
are equal to the ones of the cat-free islands. They apply Welch’s t-test (Welch,
1947) to assesses whether two population groups have equal means. The test
statistic of Welch’s t-test is given as follows:
µ̂1 − µ̂2
T =q 2 ,
s1 s22
n1
+ n2
where n1 , n2 are the sample sizes, µ̂1 , µ̂2 are the sample means and s1 , s2 are
the sample variances of the two groups.
2
1 s22
n1
+ 2
s1 n2
v= s22
.
1
n21 (n1 −1)
+ 2 2
s1 n2 (n2 −1)
Using the above data, we obtain t = −0.45 as the observed test statistic and
a p-value of 0.09.
60
Likewise, we apply CSM and SIMCTEST with the same bootstrap sam-
pling procedure. We record the average effort measured in terms of the total
number of samples generated. We set the resampling risk to = 0.001 and
use the default spending sequence n = n/(n + 1000) in SIMCTEST.
We first perform a single run of both CSM and SIMCTEST. CSM and
SIMCTEST stop after 751 and 724 steps with p-value estimates of 0.09 and
0.08, respectively. Hence, both algorithms do not reject the null hypothesis
at the 5 % level. We then conduct 10000 independent runs to stabilise the
results. Amongst those 10000 runs, CSM rejects the null hypothesis 10000
times compared with 9999 times for SIMCTEST. The average efforts of CSM
and SIMCTEST are 1440 and 1131, respectively. Therefore, in this example,
CSM gives comparable results to SIMCTEST while generating more samples
on average. We expect such behaviour due to the wider stopping boundaries
of CSM in comparison with SIMCTEST (see Figure 2.3). However, we need
to pre-compute the stopping boundaries of SIMCTEST in advance, which is
not necessary in CSM.
61
Table 2.3: Two-way contingency table.
1 2 2 1 1 0 1
2 0 0 2 3 0 0
0 1 1 1 2 7 3
1 1 2 0 0 0 1
0 1 1 1 1 0 0
Note that the overlapping area between either two p-value buckets in J w is
larger than that in J n . We would expect the lowest computational effort
when using J w followed by J n and J 0 . We also aim to explore how stable
the results are for each set of p-value buckets, i.e., whether the returned
buckets are identical over different simulations.
62
Table 2.4: Decisions returned for J 0 , J n and J w in the contingency example of Table 2.3.
Buckets J0 Jn Jw
Method CSM SIMCTEST CSM SIMCTEST CSM SIMCTEST
∼ ∼ ∼ ∼ ∼ ∼
Significance * * * * * *
% 100 NA 100 NA 99.7 0.3 99.2 0.8 14.7 85.3 19.9 80.1
Average effort 14879 11796 14843 11703 4930 3479
∼
where * refers to (0.01, 0.05] for J 0 , J n and J w , refers to (0.04, 0.06] for J n and (0.03,0.1]
for J w .
63
Table 2.5: Comparison between CSM and SIMCTEST.
The average effort measured by stopping time is higher for CSM than
for SIMCTEST when applied to the same p-value buckets. However, this
definition of the effort is not necessarily an indicator of the overall effort
if all preparatory work needs to be taken into account: Due to the larger
computational overhead to compute boundaries in SIMCTEST, the stopping
time for CSM or merely a naı̈ve approach with a constant number of samples
can be faster in practice, especially when sampling is computationally cheap.
2.11 Discussion
The first part of this chapter introduces a new method called CSM to decide
whether an unknown p-value, which can only be approximated via Monte
Carlo sampling, lies above or below a fixed threshold α while uniformly
bounding the resampling risk at a user-specified > 0. The method is
straightforward to implement and relies on the construction of a confidence
sequence (Robbins, 1970; Lai, 1976) for the unknown p-value.
64
We compare CSM to SIMCTEST (Gandy, 2009), finding that CSM is the
more conservative method: The (implied) stopping boundaries of CSM are
generally wider than the ones of SIMCTEST and in contrast to SIMCTEST,
CSM does not fully spend the allocated resampling risk .
65
aries up to τmax need to be stored, which requires O(τmax ) memory. Hence,
the total memory requirement of SIMCTEST with pre-computed bound-
aries is O(τmax ). Evaluating the stopping criterion in each step of CSM
or SIMCTEST with pre-computed boundaries requires O(1), leading to the
total computational effort of O(τ ) depicted in Table 2.5 for both cases.
Gandy (2009) reasons that the computational effort of SIMCTEST is roughly
Pτ
proportional to n=1 |Un − Ln |. Using the empirical result |Un − Ln | ∼
√ √
O( n log n), we obtain a bound of O(τ τ log τ ) for the computational effort
of SIMCTEST.
66
make it a very appealing competitor for practical applications.
67
3
Tree-based Particle
Smoothing Algorithms in a
Hidden Markov Model
3.1 Introduction
68
Y0 Y1 Y2 Y3
X0 X1 X2 X3 ...
We make the following assumptions in the entire chapter unless the model
is otherwise described: The densities of the initial state X0 , the transition
density Xt+1 given Xt = xt and the emission density Yt given Xt = xt taken
with respect to some dominating measure exist, and are defined as follows:
X0 ∼ p0 ( · ),
Two common inference problems of the hidden states in the HMM are
filtering and smoothing. The filtering distributions refer to
69
or the joint smoothing distribution
where x0:T and y0:T are abbreviations of (x0 , . . . , xT ) and (y0 , . . . , yT ), re-
spectively. Exact solution is available for a linear Gaussian HMM using
Rauch–Tung–Striebel smoother (Rauch et al., 1965) and in an HMM with a
finite-space Markov process (Baum and Petrie, 1966). In most other cases,
the smoothing distribution is not analytically tractable.
70
filter smoother (TFS) which employs a standard forward particle filter and a
backward information filter to sample from the marginal smoothing distribu-
tions. Godsill et al. (2004) propose the forward filtering backward simulation
algorithm (FFBSi) which targets the joint smoothing distribution. Typically,
these algorithms have quadratic complexities in N for generating N samples.
Fearnhead et al. (2010) and Klaas et al. (2006) propose two smoothing al-
gorithms with lower computational complexity, but their methods do not
provide asymptotically unbiased estimates.
Using the idea of D&C SMC, we focus on the HMMs rather than a general
PGM. We similarly construct an auxiliary tree T to split an HMM into sub-
models, and aim to estimate the joint smoothing distribution p(x0:T |y0:T ). We
thus call the algorithm tree-based particle smoothing algorithm (TPS). The
key differences between TPS and other smoothing algorithms lie in its non-
sequential sampling procedure and a more adaptive merging step of samples.
71
Our main contribution in this chapter is the investigation of four classes
of intermediate target distributions in a HMM, which is key for a good over-
all performance of TPS. Lindsten et al. (2017)’s strategy of building these
distributions is applicable to a general PGM rather than to an HMM only.
Moreover, the empirical performance of their method could be unstable which
will be explored in Section 3.10.
The second class uses an estimate of the filtering distribution p(xj |y0:j )
at Tj and an estimate of the joint filtering distribution p(xj:l |y0:l ) at Tj:l .
Working with this estimate involves tuning a preliminary particle filter.
72
sense. Furthermore, under this construction, we approximately retain the
marginal distribution of each random variable Xj invariant as the marginal
smoothing distribution p(xj |y0:T ) at every level of the tree. The price of im-
plementing TPS using these intermediate target distributions relies on both
the estimates of the filtering and the (marginal) smoothing distributions,
but not necessarily the joint smoothing distribution. We then propose some
parametric and non-parametric approaches to construct these intermediate
distributions based on the pre-generated Monte Carlo samples.
The fourth class inherits from the exact (joint) filtering distribution
p(xj |y0:j ) at Tj and p(xj:l |y0:l ) at Tj:l . TPS using this class of intermediate
target distributions employs the samples directly from a filtering algorithm at
the leaf nodes. It is straightforward to implement with no tuning procedures.
73
random variables given {Xt }t∈N : The conditional distribution of Yt only de-
pends on Xt (Cappé et al., 2006). We assume the underlying Markov pro-
cess {Xt }t∈N is not observable and call each Xt a hidden state of the HMM.
We only have access to the stochastic process {Yt }t∈N linked to the process
{Xt }t∈N , and call each Yt an observation of the HMM. The inference of an
HMM is hence conducted with the information of the observations only.
74
many fields, and are defined as
1. The stochastic processes {qt−1 }t∈N and {rt }t∈N are independent Gaus-
sian noises with qt−1 ∼ N (0, Qt−1 ) and rt ∼ N (0, Rt );
3. The transition matrices {At−1 }t∈N , {Ht }t∈N are known with proper di-
mensions.
Xt = ft (Xt−1 , qt−1 ),
(3.5)
Yt = ht (Xt , rt ),
where {ft }t∈N and {gt }t∈N are non-linear functions of appropriate dimensions,
and {qt−1 }t∈N and {rt }t∈N are independent Gaussian noises.
75
Markov process. A factorial HMM (Ghahramani and Jordan, 1996) allows
each observation Yt to propagate through multiple hidden states from parallel
Markov processes. A Markov-switching model provides a more general form
of the HMMs in the sense that the conditional distribution of Yt given all
past variables now depends on Xt , Yt−1 and possibly even earlier observations
(Cappé et al., 2006).
HMMs and their extensions are widely exploited in the areas such as
speech recognition (Rabiner, 1989; Rabiner and Juang, 1986; Huang et al.,
1990; Bahl et al., 1986), computer vision (Yamato et al., 1992; Brand et al.,
1997), econometrics (Hamilton, 1989), biology (Ball and Rice, 1992; Sonnham-
mer et al., 1998; Krogh et al., 2001; Petersen et al., 2011) and medical imaging
(Zhang et al., 2001).
for t = 1, . . . , T .
76
4. The dependence structure between each observation Yt and its corre-
sponding hidden state Xt is defined by the emission probability density
Yt |{Xt = xt } ∼ p( · |xt )
for t = 0, . . . , T .
Filtering and smoothing are two common inference problems which attempt
to estimate the hidden states given the observations in an HMM. Filtering
estimates the current state Xt given the observations up to the same time
step t whereas smoothing conditions on all observations till the final time
step T (Särkkä, 2013).
77
X0:T given the observations y0:T :
Kalman et al. (1960) present the exact solution to the filtering problem (3.6)
in a linear Gaussian HMM, which is called the Kalman filter (KF).
78
probabilistic terms:
p(x0 ) ∼ N ( · |m0 , P0 )
Kalman et al. (1960) prove that the prediction distribution of the hidden state
p(xt |y0:t−1 ), the filtering distribution p(xt |y0:t ) and the prediction distribution
of the observation p(yt |y0:t−1 ) are all normally distributed with
cursively from t = 1 with the following prediction and update steps. The
prediction step follows
m−
t = At−1 mt−1 ,
79
and the update step follows
vt = yt − Ht m−
t ,
St = Ht Pt− HtT + Rt ,
mt = m−
t + Kt vt ,
Pt = Pt− − Kt St KtT .
80
3.3.2 Rauch–Tung–Striebel Smoother
−1
The parameters {mst , Pts }Tt=0 are computed from backward recursions:
m−
t+1 = At mt ,
−
Pt+1 = At Pt ATt + Qt ,
− −1
Gt = Pt ATt [Pt+1 ] ,
mst = mt + Gt (mst+1 − m−
t+1 ),
−
Pts = Pt + Gt (Pt+1
s
− Pt+1 )GTt ,
T −1
where {mt , Pt }t=0 are the means and covariances computed from the KF.
We have msT = mT and PTs = PT , and the backward recursions start from
t = T − 1.
81
3.4 Particle Methods
Particle methods are an alternative way of estimating the filtering and smooth-
ing distributions in HMMs, which do not rely on the Gaussian approximation
techniques described in Section 3.3.1. Particle methods employ Monte Carlo
simulation to produce samples, which are also called particles, from the pos-
terior distribution such as filtering or smoothing. The methods fall into a
sub-class of the sequential Monte Carlo (SMC) procedures, which we will
discuss in Section 3.4.2. Before this, we first demonstrate a fundamental
technique used in SMC called importance sampling.
Z
I φ(X) = φ(x)π(x)dx,
D
82
to rewrite I(φ(X)) as following:
Z
π(x) π(Y )
I φ(X) = φ(x) q(x)dx = E φ(Y ) ,
D q(x) q(Y )
ˆ
where Y ∼ q. A straightforward Monte Carlo estimate I(φ(X)) of I(φ(X))
is given by
N
ˆ 1 X π(x(i) )
I(φ(X)) = φ(x(i) ) ,
N i=1 q(x(i) )
N
where x(i) i=1
∼ q. The density q is also known as a proposal or an im-
(i) π(x(i) )
portance density. We reweight each particle x by a factor . We call
q(x(i) )
π(x(i) )
this factor w(i) = the unnormalised importance weight of x(i) , and
q(x(i) )
π(x)
w(x) = is the unnormalised importance weight function.
q(x)
Similar technique can be applied to the target distribution πt (x0:t ) defined
on the product space Dt+1 . We can rewrite the target distribution πt (x0:t ) as
γt (x0:t )
πt (x0:t ) = ,
Zt
Z
Zt = γt (x0:t )dx0:t .
Dt+1
83
function:
1
πt (x0:t ) = wt (x0:t )qt (x0:t ),
Zt
Z (3.9)
Zt = wt (x0:t )qt (x0:t )dx0:t ,
Dt+1
where
γt (x0:t )
wt (x0:t ) = (3.10)
qt (x0:t )
N
X
π̂t (x0:t ) = Wt (x0:t )δx(i) (x0:t ),
0:t
i=1
N
1 X (i)
Ẑt = wt (x0:t ),
N i=1
where
wt (x0:t )
Wt (x0:t ) = PN
i=1 wt (x0:t )
is the normalised weight function and δx0 (x) is the Dirac measure with mass
(i) (i)
located at x0 . We use the notation {xt , Wt }N
i=1 ∼ πt to indicate that the
(i) (i)
weighted particles {xt , Wt }N
i=1 provide a particle approximation to πt .
πt2 (x0:t )
Z
Var(Ẑt ) 1
2
= dx0:t − 1 . (3.11)
Zt N Dt+1 qt (x0:t )
84
The expectation of a test function φt : Dt+1 → R defined as
Z
It φt (X0:t ) = φt (x0:t )πt (x0:t )dx0:t
Dt+1
can be approximated by
Z N
X (i) (i)
Iˆt φt (X0:t ) = φt (x0:t )π̂t (x0:t )dx0:t = Wt (x0:t )φt x0:t ,
Dt+1 i=1
which is biased for finite N . The asymptotic bias and variance are both
O(1/N ) (Doucet and Johansen, 2009).
85
ing dimension by importance sampling (Doucet and Johansen, 2009). We
aim to approximate {πt }Tt=0 and {Zt }Tt=0 in (3.9) sequentially using SIS. The
proposal qt in SIS is required to have the following form for t > 0:
γt (x0:t ) 1 γt (x0:t )
πt (x0:t ) = = qt (x0:t )
Zt Zt qt (x0:t )
γt (x0:t )
αt (x0:t ) =
γt−1 (x0:t−1 )qt (xt |x0:t−1 )
1
πt (x0:t ) = qt (x0:t )wt−1 (x0:t−1 )αt (x0:t ). (3.13)
Zt
(i) (i)
We generate the unnormalised weighted particles {x0:t , w0:t }N
i=1 ∼ πt
(i) (i)
given {x0:t−1 , w0:t−1 }N
i=1 ∼ πt−1 using (3.13). We simulate a new sample
(i) (i) (i) (i)
xt from the proposal qt (·|x0:t−1 ), and append xt to the history x0:t−1 for i =
(i) (i) (i)
1, . . . , N . By reweighting each particle x0:t with the factor wt−1 (x0:t−1 )αt (x0:t ),
(i) (i)
we obtain {x0:t , w0:t }N
i=1 ∼ πt . Starting from t = 0 where an importance sam-
86
Algorithm 1: Sequential importance sampling (SIS)
1 for t = 0 do
2 for i = 1 to N do
(i)
3 Sample x0 ∼ q0 ( · ) ;
(i)
(i) π0 (x0 )
4 Compute the unnormalised importance weight: w0 = (i)
;
q0 (x0 )
5 end
6 end
7 for t = 1 to T do
8 for i = 1 to N do
(i) (i) (i) (i) (i)
9 Sample xt ∼ qt ( · |x0:t−1 , y0:t ) and denote x0:t = (x0:t−1 , xt );
(i) (i) (i)
10 Compute the unnormalised importance weight: wt = wt−1 αt (x0:t );
11 end
12 end
In SIS, we sequentially update the weights at each step and can possibly
result in a large variance: Most particles have negligible weights and very
few occupy massive weights which will dominate the final estimate of the
distribution (Arulampalam et al., 2002). We call such phenomenon weight
degeneracy. Choosing a good proposal becomes crucial. Doucet et al. (2001)
prove the optimal proposal qtopt has the following form which minimises the
variance amongst all types of proposals conditional on the x0:t−1 and y0:t :
87
to extended Kalman filter is applied (Doucet et al., 2000). A good approx-
imation should cover the support of the target, capture both tail property
and mode(s) of the optimal proposal.
88
Algorithm 2: Multinomial resampling
1 for i = 1 to N do
2 Sample index j(i) from a multinomial distribution with the probability vector
(W̃ (1) , . . . , W̃ (N ) );
1
3 Let x(i) = x̃j(i) and W (i) = ;
N
4 end
89
Implementing an additional resampling procedure in SIS leads to the se-
quential importance resampling (SIR) algorithm. However, resampling may
not be necessary at every time step. Adaptive resampling performs a resam-
pling step once the diversity of the particles drops below a threshold Nthres .
One measure for this diversity is called effective sample size (ESS), which
assesses the variability of the weights in importance sampling. The formula
(Kong et al., 1994; Liu and Chen, 1995) of ESS is given by
N
Neff = , (3.14)
1 + Var(wt )
PN (i) 2
i=1 wt
N̂eff = PN (i) 2
. (3.15)
i=1 w t
The SIR procedure which adaptively performs the resampling steps in SIS
is shown in Algorithm 3. As most resampling procedures return normalised
weights, we additionally perform normalisation to the particles if a resam-
pling step is not necessary.
90
3.4.3 Particle Filtering and Smoothing
We apply SIR to simulate samples from the filtering and smoothing distri-
butions. In the context of filtering and smoothing in the HMM, we have
The algorithm employing SIR to address the filtering problem is also called
a particle filter (PF). At each time step t, it does not need to retain the
particles of any previous hidden state Xt0 where t0 < t, since its filtering
distribution p(xt0 |y0:t0 ) only conditions on the observations up to t0 and does
not require update when new observations come in. See Algorithm 4 for the
implementation of the PF, which generates the normalised weighted particles
(i) (i)
{xt , Wt }N
i=1 ∼ p( · |y0:t ) sequentially from t = 0 to t = T . We can therefore
N
X (i)
p̂(xt |y0:t ) = Wt δx(i) (xt ),
t
i=1
91
Algorithm 4: Particle filter (PF)
1 for t = 0 do
2 for i = 1 to N do
(i)
3 Sample x̃0 ∼ q0 ( · ) ;
(i) (i)
(i) p(y0 |x̃0 )p0 (x̃0 )
4 Compute the unnormalised importance weight: w̃0 = (i)
;
q0 (x̃0 )
5 end
6 if N̂eff < Nthres then
7 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {x0 , W0 }N
i=1 ;
8 else
(i)
9 Normalise the weights and denote by {W0 }Ni=1 ;
(i) (i) (i)
10 Denote the weighted particles by {x0 = x̃0 , W0 }N
i=1 ;
11 end
12 end
13 for t = 1 to T do
14 for i = 1 to N do
(i) (i)
15 Sample x̃t ∼ qt ( · |x0:t−1 , y0:t );
16 Compute the unnormalised importance weight:
(i) (i) (i)
(i) (i) p(yt |x̃t )p(x̃t |xt−1 )
w̃t = Wt−1 (i) (i)
;
qt (x̃t |x0:t−1 , y0:t )
17 end
18 if N̂eff < Nthres then
19 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {xt , Wt }N
i=1 ;
20 else
(i)
21 Normalise the weights and denote by {Wt }Ni=1 ;
(i) (i) (i)
22 Denote the weighted particles by {xt = x̃t , Wt }N
i=1 ;
23 end
24 end
92
Algorithm 5: Bootstrap particle filter (BPF)
1 for t = 0 do
2 for i = 1 to N do
(i)
3 Sample x̃0 ∼ p0 ( · ) ;
(i) (i)
4 Compute the unnormalised importance weight: w̃0 = p(y0 |x̃0 );
5 end
6 if N̂eff < Nthres then
7 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {x0 , W0 }N
i=1 ;
8 else
(i)
9 Normalise the weights and denote by {Wt }Ni=1 ;
(i) (i) (i)
10 Denote the weighted particles by {x0 = x̃0 , Wt }N
i=1 ;
11 end
12 end
13 for t = 1 to T do
14 for i = 1 to N do
(i) (i)
15 Sample x̃t ∼ p( · |xt−1 );
(i) (i) (i)
16 Compute the unnormalised importance weight: w̃t = Wt−1 p(yt |x̃t );
17 end
18 if N̂eff < Nthres then
19 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {xt , Wt }N
i=1 ;
20 else
(i)
21 Normalise the weights and denote by {Wt }Ni=1 ;
(i) (i) (i)
22 Denote the weighted particles by {xt = x̃t , Wt }N
i=1 ;
23 end
24 end
93
PS tracks the full history of particles with necessary resampling updates.
(i) (i)
See Algorithm 6 which simulates the weighted particles {x0:T , W0:T }N
i=1 ∼
94
Algorithm 6: Particle smoother (PS)
1 for t = 0 do
2 for i = 1 to N do
(i)
3 Sample x̃0 ∼ q0 ( · ) ;
(i) (i)
(i) p(y0 |x̃0 )p0 (x̃0 )
4 Compute the unnormalised importance weight: w̃0 = (i)
;
q0 (x̃0 )
5 end
6 if N̂eff < Nthres then
7 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {x0 , W0 }N
i=1 ;
8 else
(i)
9 Normalise the weights and denote by {W0 }N
i=1 ;
(i) (i) (i)
10 Denote the weighted particles by {x0 = x̃0 , W0 }N
i=1 ;
11 end
12 end
13 for t = 1 to T do
14 for i = 1 to N do
(i) (i) (i) (i) (i)
15 Sample x̃t ∼ qt ( · |x0:t−1 , y0:t ) and let x̃0:t = (x0:t−1 , x̃t );
16 Compute the unnormalised importance weight:
(i) (i) (i)
(i) (i) p(yt |x̃t )p(x̃t |xt−1 )
w̃t = Wt−1 (i) (i)
;
qt (x̃t |x0:t−1 , y0:t )
17 end
18 if N̂eff < Nthres then
19 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {x0:t , Wt }N
i=1 ;
20 else
(i)
21 Normalise the weights and denote by {Wt }N
i=1 . Denote the weighted particles
(i) (i) (i)
by {x0:t = x̃0:t , Wt }N
i=1 ;
22 end
23 end
95
FFBSm is based upon the decomposition of the marginal smoothing dis-
tribution proposed by Kitagawa (1987):
N (i)
Z
p(xt+1 |y0:T )p(xt+1 |xt ) X (i)
p(xt+1|T |xt )
dxt+1 ≈ Wt+1|T (i)
. (3.17)
p(xt+1 |y0:t ) i=1 p(x |y0:t ) t+1|T
(i)
We further approximate p(xt+1|T |y0:t ) appearing in the denominator of (3.17)
by
Z
(i) (i)
p(xt+1|T |y0:t ) = p(xt+1|T |xt )p(xt |y0:t )dxt
N
X (j) (i) (j)
≈ Wt|t p(xt+1|T |xt|t ). (3.18)
j=1
96
Algorithm 7: Forward filtering backward smoothing (FFBSm)
(i) (i)
1 Implement the PF which generates {xt|t , Wt|t }N
i=1 ∼ p( · |y0:t ) for t = 0, . . . , T ;
2 for t = T − 1 to 0 do
3 for j = 1 to N do
PN (l) (i) (l)
4 Compute Vj = l=1 Wt|t p(xt+1|T |xt|t );
5 end
6 for i = 1 to N do
(i) (i)
7 Let xt|T = xt|t ;
N (i) (j) (i)
(i)
X (j)
Wt|t p(xt+1|T |xt|T )
8 Compute the normalised weight: Wt|T = Wt+1|T ;
j=1
Vj
9 end
10 if N̂eff < Nthres then
(i) (i)
11 Implement the resampling step and override the notation of the {xt|T , Wt|T }N
i=1
after resampling;
12 end
An estimate p̂(xt |y0:T ) of p(xt |y0:T ) using (3.17) and (3.18) is hence given by
N N (j)
X (i)
X (j)
p(xt+1|T |xt )
p̂(xt |y0:T ) = Wt|t δx(i) (xt ) Wt+1|T PN (l) (j) (l)
i=1
t
j=1 l=1 Wt|t p(xt+1|T |xt|t )
N N (i) (j)
X X (j)
Wt|t p(xt+1|T |xt )
= Wt+1|T PN (l) (j) (l)
δx(i) (xt ).
l=1 Wt|t p(xt+1|T |xt|t )
t
i=1 j=1
(i)
The normalised weight of the particle xt is
N (i) (j)
(i)
X (j)
Wt|t p(xt+1|T |xt )
Wt|T = Wt+1|T PN (l) (j) (l)
. (3.19)
j=1 l=1 Wt|t p(xt+1|T |xt|t )
97
3.5.2 Forward Filtering Backward Sampling (FFBSi)
T
Y −1
p(x0:T |y0:T ) = p(xT |y0:T ) p(xt |xt+1:T , y0:T )
t=0
T
Y −1
= p(xT |y0:T ) p(xt |xt+1 , y0:t ).
t=0
The Monte Carlo estimate p̂(xt |xt+1:T , y0:T ) of p(xt |xt+1:T , y0:T ) using (3.20)
yields the following equation
N
X (i)
p̂(xt |xt+1:T , y0:T ) = Wt|t+1 δx(i) (xt )
t
i=1
(i)
with the normalised weight Wt|t+1 defined as
(i) (i)
(i)
Wt|t p(xt+1 |xt|t )
Wt|t+1 = PN (j) (j)
. (3.21)
j=1 Wt|t p(xt+1 |xt|t )
98
Algorithm 8: Forward filtering backward simulation (FFBSi)
(j) (j)
1 Implement the PF which generates {xt|t , Wt|t }N
j=0 ∼ p( · |y0:t ) for t = 0, . . . , T ;
2 for i = 1 to N do
3 for t = T do
(i) (j) (j) N
4 Sample xt|T from {xt|t }N
j=1 according to the weights {Wt|t }j=1 ;
5 end
6 for t = T − 1 to 0 do
7 for j = 1 to N do
(j) (i) (j)
(j)
Wt|t p(xt+1|T |xt|t )
8 Compute the normalised weight: Wt|t+1 = PN (l) (i) (l)
;
l=1 Wt|t p(xt+1|T |xt|t )
9 end
(i) (j) (j)
10 Sample xt|T from {xt|t }N N
j=1 according to the weights {Wt|t+1 }j=1 ;
(i) (i) (i)
11 Append the new particle to the selected path xt:T |T = (xt|T , xt+1:T |T );
12 end
13 end
The computational complexity for each path is O(T N ), and thus the total
complexity of FFBSi is O(T N 2 ).
99
(TPS) to address the smoothing problem in an HMM using D&C SMC. We
demonstrate a unique construction of the auxiliary tree bearing intermediate
target distributions specified at non-root nodes. The root of the tree exactly
has p(x0:T |y0:T ) as the target distribution. We then illustrate the sampling
procedure in TPS, which produces independent particles between leaf nodes,
and recursively merge them via importance sampling towards the root.
TPS splits the HMM into sub-models based upon a binary tree decom-
position. It first divides the target variable of all hidden states X0:T =
(X0 , . . . , XT ) into two disjoint subsets, and recursively applies binary splits
to the resulting two subsets, until the resulting subset consists of a univariate
random variable of single hidden state. Each generated subset corresponds
to a tree node and is assigned an intermediate target distribution. The root
characterises the complete model with the target distribution p(x0:T |y0:T ). All
other intermediate target distributions at non-root nodes will be discussed
in Section 3.7.
k = j + 2p , (3.22)
100
where p = dlog2 (l − j + 1)e − 1.
Given the auxiliary tree using the above binary split, we mark the level
L of each node. Let the leaf nodes be at the bottom level with L = 0. Then,
the root is at the top level with L = dlog2 (T + 1)e which will be illustrated
from an example in the next paragraph. Hence, we build the auxiliary tree
from top to bottom.
This type of auxiliary tree has several advantages: The random variable
within each node has consecutive time indices. The left sub-tree is also a
complete binary tree of 2blog2 (T +1)c leaf nodes. This would be useful in an
on-line setting: When new observations become available, the samples from
the complete sub-tree are retained if their intermediate target distributions
remain unchanged.
Moreover, the tree has a depth of dlog2 (T +1)e+1 levels, which implies
a maximum number of dlog2 (T + 1)e updates of samples corresponding to
a hidden state when moving from the bottom of the tree to the top. In
Figure 3.2, the samples of X0 , . . . , X3 need to be updated three times from
the leaf nodes, and those of X4 , X5 need to be updated twice. When running
the particle smoother (PS) to solve the smoothing problem, the samples at
time step t = 0 need to updated for every time step. Hence, the maximum
number of updates of a hidden state in the PS is (T + 1), which is larger than
101
X0:5 level 3
X0 X1 X2 X3 X4 X5 level 0
dlog2 (T + 1)e for T > 1. Usually, more updates indicate more resampling
steps. Therefore, TPS with the divide-and-conquer approach can potentially
mitigate path degeneracy for early time steps – Section 3.9.2 will show this
empirically.
The sampling process of TPS proceeds as follows: Initial samples are gen-
erated at the bottom of the tree, independent between leaf nodes. These
samples are recursively merged via importance sampling, which aim for the
intermediate targets, until the root of the tree is reached.
102
a leaf node Tj , we sample from fj directly. At a non-leaf node Tj:l attaching
two children Tj:k−1 and Tk:l , we may proliferate the particles, then merge,
reweight and resample them to aim for an approximation to the intermediate
target distribution. To illustrate this, we first adopt the pre-stored particles
(i) (i) N
S1 = x̃j:k−1 , W̃j:k−1 i=1
∼ fj:k−1
from Tk:l . A particle approximation of hj:l can be obtained using the product
measure given by two empirical measures formed by S1 and S2 with com-
plexity O(N 2 ). Nevertheless, we choose another routine which potentially
achieves a lower cost: We first optionally proliferate the particles (see Sec-
tion 3.6.3) in S1 and S2 respectively. Then we combine those particles with
the same indices from the two sets, which we refer to as a merging step
(Lindsten et al., 2017, Section 3.2).
(a ) (a ) 0
S10 = {(x̃j:k−1
i i
, Ŵj:k−1 )}N
i=1 ∼ fj:k−1
and
(b ) (b ) 0
S20 = {(x̃k:li , Ŵk:li )}N
i=1 ∼ fk:l ,
where the user-specified sample size N 0 is usually larger than the required
0 0
sample size N , {ai }N N
i=1 and {bi }i=1 are the indices returned from the prolif-
103
(a ) 0 (b ) 0
i
eration step, and {Ŵj:k−1 }N i N
i=1 and {Ŵk:l )}i=1 are the associated normalised
(i) (a ) (b ) (i) (a ) (b ) 0
S 0 = {x̃j:l = (x̃j:k−1
i
, x̃k:li ), w̃j:l = Ŵj:k−1
i
Ŵk:li }N
i=1 ∼ hj:l . (3.23)
TPS applies Algorithm 9 recursively from the leaf nodes to the root. The
computational flow is shown in Figure 3.3 when T = 5. In particular, when
the sampling process advances from level 0 to 1, we just preserve the particles
of X4 and X5 , and merge them from level 1 to 2. When the whole sampling
process is finished, each node contains the normalised weighted samples from
the corresponding (intermediate) target distribution. To save memory space,
for L = 0, . . . , dlog2 (T + 1)e, we may discard the samples from level L in the
tree once the sampling process at level (L + 1) is complete.
104
(i) (i) N
x0:5 , W0:5 i=1
∼ p(x0:5 |y0:5 ) = TPS gen(0, 5)
level 3
105
(i) (i) N (i) (i) N (i) (i) N (i) (i) N
x0:1 , W0:1 i=1
= TPS gen(0, 1) x2:3 , W2:3 i=1
= TPS gen(2, 3) x4 , W 4 i=1
x5 , W 5 i=1
level 1
(i) (i) N (i) (i) N (i) (i) N (i) (i) N (i) (i) N (i) (i) N
x0 , W0 i=1 x1 , W1 i=1 x2 , W2 i=1 x3 , W3 i=1 x4 , W4 i=1 x5 , W5 i=1
=TPS gen(0, 0) =TPS gen(1, 1) =TPS gen(2, 2) =TPS gen(3, 3) =TPS gen(4, 4) =TPS gen(5, 5) level 0
6 else
7 Let p = d log(l−j+1)
log 2 e − 1 and k = j + 2p ;
(i) (i)
8 Adopt S1 = {x̃j:k−1 , W̃j:k−1 }N
i=1 ← TPS gen(j, k − 1) from Tj:k−1 and
(i) (i)
S2 = {x̃k:l , W̃k:l }N
i=1 ← TPS gen(k, l) from Tk:l ;
(i) (i) 0
9 Proliferate S1 and S2 if necessary, and denote the merged particles by {x̃j:l , w̃j:l }N
i=1
(i)
(i) (i) fj:l (x̃j:l )
ŵj:l = w̃j:l (i) (i)
(3.25)
fj:k−1 (x̃j:k−1 )fk:l (x̃k:l )
12 end
0
(i) (i) N (i) (i) N
13 Resample x̃j:l , ŵj:l i=1
to obtain the normalised weighted particles xj:l , Wj:l i=1
;
14 end
3.6.3 Proliferation
The main reason of applying the proliferation step lies in more merged
particles with a potentially increased diversity considered in importance sam-
pling. When the output sample size N is fixed, proliferation may mitigate
weight degeneracy and boosts the sampling quality. Nevertheless, empirical
106
evidence suggests if we have sufficient memory space and computational bud-
get, a larger output sample size N = N1 with no proliferation is always pre-
ferred rather than setting a smaller sample size N = N2 < N1 with N 0 = N1
in the proliferation step. We demonstrate three proliferation methods.
Mixture Sampling
(a ) (b ) (i) (a ) (b )
S 0 = {(x̃j:k−1
i i
, x̃j:k−1 i
), w̃j:l = Ŵj:k−1 Ŵk:li }(ai ,bi )∈{1,...,N }×{1,...,N } ,
i i (a )i i (a ) (a ) (b )
where we have the normalised weights Ŵj:k−1 = W̃j:k−1 /N, Ŵj:k−1 = W̃j:k−1 /N
for every pair (ai , bi ) ∈ {1, . . . , N } × {1, . . . , N }.
107
(a ) 0
i
and {Ŵj:k−1 }N
i=1 are the updated normalised weights. Likewise, we denote
Permutation
Given the auxiliary tree T constructed in the way described in Section 3.6.1,
we define the intermediate target distributions of sub-models. We apply
Lindsten et al. (2017)’s method to build one class of intermediate target dis-
tributions and develop three new classes based upon the filtering or smooth-
ing distributions.
108
3.7.1 Distribution Suggested by Lindsten et al. (2017) (TPS-L)
l−1
Y
fj:l (xj:l ) ∝ p(yj |xj ) p(xi+1 |xi )p(yi+1 |xi+1 ) .
i=j
Assume Tj:l connects two children Tj:k−1 and Tk:l carrying the pre-generated
(i) (i) (i) (i)
particles: {x̃j:k−1 , W̃j:k−1 }N N
i=1 ∼ fj:k−1 at Tj:k−1 ∈ T and {x̃k:l , W̃k:l }i=1 ∼ fk:l
(i)
at Tk:l ∈ T . The unnormalised importance weight ŵj:l of the merged particles
109
(i) (i) (i)
x̃j:l = (x̃j:k−1 , x̃k:l ) in (3.25) becomes
T
Y −1
f0:T (x0:T ) = p(x0:T |y0:T ) = p0 (x0 )p(y0 |x0 ) p(xi+1 |xi )p(yi+1 |xi+1 ) .
i=0
110
intermediate target distribution:
l−1
Y
fj:l (xj:l ) ∝ p̂(xj |y0:j ) p(xi+1 |xi )p(yi+1 |xi+1 ) ≈ p(xj:l |y0:l ).
i=j
111
Filtering Filtering
Smoothing Smoothing
0
0
−20 −15 −10 −5 0 5 10 −10 −5 0 5 10 15
xt xt
Figure 3.4: Estimated (unnormalised) filtering and smoothing densities at t = 390 (left)
and t = 391 (right) in the non-linear HMM. Linear scale on the y-axis.
Given the proposal density hj:l = fj:k−1 fk:l at Tj:l being the product
of the densities from two independent random variables, we claim that the
minimum KL divergence is met when the two densities are the marginal
target densities with respect to the corresponding random variables.
112
Theorem 4. Let f be the probability density function of (X1 , X2 ) defined
on Dn1 +n2 , let h1 and h2 be the probability density functions of two indepen-
dent random variables X1 and X2 defined on Dn1 and Dn2 , respectively. If
h1 (x1 )h2 (x2 ) > 0 whenever f (x1 , x2 ) > 0, then
Z Z
f (x1 , x2 )
f (x1 , x2 ) log dx1 dx2 ≥
D n2 D n1 h1 (x1 )h2 (x2 )
Z Z
f (x1 , x2 )
f (x1 , x2 ) log dx1 dx2 ,
D n2 D n1 f1 (x1 )f2 (x2 )
R R
where f1 (x1 ) = Dn2
f (x1 , x2 )dx2 , f2 (x2 ) = D n1
f (x1 , x2 )dx1 are the marginal
densities of f (x1 , x2 ).
Z Z
f1 (x1 ) log f1 (x1 ) dx1 − f1 (x1 ) log h1 (x1 ) dx1
D n1 Dn1
Z
f1 (x1 )
= f1 (x1 ) log dx1
D n1 h1 (x1 )
f1 (X1 ) h1 (X1 )
= E log = E − log
h1 (X1 ) f1 (X1 )
h1 (X1 )
≥ − log E = 0.
f1 (X1 )
113
Using this and the definition of marginal density,
Z Z
f (x1 , x2 ) log f1 (x1 ) dx1 dx2
D n2 D n1
Z
= f1 (x1 ) log f1 (x1 ) dx1
Dn1
Z (3.28)
≥ f1 (x1 ) log h1 (x1 ) dx1
D n1
Z Z
= f (x1 , x2 ) log h1 (x1 ) dx1 dx2 .
Dn2 Dn1
Similarly,
Z Z
f (x1 , x2 ) log f2 (x2 ) dx1 dx2
Dn2 Dn1
Z Z
≥ f (x1 , x2 ) log h2 (x2 ) dx1 dx2 . (3.29)
D n2 D n1
Z Z
1
f (x1 , x2 ) log dx1 dx2
D n2 D n1 f1 (x1 )f2 (x2 )
Z Z
1
≤ f (x1 , x2 ) log dx1 dx2 .
D n2 D n1 h1 (x1 )h2 (x2 )
Z Z
Adding f (x1 , x2 ) log f (x1 , x2 ) dx1 dx2 to both sides yields the re-
D n2 D n1
sult.
114
3.7.4 Estimates of the Smoothing Distributions (TPS-ES)
l−1
p̂(xl |y0:T ) Y
fj:l (xj:l ) ∝ p̂(xj |y0:j ) p(xi+1 |xi )p(yi+1 |xi+1 )
p̂(xl |y0:l ) i=j
l−1
p(xl |y0:T ) Y
≈ p(xj |y0:j ) p(xi+1 |xi )p(yi+1 |xi+1 )
p(xl |y0:l ) i=j
= p(xj:l |y0:T ),
where p̂(xj |y0:j ) approximates the filtering distribution at time step j. The
115
preliminary constructions of {p̂(xj |y0:j )}j=0,...,T and {p̂(xj |y0:T )}j=0,...,T will
be proposed in Section 3.7.5.
(i)
(i) (i) p̂(x̃k−1 |y0:k−1 ) (i) (i) (i)
ŵj:l = w̃j:l (i) (i)
p(x̃k |x̃k−1 )p(yk |x̃k ). (3.30)
p̂(x̃k−1 |y0:T )p̂(x̃k |y0:k )
116
3.7.5 Intermediate Target Distributions at Leaf Nodes
We first consider parametric approaches. We can fit the data with some
common probability distributions such as a normal distribution. We can also
accommodate a mixture model if the target distribution seems multi-modal.
The parameters can be estimated in various ways including moment match-
ing, maximum likelihood method and expectation–maximisation algorithm.
The parametric approaches are reasonably quick and simple. For in-
stance, assuming a normal distribution requires the evaluation of the mean
and variance, which can be easily obtained from the samples using moment
matching. Nevertheless, the target distribution may not be well approxi-
mated under a parametric assumption.
117
Alternatively, we can employ non-parametric approaches for instance, a
kernel density estimator (KDE). We need to select the type of kernel and
bandwidth in advance. The complexity of generating N new samples is
O log(n)N and the evaluation of all densities is more computationally in-
tensive with complexity O(nN ).
n
✶x∈[xi −∆/2,xi +∆/2) di .
X
f (x) ∝ (3.31)
i=1
118
in (3.30). To avoid this, we consider the mixture probability distributions
which take into account the samples from both the filtering and smoothing
distributions. Assume at time step j, the first uniform grid consists of the
points xf1 < xf2 < . . . < xfnf such that xfi+1 − xfi = ∆f for i = 1, . . . , (nf − 1)
with the estimated filtering densities df1 , . . . , dfn from a KDE, and assume the
second uniform grid consists of the points xs1 < xs2 < . . . < xsns such that
xsi+1 − xsi = ∆s for i = 1, . . . , (ns − 1) with the estimated smoothing densities
ds1 , . . . , dsn from another KDE. Then the estimated filtering density p̂(x|y0:j )
is given by
nf
n s
ns
119
3.7.6 Exact Filtering Distributions (TPS-F)
(i) (i)
(i) (i) p(x̃k:l |x̃j:k−1 )
ŵj:l = w̃j:l (i)
. (3.34)
p(x̃k:l |y0:k−1 )
120
The numerator of (3.34) can be decomposed into
Z
(i) (i)
p(x̃k:l |y0:k−1 ) = p(x̃k:l |x̃j:k−1 )p(x̃j:k−1 |y0:k−1 )dx̃j:k−1
Z
(i) (i) (i) (i) (i)
= p(x̃k |x̃k−1 )p(x̃k+1 |x̃k ) . . . p(x̃l |x̃l−1 )p(x̃j:k−1 |y0:k−1 )
dx̃j:k−1
Z
(i) (i) (i) (i) (i)
= p(x̃l |x̃l−1 ) . . . p(x̃k+1 |x̃k ) p(x̃k |x̃k−1 )p(x̃j:k−1 |y0:k−1 )
dx̃j:k−1 .
(i) (i)
(i) (i) p(x̃k |x̃k−1 )
ŵj:l = w̃j:l R (i)
. (3.35)
p(x̃k |x̃k−1 )p(x̃j:k−1 |y0:k−1 )dx̃j:k−1
(i) (i)
(i) (i) p(x̃k |x̃k−1 )
ŵj:l ≈ w̃j:l PN (i) (l) (l)
.
l=1 p(x̃k |x̃k−1 )W̃j:k−1
At each non-leaf node, the effort of computing the weights is O(N 2 ) and the
total complexity of the algorithm is O(T N 2 ) if no proliferation procedure is
applied. Estimating the importance weights using Monte Carlo samples is
121
also seen in Doucet et al. (2000); Klaas et al. (2005).
TPS-F does not require any tuning process in the construction of the
intermediate target distributions, which renders the whole algorithm easily
implementable. However, the complexity is still quadratic with respect to
the sample size N which does not outperform the conventional smoothing
algorithms.
3.8 Diagnostics
We define two metrics: relative effective sample size (RESS) and marginal
relative effective sample size (MRESS) to assess the quality of the importance
sampling steps in TPS. We apply the two metrics to a toy model and to the
simulation models from Section 3.9 & 3.10.
122
apply a proliferation procedure (see Section 3.6.3), which boosts the num-
ber of samples in importance sampling for an additional computational cost.
Other techniques such as tempering (Chopin, 2002) may also be employed.
RESS and MRESS can exploit the above two factors. We first review
effective sample size (ESS) defined in Section 3.4.2, which assesses variability
of the weights in a general importance sampling procedure. The formula of
ESS (Doucet et al., 2000) is given by :
(i) 2
PN
i=1 w
N̂eff = PN . (3.36)
(w (i) )2
i=1
(i) 2
PN
1 i=1 w N̂eff
RESS = PN = ,
N (i) 2
i=1 (w )
N
which is the ratio between the effective sample size and the real sample size.
A perfect importance sampling step with equally weighted samples returns
RESS equal to 1.
123
to (3.23), we denote the merged samples after a proliferation step by
0
{(x1 (ai ) , x2 (bi ) ), w(ai ,bi ) }N
i=1 ∼ h1 h2 ,
0 0
where N 0 is the number of returned samples, {ai }N N
i=1 , {bi }i=1 are the indices
0
from the proliferation step, and {w(ai ,bi ) }N
i=1 are the associated unnormalised
importance weights. The relative effective sample size (RESS) of the samples
(a ) (b ) 0
{(x1 i , x2 i ), w(ai ,bi ) }N
i=1 is defined by
(ai ,bi ) 2
P P
1 ai bi w
RESS = 0 P P (ai ,bi ) )2
.
N ai bi (w
We then define the marginal relative effective sample size MRESS1 of the
(a ) (b ) 0
samples {(x1 i , x2 i ), w(ai ,bi ) }N
i=1 by
(ai ,bi ) 2
P P
1 ai bi w
MRESS1 = P P 2.
N w(ai ,bi )
ai bi
(ai ,bi ) 2
P P
1 ai bi w
MRESS2 = P P 2.
N w(ai ,bi )
bi ai
There exists a strict relationship between RESS and MRESS if the prolifera-
tion step proceeds with mixture sampling (see Section 3.6.3), which is stated
in Theorem 5. Recall the exact form of the merged particles after mixture
sampling is
(a ) (b )
{(x1 i , x2 i ), w(ai ,bi ) }(ai ,bi )∈{1,...,N }×{1,...,N } ,
(ai ) (b )
where the unnormalised weight w(ai ,bi ) = W1 W2 i . We simplify its nota-
124
tions and denote by
(i) (j)
{(x1 , x2 ), w(i,j) }(i,j)∈{1,...,N }×{1,...,N } .
MRESS1 ≥ RESS,
MRESS2 ≥ RESS.
N
X X
(w(k,i) − w(k,j) )2 ≥ 0.
k=1 1≤i<j≤N
Expanding the above expression, and rearranging the squared terms and the
product terms give
N X
X N N
X X
(i,j) 2
(N − 1) (w ) ≥ 2w(k,i) w(k,j) .
j=1 i=1 k=1 1≤i<j≤N
PN PN (i,j) 2
Adding j=1 i=1 (w ) to both sides and fitting the square terms on the
right hand side give
N X
N N N
2
X X X
N (w(i,j) )2 ≥ w(i,j) .
j=1 i=1 i=1 j=1
125
So
1 1 1 1
2 PN PN ≤ PN PN 2.
N j=1 i=1 (w
(i,j) )2 N w(i,j)
i=1 j=1
PN PN 2
Multiplying both sides by j=1 i=1 w(i,j) gives the result.
N N N N (i) (j)
1 X X (i,j) 1 X X f (x1 , x2 )
lim w = lim 2 (i) (j)
= 1, (3.37)
N →∞ N 2 N →∞ N
i=1 j=1 i=1 j=1 h1 (x1 )h2 (x2 )
N N
f 2 (x1 , x2 )
Z Z
1 X X (i,j) 2
lim (w ) = dx1 dx2 , (3.38)
N →∞ N 2 D n2 D n1 h1 (x1 )h2 (x2 )
i=1 j=1
N N
f12 (x1 )
Z
1 X X (i,j) 2
lim ( w ) = dx1 , (3.39)
N →∞ N 3 D n1 h1 (x1 )
i=1 j=1
N N
f22 (x2 )
Z
1 X X (i,j) 2
lim ( w ) = dx2 , (3.40)
N →∞ N 3 Dn2 h2 (x2 )
j=1 i=1
where we recall
Z Z
f1 (x1 ) = f (x1 , x2 )dx2 , f2 (x2 ) = f (x1 , x2 )dx1
D n2 D n1
126
have
f 2 (x1 , x2 )
Z Z
1
lim = dx1 dx2 ,
N →∞ RESS D n2 D n1 h1 (x1 )h2 (x2 )
f12 (x1 )
Z
1
lim = dx1 ,
N →∞ MRESS1 D n1 h1 (x1 )
f22 (x2 )
Z
1
lim = dx2 .
N →∞ MRESS2 D n2 h2 (x2 )
When N is large, RESS, MRESS1 and MRESS2 roughly quantify the relative
effective sample size with respect to the target distributions f1 , f2 and f when
using the proposals h1 , h2 and h1 h2 . Low MRESS1 (resp. MRESS2 ) implies
a poor marginal proposal h1 (resp. h2 ) for approximating the marginal distri-
bution f1 (resp. f2 ) as the target. Large MRESS1 and MRESS2 accompanied
by an extremely low RESS imply a strong correlation between X1 and X2 in
the target distribution f .
3.8.2 Examples
127
1.5
1.5
● m1 = 0, m2 = 0 ● m1 = 0, m2 = 0
m1 = 1, m2 = 0 m1 = 1, m2 = 0
m1 = 1, m2 = 1 m1 = 1, m2 = 1
RESS and MRESS
1.0
● ● ● ● ●
0.5
0.5
●
0.0
Figure 3.5: Average RESS and MRESS for three parameter settings, which are computed
over 1000 simulations, in the toy model when ρ = 0 (left) and ρ = 0.9 (right).
We set the sample size N = 1000. The average RESS, MRESS1 and
MRESS2 over 1000 simulations are plotted in Figure 3.5 when ρ = 0 (left)
and ρ = 0.9 (right). When h1 (resp. h2 ) is identical to the marginal of f ,
i.e. m1 = 0 (resp. m2 = 0), MRESS1 (resp. MRESS2 ) is (very close to) 1
regardless of ρ. RESS is affected by ρ as expected, which decreases as the
correlation increases. When both RESS and MRESS are low, no decisive
conclusion can be reached whether the poor merging step is caused only
by the ineffective marginal proposals h1 , h2 or additionally by the strong
correlation within the target variable.
128
We use a relatively small T = 31 in both models for better visualisation,
and set τ = 1, σ = 5 in the non-linear HMM. The output sample size N is
500. We implement TPS in both models and employ normal distributions as
the intermediate target distributions at the leaf nodes.
Figure 3.6 presents two trees, which indicate RESS and MRESS of all
importance sampling steps from TPS applied to the linear Gaussian HMM
(left) and to the non-linear HMM (right). Suppose we aim for RESS and
MRESS at a non-leaf node Tj:l , which is situated at level L. We search a
point (in black) with the y-coordinate value between (L − 1) and L from the
j+l
vertical line t = 2
. Then, the value of RESS at Tj:l equals the y-coordinate
of the point minus (L − 1). Note that each point connects a square (in red)
and a triangle (in blue) via solid lines. The corresponding MRESS1 (resp.
MRESS2 ) equals the y-coordinate value of the square (resp. triangle) minus
(L − 1).
In Figure 3.6, the linear Gaussian HMM and the non-linear HMM demon-
strate rather different scenarios of RESS and MRESS. In the linear Gaussian
HMM, most MRESS and RESS are greater than 0.5, which indicate relatively
effective importance sampling attempts in this model. This is because the
proposals and the (intermediate) target distributions are both normally dis-
tributed with close means and variances. However, in the non-linear HMM,
RESSs and MRESSs are much lower. Our final estimate of the joint smooth-
ing distribution may not be highly accurate. The reason of the poor im-
portance sampling step at a node can be similarly inferred from RESS and
MRESS as in the toy model in (3.41).
129
RESS RESS
MRESS1 MRESS1
MRESS2 MRESS2
4
4
3
3
2
2
1
1
0
0 5 10 15 20 25 30 0 0 5 10 15 20 25 30
Time step t Time step t
Figure 3.6: RESS and MRESS of all importance sampling steps from TPS, which are pre-
sented using a tree, applied to the linear Gaussian HMM (left) and to the non-linear HMM
(right). See Section 3.8.2 for the illustration of searching the values of RESS and MRESS
corresponding to each tree node in TPS.
We study the empirical performance of TPS with other Monte Carlo smooth-
ing algorithms in a linear Gaussian HMM. We first describe the model, and
propose two metrics which respectively measure sampling error and sample
diversity. Then we conduct the simulations of the algorithms under roughly
the same computational effort and discuss the results.
Xt = 0.8Xt−1 + Vt t = 1, . . . , T,
(3.42)
Yt = Xt + Wt t = 0, . . . , T,
130
where T = 127, X0 , V1 , . . . , VT , W0 , . . . , WT are independent with X0 ∼
N (0, 1), Vt ∼ N (0, 1), Wt ∼ N (0, 1). The smoothing solution can be ob-
tained analytically from the Rauch–Tung–Striebel smoother (RTSs) (Rauch
et al., 1965) described in Section 3.3.2.
T
1 X bm 2
MSEmm = E [Xt |y0:T ] − E[Xt |y0:T ] ,
T + 1 t=0
T
1 X dm 2
MSEvm = Var [Xt |y0:T ] − Var[Xt |y0:T ] ,
T + 1 t=0
been resampled or not. We assume the weights are all positive. We create a
new set of weighted particles Ŝ = {x̂(i) , Ŵ (i) }N
i=1 which satisfies
r
3. For ∀ x̂(i) ∈ Ŝ, there exists at least one x(j) ∈ S such that x(j) = x̂(i) .
131
We assign the normalised weight Ŵ (i) of x̂(i) ∈ Ŝ to be the total weight of
those in S that have identical values to x̂(i) . Formally, it is defined as
N
W (j) ✶x(j) =x̂(i)
X
(i)
Ŵ =
j=1
1
ESSoED = P 2.
Nr (i)
i=1 (Ŵ )
A sampling step where all particles are distinct yields an ESSoED of N , and
in the most extreme case that the empirical distribution of S is degenerate at
one point, its ESSoED is 1. More serious weight degeneracy implies a lower
ESSoED.
132
Table 3.1: Performance of the smoothing algorithms in the linear Gaussian HMM using
comparable computational effort.
from the samples of a bootstrap particle filter (BPF). The choice of a normal
distribution is motivated by the normality of the true smoothing distribution.
We extend the name of TPS-EF to TPS-EF-N.
The simulation results regarding the mean square errors are shown in
Table 3.1. TPS-EF-N and TPS-L both have the same complexity O(T N ) as
the BPS, which generate far more particles than FFBSm, FFBSi and TPS-
F with complexity O(T N 2 ). TPS-EF-N has the lowest MSEm and MSEv
followed by TPS-L, and outperforms FFBSm and FFBSi. TPS-F, though
not involving any tuning step, produces the largest MSEm and MSEv.
133
BPS
FFBSm
5000
FFBSi
TPS−EF−N
TPS−F
ESSoED
500
50
10
0 20 40 60 80 100 120
t
Figure 3.7: Effective sample size of the empirical distribution (ESSoED) averaged over 500
simulations for each time step t in the smoothing algorithms applied to the linear Gaussian
HMM. Log scale on the y-axis.
In Figure 3.7, we plot the effective sample size of the empirical distri-
bution (ESSoED) averaged over M = 500 simulations for each time step t.
The BPS provides very large ESSoED at later time steps while suffering from
path degeneracy as expected at early time steps. Its ESSoED drops to ap-
proximately 50 under the sample size of 44000 at t = 0. TPS-EF-N has large
ESSoED for all time steps, which consistently outperforms FFBSm, FFBSi
and TPS-F. It is only surpassed by the BPS after t = 100. We hence regard
TPS-EF-N as an effective way of improving path degeneracy.
134
HMM in Section 3.10.2. We then propose two error metrics in Section 3.10.3.
We perform the simulations of TPS and other algorithms in Section 3.10.4,
and further compare TPS-EF and TPS-ES in Section 3.10.5.
1 Xt−1
Xt = Xt−1 + 25 2
+ 8 cos(1.2t) + Vt , t = 1, 2, . . . , T,
2 1 + Xt−1
(3.43)
X2
Yt = t + Wt , t = 0, 2, . . . , T,
20
3.10.2 Benchmark
We aim for a new HMM which meets two conditions: It estimates the non-
linear Gaussian HMM, and therefore its smoothing solution. Additionally,
the smoothing solution of the new HMM can be obtained straightforwardly.
135
We achieve these conditions by discretising the sample space of each hidden
state into a finite space, and hence refer to the new HMM as the finite-space
HMM.
Fz ( z1 +z 2
) if i = 1,
2
Fz ( zi +zi+1 ) − Fz ( zi−1 +zi ) if i ∈ {2, 3, . . . , n − 1},
2 2
P(Ẑ = zi ) = (3.44)
1 − Fz ( zn +z2 n−1 )
i = n,
0 otherwise,
136
the distance between two consecutive grid points, needs to be determined.
For each model, we compute the marginal smoothing means based upon the
discretised HMMs with different grid sizes. We then compute the error made
by Monte Carlo algorithms under each grid size, and ensure the difference
between the errors are negligible for some relatively small grid sizes. Then
we choose one which achieves a relatively lower computational cost. In this
simulation study, we select the grid size to be 0.02 when τ = 1, σ = 1, and
0.05 when τ = 1, σ = 5 and τ = 5, σ = 1. Each grid constructed from Xt
becomes the sample space of its discrete analogy denoted by X̂t . The sample
space of each observation Yt is unchanged.
137
following decomposition:
3.10.3 Metrics
Given the smoothing distributions {p(x̂t |y0:T )}Tt=0 from the finite-space HMM,
we define the mean square error of means (MSEm) of a Monte Carlo smooth-
ing algorithm, which targets the original non-linear HMM, in the mth simu-
lation:
T
1 X bm
MSEmm = E [Xt |y0:T ] − E(X̂t y0:T ))2 ,
T + 1 t=0
b m [Xt |y0:T ] is the Monte Carlo mean of p(xt |y0:T ), E(X̂t y0:T ) is smooth-
where E
ing mean p(x̂t |y0:T ) at time t from the finite-space HMM.
138
of the smoothing distribution (see Figure 3.4), and the measure of mean
does not necessarily capture this. We additionally record the Kolmogorov–
Smirnov (KS) statistic in a Kolmogorov–Smirnov test (KS test) (Massey Jr,
1951).
In our case, the KS test is not valid given dependent samples generated
from our smoothing algorithms. We are neither interested in rejecting or not
rejecting the null hypothesis from the test. Indeed, we aim to use the KS
statistic as a metric, which can be viewed as one way of measuring sampling
quality of a Monte Carlo algorithm that simulate samples from the reference
distribution.
139
In the context of the smoothing problem, F1,N is formed by the samples of
a Monte Carlo smoothing algorithm which estimates p(xt |y0:T ), F2 is the CDF
of the discrete distribution p(x̂t |y0:T ) from the finite-space HMM. By choosing
a suitable grid size for discretisation, we guarantee the error committed by
discretisation when computing the KS statistic is negligible compared to the
error committed by the smoothing algorithms which will be implemented.
See Section 3.10.2 for similar explanation in computing the error based upon
discretised model. We denote the average KS statistic (KSS) of all time steps
by KSSm in the mth simulation of the algorithm.
We compare the mean square error of means (MSEm) and the KS statis-
tic (KSS) over M = 500 simulations with the same set of observations.
The simulation results with different values of τ and σ are shown in Ta-
ble 3.2. In the first two situations, TPS-L shows the largest MSEm and KSS,
140
Table 3.2: Performance of the smoothing algorithms using comparable computational effort
in the non-linear HMM.
141
Smoothing distribution
10 15 20 25 30
xt
Figure 3.8: CDF of the smoothing distribution, filtering and sampling distribution of TPS-L
at time step t = 271 in the non-linear HMM when τ = 1, σ = 5.
TPS-EF-P and TPS-L produce dominant results with vastly smaller MSEm.
Behind this, the relatively large variance in the transition density decreases
the correlation between the hidden states, which makes the merging steps
in TPS more effective. TPS-EF-P in this parameter setting exhibits the
smallest KSS whereas the BPS gives the largest result though generating the
most samples.
142
3.10.5 Comparison between TPS-EF and TPS-ES
143
Table 3.3: Performance between TPS-EF-P and TPS-ES-P in the non-linear HMM.
3.11 Discussion
144
sten et al., 2017) to estimate the joint smoothing distribution p(x0:T |y0:T ) in
a hidden Markov model (HMM). The algorithm decomposes an HMM into
sub-models based upon a binary tree structure with intermediate target dis-
tributions defined at the non-root nodes. The root stands for our target,
which is the joint smoothing distribution.
We propose one generic way of constructing the binary tree which se-
quentially splits the hidden states X0:T . We then discuss a general sampling
procedure in TPS. To obtain the samples at a non-leaf node, we merge the
particles from its two children using importance sampling. The merging
process can be accompanied with an optional proliferation and resampling
step. The computational complexity is adjustable in this sampling procedure
which can be linear with respect to the required sample size.
145
filtering algorithm. Nevertheless, the proposal in the importance sampling
step can still be ineffective in some highly non-linear and complicated HMMs.
TPS-ES builds the intermediate targets which estimate the (joint) smooth-
ing distributions. It roughly maintains the marginal of each hidden state
from the intermediate target distributions invariant at all levels of the aux-
iliary tree. The algorithm is more computationally intensive and demands
preliminary runs of a filtering and a (marginal) smoothing algorithm.
146
the proposal due to the assumed independence structure in TPS.
The investigation of TPS for longer time series can be studied in the fu-
ture. One advantage of applying the divide-and-conquer approach is its paral-
lel implementation or distributed computing for less runtime cost (Lindsten
147
et al., 2017). Given the pre-determined intermediate target distributions,
TPS can be employed in parallel or in distributed computing environment.
However, the preliminary run of a filtering or smoothing algorithm for con-
structing intermediate target distributions may prohibit the algorithm from
an efficient implementation. One possible solution is to run independent
filtering and smoothing algorithms for each machine given part of the obser-
vations. Another advantage of TPS is its comparatively high ESSoED under
the constraint upon cost, which could be more predominant for longer time
series.
148
4
Tree-based Sampling
Algorithms for Parameter
Estimation in a Hidden
Markov Model
4.1 Introduction
149
process, and the distribution of the observation Yt only depends on Xt . In
this chapter, we assume the following densities exist with respect to some
dominating measure and are denoted by
X0 ∼ p0 ( · )
where θ ∈ Θ is the model parameter and T is the final time step of the
process.
The inference problem of the HMM can be categorised into two scenarios
where the model parameter θ is known and unknown respectively. Given an
HMM with all parameters specified, the algorithms for solving the filtering
and smoothing problems have been discussed in Chapter 3.
150
eter and computes its posterior. The ML approach calculates a maximum
value of the parameter by analysing likelihood function conditional on the
observations. Both Bayesian and ML methods can be enforced in the off-line
or on-line manner. In this chapter, we focus on the off-line Bayesian methods.
The parameter estimation problem has been studied in the previous work
which employs the off-line Bayesian approach. Kitagawa (1998) proposes
two Monte Carlo algorithms which both simulate samples from the posterior
distributions {p(θ, xt |y0:T )}Tt=0 . The first algorithm applies forward filter-
ing backward smoothing (FFBSm) described in Section 3.5.1 to the aug-
mented space that corresponds to (θ, xt ). The second one employs fixed
lag-smoothing from a particle filter. Lee and Chia (2002) introduce a se-
quential Monte Carlo (SMC) algorithm with rejuvenation steps completed
by Markov Chain Monte Carlo (MCMC). Andrieu et al. (2010) propose
an MCMC algorithm called particle marginal Metropolis-Hastings (PMMH)
sampler, which employs MCMC updates aided by SMC. Andrieu et al. (2010)
also introduce a Gibbs sampler which iteratively samples from p(θ|x0:T , y0:T )
and pθ (x0:T |y0:T ), where the sampling procedure from pθ (x0:T |y0:T ) admits a
conditional SMC update. Whiteley (2010); Lindsten et al. (2014) improve the
Gibbs sampler by introducing backward simulation to rejuvenate samples.
151
algorithm (TPE). We are interested in the sampling process of p(θ, x0:T |y0:T ):
T
Y −1 T
Y
p(θ, x0:T |y0:T ) ∝ µ(θ)p0 (x0 ) pθ (xt+1 |xt ) pθ (yt |xt ),
t=0 t=0
where µ and p0 are the priors of two independent random variables θ and
X0 .
We first construct an auxiliary tree of TPE which splits the HMM into
sub-models using the divide-and-conquer approach (Lindsten et al., 2017).
The tree divides the target variable (θ, X0:T ) into multiple levels, and de-
mands the random variables at the same level containing disjoint hidden
state(s) and a parameter variable. We also assume the random variables are
mutually independent at the same level of the auxiliary tree. As required by
D&C SMC (Lindsten et al., 2017), we will define intermediate target distri-
butions of the sub-models at non-root nodes of the tree. At the root, the
target distribution is precisely p(θ, x0:T |y0:T ).
152
One challenge in TPE occurs in the design of the proposal for the im-
portance sampling steps. Given a non-leaf node Tj:l , the target variables
of its children Tj:k−1 (Tj if j = k − 1) and Tk:l (Tl if k = l) both contain
an unknown parameter, namely θj,k−1 and θk,l . A simple proposal, which is
the product measure on the product space of (Xj:k−1 , θj,k−1 ) ∼ fθ,j:k−1 and
(Xk:l , θk,l ) ∼ fθ,k:l from the children, can be problematic. This construction
generates a higher dimension which contains overlapping parameter variables
in (Xj:l , θj,k−1 , θk,l ) ∼ fθ,j:k−1 fθ,k:l compared to the target variable (Xj:l , θj,l )
at Tj:l .
153
stochastic approach produces additional noise. Both methods rejuvenate the
parameter samples which effectively improve weight degeneracy.
154
dimensional unknown parameter in Section 4.8. The chapter ends with a
discussion in Section 4.9.
where the only degree of freedom is the choice of q( · |θ). The acceptance
ratio in the Metropolis-Hastings step is hence
p(θ∗ , x∗0:T |y0:T ) q(θ, x0:T |θ∗ , x∗0:T ) µ(θ∗ )pθ∗ (y0:T )q(θ|θ∗ )
1∧ = 1 ∧ . (4.2)
p(θ, x0:T |y0:T ) q(θ∗ , x∗0:T |θ, x0:T ) µ(θ)pθ (y0:T )q(θ∗ |θ)
Andrieu et al. (2010) suggest to use the SMC approximations in (4.1) for
155
pθ∗ ( · |y0:T ) when generating a sample, and for the marginal density pθ (y0:T ).
Therefore, the proposal density becomes
where p̂θ∗ ( · |y0:T ) is the SMC approximation of pθ∗ ( · |y0:T ). The acceptance
ratio becomes
µ(θ∗ )p̂θ∗ (y0:T )q(θ|θ∗ )
1∧ , (4.4)
µ(θ)p̂θ (y0:T )q(θ∗ |θ)
where p̂θ (y0:T ) and p̂θ∗ (y0:T ) are the SMC approximations of the correspond-
ing marginal densities. Andrieu et al. (2010) prove that these PMMH updates
make the target distribution p(θ, x0:T |y0:T ) invariant, and the acceptance ratio
(4.4) converges to (4.2) under mild assumptions as the sample size N → ∞.
Pitt et al. (2012) suggest choosing a sample size n in the smoother such that
the standard deviation of the likelihood pθ̄ (y0:T ) evaluated at θ̄ is 0.92 where
θ̄ is roughly the posterior mean. See Algorithm 10 for the implementation of
the PMMH sampler.
We describe a sampling algorithm that estimates p(θ, x0:T |y0:T ) using se-
quential Monte Carlo (SMC). The algorithm applies sequential importance
resampling (SIR) similar to a standard bootstrap particle smoother (see Sec-
tion 3.4.3), whose only difference is a primary sampling step of the unknown
parameter. We call the algorithm sequential importance resampling for pa-
rameter estimation (SIR-PE). Algorithm 11 executes SIR-PE where N̂ef f is
156
Algorithm 10: PMMH sampler
1 for i = 1 do
2 Choose an arbitrary start point θ(1) ;
(1)
3 Generate a sample path x0:T ∼ p̂θ(1) ( · |y0:T ) where p̂θ(1) ( · |y0:T ) is the particle
approximation of pθ(1) ( · |y0:T );
4 Denote the estimated density of pθ(1) (y0:T ) by p̂θ(1) (y0:T );
5 end
6 for i = 2 to N do
7 Propose θ∗ ∼ q( · |θ(i−1) );
8 Generate a sample path x∗0:T ∼ p̂θ∗ ( · |y0:T ) where p̂θ∗ ( · |y0:T ) is the particle
approximation of pθ∗ ( · |y0:T );
9 Denote the estimated density of pθ∗ (y0:T ) by p̂θ∗ (y0:T );
(i)
10 Set (θ(i) , x0:T ) = (θ∗ , x∗0:T ) and p̂θ(i) (y0:T ) = p̂θ∗ (y0:T ) with probability
(i) (i−1)
Otherwise, set (θ(i) , x0:T ) = (θ(i−1) , x0:T ) and p̂θ(i) (y0:T ) = p̂θ(i−1) (y0:T );
11 end
the effective sample size (see Section 3.4.2) and Nthres is threshold for a re-
sampling step. Note that the samples of the parameter are only generated
once, which are sequentially updated afterward. Hence, the path degeneracy
problem is very likely to occur for t T . Gilks and Berzuini (2001); Lee
and Lee (2006) additionally rejuvenate the particles via MCMC to increase
their diversity. In some cases when T is small, SIR-PE is fast and efficient,
which will be considered as a preliminary run in an extended version of TPE
introduced in Section 4.6.
157
Algorithm 11: Sequential importance resampling for parameter estimation
(SIR-PE)
1 for t = 0 do
2 for i = 1 to N do
(i) (i)
3 Sample θ̃0 ∼ µ( · ), x̃0 ∼ p0 ( · ) ;
(i) (i)
4 Compute the unnormalised importance weight: w̃0 = pθ̃(i) (y0 |x̃0 );
0
5 end
6 if N̂eff < Nthres then
7 Implement the resampling step and denote the resampled particles (with
(i) (i) (i)
normalised weights) by {(θ0 , x0 ), W0 }N
i=1 ;
8 else
(i)
9 Calculate the normalised weights {W0 }Ni=1 and obtain
(i) (i) (i) (i) (i)
{(θ0 = θ̃0 , x0 = x̃0 ), W0 }N
i=1 ;
10 end
11 end
12 for t = 1 to T do
13 for i = 1 to N do
(i) (i) (i) (i) (i) (i) (i)
14 Sample x̃t ∼ pθ(i) ( · |xt−1 ), let x̃0:t = (x0:t−1 , x̃t ) and θ̃t = θt−1 ;
t−1
(i) (i) (i)
15 Compute the unnormalised importance weight: w̃t = Wt−1 pθ̃(i) (yt |x̃t );
t
16 end
17 if N̂eff < Nthres then
18 Implement the resampling step and denote the resampled particles (with
(i) (i) (i)
normalised weights) by {(θt , x0:t ), Wt }N
i=1 . ;
19 else
(i)
20 Calculate the normalised weights {Wt }Ni=1 and denote the normalised weighted
(i) (i) (i) (i)
particles by {(θt = θ̃t , x0:t = x̃0:t )}N
i=1 ;
21 end
22 end
158
(TPE), to approximate the posterior distribution p(θ, x0:T |y0:T ) in a hidden
Markov model (HMM).
k = j + 2p , (4.5)
159
(θ = θ0,5 , X0:5 ) level 3
the children of Tj:l . Then, the target random variables at Tj:k−1 and Tk:l are
(θj,k−1 , Xj:k−1 ) and (θk,l , Xk:l ) respectively, whose intermediate target den-
sities are denoted by fθ,j:k−1 and fθ,k:l . In the case of j = l, we stop the
division and treat the node as a leaf containing the random variable (θj , Xj )
distributed with density fθ,j . Starting from the root of the tree which con-
tains the target variable (θ = θ0:T , X0:T ), the algorithm recursively bears
children according to the rule in (4.5) until each node contains a single hid-
den state and a parameter variable. The construction of the auxiliary tree
when T = 5 is shown in Figure 4.1. We similarly mark the level of the nodes
as described in Section 3.6.1.
160
from the children to create the random variable
We denote the density of (θj:l , ∆θj:l , Xj:k−1 , Xk:l ) by h0θ,j:l , which will be con-
sidered as a proposal. By the transformation rule of random variables, we
obtain
= fθ,j:k−1 g1−1 (θj:l , ∆θj:l ), xj:k−1 fθ,k:l g2−1 (θj:l , ∆θj:l ), xk:l
where g1−1 and g2−1 are the inverse transformation functions and J(θj,l , ∆θj,l )
is the Jacobian matrix.
We expand the probability space of the target variable (θj,l , Xj:l ) to con-
form to that of (θj:l , ∆θj:l , Xj:l ) in the proposal h0θ,j:l . We concatenate a new
independent random variable ∆θj,l with a pre-defined distribution f˜j,l to the
target variable (θj,l , Xj:l ). We denote the density of the extended target
0
variable (θj,l , ∆θj,l , Xj:l ) by fθ,j:l , which is defined as the product of f˜j:l and
161
fθ,j:l :
0
fθ,j:l (θj,l , ∆θj,l , xj:l ) = fθ,j:l (θj,l , xj:l )f˜j,l (∆θj,l ). (4.8)
0
We then apply importance sampling to simulate from fθ,j:l using the proposal
h0θ,j:l . We select the samples that correspond to (θj:l , Xj:l ) from (θj:l , ∆θj,l , Xj:l ) ∼
0
fθ,j:l to complete the sampling process at the node.
and
(i) (i) (i) N
S2 = (θ̃k,l , x̃k:l ), W̃k,l i=1
∼ fθ,k:l from Tk:l .
and
(b ) (b ) (b ) 0
S20 = {(θ̃k,li , x̃k:li ), Ŵk:li )}N
i=1 ∼ fθ,k:l ,
0 0
where N 0 is the number of samples after proliferation, {ai }N N
i=1 and {bi }i=1
(a ) (b ) 0
i
are the returned indices and {Ŵj:k−1 ), Ŵk:li )}N
i=1 are the updated weights. If
0
proliferation is not required, we simply set N 0 = N , {ai = i, bi = i}N
i=1 and
(a ) (a ) (b ) (b ) 0
i
{Ŵj:k−1 i
= W̃j:k−1 , Ŵk:li = W̃k:li }N
i=1 .
(a ) (b ) (a ) (b ) (i) (a ) (b ) 0
S 0 = {(θ̃j,k−1
i
, θ̃k,li , x̃j:k−1
i
, x̃k:li ), w̃j,l = Ŵj,k−1
i
Ŵk,li }N
i=1 ∼ fθ,j:k−1 fθ,k:l .
162
the functions g1 and g2 , and denote the resulting samples by
163
Algorithm 12: Sampling process TPE gen(j,l) which targets fθ,j:l at Tj:l
in TPE
1 if j = l then
2 for i = 1 to N do
(i) (i)
3 Simulate (θj , xj ) ∼ fθ,j ( · );
4 end
(i) (i) (i) 1 N
5 Denote the normalised weighted particles by {(θj , xj ), Wj = N }i=1 ;
6 else
7 Let p = dlog2 (l − j + 1)e − 1 and k = j + 2p ;
8 Adopt the samples
and
(i) (i) (i) N
S2 = (θ̃k,l , x̃k:l ), W̃k,l i=1
← TPE gen(k, l) from Tk:l ;
(a ) (b ) (a ) (b ) (i) 0
S 0 = {(θ̃j,k−1
i
, θ̃k,li , x̃j:k−1
i
, x̃k:li ), w̃j,l }N
i=1
0 0
where N 0 is the updated sample size, {ai }N N
i=1 , {bi }i=1 are the updated indices and
(i) N 0
{w̃j,l }i=1 are the updated weights;
10 Denote the transformed samples from S 0 by
13 end
0
(i) (i) (i) (i) N
14 Resample (θ̃j,l , ∆θ̃j,l , x̃j,l ), ŵj:l i=1
to obtain the normalised weighted samples
(i) (i) (i) (i) N
(θj,l , ∆θj,l , xj:l ), Wj,l i=1 ;
15 end
164
sub-HMM at T0:3 sub-HMM at T4:5
Y0 Y1 Y2 Y3 Y4 Y5
X0 X1 X2 X3 X4 X5
the data into groups which are called ‘shards’, and performs an independent
Monte Carlo algorithm from each shard to produce a posterior estimate.
The global result is computed from the weighted average of the posterior
estimates as a consensus belief.
CMC and TPE are similar in the sense that both algorithms have an
independence assumption. In CMC, independence is explained via individ-
ual work on each shard. In TPE, independence exists between the random
variables at the same level of the auxiliary tree. We can analogously define
the shard and the individual work at a tree node Tj (resp. Tj:l ) in TPE: The
shard contains the observation(s) yj (resp. yj:l ), and the individual work
refers to the implementation of TPE. Hence, the sub-model at Tj (resp. Tj:l )
can be treated as the HMM with the observation(s) yj (resp. yj:l ), whose
dynamics need to be defined.
We refer to the original HMM with the full observations y0:T as the
complete HMM, which belongs to the root in the auxiliary tree. The sub-
model at a non-root node is yet an HMM whose observations are a subset of
those from the complete HMM. We refer to these sub-models as sub-HMMs.
165
Building the intermediate target distribution in TPE is hence equiva-
lent to specifying the dynamics of the corresponding sub-HMM. We inherit
the same transition and emission densities from the complete HMM. We
will discuss the prior, and provide the exact form of the intermediate target
distributions in Section 4.5.2 and Section 4.5.3.
Practically, we part the complete HMM based upon its tree decompo-
sition to create all sub-HMMs. In Figure 4.2, the dependence structures
of some sub-HMMs are constructed from a complete HMM where T = 5.
The associated auxiliary tree of the complete HMM is shown in Figure 4.1.
From the tree, we first split the complete HMM into two sub-HMMs which
respectively involve the hidden states X0:3 at T0:3 and X4:5 at T4:5 . The two
sub-HMMs are framed by dashed lines in Figure 4.2. We further split the
hidden states X0:3 to obtain two sub-HMMs at T0:1 and T2:3 , which are framed
by solid lines. Likewise, we have two sub-HMMs at T4 and T5 , which both
contain a single hidden state. Moreover, the sub-HMMs at T0:1 and T2:3 are
required to be split respectively until each sub-HMM only has a single hidden
state (figures not included).
We assume the unknown parameter and the initial hidden state of each
sub-HMM are independent, whose densities are exactly µ and p0 given by
the complete HMM. We define the intermediate target distribution fj,θ at
166
the node Tj as
fj,θ (θj , xj ) ∝ µ(θj )p0 (xj )pθj (yj |xj ),
fθ,j:l (θj,l , xj:l ) ∝ µ(θj,l )p0 (xj )pθj,l (yj:l , xj+1:l |xj )
which corresponds to a sub-HMM with the observations yj:l . At the root node
T0:T , we obtain the target distribution fθ,0:T (θ0:T , x0:T ) = p(θ0:T , x0:T |y0:T ).
167
At a leaf node Tj , the intermediate target distribution has the density
fθ,j:l (θj,l , xj:l ) ∝ µj (θj,l )pj (xj )pθj,l (yj:l , xj+1:l |xj )
168
produces poor samples at early time steps.
169
Algorithm 13: Construction of the priors {µj , pj }Tj=0 in TPE-EP
1 for j = 0 do
2 Set µ0 = µ, and p0 is namely the original prior of X0 in the complete HMM;
3 end
4 for j = 1 to T do
5 for i = 1 to N do
(i) (i)
6 Sample θ̃t ∼ µj−1 ( · ), x̃t−1 ∼ pt−1 ( · );
(i) (i)
7 Compute the unnormalised importance weight: w̃t−1 = pθ(i) (yt−1 |x̃t−1 );
t
8 end
9 if N̂eff < Nthres then
10 Implement the resampling step and denote the resampled samples (with
(i) (i) (i)
normalised weights) by {(θt , xt−1 ), Wt−1 }N
i=1 ;
11 else
(i)
12 Normalise the weights and denote by {Wt−1 }N
i=1 ;
(i) (i) (i) (i) (i)
13 Denote the weighted samples by {(θt = θ̃t , xt−1 = x̃t−1 ), Wt−1 }N
i=1 ;
14 end
15 for i = 1 to N do
(i) (i)
16 Generate xt ∼ pθ(i) ( · |xt−1 ), ;
t
17 end
(i) (i)
18 Estimate µj from the weighted samples {θt , Wt−1 }N
i=1 and pj from the weighted
(i) (i)
samples {xt , Wt−1 }N
i=1 .
19 end
{µj , pj }Tj=0 , where we denote the sample size in the Monte Carlo simulations
by N , and the threshold in the resampling procedure by Nthres .
This class of priors estimates the prediction distributions, and are more
informative compared to TPE-O. The Monte Carlo samples which estimate
the priors also escape from path degeneracy, since each SIR procedure is
implemented for two time steps.
170
(θ = θ0,5 , X0:5 ) level 2
We combine TPE with SIR-PE, and call the algorithm TPE-SIR. TPE-SIR
similarly constructs an auxiliary tree as TPE which recursively splits the
target variable (θ, X0:T ) into two subsets. Nevertheless, the split ceases before
each subset only contains a single hidden state with a parameter variable.
The target variable at each leaf node of TPE-SIR has multiple hidden states,
i.e. (θj,l , Xj:l ) where j 6= l. Figure 4.3 shows the construction of an auxiliary
tree in TPE-SIR when T = 5. We also define the depth D of TPE-SIR as the
total levels in the auxiliary tree. In Figure 4.3, we have D = 3. The depth
of TPE can be similarly defined with D = 1 + dlog2 (T + 1)e.
Given a leaf node with the target variable (θj,l , Xj:l ) in TPE-SIR, the
intermediate target distribution fθ,j:l is defined as:
fθ,j:l (θj,l , xj:l ) ∝ µj (θj,l )pj (xj )pθj,l (yj:l , xj+1:l |xj )
171
of {µj , pj }Tj=0 in TPE-EP can be accomplished similarly from Algorithm 13
where each SIR procedure needs to be employed for more time steps.
172
θ = θ0,3
θ0,1 θ2,3
θ0 θ1 θ2 θ3
Figure 4.4: Auxiliary tree of the simplified version of TPE constructed from the toy model
(4.17) when T = 3
1
θ ∼ N 0, ,
τ
(4.17)
Yt |θ ∼ N (θ, 1) for t = 0, . . . , T,
173
intermediate target distribution
yj 1
fj (θj ) = p(θj |yj ) ∼ N , ,
τ +1 τ +1
Pl
i=j yi 1
fj:l (θj,l ) = p(θj,l |yj:l ) ∼ N , . (4.19)
τ + dj,l τ + dj,l
To match the space of the proposal, we also need to extend the target
variable θj:l by creating (θj:l , ∆θj:l ). We assume ∆θj:l ∼ f˜j:l is independent
of θj:l . The extended target density of (θj:l , ∆θj:l ) is hence fj:l (θj:l )f˜j:l (∆θj:l ).
174
We propose two types of transformation functions g1 , g2 , which both dis-
tribute θj,l as the target fj:l marginally. The first method employs a deter-
ministic combination of the overlapping parameters θj:k−1 and θk:l to match
the mean and variance of the target variable θk,l . The second one additionally
incorporates independent noise which further increases sample diversity.
Deterministic combination
We aim to create the random variable θj,l with density fj,l from θj,k−1 ∼ fj:k−1
and θk,l ∼ fk:l using the function g1 . We first exploit a deterministic linear
combination of the overlapping parameter variables θj,k−1 and θk,l where
= α(θj,k−1 + θk,l ) + β
α li=j yi
P
2α2
∼N + β, .
τ + dj,l /2 τ + dj,l /2
The unknown constants α and β are computed by matching the mean and
variance from the target fj:l in (4.19), which gives
s
1 dj,l + 2τ
α = ,
2 dj,l + τ
Pl
i=j yi 1 1
β = p p −p .
τ + dj,l τ + dj,l τ + dj,l /2
175
and so ∆θj,l is distributed as
Pk−1
yi − li=k yi
P
i=j 2
N , . (4.20)
τ + dj,l /2 τ + dj,l /2
We let f˜j,l to have the same density in (4.20), and hence the marginal distri-
butions of ∆θj,l in the proposal and in the target are the same.
The proposal hj:l (θj,l , ∆θj,l ) can be obtained using the transformation of
random variables:
1
where |J(θj,l , ∆j,l )| = 2α
is a constant and (4.21) can be obtained by brute
force. Hence, the proposal distribution is identical to the extended target
distribution.
Stochastic combination
176
in the model (4.17). We define the function g1 as the average of θj,k−1 and
θk,l plus independent noise :
1
θj,l = g1 (θj,k−1 , θk,l ) = (θj,k−1 + θk,l ) + ,
2
Pl Pl
1 1 i=j yi i=j yi
E (θj,k−1 + θk,l ) + = + E( ) = ,
2 2 τ + dj,l /2 τ + dj,l
(4.22)
1 1 2 1
var (θj,k−1 + θk,l ) + = + var( ) = .
2 4 τ + dj,l /2 τ + dj,l
l
τ X τ
∼N yi , .
2(τ + dj,l )(τ + dj,l /2) i=j 2(τ + dj,l /2)(τ + dj,l )
Pk−1
yi − li=k yi
P
i=j 2
∆θj,l ∼ N , ,
τ + dj,l /2 τ + dj,l /2
whose density is also imposed to f˜j:l appeared in the extended target distribu-
tion. The inverse transformations g1−1 and g2−1 can be computed accordingly
where
∆θj,l
θj,k−1 = g1−1 (θj,l , ∆θj,l ) = θj,l + − ,
2
∆θj,l
θk,l = g2−1 (θj,l , ∆θj,l ) = θj,l − − .
2
177
We build a proposal with an extended probability space including the noise
, since it exists in the transformation functions. Hence, we define
∆θj,l ∆θj,l
h0j:l (θj,l , ∆θj,l , ) ∝ fj:k−1 θj,l + − fk:l θj,l − −
2 2
|J(θj,l , ∆j,l )|κ( ),
0
fj:k (θj,l , ∆θj,l , ) = fj:l (θj,l )f˜j,l (∆θj,l )κ( ).
1
corr(θj,l , ) = corr (θj,k−1 + θk,l ) + ,
2
var( )
= q
1
var (θ
2 j,k−1
+ θk,l ) var( )
s
τ 1
r
= = .
2τ + dj,l 2 + dj,l /τ
178
inefficiency in importance sampling.
In the toy model (4.17), the distribution of the target parameter θj,l is known
and can be pre-computed whereas the information of θj,l here in the HMM is
very limited. We can determine α and β from θj,k−1 and θk,l , and expect the
distribution of θj,l to compromise those of θj,k−1 and θk,l . We suggest E(θj,l )
to be the average of E(θj,k−1 ) and E(θk,l ), and Var(θj,l ) to be the average
of Var(θj,k−1 ) and Var(θk,l ). We choose not to shrink the variance of θj,l to
prevent an overly concentrated distribution in the proposal, which may not
179
explore the target space adequately. Therefore, we obtain
r √
1 1− 2
α= , β= E(θj,k−1 ) + E(θk,l ) .
2 2
When the expectations E(θj,k−1 ) and E(θk,l ) do not have closed forms, we
may use their estimated values from the corresponding Monte Carlo samples.
These estimated values are certainly dependent on the samples, and vary from
different simulations of the algorithm. However, the restrictions (4.25) we
impose on α and β are our recommendation of a good choice of their values,
which are not required to be (strictly) satisfied. Any reasonable values of α
and β are valid for the algorithm to work.
θj,l − β 1
θj,k−1 = g1−1 (θj,l , ∆θj,l ) = + ∆θj,l ,
2α 2
θj,l − β 1
θk,l = g2−1 (θj,l , ∆θj,l ) = − ∆θj,l ,
2α 2
1
with |J(θj,l , ∆θj,l )| = 2α
. The proposal density of (θj:l , ∆θj:l , Xj:l ) in (4.7) is
θj,l − β 1 θj,l − β 1
∝ fθ,j:k−1 + ∆θj,l , xj:k−1 fθ,k:l − ∆θj,l , xk:l .
2α 2 2α 2
0
Finally, we propose a simple rule for building fθ,j:l appearing as part of the
product in the extended target density in (4.8). Similar to the construction
in the toy model (4.17), we require the marginal of ∆θj:l from the target to be
identical to that from the proposal. However, the marginal distribution in the
proposal may be analytically intractable, and an approximation is needed.
0
In practice, we can impose a Gaussian distribution on fθ,j:l whose moments
180
can be estimated according to (4.24) using the Monte Carlo samples of θj,k−1
and θk,l .
1
θj,l = g1 (θj,k−1 , θk,l ) = (θj,k−1 + θk,l ) + , (4.26)
2
∆θj,l = g2 (θj,k−1 , θk,l ) = θj,k−1 − θk,l ,
1 1
E(θj,k ) = E(θj,k−1 ) + E(θk,l ) + E( ),
2 2
1 1
Var(θj,k ) = Var(θj,k−1 ) + Var(θk,l ) + Var( ).
4 4
1
E(θj,k ) = E(θj,k−1 ) + E(θk,l ) ,
2
1
Var(θj,k ) = Var(θj,k−1 ) + Var(θk,l ) .
2
1
Consequently, the mean and variance of is 0 and 4
Var(θj,k−1 ) + Var(θk,l ) ,
where the variance can be estimated by the Monte Carlo samples of θj:k−1
and θk:l .
The inverse transformation functions g1−1 and g2−1 for the derivation of
181
the proposal are given by:
1
θj,k−1 = g1−1 (θj,l , ∆θj,l ) = θj,l + ∆θj,l + ,
2
1
θk,l = g2−1 (θj,l , ∆θj,l ) = θj,l − ∆θj,l + .
2
1 1
∝ fθ,j:k−1 θj,l + ∆θj,l + , xj:k−1 fθ,k:l θj,l − ∆θj,l + , xk:l κ( ),
2 2
0
since |J(θj,l , ∆j,l )| = 1. The extended target density fθ,j:l is:
0
fθ,j:l (θj,l , ∆j,l , xj:l , ) = fθ,j:l (θj,l , xj:l )f˜j,l (∆θj:l )κ( ).
The density f˜j,l of ∆θj,l can be managed in the same way as the deterministic
approach given the same transformation function g2 .
182
which gives g1 and g2 :
θj,k−1
∆θj,l = g2 (θj,k−1 , θk,l ) = , (4.28)
θk,l
1 1 β
θj,k−1 = g1−1 (θj,l , ∆θj,l ) = θj,l
2α
∆θj,l
2
e− 2α ,
1
−1 β
θk,l = g2−1 (θj,l , ∆θj,l ) = θj,l
2α
∆θj,l2 e− 2α ,
1
β
1 − 2α −1 −1
which gives |J(θj,l , ∆θj,l )| = 2α
e θj,l ∆θj,l .
α
The density f˜j,l can be approximated from (4.27) with the same reason stated
in Section 4.7.2. We can obtain it by imposing a parametric assumption with
183
parameters estimated from the Monte Carlo samples of θj,k−1 and θk,l .
1 1
log(θj,l ) = log(θj,k−1 ) + log(θk,l ) + ,
2 2
1 1
θj,l = g1 (θj,k−1 , θk,l ) = θj,k−1
2
θk,l
2
e,
θj,k−1
∆θj,l = g2 (θj,k−1 , θk,l ) = .
θk,l
1 1
θj,k−1 = g1−1 (θj,l , ∆θj,l ) = θj,l ∆θj,l
2
e− 2 ,
−1 1
θk,l = g2−1 (θj,l , ∆θj,l ) = θj,l ∆θj,l2 e− 2 ,
−1
which gives |J(θj,l , ∆θj,l )| = θj,l ∆θj,l2 e− .
−1 1 1 −1 1
= θj,l ∆θj,l2 e− fθ,j:k−1 θj,l ∆θj,l
2
e− 2 , xj:k−1 fθ,k:l θj,l ∆θj,l2 e− 2 , xk:l κ( ).
184
The extended target distribution is
0
fθ,j:l (θj,l , ∆θj,l , xj:l , ) = fθ,j:l (θj,l , xj:l )f˜j,l (∆θj,l )κ( ).
Xt = ρXt−1 + σ1 Vt , t = 1, . . . , T,
(4.29)
Yt = Xt + σ2 Wt , t = 0, . . . , T,
ρ ∼ N (0.5, 0.01),
185
arity of the time series {Xt }t∈N . In this example, we choose the prior of ρ to
be normally distributed to accommodate the case in Section 4.7.2 where the
support of the unknown parameter is R. The variance of ρ ensures that the
probability of ρ ∈ (−1, 1) is larger than 99.9%, and the finiteness of the time
series also makes the issue of stationarity less concerned.
where µρ , µσ12 and µσ22 are the densities of ρ, σ12 and σ22 , respectively.
4.8.2 Benchmark
186
discrete prior µ̂ of θ̂ is
We then discretise the sample space of each hidden state Xt using a grid Gt
consisting of nxt points. The grid G0 becomes the sample space of the discrete
random variable X̂0 which approximates X0 . We compute the discrete prior
of X̂0 based upon (3.44) from G0 . The transition mass pθ̂ (x̂t |x̂t−1 ) can be
similarly computed for every θ̂ ∈ Gθ and x̂t−1 ∈ Gt−1 . The emission density
pθ̂ (yt |x̂t ) is continuous which does not require discretisation, although the
possible choice of θ̂ and x̂t is now finite.
where pθ̂ (y0:T ) can be computed analytically from the Kalman filter condi-
tional on θ̂.
X
p(x̂t |y0:T ) = pθ̂ (x̂t |y0:T )p(θ̂|y0:T ), (4.30)
θ̂∈Gθ
where pθ̂ (x̂t |y0:T ) is the probability mass discretised from the normal distribu-
tion pθ̂ (·|y0:T ). The mean and variance of pθ̂ (·|y0:T ) can be obtained from the
Rauch-Tung-Stiebel smoother (RTSs) conditional on θ̂. However, the com-
putation of (4.30) can be time-consuming as a consequence of n1 × n2 × n3
187
implementations of the RTSs for each x̂t ∈ Gt . Alternatively, we simulate
equally weighted Monte Carlo samples {θ̂(i) }ni=1
mc
from the discrete distribu-
tion p( · |y0:T ), and estimate the probability mass p(x̂t |y0:T ) for x̂t ∈ Gt using
nmc
1 X
p(x̂t |y0:T ) ≈ p (i) (x̂t |y0:T ).
nmc i=1 θ̂
P
As the simulations do not guarantee x̂t ∈Gt p(x̂t |y0:T ) = 1, we need to nor-
malise the probability masses.
4.8.3 Metric
188
which outputs a single number between 0 and 1.
We have defined the KS test and have justified using its statistic rather
than the result of the test as an error metric in the Monte Carlo algorithms
for smoothing in Section 3.10.3. The KS statistic is defined as
4.8.4 Algorithms
189
PMMH
In the PMMH sampler (see Algorithm 10), the proposal q(θ∗ |θ) in (4.3) needs
to be specified. We denote θ∗ = (ρ∗ , σ1∗ 2 , σ2∗ 2 ) and construct
where
When a negative value of σ1∗ 2 from qσ12 (·|σ12 ) or σ2∗ 2 from qσ22 (·|σ22 ) is generated,
we set the acceptance ratio to 0 in the Metropolis-Hastings update. The
initial value of θ is determined by the mean from a preliminary run of SIR-
PE.
SIR-PE
In SIR-PE (see Algorithm 11), we resample the particles after every impor-
tance sampling step using multinomial resampling.
190
TPE
where µj,ρ is a normal distribution, µj,σ12 and µj,σ22 are inverse Gamma distri-
butions. The parameters of µj,ρ , µj,σ12 , µj,σ22 are all computed using moment
matching from their Monte Carlo samples.
The distribution f˜j,l of the random variable ∆θj,l = (∆ρj,l , ∆σ12j,l , ∆σ22j,l )
in (4.8) is built as follows. We assume ∆ρj,l , ∆σ12j,l and ∆σ22j,l are mutually
independent. We impose a normal distribution to ∆ρj,l and inverse Gamma
distributions to ∆σ12j,l and ∆σ22j,l . The parameters of these distributions can
be estimated from the Monte Carlo samples using (4.24) for ∆ρj,l , and (4.28)
for ∆σ12j,l and ∆σ22j,l .
We further classify TPE by two criteria. The first is the prior of the
191
Table 4.1: Options in TPE regarding the prior information of the sub-HMMs and the com-
bination method of the overlapping parameters.
sub-HMMs described in Section 4.5.2 and Section 4.5.3. The second is the
combination method of the overlapping parameters, which is illustrated for
building the function g1 in Section 4.7. The available options are listed in
Table 4.1. We extend the name of TPE with the following format:
For example, TPE-O(D) applies the original priors to the sub-HMMs and
employs the deterministic approach to combine the overlapping parameters.
We therefore have four versions of TPE: TPE-O(D), TPE-O(S), TPE-EP(D),
TPE-EP(S).
TPE-SIR
We choose the depth of TPE-SIR, given that the tuning procedures of TPE
and SIR-PE have been described. We create an auxiliary tree of 7, 5 and 3
levels respectively with 4, 16 and 64 observations at the leaf nodes. We also
operate every version of TPE in TPE-SIR, and similarly name each algorithm
with the format:
192
The choice of the prior in the sub-HMMs and the combination method of the
overlapping parameters can be found in Table 4.1. Hence, TPE-O(D)-SIR-5
indicates that TPE-SIR employs the original priors in the sub-HMMs with
the deterministic combination of the overlapping parameters, and that the
auxiliary tree has 5 levels with 16 observations at each leaf node.
The simulation results are shown in Table 4.2. Under the same sample
size, SIR-PE enjoys the lowest average runtime around 0.1 second followed
by TPE-SIR using an auxiliary tree of 3 levels. TPE is much slower given
a deeper tree which spends over 5 seconds. The PMMH sampler costs a
significantly longer runtime of roughly 87 seconds for a single run, since the
particle smoother is implemented for every proposed path.
193
Table 4.2: Performance of the parameter estimation algorithms under the same sample size
in the HMM
Algorithm N n KSSx (s.e.) KSSρ (s.e.) KSSσ12 (s.e.) KSSσ22 (s.e.) Runtime
SIR-PE 1000 NA 0.42 (0.0046) 0.71 (0.0124) 0.71 (0.0127) 0.72 (0.0121) 0.13
PMMH sampler 1000 1000 0.21 (0.0042) 0.41 (0.0026) 0.58 (0.0167) 0.56 (0.0175) 86.90
TPE-O(D) 1000 NA 0.56 (0.0032) 0.48 (0.0113) 0.77 (0.0088) 0.77 (0.0079) 5.20
TPE-O(S) 1000 NA 0.62 (0.0023) 0.66 (0.0059) 0.88 (0.0047) 0.74 (0.0085) 5.25
TPE-EP(D) 1000 1000 0.51 (0.0070) 0.61 (0.0108) 0.87 (0.0097) 0.58 (0.0126) 5.42
TPE-EP(S) 1000 1000 0.48 (0.0070) 0.68 (0.0096) 0.92 (0.0076) 0.61 (0.0120) 5.49
TPE-O(D)-SIR-7 1000 NA 0.58 (0.0042) 0.45 (0.0113) 0.50 (0.0106) 0.46 (0.0098) 2.48
TPE-O(S)-SIR-7 1000 NA 0.61 (0.0037) 0.58 (0.0088) 0.54 (0.0112) 0.50 (0.0105) 2.48
TPE-EP(D)-SIR-7 1000 1000 0.38 (0.0057) 0.38 (0.0096) 0.50 (0.0110) 0.43 (0.0110) 2.53
TPE-EP(S)-SIR-7 1000 1000 0.37 (0.0027) 0.44 (0.0099) 0.53 (0.0092) 0.45 (0.0093) 2.54
TPE-O(D)-SIR-5 1000 NA 0.57 (0.0049) 0.47 (0.0117) 0.49 (0.0101) 0.47 (0.0100) 1.33
TPE-O(S)-SIR-5 1000 NA 0.62 (0.0047) 0.52 (0.0112) 0.53 (0.0115) 0.53 (0.0126) 1.34
TPE-EP(D)-SIR-5 1000 1000 0.37 (0.0060) 0.34 (0.0096) 0.42 (0.0098) 0.42 (0.0107) 1.45
TPE-EP(S)-SIR-5 1000 1000 0.39 (0.0048) 0.40 (0.0090) 0.47 (0.0092) 0.46 (0.0097) 1.41
TPE-O(D)-SIR-3 1000 NA 0.54 (0.0057) 0.61 (0.0124) 0.61 (0.0124) 0.62 (0.0137) 0.66
TPE-O(S)-SIR-3 1000 NA 0.56 (0.0065) 0.51 (0.0122) 0.55 (0.0128) 0.56 (0.0138) 0.67
TPE-EP(D)-SIR-3 1000 1000 0.43 (0.0062) 0.55 (0.0142) 0.62 (0.0153) 0.60 (0.0164) 0.66
TPE-EP(S)-SIR-3 1000 1000 0.43 (0.0066) 0.54 (0.0132) 0.58 (0.0148) 0.57 (0.0156) 0.66
The best four candidates or five if a tie exists in the last five columns are marked in bold. Runtime is averaged
√
over all simulations and is measured in seconds. Standard error (s.e.) is the standard deviation divided by M .
In contrast, SIR-PE, TPE and TPE-SIR all suffer from path degeneracy and
produce much larger results.
194
The former performs slightly better in a deeper auxiliary tree.
4.9 Discussion
195
timate the posterior distribution p(θ, x0:T |y0:T ) in a hidden Markov model
where θ is an unknown parameter. TPE decomposes the target random
variable (θ, X0:T ) via an auxiliary binary tree structure, which requires the
random variables at the same level of the tree containing disjoint hidden
states and a parameter variable. TPE samples from the leaf nodes initially.
Following the binary tree, we gradually merge the samples which aim for
the intermediate target distribution at each non-root node. The sampling
process ends when we reach the root standing for the target distribution.
196
informative priors, TPE-EP empirically demonstrates a superior performance
compared to TPE-O.
197
estimated prediction priors in the sub-HMMs and has a reasonable depth of
the auxiliary tree. Such algorithm has the following strengths: It profits from
the efficiency and accuracy of SIR-PE for simulating initial samples. The re-
estimated priors rather than the originals offer less discrepancy between the
proposal and the target in the importance sampling steps. Nevertheless, TPE
requires several tuning steps. Due to its superior performance for estimat-
ing the unknown parameter with a comparatively fast runtime, we consider
TPE-SIR as a desirable option to solve the parameter estimation problem in
an HMM.
198
5
Conclusion and Future Work
5.1 Conclusion
The present thesis developed Monte Carlo methods to investigate the infer-
ence problems for hypothesis testing and for hidden Markov models (HMMs).
In Chapter 2, we introduced Monte Carlo testing procedures for bounding a
specific error called resampling risk. In Chapter 3 & 4, we proposed Monte
Carlo sampling algorithms to target posterior distributions in an HMM.
199
The first part of Chapter 2 introduced a new method called CSM which
bounds the resampling risk uniformly with respect to a single threshold. We
observed the conservativeness of CSM in the sense that it does not spend
the full risk. We then applied truncation to CSM to accommodate some real
circumstances with limited computational budget, and identified a relatively
small resampling risk in CSM compared to other truncated procedures. We
conclude CSM is an appealing option in practice due to its simplicity.
200
under constraints upon computational cost. Otherwise, TPS-ES can achieve
a higher accuracy than TPS-EF under the same sample size.
Given the auxiliary tree with every node consisting of an unknown pa-
rameter, a novelty of TPE lies in the combination of the overlapping param-
eter variables in the sampling process. We illustrated the combination pro-
cedures using a deterministic and a stochastic approach, which both boost
the diversity of samples against degeneracy, and overcome a conventional
Markov Chain Monte Carlo (MCMC) update step.
201
is the amalgamation of TPE and a sequential Monte Carlo (SMC) algorithm
called SIR-PE. TPE-SIR creates a shallower auxiliary tree than TPE, whose
initial samples are produced from SIR-PE in a more efficient and simpler
way. In the simulation study, a decreased runtime compared to TPE and an
improved accuracy of estimating the unknown parameter both contributed
to the strengths of TPE-SIR.
This thesis leaves several topics for further discussions and possible exten-
sions.
202
X0:5
X0:3 X4:5
X0:1 X2:3 X4 X5
X0 X1 X2 X3 X4 X5
Figure 5.1: Update of the auxiliary tree when the new observation y5 is available
The updated auxiliary tree can potentially retain the samples at the
old nodes, provided that their intermediate target distributions remain un-
changed. This condition is exactly satisfied in TPE-EF (Section 3.7.2) and
all versions of TPS. In Figure 5.1, the old nodes refer to those which are not
connected by any dashed edge such as X0 , X2:3 and X0:3 . They are also the
nodes from the left complete sub-tree with the root node consisting of X0:3 .
Practically, we may even eliminate all non-root nodes from the left complete
sub-tree, and only store the samples of X0:3 . Therefore, on-line TPS and
203
TPE have a strength of saving efforts by preserving the samples from part
of the tree.
The more advanced HMMs raise new challenges to TPS and TPE based
on the divide-and-conquer approach (Lindsten et al., 2017). In a HMM with
multivariate state spaces, the curse of dimensionality implies the error of a
particle filter grows exponentially with respect to the dimension of the state
space (Rebeschini et al., 2015). Bengtsson et al. (2008) state that the maxi-
mum normalised weight in one single time step of a particle filter converges
to one in a specific class of models if the sample size grows sub-exponentially
204
in the cube root of the dimension. Rebeschini et al. (2015) propose block
particle filter by partitioning the state variable of high dimension and lo-
calise the weight computation. They prove the error bound is independent
of the dimension. Finke and Singh (2017) apply the similar technique of
block approximations for the smoothing problem, and prove the bias of their
blocked smoother is uniformly bounded in dimension and the variance is
dimension-independent. TPS and TPE applied in higher state spaces could
have challenging importance sampling steps under the existing sampling pro-
cess. We could employ the block approximation techniques in TPS and TPE
for building intermediate target distributions and for incorporating the exist-
ing sampling procedure. A more ambitious thought would be the possibility
of applying the divide-and-conquer strategy not only upon time but also
upon dimension.
205
Bibliography
Ball, F. G. and J. A. Rice (1992). Stochastic models for ion channels: intro-
duction and bibliography. Mathematical Biosciences 112 (2), 189–206.
206
BBC news (2016). Artificial intelligence: Google’s AlphaGo beats Go master
Lee Se-dol.
207
Cappé, O., E. Moulines, and T. Rydén (2006). Inference in Hidden Markov
Models. Springer Science & Business Media.
Cox, H. (1964). On the estimation of state variables and parameters for noisy
dynamic systems. IEEE Transactions on Automatic Control 9 (1), 5–12.
208
Doucet, A., S. Godsill, and C. Andrieu (2000). On sequential Monte Carlo
sampling methods for Bayesian filtering. Statistics and Computing 10 (3),
197–208.
Fay, M. P., H.-J. Kim, and M. Hachey (2007). On using truncated sequential
probability ratio test boundaries for Monte Carlo implementation of hy-
pothesis tests. Journal of Computational and Graphical Statistics 16 (4),
946–967.
209
Gandy, A. (2009). Sequential implementation of Monte Carlo tests with
uniformly bounded resampling risk. Journal of the American Statistical
Association 104 (488), 1504–1511.
Gandy, A., G. Hahn, and D. Ding (2017). Implementing Monte Carlo tests
with p-value buckets. arXiv preprint arXiv:1703.09305 .
Gandy, A. and F. D.-H. Lau (2016). The chopthin algorithm for resampling.
IEEE Transactions on Signal Processing 64 (16), 4273–4281.
210
Godsill, S., P. Rayner, and O. Cappé (2002). Digital audio restoration. In
Applications of Digital Signal Processing to Audio and Acoustics, pp. 133–
194. Springer.
Haykin, S. (2004). Kalman Filtering and Neural Networks, Volume 47. John
Wiley & Sons.
211
Hope, A. C. (1968). A simplified Monte Carlo significance test procedure.
Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy) 30 (3), 582–598.
Huang, X. D., Y. Ariki, and M. A. Jack (1990). Hidden Markov Models for
Speech Recognition. Edinburgh University Press.
IBM Corporation (2013). IBM SPSS Statistics for Windows. Armonk, NY:
IBM Corporation.
212
Kaplan, E. and C. Hegarty (2005). Understanding GPS: Principles and Ap-
plications. Artech House.
Kim, H.-J. (2010). Bounding the resampling risk for sequential Monte Carlo
implementation of hypothesis tests. Journal of Statistical Planning and
Inference 140 (7), 1834–1843.
213
Carlo: The marginal particle filter. In Proceedings of Uncertainty in Arti-
ficial Intelligence.
Lee, L.-M. and J.-C. Lee (2006). A study on high-order hidden Markov
models and applications to speech recognition. International Conference
on Industrial, Engineering and Other Applications of Applied Intelligent
Systems, 682–690.
214
Lin, M. T., J. L. Zhang, Q. Cheng, and R. Chen (2005). Independent particle
filters. Journal of the American Statistical Association 100 (472), 1412–
1421.
Liu, J. S. and R. Chen (1998). Sequential Monte Carlo methods for dynamic
systems. Journal of the American Statistical Association 93 (443), 1032–
1044.
215
size, and their reporting in randomized controlled trials. Jama 272 (2),
122–124.
Pitt, M. K., R. dos Santos Silva, P. Giordani, and R. Kohn (2012). On some
properties of Markov chain Monte Carlo simulation methods based on the
particle filter. Journal of Econometrics 171 (2), 134–151.
216
Statistical Computing. Vienna, Austria: R Foundation for Statistical Com-
puting.
Rebeschini, P., R. Van Handel, et al. (2015). Can local particle filters beat
the curse of dimensionality? The Annals of Applied Probability 25 (5),
2809–2866.
217
Schäfer, C. and N. Chopin (2013). Sequential Monte Carlo on large binary
sampling spaces. Statistics and Computing 23 (2), 163–184.
Sileshi, B., C. Ferrer, and J. Oliver (2013). Particle filters and resampling
techniques: Importance in computational complexity analysis. 2013 Con-
ference on Design and Architectures for Signal and Image Processing, 319–
325.
Silva, I., R. Assunção, et al. (2018). Truncated sequential Monte Carlo test
with exact power. Brazilian Journal of Probability and Statistics 32 (2),
215–238.
Silva, I., R. Assunção, and M. Costa (2009). Power of the sequential Monte
Carlo test. Sequential Analysis 28 (2), 163–174.
218
Tango, T. and K. Takahashi (2005). A flexibly shaped spatial scan statistic
for detecting clusters. International Journal of Health Geographics 4 (1),
11.
Yamato, J., J. Ohya, and K. Ishii (1992). Recognizing human action in time-
sequential images using hidden Markov model. Proceedings 1992 IEEE
Computer Society Conference on Computer Vision and Pattern Recogni-
tion, 379–385.
219
maximization algorithm. IEEE Transactions on Medical Imaging 20 (1),
45–57.
220