You are on page 1of 235

Monte Carlo Algorithms for

Hypothesis Testing and for Hidden


Markov Models

a thesis presented for the degree of


Doctor of Philosophy of Imperial College London
and the
Diploma of Imperial College
by

Dong DING

Department of Mathematics
Imperial College
180 Queen’s Gate, London SW7 2BZ

December 2018
I certify that this thesis, and the research to which it refers, are the product
of my own work, and that any ideas or quotations from the work of other
people, published or otherwise, are fully acknowledged in accordance with
the standard referencing practices of the discipline.

Signed:

ii
Copyright

The copyright of this thesis rests with the author and is made available
under a Creative Commons Attribution Non-Commercial No Derivatives li-
cence. Researchers are free to copy, distribute or transmit the thesis on the
condition that they attribute it, that they do not use it for commercial pur-
poses and that they do not alter, transform or build upon it. For any reuse
or redistribution, researchers must make clear to others the licence terms of
this work.

iii
Thesis advisor: Professor Axel Gandy Dong DING

Monte Carlo Algorithms for Hypothesis Testing and for


Hidden Markov Models

Abstract

Monte Carlo methods are useful tools to approximate the numerical result
of a problem by random sampling when its analytic solution is intractable
or computationally intensive. The main focus of this work is to investigate
Monte Carlo methods in two areas of inference problems: hypothesis testing
and posterior analysis in a hidden Markov model (HMM).

The first part of this thesis focuses on the decision of the p-value with respect
to a fixed threshold via Monte Carlo simulations in a statistical hypothesis
test. We wish to control the resampling risk, which is the probability of
obtaining a different test decision from the true one based on the unknown
p-value. We present confidence sequence method (CSM), a simple Monte
Carlo testing procedure which bounds the resampling risk uniformly. CSM
is useful due to its simple implementation and comparable performance to
its competitors.

The second part of the thesis focuses on two posterior distributions of an


HMM: smoothing and parameter estimation. We apply a divide-and-conquer
strategy (Lindsten et al., 2017) to develop Monte Carlo algorithms that which
provide sample approximation of the target distribution.

iv
Thesis advisor: Professor Axel Gandy Dong DING

We propose an algorithm called tree-based particle smoothing algorithm


(TPS) to estimate the joint smoothing distribution. We then assume an
unknown parameter in the HMM, and extend TPS to approximate its poste-
rior, which we refer to as tree-based parameter estimation algorithm (TPE).

TPS and TPE both construct an auxiliary tree for recursively splitting model
into sub-models. The root of the tree stands for the target distribution of the
model. We propose different forms of intermediate target distributions of the
sub-models associated to the non-root nodes, which are crucial to sampling
quality. For the sampling process, we generate initial samples independently
between the leaf nodes. Then we recursively merge these samples along the
tree until reaching the root. Each merging step involves importance sampling
for the (intermediate) target distribution. A more adaptive design of the
algorithms and an improved accuracy compared to their competitors make
them useful alternatives in practice.

v
To my family and friends.

vi
Acknowledgments

First and foremost, I would like to sincerely thank my supervisor, Prof. Axel
Gandy, for his great expertise, support and patience. Over the years, he has
not only guided and motivated me with constructive ideas in my research, but
also advised me on academic writing, time management and career planning,
which I really appreciate. I feel very lucky to meet such an excellent PhD
supervisor, and believe the research skills I learned from him along with
the instructions and encouragement would definitely help me in the future.
Moreover, I would like to thank him for securing the college scholarship for
me.

I would also like to thank Dr. Georg Hahn for his useful suggestions and
patience on our submitted journal papers as well as on my research. I still
remember the day he underwent every paragraph of my first ever collabo-
rated paper to give me very detailed advice. Furthermore, I would thank
Prof. David Van Dyk and Dr. Nikolas Kantas for their useful comments on
my early and late stage assessments. I would like to thank Jessica Zhuang,
Longjie Jia, Shijing Si, Nanxin Wei, Din-Houn Lau, Xue Lu, Ricardo Monti,
Jeff Leong, Zhana Kuncheva, Diletta Martinelli, Dimos Tsagkrasoulis, Xiyun
Jiao, Francois-Xavier Briol and Louis Ellam for bringing a memorable re-
search environment at Huxley 526 and within the maths department.

vii
I would like to thank my mum and dad for their continuous support and
understanding. They often visit me with great care and cook very delicious
food. The constant reunions of the family never make me feel lonely in the
UK.

I would like to thank all my friends for delivering all memorable moments
during my PhD study. In particular, I have always been in touch with Renda
Gu, Weiyi Huang, Yiwen Hu, Cecilia Li and Yucheng Shi to share our life
stories. It is also a great pleasure to work and initiate friendship at Imperial
CSSA with Tianyu Cheng, Lily Lin, Chris Cheung, Runzhi Zhou, Sizhe Zhou
and Yiqun Huan.

Finally, I would like to express my thanks for receiving the President’s PhD
Scholarship of Imperial College (formally known as ‘Imperial College PhD
Scholarship’) as financial support.

viii
List of figures

2.1 Lower and upper stopping boundaries of CSM . . . . . . . . . 32


2.2 Expected effort of CSM . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Stopping boundaries for CSM and SIMCTEST . . . . . . . . . 37
2.4 Ratio of widths of stopping boundaries for CSM and SIMCTEST 37
2.5 Cumulative resampling risk in CSM and SIMCTEST . . . . . 40
2.6 Stopping boundaries of truncated CSM and SIMCTEST with
the truncated spending sequence . . . . . . . . . . . . . . . . . 41
2.7 Rate spent in the real resampling risk of CSM . . . . . . . . . 42
2.8 Differences between the upper and lower stopping boundaries
of CSM and SIMCTEST . . . . . . . . . . . . . . . . . . . . . 44
2.9 Comparison of resampling risks between the truncated Monte
Carlo testing procedures . . . . . . . . . . . . . . . . . . . . . 47
2.10 Non-stopping regions for the p-value buckets constructed from
J e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.11 Non-stopping region for the p-value buckets: J 0 , J ∗ and J n . 58

3.1 Graphical representation of a hidden Markov model (HMM). . 69


3.2 Auxiliary tree of TPS constructed from an HMM . . . . . . . 102
3.3 Computational flow of TPS in an HMM . . . . . . . . . . . . 105
3.4 Estimated filtering and smoothing distributions in the non-
linear HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.5 Diagnostics plots using RESS and MRESS in the toy model . 128
3.6 Diagnostic plots using RESS and MRESS in the linear and
non-linear HMMs . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.7 Sample diversity measured by ESSoED in the linear Gaussian
HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

ix
3.8 CDF of the smoothing, filtering and sampling distribution of
TPS-L in the non-linear HMM . . . . . . . . . . . . . . . . . . 142

4.1 Auxiliary tree of TPE constructed from an HMM . . . . . . . 160


4.2 Graph representation of the sub-HMMs in an HMM . . . . . . 165
4.3 Auxiliary tree of TPE-SIR constructed from an HMM . . . . . 171
4.4 Auxiliary tree constructed from the toy model . . . . . . . . . 173

5.1 Update of the auxiliary tree of TPS in the on-line setting . . . 203

x
List of tables

2.1 Stopping boundaries using Davidson and MacKinnon (2000)’s


method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Parameters of the truncated Monte Carlo testing procedures . 46
2.3 Two-way contingency table . . . . . . . . . . . . . . . . . . . . 62
2.4 Simulation results in the contingency example . . . . . . . . . 63
2.5 Comparison between CSM and SIMCTEST . . . . . . . . . . 64

3.1 Simulation results of the smoothing algorithms in the linear


Gaussian HMM . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.2 Simulation results of the smoothing algorithms in the non-
linear Gaussian HMM . . . . . . . . . . . . . . . . . . . . . . 141
3.3 Simulation results between TPS-EF-P and TPS-ES-P in the
linear Gaussian HMM . . . . . . . . . . . . . . . . . . . . . . 144

4.1 Options in TPE regarding the prior information and the com-
bination method of the overlapping parameters . . . . . . . . 192
4.2 Simulation results of the parameter estimation algorithms in
the HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

xi
Contents

1 Introduction 1
1.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aims of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Overview of the Chapters . . . . . . . . . . . . . . . . . . . . 9
1.4 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Implementing Monte Carlo Tests with Uniformly Bounded


Resampling Risk 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Statistical Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 P-value . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Type of Error and Statistical Power . . . . . . . . . . . 21
2.3 Monte Carlo Tests . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Resampling Risk . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Sequential Monte Carlo Procedures . . . . . . . . . . . 24
2.4 Confidence Sequence Method (CSM) . . . . . . . . . . . . . . 30
2.5 Review of SIMCTEST . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Comparison of CSM to SIMCTEST . . . . . . . . . . . . . . . 36
2.6.1 Stopping Boundaries . . . . . . . . . . . . . . . . . . . 36
2.6.2 Real Resampling Risk . . . . . . . . . . . . . . . . . . 38
2.7 Spending Sequences which Dominate CSM . . . . . . . . . . . 39
2.7.1 Example of a Bespoke Spending Sequence . . . . . . . 39
2.7.2 Uniformly Dominating Spending Sequence . . . . . . . 42
2.8 Comparison of Truncated Sequential Monte Carlo Procedures 45

xii
2.9 Extension to Multiple Thresholds . . . . . . . . . . . . . . . . 46
2.9.1 P-value Buckets and Resampling Risk . . . . . . . . . . 47
2.9.2 General Construction of the Algorithms . . . . . . . . 50
2.9.3 Multi-threshold CSM and SIMCTEST . . . . . . . . . 51
2.9.4 Non-stopping Regions of the P-value Buckets . . . . . 56
2.10 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.10.1 Comparison of Penguin Pairs on Two Islands . . . . . . 59
2.10.2 Two-way Contingency Table . . . . . . . . . . . . . . . 61
2.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3 Tree-based Particle Smoothing Algorithms in a Hidden


Markov Model 68
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Filtering and Smoothing . . . . . . . . . . . . . . . . . . . . . 77
3.3.1 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . 78
3.3.2 Rauch–Tung–Striebel Smoother . . . . . . . . . . . . . 81
3.4 Particle Methods . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4.1 Importance Sampling . . . . . . . . . . . . . . . . . . . 82
3.4.2 Sequential Importance Sampling and Resampling . . . 85
3.4.3 Particle Filtering and Smoothing . . . . . . . . . . . . 91
3.5 Previous Monte Carlo Smoothing Algorithms . . . . . . . . . . 94
3.5.1 Forward Filtering Backward Smoothing (FFBSm) . . . 94
3.5.2 Forward Filtering Backward Sampling (FFBSi) . . . . 98
3.6 Tree-based Particle Smoothing Algorithm (TPS) . . . . . . . 99
3.6.1 Construction of the Auxiliary Tree . . . . . . . . . . . 100
3.6.2 Sampling Procedure . . . . . . . . . . . . . . . . . . . 102
3.6.3 Proliferation . . . . . . . . . . . . . . . . . . . . . . . . 106

xiii
3.7 Intermediate Target Distributions . . . . . . . . . . . . . . . . 108
3.7.1 Distribution Suggested by Lindsten et al. (2017) (TPS-L)109
3.7.2 Estimates of the Filtering Distributions (TPS-EF) . . . 110
3.7.3 Kullback–Leibler Divergence in TPS . . . . . . . . . . 112
3.7.4 Estimates of the Smoothing Distributions (TPS-ES) . . 115
3.7.5 Intermediate Target Distributions at Leaf Nodes . . . . 117
3.7.6 Exact Filtering Distributions (TPS-F) . . . . . . . . . 120
3.8 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.8.1 Definitions and Properties of RESS and MRESS . . . . 122
3.8.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.9 Simulation Study in a Linear Gaussian HMM . . . . . . . . . 130
3.9.1 Model Description and Metrics . . . . . . . . . . . . . 130
3.9.2 Simulation Results . . . . . . . . . . . . . . . . . . . . 132
3.10 Simulation Study in a Non-linear HMM . . . . . . . . . . . . . 134
3.10.1 Model Description and Metrics . . . . . . . . . . . . . 135
3.10.2 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 135
3.10.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.10.4 Comparison between TPS and Other Algorithms . . . 140
3.10.5 Comparison between TPS-EF and TPS-ES . . . . . . . 143
3.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

4 Tree-based Sampling Algorithms for Parameter Esti-


mation in a Hidden Markov Model 149
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.2 Particle Marginal Metropolis-Hastings Sampler (PMMH) . . . 155
4.3 Sequential Importance Resampling for Parameter Estimation
(SIR-PE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4.4 Tree-based Parameter Estimation Algorithm (TPE) . . . . . . 157
4.4.1 Construction of the Auxiliary Tree . . . . . . . . . . . 159

xiv
4.4.2 Sampling Procedure . . . . . . . . . . . . . . . . . . . 160
4.5 Intermediate Target Distributions . . . . . . . . . . . . . . . . 163
4.5.1 Relation to Consensus Monte Carlo . . . . . . . . . . . 163
4.5.2 Sub-HMMs with Original Priors (TPE-O) . . . . . . . 166
4.5.3 Sub-HMMs with Estimated Prediction Priors (TPE-EP)167
4.6 Combination of TPE and SIR-PE . . . . . . . . . . . . . . . . 171
4.7 Construction of Transformation Functions . . . . . . . . . . . 172
4.7.1 A Toy Model with Conditional Independent States . . 173
4.7.2 Unknown Parameter with Support R in an HMM . . . 179
4.7.3 Unknown Parameter with Support R+ in an HMM . . 182
4.8 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 185
4.8.1 Model Description . . . . . . . . . . . . . . . . . . . . 185
4.8.2 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 186
4.8.3 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
4.8.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 189
4.8.5 Simulation Parameters and Results . . . . . . . . . . . 193
4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

5 Conclusion and Future Work 199


5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

xv
1
Introduction

1.1 Preamble

Monte Carlo methods (Metropolis and Ulam, 1949) are a broad class of com-
putational algorithms which approximate the deterministic result of a prob-
lem by random sampling. ‘Monte Carlo’ is first coined by Metropolis and
Ulam (1949) for solving problems in mathematical physics. Since then, the
methods have become a desirable option when a highly complex problem
lacks an analytic solution or a simple implementation. According to Google
Scholar, over three million academic articles are related to the keywords
‘Monte Carlo’, of which over one million appear after year 2000.

1
Monte Carlo methods have also been employed in the cutting-edge tech-
nology. For instance, one class of method called Monte Carlo tree search is
applied in artificial intelligence to design the program ‘AlphaGo’ for playing
the board game Go (Silver et al., 2016). In early 2016, AlphaGo beated Go
master Lee Se-dol by 3-0 in a best-of-five competition (BBC news, 2016).

In this thesis, we develop new Monte Carlo methods for two inference
problems: (statistical) hypothesis testing and posterior analysis in one type
of probabilistic graphical models called hidden Markov models (HMMs). In
both areas, previous Monte Carlo algorithms do not focus on bounding the
error caused by random sampling, or perform poor approximations under
certain circumstances due to ineffective designs. Hence, we wish to control
the error or decrease it from existing methods in the aforementioned areas.
At the same time, we aim for a simple and efficient implementation.

Statistical hypothesis testing is a procedure which evaluates whether the


observed sample data is consistent with the established statements about
the population (Altman, 1990). Typically, the statements consist of a null
hypothesis and an alternative hypothesis. The null hypothesis usually postu-
lates that the effect of interest is zero against the alternative which implies
the opposite.

The decision of accepting or rejecting the null hypothesis is determined


by a statistical test, which employs a test statistic denoted by T to evaluate
the discrepancy between the observed data and the null hypothesis. The
distribution of T under the null (hypothesis) is usually known, and the level
of discrepancy is measured by the probability of obtaining a realisation of
the test statistic being as least at extreme as the observed test statistic. We

2
refer to this probability as the p-value. If the p-value is below a user-specified
threshold, the observed sample data is believed to be very unlikely under the
null. Hence, the null hypothesis is rejected.

In many real applications, the p-value cannot be evaluated analytically.


Monte Carlo methods are an alternative to approximate the p-value, which
are called (Monte Carlo) tests. The test generates independent samples from
the distribution of T under the null via Monte Carlo simulations, and com-
pares them to the observed test statistic to obtain an estimate of the p-value.

In the inference of a probabilistic graphical model (PGM), we focus on


a specific class of models called hidden Markov models (HMMs), whose def-
inition is built upon a Markov process.

The probabilistic graphical model (PGM) defines a complex probability


distribution of a series of random variables over a high dimensional space
(Koller et al., 2009). The model employs a graph-based representation to
indicate the conditional dependence structure between the random variables.
In the PGM, each node represents a random variable, and dependencies
between the random variables are explained via edges. The overall joint
distribution of the PGM can be expressed as the product of factors, where
each factor corresponds to the distribution over a sub-model.

A (discrete-time) Markov process {Xt }t∈N (Grimmett and Stirzaker, 2001)


is a stochastic process with discrete-time indices where the dependence of Xt
given X0:t−1 = (X0 , . . . , Xt−1 ) is only on Xt−1 for all t > 0.

Hidden Markov models (HMMs) are within the class of PGMs which

3
incorporate a hidden Markov process {Xt }t∈N and a set of observable random
variables {Yt }t∈N generated by the Markov process (Cappé et al., 2006). The
word ‘hidden’ implies invisibility of the realisations of {Xt }t∈N , and each Xt
is called a hidden state. Each observation Yt is obtainable from the user, and
its distribution only depends on Xt .

Estimating the hidden states given the observations becomes a popular


inference task. Some common challenges include the derivation of the pos-
terior distributions: The filtering distribution p(xt |y0:t ) refers to the distri-
bution of a hidden state Xt conditional on the observations up to the same
time step t whereas the smoothing distribution p(xt |y0:T ) conditions on all
observations until the final time step of the process, which we denote by T .
Moreover, given an incomplete HMM with an unknown parameter denoted
by θ, the posterior distribution p(θ|y0:T ) or p(θ, x0:T |y0:T ) is of interest.

Monte Carlo methods are often regarded as a remedy to the intractable


posterior distribution in many complicated HMMs with non-linear and non-
Gaussian structures. They simulate relevant samples to approximate the
target distribution, and various probabilistic properties can be inferred from
the empirical distribution formed by the samples.

1.2 Aims of the Thesis

This thesis aims to elaborate Monte Carlo algorithms for statistical hypoth-
esis testing and for the inference problems in an HMM.

In the first part (Chapter 2) of the thesis, we develop a Monte Carlo


testing procedure which bounds a specific error for an estimated p-value in

4
the statistical hypothesis test. The error we wish to control in the test caused
by random sampling is called resampling risk (Fay and Follmann, 2002; Fay
et al., 2007; Gandy, 2009). Formally, it is defined as the probability that
the true p-value and the estimated p-value are on different sides of a fixed
threshold.

Choosing the resampling risk as the metric originates from the ‘first law of
applied statistics’ (Gleser, 1996): ‘Two individuals using the same statistical
method on the same data should arrive at the same conclusion.’ Monte
Carlo simulation violates this law, since it introduces an extra variability to
the result caused by random sampling that is not existent to the data itself.
Hence, we hope to regulate this uncertainty.

In the context of statistical hypothesis testing, the argument of the first


law of applied statistics can be translated to a decision on the p-value with
respect to a fixed threshold being identical from the same statistical method
in different trials. The resampling risk exactly measures the variability of
the decision in the Monte Carlo tests.

The aim of the first part of the thesis is to develop and compare two
Monte Carlo testing procedures, which both bound the resampling risk within
a pre-specified error uniformly, i.e. for all p-values in [0,1]. The algo-
rithm proposed by Gandy (2009) which we call SIMCTEST achieves this by
constructing a spending sequence: It determines the rate of the allowed re-
sampling risk spent. SIMCTEST ensures that the probability of a proposed
metric hitting the wrong boundary given by the spending sequence is at most
, thus showing a uniformly bounded resampling risk.

5
The proposed algorithm called confidence sequence method (CSM) is en-
lightened by the construction of the sequential confidence intervals for the
p-value (Robbins, 1970; Lai, 1976). CSM forms a decision upon such se-
quence whose joint coverage probability of the p-value is bounded by 1 − .
The algorithm is suggested due to its simple implementation and comparable
performances to SIMCTEST as well as other algorithms.

In many practical tests, the threshold of interest is not restricted to a


single one, but multiple ones. Standard software packages such as R Devel-
opment Core Team (2008) and IBM Corporation (2013) report the p-value
with respect to one of the significance levels: 0.1%, 1% and 5%. Motivated
by this convention, we extend SIMCTEST and CSM to accommodate multi-
ple thresholds in the Monte Carlo test while bounding the refined resampling
risk uniformly.

In the second part (Chapter 3 & 4) of the thesis, we consider two poste-
rior distributions: smoothing and parameter estimation in a hidden Markov
model (HMM). We establish a novel class of Monte Carlo algorithms, which
approximate the posteriors when their analytic solutions do not exist.

In the literature, one type of Monte Carlo method called sequential Monte
Carlo (SMC) is widely employed in the HMM (Liu and Chen, 1998; Pitt and
Shephard, 1999; Doucet et al., 2000). SMC can produce random samples
sequentially from a list of target distributions with increasing dimension.
Hence, it can be applied to the filtering or smoothing problem in which the
distributions {p(x0:t |y0:t )}Tt=0 are sequentially estimated.

When sampling from the smoothing distribution, the Monte Carlo al-

6
gorithms using a sequential approach such as SMC usually suffer from a
phenomenon called path degeneracy (Arulampalam et al., 2002). Path de-
generacy refers to low diversity of the samples caused by numerous update
steps.

We illustrate the path degeneracy issue when applying SMC for smooth-
ing. The algorithm starts from t = 0 where the samples of X0 are initially
simulated. When we proceed forward to simulate X1 , the samples of X0 re-
quire an update, which may be accompanied by a resampling process. In this
process, we replicate samples with high probabilities to substitute those with
low probabilities, hence losing diversity. Usually, more updates imply more
resampling steps. In SMC, for each time t ≤ T , the samples of X0 , . . . , Xt−1
from all previous time steps demand updates. As a result, the samples of X0
are updated (T + 1) times upon the end of the algorithm, which implies a
large number of potential resampling steps if T is huge. On the other hand,
XT is only revealed in the sequences of the target {p(x0:t |y0:t )}Tt=0 at the very
last step, which contributes to far more diversified samples with at most one
resampling step.

The proposed tree-based particle smoothing algorithm (TPS) in an HMM


is motivated by divide-and-conquer sequential Monte Carlo (D&C SMC)
(Lindsten et al., 2017). We adapt D&C SMC, which aims for a general
PGM, to the HMM to investigate the sampling procedure for the smooth-
ing problem. Using the idea of D&C SMC, the construction of TPS can
be pictured using a binary tree structure, where the hidden states X0:T =
(X0 , X1 , . . . , XT ) at the root node are recursively split into two disjoint sub-
sets according a division rule to create two children. The split is ceased once

7
a node only contains a single hidden state. The tree under such construction
has a depth of (1 + dlog2 (T + 1)e) levels.

We need to assign the target distributions of the hidden state(s) at the


tree nodes. The root exactly stands for the joint smoothing distribution
we are interested in. The main novelty of the algorithm is the design of the
intermediate target distribution at each non-root node, where different forms
are introduced and compared.

For the sampling process of TPS, we simulate initial samples directly at


each leaf node. Using a Monte Carlo technique called importance sampling,
we obtain the samples at each non-leaf node by merging those from the
children. We recursively execute this routine from the leaf nodes to the root.

TPS can potentially mitigate path degeneracy. In a sequential approach,


the number of updates of a hidden state Xt is imbalanced for different t,
which can range from 1 to (T + 1) as mentioned earlier. In contrast, TPS
implements roughly dlog2 (T + 1)e update steps for each hidden state, where
dlog2 (T + 1)e relates to the depth of the tree. Therefore, we reduce the
maximum number of updates of an individual hidden state from O(T ) to
O log(T ) , and improve path degeneracy for early time steps.

Apart from degeneracy, most previous Monte Carlo algorithms (Doucet


et al., 2000; Godsill et al., 2004; Briers et al., 2010) are prohibited from an
efficient implementation, since their computational complexity is quadratic
with respect to the output sample size. Fearnhead et al. (2010) propose
a smoothing algorithm with a linear complexity at a cost of fewer options
for proposals. The sampling procedure in TPS is designed to be flexible in

8
consideration of proposals and computational budget. In particular, it is
adjustable with a possible reduction to a linear complexity.

Having developed TPS to target the smoothing distribution, we extend


it to solve the parameter estimation problem in an HMM, and call it tree
based parameter estimation algorithm (TPE). Rather than focusing on the
inference of the unknown parameter θ, TPE augments the target space to
draw samples from the joint posterior of the parameter with the hidden states
p(θ, x0:T |y0:T ).

TPE similarly constructs an associated binary tree as TPS where each


node contains an additional unknown parameter governing the hidden state(s).
However, this poses a challenge at each non-leaf node where two variables
representing the unknown parameters from the children are overlapped in
the merging step. Our solution is to combine the overlapping parameter
variables for generating a new one. Practically, it also diversifies samples,
which completely avoids a Markov Chain Monte Carlo (MCMC) update ap-
plied in several existing algorithms (Lee and Chia, 2002; Polson et al., 2008;
Gilks and Berzuini, 2001; Chopin et al., 2013). This is a potential advantage
of TPE as the MCMC update can be arduous in terms of choosing transition
kernel and tuning parameters in high dimension.

1.3 Overview of the Chapters

This thesis is structured as follows.

In Chapter 2, we first review statistical hypothesis testing in Section 2.2


and Monte Carlo tests in Section 2.3 before the construction of confidence

9
sequence method (CSM) in Section 2.4. We then review the existing approach
called SIMCTEST in Section 2.5, and compare it to CSM in Section 2.6 &
2.7.

We further investigate the resampling risk of the Monte Carlo test when a
maximum number of Monte Carlo samples is specified, which we call trunca-
tion. We empirically show the risk is not uniformly bounded in the truncated
versions of CSM and SIMCTEST as well as other truncated procedures in
Section 2.8. We extend the Monte Carlo tests under a single threshold to
multiple ones in Section 2.9.

In the simulation studies of Section 2.10, we apply CSM and SIMCTEST


to a real data example which compares yellow-eyed penguin pairs on two
islands, and to a contingency table. The chapter concludes with a discussion
in Section 2.11.

In Chapter 3, we propose tree-based particle smoothing algorithm (TPS)


to simulate samples from the joint smoothing distribution p(x0:T |y0:T ) in
an HMM. We first review the HMMs with their inference problems from
Section 3.2 to Section 3.5. We describe the strategy of TPS, and conceive a
natural way of establishing an auxiliary binary tree with a general sampling
procedure in Section 3.6.

We specify different classes of intermediate target distributions in the


auxiliary tree in Section 3.7. Additionally, we introduce the concepts of
relative effective sample size (RESS) and marginal relative effective sample
size (MRESS) to detect the sampling quality of TPS in Section 3.8.

We perform a series of simulation studies in a linear Gaussian HMM and

10
in a non-linear HMM in Section 3.9 & 3.10. We complete the chapter with a
discussion in Section 3.11.

In Chapter 4, we improve TPS to address the parameter estimation prob-


lem in the HMM. We develop tree-based parameter estimation algorithm
(TPE) which samples from the posterior distribution p(θ, x0:T |y0:T ). We first
review two previous Monte Carlo algorithms for parameter estimation in Sec-
tion 4.2 & 4.3. We then define an auxiliary binary tree of TPE and illustrate
a general sampling process in Section 4.4.

We proceed to Section 4.5 to present two classes of intermediate target


distributions in TPE. We illustrate an extended algorithm of TPE, which we
refer to as TPE-SIR, in Section 4.6. We demonstrate the combination of the
overlapping parameter variables in Section 4.7.

In the simulation study of Section 4.8, we perform TPE and other al-
gorithms to estimate a three dimensional unknown parameter in a linear
Gaussian HMM. We finish the chapter with a discussion in Section 4.9.

1.4 List of Publications

Chapter 2 and 3 of the thesis have been submitted to the journals and are
available on the arXiv preprint server:

• Chapter 2 (Section 2.3 – Section 2.8): submitted as the first author


and available on

https://arxiv.org/abs/1611.01675

11
• Chapter 2 (Section 2.9): submitted as the third author with main con-
tribution to Theorem 2 (first part) and Lemma 3 in this thesis. The
article is available on

https://arxiv.org/abs/1703.09305

• Chapter 3: submitted as the first author and available on

https://arxiv.org/abs/1808.08400

12
2
Implementing Monte Carlo
Tests with Uniformly
Bounded Resampling Risk

2.1 Introduction

Suppose we want to use a one-sided statistical test with null hypothesis H0


based on a test statistic T with observed value t. We aim to calculate the
p-value
p = P(T ≥ t|H0 ),

13
where the measure P is ideally the true null distribution in a simple hypoth-
esis. Otherwise, it can be an estimated distribution in a bootstrap scheme,
or a distribution conditional on an ancillary statistic, etc.

We consider the scenario in which p cannot be evaluated in closed form,


but can be approximated using Monte Carlo simulation, e.g. through boot-
strapping or drawing permutations. To be precise, we assume we can gener-
ate a stream (Ti )i∈N of i.i.d. random variables from the distribution of a test
statistic T under P. The information about whether or not Ti exceeds the
observed value t is contained in the random variable Xi = ✶(Ti ≥ t), where
✶ denotes the indicator function. It is a Bernoulli random variable satisfying
P(Xi = 1) = p. We will formulate algorithms in terms of Xi .

Gleser (1996) suggests that two individuals using the same statistical
method on the same data should reach the same conclusion. For tests, the
standard decision rule is based on comparing p to a threshold α. In the
setting we consider, Monte Carlo methods are used to compute an estimate
p̂ of p, which is then compared to α to reach a decision.

We are interested in procedures which provide a user-specified uniform


bound > 0 on the resampling risk. The resampling risk (Fay and Follmann,
2002; Fay et al., 2007; Gandy, 2009) is defined as the probability of returning
a different test decision (based on Monte Carlo simulation) than the decision
based on the unknown p.

When we compare the p-value with a single threshold, we let the resam-

14
pling risk 

P(p̂ > α) if p ≤ α,


RRp (p̂) =

P(p̂ ≤ α) if p > α,

where α is the threshold and p̂ is a p-value estimate computed by a Monte


Carlo procedure.

In the first part of this chapter, we are looking for procedures that achieve

sup RRp (p̂) ≤ (2.1)


p∈[0,1]

for a pre-specified > 0.

We introduce a simple open-ended sequential Monte Carlo testing pro-


cedure achieving (2.1) for any > 0 given a single threshold, which we call
the confidence sequence method (CSM). Our method is based on a confidence
sequence for p, that is a sequence of (random) intervals with a joint coverage
probability of at least 1 − . We will use the sequences constructed in (Rob-
bins, 1970; Lai, 1976). A decision whether to reject H0 is reached as soon as
the first interval produced in the confidence sequence ceases to contain the
threshold α.

The basic (non-sequential) Monte Carlo estimator (Davison et al., 1997)

1 + ni=1 Xi
P
p̂ =
1+n

does not guarantee a small uniform bound on the resampling risk where n is
the pre-defined number of Monte Carlo samples. In fact, the lowest uniform
bound on the resampling risk for this estimator is at least 0.5 (Gandy, 2009).

15
A variety of procedures for sequential Monte Carlo testing are available in
the literature which target different error measures. Silva et al. (2009); Silva
and Assunção (2013) bound the power loss of the test while minimising the
expected number of steps. Silva et al. (2018, Section 4) construct truncated
sequential Monte Carlo algorithms which bound the power loss and the level
of significance in comparison to the exact test by arbitrarily small numbers.
Other algorithms aim to control the resampling risk (Fay and Follmann,
2002; Fay et al., 2007; Gandy, 2009; Kim, 2010). Fay et al. (2007) use a
truncated sequential probability ratio test (SPRT) boundary and discuss the
resampling risk, but do not aim for a uniform lower bound on it. Kim (2010);
Fay and Follmann (2002) ensure a uniform bound on the resampling risk
under the assumption that the random variable p belongs to a certain class
of distributions. Besides being a much less restrictive requirement than (2.1),
one drawback of this approach is that in real situations, the distribution of p
is typically not fully known, as this would require knowledge of the underlying
true sampling distribution.

We mainly compare our method to the existing approach of Gandy


(2009), which we call SIMCTEST in the present thesis. SIMCTEST works
on the partial sum Sn = ni=1 Xi and reaches a decision on p as soon as Sn
P

crosses suitably constructed decision boundaries. SIMCTEST is specifically


constructed to guarantee a desired uniform bound on the resampling risk.

Procedures for Monte Carlo testing can be classified as open-ended and


truncated procedures (Silva and Assunção, 2013). A truncated approach
(Davidson and MacKinnon, 2000; Silva et al., 2018; Besag and Clifford, 1991;
Silva and Assunção, 2013) specifies a maximum number of Monte Carlo sam-

16
ples in the simulation in advance and forces a decision before or at the end
of all simulations. Open-ended procedures e.g., Gandy (2009) do not have
a stopping rule or an upper bound on the number of steps. Open-ended pro-
cedures can be turned into truncated procedures by forcing a decision after
a fixed number of simulation steps. Truncated procedures cannot guarantee
a uniform bound on the resampling risk – Section 2.8 demonstrates this.

In practice, a truly open-ended procedure will often not be feasible.


This is obvious in settings where generating samples is very time-consuming
(Tango and Takahashi, 2005; Kulldorff, 2001), but it is also true in other
settings where a large number of samples is being generated, as infinite com-
putational effort is never available.

In the second part of this chapter, we extend the single testing threshold
to multiple ones, since standard software packages such as R Development
Core Team (2008) and IBM Corporation (2013) usually report the signifi-
cance of a p-value with respect to multiple levels, e.g. 0.1%, 1%, 5%. We
generalise the testing thresholds to user-specified intervals by defining a set
S
of intervals J satisfying J∈J J = [0, 1], which we call p-value buckets.

We refine the resampling risk, which is the probability that the true
p-value p is not contained in a bucket I ∈ J based upon the estimated
(Monte Carlo) p-value. Formally, it is defined as RRp (I) = Pp (p ∈
/ I). We
aim to design algorithms which bound this error up to an arbitrary constant
∈ (0, 1] uniformly:

sup RRp (I) ≤ . (2.2)


p∈[0,1]

17
We formulate a class of algorithms which satisfy (2.2). These algorithms
employ a sequence of confidence intervals for the p-value with the desired
joint coverage probability, and return one p-value bucket once it contains
the confidence interval. Using this construction, we propose new versions of
CSM and SIMCTEST, which we call mCSM and mSIMCTEST, to accom-
modate the situation of multiple thresholds for bounding the resampling risk
uniformly.

This chapter is structured as follows. We first review statistical hypoth-


esis testing in Section 2.2 and Monte Carlo tests in Section 2.3. We then
derive the confidence sequence method (CSM) in Section 2.4, and briefly re-
view SIMCTEST in Section 2.5. We compare the (implied) stopping bound-
aries of both methods, and investigate the real resampling risk incurred from
their use in Section 2.6. In Section 2.7, we investigate the rate at which both
methods spend the resampling risk, and further construct a new spending se-
quence for SIMCTEST which (empirically) gives uniformly tighter stopping
boundaries than CSM, thus leading to faster decisions on p. In Section 2.8,
we show that neither procedure bounds the resampling risk in the truncated
Monte Carlo setting, but that the truncated version of CSM performs well
compared to more complicated algorithms. We introduce the Monte Carlo
tests under multiple thresholds, and investigate the algorithms which bound
the refined resampling risk uniformly in Section 2.9. In Section 2.10, we con-
duct two simulation studies regarding a real data example of penguin pairs
on two islands and a two-way contingency table. The chapter concludes with
a discussion in Section 2.11.

18
2.2 Statistical Hypothesis

‘A statistical hypothesis is an assertion or conjecture concerning one or more


populations’ (Walpole and Myers, 1993). The truth or falsity of a hypothe-
sis cannot be determined unless the information of the entire population is
known. Usually, only part of the population is observed in most of the sce-
narios and the decision is usually made based upon the observed data. An
evidence against the stated hypothesis leads to a rejection. The hypothesis
stated with the hope of being rejected is called null hypothesis, denoted by
H0 . An alternative hypothesis H1 describes ‘what alternatives to H0 it is
most important to detect, or what is thought likely to be true if H0 is not.’
(Davison et al., 1997).

A null hypothesis can be a simple or a composite one. A simple null


hypothesis completely defines the probability distribution of a single sample
from the population with cumulative distribution function F (Davison et al.,
1997). An example would be ‘exponentially distributed with mean 1’. A
composite null hypothesis indicates that the probability distribution contains
an unknown parameter under the null. We refer to this unknown parameter
as a nuisance parameter, since it is not of primary interest, but needs to
be considered when testing the hypothesis. A classical example would be
‘normally distributed with mean 0’ which additionally leaves the unknown
variance as a nuisance parameter.

We make a decision on the hypothesis by measuring the discrepancy


between the observed data and the null hypothesis. This discrepancy is
quantified by the p-value, which will be formulated in Section 2.2.1.

19
2.2.1 P-value

The p-value measures the probability of the data being at least as extreme as
the observed one under the null hypothesis (Altman, 1990). To calculate the
p-value, we utilise a statistic which usually has a known distribution under
the null hypothesis H0 called test statistic T . The distribution of T under
H0 is called the null distribution of T . We denote the observed value of the
test statistic by t, and follow the convention of rejecting the null hypothesis
when observing a large value of t. The p-value p under the null hypothesis
is then defined as
p = P(T ≥ t|H0 ). (2.3)

When the null hypothesis is a simple one, the distribution T is known. When
it is composite, three remedies can usually be employed. The first approach
chooses a test statistic T whose distribution is identical for all distributions
F . A well-known example is the Student’s t-statistic used for testing the
mean of a normal distribution with unknown variance. The second approach
conditions on a sufficient statistic S under the null, which eliminates un-
known model parameters (Davison et al., 1997). The conditional p-value is
defined as p = P(T ≥ t|H0 , S = s). The third approach approximates the
distribution F when the nuisance parameter cannot be conditioned away. It
then computes the p-value using the estimated distribution under the null.

The decision of rejecting or not rejecting the null is based upon the p-
value. When a large p-value is obtained from (2.3), we claim that the data
could often occur under the null hypothesis. Alternatively, a tiny p-value
implies that the data is very unlikely to be observed. The test of the null

20
hypothesis using the p-value is hence a decision on whether it lies above or
below a chosen cut-off. When the p-value is below (resp. above) the cut-off,
we say the test is statistically significant (resp. not significant), and the null
hypothesis is rejected (resp. not rejected).

The cut-off of the significance level depends on user’s request. The clas-
sical thresholds are 0.1%, 1% and 5%, and are usually labelled as (∗∗∗ ,∗∗ ,∗ )
by a star rating system in softwares such as R (R Development Core Team,
2008) and SPSS (IBM Corporation, 2013).

2.2.2 Type of Error and Statistical Power

A false decision for testing the null hypothesis can be committed if the in-
formation about the whole population is only obtained partially from the
observed data. Two types of error can be made. The first type is called
type I error when we falsely reject H0 in favour of H1 given that the null
hypothesis H0 is indeed true. The second type is called type II error when
we falsely reject H1 in favour of H0 given that the null hypothesis H0 is false.
We denote the probability of committing a type I and II error by α0 and β0 ,
respectively.

The probability of rejecting the null hypothesis when it is true, i.e. com-
mitting the type I error, is called level of significance or significance level of
the test. Correspondingly, the probability of rejecting the null hypothesis
when it is false, i.e. not committing the type II error, is called power of the

21
test:

Level of significance = P(Reject H0 |H0 ) = α0 ,

Power = P(Reject H0 |H1 ) = 1 − β0 ,

Sometimes the power function conditions on the parameter of interest, which


refers to a specific alternative hypothesis. The power of a test depends on the
sample size determined by the user (Altman, 1990). In other designs, people
detect the appropriate sample size to achieve the desired power (Dupont and
Plummer, 1990; Moher et al., 1994; Hooper et al., 2013). In Section 2.3.2, we
will review some Monte Carlo procedures that control the level of significance
and the power of a hypothesis test.

2.3 Monte Carlo Tests

In hypothesis testings when the null distribution of T is known, the p-value


in (2.3) can be computed analytically. More often, the distribution of the
test statistic T depends upon some unknown nuisance parameter and can be
solved by conditioning on the ancillary statistic (Davison et al., 1997). The
calculation of the exact p-value can be challenging or analytically intractable.

In some scenarios, Monte Carlo tests are an alternative way of approx-


imating the p-value. Assume we can generate (Ti )i∈N of independent and
identically distributed (i.i.d.) random variables from the null distribution of
T using Monte Carlo simulation, and denote the realisations by t1 , t2 , . . . , tn .
Then, Xi = ✶(Ti ≥ t) has the desired distribution P(Xi = 1) = p, where ✶

22
Pn
denotes the indicator function. We also define partial sum Sn = i=1 Xi ,
which will be constantly used in the later sections.

Monte Carlo tests can be classified as conventional and sequential pro-


cedures. In conventional tests, a fixed number of Monte Carlo samples is
pre-determined (Hope, 1968; Dwass, 1957) while a sequential approach gen-
erates a random number of samples until there is sufficient evidence to make a
decision. The conventional or non-sequential Monte Carlo p-value estimator
p̂mc is defined as (Davison et al., 1997):

1 + ni=1 Xi
P
p̂mc = . (2.4)
1+n

2.3.1 Resampling Risk

The basic Monte Carlo p-value estimator violates the first law of applied
statistics (Gleser, 1996): ‘Two individuals using the same statistical method
on the same data should arrive at the same conclusion.’ In hypothesis test-
ings, it refers to whether an estimator of p stays in the same region as the
true p-value given a threshold, and can be quantified by the term resampling
risk (Fay and Follmann, 2002; Fay et al., 2007; Gandy, 2009). If we assume

H0 : p > α and H1 : p ≤ α

where α is a fixed threshold, a Monte Carlo estimate p̂ of p rejects the null


if p̂ ≤ α. The resampling risk is defined as the probability that the estimate

23
p̂ and the true p lie on the different sides of α:


P(Not reject H0 |H1 ) = P(p̂ > α) if p ≤ α,

RRp (p̂) =

P(Reject H0 |H0 ) = P(p̂ ≤ α)
 if p > α.

Controlling the resampling risk is equivalent to controlling both the level of


significance and the power of the test.

We aim for procedures that achieve a small uniform bound > 0 on the
resampling risk RRp (p̂):
sup RRp (p̂) ≤ . (2.5)
p∈[0,1]

The basic (non-sequential) Monte Carlo estimator (Davison et al., 1997) in


(2.4) does not guarantee (2.5). In fact, the lowest uniform bound on the
resampling risk for this estimator is at least 0.5 (Gandy, 2009).

2.3.2 Sequential Monte Carlo Procedures

A sequential Monte Carlo test is a procedure in which a hypothesis test using


a Monte Carlo method is conducted sequentially over time as new samples
become available. Sequential Monte Carlo tests can be classified as open-
ended and truncated procedures (Silva and Assunção, 2013).

In open-ended tests such as Gandy (2009), a decision is made only if a


stopping criterion is reached. The expected number of Monte Carlo samples
can be substantial when p is close to α, and becomes infinite when p = α.
In some real applications (Tango and Takahashi, 2005; Kulldorff, 2001), the
generation of hundreds or thousands of Monte Carlo samples may take days

24
or weeks to finish depending on the complexity of the algorithm, and hence
an open-ended approach is not recommended.

A truncated approach (Davidson and MacKinnon, 2000; Silva et al., 2018;


Besag and Clifford, 1991; Silva and Assunção, 2013) specifies a maximum
number of Monte Carlo samples to be generated, and forces a decision before
or at the end of the simulations. We summarise these sequential Monte
Carlo procedures in this section and compare their empirical performances
regarding the resampling risk in Section 2.8.

Besag and Clifford (1991) propose a heuristic stopping rule for the trun-
cated sequential Monte Carlo procedure or bootstrap sampling. The algo-
rithm has two tuning parameters: the number of exceedances h and the
maximum number of samples (nmax − 1). We terminate the sampling pro-
cedure once Sn = h for some n < nmax − 1 or finish the whole simulation
process by obtaining (nmax − 1) Monte Carlo samples. Then, the estimated
p-value p̂ is defined as

h
if Sn = h and n < nmax − 1,



 n
p̂ =
 S +1
 nmax

 if Snmax < h,
nmax

and the estimate p̂ is compared to the threshold α to reach the decision of


the hypothesis test.

Davidson and MacKinnon (2000) introduce a simple sequential procedure


for controlling the resampling risk. Nevertheless, the uniform bound is not
guaranteed since the problem of multiple testing is not considered (Gandy,

25
2009). The algorithm starts with a relatively small sample size n = nmin and
increases the sample size n until reaching a decision or a maximum number
Pn
of samples denoted by nmax . Given Sn = i=1 Xi , the null hypothesis is

rejected, once the probability of obtaining at most Sn successes out of n


Monte Carlo samples is smaller than the level of pretest denoted by β. This
probability can be bounded above using p = α. Similarly, the alternative
is rejected once the probability of obtaining at least Sn successes out of n
is smaller than β, an upper bound of which can be similarly derived using
p = α. If we denote

Sn
X n
X
Ψ = {n ∈ N : b(n, α, i) < β or b(n, α, i) < β},
i=0 i=Sn

where N is a pre-specified set of sample sizes which will be suggested in the


n
next paragraph and b(n, p, x) = x
px (1 − p)n−x . The stopping time τ is
hence 

inf Ψ if Ψ 6= ∅,


τ=

nmax
 if Ψ = ∅.

Davidson and MacKinnon (2000) suggest nmin = 99 and nmax = 12799.


If no decision is reached at one step, they double the sample size and add
by one. We then have N = {99, 199, 399, 799, 1599, 3199, 6399, 12799}. We
can equivalently obtain an implied upper (resp. lower) boundary (see Table
2.1 for β = 0.05) which does not reject (resp. rejects) the null hypothesis,
once it is hit by the sample trajectory (n, Sn ). We can therefore decide on
the hypothesis by examining whether the trajectory (n, Sn ) hits the upper
or lower boundary. Otherwise when n = nmax , we simply adapt the basic

26
Table 2.1: Stopping boundaries using Davidson and MacKinnon (2000)’s method with re-
spect to the suggested sample sizes when the level of pretest β = 0.05.

Sample size n 99 199 399 799 1599 3199 6399


Upper boundary 12 19 32 56 102 190 362
Lower boundary 1 4 11 26 60 132 280

Monte Carlo estimator p̂mc .

Fay et al. (2007) employ a truncated sequential probability ratio test


(SPRT) boundary to build the sequential Monte Carlo test. SPRT is con-
ducted under the transformed hypotheses pα < α < p0 where pα and p0 are
two pre-determined parameters (Wald, 1973). The refined null hypothesis
becomes p = p0 and the alternative is p = pα , from which the decision of the
p-value with respect to the threshold α can be inferred.

The implementation of SPRT demands the choice of two constants A and


B that govern the type I and type II error. It does not reject the hypothesis
p = p0 once
Sn ≥ C1 + nC0 ,

or reject it once
Sn ≤ C2 + nC0 ,

27
where

1 − p0
C0 = log / log(r),
1 − pα

log(B)
C1 = ,
log(r)

log(A)
C2 = ,
log(r)
pα (1 − p0 )
r = .
p0 (1 − pα )

1−β0
The type I error α0 and type II error β0 can be approximated using A = α0
β0
and B = 1−α0
.

The truncated SPRT (tSPRT) specifies a maximum number of samples


nmax and constructs a pre-defined curtailed stopping boundary associated
with the SPRT boundary, the boundaries of Sn ≥ α(nmax + 1) and n − Sn ≥
(1 − α)(nmax + 1). The details of the construction are shown in (Fay et al.,
2007, Section 6.2).

Silva et al. (2018) introduce a truncated sequential Monte Carlo test


which achieves an arbitrarily close level of significance and global power as
the exact test. Here the global power π(α) is a function of the threshold α
under the exact test, and is defined as

Z 1
π(α) = π(α|p)FP (dp),
0

where FP (p) is the distribution function of p and π(α|p) is the probability of

28
rejecting H0 under the exact test:


1, if p ≤ α,

π(α|p) =

0, if p > α.

The global power πm using a Monte Carlo test is defined as

Z 1
πm (αmc ) = πm (αmc |p)FP (dp),
0

where αmc ∈ (0, 1) is a desired significance level for the (Monte Carlo) test
and πm (αmc |p) is the probability of rejecting H0 by the test. The formula of
πm (αmc |p) is given in (Silva et al., 2018, Equation (3.1)). The power loss of
the Monte Carlo test is the difference between π(α) and πm (αmc ).

The algorithm proposed by Silva et al. (2018) has four pre-determined


constants m, s, t1 and Ce . The null hypothesis is not rejected either St1 ≥ s
or (St1 < s, Sm−1 ≥ Ce ), and is rejected if St1 < s and Sm−1 < Ce . They
prove such Monte Carlo test has a significance level αs and bounds the power
loss by s if

P(Not reject H0 |p = α) ≤ s, (2.6)


(

P(Reject H0 |H0 ) ≤ αs , (2.7)

where the left hand sides of (2.6) and (2.7) are derived in (Silva et al., 2018,
Section 4).

29
2.4 Confidence Sequence Method (CSM)

This section introduces the confidence sequence method (CSM), a simple


technique to compute a decision on the p-value p while bounding the resam-
pling risk uniformly in (2.5).

Recall ∈ (0, 1) be the desired bound on the resampling risk. Using inde-
pendent Bernoulli(p) random variables Xi , i ∈ N, an inequality of (Robbins,
1970, p. 1397) states

Pp ∃n ∈ N : b(n, p, Sn ) ≤ ≤ (2.8)
n+1

n Pn
for all p ∈ (0, 1), where b(n, p, x) = x
px (1−p)n−x and Sn = i=1 Xi . Then,
In = {p ∈ [0, 1] : (n + 1)b(n, p, Sn ) > } is a sequence of confidence sets that
has the desired joint coverage probability 1 − .

Lai (1976) shows that In are intervals. Indeed, if 0 < Sn < n we obtain
In = gn (Sn ), fn (Sn ) , where gn (x) < fn (x) are the two distinct roots of
(n + 1)b(n, p, x) = for p . If Sn = 0 then the equation (n + 1)b(n, p, 0) =
has only one root rn , leading to In = [0, rn ). Likewise for Sn = n, in which
case In = (rn , 1].

CSM will determine a decision on H0 as follows. We simulate Monte


Carlo samples until α ∈
/ In , leading to a stopping time

τ = inf{n ∈ N : α ∈
/ In }.

If In ⊆ [0, α] we reject H0 . If In ⊆ (α, 1] we do not reject H0 . By construction,

30
the uniform bound on the resampling risk in (2.5) holds true.

It is not necessary to compute the roots of (n + 1)b(n, p, x) = explicitly


to check the stopping criterion. Indeed, by the initial definition of In ,

τ = inf {n ∈ N : (n + 1)b(n, α, Sn ) ≤ } ,

which is computationally easier to check.

For comparisons with SIMCTEST, it will be useful to write the stopping


time equivalently as

τ = inf{n ∈ N : Sn ≥ un or Sn ≤ ln },

where un = max{k : (n + 1)b(n, α, k) > } + 1 and ln = min{k : (n +


1)b(n, α, k) > }−1 for any n ∈ N. We call (ln )n∈N and (un )n∈N the (implied)
stopping boundaries. Figure 2.1 illustrates the implied stopping boundaries
for different testing thresholds α.

We define an estimator p̂c of p as



 Sτ

if τ < ∞,

τ
p̂c = (2.9)

α
 if τ = ∞.

The following theorem shows that the resampling risk of CSM is uniformly
bounded.

31
1200
Stopping boundaries
α = 0.01
α = 0.05
α = 0.10
800
400
0

0 2000 4000 6000 8000 10000


Number of steps n

Figure 2.1: Lower (ln ) and upper (un ) stopping boundaries of CSM for several thresholds
α.

Theorem 1. The estimator p̂c satisfies

sup RRp (p̂c ) < .


p∈[0,1]

Proof. We start by considering the case p ≤ α. Let p ≤ α. The resampling


risk is RRp (p̂c ) = Pp (p̂c > α). We show that p̂c > α only when hitting the
upper boundary and that the probability of hitting the upper boundary is
bounded by .

To see the former: When not hitting any boundary, i.e. on the event
{τ = ∞}, we have p̂c = α. When hitting the lower boundary, i.e. on the
event {τ < ∞, Sτ ≤ lτ }, we have p̂c = Sτ /τ ≤ lτ /τ . It thus suffices to show
ln /n ≤ α for all n ∈ N.

Let n ∈ N. By (2.8), Pα (b(n, α, Sn ) > n+1


) ≥ 1 − . Hence, there exists k
such that (n+1)b(n, α, k) > . Furthermore, b(n, α, x) has a maximum at x =
dαne or at x = bαnc. Thus, ∃k ∈ {dαne, bαnc} such that (n+1)b(n, α, k) > .

32
Hence, by the definition of ln we have ln ≤ dαne − 1 < αn.

To finish the proof of this case, we show that the probability of hitting
the upper boundary is bounded by , which can be done using (Gandy, 2009,
Lemma 3) and (2.8):

Pp (τ < ∞, Sτ ≥ uτ ) ≤ Pα (τ < ∞, Sτ ≥ uτ ) ≤ Pα (τ < ∞)

= Pα (∃n ∈ N : (n + 1)b(n, α, Sn ) ≤ ) ≤ .

The case p > α can be shown analogously to the case p < α using that
Pp (τ = ∞) = 0, which is shown in (Lai, 1976, p. 268).

Will CSM stop in a finite number of steps? If p = α, then the algorithm


will only stop with probability of at most . Indeed, Pα (τ < ∞) = Pα (∃n ∈
N : α ∈
/ In ) ≤ by construction. However, if p 6= α, then the algorithm
will stop in a finite number of steps with probability one. Indeed, Lai (1976)
shows that with probability one, limn→∞ fn (Sn ) = p = limn→∞ gn (Sn ) given
p 6= α, thus implying the existence of n ∈ N such that α ∈
/ In .

Figure 2.2 shows the expected number of steps Ep [τ ] as a function of p for


three different values = {0.01, 0.001, 0.0001}. The figure is generated based
upon a grid of values for p. For each p, we iteratively compute the distribution
of Sn−1 conditional on not stopping up to n − 1. We can then compute the
probability of stopping for each n, and thus Ep [τ ]. In practice, we also set
a maximum step for n, denoted by nmax = 10000. When p approaches the
threshold α, Ep [τ ] tends to infinity and we provide a lower bound for Ep [τ ]
when we reach nmax by adding nmax times the remaining probability for not

33
ε = 0.01
ε = 0.001
3000
2000 ε = 0.0001
Ep(τ)
1000
0

0.00 0.05 0.10 0.15 0.20


p

Figure 2.2: Expected number of steps Ep (τ ) required to decide whether p lies above or
below the threshold α = 0.05.

stopping. The testing threshold for Figure 2.2 is α = 0.05.

We assume the p-value is a random quantity in a Bayesian sense with


some distribution function F , and the derivative of F is positive at the thresh-
R1
old α. Then the expected effort E[τ ] = 0 Ep [τ ]dF (p) = ∞. This is a conse-
quence of (Wald, 1945, Equation (4.81)) which also applies to any procedure
satisfying (2.5) for < 0.5 see also Section 3.1 in Gandy (2009) . To have a
feasible algorithm for a random p, a finite upper threshold on the number of
steps of CSM has to be imposed, i.e., a truncated procedure needs to be em-
ployed. An alternative choice is using a specialised procedure such as Gandy
and Rubin-Delanchy (2013).

2.5 Review of SIMCTEST

This section reviews the SIMCTEST method of Gandy (2009) (Sequential


Implementation of Monte Carlo Tests) which also bounds the resampling risk

34
uniformly. SIMCTEST sequentially updates two integer sequences (Ln )n∈N
and (Un )n∈N serving as lower and upper stopping boundaries and stops the
sampling process once the trajectory (n, Sn ) hits either boundary. The deci-
sion whether the p-value lies above (below) the threshold depends on whether
the upper (lower) boundary is hit first.

The boundaries (Ln )n∈N and (Un )n∈N are a function of α, computed re-
cursively such that the probability of hitting the upper (lower) boundary,
given p ≤ α (p > α), is less than . Starting with U1 = 2, L1 = −1, the
boundaries are recursively defined as

Un = min{j ∈ N :Pα (τ ≥ n, Sn ≥ j)

+ Pα (τ < n, Sτ ≥ Uτ ) ≤ n },

Ln = max{j ∈ Z :Pα (τ ≥ n, Sn ≤ j)

+ Pα (τ < n, Sτ ≤ Lτ ) ≤ n },

where n, n ∈ N, is called a spending sequence. The spending sequence is


non-decreasing and satisfies n → as n → ∞ as well as 0 ≤ n < for all
n ∈ N. Its purpose is to control how the overall resampling risk is spent over
all iterations of the algorithm: In any step n of SIMCTEST, new boundaries
are computed with a risk of n − n−1 .

Gandy (2009) suggests


n
n = (2.10)
n+k

as a default spending sequence, where k is a constant, and chooses k = 1000.

SIMCTEST stops as soon as the trajectory (n, Sn ) hits the lower or

35
upper boundary, thus leading to the stopping time σ = inf{k ∈ N : Sk ≥
Uk or Sk ≤ Lk }. In this case, a p-value estimate can readily be computed
as p̂s = Sσ /σ if σ < ∞ and p̂s = α otherwise. Similarly to Figure 2.2, the
expected stopping time of SIMCTEST diverges as p approaches the threshold
α.

SIMCTEST achieves the desired uniform bound on the resampling risk


under certain conditions. To be precise, (Gandy, 2009, Theroem 2) states
that if ≤ 1/4 and log( n − n−1 ) = o(n) as n → ∞, then (2.5) holds with
p̂ = p̂s .

2.6 Comparison of CSM to SIMCTEST with the Default Spend-


ing Sequence

In this section, we compare the asymptotic behaviour of the width of the


stopping boundaries for CSM and SIMCTEST. SIMCTEST is employed with
the default spending sequence given in Section 2.5. Unless otherwise stated,
we always consider the threshold α = 0.05 and aim to control the resampling
risk at = 10−3 .

2.6.1 Stopping Boundaries

We first compare the stopping boundaries of CSM and SIMCTEST. Figure


2.3 gives an overview of the upper and lower stopping boundaries of CSM
and SIMCTEST up to 5000 steps, respectively. Figure 2.4 shows the ratio
of the widths of the stopping boundaries for both methods, that is (un −
ln )/(Un − Ln ), up to 107 steps, where un , ln (resp. Un , Ln ) are the upper and

36
Stopping boundaries
300

CSM
SIMCTEST
200
100
0

0 1000 2000 3000 4000 5000


Number of steps n

Figure 2.3: Stopping boundaries for CSM and SIMCTEST (with default spending se-
quence).
1.4

(un−ln)/(Un−Ln)
1.2
1.0
0.8
0.6

1e+00 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07


Number of steps n

Figure 2.4: Ratio of widths of stopping boundaries (un −ln )/(Un −Ln ) for CSM (un upper,
ln lower) and SIMCTEST with default spending sequence (Un , Ln ). Log scale on the x-axis.

37
lower stopping boundaries of CSM (resp. SIMCTEST).

According to Figure 2.4, the boundaries of CSM are initially tighter than
the ones of SIMCTEST, but become wider as the number of steps increases.
However, this will eventually reverse again for large numbers of steps as
depicted in Figure 2.4.

2.6.2 Real Resampling Risk

Both SIMCTEST and CSM are guaranteed to bound the resampling risk
by some constant chosen in advance by the user. We will demonstrate in
this section that the actual resampling risk (that is the actual probability of
hitting a boundary leading to a wrong decision in any run of an algorithm)
for SIMCTEST is close to , whereas CSM does not make full use of the
allocated resampling risk. This in turn indicates that it might be possible to
construct boundaries for SIMCTEST which are uniformly tighter than the
ones of CSM – We will pursue this in Section 2.7.

We compute the actual resampling risk recursively for both methods by


calculating the probability of hitting the upper or the lower stopping bound-
ary in any step for the case p = α; other values of p give smaller resampling
risks. This can be done as follows: Suppose we know the distribution of Sn−1
conditional on not stopping up to step n − 1. This allows us to compute the
probability of stopping at step n as well as to work out the distribution of
Sn conditional on not stopping up to step n.

Figure 2.5 plots the cumulative probability of hitting the upper and lower
boundaries over 5 · 104 steps for both methods. As before we control the

38
resampling risk at our default choice of = 10−3 .

SIMCTEST seems to spend the full resampling risk as the number of


samples goes to infinity. Indeed, the total probabilities of hitting the upper
and lower boundaries in SIMCTEST are both 9.804 · 10−4 within the first
5 · 104 steps. This matches the allowed resampling risk up to that point of

50000 = (5 · 104 )/(5 · 104 + 1000) ≈ 9.804 · 10−4 allocated by the spending
sequence, which is close to the full resampling risk = 10−3 .

CSM tends to be more conservative as it does not spend the full re-
sampling risk. Indeed, the total probabilities of hitting the upper and lower
boundaries in CSM up to step 5 · 104 are 4.726 · 10−4 and 4.472 · 10−5 , respec-
tively. In particular, the probability of hitting the lower boundary in CSM
is far lower than .

This imbalance is more pronounced for even smaller thresholds. We re-


peated the computation depicted in Figure 2.5 for α ∈ {0.02, 0.01, 0.005}
(figure not included), confirming that the total probabilities of hitting the
upper and lower boundaries in CSM both decrease monotonically as α de-
creases.

2.7 Spending Sequences which Dominate CSM

2.7.1 Example of a Bespoke Spending Sequence

One advantage of SIMCTEST lies in the fact that it allows control over
the resampling risk spent in each step through suitable adjustment of its
spending sequence n, n ∈ N. This can be useful in practical situations in

39
P(hit boundary up to step n | p = α)
8e−04 CSM, upper boundary
CSM, lower boundary
SIMCTEST, upper boundary
SIMCTEST, lower boundary
4e−04
0e+00

1 10 100 1000 10000 50000


Number of steps n

Figure 2.5: Cumulative resampling risk spent over all iterations of the algorithm spent in
CSM and SIMCTEST. Log scale on the x-axis.

which the overall computational effort is limited. In such cases, SIMCTEST


can be tuned to spend the full resampling risk over the maximum number of
samples. On the contrary, CSM has no tuning parameters and hence does
not offer a way to influence how the available resampling risk is spent.

Suppose we are given a lower bound L and an upper bound U for the
minimal and maximal number of samples to be spent, respectively. We con-
struct a new spending sequence in SIMCTEST which guarantees that no
resampling risk is spent over both the first L samples as well as after U
samples have been generated. We call this the truncated spending sequence:


0 if n ≤ L,






n = n
if L < n < U,
 n+k



if n ≥ U.

Figure 2.6 shows upper and lower stopping boundaries of CSM and SIM-

40
600 Stopping boundaries
CSM (un,ln)
SIMCTEST (Un,Ln)
Boundaries
400
200
0

1 10 100 1000 10000


Number of steps n

Figure 2.6: Stopping boundaries of CSM and SIMCTEST with the truncated spending se-
quence. Log scale on the x-axis.

CTEST with the truncated spending sequence (using L = 100, U = 10000,


and k = 1000).

As expected, for the first 100 steps the stopping boundaries of SIM-
CTEST are much wider than the ones of CSM steps since no resampling risk
is spent.

As soon as SIMCTEST starts spending resampling risk on the compu-


tation of its stopping boundaries, the upper boundary drops considerably.
By construction, the truncated spending sequence is chosen in such a way
as to make SIMCTEST spend all resampling risk within 104 steps. Indeed,
we observe in Figure 2.6 that as expected, the stopping boundaries of SIM-
CTEST are uniformly narrower than those of CSM over the interval (L, U),
thus resulting in a uniformly shorter stopping time for SIMCTEST.

We also observe, however, that this improvement in the width of the


stopping boundaries seems to be rather marginal, making the tuning-free

41
4e−04

Upper boundary Lower boundary

8e−05
l = 1.4 l = 1.4
l = 1.5 l = 1.5
l = 1.6 l = 1.6
2e−04

4e−05
0e+00

0e+00
1 2 5 10 20 50 100 500 1 2 5 10 20 50 100 500
Number of steps n Number of steps n

Figure 2.7: Trajectories of nl · CSMn − CSM


n−1 in the upper (left) and lower (right) boundary
for l = {1.4, 1.5, 1.6}. Log scale on the x-axis.

CSM method a simple and appealing competitor.

2.7.2 Uniformly Dominating Spending Sequence

Section 2.7.1 showed that it is possible to choose the spending sequence for
SIMCTEST in such a way as to obtain stopping boundaries which are strictly
contained within the ones of CSM for a pre-specified range of steps.

Motivated by Figure 2.5 indicating that CSM does not spend the full
resampling risk, we aim to construct a spending sequence with the property
that the resulting boundaries in SIMCTEST are strictly contained within
the ones of CSM for every number of steps. This implies that the stopping
time of SIMCTEST is never longer than the one of CSM. Our construction
is dependent on the specific choice α = 0.05.

We first determine the rate at which the real resampling risk is spent in
each step in CSM. By matching this rate using a suitably chosen spending
sequence, we will obtain upper and lower stopping boundaries for SIMCTEST
which are uniformly narrower than the ones of CSM (verified for the first 5·104
steps).

42
We start by estimating the rate at which the real resampling risk is
spent in CSM. We are interested in empirically finding an l ∈ R such that
nl · CSM
n − CSM
n−1 is constant, where CSM
n is the (cumulative) real resampling
risk (the total probability of hitting either boundary) for the first n steps in
CSM.

Figure 2.7 depicts nl · CSM


n − CSM
n−1 for both the upper (left plot) and
lower (right plot) boundary of CSM as a function of the number of steps n
and for three values l ∈ {1.4, 1.5, 1.6}. Based on Figure 2.7, we estimate that
CSM spends the resampling risk at roughly O(n−1.5 ) for both the upper and
lower boundaries.

We proceed with an analytical calculation of the rate at which SIM-


CTEST spends the resampling risk (as opposed to also estimating it). The
S
default spending sequence in SIMCTEST is n = n/(n + k) , n ∈ N. Hence
S S S
the resampling risk spent in step n ∈ N is ∆ n = n − n−1 = k/ (n + k)(n +
k − 1) ∼ n−2 . We conducted simulations (similar to the ones in Figure 2.7
for CSM) which indeed confirm the analytical O(n−2 ) rate for SIMCTEST
(simulations not included in this thesis). The fact that the theoretical and
empirical rates agree is supported by Figure 2.5, which indicates that SIM-
CTEST seems to spend the full resampling risk. Overall, the spending rate
of O(n−1.5 ) for CSM is thus slower than the O(n−2 ) rate of SIMCTEST with
the default spending sequence.

In order to match the O(n−1.5 ) rate for CSM, we generalise the default
S
spending sequence of SIMCTEST to n = nγ / (nγ + k) for n ∈ N and a fixed
γ > 0. Similarly to the aforementioned derivation, SIMCTEST in connection
with S
n will spend the real resampling risk at a rate of O n−(γ+1) . We choose

43
10 un − Un
ln − Ln
Difference of boundaries
5
0
−5
−10

100 1000 10000 50000


Number of steps n

Figure 2.8: Differences between the upper and lower stopping boundaries of CSM and SIM-
CTEST. Log scale on the x-axis.

the parameters γ and k to obtain dominating stopping boundaries over CSM.


First, we set γ = 0.5 to match the rate of CSM. Second, we empirically
determine k to keep the stopping boundaries of SIMCTEST within the ones
of CSM (for the range of steps n ∈ {1, . . . , 5 · 104 } considered in Figure 2.8).
We find that the choice k = 3 satisfies this condition.

Figure 2.8 depicts the differences between the upper (lower) boundaries
of CSM (upper un , lower ln ) and SIMCTEST (upper Un , lower Ln ) with
the aforementioned spending sequence. We observe that ln ≤ Ln as well as
un ≥ Un for n ∈ {1, . . . , 5 · 104 }, thus demonstrating that SIMCTEST can
be tuned empirically to spend the resampling risk at the same rate as CSM
while providing strictly tighter upper and lower stopping boundaries (over
a finite number of steps). We observe that the gap between the boundaries
seems to increase with the number of steps, leading to the conjecture that
SIMCTEST has tighter boundaries for all n ∈ N.

44
2.8 Comparison of Truncated Sequential Monte Carlo Pro-
cedures

In this section, we compute the resampling risk for several truncated proce-
dures as a function of p and thus demonstrate that they do not bound the
resampling risk uniformly.

We consider truncated versions of CSM and SIMCTEST, which we de-


note by tCSM and tSIMCTEST, as well as the algorithms of Besag and
Clifford (1991); Davidson and MacKinnon (2000); Fay et al. (2007); Silva
et al. (2018) described in Section 2.3.2. The maximum number of samples is
set to 13000 to roughly fit the case in Davidson and MacKinnon (2000). The
methods of Besag and Clifford (1991); Davidson and MacKinnon (2000); Fay
et al. (2007); Silva et al. (2018) have tuning parameters, which we choose
as follows. As suggested in Besag and Clifford (1991), the recommended pa-
rameter h controlling the number of exceedances is set to 20. For Davidson
and MacKinnon (2000), the level of pretest β concerning the resampling risk
is set to 0.05. The minimum and maximum number of simulations are 99
and 12799 as suggested by the authors. For Fay et al. (2007), we choose
one set of tuning parameters (pα , p0 , α0 , β0 ) = (0.04, 0.0614, 0.05, 0.05) rec-
ommended by the authors. For the algorithm of (Silva et al., 2018, Section
4), a grid search on the parameters (m, s, t1 , Ce ) with the aim to have the sig-
nificance level at 0.05 and bound the global power loss at 21% produces the
values (13000, 2, 16, 702). The resampling risk parameter in SIMCTEST
and CSM is set to 0.05. A summary of the parameters is shown in Table 2.2.

Figure 2.9 shows the resampling risk as a function of the p-value. As

45
Table 2.2: Parameters of the truncated Monte Carlo testing procedures.

Algorithm Parameter value(s) (nmin , nmax , ∆n)


Besag and Clifford (1991) h = 20 (1,13000,1)
Davidson and MacKinnon (2000) β = 0.05 (99,12799, n + 1)
α0 = β0 = 0.05
Fay et al. (2007) (1,13000,1)
p0 = 0.0614, pa = 0.04
m = 13000, s = 2,
Silva et al. (2018) (1,13000,1)
t1 = 16, Ce = 702
tSIMCTEST = 0.05 (1,13000,1)
tCSM = 0.05 (1,13000,1)
nmin and nmax denote the initial sample size and the maximum sample size, ∆n and
n denote the increment of the sample size in each step and the current sample size n.

expected, the truncated tests result in a resampling risk of at least 50%


when p = α. For other p-values than p = α, the resampling risk can be
smaller depending on the type of algorithm used and the number of samples
it draws. However, tCSM is still amongst the best performers as it yields
a low resampling risk almost everywhere with a localised spike at 0.5. It
also guarantees to bound the risk uniformly when the number of generated
samples tends to infinity, while other truncated procedures Besag and Clifford
(1991); Davidson and MacKinnon (2000); Fay et al. (2007) do not have this
property.

2.9 Extension to Multiple Thresholds

In the previous sections of this chapter, we consider the Monte Carlo testing
procedures which return a decision on the p-value with respect to one single
threshold. In practice, we may be interested in where the p-value lies with
respect to multiple levels, e.g. the classical thresholds {0.001, 0.01, 0.05}. In
this section, we extend the single threshold to multiple ones, and develop

46
1.0
Besag & Clifford (1991)
Davidson & MacKinnon (2000)
Fay et.al (2007)
Gandy (2009)
Silva & Assuncao (2017)
Robins (1970) & Lai (1976)
0.8
0.6
Resampling risk

0.4
0.2
0.0

0.02 0.03 0.04 0.05 0.06 0.07 0.08

p value

Figure 2.9: Comparison of resampling risks between the truncated Monte Carlo testing
procedures when the threshold α = 0.05.

algorithms which bound the refined resampling risk uniformly.

2.9.1 P-value Buckets and Resampling Risk

We transform the testing thresholds into intervals, as the algorithms we will


formulate output an interval which contains the true p-value with an arbi-
trary large probability. The intervals yield the same decision on the p-value

47
as with respect to the thresholds. For instance, given the testing thresholds
{0.001, 0.01, 0.05}, we can generate the intervals

{[0, 10−3 ], (10−3 , 0.01], (0.01, 0.05], (0.05, 1]}.

Observing a p-value lying in (10−3 , 0.01] equivalently implies p ≤ 0.01.


S
Let J be a set of sub-intervals of [0, 1] whose union is [0, 1], i.e. J∈J J=
[0, 1]. We call any such J a set of p-value buckets. For example,

J 0 := {[0, 10−3 ], (10−3 , 0.01], (0.01, 0.05], (0.05, 1]} (2.11)

is a set of p-value buckets and contains the intervals with mutually empty
intersection. We call them non-overlapping p-value buckets.

We can alternatively establish a set of intervals such that any p ∈ (0, 1)


lies in the interior of at least one interval belonging to the set, which we call
overlapping p-value buckets. For instance, consider the set of p-value buckets
given by

J ∗ := J 0 ∪ {(5 · 10−4 , 2 · 10−3 ], (8 · 10−3 , 0.012], (0.045, 0.055]}

which has the property that any p ∈ (0, 1) is contained in the interior of a
J ∈ J ∗ (for p = 0, we require that there exists J ∈ J and > 0 such that
[0, ) ⊆ J, and similarly for p = 1).

The choice of the p-value buckets is, of course, arbitrary. For the clas-
sical thresholds {0.001, 0.01, 0.05}, we usually build J 0 as the set of non-
overlapping p-value buckets and require the set of overlapping p-value buckets

48
is a superset of J 0 . We may choose the overlapping buckets for different pur-
poses. For example, we aim to achieve a relatively small upper bound for the
stopping time given limited computational budget or aim to provide stable
results which return the same p-value bucket consistently in different simu-
lations – We will explore these two properties respectively in Section 2.9.4
and Section 2.10.2.

Given a set of p-value buckets J , we aim for algorithms which return


a bucket I ∈ J containing p. We refine the resampling risk, which is the
probability that p is not contained in the interval I ∈ J , i.e. RRp (I) =
Pp (p ∈
/ I). Again, we wish to bound this risk up to an arbitrary constant
∈ (0, 1] uniformly:

sup RRp (I) ≤ . (2.12)


p∈[0,1]

Gandy et al. (2017) find that the overlapping p-value buckets are both
a sufficient and necessary condition for an algorithm satisfying (2.12) with
a finite stopping time. Here, the stopping time is measured in terms of
the number of simulated Monte Carlo samples, and is denoted by τ . The
following statements are proved to be equivalent in (Gandy et al., 2017,
Theorem 1):

1. There exists an algorithm satisfying (2.12) with Ep (τ ) < ∞ for all


p ∈ [0, 1].

2. The p-value buckets in J are overlapping.

3. There exists an algorithm satisfying (2.12) with τ < C for some deter-

49
ministic C > 0.

2.9.2 General Construction of the Algorithms

The algorithms we consider will return a confidence interval I upon stopping


which satisfy (2.12). For every realisation I from the algorithm, there exists
J ∈ J such that I ⊂ J.

We construct the interval I using confidence sequences. Suppose that


for each n ∈ N, we can compute a confidence interval In for p given the
Monte Carlo samples X1 , . . . , Xn such that the joint coverage probability of
the sequence In is at least 1 − , for some > 0. In other words, we require

Pp (p ∈ In for all n ∈ N) ≥ 1 − (2.13)

for all p ∈ [0, 1]. We provide two approaches of computing such confidence
sequences in Section 2.9.3.

In order to compute a decision for p with respect to the p-value buckets


J , we define the general stopping time

τJ = inf{n ∈ N : there exists J ∈ J such that In ⊆ J}, (2.14)

which denotes the minimal number of samples n needed until a confidence


interval for p is a subset of a p-value bucket J ∈ J .

For J 0 , the time τJ 0 is the number of samples until In is between two


consecutive thresholds in {0, 0.001, 0.01, 0.05, 1}, thus leading to a complete
decision of H0 with respect to all thresholds. Likewise, the time τJ ∗ can

50
be interpreted as the number of samples needed until a decision of p with
respect to all but one of the thresholds 0.001, 0.01, or 0.05 is computed.

We define the interval I. If τJ < ∞, let I = IτJ . If τJ = ∞, let I be


Sn
an arbitrary set in J satisfying limn→∞ n
∈ I by assuming that the limit
exists. The random interval I constructed in this way satisfies the uniform
bound on the resampling risk in (2.12) by (2.13) and by the strong law of
large numbers.

If τJ is bounded, meaning if there exists N ∈ N such that τJ < N , we


can relax (2.13) to

Pp (p ∈ In for all n < N ) ≥ 1 − . (2.15)

2.9.3 Multi-threshold CSM and SIMCTEST

We use CSM and SIMCTEST to build the confidence sequences satisfying


(2.13). Since we now target multiple thresholds rather a single one, we modify
the name of CSM and SIMCTEST by using mCSM and mSIMCTEST.

Following the general construction in Section 2.9.2, we first build the


confidence sequences in mCSM, which is identical to the way in CSM. Recall
from Section 2.4, the inequality

Pp ∃n ∈ N : b(n, p, Sn ) ≤ ≤ (2.16)
n+1

Pn
holds true for all p ∈ (0, 1) and ∈ (0, 1), where Sn = i=1 Xi and
n
b(n, p, x) = x
px (1 − p)n−x . Therefore, In = {p ∈ [0, 1] : (n + 1)b(n, p, Sn ) >

51
} is a sequence of confidence sets for p with the desired coverage probability
of 1 − .

Lai (1976) further shows that solving the left hand side of (2.16) yields

Pp gn (Sn ) < p < fn (Sn ) for all n ∈ N ≥ 1 − , (2.17)

where gn (x) < fn (x) are the two distinct roots (Lai, 1976) of (n+1)b(n, p, x) =
. Indeed, if 0 < Sn < n, a sequence of confidence intervals for p is given by
In := gn (Sn ), fn (Sn ) . In the case Sn = 0, the equation (n + 1)b(n, p, x) =
has only one root rn , leading to In = [0, rn ). Likewise for the case Sn = n
which leads to the confidence interval In = (rn , 1].

(Gandy et al., 2017, Lemma 1) prove the stopping time is bounded above
by a deterministic constant in mCSM if overlapping p-value buckets are em-
ployed.

We construct the confidence sequences in mSIMCTEST which satisfies


(2.13). First, we define the set of boundaries AJ of the intervals in J that
are in the interior of [0,1],

AJ = {sup J, inf J : J ∈ J } \ {0, 1}.

mSIMCTEST then creates the stopping boundaries from the default spend-
ing sequence in SIMCTEST (Gandy, 2009) for each α ∈ AJ , denoted by Ln,α
and Un,α , using the same resampling risk parameter ρ. We define the corre-
sponding stopping time for each α as σα = inf{k ∈ N : Sk ≥ Uk,α or Sk ≤

52
Lk,α } (based on the same sequence of Xj , j ∈ N). We also define


[0, 1] if n < σα ,






In,α = [0, α] if n ≥ σα , Sσα ≤ Lσα ,α ,




(α, 1] if n ≥ σ , S ≥ U

α σα σα ,α ,

T
and let In = α∈AJ In,α .

The following theorem shows that In indeed has the desired joint coverage
probability given in (2.13) or (2.15) when setting ρ = /2. Additionally,
employing overlapping buckets in mSIMCTEST leads to a bounded stopping
time.

Theorem 2. Let N ∈ N ∪ {∞}. Construct Un,α and Ln,α for each α ∈ AJ


with the same resampling risk parameter ρ. Suppose that Un,α ≤ Un,α0 and
Ln,α ≤ Ln,α0 for all α, α0 ∈ AJ , α < α0 , and n < N .

1. For all p ∈ [0, 1], Pp (p ∈ In for all n < N ) ≥ 1 − 2ρ.

2. Suppose N = ∞, ρ ≤ 1/4 and log( n − n−1 ) = o(n) as n → ∞. If J


is a finite set of overlapping p-value buckets, then there exists c < ∞
such that τJ ≤ c.

Allowing N < ∞ in Theorem 2 is useful for the constructed stopping


boundaries to yield a finite stopping time from (2.15).

The condition on the spending sequence in part 2 of Theorem 2 is iden-


tical to the condition imposed in Theorem 1 of Gandy (2009). It is satisfied
by the default spending sequence (2.10) in SIMCTEST, and hence in mSIM-
CTEST.

53
We only prove the first part of Theorem 2 regarding the joint coverage
probability of the confidence sequence. The second part can be found in
(Gandy et al., 2017, Theorem 2).

Proof of Theorem 2 (first part). For a given threshold α ∈ AJ , let

N
E α = {Sτα ≥ Uτα ,α , τα < N }

be the event that the upper boundary is hit first before time N and likewise
let
EN
α = {Sτα ≤ Lτα ,α , τα < N }

be the event that the lower boundary is hit first. Then, for all α, α0 ∈ AJ
with α < α0 the following holds:

N N
E α ⊇ E α0 and E N N
α ⊆ E α0 . (2.18)

N N N
Indeed, to see E α ⊇ E α0 , we can argue as follows. On the event E α0 ,
as Un,α ≤ Un,α0 for all n ∈ N, the trajectory (n, Sn ) must hit the upper
boundary Un,α of α no later than τα0 , hence τα ≤ τα0 < N . It remains to
prove that the trajectory does not first hit the lower boundary Ln,α of α.
Indeed, if the trajectory does hit the lower boundary of α before hitting its
upper boundary, it also hits the lower boundary of α0 (as Ln,α ≤ Ln,α0 for all
N
n < N ) before time τα0 , thus contradicting being on the event E α0 . Hence,
N N
we have E α ⊇ E α0 . The proof of E N N
α ⊆ E α0 is similar.

54
Using this notation, for all p ∈ [0, 1],

Pp (∃n < N : p ∈
/ In ) ≤ Pp (∃n < N, α ∈ AJ : p ∈
/ In,α )
 
[ [ N
= Pp  EN
α ∪ Eα 
α∈AJ :α<p α∈AJ :α≥p

   
[ [ N
≤ Pp  EN
α
 + Pp  E α  . (2.19)
α∈AJ :α<p α∈AJ :α≥p

If p < min AJ , then the first term is equal to 0. Otherwise, let α0 = max{α ∈
AJ : α < p}. Then, by (2.18),
 
[
Pp  EN
α
 = Pp E N
α0 ≤ ρ.
α∈AJ :α<p

The second term on the right hand side of (2.19) can be dealt with similarly.

The condition on the monotonicity of the boundaries (Un,α ≤ Un,α0 and


Ln,α ≤ Ln,α0 for all n ∈ N and α, α0 ∈ J with α < α0 ) in Theorem 2 can
be checked for a fixed spending sequence n in two ways: For finite N , the
two inequalities can be checked manually after constructing the boundaries.
For N = ∞, the following lemma shows that under certain conditions, the
monotonicity of the boundaries holds true for all n ≥ n0 , where n0 ∈ N can
be computed as a solution to inequality (2.21) in Lemma 3. For n < n0 , the
inequalities have to be checked manually.

Lemma 3. Suppose N = ∞, ρ ≤ 1/4 and log( n − n−1 ) = o(n) as n → ∞.

55
Let α, α0 ∈ AJ with α < α0 . Then there exists n0 ∈ N such that for all
n ≥ n0 ,
Ln,α ≤ Ln,α0 , Un,α ≤ Un,α0 .

Proof of Lemma 3. By arguments in (Gandy, 2009, Proof of Theorem 1), we


have

Un,α − nα ∆n + 1 Ln,α0 − nα0 ∆n + 1


≤ → 0, ≥− →0 (2.20)
n n n n
p
as n → ∞, where ∆n = −n log( n − n−1 )/2. Since ∆n = o(n) there exists
n0 ∈ N such that

∆n 1
2 + ≤ α0 − α for all n ≥ n0 . (2.21)
n n

Splitting 2
n
= 1
n
+ n1 and multiplying by n yields nα + ∆n + 1 ≤ nα0 − ∆n − 1
from which Un,α ≤ Ln,α0 follows by (2.20).

By definition, we have Ln,α ≤ Un,α and Ln,α0 ≤ Un,α0 for all n ∈ N, thus
implying Ln,α ≤ Ln,α0 , Un,α ≤ Un,α0 for all n ≥ n0 as desired.

2.9.4 Non-stopping Regions of the P-value Buckets

We investigate non-stopping regions for non-overlapping and overlapping p-


value buckets using mSIMCTEST and mCSM. According to the stopping
time in (2.14), the non-stopping region refers to the set of (n, Sn ), n ∈ N
whose confidence interval In returned by the algorithm is not contained in

56
any p-value bucket J ∈ J , i.e.

{(n, Sn ) : n ∈ N, In 6⊂ J for all J ∈ J }.

Suppose we are only interested in whether the p value is above or below the
single threshold α = 0.05, this corresponds to the non-overlapping p-value
buckets
J e := {[0, 0.05], (0.05, 1]}

and the non-stopping region (grey) is shown in Figure 2.10 (left) using mSIM-
CTEST. By construction, those regions bound the resampling risk at ,
where in this and all following simulations in this section we always use
= 10−3 . The sampling process terminates once the trajectory of (n, Sn )
hits beyond the region and we report p > α (resp. p ≤ α) upon arriving
at the upper (resp. lower) boundary of the region first. Adding another
bucket {(0.03, 0.07]} to J e results in a finite non-stopping boundary see
Figure 2.10 (right) which ensures the stopping time no later than approxi-
mately 11250 simulations of the Monte Carlo samples. The sample trajectory
(n, Sn ) can leave the non-stopping region in three ways: either from the for-
mer Figure 2.10 (right) upper boundary indicating p ∈ (0.05, 1], or from
the former lower boundary indicating p ∈ [0, 0.05], or the middle indicat-
ing p ∈ (0.03, 0.07]. We omit the plot using mCSM, which can be obtained
similarly using its implied upper and lower boundaries.

We further investigate the non-stopping regions in Figure 2.11 for the


non-overlapping buckets in J 0 (upper left) and two sets of overlapping buck-
ets J ∗ (upper right) and J n (lower left) with different overlapping areas

57
1000

1000
750

750
(0.05,1]
(0.05,1]
500

500
Sn

Sn
]
.07
3,0
(0.0
250

250
[0,0.05]
[0,0.05]
0

0
0 3750 7500 11250 15000 0 3750 7500 11250 15000

n n

Figure 2.10: Non-stopping region (grey) to decide p with respect to J e (left), which corre-
sponds to a 5% threshold, and with respect to the overlapping buckets J e ∪ {(0.03, 0.07]}
(right). 12000
400
300

8000
200
Sn

Sn
4000
100
0

0 8750 17500 26250 35000 0 1e+05 2e+05 3e+05 4e+05

n n
3000
2000
Sn
1000
0

0 12500 25000 37500 50000

Figure 2.11: Non-stopping region (grey) to decide p, which corresponds to a 5% threshold,


with respect to the p-value buckets: J 0 (upper left), J ∗ (upper right) and J n (lower left).

58
where J n is defined as:

J n := J 0 ∪ {(10−4 , 3 · 10−3 ], (6 · 10−3 , 0.015], (0.04, 0.06]}. (2.22)

As expected, the non-stopping region is infinite for the non-overlapping


buckets J 0 and finite for the overlapping buckets J ∗ and J n . In particular,
the stopping time using J n is bounded by a much smaller constant than that
using J ∗ .

2.10 Application

2.10.1 Comparison of Penguin Pairs on Two Islands

We apply CSM and SIMCTEST to a real data example which compares


the number of breeding yellow-eyed penguin pairs on two types of islands:
Stewart Island (on which cats are the natural predators of penguins) and
some cat-free islands nearby (Massaro and Blair, 2003). The number of
yellow-eyed penguin pairs are recorded in 19 discrete locations on Steward
Island, resulting in an average count of 4.2 and the following individual counts
per location:

{7, 3, 3, 7, 3, 7, 3, 10, 1, 7, 4, 1, 3, 2, 1, 2, 9, 4, 2}.

Likewise, counts at 10 discrete locations on the cat-free islands yield an


average count of 9.9 and individual counts of

{15, 32, 1, 13, 14, 11, 1, 3, 2, 7}.

59
Ruxton and Neuhäuser (2013) employ SIMCTEST to conduct a hypothesis
test to determine whether the means of the penguin counts on Stewart Island
are equal to the ones of the cat-free islands. They apply Welch’s t-test (Welch,
1947) to assesses whether two population groups have equal means. The test
statistic of Welch’s t-test is given as follows:

µ̂1 − µ̂2
T =q 2 ,
s1 s22
n1
+ n2

where n1 , n2 are the sample sizes, µ̂1 , µ̂2 are the sample means and s1 , s2 are
the sample variances of the two groups.

Under the assumption of normality of the two population groups, the


distribution of the test statistic under the null hypothesis is approximately
a Student’s t-distribution with v degrees of freedom, where

2
1 s22
n1
+ 2
s1 n2
v= s22
.
1
n21 (n1 −1)
+ 2 2
s1 n2 (n2 −1)

Using the above data, we obtain t = −0.45 as the observed test statistic and
a p-value of 0.09.

As the normality assumption may not be satisfied in our case and as


the t-distribution is only an approximation, Ruxton and Neuhäuser (2013)
implement a parametric bootstrap test which randomly allocates each of the
178 penguin pairs to one of the 29 islands, where each island is chosen with
equal probability. Based on their experiments, Ruxton and Neuhäuser (2013)
conclude that they cannot reject the null hypothesis at the 5% level.

60
Likewise, we apply CSM and SIMCTEST with the same bootstrap sam-
pling procedure. We record the average effort measured in terms of the total
number of samples generated. We set the resampling risk to = 0.001 and
use the default spending sequence n = n/(n + 1000) in SIMCTEST.

We first perform a single run of both CSM and SIMCTEST. CSM and
SIMCTEST stop after 751 and 724 steps with p-value estimates of 0.09 and
0.08, respectively. Hence, both algorithms do not reject the null hypothesis
at the 5 % level. We then conduct 10000 independent runs to stabilise the
results. Amongst those 10000 runs, CSM rejects the null hypothesis 10000
times compared with 9999 times for SIMCTEST. The average efforts of CSM
and SIMCTEST are 1440 and 1131, respectively. Therefore, in this example,
CSM gives comparable results to SIMCTEST while generating more samples
on average. We expect such behaviour due to the wider stopping boundaries
of CSM in comparison with SIMCTEST (see Figure 2.3). However, we need
to pre-compute the stopping boundaries of SIMCTEST in advance, which is
not necessary in CSM.

2.10.2 Two-way Contingency Table

We apply mCSM and mSIMCTEST to an example of multinomial counts


of two categorical variables in a 5 × 7 contingency table (see Table 2.3),
considered in Newton and Geyer (1994); Davison et al. (1997); Gandy (2009).
We are interested in testing the null hypothesis that the two variables are
independent. Using a likelihood ratio test, we reject for large values of the
P
test statistic T (A) = 2 i,j aij log(aij /hij ), where A = (aij ) is a matrix and
P P P
hij = v avj µ aiµ / v,µ avµ .

61
Table 2.3: Two-way contingency table.

1 2 2 1 1 0 1
2 0 0 2 3 0 0
0 1 1 1 2 7 3
1 1 2 0 0 0 1
0 1 1 1 1 0 0

Under the null hypothesis, the distribution of T (A) converges to a χ2


distribution with 24 degrees of freedom (Davison et al., 1997). For the matrix
Ac in Table 2.3, we observe T (Ac ) = 38.52, thus leading to a p-value estimate
of 0.031 which is significant at a 5% level.

However, the sparseness of Ac may result in a poor accuracy of the asymp-


totic χ2 approximation. To mitigate this, Davison et al. (1997) recommend a
parametric bootstrap test which generates bootstrapped tables from a multi-
P
nomial distribution with i,j aij trials and a probability of each cell (i, j)
P P
chosen proportionally to v avj µ aiµ .

We test with respect to three overlapping p-value buckets J 0 in (2.11),


J n in (2.22) and a new set of overlapping p-value buckets J w , where J w is
defined as:

J w := J 0 ∪ {(1 · 10−5 , 4 · 10−3 ], (4 · 10−3 , 0.03], (0.03, 0.1]}.

Note that the overlapping area between either two p-value buckets in J w is
larger than that in J n . We would expect the lowest computational effort
when using J w followed by J n and J 0 . We also aim to explore how stable
the results are for each set of p-value buckets, i.e., whether the returned
buckets are identical over different simulations.

62
Table 2.4: Decisions returned for J 0 , J n and J w in the contingency example of Table 2.3.

Buckets J0 Jn Jw
Method CSM SIMCTEST CSM SIMCTEST CSM SIMCTEST
∼ ∼ ∼ ∼ ∼ ∼
Significance * * * * * *
% 100 NA 100 NA 99.7 0.3 99.2 0.8 14.7 85.3 19.9 80.1
Average effort 14879 11796 14843 11703 4930 3479

where * refers to (0.01, 0.05] for J 0 , J n and J w , refers to (0.04, 0.06] for J n and (0.03,0.1]
for J w .

We compute confidence sequences for the unknown p (corresponding to


T ) using both CSM and the SIMCTEST approach. This leads to 6 different
algorithms. We start with a single run of all algorithms for which we em-
ploy the same sample trajectory (n, Sn ). In all six cases, the p-value bucket
returned by the algorithms is (0.01, 0.05] (corresponding to a significance *).
However, the decision on the p-value bucket for J w is not stable as we will
see in the next paragraph.

We conduct 10, 000 runs to investigate the stability of the results. In


each run, we use the same trajectory for all algorithms. Table 2.4 displays
the mean stopping time (or equivalently, the average effort measured as num-
ber of samples) and the distribution of the returned p-value buckets for p.
The table shows that while decisions for J 0 are stable using both CSM and
SIMCTEST, they are less so for J n and J w . Notably, the decisions obtained
with J n and J w are contradictory: Using J n , we obtain a * significance in
most cases, whereas for J w we obtain ∼ . Both decisions are valid outcomes
in the sense that they were computed with an bound on the resampling
risk. However, the decision based on J n is considerably more stable than
the one based on J w . We therefore conclude that the decision based on J n
is the more useful one of the two at the cost of more effort. Indeed, a naı̈ve

63
Table 2.5: Comparison between CSM and SIMCTEST.

CSM SIMCTEST SIMCTEST with pre-


computed boundaries

Memory requirement O(1) O( τ log τ )∗ O(τmax )

Computational effort O(τ ) O(τ τ log τ )∗ O(τ )
Parameters of each method { n }n∈N { n }n∈N
Implementation from scratch Very easy Easy Easy
The parameter τ denotes the stopping time and τmax denotes the maximum length of
the pre-computed boundaries for SIMCTEST. Empirical quantities are denoted with ∗ .

approximation of p with 107 Monte Carlo samples leads to a p-value estimate


of 0.0415, thus confirming the above result.

The average effort measured by stopping time is higher for CSM than
for SIMCTEST when applied to the same p-value buckets. However, this
definition of the effort is not necessarily an indicator of the overall effort
if all preparatory work needs to be taken into account: Due to the larger
computational overhead to compute boundaries in SIMCTEST, the stopping
time for CSM or merely a naı̈ve approach with a constant number of samples
can be faster in practice, especially when sampling is computationally cheap.

2.11 Discussion

The first part of this chapter introduces a new method called CSM to decide
whether an unknown p-value, which can only be approximated via Monte
Carlo sampling, lies above or below a fixed threshold α while uniformly
bounding the resampling risk at a user-specified > 0. The method is
straightforward to implement and relies on the construction of a confidence
sequence (Robbins, 1970; Lai, 1976) for the unknown p-value.

64
We compare CSM to SIMCTEST (Gandy, 2009), finding that CSM is the
more conservative method: The (implied) stopping boundaries of CSM are
generally wider than the ones of SIMCTEST and in contrast to SIMCTEST,
CSM does not fully spend the allocated resampling risk .

We use these findings in two ways: Firstly, an upper bound is usually


known for the maximal number of samples which can be spent in practical
applications. We construct a truncated spending sequence for SIMCTEST
which spends all the available resampling risk within a pre-specified interval,
thus leading to uniformly tighter stopping boundaries and shorter stopping
times than CSM. Secondly, we empirically analyse at which rate CSM spends
the resampling risk. By matching this rate with a suitably chosen spending
sequence, we empirically tune the stopping boundaries of SIMCTEST to
uniformly dominate those of CSM even for open-ended sampling.

A comparison of memory requirement and computational effort for CSM


and SIMCTEST is given in Table 2.5. In SIMCTEST, the boundaries are
sequentially calculated as further samples are being generated whereas in
SIMCTEST with pre-computed boundaries, the boundaries are initially com-
puted and stored up to a maximum number of steps τmax . In CSM, solely
the cumulative sum Sn needs to be stored in each step, leading to a memory
requirement of O(1). Gandy (2009) empirically shows that SIMCTEST with

the default spending sequence has a memory requirement of O( τ log τ ). In
SIMCTEST with pre-computed boundaries and default spending sequence,

the amount of memory required temporarily up to step n is O( n log n).

To compute the boundaries up to τmax , a total memory of O( τmax log τmax )
is hence required. Additionally, the values of the upper and lower bound-

65
aries up to τmax need to be stored, which requires O(τmax ) memory. Hence,
the total memory requirement of SIMCTEST with pre-computed bound-
aries is O(τmax ). Evaluating the stopping criterion in each step of CSM
or SIMCTEST with pre-computed boundaries requires O(1), leading to the
total computational effort of O(τ ) depicted in Table 2.5 for both cases.
Gandy (2009) reasons that the computational effort of SIMCTEST is roughly

proportional to n=1 |Un − Ln |. Using the empirical result |Un − Ln | ∼
√ √
O( n log n), we obtain a bound of O(τ τ log τ ) for the computational effort
of SIMCTEST.

We also compare the truncated versions of CSM and SIMCTEST with


other truncated sequential Monte Carlo procedures. We prove empirically
that the resampling risk of the truncated methods cannot be bounded by
an arbitrary small number and exceeds 0.5 when the true p-value equals
the threshold. Nevertheless, tCSM and tSIMCTEST are still comparable to
other complicated algorithms regarding the resampling risk.

The advantage of SIMCTEST (with pre-computed boundaries) lies in


its adjustable spending sequence { n }n∈N : This flexibility allows the user
to control the resampling risk spent in each step, thus enabling the user to
spend no risk before a pre-specified step or to spend the full risk within a
finite number of steps (see Section 2.7.1). This leads to (marginally) tighter
stopping boundaries and faster decisions. The strength of CSM, however,
lies in its straightforward implementation compared to SIMCTEST. Both
methods illustrate a superior performance regarding the resampling risk if
truncation is applied. Overall we conclude that the simplicity of CSM, its
comparable performance to SIMCTEST and under the truncated situation

66
make it a very appealing competitor for practical applications.

In the next section, we investigate the methods of identifying where the


unknown p-value (approximated via Monte Carlo simulations) lies with re-
spect to multiple testing thresholds.

By generalising the thresholds to p-value buckets, we propose two types


of buckets. The non-overlapping buckets can produce identical tests as the
classical statistical tests with the thresholds {0.001, 0.01, 0.05}. The overlap-
ping buckets, at least one of which contains p in the interior for all p ∈ [0, 1],
are also introduced. We refine the resampling risk as the probability that the
true p-value is not contained in the p-value bucket returned by an algorithm.
Gandy et al. (2017) proves the use of the overlapping buckets is both a suffi-
cient and necessary condition for an algorithm to have a uniformly bounded
resampling risk and a finite stopping time.

We develop a class of algorithms which return a p-value bucket based


upon the confidence sequences to achieve the uniform boundedness of the
resampling risk. We introduce two methods of constructing the desired
sequences by extending CSM and SIMCTEST, which we call mCSM and
mSIMCTEST. We either prove directly or use the result from Gandy et al.
(2017) to see: Both mCSM and mSIMCTEST bound the resampling risk
uniformly and terminate in a finite time for overlapping p-value buckets.
We empirically verify this conclusion by plotting the non-stopping regions of
the proposed non-overlapping and overlapping p-value buckets. Moreover,
we find that the overlapping buckets with different overlapping regions may
cause large difference with respect to the upper bound of the stopping time.

67
3
Tree-based Particle
Smoothing Algorithms in a
Hidden Markov Model

3.1 Introduction

A hidden Markov model HMM, Cappé et al. (2006) is a discrete-time


stochastic process {Xt , Yt }t∈N where {Xt }t∈N is an unobserved Markov pro-
cess. We only have access to each Yt whose distribution depends on Xt . The
dependence structure of an HMM is shown in Figure 3.1.

68
Y0 Y1 Y2 Y3

X0 X1 X2 X3 ...

Figure 3.1: Graphical representation of a hidden Markov model.

We make the following assumptions in the entire chapter unless the model
is otherwise described: The densities of the initial state X0 , the transition
density Xt+1 given Xt = xt and the emission density Yt given Xt = xt taken
with respect to some dominating measure exist, and are defined as follows:

X0 ∼ p0 ( · ),

Xt |{Xt−1 = xt−1 } ∼ p( · |xt−1 ) for t = 1, . . . , T,

Yt |{Xt = xt } ∼ p( · |xt ) for t = 0, . . . , T,

where T is the final time step of the process.

Two common inference problems of the hidden states in the HMM are
filtering and smoothing. The filtering distributions refer to

{p(xt |y0:t )}t=0,...,T . (3.1)

In this chapter, we are interested in the (marginal) smoothing distributions

{p(xt |y0:T )}t=0,...,T (3.2)

69
or the joint smoothing distribution

p(x0:T |y0:T ), (3.3)

where x0:T and y0:T are abbreviations of (x0 , . . . , xT ) and (y0 , . . . , yT ), re-
spectively. Exact solution is available for a linear Gaussian HMM using
Rauch–Tung–Striebel smoother (Rauch et al., 1965) and in an HMM with a
finite-space Markov process (Baum and Petrie, 1966). In most other cases,
the smoothing distribution is not analytically tractable.

A large body of work uses Monte Carlo methods to approximate the


smoothing distributions or the joint smoothing distribution. Sequential Monte
Carlo (SMC) is commonly employed to update the distributions with increas-
ing dimension {p(x0:t |y0:t )}t=0,...,T sequentially (Doucet et al., 2001). SMC
can in principle estimate the joint smoothing distribution p(x0:T |y0:T ). How-
ever, the performance can be poor, as path degeneracy will occur in many
settings (Arulampalam et al., 2002). Advanced SMC methods with desirable
theoretical and practical results have been developed in recent years includ-
ing sequential Quasi-Monte Carlo (Gerber and Chopin, 2015), divide-and-
conquer sequential Monte Carlo (Lindsten et al., 2017), multilevel sequential
Monte Carlo (Beskos et al., 2017) and variational sequential Monte Carlo
(Naesseth et al., 2017).

Other smoothing algorithms have been suggested previously. Doucet


et al. (2000) develop the forward filtering backward smoothing algorithm
(FFBSm) for sampling from the marginal smoothing distributions based on
the formula proposed by Kitagawa (1987). Briers et al. (2010) propose a two-

70
filter smoother (TFS) which employs a standard forward particle filter and a
backward information filter to sample from the marginal smoothing distribu-
tions. Godsill et al. (2004) propose the forward filtering backward simulation
algorithm (FFBSi) which targets the joint smoothing distribution. Typically,
these algorithms have quadratic complexities in N for generating N samples.
Fearnhead et al. (2010) and Klaas et al. (2006) propose two smoothing al-
gorithms with lower computational complexity, but their methods do not
provide asymptotically unbiased estimates.

Motivated by divide-and-conquer sequential Monte Carlo (D&C SMC)


(Lindsten et al., 2017), we investigate the smoothing problem in this chap-
ter. The D&C SMC algorithm can solve inference problems in general prob-
abilistic graphical models. It splits the target model into multiple levels
of sub-models based upon an auxiliary tree structure T . An intermediate
target distribution needs to be assigned at each non-root node yielding a
sub-model. By generating independent samples between the leaf nodes and
gradually propagating, merging and resampling following the tree towards
the root, the D&C SMC algorithm eventually produces samples from the
target model. Each merging step involves importance sampling which aims
for the (intermediate) target distribution.

Using the idea of D&C SMC, we focus on the HMMs rather than a general
PGM. We similarly construct an auxiliary tree T to split an HMM into sub-
models, and aim to estimate the joint smoothing distribution p(x0:T |y0:T ). We
thus call the algorithm tree-based particle smoothing algorithm (TPS). The
key differences between TPS and other smoothing algorithms lie in its non-
sequential sampling procedure and a more adaptive merging step of samples.

71
Our main contribution in this chapter is the investigation of four classes
of intermediate target distributions in a HMM, which is key for a good over-
all performance of TPS. Lindsten et al. (2017)’s strategy of building these
distributions is applicable to a general PGM rather than to an HMM only.
Moreover, the empirical performance of their method could be unstable which
will be explored in Section 3.10.

We denote a leaf node corresponding to the random variable Xj of a


single hidden state by Tj ∈ T and a non-leaf node corresponding to the
random variable Xj:l of multiple hidden states by Tj:l ∈ T (j < l).

Lindsten et al. (2017) propose a class of intermediate target distributions


for a general probabilistic graphical model with two examples shown in a
hierarchical model and in a rectangular lattice model. We employ their
method to a HMM where the intermediate target distribution has the density
proportional to the product of all transition and emission densities associated
to Xj (resp. Xj:l ) at Tj (resp. Tj:l ). This is equivalent to the unnormalised
likelihood of a new HMM given the observation(s) yj (resp. yj:l ), which enjoys
the same dynamics as the original HMM except for an uninformative prior
of Xj if j 6= 0.

The second class uses an estimate of the filtering distribution p(xj |y0:j )
at Tj and an estimate of the joint filtering distribution p(xj:l |y0:l ) at Tj:l .
Working with this estimate involves tuning a preliminary particle filter.

The third class employs an estimate of the marginal smoothing distribu-


tion p(xj |y0:T ) at Tj and of the joint smoothing distribution p(xj:l |y0:T ) at Tj:l .
We will see that this class of immediate distributions is optimal in a certain

72
sense. Furthermore, under this construction, we approximately retain the
marginal distribution of each random variable Xj invariant as the marginal
smoothing distribution p(xj |y0:T ) at every level of the tree. The price of im-
plementing TPS using these intermediate target distributions relies on both
the estimates of the filtering and the (marginal) smoothing distributions,
but not necessarily the joint smoothing distribution. We then propose some
parametric and non-parametric approaches to construct these intermediate
distributions based on the pre-generated Monte Carlo samples.

The fourth class inherits from the exact (joint) filtering distribution
p(xj |y0:j ) at Tj and p(xj:l |y0:l ) at Tj:l . TPS using this class of intermediate
target distributions employs the samples directly from a filtering algorithm at
the leaf nodes. It is straightforward to implement with no tuning procedures.

This chapter is structured as follows. We first review the HMMs in Sec-


tion 3.2 with their inference problems in Section 3.3. We describe previous
methods for filtering and smoothing in Section 3.4 & 3.5. We then introduce
TPS in Section 3.6 and discuss its intermediate target distributions in Sec-
tion 3.7. In Section 3.8, we present a diagnostic procedure for TPS which
assesses the sampling quality of the merging step. We conduct simulation
studies in a linear Gaussian HMM in Section 3.9 and in a non-linear HMM
in Section 3.10. The chapter ends with a discussion in Section 3.11.

3.2 Hidden Markov Models

An HMM is a bivariate discrete-time process {Xt , Yt }t∈N where {Xt }t∈N is


a Markov process. The time series {Yt }t∈N form a sequence of independent

73
random variables given {Xt }t∈N : The conditional distribution of Yt only de-
pends on Xt (Cappé et al., 2006). We assume the underlying Markov pro-
cess {Xt }t∈N is not observable and call each Xt a hidden state of the HMM.
We only have access to the stochastic process {Yt }t∈N linked to the process
{Xt }t∈N , and call each Yt an observation of the HMM. The inference of an
HMM is hence conducted with the information of the observations only.

We assume the HMM is homogeneous which implies the transition kernel


of the Markov process {Xt }t∈N is independent of the time step t. We denote
the state space of the Markov process {Xt }t∈N by X and the sample space of
{Yt }t∈N by Y, respectively.

The dependence structure of an HMM is shown in Figure 3.1. Each


node represents a random variable, and each edge (arrow) indicates depen-
dence between the random variables it connects. Typically, the distribution
of Xt given the history X0 , . . . , Xt−1 only depends on the previous state Xt−1
satisfying the Markov property. Likewise, the distribution of the observa-
tion Yt given all previous states X0 , . . . , Xt and the previous observations
Y0 , . . . , Yt−1 only depends on Xt .

HMMs can be classified according to the variable type or the dynamics of


the model. A finite-space HMM implies that X and Y only take finite number
of values. A normal HMM requires the conditional distribution of Yt given
Xt to be normally distributed (Cappé et al., 2006). In some applications, its
state space X is assumed to be finite (Rabiner and Juang, 1986; Ball and
Rice, 1992). Linear Gaussian HMMs (Cappé et al., 2006) are prevalent in

74
many fields, and are defined as

Xt = At−1 Xt−1 + qt−1 ,


(3.4)
Yt = Ht Xt + rt ,

where we assume the following:

1. The stochastic processes {qt−1 }t∈N and {rt }t∈N are independent Gaus-
sian noises with qt−1 ∼ N (0, Qt−1 ) and rt ∼ N (0, Rt );

2. The prior distribution of X0 is normally distributed and is uncorrelated


with the processes {qt−1 }t∈N and {rt }t∈N ;

3. The transition matrices {At−1 }t∈N , {Ht }t∈N are known with proper di-
mensions.

When a more sophisticated non-linear relationship between the variables


needs to be explained, a non-linear HMM (Cappé et al., 2006) can be em-
ployed with the general form

Xt = ft (Xt−1 , qt−1 ),
(3.5)
Yt = ht (Xt , rt ),

where {ft }t∈N and {gt }t∈N are non-linear functions of appropriate dimensions,
and {qt−1 }t∈N and {rt }t∈N are independent Gaussian noises.

HMMs can be extended to bear more complicated dependence structures


or multiple hidden layers. A higher order HMM (Hadar et al., 2009; Lee and
Lee, 2006) assumes that the conditional distribution of Xt given all past
variables depends on Xt−k−1 , Xt−1 , . . . , Xt−1 where {Xt }t∈N is a kth order

75
Markov process. A factorial HMM (Ghahramani and Jordan, 1996) allows
each observation Yt to propagate through multiple hidden states from parallel
Markov processes. A Markov-switching model provides a more general form
of the HMMs in the sense that the conditional distribution of Yt given all
past variables now depends on Xt , Yt−1 and possibly even earlier observations
(Cappé et al., 2006).

HMMs and their extensions are widely exploited in the areas such as
speech recognition (Rabiner, 1989; Rabiner and Juang, 1986; Huang et al.,
1990; Bahl et al., 1986), computer vision (Yamato et al., 1992; Brand et al.,
1997), econometrics (Hamilton, 1989), biology (Ball and Rice, 1992; Sonnham-
mer et al., 1998; Krogh et al., 2001; Petersen et al., 2011) and medical imaging
(Zhang et al., 2001).

In this chapter, we consider the inference problems in an HMM with the


most basic dependence structure shown in Figure 3.1. We make the following
assumptions in the HMM as mentioned at the beginning of the chapter:

1. The stochastic process of the HMM starts from t = 0 and terminates


at t = T , where T is referred as the final step.

2. The initial hidden state X0 has the density p0 .

3. The dynamic of the underlying Markov process {Xt }Tt=0 is specified by


the transition probability density

Xt |{Xt−1 = xt−1 } ∼ p( · |xt−1 )

for t = 1, . . . , T .

76
4. The dependence structure between each observation Yt and its corre-
sponding hidden state Xt is defined by the emission probability density

Yt |{Xt = xt } ∼ p( · |xt )

for t = 0, . . . , T .

3.3 Filtering and Smoothing

Filtering and smoothing are two common inference problems which attempt
to estimate the hidden states given the observations in an HMM. Filtering
estimates the current state Xt given the observations up to the same time
step t whereas smoothing conditions on all observations till the final time
step T (Särkkä, 2013).

Formally, a filtering distribution is the marginal distribution of the hid-


den state Xt given the past observations up to time t:

p(xt |y0:t ) for t = 0, . . . , T. (3.6)

A (marginal) smoothing distribution is the marginal distribution of the hid-


den state Xt given the observations y0:T :

p(xt |y0:T ) for t = 0, . . . , T. (3.7)

A joint smoothing distribution is the joint distribution of the hidden states

77
X0:T given the observations y0:T :

p(x0:T |y0:T ). (3.8)

In general, the solutions to the filtering and smoothing distributions


are analytically intractable except for some special circumstances such as
in linear Gaussian HMMs and in finite-space HMMs. We will review two
algorithms: Kalman filter (KF) and Rauch–Tung–Striebel smoother (RTSs)
which respectively provide closed-form solutions to the filtering and smooth-
ing problems in a linear Gaussian HMM.

Applications of filtering and smoothing appear in the areas such as global


position systems (Kaplan and Hegarty, 2005), finance (Kim et al., 1998;
Jacquier et al., 2002), signal processing (Godsill et al., 2002) and learning
systems (Haykin, 2004).

3.3.1 Kalman Filter

Kalman et al. (1960) present the exact solution to the filtering problem (3.6)
in a linear Gaussian HMM, which is called the Kalman filter (KF).

The linear Gaussian HMM in (3.4) can be expressed in terms of the

78
probabilistic terms:

p(x0 ) ∼ N ( · |m0 , P0 )

p(xt |xt−1 ) ∼ N ( · |At−1 xt−1 , Qt−1 ),

p(yt |xt ) ∼ N ( · |Ht xt , Rt ).

Kalman et al. (1960) prove that the prediction distribution of the hidden state
p(xt |y0:t−1 ), the filtering distribution p(xt |y0:t ) and the prediction distribution
of the observation p(yt |y0:t−1 ) are all normally distributed with

p(xt |y0:t−1 ) ∼ N (xt |m− −


t , Pt ),

p(xt |y0:t ) ∼ N (xt |mt , Pt ),

p(yt |y0:t−1 ) ∼ N (yt |Ht m−


t , St ),

where the set of parameters {mt , Pt , m− − T


t , Pt , St }t=1 can be calculated re-

cursively from t = 1 with the following prediction and update steps. The
prediction step follows

m−
t = At−1 mt−1 ,

Pt− = At−1 Pt−1 ATt−1 + Qt−1 ,

79
and the update step follows

vt = yt − Ht m−
t ,

St = Ht Pt− HtT + Rt ,

Kt = Pt− HtT St−1 ,

mt = m−
t + Kt vt ,

Pt = Pt− − Kt St KtT .

The KF only works for linear Gaussian HMMs. In a non-linear HMM,


we can still implement the KF for a linear Gaussian HMM, which is derived
from the original non-linear model using estimation technique, as an approx-
imation to the filtering solution. The extended Kalman filter (Jazwinski,
2007) employs Taylor series expansion to perform the Gaussian approxima-
tion. The statistical linearised filter (Gelb, 1974) derives the optimal linear
Gaussian HMM by minimising the mean square error from the linear ap-
proximation. The unscented Kalman filter (Julier et al., 1995) relies on the
unscented transform, which estimates the mean and variance of the Gaussian
distributions by deterministically selecting a fixed number of sample points
called sigma points. However, these methods can be computationally in-
tensive and potentially provide unsatisfactory results when the true filtering
distributions are highly non-linear or multi-modal (Särkkä, 2013).

80
3.3.2 Rauch–Tung–Striebel Smoother

The Rauch–Tung–Striebel smoother (RTSs) provides the closed-form solu-


tion to the marginal smoothing distributions in a linear Gaussian HMM
(Rauch et al., 1965). At time t, we have

p(xt |y0:T ) ∼ N (xt |mst , Pts ).

−1
The parameters {mst , Pts }Tt=0 are computed from backward recursions:

m−
t+1 = At mt ,


Pt+1 = At Pt ATt + Qt ,

− −1
Gt = Pt ATt [Pt+1 ] ,

mst = mt + Gt (mst+1 − m−
t+1 ),


Pts = Pt + Gt (Pt+1
s
− Pt+1 )GTt ,

T −1
where {mt , Pt }t=0 are the means and covariances computed from the KF.
We have msT = mT and PTs = PT , and the backward recursions start from
t = T − 1.

Other smoothing algorithms for non-linear HMMs include the extended


Rauch–Tung–Striebel smoother (Cox, 1964), the statistically linearised Rauch–
Tung–Striebel smoother (Särkkä, 2013), the Fourier-Hermite Rauch-Tung-
Striebel smoother (Sarmavuori and Särkkä, 2012) and the unscented Rauch–
Tung–Striebel smoother (Särkkä et al., 2006).

81
3.4 Particle Methods

Particle methods are an alternative way of estimating the filtering and smooth-
ing distributions in HMMs, which do not rely on the Gaussian approximation
techniques described in Section 3.3.1. Particle methods employ Monte Carlo
simulation to produce samples, which are also called particles, from the pos-
terior distribution such as filtering or smoothing. The methods fall into a
sub-class of the sequential Monte Carlo (SMC) procedures, which we will
discuss in Section 3.4.2. Before this, we first demonstrate a fundamental
technique used in SMC called importance sampling.

3.4.1 Importance Sampling

Importance sampling is a Monte Carlo sampling technique for estimating


the properties of a distribution. It simulates from another distribution and
reweights them to obtain the desired samples. Importance sampling is very
useful when samples cannot be directly generated from the target distribu-
tion.

Assume π is a probability density function defined on D ⊆ R which


cannot be sampled from directly, and q is another density function which
covers the support of π, i.e. π(x) > 0 ⇒ q(x) > 0 whenever x ∈ D. We aim
to compute the following integral:

Z
I φ(X) = φ(x)π(x)dx,
D

where φ is a measurable function. The key idea of importance sampling is

82
to rewrite I(φ(X)) as following:

Z
π(x) π(Y )
I φ(X) = φ(x) q(x)dx = E φ(Y ) ,
D q(x) q(Y )

ˆ
where Y ∼ q. A straightforward Monte Carlo estimate I(φ(X)) of I(φ(X))
is given by
N
ˆ 1 X π(x(i) )
I(φ(X)) = φ(x(i) ) ,
N i=1 q(x(i) )
N
where x(i) i=1
∼ q. The density q is also known as a proposal or an im-
(i) π(x(i) )
portance density. We reweight each particle x by a factor . We call
q(x(i) )
π(x(i) )
this factor w(i) = the unnormalised importance weight of x(i) , and
q(x(i) )
π(x)
w(x) = is the unnormalised importance weight function.
q(x)
Similar technique can be applied to the target distribution πt (x0:t ) defined
on the product space Dt+1 . We can rewrite the target distribution πt (x0:t ) as

γt (x0:t )
πt (x0:t ) = ,
Zt

where the unnormalised function γt : Dt+1 → R+ is known pointwise and Zt


is the normalising constant:

Z
Zt = γt (x0:t )dx0:t .
Dt+1

Suppose we have a valid proposal density qt for πt . We rewrite the target


density and the normalising constant in terms of the proposal and the weight

83
function:
1
πt (x0:t ) = wt (x0:t )qt (x0:t ),
Zt
Z (3.9)
Zt = wt (x0:t )qt (x0:t )dx0:t ,
Dt+1

where
γt (x0:t )
wt (x0:t ) = (3.10)
qt (x0:t )

is the unnormalised importance weight function. Assume we can generate


(i)
samples from the proposal {x0:t }N
i=1 ∼ qt , we obtain the Monte Carlo esti-

mates π̂t of πt and Ẑt of Zt respectively:

N
X
π̂t (x0:t ) = Wt (x0:t )δx(i) (x0:t ),
0:t
i=1

N
1 X (i)
Ẑt = wt (x0:t ),
N i=1

where
wt (x0:t )
Wt (x0:t ) = PN
i=1 wt (x0:t )

is the normalised weight function and δx0 (x) is the Dirac measure with mass
(i) (i)
located at x0 . We use the notation {xt , Wt }N
i=1 ∼ πt to indicate that the
(i) (i)
weighted particles {xt , Wt }N
i=1 provide a particle approximation to πt .

The estimate of the normalising constant Zt is unbiased with variance


(Doucet and Johansen, 2009):

πt2 (x0:t )
Z
Var(Ẑt ) 1
2
= dx0:t − 1 . (3.11)
Zt N Dt+1 qt (x0:t )

84
The expectation of a test function φt : Dt+1 → R defined as

Z
It φt (X0:t ) = φt (x0:t )πt (x0:t )dx0:t
Dt+1

can be approximated by

Z N
X (i) (i)
Iˆt φt (X0:t ) = φt (x0:t )π̂t (x0:t )dx0:t = Wt (x0:t )φt x0:t ,
Dt+1 i=1

which is biased for finite N . The asymptotic bias and variance are both
O(1/N ) (Doucet and Johansen, 2009).

We may want to calculate several integrals It with respect to different


test functions φt . Rather than optimising the proposal each time based upon
one typical test function, Doucet and Johansen (2009) suggest to choose a
proposal which minimises the variance of Ẑt in (3.11), or equivalently the
variance of the importance weights. Using qt = πt yields a zero variance,
which violates the assumption that we are unable to sample from the target
distribution. Alternatively, we aim to choose a proposal close to the target
and can be sampled from straightforwardly. We will discuss the form of an
optimal proposal in the context of the filtering and smoothing problems in
Section 3.4.2.

3.4.2 Sequential Importance Sampling and Resampling

Choosing an effective proposal in high dimension is often challenging. Se-


quential importance sampling (SIS) is a powerful tool that can sample se-
quentially from a sequence of target distributions {πt (x0:t )}Tt=0 with increas-

85
ing dimension by importance sampling (Doucet and Johansen, 2009). We
aim to approximate {πt }Tt=0 and {Zt }Tt=0 in (3.9) sequentially using SIS. The
proposal qt in SIS is required to have the following form for t > 0:

qt (x0:t ) = qt−1 (x0:t−1 )qt (xt |x0:t−1 ). (3.12)

The target distribution πt at time t can further be expressed in terms of


the unnormalised function γt−1 and the proposal qt−1 from the previous time
step:

γt (x0:t ) 1 γt (x0:t )
πt (x0:t ) = = qt (x0:t )
Zt Zt qt (x0:t )

1 γt−1 (x0:t−1 ) γt (x0:t )


= qt (x0:t ) .
Zt qt−1 (x0:t−1 ) γt−1 (x0:t−1 )qt (xt |x0:t−1 )

We define the incremental weight function

γt (x0:t )
αt (x0:t ) =
γt−1 (x0:t−1 )qt (xt |x0:t−1 )

and obtain a simple expression of distribution πt (x0:t ):

1
πt (x0:t ) = qt (x0:t )wt−1 (x0:t−1 )αt (x0:t ). (3.13)
Zt

(i) (i)
We generate the unnormalised weighted particles {x0:t , w0:t }N
i=1 ∼ πt
(i) (i)
given {x0:t−1 , w0:t−1 }N
i=1 ∼ πt−1 using (3.13). We simulate a new sample
(i) (i) (i) (i)
xt from the proposal qt (·|x0:t−1 ), and append xt to the history x0:t−1 for i =
(i) (i) (i)
1, . . . , N . By reweighting each particle x0:t with the factor wt−1 (x0:t−1 )αt (x0:t ),
(i) (i)
we obtain {x0:t , w0:t }N
i=1 ∼ πt . Starting from t = 0 where an importance sam-

86
Algorithm 1: Sequential importance sampling (SIS)
1 for t = 0 do
2 for i = 1 to N do
(i)
3 Sample x0 ∼ q0 ( · ) ;
(i)
(i) π0 (x0 )
4 Compute the unnormalised importance weight: w0 = (i)
;
q0 (x0 )
5 end
6 end
7 for t = 1 to T do
8 for i = 1 to N do
(i) (i) (i) (i) (i)
9 Sample xt ∼ qt ( · |x0:t−1 , y0:t ) and denote x0:t = (x0:t−1 , xt );
(i) (i) (i)
10 Compute the unnormalised importance weight: wt = wt−1 αt (x0:t );
11 end
12 end

pling step is applied to π0 with the proposal q0 , we proceed sequentially


to attain the unnormalised weighted particles from the target distributions
{π0:t }Tt=0 . See Algorithm 1 for the implementation.

In SIS, we sequentially update the weights at each step and can possibly
result in a large variance: Most particles have negligible weights and very
few occupy massive weights which will dominate the final estimate of the
distribution (Arulampalam et al., 2002). We call such phenomenon weight
degeneracy. Choosing a good proposal becomes crucial. Doucet et al. (2001)
prove the optimal proposal qtopt has the following form which minimises the
variance amongst all types of proposals conditional on the x0:t−1 and y0:t :

qtopt (xt |x0:t−1 , y0:t ) = p(xt |xt−1 , yt ).

When the optimal proposal is not analytically tractable, it can be approxi-


mated by a Gaussian proposal, where a local linearisation technique similar

87
to extended Kalman filter is applied (Doucet et al., 2000). A good approx-
imation should cover the support of the target, capture both tail property
and mode(s) of the optimal proposal.

Resampling is also utilised to alleviate the degeneracy issue by replacing


the particles of low weights by those of large weights. Practically, it selects
samples randomly from the original data with replacement (Särkkä, 2013).
Some popular resampling procedures which provide an unbiased estimate
of the distribution π̂t include multinomial resampling, residual resampling
(Liu and Chen, 1998), stratified resampling (Carpenter et al., 1999) and sys-
tematic resampling (Kitagawa, 1996). The implementation of multinomial
resampling has a complexity O(N log N ) based on binary search. It can be
reduced to O(N ) using inverse CDF method (Hol et al., 2006) or using the
recurrence relation described in Devroye (2006). The cost of stratified re-
sampling and systematic resampling are both O(N ). Systematic resampling
has the least expensive operating time, followed by stratified resampling and
multinomial resampling (Hol et al., 2006; Sileshi et al., 2013). Recently,
Gandy and Lau (2016) propose Chopthin resampling that enforces an upper
bound on the ratio between the weights, and obtain particles with uneven
weights after resampling. The expected implementation time is O(N ). Al-
ternatively, we may choose a resampling scheme based upon other properties
such as consistency and convergence results (Gerber et al., 2018).

We review the multinomial resampling procedure in Algorithm 2, which


generates the resampled particles {x(i) , W (i) }N
i=1 from the normalised weighted

particles {x̃(i) , W̃ (i) }N


i=1 . We will use multinomial sampling as the resampling

process throughout this chapter and assume its complexity O(N ).

88
Algorithm 2: Multinomial resampling
1 for i = 1 to N do
2 Sample index j(i) from a multinomial distribution with the probability vector
(W̃ (1) , . . . , W̃ (N ) );
1
3 Let x(i) = x̃j(i) and W (i) = ;
N
4 end

Algorithm 3: Sequential importance sampling (SIR)


1 for t = 0 do
2 for i = 1 to N do
(i)
3 Sample x̃0 ∼ q0 ( · ) ;
(i)
(i) π0 (x0 )
4 Compute the unnormalised importance weight: w̃0 = (i)
;
q0 (x0 )
5 end
6 if N̂eff < Nthres then
7 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {x0 , W0 }N
i=1 ;
8 else
(i)
9 Normalise the weights and denote by {W0 }Ni=1 ;
(i) (i) (i)
10 Denote the weighted particles by {x0 = x̃0 , W0 }N
i=1 ;
11 end
12 end
13 for t = 1 to T do
14 for i = 1 to N do
(i) (i) (i) (i) (i)
15 Sample x̃t ∼ qt ( · |x0:t−1 , y0:t ) and let x̃0:t = (x0:t−1 , x̃t );
(i) (i) (i)
16 Compute the unnormalised importance weight: w̃t = Wt−1 αt (x0:t );
17 end
18 if N̂eff < Nthres then
19 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {x0:t , Wt }N
i=1 ;
20 else
(i)
21 Normalise the weights and denote by {Wt }Ni=1 ;
(i) (i) (i)
22 Denote the weighted particles by {x0:t = x̃0:t , Wt }N
i=1 ;
23 end
24 end

89
Implementing an additional resampling procedure in SIS leads to the se-
quential importance resampling (SIR) algorithm. However, resampling may
not be necessary at every time step. Adaptive resampling performs a resam-
pling step once the diversity of the particles drops below a threshold Nthres .
One measure for this diversity is called effective sample size (ESS), which
assesses the variability of the weights in importance sampling. The formula
(Kong et al., 1994; Liu and Chen, 1995) of ESS is given by

N
Neff = , (3.14)
1 + Var(wt )

where wt is the unnormalised importance weight function defined in (3.10).


When Neff cannot be computed analytically, we can approximate it using the
formula (Doucet et al., 2000)

PN (i) 2
i=1 wt
N̂eff = PN (i) 2
. (3.15)
i=1 w t

The SIR procedure which adaptively performs the resampling steps in SIS
is shown in Algorithm 3. As most resampling procedures return normalised
weights, we additionally perform normalisation to the particles if a resam-
pling step is not necessary.

90
3.4.3 Particle Filtering and Smoothing

We apply SIR to simulate samples from the filtering and smoothing distri-
butions. In the context of filtering and smoothing in the HMM, we have

p(yt |xt )p(xt |xt−1 )


πt (x0:t ) = p(x0:t |y0:t ), Zt = p(y0:t ), αt (x0:t ) = .
qt (xt |x0:t−1 , y0:t )

The algorithm employing SIR to address the filtering problem is also called
a particle filter (PF). At each time step t, it does not need to retain the
particles of any previous hidden state Xt0 where t0 < t, since its filtering
distribution p(xt0 |y0:t0 ) only conditions on the observations up to t0 and does
not require update when new observations come in. See Algorithm 4 for the
implementation of the PF, which generates the normalised weighted particles
(i) (i)
{xt , Wt }N
i=1 ∼ p( · |y0:t ) sequentially from t = 0 to t = T . We can therefore

approximate the filtering distribution p(xt |y0:t ) by

N
X (i)
p̂(xt |y0:t ) = Wt δx(i) (xt ),
t
i=1

A bootstrap particle filter (BPF, Algorithm 5) is a generic particle filter


which applies the proposal qt (xt |x0:t−1 , y0:t ) = p(xt |xt−1 ). It results in a
simple expression of the incremental weight function αt (x0:t ) = p(yt |xt ) at
the risk of creating a poor proposal, which results in large variance of weights.

A particle smoother (PS) applies SIR to estimate the joint smoothing


distribution (Kitagawa and Sato, 2001), although in the literature people
may still refer to it as the PF (Doucet et al., 2000; Kantas et al., 2009). The

91
Algorithm 4: Particle filter (PF)
1 for t = 0 do
2 for i = 1 to N do
(i)
3 Sample x̃0 ∼ q0 ( · ) ;
(i) (i)
(i) p(y0 |x̃0 )p0 (x̃0 )
4 Compute the unnormalised importance weight: w̃0 = (i)
;
q0 (x̃0 )
5 end
6 if N̂eff < Nthres then
7 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {x0 , W0 }N
i=1 ;
8 else
(i)
9 Normalise the weights and denote by {W0 }Ni=1 ;
(i) (i) (i)
10 Denote the weighted particles by {x0 = x̃0 , W0 }N
i=1 ;
11 end
12 end
13 for t = 1 to T do
14 for i = 1 to N do
(i) (i)
15 Sample x̃t ∼ qt ( · |x0:t−1 , y0:t );
16 Compute the unnormalised importance weight:
(i) (i) (i)
(i) (i) p(yt |x̃t )p(x̃t |xt−1 )
w̃t = Wt−1 (i) (i)
;
qt (x̃t |x0:t−1 , y0:t )
17 end
18 if N̂eff < Nthres then
19 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {xt , Wt }N
i=1 ;
20 else
(i)
21 Normalise the weights and denote by {Wt }Ni=1 ;
(i) (i) (i)
22 Denote the weighted particles by {xt = x̃t , Wt }N
i=1 ;
23 end
24 end

92
Algorithm 5: Bootstrap particle filter (BPF)
1 for t = 0 do
2 for i = 1 to N do
(i)
3 Sample x̃0 ∼ p0 ( · ) ;
(i) (i)
4 Compute the unnormalised importance weight: w̃0 = p(y0 |x̃0 );
5 end
6 if N̂eff < Nthres then
7 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {x0 , W0 }N
i=1 ;
8 else
(i)
9 Normalise the weights and denote by {Wt }Ni=1 ;
(i) (i) (i)
10 Denote the weighted particles by {x0 = x̃0 , Wt }N
i=1 ;
11 end
12 end
13 for t = 1 to T do
14 for i = 1 to N do
(i) (i)
15 Sample x̃t ∼ p( · |xt−1 );
(i) (i) (i)
16 Compute the unnormalised importance weight: w̃t = Wt−1 p(yt |x̃t );
17 end
18 if N̂eff < Nthres then
19 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {xt , Wt }N
i=1 ;
20 else
(i)
21 Normalise the weights and denote by {Wt }Ni=1 ;
(i) (i) (i)
22 Denote the weighted particles by {xt = x̃t , Wt }N
i=1 ;
23 end
24 end

93
PS tracks the full history of particles with necessary resampling updates.
(i) (i)
See Algorithm 6 which simulates the weighted particles {x0:T , W0:T }N
i=1 ∼

p( · |y0:T ). The smoother is called a bootstrap particle smoother (BPS) if the


proposal qt (xt |x0:t−1 , y0:t ) = p(xt |xt−1 ) is employed.

If the resampling step is applied constantly in the PS when T is large, the


particles at t T are updated numerous times, whose diversity decreases sig-
nificantly. This phenomenon is called path degeneracy (Arulampalam et al.,
2002) and can cause unsatisfactory estimates of the marginal smoothing dis-
tributions.

3.5 Previous Monte Carlo Smoothing Algorithms

We review two smoothing algorithms: forward filtering backward smooth-


ing (Doucet et al., 2000) and forward filtering backward simulation (Godsill
et al., 2004). Both algorithms demand an implementation of the PF in
advance and use the generated particles in the reverse-time direction to ac-
complish smoothing. FFBSm targets the marginal smoothing distributions
whereas FFBSi targets the joint smoothing distribution.

3.5.1 Forward Filtering Backward Smoothing (FFBSm)

The forward filtering backward smoothing algorithm (FFBSm) proposed by


Doucet et al. (2000) first simulate normalised weighted samples from the fil-
tering distributions {p(xt |y0:t )}Tt=0 via a PF, which in this section are denoted
(i) (i)
by {xt|t , Wt|t }N
i=1 ∼ p(xt |y0:t ) for t = 0, . . . , T .

94
Algorithm 6: Particle smoother (PS)
1 for t = 0 do
2 for i = 1 to N do
(i)
3 Sample x̃0 ∼ q0 ( · ) ;
(i) (i)
(i) p(y0 |x̃0 )p0 (x̃0 )
4 Compute the unnormalised importance weight: w̃0 = (i)
;
q0 (x̃0 )
5 end
6 if N̂eff < Nthres then
7 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {x0 , W0 }N
i=1 ;
8 else
(i)
9 Normalise the weights and denote by {W0 }N
i=1 ;
(i) (i) (i)
10 Denote the weighted particles by {x0 = x̃0 , W0 }N
i=1 ;
11 end
12 end
13 for t = 1 to T do
14 for i = 1 to N do
(i) (i) (i) (i) (i)
15 Sample x̃t ∼ qt ( · |x0:t−1 , y0:t ) and let x̃0:t = (x0:t−1 , x̃t );
16 Compute the unnormalised importance weight:
(i) (i) (i)
(i) (i) p(yt |x̃t )p(x̃t |xt−1 )
w̃t = Wt−1 (i) (i)
;
qt (x̃t |x0:t−1 , y0:t )
17 end
18 if N̂eff < Nthres then
19 Implement the resampling step and denote the resampled particles (with
(i) (i)
normalised weights) by {x0:t , Wt }N
i=1 ;
20 else
(i)
21 Normalise the weights and denote by {Wt }N
i=1 . Denote the weighted particles
(i) (i) (i)
by {x0:t = x̃0:t , Wt }N
i=1 ;
22 end
23 end

95
FFBSm is based upon the decomposition of the marginal smoothing dis-
tribution proposed by Kitagawa (1987):

p(xt+1 |y0:T )p(xt+1 |xt )


Z
p(xt |y0:T ) = p(xt |y0:t ) dxt+1 . (3.16)
p(xt+1 |y0:t )

The algorithm starts from t = T , where the filtering distribution is idential


to the marginal smoothing distribution, and proceeds backward to gener-
ate the particle approximation to the marginal smoothing distribution. For
(i) (i)
t = T − 1, . . . , 0, the algorithm reweights the particles {xt|t , Wt|t }N
i=1 ∼

p( · |y0:t ) generated from the PF according to (3.16). The potentially in-


tractable integral in (3.16) needs to be estimated using the particles from
the marginal smoothing distribution generated at time t + 1, which we de-
(i) (i)
note by {xt+1|T , Wt+1|T }N
i=1 .

Practically, the integral part of (3.16) can be approximated as follows:

N (i)
Z
p(xt+1 |y0:T )p(xt+1 |xt ) X (i)
p(xt+1|T |xt )
dxt+1 ≈ Wt+1|T (i)
. (3.17)
p(xt+1 |y0:t ) i=1 p(x |y0:t ) t+1|T

(i)
We further approximate p(xt+1|T |y0:t ) appearing in the denominator of (3.17)
by

Z
(i) (i)
p(xt+1|T |y0:t ) = p(xt+1|T |xt )p(xt |y0:t )dxt
N
X (j) (i) (j)
≈ Wt|t p(xt+1|T |xt|t ). (3.18)
j=1

96
Algorithm 7: Forward filtering backward smoothing (FFBSm)
(i) (i)
1 Implement the PF which generates {xt|t , Wt|t }N
i=1 ∼ p( · |y0:t ) for t = 0, . . . , T ;
2 for t = T − 1 to 0 do
3 for j = 1 to N do
PN (l) (i) (l)
4 Compute Vj = l=1 Wt|t p(xt+1|T |xt|t );
5 end
6 for i = 1 to N do
(i) (i)
7 Let xt|T = xt|t ;
N (i) (j) (i)
(i)
X (j)
Wt|t p(xt+1|T |xt|T )
8 Compute the normalised weight: Wt|T = Wt+1|T ;
j=1
Vj
9 end
10 if N̂eff < Nthres then
(i) (i)
11 Implement the resampling step and override the notation of the {xt|T , Wt|T }N
i=1
after resampling;
12 end

An estimate p̂(xt |y0:T ) of p(xt |y0:T ) using (3.17) and (3.18) is hence given by

N N (j)
X (i)
X (j)
p(xt+1|T |xt )
p̂(xt |y0:T ) = Wt|t δx(i) (xt ) Wt+1|T PN (l) (j) (l)
i=1
t
j=1 l=1 Wt|t p(xt+1|T |xt|t )

N N (i) (j)
X X (j)
Wt|t p(xt+1|T |xt )
= Wt+1|T PN (l) (j) (l)
δx(i) (xt ).
l=1 Wt|t p(xt+1|T |xt|t )
t
i=1 j=1

(i)
The normalised weight of the particle xt is

N (i) (j)
(i)
X (j)
Wt|t p(xt+1|T |xt )
Wt|T = Wt+1|T PN (l) (j) (l)
. (3.19)
j=1 l=1 Wt|t p(xt+1|T |xt|t )

Algorithm 7 shows the implementation of FFBSm. The algorithm de-


mands the storage of the samples from the PF with a memory requirement
of O(T N ) and has a total computational complexity O(T N 2 ).

97
3.5.2 Forward Filtering Backward Sampling (FFBSi)

The forward filtering backward simulation algorithm (FFBSi) proposed by


Godsill et al. (2004) provides particle approximation of p(x0:T |y0:T ), which
can be factorised into

T
Y −1
p(x0:T |y0:T ) = p(xT |y0:T ) p(xt |xt+1:T , y0:T )
t=0

T
Y −1
= p(xT |y0:T ) p(xt |xt+1 , y0:t ).
t=0

The expression p(xt |xt+1 , y0:t ) can be further written as

p(xt |y0:t )p(xt+1 |xt )


p(xt |xt+1 , y0:t ) =
p(xt+1 |y0:t ) (3.20)
∝ p(xt |y0:t )p(xt+1 |xt ).

The algorithm requires a preliminary run of the PF with normalised


(j) (j)
weighted samples denoted by {xt|t , Wt|t }N
j=0 ∼ p( · |y0:t ) for t = 0, . . . , T .

The Monte Carlo estimate p̂(xt |xt+1:T , y0:T ) of p(xt |xt+1:T , y0:T ) using (3.20)
yields the following equation

N
X (i)
p̂(xt |xt+1:T , y0:T ) = Wt|t+1 δx(i) (xt )
t
i=1

(i)
with the normalised weight Wt|t+1 defined as

(i) (i)
(i)
Wt|t p(xt+1 |xt|t )
Wt|t+1 = PN (j) (j)
. (3.21)
j=1 Wt|t p(xt+1 |xt|t )

98
Algorithm 8: Forward filtering backward simulation (FFBSi)
(j) (j)
1 Implement the PF which generates {xt|t , Wt|t }N
j=0 ∼ p( · |y0:t ) for t = 0, . . . , T ;
2 for i = 1 to N do
3 for t = T do
(i) (j) (j) N
4 Sample xt|T from {xt|t }N
j=1 according to the weights {Wt|t }j=1 ;
5 end
6 for t = T − 1 to 0 do
7 for j = 1 to N do
(j) (i) (j)
(j)
Wt|t p(xt+1|T |xt|t )
8 Compute the normalised weight: Wt|t+1 = PN (l) (i) (l)
;
l=1 Wt|t p(xt+1|T |xt|t )
9 end
(i) (j) (j)
10 Sample xt|T from {xt|t }N N
j=1 according to the weights {Wt|t+1 }j=1 ;
(i) (i) (i)
11 Append the new particle to the selected path xt:T |T = (xt|T , xt+1:T |T );
12 end
13 end

Practically, the algorithm is initialised at time t = T which selects a par-


(j) (j)
ticle from {xT |T }N N
j=0 according the weights {WT |T }j=0 . It iterates backward

which sequentially samples a particle at each time step t according to (3.20),


and attach it to the selected path of Xt+1:T . Algorithm 8 shows the imple-
(i)
mentation of FFBSi which produces the sample paths {x0:T |T }N
i=1 ∼ p(·|y0:T ).

The computational complexity for each path is O(T N ), and thus the total
complexity of FFBSi is O(T N 2 ).

3.6 Tree-based Particle Smoothing Algorithm (TPS)

Lindsten et al. (2017) propose divide-and-conquer sequential Monte Carlo


(D&C SMC) to sample from the target distribution in a general probabilistic
graphical model (PGM) based upon an auxiliary tree structure. This section
outlines an algorithm which we call ‘tree-based particle smoothing algorithm’

99
(TPS) to address the smoothing problem in an HMM using D&C SMC. We
demonstrate a unique construction of the auxiliary tree bearing intermediate
target distributions specified at non-root nodes. The root of the tree exactly
has p(x0:T |y0:T ) as the target distribution. We then illustrate the sampling
procedure in TPS, which produces independent particles between leaf nodes,
and recursively merge them via importance sampling towards the root.

3.6.1 Construction of the Auxiliary Tree

TPS splits the HMM into sub-models based upon a binary tree decom-
position. It first divides the target variable of all hidden states X0:T =
(X0 , . . . , XT ) into two disjoint subsets, and recursively applies binary splits
to the resulting two subsets, until the resulting subset consists of a univariate
random variable of single hidden state. Each generated subset corresponds
to a tree node and is assigned an intermediate target distribution. The root
characterises the complete model with the target distribution p(x0:T |y0:T ). All
other intermediate target distributions at non-root nodes will be discussed
in Section 3.7.

We propose one intuitive way of implementing the binary split, which


ensures that the left sub-tree is always a complete binary tree and contains
at least as many nodes as the right sub-tree. We split a non-leaf node Tj:l
with the variable Xj:l , where 0 ≤ j < l ≤ T , into two children Tj:k−1 and Tk:l
with the random variables Xj:k−1 and Xk:l . The split point k satisfies

k = j + 2p , (3.22)

100
where p = dlog2 (l − j + 1)e − 1.

Given the auxiliary tree using the above binary split, we mark the level
L of each node. Let the leaf nodes be at the bottom level with L = 0. Then,
the root is at the top level with L = dlog2 (T + 1)e which will be illustrated
from an example in the next paragraph. Hence, we build the auxiliary tree
from top to bottom.

The auxiliary tree when T = 5 is shown in Figure 3.2. In particular, X4


and X5 have already been separated from level 2 to level 1, we further extend
them to level 0 to guarantee the leaf nodes containing all single hidden states
X0 , . . . , X5 at the bottom level.

This type of auxiliary tree has several advantages: The random variable
within each node has consecutive time indices. The left sub-tree is also a
complete binary tree of 2blog2 (T +1)c leaf nodes. This would be useful in an
on-line setting: When new observations become available, the samples from
the complete sub-tree are retained if their intermediate target distributions
remain unchanged.

Moreover, the tree has a depth of dlog2 (T +1)e+1 levels, which implies
a maximum number of dlog2 (T + 1)e updates of samples corresponding to
a hidden state when moving from the bottom of the tree to the top. In
Figure 3.2, the samples of X0 , . . . , X3 need to be updated three times from
the leaf nodes, and those of X4 , X5 need to be updated twice. When running
the particle smoother (PS) to solve the smoothing problem, the samples at
time step t = 0 need to updated for every time step. Hence, the maximum
number of updates of a hidden state in the PS is (T + 1), which is larger than

101
X0:5 level 3

X0:3 X4:5 level 2

X0:1 X2:3 X4 X5 level 1

X0 X1 X2 X3 X4 X5 level 0

Figure 3.2: Auxiliary tree of TPS constructed from an HMM when T = 5.

dlog2 (T + 1)e for T > 1. Usually, more updates indicate more resampling
steps. Therefore, TPS with the divide-and-conquer approach can potentially
mitigate path degeneracy for early time steps – Section 3.9.2 will show this
empirically.

3.6.2 Sampling Procedure

The sampling process of TPS proceeds as follows: Initial samples are gen-
erated at the bottom of the tree, independent between leaf nodes. These
samples are recursively merged via importance sampling, which aim for the
intermediate targets, until the root of the tree is reached.

At a leaf node Tj ∈ T corresponding to the random variable Xj , we


denote the target density by fj , which can be straightforwardly sampled
from. At a non-leaf node Tj:l ∈ T corresponding to the random variable Xj:l ,
we denote a proper proposal density by hj:l and the target density by fj:l .
At the root, we have the target density f0:T = p(x0:T |y0:T ).

We describe the sampling approach at a leaf and non-leaf node in the


auxiliary binary tree T described in Section 3.6.1. We assume the random
variables at the same level of the auxiliary tree are mutually independent. At

102
a leaf node Tj , we sample from fj directly. At a non-leaf node Tj:l attaching
two children Tj:k−1 and Tk:l , we may proliferate the particles, then merge,
reweight and resample them to aim for an approximation to the intermediate
target distribution. To illustrate this, we first adopt the pre-stored particles

(i) (i) N
S1 = x̃j:k−1 , W̃j:k−1 i=1
∼ fj:k−1

from Tj:k−1 and


(i) (i) N
S2 = x̃k:l , W̃k:l i=1
∼ fk:l

from Tk:l . A particle approximation of hj:l can be obtained using the product
measure given by two empirical measures formed by S1 and S2 with com-
plexity O(N 2 ). Nevertheless, we choose another routine which potentially
achieves a lower cost: We first optionally proliferate the particles (see Sec-
tion 3.6.3) in S1 and S2 respectively. Then we combine those particles with
the same indices from the two sets, which we refer to as a merging step
(Lindsten et al., 2017, Section 3.2).

Practically, we may proliferate the particles in S1 and S2 to produce S10


and S20 targeting the same particle approximations:

(a ) (a ) 0
S10 = {(x̃j:k−1
i i
, Ŵj:k−1 )}N
i=1 ∼ fj:k−1

and
(b ) (b ) 0
S20 = {(x̃k:li , Ŵk:li )}N
i=1 ∼ fk:l ,

where the user-specified sample size N 0 is usually larger than the required
0 0
sample size N , {ai }N N
i=1 and {bi }i=1 are the indices returned from the prolif-

103
(a ) 0 (b ) 0
i
eration step, and {Ŵj:k−1 }N i N
i=1 and {Ŵk:l )}i=1 are the associated normalised

weights. We denote the merged samples by

(i) (a ) (b ) (i) (a ) (b ) 0
S 0 = {x̃j:l = (x̃j:k−1
i
, x̃k:li ), w̃j:l = Ŵj:k−1
i
Ŵk:li }N
i=1 ∼ hj:l . (3.23)

If the proliferation step is not necessary, we simply have N 0 = N and

(i) (i) (i) (i) (i) (i) 0


S 0 = {x̃j:l = (x̃j:k−1 , x̃k:l ), w̃j:l = W̃j:k−1 W̃k:l }N
i=1 . (3.24)

We reweight the particles in S 0 to target fj:l and resample to obtain N par-


ticles. Algorithm 9 describes the function TPS gen(j, l) which produces the
(i) (i) N
normalised weighted samples xj:l , Wj:l i=1
from the target fj:l at Tj:l in
TPS. If no proliferation is employed, the total complexity of TPS is O(T N ).
Nevertheless, additional effort may be demanded for the tuning algorithms
which determine our intermediate target distributions {fj }Tj=0 at the leaf
nodes – We will discuss them in Section 3.7.5.

TPS applies Algorithm 9 recursively from the leaf nodes to the root. The
computational flow is shown in Figure 3.3 when T = 5. In particular, when
the sampling process advances from level 0 to 1, we just preserve the particles
of X4 and X5 , and merge them from level 1 to 2. When the whole sampling
process is finished, each node contains the normalised weighted samples from
the corresponding (intermediate) target distribution. To save memory space,
for L = 0, . . . , dlog2 (T + 1)e, we may discard the samples from level L in the
tree once the sampling process at level (L + 1) is complete.

104
(i) (i) N
x0:5 , W0:5 i=1
∼ p(x0:5 |y0:5 ) = TPS gen(0, 5)
level 3

(i) (i) N (i) (i) N


x0:3 , W0:3 i=1
= TPS gen(0, 3) x4:5 , W4:5 i=1
= TPS gen(4, 5)
level 2

105
(i) (i) N (i) (i) N (i) (i) N (i) (i) N
x0:1 , W0:1 i=1
= TPS gen(0, 1) x2:3 , W2:3 i=1
= TPS gen(2, 3) x4 , W 4 i=1
x5 , W 5 i=1
level 1

(i) (i) N (i) (i) N (i) (i) N (i) (i) N (i) (i) N (i) (i) N
x0 , W0 i=1 x1 , W1 i=1 x2 , W2 i=1 x3 , W3 i=1 x4 , W4 i=1 x5 , W5 i=1
=TPS gen(0, 0) =TPS gen(1, 1) =TPS gen(2, 2) =TPS gen(3, 3) =TPS gen(4, 4) =TPS gen(5, 5) level 0

Figure 3.3: Computational flow of TPS gen in an HMM when T = 5.


Algorithm 9: TPS gen(j,l) which samples from the target fj:l at Tj:l in
TPS
1 if j = l then
2 for i = 1 to N do
(i)
3 Simulate xj ∼ fj ( · );
4 end
(i) (i) 1 N
5 Denote the normalised particles by {xj , Wj = N }i=1 ;

6 else
7 Let p = d log(l−j+1)
log 2 e − 1 and k = j + 2p ;
(i) (i)
8 Adopt S1 = {x̃j:k−1 , W̃j:k−1 }N
i=1 ← TPS gen(j, k − 1) from Tj:k−1 and
(i) (i)
S2 = {x̃k:l , W̃k:l }N
i=1 ← TPS gen(k, l) from Tk:l ;
(i) (i) 0
9 Proliferate S1 and S2 if necessary, and denote the merged particles by {x̃j:l , w̃j:l }N
i=1

according to (3.23) or (3.24);


10 for i = 1 to N 0 do
11 Update the unnormalised importance weight:

(i)
(i) (i) fj:l (x̃j:l )
ŵj:l = w̃j:l (i) (i)
(3.25)
fj:k−1 (x̃j:k−1 )fk:l (x̃k:l )

12 end
0
(i) (i) N (i) (i) N
13 Resample x̃j:l , ŵj:l i=1
to obtain the normalised weighted particles xj:l , Wj:l i=1
;
14 end

3.6.3 Proliferation

We investigate the proliferation step in TPS, which produces a larger popu-


lation of N 0 merged samples in (3.23) at a non-leaf node Tj:l .

The main reason of applying the proliferation step lies in more merged
particles with a potentially increased diversity considered in importance sam-
pling. When the output sample size N is fixed, proliferation may mitigate
weight degeneracy and boosts the sampling quality. Nevertheless, empirical

106
evidence suggests if we have sufficient memory space and computational bud-
get, a larger output sample size N = N1 with no proliferation is always pre-
ferred rather than setting a smaller sample size N = N2 < N1 with N 0 = N1
in the proliferation step. We demonstrate three proliferation methods.

Mixture Sampling

Mixture sampling (Lindsten et al., 2017) returns N 0 = N 2 merged particles


by matching every distinct pair from S1 and S2 , i.e.,

(a ) (b ) (i) (a ) (b )
S 0 = {(x̃j:k−1
i i
, x̃j:k−1 i
), w̃j:l = Ŵj:k−1 Ŵk:li }(ai ,bi )∈{1,...,N }×{1,...,N } ,

i i (a )i i (a ) (a ) (b )
where we have the normalised weights Ŵj:k−1 = W̃j:k−1 /N, Ŵj:k−1 = W̃j:k−1 /N
for every pair (ai , bi ) ∈ {1, . . . , N } × {1, . . . , N }.

Applying mixture sampling for proliferation at every non-leaf node forces


TPS to have the same complexity O(T N 2 ) as forward filtering backward
smoothing (FFBSm) and forward filtering backward simulation (FFBSi).
However, the quadratic complexity in TPS originates from the presence of N 2
samples in importance sampling rather than from the weight computation of
N samples in FFBSm and FFBSi.

Lightweight Mixture Sampling

Lightweight mixture sampling (Lindsten et al., 2017) simulates the parti-


(i) N
cles with replacement from S1 according to their weights {W̃j:k−1 i=1
to
(a ) (a ) N0 0
produce S10 = i
x̃j:k−1 i
, Ŵj:k−1 i=1
, where {ai }N
i=1 are the returned indices

107
(a ) 0
i
and {Ŵj:k−1 }N
i=1 are the updated normalised weights. Likewise, we denote

the samples produced from S2 using lightweight mixture sampling by S20 =


(b ) (b ) N 0
x̃k:li , Ŵk:li i=1
. The total complexity of TPS applying lightweight mixture
sampling is O(T N 0 ) if we assume N 0 > N . Lindsten et al. (2017) suggest
N 0 = κN where 1 ≤ κ N is a positive tuning parameter specified by the
user.

Permutation

Permutation differs from lightweight mixture sampling in the sense that it


selects the samples without replacement, i.e. rearranging the labels of the
samples (Lin et al., 2005). The complexity of TPS using permutation is
O(T N 0 ). We can still set N 0 = κN where 1 ≤ κ N . We replicate the list
0
{1, 2, . . . , N } κ times to form a new list and sample the indices {a(i) }N
i=1 and
0 0
{b(i) }N
i=1 without replacement from the new list respectively. We obtain S1 =
(a ) (a ) N0 (b ) (b ) N 0 (a ) (a )
i
x̃j:k−1 i
, Ŵj:k−1 i=1
and S20 = x̃k:li , Ŵk:li i=1
, i
where Ŵj:k−1 i
= W̃j:k−1 /κ and
(b ) (b )
Ŵk:li = W̃k:li /κ for i = 1, . . . , N 0 .

3.7 Intermediate Target Distributions

Given the auxiliary tree T constructed in the way described in Section 3.6.1,
we define the intermediate target distributions of sub-models. We apply
Lindsten et al. (2017)’s method to build one class of intermediate target dis-
tributions and develop three new classes based upon the filtering or smooth-
ing distributions.

108
3.7.1 Distribution Suggested by Lindsten et al. (2017) (TPS-L)

Lindsten et al. (2017) recommends a class of intermediate target distributions


with densities proportional to the product of factors within the probabilistic
graphical model. We apply the idea to the HMM which bears binary and
unary factors. Here a binary factor refers to the transition density of two
consecutive hidden states. An unary factor refers to the prior density of the
initial hidden state or the emission density between a hidden state and its
observation. We call TPS with the above intermediate target distributions
TPS-L.

At a leaf node Tj , the target distribution contains no binary factor and


is defined as f0 (x0 ) ∝ p0 (x0 )p(y0 |x0 ) when j = 0 and fj (xj ) ∝ p(yj |xj ) when
j 6= 0.

At a non-leaf node Tj:l , the target density is proportional to the product


of all transition and emission densities containing the hidden states in the
sub-model:

l−1
Y
fj:l (xj:l ) ∝ p(yj |xj ) p(xi+1 |xi )p(yi+1 |xi+1 ) .
i=j

This is equivalent to the unnormalised likelihood of an HMM that has the


same dynamics as the original HMM except for an uninformative prior when
j 6= 0. When j = 0, the prior density of X0 is additionally multiplied.

Assume Tj:l connects two children Tj:k−1 and Tk:l carrying the pre-generated
(i) (i) (i) (i)
particles: {x̃j:k−1 , W̃j:k−1 }N N
i=1 ∼ fj:k−1 at Tj:k−1 ∈ T and {x̃k:l , W̃k:l }i=1 ∼ fk:l
(i)
at Tk:l ∈ T . The unnormalised importance weight ŵj:l of the merged particles

109
(i) (i) (i)
x̃j:l = (x̃j:k−1 , x̃k:l ) in (3.25) becomes

(i) (i) (i) (i)


ŵj:l = w̃j:l p(x̃k |x̃k−1 ), (3.26)

(i) (i) (i) (i)


where x̃k−1 is the last element in x̃j:k−1 , and x̃k is the first element in x̃k:l .

The construction of this type of intermediate target distributions is sim-


ple, which does not involve any tuning procedure compared to the algorithms
in Section 3.7.2 & 3.7.4. However, the intermediate target distribution fj at
a leaf node Tj only considers a single observation yj , which may be vastly
different from the marginal smoothing distribution in some scenarios, thus
resulting in poor estimation results – We will see this in the simulation study
in Section 3.10.

3.7.2 Estimates of the Filtering Distributions (TPS-EF)

The second class of target distributions is based on estimates of the filtering


distributions, and thus we name the algorithm TPS-EF. At the root, the
target distribution is

T
Y −1
f0:T (x0:T ) = p(x0:T |y0:T ) = p0 (x0 )p(y0 |x0 ) p(xi+1 |xi )p(yi+1 |xi+1 ) .
i=0

At a leaf node Tj , we use an estimate of the filtering distribution fj (xj ) =


p̂(xj |y0:j ) ≈ p(xj |y0:j ) whose exact form and sampling process will be dis-
cussed in Section 3.7.5. At a non-leaf and non-root node Tj:l , we define the

110
intermediate target distribution:

l−1
Y
fj:l (xj:l ) ∝ p̂(xj |y0:j ) p(xi+1 |xi )p(yi+1 |xi+1 ) ≈ p(xj:l |y0:l ).
i=j

(i) (i) (i)


The weight of the merged sample x̃j:l = (x̃j:k−1 , x̃k:l ) in (3.25) becomes:

(i) (i) (i)


(i) (i) p(x̃k |x̃k−1 )p(yk |x̃k )
ŵj:l = w̃j:l (i)
. (3.27)
p̂k (x̃k |y0:k )

This type of intermediate target distributions contains more informa-


tion of the observations than TPS-L. Using TPS-EF, the particles at the
leaf nodes are initially generated from (an estimate of) the filtering distri-
butions. Whilst moving up the tree, their marginal distributions gradually
shift towards the smoothing distributions. However, TPS-EF may eliminate
a large population of particles if the discrepancy between the filtering and
smoothing distribution is large. We illustrate in a non-linear HMM depicted
in Section 3.10.1 where the parameters τ and σ are set to be 1 and 5 respec-
tively. As shown in Figure 3.4, the unnormalised densities of the filtering
and marginal smoothing distributions at time t = 390 and 391, which are
estimated by a finite-space HMM described in Section 3.10.2, both display
rather different outcomes. This clearly indicates a proposal estimated from
a filtering distribution may not always be effective for targeting a smoothing
distribution.

111
Filtering Filtering
Smoothing Smoothing
0

0
−20 −15 −10 −5 0 5 10 −10 −5 0 5 10 15
xt xt

Figure 3.4: Estimated (unnormalised) filtering and smoothing densities at t = 390 (left)
and t = 391 (right) in the non-linear HMM. Linear scale on the y-axis.

3.7.3 Kullback–Leibler Divergence in TPS

Before proposing the third type of intermediate target distributions, we


present an optimal type of proposal at a non-leaf node in TPS which at-
tains the minimum Kullback–Leibler (KL) divergence (Cover and Thomas,
2012) from the target distribution.

Given the proposal density hj:l = fj:k−1 fk:l at Tj:l being the product
of the densities from two independent random variables, we claim that the
minimum KL divergence is met when the two densities are the marginal
target densities with respect to the corresponding random variables.

For simplicity of the notations, we denote the target variable at a non-leaf


node by (X1 , X2 ) with density f (x1 , x2 ). A valid proposal density h1 (x1 )h2 (x2 )
satisfies h1 (x1 )h2 (x2 ) > 0 whenever f (x1 , x2 ) > 0, where we assume h1 and
h2 are the probability densities of two independent random variables X1 and
X2 from the children. Using the result from (Botev et al., 2013, Section 2.1),
the proposal f1 (x1 )f2 (x2 ) has the smallest KL divergence amongst all pro-
posals of the form h1 (x1 )h2 (x2 ), where f1 (x1 ) and f2 (x2 ) are the marginal
densities of f (x1 , x2 ). We prove this in Theorem 4.

112
Theorem 4. Let f be the probability density function of (X1 , X2 ) defined
on Dn1 +n2 , let h1 and h2 be the probability density functions of two indepen-
dent random variables X1 and X2 defined on Dn1 and Dn2 , respectively. If
h1 (x1 )h2 (x2 ) > 0 whenever f (x1 , x2 ) > 0, then

Z Z
f (x1 , x2 )
f (x1 , x2 ) log dx1 dx2 ≥
D n2 D n1 h1 (x1 )h2 (x2 )
Z Z
f (x1 , x2 )
f (x1 , x2 ) log dx1 dx2 ,
D n2 D n1 f1 (x1 )f2 (x2 )

R R
where f1 (x1 ) = Dn2
f (x1 , x2 )dx2 , f2 (x2 ) = D n1
f (x1 , x2 )dx1 are the marginal
densities of f (x1 , x2 ).

Proof. By Jensen’s inequality,

Z Z
f1 (x1 ) log f1 (x1 ) dx1 − f1 (x1 ) log h1 (x1 ) dx1
D n1 Dn1
Z
f1 (x1 )
= f1 (x1 ) log dx1
D n1 h1 (x1 )

f1 (X1 ) h1 (X1 )
= E log = E − log
h1 (X1 ) f1 (X1 )

h1 (X1 )
≥ − log E = 0.
f1 (X1 )

113
Using this and the definition of marginal density,
Z Z
f (x1 , x2 ) log f1 (x1 ) dx1 dx2
D n2 D n1
Z
= f1 (x1 ) log f1 (x1 ) dx1
Dn1
Z (3.28)
≥ f1 (x1 ) log h1 (x1 ) dx1
D n1
Z Z
= f (x1 , x2 ) log h1 (x1 ) dx1 dx2 .
Dn2 Dn1

Similarly,

Z Z
f (x1 , x2 ) log f2 (x2 ) dx1 dx2
Dn2 Dn1
Z Z
≥ f (x1 , x2 ) log h2 (x2 ) dx1 dx2 . (3.29)
D n2 D n1

Multiplying (3.28) and (3.29) by -1 and adding them, we obtain

Z Z
1
f (x1 , x2 ) log dx1 dx2
D n2 D n1 f1 (x1 )f2 (x2 )
Z Z
1
≤ f (x1 , x2 ) log dx1 dx2 .
D n2 D n1 h1 (x1 )h2 (x2 )
Z Z
Adding f (x1 , x2 ) log f (x1 , x2 ) dx1 dx2 to both sides yields the re-
D n2 D n1
sult.

114
3.7.4 Estimates of the Smoothing Distributions (TPS-ES)

We build the intermediate targets in TPS which estimate the smoothing


distribution p(xj:l |y0:T ) at each non-leaf and non-root node Tj:l . The inter-
mediate target distribution at each leaf node Tj estimates p(xj |y0:T ), which
are usually constructed using the Monte Carlo samples pre-generated from
the marginal smoothing distribution. We thus name the algorithm TPS-ES.

An immediate question arises regarding the incentive of running TPS-ES,


since we already obtain the samples from the marginal smoothing distribu-
tions in the preliminary run. Two reasons consolidate the use of TPS-ES.
Firstly, TPS-ES targets the joint smoothing distribution. Secondly, we hope
to improve sampling quality given sufficient computational budget, but under
a fixed sample size. The implementation of TPS-ES can achieve this from
the simulation study in Section 3.10.5.

We define the intermediate target distributions. At the root, we still


have f0:T (x0:T ) = p(x0:T |y0:T ). At a leaf node Tj ∈ T , we define fj (xj ) =
p̂(xj |y0:T ) ≈ p(xj |y0:T ). At a non-leaf and non-root node Tj:l , we define the
target distribution fj:l :

l−1
p̂(xl |y0:T ) Y
fj:l (xj:l ) ∝ p̂(xj |y0:j ) p(xi+1 |xi )p(yi+1 |xi+1 )
p̂(xl |y0:l ) i=j

l−1
p(xl |y0:T ) Y
≈ p(xj |y0:j ) p(xi+1 |xi )p(yi+1 |xi+1 )
p(xl |y0:l ) i=j

= p(xj:l |y0:T ),

where p̂(xj |y0:j ) approximates the filtering distribution at time step j. The

115
preliminary constructions of {p̂(xj |y0:j )}j=0,...,T and {p̂(xj |y0:T )}j=0,...,T will
be proposed in Section 3.7.5.

(i) (i) (i)


The weight of the merged sample x̃j:l = (x̃j:k−1 , x̃k:l ) at a non-leaf and
non-root node Tj:l in (3.25) becomes:

(i)
(i) (i) p̂(x̃k−1 |y0:k−1 ) (i) (i) (i)
ŵj:l = w̃j:l (i) (i)
p(x̃k |x̃k−1 )p(yk |x̃k ). (3.30)
p̂(x̃k−1 |y0:T )p̂(x̃k |y0:k )

TPS-ES exhibits a sound property regarding the KL divergence stated


in Theorem 4. Given the target distribution fj:l (xj:l ) = p̂(xj:l |y0:T ) estimat-
ing p(xj:l |y0:T ) at Tj:l , the proposal density hj:l (xj:l ) = fj:k−1 (xj:k−1 )fk:l (xk:l )
estimates p(xj:k−1 |y0:T )p(xk:l |y0:T ). We observe that both p(xj:k−1 |y0:T ) and
p(xk:l |y0:T ) are the marginal densities of p(xj:l |y0:T ), and hence their prod-
uct forms a proposal attaining the minimum KL divergence from p(xj:l |y0:T ).
Hence, what the proposal hj:l estimates has a minimum KL divergence from
the smoothing density that our target fj:l estimates.

Moreover, TPS-ES can be practically useful in highly complicated HMMs


to inspect anomalies in some challenging importance sampling steps. We can
compare the empirical distributions generated by the samples corresponding
to a hidden state from the proposal and from the target at a non-leaf node,
which should both estimate the marginal smoothing distribution. If there
is a substantial difference, we need to examine the merging step – We will
discuss a diagnostic procedure in Section 3.8.

116
3.7.5 Intermediate Target Distributions at Leaf Nodes

We mentioned the estimated filtering distributions {p̂(xj |y0:j )}j=0,...,T and


the estimated marginal smoothing distributions {p̂(xj |y0:T )}j=0,...,T in Section
3.7.2 & 3.7.4, which appear to be the intermediate targets at the leaf nodes
in TPS-EF and TPS-ES. In general, the filtering and smoothing distribu-
tions of an HMM are analytically intractable with some exceptions including
linear Gaussian HMMs and finite-space HMMs, and {p̂(xj |y0:j )}j=0,...,T and
{p̂(xj |y0:T )}j=0,...,T can be estimated from Monte Carlo samples.

We aim to generate a probability density fˆ estimating the target density


f given n weighted samples {xi , Wi }ni=1 from f . In the context of f being a
filtering or marginal smoothing density, we can obtain the weighted samples
by running a filtering algorithm or a marginal smoothing algorithm. We are
not interested in the empirical distribution generated by these samples, since
it is discrete and may also suffer from weight or path degeneracy.

We first consider parametric approaches. We can fit the data with some
common probability distributions such as a normal distribution. We can also
accommodate a mixture model if the target distribution seems multi-modal.
The parameters can be estimated in various ways including moment match-
ing, maximum likelihood method and expectation–maximisation algorithm.

The parametric approaches are reasonably quick and simple. For in-
stance, assuming a normal distribution requires the evaluation of the mean
and variance, which can be easily obtained from the samples using moment
matching. Nevertheless, the target distribution may not be well approxi-
mated under a parametric assumption.

117
Alternatively, we can employ non-parametric approaches for instance, a
kernel density estimator (KDE). We need to select the type of kernel and
bandwidth in advance. The complexity of generating N new samples is
O log(n)N and the evaluation of all densities is more computationally in-
tensive with complexity O(nN ).

We propose another non-parametric approximation method using piece-


wise constant functions with a lower computational effort than the KDE. We
first build a uniform grid consisting of the points x1 < x2 < . . . < xn with
densities d1 , . . . , dn estimated by the KDE such that xi+1 − xi = ∆ > 0 for
i = 1, . . . , (n−1). The resulting probability density function formed by these
grid points using piecewise constant functions is

n
✶x∈[xi −∆/2,xi +∆/2) di .
X
f (x) ∝ (3.31)
i=1

The evaluation of the sample densities reduces significantly from O(nN ) to


O N compared to the KDE.

The proposal constructed by piecewise constant functions have several


disadvantages despite of a fast computation. Firstly, using this compactly
supported proposal based upon finite sample points for a target distribu-
tion with larger support causes biased estimates for the normalising constant
and smoothing means. The support is however not fixed and could grow to
cover the support of the target asymptotically . Secondly, in TPS-ES, if the
estimated filtering and smoothing distributions are respectively generated
using piecewise constant functions from different samples, there is no guar-
antee they have the same support, which may cause zero or infinite weight

118
in (3.30). To avoid this, we consider the mixture probability distributions
which take into account the samples from both the filtering and smoothing
distributions. Assume at time step j, the first uniform grid consists of the
points xf1 < xf2 < . . . < xfnf such that xfi+1 − xfi = ∆f for i = 1, . . . , (nf − 1)
with the estimated filtering densities df1 , . . . , dfn from a KDE, and assume the
second uniform grid consists of the points xs1 < xs2 < . . . < xsns such that
xsi+1 − xsi = ∆s for i = 1, . . . , (ns − 1) with the estimated smoothing densities
ds1 , . . . , dsn from another KDE. Then the estimated filtering density p̂(x|y0:j )
is given by

nf

✶x∈[xfi −∆f /2,xfi +∆f /2) dfi


X
f
p̂(x|y0:j ) ∝ α
i=1

n s

✶x∈[xsi −∆s /2,xsi +∆s /2) dsi ,


X
f
+ (1 − α ) (3.32)
i=1

where 0 < αf < 1. Similarly, the estimated smoothing density p̂(x|y0:T ) is


given by

ns

✶x∈[xsi −∆s /2,xsi +∆s /2) dsi


X
s
p̂(x|y0:T ) ∝ α
i=1
nf

✶x∈[xfi −∆f /2,xfi +∆f /2) dfi ,


X
s
+ (1 − α ) (3.33)
i=1

where 0 < αs < 1. We have no conclusion of the values of αf and αs so far,


and choose them both close to 1 in the simulation study of Section 3.10.5.

119
3.7.6 Exact Filtering Distributions (TPS-F)

We propose a tree-based particle smoothing algorithm (TPS) which employs


the exact (joint) filtering distributions as the intermediate targets. We call it
tree-based particle smoothing algorithm with the filtering distributions (TPS-
F). The initial samples at the leaf nodes in TPS-F are directly employed from
the Monte Carlo samples generated from a filtering algorithm.

We illustrate the intermediate target distributions in TPS-F. At a leaf


node Tj , we have fj (xj ) = p(xj |y0:j ). At a non-leaf node Tj:l , the (interme-
diate) target is the joint filtering distribution fj:k (xj:k ) = p(xj:k |y0:k ). Like-
wise, the target distribution is precisely the joint smoothing distribution
f0:T (x0:T ) = p(x0:T |y0:T ) at the root node T0:T .

We derive the weight formula in the importance sampling step at a non-


leaf node Tj:l with children Tj:k−1 and Tk:l . The decomposition of the target
fj:l (xj:l ) = p(xj:l |y0:l ) at Tj:l is given as follows:

fj:l (xj:l ) = p(xj:l |y0:l )

= p(xj:k−1 |xk:l , y0:k−1 )p(xk:l |y0:l )

p(xk:l |xj:k−1 )p(xj:k−1 |y0:k−1 )


= p(xk:l |y0:l ).
p(xk:l |y0:k−1 )

The weight formula in (3.25) becomes:

(i) (i)
(i) (i) p(x̃k:l |x̃j:k−1 )
ŵj:l = w̃j:l (i)
. (3.34)
p(x̃k:l |y0:k−1 )

120
The numerator of (3.34) can be decomposed into

(i) (i) (i) (i) (i) (i) (i) (i)


p(x̃k:l |x̃j:k−1 ) = p(x̃l |x̃l−1 ) . . . p(x̃k+1 |x̃k )p(x̃k |x̃k−1 ).

The denominator is given by

Z
(i) (i)
p(x̃k:l |y0:k−1 ) = p(x̃k:l |x̃j:k−1 )p(x̃j:k−1 |y0:k−1 )dx̃j:k−1
Z
(i) (i) (i) (i) (i)
= p(x̃k |x̃k−1 )p(x̃k+1 |x̃k ) . . . p(x̃l |x̃l−1 )p(x̃j:k−1 |y0:k−1 )

dx̃j:k−1
Z
(i) (i) (i) (i) (i)
= p(x̃l |x̃l−1 ) . . . p(x̃k+1 |x̃k ) p(x̃k |x̃k−1 )p(x̃j:k−1 |y0:k−1 )

dx̃j:k−1 .

Therefore, the unnormalised importance weight in (3.34) becomes

(i) (i)
(i) (i) p(x̃k |x̃k−1 )
ŵj:l = w̃j:l R (i)
. (3.35)
p(x̃k |x̃k−1 )p(x̃j:k−1 |y0:k−1 )dx̃j:k−1

The integral part in the denominator of (3.35) can be further estimated by


(l) (l)
the samples {x̃j:k−1 , W̃j:k−1 }N
l=1 ∼ p( · |y0:k−1 ) at Tj:k−1 . We hence have the

Monte Carlo estimate of the weight:

(i) (i)
(i) (i) p(x̃k |x̃k−1 )
ŵj:l ≈ w̃j:l PN (i) (l) (l)
.
l=1 p(x̃k |x̃k−1 )W̃j:k−1

At each non-leaf node, the effort of computing the weights is O(N 2 ) and the
total complexity of the algorithm is O(T N 2 ) if no proliferation procedure is
applied. Estimating the importance weights using Monte Carlo samples is

121
also seen in Doucet et al. (2000); Klaas et al. (2005).

TPS-F does not require any tuning process in the construction of the
intermediate target distributions, which renders the whole algorithm easily
implementable. However, the complexity is still quadratic with respect to
the sample size N which does not outperform the conventional smoothing
algorithms.

3.8 Diagnostics

We define two metrics: relative effective sample size (RESS) and marginal
relative effective sample size (MRESS) to assess the quality of the importance
sampling steps in TPS. We apply the two metrics to a toy model and to the
simulation models from Section 3.9 & 3.10.

3.8.1 Definitions and Properties of RESS and MRESS

An unsatisfactory importance sampling step in TPS, which results in ex-


tremely uneven weights, can be induced by at least one of the two factors.
Given the target fj:l at the node Tj:l connecting Tj:k−1 and Tk:l , the first
factor is the poor proposal hj:k−1 (resp. hk:l ) for estimating the marginal
target distribution with respect to Xj:k−1 (resp. Xk:l ). Such issue can how-
ever be mitigated by adjusting hj:k−1 (resp. hk:l ). The second factor lies in
the assumed structure of the proposal hj:l = fj:k−1 fk:l , which is the product
of the probability densities of two independent random variables. The po-
tential strong correlation between Xj:k−1 and Xk:l in the target can yield a
challenging merging step to ‘match’ the samples from the children. We may

122
apply a proliferation procedure (see Section 3.6.3), which boosts the num-
ber of samples in importance sampling for an additional computational cost.
Other techniques such as tempering (Chopin, 2002) may also be employed.

RESS and MRESS can exploit the above two factors. We first review
effective sample size (ESS) defined in Section 3.4.2, which assesses variability
of the weights in a general importance sampling procedure. The formula of
ESS (Doucet et al., 2000) is given by :

(i) 2
PN
i=1 w
N̂eff = PN . (3.36)
(w (i) )2
i=1

where w(i) corresponds to the unnormalised importance weight of the ith


sample. We define the relative effective sample size (RESS) by

(i) 2
PN
1 i=1 w N̂eff
RESS = PN = ,
N (i) 2
i=1 (w )
N

which is the ratio between the effective sample size and the real sample size.
A perfect importance sampling step with equally weighted samples returns
RESS equal to 1.

In an importance sampling step of TPS, RESS can be defined as follows.


For simplicity of the notations, we let f (x1 , x2 ) be the target density of
(X1 , X2 ) at a non-leaf node defined on Dn1 +n2 . We assume that h1 and h2
defined on Dn1 and Dn2 are the probability densities of two independent
random variables X1 and X2 from the children of the non-leaf node. The
product of h1 and h2 forms a valid proposal of f . Given the normalised
(i) (i)
weighted samples {(x1 (i) , W1 )}N (i) N
i=1 ∼ h1 , {(x2 , W2 )}i=1 ∼ h2 and similar

123
to (3.23), we denote the merged samples after a proliferation step by

0
{(x1 (ai ) , x2 (bi ) ), w(ai ,bi ) }N
i=1 ∼ h1 h2 ,

0 0
where N 0 is the number of returned samples, {ai }N N
i=1 , {bi }i=1 are the indices
0
from the proliferation step, and {w(ai ,bi ) }N
i=1 are the associated unnormalised

importance weights. The relative effective sample size (RESS) of the samples
(a ) (b ) 0
{(x1 i , x2 i ), w(ai ,bi ) }N
i=1 is defined by

(ai ,bi ) 2
P P
1 ai bi w
RESS = 0 P P (ai ,bi ) )2
.
N ai bi (w

We then define the marginal relative effective sample size MRESS1 of the
(a ) (b ) 0
samples {(x1 i , x2 i ), w(ai ,bi ) }N
i=1 by

(ai ,bi ) 2
P P
1 ai bi w
MRESS1 = P P 2.
N w(ai ,bi )
ai bi

Similarly, the marginal relative effective sample size MRESS2 is defined by

(ai ,bi ) 2
P P
1 ai bi w
MRESS2 = P P 2.
N w(ai ,bi )
bi ai

There exists a strict relationship between RESS and MRESS if the prolifera-
tion step proceeds with mixture sampling (see Section 3.6.3), which is stated
in Theorem 5. Recall the exact form of the merged particles after mixture
sampling is
(a ) (b )
{(x1 i , x2 i ), w(ai ,bi ) }(ai ,bi )∈{1,...,N }×{1,...,N } ,

(ai ) (b )
where the unnormalised weight w(ai ,bi ) = W1 W2 i . We simplify its nota-

124
tions and denote by

(i) (j)
{(x1 , x2 ), w(i,j) }(i,j)∈{1,...,N }×{1,...,N } .

Theorem 5. In a proliferation step which performs mixture sampling, the


(i) (j) (i) (i)
merged samples {(x1 , x2 ), w(i,j) }(i,j)∈{1,...,N }×{1,...,N } from {x1 , W1 }N
i=1 ∼
(j) (j)
h1 and {x2 , W2 }N
j=1 ∼ h2 satisfy

MRESS1 ≥ RESS,

MRESS2 ≥ RESS.

Proof. We prove MRESS1 ≥ RESS and start from

N
X X
(w(k,i) − w(k,j) )2 ≥ 0.
k=1 1≤i<j≤N

Expanding the above expression, and rearranging the squared terms and the
product terms give

N X
X N N
X X
(i,j) 2
(N − 1) (w ) ≥ 2w(k,i) w(k,j) .
j=1 i=1 k=1 1≤i<j≤N

PN PN (i,j) 2
Adding j=1 i=1 (w ) to both sides and fitting the square terms on the
right hand side give

N X
N N N
2
X X X
N (w(i,j) )2 ≥ w(i,j) .
j=1 i=1 i=1 j=1

125
So
1 1 1 1
2 PN PN ≤ PN PN 2.
N j=1 i=1 (w
(i,j) )2 N w(i,j)
i=1 j=1

PN PN 2
Multiplying both sides by j=1 i=1 w(i,j) gives the result.

The proof of MRESS2 ≥ RESS is similar.

We investigate the asymptotic behaviours of RESS and MRESS under


mixture sampling. By the strong law of large numbers and Slutskys theorem,

N N N N (i) (j)
1 X X (i,j) 1 X X f (x1 , x2 )
lim w = lim 2 (i) (j)
= 1, (3.37)
N →∞ N 2 N →∞ N
i=1 j=1 i=1 j=1 h1 (x1 )h2 (x2 )

N N
f 2 (x1 , x2 )
Z Z
1 X X (i,j) 2
lim (w ) = dx1 dx2 , (3.38)
N →∞ N 2 D n2 D n1 h1 (x1 )h2 (x2 )
i=1 j=1

N N
f12 (x1 )
Z
1 X X (i,j) 2
lim ( w ) = dx1 , (3.39)
N →∞ N 3 D n1 h1 (x1 )
i=1 j=1

N N
f22 (x2 )
Z
1 X X (i,j) 2
lim ( w ) = dx2 , (3.40)
N →∞ N 3 Dn2 h2 (x2 )
j=1 i=1

where we recall

Z Z
f1 (x1 ) = f (x1 , x2 )dx2 , f2 (x2 ) = f (x1 , x2 )dx1
D n2 D n1

are the marginal densities of f (x1 , x2 ). Recall we have N 0 = N 2 . By plugging


(3.37), (3.38), (3.39) and (3.40) to the definitions of RESS and MRESS, we

126
have

f 2 (x1 , x2 )
Z Z
1
lim = dx1 dx2 ,
N →∞ RESS D n2 D n1 h1 (x1 )h2 (x2 )

f12 (x1 )
Z
1
lim = dx1 ,
N →∞ MRESS1 D n1 h1 (x1 )

f22 (x2 )
Z
1
lim = dx2 .
N →∞ MRESS2 D n2 h2 (x2 )

When N is large, RESS, MRESS1 and MRESS2 roughly quantify the relative
effective sample size with respect to the target distributions f1 , f2 and f when
using the proposals h1 , h2 and h1 h2 . Low MRESS1 (resp. MRESS2 ) implies
a poor marginal proposal h1 (resp. h2 ) for approximating the marginal distri-
bution f1 (resp. f2 ) as the target. Large MRESS1 and MRESS2 accompanied
by an extremely low RESS imply a strong correlation between X1 and X2 in
the target distribution f .

3.8.2 Examples

We illustrate RESS and MRESS in several examples where mixture sam-


pling is applied for proliferation. We first consider a toy model. The target
distribution is    
0 1 ρ
f ∼N  ,  (3.41)
0 ρ 1

where ρ ∈ [0, 1) is the correlation between X1 and X2 . We let the propos-


als h1 ∼ N (m1 , 1) and h2 ∼ N (m2 , 1) where m1 and m2 are pre-specified
constants.

127
1.5

1.5
● m1 = 0, m2 = 0 ● m1 = 0, m2 = 0
m1 = 1, m2 = 0 m1 = 1, m2 = 0
m1 = 1, m2 = 1 m1 = 1, m2 = 1
RESS and MRESS

RESS and MRESS


1.0

1.0
● ● ● ● ●
0.5

0.5

0.0

MRESS1 RESS MRESS2 0.0 MRESS1 RESS MRESS2

Figure 3.5: Average RESS and MRESS for three parameter settings, which are computed
over 1000 simulations, in the toy model when ρ = 0 (left) and ρ = 0.9 (right).

We set the sample size N = 1000. The average RESS, MRESS1 and
MRESS2 over 1000 simulations are plotted in Figure 3.5 when ρ = 0 (left)
and ρ = 0.9 (right). When h1 (resp. h2 ) is identical to the marginal of f ,
i.e. m1 = 0 (resp. m2 = 0), MRESS1 (resp. MRESS2 ) is (very close to) 1
regardless of ρ. RESS is affected by ρ as expected, which decreases as the
correlation increases. When both RESS and MRESS are low, no decisive
conclusion can be reached whether the poor merging step is caused only
by the ineffective marginal proposals h1 , h2 or additionally by the strong
correlation within the target variable.

We further investigate RESS and MRESS in every importance sampling


step of TPS in the linear Gaussian HMM from Section 3.9 and in the non-
linear HMM from Section 3.10. At each non-leaf node Tj:l associated with the
target variable denoted by Xj:l , we calculate RESS, MRESS1 and MRESS2
from the importance sampling step which merges the particles of Xj:k−1 ∼
fj:k−1 and Xk:l ∼ fk:l .

128
We use a relatively small T = 31 in both models for better visualisation,
and set τ = 1, σ = 5 in the non-linear HMM. The output sample size N is
500. We implement TPS in both models and employ normal distributions as
the intermediate target distributions at the leaf nodes.

Figure 3.6 presents two trees, which indicate RESS and MRESS of all
importance sampling steps from TPS applied to the linear Gaussian HMM
(left) and to the non-linear HMM (right). Suppose we aim for RESS and
MRESS at a non-leaf node Tj:l , which is situated at level L. We search a
point (in black) with the y-coordinate value between (L − 1) and L from the
j+l
vertical line t = 2
. Then, the value of RESS at Tj:l equals the y-coordinate
of the point minus (L − 1). Note that each point connects a square (in red)
and a triangle (in blue) via solid lines. The corresponding MRESS1 (resp.
MRESS2 ) equals the y-coordinate value of the square (resp. triangle) minus
(L − 1).

In Figure 3.6, the linear Gaussian HMM and the non-linear HMM demon-
strate rather different scenarios of RESS and MRESS. In the linear Gaussian
HMM, most MRESS and RESS are greater than 0.5, which indicate relatively
effective importance sampling attempts in this model. This is because the
proposals and the (intermediate) target distributions are both normally dis-
tributed with close means and variances. However, in the non-linear HMM,
RESSs and MRESSs are much lower. Our final estimate of the joint smooth-
ing distribution may not be highly accurate. The reason of the poor im-
portance sampling step at a node can be similarly inferred from RESS and
MRESS as in the toy model in (3.41).

129
RESS RESS
MRESS1 MRESS1
MRESS2 MRESS2
4

4
3

3
2

2
1

1
0

0 5 10 15 20 25 30 0 0 5 10 15 20 25 30
Time step t Time step t

Figure 3.6: RESS and MRESS of all importance sampling steps from TPS, which are pre-
sented using a tree, applied to the linear Gaussian HMM (left) and to the non-linear HMM
(right). See Section 3.8.2 for the illustration of searching the values of RESS and MRESS
corresponding to each tree node in TPS.

3.9 Simulation Study in a Linear Gaussian HMM

We study the empirical performance of TPS with other Monte Carlo smooth-
ing algorithms in a linear Gaussian HMM. We first describe the model, and
propose two metrics which respectively measure sampling error and sample
diversity. Then we conduct the simulations of the algorithms under roughly
the same computational effort and discuss the results.

3.9.1 Model Description and Metrics

We consider a simple linear Gaussian HMM similar to Doucet et al. (2000).

Xt = 0.8Xt−1 + Vt t = 1, . . . , T,
(3.42)
Yt = Xt + Wt t = 0, . . . , T,

130
where T = 127, X0 , V1 , . . . , VT , W0 , . . . , WT are independent with X0 ∼
N (0, 1), Vt ∼ N (0, 1), Wt ∼ N (0, 1). The smoothing solution can be ob-
tained analytically from the Rauch–Tung–Striebel smoother (RTSs) (Rauch
et al., 1965) described in Section 3.3.2.

We define mean square error of means (MSEm) and variances (MSEv)


in the mth simulation of a Monte Carlo smoothing algorithm:

T
1 X bm 2
MSEmm = E [Xt |y0:T ] − E[Xt |y0:T ] ,
T + 1 t=0

T
1 X dm 2
MSEvm = Var [Xt |y0:T ] − Var[Xt |y0:T ] ,
T + 1 t=0

d m [Xt |y0:T ] are the Monte Carlo estimates of the


b m [Xt |y0:T ] and Var
where E
mean and variance of p(xt |y0:T ) in the mth simulation of the algorithm.
E[Xt |y0:T ] and Var[Xt |y0:T ] are the true mean and variance from the RTSs.

Additionally, we propose a metric for identifying sample diversity, which


we call effective sample size of the empirical distribution (ESSoED). ESSoED
aggregates the particles with identical values prior to the calculation of ESS
(Schäfer and Chopin, 2013; van de Meent et al., 2015). We denote a set of
the normalised weighted particles by S = {x(i) , W (i) }N
i=1 , which may have

been resampled or not. We assume the weights are all positive. We create a
new set of weighted particles Ŝ = {x̂(i) , Ŵ (i) }N
i=1 which satisfies
r

1. x̂(i) 6= x̂(j) for ∀ x̂(i) , x̂(j) ∈ Ŝ provided i 6= j;

2. For ∀ x(i) ∈ S, there exists x̂(j) ∈ S such that x(i) = x̂(j) ;

3. For ∀ x̂(i) ∈ Ŝ, there exists at least one x(j) ∈ S such that x(j) = x̂(i) .

131
We assign the normalised weight Ŵ (i) of x̂(i) ∈ Ŝ to be the total weight of
those in S that have identical values to x̂(i) . Formally, it is defined as

N
W (j) ✶x(j) =x̂(i)
X
(i)
Ŵ =
j=1

Intuitively, we merge the identical samples in S by accumulating their weights,


and produce a set of samples with all unique values. We then define ESSoED:

1
ESSoED = P 2.
Nr (i)
i=1 (Ŵ )

A sampling step where all particles are distinct yields an ESSoED of N , and
in the most extreme case that the empirical distribution of S is degenerate at
one point, its ESSoED is 1. More serious weight degeneracy implies a lower
ESSoED.

3.9.2 Simulation Results

We compare the performances of the smoothing algorithms in the linear


Gaussian HMM. We implement the bootstrap particle smoother (BPS), for-
ward filtering backward smoothing (FFBSm) (Doucet et al., 2000), forward
filtering backward simulation (FFBSi) (Godsill et al., 2004), TPS-L suggested
by Lindsten et al. (2017), TPS-EF and TPS-F. We perform multinomial re-
sampling after every importance sampling step of the above algorithms. In
all versions of TPS, we do not proliferate particles. Moreover, in TPS-EF, we
employ a normal distribution as the intermediate target distribution at each
leaf node, whose mean and variance are estimated using moment matching

132
Table 3.1: Performance of the smoothing algorithms in the linear Gaussian HMM using
comparable computational effort.

Algorithm N n MSEm (s.e.) MSEv (s.e.)


BPS 44000 NA 0.0020 (0.000015) 0.0019 (0.000013)
FFBSm 410 410 0.0065 (0.000055) 0.0047 (0.000037)
FFBSi 450 450 0.0059 (0.000056) 0.0044 (0.000031)
TPS-L 13000 NA 0.0012 (0.000009) 0.0011 (0.000007)
TPS-EF-N 10000 10000 0.0007 (0.000005) 0.0007 (0.000006)
TPS-F 500 500 0.0282 (0.000190) 0.0241 (0.000188)
The best candidates in the last two columns are√marked in bold. Standard
error (s.e.) is the standard deviation divided by M .

from the samples of a bootstrap particle filter (BPF). The choice of a normal
distribution is motivated by the normality of the true smoothing distribution.
We extend the name of TPS-EF to TPS-EF-N.

We have implemented the above algorithms in R. The output sample size


is denoted by N . Since FFBSm, FFBSi, TPS-EF-N and TPS-F require a
preliminary run of a filtering algorithm, we implement the BPF and denote
its sample size by n in these algorithms. We set the effort of TPS-EF-N
as a benchmark, which is measured by runtime, and adjust N and n in
other algorithms for roughly the same effort. As the implementations are
not deterministic, we allow a 10% error of runtime compared to TPS-EF-N.
We run each algorithm M = 500 times with the same set of observations.

The simulation results regarding the mean square errors are shown in
Table 3.1. TPS-EF-N and TPS-L both have the same complexity O(T N ) as
the BPS, which generate far more particles than FFBSm, FFBSi and TPS-
F with complexity O(T N 2 ). TPS-EF-N has the lowest MSEm and MSEv
followed by TPS-L, and outperforms FFBSm and FFBSi. TPS-F, though
not involving any tuning step, produces the largest MSEm and MSEv.

133
BPS
FFBSm
5000
FFBSi
TPS−EF−N
TPS−F
ESSoED
500
50
10

0 20 40 60 80 100 120
t

Figure 3.7: Effective sample size of the empirical distribution (ESSoED) averaged over 500
simulations for each time step t in the smoothing algorithms applied to the linear Gaussian
HMM. Log scale on the y-axis.

In Figure 3.7, we plot the effective sample size of the empirical distri-
bution (ESSoED) averaged over M = 500 simulations for each time step t.
The BPS provides very large ESSoED at later time steps while suffering from
path degeneracy as expected at early time steps. Its ESSoED drops to ap-
proximately 50 under the sample size of 44000 at t = 0. TPS-EF-N has large
ESSoED for all time steps, which consistently outperforms FFBSm, FFBSi
and TPS-F. It is only surpassed by the BPS after t = 100. We hence regard
TPS-EF-N as an effective way of improving path degeneracy.

3.10 Simulation Study in a Non-linear HMM

We run the smoothing algorithms including TPS in a non-linear HMM. We


describe the model in Section 3.10.1. As the smoothing distributions are
analytically unavailable, we instead employ the solution from a finite-space
HMM as the benchmark. We illustrate the construction of the finite-space

134
HMM in Section 3.10.2. We then propose two error metrics in Section 3.10.3.
We perform the simulations of TPS and other algorithms in Section 3.10.4,
and further compare TPS-EF and TPS-ES in Section 3.10.5.

3.10.1 Model Description and Metrics

We consider a well-known non-linear model (Gordon et al., 1993; Andrieu


et al., 2010):

1 Xt−1
Xt = Xt−1 + 25 2
+ 8 cos(1.2t) + Vt , t = 1, 2, . . . , T,
2 1 + Xt−1
(3.43)
X2
Yt = t + Wt , t = 0, 2, . . . , T,
20

where T = 511, X0 , V1 , ..., VT , W0 , ...WT are independent with X0 ∼ N (0, 1),


Vt ∼ N (0, τ 2 ) and Wt ∼ N (0, σ 2 ). We explore three different parameter
settings where (τ, σ) = {(1, 1), (1, 5), (5, 1)}.

As the smoothing distributions have no analytic form, we require their


estimations as benchmark. We achieve this by computing the solution from
a finite-space HMM transformed from the original HMM. Here, ‘finite-space’
refers to the sample space of the hidden states being finite. The construction
of the finite-space HMM will be presented in Section 3.10.2.

3.10.2 Benchmark

We aim for a new HMM which meets two conditions: It estimates the non-
linear Gaussian HMM, and therefore its smoothing solution. Additionally,
the smoothing solution of the new HMM can be obtained straightforwardly.

135
We achieve these conditions by discretising the sample space of each hidden
state into a finite space, and hence refer to the new HMM as the finite-space
HMM.

We first illustrate a grid method, which approximates a continuous uni-


variate random variable Z ∈ R using a discrete random variable Ẑ whose
sample space is finite. We discretise the sample space of Z, and then de-
fine its probability mass function. We denote the cumulative distribution
function (CDF) of Z by Fz .

We select a grid of points G = {zi }ni=1 where z1 < z2 < . . . < zn (n ≥ 3)


from the sample space of Z. The grid G constitutes the sample space of Ẑ,
whose probability mass function is defined by


Fz ( z1 +z 2
) if i = 1,




 2


Fz ( zi +zi+1 ) − Fz ( zi−1 +zi ) if i ∈ {2, 3, . . . , n − 1},


2 2
P(Ẑ = zi ) = (3.44)
1 − Fz ( zn +z2 n−1 )




 i = n,



0 otherwise,

The probability mass of an interior point zi (i ∈


/ {1, n}) is the change of
zi−1 +zi zi +zi+1
CDF evaluated at 2
and 2
. Likewise for the end point z1 (resp.
zn ) where we can create an artificial point z0 = −∞ (resp. zn+1 = ∞). The
choice of n and the positions of the grid points are specified by the user.

We apply the above technique to build a finite-space HMM associated


to the non-linear HMM. We first discretise the sample space of each hidden
state. We choose a uniform grid Gt for each Xt whose range is determined
by a preliminary run of the bootstrap particle filter. The grid size, which is

136
the distance between two consecutive grid points, needs to be determined.
For each model, we compute the marginal smoothing means based upon the
discretised HMMs with different grid sizes. We then compute the error made
by Monte Carlo algorithms under each grid size, and ensure the difference
between the errors are negligible for some relatively small grid sizes. Then
we choose one which achieves a relatively lower computational cost. In this
simulation study, we select the grid size to be 0.02 when τ = 1, σ = 1, and
0.05 when τ = 1, σ = 5 and τ = 5, σ = 1. Each grid constructed from Xt
becomes the sample space of its discrete analogy denoted by X̂t . The sample
space of each observation Yt is unchanged.

We further define the dynamics of the finite-space HMM. The prior of


X̂0 can be constructed from the grid G0 based upon (3.44). We apply the
same formula to obtain the transition mass function p(x̂t |x̂t−1 ) from the grid
Gt given x̂t−1 ∈ Gt−1 . The emission density p(yt |x̂t ) does not need to be
discretised given x̂t ∈ Gt .

Having defined the finite-space HMM, we attain its smoothing solution


by running the forward filtering backward smoothing algorithm (FFBSm)
described in Section 3.5.1. We first derive the filtering distributions. We can
obtain p(x̂0 |y0 ) by Bayes’ theorem at t = 0. Then we proceed forward to
recursively compute {p(x̂t |y0:t )}Tt=1 . To see this, we assume the probability
mass p(x̂t |y0:t ) is known at time t. Deducing p(x̂t+1 |y0:t+1 ) is based upon the

137
following decomposition:

p(x̂t+1 |y0:t+1 ) ∝ p(yt+1 |x̂t+1 )p(x̂t+1 |y0:t )


X
= p(yt+1 |x̂t+1 ) p(x̂t+1 |x̂t )p(x̂t |y0:t ).
x̂t ∈Gt

We then work out the smoothing distributions of the finite-space HMM.


At time t = T , the smoothing distribution is identical to the filtering distri-
bution. We then perform backward recursions using (3.16) from Section 3.5.1
where

X p(x̂t+1 |y0:T )p(x̂t+1 |x̂t )


p(x̂t |y0:T ) = p(x̂t |y0:t )
x̂t+1 ∈Gt+1
p(x̂t+1 |y0:t )

X p(x̂ |y )p(x̂ |x̂ )


= p(x̂t |y0:t ) P t+1 0:T 0 t+1 0 t .
x̂t+1 ∈Gt+1 x̂0t ∈Gt p(x̂t+1 |x̂t )p(x̂t |y0:t )

3.10.3 Metrics

Given the smoothing distributions {p(x̂t |y0:T )}Tt=0 from the finite-space HMM,
we define the mean square error of means (MSEm) of a Monte Carlo smooth-
ing algorithm, which targets the original non-linear HMM, in the mth simu-
lation:
T
1 X bm
MSEmm = E [Xt |y0:T ] − E(X̂t y0:T ))2 ,
T + 1 t=0

b m [Xt |y0:T ] is the Monte Carlo mean of p(xt |y0:T ), E(X̂t y0:T ) is smooth-
where E
ing mean p(x̂t |y0:T ) at time t from the finite-space HMM.

The non-linear structure of the HMM could result in a bizarre shape

138
of the smoothing distribution (see Figure 3.4), and the measure of mean
does not necessarily capture this. We additionally record the Kolmogorov–
Smirnov (KS) statistic in a Kolmogorov–Smirnov test (KS test) (Massey Jr,
1951).

The KS test is concerned with the agreement between an empirical dis-


tribution and a hypothetical reference distribution. The null hypothesis pos-
tulates that the samples which generate the empirical distribution are indeed
simulated from the reference distribution. The test employs the Kolmogorov–
Smirnov (KS) statistic which quantifies a distance between the empirical
distribution and the reference distribution. It is defined as

sup |F1,N (x) − F2 (x)|,


x

where F1,N is the empirical cumulative distribution function (ECDF) gener-


ated by N samples, and F2 is the cumulative distribution function (CDF) of
the reference distribution, which is usually assumed to be continuous. For
the discrete case, implementation of the KS test is extended in Arnold and
Emerson (2011). We reject the null if the KS statistic is larger than a pre-
specified threshold.

In our case, the KS test is not valid given dependent samples generated
from our smoothing algorithms. We are neither interested in rejecting or not
rejecting the null hypothesis from the test. Indeed, we aim to use the KS
statistic as a metric, which can be viewed as one way of measuring sampling
quality of a Monte Carlo algorithm that simulate samples from the reference
distribution.

139
In the context of the smoothing problem, F1,N is formed by the samples of
a Monte Carlo smoothing algorithm which estimates p(xt |y0:T ), F2 is the CDF
of the discrete distribution p(x̂t |y0:T ) from the finite-space HMM. By choosing
a suitable grid size for discretisation, we guarantee the error committed by
discretisation when computing the KS statistic is negligible compared to the
error committed by the smoothing algorithms which will be implemented.
See Section 3.10.2 for similar explanation in computing the error based upon
discretised model. We denote the average KS statistic (KSS) of all time steps
by KSSm in the mth simulation of the algorithm.

3.10.4 Comparison between TPS and Other Algorithms

We investigate the smoothing algorithms in the non-linear HMM. We run


the same algorithms: BPS, FFBSm, FFBSi, TPS-L and TPS-EF as in
Section 3.9.2. We eliminate TPS-F due to its undesirable performance in
the simple linear Gaussian HMM. In TPS-EF, we employ piecewise con-
stant functions defined in (3.31) for the construction of the intermediate
target distributions at leaf nodes. We call the algorithm TPS-EF-P and set
N = n = 10000 as a benchmark, where the notations are inherited from Sec-
tion 3.9.2. We correspondingly adjust the sample sizes in other algorithms
to achieve roughly the same computational effort.

We compare the mean square error of means (MSEm) and the KS statis-
tic (KSS) over M = 500 simulations with the same set of observations.

The simulation results with different values of τ and σ are shown in Ta-
ble 3.2. In the first two situations, TPS-L shows the largest MSEm and KSS,

140
Table 3.2: Performance of the smoothing algorithms using comparable computational effort
in the non-linear HMM.

Algorithm Parameter values N n MSEm (s.e) KSS (s.e.)


BPS 40000 NA 0.0239 (0.00085) 0.16 (0.00038)
FFBSm 315 315 0.0944 (0.01657) 0.15 (0.00017)
τ = 1, σ = 1
FFBSi 320 320 0.1399 (0.02291) 0.15 (0.00018)
TPS-L 13000 NA 0.3020 (0.00415) 0.21 (0.00099)
TPS-EF-P 10000 10000 0.0050 (0.00016) 0.07 (0.00010)
BPS 40000 NA 0.2086 (0.03064) 0.11 (0.00022)
FFBSm 315 315 0.6785 (0.05285) 0.13 (0.00017)
τ = 1, σ = 5
FFBSi 320 320 0.6071 (0.04981) 0.13 (0.00018)
TPS-L 13000 NA 14.4847 (0.17897) 0.51 (0.00420)
TPS-EF-P 10000 10000 0.3974 (0.01438) 0.09 (0.00040)
BPS 40000 NA 1.2182 (0.05684) 0.23 (0.00077)
FFBSm 315 315 3.4342 (0.22357) 0.18 (0.00031)
τ = 5, σ = 1
FFBSi 320 320 3.2161 (0.20196) 0.18 (0.00027)
TPS-L 13000 NA 0.4599 (0.00150) 0.13 (0.00005)
TPS-EF-P 10000 10000 0.0849 (0.00278) 0.05 (0.00006)
The best candidates in the last two columns for each parameter setting are marked
in bold.

especially when τ = 1 and σ = 5. We inspect this by plotting the CDF of the


following distributions at a particular time step t = 271: the intermediate
target distribution ft at the leaf node in TPS-L, the filtering distribution
p(x̂t |y0:t ) and the marginal smoothing distribution p(x̂t |y0:T ) both from the
finite-space HMM. In Figure 3.8, the CDF of TPS-L is wildly different from
the marginal smoothing distribution compared to the filtering one, which
contributes to very an ineffective importance sampling step.

Other algorithms need to discussed case-by-case in different parameter


settings. When τ = 1, σ = 1, TPS-EF-N shows much smaller MSEm and
KSS followed by the BPS. However, the BPS has the largest KSS. When
τ = 1, σ = 5, TPS-EF-P has a larger MSEm than the BPS. In terms of KSS,
TPS-EF-P outperforms other smoothing algorithms. When τ = 5, σ = 1,

141
Smoothing distribution

0.0 0.2 0.4 0.6 0.8 1.0 1.2


Filtering distribution
Sampling distribution
CDF

10 15 20 25 30
xt

Figure 3.8: CDF of the smoothing distribution, filtering and sampling distribution of TPS-L
at time step t = 271 in the non-linear HMM when τ = 1, σ = 5.

TPS-EF-P and TPS-L produce dominant results with vastly smaller MSEm.
Behind this, the relatively large variance in the transition density decreases
the correlation between the hidden states, which makes the merging steps
in TPS more effective. TPS-EF-P in this parameter setting exhibits the
smallest KSS whereas the BPS gives the largest result though generating the
most samples.

To conclude, TPS-EF-P and TPS-L perform very well when τ /σ is large.


TPS-EF-P has a more stable and appreciable performance, which provides
low MSEm and consistently the smallest KSS. In contrast, the result of TPS-
L may be misleading due to its instability. The BPS works well regarding
MSEm in some situations, but poorly in terms of KSS. FFBSm and FFBSi
produce less accurate results under roughly the same computational budget
due to higher complexity.

142
3.10.5 Comparison between TPS-EF and TPS-ES

We compare TPS-EF and TPS-ES in the non-linear HMM. TPS-ES demands


an additional run of a smoothing algorithm compared to TPS-EF, which may
not be a fair competitor under the same computational effort. In this section,
we compare the algorithms in two situations: under the same effort and under
the same sample size.

We demonstrate the implementations of the two algorithms. For TPS-


EF, we apply the same algorithm TPS-EF-P as described in Section 3.10.4.
For TPS-ES, we choose TPS-EF-P as the preliminary smoothing algorithm,
which generates samples for constructing the intermediate targets at the leaf
nodes of TPS-ES. We then apply piecewise constant functions to form these
distributions, and thus call the algorithm TPS-ES-P.

We specify the parameters in TPS-EF-P and TPS-ES-P. In TPS-EF-P


and the preliminary run of TPS-EF-P in TPS-ES-P, their intermediate target
distributions at the leaf nodes are both built from n Monte Carlo samples
generated by the BPFs. In TPS-ES-P, the intermediate target distributions
at the leaf nodes are built from n0 samples generated by TPS-EF-P, where
the two tuning parameters αs and αf appeared in (3.32) and (3.33) are both
set to be 0.95.

We require significantly larger sample sizes n, n0 and N in the simula-


tions compared to those in Section 3.10.4. Such setting guarantees a decent
estimation of the marginal smoothing distributions at the leaf nodes of TPS-
ES-P, and hence an overall good performance of the algorithm. We run TPS-
EF-P and TPS-ES-P in two scenarios: The first one sets N = n = 50000

143
Table 3.3: Performance between TPS-EF-P and TPS-ES-P in the non-linear HMM.

Algorithm Parameter Values N n n0 MSEm (s.e.) KSS (s.e.)


TPS-EF-P 50000 50000 NA 0.00120 (0.00027) 0.035 (0.00005)
TPS-ES-P τ = 1, σ = 1 50000 50000 50000 0.00061 (0.00011) 0.023 (0.00004)
TPS-ES-P 18000 50000 25000 0.00148 (0.00024) 0.030 (0.00005)
TPS-EF-P 50000 50000 NA 0.10140 (0.03082) 0.047 (0.00017)
TPS-ES-P τ = 1, σ = 5 50000 50000 50000 0.08404 (0.00597) 0.038 (0.00047)
TPS-ES-P 18000 50000 25000 0.16027 (0.00838) 0.051 (0.00044)
TPS-EF-P 50000 50000 NA 0.01621 (0.01212) 0.029 (0.00004)
TPS-ES-P τ = 5, σ = 1 50000 50000 50000 0.01362 (0.00346) 0.023 (0.00003)
TPS-ES-P 18000 50000 25000 0.02363 (0.00408) 0.029 (0.00004)
The best candidates in the last two columns for each parameter setting are marked in bold.

in TPS-EF-P, and N = n = n0 = 50000 in TPS-ES-P. This implies more


computational effort for another run of the preliminary smoothing algorithm
in TPS-ES-P. The second setting forces two algorithms to have roughly the
same effort by decreasing N and n0 in TPS-ES-P.

We compare TPS-EF-P and TPS-ES-P in terms of MSEm and KSS.


The results under M = 200 simulations are shown in Table 3.3. TPS-ES-
P has an evident improvement over TPS-EF-P under the same sample size
with consistently smaller MSEm and KSS. However, it does not outperform
TPS-EF-P under roughly the same effort.

To conclude, TPS-ES is recommended when we aim for a fixed sample


size with sufficient computational budget. Otherwise, we should choose TPS-
EF given limited budget.

3.11 Discussion

This chapter introduces the tree-based particle smoothing algorithm (TPS)


built from divide-and-conquer sequential Monte Carlo (D&C SMC) (Lind-

144
sten et al., 2017) to estimate the joint smoothing distribution p(x0:T |y0:T ) in
a hidden Markov model (HMM). The algorithm decomposes an HMM into
sub-models based upon a binary tree structure with intermediate target dis-
tributions defined at the non-root nodes. The root stands for our target,
which is the joint smoothing distribution.

We propose one generic way of constructing the binary tree which se-
quentially splits the hidden states X0:T . We then discuss a general sampling
procedure in TPS. To obtain the samples at a non-leaf node, we merge the
particles from its two children using importance sampling. The merging
process can be accompanied with an optional proliferation and resampling
step. The computational complexity is adjustable in this sampling procedure
which can be linear with respect to the required sample size.

Using the above settings, we investigate four algorithms with different


types of intermediate target distributions, which are TPS-L, TPS-EF, TPS-
ES and TPS-F.

TPS-L (Lindsten et al., 2017) constructs a class of intermediate target


distributions in a very simple way with no additional tuning algorithms. The
target density is equivalent to the unnormalised likelihood of an HMM, which
has the same dynamics as the original HMM except for the prior and bears
part of its observations. However, TPS-L is at the risk of providing very poor
estimates, since its intermediate targets at the leaf nodes only condition on
a single observation.

TPS-EF employs estimated (joint) filtering distributions as the inter-


mediate targets. It is straightforward to run after an initial execution of a

145
filtering algorithm. Nevertheless, the proposal in the importance sampling
step can still be ineffective in some highly non-linear and complicated HMMs.

TPS-ES builds the intermediate targets which estimate the (joint) smooth-
ing distributions. It roughly maintains the marginal of each hidden state
from the intermediate target distributions invariant at all levels of the aux-
iliary tree. The algorithm is more computationally intensive and demands
preliminary runs of a filtering and a (marginal) smoothing algorithm.

TPS-F enjoys the exact (joint) filtering distributions as the intermediate


targets. It utilises the particles directly from a filtering algorithm. The
algorithm, however, has a fixed quadratic complexity with respect to the
sample size.

We further illustrate the construction of the estimated filtering distri-


butions and the estimated marginal smoothing distributions, which appear
to be the intermediate targets at the leaf nodes in TPS-EF and TPS-ES,
based upon Monte Carlo samples. Taking into account both accuracy and
efficiency, we recommend the parametric approaches such as normal assump-
tions in a linear Gaussian HMM and the non-parametric approaches such as
piecewise constant functions in a non-linear HMM.

We propose two metrics RESS and MRESS, which measure sampling


quality from the importance sampling steps in TPS. We have some findings
under a specific proliferation process called mixture sampling (Lindsten et al.,
2017): Low MRESS often indicates a poor marginal proposal with respect to
the marginal target distribution. Large MRESS and low RESS usually imply
a strong correlation within the target variable, which we ignore in building

146
the proposal due to the assumed independence structure in TPS.

In the simulation studies, TPS demonstrates its effective prevention from


serious path degeneracy compared to some previous algorithms. We intro-
duce a metric called effective sample size of the empirical distribution (ES-
SoED) to quantify sample diversity. TPS empirically gains large and stable
ESSoED for all hidden states owing to fewer updates from the tree struc-
ture. On the contrary, the bootstrap particle smoother cannot escape from
degeneracy, which outputs very low ESSoED at early time steps. Other con-
ventional methods provide low ESSoED due to high complexity when we
compare the algorithms under comparable effort.

In terms of the performance regarding sampling accuracy, TPS-L has


the smallest error in the linear Gaussian HMM while showing very unsta-
ble results in different settings of the non-linear HMM. TPS-EF exhibits
more desirable simulation outcomes. It produces the smallest mean square
error in the linear Gaussian HMM, and consistently the smallest average
Kolmogorov–Smirnov (KS) statistic in the non-linear HMM. In particular, it
outperforms other algorithms substantially when the variance of the transi-
tion density is much larger than that of the emission density. TPS-ES has a
better performance of the KS statistic compared to TPS-EF under the same
sample size. Nevertheless, its improved accuracy is at the expense of an ad-
ditional run of a smoothing algorithm. TPS-ES is preferred when a fixed
output sample size is demanded given sufficient computational budget.

The investigation of TPS for longer time series can be studied in the fu-
ture. One advantage of applying the divide-and-conquer approach is its paral-
lel implementation or distributed computing for less runtime cost (Lindsten

147
et al., 2017). Given the pre-determined intermediate target distributions,
TPS can be employed in parallel or in distributed computing environment.
However, the preliminary run of a filtering or smoothing algorithm for con-
structing intermediate target distributions may prohibit the algorithm from
an efficient implementation. One possible solution is to run independent
filtering and smoothing algorithms for each machine given part of the obser-
vations. Another advantage of TPS is its comparatively high ESSoED under
the constraint upon cost, which could be more predominant for longer time
series.

To conclude, TPS presents a new Monte Carlo approach of addressing the


smoothing problem which shows the following advantages: Its sampling flow
which follows a binary tree structure can effectively improve path degeneracy.
Moreover, we can adaptively choose and construct the intermediate target
distributions, which potentially produce better proposals in the importance
sampling steps. TPS can also achieve a linear complexity with respect to
the sample size. Nevertheless, its performance may depend on the tuning
algorithms. Due to its flexible and relatively fast implementations which
produce noteworthy simulation results, we regard it as a competitor of the
Monte Carlo smoothing algorithms.

148
4
Tree-based Sampling
Algorithms for Parameter
Estimation in a Hidden
Markov Model

4.1 Introduction

A hidden Markov model (HMM) is a bivariate discrete-time stochastic pro-


cess {Xt , Yt }t∈N where the latent process {Xt }t∈N is an unobserved Markov

149
process, and the distribution of the observation Yt only depends on Xt . In
this chapter, we assume the following densities exist with respect to some
dominating measure and are denoted by

X0 ∼ p0 ( · )

Xt+1 |{Xt = xt } ∼ pθ ( · |xt ) for t = 0, . . . , T − 1,

Yt |{Xt = xt } ∼ pθ ( · |xt ) for t = 0, . . . , T,

where θ ∈ Θ is the model parameter and T is the final time step of the
process.

The inference problem of the HMM can be categorised into two scenarios
where the model parameter θ is known and unknown respectively. Given an
HMM with all parameters specified, the algorithms for solving the filtering
and smoothing problems have been discussed in Chapter 3.

In many situations, the HMM contains an unknown parameter θ of pri-


mary interest in the areas such as speech recognition (Bahl et al., 1986; Ra-
biner, 1989) and neuroscience (Paninski et al., 2010). Parameter estimation
often becomes challenging due to intractable posterior in some complicated
models, and hence various powerful numerical tools are employed.

The methods for parameter estimation can be classified by two crite-


ria (Kantas et al., 2015). Off-line methods perform inferences with a fixed
number of observations while on-line methods update the estimate of the
parameter sequentially as new observations become available. Alternatively,
two different strategies: Bayesian inference or maximum likelihood (ML) can
be utilised. The Bayesian approach imposes a prior on the unknown param-

150
eter and computes its posterior. The ML approach calculates a maximum
value of the parameter by analysing likelihood function conditional on the
observations. Both Bayesian and ML methods can be enforced in the off-line
or on-line manner. In this chapter, we focus on the off-line Bayesian methods.

The parameter estimation problem has been studied in the previous work
which employs the off-line Bayesian approach. Kitagawa (1998) proposes
two Monte Carlo algorithms which both simulate samples from the posterior
distributions {p(θ, xt |y0:T )}Tt=0 . The first algorithm applies forward filter-
ing backward smoothing (FFBSm) described in Section 3.5.1 to the aug-
mented space that corresponds to (θ, xt ). The second one employs fixed
lag-smoothing from a particle filter. Lee and Chia (2002) introduce a se-
quential Monte Carlo (SMC) algorithm with rejuvenation steps completed
by Markov Chain Monte Carlo (MCMC). Andrieu et al. (2010) propose
an MCMC algorithm called particle marginal Metropolis-Hastings (PMMH)
sampler, which employs MCMC updates aided by SMC. Andrieu et al. (2010)
also introduce a Gibbs sampler which iteratively samples from p(θ|x0:T , y0:T )
and pθ (x0:T |y0:T ), where the sampling procedure from pθ (x0:T |y0:T ) admits a
conditional SMC update. Whiteley (2010); Lindsten et al. (2014) improve the
Gibbs sampler by introducing backward simulation to rejuvenate samples.

In this chapter, we investigate a novel Bayesian approach using divide-


and-conquer sequential Monte Carlo (D&C SMC)(Lindsten et al., 2017) to
approximate the posterior distribution of the unknown parameter θ and the
hidden states X0:T . We call the algorithm tree-based parameter estimation

151
algorithm (TPE). We are interested in the sampling process of p(θ, x0:T |y0:T ):

T
Y −1 T
Y
p(θ, x0:T |y0:T ) ∝ µ(θ)p0 (x0 ) pθ (xt+1 |xt ) pθ (yt |xt ),
t=0 t=0

where µ and p0 are the priors of two independent random variables θ and
X0 .

We first construct an auxiliary tree of TPE which splits the HMM into
sub-models using the divide-and-conquer approach (Lindsten et al., 2017).
The tree divides the target variable (θ, X0:T ) into multiple levels, and de-
mands the random variables at the same level containing disjoint hidden
state(s) and a parameter variable. We also assume the random variables are
mutually independent at the same level of the auxiliary tree. As required by
D&C SMC (Lindsten et al., 2017), we will define intermediate target distri-
butions of the sub-models at non-root nodes of the tree. At the root, the
target distribution is precisely p(θ, x0:T |y0:T ).

The sampling procedure of TPE proceeds as follows: we simulate samples


for approximating the intermediate target distributions at the leaves, inde-
pendent between the nodes. We then sequentially merge the particles along
the auxiliary tree until reaching the root. Importance sampling is employed
in each merging step for the (intermediate) target distribution.

Based on the auxiliary tree of TPE, we denote the random variable by


(θj , Xj ) and the target density by fθ,j at a leaf node Tj . At a non-leaf node
Tj:l , we denote the random variable by (θj,l , Xj:l ) and the target density by
fθ,j:l .

152
One challenge in TPE occurs in the design of the proposal for the im-
portance sampling steps. Given a non-leaf node Tj:l , the target variables
of its children Tj:k−1 (Tj if j = k − 1) and Tk:l (Tl if k = l) both contain
an unknown parameter, namely θj,k−1 and θk,l . A simple proposal, which is
the product measure on the product space of (Xj:k−1 , θj,k−1 ) ∼ fθ,j:k−1 and
(Xk:l , θk,l ) ∼ fθ,k:l from the children, can be problematic. This construction
generates a higher dimension which contains overlapping parameter variables
in (Xj:l , θj,k−1 , θk,l ) ∼ fθ,j:k−1 fθ,k:l compared to the target variable (Xj:l , θj,l )
at Tj:l .

The main contribution of this chapter shows a general framework of


implementing the importance sampling steps in TPE, and specifies different
classes of the intermediate target distributions in the auxiliary tree.

The general importance sampling procedure proceeds as follows. At a


non-leave node Tj:l connecting Tj:k−1 (Tj if j = k −1) and Tk:l (Tk if k = l), we
construct an extended target variable (θj,l , ∆θj,l , Xj:l ) where we assume ∆θj,l
is independent of (θj,l , Xj:l ) ∼ fθ,j:l . We denote the density of (θj,l , ∆θj,l , Xj:l )
0
by fθ,j:l . We still build the random variable (θj,k−1 , θk,l , Xj:l ) ∼ fθ,j:k−1 fθ,k:l
from its children, and transform it to create the proposal (θj,l , ∆θj,l , Xj,l )
using two functions θj,l = g1 (θj,k−1 , θk,l ) and ∆θj,l = g2 (θj,k−1 , θk,l ). By
0
reweighting the samples from the proposal to target fθ,j:l , the samples of
(Xj:l , θj,l ) ∼ fθ,j:l are easily accessible by marginalising from (Xj:l , ∆θj,l , θj,l ) ∼
0
fθ,j:l .

We illustrate a deterministic and stochastic way of establishing the trans-


formation functions θj,l = g1 (θj,k−1 , θk,l ) and ∆θj,l = g2 (θj,k−1 , θk,l ). The de-
terministic approach employs a fixed combination of (θj,k−1 , θk,l ) whereas the

153
stochastic approach produces additional noise. Both methods rejuvenate the
parameter samples which effectively improve weight degeneracy.

We present two classes of intermediate target distributions in TPE both


enlightened by consensus Monte Carlo (Scott et al., 2016). We treat each
sub-model, which defines the intermediate target distribution, as a hidden
Markov model, and call these sub-models sub-HMMs at the non-root nodes.
The proposed two classes differ in terms of the priors for the sub-HMMs. The
first class which we call TPE-O imposes the original prior µ and p0 , while the
second one which we call TPE-EP estimates the priors from the prediction
distributions from the original HMM.

Furthermore, we modify TPE to construct a shallower auxiliary tree of


fewer levels. The random variable at each leaf node contains multiple hidden
states, and the generation of its samples is accomplished by a particle method.
Given the initial samples at the leaf nodes, the rest of the sampling process
is identical to TPE.

This chapter is structured as follows. We first review two parameter


estimation algorithms: the particle marginal Metropolis-Hastings (PMMH)
sampler (Andrieu et al., 2010) in Section 4.2 and sequential importance re-
sampling for parameter estimation (SIR-PE) in Section 4.3, which both pos-
sess the same target as TPE. We characterise TPE in terms of the auxiliary
tree and the sampling procedure in Section 4.4. Two types of intermedi-
ate target distributions are investigated in Section 4.5. We illustrate the
extended TPE algorithm with a shallower auxiliary tree in Section 4.6. In
Section 4.7, we propose the transformation functions g1 and g2 . We proceed
with a simulation study in a linear Gaussian HMM which involves a three

154
dimensional unknown parameter in Section 4.8. The chapter ends with a
discussion in Section 4.9.

4.2 Particle Marginal Metropolis-Hastings Sampler (PMMH)

Markov Chain Monte Carlo (MCMC) is a class of Monte Carlo algorithms.


It simulates a Markov process, which has an equilibrium distribution identi-
cal to the target distribution of interest (Brooks et al., 2011). The particle
Markov chain Monte Carlo algorithms (PMCMC) apply MCMC and the par-
ticle method (see Section 3.4) to target p(θ, x0:T |y0:T ) in an HMM (Andrieu
et al., 2010). We review a PMCMC algorithm using Metropolis-Hastings up-
dates called particle marginal Metropolis-Hastings (PMMH) sampler, which
utilises the SMC approximation of pθ ( · |y0:T ) as part of the proposal. A key
feature of PMCMC is these SMC approximations provide ‘exact’ approxima-
tion to the target p(θ, x0:T |y0:T ).

We formulate the PMMH sampler. An original thought of the proposal


density q(θ∗ , x∗0:T |θ, x0:T ) in a Metropolis-Hastings update has the form:

q(θ∗ , x∗0:T |θ, x0:T ) = q(θ∗ |θ)pθ∗ (x∗0:T |y0:T ), (4.1)

where the only degree of freedom is the choice of q( · |θ). The acceptance
ratio in the Metropolis-Hastings step is hence

p(θ∗ , x∗0:T |y0:T ) q(θ, x0:T |θ∗ , x∗0:T ) µ(θ∗ )pθ∗ (y0:T )q(θ|θ∗ )
1∧ = 1 ∧ . (4.2)
p(θ, x0:T |y0:T ) q(θ∗ , x∗0:T |θ, x0:T ) µ(θ)pθ (y0:T )q(θ∗ |θ)

Andrieu et al. (2010) suggest to use the SMC approximations in (4.1) for

155
pθ∗ ( · |y0:T ) when generating a sample, and for the marginal density pθ (y0:T ).
Therefore, the proposal density becomes

q(θ∗ , x∗0:T |θ, x0:T ) = q(θ∗ |θ)p̂θ∗ (x∗0:T |y0:T ), (4.3)

where p̂θ∗ ( · |y0:T ) is the SMC approximation of pθ∗ ( · |y0:T ). The acceptance
ratio becomes
µ(θ∗ )p̂θ∗ (y0:T )q(θ|θ∗ )
1∧ , (4.4)
µ(θ)p̂θ (y0:T )q(θ∗ |θ)

where p̂θ (y0:T ) and p̂θ∗ (y0:T ) are the SMC approximations of the correspond-
ing marginal densities. Andrieu et al. (2010) prove that these PMMH updates
make the target distribution p(θ, x0:T |y0:T ) invariant, and the acceptance ratio
(4.4) converges to (4.2) under mild assumptions as the sample size N → ∞.
Pitt et al. (2012) suggest choosing a sample size n in the smoother such that
the standard deviation of the likelihood pθ̄ (y0:T ) evaluated at θ̄ is 0.92 where
θ̄ is roughly the posterior mean. See Algorithm 10 for the implementation of
the PMMH sampler.

4.3 Sequential Importance Resampling for Parameter Estima-


tion (SIR-PE)

We describe a sampling algorithm that estimates p(θ, x0:T |y0:T ) using se-
quential Monte Carlo (SMC). The algorithm applies sequential importance
resampling (SIR) similar to a standard bootstrap particle smoother (see Sec-
tion 3.4.3), whose only difference is a primary sampling step of the unknown
parameter. We call the algorithm sequential importance resampling for pa-
rameter estimation (SIR-PE). Algorithm 11 executes SIR-PE where N̂ef f is

156
Algorithm 10: PMMH sampler
1 for i = 1 do
2 Choose an arbitrary start point θ(1) ;
(1)
3 Generate a sample path x0:T ∼ p̂θ(1) ( · |y0:T ) where p̂θ(1) ( · |y0:T ) is the particle
approximation of pθ(1) ( · |y0:T );
4 Denote the estimated density of pθ(1) (y0:T ) by p̂θ(1) (y0:T );
5 end
6 for i = 2 to N do
7 Propose θ∗ ∼ q( · |θ(i−1) );
8 Generate a sample path x∗0:T ∼ p̂θ∗ ( · |y0:T ) where p̂θ∗ ( · |y0:T ) is the particle
approximation of pθ∗ ( · |y0:T );
9 Denote the estimated density of pθ∗ (y0:T ) by p̂θ∗ (y0:T );
(i)
10 Set (θ(i) , x0:T ) = (θ∗ , x∗0:T ) and p̂θ(i) (y0:T ) = p̂θ∗ (y0:T ) with probability

µ(θ∗ )p̂θ∗ (y0:T )q(θ(i−1) |θ∗ )


1∧ ;
µ(θ)p̂θ(i−1) (y0:T )q(θ∗ |θ(i−1) )

(i) (i−1)
Otherwise, set (θ(i) , x0:T ) = (θ(i−1) , x0:T ) and p̂θ(i) (y0:T ) = p̂θ(i−1) (y0:T );
11 end

the effective sample size (see Section 3.4.2) and Nthres is threshold for a re-
sampling step. Note that the samples of the parameter are only generated
once, which are sequentially updated afterward. Hence, the path degeneracy
problem is very likely to occur for t T . Gilks and Berzuini (2001); Lee
and Lee (2006) additionally rejuvenate the particles via MCMC to increase
their diversity. In some cases when T is small, SIR-PE is fast and efficient,
which will be considered as a preliminary run in an extended version of TPE
introduced in Section 4.6.

4.4 Tree-based Parameter Estimation Algorithm (TPE)

This section elaborates a divide-and-conquer sampling algorithm (Lindsten


et al., 2017), which we refer to as tree-based parameter estimation algorithm

157
Algorithm 11: Sequential importance resampling for parameter estimation
(SIR-PE)
1 for t = 0 do
2 for i = 1 to N do
(i) (i)
3 Sample θ̃0 ∼ µ( · ), x̃0 ∼ p0 ( · ) ;
(i) (i)
4 Compute the unnormalised importance weight: w̃0 = pθ̃(i) (y0 |x̃0 );
0

5 end
6 if N̂eff < Nthres then
7 Implement the resampling step and denote the resampled particles (with
(i) (i) (i)
normalised weights) by {(θ0 , x0 ), W0 }N
i=1 ;
8 else
(i)
9 Calculate the normalised weights {W0 }Ni=1 and obtain
(i) (i) (i) (i) (i)
{(θ0 = θ̃0 , x0 = x̃0 ), W0 }N
i=1 ;
10 end
11 end
12 for t = 1 to T do
13 for i = 1 to N do
(i) (i) (i) (i) (i) (i) (i)
14 Sample x̃t ∼ pθ(i) ( · |xt−1 ), let x̃0:t = (x0:t−1 , x̃t ) and θ̃t = θt−1 ;
t−1
(i) (i) (i)
15 Compute the unnormalised importance weight: w̃t = Wt−1 pθ̃(i) (yt |x̃t );
t

16 end
17 if N̂eff < Nthres then
18 Implement the resampling step and denote the resampled particles (with
(i) (i) (i)
normalised weights) by {(θt , x0:t ), Wt }N
i=1 . ;
19 else
(i)
20 Calculate the normalised weights {Wt }Ni=1 and denote the normalised weighted
(i) (i) (i) (i)
particles by {(θt = θ̃t , x0:t = x̃0:t )}N
i=1 ;
21 end
22 end

158
(TPE), to approximate the posterior distribution p(θ, x0:T |y0:T ) in a hidden
Markov model (HMM).

We first provide one way of constructing an auxiliary tree of TPE, which


splits the target variable recursively until a stopping rule is reached. We then
illustrate a general sampling procedure which produces the samples from the
(intermediate) target distribution via importance sampling at a non-leaf node
in the tree.

4.4.1 Construction of the Auxiliary Tree

Motivated by divide-and-conquer sequential Monte Carlo (Lindsten et al.,


2017), TPE parts the HMM into sub-models based upon an auxiliary binary
tree. TPE first divides the target variable (θ, X0:T ) into two subsets which
have disjoint hidden states and both a variable of the unknown parameter.
The algorithm recursively employs such routine to the resulting subsets until
each subset only contains a single hidden state and a parameter variable.

We construct the auxiliary tree. The variable at a non-leaf node Tj:l is


denoted by (θj,l , Xj:l ) whose target density is fθ,j:l . We choose the split point
k to divide the hidden states Xj:l = (Xj , . . . , Xl ) into Xj:k−1 and Xk:l :

k = j + 2p , (4.5)

where p = dlog2 (l − j + 1)e − 1. Such choice is justified in Section 3.6 when


the auxiliary tree in the tree-based particle smoothing algorithm (TPS) is
invoked. We also attach the parameter variables denoted by θj:k−1 and θk:l ,
to the hidden states Xj:k−1 and Xk:l , respectively. Let Tj:k−1 and Tk:l be

159
(θ = θ0,5 , X0:5 ) level 3

(θ0,3 , X0:3 ) (θ4,5 , X4:5 ) level 2

(θ0,1 , X0:1 ) (θ2,3 , X2:3 ) (θ4 , X4 ) (θ5 , X5 ) level 1

(θ0 , X0 ) (θ1 , X1 ) (θ2 , X2 ) (θ3 , X3 ) (θ4 , X4 ) (θ5 , X5 ) level 0

Figure 4.1: Auxiliary tree of TPE constructed from an HMM when T = 5

the children of Tj:l . Then, the target random variables at Tj:k−1 and Tk:l are
(θj,k−1 , Xj:k−1 ) and (θk,l , Xk:l ) respectively, whose intermediate target den-
sities are denoted by fθ,j:k−1 and fθ,k:l . In the case of j = l, we stop the
division and treat the node as a leaf containing the random variable (θj , Xj )
distributed with density fθ,j . Starting from the root of the tree which con-
tains the target variable (θ = θ0:T , X0:T ), the algorithm recursively bears
children according to the rule in (4.5) until each node contains a single hid-
den state and a parameter variable. The construction of the auxiliary tree
when T = 5 is shown in Figure 4.1. We similarly mark the level of the nodes
as described in Section 3.6.1.

4.4.2 Sampling Procedure

We describe a sampling procedure which simulates samples from the (inter-


mediate) target distribution in TPE.

At a leaf node Tj , we assume to sample directly from fθ,j . At a non-leaf


node Tj:l , we first apply the product measure defined on

(θj,k−1 , Xj:k−1 ) ∼ fθ,j:k−1 and (θk,l , Xk:l ) ∼ fθ,k:l

160
from the children to create the random variable

(θj,k−1 , θk,l , Xj:k−1 , Xk:l ) = (θj,k−1 , θk,l , Xj:l )

with density fθ,j:k−1 fθ,k:l . We employ an one-to-one transformation of

(θj,k−1 , θk,l , Xj:k−1 , Xk:l ) ∼ fθ,j:k−1 fθ,k:l

to construct (θj,l , ∆θj,l , Xj:k−1 , Xk:l ) using the functions

θj,l = g1 (θj,k−1 , θk,l ) and ∆θj,l = g2 (θj,k−1 , θk,l ). (4.6)

We denote the density of (θj:l , ∆θj:l , Xj:k−1 , Xk:l ) by h0θ,j:l , which will be con-
sidered as a proposal. By the transformation rule of random variables, we
obtain

h0θ,j:l (θj:l , ∆θj:l , xj:k−1 , xk:l )

= fθ,j:k−1 g1−1 (θj:l , ∆θj:l ), xj:k−1 fθ,k:l g2−1 (θj:l , ∆θj:l ), xk:l

|J(θj,l , ∆θj,l )|, (4.7)

where g1−1 and g2−1 are the inverse transformation functions and J(θj,l , ∆θj,l )
is the Jacobian matrix.

We expand the probability space of the target variable (θj,l , Xj:l ) to con-
form to that of (θj:l , ∆θj:l , Xj:l ) in the proposal h0θ,j:l . We concatenate a new
independent random variable ∆θj,l with a pre-defined distribution f˜j,l to the
target variable (θj,l , Xj:l ). We denote the density of the extended target
0
variable (θj,l , ∆θj,l , Xj:l ) by fθ,j:l , which is defined as the product of f˜j:l and

161
fθ,j:l :
0
fθ,j:l (θj,l , ∆θj,l , xj:l ) = fθ,j:l (θj,l , xj:l )f˜j,l (∆θj,l ). (4.8)

0
We then apply importance sampling to simulate from fθ,j:l using the proposal
h0θ,j:l . We select the samples that correspond to (θj:l , Xj:l ) from (θj:l , ∆θj,l , Xj:l ) ∼
0
fθ,j:l to complete the sampling process at the node.

Practically, we adopt the pre-stored normalised weighted samples

(i) (i) (i) N


S1 = (θ̃j,k−1 , x̃j:k−1 ), W̃j,k−1 i=1
∼ fθ,j:k−1 from Tj:k−1

and
(i) (i) (i) N
S2 = (θ̃k,l , x̃k:l ), W̃k,l i=1
∼ fθ,k:l from Tk:l .

We proliferate the samples in S1 and S2 (see Section 3.6.3) if necessary to


produce
(a ) (a ) (a ) 0
S10 = {(θ̃j,k−1
i i
, x̃j:k−1 i
), Ŵj:k−1 )}N
i=1 ∼ fθ,j:k−1

and
(b ) (b ) (b ) 0
S20 = {(θ̃k,li , x̃k:li ), Ŵk:li )}N
i=1 ∼ fθ,k:l ,

0 0
where N 0 is the number of samples after proliferation, {ai }N N
i=1 and {bi }i=1
(a ) (b ) 0
i
are the returned indices and {Ŵj:k−1 ), Ŵk:li )}N
i=1 are the updated weights. If
0
proliferation is not required, we simply set N 0 = N , {ai = i, bi = i}N
i=1 and
(a ) (a ) (b ) (b ) 0
i
{Ŵj:k−1 i
= W̃j:k−1 , Ŵk:li = W̃k:li }N
i=1 .

We denote the merged samples by

(a ) (b ) (a ) (b ) (i) (a ) (b ) 0
S 0 = {(θ̃j,k−1
i
, θ̃k,li , x̃j:k−1
i
, x̃k:li ), w̃j,l = Ŵj,k−1
i
Ŵk,li }N
i=1 ∼ fθ,j:k−1 fθ,k:l .

We then transform the samples of (θj,k−1 , θk,l , Xj:k−1 , Xk:l ) in S 0 according to

162
the functions g1 and g2 , and denote the resulting samples by

(i) (i) (i) (i) 0


S 00 = {(θj,l , ∆θj,l , x̃j:l ), w̃j,l }N 0
i=1 ∼ hθ,j:l ,

(i) i (a ) (b ) (i) (a ) (b ) (i) (a ) (b )


where x̃j:l = (x̃j:k−1 , x̃k:li ), θj,l = g1 (θ̃j,k−1
i
, θ̃k:li ) and ∆θj,l = g2 (θ̃j,k−1
i
, θ̃k:li ).
In particular, the parameter samples are rejuvenated in this step via the
combination of the counterparts from its children using g1 . We then reweight
the samples in S 00 to target fθ,j:l
0
using importance sampling and resample
again to attain N particles. We select the samples with respect to (θj,l , Xj:l )
to finish the sampling process.

The function TPE gen in Algorithm 12 describes the sampling proce-


dure from the target density fθ,j:l . The algorithm outputs the samples
(i) (i) (i) N
(θj,l , xj:l ), Wj,l i=1
∼ fθ,j:l . TPE applies TPE gen recursively from the leaf
nodes to the root, and thus produce the samples from the target distribution
fθ,0:T (θ, x0:T ) = p(θ, x0:T |y0:T ).

4.5 Intermediate Target Distributions

We propose two types of intermediate target distributions in TPE motivated


by consensus Monte Carlo (Scott et al., 2016). We briefly describe CMC
and relate it to our sub-models in TPE. We then construct the intermediate
target distributions of the sub-models in Section 4.5.2 and Section 4.5.3.

4.5.1 Relation to Consensus Monte Carlo

Consensus Monte Carlo (CMC) approximates the posterior distribution of an


unknown parameter from the observed data (Scott et al., 2016). It divides

163
Algorithm 12: Sampling process TPE gen(j,l) which targets fθ,j:l at Tj:l
in TPE
1 if j = l then
2 for i = 1 to N do
(i) (i)
3 Simulate (θj , xj ) ∼ fθ,j ( · );
4 end
(i) (i) (i) 1 N
5 Denote the normalised weighted particles by {(θj , xj ), Wj = N }i=1 ;
6 else
7 Let p = dlog2 (l − j + 1)e − 1 and k = j + 2p ;
8 Adopt the samples

(i) (i) (i) N


S1 = (θ̃j,k−1 , x̃j:k−1 ), W̃j,k−1 i=1
← TPE gen(j, k − 1) from Tj:k−1

and
(i) (i) (i) N
S2 = (θ̃k,l , x̃k:l ), W̃k,l i=1
← TPE gen(k, l) from Tk:l ;

9 Denote the merged particles after a potential proliferation step by

(a ) (b ) (a ) (b ) (i) 0
S 0 = {(θ̃j,k−1
i
, θ̃k,li , x̃j:k−1
i
, x̃k:li ), w̃j,l }N
i=1

0 0
where N 0 is the updated sample size, {ai }N N
i=1 , {bi }i=1 are the updated indices and
(i) N 0
{w̃j,l }i=1 are the updated weights;
10 Denote the transformed samples from S 0 by

(i) (i) (i) (i) 0


S 00 = {(θ̃j,l , ∆θ̃j,l , x̃j:l ), w̃j,l }N
i=1

(i) i (a ) (b ) (i) (a ) (b ) (i) (a ) (b )


where θ̃j,l = g1 (θ̃j,k−1 , θ̃k:li ), ∆θ̃j,l = g2 (θ̃j,k−1
i
, θ̃k:li ) and x̃j:l = (x̃j:k−1
i
, x̃k:li );
11 for i = 1 to N 0 do
12 Compute the unnormalised weight:

0 (i) (i) (i)


(i) (i) fθ,j:l (θ̃j,l , ∆θ̃j,l , x̃j:l )
ŵj,l = w̃j,l (i) (i) (i)
h0θ,j:l (θ̃j,l , ∆θ̃j,l , x̃j:l )
(i) (i) ˜ (i)
(i) fθ,j:l (θ̃j,l , x̃j:l )fj,l (∆θ̃j,l ) (4.9)
= w̃j,l (i) (i)
|J(θ̃j,l , ∆θ̃j,l )|
1
(i) (i) (i) (i) (i) (i)
fθ,j:k−1 g1−1 (θ̃j:l , ∆θ̃j:l ), x̃j:k−1 fθ,k:l g2−1 (θ̃j:l , ∆θ̃j:l ), x̃k:l

13 end
0
(i) (i) (i) (i) N
14 Resample (θ̃j,l , ∆θ̃j,l , x̃j,l ), ŵj:l i=1
to obtain the normalised weighted samples
(i) (i) (i) (i) N
(θj,l , ∆θj,l , xj:l ), Wj,l i=1 ;
15 end

164
sub-HMM at T0:3 sub-HMM at T4:5

Y0 Y1 Y2 Y3 Y4 Y5

X0 X1 X2 X3 X4 X5

sub-HMM at T0:1 sub-HMM at T2:3 sub-HMM at T4 sub-HMM at T5

Figure 4.2: Graph representation of the sub-HMMs in a complete HMM when T = 5.

the data into groups which are called ‘shards’, and performs an independent
Monte Carlo algorithm from each shard to produce a posterior estimate.
The global result is computed from the weighted average of the posterior
estimates as a consensus belief.

CMC and TPE are similar in the sense that both algorithms have an
independence assumption. In CMC, independence is explained via individ-
ual work on each shard. In TPE, independence exists between the random
variables at the same level of the auxiliary tree. We can analogously define
the shard and the individual work at a tree node Tj (resp. Tj:l ) in TPE: The
shard contains the observation(s) yj (resp. yj:l ), and the individual work
refers to the implementation of TPE. Hence, the sub-model at Tj (resp. Tj:l )
can be treated as the HMM with the observation(s) yj (resp. yj:l ), whose
dynamics need to be defined.

We refer to the original HMM with the full observations y0:T as the
complete HMM, which belongs to the root in the auxiliary tree. The sub-
model at a non-root node is yet an HMM whose observations are a subset of
those from the complete HMM. We refer to these sub-models as sub-HMMs.

165
Building the intermediate target distribution in TPE is hence equiva-
lent to specifying the dynamics of the corresponding sub-HMM. We inherit
the same transition and emission densities from the complete HMM. We
will discuss the prior, and provide the exact form of the intermediate target
distributions in Section 4.5.2 and Section 4.5.3.

Practically, we part the complete HMM based upon its tree decompo-
sition to create all sub-HMMs. In Figure 4.2, the dependence structures
of some sub-HMMs are constructed from a complete HMM where T = 5.
The associated auxiliary tree of the complete HMM is shown in Figure 4.1.
From the tree, we first split the complete HMM into two sub-HMMs which
respectively involve the hidden states X0:3 at T0:3 and X4:5 at T4:5 . The two
sub-HMMs are framed by dashed lines in Figure 4.2. We further split the
hidden states X0:3 to obtain two sub-HMMs at T0:1 and T2:3 , which are framed
by solid lines. Likewise, we have two sub-HMMs at T4 and T5 , which both
contain a single hidden state. Moreover, the sub-HMMs at T0:1 and T2:3 are
required to be split respectively until each sub-HMM only has a single hidden
state (figures not included).

4.5.2 Sub-HMMs with Original Priors (TPE-O)

We design the first class of intermediate target distributions whose associated


sub-HMMs incorporate the original prior from the complete HMM. We call
the algorithm TPE-O.

We assume the unknown parameter and the initial hidden state of each
sub-HMM are independent, whose densities are exactly µ and p0 given by
the complete HMM. We define the intermediate target distribution fj,θ at

166
the node Tj as
fj,θ (θj , xj ) ∝ µ(θj )p0 (xj )pθj (yj |xj ),

which corresponds to an one-step sub-HMM whose observation is yj . The


intermediate target distribution fθ,j:l at a node Tj:l can be similarly written
as

fθ,j:l (θj,l , xj:l ) ∝ µ(θj,l )p0 (xj )pθj,l (yj:l , xj+1:l |xj )

∝ µ(θj,l )p0 (xj )pθj,l (yj |xj ) (4.10)


l−1
Y
pθj,l (xi+1 |xi )pθj,l (yi+1 |xi+1 ) ,
i=j

which corresponds to a sub-HMM with the observations yj:l . At the root node
T0:T , we obtain the target distribution fθ,0:T (θ0:T , x0:T ) = p(θ0:T , x0:T |y0:T ).

This class of intermediate target distributions is straightforwardly con-


structible which does not involve any estimation or tuning step. However,
it completely ignores the information of other observations beyond the sub-
HMM, which can cause the initial samples of {(θj , Xj )}Tj=0 at the leaf nodes
being enormously different from the true posteriors. Unlike rejuvenation of
the parameter samples by the transformation function g1 , the samples of the
hidden states in TPE are only reweighted in importance sampling and get
resampled. Hence serious weight degeneracy may happen to them.

4.5.3 Sub-HMMs with Estimated Prediction Priors (TPE-EP)

We build the second type of intermediate target distributions. The prior of


the sub-HMM at a non-root node is the estimated prediction distribution.
We call the algorithm TPE-EP.

167
At a leaf node Tj , the intermediate target distribution has the density

fθ,j (θj , xj ) ∝ µj (θj )pj (xj )pθj (yj |xj ),

where we assume θj and Xj are independent with priors µj and pj . Their


constructions will be investigated later. The intermediate target distribution
fθ,j:l at Tj:l is correspondingly defined as

fθ,j:l (θj,l , xj:l ) ∝ µj (θj,l )pj (xj )pθj,l (yj:l , xj+1:l |xj )

∝ µj (θj,l )pj (xj )pθj,l (yj |xj ) (4.11)


l−1
Y
pθj,l (xi+1 |xi )pθj,l (yi+1 |xi+1 ) .
i=j

We wish to choose µj and pj which approximate the prediction distri-


butions from the complete HMM, i.e. µj (θ) ≈ p(θ|y0:j−1 ) and pj (xj ) ≈
p(xj |y0:j−1 ). The intuition is as follows. If we further estimate

p(θj , xj |y0:j−1 ) ≈ p(θj |y0:j−1 )p(xj |y0:j−1 ) ≈ µj (θj )pj (xj )

by assuming conditional independence between θj and Xj , we would have


fθ,j (θj , xj ) ≈ p(θj , xj |y0:j ) at Tj and fθ,j:l (θj,l , xj:l ) ≈ p(θj,l , xj:l |y0:l ) at Tj:l .

To build the priors µj and pj for j = 0, . . . , T , we can estimate the pre-


diction distributions from their Monte Carlo samples, which can be achieved
by implementing sequential importance resampling (SIR) in the complete
HMM. Nevertheless, this option is infeasible due to two reasons. Firstly,
the samples of the hidden states are correlated which violate the indepen-
dence assumption in TPE. Secondly, path degeneracy could happen, which

168
produces poor samples at early time steps.

We alternatively construct the priors µj and pj in a recursive way. As-


sume we have µj−1 (θ) ≈ p(θ|y0:j−2 ) and pj−1 (xj−1 ) ≈ p(xj−1 |y0:j−2 ) at time j.
We aim to generate Monte Carlo samples of θj and Xj which jointly estimate
the prediction distribution p(θj , xj |y0:j−1 ) using µj−1 and pj−1 . To see this,
we first apply the following decomposition
Z
p(θj , xj |y0:j−1 ) = pθj (xj |xj−1 )p(θj , xj−1 |y0:j−1 )dxj−1 , (4.12)

We can further approximate p(θj , xj−1 |y0:j−1 ) in (4.12) by

p(θj , xj−1 |y0:j−1 ) ∝ p(θj , xj−1 , yj−1 |y0:j−2 )

= pθj (yj−1 |xj−1 )p(θj , xj−1 |y0:j−2 ) (4.13)

≈ pθj (yj−1 |xj−1 )p(θj |y0:j−2 )p(xj−1 |y0:j−2 ) (4.14)

≈ pθj (yj−1 |xj−1 )µj−1 (θj )pj−1 (xj−1 ), (4.15)

From (4.13) to (4.14), we again approximate p(θj , xj−1 |y0:j−2 ) by assuming


conditional independence between θj and Xj−1 .

Practically, we apply sequential importance resampling (SIR) for two


time steps to obtain the Monte Carlo samples. Given µj−1 and pj−1 , we
start to simulate the samples of (θj , Xj−1 ) using (4.15) which approximate
p(θj , xj−1 |y0:j−1 ). We propagate the samples of Xj based upon (4.12), and
obtain the samples of (θj , Xj ) which approximate the prediction distribution
p(θj , xj |y0:j−1 ). The exact forms of µj and pj are built from these Monte Carlo
samples by assuming parametric or non-parametric structures, which we have
discussed in Section 3.7.5. Algorithm 13 shows a recursive construction of

169
Algorithm 13: Construction of the priors {µj , pj }Tj=0 in TPE-EP
1 for j = 0 do
2 Set µ0 = µ, and p0 is namely the original prior of X0 in the complete HMM;
3 end
4 for j = 1 to T do
5 for i = 1 to N do
(i) (i)
6 Sample θ̃t ∼ µj−1 ( · ), x̃t−1 ∼ pt−1 ( · );
(i) (i)
7 Compute the unnormalised importance weight: w̃t−1 = pθ(i) (yt−1 |x̃t−1 );
t

8 end
9 if N̂eff < Nthres then
10 Implement the resampling step and denote the resampled samples (with
(i) (i) (i)
normalised weights) by {(θt , xt−1 ), Wt−1 }N
i=1 ;
11 else
(i)
12 Normalise the weights and denote by {Wt−1 }N
i=1 ;
(i) (i) (i) (i) (i)
13 Denote the weighted samples by {(θt = θ̃t , xt−1 = x̃t−1 ), Wt−1 }N
i=1 ;
14 end
15 for i = 1 to N do
(i) (i)
16 Generate xt ∼ pθ(i) ( · |xt−1 ), ;
t

17 end
(i) (i)
18 Estimate µj from the weighted samples {θt , Wt−1 }N
i=1 and pj from the weighted
(i) (i)
samples {xt , Wt−1 }N
i=1 .
19 end

{µj , pj }Tj=0 , where we denote the sample size in the Monte Carlo simulations
by N , and the threshold in the resampling procedure by Nthres .

This class of priors estimates the prediction distributions, and are more
informative compared to TPE-O. The Monte Carlo samples which estimate
the priors also escape from path degeneracy, since each SIR procedure is
implemented for two time steps.

170
(θ = θ0,5 , X0:5 ) level 2

(θ0,3 , X0:3 ) (θ4,5 , X4:5 ) level 1

(θ0,1 , X0:1 ) (θ2,3 , X2:3 ) (θ4,5 , X4:5 ) level 0

Figure 4.3: Auxiliary tree of TPE-SIR constructed from an HMM when T = 5

4.6 Combination of TPE and SIR-PE

We combine TPE with SIR-PE, and call the algorithm TPE-SIR. TPE-SIR
similarly constructs an auxiliary tree as TPE which recursively splits the
target variable (θ, X0:T ) into two subsets. Nevertheless, the split ceases before
each subset only contains a single hidden state with a parameter variable.
The target variable at each leaf node of TPE-SIR has multiple hidden states,
i.e. (θj,l , Xj:l ) where j 6= l. Figure 4.3 shows the construction of an auxiliary
tree in TPE-SIR when T = 5. We also define the depth D of TPE-SIR as the
total levels in the auxiliary tree. In Figure 4.3, we have D = 3. The depth
of TPE can be similarly defined with D = 1 + dlog2 (T + 1)e.

Given a leaf node with the target variable (θj,l , Xj:l ) in TPE-SIR, the
intermediate target distribution fθ,j:l is defined as:

fθ,j:l (θj,l , xj:l ) ∝ µj (θj,l )pj (xj )pθj,l (yj:l , xj+1:l |xj )

∝ µj (θj,l )pj (xj )pθj,l (yj |xj ) (4.16)


l−1
Y
pθj,l (xi+1 |xi )pθj,l (yi+1 |xi+1 ) .
i=j

It corresponds to the same sub-HMM at the non-leaf node Tj:l in TPE-O


if µj = µ0 and pj = p0 , or to that in TPE-EP if µj and pj are otherwise
established from the prediction distributions. In particular, the construction

171
of {µj , pj }Tj=0 in TPE-EP can be accomplished similarly from Algorithm 13
where each SIR procedure needs to be employed for more time steps.

The initial samples at a leaf node in TPE-SIR can be simulated by run-


ning SIR-PE. The rest of the sampling process is identical to TPE which
contains merging with potential proliferation and resampling at each non-
leaf node.

TPE-SIR has the following advantages. Firstly, the implementation of


SIR-PE is simple, fast and accurate when applied to a sub-HMM with a small
number of observations. The generated particles can escape from path de-
generacy. In contrast, the sampling procedure for the same sub-HMM at the
non-leaf node in TPE is more complicated. One benefit of TPE over SIR-PE,
which lies in rejuvenation of parameter samples to circumvent degeneracy, is
not compelling. Moreover, TPE-SIR saves computational effort by creating
a shallower auxiliary tree, as the effort is proportional to the depth of the
tree in TPE and TPE-SIR.

4.7 Construction of Transformation Functions

We discuss several functions which need to specified in the importance sam-


pling steps of TPE. These include g1 , g2 in (4.6) for generating the proposal
and f˜j,l in (4.8) for being part of the target distribution. We first intuitively
build them in a toy model using a simplified version of TPE, and then explore
them in an HMM.

172
θ = θ0,3

θ0,1 θ2,3

θ0 θ1 θ2 θ3
Figure 4.4: Auxiliary tree of the simplified version of TPE constructed from the toy model
(4.17) when T = 3

4.7.1 A Toy Model with Conditional Independent States

Before the investigation into an HMM, we construct g1 , g2 and f˜j,l in a simple


toy model with conditionally independent observations {yt }Tt=0 . The model
is given as follows:

1
θ ∼ N 0, ,
τ
(4.17)
Yt |θ ∼ N (θ, 1) for t = 0, . . . , T,

where τ is the precision of the normal distribution in the prior of θ. We aim


to estimate the posterior distribution p(θ|y0:T ) using the simplified version of
TPE by assuming no hidden state compared to an HMM. We first construct
an auxiliary tree, each node of which contains a parameter variable. The tree
structure when T = 3 is shown in Figure 4.4. Similar to TPE, we assume the
parameter variables are mutually independent at the same level of the tree.
For simplicity, we further assume log2 (T + 1) is an integer in this model, and
hence the auxiliary tree T is a complete binary tree.

We define the intermediate target distribution as the exact posterior


distribution at each node. To be more precise, at a leaf node Tj , we have the

173
intermediate target distribution

yj 1
fj (θj ) = p(θj |yj ) ∼ N , ,
τ +1 τ +1

which can be simulated directly. At a non-leaf node Tj:l , the intermediate


target distribution is defined as
Pl
yi
i=j 1
fj:l (θj,l ) = p(θj,l |yj:l ) ∼ N , . (4.18)
τ +l−j+1 τ +l−j+1

As the binary auxiliary tree is complete, we can let dj,l = l − j + 1 =


2(k − j) = 2(l − k + 1) for simplicity of calculation. Hence the intermediate
target distribution fj:l becomes

Pl
i=j yi 1
fj:l (θj,l ) = p(θj,l |yj:l ) ∼ N , . (4.19)
τ + dj,l τ + dj,l

Though fj:l can be sampled directly, we still apply importance sampling


which merges the Monte Carlo samples from its children as in TPE. We build
a proposal. We create a random variable (θj,k−1 , θk,l ) ∼ fj:k−1 fk:l using the
product measure defined on θj,k−1 ∼ fj:k−1 and θk,l ∼ fk:l from the children.
We define a new variable (θj,l , ∆θj,l ) using the transformation functions

θj,l = g1 (θj,k−1 , θk,l ), ∆θj,l = g2 (θj,k−1 , θk,l ),

whose density will be used as the proposal.

To match the space of the proposal, we also need to extend the target
variable θj:l by creating (θj:l , ∆θj:l ). We assume ∆θj:l ∼ f˜j:l is independent
of θj:l . The extended target density of (θj:l , ∆θj:l ) is hence fj:l (θj:l )f˜j:l (∆θj:l ).

174
We propose two types of transformation functions g1 , g2 , which both dis-
tribute θj,l as the target fj:l marginally. The first method employs a deter-
ministic combination of the overlapping parameters θj:k−1 and θk:l to match
the mean and variance of the target variable θk,l . The second one additionally
incorporates independent noise which further increases sample diversity.

Deterministic combination

We aim to create the random variable θj,l with density fj,l from θj,k−1 ∼ fj:k−1
and θk,l ∼ fk:l using the function g1 . We first exploit a deterministic linear
combination of the overlapping parameter variables θj,k−1 and θk,l where

θj,l = g1 (θj,k−1 , θk,l )

= α(θj,k−1 + θk,l ) + β

α li=j yi
P
2α2
∼N + β, .
τ + dj,l /2 τ + dj,l /2

The unknown constants α and β are computed by matching the mean and
variance from the target fj:l in (4.19), which gives
s
1 dj,l + 2τ
α = ,
2 dj,l + τ
Pl
i=j yi 1 1
β = p p −p .
τ + dj,l τ + dj,l τ + dj,l /2

We then define the function g2 :

∆θj,l = g2 (θj,k−1 , θk,l ) = θj,k−1 − θk,l

175
and so ∆θj,l is distributed as

Pk−1
yi − li=k yi
P
i=j 2
N , . (4.20)
τ + dj,l /2 τ + dj,l /2

We let f˜j,l to have the same density in (4.20), and hence the marginal distri-
butions of ∆θj,l in the proposal and in the target are the same.

The proposal hj:l (θj,l , ∆θj,l ) can be obtained using the transformation of
random variables:

hj:l (θj,l , ∆θj,l )

θj,l − β ∆θj,l θj,l − β ∆θj,l


= fj:k−1 + fk:l − |J(θj,l , ∆j,l )|
2α 2 2α 2

∝ fj:l (θj,l )f˜j,l (∆θj,l ), (4.21)

1
where |J(θj,l , ∆j,l )| = 2α
is a constant and (4.21) can be obtained by brute
force. Hence, the proposal distribution is identical to the extended target
distribution.

Such construction of g1 , g2 and f˜j,l has an advantage of producing equally


weighted particles in importance sampling. However, the deterministic con-
struction does not excessively increase the sample diversity of θj,l .To be more
precise, if several samples of (θj,k−1 , θk,l ) have identical values, their trans-
formed samples of (θj,l , ∆θj:l ) also exhibit identical ones.

Stochastic combination

We introduce a stochastic way to combine the overlapping parameter vari-


ables θj,k−1 ∼ fj:k−1 at Tj:k−1 and θk,l ∼ fk:l at Tk:l to create θj,l ∼ fj:l at Tj:l

176
in the model (4.17). We define the function g1 as the average of θj,k−1 and
θk,l plus independent noise :

1
θj,l = g1 (θj,k−1 , θk,l ) = (θj,k−1 + θk,l ) + ,
2

where the distribution κ of is to be determined. To match the mean and


variance of θj,l in (4.19), we demand

Pl Pl
1 1 i=j yi i=j yi
E (θj,k−1 + θk,l ) + = + E( ) = ,
2 2 τ + dj,l /2 τ + dj,l
(4.22)
1 1 2 1
var (θj,k−1 + θk,l ) + = + var( ) = .
2 4 τ + dj,l /2 τ + dj,l

Solving the equations in (4.22), we obtain

l
τ X τ
∼N yi , .
2(τ + dj,l )(τ + dj,l /2) i=j 2(τ + dj,l /2)(τ + dj,l )

We still define ∆θj,l = g2 (θj,k−1 , θk,l ) = θj,k−1 − θk,l and so

Pk−1
yi − li=k yi
P
i=j 2
∆θj,l ∼ N , ,
τ + dj,l /2 τ + dj,l /2

whose density is also imposed to f˜j:l appeared in the extended target distribu-
tion. The inverse transformations g1−1 and g2−1 can be computed accordingly
where

∆θj,l
θj,k−1 = g1−1 (θj,l , ∆θj,l ) = θj,l + − ,
2
∆θj,l
θk,l = g2−1 (θj,l , ∆θj,l ) = θj,l − − .
2

177
We build a proposal with an extended probability space including the noise
, since it exists in the transformation functions. Hence, we define

∆θj,l ∆θj,l
h0j:l (θj,l , ∆θj,l , ) ∝ fj:k−1 θj,l + − fk:l θj,l − −
2 2
|J(θj,l , ∆j,l )|κ( ),

where |J(θj,l , ∆j,l )| = 1. The extended target distribution is defined as

0
fj:k (θj,l , ∆θj,l , ) = fj:l (θj,l )f˜j,l (∆θj,l )κ( ).

By construction, the marginal of θj,l in the proposal is identical to fj:l .

Note that we introduce a correlation between θj,l and , as well as between


∆θj,l and in the proposal, while no correlation exists between either in the
target. To check this, we compute

1
corr(θj,l , ) = corr (θj,k−1 + θk,l ) + ,
2

var( )
= q
1
var (θ
2 j,k−1
+ θk,l ) var( )
s
τ 1
r
= = .
2τ + dj,l 2 + dj,l /τ

Consequently, the correlation between θj,l and depends on the precision in


the prior and the number of time steps dj,l = l − j + 1 at Tj:l .

To conclude, the stochastic combination provides the right marginal pro-


posal, and greatly boosts the diversity of the samples by adding independent
noise. However, a potential downside is the existence of the correlation be-
tween the noise and the parameter variable in the proposal, which may cause

178
inefficiency in importance sampling.

4.7.2 Unknown Parameter with Support R in an HMM

We propose the functions g1 , g2 in (4.6), and the distribution f˜j,l in (4.8)


when TPE is applied to an HMM. We assume θ ∈ R. We similarly construct
the transformation functions g1 and g2 motivated by the example in Section
4.7.1.

We first consider a deterministic transformation of the overlapping pa-


rameters θj,k−1 and θk,l where α and β are constants:

θj,l = g1 (θj,k−1 , θk,l ) = α(θj,k−1 + θk,l ) + β, (4.23)

∆θj,l = g2 (θj,k−1 , θk,l ) = θj,k−1 − θk,l . (4.24)

Applying the expectation and variance to (4.23) gives

E(θj,l ) = α E(θj,k−1 ) + E(θk,l ) + β,


(4.25)
2
Var(θj,l ) = α Var(θj,k−1 ) + Var(θk,l ) .

In the toy model (4.17), the distribution of the target parameter θj,l is known
and can be pre-computed whereas the information of θj,l here in the HMM is
very limited. We can determine α and β from θj,k−1 and θk,l , and expect the
distribution of θj,l to compromise those of θj,k−1 and θk,l . We suggest E(θj,l )
to be the average of E(θj,k−1 ) and E(θk,l ), and Var(θj,l ) to be the average
of Var(θj,k−1 ) and Var(θk,l ). We choose not to shrink the variance of θj,l to
prevent an overly concentrated distribution in the proposal, which may not

179
explore the target space adequately. Therefore, we obtain
r √
1 1− 2
α= , β= E(θj,k−1 ) + E(θk,l ) .
2 2

When the expectations E(θj,k−1 ) and E(θk,l ) do not have closed forms, we
may use their estimated values from the corresponding Monte Carlo samples.
These estimated values are certainly dependent on the samples, and vary from
different simulations of the algorithm. However, the restrictions (4.25) we
impose on α and β are our recommendation of a good choice of their values,
which are not required to be (strictly) satisfied. Any reasonable values of α
and β are valid for the algorithm to work.

We then work out the inverse transformations g1−1 and g2−1 :

θj,l − β 1
θj,k−1 = g1−1 (θj,l , ∆θj,l ) = + ∆θj,l ,
2α 2
θj,l − β 1
θk,l = g2−1 (θj,l , ∆θj,l ) = − ∆θj,l ,
2α 2

1
with |J(θj,l , ∆θj,l )| = 2α
. The proposal density of (θj:l , ∆θj:l , Xj:l ) in (4.7) is

h0θ,j:l (θj,l , ∆θj,l , xj:l )

θj,l − β 1 θj,l − β 1
∝ fθ,j:k−1 + ∆θj,l , xj:k−1 fθ,k:l − ∆θj,l , xk:l .
2α 2 2α 2

0
Finally, we propose a simple rule for building fθ,j:l appearing as part of the
product in the extended target density in (4.8). Similar to the construction
in the toy model (4.17), we require the marginal of ∆θj:l from the target to be
identical to that from the proposal. However, the marginal distribution in the
proposal may be analytically intractable, and an approximation is needed.
0
In practice, we can impose a Gaussian distribution on fθ,j:l whose moments

180
can be estimated according to (4.24) using the Monte Carlo samples of θj,k−1
and θk,l .

Alternatively, we apply a stochastic combination of θj,k−1 and θk,l . We


let

1
θj,l = g1 (θj,k−1 , θk,l ) = (θj,k−1 + θk,l ) + , (4.26)
2
∆θj,l = g2 (θj,k−1 , θk,l ) = θj,k−1 − θk,l ,

where ∼ κ is a random variable independent of θj,k−1 and θk,l . We can use


a parametric approach to build κ, and specify the mean and variance of
from (4.26). Applying the expectation and variance to both sides yields

1 1
E(θj,k ) = E(θj,k−1 ) + E(θk,l ) + E( ),
2 2
1 1
Var(θj,k ) = Var(θj,k−1 ) + Var(θk,l ) + Var( ).
4 4

As in the case of the deterministic combination, we still require

1
E(θj,k ) = E(θj,k−1 ) + E(θk,l ) ,
2
1
Var(θj,k ) = Var(θj,k−1 ) + Var(θk,l ) .
2

1
Consequently, the mean and variance of is 0 and 4
Var(θj,k−1 ) + Var(θk,l ) ,
where the variance can be estimated by the Monte Carlo samples of θj:k−1
and θk:l .

The inverse transformation functions g1−1 and g2−1 for the derivation of

181
the proposal are given by:

1
θj,k−1 = g1−1 (θj,l , ∆θj,l ) = θj,l + ∆θj,l + ,
2
1
θk,l = g2−1 (θj,l , ∆θj,l ) = θj,l − ∆θj,l + .
2

We define the proposal h0θ,j:l :

h0θ,j,l (θj:l , ∆θj:l , xj:l , )

1 1
∝ fθ,j:k−1 θj,l + ∆θj,l + , xj:k−1 fθ,k:l θj,l − ∆θj,l + , xk:l κ( ),
2 2

0
since |J(θj,l , ∆j,l )| = 1. The extended target density fθ,j:l is:

0
fθ,j:l (θj,l , ∆j,l , xj:l , ) = fθ,j:l (θj,l , xj:l )f˜j,l (∆θj:l )κ( ).

The density f˜j,l of ∆θj,l can be managed in the same way as the deterministic
approach given the same transformation function g2 .

4.7.3 Unknown Parameter with Support R+ in an HMM

We propose g1 , g2 and f˜j,l when the unknown parameter θ in an HMM has


support R+ . We employ a log transformation and apply an analogy from the
case when θ ∈ R in Section 4.7.2.

We first consider a deterministic combination

log(θj,l ) = α(log(θj,k−1 ) + log(θk,l )) + β,

log(∆θj,l ) = log(θj,k−1 ) − log(θk,l ), (4.27)

182
which gives g1 and g2 :

θj,l = g1 (θj,k−1 , θk,l ) = (θj,k−1 θk,l )α eβ ,

θj,k−1
∆θj,l = g2 (θj,k−1 , θk,l ) = , (4.28)
θk,l

where α and β are pre-determined constants.

Similar to the discussion in Section 4.7.2, we set the mean E log(θj,l ) to


be the average of E log(θj,k−1 ) and E log(θk,l ) . Likewise for the variance.
Hence, we have
r √
1 1− 2
α= ,β = E log(θj,k−1 ) + E(log(θk,l )) .
2 2

In practice, the means can be evaluated using Monte Carlo samples.

The inverse transformation functions g1−1 and g2−1 are:

1 1 β
θj,k−1 = g1−1 (θj,l , ∆θj,l ) = θj,l

∆θj,l
2
e− 2α ,
1
−1 β
θk,l = g2−1 (θj,l , ∆θj,l ) = θj,l

∆θj,l2 e− 2α ,

1
β
1 − 2α −1 −1
which gives |J(θj,l , ∆θj,l )| = 2α
e θj,l ∆θj,l .
α

The proposal density of (θj:l , ∆θj:l , Xj:l ) in (4.7) becomes:

h0j:l,θ (θj,l , ∆θj,l , xj:l )


1 1 1 1
−1−1 β −1 β
∝ θj,l
α
∆θj,l fθ,j:k−1 θj,l

∆θj,l
2
e− 2α , xj:k−1 fθ,k:l θj,l

∆θj,l2 e− 2α , xk:l .

The density f˜j,l can be approximated from (4.27) with the same reason stated
in Section 4.7.2. We can obtain it by imposing a parametric assumption with

183
parameters estimated from the Monte Carlo samples of θj,k−1 and θk,l .

We then consider a stochastic transformation of the logarithm of the


overlapping parameter variables θj,k−1 and θk,l :

1 1
log(θj,l ) = log(θj,k−1 ) + log(θk,l ) + ,
2 2

log(∆θj,l ) = log(θj,k−1 ) − log(θk,l ),

where ∼ κ is independent of θj,k−1 and θk,l . The mean and variance of


1
is 0 and 4
Var(log(θj,k−1 )) + Var(log(θk,l )) as explained in the deterministic
case. We obtain the transformation functions g1 and g2 :

1 1
θj,l = g1 (θj,k−1 , θk,l ) = θj,k−1
2
θk,l
2
e,

θj,k−1
∆θj,l = g2 (θj,k−1 , θk,l ) = .
θk,l

The inverse transformation functions g1−1 and g2−1 are correspondingly

1 1
θj,k−1 = g1−1 (θj,l , ∆θj,l ) = θj,l ∆θj,l
2
e− 2 ,

−1 1
θk,l = g2−1 (θj,l , ∆θj,l ) = θj,l ∆θj,l2 e− 2 ,

−1
which gives |J(θj,l , ∆θj,l )| = θj,l ∆θj,l2 e− .

The proposal density of (θj:l , ∆θj:l , Xj:l , ) is hence

h0θ,j,l (θj,l , ∆θj:l , xj:l , )

−1 1 1 −1 1
= θj,l ∆θj,l2 e− fθ,j:k−1 θj,l ∆θj,l
2
e− 2 , xj:k−1 fθ,k:l θj,l ∆θj,l2 e− 2 , xk:l κ( ).

184
The extended target distribution is

0
fθ,j:l (θj,l , ∆θj,l , xj:l , ) = fθ,j:l (θj,l , xj:l )f˜j,l (∆θj,l )κ( ).

4.8 Simulation Study

We conduct a simulation study in a linear Gaussian HMM with a three dimen-


sional unknown parameter. We aim to sample from the posterior distribution
p(θ, x0:T |y0:T ) using TPE and other parameter estimation algorithms.

4.8.1 Model Description

We consider a linear Gaussian HMM adapted from Kantas et al. (2009):

Xt = ρXt−1 + σ1 Vt , t = 1, . . . , T,
(4.29)
Yt = Xt + σ2 Wt , t = 0, . . . , T,

where T = 255, X0 ∼ N (0, 1), V1 , . . . , VT , W0 , . . . , WT are independent with


Vt ∼ N (0, 1), Wt ∼ N (0, 1). The random variables ρ, σ12 and σ22 are indepen-
dent whose distributions are

ρ ∼ N (0.5, 0.01),

σ12 ∼ IG(1, 1),

σ22 ∼ IG(1, 1),

where IG(δ, ψ) is an inverse Gamma distribution with shape δ and rate ψ.

The prior of ρ usually has a support on (-1,1) to guarantee the station-

185
arity of the time series {Xt }t∈N . In this example, we choose the prior of ρ to
be normally distributed to accommodate the case in Section 4.7.2 where the
support of the unknown parameter is R. The variance of ρ ensures that the
probability of ρ ∈ (−1, 1) is larger than 99.9%, and the finiteness of the time
series also makes the issue of stationarity less concerned.

We define the prior µ of θ = (ρ, σ12 , σ22 ) by

µ(θ) = µ(ρ, σ12 , σ22 ) = µρ (ρ)µσ12 (σ12 )µσ22 (σ22 ),

where µρ , µσ12 and µσ22 are the densities of ρ, σ12 and σ22 , respectively.

4.8.2 Benchmark

Similar to Section 3.10, we aim to employ the posterior distributions in a


finite-space HMM as a benchmark, since the analytic solution is unavailable
in the original HMM. The finite-space HMM here refers to the sample space
of the parameter and each hidden state being finite. However, given a three
dimensional space of the parameter, the exact posterior solution to the finite-
space HMM is computationally intensive. Alternatively, we perform Monte
Carlo simulations to approximate it.

We build a finite-space HMM associated to the original model using the


grid method described in Section 3.10. We first discretise ρ, σ12 and σ22 using
(3.44) from constructed grids Gρ , Gσ12 and Gσ12 to create the discrete random
variables ρ̂, σ̂12 and σ̂22 , respectively. Their probability mass functions are
denoted by µ̂ρ , µ̂σ12 , µ̂σ22 . The number of grid points, which constitute the
sample space Gρ , Gσ12 and Gσ12 , is denoted by n1 , n2 and n3 , respectively. We
define θ̂ = (µ̂ρ , µ̂σ12 , µ̂σ22 ) whose sample space is Gθ = Gρ × Gσ12 × Gσ12 . The

186
discrete prior µ̂ of θ̂ is

µ̂(θ̂) = µ̂(ρ̂, σ̂12 , σ̂22 ) = µ̂ρ (ρ̂)µ̂σ12 (σ̂12 )µ̂σ22 (σ̂22 ).

We then discretise the sample space of each hidden state Xt using a grid Gt
consisting of nxt points. The grid G0 becomes the sample space of the discrete
random variable X̂0 which approximates X0 . We compute the discrete prior
of X̂0 based upon (3.44) from G0 . The transition mass pθ̂ (x̂t |x̂t−1 ) can be
similarly computed for every θ̂ ∈ Gθ and x̂t−1 ∈ Gt−1 . The emission density
pθ̂ (yt |x̂t ) is continuous which does not require discretisation, although the
possible choice of θ̂ and x̂t is now finite.

We calculate p(θ̂|y0:T ), which is an approximation of the true posterior


p(θ|y0:T ) using
p(θ̂|y0:T ) ∝ pθ̂ (y0:T )µ̂(θ̂),

where pθ̂ (y0:T ) can be computed analytically from the Kalman filter condi-
tional on θ̂.

We demonstrate the derivation of the marginal smoothing distributions


{p(x̂t |y0:T )}Tt=0 from the finite-space HMM using p(θ̂|y0:T ). We decompose
the probability mass p(x̂t |y0:T ) into

X
p(x̂t |y0:T ) = pθ̂ (x̂t |y0:T )p(θ̂|y0:T ), (4.30)
θ̂∈Gθ

where pθ̂ (x̂t |y0:T ) is the probability mass discretised from the normal distribu-
tion pθ̂ (·|y0:T ). The mean and variance of pθ̂ (·|y0:T ) can be obtained from the
Rauch-Tung-Stiebel smoother (RTSs) conditional on θ̂. However, the com-
putation of (4.30) can be time-consuming as a consequence of n1 × n2 × n3

187
implementations of the RTSs for each x̂t ∈ Gt . Alternatively, we simulate
equally weighted Monte Carlo samples {θ̂(i) }ni=1
mc
from the discrete distribu-
tion p( · |y0:T ), and estimate the probability mass p(x̂t |y0:T ) for x̂t ∈ Gt using

nmc
1 X
p(x̂t |y0:T ) ≈ p (i) (x̂t |y0:T ).
nmc i=1 θ̂

P
As the simulations do not guarantee x̂t ∈Gt p(x̂t |y0:T ) = 1, we need to nor-
malise the probability masses.

In practice, we choose n1 = n2 = n3 = nxt = 200 and nmc = 10000. The


positions of the grids with respect to ρ, σ12 and σ22 are decided according to
their priors, and those with respect to the hidden states are decided from the
particle algorithm which sequentially estimates {p(θ, xj |y0:j )}Tj=0 . Some grids
need to be adjusted depending on the smoothness of the resulting cumulative
distribution function.

4.8.3 Metric

Kolmogorov–Smirnov test (KS test) can measure a distance between an em-


pirical distribution and a hypothetical reference distribution (Massey Jr,
1951). We employ the Kolmogorov–Smirnov (KS) statistic, which can be
obtained in the KS test, as the error metric in the linear Gaussian HMM for
two reasons rather than the mean square errors of the moments. Given the
inverse Gamma priors of the variance parameters in the HMM, their pos-
terior distributions can be skewed and asymmetric. The KS test captures
both the location and the shape of a distribution whereas the first two mo-
ments may neglect the aforementioned probabilistic properties. Moreover,
the KS statistic from the test is convenient for reporting and comparison,

188
which outputs a single number between 0 and 1.

We have defined the KS test and have justified using its statistic rather
than the result of the test as an error metric in the Monte Carlo algorithms
for smoothing in Section 3.10.3. The KS statistic is defined as

sup |F1,N (x) − F2 (x)|,


x

where F1,N is the empirical cumulative distribution function (ECDF) gener-


ated by N samples, and F2 is the cumulative distribution function (CDF) of
the reference distribution.

In the parameter estimation problem of the linear Gaussian HMM, F1,N


is the ECDF of the Monte Carlo samples estimating the posterior distribution
p(ζ|y0:T ), where ζ is one component of θ = (ρ, σ12 , σ22 ) or a hidden state Xt . F2
is the discrete distribution p(ζ̂|y0:T ) from the finite-space HMM. We ensure
the error due to discretisation when computing the KS statistic is insignificant
compared to the error from the parameter estimation algorithms which will
be implemented. We refer to KSSx,m , KSSρ,m , KSSσ12 ,m and KSSσ22 ,m as the
average KS statistic of all hidden states, the KS statistic of ρ, the KS statistic
of σ12 and the KS statistic of σ22 in the mth simulation, respectively.

4.8.4 Algorithms

We explore the following algorithms: the particle marginal Metropolis-Hastings


(PMMH) sampler, sequential importance resampling for parameter estima-
tion (SIR-PE), tree-based parameter estimation algorithm (TPE), and TPE-
SIR which combines TPE and SIR-PE. We denote the output sample size of
each algorithm by N , and describe the tuning procedures and parameters.

189
PMMH

In the PMMH sampler (see Algorithm 10), the proposal q(θ∗ |θ) in (4.3) needs
to be specified. We denote θ∗ = (ρ∗ , σ1∗ 2 , σ2∗ 2 ) and construct

q(θ∗ |θ) = q(ρ∗ , σ1∗ 2 , σ2∗ 2 |ρ, σ12 , σ22 )

= qρ (ρ∗ |ρ)qσ12 (σ1∗ 2 |σ12 )qσ22 (σ2∗ 2 |σ22 ),

where

qρ ( · |ρ) ∼ N (ρ, 0.1),

qσ12 ( · |σ12 ) ∼ N (σ12 , 0.05),

qσ22 ( · |σ22 ) ∼ N (σ22 , 0.05).

When a negative value of σ1∗ 2 from qσ12 (·|σ12 ) or σ2∗ 2 from qσ22 (·|σ22 ) is generated,
we set the acceptance ratio to 0 in the Metropolis-Hastings update. The
initial value of θ is determined by the mean from a preliminary run of SIR-
PE.

We run a particle smoother with n samples to select a proposed path x∗0:T


and to compute the estimated likelihood p̂θ∗ (y0:T ). Multinomial resampling
is executed after every importance sampling step.

SIR-PE

In SIR-PE (see Algorithm 11), we resample the particles after every impor-
tance sampling step using multinomial resampling.

190
TPE

In the importance sampling steps of TPE, we do not proliferate particles and


apply multinomial resampling afterward.

In TPE-O and TPE-EP, we demonstrate the constructions of the inter-


mediate target distributions. In TPE-O, the intermediate target distributions
are well defined. In TPE-EP, the priors {µj , pj }Tj=0 in Algorithm 13 need to
be established. We assume

µj (θ) = µj (ρ, σ12 , σ22 ) = µj,ρ (ρ)µj,σ12 (σ12 )µj,σ22 (σ22 ),

where µj,ρ is a normal distribution, µj,σ12 and µj,σ22 are inverse Gamma distri-
butions. The parameters of µj,ρ , µj,σ12 , µj,σ22 are all computed using moment
matching from their Monte Carlo samples.

The distribution f˜j,l of the random variable ∆θj,l = (∆ρj,l , ∆σ12j,l , ∆σ22j,l )
in (4.8) is built as follows. We assume ∆ρj,l , ∆σ12j,l and ∆σ22j,l are mutually
independent. We impose a normal distribution to ∆ρj,l and inverse Gamma
distributions to ∆σ12j,l and ∆σ22j,l . The parameters of these distributions can
be estimated from the Monte Carlo samples using (4.24) for ∆ρj,l , and (4.28)
for ∆σ12j,l and ∆σ22j,l .

In the stochastic combination of the overlapping parameters for con-


structing g1 (see Section 4.7), the distribution κ of the noise also needs to
be specified. We assume to be a multivariate normal distribution with a
diagonal covariance matrix. The marginal means and variances of the distri-
bution have been discussed in Section 4.7.2 and Section 4.7.3.

We further classify TPE by two criteria. The first is the prior of the

191
Table 4.1: Options in TPE regarding the prior information of the sub-HMMs and the com-
bination method of the overlapping parameters.

O: Original prior (see Section 4.5.2)


Prior in the sub-HMMs
EP: Estimated prediction prior (see Section 4.5.3)
Combination method D: Deterministic approach
(see Section 4.7) S: Stochastic approach

sub-HMMs described in Section 4.5.2 and Section 4.5.3. The second is the
combination method of the overlapping parameters, which is illustrated for
building the function g1 in Section 4.7. The available options are listed in
Table 4.1. We extend the name of TPE with the following format:

TPE - ‘Prior in the sub-HMMs’ (‘Combination method’) ,

For example, TPE-O(D) applies the original priors to the sub-HMMs and
employs the deterministic approach to combine the overlapping parameters.
We therefore have four versions of TPE: TPE-O(D), TPE-O(S), TPE-EP(D),
TPE-EP(S).

TPE-SIR

We choose the depth of TPE-SIR, given that the tuning procedures of TPE
and SIR-PE have been described. We create an auxiliary tree of 7, 5 and 3
levels respectively with 4, 16 and 64 observations at the leaf nodes. We also
operate every version of TPE in TPE-SIR, and similarly name each algorithm
with the format:

TPE - ‘Prior in the sub-HMMs’ (‘Combination method’)

- SIR -‘Depth of the auxiliary tree’.

192
The choice of the prior in the sub-HMMs and the combination method of the
overlapping parameters can be found in Table 4.1. Hence, TPE-O(D)-SIR-5
indicates that TPE-SIR employs the original priors in the sub-HMMs with
the deterministic combination of the overlapping parameters, and that the
auxiliary tree has 5 levels with 16 observations at each leaf node.

4.8.5 Simulation Parameters and Results

We have implemented the PMMH sampler, SIR-PE, and different versions


of TPE and TPE-SIR in R. Due to intensive runtime of the PMMH sampler,
we run all algorithms under the same sample size N = 1000. We refer to
n = 1000 as the sample size of any tuning procedure involving Monte Carlo
sampling in the algorithms we will implement. This includes a bootstrap
particle smoother in the PMMH sampler, and the estimation of the priors in
TPE-EP and TPE-EP-SIR. The number of simulations are set to be M = 200
with the same set of observations. We record the KS statistics: KSSx,m ,
KSSρ,m , KSSσ12 ,m , KSSσ22 ,m defined in Section 4.8.3, and the runtime of the
algorithms in each simulation.

The simulation results are shown in Table 4.2. Under the same sample
size, SIR-PE enjoys the lowest average runtime around 0.1 second followed
by TPE-SIR using an auxiliary tree of 3 levels. TPE is much slower given
a deeper tree which spends over 5 seconds. The PMMH sampler costs a
significantly longer runtime of roughly 87 seconds for a single run, since the
particle smoother is implemented for every proposed path.

The KS statistic KSSx , which indicates an overall performance of the


hidden states is vastly smaller in the PMMH sampler, since each Metropolis-
Hastings update produces a completely new path of X0:T if it is accepted.

193
Table 4.2: Performance of the parameter estimation algorithms under the same sample size
in the HMM

Algorithm N n KSSx (s.e.) KSSρ (s.e.) KSSσ12 (s.e.) KSSσ22 (s.e.) Runtime
SIR-PE 1000 NA 0.42 (0.0046) 0.71 (0.0124) 0.71 (0.0127) 0.72 (0.0121) 0.13
PMMH sampler 1000 1000 0.21 (0.0042) 0.41 (0.0026) 0.58 (0.0167) 0.56 (0.0175) 86.90
TPE-O(D) 1000 NA 0.56 (0.0032) 0.48 (0.0113) 0.77 (0.0088) 0.77 (0.0079) 5.20
TPE-O(S) 1000 NA 0.62 (0.0023) 0.66 (0.0059) 0.88 (0.0047) 0.74 (0.0085) 5.25
TPE-EP(D) 1000 1000 0.51 (0.0070) 0.61 (0.0108) 0.87 (0.0097) 0.58 (0.0126) 5.42
TPE-EP(S) 1000 1000 0.48 (0.0070) 0.68 (0.0096) 0.92 (0.0076) 0.61 (0.0120) 5.49
TPE-O(D)-SIR-7 1000 NA 0.58 (0.0042) 0.45 (0.0113) 0.50 (0.0106) 0.46 (0.0098) 2.48
TPE-O(S)-SIR-7 1000 NA 0.61 (0.0037) 0.58 (0.0088) 0.54 (0.0112) 0.50 (0.0105) 2.48
TPE-EP(D)-SIR-7 1000 1000 0.38 (0.0057) 0.38 (0.0096) 0.50 (0.0110) 0.43 (0.0110) 2.53
TPE-EP(S)-SIR-7 1000 1000 0.37 (0.0027) 0.44 (0.0099) 0.53 (0.0092) 0.45 (0.0093) 2.54
TPE-O(D)-SIR-5 1000 NA 0.57 (0.0049) 0.47 (0.0117) 0.49 (0.0101) 0.47 (0.0100) 1.33
TPE-O(S)-SIR-5 1000 NA 0.62 (0.0047) 0.52 (0.0112) 0.53 (0.0115) 0.53 (0.0126) 1.34
TPE-EP(D)-SIR-5 1000 1000 0.37 (0.0060) 0.34 (0.0096) 0.42 (0.0098) 0.42 (0.0107) 1.45
TPE-EP(S)-SIR-5 1000 1000 0.39 (0.0048) 0.40 (0.0090) 0.47 (0.0092) 0.46 (0.0097) 1.41
TPE-O(D)-SIR-3 1000 NA 0.54 (0.0057) 0.61 (0.0124) 0.61 (0.0124) 0.62 (0.0137) 0.66
TPE-O(S)-SIR-3 1000 NA 0.56 (0.0065) 0.51 (0.0122) 0.55 (0.0128) 0.56 (0.0138) 0.67
TPE-EP(D)-SIR-3 1000 1000 0.43 (0.0062) 0.55 (0.0142) 0.62 (0.0153) 0.60 (0.0164) 0.66
TPE-EP(S)-SIR-3 1000 1000 0.43 (0.0066) 0.54 (0.0132) 0.58 (0.0148) 0.57 (0.0156) 0.66
The best four candidates or five if a tie exists in the last five columns are marked in bold. Runtime is averaged

over all simulations and is measured in seconds. Standard error (s.e.) is the standard deviation divided by M .

In contrast, SIR-PE, TPE and TPE-SIR all suffer from path degeneracy and
produce much larger results.

In terms of the KS statistics of ρ, σ12 and σ22 denoted by KSSρ , KSSσ12


and KSSσ22 , the best candidates are TPE-EP(D)-SIR-5 and TPE-EP(D)-
SIR-7 which both apply the estimated priors and combine the overlapping
parameters in the deterministic way. TPE-EP(S)-SIR-5 and TPE-EP(S)-
SIR-7 have slightly larger KS statistics, followed by the comparable result of
the PMMH sampler.

We discuss different versions of TPE (resp. TPE-SIR). In terms of the


priors imposed on the sub-HMMs, TPE-EP (resp. TPE-EP-SIR) have an
evident improvement from TPE-O (resp. TPE-O-SIR). In terms of the com-
bination method of the overlapping parameters, no significant difference is
inspected between the deterministic approach and the stochastic approach.

194
The former performs slightly better in a deeper auxiliary tree.

We compare the simulation results between TPE and TPE-SIR. In gen-


eral, TPE is not recommended. Its built-up of the auxiliary tree till each
single hidden state can be unnecessary considering both time efficiency and
sampling accuracy. TPE-SIR outperforms TPE in terms of the KS statistics
in all scenarios. Within different versions of TPE-SIR, there is a trade-off
regarding the depth of the tree. A very low depth of 3 levels makes TPE-SIR
faster with a comparatively larger error. This is due to the implementation
of SIR-PE for 64 observations, which generates poor initial samples at the
leaf nodes. However, the sampling quality is more satisfactory when SIR-PE
is operated on 4 or 16 observations. The corresponding TPE-SIR requires
more effort due to the deeper auxiliary tree, but still less expensive than
TPE.

To conclude, the PMMH sampler shows the smallest KS statistic in terms


of the hidden states at the cost of an extremely long runtime. SIR-PE is fast
while suffering from degeneracy. Some versions of TPE-SIR demonstrate a
decent runtime and a superior performance in terms of the posteriors of the
unknown parameter. We thus regard them as a useful alternative compared
to the existing methods in practice. In general, these algorithms employ
estimated priors in the sub-HMMs. The depth of the associated auxiliary
tree can be potentially shallower with a compromise of the good performance
of SIR-PE.

4.9 Discussion

This chapter introduces a class of divide-and-conquer sequential Monte Carlo


(D&C SMC) algorithms (Lindsten et al., 2017), which we call TPE, to es-

195
timate the posterior distribution p(θ, x0:T |y0:T ) in a hidden Markov model
where θ is an unknown parameter. TPE decomposes the target random
variable (θ, X0:T ) via an auxiliary binary tree structure, which requires the
random variables at the same level of the tree containing disjoint hidden
states and a parameter variable. TPE samples from the leaf nodes initially.
Following the binary tree, we gradually merge the samples which aim for
the intermediate target distribution at each non-root node. The sampling
process ends when we reach the root standing for the target distribution.

We denote the target density of the random variable Xj at a leaf node


Tj by fθ,j , and the target density of the random variable Xj:l at a non-leaf
node Tj:l by fθ,j:l .

We first propose a general sampling procedure in TPE which simulates


samples from the intermediate target distribution using importance sam-
pling. At a non-leaf node Tj:l , we create a proposal with the random variable
(θj,l , ∆θj,l , Xj:l ) transformed from (θj,k−1 , θk,l , Xj:l ) by the functions θj,l =
g1 (θj,k−1 , θk,l ), ∆θj,l = g2 (θj,k−1 , θk,l ) where θj:k−1 ∼ fθ,j:k−1 and θk:l ∼ fθ,k:l
are the target variables from the children of Tj:l . To match the space of the
proposal, TPE creates an extended target variable (θj,l , ∆θj,l , xj:l ) ∼ fθ,j:l f˜j,l .
We demand ∆θj,l to be independent from (θj,l , Xj:l ), whose density f˜j,l is
specified by the user.

We design the intermediate target distributions at each non-root node


in TPE with two approaches: TPE-O and TPE-EP which both consider the
sub-models as the HMMs which enjoy the same dynamics as the original
HMM except for the priors. We refer to these sub-models as sub-HMMs.
TPE-O directly inherits the priors from the original HMM whereas TPE-EP
re-estimates these priors from the prediction distributions. With the more

196
informative priors, TPE-EP empirically demonstrates a superior performance
compared to TPE-O.

We then build the transformation functions g1 and g2 appearing in the


importance sampling steps of TPE. At each non-leaf node Tj,l , the function
g1 indeed combines the overlapping parameters θj,k−1 and θk,l from the chil-
dren to form the parameter θj,l in the proposal, which in practice rejuvenates
the samples of θj,l and increases its diversity. We propose a determinis-
tic approach and a stochastic approach for constructing the transformation
functions, and build them in two scenarios when the support of the unknown
parameter is R and R+ respectively. The deterministic approach linearly
combines (the logarithm of) θj:k−1 and θk:l . The stochastic approach adds
independent noise to the arithmetic mean of (the logarithm of) θj:k−1 and
θk:l . We have no decisive conclusion from the simulation study regarding the
superiority of the two approaches.

The further contribution of this chapter is the combined algorithm of


TPE and SIR-PE called TPE-SIR, where we deliberately establish a shal-
lower auxiliary tree. At leaf nodes of the tree, we first perform SIR-PE from
the sub-HMMs to generate initial samples, and then merge them recursively
similar to TPE. By constructing a suitable depth of the tree, TPE-SIR can
benefit from a faster implementation compared to TPE as well as an im-
proved accuracy as opposed to both SIR-PE and TPE.

Overall, TPE represents a novel class of algorithms which addresses the


parameter estimation problem in an HMM. Compared to the previous pa-
rameter estimation algorithms, TPE provides a unique way of rejuvenating
parameter samples without a Markov Chain Monte Carlo update. We de-
velop multiple versions of TPE, and recommend TPE-SIR. It employs the

197
estimated prediction priors in the sub-HMMs and has a reasonable depth of
the auxiliary tree. Such algorithm has the following strengths: It profits from
the efficiency and accuracy of SIR-PE for simulating initial samples. The re-
estimated priors rather than the originals offer less discrepancy between the
proposal and the target in the importance sampling steps. Nevertheless, TPE
requires several tuning steps. Due to its superior performance for estimat-
ing the unknown parameter with a comparatively fast runtime, we consider
TPE-SIR as a desirable option to solve the parameter estimation problem in
an HMM.

198
5
Conclusion and Future Work

5.1 Conclusion

The present thesis developed Monte Carlo methods to investigate the infer-
ence problems for hypothesis testing and for hidden Markov models (HMMs).
In Chapter 2, we introduced Monte Carlo testing procedures for bounding a
specific error called resampling risk. In Chapter 3 & 4, we proposed Monte
Carlo sampling algorithms to target posterior distributions in an HMM.

We considered Monte Carlo tests for computing the p-value in Chapter


2. We focused on the control of the resampling risk (Fay and Follmann, 2002;
Fay et al., 2007; Gandy, 2009). It measures the misjudgment of a decision
on the p-value, which is estimated from a Monte Carlo test, with respect to
a single or multiple thresholds.

199
The first part of Chapter 2 introduced a new method called CSM which
bounds the resampling risk uniformly with respect to a single threshold. We
observed the conservativeness of CSM in the sense that it does not spend
the full risk. We then applied truncation to CSM to accommodate some real
circumstances with limited computational budget, and identified a relatively
small resampling risk in CSM compared to other truncated procedures. We
conclude CSM is an appealing option in practice due to its simplicity.

The second part of Chapter 2 extended the single threshold to multiple


ones, and correspondingly refined the definition of resampling risk we at-
tempted to control. We generalised the thresholds to p-value buckets, and
proposed two new algorithms called mCSM and mSIMCTEST. Both algo-
rithms achieve a uniform boundedness of the resampling risk as well as a
finite runtime, provided that the overlapping p-value buckets are utilised.

Chapter 3 introduced a Monte Carlo algorithm called TPS which simu-


lates samples from the joint smoothing distribution in an HMM. The algo-
rithm is built by the divide-and-conquer approach (Lindsten et al., 2017) in
contrast to a sequential approach (Liu and Chen, 1998; Pitt and Shephard,
1999; Doucet et al., 2000). Such construction reduces the maximum number
of updating steps of each hidden state from O(T ) to O log(T ) , and hence
mitigates path degeneracy.

Another advantage of TPS is substantiated in its adaptive design of the


intermediate target distributions, which are crucial to the quality of samples.
We proposed three new forms which are the estimated filtering distributions
(TPS-EF), the estimated smoothing distributions (TPS-ES) and the exact
filtering distributions (TPS-F). The choice of these intermediate targets de-
pends on user’s request and computational power. TPS-EF is suggested

200
under constraints upon computational cost. Otherwise, TPS-ES can achieve
a higher accuracy than TPS-EF under the same sample size.

TPS also enjoys an adjustable complexity with a possible reduction to


linear one with respect to the sample size. The algorithm empirically demon-
strated a superior performance to its competitors under comparable compu-
tational effort.

The intuition of TPS was extended to the parameter estimation prob-


lem in the HMM which we explored in Chapter 4. The established class
of algorithms called TPE inherits the divide-and-conquer strategy (Lindsten
et al., 2017) to sample from the joint posterior distribution of the unknown
parameter and the hidden states.

Given the auxiliary tree with every node consisting of an unknown pa-
rameter, a novelty of TPE lies in the combination of the overlapping param-
eter variables in the sampling process. We illustrated the combination pro-
cedures using a deterministic and a stochastic approach, which both boost
the diversity of samples against degeneracy, and overcome a conventional
Markov Chain Monte Carlo (MCMC) update step.

Similar to the flexible design in TPS, we also proposed two different


classes of intermediate target distributions in TPE, which are implemented
in TPE-O and TPE-EP. The two algorithms both treat the sub-models of
the intermediate targets as the HMMs which have the same dynamics as the
original HMM except for the priors. TPE-O adopts the original priors and
has a simple implementation. TPE-EP constructs the priors from estimated
prediction distributions and can provide better estimate of the unknown pa-
rameter.

We further suggested a specific class of algorithms called TPE-SIR, which

201
is the amalgamation of TPE and a sequential Monte Carlo (SMC) algorithm
called SIR-PE. TPE-SIR creates a shallower auxiliary tree than TPE, whose
initial samples are produced from SIR-PE in a more efficient and simpler
way. In the simulation study, a decreased runtime compared to TPE and an
improved accuracy of estimating the unknown parameter both contributed
to the strengths of TPE-SIR.

5.2 Future Work

This thesis leaves several topics for further discussions and possible exten-
sions.

In Chapter 3 & 4, we considered off-line inference in an HMM where


‘off-line’ indicates a fixed number of observations. This could be extended to
on-line inference where new observations become constantly available as the
time step T proceeds.

In the on-line setting, the smoothing distribution requires sequential up-


dates. We aim to sample from {p(x0:T |y0:T )}T ∈N or the marginal smoothing
distributions {p(xt |y0:T )}T ∈N for every t ≤ T . For parameter estimation, we
similarly estimate {p(θ|y0:T )}T ∈N or {p(θ, x0:T |y0:T )}T ∈N .

In the literature of parameter estimation, most on-line Bayesian methods


still apply Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo
(SMC). A practical filter (Polson et al., 2008) implements parallel MCMC
chains based upon a fixed-lag approximation which introduces extra bias to
the algorithm. Gilks and Berzuini (2001) apply sequential importance resam-
pling similar to SIR-PE, and additionally incorporates MCMC to diversify
samples. An SMC2 algorithm (Chopin et al., 2013) sequentially updates the

202
X0:5

X0:3 X4:5

X0:1 X2:3 X4 X5

X0 X1 X2 X3 X4 X5

Figure 5.1: Update of the auxiliary tree when the new observation y5 is available

unknown parameter via incremental likelihood followed by potential resam-


pling and MCMC steps.

The proposed algorithms TPS in Chapter 3 and TPE in Chapter 4 can


be improved to accommodate on-line inference. As new observations are
obtained, we can expand the auxiliary tree. In Figure 5.1, we illustrate the
update of the tree in the smoothing problem when a new observation y5
becomes available. We define the new nodes to be those connected by at
least a dashed edge, at which the samples are to be generated or merged.
To be more precise, the new nodes are those contain X5 , X4:5 and X0:5 . We
need to simulate the samples of X5 at the leaf node, merge them with X4 ,
and continue to merge with X0:3 .

The updated auxiliary tree can potentially retain the samples at the
old nodes, provided that their intermediate target distributions remain un-
changed. This condition is exactly satisfied in TPE-EF (Section 3.7.2) and
all versions of TPS. In Figure 5.1, the old nodes refer to those which are not
connected by any dashed edge such as X0 , X2:3 and X0:3 . They are also the
nodes from the left complete sub-tree with the root node consisting of X0:3 .
Practically, we may even eliminate all non-root nodes from the left complete
sub-tree, and only store the samples of X0:3 . Therefore, on-line TPS and

203
TPE have a strength of saving efforts by preserving the samples from part
of the tree.

In Chapter 4, one type of TPS called TPS-ES can potentially be en-


hanced through iterative runs. In TPS-ES, the construction of the inter-
mediate target distributions using estimation techniques roughly maintains
the marginals of each hidden state at all levels of the auxiliary tree. These
marginals need to be relatively close to the true smoothing distribution. Nev-
ertheless, in practice they can be inadequately built due to a small sample
size or unsatisfactory estimation techniques. A possible remedy would be
an iterative run of TPS-ES: In each iteration, we adopt the samples from
the previous run for the construction of the intermediate targets in the next
run. We may increase the sample size as the iteration proceeds to ensure the
convergence of the intermediate target distributions at the leaf nodes to the
true marginal smoothing distributions.

In Chapter 3 & 4, the algorithms are constructed in an HMM with a


relatively simple dependence structure whose state space is often of one di-
mension. An HMM may have multivariate state spaces such as factorial
HMMs (Ghahramani and Jordan, 1996) and more complicated dependence
described in Rebeschini et al. (2015).

The more advanced HMMs raise new challenges to TPS and TPE based
on the divide-and-conquer approach (Lindsten et al., 2017). In a HMM with
multivariate state spaces, the curse of dimensionality implies the error of a
particle filter grows exponentially with respect to the dimension of the state
space (Rebeschini et al., 2015). Bengtsson et al. (2008) state that the maxi-
mum normalised weight in one single time step of a particle filter converges
to one in a specific class of models if the sample size grows sub-exponentially

204
in the cube root of the dimension. Rebeschini et al. (2015) propose block
particle filter by partitioning the state variable of high dimension and lo-
calise the weight computation. They prove the error bound is independent
of the dimension. Finke and Singh (2017) apply the similar technique of
block approximations for the smoothing problem, and prove the bias of their
blocked smoother is uniformly bounded in dimension and the variance is
dimension-independent. TPS and TPE applied in higher state spaces could
have challenging importance sampling steps under the existing sampling pro-
cess. We could employ the block approximation techniques in TPS and TPE
for building intermediate target distributions and for incorporating the exist-
ing sampling procedure. A more ambitious thought would be the possibility
of applying the divide-and-conquer strategy not only upon time but also
upon dimension.

These open questions all lead to improvement and advancement of our


existing algorithms to accommodate the more complicated scenarios, which
can be studied in the future.

205
Bibliography

Altman, D. G. (1990). Practical Statistics for Medical Research. CRC Press.

Andrieu, C., A. Doucet, and R. Holenstein (2010). Particle Markov chain


Monte Carlo methods. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 72 (3), 269–342.

Arnold, T. B. and J. W. Emerson (2011). Nonparametric goodness-of-fit


tests for discrete null distributions. R Journal 3 (2).

Arulampalam, M. S., S. Maskell, N. Gordon, and T. Clapp (2002). A tutorial


on particle filters for online nonlinear/non-Gaussian Bayesian tracking.
IEEE Transactions on Signal Processing 50 (2), 174–188.

Bahl, L., P. Brown, P. De Souza, and R. Mercer (1986). Maximum mutual


information estimation of hidden Markov model parameters for speech
recognition. ICASSP ’86. IEEE International Conference on Acoustics,
Speech, and Signal Processing 11, 49–52.

Ball, F. G. and J. A. Rice (1992). Stochastic models for ion channels: intro-
duction and bibliography. Mathematical Biosciences 112 (2), 189–206.

Baum, L. E. and T. Petrie (1966). Statistical inference for probabilistic func-


tions of finite state Markov chains. The Annals of Mathematical Statis-
tics 37 (6), 1554–1563.

206
BBC news (2016). Artificial intelligence: Google’s AlphaGo beats Go master
Lee Se-dol.

Bengtsson, T., P. Bickel, B. Li, et al. (2008). Curse-of-dimensionality revis-


ited: Collapse of the particle filter in very large scale systems. In Proba-
bility and Statistics: Essays in Honor of David A. Freedman, pp. 316–334.
Institute of Mathematical Statistics.

Besag, J. and P. Clifford (1991). Sequential Monte Carlo p-values.


Biometrika 78 (2), 301–304.

Beskos, A., A. Jasra, K. Law, R. Tempone, and Y. Zhou (2017). Multilevel


sequential Monte Carlo samplers. Stochastic Processes and Their Applica-
tions 127 (5), 1417–1440.

Botev, Z. I., P. LEcuyer, and B. Tuffin (2013). Markov chain importance


sampling with applications to rare event probability estimation. Statistics
and Computing 23 (2), 271–285.

Brand, M., N. Oliver, and A. Pentland (1997). Coupled hidden Markov


models for complex action recognition. Proceedings of IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, 994–
999.

Briers, M., A. Doucet, and S. Maskell (2010). Smoothing algorithms for


state–space models. Annals of the Institute of Statistical Mathemat-
ics 62 (1), 61–89.

Brooks, S., A. Gelman, G. Jones, and X. Meng (2011). Handbook of Markov


chain Monte Marlo. CRC Press.

207
Cappé, O., E. Moulines, and T. Rydén (2006). Inference in Hidden Markov
Models. Springer Science & Business Media.

Carpenter, J., P. Clifford, and P. Fearnhead (1999). Improved particle fil-


ter for nonlinear problems. IEE Proceedings–Radar, Sonar and Naviga-
tion 146 (1), 2–7.

Chopin, N. (2002). A sequential particle filter method for static models.


Biometrika 89 (3), 539–552.

Chopin, N., P. E. Jacob, and O. Papaspiliopoulos (2013). SMC2 : an efficient


algorithm for sequential analysis of state space models. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 75 (3), 397–
426.

Cover, T. M. and J. A. Thomas (2012). Elements of Information Theory.


John Wiley & Sons.

Cox, H. (1964). On the estimation of state variables and parameters for noisy
dynamic systems. IEEE Transactions on Automatic Control 9 (1), 5–12.

Davidson, R. and J. G. MacKinnon (2000). Bootstrap tests: How many


bootstraps? Econometric Reviews 19 (1), 55–68.

Davison, A. C., D. V. Hinkley, et al. (1997). Bootstrap Methods and Their


Application, Volume 1. Cambridge University Press.

Devroye, L. (2006). Nonuniform random variate generation. Handbooks in


Operations Research and Management Science 13, 83–121.

Doucet, A., N. De Freitas, and N. Gordon (2001). Sequential Monte Carlo


Methods in Practice. Springer.

208
Doucet, A., S. Godsill, and C. Andrieu (2000). On sequential Monte Carlo
sampling methods for Bayesian filtering. Statistics and Computing 10 (3),
197–208.

Doucet, A. and A. M. Johansen (2009). A tutorial on particle filtering and


smoothing: Fifteen years later. Handbook of Nonlinear Filtering 12, 656–
704.

Dupont, W. D. and W. D. Plummer (1990). Power and sample size calcula-


tions: a review and computer program. Controlled Clinical Trials 11 (2),
116–128.

Dwass, M. (1957). Modified randomization tests for nonparametric hypothe-


ses. The Annals of Mathematical Statistics, 181–187.

Fay, M. P. and D. A. Follmann (2002). Designing Monte Carlo implementa-


tions of permutation or bootstrap hypothesis tests. The American Statis-
tician 56 (1), 63–70.

Fay, M. P., H.-J. Kim, and M. Hachey (2007). On using truncated sequential
probability ratio test boundaries for Monte Carlo implementation of hy-
pothesis tests. Journal of Computational and Graphical Statistics 16 (4),
946–967.

Fearnhead, P., D. Wyncoll, and J. Tawn (2010). A sequential smoothing


algorithm with linear computational cost. Biometrika 97 (2), 447–464.

Finke, A. and S. S. Singh (2017). Approximate smoothing and parameter


estimation in high-dimensional state-space models. IEEE Transactions on
Signal Processing 65 (22), 5982–5994.

209
Gandy, A. (2009). Sequential implementation of Monte Carlo tests with
uniformly bounded resampling risk. Journal of the American Statistical
Association 104 (488), 1504–1511.

Gandy, A., G. Hahn, and D. Ding (2017). Implementing Monte Carlo tests
with p-value buckets. arXiv preprint arXiv:1703.09305 .

Gandy, A. and F. D.-H. Lau (2016). The chopthin algorithm for resampling.
IEEE Transactions on Signal Processing 64 (16), 4273–4281.

Gandy, A. and P. Rubin-Delanchy (2013). An algorithm to compute the


power of Monte Carlo tests with guaranteed precision. The Annals of
Statistics 41 (1), 125–142.

Gelb, A. (1974). Applied Optimal Estimation. MIT Press.

Gerber, M. and N. Chopin (2015). Sequential quasi Monte Carlo. Journal


of the Royal Statistical Society: Series B (Statistical Methodology) 77 (3),
509–579.

Gerber, M., N. Chopin, and N. Whiteley (2018). Negative association, or-


dering and convergence of resampling methods. The Annals of Statistics,
to appear.

Ghahramani, Z. and M. I. Jordan (1996). Factorial hidden Markov models.


Advances in Neural Information Processing Systems, 472–478.

Gilks, W. R. and C. Berzuini (2001). Following a moving target – Monte


Carlo inference for dynamic Bayesian models. Journal of the Royal Statis-
tical Society: Series B (Statistical Methodology) 63 (1), 127–146.

Gleser, L. (1996). Comment on ‘bootstrap confidence intervals’ by T. J.


DiCiccio and B.Efron. Statistical Science 11, 219–221.

210
Godsill, S., P. Rayner, and O. Cappé (2002). Digital audio restoration. In
Applications of Digital Signal Processing to Audio and Acoustics, pp. 133–
194. Springer.

Godsill, S. J., A. Doucet, and M. West (2004). Monte Carlo smoothing


for nonlinear time series. Journal of the American Statistical Associa-
tion 99 (465).

Gordon, N. J., D. J. Salmond, and A. F. Smith (1993). Novel approach


to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings –
Radar, Sonar and Navigation 140 (2), 107–113.

Grimmett, G. and D. Stirzaker (2001). Probability and Random Processes.


Oxford University Press.

Hadar, U. et al. (2009). High-order hidden Markov models-estimation and


implementation. 2009 IEEE/SP 15th Workshop on Statistical Signal Pro-
cessing, 249–252.

Hamilton, J. D. (1989). A new approach to the economic analysis of nonsta-


tionary time series and the business cycle. Econometrica: Journal of the
Econometric Society, 357–384.

Haykin, S. (2004). Kalman Filtering and Neural Networks, Volume 47. John
Wiley & Sons.

Hol, J. D., T. B. Schon, and F. Gustafsson (2006). On resampling algo-


rithms for particle filters. 2006 IEEE Nonlinear Statistical Signal Process-
ing Workshop, 79–82.

Hooper, R. et al. (2013). Versatile sample-size calculation using simulation.


The Stata Journal 13 (1), 21–38.

211
Hope, A. C. (1968). A simplified Monte Carlo significance test procedure.
Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy) 30 (3), 582–598.

Huang, X. D., Y. Ariki, and M. A. Jack (1990). Hidden Markov Models for
Speech Recognition. Edinburgh University Press.

IBM Corporation (2013). IBM SPSS Statistics for Windows. Armonk, NY:
IBM Corporation.

Jacquier, E., N. G. Polson, and P. E. Rossi (2002). Bayesian analysis


of stochastic volatility models. Journal of Business & Economic Statis-
tics 20 (1), 69–87.

Jazwinski, A. H. (2007). Stochastic Processes and Filtering Theory. Courier


Corporation.

Julier, S. J., J. K. Uhlmann, and H. F. Durrant-Whyte (1995). A new


approach for filtering nonlinear systems. Proceedings of 1995 American
Control Conference 3, 1628–1632.

Kalman, R. E. et al. (1960). A new approach to linear filtering and prediction


problems. Journal of Basic Engineering 82 (1), 35–45.

Kantas, N., A. Doucet, S. S. Singh, J. Maciejowski, N. Chopin, et al. (2015).


On particle methods for parameter estimation in state-space models. Sta-
tistical Science 30 (3), 328–351.

Kantas, N., A. Doucet, S. S. Singh, and J. M. Maciejowski (2009). An


overview of sequential Monte Carlo methods for parameter estimation in
general state-space models. IFAC Proceedings Volumes 42 (10), 774–785.

212
Kaplan, E. and C. Hegarty (2005). Understanding GPS: Principles and Ap-
plications. Artech House.

Kim, H.-J. (2010). Bounding the resampling risk for sequential Monte Carlo
implementation of hypothesis tests. Journal of Statistical Planning and
Inference 140 (7), 1834–1843.

Kim, S., N. Shephard, and S. Chib (1998). Stochastic volatility: likelihood


inference and comparison with ARCH models. Review of Economic Stud-
ies 65 (3), 361–393.

Kitagawa, G. (1987). Non-Gaussian state space modeling of nonstationary


time series. Journal of the American Statistical Association 82 (400), 1032–
1041.

Kitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian


nonlinear state space models. Journal of Computational and Graphical
Statistics 5 (1), 1–25.

Kitagawa, G. (1998). A self-organizing state-space model. Journal of the


American Statistical Association, 1203–1215.

Kitagawa, G. and S. Sato (2001). Monte Carlo smoothing and self-organising


state-space model. In Sequential Monte Carlo Methods in Practice, pp.
177–195. Springer.

Klaas, M., M. Briers, N. De Freitas, A. Doucet, S. Maskell, and D. Lang


(2006). Fast particle smoothing: If I had a million particles. Proceedings
of the 23rd International Conference on Machine Learning, 481–488.

Klaas, M., N. De Freitas, and A. Doucet (2005). Toward practical N 2 Monte

213
Carlo: The marginal particle filter. In Proceedings of Uncertainty in Arti-
ficial Intelligence.

Koller, D., N. Friedman, and F. Bach (2009). Probabilistic Graphical Models:


Principles and Techniques. MIT Press.

Kong, A., J. S. Liu, and W. H. Wong (1994). Sequential imputations and


bayesian missing data problems. Journal of the American Statistical As-
sociation 89 (425), 278–288.

Krogh, A., B. Larsson, G. Von Heijne, and E. L. Sonnhammer (2001). Pre-


dicting transmembrane protein topology with a hidden Markov model:
application to complete genomes. Journal of Molecular Biology 305 (3),
567–580.

Kulldorff, M. (2001). Prospective time periodic geographical disease surveil-


lance using a scan statistic. Journal of the Royal Statistical Society: Series
A (Statistics in Society) 164 (1), 61–72.

Lai, T. L. (1976). On confidence sequences. The Annals of Statistics 4 (2),


265–280.

Lee, D. S. and N. K. Chia (2002). A particle algorithm for sequential Bayesian


parameter estimation and model selection. IEEE Transactions on Signal
Processing 50 (2), 326–336.

Lee, L.-M. and J.-C. Lee (2006). A study on high-order hidden Markov
models and applications to speech recognition. International Conference
on Industrial, Engineering and Other Applications of Applied Intelligent
Systems, 682–690.

214
Lin, M. T., J. L. Zhang, Q. Cheng, and R. Chen (2005). Independent particle
filters. Journal of the American Statistical Association 100 (472), 1412–
1421.

Lindsten, F., A. M. Johansen, C. A. Naesseth, B. Kirkpatrick, T. B. Schön,


J. Aston, and A. Bouchard-Côté (2017). Divide-and-conquer with sequen-
tial Monte Carlo. Journal of Computational and Graphical Statistics 26 (2),
445–458.

Lindsten, F., M. I. Jordan, and T. B. Schön (2014). Particle Gibbs with


ancestor sampling. The Journal of Machine Learning Research 15 (1),
2145–2184.

Liu, J. S. and R. Chen (1995). Blind deconvolution via sequential imputa-


tions. Journal of the American Statistical Association 90 (430), 567–576.

Liu, J. S. and R. Chen (1998). Sequential Monte Carlo methods for dynamic
systems. Journal of the American Statistical Association 93 (443), 1032–
1044.

Massaro, M. and D. Blair (2003). Comparison of population numbers of


yellow-eyed penguins, megadyptes antipodes, on stewart island and on
adjacent cat-free islands. New Zealand Journal of Ecology, 107–113.

Massey Jr, F. J. (1951). The Kolmogorov-Smirnov test for goodness of fit.


Journal of the American Statistical Association 46 (253), 68–78.

Metropolis, N. and S. Ulam (1949). The Monte Carlo method. Journal of


the American Statistical Association 44 (247), 335–341.

Moher, D., C. S. Dulberg, and G. A. Wells (1994). Statistical power, sample

215
size, and their reporting in randomized controlled trials. Jama 272 (2),
122–124.

Naesseth, C. A., S. W. Linderman, R. Ranganath, and D. M. Blei (2017).


Variational sequential Monte Carlo. arXiv preprint at arXiv:1705.11140 .

Newton, M. A. and C. J. Geyer (1994). Bootstrap recycling: a Monte Carlo


alternative to the nested bootstrap. Journal of the American Statistical
Association 89 (427), 905–912.

Paninski, L., Y. Ahmadian, D. G. Ferreira, S. Koyama, K. R. Rad, M. Vidne,


J. Vogelstein, and W. Wu (2010). A new look at state-space models for
neural data. Journal of Computational Neuroscience 29 (1-2), 107–126.

Petersen, T. N., S. Brunak, G. von Heijne, and H. Nielsen (2011). Signalp


4.0: discriminating signal peptides from transmembrane regions. Nature
Methods 8 (10), 785.

Pitt, M. K., R. dos Santos Silva, P. Giordani, and R. Kohn (2012). On some
properties of Markov chain Monte Carlo simulation methods based on the
particle filter. Journal of Econometrics 171 (2), 134–151.

Pitt, M. K. and N. Shephard (1999). Filtering via simulation: Auxiliary


particle filters. Journal of the American Statistical Association 94 (446),
590–599.

Polson, N. G., J. R. Stroud, and P. Müller (2008). Practical filtering with


sequential parameter learning. Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 70 (2), 413–428.

R Development Core Team (2008). R: A Language and Environment for

216
Statistical Computing. Vienna, Austria: R Foundation for Statistical Com-
puting.

Rabiner, L. and B. Juang (1986). An introduction to hidden Markov models.


IEEE ASSP magazine 3 (1), 4–16.

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected


applications in speech recognition. Proceedings of the IEEE 77 (2), 257–
286.

Rauch, H. E., C. Striebel, and F. Tung (1965). Maximum likelihood estimates


of linear dynamic systems. AIAA Journal 3 (8), 1445–1450.

Rebeschini, P., R. Van Handel, et al. (2015). Can local particle filters beat
the curse of dimensionality? The Annals of Applied Probability 25 (5),
2809–2866.

Robbins, H. (1970). Statistical methods related to the law of the iterated


logarithm. Annals of Mathematical Statistics 41, 1397–1409.

Ruxton, G. D. and M. Neuhäuser (2013). Improving the reporting of p-


values generated by randomization methods. Methods in Ecology and Evo-
lution 4 (11), 1033–1036.

Särkkä, S. (2013). Bayesian Filtering and Smoothing, Volume 3. Cambridge


University Press.

Särkkä, S. et al. (2006). Recursive Bayesian Inference on Stochastic Differ-


ential Equations. Helsinki University of Technology.

Sarmavuori, J. and S. Särkkä (2012). Fourier-Hermite Rauch-Tung-Striebel


smoother. 2012 Proceedings of the 20th European Signal Processing Con-
ference (EUSIPCO), 2109–2113.

217
Schäfer, C. and N. Chopin (2013). Sequential Monte Carlo on large binary
sampling spaces. Statistics and Computing 23 (2), 163–184.

Scott, S. L., A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George, and


R. E. McCulloch (2016). Bayes and big data: The consensus Monte Carlo
algorithm. International Journal of Management Science and Engineering
Management 11 (2), 78–88.

Sileshi, B., C. Ferrer, and J. Oliver (2013). Particle filters and resampling
techniques: Importance in computational complexity analysis. 2013 Con-
ference on Design and Architectures for Signal and Image Processing, 319–
325.

Silva, I., R. Assunção, et al. (2018). Truncated sequential Monte Carlo test
with exact power. Brazilian Journal of Probability and Statistics 32 (2),
215–238.

Silva, I., R. Assunção, and M. Costa (2009). Power of the sequential Monte
Carlo test. Sequential Analysis 28 (2), 163–174.

Silva, I. R. and R. M. Assunção (2013). Optimal generalized truncated se-


quential Monte Carlo test. Journal of Multivariate Analysis 121, 33–49.

Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driess-


che, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al.
(2016). Mastering the game of Go with deep neural networks and tree
search. Nature 529 (7587), 484.

Sonnhammer, E. L., G. Von Heijne, A. Krogh, et al. (1998). A hidden Markov


model for predicting transmembrane helices in protein sequences. Ismb 6,
175–182.

218
Tango, T. and K. Takahashi (2005). A flexibly shaped spatial scan statistic
for detecting clusters. International Journal of Health Geographics 4 (1),
11.

van de Meent, J.-W., H. Yang, V. Mansinghka, and F. D. Wood (2015). Par-


ticle Gibbs with ancestor sampling for probabilistic programs. Proceedings
of Uncertainty in Artificial Intelligence (AISTATS).

Wald, A. (1945). Sequential tests of statistical hypotheses. Annals of Math-


ematical Statistics 16, 117–186.

Wald, A. (1973). Sequential Analysis. Courier Corporation.

Walpole, R. E. and R. H. Myers (1993). Probability and Statistics for Engi-


neers and Scientists. Pearson Education.

Welch, B. L. (1947). The generalization of ‘Student’s’ problem when several


different population variances are involved. Biometrika 34, 28–35.

Whiteley, N. (2010). Contribution to the discussion on ‘particle Markov


chain Monte Carlo methods’ by Andrieu, C., Doucet, A., and Holenstein,
R. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology) 72 (3), 306–307.

Yamato, J., J. Ohya, and K. Ishii (1992). Recognizing human action in time-
sequential images using hidden Markov model. Proceedings 1992 IEEE
Computer Society Conference on Computer Vision and Pattern Recogni-
tion, 379–385.

Zhang, Y., M. Brady, and S. Smith (2001). Segmentation of brain MR


images through a hidden Markov random field model and the expectation-

219
maximization algorithm. IEEE Transactions on Medical Imaging 20 (1),
45–57.

220

You might also like