Optimal Thinning of MCMC Output Using Kernel Stein Discrepancy

Optimal Thinning of MCMC Output
Marina Riabiz1,2 , Wilson Ye Chen3 , Jon Cockayne2 , Pawel Swietach4 ,

Steven A. Niederer1 , Lester Mackey5 , Chris. J. Oates6,2∗
1 King’s College London, UK 2 Alan Turing Institute, UK
arXiv:2005.03952v2 [stat.ME] 29 Jun 2020
3 University of Sydney, Australia 4 Oxford University, UK

5 Microsoft Research, US 6 Newcastle University, UK
July 1, 2020
Abstract
The use of heuristics to assess the convergence and compress the output of Markov
chain Monte Carlo can be sub-optimal in terms of the empirical approximations that
are produced. Typically a number of the initial states are attributed to “burn in” and
removed, whilst the remainder of the chain is “thinned” if compression is also required.
In this paper we consider the problem of retrospectively selecting a subset of states, of
fixed cardinality, from the sample path such that the approximation provided by their
empirical distribution is close to optimal. A novel method is proposed, based on greedy
minimisation of a kernel Stein discrepancy, that is suitable for problems where heavy
compression is required. Theoretical results guarantee consistency of the method and
its effectiveness is demonstrated in the challenging context of parameter inference for
ordinary differential equations. Software is available in the Stein Thinning package
in both Python and MATLAB.
Keywords: Bayesian computation, greedy optimisation, Markov chain Monte Carlo,
reproducing kernel, Stein’s method
1 Introduction
The most popular computational tool for non-conjugate Bayesian inference is Markov chain
Monte Carlo (MCMC). Introduced to statistics from the physics literature in Hastings (1970);
Geman and Geman (1984); Tanner and Wong (1987); Gelfand and Smith (1990), an enor-
mous amount of research effort has since been expended in the advancement of MCMC
methodology. Such is the breadth of this topic that we do not attempt a survey here, but
∗
Address for correspondence: Chris. J. Oates, School of Mathematics, Statistics and Physics, Herschel
Building, Newcastle University, Newcastle upon Tyne, NE1 7RU, UK. E-mail: chris.oates@ncl.ac.uk
1
instead refer the reader to Robert and Casella (2013); Green et al. (2015) and the references
therein to more advanced material. This paper is motivated by the fact that the approaches
used for convergence assessment and to post-process the output of MCMC can strongly affect
the estimates that are produced.
Let P be a distribution supported on Rd and let (Xi )i∈N be a P -invariant Markov chain,
also on Rd . The Markov chain sample path provides an empirical approximation
1X
n
δ(Xi ) (1)
n i=1
to P , where δ(x) denotes a point mass centred at x ∈ Rd . Our discussion supposes that a
practitioner is prepared to simulate a Markov chain up to a maximum number of iterations,
n, and that simulating further iterations is not practical; a scenario that is often encountered
(e.g. see Section 4.3). In this setting it is common (and indeed recommended) to replace (1)
with an alternative estimator
1 X
m
δ(Xπ(j) ) (2)
m j=1
that is based on a subset of the total MCMC output. The m indices π(j) ∈ {1, . . . , n}
indicate which states are retained and the identification of a suitable index set π is informed
by the following considerations:
Removal of Initial Bias: The distribution of the initial states of the Markov chain may
be quite different to P . To mitigate this, it is desirable to identify a “burn-in” (Xi )bi=1 which
is then discarded. The burn-in period b is typically selected using convergence diagnostics
(Cowles and Carlin, 1996). These are primarily based on the empirical distribution of simple
moment, quantile or density estimates across independent chains and making a judgement
as to whether the ensemble of chains has converged to the distributional target. The main
limitation of convergence diagnostics, as far as we are concerned in this work, is that in
taking b large enough to make bias negligible, the number n − b of remaining samples may
be rather small, such that the statistical efficiency of the estimator in (2) is sub-optimal as
an approximation of P . Nonetheless, a considerable portion of Bayesian pedagogy is devoted
to the identification of the burn-in period, as facilitated using diagnostic tests that are built
into commercial-grade software such as WinBUGS (Lunn et al., 2000), JAGS (Plummer, 2003),
R (R Core Team, 2020), Stan (Carpenter et al., 2017).
Increased Statistical Efficiency: It is often stated that discarding part of the MCMC
output leads to a reduction in the statistical efficiency of the estimator (2) compared to
(1). This argument, made e.g. in Geyer (1992), applies only when the procedure used to
discard part of the MCMC output does not itself depend on the MCMC output and when
the length n of the MCMC output is fixed. That estimation efficiency can be improved by
discarding a portion of the samples in a way that depends on the samples themselves is in
fact well-established (see e.g. Dwivedi et al., 2019).
2
Compression of MCMC Output: A third motivation for estimators of the form (2)
comes from the case where one wishes to approximate the expectation of a function f , where
either evaluation of f or storage of its output is associated with a computational cost. In
such situations one may want to control the cardinality m of the index set π and to use
(Xπ(j) )m
j=1 as an experimental design on which f is evaluated. The most popular solution is
to retain only every tth state visited by the Markov chain, a procedure known as “thinning”
of the MCMC output. Owen (2017), considered the problem of how to optimally allocate a
computational budget that can be used either to perform additional iterations of MCMC (i.e.
larger n) or to evaluate f on the MCMC output (i.e. larger m). Owen’s analysis provides
a recommendation on how t should be selected, as a function of the relative cost of the two
computational operations that can be performed. In particular, it is demonstrated that (2)
can be more efficient than (1) in the context of a fixed computational budget.
Taking these considerations into account, the most common approach used to select an
index set π is based on the identification of a suitable burn-in period b and/or a suitable
thinning frequency t, leading to an approximation of the form
b(n−b)/tc
1 X
δ(Xb+it ). (3)
b(n − b)/tc i=1
Here brc denotes the integer part of r. This corresponds to a set of indices π in (2) that
discards the burn-in states and retains only every tth iteration from the remainder of the
MCMC output. It includes the case where no states are removed when b = 0 and t = 1.
Despite their widespread usage, the interplay between the Markov chain sample path and
the heuristics used to select b and t is not widely appreciated. In general it is unclear how
much bias may be introduced by employing a post-processing heuristic that is itself based on
the MCMC output. Indeed, even the basic question of when the post-processed estimator in
(3) is consistent when b and t are chosen based on the MCMC output appears not to have
been studied.
In this paper we propose a novel method, called Stein Thinning, that selects an index
set π, of specified cardinality m, such that the associated empirical approximation is close
to optimal. The method is designed to ensure that (2) is a consistent approximation of P .
This includes situations when the Markov chain on which it is based is not P -invariant, but
we do of course require that the regions of high probability under P are explored. To achieve
this we adopt a kernel Stein discrepancy as our optimality criterion. The minimisation
of kernel Stein discrepancy is performed using a greedy sequential algorithm and the main
contribution of our theoretical analysis is to study the interplay of the greedy algorithm with
the randomness inherent to the MCMC output. The proposed Stein Thinning method is
simple (see Algorithm 1), applicable to most problems where gradients of the log-posterior
density can be computed, and implemented as convenient Python and MATLAB packages that
require no additional user input other than the number m of states to be selected (see
Appendix S1).
3
1.1 Related Work
Our work contributes to an active area of research that attempts to cast empirical approx-
imation as an optimisation problem. Liu and Lee (2017) considered the use of kernel Stein
discrepancy to optimally weight an arbitrary set (Xi )ni=1 ⊂ Rd of states in an manner loosely
analogous to importance sampling, at a computational cost of O(n3 ). The combined effect of
applying the approach of Liu and Lee (2017) to MCMC output was analysed in Hodgkinson
et al. (2020), who established situations in which the overall procedure will be consistent.
The present paper differs from Liu and Lee (2017) and Hodgkinson et al. (2020) in that
we attempt compression, rather than weighting of the MCMC output. Thus, although
Stein Thinning also attempts to minimise a kernel Stein discrepancy, the algorithm that
we propose and analyse is of a fundamentally different nature to that considered in previous
work, and addresses a different computational task. In principle one could use the algorithm
of Liu and Lee (2017) to assign a weight wi to each Xi and then retain the m states with
largest absolute weights, but this entails a O(n3 ) computational cost which is potentially
far larger than the O(nm2 ) computational cost of Stein Thinning and, furthermore, the
mathematical justification for discarding states with small absolute weights has not been
established.
If a compressed representation of the posterior P is required, but one is not wedded to
the use of MCMC for generation of candidate states, then several other methods can be
used. Joseph et al. (2015, 2019) proposed a criterion to capture how well an empirical mea-
sure based on a point set approximates P and then applied a global numerical optimisation
method to arrive at a suitable point set (called a “minimum energy design”; MED). A similar
approach was taken in Chen et al. (2018), where a kernel Stein discrepancy was numerically
minimised (called “Stein points”; SP). The reliance on global optimisation renders the the-
oretical analysis of MED and SP difficult. In Chen et al. (2019) the authors considered
using Markov chains to approximately perform numerical optimisation in the context of SP,
allowing a tractable analytic treatment at the expense of a sub-optimal compression of P .
Mak and Joseph (2018) considered selecting a small number of states to minimise a partic-
ular energy function that quantifies the extent to which an empirical measure supported on
those states approximates P (called Support Points). Their analysis covered the case of
the optimal point set, but did not extend to the numerical methods used to approximately
compute it. Liu and Wang (2016); Liu (2017) identified a gradient flow with P as a fixed
point that can be approximately simulated using a particle method (called “Stein variational
gradient descent”; SVGD) based on a relatively small number of particles. At convergence,
one obtains a compressed representation of P , however the theoretical analysis of SVGD
remains an open and active research topic (see e.g. Duncan et al., 2019). The present paper
differs from the contributions cited in this paragraph, in that (1) our algorithm requires only
the output from one run of MCMC, which is a realistic requirement in many situations,
and (2) we are able to provide a finite sample size error bound (Theorem 2) and a consis-
tency guarantee (Theorem 3) for Stein Thinning, that cover precisely the algorithm that
we implement.
4
1.2 Outline of the Paper
The paper proceeds, in Section 2, to recall the construction of a kernel Stein discrepancy
and to present Stein Thinning. Then in Section 3 we establish a finite sample size error
bound, as well as a widely-applicable consistency result that does not require the Markov
chain to be P -invariant. In Section 4 we present an empirical assessment of Stein Thinning
in the context of parameter inference for ordinary differential equation models. Conclusions
are contained in Section 5.
2 Methods
In this section we introduce and analyse the Stein Thinning method. First, in Section 2.1,
we recall the construction of a kernel Stein discrepancy and its theoretical properties. The
Stein Thinning method is presented in Section 2.2, whilst Section 2.3 is devoted to imple-
mentational detail.
Before we proceed, we highlight a standing assumption and recall the mathematical
definition of a reproducing kernel:
Standing Assumption: Throughout we assume that the distributional target P admits a
positive and continuously differentiable density p on Rd .
Reproducing Kernel: A reproducing kernel Hilbert space (RKHS) of functions on Rd is a
Hilbert space, denoted H(k), equipped with a function k : Rd ×Rd → R, called a kernel, such
that ∀x ∈ Rd we have k(·, x) ∈ H(k) and ∀x ∈ Rd , h ∈ H(k) we have h(x) = hh, k(·, x)iH(k) .
In this paper h·, ·iH(k) denotes the inner product in H(k) and the induced norm will be
denoted k · kH(k) . For further details, see Berlinet and Thomas-Agnan (2004).
2.1 Kernel Stein Discrepancy

To construct a criterion for the selection of states from the MCMC output we require a
notion of optimal approximation for probability distributions. To this end, recall that an
integral probability metric (IPM) (Muller, 1997), based on a set F of measure-determining
functions on Rd , is defined as
Z Z

DF (P, Q) := sup f dP − f dQ . (4)
f ∈F Rd Rd
The fact that F is measure-determining means that DF (P, Q) = 0 if and only if P = Q

is satisfied. Standard choices for F, e.g. that recover Wasserstein distance as the IPM,
cannot be used in the Bayesian context due to the need to compute integrals with respect
to P in (4). To circumvent this issue, the notion of a Stein discrepancy was proposed in
Gorham and Mackey (2015). This was based on Stein’s method (Stein, 1972), which consists
of finding a differential operator AP , depending on P and acting on d-dimensional vector
fields on Rd , and a set G of sufficiently differentiable d-dimensional vector fields on Rd such
5
R
that Rd AP g dP = 0 for all g ∈ G. The proposal of Gorham and Mackey (2015) was to take
F = AP G to be the image of G under AP in (4), leading to the Stein discrepancy
Z

DAP G (P, Q) = sup AP g dQ . (5)
g∈G Rd
Theoretical analysis had led to sufficient conditions for AP G to be measure-determining

(Gorham and Mackey, 2015). In this paper we focus on a particular form of (5) due to
Liu et al. (2016); Chwialkowski et al. (2016); Gorham and Mackey (2017), called a kernel
Stein discrepancy (KSD). In this case, AP is the Langevin Stein operator AP g := p−1 ∇ · (pg)
derived in (Gorham and
P Mackey, 2015), where ∇· denotes the divergence operator in Rd and
d
G := {g : Rd → Rd | i=1 kgi k2H(k) ≤ 1} is the unit ball in a Cartesian product of RKHS.
It follows from construction that the set AP G is the unit ball of another RKHS, denoted
H(kP ), whose kernel is
kP (x, y) := ∇x · ∇y k(x, y) + h∇x k(x, y), ∇y log p(y)i

+ h∇y k(x, y), ∇x log p(x)i + k(x, y) h∇x log p(x), ∇y log p(y)i , (6)
where h·, ·i denotes the standard Euclidean inner product, ∇ denotes the gradient operator
and subscripts have been used to indicate the variables being acted on by the differential
operators (Oates et al., 2017). Thus KSD is recognised as a maximum mean discrepancy in
H(kP ) (Gretton et al., 2006) and is fully characterised by the kernel kP ; we therefore adopt
the shorthand notation DkP (Q) for DAP G (P, Q).
In the remainder of this section we recall the main properties of KSD. The first is a
condition on the kernel k that guarantees elements of H(kp ) have zero mean with respect to
P . In what follows kxk = hx, xi1/2 denotes the Euclidean norm on Rd . It will be convenient
to abuse operator notation, writing ∇x ∇> y k for the Hessian matrix of a bivariate function
(x, y) 7→ k(x, y).
>
Proposition 1 (Proposition 1 of Gorham and Mackey
R (2017)). Let (x, y) 7→ ∇x ∇
R y k(x, y) be
d
continuous and uniformly bounded on R and let Rd k∇ log pkdP < ∞. Then Rd hdP = 0
for all h ∈ H(kP ), where kP is defined in (6).
The second main property of P

KSD that we will need is that it can be explicitly computed
for an empirical measure Q = n1 ni=1 δ(xi ), supported on states xi ∈ Rd :
Proposition 2 (Proposition 2 of Gorham and Mackey (2017)). Let (x, y) 7→ ∇x ∇>

y k(x, y)
d
be continuous on R . Then
! vu X
1X
n
u1 n
DkP δ(xi ) = t 2 kP (xi , xj ), (7)
n i=1 n i,j=1
where kP was defined in (6).
6
The third main property is that KSD provides convergence control. Let Qn ⇒ P denote
weak convergence of a sequence (Qn ) of measures to P . Theoretical analysis in Gorham
and Mackey (2017); Chen et al. (2018); Huggins and Mackey (2018); Chen et al. (2019);
Hodgkinson et P al. (2020) established sufficient conditions for when convergence of (7) to
zero implies n ni=1 δ(xi ) ⇒ P . For our purposes we present one such result, from Chen
1
et al. (2019).
Proposition 3 (Theorem 4 in Chen et al. (2019)). Let P be distantly dissipative, meaning

that lim inf r→∞ κ(r) > 0 where

h∇ log p(x) − ∇ log p(y), x − yi
κ(r) := inf −2 : kx − yk = r .
kx − yk2
Consider the kernel k(x, y) = (c2 + kΓ−1/2 (x − y)k2 )β for some fixed
Pnc > 0, a fixed positive
1
Pn matrix Γ and a fixed exponent β ∈ (−1, 0). Then DkP n i=1 δ(xi ) → 0 implies
definite
1
n i=1 δ(xi ) ⇒ P , where kP is defined in (6).
The properties just described ensure that KSD is a suitable optimality criterion to consider
for the post-processing of MCMC output. Our attention turns next to the development of
algorithms for minimisation of KSD.
2.2 Greedy Minimisation of KSD

The convergence control afforded by Proposition 3 motivates the design of methods that
select points (xi )ni=1 in Rd such that (7) is approximately minimised. Continuous optimisa-
tion algorithms over Rd were proposed for this task in Chen et al. (2018) and Chen et al.
(2019). In Chen et al. (2018), deterministic optimisation techniques were considered for
low-dimensional problems, whereas in Chen et al. (2019) a Markov chain was used to pro-
vide more a practical optimisation strategy when the state space is high-dimensional. In
each case greedy sequential strategies were considered, wherein at iteration n a new state
xn is appended to the current sequence (x1 , . . . , xn−1 ) by searching over a compact subset of
Rd . Chen et al. (2018) also considered the use of conditional gradient algorithms (so-called
Frank-Wolfe, or kernel herding algorithms) but found that greedy algorithms provided better
performance across a range of experiments and therefore we focus on greedy algorithms in
this manuscript.
The present paper is distinguished from earlier work in that we do not to attempt to
solve a continuous optimisation problem for selection of the next point xn ∈ Rd . Such
optimisation problems are fundamentally difficult and can at best be approximately solved.
Instead, we exactly solve the discrete optimisation problem of selecting a suitable element
xn from supplied MCMC output. In this sense we expect our findings will be more widely
applicable than previous work, since we are simply performing post-processing of MCMC
output and there exists a variety of commercial-grade software for MCMC. The method that
we propose, called Stein Thinning, P0 is straight-forward to implement and succinctly stated
in Algorithm 1. (The convention i=1 = 0 is employed.)
7
Data: The output (xi )ni=1 from an MCMC method, a kernel kP for which the
conclusion of Proposition 3 holds, and a desired cardinality m ∈ N.
Result: The indices π of a sequence (xπ(j) )m n
j=1 ⊂ {xi }i=1 where the π(j) are elements
of {1, . . . , n}.
for j = 1, . . . , m do
j−1
kP (xi , xi ) X
π(j) ∈ arg min + kP (xπ(j 0 ) , xi );
i=1,...,n 2 0
j =1
end
Algorithm 1: The proposed method; Stein Thinning.
The algorithm is illustrated on a simple bivariate Gaussian mixture in Figure 1. Ob-

serve in this figure that the points selected by the Stein Thinning do not belong to the
burn-in period (which is visually clear), and that although the MCMC spent a dispropor-
tionate amount of time in one of the mixture components, the number of points selected by
Stein Thinning is approximately equal across the two components of the target. A detailed
empirical assessment is presented in Section 4.
Remark 1. In the event of a tie, some additional tie-breaking rule should be used to select
the next index. For example, if the minimum in Algorithm 1 is realised by multiple candidate
values Π(j) ⊆ {1, . . . , n}, one could adopt a tie-breaking rule that selects the smallest element
of Π(j) as the value that is assigned to π(j). The rule that is used has no bearing on our
theoretical analysis in Section 3.
Remark 2. The computation associated with iteration j of Algorithm 1 is O(nrj ) where
rj ≤ min(j, n) is the number of distinct indices
P in {π(1), . . . , π(j − 1)}; the computational
complexity of Algorithm 1 is therefore O(n m j=1 rj ). For typical MCMC algorithms the
computational complexity is O(n), so the computational complexity of Stein Thinning is
equal to that for MCMC when m is fixed and higher when m is increasing with n, being at
most O(nm2 ).
Remark 3. In general the indices in π need not be distinct. That is, Algorithm 1 may
prefer to include a duplicate state rather than to include a state which is not useful for
representing P . Indeed, if m > n then the sequence (xπ(j) )m j=1 must contain duplicates entries.
Theorem 1 in Section 3 clarifies this behaviour.
P
Remark 4. An estimator m1 m j=1 f (Xπ(j) ) based on Stein Thinning of MCMC output in-
volves random variables Xπ(j) that are highly dependent by construction and therefore stan-
dard approaches for construction of confidence intervals that exploit the Markov property
cannot be used. Instead, we note that caching of the quantities computed during Algorithm 1
enables the KSD of the resulting empirical distribution to be computed. This in turn facili-
tates the deterministic error bound
Z ! Z
1 X m 1 Xm

f (xπ(j) ) − f (x)dP (x) ≤ DkP δ(xπ(j) )
f− f (x)dP (x)
.
m Rd m Rd
H(k )
j=1 j=1 P
8
2 2 2.5
x2
x2
x2
0 0 0.0
−2 −2
−2.5
−2.5 0.0
(a) 2.5 −2.5 0.0
(b) 2.5 −2.5 0.0
(c) 2.5
x1 x1 x1
Figure 1: Illustration of Stein Thinning: (a) Contours of the distributional target P .
(b) Markov chain Monte Carlo (MCMC) output, limited to 500 iterations to mimic a chal-
lenging computational context, exhibiting burn-in and autocorrelation that must be identi-
fied and mitigated. (c) A subset of m = 40 states from the MCMC output selected using
Stein Thinning, which correctly ignores the burn-in period and stratifies states approxi-
mately equally across the two components of the target.
The practical estimation of the final term in this bound was discussed in Section 4 of South
et al. (2020).
2.3 Choice of Kernel

The suitability of KSD to quantify how well Q approximates P is determined by the choice
of the kernel k in (6). Several choices are possible and, based on Proposition 3 together with
extensive empirical assessment, Chen et al. (2019) advocated the pre-conditioned inverse
multi-quadric kernel k(x, y) := (1 + kΓ−1/2 (x − y)k2 )−1/2 where, compared to Proposition 3,
we have fixed c = 1 (without loss of generality) and β = −1/2. The positive definite matrix
Γ remains to be specified and it is natural to take a data-driven approach where the MCMC
output is used to select Γ. Provided that a fixed number n0 ∈ N of the states (Xi )ni=10
from the
MCMC output are used in the construction of Γ, the consistency results for Stein Thinning
that we establish in Section 3 are not affected. To explore different strategies for the selection
of Γ, we focus on the following candidates:
• Median (med): The scaled identity matrix Γ = `2 I, where ` = med := median{kXi −

Xj k : 1 ≤ i < j ≤ n0 } is the median Euclidean distance between states (Garreau et al.,
2018). In the rare case that med = 0, an exception should be used, such as ` = 1, to
ensure a positive definite Γ is used.
p
• Scaled median (sclmed): The scaled identity matrix Γ = `2 I, where ` = med/ log(m).
This was proposed in the context of Stein Pvariational gradient descent in Liu and Wang
(2016) and arises from the intuition that m −2 2
j 0 =1 kP (xπ(j) , xπ(j 0 ) ) ≈ m exp(−` med ) =
1. Note the dependence on m means that the preceding theoretical analysis does not
apply when this heuristic is used.
9
• Sample covariance (smpcov): The matrix Γ can be taken as a sample covariance
matrix
1 X 1 X
n0 n0
>
Γ= Xi − X̄ Xi − X̄ , X̄ := Xi ,
n0 − 1 i=1 n0 i=1
provided that this matrix is non-singular.
The experiments in Section 4 shed light on which of these settings is the most effective, but
we acknowledge that many other settings could also be considered. In what follows, we set
n0 = min(n, 103 ) for the med and sclmed settings, to avoid an O(n2 ) cost of computing `,
and otherwise set n0 = n, so that the whole of the MCMC output is used to select Γ. Python
and MATLAB packages are provided and their usage is described in Appendix S1.
3 Theoretical Assessment
The theoretical analysis in this section clarifies the limiting behaviour of Stein Thinning
as m, n → ∞. The arguments that we present hold for a distribution P defined on a general
measurable space X , and are presented as such, but the main focus of this paper is the
case X = Rd . Our first main result concerns the behaviour of Stein Thinning on a fixed
sequence (xi )ni=1 :
Theorem 1. Let X be a measurable space and letR P be a probability distribution on X . Let
kP : X × X → R be a reproducing kernel with X kP (x, ·)dP (x) = 0 for all x ∈ X . Let
(xi )ni=1 ⊂ X be fixed and consider an index sequence π of length m produced by Algorithm 1.
Then we have the bound
!2 !2
1 X X
m n
∗ 1 + log(m)
DkP δ(xπ(j) ) ≤ DkP wi δ(xi ) + max kP (xi , xi ),
m j=1 i=1
m i=1,...,n
where the weights w∗ = (w1∗ , . . . , wn∗ ) in the first term satisfy

!
Xn
w∗ ∈ arg min DkP wi δ(xi ) (8)
1>
n w=1 i=1
w≥0
where 1>
n = (1, . . . , 1) and w ≥ 0 indicates that wi ≥ 0 for i = 1, . . . , n.
The proof of Theorem 1 is provided in Appendix S2.1. Its implication is that, given a se-
quence (xi )ni=1 , Stein Thinning produces an empirical
Pn distribution that converges in KSD
∗
to the optimal weighted empirical distribution i=1 wi δ(xi ) based on that sequence. Prop-
erties of such optimally weighted empirical measures were studied in Liu and Lee (2017);
Hodgkinson et al. (2020), and are not the focus of the present paper, where the case m n
is of principal interest.
The role of Theorem 1 is to study the interaction between the greedy algorithm and a
given sequence (xi )ni=1 , and this bound is central to our proof of Theorem 2 which deals with
10
the case where (xi )ni=1 is replaced by MCMC output. Figure 2 illustrates the terms involved
in Theorem 1. It is clear that a reduction in KSD is achieved by Stein Thinning of the
MCMC output.
Remark 5. ToPfurther improve the empirical approximation, we can consider an optimally-

weighted sum m ∗ ∗
j=1 wm,j δ(xπ(j) ) where the wm,j solve a convex optimisation problem anal-
ogous to (8). Such weights minimise a quadratic function subject to a linear and a non-
negativity constraint and can therefore be precisely computed. If the non-negativity constraint
is removed and the indices in π are distinct then
!
Xm
KP−1 1m
∗
vm := arg min DkP vj δ(xπ(j) ) = > −1 , (KP )i,j := kP (xπ(i) , xπ(j) ),
1>
m v=1 j=1
1m KP 1m
∗
as derived in Oates et al. (2017). Figure 2 indicates that the benefit of applying weights wm
(red curve) to the output of Stein Thinning (black curve) is limited, likely because the xπ(j)
were selected in a way that avoids redundancy in the point set. A larger improvement is
∗
provided by the weights vm (blue curve), but in this case the associated empirical measure
may not be a probability distribution.
Remark 6. The use of a conditional gradient algorithm, instead of a greedy algorithm, in this
context amounts to simply removing the term kP (xπ(j) , xπ(j) ) in Algorithm 1. As discussed
in Chen et al. (2018), this term can be thought of as a regulariser that lends stability to the
algorithm, avoiding selection of xi that are far from the effective support of P .
Remark 7. Theorem 1 is formulated at a high level of generality and can be applied on non-
Euclidean domains X . In Barp et al. (2018); Liu and Zhu (2018); Xu and Matsuda (2020);
Le et al. (2020) the authors proposed and discussed Stein operators AP for the non-Euclidean
context.
Next we consider the properties of Stein Thinning applied to MCMC output. Let V
be a function V : X → [1, ∞) and, for a function f : X → R and a measure µ on X , let
R
kf kV := supx∈X V (x) , kµkV := supkf kV ≤1 X f dµ. Recall that a ψ-irreducible and aperiodic
|f (x)|
Markov chain (Xi )i∈N ⊂ X with nth step transition kernel Pn is V -uniformly ergodic (see
Theorem 16.0.1 of Meyn and Tweedie, 2012) if and only if ∃R ∈ [0, ∞), ρ ∈ (0, 1) such that
kPn (x, ·) − P kV ≤ RV (x)ρn (9)
for all initial states x ∈ X and all n ∈ N. The notation E will be used to denote expectation
with respect to the law of the Markov chain in the sequel. Theorem 2 establishes a finite
sample size error bound for Stein Thinning applied to MCMC output:
Theorem 2. Let X be a measurable space andR let P be a probability distribution on X . Let

kP : X × X → R be a reproducing kernel with X kP (x, ·)dP (x) = 0 for all x ∈ X . Consider
a P -invariant, time-homogeneous, reversible Markov chain (Xi )i∈N ⊂ X generated using a
11
101 1
Pm
Kernel Stein Discrepancy (DkP )

m j=1 δ(xj )
1
Pm
m j=1 δ(xπ(j) )
Pm ∗
100 j=1 wm,j δ(xπ(j) )
Pm ∗
j=1 vm,j δ(xπ(j) )
10−1
10−2
100 101 102 103
Number of Points (m)
Figure 2: Illustration of Theorem 1: The gray curve represents the unprocessed output from
MCMC in the example of Figure 1. The black curve represents Stein Thinning applied to
this same output and, inPaddition, weighted output of Stein Thinning is shown for weights
∗ ∗ ∗ ∗
wm Psubject subject to wm,j = 1 and wm,j ≥ 0 (solid red) and weights vm subject only
∗
to vm,j = 1 (solid blue). The dashed horizontal lines are the limiting values of their
corresponding solid lines as the number m is increased.
p
V -uniformly ergodic transition kernel, such that (9) is satisfied with V (x) ≥ kP (x, x) for
all x ∈ X . Suppose that, for some γ > 0,
γk (Xi ,Xi ) hp i
b := sup E e P
< ∞, M := sup E kP (Xi , Xi )V (Xi ) < ∞.
i∈N i∈N
Let π be an index sequence of length m produced by Algorithm 1 applied to the Markov chain
2Rρ
output (Xi )ni=1 . Then, with C = 1−ρ , we have that
 !2 
Xm
1 log(b) CM 1 + log(m) log(nb)
E DkP δ(Xπ(j) )  ≤ + + . (10)
m j=1 γn n m γ
The proof of Theorem 2 is provided in Appendix S2.2.
Remark 8. The upper bound in (10) is asymptotically minimised when (up to log factors) m
is proportional to n. In practice we are interested in the case m n, so we may for example
n
set m = b 1000 c if we aim for substantial compression. It is not claimed that the bound in
(10) is tight and indeed empirical results in Section 4 endorse the use of Stein Thinning in
the small m context.
Remark 9. For X = Rd and kP in (6), based on a radial kernel k in (6), meaning that
k(x, y) = φ(x − y) for some function φ : Rd → R satisfyingp∇φ(0) = 0, we have that
kP (x, x) = −∆φ(0) + φ(0)k∇ log p(x)k2 . The function x 7→ kP (x, x) appearing in the
12
preconditions of Theorem 2 can therefore be understood in terms of k∇ log p(x)k. Results on
the V -uniform ergodicity of Markov chains, which relate to the preconditions of Theorem 2,
were discussed in Chen et al. (2019).
Since convergence in mean-square does not in general imply almost sure convergence,
we next strengthen the conclusions of Theorem 2. Our final result, Theorem 3, therefore
establishes an almost sure convergence guarantee for Stein Thinning. Furthermore, the
result that follows applies also in the “biased sampler” case, where (Xi )i∈N is a Q-invariant
Markov chain and Q need not equal P :
Theorem 3. Let Q be probability distributions on X with P absolutely continuous with re-

spect to Q. Consider a Q-invariant, time-homogeneous, reversible Markov chain (X
dP
pi )i∈N ⊂ X
generated using a V -uniformly ergodic transition kernel, such that V (x) ≥ dQ (x) kP (x, x).
Suppose that, for some γ > 0,
h i
γ max(1, dQ
dP
(Xi )2 )kP (Xi ,Xi ) dP p
b := sup E e < ∞, M := sup E (Xi ) kP (Xi , Xi )V (Xi ) < ∞.
i∈N i∈N dQ
Let π be an index sequence of length m produced by Algorithm 1 applied to the Markov chain
output (Xi )ni=1 . If m ≤ n and the growth β/2
of n is limited to at most log(n) = O(m ) for
P
some β < 1, then DkP m1 m j=1 δ(Xπ(j) ) → 0 almost surely as m, n → ∞. Furthermore, if
P
the preconditions of Proposition 3 are satisfied, then m1 m j=1 δ(Xπ(j) ) ⇒ P almost surely as
m, n → ∞.
The proof of Theorem 3 is provided in Appendix S2.3. The interpretation of Theorem 3 is

that one may sample states from a Markov chain that is not P -invariant and yet, under the
stated assumptions (which ensure that regions of high probability under P are explored),
one can use Stein Thinning to still obtain a consistent approximation of P . This should
be contrasted with the Support Points method of Mak and Joseph (2018), which relies on
P being well-approximated by the MCMC output. This completes our theoretical analysis
of Stein Thinning.
4 Empirical Assessment
In this section we compare the performance of Stein Thinning with existing methods for
post-processing MCMC output. Our motivation derives from a problem in which we must
infer a 38-dimensional parameter in a calcium signalling model defined by a stiff system of
6 coupled ordinary differential equations (ODEs). Posterior uncertainty is required to be
propagated through a high-fidelity simulation in a multi-scale and multi-physics model f
of the human heart. Here, compression of the MCMC output can be used to construct an
approximately optimal experimental design on which f can be evaluated. The calcium model
is, however, unsuitable for conducting a thorough in silico assessment due to its associated
computational cost. Therefore in Section 4.1 we first consider a simpler ODE model, where
13
P can be accurately approximated. Then, as an intermediate example, in Section 4.2 we
consider an ODE model that induces stronger correlations among the parameters in P , before
addressing the calcium model in Section 4.3.
In Appendix S3 we describe the generic structure of a parameter inference problem for
ODEs. In all instances the aim is to post-process the output from MCMC, in order to
produce an accurate empirical approximation of the posterior supported on a small number
m n of the states that were visited. The following methods were compared:
• The standard approach, which estimates a burn-in period using either the GR diag-
nostic b̂GR,L , L > 1, of Gelman and Rubin (1992); Brooks and Gelman (1998); Gelman
et al. (2014) or the more sophisticated VK diagnostic b̂VK,L , L ≥ 1, of Vats and Knud-
son (2018), in each case based on L independent chains as described in Appendix S4,
followed by thinning as per (3).
• The Support Points algorithm proposed in Mak and Joseph (2018), implemented in
the R package support. (Recall that there do not yet exist theoretical results for this
algorithm as implemented.)
• The Stein Thinning algorithm that we have proposed, with each of the kernel choices
described in Section 2.3.
To ensure that our empirical findings are not sensitive to the choice of MCMC method,
we implemented four Metropolis–Hastings samplers that differ qualitatively according to
the sophistication of their proposal. These were: (i) the Gaussian random walk (RW); (ii)
the adaptive Gaussian random walk (ADA-RW), which uses an estimate of the covariance of
the target (Haario et al., 1999); (iii) the Metropolis-adjusted Langevin algorithm (MALA),
which takes a step in the direction of increasing Euclidean gradient, perturbed by Gaussian
noise (Roberts and Tweedie, 1996); (iv) the preconditioned version of MALA (P-MALA), which
employs a preconditioner based on the Fisher information matrix (Girolami and Calderhead,
2011). Full details are in Appendix S3. Metropolis–Hastings algorithms were selected on the
basis that we were able to successfully implement them on the challenging calcium signalling
model in Section 4.3, which required manually interfacing with the numerical integrator to
produce reliable output.
4.1 Goodwin Oscillator

The first example that we consider is a negative feedback oscillator due to Goodwin (1965).
The ODE model and the associated d = 4 dimensional inference problem are described in
Appendix S5.1, where one trace plot for each MCMC method, of length n = 2 × 106 , are
presented in Figure S2.
First we consider the standard approach to post-processing MCMC output, as per (3).
From the trace plots in Figure S2, it is clear that a burn-in period b > 0 is required. For each
method we therefore computed the GR and VK diagnostics, to arrive at candidate values b for
the burn-in period. Default settings were used for all diagnostics, which were computed both
14
for the multivariate d-dimensional state vector and for the univariate marginals, as reported
in Appendix S5.1. The GR diagnostics were computed using L = 6 independent chains and
the VK diagnostics were computed using both L = 1 and L = 6 independent chains; note
that when L > 1, these diagnostics have access to more information in comparison with
Stein Thinning, in terms of the number of samples that are available to the method. The
estimated values for the burn-in period are reported in Appendix S5.1, Table S4. For all
MCMC methods, neither the univariate nor the multivariate GR diagnostics were satisfied,
so that b̂GR,6 > n and estimation using (3) cannot proceed. The VK diagnostic produced
values b̂VK,L < n, which typically led to about half of the MCMC output being discarded.
Although well-suited for their intended task of minimising bias in MCMC output, the smaller
number of states left after burn-in removal may lead to inefficient approximation of P and
derived quantities of interest, strikingly so in the case of the GR diagnostic. The use of an
optimality criterion enables Stein Thinning to directly address this bias-variance trade-off.
Of course, one can in principle run more iterations of MCMC to provide more diversity in
the remainder of the sample path after burn-in is removed, but in applications such as the
calcium model of Section 4.3 the computational cost associated with each iteration presents
a practical limitation in running more iterations of an MCMC method. Effective methods
to post-process limited output (or, equivalently, a long output from a poorly mixing Markov
chain) are therefore important.
Having identified a burn-in period, the standard approach thins the remainder of the
sample path according to (3). In the experiments that follow we focus on the VK diagnostic
and consider both the smallest and largest estimates obtained for the burn-in period. The
resulting index sets π are displayed, for m = 20 and RW (the simplest MCMC method)
in Figure 3 (top left panel), and in Appendix S5.1, Figures S5 (ADA-RW), S6 (MALA), S7
(P-MALA). In the same figures (top right panel) we show the set of Support Points obtained
using algorithm proposed by Mak and Joseph (2018). The remaining panels display the
output from Stein Thinning. Compared to the standard approach, Support Points and
Stein Thinning produce sets that are more structured.
To assess the performance of these competing methods, we first considered the toy prob-
lem of approximating the posterior mean of each parameter in the Goodwin oscillator as an
average of m points selected from the MCMC output. Figure 4 displays absolute errors for
each method, based on RW; our ground truth was provided by an extended run of MCMC.
Results for the other MCMC methods are provided in Appendix S5.1, Figures S8 (ADA-RW),
S9 (MALA), S10 (P-MALA). Broadly speaking, Stein Thinning tends to provide more accu-
rate estimators compared to the alternatives considered. From Figure 4 it is difficult to see
any difference in performance between med, sclmed and smpcov. To gain more insight, in
Appendix S5.1 we plot marginal density estimates in Figures S11 (RW), S12 (ADA-RW), S13
(MALA), S14 (P-MALA). It is apparent that Stein Thinning improves on the standard ap-
proach, whilst med and sclmed performed slightly better than smpcov. This may be because
in smpcov there are more degrees of freedom in Γ that must be estimated. Support Points
performed on a par with Stein Thinning based on smpcov.
To facilitate a more principled assessment, we computed two quantitative measures for
15
Figure 3: Projections on the first two coordinates of the RW MCMC output from the Goodwin
oscillator (gray dots), together with the first m = 20 points selected using: the standard
approach of discarding burn-in and thinning the remainder (the estimated burn in period
is indicated in the legend); the Support Points method; Stein Thinning, for each of the
settings med, sclmed, smpcov.
how well the resulting empirical distributions approximate the posterior. These were (a) the
energy distance (ED; Székely and Rizzo, 2004; Baringhaus and Franz, 2004), given up to an
additive constant by
Z
2 X 1 X
m m
ED := kx − xπ(j) kΣ dP (x) − 2 kxπ(j) − xπ(j 0 ) kΣ , (11)
m j=1 m j,j 0 =1
16
Parameter 1 Parameter 2
10−2
Absolute Error First Moment
Standard Thinning (high burn-in)

Standard Thinning (low burn-in)
10−4 Support Points
Stein Thinning (med)
Stein Thinning (sclmed)
10−6 Stein Thinning (smpcov)
10−2
10−4
10−6
100 101 102 100 101 102

m m
Figure 4: Absolute error of estimates for the posterior mean of each parameter in the Good-
win oscillator, based on output from RW MCMC.
RW ADA-RW
1.7
1.6 Standard Thinning (low burn-in)
Support Points
ED
1.5 Stein Thinning (med)

Stein Thinning (smpcov)
1.4
MALA P-MALA
1.7
1.6
ED
1.5
1.4
101 102 101 102

m m
Figure 5: Goodwin oscillator: Energy distance (ED) to the posterior, as per (11), for em-
pirical distributions obtained through traditional burn-in and thinning (grey lines), Support
Points (black line) and Stein Thinning (colored lines), based on output from four different
MCMC methods.
17
RW ADA-RW

102 Support Points
KSD

101
MALA P-MALA
102
KSD
101
100 101 102 100 101 102

m m
Figure 6: Goodwin oscillator: Kernel Stein discrepancy (KSD) based on med, for empir-
ical distributions obtained through traditional burn-in and thinning (grey lines), Support
MCMC methods.
where in this paper we used the norm kxkΣ := kΣ−1/2 xk induced by the covariance matrix
of P , with both Σ and (11) being estimated from MCMC output, and (b) the KSD based
on med, the simplest setting for Γ. ED serves as an objective performance measure, being
closely related to the quantity that Support Points attempts to minimise (Mak and Joseph
(2018) used the k · k norm in place of k · kΣ ), while KSD is the performance measure that is
being directly optimised in Stein Thinning. Our decision to include KSD in the assessment
is motivated by three factors; (i) ED is somewhat insensitive to detail, making it difficult
to rank competing methods; (ii) the empirical approximation of ED in (11) relies on access
to high-quality MCMC output, but this will not be available in Section 4.3; (iii) Stein
discrepancies are the only computable performance measures in the Bayesian context, to the
best of our knowledge, that have been proven to provide convergence control.
The results for ED are shown in Figure 5. Here Stein Thinning based on sclmed
performed at least as well as the other methods considered and, surprisingly, out-performed
Support Points when applied to MALA and P-MALA output. This may be because MALA
and P-MALA provided worse approximations to P compared with RW and ADA-RW (recall
that Support Points relies on the MCMC output providing an accurate approximation of
P ). Note that neither ED nor KSD values will tend to 0 as m → ∞ in this experiment,
since the number n of MCMC iterations was fixed. The corresponding results for KSD are
presented in Figure 6 and show a clearer performance ordering of the competing methods,
with Stein Thinning based on med and sclmed out-performing all other methods for all
18
RW ADA-RW
3.0
2.8 Standard Thinning (low burn-in)
Support Points
2.6
ED

2.4
2.2
MALA P-MALA
3.0
2.8
2.6
ED
2.4
2.2
101 102 101 102

m m
Figure 7: Lotka–Volterra model: Energy distance (ED) to the posterior, as per (11), for em-
MCMC methods.
but the largest values of m considered. The smpcov setting performed well for small m but
for large m its performance degraded. The performance ordering under KSD was identical
across the different MCMC output.
4.2 Lotka–Volterra
The second example that we consider is the predator-prey model of Lotka (1926) and Volterra
(1926). A description of the d = 4 dimensional inference task, the output from MCMC
methods and the implementation of thinning procedures is reserved for Appendix S5.2. Com-
pared to the Goodwin oscillator, the Lotka–Volterra posterior P exhibits stronger correlation
among parameters. This has consequences for our assessment, since now all MCMC methods,
and in particular MALA, mixed less well compared to corresponding results for the Goodwin
oscillator, as can be seen from the trace plots in Appendix S5.2, Figure S16. Results are
reported for ED in Figure 7. It can be seen that Stein Thinning based on med and sclmed
performed comparably with Support Points, being better for small m in the case of RW and
ADA-RW and marginally worse for large m in RW, ADA-RW and P-MALA. Interestingly, the set-
ting smpcov was associated with poor performance on output from RW, ADA-RW and especially
P-MALA. This may be because, when Γ is poorly conditioned, any error in an estimate for Γ
will be amplified when computing Γ−1 . However, in the case of MALA, which mixed poorly,
the standard approach of burn-in removal and thinning performed poorly and all settings of
19
RW ADA-RW
103 Standard Thinning (high burn-in)

Support Points
KSD

102 Stein Thinning (sclmed)
MALA P-MALA
103
KSD
102
100 101 102 100 101 102

m m
Figure 8: Lotka-Volterra model: Kernel Stein discrepancy (KSD) based on med, for empir-
ical distributions obtained through traditional burn-in and thinning (grey lines), Support
MCMC methods.
Stein Thinning provided an improvement.

Results for KSD are reported in Figure 8. The performance ordering of competing meth-
ods under KSD is similar to that reported in Section 4.1, except for the smpcov setting
which appears to improve the performance of Stein Thinning for larger values of m in the
context of MALA. This may be because smpcov serves to “whiten” the correlation structure in
P , such that the resulting geometry is more favourable for the construction of an empirical
approximation. However, this improved performance was not seen on P-MALA. In all cases
Stein Thinning out-performed Support Points.
4.3 Calcium Signalling Model

Our final and motivating example is a calcium signalling model, illustrated in Appendix S5.3,
Figure S23. The model describes an electrically-activated intracellular calcium signal that in
turn activates the sub-cellular sarcomere, causing the muscle cell to contract and the heart
to beat. The intracellular calcium signal is crucial for healthy cardiac function. However,
under pathological conditions, dysregulation of this intra-cellular signal can play a central
role in the initiation and sustenance of life-threatening arrhythmias. Computational models
are increasingly being applied to study this highly-orchestrated multi-scale signalling cas-
cade to determine how changes in cell-scale calcium regulation, encoded in calcium model
parameters, impact whole-organ cardiac function (Campos et al., 2015; Niederer et al., 2019;
20
RW
103 Support Points
KSD
102 Stein Thinning (smpcov)
101
100 101 102

m
Figure 9: Calcium signalling model. Kernel Stein discrepancy (KSD) based on med, for em-
Points (black line) and through Stein Thinning (colored lines), based on output from RW
MCMC.
Colman, 2019). The computational cost of simulating from tissue-scale and organ-scale mod-
els is high, with single simulations taking thousands of CPU hours (Niederer et al., 2011;
Augustin et al., 2016; Strocchi et al., 2020). This limits the capacity to propagate uncertainty
in calcium signalling model parameters up to organ-scale simulations, so that at present it
remains unclear how uncertainty in calcium signalling parameters impacts the predictions
made by a whole-organ model. Our motivation for developing Stein Thinning was to ob-
tain a compressed representation of the posterior distribution for the d = 38 dimensional
parameter of a calcium signalling model, based on a cell-scale experimental dataset, which
can subsequently be used as an experimental design to propagate uncertainty through a
whole-organ model.
This motivating problem entails a second complication in that, compared to the example
in Section 4.1 and even the example in Section 4.2, the development of an efficient MCMC
method appears to be difficult. The posterior distribution exhibits strong and nonlinear
dependencies among the parameters such that it is effectively supported on a sub-manifold of
R38 . Thus, in the experiment that follows, we cannot rely on any of the MCMC methods that
we described at the start of Section 4 to provide anything more than a crude approximation
of the posterior, at best. The potential for Stein Thinning to perform substantial bias
correction, in the spirit of Theorem 3, will therefore be explored. (Of course, it is possible
that a more sophisticated sampling method may be designed for this task, but our aim here
is not to develop a sampling method.)
Our focus in the remainder is on output from the RW MCMC method, where we compare
the traditional approach of burn-in removal and thinning to Stein Thinning. This MCMC
method was selected since methods that exploit first order gradient information (only) tend
to perform worse than gradient-free methods when the posterior is effectively supported on
a sub-manifold (Livingstone and Zanella, 2019), since a move in the direction of increasing
gradient corresponds to a step perpendicular to the sub-manifold, such that the probability
of actually landing on the sub-manifold is small. Trace plots for n = 4 × 106 iterations of
21
RW MCMC, representing two weeks’ CPU time, are shown in Appendix S5.3, Figure S25. As
before, the GR diagnostics b̂GR,10 > n, while the burn-in periods estimated using the (mul-
tivariate) VK diagnostic were b̂VK,10 = 192, 000 and b̂VK,1 = 1, 626, 000; see Appendix S5.3,
Figure S26.
Figure 9 reports the KSD based on med, for index sets of cardinality up to m = 200; see
Appendix S5.3 for results for KSD based on sclmed (Figure S27) and smpcov (Figure S28).
Inspection of the empirical approximations that correspond to Figure 9 reveals that, for
med, 45% of the selected points were selected more than once (for sclmed the corresponding
figure was 26%, while for smpcov it was 0%). This may suggest that the MCMC output
is not be representative of the posterior P and that only the “nearest” points to P are
being selected. Traditional post-processing aims to reduce bias, possibly at the expense of
increased variance, while Support Points aim to reduce variance without correcting for bias
in the MCMC output. The results presented here are consistent with the interpretation that
Stein Thinning can out-perform existing methods when both bias correction and variance
reduction (via compression) are required.
5 Conclusion
In this paper, standard approaches used to post-process and compress output from MCMC
were identified as being sub-optimal when one considers the approximation quality of the
empirical distribution that is produced. A novel method, Stein Thinning, was proposed
that seeks a subset of the MCMC output, of fixed cardinality, such that the associated
empirical approximation is close to optimal. To the best of our knowledge, the theoretical
analysis that we have provided for Stein Thinning is the first to handle the effect of the post-
processing procedure jointly with the randomness involved in simulating from the Markov
chain, such that consistency of the overall estimator is established.
Although we focused on MCMC, the proposed method can be applied to any compu-
tational method that provides a collection of states as output. These include approxi-
mate (biased) MCMC methods such as those described in Alquier et al. (2016), where
Stein Thinning may be able to provide bias correction in the spirit of Theorem 3. On
the other hand, the main limitation of Stein Thinning is that it requires gradients of the
log-target to be computed, which is not always practical.
Our research was motivated by challenging parameter inference problems that arise in
ODEs, in particular in cardiac modelling where one is interested in propagating calcium sig-
nalling parameter uncertainty through a whole-organ simulation – a task that would naı̈vely
be impractical or impossible using the full MCMC output. Our ongoing research is exploiting
Stein Thinning in this context and is enabling us to perform scientific investigations that
were not feasible beforehand.
Acknowledgements The authors are grateful for support from the Lloyd’s Register Foun-
dation programme on data-centric engineering and the programme on health and medical
sciences at the Alan Turing Institute. MR, SN and CJO were supported by the British
22
Heart Foundation (BHF; SP/18/6/33805). JC was supported by the UKRI Strategic Pri-
orities Fund (EP/T001569/1). PS was supported by the BHF (RG/15/9/31534). SN
was supported by the EPSRC (EP/P01268X/1, NS/A000049/1, EP/M012492/1), the BHF
(PG/15/91/31812, FS/18/27/33543), the NIHR (II-LB-1116-20001) and the Wellcome Trust
(WT 203148/Z/16/Z). The authors thank Matthew Graham, Liam Hodgkinson and Rob Sa-
lomone for helpful discussions related to the manuscript.
References
P. Alquier, N. Friel, R. Everitt, and A. Boland. Noisy Monte Carlo: Convergence of Markov
chains with approximate transition kernels. Statistics and Computing, 26(1-2):29–47, 2016.
C. M. Augustin, A. Neic, M. Liebmann, A. J. Prassl, S. A. Niederer, G. Haase, and G. Plank.

Anatomically accurate high resolution modeling of human whole heart electromechanics:
a strongly scalable algebraic multigrid solver method for nonlinear deformation. Journal
of Computational Physics, 305:622–646, 2016.
L. Baringhaus and C. Franz. On a new multivariate two-sample test. Journal of Multivariate

Analysis, 88(1):190–206, 2004.
A. Barp, C. Oates, E. Porcu, and M. Girolami. A Riemannian-Stein kernel method.

arXiv:1810.04946, 2018.
A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and

Statistics. Springer Science & Business Media, New York, 2004.
N. Biswas, P. E. Jacob, and P. Vanetti. Estimating convergence of Markov chains with

L-lag couplings. In Proceedings of the 33rd Conference on Neural Information Processing
Systems, 2019.
S. P. Brooks and A. Gelman. General methods for monitoring convergence of iterative

simulations. Journal of Computational and Graphical Statistics, 7(4):434–455, 1998.
B. Calderhead and M. Girolami. Estimating Bayes factors via thermodynamic integration

and population MCMC. Computational Statistics & Data Analysis, 53(12):4028–4045,
2009.
F. O. Campos, Y. Shiferaw, A. J. Prassl, P. M. Boyle, E. J. Vigmond, and G. Plank.

Stochastic spontaneous calcium release events trigger premature ventricular complexes by
overcoming electrotonic load. Cardiovascular Research, 107(1):175–183, 2015.
B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker,

J. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language. Journal of
Statistical Software, 76(1), 2017.
23
W. Y. Chen, L. Mackey, J. Gorham, F.-X. Briol, and C. J. Oates. Stein points. In Proceedings
of the 35th International Conference on Machine Learning, 2018.
W. Y. Chen, A. Barp, F.-X. Briol, J. Gorham, L. Mackey, M. Girolami, and C. J. Oates. Stein
points Markov chain Monte Carlo. In Proceedings of the 36th International Conference on
Machine Learning, 2019.
K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In
Proceedings of the 33rd International Conference on Machine Learning, 2016.
M. A. Colman. Arrhythmia mechanisms and spontaneous calcium release: Bi-directional
coupling between re-entrant and focal excitation. PLoS Computational Biology, 15(8),
2019.
M. K. Cowles and B. P. Carlin. Markov chain Monte Carlo convergence diagnostics: a
comparative review. Journal of the American Statistical Association, 91(434):883–904,
1996.
A. Duncan, N. Nüsken, and L. Szpruch. On the geometry of Stein variational gradient
descent. arXiv:1912.00894, 2019.
R. Dwivedi, O. N. Feldheim, O. Gurel-Gurevich, and A. Ramdas. The power of online
thinning in reducing discrepancy. Probability Theory and Related Fields, 174(1-2):103–
131, 2019.
J. M. Flegal, M. Haran, and G. L. Jones. Markov chain Monte Carlo: Can we trust the third
significant figure? Statistical Science, 23(2):250–260, 2008.
D. Garreau, W. Jitkrittum, and M. Kanagawa. Large sample analysis of the median heuristic.
arXiv:1707.07269, 2018.
A. E. Gelfand and A. F. Smith. Sampling-based approaches to calculating marginal densities.
Journal of the American Statistical Association, 85(410):398–409, 1990.
A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple sequences.
Statistical science, 7(4):457–472, 1992.
A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian
Data Analysis, volume 2. CRC press, 2014.
S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence,
(6):721–741, 1984.
C. J. Geyer. Practical Markov Chain Monte Carlo. Statistical Science, 7(4):473–483, 1992.
M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo
methods. Journal of the Royal Statistical Society, Series B, 73(2):123–214, 2011.
24
B. C. Goodwin. Oscillatory behavior in enzymatic control process. Advances in Enzyme
Regulation, 3:318–356, 1965.
J. Gorham and L. Mackey. Measuring sample quality with Stein’s method. In Proceedings
of the 29th Conference on Neural Information Processing Systems, 2015.
J. Gorham and L. Mackey. Measuring sample quality with kernels. In Proceedings of the
34th International Conference on Machine Learning, 2017.
P. J. Green, K. Latuszyński, M. Pereyra, and C. P. Robert. Bayesian computation: a sum-

mary of the current state, and samples backwards and forwards. Statistics and Computing,
25(4):835–862, 2015.
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel method

for the two-sample-problem. In Proceedings of the 20th Conference on Neural Information
Processing Systems, 2006.
H. Haario, E. Saksman, and J. Tamminen. Adaptive proposal distribution for random walk
Metropolis algorithm. Computational Statistics, 14(3):375–396, 1999.
W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications.
1970.
R. Hinch, J. Greenstein, A. Tanskanen, L. Xu, and R. Winslow. A simplified local control

model of calcium-induced calcium release in cardiac ventricular myocytes. Biophysical
Journal, 87(6):3723–3736, 2004.
A. C. Hindmarsh, P. N. Brown, K. E. Grant, S. L. Lee, R. Serban, D. E. Shumaker, and C. S.

Woodward. SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers.
ACM Transactions on Mathematical Software (TOMS), 31(3):363–396, 2005.
L. Hodgkinson, R. Salomone, and F. Roosta. The reproducing Stein kernel approach for
post-hoc corrected sampling. arXiv:2001.09266, 2020.
J. Huggins and L. Mackey. Random feature Stein discrepancies. In Proceedings of the 31st
Conference on Neural Information Processing Systems, 2018.
G. L. Jones and J. P. Hobert. Honest exploration of intractable probability distributions via

Markov chain Monte Carlo. Statistical Science, 16(4):312–334, 2001.
V. R. Joseph, T. Dasgupta, R. Tuo, and C. Wu. Sequential exploration of complex surfaces

using minimum energy designs. Technometrics, 57(1):64–74, 2015.
V. R. Joseph, D. Wang, L. Gu, S. Lyu, and R. Tuo. Deterministic sampling of expensive

posteriors using minimum energy designs. Technometrics, 61(3):297–308, 2019.
C. Knudson and D. Vats. stableGR, 2020. R package version 1.0.
25
H. Le, A. Lewis, K. Bharath, and C. Fallaize. A diffusion approach to Stein’s method on
Riemannian manifolds. arXiv:2003.11497, 2020.
C. Liu and J. Zhu. Riemannian Stein variational gradient descent for Bayesian inference. In
Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018.
Q. Liu. Stein variational gradient descent as gradient flow. In Proceedings of the 31st
Conference on Neural Information Processing Systems, 2017.
Q. Liu and J. D. Lee. Black-box importance sampling. In Proceedings of the 20th Interna-
tional Conference on Artificial Intelligence and Statistics, 2017.
Q. Liu and D. Wang. Stein variational gradient descent: A general purpose Bayesian infer-
ence algorithm. In Proceedings of the 30th Conference on Neural Information Processing
Systems, 2016.
Q. Liu, J. D. Lee, and M. I. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests
and model evaluation. In Proceedings of the 33rd International Conference on Machine
Learning, 2016.
S. Livingstone and G. Zanella. On the robustness of gradient-based MCMC algorithms.

arXiv:1908.11812, 2019.
A. J. Lotka. Elements of physical biology. Science Progress in the Twentieth Century (1919-
1933), 21(82):341–343, 1926.
D. J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter. WinBUGS - a Bayesian modelling

framework: concepts, structure, and extensibility. Statistics and Computing, 10(4):325–
337, 2000.
S. Mak and V. R. Joseph. Support points. The Annals of Statistics, 46(6A):2562–2592, 2018.
S. Meyn and R. Tweedie. Computable bounds for geometric convergence rates of Markov
chains. The Annals of Applied Probability, 4(4):981–1011, 1994.
S. Meyn and R. Tweedie. Markov Chains and Stochastic Stability. Springer Science &
Business Media., 2012.
A. Muller. Integral probability metrics and their generating classes of functions. Advances
in Applied Probability, 29(2):429–443, 1997.
S. A. Niederer, L. Mitchell, N. Smith, and G. Plank. Simulating human cardiac electrophys-

iology on clinical time-scales. Frontiers in Physiology, 2:14, 2011.
S. A. Niederer, J. Lumens, and N. A. Trayanova. Computational models in cardiology.

Nature Reviews Cardiology, 16(2):100–111, 2019.
26
C. J. Oates, T. Papamarkou, and M. Girolami. The controlled thermodynamic integral for
Bayesian model evidence evaluation. Journal of the American Statistical Association, 111
(514):634–645, 2016.
C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration.
Journal of the Royal Statistical Society, Series B, 79(3):695–718, 2017.
A. B. Owen. Statistically efficient thinning of a Markov chain sampler. Journal of Compu-

tational and Graphical Statistics, 26(3):738–744, 2017.
M. Plummer. JAGS: A program for analysis of Bayesian graphical models using Gibbs
sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical
Computing, 2003.
M. Plummer, N. Best, K. Cowles, and K. Vines. CODA: Convergence diagnosis and output
analysis for MCMC. R News, 6(1):7–11, 2006. URL https://journal.r-project.org/
archive/.
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation

for Statistical Computing, 2020. URL https://www.R-project.org.
C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Science & Business
Media, 2013.
G. O. Roberts and J. S. Rosenthal. Optimal scaling for various Metropolis-Hastings algo-

rithms. Statistical Science, 16(4):351–367, 2001.
G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and

their discrete approximations. Bernoulli, 2(4):341–363, 1996.
G. O. Roberts and R. L. Tweedie. Bounds on regeneration times and convergence rates for
Markov chains. Stochastic Processes and Their Applications, 80(2):211–229, 1999.
J. S. Rosenthal. Minorization conditions and convergence rates for Markov chain Monte
Carlo. Journal of the American Statistical Association, 90(430):558–566, 1995.
L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional dis-
tributions with applications to dynamical systems. In Proceedings of the 26th International
Conference on Machine Learning, 2009.
L. F. South, T. Karvonen, C. Nemeth, M. Girolami, and C. Oates. Semi-exact control

functionals from Sard’s method. arXiv:2002.00033, 2020.
C. Stein. A bound for the error in the normal approximation to the distribution of a sum of
dependent random variables. In Proceedings of 6th Berkeley Symposium on Mathematical
Statistics and Probability, pages 583–602. University of California Press, 1972.
27
M. Strocchi, M. A. Gsell, C. M. Augustin, O. Razeghi, C. H. Roney, A. J. Prassl, E. J.
Vigmond, J. M. Behar, J. S. Gould, C. A. Rinaldi, M. J. Bishop, G. Plank, and S. A.
Niederer. Simulating ventricular systolic motion in a four-chamber heart model with
spatially varying robin boundary conditions to model the effect of the pericardium. Journal
of Biomechanics, 101:109645, 2020.
G. J. Székely and M. L. Rizzo. Testing for equal distributions in high dimension. InterStat,
5(16.10):1249–1272, 2004.
M. A. Tanner and W. H. Wong. The calculation of posterior distributions by data augmen-

tation. Journal of the American statistical Association, 82(398):528–540, 1987.
D. Vats and J. M. Flegal. Lugsail lag windows and their application to MCMC.
arXiv:1809.04541, 2018.
D. Vats and C. Knudson. Revisiting the Gelman-Rubin diagnostic. arXiv:1812.09384, 2018.
D. Vats, J. M. Flegal, and G. L. Jones. Multivariate output analysis for Markov chain Monte
Carlo. Biometrika, 106(2):321–337, 2019.
V. Volterra. Variazioni e fluttuazioni del numero d’individui in specie animali conviventi.

Memoria della Reale Accademia Nazionale dei Lincei, 6:31–113, 1926.
W. Xu and T. Matsuda. A Stein goodness-of-fit test for directional distributions.

arXiv:2002.06843, 2020.
28
Supplementary Material
This electronic supplement contains details for our Python and MATLAB implementations of
Stein Thinning, proofs for all novel theoretical results reported in Section 3 of the main
text, as well as additional material relating to the experimental assessment in Section 4 of
the main text. It is structured as follows:
S1 Software 1
S2 Proofs 2
S2.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
S3 Experimental Protocol 13
S4 Convergence Diagnostics for MCMC 15
S5 Empirical Assessment: Additional Results 17

S5.1 Goodwin Oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
S5.2 Lotka–Volterra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
S5.3 Calcium Signalling Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
S1 Software
To assist with applications of Stein Thinning we have provided code in both Python and
MATLAB. The Python code is available at
https://github.com/wilson-ye-chen/stein_thinning
and the MATLAB code is available at
https://github.com/wilson-ye-chen/stein_thinning_matlab.
In this section we demonstrate Stein Thinning in Python, but the syntax for Stein Thinning
in MATLAB is almost identical. As an illustration of how Stein Thinning can be used to
post-process output from Stan, consider the following simple Stan script that produces 1000
correlated samples from a bivariate Gaussian model:
from pystan import StanModel
mc = """
parameters {vector[2] x;}
model {x ~ multi_normal([0, 0], [[1, 0.8], [0.8, 1]]);}
"""
sm = StanModel(model_code=mc)
fit = sm.sampling(iter=1000)
1
The bivariate Gaussian model is used for illustration, but regardless of the complexity of the
model being sampled the output of Stan will always be a fit object. The sampled points
xi and the gradients ∇ log p(xi ) can be extracted from the returned fit object:
import numpy as np
smpl = fit[’x’]
grad = np.apply_along_axis(fit.grad_log_prob, 1, smpl)
One can then perform Stein Thinning to obtain a subset of m = 40 states by running the
following code:
from stein_thinning.thinning import thin
idx = thin(smpl, grad, 40)
The thin function returns a NumPy array containing the row indices in smpl (and grad) of
the selected points. The default usage requires no additional user input and is based on the
sclmed setting from Section 2.3, informed by the empirical analysis of Section 4. Alterna-
tively, the user can choose to specify which setting to use for computing the preconditioning
matrix Γ by setting the option string pre to either ’med’, ’sclmed’, or ’smpcov’. For
example, the default setting corresponds to
x, g = thin(smpl, grad, 40, pre=’sclmed’)
The ease with which Stein Thinning can be used makes it possible to consider a wide
variety of applications, including the ODE models that we considered in Section 4 of the
main text.
S2 Proofs
This appendix contains detailed proofs for all novel theoretical results in the main text.
S2.1 Proof of Theorem 1

First we state and prove two elementary results that will be useful:
√
Lemma 1. For all a, b ≥ 0 it holds that 2a a2 + b ≤ 2a2 + b.
Proof. Since all quantities are non-negative, we may square both sides to get an equivalent
inequality 4a2 (a2 + b) ≤ (2a2 + b)2 . Expanding the brackets and cancelling terms leads to
0 ≤ b2 , which is guaranteed to hold.
P
Lemma 2. For all m ∈ N it holds that m 1
j=1 j ≤ 1 + log(m).
P
Proof. Since x 7→ x1 is convex on x ∈ (0, ∞), we have that the Riemann sum m 1
is a lower
Rm 1 Pm 1 Pm 1j=2 j
bound for the Riemann integral 1 x dx = log(m). Thus j=1 j = 1 + j=2 j ≤ 1 + log(m),
as required.
Now we present the proof of Theorem 1:
2
Pm Pm
Proof of Theorem 1. Let am := m2 DkP ( m1 j=1 δ(xπ(j) ))2 , fm := j=1 kP (xπ(j) , ·) and also
let S 2 := maxi=1,...,n kP (xi , xi ), so that
X
m X
m X
m−1
am = kP (xπ(j) , x
π(j 0 ) ) = am−1 + kP (xπ(m) , xπ(m) ) + 2 kP (xπ(j) , xπ(m) )
j=1 j 0 =1 j=1
≤ am−1 + S 2 + 2 minn fm−1 (y).

y∈{xi }i=1
Recall that H(kP ) denotes the reproducing kernelPn Hilbert space of the kernel kP and pick
an element h∗ ∈ H(kP ) of the form h∗ := w ∗
i=1 i Pk (x i ·), where
, P the weight vector w∗
satisfies (8). From this definition it follows that kh∗ kH(kP ) = DkP ( ni=1 wi∗ δ(xi )), which is
the minimal KSD attainable under the constraint (8). Now, let M denote the convex hull
of {kP (xi , ·)}ni=1 , so that h∗ ∈ M ⊂ H(kP ) and therefore
min fm−1 (y) = inf hfm−1 , hiH(kP ) ≤ hfm−1 , h∗ iH(kP ) . (12)

y∈{xi }n
i=1
h∈M
Noting that am = kfm k2H(kP ) , we have from (12) and Cauchy-Schwarz that
√
minn fm−1 (y) ≤ am−1 kh∗ kH(kP )
y∈{xi }i=1
and therefore
√
am ≤ am−1 + S 2 + 2 am−1 kh∗ kH(kP ) . (13)
Letting
1 X 1 m
Cm := S 2 − kh∗ k2H(kP ) , (14)
m j=1
j
we will establish by induction that

am ≤ m2 kh∗ k2H(kP ) + Cm . (15)
This will in turn prove the result, since

!2
1 X
m
am
DkP δ(xπ(j) ) = 2 ≤ kh∗ k2H(kP ) + Cm
m j=1 m
!2
X
n
= DkP wi∗ δ(xi ) + Cm
i=1
!2
X
n
1 + log(m)
≤ DkP wi∗ δ(xi ) + S 2,
i=1
m
3
where the upper bound on Cm follows from the fact that kh∗ kH(k0 ) ≤ S, combined with
Lemma 2.
The remainder of the proof is dedicated to establishing the induction in (15). The
base case m = 1 is satisfied since a1 = DkP (δ(xπ(1) )) = kP (xπ(1) , xπ(1) ) ≤ S 2 and C1 =
S 2 − kh∗ k2H(kP ) , so that a1 ≤ kh∗ k2H(kP ) + C1 . For the inductive step, we assume that (15)
holds when m is replaced by m − 1 and aim to derive (15). From (13) and the inductive
assumption, we have that
√
am ≤ am−1 + S 2 + 2 am−1 kh∗ kH(kP )
q
≤ (m − 1) kh kH(kP ) + Cm−1 + S + 2(m − 1) kh∗ k2H(kP ) + Cm−1 kh∗ kH(kP )
2 ∗ 2 2

= m2 kh∗ k2H(kP ) + Cm + Rm (16)
where
Rm := (m − 1)2 Cm−1 − m2 Cm + (1 − 2m)kh∗ k2H(kP ) + S 2

q
+ 2(m − 1) kh∗ k2H(kP ) + Cm−1 kh∗ kH(kP )
The induction (15) will therefore follow from (16) if Rm ≤ 0. Now, Rm ≤ 0 if and only if
q 2 ∗ 2
m2 Cm − (m − 1)2 Cm−1 S − kh kH(kP )
2 kh∗ k2H(kP ) + Cm−1 kh∗ kH(kP ) ≤ − + 2kh∗ k2H(kP ) .
m−1 m−1
From Lemma 1 it must hold that
q
2 kh∗ k2H(kP ) + Cm−1 kh∗ kH(kP ) ≤ 2kh∗ k2H(kP ) + Cm−1 ,
meaning it is sufficient to show that

2 ∗ 2
m2 Cm − (m − 1)2 Cm−1 S − kh kH(kP )
2kh∗ k2H(kP ) + Cm−1 ≤ − + 2kh∗ k2H(kP ) . (17)
m−1 m−1
Algebraic simplification of (17) reveals that (17) is equivalent to
1
mCm − (m − 1)Cm−1 ≤ S 2 − kh∗ k2H(kP ) (18)
m
and, using (14), we verify that (18) is satisfied as an equality. This completes the inductive
argument.

First we state and prove a technical lemma that will be useful:
4
Lemma 3. Let X be a measurable space and Rlet Q be a probability distribution on X . Let
kQ : X × X → R be a reproducing kernel with X kQ (x, ·)dQ(x) = 0 for all x ∈ X . Consider
a Q-invariant, time-homogeneous reversible Markov chain p (Xi )i∈N ⊂ X generated using a
V -uniformly ergodic transition kernel, such that V (x) ≥ kQ (x, x) for all x ∈ X , with
2Rρ
parameters R ∈ [0, ∞) and ρ ∈ (0, 1) as in (9). Then with C = 1−ρ we have that
X
n X X
n−1 q
E [kQ (Xi , Xr )] ≤ C E kQ (Xi , Xi )V (Xi ) .
i=1 r∈{1,...,n}\{i} i=1
Proof. First recall that given random variables X, Y taking values in X , the conditional
mean embedding of the distribution P[X|Y = y] is the function E[kQ (X, ·)|Y = y] ∈
H(kQ ) (Song et al., 2009). By the reproducing property we have E[kQ (X, y)|Y = y] =
hkQ (y, ·), E[kQ (X, ·)|Y = y]iH(kQ ) , hence E[kQ (X, Y )|Y ] = hkQ (Y, ·), E[kQ (X, ·)|Y ]iH(kQ ) . Thus
E[kQ (X, Y )|Y ] = hkQ (Y, ·), E[kQ (X, ·)|Y ]iH(kQ )

kQ (Y, ·)
= kkQ (Y, ·)kH(kQ ) , E[kQ (X, ·)|Y ]
kkQ (Y, ·)kH(kQ ) H(k Q)
≤ kkQ (Y, ·)kH(kQ ) sup hh, E[kQ (X, ·)|Y ]iH(kQ ) .

khkH(kQ ) =1
In what follows it is convenient to introduce a new random variable Z, independent from

the Markov chain, such that Z ∼ Q. Then, since E[kQ (Z, ·)] = 0, we have E[h(Z)] = 0 for
any h ∈ H(kQ ). Hence we have that
q
E[kQ (X, Y )|Y ] ≤ kQ (Y, Y ) sup (hh, E[kQ (X, ·)|Y ]i − E[h(Z)]) .
khkH(kQ ) =1
p p
Note |h(x)| ≤ khkH(kQ ) kQ (x, x), so khkH(kQ ) = 1 implies |h(x)| ≤ kQ (x, x), thus
q
E[kQ (X, Y )|Y ] ≤ kQ (Y, Y ) sup
√ (E[h(X)|Y ] − E[h(Z)]) .
|h(x)| ≤ kQ (x,x)
From V -uniform ergodicity it follows that
|E[h(Xn )|X0 = y] − E[h(Z)]| = |Qn (y, ·)h − Qh| ≤ RV (y)ρn .
Applying this to Y = Xi , X = Xi+r , we find

q
E[kQ (Xi+r , Xi )|Xi ] ≤ R kQ (Xi , Xi )V (Xi )ρr
and taking the expectation on both sides yields

q
E[kQ (Xi+r , Xi )] ≤ RE kQ (Xi , Xi )V (Xi ) ρr . (19)
5
Finally, we can use (19) to obtain that
X
n X X
n−1 X
n−i
E[kQ (Xr , Xi )] = 2 E[kQ (Xi+r , Xi )]
i=1 r∈{1,...,n}\{i} i=1 r=1
X
n−1 q X
n−i
≤ 2R E kQ (Xi , Xi )V (Xi ) ρr .
i=1 r=1
P∞ 2Rρ
Thus for C = 2R r=1 ρr = 1−ρ
< ∞, we have that
X
n X X
n−1 q
E[kQ (Xr , Xi )] ≤ C E kQ (Xi , Xi )V (Xi )
i=1 r∈{1,...,n}\{i} i=1
as claimed.
We can now prove the main result:
Proof of Theorem 2. Taking expectations of the bound in Theorem 1, we have that
 !2   !2 
1 Xm Xn
E DkP δ(Xπ(j) )  ≤ E DkP wi∗ δ(Xi ) 
m j=1 i=1
| {z }
(∗)

1 + log(m)
+ E max kP (Xi , Xi ) .
m i=1,...,n
| {z }
(∗∗)
In what follows we construct bounds for (∗) and (∗∗).

Bounding (∗): To bound the term (∗), note that
! !
Xn
1X
n
DkP wi∗ δ(Xi ) ≤ DkP δ(Xi ) ,
i=1
n i=1
due to the optimality property of the weights w∗ presented in P(8). It is therefore sufficient
to study the KSD of the un-weighted empirical distribution n1 ni=1 δ(Xi ). To this end, we
have that
 !2 
1 X
n
1 Xn
1 X
n X
E DkP δ(Xi )  = 2 E[kP (Xi , Xi )] + 2 E [kP (Xi , Xr )] .
n i=1 n i=1 n i=1
r∈{1,...,n}\{i}
(20)
6
To bound the first term in (20) we use Jensen’s inequality:

1 X 1 X
n n
1 γkP (Xi ,Xi )
E [kP (Xi , Xi )] = E log e
n2 i=1 n2 i=1 γ
1 X
n
log(b)
≤ 2
log E eγkP (Xi ,Xi ) ≤
γn i=1 γn
The second term in (20) can be bounded via Lemma 3 with Q = P :
1 X X C X hp i CM (n − 1)
n n−1
CM
E [kP (Xi , Xr )] ≤ 2 E kP (Xi , Xi )V (Xi ) ≤ ≤ ,
n2 i=1 n i=1 n 2 n
r∈{1,...,n}\{i}
where C is defined in Lemma 3.

Bounding (∗∗): We proceed as follows:

1 γkP (Xi ,Xi )
E max kP (Xi , Xi ) = E log max e
i=1,...,n γ i=1,...,n
" #
1 X n
γkP (Xi ,Xi )
≤ E log e
γ i=1
!
1 X n
γk (Xi ,Xi ) log(nb)
≤ log E e P = (21)
γ i=1
γ
Overall Bound: Combining our bounds on (∗) and (∗∗) leads to the overall bound
 !2 
X
m
1 log(b) CM 1 + log(m) log(nb)
E DkP δ(Xπ(j) )  ≤ + +
m j=1 γn n m γ
as claimed.

To facilitate a neat proof of Theorem 3 we first present two useful lemmas, the first of which
establishes almost sure convergence in KSD of the empirical distribution based on the full
MCMC output:
Lemma 4. RLet Q be a probability distribution on X . Let kQ : X × X → R be a reproducing
kernel with X kQ (x, ·)dQ(x) = 0 for all x ∈ X . Consider a Q-invariant, time-homogeneous,
reversible Markov chain p (Xi )i∈N ⊂ X , generated using a V -uniformly ergodic transition
kernel such that V (x) ≥ kQ (x, x) for all x ∈ X . Suppose that, for some γ > 0,
q
γk (Xi ,Xi )
b := sup E e Q
< ∞, M := sup E kQ (Xi , Xi )V (Xi ) < ∞.
i∈N i∈N
7
Then
!
1X
n
DkQ δ(Xi ) →0
n i=1
almost surely as n → ∞.
Proof. Similarly to the proof of Theorem 2, we start by bounding

 !2 
1 Xn
1 X
n
1 X
n X

E DkQ δ(Xi )  = 2 E[kQ (Xi , Xi )] + 2 E [kQ (Xi , Xr )] .
n i=1 n i=1 n i=1
r∈{1,...,n}\{i}
(22)
To bound the first term in (22) we use Jensen’s inequality:

1 X 1 X
n n
1 γkQ (Xi ,Xi )
E [kQ (Xi , Xi )] = E log e
n2 i=1 n2 i=1 γ
1 X
n
γk (Xi ,Xi ) log(b)
≤ 2
log E e Q ≤
γn i=1 γn
The second term in (22) can be bounded via Lemma 3:

q
1 X X C X
n n−1
CM (n − 1) CM
E [kQ (Xi , Xr )] ≤ 2 E kQ (Xi , Xi )V (Xi ) ≤ ≤ ,
n2 i=1 n i=1 n 2 n
r∈{1,...,n}\{i}
where C is defined in Lemma 3. This establishes that

 !2 
1 Xn
log(b) CM
E DkQ δ(Xi )  ≤ + =: c1 (n). (23)
n i=1 γn n
To simplify notation we adopt the shorthand

!
1X
n
Dn := DkQ δ(Xi )
n i=1
in this proof only. Fix > 0. If Dn > occurs infinitely often (i.o.) then there are infinitely
many r such that maxr2 ≤n<(r+1)2 Dn2 > , so that

2 2
P Dn > i.o. ≤ P 2 max 2 Dn > i.o. . (24)
r ≤n<(r+1)
8
Now, consider the bound
X h
i X
X ∞ ∞ ∞
2 2 2 2
P 2 max 2 Dn > ≤ P Dr2 > + P 2 max 2 |Dn − Dr2 | > ,
r=1
r ≤n<(r+1)
r=1
2 r=1
r ≤n<(r+1) 2
| {z } | {z }
(∗) (∗∗)
where the inequality follows from the fact that, for any a, b ∈ R, if a > then either b > 2
or |a − b| > 2 . In the remainder we will show that the sums (∗) and (∗∗) are finite, so that
from the Borel-Cantelli lemma

2
P 2 max 2 Dn > i.o. = 0. (25)
r ≤n<(r+1)
Since (25) holds for all > 0, it will follow from (24) that P[Dn → 0] = 1, as claimed.
Bounding (∗): From the Markov inequality and (23) we have that, for any > 0,
h i 2 2
P Dr22 > ≤ E[Dr22 ] ≤ c1 (r2 ).
2
2 2
Since c1 (r ) = O(1/r ), it follows that
X∞ h i 2 X
∞
(∗) = P Dr22 > ≤ c1 (r2 ) < ∞.
r=1
2 r=1
Bounding (∗∗): For r2 ≤ n < (r + 1)2 ,

1 1 1
|Dn2 − Dr22 | = 2 |n2 Dn2 − n2 Dr22 | ≤ 2 |n2 Dn2 − r4 Dr22 | + 2 |(n2 − r4 )Dr22 |
n n n
1 2 2 2(r + 1) 2
≤ 4 |n Dn − r4 Dr22 | + Dr 2 ,
r r2
and also that, using the reproducing property and Cauchy-Schwarz,

n
X
|n2 Dn2 − r4 Dr22 | = kQ (Xi , Xj )
i,j=r2 +1
Xn q q
≤ kQ (Xi , Xi ) kQ (Xj , Xj )
i,j=r2 +1
≤ (n − r ) max kQ (Xi , Xi ) ≤ 4r2 max kQ (Xi , Xi ).

2 2
1≤i≤n 1≤i≤n
Similarly again to the proof of Theorem 2 we have the bound

1 γkQ (Xi ,Xi )
E max kQ (Xi , Xi ) = E log max e
i=1,...,n γ i=1,...,n
" #
1 X n
≤ E log eγkQ (Xi ,Xi )
γ i=1
!
1 X
n
log(nb)
≤ log E eγkQ (Xi ,Xi ) = (26)
γ i=1
γ
9
so that, taking expectations, we obtain the bound

2 2 4 2(r + 1) 2
E 2 max 2 |Dn − Dr2 | ≤ 2 E max kQ (Xi , Xi ) + E Dr2
r ≤n<(r+1) r 1≤i≤n r2
4 log(nb) 2(r + 1)
≤ 2 + c1 (r2 )
r γ r2
8 log((r + 1)b) 2(r + 1)
≤ 2 + 2
c1 (r2 ) =: c2 (r)
r γ r
where c2 (r) = O(log(r)/r2 ). Using the Markov inequality,

∞
X ∞
2X
(∗∗) = P max |Dn2 − Dr22 | > ≤ 2 2
E 2 max 2 |Dn − Dr2 |
r=1
r2 ≤n<(r+1)2 2 r=1 r ≤n<(r+1)
∞
2X
≤ c2 (r) < ∞.
r=1
This completes the proof.

Our second lemma is a technical result on almost sure convergence:
Lemma 5. Let f be a non-negative function on X . Consider a sequence of random variables

(Xi )i∈N ⊂ X such that, for some γ > 0,

b := sup E eγf (Xi ) < ∞
i∈N
If m ≤ n and the growth of n is limited to at most log(n) = O(mβ/2 ) for some β < 1, then

log(m)
max f (Xi ) → 0
m i=1,...,n
almost surely as m, n → ∞.
Proof. To simplify notation we adopt the shorthand

log(m)
Em := max f (Xi )
m i=1,...,n
in this proof only, where n = n(m). The argument is similar to the proof of Lemma 4. Fix
> 0. If Em > i.o. then there are infinitely many r such that maxr2 ≤m<(r+1)2 Em > , so
that

P [Em > i.o.] ≤ P 2 max 2 Em > i.o. . (27)
r ≤m<(r+1)
10
Now, consider the bound
X h
i X
X∞ ∞ ∞

P 2 max 2 Em > ≤ P Er 2 > + P 2 max 2 |Em − Er2 | > ,
r=1
r ≤m<(r+1)
r=1
2 r=1
r ≤m<(r+1) 2
| {z } | {z }
(∗) (∗∗)
In the remainder we will show that the sums (∗) and (∗∗) are finite, so that from the Borel-
Cantelli lemma

P 2 max 2 Em > i.o. = 0. (28)
r ≤m<(r+1)
Since (28) holds for all > 0, it will follow from (27) that P[Em → 0] = 1, as claimed.
Bounding (∗): Similarly to the proof of Theorem 2, we have the bound

1 γf (Xi )
E max f (Xi ) = E log max e
i=1,...,n γ i=1,...,n
" #
1 X n
≤ E log eγf (Xi )
γ i=1
!
1 X
n
log(nb)
≤ log E eγf (Xi ) =
γ i=1
γ
and thus from the Markov inequality we have that, for any > 0,
h i 2 2
P Em > ≤ E[Em ] ≤ c1 (m)
2
where
log(m) log(nb)
c1 (m) := .
m γ
The assumption log(n) = O(mβ/2 ) for some β < 1 implies that log(nb) ≤ αmβ/2 + log(b) for
some constant α ∈ (0, ∞). Thus

2 2 log(r) αrβ + log(b)
c1 (r ) ≤ .
r2 γ
This shows that c1 (r2 ) = O(log(r)/r2−β ), and it follows that
X∞ h i 2 X
∞
(∗) = P Er2 > ≤ c1 (r2 ) < ∞.
r=1
2 r=1
Bounding (∗∗): For the second term we argue that, since Em ≥ 0,

max |Em − Er2 | ≤ 2 max Em
r2 ≤m<(r+1)2 r2 ≤m<(r+1)2
2 log(r2 )
≤ max f (Xi )
r2 i=1,...,n((r+1)2 )
11
so that, taking expectations,

4 log(r) log(n((r + 1)2 )b)
E 2 max 2 |Em − Er2 | ≤ =: c2 (r)
r ≤m<(r+1) r2 γ
Using the bound log(nb) ≤ αmβ/2 + log(b), the quantity c2 (r) just defined satisfies

4 log(r) α(r + 1)β + log(b)
c2 (r) ≤ ,
r2 γ
which is O(log(r)/r2−β ). Using the Markov inequality and the fact that (a − b)2 ≤ |a2 − b2 |,
∞
X ∞
2X
(∗∗) = P max |Em − Er2 | > ≤ E 2 max 2 |Em − Er2 |
r=1
r2 ≤m<(r+1)2 2 r=1 r ≤m<(r+1)
∞
2X
≤ c2 (r) < ∞.
r=1
This completes the proof.

Now we present the proof of Theorem 3:
Proof of Theorem 3. Our starting point is again the bound in Theorem 1:
!2 !2
1 X X
m n
∗ 1 + log(m)
DkP δ(Xπ(j) ) ≤ DkP wi δ(Xi ) + max kP (Xi , Xi ) .
m j=1 m i=1,...,n
} | {z }
i=1
| {z
(∗) (∗∗)
For term (∗), note that

! !
X
n X
n dP
dQ
(Xi )
DkP wi∗ δ(Xi ) ≤ DkP wi δ(Xi ) , wi := Pn dP
i=1 i=1 i0 =1 dQ (Xi0 )
due to the optimality property of the weights w∗ presented in (8). Further note that
! !
Xn
1 1X
n
DkP wi δ(Xi ) = 1 Pn dP DkQ δ(Xi )
i=1 n i0 =1 dQ (Xi0 )
n i=1
dP dP
R
where kQ (x, y) := dQ (x)kP (x, y) dQ (y) is a reproducing kernel such that X kQ (x, ·)dQ(x) = 0
p
for all x ∈ X . The preconditions of Theorem 3 ensure that V (x) ≥ kQ (x, x) and
q
γk (Xi ,Xi )
sup E e Q
≤ b < ∞, M = sup E kQ (Xi , Xi )V (Xi ) < ∞.
i∈N i∈N
12
Therefore we may apply Lemma 4 to obtain that
!
1X
n
DkQ δ(Xi ) →0
n i=1
R dP
almost surely as n → ∞. Moreover, since X dQ dQ < ∞, it follows from Meyn and
Tweedie (2012, Theorem 17.0.1, part (i)) that
Z
1 X dP
n
dP
(Xi ) → dQ = 1
n i=1 dQ X dQ
almost surely as n → ∞. Standard properties of almost sure convergence thus imply that
(∗) → 0 almost surely as n → ∞.
For term (∗∗), we notice that

sup E eγkP (Xi ,Xi ) ≤ b < ∞
i∈N
and we can therefore use Lemma 5 with f (x) = kP (x, x) to deduce that (∗∗) → 0 almost
surely as m, n → ∞.
Thus we have established that
!
1 X
m
DkP δ(Xπ(j) ) → 0 (29)
m j=1
almost surely as m, n → ∞. The final part of the statement of Theorem 3 is immediate from
Proposition 3.
S3 Experimental Protocol
In this appendix we describe the generic structure of a parameter inference problem for a
system of ODEs, that forms our empirical test-bed.
Consider the solution u of a system of q coupled ODEs of the form
du1
= F1 (t, u1 , . . . , uq ; x)
dt
..
.
duq
= Fq (t, u1 , . . . , uq ; x), (30)
dt
together with the initial condition u(0) = u0 ∈ Rq . The functions Fi that define the gradient
field are assumed to depend on a number d of parameters, collectively denoted x ∈ Rd , and
the Fi are assumed to be differentiable with respect to u1 , . . . , uq and x. It is assumed that
13
u(t) exists and is unique on an interval t ∈ [0, T ] for all values x ∈ Rd . For simplicity in this
work we assumed that the initial condition u0 is not dependent on x and is known. The goal
is to make inferences about the parameters x based on noisy observations of the state vector
u(ti ) at discrete times ti ; this information is assumed to be contained in a likelihood of the
form
YN
L(x) := φi (u(ti )) (31)
i=1
q
where the functions φi : R → [0, ∞), describing the nature of the measurement at time ti ,
are problem-specific and to be specified. The parameter x is endowed with a prior density
π(x) and the posterior of interest P admits a density p(x) ∝ π(x)L(x). Computation of
the gradient ∇ log p therefore requires computation of ∇ log π and ∇ log L; the latter can
be performed by augmenting the system in (30) with the sensitivity equations, as described
next.
Straight-forward application of the chain rule leads to the following expression for the
gradient of the log-likelihood:
X
N
∂u
(∇ log L)(x) = − (ti )(∇ log φi )(u(ti )),
i=1
∂x
where (∂u/∂x)r,s := ∂ur /∂xs is the matrix of sensitivities of the solution u to the parameter
x and is time-dependent. Sensitivities can be computed by augmenting the system in (30)
and simultaneously solving the forward sensitivity equations
q
d ∂ur ∂Fr X ∂Fr ∂ul
= + (32)
dt ∂xs ∂xs l=1
∂u l ∂xs
together with the initial condition (∂ur /∂xs )(0) = 0, which follows from the independence
of u0 and x.
The experiments reported in Section 4 were based on four distinct Metropolis–Hastings
MCMC methods, whose details have not yet been described. The generic structure of the
proposal mechanism is x∗ = xn−1 + H∇ log p(xn−1 ) + Gξn , where the ξn ∼ N (0, I) are
independent. The matrices H and G are specified in Table S1. Our implementation of
these samplers interfaces with the CVODES library (Hindmarsh et al., 2005), which presents
a practical barrier to reproducibility. Moreover, the CPU time required to obtain MCMC
samples was approximately two weeks for the calcium model. Since our research focused on
post-processing of MCMC output, rather than MCMC itself, we directly make available the
full output from each sampler on each model considered at
https://doi.org/10.7910/DVN/MDKNWM.
This Harvard database download link consists of a single ZIP archive (1.5GB) that contains,
for each ODE model and each MCMC method, the states (xi )ni=1 visited by the Markov
chain, their corresponding gradients ∇ log p(xi ) and the values p(xi ) up to an unknown
normalisation constant. The Stein Thinning software described in S1 can be used to post-
process these datasets at minimal effort, enabling our findings to be reproduced.
14
Proposal H G Details
RW 0 I Step size selected following Roberts
p and Rosenthal (2001)
ADA-RW (Haario et al., 1999) 0 Σ̂ Σ̂ is the sample covariance matrix of
preliminary MCMC output
2
MALA (Roberts and Tweedie, 2 I I Step size selected following Roberts
1996) and Rosenthal (2001)
2
p
P-MALA (Girolami and Calder- 2 M
−1
(xn−1 ) M −1 (xn−1 ) M (x) = F (x) + Σ−10 where F (x) is the
head, 2011) Fisher information matrix at x and Σ0
is the prior covariance matrix.
Table S1: Parameters H and G used in the Metropolis–Hastings proposal.
S4 Convergence Diagnostics for MCMC

Rigorous approaches for selecting a burn-in period b have been proposed by authors including
Meyn and Tweedie (1994); Rosenthal (1995); Roberts and Tweedie (1999); see also Jones and
Hobert (2001). Unfortunately, these often involve conditions that are difficult to establish
(Biswas et al., 2019, discuss how some of the terms appearing in these conditions can be
estimated), or, when they hold, they provide loose bounds, implying an unreasonably long
burn-in period.
Convergence diagnostics have emerged as a practical solution to the need to test for non-
convergence of MCMC. Their use is limited to reducing bias in MCMC output; they are
not optimised for the fixed n setting, which requires a bias-variance trade-off. Nevertheless,
convergence diagnostics constitute the principal means by which MCMC output is post-
processed. In this section we recall standard practice for selection of a burn-in period b in
constructing an estimator of the form (3), focussing on the widely-used diagnostics of Gelman
and Rubin (1992); Brooks and Gelman (1998); Gelman et al. (2014) (the GR diagnostic), as
well as the more recent work of Vats and Knudson (2018) (the VK diagnostic).
The GR diagnostic is based on running L independent chains, each of length n, with
starting points that are over-dispersed with respect to the target. Obtaining initial points
with such characterisation is not trivial because the target is not known beforehand; we refer
to the original literature for advice on how to select these initial points, but, in practice, it is
not uncommon to guess them. When the support of the target distribution is uni-dimensional
(or when d > 1, but a specific uni-dimensional summary f (x) is used), the GR diagnostics
(R̂GR,L ) is obtained as the square root of the ratio of two estimators of the variance σ 2 of
the target. In particular,
r
GR,L σ̂ 2
R̂ := , (33)
s2
where s2 is the (arithmetic) mean of the sample variances s2l , l = 1, . . . , L, of the chains,
which typically provides an underestimate of σ 2 , and σ̂ 2 is constructed as an overestimate of
15
the target variance
n−1 2 B
σ̂ 2 := s + ,
n n
where the term B/n is an estimate of the asymptotic variance of the sample mean of the
Markov chain. In the original GR diagnostics, this asymptotic variance was estimated as the
sample variance of the means X̄l , l = 1, . . . , L, from the L chains, leading to
!2
1 X 1X
L L
B
= X̄l − X̄l .
n L − 1 l=1 L l0 =1
The improved VK diagnostic, R̂VK,L , is formally obtained in the same way as (33), but with
more efficient estimators τ 2 /n for the asymptotic variance used in place of B/n. A number
of options are available here, but the (lugsail) batch mean estimator of Vats and Flegal
(2018) is recommended because it is guaranteed to be√biased from above, while still being
consistent (in our simulations we use batches of size 3 n). This gain in efficiency leads to
improved performance of the VK diagnostic over the GR diagnostic, in the sense that it is
less sensitive to the randomness in the Markov chains and the number of chains used. In
particular, R̂VK,L can be computed using one chain only (L = 1), which has clear practical
appeal.
For an ergodic Markov chain, R̂GR,L and R̂VK,L converge to 1 as n → ∞, so that selection
of a suitable burn-in period b amounts to observing when these diagnostics are below 1 + δ,
where δ is a suitable threshold. In the literature on R̂GR,L , the somewhat arbitrary choice
δ = 0.1 is commonly used, see Gelman et al. (2014, Ch. 11.5) and the survey in Vats and
Knudson (2018). In the literature on R̂KV,L , Vats and Knudson (2018) showed how δ can
be selected by exploiting the relationship between R̂VK,L and the effective sample size (ESS)
when estimating the mean of the target. In particular, it is possible to re-write
s
VK,L n−1 L
R̂ = + V
(34)
n ESS
V
where ESS is a strongly consistent estimator of the ESS. One can therefore (approximately)
select a δ threshold that corresponds to a pre-specified value of the ESS. The literature on
error assessment for MCMC provides guidance on how large the ESS ought to be in order
that the width of a (1−α)% confidence interval for the mean is less that a specified threshold
; see Jones and Hobert (2001); Flegal et al. (2008); Vats et al. (2019):
V
22 π χ21−α
ESS ≥ Mα, := , (35)
(Γ(1/2))2 2
where Γ(·) is the Gamma function, χ21−α is the (1 − α)th quantile of the χ2 distribution with
one degree of freedom. Plugging (35) in (34) leads to the conclusion that, after the first
16
iteration for which R̂VK,L is below 1 + δ, where
s
L
δ ≡ δ(L, α, ) = 1+ − 1, (36)
Mα,
the chain will provide an estimate of the mean with small Monte Carlo error, when compared
to the variability of the target. The default choices α = 0.05 and = 0.05 were suggested
in Vats and Knudson (2018), and were used in our work. For experiments reported in this
paper we used (36) to select an appropriate threshold for both R̂GR,L and R̂VK,L , which leads
to estimated burn-in periods that we denote b̂GR,L , and b̂VK,L , respectively, in the main text.
The above discussion focussed on the univariate case, but generalisations of these con-
vergence diagnostics are available and can be found in Brooks and Gelman (1998) and Vats
and Knudson (2018). All convergence diagnostics in this work were computed using the R
packages coda (Plummer et al., 2006) and stableGR (Knudson and Vats, 2020)1 .
S5 Empirical Assessment: Additional Results

This section collects together additional empirical results that accompany the empirical
assessment in Section 4.
S5.1 Goodwin Oscillator

The Goodwin oscillator is a phenomenological model for genetic regulatory processes in a
cell and is described by g coupled ODEs of the form
du1 a1
= − αu1 ,
dt 1 + a2 uρg
du2
= k1 u1 − αu2 ,
dt
..
.
dug
= kg−1 ug−1 − αug ,
dt
where the first component u1 represents the concentration of mRNA, u2 that of its corre-
sponding protein product, while u3 , . . . ug represent concentrations of proteins in a signalling
cascade, that can either be present (g > 2) or absent (g = 2), and with the g th protein
having a negative feedback on the production of mRNA, by means the Hill curve in the
first equation. The nontrivial oscillations that result have led to the Goodwin oscillator
being used in previous studies to assess the performance of Bayesian computational methods
1
The GR diagnostic in the software package uses the original definition in Gelman and Rubin (1992), that
differs slightly from (33); however, this difference is not expected to strongly affect the simulation results
that we present.
17
(Calderhead and Girolami, 2009; Oates et al., 2016; Chen et al., 2019). The parameters
a1 , k1 , . . . kg−1 > 0 represent synthesis rates and a2 , α > 0 representing degradation rates. To
cast this model in the setting of Section 2 we set x ∈ Rg+2 to be the vector whose entries are
log(a1 ), log(k1 ), and so forth, so that we have a d = g + 2 dimensional parameter for which
inference is performed.
The experiment that we report considers synthetic data yi ∈ Rg generated in the simple
case g = 2, which are then corrupted by Gaussian noise such that the terms φi in (31) are
equal to
1 > −1
φi (u(ti )) ∝ exp − (yi − u(ti )) C (yi − u(ti )) (37)
2
with C = diag(0.12 , 0.052 ). The initial condition was u(0) = (0, 0) and the data-generating
parameters were (a1 , a2 , α, k1 ) = (1, 3, 0.5, 1). The times ti , i = 1, . . . , 2400, at which data
were obtained were taken to be uniformly spaced on [1,25], in order to capture both the
oscillatory behaviour of the system and its steady state. This relatively high frequency
of observation and corresponding informativeness of the dataset was used to pre-empt a
similarly high frequency observation process in the calcium signalling model of Section 4.3.
Figure S1 displays the dataset. A standard Gaussian prior π(x) was placed on the parameter
x and each MCMC method was applied to approximately sample from the posterior P .
Exemplar trace plots for the MCMC methods are presented in Figure S2. The over-
dispersed initial states used for the L chains are reported in Table S2, while the univariate and
multivariate convergence diagnostics, computed every 1000 iterations, are shown respectively
in Figure S3 and Figure S4. The values of the thresholds δ(L, α, ) are reported in Table S3.
For each MCMC method, the estimated burn-in period is presented in Table S4. The GR
diagnostic did not fall below the 1 + δ threshold in the allowed number of iterations, which
is consistent with the empirical observations of Vats and Knudson (2018).
The additional results for the Goodwin oscillator that we present in this appendix are as
follows:
• Figures 3 (RW), S6 (MALA) and S7 (P-MALA) display point sets of size m = 20 selected us-
ing traditional burn in and thinning methods, Support Points and Stein Thinning,
based on MCMC output. Note that the gray regions are not necessarily regions of high
posterior probability; they are the regions explored by the sample path and, moreover,
these panels are two-dimensional projections from R4 . Therefore we are hesitant to
draw strong conclusions from these figures.
• Figures 4 (RW), S9 (MALA) and S10 (P-MALA) display the absolute error in estimating
the first moment of each parameter, for each of the competing methods, where an
extended run from MCMC provided the ground truth.
• Figures S11 (RW), S12 (ADA-RW), S13 (MALA) and S14 (P-MALA) show marginal density
estimates, for each parameter and each of the competing methods, where an extended
run from MCMC provided the ground truth.
18
1.25
1.0
1.00
u1(t)
u2(t)
0.75
0.5
0.50
0 10 20 0 10 20
t t
Figure S1: Data (gray) and ODE solution corresponding to the true data-generating param-
eters (black) for the Goodwin oscillator.
Chain Number Initial State for Parameters (a1 , a2 , α, k)

1 (0.5, 1, 3, 2)
2 (0.001, 0.2, 0.1, 10)
3 (10, 0.1, 0.9, 0.1)
4 (0.1, 30, 0.1, 0.3)
5 (2, 2, 2, 2)
6 (5, 5, 1, 1)
Table S2: Initial states, over-dispersed with respect to the posterior, for the L = 6 indepen-
dent Markov chains used in the Goodwin oscillator. The parameters used to generate the
data were (a1 , a2 , α, k) = (1, 3, 0.5, 1).
Univariate Diagnostics Multivariate Diagnostics

L=6 4.88 ×10−4 3.56 ×10−4
L=5 4.07 ×10−4 2.96 ×10−4
L=1 8.13 ×10−5 5.93 ×10−5
Table S3: The values of the threshold δ(L, α, ) used in analysis of the Goodwin and Lotka–
Volterra models, with α = 0.05, = 0.05, when changing L, and considering the univariate
and multivariate convergence diagnostics R̂VK,L .
19
RW ADA-RW MALA P-MALA
0.5
Parameter 1
0.0
−0.5
6
Parameter 2
1
Parameter 3
1.0
Parameter 4
0.5
0.0
101 103 105 101 103 105 101 103 105 101 103 105
n n n n
Figure S2: Trace plots for each parameter in the Goodwin oscillator, plotted against the
MCMC iteration number. Each row corresponds to one of the four parameters, while each
column corresponds to one of the four MCMC methods considered. (Note the logarithmic
scale on the horizontal axis, used to better visualise the initial part of the MCMC sample
path.)
20
GR, L = 6
R̂ − 1, Parameter 1
10 0 VK, L = 6
VK, L = 1
δ(L = 6)
10−2
δ(L = 1)
10−4
100
10−2
10−4
100
10−2
10−4
100
10−2
10−4
103 104 105 106 103 104 105 106 103 104 105 106 103 104 105 106
n n n n
Figure S3: Univariate convergence diagnostics, for the Goodwin oscillator, plotted against
the MCMC iteration number. The black line represents the GR diagnostic (based on L = 6
chains), while the blue and red lines represent the VK diagnostic (based on L = 6 and
L = 1 chains, respectively). The dash-dotted (L = 6) and dashed (L = 1) horizontal
lines correspond to the critical values δ(L, α, ), used to determine the burn-in period; see
Table S3.
21
2
10 GR, L = 6
VK, L = 6
100 VK, L = 1
R̂ − 1
δ(L = 6)
10−2
δ(L = 1)
−4
10
103 104 105 106 103 104 105 106 103 104 105 106 103 104 105 106
n n n n
Figure S4: Multivariate convergence diagnostics for the Goodwin oscillator, plotted against
the MCMC iteration number n. The black line is the GR diagnostic (based on L = 6
chains), while the blue and red lines are the VK diagnostic (based on L = 6 and L = 1
chains, respectively). The dotted (L = 6) and dashed (L = 1) horizontal lines correspond to
the threshold δ that is used to determine the burn-in period; see Table S3 in Appendix S5.1.
MCMC Diagnostics Sampler b̂GR,6 b̂VK,6 b̂VK,1

RW >n 70,000 820,000
ADA-RW >n 71,000 816,000
Univariate
MALA >n 397,000 1,020,000
P-MALA >n 68,000 987,000
RW >n 93,000 578,000
ADA-RW >n 107,000 824,000
Multivariate
MALA >n 316,000 1,615,000
P-MALA >n 103,000 1,475,000
Table S4: Estimated burn-in period for the Goodwin oscillator, using the GR diagnostic
based on L chains, b̂GR,L (L = 6), and the VK diagnostic based on L chains, b̂VK,L , (L = 1, 6).
In each case both univariate and multivariate convergence diagnostics are presented; in the
univariate case we report the largest value obtained when looking at each of the d parameters
individually to estimate the burn-in period. The symbol “> n” indicates the case in which
a diagnostic did not go below the 1 + δ threshold.
22
Figure S5: Projections on the first two coordinates of the ADA-RW MCMC output for the
Goodwin oscillator (grey dots), together with the first m = 20 points selected through:
traditional burn-in and thinning (the amount of burn in is indicated in the legend); the
Support Points method; Stein Thinning, for each of the settings med, sclmed, smpcov.
23
Figure S6: Projections on the first two coordinates of the MALA MCMC output for the
24
Figure S7: Projections on the first two coordinates of the P-MALA MCMC output for the
25
10−2

10−2
10−4
10−6
100 101 102 100 101 102

m m
Figure S8: Absolute error of estimates for the posterior mean of each parameter in the
Goodwin oscillator, based on output from ADA-RW MCMC.
10−2

10−2
10−4
10−6
100 101 102 100 101 102

m m
Goodwin oscillator, based on output from MALA MCMC.
26
10−2
10−2
10−4
10−6
100 101 102 100 101 102

m m
Goodwin oscillator, based on output from P-MALA MCMC.
25
60
20
15 True
Density
40
Standard Thinning, high burn-in
10 Standard Thinning, low burn-in
20 Support Points
5
0 0 Stein Thinning (sclmed)
−0.02 0.00 0.02 0.04 1.00 1.05 1.10 1.15 Stein Thinning (smpcov)
120
80
100
80 60
Density
60
40
40
20
20
0 0
−0.70 −0.69 −0.68 −0.67 −0.02 0.00 0.02
Figure S11: True and estimated marginal densities of the parameters in the Goodwin oscil-
lator, using m = 20 points selected from RW MCMC output.
27
25
60
20
15 True
Density
40
20 Support Points
5
120
80
100
80 60
Density
60
40
40
20
20
0 0
−0.70 −0.69 −0.68 −0.67 −0.02 0.00 0.02
lator, using m = 20 points, selected from ADA-RW MCMC output.
25
60
20
15 True
Density
40
20 Support Points
5
120
80
100
80 60
Density
60
40
40
20
20
0 0
−0.70 −0.69 −0.68 −0.67 −0.02 0.00 0.02
lator, using m = 20 points, selected from MALA MCMC output.
28
25
60
20
15 True
Density
40
20 Support Points
5
120
80
100
80 60
Density
60
40
40
20
20
0 0
−0.70 −0.69 −0.68 −0.67 −0.02 0.00 0.02
lator, using m = 20 points, selected from P-MALA MCMC output.
29
S5.2 Lotka–Volterra
The Lotka–Volterra model describes the oscillatory evolution of prey (u1 ) and predator (u2 )
species in a closed environment. The prey has an intrinsic mechanism for growth proportional
to its abundance, described by a parameter θ1 > 0, whilst interaction with the predator leads
to a decrease in the prey population at a rate described by a parameter θ2 > 0. Conversely,
the predator has an intrinsic mechanism for decline proportional to its abundance, described
by a parameter θ3 > 0, whilst interaction with the prey leads to an increase in the predator
population at a rate described by a parameter θ4 > 0. The resulting system of ODEs is:
du1
= θ1 u1 − θ2 u1 u2 ,
dt
du2
= θ4 u1 u2 − θ3 u2 .
dt
To cast this model in the setting of Section 2 we set x ∈ R4 to be the vector whose entries
are log(θ1 ), . . . , log(θ4 ), so that we have a d = 4 dimensional parameter for which inference
is performed.
The experiment that we report considers synthetic data which are corrupted by Gaussian
noise such that the terms φi in (31) have expression (37), with C = diag(0.22 , 0.22 ). The
initial condition was u(0) = (1, 1) and the data-generating parameters were x = log(θ), with
θ = (0.67, 1.33, 1, 1). The times ti , i = 1, . . . , 2400, at which data were obtained were taken
to be uniformly spaced on [0, 25]. Figure S15 displays the dataset. A standard Gaussian
prior π(x) was used.
Exemplar trace plots for the MCMC methods are presented in Figure S16. The over-
dispersed initial states used for the L chains are the same as those reported in Table S2, while
the univariate and multivariate convergence diagnostics, computed every 1000 iterations, are
shown respectively in Figure S17 and Figure S18. The values of the thresholds δ(L, α, ) are
the same as those reported in Table S3. For each MCMC method, the estimated burn-in
period is presented in Table S6.
The additional results for the Lotka–Volterra model that we present in this appendix are
as follows:
• Figures S19 (RW), S20 (ADA-RW), S21 (MALA) and S22 (P-MALA) display point sets of size
m = 20 selected using traditional burn in and thinning methods, Support Points and
Stein Thinning, based on MCMC output.
30
1.5
2
1.0
u1(t)
u2(t)
1 0.5
0.0
0
−0.5
0 10 20 0 10 20
t t
Figure S15: Data (gray) and ODE solution corresponding to the true data-generating pa-
rameters (black) for the Lotka–Volterra model.
Initial State for Parameters Initial State for Parameters

Chain Number (RW, MALA, P-MALA) (ADA-RW)
1 (0.55, 1, 0.8, 0.8) (0.55, 1, 0.8, 0.8)
2 (1.5, 1, 0.8, 0.8) (0.55, 1, 0.8, 1.3)
3 (1.3, 1.33, 0.5, 0.8) (1.3, 1.33, 0.5, 0.8)
4 (0.55, 3, 3, 0.8) (0.55, 1, 1.5, 1.5)
5 (0.55, 1, 1.5, 1.5) (0.55, 1.3, 1, 0.8)
Table S5: Initial states, θ = exp(x), over-dispersed with respect to the posterior, for the
L = 5 independent Markov chains used in the Lotka–Volterra model. The parameters used
to generate the data were θ = (0.67, 1.33, 1, 1).
MCMC Diagnostics Sampler b̂GR,5 b̂VK,5 b̂VK,1

RW >n 88,000 954,000
ADA-RW >n 84,000 764,000
Univariate
MALA >n 424,000 995,000
P-MALA >n 90,000 820,000
RW >n 119,00 1, 512, 000
ADA-RW 1,797,000 99,000 743, 000
Multivariate
MALA >n 259,000 1, 573, 000
P-MALA >n 114,000 1, 251, 000
Table S6: Estimated burn-in period for the Lotka–Volterra model, using the GR diagnostic
based on L chains, b̂GR,L (L = 5), and the VK diagnostic based on L chains, b̂VK,L , (L = 1, 5).
In each case both univariate and multivariate convergence diagnostics are presented; in the
univariate case we report the largest value obtained when looking at each of the d parameters
individually to estimate the burn-in period. The symbol “> n” indicates the case in which
a diagnostic did not go below the 1 + δ threshold.
31
−0.50
Parameter 1
−0.75
−1.00
−1.25
0.25
Parameter 2
0.00
−0.25
−0.50
0.4
Parameter 3
0.2
0.0
−0.2
0.4
Parameter 4
0.2
0.0
−0.2
101 103 105 101 103 105 101 103 105 101 103 105
n n n n
Figure S16: Trace plots for the parameters xi in the Lotka–Volterra model, plotted against
the MCMC iteration number. Each row corresponds to one of the four parameters, while
each column corresponds to one of the four MCMC methods considered.
32
GR, L = 5
100 VK, L = 5
VK, L = 1
10−2 δ(L = 5)
δ(L = 1)
−4
10
100
10−2
10−4
100
10−2
10−4
100
10−2
10−4
103 104 105 106 103 104 105 106 103 104 105 106 103 104 105 106
n n n n
Figure S17: Univariate convergence diagnostics, for the Lotka–Volterra model, plotted
against the MCMC iteration number. The black line represents the GR diagnostic (based
on L = 5 chains), while the blue and red lines represent the VK diagnostic (based on L = 5
and L = 1 chains, respectively). The dash-dotted (L = 5) and dashed (L = 1) horizontal
lines correspond to the critical values δ(L, α, ), used to determine the burn-in period; see
Table S3.
33
102 GR, L = 5
VK, L = 5
100 VK, L = 1
R̂ − 1
δ(L = 5)
10−2
δ(L = 1)
10−4
103 104 105 106 103 104 105 106 103 104 105 106 103 104 105 106
n n n n
Figure S18: Multivariate convergence diagnostics for Lotka–Volterra, plotted against the
MCMC iteration number. The black line is the GR diagnostic (based on L = 5 chains),
while the blue and red lines are the VK diagnostic (based on L = 5 and L = 1 chains,
respectively). The dotted (L = 5) and dashed (L = 1) horizontal lines correspond to the
critical values δ(L, α, ), used to determine the burn-in period; see Table S3.
34
Figure S19: Projections on the first two coordinates of the RW MCMC output for the Lotka–
Volterra model (grey dots), together with the first m = 4 points selected through: traditional
burn-in and thinning (the amount of burn in is indicated in the legend); the Support Points
method; Stein Thinning, for each of the settings med, sclmed, smpcov.
35
Figure S20: Projections on the first two coordinates of the ADA-RW MCMC output for the
Lotka–Volterra model (grey dots), together with the first m = 4 points selected through:
36
Figure S21: Projections on the first two coordinates of the MALA MCMC output for the
37
Figure S22: Projections on the first two coordinates of the P-MALA MCMC output for the
38
S5.3 Calcium Signalling Model
This appendix contains a detailed biochemical description of the calcium singalling model
studied in Section 4.3 of the main text, together with the experimental dataset that we
collected.
The Hinch et al. (2004) single cell model simulates the calcium transient evoked by mem-
brane depolarisation in a cardiac cell. The model has a mathematical representation of the
extracellular space and the intracellular compartment consisting of the sarcoplasmic reticu-
lum (SR), dyadic space and cytosol. The major sarcolemmal calcium pathways are included:
the L-type Ca channel (LCC), the plasmalemmal membrane calcium ATPase (PMCA) and
the sodium-calcium exchanger (NCX). Inside the cell, the model has mathematical repre-
sentations for calcium release from the SR to dyadic space through ryanodine receptors
(RyR) and re-sequestration of calcium from the dyadic space into the SR by the SR ATPase
(SERCA). Calcium buffering is also featured for the cytosol. A schematic representation of
the cell model is given in Figure S23.
Membrane depolarisation is triggered by an electrical event. This causes calcium to enter
through LCCs into the dyadic space, producing a local rise in Ca concentration, sufficient
to activate RyRs. This process engages a feedback, whereby Ca release from the SR causes
more RyR opening events. As the released Ca diffuses into the cytosol, most of it becomes
buffered, but some ions remain free and underpin the Ca transient. Recovery following Ca
release is driven by SERCA, which re-sequesters Ca into the SR, and NCX and PMCA which
extrude calcium across the sarcolemma. This returns the cell to is initial conditions, ready
for the next electrical stimulation.
The Hinch model describes the nonlinear, time-dependent interaction of the four Ca
handling transporters (LCC, PMCA, RyR and SERCA) and lumped buffering by a system
of 6 ODEs whose parameter is d = 38 dimensional. The model provides a simplified four-
state model describing the interaction between LCC and RyR within the dyadic space. Here,
only three states are simulated due to a conservation of mass constraint. The remaining
differential equations describe calcium concentration in the sarcoplasmic reticulum and the
cytosol, and the calcium bound to cytosolic buffers. Of these state variables, only the
concentration of free calcium in the cytosol can be experimentally observed.
To provide a rich dataset for characterising calcium dynamics in a single cardiac my-
ocyte, we applied three experimental protocols in sequence on a single myocyte. During
these protocols, we controlled membrane potential and measured membrane currents electro-
physiologically and, after appropriate calibration, followed Ca fluorimetrically. The calcium
handling proteins were interrogated by relating currents and Ca concentration in response to
defined membrane potential manoeuvres, and in the presence of drugs to eliminate various
confounding components. The first voltage protocol interrogated LCC currents at different
voltages, and measured their response in terms of SR release. In the second protocol, a train
of depolarisations then triggered Ca transients which provided information about SR release
and their recovery provided a readout of SERCA, NCX and PMCA activities. The third
protocol consistent of rapid exposure to caffeine which emptied the SR and short-circuited
SERCA. This provided information about SR load, and the subsequent recovery is a readout
39
of NCX and PMCA. Buffering was calculated from the quotient of measured Ca rise upon
caffeine exposure and the amount of Ca released back-calculated from sarcolemmal current
generated by NCX. The dataset contains 12998 observations of cytosolic free calcium con-
centration observed at a 60 Hz sampling frequency, for a duration of 3 minutes. The data
are displayed in Figure S24.
The calcium signalling model in Figure S23 is represented by a coupled system of 6
ODEs and depends on a d = 38 dimensional parameter, which is to be estimated based on
the experimental dataset. As just described, the data consist of measurements of calcium
concentration in the cytoplasm whilst the cell was externally stimulated, so that only one of
the state variables (in our case, u5 ) was observed. Our likelihood took the simple Gaussian
form φi (u(ti )) ∝ exp(− 2σ1 2 (yi − u5 (ti ))2 ) with σ = 2.07 × 10−8 . The ODE was numerically
solved using CVODES (Hindmarsh et al., 2005) and sensitivities were computed by solving the
forward sensitivity equations; see Appendix S3. Further details of the expert-elicited prior,
the data pre-processing procedure and numerical details associated with the ODE solver will
be reported in a separate manuscript, in preparation as of July 1, 2020, and are available on
request.
The additional results for the calcium signalling model that we present in this appendix
are as follows:
• Exemplar trace plots for the MCMC methods are presented in Figure S25.
• The multivariate convergence diagnostics are displayed in Figure S26.
• Figures S27 and S27 present results for KSD based on sclmed and smpcov settings, to
complement Figure 9 in the main text.
40
Figure S23: Calcium signalling model; a schematic representation due to Hinch et al. (2004).
The model consists of 6 coupled ordinary differential equations and depends upon 38 real-
valued parameters that must be estimated from an experimental dataset.
×10−9 calcium signalling data

1000
800
[Ca++ ](mol)
600
400
200
0
0 50000 100000 150000 200000
t (ms)
Figure S24: Calcium concentration data (in nmol) plotted against time (in ms).
41
0.25
Parameter 1
Parameter 2
Parameter 3
Parameter 4
−17.0
−9 −3.5
0.00
−17.5
−4.0 −0.25
−18.0 −10
−0.50
−2 −4
2
Parameter 5
Parameter 6
Parameter 7
Parameter 8
−5 4.5
−3
1 −6 4.0
−4 0 3.5
−7
2 2
−3
Parameter 10
Parameter 11
Parameter 12
Parameter 9
2
1
−4
1 1
0
−5
−1
1 −4.0
−9.0
Parameter 13
Parameter 14
Parameter 15
Parameter 16
7 −4.5 −9.5
0
−5.0 −10.0
6
−1 −5.5 −10.5
0.0 2.5 0.0 2.5 0.0 2.5 0.0 2.5
n ×106 n ×106 n ×106 n ×106
Figure S25: Trace plots for the first 16 (of 38) parameters in the calcium signalling model,
plotted against the MCMC iteration number. Each panel corresponds to one parameter in
the model.
42
102 GR, L = 10
VK, L = 10
101
VK, L = 1
100 δ(L = 10)
R̂ − 1 δ(L = 1)
10−1
10−2
10−3
10−4
104 105 106

n
Figure S26: Multivariate convergence diagnostics for the calcium signalling model, plotted
against the MCMC iteration number. The blue line is the GR diagnostic (based on L = 10
chains), while the orange and green lines are the VK diagnostic (based on L = 10 and L = 1
chains, respectively). The dotted (L = 10) and dashed (L = 1) horizontal lines correspond
to the critical values δ(L, α, ), equal to 5.91 ×10−4 and 5.91 ×10−5 , respectively, used to
determine the burn-in period.
RW
103 Support Points
KSD
102 Stein Thinning (smpcov)
101
100 101 102

m
Figure S27: Calcium signalling model. Kernel Stein discrepancy (KSD) based on sclmed,
for empirical distributions obtained through traditional burn-in and thinning (grey lines),
Support Points (black line) and through Stein Thinning (colored lines), based on output
from RW MCMC.
43
RW
3 Support Points
10
KSD

102
101
100 101 102
m
Figure S28: Calcium signalling model. Kernel Stein discrepancy (KSD) based on smpcov,
for empirical distributions obtained through traditional burn-in and thinning (grey lines),
Support Points (black line) and through Stein Thinning (colored lines), based on output
from RW MCMC.
44

Optimal Thinning of MCMC Output Using Kernel Stein Discrepancy

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimal Thinning of MCMC Output Using Kernel Stein Discrepancy

Uploaded by

Copyright:

Available Formats

Optimal Thinning of MCMC Output

Marina Riabiz1,2 , Wilson Ye Chen3 , Jon Cockayne2 , Pawel Swietach4 ,

3 University of Sydney, Australia 4 Oxford University, UK

2.1 Kernel Stein Discrepancy

The fact that F is measure-determining means that DF (P, Q) = 0 if and only if P = Q

Theoretical analysis had led to sufficient conditions for AP G to be measure-determining

kP (x, y) := ∇x · ∇y k(x, y) + h∇x k(x, y), ∇y log p(y)i

The second main property of P

Proposition 2 (Proposition 2 of Gorham and Mackey (2017)). Let (x, y) 7→ ∇x ∇>

where kP was defined in (6).

Proposition 3 (Theorem 4 in Chen et al. (2019)). Let P be distantly dissipative, meaning

2.2 Greedy Minimisation of KSD

The algorithm is illustrated on a simple bivariate Gaussian mixture in Figure 1. Ob-

2.3 Choice of Kernel

• Median (med): The scaled identity matrix Γ = `2 I, where ` = med := median{kXi −

where the weights w∗ = (w1∗ , . . . , wn∗ ) in the first term satisfy

Remark 5. ToPfurther improve the empirical approximation, we can consider an optimally-

kPn (x, ·) − P kV ≤ RV (x)ρn (9)

Theorem 2. Let X be a measurable space andR let P be a probability distribution on X . Let

Kernel Stein Discrepancy (DkP )

The proof of Theorem 2 is provided in Appendix S2.2.

Theorem 3. Let Q be probability distributions on X with P absolutely continuous with re-

The proof of Theorem 3 is provided in Appendix S2.3. The interpretation of Theorem 3 is

4.1 Goodwin Oscillator

Standard Thinning (high burn-in)

100 101 102 100 101 102

1.5 Stein Thinning (med)

101 102 101 102

Standard Thinning (high burn-in)

Stein Thinning (med)

100 101 102 100 101 102

Stein Thinning (med)

101 102 101 102

103 Standard Thinning (high burn-in)

Stein Thinning (med)

100 101 102 100 101 102

Stein Thinning provided an improvement.

4.3 Calcium Signalling Model

100 101 102

C. M. Augustin, A. Neic, M. Liebmann, A. J. Prassl, S. A. Niederer, G. Haase, and G. Plank.

L. Baringhaus and C. Franz. On a new multivariate two-sample test. Journal of Multivariate

A. Barp, C. Oates, E. Porcu, and M. Girolami. A Riemannian-Stein kernel method.

A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and

N. Biswas, P. E. Jacob, and P. Vanetti. Estimating convergence of Markov chains with

S. P. Brooks and A. Gelman. General methods for monitoring convergence of iterative

B. Calderhead and M. Girolami. Estimating Bayes factors via thermodynamic integration

F. O. Campos, Y. Shiferaw, A. J. Prassl, P. M. Boyle, E. J. Vigmond, and G. Plank.

B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker,

P. J. Green, K. Latuszyński, M. Pereyra, and C. P. Robert. Bayesian computation: a sum-

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel method

R. Hinch, J. Greenstein, A. Tanskanen, L. Xu, and R. Winslow. A simplified local control

A. C. Hindmarsh, P. N. Brown, K. E. Grant, S. L. Lee, R. Serban, D. E. Shumaker, and C. S.

G. L. Jones and J. P. Hobert. Honest exploration of intractable probability distributions via

V. R. Joseph, T. Dasgupta, R. Tuo, and C. Wu. Sequential exploration of complex surfaces

V. R. Joseph, D. Wang, L. Gu, S. Lyu, and R. Tuo. Deterministic sampling of expensive

C. Knudson and D. Vats. stableGR, 2020. R package version 1.0.

S. Livingstone and G. Zanella. On the robustness of gradient-based MCMC algorithms.

D. J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter. WinBUGS - a Bayesian modelling

S. A. Niederer, L. Mitchell, N. Smith, and G. Plank. Simulating human cardiac electrophys-

S. A. Niederer, J. Lumens, and N. A. Trayanova. Computational models in cardiology.

A. B. Owen. Statistically efficient thinning of a Markov chain sampler. Journal of Compu-

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation