You are on page 1of 28

theta – a framework for template-based modeling

and inference
Thomas M¨
uller

Jochen Ott

Jeannine Wagner-Kuhr

Institut f¨
ur Experimentelle Kernphysik, Karlsruhe Institute of Technology (KIT),
Germany
June 17, 2010

Statistical methods such as hypothesis tests and interval estimation for Poisson counts in multiple channels are frequently performed in high energy physics.
We present an efficient and extensible software framework which uses a templatebased model approach. It includes modules to calculate several likelihood-based
quantities on a large number of pseudo experiments or on data. The generated
values can be used to apply frequentist methods such as hypothesis testing
and the Neyman construction for interval estimation, or modified frequentist
methods such as the CLs method. It also includes an efficient Markov-Chain
Monte-Carlo implementation for Bayesian inference to calculate Bayes factors
or Bayesian credible intervals.

1

Contents
1 Introduction

3

2 Template-Based Modeling
2.1 Example Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
4

3 Statistical Methods
3.1 Discovery . . . . . . . . . . . . . . . . .
3.1.1 Direct Estimation of Z . . . . . .
3.1.2 Z via Test Statistic Distribution
3.1.3 Bayes Factor . . . . . . . . . . .
3.2 Measurement . . . . . . . . . . . . . . .
3.2.1 Profile Likelihood Method . . . .
3.2.2 Neyman Construction . . . . . .
3.2.3 Bayesian Inference: Posterior . .
3.3 Exclusion . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

5
6
6
7
8
10
10
12
14
14

4 Including Systematic Uncertainties
17
4.1 Rate Uncertainties . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Template Uncertainties . . . . . . . . . . . . . . . . . . . . . . . 18
5 theta Framework
21
5.1 Combination with external Likelihood Functions . . . . . . . . . 22
5.2 Markov Chains in theta . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2

In all but the most simple cases. It can be thought of a histogram normalized to the expectation. tj. we only consider the case where the expected and measured distributions are binned. The probability of observing ni events in bin i is the Poisson probability with mean mi where the expected poisson mean mi is the sum of expected poisson means of the different contributing processes. Section 5 gives an overview of the architecture and some implementation details of theta. Common cases are interval estimations and hypothesis tests based on models of Poisson counts in multiple channels.1 Introduction In High Energy Physics (HEP). Fixing j.i (1) where pj are real model parameters. Here. a probability density in an observable as well as the event count. It will be defined and discussed in more detail in Section 2.. not only the mere number of events is used as input to the statistical method but the measured distribution of a certain observable.i + p2 · t2. a more general model is considered here which allows the coefficients and templates in the linear combination in the right hand side of eq.e. the result of an analysis is often the outcome of a statistical procedure.org/. • another source of uncertainty might affect the shape of the templates ti . Section 4 discusses how systematic uncertainties affecting the shape and normalization of templates can be included in theta. Analytical solutions to the test statistics commonly used.k is a one-dimensional template1 for process j. Writing the expectation as linear combination of templates is the basis of the (more general) model definition used in theta. 3 . 1 The term “template” is used here to mean both. a numerical treatment is necessary. While this covers some simple use cases. are in general not known for such a model. This is one main application of theta. The resulting template m contains the predicted Poisson mean for each bin. Therefore. given as histograms. such as the maximum likelihood estimate for a certain parameter or the value of a likelihood ratio. In order to accurately generate the test statistic distribution for a certain model. binned in the observable. large-scale Monte-Carlo production of pseudo data and efficient calculation of test statistic is necessary. i. 2 Template-Based Modeling A typical model predicts the Poisson mean in each bin i as linear combination of different components mi = p1 · t1. but not p3 . (1) to depend arbitrarily on model parameters. The software package and further documentation can be obtained via http://theta-framework. namely • one source of uncertainty might affect p1 and p2 . Therefore. theta is licensed under the GPL. Details on how theta can be used in various statistical methods is discussed in Section 3. systematic uncertainties will complicate the picture. An Appendix is included which contains example configuration files for theta.i + p3 · t3.

consider the measurement of signal using two channels: the first channel is a signal-free region used to measure 4 . bi is the number of bins for observable oi . . depending on the application. (3) as function of p~.k are the templates of the Mi individual processes contributing in this channel. In particular.The most general statistical model in theta predicts templates of a set of observables o1 . an additional term D(~ p) is introduced in the likelihood function and eq. It is assumed that the template bin contents ti.1 Example Model As example used in the following sections. 2. (4) becomes L(~ p|d) = pm (d|~ p)D(~ p). the model prediction for observable oi is mi (~ p) = Mi X ci.k (~ p) (2) k=1 where the ci.k are real-valued coefficients and ti. the probability of observing “data” template d is given by a Poisson for each bin of the template: pm (d|~ p) = bi N Y Y Poisson(di.k have been introduced as one-dimensional objects. oN . D can be seen as the approximate likelihood of the first measurement. the coefficients and the templates are functions of real model parameters p~. In the first (“sideband”) measurement. However. Therefore. (4) In general. bins can be re-ordered arbitrarily without changing the outcome of any method.l is the number of observed events of observable oi in bin l. di. Thus.k (~ p)ti. (5) The origin and interpretation of term D(~ p) might differ. Both. it is also possible to start with multi-dimensional templates and concatenating all their bins to obtain a one-dimensional template. given mean λ. and Poisson(n|λ) = λn e−λ n! is the poisson probability of observing n events. As part of the likelihood function in the second model. Fixing d and reading eq.k are all strictly positive. some parameter (such as the background rate) is measured as pˆ ± ∆ˆ p. using one-dimensional templates does not restrict applications as much as it might seem at first. the parameter of interest is estimated. Note that the templates ti. Given model parameters p~. . Therefore. .l |mi. a Gaussian term for p with mean pˆ and width ∆ˆ p is included in D. all bins are statistically independent. is the likelihood function used in many statistical methods discussed in Section 3: L(~ p|d) = pm (d|~ p). In the second measurement. A typical application is a two-step measurement with two different models. In this application.l (~ p)) (3) i=1 l=1 where i runs over all observables.

background (“sideband channel” sb).1 . whereas the templates for the signal channel are a mass like distribution with 100 bins on the range 0–500 where the background template ts.3 0.05 0 0 50 100 150 200 250 300 350 400 450 500 "mass" Figure 1: Templates in the signal region of the signal and background process over some mass-like observable. i.1 . Discovery: make a hypothesis test with null hypothesis µs = 0 versus µs > 0. the sideband template. µb ) = µb · tsb.45 Background 0. and the Poisson mean of the signal.2 0.2 in (7). The model prediction written in form of eq. µb is a nuisance parameter.4 Signal 0. In this example. is taken to be a counting-only measurement. both for the signal channel.1 + µs · ts. and 10 as normalization of the sideband background template tsb. The “Signal” template is ts.1 0. where the factor is known precisely. 2.2 is a normal distribution around 250 with width 50. they do not depend on any model parameter. Both templates in the signal channel are normalized such that the sum of their bin entries equals one.2 (6) (7) where the templates ti. tsb. µs = 10. the second channel contains signal (“signal channel” s). µb . The parameter of interest is µs . 0. The templates in the signal region are depicted in Figure 1. µs . 5 .e.e.35 0. It is assumed that the Poisson mean in the sideband channel is a fixed multiple of µb .1 ms (µs .25 0. The values used in the following sections will be µb = 20. (2) is msb (µs . The signal template ts.1 . µb ) = µb · ts.1 is exponentially falling.15 0. The parameters of this model are the Poisson mean of the background.. it is a one-bin template with τ as bin content. “Background” corresponds to ts.j are constant. and an interval for µs .. Measurement: estimate µs .arb. i. 3 Statistical Methods The statistical methods discussed in the following sections are: 1.

Thus. Examples for producers are • maximum likelihood estimator. The Bayes factor is the ratio of posterior probabilities of the null hypothesis and an alternative hypothesis. it expresses the relative credibility in the null hypothesis: a small Bayes factor would lead to the assumption that the null hypothesis is incorrect. some applications of these producers are discussed in more detail with the example model from Section 2. the second one is more general but also much more time-consuming to apply as a large number of pseudo experiments are required for reliable results. different producers are available to address these questions. The first two methods are frequentist methods. These products can be used in a second step to make the actual statistical inferences.1. The third method is the Bayesian counterpart to frequentist hypothesis tests. In theta. significance estimation via the tail of the test statistic distribution 3. given data and a model as C++ objects. maximum-likelihood estimator. In the following sections. They both yield a p value or a Z value (as in “a significance of Z sigma”) if applied to a dataset. 3. 3. Most statistical methods discussed here (such as likelihood-ratio test. B consists of those parameter values with µs = 0. While the first relies on properties which hold for a large number of events. A producer calculates one or more values. Exclusion: give an upper / lower limit on µs .1 Discovery Three methods are discussed: 1.3. Neyman interval construction) are discussed in detail in many introductory texts in statistics (such as [4]) and will not be explained or referenced in detail here. Bayes factor. The ratio of maximized likelihoods λ(d) is defined as λ(d) = ln maxp~ L(~ p|d) maxp~∈B L(~ p|d) 6 (8) . which produces the estimated parameter values • ∆ ln L producer for interval estimation which produces profile likelihood intervals • Likelihood ratio producer which calculates the negative log-likelihood for different parameter restrictions of the same model • Markov-Chain Monte-Carlo producer which calculates a histogram of the posterior in one parameter.1 Direct Estimation of Z The model is formulated such that the null hypothesis corresponds to a subset B of parameter values p~. direct estimation of significance using asymptotic properties of the likelihood ratio 2.1. The calculated values are written to an output file and called products. In this example.

7 . The median estimated Z value is 3. The resulting Zest distribution can be seen in Figure 2. the model without signal. one million pseudo experiments are thrown 2 For simplicity. The p value for a certain dataset dˆ is ˆ defined as the probability of observing a value of λ(d) as least as large as λ(d). if the data is drawn from the model p ∈ B. λ is asymptotically distributed according to a χ2 distribution. the estimated Z value for a given value of λ is √ (9) Zest = 2λ. According to Wilks’ Theorem. this is done using the likelihood ratio producer. the maximum is taken only over the subset B which corresponds to the null hypothesis. In theta. i.02 which can be quoted as expected sensitivity. 3. Calculating the p value thus requires the knowledge of the distribution λ(d) for the null hypothesis.000 pseudo data distributions d according to the model and calculating Zest (d) each time.2 In HEP.e. the background-only model variant of the example model is defined and by choosing a flat distribution for µs . 9 for the null hypothesis numerically. (5). theta is used to generate the Zest distribution by throwing 100. The Z value for a given p value is the number of standard deviations in the sense that Z ∞ G(x)dx = p Z where G(x) is the normal distribution around zero with width one. λ(d) is a real value indicating the compatibility with the null hypothesis: a large value disfavours the null hypothesis. as is the case for the example model. it is assumed that B fixes only one parameter. for µs = 0. it is common to cite the Z value instead of the p value if reporting the outcome of a hypothesis test. the p value of a measurement Zˆ is Z ∞ p= Zest (Z)dZ (10) ˆ Z where Zest (Z) is the distribution of Zest values for the null hypothesis. For the first case. one can determine the distribution of estimated Zest values as defined in eq.2 Z via Test Statistic Distribution Alternatively.1. In order to give an expected p value in case signal is present. the maximum is taken over all allowed parameter values p~ while in the denominator.0 to generate the Zest distribution for the null hypothesis and for µs = 10 to determine the median value of the Zest distribution. In the case considered here. eq.where L is the likelihood function as defined in eq. the signal-plus-background variant is defined.. It minimizes the likelihood function for two model variants which only differ in the definition of the prior distribution D(~ p). theta is run twice. By choosing a δ-distribution fixing µs to zero. From this distribution. (10) is evaluated setting Zˆ to the median Zest value for the signal-plus-background case. In the nominator. assuming the null hypothesis is true.

given data d. (9) is not fulfilled.) The error bars are approximate uncertainties from the finite number of pseudo experiments. theta took about 10 minutes using one core of a recent CPU.NPE 4000 3500 3000 2500 2000 1500 1000 500 00 1 2 3 4 5 Zest Figure 2: Distribution of estimated Z values for the example model with µs = 10 using Zest from eq. The main disadvantage is that it is much more CPU-intensive as it requires a large number of pseudo experiment in order to accurately describe the tail of the test statistic distribution. yielding the distribution shown in Figure 3. The second run was already carried out for the previous section and yielded a median value Zˆ = 3.02. More importantly. (9. The median p value is given by the fraction contained in the filled area in the Figure and corresponds to a Z value of 3. The advantages of this method compared to a direct significance estimate discussed in the previous section are that it gives the correct answer even if the assumption of large event numbers used for Zest in eq.3 Bayes Factor The Bayes factor is the ratio of posterior probabilities B01 (d) = p(H0 |d) p(H1 |d) (11) where p(H0 |d) is the posterior probability (degree of belief) of the null hypothesis. For the background only distribution evaluating one million pseudo experiments.1. and p(H1 |d) is the posterior probability of the alternative 8 . 3. this method is more general than using Zest and allows the straight-forward inclusion of systematic uncertainties via the prior-predictive method which will be discussed in Section 4.06 which is the expected sensitivity as usually quoted.

the error bars indicate the approximate statistical errors from the finite number of pseudo experiments. 9 .NPE per bin 106 B only hypothesis S + B hypothesis 105 median of S + B 104 103 102 NPE ·ˆp 101 100 0 1 2 3 4 5 Zest Figure 3: Distribution of Zest for the background-only and signal-plusbackground variants of the example model. The y-axis is the number of pseudo experiments per bin. The median of the S+B Zest distribution is marked with a vertical line and defines the integration region of the B only Zest distribution used to calculate the median expected p value. pˆ.

(5). Therefore. the Neyman construction for central intervals using as test statistic either (i) the likelihood ratio or (ii) the maximum likelihood estimate and 3. The smaller B01 . theta was used to generate the negative logarithm of the posteriors for the two model variants (signal-plus-background and background-only) using Markov Chains.1 Profile Likelihood Method The profile likelihood function Lp in the parameter µs for the example model is defined as Lp (µs |d) = max L(µs . over all parameters 3 If the alternative hypothesis H does not specify a concrete value for µ but a whole s 1 parameter range.2. Figure 4 shows the distribution of Bayes factors calculated for 10. the profile likelihood method 2.000 pseudo datasets sampled according to the signal-plus-background hypothesis. the probability is integrated over the nuisance parameters. in this case only µb Z p(d|H0 ) = p(d|µs = 0. µb )π(µb )dµb (12) µb and similarly for H1 . While the integral in eq. the null hypothesis H0 is µs = 0 and the alternative H1 is given by µs = 10. the maximum on the right hand side is taken over all nuisance parameters. given hypothesis Hi . as Bayesian method. Three methods are discussed: 1.e. the minimum Bayes factor obtained by scanning over the allowed values of µs can be cited as result. the smaller belief there is that the null hypothesis is correct.hypothesis. it is not uncommon to have of the order of 10 nuisance parameters. As prior π(µb ) for µb . The expected (median) Bayes factor is 0. i. where µs = 10 is used on the right hand side. µb |d) µb where L is the likelihood function from eq. The posterior probabilities of the above equation are given via Bayes Theorem: π(Hi ) p(Hi |d) = p(d|Hi ) π(d) where π are the prior probabilities and p(d|Hi ) is the probability of observing data d. 10 . 3. the (marginal) posterior of the signal cross section. In the case considered here. a flat prior is used. 3. such as µs > 0.. In general.3 In order to determine the posterior for H0 or H1 . π(d) is a normalization factor which does not have to be determined explicitely as it cancels in the ratio (11).010. the numerical evaluation of the integral is done with the Metropolis-Hastings Markov-Chain Monte-Carlo method [5] which has good performance in high-dimensional parameter spaces. (12) is only one-dimensional.2 Measurement The term measurement in this context is used as synonym for point estimation and interval estimation in statistic literature.

1.NPE 3000 2500 2000 1500 1000 500 0 10-4 10-3 10-2 10-1 100 101 102 103 B01 Figure 4: Distribution of Bayes factors B01 in case H1 is true. include all values in the interval for which ln Lp (µs ) − ln Lp (ˆ µs ) < ∆(l). The value ∆(l) in the last step is found by applying Wilks’ Theorem to the likelihood ratio test statistic (cf. construct Lp (µs |d) by minimizing with respect to µb 2.1 yields the following construction for a confidence interval at level l for µs : 1. find the value µ ˆs which maximizes Lp (µs |d) 3. In general. the y errors are approximate statistical errors from finite number of pseudo experiments. setting µb = 20 using 10. The likelihood ratio λ defined in eq.1. theta was run producing pseudo data distributed according to the prediction of the example model setting µs = 10 and µb = 20 and determining the lower and upper interval borders for each generated pseudo data distribution.000 pseudo experiments. The error bars in x direction indicate the bin borders. Section 3.1): √ ∆(l) = 2erf−1 (l) where erf is the error function. any hypothesis test can be used to construct confidence intervals: the confidence interval with confidence level 1 − α/2 for µs consists of those values µs which are not rejected by a hypothesis test at level α. The 11 . but the parameter of interest. Applying this principle to the hypothesis test discussed in Section 3. (8) can then be written as Lp (ˆ µs |d) λ(d) = ln Lp (µs = 0|d) where µ ˆs is defined as the value of µs which maximizes Lp .

the band indicate the central 68% and 95% of the test statistic distribution (in this case Zest ). Then.000 pseudo experiments where for each pseudo experiment.8+4. it was confirmed that this bias vanished if scaling µs and µb simultaneously by a factor 100. lower and upper one-sigma interval borders (where the medians are taken separately over all pseudo experiments) are µ ˆs = 9.5. 0.7 −3. the µs range is divided into 30 bins. µb |d).0+4.L.7 (13) −3. After throwing the pseudo experiments. The bias in the central is due to the smallness of µs and a well-known property of the maximum likelihood method. µ ˆs . 0. The coverage of the intervals was found to be 68%. if varying both µs and µb .0. a considerable fraction of pseudo experiments yield Zest = 0 and the band contains more than 68% or 95% of the Zest values. 12 .9 and thus gives the same expected interval as if using Zest as test statistic.16. the median test statistic value is Tˆ = 3. µs is drawn randomly from a flat distribution between 0.0 and 30. The central value and 68% C. and for each bin. interval are then read from the intersection points of a horizontal line at Tˆ with the 68% belt. 3. The interval in the median case was found to be µ ˆs = 10.2. respectively.9 . This Figure was constructed by throwing 200.02 and the expected confidence interval in this case is µ ˆs = 10. The 68% and 95% confidence belts are shown in figure 5. Maximum Likelihood Estimate as Test Statistic The maximum likelihood estimate µ ˆs is defined as the value of µs which maximizes the likelihood function L(µs . Given a measurement which yields a test statistic value Tˆ. Zest from eq. For low values of µs . Likelihood Ratio as Test Statistic As test statistic in the first case.median value of estimated µs . For µs = 10. the central 68% (95%) of the test statistic are included in a confidence belt. the quantiles (0. 9 is used. 0. The interval construction starts at the observed test statistic value Tˆ at the y-axis.0+4.975) of the test statistic distribution are determined. This estimate is used as test statistic for the Neyman construction.9 . 0. As cross-check.2 Neyman Construction Using the Neyman construction to construct central intervals. for each fixed µs .84. These quantiles define the belts as explained above.025. the cited interval consists of those values of µs contained in the belt at this test statistic value. This construction is depicted in Figure 5: for each fixed value of µs . one has to determine the test statistic distribution as function of the value of the parameter of interest µs .6 −3. The confidence belts are shown in Figure 6. as desired.

45 40 35 68% central belt 95% central belt 30 25 20 15 10 median µˆs 5 00 5 15 10 20 25 30 µs Figure 6: 68% and 95% confidence level confidence belts of a central interval Neyman construction for the example model using the maximum likelihood estimate for µs as test statistic.Zest 10 68% central belt 95% central belt 8 6 4 median Zest 2 00 5 15 10 20 25 30 µs µˆ s Figure 5: 68% and 95% confidence level confidence belts of a central interval Neyman construction for the example model using Zest as test statistic (which is equivalent to using a likelihood ratio). 13 .

the posterior in all model parameters (µs .e. In order to make statements about the parameter of interest. The denominator on the right hand side. µb ) distributed according to the posterior 14. µb ) is given by p(µs . the posterior in µs can be considered the final result of a measurement. one would include the upper 95%. 14 .1+5. This marginal posterior is extracted by a Markov-Chain Monte-Carlo method which creates a Markov Chain of parameter values (µs .3. In this case. the expected result can be summarized as µ ˆs = 10.2. µb ) π(d) (14) where π are the priors. however. • The Neyman construction remains valid.0 −3.6 . The posterior was determined using a chain length of one million and flat priors in µs and µb .2. the most probable value and the 68% credible level central interval are considered.L.L.. Given data d. µb |d)dµb . The posterior for µs is shown in Figure 7. Note. The most probable value can be determined robustly by fitting a normal distribution to the peak region of the posterior and taking the mean of the fitted distribution as estimated value. upper limit. The central 68% credible level interval is illustrated in Figure 7. While the posterior in µs . π(d). is a normalization factor which does not have to be determined explicitely. the template of the model prediction is used directly. only the marginal distribution of µs of the full posterior is considered. • The Bayesian result is the marginal posterior in µs . 3. µs . from which the 95% quantile can be easily derived. coverage tests should be carried out to check the validity of this method. instead of taking the central 68% of the test statistic distribution as the confidence belt.3 Exclusion Exclusion can be handled by methods discussed in Section 3. µb |d) = p(d|µs ) π(µs . i. interval from the profile likelihood method can be cited as the 95% C. It is given by Z p(µs |d) = p(µs . some derived quantities can be used to summarize the posterior. In this case. with some minor modifications: • The upper end of a 90% C. Using these values. As dataset. without throwing random pseudo data. p(µs ) can be considered as the final result. that the profile likelihood method might perform poorly near parameter boundaries µs → 0 and this is typically the case if you want to calculate an upper limit. The run time for this Markov Chain was about 6 seconds.3 Bayesian Inference: Posterior In Bayesian statistic.

0 to 30.2.02 0. (15) is 0. upper limits are determined by calculating CLs for all bins in µs . The 95% upper limit is given by the amount of signal for which CLs as defined in eq. Zest is used as test statistic. The CLs value for a certain signal s and background b is defined as CLs = 1 − ps+b 1 − pb (15) where pi is the upper tail of the test statistic distribution for model i. For these two cases.000 5 15 10 20 25 30 µs Figure 7: Posterior in µs for the example model using µs = 10 and a flat prior in µs . For this example.e. upper limit is calculated twice. the probability of observing a test statistic value as least as signal-like as the one observed.06 0. the median of the Zest distribution is 0. This definition of CLs is depicted in Figure 8.02.05. The expected 95% C. (10). (15) are given by eq.10 0. The 95% C. The pseudo data sample used is the same as in the Neyman construction using Zest in Section 3.2. it has the property that a downward fluctuation of the background does not lead to more stringent upper limits. assuming model i is true. which can be used to construct upper limits. The values for µs from the chain were binned in 60 bins from 0. i. As was done there.00 and 3. Unlike other methods.. This is seen as a desired property by many physicists. respectively. as otherwise.p(µs ) 0. for µs = 0 and for µs = 10. (15). a poor background model which systematically overestimates the background level would yield a better upper limit than a realistic background model. The only method discussed here in more detail is the CLs method [6].08 68% 0. The y-axis are the number of chain elements in the bin which is proportional to the posterior p(µs |d).L.L. µb .0. 30 bins in µs are created and the test statistic distribution in each bin is used to calculate the p values in eq.04 0. and the p values for eq. 15 .

20 B hypothesis S + B hypothesis Tˆ ps +b 1−pb 0.05.10 0.05 0.25 0. With the test statistic (T S) distributions for the background-only and the signal-plus-background.d(TS) 0.1 in case of no signal.15 0. 16 . the CLs value for a given measurement with test statistic value Tˆ is defined as 1−ps+b 1−pb . interpolating this dependence linearly and finding the value µs for which CLs is 0. This yields an expected upper limit of 7.000 5 15 10 20 25 30 TS Figure 8: Illustration for the definition of the CLs value.

2.e. an uncertainty affects the whole template. or a model which fixes the value of ~q to values which reproduce the most probable one.. i.e. the uncertainty to consider is merely an uncertainty of the rate of a certain process. hypothesis test as in Section 3. these uncertainties would only be kept if they improve the expected result. In a Bayesian method. Within theta. They are given another name here only in order to unambiguously refer to these parameters in the following discussion.1.4 The definition of the test statistic is often based on maximizing the likelihood function. On the other hand. The latter approach is often more robust in practice as the number of parameters to vary during the minimization of the negative log-likelihood is smaller. These posteriors are then used to throw random values for the test statistic generation as discussed. This method is known as prior-predictive method. Then. assumptions about the distribution ~q in the form of a prior for ~q have to be made. the posterior-predictive method. 17 . i. posteriors for q ~ are derived first in an auxiliary measurement. i. In a simple case. CLs method). including the dependence on some nuisance parameters for the definition of the test statistic can improve the results as the value of the nuisance parameters can be determined from data simultaneously to the parameter of interest. the template depends on ~q. the most natural way would be to include priors for the parameters ~q and proceed as before.. Once the uncertainty is included in the model. 4 In an alternative method. This dependence affects not only the rate but also the shape of the template. a general advise about whether or not to include the uncertainties in the model definition used for the likelihood function is to start with a model without these uncertainties and include one by one the uncertainties. minimization of the negative log-likelihood effectively measures the parameter value qj for this uncertainty which improves the background prediction in the signal region.4 Including Systematic Uncertainties Systematic uncertainties can be included in the model by introducing additional (nuisance) parameters ~q into the model which parametrize the effect of the systematic uncertainty on the predicted templates. starting with those which have the largest impact on the result. the pseudo data to calculate the test statistic is generated including the systematic uncertainties by choosing random values for ~q before drawing a pseudo data distribution from the model and calculating the value of the test statistic. This case is discussed in the second subsection. these nuisance parameters are treated no different than other model parameters p~. However. This is especially true for systematic uncertainties which have a large impact on the background in signal-free regions. the model used before including the systematic uncertainties. Thus.. In this case. For methods which use numerical distributions of the test statistic (such as the Neyman construction. This will not be discussed here. including a nuisance parameters qj in the model for the likelihood definition which has little impact on the model prediction and thus can hardly be measured from data might even worsen results. integrate with repect to ~q in order to construct the posterior for the parameter of interest or the Bayes factor. How such a case is included in theta is discussed first. there are different possibilities to account for them in the methods previously presented. The model used to define this likelihood function can be either the model which includes ~q (including priors for ~q).e. In this case. More generally.

all bins should have positive values. which is the ratio of the background expectation in the sideband and the signal region.2 (0) = ts. on the relative normalization of tsb.2. the Neyman construction using the maximum likelihood estimate µ ˆs as test statistic from Section 3.e.2. 18 .1 and tb.1 ms (µs . To include this uncertainty in the model. τ is known to be τ = 1 ± 0.0+5..2 ts.0 −4.g.2 . and • the bin values as function of δ should be continuously differentiable.2 (δ) such that ts. Rather. τ is chosen randomly from a normal distribution around 1 with width 0.2. Additional desirable properties of a template interpolation are: • for all values of δ. As the templates affected by uncertainties represent a one-sigma deviation. The model used to calculate the maximum likelihood test statistic fixes τ to 1. a parameter δ is introduced in the model which is used to interpolate between the original template without uncertainties.000 pseudo experiments is shown in Figure 9.2.0 and the interval is larger than without this uncertainty.1 Rate Uncertainties Consider an uncertainty in the example model on the relative normalizations of the background templates in the different channels.2 (−1) = ts.2. Then. This can be considered in theta by defining D(~ p) to include a normal distribution for τ (see eq. An example would be a bias in the energy measurement which shifts the signal peak of the example model (see Figure 10). i. (6) and (7)). In order to include this effect in the model.2 is re-evaluated by generating pseudo data where for each pseudo experiment.2. As example. a mere rate uncertainty does not suffice to describe the effect of the systematic uncertainty on the expected distribution in an observable. the whole template is affected.2 Template Uncertainties In general. an additional parameter τ is introduced in the model.− . The absence of any uncertainty corresponds to fixing τ = 1. 4..2 (1) = ts. i.− . ts. (5)).2 .+ and ts. an auxiliary measurement).The resulting confidence belt using 200. µb ) = µb · ts. The expected result (for median value of the test statistic) is µ ˆs = 10. a reasonable prior distribution for δ is a normal distribution with mean 0 and width 1.+ ts.4. Assume that by external knowledge (e. the model equations (6) and (7) become msb (µs . 68% credible intervals.1 + µs · ts.2.. and the templates which include the uncertainty ts.1 (see eqs. The templates shown there represent the estimated one-sigma deviations. This is written as a template function ts.e. µb ) = τ µb · tsb.

2.2 is the original signal template unaffected by uncertainties (cf. Pseudo data was generated including a 20% relative uncertainty on the relative background expectation in the sideband and signal region. 2 ts. 0.2 0. The test statistic is the maximum likelihood estimate of µs using a model without the uncertainty on τ .± are the signal templates affected by a energy-scale like uncertainty which affects both the shape and the normalization of the template. ts. Figure 9: 68% and 95% confidence level confidence belts of a central interval Neyman construction for the example model. ts. + ts.1 0 0 50 100 150 200 250 300 350 400 450 500 "mass" Figure 10: Signal templates for the example model. 2. τ . 19 .4 0. - 0. 2.5 ts.3 0.µˆ s 45 68% central belt 95% central belt 40 35 30 25 20 15 10 median µˆs 5 00 5 15 10 20 25 30 µs arb. Figure 1).

Pseudo data was generated including the template uncertainty as depicted in Figure 10. The model used for calculating the maximum likelihood estimate used as test statistic was done using the same model. The runtime of theta was about 5 minutes. The one chosen here is  ts. The belts calculated from 200.9 .2 |δ| where the equation holds for each individual bin of the templates involved.2. The first property ensures that the model always remains meaningful and the evaluation of the negative log-likelihood always yields finite values.sign(δ) ts. The second property is important for stable numerical minimization of the negative loglikelihood. For the Neyman construction. poisson data is generated from the model prediction for this value of δ. 20 .e. including the template interpolation with δ as parameter and the Gaussian prior for δ.µˆ s 45 68% central belt 95% central belt 40 35 30 25 20 15 10 median µˆs 5 00 5 15 10 20 25 30 µs Figure 11: 68% and 95% confidence level confidence belts of a central interval Neyman construction for the example model.9 −3. i.2 (δ) = ts.2 · ts. The expected (median) result of the interval estimation in this case is µ ˆs = 10. Then. as many algorithms assume a continuous first derivative.. pseudo data is generated by choosing a random value for δ from a Gaussian around zero with width one. There are many different possibilities for a template interpolation with these properties.000 pseudo experiments are shown in Figure 11.0+4.

This can mean throwing random Poisson data according to a model or always passing 21 . Other classes are merely abstract classes in theta. Which plugins to use and the configuration parameters for these plugins are specified in a configuration file. theta consists of a few core classes which implement central concepts of templatebased modeling. respectively.Figure 12: Overview of the collaboration of some important classes in theta.k (~ p) and templates ti. concrete classes are provided by plugins. 5 theta Framework Figure 12 gives an overview over the architecture of theta. For each pseudo experiment. Data is produced by a DataSource. The theta main program reads this configuration file and creates the requested plugin instances. data and the negative log-likelihood function (NLLikelihood). generation of random parameter values for pseudo data construction as well as term for the likelihood function. it was specifically designed to enable users to write their own plugins for their needs. Rectangles are core classes of theta while the ellipses represent abstract classes which are implemented as plugins. The distribution D(~ p) of a model is represented by a Distribution instance. such as the model. This includes setting up the model which contains a number of Functions and HistogramFunctions which represent the coefficients ci. It can be used for both. The collaboration of plugin classes depicted here is only an example: theta imposes no restriction on the architexture of plugins. As depicted there.k (~ p) in equation (2). While theta includes some common plugins.

the posterior at the proposed point is very small and rejected too often. the proposal is accepted. i. eq. (5)) of the model. If the proposal is not accepted. such as the access to ROOT histograms. the next point in the sequence is xi (again).. to write the results to a sqlite3 database.e.. the jump kernel is a multivariate normal distribution. i. Given a point xi within the sequence. One crucial point when applying the algorithm is the typical step size of the proposal step. i. This Function can then be used as part of the prior D(~ p) (cf. 5. The found width is further scaled to maximize the diffusion speed according to [3] by a factor 2. the next point in the chain is x ˆ.1 Combination with external Likelihood Functions Combination of different channels is possible by configuring a theta model which includes all channels as different observables. If no there is no suitable plugin for a particular use case. defining an own plugin is straight-forward and is merely a matter of deriving from the appropriate base class and implementing its purely virtual methods.the same Data which was read from a ROOT file. This core system becomes useful only through plugins. if f (ˆ x) > f (xi ).2 Markov Chains in theta The Metropolis-Hastings Markov-Chain Monte-Carlo algorithm produces a Markov chain which are distributed according to a probability density f (x). A Database contained in theta writes the products in a SQL table to a sqlite3 file. using the average prediction of the model without any randomization. A combination with an external analysis on the likelihood-level is possible by exploting the plugin system and writing a Function plugin which calculates the external negative log-likelihood. In this case. This Data is passed to each configured Producer which has access to the Model and Data.. calculates some quantities and writes them to a Database. x ˆ is only accepted with probability f (ˆ x)/f (xi ). the next point in the sequence is found by choosing randomly a proposed point x ˆ in the neighborhood of xi .e. On the other hand. If the step size is too large. a generalpurpose C++ library. the acceptance rate will be high. and many more. This allows to either • use methods implemented in theta on the external likelihood function or • combine the external likelihood function with the likelihood function calculated internally in theta. The Producer typically constructs the likelihood function for the Model and the Data (NLLikelihood). theta provides plugins for many common use cases. if the step size is very small.e. Its covariance is determined iteratively from the posterior of average data. the proposed points are too far from the current point xi .38 √ n 22 . The only external dependency of the theta core system is Boost [1]. In theta. 5. Otherwise. If the probability density at that proposed point is larger than the current one. the chain elements behave like a particle in a diffusion process and the covered volume in parameter space is √ proportional to N for a chain of length N .

the Profile likelihood interval producer on counting experiments.3 Testing Testing is done by evaluating the producers for simple models for which the result is known analytically or from statistic HEP papers. including tests against the values given in [2] for the on/off problem and the gaussian-mean problem. the number of non-fixed parameters. Bayesian credible intervals for a counting experiment has been checked against analytical results.where n is the number of dimensions. This includes the Likelihood ratio producer for counting experiments.e. 5. 23 . The covariance is determined only once per theta run and used for all pseudo experiments. i. This provides a considerable speedup compared to methods which determine the jump kernel width for each pseudo experiment. Also..

. a Distribution which is flat on a given interval. The first. • root histogram. 7.0). was used to produce the values of the minimized negative log-likelihood function for eq.cfg. a Distribution to fix a parameter to a given value. Plugins used in this example are: • fixed poly. exmodel. a HistogramFunction which does not depend on any model parameters and returns the histogram from a ROOT file. defines the example model. • mult.cfg: // the names of all model parameters parameters = (”mu s”. a Producer which produces the values of the negativelog-likelihoods appearing on the right hand side of equation (8). }. • flat distribution. The second configuration file.root and called "bkg" and "signal".2. // counting−only variable for the sideband. ”mu b”). a DataSource which throws random Poisson data according to a given model.e . 500): mass = { range = (0. nbins = 100. are saved in a ROOT-file templates. respectively. deltanll. one bin // on an arbitrary range: sb = { 24 . a Distribution defined as product of other Distributions. ts. a Function which multiplies all parameters in a list. a HistogramFunction which does not depend on any parameters and is defined by a polynomial.1.2 .0.cfg.1. a Database which writes all data to a single sqlite3 database file. The templates in the signal region from eq.1 as well as for the signal-plus-background distribution of the Zest test statistic in Section 3. 500. (8) used in Section 3.1 and ts. // the observables of the model observables = { // mass variable in the signal region has 100 bins on // the range (0. exmodel. • product distribution. • deltanll hypotest. • sqlite database.Appendix Two annotated configuration files are given below. i . • model source. • delta distribution.

parameters = (”mu b”). // templates from a root file for the others: ts1 ={ type = ”root histogram”. a // ” coefficient −function” and a ”histogram” is specified which correspond // to c ik and t ik in eq. histoname = ”bkg”. }.root”. }. 1.0]. histogram = ”@t sb 1”.0). sb = { background = { coefficient −function = { type = ”mult”. }. (2). }. parameters = (”mu s”). histogram = ”@t s 2”. normalize to = 1. signal = { coefficient −function = { type = ”mult”.range = (0. // flat template. }. parameters = (”mu b”). normalize to = 1. }. ts2 ={ type = ”root histogram”. 25 . nbins = 1. }.0. normalized to tau for the sideband counting part : t sb 1 = { type = ”fixed poly”. filename = ”templates. (2). For each term. the predition is given for each observable by listing . // the terms of the sum in eq. }. filename = ”templates. histoname = ”signal”. observable = ”sb”. example model = { mass = { background = { coefficient −function = { type = ”mult”. normalize to = ”@tau”. }. histogram = ”@t s 1”. }.0. }. // within the model.root”.0. coefficients = [1. tau = 10.0.

}. the product of two // flat distributions is used which are defined below. 26 . // As initial step size for minimization or Markov Chains. fix −sample−value = 10. fix −sample−value = 20. given data and a model. }.0.0. }. ”inf”). they will always return a fixed value // given by ” fix −sample−value”. // use flat distributions which restrict the // mu parameters to positive values . deltanll. width = 1. namely the // ”background−only−distribution” and the // ”signal−plus−background−distribution”.0. }.0. // the ”width” parameter is used. If sampling from these // distributions . mu s−flat = { type = ” flat distribution ”. }. but do not add an additional // term to the likelihood function.cfg” // hypotest is a producer which produces random variable // distributions . This value is also used as // initial guess for likelihood function minimization. distributions = (”@mu b−flat”. ”inf”). mu b−flat = { type = ” flat distribution ”. parameter−distribution = { type = ”product distribution”. hypotest = { type = ”deltanll hypotest”. // as distribution for the parameters. // The deltanll hypotest produces the two minima of the likelihood functions // using the same model but different parameter distributions .cfg: @include ”exmodel. ”@mu s−flat”). width = 2. mu s = { range = (0.0. }.}.0. }. mu b = { range = (0.

model = ”@example model”. the distribution as // defined in the model is used: signal−plus−background−distribution = ”@example model./. n−events = 100000. ”@mu b−flat”). }. call all the ”producers” on the pseudo data. // as s+b distribution ../ lib /core−plugins./ lib /root. //included via exmodel. distributions = (”@mu s−zero”.0. in ”output database” main = { data source = { type = ”model source”.so”). producers = (”@hypotest”). }. ” . }. name = ”source”. along with ”model” // 3. 27 .parameter−distribution”.. model = ”@example model”. options = { plugin files = (” . }. background−only−distribution = { type = ”product distribution”.name = ”hypotest”. It will run ”n−events” // pseudoexperiments which // 1. mu s = 0. }.db”. sample pseudo data from the ”data source” // 2. minimizer = { type = ”root minuit”. filename = ”deltanll hypo..cfg }. }. output database = { type = ”sqlite database”.. // main is the setting which glues all together . save the products produced in 2. mu s−zero = { type = ”delta distribution”.so”./. }.

H. Gelman. D.References [1] boost C++ libraries.org/. G. 595(2):480–501. Read. Teller. Teller. W. Gilks. (CERN-OPEN-2000-205). [5] N. Rosenbluth. Bayesian Statistics. and J. 28 . The Journal of Chemical Physics. 2006. Cousins. [3] A. 5:599–607. Metropolis. Linnemann. Nuclear Instruments and Methods in Physics Research Section A: Accelerators. J. A. A. Modified frequentist analysis of search results (the CLs method). 21(6):1087–1092. Efficient Metropolis Jumping Rules. http://boost. R. Statistical Methods in Experimental Physics. Equation of state calculations by fast computing machines. [4] F. 1953. Rosenbluth. Evaluation of three methods for calculating statistical significance when incorporating a systematic uncertainty into a test of the background-only hypothesis for a Poisson process. James. L. and W. Roberts. Detectors and Associated Equipment. T. N. M. Spectrometers. [2] R. [6] A. 1996. and E. Tucker. 2008. 2000. O.