Professional Documents
Culture Documents
Luca Lista
Statistical
Methods for
Data Analysis in
Particle Physics
Second Edition
Lecture Notes in Physics
Volume 941
Founding Editors
W. Beiglböck
J. Ehlers
K. Hepp
H. Weidenmüller
Editorial Board
M. Bartelmann, Heidelberg, Germany
P. Hänggi, Augsburg, Germany
M. Hjorth-Jensen, Oslo, Norway
R.A.L. Jones, Sheffield, UK
M. Lewenstein, Barcelona, Spain
H. von Löhneysen, Karlsruhe, Germany
A. Rubio, Hamburg, Germany
M. Salmhofer, Heidelberg, Germany
W. Schleich, Ulm, Germany
S. Theisen, Potsdam, Germany
D. Vollhardt, Augsburg, Germany
J.D. Wells, Ann Arbor, USA
G.P. Zank, Huntsville, USA
The Lecture Notes in Physics
The series Lecture Notes in Physics (LNP), founded in 1969, reports new
developments in physics research and teaching-quickly and informally, but with a
high quality and the explicit aim to summarize and communicate current knowledge
in an accessible way. Books published in this series are conceived as bridging
material between advanced graduate textbooks and the forefront of research and to
serve three purposes:
• to be a compact and modern up-to-date source of reference on a well-defined
topic
• to serve as an accessible introduction to the field to postgraduate students and
nonspecialist researchers from related areas
• to be a source of advanced teaching material for specialized seminars, courses
and schools
Both monographs and multi-author volumes will be considered for publication.
Edited volumes should, however, consist of a very limited number of contributions
only. Proceedings will not be considered for LNP.
Volumes published in LNP are disseminated both in print and in electronic
formats, the electronic archive being available at springerlink.com. The series
content is indexed, abstracted and referenced by many abstracting and information
services, bibliographic networks, subscription agencies, library networks, and
consortia.
Proposals should be sent to a member of the Editorial Board, or directly to the
managing editor at Springer:
Christian Caron
Springer Heidelberg
Physics Editorial Department I
Tiergartenstrasse 17
69121 Heidelberg/Germany
christian.caron@springer.com
Second Edition
123
Luca Lista
INFN Sezione di Napoli
Napoli, Italy
The second edition of this book reflects the work I did in preparation of the lectures
that I was invited to give during the CERN-JINR European School of High-Energy
Physics (15–28 June 2016, Skeikampen, Norway). On that occasion, I reviewed,
expanded, and reordered my material.
v
vi Preface
In addition, with respect to the first edition, I added a chapter about unfolding,
an extended discussion about the best linear unbiased estimator, and an introduction
to machine learning algorithms, in particular artificial neural networks, with hints
about deep learning, and boosted decision trees.
Acknowledgments
I am grateful to Louis Lyons who carefully and patiently read the first edition of my
book and provided useful comments and suggestions. I would like to thank Eliam
Gross for providing useful examples and for reviewing the sections about the look
elsewhere effect. I also received useful comments from Vitaliano Ciulli and from
Luis Isaac Ramos Garcia.
I considered all feedback I received in the preparation of this second edition.
1 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.1 Why Probability Matters to a Physicist . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.2 The Concept of Probability .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2
1.3 Repeatable and Non-Repeatable Cases . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2
1.4 Different Approaches to Probability . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.5 Classical Probability .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4
1.6 Generalization to the Continuum.. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6
1.6.1 The Bertrand’s Paradox .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
1.7 Axiomatic Probability Definition . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8
1.8 Probability Distributions.. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9
1.9 Conditional Probability .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9
1.10 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 10
1.11 Law of Total Probability .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 11
1.12 Average, Variance and Covariance .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 12
1.13 Transformations of Variables .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
1.14 The Bernoulli Process . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 16
1.15 The Binomial Process. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 17
1.16 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 20
1.17 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 21
1.18 Frequentist Definition of Probability.. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 22
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 23
2 Probability Distribution Functions . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
2.2 Definition of Probability Distribution Function . . . . . . . . . . . . . . . . . . . 25
2.3 Average and Variance in the Continuous Case . . . . . . . . . . . . . . . . . . . . 27
2.4 Mode, Median, Quantiles .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.5 Cumulative Distribution . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.6 Continuous Transformations of Variables . . . . .. . . . . . . . . . . . . . . . . . . . 29
2.7 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 30
vii
viii Contents
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 251
List of Tables
Table 1.1 Possible values of the sum of two dice rolls with
all possible pair combinations and corresponding
probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5
Table 2.1 Probabilities corresponding to Z one-dimensional
intervals and two-dimensional contours for different
values of Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 58
Table 3.1 Assessing evidence with Bayes factors according to the
scale proposed in [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 74
Table 3.2 Jeffreys’ priors corresponding to the parameters of some
of the most frequently used PDFs . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 76
Table 6.1 Properties of different indicators of a measurement’s
importance within a BLUE combination . . . . . .. . . . . . . . . . . . . . . . . . . . 137
Table 10.1 Significance expressed as ‘Z’ and corresponding p-value
in a number of typical cases . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 208
Table 10.2 Upper limits in presence of negligible background
evaluated under the Bayesian approach for different
number of observed events n . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 213
Table 10.3 Upper and lower limits in presence of negligible
background (b D 0) with the Feldman–Cousins approach . . . . . . . 219
xiii
List of Examples
xv
xvi List of Examples
• Is a signal due to the Higgs boson present in the data recorded at the Large
Hadron Collider?
(We know that LHC data gave a positive answer!)
• Is there a signal due to dark matter particles in the present experimental data?
(At the moment experiments give no evidence; it is possible to exclude a set of
theory parameters that are incompatible with the present data.)
For the latter two questions, may represent a parameter of the theory that describes
Higgs boson or Dark Matter, respectively. A measurement is performed by applying
probability theory in order to model the expected distribution of the data in all
possible assumed scenarios. Then, probability theory must be used, together with
data, to extract information about the observed processes and to address the relevant
questions, in the sense discussed above.
Many processes in nature have uncertain outcomes, in the sense that their result
cannot be predicted in advance. Probability is a measure of how favored one of the
possible outcomes of such a random process is, compared with any of the other
possible outcomes. A possible outcome of an experiment is also called event.
Most experiments in physics can be repeated under the same, or at least very
similar, conditions; nonetheless, it’s frequent to observe different outcomes at every
repetition. Such experiments are examples of random processes, i.e. processes that
can be repeated, to some extent, within some given boundary and initial conditions,
but the outcome of those processes is uncertain.
Randomness in repeatable processes may arise because of insufficient informa-
tion about intrinsic dynamics which prevents to predict the outcome, or because of
lack of sufficient accuracy in reproducing the initial and boundary conditions. Some
processes, like the ones ruled by quantum mechanics, have intrinsic randomness
that leads to different possible outcomes, even if the experiment is repeated within
exactly the same conditions.
The result of an experiment may be used to address questions about natural
phenomena, for instance about the knowledge of the value of an unknown physical
quantity, or the existence or not of some new phenomena. Statements that answer
those questions can be assessed by assigning them a probability. Different defini-
tions of probability may apply to cases in which statements refers to repeatable
cases, as the experiments described above, or not, as will be discussed in Sect. 1.4.
Examples of questions about repeatable cases are:
• What is the probability to extract one ace in a deck of cards?
1.4 Different Approaches to Probability 3
• What is the probability to win a specific lottery (or bingo, or any other game
based on random extractions)?
• What is the probability that a pion is incorrectly identified as a muon in a particle
identification detector?
• What is the probability that a fluctuation of the background in an observed
spectrum can produce a signal with a magnitude at least equal to what has been
observed by my experiment?
Note: this question is different from asking: “what is the probability that no real
signal was present, and my observation is due to a background fluctuation?”
This latter question refers to a non-repeatable case, because we can’t have more
Universes, each with and without a new physical phenomenon, where we can
repeat our measurement!
Questions may refer to unknown facts, rather than repeatable events. Examples
of questions about probability for such cases are:
• About future events:
– What is the probability that tomorrow it will rain in Geneva?
– What is the probability that your favorite team will win next championship?
• About past events:
– What is the probability that dinosaurs went extinct because of an asteroid
collision with Earth?
• More in general, about unknown statements:
– What is the probability that present climate changes are mainly due to human
intervention?
– What is the probability that dark matter is made of weakly-interacting massive
particles heavier than 1 TeV?
Probability concepts that answer the above questions, either for repeatable or for
non-repeatable, cases are introduced in next Section.
There are two main different approaches to define the concept of probability, called
frequentist and Bayesian.
• Frequentist probability is defined as the fraction of the number of occurrences
of an event of interest over the total number of events in a repeatable experiment,
in the limit of very large number of experiments. Frequentist probability only
applies to processes that can be repeated over a reasonably long period of time,
but does not apply to unknown statements. For instance, it is meaningful to define
the frequentist probability that a particle is detected by a device (if a large number
4 1 Probability Theory
of particles cross the device, we can count how many are actually detected),
but there is no frequentist meaning of the probability of a particular result in a
football match, or the probability that the mass of an unknown particle is greater
than—say—200 GeV.
• Bayesian probability measures one’s degree of belief that a statement is true.
The quantitative definition of Bayesian probability is based on an extension
of Bayes’ theorem, and will be discussed in Sect. 3.2. Bayesian probability
applies wherever the frequentist probability is meaningful, as well as on a
wider variety of cases, such as about unknown past, future or present events.
Bayesian probability also applies to values of unknown quantities. For instance,
after some direct and/or indirect experimental measurements, the probability
that an unknown particle’s mass is greater than 200 GeV may have meaning
in the Bayesian sense. Other examples where Bayesian probability applies, but
frequentist probability is meaningless, are the outcome of a future election,
uncertain features of prehistoric extinct species, and so on.
The following Sect. 1.5 will discuss classical probability theory first, as formulated
since the eighteenth century, and will then discuss frequentist probability. Chapter 3
will be entirely devoted to Bayesian probability.
This approach can be used in practice only for relatively simple problems, since
it assumes that all possible cases under consideration are equally probable, which
may not always be the case in complex situations. Examples of cases where the
classical probability can be applied are coin tossing, where the two faces of a coin
are assumed to have an equal probability of 1=2 each, or dice rolling, where each of
the six faces of a dice1 has an equal probability of 1=6, and so on.
Starting from simple cases, like coins or dice, more complex models can be
built using combinatorial analysis. In the simplest cases, one may proceed by
enumeration of all the finite number of possible cases, and again the probability
of an event can be evaluated as the number of favorable cases divided by the total
number of possible cases of the combinatorial problem.
Table 1.1 Possible values of the sum of two dice rolls with
all possible pair combinations and corresponding probability
Sum Favorable cases Probability
2 (1, 1) 1/36
3 (1, 2), (2, 1) 1/18
4 (1, 3), (2, 2), (3, 1) 1/12
5 (1, 4), (2, 3), (3, 2), (4, 1) 1/9
6 (1, 5), (2, 4), (3, 3), (4, 2), (5, 1) 5/36
7 (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1) 1/6
8 (2, 6), (3, 5), (4, 4), (5, 3), (6, 2) 5/36
9 (3, 6), (4, 5), (5, 4), (6, 3) 1/9
10 (4, 6), (5, 5), (6, 4) 1/12
11 (5, 6), (6, 5) 1/18
12 (6, 6) 1/36
(continued )
1
Many role-playing games use dices of solid shapes different from cube as well.
6 1 Probability Theory
0.18
1/6
0.16
0.14 5/36
0.12
1/9
0.1
P
0.08 1/12
0.06
1/18
0.04
1/36
0.02
0 0
2 4 6 8 10 12
d1+d2
The generalization of classical probability, introduced in the previous Sect. 1.5 for
discrete cases, to continuous random variables can not be done in an unambiguous
way.
2
Note that in physics often event is intended as an elementary event. So, the use of the word event
in a text about both physics and statistics may sometimes lead to some confusion.
1.6 Generalization to the Continuum 7
A way to extend the concept of equiprobability that was applied to the six
possible outcomes of a dice roll to a continuous variable x consists in partitioning
its validity range (say [x1 ; x2 ]) into intervals I1 ; ; IN all having the same width
of .x2 x1 /=N, for an arbitrarily large number N. The probability distribution of the
variable x can be considered uniform if the probability that a random extraction of x
falls into each of the N intervals is the same, i.e. it is equal to 1=N.
This definition of uniform probability of x in the interval [x1 ; x2 ] clearly changes
under reparametrization, i.e. if the variable x is transformed into y D Y.x/, and
the interval [x1 ; x2 ] into Œ y1 ; y2 D Œ Y.x1 /; Y.x2 /, assuming for simplicity that Y
is a monotonous increasing function of x. In this case, the transformed intervals
J1 ; ; JN D Y.I1 /; ; Y.IN / will not have in general all the same width, unless
the transformation Y is linear. So, if x has a uniform distribution, y D Y.x/ in general
does not have a uniform distribution.
3
This apparent paradox is due to the French mathematician Joseph Louis François Bertrand (1822–
1900).
8 1 Probability Theory
Fig. 1.2 Illustration of Bertrand’s paradox: three different choices of random extraction of a chord
in a circle lead apparently to probabilities that the cord is longer than the inscribed triangle’s side
of 1=2 (left), 1=3 (center) and 1=4 (right), respectively. Red solid lines and blue dashed lines represent
chords longer and shorter of the triangle’s side, respectively
2. Let us take, instead, one of the chords starting from the top vertex of the triangle
(Fig. 1.2, center plot) and extract uniformly an angle with respect to the tangent
to the circle passing by that vertex. The chord is longer than the triangle’s side
when it intersects the basis of the triangle, and it is shorter otherwise. This occurs
in one-thirds of the cases since the angles of an equilateral triangle measure =3
each, and the chords span an angle of . By uniformly extracting a possible
point on the circumference of the circle, one would conclude that P D 1=3, which
is different from P D 1=2, as it was derived in the first case.
3. Let us extract uniformly a point inside the circle and construct the chord passing
by that point perpendicular to the radius that passes through the same point
(Fig. 1.2, right plot). With this extraction, it’s possible to conclude that P D 1=4,
since the chords starting from a point contained inside (outside) the circle
inscribed in the triangle have a length longer (shorter) than the triangle’s side,
and the ratio of the areas of the circle inscribed in the triangle to the area of the
circle that inscribes the triangle is equal to 1=4. P D 1=4 is different from the values
determined in both cases above.
The paradox is clearly only apparent because the process of uniform random
extraction of a chord in a circle is not univocally defined, as already discussed in
Sect. 1.6.
From the second Kolmogorov’s axiom, the probability of the event corresponding
to the set of all possible outcomes, x1 ; ; xN , must be equal to one. Equivalently,
using Eq. (1.2), the sum of the probabilities of all possible outcomes is equal to one:
X
N
P.xi / D 1 : (1.3)
iD1
Given two events A and B, the conditional probability represents the probability of
the event A given the condition that the event B has occurred, and is given by:
P.A \ B/
P.A j B/ D : (1.4)
P.B/
10 1 Probability Theory
The conditional probability can be visualized in Fig. 1.3: while the probability of A,
P.A/, corresponds to the area of the set A relative to the area of the whole sample
space , which is equal to one, the conditional probability, P.A j B/, corresponds to
the area of the intersection of A and B relative to the area of the set B.
(continued )
1.11 Law of Total Probability 11
This result clearly does not hold if there are causes of simultaneous
inefficiency of both detectors, e.g.: a fraction of times where the electronics
systems for both A and B are simultaneously switched off for short periods,
or geometrical overlap of inactive regions, and so on.
[
N
Ei \ Ej D 0 8 i; j and Ei D E0 : (1.8)
iD1
Given a partition A1 ; ;P
AN of the sample space made of disjoint sets, i.e.
such that Ai \ Aj D 0 and NiD1 P.Ai / D 1, the following sets can be built, as
visualized in Fig. 1.5:
Ei D E0 \ Ai : (1.10)
... ...
E3 ...
12 1 Probability Theory
Ω
A2 ...
E0 AN
A1
E2 ...
E1 EN
... ...
...
E3 ...
A3
...
Fig. 1.5 Visualization of the law of total probability. The sample space is partitioned into the
sets A1 , A2 , A3 , , AN , and the set E0 is partitioned into E1 , E2 , E3 , , EN , where each Ei is
E0 \ Ai and has probability equal to P.E0 j Ai / P.Ai /
X
N
P.E0 / D P.E0 j Ai / P.Ai / : (1.12)
iD1
of x is defined as:
X
N
hxi D xi P.xi / : (1.13)
iD1
Sometimes the notation EŒx or xN is also used in literature to indicate the average
value.
More in general, given a function g of x, the average value of g is:
X
N
hg.x/i D g.xi / P.xi / : (1.14)
iD1
X
N
VŒx D .xi hxi/2 P.xi / ; (1.15)
iD1
4
In some physics literature, the standard deviation is sometimes also called root mean square or
r.m.s. This may be cause some confusion.
14 1 Probability Theory
The average of the sum of two random variables x and y can be easily demonstrated
to be equal to the sum of their averages:
and variance:
The value of x that corresponds to the largest probability is called mode. The
mode may not be unique and in that case the distribution of x is said multimodal
(bimodal in case of two maxima).
The median is the value xQ that separates the range of possible values of x in
two sets that have both equal probabilities. The median can also be defined for an
ordered sequence of values, say fx1 ; ; xN g as:
x.NC1/=2 if N is odd ;
xQ D 1 (1.23)
2
.x N=2 C x N=2C1 / if N is even :
cov.x; y/
xy D : (1.25)
x y
From Eq. (1.26), the variance of the sum of uncorrelated variables, i.e. variables
which have null covariance, is equal to the sum of their variances.
Given N random variables, x1 ; ; xN , the symmetric matrix Cij D cov.xi ; yj /
is called covariance matrix. The diagonal terms Cii are always positive or null, and
correspond to the variances of the N variables. A covariance matrix is diagonal if
and only if all variables are uncorrelated.
1.13 Transformations of Variables 15
Œx C y D Œx C Œ y ; (1.29)
and:
Given a random variable x, let us consider the variable y transformed via the function
y D Y.x/. If x can assume the values fx1 ; ; xN g, then y can assume one of the
values fy1 ; ; yM g D fY.x1 /; ; Y.yN /g. N is equal to M only if all the values
Y.x1 /; ; Y.xN / are different from each other. The probability corresponding to
each value yj is given by the sum of the probabilities of all values xi that are
16 1 Probability Theory
transformed into yj by Y:
X
P.yj / D P.xi / : (1.33)
iW Y.xi / D yj
In case, for a given j, there is a single value of the index i for which Y.xi / D yj , we
have P.yj / D P.xi /.
The generalization to more variables is straightforward: assume that a variable z
is determined from two random variables x and y as z D Z.x; y/; if fz1 ; ; zM g is
the set of all possible values of z, the probability corresponding to each zk is given
by:
X
P.zk / D P.xi ; yj / : (1.34)
i; jW Z.xi ; yj / D zk
This expression is consistent with the results obtained for the sum of two dices
considered in Sect. 1.5, as will be seen in Example 1.3.
Let us consider a basket containing a number n of balls each having one of two
possible colors, say red and white. Assume to know the number r of red balls in the
basket, hence the number of white balls must be n r (Fig. 1.6). The probability to
randomly extract a red ball in the basket is p D r=n, according to Eq. (1.1).
1.15 The Binomial Process 17
p (1)
p (1)
(1) 1−p
(4)
p p
1−p
p p (3)
1−p 1−p
p (2) p (6)
1−p 1−p (3)
p
1−p (1) 1−p (4)
p
1−p (1)
1−p (1)
Fig. 1.7 Building a binomial process as subsequent random extractions of a single red or white
ball (Bernoulli process). The tree shows all the possible combinations at each extraction step. Each
branching has a corresponding probability equal to p or 1 p for a red or white ball, respectively.
The number of paths corresponding to each possible combination is shown in parentheses, and is
equal to the binomial coefficient in Eq. (1.38)
The number n of positive outcomes (red balls) is called binomial variable and
is equal to the sum of the N Bernoulli random variables. Its probability distribution
can be simply determined from Fig. 1.7, considering how many red and white balls
were present in each extraction, assigning each extraction a probability p or 1
p, respectively, and considering the number of possible paths leading to a given
combination of red/white extractions.
The latter term is called binomial coefficient and can be demonstrated by
recursion (Fig. 1.8) to be equal to [3]:
n nŠ
D : (1.38)
N NŠ .N n/Š
Fig. 1.8 The Yang Hui triangle [4], showing the construction of binomial coefficients
nŠ
P.nI N; p/ D pN .1 p/Nn : (1.39)
NŠ .N n/Š
Binomial distributions are shown in Fig. 1.9 for N D 15 and for p D 0:2, 0.5 and
0.8.
Since a binomial variable n is equal to the sum of N independent Bernoulli
variables with probability p, the average and variance of n is equal to N times the
average and variance of a Bernoulli variable, respectively (Eqs. (1.35) and (1.37)):
hni D Np ; (1.40)
VŒn D Np .1 p/ : (1.41)
0.25
p=0.2 p=0.8
p=0.5
0.2
Probability
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14
n
Fig. 1.9 Binomial distributions for N D 15 and for p D 0:2, 0.5 and 0.8
The binomial distribution introduced in Sect. 1.15 can be generalized to the case in
which, out of N extractions, there are more than two outcome categories (success
and failure, in the binomial case). Consider k categories (k D 2 for the binomial
case), let the possibleP numbers of outcomes be equal to n1 ; ; nk for each of the
k categories, with kiD1 ni D N. Each category has a probability pi , i D 1; ; k
P
of individual extraction, respectively, with kiD1 pi D 1. The joint distribution of
n1 ; ; nk is given by:
NŠ
P.n1 ; ; nk I N; p1 ; ; pk / D pn1 pnk k ; (1.42)
n1 Š nk Š 1
and is called multinomial distribution. Equation (1.39) is equivalent to Eq. (1.42) for
k D 2, n D n1 , n2 D N n1 , p1 D p and p2 D 1 p.
The average values of the multinomial variables ni are:
The multinomial variables ni have negative correlation, and their covariance is, for
i ¤ j:
For a binomial distribution, Eq. (1.45) leads to the obvious conclusion that n1 D n
and n2 D N n are 100% anticorrelated.
x1 C C xN
xN D : (1.46)
N
xN is itself a random variable, and its expected value is, from Eq. (1.20), equal to
the expected value of x. The distribution of xN , in general, has a smaller range of
fluctuation than the variable x and central values of xN tend to be more probable. This
can be demonstrated, using classical probability and combinatorial analysis, in the
simplest cases. A case with N D 2 is, for instance, the distribution of the sum of
two dices d1 C d2 in Fig. 1.1, where xN is just given by .d1 C d2 /=2. The distribution
has the largest probability value for .d1 C d2 /=2 D 3:5, which is the expected value
of a single dice roll:
1C2C3C4C5C6
hxi D D 3:5 : (1.47)
6
Repeating the combinatorial exercise for the average of three or more dices, gives
even more ‘peaked’ distributions.
In general, it is possible to demonstrate that, under some conditions about the
distribution of x, as N increases, a smaller probability corresponds to most of the
possible values of xN , except the ones very close to the expected average hNxi D hxi.
The probability distribution of xN becomes a narrow peak around the value hxi, and
the interval of values that correspond to a large fraction of the total probability (we
could choose—say—90% or 95%) becomes smaller. Eventually, for N ! 1, the
distribution becomes a Dirac’s delta centered at hxi.
This convergence is called law of large numbers, and can be illustrated in a
simulated experiment consisting of repeated dice rolls, as shown in Fig. 1.10, where
xN is plotted as a function of N for two independent random extractions. Larger values
of N correspond to smaller fluctuations of the result and to a visible convergence
towards the value of 3.5. If we would ideally increase to infinity the total number of
trials N, the average value xN would no longer be a random variable, but would take
a single possible value, equal to hxi D 3:5.
22 1 Probability Theory
5.5
4.5
4
Average
3.5
2.5
1.5
1
0 100 200 300 400 500 600 700 800 900 1000
Number of dice rolls
Fig. 1.10 An illustration of the law of large numbers using a computer simulation of die rolls. The
average of the first N out of 1000 random extraction is reported as a function of N. 1000 extractions
have been repeated twice (red and blue lines) with independent random extractions
The law of large numbers has many empirical verifications for the vast majority
of random experiments and has a broad validity range.
The limit is intended, in this case, as convergence in probability, given by the law
of large numbers. The limit only rigorously holds in the non-realizable case of an
infinite number of experiments. Rigorously speaking, the definition of frequentist
probability in Eq. (1.48) is defined itself in terms of another probability, which could
introduce conceptual problems. F. James et al. report the following sentence:
Œ this definition is not very appealing to a mathematician, since it is based on
experimentation, and, in fact, implies unrealizable experiments (N ! 1) [5].
References 23
References
1. Laplace, P.: Essai Philosophique Sur les Probabilités, 3rd edn. Courcier Imprimeur, Paris (1816)
2. Kolmogorov, A.: Foundations of the Theory of Probability. Chelsea, New York (1956)
3. The coefficients present in the binomial distribution are the same that appear in the expansion a
binomial raised to the nth power, .a C b/n . A simple iterative way to compute those coefficients
is known as Pascal’s triangle. In different countries this triangle is named after different authors,
e.g.: the Tartaglia’s triangle in Italy, Yang Hui’s triangle in China, and so on. In particular, the
following publications of the triangle are present in literature:
• India: published in the tenth century, referring to the work of Pingala, dating back to fifth–
second century BC .
• Persia: Al-Karaju (953–1029) and Omar Jayyám (1048–1131)
• China: Yang Hui (1238–1298); see Fig. 1.8
• Germany: Petrus Apianus (1495–1552)
• Italy: Nicolò Fontana Tartaglia (1545)
• France: Blaise Pascal (1655)
4. Yang Hui (杨辉) triangle as published by Zhu Shijie (朱世杰) in Siyuan yujian, (四元玉鉴,
Jade Mirror of the four unknowns, 1303). Public domain image.
5. Eadie, W., Drijard, D., James, F., Roos, M., Saudolet, B.: Statistical Methods in Experimental
Physics. North Holland, Amsterdam (1971)
6. D’Agostini, G.: Bayesian Reasoning in Data Analysis: A Critical Introduction. World Scientific,
Hackensack (2003)
Chapter 2
Probability Distribution Functions
2.1 Introduction
The problem introduced in Sect. 1.6.1 with Bertrand’s paradox occurs when we try
to decompose the range of possible values of a random variable x into equally
probable elementary intervals and this is not always possible without ambiguity
because of the continuous nature of the problem. In Sect. 1.6 we considered a
continuous random variable x with possible values in an interval Œx1 ; x2 , and we
saw that if x is uniformly distributed in Œx1 ; x2 , a transformed variable y D Y.x/
is not in general uniformly distributed in Œy1 ; y2 D ŒY.x1 /; Y.x2 / (Y is taken as
a monotonic function of x). This makes the choice of the continuous variable on
which equally probable intervals are defined an arbitrary choice.
The following Sections will show how to overcome this difficulty using defini-
tions consistent with the axiomatic approach to probability introduced in Sect. 1.7.
The function f is called probability distribution function (PDF). The function f .Ex /,
times dn x, can be interpreted as differential probability, i.e. f .Ex / is equal to the
probability dP corresponding to the infinitesimal hypervolume dx1 dxn , divided
by the infinitesimal hypervolume:
dP
D f .x1 ; ; xn / : (2.2)
dx1 dxn
The normalization condition for discrete probability distributions (Eq. (1.3)) can be
generalized to continuous case as follows:
Z
f .x1 ; ; xn / dn x D 1 : (2.3)
Dirac’s delta functions can be linearly combined with proper weights equal to the
probabilities corresponding to discrete values. A distribution representing a discrete
random variable x that can only take the values x1 ; ; xN with probabilities
P1 ; ; PN , respectively, can be written, using the continuous PDF formalism, as:
X
N
f .x/ D Pi ı.x xi / : (2.6)
iD1
which gives again the normalization condition for a discrete variable, already shown
in Eq. (1.3).
2.3 Average and Variance in the Continuous Case 27
1 1
f .x/ D ı.x/ C f .x/ (2.8)
2 2
gives a 50% probability to have x D 0 and a 50% probability to have a value of x
distributed according to f .x/.
The definitions of average and variance introduced in Sect. 1.12 are generalized for
continuous variables as follows. The average value of a continuous variable x whose
PDF is f is:
Z
hxi D x f .x/ dx : (2.9)
Integrals should be extended over Œ1; C1, or the entire validity range of the
variable x.
Covariance, correlation coefficient, and covariance matrix can be defined for the
continuous case in the same way as defined in Sect. 1.12, as well as skewness and
kurtosis.
28 2 Probability Distribution Functions
As for the discrete case, a continuous PDF may have more than one mode, and in
that case is called multimodal distribution.
The median of a PDF f .x/ is the value xQ such that:
or equivalently1:
Z xQ Z C1
f .x/ dx D f .x/ dx : (2.15)
1 xQ
1
If we assume here that f .x/ is sufficiently regular, such that P.fQxg/ D 0, i.e. f .x/ has no Dirac’s
delta component ı.x xQ/, we have P.x < xQ/ D P.x > Qx/ D 1=2. Otherwise, P.x < Qx/ D
P.x > xQ/ D .1 P.fQxg/ /=2.
2.6 Continuous Transformations of Variables 29
If the variable x follows the PDF f .x/, the PDF of the transformed variable y D
F.x/ is uniform between 0 and 1, as can be easily be demonstrated:
dP dP dx dx f .x/
D D f .x/ D D1: (2.20)
dy dx dy dF.x/ f .x/
where f .x; y/ is the PDF for the variables x and y. In the case of transformations
into more than one variable, the generalization is straightforward. If we have for
instance: x0 D X 0 .x; y/, y0 D Y 0 .x; y/, the transformed two-dimensional PDF can be
written as:
Z
f .x ; y / D ı.x0 X 0 .x; y// ı.y0 Y 0 .x; y// f .x; y/ dx dy :
0 0 0
(2.23)
Examples of uniform distributions are shown in Fig. 2.1 for different extreme values
a and b.
The average of a uniformly distributed variable x is:
aCb
hxi D ; (2.27)
2
0.35
a=1, b=4
0.3
0.25
0.2
dP/dx
a=2, b=8
0.15
a=3, b=12
0.1
0.05
0
0 2 4 6 8 10 12
x
Fig. 2.1 Uniform distributions with different values of the range extremes a and b
2.8 Gaussian Distribution 31
1 .x
/2
g.xI
; / D p exp : (2.29)
2 2 2 2
and .
For
D 0 and D 1, a normal distribution is called standard normal
distribution, and it is equal to:
1 2
.x/ D p ex =2 : (2.30)
2
0.4
μ = 0, σ = 1
0.35
0.3
0.25
dP/dx
0.2
μ = 1, σ = 2
0.15
0.1
μ = 2, σ = 4
0.05
0
−10 −5 0 5 10 15
x
Fig. 2.2 Gaussian distributions with different values of average and standard deviation parameters
and
The most frequently used values are the ones corresponding to 1, 2 and 3 (Z D
1; 2; 3), and have probabilities of 68.27, 95.45 and 99.73%, respectively.
The importance of the Gaussian distribution resides in the central limit theorem
(see Sect. 2.14), which allows to approximate to Gaussian distributions may realistic
cases resulting from the superposition of more random effects, each having a finite
and possibly unknown PDF.
2.9 2 Distribution
0.9
0.8
0.7 n=1
0.6
dP/dx
0.5
0.4
0.3 n=3
0.2 n=5
0.1
0
0 2 4 6 8 10 12 14
x
Fig. 2.3 2 distributions with different values of the number of degrees of freedom n
is the so-called gamma function, the analytical extension of the factorial.2 The
expected value of a 2 distribution is equal to the number of degrees of freedom n
and the variance is equal to 2n. 2 distributions are shown in Fig. 2.3 for different
number of degrees of freedom n.
Typical applications of the 2 distribution are goodness of fit tests (see
Sect. 5.12.2), where the cumulative 2 distribution is used.
2
.n/ D .n 1/Š if n is an integer value.
34 2 Probability Distribution Functions
3.5 μ = 0, σ = 0.1
2.5
dP/dx
1
μ = 1, σ = 0.25
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x
Fig. 2.4 Log normal distributions with different values of the parameters
and
The PDF in Eq. (2.34) can be determined applying Eq. (2.21) to the case of a normal
distribution. A log normal variable has the following average and standard deviation:
2 =2
hxi D e
C ; (2.35)
p
C 2 =2
x D e e 2 1 : (2.36)
Note that Eq. (2.35) implies that he y i > ehyi for a normal random variable y:
2 =2
he y i D hxi D e
e > e
D ehyi :
Examples of log normal distributions are shown in Fig. 2.4 for different values of
and .
3.5
2.5
dP/dx
2 λ=4
1.5
1
λ=2
0.5
λ=1
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x
Examples of exponential distributions are shown in Fig. 2.5 for different values of
the parameter .
Exponential distributions are widely used in physics, in particular to model the
distribution of particle lifetimes. In those cases, the average lifetime is the inverse
of the parameter .
n e
P.nI / D : (2.38)
nŠ
Equation (2.38) is called Poisson distribution and is sometimes also called rate,
as will be more clear from Example 2.5. Figure 2.6 shows examples of Poisson
distributions for different values of the parameter . It is easy to demonstrate that
the average and the variance of a Poisson distribution are both equal to .
(continued )
36 2 Probability Distribution Functions
0.25 ν=2
0.2
ν=4
Probability
0.15
ν=8
0.1
0.05
0
0 5 10 15 20
n
Fig. 2.6 Poisson distributions with different value of the rate parameter
x ξ
Fig. 2.7 A uniform distribution of occurrences along a variable . Two intervals are
shown of sizes x and X, where x X
(continued )
2.12 Poisson Distribution 37
The first term, n =nŠ, does not depend on N, while the remaining three terms
tend to 1, e and 1, respectively, in the limit of N ! 1. The distribution
of n from Eq. (2.41), in this limit, is equal to the Poisson distribution:
n e
P.nI / D : (2.42)
nŠ
t0 =0 t1
t
δt
Fig. 2.8 Occurrence times (dots) of events uniformly distributed in time, represented
along a horizontal axis. The time origin .t0 D 0/ is marked as a C. The occurrence time
of the first event is marked as t1
Let t1 be the occurrences time of the first event with respect to an arbitrary
time origin t0 , which could also coincide with the occurrence of one of
(continued )
38 2 Probability Distribution Functions
the events. In that case, t1 represents the time difference between two
consecutive events.
The occurrence time of the first events, as well as the time difference between
two consecutive events, can be demonstrated to be distributed according to
an exponential PDF, as follows.
Let us consider a time t and another time t C ıt, where ıt t. The
probability that t1 is greater or equal to t C ıt, P.t1 t C ıt/, is equal
to the probability P.0; Œ0; t C ıt Œ / that no event occurs before t C ıt, i.e. in
the time interval Œ0; t C ıt Œ.
The probability P.0; Œ0; t C ıt Œ / is equal to the probability that no event
occurs in the interval Œ0; t Œ and no event occurs in the interval Œt; t C ıt Œ,
since Œ0; t C ıt ŒD Œ0; t Œ [ Œt; t C ıt Œ. Events occurring in the two disjoint
time intervals are independent, hence the combined probability is the
product of the two probabilities:
Given the event rate r per unit of time, the probability to have n occurrences
in a time interval ıt is given by a Poissonian distribution (see Example 2.5)
with rate D r ıt:
n e
P.nI / D P.n; Œt C ıt Œ/ D : (2.44)
nŠ
Equation (2.43) can be written, using the result from Eq. (2.45), as:
or, equivalently:
(continued )
2.12 Poisson Distribution 39
which gives:
dP.t1 t/
D r P.t1 t/ : (2.49)
dt
Considering the initial condition P.t1 0/ D 1, Eq. (2.49) has the following
solution:
0.3
ν=2
0.25
0.2
ν=5
Probability
0.15
ν = 10
ν = 20
0.1 ν = 30
0.05
0
0 10 20 30 40 50
n
Fig. 2.9 Poisson distributions withpdifferent value of the parameter compared with Gaussian
distributions with
D and D
Poisson distributions have several interesting properties, some of which are listed
in the following.
• For large , a Poisson distributionpcan be approximated with a Gaussian having
average and standard deviation . See Fig. 2.9 for a visual comparison.
• A Binomial distribution with a number of extractions N and probability p
1 can be approximated with a Poisson distribution with average D p N (see
Example 2.5, above).
• If two variables n1 and n2 follow Poisson distributions with averages 1 and 2 ,
respectively, it is easy to demonstrate, using Eq. (1.34), that the sum n D n1 C n2
follows again a Poisson distribution with average 1 C 2 . In formulae:
X
n
P.nI 1 ; 2 / D Pois.n1 I 1 / Pois.n2 I 2 / D Pois.nI 1 C 2 / :
n1 D 0
n2 D n n1
(2.55)
This property descends from the fact that the superposition of two uniform
processes, like the one considered in Example 2.5, is again a uniform process,
whose total rate is equal to the sum of the two individual rates.
• Randomly picking with probability " occurrences from a Poissonian process
gives again a Poissonian process. In other words, if a Poisson variable n0 has
expected value (rate) 0 , then the variable n distributed according to a binomial
distribution with probability " and size of the sample n0 is distributed according
2.13 Other Distributions Useful in Physics 41
This is the case, for instance, when counting the number of cosmic rays recorded
by a detector whose efficiency " is not ideal ." < 1/.
• The cumulative 2 (Eq. (2.33)) and Poisson distributions are related. From the
following formulas:
Z n
where P.x; n/ is the so-called incomplete Gamma function, the following relation
holds:
X
n1 Z C1
k
e D f .2 I 2n/ d2 : (2.59)
kD0
kŠ 2
Some of the most commonly used PDF in physics are presented in the following
sections. The list is of course not exhaustive.
While the parameter x0 determines the position of the maximum of the distribution
(mode), twice the parameter is equal to the full width at half maximum of the
distribution.
A Breit–Wigner distribution arises in many resonance problems in physics.
42 2 Probability Distribution Functions
0.6
0.5
0.4
γ = 0.5
dP/dx
0.3
0.2 γ =1
0.1 γ =2
0
−6 −4 −2 0 2 4 6
x
Fig. 2.10 Breit–Wigner distributions centered around zero for different values of the width
parameter
Since the integrals of both x BW.x/ and x2 BW.x/ are divergent, the mean and
variance of a Breit–Wigner distribution are undefined.
Figure 2.10 shows examples of Breit–Wigner distributions for different values of
the width parameter and for fixed x0 D 0.
The parameter m determines the position of the maximum of the distribution (mode)
and the parameter
measures the width of the distribution.
A relativistic Breit–Wigner distribution arises from the square of a virtual
particle’s propagator (see for instance [1]) with four-momentum squared p2 D x2 ,
2.13 Other Distributions Useful in Physics 43
1
: (2.63)
.x2 m2 / C im
The Argus collaboration introduced [2] a function that models many cases of
combinatorial backgrounds where kinematical bounds produce a sharp edge. The
Argus distribution is given by:
r x
2
e Œ1.x= / =2 ;
2 2
A.xI ; / D Nx 1 (2.64)
−6
×10
25
20 Γ = 2.5
15
dP/dx
10 Γ=5
5
Γ = 10
0
80 85 90 95 100 105 110 115 120
x
Fig. 2.11 Relativistic Breit–Wigner distributions with mass parameter m D 100 and different
values of the width parameter
44 2 Probability Distribution Functions
0.2
0.16
0.14
θ=9, ξ=0.5
0.12
dP/dx
0.1
θ=10, ξ=1
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10
x
Fig. 2.12 Argus distributions with different values of the parameters and
the parameters and . The primitive function of Eq. (2.64) can be computed
analytically, and this saves computer time in the evaluation of the normalization
coefficient N. Assuming 2 0, the normalization condition for an Argus PDF can
be written as follows:
Z
1
A.xI ; / dx
N
( r r s )
2 2 Œ1 .1x= /2 =2 x2 1 2 x2
D 2 e 1 2 erf 1 2 ; (2.65)
2 2 2
where ‰./ D ˆ./ ./ 1=2, ./ being a standard normal distribution
(Eq. (2.30)) and ˆ./ its cumulative distribution (Eq. (2.31)).
The parameter ˛ determines the starting point of the power-law tail, measured in
units of , the standard deviation of the Gaussian ‘core’.
Examples of Crystal ball distributions are shown in Fig. 2.13 where the parameter
˛ was varied, while the parameters of the Gaussian core were fixed at
D 0; D 1,
and the power-law exponent was set to n D 2.
0.4
α=2
0.35
0.3
α=1
0.25
dP/dx
0.2
α = 0.5
0.15
0.1
0.05
0
−30 −25 −20 −15 −10 −5 0 5 10
x
0.9
0.8 σ = 0.2
0.7
0.6
dP/dx
0.5
0.4
σ = 0.4
0.3
0.2
0.1 σ = 0.6
0
−2 0 2 4 6 8 10 12
x
A model that describes the fluctuations of energy loss of particles traversing a thin
layers of matter is due to Landau [4, 5]. The distribution of the energy loss x is given
by the following integral expression called Landau distribution:
Z 1
1
L.x/ D et log txt sin.t/ dt : (2.69)
0
L.xI
; / D L : (2.70)
Examples of Landau distributions are shown in Fig. 2.14 for different values of
and fixed
D 0. This distribution is also used as empiric model for several
asymmetric distributions.
4000 4000
3500 3500
3000 3000
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
x1 (x1+x 2)/ 2
4000 4000
3500 3500
3000 3000
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
Fig. 2.15 Approximate visual demonstration of the central limit theorem p usingpa Monte Carlo
technique. A random variable x1 is generated uniformly in the interval [ 3, 3[, in order to
have average value
D 0 and standard deviation D 1. The top-left plot shows the distributionp
of 105 randompextractions of x1 ; the other 5
pplots show 10 random extractions of .x1 C x2 /= 2,
.x1 Cx2 Cx3 /= 3 and .x1 Cx2 Cx3 Cx4 /= 4, respectively, where all xi are extracted with the same
uniform distribution as x1 . A Gaussian curve with
D 0 and D 1, with proper normalization,
in order to match the sample size, is superimposed to the extracted distributions in the four cases.
The Gaussian approximation is more and more stringent as a larger number of variables is added
48 2 Probability Distribution Functions
6000
5000
5000
4000
4000
3000
3000
2000
2000
1000 1000
0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
x1 (x 1 +x )/ 2
2
5000
6000
4000
5000
4000 3000
3000
2000
2000
1000
1000
0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
(x +x +x )/ 3 (x +x +x +x )/ 4
1 2 3 1 2 3 4
4000 4000
3500 3500
3000 3000
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
(x + ... +x )/ 6 (x + ... +x )/ 10
1 6 1 10
Fig. 2.16 Same as Fig. 2.15, using a PDF that is uniformly distributed in two disjoint intervals,
[ 3=2, 1=2 [ and [ 1=2 , 3=2 [, in order to have average value
D 0 and standard deviation D 1.
The individual distributions and p the sum of 2, 3, 4, 6 and 10 independent random extractions of
such a variable, divided by n, n D 2; 3; 4; 6; 10, are shown in the six plots, respectively. A
Gaussian distribution having
D 0 and D 1 is superimposed
2.15 Probability Distribution Functions in More than One Dimension 49
Probability densities can be defined in spaces with more than one dimension as
introduced in Sect. 2.2. In the simplest case of two dimensions, a PDF f .x; y/
measures the probability density per unit area, i.e. the ratio of the differential
probability dP corresponding to an infinitesimal interval around a point .x; y/ and
the differential area dx dy:
dP
D f .x; y/ : (2.71)
dx dy
In three dimensions the PDF measures the probability density per volume area:
dP
D f .x; y; z/ ; (2.72)
dx dy dz
and so on in more dimensions.
A PDF in more dimensions that describes the distribution of more than one
random variable is also called joint probability distribution.
Given a two-dimensional PDF f .x; y/, the probability distributions of the two
individual variables x and y, called marginal distributions, can be determined by
integrating f .x; y/ over the other coordinate, y and x, respectively:
Z
fx .x/ D f .x; y/ dy ; (2.73)
Z
fy .y/ D f .x; y/ dx : (2.74)
A pictorial view that illustrates the interplay between the joint distribution f .x; y/
and the marginal distributions fx .x/ and fy .y/ is shown in Fig. 2.17.
The events A and B shown in the figure correspond to two values xO and yO extracted
in the intervals Œx; x C ıx Œ and Œy; y C ıy Œ, respectively:
Let us remember that, according to Eq. (1.6), two events A and B are independent
if P.A \ B/ D P.A/ P.B/. The equality P.A \ B/ D P.A/ P.B/ holds, given
Eq. (2.79), if and only if f .x; y/ can be factorized into the product of the two
marginal PDFs:
From this result, x and y can be defined as independent random variables if their
joint PDF can be written as the product of a PDF of the variable x times a PDF of
the variable y.
x
2.15 Probability Distribution Functions in More than One Dimension 51
Note that if two variable x and y are independent, it can be easily demonstrated
that they are also uncorrelated, in the sense that their covariance (Eq. (1.24)) is null.
Conversely, if two variables are uncorrelated, they are not necessarily independent,
as shown in Example 2.7 below.
hzi D
;
hz2 i D
2 C 2 ;
it is easy to demonstrate that, for x and y distributed according to f .x; y/, the
following relations also hold:
hxi D hyi D 0;
2
hx2 i D hy2 i D 2 C 2 ;
hxyi D0:
(continued )
52 2 Probability Distribution Functions
0.8
dP/dxdy
0.6
0.4
0.2
0
5
4
3
2
1
0
−1
y
−2 4 5
−3 2 3
0 1
−4 −2 −1
−5 −5 −4 −3 x
Fig. 2.18 Example of a PDF of two variables x and y that are uncorrelated but not
independent
Given a two-dimensional PDF f .x; y/ and a fixed value x0 of the variable x, the
conditional distribution of y given x0 is defined as:
f .x0 ; y/
f .y j x0 / D R : (2.84)
f .x0 ; y0 / dy0
f .Ex0 ; Ey /
f .Ey j Ex0 / D R : (2.85)
f .Ex0 ; Ey / dy01 dy0k
f(x,y)
0.35
0.3
f(y|x0)
dP/dx
0.25
0.2
0.15
1
1 0.5
0.5 0
0 x0 x
y −0.5
−1 −1
Let us consider in two dimensions the product of two Gaussian distributions for
the variables x0 and y0 having standard deviations x0 and y0 , respectively, and for
simplicity having both averages
x0 D
y0 D 0 (a translation can always be applied
in order to generalize to the case
x0 ;
y0 ¤ 0):
" !#
0 0 0 1 1 x02 y02
g .x ; y / D exp 2
C 2 : (2.86)
2 x0 y0 2 x0 y0
Theˇ transformed
ˇ PDF g.x; y/ can be obtained using Eq. (2.24), considering that
det ˇ@x0i =@xj ˇ D 1, which leads to g0 .x0 ; y0 / D g.x; y/.
g.x; y/ has the form:
1 1 x
g.x; y/ D 1
exp .x; y/ C1 ; (2.88)
2 jCj 2 2 y
where the matrix C1 is the inverse of the covariance matrix for the variables .x; y/.
C1 can be obtained by comparing Eqs. (2.86) and (2.88). The rotated variables
defined in Eq. (2.87) can be substituted in the following equation:
x02 y02 x
C D .x; y/ C1 ; (2.89)
x20 y20 y
obtaining:
0 !1
cos2 sin2 1 1
B C sin cos 2 C
B x20 y20 2
y0 x0 C
C 1
DB
B
! C :
C (2.90)
@ 1 1 sin2 cos2 A
sin cos 2 C
y20 x0 2
x0 y20
where xy is the correlation coefficient defined in Eq. (1.25), the determinant of C1
that appears in Eq. (2.88) must be equal to:
ˇ 1 ˇ 1 1
ˇC ˇ D D 2 2 : (2.92)
x20 y20 2
x y 1 xy
Inverting the matrix C1 in Eq. (2.90), the covariance matrix in the rotated variables
.x; y/ is:
0
1
cos2 x20 C sin2 y20 sin cos y20 x20
CD@
A : (2.93)
sin cos y20 x20 sin2 x20 C cos2 y20
2 2
cov.x; y/ sin 2 y 0 x 0
xy D Dr
: (2.96)
x y
sin 2 x40 C y40 C 2 x20 y20
The last Eq. (2.96) implies that the correlation coefficient is equal to zero if either
y0 D x0 or if is a multiple of =2. The following relation gives tan 2 in terms of
the elements of the covariance matrix:
2 xy x y
tan 2 D : (2.97)
y2 x2
The transformed PDF can be finally written in terms of all the results obtained so
far:
" !#
1 1 x2 y2 2x y xy
g.x; y/ D q exp 2 / 2
C 2 :
2 x y 2
1 xy 2.1 xy x y x y
(2.98)
x2 y2 2x y xy
2
C 2 D1: (2.99)
x y x y
56 2 Probability Distribution Functions
σy σx
φ
σy σx
Fig. 2.20 One-sigma contour for a two-dimensional Gaussian PDF. The two ellipse axes have
length equal to x0 and y0 ; the x0 axis is rotated of an angle with respect to the x axis and the
lines tangent to the ellipse parallel to the x and y axes, shown in gray, have a distance with respect
to the respective axes equal to y and x
It is possible to demonstrate that the distance of the horizontal and vertical tangent
lines to the ellipse defined in Eq. (2.99) have a distance with respect to their
respective axes equal to y and x .
Similarly to the 1 contour, defined in Eq. (2.99), the 2 contour is defined by:
x2 y2 2x y xy
C D 22 D 4 : (2.100)
x2 y2 x y
2σ 1 σ
1σ
2σ
Fig. 2.21 Plot of two-dimensional 1 and 2 Gaussian contours. Each one-dimensional projec-
tion of the 1 or 2 contour corresponds to a band which has a 68.27% or 95.45% probability
content, respectively. As an example, three possible projections are shown: a vertical, a horizontal
and a diagonal one. The probability content of the ellipses are smaller than the corresponding
one-dimensional projected interval probabilities
x2
D1 H) x D ˙ : (2.103)
2
The probability values corresponding to Z one-dimensional intervals for a Gaus-
sian distribution are determined by Eq. (2.32). The corresponding result for the
two-dimensional case can be performed by integrating g.x; y/ in two dimensions
over the ellipse EZ corresponding to Z:
Z
P2D .Z/ D g.x; y/ dx dy ; (2.104)
EZ
where
( )
x2 y2 2x y xy
EZ D .x; y/ W 2 C 2 Z2 : (2.105)
x y x y
The probabilities corresponding to 1, 2 and 3 for the one- and two-
dimensional cases are reported in Table 2.1. The two-dimensional integrals are, in all
cases, smaller than in one dimension for a given Z. In particular, in order to recover
the same probability content as the corresponding one-dimensional interval, one
would need to artificially enlarge a two-dimensional ellipse from 1 to 1:515, from
2 to 2:486, and from 3 to 3:439. Usually, results are reported in literature as 1
and 2 contours, without any artificial interval enlargement, and the conventional
probability content of 68.27% or 95.45% refers to any one-dimensional projection
of those contours.
The generalization to n dimensions of the two-dimensional Gaussian described
in Eq. (2.98) is:
1 1 1
g.x1 ; ; xn / D 1
exp .xi
i / Cij .xj
j / ; (2.108)
n
.2/ 2 jCj 2 2
where
i is the average of the variable xi and Cij is the n n covariance matrix of
the variables x1 ; ; xn .
References
1. Bjorken, J., Drell, S.: Relativistic Quantum Fields. McGraw-Hill, New York (1965)
2. ARGUS Collaboration, Albrecht, H., et al.: Search for hadronic b ! u decays. Phys. Lett.
B241, 278–282 (1990)
3. Gaiser, J.: Charmonium spectroscopy from radiative decays of the J= and 0 . Ph.D. thesis,
Stanford University (1982). Appendix F
4. Landau, L.: On the energy loss of fast particles by ionization. J. Phys. (USSR) 8, 201 (1944)
5. Allison, W., Cobb, J.: Relativistic charged particle identification by energy loss. Annu. Rev.
Nucl. Part. Sci. 30, 253–298 (1980)
Chapter 3
Bayesian Approach to Probability
3.1 Introduction
P.A \ B/
P.A j B/ D : (3.1)
P.B/
P(A)= P(B)=
P(A|B)= P(B|A)=
The probability of the event B given the event A, vice versa, can be written as:
P.A \ B/
P.B j A/ D : (3.2)
P.A/
from which the Bayes’ theorem can be derived in the following form:
P.B j A/ P.A/
P.A j B/ D : (3.4)
P.B/
The probability P.A/ can be interpreted as the probability of the event A before the
knowledge that the event B has occurred (prior probability), while P.A j B/ is the
probability of the same event A having as further information the knowledge that
the event B has occurred (posterior probability).
A visual derivation of Bayes’ theorem is presented in Fig. 3.2, using the visual
notation of Fig. 3.1.
3.2 Bayes’ Theorem 61
P(A|B)P(B)= X = =P(A B)
P(B|A)P(A)= X = =P(A B)
Fig. 3.2 Visualization of the Bayes’ theorem. The areas of events A an B, equal to P.A/ and P.B/,
respectively, simplify when P.A j B/ P.B/ and P.B j A/ P.A/ are multiplied. Representation by R.
Cousins
The answer to our question is P.ill j C/. Using Bayes’ theorem, the condi-
tional probability can be ‘inverted’ as follows:
(continued )
62 3 Bayesian Approach to Probability
P.ill/
P.ill j C/ ' : (3.10)
P.C/
A missing ingredient in the problem can be identified from Eq. (3.10): P.ill/,
the probability that a random person in the population under consideration
is really ill (regardless of any possibly performed test), was not given. In a
normal situation of a generally healthy population, we can expect P.ill/
P.healthy/. Using:
and:
The probability P.ill j C/ can then be written, using Eq. (3.13), as:
P.ill/ P.ill/
P.ill j C/ D ' : (3.14)
P.C/ P.ill/ C P.C j healthy/
Assuming that P.ill/ is smaller than P.C j healthy/, P.ill j C/ will result
smaller than 50%. For instance, if P.ill/ D 0.15%, compared with the
assumed P.C j healthy/ D 0:2%, then
0:15
P.ill j C/ D D 43% : (3.15)
0:15 C 0:20
The probability to be really ill, given the positive diagnosis, is very different
from the naïve conclusion, according to which one would be most likely
really hill. The situation can be visualized, changing a bit the proportions in
order to have a better presentation, in Fig. 3.3.
(continued )
3.2 Bayes’ Theorem 63
P(+|healthy)
P(+|ill)
P(−|healthy)
P(−|ill)
P(ill) P(healthy)
Fig. 3.3 Visualization of the ill=healthy problem. The red areas correspond to the cases
of a positive diagnosis for a ill person .P.C j ill/, vertical red area) and a positive
diagnosis for a healthy person .P.C j healthy/, horizontal red area). The probability of
being really ill in the case of a positive diagnosis, P.ill j C/, is equal to the ratio of the
vertical red area and the total red area. In the example it was assumed that P. j ill/ is
very small
(continued )
64 3 Bayesian Approach to Probability
P.C j
/ P.
/ P.C j
/ P.
/
P.
j C/ D D :
P.C/ P.C j
/ P.
/ C P.C j / P./
(3.16)
f
sel D P.
j C/ D : (3.17)
" f
C ı f
P.
j C/ P.C j
/ P.
/
D : (3.18)
P. j C/ P.C j / P./
The above expression also holds if more than two possible particle types are
present in the sample (say muons, pions, and kaons), and does not require
to compute the denominator that is present in Eq. (3.16), which would be
needed, instead, in order to compute individual probabilities related to all
possible particle cases.
Section 3.6 will discuss further posterior odds and their use.
In the above Examples 3.8 and 3.9, Bayes’ theorem was applied to cases that can
be considered under the frequentist domain. The formulation of Bayes’ theorem, as
from Eq. (3.4):
P.B j A/ P.A/
P.A j B/ D ; (3.19)
P.B/
3.3 Bayesian Probability Definition 65
can also be interpreted as follows: before we know that B is true, our degree of belief
in the event A is equal to the prior probability P.A/. After we know that B is true,
our degree of belief in the event A changes, and becomes equal to the posterior
probability P.A j B/.
Note that prior and posterior probabilities apply not only to the case in which
the former is the probability before and the latter after the event B has occurred,
in the chronological sense, but, more in general, they refer to before and after the
knowledge of B, i.e. B may also have occurred (or not), but we don’t have any
knowledge about B yet.
Using this interpretation, the definition of probability, in this new Bayesian
sense, can be extended to events that are not associated with random outcomes
of repeatable experiments, but may represent statements about unknown facts, like
“my football team will win next match”, or “the mass of a dark-matter candidate
particle is between 1000 and 1500 GeV”. We can consider a prior probability P.A/
of such an unknown statement, representing a measurement of our ‘prejudice’ about
that statement, before the acquisition of any information that could modify our
knowledge. After we know that the event B has occurred, our knowledge of A should
change, and our degree of belief should be modified and must become equal to the
posterior probability P.A j B/. In other words, Bayes’ theorem gives us a quantitative
prescription about how to rationally change our subjective degree of belief from an
initial prejudice considering newly available information. Anyway, starting from
different priors (i.e. different prejudices), different posteriors will be determined.
The term P.B/ that appears in the denominator of Eq. (3.4) can be considered
as a normalization factor. The sample space can be decomposed in a partition
A1 ; ; AN , where:
[
N
Ai D and Ai \ Aj D 0 8 i; j ; (3.20)
iD1
in order to apply the law of total probability in Eq. (1.12), as already done in
Examples 3.8 and 3.9, discussed in the previous section:
X
N
P.B/ D P.B j Ai/ P.Ai / : (3.21)
iD1
This corresponds to the belief that A0 is absolutely true, and all other
alternatives Ai are absolutely false for i ¤ 0. Whatever knowledge of any
event B is achieved, we will demonstrate that the posterior probability of
any Ai will not be different from the prior probability:
P.B j Ai / P.Ai /
P.Ai j B/ D ; (3.24)
P.B/
but, if i ¤ 0, clearly:
P.B j Ai / 0
P.Ai j B/ D D 0 D P.Ai / : (3.25)
P.B/
This situation reflects the case that we may call dogma, or religious belief,
i.e. the case in which someone has such a strong prejudices on Ai that no
event B, i.e. no new knowledge, can change his/her degree of belief.
The scientific method allowed to evolve mankind’s knowledge of Nature
during history by progressively adding more knowledge based on the
observation of new experimental evidences. The history of science is full
of examples in which theories known to be true have been falsified by new
or more precise observations and new better theories have replaced the old
ones.
According to Eq. (3.23), instead, scientific progress is not possible in the
presence of religious beliefs about observable facts.
3.4 Bayesian Probability and Likelihood Functions 67
Given a sample .x1 ; ; xn / of n random variables whose PDF has a known form
which depends on m parameters, 1 ; ; m , the likelihood function is defined as
the probability density at the point .x1 ; ; xn / for fixed values of the parameters
1 ; ; m :
ˇ
dP.x1 ; ; xn / ˇˇ
L.x1 ; ; xn I 1 ; ; m / D : (3.27)
dx1 dxn ˇ 1 ; ; m
L.x1 ; ; xn j 1 ; ; m / .1 ; ; m /
P.1 ; ; m j x1 ; ; xn / D R ;
L.x1 ; ; xn j 10 ; ; m0 / .10 ; ; m0 / dm 0
(3.28)
where the probability distribution function .1 ; ; m / is the prior PDF of the
parameters 1 ; ; m , i.e., our degree of belief about the unknown parameters
before the observation of .x1 ; ; xn /. The denominator in Eq. (3.28), coming from
an extension of the law of total probability, is clearly interpreted as a normalization
of the posterior PDF.
Fred James et al. wrote the following sentence about the posterior probability
density given by Eq. (3.28):
The difference between . / and P. j x/ shows how one’s knowledge (degree of belief)
about has been modified by the observation x. The distribution P. j x/ summarizes all
one’s knowledge of and can be used accordingly [3].
where again a normalization factor was omitted. Equation (3.31) can be interpreted
as the application of Bayes’ theorem to the observation of x2 having as prior
probability P1 ./, which was the posterior probability after the observation of x1
(Eq. (3.29)).
Considering a third independent observation x3 , Bayes’ theorem again gives:
This section will discuss how the estimate of unknown parameters with their uncer-
tainties can be addressed using posterior Bayesian PDF of unknown parameters
using Eq. (3.28). Chapter 5, and more in details Chap. 7, will discuss the estimate of
unknown parameters and their uncertainties more in general, and in particular also
using a frequentist approach.
O
First, the most likely values of the unknown parameters E (or the most likely
value, in case of a single parameter) can be determined as the maximum of the
3.5 Bayesian Inference 69
posterior PDF:
The most likely values can be taken as measurement of the unknown parameters,
affected
D Eby an uncertainty that will be discussed in Sect. 3.5.2. Similarly, the average
value E can also be determined from the same posterior PDF.
If no prior information is available about the parameters , E the prior density
should not privilege particular parameter values. In those cases, the prior is called
uninformative prior. A uniform distribution may appear the most natural choice,
but it would not be invariant under reparametrization, as discussed in Sect. 2.1. An
invariant uninformative prior is defined in Sect. 3.8.
If a uniform prior distribution is assumed, the most likely parameter values are
the ones that give the maximum likelihood, since the posterior PDF is equal, up to
a normalization constant, to the likelihood function:
ˇ L.Ex I /E
ˇ
P.E j Ex / ˇ D R : (3.34)
.E/Dconst: L.Ex I E 0 / dh 0
This gives results similar to the frequentist approach, as noted in Sect. 5.10, where
the maximum likelihood estimator will be introduced. The same result, of course,
does not necessarily hold in case of a non-uniform prior PDF.
Usually, in Bayesian applications, the computation of posterior probabilities, and,
in general, of most of the quantities of interest, requires integrations that, in the vast
majority of the realistic cases, can only be performed using computer algorithms.
Markov chain Monte Carlo (see Sect. 4.8) is one of the most performant numerical
integrations method for Bayesian computations.
L.Ex I E ; E / .E ; E /
P.E ; E j Ex / D R ; (3.35)
L.Ex I E 0 ; E 0 / .E 0 ; E 0 / dh 0 dl 0
70 3 Bayesian Approach to Probability
and the posterior PDF, for the parameters E only, can be obtained as marginal PDF,
integrating Eq. (3.35) over all the remaining parameters E :
Z R
L.Ex I E ; E / .E ; E / dl
P.E j xE / D P.E ; E j Ex / dl D R : (3.36)
L.Ex I E 0 ; E / .E 0 ; E / dh 0 dl
Using Eq. (3.36), nuisance parameters can be treated, under the Bayesian
approach, with a simple integration. Section 10.10 will discuss more in details the
treatment of nuisance parameters, including the frequentist approach.
D O ˙ ı ; (3.37)
3.5 Bayesian Inference 71
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
P(θ|x)
P(θ|x)
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
θ θ
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
P(θ|x)
P(θ|x)
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
θ θ
Fig. 3.4 Different probability interval choices at 68:27% (top) or 90% (bottom) shown as shaded
areas. The most probable value is shown as a dashed vertical line. Top left: central interval; the left
and right tails have equal probability. Top right: symmetric interval; the most probable value lies at
the center of the interval. Bottom left: fully asymmetric intervals for an upper limit. Bottom right:
fully asymmetric interval for a lower limit
(continued )
72 3 Bayesian Approach to Probability
Z ˇ
1 1
.n C 1; s/ ˇˇ1
sn es ds D ˇ D1: (3.40)
nŠ 0 nŠ 0
sO D n : (3.42)
The average value and variance of s can also be determined from Eq. (3.41):
hsi D n C 1 ; (3.43)
VarŒs D n C 1 : (3.44)
Note that the most probable value sO is different from the average value hsi,
since the distribution is not symmetric.
Those results depend of course on the choice of the prior .s/. A constant
prior could not be the most ‘natural’ choice. Considering the pprior choice
due to Jeffreys, discussed in Sect. 3.8, a prior .s/ / 1= s should be
chosen.
0.18 1
0.16
0.14 0.8
0.12
0.6
P(s|n)
P(s|n)
0.1
0.08
0.4
0.06
0.04 0.2
0.02
0 0
0 2 4 6 8 10 12 14 16 18 20 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
s s
Fig. 3.5 Poisson posterior PDFs for n D 5 with a central 68.27% probability interval
and for n D 0 with a fully asymmetric 90% probability interval. Intervals are shown as
shaded areas
3.6 Bayes Factors 73
As seen at the end of Example 3.9, there is a convenient way to compare the
probability of two hypotheses using Bayes’ theorem which does not require the
knowledge of all possible hypotheses.
Using Bayes’ theorem, one can write the ratio of posterior probabilities evaluated
under two hypotheses H0 and H1 , given our observation Ex, called posterior odds of
the hypothesis H1 versus the hypothesis H0 :
where .H1 / and .H0 / are the priors for the two hypotheses. The ratio of the priors,
.H1 /=.H0 /, is called prior odds. Finally, the ratio:
P.Ex j H1 /
B1=0 D (3.46)
P.Ex j H0 /
The Bayes factor is equal to the posterior odds if priors are identical for the two
hypotheses.
The computation of Bayes factor in practice requires the introduction of the
likelihood function, as present in Eq. (3.28). In the simplest case in which no
parameter E is present in either of the two hypotheses, the Bayes factor is equal to
the likelihood ratio in the two hypotheses. If parameters are present, the probability
densities P.Ex j H0; 1 / should be computed by integrating the product of the likelihood
function and the prior over the parameter space:
Z
P.Ex j H0 / D L.Ex j H0 ; E0 / 0 .E0 / d0 ; (3.48)
Z
P.Ex j H1 / D L.Ex j H1 ; E1 / 1 .E1 / d1 ; (3.49)
The scale for Bayes factor in order to assess the evidence of H1 against H0 , as
proposed in [4], are reported in Table 3.1.
74 3 Bayesian Approach to Probability
is absorbed in the determinant that appears in Fisher information, from Eq. (3.52),
expressed in the transformed coordinates.
Jeffreys’ priors corresponding to the parameters of some of the most frequently
used PDFs are given in Table 3.2.
Note that only the mean of a Gaussian corresponds to a uniform Jeffreys’ prior.
For instance, for a Poissonian counting experiment, p like the one considered in
Example 3.11, Jeffreys’ prior is proportional to 1= s, not uniform, as assumed in
order to determine Eq. (3.41).
76 3 Bayesian Approach to Probability
Table 3.2 Jeffreys’ priors corresponding to the parameters of some of the most frequently used
PDFs
PDF parameter Jeffreys’ prior
p
Poissonian mean s p.s/ / 1= s
p
Poissonian signal mean s with a background b p.s/ / 1= s C b
Gaussian mean
p.
/ / 1
Gaussian standard deviation p. / / 1=
p
Binomial success fraction " p."/ / 1= " .1 "/
Exponential parameter p./ / 1=
In many cases, like in Table 3.2, Jeffreys’ priors have diverging integral over the
entire parameter domain. This is also the case when a uniform prior is chosen. The
integrals present in the evaluation of Bayesian posterior, which involve the product
of the likelihood function and the prior, are anyway finite. Such priors are called
improper prior distributions.
(continued )
3.10 Improper Priors 77
1
J ./ / ; (3.59)
2
and Jeffreys’ prior is:
p 1
./ / J ./ / : (3.60)
Figure 3.6 shows the posterior distributions p.I Et / for randomly extracted
datasets with D 1 using a uniform prior on , a uniform prior on D 1=
and Jeffreys’ prior, for N D 5, 10 and 50. The differences for the three cases
become less relevant as the number of measurements N increases.
The treatment of the same case with the frequentist approach is discussed in
Example 5.19.
(continued )
78 3 Bayesian Approach to Probability
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t, τ
1.4
Uniform prior on λ = 1/τ , π(τ ) ∝ 1/ τ 2
1.2 Jeffreys' prior, π(τ ) ∝ 1/ τ
Uniform prior on τ , π(τ ) = const.
1
Data (entries / 10)
p(τ; t1, ..., tN)
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t, τ
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t, τ
Fig. 3.6 Posterior distribution for using a uniform prior (dashed line), a uniform
prior on D 1= (dotted line) and Jeffreys’ prior (solid line). Data are shown as blue
histogram. A number of measurements N D 5 (top), 10 (middle) and 50 (bottom) has
been considered
References 79
Given f 0 .x0 ; y0 /, one can determine again, for the transformed variables x0 and y0 ,
the most likely values and credible intervals. The generalization to more variables,
Ex ! Ex 0 , is straightforward.
Something to be noted is that the most probable values ExO , i.e.: the values that
EO ExO / that maximize f 0 .Ex 0 /,
maximize f .Ex /, do not necessarily map into values ExO 0 D X.
as well as the average values of xO are not necessarily transformed into the averages
of Ex 0 : it was already noted in Sect. 2.10, for instance, that hey i ¤ ehyi if y is a normal
random variable.
Issues with non-trivial transformations of variables and error propagations are
also present in the frequentist approach. Section 5.15 will briefly discussed the
related case of propagation of asymmetric uncertainties. Section 5.14 will discuss
how to propagate errors in the case of transformation of variables using a linear
approximation, and the results hold for Bayesian as well as for frequentist inference.
Under this simplified assumption, which is a sufficient approximation only in the
presence of small uncertainties, one may assume that values that maximize f map
into values that maximize f 0 , i.e. .Ox0 ; yO 0 / ' .X.Ox/; Y.Oy//.
References
5. D’Agostini, G.: Bayesian Reasoning in Data Analysis: A Critical Introduction. World Scientific,
Hackensack (2003)
6. Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proc. R. Soc.
Lond. A Math. Phys. Sci. 186, 453–46l (1946)
7. Demortier, L., Jain, S., Prosper, H.B.: Reference priors for high energy physics. Phys. Rev. D82,
034002 (2010)
Chapter 4
Random Numbers and Monte Carlo Methods
Many computer applications, ranging from simulations to video games and 3D-
graphics, take advantage of computer-generated numeric sequences that have
properties very similar to truly random variables. Sequences generated by com-
puter algorithms through mathematical operations are not really random, having
no intrinsic unpredictability, and are necessarily deterministic and reproducible.
Indeed, the possibility to reproduce exactly the same sequence of computer-
generated numbers with a computer algorithm is often a good feature for many
application.
Good algorithms that generate ‘random’ numbers or, more precisely, pseudoran-
dom numbers, given their reproducibility, must obey, in the limit of large numbers,
to the desired statistical properties of real random variables, with the limitation that
pseudorandom sequences can be large, but not infinite.
Considering that computers have finite machine precision, pseudorandom num-
ber, in practice, have discrete possible values, depending on the number of bits used
to store floating point variables.
Numerical methods involving the repeated use of computer-generated pseudo-
random numbers are also known as Monte Carlo methods, from the name of
the city hosting the famous casino, where the properties of (truly) random num-
bers resulting from roulette and other games are exploited in order to generate
profit.
In the following, we will sometimes refer to pseudorandom numbers simply as
random numbers, when the context creates no ambiguity.
Depending on the value of , the sequence may have very different possible
behaviors. If the sequence converges to a single asymptotic value x for n !
1, we have:
lim xn D x ; (4.4)
n!1
x D x .1 x/ : (4.5)
x D . 1/= : (4.6)
(continued )
4.2 Pseudorandom Generators Properties 83
x1 D x2 .1 x2 / ; (4.7)
x2 D x1 .1 x1 / : (4.8)
p
For larger values, up to 1 C 6, the sequences oscillates among four
values, and further bifurcations occur for even larger values of , until
it achieves a very complex and poorly predictable behavior. For D 4,
the sequence finally densely covers the interval 0; 1Œ. The PDF corre-
sponding to the sequence with D 4 can be demonstrated to be a beta
distribution with parameters ˛ D ˇ D 0:5, where the beta distribution is
defined as:
x˛1 .1 x/ ˇ1
f .xI ˛; ˇ/ D R ˛1 : (4.9)
0 .1 u/ ˇ1 du
1.0
0.8
0.6
x
0.4
0.2
0.0
2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0
λ
Fig. 4.1 Logistic map [2]
84 4 Random Numbers and Monte Carlo Methods
The most widely used computer-based random number generators are conveniently
written in order to produce sequences of uniformly distributed numbers ranging
from zero to one.1 Starting from uniform random number generators, most of the
other distributions of interest can be derived using specific algorithms, some of
which are described in the following Sections.
The period of a random sequence, i.e. the number of extractions after which the
sequence will repeat itself, should be as large as possible, and anyway larger than
the number of random numbers required by our specific application.
One example is the function lrand48 [3], which is a standard of C program-
ming language, defined according to the following algorithm:
m D 248 ; (4.11)
a D 25214903917 D 5DEECE66Dhex ; (4.12)
c D 11 D Bhex : (4.13)
The sequences obtained from Eq. (4.10) for given initial values x0 are distributed
uniformly, to a good approximation, between 0 and 248 1. The obtained sequences
of random bits that can be used to return 32-bits integer numbers, or can be
mapped into sequences of floating-point numbers uniformly distributed in Œ0; 1Œ,
as implemented in the C function drand48.
The value x0 is called seed of the random sequence. By choosing different initial
seeds, different sequences are produced. In this way, one can repeat a computer-
simulated experiment using different random sequences, each time changing the
initial seed and obtaining different results, in order to simulate the statistical
fluctuations occurring in reality when repeating an experiment.
Similarly to lrand48, the gsl_rng_rand [4] generator of the BSD rand
function uses the same algorithm but with a D 4lC64E6Dhex, c D 3039hex and
m D 231 . The period of gsl_rng_rand is about 231 , which is lower than
lrand48.
A popular random generator that offers good statistical properties is due to
Lüscher [5], implemented by F. James in the RANLUX generator [6], whose period
is of the order of 10171 . RANLUX is now considered relatively slower than other
algorithms, like the L’Ecuyer generator [7], which has a period of 288 , or the
1
In realistic cases of finite numeric precision, one of the extreme values is excluded. Each
individual value would have a corresponding zero probability, in the case of infinite precision,
but this is not exactly true with finite machine precision.
4.4 Discrete Random Number Generators 85
x0 D a C x .b a/ : (4.14)
Random variables may have discrete values, each corresponding to a given proba-
bility.
The simplest example is the simulation of a detector response whose efficiency
is ". In this case, one can generate a uniform random number r in Œ0; 1Œ; if r < " the
response will be positive (i.e. the particle has been detected), otherwise the response
will be negative (i.e. the particle has not been detected).
If we have more possible values, 1; ; n, corresponding to probabilities
P1 ; ; Pn , one possibility is to store the cumulative probabilities into an array:
X
k
Ck D Pi ; k D 1; ; n 1 ; (4.15)
iD1
x D F 1 .r/ (4.17)
r D F.x/ ; (4.18)
we have:
dF
dr D dx D f .x/ dx : (4.19)
dx
Introducing the differential probability dP, we have:
dP dP
D f .x/ : (4.20)
dx dr
dP
D f .x/ ; (4.21)
dx
which demonstrates that x follows the desired PDF.
4.5 Nonuniform Random Number Generators 87
This method only works conveniently if the cumulative F.x/ can be easily
computed and inverted using either analytical or numerical methods. If not, usually
this algorithm may be very slow, and alternative implementations may provide better
CPU performances.
1 ex D r ; (4.24)
k is a normalization constant such that the PDF integrates to unity over the
entire solid angle. From Eq. (4.26), the joint two-dimensional PDF can be
factorized into the product of two PDFs, as functions of and :
dP
D f ./ g./ D k sin ; (4.27)
d d
(continued )
88 4 Random Numbers and Monte Carlo Methods
where:
dP
f ./ D D c1 sin ; (4.28)
d
dP
g./ D D c2 : (4.29)
d
By virtue of the central limit theorem (see Sect. 2.14), the sum of N random
variables, each having a finite variance, is distributed, in the limit N ! 1,
according to a Gaussian distribution. A finite, but sufficiently large number of
random numbers, x1 ; ; xN , p
can extracted
p using a uniform random generator,
and remapped from Œ0; 1Œ to Œ 3; 3 Œ using Eq. (4.14), so that the average and
variance of each xi are equal to zero and one, respectively. Then, the following
combination is computed:
x1 C C xN
xD p ; (4.32)
N
which has again average equal to zero and variance equal to one. Figure 2.15
shows the distribution of x for N up to four, compared with a Gaussian distribution.
The
p distribution
p of x, from Eq. (4.32), is necessarily truncated within the range
Œ 3N; 3N, by construction, while a truly Gaussian variable has no upper nor
lower bound.
The approach presented here is simple and may be instructive, but, apart from
the unavoidable approximations, it is not the most CPU-effective way to generate
Gaussian random numbers, since several uniform extractions are needed in order
4.6 Monte Carlo Sampling 89
The transformation from two variables r1 and r2 uniformly distributed in [0, 1[ into
two variables z1 and z2 distributed according to a standard normal is called Box–
Muller transformation [10]:
p
rD 2 log.1 r1 / ; (4.34)
D 2 r2 ; (4.35)
z1 D r cos ; (4.36)
z2 D r sin : (4.37)
xD
Cz: (4.38)
More efficient generators for Gaussian random numbers exist. For instance, the
so-called Ziggurat algorithm is described in [11].
In case the cumulative distribution of a PDF cannot be easily computed and inverted,
neither analytically nor numerically, other methods allow generating random num-
bers according to the desired PDF with reasonably good CPU performances.
90 4 Random Numbers and Monte Carlo Methods
f(x)
m
miss
hit
a b x
Rb
2
I.e. a f .x/ dx may be different from one.
4.6 Monte Carlo Sampling 91
In other words, the method has an efficiency (i.e. the fraction of accepted values
of x) equal to:
Rb
f .x/ dx
"D a
; (4.39)
.b a/ m
which may lead to a suboptimal use of computing power, in particular if the shape
of f .x/ is very peaked.
Hit-or-miss Monte Carlo can also be applied to multidimensional cases with no
conceptual difference: first a multidimensional point xE D .x1 ; ; xn / is extracted,
then xE is accepted or rejected according to a random extraction r 2 Œ0; m Œ, compared
with f .x1 ; ; xn /.
If the function f .x/ is very peaked, the efficiency " of the hit-or-miss method
(Eq. (4.39)) may be very low. The algorithm may be adapted in order to improve the
efficiency by identifying, in a preliminary stage of the algorithm, a partition of the
interval [a; b Œ such that in each subinterval the function f .x/ has a smaller variation
than in the overall range. In each subinterval, the maximum of f .x/ is estimated,
as sketched in Fig. 4.3. The modified algorithm proceeds as follows: first of all, a
subinterval of the partition is randomly chosen with a probability proportional to its
area (see Sect. 4.4); then, the hit-or-miss approach is followed in the corresponding
rectangle. This approach is often called importance sampling.
A possible variation of this method is to use, instead of the aforementioned
partition, an ‘envelope’ for the function f .x/, i.e. a function g.x/ that is always
a b x
92 4 Random Numbers and Monte Carlo Methods
and for which a convenient method to extract x according to the normalized distri-
bution g.x/ is known. A concrete case of this method is presented in Example 4.16.
It is evident that the efficiency of the importance sampling may be significantly
larger than the ‘plain’ hit-or-miss Monte Carlo if the partition or envelope is properly
chosen.
Monte Carlo methods are often used as numerical techniques in order to compute
integrals. The hit-or-miss method described in Sect. 4.6.1, for instance, estimates the
Rb
integral a f .x/ dx from the fraction of the accepted hits nO over the total number of
extractions N:
Z b
nO
ID f .x/ dx ' IO D .b a/ : (4.42)
a N
4.8 Markov Chain Monte Carlo 93
With this approach, nO follows a binomial distribution. If nO is not too close to either
0 nor N, Eq. (5.11) gives an approximate error on IO3 :
s
IO .1 I/
O
IO D .b a/ : (4.43)
N
p
The error on IO from Eq. (4.43) decreases as N. This result is true also if the hit-or-
miss method is applied to a multidimensional integration, regardless of the number
of dimensions d of the problem. Other numerical methods, not based on random
number extractions, may suffer from severe computing time penalties as the number
of dimensions d increases. This makes Monte Carlo methods advantageous in cases
of a large number of dimensions.
In the case of hit-or-miss Monte Carlo, anyway, numerical problems may
arise in the algorithm that finds the maximum value of the input function in
the multidimensional range of interest. Also, partitioning the multidimensional
integration range in an optimal way, in the case of importance sampling, may be
a non-trivial problem.
The Monte Carlo methods considered so far in the previous sections are based on
sequences of uncorrelated pseudorandom numbers that follow a given probability
distribution. There are classes of algorithms that sample more efficiency some
probability distributions by producing sequences of correlated pseudorandom
numbers, i.e. each entry in the sequence depends on previous ones.
A sequence of random variables xE0 ; ; xEn is a Markov chain if the probability
distributions obey:
i.e. if fn .ExnC1 / only depends on the previously extracted value xEn , which corresponds
to a ‘loss of memory’. The Markov chain is said homogeneous if fn .ExnC1 I xEn / D
f .ExnC1 I xEn / does not depend on n.
One example of Markov chain is the Metropolis–Hastings algorithm [12, 13]
described below. Imagine we want to sample a PDF f .Ex /, starting from an initial
point xE D xE0 . A second point xE is generated according to a PDF q.ExI xE0 /, called
proposal distribution, that depends on xE0 . xE is accepted or not according to the
Hastings test ratio:
f .Ex / q.Ex0 I xE /
r D min 1; ; (4.45)
f .Ex0 / q.ExI xE0 /
3
See Sect. 5.9; a more rigorous approach is presented in Sect. 7.3.
94 4 Random Numbers and Monte Carlo Methods
i.e. the point xE is accepted as new point xE1 if a uniformly generated value u is less
or equal to r, otherwise the generation of xE is repeated. Once the generated point
is accepted, a new generation restarts from xE1 , as above, in order to generate xE2 ,
and so on. The process is repeated indefinitely, until the desired number of points is
generated.
If a finite set of initial values is discarded, the rest of values in the sequence can
be proven to follow the desired PDF f .Ex /, and each value with non-null probability
will be eventually reached, within an arbitrary precision, after a sufficiently large
number of extractions (ergodicity).
Usually, it is convenient to chose a symmetric proposal distribution
such that q.Ex; xE0 / D q.Ex0 ; xE /, so that Eq. (4.45) simplifies to the so-called
10 10
9 0.5 9 0.5
8 8
7 0.4 7 0.4
6 6
0.3 0.3
5 5
y
4 4
0.2 0.2
3 3
2 0.1 2 0.1
1 1
0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
10 10
9 0.5 9 0.5
8 8
7 0.4 7 0.4
6 6
0.3 0.3
5 5
y
4 4
0.2 0.2
3 3
2 0.1 2 0.1
1 1
0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
Fig. 4.4 Monte Carlo generation with the Metropolis–Hastings method for the sum of two two-
dimensional Gaussian functions with relative weights equal to 0.45 and 0.55. The PDF is shown
as a red-to-black color map in the background. The generated points are connected with a line in
order to show the generated sequence. The first 20 generated points are shown in purple, all the
subsequent ones are in blue. The proposal function is a two-dimensional Gaussian with D 0:5.
The total number of generated points is 100 (top, left), 1000 (top, right), 2000 (bottom, left) and
5000 (bottom, right)
References 95
References
1. May, R.: Simple mathematical models with very complicated dynamics. Nature 621, 459
(1976)
2. Logistic map. Public domain image. https://commons.wikimedia.org/wiki/File:LogisticMap_
BifurcationDiagram.png (2011)
3. T.O. Group: The single UNIX R
specification, Version 2. http://www.unix.org (1997)
4. T.G. project: GNU operating system–GSL–GNU scientific library. http://www.gnu.org/
software/gsl/ (1996–2011)
5. Lüscher, M.: A portable high-quality random number generator for lattice field theory
simulations. Comput. Phys. Commun. 79, 100–110 (1994)
6. James, F.: Ranlux: a fortran implementation of the high-quality pseudorandom number
generator of Lüscher. Comput. Phys. Commun. 79, 111–114 (1994)
7. L’Ecuyer, P.: Maximally equidistributed combined Tausworthe generators. Math. Comput. 65,
203–213 (1996)
8. Matsumoto, M., Nishimura, T.: Mersenne twistor: a 623-dimensionally equidistributed uniform
pseudorandom number generator. ACM Trans. Model. Comput. Simul. 8, 3–30 (1998)
9. Marsaglia, G., Tsang, W.W., Wang, J.: Fast generation of discrete random variables. J. Stat.
Softw. 11, 3 (2004)
10. Box, G.E.P., Muller, M.: A note on the generation of random normal deviates. Ann. Math. Stat.
29, 610–611 (1958)
11. Marsglia, G., Tsang, W.: The Ziggurat method for generating random variables. J. Stat. Softw.
5, 8 (2000)
12. Metropolis, N., et al.: Equations of state calculations by fast computing machines. J. Chem.
Phys. 21(6), 1087–1092 (1953)
13. Hastings, W.: Monte Carlo sampling methods using Markov chains and their application.
Biometrika 57, 97–109 (1970)
14. Caldwell, A., Kollar, D., Kröninger, K.: BAT – The Bayesian Analysis Toolkit. Comput. Phys.
Commun. 180, 2197–2209 (2009). https://wwwold.mppmu.mpg.de/bat/
Chapter 5
Parameter Estimate
5.1 Introduction
5.2 Inference
Probability
Theory Model Data
Inference
Theory Model Data
The theory provides a probability model that predicts the distribution of the
observable quantities. Some theory parameters are unknown, and the measurement
of those parameters of interest is the goal of our experiment.
There are cases where the parameter of interest is a calibration constant, which
is not strictly related to a physics theory, but it is related to a model assumed to
describe the detector response we are interested in.
Our data sample consist of measured values of the observable quantities, which are,
in turn, a sampling of the PDF determined by a combination of the theoretical and
the instrumental effects.
We can determine, or estimate the value of unknown parameters (parameters of
interest or nuisance parameters) using the data collected by our experiment. The
estimate is not exactly equal to the true value of the parameter, but provides an
approximate knowledge of the true value, within some uncertainty. As result of the
measurement of a parameter , one quotes the estimated value O and its uncertainty
ı, usually using the notation already introduced in Sect. 3.5.2:
D O ˙ ı : (5.1)
The estimate O is often also called central value.1 The interval ŒO ı; O C ı is
referred to as uncertainty interval. The meaning of uncertainty interval is different
in the frequentist and in the Bayesian approaches. In both approaches, a probability
level needs to be specified in order to determine the size of the uncertainty. When
not otherwise specified, by convention, a 68.27% probability level is assumed,
corresponding to the area under a Gaussian distribution in a ˙ 1 interval. Other
choices are 90% or 95% probability level, usually adopted when quoting upper or
lower limits.
In some cases, asymmetric positive and negative uncertainties are taken, and the
result is quoted as:
Cı
D O ıC ; (5.2)
1
The name ‘central value’ is frequently used in physics, but it may be sometimes ambiguous in
statistics, where it could be confused with the median, mean or mode.
100 5 Parameter Estimate
5.7 Estimators
.x/
O Dx: (5.3)
Different estimators may have different statistical properties that make one or
another estimator more suitable for a specific problem. In the following, some of the
main properties of estimators are presented. Section 5.10 will introduce maximum
likelihood estimators which have good properties in terms of most of the indicators
described in the following.
102 5 Parameter Estimate
5.8.1 Consistency
ˇ ˇ
8" lim P ˇOn ˇ < " D 1 : (5.4)
n!1
5.8.2 Bias
The bias of an estimator is the expected value of the deviation of the parameter
estimate from the corresponding true value of that parameter:
D E D E
O D O D O :
b./ (5.5)
O is the bias of the estimator (Eq. (5.5)) and the denominator is the Fisher
where b./
information, already defined in Sect. 3.7.
The ratio of the Cramér–Rao bound to the estimator’s variance is called
estimator’s efficiency:
O
VCR ./
O D
"./ : (5.7)
O
VŒ
The good properties of some estimators may be spoiled in case the real distribution
of a data sample has deviations from the assumed PDF model.
Entries in the data sample that introduce visible deviations from the theoretical
PDF, such as data in extreme tails of the PDF where no entry is expected, are called
outliers.
If data may deviate from the nominal PDF model, an important property of an
estimator is to have a limited sensitivity to the presence of outliers. This property,
that can be better quantified, is in general defined as robustness.
An example of robust estimator of the central value of a symmetric distribution
from a sample x1 ; ; xN is the median xQ , defined in the Eq. (1.23):
(
x.NC1/=2 if N is odd ;
xQ D 1
(5.8)
.x
2 N=2
C xN=2C1 / if N is even :
Clearly, the presence of outliers at the left or right tails of the distribution will
not change significantly the value of the median, if it is dominated by measurements
in the ‘core’ of the distribution. Conversely, the usual average (Eq. (1.13)) could be
shifted from the true value as much as the outliers distribution is broader than the
‘core’ part of the distribution.
An average computed by removing from the sample a given fraction f of data
present in the rightmost and leftmost tails is called trimmed average and is also less
sensitive to the presence of outliers.
It is convenient to define the breakdown point as the maximum fraction of
incorrect measurements (i.e. outliers) above which the estimate may grow arbitrarily
large in absolute value. In particular, a trimmed average that removes a fraction f
of the events can be demonstrated to have a breakdown point equal to f , while the
median has a breakdown point of 0.5. The mean of a distribution, instead, has a
breakdown point of 0%.
104 5 Parameter Estimate
A more detailed treatment of robust estimators is beyond the purpose of this text.
Reference [3] contains a more extensive discussion about robust estimators.
The efficiency " of a device is defined as the probability that the device gives a
positive signal when a process of interest occurs. Particle detectors are examples of
such devices; they produce a signal when a particle interacts with them, but they
may fail in a fraction of the cases. The distribution of the number of positive signals
n, if N processes of interest occurred, is given by a binomial distribution (Eq. (1.39))
with parameter p D ".
A typical problem is the estimate of the efficiency " of the device. A pragmatic
way to estimate the efficiency may consist in performing a large number N of
sampling of the process of interest, counting the number of times the device gives a
positive signal (i.e. it has been efficient). For a particle detector, the data acquisition
time should be sufficiently long in order to have a large number of particle traversing
the detector.
Assume that the result of a real experiment gives a measured value of n equal to
nO , an estimate of the true efficiency " is given by:
nO
"O D : (5.9)
N
The variance of nO , from Eq. (1.41), is equal to N" .1 "/, hence the variance of
"O D nO =N, using Eq. (1.22), is given by:
s r
nO " .1 "/
"O D Var D : (5.10)
N N
Equation (5.10) is not very useful, since ", the true efficiency value, is unknown.
Anyway, if N is sufficiently large, "O is close to the true efficiency ", as a consequence
of the law of large numbers (see Sect. 1.17). By simply replacing " with "O, the
following approximated expression for the uncertainty on "O is obtained:
r
"O .1 "O/
"O ' : (5.11)
N
The above formula is just an approximation, and in particular it leads to an error
"O D 0 in cases where "O D 0 or "O D 1, i.e. for nO D 0 or nO D N respectively.
Section 7.3 will present how to overcome this problems present in Eq. (5.11) with a
more rigorous treatment.
5.10 Maximum Likelihood Method 105
The most frequently adopted estimate method is based on the construction of the
combined probability distribution of all measurements in our data sample, called
likelihood function, which was already introduced in Sect. 3.4. The estimate of
the parameters we want to determine is obtained by finding the parameter set that
corresponds to the maximum value of the likelihood function. This approach gives
the name of maximum likelihood method to this technique.
The procedure is also called best fit because it determines the parameters for
which the theoretical PDF model best fits the experimental data sample.
Maximum likelihood fits are very frequently used because of very good statistical
properties according to the indicators discussed in Sect. 5.8.
The estimator discussed in Example 5.17 is a very simple example of the
application of the maximum likelihood method. A Gaussian PDF with unknown
average
and known standard deviation was assumed, and the estimate
O was
the value of a single measurement x following the given Gaussian PDF. x is indeed
the value of
that maximizes the likelihood function.
The likelihood function is the function that, for given values of the unknown
parameters, returns the value of the PDF evaluated at the observed data sample. If
the measured values of n random variables are x1 ; xn and our PDF model depends
on m unknown parameters 1 ; ; m , the likelihood function is:
consisting of the N events2 recorder by our experiment can be written as the product
of the PDFs corresponding to the measurement of each single event:
Y
N
E D
L.Ex I / f .xi1 ; ; xin I 1 ; ; m / : (5.13)
iD1
Often the logarithm of the likelihood function is computed, so that when the
product of many terms appears in the likelihood definition, this is transformed into
the sum of the logarithms. The logarithm of the likelihood function in Eq. (5.13) is:
X
N
E D
log L.Ex I / log f .xi1 ; ; xin I 1 ; ; m / : (5.14)
iD1
The maximization of the likelihood function L, or the equivalent, but often more
convenient, minimization of log L, can be performed analytically only in the
simplest cases. Most of the realistic cases require numerical methods implemented
as computer algorithms. The software MINUIT [4] is one of the most widely used
minimization tool in the field of high-energy and nuclear physics since the years
1970s.
The minimization is based on the steepest descent direction in the parameter
space, which is determined based on a numerical evaluation of the gradient of (the
logarithm of) the likelihood function.
MINUIT has been reimplemented from the original fortran version in C++, and
is available in the ROOT software toolkit [5].
If the number of recorded events N is also a random variable that follows a distribu-
tion P.NI 1 ; ; m /, which may also depend on the m unknown parameters, the
2
In physics often the word event is used with a different meaning with respect to statistics, and it
refers to a collection of measurements of observable quantities Ex D .x1 ; ; xn / corresponding to
a physical phenomenon, like a collision of particles at an accelerator, or the interaction of a particle
or a shower of particles from cosmic rays in a detector. Measurements performed at different events
are usually uncorrelated and each sequence of variables taken from N different events, x1i ; ; xNi ,
i D 1; ; n, can be considered a sampling of independent and identically distributed random
variables, or IID, as defined in Sect. 4.2.
5.10 Maximum Likelihood Method 107
Y
N
L D P.NI 1 ; ; m / f .xi1 ; ; xin I 1 ; ; m / : (5.15)
iD1
e
.1 ; ; m /
.1 ; ; m /N Y i
N
LD f .x1 ; ; xin I 1 ; ; m / : (5.16)
NŠ iD1
Consider the case where the PDF f is a linear combination of two PDFs, one
for ‘signal’, fs , and one for ‘background’, fb , whose individual yields, s and b are
unknown. The extended likelihood function can be written as:
.s C b/N e.sCb/ Y
N
L.xi I s; b; / D Œws fs .xi I / C wb fb .xi I / ; (5.17)
NŠ iD1
e.sCb/ Y
N
D Œsfs .xi I / C bfb .xi I / : (5.21)
NŠ iD1
X
N
log L.xi I s; b; / D s C b C log Œsfs .xi I / C bfb .xi I / log NŠ :
iD1
(5.22)
108 5 Parameter Estimate
100
80
Events / ( 0.01 )
60
40
20
0
2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5
m (GeV)
Fig. 5.2 Example of unbinned extended maximum likelihood fit of a simulated dataset. The fit
curve is superimposed to the data points (black dots with error bars) and shown as solid blue line
The last term, log NŠ, is constant with respect to the unknown parameters and can
be omitted when minimizing the function log L.xi I s; b; / .
An example of application of the extended maximum likelihood from Eq. (5.22)
is the two-component fit shown in Fig. 5.2. The assumed PDF is modeled as the sum
of a Gaussian component (‘signal’) and an exponential component (‘background’).
The points with the error bars represent the data sample which is randomly
extracted according to the assumed PDF. Data are shown as a binned histogram
for convenience, but the individual values of the random variable m are used
in the likelihood function. The unknown parameters that model the signal and
background PDFs determined by the fit, according to the likelihood function in
Eq. (5.22), are the mean
and standard deviation of the Gaussian component
and the decay parameter of the exponential component. They are determined
simultaneously with the number of signal and background events, s and b, with
a single minimization of the negative logarithm of the likelihood function.
and standard deviation , twice the negative logarithm of the likelihood function
can be written as:
X
N
.xi
/2
2 log L.xi I
; 2 / D C N .log 2 C 2 log / : (5.23)
iD1
2
5.11 Errors with the Maximum Likelihood Method 109
1 X
N
O D xi ; (5.24)
N iD1
1 X
V N
2 D O 2:
.xi
/ (5.25)
N iD1
V
The maximum likelihood estimate 2 is affected by a bias (see Sect. 5.8.2), in the
sense that its average value deviates from the true 2 . The bias, anyway, decreases
as N ! 1. A way to correct the bias present in Eq. (5.25) is discussed in
Example 5.20.
3
Note that we will use the notation 2 and not O 2 , since we consider 2 as a parameter of interest,
rather than .
110 5 Parameter Estimate
@ log L.x1 ; ; xn I 1 ; ; m /
Cij1 D : (5.26)
@i @j
1 @2 . log L/ n
2
D D 2 ; (5.27)
O @
2
This expression coincides with the result obtained by evaluating the standard
deviation of the average from Eq. (5.24) using the general formulae from Eqs. (1.15)
and (1.16).
EO :
Lmax D L.Ex I / (5.29)
E D 2 log Lmax C 1 :
2 log L./ (5.30)
4
For the users of the program MINUIT, this estimate correspond to call the method
M IGRAD /HESSE.
5
For M INUIT users this procedure correspond to the call of the method MINOS.
5.11 Errors with the Maximum Likelihood Method 111
−2lnL
−2lnLmax+1
−2lnLmax
^
θ− ^
θ ^
θ+ θ
− +
Fig. 5.3 Scan of 2 log L as a function of a parameter . The error interval is determined from the
excursion of 2 log L around the minimum, at O , finding the values for which 2 log L increases
by one unit
1620
1600
1580
b
1560
1540
1520
1500
400 410 420 430 440 450 460 470 480
s
Fig. 5.4 Two-dimensional contour plot showing the 1 uncertainty ellipse for the fit parameters
s and b (number of signal and background events) corresponding to the fit in Fig. 5.2. The contour
correspond to the equation 2 log L.Ex I s; b; / D 2 log Lmax C 1
For instance, the two-dimensional contour plot shown in Fig. 5.4 shows the
approximate ellipse for the sample considered in Fig. 5.2, corresponding to the
points for which 2 log L.Ex I s; b; / in Eq. (5.22) increases by one unit with respect
to its minimum, shown as central point.
This method leads to identical errors as Eq. (5.26) in the Gaussian case, in
which 2 log L has a parabolic shape, but in general an uncertainty interval given
by Eq. (5.30) may be different from uncertainties determined from Eq. (5.26). The
coverage is usually improved using the 2 log L scan with respect to the parabolic
approximation used in Eq. (5.26).
112 5 Parameter Estimate
E D 2 log Lmax C Z 2 :
2 log L./ (5.31)
Y
N PN
L.t1 ; ; tn I / D eti D N e iD1 ti : (5.33)
iD1
with uncertainty, from Eq. (5.26), uncorrected for a possible bias, equal to:
p
O
O D = N: (5.35)
1 X
V N
2 D O 2:
.xi
/ (5.36)
N iD1
V
The bias, defined in Eq. (5.5), is the difference of the expected value of 2
and the true value of 2 . It is easy to show analytically that the expected
V
value of 2 is:
DV E N 1 2
2 D ; (5.37)
N
V
1 X
V N
2 unbiased D O 2:
.xi
/ (5.39)
N 1 iD1
114 5 Parameter Estimate
E ;
y D f .xI / (5.40)
2
X
n E
yi f .xi I / X
n
E D
2 log L. Ey I / C log 2 i2 : (5.42)
iD1
i2 iD1
The last term does not depend on the parameters E if the uncertainties i are
known and fixed, hence it is a constant that can be dropped when performing the
minimization. The first term to be minimized in Eq. (5.42) is a 2 variable (see
Sect. 2.9):
2
X
n E
yi f .xi I /
E D
2 ./ : (5.43)
iD1
i2
The terms:
EO ;
ri D yi yO i D yi f .xi I / (5.44)
O
evaluated at the fit values E of the parameters ,
E are called residuals.
An example of fit performed with the minimum 2 method is shown in Fig. 5.5.
If data are distributed according to the assumed model, then residuals are randomly
distributed around zero, according to the data uncertainties.
5.12 Minimum 2 and Least-Squares Methods 115
6 y = A x e-B x
4
y
2 4 6 8 10
0.5
0
r
−0.5
2 4 6 8 10
x
Fig. 5.5 Example of minimum 2 fit of a computer-generated dataset. The points with the error
bars are used to fit a function model of the type y D f .x/ D A x eBx , where A and B are unknown
parameters determined by the fit. The fit curve is superimposed as solid blue line. Residuals are
shown in the bottom section of the plot
In case the uncertainties i are all equal, it is possible to minimize the expression:
n
X
2
SD E
yi f .xi I / : (5.45)
iD1
In the simplest case of a linear function, the minimum 2 problem can be solved
analytically. The function f can be written as:
y D f .xI a; b/ D a C bx ; (5.46)
116 5 Parameter Estimate
X
n
. yi a bxi /2
2 .a; b/ D : (5.47)
iD1
i2
1=i2
wi D ; (5.48)
1= 2
P P
with 1= 2 D niD1 1=i2 , so that niD1 wi D 1. Equal errors give weights are all
equal to one.
The analytical minimization is achieved by imposing: @2 .a; b/=@a D 0 and
2
@ .a; b/=@b D 0, which give:
X
n X
n
wi yi D a C b wi xi ; (5.49)
iD1 iD1
X
n X
n X
n
wi xi yi D a wi xi C b wi x2i : (5.50)
iD1 iD1 iD1
In matrix form, the system of the two Eqs. (5.49) and (5.50) becomes:
Pn Pn
wy 1 wx a
PniD1 i i D Pn PniD1 i 2i ; (5.51)
w x y
iD1 i i i w x
iD1 i i w x
iD1 i i b
cov.x; y/
bO D ; (5.52)
VŒx
aO D hyi bO hxi ; (5.53)
where the terms that appear in Eqs. (5.52) and (5.53), and have to be computed in
O are:
order to determine aO and b,
X
n
hxi D wi xi ; (5.54)
iD1
X
n
hyi D wi yi ; (5.55)
iD1
5.12 Minimum 2 and Least-Squares Methods 117
!2
X
n X
n
2 2
VŒx D hxi hx i D wi xi wi x2i ; (5.56)
iD1 iD1
X
n X
n X
n
cov.x; y/ D hxyi hxi hyi D wi xi yi wi xi wi yi ; (5.57)
iD1 iD1 iD1
1 @2 log L 1 @2 2
D D ; (5.58)
aO2 @2 a 2 @2 a
1 1 @2 2
D ; (5.59)
bO2 2 @2 b
aO D ; (5.60)
bO D qP ; (5.61)
n 2
iD1 w i x i
2
Pn
O D P
cov.Oa; b/ iD1 wi xi
2 P : (5.62)
n
iD1 wi xi i wi x2i
A coefficient that is not very used in physics, but appears rather frequently
in linear regressions performed by commercial software, is the coefficient of
determination, or R2 , defined as:
PN
2 . yO i hyi/2
R D PiD1 ; (5.63)
N
iD1 . yi hyi/2
O D aO C bx
where yO i D f .xi I aO ; b/ O i are the fit values of the individual measurements.
R2 may have values between 0 and 1, and is often expressed as percentage. R2 D
1 corresponds to measurements perfectly aligned along the fitted regression line,
indicating that the regression line accounts for all the measurement variations as a
function of x, while R2 D 0 corresponds to a perfectly horizontal regression line,
indicating that the measurements are insensitive on the variable x.
118 5 Parameter Estimate
The maximum likelihood method discussed in Sect. 5.10 is used to perform param-
eter estimates using the complete set of information present in our measurements
sample. For repeated measurements of a single variable x, this is given by a dataset
.x1 ; ; xn /. In case of a very large number of measurements n, computing the
likelihood function may become unpractical from the numerical point of view, and
the implementation could require intensive computing power. Machine precision
may also become an issue.
For this reason, it is frequently preferred to perform the parameter estimate using
a summary of the sample’s information obtained by binning the distribution of the
5.13 Binned Data Samples 119
random variable (or variables) of interest and using as information the number of
entries in each single bin: .n1 ; ; nN /, where the number of bins N is typically
much smaller than the number of measurements n.
In practice, an histogram of the experimental distribution is built in one or more
variables. If the sample is composed of independent extractions from a given random
distribution, the number of entries in each bin follows a Poisson distribution. The
expected number of entries in each bin can be determined from the theoretical
distribution and depends on the unknown parameters one wants to estimate.
In case of a sufficiently large number of entries in each bin, the Poisson distribution
describing the number of entries in a bin can be approximated by a Gaussian with
variance equal to the expected number of entries in that bin (see Sect. 2.12). In this
case, the expression for 2 log L becomes:
XN
.ni
i .1 ; ; m //2 XN
2 log L D C N log 2 C log ni ; (5.64)
iD1
i .1 ; ; m / iD1
where:
Z xi
up
i .1 ; ; m / D f .xI 1 ; ; m / dx (5.65)
xlo
i
up
and Œxlo
i ; xi is the interval corresponding to the i bin. If the binning is sufficiently
th
fine,
i can be replaced by:
i .1 ; ; m / ' f .xi I 1 ; ; m / ıxi ; (5.66)
up up
where xi D .xi C xlo
i /=2 is center of the i bin and ıxi D xi xi is the bin’s width.
th lo
The quantity defined in Eq. (5.64), dropping the last two constant terms, is called
Pearson’s 2 :
XN
.ni
i .1 ; ; m //2
2P D : (5.67)
iD1
i .1 ; ; m /
It may be more convenient to replace the expected number of entries with the
observed number of entries. This gives the so-called Neyman’s 2 :
XN
.ni
i .1 ; ; m //2
2N D : (5.68)
iD1
ni
120 5 Parameter Estimate
The Gaussian approximation assumed in Sect. 5.13.1 does not hold when the
number of entries per bin is small. A Poissonian model, also valid for small number
of entries, should be applied in those cases. The negative log likelihood function
that replaces Eq. (5.64) is:
Y
N
2 log L D 2 log Pois.ni I
i .1 ; ; m // (5.69)
iD1
YN
e
i .1 ; ; m /
i .1 ; ; m / ni
D 2 log : (5.70)
iD1
ni Š
Using the approach proposed in [6], the likelihood function can be divided by its
maximum value which does not depend on the unknown parameters, and does not
change the fit result. The denominator can be obtained by replacing
i with ni ,
obtaining the following negative log likelihood ratio:
YN
L.ni I
i .1 ; ; m // XN
e
i
ni ni Š
2 D 2 log D 2 log i
ni nni
(5.71)
iD1
L.n i I n i / iD1
n i Š e i
N
X
ni
D2
i .1 ; ; m / ni C ni log : (5.72)
iD1
i .1 ; ; m /
From Wilks’ theorem (see Sect. 9.8), if the model is correct, the distribution of the
minimum value of 2 can be asymptotically approximated with a 2 distribution
(Eq. (2.33)) with a number of degrees of freedom equal to the number of bins minus
the number of fit parameters. 2 can be used to determine a p-value (see Sect. 5.12.2)
that provides a measure of the goodness of the fit.
If the number of measurements is not sufficiently large, the distribution of 2 for
the specific problem may deviate from a 2 distribution, but can still be determined
by generating a sufficiently large number of Monte Carlo pseudo-experiments that
reproduce the theoretical PDF, and the p-value can be computed accordingly. This
technique is often called toy Monte Carlo.
5.14 Error Propagation 121
H D AT ‚ A ; (5.74)
where:
@p
Api D : (5.75)
@i
y D ax : (5.76)
122 5 Parameter Estimate
hence:
For what concerns the product z D x y, the relative uncertainties should instead be
added in quadrature, plus a possible correlation term:
s
2 2
x y x y 2 x y
D C C : (5.81)
xy x y xy
5.15 Treatment of Asymmetric Errors 123
x x
D j˛j : (5.85)
x˛ x
The error of log x is equal to its relative error of x:
x
log y D : (5.86)
x
In Sect. 5.11 we observed that maximum likelihood fits may lead to asymmetric
errors. The propagation of asymmetric errors and the combination of more mea-
surements having asymmetric errors may require special care. If we have two
C C C C
measurements: x D xO xx and y D yO y y
, the naïve extension of the sum in
quadrature of errors, derived in Eq. (5.83), would lead to the (incorrect!) sum in
quadratures of the positive and negative errors:
q
C .xC /2 C.yC /2
x C y D .Ox C yO / p : (5.87)
.x /2 C.y /2
Though sometimes Eq. (5.87) has been applied in real cases, it has no statistical
motivation. One reason why Eq. (5.87) is incorrect may be found in the central limit
theorem. Uncertainties are related to the standard deviation of the distribution of
a sample, and, in the case of an asymmetric (skew) distribution, asymmetric errors
may be related to the skewness (Eq. (1.27)) of the distribution. Adding more random
124 5 Parameter Estimate
˝ 0˛ 1 0
x D xO 0 C p C 0 : (5.90)
2
While the average value of the sum of two variables is equal to the sum of the
individual average values (Eq. (1.20)), this is not the case for the most likely value of
the sum of the two variables. Using a naïve error treatment, like the one in Eq. (5.87),
could even lead, for this reason, to a bias in the estimate of combined values, as
evident from Eq. (5.90).
126 5 Parameter Estimate
Equations (5.90), (5.91) and (5.92) allow to transform a measurement xO 0 and its two
0
asymmetric error components C and 0 into three other quantities hx0 i, VarŒx0 and
0
Œx . The advantage of this transformation is that the average, the variance and the
unnormalized skewness add linearly when adding random variables, and this allows
an easier combination of uncertainties.
C1C
If we have two measurements affected by asymmetric errors, say: xO 1 and
1
C2C
xO 2 , the average, variance and
unnormalized skewness can be computed, assum-
2
ing an underlying piece-wise linear model, for the sum of the two corresponding
random variables x1 ad x2 as:
The above average, variance and unnormalized skewness can be computed individ-
ually from xO 1 and xO 2 and their corresponding asymmetric uncertainties, again from
Eqs. (5.90), (5.91) and (5.92). Using numerical techniques, the relation between
C
hx1 C x2 i, VarŒx1 C x2 and Œx1 C x2 , and xO 1C2 , 1C2 and 1C2 , can be inverted
to obtain the correct estimate for x1C2 D x1 C x2 and its corresponding asymmetric
C C
1C2
uncertainty components: xO 1C2 .
1C2
Barlow [9] also considers the case of a parabolic dependence and obtains a
C C
1C2
procedure to estimate xO 1C2 with this second model.
1C2
Any estimate of the sum of two measurements affected by asymmetric errors
requires an assumption of an underlying PDF model, and results may be more or
less sensitive to the assumed model, depending case by case.
References 127
References
1. Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton (1946)
2. Rao, C.R.: Information and the accuracy attainable in the estimation of statistical parameters.
Bull. Calcutta Math. Soc. 37, 8189 (1945)
3. Eadie, W., Drijard, D., James, F., Roos, M., Saudolet, B.: Statistical Methods in Experimental
Physics. North Holland, London (1971)
4. James, F., Roos, M.: M INUIT: Function minimization and error analysis. CERN Computer
Centre Program Library, Geneve Long Write-up No. D506 (1989)
5. Brun, R., Rademakers, F.: ROOT—an object oriented data analysis framework. In: Proceedings
AIHENP96 Workshop, Lausanne (1996) Nucl. Inst. Meth. A389, 81–86 (1997) http://root.cern.
ch/
6. Baker, S., Cousins, R.: Clarification of the use of chi-square and likelihood functions in fit to
histograms. Nucl. Instrum. Methods A221, 437–442 (1984)
7. Barlow, R.: Asymmetric errors. In: Proceedings of PHYSTAT2003, SLAC, Stanford (2003)
http://www.slac.stanford.edu/econf/C030908/
8. Barlow, R.: Asymmetric statistical errors, arXiv:physics/0406120vl (2004)
9. Barlow, R.: Asymmetric systematic errors, arXiv:physics/0306138 (2003)
10. D’Agostini, G.: Asymmetric uncertainties sources, treatment and potential dangers.
arXiv:physics/0403086 (2004)
Chapter 6
Combining Measurements
6.1 Introduction
Consider the model discussed in Sect. 5.10.2 with a signal peak around m = 3.1 GeV,
reported in Fig. 5.2. The two regions with m < 3 GeV and m > 3:2 GeV can be
considered as control regions, while the region with 3:04 GeV< m < 3:18 GeV can
be taken as signal region. The background yield in the signal region under the signal
peak can be determined from the observed number of events in the control regions,
which contain a negligible amount of signal, interpolated to the signal region by
applying a scale factor given by the ratio of the signal region to control region areas,
as expected by the predicted background distribution (an exponential distribution, in
the considered case). This background constraint is already effectively applied when
performing the fit described in Sect. 5.10.2, where the background shape parameter
, as well as the signal and background yields, are determined directly from data.
The problem of simultaneous fit can be formalized in general as follows, also
taking account for a possible signal contamination in the control regions. Consider
two data samples, Ex D .x1 ; ; xh / and Ey D .y1 ; ; yk /. The likelihood functions
E and Ly .Ey I /
Lx .Ex I / E for the two individual data samples depend on a parameter
E
set D .1 ; ; m /. In particular, Lx and Ly may individually depend on a subset
of the comprehensive parameter set . E Some of the parameters determine the signal
and the background yields to be constrained. The combined likelihood function that
comprises both data sample can be written as:
E D Lx .Ex I /
Lx; y .Ex ; Ey I / E Ly .Ey I /
E : (6.1)
signal and background yields in the control region from the fitted values of the
yields in the signal region, which may be free parameters in the fit. In this way, the
background yield will be mainly determined from the control sample distribution,
while the signal yield will be mainly determined from the distribution in the signal
region.
x D xO 1 ˙ 1 ; (6.2)
x D xO 2 ˙ 2 : (6.3)
Assuming a Gaussian distribution for xO 1 and xO 2 and no correlation between the two
measurements, the following 2 can be built:
.x xO 1 /2 .x xO 2 /2
2 D C : (6.4)
12 22
which gives:
xO 1 xO 2
2
C 2
2
xO D 1 : (6.6)
1 1
C
12 22
1 @2 log L 1 @2 2 1 1
2
D 2
D D 2 C 2 : (6.7)
xO @x 2 @x2 1 2
Equation (6.6) is called weighted average, and can be generalized for N measure-
ments as:
X
N
xO D wi xO i ; (6.8)
iD1
132 6 Combining Measurements
where:
1=i2
wi D ; (6.9)
1= 2
PN PN
with 1= 2 D iD1 1=i2 , in order to ensure that iD1 wi D 1. The error on xO is
given by:
1
xO D D qP : (6.10)
N 2
iD1 1=i
6.4 2 in n Dimensions
E D .
1 ; ;
n / which, on turn, depend on
E
another parameter set D .1 ; ; m /:
i D
i .1 ; ; m /. If the covariance
matrix of the measurements .Ox1 ; ; xO n / is with elements ij , the 2 can be
written as follows:
X
n
2 D .Oxi
i / ij1 .Oxj
j / D .ExO
/
E T 1 .ExO
/
E (6.11)
i; jD1
0 11 0 1
11 1n xO 1
1
B :: : : :: C @
D .Ox1
1 ; ; xO n
n / @ : : : A ::: A : (6.12)
n1 nn xO n
n
The 2 expression can be minimized in order to determine the estimates O1 ; ; Om
of the unknown parameters 1 ; ; m , with the proper error matrix.
Examples of application of this method are the combinations of electroweak
measurements performed with data taken at the electron-positron collider LEP [1]
at CERN and with the precision measurements performed at the SLC collider at
SLAC [2, 3] in the context of the LEP Electroweak Working Group [4] and the
GFitter Group [5]. The effect of radiative corrections that depend on top-quark
and Higgs-boson masses is taken into account in the predictions, above indicated
in the relation
i D
i .1 ; ; m /. The combination of precision electroweak
measurements with this approach allowed to have indirect estimates of the top-quark
mass before its discovery at Tevatron and a less precise estimate of the Higgs-boson
mass before the beginning of the LHC data taking, where eventually the Higgs boson
was discovered. In both cases, the indirect estimates were in agreement with the
measured masses. The global fit performed by the GFitter group is shown in Fig. 6.1
for the prediction of the masses of the W boson and the top quark.
6.5 The Best Linear Unbiased Estimator 133
mt world comb. 1
68% and 95% CL contours
mt = 173.34 GeV
80.5 fit w/o MW and mt measurements = 0.76 GeV
fit w/o MW , mt and MH measurements = 0.76 0.50theo GeV
80.4
MW world comb. 1
80.35 MW = 80.385 0.015 GeV
80.3
eV
eV
.14G eV eV
0G 25 00G 00G
80.25 =5 =1 =3 =6
Jul ’14
MH MH MH MH
G fitter SM
140 150 160 170 180 190
mt [GeV]
Fig. 6.1 Contours at the 68% and 95% confidence level in the plane .mW ; mt /, mass of the W
boson vs mass of the top quark, from global electroweak fits including (blue) or excluding (gray)
the measurement of the Higgs boson mass mH . The direct measurements of mW and mt (green) are
shown as comparison. The fit was performed by the GFitter group [5], and the figure is from [3]
(Open Access)
x D xO 1 ˙ 1 ; (6.13)
x D xO 2 ˙ 2 ; (6.14)
which have a correlation coefficient . The case discussed in Sect. 6.3 was a special
case with D 0. Taking into account the correlation term, the 2 can be written
using the covariance matrix form in Eq. (2.91):
1
12 1 2 x xO 1
2 D .x xO 1 ; x xO 2 / : (6.15)
1 2 22 x xO 2
xO 1 .22 1 2 / C xO 2 .12 1 2 /
xO D ; (6.16)
12 2 1 2 C 22
134 6 Combining Measurements
12 22 .1 2 /
xO2 D : (6.17)
12 2 1 2 C 22
The general case of more than two measurements proceeds in a similar way,
and the minimization of the 2 is equivalent to find the best linear unbiased
estimate (BLUE), i.e. the unbiased linear combination ofthe measurements ExO D
.Ox1 ; ; xO N / with known covariance matrix D ij that gives the lowest
variance.
Introducing the set of weights w E D .w1 ; ; wN /, the linear combination can be
written as:
X
N
xO D wi xO i D ExO w
E: (6.18)
iD1
xO2 D w
E T w
E: (6.19)
It can be shown [6] that the weights that minimize the variance xO2 in Eq. (6.19) are
given by the following expression:
1 uE
ED
w ; (6.20)
uE T 1 uE
where uE is the vector having all elements equal to the unity: uE D .1; ; 1/ .
The interpretation of Eq. (6.16) becomes more intuitive [7] if we introduce the
common error, defined as:
C D 1 2 : (6.21)
Imagine, for instance, that the two measurements are affected by a common
uncertainty, like the knowledge of the integrated luminosity in case of a cross section
measurement, while the other uncertainties are uncorrelated. In that case, the two
measurements can be written as:
x D xO 1 ˙ 10 ˙ C ; (6.22)
x D xO 2 ˙ 20 ˙ C ; (6.23)
where 102 D 12 C2 and 202 D 22 C2 . This is clearly possible only if C 1
and C 2 , which is equivalent to require that the weights w1 and w2 in the BLUE
6.5 The Best Linear Unbiased Estimator 135
combination reported in Eq. (6.16) are both positive. Equation (6.16), in that case,
can also be written as:
xO 1 xO 2
C 2
12 C2
2 C2
xO D (6.24)
1 1
C 2
12 C2 2 C2
with a variance:
1
xO2 D C C2 : (6.25)
1 1
C 2
12 C2 2 C2
Equation (6.24) is equivalent to the weighted average in Eq. (6.6), where the errors
102 and 202 are used to determine the weights instead of 1 and 2 . Equation (6.25)
shows that the uncertainty contribution term C has to be added in quadrature to
the expression that gives the error for the ordinary weighted average in case of no
correlation (Eqs. (6.7) and (6.10)).
More in general, as evident from Eq. (6.16), weights with the BLUE method can
also become negative. This may lead to counter-intuitive results. In particular, a
combination of two measurements may also lie outside the range delimited by the
two central values. Also, when combining two measurements with D 1 =2 , the
weight of the second measurement is zero, hence the combined central value is not
influenced by xO 2 . Conversely, if D 2 =1 , the central value is not influenced by xO 1 .
The BLUE method, as well as the standard weighted average, provides weights of
individual measurements that enter the combination. BLUE weights, anyway, can
become negative, and this does not allow to use weights in order to quantify the
‘importance’ of each individual measurement used in a combination.
136 6 Combining Measurements
jwi j
RIi D ; (6.26)
X
N
jwi j
iD1
P
in order to obey the normalization condition: NiD1 RI D 1. RI was quoted, for
instance, in combinations of top-quark mass measurements at the Tevatron and at
the LHC [8, 9].
The questionability of the use of RIs was raised [10] because it violates the
combination principle: in case of three measurement, say xO , yO 1 and yO 2 , the RI of the
measurement xO changes whether the three measurements are combined all together,
or if yO 1 and yO 2 are first combined into yO , and then the measurement xO is combined
with the partial combination yO .
Proposed alternatives to RI are based on Fisher information, defined in Eq. (3.52),
which also poses a lower bound to the variance of an estimator (see Sect. 5.8.3). For
a single parameter, the Fisher information is given by:
1
J D J11 D uE T V uE D : (6.27)
xO2
Two quantities are proposed in [10] to replace RI: the intrinsic information weight
(IIW), defined as:
1=i2 1=i2
IIWi D D ; (6.28)
1=xO2 J
i.e. the relative difference of the Fisher information of the combination and the
Fisher information of the combination obtained by excluding the ith measurement.
Both IIW and MIW do not obey a normalization condition. For IIW, the quantity
IIWcorr can be defined such that:
X
N
IIWi C IIWcorr D 1 : (6.30)
iD1
6.5 The Best Linear Unbiased Estimator 137
PN Consistent with
Weight type 0 iD1 D1 partial combination
BLUE coefficient wi ✗ ✓ ✓
PN
Relative importance jwi j= iD1 jwi j ✓ ✓ ✗
Intrinsic information weight IIWi ✓ ✗ ✓
Marginal information weight MIWi ✓ ✗ ✓
IIWcorr represents the weight assigned to the correlation interplay, not assignable to
any individual measurement, and is given by:
P P
1=xO 2 NiD1 1=i2 J NiD1 1=i2
IIWcorr D D : (6.31)
1=xO2 J
The properties of the BLUE weights, RI, and the two proposed alternatives to RI
are reported in Table 6.1.
The quantities IIW and MIW are reported, instead of RI, in papers presenting the
combination of LHC and Tevatron measurements related to top-quark physics [11–
13].
0.8
0.6
0.4
0.2
w2
− 0.2 σ2 / σ1 = 1
σ2/ σ1 = 1.1
− 0.4 σ2/ σ1 = 1.5
− 0.6
σ2 / σ1 = 2
σ2 / σ1 = 3
− 0.8 σ2 / σ1 = 4
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
ρ
1.4 σ2 / σ1 = 1
σ2/ σ1 = 1.1
1.2 σ2/ σ1 = 1.5
σ2 / σ1 = 2
σ2 / σ1 = 3
1 σ2 / σ1 = 4
σ2x/ σ21
0.8
0.6
0.4
0.2
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
ρ
Fig. 6.2 BLUE coefficient for the second measurement w2 (top) and combined BLUE variance
Ox2 divided by 12 (bottom) as a function of the correlation between the two measurements 1 and
2 for various fixed values of the ratio 2 =1
1.2
0.8
ρ(cor)
0.6
σ2(cor)/ σ1(cor) = 1
σ2(cor)/ σ1(cor) = 1.1
0.4 σ2(cor)/ σ1(cor) = 1.5
σ2(cor)/ σ1(cor) = 2
0.2 σ2(cor)/ σ1(cor) = 3
σ2(cor)/ σ1(cor) = 4
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
σ1(cor)/σ1
Fig. 6.3 The most conservative value of an unknown correlation coefficient (cor) between
uncertainties 1 (cor) and 2 (cor) as a function of 1 (cor)=1 , for different possible values of
2 (cor)=1 (cor) 1
Assume (cor) is the correlation coefficient of the correlated terms. The most
conservative value of (cor), i.e. the value that maximises the total uncertainty, can
be demonstrated [10] to be equal to 1 only for 2 (cor)=1 (cor)< .1 =1 (cor))2,
where it has been assumed that 1 (cor) 2 (cor). The most conservative choice of
(cor), for values of 2 (cor)=1 (cor) larger than .1 =1 (cor))2, is equal to:
Figure 6.3 shows the most conservative value of (cor) as a function of 1 .cor/=1
for different possible values of the ratio 1 .cor/=2 .cor/.
The BLUE method is unbiased by construction assuming that the true uncertainties
and their correlations are known. Anyway, it can be proven that BLUE combinations
may exhibit a bias if uncertainty estimates are used in place of the true ones, and
in particular if the uncertainty estimates depend on measured values. For instance,
when contributions to the total uncertainty are known as relative uncertainties, the
actual uncertainty estimates are obtained as the product of the relative uncertainties
140 6 Combining Measurements
times the measured central values. An iterative application of the BLUE method can
be implemented in order to mitigate such a bias.
L. Lyons et al. remarked in [14] the limitations of the BLUE method in the
combination of lifetime measurements where uncertainty estimates O i of the true
unknown uncertainties i were used, and those estimates had a dependency on the
measured lifetime. They also demonstrated that the application of the BLUE method
violates, in that case, the combination principle: if the set of measurements is split
into a number of subsets and then the combination is first performed in each subset
and finally all subset combinations are combined into a single grand combination,
the obtained result differs from the single combination of all individual results of
the entire set.
Reference [14] recommends applying iteratively the BLUE method, rescaling at
each iteration the uncertainty estimates according to the central value obtained with
the BLUE method in the previous iteration, until the sequence converges to a stable
result. It was also proven that the bias of the BLUE estimate is reduced compared
to the standard application of the BLUE method.
A more extensive study of the iterative application of the BLUE method is also
available in [15].
References
1. The ALEPH, DELPHI, L3, OPAL Collaborations, the LEP Electroweak Working Group:
Electroweak measurements in electron-positron collisions at W-boson-pair energies at LEP.
Phys. Rep. 532, 119 (2013)
2. The ALEPH, DELPHI, L3, OPAL, SLD Collaborations, the LEP Electroweak Working Group,
the SLD Electroweak and Heavy Flavour Groups: Precision electroweak measurements on the
Z resonance. Phys. Rep. 427, 257 (2006)
3. The GFitter Group, Baak, M., et al.: The global electroweak fit at NNLO and prospects for the
LHC and ILC. Eur. Phys. J. C 74, 3046 (2014)
4. The LEP Electroweak Working Group. http://lepewwg.web.cern.ch/LEPEWWG/
5. A Generic Fitter Project for HEP Model Testing. http://project-gfitter.web.cern.ch/project-
gfitter/
6. Lyons, L., Gibaut, D., Clifford, P.: How to combine correlated estimates of a single physical
quantity. Nucl. Inst. Methods A270, 110–117 (1988)
7. Greenlee, H.: Combining CDF and D0 physics results. Fermilab Workshop on Confidence
Limits (2000)
8. The CDF and D0 Collaborations: Combination of CDF and DO results on the mass of the top
quark using up to 5.8 fb1 of data. FERMILAB-TM-2504-E, CDF-NOTE-10549, D0-NOTE-
6222 (2011)
9. The ATLAS and CMS Collaborations: Combination of ATLAS and CMS results on the mass of
the top quark using up to 4.9 fb1 of data. ATLAS-CONF-2012-095, CMS-PAS-TOP-12-001
(2012)
10. Valassi, A., Chierici, R.: Information and treatment of unknown correlations in the combination
of measurements using the BLUE method. Eur. Phys. J. C 74, 2717 (2014)
11. ATLAS and CMS Collaborations: Combination p of ATLAS and CMS results on the mass of
the top-quark using up to 4.9 fb -1 of s D 7 TeV LHC data. ATLAS-CONF-2013-102,
CMS-PAS-TOP-13-005 (2013)
References 141
12. ATLAS and CMS Collaborations: Combination of ATLAS p and CMS ttbar charge asymmetry
measurements using LHC proton-proton collisions at s D 7 TeV. ATLAS-CONF-2014-012,
CMS-PAS-TOP-14-006 (2014)
13. ATLAS, CMS, CDF and D0 Collaborations: First combination of Tevatron and LHC
measurements of the top-quark mass. ATLAS-CONF-2014-008, CDF-NOTE-11071, CMS-
PAS-TOP-13-014, D0-NOTE-6416, FERMILAB-TM-2582, arXiv:1403.4427 (2014)
14. Lyons, L., Martin, A.J., Saxon, D.H.: On the determination of the B lifetime by combining the
results of different experiments. Phys. Rev. D41, 982–985 (1990)
15. Lista, L.: The bias of the unbiased estimator: a study of the iterative application of the BLUE
method. Nucl. Inst. Methods A764, 82–93 (2014) and corr. ibid. A773, 87–96 (2015)
Chapter 7
Confidence Intervals
7.1 Introduction
A more rigorous and general treatment of confidence intervals under the frequentist
approach is due to Neyman [4], which will be discussed in the following in the
simplest case of a single parameter.
Let us consider a variable x distributed according to a PDF which depends on an
unknown parameter . We have in mind that x could be the value of an estimator of
the parameter . Neyman’s procedure to determine confidence intervals proceeds in
two steps, sketched in Fig. 7.1:
1. the construction of a confidence belt;
2. the inversion of the confidence belt to determine the confidence interval.
θ
θ
up
θ (x0)
θ0
θlo(x0)
xlo(θ0 ) x hi(θ0 ) x x0 x
Fig. 7.1 Graphical illustration of Neyman belt construction (left) and inversion (right)
In the first step, the confidence belt is determined by scanning the parameter space,
varying within its allowed range. For each fixed value of the parameter D 0 ,
the corresponding PDF, which describes the distribution of x; f .x j 0 /, is known.
According to the PDF f .x j 0 /, an interval Œxlo .0 /; xup .0 / is determined whose
corresponding probability is equal to the specified confidence level, defined as CL D
1 ˛, and usually equal to 68.27% .1/ , 90 or 95%:
Z xup .0 /
1˛ D f .x j 0 / dx : (7.1)
xlo .0 /
f(x|θ0)
x
f(x|θ0) f(x|θ0)
1−α 1−α
α α
x x
Fig. 7.2 Three possible choices of ordering rule: central interval (top) and fully asymmetric
intervals (bottom left, right)
Other possibilities consist in choosing the interval having the smallest size, or
fully asymmetric intervals on either side: Œxlo .0 /; C1 Œ or 1; xup .0 /. More
options are also considered in the literature. Figure 7.2 shows three of the possible
cases described above.
A special ordering rule was introduced by Feldman and Cousins based on a
likelihood-ratio criterion and will be discussed in Sect. 7.5.
Given a choice of the ordering rule, the intervals Œxlo ./; xup ./, for all possible
values of , define the Neyman belt in the plane .x; / as shown in Fig. 7.1.
146 7 Confidence Intervals
4.5
3.5
2.5
μ
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x
nŠ
P.nI N; p/ D pN .1 p/Nn : (7.4)
NŠ .N n/Š
nO
pO D : (7.5)
N
This may be justified by the law of large numbers because for n ! 1 pO and p
coincide. Anyway, Eq. (7.6) gives a null error in case either nO D 0 or nO D N, i.e. for
pO D 0 or 1, respectively.
A solution to the problem of determining the correct confidence interval for a
binomial distribution is due to Clopper and Pearson [1] and allows to determine the
interval Œplo ; pup that gives at least the correct coverage 1 ˛. The extremes plo and
pup of the interval should be taken such that:
X
N
nŠ ˛
P.n nO j N; plo / D .plo /n .1 plo /Nn D ; (7.7)
NŠ .N n/Š 2
nDOn
nO
X nŠ ˛
P.n nO j N; pup / D .pup /n .1 pup /Nn D : (7.8)
nD0
NŠ .N n/Š 2
0.9
0.8
0.7
0.6
p
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10
n
Fig. 7.4 Neyman belt for the parameter p of a binomial distribution with n D 10 at the 68.27%
confidence level
(continued )
7.3 Binomial Intervals 149
0.9
0.8
0.7
P (coverage)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
0.9
0.8
0.7
P (coverage)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
Fig. 7.5 Coverage of Clopper–Pearson intervals as a function of p for the cases n D 10 (top) and
n D 100 (bottom)
So, for pup we should consider the largest allowed value pup D 1:0, since
the probability P.n N j N; pup / is equal to one. The first equation can be
inverted and gives:
(continued )
150 7 Confidence Intervals
μ
f (x) 90%
5% 5%
90% 10% 5%
10% x
μ = x ± 1.645
x
μ < x + 1.282
3 x
Fig. 7.6 Illustration of the flip-flopping problem. The plot shows the quoted central value of
as
a function of the measured x (dashed line), and the 90% confidence interval corresponding to the
choice of quoting a central interval for x= 3 and an upper limit for x= < 3. The coverage
decreases from 90 to 85% for a value of
corresponding to the horizontal lines with arrows
If x= 3, the quoted confidence interval, given the measurement x, has a central
value with a symmetric error equal to ˙ at the 68.27% confidence level (CL), or
˙ 1:645 at 90% CL. Instead, if x= < 3, the confidence interval is Œ0;
up with
an upper limit
up D x C 1:282 at 90% CL, given the corresponding area under a
Gaussian PDF.
In summary, the quoted confidence interval at 90% CL is:
(
Œx 1:645; x C 1:645 if x= 3
Œ
;
D
lo up
: (7.10)
Œ0; x C 1:282 if x= < 3
In order to avoid the flip-flopping problem and to ensure the correct coverage, the
ordering rule proposed by Feldman and Cousins [3] provides a Neyman confidence
belt, following the procedure described in Sect. 7.2, that smoothly changes from a
central or quasi-central interval to an upper limit, in the case of low observed signal
yield.
The proposed ordering rule is based on a likelihood ratio whose properties will
be further discussed in Sect. 9.5. Given a value 0 of the unknown parameter , the
chosen interval of the variable x used for the Neyman belt construction is defined
by the ratio of two PDFs of x, one under the hypothesis that is equal to the
considered fixed value 0 , the other under the hypothesis that is equal to the
maximum likelihood estimate value .x/, O corresponding to the given measurement
x. The likelihood ratio must be greater than a constant k˛ whose value depends on
the chosen confidence level 1 ˛. In a formula:
f .x j 0 /
.x j 0 / D > k˛ : (7.11)
O
f .x j .x//
The constant k˛ should be taken such that the integral of the PDF in the confidence
interval R˛ is equal to 1 ˛:
Z
f .x j 0 / dx D 1 ˛ : (7.12)
R˛
kα
1−α
x
7.5 The Unified Feldman–Cousins Approach 153
.x/
O D max.x; 0/ : (7.14)
The likelihood ratio in Eq. (7.11) can be written in this case as:
(
f .x j
/ exp..x
/2 =2/ if x0
.x j
/ D D : (7.16)
f .x j
.x//
O exp.x
=2/2
if x<0
μ
symmetric errors
asymmetric errors
upper limit
x
Fig. 7.8 Neyman confidence belt constructed using the Feldman–Cousins ordering
154 7 Confidence Intervals
with a central interval, like the one shown in Fig. 7.3. As seen, this approach
smoothly changes from a central interval to an upper limit, yet correctly ensuring
the required coverage (90% in this case).
More applications of the Feldman–Cousins approach will be presented in
Chap. 10. The application of the Feldman–Cousins method requires, in most of the
cases, numerical treatment, even for simple PDF models, like the Gaussian case
discussed above. The reason is that the inversion of the integral in Eq. (7.12) is
required. The inversion usually proceeds with a scan of the parameter space, and,
in case of complex models, it may be very CPU intensive. For this practical reason,
other methods are often preferred to the Feldman–Cousins for complex cases.
References
1. Clopper, C.J., Pearson, E.: The use of confidence or fiducial limits illustrated in the case of the
binomial. Biometrika 26, 404–413 (1934)
2. Cousins, R., Hymes, K.E., Tucker, J.: Frequentist evaluation of intervals estimated for a binomial
parameter and for the ratio of Poisson means. Nucl. Instrum. Meth. A612 388–398 (2010)
3. Feldman, G., Cousins, R.: Unified approach to the classical statistical analysis of small signals.
Phys. Rev. D57, 3873–3889 (1998)
4. Neyman, J.: Outline of a theory of statistical estimation based on the classical theory of
probability. Philos. Trans. R. Soc. Lond. A Math. Phys. Sci. 236, 333–380 (1937)
Chapter 8
Convolution and Unfolding
8.1 Introduction
This section will discuss two related problems: how to take into account realistic
detector response, like resolution, efficiency, and background, into a probability
model, and how to remove those experimental effects from an observed distribution,
in order to recover the original distribution. This second problem is known as
unfolding.
Unfolding is not always necessary in order to compare an experimental dis-
tribution with the expectation from theory since data can be compared with a
realistic prediction that takes into account experimental effects. In some cases,
anyway, it is desirable to produce a distribution that can be compared among
different experiments, each introducing different experimental effects. For those
cases, unfolding observed distributions may be a necessary task.
8.2 Convolution
Note that convolution may also be applied if the original distribution g.y/ is not
normalized to unity and the integral of g.y/ may represent the overall number of
random extractions of the variable y. Convolution of non-normalized distributions
will be more relevant in the discrete case, discussed in Sect. 8.2.2.
The convolution of a theoretical distribution with a finite resolution response is
also called smearing, which typically broadens the peaking structures present in
the original distribution. Examples of the effect of PDF convolution are shown in
Fig. 8.1.
In many cases the response function r only depends on the difference x y, and
can be described by a function of one variable: r.xI y/ D r.x y/. Sometimes the
notation f D g ˝ r is also used in those cases to indicate the convolution of g and r.
Convolutions have interesting properties under Fourier transform. In particular,
let us define as usual the Fourier transform of f as:
Z C1
gO .k/ D g.y/ eiky dy ; (8.2)
1
1
g ˝ r D gO rO : (8.4)
Conversely, the Fourier transform of the product of two PDFs is equal to the
convolution of the two Fourier transforms, i.e.:
b
g r D gO ˝ rO : (8.5)
1 2 2
r.x y/ D p e.xy/ =2 : (8.6)
2
8.2 Convolution 157
0.8 g(y)
0.7 r(x, y) / 2
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10
x, y
0.14
g(y)
0.12
r(x, y) / 20
0.08
0.06
0.04
0.02
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x, y
Fig. 8.1 Examples of convolution of two PDFs g.y/ (solid light blue lines) with Gaussian
kernel functions r.x; x0 / (dashed black lines). Top: g.y/ is the superposition of three Breit–
Wigner functions (see Sect. 2.13.1); bottom: g.y/ is a piece-wise constant PDF. The result of the
convolution, f .x/, is the solid dark blue line. The kernel function is scaled down by a factor 2 and
20 in the top and bottom plots, respectively, for display convenience
In many cases, data samples are available in form of histograms. For instance,
consider a sample randomly extracted according to a continuous distribution f .x/.
The sample is, for convenience, stored into N intervals (bins) of given size of the
variable x, and the content of the bins (the number of entries) is .n1 ; ; nN /. Each
ni is distributed according to Poissonian with expected value:
Z xi
up
up
xlo th
i and xi being the edges of the i bin. Consider that values of x are derived
from original values of y, which are unaffected by the experimental effects and are
distributed according to a theoretical distribution g.y/. The range of possible values
of y can be divided into M bins, where M is not necessarily equal to N. Each of
those bins j has an expected number of entries equal to:
Z yj
up
j D g.y/ dy ; (8.9)
ylo
j
up
ylo th
j and yj being the edges of the j bin. For such discrete cases, the effect of a
realistic detector response is similar to Eq. (8.1), and can be written as:
X
M
i D Rij
j : (8.10)
jD1
The N M matrix Rij is called response matrix, and is responsible for the bin
migration from the original histogram of values of y to the observed histogram of
values of x.
The case of a kernel function that depends on the difference x y in a continuous
case corresponds to a response matrix that only depends on the index difference i j
in a discrete case.
An example of the effect of bin migration is shown in Fig. 8.2, where, for
simplicity of representation, M D N has been chosen.
The convolution presented in Eq. (8.1) represents the transformation of g.y/ into
f .x/, where both g and f are normalized to unity. In Eq. (8.10), instead, the
histograms of the expected values .1 ; ; N / and .
1 ; ;
M / do not obey
a normalization condition, since their content does not represent a probability
distribution, but the expected yield in each bin. In those cases, in addition to
8.2 Convolution 159
Rij
20
18
16
14
12
j
10
2 4 6 8 10 12 14 16 18 20
i
3000
μ
i
2500 νi = Σj Rij μ
i
2000
μ , νi
i
1500
1000
500
0
2 4 6 8 10 12 14 16 18 20
i
Fig. 8.2 Effect of bin migration. Top: response matrix Rij ; bottom: the original distribution
j
PM
(light blue line) with superimposed the convoluted distribution i D jD1 Rij
j (dark blue line)
X
M
i D "j ıij
j D "i
i : (8.11)
jD1
160 8 Convolution and Unfolding
i D
i C bi : (8.12)
X
M
i D "j Rij
j C bi : (8.13)
jD1
The efficiency term "i can be dropped from Eq. (8.13) and incorporated in the
response matrix Rij definition giving up the normalization condition:
X
N
Rij D 1 ; (8.14)
iD1
which would preserve the histogram normalization in Eq. (8.10). Absorbing the
efficiency term, Eq. (8.13) becomes:
X
M
i D Rij
j C bi ; (8.15)
jD1
E C bE :
E D R
(8.17)
E D .
1 ; ;
M /.
The most straightforward way is given by the inversion of Eq. (8.17), but this
approach leads to results that are not usable in practice, as shown in the following.
The best estimate of
E can be determined with the maximum likelihood method.
The probability distribution for each ni is a Poissonian with expected value i , hence
8.3 Unfolding by Inversion of the Response Matrix 161
E/D Pois.ni I i .
The maximum likelihood solution leads [6] to the inversion of Eq. (8.17). Assuming
EO that maximizes the likelihood function is:
for simplicity N D M, the estimate of
EO D R1 nE bE :
(8.19)
where the matrix V is the covariance matrix of the measurements ni . Due to the
Poisson distribution, V is a diagonal matrix with elements on the diagonal equal to
ni :
It is also possible to demonstrate that this estimate is unbiased and it has the smallest
possible variance (see Sect. 5.8.3). Results are shown in Fig. 8.3 on a Monte Carlo
example.
The response matrix was taken from Fig. 8.2 (top), and a histogram containing
the observations nE was generated according to the convoluted distribution from
Fig. 8.2 (bottom) with 10,000 entries. This histogram, generated by Monte Carlo,
is shown in Fig. 8.3 (top) superimposed to the original convoluted distribution.
The resulting estimated
EO are shown in Fig. 8.3 (bottom), with uncertainty bars
computed according to Eq. (8.20). The plot shows very large oscillations from one
bin to the subsequent one, of the order of the uncertainty bars, which is very large
as well.
The numerical values of the non-diagonal terms of the matrix U from Eq. (8.20)
show that each bin
O j is very strongly anticorrelated with its adjacent bins,
O jC1
and
O j1 , the correlation coefficient being very close to 1. Statistical fluctuations
in the generated number of entries ni , which are bin-by-bin independent, turn into
high-frequency bin-by-bin oscillations of the resulting estimates
O j , due to the very
large anticorrelation, rather than, as one may expect, reflecting into uncertainties of
the estimates
O j .
In order to overcome the observed problems present with the simple maximum
likelihood solution of the unfolding, other techniques have been introduced to
regularize the observed oscillation. The maximum likelihood solution has no bias,
and it provides the smallest possible variance (see Sect. 5.11.3). For this reason,
in order to reduce the large observed variance and the bin-by-bin correlations,
regularization methods must unavoidably introduce some bias in the unfolded
estimates.
162 8 Convolution and Unfolding
1400
1200
1000
800
ni
600
400
200
0
2 4 6 8 10 12 14 16 18 20
i
3
×10
4000
3000
2000
1000
i
0
μ
− 1000
− 2000
− 3000
− 4000
2 4 6 8 10 12 14 16 18 20
i
Fig. 8.3 Top: Monte Carlo generated histogram with 10,000 entries randomly extracted according
to the convoluted distribution E in Fig. 8.2 (bottom) superimposed to the original convoluted
distribution; bottom: the result of response matrix inversion, or maximum likelihood estimate of
the unconvoluted distribution
,EO with error bars. Note that the vertical scale on the bottom plot
extends far above the range of the top plot
8.5 Regularized Unfolding 163
A solution to the unfolding problem, which also turns out to have disadvantages,
consists in performing a correction to the yields ni observed in each bin equal to the
ratio of expected values before and after unfolding in each bin (assuming N D M),
which are assumed to be known:
est
O i D i
.ni bi / : (8.22)
iest
i and i are usually estimates of the true expected yields determined from
est est
simulation. The estimate iest does not include the background contribution bi .
This method produces results that, by construction, resemble the estimated
expectation, hence a simple visual inspection of unfolded distributions may not
show any apparent disagreement with the expectation. Anyway, this approach has
the serious drawback that it introduce a bias that drives the estimates
O i towards the
estimated expectations
est
i [6]. The bias, in fact, can be determined to be:
est
i
h
O i i
i D i
.i bi / : (8.23)
iest .i bi /
Note that the expectation i includes the background contribution bi , hence the
term i bi is the expected yield due to signal only.
Due to Eq. (8.23), any comparison of unfolded data with simulation prediction,
in this way, will be biased towards good agreement, with the risk of hiding real
discrepancies.
Acceptable unfolding methods should reduce the large variance which occurs in
maximum likelihood estimates, given by the simple matrix inversion discussed in
Sect. 8.3. This has the unavoidable cost of introducing a bias in the estimate, since,
as seen above, the only unbiased estimator with the smallest possible variance is the
maximum likelihood one. A tradeoff between smallest variance and smallest bias is
achieved by imposing an additional condition on the smoothness of the set of values
to be estimated
E which is quantified by a function S.
E /. The definition of S may
vary according to the specific implementation. Such methods are called regularized
unfolding.
In practice, instead minimizing D 2 log L (which is a 2 , for a Gaussian
case), the function to be minimized is:
.
E / C 2 S.
E / D .
E/: (8.24)
164 8 Convolution and Unfolding
The parameter that appears in Eq. (8.24) is called regularization strength. The
case D 0 is equivalent to a maximum likelihood estimate, which has been already
considered, and exhibits large bin-by-bin oscillations and large variance of the
estimates. The other extreme, when is very large, provides an estimate
EO that
is extremely smooth, in the sense that it minimizes S, but it is insensitive to the
observed data nE .
The regularization procedure should find an optimal value of in order to achieve
EO that is sufficiently close to the minimum of 2 log and, at the same
a solution
EO D R
EO C bE ; (8.25)
E / D .
E/C ni i : (8.26)
iD1 iD1
One of the most popular regularization technique [15, 18, 19] adopts the regulariza-
tion function given by the following expression:
T
S.
E / D L
E L
E; (8.27)
where L is a matrix with M columns and a number K of rows that could also be
different from M.
In the simplest implementation, L is taken as unit matrix, and Eq. (8.27) becomes:
X
M
S.
E/D
2j : (8.28)
iD1
The term 2 S.
E / in the minimization just dumps the cases with very large deviations
of
j from zero.
8.5 Regularized Unfolding 165
The reason for this choice is that the derivative of a function, approximated by a
discrete histogram h, can be computed from finite differences, divided by the bin
size . The first two derivative could be approximated, in this way, as:
hi hi1
h0i D ; (8.30)
hi1 2hi C hiC1
h00i D : (8.31)
2
(8.32)
The matrix L in Eq. (8.29) determines the approximate second derivative of the
histogram
The method usually adopted to determine the optimal value of the regularization
strength parameter in a scan of the two-dimensional curve L defined by the
following coordinates[9]:
EO .// ;
Lx D log .
(8.34)
EO .// :
Ly D log S.
(8.35)
166 8 Convolution and Unfolding
.lC1/ .l/
X
N
Rij ni X
N
.l/
j D
j PN .l/
D ni Mij : (8.36)
iD1
"j kD1 Rik
k iD1
Above, "j is given by Eq. (8.16). Usually, as initial solution, the simulation
.0/
prediction
j D
est j is taken, motivated as prior choice according to a Bayesian
interpretation.
Equation (8.36) can be demonstrated to converge to the maximum likelihood
solution, but a very large number of iterations is needed before it approaches the
asymptotic limit. The method can be stopped after a finite number I of iterations:
.I/
O j D
j . I acts here as regularization parameter, similarly to for the regularized
unfolding in Sect. 8.5.
Equation (8.36) can be motivated using Bayes’ theorem as follows. Given a
number of ‘causes’ Cj , j D 1; ; M, the ‘effects’ Ei , i D 1; ; N, can be
8.6 Iterative Unfolding 167
Rij
20
18
16
14
12
10
j
0
0 2 4 6 8 10 12 14 16 18 20
i
μ
1800 j
1600 νi = Σj Rij μ
j
1400
1200
μ , νi
1000
j
800
600
400
200
0
0 2 4 6 8 10 12 14 16 18 20
x
700
ni
600 νi
500
400
ni
300
200
100
0
0 2 4 6 8 10 12 14 16 18 20
x
L curve
8.5
8
Ly
7.5
6.5
1.6 1.7 1.8 1.9 2 2.1 2.2
Lx
Fig. 8.5 Top: data sample (solid histogram with error bars) randomly generated according to the
convoluted distribution (dashed histogram, superimposed); bottom: L-curve scan (black line). The
optimal L-curve point is shown as blue cross
8.6 Iterative Unfolding 169
μ
1800 i
1600 μ
j
1400
1200
1000
j
μ
800
600
400
200
−200
0 2 4 6 8 10 12 14 16 18 20
x
Fig. 8.6 Unfolded distribution using Tikhonov regularization, shown as points with error bars,
superimposed to the original distribution, shown as dashed line
related to the causes using Bayes’ theorem that can be expressed with the following
formula [7] :
P.Ei j Cj / P0 .Cj /
P.Cj j Ei / D PM : (8.37)
kD1 P.Ei j Ck / P0 .Ck /
In Eq. (8.37) a cause Cj corresponds to an event generated in the jth bin of the
original variable y histogram, and an effect Ei corresponds to an event being
observed, after detector effects, in the ith bin of the observable variable x histogram.
P.Cj j Ei / has the role of the response matrix Rij , and P0 .Cj / can be rewritten as
.0/ P
P0 .Cj / D
j =nobs , where nobs D NiD1 ni . This allows to rewrite Eq. (8.37) as:
.0/
Rij
j
P.Cj j Ei / D PM .0/
: (8.38)
kD1 Rik
k
.1/
X
N
P.Cj j Ei / .0/
X Rij
N
ni
j D nO .Cj / D n.Ei / D
j PM .0/
: (8.39)
iD1
"j "
iD1 j kD1 Rik
k
170 8 Convolution and Unfolding
X @
p .l/
@
q
.l/
.l/
Upq D Vij ; (8.40)
i; j
@ni @nj
VO ij D ıij ni : (8.41)
The derivative terms in Eq. (8.40) is, at the first iteration (l D 1), equal to [7]:
.1/
@
O k .0/
D Mij : (8.42)
@ni
For the following iteration, the derivative can be estimated again iteratively
according to the following formula [1]1 :
.lC1/ .lC1/ .l/ X nk "q .l/
@
O k .l/
i @
i .l/.l/ @
q
D Mij C .l/
.l/
Mip Mqp ; (8.43)
@ni
i @nj p; q
q @nj
Figure 8.7 shows result of iterative unfolding with 100 iterations, which appears
relatively smooth, for the same distribution considered in Fig. 8.6. Figure 8.8 shows
unfolded distributions after 1000, 10,000 and 100,000 iterations. An oscillating
structure, as for the maximum likelihood solution, slowly begins to emerge as the
number of iteration increases.
Regularization is achieved in the iterative unfolding method by stopping the
procedure after a number I of iterations. I should not be too large, otherwise, the
solution starts to exhibit oscillation, as seen in Fig. 8.8. On the other hand, if I small,
the solution is biased towards the initial values, like with bin-by-bin corrections
(see Sect. 8.4). A prescription about the choice of the regularization parameter I,
for iterative unfolding, is not as obvious as for the Tikhonov regularization (see
Sect. 8.5.1).
1
The original derivation in [7] only reported the first term in Eq. (8.43), similarly to the case with
l D 0, as in Eq. (8.42). The complete formula was reported in [1]. A revised and improved version
of the method proposed in [7] was also later proposed in [8].
8.7 Other Unfolding Methods 171
2000
μ
i
μ
j
1500
1000
j
μ
500
0 2 4 6 8 10 12 14 16 18 20
x
Fig. 8.7 Unfolded distribution using the iterative unfolding, shown as points with error bars,
superimposed to the original distribution, shown as dashed line. 100 iterations have been used
In case a background subtraction is needed (bi ¤ 0), two approaches could be used
to modify Eq. (8.36). The first is to subtract the background in the numerator:
.lC1/ .l/
X
N
Rij . yi b i /
j D
j PN .l/
; (8.44)
"j
iD1 kD1 Rik
k
.lC1/ .l/
X
N
Rij yi
j D
j PN .l/
: (8.45)
"j
k C bi
iD1 kD1 Rik
.l/
The latter choice ensures nonnegative yields:
j 0.
The presented list of unfolding methods is not exhaustive, and other unfolding
methods are used for physics applications. Among those, the Singular Value Decom-
position or SVD [10], Shape-constrained unfolding [12] and a Fully Bayesian
unfolding method [5] deserve to be mentioned.
172 8 Convolution and Unfolding
μ
1800 i
1600 μ
j
1400
1200
1000
j
μ
800
600
400
200
0 2 4 6 8 10 12 14 16 18 20
x
μ
1800 i
1600 μ
j
1400
1200
1000
j
μ
800
600
400
200
0 2 4 6 8 10 12 14 16 18 20
x
μ
1800 i
1600 μ
j
1400
1200
1000
j
μ
800
600
400
200
0 2 4 6 8 10 12 14 16 18 20
x
Fig. 8.8 Unfolded distribution as in Fig. 8.7 using the iterative unfolding with 1000 (top), 10,000
(middle) and 100,000 (bottom) iterations
References 173
In case of distributions in more than one dimension, the methods described above
can be used considering that in the response matrix Rij the indices i and j may
represent a single bin in multiple dimensions. Done that, unfolding proceeds
similarly to what described in the previous Sections. One caveat is the choice
of the matrix L for regularized unfolding, see for instance Eq. (8.29), which may
become nonobvious for multiple dimensions, since the L matrix should represent
an approximation of a derivative by finite differences. See [16] for a more detailed
discussion of this case.
References
1. Adye, T.: Corrected error calculation for iterative Bayesian unfolding. http://hepunx.rl.ac.uk/~
adye/software/unfold/bayes_errors.pdf (2011)
2. Adye, T., Claridge, R., Tackmann, K., Wilson, F.: RooUnfold: ROOT unfolding framework.
http://hepunx.rl.ac.uk/~adye/software/unfold/RooUnfold.html (2012)
3. Bracewell, R.: The Fourier Transform and Its Applications, 3rd. edn. McGraw-Hill, New York
(1999)
4. Brun, R., Rademakers, F.: ROOT—an object oriented dat analysis framework. Proceedings
AIHENP96 Workshop, Lausanne (1996). Nucl. Instrum. Methods Phys. Res. A 389, 81–86
(1997) http://root.cern.ch/
5. Choudalakis, G.: Fully Bayesian unfolding. arXiv:1201.4612 (2012)
6. Cowan, G.: Statistical Data Analysis. Clarendon Press, Oxford (1998)
7. D’Agostini, G.: A multidimensional unfolding method based on Bayes’ theorem. Nucl.
Instrum. Methods Phys. Res. A 362, 487–498 (1995)
8. D’Agostini, G.: Improved iterative Bayesian unfolding. arXiv:1010.0632 (2010)
9. Hansen, P.C.: The L-curve and Its use in the numerical treatment of inverse problems.
In: Johnston, P. (ed.) Computational Inverse Problems in Electrocardiology. WIT Press,
Southampton (2000)
10. Hoecker, A., Kartvelishvili, V.: SVD approach to data unfolding. Nucl. Instrum. Methods A
372, 469–481 (1996)
11. Kondor, A.: Method of converging weights—an iterative procedure for solving Fredholm’s
integral equations of the first kind. Nucl. Instrum. Methods Phys. Res. 216, 177 (1983)
12. Kuusela, M., Stark, P.B.: Shape-constrained uncertainty quantification in unfolding steeply
falling elementary particle spectra. arXiv:1512.00905 (2015)
174 8 Convolution and Unfolding
13. Müulthei, H.N., Schorr, B.: On an iterative method for the unfolding of spectra. Nucl. Instrum.
Methods Phys. Res. A 257, 371 (1983)
14. Müulthei, H.N., Schorr, B.: On an iterative method for a class of integral equations of the first
kind. Math. Methods Appl. Sci. 9, 137 (1987)
15. Phillips, D. L.: A technique for the numerical solution of certain integral equations of the first
kind. J. ACM 9, 84 (1962)
16. Schmitt, S.: TUnfold: an algorithm for correcting migration effects in high energy physics. J.
Instrum. 7, T10003 (2012)
17. Shepp, L., Vardi, Y.: Maximum likelihood reconstruction for emission tomography. IEEE
Trans. Med. Imaging 1, 113–122 (1982)
18. Tikhonov, A. N.: On the solution of improperly posed problems and the method of regulariza-
tion. Sov. Math. 5, 1035 (1963)
19. Tikhonov, A.N., Arsenin. V.Y.: Solutions of Ill-Posed Problems. John Wiley, New York (1977)
Chapter 9
Hypothesis Tests
9.1 Introduction
In statistical literature, when two hypotheses are present, these are usually called
null hypothesis, H0 , and alternative hypothesis, H1 .
Assume that the observed data sample consists of measurements of the variables
Ex D .x1 ; ; xn /, randomly distributed according to some probability density
function f .Ex /, which is in general different under the hypotheses H0 and H1 :
f .Ex / D f .Ex j H0 / or f .Ex / D f .Ex j H0 /, according to which of H0 or H1 is true.
The goal of a test is to determine whether the observed data sample better agrees
with H0 or rather with H1 . Instead of using all the n available variables, .x1 ; ; xn /,
f (t) signal
background
t cut t
Fig. 9.1 Probability distribution functions for a discriminating variable t which has two different
PDFs for the signal (blue) and background (red) hypotheses under test
whose PDF f .Ex / may have complex features, usually a test proceeds by determining
the value of a function of the measured sample Ex, t D t.Ex /, which summarizes
the information contained in the data sample. t is called test statistic, and its PDF,
under each of the considered hypotheses, can be derived from the PDF f .Ex / of the
observable quantities Ex.
A simple example of test statistic is a single variable x which has discriminating
power between two hypotheses, say signal D ‘muon’ versus background D ‘pion’,
as shown in Fig. 9.1 where t D x. A good separation of the two cases can be
achieved if the PDFs of x under the hypotheses H1 D signal and H0 D background
are appreciably different. Note that this is a conventional choice, and opposite
choices are also adopted in some literature.
If the observed value of the discriminating variable x is xO , the simplest test
statistic can be defined as the measured value itself:
Ot D t.Ox/ D xO : (9.1)
A selection requirement (in physics jargon sometimes called cut) can be defined
by identifying a particle as a muon if Ot tcut or as a pion if Ot > tcut , where the
value tcut is chosen a priori, and the direction of the required inequality is motivated
because background tends to have values of x larger than signal, in the specific
example.
Not all real muons are correctly identified as muons according to the selection
criterion, as well as not all real pions are correctly identified as pions. The expected
fraction of selected signal particles (muons) is usually called signal selection
efficiency and the expected fraction of selected background particles (pions) is called
misidentification probability.
Misidentified particles constitute a background to positively identified signal
particles. Applying the required selection, in this case t tcut , on a data sample
containing different detected particles, the sample of selected particles will be
enriched of signal and will have a reduced fraction of background, compared to the
original unselected sample. This is true if we assume that the selection efficiency
9.3 Type I and Type II Errors 177
Statistical literature defines the significance level ˛ as the probability to reject the
hypothesis H0 if H0 is true. Rejecting H0 if it is true is called error of the first kind,
or type-I error. In our example, this means selecting a particle as a muon in case it
is a pion. The background misidentification probability corresponds to ˛.
The probability ˇ to reject the hypothesis H1 if it is true (error of the second kind
or type-II error) is equal to one minus the signal efficiency, i.e. the probability to
incorrectly identify a muon as a pion, in our example.
By varying the value of the selection cut tcut , different values of selection
efficiency (1 ˇ) and misidentification probability (1 ˛) are determined. A typical
curve representing the signal efficiency versus the misidentification probability
obtained by varying the selection requirement is shown in Fig. 9.2, which is also
called receiver operating characteristic, or ROC curve.
A good selection should have a low misidentification probability corresponding
to a large selection efficiency. But clearly the background rejection can’t be perfect
(i.e. the misidentification probability can’t drop to zero) if the distributions f .x j H0 /
and f .x j H1 / overlap, as in Fig. 9.1.
1
signal efficiency
0
0 1
mis-id probability
Fig. 9.2 Signal efficiency versus background misidentification probability (receiver operating
characteristic or ROC curve)
178 9 Hypothesis Tests
10 10
8 8
6 6
4 4
2 2
0 0
–2 –2
–4 –4
–6 –6
–8 –8
–10 –10
–10 –8 –6 –4 –2 0 2 4 6 8 10 –10 –8 –6 –4 –2 0 2 4 6 8 10
Fig. 9.3 Examples of two-dimensional selections of a signal (blue dots) against a background (red
dots). A linear cut is chosen on the left plot, while a box cut is chosen on the right plot
A test statistic that allows discriminating two samples in more dimensions using a
linear combination of n discriminating variables is due to Fisher [1]. The optimal
separation between one-dimensional projections of two random variable sets is
achieved by maximizing the squared difference of the means of the two PDFs
when projected on a given axis w, E divided by the corresponding variances added
in quadrature.
The Fisher discriminant can be defined as:
.
1
2 /2
E/D
J.w ; (9.2)
12 C 22
where
1 and
2 are the averages and 1 and 2 are the standard deviations of the
E
two PDF projections, which depend on the projection direction w.
An example of Fisher projections along two different possible projection lines
is shown in Fig. 9.4 for a two-dimensional case. Different levels of separation of
9.4 Fisher’s Linear Discriminant 179
60 60
40 40
20 20
0 0
y
y
−20 −20
−40 −40
−60 −60
−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60
x x
8000
6000
7000
5000
6000
5000 4000
4000
3000
3000
2000
2000
1000
1000
0 0
−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60
x w 1x + y w 1 x w 2x + y w 2
y y
Fig. 9.4 Example of projections of two-dimensional distributions (plots on the top left and right)
along two different lines. The red and blue distributions are projected on the black dashed lines;
the normal to that line is shown as a green line with an arrow. The bottom plots show the one-
dimensional projections of the corresponding top plots
the two distributions can be achieved by changing the projection line. The goal of
E that achieves the optimal
Fisher discriminant is to find the projection direction w
separation.
The projected average difference is:
1
2 D w
E T .m
E1 m
E 2/ ; (9.3)
where mE 1 and m
E 2 are the n-dimensional averages of the two samples. The square of
Eq. (9.3) gives the numerator in Eq. (9.2):
.
1
2 /2 D w
E T .m
E1 m
E 2 / .m
E1 m
E 2 /T w
E Dw
E T SB w
E; (9.4)
180 9 Hypothesis Tests
SB D .m
E1 m
E 2 / .m
E1 m
E 2 /T : (9.5)
12 D w
E T S1 w
E; (9.6)
22 D w
E T S2 w
E; (9.7)
12 C 22 D w
E T .S1 C S2 / w
E Dw
E T SW w
E; (9.8)
SW D S1 C S2 : (9.9)
The problem of finding the vector w E that maximizes J.wE / can be solved by
performing the derivatives of J.w E / with respect to the components wi of w,
E or
equivalently by solving the following eigenvalues equation:
E D SW w
SB w E; (9.11)
i.e.:
SW 1 SB w
E D w
E; (9.12)
E D SW 1 SB .m
w E1 m
E 2/ : (9.13)
L.Ex j H1 /
.Ex / D : (9.14)
L.Ex j H0 /
The test is optimal in the sense that, for a fixed background misidentification
probability ˛, the selection that corresponds to the largest possible signal selection
efficiency 1 ˇ is given by:
L.Ex j H1 /
.Ex / D k˛ ; (9.15)
L.Ex j H0 /
where, by varying the value of the ‘cut’ k˛ , the required value of ˛ may be achieved.
This corresponds to chose a point in the ROC curve (see Fig. 9.2) such that ˛
corresponds to the required misidentification probability.
The Neyman–Pearson lemma provides the selection that achieves the optimal
performances only if the joint multidimensional PDFs that characterize our problem
are known. In many realistic cases, anyway, it is not easy to determine the correct
model for multidimensional PDFs and approximated solutions may be adopted.
Numerical methods and algorithms exist to find selections in the variable space that
have performances in terms of efficiency and misidentification probability close to
the optimal limit given by the Neyman–Pearson lemma. Among those approximate
methods, machine-learning algorithm, such as Artificial Neural Networks and
Boosted Decision Trees, are widely used in High Energy Physics, and will be
introduced in Sect. 9.10.
If the PDFs cannot be exactly factorized, anyway, the test statistic defined in
Eq. (9.17) will differ from the exact likelihood ratio in Eq. (9.14) and will correspond
to worse performances in term of ˛ and ˇ compared with the best possible
performances obtained using the Neyman–Pearson lemma.
In some cases, anyway, the simplicity of this method can justify its application in
spite of the suboptimal performances. The marginal PDFs fi can be obtained using
Monte Carlo training samples with a large number of entries that allow to produce
histograms corresponding to the distributions of the individual variables xi that, to a
good approximation, reproduce the marginal PDFs.
Some numerical applications implement projective likelihood-ratio discriminant
after applying a proper rotation in the variable space that reduces or eliminates the
correlation among variables with a diagonalization of the covariance matrix. This
improves the performances of the method but does not necessarily allow to reach
optimal performances of the Neyman–Pearson lemma. Uncorrelated variables are
not necessarily independent (see for instance Example 2.7), and the PDF can be
factorized only for independent variables (see Sect. 2.15.1).
The approximation of a multidimensional PDF as a binned histogram in multiple
dimensions may become intractable since the size of the training samples should
increase as the number of bins to the power of the number of dimensions.
A test due to Kolmogorov [3], Smirnov [4] and Chakravarti [5] can be used to assess
the hypothesis that a data sample is compatible with a given distribution. Consider a
sample .x1 ; ; xn /, ordered by increasing values of x, that has to be compared with
a distribution f .x/, assumed to be a continuous function. The discrete cumulative
distribution of the sample can be defined as:
1X
n
Fn .x/ D .x xi / ; (9.18)
n iD1
9.7 Kolmogorov–Smirnov Test 183
The distribution Fn .x/, can be compared with the cumulative distribution of f .x/
defined as:
Z x
F.x/ D f .x0 / dx0 : (9.20)
1
The maximum distance between the two cumulative distributions Fn .x/ and F.x/
is used to quantify the agreement of the data sample .x1 ; ; xn / with f .x/:
1
Dn
F(x)
Fn(x)
0
x
x x
1 2 … x
n
asymptotically converges to zero if n and m are sufficiently large, and the fol-
lowing test statistic asymptotically follows Kolmogorov distribution according to
Eq. (9.22):
r
nm
Dn; m : (9.24)
nCm
Alternative tests to Kolmogorov–Smirnov are due to Stephens [8], Anderson and
Darling [9], and Cramér [10] and von Mises [11].
Y
N
sup E
L.Exi I /
E 2 ‚0 iD1
2r D 2 log ; (9.25)
Y
N
sup E
L.Exi I /
E 2 ‚1 iD1
9.9 Likelihood Ratio in the Search for a New Signal 185
Y
N
sup E
L.Exi I
0 ; /
E iD1
2r .
0 / D 2 log (9.26)
YN
sup E
L.Exi I
; /
; E iD1
Y
N Y
N
sup E D
L.Exi I
; / L.Exi I
; EO :
O /
; E iD1 iD1
O
L.Ex j
0 ; E .
0 //
2r .
0 / D : (9.27)
L.Ex j
;
O /EO
This test statistic is known as profile likelihood, and its application to upper limits
determination will be discussed in Sect. 10.11.
In the previous section the likelihood function was considered for a set of indepen-
E
dent observations .Ex1 ; ; ExN / with parameters :
Y
N
E D
L.Ex1 ; ; ExN I / E :
f .Exi I / (9.28)
iD1
186 9 Hypothesis Tests
Two hypotheses H1 and H0 are represented as two possible sets of values ‚1 and
E
‚0 of the parameters .
Usually the number of events N can also be used as information introducing the
extended likelihood function (see Sect. 5.10.2):
E ENY
e. / ./
N
E
L.Ex1 ; ; ExN I / D E ;
f .Exi I / (9.29)
NŠ iD1
where in the Poissonian term the expected number of event may also depend on
E D ./.
the parameters : E
Typically, we want to discriminate between two hypotheses: H1 represents the
presence of both signal and background, i.e. D
s C b, while H0 represents
the presence of only background events in our sample, i.e. D b, or equivalently
E D
s E C b E :
f .Ex I / fs .Ex I / fb .Ex I / (9.30)
s C b
s C b
In this case, the extended likelihood function in Eq. (9.29) becomes similar to
Eq. (5.27):
s.E/Cb.E/ YN
E D e E C bfb .Exi I /
E :
LsCb .Ex1 ; ; ExN I
; /
sfs .Exi I /
NŠ iD1
(9.31)
E
Note that s an b may also depend on the unknown parameters :
E ;
s D s./ (9.32)
E :
b D b./ (9.33)
For instance, in a search for the Higgs boson, the theoretical cross section may
depend on the Higgs boson’s mass. Also the PDF for the signal fs , which represents
a resonance peak, depends on the Higgs boson’s mass.
9.9 Likelihood Ratio in the Search for a New Signal 187
b.E/ Y
N
E De
Lb .Ex1 ; ; ExN I / E :
bfb .Exi I / (9.34)
NŠ iD1
The term 1=NŠ disappears when performing the likelihood ratio in Eq. (9.14), which
becomes:
E
E D LsCb .Ex1 ; ; ExN I
; /
.
; /
E
Lb .Ex1 ; ; ExN I /
s.E/Cb.E/ YN E C bfb .Exi I /
E
e
sfs .Exi I /
D (9.35)
eb.E/ iD1
E
bfb .Exi I /
!
YN E
sfs .Exi I /
s.E/
De C1 :
iD1
E
bfb .Exi I /
In the case of a simple event counting experiment, the likelihood function only
accounts for the Poissonian probability term which only depends on the number
of observed event N, and the dependence on the parameters E only appears in the
expected signal and background yields. The likelihood ratio which defines the test
statistic is, in that case:
E
LsCb .NI /
E D
./ ; (9.37)
E
Lb .NI /
which is a simplified version of Eq. (9.36), where the terms fs and fb have been
dropped.
Equations (9.39) and (9.36) are used to determine upper limits in searches for a
new signal, as will be discussed in Chap. 10.
Note that the hypotheses b and sCb are nested (b is a particular case of sCb with
D 0), but the likelihood ratios defined to determine the test statistics in Eqs. (9.36)
and (9.39) assume s C b in the numerator and b in the denominator, which is the
inverse convention of the likelihood ratio introduced in Eq. (9.25). Wilks’ theorem
hypotheses apply to those cases as well, with an extra minus sign in the definition
of the test statistic.
Alternatively, algorithms that group and interpret data according to the observed
distribution of the input data are called unsupervised learning.
In the following examples of supervised learning will be discussed.
In many cases of supervised learning, training samples are provided by computer
simulations having expected distributions in either the H0 or the H1 hypothesis. If
control samples in data are available with very good purity, those data samples can
be used as training samples avoiding uncertainties due to simulation modeling. Each
entry in a data sample is often called observation in machine-learning literature, and
typically may represent a collision event or an observed particle of a given type in
particle physics.
9.10.1 Overtraining
A potential general problem with the training of machine learning algorithms is the
possibility that the algorithm tries to exploit artifacts of the necessarily finite size
of training samples that are not representative of the actually expected distributions.
This possibility may occur, in particular, if the size of the training sample given as
input is not very large. The effect of overtraining is illustrated in Fig. 9.6. In practice,
an irregular selection may pick or skip individual observations in training categories.
The presence of overtraining may be spotted by looking at the discriminant
distribution and its performances in terms of efficiency and background rejection
evaluated on the training sample and compare with another independently-generated
test sample. If the discriminant distributions for the training and test samples are not
in good agreement (for instance, Kolmogorov–Smirnov test may be used to check
y signal y signal
background background
x x
Fig. 9.6 Illustration of overtraining of a machine learning algorithm in two dimensions. Blue
points are signal training observations, red points are background. The multivariate selection is
represented as a dashed curve. Left: smooth selection based on a regular algorithm output; right:
selection based on an overtrained algorithm that picks artifacts due to statistical fluctuations of the
finite-size training samples
190 9 Hypothesis Tests
the agreement, see Sect. 9.7), and performances evaluated on the training sample are
significantly better than the ones evaluated on the test sample, this may indicate the
presence of overtraining, hence the training of the algorithm and its performances
may not be optimal when applied on real data.
Artificial neural networks have been designed in the attempt to mimic with comput-
ers a simplified functional model of neuron cells in animals. The discriminant output
is computed by combining the response of multiple nodes, each representing a single
neuron cell. Nodes are arranged into layers, as shown in Fig. 9.7. Input variable
values .x1 ; ; xp / are passed to a first input layer, whose output is passed as input
to the next layer, and so on. Finally, the last output layer is usually constituted by
a single node that provides the discriminant output. More than one output layer
can be used to encode more than two possible outcomes of the discrimination, as
could be the case for the identification of hand-written character. Intermediate layers
between the input and the output layers are called hidden layers. Such a structure is
also called feedforward multilayer perceptron.
The input layer may apply linear transformations to the input variables in order
to adjust their range to a standard interval, typically Œ0; 1. The output of each node
is computed as weighted average of the input variables, with weights that are subject
input layer
hidden layers
output layer
w11(1)
x1
w11(2)
w12(1)
x2 w12(2) w11(3)
w12(3)
x3
y
…
w1p(3)
w1p(1) w1p(2)
xp
The activation function limits the output in the range Œ0; 1. A typical choice for ' is
a sigmoid function:
1
'./ D : (9.41)
1 C e
X
N
2
L.w/ D ytrue
i y.Exi / : (9.43)
iD1
yi is usually equal to 1 for signal (H1 is true) and 0 for background (H0 is true).
The loss function depends on the choice of the network weights w, which are not
explicitly reported in Eq. (9.43), but appear in y.Exi /.
A popular algorithm to optimize the weights in order to minimize the loss
function consists in iteratively modifying the weights after each training observation
or after a bunch of training observations. Weights are typically randomly initialized
in order to break the symmetry among different nodes. The minimization usually
proceeds via the so-called stochastic gradient descent [13], which modifies weight
at each iteration according to the following formula:
The parameter is called learning rate, and controls how large is the change to the
parameters w at each iteration. This method is also called back propagation [14],
192 9 Hypothesis Tests
since the value computed as output drives the changes to last node’s weights, which,
in turn, drive the changes to the weights in the previous layer, and so on.
An more extensive introduction to artificial neural network can be found in [15].
It has been demonstrated that an artificial neural network with a single hidden
layer may approximate any analytical function within any desired approximation,
provided that the number of neurons is sufficiently large, and if the activation
function satisfies certain regularity conditions [16]. Anyway, in practice, the number
of nodes required to achieve the desired approximation may be very large, and
some performance limitations may be present in artificial neural networks with
a reasonably manageable number of nodes. Those performance limitations made
boosted decision trees (see Sect. 9.12) a preferred choice over artificial neural
networks in several data analyses in particle physics experiments.
The number of required nodes for optimal performances, if more hidden layers
are added, is smaller than for a single hidden layer [17], but adding more hidden
layers could make the training process more difficult to converge because it is more
frequent to meet the conditions where the output of the activation function is close
to 0 or 1, which corresponds to a gradient of the loss function that is close to zero,
making the learning by stochastic gradient descent slower.
Those limitations have been recently overcome by techniques that use several
hidden layers and possibly relatively large numbers of nodes per layer, and
optimized training algorithms that make the treatment of complex networks feasible
with modern computing technologies, even for cases that were intractable with
traditional algorithms. Those techniques are called deep learning [18], and have
recently become popular for advanced applications like image classification or face
recognition.
One of the first application of artificial neural network with several layers and a
large number of nodes was implemented to classify handwritten digits based on a
large number of training images [19]. The adopted network had 282 D 784 input
variables1 and 10 outputs in order to encode the 10 digits from 0 to 9. Five hidden
layers were used with 2500, 2000, 1500, 1000 and 500 nodes, respectively. The
training method was a standard stochastic gradient descent (see Sect. 9.11), and
intensive computing power was used.
In high-energy physics, deep learning techniques allow achieving optimal perfor-
mances, exceeding shallow neural networks and boosted decision trees. Moreover,
deep learning provides another very convenient feature: it is possible to use directly
very basic quantities, like particles four-momenta, as input to the network instead of
working out combinations that are suitable for a specific analysis. For instance, in
1
In order to classify 28 28 pixel grayscale images.
9.11 Artificial Neural Networks 193
a search for a resonance, the invariant mass is a key variable, its distribution being
peaked for the signal and more flat for the background. More complex kinematical
variables are optimal for other searches for new particles, typically for those with
missing energy due to escaping undetected particles, like for Supersymmetry. While
shallow neural networks may not achieve reasonable performances in most of the
cases if those specific kinematical variables are not used as input, it has been proven
that the same performances can be achieved with deep learning if input variables
are either complex variable combinations or four-momenta directly [20]. For those
cases, typical number of nodes per layers may be several hundred, with a number if
hidden layers of the order of 5.
This capability of machine-learning algorithms may lead to optimal perfor-
mances with a reduced human effort in the data processing steps preliminary to
the training of multivariate analysis.
Optimization of the training procedure may be applied, and those include a
modification of the learning rate parameter , that may be decreased, for instance,
exponentially with the number of interactions [20].
Another application in particle physics of deep learning is the classification of
substructures in hadronic jet that allow to identify jets produced from a W ! qq0
decay from jets due to quarks and gluons [21].
More advanced and optimized neural network structures have been developed
mainly for image and sound processing, called convolutional neural network (CNN
or ConvNet) [22]. The number of inputs can be as large as the number of pixels in
an image.2
Node arrangement is usually done in more dimensions in order to adapt to the
dimensional structure of the input data. For instance, the three dimensions typically
corresponding to the width, height and color channel of an image.
Not all possible node connections are allowed in the network, reducing in this
way the total number of weights to be optimized, and nodes in a layer have
connections with only some of the nodes in the previous layer. A fully connected
network would otherwise become intractable for very large number of inputs, like
for images with tens of millions pixels.
Moreover, translational symmetry along the image in two dimensions, or in one
dimension (time) for sound, is also used to reduce the number of parameters. This
corresponds, in image analysis, to the inspection of local features, e.g.: edges, that
can be differently displaced in an image, logically arranged into subsequent more
2
Multiplied by three for color images, since each of the three color channels in a pixel is separately
encoded.
194 9 Hypothesis Tests
complex structures, as opposite to a single global image analysis that would consider
the interconnections of all possible information contained in individual pixels.
The overall network architecture is potentially similar to the logical structure of
how image processing may proceed in the brain’s visual cortex, which is able to
perform vision using a limited number of neurons.
Several small network units (filters), each aiming at identifying a specific
feature, are applied to the input at all possible positions, exploiting the translational
symmetry. For image recognitions, a filter corresponds to a ‘window’ having as size
a small number of pixels. The extent of the filter is called local receptive field. Each
pixel in the window is connected to a neuron, whose output is the output of the filter,
which evaluates a weighted sum, as in standard neural networks, of the pixel values
(convolution, which gives the name to this neural network architecture).
The resulting filter output evaluated at each position is stored in a layer called
feature map. Each filter unit will generate in this way a feature map, which is stored
as a two-dimensional array. The construction of a feature map is shown in Fig. 9.8
for a two-dimensional case and in Fig. 9.9 for a three-dimensional case.
Usually, there are several filters and the set of all feature maps is stored as a three-
dimensional array, the third dimension is given by the number of applied filter units.
The feature map array produced at this stage can be used as input to subsequent
network layers.
A feature map can be reduced in size by applying a subsampling (pooling).
A reduced feature map can be processed in a subsequent network layer having a
smaller number of nodes.
Fig. 9.8 Construction of a feature map for a 33 filter applied on a 1010 pixel image at different
positions. The resulting feature map will have size 8 8 1. For simplicity, a gray scale image has
been assumed (one color code instead of three), reducing the dimensionality of the example
9.11 Artificial Neural Networks 195
Fig. 9.9 Construction of a feature map for a 3 3 3 filter applied on a 10 10 3 pixel image
at different positions. The resulting feature map will have size 8 8 1
The output of this activation function applied to each entry of a previous layer
is called rectified linear unit layer (ReLU). A smoothed version of ReLU was
introduced by Softplus [23]:
convolution
convolution
subsampling
A decision tree is a sequence of selection cuts that are applied in a specified order
on a given variable datasets. Each cut splits the sample into nodes, each of which
corresponds to a given number of observations classified as signal or as background.
A node may be further split by the application of the subsequent cut in the tree.
Nodes in which either signal or background is largely dominant are classified
as leafs, and no further selection is applied. A node may also be classified as leaf,
and the selection path is stopped, in case too few observations per node remain, or
in case the total number of identified nodes is too large, and different criteria have
been proposed and applied in real implementations.
Each branch on a tree represents one sequence of cuts. An example of decision
tree is shown in Fig. 9.11. Along the decision tree, the same variable may appear
multiple times, depending on the depth of the tree, each time with a different applied
cut, possibly even with different inequality directions.
Selection cuts can be tuned in order to achieve the best split level in each node
according to some metrics. One possible optimization consists in maximizing for
each node the gain of Gini index achieved after a splitting. The Gini index can be
defined as:
G D P .1 P/ ; (9.47)
where P is the purity of the node, i.e. the fraction of signal observations. G is equal
to zero for nodes containing only signal or background observations.
Alternatives to the Gini index are also used as metric to be optimized. One
example is the so called cross entropy, equal to:
Fig. 9.11 An example of a decision tree. Each node represented as an ellipse contains a different
number of signal (left number) and background (right number) observations. The relative amount
of signal and background in each node is represented as blue and red area, respectively. Applied
requirement are represented as rectangular boxes
The gain due to the splitting of a node A into the nodes B1 and B2 , which depends
on the chosen cut, is given by:
where I denotes the adopted metric (G or E, in case of the Gini index or cross
entropy introduced above). By varying the cut, the optimal gain may be achieved.
A single decision tree can be optimized as described above, but its performances
are usually far from the optimal limit set by the Nayman–Pearson lemma. A sig-
nificant improvement can be achieved by combining multiple decision trees into a
decision forest.
The random forest algorithm [26] consists of ‘growing’ many decision trees from
replicas of the training samples obtained by randomly resampling the input data. No
minimum size is required for leaf nodes. The final score of the algorithm is given
by an unweighted average of the prediction (zero or one) by each individual tree.
198 9 Hypothesis Tests
The boosting procedure instead iteratively adds a new tree to a forest obtained by
performing the selection optimization after test observations have been reweighted
according to the score given by the classifier in the previous iteration. The collection
obtained at the end of the iterative procedure is so called boosted decision trees
(BDT) [27].
The boosting algorithm usually proceeds following the steps below:
• Training observations are reweighted using the previous iteration’s classifier
result.
• A new tree is built and optimized using the reweighted observations as a training
sample.
• A score is given to each tree.
• The output of the final BDT classifier is the weighted average over all trees in the
forest:
X
N trees
Among the boosting algorithms, the most popular is the adaptive boosting [28].
With this approach, only observations misclassified in the previous iteration are
reweighted by a weight which depends on the fraction of misclassified observations
f in the previous tree, usually equal to:
1f Nmisclassified
wD ; f D : (9.51)
f Ntot
The misclassification fraction f is also used to compute the weights in the combina-
tion of individual decision tree outputs:
X
N
trees
1 f .i/
y.Ex / D log C.i/ .Ex / : (9.52)
iD1
f .i/
The software tool TMVA [29] implements the most frequently adopted multivariate
analysis methods in high-energy physics, including Fisher’s linear discriminant,
projective likelihood analysis, artificial neural network and boosted decision trees.
The tool is distributed with the ROOT [7] framework, and provides a common
interface for the different multivariate methods. Each method is configurable, and
all parameters can be tuned, though a series of default values, reasonably valid in
the majority of applications, are provided.
CAFFE [30] is a software framework that implements deep learning algorithms
and was used in [24]. Versions of ROOT from 6.08 on also provide a deep learning
implementation in the TMVA library.
(continued )
200 9 Hypothesis Tests
The Fisher linear discriminant divides the sample space using a straight line.
The orientation of the line is optimized, but clearly the achieved selection is
suboptimal. The projective likelihood method achieves a better separation,
not as good as the artificial neural network or the boosted decision tees.
Figure 9.15 shows the ROC curve (receiver operating characteristic) for the
four considered methods, as produced by the graphic user interface provided
by TMVA.
10
−5
−10
Fig. 9.12 Distribution in two dimensions for signal (blue) and background (red) test
observations
(continued )
9.13 Multivariate Analysis Implementations 201
18000
4000
16000
3500
14000
3000
12000
2500
entries
entries
10000
2000
8000
1500 6000
1000 4000
500 2000
0 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fisher discr. Projective Likelihood discr.
35000
5000
30000
4000
25000
entries
entries
20000 3000
15000
2000
10000
1000
5000
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8
Neural Network discr. Boosted Decision Trees discr.
(continued )
202 9 Hypothesis Tests
Fig. 9.14 Distribution of test observations as in Fig. 9.12 showing observations selected
as signal (blue) and background (red) by the different multivariate discriminants: Fisher
linear discriminant (top, left), projective likelihood discriminant (top, right), artificial
neural network (bottom, left) and boosted decision trees (bottom, right)
(continued )
References 203
0.9
Background rejection
0.8
0.7
0.6
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Signal efficiency
Fig. 9.15 ROC curve for the four multivariate discriminators. Fisher and projective
likelihood exhibit suboptimal performances, while, for this example, the performance
curves of artificial neural network and boosted decision trees completely overlap
References
1. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–
188 (1936)
2. Neyman, J., Pearson, E.: On the problem of the most efficient tests of statistical hypotheses.
Philos. Trans. R. Soc. Lond. Ser. A 231, 289–337 (1933)
3. Kolmogorov, A.: Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital.
Attuari 4, 83–91 (1933)
4. Smirnov, N.: Table for estimating the goodness of fit of empirical distributions. Ann. Math.
Stat. 19, 279–281 (1948)
5. Chakravarti, I.M., Laha, R.G., Roy, J.: Handbook of Methods of Applied Statistics, vol. I.
Wiley, New York (1967)
6. Marsaglia, G., Tsang, W.W., Wang, J.: Evaluating Kolmogorov’s distribution. J. Stat. Softw. 8,
l–4 (2003)
7. Brun, R., Rademakers, F.: ROOT—an object oriented data analysis framework. Proceedings
AIHENP96 Workshop, Lausanne (1996). Nucl. Instrum. Methods A 389 81–86 (1997). http://
root.cern.ch/
8. Stephens, M.A.: EDF statistics for goodness of fit and some comparisons. J. Am. Stat. Assoc.
69, 730–737 (1974)
9. Anderson, T.W., Darling, D.A.: Asymptotic theory of certain “goodness-of-fit” criteria based
on stochastic processes. Ann. Math. Stat. 23, 193–212 (1952)
10. Cramér, H.: On the composition of elementary errors. Scand. Actuar. J. 1928(1), 13–74 (1928)
204 9 Hypothesis Tests
11. von Mises, R.E.: Wahrscheinlichkeit, Statistik und Wahrheit. Julius Springer, Vienna (1928)
12. Wilks, S.: The large-sample distribution of the likelihood ratio for testing composite hypothe-
ses. Ann. Math. Stat. 9, 60–62 (1938)
13. LeCun, Y., Bottou, L., Orr, G.B., Müller, K.R.: Neural Networks: Tricks of the Trade. Springer,
Berlin/Heidelberg (1998)
14. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating
errors. Nature 323 533–536 (1986)
15. Peterson, C., Rgnvaldsson, T.S.: An introduction to artificial neural networks. In: LU-TP-91-
23. LUTP-91-23. 14th CERN School of Computing, Ystad (1991)
16. Mhaskar, H.N.: Neural networks for optimal approximation of smooth and analytic functions.
Neural Comput. 8(1), 164–177 (1996)
17. Reed, R., Marks, R.: Neural Smithing: Supervised Learning in Feedforward Artificial Neural
Networks. A Bradford book. MIT Press, Cambridge (1999)
18. Le Cun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521 436–444 (2015)
19. Cireşan, D.C., et al.: Deep, big, simple neural nets for handwritten digit recognition. Neural
Comput. 22, 3207–20 (2010)
20. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics
with deep learning. Nature Commun. 5, 4308 (2014)
21. Baldi, P., Bauer, K., Eng, C., Sadowski, P., Whiteson, D.: Jet substructure classification in
high-energy physics with deep neural networks. Phys. Rev. D 93, 094034 (2016)
22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. Adv. Neural Inf. Proces. Syst. 25, 1097–1105 (2012)
23. Dugas, C., Bengio, Y., Bélisle, F., Nadeau, C., Garcia, R.: Incorporating second-order
functional knowledge for better option pricing. In: Proceedings of NIPS’2000: Advances in
Neural Information Processing Systems (2001)
24. Aurisano, A., et al.: A convolutional neural network neutrino event classifier. J. Instrum. 11,
P09001 (2016)
25. Photo by Angela Sorrentino: http://angelasorrentino.awardspace.com/ (2007)
26. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). http://www.stat.berkeley.edu/~
breiman/RandomForests/
27. Roe, B.P., Yang, H.J., Zhu, J., Liu, Y., Stancu, I., McGregor, G.: Boosted decision trees as an
alternative to artificial neural networks for particle identification. Nucl. Instrum. Methods A
543, 577–584 (2005)
28. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an appli-
cation to boosting. In: Proceedings of EuroCOLT’94: European Conference on Computational
Learning Theory (1994)
29. Hoecker, A., et al.: TMVA – Toolkit for Multivariate Data Analysis. PoS ACAT 040,
arXiv:physics/0703039 (2007)
30. Jia, Y., et al.: Convolutional architecture for fast feature embedding. arXiv:1408.5093 (2014)
Chapter 10
Discoveries and Upper Limits
This chapter will introduce the concept of significance and will present the most
popular methods to set upper limits, discussing their main benefits and limitations.
The so-called modified frequentist approach will be introduced, which is a popular
method in high-energy physics that is neither a purely frequentist nor a Bayesian
method.
10.2.1 p-Values
Given an observed data sample, claiming the discovery of a new signal requires
determining that the sample is sufficiently inconsistent with the hypothesis that
only background is present in the data. A test statistic t can be used to measure the
inconsistency of the observation in the hypothesis of the presence of background
only, typically assumed as a null hypothesis, H0 .
The probability p that the considered test statistic t assumes a value greater or
equal to the observed one in the case of pure background fluctuation is called
p-value, as already introduced in Sect. 5.12.2. It was implicitly assumed that, by
convention, large values of t correspond to a more signal-like sample.
The p-value has, by construction (see Sect. 2.5), a uniform distribution between
0 and 1 for the background-only hypothesis H0 and tends to have small values in the
presence of the signal (hypothesis H1 ).
If the number of observed events is adopted as test statistic (event counting
experiment), the p-value can be determined as the probability to count a number
of events equal to or greater than the observed one assuming the presence of no
signal and the expected background level. In this cases, the test statistic and the p-
value may assume discrete values. In general, the distribution of the p-value is only
approximately uniform for a discrete distribution.
1
X 7
X 4:5n
p D P.n 8/ D Pois.nI 4:5/ D 1 e4:5 :
nD8 nD0
nŠ
(continued )
10.2 Claiming a Discovery 207
P(n | ν = b = 4.5)
0.18
0.16
P(n ≥ 8) = 0.087
0.14
0.12
P(n|ν)
0.1
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10 12 14
n
Fig. 10.1 Poisson distribution for a null signal and an expected background of b D 4:5.
The probability corresponding to n 8 (light blue area) is 0.087, and is equal to the
p-value, assuming the event counting as test statistic
Determining the significance is only part of the process that leads to a discovery,
in the scientific method. The following sentence is worth to be reported from a
document written by the American Statistical Association (ASA):
The p-value was never intended to be a substitute for scientific reasoning. Well-reasoned
statistical arguments contain much more than the value of a single number and whether that
number exceeds an arbitrary threshold. The ASA statement is intended to steer research into
a ‘post p < 0:05 era’ [1].
This was also remarked by the physics community and the following statement was
reported by Cowan et al.:
It should be emphasized that in an actual scientific context, rejecting the background-only
hypothesis in a statistical sense is only part of discovering a new phenomenon. One’s degree
of belief that a new process is present will depend in general on other factors as well, such
as the plausibility of the new signal hypothesis and the degree to which it can describe the
data [2].
In order to evaluate the ‘plausibility of a new signal’ and other factors that give
confidence in a discovery, the physicist’s judgment cannot, of course, be replaced
by the statistical evaluation of the p-value only. In this sense, we can say that a
Bayesian interpretation of the very final result is somehow implicitly assumed, even
when reporting a frequentist significance level.
A better approximation to Eq. (10.3), even for b well below unity, is given by
Cowan et al. [2]:
r h s
i
Z D 2 .s C b/ log 1 C s : (10.5)
b
Equation (10.3) may be obtained from Eq. (10.5) for s b.
In Sect. 9.9 test statistics based on likelihood ratios were introduced. In particular, a
test statistic suitable for searches for a new signal is:
E
LsCb .Ex1 ; ; ExN I
; /
E D
.
; / : (10.6)
E
Lb .Ex1 ; ; ExN I /
A minimum of 2 log .
/ at
D
O indicates the possible presence of a signal
having a signal strength equal to
.
O
The advantage of the likelihood ratio as test statistic is that H0 , assumed in
the denominator, can be taken as a special case of H1 , assumed in the numerator,
with
D 0. This represents a case of nested hypotheses, hence Wilks’ theorem
can be applied, assuming the likelihood function is sufficiently regular. Note that
in Eq. (10.6) numerator and denominator are inverted with respect to Eqs. (9.25)
and (9.26).
210 10 Discoveries and Upper Limits
This significance Z is also called local significance, in the sense that it corre-
sponds to a fixed set of values of the parameters . E In case one or more of the
parameters E are estimates from data, the local significance at fixed values of the
measured parameters may be affected by the look elsewhere effect, see Sect. 10.13.
If the background PDF does not depend on E (for instance, if we have a single
parameter equal to the mass of an unknown particle, which only affects the
E does also not depend on E and the likelihood ratio
signal distribution), Lb .Ex I /
E
./ is equal, up to a multiplicative factor, to the likelihood function LsCb .Ex I ; /.
E E D , O
E can also be determined by
Hence, the maximum likelihood estimate of ,
minimizing 2 log .
; /, E which is equal to 2 log LsCb up to a constant, and
the error matrix, or the uncertainty contours on , E can be determined as usual for
maximum likelihood estimates (see Sect. 5.11).
1
Given the definition of in Eq. (10.6), signal-like case tend to have LsCb > Lb , hence > 1,
which implies log > 0 and 2 log < 0, and viceversa background-like cases tend to have
2 log > 0. For this reason, small correspond to more signal-like cases. Other choices of test
statistics may have the opposite convention.
10.5 Definitions of Upper Limit 211
For the purpose of excluding a signal hypothesis, usually the requirement applied in
terms of p-value is much milder than for a discovery. Instead of requiring a p-value
of 2:87 107 or less (the ‘5’ criterion), upper limits for a signal exclusion are
set requiring p < 0:05, corresponding to a 95% confidence level (CL), or p < 0:10,
corresponding to a 90% CL.
For signal exclusion, p indicates the probability of a signal underfluctuation,
i.e. the null hypothesis and alternative hypothesis are inverted, when testing for p-
values, with respect to the case of a discovery.
In the frequentist approach, the procedure to set an upper limit is a special case of
confidence interval determination (see Sect. 7.2), typically applied to the unknown
signal yield s, or alternatively the signal strength
.
In order to determine an upper limit instead of the usual central interval, the
choice of the interval corresponding to the desired confidence level 1 ˛ (90%
or 95%, usually) is fully asymmetric: Œ0; sup [ which translates in an upper limit
quoted as:
In the Bayesian approach the interpretation the upper limit sup is that the credible
interval Œ0; sup Œ corresponds to a posterior probability equal to the confidence level
1 ˛.
212 10 Discoveries and Upper Limits
The Bayesian posterior PDF for a signal yield s,2 assuming a prior .s/, is given by:
L.Ex I s/ .s/
P.s j xE / D R 1 : (10.9)
0 L.Ex I s0 / .s0 / ds0
The upper limit sup can be computed by requiring that the posterior probability
corresponding to the interval Œ0; sup Œ is equal to the specified confidence level CL, or
equivalently that the probability corresponding to Œsup ; 1Œ is equal to ˛ D 1 CL:
Z 1
R1
up L.Ex I s/ .s/ ds
˛D P.s j Ex / .s/ ds D Rs1 : (10.10)
sup 0 L.Ex I s/ .s/ ds
Apart from the technical aspects related to the integral computations and the
already mentioned subjectiveness in the choice of the prior .s/ (see Sect. 3.7), the
above expression poses no particular fundamental problem.
sn es
P.s j n/ D : (10.11)
nŠ
In case no event is observed, i.e. n D 0, we have:
and:
Z 1
es ds D es ;
up
˛D (10.13)
sup
which gives:
2
The same approach could be equivalently formulated in terms of the signal strength
For ˛ D 0:05 (95% CL) and ˛ D 0:10 (90% CL), the following upper limits can
be set:
The general case with expected background b ¤ 0 was treated by Helene [3],
and Eq. (10.10) becomes:
Xn
.sup C b/m
mŠ
˛ D es
up mD0
: (10.17)
Xn
bm
mD0
mŠ
The above expression can be inverted numerically in order to determine sup for given
˛; n and b. In case of no background .b D 0/, Eq. (10.17) becomes:
Xn
sup m
˛ D es
up
; (10.18)
mD0
mŠ
18
16
14
12
10
sup
8 n=10
2 n=0
0
0 2 4 6 8 10 12 14 16 18 20
b
18
16
14
12
10
sup
8 n=10
2 n=0
0
0 2 4 6 8 10 12 14 16 18 20
b
Fig. 10.2 Upper limits at the 90% CL (top) and 95% CL (bottom) to the signal yield s for a
Poissonian process using the Bayesian approach as a function of the expected background b and
for number of observed events n from n D 0 to n D 10
10.7 Frequentist Upper Limits 215
The derivation of Bayesian upper limits presented above assumes a uniform prior
.s/ Dconst. on the expected signal yield. Assuming a different prior distribution
would result in different upper limits. In general, there is no unique criterion to chose
a specific prior PDF that models the complete lack of knowledge about a variable,
in this case the signal yield, as already discussed in Sect. 3.7.
In searches for new signals, the signal yield may be related to other parameters of
the theory (e.g.: the mass of unknown particles, or specific coupling constants). In
that case, should one choose a uniform prior for the signal yield or a uniform prior
for the theory parameters? As already said, no unique prescription can be derived
from first principles.
A possible approach is to choose more priors that reasonably model one’s
ignorance about the unknown parameters and verify that the obtained upper limits
are not too sensitive to the choice of the prior.
Frequentist upper limits can be computed by inverting the Neyman belt (see
Sect. 7.2) for a parameter with fully asymmetric intervals for the observed
quantity x, as illustrated in Fig. 10.7, which is the equivalent of Fig. 7.1 when
adopting a fully asymmetric interval. In particular, assuming that the Neyman belt is
monotonically increasing, the choice of intervals ]xlo .0 /; C1Œ for x as a function
of 0 leads to a confidence interval [0; up .x0 / Œ for , given a measurement x0
(Fig. 10.3), which corresponds to the upper limit:
x0 x
216 10 Discoveries and Upper Limits
Similarly to Sect. 10.6.1, the case of a counting experiment with negligible back-
ground will be analyzed first. The probability to observe n events with an expecta-
tion s is given by a Poisson distribution:
es sn
P.nI s/ D : (10.20)
nŠ
An upper limit to the expected signal yield s can be set using n as test statistic
and excluding the values of s for which the probability (p-value) to observe n events
or less is below ˛ D 1 CL. For n D 0 we have:
or, equivalently:
Those results accidentally coincide with the ones obtained under the Bayesian
approach (Eqs. (10.15) and (10.16)). The numerical coincidence of upper limits
computed under the Bayesian and frequentist approaches for the simple but common
case of a counting experiment may lead to confusion. There is no intrinsic
reason for which limits evaluated under the two approaches should coincide, and
in general, with very few exceptions, like in this case, Bayesian and frequentist
limits do not coincide numerically. Moreover, regardless of their numerical value,
the interpretation of Bayesian and frequentist limits is very different, as already
discussed several times.
Note that if the true value is s D 0, then the interval Œ0; sup Œ covers the true value
with a 100% probability, instead of the required 90% or 95%, similarly to what was
observed in Example 7.23. The extreme overcoverage is due to the discrete nature
of the counting problem, and may appear as a counterintuitive feature.
10.7 Frequentist Upper Limits 217
When constructing the Neyman belt for a discrete variable n, like in a Poissonian
case, it is not always possible to find an interval fnlo ; : : : ; nup g that has exactly
the desired coverage because of the intrinsic discreteness of the problem. This issue
was already introduced in Sect. 7.3, when discussing binomial intervals. For discrete
cases, it is possible to take the smallest interval which has a probability greater or
equal to the desired confidence level. Upper limits determined in those cases are
conservative, i.e. the procedure ensures that the probability content of the confidence
belt is greater than or equal to 1 ˛ (overcoverage).
Figure 10.4 shows an example of Poisson distribution corresponding to the
case with s D 4 and b D 0. Using a fully asymmetric interval as ordering
rule, the interval f2; 3; g of the discrete variable n corresponds to a probability
P.n 2/ D 1 P.0/ P.1/ D 0:9084, and is the smallest interval which has
a probability greater or equal to a desired confidence level of 0.90: the interval
f3; 4; g would have a probability P.n 3/ less than 90%, while enlarging the
interval to f1; 2; 3; g would produce a probability P.n 1/ even larger than
P.n 2/.
P(n | ν = s = 4)
0.25
P(n ≤ 1) = 0.092
0.2
0.1
0.05
0
0 2 4 6 8 10 12 14
n
Fig. 10.4 Poisson distribution for s D 4 and b D 0. The white bins show the smallest possible
fully asymmetric confidence interval .f2; 3; 4; g in this case) that gives at least the required
coverage of 90%
218 10 Discoveries and Upper Limits
If n events are observed, the upper limit sup is given by the inversion of the
Neyman belt, which corresponds to:
( )
X
n
sup
D inf s W P.mI s/ < ˛ : (10.26)
mD0
The simplest case with n D 0 gives the result shown in Sect. 10.7.1.
Consider a case with non-negligible background, b ¤ 0, and take again n as the
test statistic. Even the assumption s D 0 or possibly unphysical values s < 0 could
be excluded if data have large under fluctuations, which are improbable put possible,
according to a Poisson distribution with expected number of events b, or bjsj, for a
negative s, such that the p-value is less than the required 0.1 or 0.05. The possibility
to exclude parameter regions where the experiment should be insensitive (s D 0) or
even unphysical regions (s < 0) are rather unpleasant to a physicist.
From the pure frequentist point of view, moreover, this result potentially suffers
from the flip-flopping problem (see Sect. 7.4): if we decide a priori to quote an upper
limit as our final result, Neyman’s construction with a fully asymmetric interval
leads to the correct coverage, but if we choose to switch from fully asymmetric
to central intervals in case a significant signal is observed, this would produce an
incorrect coverage.
20
18
16
14
12
s
10
0 2 4 6 8 10 12 14
n
Fig. 10.5 Confidence belt at the 90% CL for a Poissonian process using the Feldman–Cousins
approach for b D 3
can be seen in Fig. 10.6 (lowest curve). This feature is absent in Bayesian limits that
do not depend on the expected background b for n D 0 (see Fig. 10.2).
This dependence of upper limits on the expected amount of background is
somewhat counterintuitive: imagine two experiments (say A and B) performing
a search for a rare signal. Both experiments are designed to achieve a very low
background level, but A can reduce the background level more than B, say b D 0:01
and b D 0:1 expected events for A and B, respectively. If both experiments observe
zero events, which is for both the most likely outcome, the experiment that achieves
220 10 Discoveries and Upper Limits
18
16
14
12
10
sup
8 n=10
2 n=0
0
0 2 4 6 8 10 12 14 16 18 20
b
Fig. 10.6 Upper limits at 90% CL to the signal s for a Poissonian process using the Feldman–
Cousins method as a function of the expected background b and for number of observed events n
from 0 to 10
the most stringent limit is the one with the largest expected background (B in this
case), i.e. the one which has the worse expected performances.
The Particle Data Group published in their review the following sentence about
the interpretation of frequentist upper limits, in particular for what concerns the
difficulty to interpret a more stringent limit for an experiment with worse expected
background, in case no event is observed:
The intervals constructed according to the unified [Feldman Cousins] procedure for a
Poisson variable n consisting of signal and background have the property that for n D 0
observed events, the upper limit decreases for increasing expected background. This is
counter-intuitive since it is known that if n D 0 for the experiment in question, then
no background was observed, and therefore one may argue that the expected background
should not be relevant. The extent to which one should regard this feature as a drawback is
a subject of some controversy [4].
This modification leads to the same result obtained by Helene in Eq. (10.17),
which apparently indicates a possible convergence of Bayesian and frequen-
tist approaches.
This approach was later criticized by Highland and Cousins [6] who
demonstrated that the modification introduced by Eq. (10.28) produces an
incorrect coverage, and Zech himself admitted the nonrigorous application
of the frequentist approach [7].
This attempt could not provide a way to conciliate the Bayesian and
frequentist approaches, which, as said, have completely different interpreta-
tions. Anyway, Zech’s intuition anticipated the formulation of the modified
frequentist approach that will be discussed in Sect. 10.8, which is nowadays
widely used in high-energy physics.
The concerns about frequentist limits discussed at the end of Sects. 10.7.2 and 10.7.3
have been addressed with the definition of a procedure that was adopted for the first
time for the combination of the results obtained by the four LEP experiments, Aleph,
Delphi, Opal, and L3, in the search for the Higgs boson [8].
222 10 Discoveries and Upper Limits
E
LsCb .Ex I /
E D
./ : (10.29)
E
Lb .Ex I /
Different test statistics have been applied after the original formulation, in
particular, it is now common to use the profile likelihood (Eq. (9.27)). The method
described in the following is valid for any test statistic.
The likelihood ratio in Eq. (10.29) can also be written introducing the signal
strength
separately from the other parameters of interest , E as in Eq. (9.35):
!
E
YN
s. E s .Exi I /
/f E
E De
.
; /
s. /
C1 ; (10.30)
iD1
E b .Exi I /
b./f E
where the functions fs and fb are the PDFs for signal and background of the variables
Ex. The negative logarithm of the test statistic is given in Eq. (9.36), also reported
below:
!
XN E s .Exi I /
s./f E
E E
log .
; / D
s./ log C1 : (10.31)
iD1
E b .Exi I /
b./f E
In order to quote an upper limit using the frequentist approach, the distribution
of the test statistic (or equivalently 2 log ) in the hypothesis of signal plus
background has to be known, and the p-value corresponding to the observed value
D ,O denoted below as psCb , has to be determined as a function of the parameters
E
of interest
and .
The proposed modification to the purely frequentist approach consists in finding
two p-values corresponding to both the H1 and H0 hypotheses (below, for simplicity
of notation, the set of parameters E will also includes
, which is omitted):
E D PsCb ../
psCb ./ E /
O ; (10.32)
E D Pb ../
pb ./ E /
O : (10.33)
10.8 Modified Frequentist Approach: The CLs Method 223
From those two probabilities, the following quantity can be derived [26]:
E
E D psCb ./ :
CLs ./ (10.34)
E
1 pb ./
Upper limits are determined excluding the range of the parameters of interest for
which CLs ./E is lower than the conventional confidence level, typically 95% or
90%. For this reason, the modified frequentist approach is often referred to as the
CLs method.
In most of the cases, the probabilities PsCb and Pb in Eqs. (10.32) and (10.33)
are not trivial to obtain analytically and are determined numerically using pseudo-
experiments generated by Monte Carlo. An example of the outcome of this
numerical approach is shown in Fig. 10.7.
The modified frequentist approach does not provide the desired coverage from
the frequentist point of view, but does not suffer from the counterintuitive features
of frequentist upper limits, and has convenient statistical properties:
• It is conservative from the frequentist point of view. In fact, since pb 1, we
have that CLs ./E psCb ./.
E Hence, the provided intervals overcover, and CLs
limits are less stringent than purely frequentist ones.
data
0.05 exp. exp.
for s+b for b
0.04
probability
0.03
0.02
pb ps+b
0.01
0
−60 −40 −20 0 20 40
2ln
Fig. 10.7 Example of evaluation of CLs from pseudoexperiments. The distribution of the test
statistic 2 log is shown in blue assuming the signal-plus-background hypothesis and in red
assuming the background-only hypothesis. The black line shows the value of the test statistic
measured in data, and the shaded areas represent psCb (blue) and pb (red). CLs is determined
as psCb =.1 pb /
224 10 Discoveries and Upper Limits
data
exp. exp.
for s+b for b
pb ~ 0 ps+b ~ CLs
probability
2ln
data
exp. exp.
for s+b for b
large pb ps+b < CLs
probability
2ln
Fig. 10.8 Illustration of the application of the CLs method in case of well separated distributions
of the test statistic 2 log for the s C b and b hypotheses (top) and in case of largely overlapping
distributions (bottom) where the experiment has poor sensitivity to the signal
10.9 Presenting Upper Limits: The Brazil Plot 225
For a simple Poissonian counting experiment with expected signal s and a back-
ground b, using the likelihood ratio from Eq. (9.39), it is possible to demonstrate
analytically that the CLs approach leads to a result identical to the Bayesian one
(Eq. (10.17)) which is also identical to the results of the method proposed by
Zech [5], discussed in Example 10.26.
In general, in many realistic applications, the CLs upper limits are numerically
similar to Bayesian upper limits assuming a uniform prior, but of course the
Bayesian interpretation of upper limits cannot be applied to limits obtained using
the CLs approach.
On the other hand, the fundamental interpretation of limits obtained using the
CLs method is not obvious, and it does not match neither the frequentist nor the
Bayesian approaches.
Under some hypothesis, typically in case of background only, the upper limit is a
random variable that depends on the observed data sample and its distribution can
be predicted using Monte Carlo.
When presenting an upper limit as the result of a data analysis, it’s often
useful to report, together with the observed upper limit, also the expected value
of the limit and possibly the interval of excursion, quantified as percentiles that
correspond to ˙1 and ˙2. These bands are conventionally colored in green and
yellow, respectively, and this gives the jargon name of Brazil plot to this kind of
presentation.
A typical example is shown in Fig. 10.9 which reports the observed and expected
limits to the signal strength
D =SM to the Standard Model Higgs boson
production, with the ˙1 and 2 bands as a function of the Higgs boson mass
obtained with the 2011 and 2012 LHC data, reported by the ATLAS experiment [9].
The observed upper limit reasonably fluctuates within the expected band but
exceeds the C2 band around a mass value of about 125 GeV, corresponding to
the presently measured value of the Higgs boson. This indicates a deviation from
the background-only hypothesis assumed in the computation of the expected limit.
In some cases, expected limits are presented assuming a nominal signal yield.
Those cases are sometimes called in jargon signal-injected expected limits.
226 10 Discoveries and Upper Limits
±1 σ
±2 σ
10-1
CLs Limits
100 200 300 400 500 600
mH [GeV]
Fig. 10.9 Example of upper limit reported as Brazil plot in the context of the search for the Higgs
boson at the LHC by the ATLAS experiment. The expected limit to the signal strength
D =SM
is shown as dashed line, surrounded by the ˙1 (green) and ˙2 (yellow) bands. The observed
limit is shown as a solid line. All mass values corresponding to a limit below
D 1 (dashed
horizontal line) are excluded at the 95% confidence level. The plot is from [9] (open access)
The test statistic may contains some parameters that are not of direct interest of our
measurement, but are nuisance parameters needed to model the PDF of our data
sample, as discussed in Sect. 5.4.
In the following, the parameter set is split in two subsets: parameters of interest,
E D .1 ; ; h /, and nuisance parameters, E D .1 ; ; l /. For instance, if we
are only interested in the measurement of signal strength
, we have E D .
/; if
we want to measure, instead, both the signal strength and a new particle’s mass m,
we have E D .
; m/.
The treatment of nuisance parameters is well defined under the Bayesian approach
and was already discussed in Sect. 3.5. The posterior joint probability distribution
for all the unknown parameters is (Eq. (3.33)):
L.Ex I ; E E / .; E E /
E E j Ex / D R
P.; ; (10.35)
L.Ex I E 0 ; E 0 / .E 0 ; E 0 / dh 0 dl 0
10.10 Nuisance Parameters and Systematic Uncertainties 227
The only difficulty that may arise is the numerical integration in multiple
dimensions. A particularly performant class of algorithms in those cases is based
on Markov chain Monte Carlo [10], introduced in Sect. 4.8.
1
s.E ; E
/Cb.E ; E
s.E ; E / fs .Exi I E ; E / C b.E ; E / fb .Exi I E ; E / dl ;
/
D e
NŠ iD1
(10.37)
Z Y
N
1 E
Lb .Ex1 ; ; ExN j E/ D eb. ; E / b.E ; E / fb .Exi I E ; E / dl : (10.38)
NŠ iD1
a function of the true unknown expected background b, P.b0 I b/. The likelihood
functions, which depend on the parameter of interest s and the unknown nuisance
parameter b, can be written as:
e.sCb/ .s C b/n
LsCb .n; b0 I s; b/ D P.b0 I b/; (10.39)
nŠ
eb bn
Lb .n; b0 I b/ D P.b0 I b/ : (10.40)
nŠ
In order to eliminate the dependence on the nuisance parameter b, the hybrid
likelihoods, using the Cousins–Highlands approach, can be written as:
Z 1
e.sCb/ .s C b/n
LsCb .n; b0 I s/ D P.b0 I b/ db ; (10.41)
0 nŠ
Z 1
eb bn
Lb .n; b0 / D P.b0 I b/ db : (10.42)
0 nŠ
A test statistic that accounts for nuisance parameters and allows to avoids the hybrid
Bayesian approach is the profile likelihood, defined in Eq. (9.27):
EO
//
L.Ex j
; .
.
/ D ; (10.43)
L.Ex j
;
O /EO
O
where
O and E are the best fit values of
and E corresponding to the observed data
O
sample, and E is the best fit value of E obtained for a fixed value of
. All parameters
10.12 Variations of the Profile-Likelihood Test Statistic 229
t
D 2 log .
/ : (10.44)
A scan of t
as a function of
reveals a minimum at the value
D
. O The
minimum value t
.
/ O is equal to zero by construction. An uncertainty interval for
Different variations of the profile likelihood definition have been adopted for various
data analysis cases. A review of the most popular test statistics is presented in [2]
where approximate formulae, valid in the asymptotic limit of a large number of
measurements, are provided in order simplify the computation. The main examples
are reported in the following.
230 10 Discoveries and Upper Limits
In order to assess the presence of a new signal, the hypothesis of a positive signal
strength
is tested against the hypothesis
D 0. This is done using the test statistic
t
D 2 log .
/ evaluated for
D 0. The test statistic t0 D 2 log .0/, anyway,
may reject the hypothesis
D 0 in case a downward fluctuations in data would
result in a negative best-fit value
.
O A modification of t0 has been proposed that is
only sensitive to an excess in data that produce a positive value of
O [2]:
2 log .0/
O 0;
q0 D (10.46)
0
O < 0 :
The p-value corresponding to the test statistic q0 can be evaluated using Monte Carlo
pseudosamples that simulate only background events. The distribution of q0 has a
Dirac’s delta component ı.q0 /, i.e. a ‘spike’ at q0 D 0, corresponding to all the
cases which give a negative
.
O
Similarly to the definition of q0 , one may not want to consider upward fluctuations
in data to exclude a given value of
in case the best fit value
O is greater than the
assumed value of
. In order to avoid those cases, the following modification of t
The distribution of q
presents a spike at q
D 0 corresponding to those cases which
give
O >
.
10.12 Variations of the Profile-Likelihood Test Statistic 231
Both cases considered for Eqs. (10.45) and (10.47) are taken into account in the test
statistic adopted for Higgs search at the LHC:
8
ˆ O
ˆ
ˆ
ˆ L.Ex j
; E .
//
ˆ 2 log
O < 0;
ˆ
ˆ O
< L.Ex j 0; E .0//
qQ
D O (10.48)
ˆ
ˆ L.Ex j
; E .
//
ˆ
ˆ 2 log 0
O
;
ˆ O
ˆ O E .
//
L.Ex j
;
:̂
0
O >
:
For the test statistics presented in the previous Sections, asymptotic approximations
have been computed and are discussed extensively in [2] using Wilks’ theorem and
approximate formulae due to Wald [15].
For instance, the asymptotic approximation for the significance when using the
test statistic for discovery q0 (Eq. (10.46)) is:
p
Z0 ' q0 : (10.49)
In practice, the values of the random variables present in the dataset are set to their
respective expected values. In particular, all variables that represent yields in the
data sample (e.g.: all entries in a binned histogram) are replaced with their expected
values, that may also have noninteger values.
The use of a single representative dataset in asymptotic formulae avoids the
generation of typically very large sets of Monte Carlo pseudo-xperiments, reducing
significantly the computation time.
232 10 Discoveries and Upper Limits
While in the past Asimov datasets have been used as a pragmatic and CPU-
efficient solution, mathematical motivation was given in the evaluation of asymp-
totic formulae provided in [2].
The asymptotic approximation for the distribution of qQ
(Eq. (10.48)), for
instance, can be computed in terms of the Asimov dataset as follows:
8
1 < p1 p1 eQq
=2 0 < qQ
2 = 2 ;
2 2 qQ
f .Qq
j
/ D ı.Qq
/ C h i
2 : p 1 .Qq C
2 = 2 /2
exp 1
2 .2
= / 2 2
.2
= /
qQ
>
2 = 2 ;
(10.50)
where ı.Qq
/ is a Dirac delta function, to model the cases in which the test statistic
is set to zero, and 2 D
2 =q
; A depends on the term q
; A , which is the value of
the profile likelihood test statistic 2 log evaluated at the Asimov dataset setting
nuisance parameters at their nominal values [16].
From Eq. (10.50) the median significance, for the hypothesis of background only,
can be written as the square root of the test statistic evaluated at the Asimov dataset:
p
medŒZ
j 0 D qQ
; A : (10.51)
For a comprehensive treatment of asymptotic approximations, again, refer to [2].
For practical applications, asymptotic formulae are implemented in the ROOST-
ATS library [17], released within the ROOT [18] framework. A common software
interface to most of the commonly used statistical methods allows to easily
switch from one test statistic to another and to perform computations using either
asymptotic formulae or with Monte Carlo pseudo-experiments. Bayesian as well as
frequentist methods are both implemented and can be easily compared.
Example 10.27 Bump Hunting with the LsCb =Lb Test Statistic
A classic ‘bump-hunting’ case is studied in order to determine the signal
significance using as test statistic the ratio of likelihood functions in the
two hypotheses s C b and b. This test statistic was traditionally used by
experiments at the LEP and at Tevatron. The systematic uncertainty on the
expected background yield will be added in the following Example 10.28,
and finally the profile-likelihood test statistic will be used in Example 10.29
for a similar case.
The data model
A data sample, generated with Monte Carlo, is compared with two
hypotheses:
1. background only, assuming an exponential distribution;
2. background plus signal, with a Gaussian signal on top of the exponential
background.
The expected distributions in the two hypotheses are shown in Fig. 10.10
superimposed to the data histogram, divided in N D 40 bins.
(continued )
10.12 Variations of the Profile-Likelihood Test Statistic 233
90
data
80
background
70
60
entries
50
40
30
20
10
0
0 100 200 300 400 500 600 700 800 900 1000
m
90
data
80
background
70
60
entries
50
40
30
20
10
0
0 100 200 300 400 500 600 700 800 900 1000
m
Fig. 10.10 Monte Carlo generated data sample superimposed to an exponential back-
ground model (top) and to an exponential background model plus a Gaussian signal
(bottom)
(continued )
234 10 Discoveries and Upper Limits
Likelihood function
The likelihood function for a binned distribution is the product of
Poisson distributions for the number of entries observed in each bin,
nE D .n1 ; ; nN /:
Y
N
E
; ˇ/ D
L.En j Es ; b; Pois .ni j ˇbi C
si / ; (10.52)
iD1
where the expected distributions for signal and background are modeled as
Es D .si ; ; sN / and bE D .bi ; ; bN /, respectively, and are determined
from the expected binned signal and background distributions.
The normalizations of Es and bE are given by theory expectations, and
variations of the normalization scales are modeled with the extra parameters
and ˇ for signal and background, respectively. The parameter of interest
is the signal strength
. ˇ has the same role as
for the background yield,
and in this case is a nuisance parameter.
For the moment, ˇ D 1 will be assumed, which corresponds to a negligible
uncertainty on the expected background yield.
Test statistic
The test statistic based on the likelihood ratio LsCb =Lb can be written as:
!
E
; ˇ D 1/
L.En j Es; b;
q D 2 log : (10.53)
E
D 0; ˇ D 1/
L.En j Es; b;
q D 2 log .
/ C 2 log .0/ : (10.54)
Significance evaluation
The distributions of the test statistic q for the background-only and for the
signal-plus-background hypotheses are shown in Fig. 10.11, where 100,000
pseudo-experiments have been generated in each hypothesis. The observed
value of q is closer to the bulk of the s C b distribution than to the b
distribution. The p-value corresponding to the background-only hypothesis
can be determined as the fraction of pseudo-experiments having a value of
q lower than the one observed in data.
(continued )
10.12 Variations of the Profile-Likelihood Test Statistic 235
In the distributions shown in Fig. 10.11, 375 out of 100,000 toy samples have
a value of q below the one corresponding to our data sample, hence the p-
value is 3.7%. Considering a binomial uncertainty, the p-value is determined
with an uncertainty of 0.2%. A significance Z D 2:7 is determined from the
p-value into a using Eq. (10.1).
104
103
n. toys
102
10
(continued )
236 10 Discoveries and Upper Limits
0
q
−2
−4
−6
Fig. 10.12 Scan of the test statistic q as a function of the parameter of interest
(continued )
10.12 Variations of the Profile-Likelihood Test Statistic 237
Y
N
E
; ˇ; ˇ 0 / D
L.En j Es ; b; Pois .ni j ˇbi C
si / P. ˇ 0 j ˇ/ : (10.57)
iD1
90 data
background
80
bkg +10%
70 bkg -10%
60
entries
50
40
30
20
10
0
0 100 200 300 400 500 600 700 800 900 1000
m
90 data
background
bkg +10%
80 bkg -10%
signal
70 signal+bkg.
60
entries
50
40
30
20
10
0
0 100 200 300 400 500 600 700 800 900 1000
m
Fig. 10.13 Toy data sample superimposed to an exponential background model (top)
and to an exponential background model plus a Gaussian signal (bottom) adding a 10%
uncertainty to the background normalization
(continued )
238 10 Discoveries and Upper Limits
The scan of this test statistic is shown in Fig. 10.14. Compared with the case
where no uncertainty was included, the shape of the test statistic curve is
now broader, and the minimum is less deep, resulting in a larger uncertainty:
D 1:40C0:61
0:60 , and a smaller significance: Z D 2:3.
0
q
−2
−4
−6
Fig. 10.14 Scan of the test statistic q as a function of the parameter of interest
including
systematic uncertainty on ˇ (red) compared with the case with no uncertainty (blue)
3
Remember that halfp the range of a uniform distribution is larger than the corresponding standard
p
deviation by a factor 3, so ˇ 0 ˙ ıˇ D 1:0 ˙ 0:1 does not represent a ˙1 interval but a ˙ 3
interval. See Sect. 2.7, Eq. (2.28).
10.12 Variations of the Profile-Likelihood Test Statistic 239
10
8
Events / ( 1 )
0
100 105 110 115 120 125 130 135 140 145 150
m (GeV)
This exercise was implemented using the ROOSTATS library [17] within the
ROOT [18] framework.
In this case, the problem is treated with an unbinned likelihood function,
and the signal yields s is fit from data.
For simplicity, all parameters in the model are fixed, i.e. are considered as
constants known with negligible uncertainty, except the background yield,
which is assumed to be known with some uncertainty ˇ , modeled with a
log normal distribution.
(continued )
240 10 Discoveries and Upper Limits
where:
e.sCb0 / 1 2 2
L0 .mI s; b0 / D sp e.m
/ =2 C b0 em ;
nŠ 2
(10.60)
1 2 2
Lˇ . ˇI ˇ / D p eˇ =2ˇ : (10.61)
2ˇ
Y
N
E I s; ˇ/ D
L.m L.mi I s; ˇ/ : (10.62)
iD1
(continued )
10.12 Variations of the Profile-Likelihood Test Statistic 241
10
6
- log λ (s)
0
0 10 20 30 40 50 60 70 80
s
Fig. 10.16 Negative logarithm of the profile likelihood as a function of the signal yield s.
The blue curve is computed assuming a negligible uncertainty on the background yield,
while the red curve is computed assuming a 30% uncertainty. The intersection of the
curves with the green line at log .s/ D 0:5 determines the uncertainty intervals on s
4
log is the default visualization choice provided by the ROO STATS library in ROOT, which
differs by a factor of 2 with respect to the choice of 2 log adopted elsewhere.
242 10 Discoveries and Upper Limits
Many searches for new physical phenomena look for a peak in a distribution,
typically a reconstructed particle’s mass. In some cases, the location of the peak is
known, like in searches for rare decays of a known particles, such as Bs !
C
.
But this is not the case in the search for new particles, like the Higgs boson
discovered at the LHC, whose mass is not predicted by theory.
If an excess in data, compared with the background expectation, is found at any
mass value, the excess could be interpreted as a possible signal of a new resonance
at the observed mass. Anyway, the peak could be produced either by the presence
of a real new signal or by a background fluctuation.
One way to compute the significance of the new signal is to use the p-value
corresponding to the measured test statistic q assuming a fixed value m0 of the
resonance mass m. In this case, the significance is called local significance.
Given the PDF f .q j m;
/ of the adopted test statistic q, the local p-value is:
Z 1
p.m0 / D f .q j m0 ;
D 0/ dq : (10.64)
qobs .m0 /
p.m0 / gives the probability that a background fluctuation at a fixed value of the mass
m0 results in a value of q greater or equal to the observed value qobs .m0 /.
The probability of a background fluctuation at any mass value in the range of
interest is called global p-value and is in general larger than the local p-value. So,
the local significance is an underestimate, if interpreted as global significance.
In general, the effect of the reduction of significance, when evaluated from local
to global, in case one or more parameters of interest are determined from data, is
called look elsewhere effect.
More in general, when an experiment is looking for a signal where one or more
parameters of interest E 5 are unknown (e.g.: could be both the mass and the width,
or other properties of a new particle), in the presence of an excess in data with
respect to the background expectation, the unknown parameter(s) can be determined
from the data sample itself. The local p-value of the excess is:
Z 1
p.E0 / D f .q j E0 ;
D 0/ dq ; (10.65)
qobs .E0 /
5
Here, as in Sect. 3.5.1, we denote E as parameters of interest. Other possible nuisance parameters
E are dropped for simplicity of notation. The signal strength parameter
will be explicitated and
is not included in the set of parameters E.
10.13 The Look Elsewhere Effect 243
The global p-value can be computed using, as test statistic, the largest value of
the estimator over the entire parameter range:
O
qglob D sup q.E ;
D 0/ D q.E ;
D 0/ ; (10.66)
imin < i < imax ;
iD1; ; m
O
where E denotes the set of parameter of interest that maximize q.;
E
D 0/.
The global p-value can be determined from the distribution of the test statistic
glob
qglob assuming background only, given the observed value qobs :
Z 1
pglob D f .qglob j
D 0/ dqglob : (10.67)
glob
qobs
Even if the test statistic q is derived, as usual, from a likelihood ratio, in this
case, Wilks’ theorem cannot be applied because the values of the parameters E are
undefined for
D 0. Consider, for instance, a search for a resonance: in case
of background only (
D 0), the test statistic would no longer depend on the
resonance mass m. In this case, the two hypotheses assumed at the numerator and the
denominator of the likelihood ratio in the test statistic considered in Wilks’ theorem
hypotheses are not nested [19].
The distribution of qglob from Eq. (10.66) can be computed with Monte Carlo
samples. Large significance values, corresponding to very low p-values, require
considerable sizes of the pseudo-samples, which demand large CPU time.
An approximate way to determine the global significance taking into account the
look elsewhere effect is reported in [20], relying on the asymptotic behavior of
likelihood-ratio estimators.
The correction factor f that needs to be applied to the local significance in order
to obtain the global significance is called trial factor.
The trial factor is related to peak width, which may be dominated by the
experimental resolution, if the intrinsic width is small. Empirical evaluations, when
the mass is determined from data, give a factor f typically proportional to the ratio
of the search range and the peak width, times the local significance [21].
For a single parameter m, the global test statistic is (Eq. (10.66)):
qglob D q.m;
O
D 0/ : (10.68)
It is possible to demonstrate [22] that the probability that the test statistic qglob is
greater than a given value u, used to determine the global p-value, is bound by the
244 10 Discoveries and Upper Limits
q(m)
1 2 3
m
Fig. 10.17 Visual illustration of up-crossings, computed to determine hNu0 i. In this example, we
have a number of up crossings Nu D 3
following inequality:
O
D 0/ > u/ P.2 > u/ C hNu i :
pglob D P.q.m; (10.69)
The term P.2 > u/, related to the local p-value, is a cumulative 2 distribution that
comes from an asymptotic approximation as a 2 with one degree of freedom of the
local test statistic:
One can evaluate hNu0 i by generating a not too large number of pseudo-experiments,
then hNu i can be determined using Eq. (10.71) preserving a good numerical
precision.
10.13 The Look Elsewhere Effect 245
In practice, one can move from local to global p-values using the following
asymptotically approximated relation:
2σ
Local p
10-2
10-3 3σ
10-4
4σ
10-5
10-6
5σ
10-7
10-8
10-9 6σ
100 200 300 400 500 600
mH [GeV]
Fig. 10.18 Local p-value as a function of the Higgs boson mass in the search for the
Higgs boson at the LHC performed by ATLAS. The solid line shows the observed p-value,
while the dashed line shows the median expected p-value according to the prediction of
a Standard Model Higgs boson corresponding to a given mass value mH . The plot is
from [9] (open source)
(continued )
246 10 Discoveries and Upper Limits
hN0 i D 9 ˙ 3 : (10.73)
The local p-value, corresponding to 5, is about 3 107. From Eq. (10.72),
the global p-value is, approximately:
pglob 3 105
f D ' D 100 : (10.76)
ploc 3 107
In some cases, more than one parameter is determined from data. For instance, an
experiment may measure both the mass and the width of a new resonance, which
are both not predicted by theory.
In more dimensions, the look elsewhere correction proceeds in a way similar
to one dimension, but in this case, the test statistic depends on more than one
parameter.
Equation (10.69) and the scaling law in Eq. (10.71) are written in terms of
the average number of upcrossing, which is only meaningful in one dimension.
The generalization to more dimensions can be obtained by replacing the number
of upcrossing with the Euler characteristic, which is equal to the number of
disconnected components minus the number of ‘holes’ in the multidimensional sets
E > u [24].
of the parameter space defined by qloc ./
Examples of sets with different value of the Euler characteristic are shown in
Fig. 10.19.
10.13 The Look Elsewhere Effect 247
φ=1 φ=0
φ=2 φ=1
Fig. 10.19 Examples of sets with different value of the Euler characteristic ', defined as the
number of disconnected components minus the number of ‘holes’. Top left: ' D 1 (one component,
no holes), top right: ' D 0 (one component, one hole), bottom left: ' D 2 (two components, no
holes), bottom right: ' D 1 (two components, one hole)
X
D
h'.u/i D Nd d .u/ ; (10.77)
dD0
where the functions 0 .u/; ; D .u/ are characteristic of the specific random field.
For a 2 field with D D 1, Eq. (10.77) gives the scaling law in Eq. (10.71). For a
two-dimensional 2 field, which is, for instance, the case when measuring from data
both the mass and the width of a new resonance, Eq. (10.77) becomes:
p
h'.u/i D N1 C N2 u eu=2 : (10.78)
The expected value h'.u/i can be determined, typically with Monte Carlo, at two
values of u, u D u1 and u D u2 . Once h'.u1 /i and h'.u2 /i are determined, N1 and
N2 can be found inverting the system of two equations given by Eq. (10.78) for the
two values h'.u1 /i and h'.u2 /i.
248 10 Discoveries and Upper Limits
References
1. Wasserstein, R.L., Lazar, N.A.: The ASA’s statement on p-values: context, process, and
purpose. Am. Stat. 70, 129–133 (2016)
2. Cowan, G., Cranmer, K., Gross, E., Vitells, O.: Asymptotic formulae for likelihood-based tests
of new physics. Eur. Phys. J. C 71, 1554 (2011)
3. Helene, O.: Upper limit of peak area. Nucl. Instrum. Methods A 212, 319 (1983)
4. Amsler, C., et al.: The review of particle physics. Phys. Lett. B 667, 1 (2008)
5. Zech, G.: Upper limits in experiments with background or measurement errors. Nucl. Instrum.
Methods A 277, 608 (1989)
6. Highland, V., Cousins, R.: Comment on “upper limits in experiments with background or
measurement errors”. Nucl. Instrum. Methods A 277, 608–610 (1989). Nucl. Instrum. Methods
A 398, 429 (1989)
7. Zech, G.: Reply to comment on “upper limits in experiments with background or measurement
errors”. Nucl. Instrum. Methods A 277, 608–610 (1989). Nucl. Instrum. Methods A 398, 431
(1989)
8. Abbiendi, G., et al.: Search for the standard model Higgs boson at LEP. Phys. Lett. B 565,
61–75 (2003)
9. ATLAS Collaboration: Observation of an excess of events in the search for the standard model
higgs boson with the ATLAS detector at the LHC. ATLAS-CONF-2012-093 (2012). http://cds.
cern.ch/record/1460439
10. Berg, B.: Markov Chain Monte Carlo Simulations and Their Statistical Analysis. World
Scientific, Singapore (2004)
11. Cousins, R., Highland, V.: Incorporating systematic uncertainties into an upper limit. Nucl.
Instrum. Methods A 320, 331–335 (1992)
12. Zhukov, V., Bonsch, M.: Multichannel number counting experiments. In: Proceedings of
PHYSTAT2011 (2011)
13. Blocker, C.: Interval estimation in the presence of nuisance parameters: 2. Cousins and
highland method. CDF/MEMO/STATISTICS/PUBLIC/7539 (2006). https://www-cdf.fnal.
gov/physics/statistics/notes/cdf7539_ch_limits_v2.ps
14. Lista, L.: Including Gaussian uncertainty on the background estimate for upper limit calcula-
tions using Poissonian sampling. Nucl. Instrum. Methods A 517, 360 (2004)
15. Wald, A.: Tests of statistical hypotheses concerning several parameters when the number of
observations is large. Trans. Am. Math. Soc. 54, 426–482 (1943)
16. Asimov, I.: Franchise. In: Asimov, I. (ed.) The Complete Stories, vol. 1. Broadway Books, New
York (1990)
17. Grégory Schott for the ROO STATS Team: RooStats for searches. In: Proceedings of PHYS-
TAT2011 (2011). https://twiki.cern.ch/twiki/bin/view/RooStats
18. Brun, R., Rademakers, F.: ROOT—an object oriented data analysis framework. In: Proceedings
AIHENP96 Workshop, Lausanne (1996). Nucl. Instrum. Methods A 389 81–86 (1997). http://
root.cern.ch/
References 249
19. Ranucci, G.: The profile likelihood ratio and the look elsewhere effect in high energy physics.
Nucl. Instrum. Methods A 661, 77–85 (2012)
20. Gross, E., Vitells, O.: Trial factors for the look elsewhere effect in high energy physics. Eur.
Phys. J. C 70, 525 (2010)
21. Gross, E., Vitells, O.: Statistical Issues Relevant to Significance of Discovery Claims
(10w5068), Banff, Alberta, 11–16 July 2010
22. Davies, R.: Hypothesis testing when a nuisance parameter is present only under the alternative.
Biometrika 74, 33 (1987)
23. Gross, E.: Proceedings of the European School of High Energy Physics (2015)
24. Vitells, O., Gross, E.: Estimating the significance of a signal in a multi-dimensional search.
Astropart. Phys. 35, 230–234 (2011) p
25. ATLAS Collaboration: Search for resonances in diphoton events at s D 13 TeV with the
ATLAS detector. J. High Energy Phys. 09, 001 (2016)
26. Read, A.: Modified frequentist analysis of search results (the CLs method). In: Proceedings of
the 1st Workshop on Confidence Limits, CERN (2000)
Index
exponential, 87 response
from cumulative inversion, 86 function, 155
Gaussian, Box–Muller, 89 Gaussian case, 157
Gaussian, central limit theorem, 88 matrix, 158
L’Ecuyer, 84 RMS, see root mean square
Lüscher, 84 robust estimator, 103
Mersenne-Twistor, 84 ROC curve, receiver operating characteristic,
period, 84 177
RAN LUX , 84 ROOT, 106, 173
seed, 84 root mean square, 13
uniform, 84 ROO UNFOLD , 173
uniform on a sphere, 87
purity, 64, 196
sample space, 8
seed, pseudorandom number generator, 84
quantile, 28 selection, 176
efficiency, 176
shortest interval, 70
R2 , coefficient of determination, 117 sigmoid function, 191
RAN LUX pseudorandom number generator, 84 signal
random exclusion, 211
number, see pseudorandom number region, 129
generator, see pseudorandom number strength, 185, 186
generator signal-injected expected limit, 225
process, 2 significance level, 177, 205, 207–210
variable, 4 simultaneous fit, 130
2 , 32 singular value decomposition, 171
Bernoulli, 17 skewness, 14
binomial, 18 unnormalized, 15, 126
exponential, 34 smearing, 156
Gaussian, 31 sources of systematic uncertainty, 100
independent, 50 standard
log normal, 33 deviation, 13
normal, 31 continuous case, 27
Poissonian, 35 normal
standard normal, 31 distribution, 31
uncorrelated, 14 random variable, 31
uniform, 6, 30 statistical uncertainty, 100
Random forest, 197 subjective probability, see Bayesian probability
rate parameter, 35 supervised machine learning, 188
receiver operating characteristic, 177, 200 SVD, see singular value decomposition171
rectified linear units, convolutional neural symmetric interval, 70
network, 195 systematic uncertainty, 99, 226, 227
reference analysis, 76 sources, 100
reference prior, 76
regularization strength, 164
regularized unfolding, 163 Tartaglia’s triangle, 23
relative importance, BLUE method, 136 test
religious belief, Bayesian extreme probability, sample, 189
66 statistic, 176
ReLU, rectified linear units, 195 for discovery, 230
repeatable experiment, 2 for Higgs boson search, 231
residuals, 114 for positive signal strength, 230
resonance, 41 for upper limits, 230
Index 257