You are on page 1of 268

Lecture Notes in Physics 941

Luca Lista

Statistical
Methods for
Data Analysis in
Particle Physics
Second Edition
Lecture Notes in Physics

Volume 941

Founding Editors
W. Beiglböck
J. Ehlers
K. Hepp
H. Weidenmüller

Editorial Board
M. Bartelmann, Heidelberg, Germany
P. Hänggi, Augsburg, Germany
M. Hjorth-Jensen, Oslo, Norway
R.A.L. Jones, Sheffield, UK
M. Lewenstein, Barcelona, Spain
H. von Löhneysen, Karlsruhe, Germany
A. Rubio, Hamburg, Germany
M. Salmhofer, Heidelberg, Germany
W. Schleich, Ulm, Germany
S. Theisen, Potsdam, Germany
D. Vollhardt, Augsburg, Germany
J.D. Wells, Ann Arbor, USA
G.P. Zank, Huntsville, USA
The Lecture Notes in Physics

The series Lecture Notes in Physics (LNP), founded in 1969, reports new
developments in physics research and teaching-quickly and informally, but with a
high quality and the explicit aim to summarize and communicate current knowledge
in an accessible way. Books published in this series are conceived as bridging
material between advanced graduate textbooks and the forefront of research and to
serve three purposes:
• to be a compact and modern up-to-date source of reference on a well-defined
topic
• to serve as an accessible introduction to the field to postgraduate students and
nonspecialist researchers from related areas
• to be a source of advanced teaching material for specialized seminars, courses
and schools
Both monographs and multi-author volumes will be considered for publication.
Edited volumes should, however, consist of a very limited number of contributions
only. Proceedings will not be considered for LNP.
Volumes published in LNP are disseminated both in print and in electronic
formats, the electronic archive being available at springerlink.com. The series
content is indexed, abstracted and referenced by many abstracting and information
services, bibliographic networks, subscription agencies, library networks, and
consortia.
Proposals should be sent to a member of the Editorial Board, or directly to the
managing editor at Springer:

Christian Caron
Springer Heidelberg
Physics Editorial Department I
Tiergartenstrasse 17
69121 Heidelberg/Germany
christian.caron@springer.com

More information about this series at http://www.springer.com/series/5304


Luca Lista

Statistical Methods for Data


Analysis in Particle Physics

Second Edition

123
Luca Lista
INFN Sezione di Napoli
Napoli, Italy

ISSN 0075-8450 ISSN 1616-6361 (electronic)


Lecture Notes in Physics
ISBN 978-3-319-62839-4 ISBN 978-3-319-62840-0 (eBook)
DOI 10.1007/978-3-319-62840-0
Library of Congress Control Number: 2017948232

© Springer International Publishing AG 2016, 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

This book started as a collection of material from a course of lectures on Statistical


Methods for Data Analysis I gave to Ph.D. students in physics at the University of
Naples Federico II from 2009 to 2017 and was subsequently enriched with material
from other seminars and lectures I have been invited to give in the last years.
The aim of the book is to present and elaborate the main concepts and tools that
physicists use to analyze experimental data.
An introduction to probability theory and basic statistics is provided mainly as
refresher lectures to students who did not take a formal course on statistics before
starting their Ph.D. This also gives the opportunity to introduce Bayesian approach
to probability, which is a new topic to many students.
More advanced topics follow, up to recent developments in statistical methods
used for particle physics, in particular for data analyses at the Large Hadron Collider.
Many of the covered tools and methods have applications in high-energy physics,
but their scope could well be extended to other fields.
A shorter version of the course was presented at CERN in November 2009
as lectures on Statistical Methods in LHC Data Analysis for the ATLAS and
CMS experiments. The chapter that discusses discoveries and upper limits was
improved after the lectures on the subject I gave in Autrans, France, at the IN2P3
School of Statistics in May 2012. I was also invited to conduct a seminar about
Statistical Methods at Gent University, Belgium, in October 2014, which gave me
the opportunity to review some of my material and add new examples.

Note to the Second Edition

The second edition of this book reflects the work I did in preparation of the lectures
that I was invited to give during the CERN-JINR European School of High-Energy
Physics (15–28 June 2016, Skeikampen, Norway). On that occasion, I reviewed,
expanded, and reordered my material.

v
vi Preface

In addition, with respect to the first edition, I added a chapter about unfolding,
an extended discussion about the best linear unbiased estimator, and an introduction
to machine learning algorithms, in particular artificial neural networks, with hints
about deep learning, and boosted decision trees.

Acknowledgments

I am grateful to Louis Lyons who carefully and patiently read the first edition of my
book and provided useful comments and suggestions. I would like to thank Eliam
Gross for providing useful examples and for reviewing the sections about the look
elsewhere effect. I also received useful comments from Vitaliano Ciulli and from
Luis Isaac Ramos Garcia.
I considered all feedback I received in the preparation of this second edition.

Napoli, Italy Luca Lista


Contents

1 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.1 Why Probability Matters to a Physicist . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.2 The Concept of Probability .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2
1.3 Repeatable and Non-Repeatable Cases . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2
1.4 Different Approaches to Probability . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.5 Classical Probability .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4
1.6 Generalization to the Continuum.. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6
1.6.1 The Bertrand’s Paradox .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
1.7 Axiomatic Probability Definition . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8
1.8 Probability Distributions.. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9
1.9 Conditional Probability .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9
1.10 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 10
1.11 Law of Total Probability .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 11
1.12 Average, Variance and Covariance .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 12
1.13 Transformations of Variables .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
1.14 The Bernoulli Process . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 16
1.15 The Binomial Process. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 17
1.16 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 20
1.17 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 21
1.18 Frequentist Definition of Probability.. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 22
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 23
2 Probability Distribution Functions . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
2.2 Definition of Probability Distribution Function . . . . . . . . . . . . . . . . . . . 25
2.3 Average and Variance in the Continuous Case . . . . . . . . . . . . . . . . . . . . 27
2.4 Mode, Median, Quantiles .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.5 Cumulative Distribution . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.6 Continuous Transformations of Variables . . . . .. . . . . . . . . . . . . . . . . . . . 29
2.7 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 30

vii
viii Contents

2.8 Gaussian Distribution .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 31


2.9 2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 32
2.10 Log Normal Distribution . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 33
2.11 Exponential Distribution.. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 34
2.12 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 35
2.13 Other Distributions Useful in Physics . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 41
2.13.1 Breit–Wigner Distribution .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 41
2.13.2 Relativistic Breit–Wigner Distribution.. . . . . . . . . . . . . . . . . 42
2.13.3 Argus Function .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 43
2.13.4 Crystal Ball Function . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 44
2.13.5 Landau Distribution.. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 46
2.14 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 46
2.15 Probability Distribution Functions in More than One
Dimension .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 49
2.15.1 Marginal Distributions . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 49
2.15.2 Independent Variables . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 50
2.15.3 Conditional Distributions .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 53
2.16 Gaussian Distributions in Two or More Dimensions .. . . . . . . . . . . . . 54
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 58
3 Bayesian Approach to Probability .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 59
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 59
3.2 Bayes’ Theorem.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 59
3.3 Bayesian Probability Definition .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 64
3.4 Bayesian Probability and Likelihood Functions .. . . . . . . . . . . . . . . . . . 67
3.4.1 Repeated Use of Bayes’ Theorem and Learning
Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 67
3.5 Bayesian Inference .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 68
3.5.1 Parameters of Interest and Nuisance Parameters . . . . . . . 69
3.5.2 Credible Intervals . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 70
3.6 Bayes Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 73
3.7 Subjectiveness and Prior Choice . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 74
3.8 Jeffreys’ Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 75
3.9 Reference Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 76
3.10 Improper Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 76
3.11 Transformations of Variables and Error Propagation . . . . . . . . . . . . . 79
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 79
4 Random Numbers and Monte Carlo Methods . . . . . .. . . . . . . . . . . . . . . . . . . . 81
4.1 Pseudorandom Numbers .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 81
4.2 Pseudorandom Generators Properties .. . . . . . . . .. . . . . . . . . . . . . . . . . . . . 82
4.3 Uniform Random Number Generators .. . . . . . . .. . . . . . . . . . . . . . . . . . . . 84
4.3.1 Remapping Uniform Random Numbers . . . . . . . . . . . . . . . . 85
4.4 Discrete Random Number Generators . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 85
Contents ix

4.5 Nonuniform Random Number Generators.. . . .. . . . . . . . . . . . . . . . . . . . 86


4.5.1 Nonuniform Distribution from Inversion
of the Cumulative Distribution . . . . . .. . . . . . . . . . . . . . . . . . . . 86
4.5.2 Gaussian Generator Using the Central Limit
Theorem .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 88
4.5.3 Gaussian Generator with the Box–Muller Method .. . . . 89
4.6 Monte Carlo Sampling.. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 89
4.6.1 Hit-or-Miss Monte Carlo . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 90
4.6.2 Importance Sampling . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 91
4.7 Numerical Integration with Monte Carlo Methods .. . . . . . . . . . . . . . . 92
4.8 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 93
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 95
5 Parameter Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 97
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 97
5.2 Inference.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 97
5.3 Parameters of Interest .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 98
5.4 Nuisance Parameters .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 98
5.5 Measurements and Their Uncertainties . . . . . . . .. . . . . . . . . . . . . . . . . . . . 99
5.5.1 Statistical and Systematic Uncertainties . . . . . . . . . . . . . . . . 99
5.6 Frequentist vs Bayesian Inference . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 100
5.7 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 100
5.8 Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101
5.8.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 102
5.8.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 102
5.8.3 Minimum Variance Bound and Efficiency .. . . . . . . . . . . . . 102
5.8.4 Robust Estimators.. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 103
5.9 Binomial Distribution for Efficiency Estimate . . . . . . . . . . . . . . . . . . . . 104
5.10 Maximum Likelihood Method . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 105
5.10.1 Likelihood Function . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 105
5.10.2 Extended Likelihood Function . . . . . .. . . . . . . . . . . . . . . . . . . . 106
5.10.3 Gaussian Likelihood Functions . . . . .. . . . . . . . . . . . . . . . . . . . 108
5.11 Errors with the Maximum Likelihood Method . . . . . . . . . . . . . . . . . . . . 109
5.11.1 Second Derivatives Matrix . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 109
5.11.2 Likelihood Scan . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 110
5.11.3 Properties of Maximum Likelihood Estimators . . . . . . . . 112
5.12 Minimum 2 and Least-Squares Methods .. . . .. . . . . . . . . . . . . . . . . . . . 114
5.12.1 Linear Regression .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 115
5.12.2 Goodness of Fit and p-Value . . . . . . . .. . . . . . . . . . . . . . . . . . . . 118
5.13 Binned Data Samples . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 118
5.13.1 Minimum 2 Method for Binned Histograms .. . . . . . . . . 119
5.13.2 Binned Poissonian Fits . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 120
x Contents

5.14 Error Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 121


5.14.1 Simple Cases of Error Propagation .. . . . . . . . . . . . . . . . . . . . 121
5.15 Treatment of Asymmetric Errors .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 123
5.15.1 Asymmetric Error Combination with a Linear
Model .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 124
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 127
6 Combining Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 129
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 129
6.2 Simultaneous Fits and Control Regions . . . . . . .. . . . . . . . . . . . . . . . . . . . 129
6.3 Weighted Average.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 131
6.4 2 in n Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 132
6.5 The Best Linear Unbiased Estimator.. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 133
6.5.1 Quantifying the Importance of Individual
Measurements .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 135
6.5.2 Negative Weights . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 137
6.5.3 Iterative Application of the BLUE Method .. . . . . . . . . . . . 139
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 140
7 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 143
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 143
7.2 Neyman Confidence Intervals . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 143
7.2.1 Construction of the Confidence Belt .. . . . . . . . . . . . . . . . . . . 144
7.2.2 Inversion of the Confidence Belt . . . .. . . . . . . . . . . . . . . . . . . . 146
7.3 Binomial Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 147
7.4 The Flip-Flopping Problem . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 150
7.5 The Unified Feldman–Cousins Approach . . . . .. . . . . . . . . . . . . . . . . . . . 152
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 154
8 Convolution and Unfolding .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 155
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 155
8.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 155
8.2.1 Convolution and Fourier Transform . . . . . . . . . . . . . . . . . . . . 156
8.2.2 Discrete Convolution and Response Matrix . . . . . . . . . . . . 158
8.2.3 Efficiency and Background .. . . . . . . . .. . . . . . . . . . . . . . . . . . . . 158
8.3 Unfolding by Inversion of the Response Matrix.. . . . . . . . . . . . . . . . . . 160
8.4 Bin-by-Bin Correction Factors . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 163
8.5 Regularized Unfolding.. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 163
8.5.1 Tikhonov Regularization . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 164
8.6 Iterative Unfolding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 166
8.6.1 Treatment of Background . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 171
8.7 Other Unfolding Methods . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 171
8.8 Software Implementations .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 173
8.9 Unfolding in More Dimensions . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 173
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 173
Contents xi

9 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 175


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 175
9.2 Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 175
9.3 Type I and Type II Errors .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 177
9.4 Fisher’s Linear Discriminant . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 178
9.5 The Neyman–Pearson Lemma . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 181
9.6 Projective Likelihood Ratio Discriminant . . . . .. . . . . . . . . . . . . . . . . . . . 181
9.7 Kolmogorov–Smirnov Test . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 182
9.8 Wilks’ Theorem .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 184
9.9 Likelihood Ratio in the Search for a New Signal.. . . . . . . . . . . . . . . . . 185
9.10 Multivariate Discrimination with Machine Learning . . . . . . . . . . . . . 188
9.10.1 Overtraining .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 189
9.11 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 190
9.11.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 192
9.11.2 Convolutional Neural Networks. . . . .. . . . . . . . . . . . . . . . . . . . 193
9.12 Boosted Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 196
9.13 Multivariate Analysis Implementations .. . . . . . .. . . . . . . . . . . . . . . . . . . . 199
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 203
10 Discoveries and Upper Limits . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 205
10.1 Searches for New Phenomena: Discovery and Upper Limits . . . . . 205
10.2 Claiming a Discovery .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 206
10.2.1 p-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 206
10.2.2 Significance Level . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 207
10.2.3 Significance and Discovery.. . . . . . . . .. . . . . . . . . . . . . . . . . . . . 208
10.2.4 Significance for Poissonian Counting Experiments .. . . 208
10.2.5 Significance with Likelihood Ratio .. . . . . . . . . . . . . . . . . . . . 209
10.2.6 Significance Evaluation with Toy Monte Carlo . . . . . . . . 210
10.3 Excluding a Signal Hypothesis .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 211
10.4 Combined Measurements and Likelihood Ratio . . . . . . . . . . . . . . . . . . 211
10.5 Definitions of Upper Limit . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 211
10.6 Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 212
10.6.1 Bayesian Upper Limits for Poissonian Counting.. . . . . . 212
10.6.2 Limitations of the Bayesian Approach.. . . . . . . . . . . . . . . . . 215
10.7 Frequentist Upper Limits . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 215
10.7.1 Frequentist Upper Limits for Counting
Experiments .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 216
10.7.2 Frequentist Limits in Case of Discrete Variables .. . . . . . 217
10.7.3 Feldman–Cousins Unified Approach . . . . . . . . . . . . . . . . . . . 218
10.8 Modified Frequentist Approach: The CLs Method .. . . . . . . . . . . . . . . 221
10.9 Presenting Upper Limits: The Brazil Plot . . . . .. . . . . . . . . . . . . . . . . . . . 225
10.10 Nuisance Parameters and Systematic Uncertainties .. . . . . . . . . . . . . . 226
10.10.1 Nuisance Parameters with the Bayesian Approach . . . . 226
10.10.2 Hybrid Treatment of Nuisance Parameters . . . . . . . . . . . . . 227
10.10.3 Event Counting Uncertainties . . . . . . .. . . . . . . . . . . . . . . . . . . . 227
xii Contents

10.11 Upper Limits Using the Profile Likelihood .. . .. . . . . . . . . . . . . . . . . . . . 228


10.12 Variations of the Profile-Likelihood Test Statistic . . . . . . . . . . . . . . . . . 229
10.12.1 Test Statistic for Positive Signal Strength . . . . . . . . . . . . . . 230
10.12.2 Test Statistic for Discovery .. . . . . . . . .. . . . . . . . . . . . . . . . . . . . 230
10.12.3 Test Statistic for Upper Limits . . . . . .. . . . . . . . . . . . . . . . . . . . 230
10.12.4 Higgs Test Statistic . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 231
10.12.5 Asymptotic Approximations . . . . . . . .. . . . . . . . . . . . . . . . . . . . 231
10.12.6 Asimov Datasets . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 231
10.13 The Look Elsewhere Effect.. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 242
10.13.1 Trial Factors .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 243
10.13.2 Look Elsewhere Effect in More Dimensions . . . . . . . . . . . 246
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 248

Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 251
List of Tables

Table 1.1 Possible values of the sum of two dice rolls with
all possible pair combinations and corresponding
probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5
Table 2.1 Probabilities corresponding to Z one-dimensional
intervals and two-dimensional contours for different
values of Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 58
Table 3.1 Assessing evidence with Bayes factors according to the
scale proposed in [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 74
Table 3.2 Jeffreys’ priors corresponding to the parameters of some
of the most frequently used PDFs . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 76
Table 6.1 Properties of different indicators of a measurement’s
importance within a BLUE combination . . . . . .. . . . . . . . . . . . . . . . . . . . 137
Table 10.1 Significance expressed as ‘Z’ and corresponding p-value
in a number of typical cases . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 208
Table 10.2 Upper limits in presence of negligible background
evaluated under the Bayesian approach for different
number of observed events n . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 213
Table 10.3 Upper and lower limits in presence of negligible
background (b D 0) with the Feldman–Cousins approach . . . . . . . 219

xiii
List of Examples

Example 1.1 Two Dice Roll Probability.. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5


Example 1.2 Combination of Detector Efficiencies . . .. . . . . . . . . . . . . . . . . . . . 10
Example 1.3 Application to the Sum of Dice Rolls . . .. . . . . . . . . . . . . . . . . . . . 16

Example 2.4 Strip Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 31


Example 2.5 Poisson Distributions as Limit of Binomial
Distribution from a Uniform Process . . . .. . . . . . . . . . . . . . . . . . . . 35
Example 2.6 Exponential Distributions from Uniformly
Distributed Process . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 37
Example 2.7 Uncorrelated Variables May not Be Independent . . . . . . . . . . . 51

Example 3.8 An Epidemiology Example . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 61


Example 3.9 Particle Identification and Purity of a Sample .. . . . . . . . . . . . . . 63
Example 3.10 Extreme Cases of Prior Beliefs . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 66
Example 3.11 Posterior for a Poisson Rate . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 71
Example 3.12 Posterior for Exponential Distribution .. .. . . . . . . . . . . . . . . . . . . . 76

Example 4.13 Transition From Regular to ‘Unpredictable’ Sequences .. . . 82


Example 4.14 Extraction of an Exponential Random Variable .. . . . . . . . . . . . 87
Example 4.15 Extraction of a Uniform Point on a Sphere .. . . . . . . . . . . . . . . . . 87
Example 4.16 Combining Different Monte Carlo Techniques . . . . . . . . . . . . . 92

Example 5.17 A Very Simple Estimator in a Gaussian Case . . . . . . . . . . . . . . . 101


Example 5.18 Estimators with Variance Below the Cramér–Rao
Bound Are not Consistent . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 103
Example 5.19 Maximum Likelihood Estimate for an Exponential
Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 112
Example 5.20 Bias of the Maximum Likelihood Estimate of a
Gaussian Variance . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 113

xv
xvi List of Examples

Example 6.21 Reusing Multiple Times the Same Measurement


Does not Improve a Combination .. . . . . . .. . . . . . . . . . . . . . . . . . . . 135

Example 7.22 Neyman Belt: Gaussian Case . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 146


Example 7.23 Application of the Clopper–Pearson Method.. . . . . . . . . . . . . . . 148

Example 9.24 Comparison of Multivariate Discriminators .. . . . . . . . . . . . . . . . 199

Example 10.25 p-Value for a Poissonian Counting.. . . . . .. . . . . . . . . . . . . . . . . . . . 206


Example 10.26 Can Frequentist and Bayesian Upper Limits Be
‘Unified’? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 221
Example 10.27 Bump Hunting with the LsCb =Lb Test Statistic . . . . . . . . . . . . . 232
Example 10.28 Adding Systematic Uncertainty with LsCb =Lb Approach .. . 236
Example 10.29 Bump Hunting with Profile Likelihood ... . . . . . . . . . . . . . . . . . . . 239
Example 10.30 Simplified Look Elsewhere Calculation .. . . . . . . . . . . . . . . . . . . . 245
Chapter 1
Probability Theory

1.1 Why Probability Matters to a Physicist

The main goal of an experimental physicist is to measure quantities of interest,


possibly with the best precision. In the luckiest cases, measurements lead to the
discovery of new physical phenomena that may represent a breakthrough in the
knowledge of Nature. Measurements, and, more in general, observations of Nature’s
behavior, are performed with experiments that record quantitative information about
the physical phenomenon under observation.
Experiments often introduce randomness into recorded data because of fluctu-
ations in the detector response, due to effects like resolution, efficiency, and so
on. Moreover, natural phenomena contain in many cases intrinsic randomness. For
instance, quantum mechanics may lead to different possible measurement outcomes
of an observable quantity. The distribution (formally defined in Sect. 1.8) of the
possible outcomes can be predicted in terms of probability, given by the square of
the quantum amplitude of the process.
Experiments at particle accelerators record a large number of collision events
containing particles produced by different interaction processes. For each collision,
different quantities are measured, like particle positions when crossing detector
elements, deposited energy, crossing time of a particle through a detector, etc.,
which, in addition to the intrinsic randomness due to quantum mechanics, are also
affected by fluctuations due to detector response.
A measurement of a physical quantity  can be performed using a data sample
consisting of a set of measured quantities whose distribution depends on the value
of . By comparing the data sample with a prediction that takes into account
both theory and detector effects, questions about the nature of data, such as the
followings, can be addressed:
• What is the value of ?
(A typical quantity of interest in particle physics may be a cross section,
branching ratio, particle mass, lifetime, etc.)

© Springer International Publishing AG 2017 1


L. Lista, Statistical Methods for Data Analysis in Particle Physics,
Lecture Notes in Physics 941, DOI 10.1007/978-3-319-62840-0_1
2 1 Probability Theory

• Is a signal due to the Higgs boson present in the data recorded at the Large
Hadron Collider?
(We know that LHC data gave a positive answer!)
• Is there a signal due to dark matter particles in the present experimental data?
(At the moment experiments give no evidence; it is possible to exclude a set of
theory parameters that are incompatible with the present data.)
For the latter two questions,  may represent a parameter of the theory that describes
Higgs boson or Dark Matter, respectively. A measurement is performed by applying
probability theory in order to model the expected distribution of the data in all
possible assumed scenarios. Then, probability theory must be used, together with
data, to extract information about the observed processes and to address the relevant
questions, in the sense discussed above.

1.2 The Concept of Probability

Many processes in nature have uncertain outcomes, in the sense that their result
cannot be predicted in advance. Probability is a measure of how favored one of the
possible outcomes of such a random process is, compared with any of the other
possible outcomes. A possible outcome of an experiment is also called event.

1.3 Repeatable and Non-Repeatable Cases

Most experiments in physics can be repeated under the same, or at least very
similar, conditions; nonetheless, it’s frequent to observe different outcomes at every
repetition. Such experiments are examples of random processes, i.e. processes that
can be repeated, to some extent, within some given boundary and initial conditions,
but the outcome of those processes is uncertain.
Randomness in repeatable processes may arise because of insufficient informa-
tion about intrinsic dynamics which prevents to predict the outcome, or because of
lack of sufficient accuracy in reproducing the initial and boundary conditions. Some
processes, like the ones ruled by quantum mechanics, have intrinsic randomness
that leads to different possible outcomes, even if the experiment is repeated within
exactly the same conditions.
The result of an experiment may be used to address questions about natural
phenomena, for instance about the knowledge of the value of an unknown physical
quantity, or the existence or not of some new phenomena. Statements that answer
those questions can be assessed by assigning them a probability. Different defini-
tions of probability may apply to cases in which statements refers to repeatable
cases, as the experiments described above, or not, as will be discussed in Sect. 1.4.
Examples of questions about repeatable cases are:
• What is the probability to extract one ace in a deck of cards?
1.4 Different Approaches to Probability 3

• What is the probability to win a specific lottery (or bingo, or any other game
based on random extractions)?
• What is the probability that a pion is incorrectly identified as a muon in a particle
identification detector?
• What is the probability that a fluctuation of the background in an observed
spectrum can produce a signal with a magnitude at least equal to what has been
observed by my experiment?
Note: this question is different from asking: “what is the probability that no real
signal was present, and my observation is due to a background fluctuation?”
This latter question refers to a non-repeatable case, because we can’t have more
Universes, each with and without a new physical phenomenon, where we can
repeat our measurement!
Questions may refer to unknown facts, rather than repeatable events. Examples
of questions about probability for such cases are:
• About future events:
– What is the probability that tomorrow it will rain in Geneva?
– What is the probability that your favorite team will win next championship?
• About past events:
– What is the probability that dinosaurs went extinct because of an asteroid
collision with Earth?
• More in general, about unknown statements:
– What is the probability that present climate changes are mainly due to human
intervention?
– What is the probability that dark matter is made of weakly-interacting massive
particles heavier than 1 TeV?
Probability concepts that answer the above questions, either for repeatable or for
non-repeatable, cases are introduced in next Section.

1.4 Different Approaches to Probability

There are two main different approaches to define the concept of probability, called
frequentist and Bayesian.
• Frequentist probability is defined as the fraction of the number of occurrences
of an event of interest over the total number of events in a repeatable experiment,
in the limit of very large number of experiments. Frequentist probability only
applies to processes that can be repeated over a reasonably long period of time,
but does not apply to unknown statements. For instance, it is meaningful to define
the frequentist probability that a particle is detected by a device (if a large number
4 1 Probability Theory

of particles cross the device, we can count how many are actually detected),
but there is no frequentist meaning of the probability of a particular result in a
football match, or the probability that the mass of an unknown particle is greater
than—say—200 GeV.
• Bayesian probability measures one’s degree of belief that a statement is true.
The quantitative definition of Bayesian probability is based on an extension
of Bayes’ theorem, and will be discussed in Sect. 3.2. Bayesian probability
applies wherever the frequentist probability is meaningful, as well as on a
wider variety of cases, such as about unknown past, future or present events.
Bayesian probability also applies to values of unknown quantities. For instance,
after some direct and/or indirect experimental measurements, the probability
that an unknown particle’s mass is greater than 200 GeV may have meaning
in the Bayesian sense. Other examples where Bayesian probability applies, but
frequentist probability is meaningless, are the outcome of a future election,
uncertain features of prehistoric extinct species, and so on.
The following Sect. 1.5 will discuss classical probability theory first, as formulated
since the eighteenth century, and will then discuss frequentist probability. Chapter 3
will be entirely devoted to Bayesian probability.

1.5 Classical Probability

A random variable represents the outcome of a repeatable experiment whose result


is uncertain. An event consists of the occurrence of a certain condition about the
value of the random variable resulting from an experiment. For instance, a coin toss
gives a head, or a dice roll gives an even value.
In 1814, Pierre-Simon Laplace gave the following definition of probability
referred to as classical probability:
The theory of chance consists in reducing all the events of the same kind to a certain number
of cases equally possible, that is to say, to such as we may be equally undecided about in
regard to their existence, and in determining the number of cases favorable to the event
whose probability is sought. The ratio of this number to that of all the cases possible is the
measure of this probability, which is thus simply a fraction whose numerator is the number
of favorable cases and whose denominator is the number of all the cases possible [1].

The probability P of an event E, which corresponds to one out of a certain number


of favorable cases, can be written, according to Laplace, as:

Number of favorable cases


P.E/ D : (1.1)
Number of possible cases
1.5 Classical Probability 5

This approach can be used in practice only for relatively simple problems, since
it assumes that all possible cases under consideration are equally probable, which
may not always be the case in complex situations. Examples of cases where the
classical probability can be applied are coin tossing, where the two faces of a coin
are assumed to have an equal probability of 1=2 each, or dice rolling, where each of
the six faces of a dice1 has an equal probability of 1=6, and so on.
Starting from simple cases, like coins or dice, more complex models can be
built using combinatorial analysis. In the simplest cases, one may proceed by
enumeration of all the finite number of possible cases, and again the probability
of an event can be evaluated as the number of favorable cases divided by the total
number of possible cases of the combinatorial problem.

Example 1.1 Two Dice Roll Probability


An easy case of combinatorial analysis is given by the roll of two dices,
taking the sum of the two outcomes as the final result.
The possible number of outcomes is given by the 6  6 D 36 different com-
binations that give a sum ranging from 2 to 12. The possible combinations
are enumerated in Table 1.1, and the corresponding probabilities computed
as the number of favorable cases divided by 36, are shown in Fig. 1.1.

Table 1.1 Possible values of the sum of two dice rolls with
all possible pair combinations and corresponding probability
Sum Favorable cases Probability
2 (1, 1) 1/36
3 (1, 2), (2, 1) 1/18
4 (1, 3), (2, 2), (3, 1) 1/12
5 (1, 4), (2, 3), (3, 2), (4, 1) 1/9
6 (1, 5), (2, 4), (3, 3), (4, 2), (5, 1) 5/36
7 (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1) 1/6
8 (2, 6), (3, 5), (4, 4), (5, 3), (6, 2) 5/36
9 (3, 6), (4, 5), (5, 4), (6, 3) 1/9
10 (4, 6), (5, 5), (6, 4) 1/12
11 (5, 6), (6, 5) 1/18
12 (6, 6) 1/36

(continued )

1
Many role-playing games use dices of solid shapes different from cube as well.
6 1 Probability Theory

0.18
1/6
0.16

0.14 5/36

0.12
1/9
0.1
P

0.08 1/12

0.06
1/18
0.04
1/36
0.02

0 0
2 4 6 8 10 12
d1+d2

Fig. 1.1 Probability of the sum of two dice rolls, d1 and d2

Many combinatorial problems can be decomposed in all possible elementary


events.2 An event corresponds to the occurrence of one in a specific set of possible
outcomes. For instance, the event “sum of dices = 4” corresponds to the set of
possible elementary outcomes {(1, 3), (2, 2), (3, 1)}. Other events (e.g.: “sum is
an odd number”, or “sum is greater than 5”) may be associated with different
sets of possible pair combinations. In general, formulating the problem in terms
of sets allows to replace logical operators “and”, “or” and “not” in a sentence
by the intersection, union and complement of the corresponding sets of elementary
outcomes, respectively.
In more complex combinatorial problems it may be hard in practice to decom-
pose the problem into equally probable elementary cases. One of the possible
approaches to such complex cases, for instance, when dealing with a realistic
detector response, which include effects like efficiency, resolution, etc., is to use of
computer simulation performed with Monte Carlo methods, that will be discussed
in Chap. 4.

1.6 Generalization to the Continuum

The generalization of classical probability, introduced in the previous Sect. 1.5 for
discrete cases, to continuous random variables can not be done in an unambiguous
way.

2
Note that in physics often event is intended as an elementary event. So, the use of the word event
in a text about both physics and statistics may sometimes lead to some confusion.
1.6 Generalization to the Continuum 7

A way to extend the concept of equiprobability that was applied to the six
possible outcomes of a dice roll to a continuous variable x consists in partitioning
its validity range (say [x1 ; x2 ]) into intervals I1 ;    ; IN all having the same width
of .x2  x1 /=N, for an arbitrarily large number N. The probability distribution of the
variable x can be considered uniform if the probability that a random extraction of x
falls into each of the N intervals is the same, i.e. it is equal to 1=N.
This definition of uniform probability of x in the interval [x1 ; x2 ] clearly changes
under reparametrization, i.e. if the variable x is transformed into y D Y.x/, and
the interval [x1 ; x2 ] into Πy1 ; y2  D ΠY.x1 /; Y.x2 /, assuming for simplicity that Y
is a monotonous increasing function of x. In this case, the transformed intervals
J1 ;    ; JN D Y.I1 /;    ; Y.IN / will not have in general all the same width, unless
the transformation Y is linear. So, if x has a uniform distribution, y D Y.x/ in general
does not have a uniform distribution.

1.6.1 The Bertrand’s Paradox

The arbitrariness in the definition of uniformity of a random extraction becomes


evident in a famous paradox, called the Bertrand’s paradox3 that can be formulated
as follows:
• Consider an equilateral triangle inscribed in a circle. Extract uniformly one of
the possible chords of the circle. What is the probability that the length of the
extracted chord is larger than the side of the inscribed triangle?
The apparent paradox arises because the uniform extraction of a chord on a circle
is not a well-defined process. Below three possible example are presented of random
extractions that give different results in terms of probability.
1. Let us take the circle’s diameter passing by one of the vertices of the triangle;
extract uniformly a point on this diameter and take the chord perpendicular to
the diameter passing by the extracted point. Evidently from Fig. 1.2 (left plot),
the basis of the triangle cuts the vertical diameter leaving a segment at the
bottom whose length is half the radius size. This corresponds to one-half of
the extracted chords having a radius less than the triangle basis. Considering all
possible diameters of the circle, that we assume to extract uniformly with respect
to the azimuthal angle, all possible chords of the circle are spanned. Hence, the
probability in question would result to be P D 1=2.

3
This apparent paradox is due to the French mathematician Joseph Louis François Bertrand (1822–
1900).
8 1 Probability Theory

Fig. 1.2 Illustration of Bertrand’s paradox: three different choices of random extraction of a chord
in a circle lead apparently to probabilities that the cord is longer than the inscribed triangle’s side
of 1=2 (left), 1=3 (center) and 1=4 (right), respectively. Red solid lines and blue dashed lines represent
chords longer and shorter of the triangle’s side, respectively

2. Let us take, instead, one of the chords starting from the top vertex of the triangle
(Fig. 1.2, center plot) and extract uniformly an angle with respect to the tangent
to the circle passing by that vertex. The chord is longer than the triangle’s side
when it intersects the basis of the triangle, and it is shorter otherwise. This occurs
in one-thirds of the cases since the angles of an equilateral triangle measure =3
each, and the chords span an angle of . By uniformly extracting a possible
point on the circumference of the circle, one would conclude that P D 1=3, which
is different from P D 1=2, as it was derived in the first case.
3. Let us extract uniformly a point inside the circle and construct the chord passing
by that point perpendicular to the radius that passes through the same point
(Fig. 1.2, right plot). With this extraction, it’s possible to conclude that P D 1=4,
since the chords starting from a point contained inside (outside) the circle
inscribed in the triangle have a length longer (shorter) than the triangle’s side,
and the ratio of the areas of the circle inscribed in the triangle to the area of the
circle that inscribes the triangle is equal to 1=4. P D 1=4 is different from the values
determined in both cases above.
The paradox is clearly only apparent because the process of uniform random
extraction of a chord in a circle is not univocally defined, as already discussed in
Sect. 1.6.

1.7 Axiomatic Probability Definition

An axiomatic definition of probability, founded on measure theory and valid both


in the discrete and the continuous
 case, is due to Kolmogorov [2]. Let us consider
a measure space, ; F  2 ; P , where P is a function that maps elements of F to
real numbers. F is a subset of the power set 2 of , i.e. it contains subsets of .
 is called sample space and F is called event space. P is a probability measure if
1.9 Conditional Probability 9

the following properties are satisfied:


1. P.E/  0, 8 E 2 F ,
2. P./ D 1 , !
[
N X
N
3. 8 .E1 ;    ; EN / 2 F N : Ei \ Ej D 0, P Ei D P.Ei / .
iD1 iD1

Both Bayesian probability, defined in Chap. 3, and frequentist probability obey


Kolmogorov’s axioms.

1.8 Probability Distributions

Consider a random variable x which has possible values x1 ;    ; xN , each occurring


with a probability P.fxi g/ D P.xi /, i D 1;    N. The function that associates the
probability P.xi / to each possible value xi of x is called probability distribution.
The probability of an event E corresponding to a set of distinct possible
elementary events fxE1 ;    ; xEK g, where xEj 2  D fx1 ;    ; xN g for all
j D 1;    ; K, is, according to the third Kolmogorov’s axiom, equal to:
0 1
[K X
K
P @ fxEj gA D P.fxE1 ;    ; xEK g/ D P.E/ D P.xEj / : (1.2)
jD1 jD1

From the second Kolmogorov’s axiom, the probability of the event  corresponding
to the set of all possible outcomes, x1 ;    ; xN , must be equal to one. Equivalently,
using Eq. (1.2), the sum of the probabilities of all possible outcomes is equal to one:

X
N
P.xi / D 1 : (1.3)
iD1

This property of probability distributions is called normalization condition.

1.9 Conditional Probability

Given two events A and B, the conditional probability represents the probability of
the event A given the condition that the event B has occurred, and is given by:

P.A \ B/
P.A j B/ D : (1.4)
P.B/
10 1 Probability Theory

Fig. 1.3 If two events, A and


B, are represented as sets, the
conditional probability
A B
P.A j B/ is equal to the area of
the intersection, A \ B,
divided by the area of B
A B

The conditional probability can be visualized in Fig. 1.3: while the probability of A,
P.A/, corresponds to the area of the set A relative to the area of the whole sample
space , which is equal to one, the conditional probability, P.A j B/, corresponds to
the area of the intersection of A and B relative to the area of the set B.

1.10 Independent Events

An event A is said to be independent on event B if the conditional probability of A,


given B, is equal to the probability of A. In other words, the occurrence of B does
not change the probability of A:

P.A j B/ D P.A/ : (1.5)

Given the definition of conditional probability in Eq. (1.4), A is independent on


B if, and only if, the probability of the simultaneous occurrence of both events is
equal to the product of their probabilities:

P.“A and B”/ D P.A \ B/ D P.A/ P.B/ : (1.6)

From the symmetry of Eq. (1.6), if A is independent on B, then B is also


independent on A, and we will write that A and B are independent events.

Example 1.2 Combination of Detector Efficiencies


Consider an experimental apparatus made of two detectors A and B, and a
particle traversing both. Each detector will produce a signal when crossed
by a particle with probability "A and "B , respectively. "A and "B are called
efficiencies of the detectors A and B.
If the signals are produced independently in the two detectors, the probabil-
ity "AB that a particle gives a signal in both detectors, according to Eq. (1.6),

(continued )
1.11 Law of Total Probability 11

is equal to the product of "A and "B :

"AB D "A "B : (1.7)

This result clearly does not hold if there are causes of simultaneous
inefficiency of both detectors, e.g.: a fraction of times where the electronics
systems for both A and B are simultaneously switched off for short periods,
or geometrical overlap of inactive regions, and so on.

1.11 Law of Total Probability

Let us consider N events corresponding to the sets E1 ;    ; EN , which are subsets


of another set E0 included in the sample space . Assume that the set of all Ei is a
partition of E0 , as visualized in Fig. 1.4, i.e.:

[
N
Ei \ Ej D 0 8 i; j and Ei D E0 : (1.8)
iD1

Given Kolmogorov’s third axiom (Sect. 1.7), the probability corresponding to E0 is


equal to the sum of the probabilities of Ei :
X
N
P.E0 / D P.Ei / : (1.9)
iD1

Given a partition A1 ;    ;P
AN of the sample space  made of disjoint sets, i.e.
such that Ai \ Aj D 0 and NiD1 P.Ai / D 1, the following sets can be built, as
visualized in Fig. 1.5:

Ei D E0 \ Ai : (1.10)

Fig. 1.4 Visual example of E0


partition of a set E0 into the
union of the disjoint sets E1 ,
E2 , E3 ,    , EN E2 ...
E1 EN

... ...

E3 ...
12 1 Probability Theory

Ω
A2 ...
E0 AN
A1
E2 ...
E1 EN

... ...

...
E3 ...

A3
...

Fig. 1.5 Visualization of the law of total probability. The sample space  is partitioned into the
sets A1 , A2 , A3 ,    , AN , and the set E0 is partitioned into E1 , E2 , E3 ,    , EN , where each Ei is
E0 \ Ai and has probability equal to P.E0 j Ai / P.Ai /

Each set Ei , using Eq. (1.4), corresponds to a probability:

P.Ei / D P.E0 \ Ai / D P.E0 j Ai / P.Ai / : (1.11)

In this case, Eq. (1.9) can be rewritten as:

X
N
P.E0 / D P.E0 j Ai / P.Ai / : (1.12)
iD1

This decomposition is called law of total probability and can be interpreted


as weighted average (see Sect. 6.3) of the probabilities P.Ai / with weights wi D
P.E0 j Ai /.
Equation (1.12) has application in the computation of Bayesian probability, as
discussed in Chap. 3.

1.12 Average, Variance and Covariance

In this Section, a number of useful quantities related to probability distributions of


random variables are defined.
Consider a discrete random variable x which can assume N possible values,
x1 ;    ; xN , with probability distribution P. The average value or expected value
1.12 Average, Variance and Covariance 13

of x is defined as:

X
N
hxi D xi P.xi / : (1.13)
iD1

Sometimes the notation EŒx or xN is also used in literature to indicate the average
value.
More in general, given a function g of x, the average value of g is:

X
N
hg.x/i D g.xi / P.xi / : (1.14)
iD1

The variance of x is defined as:

X
N
VŒx D .xi  hxi/2 P.xi / ; (1.15)
iD1

and the standard deviation is the square root of the variance:


v
u N
p uX
x D VŒx D t .xi  hxi/2 P.xi / : (1.16)
iD1

The root mean square, abbreviated as r.m.s., is defined as4 :


v
u
u1 X N p
xrms Dt x2i P.xi / D hx2 i : (1.17)
N iD1

The variance in Eq. (1.15) can also be written as:

VŒx D h.x  hxi/2 i ; (1.18)

and is also equal to:

VŒx D hx2 i  hxi2 : (1.19)

4
In some physics literature, the standard deviation is sometimes also called root mean square or
r.m.s. This may be cause some confusion.
14 1 Probability Theory

The average of the sum of two random variables x and y can be easily demonstrated
to be equal to the sum of their averages:

hx C yi D hxi C hyi ; (1.20)

and the product of a constant a times x has average:

haxi D a hxi (1.21)

and variance:

VŒax D a2 VŒx : (1.22)

The value of x that corresponds to the largest probability is called mode. The
mode may not be unique and in that case the distribution of x is said multimodal
(bimodal in case of two maxima).
The median is the value xQ that separates the range of possible values of x in
two sets that have both equal probabilities. The median can also be defined for an
ordered sequence of values, say fx1 ;    ; xN g as:

x.NC1/=2 if N is odd ;
xQ D 1 (1.23)
2
.x N=2 C x N=2C1 / if N is even :

The notation medŒx is also sometimes used instead of xQ .


Given two random variables x and y, their covariance is defined as:

cov.x; y/ D hx yi  hxi hyi ; (1.24)

and the correlation coefficient of x and y is defined as:

cov.x; y/
xy D : (1.25)
x y

It can be demonstrated that:

VŒx C y D VŒx C VŒ y C 2 cov.x; y/ : (1.26)

From Eq. (1.26), the variance of the sum of uncorrelated variables, i.e. variables
which have null covariance, is equal to the sum of their variances.
Given N random variables, x1 ;    ; xN , the symmetric matrix Cij D cov.xi ; yj /
is called covariance matrix. The diagonal terms Cii are always positive or null, and
correspond to the variances of the N variables. A covariance matrix is diagonal if
and only if all variables are uncorrelated.
1.13 Transformations of Variables 15

Another useful quantity is the skewness that measures the asymmetry of a


distribution, and is defined as:
* 3 +
x  hxi Œx
1 Œx D D ; (1.27)
x x3

where the quantity:

Œx D hx3 i  3 hxi hx2 i C 2 hxi3 (1.28)

is called unnormalized skewness. Symmetric distributions have skewness equal to


zero, while negative (positive) skewness correspond to a mode greater then (less
than) the average value hxi. It is possible to demonstrate that:

Œx C y D Œx C Œ y ; (1.29)

and:

Œax D a3 Œx : (1.30)

The kurtosis, finally, is defined as:


* 4 +
x  hxi
ˇ2 Œx D : (1.31)
x

Usually the kurtosis coefficient, 2 , is defined as:

2 Œx D ˇ2 Œx  3 (1.32)

in order to have 2 D 0 for a normal distribution (defined in Sect. 2.8). 2 is also


called excess.

1.13 Transformations of Variables

Given a random variable x, let us consider the variable y transformed via the function
y D Y.x/. If x can assume the values fx1 ;    ; xN g, then y can assume one of the
values fy1 ;    ; yM g D fY.x1 /;    ; Y.yN /g. N is equal to M only if all the values
Y.x1 /;    ; Y.xN / are different from each other. The probability corresponding to
each value yj is given by the sum of the probabilities of all values xi that are
16 1 Probability Theory

transformed into yj by Y:
X
P.yj / D P.xi / : (1.33)
iW Y.xi / D yj

In case, for a given j, there is a single value of the index i for which Y.xi / D yj , we
have P.yj / D P.xi /.
The generalization to more variables is straightforward: assume that a variable z
is determined from two random variables x and y as z D Z.x; y/; if fz1 ;    ; zM g is
the set of all possible values of z, the probability corresponding to each zk is given
by:
X
P.zk / D P.xi ; yj / : (1.34)
i; jW Z.xi ; yj / D zk

This expression is consistent with the results obtained for the sum of two dices
considered in Sect. 1.5, as will be seen in Example 1.3.

Example 1.3 Application to the Sum of Dice Rolls


Let us compute the probabilities that the sum of two dice rolls is even or
odd.
The probability of the sum of two dices can be evaluated using Eq. (1.34),
where x and y are the outcome of each dice, and z D x C y.
All cases in Table 1.1 can be considered, and the probabilities for all even
and odd values can be added. Even values and their probabilities are: 2 : 1=36,
4 : 1=12, 6 : 5=36, 8 : 5=36, 10 : 1=12, 12 : 1=36. So, the probability of an even result
is: .1 C 3 C 5 C 5 C 3 C 1/=36 = 1=2, and the probability of an odd result is
1  1=2 D 1=2.
Another way to proceed is the following: each dice has probability 1=2 to give
an even or an odd result. The sum of two dices is even if either two even or
two odd results are added. Each case has probability 1=2  1=2 D 1=4, since the
two dice extractions are independent. Hence, the probability to have either
two odd or two even results is 1=4 C 1=4 D 1=2, since the two case have no
intersection.

1.14 The Bernoulli Process

Let us consider a basket containing a number n of balls each having one of two
possible colors, say red and white. Assume to know the number r of red balls in the
basket, hence the number of white balls must be n  r (Fig. 1.6). The probability to
randomly extract a red ball in the basket is p D r=n, according to Eq. (1.1).
1.15 The Binomial Process 17

Fig. 1.6 A set of n D 10


balls, of which r D 3 are red,
considered in a Bernoulli
process. The probability to
randomly extract a red ball in
the shown set is
p D r=n D 3=10 D 30%

A variable x equal to the number of extracted red balls is called Bernoulli


variable, and can assume only the values of 0 or 1. The probability distribution
of x is simply given by P.1/ D p and P.0/ D 1  p. The average of a Bernoulli
variable is easy to compute:

hxi D P.0/  0 C P.1/  1 D P.1/ D p : (1.35)

Similarly, the average of x2 is:


˝ 2˛
x D P.0/  02 C P.1/  12 D P.1/ D p ; (1.36)

hence, the variance of x, using Eq. (1.19), is:


˝ ˛
VŒx D x2  hxi2 D p .1  p/ : (1.37)

1.15 The Binomial Process

A binomial process consists of a given number N of independent Bernoulli


extractions, each with probability p. This could be implemented, for instance, by
randomly extracting a ball from a basket containing a fraction p of red balls; after
each extraction, the extracted ball is placed again in the basket and the extraction is
repeated N times. Figure 1.7 shows the possible outcomes of a binomial process as
subsequent random extractions of a single ball.
18 1 Probability Theory

p (1)
p (1)
(1) 1−p
(4)
p p
1−p
p p (3)
1−p 1−p
p (2) p (6)
1−p 1−p (3)
p
1−p (1) 1−p (4)
p
1−p (1)
1−p (1)

Fig. 1.7 Building a binomial process as subsequent random extractions of a single red or white
ball (Bernoulli process). The tree shows all the possible combinations at each extraction step. Each
branching has a corresponding probability equal to p or 1  p for a red or white ball, respectively.
The number of paths corresponding to each possible combination is shown in parentheses, and is
equal to the binomial coefficient in Eq. (1.38)

The number n of positive outcomes (red balls) is called binomial variable and
is equal to the sum of the N Bernoulli random variables. Its probability distribution
can be simply determined from Fig. 1.7, considering how many red and white balls
were present in each extraction, assigning each extraction a probability p or 1 
p, respectively, and considering the number of possible paths leading to a given
combination of red/white extractions.
The latter term is called binomial coefficient and can be demonstrated by
recursion (Fig. 1.8) to be equal to [3]:

 
n nŠ
D : (1.38)
N NŠ .N  n/Š

The probability distribution of a binomial variable n for given N and p can be


obtained considering that the N extractions are independent, hence the correspond-
ing probability terms ( p for a red extraction, 1  p for a white extraction) can be
multiplied, according to Eq. (1.6). This product has to be multiplied by the binomial
coefficient from Eq. (1.38) in order to take into account all possible extraction paths
leading to the same outcome. The probability to obtain n red and N  n white
1.15 The Binomial Process 19

Fig. 1.8 The Yang Hui triangle [4], showing the construction of binomial coefficients

extractions, called binomial distribution, can be written, in this way, as:


P.nI N; p/ D pN .1  p/Nn : (1.39)
NŠ .N  n/Š

Binomial distributions are shown in Fig. 1.9 for N D 15 and for p D 0:2, 0.5 and
0.8.
Since a binomial variable n is equal to the sum of N independent Bernoulli
variables with probability p, the average and variance of n is equal to N times the
average and variance of a Bernoulli variable, respectively (Eqs. (1.35) and (1.37)):

hni D Np ; (1.40)
VŒn D Np .1  p/ : (1.41)

Those formulae can also be obtained directly from Eq. (1.39).


20 1 Probability Theory

0.25
p=0.2 p=0.8
p=0.5
0.2
Probability

0.15

0.1

0.05

0
0 2 4 6 8 10 12 14
n

Fig. 1.9 Binomial distributions for N D 15 and for p D 0:2, 0.5 and 0.8

1.16 Multinomial Distribution

The binomial distribution introduced in Sect. 1.15 can be generalized to the case in
which, out of N extractions, there are more than two outcome categories (success
and failure, in the binomial case). Consider k categories (k D 2 for the binomial
case), let the possibleP numbers of outcomes be equal to n1 ;    ; nk for each of the
k categories, with kiD1 ni D N. Each category has a probability pi , i D 1;    ; k
P
of individual extraction, respectively, with kiD1 pi D 1. The joint distribution of
n1 ;    ; nk is given by:


P.n1 ;    ; nk I N; p1 ;    ; pk / D pn1    pnk k ; (1.42)
n1 Š    nk Š 1

and is called multinomial distribution. Equation (1.39) is equivalent to Eq. (1.42) for
k D 2, n D n1 , n2 D N  n1 , p1 D p and p2 D 1  p.
The average values of the multinomial variables ni are:

hni i D Npi ; (1.43)

and their variances are:

VarŒni  D Npi .1  pi / : (1.44)


1.17 The Law of Large Numbers 21

The multinomial variables ni have negative correlation, and their covariance is, for
i ¤ j:

Cov.ni ; nj / D Npi pj : (1.45)

For a binomial distribution, Eq. (1.45) leads to the obvious conclusion that n1 D n
and n2 D N  n are 100% anticorrelated.

1.17 The Law of Large Numbers

Assume to repeat N times an experiment whose outcome is a random variable x


having a given probability distribution. The average of all results is given by:

x1 C    C xN
xN D : (1.46)
N
xN is itself a random variable, and its expected value is, from Eq. (1.20), equal to
the expected value of x. The distribution of xN , in general, has a smaller range of
fluctuation than the variable x and central values of xN tend to be more probable. This
can be demonstrated, using classical probability and combinatorial analysis, in the
simplest cases. A case with N D 2 is, for instance, the distribution of the sum of
two dices d1 C d2 in Fig. 1.1, where xN is just given by .d1 C d2 /=2. The distribution
has the largest probability value for .d1 C d2 /=2 D 3:5, which is the expected value
of a single dice roll:

1C2C3C4C5C6
hxi D D 3:5 : (1.47)
6
Repeating the combinatorial exercise for the average of three or more dices, gives
even more ‘peaked’ distributions.
In general, it is possible to demonstrate that, under some conditions about the
distribution of x, as N increases, a smaller probability corresponds to most of the
possible values of xN , except the ones very close to the expected average hNxi D hxi.
The probability distribution of xN becomes a narrow peak around the value hxi, and
the interval of values that correspond to a large fraction of the total probability (we
could choose—say—90% or 95%) becomes smaller. Eventually, for N ! 1, the
distribution becomes a Dirac’s delta centered at hxi.
This convergence is called law of large numbers, and can be illustrated in a
simulated experiment consisting of repeated dice rolls, as shown in Fig. 1.10, where
xN is plotted as a function of N for two independent random extractions. Larger values
of N correspond to smaller fluctuations of the result and to a visible convergence
towards the value of 3.5. If we would ideally increase to infinity the total number of
trials N, the average value xN would no longer be a random variable, but would take
a single possible value, equal to hxi D 3:5.
22 1 Probability Theory

5.5

4.5

4
Average

3.5

2.5

1.5

1
0 100 200 300 400 500 600 700 800 900 1000
Number of dice rolls

Fig. 1.10 An illustration of the law of large numbers using a computer simulation of die rolls. The
average of the first N out of 1000 random extraction is reported as a function of N. 1000 extractions
have been repeated twice (red and blue lines) with independent random extractions

The law of large numbers has many empirical verifications for the vast majority
of random experiments and has a broad validity range.

1.18 Frequentist Definition of Probability

The frequentist definition of the probability P.E/ of an event E is formulated with


following limit:
ˇ ˇ 
ˇ N.E/ ˇ
P.E/ D p if ˇ
8 " lim P ˇ ˇ
 pˇ < " : (1.48)
N!1 N

The limit is intended, in this case, as convergence in probability, given by the law
of large numbers. The limit only rigorously holds in the non-realizable case of an
infinite number of experiments. Rigorously speaking, the definition of frequentist
probability in Eq. (1.48) is defined itself in terms of another probability, which could
introduce conceptual problems. F. James et al. report the following sentence:
Π   this definition is not very appealing to a mathematician, since it is based on
experimentation, and, in fact, implies unrealizable experiments (N ! 1) [5].
References 23

In practice, experiments are reproducible on a finite range of time (on planet


Earth, for instance, only until the Sun and the Solar System will continue to exist),
so, for the practical purposes of applications in physics, both the law of large
numbers and the frequentist definition of probability can be considered, beyond a
possible exact mathematical meaning, as pragmatic definitions. They describe, to a
very good level of approximation, the concrete situations of the vast majority of the
cases we are interested in for what concerns experimental physics.
Some Bayesian statisticians express very strong concerns about frequentist
probability (see for instance [6]). In this book, we won’t enter such kind of debate.
Nonetheless, we will try to remark the limitations of both frequentist and Bayesian
approaches, whenever it is the case.

References

1. Laplace, P.: Essai Philosophique Sur les Probabilités, 3rd edn. Courcier Imprimeur, Paris (1816)
2. Kolmogorov, A.: Foundations of the Theory of Probability. Chelsea, New York (1956)
3. The coefficients present in the binomial distribution are the same that appear in the expansion a
binomial raised to the nth power, .a C b/n . A simple iterative way to compute those coefficients
is known as Pascal’s triangle. In different countries this triangle is named after different authors,
e.g.: the Tartaglia’s triangle in Italy, Yang Hui’s triangle in China, and so on. In particular, the
following publications of the triangle are present in literature:
• India: published in the tenth century, referring to the work of Pingala, dating back to fifth–
second century BC .
• Persia: Al-Karaju (953–1029) and Omar Jayyám (1048–1131)
• China: Yang Hui (1238–1298); see Fig. 1.8
• Germany: Petrus Apianus (1495–1552)
• Italy: Nicolò Fontana Tartaglia (1545)
• France: Blaise Pascal (1655)
4. Yang Hui (杨辉) triangle as published by Zhu Shijie (朱世杰) in Siyuan yujian, (四元玉鉴,
Jade Mirror of the four unknowns, 1303). Public domain image.
5. Eadie, W., Drijard, D., James, F., Roos, M., Saudolet, B.: Statistical Methods in Experimental
Physics. North Holland, Amsterdam (1971)
6. D’Agostini, G.: Bayesian Reasoning in Data Analysis: A Critical Introduction. World Scientific,
Hackensack (2003)
Chapter 2
Probability Distribution Functions

2.1 Introduction

The problem introduced in Sect. 1.6.1 with Bertrand’s paradox occurs when we try
to decompose the range of possible values of a random variable x into equally
probable elementary intervals and this is not always possible without ambiguity
because of the continuous nature of the problem. In Sect. 1.6 we considered a
continuous random variable x with possible values in an interval Œx1 ; x2 , and we
saw that if x is uniformly distributed in Œx1 ; x2 , a transformed variable y D Y.x/
is not in general uniformly distributed in Œy1 ; y2  D ŒY.x1 /; Y.x2 / (Y is taken as
a monotonic function of x). This makes the choice of the continuous variable on
which equally probable intervals are defined an arbitrary choice.
The following Sections will show how to overcome this difficulty using defini-
tions consistent with the axiomatic approach to probability introduced in Sect. 1.7.

2.2 Definition of Probability Distribution Function

The concept of probability distribution introduced in Sect. 1.8 can be generalized


to the continuous case. Let us consider a sample space   Rn . Each random
extraction (an experiment, in the cases of interest to a physicist) will lead to an
outcome (i.e. a measurement) corresponding to one point Ex in the sample space .
We can associate a probability density f .Ex / D f .x1 ;    ; xn / to any point Ex in 
which is a real value greater or equal to zero. The probability of an event A, where
A  , i.e. the probability that Ex 2 A, is given by:
Z
P.A/ D f .x1 ;    ; xn / dn x : (2.1)
A

© Springer International Publishing AG 2017 25


L. Lista, Statistical Methods for Data Analysis in Particle Physics,
Lecture Notes in Physics 941, DOI 10.1007/978-3-319-62840-0_2
26 2 Probability Distribution Functions

The function f is called probability distribution function (PDF). The function f .Ex /,
times dn x, can be interpreted as differential probability, i.e. f .Ex / is equal to the
probability dP corresponding to the infinitesimal hypervolume dx1    dxn , divided
by the infinitesimal hypervolume:

dP
D f .x1 ;    ; xn / : (2.2)
dx1    dxn

The normalization condition for discrete probability distributions (Eq. (1.3)) can be
generalized to continuous case as follows:
Z
f .x1 ;    ; xn / dn x D 1 : (2.3)


In one dimension, one can write:


Z C1
f .x/ dx D 1 : (2.4)
1

Note that the probability corresponding to a set containing a single point is


rigorously zero if f is a real function, i.e. P.fx0 g/ D 0 for any x0 , since the set
fx0 g has a null measure. The treatment of discrete variables in one dimension can be
done using the same formalism of PDFs, extending the definition of PDF to Dirac’s
delta functions, ı.x  x0 /, with:
Z C1
ı.x  x0 / dx D 1 : (2.5)
1

Dirac’s delta functions can be linearly combined with proper weights equal to the
probabilities corresponding to discrete values. A distribution representing a discrete
random variable x that can only take the values x1 ;    ; xN with probabilities
P1 ;    ; PN , respectively, can be written, using the continuous PDF formalism, as:

X
N
f .x/ D Pi ı.x  xi / : (2.6)
iD1

The normalization condition in this case can be written as:


Z C1 X
N Z C1 X
N
f .x/ dx D Pi ı.x  xi / dx D Pi D 1 ; (2.7)
1 iD1 1 iD1

which gives again the normalization condition for a discrete variable, already shown
in Eq. (1.3).
2.3 Average and Variance in the Continuous Case 27

Discrete and continuous distributions can be combined using linear combinations


of continuous PDFs and Dirac’s delta functions. For instance, the PDF:

1 1
f .x/ D ı.x/ C f .x/ (2.8)
2 2
gives a 50% probability to have x D 0 and a 50% probability to have a value of x
distributed according to f .x/.

2.3 Average and Variance in the Continuous Case

The definitions of average and variance introduced in Sect. 1.12 are generalized for
continuous variables as follows. The average value of a continuous variable x whose
PDF is f is:
Z
hxi D x f .x/ dx : (2.9)

More in general, the average value of g.x/ is:


Z
hg.x/i D g.x/ f .x/ dx : (2.10)

The variance of x is:


Z
VŒx D .x  hxi/2 f .x/ dx D hx2 i  hxi2 : (2.11)

The standard deviation is defined as:


p
x D VŒx: (2.12)

Integrals should be extended over Œ1; C1, or the entire validity range of the
variable x.
Covariance, correlation coefficient, and covariance matrix can be defined for the
continuous case in the same way as defined in Sect. 1.12, as well as skewness and
kurtosis.
28 2 Probability Distribution Functions

2.4 Mode, Median, Quantiles

The mode of a PDF f is the value M corresponding to the maximum of f .x/:

f .M/ D max f .x/ : (2.13)


x

As for the discrete case, a continuous PDF may have more than one mode, and in
that case is called multimodal distribution.
The median of a PDF f .x/ is the value xQ such that:

P.x < xQ / D P.x > xQ /; (2.14)

or equivalently1:
Z xQ Z C1
f .x/ dx D f .x/ dx : (2.15)
1 xQ

More in general, the quantity q˛ such that:


Z q˛ Z C1
f .x/ dx D ˛ D 1  f .x/ dx (2.16)
1 q˛

is called quantile (or ˛-quantile). The median is the quantile corresponding to a


probability 1=2. The 100 quantiles corresponding to a probability of 1%, 2%,    ,
99% are called 1st, 2nd,    , 99th percentile, respectively.

2.5 Cumulative Distribution

Given a PDF f .x/, its cumulative distribution is defined as:


Z x
F.x/ D f .x0 / dx0 : (2.17)
1

The cumulative distribution F.x/ is a monotonous increasing function of x, and from


the normalization of f .x/ (Eq. (2.4)), its values range from 0 to 1. In particular:

lim F.x/ D 0; (2.18)


x ! 1

lim F.x/ D 1: (2.19)


x ! C1

1
If we assume here that f .x/ is sufficiently regular, such that P.fQxg/ D 0, i.e. f .x/ has no Dirac’s
delta component ı.x  xQ/, we have P.x < xQ/ D P.x > Qx/ D 1=2. Otherwise, P.x < Qx/ D
P.x > xQ/ D .1  P.fQxg/ /=2.
2.6 Continuous Transformations of Variables 29

If the variable x follows the PDF f .x/, the PDF of the transformed variable y D
F.x/ is uniform between 0 and 1, as can be easily be demonstrated:

dP dP dx dx f .x/
D D f .x/ D D1: (2.20)
dy dx dy dF.x/ f .x/

This property turns out to be very useful in order to generate pseudorandom


numbers having the desired PDF with computer algorithms, as will be discussed
in Sect. 4.5.1.

2.6 Continuous Transformations of Variables

The evaluation of probability distributions under transformation of variables was


discussed in Sect. 1.13 for the discrete case, and can be generalized to the contin-
uum. Consider a transformation of variable y D Y.x/, where x follows a PDF f .x/.
The following generalization of Eq. (1.33) gives the PDF of the transformed variable
y:
Z
f .y/ D ı.y  Y.x// f .x/ dx : (2.21)

Similarly, considering a transformation z D Z.x; y/, the PDF of the transformed


variable z can be generalized from Eq. (1.34) as:
Z
f .z/ D ı.z  Z.x; y// f .x; y/ dx dy; (2.22)

where f .x; y/ is the PDF for the variables x and y. In the case of transformations
into more than one variable, the generalization is straightforward. If we have for
instance: x0 D X 0 .x; y/, y0 D Y 0 .x; y/, the transformed two-dimensional PDF can be
written as:
Z
f .x ; y / D ı.x0  X 0 .x; y// ı.y0  Y 0 .x; y// f .x; y/ dx dy :
0 0 0
(2.23)

If the transformation is invertible, the PDF transforms according to the determi-


nant of the Jacobean of the transformation, which appears in the transformation of
the n-dimensional volume element dn x D dx1    dxn :
ˇ  0 ˇ ˇ  0 ˇ
dn P dn P0 ˇˇ @xi ˇˇ 0 0 0 ˇ
ˇ @xi ˇˇ
f .x1 ;    ; xn / D n D n 0 ˇdet ˇ D f .x1 ;    ; xn / ˇdet :
dx dx @xj @xj ˇ
(2.24)
30 2 Probability Distribution Functions

For the simplest case of a single variable:


ˇ 0ˇ
ˇ dx ˇ
f .x/ D f 0 .x0 / ˇˇ ˇˇ : (2.25)
dx

2.7 Uniform Distribution

A variable x is uniformly distributed in the interval [a; b Πif the PDF is constant


in a range x 2 Œa; b Œ. This condition was discussed in Sect. 1.6.1, before formally
introducing the concept of PDF. Considering the normalization condition, a uniform
PDF can be written as:
(
1 if a  x < b ;
u.x/ D ba (2.26)
0 if x < a or x  b :

Examples of uniform distributions are shown in Fig. 2.1 for different extreme values
a and b.
The average of a uniformly distributed variable x is:

aCb
hxi D ; (2.27)
2

0.35
a=1, b=4
0.3

0.25

0.2
dP/dx

a=2, b=8
0.15
a=3, b=12
0.1

0.05

0
0 2 4 6 8 10 12
x

Fig. 2.1 Uniform distributions with different values of the range extremes a and b
2.8 Gaussian Distribution 31

and its standard deviation is:


ba
x D p : (2.28)
12

Example 2.4 Strip Detectors


A detector instrumented with strips of a given pitch l receives particles
uniformly distributed along each strip.
The standard deviation of the distribution of the position of the particles’
impact
p point on the strip along the direction transverse to the strips is given
by l= 12, according to Eq. (2.28).

2.8 Gaussian Distribution

A Gaussian or normal distribution is defined by the following PDF,


and  being
fixed parameters:

 
1 .x 
/2
g.xI
; / D p exp  : (2.29)
2  2 2 2

A random variable following a normal distribution is called normal random


variable. The average value and standard deviation of a normal variable are
and
, respectively.
p The full width at half maximum (FWHM) of a Gaussian distribution
is equal to 2 2 log 2  ' 2:3548 .
Examples of Gaussian distributions are shown in Fig. 2.2 for different values of

and .
For
D 0 and  D 1, a normal distribution is called standard normal
distribution, and it is equal to:

1 2
.x/ D p ex =2 : (2.30)
2

The cumulative of a standard normal distribution is:


Z x    
1 02 1 x
ˆ.x/ D p ex =2 dx0 D erf p C1 : (2.31)
2 1 2 2
32 2 Probability Distribution Functions

0.4
μ = 0, σ = 1
0.35

0.3

0.25
dP/dx

0.2
μ = 1, σ = 2
0.15

0.1
μ = 2, σ = 4
0.05

0
−10 −5 0 5 10 15
x

Fig. 2.2 Gaussian distributions with different values of average and standard deviation parameters

and 

The probability for a Gaussian distribution corresponding to a symmetric interval


around
, Œ
 Z;
C Z, frequently used in many application, can be
computed as:
Z  
1 Z
x2 =2 Z
P.Z/ D p e dx D ˆ.Z/  ˆ.Z/ D erf p : (2.32)
2 Z 2

The most frequently used values are the ones corresponding to 1, 2 and 3 (Z D
1; 2; 3), and have probabilities of 68.27, 95.45 and 99.73%, respectively.
The importance of the Gaussian distribution resides in the central limit theorem
(see Sect. 2.14), which allows to approximate to Gaussian distributions may realistic
cases resulting from the superposition of more random effects, each having a finite
and possibly unknown PDF.

2.9 2 Distribution

A 2 random variable with n degrees of freedom is the sum of n standard normal


variables (see Sect. 2.8). The distribution of a 2 variable is given by:

2n=2 n2 2 =2


f .2 I n/ D  e : (2.33)
.n=2/
2.10 Log Normal Distribution 33

0.9

0.8

0.7 n=1

0.6
dP/dx

0.5

0.4

0.3 n=3
0.2 n=5
0.1

0
0 2 4 6 8 10 12 14
x

Fig. 2.3 2 distributions with different values of the number of degrees of freedom n

is the so-called gamma function, the analytical extension of the factorial.2 The
expected value of a 2 distribution is equal to the number of degrees of freedom n
and the variance is equal to 2n. 2 distributions are shown in Fig. 2.3 for different
number of degrees of freedom n.
Typical applications of the 2 distribution are goodness of fit tests (see
Sect. 5.12.2), where the cumulative 2 distribution is used.

2.10 Log Normal Distribution

If a random variable y is distributed according to a normal distribution with average

and standard deviation , the variable x D e y is distributed according to a log


normal distribution, defined as:
 
1 .log x 
/2
f .xI
; / D pexp  : (2.34)
x  2 2 2

2
.n/ D .n  1/Š if n is an integer value.
34 2 Probability Distribution Functions

3.5 μ = 0, σ = 0.1

2.5
dP/dx

1.5 μ = 0.5, σ = 0.2

1
μ = 1, σ = 0.25
0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x

Fig. 2.4 Log normal distributions with different values of the parameters
and 

The PDF in Eq. (2.34) can be determined applying Eq. (2.21) to the case of a normal
distribution. A log normal variable has the following average and standard deviation:
2 =2
hxi D e
C ; (2.35)
p

C 2 =2
x D e e 2  1 : (2.36)

Note that Eq. (2.35) implies that he y i > ehyi for a normal random variable y:
2 =2
he y i D hxi D e
e  > e
D ehyi :

Examples of log normal distributions are shown in Fig. 2.4 for different values of

and .

2.11 Exponential Distribution

An exponential distribution of a variable x  0 is characterized by a PDF


proportional to ex , where  is a constant. The expression of an exponential PDF,
including the overall normalization factor , is given by:

f .xI / D  ex : (2.37)


2.12 Poisson Distribution 35

3.5

2.5
dP/dx

2 λ=4

1.5

1
λ=2
0.5
λ=1
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x

Fig. 2.5 Exponential distributions with different values of the parameter 

Examples of exponential distributions are shown in Fig. 2.5 for different values of
the parameter .
Exponential distributions are widely used in physics, in particular to model the
distribution of particle lifetimes. In those cases, the average lifetime  is the inverse
of the parameter .

2.12 Poisson Distribution

A non-negative integer random variable n is called Poissonian random variable if it


is distributed, for a given value of the parameter , according to the distribution:

 n e
P.nI / D : (2.38)

Equation (2.38) is called Poisson distribution and  is sometimes also called rate,
as will be more clear from Example 2.5. Figure 2.6 shows examples of Poisson
distributions for different values of the parameter . It is easy to demonstrate that
the average and the variance of a Poisson distribution are both equal to .

Example 2.5 Poisson Distributions as Limit of Binomial Distribution


from a Uniform Process
Consider a uniformly distributed variable  over an interval Œ0; X Œ.  could
be either a time or space variable, in a concrete case. Imagine, for instance,

(continued )
36 2 Probability Distribution Functions

0.25 ν=2

0.2
ν=4
Probability

0.15
ν=8
0.1

0.05

0
0 5 10 15 20
n

Fig. 2.6 Poisson distributions with different value of the rate parameter 

the arrival position of a rain drop on the ground, or a particle on a detector,


along one direction, or the arrival time of a cosmic ray.
If  is randomly extracted N times in the range Œ0; X Œ, the rate r, equal to
N=X, can be introduced. r represents the number of extractions per unit of
.

x ξ

Fig. 2.7 A uniform distribution of occurrences along a variable . Two intervals are
shown of sizes x and X, where x  X

Let us consider only values of  in a shorter interval Œ0; x Œ (Fig. 2.7).


The extraction of n occurrences out of N in the interval Œ0; x Œ, while the
remaining N  n occurrences are in Œx; X Œ, is clearly a binomial process (see
Sect. 1.15).
Consider N and X as constants, i.e. not subject to random fluctuations, and
the limit N ! 1; X ! 1 obtained while keeping the ratio N=X D r as
a constant.

(continued )
2.12 Poisson Distribution 37

The expected value  of the number of extracted values of  in the interval


Œ0; x Œ can be determined with a simple proportion:
Nx
 D hni D D rx ; (2.39)
X
while n follows a binomial distribution (Eq. (1.39)):
NŠ 
n 
Nn
P.nI N; / D 1 : (2.40)
nŠ .N  n/Š N N
Equation (2.40) can also be written as:
 n 
 N.N  1/    .N  n C 1/ 
N 
n
P.nI N; / D 1  1  :
nŠ Nn N N
(2.41)

The first term,  n =nŠ, does not depend on N, while the remaining three terms
tend to 1, e and 1, respectively, in the limit of N ! 1. The distribution
of n from Eq. (2.41), in this limit, is equal to the Poisson distribution:

 n e
P.nI / D : (2.42)

Example 2.6 Exponential Distributions from Uniformly Distributed


Process
Consider a sequence of events uniformly distributed over an indefinitely
large time interval, as in the previous Example 2.5. The time t could be
the arrival time of a cosmic ray, for instance. The situation is sketched in
Fig. 2.8.

t0 =0 t1

t
δt
Fig. 2.8 Occurrence times (dots) of events uniformly distributed in time, represented
along a horizontal axis. The time origin .t0 D 0/ is marked as a C. The occurrence time
of the first event is marked as t1

Let t1 be the occurrences time of the first event with respect to an arbitrary
time origin t0 , which could also coincide with the occurrence of one of

(continued )
38 2 Probability Distribution Functions

the events. In that case, t1 represents the time difference between two
consecutive events.
The occurrence time of the first events, as well as the time difference between
two consecutive events, can be demonstrated to be distributed according to
an exponential PDF, as follows.
Let us consider a time t and another time t C ıt, where ıt  t. The
probability that t1 is greater or equal to t C ıt, P.t1  t C ıt/, is equal
to the probability P.0; Œ0; t C ıt Œ / that no event occurs before t C ıt, i.e. in
the time interval Œ0; t C ıt Œ.
The probability P.0; Œ0; t C ıt Œ / is equal to the probability that no event
occurs in the interval Œ0; t Œ and no event occurs in the interval Œt; t C ıt Œ,
since Œ0; t C ıt ŒD Œ0; t Œ [ Œt; t C ıt Œ. Events occurring in the two disjoint
time intervals are independent, hence the combined probability is the
product of the two probabilities:

P.0; Œ0; t C ıt Œ / D P.0; Œ0; t Œ / P.0; Œt; t C ıt Œ / : (2.43)

Given the event rate r per unit of time, the probability to have n occurrences
in a time interval ıt is given by a Poissonian distribution (see Example 2.5)
with rate  D r ıt:
 n e
P.nI / D P.n; Œt C ıt Œ/ D : (2.44)

The probability to have more than one occurrence is of order O.ıt2 / or


smaller, and the most probable values are n D 0 and n D 1. Neglecting the
probability that n > 1, the normalization condition for the distribution in
Eq. (2.44) gives:

P.0; Œt; t C ıt Œ / ' 1  P.1; Œt; t C ıt Œ / ' 1  r ıt : (2.45)

Equation (2.43) can be written, using the result from Eq. (2.45), as:

P.0; Œ0; t C ıt Œ / D P.0; Œ0; t Œ / .1  r ıt/ ; (2.46)

or, equivalently:

P.t1  t C ıt/ D P.t1  t/ .1  r ıt/ ; (2.47)

(continued )
2.12 Poisson Distribution 39

which gives:

P.t1  t C ıt/  P.t1  t/


D r P.t1  t/ : (2.48)
ıt

Taking the limit ıt ! 0, the following differential equation can be written:

dP.t1  t/
D r P.t1  t/ : (2.49)
dt

Considering the initial condition P.t1  0/ D 1, Eq. (2.49) has the following
solution:

P.t1  t/ D ert : (2.50)

If P.t/ is the probability distribution function of the first occurrence time


t D t1 , P.t/ can be determined from the derivative of P.t1  t/, in Eq. (2.50):

P.t < t1 < t C ıt/ dP.t1 < t/


P.t/ D D ; (2.51)
ıt dt
where:

P.t1 < t/ D 1  P.t1  t/ D 1  ert : (2.52)

The derivative with respect to t gives:

dP.t1 < t/ d.1  ert /


P.t/ D D ; (2.53)
dt dt
hence:

P.t/ D r ert : (2.54)

The exponential distribution is characteristic of particles lifetimes. The


possibility to measure the decay parameter of an exponential distribution,
independently on the initial time t0 , allows measuring particle lifetimes
even it the particle’s creation time is not known. For instance, the lifetime
of cosmic-ray muons can be measured at sea level even if the muon was
produced in the high atmosphere.
40 2 Probability Distribution Functions

0.3

ν=2
0.25

0.2

ν=5
Probability

0.15

ν = 10
ν = 20
0.1 ν = 30

0.05

0
0 10 20 30 40 50
n

Fig. 2.9 Poisson distributions withpdifferent value of the parameter  compared with Gaussian
distributions with
D  and  D 

Poisson distributions have several interesting properties, some of which are listed
in the following.
• For large , a Poisson distributionpcan be approximated with a Gaussian having
average  and standard deviation . See Fig. 2.9 for a visual comparison.
• A Binomial distribution with a number of extractions N and probability p 
1 can be approximated with a Poisson distribution with average  D p N (see
Example 2.5, above).
• If two variables n1 and n2 follow Poisson distributions with averages 1 and 2 ,
respectively, it is easy to demonstrate, using Eq. (1.34), that the sum n D n1 C n2
follows again a Poisson distribution with average 1 C 2 . In formulae:

X
n
P.nI 1 ; 2 / D Pois.n1 I 1 / Pois.n2 I 2 / D Pois.nI 1 C 2 / :
n1 D 0
n2 D n  n1
(2.55)

This property descends from the fact that the superposition of two uniform
processes, like the one considered in Example 2.5, is again a uniform process,
whose total rate is equal to the sum of the two individual rates.
• Randomly picking with probability " occurrences from a Poissonian process
gives again a Poissonian process. In other words, if a Poisson variable n0 has
expected value (rate) 0 , then the variable n distributed according to a binomial
distribution with probability " and size of the sample n0 is distributed according
2.13 Other Distributions Useful in Physics 41

to a Poisson distribution with average  D " 0 . In formulae:


1
X
P.nI 0 ; "/ D Pois.n0 I 0 / Binom.nI n0 ; "/ D Pois.nI " 0 / : (2.56)
n0 D0

This is the case, for instance, when counting the number of cosmic rays recorded
by a detector whose efficiency " is not ideal ." < 1/.
• The cumulative 2 (Eq. (2.33)) and Poisson distributions are related. From the
following formulas:
Z  n 

f .2 I n/ d2 D P ; ; (2.57)


0 2 2
X
n1
k
e D 1  P.n; / ; (2.58)
kD0

where P.x; n/ is the so-called incomplete Gamma function, the following relation
holds:

X
n1 Z C1
k
e D f .2 I 2n/ d2 : (2.59)
kD0
kŠ 2

2.13 Other Distributions Useful in Physics

Some of the most commonly used PDF in physics are presented in the following
sections. The list is of course not exhaustive.

2.13.1 Breit–Wigner Distribution

A (non relativistic) Breit–Wigner distribution, also known as Lorentz distribution or


Cauchy distribution, has the following expression:
 
1
BW.xI x0 ; / D : (2.60)
 .x  x0 /2 C 2

While the parameter x0 determines the position of the maximum of the distribution
(mode), twice the parameter is equal to the full width at half maximum of the
distribution.
A Breit–Wigner distribution arises in many resonance problems in physics.
42 2 Probability Distribution Functions

0.6

0.5

0.4
γ = 0.5
dP/dx

0.3

0.2 γ =1

0.1 γ =2

0
−6 −4 −2 0 2 4 6
x

Fig. 2.10 Breit–Wigner distributions centered around zero for different values of the width
parameter

Since the integrals of both x BW.x/ and x2 BW.x/ are divergent, the mean and
variance of a Breit–Wigner distribution are undefined.
Figure 2.10 shows examples of Breit–Wigner distributions for different values of
the width parameter and for fixed x0 D 0.

2.13.2 Relativistic Breit–Wigner Distribution

A relativistic Breit–Wigner distribution has the following expression:


N
BWR.xI m; / D ; (2.61)
.x2  m 2 /2 C m2 2
where the constant N is given by:
p
2 2m k p
ND p ; with: k D m2 .m2 C 2 / : (2.62)
 m2 C k

The parameter m determines the position of the maximum of the distribution (mode)
and the parameter measures the width of the distribution.
A relativistic Breit–Wigner distribution arises from the square of a virtual
particle’s propagator (see for instance [1]) with four-momentum squared p2 D x2 ,
2.13 Other Distributions Useful in Physics 43

which is proportional to:

1
: (2.63)
.x2  m2 / C im

As for a non-relativistic Breit–Wigner, due to integral divergences, the mean and


variance of a relativistic Breit–Wigner distribution are also undefined.
Figure 2.11 shows examples of relativistic Breit–Wigner distribution for different
values of and fixed m.

2.13.3 Argus Function

The Argus collaboration introduced [2] a function that models many cases of
combinatorial backgrounds where kinematical bounds produce a sharp edge. The
Argus distribution is given by:
r x
2
e Œ1.x= / =2 ;
2 2
A.xI ; / D Nx 1  (2.64)


where N is a normalization coefficient which depends on the parameters  and


. Examples of Argus distributions are shown in Fig. 2.12 for different values of

−6
×10
25

20 Γ = 2.5

15
dP/dx

10 Γ=5

5
Γ = 10
0
80 85 90 95 100 105 110 115 120
x

Fig. 2.11 Relativistic Breit–Wigner distributions with mass parameter m D 100 and different
values of the width parameter
44 2 Probability Distribution Functions

0.2

0.18 θ=8, ξ=0.01

0.16

0.14
θ=9, ξ=0.5
0.12
dP/dx

0.1
θ=10, ξ=1
0.08

0.06

0.04

0.02

0
0 2 4 6 8 10
x

Fig. 2.12 Argus distributions with different values of the parameters  and 

the parameters  and . The primitive function of Eq. (2.64) can be computed
analytically, and this saves computer time in the evaluation of the normalization
coefficient N. Assuming  2  0, the normalization condition for an Argus PDF can
be written as follows:
Z
1
A.xI ; / dx
N
( r r s  )
 2  2 Œ1 .1x= /2 =2 x2  1 2 x2
D 2 e 1 2  erf  1 2 ; (2.65)
  2 2 2 

and the normalized expression of the Argus function becomes:


r x
2
3 x
e Œ1.x= / =2 ;
2 2
A.xI ; / D p 1 (2.66)
2 ‰./  2 

where ‰./ D ˆ./   ./  1=2, ./ being a standard normal distribution
(Eq. (2.30)) and ˆ./ its cumulative distribution (Eq. (2.31)).

2.13.4 Crystal Ball Function

Some random variables only approximately follow a Gaussian distribution, but


exhibit an asymmetric tail on one of the two sides. In order to provide a description
2.13 Other Distributions Useful in Physics 45

of such distributions, the collaboration working on the Crystal ball experiment at


SLAC defined the following PDF [3] where a power-law distribution is used in
place of one of the two Gaussian tail, ensuring the continuity of the function and its
first derivative. The Crystal ball distribution is defined as:
8  2

< exp  .x  x/
ˆ for x  x > ˛

CB.xI ˛; n; x; / D N  2 2 ;

(2.67)
:̂ A B  x  x n for x  x  ˛
 

where N is a normalization coefficient, while A and B can be determined imposing


the continuity of the function and its first derivative, which give:
 n
n 2 =2 n
AD e˛ ; BD  j˛j : (2.68)
j˛j j˛j

The parameter ˛ determines the starting point of the power-law tail, measured in
units of , the standard deviation of the Gaussian ‘core’.
Examples of Crystal ball distributions are shown in Fig. 2.13 where the parameter
˛ was varied, while the parameters of the Gaussian core were fixed at
D 0;  D 1,
and the power-law exponent was set to n D 2.

0.4
α=2
0.35

0.3
α=1
0.25
dP/dx

0.2
α = 0.5
0.15

0.1

0.05

0
−30 −25 −20 −15 −10 −5 0 5 10
x

Fig. 2.13 Crystal ball distributions with


D 0,  D 1, n D 2 and different values of the tail
parameter ˛
46 2 Probability Distribution Functions

0.9

0.8 σ = 0.2

0.7

0.6
dP/dx

0.5

0.4
σ = 0.4
0.3

0.2

0.1 σ = 0.6
0
−2 0 2 4 6 8 10 12
x

Fig. 2.14 Landau distributions with


D 0, and different values of 

2.13.5 Landau Distribution

A model that describes the fluctuations of energy loss of particles traversing a thin
layers of matter is due to Landau [4, 5]. The distribution of the energy loss x is given
by the following integral expression called Landau distribution:
Z 1
1
L.x/ D et log txt sin.t/ dt : (2.69)
 0

More frequently, the distribution is shifted by a constant


and scaled by a constant
, according to the following expression:
x 

L.xI
; / D L : (2.70)

Examples of Landau distributions are shown in Fig. 2.14 for different values of
 and fixed
D 0. This distribution is also used as empiric model for several
asymmetric distributions.

2.14 Central Limit Theorem

Given N independent random variables, x1 ;    ; xN , each distributed according to a


PDF having finite variance, the average of those N variables can be approximated, in
the limit of N ! 1, to a Gaussian distribution, regardless of the underlying PDFs.
2.14 Central Limit Theorem 47

The demonstration is not reported here, but quantitative approximate demon-


strations of the central limit theorem for specific cases are easy to perform using
numerical simulations based on Monte Carlo methods (see Chap. 4).
Two examples of such numerical exercises are shown in Figs. 2.15 and 2.16
where multiple random extractions from two different PDFs are summed and
divided by the square root of the number of generated variables, so that this
combination has the same variance of the original distribution. The distributions
obtained with a large randomly-extracted sample are plotted, superimposed to a
Gaussian distribution.

4000 4000
3500 3500

3000 3000

2500 2500
2000 2000

1500 1500
1000 1000

500 500
0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
x1 (x1+x 2)/ 2

4000 4000
3500 3500

3000 3000

2500 2500
2000 2000

1500 1500
1000 1000

500 500
0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5

(x +x 2+x 3)/ 3 (x1+x 2+x 3+x 4)/ 4


1

Fig. 2.15 Approximate visual demonstration of the central limit theorem p usingpa Monte Carlo
technique. A random variable x1 is generated uniformly in the interval [ 3, 3[, in order to
have average value
D 0 and standard deviation  D 1. The top-left plot shows the distributionp
of 105 randompextractions of x1 ; the other 5
pplots show 10 random extractions of .x1 C x2 /= 2,
.x1 Cx2 Cx3 /= 3 and .x1 Cx2 Cx3 Cx4 /= 4, respectively, where all xi are extracted with the same
uniform distribution as x1 . A Gaussian curve with
D 0 and  D 1, with proper normalization,
in order to match the sample size, is superimposed to the extracted distributions in the four cases.
The Gaussian approximation is more and more stringent as a larger number of variables is added
48 2 Probability Distribution Functions

6000
5000
5000
4000
4000
3000
3000

2000
2000

1000 1000

0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
x1 (x 1 +x )/ 2
2

5000
6000

4000
5000

4000 3000

3000
2000
2000
1000
1000

0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
(x +x +x )/ 3 (x +x +x +x )/ 4
1 2 3 1 2 3 4

4000 4000

3500 3500

3000 3000

2500 2500

2000 2000

1500 1500

1000 1000

500 500

0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
(x + ... +x )/ 6 (x + ... +x )/ 10
1 6 1 10

Fig. 2.16 Same as Fig. 2.15, using a PDF that is uniformly distributed in two disjoint intervals,
[ 3=2,  1=2 [ and [ 1=2 , 3=2 [, in order to have average value
D 0 and standard deviation  D 1.
The individual distributions and p the sum of 2, 3, 4, 6 and 10 independent random extractions of
such a variable, divided by n, n D 2; 3; 4; 6; 10, are shown in the six plots, respectively. A
Gaussian distribution having
D 0 and  D 1 is superimposed
2.15 Probability Distribution Functions in More than One Dimension 49

2.15 Probability Distribution Functions in More than One


Dimension

Probability densities can be defined in spaces with more than one dimension as
introduced in Sect. 2.2. In the simplest case of two dimensions, a PDF f .x; y/
measures the probability density per unit area, i.e. the ratio of the differential
probability dP corresponding to an infinitesimal interval around a point .x; y/ and
the differential area dx dy:
dP
D f .x; y/ : (2.71)
dx dy
In three dimensions the PDF measures the probability density per volume area:
dP
D f .x; y; z/ ; (2.72)
dx dy dz
and so on in more dimensions.
A PDF in more dimensions that describes the distribution of more than one
random variable is also called joint probability distribution.

2.15.1 Marginal Distributions

Given a two-dimensional PDF f .x; y/, the probability distributions of the two
individual variables x and y, called marginal distributions, can be determined by
integrating f .x; y/ over the other coordinate, y and x, respectively:
Z
fx .x/ D f .x; y/ dy ; (2.73)

Z
fy .y/ D f .x; y/ dx : (2.74)

The above expressions are also a special case of continuous transformation of


variables, as described in Sect. 2.6, where the applied transformation maps the two
variables into one of the two: .x; y/ ! x or .x; y/ ! y.
More in general, if we have a PDF in n D h C k variables .Ex ; Ey / D .x1 ;    ; xh ;
y1 ;    ; yk /, the marginal PDF of the subset of h variables .x1 ;    ; xh / can be
determined by integrating the PDF f .Ex ; Ey / over the remaining set of variables
.y1 ;    ; yk /:
Z
fx1 ;  ; xh .Ex / D f .Ex ; Ey / dk y : (2.75)
50 2 Probability Distribution Functions

2.15.2 Independent Variables

A pictorial view that illustrates the interplay between the joint distribution f .x; y/
and the marginal distributions fx .x/ and fy .y/ is shown in Fig. 2.17.
The events A and B shown in the figure correspond to two values xO and yO extracted
in the intervals Œx; x C ıx Œ and Œy; y C ıy Œ, respectively:

A D fOx W x  xO < x C ıxg and B D fOy W y  yO < y C ıyg : (2.76)

The probability of their intersection is:

P.A \ B/ D P.x  xO < x C ıx and y  yO < y C ıy/ D f .x; y/ ıx ıy : (2.77)

By definition of marginal PDF:

P.A/ D ıP.x/ D fx .x/ ıx and P.B/ D ıP.y/ D fy .y/ ıy ; (2.78)

hence, the product of the two probabilities is:

P.A/ P.B/ D ıP.x; y/ D fx .x/ fy .y/ ıx ıy : (2.79)

Let us remember that, according to Eq. (1.6), two events A and B are independent
if P.A \ B/ D P.A/ P.B/. The equality P.A \ B/ D P.A/ P.B/ holds, given
Eq. (2.79), if and only if f .x; y/ can be factorized into the product of the two
marginal PDFs:

f .x; y/ D fx .x/ fy .y/ : (2.80)

From this result, x and y can be defined as independent random variables if their
joint PDF can be written as the product of a PDF of the variable x times a PDF of
the variable y.

Fig. 2.17 In a two- A


dimensional plane .x; y/, a y
slice in x corresponds to a
probability ıP.x/ D fx .x/ ıx,
δP(x, y)
a slice in y corresponds to a B
probability ıP.y/ D fy .y/ ıy, δP(y)
and their intersection to a
probability ıP.x; y/ D
f .x; y/ ıx ıy δP(x)

x
2.15 Probability Distribution Functions in More than One Dimension 51

More in general, n variables x1 ;    ; xn are said to be independent if their n-


dimensional PDF can be factorized into the product of n one-dimensional PDF in
each of the variables:

f .x1 ;    ; xn / D f1 .x1 /    fn .xn / : (2.81)

In a weaker sense, the variables sets Ex D .x1 ;    ; xn / and Ey D .y1 ;    ; ym / are


independent if:

f .Ex ; Ey / D fx .Ex / fy .Ey / : (2.82)

Note that if two variable x and y are independent, it can be easily demonstrated
that they are also uncorrelated, in the sense that their covariance (Eq. (1.24)) is null.
Conversely, if two variables are uncorrelated, they are not necessarily independent,
as shown in Example 2.7 below.

Example 2.7 Uncorrelated Variables May not Be Independent


An example of PDF that describes uncorrelated variables that are not
independent is given by the sum of four two-dimensional Gaussian PDFs
as specified below:
1
f .x; y/ D Œg.xI
; / g.yI 0; / C g.xI 
; / g.yI 0; /
4 (2.83)
g.xI 0; / g.yI
; / C g.xI 0; / g.yI 
; / ;

where g is a one-dimensional Gaussian distribution.


This example is illustrated in Fig. 2.18 which plots the PDF in Eq. (2.83)
with numerical values
D 2:5 and  D 0:7.
Considering that, for a variable z distributed according to g.zI
; /, the
following relations hold:

hzi D
;
hz2 i D
2 C  2 ;

it is easy to demonstrate that, for x and y distributed according to f .x; y/, the
following relations also hold:

hxi D hyi D 0;

2
hx2 i D hy2 i D  2 C 2 ;
hxyi D0:

(continued )
52 2 Probability Distribution Functions

0.8
dP/dxdy

0.6

0.4

0.2

0
5
4
3
2
1
0
−1
y

−2 4 5
−3 2 3
0 1
−4 −2 −1
−5 −5 −4 −3 x

Fig. 2.18 Example of a PDF of two variables x and y that are uncorrelated but not
independent

Applying the definition of covariance in Eq. (1.24) gives cov.x; y/ D 0, and


for this reason x and y are uncorrelated.
Anyway, x and y are clearly not independent because f .x; y/ can’t be
factorized into the product of two PDF, i.e: there is no pair of functions
fx .x/ and fy .y/ such that f .x; y/ D fx .x/ fy .y/.
Consider, for instance, three “slices” of f .x; y/ at x D 0 and x D ˙
D
˙ 2:5. The function of y f .x0 ; y/, for a fixed x0 , has two maxima for x0 D 0
and a single maximum for x0 D ˙
. For a factorized PDF, instead, the
shape of f .x0 ; y/ D fx .x0 / fy .y/ should be the same for all values of x0 , up to
a scale factor fx .x0 /.
2.15 Probability Distribution Functions in More than One Dimension 53

2.15.3 Conditional Distributions

Given a two-dimensional PDF f .x; y/ and a fixed value x0 of the variable x, the
conditional distribution of y given x0 is defined as:

f .x0 ; y/
f .y j x0 / D R : (2.84)
f .x0 ; y0 / dy0

The conditional distribution can be interpreted as being obtained by ‘slicing’ f .x; y/


at x D x0 and applying a normalization factor to the sliced one-dimensional
distribution. An illustration of conditional PDF is shown in Fig. 2.19.
Reminding Eq. (1.4), and considering again the example in Fig. 2.17, this
definition of conditional distribution in Eq. (2.84) is consistent with the definition of
conditional probability: P.B j A/ D P.A \ B/=P.A/, where B D “y  yO < y C ıy”,
and A D “x0  xO < x0 C ıx”, xO and yO being extracted value of x and y, respectively.
In more than two dimensions, Eq. (1.4) can be generalized for a PDF of h C k
variables .Ex ; Ey / D .x1 ;    ; xh ; y1 ;    ; yk / as:

f .Ex0 ; Ey /
f .Ey j Ex0 / D R : (2.85)
f .Ex0 ; Ey / dy01    dy0k

f(x,y)
0.35

0.3
f(y|x0)
dP/dx

0.25

0.2

0.15

1
1 0.5
0.5 0
0 x0 x
y −0.5
−1 −1

Fig. 2.19 Illustration of conditional PDF in two dimensions


54 2 Probability Distribution Functions

2.16 Gaussian Distributions in Two or More Dimensions

Let us consider in two dimensions the product of two Gaussian distributions for
the variables x0 and y0 having standard deviations x0 and y0 , respectively, and for
simplicity having both averages
x0 D
y0 D 0 (a translation can always be applied
in order to generalize to the case
x0 ;
y0 ¤ 0):
" !#
0 0 0 1 1 x02 y02
g .x ; y / D exp  2
C 2 : (2.86)
2 x0 y0 2 x0 y0

Let us apply a rotation from .x0 ; y0 / to new coordinates .x; y/ by an angle ,


defined by:
(
x0 D x cos  y sin
: (2.87)
y0 D x sin C y cos

Theˇ transformed
ˇ PDF g.x; y/ can be obtained using Eq. (2.24), considering that
det ˇ@x0i =@xj ˇ D 1, which leads to g0 .x0 ; y0 / D g.x; y/.
g.x; y/ has the form:
  
1 1 x
g.x; y/ D 1
exp  .x; y/ C1 ; (2.88)
2 jCj 2 2 y

where the matrix C1 is the inverse of the covariance matrix for the variables .x; y/.
C1 can be obtained by comparing Eqs. (2.86) and (2.88). The rotated variables
defined in Eq. (2.87) can be substituted in the following equation:
 
x02 y02 x
C D .x; y/ C1 ; (2.89)
x20 y20 y

obtaining:
0 !1
cos2 sin2 1 1
B C sin cos  2 C
B x20 y20 2
y0 x0 C
C 1
DB
B
! C :
C (2.90)
@ 1 1 sin2 cos2 A
sin cos  2 C
y20 x0 2
x0 y20

Considering that the covariance matrix should have the form:


!
x2 xy x y
CD ; (2.91)
xy x y y2
2.16 Gaussian Distributions in Two or More Dimensions 55

where xy is the correlation coefficient defined in Eq. (1.25), the determinant of C1
that appears in Eq. (2.88) must be equal to:

ˇ 1 ˇ 1 1
ˇC ˇ D D 2 2 : (2.92)
x20 y20 2
x y 1  xy

Inverting the matrix C1 in Eq. (2.90), the covariance matrix in the rotated variables
.x; y/ is:
0
1
cos2 x20 C sin2 y20 sin cos y20  x20
CD@
A : (2.93)
sin cos y20  x20 sin2 x20 C cos2 y20

The variances of x and y and their correlation coefficient can be determined by


comparing Eq. (2.93) to Eq. (2.91):

x2 D cos2 x20 C sin2 y20 ; (2.94)

y2 D sin2 x20 C cos2 y20 ; (2.95)


2 2
cov.x; y/ sin 2 y 0  x 0
xy D Dr
: (2.96)
x y
sin 2 x40 C y40 C 2 x20 y20

The last Eq. (2.96) implies that the correlation coefficient is equal to zero if either
y0 D x0 or if is a multiple of =2. The following relation gives tan 2 in terms of
the elements of the covariance matrix:
2 xy x y
tan 2 D : (2.97)
y2  x2

The transformed PDF can be finally written in terms of all the results obtained so
far:
" !#
1 1 x2 y2 2x y xy
g.x; y/ D q exp  2 / 2
C 2 :
2 x y 2
1  xy 2.1  xy x y x y

(2.98)

The geometrical interpretation of x and y in the rotated coordinate system is


shown in Fig. 2.20, where the ellipse determined by the following equation is drawn:

x2 y2 2x y xy
2
C 2 D1: (2.99)
x y x y
56 2 Probability Distribution Functions

σy σx
φ

σy σx

Fig. 2.20 One-sigma contour for a two-dimensional Gaussian PDF. The two ellipse axes have
length equal to x0 and y0 ; the x0 axis is rotated of an angle with respect to the x axis and the
lines tangent to the ellipse parallel to the x and y axes, shown in gray, have a distance with respect
to the respective axes equal to y and x

It is possible to demonstrate that the distance of the horizontal and vertical tangent
lines to the ellipse defined in Eq. (2.99) have a distance with respect to their
respective axes equal to y and x .
Similarly to the 1 contour, defined in Eq. (2.99), the 2 contour is defined by:

x2 y2 2x y xy
C  D 22 D 4 : (2.100)
x2 y2 x y

Projecting a two-dimensional Gaussian in Eq. (2.98) on one of the two coor-


dinates gives the following marginal PDFs, which correspond to the expected one-
dimensional Gaussian distributions with standard deviations x and y , respectively:
Z C1
1 2 2
gx .x/ D g.x; y/ dy D p ex =2x ; (2.101)
1 2 x2
Z C1
1 2 =2 2
gy .y/ D g.x; y/ dx D q ey y : (2.102)
1 2 y2

In general, projecting a two-dimensional Gaussian PDF in any direction gives a


one-dimensional Gaussian whose standard deviation is equal to the distance of the
tangent line to the ellipse perpendicular to the axis along which the two-dimensional
Gaussian is projected. This is visually shown in Fig. 2.21 where 1 and 2 contours
are shown for a two-dimensional Gaussian.
Figure 2.21 shows three possible choices of 1 and 2 bands: one along the x
axis, one along the y axis and one along a generic oblique direction.
Note that the probability corresponding to the ellipse defined by Eq. (2.99) is
smaller than the 68.27% that correspond to a one-dimensional interval, which may
2.16 Gaussian Distributions in Two or More Dimensions 57

2σ 1 σ

Fig. 2.21 Plot of two-dimensional 1 and 2 Gaussian contours. Each one-dimensional projec-
tion of the 1 or 2 contour corresponds to a band which has a 68.27% or 95.45% probability
content, respectively. As an example, three possible projections are shown: a vertical, a horizontal
and a diagonal one. The probability content of the ellipses are smaller than the corresponding
one-dimensional projected interval probabilities

be defined similarly in one dimension by:

x2
D1 H) x D ˙ : (2.103)
2
The probability values corresponding to Z one-dimensional intervals for a Gaus-
sian distribution are determined by Eq. (2.32). The corresponding result for the
two-dimensional case can be performed by integrating g.x; y/ in two dimensions
over the ellipse EZ corresponding to Z:
Z
P2D .Z/ D g.x; y/ dx dy ; (2.104)
EZ

where
( )
x2 y2 2x y xy
EZ D .x; y/ W 2 C 2   Z2 : (2.105)
x y x y

The integral in Eq. (2.104), written in polar coordinates, simplifies to:


Z Z
2 =2 2 =2
P2D .Z/ D er r dr D 1  eZ ; (2.106)
0
58 2 Probability Distribution Functions

Table 2.1 Probabilities P1D P2D


corresponding to Z
one-dimensional intervals and 1 0:6827 0:3934
two-dimensional contours for 2 0:9545 0:8647
different values of Z 3 0:9973 0:9889
1:515  0:8702 0:6827
2:486  0:9871 0:9545
3:439  0:9994 0:9973
Bold values correspond to 1,
2 and 3 probabilities for a
one-dimension Gaussian

which can be compared to the one-dimensional case:


r Z  
2 Z
2 =2 Z
P1D .Z/ D ex dx D erf p : (2.107)
 0 2

The probabilities corresponding to 1, 2 and 3 for the one- and two-
dimensional cases are reported in Table 2.1. The two-dimensional integrals are, in all
cases, smaller than in one dimension for a given Z. In particular, in order to recover
the same probability content as the corresponding one-dimensional interval, one
would need to artificially enlarge a two-dimensional ellipse from 1 to 1:515, from
2 to 2:486, and from 3 to 3:439. Usually, results are reported in literature as 1
and 2 contours, without any artificial interval enlargement, and the conventional
probability content of 68.27% or 95.45% refers to any one-dimensional projection
of those contours.
The generalization to n dimensions of the two-dimensional Gaussian described
in Eq. (2.98) is:
 
1 1 1
g.x1 ;    ; xn / D 1
exp  .xi 
i / Cij .xj 
j / ; (2.108)
n
.2/ 2 jCj 2 2

where
i is the average of the variable xi and Cij is the n  n covariance matrix of
the variables x1 ;    ; xn .

References

1. Bjorken, J., Drell, S.: Relativistic Quantum Fields. McGraw-Hill, New York (1965)
2. ARGUS Collaboration, Albrecht, H., et al.: Search for hadronic b ! u decays. Phys. Lett.
B241, 278–282 (1990)
3. Gaiser, J.: Charmonium spectroscopy from radiative decays of the J= and 0 . Ph.D. thesis,
Stanford University (1982). Appendix F
4. Landau, L.: On the energy loss of fast particles by ionization. J. Phys. (USSR) 8, 201 (1944)
5. Allison, W., Cobb, J.: Relativistic charged particle identification by energy loss. Annu. Rev.
Nucl. Part. Sci. 30, 253–298 (1980)
Chapter 3
Bayesian Approach to Probability

3.1 Introduction

The Bayesian approach to probability allows to quantitatively determine probability


values corresponding to statements whose truth or falsity is not known with
certainty.
Bayesian probability has a wider range of applicability compared to frequentist
probability (see Sect. 1.18), which instead can be applied to repeatable cases only.
While under the frequentist approach one can only determine the probability that
a random variable lies within a certain interval, the Bayesian approach also allows
determining the probability that the value of an unknown parameter lies within a
certain interval, which would not have a frequentist meaning, since an unknown
parameter is not a random variable.
The mathematical procedure needed to quantitatively define Bayesian probability
starts from an extension of the Bayes’ theorem, that is presented in the following
section. Bayes’ theorem has general validity for any approach to probability,
including frequentist probability.

3.2 Bayes’ Theorem

The conditional probability, introduced in Eq. (1.4), defines the probability of an


event A with the condition that the event B has occurred:

P.A \ B/
P.A j B/ D : (3.1)
P.B/

© Springer International Publishing AG 2017 59


L. Lista, Statistical Methods for Data Analysis in Particle Physics,
Lecture Notes in Physics 941, DOI 10.1007/978-3-319-62840-0_3
60 3 Bayesian Approach to Probability

Fig. 3.1 Visualization of the


conditional probabilities, Ω
P.A j B/ and P.B j A/. The
A
events A and B are B
represented as subsets of a
sample space .
Representation by R. Cousins

P(A)= P(B)=

P(A|B)= P(B|A)=

The probability of the event B given the event A, vice versa, can be written as:

P.A \ B/
P.B j A/ D : (3.2)
P.A/

This situation is visualized in Fig. 3.1.


Extracting from Eqs. (3.1) and (3.2) the common term P.A \ B/, the following
relation is obtained:

P.A j B/ P.B/ D P.B j A/ P.A/ ; (3.3)

from which the Bayes’ theorem can be derived in the following form:

P.B j A/ P.A/
P.A j B/ D : (3.4)
P.B/

The probability P.A/ can be interpreted as the probability of the event A before the
knowledge that the event B has occurred (prior probability), while P.A j B/ is the
probability of the same event A having as further information the knowledge that
the event B has occurred (posterior probability).
A visual derivation of Bayes’ theorem is presented in Fig. 3.2, using the visual
notation of Fig. 3.1.
3.2 Bayes’ Theorem 61

P(A|B)P(B)= X = =P(A B)

P(B|A)P(A)= X = =P(A B)

Fig. 3.2 Visualization of the Bayes’ theorem. The areas of events A an B, equal to P.A/ and P.B/,
respectively, simplify when P.A j B/ P.B/ and P.B j A/ P.A/ are multiplied. Representation by R.
Cousins

Example 3.8 An Epidemiology Example


Bayes’ theorem allows the so-called ‘inversion’ of conditional probability
which occurs in several popular examples. A typical case is finding the
probability that a person who received a positive diagnosis of some illness
is really ill, knowing the probability that the test may give a false positive
outcome. This example has been reported in several lecture series and books,
for instance, in [1, 2].
Assume to know that, if a person is really ill, the probability that the test
gives a positive result is 100%. But the test also has a small probability, say
0.2%, to give a false positive result on a healthy person.
If a random person is tested positive and diagnosed with an illness, what is
the probability that he/she is really ill?
A common mistake is to conclude that the probability is equal to 100:0 
0:2% D 99:8%. In the following, it will be clear why this answer is wrong.
The problem can be formulated more precisely as follows, where ‘C’ and
‘’ indicate positive and negative test results:

P.C j ill/ ' 100% ; (3.5)


P. j ill/ ' 0% ; (3.6)
P.C j healthy/ D 0:2% ; (3.7)
P. j healthy/ D 99:8% : (3.8)

The answer to our question is P.ill j C/. Using Bayes’ theorem, the condi-
tional probability can be ‘inverted’ as follows:

P.C j ill/ P.ill/


P.ill j C/ D ; (3.9)
P.C/

(continued )
62 3 Bayesian Approach to Probability

which, since P.C j ill/ ' 1, gives approximately:

P.ill/
P.ill j C/ ' : (3.10)
P.C/

A missing ingredient in the problem can be identified from Eq. (3.10): P.ill/,
the probability that a random person in the population under consideration
is really ill (regardless of any possibly performed test), was not given. In a
normal situation of a generally healthy population, we can expect P.ill/ 
P.healthy/. Using:

P.ill/ C P.healthy/ D 1 ; (3.11)

and:

P.ill and healthy/ D 0 ; (3.12)

P.C/ can be decomposed as follows, according to the law of total probabil-


ity (Eq. (1.12) in Sect. 1.11):

P.C/ D P.C j ill/ P.ill/ C P.C j healthy/ P.healthy/


' P.ill/ C P.C j healthy/ : (3.13)

The probability P.ill j C/ can then be written, using Eq. (3.13), as:

P.ill/ P.ill/
P.ill j C/ D ' : (3.14)
P.C/ P.ill/ C P.C j healthy/

Assuming that P.ill/ is smaller than P.C j healthy/, P.ill j C/ will result
smaller than 50%. For instance, if P.ill/ D 0.15%, compared with the
assumed P.C j healthy/ D 0:2%, then

0:15
P.ill j C/ D D 43% : (3.15)
0:15 C 0:20

The probability to be really ill, given the positive diagnosis, is very different
from the naïve conclusion, according to which one would be most likely
really hill. The situation can be visualized, changing a bit the proportions in
order to have a better presentation, in Fig. 3.3.

(continued )
3.2 Bayes’ Theorem 63

P(+|healthy)

P(+|ill)
P(−|healthy)

P(−|ill)

P(ill) P(healthy)

Fig. 3.3 Visualization of the ill=healthy problem. The red areas correspond to the cases
of a positive diagnosis for a ill person .P.C j ill/, vertical red area) and a positive
diagnosis for a healthy person .P.C j healthy/, horizontal red area). The probability of
being really ill in the case of a positive diagnosis, P.ill j C/, is equal to the ratio of the
vertical red area and the total red area. In the example it was assumed that P. j ill/ is
very small

A large probability of a positive diagnosis in case of illness does not imply


that a positive diagnosis turns into a large probability of being really ill.
The correct answer depends as well on the prior probability for a random
person in the population to be ill, P.ill/. Bayes’ theorem allows to compute
the posterior probability P.ill j C/ in terms of the prior probability and the
probability of a positive diagnosis for an ill person, P.C j ill/.

Example 3.9 Particle Identification and Purity of a Sample


The previous Example 3.8 can be applied to a selection based on a particle
identification detector, and the conclusion will appear less counterintuitive
than in the previous case, since the situation is more familiar to physicist’s
experience.

(continued )
64 3 Bayesian Approach to Probability

Consider a muon detector that gives a positive signal if traversed by a muon


with an efficiency " D P.C j
/ and gives a false positive signal if traversed
by a pion with a probability ı D P.C j /.
Given a collection of particles that can be either muons or pions, what is
the probability that a selected particle is really a muon, i.e. P.
j C/?
As in the previous example, in order to give an answer, one also needs to
provide the prior probability, i.e. the probabilities that a random particle
from the sample is really a muon or pion, P.
/ and P./ D 1  P.
/,
respectively. Using Bayes’ theorem, together with Eq. (1.12), one can write:

P.C j
/ P.
/ P.C j
/ P.
/
P.
j C/ D D :
P.C/ P.C j
/ P.
/ C P.C j / P./
(3.16)

The purity of the sample f


sel is the fraction of muons in a sample of selected
particles. In terms of the fraction of muons f
D P.
/ and the fraction of
pions f D P./ of the original sample, we can write:
" f

f
sel D P.
j C/ D : (3.17)
" f
C ı f

Another consequence of Bayes’ theorem is the relation between ratio of


posterior probabilities and ratio of prior probabilities. The posteriors’ ratio,
also called posterior odds, can be written as:

P.
j C/ P.C j
/ P.
/
D  : (3.18)
P. j C/ P.C j / P./
The above expression also holds if more than two possible particle types are
present in the sample (say muons, pions, and kaons), and does not require
to compute the denominator that is present in Eq. (3.16), which would be
needed, instead, in order to compute individual probabilities related to all
possible particle cases.
Section 3.6 will discuss further posterior odds and their use.

3.3 Bayesian Probability Definition

In the above Examples 3.8 and 3.9, Bayes’ theorem was applied to cases that can
be considered under the frequentist domain. The formulation of Bayes’ theorem, as
from Eq. (3.4):

P.B j A/ P.A/
P.A j B/ D ; (3.19)
P.B/
3.3 Bayesian Probability Definition 65

can also be interpreted as follows: before we know that B is true, our degree of belief
in the event A is equal to the prior probability P.A/. After we know that B is true,
our degree of belief in the event A changes, and becomes equal to the posterior
probability P.A j B/.
Note that prior and posterior probabilities apply not only to the case in which
the former is the probability before and the latter after the event B has occurred,
in the chronological sense, but, more in general, they refer to before and after the
knowledge of B, i.e. B may also have occurred (or not), but we don’t have any
knowledge about B yet.
Using this interpretation, the definition of probability, in this new Bayesian
sense, can be extended to events that are not associated with random outcomes
of repeatable experiments, but may represent statements about unknown facts, like
“my football team will win next match”, or “the mass of a dark-matter candidate
particle is between 1000 and 1500 GeV”. We can consider a prior probability P.A/
of such an unknown statement, representing a measurement of our ‘prejudice’ about
that statement, before the acquisition of any information that could modify our
knowledge. After we know that the event B has occurred, our knowledge of A should
change, and our degree of belief should be modified and must become equal to the
posterior probability P.A j B/. In other words, Bayes’ theorem gives us a quantitative
prescription about how to rationally change our subjective degree of belief from an
initial prejudice considering newly available information. Anyway, starting from
different priors (i.e. different prejudices), different posteriors will be determined.
The term P.B/ that appears in the denominator of Eq. (3.4) can be considered
as a normalization factor. The sample space  can be decomposed in a partition
A1 ;    ; AN , where:

[
N
Ai D  and Ai \ Aj D 0 8 i; j ; (3.20)
iD1

in order to apply the law of total probability in Eq. (1.12), as already done in
Examples 3.8 and 3.9, discussed in the previous section:

X
N
P.B/ D P.B j Ai/ P.Ai / : (3.21)
iD1

The Bayesian definition of probability obeys Kolmogorov’s axioms of probabil-


ity, as defined in Sect. 1.7, hence all properties of probability discussed in Chap. 1
apply to Bayesian probability.
An intrinsically unavoidable feature of Bayesian probability is that the probabil-
ity associated with an event A can’t be defined without a prior probability of that
event, which makes Bayesian probability intrinsically subjective.
66 3 Bayesian Approach to Probability

Example 3.10 Extreme Cases of Prior Beliefs


Consider a set of possible events fAi g that constitute a non-intersecting
partition of . Imagine that we have as prior probability for each Ai :

1 if iD0
P.Ai / D : (3.22)
0 if i¤0

This corresponds to the belief that A0 is absolutely true, and all other
alternatives Ai are absolutely false for i ¤ 0. Whatever knowledge of any
event B is achieved, we will demonstrate that the posterior probability of
any Ai will not be different from the prior probability:

P.Ai j B/ D P.Ai /; 8 B : (3.23)

From Bayes’ theorem:

P.B j Ai / P.Ai /
P.Ai j B/ D ; (3.24)
P.B/

but, if i ¤ 0, clearly:

P.B j Ai /  0
P.Ai j B/ D D 0 D P.Ai / : (3.25)
P.B/

If i D 0, instead, assuming P.B j A0/ ¤ 0:

P.B j A0/  1 P.B j A0 /


P.A0 j B/ D Pn D D 1 D P.A0 / :
iD1 P.B j Ai / P.Ai / P.B j A0 /  1
(3.26)

This situation reflects the case that we may call dogma, or religious belief,
i.e. the case in which someone has such a strong prejudices on Ai that no
event B, i.e. no new knowledge, can change his/her degree of belief.
The scientific method allowed to evolve mankind’s knowledge of Nature
during history by progressively adding more knowledge based on the
observation of new experimental evidences. The history of science is full
of examples in which theories known to be true have been falsified by new
or more precise observations and new better theories have replaced the old
ones.
According to Eq. (3.23), instead, scientific progress is not possible in the
presence of religious beliefs about observable facts.
3.4 Bayesian Probability and Likelihood Functions 67

3.4 Bayesian Probability and Likelihood Functions

Given a sample .x1 ;    ; xn / of n random variables whose PDF has a known form
which depends on m parameters, 1 ;    ; m , the likelihood function is defined as
the probability density at the point .x1 ;    ; xn / for fixed values of the parameters
1 ;    ; m :
ˇ
dP.x1 ;    ; xn / ˇˇ
L.x1 ;    ; xn I 1 ;    ; m / D : (3.27)
dx1    dxn ˇ 1 ;  ; m

The notation L.x1 ;    ; xn j 1 ;    ; m / is sometimes used in place of L.x1 ;    ; xn I


1 ;    ; m /, similarly to the notation used for conditional probability.
The likelihood function will be more extensively discussed in Sect. 5.10.1.
The posterior Bayesian probability distribution function for the parameters
1 ;    ; m , given the observation of .x1 ;    ; xn /, can be defined using the likelihood
function in Eq. (3.27):

L.x1 ;    ; xn j 1 ;    ; m / .1 ;    ; m /
P.1 ;    ; m j x1 ;    ; xn / D R ;
L.x1 ;    ; xn j 10 ;    ; m0 / .10 ;    ; m0 / dm  0
(3.28)

where the probability distribution function .1 ;    ; m / is the prior PDF of the
parameters 1 ;    ; m , i.e., our degree of belief about the unknown parameters
before the observation of .x1 ;    ; xn /. The denominator in Eq. (3.28), coming from
an extension of the law of total probability, is clearly interpreted as a normalization
of the posterior PDF.
Fred James et al. wrote the following sentence about the posterior probability
density given by Eq. (3.28):
The difference between . / and P. j x/ shows how one’s knowledge (degree of belief)
about  has been modified by the observation x. The distribution P. j x/ summarizes all
one’s knowledge of  and can be used accordingly [3].

3.4.1 Repeated Use of Bayes’ Theorem and Learning Process

If we have initially a prior PDF ./ D P0 ./ of an unknown parameter , Bayes’


theorem can be applied after an observation x1 in order to obtain the posterior
probability:

P1 ./ / P0 ./ L.x1 j / ; (3.29)


R
where the normalization factor P0 . 0 / L.x1 j  0 / d 0 was omitted.
68 3 Bayesian Approach to Probability

After a second observation x2 , independent on x1 , the combined likelihood


function, corresponding to the two measurements x1 and x2 , is given by the product
of the individual likelihood functions:

L.x1 ; x2 j / D L.x1 j / L.x2 j / : (3.30)

Bayes’ theorem can be applied again, giving:

P2 ./ / P0 ./ L.x1 ; x2 j / D P0 ./ L.x1 j / L.x2 j / ; (3.31)

where again a normalization factor was omitted. Equation (3.31) can be interpreted
as the application of Bayes’ theorem to the observation of x2 having as prior
probability P1 ./, which was the posterior probability after the observation of x1
(Eq. (3.29)).
Considering a third independent observation x3 , Bayes’ theorem again gives:

P3 ./ / P2 ./ L.x3 j / D P0 ./ L.x1 j / L.x2 j / L.x3 j / : (3.32)

By adding more measurements, Bayes’ theorem can be applied repeatedly.


This possibility allows interpreting the application of Bayes’ theorem as learning
process, where one’s knowledge about an unknown parameter is influenced and
improved by the subsequent observations x1 ; x2 , x3 , and so on.
The more measurement, x1 ;    ; xn , are added, the more the final posterior
probability Pn ./ will be insensitive to the choice of the prior probability ./ D
P0 ./, because the  range in which L.x1 ;    ; xn j / will be significantly different
from zero will get smaller and smaller, and, within a very small  range, a reasonably
smooth prior ./ can be approximated by a constant value that would cancel in the
normalization of the posterior.
In this sense, a sufficiently large number of observations may remove, asymp-
totically, any dependence on subjective choices of prior probability, if the prior is
a sufficiently smooth and regular function. This was not the case with the extreme
assumptions considered in Example 3.10.

3.5 Bayesian Inference

This section will discuss how the estimate of unknown parameters with their uncer-
tainties can be addressed using posterior Bayesian PDF of unknown parameters
using Eq. (3.28). Chapter 5, and more in details Chap. 7, will discuss the estimate of
unknown parameters and their uncertainties more in general, and in particular also
using a frequentist approach.
O
First, the most likely values of the unknown parameters E (or the most likely
value, in case of a single parameter) can be determined as the maximum of the
3.5 Bayesian Inference 69

posterior PDF:

L.Ex I /E ./ E


P.E j xE / D R : (3.33)
L.Ex I E 0 / .E 0 / dh  0

The most likely values can be taken as measurement of the unknown parameters,
affected
D Eby an uncertainty that will be discussed in Sect. 3.5.2. Similarly, the average
value E can also be determined from the same posterior PDF.
If no prior information is available about the parameters , E the prior density
should not privilege particular parameter values. In those cases, the prior is called
uninformative prior. A uniform distribution may appear the most natural choice,
but it would not be invariant under reparametrization, as discussed in Sect. 2.1. An
invariant uninformative prior is defined in Sect. 3.8.
If a uniform prior distribution is assumed, the most likely parameter values are
the ones that give the maximum likelihood, since the posterior PDF is equal, up to
a normalization constant, to the likelihood function:

ˇ L.Ex I /E
ˇ
P.E j Ex / ˇ D R : (3.34)
.E/Dconst: L.Ex I E 0 / dh  0

This gives results similar to the frequentist approach, as noted in Sect. 5.10, where
the maximum likelihood estimator will be introduced. The same result, of course,
does not necessarily hold in case of a non-uniform prior PDF.
Usually, in Bayesian applications, the computation of posterior probabilities, and,
in general, of most of the quantities of interest, requires integrations that, in the vast
majority of the realistic cases, can only be performed using computer algorithms.
Markov chain Monte Carlo (see Sect. 4.8) is one of the most performant numerical
integrations method for Bayesian computations.

3.5.1 Parameters of Interest and Nuisance Parameters

Imagine a number of parameters is needed to define our probability model, but we


are interested only in a subset of them, say E D .1 ;    ; h /. Those parameters
are parameters of interest, while the remaining parameters, E D .1 ;    ; l /, may
be needed to model our PDF, but should not appear among the final results of our
measurement. Those parameters are called nuisance parameters.
The posterior PDF for both sets of parameters can be written as:

L.Ex I E ; E / .E ; E /
P.E ; E j Ex / D R ; (3.35)
L.Ex I E 0 ; E 0 / .E 0 ; E 0 / dh  0 dl  0
70 3 Bayesian Approach to Probability

and the posterior PDF, for the parameters E only, can be obtained as marginal PDF,
integrating Eq. (3.35) over all the remaining parameters E :
Z R
L.Ex I E ; E / .E ; E / dl 
P.E j xE / D P.E ; E j Ex / dl  D R : (3.36)
L.Ex I E 0 ; E / .E 0 ; E / dh  0 dl 
Using Eq. (3.36), nuisance parameters can be treated, under the Bayesian
approach, with a simple integration. Section 10.10 will discuss more in details the
treatment of nuisance parameters, including the frequentist approach.

3.5.2 Credible Intervals

Given a posterior PDF for an unknown parameter of interest , intervals Πlo ;  up 


can be determined such that the integral of the posterior PDF from  lo to  up
corresponds to a given probability value, usually indicated with 1  ˛. The most
frequent and natural choice of 1  ˛ is 68.27%, corresponding to a ˙ one  interval
for a normal distribution.
Probability intervals determined with the Bayesian approach from the posterior
PDF are called credible intervals, and reflect the uncertainty in the measurement of
the unknown parameter, taken as the most likely value, according to the posterior
PDF.
The choice of the interval for a fixed probability level 1  ˛, anyway, has still
some degree of arbitrariness, since different interval choices are possible, all having
the same probability level. Below some examples:
• a central interval Œ lo ;  up  such that the two complementary intervals   1;  lo Œ
and   up ; C1Œ both correspond to probabilities of ˛=2;
• a fully asymmetric interval   1;  up  with corresponding probability 1  ˛;
• a fully asymmetric interval  lo ; C1 with corresponding probability 1  ˛;
• a symmetric interval around the value with maximum probability : O Œ lo D O 
ı;  D O C ı, corresponding to the specified probability 1  ˛;
up

• the interval Œ lo ;  up  with the smallest width corresponding to the specified


probability 1  ˛;
• : : : etc.
Cases with fully asymmetric intervals lead to upper or lower limits to the parameter
of interest, determined as the upper or lower bound, respectively, of the asymmetric
interval. A probability level 1  ˛ of 0.9 or 0.95 is usually chosen when upper or
lower limits are reported. The first four of the possible interval choices listed above
are shown in Fig. 3.4.
As result of an inference, a credible interval is usually reported as corresponding
error or uncertainty of the measurement:

 D O ˙ ı ; (3.37)
3.5 Bayesian Inference 71

0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
P(θ|x)

P(θ|x)
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
θ θ

0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
P(θ|x)
P(θ|x)

0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
θ θ

Fig. 3.4 Different probability interval choices at 68:27% (top) or 90% (bottom) shown as shaded
areas. The most probable value is shown as a dashed vertical line. Top left: central interval; the left
and right tails have equal probability. Top right: symmetric interval; the most probable value lies at
the center of the interval. Bottom left: fully asymmetric intervals for an upper limit. Bottom right:
fully asymmetric interval for a lower limit

in case of a symmetric interval Œ lo D O  ı;  up D O C ı, or, with a notation


that was already introduced in Sect. 3.5.2:
Cı
 D O ıC ; (3.38)

in case of an asymmetric interval Œ lo D O  ı ;  up D O C ıC .

Example 3.11 Posterior for a Poisson Rate


Let us consider the case of a Poisson distribution P.n j s/, where a certain
value of n is observed. Assuming a prior PDF .s/, the posterior for s is
given by Eq. (3.28):
sn es
.s/
P.s j n/ D Z nŠ : (3.39)
1 0
s0n es 0 0
.s / ds
0 nŠ
Taking .s/ as a constant, the normalization factor in the denominator
becomes:

(continued )
72 3 Bayesian Approach to Probability

Z ˇ
1 1
.n C 1; s/ ˇˇ1
sn es ds D  ˇ D1: (3.40)
nŠ 0 nŠ 0

The posterior PDF is:


sn es
P.s j n/ D : (3.41)

Equation (3.41) has the same expression of the original Poisson distribution,
but this time it is interpreted as posterior PDF of the unknown parameter s,
given the observation n.
Figure 3.5 shows the distribution of P.s j n/ in Eq. (3.41) for the cases n D 5
and n D 0. A 68.27% central probability interval is also shown in the former
case and a 90% fully asymmetric interval is shown in the latter case.
The most probable value of s can be determined according to the posterior
in Eq. (3.41):

sO D n : (3.42)

The average value and variance of s can also be determined from Eq. (3.41):

hsi D n C 1 ; (3.43)
VarŒs D n C 1 : (3.44)

Note that the most probable value sO is different from the average value hsi,
since the distribution is not symmetric.
Those results depend of course on the choice of the prior .s/. A constant
prior could not be the most ‘natural’ choice. Considering the pprior choice
due to Jeffreys, discussed in Sect. 3.8, a prior .s/ / 1= s should be
chosen.

0.18 1
0.16
0.14 0.8

0.12
0.6
P(s|n)

P(s|n)

0.1
0.08
0.4
0.06
0.04 0.2
0.02
0 0
0 2 4 6 8 10 12 14 16 18 20 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
s s

Fig. 3.5 Poisson posterior PDFs for n D 5 with a central 68.27% probability interval
and for n D 0 with a fully asymmetric 90% probability interval. Intervals are shown as
shaded areas
3.6 Bayes Factors 73

3.6 Bayes Factors

As seen at the end of Example 3.9, there is a convenient way to compare the
probability of two hypotheses using Bayes’ theorem which does not require the
knowledge of all possible hypotheses.
Using Bayes’ theorem, one can write the ratio of posterior probabilities evaluated
under two hypotheses H0 and H1 , given our observation Ex, called posterior odds of
the hypothesis H1 versus the hypothesis H0 :

P.H1 j xE / P.Ex j H1 / .H1 /


D  ; (3.45)
P.H0 j xE / P.Ex j H0 / .H0 /

where .H1 / and .H0 / are the priors for the two hypotheses. The ratio of the priors,
.H1 /=.H0 /, is called prior odds. Finally, the ratio:

P.Ex j H1 /
B1=0 D (3.46)
P.Ex j H0 /

is called Bayes factor [4], so Eq. (3.45) reads as:

posterior odds D Bayes factor  prior odds : (3.47)

The Bayes factor is equal to the posterior odds if priors are identical for the two
hypotheses.
The computation of Bayes factor in practice requires the introduction of the
likelihood function, as present in Eq. (3.28). In the simplest case in which no
parameter E is present in either of the two hypotheses, the Bayes factor is equal to
the likelihood ratio in the two hypotheses. If parameters are present, the probability
densities P.Ex j H0; 1 / should be computed by integrating the product of the likelihood
function and the prior over the parameter space:
Z
P.Ex j H0 / D L.Ex j H0 ; E0 / 0 .E0 / d0 ; (3.48)
Z
P.Ex j H1 / D L.Ex j H1 ; E1 / 1 .E1 / d1 ; (3.49)

and the Bayes factor can be written as:


R
L.Ex j H1 ; E1 / 1 .E1 / dE1
B1=0 D R : (3.50)
L.Ex j H0 ; E0 / 0 .E0 / dE0

The scale for Bayes factor in order to assess the evidence of H1 against H0 , as
proposed in [4], are reported in Table 3.1.
74 3 Bayesian Approach to Probability

Table 3.1 Assessing B1=0 Evidence against H0


evidence with Bayes factors
according to the scale 1–3 Not worth more than a bare mention
proposed in [4] 3–20 Positive
20–150 Strong
>150 Very strong

In the Bayesian approach to probability, Bayes factors are an alternative to the


hypothesis test adopted under the frequentist approach that will be introduced in
Chap. 9. In particular, Bayes factors can be used in place of significance levels
(see Sect. 10.2) in order to determine the evidence of one hypothesis H1 (e.g.: the
presence of a signal due to a new particle) against a null hypothesis H0 (e.g.: no
signal due to a new particle is present in our data sample).

3.7 Subjectiveness and Prior Choice

One main feature of Bayesian probability is its intrinsic dependence on a prior


probability that could be chosen by different observers in different ways. This
feature is intrinsic and unavoidable in Bayesian probability, which is subjective, in
the sense that it depends on one’s choice of the prior probability. Example 3.10
demonstrated that, in extreme cases, drastic choices of prior PDFs may lead to
insensitiveness of the posterior to the actual observation.
It is also true, as remarked in Sect. 3.4.1, that, for reasonable choices of the
prior PDF, adding more and more measurements increases one’s knowledge about
the unknown parameter(s), hence the posterior probability will be less and less
sensitive to the choice of the prior probability. In most of those cases, where a large
number of measurements is available, for this reason, the application of Bayesian
and frequentist calculations tend to give consistent results.
But it is also true that many interesting statistical problems arise in cases with a
small number of measurements, where the goal of the measurement is to extract
the maximum possible information from the limited available sample, which is
in general precious because it is the outcome of a complex and labor-intensive
experiment. In those cases, applying Bayesian or frequentist methods usually leads
to numerically different results, which should also be interpreted in very different
ways. In those cases, using the Bayesian approach, the choice of prior probabilities
may play a crucial role and it may have relevant influence on the results.
One of the main difficulty arises when choosing a probability distribution to
models one’s complete ignorance about an unknown parameter . A frequently
adopted prior distribution in physics is a uniform PDF in the interval of validity of
. Imagine we change parametrization, from the original parameterp to a function
of ; for instance in one dimension one may choose exp ; log ;  or 1=, etc.
The resulting transformed parameter will no longer have a uniform prior PDF. This
3.8 Jeffreys’ Prior 75

is particularly evident in case of the measurement of a particle’s lifetime : should


one chose a PDF uniform in , or in the particle’s width, D 1=? There is no
preferred choice provided by any first principle.
This subjectiveness in the choice of the prior PDF, intrinsic to the Bayesian
approach, raises criticism by supporters of the frequentist approach, who object that
results obtained under the Bayesian approach are to some extent arbitrary, while
scientific results should not depend on any subjective assumptions. Supporters of the
Bayesian approach reply that Bayesian result are not arbitrary, but intersubjective
[5], in the sense that commonly agreed prior choices lead to common results, and
a dependence on prior knowledge is unavoidable and intrinsic in the process of
scientific progress. The debate is in some cases still open, and literature still contains
opposite opinions about this issue.

3.8 Jeffreys’ Prior

Harold Jeffreys [6] proposed a choice of uninformative prior which is invariant


under parameter transformation. Jeffreys’ choice is, up to a normalization factor,
given by:
q
E /
p./ E ;
J ./ (3.51)

E is the determinant of the Fisher information matrix defined below:


where J ./
"* +#
E @ log L.Ex j /
@ log L.Ex j / E
E D det
J ./ : (3.52)
@i @j

It is not difficult to demonstrate that Jeffreys’ prior is invariant when changing


parametrization, i.e. transforming E ! E 0 D E 0 ./:
E the Jacobean determinant
that appears in the PDF transformation, using Eq. (2.24),
ˇ !ˇ
ˇ @i ˇˇ E
E0 ˇ
p. / D ˇdet ˇ p./ ; (3.53)
ˇ @j0 ˇ

is absorbed in the determinant that appears in Fisher information, from Eq. (3.52),
expressed in the transformed coordinates.
Jeffreys’ priors corresponding to the parameters of some of the most frequently
used PDFs are given in Table 3.2.
Note that only the mean of a Gaussian corresponds to a uniform Jeffreys’ prior.
For instance, for a Poissonian counting experiment, p like the one considered in
Example 3.11, Jeffreys’ prior is proportional to 1= s, not uniform, as assumed in
order to determine Eq. (3.41).
76 3 Bayesian Approach to Probability

Table 3.2 Jeffreys’ priors corresponding to the parameters of some of the most frequently used
PDFs
PDF parameter Jeffreys’ prior
p
Poissonian mean s p.s/ / 1= s
p
Poissonian signal mean s with a background b p.s/ / 1= s C b
Gaussian mean
p.
/ / 1
Gaussian standard deviation  p. / / 1=
p
Binomial success fraction " p."/ / 1= " .1  "/
Exponential parameter  p./ / 1=

3.9 Reference Priors

Another methodology known as reference analysis constructs priors that are


invariant under reparameterization on the basis of a procedure that minimizes the
‘informativeness’ according to a mathematical definition. Such reference prior in
some cases coincide with Jeffreys’ priors. This method, which is beyond the purpose
of the present book, has been described in [7] with examples of application to
particle physics.

3.10 Improper Priors

In many cases, like in Table 3.2, Jeffreys’ priors have diverging integral over the
entire parameter domain. This is also the case when a uniform prior is chosen. The
integrals present in the evaluation of Bayesian posterior, which involve the product
of the likelihood function and the prior, are anyway finite. Such priors are called
improper prior distributions.

Example 3.12 Posterior for Exponential Distribution


A particle’s mean lifetime  can be determined from the measurement of
a number N of decay times t1 ;    ; tN , which are expected to follow an
exponential distribution:
1 t=
f .t/ D e D  et ; (3.54)

where  D 1=. The likelihood function is given by the product of
f .t1 /    f .tN /:
PN
Y
N PN e iD1 ti =
ti N 
L.Et I / D e D e iD1 ti D : (3.55)
iD1
N

(continued )
3.10 Improper Priors 77

The posterior distribution for the parameter , assuming a prior ./, is


given by:
PN
./ e iD1 ti = = N
p.I Et / D R PN : (3.56)
. 0 / e iD1 ti = 0 = 0 N d 0
A possible prior choice to model one’s ignorance about  (uninformative
prior) is to assume a uniform distribution for , ./ D const:, but this is
not the only possible choice. Another choice might be a uniform prior for
 D 1=, ./ D const:, which, using Eq. (2.25), gives:
ˇ ˇ
ˇ d ˇ 1
./ D ˇˇ ˇˇ ./ / 2 : (3.57)
d 
Alternatively, Jeffreys’ prior could be used. Using the likelihood function in
Eq. (3.55), The Fisher information matrix, defined in Eq. (3.52), has a single
element:
* 2 + *  P  !2 +
d log L.Et I / d N log   niD1 ti =
J ./ D D
d d
* !2 + * P
2 +
PN 2
PN N
ti
N ti N N t
iD1 i iD1
D  C iD1 D 2 C
 2 2 3 4
˝ ˛
N2 N 2 hti N 2 t2
D 2 2 3 C : (3.58)
  2
˝ ˛
For an exponential distribution, hti D  and t2 D 2 2 , hence:

1
J ./ / ; (3.59)
2
and Jeffreys’ prior is:
p 1
./ / J ./ / : (3.60)

Figure 3.6 shows the posterior distributions p.I Et / for randomly extracted
datasets with  D 1 using a uniform prior on , a uniform prior on  D 1=
and Jeffreys’ prior, for N D 5, 10 and 50. The differences for the three cases
become less relevant as the number of measurements N increases.
The treatment of the same case with the frequentist approach is discussed in
Example 5.19.

(continued )
78 3 Bayesian Approach to Probability

1.2 Uniform prior on λ = 1/τ , π(τ ) ∝ 1/ τ 2


Jeffreys' prior, π(τ ) ∝ 1/ τ
1 Uniform prior on τ , π(τ ) = const.

p(τ; t1, ..., tN) Data (entries / 10)


0.8

0.6

0.4

0.2

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t, τ

1.4
Uniform prior on λ = 1/τ , π(τ ) ∝ 1/ τ 2
1.2 Jeffreys' prior, π(τ ) ∝ 1/ τ
Uniform prior on τ , π(τ ) = const.
1
Data (entries / 10)
p(τ; t1, ..., tN)

0.8

0.6

0.4

0.2

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t, τ

3.5 Uniform prior on λ = 1/τ , π(τ ) ∝ 1/ τ 2


Jeffreys' prior, π(τ ) ∝ 1/ τ
3
Uniform prior on τ , π(τ ) = const.

2.5 Data (entries / 10)


p(τ; t1, ..., tN)

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t, τ

Fig. 3.6 Posterior distribution for  using a uniform prior (dashed line), a uniform
prior on  D 1= (dotted line) and Jeffreys’ prior (solid line). Data are shown as blue
histogram. A number of measurements N D 5 (top), 10 (middle) and 50 (bottom) has
been considered
References 79

3.11 Transformations of Variables and Error Propagation

The definition of credible intervals and measurement errors, or uncertainties, was


discussed in Sect. 3.5.2. Measurement errors need to be propagated when the
original measured parameters E are transformed into a different set of parameters E,
and uncertainties in the new parameters E must be quoted.
Error propagation can be introduced in a natural way within Bayesian inference,
whose outcome is a posterior PDF for the unknown parameter(s) of interest. In order
to obtain the PDF for a set of transformed parameters, it is sufficient to transform
the posterior PDF under variables transformation (see Sect. 2.6).
In the case of a two-variable transformation, .x; y/ ! .x0 ; y0 / D .X 0 .x; y/;
0
Y .x; y//, a PDF f .x; y/ transforms according to:
Z
0 0 0
f .x ; y / D ı.x0  X 0 .x; y// ı.y0  Y 0 .x; y// f .x; y/ dx dy : (3.61)

Given f 0 .x0 ; y0 /, one can determine again, for the transformed variables x0 and y0 ,
the most likely values and credible intervals. The generalization to more variables,
Ex ! Ex 0 , is straightforward.
Something to be noted is that the most probable values ExO , i.e.: the values that
EO ExO / that maximize f 0 .Ex 0 /,
maximize f .Ex /, do not necessarily map into values ExO 0 D X.
as well as the average values of xO are not necessarily transformed into the averages
of Ex 0 : it was already noted in Sect. 2.10, for instance, that hey i ¤ ehyi if y is a normal
random variable.
Issues with non-trivial transformations of variables and error propagations are
also present in the frequentist approach. Section 5.15 will briefly discussed the
related case of propagation of asymmetric uncertainties. Section 5.14 will discuss
how to propagate errors in the case of transformation of variables using a linear
approximation, and the results hold for Bayesian as well as for frequentist inference.
Under this simplified assumption, which is a sufficient approximation only in the
presence of small uncertainties, one may assume that values that maximize f map
into values that maximize f 0 , i.e. .Ox0 ; yO 0 / ' .X.Ox/; Y.Oy//.

References

1. Cowan, G.: Statistical Data Analysis. Clarendon Press, Oxford (1998)


2. D’Agostini, G.: Telling the Truth with Statistics. CERN Academic Training, Geneva (2005)
3. Eadie, W., Drijard, D., James, F., Roos, M., Saudolet, B.: Statistical Methods in Experimental
Physics. North Holland, Amsterdam (1971)
4. Kass, R., Raftery, E.: Bayes factors. J. Am. Stat. Assoc. 90, 773 (1995)
80 3 Bayesian Approach to Probability

5. D’Agostini, G.: Bayesian Reasoning in Data Analysis: A Critical Introduction. World Scientific,
Hackensack (2003)
6. Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proc. R. Soc.
Lond. A Math. Phys. Sci. 186, 453–46l (1946)
7. Demortier, L., Jain, S., Prosper, H.B.: Reference priors for high energy physics. Phys. Rev. D82,
034002 (2010)
Chapter 4
Random Numbers and Monte Carlo Methods

4.1 Pseudorandom Numbers

Many computer applications, ranging from simulations to video games and 3D-
graphics, take advantage of computer-generated numeric sequences that have
properties very similar to truly random variables. Sequences generated by com-
puter algorithms through mathematical operations are not really random, having
no intrinsic unpredictability, and are necessarily deterministic and reproducible.
Indeed, the possibility to reproduce exactly the same sequence of computer-
generated numbers with a computer algorithm is often a good feature for many
application.
Good algorithms that generate ‘random’ numbers or, more precisely, pseudoran-
dom numbers, given their reproducibility, must obey, in the limit of large numbers,
to the desired statistical properties of real random variables, with the limitation that
pseudorandom sequences can be large, but not infinite.
Considering that computers have finite machine precision, pseudorandom num-
ber, in practice, have discrete possible values, depending on the number of bits used
to store floating point variables.
Numerical methods involving the repeated use of computer-generated pseudo-
random numbers are also known as Monte Carlo methods, from the name of
the city hosting the famous casino, where the properties of (truly) random num-
bers resulting from roulette and other games are exploited in order to generate
profit.
In the following, we will sometimes refer to pseudorandom numbers simply as
random numbers, when the context creates no ambiguity.

© Springer International Publishing AG 2017 81


L. Lista, Statistical Methods for Data Analysis in Particle Physics,
Lecture Notes in Physics 941, DOI 10.1007/978-3-319-62840-0_4
82 4 Random Numbers and Monte Carlo Methods

4.2 Pseudorandom Generators Properties

Good (pseudo)random number generators must be able to generate sequences of


numbers that are statistically independent on previous extractions, though unavoid-
ably each number will be determined mathematically, through the generator’s
algorithm, from the previously extracted numbers.
All numbers in a sequence should be independent and distributed according to
the same PDF f .x/ (independent and identically distributed random variables, or
IID). Those properties can be written as follows:

f .xi / D f .xj /; 8 i; j ; (4.1)


f .xn j xnm / D f .xn /; 8 n; m : (4.2)

Example 4.13 Transition From Regular to ‘Unpredictable’ Sequences


There are several examples of mathematical algorithms that lead to
sequences that are poorly predictable. One example of transition from a
‘regular’ to a ‘chaotic’ regime is given by the logistic map [1]. The sequence
is defined, starting from an initial value x0 , as:

xnC1 D xn .1  xn / : (4.3)

Depending on the value of , the sequence may have very different possible
behaviors. If the sequence converges to a single asymptotic value x for n !
1, we have:

lim xn D x ; (4.4)
n!1

where x must satisfy:

x D x .1  x/ : (4.5)

Excluding the trivial solution x D 0, Eq. (4.5) leads to:

x D .  1/= : (4.6)

This solution is stable for values of  smaller than 3. Above  D 3, the


sequence stably approaches a state where it oscillates between two values

(continued )
4.2 Pseudorandom Generators Properties 83

x1 and x2 that satisfy the following system of two equations:

x1 D x2 .1  x2 / ; (4.7)
x2 D x1 .1  x1 / : (4.8)
p
For larger values, up to 1 C 6, the sequences oscillates among four
values, and further bifurcations occur for even larger values of , until
it achieves a very complex and poorly predictable behavior. For  D 4,
the sequence finally densely covers the interval 0; 1Œ. The PDF corre-
sponding to the sequence with  D 4 can be demonstrated to be a beta
distribution with parameters ˛ D ˇ D 0:5, where the beta distribution is
defined as:

x˛1 .1  x/ ˇ1
f .xI ˛; ˇ/ D R ˛1 : (4.9)
0 .1  u/ ˇ1 du

The behavior of the logistic map for different values of  is shown in


Fig. 4.1.

1.0

0.8

0.6
x

0.4

0.2

0.0
2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0
λ
Fig. 4.1 Logistic map [2]
84 4 Random Numbers and Monte Carlo Methods

4.3 Uniform Random Number Generators

The most widely used computer-based random number generators are conveniently
written in order to produce sequences of uniformly distributed numbers ranging
from zero to one.1 Starting from uniform random number generators, most of the
other distributions of interest can be derived using specific algorithms, some of
which are described in the following Sections.
The period of a random sequence, i.e. the number of extractions after which the
sequence will repeat itself, should be as large as possible, and anyway larger than
the number of random numbers required by our specific application.
One example is the function lrand48 [3], which is a standard of C program-
ming language, defined according to the following algorithm:

xnC1 D .axn C c/ mod m ; (4.10)

where the values of m; a and c are:

m D 248 ; (4.11)
a D 25214903917 D 5DEECE66Dhex ; (4.12)
c D 11 D Bhex : (4.13)

The sequences obtained from Eq. (4.10) for given initial values x0 are distributed
uniformly, to a good approximation, between 0 and 248  1. The obtained sequences
of random bits that can be used to return 32-bits integer numbers, or can be
mapped into sequences of floating-point numbers uniformly distributed in Œ0; 1Œ,
as implemented in the C function drand48.
The value x0 is called seed of the random sequence. By choosing different initial
seeds, different sequences are produced. In this way, one can repeat a computer-
simulated experiment using different random sequences, each time changing the
initial seed and obtaining different results, in order to simulate the statistical
fluctuations occurring in reality when repeating an experiment.
Similarly to lrand48, the gsl_rng_rand [4] generator of the BSD rand
function uses the same algorithm but with a D 4lC64E6Dhex, c D 3039hex and
m D 231 . The period of gsl_rng_rand is about 231 , which is lower than
lrand48.
A popular random generator that offers good statistical properties is due to
Lüscher [5], implemented by F. James in the RANLUX generator [6], whose period
is of the order of 10171 . RANLUX is now considered relatively slower than other
algorithms, like the L’Ecuyer generator [7], which has a period of 288 , or the

1
In realistic cases of finite numeric precision, one of the extreme values is excluded. Each
individual value would have a corresponding zero probability, in the case of infinite precision,
but this is not exactly true with finite machine precision.
4.4 Discrete Random Number Generators 85

Mersenne–Twistor generator [8] which has a period of 219937  1 and is relatively


faster than L’Ecuyer’s generator.

4.3.1 Remapping Uniform Random Numbers

Given a random variable x uniformly distributed in [0, 1[, it is often convenient


to transform it into a variable x0 uniformly distributed in another interval Œa; b Œ
performing the following linear transformation:

x0 D a C x .b  a/ : (4.14)

With this transformation, x D 0 corresponds to x0 D a and x D 1 corresponds to


x0 D b.

4.4 Discrete Random Number Generators

Random variables may have discrete values, each corresponding to a given proba-
bility.
The simplest example is the simulation of a detector response whose efficiency
is ". In this case, one can generate a uniform random number r in Œ0; 1Œ; if r < " the
response will be positive (i.e. the particle has been detected), otherwise the response
will be negative (i.e. the particle has not been detected).
If we have more possible values, 1;    ; n, corresponding to probabilities
P1 ;    ; Pn , one possibility is to store the cumulative probabilities into an array:

X
k
Ck D Pi ; k D 1;    ; n  1 ; (4.15)
iD1

considering that it is not necessary to store the obvious value Cn D 1. A random


number r uniformly distributed in Œ0; 1Œ can be compared with the values Ck in
the array. The smallest value of k such that Ck > r can be returned as discrete
random number. A binary search may help in case of a large number of possible
values.
Optimized implementations exist for discrete random extractions and are
described in [9].
86 4 Random Numbers and Monte Carlo Methods

4.5 Nonuniform Random Number Generators

Nonuniformly distributed random numbers can be generated starting from a uniform


random number generators using various algorithms. Examples will be provided in
the following Sections.

4.5.1 Nonuniform Distribution from Inversion


of the Cumulative Distribution

In order to generate a pseudorandom number x distributed according to a given


function f .x/, its cumulative distribution can be built (Eq. (2.17)):
Z X
F.x/ D f .x0 / dx0 : (4.16)
1

By inverting the cumulative distribution F.x/, it is possible to demonstrate that,


extracting a random number r uniformly distributed in [0, 1[, the transformed
variable:

x D F 1 .r/ (4.17)

is distributed according to f .x/. If we write:

r D F.x/ ; (4.18)

we have:
dF
dr D dx D f .x/ dx : (4.19)
dx
Introducing the differential probability dP, we have:

dP dP
D f .x/ : (4.20)
dx dr

Since r is uniformly distributed, dP=dr D 1, hence:

dP
D f .x/ ; (4.21)
dx
which demonstrates that x follows the desired PDF.
4.5 Nonuniform Random Number Generators 87

This method only works conveniently if the cumulative F.x/ can be easily
computed and inverted using either analytical or numerical methods. If not, usually
this algorithm may be very slow, and alternative implementations may provide better
CPU performances.

Example 4.14 Extraction of an Exponential Random Variable


The inversion of the cumulative distribution presented in Sect. 4.5.1 allows
to extract random numbers x distributed according to an exponential PDF:

f .x/ D  ex : (4.22)

The cumulative PDF of f .x/ is:


Z x
F.x/ D f .x0 / dx0 D 1  ex : (4.23)
0

Inverting F.x/ leads to:

1  ex D r ; (4.24)

which turns into:


1
x D  log.1  r/ : (4.25)

If the extraction of r happens in the interval [0, 1[, like with drand48,
r D 1 will never be extracted, hence the argument of the logarithm will
never be equal to zero, ensuring the numerical validity of Eq. (4.25).

Example 4.15 Extraction of a Uniform Point on a Sphere


Assume we want to generate two variables,  and , distributed in such
a way that they correspond in polar coordinates to a point uniformly
distributed on a sphere, i.e. the probability density per unit of solid angle
 is uniform:
dP dP
D Dk: (4.26)
d sin  d d

k is a normalization constant such that the PDF integrates to unity over the
entire solid angle. From Eq. (4.26), the joint two-dimensional PDF can be
factorized into the product of two PDFs, as functions of  and :

dP
D f ./ g. / D k sin  ; (4.27)
d d

(continued )
88 4 Random Numbers and Monte Carlo Methods

where:
dP
f ./ D D c1 sin  ; (4.28)
d
dP
g. / D D c2 : (4.29)
d

The constants c1 and c2 ensure the normalization of f ./ and g. / individ-


ually. This factorization implies that  and are independent.
 can be extracted by inverting the cumulative distribution of f ./
(Eq. (4.17)). is uniformly distributed, since g. / is a constant, so it can
be extracted by remapping the interval Œ0; 1Œ into Œ0; 2 Œ (Eq. (4.14)). In
summary, the generation may proceed as follows:

 D arccos .1  2r1 / 2  0;  ; (4.30)


D 2r2 2 Œ0; 2 Œ ; (4.31)

where r1 and r2 are extracted in [0, 1[ with a standard uniform generator.

4.5.2 Gaussian Generator Using the Central Limit Theorem

By virtue of the central limit theorem (see Sect. 2.14), the sum of N random
variables, each having a finite variance, is distributed, in the limit N ! 1,
according to a Gaussian distribution. A finite, but sufficiently large number of
random numbers, x1 ;    ; xN , p
can extracted
p using a uniform random generator,
and remapped from Œ0; 1Œ to Œ 3; 3 Œ using Eq. (4.14), so that the average and
variance of each xi are equal to zero and one, respectively. Then, the following
combination is computed:

x1 C    C xN
xD p ; (4.32)
N

which has again average equal to zero and variance equal to one. Figure 2.15
shows the distribution of x for N up to four, compared with a Gaussian distribution.
The
p distribution
p of x, from Eq. (4.32), is necessarily truncated within the range
Π3N; 3N, by construction, while a truly Gaussian variable has no upper nor
lower bound.
The approach presented here is simple and may be instructive, but, apart from
the unavoidable approximations, it is not the most CPU-effective way to generate
Gaussian random numbers, since several uniform extractions are needed in order
4.6 Monte Carlo Sampling 89

to generate a single Gaussian number. A better algorithm will be described in the


following Sect. 4.5.3.

4.5.3 Gaussian Generator with the Box–Muller Method

In order to generate Gaussian random numbers, the inversion of the cumulative


distribution discussed in Sect. 4.5.1 would require inverting an error function, which
cannot be performed with efficient algorithms.
A more efficient algorithm may proceed by extracting pairs of random numbers
simultaneously generated in two dimensions with a transformation from Cartesian
coordinates .x; y/ to polar coordinates .r; /. In particular, the radial Gaussian
cumulative PDF was already introduced in Eq. (2.106). In the simplest case of a
standard normal variable, it can be written as:
Z r
2 2
F.r/ D e =2  dr D 1  er =2 : (4.33)
0

The transformation from two variables r1 and r2 uniformly distributed in [0, 1[ into
two variables z1 and z2 distributed according to a standard normal is called Box–
Muller transformation [10]:
p
rD 2 log.1  r1 / ; (4.34)
D 2 r2 ; (4.35)
z1 D r cos ; (4.36)
z2 D r sin : (4.37)

A standard normal random number z can be easily transformed into a Gaussian


random number x with average
and standard deviation  using the following
transformation:

xD
Cz: (4.38)

More efficient generators for Gaussian random numbers exist. For instance, the
so-called Ziggurat algorithm is described in [11].

4.6 Monte Carlo Sampling

In case the cumulative distribution of a PDF cannot be easily computed and inverted,
neither analytically nor numerically, other methods allow generating random num-
bers according to the desired PDF with reasonably good CPU performances.
90 4 Random Numbers and Monte Carlo Methods

f(x)
m

miss

hit

a b x

Fig. 4.2 Visualization of the hit-or-miss Monte Carlo method

4.6.1 Hit-or-Miss Monte Carlo

A rather general-purpose and simple random number generator is the hit-or-miss


Monte Carlo. It assumes a PDF f .x/ defined in an interval x 2 Œa; b Œ, not necessarily
normalized.2 The maximum value m of f .x/, or at least a value m that is known
to be greater or equal to the maximum of f .x/, must be known. The situation is
represented in Fig. 4.2.
The method proceeds according to the following steps:
• first, a uniform random number x is extracted in the interval [a; b Œ, and f D f .x/
is computed;
• a random number r is extracted uniformly in [0; m Œ. If r > f (‘miss’), the
extraction of x is repeated until r < f (‘hit’). In this case, x is accepted as
extracted value.
The probability distribution of the accepted values of x is equal to the initial PDF
f .x/, up to a normalization factor, by construction.
The method rejects a fraction of extractions equal to the ratio of area under the
curve f .x/ to the area of the rectangle that contains f , and this may be problematic
if the ratio is particularly small.

Rb
2
I.e. a f .x/ dx may be different from one.
4.6 Monte Carlo Sampling 91

In other words, the method has an efficiency (i.e. the fraction of accepted values
of x) equal to:
Rb
f .x/ dx
"D a
; (4.39)
.b  a/  m

which may lead to a suboptimal use of computing power, in particular if the shape
of f .x/ is very peaked.
Hit-or-miss Monte Carlo can also be applied to multidimensional cases with no
conceptual difference: first a multidimensional point xE D .x1 ;    ; xn / is extracted,
then xE is accepted or rejected according to a random extraction r 2 Œ0; m Œ, compared
with f .x1 ;    ; xn /.

4.6.2 Importance Sampling

If the function f .x/ is very peaked, the efficiency " of the hit-or-miss method
(Eq. (4.39)) may be very low. The algorithm may be adapted in order to improve the
efficiency by identifying, in a preliminary stage of the algorithm, a partition of the
interval [a; b Πsuch that in each subinterval the function f .x/ has a smaller variation
than in the overall range. In each subinterval, the maximum of f .x/ is estimated,
as sketched in Fig. 4.3. The modified algorithm proceeds as follows: first of all, a
subinterval of the partition is randomly chosen with a probability proportional to its
area (see Sect. 4.4); then, the hit-or-miss approach is followed in the corresponding
rectangle. This approach is often called importance sampling.
A possible variation of this method is to use, instead of the aforementioned
partition, an ‘envelope’ for the function f .x/, i.e. a function g.x/ that is always

Fig. 4.3 Variation of the f(x)


hit-or-miss Monte Carlo
using the importance m
sampling

a b x
92 4 Random Numbers and Monte Carlo Methods

greater than or equal to f .x/:

g.x/  f .x/ ; 8 x 2 Œa; b Œ ; (4.40)

and for which a convenient method to extract x according to the normalized distri-
bution g.x/ is known. A concrete case of this method is presented in Example 4.16.
It is evident that the efficiency of the importance sampling may be significantly
larger than the ‘plain’ hit-or-miss Monte Carlo if the partition or envelope is properly
chosen.

Example 4.16 Combining Different Monte Carlo Techniques


We want to find an algorithm to generate a random variable x distributed
according to a PDF:

f .x/ D Cex cos2 kx ; (4.41)

where C is a normalization constant, and  and k are two known parameters.


f .x/ features an oscillating term cos2 kx dumped by an exponential term
ex .
As ‘envelope’ the function Cex can be taken, and, as first step, a
random number x is generated according to this exponential distribution
(see Example 4.14). Then, a hit-or-miss technique is applied, accepting or
rejecting x according to a probability proportional to cos2 kx. The probability
distribution, given the two independent processes, is the product of the
exponential envelope times the cosine-squared oscillating term. In summary,
the algorithm may proceed as follows:
1. generate r uniformly in [0, 1 [;
2. compute x D  log.1  r/=;
3. generate s uniformly in [0, 1 [;
4. if s > cos2 kx repeat the extraction at the point 1, else return x.

4.7 Numerical Integration with Monte Carlo Methods

Monte Carlo methods are often used as numerical techniques in order to compute
integrals. The hit-or-miss method described in Sect. 4.6.1, for instance, estimates the
Rb
integral a f .x/ dx from the fraction of the accepted hits nO over the total number of
extractions N:
Z b
nO
ID f .x/ dx ' IO D .b  a/  : (4.42)
a N
4.8 Markov Chain Monte Carlo 93

With this approach, nO follows a binomial distribution. If nO is not too close to either
0 nor N, Eq. (5.11) gives an approximate error on IO3 :
s
IO .1  I/
O
IO D .b  a/ : (4.43)
N
p
The error on IO from Eq. (4.43) decreases as N. This result is true also if the hit-or-
miss method is applied to a multidimensional integration, regardless of the number
of dimensions d of the problem. Other numerical methods, not based on random
number extractions, may suffer from severe computing time penalties as the number
of dimensions d increases. This makes Monte Carlo methods advantageous in cases
of a large number of dimensions.
In the case of hit-or-miss Monte Carlo, anyway, numerical problems may
arise in the algorithm that finds the maximum value of the input function in
the multidimensional range of interest. Also, partitioning the multidimensional
integration range in an optimal way, in the case of importance sampling, may be
a non-trivial problem.

4.8 Markov Chain Monte Carlo

The Monte Carlo methods considered so far in the previous sections are based on
sequences of uncorrelated pseudorandom numbers that follow a given probability
distribution. There are classes of algorithms that sample more efficiency some
probability distributions by producing sequences of correlated pseudorandom
numbers, i.e. each entry in the sequence depends on previous ones.
A sequence of random variables xE0 ;    ; xEn is a Markov chain if the probability
distributions obey:

fn .ExnC1 I xE0 ;    ; xEn / D fn .ExnC1 I xEn / ; (4.44)

i.e. if fn .ExnC1 / only depends on the previously extracted value xEn , which corresponds
to a ‘loss of memory’. The Markov chain is said homogeneous if fn .ExnC1 I xEn / D
f .ExnC1 I xEn / does not depend on n.
One example of Markov chain is the Metropolis–Hastings algorithm [12, 13]
described below. Imagine we want to sample a PDF f .Ex /, starting from an initial
point xE D xE0 . A second point xE is generated according to a PDF q.ExI xE0 /, called
proposal distribution, that depends on xE0 . xE is accepted or not according to the
Hastings test ratio:
 
f .Ex / q.Ex0 I xE /
r D min 1; ; (4.45)
f .Ex0 / q.ExI xE0 /

3
See Sect. 5.9; a more rigorous approach is presented in Sect. 7.3.
94 4 Random Numbers and Monte Carlo Methods

i.e. the point xE is accepted as new point xE1 if a uniformly generated value u is less
or equal to r, otherwise the generation of xE is repeated. Once the generated point
is accepted, a new generation restarts from xE1 , as above, in order to generate xE2 ,
and so on. The process is repeated indefinitely, until the desired number of points is
generated.
If a finite set of initial values is discarded, the rest of values in the sequence can
be proven to follow the desired PDF f .Ex /, and each value with non-null probability
will be eventually reached, within an arbitrary precision, after a sufficiently large
number of extractions (ergodicity).
Usually, it is convenient to chose a symmetric proposal distribution
such that q.Ex; xE0 / D q.Ex0 ; xE /, so that Eq. (4.45) simplifies to the so-called

10 10
9 0.5 9 0.5
8 8

7 0.4 7 0.4

6 6
0.3 0.3
5 5
y

4 4
0.2 0.2
3 3

2 0.1 2 0.1
1 1
0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

10 10
9 0.5 9 0.5
8 8

7 0.4 7 0.4

6 6
0.3 0.3
5 5
y

4 4
0.2 0.2
3 3

2 0.1 2 0.1
1 1
0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

Fig. 4.4 Monte Carlo generation with the Metropolis–Hastings method for the sum of two two-
dimensional Gaussian functions with relative weights equal to 0.45 and 0.55. The PDF is shown
as a red-to-black color map in the background. The generated points are connected with a line in
order to show the generated sequence. The first 20 generated points are shown in purple, all the
subsequent ones are in blue. The proposal function is a two-dimensional Gaussian with  D 0:5.
The total number of generated points is 100 (top, left), 1000 (top, right), 2000 (bottom, left) and
5000 (bottom, right)
References 95

Metropolis–Hastings ratio, which does not depend on q:


 
f . xE /
r D min 1; : (4.46)
f . xE0 /

A typical proposal choice may be a multidimensional Gaussian centered around xE0


with a fixed standard deviation.
The number of initial extractions that needs to be discarded is not easy to predict,
and may depend on the choice of q. Empiric tests may be needed to check that the
sequence reaches a converging state.
Figure 4.4 shows an example of application of the Metropolis–Hastings Monte
Carlo.
The Metropolis–Hastings method is the core of Markov chain Monte Carlo
(MCMC) techniques. This technique is very powerful to compute posterior prob-
ability densities for Bayesian inference (see Sect. 3.5), and is used, for instance, in
the implementation of the BAT software toolkit [14].

References

1. May, R.: Simple mathematical models with very complicated dynamics. Nature 621, 459
(1976)
2. Logistic map. Public domain image. https://commons.wikimedia.org/wiki/File:LogisticMap_
BifurcationDiagram.png (2011)
3. T.O. Group: The single UNIX R
specification, Version 2. http://www.unix.org (1997)
4. T.G. project: GNU operating system–GSL–GNU scientific library. http://www.gnu.org/
software/gsl/ (1996–2011)
5. Lüscher, M.: A portable high-quality random number generator for lattice field theory
simulations. Comput. Phys. Commun. 79, 100–110 (1994)
6. James, F.: Ranlux: a fortran implementation of the high-quality pseudorandom number
generator of Lüscher. Comput. Phys. Commun. 79, 111–114 (1994)
7. L’Ecuyer, P.: Maximally equidistributed combined Tausworthe generators. Math. Comput. 65,
203–213 (1996)
8. Matsumoto, M., Nishimura, T.: Mersenne twistor: a 623-dimensionally equidistributed uniform
pseudorandom number generator. ACM Trans. Model. Comput. Simul. 8, 3–30 (1998)
9. Marsaglia, G., Tsang, W.W., Wang, J.: Fast generation of discrete random variables. J. Stat.
Softw. 11, 3 (2004)
10. Box, G.E.P., Muller, M.: A note on the generation of random normal deviates. Ann. Math. Stat.
29, 610–611 (1958)
11. Marsglia, G., Tsang, W.: The Ziggurat method for generating random variables. J. Stat. Softw.
5, 8 (2000)
12. Metropolis, N., et al.: Equations of state calculations by fast computing machines. J. Chem.
Phys. 21(6), 1087–1092 (1953)
13. Hastings, W.: Monte Carlo sampling methods using Markov chains and their application.
Biometrika 57, 97–109 (1970)
14. Caldwell, A., Kollar, D., Kröninger, K.: BAT – The Bayesian Analysis Toolkit. Comput. Phys.
Commun. 180, 2197–2209 (2009). https://wwwold.mppmu.mpg.de/bat/
Chapter 5
Parameter Estimate

5.1 Introduction

This chapter describes how to determine unknown parameters of some probability


distribution by sampling the values of random variables that follow such distribu-
tion. In physics this procedure is applied when measuring some parameters from
an experimental data sample. A typical case is an experiment at a particle collider
recording multiple collision events for further analysis.
The problem of parameter estimate was already discussed in Sect. 3.5 for the
Bayesian approach. The concepts will be generalized and the frequentist approach
will be discussed in the following.

5.2 Inference

The process of determining an estimated value O and the corresponding uncertainty


ı of some unknown parameter  from experimental data is also called inference.
The presence of a finite uncertainty reflects the statistical fluctuations of the
data sample due to the intrinsic (theoretical) and experimental (due to detector
effects) randomness of our observable quantities. This is depicted in the diagram
in Fig. 5.1. The smallest the amount of fluctuation of data (i.e. a distribution that
concentrates a large probability in a small interval), the smallest the uncertainty
in the determination of the unknown parameters. Ideally, if the data sample would
exhibit no fluctuation and if our detector would have a perfect response, an exact
knowledge of the unknown parameter would be possible. This case is never present
in real experiments, and every real-life measurement is affected by some level of
uncertainty.

© Springer International Publishing AG 2017 97


L. Lista, Statistical Methods for Data Analysis in Particle Physics,
Lecture Notes in Physics 941, DOI 10.1007/978-3-319-62840-0_5
98 5 Parameter Estimate

Probability
Theory Model Data

data fluctuate according


to the process’ randomness

Inference
Theory Model Data

model parameters are uncertain


due to fluctuations in the finite
data sample

Fig. 5.1 Relation between probability and inference

5.3 Parameters of Interest

The theory provides a probability model that predicts the distribution of the
observable quantities. Some theory parameters are unknown, and the measurement
of those parameters of interest is the goal of our experiment.
There are cases where the parameter of interest is a calibration constant, which
is not strictly related to a physics theory, but it is related to a model assumed to
describe the detector response we are interested in.

5.4 Nuisance Parameters

The distribution of experimental data is the result of the combination of a theoretical


model and the effect of the experimental detector response: detector’s finite
resolution, miscalibrations, the presence of background, etc. The detector response
itself can be described by a probability model that depends on unknown parameters.
Those additional unknown parameters, called nuisance parameters, arise in such a
way in the problem and appear together with the parameters of interest.
For instance, when determining the yield of a signal peak, often other parameters
need to be determined from data, such as the experimental resolution that affects
the peak width, the detector efficiencies that need to be corrected for in order to
determine the signal production yield from the measured area under the peak, or
other additional parameters needed to determine the shapes and the amounts of the
possible backgrounds, and so on.
5.5 Measurements and Their Uncertainties 99

5.5 Measurements and Their Uncertainties

Our data sample consist of measured values of the observable quantities, which are,
in turn, a sampling of the PDF determined by a combination of the theoretical and
the instrumental effects.
We can determine, or estimate the value of unknown parameters (parameters of
interest or nuisance parameters) using the data collected by our experiment. The
estimate is not exactly equal to the true value of the parameter, but provides an
approximate knowledge of the true value, within some uncertainty. As result of the
measurement of a parameter , one quotes the estimated value O and its uncertainty
ı, usually using the notation already introduced in Sect. 3.5.2:

 D O ˙ ı : (5.1)

The estimate O is often also called central value.1 The interval ŒO  ı; O C ı is
referred to as uncertainty interval. The meaning of uncertainty interval is different
in the frequentist and in the Bayesian approaches. In both approaches, a probability
level needs to be specified in order to determine the size of the uncertainty. When
not otherwise specified, by convention, a 68.27% probability level is assumed,
corresponding to the area under a Gaussian distribution in a ˙ 1 interval. Other
choices are 90% or 95% probability level, usually adopted when quoting upper or
lower limits.
In some cases, asymmetric positive and negative uncertainties are taken, and the
result is quoted as:

Cı
 D O ıC ; (5.2)

corresponding to the uncertainty interval ŒO  ıC ; O C ı .

5.5.1 Statistical and Systematic Uncertainties

Nuisance parameters can be usually determined from experimental data samples. In


some cases, dedicated data samples may be needed (e.g.: data from test beams in
order to determine calibration constants of a detector, cosmic-ray runs to determine
alignment constants, etc.), or dedicated simulation programs. The uncertainty on the
determination of nuisance parameters reflects into uncertainties on the estimate of
parameters of interest (see for instance Sect. 10.10).

1
The name ‘central value’ is frequently used in physics, but it may be sometimes ambiguous in
statistics, where it could be confused with the median, mean or mode.
100 5 Parameter Estimate

Uncertainties due to the propagation of imperfect knowledge of nuisance param-


eters produces systematic uncertainties in the final measurement. Sometimes,
separate contributions to systematic uncertainties due to individual sources (i.e.
individual nuisance parameters) are quoted, together with the overall measurement
uncertainty.
Uncertainties related to the determination of the parameters of interest purely
reflecting fluctuation in data, regardless of possible uncertainties of nuisance
parameters, are called statistical uncertainties.

5.6 Frequentist vs Bayesian Inference

As introduced in Sect. 1.2, two main complementary statistical approaches exist in


literature and correspond to two different interpretations of the uncertainty interval,
the central value and its corresponding uncertainty.
• Frequentist approach: for a large fraction, usually taken as 68.27%, of repeated
experiments, the unknown true value of  is contained in the quoted confidence
interval ŒO  ı; O C ı. The fraction is intended in the limit of infinitely
large number of repetitions of the experiment, and O and ı may vary from one
experiment to another, being the result of a measurement in each experiment.
• Bayesian approach: one’s degree of belief that the unknown parameter is
contained in the quoted credible interval ŒO  ı; O C ı can be quantified
with a 68.27% probability.
Bayesian inference was already discussed in Sect. 3.5.
In the frequentist approach, the property of the estimated interval to contain the
true value in 68.27% of the experiments is called coverage. The probability level,
usually taken as 68.27%, is called confidence level. Interval estimates that have a
larger or smaller probability of containing the true value, compared to the desired
confidence level, are said to overcover or undercover, respectively.
With both the frequentist and the Bayesian approaches, there are some degrees
of arbitrariness in the choice of uncertainty intervals (central interval, extreme
intervals, smallest-length interval, etc.), as already discussed in Sect. 3.5.2 for the
Bayesian approach. Chapter 7 will discuss the corresponding implications for the
frequentist approach.

5.7 Estimators

The estimate of an unknown parameter is a mathematical procedure to determine a


central value of an unknown parameter as a function of the observed data sample.
In general, the function of the data sample that returns the estimate of a parameter
5.8 Properties of Estimators 101

is called estimator. Estimators can be defined in practice by more or less complex


mathematical procedures or numerical algorithms. We are interested in estimators
that have ‘good’ statistical properties and such properties that characterize good
estimators will be discussed in Sect. 5.8.

Example 5.17 A Very Simple Estimator in a Gaussian Case


As a first and extremely simplified example, let us assume a Gaussian
distribution whose standard deviation  is known (e.g.: the resolution of
our apparatus), and whose average
is the unknown parameter of interest.
Consider a data sample consisting of a single measurement x distributed
according to the Gaussian distribution under consideration. As estimator of

, the function
O that returns the single measured value x can be taken:


.x/
O Dx: (5.3)

If the experiment is repeated many times (ideally an infinite number of


times), different values of
O D x will be obtained, distributed according
to the original Gaussian.
In 68.27% of the experiments, in the limit of an infinite number of
experiments, the fixed and unknown true value
will lie in the confidence
interval Œ
O  ;
O C , i.e.
  <
O <
C  or
2 Œ
O  ;
O C , and
in the remaining 31.73% of the cases
will lie outside the same interval.
This expresses the coverage of the interval Œ
O  ;
O C  at the 68.27%
confidence level.
The estimate
D
O ˙  can be quoted in this sense. ˙  is the error or
uncertainty assigned to the measurement
, O with the frequentist meaning
defined in Sect. 5.6.
In realistic cases, experimental data samples contain more information
than a single measurement, and more complex PDF models than a simple
Gaussian are required. The definition of an estimator may require in general
complex mathematics and, in many cases, computer algorithms.

5.8 Properties of Estimators

Different estimators may have different statistical properties that make one or
another estimator more suitable for a specific problem. In the following, some of the
main properties of estimators are presented. Section 5.10 will introduce maximum
likelihood estimators which have good properties in terms of most of the indicators
described in the following.
102 5 Parameter Estimate

5.8.1 Consistency

An estimator is said to be consistent if it converges, in probability, to the true


unknown parameter value, as the number of measurements n that tends to infinity.
That is, if:
ˇ ˇ

ˇ ˇ
8" lim P ˇOn   ˇ < " D 1 : (5.4)
n!1

5.8.2 Bias

The bias of an estimator is the expected value of the deviation of the parameter
estimate from the corresponding true value of that parameter:
D E D E
O D O   D O   :
b./ (5.5)

5.8.3 Minimum Variance Bound and Efficiency

O of any consistent estimator is subject to a lower bound due to


The variance VŒ
Cramér [1] and Rao [2] which is given by:
!2
O
@b./
1C
@
O D*
O  VCR ./
VΠ 2 + ; (5.6)
@ log L.x1 ;    ; xn I /
@

O is the bias of the estimator (Eq. (5.5)) and the denominator is the Fisher
where b./
information, already defined in Sect. 3.7.
The ratio of the Cramér–Rao bound to the estimator’s variance is called
estimator’s efficiency:

O
VCR ./
O D
"./ : (5.7)
O
VŒ

Any consistent estimator O has efficiency "./


O lower or equal to one, due to Cramér–
Rao bound.
5.8 Properties of Estimators 103

Example 5.18 Estimators with Variance Below the Cramér–Rao Bound


Are not Consistent
It is possible to find estimators that have variance lower than the Cramér–
Rao bound, but this implies that they are not consistent.
For instance, an estimator of an unknown parameter that gives a constant
value (say zero) as an estimate of the parameter, regardless of the data
sample, has zero variance, but it is of course not consistent.
An estimator of this kind is clearly not very useful in practice.

5.8.4 Robust Estimators

The good properties of some estimators may be spoiled in case the real distribution
of a data sample has deviations from the assumed PDF model.
Entries in the data sample that introduce visible deviations from the theoretical
PDF, such as data in extreme tails of the PDF where no entry is expected, are called
outliers.
If data may deviate from the nominal PDF model, an important property of an
estimator is to have a limited sensitivity to the presence of outliers. This property,
that can be better quantified, is in general defined as robustness.
An example of robust estimator of the central value of a symmetric distribution
from a sample x1 ;    ; xN is the median xQ , defined in the Eq. (1.23):
(
x.NC1/=2 if N is odd ;
xQ D 1
(5.8)
.x
2 N=2
C xN=2C1 / if N is even :

Clearly, the presence of outliers at the left or right tails of the distribution will
not change significantly the value of the median, if it is dominated by measurements
in the ‘core’ of the distribution. Conversely, the usual average (Eq. (1.13)) could be
shifted from the true value as much as the outliers distribution is broader than the
‘core’ part of the distribution.
An average computed by removing from the sample a given fraction f of data
present in the rightmost and leftmost tails is called trimmed average and is also less
sensitive to the presence of outliers.
It is convenient to define the breakdown point as the maximum fraction of
incorrect measurements (i.e. outliers) above which the estimate may grow arbitrarily
large in absolute value. In particular, a trimmed average that removes a fraction f
of the events can be demonstrated to have a breakdown point equal to f , while the
median has a breakdown point of 0.5. The mean of a distribution, instead, has a
breakdown point of 0%.
104 5 Parameter Estimate

A more detailed treatment of robust estimators is beyond the purpose of this text.
Reference [3] contains a more extensive discussion about robust estimators.

5.9 Binomial Distribution for Efficiency Estimate

The efficiency " of a device is defined as the probability that the device gives a
positive signal when a process of interest occurs. Particle detectors are examples of
such devices; they produce a signal when a particle interacts with them, but they
may fail in a fraction of the cases. The distribution of the number of positive signals
n, if N processes of interest occurred, is given by a binomial distribution (Eq. (1.39))
with parameter p D ".
A typical problem is the estimate of the efficiency " of the device. A pragmatic
way to estimate the efficiency may consist in performing a large number N of
sampling of the process of interest, counting the number of times the device gives a
positive signal (i.e. it has been efficient). For a particle detector, the data acquisition
time should be sufficiently long in order to have a large number of particle traversing
the detector.
Assume that the result of a real experiment gives a measured value of n equal to
nO , an estimate of the true efficiency " is given by:

nO
"O D : (5.9)
N

The variance of nO , from Eq. (1.41), is equal to N" .1  "/, hence the variance of
"O D nO =N, using Eq. (1.22), is given by:
s   r
nO " .1  "/
"O D Var D : (5.10)
N N

Equation (5.10) is not very useful, since ", the true efficiency value, is unknown.
Anyway, if N is sufficiently large, "O is close to the true efficiency ", as a consequence
of the law of large numbers (see Sect. 1.17). By simply replacing " with "O, the
following approximated expression for the uncertainty on "O is obtained:
r
"O .1  "O/
"O ' : (5.11)
N
The above formula is just an approximation, and in particular it leads to an error
"O D 0 in cases where "O D 0 or "O D 1, i.e. for nO D 0 or nO D N respectively.
Section 7.3 will present how to overcome this problems present in Eq. (5.11) with a
more rigorous treatment.
5.10 Maximum Likelihood Method 105

5.10 Maximum Likelihood Method

The most frequently adopted estimate method is based on the construction of the
combined probability distribution of all measurements in our data sample, called
likelihood function, which was already introduced in Sect. 3.4. The estimate of
the parameters we want to determine is obtained by finding the parameter set that
corresponds to the maximum value of the likelihood function. This approach gives
the name of maximum likelihood method to this technique.
The procedure is also called best fit because it determines the parameters for
which the theoretical PDF model best fits the experimental data sample.
Maximum likelihood fits are very frequently used because of very good statistical
properties according to the indicators discussed in Sect. 5.8.
The estimator discussed in Example 5.17 is a very simple example of the
application of the maximum likelihood method. A Gaussian PDF with unknown
average
and known standard deviation  was assumed, and the estimate
O was
the value of a single measurement x following the given Gaussian PDF. x is indeed
the value of
that maximizes the likelihood function.

5.10.1 Likelihood Function

The likelihood function is the function that, for given values of the unknown
parameters, returns the value of the PDF evaluated at the observed data sample. If
the measured values of n random variables are x1 ;    xn and our PDF model depends
on m unknown parameters 1 ;    ; m , the likelihood function is:

L.x1 ;    ; xn I 1 ;    ; m / D f .x1 ;    ; xn I 1 ;    ; m / ; (5.12)

where f is the joint PDF of the random variables x1 ;    ; xn . As already anticipated


in Sect. 3.4, the notation: L.x1 ;    ; xn j 1 ;    ; m / is also used, resembling the
notation adopted for conditional probability (see Sect. 1.9).
The maximum likelihood estimator of the unknown parameters 1 ;    ; m is
the function that returns the values of the parameters O1 ;    ; Om for which the
likelihood function, evaluated at the measured sample x1 ;    ; xn , is maximum. If
the maximum is not unique, the determination of the maximum likelihood estimate
is ambiguous.
If we have N repeated measurements each consisting of the n values of the
random variables x1 ;    ; xn , the likelihood function is the probability density
corresponding to the total sample Ex D f.x11 ;    ; x1n /;    ; .xN1 ;    ; xNn /g. If the
observations are independent of each other, the likelihood function of the sample
106 5 Parameter Estimate

consisting of the N events2 recorder by our experiment can be written as the product
of the PDFs corresponding to the measurement of each single event:

Y
N
E D
L.Ex I / f .xi1 ;    ; xin I 1 ;    ; m / : (5.13)
iD1

Often the logarithm of the likelihood function is computed, so that when the
product of many terms appears in the likelihood definition, this is transformed into
the sum of the logarithms. The logarithm of the likelihood function in Eq. (5.13) is:

X
N
E D
 log L.Ex I / log f .xi1 ;    ; xin I 1 ;    ; m / : (5.14)
iD1

5.10.1.1 Numerical Implementations: MINUIT

The maximization of the likelihood function L, or the equivalent, but often more
convenient, minimization of  log L, can be performed analytically only in the
simplest cases. Most of the realistic cases require numerical methods implemented
as computer algorithms. The software MINUIT [4] is one of the most widely used
minimization tool in the field of high-energy and nuclear physics since the years
1970s.
The minimization is based on the steepest descent direction in the parameter
space, which is determined based on a numerical evaluation of the gradient of (the
logarithm of) the likelihood function.
MINUIT has been reimplemented from the original fortran version in C++, and
is available in the ROOT software toolkit [5].

5.10.2 Extended Likelihood Function

If the number of recorded events N is also a random variable that follows a distribu-
tion P.NI 1 ;    ; m /, which may also depend on the m unknown parameters, the

2
In physics often the word event is used with a different meaning with respect to statistics, and it
refers to a collection of measurements of observable quantities Ex D .x1 ;    ; xn / corresponding to
a physical phenomenon, like a collision of particles at an accelerator, or the interaction of a particle
or a shower of particles from cosmic rays in a detector. Measurements performed at different events
are usually uncorrelated and each sequence of variables taken from N different events, x1i ;    ; xNi ,
i D 1;    ; n, can be considered a sampling of independent and identically distributed random
variables, or IID, as defined in Sect. 4.2.
5.10 Maximum Likelihood Method 107

extended likelihood function may be defined as:

Y
N
L D P.NI 1 ;    ; m / f .xi1 ;    ; xin I 1 ;    ; m / : (5.15)
iD1

The extended likelihood function exploits the number of recorded events N as


information in order to determine the parameter estimate, in addition to the data
sample Ex. In almost all cases in physics, P.NI 1 ;    ; m / is a Poisson distribution
whose average
may depend on the m unknown parameters, and Eq. (5.15)
becomes:

e
.1 ;  ; m /
.1 ;    ; m /N Y i
N
LD f .x1 ;    ; xin I 1 ;    ; m / : (5.16)
NŠ iD1

Consider the case where the PDF f is a linear combination of two PDFs, one
for ‘signal’, fs , and one for ‘background’, fb , whose individual yields, s and b are
unknown. The extended likelihood function can be written as:

.s C b/N e.sCb/ Y
N
L.xi I s; b; / D Œws fs .xi I / C wb fb .xi I / ; (5.17)
NŠ iD1

where the fractions of signal and background ws and wb are:


s
ws D ; (5.18)
sCb
b
wb D : (5.19)
sCb

Note that ws C wb D 1, hence f D ws fs C wb fb is normalized, assuming that fs and


fb are normalized. Replacing Eqs. (5.18) and (5.19) in (5.17), we have:

.s C b/N e.sCb/ Y Œsfs .xi I / C bfb .xi I /


N
L.xi I s; b; / D (5.20)
NŠ iD1
sCb

e.sCb/ Y
N
D Œsfs .xi I / C bfb .xi I / : (5.21)
NŠ iD1

The logarithm of the likelihood function gives a more convenient expression:

X
N
 log L.xi I s; b; / D s C b C log Œsfs .xi I / C bfb .xi I /  log NŠ :
iD1
(5.22)
108 5 Parameter Estimate

100

80
Events / ( 0.01 )

60

40

20

0
2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5
m (GeV)

Fig. 5.2 Example of unbinned extended maximum likelihood fit of a simulated dataset. The fit
curve is superimposed to the data points (black dots with error bars) and shown as solid blue line

The last term,  log NŠ, is constant with respect to the unknown parameters and can
be omitted when minimizing the function  log L.xi I s; b; / .
An example of application of the extended maximum likelihood from Eq. (5.22)
is the two-component fit shown in Fig. 5.2. The assumed PDF is modeled as the sum
of a Gaussian component (‘signal’) and an exponential component (‘background’).
The points with the error bars represent the data sample which is randomly
extracted according to the assumed PDF. Data are shown as a binned histogram
for convenience, but the individual values of the random variable m are used
in the likelihood function. The unknown parameters that model the signal and
background PDFs determined by the fit, according to the likelihood function in
Eq. (5.22), are the mean
and standard deviation  of the Gaussian component
and the decay parameter  of the exponential component. They are determined
simultaneously with the number of signal and background events, s and b, with
a single minimization of the negative logarithm of the likelihood function.

5.10.3 Gaussian Likelihood Functions

For N measurements of a variable x distributed according to a Gaussian with average

and standard deviation , twice the negative logarithm of the likelihood function
can be written as:

X
N
.xi 
/2
2 log L.xi I
;  2 / D C N .log 2 C 2 log / : (5.23)
iD1
2
5.11 Errors with the Maximum Likelihood Method 109

The minimization of 2 log L.xi I


;  2 / can be performed analytically by finding
the zeros of the first derivative of 2 log L with respect to
and  2 . The following
maximum likelihood estimates for
and  2 can be obtained3:

1 X
N

O D xi ; (5.24)
N iD1

1 X
V N
2 D O 2:
.xi 
/ (5.25)
N iD1
V

The maximum likelihood estimate  2 is affected by a bias (see Sect. 5.8.2), in the
sense that its average value deviates from the true  2 . The bias, anyway, decreases
as N ! 1. A way to correct the bias present in Eq. (5.25) is discussed in
Example 5.20.

5.11 Errors with the Maximum Likelihood Method

Once the estimate O of a parameter  is determined using the maximum likelihood


method, a confidence interval needs to be determined. The required coverage (see
Sect. 5.6) is, in most of the cases, equal to 68.27%, corresponding to 1.
Two approximate methods to determine parameter uncertainties are presented in
the following Sections. For both cases, the coverage is only approximately ensured.
Chapter 7 will discuss a more rigorous treatment of uncertainty intervals that ensure
the proper coverage.

5.11.1 Second Derivatives Matrix

In the limit of very large number of measurements, a Gaussian model may be


justified in order to determine an estimate of uncertainty intervals, but in realistic
cases the obtained results may only be an approximation, and deviation from the
exact coverage may occur.
Assuming as PDF model an n-dimensional Gaussian (Eq. (2.108)), it is easy to
demonstrate that the n-dimensional covariance matrix C of the PDF may be obtained
from as the inverse of the second-order partial derivative matrix of the negative

3
Note that we will use the notation  2 and not O 2 , since we consider  2 as a parameter of interest,
rather than  .
110 5 Parameter Estimate

logarithm of the likelihood function,4 that can be written as:

@ log L.x1 ;    ; xn I 1 ;    ; m /
Cij1 D  : (5.26)
@i @j

This covariance matrix gives an n-dimensional elliptic confidence contour having


the correct coverage only if the PDF model is exactly Gaussian.
Consider the Gaussian likelihood case in Eq. (5.23), seen in Sect. 5.10.3. The
derivative of  log L with respect to the
is:

1 @2 . log L/ n
2
D D 2 ; (5.27)

O @
2 

which gives the following error on the estimated average


:
O


O D p : (5.28)
n

This expression coincides with the result obtained by evaluating the standard
deviation of the average from Eq. (5.24) using the general formulae from Eqs. (1.15)
and (1.16).

5.11.2 Likelihood Scan

Another frequently used method to determine the uncertainty on maximum like-


lihood estimates is to consider a scan of 2 log L around the minimum value,
2 log Lmax , corresponding to the parameter set that maximizes L:

EO :
Lmax D L.Ex I / (5.29)

An interval corresponding to an increase of 2 log L by one unit with respect to its


minimum value can be determined,5 as graphically shown in Fig. 5.3, for the case
of a single parameter . The interval determined by the 2 log L scan may lead to
asymmetric errors, as evident in Fig. 5.3, which follow the non-local behavior of the
likelihood function. For more than one parameter, the error contour corresponds to
the set of parameter values E such that:

E D 2 log Lmax C 1 :
 2 log L./ (5.30)

4
For the users of the program MINUIT, this estimate correspond to call the method
M IGRAD /HESSE.
5
For M INUIT users this procedure correspond to the call of the method MINOS.
5.11 Errors with the Maximum Likelihood Method 111

−2lnL

−2lnLmax+1

−2lnLmax
^
θ− ^
θ ^
θ+ θ
− +

Fig. 5.3 Scan of 2 log L as a function of a parameter  . The error interval is determined from the
excursion of 2 log L around the minimum, at O , finding the values for which 2 log L increases
by one unit

1620

1600

1580
b

1560

1540

1520

1500
400 410 420 430 440 450 460 470 480
s

Fig. 5.4 Two-dimensional contour plot showing the 1 uncertainty ellipse for the fit parameters
s and b (number of signal and background events) corresponding to the fit in Fig. 5.2. The contour
correspond to the equation 2 log L.Ex I s; b;  / D 2 log Lmax C 1

For instance, the two-dimensional contour plot shown in Fig. 5.4 shows the
approximate ellipse for the sample considered in Fig. 5.2, corresponding to the
points for which 2 log L.Ex I s; b; / in Eq. (5.22) increases by one unit with respect
to its minimum, shown as central point.
This method leads to identical errors as Eq. (5.26) in the Gaussian case, in
which 2 log L has a parabolic shape, but in general an uncertainty interval given
by Eq. (5.30) may be different from uncertainties determined from Eq. (5.26). The
coverage is usually improved using the 2 log L scan with respect to the parabolic
approximation used in Eq. (5.26).
112 5 Parameter Estimate

Contours corresponding to Z standard deviations can be determined similarly to


Eq. (5.30) by requiring:

E D 2 log Lmax C Z 2 :
 2 log L./ (5.31)

Example 5.19 Maximum Likelihood Estimate for an Exponential


Distribution
Assume an exponential PDF with parameter :

f .t/ D  et : (5.32)

 is the inverse of the average lifetime  D 1=. Given N measurements


of t: t1 ;    ; tN , The likelihood function can be written as the product of
f .t1 /    f .tN /:

Y
N PN
L.t1 ;    ; tn I / D  eti D N e iD1 ti : (5.33)
iD1

The analytic maximization of the likelihood function gives the maximum


likelihood estimate of :
!1
1 X
N
O D ti ; (5.34)
N iD1

with uncertainty, from Eq. (5.26), uncorrected for a possible bias, equal to:
p
O
O D = N: (5.35)

The demonstration of this result is left as exercise to the reader.


The same example using Bayesian inference was discussed in Example 3.12.

5.11.3 Properties of Maximum Likelihood Estimators

Below a number of properties of maximum likelihood estimators are listed,


according to what was defined in Sect. 5.8:
• Maximum likelihood estimators are consistent.
• Maximum likelihood estimators may have a bias, but the bias tends to zero as the
number of measurements N tends to infinity.
5.11 Errors with the Maximum Likelihood Method 113

• Maximum likelihood estimators have efficiencies (compared to the Cramér–Rao


bound, from Eq. (5.7)) that asymptotically, for a large number of measurements,
tend to one. Hence, maximum likelihood estimators have, asymptotically, the
lowest possible variance compared to any other consistent estimator.
• Finally, maximum likelihood estimators are invariant under reparameterizations.
That is, if a maximum of the likelihood is found in term of some parameters, in a
new parameterization, the transformed parameters also maximize the likelihood
function.

Example 5.20 Bias of the Maximum Likelihood Estimate of a Gaussian


Variance
The maximum likelihood estimate of the variance  2 of a Gaussian distri-
bution, from Eq. (5.25), is given by:

1 X
V N
2 D O 2:
.xi 
/ (5.36)
N iD1
V

The bias, defined in Eq. (5.5), is the difference of the expected value of  2
and the true value of  2 . It is easy to show analytically that the expected
V

value of  2 is:
DV E N 1 2
2 D  ; (5.37)
N
V

where  2 is the true variance. Hence, the maximum likelihood estimate  2


tends to underestimate the variance, and it has a bias given by:
D E  
V V
N1 2
b. 2 / D  2   2 D  1 2 D  : (5.38)
N N
V

b. 2 / decreases with N, which is a general property of maximum likelihood


estimates. For this specific case, an unbiased estimate can be obtained by
multiplying the maximum likelihood estimate by a correction factor N=.N1/,
which gives:

1 X
V N
 2 unbiased D O 2:
.xi 
/ (5.39)
N  1 iD1
114 5 Parameter Estimate

5.12 Minimum 2 and Least-Squares Methods

Consider a number n of measurements . y1 ˙ 1 ;    ; yn ˙ n /, and each measure-


ment yi ˙ i corresponds to a value xi of a variable x. Assume we have a model for
the dependence of y on the variable x given by a function f :

E ;
y D f .xI / (5.40)

where E D .1 ;    ; m / is a set of unknown parameters. If the measurements yi


are distributed around the value f .xi I /E according to a Gaussian distribution with
standard deviation equal to i , the likelihood function for this problem can be written
as the product of n Gaussian PDFs:
2
2 3
Y
n y  f .x I E
/
E D 1 6 i i
7
L. Ey I / q exp 4 2 5 ; (5.41)
iD1
2
2 i 2 i

where the notation Ey D . y1 ;    ; yn / was introduced.


E is equivalent to minimize 2 log L. Ey I /:
Maximizing L. Ey I / E


2
X
n E
yi  f .xi I / X
n
E D
2 log L. Ey I / C log 2 i2 : (5.42)
iD1
i2 iD1

The last term does not depend on the parameters E if the uncertainties i are
known and fixed, hence it is a constant that can be dropped when performing the
minimization. The first term to be minimized in Eq. (5.42) is a 2 variable (see
Sect. 2.9):


2
X
n E
yi  f .xi I /
E D
2 ./ : (5.43)
iD1
i2

The terms:

EO ;
ri D yi  yO i D yi  f .xi I / (5.44)

O
evaluated at the fit values E of the parameters ,
E are called residuals.
An example of fit performed with the minimum 2 method is shown in Fig. 5.5.
If data are distributed according to the assumed model, then residuals are randomly
distributed around zero, according to the data uncertainties.
5.12 Minimum 2 and Least-Squares Methods 115

6 y = A x e-B x

4
y

2 4 6 8 10
0.5

0
r

−0.5

2 4 6 8 10
x

Fig. 5.5 Example of minimum 2 fit of a computer-generated dataset. The points with the error
bars are used to fit a function model of the type y D f .x/ D A x eBx , where A and B are unknown
parameters determined by the fit. The fit curve is superimposed as solid blue line. Residuals are
shown in the bottom section of the plot

In case the uncertainties i are all equal, it is possible to minimize the expression:

n
X
2
SD E
yi  f .xi I / : (5.45)
iD1

This minimization is called least squares method.

5.12.1 Linear Regression

In the simplest case of a linear function, the minimum 2 problem can be solved
analytically. The function f can be written as:

y D f .xI a; b/ D a C bx ; (5.46)
116 5 Parameter Estimate

a and b being free parameters, and the 2 becomes:

X
n
. yi  a  bxi /2
2 .a; b/ D : (5.47)
iD1
i2

Let us introduce the weights wi , conveniently defined as:

1=i2
wi D ; (5.48)
1= 2
P P
with 1= 2 D niD1 1=i2 , so that niD1 wi D 1. Equal errors give weights are all
equal to one.
The analytical minimization is achieved by imposing: @2 .a; b/=@a D 0 and
2
@ .a; b/=@b D 0, which give:

X
n X
n
wi yi D a C b wi xi ; (5.49)
iD1 iD1

X
n X
n X
n
wi xi yi D a wi xi C b wi x2i : (5.50)
iD1 iD1 iD1

In matrix form, the system of the two Eqs. (5.49) and (5.50) becomes:
 Pn   Pn  
wy 1 wx a
PniD1 i i D Pn PniD1 i 2i ; (5.51)
w x y
iD1 i i i w x
iD1 i i w x
iD1 i i b

which can easily be inverted. The solution can be written as follows:

cov.x; y/
bO D ; (5.52)
VŒx
aO D hyi  bO hxi ; (5.53)

where the terms that appear in Eqs. (5.52) and (5.53), and have to be computed in
O are:
order to determine aO and b,

X
n
hxi D wi xi ; (5.54)
iD1

X
n
hyi D wi yi ; (5.55)
iD1
5.12 Minimum 2 and Least-Squares Methods 117

!2
X
n X
n
2 2
VŒx D hxi  hx i D wi xi  wi x2i ; (5.56)
iD1 iD1

X
n X
n X
n
cov.x; y/ D hxyi  hxi hyi D wi xi yi  wi xi wi yi ; (5.57)
iD1 iD1 iD1

The uncertainties on the estimates aO and bO can be determined as described in


Sect. 5.11.1 from the second derivatives matrix of  log L D 1=2 2 :

1 @2 log L 1 @2 2
D  D ; (5.58)
aO2 @2 a 2 @2 a
1 1 @2 2
D ; (5.59)
bO2 2 @2 b

and are, with the covariance term:

aO D  ; (5.60)

bO D qP ; (5.61)
n 2
iD1 w i x i

2
Pn
O D P 
cov.Oa; b/ iD1 wi xi
2 P : (5.62)
n
iD1 wi xi  i wi x2i

A coefficient that is not very used in physics, but appears rather frequently
in linear regressions performed by commercial software, is the coefficient of
determination, or R2 , defined as:
PN
2 . yO i  hyi/2
R D PiD1 ; (5.63)
N
iD1 . yi  hyi/2

O D aO C bx
where yO i D f .xi I aO ; b/ O i are the fit values of the individual measurements.
R2 may have values between 0 and 1, and is often expressed as percentage. R2 D
1 corresponds to measurements perfectly aligned along the fitted regression line,
indicating that the regression line accounts for all the measurement variations as a
function of x, while R2 D 0 corresponds to a perfectly horizontal regression line,
indicating that the measurements are insensitive on the variable x.
118 5 Parameter Estimate

5.12.2 Goodness of Fit and p-Value

One advantage of the minimum 2 method is that the expected distribution of O 2 ,


the minimum 2 value, is known and is given by the 2 distribution in Eq. (2.33),
with a number of degrees of freedom equal to the number of measurements n minus
the number of fit parameters m.
The p-value is defined as P.2  O 2 /, the probability that a 2 greater or equal
to the value O 2 is obtained from a random fit according to the assumed model. If
the data follow the assumed Gaussian distributions, the p-value is expected to be
a random variable uniformly distributed from 0 to 1, from a general property of
cumulative distributions discussed in Sect. 2.5.
Obtaining a small p-value of the fit could be symptom of a poor description of
E For this reason, the minimum 2 value can be
the theoretical model y D f .xI /.
used as a measurement of the goodness of the fit. Anyway, setting a threshold, say
p-value more than 0.05, to determine whether a fit can be considered acceptable or
not, will always discard on average 5% of the cases, even if the PDF model correctly
describes the data, due to the possibility of statistical fluctuations.
Note also that the p-value can not be considered as the probability of the fit
hypothesis to be true. Such probability would only have a meaning in the Bayesian
approach (see Chap. 3) and, in that case, it would require a different type of
evaluation.
Unlike minimum 2 fits, in general, for maximum likelihood fits the value
of 2 log L for which the likelihood function is maximum does not provide a
measurement of the goodness of the fit. It is possible in some cases to obtain
a goodness-of-fit measurement by finding the ratio of the likelihood functions
evaluated in two different hypotheses. Wilks’ theorem (see Sect. 9.8) ensures that
a likelihood ratio, under some conditions that hold in particular circumstances, is
asymptotically distributed as a 2 for a large number of repeated measurements. A
more extensive discussion about the relation between likelihood function, the ratio
of likelihood functions and 2 can be found in [6], and will also be discussed in
Sect. 5.13.2 for what concerns binned fits.

5.13 Binned Data Samples

The maximum likelihood method discussed in Sect. 5.10 is used to perform param-
eter estimates using the complete set of information present in our measurements
sample. For repeated measurements of a single variable x, this is given by a dataset
.x1 ;    ; xn /. In case of a very large number of measurements n, computing the
likelihood function may become unpractical from the numerical point of view, and
the implementation could require intensive computing power. Machine precision
may also become an issue.
For this reason, it is frequently preferred to perform the parameter estimate using
a summary of the sample’s information obtained by binning the distribution of the
5.13 Binned Data Samples 119

random variable (or variables) of interest and using as information the number of
entries in each single bin: .n1 ;    ; nN /, where the number of bins N is typically
much smaller than the number of measurements n.
In practice, an histogram of the experimental distribution is built in one or more
variables. If the sample is composed of independent extractions from a given random
distribution, the number of entries in each bin follows a Poisson distribution. The
expected number of entries in each bin can be determined from the theoretical
distribution and depends on the unknown parameters one wants to estimate.

5.13.1 Minimum 2 Method for Binned Histograms

In case of a sufficiently large number of entries in each bin, the Poisson distribution
describing the number of entries in a bin can be approximated by a Gaussian with
variance equal to the expected number of entries in that bin (see Sect. 2.12). In this
case, the expression for 2 log L becomes:

XN
.ni 
i .1 ;    ; m //2 XN
2 log L D C N log 2 C log ni ; (5.64)
iD1

i .1 ;    ; m / iD1

where:
Z xi
up


i .1 ;    ; m / D f .xI 1 ;    ; m / dx (5.65)
xlo
i

up
and Œxlo
i ; xi  is the interval corresponding to the i bin. If the binning is sufficiently
th

fine,
i can be replaced by:


i .1 ;    ; m / ' f .xi I 1 ;    ; m / ıxi ; (5.66)
up up
where xi D .xi C xlo
i /=2 is center of the i bin and ıxi D xi  xi is the bin’s width.
th lo

The quantity defined in Eq. (5.64), dropping the last two constant terms, is called
Pearson’s 2 :

XN
.ni 
i .1 ;    ; m //2
2P D : (5.67)
iD1

i .1 ;    ; m /

It may be more convenient to replace the expected number of entries with the
observed number of entries. This gives the so-called Neyman’s 2 :

XN
.ni 
i .1 ;    ; m //2
2N D : (5.68)
iD1
ni
120 5 Parameter Estimate

The value of 2 at the minimum can be used, as discussed in Sect. 5.12.2, as


measurement of the goodness of the fit, where in this case the number of degrees of
freedom is equal to the number of bins n minus the number of fit parameters k.

5.13.2 Binned Poissonian Fits

The Gaussian approximation assumed in Sect. 5.13.1 does not hold when the
number of entries per bin is small. A Poissonian model, also valid for small number
of entries, should be applied in those cases. The negative log likelihood function
that replaces Eq. (5.64) is:

Y
N
2 log L D 2 log Pois.ni I
i .1 ;    ; m // (5.69)
iD1

YN
e
i .1 ;  ; m /
i .1 ;    ; m / ni
D 2 log : (5.70)
iD1
ni Š

Using the approach proposed in [6], the likelihood function can be divided by its
maximum value which does not depend on the unknown parameters, and does not
change the fit result. The denominator can be obtained by replacing
i with ni ,
obtaining the following negative log likelihood ratio:

YN
L.ni I
i .1 ;    ; m // XN
e
i
ni ni Š
2 D 2 log D 2 log i
ni nni
(5.71)
iD1
L.n i I n i / iD1
n i Š e i

N 
X  
ni
D2
i .1 ;    ; m /  ni C ni log : (5.72)
iD1

i .1 ;    ; m /

From Wilks’ theorem (see Sect. 9.8), if the model is correct, the distribution of the
minimum value of 2 can be asymptotically approximated with a 2 distribution
(Eq. (2.33)) with a number of degrees of freedom equal to the number of bins minus
the number of fit parameters. 2 can be used to determine a p-value (see Sect. 5.12.2)
that provides a measure of the goodness of the fit.
If the number of measurements is not sufficiently large, the distribution of 2 for
the specific problem may deviate from a 2 distribution, but can still be determined
by generating a sufficiently large number of Monte Carlo pseudo-experiments that
reproduce the theoretical PDF, and the p-value can be computed accordingly. This
technique is often called toy Monte Carlo.
5.14 Error Propagation 121

5.14 Error Propagation

Given the measured values of the parameters 1 ;    ; m provided by an inference


procedure, in some cases it may be necessary to evaluate a new set of parameters,
1 ;    ; k , determined as functions of the measured ones. The uncertainty on the
original parameters propagates to the uncertainty on the new parameter set.
The best option to determine the uncertainties on the new parameters would be
to reparameterize the likelihood problem using the new set of parameters and to
perform again the maximum likelihood fit in terms of the new parameters, which
would directly provide estimates for 1 ;    ; k with their uncertainties. This is not
always possible, in particular when the details of the likelihood problem are not
available, for instance when retrieving a measurement from a published paper.
In those cases, the simplest procedure may be to perform a local linear approxi-
mation of the function that transforms the measured parameters into the new ones.
If the errors are sufficiently small, projecting them on the new variables, using the
assumed linear approximation, leads to a sufficiently accurate result. In general,
the covariance matrix Hij of the transformed parameters can be obtained from the
covariance matrix ‚kl of the original parameters as follows:
X @i @j
Hij D ‚pq ; (5.73)
p; q
@p @q

or, in matrix form:

H D AT ‚ A ; (5.74)

where:

@p
Api D : (5.75)
@i

This procedure is visualized in the simplest case of a one-dimensional transforma-


tion in Fig. 5.6.

5.14.1 Simple Cases of Error Propagation

Imagine to rescale a variable x by a constant a:

y D ax : (5.76)
122 5 Parameter Estimate

Fig. 5.6 Plot of a


transformation of variable
 D . /, and visualization
of the procedure of error
propagation using local linear
approximation

The corresponding uncertainty squared, applying Eq. (5.73), is:


 2
dy
y2 D x2 D a2 x2 ; (5.77)
dx

hence:

ax D jaj x : (5.78)

For a variable z that is a function of two variables x and y, in general, also


considering a possible correlation term, Eq. (5.73) can be written as:
 2  2   
dz dz dz dz
z D x2 C y2 C 2 cov.x; y/ ; (5.79)
dx dy dx dy

which, for z D x C y, gives:


q
xCy D x2 C y2 C 2  x y : (5.80)

For what concerns the product z D x y, the relative uncertainties should instead be
added in quadrature, plus a possible correlation term:
s
  
2  2
x y x y 2  x y
D C C : (5.81)
xy x y xy
5.15 Treatment of Asymmetric Errors 123

If x and y are uncorrelated, Eq. (5.79) simplifies to:


 2  2
dz dz
z2 D x2 C y2 D x2 C y2 ; (5.82)
dx dy

which, for the sum or difference of two uncorrelated variables, gives:


q
xCy D xy D x2 C y2 ; (5.83)

and for the product and ratio:


s
    
2  2
x y x=y x y
D D C : (5.84)
xy x=y x y

In case of a power law y D x˛ , the error propagates as:


 ˛


x x
D j˛j : (5.85)
x˛ x
The error of log x is equal to its relative error of x:
x
log y D : (5.86)
x

5.15 Treatment of Asymmetric Errors

In Sect. 5.11 we observed that maximum likelihood fits may lead to asymmetric
errors. The propagation of asymmetric errors and the combination of more mea-
surements having asymmetric errors may require special care. If we have two
C C C C
measurements: x D xO xx and y D yO y y
, the naïve extension of the sum in
quadrature of errors, derived in Eq. (5.83), would lead to the (incorrect!) sum in
quadratures of the positive and negative errors:
q
C .xC /2 C.yC /2
x C y D .Ox C yO / p : (5.87)
 .x /2 C.y /2

Though sometimes Eq. (5.87) has been applied in real cases, it has no statistical
motivation. One reason why Eq. (5.87) is incorrect may be found in the central limit
theorem. Uncertainties are related to the standard deviation of the distribution of
a sample, and, in the case of an asymmetric (skew) distribution, asymmetric errors
may be related to the skewness (Eq. (1.27)) of the distribution. Adding more random
124 5 Parameter Estimate

variables, each characterized by an asymmetric PDF, should lead to a resulting


PDF that approaches a Gaussian more than the original PDFs, hence it should
have more symmetric errors. From Eq. (5.87), instead, the error asymmetry would
never decrease by adding more and more measurements all with the same error
asymmetry.
One statistically correct way to propagate asymmetric errors on quantities (say
Ex 0 ) that are expressed as functions of some original parameters (say Ex ) is to
reformulate the fit problem in terms of the new parameters, and perform again the
fit and error evaluation for the derived quantities .Ex 0 /. This approach is sometimes
not feasible when measurements with asymmetric errors are the result of a previous
measurement (e.g.: from a publication) that do not specify the complete underlying
likelihood model. In those cases, the treatment of asymmetric errors requires
some assumptions on the underlying PDF model which is missing in the available
documentation of the model’s description.
Discussions about how to treat asymmetric errors can be found in [7–9] using
a frequentist approach. D’Agostini [10] also discusses this subject using the
Bayesian approach, reporting the method presented in Sect. 3.11, and demonstrating
that potential problems (e.g.: bias) are present with naïve approaches to error
combination.
In the following Sect. 5.15.1, the derivation from [9] will be briefly presented
as an example in order to demonstrate peculiar features of propagation and
combination of asymmetric uncertainties.

5.15.1 Asymmetric Error Combination with a Linear Model

The following will demonstrate how to treat the propagation of an asymmetric


uncertainty that arises from a non-linear dependency on a nuisance parameter (e.g.:
some source of systematic uncertainty). Imagine that the uncertainty on a parameter
x0 arises from a non-linear dependence on another parameter x (i.e. x0 D f .x/) which
has a symmetric uncertainly .
Figure 5.7 shows a simple case where a random variable x, distributed according
to a Gaussian PDF, is transformed into a variable x0 through a piece-wise linear
transformation, leading to an asymmetric PDF. The two straight-line sections,
with different slopes, join with continuity at the central value of the original
PDF. This leads to a resulting PDF of the transformed variable which is piece-
wise Gaussian: two half-Gaussians, each corresponding to a 50% probability, have
0
different standard deviation parameters, C and 0 . Such a PDF is also called
bifurcated Gaussian in some literature.
If we have the original measurement: x D xO ˙ , this transformation will lead to
C 0
a resulting measurement of the transformed variable: x0 D xO0 0C , where  0 and  0
 C 
5.15 Treatment of Asymmetric Errors 125

Fig. 5.7 Transformation of a


variable x into a variable x0
through a piece-wise linear
transformation characterized
by two different slopes. If x
follows a Gaussian PDF with
standard deviation ; x0
follows a bifurcated
Gaussian, made of two
Gaussian halves having
different standard deviation
0 0
parameters, C and 

depend on  through factors equal to the two different slopes:


0
C D   sC ; (5.88)
0 D   s ; (5.89)

as evident in Fig. 5.7.


One consequence of the transformed PDF shape is that the average value of the
transformed variable hx0 i is different from the most likely value, xO 0 . In particular, the
average value of x0 can be computed as:

˝ 0˛ 1  0 
x D xO 0 C p C  0 : (5.90)
2

While the average value of the sum of two variables is equal to the sum of the
individual average values (Eq. (1.20)), this is not the case for the most likely value of
the sum of the two variables. Using a naïve error treatment, like the one in Eq. (5.87),
could even lead, for this reason, to a bias in the estimate of combined values, as
evident from Eq. (5.90).
126 5 Parameter Estimate

In the assumed case of a piece-wise linear transformation, in addition to


Eq. (5.90), the corresponding expressions for the variance can be also considered:
 0 2  2  
0 C C 0 0
C  0 2
VarŒx  D C 1 ; (5.91)
2 2 

as well as for the unnormalized skewness, defined in Eq. (1.28):


 
1 3 1
Œx0  D 3
2.C  3 /  .C   /.C
3
C 3 / C .C   /3 :
2 2 
(5.92)

Equations (5.90), (5.91) and (5.92) allow to transform a measurement xO 0 and its two
0
asymmetric error components C and 0 into three other quantities hx0 i, VarŒx0  and
0
Œx . The advantage of this transformation is that the average, the variance and the
unnormalized skewness add linearly when adding random variables, and this allows
an easier combination of uncertainties.
C1C
If we have two measurements affected by asymmetric errors, say: xO 1   and
1
C2C
xO 2   , the average, variance and
unnormalized skewness can be computed, assum-
2
ing an underlying piece-wise linear model, for the sum of the two corresponding
random variables x1 ad x2 as:

hx1 C x2 i D hx1 i C hx2 i ; (5.93)


VarŒx1 C x2  D VarŒx1  C VarŒx2  ; (5.94)
Œx1 C x2  D Œx1  C Œx2  : (5.95)

The above average, variance and unnormalized skewness can be computed individ-
ually from xO 1 and xO 2 and their corresponding asymmetric uncertainties, again from
Eqs. (5.90), (5.91) and (5.92). Using numerical techniques, the relation between
C 
hx1 C x2 i, VarŒx1 C x2  and Œx1 C x2 , and xO 1C2 , 1C2 and 1C2 , can be inverted
to obtain the correct estimate for x1C2 D x1 C x2 and its corresponding asymmetric
C C
1C2
uncertainty components: xO 1C2   .
1C2
Barlow [9] also considers the case of a parabolic dependence and obtains a
C C
1C2
procedure to estimate xO 1C2   with this second model.
1C2
Any estimate of the sum of two measurements affected by asymmetric errors
requires an assumption of an underlying PDF model, and results may be more or
less sensitive to the assumed model, depending case by case.
References 127

References

1. Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton (1946)
2. Rao, C.R.: Information and the accuracy attainable in the estimation of statistical parameters.
Bull. Calcutta Math. Soc. 37, 8189 (1945)
3. Eadie, W., Drijard, D., James, F., Roos, M., Saudolet, B.: Statistical Methods in Experimental
Physics. North Holland, London (1971)
4. James, F., Roos, M.: M INUIT: Function minimization and error analysis. CERN Computer
Centre Program Library, Geneve Long Write-up No. D506 (1989)
5. Brun, R., Rademakers, F.: ROOT—an object oriented data analysis framework. In: Proceedings
AIHENP96 Workshop, Lausanne (1996) Nucl. Inst. Meth. A389, 81–86 (1997) http://root.cern.
ch/
6. Baker, S., Cousins, R.: Clarification of the use of chi-square and likelihood functions in fit to
histograms. Nucl. Instrum. Methods A221, 437–442 (1984)
7. Barlow, R.: Asymmetric errors. In: Proceedings of PHYSTAT2003, SLAC, Stanford (2003)
http://www.slac.stanford.edu/econf/C030908/
8. Barlow, R.: Asymmetric statistical errors, arXiv:physics/0406120vl (2004)
9. Barlow, R.: Asymmetric systematic errors, arXiv:physics/0306138 (2003)
10. D’Agostini, G.: Asymmetric uncertainties sources, treatment and potential dangers.
arXiv:physics/0403086 (2004)
Chapter 6
Combining Measurements

6.1 Introduction

The problem of combining two or more measurements of the same unknown


quantity  can be addressed in general by building a likelihood function that
combines two or more data samples. If the measurements are independent, the
combined likelihood function is given by the product of the individual likelihood
functions and depends on the unknown parameters present in each of them,
including the parameter of interest  and possibly nuisance parameters, at least some
of which could be in common among different measurements. The minimization of
the combined likelihood function provides an estimate of the parameter of interest
 that takes into account all the individual data samples.
This approach requires that the likelihood function is available for each individ-
ual measurement, and it is usually adopted to extract information from multiple data
samples within the same experiment, as will be discussed in Sect. 6.2. Anyway, this
strategy is not always pursuable, either because of the intrinsic complexity of the
problem, or because the original data samples and/or the probability models are not
available, and only the final individual results are known, as when combining results
taken from publications.
In case a Gaussian approximation is sufficiently accurate, assumed in
Sect. 5.13.1, the minimum 2 method can be used to perform a combination of
measurements taking into account their uncertainties and correlation. This will be
discussed in Sects. 6.3 and in the following ones.

6.2 Simultaneous Fits and Control Regions

Consider the model discussed in Sect. 5.10.2 with a signal peak around m = 3.1 GeV,
reported in Fig. 5.2. The two regions with m < 3 GeV and m > 3:2 GeV can be
considered as control regions, while the region with 3:04 GeV< m < 3:18 GeV can

© Springer International Publishing AG 2017 129


L. Lista, Statistical Methods for Data Analysis in Particle Physics,
Lecture Notes in Physics 941, DOI 10.1007/978-3-319-62840-0_6
130 6 Combining Measurements

be taken as signal region. The background yield in the signal region under the signal
peak can be determined from the observed number of events in the control regions,
which contain a negligible amount of signal, interpolated to the signal region by
applying a scale factor given by the ratio of the signal region to control region areas,
as expected by the predicted background distribution (an exponential distribution, in
the considered case). This background constraint is already effectively applied when
performing the fit described in Sect. 5.10.2, where the background shape parameter
, as well as the signal and background yields, are determined directly from data.
The problem of simultaneous fit can be formalized in general as follows, also
taking account for a possible signal contamination in the control regions. Consider
two data samples, Ex D .x1 ;    ; xh / and Ey D .y1 ;    ; yk /. The likelihood functions
E and Ly .Ey I /
Lx .Ex I / E for the two individual data samples depend on a parameter
E
set  D .1 ;    ; m /. In particular, Lx and Ly may individually depend on a subset
of the comprehensive parameter set . E Some of the parameters determine the signal
and the background yields to be constrained. The combined likelihood function that
comprises both data sample can be written as:

E D Lx .Ex I /
Lx; y .Ex ; Ey I / E Ly .Ey I /
E : (6.1)

Equation (6.1) can be maximized in order to fit the parameters E simultaneously


from the two data samples Ex and Ey. The generalization to more than two datasets is
straightforward.
Imagine now that an experiment wants to measure the production cross section
of a rare signal affected by a large background, based on the different shapes of
the distributions an observable variable x in different physical processes. One of
the easiest cases is when the distribution of x has a peaked shape for the signal
and it is smoother for the backgrounds, like in Fig. 5.2, but more difficult situations
are often present in realistic cases. Imagine also that the background yield is not
predicted with good precision. In order to measure the signal yield, one may
define data regions enriched in background events with a negligible or anyway
small contribution from the signal, and use those regions in order to determine the
background yield from data, without relying on theory predictions.
For instance, imagine we want to measure the production cross section of events
with a single top quark at the Large Hadron Collider. This signal is affected by
background due to top-quark pair production. The selection of single-top event
relies on the presence of one jet identified as produced by a b quark (the top quark
decays into a b quark and a W boson). A control sample can be defined by requiring
the presence of two b quark instead of one, as required for the signal sample. With
this requirement, the control sample will contain a very small contribution from
single-top events, and will be dominated by the top-pair background.
A simultaneous fit using both the signal and control samples allows to determine
both the yields of the single-top signal and of the top-pair background. The ratio
of yields in the signal and control region for both signal and background can be
taken as a constant whose value is given by simulation, in order to constrain the
6.3 Weighted Average 131

signal and background yields in the control region from the fitted values of the
yields in the signal region, which may be free parameters in the fit. In this way, the
background yield will be mainly determined from the control sample distribution,
while the signal yield will be mainly determined from the distribution in the signal
region.

6.3 Weighted Average

Imagine we have two measurements of the same quantity x, which are:

x D xO 1 ˙ 1 ; (6.2)
x D xO 2 ˙ 2 : (6.3)

Assuming a Gaussian distribution for xO 1 and xO 2 and no correlation between the two
measurements, the following 2 can be built:

.x  xO 1 /2 .x  xO 2 /2
2 D C : (6.4)
12 22

The value x D xO that minimizes the 2 can be found imposing:


ˇ
@2 ˇˇ .Ox  xO 1 / .Ox  xO 2 /
0D D2 C2 ; (6.5)
@x ˇxDOx 12 22

which gives:

xO 1 xO 2
2
C 2
 2
xO D 1 : (6.6)
1 1
C
12 22

The variance of the estimate xO can be computed using Eq. (5.26):

1 @2 log L 1 @2 2 1 1
2
D 2
D D 2 C 2 : (6.7)
xO @x 2 @x2 1 2

Equation (6.6) is called weighted average, and can be generalized for N measure-
ments as:

X
N
xO D wi xO i ; (6.8)
iD1
132 6 Combining Measurements

where:

1=i2
wi D ; (6.9)
1= 2
PN PN
with 1= 2 D iD1 1=i2 , in order to ensure that iD1 wi D 1. The error on xO is
given by:

1
xO D  D qP : (6.10)
N 2
iD1 1=i

6.4 2 in n Dimensions

The 2 defined in Eq. (6.4) can be generalized to the case of n measurements


ExO D .Ox1 ;    ; xO n / of the parameters

E D .
1 ;    ;
n / which, on turn, depend on
E
another parameter set  D .1 ;    ; m /:
i D
i .1 ;    ; m /. If the covariance
matrix of the measurements .Ox1 ;    ; xO n / is  with elements ij , the 2 can be
written as follows:

X
n
2 D .Oxi 
i / ij1 .Oxj 
j / D .ExO 
/
E T  1 .ExO 
/
E (6.11)
i; jD1
0 11 0 1
11    1n xO 1 
1
B :: : : :: C @
D .Ox1 
1 ;    ; xO n 
n / @ : : : A ::: A : (6.12)
n1    nn xO n 
n

The 2 expression can be minimized in order to determine the estimates O1 ;    ; Om
of the unknown parameters 1 ;    ; m , with the proper error matrix.
Examples of application of this method are the combinations of electroweak
measurements performed with data taken at the electron-positron collider LEP [1]
at CERN and with the precision measurements performed at the SLC collider at
SLAC [2, 3] in the context of the LEP Electroweak Working Group [4] and the
GFitter Group [5]. The effect of radiative corrections that depend on top-quark
and Higgs-boson masses is taken into account in the predictions, above indicated
in the relation
i D
i .1 ;    ; m /. The combination of precision electroweak
measurements with this approach allowed to have indirect estimates of the top-quark
mass before its discovery at Tevatron and a less precise estimate of the Higgs-boson
mass before the beginning of the LHC data taking, where eventually the Higgs boson
was discovered. In both cases, the indirect estimates were in agreement with the
measured masses. The global fit performed by the GFitter group is shown in Fig. 6.1
for the prediction of the masses of the W boson and the top quark.
6.5 The Best Linear Unbiased Estimator 133

mt world comb. 1
68% and 95% CL contours
mt = 173.34 GeV
80.5 fit w/o MW and mt measurements = 0.76 GeV
fit w/o MW , mt and MH measurements = 0.76 0.50theo GeV

direct MW and mt measurements


80.45
MW [GeV]

80.4

MW world comb. 1
80.35 MW = 80.385 0.015 GeV

80.3
eV
eV
.14G eV eV
0G 25 00G 00G
80.25 =5 =1 =3 =6

Jul ’14
MH MH MH MH
G fitter SM
140 150 160 170 180 190
mt [GeV]

Fig. 6.1 Contours at the 68% and 95% confidence level in the plane .mW ; mt /, mass of the W
boson vs mass of the top quark, from global electroweak fits including (blue) or excluding (gray)
the measurement of the Higgs boson mass mH . The direct measurements of mW and mt (green) are
shown as comparison. The fit was performed by the GFitter group [5], and the figure is from [3]
(Open Access)

6.5 The Best Linear Unbiased Estimator

Let us consider the case of two measurements of the same quantity x:

x D xO 1 ˙ 1 ; (6.13)
x D xO 2 ˙ 2 ; (6.14)

which have a correlation coefficient . The case discussed in Sect. 6.3 was a special
case with  D 0. Taking into account the correlation term, the 2 can be written
using the covariance matrix form in Eq. (2.91):
 1  
12  1 2 x  xO 1
2 D .x  xO 1 ; x  xO 2 / : (6.15)
 1 2 22 x  xO 2

The same minimization used to obtain Eq. (6.6) gives now:

xO 1 .22   1 2 / C xO 2 .12   1 2 /
xO D ; (6.16)
12  2  1 2 C 22
134 6 Combining Measurements

with the following uncertainty:

12 22 .1  2 /
xO2 D : (6.17)
12  2  1 2 C 22

The general case of more than two measurements proceeds in a similar way,
and the minimization of the 2 is equivalent to find the best linear unbiased
estimate (BLUE), i.e. the unbiased linear combination ofthe measurements ExO D
.Ox1 ;    ; xO N / with known covariance matrix  D ij that gives the lowest
variance.
Introducing the set of weights w E D .w1 ;    ; wN /, the linear combination can be
written as:

X
N
xO D wi xO i D ExO  w
E: (6.18)
iD1

The condition of a null bias is hOxi D x, where x is the unknown


Ptrue value. If the
individual measurements are unbiased, x D hxi i for all i, and NiD1 wi D 1. The
variance of xO can be expressed as:

xO2 D w
E T w
E: (6.19)

It can be shown [6] that the weights that minimize the variance xO2 in Eq. (6.19) are
given by the following expression:

 1 uE
ED
w ; (6.20)
uE T  1 uE

where uE is the vector having all elements equal to the unity: uE D .1;    ; 1/ .
The interpretation of Eq. (6.16) becomes more intuitive [7] if we introduce the
common error, defined as:

C D  1 2 : (6.21)

Imagine, for instance, that the two measurements are affected by a common
uncertainty, like the knowledge of the integrated luminosity in case of a cross section
measurement, while the other uncertainties are uncorrelated. In that case, the two
measurements can be written as:

x D xO 1 ˙ 10 ˙ C ; (6.22)
x D xO 2 ˙ 20 ˙ C ; (6.23)

where 102 D 12  C2 and 202 D 22  C2 . This is clearly possible only if C  1
and C  2 , which is equivalent to require that the weights w1 and w2 in the BLUE
6.5 The Best Linear Unbiased Estimator 135

combination reported in Eq. (6.16) are both positive. Equation (6.16), in that case,
can also be written as:

xO 1 xO 2
C 2
12  C2
2  C2
xO D (6.24)
1 1
C 2
12  C2 2  C2

with a variance:
1
xO2 D C C2 : (6.25)
1 1
C 2
12  C2 2  C2

Equation (6.24) is equivalent to the weighted average in Eq. (6.6), where the errors
102 and 202 are used to determine the weights instead of 1 and 2 . Equation (6.25)
shows that the uncertainty contribution term C has to be added in quadrature to
the expression that gives the error for the ordinary weighted average in case of no
correlation (Eqs. (6.7) and (6.10)).
More in general, as evident from Eq. (6.16), weights with the BLUE method can
also become negative. This may lead to counter-intuitive results. In particular, a
combination of two measurements may also lie outside the range delimited by the
two central values. Also, when combining two measurements with  D 1 =2 , the
weight of the second measurement is zero, hence the combined central value is not
influenced by xO 2 . Conversely, if  D 2 =1 , the central value is not influenced by xO 1 .

Example 6.21 Reusing Multiple Times the Same Measurement Does


not Improve a Combination
Assume to have a single measurement, xO ˙ , and we want to use it twice
to determine again the same quantity. Without considering the correlation
coefficient 
pD 1, one would expect to reduce the uncertainty  ‘for free’
by a factor 2, which clearly would not make sense.
The correct use of Eqs. (6.16) and (6.17) leads, when  D 1, to xO ˙ , i.e.,
as expected, no precision is gained by using the same measurement twice in
a combination.

6.5.1 Quantifying the Importance of Individual Measurements

The BLUE method, as well as the standard weighted average, provides weights of
individual measurements that enter the combination. BLUE weights, anyway, can
become negative, and this does not allow to use weights in order to quantify the
‘importance’ of each individual measurement used in a combination.
136 6 Combining Measurements

One approach sometimes adopted in literature consists in quoting as ‘relative


importance’ (RI) a quantity proportional to the absolute value of the corresponding
weight wi , which is usually defined as:

jwi j
RIi D ; (6.26)
X
N
jwi j
iD1

P
in order to obey the normalization condition: NiD1 RI D 1. RI was quoted, for
instance, in combinations of top-quark mass measurements at the Tevatron and at
the LHC [8, 9].
The questionability of the use of RIs was raised [10] because it violates the
combination principle: in case of three measurement, say xO , yO 1 and yO 2 , the RI of the
measurement xO changes whether the three measurements are combined all together,
or if yO 1 and yO 2 are first combined into yO , and then the measurement xO is combined
with the partial combination yO .
Proposed alternatives to RI are based on Fisher information, defined in Eq. (3.52),
which also poses a lower bound to the variance of an estimator (see Sect. 5.8.3). For
a single parameter, the Fisher information is given by:

1
J D J11 D uE T V uE D : (6.27)
xO2

Two quantities are proposed in [10] to replace RI: the intrinsic information weight
(IIW), defined as:

1=i2 1=i2
IIWi D D ; (6.28)
1=xO2 J

and the marginal information weight (MIW), defined as follows:

Ji J  Jf1;  ; ngfig


MIWi D D ; (6.29)
J J

i.e. the relative difference of the Fisher information of the combination and the
Fisher information of the combination obtained by excluding the ith measurement.
Both IIW and MIW do not obey a normalization condition. For IIW, the quantity
IIWcorr can be defined such that:

X
N
IIWi C IIWcorr D 1 : (6.30)
iD1
6.5 The Best Linear Unbiased Estimator 137

Table 6.1 Properties of different indicators of a measurement’s importance within a BLUE


combination

PN Consistent with
Weight type 0 iD1 D1 partial combination
BLUE coefficient wi ✗ ✓ ✓
PN
Relative importance jwi j= iD1 jwi j ✓ ✓ ✗
Intrinsic information weight IIWi ✓ ✗ ✓
Marginal information weight MIWi ✓ ✗ ✓

IIWcorr represents the weight assigned to the correlation interplay, not assignable to
any individual measurement, and is given by:
P P
1=xO 2  NiD1 1=i2 J  NiD1 1=i2
IIWcorr D D : (6.31)
1=xO2 J

The properties of the BLUE weights, RI, and the two proposed alternatives to RI
are reported in Table 6.1.
The quantities IIW and MIW are reported, instead of RI, in papers presenting the
combination of LHC and Tevatron measurements related to top-quark physics [11–
13].

6.5.2 Negative Weights

Negative weights may arise in BLUE combinations in presence of large correlation.


Figure 6.2 shows the dependence of the BLUE coefficient w2 and the ratio of
uncertainties xO2 =12 as from Eq. (6.16) as a function of the correlation  for different
values of 2 =1 in the combination of two measurements. The uncertainty ratio
xO 2 =12 increases as a function of  for  < 1 =2 ; for  D 1 =2 it reaches a
maximum, which also corresponds to w2 D 0 and MIW2 D 0. For  > 1 =2 , w2
becomes negative and xO 2 =12 decreases for increasing . The dependence on  may
also be very steap, depending on 2 =1 . The case for which w2 D 0 should not be
interpreted as the measurement xO 2 not being used in the combination.
In cases where the correlation coefficient  is not known with good precision,
the assumption  D 1 is a conservative choice only if the uncorrelated contributions
are dominant in the total uncertainty. As seen above, in case of negative weights, 
should be accurately determined, because the uncertainty may strongly depend on
, and assuming  D 1 may result in underestimating the combination uncertainty.
138 6 Combining Measurements

0.8

0.6

0.4

0.2
w2

− 0.2 σ2 / σ1 = 1
σ2/ σ1 = 1.1
− 0.4 σ2/ σ1 = 1.5
− 0.6
σ2 / σ1 = 2
σ2 / σ1 = 3
− 0.8 σ2 / σ1 = 4
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
ρ

1.4 σ2 / σ1 = 1
σ2/ σ1 = 1.1
1.2 σ2/ σ1 = 1.5
σ2 / σ1 = 2
σ2 / σ1 = 3
1 σ2 / σ1 = 4
σ2x/ σ21

0.8

0.6

0.4

0.2

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
ρ

Fig. 6.2 BLUE coefficient for the second measurement w2 (top) and combined BLUE variance
Ox2 divided by 12 (bottom) as a function of the correlation  between the two measurements 1 and
2 for various fixed values of the ratio 2 =1

Consider two measurements whose total uncertainties is given by the combina-


tion of uncorrelated contributions, 1 (unc) and 2 (unc), and correlated contribu-
tions, 1 (cor) and 2 (cor), respectively:

12 D 1 .cor/2 C 1 .unc/2 ; (6.32)


2 2
22 D 2 .cor/ C 2 .unc/ : (6.33)
6.5 The Best Linear Unbiased Estimator 139

1.2

0.8
ρ(cor)

0.6
σ2(cor)/ σ1(cor) = 1
σ2(cor)/ σ1(cor) = 1.1
0.4 σ2(cor)/ σ1(cor) = 1.5
σ2(cor)/ σ1(cor) = 2
0.2 σ2(cor)/ σ1(cor) = 3
σ2(cor)/ σ1(cor) = 4

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
σ1(cor)/σ1

Fig. 6.3 The most conservative value of an unknown correlation coefficient (cor) between
uncertainties 1 (cor) and 2 (cor) as a function of 1 (cor)=1 , for different possible values of
2 (cor)=1 (cor) 1

Assume (cor) is the correlation coefficient of the correlated terms. The most
conservative value of (cor), i.e. the value that maximises the total uncertainty, can
be demonstrated [10] to be equal to 1 only for 2 (cor)=1 (cor)< .1 =1 (cor))2,
where it has been assumed that 1 (cor) 2 (cor). The most conservative choice of
(cor), for values of 2 (cor)=1 (cor) larger than .1 =1 (cor))2, is equal to:

12 1 .cor/=2 .cor/


.cor/cons D D <1: (6.34)
1 .cor/2 .cor/ .1 .cor/=1 /2

Figure 6.3 shows the most conservative value of (cor) as a function of 1 .cor/=1
for different possible values of the ratio 1 .cor/=2 .cor/.

6.5.3 Iterative Application of the BLUE Method

The BLUE method is unbiased by construction assuming that the true uncertainties
and their correlations are known. Anyway, it can be proven that BLUE combinations
may exhibit a bias if uncertainty estimates are used in place of the true ones, and
in particular if the uncertainty estimates depend on measured values. For instance,
when contributions to the total uncertainty are known as relative uncertainties, the
actual uncertainty estimates are obtained as the product of the relative uncertainties
140 6 Combining Measurements

times the measured central values. An iterative application of the BLUE method can
be implemented in order to mitigate such a bias.
L. Lyons et al. remarked in [14] the limitations of the BLUE method in the
combination of lifetime measurements where uncertainty estimates O i of the true
unknown uncertainties i were used, and those estimates had a dependency on the
measured lifetime. They also demonstrated that the application of the BLUE method
violates, in that case, the combination principle: if the set of measurements is split
into a number of subsets and then the combination is first performed in each subset
and finally all subset combinations are combined into a single grand combination,
the obtained result differs from the single combination of all individual results of
the entire set.
Reference [14] recommends applying iteratively the BLUE method, rescaling at
each iteration the uncertainty estimates according to the central value obtained with
the BLUE method in the previous iteration, until the sequence converges to a stable
result. It was also proven that the bias of the BLUE estimate is reduced compared
to the standard application of the BLUE method.
A more extensive study of the iterative application of the BLUE method is also
available in [15].

References

1. The ALEPH, DELPHI, L3, OPAL Collaborations, the LEP Electroweak Working Group:
Electroweak measurements in electron-positron collisions at W-boson-pair energies at LEP.
Phys. Rep. 532, 119 (2013)
2. The ALEPH, DELPHI, L3, OPAL, SLD Collaborations, the LEP Electroweak Working Group,
the SLD Electroweak and Heavy Flavour Groups: Precision electroweak measurements on the
Z resonance. Phys. Rep. 427, 257 (2006)
3. The GFitter Group, Baak, M., et al.: The global electroweak fit at NNLO and prospects for the
LHC and ILC. Eur. Phys. J. C 74, 3046 (2014)
4. The LEP Electroweak Working Group. http://lepewwg.web.cern.ch/LEPEWWG/
5. A Generic Fitter Project for HEP Model Testing. http://project-gfitter.web.cern.ch/project-
gfitter/
6. Lyons, L., Gibaut, D., Clifford, P.: How to combine correlated estimates of a single physical
quantity. Nucl. Inst. Methods A270, 110–117 (1988)
7. Greenlee, H.: Combining CDF and D0 physics results. Fermilab Workshop on Confidence
Limits (2000)
8. The CDF and D0 Collaborations: Combination of CDF and DO results on the mass of the top
quark using up to 5.8 fb1 of data. FERMILAB-TM-2504-E, CDF-NOTE-10549, D0-NOTE-
6222 (2011)
9. The ATLAS and CMS Collaborations: Combination of ATLAS and CMS results on the mass of
the top quark using up to 4.9 fb1 of data. ATLAS-CONF-2012-095, CMS-PAS-TOP-12-001
(2012)
10. Valassi, A., Chierici, R.: Information and treatment of unknown correlations in the combination
of measurements using the BLUE method. Eur. Phys. J. C 74, 2717 (2014)
11. ATLAS and CMS Collaborations: Combination p of ATLAS and CMS results on the mass of
the top-quark using up to 4.9 fb -1 of s D 7 TeV LHC data. ATLAS-CONF-2013-102,
CMS-PAS-TOP-13-005 (2013)
References 141

12. ATLAS and CMS Collaborations: Combination of ATLAS p and CMS ttbar charge asymmetry
measurements using LHC proton-proton collisions at s D 7 TeV. ATLAS-CONF-2014-012,
CMS-PAS-TOP-14-006 (2014)
13. ATLAS, CMS, CDF and D0 Collaborations: First combination of Tevatron and LHC
measurements of the top-quark mass. ATLAS-CONF-2014-008, CDF-NOTE-11071, CMS-
PAS-TOP-13-014, D0-NOTE-6416, FERMILAB-TM-2582, arXiv:1403.4427 (2014)
14. Lyons, L., Martin, A.J., Saxon, D.H.: On the determination of the B lifetime by combining the
results of different experiments. Phys. Rev. D41, 982–985 (1990)
15. Lista, L.: The bias of the unbiased estimator: a study of the iterative application of the BLUE
method. Nucl. Inst. Methods A764, 82–93 (2014) and corr. ibid. A773, 87–96 (2015)
Chapter 7
Confidence Intervals

7.1 Introduction

Section 5.11 presented two approximate methods to determine uncertainties of


maximum likelihood estimates. Basically, either the negative log likelihood function
is approximated to a parabola at the minimum, corresponding to a Gaussian PDF
approximation, or the excursion of the negative log likelihood around the minimum
is considered in order to obtain the possibly asymmetric uncertainty. None of those
methods guarantees an exact coverage of the uncertainty interval. In many cases,
the provided level of approximation is sufficient, but for measurements with a
PDF models that exhibit large deviation from the Gaussian approximation, the
uncertainty determined with those approximate methods may not be sufficiently
accurate.

7.2 Neyman Confidence Intervals

A more rigorous and general treatment of confidence intervals under the frequentist
approach is due to Neyman [4], which will be discussed in the following in the
simplest case of a single parameter.
Let us consider a variable x distributed according to a PDF which depends on an
unknown parameter . We have in mind that x could be the value of an estimator of
the parameter . Neyman’s procedure to determine confidence intervals proceeds in
two steps, sketched in Fig. 7.1:
1. the construction of a confidence belt;
2. the inversion of the confidence belt to determine the confidence interval.

© Springer International Publishing AG 2017 143


L. Lista, Statistical Methods for Data Analysis in Particle Physics,
Lecture Notes in Physics 941, DOI 10.1007/978-3-319-62840-0_7
144 7 Confidence Intervals

θ
θ
up
θ (x0)

θ0

θlo(x0)

xlo(θ0 ) x hi(θ0 ) x x0 x

Fig. 7.1 Graphical illustration of Neyman belt construction (left) and inversion (right)

7.2.1 Construction of the Confidence Belt

In the first step, the confidence belt is determined by scanning the parameter space,
varying  within its allowed range. For each fixed value of the parameter  D 0 ,
the corresponding PDF, which describes the distribution of x; f .x j 0 /, is known.
According to the PDF f .x j 0 /, an interval Œxlo .0 /; xup .0 / is determined whose
corresponding probability is equal to the specified confidence level, defined as CL D
1  ˛, and usually equal to 68.27% .1/ , 90 or 95%:
Z xup .0 /
1˛ D f .x j 0 / dx : (7.1)
xlo .0 /

Neyman’s construction of the confidence belt is graphically illustrated in Fig. 7.1,


left.
Equation (7.1) can be satisfied exactly for a continous random variable x. In case
of a discrete variable, instead, it’s usually difficult to find an interval that corresponds
exactly to the desired confidence level, and the interval will be constructed in
order to correspond to a probability at least equal to the desired confidence level
(overcoverage).
The choice of xlo .0 / and xup .0 / has still some arbitrariness, since there are
different possible intervals having the same probability, according to the condition
in Eq. (7.1). The choice of the interval is often called ordering rule. This arbi-
trariness was already encountered in Sect. 3.5.2 when discussing Bayesian credible
intervals.
For instance, one could chose an interval centered around the central value xN of x
corresponding to 0 , i.e. an interval:

Œxlo .0 /; xup .0 / D Œx.


N 0 /  ı; x.
N 0 / C ı ; (7.2)
7.2 Neyman Confidence Intervals 145

f(x|θ0)

α/2 1−α α/2

x
f(x|θ0) f(x|θ0)

1−α 1−α
α α

x x

Fig. 7.2 Three possible choices of ordering rule: central interval (top) and fully asymmetric
intervals (bottom left, right)

where ı is determined in order to ensure the coverage condition in Eq. (7.1).


Alternatively, one could chose the interval with equal areas of the PDF tails at the
two extreme sides, i.e. such that:
Z xlo .0 / Z C1
˛ ˛
f .x j 0 / dx D and f .x j 0 / dx D : (7.3)
1 2 xup .0 / 2

Other possibilities consist in choosing the interval having the smallest size, or
fully asymmetric intervals on either side: Œxlo .0 /; C1 Œ or   1; xup .0 /. More
options are also considered in the literature. Figure 7.2 shows three of the possible
cases described above.
A special ordering rule was introduced by Feldman and Cousins based on a
likelihood-ratio criterion and will be discussed in Sect. 7.5.
Given a choice of the ordering rule, the intervals Œxlo ./; xup ./, for all possible
values of , define the Neyman belt in the plane .x; / as shown in Fig. 7.1.
146 7 Confidence Intervals

7.2.2 Inversion of the Confidence Belt

In the second step of the Neyman procedure, given a measurement x D x0 , the


confidence interval for  is determined by inverting the Neyman belt (Fig. 7.1, right):
two extreme values  lo .x0 / and  up .x0 / are determined as the intersections of the
vertical line at x D x0 with the two boundary curves of the belt.
The interval Πlo .x0 /;  up .x0 / has, by construction, a coverage equal to the
desired confidence level, 1  ˛. This means that, if  is equal to the true value
 true , extracting x D x0 randomly according to the PDF f .x j  true /,  true will be
included in the determined confidence interval Œ lo .x0 /;  up .x0 / in a fraction 1  ˛
of the cases, in the limit of a very large number of extractions.
In case of a discrete random variable, the confidence belt provides at least the
desired confidence level and the interval Πlo .x0 /;  up .x0 / contains the true value in
a fraction  1  ˛ of the cases, in the limit of large numbers, i.e. it will overcover.

Example 7.22 Neyman Belt: Gaussian Case


The Neyman belt for the parameter
of a Gaussian distribution with  D
1 is shown in Fig. 7.3 for a 68.27% confidence level with the choice of a
central interval. The inversion of the belt is straightforward and gives the
usual result
D x ˙ , as in Ex. 5.17.

4.5

3.5

2.5
μ

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x

Fig. 7.3 Neyman belt for the parameter


of a Gaussian with  D 1 at the 68.27%
confidence level
7.3 Binomial Intervals 147

7.3 Binomial Intervals

The binomial distribution was introduced in Sect. 5.9, Eq. (1.39):


P.nI N; p/ D pN .1  p/Nn : (7.4)
NŠ .N  n/Š

Given an extracted value of n; n D nO , the parameter p can be estimated as:

nO
pO D : (7.5)
N

An approximate estimate of the uncertainty on pO was obtained in Sect. 5.9 by


replacing p with its estimate pO in the variance expression from Eq. (1.41):
r
pO .1  pO /
ı pO ' : (7.6)
N

This may be justified by the law of large numbers because for n ! 1 pO and p
coincide. Anyway, Eq. (7.6) gives a null error in case either nO D 0 or nO D N, i.e. for
pO D 0 or 1, respectively.
A solution to the problem of determining the correct confidence interval for a
binomial distribution is due to Clopper and Pearson [1] and allows to determine the
interval Œplo ; pup  that gives at least the correct coverage 1  ˛. The extremes plo and
pup of the interval should be taken such that:

X
N
nŠ ˛
P.n  nO j N; plo / D .plo /n .1  plo /Nn D ; (7.7)
NŠ .N  n/Š 2
nDOn

nO
X nŠ ˛
P.n  nO j N; pup / D .pup /n .1  pup /Nn D : (7.8)
nD0
NŠ .N  n/Š 2

This corresponds to the Neyman inversion described in Sect. 7.2 applied in a


discrete case. The corresponding Neyman belt is shown in Fig. 7.4 for the case
n D 10.
The presence of a discrete variable does not allow a continuous variation of the
discrete intervals Œnlo .p0 /; nup .p0 /, for an assumed parameter value p D p0 . In
the Neyman’s construction, one has to choose an interval Œnlo ; nup , consistently
with the adopted ordering rule, that has at least the desired coverage. In this way,
the confidence interval Œplo ; pup  determined by the inversion procedure for the
parameter p could overcover, i.e. it may have a corresponding probability greater
than the desired confidence level 1  ˛. In this sense, the interval is said to be
conservative.
148 7 Confidence Intervals

0.9

0.8

0.7

0.6
p

0.5

0.4

0.3

0.2

0.1

0
0 2 4 6 8 10
n

Fig. 7.4 Neyman belt for the parameter p of a binomial distribution with n D 10 at the 68.27%
confidence level

The coverage of Clopper–Pearson intervals as a function of p is shown in Fig. 7.5


for the cases n D 10 and n D 100. The coverage is, by construction, always greater
than the nominal 68.27%, and has a ‘ripple’ structure, due to the discrete nature of
the problem, that gets sort of dumped as n increases. This effect is also discussed
in [2].
Another case of a discrete application of the Neyman belt inversion in a
Poissonian problem can be found in Sect. 10.7.

Example 7.23 Application of the Clopper–Pearson Method


As exercise, we compute with the Clopper–Pearson method the 90%
confidence level interval for a measurement nO D N D 10, i.e. equal to
the maximum possible outcome for n.
We need to determine the values plo and pup such that Eqs. (7.7) and (7.8)
hold. Considering that ˛ D 0:10, those give:

P.n  N j N; plo / D . plo /N .1  plo /0 D . plo /N D 0:05 ;
NŠ 0Š
P.n  N j N; pup / D 1 :

(continued )
7.3 Binomial Intervals 149

0.9

0.8

0.7
P (coverage)

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p

0.9

0.8

0.7
P (coverage)

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p

Fig. 7.5 Coverage of Clopper–Pearson intervals as a function of p for the cases n D 10 (top) and
n D 100 (bottom)

So, for pup we should consider the largest allowed value pup D 1:0, since
the probability P.n  N j N; pup / is equal to one. The first equation can be
inverted and gives:

plo D exp Œlog.0:05/=N :

(continued )
150 7 Confidence Intervals

For N D 10 we have plo D 0.741, and the confidence interval is [0.74,


1.00]. Instead, the approximated expression in Eq. (7.6) gives an interval of
null size, which is a clear sign of a pathology.
Symmetrically, the Clopper–Pearson evaluation of the confidence interval
for nO D 0 gives [0.00, 0.26] for N D 10.
Note that discrete intervals may overcover. In particular, if the true value is
p D 1, nO is always equal to N, which gives the confidence interval [0.74,
1.00] that contains p with 100% probability, while it was supposed to cover
with 90% probability.

7.4 The Flip-Flopping Problem

In order to determine confidence intervals, a consistent choice of the ordering rule


has to be adopted. Feldman and Cousins demonstrated [3] that the ordering rule
choice must not depend on the outcome of our measurement, otherwise the quoted
confidence interval or upper limit could not correspond to the correct confidence
level, i.e. it does not respect the coverage.
In some cases, experimenters searching for a rare signal decide to quote their
final result in different possible ways, switching from a central interval to an upper
limit, depending on the outcome of the measurement. A typical choice is to:
• quote an upper limit if the measured signal yield is not greater than at least three
times its uncertainty;
• instead, quote the central value with its uncertainty if the measured signal exceeds
three times its uncertainty.
This ‘3’ significance criterion will be discussed in more details later, see
Sect. 10.2 for a more general definition of significance level.
This problem is sometimes referred to in literature as flip-flopping, and can be
illustrated in a simple example [3]. Imagine a model where a random variable x
follows a Gaussian distribution with a fixed and known standard deviation  and an
unknown average
which is bound to be greater or equal to zero. This is the case
of a signal yield or cross section measurement. For simplicity, we can take  D 1,
without loss of generality.
The quoted central value must always be greater than or equal to zero, given the
assumed constraint. Imagine one decides to quote, as measured value for
, zero
if the significance, defined as x=, is less than 3. The measured value x is quoted
otherwise:
(
x if x=  3

.x/
O D : (7.9)
0 if x= < 3
7.4 The Flip-Flopping Problem 151

μ
f (x) 90%

5% 5%
90% 10% 5%
10% x
μ = x ± 1.645
x
μ < x + 1.282

3 x
Fig. 7.6 Illustration of the flip-flopping problem. The plot shows the quoted central value of
as
a function of the measured x (dashed line), and the 90% confidence interval corresponding to the
choice of quoting a central interval for x=  3 and an upper limit for x= < 3. The coverage
decreases from 90 to 85% for a value of
corresponding to the horizontal lines with arrows

If x=  3, the quoted confidence interval, given the measurement x, has a central
value with a symmetric error equal to ˙  at the 68.27% confidence level (CL), or
˙ 1:645 at 90% CL. Instead, if x= < 3, the confidence interval is Œ0;
up  with
an upper limit
up D x C 1:282 at 90% CL, given the corresponding area under a
Gaussian PDF.
In summary, the quoted confidence interval at 90% CL is:
(
Œx  1:645; x C 1:645 if x=  3
Œ
;
 D
lo up
: (7.10)
Œ0; x C 1:282 if x= < 3

The situation is shown in Fig. 7.6.


The choice to switch from a central interval to a fully asymmetric interval, which
gives an upper limit, based on the observation of x, produces an incorrect coverage.
Looking at Fig. 7.6, depending on the value of
, the coverage can be determined
as the probability corresponding to the interval Œxlo ; xup  obtained by crossing the
confidence belt with a horizontal line. One may have cases where the coverage
decreases from 90% to 85%, which is lower than the desired confidence level,
indicated by the lines with arrows in Fig. 7.6.
Next Sect. 7.5 presents the method due to Feldman and Cousins to consistently
preserve the coverage for this example without incurring the flip-flopping problem.
152 7 Confidence Intervals

7.5 The Unified Feldman–Cousins Approach

In order to avoid the flip-flopping problem and to ensure the correct coverage, the
ordering rule proposed by Feldman and Cousins [3] provides a Neyman confidence
belt, following the procedure described in Sect. 7.2, that smoothly changes from a
central or quasi-central interval to an upper limit, in the case of low observed signal
yield.
The proposed ordering rule is based on a likelihood ratio whose properties will
be further discussed in Sect. 9.5. Given a value 0 of the unknown parameter , the
chosen interval of the variable x used for the Neyman belt construction is defined
by the ratio of two PDFs of x, one under the hypothesis that  is equal to the
considered fixed value 0 , the other under the hypothesis that  is equal to the
maximum likelihood estimate value .x/, O corresponding to the given measurement
x. The likelihood ratio must be greater than a constant k˛ whose value depends on
the chosen confidence level 1  ˛. In a formula:

f .x j 0 /
.x j 0 / D > k˛ : (7.11)
O
f .x j .x//

The constant k˛ should be taken such that the integral of the PDF in the confidence
interval R˛ is equal to 1  ˛:
Z
f .x j 0 / dx D 1  ˛ : (7.12)

The confidence interval R˛ for a given value  D 0 is defined by Eq. (7.11):

R˛ .0 / D fx W .x j 0 / > k˛ g : (7.13)

The case is illustrated in Fig. 7.7.


Feldman and Cousins computed the confidence interval for the simple Gaussian
case discussed in Sect. 7.4. The value
D
.x/
O that maximizes the likelihood

Fig. 7.7 Ordering rule in the


Feldman–Cousins approach, f (x|θ0) f (x|θ0)
^
based on the likelihood ratio f (x|θ(x))
O
.x j 0 / D f .x j 0 /=f .x j .x//


1−α

x
7.5 The Unified Feldman–Cousins Approach 153

function, given x, under the constraint


 0, is:


.x/
O D max.x; 0/ : (7.14)

The PDF for x, using the maximum likelihood estimate for


, becomes:
8
ˆ
ˆ 1
<p if x0
f .x j
.x//
O D 2 : (7.15)
ˆ p1 ex2 =2 x<0
:̂ if
2

The likelihood ratio in Eq. (7.11) can be written in this case as:
(
f .x j
/ exp..x 
/2 =2/ if x0
.x j
/ D D : (7.16)
f .x j
.x//
O exp.x

=2/2
if x<0

The interval Œxlo .


0 /; xup .
0 /, for a given
D
0 , can be found numerically
using the equation .x j
/ > k˛ and imposing the desired confidence level 1  ˛,
according to Eq. (7.12).
The result is shown in Fig. 7.8, and can be compared with Fig. 7.6. Using the
Feldman–Cousins approach, for large values of x, the usual symmetric confidence
interval is obtained. As x moves towards lower values, the interval becomes more
and more asymmetry, and at some point it becomes fully asymmetric .i:e: W
Œ0;
up /, determining an upper limit
up . For negative values of x, the result is
always an upper limit, avoiding unphysical cases corresponding to negative values
of
. Negative values of
would not be excluded using a Neyman belt construction

μ
symmetric errors

asymmetric errors

upper limit

x
Fig. 7.8 Neyman confidence belt constructed using the Feldman–Cousins ordering
154 7 Confidence Intervals

with a central interval, like the one shown in Fig. 7.3. As seen, this approach
smoothly changes from a central interval to an upper limit, yet correctly ensuring
the required coverage (90% in this case).
More applications of the Feldman–Cousins approach will be presented in
Chap. 10. The application of the Feldman–Cousins method requires, in most of the
cases, numerical treatment, even for simple PDF models, like the Gaussian case
discussed above. The reason is that the inversion of the integral in Eq. (7.12) is
required. The inversion usually proceeds with a scan of the parameter space, and,
in case of complex models, it may be very CPU intensive. For this practical reason,
other methods are often preferred to the Feldman–Cousins for complex cases.

References

1. Clopper, C.J., Pearson, E.: The use of confidence or fiducial limits illustrated in the case of the
binomial. Biometrika 26, 404–413 (1934)
2. Cousins, R., Hymes, K.E., Tucker, J.: Frequentist evaluation of intervals estimated for a binomial
parameter and for the ratio of Poisson means. Nucl. Instrum. Meth. A612 388–398 (2010)
3. Feldman, G., Cousins, R.: Unified approach to the classical statistical analysis of small signals.
Phys. Rev. D57, 3873–3889 (1998)
4. Neyman, J.: Outline of a theory of statistical estimation based on the classical theory of
probability. Philos. Trans. R. Soc. Lond. A Math. Phys. Sci. 236, 333–380 (1937)
Chapter 8
Convolution and Unfolding

8.1 Introduction

This section will discuss two related problems: how to take into account realistic
detector response, like resolution, efficiency, and background, into a probability
model, and how to remove those experimental effects from an observed distribution,
in order to recover the original distribution. This second problem is known as
unfolding.
Unfolding is not always necessary in order to compare an experimental dis-
tribution with the expectation from theory since data can be compared with a
realistic prediction that takes into account experimental effects. In some cases,
anyway, it is desirable to produce a distribution that can be compared among
different experiments, each introducing different experimental effects. For those
cases, unfolding observed distributions may be a necessary task.

8.2 Convolution

Detector effects distort the theoretical distribution of an observable quantity. Let us


denote by y the true value of the quantity and x the corresponding measured quantity.
The theoretical distribution of y is given by a PDF g.y/ and, for a given true value
y, the PDF for x depends on y and is r.xI y/, which is called response function or
kernel function.
The PDF that describes the distribution of the measured value x, taking into
account both the original theoretical distribution and the detector response, is given
by the convolution of the two PDFs g and r, defined as follows:
Z
f .x/ D g.y/ r.xI y/ dy : (8.1)

© Springer International Publishing AG 2017 155


L. Lista, Statistical Methods for Data Analysis in Particle Physics,
Lecture Notes in Physics 941, DOI 10.1007/978-3-319-62840-0_8
156 8 Convolution and Unfolding

Note that convolution may also be applied if the original distribution g.y/ is not
normalized to unity and the integral of g.y/ may represent the overall number of
random extractions of the variable y. Convolution of non-normalized distributions
will be more relevant in the discrete case, discussed in Sect. 8.2.2.
The convolution of a theoretical distribution with a finite resolution response is
also called smearing, which typically broadens the peaking structures present in
the original distribution. Examples of the effect of PDF convolution are shown in
Fig. 8.1.

8.2.1 Convolution and Fourier Transform

In many cases the response function r only depends on the difference x  y, and
can be described by a function of one variable: r.xI y/ D r.x  y/. Sometimes the
notation f D g ˝ r is also used in those cases to indicate the convolution of g and r.
Convolutions have interesting properties under Fourier transform. In particular,
let us define as usual the Fourier transform of f as:
Z C1
gO .k/ D g.y/ eiky dy ; (8.2)
1

and conversely the inverse transform as:


Z C1
1
g.y/ D gO .k/ eiky dk : (8.3)
2 1

It is possible to demonstrate that the Fourier transform of the convolution of two


PDFs g.y/ and r.x  y/ is given by the product of the Fourier transforms of the two
PDFs, i.e.:

1
g ˝ r D gO  rO : (8.4)

Conversely, the Fourier transform of the product of two PDFs is equal to the
convolution of the two Fourier transforms, i.e.:

b
g  r D gO ˝ rO : (8.5)

This property allows implementations of numerical convolutions based on the


fast Fourier transform (FFT) algorithm [3].
The simplest and most common model for detector resolution is a Gaussian PDF:

1 2 2
r.x  y/ D p e.xy/ =2 : (8.6)
2 
8.2 Convolution 157

0.8 g(y)

0.7 r(x, y) / 2

0.6 f(x) = g(y) ⊗ r(x, y)


f(x), g(y)

0.5

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9 10
x, y

0.14
g(y)
0.12
r(x, y) / 20

0.1 f(x) = g(y) ⊗ r(x, y)


f(x), g(y)

0.08

0.06

0.04

0.02

0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x, y

Fig. 8.1 Examples of convolution of two PDFs g.y/ (solid light blue lines) with Gaussian
kernel functions r.x; x0 / (dashed black lines). Top: g.y/ is the superposition of three Breit–
Wigner functions (see Sect. 2.13.1); bottom: g.y/ is a piece-wise constant PDF. The result of the
convolution, f .x/, is the solid dark blue line. The kernel function is scaled down by a factor 2 and
20 in the top and bottom plots, respectively, for display convenience

The Fourier transform of a Gaussian PDF can be computed analytically in order to


implement convolution algorithms based on FFT:
2 k2 =2
rO .k/ D eik
e : (8.7)
158 8 Convolution and Unfolding

8.2.2 Discrete Convolution and Response Matrix

In many cases, data samples are available in form of histograms. For instance,
consider a sample randomly extracted according to a continuous distribution f .x/.
The sample is, for convenience, stored into N intervals (bins) of given size of the
variable x, and the content of the bins (the number of entries) is .n1 ;    ; nN /. Each
ni is distributed according to Poissonian with expected value:
Z xi
up

i D hni i D f .x/ dx ; (8.8)


xlo
i

up
xlo th
i and xi being the edges of the i bin. Consider that values of x are derived
from original values of y, which are unaffected by the experimental effects and are
distributed according to a theoretical distribution g.y/. The range of possible values
of y can be divided into M bins, where M is not necessarily equal to N. Each of
those bins j has an expected number of entries equal to:
Z yj
up


j D g.y/ dy ; (8.9)
ylo
j

up
ylo th
j and yj being the edges of the j bin. For such discrete cases, the effect of a
realistic detector response is similar to Eq. (8.1), and can be written as:

X
M
i D Rij
j : (8.10)
jD1

The N  M matrix Rij is called response matrix, and is responsible for the bin
migration from the original histogram of values of y to the observed histogram of
values of x.
The case of a kernel function that depends on the difference x  y in a continuous
case corresponds to a response matrix that only depends on the index difference i  j
in a discrete case.
An example of the effect of bin migration is shown in Fig. 8.2, where, for
simplicity of representation, M D N has been chosen.

8.2.3 Efficiency and Background

The convolution presented in Eq. (8.1) represents the transformation of g.y/ into
f .x/, where both g and f are normalized to unity. In Eq. (8.10), instead, the
histograms of the expected values .1 ;    ; N / and .
1 ;    ;
M / do not obey
a normalization condition, since their content does not represent a probability
distribution, but the expected yield in each bin. In those cases, in addition to
8.2 Convolution 159

Rij
20

18

16

14

12
j

10

2 4 6 8 10 12 14 16 18 20
i

3000
μ
i

2500 νi = Σj Rij μ
i

2000
μ , νi
i

1500

1000

500

0
2 4 6 8 10 12 14 16 18 20
i

Fig. 8.2 Effect of bin migration. Top: response matrix Rij ; bottom: the original distribution
j
PM
(light blue line) with superimposed the convoluted distribution i D jD1 Rij
j (dark blue line)

the aforementioned detector response, relevant experimental effects may also be


efficiency and background. A typical effect of histogram distortion purely due to
efficiency can be represented by a diagonal migration matrix, if N D M is assumed:

X
M
i D "j ıij
j D "i
i : (8.11)
jD1
160 8 Convolution and Unfolding

Background contribution can be taken into account by adding an offset bi to the


bin content of each bin:

i D
i C bi : (8.12)

A realistic prediction of an experimental distribution from a theory prediction


that takes into account detector response, including efficiency and background, can
be obtained by combining the individual effects in Eqs. (8.10), (8.11) and (8.12):

X
M
i D "j Rij
j C bi : (8.13)
jD1

The efficiency term "i can be dropped from Eq. (8.13) and incorporated in the
response matrix Rij definition giving up the normalization condition:

X
N
Rij D 1 ; (8.14)
iD1

which would preserve the histogram normalization in Eq. (8.10). Absorbing the
efficiency term, Eq. (8.13) becomes:

X
M
i D Rij
j C bi ; (8.15)
jD1

where the efficiency is given by:


X
Rij D "j : (8.16)
iD1

In matrix form, Eq. (8.15) becomes:

E C bE :
E D R
(8.17)

8.3 Unfolding by Inversion of the Response Matrix

Unfolding the response matrix from an experimental distribution nE D .n1 ;    ; nN /


consists in finding an estimate
EO of the original distribution

E D .
1 ;    ;
M /.
The most straightforward way is given by the inversion of Eq. (8.17), but this
approach leads to results that are not usable in practice, as shown in the following.
The best estimate of
E can be determined with the maximum likelihood method.
The probability distribution for each ni is a Poissonian with expected value i , hence
8.3 Unfolding by Inversion of the Response Matrix 161

the likelihood function can be written as:


0 1
Y
N Y
N X
M
L.En I

E/D Pois.ni I i .

E // D Pois @ni I Rij


j C bi A : (8.18)
iD1 iD1 jD1

The maximum likelihood solution leads [6] to the inversion of Eq. (8.17). Assuming
EO that maximizes the likelihood function is:
for simplicity N D M, the estimate of

EO D R1 nE  bE :

(8.19)

The covariance matrix Uij of the estimated expected bin entries


O j is given by:
 T
U D R1 V R1 ; (8.20)

where the matrix V is the covariance matrix of the measurements ni . Due to the
Poisson distribution, V is a diagonal matrix with elements on the diagonal equal to
ni :

Vij D ıij j : (8.21)

It is also possible to demonstrate that this estimate is unbiased and it has the smallest
possible variance (see Sect. 5.8.3). Results are shown in Fig. 8.3 on a Monte Carlo
example.
The response matrix was taken from Fig. 8.2 (top), and a histogram containing
the observations nE was generated according to the convoluted distribution from
Fig. 8.2 (bottom) with 10,000 entries. This histogram, generated by Monte Carlo,
is shown in Fig. 8.3 (top) superimposed to the original convoluted distribution.
The resulting estimated
EO are shown in Fig. 8.3 (bottom), with uncertainty bars
computed according to Eq. (8.20). The plot shows very large oscillations from one
bin to the subsequent one, of the order of the uncertainty bars, which is very large
as well.
The numerical values of the non-diagonal terms of the matrix U from Eq. (8.20)
show that each bin
O j is very strongly anticorrelated with its adjacent bins,
O jC1
and
O j1 , the correlation coefficient being very close to 1. Statistical fluctuations
in the generated number of entries ni , which are bin-by-bin independent, turn into
high-frequency bin-by-bin oscillations of the resulting estimates
O j , due to the very
large anticorrelation, rather than, as one may expect, reflecting into uncertainties of
the estimates
O j .
In order to overcome the observed problems present with the simple maximum
likelihood solution of the unfolding, other techniques have been introduced to
regularize the observed oscillation. The maximum likelihood solution has no bias,
and it provides the smallest possible variance (see Sect. 5.11.3). For this reason,
in order to reduce the large observed variance and the bin-by-bin correlations,
regularization methods must unavoidably introduce some bias in the unfolded
estimates.
162 8 Convolution and Unfolding

1400

1200

1000

800
ni

600

400

200

0
2 4 6 8 10 12 14 16 18 20
i
3
×10
4000

3000

2000

1000
i

0
μ

− 1000

− 2000

− 3000

− 4000
2 4 6 8 10 12 14 16 18 20
i

Fig. 8.3 Top: Monte Carlo generated histogram with 10,000 entries randomly extracted according
to the convoluted distribution E in Fig. 8.2 (bottom) superimposed to the original convoluted
distribution; bottom: the result of response matrix inversion, or maximum likelihood estimate of
the unconvoluted distribution
,EO with error bars. Note that the vertical scale on the bottom plot
extends far above the range of the top plot
8.5 Regularized Unfolding 163

8.4 Bin-by-Bin Correction Factors

A solution to the unfolding problem, which also turns out to have disadvantages,
consists in performing a correction to the yields ni observed in each bin equal to the
ratio of expected values before and after unfolding in each bin (assuming N D M),
which are assumed to be known:


est

O i D i
.ni  bi / : (8.22)
iest

i and i are usually estimates of the true expected yields determined from

est est

simulation. The estimate iest does not include the background contribution bi .
This method produces results that, by construction, resemble the estimated
expectation, hence a simple visual inspection of unfolded distributions may not
show any apparent disagreement with the expectation. Anyway, this approach has
the serious drawback that it introduce a bias that drives the estimates
O i towards the
estimated expectations
est
i [6]. The bias, in fact, can be determined to be:
 

est
i
h
O i i 
i D i
 .i  bi / : (8.23)
iest .i  bi /

Note that the expectation i includes the background contribution bi , hence the
term i  bi is the expected yield due to signal only.
Due to Eq. (8.23), any comparison of unfolded data with simulation prediction,
in this way, will be biased towards good agreement, with the risk of hiding real
discrepancies.

8.5 Regularized Unfolding

Acceptable unfolding methods should reduce the large variance which occurs in
maximum likelihood estimates, given by the simple matrix inversion discussed in
Sect. 8.3. This has the unavoidable cost of introducing a bias in the estimate, since,
as seen above, the only unbiased estimator with the smallest possible variance is the
maximum likelihood one. A tradeoff between smallest variance and smallest bias is
achieved by imposing an additional condition on the smoothness of the set of values
to be estimated
E which is quantified by a function S.
E /. The definition of S may
vary according to the specific implementation. Such methods are called regularized
unfolding.
In practice, instead minimizing  D 2 log L (which is a 2 , for a Gaussian
case), the function to be minimized is:

.
E / C  2 S.

E / D .
E/: (8.24)
164 8 Convolution and Unfolding

The parameter  that appears in Eq. (8.24) is called regularization strength. The
case  D 0 is equivalent to a maximum likelihood estimate, which has been already
considered, and exhibits large bin-by-bin oscillations and large variance of the
estimates. The other extreme, when  is very large, provides an estimate
EO that
is extremely smooth, in the sense that it minimizes S, but it is insensitive to the
observed data nE .
The regularization procedure should find an optimal value of  in order to achieve
EO that is sufficiently close to the minimum of 2 log  and, at the same
a solution

time, ensures a sufficiently small S, i.e. a sufficient smoothness.


Another issue
PN that should be considered is that, in general, the total observed
yield n D iD1 ni is not necessarily equal to the sum of the estimates for the
P
expected values, O D NiD1 O i . EO can be determined as:

EO D R

EO C bE ; (8.25)

EO minimizes in Eq. (8.24). In order to impose the conservation of the overall


where

normalization after unfolding, an extra term can be included in the minimization


problem which contains a Lagrange multiplier :
!
X
N X
N
2
.
E / C  S.

E / D .
E/C ni  i : (8.26)
iD1 iD1

The choice of the regularization function S.


E / defines the different regulariza-
tion methods available in literature and used for different problems.

8.5.1 Tikhonov Regularization

One of the most popular regularization technique [15, 18, 19] adopts the regulariza-
tion function given by the following expression:
 T
S.

E / D L

E L
E; (8.27)

where L is a matrix with M columns and a number K of rows that could also be
different from M.
In the simplest implementation, L is taken as unit matrix, and Eq. (8.27) becomes:

X
M
S.

E/D
2j : (8.28)
iD1

The term  2 S.

E / in the minimization just dumps the cases with very large deviations
of
j from zero.
8.5 Regularized Unfolding 165

Another commonly adopted L matrix is the following:


0 1
1 1 0  0 0 0
B 1 2 1  0 0C 0
B C
B C
B 0 1 2  0 0C 0
B : : : :: :: :: C
::
LDB
B :: :: :: : :
C
:C :
: (8.29)
B C
B 0 0 0    2 1 0 C
B C
@ 0 0 0    1 2 1 A
0 0 0    0 1 1

The reason for this choice is that the derivative of a function, approximated by a
discrete histogram h, can be computed from finite differences, divided by the bin
size . The first two derivative could be approximated, in this way, as:
hi  hi1
h0i D ; (8.30)

hi1  2hi C hiC1
h00i D : (8.31)
2
(8.32)

The matrix L in Eq. (8.29) determines the approximate second derivative of the
histogram

E as approximation of the original function g.y/. According to this


approximation, the function S in Eq. (8.27), with the choice of L in Eq. (8.29), is
approximately equal to the integral of the second derivative of g squared:
Z  2
d2 g.y/
S dy : (8.33)
dy2
This regularization choice dumps large second derivative terms which correspond
EO typical of the
to the large high-frequency oscillations of the unfolded distribution
,
maximum likelihood solution.
Note that, due to the matrix structure of the problem, a solution can be found
only if N > M.

8.5.1.1 L-Curve Scan

The method usually adopted to determine the optimal value of the regularization
strength parameter  in a scan of the two-dimensional curve L defined by the
following coordinates[9]:

EO .// ;
Lx D log .
(8.34)
EO .// :
Ly D log S.
(8.35)
166 8 Convolution and Unfolding

Lx measures how well


EO ./ agrees with the model, while Ly measures how well
EO ./ matches the regularization condition. Note that neither  nor S explicitly

depend on , but their value at the minimum of .


E / will depend on the chosen
value of .
 D 0 corresponds to a minimum value of Lx and maximum of Ly , while, for
increasing values of , Lx increases and Ly decreases. Usually, the Ly vs Lx curve
exhibits a kink that makes the curve L-shaped. This motivates the name L curve for
this method. The kink which corresponds to the maximum curvature of the L curve
is taken as the optimal value.
An example of application of the Tikhonov regularization is shown in Figs. 8.4,
8.5 and 8.6. Figure 8.4 is similar to Fig. 8.2, and shows the assumed response matrix
and the original unfolded and convoluted distributions. In this case, we assumed
M D 25 bins in the original distribution and N D 60 bins in the convoluted
(detector-level) distribution. Figure 8.5 shows a randomly-extracted data sample and
the corresponding L-curve scan, with the optimal point corresponding to the chosen
value of . Figure 8.6 shows the unfolded distribution superimposed to the original
one.

8.6 Iterative Unfolding

A method to address the unfolding problem by subsequent iteration was proposed


initially in the ’80s [11, 13, 14, 17], then reproposed in the ’90s [7] under the name
iterative Bayesian unfolding. The methods proceeds iteratively computing at every
iteration a new estimate of the unfolded distribution
E .lC1/ from the distribution
.l/
obtained at the previous iteration
E , according to the following equation:

.lC1/ .l/
X
N
Rij ni X
N
.l/

j D
j PN .l/
D ni Mij : (8.36)
iD1
"j kD1 Rik
k iD1

Above, "j is given by Eq. (8.16). Usually, as initial solution, the simulation
.0/
prediction
j D
est j is taken, motivated as prior choice according to a Bayesian
interpretation.
Equation (8.36) can be demonstrated to converge to the maximum likelihood
solution, but a very large number of iterations is needed before it approaches the
asymptotic limit. The method can be stopped after a finite number I of iterations:
.I/

O j D
j . I acts here as regularization parameter, similarly to  for the regularized
unfolding in Sect. 8.5.
Equation (8.36) can be motivated using Bayes’ theorem as follows. Given a
number of ‘causes’ Cj , j D 1;    ; M, the ‘effects’ Ei , i D 1;    ; N, can be
8.6 Iterative Unfolding 167

Rij
20

18

16

14

12

10
j

0
0 2 4 6 8 10 12 14 16 18 20
i

μ
1800 j

1600 νi = Σj Rij μ
j

1400

1200
μ , νi

1000
j

800

600

400

200

0
0 2 4 6 8 10 12 14 16 18 20
x

; bottom: the original distribution


j is shown in light blue, while
Fig. 8.4 Top: response matrix RijP
the convoluted distribution i D Rij
j s shown in dark blue
168 8 Convolution and Unfolding

700
ni
600 νi

500

400
ni

300

200

100

0
0 2 4 6 8 10 12 14 16 18 20
x
L curve

8.5

8
Ly

7.5

6.5
1.6 1.7 1.8 1.9 2 2.1 2.2
Lx

Fig. 8.5 Top: data sample (solid histogram with error bars) randomly generated according to the
convoluted distribution (dashed histogram, superimposed); bottom: L-curve scan (black line). The
optimal L-curve point is shown as blue cross
8.6 Iterative Unfolding 169

μ
1800 i

1600 μ
j
1400

1200

1000
j
μ

800

600

400

200

−200
0 2 4 6 8 10 12 14 16 18 20
x

Fig. 8.6 Unfolded distribution using Tikhonov regularization, shown as points with error bars,
superimposed to the original distribution, shown as dashed line

related to the causes using Bayes’ theorem that can be expressed with the following
formula [7] :

P.Ei j Cj / P0 .Cj /
P.Cj j Ei / D PM : (8.37)
kD1 P.Ei j Ck / P0 .Ck /

In Eq. (8.37) a cause Cj corresponds to an event generated in the jth bin of the
original variable y histogram, and an effect Ei corresponds to an event being
observed, after detector effects, in the ith bin of the observable variable x histogram.
P.Cj j Ei / has the role of the response matrix Rij , and P0 .Cj / can be rewritten as
.0/ P
P0 .Cj / D
j =nobs , where nobs D NiD1 ni . This allows to rewrite Eq. (8.37) as:

.0/
Rij
j
P.Cj j Ei / D PM .0/
: (8.38)
kD1 Rik
k

The observation of an experimental distribution nE D .n1 ;    ; nN / is interpreted


as number of occurrences of each of the Ei effects: ni D n.Ei /. The expected number
.1/
of events
j assigned to each cause Cj can be determined, taking also into account
a term due to finite efficiency, as:

.1/
X
N
P.Cj j Ei / .0/
X Rij
N
ni

j D nO .Cj / D n.Ei / D
j PM .0/
: (8.39)
iD1
"j "
iD1 j kD1 Rik
k
170 8 Convolution and Unfolding

Equation (8.39) can be applied iteratively starting from an initial condition


E .0/
due to the prior choice in Bayes’ theorem, and the general iteration formula can be
derived as anticipated in Eq. (8.36).
The evaluation of covariance matrix may proceed using error propagation in
Eq. (5.73). The covariance matrix U .l/ of the unfolded estimates
E .l/ at a given
iteration is given by:

X  @
p .l/ 
@
q
.l/
.l/
Upq D Vij ; (8.40)
i; j
@ni @nj

where V is the covariance matrix of the measured distribution nE , usually due to


independent Poissonian fluctuations, and can be estimated as:

VO ij D ıij ni : (8.41)

The derivative terms in Eq. (8.40) is, at the first iteration (l D 1), equal to [7]:
 .1/
@
O k .0/
D Mij : (8.42)
@ni

For the following iteration, the derivative can be estimated again iteratively
according to the following formula [1]1 :
 .lC1/ .lC1/  .l/ X nk "q  .l/
@
O k .l/
i @
i .l/.l/ @
q
D Mij C .l/
 .l/
Mip Mqp ; (8.43)
@ni
i @nj p; q
q @nj

Figure 8.7 shows result of iterative unfolding with 100 iterations, which appears
relatively smooth, for the same distribution considered in Fig. 8.6. Figure 8.8 shows
unfolded distributions after 1000, 10,000 and 100,000 iterations. An oscillating
structure, as for the maximum likelihood solution, slowly begins to emerge as the
number of iteration increases.
Regularization is achieved in the iterative unfolding method by stopping the
procedure after a number I of iterations. I should not be too large, otherwise, the
solution starts to exhibit oscillation, as seen in Fig. 8.8. On the other hand, if I small,
the solution is biased towards the initial values, like with bin-by-bin corrections
(see Sect. 8.4). A prescription about the choice of the regularization parameter I,
for iterative unfolding, is not as obvious as for the Tikhonov regularization (see
Sect. 8.5.1).

1
The original derivation in [7] only reported the first term in Eq. (8.43), similarly to the case with
l D 0, as in Eq. (8.42). The complete formula was reported in [1]. A revised and improved version
of the method proposed in [7] was also later proposed in [8].
8.7 Other Unfolding Methods 171

2000
μ
i

μ
j
1500

1000
j
μ

500

0 2 4 6 8 10 12 14 16 18 20
x

Fig. 8.7 Unfolded distribution using the iterative unfolding, shown as points with error bars,
superimposed to the original distribution, shown as dashed line. 100 iterations have been used

8.6.1 Treatment of Background

In case a background subtraction is needed (bi ¤ 0), two approaches could be used
to modify Eq. (8.36). The first is to subtract the background in the numerator:

.lC1/ .l/
X
N
Rij . yi  b i /

j D
j PN .l/
; (8.44)
"j
iD1 kD1 Rik
k

the second is to add it in the folded term at the denominator:

.lC1/ .l/
X
N
Rij yi

j D
j PN .l/
: (8.45)
"j
k C bi
iD1 kD1 Rik

.l/
The latter choice ensures nonnegative yields:
j  0.

8.7 Other Unfolding Methods

The presented list of unfolding methods is not exhaustive, and other unfolding
methods are used for physics applications. Among those, the Singular Value Decom-
position or SVD [10], Shape-constrained unfolding [12] and a Fully Bayesian
unfolding method [5] deserve to be mentioned.
172 8 Convolution and Unfolding

μ
1800 i

1600 μ
j

1400

1200

1000
j
μ

800

600

400

200

0 2 4 6 8 10 12 14 16 18 20
x

μ
1800 i

1600 μ
j

1400

1200

1000
j
μ

800

600

400

200

0 2 4 6 8 10 12 14 16 18 20
x

μ
1800 i

1600 μ
j

1400

1200

1000
j
μ

800

600

400

200

0 2 4 6 8 10 12 14 16 18 20
x

Fig. 8.8 Unfolded distribution as in Fig. 8.7 using the iterative unfolding with 1000 (top), 10,000
(middle) and 100,000 (bottom) iterations
References 173

8.8 Software Implementations

An implementation of the Tikhonov regularized unfolding is available with the


package TUNFOLD [16], released as part of the ROOT framework [4].
Moreover, the package ROOUNFOLD [2] contains implementations and inter-
faces to multiple unfolding methods in the context of the ROOT framework [4]. This
package has not become yet, to date, part of the ROOT release.

8.9 Unfolding in More Dimensions

In case of distributions in more than one dimension, the methods described above
can be used considering that in the response matrix Rij the indices i and j may
represent a single bin in multiple dimensions. Done that, unfolding proceeds
similarly to what described in the previous Sections. One caveat is the choice
of the matrix L for regularized unfolding, see for instance Eq. (8.29), which may
become nonobvious for multiple dimensions, since the L matrix should represent
an approximation of a derivative by finite differences. See [16] for a more detailed
discussion of this case.

References

1. Adye, T.: Corrected error calculation for iterative Bayesian unfolding. http://hepunx.rl.ac.uk/~
adye/software/unfold/bayes_errors.pdf (2011)
2. Adye, T., Claridge, R., Tackmann, K., Wilson, F.: RooUnfold: ROOT unfolding framework.
http://hepunx.rl.ac.uk/~adye/software/unfold/RooUnfold.html (2012)
3. Bracewell, R.: The Fourier Transform and Its Applications, 3rd. edn. McGraw-Hill, New York
(1999)
4. Brun, R., Rademakers, F.: ROOT—an object oriented dat analysis framework. Proceedings
AIHENP96 Workshop, Lausanne (1996). Nucl. Instrum. Methods Phys. Res. A 389, 81–86
(1997) http://root.cern.ch/
5. Choudalakis, G.: Fully Bayesian unfolding. arXiv:1201.4612 (2012)
6. Cowan, G.: Statistical Data Analysis. Clarendon Press, Oxford (1998)
7. D’Agostini, G.: A multidimensional unfolding method based on Bayes’ theorem. Nucl.
Instrum. Methods Phys. Res. A 362, 487–498 (1995)
8. D’Agostini, G.: Improved iterative Bayesian unfolding. arXiv:1010.0632 (2010)
9. Hansen, P.C.: The L-curve and Its use in the numerical treatment of inverse problems.
In: Johnston, P. (ed.) Computational Inverse Problems in Electrocardiology. WIT Press,
Southampton (2000)
10. Hoecker, A., Kartvelishvili, V.: SVD approach to data unfolding. Nucl. Instrum. Methods A
372, 469–481 (1996)
11. Kondor, A.: Method of converging weights—an iterative procedure for solving Fredholm’s
integral equations of the first kind. Nucl. Instrum. Methods Phys. Res. 216, 177 (1983)
12. Kuusela, M., Stark, P.B.: Shape-constrained uncertainty quantification in unfolding steeply
falling elementary particle spectra. arXiv:1512.00905 (2015)
174 8 Convolution and Unfolding

13. Müulthei, H.N., Schorr, B.: On an iterative method for the unfolding of spectra. Nucl. Instrum.
Methods Phys. Res. A 257, 371 (1983)
14. Müulthei, H.N., Schorr, B.: On an iterative method for a class of integral equations of the first
kind. Math. Methods Appl. Sci. 9, 137 (1987)
15. Phillips, D. L.: A technique for the numerical solution of certain integral equations of the first
kind. J. ACM 9, 84 (1962)
16. Schmitt, S.: TUnfold: an algorithm for correcting migration effects in high energy physics. J.
Instrum. 7, T10003 (2012)
17. Shepp, L., Vardi, Y.: Maximum likelihood reconstruction for emission tomography. IEEE
Trans. Med. Imaging 1, 113–122 (1982)
18. Tikhonov, A. N.: On the solution of improperly posed problems and the method of regulariza-
tion. Sov. Math. 5, 1035 (1963)
19. Tikhonov, A.N., Arsenin. V.Y.: Solutions of Ill-Posed Problems. John Wiley, New York (1977)
Chapter 9
Hypothesis Tests

9.1 Introduction

A key task in most of the physics measurements is to discriminate between two or


more hypotheses on the basis of the observed experimental data.
A typical example is the determination of a particle type in a detector with
particle identification capabilities, based, for instance, on the depth of penetration in
an iron absorber, the energy released in scintillator crystals or information provided
by a Cherenkov detector.
Another example is to determine whether a sample of events is composed of
background only or it contains a mixture of background plus signal events. This
may allow ascertaining the presence of a new signal, which may lead to a discovery.
This problem in statistics is known as hypothesis test, and methods have been
developed to assign an observation, which consists of the measurements of specific
discriminating variables, to one of two or more hypothetical models, considering the
predicted probability distributions of the observed quantities under the considered
hypotheses.

9.2 Test Statistic

In statistical literature, when two hypotheses are present, these are usually called
null hypothesis, H0 , and alternative hypothesis, H1 .
Assume that the observed data sample consists of measurements of the variables
Ex D .x1 ;    ; xn /, randomly distributed according to some probability density
function f .Ex /, which is in general different under the hypotheses H0 and H1 :
f .Ex / D f .Ex j H0 / or f .Ex / D f .Ex j H0 /, according to which of H0 or H1 is true.
The goal of a test is to determine whether the observed data sample better agrees
with H0 or rather with H1 . Instead of using all the n available variables, .x1 ;    ; xn /,

© Springer International Publishing AG 2017 175


L. Lista, Statistical Methods for Data Analysis in Particle Physics,
Lecture Notes in Physics 941, DOI 10.1007/978-3-319-62840-0_9
176 9 Hypothesis Tests

f (t) signal

background

t cut t

Fig. 9.1 Probability distribution functions for a discriminating variable t which has two different
PDFs for the signal (blue) and background (red) hypotheses under test

whose PDF f .Ex / may have complex features, usually a test proceeds by determining
the value of a function of the measured sample Ex, t D t.Ex /, which summarizes
the information contained in the data sample. t is called test statistic, and its PDF,
under each of the considered hypotheses, can be derived from the PDF f .Ex / of the
observable quantities Ex.
A simple example of test statistic is a single variable x which has discriminating
power between two hypotheses, say signal D ‘muon’ versus background D ‘pion’,
as shown in Fig. 9.1 where t D x. A good separation of the two cases can be
achieved if the PDFs of x under the hypotheses H1 D signal and H0 D background
are appreciably different. Note that this is a conventional choice, and opposite
choices are also adopted in some literature.
If the observed value of the discriminating variable x is xO , the simplest test
statistic can be defined as the measured value itself:

Ot D t.Ox/ D xO : (9.1)

A selection requirement (in physics jargon sometimes called cut) can be defined
by identifying a particle as a muon if Ot  tcut or as a pion if Ot > tcut , where the
value tcut is chosen a priori, and the direction of the required inequality is motivated
because background tends to have values of x larger than signal, in the specific
example.
Not all real muons are correctly identified as muons according to the selection
criterion, as well as not all real pions are correctly identified as pions. The expected
fraction of selected signal particles (muons) is usually called signal selection
efficiency and the expected fraction of selected background particles (pions) is called
misidentification probability.
Misidentified particles constitute a background to positively identified signal
particles. Applying the required selection, in this case t  tcut , on a data sample
containing different detected particles, the sample of selected particles will be
enriched of signal and will have a reduced fraction of background, compared to the
original unselected sample. This is true if we assume that the selection efficiency
9.3 Type I and Type II Errors 177

is larger than the misidentification probability, as achieved with the selection in


Fig. 9.1, thanks to the separation of the two PDFs.
This case was also discussed using Bayes’ theorem in Example 3.9.

9.3 Type I and Type II Errors

Statistical literature defines the significance level ˛ as the probability to reject the
hypothesis H0 if H0 is true. Rejecting H0 if it is true is called error of the first kind,
or type-I error. In our example, this means selecting a particle as a muon in case it
is a pion. The background misidentification probability corresponds to ˛.
The probability ˇ to reject the hypothesis H1 if it is true (error of the second kind
or type-II error) is equal to one minus the signal efficiency, i.e. the probability to
incorrectly identify a muon as a pion, in our example.
By varying the value of the selection cut tcut , different values of selection
efficiency (1  ˇ) and misidentification probability (1  ˛) are determined. A typical
curve representing the signal efficiency versus the misidentification probability
obtained by varying the selection requirement is shown in Fig. 9.2, which is also
called receiver operating characteristic, or ROC curve.
A good selection should have a low misidentification probability corresponding
to a large selection efficiency. But clearly the background rejection can’t be perfect
(i.e. the misidentification probability can’t drop to zero) if the distributions f .x j H0 /
and f .x j H1 / overlap, as in Fig. 9.1.

1
signal efficiency

0
0 1
mis-id probability

Fig. 9.2 Signal efficiency versus background misidentification probability (receiver operating
characteristic or ROC curve)
178 9 Hypothesis Tests

10 10
8 8
6 6
4 4
2 2
0 0
–2 –2
–4 –4
–6 –6
–8 –8
–10 –10
–10 –8 –6 –4 –2 0 2 4 6 8 10 –10 –8 –6 –4 –2 0 2 4 6 8 10

Fig. 9.3 Examples of two-dimensional selections of a signal (blue dots) against a background (red
dots). A linear cut is chosen on the left plot, while a box cut is chosen on the right plot

More complex examples of cut-based selections involve multiple variables,


where selection requirements in multiple dimensions can be defined as regions in
the discriminating variable space. Events are accepted as ‘signal’ or as ‘background’
if they fall inside or outside the selection region. Finding an optimal selection in
multiple dimensions is usually not a trivial task.
Two simple examples of selections with very different performances in terms of
efficiency and misidentification probability are shown in Fig. 9.3.
The problem of classifying data according to the value of multiple variables is
called multivariate analysis (MVA) and will be discussed in the following sections.

9.4 Fisher’s Linear Discriminant

A test statistic that allows discriminating two samples in more dimensions using a
linear combination of n discriminating variables is due to Fisher [1]. The optimal
separation between one-dimensional projections of two random variable sets is
achieved by maximizing the squared difference of the means of the two PDFs
when projected on a given axis w, E divided by the corresponding variances added
in quadrature.
The Fisher discriminant can be defined as:

.
1 
2 /2
E/D
J.w ; (9.2)
12 C 22

where
1 and
2 are the averages and 1 and 2 are the standard deviations of the
E
two PDF projections, which depend on the projection direction w.
An example of Fisher projections along two different possible projection lines
is shown in Fig. 9.4 for a two-dimensional case. Different levels of separation of
9.4 Fisher’s Linear Discriminant 179

60 60

40 40

20 20

0 0
y

y
−20 −20

−40 −40

−60 −60
−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60
x x
8000
6000
7000
5000
6000

5000 4000

4000
3000

3000
2000
2000

1000
1000

0 0
−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60
x w 1x + y w 1 x w 2x + y w 2
y y

Fig. 9.4 Example of projections of two-dimensional distributions (plots on the top left and right)
along two different lines. The red and blue distributions are projected on the black dashed lines;
the normal to that line is shown as a green line with an arrow. The bottom plots show the one-
dimensional projections of the corresponding top plots

the two distributions can be achieved by changing the projection line. The goal of
E that achieves the optimal
Fisher discriminant is to find the projection direction w
separation.
The projected average difference is:


1 
2 D w
E T .m
E1  m
E 2/ ; (9.3)

where mE 1 and m
E 2 are the n-dimensional averages of the two samples. The square of
Eq. (9.3) gives the numerator in Eq. (9.2):

.
1 
2 /2 D w
E T .m
E1  m
E 2 / .m
E1  m
E 2 /T w
E Dw
E T SB w
E; (9.4)
180 9 Hypothesis Tests

where the between classes scatter matrix SB is defined as:

SB D .m
E1  m
E 2 / .m
E1  m
E 2 /T : (9.5)

E of the two n  n covariance matrices S1 and S2 give the


The projections along w
variances:

12 D w
E T S1 w
E; (9.6)
22 D w
E T S2 w
E; (9.7)

whose sum is the denominator in Eq. (9.2):

12 C 22 D w
E T .S1 C S2 / w
E Dw
E T SW w
E; (9.8)

where the within classes scatter matrix SW is defined as:

SW D S1 C S2 : (9.9)

The Fisher’s discriminant in Eq. (9.2) can be written as:


2
E T .m
w E1  m
E 2/ wE T SB w
E
E/D T
J.w D T : (9.10)
wE .S1 C S2 / wE E SW w
w E

The problem of finding the vector w E that maximizes J.wE / can be solved by
performing the derivatives of J.w E / with respect to the components wi of w,
E or
equivalently by solving the following eigenvalues equation:

E D  SW w
SB w E; (9.11)

i.e.:

SW 1 SB w
E D w
E; (9.12)

which leads to the solution:

E D SW 1 SB .m
w E1  m
E 2/ : (9.13)

A practical way to find Fisher’s discriminant is to provide two training samples


with a sufficiently large number of entries in order to represent approximately the
two PDFs. Training samples can be generated with a Monte Carlo according to the
two known PDFs. The averages and covariance matrices determined from training
E that maximizes
samples can be used with Eq. (9.13) in order to find the direction w
Fisher’s discriminant.
9.6 Projective Likelihood Ratio Discriminant 181

9.5 The Neyman–Pearson Lemma

The performance of a selection criterion can be considered optimal if it achieves the


smallest misidentification probability for a desired value of the selection efficiency.
According to the Neyman–Pearson lemma [2], the optimal test statistic, in this
sense, is given by the ratio of the likelihood functions L.Ex j H1 / and L.Ex j H0 /
evaluated for the observed data sample Ex under the two hypotheses H1 and H0 :

L.Ex j H1 /
.Ex / D : (9.14)
L.Ex j H0 /

The test is optimal in the sense that, for a fixed background misidentification
probability ˛, the selection that corresponds to the largest possible signal selection
efficiency 1  ˇ is given by:

L.Ex j H1 /
.Ex / D  k˛ ; (9.15)
L.Ex j H0 /

where, by varying the value of the ‘cut’ k˛ , the required value of ˛ may be achieved.
This corresponds to chose a point in the ROC curve (see Fig. 9.2) such that ˛
corresponds to the required misidentification probability.
The Neyman–Pearson lemma provides the selection that achieves the optimal
performances only if the joint multidimensional PDFs that characterize our problem
are known. In many realistic cases, anyway, it is not easy to determine the correct
model for multidimensional PDFs and approximated solutions may be adopted.
Numerical methods and algorithms exist to find selections in the variable space that
have performances in terms of efficiency and misidentification probability close to
the optimal limit given by the Neyman–Pearson lemma. Among those approximate
methods, machine-learning algorithm, such as Artificial Neural Networks and
Boosted Decision Trees, are widely used in High Energy Physics, and will be
introduced in Sect. 9.10.

9.6 Projective Likelihood Ratio Discriminant

If the variables x1 ;    ; xn that characterize our problem are independent, the


likelihood function can be factorized into the product of one-dimensional marginal
PDFs:
Qn
L.x1 ;    ; xn j H1 / fi .xi j H1 /
.x1 ;    ; xn / D D QiD1
n : (9.16)
L.x1 ;    ; xn j H0 / iD1 fi .xi j H0 /
182 9 Hypothesis Tests

If this factorization holds, optimal performances are achieved, according to the


Neyman–Pearson lemma.
Even if it is not possible to factorize the PDFs into the product of one-
dimensional marginal PDFs, i.e. if the variables are not independent, the test statistic
inspired by Eq. (9.16) can be used as discriminant using the marginal PDFs fi for the
individual variables. This is called projective likelihood ratio:
Qn
fi .xi j H1 /
.x1 ;    ; xn / D QiD1
n : (9.17)
iD1 fi .xi j H0 /

If the PDFs cannot be exactly factorized, anyway, the test statistic defined in
Eq. (9.17) will differ from the exact likelihood ratio in Eq. (9.14) and will correspond
to worse performances in term of ˛ and ˇ compared with the best possible
performances obtained using the Neyman–Pearson lemma.
In some cases, anyway, the simplicity of this method can justify its application in
spite of the suboptimal performances. The marginal PDFs fi can be obtained using
Monte Carlo training samples with a large number of entries that allow to produce
histograms corresponding to the distributions of the individual variables xi that, to a
good approximation, reproduce the marginal PDFs.
Some numerical applications implement projective likelihood-ratio discriminant
after applying a proper rotation in the variable space that reduces or eliminates the
correlation among variables with a diagonalization of the covariance matrix. This
improves the performances of the method but does not necessarily allow to reach
optimal performances of the Neyman–Pearson lemma. Uncorrelated variables are
not necessarily independent (see for instance Example 2.7), and the PDF can be
factorized only for independent variables (see Sect. 2.15.1).
The approximation of a multidimensional PDF as a binned histogram in multiple
dimensions may become intractable since the size of the training samples should
increase as the number of bins to the power of the number of dimensions.

9.7 Kolmogorov–Smirnov Test

A test due to Kolmogorov [3], Smirnov [4] and Chakravarti [5] can be used to assess
the hypothesis that a data sample is compatible with a given distribution. Consider a
sample .x1 ;    ; xn /, ordered by increasing values of x, that has to be compared with
a distribution f .x/, assumed to be a continuous function. The discrete cumulative
distribution of the sample can be defined as:

1X
n
Fn .x/ D  .x  xi / ; (9.18)
n iD1
9.7 Kolmogorov–Smirnov Test 183

where the step function  is defined as:


(
1 if x0
 .x/ D : (9.19)
0 if x<0

The distribution Fn .x/, can be compared with the cumulative distribution of f .x/
defined as:
Z x
F.x/ D f .x0 / dx0 : (9.20)
1

The maximum distance between the two cumulative distributions Fn .x/ and F.x/
is used to quantify the agreement of the data sample .x1 ;    ; xn / with f .x/:

Dn D sup jFx .x/  F.x/j : (9.21)


x

The definition of Fn .x/, F.x/ and Dn is visualized in Fig. 9.5.


For large n;pDn converges to zero in probability. The distribution of the test
statistic K D n Dn , in the hypothesis that the sample .x1 ;    ; xn / is distributed
according to f .x/, does not depend on f .x/. The probability that K is less or equal
to a given value k is given by Marsaglia et al. [6], and it defines the Kolmogorov
distribution:
1 p 1
X 2 X .2i1/2  2 =8k2
i1 i2 k2
P.K  k/ D 1  2 .1/ e D e : (9.22)
iD1
k iD1

1
Dn
F(x)
Fn(x)

0
x
x x
1 2 … x
n

Fig. 9.5 Graphical representation of the Kolmogorov–Smirnov test


184 9 Hypothesis Tests

It is important to notice that Kolmogorov–Smirnov is a non-parametric test and


if some parameters of the distribution f .x/ are determined (i.e.: fit) from the data
sample .x1 ;    ; xn /, than the test cannot be applied.
A pragmatic solution to this problem, in case parameters of f .x/ have been
estimated from data, is to still use the test statistic K, but without relying on the
expression in Eq. (9.22). The distribution of K for a specific problem should be
determined empirically in those cases with Monte Carlo. This is implemented, for
instance, as optional method in the Kolmogorov–Smirnov test provided by the ROOT
framework [7].
The Kolmogorov–Smirnov test can also be used to compare two samples, say
.x1 ;    ; xn / and .y1 ;    ; ym /, and asses the hypothesis that both come from the
same distribution. In this case, the maximum distance:

Dn; m D sup jFn .x/  Fm .y/j (9.23)


x

asymptotically converges to zero if n and m are sufficiently large, and the fol-
lowing test statistic asymptotically follows Kolmogorov distribution according to
Eq. (9.22):
r
nm
Dn; m : (9.24)
nCm
Alternative tests to Kolmogorov–Smirnov are due to Stephens [8], Anderson and
Darling [9], and Cramér [10] and von Mises [11].

9.8 Wilks’ Theorem

When a large number of measurements is available, Wilks’ theorem allows finding


an approximate asymptotic expression for a test statistic based on a likelihood ratio
inspired by the Neyman–Pearson lemma (Eq. (9.14)).
Assume that two hypotheses H1 and H0 can be defined in terms of a set of
parameters E D .1 ;    ; m / that appear in the definition of the likelihood function.
The condition that H1 is true can be expressed as E 2 ‚1 , while the condition that
H0 is true can be expressed as E 2 ‚0 . Let us also assume that H0 and H1 are nested
hypotheses, i.e. ‚0  ‚1 .
Given a data sample made of independent measurements .Ex1 ;    ; ExN /, Wilks’
theorem [12] ensures, assuming some regularity conditions of the likelihood
function, that the quantity:

Y
N
sup E
L.Exi I /
E 2 ‚0 iD1
2r D 2 log ; (9.25)
Y
N
sup E
L.Exi I /
E 2 ‚1 iD1
9.9 Likelihood Ratio in the Search for a New Signal 185

has a distribution that can be approximated, for N ! 1, and if H0 is true, with a 2


distribution having a number of degrees of freedom equal to the difference between
the dimensionality of the set ‚1 and the dimensionality of the set ‚0 .
As a more specific example, assume that
is the only parameter of interest
and the remaining parameters, E D .1 ;    ; m /, are m nuisance parameters (see
Sect. 5.4). For instance,
could be the ratio of a signal cross section to its theoretical
value (signal strength, see also Sect. 9.9).
Taking as H0 the hypothesis
D
0 , while H1 is the hypothesis that
may have
any possible value greater or equal to zero, Wilks’ theorem ensures that:

Y
N
sup E
L.Exi I
0 ; /
E iD1
2r .
0 / D 2 log (9.26)
YN
sup E
L.Exi I
; /

; E iD1

is asymptotically distributed as a 2 with one degree of freedom.


The denominator in Eq. (9.26) is the likelihood function evaluated at the param-
O
eter values
D
O and E D E that maximize it:

Y
N Y
N
sup E D
L.Exi I
; / L.Exi I
; EO :
O /

; E iD1 iD1

In the numerator, only the nuisance parameters E are fit and


is fixed to the
O
constant value
D
0 . Taking as E D E .
0 / the values of E that maximize the
likelihood function for a fixed
D
0 , Eq. (9.26) can be written as:

O
L.Ex j
0 ; E .
0 //
2r .
0 / D : (9.27)
L.Ex j
;
O /EO

This test statistic is known as profile likelihood, and its application to upper limits
determination will be discussed in Sect. 10.11.

9.9 Likelihood Ratio in the Search for a New Signal

In the previous section the likelihood function was considered for a set of indepen-
E
dent observations .Ex1 ;    ; ExN / with parameters :

Y
N
E D
L.Ex1 ;    ; ExN I / E :
f .Exi I / (9.28)
iD1
186 9 Hypothesis Tests

Two hypotheses H1 and H0 are represented as two possible sets of values ‚1 and
E
‚0 of the parameters .
Usually the number of events N can also be used as information introducing the
extended likelihood function (see Sect. 5.10.2):

E ENY
e. / ./
N
E
L.Ex1 ;    ; ExN I / D E ;
f .Exi I / (9.29)
NŠ iD1

where in the Poissonian term the expected number of event  may also depend on
E  D ./.
the parameters : E
Typically, we want to discriminate between two hypotheses: H1 represents the
presence of both signal and background, i.e.  D
s C b, while H0 represents
the presence of only background events in our sample, i.e.  D b, or equivalently

D 0. Above, the multiplier


, called signal strength, typical of many data analyses
performed at the Large Hadron Collider, was introduced, assuming that the expected
signal yield from theory is s. All possible values of the expected signal yield are
obtained by varying
,
D 1 corresponding to the theory prediction.
E can be written as superposition of two components, one PDF
The PDF f .Exi I /
for the signal and another for the background, weighted by the expected signal and
background fractions, respectively:

E D
s E C b E :
f .Ex I / fs .Ex I / fb .Ex I / (9.30)

s C b
s C b

In this case, the extended likelihood function in Eq. (9.29) becomes similar to
Eq. (5.27):


s.E/Cb.E/ YN

E D e E C bfb .Exi I /
E :
LsCb .Ex1 ;    ; ExN I
; /
sfs .Exi I /
NŠ iD1
(9.31)

E
Note that s an b may also depend on the unknown parameters :

E ;
s D s./ (9.32)
E :
b D b./ (9.33)

For instance, in a search for the Higgs boson, the theoretical cross section may
depend on the Higgs boson’s mass. Also the PDF for the signal fs , which represents
a resonance peak, depends on the Higgs boson’s mass.
9.9 Likelihood Ratio in the Search for a New Signal 187

Under the hypothesis H0 (i.e.


D 0, background only), the likelihood function
can be written as:

b.E/ Y
N
E De
Lb .Ex1 ;    ; ExN I / E :
bfb .Exi I / (9.34)
NŠ iD1

The term 1=NŠ disappears when performing the likelihood ratio in Eq. (9.14), which
becomes:

E
E D LsCb .Ex1 ;    ; ExN I
; /
.
; /
E
Lb .Ex1 ;    ; ExN I /


s.E/Cb.E/ YN E C bfb .Exi I /
E
e
sfs .Exi I /
D (9.35)
eb.E/ iD1
E
bfb .Exi I /
!
YN E

sfs .Exi I /

s.E/
De C1 :
iD1
E
bfb .Exi I /

The negative logarithm of the likelihood ratio is:


!
X
N E

sfs .Exi I /
E 
 log .
; / D
s./ log C1 : (9.36)
iD1
E
bfb .Exi I /

In the case of a simple event counting experiment, the likelihood function only
accounts for the Poissonian probability term which only depends on the number
of observed event N, and the dependence on the parameters E only appears in the
expected signal and background yields. The likelihood ratio which defines the test
statistic is, in that case:

E
LsCb .NI /
E D
./ ; (9.37)
E
Lb .NI /

where LsCb and Lb are Poissonian probabilities for N corresponding to expected


E can be written as:
averages equal to
s C b and b, respectively. ./


N

s.E/Cb.E/ E C b./
E
e
s./ NŠ
E D
./ D
NŠ EN
eb.E/ b./
(9.38)
!N
E
E

s./
D e
s. / C1 :
E
b./
188 9 Hypothesis Tests

The negative logarithm of the above expression is:


!
E
 log ./ E  N log
s./ C 1
E D
s./ ; (9.39)
E
b./

which is a simplified version of Eq. (9.36), where the terms fs and fb have been
dropped.
Equations (9.39) and (9.36) are used to determine upper limits in searches for a
new signal, as will be discussed in Chap. 10.
Note that the hypotheses b and sCb are nested (b is a particular case of sCb with

D 0), but the likelihood ratios defined to determine the test statistics in Eqs. (9.36)
and (9.39) assume s C b in the numerator and b in the denominator, which is the
inverse convention of the likelihood ratio introduced in Eq. (9.25). Wilks’ theorem
hypotheses apply to those cases as well, with an extra minus sign in the definition
of the test statistic.

9.10 Multivariate Discrimination with Machine Learning

The Neyman–Pearson lemma, introduced in Sect. 9.5, sets an upper limit to


the performance of any possible multivariate discriminant, in the sense that a
discrimination based on the likelihood ratio test statistic achieves the lowest possible
misidentification probability ˛ for a fixed value of the signal efficiency " D 1  ˇ.
The exact evaluation of the likelihood function in the two hypotheses H0 and H1
is not always possible, and approximate methods have been developed in order to
approach the performance of an ideal selection based on the likelihood ratio, hence
approaching the limit set by the Neyman–Pearson lemma.
The most powerful approximate methods are implemented by mean of computer
algorithms: the algorithm receives as input a set of discriminating variables, each
of which individually does not allow to reach an optimal selection power. The
algorithm computes an output that combines the input variables. The discriminant
output value is taken as a test statistic and is then adopted to perform the signal
selection, which is implemented as a cut on the value of the discriminant, as shown
in Fig. 9.1.
In machine-learning algorithms, the computation of the output value, given the
input variables, is based on a number of parameters which can often be very large.
The choice of the parameter values is a key task of the algorithm since an optimal
choice allows achieving the best possible performances.
The usual strategy consists in tuning the discriminant parameters providing as
input to the algorithm large datasets distributed according to either the H0 hypothesis
or H1 . By comparing the discriminant output to the true origin of the dataset, the
parameters are modified. This process is called training, and algorithms that use
such training samples are called supervised machine learning algorithms.
9.10 Multivariate Discrimination with Machine Learning 189

Alternatively, algorithms that group and interpret data according to the observed
distribution of the input data are called unsupervised learning.
In the following examples of supervised learning will be discussed.
In many cases of supervised learning, training samples are provided by computer
simulations having expected distributions in either the H0 or the H1 hypothesis. If
control samples in data are available with very good purity, those data samples can
be used as training samples avoiding uncertainties due to simulation modeling. Each
entry in a data sample is often called observation in machine-learning literature, and
typically may represent a collision event or an observed particle of a given type in
particle physics.

9.10.1 Overtraining

A potential general problem with the training of machine learning algorithms is the
possibility that the algorithm tries to exploit artifacts of the necessarily finite size
of training samples that are not representative of the actually expected distributions.
This possibility may occur, in particular, if the size of the training sample given as
input is not very large. The effect of overtraining is illustrated in Fig. 9.6. In practice,
an irregular selection may pick or skip individual observations in training categories.
The presence of overtraining may be spotted by looking at the discriminant
distribution and its performances in terms of efficiency and background rejection
evaluated on the training sample and compare with another independently-generated
test sample. If the discriminant distributions for the training and test samples are not
in good agreement (for instance, Kolmogorov–Smirnov test may be used to check

y signal y signal

background background

x x

Fig. 9.6 Illustration of overtraining of a machine learning algorithm in two dimensions. Blue
points are signal training observations, red points are background. The multivariate selection is
represented as a dashed curve. Left: smooth selection based on a regular algorithm output; right:
selection based on an overtrained algorithm that picks artifacts due to statistical fluctuations of the
finite-size training samples
190 9 Hypothesis Tests

the agreement, see Sect. 9.7), and performances evaluated on the training sample are
significantly better than the ones evaluated on the test sample, this may indicate the
presence of overtraining, hence the training of the algorithm and its performances
may not be optimal when applied on real data.

9.11 Artificial Neural Networks

Artificial neural networks have been designed in the attempt to mimic with comput-
ers a simplified functional model of neuron cells in animals. The discriminant output
is computed by combining the response of multiple nodes, each representing a single
neuron cell. Nodes are arranged into layers, as shown in Fig. 9.7. Input variable
values .x1 ;    ; xp / are passed to a first input layer, whose output is passed as input
to the next layer, and so on. Finally, the last output layer is usually constituted by
a single node that provides the discriminant output. More than one output layer
can be used to encode more than two possible outcomes of the discrimination, as
could be the case for the identification of hand-written character. Intermediate layers
between the input and the output layers are called hidden layers. Such a structure is
also called feedforward multilayer perceptron.
The input layer may apply linear transformations to the input variables in order
to adjust their range to a standard interval, typically Œ0; 1. The output of each node
is computed as weighted average of the input variables, with weights that are subject

input layer
hidden layers
output layer
w11(1)
x1
w11(2)
w12(1)
x2 w12(2) w11(3)
w12(3)
x3
y

w1p(3)
w1p(1) w1p(2)
xp

Fig. 9.7 Structure of a multilayer artificial neural network


9.11 Artificial Neural Networks 191

to optimization via training. The weighted average is then filtered by an activation


function ', and the output of the kth node of the nth layer is given by:
0 .n/
1
.n/
X
p
.n/
yk .Ex / D'@ wkj xj A : (9.40)
jD1

The activation function limits the output in the range Œ0; 1. A typical choice for ' is
a sigmoid function:

1
'./ D : (9.41)
1 C e

Other choices, such as an arctangent, are also sometimes adopted.


In some cases, a bias or threshold w0 is also added as an extra parameter to each
layer n:
0 .n/
1
.n/
X
p
yk .Ex / D' @w.n/ C
.n/
wkj xj A : (9.42)
0
jD1

The training of the multilayer perceptron is achieved by minimizing the so-


called loss function, defined as the sum over an N-observations training dataset Exi ,
i D 1;    ; N, of the squared differences of the network’s output y.Exi / and the true
classification ytrue
i :

X
N
 2
L.w/ D ytrue
i  y.Exi / : (9.43)
iD1

yi is usually equal to 1 for signal (H1 is true) and 0 for background (H0 is true).
The loss function depends on the choice of the network weights w, which are not
explicitly reported in Eq. (9.43), but appear in y.Exi /.
A popular algorithm to optimize the weights in order to minimize the loss
function consists in iteratively modifying the weights after each training observation
or after a bunch of training observations. Weights are typically randomly initialized
in order to break the symmetry among different nodes. The minimization usually
proceeds via the so-called stochastic gradient descent [13], which modifies weight
at each iteration according to the following formula:

.n/ .n/ @L.w/


wij ! wij   .n/
: (9.44)
@wij

The parameter  is called learning rate, and controls how large is the change to the
parameters w at each iteration. This method is also called back propagation [14],
192 9 Hypothesis Tests

since the value computed as output drives the changes to last node’s weights, which,
in turn, drive the changes to the weights in the previous layer, and so on.
An more extensive introduction to artificial neural network can be found in [15].

9.11.1 Deep Learning

It has been demonstrated that an artificial neural network with a single hidden
layer may approximate any analytical function within any desired approximation,
provided that the number of neurons is sufficiently large, and if the activation
function satisfies certain regularity conditions [16]. Anyway, in practice, the number
of nodes required to achieve the desired approximation may be very large, and
some performance limitations may be present in artificial neural networks with
a reasonably manageable number of nodes. Those performance limitations made
boosted decision trees (see Sect. 9.12) a preferred choice over artificial neural
networks in several data analyses in particle physics experiments.
The number of required nodes for optimal performances, if more hidden layers
are added, is smaller than for a single hidden layer [17], but adding more hidden
layers could make the training process more difficult to converge because it is more
frequent to meet the conditions where the output of the activation function is close
to 0 or 1, which corresponds to a gradient of the loss function that is close to zero,
making the learning by stochastic gradient descent slower.
Those limitations have been recently overcome by techniques that use several
hidden layers and possibly relatively large numbers of nodes per layer, and
optimized training algorithms that make the treatment of complex networks feasible
with modern computing technologies, even for cases that were intractable with
traditional algorithms. Those techniques are called deep learning [18], and have
recently become popular for advanced applications like image classification or face
recognition.
One of the first application of artificial neural network with several layers and a
large number of nodes was implemented to classify handwritten digits based on a
large number of training images [19]. The adopted network had 282 D 784 input
variables1 and 10 outputs in order to encode the 10 digits from 0 to 9. Five hidden
layers were used with 2500, 2000, 1500, 1000 and 500 nodes, respectively. The
training method was a standard stochastic gradient descent (see Sect. 9.11), and
intensive computing power was used.
In high-energy physics, deep learning techniques allow achieving optimal perfor-
mances, exceeding shallow neural networks and boosted decision trees. Moreover,
deep learning provides another very convenient feature: it is possible to use directly
very basic quantities, like particles four-momenta, as input to the network instead of
working out combinations that are suitable for a specific analysis. For instance, in

1
In order to classify 28  28 pixel grayscale images.
9.11 Artificial Neural Networks 193

a search for a resonance, the invariant mass is a key variable, its distribution being
peaked for the signal and more flat for the background. More complex kinematical
variables are optimal for other searches for new particles, typically for those with
missing energy due to escaping undetected particles, like for Supersymmetry. While
shallow neural networks may not achieve reasonable performances in most of the
cases if those specific kinematical variables are not used as input, it has been proven
that the same performances can be achieved with deep learning if input variables
are either complex variable combinations or four-momenta directly [20]. For those
cases, typical number of nodes per layers may be several hundred, with a number if
hidden layers of the order of 5.
This capability of machine-learning algorithms may lead to optimal perfor-
mances with a reduced human effort in the data processing steps preliminary to
the training of multivariate analysis.
Optimization of the training procedure may be applied, and those include a
modification of the learning rate parameter , that may be decreased, for instance,
exponentially with the number of interactions [20].
Another application in particle physics of deep learning is the classification of
substructures in hadronic jet that allow to identify jets produced from a W ! qq0
decay from jets due to quarks and gluons [21].

9.11.2 Convolutional Neural Networks

More advanced and optimized neural network structures have been developed
mainly for image and sound processing, called convolutional neural network (CNN
or ConvNet) [22]. The number of inputs can be as large as the number of pixels in
an image.2
Node arrangement is usually done in more dimensions in order to adapt to the
dimensional structure of the input data. For instance, the three dimensions typically
corresponding to the width, height and color channel of an image.
Not all possible node connections are allowed in the network, reducing in this
way the total number of weights to be optimized, and nodes in a layer have
connections with only some of the nodes in the previous layer. A fully connected
network would otherwise become intractable for very large number of inputs, like
for images with tens of millions pixels.
Moreover, translational symmetry along the image in two dimensions, or in one
dimension (time) for sound, is also used to reduce the number of parameters. This
corresponds, in image analysis, to the inspection of local features, e.g.: edges, that
can be differently displaced in an image, logically arranged into subsequent more

2
Multiplied by three for color images, since each of the three color channels in a pixel is separately
encoded.
194 9 Hypothesis Tests

complex structures, as opposite to a single global image analysis that would consider
the interconnections of all possible information contained in individual pixels.
The overall network architecture is potentially similar to the logical structure of
how image processing may proceed in the brain’s visual cortex, which is able to
perform vision using a limited number of neurons.
Several small network units (filters), each aiming at identifying a specific
feature, are applied to the input at all possible positions, exploiting the translational
symmetry. For image recognitions, a filter corresponds to a ‘window’ having as size
a small number of pixels. The extent of the filter is called local receptive field. Each
pixel in the window is connected to a neuron, whose output is the output of the filter,
which evaluates a weighted sum, as in standard neural networks, of the pixel values
(convolution, which gives the name to this neural network architecture).
The resulting filter output evaluated at each position is stored in a layer called
feature map. Each filter unit will generate in this way a feature map, which is stored
as a two-dimensional array. The construction of a feature map is shown in Fig. 9.8
for a two-dimensional case and in Fig. 9.9 for a three-dimensional case.
Usually, there are several filters and the set of all feature maps is stored as a three-
dimensional array, the third dimension is given by the number of applied filter units.
The feature map array produced at this stage can be used as input to subsequent
network layers.
A feature map can be reduced in size by applying a subsampling (pooling).
A reduced feature map can be processed in a subsequent network layer having a
smaller number of nodes.

w11 w12 w13 w14 w11 w12 w13 w14


w21 w22 w23 w24 f11 w21 w22 w23 w24 f11 f12
w31 w32 w33 w34 w31 w32 w33 w34
w41 w42 w43 w44 w41 w42 w43 w44

w11 w12 w13 w14 w11 w12 w13 w14


w21 w22 w23 w24 f11 f12 w21 w22 w23 w24 f11 f12
w31 w32 w33 w34 f21 w31 w32 w33 w34 f21 f22
w41 w42 w43 w44 w41 w42 w43 w44

Fig. 9.8 Construction of a feature map for a 33 filter applied on a 1010 pixel image at different
positions. The resulting feature map will have size 8  8  1. For simplicity, a gray scale image has
been assumed (one color code instead of three), reducing the dimensionality of the example
9.11 Artificial Neural Networks 195

Fig. 9.9 Construction of a feature map for a 3  3  3 filter applied on a 10  10  3 pixel image
at different positions. The resulting feature map will have size 8  8  1

In order to improve the learning process with respect to standard neural


networks, sigmoid activation functions are replaced by piecewise-linear functions.
This avoids gradients that may approach zero, hence impairing the learning process.
A commonly used function is:

'./ D max.0; / : (9.45)

The output of this activation function applied to each entry of a previous layer
is called rectified linear unit layer (ReLU). A smoothed version of ReLU was
introduced by Softplus [23]:

'./ D ln.1 C e / : (9.46)

Sequences of layer structures implementing the three processing steps described


above (convolution, pooling, rectified linear units) can be repeated several times.
All the outputs of the last layer of the sequence are finally fed into a smaller-size
fully connected network, whose outputs perform the final classification. A possible
simplified structure of a complete convolutional neural network is shown in
Fig. 9.10. The training, i.e. the weights optimization, can be performed by stochastic
gradient descent or other more performant algorithms.
Convolutional neural network have a wide range of application in artificial
intelligence (AI), but they also have recent applications in particle physics. For
instance, in neutrino event analysis [24], the modular structure of neutrino detectors
is exploited, as for images, to classify different possible neutrino interaction that
may occur at any point of the detector, automatically identifying the features as
patterns in the set of detector signals.
196 9 Hypothesis Tests

input feature maps


fully
feature maps connected
feature maps network
output

convolution
convolution
subsampling

Fig. 9.10 Simplified possible structure of a convolutional neural network [25]

9.12 Boosted Decision Trees

A decision tree is a sequence of selection cuts that are applied in a specified order
on a given variable datasets. Each cut splits the sample into nodes, each of which
corresponds to a given number of observations classified as signal or as background.
A node may be further split by the application of the subsequent cut in the tree.
Nodes in which either signal or background is largely dominant are classified
as leafs, and no further selection is applied. A node may also be classified as leaf,
and the selection path is stopped, in case too few observations per node remain, or
in case the total number of identified nodes is too large, and different criteria have
been proposed and applied in real implementations.
Each branch on a tree represents one sequence of cuts. An example of decision
tree is shown in Fig. 9.11. Along the decision tree, the same variable may appear
multiple times, depending on the depth of the tree, each time with a different applied
cut, possibly even with different inequality directions.
Selection cuts can be tuned in order to achieve the best split level in each node
according to some metrics. One possible optimization consists in maximizing for
each node the gain of Gini index achieved after a splitting. The Gini index can be
defined as:

G D P .1  P/ ; (9.47)

where P is the purity of the node, i.e. the fraction of signal observations. G is equal
to zero for nodes containing only signal or background observations.
Alternatives to the Gini index are also used as metric to be optimized. One
example is the so called cross entropy, equal to:

E D  .P log P C .1  P/ log.1  P// : (9.48)


9.12 Boosted Decision Trees 197

Fig. 9.11 An example of a decision tree. Each node represented as an ellipse contains a different
number of signal (left number) and background (right number) observations. The relative amount
of signal and background in each node is represented as blue and red area, respectively. Applied
requirement are represented as rectangular boxes

The gain due to the splitting of a node A into the nodes B1 and B2 , which depends
on the chosen cut, is given by:

I D I.A/  I.B1 /  I.B2 / ; (9.49)

where I denotes the adopted metric (G or E, in case of the Gini index or cross
entropy introduced above). By varying the cut, the optimal gain may be achieved.
A single decision tree can be optimized as described above, but its performances
are usually far from the optimal limit set by the Nayman–Pearson lemma. A sig-
nificant improvement can be achieved by combining multiple decision trees into a
decision forest.
The random forest algorithm [26] consists of ‘growing’ many decision trees from
replicas of the training samples obtained by randomly resampling the input data. No
minimum size is required for leaf nodes. The final score of the algorithm is given
by an unweighted average of the prediction (zero or one) by each individual tree.
198 9 Hypothesis Tests

The boosting procedure instead iteratively adds a new tree to a forest obtained by
performing the selection optimization after test observations have been reweighted
according to the score given by the classifier in the previous iteration. The collection
obtained at the end of the iterative procedure is so called boosted decision trees
(BDT) [27].
The boosting algorithm usually proceeds following the steps below:
• Training observations are reweighted using the previous iteration’s classifier
result.
• A new tree is built and optimized using the reweighted observations as a training
sample.
• A score is given to each tree.
• The output of the final BDT classifier is the weighted average over all trees in the
forest:

X
N trees

y.Ex / D wi C.i/ .Ex / : (9.50)


kD1

Among the boosting algorithms, the most popular is the adaptive boosting [28].
With this approach, only observations misclassified in the previous iteration are
reweighted by a weight which depends on the fraction of misclassified observations
f in the previous tree, usually equal to:

1f Nmisclassified
wD ; f D : (9.51)
f Ntot

The misclassification fraction f is also used to compute the weights in the combina-
tion of individual decision tree outputs:

X
N  
trees
1  f .i/
y.Ex / D log C.i/ .Ex / : (9.52)
iD1
f .i/

With the adaptive boosting, observations misclassified in a given iteration achieve


more importance, and the tree added at the next iteration will be more efficient in
classifying those observations correctly.
Variations of this algorithm with more boosting options are available in literature.
9.13 Multivariate Analysis Implementations 199

9.13 Multivariate Analysis Implementations

The software tool TMVA [29] implements the most frequently adopted multivariate
analysis methods in high-energy physics, including Fisher’s linear discriminant,
projective likelihood analysis, artificial neural network and boosted decision trees.
The tool is distributed with the ROOT [7] framework, and provides a common
interface for the different multivariate methods. Each method is configurable, and
all parameters can be tuned, though a series of default values, reasonably valid in
the majority of applications, are provided.
CAFFE [30] is a software framework that implements deep learning algorithms
and was used in [24]. Versions of ROOT from 6.08 on also provide a deep learning
implementation in the TMVA library.

Example 9.24 Comparison of Multivariate Discriminators


The performances of different multivariate discriminator can be evaluated
and compared using Monte Carlo pseudo-samples. The most striking
features appear when signal and background distributions have significant
overlap and are not obviously separable with linear cuts.
The TMVA [29] package has been used in this example as implementation
for different multivariate analysis methods, which have been trained using
two-dimensional datasets generated with Monte Carlo. The distributions
are shown in Fig. 9.12. Independent datasets have been used to test the
performances of each method.
Four methods have been tested, using the default configuration provided by
the TMVA package:
• Fisher linear discriminant
• Projective likelihood discriminant
• Artificial neural network
• Boosted decision trees
The distribution of the four discriminants for the generated test samples is
shown in Fig. 9.13. The distribution of the observations selected as signal
and background from test samples is also showing Fig. 9.14.

(continued )
200 9 Hypothesis Tests

The Fisher linear discriminant divides the sample space using a straight line.
The orientation of the line is optimized, but clearly the achieved selection is
suboptimal. The projective likelihood method achieves a better separation,
not as good as the artificial neural network or the boosted decision tees.
Figure 9.15 shows the ROC curve (receiver operating characteristic) for the
four considered methods, as produced by the graphic user interface provided
by TMVA.

10

−5

−10

−30 −20 −10 0 10 20 30

Fig. 9.12 Distribution in two dimensions for signal (blue) and background (red) test
observations

(continued )
9.13 Multivariate Analysis Implementations 201

18000
4000
16000
3500
14000
3000
12000
2500
entries

entries
10000

2000
8000

1500 6000

1000 4000

500 2000

0 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fisher discr. Projective Likelihood discr.

35000
5000
30000

4000
25000
entries

entries

20000 3000

15000
2000
10000

1000
5000

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8
Neural Network discr. Boosted Decision Trees discr.

Fig. 9.13 Distribution of different multivariate discriminators, trained on samples inde-


pendent but distributed according to the same PDF as the ones shown in Fig. 9.12,
for signal (blue) and background (red): Fisher linear discriminant (top, left), projective
likelihood discriminant (top, right), artificial neural network (bottom, left) and boosted
decision trees (bottom, right)

(continued )
202 9 Hypothesis Tests

Fig. 9.14 Distribution of test observations as in Fig. 9.12 showing observations selected
as signal (blue) and background (red) by the different multivariate discriminants: Fisher
linear discriminant (top, left), projective likelihood discriminant (top, right), artificial
neural network (bottom, left) and boosted decision trees (bottom, right)

(continued )
References 203

Background rejection versus Signal efficiency


1

0.9
Background rejection

0.8

0.7

0.6

0.5 MVA Method:


BDT
0.4 NeuralNetwork
LikelihoodD
0.3 Fisher

0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Signal efficiency

Fig. 9.15 ROC curve for the four multivariate discriminators. Fisher and projective
likelihood exhibit suboptimal performances, while, for this example, the performance
curves of artificial neural network and boosted decision trees completely overlap

References

1. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–
188 (1936)
2. Neyman, J., Pearson, E.: On the problem of the most efficient tests of statistical hypotheses.
Philos. Trans. R. Soc. Lond. Ser. A 231, 289–337 (1933)
3. Kolmogorov, A.: Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital.
Attuari 4, 83–91 (1933)
4. Smirnov, N.: Table for estimating the goodness of fit of empirical distributions. Ann. Math.
Stat. 19, 279–281 (1948)
5. Chakravarti, I.M., Laha, R.G., Roy, J.: Handbook of Methods of Applied Statistics, vol. I.
Wiley, New York (1967)
6. Marsaglia, G., Tsang, W.W., Wang, J.: Evaluating Kolmogorov’s distribution. J. Stat. Softw. 8,
l–4 (2003)
7. Brun, R., Rademakers, F.: ROOT—an object oriented data analysis framework. Proceedings
AIHENP96 Workshop, Lausanne (1996). Nucl. Instrum. Methods A 389 81–86 (1997). http://
root.cern.ch/
8. Stephens, M.A.: EDF statistics for goodness of fit and some comparisons. J. Am. Stat. Assoc.
69, 730–737 (1974)
9. Anderson, T.W., Darling, D.A.: Asymptotic theory of certain “goodness-of-fit” criteria based
on stochastic processes. Ann. Math. Stat. 23, 193–212 (1952)
10. Cramér, H.: On the composition of elementary errors. Scand. Actuar. J. 1928(1), 13–74 (1928)
204 9 Hypothesis Tests

11. von Mises, R.E.: Wahrscheinlichkeit, Statistik und Wahrheit. Julius Springer, Vienna (1928)
12. Wilks, S.: The large-sample distribution of the likelihood ratio for testing composite hypothe-
ses. Ann. Math. Stat. 9, 60–62 (1938)
13. LeCun, Y., Bottou, L., Orr, G.B., Müller, K.R.: Neural Networks: Tricks of the Trade. Springer,
Berlin/Heidelberg (1998)
14. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating
errors. Nature 323 533–536 (1986)
15. Peterson, C., Rgnvaldsson, T.S.: An introduction to artificial neural networks. In: LU-TP-91-
23. LUTP-91-23. 14th CERN School of Computing, Ystad (1991)
16. Mhaskar, H.N.: Neural networks for optimal approximation of smooth and analytic functions.
Neural Comput. 8(1), 164–177 (1996)
17. Reed, R., Marks, R.: Neural Smithing: Supervised Learning in Feedforward Artificial Neural
Networks. A Bradford book. MIT Press, Cambridge (1999)
18. Le Cun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521 436–444 (2015)
19. Cireşan, D.C., et al.: Deep, big, simple neural nets for handwritten digit recognition. Neural
Comput. 22, 3207–20 (2010)
20. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics
with deep learning. Nature Commun. 5, 4308 (2014)
21. Baldi, P., Bauer, K., Eng, C., Sadowski, P., Whiteson, D.: Jet substructure classification in
high-energy physics with deep neural networks. Phys. Rev. D 93, 094034 (2016)
22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. Adv. Neural Inf. Proces. Syst. 25, 1097–1105 (2012)
23. Dugas, C., Bengio, Y., Bélisle, F., Nadeau, C., Garcia, R.: Incorporating second-order
functional knowledge for better option pricing. In: Proceedings of NIPS’2000: Advances in
Neural Information Processing Systems (2001)
24. Aurisano, A., et al.: A convolutional neural network neutrino event classifier. J. Instrum. 11,
P09001 (2016)
25. Photo by Angela Sorrentino: http://angelasorrentino.awardspace.com/ (2007)
26. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). http://www.stat.berkeley.edu/~
breiman/RandomForests/
27. Roe, B.P., Yang, H.J., Zhu, J., Liu, Y., Stancu, I., McGregor, G.: Boosted decision trees as an
alternative to artificial neural networks for particle identification. Nucl. Instrum. Methods A
543, 577–584 (2005)
28. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an appli-
cation to boosting. In: Proceedings of EuroCOLT’94: European Conference on Computational
Learning Theory (1994)
29. Hoecker, A., et al.: TMVA – Toolkit for Multivariate Data Analysis. PoS ACAT 040,
arXiv:physics/0703039 (2007)
30. Jia, Y., et al.: Convolutional architecture for fast feature embedding. arXiv:1408.5093 (2014)
Chapter 10
Discoveries and Upper Limits

10.1 Searches for New Phenomena: Discovery


and Upper Limits

The goal of many experiments is to search for new physical phenomena. If an


experiment provides a convincing measurement of a new signal, the result should be
published and claimed as discovery. If the outcome is not sufficiently convincing, in
many cases it is nonetheless interesting to quote, as the result of the search for the
new phenomena, an upper limit to the yield of the new signal. From upper limits
to the signal yield, it is often possible to indirectly derive limits on the properties
of the new signal. Limits could be set to the mass of a new particle or to coupling
constants; more in general, it’s possible to exclude regions of the parameter space
of a new theory that influences the signal yield.
In order to give a quantitative measure of how ‘convincing’ the result of an
experiment is, there are different possible approached. Using the Bayesian approach
(Chap. 3), the posterior probability, given the experiment’s measurement and a
subjective prior, can quantify the degree of belief that the new signal hypothesis
is true. In particular, when comparing two hypotheses, in this case the presence or
absence of signal, the Bayes factor (see Sect. 3.6) can be used to quantify how strong
the evidence for a new signal is against the background-only hypothesis.
In the frequentist approach the significance level is introduced (see Sect. 10.2),
which measures the probability that, in the case of presence of background only,
a statistical fluctuation in data may produce by chance a measurement of the new
signal at least as ‘intense’ as what is observed in data.
The interpretation of a discovery in the Bayesian and frequentist approaches are
very different, as will be discussed in the following.
The determination of upper limits is, in many cases, a complex task and
the computation frequently requires numerical algorithms. Several methods are
adopted in high-energy physics and are documented in the literature to determine
upper limits.

© Springer International Publishing AG 2017 205


L. Lista, Statistical Methods for Data Analysis in Particle Physics,
Lecture Notes in Physics 941, DOI 10.1007/978-3-319-62840-0_10
206 10 Discoveries and Upper Limits

This chapter will introduce the concept of significance and will present the most
popular methods to set upper limits, discussing their main benefits and limitations.
The so-called modified frequentist approach will be introduced, which is a popular
method in high-energy physics that is neither a purely frequentist nor a Bayesian
method.

10.2 Claiming a Discovery

10.2.1 p-Values

Given an observed data sample, claiming the discovery of a new signal requires
determining that the sample is sufficiently inconsistent with the hypothesis that
only background is present in the data. A test statistic t can be used to measure the
inconsistency of the observation in the hypothesis of the presence of background
only, typically assumed as a null hypothesis, H0 .
The probability p that the considered test statistic t assumes a value greater or
equal to the observed one in the case of pure background fluctuation is called
p-value, as already introduced in Sect. 5.12.2. It was implicitly assumed that, by
convention, large values of t correspond to a more signal-like sample.
The p-value has, by construction (see Sect. 2.5), a uniform distribution between
0 and 1 for the background-only hypothesis H0 and tends to have small values in the
presence of the signal (hypothesis H1 ).
If the number of observed events is adopted as test statistic (event counting
experiment), the p-value can be determined as the probability to count a number
of events equal to or greater than the observed one assuming the presence of no
signal and the expected background level. In this cases, the test statistic and the p-
value may assume discrete values. In general, the distribution of the p-value is only
approximately uniform for a discrete distribution.

Example 10.25 p-Value for a Poissonian Counting


Figure 10.1 shows a Poisson distribution corresponding to an expected
number of (background-only) events equal to 4.5. In case the observed
number of events is 8, the p-value is equal to the probability to observe
8 or more events, i.e. it is given by:

1
X 7
X 4:5n
p D P.n  8/ D Pois.nI 4:5/ D 1  e4:5 :
nD8 nD0

Performing the computation explicitly, a p-value of 0.087 can be


determined.

(continued )
10.2 Claiming a Discovery 207

P(n | ν = b = 4.5)

0.18

0.16
P(n ≥ 8) = 0.087
0.14

0.12
P(n|ν)

0.1

0.08

0.06

0.04

0.02

0
0 2 4 6 8 10 12 14
n

Fig. 10.1 Poisson distribution for a null signal and an expected background of b D 4:5.
The probability corresponding to n  8 (light blue area) is 0.087, and is equal to the
p-value, assuming the event counting as test statistic

10.2.2 Significance Level

Instead of quoting a p-value, it is often preferred to report the equivalent number of


standard deviations that correspond to an area equal to the p-value under the right-
most tail of a normal distribution. So, one quotes a ‘Z’ significance corresponding
to a given p-value using the following transformation (see Eq. (2.31)):
Z 1   
1 2 1 Z
pD p ex =2 dx D 1  ˆ.Z/ D ˆ.Z/ D 1  erf p :
Z 2 2 2
(10.1)

By convention, in literature, one claims the ‘evidence’ of the signal under


investigation if the significance is at least 3 (Z = 3), which corresponds to a
probability of background fluctuation of 1:35  103 . One claims the ‘observation’
of a signal (discovery) in case the significance is at least 5 .Z D 5/, corresponding
to a p-value of 2:87  107 .
Table 10.1 shows a number of typical significance values expressed as ‘Z’ and
their corresponding p-values.
208 10 Discoveries and Upper Limits

Table 10.1 Significance Z. / p


expressed as ‘Z ’ and
corresponding p-value in a 1:00 1:59  101
number of typical cases 1:28 1:00  101
1:64 5:00  102
2:00 2:28  102
2:32 1:00  102
3:00 1:35  103
3:09 1:00  103
3:71 1:00  104
4:00 3:17  105
5:00 2:87  107
6:00 9:87  1010

10.2.3 Significance and Discovery

Determining the significance is only part of the process that leads to a discovery,
in the scientific method. The following sentence is worth to be reported from a
document written by the American Statistical Association (ASA):
The p-value was never intended to be a substitute for scientific reasoning. Well-reasoned
statistical arguments contain much more than the value of a single number and whether that
number exceeds an arbitrary threshold. The ASA statement is intended to steer research into
a ‘post p < 0:05 era’ [1].

This was also remarked by the physics community and the following statement was
reported by Cowan et al.:
It should be emphasized that in an actual scientific context, rejecting the background-only
hypothesis in a statistical sense is only part of discovering a new phenomenon. One’s degree
of belief that a new process is present will depend in general on other factors as well, such
as the plausibility of the new signal hypothesis and the degree to which it can describe the
data [2].

In order to evaluate the ‘plausibility of a new signal’ and other factors that give
confidence in a discovery, the physicist’s judgment cannot, of course, be replaced
by the statistical evaluation of the p-value only. In this sense, we can say that a
Bayesian interpretation of the very final result is somehow implicitly assumed, even
when reporting a frequentist significance level.

10.2.4 Significance for Poissonian Counting Experiments

In a counting experiment, the number of observed events is the only considered


information. The selected event sample contains a mixture of n events due to signal
and background processes and the expected total number of events is s C b, where s
and b are the expected numbers of signal and background events, respectively.
10.2 Claiming a Discovery 209

Assuming the expected background b is known (e.g.: it could be estimated from


theory or from a control data sample with negligible uncertainty), the main unknown
parameter of the problem is s and the likelihood function is:
.s C b/n .sCb/
L.nI s; b/ D e : (10.2)

The number of observed events n must be compared with the expected number
of background events b in the null hypothesis (s D 0). If b is sufficiently large,
the distribution can p
be approximated with a Gaussian with average b and standard
deviation equal to b. An excess in data, quantified
p as s D n  b, should be
compared with the expected standard deviation b, and the significance can be
approximately evaluated with the popular forums:
s
ZDp : (10.3)
b
In case the expected background yield b comes from an estimate which has a
non-negligible uncertainty b , Eq. (10.3) can be modified as follows:
s
ZDq : (10.4)
b C b2

A better approximation to Eq. (10.3), even for b well below unity, is given by
Cowan et al. [2]:
r h s
i
Z D 2 .s C b/ log 1 C s : (10.5)
b
Equation (10.3) may be obtained from Eq. (10.5) for s  b.

10.2.5 Significance with Likelihood Ratio

In Sect. 9.9 test statistics based on likelihood ratios were introduced. In particular, a
test statistic suitable for searches for a new signal is:

E
LsCb .Ex1 ;    ; ExN I
; /
E D
.
; / : (10.6)
E
Lb .Ex1 ;    ; ExN I /
A minimum of 2 log .
/ at
D
O indicates the possible presence of a signal
having a signal strength equal to
.
O
The advantage of the likelihood ratio as test statistic is that H0 , assumed in
the denominator, can be taken as a special case of H1 , assumed in the numerator,
with
D 0. This represents a case of nested hypotheses, hence Wilks’ theorem
can be applied, assuming the likelihood function is sufficiently regular. Note that
in Eq. (10.6) numerator and denominator are inverted with respect to Eqs. (9.25)
and (9.26).
210 10 Discoveries and Upper Limits

According to Wilks’ theorem, the distribution of 2 log .


/O can be approximated
by a 2 distribution with one degree of freedom. The minus sign with respect
to Eq. (9.26) is dropped because, as noted above, numerator and denominator are
inverted in Eq. (10.6). In particular, the square root of 2 log .
/
O at the minimum
gives an approximate estimate of the significance level Z:
p
Z D 2 log .
/ O : (10.7)

This significance Z is also called local significance, in the sense that it corre-
sponds to a fixed set of values of the parameters . E In case one or more of the
parameters E are estimates from data, the local significance at fixed values of the
measured parameters may be affected by the look elsewhere effect, see Sect. 10.13.
If the background PDF does not depend on E (for instance, if we have a single
parameter  equal to the mass of an unknown particle, which only affects the
E does also not depend on E and the likelihood ratio
signal distribution), Lb .Ex I /
E
./ is equal, up to a multiplicative factor, to the likelihood function LsCb .Ex I ; /.
E E D , O
E can also be determined by
Hence, the maximum likelihood estimate of ,
minimizing 2 log .
; /, E which is equal to 2 log LsCb up to a constant, and
the error matrix, or the uncertainty contours on , E can be determined as usual for
maximum likelihood estimates (see Sect. 5.11).

10.2.6 Significance Evaluation with Toy Monte Carlo

An accurate estimate of the significance corresponding to the test statistic 2 log 


can be achieved by generating a large number of Monte Carlo pseudo-experiments
assuming the presence of no signal .
D 0/, which gives to a good approximation
the expected distribution of 2 log .
The p-value is equal to the probability that  is less or equal to the observed value
of test statistic O 1 :
O ;
p D Pb ../  / (10.8)

which is on turn equal to the fraction of generated pseudoexperiments for which


./  .O
Note that, in order to determine large significance values with sufficient preci-
sion, very large Monte Carlo samples are required. The p-value corresponding to a
5 evidence, for instance, is equal to (Table 10.1) 2:87  107 , hence samples as
large as 109 may be needed.

1
Given the definition of  in Eq. (10.6), signal-like case tend to have LsCb > Lb , hence  > 1,
which implies log  > 0 and 2 log  < 0, and viceversa background-like cases tend to have
2 log  > 0. For this reason, small  correspond to more signal-like cases. Other choices of test
statistics may have the opposite convention.
10.5 Definitions of Upper Limit 211

10.3 Excluding a Signal Hypothesis

For the purpose of excluding a signal hypothesis, usually the requirement applied in
terms of p-value is much milder than for a discovery. Instead of requiring a p-value
of 2:87  107 or less (the ‘5’ criterion), upper limits for a signal exclusion are
set requiring p < 0:05, corresponding to a 95% confidence level (CL), or p < 0:10,
corresponding to a 90% CL.
For signal exclusion, p indicates the probability of a signal underfluctuation,
i.e. the null hypothesis and alternative hypothesis are inverted, when testing for p-
values, with respect to the case of a discovery.

10.4 Combined Measurements and Likelihood Ratio

Combining the likelihood ratios of several independent measurements can be


performed by multiplying the likelihood functions of individual channels in order to
produce a combined likelihood function (see also Sect. 6.2).
Assume that a first measurement has strong sensitivity to the signal and it is
combined with a second measurement that has low sensitivity; the combined test
statistic is given by the product of likelihood ratios from both measurements. Since
for the second measurement the s C b and b hypotheses give similar values of the
likelihood functions, given the low sensitivity to signal of the second measurement,
the likelihood ratio of the second additional measurement is close to one. Hence, the
combined test statistic (the product of the two) is expected not to differ too much
from the one given by the first measurement only, and the combined sensitivity will
not be worsened by the presence of the second measurement, with respect to the
case in which only the first measurement is used.

10.5 Definitions of Upper Limit

In the frequentist approach, the procedure to set an upper limit is a special case of
confidence interval determination (see Sect. 7.2), typically applied to the unknown
signal yield s, or alternatively the signal strength
.
In order to determine an upper limit instead of the usual central interval, the
choice of the interval corresponding to the desired confidence level 1  ˛ (90%
or 95%, usually) is fully asymmetric: Œ0; sup [ which translates in an upper limit
quoted as:

s < sup at 95% C.L .or 90% CL/:

In the Bayesian approach the interpretation the upper limit sup is that the credible
interval Œ0; sup Œ corresponds to a posterior probability equal to the confidence level
1  ˛.
212 10 Discoveries and Upper Limits

10.6 Bayesian Approach

The Bayesian posterior PDF for a signal yield s,2 assuming a prior .s/, is given by:

L.Ex I s/ .s/
P.s j xE / D R 1 : (10.9)
0 L.Ex I s0 / .s0 / ds0

The upper limit sup can be computed by requiring that the posterior probability
corresponding to the interval Œ0; sup Œ is equal to the specified confidence level CL, or
equivalently that the probability corresponding to Œsup ; 1Œ is equal to ˛ D 1  CL:

Z 1
R1
up L.Ex I s/ .s/ ds
˛D P.s j Ex / .s/ ds D Rs1 : (10.10)
sup 0 L.Ex I s/ .s/ ds

Apart from the technical aspects related to the integral computations and the
already mentioned subjectiveness in the choice of the prior .s/ (see Sect. 3.7), the
above expression poses no particular fundamental problem.

10.6.1 Bayesian Upper Limits for Poissonian Counting

In the simplest case of negligible background, b D 0, and assuming a uniform prior,


.s/ D const., the posterior PDF for s has the same expression as the Poissonian
probability itself, as it was demonstrated in Example 3.11:

sn es
P.s j n/ D : (10.11)

In case no event is observed, i.e. n D 0, we have:

P.s j 0/ D es ; (10.12)

and:
Z 1
es ds D es ;
up
˛D (10.13)
sup

which gives:

sup D  log ˛ : (10.14)

2
The same approach could be equivalently formulated in terms of the signal strength

10.6 Bayesian Approach 213

For ˛ D 0:05 (95% CL) and ˛ D 0:10 (90% CL), the following upper limits can
be set:

s < 3:00 at 95% CL ; (10.15)


s < 2:30 at 90% CL : (10.16)

The general case with expected background b ¤ 0 was treated by Helene [3],
and Eq. (10.10) becomes:

Xn
.sup C b/m

˛ D es
up mD0
: (10.17)
Xn
bm
mD0

The above expression can be inverted numerically in order to determine sup for given
˛; n and b. In case of no background .b D 0/, Eq. (10.17) becomes:

Xn
sup m
˛ D es
up
; (10.18)
mD0

which gives again Eq. (10.13) for n D 0.


The corresponding upper limits in case of negligible background for different
number of observed events n are reported in Table 10.2. For different number
of observed events n and different expected background b, the upper limits from
Eq. (10.17) at 90% and 95% CL are shown in Fig. 10.2.

Table 10.2 Upper limits in 1  ˛ D 90% 1  ˛ D 95%


presence of negligible
n sup sup
background evaluated under
the Bayesian approach for 0 2:30 3:00
different number of observed 1 3:89 4:74
events n 2 5:32 6:30
3 6:68 7:75
4 7:99 9:15
5 9:27 10:51
6 10:53 11:84
7 11:77 13:15
8 12:99 14:43
9 14:21 15:71
10 15:41 19:96
214 10 Discoveries and Upper Limits

18

16

14

12

10
sup

8 n=10

2 n=0

0
0 2 4 6 8 10 12 14 16 18 20
b

18

16

14

12

10
sup

8 n=10

2 n=0

0
0 2 4 6 8 10 12 14 16 18 20
b

Fig. 10.2 Upper limits at the 90% CL (top) and 95% CL (bottom) to the signal yield s for a
Poissonian process using the Bayesian approach as a function of the expected background b and
for number of observed events n from n D 0 to n D 10
10.7 Frequentist Upper Limits 215

10.6.2 Limitations of the Bayesian Approach

The derivation of Bayesian upper limits presented above assumes a uniform prior
.s/ Dconst. on the expected signal yield. Assuming a different prior distribution
would result in different upper limits. In general, there is no unique criterion to chose
a specific prior PDF that models the complete lack of knowledge about a variable,
in this case the signal yield, as already discussed in Sect. 3.7.
In searches for new signals, the signal yield may be related to other parameters of
the theory (e.g.: the mass of unknown particles, or specific coupling constants). In
that case, should one choose a uniform prior for the signal yield or a uniform prior
for the theory parameters? As already said, no unique prescription can be derived
from first principles.
A possible approach is to choose more priors that reasonably model one’s
ignorance about the unknown parameters and verify that the obtained upper limits
are not too sensitive to the choice of the prior.

10.7 Frequentist Upper Limits

Frequentist upper limits can be computed by inverting the Neyman belt (see
Sect. 7.2) for a parameter  with fully asymmetric intervals for the observed
quantity x, as illustrated in Fig. 10.7, which is the equivalent of Fig. 7.1 when
adopting a fully asymmetric interval. In particular, assuming that the Neyman belt is
monotonically increasing, the choice of intervals ]xlo .0 /; C1Πfor x as a function
of 0 leads to a confidence interval [0;  up .x0 / Πfor , given a measurement x0
(Fig. 10.3), which corresponds to the upper limit:

 <  up .x0 / : (10.19)

Most frequently, the parameter  is a signal yield s or a signal strength


.

Fig. 10.3 Graphical


illustration of Neyman belt θ
construction for upper limits up
determination θ (x0)
]x lo(θ0) , +∞[
[0 , θ (x0) [
up

x0 x
216 10 Discoveries and Upper Limits

10.7.1 Frequentist Upper Limits for Counting Experiments

Similarly to Sect. 10.6.1, the case of a counting experiment with negligible back-
ground will be analyzed first. The probability to observe n events with an expecta-
tion s is given by a Poisson distribution:

es sn
P.nI s/ D : (10.20)

An upper limit to the expected signal yield s can be set using n as test statistic
and excluding the values of s for which the probability (p-value) to observe n events
or less is below ˛ D 1  CL. For n D 0 we have:

p D P.0I s/ D es (10.21)

and the condition p > ˛ gives:

p D es > ˛ (10.22)

or, equivalently:

s <  log ˛ D sup : (10.23)

For ˛ D 0:05 or ˛ D 0:1, upper limits are:

s < 3:00 at 95% CL ; (10.24)


s < 2:30 at 90% CL : (10.25)

Those results accidentally coincide with the ones obtained under the Bayesian
approach (Eqs. (10.15) and (10.16)). The numerical coincidence of upper limits
computed under the Bayesian and frequentist approaches for the simple but common
case of a counting experiment may lead to confusion. There is no intrinsic
reason for which limits evaluated under the two approaches should coincide, and
in general, with very few exceptions, like in this case, Bayesian and frequentist
limits do not coincide numerically. Moreover, regardless of their numerical value,
the interpretation of Bayesian and frequentist limits is very different, as already
discussed several times.
Note that if the true value is s D 0, then the interval Œ0; sup Œ covers the true value
with a 100% probability, instead of the required 90% or 95%, similarly to what was
observed in Example 7.23. The extreme overcoverage is due to the discrete nature
of the counting problem, and may appear as a counterintuitive feature.
10.7 Frequentist Upper Limits 217

10.7.2 Frequentist Limits in Case of Discrete Variables

When constructing the Neyman belt for a discrete variable n, like in a Poissonian
case, it is not always possible to find an interval fnlo ; : : : ; nup g that has exactly
the desired coverage because of the intrinsic discreteness of the problem. This issue
was already introduced in Sect. 7.3, when discussing binomial intervals. For discrete
cases, it is possible to take the smallest interval which has a probability greater or
equal to the desired confidence level. Upper limits determined in those cases are
conservative, i.e. the procedure ensures that the probability content of the confidence
belt is greater than or equal to 1  ˛ (overcoverage).
Figure 10.4 shows an example of Poisson distribution corresponding to the
case with s D 4 and b D 0. Using a fully asymmetric interval as ordering
rule, the interval f2; 3;    g of the discrete variable n corresponds to a probability
P.n  2/ D 1  P.0/  P.1/ D 0:9084, and is the smallest interval which has
a probability greater or equal to a desired confidence level of 0.90: the interval
f3; 4;    g would have a probability P.n  3/ less than 90%, while enlarging the
interval to f1; 2; 3;    g would produce a probability P.n  1/ even larger than
P.n  2/.

P(n | ν = s = 4)
0.25

P(n ≤ 1) = 0.092
0.2

0.15 P(n ≥ 2) = 0.908


P(n|ν)

0.1

0.05

0
0 2 4 6 8 10 12 14
n

Fig. 10.4 Poisson distribution for s D 4 and b D 0. The white bins show the smallest possible
fully asymmetric confidence interval .f2; 3; 4;    g in this case) that gives at least the required
coverage of 90%
218 10 Discoveries and Upper Limits

If n events are observed, the upper limit sup is given by the inversion of the
Neyman belt, which corresponds to:
( )
X
n
sup
D inf s W P.mI s/ < ˛ : (10.26)
mD0

The simplest case with n D 0 gives the result shown in Sect. 10.7.1.
Consider a case with non-negligible background, b ¤ 0, and take again n as the
test statistic. Even the assumption s D 0 or possibly unphysical values s < 0 could
be excluded if data have large under fluctuations, which are improbable put possible,
according to a Poisson distribution with expected number of events b, or bjsj, for a
negative s, such that the p-value is less than the required 0.1 or 0.05. The possibility
to exclude parameter regions where the experiment should be insensitive (s D 0) or
even unphysical regions (s < 0) are rather unpleasant to a physicist.
From the pure frequentist point of view, moreover, this result potentially suffers
from the flip-flopping problem (see Sect. 7.4): if we decide a priori to quote an upper
limit as our final result, Neyman’s construction with a fully asymmetric interval
leads to the correct coverage, but if we choose to switch from fully asymmetric
to central intervals in case a significant signal is observed, this would produce an
incorrect coverage.

10.7.3 Feldman–Cousins Unified Approach

The Feldman–Cousins approach, introduced in Sect. 7.5, provides a continuous


transition from central intervals to upper limits, avoiding the flip-flopping problem.
Moreover, it ensures that no unphysical parameter value (s < 0) is excluded.
In the Poissonian counting case, the 90% confidence belt obtained with the
Feldman–Cousins approach is shown in Fig. 10.5 for b D 3. The results in case
of no background .b D 0/ are reported in Table 10.3 for different values of the
number of observed events n. Figure 10.6 shows the value of the 90% CL upper
limit computed using the Feldman–Cousins approach as a function of the expected
background b for different values of n.
Comparing Table 10.3 with Table 10.2, which reports the Bayesian results,
Feldman–Cousins upper limits are in general numerically larger than Bayesian
limits, unlike the case considered in Sect. 10.7.1 of upper limits from a fully
asymmetric Neyman belt, where frequentist and Bayesian upper limits coincide. In
particular, for n D 0, the 90% CL upper limits are 2.30 and 2.44, respectively and
3.00 and 3.09, respectively, for 95% CL. Anyway, as remarked before, the numerical
comparison of those upper limits should not lead to any specific interpretation which
is very different for frequentist and Bayesian limits.
A peculiar feature of Feldman–Cousins upper limits is that, for n D 0, a larger
expected background b corresponds to a more stringent, i.e. lower, upper limit, as
10.7 Frequentist Upper Limits 219

20

18

16

14

12
s

10

0 2 4 6 8 10 12 14
n

Fig. 10.5 Confidence belt at the 90% CL for a Poissonian process using the Feldman–Cousins
approach for b D 3

Table 10.3 Upper and lower 1˛ D 90% 1˛ D 95%


limits in presence of
n slo sup sup slo
negligible background
(b D 0) with the 0 0:00 2:44 0:00 3:09
Feldman–Cousins approach 1 0:11 4:36 0:05 5:14
2 0:53 5:91 0:36 6:72
3 1:10 7:42 0:82 8:25
4 1:47 8:60 1:37 9:76
5 1:84 9:99 1:84 11:26
6 2:21 11:47 2:21 12:75
7 3:56 12:53 2:58 13:81
8 3:96 13:99 2:94 15:29
9 4:36 15:30 4:36 16:77
10 5:50 16:50 4:75 17:82

can be seen in Fig. 10.6 (lowest curve). This feature is absent in Bayesian limits that
do not depend on the expected background b for n D 0 (see Fig. 10.2).
This dependence of upper limits on the expected amount of background is
somewhat counterintuitive: imagine two experiments (say A and B) performing
a search for a rare signal. Both experiments are designed to achieve a very low
background level, but A can reduce the background level more than B, say b D 0:01
and b D 0:1 expected events for A and B, respectively. If both experiments observe
zero events, which is for both the most likely outcome, the experiment that achieves
220 10 Discoveries and Upper Limits

18

16

14

12

10
sup

8 n=10

2 n=0

0
0 2 4 6 8 10 12 14 16 18 20
b

Fig. 10.6 Upper limits at 90% CL to the signal s for a Poissonian process using the Feldman–
Cousins method as a function of the expected background b and for number of observed events n
from 0 to 10

the most stringent limit is the one with the largest expected background (B in this
case), i.e. the one which has the worse expected performances.
The Particle Data Group published in their review the following sentence about
the interpretation of frequentist upper limits, in particular for what concerns the
difficulty to interpret a more stringent limit for an experiment with worse expected
background, in case no event is observed:
The intervals constructed according to the unified [Feldman Cousins] procedure for a
Poisson variable n consisting of signal and background have the property that for n D 0
observed events, the upper limit decreases for increasing expected background. This is
counter-intuitive since it is known that if n D 0 for the experiment in question, then
no background was observed, and therefore one may argue that the expected background
should not be relevant. The extent to which one should regard this feature as a drawback is
a subject of some controversy [4].

This feature of frequentist limits, as well as the possibility to exclude parameter


values to which the experiment is not sensitive to, as remarked at the end of the
previous Section, are often considered unpleasant by physicists. The reason is that
human intuition tends to interpret upper limits and confidence intervals, more in
general, as corresponding (Bayesian) probability of the signal hypothesis, even
when they are determined under the frequentist approach.
A modification of the pure frequentist approach in order to determine upper limits
which have more intuitive features will be discussed in the following Sect. 10.8.
10.8 Modified Frequentist Approach: The CLs Method 221

Example 10.26 Can Frequentist and Bayesian Upper Limits Be


‘Unified’?
The coincidence of Bayesian and frequentist upper limits in the simplest
event counting case motivated an effort attempted by Zech [5] to reconcile
the two approaches, namely the limits obtained by Helene in [3] and the
frequentist approach.
Consider the superposition of two Poissonian processes having s and b
expected number of events from signal and background, respectively. Using
Eq. (2.55), the probability distribution for the total observed number of
events n can be written as:
X
n
P.nI s; b/ D P.ns I s/ P.nb I b/ ; (10.27)
ns D 0
nb D n  ns

where P represents a Poissonian distribution.


Zech proposed to modify the background term of the sum in Eq. (10.27),
P.nb I b/, to take into account that the observation of n events should put a
constraint on the possible values of nb , which can only range from 0 to n. In
this way, P.nb I b/ was replaced with:
X
n
P0 .nb I b/ D P.nb I b/= P.n0b I b/ : (10.28)
n0b D0

This modification leads to the same result obtained by Helene in Eq. (10.17),
which apparently indicates a possible convergence of Bayesian and frequen-
tist approaches.
This approach was later criticized by Highland and Cousins [6] who
demonstrated that the modification introduced by Eq. (10.28) produces an
incorrect coverage, and Zech himself admitted the nonrigorous application
of the frequentist approach [7].
This attempt could not provide a way to conciliate the Bayesian and
frequentist approaches, which, as said, have completely different interpreta-
tions. Anyway, Zech’s intuition anticipated the formulation of the modified
frequentist approach that will be discussed in Sect. 10.8, which is nowadays
widely used in high-energy physics.

10.8 Modified Frequentist Approach: The CLs Method

The concerns about frequentist limits discussed at the end of Sects. 10.7.2 and 10.7.3
have been addressed with the definition of a procedure that was adopted for the first
time for the combination of the results obtained by the four LEP experiments, Aleph,
Delphi, Opal, and L3, in the search for the Higgs boson [8].
222 10 Discoveries and Upper Limits

The approach consists in a modification of the pure frequentist approach with


the introduction of a conservative corrective factor to the p-value that cures the
aforementioned counterintuitive peculiarities. In particular, it avoids the possibility
to exclude, purely due to statistical fluctuations, parameter regions where the
experiment is not sensitive to, and, if zero events are observed, a higher expected
background does not correspond to a more stringent limit, as with the Feldman–
Cousins approach.
The so-called modified frequentist approach will be illustrated in the following
using the test statistic adopted in the original proposal, introduced in Section 9.9,
which is the ratio of the likelihood functions evaluated under two different hypothe-
ses: the presence of signal plus background .H1 , corresponding to the likelihood
function LsCb /, and the presence of background only .H0 , corresponding to the
likelihood function Lb /:

E
LsCb .Ex I /
E D
./ : (10.29)
E
Lb .Ex I /
Different test statistics have been applied after the original formulation, in
particular, it is now common to use the profile likelihood (Eq. (9.27)). The method
described in the following is valid for any test statistic.
The likelihood ratio in Eq. (10.29) can also be written introducing the signal
strength
separately from the other parameters of interest , E as in Eq. (9.35):
!
E
YN

s. E s .Exi I /
/f E
E De
.
; / 
s.  /
C1 ; (10.30)
iD1
E b .Exi I /
b./f E

where the functions fs and fb are the PDFs for signal and background of the variables
Ex. The negative logarithm of the test statistic is given in Eq. (9.36), also reported
below:
!
XN E s .Exi I /

s./f E
E E
 log .
; / D
s./  log C1 : (10.31)
iD1
E b .Exi I /
b./f E

In order to quote an upper limit using the frequentist approach, the distribution
of the test statistic  (or equivalently 2 log ) in the hypothesis of signal plus
background has to be known, and the p-value corresponding to the observed value
 D ,O denoted below as psCb , has to be determined as a function of the parameters
E
of interest
and .
The proposed modification to the purely frequentist approach consists in finding
two p-values corresponding to both the H1 and H0 hypotheses (below, for simplicity
of notation, the set of parameters E will also includes
, which is omitted):

E D PsCb ../
psCb ./ E  /
O ; (10.32)
E D Pb ../
pb ./ E  /
O : (10.33)
10.8 Modified Frequentist Approach: The CLs Method 223

From those two probabilities, the following quantity can be derived [26]:

E
E D psCb ./ :
CLs ./ (10.34)
E
1  pb ./

Upper limits are determined excluding the range of the parameters of interest for
which CLs ./E is lower than the conventional confidence level, typically 95% or
90%. For this reason, the modified frequentist approach is often referred to as the
CLs method.
In most of the cases, the probabilities PsCb and Pb in Eqs. (10.32) and (10.33)
are not trivial to obtain analytically and are determined numerically using pseudo-
experiments generated by Monte Carlo. An example of the outcome of this
numerical approach is shown in Fig. 10.7.
The modified frequentist approach does not provide the desired coverage from
the frequentist point of view, but does not suffer from the counterintuitive features
of frequentist upper limits, and has convenient statistical properties:
• It is conservative from the frequentist point of view. In fact, since pb  1, we
have that CLs ./E  psCb ./.
E Hence, the provided intervals overcover, and CLs
limits are less stringent than purely frequentist ones.

data
0.05 exp. exp.
for s+b for b
0.04
probability

0.03

0.02

pb ps+b
0.01

0
−60 −40 −20 0 20 40
2ln

Fig. 10.7 Example of evaluation of CLs from pseudoexperiments. The distribution of the test
statistic 2 log  is shown in blue assuming the signal-plus-background hypothesis and in red
assuming the background-only hypothesis. The black line shows the value of the test statistic
measured in data, and the shaded areas represent psCb (blue) and pb (red). CLs is determined
as psCb =.1  pb /
224 10 Discoveries and Upper Limits

• Unlike upper limits obtained using the Feldman–Cousins approach, if no event is


observed, CLs upper limits do not depend on the expected amount of background.
This feature is also common with Bayesian upper limits.
If the distributions of the test statistic  (or equivalently 2 log ) for the two
hypotheses H0 (b) and H1 (s C b) are well separated (as in Fig. 10.8, top), in case H1
is true, than pb has large chance to be very small, and consequently 1  pb ' 1 and
CLs ' psCb . If this is the case, the CLs limit will be almost identical to the purely
frequentist upper limit based on psCb .
If the two distributions have large overlap (as Fig. 10.8, bottom), this is an
indication that the experiment has low sensitivity on the signal. In this case, if, due
to a statistical fluctuation, pb is large, then 1  pb at the denominator in Eq. (10.34)
is small, preventing CLs to become too small, which would allow to reject cases in
which the experiment has poor sensitivity.

data
exp. exp.
for s+b for b

pb ~ 0 ps+b ~ CLs
probability

2ln

data
exp. exp.
for s+b for b
large pb ps+b < CLs
probability

2ln
Fig. 10.8 Illustration of the application of the CLs method in case of well separated distributions
of the test statistic 2 log  for the s C b and b hypotheses (top) and in case of largely overlapping
distributions (bottom) where the experiment has poor sensitivity to the signal
10.9 Presenting Upper Limits: The Brazil Plot 225

For a simple Poissonian counting experiment with expected signal s and a back-
ground b, using the likelihood ratio from Eq. (9.39), it is possible to demonstrate
analytically that the CLs approach leads to a result identical to the Bayesian one
(Eq. (10.17)) which is also identical to the results of the method proposed by
Zech [5], discussed in Example 10.26.
In general, in many realistic applications, the CLs upper limits are numerically
similar to Bayesian upper limits assuming a uniform prior, but of course the
Bayesian interpretation of upper limits cannot be applied to limits obtained using
the CLs approach.
On the other hand, the fundamental interpretation of limits obtained using the
CLs method is not obvious, and it does not match neither the frequentist nor the
Bayesian approaches.

10.9 Presenting Upper Limits: The Brazil Plot

Under some hypothesis, typically in case of background only, the upper limit is a
random variable that depends on the observed data sample and its distribution can
be predicted using Monte Carlo.
When presenting an upper limit as the result of a data analysis, it’s often
useful to report, together with the observed upper limit, also the expected value
of the limit and possibly the interval of excursion, quantified as percentiles that
correspond to ˙1 and ˙2. These bands are conventionally colored in green and
yellow, respectively, and this gives the jargon name of Brazil plot to this kind of
presentation.
A typical example is shown in Fig. 10.9 which reports the observed and expected
limits to the signal strength
D =SM to the Standard Model Higgs boson
production, with the ˙1 and 2 bands as a function of the Higgs boson mass
obtained with the 2011 and 2012 LHC data, reported by the ATLAS experiment [9].
The observed upper limit reasonably fluctuates within the expected band but
exceeds the C2 band around a mass value of about 125 GeV, corresponding to
the presently measured value of the Higgs boson. This indicates a deviation from
the background-only hypothesis assumed in the computation of the expected limit.
In some cases, expected limits are presented assuming a nominal signal yield.
Those cases are sometimes called in jargon signal-injected expected limits.
226 10 Discoveries and Upper Limits

ATLAS Preliminary 2011 + 2012 Data


10 Obs. s = 7 TeV: ∫ Ldt = 4.6-4.8 fb-1
Exp. s = 8 TeV: ∫ Ldt = 5.8-5.9 fb-1
95% CL Limit on σ/σSM

±1 σ
±2 σ

10-1
CLs Limits
100 200 300 400 500 600
mH [GeV]

Fig. 10.9 Example of upper limit reported as Brazil plot in the context of the search for the Higgs
boson at the LHC by the ATLAS experiment. The expected limit to the signal strength
D =SM
is shown as dashed line, surrounded by the ˙1 (green) and ˙2 (yellow) bands. The observed
limit is shown as a solid line. All mass values corresponding to a limit below
D 1 (dashed
horizontal line) are excluded at the 95% confidence level. The plot is from [9] (open access)

10.10 Nuisance Parameters and Systematic Uncertainties

The test statistic may contains some parameters that are not of direct interest of our
measurement, but are nuisance parameters needed to model the PDF of our data
sample, as discussed in Sect. 5.4.
In the following, the parameter set is split in two subsets: parameters of interest,
E D .1 ;    ; h /, and nuisance parameters, E D .1 ;    ; l /. For instance, if we
are only interested in the measurement of signal strength
, we have E D .
/; if
we want to measure, instead, both the signal strength and a new particle’s mass m,
we have E D .
; m/.

10.10.1 Nuisance Parameters with the Bayesian Approach

The treatment of nuisance parameters is well defined under the Bayesian approach
and was already discussed in Sect. 3.5. The posterior joint probability distribution
for all the unknown parameters is (Eq. (3.33)):

L.Ex I ; E E / .; E E /
E E j Ex / D R
P.; ; (10.35)
L.Ex I E 0 ; E 0 / .E 0 ; E 0 / dh  0 dl  0
10.10 Nuisance Parameters and Systematic Uncertainties 227

E E / is, as usual, the likelihood function, and .;


where L.ExI ; E E / is the prior
distribution of the unknown parameters.
The probability distribution for the parameters of interest E can be obtained as
marginal PDF, integrating the joint PDF over all nuisance parameters:
Z R
L.Ex I E ; E / .E ; E / dl 
P.E j Ex / D P.E ; E j xE / dl  D R : (10.36)
L.Ex I E 0 ; E 0 / .E 0 ; E 0 / dh  0 dl  0

The only difficulty that may arise is the numerical integration in multiple
dimensions. A particularly performant class of algorithms in those cases is based
on Markov chain Monte Carlo [10], introduced in Sect. 4.8.

10.10.2 Hybrid Treatment of Nuisance Parameters

The treatment of nuisance parameters under the frequentist approach is more


difficult to perform rigorously with the test statistic LsCb =Lb (Eq. (10.6)). Cousins
and Highlands [11] proposed to adopt the same approach used for the Bayesian
treatment and to determine approximate likelihood functions integrating Eqs. (9.31)
and (9.34) over the nuisance parameters:

LsCb .Ex1 ;    ; ExN j


; E/
Z
N
Y

1 
s.E ; E
 /Cb.E ; E

s.E ; E / fs .Exi I E ; E / C b.E ; E / fb .Exi I E ; E / dl  ;
/
D e
NŠ iD1
(10.37)
Z Y
N
1 E
Lb .Ex1 ;    ; ExN j E/ D eb. ; E / b.E ; E / fb .Exi I E ; E / dl  : (10.38)
NŠ iD1

This so-called hybrid Bayesian-frequentist approach does not ensure an exact


frequentist coverage [12]. It has been proven on simple models that the results are
numerically close to Bayesian limits assuming a uniform prior [13].
Likelihood functions determined with the hybrid approach have been used in the
combined search for the Higgs boson at LEP [8] in conjunction with the modified
frequentist approach (see Sect. 10.8).

10.10.3 Event Counting Uncertainties

For an event counting problem, if the number of background events is known


with some uncertainty, the PDF of the background estimate b0 can be modeled as
228 10 Discoveries and Upper Limits

a function of the true unknown expected background b, P.b0 I b/. The likelihood
functions, which depend on the parameter of interest s and the unknown nuisance
parameter b, can be written as:

e.sCb/ .s C b/n
LsCb .n; b0 I s; b/ D P.b0 I b/; (10.39)

eb bn
Lb .n; b0 I b/ D P.b0 I b/ : (10.40)

In order to eliminate the dependence on the nuisance parameter b, the hybrid
likelihoods, using the Cousins–Highlands approach, can be written as:
Z 1
e.sCb/ .s C b/n
LsCb .n; b0 I s/ D P.b0 I b/ db ; (10.41)
0 nŠ
Z 1
eb bn
Lb .n; b0 / D P.b0 I b/ db : (10.42)
0 nŠ

Assuming as a simplified case that P.b0 I b/ is a Gaussian function, the integration


can be performed analytically [14]. This approximation may be valid only if the
uncertainties on background prediction are small. Otherwise, the PDF P.b0 I b/ may
extend to unphysical negative values of b, which are included in the integration
range.
In order to avoid such cases, the use of distributions whose range is limited
to positive values is often preferred. For instance, a log normal distribution (see
Sect. 2.10) is usually preferred to a plain Gaussian.
For such more complex cases, the integration should proceed numerically, with
potential computing performance penalties.

10.11 Upper Limits Using the Profile Likelihood

A test statistic that accounts for nuisance parameters and allows to avoids the hybrid
Bayesian approach is the profile likelihood, defined in Eq. (9.27):

EO
//
L.Ex j
; .
.
/ D ; (10.43)
L.Ex j
;
O /EO

O
where
O and E are the best fit values of
and E corresponding to the observed data
O
sample, and E is the best fit value of E obtained for a fixed value of
. All parameters
10.12 Variations of the Profile-Likelihood Test Statistic 229

are treated as nuisance parameter, with the exception of


that is the only parameter
of interest, in this case. A convenient test statistic is:

t
D 2 log .
/ : (10.44)

A scan of t
as a function of
reveals a minimum at the value
D
. O The
minimum value t
.
/ O is equal to zero by construction. An uncertainty interval for

can be determined as discussed in Sect. 5.11.2 from the excursion of t


around
the minimum
: O the intersections of the curve with a straight line corresponding to
t
D 1 give the interval extremes.
The profile likelihood is introduced in order to satisfy the conditions required
by Wilks’ theorem (see Sect. 9.8), according to which, if
corresponds to the true
value, then t
follows a 2 distribution with one degree of freedom.
Usually the addition of nuisance parameters broadens the shape of the profile
likelihood as a function of the parameter of interest
compared with the case
where nuisance parameters are not added. As a consequence, the uncertainty on

increases when nuisance parameters, that usually model sources of systematic


uncertainties, are included in the test statistic.
Compared with the Cousins–Highland hybrid method, the profile likelihood is
more statistically sound from the frequentist point of view. In addition, no numerical
integration is needed, which is usually a more CPU-intensive task compared with
the minimizations required for the profile likelihood evaluation.
Given that the profile likelihood is based on a likelihood ratio, according to
the Neyman–Pearson lemma (see Sect. 9.5), it has optimal performances for what
concerns the separation of the two hypotheses assumed in the numerator and in the
denominator of Eq. (10.43).
The test statistic t
can be used to compute p-values corresponding to the various
hypotheses on
in order to determine upper limits or significance. Those p-
values can be computed in general by generating sufficiently large Monte Carlo
pseudo-samples, but in many cases, asymptotic approximations allow a much faster
evaluation, as will be discussed in Sect. 10.12.5.

10.12 Variations of the Profile-Likelihood Test Statistic

Different variations of the profile likelihood definition have been adopted for various
data analysis cases. A review of the most popular test statistics is presented in [2]
where approximate formulae, valid in the asymptotic limit of a large number of
measurements, are provided in order simplify the computation. The main examples
are reported in the following.
230 10 Discoveries and Upper Limits

10.12.1 Test Statistic for Positive Signal Strength

In order to enforce the condition


 0, since a signal yield can’t have negative
values, the test statistic t
D 2 log .
/ in Eq. (10.44) can be modified as follows:
8
ˆ O
ˆ
ˆ
ˆ L.Ex j
; E .
//
ˆ 2 log
O  0 ;
<
L.Ex j
;
O EO
/
Qt
D 2 log .Q
/ D (10.45)
ˆ
ˆ EO
ˆ
ˆ 2 log L.Ex j
;  .
//
O < 0 :
:̂ O
L.Ex j 0; E .0//
In practice, the estimate of
is replaced with zero if the best fit value
O is negative,
which may occur in case of a downward fluctuation in data.

10.12.2 Test Statistic for Discovery

In order to assess the presence of a new signal, the hypothesis of a positive signal
strength
is tested against the hypothesis
D 0. This is done using the test statistic
t
D 2 log .
/ evaluated for
D 0. The test statistic t0 D 2 log .0/, anyway,
may reject the hypothesis
D 0 in case a downward fluctuations in data would
result in a negative best-fit value
.
O A modification of t0 has been proposed that is
only sensitive to an excess in data that produce a positive value of
O [2]:

2 log .0/
O  0;
q0 D (10.46)
0
O < 0 :

The p-value corresponding to the test statistic q0 can be evaluated using Monte Carlo
pseudosamples that simulate only background events. The distribution of q0 has a
Dirac’s delta component ı.q0 /, i.e. a ‘spike’ at q0 D 0, corresponding to all the
cases which give a negative
.
O

10.12.3 Test Statistic for Upper Limits

Similarly to the definition of q0 , one may not want to consider upward fluctuations
in data to exclude a given value of
in case the best fit value
O is greater than the
assumed value of
. In order to avoid those cases, the following modification of t

has been proposed:



2 log .
/
O 
;
q
D (10.47)
0
O >
:

The distribution of q
presents a spike at q
D 0 corresponding to those cases which
give
O >
.
10.12 Variations of the Profile-Likelihood Test Statistic 231

10.12.4 Higgs Test Statistic

Both cases considered for Eqs. (10.45) and (10.47) are taken into account in the test
statistic adopted for Higgs search at the LHC:
8
ˆ O
ˆ
ˆ
ˆ L.Ex j
; E .
//
ˆ 2 log
O < 0;
ˆ
ˆ O
< L.Ex j 0; E .0//
qQ
D O (10.48)
ˆ
ˆ L.Ex j
; E .
//
ˆ
ˆ 2 log 0 

O 
;
ˆ O
ˆ O E .
//
L.Ex j
;

0
O >
:

A null value replaces


O in the denominator of the profile likelihood for the cases
where
O < 0, as for Qt
, in order to protect against unphysical values of the signal
strength. In order to avoid to spoil upper limit performances, as for q
, the cases
of upward fluctuations in data are not considered as evidence against the assumed
signal hypothesis, and, if
O >
, the test statistic is set to zero.

10.12.5 Asymptotic Approximations

For the test statistics presented in the previous Sections, asymptotic approximations
have been computed and are discussed extensively in [2] using Wilks’ theorem and
approximate formulae due to Wald [15].
For instance, the asymptotic approximation for the significance when using the
test statistic for discovery q0 (Eq. (10.46)) is:
p
Z0 ' q0 : (10.49)

10.12.6 Asimov Datasets

Several asymptotic approximations can be computed in terms of the so-called


Asimov dataset, defined as follows:
We define the Asimov dataset such that when one uses it to evaluate the estimators for all
parameters, one obtains the true parameter values [2].

In practice, the values of the random variables present in the dataset are set to their
respective expected values. In particular, all variables that represent yields in the
data sample (e.g.: all entries in a binned histogram) are replaced with their expected
values, that may also have noninteger values.
The use of a single representative dataset in asymptotic formulae avoids the
generation of typically very large sets of Monte Carlo pseudo-xperiments, reducing
significantly the computation time.
232 10 Discoveries and Upper Limits

While in the past Asimov datasets have been used as a pragmatic and CPU-
efficient solution, mathematical motivation was given in the evaluation of asymp-
totic formulae provided in [2].
The asymptotic approximation for the distribution of qQ
(Eq. (10.48)), for
instance, can be computed in terms of the Asimov dataset as follows:
8
1 < p1 p1 eQq
=2 0 < qQ

2 = 2 ;
2 2 qQ

f .Qq
j
/ D ı.Qq
/ C h i
2 : p 1 .Qq C
2 = 2 /2
exp  1

2 .2
= / 2 2
.2
= /
qQ
>
2 = 2 ;
(10.50)
where ı.Qq
/ is a Dirac delta function, to model the cases in which the test statistic
is set to zero, and  2 D
2 =q
; A depends on the term q
; A , which is the value of
the profile likelihood test statistic 2 log  evaluated at the Asimov dataset setting
nuisance parameters at their nominal values [16].
From Eq. (10.50) the median significance, for the hypothesis of background only,
can be written as the square root of the test statistic evaluated at the Asimov dataset:
p
medŒZ
j 0 D qQ
; A : (10.51)
For a comprehensive treatment of asymptotic approximations, again, refer to [2].
For practical applications, asymptotic formulae are implemented in the ROOST-
ATS library [17], released within the ROOT [18] framework. A common software
interface to most of the commonly used statistical methods allows to easily
switch from one test statistic to another and to perform computations using either
asymptotic formulae or with Monte Carlo pseudo-experiments. Bayesian as well as
frequentist methods are both implemented and can be easily compared.

Example 10.27 Bump Hunting with the LsCb =Lb Test Statistic
A classic ‘bump-hunting’ case is studied in order to determine the signal
significance using as test statistic the ratio of likelihood functions in the
two hypotheses s C b and b. This test statistic was traditionally used by
experiments at the LEP and at Tevatron. The systematic uncertainty on the
expected background yield will be added in the following Example 10.28,
and finally the profile-likelihood test statistic will be used in Example 10.29
for a similar case.
The data model
A data sample, generated with Monte Carlo, is compared with two
hypotheses:
1. background only, assuming an exponential distribution;
2. background plus signal, with a Gaussian signal on top of the exponential
background.
The expected distributions in the two hypotheses are shown in Fig. 10.10
superimposed to the data histogram, divided in N D 40 bins.

(continued )
10.12 Variations of the Profile-Likelihood Test Statistic 233

90
data
80
background
70

60
entries

50

40

30

20

10

0
0 100 200 300 400 500 600 700 800 900 1000
m
90
data
80
background
70

60
entries

50

40

30

20

10

0
0 100 200 300 400 500 600 700 800 900 1000
m
Fig. 10.10 Monte Carlo generated data sample superimposed to an exponential back-
ground model (top) and to an exponential background model plus a Gaussian signal
(bottom)

(continued )
234 10 Discoveries and Upper Limits

Likelihood function
The likelihood function for a binned distribution is the product of
Poisson distributions for the number of entries observed in each bin,
nE D .n1 ;    ; nN /:

Y
N
E
; ˇ/ D
L.En j Es ; b; Pois .ni j ˇbi C
si / ; (10.52)
iD1

where the expected distributions for signal and background are modeled as
Es D .si ;    ; sN / and bE D .bi ;    ; bN /, respectively, and are determined
from the expected binned signal and background distributions.
The normalizations of Es and bE are given by theory expectations, and
variations of the normalization scales are modeled with the extra parameters

and ˇ for signal and background, respectively. The parameter of interest
is the signal strength
. ˇ has the same role as
for the background yield,
and in this case is a nuisance parameter.
For the moment, ˇ D 1 will be assumed, which corresponds to a negligible
uncertainty on the expected background yield.
Test statistic
The test statistic based on the likelihood ratio LsCb =Lb can be written as:
!
E
; ˇ D 1/
L.En j Es; b;
q D 2 log : (10.53)
E
D 0; ˇ D 1/
L.En j Es; b;

The test statistic q in Eq. (10.53) is equal to the profile likelihood


2 log .
/ in Eq. (10.43) up to a constant term, since, given that ˇ is fixed,
no nuisance parameter is present:

q D 2 log .
/ C 2 log .0/ : (10.54)

Significance evaluation
The distributions of the test statistic q for the background-only and for the
signal-plus-background hypotheses are shown in Fig. 10.11, where 100,000
pseudo-experiments have been generated in each hypothesis. The observed
value of q is closer to the bulk of the s C b distribution than to the b
distribution. The p-value corresponding to the background-only hypothesis
can be determined as the fraction of pseudo-experiments having a value of
q lower than the one observed in data.

(continued )
10.12 Variations of the Profile-Likelihood Test Statistic 235

In the distributions shown in Fig. 10.11, 375 out of 100,000 toy samples have
a value of q below the one corresponding to our data sample, hence the p-
value is 3.7%. Considering a binomial uncertainty, the p-value is determined
with an uncertainty of 0.2%. A significance Z D 2:7 is determined from the
p-value into a using Eq. (10.1).

104

103
n. toys

102

10

−30 −20 −10 0 10 20 30


q
Fig. 10.11 Distribution of the test statistic q for the background-only hypothesis (blue)
and for the signal-plus-background hypothesis (red). The value determined with the
presented data sample (black arrow) is superimposed. p-values can be determined from
the shaded areas of the two PDFs

The significance can also be approximately determined from the scan of


the test statistic as a function of the parameter of interest
in Fig. 10.12.
The minimum value of q is reached for
D
O D 1:24, and can be used
to determine the significance in the asymptotic approximation. If the null
hypothesis .
D 0, assumed in the denominator of Eq. (10.53)) is true, than
the Wilks’ theorem holds, giving the approximated expression:
p
Z' qmin D 2:7 ; (10.55)

in agreement with the estimate obtained with the toy generation.


Considering the range of
where q exceeds the minimum value by not
more than one unit, the uncertainty interval for
can be determined as
O D
1:24C0:49
0:48 , reflecting the very small asymmetry of test statistic curve.

(continued )
236 10 Discoveries and Upper Limits

Note that in Fig. 10.12 the test statistic is zero for


D 0 and reaches a
negative minimum for
D
, O while the profile likelihood (Eq. (10.43)) has
a minimum value of zero.

0
q

−2

−4

−6

0 0.5 1 1.5 2 2.5 3


μ

Fig. 10.12 Scan of the test statistic q as a function of the parameter of interest

Example 10.28 Adding Systematic Uncertainty with LsCb =Lb Approach


Let us modify Example 10.27 assuming the background normalization is
known with a 10% uncertainty, corresponds to an estimate of the nuisance
parameter ˇ D ˇ 0 ˙ ıˇ D 1:0 ˙ 0:1. The extreme cases where ˇ D 0:9 or
ˇ D 1:1 are shown in Fig. 10.13, superimposed to the data histogram.
Assuming a generic distribution P. ˇ 0 j ˇ/ for the estimated background
yield ˇ 0 , given the true value ˇ, the test statistic in Eq. (10.53) can be
modified as follows to incorporate the effect of the uncertainty on ˇ:
0 1
sup E
; ˇ; ˇ 0 /
L.En j Es ; b;
B 1 < ˇ < C1 C
q D 2 log @ A ; (10.56)
sup E
D 0; ˇ; ˇ 0 /
L.En j Es ; b;
1 < ˇ < C1

(continued )
10.12 Variations of the Profile-Likelihood Test Statistic 237

where the likelihood function is:

Y
N
E
; ˇ; ˇ 0 / D
L.En j Es ; b; Pois .ni j ˇbi C
si / P. ˇ 0 j ˇ/ : (10.57)
iD1

90 data
background
80
bkg +10%
70 bkg -10%

60
entries

50

40

30

20

10

0
0 100 200 300 400 500 600 700 800 900 1000
m
90 data
background
bkg +10%
80 bkg -10%
signal
70 signal+bkg.

60
entries

50

40

30

20

10

0
0 100 200 300 400 500 600 700 800 900 1000
m

Fig. 10.13 Toy data sample superimposed to an exponential background model (top)
and to an exponential background model plus a Gaussian signal (bottom) adding a 10%
uncertainty to the background normalization

(continued )
238 10 Discoveries and Upper Limits

Typical choices for P. ˇ 0 j ˇ/ are a Gaussian distribution, with the inconve-


nient that it could also lead to negative unphysical values of ˇ 0 , or a log
normal distribution (see Sect. 2.10), which constrains ˇ 0 to be positive.
The numerical evaluation can also be simplified assuming a uniform
distribution of ˇ 0 within the given uncertainty interval3:
0 1
sup E
; ˇ/
L.En j Es; b;
B 0:9  ˇ  1:1 C
q D 2 log @ A : (10.58)
sup E
D 0; ˇ/
L.En j Es; b;
0:9  ˇ  1:1

The scan of this test statistic is shown in Fig. 10.14. Compared with the case
where no uncertainty was included, the shape of the test statistic curve is
now broader, and the minimum is less deep, resulting in a larger uncertainty:

D 1:40C0:61
0:60 , and a smaller significance: Z D 2:3.

0
q

−2

−4

−6

0 0.5 1 1.5 2 2.5 3


μ

Fig. 10.14 Scan of the test statistic q as a function of the parameter of interest
including
systematic uncertainty on ˇ (red) compared with the case with no uncertainty (blue)

3
Remember that halfp the range of a uniform distribution is larger than the corresponding standard
p
deviation by a factor 3, so ˇ 0 ˙ ıˇ D 1:0 ˙ 0:1 does not represent a ˙1 interval but a ˙ 3
interval. See Sect. 2.7, Eq. (2.28).
10.12 Variations of the Profile-Likelihood Test Statistic 239

Example 10.29 Bump Hunting with Profile Likelihood


Similarly to Example 10.27, a pseudosample is randomly extracted accord-
ing to a Gaussian signal with yield s D 40 centered at a value m D 125 GeV
on top of an exponential background with yield b D 100, as shown in
Fig. 10.15.

10

8
Events / ( 1 )

0
100 105 110 115 120 125 130 135 140 145 150
m (GeV)

Fig. 10.15 Example of pseudoexperiment generated with a Gaussian signal on top of an


exponential background. The assumed distribution for the background is shown as a red
dashed line, while the distribution for signal plus background is shown as a blue solid line

This exercise was implemented using the ROOSTATS library [17] within the
ROOT [18] framework.
In this case, the problem is treated with an unbinned likelihood function,
and the signal yields s is fit from data.
For simplicity, all parameters in the model are fixed, i.e. are considered as
constants known with negligible uncertainty, except the background yield,
which is assumed to be known with some uncertainty ˇ , modeled with a
log normal distribution.

(continued )
240 10 Discoveries and Upper Limits

The likelihood function for a single measurement m, according to this


model, only depends on two parameters, s and ˇ, and has the following
expression:

L.mI s; ˇ/ D L0 .mI s; b0 D beˇ / Lˇ . ˇI ˇ / ; (10.59)

where:
 
e.sCb0 / 1 2 2
L0 .mI s; b0 / D sp e.m
/ =2 C b0  em ;
nŠ 2
(10.60)
1 2 2
Lˇ . ˇI ˇ / D p eˇ =2ˇ : (10.61)
2ˇ

b is the background estimate, and b0 is the true value. For a set of


measurements mE D .m1 ;    ; mN / the likelihood function can be written
as:

Y
N
E I s; ˇ/ D
L.m L.mi I s; ˇ/ : (10.62)
iD1

As test statistic, the profile likelihood  in Eq. (10.43) is considered. The


scan of  log .s/ is shown in Fig. 10.16.4
The profile likelihood was first evaluated assuming ˇ D 0 (no uncertainty
on b, blue curve), then assuming ˇ D 0:3 (red curve). The minimum value
of  log .s/, unlike Fig. 10.14, is equal to zero.
Adding the uncertainty on ˇ (red curve), the curve is broadened, with a
corresponding increase of the uncertainty on the estimate of s, which can
be determined by the intersection of the curve of with an horizontal line
corresponding to  log .s/ D 0:5 (green line).
The significance of the observed signal can be determined using Wilks’
theorem. Assuming
D 0 (null hypothesis), the quantity q0 D 2 log .0/
can be approximated with a 2 with one degree of freedom and the
significance can be evaluated within the asymptotic approximation as:
p
Z' q0 : (10.63)

(continued )
10.12 Variations of the Profile-Likelihood Test Statistic 241

q0 is twice the intercept


p of the curve in Fig. 10.16 with the vertical axis,
which givespZ ' 2  6:66 D 3:66 in case of no uncertainty on b,
and Z ' 2  3:93 D 2:81, adding the uncertainty on b. The effect
of the uncertainty on the background yield, in this example, reduces the
significance below the ‘3 evidence’ level.

10

6
- log λ (s)

0
0 10 20 30 40 50 60 70 80
s

Fig. 10.16 Negative logarithm of the profile likelihood as a function of the signal yield s.
The blue curve is computed assuming a negligible uncertainty on the background yield,
while the red curve is computed assuming a 30% uncertainty. The intersection of the
curves with the green line at  log .s/ D 0:5 determines the uncertainty intervals on s

4
 log  is the default visualization choice provided by the ROO STATS library in ROOT, which
differs by a factor of 2 with respect to the choice of 2 log  adopted elsewhere.
242 10 Discoveries and Upper Limits

10.13 The Look Elsewhere Effect

Many searches for new physical phenomena look for a peak in a distribution,
typically a reconstructed particle’s mass. In some cases, the location of the peak is
known, like in searches for rare decays of a known particles, such as Bs !
C
 .
But this is not the case in the search for new particles, like the Higgs boson
discovered at the LHC, whose mass is not predicted by theory.
If an excess in data, compared with the background expectation, is found at any
mass value, the excess could be interpreted as a possible signal of a new resonance
at the observed mass. Anyway, the peak could be produced either by the presence
of a real new signal or by a background fluctuation.
One way to compute the significance of the new signal is to use the p-value
corresponding to the measured test statistic q assuming a fixed value m0 of the
resonance mass m. In this case, the significance is called local significance.
Given the PDF f .q j m;
/ of the adopted test statistic q, the local p-value is:
Z 1
p.m0 / D f .q j m0 ;
D 0/ dq : (10.64)
qobs .m0 /

p.m0 / gives the probability that a background fluctuation at a fixed value of the mass
m0 results in a value of q greater or equal to the observed value qobs .m0 /.
The probability of a background fluctuation at any mass value in the range of
interest is called global p-value and is in general larger than the local p-value. So,
the local significance is an underestimate, if interpreted as global significance.
In general, the effect of the reduction of significance, when evaluated from local
to global, in case one or more parameters of interest are determined from data, is
called look elsewhere effect.
More in general, when an experiment is looking for a signal where one or more
parameters of interest E 5 are unknown (e.g.: could be both the mass and the width,
or other properties of a new particle), in the presence of an excess in data with
respect to the background expectation, the unknown parameter(s) can be determined
from the data sample itself. The local p-value of the excess is:
Z 1
p.E0 / D f .q j E0 ;
D 0/ dq ; (10.65)
qobs .E0 /

which would be an underestimate, if interpreted as global significance.

5
Here, as in Sect. 3.5.1, we denote E as parameters of interest. Other possible nuisance parameters
E are dropped for simplicity of notation. The signal strength parameter
will be explicitated and
is not included in the set of parameters E.
10.13 The Look Elsewhere Effect 243

The global p-value can be computed using, as test statistic, the largest value of
the estimator over the entire parameter range:

O
qglob D sup q.E ;
D 0/ D q.E ;
D 0/ ; (10.66)
imin < i < imax ;
iD1;  ; m

O
where E denotes the set of parameter of interest that maximize q.;
E
D 0/.
The global p-value can be determined from the distribution of the test statistic
glob
qglob assuming background only, given the observed value qobs :
Z 1
pglob D f .qglob j
D 0/ dqglob : (10.67)
glob
qobs

Even if the test statistic q is derived, as usual, from a likelihood ratio, in this
case, Wilks’ theorem cannot be applied because the values of the parameters E are
undefined for
D 0. Consider, for instance, a search for a resonance: in case
of background only (
D 0), the test statistic would no longer depend on the
resonance mass m. In this case, the two hypotheses assumed at the numerator and the
denominator of the likelihood ratio in the test statistic considered in Wilks’ theorem
hypotheses are not nested [19].
The distribution of qglob from Eq. (10.66) can be computed with Monte Carlo
samples. Large significance values, corresponding to very low p-values, require
considerable sizes of the pseudo-samples, which demand large CPU time.

10.13.1 Trial Factors

An approximate way to determine the global significance taking into account the
look elsewhere effect is reported in [20], relying on the asymptotic behavior of
likelihood-ratio estimators.
The correction factor f that needs to be applied to the local significance in order
to obtain the global significance is called trial factor.
The trial factor is related to peak width, which may be dominated by the
experimental resolution, if the intrinsic width is small. Empirical evaluations, when
the mass is determined from data, give a factor f typically proportional to the ratio
of the search range and the peak width, times the local significance [21].
For a single parameter m, the global test statistic is (Eq. (10.66)):

qglob D q.m;
O
D 0/ : (10.68)

It is possible to demonstrate [22] that the probability that the test statistic qglob is
greater than a given value u, used to determine the global p-value, is bound by the
244 10 Discoveries and Upper Limits

q(m)

1 2 3

m
Fig. 10.17 Visual illustration of up-crossings, computed to determine hNu0 i. In this example, we
have a number of up crossings Nu D 3

following inequality:

O
D 0/ > u/  P.2 > u/ C hNu i :
pglob D P.q.m; (10.69)

The term P.2 > u/, related to the local p-value, is a cumulative 2 distribution that
comes from an asymptotic approximation as a 2 with one degree of freedom of the
local test statistic:

qloc .m/ D q.m;


D 0/ : (10.70)

A test statistic based on the profile likelihood q D t


(Eq. (10.44)) with
D 0
has been assumed in Eq. (10.69); in case of a test statistic for discovery q0 (see
Sect. 10.12.2), the term P.2 > u/ achieves an extra factor 1=2. The inequality in
Eq. (10.69) may be considered as an equality, asymptotically.
The term hNu i in Eq. (10.69) is the average number of upcrossings, i.e. the
expected number of times the local test statistic curve qloc .m/ crosses an horizontal
line at a given level q D u with a positive derivative. An example of the evaluation
of the number of upcrossing for a specific curve is visualized in Fig. 10.17. hNu i can
be evaluated using Monte Carlo as average value over many samples.
The value of hNu i could be very small, depending on the level u, and in those
case very large Monte Carlo samples would be required for a precise numerical
evaluation. Fortunately, a scaling law allows to extrapolate a value hNu0 i evaluated
for a different level u0 to the desired level u:

hNu i D hNu0 i e.uu0 /=2 : (10.71)

One can evaluate hNu0 i by generating a not too large number of pseudo-experiments,
then hNu i can be determined using Eq. (10.71) preserving a good numerical
precision.
10.13 The Look Elsewhere Effect 245

In practice, one can move from local to global p-values using the following
asymptotically approximated relation:

pglob ' ploc C hNu0 i e.uu0 /=2 : (10.72)

Example 10.30 Simplified Look Elsewhere Calculation


An approximate evaluation of the look elsewhere effect may even not
require the use of Monte Carlo, as shown in the following example by
Gross [23].
Figure 10.18 shows the local p-value for the Higgs boson search at LHC
performed by ATLAS [9]. The p-value is minimum close to the mass mH D
125 GeV and corresponds to a local significance of about 5, according to
the red scale at the right of the plot.
Instead of generating Monte Carlo samples, an estimate of hNu0 i can be
obtained using the single observed test statistic curve as a function of mH
and counting the number of upcrossings. As test statistic curve, we can take
the p-value curve in Fig. 10.18 and express it as equivalent significance level
squared Z 2 .

10 3 ATLAS Preliminary 2011 + 2012 Data


10 2 Obs. s = 7 TeV: ∫ Ldt = 4.6-4.8 fb-1
10 Exp. s = 8 TeV: ∫ Ldt = 5.8-5.9 fb-1
1 0σ
10-1 1σ
0


Local p

10-2
10-3 3σ
10-4

10-5
10-6

10-7
10-8
10-9 6σ
100 200 300 400 500 600
mH [GeV]

Fig. 10.18 Local p-value as a function of the Higgs boson mass in the search for the
Higgs boson at the LHC performed by ATLAS. The solid line shows the observed p-value,
while the dashed line shows the median expected p-value according to the prediction of
a Standard Model Higgs boson corresponding to a given mass value mH . The plot is
from [9] (open source)

(continued )
246 10 Discoveries and Upper Limits

As convenient level, u0 D 0 can be taken, corresponding to the red 0 curve,


or equivalently to a p-value p0 D 0:5. The number of times the black solid
curve crosses the red dashed 0 curve with positive derivative is equal to
N0 D 9, so we can determine the approximate estimate:

hN0 i D 9 ˙ 3 : (10.73)

hNu i for u ' 52 , corresponding to the minimum p-value, can be determined


from hN0 i using the scaling law in Eq. (10.71):
2 0/=2
hN52 i D hN0 i e.5 D .9 ˙ 3/ e25=2 ' .3 ˙ 1/  105 : (10.74)

The local p-value, corresponding to 5, is about 3  107. From Eq. (10.72),
the global p-value is, approximately:

pglob ' 3  107 C 3  105 ' 3  105 ; (10.75)

which corresponds to a global significance of about 4, to be compared with


the local 5. The trial factor is, within a 30% accuracy:

pglob 3  105
f D ' D 100 : (10.76)
ploc 3  107

10.13.2 Look Elsewhere Effect in More Dimensions

In some cases, more than one parameter is determined from data. For instance, an
experiment may measure both the mass and the width of a new resonance, which
are both not predicted by theory.
In more dimensions, the look elsewhere correction proceeds in a way similar
to one dimension, but in this case, the test statistic depends on more than one
parameter.
Equation (10.69) and the scaling law in Eq. (10.71) are written in terms of
the average number of upcrossing, which is only meaningful in one dimension.
The generalization to more dimensions can be obtained by replacing the number
of upcrossing with the Euler characteristic, which is equal to the number of
disconnected components minus the number of ‘holes’ in the multidimensional sets
E > u [24].
of the parameter space defined by qloc ./
Examples of sets with different value of the Euler characteristic are shown in
Fig. 10.19.
10.13 The Look Elsewhere Effect 247

φ=1 φ=0

φ=2 φ=1
Fig. 10.19 Examples of sets with different value of the Euler characteristic ', defined as the
number of disconnected components minus the number of ‘holes’. Top left: ' D 1 (one component,
no holes), top right: ' D 0 (one component, one hole), bottom left: ' D 2 (two components, no
holes), bottom right: ' D 1 (two components, one hole)

The expected value of the Euler characteristic h'.u/i for a multidimensional


random field has a dependency on the level u that depends on the dimensionality
D as follows:

X
D
h'.u/i D Nd d .u/ ; (10.77)
dD0

where the functions 0 .u/;    ; D .u/ are characteristic of the specific random field.
For a 2 field with D D 1, Eq. (10.77) gives the scaling law in Eq. (10.71). For a
two-dimensional 2 field, which is, for instance, the case when measuring from data
both the mass and the width of a new resonance, Eq. (10.77) becomes:
 p 
h'.u/i D N1 C N2 u eu=2 : (10.78)

The expected value h'.u/i can be determined, typically with Monte Carlo, at two
values of u, u D u1 and u D u2 . Once h'.u1 /i and h'.u2 /i are determined, N1 and
N2 can be found inverting the system of two equations given by Eq. (10.78) for the
two values h'.u1 /i and h'.u2 /i.
248 10 Discoveries and Upper Limits

The magnitude of the look elsewhere effect in more dimensions may be


important. A search for a new resonance decaying in two photons at the Large
Hadron Collider by the ATLAS and CMS collaborations raised quite high attention
at the end of 2015 because of an excess corresponding to a resonance mass of about
750 GeV. The ATLAS collaboration quoted a local significance of 3.9, but the
look elsewhere effect, due to the measurement of both the mass and width of the
resonance from data, reduced the global significance to 2.1 [25].

References

1. Wasserstein, R.L., Lazar, N.A.: The ASA’s statement on p-values: context, process, and
purpose. Am. Stat. 70, 129–133 (2016)
2. Cowan, G., Cranmer, K., Gross, E., Vitells, O.: Asymptotic formulae for likelihood-based tests
of new physics. Eur. Phys. J. C 71, 1554 (2011)
3. Helene, O.: Upper limit of peak area. Nucl. Instrum. Methods A 212, 319 (1983)
4. Amsler, C., et al.: The review of particle physics. Phys. Lett. B 667, 1 (2008)
5. Zech, G.: Upper limits in experiments with background or measurement errors. Nucl. Instrum.
Methods A 277, 608 (1989)
6. Highland, V., Cousins, R.: Comment on “upper limits in experiments with background or
measurement errors”. Nucl. Instrum. Methods A 277, 608–610 (1989). Nucl. Instrum. Methods
A 398, 429 (1989)
7. Zech, G.: Reply to comment on “upper limits in experiments with background or measurement
errors”. Nucl. Instrum. Methods A 277, 608–610 (1989). Nucl. Instrum. Methods A 398, 431
(1989)
8. Abbiendi, G., et al.: Search for the standard model Higgs boson at LEP. Phys. Lett. B 565,
61–75 (2003)
9. ATLAS Collaboration: Observation of an excess of events in the search for the standard model
higgs boson with the ATLAS detector at the LHC. ATLAS-CONF-2012-093 (2012). http://cds.
cern.ch/record/1460439
10. Berg, B.: Markov Chain Monte Carlo Simulations and Their Statistical Analysis. World
Scientific, Singapore (2004)
11. Cousins, R., Highland, V.: Incorporating systematic uncertainties into an upper limit. Nucl.
Instrum. Methods A 320, 331–335 (1992)
12. Zhukov, V., Bonsch, M.: Multichannel number counting experiments. In: Proceedings of
PHYSTAT2011 (2011)
13. Blocker, C.: Interval estimation in the presence of nuisance parameters: 2. Cousins and
highland method. CDF/MEMO/STATISTICS/PUBLIC/7539 (2006). https://www-cdf.fnal.
gov/physics/statistics/notes/cdf7539_ch_limits_v2.ps
14. Lista, L.: Including Gaussian uncertainty on the background estimate for upper limit calcula-
tions using Poissonian sampling. Nucl. Instrum. Methods A 517, 360 (2004)
15. Wald, A.: Tests of statistical hypotheses concerning several parameters when the number of
observations is large. Trans. Am. Math. Soc. 54, 426–482 (1943)
16. Asimov, I.: Franchise. In: Asimov, I. (ed.) The Complete Stories, vol. 1. Broadway Books, New
York (1990)
17. Grégory Schott for the ROO STATS Team: RooStats for searches. In: Proceedings of PHYS-
TAT2011 (2011). https://twiki.cern.ch/twiki/bin/view/RooStats
18. Brun, R., Rademakers, F.: ROOT—an object oriented data analysis framework. In: Proceedings
AIHENP96 Workshop, Lausanne (1996). Nucl. Instrum. Methods A 389 81–86 (1997). http://
root.cern.ch/
References 249

19. Ranucci, G.: The profile likelihood ratio and the look elsewhere effect in high energy physics.
Nucl. Instrum. Methods A 661, 77–85 (2012)
20. Gross, E., Vitells, O.: Trial factors for the look elsewhere effect in high energy physics. Eur.
Phys. J. C 70, 525 (2010)
21. Gross, E., Vitells, O.: Statistical Issues Relevant to Significance of Discovery Claims
(10w5068), Banff, Alberta, 11–16 July 2010
22. Davies, R.: Hypothesis testing when a nuisance parameter is present only under the alternative.
Biometrika 74, 33 (1987)
23. Gross, E.: Proceedings of the European School of High Energy Physics (2015)
24. Vitells, O., Gross, E.: Estimating the significance of a signal in a multi-dimensional search.
Astropart. Phys. 35, 230–234 (2011) p
25. ATLAS Collaboration: Search for resonances in diphoton events at s D 13 TeV with the
ATLAS detector. J. High Energy Phys. 09, 001 (2016)
26. Read, A.: Modified frequentist analysis of search results (the CLs method). In: Proceedings of
the 1st Workshop on Confidence Limits, CERN (2000)
Index

˛, see significance level deep learning, 192


ˇ2 , see kurtosis Asimov dataset, 231
, see unnormalized skewness asymmetric errors, 99, 110
1 , see skewness combination of, 123
2 , see excess asymptotic formulae for test statistics,
", see efficiency 231

. see Gaussian average value or signal average value


strength continuous case, 27
, see correlation coefficient discrete case, 12
 , see standard deviation or Gaussian standard in Bayesian inference, 69
deviation
 , see lifetime
˚, see Gaussian cumulative distribution back propagation, neural network, 191
2 background
distribution, 32 dependence of Feldman–Cousins upper
method, 114 limits, 218, 220
binned case, 119 determination from control regions,
in multiple dimensions, 132 129
random variable, 32, 114, 120 fluctuation for significance level, 205
Baker–Cousins, 120 in convolution and unfolding, 160
Neyman’s, 119 modeling in extended likelihood, 107
Pearson’s, 119 modeling with Argus function, 43
, see sample space rejection in hypothesis test, 176
3 evidence, 207 treatment in iterative unfolding, 171
5 observation, 207 uncertainty in significance evaluation, 209
uncertainty in test statistic, 227, 236
Baker–Cousins 2 , 120
activation function, 191 Bayes factor, 73
adaptive boosting, 198 Bayes’ theorem, 59
AI, artificial intelligence, 195 learning process, 67
alternative hypothesis, 175 Bayesian
Anderson–Darling test, 184 inference, 68
Argus function, 43 probability, 59, 64
artificial intelligence, 195 visual derivation, 60
artificial neural network, 181, 190 unfolding, 166

© Springer International Publishing AG 2017 251


L. Lista, Statistical Methods for Data Analysis in Particle Physics,
Lecture Notes in Physics 941, DOI 10.1007/978-3-319-62840-0
252 Index

BDT, see boosted decision trees Clopper–Pearson binomial interval, 147


Bernoulli CLs method, 221
probability distribution, 17 CNN, see convolutional neural network
random process, 16 coefficient of determination R2 , 117
random variable, 17 combination
Bertrand’s paradox, 7 of measurements, 129
best linear unbiased estimator, 133 principle, 136, 140
conservative correlation assumption, 137 conditional
intrinsic information weight, 136 distribution, 53
iterative application, 139 probability, 9
marginal information weight, 136 confidence
negative weights, 135, 137 interval, 100, 109, 143
relative importance, 136 level, 100
beta distribution, 83 conservative
bias, 102 CLs method, 221, 223
in maximum likelihood estimators, 113 correlation assumption, BLUE method, 137
bifurcated Gaussian, 124 interval, 147
bimodal distribution, 14 limit, 217
bin migration, 158 consistency of an estimator, 102
binned Poissonian fit, 120 control
binning, 118 region, 129
in convolution, 158 sample, 130
binomial convergence in probability, 22
coefficient, 18 ConvNet, see convolutional neural network
interval, 147 convolution, 155
probability distribution, 18, 147 convolutional neural network, 194
Poissonian limit, 40 Fourier transform, 156
random process, 17 convolutional neural network, 193
random variable, 18 feature map, 194
BLUE, see best linear unbiased estimator local receptive fields, 194
boosted decision trees, 181 correlation coefficient, 14
adaptive boosting, 198 counting experiment, 208, 212, 216, 227
boosting, 198 Cousins–Highlands method, 227
cross entropy, 196 covariance, 14
decision forest, 197 matrix, 14
Gini index, 196 coverage, 100
leaf, 196 Cramér–Rao bound, 102
node, 196 Cramér–von Mises test, 184
boosting, boosted decision trees, 198 credible interval, 70
Box–Muller transformation, 89 cross entropy, decision tree, 196
Brazil plot, 225 Crystal Ball function, 44
breakdown point, robust estimator, 103 cumulative distribution, 28
Breit–Wigner cut, 176
non-relativistic distribution, 41
relativistic distribution, 42
data sample, 99
decision
Cauchy distribution, 41 forest, 197
central tree, 196
interval, 70 deep learning, artificial neural network, 192
limit theorem, 46 degree of belief, 65
value, 99 dices, 4–6, 16, 21
chaotic regime, 82 differential probability, 26
classical probability, 4 discovery, 205, 207, 208
Index 253

distribution, see probability distribution flat (uniform) distribution, 6, 30


dogma, extreme Bayesian prior, 66 flip-flopping, 150
drand48 function from C standard library, 84 forest, boosted decision trees, 197
Fourier transform of PDF convolution,
156
efficiency frequentist
hit-or-miss Monte Carlo, 90 inference, 100
of a detector, 10, 158 probability, 3, 22
estimate, 104 full width at half maximum, 31, 41
of an estimator, 102 fully asymmetric interval, 70
elementary event, 4, 6, 9 FWHM, full width at half maximum,
equiprobability, 4, 6, 25 31
ergodicity, 94
error
of a measurement gamma function, 33, 71
Bayesian approach, 70 Gaussian
frequentist approach, 99 average value,
, 31
of the first kind, 177 bifurcated, 124
of the second kind, 177 contours in two dimensions, 55
propagation cumulative distribution, 31
Bayesian case, 79 distribution, 31
frequentist case, 121 in more dimensions, 54
simple cases, 121 intervals, 32, 58
estimate, 68, 97, 99, 100 likelihood function, 108
estimator, 100 random number generator, 89
efficiency, 102 central limit theorem, 88
maximum likelihood, 105 standard deviation,  , 31
properties, 101 generator, see pseudorandom number generator
robust, 103 Gini index, decision tree, 196
Euler characteristic, 246 global significance level, 242
event, 2 goodness of fit, 33, 118, 120
counting experiment, 187, 206 gsl_rng_rand function from GSL library,
elementary, 4, 6, 9 84
in physics, 105
in statistics, 2
independent, 10 Hastings ratio, 93
evidence histogram, 119
3 significance level, 207 convolution, 158
Bayes factor, 73 in Asimov dataset, 231
excess, 15 PDF approximation, 182
exclusion, 211 hit-or-miss Monte Carlo, 90
expected value. see average value homogeneous Markov chain, 93
exponential distribution, 34 Hui’s triangle, 18, 23
random number generator, 87 hybrid frequentist approach, 227
extended likelihood function, 106, 186 hypothesis test, 175

fast Fourier transform, 156 IID, independent identically distributed


feature map, convolutional neural network, 194 random variables, 82
feedforward multilayer perceptron, 190 IIW, intrinsic information weight, BLUE
Feldman–Cousins unified intervals, 152 method, 136
FFT. see fast Fourier transform importance sampling, 91
Fisher information, 75, 102, 136 improper prior distribution, 76
Fisher’s linear discriminant, 178 incomplete Gamma function, 41
254 Index

independent linear regression, 115


and identically distributed random local
variables, 82, 106 receptive fields, 194
events, 10 significance level, 210, 242
random variables, 50 log normal distribution, 33
inference, 97 logistic map, 82
Bayesian, 68 look elsewhere effect, 210, 242
intersubjective probability, 75 in more dimensions, 246
intrinsic information weight, BLUE method, Lorentz distribution, 41
136 loss function, 191
invariant prior, see Jeffreys’ prior lower limit, 70
iterative unfolding, 166 lrand48 function from C standard library, 84

Jeffreys’ prior, 75 machine learning, 188


joint probability distribution, 49 observation, 189
supervised, 188
kernel function, see response function unsupervised, 188
Kolmogorov distribution, 183 marginal
Kolmogorov–Smirnov test, 182 distribution, 49
kurtosis, 15 information weight, BLUE method, 136
coefficient, 15 Markov chain, 93
homogeneous, 93
Monte Carlo, 69, 93
L’Ecuyer pseudorandom number generator, 84 maximum likelihood
L-curve, 165 estimator, 105
Lüscher pseudorandom number generator, 84 bias, 113
Landau distribution, 46 properties, 112
large numbers, law of, 21 method, 69, 105
law uncertainty, 109
of large numbers, 21 MC, see Monte Carlo
of total probability, 11 MCMC, Markov chain Monte Carlo, 93
leaf, decision tree, 196 median, 14, 28, 103
learning Mersenne-Twistor pseudorandom number
process in Bayesian probability, 67 generator, 84
rate parameter, artificial neural network, Metropolis–Hastings
191 algorithm, 93
least squares method, 114 proposal distribution, 93
lifetime, 35, 39 ratio, 95
Bayesian inference, 76 minimum
Jeffreys prior, 77 2 method, see 2 method
maximum likelihood estimate, 112 variance bound, 102
measurement combination, 140 M INUIT, 106, 110
likelihood misidentification probability, 176
function, 67, 105 MIW, marginal information weight, BLUE
extended, 106, 186 method, 136
Gaussian, 108 mode, 14, 28
in Bayesian probability, 67 modified frequentist approach, 221
ratio Monte Carlo method, 6, 46, 69, 81
discriminant, 181 hit-or-miss, 90
in search for new signals, 185, 209 numerical integration, 92
projective discriminant, 182 sampling, 89
test statistic in Neyman–Pearson multilayer perceptron, 190
lemma, 181 multimodal distribution, 14, 28
Index 255

multinomial distribution, 20 pooling, convolutional neural network, 194


multivariate analysis, 178 posterior
MVA, multivariate analysis, 178 odds, 64, 73
probability, 60, 65, 67
prior
negative weights, 135, 137 odds, 73
nested hypotheses, see Wilks’ theorem probability, 60, 65
neural network, see artificial neural network distribution, 67
Neyman distribution, improper, 76
confidence belt distribution, uniform, 71, 74
binomial case, 147 subjective choice, 74
construction, 144 uninformative, 69, 75
Feldman–Cousins, 152, 218 probability, 2
Gaussian case, 146 axiomatic definition, 8
inversion, 146 Bayesian, 3, 4
confidence intervals, 215 classical, 4
Neyman’s 2 , 119 density, 25
Neyman–Pearson lemma, 181 dice rolls, 5
node, decision tree, 196 distribution, 9, 25
normal 2 , 32
distribution, see Gaussian distribution Bernoulli, 17
random variable, 31 beta, 83
normalization condition, 9, 26 bimodal, 14
nuisance parameter, 69, 98, 226, 227 binomial, 18
null hypothesis, 175 Breit–Wigner, non-relativistic, 41
Breit–Wigner, relativistic, 42
cumulative, 28
observation in machine learning, 189 exponential, 34
observation, 5 significance level, 207 Gaussian, 31
odds Gaussian, in more dimensions, 54
posterior, 64, 73 joint, 49
prior, 73 Landau, 46
ordering rule, 144 log normal, 33
outlier, 103 Lorentz, 41
overcoverage, 100, 144, 146, 147, 217 marginal, 49
multimodal, 14, 28
multinomial, 20
p-value, 118, 206 normal, 31
parameter Poissonian, 35
estimate, 100 standard normal, 31
Bayesian, 68 uniform, 6, 30
nuisance, 226, 227 distribution function, 26
of interest, 69, 98 in more dimensions, 49
Pascal’s triangle, 23 frequentist, 3, 22
PDF, see probability distribution function posterior, 60, 65, 67
Pearson’s 2 , 119 prior (see prior probability)
percentile, 28 theory, 2
period of a pseudorandom number generator, profile likelihood, 185, 228
84 projective likelihood ratio discriminant, 182
POI, parameter of interest, 69, 98 pseudorandom number, 81
Poisson distribution, 35 generator, 82
Gaussian limit, 40 drand48 function, 84
Poissonian, see Poisson distribution gsl_rng_rand function, 84
random variable, 35 lrand48 function, 84
256 Index

exponential, 87 response
from cumulative inversion, 86 function, 155
Gaussian, Box–Muller, 89 Gaussian case, 157
Gaussian, central limit theorem, 88 matrix, 158
L’Ecuyer, 84 RMS, see root mean square
Lüscher, 84 robust estimator, 103
Mersenne-Twistor, 84 ROC curve, receiver operating characteristic,
period, 84 177
RAN LUX , 84 ROOT, 106, 173
seed, 84 root mean square, 13
uniform, 84 ROO UNFOLD , 173
uniform on a sphere, 87
purity, 64, 196
sample space, 8
seed, pseudorandom number generator, 84
quantile, 28 selection, 176
efficiency, 176
shortest interval, 70
R2 , coefficient of determination, 117 sigmoid function, 191
RAN LUX pseudorandom number generator, 84 signal
random exclusion, 211
number, see pseudorandom number region, 129
generator, see pseudorandom number strength, 185, 186
generator signal-injected expected limit, 225
process, 2 significance level, 177, 205, 207–210
variable, 4 simultaneous fit, 130
2 , 32 singular value decomposition, 171
Bernoulli, 17 skewness, 14
binomial, 18 unnormalized, 15, 126
exponential, 34 smearing, 156
Gaussian, 31 sources of systematic uncertainty, 100
independent, 50 standard
log normal, 33 deviation, 13
normal, 31 continuous case, 27
Poissonian, 35 normal
standard normal, 31 distribution, 31
uncorrelated, 14 random variable, 31
uniform, 6, 30 statistical uncertainty, 100
Random forest, 197 subjective probability, see Bayesian probability
rate parameter, 35 supervised machine learning, 188
receiver operating characteristic, 177, 200 SVD, see singular value decomposition171
rectified linear units, convolutional neural symmetric interval, 70
network, 195 systematic uncertainty, 99, 226, 227
reference analysis, 76 sources, 100
reference prior, 76
regularization strength, 164
regularized unfolding, 163 Tartaglia’s triangle, 23
relative importance, BLUE method, 136 test
religious belief, Bayesian extreme probability, sample, 189
66 statistic, 176
ReLU, rectified linear units, 195 for discovery, 230
repeatable experiment, 2 for Higgs boson search, 231
residuals, 114 for positive signal strength, 230
resonance, 41 for upper limits, 230
Index 257

Tikhonov regularization, 164 regularization strength, 164


total probability, law of, 11 regularized, 163
toy Monte Carlo, 120 response matrix inversion, 160
training, 188 singular value decomposition, 171
sample, 180, 182, 188 Tikhonov regularization, 164
transformation of variables, 15, 29, 121 unified intervals, Feldman–Cousins, 152, 218
Bayesian posterior, 79 uniform
trial factor, 243 distribution, 6, 30
trimmed average, 103 random number generator, 84
TUNFOLD , 173 uninformative prior, 69, 75
type-I error, 177 unknown parameter, 97
type-II error, 177 unnormalized skewness, 15, 126
unsupervised machine learning, 188
upcrossing, 244
uncertainty, 68, 70, 97, 99 upper limit, 70, 211, 215
interval, 99
Bayesian, 70
frequentist, 100 variance
with maximum likelihood method, 109 continuous case, 27
uncorrelated random variables, 14 discrete case, 13
undercoverage, 100
underfluctuation, 211
unfolding, 155 weighted average, 131
L curve, 165 Wilks’ theorem, 120, 184
Bayesian, 166 nested hypotheses, 184
bin-to-bin correction factors, 163
in more dimensions, 173
iterative, 166 Z, see significance level

You might also like