AEPSHEP Lecture1

2022 ASIA EUROPE PACIFIC SCHOOL OF HIGH-ENERGY PHYSICS
Practical Statistics
Nicolas Berger (LAPP Annecy) Lecture 1

2022 ASIA EUROPE PACIFIC SCHOOL OF HIGH-ENERGY PHYSICS
Practical Statistics
ysi c ist s
a r ti cl e Ph
For P
Nicolas Berger (LAPP Annecy) Lecture 1
Statistics are everywhere “There are three types of lies - lies, damn
lies, and statistics.” – Benjamin Disraeli
Credits: mattbuck / wikimedia Credits: StatLab
And Physics ? “If your experiment needs statistics, you ought to have
done a better experiment” – E. Rutherford 3/
Introduction
Statistical methods play a critical role in
many areas of physics
Higgs discovery : “We have 5σ” !
“5s”
Phys. Lett. B 716 (2012) 1-29
4/
Introduction
Sometimes difficult to distinguish a bona fide discovery from a background fluctuation…
New Physics ?
3.9σ ? 2.1σ ?
JHEP 09 (2016) 1 5/
Introduction
Sometimes difficult to distinguish a bona fide discovery from a background fluctuation…
A few months later...

New Physics ?
3.9σ ? 2.1σ ?
JHEP 09 (2016) 1 5/
Introduction
Precision measurements are another window into BSM effects
→ How to compute (and interpret) measurement intervals
→ How to model systematic uncertainties ?
→ How to get the smallest achievable uncertainties ?
Image credits: CERN courier, LHCb

6/
Lecture Plan
Statistics basic concepts (Today)
[Basic ingredients (PDFs, etc.)]
Statistical Modeling (PDFs for particle physics measurements)
Parameter estimation (maximum likelihood, least-squares, …)
Computing statistical results (Tomorrow)

Model testing (χ2 tests, hypothesis testing, p-values, …)
Discovery testing
Confidence intervals
Disclaimer: the examples and
Upper limits
methods covered in the
lectures will be biased towards
Systematics and further topics (Saturday)
LHC techniques (generally close
Systematics and profiling
[Bayesian techniques] to the state of the art anyway)
The class will be based on both lectures and hands-on tutorials 7/

Randomness in High-Energy Physics
Experimental data is produced by incredibly complex processes

8/
Randomness in High-Energy Physics More
d etail
s in o
lectu ther
res!
Experimental data is produced by incredibly complex processes
Hard scattering
PDFs, Parton shower, Pileup
Decays
Detector response
Reconstruction
Image Credits:
S. Höche,
SLAC-PUB-16160
Randomness involved in all stages

→ Classical randomness: detector response
→ Quantum effects in particle production, decay 9/
Measurement Errors: Energy measurement
Example: measuring the energy
of a photon in a calorimeter
Calorimeter Readout
10
/
Calorimeter Readout
g Energy
deposit
Perfect
case
10
/
Calorimeter Readout
g
Perfect Real
case life
10
/
Measure leakage behind calorimeter
Measure leakage into neighboring cells

Calorimeter Readout
g
Perfect Real
case life
10
/
Measure leakage behind calorimeter
Measure leakage into neighboring cells

Calorimeter Readout
g
Perfect Real
case life
Cannot predict the measured value for a given event

10
⇒ Random process ⇒ Need a probabilistic description /
Quantum Randomness: H®ZZ*®4l
http://www.phdcomics.com/comics/archive.php?comicid=1489
Phys. Rev. D 91, 012006
Rare process: Expect 1 signal

event every ~6 days
11
/
Phys. Rev. D 91, 012006

event every ~6 days
View online
12
/
Phys. Rev. D 91, 012006

event every ~6 days
“Will I get an event today ?” → only probabilistic answer

13
/
Statistical Modeling
14
/
Probability Distributions
Probabilistic treatment of possible outcomes

Þ Probability Distribution
Example: two-coin toss

→ Fractions of events in each bin i
converge to a limit pi
Probability distribution :
{ Pi } for i = 0, 1, 2
Properties
• Pi > 0
• Σ Pi=1
15
/
Continuous Variables: PDFs
Continuous variable: can consider per-bin probabilities pi, i=1.. nbins
Bin size → 0 : Probability distribution function P(x)
High PDF value

⇒ High chance to get a measurement here
x P(x) > 0, ∫ P(x) dx = 1
Generalizes to multiple variables :

P(x,y) > 0, ∫ P(x,y) dx dy = 1
Contours: P(x,y)
16
x /
Continuous Variables: PDFs
Continuous variable: can consider per-bin probabilities pi, i=1.. nbins
Bin size → 0 : Probability distribution function P(x)
High PDF value

⇒ High chance to get a measurement here
x P(x) > 0, ∫ P(x) dx = 1
Generalizes to multiple variables :

P(x,y) > 0, ∫ P(x,y) dx dy = 1
Contours: P(x,y)
16
x /
PDF Mean
PDF Properties: Mean
E(X) = <X> : Mean of X – expected outcome
on average over many measurements
⟨ X ⟩ = ∑ xi Pi or
i
⟨ X ⟩ = ∫ x P ( x) dx
→ Property of the PDF PDF Mean Sample Mean
For measurements x1... xn,

then can compute the Sample mean:
1
x̄ = ∑ x i
n i
→ Property of the sample
→ approximates the PDF mean. 17
/
PDF Properties: (Co)variance
Variance of X: 2
Var( X )=⟨ ( X − ⟨ X ⟩ ) ⟩ RMS
→ Average square of deviation from mean
→ RMS(X) = ÖVar(X) = σX standard deviation
Can be approximated by sample variance:
1
σ^ 2 = ∑ i
n−1 i
( x − x̄) 2
Cov(x, y) > 0
Covariance of X and Y: y
Cov ( X ,Y )=⟨ ( X − ⟨ X ⟩ ) (Y − ⟨ Y ⟩ ) ⟩
x
→ Large if variations of X and Y are “synchronized”
Cov ( X ,Y )
Correlation coefficient ρ= -1 ≤ ρ ≤ 1
√ Var( X ) Var(Y ) 18
/
Variance of X: 2
Var( X )=⟨ ( X − ⟨ X ⟩ ) ⟩ RMS
1
σ^ 2 = ∑ i
n−1 i
( x − x̄) 2
Cov(x, y) < 0
Cov ( X ,Y )=⟨ ( X − ⟨ X ⟩ ) (Y − ⟨ Y ⟩ ) ⟩
x
Cov ( X ,Y )
√ Var( X ) Var(Y ) 18
/
Variance of X: 2
Var( X )=⟨ ( X − ⟨ X ⟩ ) ⟩ RMS
1
σ^ 2 = ∑ i
n−1 i
( x − x̄) 2
Cov(x, y) < 0
Cov ( X ,Y )=⟨ ( X − ⟨ X ⟩ ) (Y − ⟨ Y ⟩ ) ⟩ Cov(X, Y) = 0
x
Cov ( X ,Y )
√ Var( X ) Var(Y ) 18
/
“Linear” vs. “non-linear” correlations
For non-Gaussian cases, the Correlation coefficient ρ is not the whole story:
2 ρ σ 1σ 2
tan 2 α = 2 2
σ 1− σ 2
Source: Wikipedia
In particular, variables can still be correlated even

when ρ=0 : “Non-linear” correlations. 19
/
Some vocabulary...
X, Y... are Random Variables (continuous or discrete), a.ka. observables :

→ X can take any value x, with probability P(X=x).
→ P(X=x) is the PDF of X, a.k.a. the Statistical Model.
→ The Observed data is one value xobs of X,

drawn from P(X=x).
x
20
x /
Gaussian PDF
Gaussian distribution:
2
( x− X 0 )
1 −
2 σ2
s
G ( x ; X 0 ,σ )= e
σ √2π X0
→ Mean : X0
→ Variance : σ2 (⇒ RMS = σ)
1
Generalize to N dimensions: 1 − ( x− X 0)T C −1 ( x− X 0 )
2
G ( x ; X 0 , C )= N 1/ 2
e
→ Mean : X0 [(2 π ) |C|] 2ρ σ1 σ2
tan 2 α =
→ Covariance matrix : σ 21 − σ 22
x2
α
C=
[
Var ( X 1 ) Cov ( X 1 , X 2 )
Cov ( X 2 , X 1 ) Var ( X 2 ) ]
=
[ σ 21
ρσ1σ2
ρ σ 1 σ2
σ
2
2
] x1
21
/
Central Limit Theorem (*) Assuming σX < ∞
and other regularity
conditions
For an observable X with any(*) distribution, one has
n
1 σX
x̄ = ∑ xi ∼ G ( ⟨ X ⟩ ,
n→ ∞
)
n i=1 √n
What this means:
• The average of many measurements is always Gaussian, whatever the
distribution for a single measurement
• The mean of the Gaussian is the average of the single measurements
• The RMS of the Gaussian decreases as Ön : smaller fluctuations when
averaging over many measurements
n
∑ xi
n→ ∞
Another version: ∼ G( n ⟨ X ⟩ , √ n σ X)
i=1
Mean scales like n, but RMS only like Ön 22

/
Central Limit Theorem in action
Draw events from a parabolic distribution (e.g. decay cos θ*)
n
1
x̄ = ∑ x i
n i=1
x̄
Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
n
1
x̄ = ∑ x i
n i=1
x̄
23
n
1
x̄ = ∑ x i
n i=1
x̄
23
n
1
x̄ = ∑ x i
n i=1
x̄
23
n
1
x̄ = ∑ x i
n i=1
x̄
23
n
1
x̄ = ∑ x i
n i=1
x̄
23
n
1
x̄ = ∑ x i
n i=1
x̄
23
n
1
x̄ = ∑ x i
n i=1
x̄
23
P(|x – X0| > Zσ)
Gaussian Quantiles Z
1 0.317
Consider z= (
x− X 0
σ ) “pull” of x
2
3
0.045
0.003
G(x; X0,σ) depends only on z ~ G(z; 0,1) 4 3 x 10-5

5 6 x 10-7
Probability P(|x – X0| > Zσ) to be away from the mean:
Cumulative Distribution Function (CDF)

of the Gaussian :
z
Φ ( z) = ∫− ∞ G(u ; 0,1) du
24
/
Z P(|x – X0| > Zσ)
Gaussian Quantiles
1 0.317
Consider z= (
x− X 0
2
3
0.045
0.003

5 6 x 10-7

of the Gaussian :
z
Φ ( z) = ∫− ∞ G(u ; 0,1) du
24
/
Z P(|x – X0| > Zσ)
Gaussian Quantiles
1 0.317
Consider z= (
x− X 0
2
3
0.045
0.003

5 6 x 10-7

of the Gaussian :
z
Φ ( z) = ∫− ∞ G(u ; 0,1) du
24
/
σ1
Chi-squared
Multiple Independent Gaussian

σ2
variables xi : Define
2 =1
χ
4
( )
n 0 2
xi − x 2=
χ =∑
2
σi
i
χ
i=1
Measures global distance from

χ2
reference point (x10 …. xn0)
Distribution depends on n :
Rule of thumb:
2
χ / n should be ≾ 1 25
/
σ1
Chi-squared
Multiple Independent Gaussian

σ2
variables xi : Define
2 =1
χ
4
( )
n 0 2
xi − x 2=
χ =∑
2
σi
i
χ
i=1
Measures global distance from χ

χ 2 /n
2
reference point (x10 …. xn0)
Distribution depends on n :
Rule of thumb:
2
χ / n should be ≾ 1 25
/
Histogram Chi-squared
Histogram χ2 with respect to a reference shape:
• Assume an independent Gaussian distribution in each bin
• Degrees of freedom = (number of bins) – (number of fit parameters)
BLUE histogram vs. flat reference

✔
χ = 12.9, p(χ =12.9, n=10) = 23%
2 2
26
/

✔
χ = 12.9, p(χ =12.9, n=10) = 23%
2 2
RED histogram vs. flat reference

χ2 = 38.8, p(χ2=38.8, n=10) = 0.003% ✘
26
/

✔
χ = 12.9, p(χ =12.9, n=10) = 23%
2 2
RED histogram vs. flat reference

χ2 = 38.8, p(χ2=38.8, n=10) = 0.003% ✘
RED histogram vs. correct reference
χ2 = 9.5, p(χ2=9.5, n=10) = 49% ✔
26
/
Statistical Modeling
27
/
Example 1: Z counting Phys. Lett. B 759 (2016) 601
Measure the cross-section (event rate) of the

Z→ ee process
35000 ± 187
175 ± 8
ndata −N bkg
fid
σ = (81 ± 2) pb-1
C fid L
0.552 ± 0.006
σfid = 0.781 ± 0.004 (stat) ± 0.018 (syst) nb
Fluctuations in Other uncertainties

the data counts (assumptions, parameter values)
28
“Single bin counting” : only data input is ndata. /
Example 2: ttH→bb arXiv:2111.06712
Event counting in different regions:

Multiple-bin counting
Lots of information available

→ Potentially higher sensitivity
→ How to make optimal use of it ?
29
/
Example 3: unbinned modeling ATLAS-CONF-2017-045
All modeling done using continuous distributions:

S B
P total ( m γ γ ) = P signal ( m γ γ ; m H ) + P bkg ( m γ γ ) 30
S+ B S+ B /
How to count
Common situation: produce many events N, select a (very) small fraction P
→ In principle, binomial process
→ In practice, P ≪ 1, N ≫ 1, ⇒ Poisson approximation.
→ i.e. very rare process, but very many trials so still expect to see good events
n
Poisson distribution P ( n ; λ )=e
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N
∼
( 1− λ
N ) ∼ e− λ
Mean = λ For a counting

Variance = λ measurement,
σ = √λ
RMS = √N
Central limit theorem :
becomes Gaussian for large λ :
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
How to count
n
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N
∼
( 1− λ
N ) ∼ e− λ

σ = √λ
RMS = √N
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
How to count
n
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N
∼
( 1− λ
N ) ∼ e− λ

σ = √λ
RMS = √N
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
How to count
n
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N
∼
( 1− λ
N ) ∼ e− λ

σ = √λ
RMS = √N
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
How to count
n
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N
∼
( 1− λ
N ) ∼ e− λ

σ = √λ
RMS = √N
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
How to count
n
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N
∼
( 1− λ
N ) ∼ e− λ

σ = √λ
RMS = √N
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
Statistical Model for Counting
Observable: number of events n
Typically both Signal and Background present:
n
−( S + B) ( S + B) S : # of events from signal process
P ( n ; S , B)=e
n! B : # of events from bkg. process(es)
Model has parameters S and B.

B can be known a priori or not (S usually not...)
→ Example: assume B is known, use measured n to find out about S.
32
/
Multiple counting bins
Count in bins of a variable ⇒ histogram n1 ... nN.
(N : number of bins)
Per-bin fractions (=shapes)
of Signal and Background
N ni
( S f S , i + B f B , i)
P ({ni } ; S , B) = ∏ e
−( S f S , i + B f B , i )
i=1 ni !
Poisson distribution in each bin
Shapes f typically obtained from simulated events (Monte Carlo)
→ HEP: generally good modeling from simulation, although some uncertainties need
to be accounted for.
Also not always possible to generate sufficiently large MC samples

MC stat fluctuations can create artefacts, especially for S ≪ B. 33
/
Model Parameters
Model typically includes:

• Parameters of interest (POIs) : what we want to measure
→ S, mW, …
• Nuisance parameters (NPs) : other parameters needed to define the model
→ Background levels (B)
→ For binned data, fsigi , fbkgi
NPs must be either:

→ Known a priori (within uncertainties) or
→ Constrained by the data
34
/
Takeaways
Random data must be described using a statistical model:
Description Observable Likelihood

n Poisson
Counting −(S + B) (S + B)n
P(n; S , B)=e
n!
ni, i = 1 .. Nbins Poisson product
Binned shape nbins sig bkg n i
analysis ) (S f i + B f i )
P(ni ; S , B)=∏ e
sig bkg
−(S f i +Bf i
i=1 ni !
mi, i = 1 .. nevts Extended Unbinned Likelihood

Unbinned nevts
e−(S + B)
shape analysis P(mi ; S, B)=
nevts !
∏
i=1
S P sig (mi )+ B P bkg (mi )
Includes parameters of interest (POIs) but also nuisance parameters (NPs)

Next step: use the model to obtain information on the POIs 35
/
Maximum Likelihood Estimation
36
/
What a PDF is for
Model describes the distribution of the observable: P(data; parameters)
⇒ Possible outcomes of the experiment, for given parameter values
Can draw random events according to PDF : generate pseudo-data
P ( λ =5) 2, 5, 3, 7, 4, 9, ….
Each entry = separate “experiment”
Generate
Unbinned
37
/
What a PDF is also for: Likelihood
Model describes the distribution of the observable: P(data; parameters)
⇒ Possible outcomes of the experiment, for given parameter values
We want the other direction: use data to get information on parameters
P ( λ =?) 2
Estimate
?
Likelihood: L(parameters) = P(data; parameters)

38
→ same as the PDF, but seen as function of the parameters /
Maximum Likelihood Estimation
To estimate a parameter μ, find the value μ̂ that maximizes L(μ)
Maximum Likelihood
μ^ = arg max L(μ )
Estimator (MLE) μ̂ :
S = 0.5
L(S; n=5)
Observed L(S) max
P(n; S)
Value n=5
@ Ŝ = 5
given n=5
S=5 S = 20
n s
n
MLE: the value of μ for which this data was most likely to occur
The MLE is a function of the data – itself an observable
No guarantee it is the true value (data may be “unlikely”) but sensible estimate 39
/
Gaussian case
data
Best-fit of Gaussian PDF mean to observed data

40
/
Gaussian case
data

40
/
Gaussian case
data

40
/
Multiple Gaussian bins
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1
Maximum likelihood ⇔ Minimum χ2

⇔ Least-squares
minimization
However typically need to perform non-linear minimization in other cases.
HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)
● scipy.minimize – using NumPy/TensorFlow/PyTorch/... backends

→ Many algorithms – gradient-based, etc.
41
/
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

⇔ Least-squares
minimization
HEP practice:

41
/
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

⇔ Least-squares
minimization
HEP practice:

41
/
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

⇔ Least-squares
minimization
HEP practice:

42
/
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

⇔ Least-squares
minimization
HEP practice:

42
/
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

⇔ Least-squares
minimization
HEP practice:

42
/
-2 log Likelihood:
( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

⇔ Least-squares
minimization
HEP practice:

42
/
Hands-ons
Each lecture statistics lecture comes with “hands-on” exercises.
The hands-on session will be based on Jupyter notebooks built using the
numpy/scipy/pyplot stack.
If you have a computer, please install anaconda before the start of the class.
This provides a consistent installation of python, JupyterLab, etc.
→ Alternatively, you can also install JupyterLab as a standalone package.
→ Another solution is to run the notebooks on the public jupyter servers at
mybinder.org. This will probably be slower but avoids a local install.
No hands-on today, but have a look after the course.

Please be prepared to run the hands-ons during lectures 2 and 3 !
Links to resources
The hands-on resources for each lecture are listed below:
Lecture 1 notebook [solutions] binder [solutions] Today

Lecture 2 notebook binder
Lecture 3 notebook binder
● Use the notebook links if you have a local install: save the notebook
locally and open it with your JupyterLab installation.
● Use the binder links to use public servers: the links will open the
notebooks in a remote server sessions in your browser.
Notebooks with solutions to the exercises will be posted after the lectures.
Please let me know in case of technical issues running the notebooks!
Extra Slides
45
/
Error Bars
Strictly speaking, the uncertainty is given by the model :
→ Bin central value ~ mean of the bin PDF
→ Bin uncertainty ~ RMS of the bin PDF
The data is just what it is, a simple observed point.
⇒ One should in principle show the error bar on the prediction.

→ In practice, the usual convention is to have error bars on the data points.
46
/
Error Bars
Strictly speaking, the uncertainty is given by the model :
→ Bin central value ~ mean of the bin PDF
→ Bin uncertainty ~ RMS of the bin PDF
The data is just what it is, a simple observed point.
⇒ One should in principle show the error bar on the prediction.

→ In practice, the usual convention is to have error bars on the data points.
46
/
Rare Processes ?
HEP : almost always use Poisson
distributions. Why ?
ATLAS :
• Event rate ~ 1 GHz
(L~1034 cm-2s-1~10 nb-1/s, stot~108 nb, )
• Trigger rate ~ 1 kHz
(Higgs rate ~ 0.1 Hz)
⇒ p ~ 10-6 ≪ 1 (pH→γγ ~ 10-13)
A day of data: N ~ 1014 ≫ 1
Þ Poisson regime! Similarly true in many W.J. Stirling, private

communication
other physics situations.

47
(Large N = design requirement, to get not-too-small l=Np...) /
Unbinned Shape Analysis Signal
Observable: set of values m1... mn, one per event

s
→ Describe shape of the distribution of m mH
→ Deduce the probability to observe m1... mn
H→γγ-inspired example: slope a

• Gaussian signal P signal ( m) = G( m ; m H , σ )
• Exponential bkg P bkg ( m) = α e−α m Background
Expected yields : S, B
⇒ Total PDF for a single event:
S B −α m
P total (m) = G (m ; m H , σ ) + α e
S+ B S+ B
⇒ Total PDF for a dataset Probability to observe Total

the value mi
Probability to observe n events
n n
( S+ B) S B
P ({mi }i=1… n ) = e −( S+ B)
n!
∏ S+ B
G( mi ; m H , σ ) +
S+ B
−α m
α e 48 i
i=1 /
Poisson Example
n
Assume Poisson distribution with B = 0 : −S S
P ( n ; S) = e
Say we observe n=5, want to infer information on the parameter S n!
→ Try different values of S for a fixed data value n=5
→ Varying parameter, fixed data: likelihood
5
−S S
L( S ; n=5)=e
5!
Observed
Value n=5
49
n /
Poisson Example
n
P ( n ; S) = e
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here
Observed
P(S = 0.5)
Value n=5
Low
likelihood
49
n /
Poisson Example
n
P ( n ; S) = e
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here
Observed
P(S = 0.5)
Value n=5
Low
likelihood P(S = 5)
High
likelihood
49
n /
Poisson Example
n
P ( n ; S) = e
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here
Observed
P(S = 0.5)
Value n=5
Low
likelihood P(S = 5)
P(S = 20)
High
Low
likelihood
likelihood
49
n /
Poisson Example
n
P ( n ; S) = e
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here
Observed L(S; n=5):

P(S = 0.5) Likelihood
Value n=5
Low
P(S = 5) of S for n=5
likelihood
High
likelihood
49
n /
MLEs in Shape Analyses
Binned shape analysis:
N
L(S ; ni ) = P(n i ; S) = ∏ Pois (ni ; S f i + Bi )
i=1
Maximize global L(S) (each bin may prefer a different S)

In practice easier to minimize
N
λ Pois (S) = −2 log L( S) = −2 ∑ log Pois(ni ; S f i + Bi ) Needs a computer...
i=1
In the Gaussian limit
( )
N N 2
ni −(S f i + B i )
λ Gaus (S) = ∑ −2 log G (ni ; S f i + Bi , σ i ) = ∑ σi χ2 formula!
i=1 i=1
→ Gaussian MLE (min χ2 or min λGaus) : Best fit value in a χ2 (Least-squares) fit
→ Poisson MLE (min λPois) : Best fit value in a likelihood fit (in ROOT, fit option “L”)
In RooFit, λPois ⇒ RooAbsPdf::fitTo(), λGaus ⇒ RooAbsPdf::chi2FitTo().
In both cases, MLE ⇔ Best Fit 50

/
H→γγ
nevts
−(S + B)
L(S , B ; mi )=e ∏
i=1
S P sig (mi )+ B P bkg (m i )
Estimate the MLE Ŝ of S ?

Ŝ
→ Perform (likelihood) best-fit of
model to data
⇒ fit result for S is the desired Ŝ.
In particle physics, often use the

MINUIT minimizer within ROOT.
ATLAS-CONF-2017-045 51
/
MLE Properties
( )
* 2
^
( μ−μ )
• Asymptotically Gaussian P ( μ^ ) ∝ exp − 2
for n → ∞
* 2 σ μ^
and unbiased ⟨ μ^ ⟩ = μ for n → ∞
Standard deviation of the distribution of μ̂

for large enough datasets
• Asymptotically Efficient : σμ̂ is the lowest possible value (in the limit n®¥)
among consistent estimators.
→ MLE captures all the available information in the data
n→∞
• Also consistent: μ̂ converges to the true value for large n, μ^ → μ *
• Log-likelihood : Can also minimize l = -2 log L
→ Usually more efficient numerically
→ For Gaussian L, l is parabolic:

• Can drop multiplicative constants in L (additive constants in l) 52
/
Extra: Fisher Information
Fisher Information:
⟨( ⟩ ⟨ ⟩
2 2
I (μ ) = ∂ log L(μ )
∂μ ) = − ∂ 2 log L(μ )
∂μ
Measures the amount of information available in the measurement of μ.
Gaussian case:
1 ● For a Gaussian estimator μ̃
Gaussian likelihood: I (μ ) = 2
σ Gauss
( (~
)
* 2
μ −μ )
P (~
μ ) ∝ exp −
→ smaller σGauss ⇒ more information. 2
2 σ ~μ
● MLE: Var(μ̂) = σμ̂2
1
Cramer-Rao bound: Var( ~
μ)≥ Cramer-Rao: Var(μ̃) ≥ σGauss2 = σμ̃2
I (μ )
For any estimator μ̃ .
→ cannot be more precise than allowed by information in the measurement.
Efficient estimators reach the bound : e.g. MLE in the large dataset limit.
53
/
High-mass X→γγ Search: JHEP 09 (2016) 1
Some Examples
Higgs Discovery: Phys. Lett. B 716 (2012) 1-29
3.9σ
p0 = 1.8 ´ 10-9 Û Z = 5.9σ

54
/

AEPSHEP Lecture1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AEPSHEP Lecture1

Uploaded by

Copyright:

Available Formats

2022 ASIA EUROPE PACIFIC SCHOOL OF HIGH-ENERGY PHYSICS

Nicolas Berger (LAPP Annecy) Lecture 1

Credits: mattbuck / wikimedia Credits: StatLab

Higgs discovery : “We have 5σ” !

Phys. Lett. B 716 (2012) 1-29

Sometimes difficult to distinguish a bona fide discovery from a background fluctuation…

Sometimes difficult to distinguish a bona fide discovery from a background fluctuation…

A few months later...

Image credits: CERN courier, LHCb

Computing statistical results (Tomorrow)

The class will be based on both lectures and hands-on tutorials 7/

Experimental data is produced by incredibly complex processes

PDFs, Parton shower, Pileup

Randomness involved in all stages

Measure leakage into neighboring cells

Measure leakage into neighboring cells

Cannot predict the measured value for a given event

Rare process: Expect 1 signal

Rare process: Expect 1 signal

Rare process: Expect 1 signal

“Will I get an event today ?” → only probabilistic answer

Probabilistic treatment of possible outcomes

Example: two-coin toss

Bin size → 0 : Probability distribution function P(x)

High PDF value

x P(x) > 0, ∫ P(x) dx = 1

Generalizes to multiple variables :

Bin size → 0 : Probability distribution function P(x)

High PDF value

x P(x) > 0, ∫ P(x) dx = 1

Generalizes to multiple variables :

→ Property of the PDF PDF Mean Sample Mean

For measurements x1... xn,

Cov ( X ,Y )=⟨ ( X − ⟨ X ⟩ ) (Y − ⟨ Y ⟩ ) ⟩ Cov(X, Y) = 0

In particular, variables can still be correlated even

X, Y... are Random Variables (continuous or discrete), a.ka. observables :

→ P(X=x) is the PDF of X, a.k.a. the Statistical Model.

→ The Observed data is one value xobs of X,

Mean scales like n, but RMS only like Ön 22

Draw events from a parabolic distribution (e.g. decay cos θ*)

Draw events from a parabolic distribution (e.g. decay cos θ*)

Draw events from a parabolic distribution (e.g. decay cos θ*)

Draw events from a parabolic distribution (e.g. decay cos θ*)

Draw events from a parabolic distribution (e.g. decay cos θ*)

Draw events from a parabolic distribution (e.g. decay cos θ*)

Draw events from a parabolic distribution (e.g. decay cos θ*)

Draw events from a parabolic distribution (e.g. decay cos θ*)

G(x; X0,σ) depends only on z ~ G(z; 0,1) 4 3 x 10-5

Cumulative Distribution Function (CDF)

G(x; X0,σ) depends only on z ~ G(z; 0,1) 4 3 x 10-5

Cumulative Distribution Function (CDF)

G(x; X0,σ) depends only on z ~ G(z; 0,1) 4 3 x 10-5

Cumulative Distribution Function (CDF)

Multiple Independent Gaussian

Measures global distance from

Multiple Independent Gaussian

Measures global distance from χ

reference point (x10 …. xn0)

BLUE histogram vs. flat reference

BLUE histogram vs. flat reference

RED histogram vs. flat reference

BLUE histogram vs. flat reference

RED histogram vs. flat reference