You are on page 1of 91

2022 ASIA EUROPE PACIFIC SCHOOL OF HIGH-ENERGY PHYSICS

Practical Statistics

Nicolas Berger (LAPP Annecy) Lecture 1


2022 ASIA EUROPE PACIFIC SCHOOL OF HIGH-ENERGY PHYSICS

Practical Statistics

ysi c ist s
a r ti cl e Ph
For P
Nicolas Berger (LAPP Annecy) Lecture 1
Statistics are everywhere “There are three types of lies - lies, damn
lies, and statistics.” – Benjamin Disraeli

Credits: mattbuck / wikimedia Credits: StatLab

And Physics ? “If your experiment needs statistics, you ought to have
done a better experiment” – E. Rutherford 3/
Introduction
Statistical methods play a critical role in
many areas of physics

Higgs discovery : “We have 5σ” !

“5s”

Phys. Lett. B 716 (2012) 1-29

4/
Introduction

Sometimes difficult to distinguish a bona fide discovery from a background fluctuation…

New Physics ?

3.9σ ? 2.1σ ?

JHEP 09 (2016) 1 5/
Introduction

Sometimes difficult to distinguish a bona fide discovery from a background fluctuation…

A few months later...


New Physics ?

3.9σ ? 2.1σ ?

JHEP 09 (2016) 1 5/
Introduction
Precision measurements are another window into BSM effects
→ How to compute (and interpret) measurement intervals
→ How to model systematic uncertainties ?
→ How to get the smallest achievable uncertainties ?

Image credits: CERN courier, LHCb


6/
Lecture Plan
Statistics basic concepts (Today)
[Basic ingredients (PDFs, etc.)]
Statistical Modeling (PDFs for particle physics measurements)
Parameter estimation (maximum likelihood, least-squares, …)

Computing statistical results (Tomorrow)


Model testing (χ2 tests, hypothesis testing, p-values, …)
Discovery testing
Confidence intervals
Disclaimer: the examples and
Upper limits
methods covered in the
lectures will be biased towards
Systematics and further topics (Saturday)
LHC techniques (generally close
Systematics and profiling
[Bayesian techniques] to the state of the art anyway)

The class will be based on both lectures and hands-on tutorials 7/


Randomness in High-Energy Physics

Experimental data is produced by incredibly complex processes


8/
Randomness in High-Energy Physics More
d etail
s in o
lectu ther
res!
Experimental data is produced by incredibly complex processes
Hard scattering

PDFs, Parton shower, Pileup

Decays

Detector response

Reconstruction
Image Credits:
S. Höche,
SLAC-PUB-16160

Randomness involved in all stages


→ Classical randomness: detector response
→ Quantum effects in particle production, decay 9/
Measurement Errors: Energy measurement
Example: measuring the energy
of a photon in a calorimeter

Calorimeter Readout

10
/
Measurement Errors: Energy measurement
Example: measuring the energy
of a photon in a calorimeter

Calorimeter Readout

g Energy
deposit

Perfect
case

10
/
Measurement Errors: Energy measurement
Example: measuring the energy
of a photon in a calorimeter

Calorimeter Readout

g
Perfect Real
case life

10
/
Measurement Errors: Energy measurement
Example: measuring the energy
of a photon in a calorimeter
Measure leakage behind calorimeter

Measure leakage into neighboring cells


Calorimeter Readout

g
Perfect Real
case life

10
/
Measurement Errors: Energy measurement
Example: measuring the energy
of a photon in a calorimeter
Measure leakage behind calorimeter

Measure leakage into neighboring cells


Calorimeter Readout

g
Perfect Real
case life

Cannot predict the measured value for a given event


10
⇒ Random process ⇒ Need a probabilistic description /
Quantum Randomness: H®ZZ*®4l

http://www.phdcomics.com/comics/archive.php?comicid=1489
Phys. Rev. D 91, 012006

Rare process: Expect 1 signal


event every ~6 days

11
/
Quantum Randomness: H®ZZ*®4l

http://www.phdcomics.com/comics/archive.php?comicid=1489
Phys. Rev. D 91, 012006

Rare process: Expect 1 signal


event every ~6 days

View online
12
/
Quantum Randomness: H®ZZ*®4l

http://www.phdcomics.com/comics/archive.php?comicid=1489
Phys. Rev. D 91, 012006

Rare process: Expect 1 signal


event every ~6 days

“Will I get an event today ?” → only probabilistic answer


13
/
Statistical Modeling

14
/
Probability Distributions

Probabilistic treatment of possible outcomes


Þ Probability Distribution

Example: two-coin toss


→ Fractions of events in each bin i
converge to a limit pi

Probability distribution :
{ Pi } for i = 0, 1, 2
Properties
• Pi > 0
• Σ Pi=1

15
/
Continuous Variables: PDFs
Continuous variable: can consider per-bin probabilities pi, i=1.. nbins

Bin size → 0 : Probability distribution function P(x)

High PDF value


⇒ High chance to get a measurement here

x P(x) > 0, ∫ P(x) dx = 1

Generalizes to multiple variables :


P(x,y) > 0, ∫ P(x,y) dx dy = 1

Contours: P(x,y)
16
x /
Continuous Variables: PDFs
Continuous variable: can consider per-bin probabilities pi, i=1.. nbins

Bin size → 0 : Probability distribution function P(x)

High PDF value


⇒ High chance to get a measurement here

x P(x) > 0, ∫ P(x) dx = 1

Generalizes to multiple variables :


P(x,y) > 0, ∫ P(x,y) dx dy = 1

Contours: P(x,y)
16
x /
PDF Mean
PDF Properties: Mean
E(X) = <X> : Mean of X – expected outcome
on average over many measurements

⟨ X ⟩ = ∑ xi Pi or
i

⟨ X ⟩ = ∫ x P ( x) dx

→ Property of the PDF PDF Mean Sample Mean

For measurements x1... xn,


then can compute the Sample mean:

1
x̄ = ∑ x i
n i
→ Property of the sample
→ approximates the PDF mean. 17
/
PDF Properties: (Co)variance
Variance of X: 2
Var( X )=⟨ ( X − ⟨ X ⟩ ) ⟩ RMS
→ Average square of deviation from mean
→ RMS(X) = ÖVar(X) = σX standard deviation
Can be approximated by sample variance:

1
σ^ 2 = ∑ i
n−1 i
( x − x̄) 2

Cov(x, y) > 0

Covariance of X and Y: y

Cov ( X ,Y )=⟨ ( X − ⟨ X ⟩ ) (Y − ⟨ Y ⟩ ) ⟩
x
→ Large if variations of X and Y are “synchronized”

Cov ( X ,Y )
Correlation coefficient ρ= -1 ≤ ρ ≤ 1
√ Var( X ) Var(Y ) 18
/
PDF Properties: (Co)variance
Variance of X: 2
Var( X )=⟨ ( X − ⟨ X ⟩ ) ⟩ RMS
→ Average square of deviation from mean
→ RMS(X) = ÖVar(X) = σX standard deviation
Can be approximated by sample variance:

1
σ^ 2 = ∑ i
n−1 i
( x − x̄) 2

Cov(x, y) < 0

Covariance of X and Y: y

Cov ( X ,Y )=⟨ ( X − ⟨ X ⟩ ) (Y − ⟨ Y ⟩ ) ⟩
x
→ Large if variations of X and Y are “synchronized”

Cov ( X ,Y )
Correlation coefficient ρ= -1 ≤ ρ ≤ 1
√ Var( X ) Var(Y ) 18
/
PDF Properties: (Co)variance
Variance of X: 2
Var( X )=⟨ ( X − ⟨ X ⟩ ) ⟩ RMS
→ Average square of deviation from mean
→ RMS(X) = ÖVar(X) = σX standard deviation
Can be approximated by sample variance:

1
σ^ 2 = ∑ i
n−1 i
( x − x̄) 2

Cov(x, y) < 0

Covariance of X and Y: y

Cov ( X ,Y )=⟨ ( X − ⟨ X ⟩ ) (Y − ⟨ Y ⟩ ) ⟩ Cov(X, Y) = 0

x
→ Large if variations of X and Y are “synchronized”

Cov ( X ,Y )
Correlation coefficient ρ= -1 ≤ ρ ≤ 1
√ Var( X ) Var(Y ) 18
/
“Linear” vs. “non-linear” correlations
For non-Gaussian cases, the Correlation coefficient ρ is not the whole story:

2 ρ σ 1σ 2
tan 2 α = 2 2
σ 1− σ 2

Source: Wikipedia

In particular, variables can still be correlated even


when ρ=0 : “Non-linear” correlations. 19
/
Some vocabulary...

X, Y... are Random Variables (continuous or discrete), a.ka. observables :


→ X can take any value x, with probability P(X=x).

→ P(X=x) is the PDF of X, a.k.a. the Statistical Model.

→ The Observed data is one value xobs of X,


drawn from P(X=x).

x
20
x /
Gaussian PDF
Gaussian distribution:
2
( x− X 0 )
1 −
2 σ2
s
G ( x ; X 0 ,σ )= e
σ √2π X0
→ Mean : X0
→ Variance : σ2 (⇒ RMS = σ)

1
Generalize to N dimensions: 1 − ( x− X 0)T C −1 ( x− X 0 )
2
G ( x ; X 0 , C )= N 1/ 2
e
→ Mean : X0 [(2 π ) |C|] 2ρ σ1 σ2
tan 2 α =
→ Covariance matrix : σ 21 − σ 22
x2

α
C=
[
Var ( X 1 ) Cov ( X 1 , X 2 )
Cov ( X 2 , X 1 ) Var ( X 2 ) ]
=
[ σ 21
ρσ1σ2
ρ σ 1 σ2
σ
2
2
] x1
21
/
Central Limit Theorem (*) Assuming σX < ∞
and other regularity
conditions
For an observable X with any(*) distribution, one has
n
1 σX
x̄ = ∑ xi ∼ G ( ⟨ X ⟩ ,
n→ ∞
)
n i=1 √n
What this means:
• The average of many measurements is always Gaussian, whatever the
distribution for a single measurement
• The mean of the Gaussian is the average of the single measurements
• The RMS of the Gaussian decreases as Ön : smaller fluctuations when
averaging over many measurements
n

∑ xi
n→ ∞
Another version: ∼ G( n ⟨ X ⟩ , √ n σ X)
i=1

Mean scales like n, but RMS only like Ön 22


/
Central Limit Theorem in action

Draw events from a parabolic distribution (e.g. decay cos θ*)

n
1
x̄ = ∑ x i
n i=1


Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action

Draw events from a parabolic distribution (e.g. decay cos θ*)

n
1
x̄ = ∑ x i
n i=1


Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action

Draw events from a parabolic distribution (e.g. decay cos θ*)

n
1
x̄ = ∑ x i
n i=1


Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action

Draw events from a parabolic distribution (e.g. decay cos θ*)

n
1
x̄ = ∑ x i
n i=1


Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action

Draw events from a parabolic distribution (e.g. decay cos θ*)

n
1
x̄ = ∑ x i
n i=1


Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action

Draw events from a parabolic distribution (e.g. decay cos θ*)

n
1
x̄ = ∑ x i
n i=1


Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action

Draw events from a parabolic distribution (e.g. decay cos θ*)

n
1
x̄ = ∑ x i
n i=1


Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
Central Limit Theorem in action

Draw events from a parabolic distribution (e.g. decay cos θ*)

n
1
x̄ = ∑ x i
n i=1


Distribution becomes Gaussian, although very non-Gaussian originally
23
Distribution becomes narrower as expected (as 1/√n ) /
P(|x – X0| > Zσ)
Gaussian Quantiles Z
1 0.317

Consider z= (
x− X 0
σ ) “pull” of x
2
3
0.045
0.003

G(x; X0,σ) depends only on z ~ G(z; 0,1) 4 3 x 10-5


5 6 x 10-7
Probability P(|x – X0| > Zσ) to be away from the mean:

Cumulative Distribution Function (CDF)


of the Gaussian :

z
Φ ( z) = ∫− ∞ G(u ; 0,1) du
24
/
Z P(|x – X0| > Zσ)
Gaussian Quantiles
1 0.317

Consider z= (
x− X 0
σ ) “pull” of x
2
3
0.045
0.003

G(x; X0,σ) depends only on z ~ G(z; 0,1) 4 3 x 10-5


5 6 x 10-7
Probability P(|x – X0| > Zσ) to be away from the mean:

Cumulative Distribution Function (CDF)


of the Gaussian :

z
Φ ( z) = ∫− ∞ G(u ; 0,1) du
24
/
Z P(|x – X0| > Zσ)
Gaussian Quantiles
1 0.317

Consider z= (
x− X 0
σ ) “pull” of x
2
3
0.045
0.003

G(x; X0,σ) depends only on z ~ G(z; 0,1) 4 3 x 10-5


5 6 x 10-7
Probability P(|x – X0| > Zσ) to be away from the mean:

Cumulative Distribution Function (CDF)


of the Gaussian :

z
Φ ( z) = ∫− ∞ G(u ; 0,1) du
24
/
σ1
Chi-squared

Multiple Independent Gaussian


σ2
variables xi : Define
2 =1
χ
4
( )
n 0 2
xi − x 2=
χ =∑
2
σi
i
χ
i=1

Measures global distance from


χ2
reference point (x10 …. xn0)

Distribution depends on n :

Rule of thumb:
2
χ / n should be ≾ 1 25
/
σ1
Chi-squared

Multiple Independent Gaussian


σ2
variables xi : Define
2 =1
χ
4
( )
n 0 2
xi − x 2=
χ =∑
2
σi
i
χ
i=1

Measures global distance from χ


χ 2 /n
2

reference point (x10 …. xn0)

Distribution depends on n :

Rule of thumb:
2
χ / n should be ≾ 1 25
/
Histogram Chi-squared
Histogram χ2 with respect to a reference shape:
• Assume an independent Gaussian distribution in each bin
• Degrees of freedom = (number of bins) – (number of fit parameters)

BLUE histogram vs. flat reference



χ = 12.9, p(χ =12.9, n=10) = 23%
2 2

26
/
Histogram Chi-squared
Histogram χ2 with respect to a reference shape:
• Assume an independent Gaussian distribution in each bin
• Degrees of freedom = (number of bins) – (number of fit parameters)

BLUE histogram vs. flat reference



χ = 12.9, p(χ =12.9, n=10) = 23%
2 2

RED histogram vs. flat reference


χ2 = 38.8, p(χ2=38.8, n=10) = 0.003% ✘

26
/
Histogram Chi-squared
Histogram χ2 with respect to a reference shape:
• Assume an independent Gaussian distribution in each bin
• Degrees of freedom = (number of bins) – (number of fit parameters)

BLUE histogram vs. flat reference



χ = 12.9, p(χ =12.9, n=10) = 23%
2 2

RED histogram vs. flat reference


χ2 = 38.8, p(χ2=38.8, n=10) = 0.003% ✘
RED histogram vs. correct reference
χ2 = 9.5, p(χ2=9.5, n=10) = 49% ✔

26
/
Statistical Modeling

27
/
Example 1: Z counting Phys. Lett. B 759 (2016) 601

Measure the cross-section (event rate) of the


Z→ ee process

35000 ± 187
175 ± 8
ndata −N bkg
fid
σ = (81 ± 2) pb-1
C fid L
0.552 ± 0.006

σfid = 0.781 ± 0.004 (stat) ± 0.018 (syst) nb

Fluctuations in Other uncertainties


the data counts (assumptions, parameter values)

28
“Single bin counting” : only data input is ndata. /
Example 2: ttH→bb arXiv:2111.06712

Event counting in different regions:


Multiple-bin counting

Lots of information available


→ Potentially higher sensitivity
→ How to make optimal use of it ?

29
/
Example 3: unbinned modeling ATLAS-CONF-2017-045

All modeling done using continuous distributions:


S B
P total ( m γ γ ) = P signal ( m γ γ ; m H ) + P bkg ( m γ γ ) 30
S+ B S+ B /
How to count
Common situation: produce many events N, select a (very) small fraction P
→ In principle, binomial process
→ In practice, P ≪ 1, N ≫ 1, ⇒ Poisson approximation.
→ i.e. very rare process, but very many trials so still expect to see good events
n
Poisson distribution P ( n ; λ )=e
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N

( 1− λ
N ) ∼ e− λ

Mean = λ For a counting


Variance = λ measurement,
σ = √λ
RMS = √N
Central limit theorem :
becomes Gaussian for large λ :
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
How to count
Common situation: produce many events N, select a (very) small fraction P
→ In principle, binomial process
→ In practice, P ≪ 1, N ≫ 1, ⇒ Poisson approximation.
→ i.e. very rare process, but very many trials so still expect to see good events
n
Poisson distribution P ( n ; λ )=e
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N

( 1− λ
N ) ∼ e− λ

Mean = λ For a counting


Variance = λ measurement,
σ = √λ
RMS = √N
Central limit theorem :
becomes Gaussian for large λ :
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
How to count
Common situation: produce many events N, select a (very) small fraction P
→ In principle, binomial process
→ In practice, P ≪ 1, N ≫ 1, ⇒ Poisson approximation.
→ i.e. very rare process, but very many trials so still expect to see good events
n
Poisson distribution P ( n ; λ )=e
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N

( 1− λ
N ) ∼ e− λ

Mean = λ For a counting


Variance = λ measurement,
σ = √λ
RMS = √N
Central limit theorem :
becomes Gaussian for large λ :
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
How to count
Common situation: produce many events N, select a (very) small fraction P
→ In principle, binomial process
→ In practice, P ≪ 1, N ≫ 1, ⇒ Poisson approximation.
→ i.e. very rare process, but very many trials so still expect to see good events
n
Poisson distribution P ( n ; λ )=e
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N

( 1− λ
N ) ∼ e− λ

Mean = λ For a counting


Variance = λ measurement,
σ = √λ
RMS = √N
Central limit theorem :
becomes Gaussian for large λ :
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
How to count
Common situation: produce many events N, select a (very) small fraction P
→ In principle, binomial process
→ In practice, P ≪ 1, N ≫ 1, ⇒ Poisson approximation.
→ i.e. very rare process, but very many trials so still expect to see good events
n
Poisson distribution P ( n ; λ )=e
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N

( 1− λ
N ) ∼ e− λ

Mean = λ For a counting


Variance = λ measurement,
σ = √λ
RMS = √N
Central limit theorem :
becomes Gaussian for large λ :
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
How to count
Common situation: produce many events N, select a (very) small fraction P
→ In principle, binomial process
→ In practice, P ≪ 1, N ≫ 1, ⇒ Poisson approximation.
→ i.e. very rare process, but very many trials so still expect to see good events
n
Poisson distribution P ( n ; λ )=e
−λ λ λ = NP
n! N N≫1
(1−P) N−n n ≪ N

( 1− λ
N ) ∼ e− λ

Mean = λ For a counting


Variance = λ measurement,
σ = √λ
RMS = √N
Central limit theorem :
becomes Gaussian for large λ :
λ→∞
P ( λ ) → G( λ , √ λ ) 31
/
Statistical Model for Counting

Observable: number of events n

Typically both Signal and Background present:

n
−( S + B) ( S + B) S : # of events from signal process
P ( n ; S , B)=e
n! B : # of events from bkg. process(es)

Model has parameters S and B.


B can be known a priori or not (S usually not...)

→ Example: assume B is known, use measured n to find out about S.

32
/
Multiple counting bins
Count in bins of a variable ⇒ histogram n1 ... nN.
(N : number of bins)
Per-bin fractions (=shapes)
of Signal and Background

N ni
( S f S , i + B f B , i)
P ({ni } ; S , B) = ∏ e
−( S f S , i + B f B , i )

i=1 ni !
Poisson distribution in each bin

Shapes f typically obtained from simulated events (Monte Carlo)

→ HEP: generally good modeling from simulation, although some uncertainties need
to be accounted for.

Also not always possible to generate sufficiently large MC samples


MC stat fluctuations can create artefacts, especially for S ≪ B. 33
/
Model Parameters

Model typically includes:


• Parameters of interest (POIs) : what we want to measure
→ S, mW, …

• Nuisance parameters (NPs) : other parameters needed to define the model

→ Background levels (B)

→ For binned data, fsigi , fbkgi

NPs must be either:


→ Known a priori (within uncertainties) or
→ Constrained by the data
34
/
Takeaways
Random data must be described using a statistical model:

Description Observable Likelihood


n Poisson
Counting −(S + B) (S + B)n
P(n; S , B)=e
n!
ni, i = 1 .. Nbins Poisson product
Binned shape nbins sig bkg n i
analysis ) (S f i + B f i )
P(ni ; S , B)=∏ e
sig bkg
−(S f i +Bf i

i=1 ni !

mi, i = 1 .. nevts Extended Unbinned Likelihood


Unbinned nevts
e−(S + B)
shape analysis P(mi ; S, B)=
nevts !

i=1
S P sig (mi )+ B P bkg (mi )

Includes parameters of interest (POIs) but also nuisance parameters (NPs)


Next step: use the model to obtain information on the POIs 35
/
Maximum Likelihood Estimation

36
/
What a PDF is for
Model describes the distribution of the observable: P(data; parameters)
⇒ Possible outcomes of the experiment, for given parameter values
Can draw random events according to PDF : generate pseudo-data

P ( λ =5) 2, 5, 3, 7, 4, 9, ….
Each entry = separate “experiment”

Generate

Unbinned

37
/
What a PDF is also for: Likelihood
Model describes the distribution of the observable: P(data; parameters)
⇒ Possible outcomes of the experiment, for given parameter values
We want the other direction: use data to get information on parameters

P ( λ =?) 2

Estimate
?

Likelihood: L(parameters) = P(data; parameters)


38
→ same as the PDF, but seen as function of the parameters /
Maximum Likelihood Estimation
To estimate a parameter μ, find the value μ̂ that maximizes L(μ)

Maximum Likelihood
μ^ = arg max L(μ )
Estimator (MLE) μ̂ :

S = 0.5

L(S; n=5)
Observed L(S) max
P(n; S)

Value n=5
@ Ŝ = 5
given n=5
S=5 S = 20
n s
n

MLE: the value of μ for which this data was most likely to occur
The MLE is a function of the data – itself an observable
No guarantee it is the true value (data may be “unlikely”) but sensible estimate 39
/
Gaussian case

data

Best-fit of Gaussian PDF mean to observed data


40
/
Gaussian case

data

Best-fit of Gaussian PDF mean to observed data


40
/
Gaussian case

data

Best-fit of Gaussian PDF mean to observed data


40
/
Multiple Gaussian bins

-2 log Likelihood:

( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

Maximum likelihood ⇔ Minimum χ2


⇔ Least-squares
minimization

However typically need to perform non-linear minimization in other cases.

HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)

● scipy.minimize – using NumPy/TensorFlow/PyTorch/... backends


→ Many algorithms – gradient-based, etc.
41
/
Multiple Gaussian bins

-2 log Likelihood:

( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

Maximum likelihood ⇔ Minimum χ2


⇔ Least-squares
minimization

However typically need to perform non-linear minimization in other cases.

HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)

● scipy.minimize – using NumPy/TensorFlow/PyTorch/... backends


→ Many algorithms – gradient-based, etc.
41
/
Multiple Gaussian bins

-2 log Likelihood:

( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

Maximum likelihood ⇔ Minimum χ2


⇔ Least-squares
minimization

However typically need to perform non-linear minimization in other cases.

HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)

● scipy.minimize – using NumPy/TensorFlow/PyTorch/... backends


→ Many algorithms – gradient-based, etc.
41
/
Multiple Gaussian bins

-2 log Likelihood:

( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

Maximum likelihood ⇔ Minimum χ2


⇔ Least-squares
minimization

However typically need to perform non-linear minimization in other cases.

HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)

● scipy.minimize – using NumPy/TensorFlow/PyTorch/... backends


→ Many algorithms – gradient-based, etc.
42
/
Multiple Gaussian bins

-2 log Likelihood:

( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

Maximum likelihood ⇔ Minimum χ2


⇔ Least-squares
minimization

However typically need to perform non-linear minimization in other cases.

HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)

● scipy.minimize – using NumPy/TensorFlow/PyTorch/... backends


→ Many algorithms – gradient-based, etc.
42
/
Multiple Gaussian bins

-2 log Likelihood:

( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

Maximum likelihood ⇔ Minimum χ2


⇔ Least-squares
minimization

However typically need to perform non-linear minimization in other cases.

HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)

● scipy.minimize – using NumPy/TensorFlow/PyTorch/... backends


→ Many algorithms – gradient-based, etc.
42
/
Multiple Gaussian bins

-2 log Likelihood:

( )
N bins 2
ni – y i (μ )
λ (μ ) = −2 log L(μ )= ∑ σi
i=1

Maximum likelihood ⇔ Minimum χ2


⇔ Least-squares
minimization

However typically need to perform non-linear minimization in other cases.

HEP practice:
● MINUIT (C++ library within ROOT, numerical gradient descent)

● scipy.minimize – using NumPy/TensorFlow/PyTorch/... backends


→ Many algorithms – gradient-based, etc.
42
/
Hands-ons
Each lecture statistics lecture comes with “hands-on” exercises.
The hands-on session will be based on Jupyter notebooks built using the
numpy/scipy/pyplot stack.

If you have a computer, please install anaconda before the start of the class.
This provides a consistent installation of python, JupyterLab, etc.
→ Alternatively, you can also install JupyterLab as a standalone package.
→ Another solution is to run the notebooks on the public jupyter servers at
mybinder.org. This will probably be slower but avoids a local install.

No hands-on today, but have a look after the course.


Please be prepared to run the hands-ons during lectures 2 and 3 !
Links to resources
The hands-on resources for each lecture are listed below:

Lecture 1 notebook [solutions] binder [solutions] Today


Lecture 2 notebook binder

Lecture 3 notebook binder

● Use the notebook links if you have a local install: save the notebook
locally and open it with your JupyterLab installation.
● Use the binder links to use public servers: the links will open the
notebooks in a remote server sessions in your browser.

Notebooks with solutions to the exercises will be posted after the lectures.
Please let me know in case of technical issues running the notebooks!
Extra Slides

45
/
Error Bars
Strictly speaking, the uncertainty is given by the model :
→ Bin central value ~ mean of the bin PDF
→ Bin uncertainty ~ RMS of the bin PDF
The data is just what it is, a simple observed point.

⇒ One should in principle show the error bar on the prediction.


→ In practice, the usual convention is to have error bars on the data points.

46
/
Error Bars
Strictly speaking, the uncertainty is given by the model :
→ Bin central value ~ mean of the bin PDF
→ Bin uncertainty ~ RMS of the bin PDF
The data is just what it is, a simple observed point.

⇒ One should in principle show the error bar on the prediction.


→ In practice, the usual convention is to have error bars on the data points.

46
/
Rare Processes ?

HEP : almost always use Poisson

distributions. Why ?

ATLAS :
• Event rate ~ 1 GHz

(L~1034 cm-2s-1~10 nb-1/s, stot~108 nb, )

• Trigger rate ~ 1 kHz

(Higgs rate ~ 0.1 Hz)

⇒ p ~ 10-6 ≪ 1 (pH→γγ ~ 10-13)

A day of data: N ~ 1014 ≫ 1

Þ Poisson regime! Similarly true in many W.J. Stirling, private


communication

other physics situations.


47
(Large N = design requirement, to get not-too-small l=Np...) /
Unbinned Shape Analysis Signal

Observable: set of values m1... mn, one per event


s
→ Describe shape of the distribution of m mH
→ Deduce the probability to observe m1... mn

H→γγ-inspired example: slope a


• Gaussian signal P signal ( m) = G( m ; m H , σ )
• Exponential bkg P bkg ( m) = α e−α m Background
Expected yields : S, B
⇒ Total PDF for a single event:
S B −α m
P total (m) = G (m ; m H , σ ) + α e
S+ B S+ B

⇒ Total PDF for a dataset Probability to observe Total


the value mi
Probability to observe n events

n n
( S+ B) S B
P ({mi }i=1… n ) = e −( S+ B)
n!
∏ S+ B
G( mi ; m H , σ ) +
S+ B
−α m
α e 48 i

i=1 /
Poisson Example
n
Assume Poisson distribution with B = 0 : −S S
P ( n ; S) = e
Say we observe n=5, want to infer information on the parameter S n!
→ Try different values of S for a fixed data value n=5
→ Varying parameter, fixed data: likelihood
5
−S S
L( S ; n=5)=e
5!

Observed
Value n=5

49
n /
Poisson Example
n
Assume Poisson distribution with B = 0 : −S S
P ( n ; S) = e
Say we observe n=5, want to infer information on the parameter S n!
→ Try different values of S for a fixed data value n=5
→ Varying parameter, fixed data: likelihood
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here

Observed
P(S = 0.5)
Value n=5
Low
likelihood

49
n /
Poisson Example
n
Assume Poisson distribution with B = 0 : −S S
P ( n ; S) = e
Say we observe n=5, want to infer information on the parameter S n!
→ Try different values of S for a fixed data value n=5
→ Varying parameter, fixed data: likelihood
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here

Observed
P(S = 0.5)
Value n=5
Low
likelihood P(S = 5)
High
likelihood

49
n /
Poisson Example
n
Assume Poisson distribution with B = 0 : −S S
P ( n ; S) = e
Say we observe n=5, want to infer information on the parameter S n!
→ Try different values of S for a fixed data value n=5
→ Varying parameter, fixed data: likelihood
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here

Observed
P(S = 0.5)
Value n=5
Low
likelihood P(S = 5)
P(S = 20)
High
Low
likelihood
likelihood

49
n /
Poisson Example
n
Assume Poisson distribution with B = 0 : −S S
P ( n ; S) = e
Say we observe n=5, want to infer information on the parameter S n!
→ Try different values of S for a fixed data value n=5
→ Varying parameter, fixed data: likelihood
5
−S S
L( S ; n=5)=e
5!
Read L(S; n=5) here

Observed L(S; n=5):


P(S = 0.5) Likelihood
Value n=5
Low
P(S = 5) of S for n=5
likelihood
High
likelihood

49
n /
MLEs in Shape Analyses
Binned shape analysis:
N
L(S ; ni ) = P(n i ; S) = ∏ Pois (ni ; S f i + Bi )
i=1

Maximize global L(S) (each bin may prefer a different S)


In practice easier to minimize
N
λ Pois (S) = −2 log L( S) = −2 ∑ log Pois(ni ; S f i + Bi ) Needs a computer...
i=1
In the Gaussian limit

( )
N N 2
ni −(S f i + B i )
λ Gaus (S) = ∑ −2 log G (ni ; S f i + Bi , σ i ) = ∑ σi χ2 formula!
i=1 i=1

→ Gaussian MLE (min χ2 or min λGaus) : Best fit value in a χ2 (Least-squares) fit
→ Poisson MLE (min λPois) : Best fit value in a likelihood fit (in ROOT, fit option “L”)
In RooFit, λPois ⇒ RooAbsPdf::fitTo(), λGaus ⇒ RooAbsPdf::chi2FitTo().

In both cases, MLE ⇔ Best Fit 50


/
H→γγ
nevts
−(S + B)
L(S , B ; mi )=e ∏
i=1
S P sig (mi )+ B P bkg (m i )

Estimate the MLE Ŝ of S ?



→ Perform (likelihood) best-fit of
model to data
⇒ fit result for S is the desired Ŝ.

In particle physics, often use the


MINUIT minimizer within ROOT.

ATLAS-CONF-2017-045 51
/
MLE Properties

( )
* 2
^
( μ−μ )
• Asymptotically Gaussian P ( μ^ ) ∝ exp − 2
for n → ∞
* 2 σ μ^
and unbiased ⟨ μ^ ⟩ = μ for n → ∞

Standard deviation of the distribution of μ̂


for large enough datasets

• Asymptotically Efficient : σμ̂ is the lowest possible value (in the limit n®¥)
among consistent estimators.
→ MLE captures all the available information in the data
n→∞
• Also consistent: μ̂ converges to the true value for large n, μ^ → μ *
• Log-likelihood : Can also minimize l = -2 log L

→ Usually more efficient numerically

→ For Gaussian L, l is parabolic:


• Can drop multiplicative constants in L (additive constants in l) 52
/
Extra: Fisher Information

Fisher Information:
⟨( ⟩ ⟨ ⟩
2 2
I (μ ) = ∂ log L(μ )
∂μ ) = − ∂ 2 log L(μ )
∂μ
Measures the amount of information available in the measurement of μ.

Gaussian case:
1 ● For a Gaussian estimator μ̃
Gaussian likelihood: I (μ ) = 2
σ Gauss
( (~
)
* 2
μ −μ )
P (~
μ ) ∝ exp −
→ smaller σGauss ⇒ more information. 2
2 σ ~μ
● MLE: Var(μ̂) = σμ̂2
1
Cramer-Rao bound: Var( ~
μ)≥ Cramer-Rao: Var(μ̃) ≥ σGauss2 = σμ̃2
I (μ )
For any estimator μ̃ .

→ cannot be more precise than allowed by information in the measurement.

Efficient estimators reach the bound : e.g. MLE in the large dataset limit.
53
/
High-mass X→γγ Search: JHEP 09 (2016) 1
Some Examples
Higgs Discovery: Phys. Lett. B 716 (2012) 1-29

3.9σ

p0 = 1.8 ´ 10-9 Û Z = 5.9σ


54
/

You might also like