You are on page 1of 58

Probability and Statistics


Basic concepts
Florian RUPPIN
Université Grenoble Alpes / LPSC
ruppin@lpsc.in2p3.fr
Course content: Benoit Clément
2 Bibliography

Kendall’s Advanced theory of statistics, Hodder Arnold Pub.


volume 1: Distribution theory, A. Stuart et K. Ord
volume 2a: Classical Inference and and the Linear Model, A. Stuart, K. Ord, S. Arnold
volume 2b: Bayesian inference, A. O’Hagan, J. Forster

The Review of Particle Physics, K. Nakamura et al., J. Phys. G 37, 075021 (2010) (+Booklet)
Data Analysis: A Bayesian Tutorial, D. Sivia and J. Skilling, Oxford Science Publication
Statistical Data Analysis, Glen Cowan, Oxford Science Publication

Analyse statistique des données expérimentales, K. Protassov, EDP sciences


Probabilités, analyse des données et statistiques, G. Saporta, Technip
Analyse de données en sciences expérimentales, B. Clément, Dunod

Florian Ruppin - ESIPAP - 02/02/2018


3 Contents

First lecture: Probability theory - Sample and population

SAMPLE

• Finite size
• Selected through a random process
eg. Result of a measurement

POPULATION
• Potentially infinite size
eg. All possible results

Characterization of the sample, the population and the sampling process

Florian Ruppin - ESIPAP - 02/02/2018


4 Contents

Second lecture: Statistical inference


Physics
parameters ✓
SAMPLE
• Finite size
Inference
xi
POPULATION
f (x; ✓)

Experiment

Using the sample to estimate the characteristics of the population

Florian Ruppin - ESIPAP - 02/02/2018


5 Random process
• Random process (“measurement” or “experiment”):
Process whose outcome cannot be predicted with certainty.

• Described by:

Universe: ⌦ = Set of all possible outcomes


Event: Logical condition on an outcome

Either true or false

An event splits the universe in 2 subsets



A

• An event A will be identified by the subset A for which A is true.

Florian Ruppin - ESIPAP - 02/02/2018


6 Probability

• Probability function P defined by: (Kolmogorov, 1933)

P : {Events} [0 : 1]
A P(A)
• Properties:

P(⌦) = 1
P(A or B) = P(A) + P(B) if (A and B) = ;
• Interpretation:

- Frequentist approach: if we repeat the random process a great number of times n,


and count the number of times the outcome satisfies event A , nA then the ratio:
nA
lim = P(A) defines a probability
n!+1 n

- Bayesian interpretation: A probability is a measure of the 



credibility associated to the event.

Florian Ruppin - ESIPAP - 02/02/2018


7 Logical relations

⌦ • Event “not A” associated with the


complement of A:

A P(Ā) = 1 P(A)
P(;) = 1 P(⌦) = 0

⌦ B
A
• Event “A and B” associated with the
intersection of the subsets
A and B

⌦ B
A • Event “A or B” associated with the
A or B union of the subsets
P(A or B) = P(A) + P(B) P(A and B)

Florian Ruppin - ESIPAP - 02/02/2018


8 Conditional probability
0
• Event B known to be true restriction of the universe to 
⌦ = B
Definition of a new probability function on this universe, the conditional probability:
P(A|B) = ”probability of A given B”

⌦ B
0
⌦ =B
A
P(A and B)
P(A|B) =
P(B)
A and B
A|B
• The definition of the conditional probability leads to:
P(A and B) = P(A|B).P(B) = P(B|A).P(A)
Relation between P(A|B) and P(B|A), the Bayes theorem:

P(A|B).P(B) Major role in


P(B|A) = Bayesian inference
P(A)

Florian Ruppin - ESIPAP - 02/02/2018


9 Incompatibility and Independance

• Two incompatible events cannot be true simultaneously: P(A and B) = 0

P(A or B) = P(A) + P(B)

• Two events are independent, if the realization of one is not linked in any way to
the realization of the other: P(A|B) = P(A) and P(B|A) = P(B)

P(A and B) = P(A).P(B)

Florian Ruppin - ESIPAP - 02/02/2018


10 Random variable
• When the outcome of the random process is a number (real or integer),
we associate to the random process a random variable X .

• Each realization of the process leads to a particular result: X=x


x is a realization of X

• For a discrete variable:


Probability law: p(x) = P(X = x)

P(X = x) = 0
• For a real variable:
Cumulative density function: F(x) = P(X < x)

dF = F(x + dx) F(x) = P(X < x + dx) P(X < x)


= P(X < x or x < X < x + dx) P(X < x)
= P(X < x) + P(x < X < x + dx) P(X < x)
= P(x < X < x + dx) = f (x)dx
dF
Probability density function (pdf): f (x) =
dx
Florian Ruppin - ESIPAP - 02/02/2018
11 Density function
Probability density function: Cumulative density function:
f (x) F(x)
1

Z
x x
+1
By construction:
f (x)dx = P(⌦) = 1
1 F( 1) = P(;) = 0
Note - Discrete variables can also be described by a
probability density function using Dirac distributions: F(+1) = P(⌦) = 1
X Z a
f (x) = p(i) (i x) F(a) = f (x)dx
i 1 Z
X b
with p(i) = 1 P(a < X < b) = F(b) F(a) = f (x)dx
a
i

Florian Ruppin - ESIPAP - 02/02/2018


12 Moments
• For any function g(x), the expectation of g is:
Z
E[g(X)] = g(x)f (x)dx Mean value of g

• Moments µk are the expectation of X k


0th moment: µ0 = 1 (pdf normalization)
1st moment: µ1 = µ (mean)
X0 = X µ1
is called a central variable
0 2
2 central moment: µ2 =
nd (variance)
Z
ixt ixt 1
• Characteristic function: (t) = E[e ] = f (x)e dx = FT [f ]
Z X X (it)k
(itx)k
Taylor expansion (t) = f (x)dx = µk
k! k!
k k
k
k d Pdf entirely defined by its moments
µk = i Characteristic function: usefull tool for demonstrations
dtk t=0

Florian Ruppin - ESIPAP - 02/02/2018


13 Sample PDF

• A sample is obtained from a random drawing within a population,


described by a probability density function.

• We’re going to discuss how to characterize, independently from one another:

a population
a sample

• To this end, it is useful to consider a sample as a finite set from which one
can randomly draw elements, with equipropability.
We can then associate to this process a probability density:

the empirical density or sample density

1X
fsample (x) = (x i)
n i
This density will be useful to translate properties of distribution to a finite sample.

Florian Ruppin - ESIPAP - 02/02/2018


14 Characterizing a distribution

How to reduce a distribution / sample to a finite number of values ?

• Measure of location:

Reducing the distribution to one central value


Result

• Measure of dispersion:

Spread of the distribution around the central value

Uncertainty / Error

• Frequency table / histogram (for a finite sample)

Florian Ruppin - ESIPAP - 02/02/2018


15 Location and dispersion
Population Sample (size n)

Mean value: Sum (integral) of all possible values weighted by the probability of occurrence
Z +1 Xn
1
µ = x̄ = xf (x)dx µ = x̄ = xi
1 n i=1

2
Standard deviation ( ) and variance ( v = ): Mean value of the
squared deviation to the mean
Z Xn
2 1
v= 2
= (x 2
µ) f (x)dx v= = (xi µ)2
n i=1
Koenig’s
Z theorem: Z Z
2
= x2 f (x)dx + µ2 f (x)dx 2µ xf (x)dx = x2 µ2 = x2 x̄2

Florian Ruppin - ESIPAP - 02/02/2018


16 Discrete distributions
• Binomial distribution: randomly choosing K objects within a finite set of n,
with a fixed drawing probability of p
p = 0.65
Variable : K n = 10
Parameters : n, p
n! k
Law : P (k; n, p) = p (1 p)n k
k!(n k)!
Mean : np
Variance : np(1 p)
• Poisson distribution: limit of the binomial when n ! +1, p ! 0 , np

 =
Counting events with fixed probability per time/space unit.

Variable : K = 6.5
Parameters : k
e
Law : P (k; ) =
Mean :
k!
Variance :

Florian Ruppin - ESIPAP - 02/02/2018


17 Real distributions
• Uniform distribution: equiprobability over a finite range [a, b]
Parameters : a, b
1
Law : f (x; a, b) = if a < x < b
b a
Mean : µ = (a + b)/2
2
Variance : v= = (b a)2 /12
• Normal distribution: limit of many processes
Parameters : µ,
1 (x µ)2
Law : f (x; µ, ) = p e 2 2
2⇡ FWHM

• Chi-square distribution: sum of the square of n


normal reduced variables
X n ✓ ◆2
Xk µX k
Variable : C=
Xk
Parameters : n k=1
n C n
⇣n⌘
1
Law : f (C; n) = C 2 e 2 /2 2
2
Mean : n Variance: 2n

Florian Ruppin - ESIPAP - 02/02/2018


18 Convergence

Poisson distribution
k
e
p small, k ⌧ n P (k; ) =
k!p
np = µ= = & 25

Binomial distribution Normal distribution


P (k; n, p) =
n!
pk (1 p)n k n & 50 1
f (x; µ, ) = p e 2 2
(x µ)2

k!(n k)! 2⇡
p
µ = np = np(1 p)
µ=µ =

n & 30
Chi-square distribution
n C n
⇣n⌘
f (C; n) = C 2 1 e 2 /2 2
p 2
µ=n = 2n

Florian Ruppin - ESIPAP - 02/02/2018


19 Multidimensional PDF
• Random variables can be generalized to random vectors:
~ = (X1 , X2 , ..., Xn )
X
• The probability density function becomes:

f (~x)d~x = f (x1 , x2 , ..., xn )dx1 dx2 ...dxn


= P (x1 < X1 < x1 + dx1 and x2 < X2 < x2 + dx2 ...
...and xn < Xn < xn + dxn )
Z b Z d
and P (a < X < b and c < Y < d) = dx dy f (x, y)
a c
• Marginal density: probability of only one of the component
Z
fX (x)dx = P (x < X < x + dx and 1 < Y < +1) = (f (x, y)dx)dy
Z
fX (x) = f (x, y)dy

Florian Ruppin - ESIPAP - 02/02/2018


20 Multidimensional PDF
• For a fixed value of Y = y0 :
f (x|y0 )dx = “Probability of x < X < x + dx knowing that Y = y0”
is a conditional density for X . It is proportional to f (x, y)
Z
Therefore: f (x|y) / f (x, y) f (x|y)dx = 1
f (x, y) f (x, y)
f (x|y) = R =
f (x, y)dx fY (y)
• The two random variables X and Y are independent if all events of the form
x < X < x + dx are independent from y < Y < y + dy
f (x|y) = fX (x) and f (y|x) = fY (y) hence f (x, y) = fX (x).fY (y)
• For probability density functions, Bayes’ theorem becomes:

f (x|y)fY (y) f (x|y)fY (y)


f (y|x) = = R
fX (x) f (x|y)fY (y)dy

Florian Ruppin - ESIPAP - 02/02/2018


21 Covariance and correlation
• A random vector (X, Y ) can be treated as 2 separate variables marginal densities
mean and standard deviation for each variable: µX, µY , X, Y
• These quantities do not take into account correlations between the variables:

⇢= 0.5 ⇢=0 ⇢ = 0.9

• Generalized measure of dispersion: Covariance of X and Y


ZZ
Cov(X, Y ) = (x µX )(y µY )f (x, y)dxdy = ⇢ X Y = µXY µX µY
n
1X ⇢=0
Cov(X, Y ) = (xi µX )(yi µY )
n i=1
Cov(X, Y )
• Correlation: ⇢ = Uncorrelated variables: ⇢ =0
X Y

Independent Uncorrelated

Florian Ruppin - ESIPAP - 02/02/2018


22 Decorrelation
• Covariance matrix for n variables Xi : 2 3
2
1 ⇢12 1 2 ... ⇢1n 1 n
6 ⇢12 1 2
... ⇢2n 7
6 2 2 2 n 7
⌃ij = Cov(Xi , Xj ) ⌃ = 64 ... ..
.
..
.
..
.
7
5
2
⇢1n 1 n ⇢2n 2 n ... n

• For uncorrelated variables ⌃ is diagonal

• Matrix real and symmetric: ⌃ can be diagonalized


Definition of n new uncorrelated variables Yi
2 0
2
3
1 0 ... 0
0
6 0 2
... 0 7
0 6 2 7 1
⌃ = 64 .. .. ..
.
.. 7=
5 B ⌃B with Y = BX
. . .
0
2
0 0 ... n
0
2
i are the eigenvalues of ⌃
B contains the orthonormal eigenvectors
• The Yi are the principal components. Sorted from the largest
0
to the smallest , they allow dimensional reduction

Florian Ruppin - ESIPAP - 02/02/2018


23 Regression
• Measure of location:
A point: (µX , µY )
A curve: line which is the closest to the points linear regression
• Minimizing the dispersion between the curve “ y = ax + b ” and the distribution
ZZ !
1X
Let: w(a, b) = (y ax b)2 f (x, y)dxdy = (yi axi b)2
n i
8 @w
RR
< @a =0= x(y ax b)f (x, y)dxdy
: @w RR
@b = 0 = (y ax b)f (x, y)dxdy
8
2 22 2
a( <
X a(
µ XX) + µ
bµXX)+= bµ
⇢ XX =Y⇢+XµXYµY
+ µX µY
:
aµX + aµ
b =X µ
+Yb = µY
8
< a = ⇢ Y
X Fully correlated ⇢=1
: Fully anti-correlated ⇢ = 1
b = µY ⇢ Y
µX
X Then Y = aX + b
Florian Ruppin - ESIPAP - 02/02/2018
24 Multidimensional PDFs
• Multinomial distribution: randomly choosing K1 , K2 , ...KS objects X set of n, 

within a finite X
with a fixed drawing probability for each category p1 , p2 , ...pS with Ki = n and pi = 1

Parameters : n, p1 , p2 , ...pS
n!
: P (~
k1 k2 kS
Law k; n, p~) = p1 p2 ...pS
k1 !k2 !...kS !
Mean : µi = npi
2
Variance : i = npi (1 pi ) Cov(Ki , Kj ) = npi pj
Note: Variables are not independent. The binomial corresponds to S = 2 but has
only one independent variable
• Multinormal distribution:
Parameters : µ
~,⌃
1 1
~ )T ⌃ 1
Law : ~ , ⌃) = p
f (~x; µ e 2 (~
x µ (~
x µ
~)
2⇡|⌃|
Y 1 (xi µi )2
2 2
If uncorrelated: f (~
x; µ
~ , ⌃) = p e i

i 2⇡
Independent Uncorrelated

Florian Ruppin - ESIPAP - 02/02/2018


25 Sum of random variables
• The sum of several random variable is a new random variable S
n
X
S= Xi
i=1
• Assuming the mean and variance of each variable exist:
Mean value of S :
Z n
! n Z n
X X X
µS = xi f (x1 , ..., xn )dx1 ...dxn = xi fXi (xi )dxi = µi
i=1 i=1 i=1

The mean is an additive quantity


Variance of S: !2
Z n
X
2
S = xi µ Xi f (x1 , ..., xn )dx1 ...dxn
i=1
n
X XX
2
= Xi +2 Cov(Xi , Xj ) n
i=1 i j<i
X
2 2
For uncorrelated variables, the variance is an additive quantity S = Xi
used for error combinations i=1

Florian Ruppin - ESIPAP - 02/02/2018


26 Sum of random variables
• Probability density function of S : fS (s)
• Using the characteristic function:
Z Z P
ist it xi
S (t) = fS (s)e ds = fX
~ (~
x)e d~x
For independent variables:
YZ Y
itxk
S (t) = fXk (xk )e dxk = Xi (t)

The characteristic function factorizes.

• The PDF is the Fourier transform of the characteristic function, therefore:


fS = fX1 ⇤ fX2 ⇤ ... ⇤ fXn
The PDF of the sum of random variables is the convolution of the individual PDFs

Sum of Normal variables Normal


Sum of Poisson variables ( 1 and 2 ) Poisson with = 1+ 2
Sum of Chi-2 variables ( n1 and n2 ) Khi-2 with n = n1 + n2

Florian Ruppin - ESIPAP - 02/02/2018


27 Sum of independent variables
• Weak law of large numbers
Sample of size n = realization of n independent variables with the same
distribution (mean µ, variance 2 )
S 1X
The sample mean is a realization of M= = Xi
n n
µM = µ 2 2
Mean value of M : Variance of M: M = /n
• Central limit theorem
2
n independent random variables of mean µi and variance i
1 X Xi µi
Sum of the reduced variables: C=p
n i
The PDF of C converges to a reduced normal distribution:

1 c2
fC (c) !p e 2
n!+1 2⇡
The sum of many random fluctuations is normally distributed

Florian Ruppin - ESIPAP - 02/02/2018


28 Central limit theorem

Gaussian Gaussian
X1 + X2
X1 p
2

Gaussian Gaussian
X1 + X2 + X3 X1 + X2 + X3 + X4 + X5
p p
3 5

Florian Ruppin - ESIPAP - 02/02/2018


29 Dispersion and uncertainty

• Any measure (or combination of measures) is a realization of a random variable.

Measured value: ✓
True value: ✓0

• The uncertainty quantifies the difference between ✓ and ✓0 :


Measure of dispersion

Postulate: ✓=↵ ✓ Absolute error always positive

• Usually one differentiates:


Statistical errors: due to the measurement PDF
Systematic errors or bias: fixed but unknown deviation (equipment, assumptions, …)
Systematic errors can be seen as statistical error in a set of similar experiments

Florian Ruppin - ESIPAP - 02/02/2018


30 Error sources
Observation error: O Scaling error: S Position error: P

Measured value: ✓ = ✓0 +
O + S + P
2
• Each i is a realization of a random variable of mean 0 and variance i
For uncorrelated error sources:
9
O =↵ O =
2 2 2 2 2 2 2 2 2
S =↵ S tot = (↵ tot ) = ↵ ( O + S + P) = O + S + P
;
P =↵ P

• Choice for ↵ :
Many sources of error central limit theorem normal distribution
↵ = 1 gives (approximately) a 68% confidence interval
↵ = 2 gives a 95% confidence interval
Florian Ruppin - ESIPAP - 02/02/2018
31 Error propagation
df
f (x) 0
f (x) =
dx
• Measure: x± x
f
• Compute: f (x) ! f ? f

x
x
Assuming small errors and using
the Taylor expansion: x

df 1 d2 f
f (x + x) = f (x) + x+ x2
dx 2 dx2
df 1 d2 f
f (x x) = f (x) x+ x2
dx 2 dx2
1 df
f = |f (x + x) f (x x)| = x
2 dx
Florian Ruppin - ESIPAP - 02/02/2018
32 Error propagation
• Measure: x± x, y ± y zm = f (xm , ym ) @f
=
df (x, ym )
@x dx
• Compute: f (x, y, ...) f? 2 xf Curve z = f (x, ym ) fixed ym

Method: Treat the effect of each


variable as separate error sources Surface z = f (x, y)

@f @f ym
xf = x and yf = y
@x @y xm
2 x

Then: 2 y
✓ ◆2 ✓ ◆2
2 2 2 @f @f @f @f
f = xf + yf + 2⇢xy xf yf = x + y + 2⇢xy x y
@x @y @x @y
X ✓ @f ◆2 X @f @f
2
f = xi +2 ⇢x i x j xi xj
i
@xi i,j<i
@xi @xj

Uncorrelated Correlated Anticorrelated


X ✓ @f ◆2 @f @f @f @f
f2 = xi f= x+ y f= x y
@xi @x @y @x @y
i

Florian Ruppin - ESIPAP - 02/02/2018


33 Parametric estimation

• Estimating a parameter ✓ from a finite sample {xi }

• Statistic: a function S = f ({xi })

Any statistic can be considered as an estimator of ✓


To be a good estimator it needs to satisfy:

Consistency: limit of the estimator for an infinite sample


Bias: difference between the estimator and the true value
Efficiency: speed of convergence
Robustness: sensitivity to statistical fluctuations

• A good estimator should at least be consistent and asymptotically unbiased

• Efficient / Unbiased / Robust often contradict each others


Need to make a choice for a given situation

Florian Ruppin - ESIPAP - 02/02/2018


34 Bias and consistency
• As the sample is a set of realizations of random variables (or one vector variable),
so is the estimator:
✓ˆ is a realization of ⇥
ˆ
It has a mean, a variance, …, and a probability density function

• Bias: characterize the mean value of the estimator ˆ = E[⇥


b(✓) ˆ ✓0 ] = µ⇥
ˆ ✓0
Unbiased estimator: ˆ =0
b(✓)
ˆ
Asymptotically unbiased: b(✓) !0
n!+1
! 0, 8✏
n!+1
• Consistency: formally P(|✓ˆ ✓| < ✏) ! 1, 8✏
✓| > ✏) P(|✓ˆ
n!+1
In practice, if the estimator is asymptotically unbiased ˆ
⇥ ! 0
n!+1

Biased Asymptotically unbiased Unbiased

Florian Ruppin - ESIPAP - 02/02/2018


35 Efficiency

• For any unbiased estimator of ✓ , the variance cannot exceed (Cramer-Rao bound):
!
1 1 !
ˆ

h 2
i 1
= ⇥ @ lnL
2 ⇤ 1
E @lnL
ˆ
E h @✓ 2 i = ⇥ ⇤
@✓ ⇥ @lnL 2
2
@ lnL
E E @✓ 2
@✓

• The efficiency of a convergent estimator is given by its variance.

An efficient estimator reaches the Cramer-Rao bound (at least asymptotically)


Minimal variance estimator

• The minimal variance estimator will often be biased, asymptotically unbiased

Florian Ruppin - ESIPAP - 02/02/2018


36 Empirical estimator

• Sample mean is a good estimator of the population mean


weak law of large numbers: convergent, unbiased
1X 2 2
2
µ̂ = xi µµ̂ = E[µ̂] = µ µ̂ = E[(µ̂ µ) ] =
n n
• Sample variance as an estimator of the population variance:
!
1 X 1X
ŝ2 = (xi µ̂)2 = (xi µ)2 (µ̂ µ)2
n i n i
!
2 1X 2 2 2
2
n 1 2
E[ŝ ] = µ̂ = = biased, asymptotically unbiased
n i n n
1 X
2
unbiased variance estimator: ˆ = (xi µ̂)2
n 1 i

4
✓ ◆
2 n 1 2 4
Variance of the estimator (convergence): ˆ2 = 2 +2 !
n 1 n n

Florian Ruppin - ESIPAP - 02/02/2018


37 Errors on estimators

Uncertainty Estimator standard deviation


p
• Use an estimator of standard deviation: ˆ= ˆ2 (biased !)
r
1X 2
2
ˆ2
• Mean: µ̂ = xi , µ̂ = µ̂ =
n n n
r
1 X 2 4 2 2
• Variance: ˆ2 = (xi µ̂)2, 2
ˆ2 ⇡ ˆ = 2
ˆ
n 1 i n n

• Central-Limit theorem empirical estimators of mean and variance are


normally distributed for large enough samples

µ̂ ± µ̂ , ˆ ± ˆ define 68% confidence intervals

Florian Ruppin - ESIPAP - 02/02/2018


38 Likelihood function
Generic function k(x, ✓)
x: random variable(s)
✓ : parameter(s)
fix x = u (one realization
fix ✓ = ✓0 (true value) of the random variable)

Probability density function Likelihood function

f (x; ✓) = k(x, ✓0 ) L(✓) = k(u, ✓)


Z Z
f (x; ✓)dx = 1 L(✓)d✓ =???
Z
for Bayesian f (x|✓) = f (x; ✓) for Bayesian f (✓|x) = L(✓) L(✓)d✓

For a sample: n independent realizations of the same variable X


Y Y
L(✓) = k(xi , ✓) = f (xi ; ✓)
i i
Florian Ruppin - ESIPAP - 02/02/2018
39 Maximum likelihood
• Let a sample of measurements: {xi }
The analytical form of the density is known and depends on
several unknown parameters ✓
For example: Event counting follows a Poisson distribution with a parameter i (✓)
depending on the physics.
Y e i (✓) i (✓)xi
L(✓) =
i
xi !
• An estimator of the parameters ✓ is given by the position of the
maximum of the likelihood function
Parameter values which maximize the probability to get the observed results

@L
=0
@✓ ✓=✓ˆ
Note: system of equations for several parameters
Note: minimizing lnL often simplify the expression

Florian Ruppin - ESIPAP - 02/02/2018


40 Properties of MLE
• Mostly asymptotic properties: valid for large samples, often assumed in any
case for lack of better information

Asymptotically unbiased
Asymptotically efficient (reaches the Cramer-Rao bound)
Asymptotically normally distributed

Multinormal law with covariance given by a generalization of the CR bound:


ˆ~
~ 1 1 ~
(
ˆ
✓ ~ T⌃
✓) 1 ~ˆ ✓)
(✓ ~ 1 @lnL @lnL
f (✓; ✓, ⌃) = p e 2
⌃ij = E
2⇡|⌃| @✓i @✓j
• Goodness of fit: The value of ˆ is Chi-2 distributed with
2lnL(✓)
ndf = sample size number of parameters
Z +1
p value = f 2 (x; ndf)dx Probability of getting
ˆ
2lnL(✓) a worse agreement

Florian Ruppin - ESIPAP - 02/02/2018


41 Errors on MLE

ˆ~
~ 1 1 ~
(
ˆ
✓ ~ T⌃
✓) 1 ˆ ~
~
(✓ ✓) 1 @lnL @lnL
f (✓; ✓, ⌃) = p e 2
⌃ij = E
2⇡|⌃| @✓i @✓j
• Errors on the parameters given by the covariance matrix
s
1 only one realization of
• For one parameter, 68% confidence interval: ✓ = ˆ✓ˆ = @ 2 lnL the estimator: empirical
@✓ 2 mean of 1 value
• More generally:

ˆ 1X
lnL = lnL(✓) lnL(✓) = 1
⌃ij (✓i ✓ˆi )(✓j ˆ 3
✓j ) + O(✓ )
2 i,j
Confidence contours are defined by the equation:
Z 2
n ✓ 1 2 3

lnL = (n✓ , ↵) with ↵ = f 2 (x; n✓ )dx
0 68.3 0.5 1.15 1.76
Values of for different number 95.4 2 3.09 4.01
parameters n✓ and confidence levels ↵
99.7 4.5 5.92 7.08

Florian Ruppin - ESIPAP - 02/02/2018


42 Least squares
• Set of measurements (xi , yi ) with uncertainties on yi
Theoretical law given by: y = f (x, ✓)
• Naive approach: use regression
X @w
2
w(✓) = (yi f (xi , ✓)) =0
i
@✓i
• Reweight each term by its associated error:
X ✓ yi f (xi , ✓)
◆2
@K 2
2
K (✓) = =0
i
yi @✓i
• Maximum likelihood assumes that each yi is normally distributed with a mean
equal to f (xi , ✓) and a standard deviation given by yi
⇣ ⌘2
Y 1 1 yi f (xi ,✓)
• The likelihood is then L(✓) = p e 2 yi

i
2⇡ yi
@L @lnL @K 2 Least squares or Chi-2 fit is the maximum
=0, 2 = =0 likelihood estimator for Gaussian errors
@✓ @✓ @✓
~ 1
• Generic case with correlations: K (✓) = (~
y 2
f~(x, ✓))
~ T⌃ 1
(~y f~(x, ✓))
~
2
Florian Ruppin - ESIPAP - 02/02/2018
43 Example: fitting a line
• For f (x) = ax K 2 (a) = Aa2 2Ba + C = 2lnL
X x2 X x i yi X y2
i , i
A= 2
B= 2
, C= 2
y i i
y i i
y i
i

@K 2 B
= 2Aa 2B = 0 â =
@a A
@2K 2 2 1
2
= 2A = 2 â = a = p
@a a A

Florian Ruppin - ESIPAP - 02/02/2018


44 Example: fitting a line
• For f (x) = ax + b
K 2 (a, b) = Aa2 + Bb2 + 2Cab 2Da 2Eb + F = 2lnL
X x2 X 1 X xi X x i yi X yi X y2
i , , , , , i
A= 2 B = 2
C = 2 D = 2 E = 2 F = 2
i
y i i
y i y i i
y i y i i
y i i i

@K 2
9
@a = 2Aa + 2Cb 2D = 0 =
BD EC AE BC
â = , b̂ =
@K 2 ; AB C 2 AB C2
@b = 2Ca + 2Bb 2E = 0

2 2
9  
@ K 1 >
@a2 = 2A = 2⌃11 >
> ⌃ 1
=
A C
⌃=
1 B C
>
> C B AB C2 C A
=
@2K2
@b2 = 2B = 2⌃221 r r
>
>
>
> â = =
B
, b̂ = =
A
2
@ K 2
1
>
; a
AB C2 b
AB C2
@a@b = 2C = 2⌃12

Florian Ruppin - ESIPAP - 02/02/2018


45 Example: fitting a line

b, intercept • Two dimensional error contours on a and b

contour at 68.3%
contour at 95.4%
contour at 99.7%
parameter

parameter a, slope

Florian Ruppin - ESIPAP - 02/02/2018


46 Non-parametric estimation

• Directly estimating the probability density function

Likelihood ratio discriminant

Separating power of variables

Data / Monte Carlo agreement

• Frequency table: For a sample {xi }, i = 1...n


1. Define successive invervals (bins) Ck = [ak , ak+1 [
2. Count the number of events nk in Ck

• Histogram: Graphical representation of the frequency table h(x) = nk if x 2 Ck

Florian Ruppin - ESIPAP - 02/02/2018


47 Histogram
Bin Number of N/Z Frequency Bin Number of N/Z Frequency

N/Z for stable heavy nuclei

1.321, 1.357, 1.392, 1.410, 1.428, 1.446, 1.464, 1.421,


1.438, 1.344, 1.379, 1.413, 1.448, 1.389, 1.366, 1.383,
1.400, 1.416, 1.433, 1.466, 1.500, 1.322, 1.370, 1.387,
1.403, 1.419, 1.451, 1.483, 1.396, 1.428, 1.375, 1.406,
1.421, 1.437, 1.453, 1.468, 1.500, 1.446, 1.363, 1.393,
1.424, 1.439, 1.454, 1.469, 1.484, 1.462, 1.382, 1.411,
1.441, 1.455, 1.470, 1.500, 1.449, 1.400, 1.428, 1.442,
1.457, 1.471, 1.485, 1.514, 1.464, 1.478, 1.416, 1.444,
1.458, 1.472, 1.486, 1.500, 1.465, 1.479, 1.432, 1.459,
1.472, 1.486, 1.513, 1.466, 1.493, 1.421, 1.447, 1.460,
1.473, 1.486, 1.500, 1.526, 1.480, 1.506, 1.435, 1.461,
1.487, 1.500, 1.512, 1.538, 1.493, 1.450, 1.475, 1.500,
1.512, 1.525, 1.550, 1.506, 1.530, 1.487, 1.512, 1.524,
1.536, 1.518, 1.577, 1.554, 1.586, 1.586

Florian Ruppin - ESIPAP - 02/02/2018


48 Histogram as PDF estimator
nk are multinomial random variables Z
• Statistical description:
X
Parameters: n= nk pk = P(x 2 Ck ) = fX (x)dx
k Ck
2
µnk = npk nk = npk (1 pk ) ⇡ µ n k Cov(nk , nr ) = npk pr ⇡ 0
pk ⌧ 1 pk ⌧ 1

For a large sample: For small classes (width ):


Z
nk µk pk
lim = = pk pk = fX (x)dx ⇡ f (xc ) ) lim = f (x)
n!+1 n n Ck !0

1
Finally: f (x) = lim h(x)
n!+1 n
!0

• The histogram is an estimator of the probability density function

• Each bin can be described by a Poisson


q density p p
The 1 error on nk is then: nk = ˆn2 k = µ̂nk = nk

Florian Ruppin - ESIPAP - 02/02/2018


49 Confidence interval

• For a random variable, a confidence interval with confidence level ↵,


is any interval [a, b] such that:
Z b
Probability of finding a
P(X 2 [a, b]) = fX (x)dx = ↵
a
realization inside the interval

• Generalization of the concept of uncertainty:



interval that contains the true value with a given probability

• For Bayesians: the posterior density is the probability density of the true value.
It can be used to estimate an interval: P(✓ 2 [a, b]) = ↵

• No such thing for a Frequentist: the interval itself becomes the random
variable [a, b] is a realization of [A, B]

P(A < ✓ and B > ✓) = ↵ independently of ✓

Florian Ruppin - ESIPAP - 02/02/2018


50 Confidence interval
• Mean centered, symetric interval:
[µ a, µ + a]
Z µ+a
f (x)dx = ↵
µ a

• Mean centered, probability symetric interval:


Z Z
[a, b]
µ b

f (x)dx = f (x)dx =
a µ 2

• Highest probability density (HDP) interval: [a, b]


Z b
f (x)dx = ↵
a
f (x) > f (y) for x 2 [a, b] and y 2
/ [a, b]

Florian Ruppin - ESIPAP - 02/02/2018


51 Confidence Belt
✓ˆ of ✓ :
• To build a frequentist interval for an estimator
1. Make pseudo-experiments for several values of ✓ and compute the
ˆ for each (Monte Carlo sampling of the estimator PDF)
estimator ✓
2. For each ✓ , determine ⌅(✓) and ⌦(✓) such that:
✓ˆ < ⌅(✓) for a fraction (1 ↵)/2 of the pseudo-experiments
✓ˆ > ⌦(✓) for a fraction (1 ↵)/2 of the pseudo-experiments
These 2 curves are the confidence belt for a confidence level ↵

3. Inverse these functions. The interval [⌦ 1 ˆ ⌅


(✓), 1 ˆ satisfies:
(✓)]
⇣ ⌘ ⇣ ⌘ ⇣ ⌘
P ⌦ 1 ˆ <✓<⌅
(✓) 1 ˆ =1
(✓) ˆ <✓
P ⌅ 1 (✓) P ⌦ 1 (✓)ˆ >✓
⇣ ⌘ ⇣ ⌘
=1 P ✓ˆ < ⌅(✓) P ✓ˆ > ⌦(✓) = ↵

Confidence belt for a Poisson parameter estimated


with the empirical mean of 3 realizations (68% CL)

Florian Ruppin - ESIPAP - 02/02/2018


52 Dealing with systematics

• The variance of the estimator only measures the statistical uncertainty.

• Often, we will have to deal with parameters whose value is known


with limit precision.

Systematic uncertainties

• The likelihood function becomes:

+ ⌫+
L(✓, ⌫) with ⌫ = ⌫0 ± ⌫ or ⌫0 ⌫

The known parameters ⌫ are nuisance parameters

Florian Ruppin - ESIPAP - 02/02/2018


53 Bayesian inference

• In Bayesian statistics, nuisance parameters are dealt with by assigning them a prior ⇡(⌫).

• Usually a multinormal law is used with mean ⌫0 and covariance matrix estimated
from ⌫0 (+ correlation if needed)

f (x|✓, ⌫)⇡(✓)⇡(⌫)
f (✓, ⌫|x) = RR
f (x|✓, ⌫)⇡(✓)⇡(⌫)d✓d⌫

• The final posterior distribution is obtained by marginalization over the nuisance parameters:
Z R
f (x|✓, ⌫)⇡(✓)⇡(⌫)d⌫
f (✓|x) = f (✓, ⌫|x)d⌫ = RR
f (x|✓, ⌫)⇡(✓)⇡(⌫)d✓d⌫

Florian Ruppin - ESIPAP - 02/02/2018


54 Profile Likelihood
• No true frequentist way to add systematic effects. Popular method of the day: profiling

• Deal with nuisance parameters as realization of random variables:


0
extend the likelihood: L(✓, ⌫) ! L (✓, ⌫)G(⌫)

• G(⌫) is the likelihood of the new parameters (identical to prior)


• For each value of ✓ , maximize the likelihood with respect to nuisance:
profile likelihood PL(✓)

• PL(✓) has the same statistical asymptotical properties than the regular likelihood

Florian Ruppin - ESIPAP - 02/02/2018


55 Statistical tests

• Statistical tests aim at:

Checking the compatibility of a dataset {xi } with a given distribution

Checking the compatibility of two datasets {xi }, {yi }: are they issued
from the same distribution ?

Comparing different hypothesis: background VS signal + background

• In every case:

Build a statistic that quantifies the agreement with the hypothesis


Convert it into a probability of compatibility/incompatibility: p-value

Florian Ruppin - ESIPAP - 02/02/2018


56 Pearson test
• Test for binned data: use the Poisson limit of the histogram
Sort the sample into k bins Ci : ni Z
Compute the probability of this class: pi = f (x)dx
Ci
For each bin, the test statistics compares the deviation of the observation
from the expected mean to the theoretical standard deviation.
Data Poisson mean
X (ni npi )2
2
=
npi
bins i Poisson variance
2
• follows (asymptotically)
X a Chi-2 law with k 1 degrees of freedom
(one constraint ni = n )
Z +1
• p-value: probability of doing worse: p value = f 2 (x; k 1)dx
2
2
For a “good” agreement: /(k 1) ⇠ 1
2
p
More precisely: 2 (k 1) ± 2(k 1) (1 interval ~ 68% CL)

Florian Ruppin - ESIPAP - 02/02/2018


57 Kolmogorov-Smirnov test

• Test for unbinned data: compare the sample cumulative density function to the tested one

• Sample PDF (ordered sample)


8
1X
< 0 x < x0
k
fS (x) = (x i) FS (x) = n xk  x < xk+1
n i :
1 x > xn
• The Kolmogorov statistic is the largest deviation:
Dn = sup|FS (x) F (x)|
x
• The test distribution has been computed by Kolmogorov:
p X
r 1 2r 2 z 2
P(Dn > n) = 2 ( 1) e
r
[0, ] defines a confidence interval for Dn
p p
= 0.9584/ n for 68.3% CL = 1.3754/ n for 95.4% CL

Florian Ruppin - ESIPAP - 02/02/2018


58 Example
x
• Test compatibility with an exponential law: f (x) = e , = 0.4
0.008, 0.036, 0.112, 0.115, 0.133, 0.178, 0.189, 0.238, 0.274, 0.323, 0.364, 0.386, 0.406, 0.409, 0.418, 0.421, 0.423, 0.455,
0.459, 0.496, 0.519, 0.522, 0.534, 0.582, 0.606, 0.624, 0.649, 0.687, 0.689, 0.764, 0.768, 0.774, 0.825, 0.843, 0.921, 0.987,
0.992, 1.003, 1.004, 1.015, 1.034, 1.064, 1.112, 1.159, 1.163, 1.208, 1.253, 1.287, 1.317, 1.320, 1.333, 1.412, 1.421, 1.438,
1.574, 1.719, 1.769, 1.830, 1.853, 1.930, 2.041, 2.053, 2.119, 2.146, 2.167, 2.237, 2.243, 2.249, 2.318, 2.325, 2.349, 2.372,
2.465, 2.497, 2.553, 2.562, 2.616, 2.739, 2.851, 3.029, 3.327, 3.335, 3.390, 3.447, 3.473, 3.568, 3.627, 3.718, 3.720, 3.814,
3.854, 3.929, 4.038, 4.065, 4.089, 4.177, 4.357, 4.403, 4.514, 4.771, 4.809, 4.827, 5.086, 5.191, 5.928, 5.952, 5.968, 6.222,
6.556, 6.670, 7.673, 8.071, 8.165, 8.181, 8.383, 8.557, 8.606, 9.032, 10.482, 14.174

Dn = 0.069
p value = 0.0617
1 : [0, 0.0875]

Florian Ruppin - ESIPAP - 02/02/2018

You might also like