Probability and Statistics Basic Concepts: Florian RUPPIN

Probability and Statistics 
Basic concepts
Florian RUPPIN
Université Grenoble Alpes / LPSC
ruppin@lpsc.in2p3.fr
Course content: Benoit Clément
2 Bibliography
Kendall’s Advanced theory of statistics, Hodder Arnold Pub.

volume 1: Distribution theory, A. Stuart et K. Ord
volume 2a: Classical Inference and and the Linear Model, A. Stuart, K. Ord, S. Arnold
volume 2b: Bayesian inference, A. O’Hagan, J. Forster
 
The Review of Particle Physics, K. Nakamura et al., J. Phys. G 37, 075021 (2010) (+Booklet)
Data Analysis: A Bayesian Tutorial, D. Sivia and J. Skilling, Oxford Science Publication
Statistical Data Analysis, Glen Cowan, Oxford Science Publication
Analyse statistique des données expérimentales, K. Protassov, EDP sciences

Probabilités, analyse des données et statistiques, G. Saporta, Technip
Analyse de données en sciences expérimentales, B. Clément, Dunod
Florian Ruppin - ESIPAP - 02/02/2018

3 Contents
First lecture: Probability theory - Sample and population
SAMPLE
• Finite size
• Selected through a random process
eg. Result of a measurement
POPULATION
• Potentially infinite size
eg. All possible results
Characterization of the sample, the population and the sampling process

4 Contents
Second lecture: Statistical inference

Physics
parameters ✓
SAMPLE
• Finite size
Inference
xi
POPULATION
f (x; ✓)
Experiment
Using the sample to estimate the characteristics of the population

5 Random process
• Random process (“measurement” or “experiment”):
Process whose outcome cannot be predicted with certainty.
• Described by:
Universe: ⌦ = Set of all possible outcomes

Event: Logical condition on an outcome 
Either true or false 
An event splits the universe in 2 subsets
⌦
Ā
A
• An event A will be identified by the subset A for which A is true.

6 Probability
• Probability function P defined by: (Kolmogorov, 1933)
P : {Events} [0 : 1]
A P(A)
• Properties:
P(⌦) = 1
P(A or B) = P(A) + P(B) if (A and B) = ;
• Interpretation:
- Frequentist approach: if we repeat the random process a great number of times n,

and count the number of times the outcome satisfies event A , nA then the ratio:
nA
lim = P(A) defines a probability
n!+1 n
- Bayesian interpretation: A probability is a measure of the  

credibility associated to the event.

7 Logical relations
⌦ • Event “not A” associated with the

complement of A:
Ā
A P(Ā) = 1 P(A)
P(;) = 1 P(⌦) = 0
⌦ B
A
• Event “A and B” associated with the
intersection of the subsets
A and B
⌦ B
A • Event “A or B” associated with the
A or B union of the subsets
P(A or B) = P(A) + P(B) P(A and B)

8 Conditional probability
0
• Event B known to be true restriction of the universe to  ⌦ = B
Definition of a new probability function on this universe, the conditional probability:
P(A|B) = ”probability of A given B”
⌦ B
0
⌦ =B
A
P(A and B)
P(A|B) =
P(B)
A and B
A|B
• The definition of the conditional probability leads to:
P(A and B) = P(A|B).P(B) = P(B|A).P(A)
Relation between P(A|B) and P(B|A), the Bayes theorem:
P(A|B).P(B) Major role in

P(B|A) = Bayesian inference
P(A)

9 Incompatibility and Independance
• Two incompatible events cannot be true simultaneously: P(A and B) = 0
P(A or B) = P(A) + P(B)
• Two events are independent, if the realization of one is not linked in any way to
the realization of the other: P(A|B) = P(A) and P(B|A) = P(B)
P(A and B) = P(A).P(B)

10 Random variable
• When the outcome of the random process is a number (real or integer),
we associate to the random process a random variable X .
• Each realization of the process leads to a particular result: X=x

x is a realization of X
• For a discrete variable:

Probability law: p(x) = P(X = x)
P(X = x) = 0
• For a real variable:
Cumulative density function: F(x) = P(X < x)
dF = F(x + dx) F(x) = P(X < x + dx) P(X < x)

= P(X < x or x < X < x + dx) P(X < x)
= P(X < x) + P(x < X < x + dx) P(X < x)
= P(x < X < x + dx) = f (x)dx
dF
Probability density function (pdf): f (x) =
dx
11 Density function
Probability density function: Cumulative density function:
f (x) F(x)
1
Z
x x
+1
By construction:
f (x)dx = P(⌦) = 1
1 F( 1) = P(;) = 0
Note - Discrete variables can also be described by a
probability density function using Dirac distributions: F(+1) = P(⌦) = 1
X Z a
f (x) = p(i) (i x) F(a) = f (x)dx
i 1 Z
X b
with p(i) = 1 P(a < X < b) = F(b) F(a) = f (x)dx
a
i

12 Moments
• For any function g(x), the expectation of g is:
Z
E[g(X)] = g(x)f (x)dx Mean value of g
• Moments µk are the expectation of X k

0th moment: µ0 = 1 (pdf normalization)
1st moment: µ1 = µ (mean)
X0 = X µ1
is called a central variable
0 2
2 central moment: µ2 =
nd (variance)
Z
ixt ixt 1
• Characteristic function: (t) = E[e ] = f (x)e dx = FT [f ]
Z X X (it)k
(itx)k
Taylor expansion (t) = f (x)dx = µk
k! k!
k k
k
k d Pdf entirely defined by its moments
µk = i Characteristic function: usefull tool for demonstrations
dtk t=0

13 Sample PDF
• A sample is obtained from a random drawing within a population,

described by a probability density function.
• We’re going to discuss how to characterize, independently from one another:
a population
a sample
• To this end, it is useful to consider a sample as a finite set from which one
can randomly draw elements, with equipropability.
We can then associate to this process a probability density: 
the empirical density or sample density
1X
fsample (x) = (x i)
n i
This density will be useful to translate properties of distribution to a finite sample.

14 Characterizing a distribution
How to reduce a distribution / sample to a finite number of values ?
• Measure of location:
Reducing the distribution to one central value

Result
• Measure of dispersion:
Spread of the distribution around the central value
Uncertainty / Error
• Frequency table / histogram (for a finite sample)

15 Location and dispersion
Population Sample (size n)
Mean value: Sum (integral) of all possible values weighted by the probability of occurrence
Z +1 Xn
1
µ = x̄ = xf (x)dx µ = x̄ = xi
1 n i=1
2
Standard deviation ( ) and variance ( v = ): Mean value of the
squared deviation to the mean
Z Xn
2 1
v= 2
= (x 2
µ) f (x)dx v= = (xi µ)2
n i=1
Koenig’s
Z theorem: Z Z
2
= x2 f (x)dx + µ2 f (x)dx 2µ xf (x)dx = x2 µ2 = x2 x̄2

16 Discrete distributions
• Binomial distribution: randomly choosing K objects within a finite set of n,
with a fixed drawing probability of p
p = 0.65
Variable : K n = 10
Parameters : n, p
n! k
Law : P (k; n, p) = p (1 p)n k
k!(n k)!
Mean : np
Variance : np(1 p)
• Poisson distribution: limit of the binomial when n ! +1, p ! 0 , np
  =
Counting events with fixed probability per time/space unit.
Variable : K = 6.5
Parameters : k
e
Law : P (k; ) =
Mean :
k!
Variance :

17 Real distributions
• Uniform distribution: equiprobability over a finite range [a, b]
Parameters : a, b
1
Law : f (x; a, b) = if a < x < b
b a
Mean : µ = (a + b)/2
2
Variance : v= = (b a)2 /12
• Normal distribution: limit of many processes
Parameters : µ,
1 (x µ)2
Law : f (x; µ, ) = p e 2 2
2⇡ FWHM
• Chi-square distribution: sum of the square of n

normal reduced variables
X n ✓ ◆2
Xk µX k
Variable : C=
Xk
Parameters : n k=1
n C n
⇣n⌘
1
Law : f (C; n) = C 2 e 2 /2 2
2
Mean : n Variance: 2n

18 Convergence
Poisson distribution
k
e
p small, k ⌧ n P (k; ) =
k!p
np = µ= = & 25
Binomial distribution Normal distribution

P (k; n, p) =
n!
pk (1 p)n k n & 50 1
f (x; µ, ) = p e 2 2
(x µ)2
k!(n k)! 2⇡
p
µ = np = np(1 p)
µ=µ =
n & 30
Chi-square distribution
n C n
⇣n⌘
f (C; n) = C 2 1 e 2 /2 2
p 2
µ=n = 2n

19 Multidimensional PDF
• Random variables can be generalized to random vectors:
~ = (X1 , X2 , ..., Xn )
X
• The probability density function becomes:
f (~x)d~x = f (x1 , x2 , ..., xn )dx1 dx2 ...dxn

= P (x1 < X1 < x1 + dx1 and x2 < X2 < x2 + dx2 ...
...and xn < Xn < xn + dxn )
Z b Z d
and P (a < X < b and c < Y < d) = dx dy f (x, y)
a c
• Marginal density: probability of only one of the component
Z
fX (x)dx = P (x < X < x + dx and 1 < Y < +1) = (f (x, y)dx)dy
Z
fX (x) = f (x, y)dy

20 Multidimensional PDF
• For a fixed value of Y = y0 :
f (x|y0 )dx = “Probability of x < X < x + dx knowing that Y = y0”
is a conditional density for X . It is proportional to f (x, y)
Z
Therefore: f (x|y) / f (x, y) f (x|y)dx = 1
f (x, y) f (x, y)
f (x|y) = R =
f (x, y)dx fY (y)
• The two random variables X and Y are independent if all events of the form
x < X < x + dx are independent from y < Y < y + dy
f (x|y) = fX (x) and f (y|x) = fY (y) hence f (x, y) = fX (x).fY (y)
• For probability density functions, Bayes’ theorem becomes:
f (x|y)fY (y) f (x|y)fY (y)

f (y|x) = = R
fX (x) f (x|y)fY (y)dy

21 Covariance and correlation
• A random vector (X, Y ) can be treated as 2 separate variables marginal densities
mean and standard deviation for each variable: µX, µY , X, Y
• These quantities do not take into account correlations between the variables:
⇢= 0.5 ⇢=0 ⇢ = 0.9
• Generalized measure of dispersion: Covariance of X and Y

ZZ
Cov(X, Y ) = (x µX )(y µY )f (x, y)dxdy = ⇢ X Y = µXY µX µY
n
1X ⇢=0
Cov(X, Y ) = (xi µX )(yi µY )
n i=1
Cov(X, Y )
• Correlation: ⇢ = Uncorrelated variables: ⇢ =0
X Y
Independent Uncorrelated

22 Decorrelation
• Covariance matrix for n variables Xi : 2 3
2
1 ⇢12 1 2 ... ⇢1n 1 n
6 ⇢12 1 2
... ⇢2n 7
6 2 2 2 n 7
⌃ij = Cov(Xi , Xj ) ⌃ = 64 ... ..
.
..
.
..
.
7
5
2
⇢1n 1 n ⇢2n 2 n ... n
• For uncorrelated variables ⌃ is diagonal
• Matrix real and symmetric: ⌃ can be diagonalized

Definition of n new uncorrelated variables Yi
2 0
2
3
1 0 ... 0
0
6 0 2
... 0 7
0 6 2 7 1
⌃ = 64 .. .. ..
.
.. 7=
5 B ⌃B with Y = BX
. . .
0
2
0 0 ... n
0
2
i are the eigenvalues of ⌃
B contains the orthonormal eigenvectors
• The Yi are the principal components. Sorted from the largest
0
to the smallest , they allow dimensional reduction

23 Regression
• Measure of location:
A point: (µX , µY )
A curve: line which is the closest to the points linear regression
• Minimizing the dispersion between the curve “ y = ax + b ” and the distribution
ZZ !
1X
Let: w(a, b) = (y ax b)2 f (x, y)dxdy = (yi axi b)2
n i
8 @w
RR
< @a =0= x(y ax b)f (x, y)dxdy
: @w RR
@b = 0 = (y ax b)f (x, y)dxdy
8
2 22 2
a( <
X a(
µ XX) + µ
bµXX)+= bµ
⇢ XX =Y⇢+XµXYµY
+ µX µY
:
aµX + aµ
b =X µ
+Yb = µY
8
< a = ⇢ Y
X Fully correlated ⇢=1
: Fully anti-correlated ⇢ = 1
b = µY ⇢ Y
µX
X Then Y = aX + b
24 Multidimensional PDFs
• Multinomial distribution: randomly choosing K1 , K2 , ...KS objects X set of n,  
within a finite X
with a fixed drawing probability for each category p1 , p2 , ...pS with Ki = n and pi = 1
Parameters : n, p1 , p2 , ...pS
n!
: P (~
k1 k2 kS
Law k; n, p~) = p1 p2 ...pS
k1 !k2 !...kS !
Mean : µi = npi
2
Variance : i = npi (1 pi ) Cov(Ki , Kj ) = npi pj
Note: Variables are not independent. The binomial corresponds to S = 2 but has
only one independent variable
• Multinormal distribution:
Parameters : µ
~,⌃
1 1
~ )T ⌃ 1
Law : ~ , ⌃) = p
f (~x; µ e 2 (~
x µ (~
x µ
~)
2⇡|⌃|
Y 1 (xi µi )2
2 2
If uncorrelated: f (~
x; µ
~ , ⌃) = p e i
i 2⇡
Independent Uncorrelated

25 Sum of random variables
• The sum of several random variable is a new random variable S
n
X
S= Xi
i=1
• Assuming the mean and variance of each variable exist:
Mean value of S :
Z n
! n Z n
X X X
µS = xi f (x1 , ..., xn )dx1 ...dxn = xi fXi (xi )dxi = µi
i=1 i=1 i=1
The mean is an additive quantity

Variance of S: !2
Z n
X
2
S = xi µ Xi f (x1 , ..., xn )dx1 ...dxn
i=1
n
X XX
2
= Xi +2 Cov(Xi , Xj ) n
i=1 i j<i
X
2 2
For uncorrelated variables, the variance is an additive quantity S = Xi
used for error combinations i=1

26 Sum of random variables
• Probability density function of S : fS (s)
• Using the characteristic function:
Z Z P
ist it xi
S (t) = fS (s)e ds = fX
~ (~
x)e d~x
For independent variables:
YZ Y
itxk
S (t) = fXk (xk )e dxk = Xi (t)
The characteristic function factorizes.
• The PDF is the Fourier transform of the characteristic function, therefore:

fS = fX1 ⇤ fX2 ⇤ ... ⇤ fXn
The PDF of the sum of random variables is the convolution of the individual PDFs
Sum of Normal variables Normal

Sum of Poisson variables ( 1 and 2 ) Poisson with = 1+ 2
Sum of Chi-2 variables ( n1 and n2 ) Khi-2 with n = n1 + n2

27 Sum of independent variables
• Weak law of large numbers
Sample of size n = realization of n independent variables with the same
distribution (mean µ, variance 2 )
S 1X
The sample mean is a realization of M= = Xi
n n
µM = µ 2 2
Mean value of M : Variance of M: M = /n
• Central limit theorem
2
n independent random variables of mean µi and variance i
1 X Xi µi
Sum of the reduced variables: C=p
n i
The PDF of C converges to a reduced normal distribution:
1 c2
fC (c) !p e 2
n!+1 2⇡
The sum of many random fluctuations is normally distributed

28 Central limit theorem
Gaussian Gaussian
X1 + X2
X1 p
2
Gaussian Gaussian
X1 + X2 + X3 X1 + X2 + X3 + X4 + X5
p p
3 5

29 Dispersion and uncertainty
• Any measure (or combination of measures) is a realization of a random variable.
Measured value: ✓
True value: ✓0
• The uncertainty quantifies the difference between ✓ and ✓0 :

Measure of dispersion
Postulate: ✓=↵ ✓ Absolute error always positive
• Usually one differentiates:

Statistical errors: due to the measurement PDF
Systematic errors or bias: fixed but unknown deviation (equipment, assumptions, …)
Systematic errors can be seen as statistical error in a set of similar experiments

30 Error sources
Observation error: O Scaling error: S Position error: P
Measured value: ✓ = ✓0 +
O + S + P
2
• Each i is a realization of a random variable of mean 0 and variance i
For uncorrelated error sources:
9
O =↵ O =
2 2 2 2 2 2 2 2 2
S =↵ S tot = (↵ tot ) = ↵ ( O + S + P) = O + S + P
;
P =↵ P
• Choice for ↵ :
Many sources of error central limit theorem normal distribution
↵ = 1 gives (approximately) a 68% confidence interval
↵ = 2 gives a 95% confidence interval
31 Error propagation
df
f (x) 0
f (x) =
dx
• Measure: x± x
f
• Compute: f (x) ! f ? f
x
x
Assuming small errors and using
the Taylor expansion: x
df 1 d2 f
f (x + x) = f (x) + x+ x2
dx 2 dx2
df 1 d2 f
f (x x) = f (x) x+ x2
dx 2 dx2
1 df
f = |f (x + x) f (x x)| = x
2 dx
32 Error propagation
• Measure: x± x, y ± y zm = f (xm , ym ) @f
=
df (x, ym )
@x dx
• Compute: f (x, y, ...) f? 2 xf Curve z = f (x, ym ) fixed ym
Method: Treat the effect of each

variable as separate error sources Surface z = f (x, y)
@f @f ym
xf = x and yf = y
@x @y xm
2 x
Then: 2 y
✓ ◆2 ✓ ◆2
2 2 2 @f @f @f @f
f = xf + yf + 2⇢xy xf yf = x + y + 2⇢xy x y
@x @y @x @y
X ✓ @f ◆2 X @f @f
2
f = xi +2 ⇢x i x j xi xj
i
@xi i,j<i
@xi @xj
Uncorrelated Correlated Anticorrelated

X ✓ @f ◆2 @f @f @f @f
f2 = xi f= x+ y f= x y
@xi @x @y @x @y
i

33 Parametric estimation
• Estimating a parameter ✓ from a finite sample {xi }
• Statistic: a function S = f ({xi })
Any statistic can be considered as an estimator of ✓

To be a good estimator it needs to satisfy:
Consistency: limit of the estimator for an infinite sample

Bias: difference between the estimator and the true value
Efficiency: speed of convergence
Robustness: sensitivity to statistical fluctuations
• A good estimator should at least be consistent and asymptotically unbiased
• Efficient / Unbiased / Robust often contradict each others

Need to make a choice for a given situation

34 Bias and consistency
• As the sample is a set of realizations of random variables (or one vector variable),
so is the estimator:
✓ˆ is a realization of ⇥
ˆ
It has a mean, a variance, …, and a probability density function
• Bias: characterize the mean value of the estimator ˆ = E[⇥

b(✓) ˆ ✓0 ] = µ⇥
ˆ ✓0
Unbiased estimator: ˆ =0
b(✓)
ˆ
Asymptotically unbiased: b(✓) !0
n!+1
! 0, 8✏
n!+1
• Consistency: formally P(|✓ˆ ✓| < ✏) ! 1, 8✏
✓| > ✏) P(|✓ˆ
n!+1
In practice, if the estimator is asymptotically unbiased ˆ
⇥ ! 0
n!+1
Biased Asymptotically unbiased Unbiased

35 Efficiency
• For any unbiased estimator of ✓ , the variance cannot exceed (Cramer-Rao bound):
!
1 1 !
ˆ
⇥
h 2
i 1
= ⇥ @ lnL
2 ⇤ 1
E @lnL
ˆ
E h @✓ 2 i = ⇥ ⇤
@✓ ⇥ @lnL 2
2
@ lnL
E E @✓ 2
@✓
• The efficiency of a convergent estimator is given by its variance.
An efficient estimator reaches the Cramer-Rao bound (at least asymptotically)

Minimal variance estimator
• The minimal variance estimator will often be biased, asymptotically unbiased

36 Empirical estimator
• Sample mean is a good estimator of the population mean

weak law of large numbers: convergent, unbiased
1X 2 2
2
µ̂ = xi µµ̂ = E[µ̂] = µ µ̂ = E[(µ̂ µ) ] =
n n
• Sample variance as an estimator of the population variance:
!
1 X 1X
ŝ2 = (xi µ̂)2 = (xi µ)2 (µ̂ µ)2
n i n i
!
2 1X 2 2 2
2
n 1 2
E[ŝ ] = µ̂ = = biased, asymptotically unbiased
n i n n
1 X
2
unbiased variance estimator: ˆ = (xi µ̂)2
n 1 i
4
✓ ◆
2 n 1 2 4
Variance of the estimator (convergence): ˆ2 = 2 +2 !
n 1 n n

37 Errors on estimators
Uncertainty Estimator standard deviation

p
• Use an estimator of standard deviation: ˆ= ˆ2 (biased !)
r
1X 2
2
ˆ2
• Mean: µ̂ = xi , µ̂ = µ̂ =
n n n
r
1 X 2 4 2 2
• Variance: ˆ2 = (xi µ̂)2, 2
ˆ2 ⇡ ˆ = 2
ˆ
n 1 i n n
• Central-Limit theorem empirical estimators of mean and variance are

normally distributed for large enough samples
µ̂ ± µ̂ , ˆ ± ˆ define 68% confidence intervals

38 Likelihood function
Generic function k(x, ✓)
x: random variable(s)
✓ : parameter(s)
fix x = u (one realization
fix ✓ = ✓0 (true value) of the random variable)
Probability density function Likelihood function
f (x; ✓) = k(x, ✓0 ) L(✓) = k(u, ✓)

Z Z
f (x; ✓)dx = 1 L(✓)d✓ =???
Z
for Bayesian f (x|✓) = f (x; ✓) for Bayesian f (✓|x) = L(✓) L(✓)d✓
For a sample: n independent realizations of the same variable X

Y Y
L(✓) = k(xi , ✓) = f (xi ; ✓)
i i
39 Maximum likelihood
• Let a sample of measurements: {xi }
The analytical form of the density is known and depends on
several unknown parameters ✓
For example: Event counting follows a Poisson distribution with a parameter i (✓)
depending on the physics.
Y e i (✓) i (✓)xi
L(✓) =
i
xi !
• An estimator of the parameters ✓ is given by the position of the
maximum of the likelihood function
Parameter values which maximize the probability to get the observed results
@L
=0
@✓ ✓=✓ˆ
Note: system of equations for several parameters
Note: minimizing lnL often simplify the expression

40 Properties of MLE
• Mostly asymptotic properties: valid for large samples, often assumed in any
case for lack of better information
Asymptotically unbiased
Asymptotically efficient (reaches the Cramer-Rao bound)
Asymptotically normally distributed
Multinormal law with covariance given by a generalization of the CR bound:

ˆ~
~ 1 1 ~
(
ˆ
✓ ~ T⌃
✓) 1 ~ˆ ✓)
(✓ ~ 1 @lnL @lnL
f (✓; ✓, ⌃) = p e 2
⌃ij = E
2⇡|⌃| @✓i @✓j
• Goodness of fit: The value of ˆ is Chi-2 distributed with
2lnL(✓)
ndf = sample size number of parameters
Z +1
p value = f 2 (x; ndf)dx Probability of getting
ˆ
2lnL(✓) a worse agreement

41 Errors on MLE

ˆ~
~ 1 1 ~
(
ˆ
✓ ~ T⌃
✓) 1 ˆ ~
~
(✓ ✓) 1 @lnL @lnL
f (✓; ✓, ⌃) = p e 2
⌃ij = E
2⇡|⌃| @✓i @✓j
• Errors on the parameters given by the covariance matrix
s
1 only one realization of
• For one parameter, 68% confidence interval: ✓ = ˆ✓ˆ = @ 2 lnL the estimator: empirical
@✓ 2 mean of 1 value
• More generally:
ˆ 1X
lnL = lnL(✓) lnL(✓) = 1
⌃ij (✓i ✓ˆi )(✓j ˆ 3
✓j ) + O(✓ )
2 i,j
Confidence contours are defined by the equation:
Z 2
n ✓ 1 2 3
↵
lnL = (n✓ , ↵) with ↵ = f 2 (x; n✓ )dx
0 68.3 0.5 1.15 1.76
Values of for different number 95.4 2 3.09 4.01
parameters n✓ and confidence levels ↵
99.7 4.5 5.92 7.08

42 Least squares
• Set of measurements (xi , yi ) with uncertainties on yi
Theoretical law given by: y = f (x, ✓)
• Naive approach: use regression
X @w
2
w(✓) = (yi f (xi , ✓)) =0
i
@✓i
• Reweight each term by its associated error:
X ✓ yi f (xi , ✓)
◆2
@K 2
2
K (✓) = =0
i
yi @✓i
• Maximum likelihood assumes that each yi is normally distributed with a mean
equal to f (xi , ✓) and a standard deviation given by yi
⇣ ⌘2
Y 1 1 yi f (xi ,✓)
• The likelihood is then L(✓) = p e 2 yi
i
2⇡ yi
@L @lnL @K 2 Least squares or Chi-2 fit is the maximum
=0, 2 = =0 likelihood estimator for Gaussian errors
@✓ @✓ @✓
~ 1
• Generic case with correlations: K (✓) = (~
y 2
f~(x, ✓))
~ T⌃ 1
(~y f~(x, ✓))
~
2
43 Example: fitting a line
• For f (x) = ax K 2 (a) = Aa2 2Ba + C = 2lnL
X x2 X x i yi X y2
i , i
A= 2
B= 2
, C= 2
y i i
y i i
y i
i
@K 2 B
= 2Aa 2B = 0 â =
@a A
@2K 2 2 1
2
= 2A = 2 â = a = p
@a a A

• For f (x) = ax + b
K 2 (a, b) = Aa2 + Bb2 + 2Cab 2Da 2Eb + F = 2lnL
X x2 X 1 X xi X x i yi X yi X y2
i , , , , , i
A= 2 B = 2
C = 2 D = 2 E = 2 F = 2
i
y i i
y i y i i
y i y i i
y i i i
@K 2
9
@a = 2Aa + 2Cb 2D = 0 =
BD EC AE BC
â = , b̂ =
@K 2 ; AB C 2 AB C2
@b = 2Ca + 2Bb 2E = 0
2 2
9  
@ K 1 >
@a2 = 2A = 2⌃11 >
> ⌃ 1
=
A C
⌃=
1 B C
>
> C B AB C2 C A
=
@2K2
@b2 = 2B = 2⌃221 r r
>
>
>
> â = =
B
, b̂ = =
A
2
@ K 2
1
>
; a
AB C2 b
AB C2
@a@b = 2C = 2⌃12

b, intercept • Two dimensional error contours on a and b
contour at 68.3%
contour at 95.4%
contour at 99.7%
parameter
parameter a, slope

46 Non-parametric estimation
• Directly estimating the probability density function
Likelihood ratio discriminant
Separating power of variables
Data / Monte Carlo agreement
• Frequency table: For a sample {xi }, i = 1...n

1. Define successive invervals (bins) Ck = [ak , ak+1 [
2. Count the number of events nk in Ck
• Histogram: Graphical representation of the frequency table h(x) = nk if x 2 Ck

47 Histogram
Bin Number of N/Z Frequency Bin Number of N/Z Frequency
N/Z for stable heavy nuclei
1.321, 1.357, 1.392, 1.410, 1.428, 1.446, 1.464, 1.421,

1.438, 1.344, 1.379, 1.413, 1.448, 1.389, 1.366, 1.383,
1.400, 1.416, 1.433, 1.466, 1.500, 1.322, 1.370, 1.387,
1.403, 1.419, 1.451, 1.483, 1.396, 1.428, 1.375, 1.406,
1.421, 1.437, 1.453, 1.468, 1.500, 1.446, 1.363, 1.393,
1.424, 1.439, 1.454, 1.469, 1.484, 1.462, 1.382, 1.411,
1.441, 1.455, 1.470, 1.500, 1.449, 1.400, 1.428, 1.442,
1.457, 1.471, 1.485, 1.514, 1.464, 1.478, 1.416, 1.444,
1.458, 1.472, 1.486, 1.500, 1.465, 1.479, 1.432, 1.459,
1.472, 1.486, 1.513, 1.466, 1.493, 1.421, 1.447, 1.460,
1.473, 1.486, 1.500, 1.526, 1.480, 1.506, 1.435, 1.461,
1.487, 1.500, 1.512, 1.538, 1.493, 1.450, 1.475, 1.500,
1.512, 1.525, 1.550, 1.506, 1.530, 1.487, 1.512, 1.524,
1.536, 1.518, 1.577, 1.554, 1.586, 1.586

48 Histogram as PDF estimator
nk are multinomial random variables Z
• Statistical description:
X
Parameters: n= nk pk = P(x 2 Ck ) = fX (x)dx
k Ck
2
µnk = npk nk = npk (1 pk ) ⇡ µ n k Cov(nk , nr ) = npk pr ⇡ 0
pk ⌧ 1 pk ⌧ 1
For a large sample: For small classes (width ):

Z
nk µk pk
lim = = pk pk = fX (x)dx ⇡ f (xc ) ) lim = f (x)
n!+1 n n Ck !0
1
Finally: f (x) = lim h(x)
n!+1 n
!0
• The histogram is an estimator of the probability density function
• Each bin can be described by a Poisson

q density p p
The 1 error on nk is then: nk = ˆn2 k = µ̂nk = nk

49 Confidence interval
• For a random variable, a confidence interval with confidence level ↵,

is any interval [a, b] such that:
Z b
Probability of finding a
P(X 2 [a, b]) = fX (x)dx = ↵
a
realization inside the interval
• Generalization of the concept of uncertainty: 

interval that contains the true value with a given probability
• For Bayesians: the posterior density is the probability density of the true value.
It can be used to estimate an interval: P(✓ 2 [a, b]) = ↵
• No such thing for a Frequentist: the interval itself becomes the random
variable [a, b] is a realization of [A, B]
P(A < ✓ and B > ✓) = ↵ independently of ✓

50 Confidence interval
• Mean centered, symetric interval:
[µ a, µ + a]
Z µ+a
f (x)dx = ↵
µ a
• Mean centered, probability symetric interval:

Z Z
[a, b]
µ b
↵
f (x)dx = f (x)dx =
a µ 2
• Highest probability density (HDP) interval: [a, b]

Z b
f (x)dx = ↵
a
f (x) > f (y) for x 2 [a, b] and y 2
/ [a, b]

51 Confidence Belt
✓ˆ of ✓ :
• To build a frequentist interval for an estimator
1. Make pseudo-experiments for several values of ✓ and compute the
ˆ for each (Monte Carlo sampling of the estimator PDF)
estimator ✓
2. For each ✓ , determine ⌅(✓) and ⌦(✓) such that:
✓ˆ < ⌅(✓) for a fraction (1 ↵)/2 of the pseudo-experiments
✓ˆ > ⌦(✓) for a fraction (1 ↵)/2 of the pseudo-experiments
These 2 curves are the confidence belt for a confidence level ↵
3. Inverse these functions. The interval [⌦ 1 ˆ ⌅

(✓), 1 ˆ satisfies:
(✓)]
⇣ ⌘ ⇣ ⌘ ⇣ ⌘
P ⌦ 1 ˆ <✓<⌅
(✓) 1 ˆ =1
(✓) ˆ <✓
P ⌅ 1 (✓) P ⌦ 1 (✓)ˆ >✓
⇣ ⌘ ⇣ ⌘
=1 P ✓ˆ < ⌅(✓) P ✓ˆ > ⌦(✓) = ↵
Confidence belt for a Poisson parameter estimated

with the empirical mean of 3 realizations (68% CL)

52 Dealing with systematics
• The variance of the estimator only measures the statistical uncertainty.
• Often, we will have to deal with parameters whose value is known

with limit precision.
Systematic uncertainties
• The likelihood function becomes:
+ ⌫+
L(✓, ⌫) with ⌫ = ⌫0 ± ⌫ or ⌫0 ⌫
The known parameters ⌫ are nuisance parameters

53 Bayesian inference
• In Bayesian statistics, nuisance parameters are dealt with by assigning them a prior ⇡(⌫).
• Usually a multinormal law is used with mean ⌫0 and covariance matrix estimated
from ⌫0 (+ correlation if needed)
f (x|✓, ⌫)⇡(✓)⇡(⌫)
f (✓, ⌫|x) = RR
f (x|✓, ⌫)⇡(✓)⇡(⌫)d✓d⌫
• The final posterior distribution is obtained by marginalization over the nuisance parameters:
Z R
f (x|✓, ⌫)⇡(✓)⇡(⌫)d⌫
f (✓|x) = f (✓, ⌫|x)d⌫ = RR
f (x|✓, ⌫)⇡(✓)⇡(⌫)d✓d⌫

54 Profile Likelihood
• No true frequentist way to add systematic effects. Popular method of the day: profiling
• Deal with nuisance parameters as realization of random variables:

0
extend the likelihood: L(✓, ⌫) ! L (✓, ⌫)G(⌫)
• G(⌫) is the likelihood of the new parameters (identical to prior)

• For each value of ✓ , maximize the likelihood with respect to nuisance:
profile likelihood PL(✓)
• PL(✓) has the same statistical asymptotical properties than the regular likelihood

55 Statistical tests
• Statistical tests aim at:
Checking the compatibility of a dataset {xi } with a given distribution
Checking the compatibility of two datasets {xi }, {yi }: are they issued
from the same distribution ?
Comparing different hypothesis: background VS signal + background
• In every case:
Build a statistic that quantifies the agreement with the hypothesis

Convert it into a probability of compatibility/incompatibility: p-value

56 Pearson test
• Test for binned data: use the Poisson limit of the histogram
Sort the sample into k bins Ci : ni Z
Compute the probability of this class: pi = f (x)dx
Ci
For each bin, the test statistics compares the deviation of the observation
from the expected mean to the theoretical standard deviation.
Data Poisson mean
X (ni npi )2
2
=
npi
bins i Poisson variance
2
• follows (asymptotically)
X a Chi-2 law with k 1 degrees of freedom
(one constraint ni = n )
Z +1
• p-value: probability of doing worse: p value = f 2 (x; k 1)dx
2
2
For a “good” agreement: /(k 1) ⇠ 1
2
p
More precisely: 2 (k 1) ± 2(k 1) (1 interval ~ 68% CL)

57 Kolmogorov-Smirnov test
• Test for unbinned data: compare the sample cumulative density function to the tested one
• Sample PDF (ordered sample)

8
1X
< 0 x < x0
k
fS (x) = (x i) FS (x) = n xk  x < xk+1
n i :
1 x > xn
• The Kolmogorov statistic is the largest deviation:
Dn = sup|FS (x) F (x)|
x
• The test distribution has been computed by Kolmogorov:
p X
r 1 2r 2 z 2
P(Dn > n) = 2 ( 1) e
r
[0, ] defines a confidence interval for Dn
p p
= 0.9584/ n for 68.3% CL = 1.3754/ n for 95.4% CL

58 Example
x
• Test compatibility with an exponential law: f (x) = e , = 0.4
0.008, 0.036, 0.112, 0.115, 0.133, 0.178, 0.189, 0.238, 0.274, 0.323, 0.364, 0.386, 0.406, 0.409, 0.418, 0.421, 0.423, 0.455,
0.459, 0.496, 0.519, 0.522, 0.534, 0.582, 0.606, 0.624, 0.649, 0.687, 0.689, 0.764, 0.768, 0.774, 0.825, 0.843, 0.921, 0.987,
0.992, 1.003, 1.004, 1.015, 1.034, 1.064, 1.112, 1.159, 1.163, 1.208, 1.253, 1.287, 1.317, 1.320, 1.333, 1.412, 1.421, 1.438,
1.574, 1.719, 1.769, 1.830, 1.853, 1.930, 2.041, 2.053, 2.119, 2.146, 2.167, 2.237, 2.243, 2.249, 2.318, 2.325, 2.349, 2.372,
2.465, 2.497, 2.553, 2.562, 2.616, 2.739, 2.851, 3.029, 3.327, 3.335, 3.390, 3.447, 3.473, 3.568, 3.627, 3.718, 3.720, 3.814,
3.854, 3.929, 4.038, 4.065, 4.089, 4.177, 4.357, 4.403, 4.514, 4.771, 4.809, 4.827, 5.086, 5.191, 5.928, 5.952, 5.968, 6.222,
6.556, 6.670, 7.673, 8.071, 8.165, 8.181, 8.383, 8.557, 8.606, 9.032, 10.482, 14.174
Dn = 0.069
p value = 0.0617
1 : [0, 0.0875]

Probability and Statistics Basic Concepts: Florian RUPPIN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability and Statistics Basic Concepts: Florian RUPPIN

Uploaded by

Copyright:

Available Formats

Probability and Statistics

Kendall’s Advanced theory of statistics, Hodder Arnold Pub.

Analyse statistique des données expérimentales, K. Protassov, EDP sciences

Florian Ruppin - ESIPAP - 02/02/2018

First lecture: Probability theory - Sample and population

Characterization of the sample, the population and the sampling process

Florian Ruppin - ESIPAP - 02/02/2018

Second lecture: Statistical inference

Using the sample to estimate the characteristics of the population

Florian Ruppin - ESIPAP - 02/02/2018

Universe: ⌦ = Set of all possible outcomes

• An event A will be identified by the subset A for which A is true.

Florian Ruppin - ESIPAP - 02/02/2018

• Probability function P defined by: (Kolmogorov, 1933)

- Frequentist approach: if we repeat the random process a great number of times n,

- Bayesian interpretation: A probability is a measure of the

Florian Ruppin - ESIPAP - 02/02/2018

⌦ • Event “not A” associated with the

Florian Ruppin - ESIPAP - 02/02/2018

P(A|B).P(B) Major role in

Florian Ruppin - ESIPAP - 02/02/2018

• Two incompatible events cannot be true simultaneously: P(A and B) = 0

P(A or B) = P(A) + P(B)

P(A and B) = P(A).P(B)

Florian Ruppin - ESIPAP - 02/02/2018

• Each realization of the process leads to a particular result: X=x

• For a discrete variable:

dF = F(x + dx) F(x) = P(X < x + dx) P(X < x)

Florian Ruppin - ESIPAP - 02/02/2018

• Moments µk are the expectation of X k

Florian Ruppin - ESIPAP - 02/02/2018

• A sample is obtained from a random drawing within a population,

• We’re going to discuss how to characterize, independently from one another:

Florian Ruppin - ESIPAP - 02/02/2018

How to reduce a distribution / sample to a finite number of values ?

Reducing the distribution to one central value

Spread of the distribution around the central value

• Frequency table / histogram (for a finite sample)

Florian Ruppin - ESIPAP - 02/02/2018

Florian Ruppin - ESIPAP - 02/02/2018

Florian Ruppin - ESIPAP - 02/02/2018

• Chi-square distribution: sum of the square of n

Florian Ruppin - ESIPAP - 02/02/2018

Binomial distribution Normal distribution

Florian Ruppin - ESIPAP - 02/02/2018

f (~x)d~x = f (x1 , x2 , ..., xn )dx1 dx2 ...dxn

Florian Ruppin - ESIPAP - 02/02/2018

f (x|y)fY (y) f (x|y)fY (y)

Florian Ruppin - ESIPAP - 02/02/2018

⇢= 0.5 ⇢=0 ⇢ = 0.9

• Generalized measure of dispersion: Covariance of X and Y

Florian Ruppin - ESIPAP - 02/02/2018

• For uncorrelated variables ⌃ is diagonal

• Matrix real and symmetric: ⌃ can be diagonalized

Florian Ruppin - ESIPAP - 02/02/2018

Florian Ruppin - ESIPAP - 02/02/2018

The mean is an additive quantity

Florian Ruppin - ESIPAP - 02/02/2018

The characteristic function factorizes.

• The PDF is the Fourier transform of the characteristic function, therefore:

Sum of Normal variables Normal

Florian Ruppin - ESIPAP - 02/02/2018

Florian Ruppin - ESIPAP - 02/02/2018

Probability and Statistics 

- Bayesian interpretation: A probability is a measure of the  

• Generalization of the concept of uncertainty: