You are on page 1of 4

Capstone-Cheatsheet Statistics 1 distributions and massaging the ex- for some test statistic Tn and thres- Right-tailed p-values:

for some test statistic Tn and thres- Right-tailed p-values: 7.4 Likelihood Ratio Test
by Blechturm, Page 1 of 2 pression yields an an asymptotic CI: hold c ∈ R. Threshold c is usually qα/2 Parameter space Θ ⊆ Rd and H0 is
(d) Rejection region: Z that parameters θr+1 through θd have
P n
Xi=1
p p
−−−−−→ N (nµ, (n) (σ 2 )) √ pvalue = P(X ≥ x|H0 ) Tn = S̃
n→∞ q V ar(X )
I = [θ̂n − α/2 √n i , Rψ = {Tn > c} values θcr+1 through θdc leaving the
X−µ
1 Statistical models P n −nµ
Xi=1 (d) = other r unspecified. That is:
√ √ √ Two-sided p-values: If asymptotic, √σ̂
−−−−−→ N (0, 1) Symmetric about zero and acceptan-
(n) (σ 2 ) n→∞ q V ar(X ) create normalized Tn using parame-
n
H0 : (θr+1 , ..., θd )T = θr+1...d = θ0
E, {Pθ }θ∈Θ θ̂n + α/2 √n i ] ce Region interval: √ X n −µ0
ters from H0 . Then use Tn to get to n
E is a sample space for X i.e. a set that probabilities. = q σ Construct two estimators:
ψ = 1{|Tn | − c > 0}.
contains all possible outcomes of X This expression depends on the real S̃n
{Pθ }θ∈Θ is a family of probability dis- Variance of the Mean: variance V ar(Xi ) of the r.vs, the va- Power of the test: σ2 bnMLE = argmaxθ∈Θ (`n (θ))
θ
pvalue = 2min{P(X ≤ x|H0 ), P(X ≥ x|H0 )}
tributions on E. 2 riance has to be estimated. ∼
N (0,1)
Θ is a parameter set, i.e. a set consis- V ar(Xn ) = ( σn )2 V ar(X1 + X2 , ..., Xn ) Three possible methods: plugin (use P(|Z| > |Tn,θ0 (X n )| = 2(1 − Φ(Tn ))
r
2
bnc = argmaxθ∈Θ (`n (θ))
θ
πψ = infθ∈Θ1 (1 − βψ (θ)) χn−1 0
ting of some possible values of Θ. sample mean or empirical variance), n−1
σ2
θ is the true parameter and unknown. = n solve (solve quadratic inequality), con- Where βψ is the probability of making Z ∼ N (0, 1) Test statistic:
In a parametric model we assume that ∼ tn−1
servative (use the theoretical maxi- a Type2 Error and inf is the maxi-
Θ ⊂ Rd , for some d ≥ 1. Expectation of the mean:
mum of the variance). mum. bnMLE ) − `(X1 , ..Xn |θ
Tn = 2(`(X1 , ..Xn |θ bnc ))
1.1 Identifiability 6.2 Comparisons of two proporti- Works bc. under H0 the numerator
5.2 Sample Mean and Sample Va- Two-sided test: ons
E[Xn ] = 1
+ X2 , ..., Xn ] N (0, 1) and the denominator
n E[X1 riance H1 : θ , Θ0 iid Wilk’s Theorem: under H0 , if the
Let X1 , . . . , Xn ∼ Bern(px ) and S̃n 1 2
= µ.
iid
Let X1 , ..., Xn ∼ Pµ , where E(Xi ) = µ ∼ n−1 χn−1 are independent by MLE conditions are satisfied:
θ , θ 0 ⇒ Pθ , Pθ 0 1(|Tn | > qα/2 ) iid
σ2
and V ar(Xi ) = σ 2 for all i = 1, 2, ..., n Y1 , . . . , Yn ∼ Bern(py ) and be X Cochran’s Theorem.
Pθ = Pθ 0 ⇒ θ = θ 0 (d) 2
4 Quantiles of a Distribution One-sided tests: independent of Y . p̂x = 1/n ni=1 Xi
P
Sample Mean: Tn −−−−−→ χd−r
Let α in (0, 1). The quantile of order Student’s T test at level α: n→∞
and p̂x = 1/n ni=1 Yi
P
A Model is well specified if: H1 : θ > Θ0
1 − α of a random variable X is the 1 Pn
number qα such that: Xn = i=1 Xi ψα = 1{|Tn | > qα/2 (tn−1 )}
n 1(Tn < −qα )H1 : θ < Θ0 H0 : px = py ; H1 : px , py
∃θ s.t. P = Pθ
Sample Variance: 1(Tn > qα ) Likelihood ratio test at level α:
To get the asymptotic Variance use Student’s T test (one sample, one-
2 Estimators P (X ≤ qα ) = qα = 1 − α Type1 Error: multivariate Delta-method. Consider 2
ψα = 1{Tn > qα (χd−r )}
A statistic is any measurable function Sn = 1 Pn
− X n )2 p̂x − p̂y = g(p̂x , p̂y ); g(x, y) = x − y, then sided):
P(X ≥ qα ) = α n i=1 (Xi Test rejects null hypothesis ψ = 1 but
calculated with the data (Xn , max(Xi ), 1 Pn 2 2 it is actually true H0 = T RU E also p (d) 7.5 Implicit Testing
etc). FX (qα ) = 1 − α = n ( i=1 Xi ) − X n known as the level of a test. (n)(g(p̂x , p̂y ) − g(px − py )) −−−−−→ ψα = 1{Tn > qα (tn−1 )}
Type2 Error: n→∞ Todo
An estimator θ̂n of θ is any statistic FX−1 (1 − α) = α Unbiased estimator of sample va- N (0, ∇g(px − py )T Σ∇g(p x − py )) 7.6 Goodness of Fit Discrete Distri-
Test does not reject null hypothesis Student’s T test (two samples, two- butions
which does not depend on θ. riance: ψ = 0 but alternative hypothesis is
If the distribution is standard true H1 = T RU E ⇒ N (0, px (1 − px) + py (1 − py)) sided): Let X1 , ..., Xn be iid samples from
Estimators are random variables normal X ∼ N (0, 1): 1 Pn
 2 iid a categorical distribution. Test
S̃n = n−1 i=1 Xi − X n
i.i.d.
Example: Let X1 , . . . , Xn ∼ Ber(p∗ ). 7 Non-asymptotic Hypothesis tests Let X1 , ..., Xn ∼ N (µX , σX2 ) and
if they depend on the data (= H0 : p = p0 against H1 : p , p0 .
realizations of random variables). n Question: is p∗ = 1/2. 7.1 Chi squared
iid
Y1 , ..., Yn ∼ N (µY , σY2 ), suppose
= n−1 Sn H0 : p∗ = 1/2; H1 : p∗ , 1/2 Example: against the uniform
P(|X| > qα ) = α we want to test H0 : µX = µY vs
If asymptotic level α then we need to The χd2 distribution with d degrees of distribution p0 = (1/K, . . . , 1/K)> .
An estimator θ̂n is weakly consistent = 2Φ(qα/2 ) 5.3 Delta Method H1 : µX , µY .
P
standardize the estimated parameter freedom is given by the distribution
if: limn→∞ θ̂n = θ or θ̂n −−−−−→ E[g(X)]. To find the asymptotic CI if the esti- Test statistic under H0 :
p̂ = X n first. of Z12 +Z22 +· · ·+Zd2 , where Z1 , . . . , Zd ∼
iid
n→∞ Use standardization if a gaussian mator is a function of the mean. Goal
If the convergence is almost surely it has unknown mean and variance is to find an expression that converges N (0, 1) X n −Y m (p̂k −pk0 )2 (d)
√ |X −0.5| Tn,m = q Tn = n
PK 2
−−−−−→ χK−1
is strongly consistent. X ∼ N (µ, σ 2 ) to get the quantiles by a function of the mean using the CLT.
Tn = n √ n σˆ2 X σˆ2 Y k=1 pk0 n→∞
Asymptotic normality of an estima- using Z-tables (standard normal
p
Let Zn be a sequence of r.v. (n)(Zn − 0.5(1−0.5) If V ∼ χk2 : n + m
tor: tables). (d) ψn = 1 (Tn > qα/2 ) E = E[Z12 ] + E[Z22 ] + . . . + E[Zd2 ] = d Test at level alpha:
θ) −−−−−→ N (0, σ 2 ) and let g : R −→ R Welch-Satterthwaite formula:
p (d) n→∞
(n)(θ̂n − θ) −−−−−→ N (0, σ 2 ) be continuously differentiable at θ, where qα/2 denotes the qα/2 quantile 2
ψα = 1{Tn > qα (χK−1 )}
n→∞
then: of a standard Gaussian, and α is de- V ar(V ) = V ar(Z12 ) + V ar(Z22 ) + . . . + When samples are different sizes
t−µ
 
P (X ≤ t) = P Z ≤ σ termined by the required level of ψ. V ar(Zd2 ) = 2d we need to finde the Student’s T
σ2 is called the Asymptotic Variance  t−µ  √ (d) Note the absolute value in Tn for this distribution of: Tn,m ∼ tN 7.7 Kolmogorov-Smirnov test
of the estimator θ̂n . In the case of the =Φ σ n(g(Zn ) − g(θ)) −−−−−→ two sided test. Cochranes Theorem: 7.8 Kolmogorov-Lilliefors test
n→∞ iid
sample mean it is the same variance Pivot: If X1 , ..., Xn ∼ N (µ, σ 2 ), then sample Calculate the degrees of freedom for 7.9 QQ plots
as as the single Xi . Z=
X−µ
∼ N (0, 1) N (0, g 0 (θ)2 σ 2 ) Let Tn be a function of the random mean X n and the sample variance Sn tN with: Heavier tails: below > above the
σ samples X1 , . . . , Xn , θ. Let g(Tn ) be a
If the estimator is a function of the are independent. The sum of squares diagonal.
sample mean the Delta Method is t−µ
qα = Example: let X1 , ..., Xn exp(λ) where random variable whose distribution of n variables follows a chi squared 2 Lighter tails: above > below the
σˆ2 X σˆ2 Y

σ
needed to compute the asymptotic λ > 0 . Let X n = n1 ni=1 Xi denote the
P is the same for all θ . Then, g is called distribution with (n-1) degrees of free- + diagonal.
n m
variance.Asymptotic Variance , Va- 5 Confidence intervals a pivotal quantity or a pivot. dom: N= ≥ min(n, m) Right-skewed: above > below >
sample mean. By the CLT, we know 2
σˆ2 X σˆ2 Y
2
riance of an estimator. Confidence Intervals follow the form: Example: let X be a random variable + above the diagonal.
Bias of an estimator: √   (d) n2 (n−1) m2 (m−1)
that n X n − λ1 −−−−−→ N (0, σ 2 ) for with mean µ and variance σ 2 . Let nSn 2
∼ χn−1 Left-skewed: below > above > below
(statistic) ± (critical value)(estimated n→∞ σ2
Bias(θ̂n ) = E[θˆn ] − θ X1 , . . . , Xn be iid samples of X. Then, the diagonal.
standard deviation of statistic) some value of σ 2 that depends on λ. N should be rounded down.
If we set g : R → R and x 7→ 1/x, then Xn −µ
gn , If formula for unbiased sample 7.3 Walds Test 8 Distances between distributions
Quadratic risk of an estimator Let (E, (Pθ )θ∈Θ ) be a statistical model by the Delta method: σ
variance is used:
iT Squared distance of θ bnMLE to true θ0 8.1 Total variation distance
based on observations X1 , . . . Xn and h
is a pivot with θ = µ σ 2 being the The total variation distance TV bet-
R(θ̂n ) = E[(θ̂n − θ)2 ] assume Θ ⊆ R. Let α ∈ (0, 1). √  using the fisher information I(θ bnMLE )
ween the propability measures P and
 
Non asymptotic confidence interval n g(X n ) − g λ1 parameter vector (not the same set of (n−1)Sn as metric.
= Bias2 + V ariance paramaters that we use to define a sta-
2
∼ χn−1 Q with a sample space E is defined as:
of level 1 − α for θ: σ2 iid
(d) Let X1 , . . . , Xn ∼ Pθ ∗ for some true TV(P, Q) = maxA⊂E |P(A) − Q(A)|,
Any random interval I , depending on −−−−−→ N (0, g 0 (E[X])2 VarX) tistical model).
3 LLN and CLT the sample X1 , . . . Xn but not at θ and n→∞ 6.1 P-Value parameter θ ∗ ∈ Rd and the maximum Calculation with f and g:
iid such that: The (asymptotic) p-value of a test ψα 7.2 Student’s T Test likelihood estimator θ bnMLE for θ ∗ .
Let X1 , ..., Xn ∼ Pµ , where E(Xi ) = µ (d)  2
1 1
Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ −−−−−→ N (0, g 0 ) is the smallest (asymptotic) level α Non-asymptotic hypothesis test T V (P, Q) =
and V ar(Xi ) = σ 2 for all i = 1, 2, ..., n n→∞ λ λ2
Confidence interval of asymptotic le- at which ψα rejects H0 . It is random for small samples (works on large Test H0 : θ ∗ = 0 vs H1 : θ ∗ , 0 (1 P
and Xn = n1 ni=1 Xi . 2 R x∈E |f (x) − g(x)|, discr
P
vel 1 − α for θ: (d) since it depends on the sample. It can samples too), data must be gaussian.
Any random interval I whose bounda- −−−−−→ N (0, λ2 ) also interpreted as the probability Under H0 , the asymptotic normality
1
2 x∈E |f (x) − g(x)|dx, cont
Law of large numbers: n→∞
ries do not depend on θ and such that: that the test-statistic Tn is realized Student’s T distribution with d bnMLE implies that:
of the MLE θ
P ,a.s. limn→∞ Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ 6 Asymptotic Hypothesis tests given the null hypothesis. degrees of freedom: td := √ Z Symmetry: T V (P, Q) = T V (Q, P)
Xn −−−−−→ µ 5.1 Two-sided asymptotic CI Two hypotheses (Θ0 disjoint set from V /n
√ (d) Positive: T V (P, Q) ≥ 0
n→∞ ( If pvalue ≤ α , H0 is rejected by ψα at where Z ∼ N (0, 1) and V ∼ χk2 are bnMLE − 0) 2 −−−−−→ χ2
n I (0)1/2 (θ
P ,a.s.
iid
Let X1 , . . . , Xn = X̃ and X̃ ∼ Pθ . A two- H0 : θΘ0 d n→∞ Definite: T V (P, Q) = 0 ⇐⇒ P = Q
1 Pn Θ1 ): . Goal is to reject the (asymptotic) level α independent.
n i=1 g(Xi ) −
−−−−→ E[g(X)]
n→∞ sided CI is a function depending on H1 : θΘ1 Triangle inequality: T V (P, V) ≤
Test statistic: T V (P, Q) + T V (Q, V)
X̃ giving an upper and lower bound H0 using a test statistic. The smaller the p-value, the more con- Student’s T test (one sample +
Central Limit Theorem for Mean: in which the estimated parameter lies fidently one can reject H0 . bnMLE − θ0 )> I(θ bnMLE − θ0 ) If the support of P and Q is disjoint:
bnMLE )(θ
A test ψ has level α if two-sided): Tn =n(θ
I = [l(X̃, u(X̃)] with a certain probabi- Left-tailed p-values:
X −µ (d) αψ (θ) ≤ α, ∀θ ∈ Θ0 . and asymptotic
lity P(θ ∈ I ) ≥ 1 − qα and conversely (d)
p
(n) √n 2 −−−−−→ N (0, 1) pvalue = P(X ≤ x|H0 ) iid
Let X1 , ..., Xn ∼ N (µ, σ 2 ) and suppose −−−−−→ χd2 T V (P, V) = 1
(σ ) n→∞ P(θ < I ) ≤ α level α if limn→∞ Pθ (ψ = 1) ≤ α. n→∞
= P(Z < Tn,θ0 (X n ))) we want to test H0 : µ = µ0 = 0 vs.
(d) Since the estimator is a r.v. depending H1 : µ , 0.
p
(n)(Xn − µ) −−−−−→ N (0, σ 2 ) A hypothesis-test has the form TV between continuous and discrete
n→∞ on X̃ it has a variance V ar(θ̂n and a Wald test of level α: r.v:
= Φ(Tn,θ0 (X n ))
mean E[θ̂n ]. Since the CLT is valid for Test statistic follows Student’s T dis-
Central Limit Theorem for Sums: every distribution standardizing the ψ = 1{Tn ≥ c} Z ∼ N (0, 1) tribution: ψα = 1{Tn > qα (χd2 )} T V (P, V) = 1
Capstone-Cheatsheet Statistics 1 9.1 Fisher Information • A few more technical condi- We can often use an improper prior, Theoretical linear regression: let Then:
by Blechturm, Page 2 of 2 The Fisher information is the tions i.e. a prior that is not a proper pro- (S) (j)
X, Y be two random variables with ψα = max ψα/K
covariance matrix of the gradient bability distribution (whose integral two moments such as V [X] > 0. The j∈S
of the loglikelihood function. It is The asymptotic variance of the MLE is diverges), and still get a proper poste- theoretical linear regression of Y on ∇F(β) = 2X > (Y − Xβ)
equal to the negative expectation the inverse of the fisher information. rior. For example, the improper prior X is the line a∗ + b∗ x where where K = |S|. The rejection region
8.2 KL divergence of the Hessian of the loglikelihood (d) therefore is the union of all rejection
p
bnMLE − θ ∗ ) −−−−−→ Nd (0, I (θ ∗ )−1 )
(n)(θ π(θ) = 1 on Θ gives the likelihood as Least squares estimator: setting
The KL divergence (aka relative entro- function and captures the negative regions:
n→∞ a posterior. i ∇F(β) = 0 gives us the expression of
py) KL between between probability of the expected curvature of the
h
∗ ∗
(a , b ) = argmin(a,b)∈R2 E (Y − a − bX) 2
11.2 Jeffreys Prior (S)
[ (j)
measures P and Q with the common loglikelihood function. 10 Method of Moments β̂: Rα = Rα/K
sample space E and pmf/pdf functi- iid
Let X1 , . . . , Xn ∼ Pθ ∗ associated with j∈S
ons f and g is defined as: Let θ ∈ Θ ⊂ Rd and let (E, {Pθ }θ∈Θ ) model (E, {Pθ }θ∈Θ ), with E ⊆ R and Which gives:
β̂ = (X > X)−1 X > Y
p
be a statistical model. Let fθ (x) be the πJ (θ) ∝ detI(θ) This test has nonasymptotic level at
Θ ⊆ R, for some d ≥ 1
KL(P, Q) = pdf of the distribution Pθ . Then, the Population moments: Cov(X,Y ) most α:
Fisher information of the statistical where I(θ) is the Fisher information. b∗ = V [X]
, a∗ = E[Y ] − b∗ E[X] **Geometric interpretation**: X β̂ is
 p(x) 
discr model is.
P
mk (θ) = Eθ [X1k ], 1 ≤ k This prior is invariant by reparame- the orthogonal projection of Y onto
R x∈E p(x) ln q(x) , ≤d

   X  
the subspace spanned by the columns (S) (j)
 p(x) terization, which means that if we ha- Noise: we model the noise of Y ¶H0 Rα ≤ ¶H0 Rα/K = α


x∈E
p(x) ln q(x) dx, cont I (θ) = Cov(∇`(θ)) = Empirical moments: ve η = φ(θ), then the same prior gives around the regression line by a ran- of X:
j∈S
= E[∇`(θ))∇`(θ)T ] − us a probability distribution for η ve- dom variable ε = Y − a∗ − b∗ X, such
E[∇`(θ)]E[∇`(θ)] = Pn as: X β̂ = P Y
The KL divergence is not a distance ck (θ) = Xnk
m = 1
n
k
i=1 Xi
rifying: This test also works for implicit tes-
measure! Always sum over the = −E[H`(θ)] Convergence of empirical moments: where P = X(X > X)−1 X > is the expres- ting (for example, β1 ≥ β2 ).
support of P !
q
˜
π̃J (η) ∝ det I(η) E[ε] = 0, Cov(X, ε) = 0 sion of the projector. 13 Generalized Linear Models
Asymetric in general: Where `(θ) = ln fθ (X).If ∇`(θ) ∈ Rd it P ,a.s. **Statistic inference**: let us suppose
KL(P, Q) , KL(Q, P) is a d × d matrix. The definition when m
ck −−−−−→ mk We relax the assumption that µ is
n→∞ We have to estimate a∗ and b∗ from that:
Nonnegative: KL(P, Q) ≥ 0 the distribution has a pmf pθ (x) is The change of parameter follows the * The design matrix X is determini- linear. Instead, we assume that g ◦µ
also the same, with the expectation P ,a.s. following formula: the data. We have n random pairs is linear, for some function g:
Definite: if P = Q then KL(P, Q) = 0 (c cd ) −−−−−→ (m1 , . . . , md )
m1 , . . . , m (X1 , Y1 ), . . . , (Xn , Yn ) ∼iid (X, Y ) such stic and rank(X) = p. * The model is
trian- taken with respect to the pmf.
n→∞
Does not satisfy MOM Estimator M is a map from as: **homoscedastic**: ε1 , . . . , εn are i.i.d. * g(µ(x)) = xT β
gle inequality in general: π̃J (η) = det(∇φ−1 (η))πJ (φ−1 (η))
Let (R, {Pθ }θ∈R ) denote a continuous the parameters of a model to the mo- The noise is Gaussian:  ∼ Nn (0, σ 2 In ). The function g is assumed to be
KL(P, V)  KL(P, Q) + KL(Q, V) ments of its distribution. This map is We therefore have:
statistical model. Let fθ (x) denote the known, and is referred to as the link
invertible, (ie. it results into a system
11.3 Bayesian confidence region Yi = a∗ + b∗ Xi + εi function. It maps the domain of the
pdf (probability density function) of Let α ∈ (0, 1). A *Bayesian confidence
Estimator of KL divergence:
the continuous distribution Pθ . Assu- of equations that can be solved for the Y ∼ Nn (Xβ ∗ , σ 2 In ) dependent variable to the entire real
region with level α* is a random sub- The Least Squares Estimator (LSE)
me that fθ (x) is twice-differentiable true parameter vector θ ∗ ). Find the set R ⊂ Θ depending on X1 , ..., Xn Properties of the LSE:
Line.
h  p ∗ (X) i
KL (Pθ ∗ , Pθ ) = Eθ ∗ ln pθ (X) as a function of the parameter θ. moments (as many as parameters), set of (a∗ , b∗ ) is the minimizer of the squa- it has to be strictly increasing,
(and the prior π) such that: red sum: it has to be continuously differentia-
θ up system of equations, solve for pa- β̂ ∼ Np (β ∗ , σ 2 (X > X)−1 )
c θ , Pθ ) = const − 1 Pn Formula for the calculation of Fisher rameters, use empirical moments to ble and
KL(P ∗ n i=1 log(pθ (Xi )) Information of X: estimate. P [θ ∈ R|X1 , ..., Xn ] ≥ 1 − α its range is all of R
)The quadratic risk of β̂ is given by:
Pn 2
ψ : Θ → Rd (ân , b̂n ) = argmin(a,b)∈R2 i=1 (Yi − a − bXi 13.1 The Exponential Family
9 Maximum likelihood estimation R∞

∂fθ (x) 2

Bayesian confidence region and con- A family of distribution {Pθ : θ ∈ Θ},
∂θ

Let E, (Pθ )θ∈Θ be a statistical mo- I (θ) = dx θ 7→ (m1 (θ), m2 (θ), . . . , md (θ)) fidence interval are distinct noti- The estimators are given by: h i  
where the parameter space Θ ⊂ Rk
−∞ fθ (x) ons.The Bayesian framework can be E kβ̂ − β ∗ k22 = σ 2 T r (X > X)−1
del associated with a sample of i.i.d. −1 ∗ ∗ ∗ used to estimate the true underlying is -k dimensional, is called a
random variables X1 , X2 , . . . , Xn . Assu- Models with one parameter (ie. M (m1 (θ ), m2 (θ ), . . . , md (θ )) b̂n = XY −XY
, ân = Y − b̂n X k-parameter exponential family on
parameter. In that case, it is used to 2 The prediction error is given by:
me that there exists θ ∗ ∈ Θ such that Bernulli): The MOM estimator uses the
build a new class of estimators, based
X 2 −X
R1 if the pmf or pdf fθ : Rq → R of
Xi ∼ Pθ ∗ . empirical moments:
on the posterior distribution. Pθ can be written in the form:
h i
The likelihood of the model is the I (θ) = Var(` 0 (θ)) The Multivariate Regression is given E kY − X β̂k22 = σ 2 (n − p)
M −1 n1 ni=1 Xi , n1 ni=1 Xi2 , . . . , n1 ni=1 Xid 11.4 Bayes estimator
 P 
product of the n samples of the
P P by:
I (θ) = −E(` 00 (θ)) posterior mean: fθ (y) =
pdf/pmf: The unbiased estimator of σ 2 is:
h(y) exp (η(θ) · T(y) − B(θ)) where
Assuming M −1 is continuously diffe- R Pp (j) ∗
Models with multiple parameters (ie. + εi = Xi> β ∗ +εi η1 (θ)
  
rentiable at M(0), the asymptotical θ̂(π) = Θ θπ(θ|X1 , ..., Xn )dθ Yi = j=1 Xi βj

Gaussians): n 

1 1 X 2
 
Ln (X1 , X2 , . . . , Xn , θ) = variance of the MOM estimator is:
 .
σˆ2 = kY − X β̂k22 = η(θ) =  .  : Rk → Rk
|{z} |{z}   
ε̂i


Maximum a posteriori estimator 1×p p×1 
.
(Qn I (θ) = −E [H`(θ)] n−p n−p 
  
pθ (xi ) if E is discrete i=1
 
Qi=1 (MAP): 
 
ηk (θ)

n

(d) (1)
i=1 fθ (xi ) if E is continous


(n)(θnMM − θ) −−−−−→ N (0, Γ )
p
Cookbook: We can assuming that the Xi are 1 T1 (y)
  
By **Cochran’s Theorem**:
€ 
MAP

n→∞ θ̂(π) = argmaxθ∈Θ π(θ|X1 , ..., Xn ) for the intercept. 
 
 . 

q k
Better to use 2nd derivative. T(y) =  ..  : R → R

The maximum likelihood estimator where, σˆ2

  

is the (unique) θ that minimizes Γ (θ) = The MAP is equivalent to the MLE, if • If β ∗ = (a∗ , b∗ >)> , β1∗ = a∗ is (n − p) 2
∼ χn−p , β̂ ⊥ σˆ2 
  
σ2 Tk (y)


c (Pθ ∗ , Pθ ) over the parameter space.
KL • Find loglikelihood
h −1
∂M
iT h −1 i the prior is uniform. the intercept. 
(M(θ)) Σ(θ) ∂M

(M(θ)) : Rk → R

B(θ)

(The minimizer of the KL divergence ∂θ ∂θ 12 OLS • the εi is the noise, satisfying **Significance test**: let us test H0 : 
• Take second derivative : Rq → R.

Γ (θ) = ∇θ (M −1 )T Σ∇θ (M −1 ) βj = 0 against H1 : βj , 0. Let us call h(y)

is unique due to it being strictly con- Given two random variables X and Y , Cov(Xi , εi ) = 0
(=Hessian if multivariate) how can we predict the values of Y
vex in the space of distributions once Σθ is the covariance matrix of   if k = 1 it reduces to:
is fixed.) • Massage second derivative or the random vector of the moments given X? The Multivariate Least Squares Esti- γj = (X > X)−1 > 0
Let us consider jj
Hessian (isolate functions of (X11 , X12 . . . , X1d ). mator (LSE) of β ∗ is the minimizer of fθ (y) = h(y) exp (η(θ)T (y) − B(θ))
Xi to use with −E(` 00 (θ)) or (X1 , Y1 ), . . . , (Xn , Yn ) ∼iid P whe- then:
bnMLE
θ c n (Pθ ∗ , Pθ )
= argminθ∈Θ KL 11 Bayesian Statistics the sum of square errors: 14 Expectation
−E [H`(θ)]. re P is an unknown joint distribution. R +inf
Bayesian inference conceptually P can be described entirely by: E [X] = −inf x · fX (x) dx
= argmaxθ∈Θ ni=1 ln pθ (Xi ) β̂j − βj
Pn
− Xi> β)2
P
• Find the expectation of the amounts to weighting the likelihood β̂ = argminβ∈Rp i=1 (Yi ∼ tn−p
Ln (θ) by a prior knowledge we
q R +inf
Q  functions of Xi and subsitu- σˆ2 γj
= argmaxθ∈Θ ln ni=1 pθ (Xi ) E [g (X)] = −inf g (x) · fX (x) dx
R
te them back into the Hes- might have on θ. Given a statistical g(X) = f (X, y)dy Matrix form: we can rewrite these ex-
sian or the second derivati- model we technically model our pressions. Let Y = (Y1 , . . . , Yn )> ∈ Rn , R +inf
ve. Be extra careful to sub- parameter θ as if it were a random f (x,Y ) We can define the test statistic for our E [X Y = y] = −inf x · fX|Y (x|y) dx
Since taking derivatives of products h(Y |X = x) = g(x) and  = (ε1 , . . . , εn )> . test:
situte the right power back. variable. We therefore define the Let
is hard but easy for sums and exp() is prior distribution (PDF): Integration limits only have to be
very common in pdfs we usually ta- E[Xi ] , E[Xi2 ]. where f is the joint PDF, g the margi-  > β̂j over the support of the pdf. Discrete
(j)
ke the log of the likelihood function nal density of X and h the conditional X1  Tn = q r.v. same as continuous but with sums
• Don’t forget the minus sign! π(θ)
X =  ..  ∈ Rn×p σˆ2 γj and pmfs.
before maximizing it. density. What we are interested in is  
9.2 Asymptotic normality of the ma- h(Y |X).  . 
Let X1 , ..., Xn . We note Ln (X1 , ..., Xn |θ)  > Total expectation theorem:
`((X1 , X2 , . . . , Xn , θ)) = ln(Ln (X1 , X2 , . . . , Xn , θ)) ximum likelihood estimator Regression function: For a partial de- Xn The test with non-asymptotic level α
Pn Under certain conditions the MLE is the joint probability distribution of scription, we can consider instead the is given by: R +inf
= i=1 ln(Li (Xi , θ) asymptotically normal and consistent. X1 , ..., Xn conditioned on θ where θ ∼ conditional expection of Y given X = X is called the **design matrix**. The E [X] = −inf fY (y) · E [X Y = y] dy
This applies even if the MLE is not the π. This is exactly the likelihood from x: regression is given by: (j) (j)
the frequentist approach. ψα = 1{|Tn | > qα/2 (tn−p )}
sample average. Law of iterated expectation:
Cookbook: set up the likelihood func- Let the true parameter θ ∗ ∈ Θ. Ne- 11.1 Bayes’ formula
tion, take log of likelihood function. cessary assumptions: **Bonferroni’s test**: if we want to test
E[Y ] = E[E[Y |X]]
R
Take the partial derivative of the lo- . The posterior distribution verifies: x 7→ f (x) = E[Y |X = x] = yh(y|x)dy Y = Xβ ∗ + the significance level of multiple tests
glikelihood function wrt. the parame- • The parameter is identifiable at the same time, we cannot use the sa-
∀θ ∈ Θ, π(θ|X1 , ..., Xn ) ∝ and the LSE is given by: me level α for each of them. We must Expectation of constant a:
ter(s). Set the partial derivative(s) to We can also consider different descrip-
zero and solve for the parameter. • For all θ ∈ Θ, the support Pθ tions of the distribution, like the me- use a stricter test for each of them. Let
does not depend on θ (e.g. li- π(θ)Ln (X1 , ..., Xn |θ) us consider S ⊆ {1, . . . , p}. Let us consi- E[a] = a
If an indicator function on the dian, quantiles or the variance.
ke in U nif (0, θ)); Linear regression: trying to fit any β̂ = argminβ∈Rp kY − Xβk22 der
pdf/pmf does not depend on the para- Product of independent r.vs X and Y
meter, it can be ignored. If it depends The constant is the normalization fac- function to E[Y |X = x] is a nonpara- :
• θ ∗ is not on the boundary of tor to ensure the result is a proper Let us suppose n ≥ p and rank(X) = p.
on the parameter it can’t be ignored Θ; metric problem; therefore, we restrict
because there is an discontinuity in distribution, and does not depend on the problem to the tractable one of If we write: H0 : ∀j ∈ S, βj = 0, H1 : ∃j ∈ S, βj , 0 E[X · Y ] = E[X] · E[Y ]
the loglikelihood function. The maxi- • Fisher information I (θ) is in- θ: linear function:
mum/minimum of the Xi is then the vertible in the neighborhood The *Bonferroni’s test* with signifi- Product of dependent r.vs X and Y :
maximum likelihood estimator. of θ ∗ π(θ|X1 , ..., Xn ) = R π(θ)Ln (X1 ,...,Xn |θ) f : x 7→ a + bx F(β) = kY − Xβk22 = (Y − Xβ)> (Y − Xβ) cance level α is given by:
 
Capstone-Cheatsheet Statistics 1 E[X] = p Multinomial fθ (y) = exp yθ − (− ln(−θ)) + 0 E[X 4 ] = µ4 + 6µ2 σ 2 + 3σ 4 “Out of n people, we want to form a Σ = Cov(X) =
by Blechturm, Page 3 of 2 committee consisting of a chair and σ σ12 ... σ1d 
Parameters n > 0 and p1 , . . . , pr . | {z } |{z}
 11
V ar(X) = p(1 − p) n! b(θ) c(y,φ) Quantiles: other members. We allow the commit- σ21 σ22 ... σ2d 
px (x) = x !,...,x ! p1 , . . . , pr tee size to be any integer in the range
1 n
 
Likelihood n trials: θ = −λ = − µ1 Uniform 1, 2, . . . , n . How many choices do we
 .
 . . .. . 
E[X · Y ] , E[X] · E[Y ] E[Xi ] = n ∗ pi  . . . . 
Parameters have in selecting a committee-chair . .
( a and b, continuous.

φ=1

Ln (X , . . . , Xn , p) = P 1 combination?" σd1 σd2 ... σdd
E[X·Y ] = E[E[Y ·X|Y ]] = E[Y ·E[X|Y ]] Pn1 V ar(Xi ) = npi (1 − pi ) Shifted Exponential , if a < x <b
= p i=1 Xi (1 − p)n− i=1 Xi
n fx (x) = b−a The covariance matrix Σ is a d × d
Parameters 0, o.w.
( λ, a ∈ R, continuous n matrix. It is a table of the pairwise
!
Linearity of Expectation where a and Likelihood:  X n
c are given scalars: Loglikelihood n trials: λexp(−λ(x − a)), x >= a  0, f orx ≤ a n2n−1 = i. covariances of the elemtents of the
Qn Tj fx (x) = 
 x−a
 i random vector. Its diagonal elements
px (x) = , where 0, x <= a Fx (x) =  b−a , x ∈ [a, b)
j=1 pj i=0
are the variances of the elements of
`n (p) =

E[aX + cY ] = aE[X] + cE[Y ] 1,
T j = 1(Xi = j) is the count
x≥b

Fx (x) = the random vector, the off-diagonal
ln (p) ni=1 Xi
P
= + 18.2 Finding Joint PDFS
E[X] = a+b elements are its covariances. Note
(
If Variance of X is known:   how often an outcome is seen in 1 − exp(−λ(x − a)), if x >= a 2
n − ni=1 Xi ln (1 − p) fX,Y (x, y) = fX (x)fY |X (y | x) that the covariance is commutative
P
trials. 0, x <= a (b−a)2
V ar(X) = 12 e.g. σ12 = σ21
E[X 2 ] = var(X) − E[X] 1 19 Random Vectors
MLE: Loglikelihood:  E[X] = a + λ T
Likelihood: Alternative forms:

15 Variance `n = nj=2 Tj ln pj
P
1(maxi (xi ≤b)) A random vector X = X (1) , . . . , X (d)
Pn 1 L(x1 . . . xn ; b) =
i=1 (Xi ) V ar(X) =
Variance is the squared distance from p̂MLE = n λ2 bn of dimension d × 1 is a vector-valued Σ = E[XX T ] − E[X]E[X]T =
the mean. Poisson function from a probability space ω
Fisher Information: Likelihood: Loglikelihood: = E[XX T ] − µX µTX
Parameter λ. discrete, approximates to Rd :
V ar(X) = E[(X − E(X))2 ] 1 the binomial PMF when n is large, p L(X1 . . .Xn ; λ, θ) = Cauchy Let the random vector X ∈ Rd and A
I(p) = p(1−p) is small, and λ = np. continuous, parameter m, X : Ω −→ Rd and B be conformable matrices of

λn exp −λ ni=1 (Xi − a) 1mini=1,...,n (Xi )≥a .
h i P
V ar (X) = E X 2 − (E [X])2 constants.
Canonical exponential form: k Loglikelihood: fm (x) = π1 1+(x−m)
1
2
px (k) = exp(−λ) λk! for k = 0, 1, . . . ,
 (1) 
Variance of a product with constant a: X (ω)
 
E[X] = notdef ined! X (2) (ω) Cov(AX + B) = Cov(AX) =
fθ (y) = exp yθ − ln(1 + eθ ) + 0 `(λ, a) := n ln λ − λ ni=1 Xi + nλa
P
E[X] = λ V ar(X) = notdef ined!
  ACov(X)AT = AΣAT
V ar(aX) = a2 V ar (X) | {z } |{z} MLE: ω −→  . 
Every Covariance matrix is positive
c(y,φ)  . 
b(θ)
V ar(X) = λ λ̂MLE = 1 med(X) R= P (X > M) = P (X < M)  .  definite.
Variance of sum of two dependent X n −â
p
X (d) (ω)
 
θ = ln ∞
r.v.: 1−p
Likelihood: âMLE = mini=1,...,n (Xi ) = 1/2 = 1/2 π1 · 1+(x−m)
1
2 dx Σ≺0
φ=1 Pn
i x
Univariate Gaussians where each X (k) , is a (scalar) random
V ar(X + Y ) = V ar(X) + V ar(Y ) + Ln (x1 , . . . , xn , λ) = ni=1
Q λ
Qni=1 e−nλ Chi squared variable on Ω.
i=1 xi ! Parameters µ and σ 2 > 0, continuous The χd2 distribution with d degrees of
Gaussian Random Vectors
2Cov(X, Y ) Binomial 2 A random vector X = (X (1) , . . . , X (d) )T
Loglikelihood: 1 (x−µ)
Parameters p and n, discrete. f (x) = √ exp(− 2σ 2 ) freedom is given by the distribution PDF of X: joint distribution of its is a Gaussian vector, or multivariate
Variance of sum/difference of two `n (λ) = (2πσ 2 )
independent r.v.:
Describes the number of successes in
E[X] = µ
iid
of Z12 +Z22 +· · ·+Zd2 , where Z1 , . . . , Zd ∼ components X (1) , . . . , X (d) . Gaussian or normal variable, if any li-
= −nλ+log(λ)( ni=1 xi ))−log( ni=1 xi !)
P Q
n independent Bernoulli trials. near combination of its components
V ar(X) = σ 2 N (0, 1)
V ar(X + Y ) = V ar(X) + V ar(Y ) CDF of X: is a (univariate) Gaussian variable or
px (k) = nk pk (1 − p)n−k , k = 0, . . . , n MLE: If V ∼ χk2 :

CDF of standard gaussian: a constant (a “Gaussian"variable with
V ar(X − Y ) = V ar(X) + V ar(Y ) 1 Pn Rd → [0, 1] zero variance), i.e., if α T X is (univaria-
E[X] = np λ̂MLE = n i=1 (Xi ) Rz 2
E = E[Z12 ] + E[Z22 ] + . . . + E[Zd2 ] = d
Φ(z) = −∞ √1 e−x /2 dx te) Gaussian or constant for any con-
16 Covariance Fisher Information: 2π V ar(V ) = V ar(Z12 ) + V ar(Z22 ) + . . . + x 7→ P(X (1) ≤ x(1) , . . . , X (d) ≤ x(d) ).
V ar(X) = np(1 − p)
Likelihood: stant non-zero vector α ∈ Rd .
The Covariance is a measure of
how much the values of each of 1 V ar(Zd2 ) = 2d The sequence X1 , X2 , . . . converges Multivariate Gaussians
Likelihood: I(λ) = λ The distribution of, X the
two correlated random variables L(x1 . . . Xn ; µ, σ2 ) = Student’s T Distribution in probability to X if and only if
determine each other

Tn := √ Z where Z ∼ N (0, 1), and Z each component of the sequence d-dimensional Gaussian or nor-
Canonical exponential form: = √1 n exp − 2σ1 2 ni=1 (Xi − µ)2
P
Ln(X1 , . . . , Xn , θ) = V /n mal distribution, is completely
 Pn Pn (σ 2π) and V are independent
(k) (k)
X1 , X2 , . . . converges in probability specified by the vector mean
= ni=1 XK θ i=1 Xi (1 − θ)nK− i=1 Xi
Q
Cov(X, Y ) = E[(X − µX )(Y − µY )]
 
i fθ (y) = exp yθ − eθ − ln y! Loglikelihood: 18.1 Useful to know
|{z} |{z} to X (k) . µ = E[X] = (E[X (1) ], . . . , E[X (d) ])T and
Cov(X, Y ) = E[XY ] − E[X]E[Y ] Loglikelihood: 18.1.1 Min of iid exponential the d × d covariance matrix Σ. If Σ is
b(θ) c(y,φ) `n (µ, σ 2 ) = r.v
Expectation of a random vector
√ invertible, then the pdf of X is:
θ = ln λ = −nlog(σ 2π) − 1 Pn
− µ)2 The expectation of a random vector is
i=1 (Xi
P 
Cov(X, Y ) = E[(X)(Y − µY )] n
`n (θ) = C + i=1 Xi log θ + 2σ 2 Let X1 , . . . , Xn n be i.i.d. Exp(λ) ran- the elementwise expectation. Let X be
φ=1 MLE: 1 1 T −1
 Pn 
nK − i=1 Xi log(1 − θ) dom variables. a random vector of dimension d × 1. fX (x) = q e− 2 (x−µ) Σ (x−µ) ,
Possible notations: (2π)d det(Σ)
Poisson process: µ̂M LE = X n Distribution of mini (Xi)
k arrivals in t slots E[X (1) ] x ∈ Rd
 
Cov(X, Y ) = σ (X, Y ) = σ(X,Y ) MLE: c2 LE = 1 Pn (X − X )2
(λt)k σ
px (k, t) = P(Nt = k) = e−λt k! M i=1 i n
n
 
 . 
Covariance is commutative: Fisher Information: Fisher Information: P(mini (Xi ) ≤ t) = E[X] =  .  . Where det(Σ) is the determinant of Σ,
 .  which is positive when Σ is invertible.
E[Nt ] = λt = 1 − P(mini (Xi ) ≥ t)
n E[X (d) ]
 
I(p) = 1 !
If µ = 0 and Σ is the identity matrix,
Cov(X, Y ) = Cov(Y , X) p(1−p) 0
V ar(Nt ) = λt I(µ, σ 2 ) = σ 2 1 = 1 − (P(X1 ≥ t))(P(X2 ≥ t)) . . . (P(X ≥ t)) then X is called a standard normal
Canonical exponential form: 0 2σ 4
The expectation n of a random matrix
Covariance with of r.v. with itself is Exponential is the expected value of each of its random vector .
variance: Canonical exponential form: = 1 − (1 − FX (t))n = 1 − e−nλx If the covariant matrix Σ is diagonal,
fp (y) = Parameter( λ, continuous elements. Let X = {Xij } be an n × p the pdf factors into pdfs of univariate
Cov(X, X) = E[(X − µX )2 ] = V ar(X) Gaussians are invariant under affine random matrix. Then E[X], is the
exp(y (ln(p) − ln(1 − p)) + n ln(1 − p) + ln( y f)x)(x) = λexp(−λx), if x >= 0
n
transformation:
Differentiate w.r.t x to get the pdf of Gaussians, and hence the components
| {z } | {z } | {z } 0, o.w. mini (Xi): n × p matrix of numbers (if they exist): are independent.
Useful properties: θ −b(θ) c(y,φ)P (X > a) = exp(−λa) aX + b ∼ N (X + b, a2 σ 2 ) E[X] = The linear transform of a gaussian
Cov(aX + h, bY + c) = abCov(X, Y ) Geometric (
Sum of independent gaussians: fmin (x) = (nλ)e−(nλ)x E[X ]
11 E[X12 ] ... E[X1p ] X ∼ Nd (µ, Σ) with conformable
Number of T trials up to (and inclu- F (x) = 1 − exp(−λx), if x >= 0 
E[X21 ] E[X22 ] ... E[X2p ] matrices A and B is a gaussian:
Cov(X, X + Y ) = V ar(X) + cov(X, Y ) x 0, o.w.
ding) the first success. Let X∼N (µX , σX2 ) and Y ∼N (µY , σY2 )
 
18.1.2 Counting Commitees  . . .. .  AX + B = Nd (Aµ + b, AΣAT )
Cov(aX + bY , Z) = aCov(X, Z) + pT (t) = (1 − p)t−1 , t = 1, 2, ... E[X] = 1  .
 . . . . 
λ
If Y = X + Z, then . .  Multivariate CLT
bCov(Y , Z) E[T ] = p1 Out of 2n people, we want to choose a 
E[Xn1 ] E[Xn2 ] ... E[Xnp ]
E[X 2 ] = 2 Y ∼ N (µX + µY , σX + σY ) committee of n people, one of whom Let X1 , . . . , Xd ∈ Rd be independent
1−p λ2
var(T ) = 1 will be its chair. In how many diffe- Let X and Y be random matrices of copies of a random vector X such that
If Cov(X, Y ) = 0, we say that X and Y p2 V ar(X) = If U = X − Y, then rent ways can this be done?"
λ2 the same dimension, and let A and B E[x] = µ (d × 1 vector of expectations)
are uncorrelated. If X and Y are in- Pascal U ∼ N (µX − µY , σX + σY )
dependent, their Covariance is zero. Likelihood: be conformable matrices of constants. and Cov(X) = Σ
The negative binomial or Pascal distri-
! !
 P  2n 2n − 1
The converse is not always true. It is bution is a generalization of the geo- L(X1 . . . Xn ; λ) = λn exp −λ ni=1 Xi Symmetry: n = 2n . (d)
only true if X and Y form a gaussi- metric distribution. It relates to the n n−1 E[X + Y ] = E[X] + E[Y ] p
(n)(Xn − µ) −−−−−→ N (0, Σ)
an vector, ie. any linear combination random experiment of repeated inde- Loglikelihood: If X ∼ N (0, σ 2 ), then −X ∼ N (0, σ 2 ) E[AXB] = AE[X]B n→∞
“In a group of 2n people, consisting of
αX + βY is gaussian for all (α, β) ∈ R2 pendent trials until observing m suc- n boys and n girls, we want to select a
p (d)
cesses. I.e. the time of the kth arrival. `n (λ) = nln(λ) − λ
Pn
P(|X| > x) = 2P(X > x) Covariance Matrix (n)Σ−1/2 Xn − µ −−−−−→ N (0, Id )
without {0, 0}. i=1 (Xi ) committee of n people. In how many n→∞
17 correlation coefficient Yk = T1 + ...Tk ways can this be done?" Let X be a random vector of dimensi-
MLE: Standardization: Where Σ−1/2 is the d × d matrix such
Cov(X,Y ) on d × 1 with expectation µX . that Σ−1/2 Σ−1/2 = Σ1 and Id is the
ρ(X, Y ) = √ Ti ∼ iidGeometric(p) ! X n ! !
V ar(XV ar(Y
Pn n X−µ 2n n n Matrix outer products! identity matrix.
λ̂MLE = Z= ∼ N (0, 1) =
18 Important probability distributi- E[Yk ] = k i=1 (Xi ) σ
n i n−i
ons p i=0 Σ = E[(X − µX )(X − µX )T ] = Multivariate Delta Method
t−µ
 
Fisher Information: P (X ≤ t) = P Z ≤ σ
Bernoulli k(1−p “How many subsets does a set with 2n 20 Algebra
V ar(Yk ) = p2 Higher moments:  X1 − µ1  Absolute Value Inequalities:
  
Parameter 1
( p ∈ [0, 1], discrete t−1  k t−k
I(λ) = λ2
elements have?"  X2 − µ2 

 (x)| < a ⇒ −a < f (x) < a
p, if k = 1 pYk (t) = k−1 p (1 − p)
|f
E  . . .  [X1 − µ1 , X2 − µ2 , . . . , Xd − µd ]

px (k) = Canonical exponential form: E[X 2 ] = µ2 + σ 2 2n !
|f  (x)| > a ⇒ f (x) > a or f (x) < −a
(1 − p), if k = 0 X 2n 
X d − µd

t = k, k + 1, ... E[X 3 ] = µ3 + 3µσ 2 22n =
Capstone-Cheatsheet Statistics 1 Σ = E[(X − µX )(X − µX )T ]
by Blechturm, Page 4 of 2
= E[XX T ] − E[X]E[X]T
= E[XX T ] − µX µTX
21 Matrixalgebra
kAxk2 = (Ax)T (Ax) = xT AT Ax =
xT AT Ax
22 Calculus
Differentiation
R under
 the integral sign
d b(x) 0
dx a(x) f (x, t)dt = f (x, b(x))b (x) −
R b(x)
f (x, a(x))a0 (x) + a(x) fx (x, t)dt.
Concavity in 1 dimension
If g : I → R is twice differentiable in
the interval I:
concave:
if and only if g 00 (x)≤0 for all x ∈ I

strictly concave:
if g 00 (x)<0 for all x ∈ I

convex:
if and only if g 00 (x)≥0 for all x ∈ I

strictly convex if:


g 00 (x)>0 for all x ∈ I

Multivariate Calculus
The Gradient ∇ of a twice differntia-
ble function f is defined as:
∇f : Rd → Rd
 ∂f 
 ∂θ1 
θ1 
 
 ∂f 
θ2   
   ∂θ2 
θ =  .  7→  . 
 
 .   . 
 .   . 
θd  
 ∂f 

∂θd θ
Hessian

The Hessian of f is a symmetric


matrix of second partial derivatives
of f

Hh(θ) = ∇2 h(θ) =
2h ∂2 h
 ∂θ∂ ∂θ
 
(θ) ··· ∂θ1 ∂θd
(θ) 
 1 1 
 . 
 ∈
 .

 . 

 ∂2 h ∂2 h 
∂θ ∂θ
(θ) · · · ∂θd ∂θd
(θ) 
d 1
Rd×d

A symmetric (real-valued) d × d
matrix A is:

Positive semi-definite:
xT A x ≥ 0 for all x ∈ Rd .

Positive definite:
xT A x > 0 for all non-zero vectors
x ∈ Rd

Negative semi-definite (resp. negative


definite):

xT A x is negative for all x ∈ Rd − {0}.

Positive (or negative) definiteness


implies positive (or negative)
semi-definiteness.

If the Hessian is positive definite


then f attains a local minimum at a
(convex).

If the Hessian is negative definite at a,


then f attains a local maximum at a
(concave).

If the Hessian has both positive and


negative eigenvalues then a is a saddle
point for f .
23 Covariance Matrix
Let X be a random vector of dimensi-
on d × 1 with expectation µX .
Matrix outer products!

You might also like