Professional Documents
Culture Documents
Notes1 PDF
Notes1 PDF
Statistical Modeling
1
ORF 524: Statistical Modeling – J.Fan 2
Xi = µ + εi
i) εi is independent of µ.
Drug A Drug B
random outcomes X1, · · · , Xm Y1 · · · , Yn
realizations x 1 , · · · , xm y 1 , · · · , y n
ORF 524: Statistical Modeling – J.Fan 8
— In Example 1:
Nθ N −N θ
x n−x
Pθ (x) = N
,
n
where Θ = {0, 1/N, · · · , N/N } or [0, 1].
— In Example 2:
x − µ
i
Pθ (x) = Πni=1σ −1ϕ
σ
where ϕ(·) is the normal density, Θ = {(µ, σ), µ > 0, σ > 0}.
— In Example 3:
x − µ
−1 i A −1 y i − µB
Pθ (x) = Πm σ
i=1 A ϕ Π n
σ
i=1 B ϕ ,
σA σB
where ϕ(·) is the normal density, Θ = {(µA, µB , σA, σB ) : µA, µB , σA, σB > 0}.
ORF 524: Statistical Modeling – J.Fan 10
assuming that {εi, i = 1, · · · , n} are i.i.d random variables with density f . Then,
Since no form of f has been imposed, i.e. f has not been parameterized, the
parameter space Θ is called nonparametric or semiparametric.
Basic assumption: Throughout this class, we will assume that
(ii) Discrete variable:All Pθ are discrete with frequency functions p(x, θ). Further,
there exists a set {x1, x2, · · · , } such that
P∞
i=1 p(xi , θ) = 1,where xi is independent of θ.
ORF 524: Statistical Modeling – J.Fan 11
We wish to study the association between Y and X1, · · · , Xp. How to predict
Y based on X? Any gender discrimination? (Note: the data x in the general
formulation now include all {(xi1, · · · , xip, yi)}ni=1).
where ε is the part that can not be explained by X. Thus the parameter space
is Θ = {(β0, β1, · · · , β5, G)}.
Y = µ(X1, · · · , X5) + ε.
ORF 524: Statistical Modeling – J.Fan 13
Modeling: Data are thought of a realization from (Y, X1, · · · , X5) with the rela-
tionship between X and Y described above.
From this example, the model is a convenient assumption made by data analysts.
Indeed, statistical models are frequently useful fictions. There are trade-offs among
the choice of statistical models:
larger model ⇒ reducing model biases
⇒ increasing estimation variance.
The decision depends also available sample size n.
Statistics: a function of data only, e.g.
X1 + · · · + Xn
q
X= , X1, X12 + X22 + X32 + 3,
n
but
X1 + σ, X +µ
are not.
ORF 524: Statistical Modeling – J.Fan 14
— defective rate is 1%
P (θ = i/N ) = πi, i = 1, 2, · · · , N.
This provides as a prior distribution. The defective rate θ0 of the current lot is
thought of as a realization from π(θ). Given θ0,
N θ0 N −N θ0
x n−x
P (X = x|θ0) = N
,
n
Basic element of Baysian models
ORF 524: Statistical Modeling – J.Fan 16
(iii) Given θ, the observed data x are a realization of pθ . The joint density of (θ, X)
is π(θ)p(x|θ).
(iv) The goal of the Bayesian analysis is to modify the prior of θ after observing x:
R π(θ)p(x|θ) , θ continuous,
π(θ)p(x|θ) dθ
π(θ|X = x) =
Pπ(θ)p(x|θ) , θ discrete
θ π(θ)p(x|θ)
e.g. summarizing the distribution by posterior mean, median and SD, etc.
ORF 524: Statistical Modeling – J.Fan 17
Example 5 (Quality inspection) Suppose that from the past experience, the de-
fective rate is about 10%. Suppose that a lot consists of 100 products, whose quality
is independent of each other.
ORF 524: Statistical Modeling – J.Fan 18
Now suppose that n = 19 products are sampled and x = 10 are defective. Then
P (θ = θi, X = 10) π(θi)P (X = 10|θ = θi)
π(θi|X = 10) = = P .
P (X = 10) j π(θ j )P (X = 10|θ = θ j )
e.g.
(100θ − X is the number of defective left after 19 draws, having distribution Bernoulli(81, 0.1)). Compared with the prior probability
Example 6. Suppose that X1, · · · , Xn are i.i.d. random variables with Bernoulli(θ)
and θ has a prior distribution π(θ). Then
Pn
n− ni=1 xi
P
π(θ)θ i=1 xi (1 − θ)
π(θ|x) = R 1 Pn Pn .
π(t)t i=1 xi (1 − t) n− x
i=1 dt
i
0
ORF 524: Statistical Modeling – J.Fan 20
Figure 1.8: Beta distributions with shape parameters: Left panel: (4, 10), (5, 2), (2, 5), (.7, 3); right panel: (5, 5), (2, 2), (1, 1), (0.5, 0.5)
then
P P X X
s+ xi −1 n− xi +r
π(θ|x) ∝ θ (1 − θ) ∼ Beta(s + xi , n − xi + r).
Thus, Pn
Pn i=1 i x +1
s + i=1 xi n+2
s=r=1
E(θ|x) = =
n+s+r ≈ n−1 Pn x , n is large
i=1 i
Conjugate prior: Note that the prior and posterior in this example belong to the
same family. Such a prior is called “conjugate prior”. It was introduced to facilitate
the computation.
1.3 Sufficiency
Purpose:
1 simplify probability structure, less obscure than the whole data
2 understand whether a loss in reduction
3 useful technical tools
Pn
Example 7 (continued). The conditional distribution of X given i=1 Xi is
n
X
Pθ {X = x| Xi = s}
i=1
P
0 if xi 6= s,
= P (X=x,
Pn
X =s) θ s (1−θ)n−s .
i=1 i
P (Pni=1 Xi=s) = (n)θs(1−θ)n−s
otherwise
c
Pn
Obviously, this conditional distribution is independent of θ. Thus, i=1 Xi is suf-
ficient.
Conversely,
Then,
p(X, y, θ)
1
∝ Πni=1 σ −1 exp(− 2
(Y i − α − βX i ) 2
)f (Xi)
2σ !
n
n 1 X
= Πi=1f (Xi) exp − log σ − 2 [Yi − α − βXi]2
2σ i=1
n n n
!
1 X X X
× exp − 2 [ Yi2 − 2α Yi − 2β XiYi]
2σ i=1 i=1 i=1
where f (·) is density function of X. Thus,
n n n n n
!
X X X X X
2
T = Yi , Yi , XiYi, Xi, Xi2
i=1 i=1 i=1 i=1 i=1
is a sufficient statistic. This is equivalent to the fact that
T ∗ = (X̄, Ȳ , σ
bX2
bY2 , r)
,σ
is a sufficient statistic.
Sufficiency Principle: Suppose that T (X) is sufficient. For any decision rule
ORF 524: Statistical Modeling – J.Fan 29
δ(X), we can find a decision rule δ ∗(T (X)), depending on T (X) and δ(X) such
that
R(θ, δ) = R(θ, δ ∗) for all θ,
where R(θ, δ) = Eθ `(θ, δ(X)) is the expected loss function — risk function. Namely,
considering the class of sufficient statistic is good enough for making statistical
decisions.
Proof. For better understanding, let us first assume that `(θ, a) is convex in a.
Then, let δ ∗(T ) = E{δ(X)|T (X)}. By Jenssen’s inequality,
Example 11. Suppose X1, X2, · · · , Xn ∼ i.i.d.N (µ, σ 2), e.g. measurement of
temperature.
data (in oC) data(in oF /unnamed scale)
x1 ax1 + b
x2 ax2 + b
.. ..
xn axn + b
µb: T (x1, x2, · · · , xn) T (ax1 + b, ax2 + b, · · · , axn + b)
Estimate of µ: T (X1, X2, · · · , Xn) in oC = aT (X1, X2, · · · , Xn) + b in oF
Hope: T (ax1 + b, ax2 + b, · · · , axn + b) = aT (x1, x2, · · · , xn) + b
Equivariance: Such an estimator is called equivariant under linear transforma-
tion.
If we are interested in σ, we hope
T (X̄, S).
From
= a[X̄ + T ∗ (S)]
=⇒ T ∗ (aS) = aT ∗ (S)
=⇒ T ∗ (S) = ST ∗ (1).
= T ∗2 (ES)2 + T ∗2 var(S) + σ 2 /n
ORF 524: Statistical Modeling – J.Fan 32
Theorem 2 (Kolmogrov) If T (X) is sufficient for θ, then for any prior π(θ),
the conditional distribution
E(g(θ)|T ) = E(g(θ)|X).
This implies that given T (X), and X and θ are independent, since
By setting c(θ) = η, the exponential family can be written in the canonical form
as
p(x, η) = exp(ηT (x) + d0(η) + S(x)),
where d0(η) = d(c−1(η)), when c(θ) is one-to-one.
η — canonical (natural) parameter and
c(·) — canonical link,
Examples of canonical link functions:
Normal c(θ) = θ identity
θ
Binomial c(θ) = log 1−θ logit
Poisson c(θ) = log θ logarithm.
Regeneration properties:
Moreover, ET (X) = −d00(η), var(T (x)) = −d000 (η). (The function d0 is con-
vave.)
Now,
Similarly,
η = −θk
d0(η) = −n log θk = −n log(−η).
ORF 524: Statistical Modeling – J.Fan 39
Hence,
n
X
Xik — natural sufficient statistic,
i=1
n
X −n n
E Xik = = k,
i=1
η θ
n
X n n
var( Xik ) = 2
= 2k
.
i=1
η θ
Direct computation of these moments are more complicated.
The k parameter case
A family of distributions {Pθ : θ ∈ Θ} is said to be k parameter exponential
family if its joint density admits the form
k
X
p(x, θ) = exp( Ci(θ)Ti(x) + d(θ) + S(x))
i=1
Xk
= exp( ηiTi(x) + d0(η)).
i=1
ORF 524: Statistical Modeling – J.Fan 40
I(j=`)
P (Xi = j) = pj = Πk`=1p`
ORF 524: Statistical Modeling – J.Fan 41
Figure 1.12: Multinomial trial. Each outcome is a k-dimensional unit vector, indicting which category is observed.
I(xi =`) n
Πni=1P (xi, p) = Πki=1Πn`=1p` = Πk`=1p` ` .
n
X
n` = I(xi = `) — ] of times observing `
i=1
1
=⇒ pk = Pk−1 αj
1+ j=1 e
Hence,
k−1
nX k−1
X o
αj
p(x, p) = exp n`α` − n log(1 + e ) .
`=1 j=1
The variance and covariance matrix of (n1, · · · , nk ) can easily be completed.
Other Examples: — Multivariate normal distributions
— Dirichlet distribution (multivariate β-distribution):
β −1
cxβ1 1−1 · · · xp p (1 − x1 − · · · − xp)βp+1−1.