Professional Documents
Culture Documents
Igor Prünster
Bocconi University
1 / 25
Exchangeability & Prediction
2 / 25
Exchangeability & Prediction
θk
P [Kn = k] = |s(n, k)|
(θ)n
n
X θ
E[Kn ] =
i=1
θ + i −1
3 / 25
Exchangeability & Prediction
Km(n) := Km+n − Kn
4 / 25
Exchangeability & Prediction
Kn a.s.
−→ θ (n → ∞)
log n
Remark: Ideally one would like a model with a flexible growth rate which
depends on the model parameters (e.g. nσ with parameter σ ∈ (0, 1)). Instead,
the DP displays a logarithmic growth and such a behaviour is unaffected by the
sample X (n) =⇒ highly restrictive and inappropriate in applications [e.g.
linguistics (Teh, 2006), certain bipartite graphs (Caron, 2012), network models
(Caron & Fox, 2017) and species sampling]
Can such a model be really considered nonparametric? What is re-
sponsible for such a restrictive behaviour?
It all boils down to P[Xn+1 = “new” | X (n) ] = θ
θ+n
!
5 / 25
Gibbs–type priors & the Pitman–Yor process
Fundamental Characterization:
Based on (∗), discrete P̃ classified in 3 categories [De Blasi et al., 2015]:
(a) P[Xn+1 = “new” | X (n) ] = f (n, model parameters)
⇐⇒ depends on n but not on Kn and Nn = (N1 , . . . , NKn )
⇐⇒ Dirichlet process;
(b) P[Xn+1 = “new” | X (n) ] = f (n, Kn , model parameters)
⇐⇒ depends on n and Kn but not on Nn = (N1 , . . . , NKn )
⇐⇒ Gibbs–type priors [Gnedin & Pitman, 2006];
(c) P[Xn+1 = “new” | X (n) ] = f (n, Kn , Nn , model parameters)
⇐⇒ depends on all information conveyed by the sample i.e. n, Kn and
Nn = (N1 , . . . , NKn )
⇐⇒ serious tractability issues.
=⇒ Let’s go for the intermediate case which exhibits a richer predictive
structure than the DP!
6 / 25
Gibbs–type priors & the Pitman–Yor process
Gibbs–type priors
P[Xn+1 ∈ · | X (n) ]
PKn
i=1 (Ni − σ) δXi ( · )
Vn+1,Kn +1 Vn+1,Kn +1 ∗
= P ∗( · ) + 1 −
Vn,K Vn,Kn n − σKn
| {z n }
| {z }
prior guess | {z }| {z }
weighted emprirical measure
P[Xn+1 =“new” | X (n) ] P[Xn+1 =“old” | X (n) ]
7 / 25
Gibbs–type priors & the Pitman–Yor process
where β > 0, σ ∈ (0, 1) and Γ(x, a) denotes the incomplete gamma function. If
σ = 1/2 it reduces to the normalized inverse Gaussian process (N–IG). [Lijoi,
Mena & P, 2005 & 2007b].
=⇒ In the following we restrict attention to the PY process but most results
carry over (with minor modifications) to general Gibbs–type priors
8 / 25
Gibbs–type priors & the Pitman–Yor process
9 / 25
Gibbs–type priors & the Pitman–Yor process
0.3
0.2
0.1
0
0.2
10 0.3
20 0.4
0.5
30
0.6
40 0.7
50 0.8
11 / 25
Gibbs–type priors & the Pitman–Yor process
▶ One has
Kn a.s.
−→ Yθ/σ (n → ∞).
nσ
where Yq , with q ≥ 0 is a suitably generalized Mittag–Leffler random
variable. [Pitman, 2006]
▶ Conditional on X (n) featuring Kn = j distinct values, as the size of the
additional sample m diverges
(n)
Km a.s.
X (n) −→ Zn,j (m → ∞)
mσ
d
where Zn,j = Bj+θ/σ, n/σ−j Y(θ+n)/σ with Ba,b a beta random variable
independent of Yq . [Favaro, Lijoi, Mena & P, 2009]
=⇒ In the DP case logarithmic growth of Kn : such a restriction has been
overcome by allowing a richer predictive structure and the growth now depends
on the model parameter σ.
12 / 25
Species sampling
meaning that 1 species appears once, 2 appear twice and 1 five times.
13 / 25
Species sampling
One–step Prediction
▶ Discovery probability estimation: Given a basic sample X (n) , estimate the
probability of discovering at the (n+1)–th sampling step either a new
species or an “ old ” species with frequency r .
▶ Turing estimator [Good, 1953; Mao & Lindsay, 2002]: probability of
discovering at (n+1)–th step a new species is
m1
n
and a species with frequency r in X (n) is
mr +1
(r + 1) .
n
▶ Problem: mr +1 is used to estimate the probability of discovering a species
with frequency r =⇒ counterintuitive! It should be based on mr .
E.g. If m5 = 10, m6 = 0, the estimated probability of detecting a species
with frequency 5 would be 0 =⇒ problem bypassed by use of “ smoothing
functions ” but, in I.J. Good’s words, it seems like an “adhockery”!
Origin of the problem? In a frequentist nonparametric setup there is no
natural quantity to use for estimating the probability of discovering a new
species and so m1 is used. Hence, the discovery probability of species with
frequency 1 uses m2 and so on.
14 / 25
Species sampling
θ + σj
P[Xn+1 = “new” | X (n) ] = .
θ+n
15 / 25
Species sampling
θ + kσ (θ + n + σ)m
P[Xn+m+1 = “new” | X (n) ] = .
θ + n (θ + n + 1)m
17 / 25
Species sampling
0.8
0.6
0.4
0.2
0.0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
EST data from Naegleria gruberi aerobic and anaerobic cDNA libraries
with basic sample n ∼
= 950: Good–Toulmin (GT), DP process and PY
process estimators of the probability of discovering a new gene at the
(n + m + 1)–th sampling step for m = 1, . . . , 2000.
18 / 25
Species sampling
1500
1000
500
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Size of the additional sample
EST data from Naegleria gruberi aerobic and anaerobic cDNA libraries
with basic sample n ∼
= 950: Good–Toulmin (GT), DP process and PY
process estimators of the number of new genes to be observed in an
additional sample of size m = 1, . . . , 2000.
19 / 25
Species sampling
20 / 25
Frequentist Consistency
21 / 25
Frequentist Consistency
P[Xn+1 ∈ · | X (n) ] =
κn
θ + σκn ∗ n − σκn X (Ni − σ)
= P (·) + δXi∗ ( · )→ P0 ( · )
θ+n θ+n n − σκ
| {z } | {z } i=1 | {z n}
→0 →1 ∼Ni /n
n
θ + σn ∗ n − σn X (1 − σ)
P[Xn+1 ∈ · | X (n) ] = P (·) + δXi∗ ( · )
θ+n θ+n n − σn
| {z } | {z } i=1 | {z }
→σ →1−σ =1/n
∗
→ σP ( · ) + (1 − σ)P0 ( · )
22 / 25
Frequentist Consistency
n→∞
Moreover, in both cases we have Var[P̃ | X (n) ] −→ 0 a.s.–P0∞ and hence:
▶ The PY process is consistent for discrete data generating P0 (e.g. species
sampling problems);
▶ The PY process is inconsistent for diffuse data generating P0 .
Similar results hold for general Gibbs–type priors [De Blasi, Lijoi and P, 2013].
Should one worry about the inconsistency result in the case of diffuse P0 ? NO!
Discrete nonparametric priors are designed to model discrete distributions and
should not be used to model data from continuous distributions. In fact, as the
sample size n diverges:
▶ P0 generates (Xn )n≥1 containing no ties with probability 1;
▶ a discrete de Finetti measure generates (Xn )n≥1 containing no ties with
probability 0
=⇒ model and data generating mechanism are incompatible!
=⇒ There are not good or bad priors, but rather models which are
suitable/compatible/designed for a certain phenomenon and not for others!
23 / 25
Some References
Some References
24 / 25
Some References
• Good & Toulmin (1956). The number of new species, and the increase in
population coverage, when a sample is increased. Biometrika 43, 45-63.
• Good (1953). The population frequencies of species and the estimation of
population parameters. Biometrika 40, 237-64.
• Korwar & Hollander (1973). Contribution to the theory of Dirichlet processes.
Ann. Probab. 1, 705-11
• Lijoi, Mena & Prünster (2007a). Bayesian nonparametric estimation of the
probability of discovering new species. Biometrika 94, 769–786.
• Lijoi, Mena & Prünster (2007b). Controlling the reinforcement in Bayesian
nonparametric mixture models. J. Roy. Statist. Soc. Ser. B 69, 715–740.
• Lo (1991). A characterization of the Dirichlet process. Statist. Probab. Lett. 12,
185–187.
• Mao (2004). Prediction of the conditional probability of discovering a new class. J.
Am. Statist. Assoc. 99, 1108-18.
• Mao & Lindsay (2002). A Poisson model for the coverage problem with a genomic
application. Biometrika 89, 669-81.
• Pitman & Yor (1997). The two-parameter Poisson-Dirichlet distribution derived
from a stable subordinator. Ann. Probab. 25, 855–900.
• Pitman (2006). Combinatorial Stochastic Processes. Lecture Notes in Math.,
vol.1875, Springer, Berlin.
• Regazzini (1978). Intorno ad alcune questioni relative alla definizione del premio
secondo la teoria della credibilità. Giorn. Istit. Ital. Attuari, 41, 77–89.
• Teh (2006). A Hierarchical Bayesian Language Model based on Pitman-Yor
Processes. Coling/ACL 2006, 985-92.
25 / 25