You are on page 1of 25

Machine Learning II

Part I: An overview of Bayesian Nonparametrics


Predictive approach to BNP

Igor Prünster
Bocconi University

1 / 25
Exchangeability & Prediction

Exchangeability & Prediction

de Finetti’s representation theorem

A sequence of X–valued observations (Xn )n≥1 is exchangeable if and only


if
n
Z Y
P[X1 ∈ A1 , . . . , Xn ∈ An ] = P(Ai )Q(dP)
P i=1

for any n ≥ 1, where P is the space of probability measures on X.

Prediction within de Finetti’s framework:


▶ Singular (or one–step) prediction
Z
P[Xn+1 ∈ A |X (n) ] = P(A) Q(dP|X (n) )
| {z } P | {z }
predicitve distribution ∈ P posterior distribution
(n)
where throughout we set X := (X1 , . . . , Xn ).
▶ m–step prediction
Z m
Y
P[Xn+1 ∈ A1 , . . . , Xn+m ∈ Am |X (n) ] = P(Ai ) Q(dP|X (n) ).
P i=1

2 / 25
Exchangeability & Prediction

Two key properties of the Dirichlet process

Recall the following key properties of the Dirichlet process:


▶ For the number of distinct values (species, agents, clusters, components
etc.) Kn generated by a sample X (n) we have

θk
P [Kn = k] = |s(n, k)|
(θ)n
n
X θ
E[Kn ] =
i=1
θ + i −1

with (θ)n := θ(θ + 1) . . . (θ + n − 1) and |s(n, k)| the signless Stirling


number of the first kind.
▶ The predictive distributions are of the form
Kn
θ n 1 X
P[Xn+1 ∈ · |X (n) ] = P ∗( · ) + Ni δXi∗ ( · )
θ+n | {z } θ+n n i=1
| {z } prior guess
| {z }
| {z }
P[Xn+1 =“new” | X (n) ] P[Xn+1 =“old” | X (n) ] emprirical measure

3 / 25
Exchangeability & Prediction

m–step prediction for the Dirichlet process

Given a basic sample X (n) , prediction of various features of an additional


sample X (m) := (Xn+1 , . . . , Xn+m ) is of interest. In particular,

Km(n) := Km+n − Kn

is the number of new species to be recorded in X (m) given X (n) .


In the DP case [Favaro, P & Walker, 2011] one has
h i θj (θ)n
P Km(n) = j|X (n) = |s(m, j, n)|
(θ)(n+m)
m
h i X θ
E Km(n) |X (n) =
i=1
θ + n + i −1

with|s(n, k, r )| denoting the non–central version of the signless Stirling number.


(n)
=⇒ The only sample information affecting the distribution of Km |X (n) is the
size n; it depends neither on Kn nor on N1 , . . . , NKn !

4 / 25
Exchangeability & Prediction

Since P[Xn+1 = “new” | X (n) ] = θ+n


θ
for any sample size n, clearly Kn will
diverge as the sample size n → ∞. But which is its growth rate?
▶ In the unconditional case, by Korwar and Hollander (1973), one has

Kn a.s.
−→ θ (n → ∞)
log n

▶ In the conditional case, given a fixed sample X (n) , one has


(n)
Km a.s.
X (n) −→ θ (m → ∞).
log m

Remark: Ideally one would like a model with a flexible growth rate which
depends on the model parameters (e.g. nσ with parameter σ ∈ (0, 1)). Instead,
the DP displays a logarithmic growth and such a behaviour is unaffected by the
sample X (n) =⇒ highly restrictive and inappropriate in applications [e.g.
linguistics (Teh, 2006), certain bipartite graphs (Caron, 2012), network models
(Caron & Fox, 2017) and species sampling]
Can such a model be really considered nonparametric? What is re-
sponsible for such a restrictive behaviour?
It all boils down to P[Xn+1 = “new” | X (n) ] = θ
θ+n
!
5 / 25
Gibbs–type priors & the Pitman–Yor process

Probability of discovering a new species


As seen, a key quantity in the analysis of discrete nonparametric priors is the
probability of discovering a new species

P[Xn+1 = “new” | X (n) ]. (∗)

Fundamental Characterization:
Based on (∗), discrete P̃ classified in 3 categories [De Blasi et al., 2015]:
(a) P[Xn+1 = “new” | X (n) ] = f (n, model parameters)
⇐⇒ depends on n but not on Kn and Nn = (N1 , . . . , NKn )
⇐⇒ Dirichlet process;
(b) P[Xn+1 = “new” | X (n) ] = f (n, Kn , model parameters)
⇐⇒ depends on n and Kn but not on Nn = (N1 , . . . , NKn )
⇐⇒ Gibbs–type priors [Gnedin & Pitman, 2006];
(c) P[Xn+1 = “new” | X (n) ] = f (n, Kn , Nn , model parameters)
⇐⇒ depends on all information conveyed by the sample i.e. n, Kn and
Nn = (N1 , . . . , NKn )
⇐⇒ serious tractability issues.
=⇒ Let’s go for the intermediate case which exhibits a richer predictive
structure than the DP!
6 / 25
Gibbs–type priors & the Pitman–Yor process

Gibbs–type priors

Q is a Gibbs-type prior of order σ ∈ (−∞, 1) if and only if it gives rise to


predictive distributions of the form

P[Xn+1 ∈ · | X (n) ]
 PKn
i=1 (Ni − σ) δXi ( · )

Vn+1,Kn +1 Vn+1,Kn +1 ∗
= P ∗( · ) + 1 −
Vn,K Vn,Kn n − σKn
| {z n }
| {z }
prior guess | {z }| {z }
weighted emprirical measure
P[Xn+1 =“new” | X (n) ] P[Xn+1 =“old” | X (n) ]

where P ∗ is diffuse and {Vn,j : n ≥ 1, 1 ≤ j ≤ n} is a set of weights which


satisfy the recursion

Vn,j = (n − jσ)Vn+1,j + Vn+1,j+1 . (♢)

=⇒ A Gibbs–type prior is completely characterized by choice of P ∗ , σ < 1 and


a set of weights Vn,j ’s.
=⇒ Crucially now P[Xn+1 = “new” | X (n) ] depends on both the sample size n
and the distinct values in the sample Kn .

7 / 25
Gibbs–type priors & the Pitman–Yor process

The Pitman-Yor process & the NGG process


By defining, for σ ≥ 0 and θ > −σ or σ < 0 and θ = r |σ| with r ∈ N,
Qk−1
(θ + iσ)
Vn,k = i=1
(θ + 1)n−1
one obtains the two parameter Poisson–Dirichlet process aka Pitman–Yor (PY)
process [Pitman & Yor, 1997], which yields
PKn
i=1 (Ni − σ)δXi ( · )
 
(n) θ + σKn ∗ n − σKn ∗
P Xn+1 ∈ · X = P (·) + .
θ+n θ+n n − σKn
θ+Kn σ θ
=⇒ if σ = 0, the PY reduces to the Dirichlet process and θ+n
to θ+n
.
The normalized generalized gamma process (NGG) arises with
n−1
!
eβ σ k−1 X n − 1
 
i
Vn,k = (−1)i β i/σ Γ k − ; β ,
Γ(n) i=0 i σ

where β > 0, σ ∈ (0, 1) and Γ(x, a) denotes the incomplete gamma function. If
σ = 1/2 it reduces to the normalized inverse Gaussian process (N–IG). [Lijoi,
Mena & P, 2005 & 2007b].
=⇒ In the following we restrict attention to the PY process but most results
carry over (with minor modifications) to general Gibbs–type priors
8 / 25
Gibbs–type priors & the Pitman–Yor process

A closer look at the predictive structure


The Gibbs–structure allows to look at the predictives as the result of two steps:
(1) Xn+1 is a new species with probability
θ + σKn
θ+n
or “old” {X1∗ , . . . , XK∗n } with probability (n − σKn )(θ + n)−1 .
=⇒ depends on n and Kn but not on the frequencies Nn = (N1 , . . . , NKn )
=⇒ P[Xn+1 = “new” | X (n) ] is monotonically increasing or decreasing in
Kn according to σ > 0 or σ < 0, respectively.
(2) (i) Given Xn+1 is new, it is independently sampled from P ∗ .
(ii) Given Xn+1 is a tie, it coincides with Xi∗ with probability
Ni − σ
.
n − σKn
=⇒ A reinforcement mechanism driven by σ takes place among the
“old” values. For instance, if N1 = 2 and N2 = 1 then
 
 < 2 if σ < 0  1.5 if σ = −1
P[Xn+1 = X1∗ | X (n) ] 2−σ e.g .
= = 2 if σ = 0 = 2 if σ = 0
P[Xn+1 = X2∗ | X (n) ] 1−σ
> 2 if σ > 0 3 if σ = 0.5
 

9 / 25
Gibbs–type priors & the Pitman–Yor process

The number of clusters generated by the PY process


The (prior) distribution of the number of clusters [Pitman, 2006] is given by
Qk−1
(θ + iσ)
P(Kn = j) = i=1 C (n, j; σ)
(θ + 1)n−1 σ j
with C (n, j; σ) = j!1 ji=0 (−1)i ji (−iσ)n a generalized factorial coefficient.
P 

In general, the dependence of the distribution of Kn on the prior parameters is:


▶ σ controls the “ flatness ” (or variability) of the (prior) distribution of Kn .
▶ θ controls the location of the (prior) distribution of Kn
(n)
Recall that Km is the number of new species to be recorded in the additional
sample of size m given X (n) featuring Kn = j distinct values. One has that
Qj+k−1
(θ + iσ)
 
(θ + 1)n−1 i=j
(n)
P Km = k X (n)
= C (m, k; σ, −n + jσ).
(θ + 1)n+m−1 σk
and also that the expected number of new species is
  
(n) (n) θ (θ + n + σ)m
E[Km |X ] = j + −1 .
σ (θ + n)m
See Lijoi, Mena & P (2007a) and Favaro, Lijoi, Mena & P (2009).
10 / 25
Gibbs–type priors & the Pitman–Yor process

Prior distribution of the number of clusters as σ varies

0.3

0.2

0.1

0
0.2
10 0.3
20 0.4
0.5
30
0.6
40 0.7
50 0.8

Prior distributions on the number of clusters corresponding to the PY


process with n = 50, θ = 1 and σ = 0.2, 0.3, . . . , 0.8.

11 / 25
Gibbs–type priors & the Pitman–Yor process

Asymptotics for the number of clusters in the PY case


a.s.
▶ If σ < 0 and θ = r |σ|, then Kn −→ r as n → ∞. Also, conditionally on
(n) a.s.
X featuring Kn = j distinct values, Km |X (n) −→ r − j for m → ∞.
(n)

▶ One has
Kn a.s.
−→ Yθ/σ (n → ∞).

where Yq , with q ≥ 0 is a suitably generalized Mittag–Leffler random
variable. [Pitman, 2006]
▶ Conditional on X (n) featuring Kn = j distinct values, as the size of the
additional sample m diverges
(n)
Km a.s.
X (n) −→ Zn,j (m → ∞)

d
where Zn,j = Bj+θ/σ, n/σ−j Y(θ+n)/σ with Ba,b a beta random variable
independent of Yq . [Favaro, Lijoi, Mena & P, 2009]
=⇒ In the DP case logarithmic growth of Kn : such a restriction has been
overcome by allowing a richer predictive structure and the growth now depends
on the model parameter σ.
12 / 25
Species sampling

Data structure in species sampling problems

▶ X (n) = basic sample of draws from a population containing different


species (plants, genes, animals,...). Information:
⋄ sample size n and number of distinct species in the sample Kn ;
PKn
⋄ a collection of frequencies N = (N1 , . . . , NKn ) s.t. i=1 Ni = n;
⋄ the labels (names) Xi∗ ’s of the distinct species, for i = 1, . . . , Kn .

▶ The information provided by N can also be coded by M := (M1 , . . . , Mn )


Mi = number of species in the sample X (n) having frequency i.
Note that ni=1 Mi,n = Kn and ni=1 iMi,n = n.
P P

▶ Example: Consider a basic sample such that


⋄ n = 10 with j = 4 and frequencies (n1 , n2 , n3 , n4 ) = (2, 5, 2, 1).
⋄ equivalently we can code this information as

(m1 , m2 , . . . , m10 ) = (1, 2, 0, 0, 1, . . . , 0),

meaning that 1 species appears once, 2 appear twice and 1 five times.

13 / 25
Species sampling

One–step Prediction
▶ Discovery probability estimation: Given a basic sample X (n) , estimate the
probability of discovering at the (n+1)–th sampling step either a new
species or an “ old ” species with frequency r .
▶ Turing estimator [Good, 1953; Mao & Lindsay, 2002]: probability of
discovering at (n+1)–th step a new species is
m1
n
and a species with frequency r in X (n) is
mr +1
(r + 1) .
n
▶ Problem: mr +1 is used to estimate the probability of discovering a species
with frequency r =⇒ counterintuitive! It should be based on mr .
E.g. If m5 = 10, m6 = 0, the estimated probability of detecting a species
with frequency 5 would be 0 =⇒ problem bypassed by use of “ smoothing
functions ” but, in I.J. Good’s words, it seems like an “adhockery”!
Origin of the problem? In a frequentist nonparametric setup there is no
natural quantity to use for estimating the probability of discovering a new
species and so m1 is used. Hence, the discovery probability of species with
frequency 1 uses m2 and so on.
14 / 25
Species sampling

BNP approach to discovery probab. [Lijoi, Mena & P, 2007a]

Key advantage of the Bayesian nonparametric approach is that the


predictive structure includes a positive probability of discovering a new
species, value, cluster, agent etc.
Assume the data (Xn )n≥1 are exchangeable with a PY de Finetti measure.
(n)
BNP analog to Turing estimator: given Pna basic sample X featuring Kn = j
distinct species with m1 , . . . , mn s.t. i=1 i mi = n:
▶ the probability of discovering a new species is

θ + σj
P[Xn+1 = “new” | X (n) ] = .
θ+n

▶ the probability of detecting a species with frequency r in X (n) is


r −σ
P[Xn+1 = “species with frequency r ” | X (n) ] = mr .
θ+n

=⇒ Probability of sampling a species with frequency r depends, in agreement


with intuition, on mr and also on Kn = j.

15 / 25
Species sampling

More discovery probability estimation problems


Conditionally on a basic sample X (n) , estimate the probability of discovering at
the (n+m+1)–th step either a new species or an “ old ” species with frequency
r without observing the additional sample X (m) := (Xn+1 , . . . , Xn+m ).
Remark. From such an estimator one immediately obtains:
▶ the discovery probability for rare species i.e. the probability of discovering
a species which is either new or has frequency at most τ at the
(n+m+1)–th step =⇒ rare species estimation
▶ an optimal additional sample size: sampling is stopped once the
probability of sampling new or rare species is below a certain threshold
▶ the sample coverage, i.e. the proportion of species in the population
detected with a sample of size n + m.
Frequentist estimators:
▶ Good–Toulmin estimator [Good & Toulmin, 1956; Mao, 2004]: estimator
for the probability of discovering a new species at (n+m+1)–th step.
=⇒ unstable if the size of the additional unobserved sample m is larger
than n (estimated probability becomes either < 0 or > 1).
▶ No frequentist nonparametric estimator for the probability of discovering a
species with frequency r at (n+m+1)–th sampling step is available.
16 / 25
Species sampling

▶ BNP analog of the Good–Toulmin estimator [Favaro, Lijoi, Mena & P,


2009]: estimator for the probability of discovering a new species at the
(n+m+1)–th step

θ + kσ (θ + n + σ)m
P[Xn+m+1 = “new” | X (n) ] = .
θ + n (θ + n + 1)m

▶ BNP estimator for the probability of discovering a species with frequency


r at the (n+m+1)–th sampling step [Favaro, Lijoi and P, 2012]:

P[Xn+m+1 = “ species with frequency r ” | X (n) ]


r
!
X m (θ + n − i + σ)m−r +i
= mi (i − σ)r +1−i
i=1
r − i (θ + n)m+1
!
m (θ + kσ)(θ + n + σ)m−r
+ (1 − σ)r
r (θ + n)m+1

17 / 25
Species sampling

Discovery probability at the (n + m + 1)–th sampling step.


1.2 Anaerobic Aerobic
PY-Estimator PY-Estimator
GT-Estimator GT-Estimator
Probability of discovering a new species DP-Estimator DP-Estimator
1.0

0.8

0.6

0.4

0.2

0.0

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Size of the additional sample

EST data from Naegleria gruberi aerobic and anaerobic cDNA libraries
with basic sample n ∼
= 950: Good–Toulmin (GT), DP process and PY
process estimators of the probability of discovering a new gene at the
(n + m + 1)–th sampling step for m = 1, . . . , 2000.
18 / 25
Species sampling

Expected number of new genes in an additional sample of size


m. 2000
Aerobic Anaerobic
PY-Estimator PY-Estimator
GT-Estimator GT-Estimator
DP-Estimator DP-Estimator
Expected number of new genes

1500

1000

500

0 200 400 600 800 1000 1200 1400 1600 1800 2000
Size of the additional sample

EST data from Naegleria gruberi aerobic and anaerobic cDNA libraries
with basic sample n ∼
= 950: Good–Toulmin (GT), DP process and PY
process estimators of the number of new genes to be observed in an
additional sample of size m = 1, . . . , 2000.
19 / 25
Species sampling

Some remarks on BNP models for species sampling problems

▶ BNP estimators based on Gibbs–type priors available for the before


mentioned and other quantities of interest in species sampling problems.
▶ BNP models correspond to large probabilistic models in which all objects
of potential interest are modeled jointly and coherently thus leading to
intuitive predictive structures
=⇒ avoids ad–hoc procedures and incoherencies sometimes connected
with frequentist nonparametric procedures.
▶ Gibbs–type priors with σ > 0 (recall that they assume an infinite number
of species) are ideally suited for populations with large unknown number
of species =⇒ typical case in Genomics.
▶ In Ecology “∞” assumption often too strong =⇒ Gibbs–type priors with
σ < 0 (surprising heuristic by–product: by combining Gibbs-type priors
with σ > 0 and σ < 0 is possible to identify situations in which frequentist
estimators work).

20 / 25
Frequentist Consistency

Frequentist Posterior Consistency

“ What if ” or frequentist approach to consistency [Diaconis and Freedman,


1986]: What happens if the data are not exchangeable but i.i.d. from a “true”
P0 ? Does the posterior Q( · |X (n) ) accumulate around P0 as the sample size
increases?
Definition. Q is weakly consistent at P0 if for every Aε
n→∞
Q(Aε |X (n) ) −→ 1 a.s. − P0∞

where Aε is a weak neighbourhood of P0 and P0∞ denotes the infinite product


measure.
The typical proof strategy for discrete nonparametric priors consists in showing
▶ E[P̃ | X (n) ] n→∞
−→ P0 a.s.–P0∞ (the key step consists in showing that
n→∞
P[Xn+1 = “new” | X (n) ] −→ 0 a.s.–P0∞ )
▶ Var[P̃ | X (n) ] n→∞
−→ 0 a.s.–P0∞ by finding a suitable bound on the variance.

21 / 25
Frequentist Consistency

Consistency of the PY process


Denote by κn the number of distinct values generated by the “ true ” P0 in a
sample of size n. We distinguish two cases of P0 :
▶ P0 is discrete (with finite or infinite support points) =⇒ κnn n→∞
−→ 0:
The predictive distribution given a sample generated by P0 is then

P[Xn+1 ∈ · | X (n) ] =
κn
θ + σκn ∗ n − σκn X (Ni − σ)
= P (·) + δXi∗ ( · )→ P0 ( · )
θ+n θ+n n − σκ
| {z } | {z } i=1 | {z n}
→0 →1 ∼Ni /n

▶ P0 is diffuse (i.e. P0 ({x}) = 0 ∀x ∈ X) =⇒ κn = n and Ni = 1, ∀i.

n
θ + σn ∗ n − σn X (1 − σ)
P[Xn+1 ∈ · | X (n) ] = P (·) + δXi∗ ( · )
θ+n θ+n n − σn
| {z } | {z } i=1 | {z }
→σ →1−σ =1/n

→ σP ( · ) + (1 − σ)P0 ( · )

22 / 25
Frequentist Consistency

n→∞
Moreover, in both cases we have Var[P̃ | X (n) ] −→ 0 a.s.–P0∞ and hence:
▶ The PY process is consistent for discrete data generating P0 (e.g. species
sampling problems);
▶ The PY process is inconsistent for diffuse data generating P0 .
Similar results hold for general Gibbs–type priors [De Blasi, Lijoi and P, 2013].
Should one worry about the inconsistency result in the case of diffuse P0 ? NO!
Discrete nonparametric priors are designed to model discrete distributions and
should not be used to model data from continuous distributions. In fact, as the
sample size n diverges:
▶ P0 generates (Xn )n≥1 containing no ties with probability 1;
▶ a discrete de Finetti measure generates (Xn )n≥1 containing no ties with
probability 0
=⇒ model and data generating mechanism are incompatible!
=⇒ There are not good or bad priors, but rather models which are
suitable/compatible/designed for a certain phenomenon and not for others!

23 / 25
Some References

Some References

• Antoniak(1974). Mixtures of Dirichlet processes with applications to Bayesian


nonparametric problems. Ann. Statist. 2, 1152–1174.
• De Blasi, Favaro, Lijoi, Mena, Prünster & Ruggiero (2015). Are Gibbs-type priors
the most natural generalization of the Dirichlet process? IEEE TPAMI 37, 212-229.
• De Blasi, Lijoi, & Prünster (2013). An asymptotic analysis of a class of discrete
nonparametric priors. Satist. Sinica 23, 1299–1322.
• Caron (2012). Bayesian nonparametric models for bipartite graphs. NIPS’2012.
• Caron & Fox (2017), Sparse graphs using exchangeable random measures. J. R.
Stat. Soc. B 79, 1295-1366.
• Diaconis & Freedman (1986). On the consistency of Bayes estimates. Ann. Statist.
14, 1-26.
• Favaro, Lijoi, Mena & Prünster (2009). Bayesian nonparametric inference for species
variety with a two parameter Poisson-Dirichlet process prior. J. Roy. Statist. Soc. Ser.
B 71, 993–1008.
• Favaro, Lijoi & Prünster (2012). A new estimator of the discovery probability.
Biometrics 68, 1188-96.
• Favaro, Prünster & Walker (2011). On a class of random probability measures with
general predictive structure. Scand. J. Statist. 38, 359–376.
• Ferguson (1973). A Bayesian analysis of some nonparametric problems. Ann.
Statist. 1, 209-30.
• Gnedin & Pitman (2006). Exchangeable Gibbs partitions and Stirling triangles. J.
Math. Sci. (N.Y.) 138, 5674-85.

24 / 25
Some References

• Good & Toulmin (1956). The number of new species, and the increase in
population coverage, when a sample is increased. Biometrika 43, 45-63.
• Good (1953). The population frequencies of species and the estimation of
population parameters. Biometrika 40, 237-64.
• Korwar & Hollander (1973). Contribution to the theory of Dirichlet processes.
Ann. Probab. 1, 705-11
• Lijoi, Mena & Prünster (2007a). Bayesian nonparametric estimation of the
probability of discovering new species. Biometrika 94, 769–786.
• Lijoi, Mena & Prünster (2007b). Controlling the reinforcement in Bayesian
nonparametric mixture models. J. Roy. Statist. Soc. Ser. B 69, 715–740.
• Lo (1991). A characterization of the Dirichlet process. Statist. Probab. Lett. 12,
185–187.
• Mao (2004). Prediction of the conditional probability of discovering a new class. J.
Am. Statist. Assoc. 99, 1108-18.
• Mao & Lindsay (2002). A Poisson model for the coverage problem with a genomic
application. Biometrika 89, 669-81.
• Pitman & Yor (1997). The two-parameter Poisson-Dirichlet distribution derived
from a stable subordinator. Ann. Probab. 25, 855–900.
• Pitman (2006). Combinatorial Stochastic Processes. Lecture Notes in Math.,
vol.1875, Springer, Berlin.
• Regazzini (1978). Intorno ad alcune questioni relative alla definizione del premio
secondo la teoria della credibilità. Giorn. Istit. Ital. Attuari, 41, 77–89.
• Teh (2006). A Hierarchical Bayesian Language Model based on Pitman-Yor
Processes. Coling/ACL 2006, 985-92.

25 / 25

You might also like