Professional Documents
Culture Documents
October 7, 2018
Abstract
We introduce a directed random contingency table model, where the entries are in (0, 1), independent
and beta-distributed with parameters depending on their row and column indices. We will nd sucient
statistics, and based on them, give an algorithm to nd the MLE of the parameters, together with
convergence proof. Then we extend the model to the multiclass scenario, where for xed k and l,
we are looking for k so-called out-clusters of the rows and l in-clusters of the columns so that the
parameters of the beta-distributed entries also depend on their cluster memberships. To nd the clusters
and estimate the parameters, we use an iteration that alternately nds the cluster memberships, then
xing the memberships, estimates the parameters within the blocks separately. This resembles the
EM algorithm for mixtures of exponential-family distributions. The algorithm is applicable, e.g., for
microarrays.
Keywords: exponential family, ML estimation, EM algorithm, digamma function
MSC2010 : 62F10, 62B05.
1 Introduction
In [7] we introduced a weighted digraph model, where the directed edges had beta-distributed edge-
weights (the parameters depend on the indices of the entries. Now we extend this model to rectangular
arrays of nonnegative entries and estimate the parameters. Then we extend the model to the multiclass
scenario, by applying the theory of exponential mixtures, and using the ideas of [5] and that of the EM
algorithm.
More explanation is needed, BOLLA will probably do it at the end.
∗ jmala@math.bme.hu
† fatma@math.bme.hu
‡ ahmed@math.bme.hu
§ marib@math.bme.hu
1
2 A random contingency table model with beta-distributed en-
tries
Let W = (wij ) be an n × m contingency table of entries transformed into the (0,1) interval. Our model
is the following: wij obeys a beta-distribution with parameters ai > 0 and bj > 0. The parameters are
collected in a = (a1 , . . . an ) and b = (b1 , . . . bm ), or briey, in (a, b). Here ai can be thought of as the
potential of row i to send messages out, and bj as the resistance of column j to receive messages in, while
wij is the weight of the i → j message.
The likelihood function is factorized as
n Y m
Y Γ(ai + bj ) ai −1
La,b (W ) = wij (1 − wij )bj −1
i=1 j=1
Γ(ai )Γ(b j )
n Y
Y m
= C(a, b) exp[(ai − 1) ln wij + (bj − 1) ln(1 − wij )]
i=1 j=1
n
X m
X m
X n
X
= exp[ (ai − 1) ln wij + (bj − 1) ln(1 − wij ) − Z(a, b)],
i=1 j=1 j=1 i=1
where C(a, b) is the normalizing constant, and Z(a, b) = − ln C(a, b) is the log-partition (cumulant)
function. Since the likelihood function depends on W only through the row-sums si 's of the n × m
matrix U = U (W ) of general entry ln wij and the column-sums zj 's of the n × m matrix V = V (W ) of
general entry ln(1 − wij ), by the NeymanFisher factorization theorem, the row-sums of U and column-
sums of V are sucient statistics for the parameters.
m
X n
X
si := ln wij (i = 1, . . . , n); zj := ln(1 − wij ) (j = 1, . . . , m).
j=1 i=1
With them,
m
∂ ln La,b (W ) X Γ0 (ai + bs ) Γ0 (ai )
= −m + si = 0, i = 1, . . . , n
∂ai s=1
Γ(ai + bs ) Γ(ai )
n
(1)
∂ ln La,b (W ) X Γ0 (as + bi ) Γ0 (bi )
= −n + zi = 0, i = 1, . . . , m.
∂bi s=1
Γ(as + bi ) Γ(bi )
To use a xed point iteration, now we write the system of likelihood equations in the form θ = f (θ),
where θ = (a, b), as follows:
" m
#
−1 1 1 X
ai = ψ si + ψ(ai + bs ) =: gi (a, b), i = 1, . . . , n
m m s=1
" n
# (2)
1 1 X
bi = ψ −1 zi + ψ(as + bi ) =: hi (a, b), i = 1, . . . , m.
n n s=1
0
Here ψ(x) = ∂ ln Γ(x)
∂x = ΓΓ(x)
(x)
for x > 0 is the digamma function and the gi 's resp. hj 's are the coordinate
functions of g : R → R resp. of h : Rn+m → Rm . Let f = (g, h). Then, starting with θ (0) , we use
n+m n
2
Later we will use the following trivial upper bound for the sum of row- and column-sums (of the U
and V matrices):
n
X m
X n X
X m m X
X n X
si + zj = ln wij + ln(1 − wij ) = ln[wij (1 − wij )] ≤ −2 ln 2 nm (3)
i=1 j=1 i=1 j=1 j=1 i=1 i,j
due to rearranging the terms and the relation wij (1 − wij ) ≤ 1/4 for wij ∈ (0, 1) with equality if and
only if wij = 21 .
Lemma 1 The function ψ(2x) − ψ(x), x ∈ (0, ∞) is decreasing and its range is (ln 2, ∞).
It is easily seen by the identity ψ(2x) = 12 ψ(x) + 12 ψ x + 1
+ ln 2 which can be found in [1].
Proof.
2
Proposition 1 Let s z
i i
M := max max − , max − (4)
i∈{1,...,n} n i∈{1,...,m} m
and ε > 0 be the (only) solution of the equation ψ(2x) − ψ(x) = M . Then (â, b̂) ≥ ε1, where 1 ∈ Rn+m
is the all 1's vector, and the inequality between vectors is understood componentwise.
Proof. In view of (3) we have that M ≥ ln 2. Since equality in (3) is attained with probability 0, we
have that M > ln 2 with probability 1. Therefore, by Lemma 1, there exists an ε, with probability 1,
such that ψ(2ε) − ψ(ε) = M .
Without loss of generality we can assume that
âi0 = min min âi , min b̂j .
i∈{1,...,n} j∈{1,...,m}
Proposition 2 With the solution ε of ψ(2x) − ψ(x) = M of (4), we have f (ε1) ≥ ε1.
Proof. si
gi (ε1) = ψ −1 ψ(2ε) + ≥ ψ −1 (ψ(2ε) − M ) = ε.
n
Likewise, zi
hi (ε1) = ψ −1 ψ(2ε) + ≥ ψ −1 (ψ(2ε) − M ) = ε.
m
It is also clear that we have the following.
3
3 The multiclass model
In the several clusters case, we are putting blocks of Section 2 together. Here the statistics are sucient
only within the blocks.
Given the integers 1 ≤ k ≤ n and 1 ≤ l ≤ m, we are looking for k -partition, in other words, clusters
R1 , . . . , Rk of the rows and C1 , . . . , Cl of the columns such that the row and column items are assigned
to the clusters independently, and given the cluster memberships, the wight of the message sent by row
u ∈ Ri to v ∈ Cj has weight wuv ∼ Beta(auj , bvi ); further, all these assignments are done independently.
The parameters are the n × l matrix A and the m × k matrix B , where the j th column of A contains
the parameters auj in the block u ∈ Ri , for i = 1, . . . , k ; j = 1, . . . , l. Likewise, the ith column of B
contains the parameters bvi in the block v ∈ Cj , for j = 1, . . . , l; i = 1, . . . , k . Here auj can be thought
of as the potential of row-item u of cluster Ri to send messages out to Cj , and bvi as the of column-item
v of cluster Cj to receive messages in from Ri .
This is a mixture of exponential-family distributions, and as the mixing can be supervised by two
multinomially distributed random variables (responsible for the memberships), the general theory of
mixtures (see, e.g., [16]) and the iteration of the EM algorithm can be used to estimate the parameters.
With the terminology of the EM algorithm, W is the incomplete data specication. If the missing
memberships were known, we were able to write the complete log-likelihood in the following form:
k X l X X
X Γ(auj + bvi )
ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) . (5)
i=1 j=1
Γ(auj )Γ(bvi )
u∈Ri v∈Cj
• Maximization step within the blocks: We update estimates of the parameters A(t) , B (t) within
(t) (t)
the kl blocks, separately. As for the block Ri × Cj , we use the algorithm of Section 2 to nd the
(t) (t) (t) (t)
estimates auj for u ∈ Ri and bvi for v ∈ Cj . As each row u and column v uniquely corresponds
(t) (t)
to exactly one row- and column-cluster, respectively, in this way, the Ri × Cj parameter blocks
for i = 1, . . . , k , j = 1, . . . , l will ll in the A(t) , B (t) parameter matrices.
• Relocation step between the blocks: Given the new estimates of the parameters A(t) , B (t) ,
we relocate u into the row-cluster Ri∗ and v into the column-cluster Cj ∗ for which the contribution
of wuv to the overall likelihood (5) is maximal. It is not easy, we do it separately for the rows and
columns.
Relocation of the rows : For each u (u = 1, . . . n) take the maximum of the following over i (i =
1, . . . , k):
m X
l
X Γ(auj ) + bvi )
cvj ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) .
v=1 j=1
Γ(auj )Γ(bvi )
4
If it is maximum for j ∗ , then we relocate v into the column-cluster Cj ∗ .
This is a discrete maximization. Break ties arbitrarily.
If we x i, j , then we maximize the inner double, i.e., we nd the ML estimate of the parameters in the
Ri × Cj block (maximization step) as in (5). Reordering the summation as
k X
n X l
m X
X Γ(auj ) + bvi )
cvj ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) , (7)
u=1 i=1 v=1 j=1
Γ(auj )Γ(bvi )
for any xed row u, we maximize the inner double summation for i = 1, . . . , k . If it is maximum for i∗ ,
then we relocate u into the row-cluster Ri∗ . Then we do the relocation for the columns:
m X
l X
n X
k
X Γ(auj ) + bvi )
rui ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) . (8)
v=1 j=1 u=1 i=1
Γ(auj )Γ(bvi )
For any xed column v , we maximize the inner double summation for j = 1, . . . , l. If it is maximum for
j ∗ , then we relocate v into the column-cluster Cj ∗ .
This idea comes from the collaborative ltering.
4 Applications
We can apply the algorithm to a microarray W , where rows correspond to the genes and column to the
conditions, whereas wuv is the expression level of gene u under condition v . We are looking for clusters
of the genes and of the conditions simultaneously, such that genes of the same cluster eect conditions
of cluster j according to the model of Section 2.
The model is also applicable to digraps with possible loops when n = m. In this case, the row- and
column-clusters are out- and in-clusters of the vertices.
5
First application: for the approx. 1000x30 genetical data. There are 4 starting clusters of the rows
and 2-4 starting clusters of the columns will be found by Fatma.
We need the following matrices. The parameters will be collected in the n × l matrix A and the m × k
matrix B , where the j th column of A contains the parameters auj for u ∈ Ri , i = 1, . . . , k ; j = 1, . . . , l.
Likewise, the ith column of B contains the parameters bv for v ∈ Cj , j = 1, . . . , l; i = 1, . . . , k .
Dene the row membership matrix R as the n × k matrix of 0-1 entries. rui = 1 if row u belongs to
Ri , otherwise 0. So the sum of the entries in each row is 1.
Dene the column membership matrix C as the m × l matrix of 0-1 entries. cvj = 1 if column v
belongs to Cj , otherwise 0. So the sum of the entries in each row is 1.
Start with some initial clustering, that is, with membership matrices.
In the maximization step: for i = 1, . . . k and j = 1, . . . l consider the submatrix of W with entries wuv
such that rui = 1 and cvj = 1. For this |Ri | × |Cj | matrix run either the Newton algorithm of (1) or the
xed point iteration (2); I call it 1RD (1-class rectangular directed) algorithm. The starting parameter
values be all ε, where ε is dened in Proposition 1.
You obtain parameters au 's and bv 's. Put them into the matrices A and B as follows:
If it is maximum for i∗ , then we relocate u into the row-cluster Ri∗ , i.e., rui∗ := 1. Do it for u = 1, . . . n,
separately.
Relocation for the columns : For each v , take the maximum of the following over j (j = 1, . . . , l):
n X
k
X Γ(auj ) + bvi )
rui ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) .
u=1 i=1
Γ(auj )Γ(bvi )
If it is maximum for j ∗ , then we relocate v into the column-cluster Cj ∗ , i.e., cvj ∗ := 1. Do it for
v = 1, . . . m, separately.
This is a discrete maximization. Break ties arbitrarily. In this way, the membership matrices are
lled in (zeros stay zeros).
Start the maximization with the new membership matrices, after you have made A and B zero
everywhere, until convergence.
Second application: For the electrical consumption data.
References
[1] Abramowitz, M., Stegun, I. A., eds., Handbook of Mathematical Functions with Formulas, Graphs,
and Mathematical Tables (10th ed.). New York: Dover (1972), pp. 258-259.
6
[2] Alzer, H., Wells, J., Inequalities for the polygamma functions, SIAM J. Math. Anal. 29 (6) (1998),
1459-1466.
[3] Bernardo, J. M., Psi (digamma) function. Algorithm AS 103, Applied Statistics 25 (1976), 315317.
[5] Bolla, M., Elbanna, A., Estimating parameters of a probabilistic heterogeneous block model via the
EM algorithm, Journal of Probability and Statistics (2015), Article ID 657965.
[6] Bolla, M., Elbanna, A., Matrix and discrepancy view of generalized random and quasirandom graphs,
Special Matrices 4, Issue 1 (2016), 31-45.
[7] Bolla, M., Elbanna, A., Mala, J., Estimating parameters of a directed weighted graph model with
beta-distributed edge-weights.
[8] Chatterjee, S., Diaconis, P. and Sly, A., Random graphs with a given degree sequence, Ann. Stat. 21
(2010), 14001435.
[9] Dempster,A.P., Laird,N.M. and Rubin,D.B., Maximum likelihood from incomplete data via the EM
algorithm, J. R. Statist. Soc. B 39, 1-38 (1977).
[12] Hofmann,T. and Puzicha,J., Latent class models for collaborative ltering. In Proc. 16th Interna-
tional Joint Congress on Articial Intelligence (IJCAI 99) (ed. Dean T), Vol. 2, pp. 688-693. Morgan
Kaufmann Publications Inc., San Francisco CA (1999).
[14] Ungar,L.H. and Foster,D.P., A Formal Statistical Approach to Collaborative Filtering. In Proc.
Conference on Automatical Learning and Discovery (CONALD 98) (1998).
[15] Yan, T., Leng, C., Zhu, J., Asymptotics in directed exponential random graph models with an
increasing bi-degree sequence, Ann. Stat. (2016) 44, 31-57.
[16] M. Wainwright,. M. I. Jordan, Graphical models, exponential families, and variational inference,
Foundations and Trends in Machine Learning 1 (1-2), 1-305 (2008).