You are on page 1of 7

Parameter estimation for random contingency tables and for

biclassied blockmodels of them


∗ † ‡ §
József Mala , Fatma Abdelkhalek , Ahmed Elbanna , and Marianna Bolla

Institute of Mathematics, Budapest University of Technology and Economics

Institute of Mathematics, Eötvös Loránd University

October 7, 2018

Abstract

We introduce a directed random contingency table model, where the entries are in (0, 1), independent
and beta-distributed with parameters depending on their row and column indices. We will nd sucient
statistics, and based on them, give an algorithm to nd the MLE of the parameters, together with
convergence proof. Then we extend the model to the multiclass scenario, where for xed k and l,
we are looking for k so-called out-clusters of the rows and l in-clusters of the columns so that the
parameters of the beta-distributed entries also depend on their cluster memberships. To nd the clusters
and estimate the parameters, we use an iteration that alternately nds the cluster memberships, then
xing the memberships, estimates the parameters within the blocks separately. This resembles the
EM algorithm for mixtures of exponential-family distributions. The algorithm is applicable, e.g., for
microarrays.
Keywords: exponential family, ML estimation, EM algorithm, digamma function
MSC2010 : 62F10, 62B05.

1 Introduction

In [7] we introduced a weighted digraph model, where the directed edges had beta-distributed edge-
weights (the parameters depend on the indices of the entries. Now we extend this model to rectangular
arrays of nonnegative entries and estimate the parameters. Then we extend the model to the multiclass
scenario, by applying the theory of exponential mixtures, and using the ideas of [5] and that of the EM
algorithm.
More explanation is needed, BOLLA will probably do it at the end.
∗ jmala@math.bme.hu
† fatma@math.bme.hu
‡ ahmed@math.bme.hu
§ marib@math.bme.hu

1
2 A random contingency table model with beta-distributed en-
tries

Let W = (wij ) be an n × m contingency table of entries transformed into the (0,1) interval. Our model
is the following: wij obeys a beta-distribution with parameters ai > 0 and bj > 0. The parameters are
collected in a = (a1 , . . . an ) and b = (b1 , . . . bm ), or briey, in (a, b). Here ai can be thought of as the
potential of row i to send messages out, and bj as the resistance of column j to receive messages in, while
wij is the weight of the i → j message.
The likelihood function is factorized as
n Y m
Y Γ(ai + bj ) ai −1
La,b (W ) = wij (1 − wij )bj −1
i=1 j=1
Γ(ai )Γ(b j )
n Y
Y m
= C(a, b) exp[(ai − 1) ln wij + (bj − 1) ln(1 − wij )]
i=1 j=1
n
X m
X m
X n
X
= exp[ (ai − 1) ln wij + (bj − 1) ln(1 − wij ) − Z(a, b)],
i=1 j=1 j=1 i=1

where C(a, b) is the normalizing constant, and Z(a, b) = − ln C(a, b) is the log-partition (cumulant)
function. Since the likelihood function depends on W only through the row-sums si 's of the n × m
matrix U = U (W ) of general entry ln wij and the column-sums zj 's of the n × m matrix V = V (W ) of
general entry ln(1 − wij ), by the NeymanFisher factorization theorem, the row-sums of U and column-
sums of V are sucient statistics for the parameters.
m
X n
X
si := ln wij (i = 1, . . . , n); zj := ln(1 − wij ) (j = 1, . . . , m).
j=1 i=1

With them,
m
∂ ln La,b (W ) X Γ0 (ai + bs ) Γ0 (ai )
= −m + si = 0, i = 1, . . . , n
∂ai s=1
Γ(ai + bs ) Γ(ai )
n
(1)
∂ ln La,b (W ) X Γ0 (as + bi ) Γ0 (bi )
= −n + zi = 0, i = 1, . . . , m.
∂bi s=1
Γ(as + bi ) Γ(bi )

To use a xed point iteration, now we write the system of likelihood equations in the form θ = f (θ),
where θ = (a, b), as follows:
" m
#
−1 1 1 X
ai = ψ si + ψ(ai + bs ) =: gi (a, b), i = 1, . . . , n
m m s=1
" n
# (2)
1 1 X
bi = ψ −1 zi + ψ(as + bi ) =: hi (a, b), i = 1, . . . , m.
n n s=1
0
Here ψ(x) = ∂ ln Γ(x)
∂x = ΓΓ(x)
(x)
for x > 0 is the digamma function and the gi 's resp. hj 's are the coordinate
functions of g : R → R resp. of h : Rn+m → Rm . Let f = (g, h). Then, starting with θ (0) , we use
n+m n

the successive approximation θ (t) := f (θ (t−1) for t = 1, 2, . . . until convergence.

2
Later we will use the following trivial upper bound for the sum of row- and column-sums (of the U
and V matrices):
n
X m
X n X
X m m X
X n X
si + zj = ln wij + ln(1 − wij ) = ln[wij (1 − wij )] ≤ −2 ln 2 nm (3)
i=1 j=1 i=1 j=1 j=1 i=1 i,j

due to rearranging the terms and the relation wij (1 − wij ) ≤ 1/4 for wij ∈ (0, 1) with equality if and
only if wij = 21 .

Lemma 1 The function ψ(2x) − ψ(x), x ∈ (0, ∞) is decreasing and its range is (ln 2, ∞).
It is easily seen by the identity ψ(2x) = 12 ψ(x) + 12 ψ x + 1
+ ln 2 which can be found in [1]. 

Proof.
2

Proposition 1 Let   s   z 
i i
M := max max − , max − (4)
i∈{1,...,n} n i∈{1,...,m} m
and ε > 0 be the (only) solution of the equation ψ(2x) − ψ(x) = M . Then (â, b̂) ≥ ε1, where 1 ∈ Rn+m
is the all 1's vector, and the inequality between vectors is understood componentwise.
Proof. In view of (3) we have that M ≥ ln 2. Since equality in (3) is attained with probability 0, we
have that M > ln 2 with probability 1. Therefore, by Lemma 1, there exists an ε, with probability 1,
such that ψ(2ε) − ψ(ε) = M .
Without loss of generality we can assume that
 
âi0 = min min âi , min b̂j .
i∈{1,...,n} j∈{1,...,m}

Then by the ML equation, the monotonicity of ψ , and Lemma 1, we get


n
X
nM ≥ −si0 = ψ(âi0 + b̂j ) − nψ(âi0 ) ≥ n[ψ(2âi0 ) − ψ(âi0 )].
j=1

Therefore, âi0 ≥ ε, whence âi , b̂i ≥ ε holds for every i = 1, . . . , n. 

Proposition 2 With the solution ε of ψ(2x) − ψ(x) = M of (4), we have f (ε1) ≥ ε1.
Proof.  si 
gi (ε1) = ψ −1 ψ(2ε) + ≥ ψ −1 (ψ(2ε) − M ) = ε.
n
Likewise,  zi 
hi (ε1) = ψ −1 ψ(2ε) + ≥ ψ −1 (ψ(2ε) − M ) = ε. 
m
It is also clear that we have the following.

Proposition 3 If (a, b) ≥ (x, y) > 0, then f (a, b) ≥ f (x, y).


Proof. From Propositions 1 and 2 we obtain that the sequence θ (it) is coordinate-wise increasing.
Moreover, it is clear that (θ (it) ) is bounded from above by (â, b̂), due to Proposition 3. Therefore, the
convergence of θ (it) follows, and by the continuity of f , the limit is clearly a xed point of f . However,
in view of Section 2, the solution of the ML equation is a xed point of f , and it cannot be else but the
unique solution (â, b̂), guaranteed by Theorem ??. 
JOSKA, ez igy hianyos. Kellene egy konvergenciatetel, hasonlo, mint az elozo cikkben. Lehet, irtal
is, csak nem talalom. Tedd ide, legy szives.

3
3 The multiclass model

In the several clusters case, we are putting blocks of Section 2 together. Here the statistics are sucient
only within the blocks.
Given the integers 1 ≤ k ≤ n and 1 ≤ l ≤ m, we are looking for k -partition, in other words, clusters
R1 , . . . , Rk of the rows and C1 , . . . , Cl of the columns such that the row and column items are assigned
to the clusters independently, and given the cluster memberships, the wight of the message sent by row
u ∈ Ri to v ∈ Cj has weight wuv ∼ Beta(auj , bvi ); further, all these assignments are done independently.
The parameters are the n × l matrix A and the m × k matrix B , where the j th column of A contains
the parameters auj in the block u ∈ Ri , for i = 1, . . . , k ; j = 1, . . . , l. Likewise, the ith column of B
contains the parameters bvi in the block v ∈ Cj , for j = 1, . . . , l; i = 1, . . . , k . Here auj can be thought
of as the potential of row-item u of cluster Ri to send messages out to Cj , and bvi as the of column-item
v of cluster Cj to receive messages in from Ri .
This is a mixture of exponential-family distributions, and as the mixing can be supervised by two
multinomially distributed random variables (responsible for the memberships), the general theory of
mixtures (see, e.g., [16]) and the iteration of the EM algorithm can be used to estimate the parameters.
With the terminology of the EM algorithm, W is the incomplete data specication. If the missing
memberships were known, we were able to write the complete log-likelihood in the following form:
k X l X X  
X Γ(auj + bvi )
ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) . (5)
i=1 j=1
Γ(auj )Γ(bvi )
u∈Ri v∈Cj

(0) (0) (0) (0)


Starting with an initial clustering R1 , . . . , Rk of the rows and C1 , . . . , Cl of the columns, the t-th
step of the iteration is as follows (t = 1, 2, . . . ).

• Maximization step within the blocks: We update estimates of the parameters A(t) , B (t) within
(t) (t)
the kl blocks, separately. As for the block Ri × Cj , we use the algorithm of Section 2 to nd the
(t) (t) (t) (t)
estimates auj for u ∈ Ri and bvi for v ∈ Cj . As each row u and column v uniquely corresponds
(t) (t)
to exactly one row- and column-cluster, respectively, in this way, the Ri × Cj parameter blocks
for i = 1, . . . , k , j = 1, . . . , l will ll in the A(t) , B (t) parameter matrices.

• Relocation step between the blocks: Given the new estimates of the parameters A(t) , B (t) ,
we relocate u into the row-cluster Ri∗ and v into the column-cluster Cj ∗ for which the contribution
of wuv to the overall likelihood (5) is maximal. It is not easy, we do it separately for the rows and
columns.
Relocation of the rows : For each u (u = 1, . . . n) take the maximum of the following over i (i =
1, . . . , k):
m X
l  
X Γ(auj ) + bvi )
cvj ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) .
v=1 j=1
Γ(auj )Γ(bvi )

If it is maximum for i∗ , then we relocate u into the row-cluster Ri∗ .


Relocation for the columns : For each v (v = 1, . . . m) take the maximum of the following over j
(j = 1, . . . , l):
n X
k  
X Γ(auj ) + bvi )
rui ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) .
u=1 i=1
Γ(auj )Γ(bvi )

4
If it is maximum for j ∗ , then we relocate v into the column-cluster Cj ∗ .
This is a discrete maximization. Break ties arbitrarily.

(t) (t) (t) (t)


In this way, we get a new clustering R1 , . . . , Rk of the rows and C1 , . . . , Cl of the columns, with
which we go back to the maximization step.
As in both steps we increase the likelihood, and the likelihood function is bounded from above with
the sum of the existing MLE's, the iteration must converge to a local maximum of it. A good starting, for
example, with spectral biclustering (see, e.g., [4]) helps a lot. In fact, the EM algorithm of [9] works with
the indicator vectors of the clusters and nds the probabilities of the objects belonging to the clusters,
hence gives rise to a fuzzy clustering. We rather apply the collaborative ltering algorithm of [12, 14],
where the objects are relocated between the clusters to increase the overall likelihood in each step. So
we do not really need the membership vectors, merely the relocation rule should be dened.
We can write the overall likelihood by means of membership vectors. Let the n × k matrix R = (rui )
contain the membership vectors of the rows, i.e., rui = 1 if u ∈ Ri and rui0 = 0 for i0 6= i. Likewise, let
the m × l matrix C = (cvj ) contain the membership vectors of the columnss, i.e., cvj = 1 if v ∈ Cj and
cvj 0 = 0 for j 0 6= j .
Given W , k and l, the complete log-likelihood is
k X
l X m
n X  
X Γ(auj ) + bvi )
rui cvj ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) . (6)
i=1 j=1 u=1 v=1
Γ(auj )Γ(bvi )

If we x i, j , then we maximize the inner double, i.e., we nd the ML estimate of the parameters in the
Ri × Cj block (maximization step) as in (5). Reordering the summation as
k X
n X l
m X  
X Γ(auj ) + bvi )
cvj ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) , (7)
u=1 i=1 v=1 j=1
Γ(auj )Γ(bvi )

for any xed row u, we maximize the inner double summation for i = 1, . . . , k . If it is maximum for i∗ ,
then we relocate u into the row-cluster Ri∗ . Then we do the relocation for the columns:
m X
l X
n X
k  
X Γ(auj ) + bvi )
rui ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) . (8)
v=1 j=1 u=1 i=1
Γ(auj )Γ(bvi )

For any xed column v , we maximize the inner double summation for j = 1, . . . , l. If it is maximum for
j ∗ , then we relocate v into the column-cluster Cj ∗ .
This idea comes from the collaborative ltering.

4 Applications

We can apply the algorithm to a microarray W , where rows correspond to the genes and column to the
conditions, whereas wuv is the expression level of gene u under condition v . We are looking for clusters
of the genes and of the conditions simultaneously, such that genes of the same cluster eect conditions
of cluster j according to the model of Section 2.
The model is also applicable to digraps with possible loops when n = m. In this case, the row- and
column-clusters are out- and in-clusters of the vertices.

5
First application: for the approx. 1000x30 genetical data. There are 4 starting clusters of the rows
and 2-4 starting clusters of the columns will be found by Fatma.
We need the following matrices. The parameters will be collected in the n × l matrix A and the m × k
matrix B , where the j th column of A contains the parameters auj for u ∈ Ri , i = 1, . . . , k ; j = 1, . . . , l.
Likewise, the ith column of B contains the parameters bv for v ∈ Cj , j = 1, . . . , l; i = 1, . . . , k .
Dene the row membership matrix R as the n × k matrix of 0-1 entries. rui = 1 if row u belongs to
Ri , otherwise 0. So the sum of the entries in each row is 1.
Dene the column membership matrix C as the m × l matrix of 0-1 entries. cvj = 1 if column v
belongs to Cj , otherwise 0. So the sum of the entries in each row is 1.
Start with some initial clustering, that is, with membership matrices.
In the maximization step: for i = 1, . . . k and j = 1, . . . l consider the submatrix of W with entries wuv
such that rui = 1 and cvj = 1. For this |Ri | × |Cj | matrix run either the Newton algorithm of (1) or the
xed point iteration (2); I call it 1RD (1-class rectangular directed) algorithm. The starting parameter
values be all ε, where ε is dened in Proposition 1.
You obtain parameters au 's and bv 's. Put them into the matrices A and B as follows:

• In the j th column of A: auj := au if rui = 1.


• In the ith column of B : bvi := bv if cvj = 1.

In this way, the A and B matrices are completely lled in.


Make the R and C matrices to be zero everywhere. In the relocation step:
Relocation of the rows : For each u, take the maximum of the following over i (i = 1, . . . , k):
l
m X  
X Γ(auj ) + bvi )
cvj ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) .
v=1 j=1
Γ(auj )Γ(bvi )

If it is maximum for i∗ , then we relocate u into the row-cluster Ri∗ , i.e., rui∗ := 1. Do it for u = 1, . . . n,
separately.
Relocation for the columns : For each v , take the maximum of the following over j (j = 1, . . . , l):
n X
k  
X Γ(auj ) + bvi )
rui ln + (auj − 1) ln wuv + (bvi − 1) ln(1 − wuv ) .
u=1 i=1
Γ(auj )Γ(bvi )

If it is maximum for j ∗ , then we relocate v into the column-cluster Cj ∗ , i.e., cvj ∗ := 1. Do it for
v = 1, . . . m, separately.
This is a discrete maximization. Break ties arbitrarily. In this way, the membership matrices are
lled in (zeros stay zeros).
Start the maximization with the new membership matrices, after you have made A and B zero
everywhere, until convergence.
Second application: For the electrical consumption data.

References

[1] Abramowitz, M., Stegun, I. A., eds., Handbook of Mathematical Functions with Formulas, Graphs,
and Mathematical Tables (10th ed.). New York: Dover (1972), pp. 258-259.

6
[2] Alzer, H., Wells, J., Inequalities for the polygamma functions, SIAM J. Math. Anal. 29 (6) (1998),
1459-1466.

[3] Bernardo, J. M., Psi (digamma) function. Algorithm AS 103, Applied Statistics 25 (1976), 315317.

[4] Bolla, M., Spectral clustering and biclustering. Wiley (2013).

[5] Bolla, M., Elbanna, A., Estimating parameters of a probabilistic heterogeneous block model via the
EM algorithm, Journal of Probability and Statistics (2015), Article ID 657965.

[6] Bolla, M., Elbanna, A., Matrix and discrepancy view of generalized random and quasirandom graphs,
Special Matrices 4, Issue 1 (2016), 31-45.
[7] Bolla, M., Elbanna, A., Mala, J., Estimating parameters of a directed weighted graph model with
beta-distributed edge-weights.

[8] Chatterjee, S., Diaconis, P. and Sly, A., Random graphs with a given degree sequence, Ann. Stat. 21
(2010), 14001435.

[9] Dempster,A.P., Laird,N.M. and Rubin,D.B., Maximum likelihood from incomplete data via the EM
algorithm, J. R. Statist. Soc. B 39, 1-38 (1977).

[10] Grasmair, M., Fixed point iterations, https://wiki.math.ntnu.no/_media/ma2501/2014v/xedpoint.pdf

[11] C. J. Hillar, A. Wibisono, Maximum entropy distributions on graphs, arXiv:1301.3321v2 (2013).

[12] Hofmann,T. and Puzicha,J., Latent class models for collaborative ltering. In Proc. 16th Interna-
tional Joint Congress on Articial Intelligence (IJCAI 99) (ed. Dean T), Vol. 2, pp. 688-693. Morgan
Kaufmann Publications Inc., San Francisco CA (1999).

[13] Lauritzen, S. L., Graphical Models. Oxfor Univ. Press (1995).

[14] Ungar,L.H. and Foster,D.P., A Formal Statistical Approach to Collaborative Filtering. In Proc.
Conference on Automatical Learning and Discovery (CONALD 98) (1998).
[15] Yan, T., Leng, C., Zhu, J., Asymptotics in directed exponential random graph models with an
increasing bi-degree sequence, Ann. Stat. (2016) 44, 31-57.

[16] M. Wainwright,. M. I. Jordan, Graphical models, exponential families, and variational inference,
Foundations and Trends in Machine Learning 1 (1-2), 1-305 (2008).

You might also like