Professional Documents
Culture Documents
1205
WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China
x8
x11 3. CONCLUSION
In this paper, we have investigated the problem of name
Figure 1. The graphical structure of a HMRF disambiguation. We have proposed a generalized probabilistic
model to the problem. We have explored a dynamic approach for
2.3 EM Algorithm estimating the person number K. Experiments show that the
Three tasks are executed by Expectation Maximization (EM): proposed method significantly outperforms the baseline methods.
learning parameters of the distance measure, re-assignment of
each paper to a cluster, and update of cluster centroid u(i).
4. REFERENCES
We define the distance function D(xi, xj) as follows [1]:
[1] S. Basu, M. Bilenko, and R. J. Mooney. A Probabilistic
T
x Ax j
i
Framework for Semi-Supervised Clustering. In Proc. of
D( xi , x j ) = 1 − , where || xi ||A = xiT Axi (6) SIGKDD’2004, pp. 59-68, Seattle, USA, August 2004
|| xi ||A || x j ||A
[2] M. Ester, R. Ge, B.J. Gao, Z. Hu, and B. Ben-Moshe. Joint
here A is a parameter matrix. Cluster Analysis of Attribute Data and Relationship Data: the
The EM process can be summarized as follows: in the E-step, Connected K-center Problem. In Proc. of SDM’2006.
given the current cluster centroid, every paper xi is re-assigned to [3] J. Hammersley and P. Clifford. Markov Fields on Finite
the cluster by maximizing p(yi|xi). In the M-step, the cluster Graphs and Lattices. Unpublished manuscript. 1971.
centers are re-estimated based on the assignments to minimize the [4] J. Tang, D. Zhang, and L. Yao. Social network extraction of
objective function L; and the parameter matrix is updated to academic researchers. Proc. of ICDM’2007. pp. 292-301
increase the objective function.
[5] D. Zhang, J. Tang, J. Li, and K. Wang. A constraint-based
2.4 Estimation of K probabilistic framework for name disambiguation. Proc. of
Our proposed strategy (see Algorithm 1) is to start by setting K as CIKM’2007. pp. 1019-1022
1 and use the BIC score to measure whether to split the current
cluster. The algorithm runs iteratively. In each iteration, we try to
1206