Professional Documents
Culture Documents
Abstract—Mining interrelated data among multiple types of larger the value of the relationship between two objects is,
objects or entities is important in many real-world applications. the more similar the two objects are or more strongly the
Despite of extensive study on fuzzy clustering of vector space two objects are associated. In computer science, relational data
data, very limited exploration has been made on fuzzy clustering
of relational data involving several object types. In this paper, can also be modeled as a graph, in which a node or vertex
we propose a new fuzzy approach for clustering multi-type corresponds to an object and the weight of the edge connected
relational data (FC-MR), which simultaneously clusters different two nodes is the similarity between the two objects. Thus,
types of objects. In FC-MR, an object is assigned with a large a graph can be constructed based on a relation matrix, and
membership to a cluster if its related objects in this cluster clustering of relational data is then equivalent to partitioning
have high rankings. In each cluster, an object tends to have
a high ranking if its related objects have large memberships in of the corresponding graph. Since this type of relation, e.g.,
this cluster. The FC-MR approach is formulated to deal with document-document relation, exist among objects of the same
multi-type relational data with various structures. The objective type, it is referred as homogeneous relational data.
function of FC-MR is locally optimized by an efficient iterative Different from traditional clustering approaches which gen-
algorithm which updates the fuzzy membership matrix and the erate clusters of objects of the same type based on the
ranking matrix of one type at once while keeping those of other
types constant. We also discuss the simplified FC-MR for multi- vector representation or the pairwise relation, co-clustering
type relational data with two special structures namely star- approaches simultaneously cluster the rows and columns of a
structure and extended star-structure. Experimental studies are data matrix [7]–[12]. For example, in document co-clustering
conducted on benchmark document datasets to illustrate how based on the document-word co-occurrence matrix, both doc-
the proposed approach can be applied flexibly under different ument clusters and word clusters are produced. Co-clustering
scenarios in real-world applications. The experimental results
demonstrate the feasibility and effectiveness of the new approach approaches have been initially proposed for handling high
compared with existing ones. dimensional data such as text documents and mircoarray data,
where the effectiveness of traditional distance-based clustering
Index Terms—Fuzzy clustering, relational data, multi-type,
document clustering, multi-way clustering. approaches degrades due to the “curse of dimensionality”. In
[7], co-clustering is treated as a bi-partite graph partitioning
by calculating the Singular Value Decomposition (SVD) of
I. I NTRODUCTION the data matrix. A bi-partite graph consists of two types of
Lustering has been a fundamental and efficient tool for nodes, and edges only exist between nodes of different types.
C data analysis by grouping similar objects into clusters.
Compared with hard clustering, fuzzy clustering which allows
Since homogeneous relational data corresponds to a graph
consisting of nodes of the same type, the data matrix such as
overlaps among clusters is able to provide a more accurate and the document-term matrix which corresponds to a bi-partite
natural description of the underlying structure of real-world graph can be treated as bi-type heterogeneous relational data,
data. The same as k-means, most existing studies on fuzzy i.e., relation between two different object types, as illustrated
clustering including the well known fuzzy c-means (FCM) [1] in Fig. 1. In this point of view, co-clustering is actually
and some recently proposed approaches such as [2], [3] deal bi-type heterogenous relational data clustering [13], or two-
with vector-based data, of which each object is represented way clustering as it produces clusters for two object types
as a vector in some feature space. For example, in document simultaneously [14].
clustering, a document may be represented as a vector where However, the data in real-world data mining applications
each feature or dimension of the vector is a distinctive word. may contain relations involving more than two types. For
Other than the vector represented data, clustering based on example, for a dataset recording information about published
pairwise relation has also been studied for a long time, such research papers, other than knowing a set of words each
as hierarchical clustering [4], k-medoid clustering [5], and paper contains, we may also know the authors of each paper,
fuzzy relational clustering [6]. Generally, pairwise relation the name of conferences or journals where those papers are
could be described by similarities or dissimilarities between published, and the references of each paper. Thus, this dataset
each pair of objects from a given dataset. In this paper, may be referred as a four-type relational data, which consists
we only consider similarity type relation, which means the of paper, term, author, venue (e.g., conference, journal) these
four types of objects or entities, together with four rela-
J.-P Mei and L. Chen are with the Division of Information Engi- tions: paper-term relation, paper-author relation, paper-venue
neering, School of Electrical and Electronic Engineering, Nanyang Tech-
nological University, Singapore 639798, Republic of Singapore (e-mail: relation and paper-paper relation. The first three relations
meij0002@e.ntu.edu.sg; elhchen@ntu.edu.sg) characterize each individual paper with respect to different
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
• letters in bold lowercase denote vectors, e.g., x. gives the sum of rankings of each of its related objects of
• letters in calligraphic font denote sets, e.g., X , and |X | all the m types in cluster f , weighted by the strength of
represents the size of set X . relationships, and
• bold 1 is a vector with a proper length, of which all nν
m
elements are 1s; 1 is a matrix with a proper size, of μ νμ ν
Mif = rji ujf (3)
which all entries are 1s. ν=1 j=1
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
B. Solution μ μ μ
With Uμ = (u 1 , u 2 , . . . , u nμ )T and Vμ =
μ μ μ
We use the method of Lagrange multiplier to derive the local (v1 , v2 , . . . , vk ), we get the membership matrix and
solutions for the constrained optimization problem formulated ranking matrix of objects in X μ .
above. With vectors γ μ , λν , and matrices αμi,c ∈ αμ , βi,c
ν
∈ βν The first term of (14) decides the membership distribution
being Lagrange multipliers, the Lagrangian is formed as of each object xμi in the k clusters while the second term
m
m
is a normalization term to ensure the summation constraint
L =J + γμT (Uμ 1 − 1) + λT (VT 1 − 1) to be satisfied. Similarly, the first term in (18) decides the
ν ν
μ=1 ν=1 distribution of ranking values among objects in X μ in each
m
m
cluster c, which is normalized by the second term so that the
+ 1T (αμ Uμ )1 + 1T (βν Vν )1 (10) sum of ranking values of objects of the same type in each
μ=1 ν=1
cluster is 1.
According to the KKT (Karush-Kuhn-Tucker) conditions: The last problem left is to decide K iμ+ and Ncμ+ . According
∂L to the discussions in [24] and [21], it can be proved that if
=0 (11) c ∈ Kiμ+ , then {∀f ∈ Kiμ+ |f : gif μ
> gic μ
} and if j ∈ Ncμ+ ,
∂uμic
αμi,c uμic = 0 (12) then {∀p ∈ Nc |p : hpc > hjc }. Based on this, Kiμ+ and
μ+ μ μ
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
The complete procedure of the FC-MR algorithm is given In the previous section, we have presented the FC-MR al-
in Algorithm 1. The most costly step of updating U μ (or Vμ ) gorithm which is applicable for various structures of relational
is the calculation of G μ in (23) (or H μ in (27)). If for each data. Now, we discuss a special case where the relational
μ, relation matrix R μν exists for all ν, both of these two data forms a star-structure. Relational data of this structure
steps have a time complexity of O(n μ nmax k), where k is the consists of one central type and several attribute types and
number of clusters, and n max = maxν nν is the largest scale only central-attribute relations are considered. For an m-type
of m types of object sets. When those relation matrices are star-structured relational data, we assume μ = 1 is the central
sparse, this complexity reduces to O(e max k), where emax = type and μ = {2, 3, . . . , m} are m − 1 attribute types. The
μ μ
maxν eμν and eμν is the number of nonzero entries of R μν . relation between the central type and the (μ − 1)th attribute
Assume that the algorithm converges after l iterations, the total type is recorded by a matrix R 1μ . The objective function of
time complexity of the algorithm is O(lmn max nmax k) for FC-MR for the star-structured relational data is reduced to
dense relation matrices or O(lme max k) for sparse case with
emax = maxμ,ν eμν and m is the number of object types. Jstar = Jstar1 (U1 , {Vμ }m m
μ=2 ) + Jstar2 (V1 , {Uμ }μ=2 ) (30)
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
n2
k n1
C. FC-MR for Homogeneous Relational Data
In the case when only the matrix R 11 records the relation-
Jstar1 = uci rij vcj
c=1 j=1 i=1 ships between each pair of objects of the central type is known,
n1
k
n2
k
FC-MR reduces to a homogeneous relational data clustering.
− Tu u2ci − Tv 2
vcj (39) The objective function becomes
c=1 i=1 c=1 j=1 φ1 θ1
Jhom = T r(UT1 R11 V1 ) − U1 2F − V1 2F . (43)
which is the objective function of fuzzy co-clustering approach 2 2
called FCoDok [11]. This means, the FCoDok approach is If a dissimilarity matrix D is defined as
a special case of the proposed FC-MR with m = 2, which
handles bi-type star-structured relational data and produces D = rmax 1 − R11 (44)
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
ɖ2
ɖ3
ɖ1 ɖ1
TABLE II
B. Experiments on 20Newsgroups Data S TRUCTURE OF N EWSGROUP D ATA SETS
In this experiment, FC-MR handles a tri-type star-structured
C1 : {rec.sport.baseball, rec.sport.hockey}
relational data. TM1
C2 : {talk.politics.guns, talk.politics.mideast, talk.politics.misc}
1) Data: The original data of 20newsgroups 1 contains C1 : {comp.graphics, comp.os.ms-windows.misc}
18828 non-duplicated documents, which are categorized into TM2 C2 : {rec.autos, rec.motorcycles}
C3 : {sci.crypt, sci.electronics}
20 topics. As in [13] and [16], we generate three subsets C1 : {comp.sys.ibm.pc.hardware, comp.sys.mac.hardware}
consisting of different subtopics as listed in Table II. For C3 : {rec.autos, rec.motorcycles}
TM3
example, dataset TM1 consists of clusters C 1 and C2 , where C3 : {sci.med, sci.space}
C4 : {talk.politics.guns, talk.politics.mideast}
C1 is the rec.sport cluster which contains two subtopics or
categories on baseball and hockey, and C 2 is the talk.politics
cluster that has three subtopics on guns, mideast and misc.
Each of the three datasets is formed by randomly selecting matrices. Each main diagonal entry of D r is the sum of
100 documents from each of the chosen subtopics. each row of R 12 , and each main diagonal entry of D c is
2) Relations Derived: the sum of each column of R 12 . Similar normalization is
• Document-Word Relation: For each dataset, we use the
performed on R 13 . For these three 20newsgroups datasets,
rainbow toolkit [28] to get the document-word co- the document-word relation alone may be used for document
occurrence relation. Stop-words have been removed and clustering, as it provides the information that documents
2000 words with the largest information gain are kept. contain similar words should be labeled in the same cluster;
Non-text documents are skipped by rainbow. After pre- while the document-category relation although clearly shows
processing, documents consisting of less than two words about which documents are in the same subtopic, it provides
are removed. Finally we have 497, 598 and 794 docu- no indication on which subtopics are related to the same topic
ments for TM1, TM2 and TM3, respectively. The tf-idf or cluster. Therefore, the document-category relation alone
weighting [29] is used for weighting the words in each should not be used for clustering process. A better clustering
document. result may be produced based on joint analysis of these two
• Document-Category Relation: Other than the document-
relations.
word relation which captures the document content by 3) FC-MR vs. HFCM and FCoDok: To see whether FC-
statistical information in terms of word frequency, we MR can make use of the additional document-category relation
also generate another relation matrix document-category to achieve any improvements on the clustering results, we first
to indicate which subtopic each document belongs. Each compare the document clusters generated by HFCM [30] and
subtopic is a category and the relation matrix is a binary FCoDok [11] based on R 12 with those generated by FC-MR
matrix with an entry is 1 if the document is from the based on both R 12 and R13 . HFCM is a modified fuzzy c-
corresponding subtopic otherwise is 0. means clustering based on cosine distance, and FCoDok is a
These two relations, i.e., document-word relation represented fuzzy co-clustering approach. In our experiment, the number
as R12 and document-category relation represented as R 13 of clusters k is set to be equal to the real number of classes
form a tri-type star-structure, where the central type is doc- for each dataset. For FC-MR, we follow the guideline given
ument and two attribute types are word and category. In earlier to set the parameters. For other approaches, we tried
this experiment, we use the normalized relation matrices, i.e., different values of parameters on a grid, and choose those
R12 = D−0.5 R12 D−0.5 , where Dr and Dc are two diagonal give the best results. In this experiment, we find the following
r c
settings allow each approach to perform well on three datasets:
1 http://people.csail.mit.edu/jrennie/20Newsgroups/ Tu = 0.001, Tv = 1 for FCoDok, m = 1.02 for HFCM,
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
TABLE VI
and φ1 = 0.01, θ2 = 1, θ3 = 0.03 for FC-MR. The weights C OMPARISON OF A CCURACY AND NMI ON N EWSGROUP DATA
of two matrices β12 and β13 are set to be equal in FC-MR.
For all fuzzy approaches, a truncated partitioning is obtained
Accuracy
by assigning each document to the cluster with the largest TM1 TM2 TM3
membership. HFCM 97.32 ± 10.21 89.49 ± 14.12 86.30 ± 14.76
Table III to Table V give the best clustering results that FCoDok 97.32 ± 10.21 89.49 ± 14.12 86.71 ± 14.23
NMF 97.26 ± 10.20 86.30 ± 9.38 87.53 ± 13.88
each approach can achieve among 30 trials with random SRC 100.00 ± 0.00 75.88 ± 13.96 77.57 ± 22.86
initializations on the three datasets, respectively. From these FC-MR 100.00 ± 0.00 99.16 ± 4.61 95.52 ± 11.78
contingency tables, it can be seen that with a proper initial- NMI
ization, FC-MR is able to label all the documents correctly TM1 TM2 TM3
HFCM 97.32 ± 24.89 89.49 ± 18.57 86.30 ± 14.34
on all the three datasets, while FCoDok and HFCM mis- FCoDok 97.32 ± 24.89 89.49 ± 18.57 86.71 ± 14.54
cluster different numbers of documents on different datasets. NMF 97.26 ± 24.80 86.30 ± 13.83 87.53 ± 12.40
The mislabeled number of documents by FCoDok and HFCM SRC 100.00 ± 0.00 72.06 ± 3.82 85.85 ± 15.87
FC-MR 100.00 ± 0.00 99.16 ± 7.23 95.52 ± 12.03
on TM1 and TM2 are close, but more documents are labeled
correctly by FCoDok than HFCM on TM3. Although FCoDok
and HFCM make use of the same document-word matrix,
eigendecomposition.
HFCM treats each document as a vector and use cosine
• NMF-H [18]: Non-negative matrix factorization approach
similarity to measure the closeness of two documents, while
for star-structured heterogeneous relational data.
FCoDok treats document and word as two different object
types, and documents are clustered based the ranking of words. In this experiment, the results of HFCM and FCoDok are
There is no explicit similarity measure defined in FCoDok and obtained based on X, the data matrix by combing R 12 and
FC-MR. R13 , i.e., X = [R12 R13 ].
To see some more detailed comparison, we plot u 1 = Two external metrics Accuracy and Normalized Mutual
{u1,1 , u2,1 . . . , u500,1 }, the membership values of documents Information (NMI) [31] are used to evaluate the clustering
in C1 of TM1 produced by HFCM and FC-MR in Fig. 4. results, which measure the degree of agreement between the
According to the summation constraints, the memberships of results produced by a clustering algorithm and the ground
these objects in the other cluster follow u 2 = 1−u1 . For TM1, truth. If we refer class as the ground truth, and cluster as
the ground truth partitioning is that the first 198 documents the results generated by a clustering algorithm, the NMI score
are in C1 : rec.sport and the rest are in C 2 : talk.politics. It can is calculated with the following formula
be seen from Fig. 4a that in HFCM, for the 21th document, k f j n·njc
u21,1 < u21,2 , although the ground truth label of this document c=1 j=1 nc log( nc ·nj )
NMI = f . (47)
is C1 . At the same time, six documents (labeled by circles) k n
( c=1 nc log nnc )( j=1 nj log nj )
which should be in C 2 are assigned larger memberships in
C1 . These results may indicate that the representation of these where n is the total number of documents, n c , nj are the num-
documents in terms of word frequency is not sufficient enough bers of documents in the cth cluster and jth class, respectively,
for correctly assigning those documents into the right clusters. and njc is the number of common documents in class j and
The document-category relation helps to clarify the uncertainty cluster c. In our experiments, the number of clusters k is set
or correct the misleading information in the document-word to be equal to the number of classes f .
relation. For example, according to the document-category Another metric Accuracy is calculated as below after ob-
relation, it is known that the 21th document possibly should be taining a one-to-one matching between the k clusters and the
grouped together with the other 99 documents as they belong k classes,
k
to the same category. This is confirmed from Fig. 4b that when 1 p
Accuracy = n (48)
the document-category relation is further incorporated, those n c=1 c
previously mis-clustered documents are now assigned reason-
able memberships in two clusters which are consistent with the where npc is the number of common objects in the cth cluster
ground truth. Although from the document-category matrix, it and its matched class p. The higher Accuracy and NMI
can be told that which documents are in the same subtopic, indicates the more consistent between the algorithm-generated
this relation alone delivers no information on the hierarchy clusters and the ground truth classes, and thus the better
structure of the subtopics, i.e. which subtopics should be clustering result. Both Accuracy and NMI equal to 1 only
combined into high-level clusters. Therefore, clustering only when the partitioning produced by an algorithm is identical
based on the document-category relation does not produce any to the ground truth classes.
meaningful result, which turn out to be random combinations For each dataset, each approach is run 30 trials. To avoid bad
of those subgroups into the specified number of clusters. initializations with which the word-category relation may dom-
4) Comparison of Accuracy and NMI: Other than two fuzzy inate in the clustering process to produce random combinations
approaches HFCM and FCoDok, we also compare FC-MR of subtopics, we first run HFCM with random initializations
with another two existing multi-type relational data clustering only based on the document-word relation, and then use the
approaches: document fuzzy memberships produced as the initial document
• SRC [16]: Multi-type relational clustering based on partitioning in all the five approaches. Table VI shows the
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
10
TABLE III
C ONTINGENCY TABLE OF TM1
TABLE IV
C ONTINGENCY TABLE OF TM2
TABLE V
C ONTINGENCY TABLE OF TM3
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6
i1
Value of ui1
0.6
Value of u
0.5 0.5
0.4 0.4
i = 21
u =0.3886
i1
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 100 200 300 400 500 0 100 200 300 400 500
(a) HFCM (b) FC-MR
Fig. 4. Fuzzy memberships of documents with respect to C1 of TM1 by HFCM and FC-MR. The (horizontal) x-axis denotes the document id (i = 1, . . . , 500)
and the (vertical) y-axis shows the value of the membership ui1 . In (a), small circles label objects that are clustered incorrectly.
mean and standard deviation of Accuracy (%) and NMI values 5) Ranking of Words and Categories: Together with the
(%) over 30 trials. documents clusters, we also obtain the word rankings and
category rankings. The ranking values of each category in each
It can be seen that FC-MR gives the highest Accuracy and cluster is shown in the three matrices given in (49) to (51) for
NMI values among five approaches on all the three datasets. TM1, TM2 and TM3, respectively, where each row indicates a
The performances of HFCM, FCoDok and NMF-H are very category and each column indicates a cluster or a topic. Top 10
close. Although SRC gives the best results as FC-MR on TM1, words of each cluster of three datasets are also listed in Table
its results on the other two datasets are even worse than those VII. As attribute types, category and word can be used for
of the other three approaches. These results show that FC- description or interpretation of the document clusters. We can
MR achieves significant improvement in document clustering see that those key categories and key words which have large
compared with existing vector-based fuzzy clustering and ranking values provide meaningful information to tell what
fuzzy co-clustering with combined relations, and also perform topic each document cluster possibly related to. If we take C 1
much better than nonnegative matrix factorization based and of TM1 for example, from its associated key categories named
spectral clustering based multi-type relational data clustering baseball and hockey, and key words, such as game, hockey,
approaches. team, players, baseball, we may guess that documents of this
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
11
TABLE VII
T OP 10 WORDS IN EACH CLUSTER OF THREE NEWSGROUP DATASETS GENERATED BY FC-MR
cluster are about sports, possibly on hockey and baseball. Such are previously used in [32] and [33]. A summarization of each
kind of information can be used as a short summary of the of them is given in Table VIII.
cluster to give a quick sense of the whole document cluster 2) Relations Derived:
without looking through each of the complete long documents. • paper-paper relation: The value of the relationship be-
tween two papers is 1 if any of them is in the reference
rec.sport talk.politics list of the other and is 2 if two papers are mutually cited.
⎛ ⎞
baseball 0.4327 0 • paper-term relation: Each entry of the paper-term matrix
hockey ⎜ ⎜ 0.5673 0 ⎟
⎟ is the number of occurrences of the term in the abstract
guns ⎜ 0 0.3067 ⎟ (49) of the corresponding paper.
⎜ ⎟
mideast ⎝ 0 0.3506 ⎠ We use R11 to denote the paper-paper citation relation and
misc 0 0.3428 R12 the paper-term relation. For these Cora datasets, both the
paper-paper relation and the paper-term relation may be used
comp rec sci alone for document clustering. However, the content contained
⎛ ⎞ in the abstracts of Cora data is much less complete than that in
graphics 0.4716 0 0
os.ms-windows.misc ⎜
⎜ 0.5284 0 0 ⎟ ⎟
20newsgroups data and some papers in the Cora collection are
⎜ 0 ⎟ recorded even without abstract. According to [32], these five
autos ⎜ 0 0.4579 ⎟ (50)
motorcycles ⎜ 0 0.5421 0 ⎟ datasets we used are preprocessed by removing papers without
⎜ ⎟
crypt ⎝ 0 0 0.7018 ⎠ references only. In other words, a paper without abstract but
electronics 0 0 0.2982 has at least one reference is kept in preprocessing. Based on
this knowledge, here we let the citation relation have a higher
comp.sys rec sci talk.politics weight than the paper-term relation.
⎛ ⎞ 3) Algorithms and Settings: Other than two fuzzy ap-
ibm.pc.hardware 0.5265 0 0 0
mac.hardware ⎜ ⎜ 0.4735 0 0 0 ⎟
⎟
proaches HFCM and FCoDok and two multi-type relational
autos ⎜ 0 0.3749 0 0 ⎟ data clustering approaches, we also compare FC-MR with
⎜ ⎟
motorcycles ⎜ 0 0.6251 0 0 ⎟ another state-of-the-art approach iTopicModel [27]: A topic
⎜ ⎟
med ⎜ 0 0 0.4056 0 ⎟ modeling approach, which considers both text and structure
⎜ ⎟
space ⎜ 0 0 0.5944 0 ⎟ information for documents. The results of HFCM and FCoDok
⎜ ⎟
guns ⎝ 0 0 0 0.3530 ⎠ are obtained based on X, the data matrix by combing R 11 and
mideast 0 0 0 0.6470 R12 , i.e., X = [β11 R11 β12 R12 ]. In SRC and NMF-H, R11
(51) is treated as a relation between two different types as these two
approaches only consider such type of relations. The iTopic-
Model approach establishes generative models by making use
C. Experiments on Cora Data Sets both the content information R 12 and the structure information
In this experiment, we compare the clustering accuracy R11 . For FC-MR, the each dataset with paper-paper relation
of FC-MR with both fuzzy and non-fuzzy approaches on and paper-term relation form a bi-type extended star structure.
five Cora datasets. FC-MR treats each dataset as a bi-type In this experiment, we find each of the approaches performs
relational data with an extended star structure. well with the following settings: φ 1 = 0.01, θ1 = θ2 = 1 for
1) Data: The Cora data [23] contains abstracts and refer- FC-MR, Tu = 0.01, Tv = 0.1 for FCoDok, and m = 1.02 for
ences of computer science papers published in the conferences HFCM. The weights of two relation matrices β 11 = 0.8, and
and journals of different research areas, such as artificial β12 = 0.2 are used for all the five datasets.
intelligence, information retrieval and hardware. A typical 4) Comparisons of Accuracy and NMI: Each approach is
sample record is shown in Fig. 5. Five datasets with each run 30 trials on each dataset with random initializations. Table
of them corresponding to a research area in computer science IX shows the means and standard deviations of Accuracy (%)
are used in our experiment. We use the processed data which and NMI (%) values over 30 trials.
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
12
URL: http://pertsserver.cs.uiuc.edu/papers/HaLi94a.ps
ǥ.
Reference: [2] <author> K.G. Shin and Y.C. Chang. </author> <title> Load sharing in
distributed real-time systems with state-change broadcasts. </title> <journal> IEEE
Transactions on Software Engineering, </journal> <volume> 38(8) </volume> <pages>
1124-1142, </pages> <month> August </month> <year> 1989. </year>
References-found: 13
Fig. 5. A sample record of extracted information of a paper in the Cora data. Each record usually consists of three parts: the header part includes the URL,
Title, Author and some other related information like Address, Affiliation and Email; the second part is the content of abstract; the third part is the list of
formatted references.
TABLE VIII
C ORA D ATA SETS
From this table, it is clearly seen that FC-MR achieves 20newsgroups datasets, the Accuracy and NMI values for Cora
significant improvements than other five approaches on all datasets are much lower. This is mainly because that the
the five datasets with respect to both Accuracy and NMI. clusters of 20Newsgroups datasets are well separated since
The overall performance of the other five approaches are each cluster corresponds to a topic, and all the topics are very
close. For some datasets, the performances of HFCM and different, e.g., sport and politics; while the clusters of Cora
FCoDok based on concatenated data are even slightly better datasets have more overlaps as each cluster is a subfield and all
than those of the other three approaches. Since SRC and subfields are related to the same research field. Other than that,
NMF-H only handles heterogeneous relations, the paper-paper for each document in a 20Newsgroups dataset, we make sure
homogeneous relation is unable to be used effectively. Among that it at least contains two words excluding stop words; while
these five approaches, SRC performs better than other four quite a few documents in Cora data have no abstract being
on ML, but its results on OC is the worst of all. The recorded, which means these documents contain no words.
performance of iTopicModel on these five datasets are not Both the overlaps among clusters and less information on the
as good as expected. It is also observed that comparing with content make Cora datasets more challenging for clustering.
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
13
TABLE IX
C OMPARISON OF A CCURACY AND NMI ON C ORA DATA
Accuracy
DS HA ML OS PL
HFCM 39.17 ± 2.84 37.33 ± 2.88 49.25 ± 3.77 44.99 ± 2.91 35.65 ± 1.91
FCoDok 37.15 ± 2.54 40.94 ± 3.34 47.10 ± 4.88 49.64 ± 1.87 32.12 ± 2.23
NMF-H 34.07 ± 2.66 40.08 ± 5.57 40.85 ± 5.03 46.83 ± 5.02 30.56 ± 2.86
SRC 34.92 ± 2.33 41.41 ± 2.19 52.21 ± 3.29 44.79 ± 6.56 30.84 ± 1.58
iTopicModel 29.85 ± 3.00 34.38 ± 4.17 39.97 ± 4.66 50.81 ± 6.65 33.30 ± 2.66
FC-MR 44.39 ± 3.53 46.64 ± 3.49 66.68 ± 2.65 63.32 ± 4.33 42.18 ± 1.98
NMI
DS HA ML OS PL
HFCM 28.11 ± 2.68 32.55 ± 3.30 34.26 ± 2.50 23.97 ± 2.12 28.30 ± 1.70
FCoDok 25.06 ± 1.80 35.72 ± 2.88 27.95 ± 4.11 26.43 ± 2.93 21.92 ± 1.01
NMF-H 22.76 ± 1.99 26.84 ± 4.01 19.90 ± 3.54 19.01 ± 2.73 16.44 ± 1.64
SRC 22.25 ± 1.51 32.29 ± 1.94 36.01 ± 2.08 15.30 ± 2.28 21.46 ± 0.91
iTopicModel 17.08 ± 2.36 16.11 ± 3.19 21.71 ± 2.95 16.59 ± 4.16 17.92 ± 1.75
FC-MR 38.24 ± 1.84 42.38 ± 3.27 49.75 ± 1.83 35.39 ± 1.97 31.92 ± 1.30
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.
Thisarticlehasbeenacceptedforpublicationinafutureissueofthisjournal,buthasnotbeenfullyedited.Contentmaychangepriortofinalpublication.
14
[9] B. Long, Z. Zhang, and P. S. Yu, “Co-clustering by block value Jian-Ping Mei received her B.Eng. degree from
decomposition,” in Proc. KDD, 2005. School of Electronic and Information Engineering at
[10] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha, “A Ningbo University, China, in 2005 and M.Eng. De-
generalized maximum entropy approach to bregman co-clustering and gree from School of Information Science and Elec-
matrix approximation,” Journal of Machine Learning Research, vol. 8, tronic Engineering at Zhejiang University, China, in
pp. 1919–1986, 2007. 2007. She is currently working towards Ph.D. degree
[11] K. Kummamuru, A. Dhawale, and R. Krishnapuram, “Fuzzy co- at Nanyang Technological University in Singapore.
clustering of documents and keywords,” in 12th IEEE Int. Conf. Fuzzy Her research interests include machine learning al-
Systems, 2003. gorithms and applications to Web mining and bioin-
[12] W.-C. Tjhi and L. Chen, “Dual fuzzy-possibilistic coclustering for formatics.
categorization of documents,” IEEE Trans. Fuzzy Syst., vol. 17, pp. 532–
543, 2009.
[13] B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma, “Consistent
bipartite graph co-partitioning for star-structured high-order heteroge-
neous data co-clustering,” in Proc. KDD’05.
[14] R. Bekkerman, R. El-Yaniv, and A. McCallum, “Multi-way distributional
clustering via pairwise interactions,” in ICML’05, 2005.
[15] B. Long, X. Wu, Z. Zhang, and P. S. Yu, “Unsupervised learning on
k-partite graphs,” in KDD’06.
[16] B. Long, Z. Zhang, X. Wu, and P. S. Yu, “Spectral clustering for multi-
type relational data,” in Proc. 23th Int. Conf. Machine Learning, 2006,
pp. 585–592.
[17] A. Banerjee, S. Basuy, and S. Meruguz, “Multi-way clustering on
relation graphs,” in SIAM’07, 2007.
[18] Y. Chen, L. Wang, and M. Dong, “Non-negative matrix factorization for
semisupervised heterogeneous data coclustering,” IEEE Trans. Knowl.
Data Eng., vol. 22, no. 10, pp. 1459–1474, 2010.
[19] B. Long, Z. Zhang, and P. S. Yu, “A probabilistic framework for
relational clustering,” in KDD’07.
[20] Y. Sun, Y. Yu, and J. Han, “Ranking-based clustering of heterogeneous
information networks with star network schema,” in Proc. KDD’09.
[21] J.-P. Mei and L. Chen, “Fuzzy clustering with weighted medoids for
relational data,” Pattern Recognition, vol. 43, pp. 1964–1974, 2010.
[22] K. Lang, “NewsWeeder: learning to filter netnews,” in Proc. 12th Int.
Conf. Machine Learning, 1995, pp. 331–339.
[23] A. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automating
the construction of internet portals with machine learning,” Information
Retrieval, vol. 3, no. 2, pp. 127–163, 2000.
[24] S. Miyamoto and K. Umayahara, “Fuzzy clustering by quadratic regu-
larization,” in IEEE Int. Conf. Fuzzy Systems, 1998.
[25] Z. Guo, S. Zhu, Y. Chi, Z. M. Zhang, and Y. Gong, “A latent topic Lihui Chen received the BEng in Computer Science
model for linked documents,” in SIGIR’09. & Engineering at Zhejiang University, China and
[26] Q. Mei, D. Cai, D. Zhang, and C. Zhai, “Topic modeling with network the PhD in Computational Science at University of
regularization,” in WWW’08. St. Andrews, UK. Currently she is an Associate
[27] Y. Sun, J. Han, J. Gao, and Y. Yu, “itopicmodel: Information network- Professor in the Division of Information Engineering
integrated topic modeling,” in Proc. ICDM’09. at Nanyang Technological University in Singapore.
[28] A. K. McCallum, “Bow: A toolkit for statistical language modeling, Her research interests include machine learning al-
text retrieval, classification and clustering,” 1996. [Online]. Available: gorithms and applications, data mining and web
http://www.cs.cmu.edu/ mccallum/bow/ intelligence. She has published more than seventy
[29] G. Salton and C. Buckley, “Term-weighting approaches in automatic referred papers in international journals and confer-
text retrieval,” Information Processing and Management, vol. 24, no. 5, ences in these areas. She is a senior member of the
pp. 513–523, 1988. IEEE, and a member of the IEEE Computational Intelligence Society.
[30] M. E. S. Mendes and L. Sacks, “Evaluating fuzzy clustering for
relevance-based information access,” in Proc. IEEE Conf. Fuzzy Syst.,
2003.
[31] A. Strehl and J. Ghosh, “Cluster ensembles—a knowledge reuse frame-
work for combining multiple partitions,” Journal on Machine Learning
Research, vol. 3, pp. 583–617, 2002.
[32] S. Zhu, K. Yu, Y. Chi, and Y. Gong, “Combining content and link for
classification using matrix factorization,” in Proc. SIGIR’07.
[33] D. Zhang, F. Wang, C. Zhang, and T. Li, “Multi-view local learning,”
in Proc. AAAI’08, 2008.
Copyright(c)2011IEEE.Personaluseispermitted.Foranyotherpurposes,permissionmustbeobtainedfromtheIEEEbyemailingpubs-permissions@ieee.org.