You are on page 1of 10

The Congruity Between Linkage-Based Factors and

Content-Based Clusters—An Experimental Study


Using Multiple Document Corpora

Tsung Teng Chen


Graduate Institute of Information Management, National Taipei University, 151 University Road, San Shia
District, New Taipei City 23741, Taiwan. E-mail: misttc@mail.ntpu.edu.tw

Intellectual Structure (IS) is a bibliometric method that Introduction


is widely applied in knowledge domain analysis and in
science mapping. An intellectual structure consists of Intellectual structure is a popular bibliometric method
clusters of related documents ascribed to individual that has been applied to many studies in various research
factors. Documents ascribed to a factor are generally domains (Culnan, 1987; Hirschheim, Klein, & Lyytinen,
associated with a common research theme. As such,
1996; Hoffman & Holbrook, 1993; Locke & Perera, 2001;
the contents of documents ascribed to a factor are
theorized to be similar to each other. This study shows Pilkington & Meredith, 2009; Pratt, Hauser, & Sugimoto,
that the link-based relatedness implies content-based 2012; Rorissa & Yuan, 2012). The foundation of the intel-
similarity. The intellectual structures of two research lectual structure building method is citation and cocitation
domains were derived from data sets retrieved from the analysis. The citation analysis method was pioneered by
Microsoft Academic Search database. The collection of
Garfield et al. (Garfield, Sher, & Torpie, 1964). It is used to
documents ascribed to a factor is referred to as a
factor-based document cluster, which the content- analyze documents’ citation relationships. The citation rela-
based document clusters are compared with. All docu- tionships may be abstracted into a citation network repre-
ments in an intellectual structure are re-clustered sented by a graph. A citation network reveals the citation
based on their content similarity, which is derived from relationships (links) between articles (nodes). It may also
the cosine of their vector form encoded in documents’
expose important nodes (articles) via the structure of the
term frequency inverse document frequency (TF-IDF)
weighted terms. The factor-based document clusters network. The concept of cocitation was pioneered by Small
are then compared with the content-based clusters for as an analytic tool for the interpretation of the results of
congruity. We used the Rand index and kappa coeffi- literature analysis (Small, 1973). Cocitation is an induced
cient to check the congruity between the factor-based relationship derived from the act of citing; two articles are
and content-based document clusters. The kappa coef-
cocited if there exists another article that cites both of them.
ficient indicates that there is fair to moderate agree-
ment between the clusters derived from these two Cocitation is an undirected relationship, whereas citation is
different bases. a directed one. The intellectual structure of a research
domain is derived from the cocitation network derived from
the literature of the domain. Cocitation analysis uses biblio-
graphic data to derive documents’ relatedness strengths,
which are usually represented by the cocitation counts
between documents. A cluster of cocited documents is con-
sidered to represent the knowledge base of a specialty: the
key concepts, methods, or experiments that researchers build
on (Small, 1977). Similar to citation relationships, the coci-
Received August 30, 2013; revised August 15, 2014; accepted August 18, tation relationships also can be abstracted into a network
2014 represented by a graph. A cocitation relationship is repre-
© 2015
C
V ASIS&T  Published
2015 ASIS&T •
Published online
online 27 MarchOnline
in Wiley 2015 inLibrary
Wiley Online sented by an undirected edge, whereas citation is repre-
Library (wileyonlinelibrary.com).
(wileyonlinelibrary.com). DOI: 10.1002/asi.23413
DOI: 10.1002/asi.23413 sented by a directed arc. Graph-based relationships, such as

JOURNAL
JOURNAL OF
OF THE
THE ASSOCIATION
ASSOCIATION FOR
FOR INFORMATION
INFORMATION SCIENCE
SCIENCE AND
AND TECHNOLOGY,
TECHNOLOGY, 67(3):610–619,
••(••):••–••, 20152016
citation and cocitation, are conveniently represented in a clustered based on their cosine similarity, which is computed
matrix form. A “1” in the matrix’s cell indicates the presence from the term frequency-inverse document frequency (TF-
of a relationship; a “0” in a cell denotes the absence of a IDF) weighted terms’ vector of each document. The text-
relationship between the nodes abstracted in the correspond- mining content-based approach is held to provide
ing column and row of the matrix. If we take the matrix form comparable information to the traditional link-based
of the citation relationship, then the matrix form of the approach. Science mapping using the bibliographic coupling
cocitation relationship can be derived by multiplying approach was compared to the content-based approach
the citation matrix with the transposed matrix of itself. The (Ahlgren & Jarneving, 2008). In this study, a collection of
values in the diagonal of the cocitation matrix are computed articles on the information retrieval field were clustered
by taking the three highest values of each row from the based on their bibliographic coupling as well as content
cocitation matrix divided by two, thereby indicating in a similarity. The resultant agreement between clusters derived
general way the relative importance of a particular literature from the link-based and content-based clustering methods
within the field (White & Griffith, 1981). The resultant coci- was low. The content-based approach is shown to provide
tation matrix is subjected to factor analysis that combines supplementary information to the traditional link-based
sets of correlated variables (cocited documents) into factors approach. A study that compared the text-based approach,
(research themes). As such, hundreds of documents can be the citation-based bibliographic approach, and a combined
deduced into tens of factors. Each factor is regarded as a approach that involved nine methods in total was carried out
surrogate for the set of documents ascribed to it. Documents (Ahlgren & Colliander, 2009). One text-based method and
ascribed to a factor usually share a common research theme one combined method among the nine methods were found
or topic. The clusters of documents surrogated by their cor- to be largely in agreement with the ground truth subjective
responding factors are the foundation of the intellectual classification given by a domain expert in the study. A
structure building methodology. The set of documents variant of the combined bibliographic and text-based
ascribed to a factor is derived from the cocitation linkages approach that adjusts the cocitation weights (normalized
existing among these documents. The strength of the relat- frequencies) based on proximity between reference pairs’
edness relationship between two documents is proportional positions in full text found the textual coherence of a coci-
to the number of cocitation edges existing between them. tation cluster increased by 9–20% over the traditional bib-
The research question is “Does the relatedness of literatures liographic only approach (Boyack, Small, & Klavans,
imply content similarity?” If we can provide evidence that 2013).
link-derived relatedness between documents implies that
they have similar content, it will provide the link-based
intellectual structure building methodology with further Data and Methods
support. Although text-based science mapping methods and We retrieved two citation data sets from the online cita-
their variants have been studied (Janssens, Leta, Glänzel, & tion database Microsoft Academic Search (Carlson, 2006) in
De Moor, 2006; Janssens, Glänzel, & Moor, 2008), intellec- 2013. Citation data sets were collected by querying the
tual structure and science mapping have predominantly uti- database with the key phrases “text mining application” on
lized link-based citation approaches for their computational February 21 and “citation analysis” on March 29. Both
efficiency over text-based approaches. queries limited the search for literature published after 2010
from the database. The queries returned 298 seed papers
Related Studies for “text mining application” and 210 seed papers for “cita-
tion analysis.” Our citation expansion scheme then took
Intellectual structure is conventionally derived from coci- these seed set papers as the initial literature collection, and
tation relationships between documents. Cocitation-based retrieved papers that cite or are cited by literatures in the
analysis has been the prevalent approach since the 1970s. initial seed set. This two-way literature search is recom-
However, there has been a resurgence of interest in the mended by journal editors for writing literature reviews
bibliographic coupling-based approach that has inspired (Webster & Watson, 2002). The full citation graph was built
researchers to compare the merits between these two by connecting all reference links between articles collected
approaches (Boyack & Klavans, 2010). As we described by the two-way citation expansion method. The basic statis-
previously, the clusters of documents ascribed to their cor- tics of the two data sets are listed in Table 1.
responding factors are derived from the cocitation linkages
between them. The relatedness of literatures can be inferred
by their cocitation linkage as well as their content similarity
(Klavans & Boyack, 2006; Yazdani & Popescu-Belis, 2013). TABLE 1. The basic statistics of the two data sets.
As such, documents may be clustered based on content Seed Paper Citation # of nodes
similarity instead of link relations. However, we can only Name\Attributes count count arcs count in factors
find one study from the literature that applied content-based
Text Mining Application 298 3,428 3,870 159
clustering to build the intellectual structure of archival
Citation Analysis 210 3,272 3,731 215
studies (Kim & Lee, 2008). In this study, 432 articles were

2 JOURNAL OF THE ASSOCIATION FOR ASSOCIATION


JOURNAL OF THE INFORMATIONFOR
SCIENCE AND TECHNOLOGY—••
INFORMATION SCIENCE AND2015
TECHNOLOGY—March 2016 611
DOI: 10.1002/asi DOI: 10.1002/asi
Citation-Based Approach TABLE 2. The number of documents ascribed to each factor in the
intellectual structures derived from the two data sets.
The citation graphs built from the citation data sets are
pruned by using two threshold values—the citation thresh- Text Mining Application Citation Analysis
old is set to 2 and the cocitation threshold is set to 1. The Paper Paper with Paper Paper with
threshold values were selected to filter out the less-cited Factor # count full text count full text
papers (through citation threshold), and augment the struc-
0 14 11 23 14
ture (via cocitation threshold). Pruning with thresholds has 1 20 18 35 16
been applied in previous studies (Lee & Chen, 2012). The 2 17 16 26 9
pruned cocitation graphs contain 159 and 215 articles, 3 18 15 16 6
respectively. The cocitation graph is then encoded in a 4 10 9 15 7
matrix format and input to the factor analysis procedure. The 5 12 10 16 9
6 10 8 14 12
top 20 factors are selected as the main research themes. In 7 9 9 10 6
the intellectual structure-deriving process, papers with a 8 7 6 9 5
loading over 0.6 to a factor are ascribed to the factor to 9 7 5 8 3
facilitate content analysis. The loading value denotes the 10 8 6 9 3
degree of dependence (relatedness) a paper has to the factor 11 5 5 7 2
12 6 6 5 2
it ascribes. A loading value of 1 indicates a paper is com- 13 3 1 6 4
pletely correlated with (represented by) the factor. A loading 14 2 1 4 4
over 0.4 is considered significant (Stevens, 2012). As such, 15 2 2 4 2
papers ascribed to a factor are treated as the constituent 16 2 1 3 2
nodes of a document cluster. The factor-based clustering is 17 2 1 1 1
18 2 1 2 2
among one of several steps to be carried out for the intellec- 19 2 1 2 2
tual structure building, which is computerized using Intel- 20 1 1
lectual Structurer (Chen, 2012). Taking the data set “Text Total 159 133 (83.6%) 215 111 (51.6%)
Mining Application” as an illustrative example, 20 factors
are generated by the intellectual structure building process.
Each factor includes 1 to 20 articles ascribed to the factor.
The resultant factor-based clustering is tabulated in Table 2.
The detailed information about articles ascribed to each
vector using the TF-IDF (Salton & McGill, 1986) weighting
factor is listed in Appendix A. Some articles are not ascribed
scheme on meaningful terms. The cosine similarity, which
to a factor due to their low factor loading. We conveniently
equates to the cosine of the angle between the document’s
assigned these articles to factor 0 that usually includes inter-
vectors, is used as the similarity measure between two
disciplinary articles, which are significantly related to
articles.
several factors. Literatures ascribed to factor 0 are treated as
We applied the K-Means (MacQueen, 1967) clustering
nodes that belong to the zero-labeled factor cluster. There
method to recluster articles in the intellectual structures—
are 21 factors in the “Text Mining Application” data set
159 and 215 articles, respectively. Each document is repre-
(factor 1 to 20 + factor 0), and 20 factors in the “Citation
sented by its feature vector in an N-dimensional space, where
Analysis” data set (factor 1 to 19 + factor 0). The factor
N is the number of meaningful terms used to represent the
loading of documents ascribed to factor 20 in the “Citation
documents corpus. The K-Means algorithm then randomly
Analysis” data set are all below the 0.6 threshold. Therefore,
chooses k documents as the initial centers of the clusters. The
only 19 factors with one or more ascribing documents are
remaining documents are then each assigned to the k centers
left in the factor-based clusters.
they are most similar with. After the first iteration, a new
center for each cluster is computed by averaging the feature
vectors of all documents assigned to it. The process of
Text-Based Approach
assigning documents and recomputing centers is iterated
The text-based clustering approach is kept constant until the process converges, which is usually signaled by the
across the two data sets. Therefore, we use one data set to condition of un-revolving centers. As we described previ-
illustrate the clustering process. The “Text Mining Applica- ously, there are 21 and 20 factor-based clusters in their
tion” data set includes 159 articles assigned to 21 factors, respective intellectual structures derived from the two data
which are reclustered according to their content similarity. sets. Matching the number of factor-based clusters, the K
We followed the commonly applied text mining procedure, parameter of the K-Means method is naturally set to 21 for
with which meaningful terms are extracted from the abstract the “Text Mining Application” data set, and 20 for the “Cita-
and full paper (if available) by tokenizing text content and tion Analysis” data set. There are 21 document clusters
filtering out stop words. The remaining meaningful terms generated for the “Text Mining Application” data set, and 20
are further condensed to their stems by the Porter stemmer document clusters generated for the “Citation Analysis” data
(Porter, 1997). Each article is then represented by a feature set. The processing steps are further delineated in Figure 1.

612 JOURNALFOR
JOURNAL OF THE ASSOCIATION OF THE ASSOCIATION
INFORMATION FOR INFORMATION
SCIENCE SCIENCE AND2016
AND TECHNOLOGY—March TECHNOLOGY—•• 2015 3
DOI: 10.1002/asi DOI: 10.1002/asi
TABLE 3. Matched factors and clusters sorted on descending order of
their Jaccard index.

Text Mining Application Citation Analysis

Jaccard Jaccard
Cluster # Factor # index Cluster # Factor # index

13 6 0.8182 16 16 0.50
18 10 0.625 5 5 0.470
8 5 0.4117 1 13 0.40
5 7 0.375 11 8 0.3636
3 12 0.375 19 4 0.3333
16 11 0.3333 18 9 0.3333
10 1 0.3076 6 15 0.3333
17 4 0.2666 3 10 0.30
15 18 0.25 12 1 0.225
0 3 0.2068 17 3 0.2187
11 14 0.2 13 2 0.2058
20 0 0.1904 2 6 0.2
6 8 0.1875 0 0 0.1290
4 9 0.1764 15 12 0.0909
12 2 0.12 10 7 0.0666
– – – 8 11 0.0625

FIG. 1. The steps in the congruity checking procedure. [Color figure can TABLE 4. The Rand Index and kappa statistic between factors and
be viewed in the online issue, which is available at wileyonlinelibrary.com.] clusters.

Rand # of Factors/ # of
Results Data set index Kappa Clusters matched pairs

The document clusters generated by the cluster analysis Text Mining Application 0.8703 0.418 21/21 15
Citation Analysis 0.8549 0.371 20/20 16
are compared against the factor-based clusters derived from
the factor analysis step in the intellectual structure building
process. We applied the Rand Index (Rand, 1971) to measure size 0, or all the entry in M has same value 0. The factors and
the congruity between the linkage-derived factors (docu- clusters that are matched with the max-matched pairing com-
ments ascribed to a factor) and the content-based document putation are listed in Table 3.
cluster. The Rand Index between the two clusters is shown in We may treat the link-based factoring and the content-
Table 4. The Rand Index is above 0.85, which indicates a high based clustering as distinctive judgments made by two
congruity between the factor-based clusters and the content- experts. To assess the agreement between judgments made by
based clusters. The Rand Index gives us an overall consis- the two, we applied the kappa statistic (Cohen, 1960), which
tency metric between two groups of clusters. However, the is a measurement often used in psychological research to
index does not show the individual matching pairs from the assess the agreement of judgments made by two observers.
two groups of clusters. The individually matched cluster pairs The kappa statistic operates on singletons and requires that
afford us the information necessary to examine whether the the labels of the clusters for matching the partitions be known,
two matching clusters are congruent. To facilitate the pairing whereas the Rand Index is based on pairs of observations that
process, we first build a matching matrix where the rows are appear in the same cluster in each of the partitions. Our
the factor-based clusters and the columns are the content- pairing method is equivalent to labeling the matching pairs
based clusters. Taking the “Text Mining Application” data set with one label. It has been shown that the adjusted Rand Index
as the illustrative example, the matching matrix is a 21 by 21 is equivalent to Cohen’s kappa, albeit that their inputs are
symmetric matrix M as shown in Appendix B. The matrix’s quite different (Warrens, 2008). The kappa statistics com-
entry M[i,j] holds the Jaccard similarity index (Hamers et al., puted from the matching pairs of the “Text Mining Applica-
1989) between the documents cluster ascribed to factor i, and tion” and “Citation Analysis” data sets are shown in Table 4.
the content-derived documents cluster j. The Maximum- The kappa values indicate there is fair to moderate agreement
Match-Measure (Meilă & Heckerman, 2001) method is then (Viera & Garrett, 2005) between the factor-centered docu-
applied to find the best-matching cluster that is separately ment clusters and content-derived document clusters.
taken from the factor-derived clusters and content-based
ones. The pairing method is a greedy searching process that
Unified Model Checking
iteratively finds the maximum entry M[a, b] in the matching
matrix M. Once a maximum entry M[a, b] is located, the entry The factor-centered clusters and content-based clusters
in the a-th row and b-th column is deleted from M afterward. are derived from different form of multivariate analysis
The pairing process repeats this step until the matrix M has (K-Means and factor analysis, respectively) that introduce an

4 JOURNAL OF THE ASSOCIATION FOR ASSOCIATION


JOURNAL OF THE INFORMATIONFOR
SCIENCE AND TECHNOLOGY—••
INFORMATION SCIENCE AND2015
TECHNOLOGY—March 2016 613
DOI: 10.1002/asi DOI: 10.1002/asi
TABLE 5. The Rand Index and kappa statistic between two groups of
factors.

Rand # of Factors/ # of
Data set index Kappa Clusters matched pairs

Text Mining Application 0.7315 0.321 21/21 15


Citation Analysis 0.7209 0.222 20/20 15

TABLE 6. The Rand Index and kappa statistic between two groups of
clusters.

Rand # of Factors/ # of
Data set index Kappa Clusters matched pairs

Text Mining Application 0.8724 0.363 21/21 17


Citation Analysis 0.8689 0.367 20/20 19

domain analysis as well as science mapping. Nowadays,


widely available digitized documents, coupled with afford-
able computing power and mass storage, have paved the way
for content-based approaches. However, the computational
complexities of the content-based analytical methods still
FIG. 2. Additional procedures for congruity checking between factors render them impracticable for analyzing a large document
(check 2) and clusters (check 3) are added to the procedure depicted in corpus. We provide empirical evidence that the widely used
Figure 1. [Color figure can be viewed in the online issue, which is available linkage-based analytical methods do bind documents with
at wileyonlinelibrary.com.] similar content together. The implication that relatedness
implies similarity provides additional support for the link-
additional influence that may affect the result. To exclude the based methodologies popular in the field of bibliographical
possible influence introduced by disparate analyses, we study. The kappa statistic of the “Text Mining Application”
applied the same factor analysis procedure to the relatedness data set is more significant than the “Citation Analysis” data
measurement between documents and to their similarity set. The higher kappa value may have resulted from a higher
metrics. The relatedness measurement is represented by the percentage of full text for the “Text Mining Application”
same cocitation matrix we used in the analysis shown in data set (83.6%) available for the content-based similarity
Figure 1, whereas the documents’ similarity metric is a sym- analysis than the “Citation Analysis” data set (51.6%). The
metric matrix that has the cosine similarity between docu- content-based clustering was derived from the terms appear-
ments as its entries. The processing steps are delineated in ing in the title, abstract, and full text (if available). Our
Figure 2, which shows additional processing steps as blocks experiment gives equal weight to all terms in a document
connected by the orange-colored arrowed line. The congruity regardless of their positioning. A scheme that gives different
check is carried out by applying the same procedure to weights to terms found in different sections of a paper, such
compare the linkage-derived factors (denoted by Factors in as the title or abstract, may produce better content-based
Figures 1 and 2) with the content-derived factors (denoted by clustering.
Factors/D in Figure 2). The Rand Index and kappa statistics We followed the approach used by White and Griffith
between the two groups of factors are listed in Table 5. To our (1981), who used the raw cocitation frequencies. An input
surprise, both the Rand Index and the kappa statistics did not data based on normalized cocitation frequencies such as
increase but declined moderately. We consider these unex- the frequencies discussed by Klavans and Boyack (2006)
pected results in the Discussion section. Similarly, K-Means might have resulted in better linkage-based factor clusters.
is applied to the cocitation matrix to derive clusters (denoted Improved content-derived clusters coupled with better
by Clusters/C) and compute the congruity statistics as shown factor-ascribed clusters might result in a higher kappa
in Table 6. The additional processing steps are shown as value. In contrast to the existing studies, our study pro-
blocks connected by the green arrowed line. cesses the full text (if available) to compute the feature
vector in the hope of accomplishing a better result. In com-
parison to merely processing text in the title and abstract,
Discussion
the full-text approach is relatively computationally expen-
The citation linkage-based analytical methodologies are sive and may have only marginally improved the docu-
mature and widely applied; they are dominant in research ments’ clustering. In addition, amassing the full text of all

614 JOURNALFOR
JOURNAL OF THE ASSOCIATION OF THE ASSOCIATION
INFORMATION FOR INFORMATION
SCIENCE SCIENCE AND2016
AND TECHNOLOGY—March TECHNOLOGY—•• 2015 5
DOI: 10.1002/asi DOI: 10.1002/asi
literature in a document corpus is unlikely. A tailored data To facilitate the calculation of the kappa statistic, we used
set with a complete full-text collection of literature may the maximum-match measure to find the best matched pairs.
produce a better result. Alternatively, we can have a base- Nevertheless, as we can see from Tables 4 and 5 there are
line study for the content-based clustering that verifies a still about 25% of the factors and clusters dangling alone.
document corpus using just the title and abstract is clus- We have explored alternative ways to match pairs fully by
tered similarly in comparison to a full-texted document using the aggregated cosine similarity metrics, which is cal-
corpus. With this baseline study, we can use document culated from the many-to-many cosine similarity between
corpora with only the title and abstract in place of the full- each element in the matching clusters. However, the kappa
text corpora to compute the feature vectors and thus sim- coefficient and Rand Index are calculated based on the
plify the text processing step. number of identical elements in (matched) pairs that render
The Rand Index has been criticized for not correcting for our contextual matching inoperable. Their calculation
chances that may have a positive index value due to random merely takes into account identical papers in (matched)
agreements between two partitions. To rectify the shortcom- pairs, but does not count papers with similar contents.
ings of the Rand Index, the adjusted Rand Index (Hubert & Beyond accounting paired identical papers, a congruity
Arabie, 1985) was proposed, which has an expected value of measurement that takes into account the contextual and
zero for independent clusterings. We did not adopt the semantic similarities between literatures may better reflect
adjusted Rand Index because it assumes that clusterings their true agreement.
must have a fixed number of elements in each cluster (Vinh,
Epps, & Bailey, 2009). Since the number of elements in each
factor is decided by the factor analysis that sequentially Conclusion
extracts factors with a descending number of ascribed ele- We provide empirical evidence that shows the relatedness
ments, it does not warrant the application of the adjusted relationship denoted by documents’ cocitation linkage is
Rand Index. Similarly, the number of elements in each consistent with the similarity relationship derived from the
cluster derived by K-Means varies greatly. content of these documents. Relatedness between docu-
The Rand Index as well as the kappa statistics declined ments is inferred by the linkages between them, whereas the
moderately by applying factor analysis to the same data content of documents determines their similarity. Our study
sets. We think this may be due to the similarity metric shows moderate congruence between linkage-derived
being implicitly treated as a relatedness measurement when (factor-ascribed) document clusters and content-decided
we applied the factor analysis to the cosine similarity document clusters, which suggests that relatedness implies
matrix. The implicit equivalence between the similarity and similarity. The congruity between the linkage-based factors
relatedness measurement in our study is like adopting an and content-based clusters affords novel support for tradi-
unproven proposition in the process of proving the propo- tional citation-based bibliographical analytical methods.
sition itself. The factor analysis is designed to identify The efficacy of citation-based bibliometric methods, such as
factors that explain most of the common variance (Kline, science mapping based on the cocitation and bibliographic
2013), which is very different with the working of coupling, may be due to the effective binding of similar
K-Means, which tries to group the most similar elements documents through bibliographical linkage analysis.
into clusters. The differences between the two multivariate
analyses may also negatively affect the congruity statistics.
However, the same rationale cannot explain the higher Acknowledgments
Rand Index value in the case of applying K-Means cluster-
ing analysis. The idea of adopting one single multivariate This work was partially supported by NSC project 102-
analysis to preclude the possible treatment’s effect and thus 2410-H-305-070. I would like to thank my research assistant
raise the congruity statistics seems to be weakly supported Wei-Chieh Wang, who has contributed extensively to pro-
by the results of the K-Means analysis. It also implies that gramming. I would also like to express my special thanks to
the same relatedness and similarity measurements give a several of my research assistants: Hsing-Yi Huang, Yu-Jie
more consistent clustering by some multivariate analyses Zheng, and Meng-Hsiu Tsai for their effort in data analysis.
than others. The fact that the relatedness and similarity I am also thankful for the insightful comments and sugges-
metrics produce a congruent clustering indicates the two tions from the anonymous reviewers of the manuscript.
measurements are correlated. We could have better science
mapping by substituting the factor analysis method with
other clustering methods such as K-Means. Since factor References
analysis not only derives factors, but also generates Ahlgren, P., & Colliander, C. (2009). Document–document similarity
complementary statistics such as factor loadings and a cor- approaches and science mapping: Experimental comparison of five
relation matrix, it could be replaced by the soft K-Means approaches. Journal of Informetrics, 3(1), 49–63.
Ahlgren, P., & Jarneving, B. (2008). Bibliographic coupling, common
method exemplified by the expectation maximization abstract stems and clustering: A comparison of two document-document
(MacKay, 2005) algorithm that also generates clustering similarity approaches in the context of science mapping. Scientometrics,
related statistic. 76(2), 273–290.

6 JOURNAL OF THE ASSOCIATION FOR ASSOCIATION


JOURNAL OF THE INFORMATIONFOR
SCIENCE AND TECHNOLOGY—••
INFORMATION SCIENCE AND2015
TECHNOLOGY—March 2016 615
DOI: 10.1002/asi DOI: 10.1002/asi
Boyack, K.W., & Klavans, R. (2010). Co-citation analysis, bibliographic MacKay, D.J.C. (2005). Information theory, inference, and learning algo-
coupling, and direct citation: Which citation approach represents the rithms (4th ed.). Cambridge, UK: Cambridge University Press.
research front most accurately? Journal of the American Society for MacQueen, J. (1967). Some methods for classification and analysis of
Information Science and Technology, 61(12), 2389–2404. multivariate observations. In Paper presented at the Proceedings of the
Boyack, K.W., Small, H., & Klavans, R. (2013). Improving the accuracy of Fifth Berkeley Symposium on Mathematical Statistics and Probability
co-citation clustering using full text. Journal of the American Society for (pp. 281–297). Berkeley, Calif.: University of California Press.
Information Science and Technology, 64(9), 1759–1767. Meilă, M., & Heckerman, D. (2001). An experimental comparison of
Carlson, S. (2006). Challenging Google, Microsoft unveils a search tool for model-based clustering methods. Machine Learning, 42(1–2), 9–29.
scholarly articles. Chronicle of Higher Education, 52(33), 43. Pilkington, A., & Meredith, J. (2009). The evolution of the intellectual
Chen, T.T. (2012). The development and empirical study of a literature structure of operations management—1980–2006: A citation/co-citation
review aiding system. Scientometrics, 92(1), 105–116. analysis. Journal of Operations Management, 27(3), 185–202.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educa- Porter, M.F. (1997). An algorithm for suffix stripping. In J. Karen Sparck &
tional and Psychological Measurement, 20(1), 37–46. W. Peter (Eds.), Readings in information retrieval (pp. 313–316). San
Culnan, M.J. (1987). Mapping the intellectual structure of MIS, 1980– Francisco: Morgan Kaufmann Publishers Inc.
1985: A co-citation analysis. MIS Quarterly, 11(3), 341–353. Pratt, J., Hauser, K., & Sugimoto, C. (2012). Defining the intellectual
Garfield, E., Sher, I.H., & Torpie, R.J. (1964). The use of citation data in structure of information systems and related college of business disci-
writing the history of science. Philadelphia: Institute for Scientific plines: A bibliometric analysis. Scientometrics, 93(2), 279–304.
Information. Rand, W.M. (1971). Objective criteria for the evaluation of clustering
Hamers, L., Hemeryck, Y., Herweyers, G., Janssen, M., Keters, H., methods. Journal of the American Statistical Association, 66(336), 846–
Rousseau, R., & Vanhoutte, A. (1989). Similarity measures in sciento- 850.
metric research: The Jaccard index versus Salton’s cosine formula. Infor- Rorissa, A., & Yuan, X. (2012). Visualizing and mapping the intellectual
mation Processing & Management, 25(3), 315–318. structure of information retrieval. Information Processing & Manage-
Hirschheim, R., Klein, H.K., & Lyytinen, K. (1996). Exploring the intel- ment, 48(1), 120–135.
lectual structures of information systems development: A social action Salton, G., & McGill, M.J. (1986). Introduction to modern information
theoretic analysis. Accounting, Management and Information Technolo- retrieval. New York: McGraw-Hill, Inc.
gies, 6(1–2), 1–64. Small, H. (1973). Co-citation in the scientific literature: A new measure of
Hoffman, D.L., & Holbrook, M.B. (1993). The intellectual structure of the relationship between two documents. Journal of the American
consumer research: A bibliometric study of author cocitations in the first Society for Information Science, 24(4), 265–269.
15 years of the Journal of Consumer Research. Journal of Consumer Small, H. (1977). A co-citation model of a scientific specialty: A longitu-
Research, 19(4), 505–517. dinal study of collagen research. Social Studies of Science, 7(2), 139–
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classifi- 166.
cation, 2(1), 193–218. Stevens, J.P. (2012). Applied multivariate statistics for the social sciences.
Janssens, F., Leta, J., Glänzel, W., & De Moor, B. (2006). Towards mapping New York: Routledge Academic.
library and information science. Information Processing & Management, Viera, A.J., & Garrett, J.M. (2005). Understanding interobserver agree-
42(6), 1614–1642. ment: The kappa statistic. Family Medicine, 37(5), 360–363.
Janssens, F., Glänzel, W., & Moor, B. (2008). A hybrid mapping of infor- Vinh, N.X., Epps, J., & Bailey, J. (2009). Information theoretic measures
mation science. Scientometrics, 75(3), 607–631. for clusterings comparison: Is a correction for chance necessary?
Kim, H., & Lee, J.Y. (2008). Exploring the emerging intellectual structure In Paper presented at the Proceedings of the 26th Annual International
of archival studies using text mining: 2001–2004. Journal of Information Conference on Machine Learning (pp. 1073–1080). Montreal, QC,
Science, 34(3), 356–369. Canada: ACM.
Klavans, R., & Boyack, K.W. (2006). Identifying a better measure of Warrens, M. (2008). On the equivalence of Cohen’s Kappa and the
relatedness for mapping science. Journal of the American Society for Hubert-Arabie adjusted Rand index. Journal of classification, 25(2),
Information Science and Technology, 57(2), 251–263. 177–183.
Kline, R. (2013). Exploratory and confirmatory factor analysis. In Y. Webster, J., & Watson, R.T. (2002). Analyzing the past to prepare for the
Petscher & C. Schatschneider (Eds.), Applied quantitative analysis in the future: Writing a literature review. Management Information Systems
social sciences (pp. 171–207). New York: Routledge. Quarterly, 26(2), xiii–xxiii.
Lee, M.R., & Chen, T.T. (2012). Revealing research themes and trends in White, H.D., & Griffith, B.C. (1981). Author cocitation: A literature
knowledge management: From 1995 to 2010. Knowledge-Based measure of intellectual structure. Journal of the American Society for
Systems, 28, 47–58. Information Science, 32(3), 163–171.
Locke, J., & Perera, H. (2001). The intellectual structure of international Yazdani, M., & Popescu-Belis, A. (2013). Computing text semantic relat-
accounting in the early 1990s. The International Journal of Accounting, edness using the contents and links of a hypertext encyclopedia. Artificial
36(2), 223–249. Intelligence, 194(0), 176–202.

616 JOURNALFOR
JOURNAL OF THE ASSOCIATION OF THE ASSOCIATION
INFORMATION FOR INFORMATION
SCIENCE SCIENCE AND2016
AND TECHNOLOGY—March TECHNOLOGY—•• 2015 7
DOI: 10.1002/asi DOI: 10.1002/asi
Appendix A. Literatures’ title in the “Data Mining Application” data set and the factors they are
ascribed to.
Full
Doc # Title Factor text

51 The Description Logic Handbook: Theory, Implementation, and Applications 0


52 The ENZYME database in 2000 0 Y
53 Using process diagrams for the graphical representation of biological networks 0
54 Word association norms, mutual information, and lexicography 0 Y
55 WordNet: An Electronic Lexical Database 0
56 APT: Arabic Part-of-speech Tagger 0 Y
57 Data clustering: a review 0 Y
58 Emergence of scaling in random networks 0 Y
59 Improving the prediction of pharmacogenes using text-derived drug-gene relationships 0 Y
60 Indexing by Latent Semantic Analysis 0 Y
61 information retrieval 0 Y
62 Latent Dirichlet allocation 0 Y
63 Maximum likelihood from incomplete data via the EM algorithm 0 Y
64 Mining the Biomedical Literature in the Genomic Era: An Overview 0 Y
65 A gene network for navigating the literature 1 Y
66 A simple approach for protein name identification: prospects and limits 1 Y
67 Comparative assessment of large-scale data sets of protein-protein interactions 1 Y
68 Comparative experiments on learning information extractors for proteins and their interactions 1 Y
69 Development of human protein reference database as an initial platform for approaching systems biology in humans 1
70 Discovering patterns to extract protein-protein interactions from the literature: Part II 1 Y
71 Extraction of protein interaction information from unstructured text using a context-free grammar 1 Y
72 finding the evidence for protein interactions from PubMed abstracts 1 Y
73 Improving the performance of dictionary-based approaches in protein name recognition 1 Y
74 IntAct—open source resource for molecular interaction data 1 Y
75 Introducing meta services for biomedical information extraction 1 Y
76 Optimizing syntax patterns for discovering protein-protein interactions 1 Y
77 Overview of the protein-protein interaction annotation extraction task of BioCreative II 1 Y
78 PreBIND and Textomy—mining the biomedical literature for protein interactions using a support vector machine 1 Y
79 BioRAT: extracting biological information from full length papers 1 Y
80 Subsequence Kernels for Relation Extraction 1 Y
81 Svmlight: Support Vector Machine 1
82 Towards a proteome scale map of the human protein interaction network 1 Y
83 Unsupervised Models for Named Entity Classification 1 Y
84 Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions 1 Y
85 A coefficient of agreement for nominal scales 2 Y
86 AliBaba: PubMed as a graph 2 Y
87 BioRAT: extracting biological information from full length papers 2 Y
88 Building an abbreviation dictionary using a term recognition approach 2 Y
89 Disambiguating the species of biomedical named entities using natural language parsers 2 Y
90 Distinguishing the species of biomedical named entities for term identification 2 Y
91 Distribution of information in biomedical abstracts and full-text publications 2 Y
92 Entrez Gene: gene-centered information at NCBI 2 Y
93 GenBank and PubMed: How connected are they? 2 Y
94 Gene name ambiguity of eukaryotic nomenclatures 2 Y
95 Information extraction from full text scientific articles: Where are the keywords? 2 Y
96 Interspecies normalization of gene mentions with GNAT 2 Y
97 Introduction to Automata Theory, Languages and Computation 2
98 Is searching full text more effective than searching abstracts? 2 Y
99 ProMiner: rule based protein and gene entity recognition 2 Y
100 Text processing through Web services: calling Whatizit 2 Y
101 The Universal Protein Resource (UniProt) 2 Y
1 Text mining and ontologies in biomedicine: Making sense of raw text 3 Y
2 Text-mining and information-retrieval services for molecular biology 3 Y
3 The next generation of literature analysis Integration of genomic analysis into text mining 3 Y
4 The Text Mining Handbook—Advanced Approaches in Analyzing Unstructured Data 3
5 Untangling Text Data Mining 3 Y
6 A literature network of human genes for high-throughput analysis of gene expression 3 Y
7 A survey of current work in biomedical text mining 3 Y
8 A text-mining analysis of the human phenome 3 Y
9 Assessment of disease named entity recognition on a corpus of annotated sentences 3 Y
10 Data Mining: Practical Machine Learning Tools and Techniques 3
11 Evaluation of text-mining systems for biology: overview of the Second Bio Creative community challenge 3 Y
12 Extracting interactions between proteins from the literature 3 Y
13 Fast Algorithms for Mining Association Rules 3 Y
14 Frontiers of biomedical text mining: current progress 3 Y

8 JOURNAL OF THE ASSOCIATION FOR ASSOCIATION


JOURNAL OF THE INFORMATIONFOR
SCIENCE AND TECHNOLOGY—••
INFORMATION SCIENCE AND2015
TECHNOLOGY—March 2016 617
DOI: 10.1002/asi DOI: 10.1002/asi
Appendix A. (Continued)
Full
Doc # Title Factor text

15 GENIA corpus—a semantically annotated corpus for bio text mining 3 Y


16 Genomics and natural language processing 3 Y
17 Literature mining for the biologist: from information retrieval to biological discovery 3
18 Text mining and its potential applications in systems biology 3 Y
19 A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora 4 Y
20 All paths graph kernel for protein- protein interaction extraction with evaluation of cross-corpus learning 4 Y
21 BioInfer: a corpus for information extraction in the biomedical domain 4 Y
22 Combining Multiple Layers of Syntactic Information for Protein Interaction Extraction 4 Y
23 Comparative analysis of five protein-protein interaction corpora 4 Y
24 Evaluating contributions of natural language parsers to protein-protein interaction extraction 4 Y
25 Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature 4 Y
26 Extracting Protein-Protein Interaction Information from Biomedical Text with SVM 4
27 Incorporating rich background knowledge for gene named entity classification and recognition 4 Y
28 TREC 2006 Genomics Track Overview 4 Y
29 A Linguistic Analysis of How People Describe Software Problems 5 Y
30 An algorithm for suffix stripping 5
31 An approach to detecting duplicate bug reports using natural language and execution information 5 Y
32 An empirical study of the naive Bayes classifier 5 Y
33 Automated Severity Assessment of Software Defect Reports 5 Y
34 Automatic bug triage using text categorization 5
35 Detection of Duplicate Defect Reports Using Natural Language Processing 5 Y
36 Guidelines for conducting and reporting case study research in software engineering 5 Y
37 Improving bug triage with bug tossing graphs 5 Y
38 Towards a simplification of the bug report form in eclipse 5 Y
39 What makes a good bug report? 5 Y
40 Who should fix this bug? 5 Y
41 Algorithms and applications for approximate nonnegative matrix factorization 6 Y
42 Algorithms for Non-negative Matrix Factorization 6 Y
43 Bioconductor: open software development for computational biology and bioinformatics 6 Y
44 Learning Spatially Localized, Parts-Based Representation 6 Y
45 Learning the parts of objects by non-negative matrix factorization 6
46 Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology 6 Y
47 Non-negative matrix factorization with sparseness constraints 6 Y
48 Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values 6
49 Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis 6 Y
50 SVD based initialization: A head start for nonnegative matrix factorization 6 Y
103 A computational system to select candidate genes for complex human traits 7 Y
104 Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining 7 Y
105 Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data 7 Y
106 Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning 7 Y
107 G2D: a tool for mining genes associated with disease 7 Y
108 Gene prioritization through genomic data fusion 7 Y
109 Integration of text- and data-mining using ontologies successfully selects disease gene candidates 7 Y
110 POCUS: mining genomic sequence annotation to predict disease genes 7 Y
111 SUSPECTS: enabling fast and effective prioritization of positional candidates 7 Y
117 Collective dynamics of “small world” networks 8 Y
118 Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome 8 Y
119 Cytoscape: A Software Environment for Integrated Models of Bio molecular Interaction Networks 8 Y
120 Evaluation of different biological data and computational classification methods for use in protein interaction prediction 8
121 Gene ontology tool for the unification of biology 8 Y
122 Global topological features of cancer proteins in the human interactome 8 Y
123 The human disease network 8 Y
132 Co-clustering documents and words using bipartite spectral graph partitioning 9 Y
133 Combining Labeled and Unlabeled Data with Co-Training 9 Y
134 Distributional clustering of words for text classification 9 Y
135 Information-theoretic co-clustering 9 Y
136 NewsWeeder: Learning to Filter Netnews 9
137 Spectral clustering for mult-type relational data 9 Y
138 Transductive Inference for Text Classification using Support Vector Machines 9
124 A comparative study on feature selection in text categorization 10
125 A Tutorial on Support Vector Machines for Pattern Recognition 10 Y
126 Extracting Product Features and Opinions from Reviews 10 Y
127 Latent Dirichlet Allocation 10
128 Mining the peanut gallery: opinion extraction and semantic classification of product reviews 10 Y
129 Opinion observer: analyzing and comparing opinions on the Web 10 Y
130 SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining 10 Y

618 JOURNALFOR
JOURNAL OF THE ASSOCIATION OF THE ASSOCIATION
INFORMATION FOR INFORMATION
SCIENCE SCIENCE AND2016
AND TECHNOLOGY—March TECHNOLOGY—•• 2015 9
DOI: 10.1002/asi DOI: 10.1002/asi
Appendix A. (Continued)
Full
Doc # Title Factor text

131 Thumbs up?: Sentiment Classification using Machine Learning Techniques 10 Y


112 A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology 11 Y
113 Bio Thesaurus: a web based thesaurus of protein and gene names 11 Y
114 Extracting human protein interactions from MEDLINE using a full-sentence parser 11 Y
115 Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis driven science in the post 11 Y
genomic era
116 The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models 11 Y
143 A web-based kernel function for measuring the similarity of short text snippets 12 Y
144 Connections between the lines: augmenting social networks with text 12 Y
145 Finding scientific topics 12 Y
146 Joint latent topic models for text and citations 12 Y
147 Twitter Rank: finding topic-sensitive influential twitterers 12 Y
148 Why we twitter: understanding micro blogging usage and communities 12 Y
149 A vector space model for automatic indexing 13 Y
150 Machine learning in automated text categorization 13
151 Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural 13
Language
139 Corpus based and Knowledge-based Measures of Text Semantic Similarity 14 Y
140 Introduction to Data Mining 14
141 BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition 15 Y
142 DrugBank: a comprehensive resource for in silico drug discovery and exploration 15 Y
156 Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses 16
157 Foundations of Statistical Natural Language Processing 16 Y
158 A Brief Survey of Text Mining 17 Y
159 A new readability yardstick 17
152 Generative model-based document clustering a comparative study 18 Y
153 Introduction to Modern Information Retried 18
154 Database resources of the National Center for Biotechnology Information 19 Y
155 Protein promiscuity and its implications for biotechnology 19
102 A Comparison of String Distance Metrics for Name-Matching Tasks 20 Y

Appendix B. Matching matrix between factors and clusters.


Cluster
Factor 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0 0.059 0 0 0.048 0 0 0 0.1 0.029 0 0.056 0 0.063 0 0 0 0 0.133 0.042 0.15 0.053
1 0 0.217 0 0 0 0.037 0.276 0.12 0.024 0 0 0.031 0 0 0 0 0 0 0 0.036 0
2 0 0 0 0 0 0.136 0.03 0 0.182 0 0 0.071 0 0.056 0.087 0 0 0 0.037 0.04 0
3 0 0.083 0 0.13 0 0.04 0.029 0 0.111 0.037 0.045 0.069 0 0 0.13 0 0 0 0 0 0
4 0 0 0 0 0 0.059 0.038 0.059 0 0 0 0.353 0 0 0 0 0 0 0.05 0 0
5 0.143 0 0 0 0.667 0.053 0 0 0 0 0 0 0 0 0 0 0 0 0 0.05 0
6 0.077 0 0 0 0 0 0 0 0 0.667 0 0 0 0 0 0 0 0 0 0.056 0
7 0 0 0 0 0 0 0 0 0.348 0 0.077 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0.263 0 0.036 0 0 0 0 0 0 0 0 0.111 0 0 0
9 0 0 0 0.071 0 0 0 0 0 0 0.091 0.053 0 0 0 0 0 0 0.2 0.067 0
10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.375 0 0 0.118 0.063 0.167
11 0 0 0 0.083 0 0 0.048 0 0 0 0.111 0 0 0 0.182 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0.077 0 0 0 0 0 0 0 0 0.333 0 0 0 0.333
13 0 0 0 0 0 0 0 0 0 0.083 0 0 0 0 0.1 0 0 0 0.077 0 0
14 0 0 0 0.111 0 0 0 0 0 0 0 0.071 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0.111 0 0 0 0 0 0 0.25 0 0 0 0 0 0 0 0
16 0 0.111 0 0 0 0 0 0 0 0 0 0 0 0.333 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0.111 0 0 0 0 0 0 0 0 0 0 0.083 0 0
18 0 0 0 0.111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.083 0 0
19 0 0 0 0 0 0 0 0 0.043 0 0 0 0.25 0 0 0 0 0 0 0 0
20 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

10 JOURNAL OF THE JOURNAL


ASSOCIATION FORASSOCIATION
OF THE INFORMATION SCIENCE
FOR AND TECHNOLOGY—••
INFORMATION 2015
SCIENCE AND TECHNOLOGY—March 2016 619
DOI: 10.1002/asi DOI: 10.1002/asi

You might also like