Professional Documents
Culture Documents
JOURNAL
JOURNAL OF
OF THE
THE ASSOCIATION
ASSOCIATION FOR
FOR INFORMATION
INFORMATION SCIENCE
SCIENCE AND
AND TECHNOLOGY,
TECHNOLOGY, 67(3):610–619,
••(••):••–••, 20152016
citation and cocitation, are conveniently represented in a clustered based on their cosine similarity, which is computed
matrix form. A “1” in the matrix’s cell indicates the presence from the term frequency-inverse document frequency (TF-
of a relationship; a “0” in a cell denotes the absence of a IDF) weighted terms’ vector of each document. The text-
relationship between the nodes abstracted in the correspond- mining content-based approach is held to provide
ing column and row of the matrix. If we take the matrix form comparable information to the traditional link-based
of the citation relationship, then the matrix form of the approach. Science mapping using the bibliographic coupling
cocitation relationship can be derived by multiplying approach was compared to the content-based approach
the citation matrix with the transposed matrix of itself. The (Ahlgren & Jarneving, 2008). In this study, a collection of
values in the diagonal of the cocitation matrix are computed articles on the information retrieval field were clustered
by taking the three highest values of each row from the based on their bibliographic coupling as well as content
cocitation matrix divided by two, thereby indicating in a similarity. The resultant agreement between clusters derived
general way the relative importance of a particular literature from the link-based and content-based clustering methods
within the field (White & Griffith, 1981). The resultant coci- was low. The content-based approach is shown to provide
tation matrix is subjected to factor analysis that combines supplementary information to the traditional link-based
sets of correlated variables (cocited documents) into factors approach. A study that compared the text-based approach,
(research themes). As such, hundreds of documents can be the citation-based bibliographic approach, and a combined
deduced into tens of factors. Each factor is regarded as a approach that involved nine methods in total was carried out
surrogate for the set of documents ascribed to it. Documents (Ahlgren & Colliander, 2009). One text-based method and
ascribed to a factor usually share a common research theme one combined method among the nine methods were found
or topic. The clusters of documents surrogated by their cor- to be largely in agreement with the ground truth subjective
responding factors are the foundation of the intellectual classification given by a domain expert in the study. A
structure building methodology. The set of documents variant of the combined bibliographic and text-based
ascribed to a factor is derived from the cocitation linkages approach that adjusts the cocitation weights (normalized
existing among these documents. The strength of the relat- frequencies) based on proximity between reference pairs’
edness relationship between two documents is proportional positions in full text found the textual coherence of a coci-
to the number of cocitation edges existing between them. tation cluster increased by 9–20% over the traditional bib-
The research question is “Does the relatedness of literatures liographic only approach (Boyack, Small, & Klavans,
imply content similarity?” If we can provide evidence that 2013).
link-derived relatedness between documents implies that
they have similar content, it will provide the link-based
intellectual structure building methodology with further Data and Methods
support. Although text-based science mapping methods and We retrieved two citation data sets from the online cita-
their variants have been studied (Janssens, Leta, Glänzel, & tion database Microsoft Academic Search (Carlson, 2006) in
De Moor, 2006; Janssens, Glänzel, & Moor, 2008), intellec- 2013. Citation data sets were collected by querying the
tual structure and science mapping have predominantly uti- database with the key phrases “text mining application” on
lized link-based citation approaches for their computational February 21 and “citation analysis” on March 29. Both
efficiency over text-based approaches. queries limited the search for literature published after 2010
from the database. The queries returned 298 seed papers
Related Studies for “text mining application” and 210 seed papers for “cita-
tion analysis.” Our citation expansion scheme then took
Intellectual structure is conventionally derived from coci- these seed set papers as the initial literature collection, and
tation relationships between documents. Cocitation-based retrieved papers that cite or are cited by literatures in the
analysis has been the prevalent approach since the 1970s. initial seed set. This two-way literature search is recom-
However, there has been a resurgence of interest in the mended by journal editors for writing literature reviews
bibliographic coupling-based approach that has inspired (Webster & Watson, 2002). The full citation graph was built
researchers to compare the merits between these two by connecting all reference links between articles collected
approaches (Boyack & Klavans, 2010). As we described by the two-way citation expansion method. The basic statis-
previously, the clusters of documents ascribed to their cor- tics of the two data sets are listed in Table 1.
responding factors are derived from the cocitation linkages
between them. The relatedness of literatures can be inferred
by their cocitation linkage as well as their content similarity
(Klavans & Boyack, 2006; Yazdani & Popescu-Belis, 2013). TABLE 1. The basic statistics of the two data sets.
As such, documents may be clustered based on content Seed Paper Citation # of nodes
similarity instead of link relations. However, we can only Name\Attributes count count arcs count in factors
find one study from the literature that applied content-based
Text Mining Application 298 3,428 3,870 159
clustering to build the intellectual structure of archival
Citation Analysis 210 3,272 3,731 215
studies (Kim & Lee, 2008). In this study, 432 articles were
612 JOURNALFOR
JOURNAL OF THE ASSOCIATION OF THE ASSOCIATION
INFORMATION FOR INFORMATION
SCIENCE SCIENCE AND2016
AND TECHNOLOGY—March TECHNOLOGY—•• 2015 3
DOI: 10.1002/asi DOI: 10.1002/asi
TABLE 3. Matched factors and clusters sorted on descending order of
their Jaccard index.
Jaccard Jaccard
Cluster # Factor # index Cluster # Factor # index
13 6 0.8182 16 16 0.50
18 10 0.625 5 5 0.470
8 5 0.4117 1 13 0.40
5 7 0.375 11 8 0.3636
3 12 0.375 19 4 0.3333
16 11 0.3333 18 9 0.3333
10 1 0.3076 6 15 0.3333
17 4 0.2666 3 10 0.30
15 18 0.25 12 1 0.225
0 3 0.2068 17 3 0.2187
11 14 0.2 13 2 0.2058
20 0 0.1904 2 6 0.2
6 8 0.1875 0 0 0.1290
4 9 0.1764 15 12 0.0909
12 2 0.12 10 7 0.0666
– – – 8 11 0.0625
FIG. 1. The steps in the congruity checking procedure. [Color figure can TABLE 4. The Rand Index and kappa statistic between factors and
be viewed in the online issue, which is available at wileyonlinelibrary.com.] clusters.
Rand # of Factors/ # of
Results Data set index Kappa Clusters matched pairs
The document clusters generated by the cluster analysis Text Mining Application 0.8703 0.418 21/21 15
Citation Analysis 0.8549 0.371 20/20 16
are compared against the factor-based clusters derived from
the factor analysis step in the intellectual structure building
process. We applied the Rand Index (Rand, 1971) to measure size 0, or all the entry in M has same value 0. The factors and
the congruity between the linkage-derived factors (docu- clusters that are matched with the max-matched pairing com-
ments ascribed to a factor) and the content-based document putation are listed in Table 3.
cluster. The Rand Index between the two clusters is shown in We may treat the link-based factoring and the content-
Table 4. The Rand Index is above 0.85, which indicates a high based clustering as distinctive judgments made by two
congruity between the factor-based clusters and the content- experts. To assess the agreement between judgments made by
based clusters. The Rand Index gives us an overall consis- the two, we applied the kappa statistic (Cohen, 1960), which
tency metric between two groups of clusters. However, the is a measurement often used in psychological research to
index does not show the individual matching pairs from the assess the agreement of judgments made by two observers.
two groups of clusters. The individually matched cluster pairs The kappa statistic operates on singletons and requires that
afford us the information necessary to examine whether the the labels of the clusters for matching the partitions be known,
two matching clusters are congruent. To facilitate the pairing whereas the Rand Index is based on pairs of observations that
process, we first build a matching matrix where the rows are appear in the same cluster in each of the partitions. Our
the factor-based clusters and the columns are the content- pairing method is equivalent to labeling the matching pairs
based clusters. Taking the “Text Mining Application” data set with one label. It has been shown that the adjusted Rand Index
as the illustrative example, the matching matrix is a 21 by 21 is equivalent to Cohen’s kappa, albeit that their inputs are
symmetric matrix M as shown in Appendix B. The matrix’s quite different (Warrens, 2008). The kappa statistics com-
entry M[i,j] holds the Jaccard similarity index (Hamers et al., puted from the matching pairs of the “Text Mining Applica-
1989) between the documents cluster ascribed to factor i, and tion” and “Citation Analysis” data sets are shown in Table 4.
the content-derived documents cluster j. The Maximum- The kappa values indicate there is fair to moderate agreement
Match-Measure (Meilă & Heckerman, 2001) method is then (Viera & Garrett, 2005) between the factor-centered docu-
applied to find the best-matching cluster that is separately ment clusters and content-derived document clusters.
taken from the factor-derived clusters and content-based
ones. The pairing method is a greedy searching process that
Unified Model Checking
iteratively finds the maximum entry M[a, b] in the matching
matrix M. Once a maximum entry M[a, b] is located, the entry The factor-centered clusters and content-based clusters
in the a-th row and b-th column is deleted from M afterward. are derived from different form of multivariate analysis
The pairing process repeats this step until the matrix M has (K-Means and factor analysis, respectively) that introduce an
Rand # of Factors/ # of
Data set index Kappa Clusters matched pairs
TABLE 6. The Rand Index and kappa statistic between two groups of
clusters.
Rand # of Factors/ # of
Data set index Kappa Clusters matched pairs
614 JOURNALFOR
JOURNAL OF THE ASSOCIATION OF THE ASSOCIATION
INFORMATION FOR INFORMATION
SCIENCE SCIENCE AND2016
AND TECHNOLOGY—March TECHNOLOGY—•• 2015 5
DOI: 10.1002/asi DOI: 10.1002/asi
literature in a document corpus is unlikely. A tailored data To facilitate the calculation of the kappa statistic, we used
set with a complete full-text collection of literature may the maximum-match measure to find the best matched pairs.
produce a better result. Alternatively, we can have a base- Nevertheless, as we can see from Tables 4 and 5 there are
line study for the content-based clustering that verifies a still about 25% of the factors and clusters dangling alone.
document corpus using just the title and abstract is clus- We have explored alternative ways to match pairs fully by
tered similarly in comparison to a full-texted document using the aggregated cosine similarity metrics, which is cal-
corpus. With this baseline study, we can use document culated from the many-to-many cosine similarity between
corpora with only the title and abstract in place of the full- each element in the matching clusters. However, the kappa
text corpora to compute the feature vectors and thus sim- coefficient and Rand Index are calculated based on the
plify the text processing step. number of identical elements in (matched) pairs that render
The Rand Index has been criticized for not correcting for our contextual matching inoperable. Their calculation
chances that may have a positive index value due to random merely takes into account identical papers in (matched)
agreements between two partitions. To rectify the shortcom- pairs, but does not count papers with similar contents.
ings of the Rand Index, the adjusted Rand Index (Hubert & Beyond accounting paired identical papers, a congruity
Arabie, 1985) was proposed, which has an expected value of measurement that takes into account the contextual and
zero for independent clusterings. We did not adopt the semantic similarities between literatures may better reflect
adjusted Rand Index because it assumes that clusterings their true agreement.
must have a fixed number of elements in each cluster (Vinh,
Epps, & Bailey, 2009). Since the number of elements in each
factor is decided by the factor analysis that sequentially Conclusion
extracts factors with a descending number of ascribed ele- We provide empirical evidence that shows the relatedness
ments, it does not warrant the application of the adjusted relationship denoted by documents’ cocitation linkage is
Rand Index. Similarly, the number of elements in each consistent with the similarity relationship derived from the
cluster derived by K-Means varies greatly. content of these documents. Relatedness between docu-
The Rand Index as well as the kappa statistics declined ments is inferred by the linkages between them, whereas the
moderately by applying factor analysis to the same data content of documents determines their similarity. Our study
sets. We think this may be due to the similarity metric shows moderate congruence between linkage-derived
being implicitly treated as a relatedness measurement when (factor-ascribed) document clusters and content-decided
we applied the factor analysis to the cosine similarity document clusters, which suggests that relatedness implies
matrix. The implicit equivalence between the similarity and similarity. The congruity between the linkage-based factors
relatedness measurement in our study is like adopting an and content-based clusters affords novel support for tradi-
unproven proposition in the process of proving the propo- tional citation-based bibliographical analytical methods.
sition itself. The factor analysis is designed to identify The efficacy of citation-based bibliometric methods, such as
factors that explain most of the common variance (Kline, science mapping based on the cocitation and bibliographic
2013), which is very different with the working of coupling, may be due to the effective binding of similar
K-Means, which tries to group the most similar elements documents through bibliographical linkage analysis.
into clusters. The differences between the two multivariate
analyses may also negatively affect the congruity statistics.
However, the same rationale cannot explain the higher Acknowledgments
Rand Index value in the case of applying K-Means cluster-
ing analysis. The idea of adopting one single multivariate This work was partially supported by NSC project 102-
analysis to preclude the possible treatment’s effect and thus 2410-H-305-070. I would like to thank my research assistant
raise the congruity statistics seems to be weakly supported Wei-Chieh Wang, who has contributed extensively to pro-
by the results of the K-Means analysis. It also implies that gramming. I would also like to express my special thanks to
the same relatedness and similarity measurements give a several of my research assistants: Hsing-Yi Huang, Yu-Jie
more consistent clustering by some multivariate analyses Zheng, and Meng-Hsiu Tsai for their effort in data analysis.
than others. The fact that the relatedness and similarity I am also thankful for the insightful comments and sugges-
metrics produce a congruent clustering indicates the two tions from the anonymous reviewers of the manuscript.
measurements are correlated. We could have better science
mapping by substituting the factor analysis method with
other clustering methods such as K-Means. Since factor References
analysis not only derives factors, but also generates Ahlgren, P., & Colliander, C. (2009). Document–document similarity
complementary statistics such as factor loadings and a cor- approaches and science mapping: Experimental comparison of five
relation matrix, it could be replaced by the soft K-Means approaches. Journal of Informetrics, 3(1), 49–63.
Ahlgren, P., & Jarneving, B. (2008). Bibliographic coupling, common
method exemplified by the expectation maximization abstract stems and clustering: A comparison of two document-document
(MacKay, 2005) algorithm that also generates clustering similarity approaches in the context of science mapping. Scientometrics,
related statistic. 76(2), 273–290.
616 JOURNALFOR
JOURNAL OF THE ASSOCIATION OF THE ASSOCIATION
INFORMATION FOR INFORMATION
SCIENCE SCIENCE AND2016
AND TECHNOLOGY—March TECHNOLOGY—•• 2015 7
DOI: 10.1002/asi DOI: 10.1002/asi
Appendix A. Literatures’ title in the “Data Mining Application” data set and the factors they are
ascribed to.
Full
Doc # Title Factor text
618 JOURNALFOR
JOURNAL OF THE ASSOCIATION OF THE ASSOCIATION
INFORMATION FOR INFORMATION
SCIENCE SCIENCE AND2016
AND TECHNOLOGY—March TECHNOLOGY—•• 2015 9
DOI: 10.1002/asi DOI: 10.1002/asi
Appendix A. (Continued)
Full
Doc # Title Factor text
0 0.059 0 0 0.048 0 0 0 0.1 0.029 0 0.056 0 0.063 0 0 0 0 0.133 0.042 0.15 0.053
1 0 0.217 0 0 0 0.037 0.276 0.12 0.024 0 0 0.031 0 0 0 0 0 0 0 0.036 0
2 0 0 0 0 0 0.136 0.03 0 0.182 0 0 0.071 0 0.056 0.087 0 0 0 0.037 0.04 0
3 0 0.083 0 0.13 0 0.04 0.029 0 0.111 0.037 0.045 0.069 0 0 0.13 0 0 0 0 0 0
4 0 0 0 0 0 0.059 0.038 0.059 0 0 0 0.353 0 0 0 0 0 0 0.05 0 0
5 0.143 0 0 0 0.667 0.053 0 0 0 0 0 0 0 0 0 0 0 0 0 0.05 0
6 0.077 0 0 0 0 0 0 0 0 0.667 0 0 0 0 0 0 0 0 0 0.056 0
7 0 0 0 0 0 0 0 0 0.348 0 0.077 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0.263 0 0.036 0 0 0 0 0 0 0 0 0.111 0 0 0
9 0 0 0 0.071 0 0 0 0 0 0 0.091 0.053 0 0 0 0 0 0 0.2 0.067 0
10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.375 0 0 0.118 0.063 0.167
11 0 0 0 0.083 0 0 0.048 0 0 0 0.111 0 0 0 0.182 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0.077 0 0 0 0 0 0 0 0 0.333 0 0 0 0.333
13 0 0 0 0 0 0 0 0 0 0.083 0 0 0 0 0.1 0 0 0 0.077 0 0
14 0 0 0 0.111 0 0 0 0 0 0 0 0.071 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0.111 0 0 0 0 0 0 0.25 0 0 0 0 0 0 0 0
16 0 0.111 0 0 0 0 0 0 0 0 0 0 0 0.333 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0.111 0 0 0 0 0 0 0 0 0 0 0.083 0 0
18 0 0 0 0.111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.083 0 0
19 0 0 0 0 0 0 0 0 0.043 0 0 0 0.25 0 0 0 0 0 0 0 0
20 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0