Professional Documents
Culture Documents
130
0.50
● ● ● ●
elbow elbow
0.12
difference in shell sizes
●
2
● ●
top 33% ● top 33%
110
● ●
0.46
CR scores
TR scores
0
density
●
● ● ●
0.08 ● ●
−6 −4 −2
● ●
●
90
0.42
●
●
●
● ●
● ●
●
70
●
0.04
●
0.38
● ● ● ● ●
Figure 1: (a) Graph-of-words (W = 8) for document #1512 of Hulth2003 decomposed with k-core (non-(nouns and
adjectives) in italic). (b) Keywords extracted by each proposed and baseline method (human assigned keywords in
bold). (c) Selection criterion of each method except main (does not apply).
**
*
3-core
2-core
1-core
Figure 2: (a) k-core decomposition illustrative example. Note that while nodes * and ** have same degree (3), node
** makes a more influential spreader as it lies in a higher core than node *. (b) k-core versus K-truss. The main
K-truss subgraph can be considered as the core of the main core.
prestige-based ranking in the Web and social net- weighted k-core decomposition on graphs-of-words
works (among other things), PageRank may thus and retained the members of the main cores as key-
not be ideal for keyword extraction. Indeed, a fun- words. Best results were obtained in the weighted
damental difference when dealing with text is the case, with small main cores yielding good precision
paramount importance of cohesiveness: keywords but low recall, and significantly outperforming TR.
need not only to have many important connections As expected, (Rousseau and Vazirgiannis, 2015) ob-
but also to form dense substructures with these con- served that cores exhibited the desirable property of
nections. Actually, most of the time, keywords are containing “long-distance n-grams”.
n-grams (Rousseau and Vazirgiannis, 2015). There- In addition to superior quantitative performance,
fore, we hypothesize that keywords are more likely another advantage of degeneracy-based techniques
to be found among the influential spreaders of a (compared to TR, which extracts a constant percent-
graph-of-words – as extracted by degeneracy-based age of nodes) is adaptability. Indeed, the size of
methods – rather than among the nodes high on the main core (and more generally, of every level in
eigenvector-related centrality measures. the hierarchy) depends on the structure of the graph-
of-words, which by nature is uniquely tailored to
Topical vs. network coherence. Note that, un-
the document at hand. Consequently, the distribu-
like a body of work that tackled the task of key-
tion of the number of extracted keywords matches
word extraction and document summarization from
more closely that of the human assigned keywords
a topical coherence perspective (Celikyilmaz and
(Rousseau and Vazirgiannis, 2015).
Hakkani-Tür, 2011; Chen et al., 2012; Christensen
Limitations. Nevertheless, while (Rousseau and
et al., 2013), we deal here with network coherence,
Vazirgiannis, 2015) made great strides, it suffers the
a purely graph theoretic notion orthogonal to topical
following limitations: (1) k-core is good but not
coherence.
best in capturing cohesiveness; (2) retaining only
Graph degeneracy. The aforementioned limita- the main core (or truss) is suboptimal, as one cannot
tion of TR motivated the use of graph degeneracy to expect all the gold standard keywords to be found
not only extract central nodes, but also nodes form- within a unique subgraph – actually, many valu-
ing dense subgraphs. More precisely, (Rousseau and able keywords live in lower levels of the hierarchy
Vazirgiannis, 2015) applied both unweighted and (see Figure 1); and (3) the coarseness of the k-core
decomposition implies to work at a high granular- Algorithm 1: dens method
ity level (selecting or discarding a large group of Input : core (or truss) decomposition of G
words at a time), which diminishes the flexibility of Output: set of keywords
the extraction process and negatively impacts perfor- 1 D ← empty vector of length nlevels
mance. 2 for n ← 1 to nlevels do
140
tions of document size and manually assigned key-
2500
words for each dataset.
80 100
The Hulth20031 (Hulth, 2003) dataset contains
1500
abstracts drawn from the Inspec database of physics
60
and engineering papers. Following our baselines, we
40
500
used the 500 documents in the validation set and the
20
“uncontrolled” keywords assigned by human anno-
0
Hulth2003 Marujo2012 Semeval Hulth2003 Marujo2012 Semeval
tators. The mean document size is 120 words and Figure 3: Basic dataset statistics.
on average, 21 keywords (in terms of unigrams) are
available for each document.
We also used the training set of Marujo20121 , SMART information retrieval system3 were dis-
containing 450 web news stories of about 440 words carded.
on average, covering 10 different topics from art and Part-of-Speech tagging and screening using the
culture to business, sport, and technology (Marujo et openNLP (Hornik, 2015) R (R Core Team, 2015)
al., 2012). For each story, the keyphrases assigned implementation of the Apache OpenNLP Maxent
by at least 9 out of 10 Amazon Mechanical Turk- POS tagger. Then, following (Mihalcea and Tarau,
ers are provided as gold standard. After splitting the 2004), only nouns and adjectives were kept. For
keyphrases into unigrams, this makes for an aver- Marujo2012, as many gold standard keywords are
age of 68 keywords per document, which is much verbs, this step was skipped (note that we did exper-
higher than for the two other datasets, even the one iment with and without POS-based screening but got
comprising long documents (Semeval, see next). better results in the second case).
The Semeval2 dataset (Kim et al., 2010) offers Stemming with the R SnowballC package
parsed scientific papers collected from the ACM (Bouchet-Valat, 2014) (Porter’s stemmer). Gold
Digital Library. More precisely, we used the 100 standard keywords were also stemmed.
articles in the test set and the corresponding author- After pre-processing, graphs-of-words (as de-
and-reader-assigned keyphrases. Each document is scribed in Section 2) were constructed for each doc-
approximately 1,860 words in length and is associ- ument and various window sizes (from 3, increasing
ated with about 24 keywords. by 1, until a plateau in scores was reached). We used
the R igraph package (Csardi and Nepusz, 2006)
Notes. In Marujo2012, the keywords were as- to write graph building and weighted k-core imple-
signed in an extractive manner, but many of them mentation code. For K-truss, we used the C++ im-
are verbs. In the two other datasets, keywords were plementation offered by (Wang and Cheng, 2012).
freely chosen by human coders in an abstractive way Finally, for TRP and CRP, we retained the
and as such, some of them are not present in the orig- top 33% keywords on Hulth2003 and Marujo2012
inal text. On these datasets, reaching perfect recall is (short and medium size documents), whereas on Se-
therefore impossible for our methods (and the base- meval (long documents), we retained the top 15 key-
lines), which by definition all are extractive. words. This is consistent with our baselines. In-
6.3 Implementation deed, the number of manually assigned keywords in-
creases with document size up to a certain point, and
Before constructing the graphs-of-words and pass-
stabilizes afterwards.
ing them to the keyword extraction methods, we per-
The code of the implementation and the datasets
formed the following pre-processing steps:
can be found here4 .
Stopwords removal. Stopwords from the
3
http://jmlr.org/papers/volume5/
1
https://github.com/snkim/ lewis04a/a11-smart-stop-list/english.stop
AutomaticKeyphraseExtraction 4
https://github.com/Tixierae/EMNLP_2016
2
https://github.com/boudinfl/centrality_
measures_ijcnlp13/tree/master/data
Truss num
8 11 12
6
4
● precision recall F1-score
20000
● dens 48.79 72.78 56.09*
inf 48.96 72.19 55.98*
●
●
●
● ●
●
●
●
●
●
●
●
● CRE 65.33 37.90 44.11
10000
●
●
●
Table 1: Hulth2003, K-truss, W = 11.
*statistical significance at p < 0.001 with respect to all baselines.
0
† baseline systems.
3 4 5 6 7 8 9 10 11 12 13 14
k−core
Hulth 2003 k−core
Marujo 2012 k−truss k−core
Semeval
k−truss k−truss
0.70
0.70
0.75
0.75
0.6
0.6
● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
0.70
0.70
● ● ● ● ● ● ● ●
0.65
● ● ● ● ●
0.65
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
●
0.5
0.5
0.65
0.65
PRECISION
0.60
0.60
0.4
0.4
0.60
0.60
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.55
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.55
0.3
0.3
0.55
0.55
0.50
0.50
0.2
0.2
0.50
0.50
0.45
0.45
0.45
0.45
0.1
0.1
4 6 8 10 12 14 4 6 8 10 12 14 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20
0.8
0.8
0.8
0.8
0.7
0.7
0.7
0.7
0.7
0.7
0.6
0.6
0.6
0.6
5(&$//
0.6
0.6
0.5
0.5
0.5
0.5
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.4
0.4
0.5
0.5
0.4
0.4
0.3
0.3
0.3
0.3
0.4
0.4
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.2
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.2
● ● ● ● ●
0.2
● ● ● ● ●
0.2
0.3
0.3
4 6 8 10 12 14 4 6 8 10 12 14 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20
0.60
0.55
0.60
0.40
0.55
0.40
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
0.50
0.50
0.35
0.35
0.55
0.55
)6&25(
0.45
0.45
0.30
0.30
0.40
0.50
0.50
0.40
0.25
0.25
0.35
0.35
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
0.45
0.45
● ● ●
0.20
0.20
0.30
● ● ● ● ●
0.30
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
0.25
0.40
0.40
0.15
0.25
0.15
4 6 8 10 12 14 4 6 8 10 12 14 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20
window size window size window size window size window size window size
precision recall F1-score the graphs-of-words, and thus the subgraphs corre-
dens 8.44 79.45 15.06
inf 17.70 65.53 26.68 sponding to the k-core (or truss) hierarchy levels, are
CRP 49.67 32.88 38.98* very large. This analysis is consistent with the ob-
CRE 25.82 58.80 34.86 servation that conversely, approaches that work at a
main† 25.73 49.61 32.83
TRP† 47.93 31.74 37.64 finer granularity level (node level) prove superior on
TRE† 33.87 46.08 37.55 long documents, such as our proposed CRP method
Table 3: Semeval, K-truss, W = 20. which reaches best performance on Semeval.
*statistical significance at p < 0.001 w.r.t. main. † baseline systems.
8 Conclusion and Future Work
(main). On Semeval, which features larger pieces Our results provide empirical evidence that spread-
of text, our CRP technique improves on TRP, the ing influence may be a better “keywordness” met-
best performing baseline, by more than 1 %, altough ric than eigenvector (or random walk)-based crite-
the difference is not statistically significant. How- ria. Our CRP method is currently very basic and
ever, CRP is head and shoulders above main, with an leveraging edge weights/direction or combining it
absolute gain of 6%. This suggests that converting with other scores could yield better results. Also,
the cohesiveness information captured by degener- more meaningful edge weights could be obtained
acy into ranks may be valuable for large documents. by merging local co-occurrence statistics with exter-
Finally, the poor performance of the dens and inf nal semantic knowledge offered by pre-trained word
methods on Semeval (Table 3) might be explained embeddings (Wang et al., 2014). The direct use of
by the fact that these methods are only capable of density-based objective functions could also prove
selecting an entire batch of nodes (i.e., subgraph- valuable.
of-words) at a time. This lack of flexibility seems
to become a handicap on long documents for which
References Hernán A Makse. 2010. Identification of influen-
tial spreaders in complex networks. Nature Physics,
Joonhyun Bae and Sangwook Kim. 2014. Identifying
6(11):888–893.
and ranking influential spreaders in complex networks
Fragkiskos D Malliaros and Konstantinos Skianis. 2015.
by neighborhood coreness. Physica A: Statistical Me-
Graph-based term weighting for text categorization.
chanics and its Applications, 395:549–559.
In Proceedings of the 2015 IEEE/ACM International
Vladimir Batagelj and Matjaž Zaveršnik. 2002. General-
Conference on Advances in Social Networks Analysis
ized cores. arXiv preprint cs/0202039.
and Mining, pages 1473–1479. ACM.
Milan Bouchet-Valat, 2014. SnowballC: Snowball stem-
Fragkiskos D. Malliaros, Apostolos N. Papadopoulos,
mers based on the C libstemmer UTF-8 library. R
and Michalis Vazirgiannis. 2016a. Core decompo-
package version 0.5.1.
sition in graphs: concepts, algorithms and applica-
Asli Celikyilmaz and Dilek Hakkani-Tür. 2011. Discov- tions. In Proceedings of the 19th International Confer-
ery of topically coherent sentences for extractive sum- ence on Extending Database Technology, EDBT, pages
marization. In Proceedings of the 49th Annual Meet- 720–721.
ing of the Association for Computational Linguistics:
Fragkiskos D Malliaros, Maria-Evgenia G Rossi, and
Human Language Technologies-Volume 1, pages 491–
Michalis Vazirgiannis. 2016b. Locating influen-
499. Association for Computational Linguistics.
tial nodes in complex networks. Scientific Reports,
Yun-Nung Chen, Yu Huang, Hung-Yi Lee, and Lin-Shan 6:19307.
Lee. 2012. Unsupervised two-stage keyword extrac-
Luis Marujo, Anatole Gershman, Jaime Carbonell,
tion from spoken documents by topic coherence and
Robert Frederking, and Jo ao P. Neto. 2012. Su-
support vector machine. In 2012 IEEE International
pervised topical key phrase extraction of news stories
Conference on Acoustics, Speech and Signal Process-
using crowdsourcing, light filtering and co-reference
ing (ICASSP), pages 5041–5044. IEEE.
normalization. In Proceedings of LREC 2012. ELRA.
Janara Christensen, Stephen Soderland Mausam, Stephen
Polykarpos Meladianos, Giannis Nikolentzos, François
Soderland, and Oren Etzioni. 2013. Towards coher-
Rousseau, Yannis Stavrakas, and Michalis Vazirgian-
ent multi-document summarization. In HLT-NAACL,
nis. 2015. Degeneracy-based real-time sub-event de-
pages 1163–1173. Citeseer.
tection in twitter stream. In Ninth International AAAI
Jonathan Cohen. 2008. Trusses: Cohesive subgraphs Conference on Web and Social Media (ICWSM).
for social network analysis. National Security Agency
Rada Mihalcea and Paul Tarau. 2004. TextRank: bring-
Technical Report, page 16.
ing order into texts. In Proceedings of the 2004 Con-
Gabor Csardi and Tamas Nepusz. 2006. The igraph soft- ference on Empirical Methods in Natural Language
ware package for complex network research. Inter- Processing (EMNLP). Association for Computational
Journal, Complex Systems:1695. Linguistics.
Vince Grolmusz. 2015. A note on the pagerank of R Core Team, 2015. R: A Language and Environment
undirected graphs. Information Processing Letters, for Statistical Computing. R Foundation for Statistical
115(6):633–634. Computing, Vienna, Austria.
Zellig S Harris. 1954. Distributional structure. Word, François Rousseau and Michalis Vazirgiannis. 2013.
10(2-3):146–162. Graph-of-word and tw-idf: new approach to ad hoc
Kurt Hornik, 2015. openNLP: Apache OpenNLP Tools ir. In Proceedings of the 22nd ACM international con-
Interface. R package version 0.2-5. ference on Conference on Information & Knowledge
Anette Hulth. 2003. Improved automatic keyword ex- Management (CIKM), pages 59–68. ACM.
traction given more linguistic knowledge. In Proceed- François Rousseau and Michalis Vazirgiannis. 2015.
ings of the 2003 Conference on Empirical Methods in Main core retention on graph-of-words for single-
Natural Language Processing (EMNLP), pages 216– document keyword extraction. In Advances in Infor-
223. Association for Computational Linguistics. mation Retrieval, pages 382–393. Springer.
Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Timo- François Rousseau, Emmanouil Kiagias, and Michalis
thy Baldwin. 2010. Semeval-2010 task 5: Automatic Vazirgiannis. 2015. Text categorization as a graph
keyphrase extraction from scientific articles. In Pro- classification problem. In ACL, volume 15, page 107.
ceedings of the 5th International Workshop on Seman- Stephen B Seidman. 1983. Network structure and mini-
tic Evaluation, pages 21–26. Association for Compu- mum degree. Social networks, 5(3):269–287.
tational Linguistics.
Jia Wang and James Cheng. 2012. Truss decomposition
Maksim Kitsak, Lazaros K Gallos, Shlomo Havlin, in massive networks. Proceedings of the VLDB En-
Fredrik Liljeros, Lev Muchnik, H Eugene Stanley, and dowment, 5(9):812–823.
Rui Wang, Wei Liu, and Chris McDonald. 2014. Corpus-
independent generic keyphrase extraction using word
embedding vectors. In Software Engineering Research
Conference, page 39.