You are on page 1of 23

Cognitive Computation (2022) 14:852–874

https://doi.org/10.1007/s12559-021-09979-7

HAKE: an Unsupervised Approach to Automatic Keyphrase Extraction


for Multiple Domains
Zakariae Alami Merrouni1   · Bouchra Frikh1 · Brahim Ouhbi2

Received: 13 October 2020 / Accepted: 26 November 2021 / Published online: 21 January 2022
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021

Abstract
Keyphrases capture the main content of a free text document. The task of automatic keyphrase extraction (AKPE) plays a
significant role in retrieving and summarizing valuable information from several documents with different domains. Vari-
ous techniques have been proposed for this task. However, supervised AKPE requires large annotated data and depends on
the tested domain. An alternative solution is to consider a new independent domain method that can be applied to several
domains (such as medical, social). In this paper, we tackle keyphrase extraction from single documents with HAKE, a novel
unsupervised method that takes full advantage of mining linguistic, statistical, structural, and semantic text features simulta-
neously to select the most relevant keyphrases in a text. HAKE achieves higher F-scores than the unsupervised state-of-the-art
systems on standard datasets and is suitable for real-time processing of large amounts of Web and text data across different
domains. With HAKE, we also explicitly increase coverage and diversity among the selected keyphrases by introducing a
novel technique (based on a parse tree approach, part of speech tagging, and filtering) for candidate keyphrase identifica-
tion and extraction. This technique allows us to generate a comprehensive and meaningful list of candidate keyphrases and
reduce the candidate set’s size without increasing the computational complexity. HAKE’s effectiveness is compared to twelve
state-of-the-art and recent unsupervised approaches, as well as to some other supervised approaches. Experimental analysis
is conducted to validate the proposed method using five of the top available benchmark corpora from different domains and
shows that HAKE significantly outperforms both the existing unsupervised and supervised methods. Our method does not
require training on a particular set of documents, nor does it depend on external corpora, dictionaries, domain, or text size.
Our experiments confirm that HAKE’s candidate selection model and its ranking model are effective.

Keywords  Automatic keyphrase extraction · Unsupervised machine learning · Feature selection

Introduction information becomes an important research matter. Key-


phrases or keywords characterize and capture the main con-
With the rapid growth of the textual data scale and its com- tent of a large text data collection. They provide a solution
plicated content, information on the data sources is emerg- to help organize, manage, and retrieve these documents.
ing exponentially. How to efficiently seek and manage Manually extracting these keyphrases is difficult and very
time-consuming. Therefore, automatic keyphrases extrac-
* Zakariae Alami Merrouni tion “AKPE” methods serve as an efficient and practical
zakariae.alamimerrouni@usmba.ac.ma alternative. The AKPE task interacts with many real-world
Bouchra Frikh domains, for instance, Medical Literature such as research
bfrikh@yahoo.com articles, clinical trial reports, medical case reports, and med-
Brahim Ouhbi ical news available on the web are important sources to help
ouhbib@yahoo.co.uk clinicians in patient care. The pervasion of the huge amount
1
LIASSE Lab, National School of Applied Sciences (ENSA), of medical information through WWW has created a grow-
Sidi Mohamed Ben Abdellah University, B.P. 72, Route ing need for the development of techniques for text sum-
d’imouzer, Fez, Morocco marization, automatic indexing, clustering/classification,
2
Mathematical Modeling & Computer Laboratory (LM2I), etc. [1].
National Higher School of Arts and Crafts (ENSAM), The AKPE task has received more attention and
Moulay Ismail University (UMI), Marjane II, B.P. 4024, revealed to be efficient for many NLP, IR, and TM tasks.
Meknes, Morocco

13
Cognitive Computation (2022) 14:852–874 853

Applications of the AKPE include and are not limited to many phrases may not be statistically frequent [21, 22]. They
the following: document retrieval [2, 3], text summa- reduce the over-generation errors of AKPE by ensuring that
rization [4, 5, 6], document clustering [7, 8, 9], opinion the keyphrases are selected according to their component
mining [10] and sentiments analysis [11], web mining [12, words. The weight of each unique word is counted only once
13], recommender systems [14], searching, learning, and [23].
building ontologies [15, 16, 17, 18]. However, despite the Motivated by these facts, we present in this paper
huge literature dealing with keyphrase extraction, recent “HAKE,” a novel unsupervised automatic keyphrase extrac-
studies have shown that these methods’ accuracy remains tion method, to identify and extract relevant keywords and
moderate [19]. Several factors contribute to this. Diffi- keyphrase(s) from text documents. It deals with the docu-
culties in defining relevance and the language, sizes, and ments in complex and multi-domain topics. Precisely, our
domains of the input documents are among the most impor- work mainly improves the first two steps of the AKPE pro-
tant challenges. Other issues include extracting keywords cess by adopting a twofold approach. First, we identify and
that are only statistically relevant but missing the semantic form comprehensive and syntactically correct candidate
context of some keywords, as well as the evaluation met- keyphrases using a novel method that relies on a parsing-
ric that uses the exact match restriction. Several existing based technique and uses speech tagging. Second, several
supervised approaches use different features and machine features are calculated and combined to identify relevant
learners that classify keywords as relevant or non-relevant. document keywords using a novel weighting model named
Whereas these models have some advantages, they require “KWHFSM” (Keyword Hybrid Features Selection Model).
a long training process and large manually annotated data. Precisely, this model calculates the weight of single key-
They depend on the pre-determination of features before words that are used later to rank the generated candidate
running the system. They use offline analysis and require keyphrases from the first step. It considers three kinds of
a considerable amount of computing time for training. As features: the statistic feature, namely the CHIR measure (an
a result, these features cannot be modified or learned dur- improved version of the Chi-Square Statistic); the structural
ing the system process. It is difficult to list all the features feature, the TPW (term position weight); and the semantic
associated with a specific domain because it is necessary to feature, SIM measure (a mutual information-based method).
exploit the text’s textual, statistical, and external informa- Finally, another lightweight model called “KPHFSM” (Key-
tion [19]. On the other hand, unsupervised models perform phrase Hybrid Features Selection Weighting Model) is built
without any prior knowledge or manually labelled data, and for rank and select the top keyphrases’ final list.
they can be easily applied to a document across different Hence, our paper proposes a new AKPE technique with
domains. the following contributions:
AKPE includes three crucial steps. First, candidate key-
phrases identification: all the phrases of a document are fil- – Propose a novel technique (based on a parse tree
tered to identifycandidate terms that could be keyphrases. approach, part of speech tagging, and filtering) for can-
Second, feature selection: a set of features (properties) are didate keyphrases identification and extraction. This
used to discriminate keyphrasesand their relevance from technique allows to generate a comprehensive and mean-
other terms. Third, keyphrase ranking: these features are ingful list of candidates’ keyphrases and reduce the can-
combined using machine learning models (supervised or didate set’s size without increasing the computational
unsupervised) to give a synthetic score for each keyphrase. complexity.
This score is used to rank the candidate keyphrases, and – An unsupervised approach that does not require training
the k-top-ranked keyphrases are output as the keyphrases on a particular set of documents, nor does it depend on
we look for. In every proposed method, identifying and external corpora, dictionaries, domain, or text size.
selecting candidate keyphrases is an essential task, and – A system with a flexible architecture that can be inte-
thus, the robustness and effectiveness of methods depend grated in any IR framework.
primarily on the underlying candidate keyphrase extraction
technique. However, there are certain limitations on exist- To prove HAKE’s effectiveness, it is compared to twelve
ing approaches that make it difficult to obtain a suitable and state-of-the-art and most recent unsupervised approaches
comprehensive list of candidate phrases. An intuitive idea is and to some other supervised methods. Experimental anal-
that keyphrase(s) must contain keyword(s) [20]. Here, key- ysis is conducted to validate the proposed method using
words mean the most important single words in the docu- five of the top available benchmark corpora from different
ment. They can be used as the basis for measuring candi- domains. HAKE significantly outperforms both the com-
date keyphrases. Weighting words first, instead of phrases, pared existing unsupervised and supervised methods.
offers many advantages. They are usually much easier to Our paper is organized as follows: The “Related Works”
extract, match, and weight, especially in documents where section provides a brief background and related work on the

13
854 Cognitive Computation (2022) 14:852–874

task of AKPE. The “Methods” section introduces the HAKE heuristics rules obtained from training sets or thesaurus
system that can generate candidate phrases by using a novel have also been used to identify and extract candidate key-
method, calculating their several features, and finally rank- phrases for long textual entities [36, 37, 38].
ing relevant keyphrases. The “Results” section presents and
discusses the evaluation experiments and results, and the
“Conlusion and Perspectives” section concludes our paper. Candidate Keyphrases Scoring and Ranking
Approaches

Related Works This step of the automatic keyphrase extractor aims to score
the candidate keyphrases. We highlight both supervised and
Typically, the automatic keyphrase extraction task is per- unsupervised approaches for this purpose (see Table 1).
formed in three main steps: (1) identifying a set of appro- In unsupervised approaches, the AKPE task is con-
priate phrases as candidate keyphrases, (2) using features to sidered as a ranking problem and performs without prior
discriminate keyphrases from other terms, and (3) extracting knowledge. Most unsupervised keyphrase extraction tech-
final keyphrases from the set of candidates by means of a niques could be broadly divided into statistical, graph-
supervised or unsupervised approach. In what follows, the based, and embeddings-based techniques [33, 50]. Rabby
related works of each step are discussed. et al. [51] used nominal statistical knowledge to select key-
phrases. In the statistical techniques, Matsuo and Ishizuka
[51] considers the relevance of words if they co-occur in
Candidate Keyphrase Identification Methods the document more frequently than randomly distributed
in the document. They used the χ12 values to determine
The first step, “candidate phrase identification” or “can- the biases between expected and observed frequencies.
didate phrase generation,” aims to identify a set of words Another feature that has shown significant improvements
serving as candidate keyphrases. This step reduces the in different keyword extraction systems is the CHIR Sta-
number of false-positive candidates while maintaining tistic [18, 19]. It is basically based on the well-known
the true-positive ones. The majority of methods usually chi-square χ35 statistic measure which determines whether
fall into two categories: (i) N-Gram based, (ii) Part-Of- a term and a category are statistically independent [51].
Speech (POS) sequence-based, or both. The first category Despite its simplicity, the ranking based on tf-idf and co-
was adopted in pioneering systems [24, 25]. The input text occurrence measures has been shown to work very well
is separated according to phrase boundaries and to a limited in practice [52]. Chen et al. [52] use three new features
length (bigram, trigram,...) [25, 26, 27]. All subsequences (phrase frequency, average term frequency, and Okapi fea-
of these initial phrases up to a limit length are taken as ture that involves tf part, idf part, and a document length
candidate phrases; then, some rules are applied to these part) considering web pages’ structure. Wang and Peng
candidates to filter meaningless subsequences  [28, 29, [53] present mark-up features to extract keyphrases from
30]. However, n-grams are most likely to be grammati- web pages. Nguyen and Kan [54] consider that keyphrases
cally incorrect and do not always capture complete infor- are distributed non-uniformly in different sections of a
mation. The alternative solution to this problem is the use scientific paper. They propose section occurrence vectors
of specific linguistic patterns based on part-of-talk (POS) (“in title” and “in a section occurrence”) favoring sections
tags or noun phrases [31, 32, 33]. It allows words with like introduction and related work. Kumar and Srinathan
finite part-of-speech tags (e.g., nouns, adjectives, verbs) [55] use the position of phrases in a sentence and the posi-
to be candidate keyphrases [34, 31, 35]. Wan and Xiao tion of a sentence in the document. Berend and Farkas [56]
[28] developed a noun phrase approach based on linguis- present features that consider the document, corpus, and
tic patterns of POS tags to select candidate keyphrases. knowledge-based levels. They also use external knowledge
Hulth [31] compares three different phrase identification sources (such as Wikipedia) to achieve more improvements
methods and concludes that extracting noun phrase chunks in performance.
and phrases with a finite POS tag sequence gives better Adar and Datta [57] extract keyphrases by mining abbre-
results than using n-grams. Grineva et al. [32] use n-grams viations from scientific literature and composing those
that appear in Wikipedia article titles as candidates. keyphrases into semantically meaningful hierarchies to
Haddoud and Abdeddaım [33] identified five existing POS build a domain-specific keyphrase database. El-Beltagy
tag sequence definitions of a noun phrase and proposed and Rafea [37] proposed the KP-Miner system, an unsu-
a new one that reveals a noteworthy improvement over pervised non-learning ranking-based system for keyphrase
the other filters. It selects phrases that consist of zero or extraction, which relies on a three-step process: candidate
more adjectives is followed by one or more nouns. Various keyphrases selection, candidates weight calculation, and

13
Cognitive Computation (2022) 14:852–874 855

Table 1  Some prominent systems for keyphrase extraction


Technique/contributor Approach Features

KEA System [39, 26] Supervised Term first occurrence


position, tf -idf
GenEx Algorithm [25] Supervised Term First occurrence
position, tf -idf, phrase length
TextRank Algorithm [40] Unsupervised POS, distance between
word occurences
CorePhrase Algorithm [7] Unsupervised DF-average weight, phrase frequency, average phrase depth
SingleRank Algorithm [41] Unsupervised POS tagging, neighborhood, level saliency score
Maui Algorithm [42] Supervised tf − idf, position of first occurrence, keyphraseness, phrase length, spread,
node degree
KP-Miner System [37] Unsupervised tf-idf, first occurrence
(weighting position, boosting factor
model) for weight bias
Core Word Expansion Algorithm [43] Supervised; unsupervised Inverse document frequency (idf), N-gram
CommunityCluster Algorithm [36] Unsupervised N-Gram
Keycluster Algorithm [44] Unsupervised Co-occurrence-based term relatedness (idf), N-gram
RAKE [45] Unsupervised term frequency; term degree (the number of different terms that it co-occurs
with), and ratio of degree to frequency
Sarkar et al. Algorithm [1] Supervised Noun phrase, phrase frequency, phrase links to other phrases, idf, phrase
position, phrase length and word length
CeKE Algorithm [46] Supervised tf − idf; relativePos, POS; inCited, inCiting; citation tfidf, first position,
tf-idf-Over, firstPosUnder, keyphraseness
KeyphraseDS System [47] Unsupervised Syntactic features (POS and relation), correlation features (MutInfoPre,
MutInfoNext), semantic relatedness
KeyEx Method [48] Supervised tf-idf; POS; first occurrence
position, statistical pattern features (pattern support, pattern
length, number of closed patterns, P F IDF value)
TSAKE [49] Unsupervised N-gram, co-occurrence graph
YAKE! [30] Unsupervised Word position, word frequency, word relatedness to context, and word in the
different sentences

finally keyphrase refinement. To determine the importance technique, its computational complexity increases linearly
of a keyphrase, this system assigns scores to phrases based with respect to N-grams. Consequently, a large number
on a modified version of tf-idf, and two boosting factors of keyphrases are generated, which entices the rank-
(position of the first occurrence, and word length that ing procedure. Rabby et al. [60] proposed “TeKET,” a
increases the chance of longer phrases being selected). new unsupervised tree-based keyphrase extraction tech-
Recently, closest prior research to our proposal, especially nique, which is domain and language independent. It
in the second AKPE step, Florescu and Caragea [58] pro- employs limited statistical knowledge, but no train data
posed a new scheme for scoring phrases that calculates the are required. CorePhrase [7] system constructs a list of
final score using the scores’ average of individual words keyphrase candidates by comparing every pair of docu-
weighted by the frequency of the phrase in the docu- ments. It scores each candidate according to some criteria
ment. Tomokiyo and Hurst [59] propose an approach that and ranks keyphrases according to their score. Liu et al.
extracts keyphrases based on statistical language models [44] involve an unsupervised clustering technique (hence-
using the point-wise KL-divergence between multiple forth KeyCluster) which assign an importance score to a
language models. Most recently, Campos et al. proposed word, and then pick the top-ranked words as keywords by
YAKE! [30], an unsupervised system that is based on a clustering semantically similar candidates using Wikipe-
lightweight technique. It takes five features into consid- dia and cooccurrence-based statistics. In the KeyphraseDS
eration, namely casing, word position, word frequency, system [47], keyphrases are extracted through a condi-
word relatedness to context, and word in the different sen- tional random field (CRF)-based model exploiting various
tences to calculate a keyphrase’s weight. However, due to features. Boudin [23] proposes a weighting model applied
generating candidate keyphrases employing the N-grams to both supervised and unsupervised word weighting

13
856 Cognitive Computation (2022) 14:852–874

functions. In their inference model, the sets of candidates [70] tackle keyphrase extraction from single documents
are weighted according to their component words. He used with EmbedRank: a novel unsupervised method, that lev-
different weighting functions such as tf-idf, TextRank, and erages sentence embeddings. Sun et al. present SIFRank
a logistic regression model to assign importance weights [71], a new baseline for unsupervised keyphrase extraction
to words in the document. based on pre-trained language model.
Graph-based methods offer an alternative solution to In the supervised approaches, keyphrase extraction is
purely statistical methods by building a graph from an input considered as a binary classification task, where each can-
document and ranking its nodes according to their impor- didate keyphrase in a document is determined as a key-
tance. TextRank algorithm [40] is one of the most well- phrase or not [24, 25]. Supervised methods have mainly
represented graph-based models. Later, several variations focused on (1) selecting features that measure candidate
of TextRank are proposed, such as SingleRank [41], and keyphrases, and (2) running learning models to determine
ExpandRank [41], the latter being enriched by terms of the keyphrases from non-keyphrases. These models should
k-nearest neighboring documents. An expanded version be trained on manually annotated keyphrases (which may
of this neighborhood approach was proposed in the Cite- sometimes not be practical, and time consuming). The used
TextRank algorithm [46], which extracts keyphrases from features can be classified as follows: statistical [72, 44, 73,
research articles. TopicRank [61] extracts noun phrases 1, 20, 24, 25], linguistic [26, 31, 31, 31, 32, 35, 43, 47],
that represent the main topics of a document and clusters structural [48, 55, 57], semantic [74, 75, 76, 77, 78], or
them into topics. Later, Sterckx et  al. propose Topical- hybrid ones [79, 80, 81, 81]. Different learning models are
PageRank [62] that runs TextRank multiple times for a used in the supervised systems such as SVM [82, 83], max-
document, once for each of its topics generated by a Latent imum entropy [84], bagged C4.5 [85], Naïve Bayes [86,
Dirichlet Allocation LDA [63]. Consequently, TPR ensures that 87, 88], bagged decision tree [89], logistic regression [90],
the extracted keyphrases cover the main topics of the docu- neural networks [91].
ment. The final score of a candidate is computed as the sum A more thorough discussion and in-depth review of
of its scores for each topic, weighted by the probability that research on supervised and unsupervised AKPE systems,
a given topic is in the document. Rose et al. [45] presented including a comparative study, can be found in a recent sur-
RAKE, an independent, unsupervised, single-document vey [19]. As mentioned in [19], using different features turns
domain, and language approach. It considers term sequences out to be very useful for the AKPE task. Indeed, the best-
(delimited by stopword or phrase delimiters) as candidate performing systems [37, 31, 81] have combined at least two
keywords and builds a matrix of term co-occurrences on this types of features to extract relevant keyphrases. Hence, any
basis. Term scores are then determined for each candidate AKPE system’s success depends mainly on the used candi-
keyword as the sum of its member word scores. Florescu and date keyphrases selecting technique, and the pertinence, the
Caragea [64] proposed an unsupervised graph-based model number, and variety of the used features [19].
called PositionRank, which incorporates information from
all positions of a word’s occurrence and its frequency in
a document score a biased PageRank score for each word. Methods
Boudin proposed the MultipartiteRank technique [65] that
utilizes a multipartite graph. In this section, we describe HAKE, our unsupervised hybrid
With recent advancements in deep learning techniques automatic keyphrase extraction method, where our system’s
applied to NLP, words are presented as dense real-valued input is a collection of text documents, and the output is a set
vectors, popularly known as word embeddings. These of relevant keyphrases of every single document. Generally,
representations have been shown to equal or outperform our proposed method is intuitively based on the following
other methods [66]. The embedding vectors are supposed assumption: (1) a selected candidate keyphrase should be
to preserve the syntactic and semantic similarities between comprehensive and syntactically correct, (2) a keyphrase
words. Some of the most popular approaches for train- should be relevant if it contains one or many relevant key-
ing word embeddings are Word2Vec [41], Key2Vec [27], words, and (3) a keyword should be relevant if it is selected
Glove [67], and Fasttext [68]. Such local representation and ranked with a high weight. Thus, we adjust a keyphrase’s
of words and keyphrases are able to accurately capture their weight by its keyword’s salience scores.
semantics in the context of the document they are part of, The HAKE system consists of four main steps: (1) text
and therefore can help in improving keyphrase extraction pre-processing and candidate keyphrase identification, (2)
quality. Papagiannopoulou and Tsoumakas [69] presents feature extraction, (3) computing keyword score, (4) com-
RVA, a novel unsupervised method for keyphrase extrac- puting candidate keyphrases score, and ranking. The first
tion, whose main innovation is the use of local word embed- step, pre-process the input documents, then identify and
dings (in particular GloVe vectors). Bennani-Smires et al. form comprehensive and syntactically correct candidate

13
Cognitive Computation (2022) 14:852–874 857

keyphrases using a novel method that relies on a parsing- Identification of Candidate Keyphrases
based technique and part of speech tagging. Several fea-
ture selection criteria (statistical, structural, semantic) are In this subsection, we present our method for identifying
calculated to select relevant individual keywords in the candidate keyphrases. Generally, phrases have a stronger
second step. Then, in the third step, these selected fea- ability to represent text topics compared with simple
ture values are combined using a novel lightweight model words. However, phrases have a complex grammatical
“KWHFSM” (KeyWord Hybrid Features Selection Model) structure. Hence, the first problem we have to deal with is
that gives a synthetic score to each keyword (see the “Fea- phrase identification, because it is hard to let the algorithm
tures Extraction” section) to reflect the importance of the know what a phrase is if the phrases do not have clear
term. In the fourth step, the keyword’s scores or weights boundaries and structure. Some previous works usually
are later gathered using a new scheme to obtain scores include all possible n-grams, but the number of candidates
for the candidate keyphrases that have been identified left can be quite large. Some other systems use syntacti-
from the first step based on their relevancy. Finally, this cally motivated boundaries such as NP-chunks and POS
last weighting model called “KPHFSM” (the keyphrase tag sequences, which decrease the number of candidate
hybrid features selection model) is performed to rank the phrases. Our method aims at obtaining a refined candidate
keyphrases. The candidate keyphrases with the highest set benefiting from the best identification techniques. To
KPHFSM score are ranked as more relevant keyphrases. achieve this goal, we propose a novel technique (based on
Algorithm 1 and Fig. 1 describe the steps of the HAKE a parse tree approach, part of speech tagging, and filter-
system. Further details about each step are presented in the ing) to generate a comprehensive and meaningful list of
following sub-sections. candidate keyphrases.

13
858 Cognitive Computation (2022) 14:852–874

Fig. 1  Schematic diagram of HAKE system for extracting keyphrases

Documents Pre‑processing the readability of the final result caused by grammatical


mistakes.
Identifying the candidate terms of each document needs – Lemmatization: the term variation can affect its fre-
to pre-process the text. Text pre-processing (TP) aims to quency. Lemmatization reduces inflectional forms and
regularize the input files and set phrase boundaries for derivationally related forms of a word to a common base
word sequences. The way this classical step is done can sig- form so they can be analyzed as a single item, and identi-
nificantly affect the accuracy of the keyphrase extraction fied by the word’s Lemma. Unlike stemming, lemmatiza-
method. Thus, it is important to give some details on how tion depends on correctly identifying the envisioned part
we have implemented it. Each document goes through the of speech and meaning of a word in a sentence, as well as
following pre-processing stages: within the larger context surrounding that sentence, such
as neighboring sentences or even an entire document. We
– Personalized tokenizing: for each sentence, we tokenize use the Stanford CoreNLP library for lemmatizing our
it by eliminating all marks, punctuation, brackets, sentences [92].
numbers, and special characters along with lower-
casing all words. For example, given the sentence (text To recognize terms of equivalent meaning, stemming is
­miningInstA˜ance;,;:, ­A˜are; fea = ­tureA˜s), the result will also applied during the pre-processing step in many AKPE
be (text mining instance are features). systems. However, in our system, we decided to stem in the
– Stop word removal: removing words saves time and fea- later step of evaluation (see the “Results” section). We aim
ture space. They include a large number of prepositions, to hold the non-stemmed original input texts from which the
pronouns, and conjunctions in sentences. These words final keyphrase can be picked up to avoid stemming gram-
need to be removed before we analyze the text so that matical complications.
the frequently used words are mainly the words relevant
to the context and not common words used in the text. Candidate Keyphrases Identification Technique
In particular, we exclude from the list some words that
may occur as part of keyphrases (e.g., “in,” “of,” “for,” Since the proposed method does not know what a phrase
“and”). The reason for keeping such conjunctions and is, it has to use a set of conditions to identify and gener-
prepositions is that the lack of such words may impact ate candidate keyphrases. We propose a novel technique for

13
Cognitive Computation (2022) 14:852–874 859

candidate keyphrases’ extraction based on sentence pars- for each sentence (see example in Fig. 3), we can depict the
ing and POS tagging in this phase. The generated sentences sentence’s full picture. It shows the overall structure of the
are then analyzed to identify potential candidate keywords sentence and its relationship to different parts. For instance,
that are the topical terms to replace the document text in its the parse tree of the lengthy sentence “grid computing has
entirety. In subsequent stages, linguistically relevant candi- become a real alternative to traditional parallel computing, a
date keyphrases are selected from this set. Figure 2 shows grid provides much computational power, and thus offers the
the schematic diagram of our proposed technique. The steps possibility to solve very large problems” is given in Fig. 3.
of the proposed technique are as follows: The tree structure list will then be passed to the next step of
candidate keyphrases’ extraction and filtering for meaningful
Sentence Parse Tree Generation and comprehensive candidate keyphrases.

Sentence splitting is a preliminary step that is commonly


performed on texts before further processing is called sen- Candidate Keyphrases Extraction
tence boundary detection or sentence segmentation. Namely, and Filtering
it is the process of dividing up the input text into sentences.
We split the input document into a list of sentences. This list After parsing the document, this step aims to generate syn-
is then passed through the sentence parser, which generates tactically correct and comprehensive candidate keyphrases.
and returns parse trees for each sentence. We employ the To do that, we extract the meaningful parts from the sen-
Stanford Parser [92] for this aim. By creating a parse tree tence tree structures. We cut the tree into subtrees for each

Fig. 2  Schematic diagram of the


proposed technique for candi-
date keyphrases identification

13
860 Cognitive Computation (2022) 14:852–874

Fig. 3  Example of parse tree of a sentence

sentence. Then, we branch out noun phrases via the filtering bial objectives; they hold a position normally occupied
process. In the filtering process, the generated list of candi- by a verb’s direct object and modify the verb with
date keyphrases is filtered by a certain heuristic. As a result, an aspect of time, distance, weight, age, or monetary
a candidate can be included in the final list of candidates if value. Adverbs, in general, can interact with the noun
it meets the following heuristic rules: phrase. They can change the context and the mean-
ing of a candidate keyphrase, especially in scientific
1. Part-of-speech tags (POS tags) and noun phrase domains where authors are more precise and specific
chunks are considered as the most commonly used in explaining some situations.
linguistic features that capture the linguistic proper- 2. The candidate keyphrase does not start or end with a
ties of keyphrases [31, 81, 19]. Indeed, selecting a stop word.
well-formed candidate that should be linguistically 3. The length of a candidate keyphrase is up to n extracted
informative, will limit, and filter out keyphrases. tokens (i.e., unigrams, bigrams, trigrams, four grams)
Consequently, we bound the computational complex- without a specified limit (usually with n < = 5 at maxi-
ity of the next feature selection step. To achieve that, mum). Nevertheless, the length parameter can be fin-
we placed a phrase-type restriction to identify correct tuned for one dataset to another, depending on the size
candidate keyphrases linguistically [31, 32, 78, 29, 35, of the documents and the dataset’s knowledge domain.
25, 33]. We only keep candidate keyphrases, which
are considered as noun phrases. The definition of a After filtering candidate keyphrases, some of them may
noun phrase can differ from a method or a language to still contain redundant phrases, or one phrase may be part of
another. To select proper and accurate candidate key- another one. Therefore, we perform a redundancy removal
phrases, we have proposed a new POS tag sequence that eliminates the overlapping between phrases. This
definition of all possible noun phrases forms. We use removal uses the following conditions: (1) if two candidate
Stanford Parser [92] and Stanford POS Tagger [93] for keyphrases match each other, then eliminate one of them;
these aims. Our proposed POS tag definition of noun (2) If a noun phrase 1 is part of the noun phrase 2, then
phrases satisfies the following regular expression rule: eliminate the first one.
(NN | NNS | NNP | NNPS | VBN | JJ | JJS | RB) * (NN Figure 4 represents a sample sentence and the identi-
| NNS | NNP | NNPS | VBG) + . Unlike other defini- fied candidate keyphrases from this sentence. Sometimes,
tions, our structure of noun phrases can comprise an some generated candidate keyphrases using the technique
adverbial noun (tag RB) such as “double experience” mentioned above may be less meaningful to human readers.
(RB NN), and a verb in present participle (tag VBG) After the feature engineering step, these kind of candidate
such as follows: “virtual desktop conferencing” (JJ keyphrases are filtered out. For this purpose, a condition
NN VBG), where the VBG tag can be at the begin- is applied. The condition is to choose a threshold on the
ning, the middle, or at the end of the noun phrase. phrase weight (phrase weighting schema presented in the
Adverbial nouns are sometimes referred to as adver- next subsection).

13
Cognitive Computation (2022) 14:852–874 861

Fig. 4  Identified candidate key-


phrases from a simple sentence

Features Extraction The 𝜒 2 term-category (which is the documents’ cluster


that includes the most relevant keywords) independence
After the text preprocessing and candidate keyphrase identi- test has been widely used as a feature selection [18, 53]
fication stages, we obtain a list of the document’s candidates. and defined as follows:
The success of an AKPE system depends on its capability to
extract the best terms that describe the document content. ∑ ∑ (O(i, j) − E(i, j))2
χ2w,c = (1)
This step aims at extracting a subset containing the most i∈{w,w} j∈{c,c} E(i, j)
important single terms. Our method uses a feature weight
where O(i, j) is the observed frequency and E(i, j) is the num-
calculation. Feature weight is an index of phrase importance.
ber of documents which are in the category c and contain
The axiomatic idea is that keyphrases should comprise key-
the term w . However, using only the 𝜒 2 statistic method
words. Here, keywords mean the most relevant single words
can eliminate the terms that are fairly relevant to a category
in the document. However, by looking at the majority of
and keep the irrelevant and redundant terms, and do not pro-
documents, it can be observed that phrases are much less
vide enough information about the relationship between the
frequent than the occurrence of single terms within the same
selected terms and the corresponding categories. To address
document [37]. To overcome this issue, our method extracts
this problem, Li et al. [53] define a new criterion for the
and selects single relevant keywords used later to rank the
relevancy of a term w to a category c as the term w should
candidate keyphrases. More precisely, our keyword set is
have a strong positive dependency on the category c . First,
based on a keyword weighting scheme. The synthetic weight
to evaluate whether the dependency between a term w and
of a relevant keyword is calculated using the hybrid keyword
a category c is positive or negative, the Rw,c measure is
feature selection model “KWHFSM.” Then, the remaining
defined as follows:
weights of keywords are gathered to obtain scores for the can-
didate keyphrases. We employ a hybrid analysis based on co- O(w, c)
occurrence, structure, and semantic features. Consequently, Rw,c = (2)
E(w, c)
the used features are divided into three groups: statistical
features, semantic features, and structural features. Detailed which can be interpreted with the probabilities as follows:
explanations of each type of feature are given below.
p(w, c)p(¬w, ¬c) − p(w, ¬c)p(¬w, c)
Rw,c = +1 (3)
p(w)p(c)
Statistical Features Measure
where Rw,c is the ratio between O(w, c) and E(w, c) . Rw,c
In information retrieval and text mining fields, the 𝜒 statis-
2 should be close to 1 if there is no dependency between
tic is regularly used to measure the degree of dependency the term w and the category c (i.e., 𝜒w,c
2 is not statistically

between a term w and a specific category c . The text catego- significant), Rw,c should be larger than 1 if there is a posi-
rization is the task of classifying a document into predefined tive dependency, which means that the observed frequency
categories based on the contents of the document. To cat- should be larger than the expected frequency. Rw,c should be
egorize the documents, we first build clusters by performing smaller than 1 if there is negative dependency.
the k-means algorithm to get initial clusters and centroids. In order to calculate the feature goodness of term w in a
During the clustering process, we perform the feature selec- corpus with k categories, Li et al. [53] combine equations
tion method using the current clusters and their centroids to 2 statistic (1) and R
𝜒w,c w,c measure (2) and propose a new
estimate the relevancy of each term to the data set. measure called CHIR, defined as follows:

13
862 Cognitive Computation (2022) 14:852–874

∑k ( )
number of times that zi is followed by w within the window
rχ2 (w) = p Rw,cj χ2w,c with Rw,cj > 1 (4)
j=1 j and by the cardinal of the vocabulary.
The similarity between a term w and a document centroïd
where p(Rw,cj ) is the weight of chi-square statistic χ2w,c in the
j d is defined in [94] as the average of the similarities between
corpus in terms of Rw,cj . It is defined as follows: the term w and the x terms of the document centroïd. This
� � Rw,cj measure is given by:
p Rw,cj = ∑k with Rw,cj > 1 (5) ∑x
sim(w, wj )
R
j=1 w,cj j=1
SIM(w, d) = ∑x ∑x (8)
sim(wj , wi )
This new term-goodness measure, r𝜒 2 (w) , is the weighted
j=1 i=1

sum of 𝜒w,c
2 statistics when there is a positive dependency
so as to determine the semantic relevance of a term w in
j
2 and the category c  , a bigger r𝜒 2
between the term 𝜒w,c j (w) , a corpus of k clusters, for each cluster we calculate the
j

or CHIR measure value indicates that the term is more weighted sum of its similarities with the document centroid
relevant. dcenj of each cluster cj using the following formula:
∑k (( ))
SIM(w) =
j=1
P I w, dcenj SIM(w, dcenj ) (9)
Semantic Feature Measure (( ))
where P I w, cenj is the weight of the similarity( between)
the term w and the document centroïd dcenj and I w, dcenj
To avoid any problems like synonymy and polysemy and to
is the mutual information between w and dcenj . Considering
increase the semantic aspect of selected terms in a speci-
the contingency table of a term w and a centroïd d where
fied context, semantic feature selection methods measure
A is the number of times w and d co-occur, i.e., w occur in
the importance of the terms based on their semantic con-
documents that belong to the cluster whose centroïd is d , B
tent, such as pointwise mutual information, information
is the number of times w occurs without d, C is the number
gain, or log-likelihood ratio. One of the most used semantic
of times d occurs without w , and N is the total number of
measures is mutual information. Mutual information (SIM)
documents. The mutual information criterion between a term
is widely used in information theory. It was proposed as a
w and a document dcenj is defined by:
measure of word association and reflects the strength of the
relationship between words by comparing their actual co- P(w, dcenj )
occurrence probability with the probability that would be I(w, dcenj ) = P(w, dcenj ) log( ) (10)
P(w)P(dcenj )
expected by chance. Mutual information, in general, repre-
sents the relative change in the likelihood of observing when If there is a strong association between w and dcenj ,
it is present (the amount of information that it provides) then the joint probability P(w, ( dcenj )) will be larger than
[83]. It is based on the fact that two words are considered P(w)P(dcenj ) ; consequently, I w, cenj > 0 . If w and dcenj
similar if their mutual information with all the words in the are in complementary distribution, then P(w, dcenj ) will be
vocabulary is nearly the same [83]. less than P(w)P(dcenj ) ; hence, I(w, dcenj ) < 0 . In the case
Dagan et al. [83] have defined a semantic similarity meas- of poor association between w and dcenj , then P(w, dcenj ) ≈
ure between two terms w1 and w2 as follows: P(w)
( ( P ( dcen))
j ); consequently, I(w, dcenj ) ≈ 0. The weight of
P I w, dcenj defined as follows:
min(I(zi ,w1 ),I(zi ,w2 ))
( ) 1 ∑|V| ( max(I(zi ,w1 ),I(zi ,w2 )) + � �
sim w1 , w2 =
i=1 min(I(w1 ,zi ),I(w2 ,zi ))
(6) �� �� I w, dcenj � �
2|V| ) P I w, dcenj = ∑k � � �� with I w, dcenj > 0
max(I(w1 ,zi ),I(w2 ,zi ))
i=1
I w, dcenj
where V is the vocabulary and I(zi , w1 ) is the mutual infor- (11)
mation between the term zi andw1 . I(zi , w1 ) is evaluated The high weight of the SIM(w) or SIM measure of a term
using the following formula: indicates that this term is relevant semantically.
( )
( ) ( ) Pd zi , w1 Structural Feature Measure
I zi , w1 = Pd zi , w1 log( ( ) ) (7)
P zi P(w1 )
( ) Position features are related to the position of a given word
where d represent the size of a sliding window,Pd zi , w1 or phrase in the text. The word position in a document plays
is the probability of succession
( ) of zi and w1 in a window a significant role in linguistics and statistics approaches.
of (d + 1) words, and P zi is the priori probability of the Using a word’s position as a feature has many advantages in
termzi . This probability can be estimated by the ratio of the selecting and increasing the precision and recall of keywords

13
Cognitive Computation (2022) 14:852–874 863

or keyphrase, especially in newspaper articles and scientific Computing Keyword Score


papers [12, 49, 26, 95]. For example, in a corpus of sci-
entific research papers, each article comprises a consistent The Keyword Hybrid Feature Selection Model
sequential structure (title, abstract, keywords, full text, etc.)
that could not have the same importance level. Thus, the After computing the score of each feature, we determine
relevance of a given word varies from a part to another. In in this step the final keyword score. To compute this score,
such valuable positions in a document, words carry more we gather all the feature weights into a unique s(w) score.
important information than ordinary paragraphs. Statistics The gathering process goes through our novel hybrid term
show that the title, the abstract, and the first paragraph usu- weighting model “KW-HFSM” (The Keyword Hybrid Fea-
ally highlight the topic and cover the whole text’s core ideas. ture Selection Model), which can reduce the feature space
To calculate the importance and relevance of a term posi- by selecting terms that are statistically, semantically, and
tion along with its frequency in each part of a document, we structurally pertinent. We define the score of a term w as
propose a measure called the term position weight “TPW” a weighted sum of its statistical measure chir(w) , semantic
(for single document and multi-documents). To do that, let measure sim(w) , and statistical/structural measure tpw(t) .
d be a document composed of n parts px and z terms ty . The The global measure of a term goodness s(w) is defined as
considered parts are such as the title, the abstract, and the follows:
full text. ( )
s(w) = α ∗ tpw ty + β ∗ chir(w) + (1 − α − β) ∗ sim(w)
• The document part weight: (15)
where α and β are weighting parameters between 0 and 1
We give for each part px of a document d a weight wx ; and α − β ≤ 1 . To select the most p pertinent terms, we
this weight reflects the relevancy of a part in the document conduct the three following steps: (1) calculate the hybrid
∑n
with x=1 Wx = 1 . Notice that title and abstract have the measure s(w) for each term in the dataset, (2) sort the terms
larger weights. in descending order of their scores, and (3) finally select
the top p terms from the sorted list. A threshold δ is set to
• The term relevancy within a part of a document: filter terms with a low s(w) value. In our system, the value
of δ is set to 0.25. As we can see in the next section, this is
The relevancy of(each ) term ty from d in a part px is an important step in our process because single term scores
defined as the tpwpx ty and calculated as follows: are used to calculate the scores of candidates keyphrases.
( ) ( )
tpwpx ty = tf px ty ∗ wx (12) Computing Keyphrase Score and Ranking
( )
where tf px ty is the frequency of the term ty in the part px.
The Keyphrase Hybrid Feature Selection Weighting Model
• The term relevancy in a document:
After identifying candidate keyphrases (in the candidate
keyphrases identification stage) and calculating the features
Given
( a)term ty , its relevancy in the document d is defined of single terms, in this subsection, we will describe how we
as tpwd ty calculated as follows:
use the feature values to rank all the candidate phrases. In
( ) ∑n ( ) this step, some authors choose a separate linear weighting
tpwd ty = tpwpx ty (13)
i=1
coefficient perused feature [55]. Other authors use the super-
vised machine learning approach as a solution [20, 24, 31].
• The term relevancy in the whole dataset: Although this method is useful, it involves additional steps
of preparing the training set and the learning process. Our
The following equation is used to calculate the weight of approach based on statistical features in conjunction with the
a term ty in the dataset: semantic and structural features is used to rank the candidate
∑n � � phrases. Similar schemes have been shown to yield perfor-
� � y=1 tpwd ty mances comparable to the supervised approach [32, 37].
tpw ty = (14)
n According to the type of the feature, a higher CHIR value
indicates higher statistical importance. This also applies to
where n is the number of occurrences of ty in different docu- the SIM feature and the TPW feature.
ments of the dataset. ( ) Keyphrases are built from keywords, so the higher the weight
The high weight of the tpw ty measure of a term indi- of the keywords that build a keyphrase, the higher the weight of
cates that this term is relevant structurally. the keyphrase is. We define the weight of a keyphrase as follows:

13
864 Cognitive Computation (2022) 14:852–874

∑n � �
The main benefit of using the SemEval-2010 corpus is
� � i=1 swi
s(kp) = s w1 , w2 , … , wn = (16) that our results can be compared to those of the 25 AKPE
n
systems, including those that participated in the challenge,
where n is the number of words within a candidate key- and other recent works (such as TSAKE [30], and KG-KE
phrase. Consequently, the k-top keyphrases are ranked by Model [49], TeKET [60], YAKE! [99]). Notice that in the
their s(kp) score. The final output of our system is a sorted SemEval-2010 dataset, the maximum recall achieved over
list of non-overlapping K-top keyphrases. the combined keyphrase set was approximately 75% [100].
The Nguyen2007 [35] is a dataset composed of 211 sci-
Results entific conference papers with 5201 tokens per document
on average. The original author’s assigned keyphrases were
In this section, we discuss our experimental setup and evalu- hidden to avoid bias. Standard gold keyphrases were manu-
ation results of the proposed keyphrase extraction method. ally assigned by student volunteers, which resulted in 11.33
The experiment aims first to explore how key parameters of gold keyphrases per document.
features can influence HAKE’s performance; then, to make Theses100 [96] is a recent benchmark dataset. This data-
the result more convincing, we compare the result with some set is composed of 100 master and Ph.D. theses from the
existing representative methods. University of Waikato, New Zealand. The dataset includes
different domains of the theses, ranging from computer sci-
Datasets ence, and economics to psychology, philosophy, history,
chemistry, and others. All documents are in plain text, and
The primary corpus employed for evaluating the proposed the average length of these documents is approximately 7000
technique and other similar techniques is the well-known words. For comparison, standard gold keyphrases, along
“SemEval-2010” dataset published by the ACL SemEval with the dataset, have been taken into account. Note that
workshop [21]. This dataset contains 244 scientific confer- the theses 100 dataset has 47.6% absent keys (as stated in
ence and workshop articles collected from the ACM Digi- the table); for this reason, we did not expect to obtain high
tal Library, including four research areas: (1) Distributed scores when testing on this particular dataset.
Systems, (2) Distributed Artificial Intelligence-Multiagent wiki20 [97] dataset contains 20 English technical
Systems, (3) Social and Behavioral Sciences—Economics, research reports covering different subjects of computer sci-
and (4) Information Search and Retrieval. The 244 articles ence. Fifteen teams have assigned keyphrases to each report
are divided into three sets: 144 for the training set, 40 for using Wikipedia article titles as the candidate vocabulary.
the trial set (trial documents), and 100 for the test set. The Each team assigned 5.7 keywords on average, and each doc-
average length of an article is between six and eight pages ument had 35.5 gold keyphrases on average. Similarly, for
(see Table 2). For the collected papers, three sets of answers the “Theses100” dataset, we did not expect to obtain high
(gold standard keyphrases) were provided for each paper: scores when testing on this particular dataset due to absent
author-assigned keyphrases, reader-assigned keyphrases, keys (51.8%).
and finally, a set of a combination between the two previous Finally, the largest dataset “The Krapivin2009 [98]” in
sets. All reader-assigned keyphrases were extracted from terms of documents was published by ACM in the period
the papers, whereas some of the original author-assigned 2003–2005. It compromises 2304 full papers from the com-
keyphrases do not occur in the document’s content [21]. puter science domain, and 8040 tokens per paper, on aver-
The evaluation ran on the combined set, which was the one age. The papers were downloaded from CiteSeerX Autono-
whose final results were taken into consideration by the mous Digital Library, and each one has keyphrases assigned
competition organizers. Since our system is unsupervised, by the authors and verified by the reviewers, resulting in 6.34
experiments will only run the test set. gold keywords per document.

Table 2  Detailed statistics about Dataset Number of #Gold keys (per Absent gold keys Average document
evaluated datasets documents document) size (in word’s
number)

SemEval–2010 [21] 243 4002 (16.47) 11.3% 8332


Nguyen2007 [35] 209 2369 (11.33) 17.8% 5201
Theses100 [96] 100 767 (7.67) 47.6% 4728
Wiki20 [97] 20 730 (36.50) 51.8% 6177
Krapivin2009 [98] 2304 14599 (6.34) 15.3% 8040

13
Cognitive Computation (2022) 14:852–874 865

Baseline Methods Specifically, we used the following evaluation metrics:


The precision, which represents the number of the
We compare our approach against some unsupervised and extracted keyphrases that match the manually assigned ones,
supervised state-of-the-art algorithms. An open-source is given by Eq. (17):
Python Keyphrase Extraction (PKE) toolkit [101] is utilized.
TP
The unsupervised systems compromise statistical methods Precision = (17)
TP + FP
such as TF.IDF [26], KP-Miner [30], RAKE [37], YAKE!
[40], and graph-based methods such as TextRank [41], The recall, which is the number of the keyphrases manu-
SingleRank [41], ExpandRank [45], TopicalPageRank [61], ally assigned that are extracted by the keyphrase extraction
PositionRank [62], TopicRank [64], and MultipartiteRank system, is given by Eq. (18):
[65]. For the graph-based methods, we employed the default
TP
parameters as set in the corresponding papers. Namely, we Recall = (18)
TP + FN
set the co-occurrence window size for TextRank to 2 and
ExpandRank to 10, as these values have yielded the best where TP is the true positive, FP is the false positive, and FN
results for their evaluation datasets. For ExpandRank, we is the false negative.
define the five nearest document neighbors for each pro- The F1-score is defined as follows:
cessed document. A co-occurrence window size is set to
10 for SingleRank, PositionRank, and TopicalPageRank, precision ∗ recall
F1 − Score = 2 ∗ (19)
and a minimum similarity for clustering is set to 0.74 with precision + recall
the linkage method for both TopicRank and MultipartiteR-
Given the precision and recall results of each document,
ank. For statistical methods, in RAKE, no parameters were
we calculate the micro-average of precision, recall, and
needed or defined. For the YAKE! system, the maximum
F-score of all documents.
keyword length is set at 3, the minimum keyword length is
In the micro-average method, we sum up the individual
set at 1, and the minimum number of keyword occurrences is
true positives, false positives, and false negatives of our sys-
at 2. A full description of the parameters can be found in the
tem for different sets. By using this method, the precision
corresponding papers. We compared our approach against
and recall are defined as follows:
the supervised KEA (www.​nzdl.​org/​Kea/) [102] system,
∑n
which is trained through fivefold cross-validation using each i=1
Tpi
corresponding dataset’s documents. More comparisons were Micro-average of precision = ∑n ∑n (20)
Tpi + i=1 Fpi
made against other supervised and unsupervised approaches, i=1

but only on the “SemEval- 2010” dataset since their imple- ∑n


mentation was not accessible, though the findings in their i=1
Tpi
papers on that dataset were available.
Micro average of recall = ∑n ∑n (21)
i=1
Tpi + i=1
Fni

Evaluation’s Metrics The Micro-Average F1-Score will be simply the harmonic


mean of Eqs. (20) and (21).
Traditionally, automatic keyphrase extraction systems have
been assessed using the proportion of top-N candidates that Results and Discussion
exactly match the gold standard keyphrases [24, 25, 103]. For
example, in the SemEval-2010/Task-5 dataset (organized by In this section, we present the results of our experiments
the SemEval workshop), there was a competition for extracting comparing HAKE against other methods. We divided our
keyphrases from scientific articles. The participants were asked experiments into three different parts: (1) parameter value
to extract the top 5–10-15 keyphrases for each article in the test selection to determine the best parameter values; (2) linguis-
document set. The evaluation was carried out by matching the tic filter comparison to show its effectiveness against other
extracted keyphrase sets in opposition to the answer sets (the existing filters; and (3) overall effectiveness versus baseline
combined set). We followed the same procedure given by the and other methods, and finally discuss the overall findings.
challenge SemEval-2010 [21] to evaluate our system against
other systems, where an exact match is considered if the stem Parameter Values Selection and Analysis
of the automatically extracted keyphrases matches exactly the
stem of the answer sets (the combined set). For the stemming In this subsection, we include and analyze the obtained
process, all keyphrases are stemmed using the English Porter results from the experiments while selecting and changing
stemmer [104]. The exact matching is strict and enables us to the parameter values. We start by selecting suitable param-
compare our results with previous systems. eter values that directly influence the performance of the

13
866 Cognitive Computation (2022) 14:852–874

Fig. 5  The F1 scores of the


top 5, 10, and 15 keyphrases
while varying the weighting
parameters on the SemEval2010
Dataset

proposed method. The parameter selection is related directly and @15 for this first experiment (following the challenge
to the importance of each feature. For this purpose, we con- evaluation procedure). The acquired results are demonstrated
duct different analyses to understand the characteristics of in Table 3. The highest performance shown for F1 value
the features CHIR, SIM, and TPW, and how they behave is 36.49% for top-15 keyphrases by ( α =0.2) and ( β =0.2)
and contribute to the final keyword weight S(w) when com- (that means including the TPW feature with the CHIR and
bined in the single keyword weighting model (KWHFSM). SIM measures), followed by 34.44% ( α =0.5) and ( β =0.5)
Moreover, we want to know their direct influence on the pro- (which means using just the CHIR and SIM measures), and
posed method’s performance over different evaluated data- 28.89% where ( α =0.5) and ( β =0.5) (which means disre-
sets. Therefore, to estimate the importance of each feature garding the semantic measure). Whereas the lowest perfor-
regarding its contribution to the weights of the keywords mance shown is 10.49% for the top 15 keyphrases where ( α
S(w) in the KWHFSM weighting model, we conducted an =1) and ( β =0) (which means eliminating the statistic and
ablation/relax study using a backward-like total exclusion/ the semantic measures). From the results, it is evident that
relaxation strategy. with increasing the SIM measure value, F1 value increases.
In this strategy, each feature is individually removed, To understand and get more insights into the usefulness and
strengthened, or relaxed in the weighting model. The best importance of each used feature of HAKE, we conducted
configuration was then selected and improved by varying the same experiments on the other datasets, large text data-
parameter values. First, we conduct experiments with sets (Krapivin2009, and Wiki20), as well as datasets with
different values of weighting parameters α and β on the different domains (Nguyen2007) (See Fig. 6). In Fig. 6, we
SemEval-2010 dataset (evaluated on the gold combined showed the effectiveness (F@10, P@10, R@10 scores) of
set) and characterized by its large texts. Concretely, we set HAKE while changing the weighting parameter values. As
the weighting parameters 𝛼 and 𝛽 ∈ {0, 0.2, 0.5, 1} . These shown in this figure, we observe that the measures CHIR and
values were applied to the KW-HFSM weighting model. SIM present competitive results compared to their elimina-
A decrease in accuracy indicates how much the removed tion as the rates of F1, precision, and recall decrease.
features impact the overall accuracy (see Fig. 5). The num- Indeed, it is apparent that the removal of the semantic fea-
ber of keyphrases proposed by HAKE is set at @5, @10, tures ((α =0.5) and ( β =0.5)) negatively impacts the results

Table 3  The performance of HAKE system while changing the weighting parameters


Parameters values Top 5 keyphrases Top 10 keyphrases Top 15 keyphrases
P R F1 P R F1 P R F1

𝛂=0; 𝛃 =0.5; 1-𝛂 - 𝛃 =0.5 48.10% 21.10% 29.33% 40.10% 30.90% 34.90% 34.20% 34.70% 34.44%
𝛂=0.2; 𝛃 =0.2; 1-𝛂 -𝛃 =0.6 50.80% 23.60% 32.23% 43.90% 32.90% 36.78% 36.2% 36.8% 36.49%
𝛂=0.5; 𝛃 =0.5; 1-𝛂 - 𝛃 =0 42.90% 14.40% 21.56% 35.80% 22.60% 27.70% 29.10% 28.70% 28.89%
𝛂=1; 𝛃 =0; 1-𝛂 - 𝛃 =0 24.60% 1.90% 3.52% 18.30% 6.10% 9.15% 10.40% 10.60% 10.49%

13
Cognitive Computation (2022) 14:852–874 867

Fig. 6  The F1 scores of the top 5, 10, and 15 keyphrases while varying the weighting parameters on the SemEval2010 Dataset

in all the bassline datasets. This provides strong proof of beginning of the document (title, abstract) while considering
how the semantic measure is an efficient way to boost the other positions in the document (introduction, middle text,
accuracy of the automated extraction of keywords and sus- conclusion, etc.). This is particularly evident for scientific
tain a higher accuracy, precision, and recall rates, as key- texts that tend to concentrate a high degree of important key-
words are more related to the document’s topic despite its words at the top of the text. It is apparent from Fig. 6 that the
disparity in content domains. Nevertheless, a remarkable F1 score shows a noticeable increasing trend when taking
performance drop can be seen when discarding the CHIR into account this feature. For example, in the SemEval2010
feature, and the SIM feature and using just the TPW ((α dataset, an increase in the F1 score of 5.95% is reported in
=1; β =0)). By disregarding the CHIR, we observe that this the top 15 keyphrases (compared to the second-best score
statistical feature tends to eliminate keywords with high rel- of the top 15 keyphrases where TPW is excluded). Like-
evance in the text. Thus, removing these keywords spoils wise, in the Nguyen2007, wiki20, and krapivin2009 data-
the similarity computation and consequently, the quality sets, an increase of F1 score 10.46%, 31.13%, and 17.58% is
of the extracted keywords. Another feature that enhances reported successively, whereas the best score of the top 15
the relevancy of a term is its relative position in TPW. In keyphrases without using the TPW. The feature in the same
this selected feature, we favorized words that appear at the dataset is 31.4%. Analogously, the same point explains that

Table 4  Performances of Top 5 keyphrases Top 10 keyphrases Top 15 keyphrases


HAKE system with different
noun phrase filters according Filters P R F P R F P R F
to precision (P), recall (R), and
F1-Score (F) Filter 1 50.80% 23.60% 32.23% 42.20% 32.60% 36.78% 36.2% 36.8% 36.49%
Filter 2 44.80% 15.30% 22.80% 36.20% 24.70% 29.40% 28.30% 28.90% 28.60%
Filter 3 45.00% 15.40% 22.90% 33.50% 22.90% 27.20% 26.70% 27.30% 27.00%
Filter 4 42.20% 14.40% 21.50% 33.60% 22.90% 27.30% 26.30% 26.90% 26.60%
Filter 5 43.20% 14.70% 22.00% 32.70% 22.30% 26.50% 25.80% 26.40% 26.10%
Filter 6 37.80% 12.90% 19.2% 29.60% 20.20% 24.00% 23.80% 24.40% 24.10%

13
868 Cognitive Computation (2022) 14:852–874

the combination of features ((α =0.2; β =0.2)) while favor-

approaches
Supervised
ing the semantic feature generates good keyword sets and

Binary
improves the accuracy rate, especially for a reduced number

22.1%
21.5%
13.4%
17.1%
10.4%
KEA
of keywords. Eccentricity, the combination of feature sets,
presents a similar behavior for most of the datasets. Overall,

Multipartite Expand rank


it can be concluded that CHIR and SIM are the most impor-
tant features for extracting the relevant keywords, and cer-
tainly without ablating the TPW feature. Given the analysis

5.1%
3.0%
1.3%
2.7%
2.5%
results, the hybrid model improves the accuracy rates across
all the evaluated datasets.

19.0%
19.9%
9.1%
14.6%
11.7%
Linguistic Filter Comparison

rank
To evaluate our system using different noun phrase filters,

Position
and to measure the contribution of our candidate keyphrases

14.3%
13.3%
5.9%
10.4%
6.9%
rank
identification method to the overall system performance,
we represent in Table 4 the results obtained by our method

Topical pagerank
compared with six other most adequate filters evaluated on
the SemEval2010 Dataset. The considered filters are the
following:

14.8%
12.5%
5.9%
10.4%
8.3%
– Filter 1: Our proposed POS tag definition of noun phrases
corresponding to our POS tag sequence: (NN | NNS |

Topic rank
NNP | NNPS | VBN | JJ | JJS | RB) * (NN | NNS | NNP |
NNPS | VBG) + . Table 5  The performance of HAKE using (α = 0.2), (β = 0.2), and n ≤ 5 compared with Baselines. F1@10

17.3%
19.5%
10.6%
13.8%
11.4%
– Filter 2: (NN | NNS | NNP | NNPS | JJ | VBN | NN IN |
NNS IN) * (NN | NNS | NNP | NNPS | VBG) [31]

Single rank
– Filter 3: (NN | NNS | NNP | NNPS | JJ) * (NN | NNS |
NNP | NNPS | VBG) [105, 25]

15.8%
12.9%
3.8%
9.7%
6.0%
– Filter 4: NBAR IN NBAR where NBAR = (NN | NNS |
Graph based

NNP | NNPS | JJ | JJR | JJS) * (NN | NNS | NNP | NNPS)


Text rank

[106, 107].
16.7%
14.9%
7.4%
12.1%
5.8%
– Filter 5: (JJ)*(NN | NNS | NNP) + [44].
– Filter 6: (NN | NNS | NNP | NNPS | JJ | JJR | JJS) * (NN
| NNS | NNP | NNPS) [35].
YAKE

25.6%
21.1%
16.2%
17.0%
11.1%

According to these experiments, we can see that our pro-


RAKE

posed filter (filter1) yields the best results, for the top 5, 10,
0.9%
0.2%
0.3%

0.2%
0.2%

and 15 on the SemEval2010 dataset.


Unsupervised approaches

KP-MINER

Overall Performance
Statistical based

15.6%
31.4%
26.1%

22.7%
15.8%

In order to evaluate the efficiency of HAKE, we carried


out experiments over five different datasets and compared
TF.IDF

HAKE’s performance against state-of-the-art baselines (13


22.5%
17.7%
12.6%
15.0%
10.3%

unsupervised and one supervised) and other methods (car-


ried out over SemEval2010 Dataset Only). Table 5 lists the
HAKE

41.2%
37.6%
19.8%
32.0%
19.2%

results of the top ten keyphrases (F1@10) (which is regu-


larly used for this kind of evaluation [30]) for α = 0.2; β = 0.2
as a weighting parameter; and n ≤ 5 as a maximum number
SemEval2010

Krapivin2009
Nguyen2007
DATASET

of words contained in a keyphrase. These parameters have


theses100

achieved the best results in the parameter weighting selec-


wiki20

tion experiments.

13
Cognitive Computation (2022) 14:852–874 869

Fig. 7  Effectiveness of HAKE on top of large texts (Krapivin2009, SemEval2010, and Wiki20 datasets), and different domains (Nguyen2007
dataset). F@10, P@10, R@10

By looking at the baseline competing systems in In Table 6, we tested our method following the challenge
Table 5, we notice that our method outperforms all the evaluation procedure of the SemEval201/Task 5 workshop.
baseline systems (statistical and state-of-the-art graph- Each system’s performance is given on the number of key-
based) and ranks first over the top 10 keyphrases across phrases: top 5, 10, and 15. The results (including their ref-
different datasets, text sizes, and domains. The best erences) of all the participants are reported in Kim et al.
F1@10 score obtained by the HAKE system is 41.2% (2013) and ranked by F1-score. Hence, we compared our
for the Nguyen2007 dataset. In contrast, the best score of results with top-ranked methods (supervised and unsuper-
the best-performing system on the same dataset is 31.4%. vised) (Kim et al., 2013), including more recent ones that
Another remarkable point is that HAKE outperforms the were tested on the same dataset and followed the same evalu-
supervised method KEA without any training on particular ation procedure. By looking at the competing systems in
data. The low results occur with Theses100 and wiki20 Table 6, we notice that several ones adopt or partly adopt
datasets, which is expected because they have the most supervised learning approaches. A few systems among the
absent keywords. As can be seen from Table 5 and Fig. 7, top-ranked ones, such as the Core Word Expansion Algo-
the second-best result was obtained with KP-Miner in rithm [20], and KX FBK [31], are based on unsupervised
Nguyen2007, SemEval2010, Krapivin2009, and theses100 approaches. As pointed out in the results, we notice that
datasets, and YAKE! in the wiki20 dataset, which is a our method outperforms all the existing systems (supervised
statistical-based approach. However, RAKE, which is con- and unsupervised) and ranks first over the top 5–10-15 key-
sidered one of the best-known state-of-the-art statistical phrases inaccuracy. The F-score achieved by the HAKE
approaches, appeared with deficient performance com- on the assigned keyphrases (for the top 15 keyphrases) is
pared to all other methods. We conducted a more detailed 36.49%, whereas the F1-score of the best-performing sys-
analysis that allows us to conclude that HAKE outper- tem (which is also an unsupervised approach) on the same
forms the state-of-the-art methods when it is performed on dataset is 35.1%. Hence, for 15 keyphrases, HAKE improves
large texts (as is the case of Krapivin2009, SemEval2010, at a rate of 3.96% the F1-score compared to TSAKE [49],
and Wiki20 datasets), and on different domains (as is the and for 10 keyphrases, it increases by a rate of 25.1% com-
case of Nguyen2007 dataset) (See Fig. 7). pared to the F1-score of the DPM-index [108] (which is a

13
870 Cognitive Computation (2022) 14:852–874

Table 6  The performance of System Top 5 keyphrases Top 10 keyphrases Top 15 keyphrases


HAKE using (α = 0.2), (β = 0.2),
and n ≤ 5 compared with 23 P R F P R F P R F
other methods according
to precision (P), recall (R), HAKE 50.80% 23.60% 32.23% 43.90% 32.90% 37.60% 36.20% 36.80% 36.49%
and F1-score (F) on the TSAKE — — — — — — 35.10% 35.20% 35.10%
SemEval2010 Dataset DPM-index 44.80% 15.30% 22.80% 36.20% 24.70% 29.40% 28.30% 28.90% 28.60%
HUMB 39.00% 13.30% 19.80% 32.00% 21.80% 26.00% 27.20% 27.80% 27.50%
Core Word — — — — — — 26.20% 26.80% 26.00%
Expansion
Algorithm
WINGNUS 40.20% 13.70% 20.50% 30.50% 20.80% 24.70% 24.90% 25.50% 25.20%
KP-MINER 36.00% 12.30% 18.30% 28.60% 19.50% 23.20% 24.90% 25.50% 25.20%
SZTERGAK 34.20% 11.70% 17.40% 28.50% 19.40% 23.10% 24.80% 25.40% 25.10%
ICL 34.40% 11.70% 17.50% 29.20% 19.90% 23.70% 24.60% 25.20% 24.90%
SEERLAB 39.00% 13.30% 19.80% 29.70% 20.30% 24.10% 24.10% 24.60% 24.30%
KX-FBK 34.20% 11.70% 17.40% 27.00% 18.40% 21.90% 23.60% 24.20% 23.90%
DERIUNLP 27.40% 9.40% 13.90% 23.00% 15.70% 18.70% 22.00% 22.50% 22.30%
Maui 35.00% 11.90% 17.80% 25.20% 17.20% 20.40% 20.30% 20.80% 20.60%
DFKI 29.20% 10.00% 14.90% 23.30% 15.90% 15.90% 20.30% 20.70% 20.50%
DP-Seg 33.30% 10.50% 16.00% 23.70% 16.90% 19.70% 19.20% 21.30% 20.20%
BUAP 13.60% 4.60% 6.90% 17.60% 12.00% 14.30% 19.00% 19.40% 19.20%
SJTULTLAB 30.20% 10.30% 15.40% 22.70% 15.50% 18.40% 18.40% 18.80% 18.60%
UNICE 27.40% 9.40% 13.90% 22.40% 15.30% 18.20% 18.30% 18.80% 18.50%
UNPMC 18.00% 6.10% 9.20% 19.00% 13.00% 15.40% 18.10% 18.60% 18.30%
U CSE 28.40% 9.70% 14.50% 21.50% 14.70% 17.40% 17.80% 18.20% 18.00%
LIKEY 29.20% 10.00% 14.90% 21.10% 14.40% 17.10% 16.30% 16.70% 16.50%
UvT 24.80% 8.50% 12.60% 18.60% 12.70% 15.10% 14.60% 14.90% 14.80%
POLYU 15.60% 5.30% 7.90% 14.60% 10.00% 11.80% 13.90% 14.20% 14.00%
UKP 9.40% 3.20% 4.80% 5.90% 4.00% 4.80% 5.30% 5.40% 5.30%

supervised approach). Notice that the opposite of our sys- To overcome this problem, we are thinking of suggesting
tem, some systems such as the HUMB system, use different models of parameter learning that apply experiments on dif-
external knowledge features to improve their performance. ferent types of documents, leading to more corpus-adapted
These knowledge bases (HAL, GRISP, and GROBID/TEI) parameter settings and improving system performance. Our
are specific to scientific papers. new algorithm for obtaining a refined candidate set improves
The main findings summarize that the generality of the applications that require high efficiency, for example, key-
parameters’ features and robustness improve the keyphrases phrases for interactive query refinement. In such a real-time
relevancy while calculating the weight of candidate phrases. application, especially on a large searching scale (e.g., the
We can consider these parameters as fixed parameters that do whole web), a fast keyphrase extraction model is required,
not require any training data or learning phase. On the one as time savings during this phase directly reduce the search
hand, the candidate selection method and the hybrid model response time. Finally, we prove that our unsupervised
that shape a well-chosen feature set (statistical, semantic, approach to HAKE can achieve good accuracy compara-
and structural) can perform in domain-independent and ble to state-of-the-art and supervised methods without any
dependent corpora. On the other hand, the two parameters α human-labeled training sets or any external knowledge data.
and β are used to control the effect of the used features in the
final result concerning parameters. Hence, we conclude that
they are not sensitive to the test documents’ domain, but they Conclusion and Perspectives
may be sensitive to their types. For example, when applied
to web pages or unstructured documents (such as free text In this paper, we have presented HAKE, a novel effective
documents), we can expect that an appropriate increase in technique that can automatically extract keyphrases for a
the value of α may reduce the effect of the structural fea- given document. HAKE is an unsupervised approach; it does
ture (TPW), consequently decreasing the system accuracy. not require any human-labelled training sets or any external

13
Cognitive Computation (2022) 14:852–874 871

knowledge data. This makes it more flexible, easy to use, and international conferences on web intelligence and intelligent
can be adapted across different domains. Moreover, our solu- agent technology - workshops. 2007. pp. 56–59. IEEE.
9. Hulth A, Megyesi BB. A study on automatically extracted
tion scales to any document length. To be more efficient, we keywords in text categorization. In: Proceedings of the 21st
introduce first a new method based on a parse tree approach, international conference on computational linguistics and the
part of speech tagging, and filtering to cope with the diffi- 44th annual meeting of the Association for Computational Lin-
culty of identifying potential candidate phrases. Then, dur- guistics. Association for Computational Linguistics. 2006. pp.
537–544.
ing the step of feature selection, we propose a hybrid model 10. Berend G. Opinion expression mining by exploiting keyphrase
that incorporates three types of features (structural, statisti- extraction. In: proceedings of the 5th international joint con-
cal, and semantic) to select relevant individual keywords ference on natural language processing. Asian Federation of
used later to rank relevant keyphrases. The results obtained Natural Language Processing. 2011.
11. Dashtipour K, Gogate M, Cambria E, Hussain A.  A novel
are encouraging. Our new method for candidate phrases context-aware multimodal framework for Persian sentiment
identification has proved its efficiency. analysis. Neurocomputing. 2021.
Additionally, we provided evidence that scoring key- 12. Chen M, Sun JT, Zeng HJ, Lam KY. A practical system of key-
phrases according to their keyword’s scores improves key- phrase extraction for web pages. In: Proceedings of the 14th ACM
international conference on information and knowledge manage-
phrase extraction performance, and considering the semantic ment; 2005. pp. 277–278. ACM.
relationships between features improves its accuracy. Our 13. Turney PD. Coherent keyphrase extraction via web mining.
experiments confirm that HAKE achieves higher perfor- CORR ArXiv Preprint Cs/0308033. 2003.
mance than state-of-the-art supervised and unsupervised 14. Ferrara F, Pudota N, Tasso C. A keyphrase-based paper rec-
ommender system. In: Italian research conference on digital
methods under a large number of text documents belonging libraries; 2011, pp. 14–25.
to five different datasets. In the future, we plan to work on a 15. Do N, Ho L. Domain-specific keyphrase extraction and near-
new version of HAKE, based on a word embedding approach. duplicate article detection based on ontology. In: International
conference on computing & communication technologies,
research, innovation, and vision for the future (RIVF). 2015;
pp. 123–126. IEEE.
Declarations  16. El Idrissi O, Frikh B, Ouhbi B. HCHIRSIMEX: an extended
method for domain ontology learning based on conditional
Ethics approval  This article does not contain any studies with human mutual information. In: Third IEEE international colloquium in
participants or animals performed by any of the authors. information science and technology (CIST); 2014. pp. 91–95.
17. Fortuna B, Grobelnik M, Mladeni’c D. Semi-automatic data-
driven ontology construction system. In: Proceedings of the 9th
Conflict of Interest  The authors declare no competing interests. international multi-conference information society; 2006, pp.
223–226.
18. Frikh B, Djaanfar AS, Ouhbi B. A new methodology for domain
References ontology construction from the Web. Int J Artif Intell Tools.
2011;20(06):1157–70.
1. Sarkar K. A hybrid approach to extract keyphrases from medical 19. Merrouni ZA, Frikh B, Ouhbi B. Automatic keyphrase extrac-
documents. Int J Comput Appl. 2013;63(18):14–19. tion: a survey and trends. Journal of Intelligent Information Sys-
2. Gutwin C, Paynter G, Witten I, Nevill-Manning C, Frank E. tems. 2019. pp. 1–34. Springer.
Improving browsing in digital libraries with keyphrase indexes. 20. You W, Fontaine D, Barth’es JP. An automatic keyphrase extrac-
Decis Support Syst. 1999;27(1–2):81–104. tion system for scientific documents. Knowl Inf Syst. 2013;34(3),
3. Jones S, Staveley MS. PHRASIER: a system for interactive docu- 691–724.
ment retrieval using keyphrases. In: Proceedings of the 22nd 21. Kim SN, Medelyan O, Kan MY, Baldwin T. SEMEVAL-2010
annual international ACM SIGIR conference on research and Task 5: automatic keyphrase extraction from scientific articles. In:
development in information retrieval. 1999. pp. 160–167. ACM. Proceedings of the 5th international workshop on semantic evalu-
4. D’Avanzo E, Magnini B. A keyphrase-based approach to sum- ation. Association for Computational Linguistics. 2010. pp. 21–26.
marization: the LAKE system at DUC-2005. In: Proceedings 22. Liu Z, Huang W, Zheng Y, Sun M. Automatic keyphrase extrac-
of DUC. 2005. tion via topic decomposition. In: Proceedings of the 2010 confer-
5. Zha H. Generic summarization and keyphrase extraction ence on empirical methods in natural language processing. Asso-
using mutual reinforcement principle and sentence cluster- ciation for Computational Linguistics. 2010. pp. 366–376.
ing. In: Proceedings of the 25th annual international ACM 23. Boudin F. Reducing over-generation errors for automatic key-
SIGIR conference on research and development in information phrase extraction using integer linear programming. In: ACL
retrieval. 2002. pp. 113–120. ACM. 2015 Workshop on Novel Computational Approaches to Key-
6. Zhang Y, Zincir-Heywood N, Milios E. World Wide Web site phrase Extraction; 2015.
summarization. Web Intelligence and Agent Systems: An Inter- 24. Frank E, Paynter GW, Witten IH, Gutwin C, Nevill-Manning CG.
national Journal. 2004;2(1), 39–53. Domain-specific keyphrase extraction. In: Proceedings of the six-
7. Hammouda KM, Matute DN, Kamel MS. COREPHRASE: teenth international joint conference on artificial intelligence,
keyphrase extraction for document clustering. In: International IJCAI ‘99. Morgan Kaufmann Publishers Inc., San Francisco,
workshop on machine learning and data mining in pattern rec- CA, USA. 1999 pp. 668–673. http://d​ l.a​ cm.o​ rg/c​ itati​ on.​cfm?​id=​
ognition. 2005. pp. 265–274. 646307.​687591.
8. Han J, Kim T, Choi J. Web document clustering by using 25. Turney PD. Learning algorithms for keyphrase extraction. Inf
automatic keyphrase extraction. In: 2007 IEEE/WIC/ACM Retrieval. 2000;2(4):303–36.

13
872 Cognitive Computation (2022) 14:852–874

26. Witten IH, Paynter GW, Frank E, Gutwin C, Nevill-Manning 44. Liu Z, Li P, Zheng Y, Sun M. Clustering to find exemplar terms
CG. KEA: Practical automatic keyphrase extraction. In: Proceed- for keyphrase extraction. In: Proceedings of the 2009 confer-
ings of the fourth ACM conference on digital libraries. 1999. pp. ence on empirical methods in natural language processing.
254–255. ACM. Association for Computational Linguistics. 2009. volume 1, pp.
27. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed 257–266. 
representations of words and phrases and their compositionality. 45. Rose S, Engel D, Cramer N, Cowley W. Automatic keyword
In Advances in neural information processing systems. 2013. (pp. extraction from individual documents. Text Mining: Applica-
3111–3119) tions and Theory. 2010;1:1–20.
28. Huang C, Tian Y, Zhou Z, Ling CX, Huang T. Keyphrase 46. Gollapalli SD, Caragea C. Extracting keyphrases from
extraction using semantic networks structure analysis. In: Sixth research papers using citation networks. In: AAAI. 2014. pp.
international conference on data mining (ICDM’06). 2006; pp. 1629–1635.
275–284. IEEE. 47. Yang S, Lu W, Yang D, Li X, Wu C, Wei B. KEYPHRASEDS:
29. Liu F, Pennell D, Liu F, Liu Y. Unsupervised approaches for automatic generation of survey by exploiting keyphrase infor-
automatic keyword extraction using meeting transcripts. In: mation. Neurocomputing. 2017;224:58–70.
Proceedings of human language technologies: the 2009 annual 48. Xie F, Wu X, Zhu X. Efficient sequential pattern mining
conference of the North American chapter of the Association with wildcards for keyphrase extraction. Knowl-Based Syst.
for Computational Linguistics. Association for Computational 2017;115:27–39.
Linguistics. 2009. pp. 620–628. 49. Rafiei-Asl J, Nickabadi A. Tsake: a topical and structural auto-
30. Campos R, Mangaravite V, Pasquali A, Jorge A, Nunes C, Jatowt matic keyphrase extractor. Appl Soft Comput. 2017;58:620–30.
A. YAKE! Keyword extraction from single documents using 50. Danesh S, Sumner T, Martin JH. SGRANK: Combining sta-
multiple local features. Inf Sci. 2020;509:257–89. tistical and graphical methods to improve the state of the art
31. Haddoud M, Abdeddaïm S. Accurate keyphrase extraction by dis- in unsupervised keyphrase extraction. In: Proceedings of the
criminating overlapping phrases. J Inf Sci. 2014; 40(4), 488–500. fourth joint conference on lexical and computational seman-
32. Hulth A. Improved automatic keyword extraction given more tics; 2015. pp. 117–126.
linguistic knowledge. In: Proceedings of the 2003 conference on 51. Rabby G, Azad S, Mahmud M, Zamli KZ, Rahman MM. A flex-
empirical methods in natural language processing. Association for ible keyphrase extraction technique for academic literature. Pro-
Computational Linguistics. 2003. pp. 216–223 cedia Computer Science. 2018;135:553–63.
33. Wan X, Xiao J. Single document keyphrase extraction using 52. Matsuo Y, Ishizuka M. Keyword extraction from a single docu-
neighborhood knowledge. In: AAAI. 2008. vol. 8, pp. 855–860. ment using word co-occurrence statistical information. Int J Artif
34. Barker K, Cornacchia N. Using noun phrase heads to extract Intell Tools. 2004;13(01):157–69.
document keyphrases. In: conference of the canadian society for 53. Li Y, Luo C, Chung SM. Text clustering with feature selec-
computational studies of intelligence. 2000; pp. 40–52. Springer. tion by using statistical data. IEEE Trans Knowl Data Eng.
35. Nguyen TD, Kan MY. Keyphrase extraction in scientific publica- 2008;20(5):641–52.
tions. In: International conference on asian digital libraries. 2007. 54. Wang J, Peng H. Keyphrases extraction from web document by
pp. 317–326. Springer. the least squares support vector machine. In: The 2005 IEEE/
36. Grineva M, Grinev M, Lizorkin D. Extracting key terms from WIC/ACM International Conference on Web Intelligence
noisy and multitheme documents. In: Proceedings of the (WI’05). 2005 pp. 293–296. IEEE.
18th international conference on World Wide Web, 2009. pp. 55. Kumar N, Srinathan K. Automatic keyphrase extraction from
661–670. scientific documents using n-gram filtration technique. In: Pro-
37. El-Beltagy SR, Rafea A. KP-MINER: a keyphrase extrac- ceedings of the eighth ACM symposium on document engineer-
tion system for English and Arabic documents. Inf Syst. ing. 2008. pp. 199–208. ACM.
2009;34(1):132–44. 56. Berend G, Farkas R. SZTERGAK: feature engineering for key-
38. Newman D, Koilada N, Lau JH, Baldwin T. Bayesian text seg- phrase extraction. In: proceedings of the 5th international work-
mentation for index term identification and keyphrase extraction. shop on semantic evaluation. Association for Computational
Proceedings of COLING. 2012;2012:2077–92. Linguistics. 2010. pp. 186–189.
39. Medelyan O, Witten IH. Thesaurus based automatic keyphrase 57. Adar E, Datta S. Building a scientific concept hierarchy data-
indexing. In: Proceedings of the 6th ACM/IEEE-CS joint confer- base (schbase). In: Proceedings of the 53rd Annual Meeting
ence on digital libraries. 2006. pp. 296–297. ACM. of the Association for Computational Linguistics and the 7th
40. Mihalcea R, Tarau P. TEXTRANK: bringing order into text. International Joint Conference on Natural Language Processing
In: Proceedings of the 2004 Conference on Empirical Methods (Volume 1: Long Papers). 2015. vol. 1, pp. 606–615.
in Natural Language Processing. 2004. 58. Florescu C, Caragea C. A new scheme for scoring phrases in
41. Mahata D, Kuriakose J, Shah R, Zimmermann R. Key2vec: unsupervised keyphrase extraction. In: Proceedings of the 39th
automatic ranked keyphrase extraction from scientific articles European Conference on Information Retrieval (ECIR’17),
using phrase embeddings. In Proceedings of the 2018 Confer- Aberdeen, Scotland. April 9–13. 2017. pp. 477–483.
ence of the North American Chapter of the Association for 59. Tomokiyo T, Hurst M. A language model approach to keyphrase
Computational Linguistics: Human Language Technologies. extraction. In: Proceedings of the ACL 2003 workshop on multi-
2008. Volume 2 (Short Papers) (pp. 634–639) word expressions: analysis, acquisition and treatment. Associa-
42. Medelyan O, Frank E, Witten IH. Human-competitive tagging tion for Computational Linguistic. 2003. volume 18, pp. 33–40.
using automatic keyphrase extraction. In: Proceedings of the 60. Rabby G, Azad S, Mahmud M, Zamli KZ, Rahman MM.
2009 conference on empirical methods in natural language pro- TeKET: a tree-based unsupervised keyphrase extraction tech-
cessing. 2009. p. 1318–1327. nique. Cogn Comput. 2020. 1–23.
43. You W, Fontaine D, Barthes JP. Automatic keyphrase extrac- 61. Bougouin A, Boudin F, Daille B. Topicrank: graph-based topic
tion with a refined candidate set. In: Proceedings of the 2009 ranking for keyphrase extraction. In: Proc IJCNLP; 2013. p.
IEE/WIC/ACM International joint conference on web intelli- 543–551.
gence and intelligent agent technology. IEEE Computer Soci- 62. Sterckx L, Demeester T, Deleu J, Develder C. Topical word
ety. 2009. volume 01, pp. 576–579. importance for fast keyphrase extraction. In Proceedings of the

13
Cognitive Computation (2022) 14:852–874 873

24th International Conference on World Wide Web; 2015. (pp. 82. Chua S, Kulathuramaiyer N. Semantic feature selection using
121–122). wordnet. In: IEEE/WIC/ACM International Conference on
63. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Web Intelligence (WI’04); 2004. pp. 166–172.
Mach Learn Res. 2003; 3, 993–1022. 83. Dagan I, Marcus S, Markovitch S. Contextual word similarity and
64. Florescu C, Caragea C. Positionrank: an unsupervised approach estimation from sparse data. In: Proceedings of the 31st annual
to keyphrase extraction from scholarly documents. In: Proc. meeting on Association for Computational Linguistics. Associa-
ACL; 2017. p. 1105–1115. tion for Computational Linguistics; 1993. pp. 164–171.
65. Boudin F. Unsupervised keyphrase extraction with multipartite 84. Kelleher D, Luz S. Automatic hypertext keyphrase detection
graphs. In: Proc NAACL: Human language technologies; 2018. In: IJCAI. 2005;5:1608–9.
p. 667–672. 85. Li CH, Park SC. Combination of modified bpnn algorithms and
66. Baroni M, Dinu G, Kruszewski G. Don’t count, predict! a sys- an efficient feature selection method for text categorization. Inf
tematic comparison of context-counting vs. context-predicting Process Manage. 2009;45(3):329–40.
semantic vectors. In ACL (1). 2014 p. 238–247. 86. Song W, Liang JZ, He XL, Chen P. Taking advantage of
67. Pennington J, Socher R, Manning CD. Glove: global vectors for improved resource allocating network and latent semantic fea-
word representation. In Proceedings of the 2014 Conference on ture selection approach for automated text categorization. Appl
Empirical Methods in Natural Language Processing (EMNLP); Soft Comput. 2014;21:210–20.
2014. (pp. 1532–1543). 87. Frantzi KT, Ananiadou S, Tsujii J. The C-VALUE/NC-VALUE
68. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word method of automatic recognition for multi-word terms. In:
vectors with subword information. Transactions of the Associa- International conference on theory and practice of digital
tion for Computational Linguistics. 2017;5:135–46. libraries. 1998, pp. 585–604.
69. Papagiannopoulou E, Tsoumakas G. Local word vectors guiding 88. Jiang X, Hu Y, Li H. A ranking approach to keyphrase extraction.
keyphrase extraction. Inf Process Manage. 2018;54(6):888–902. In: Proceedings of the 32nd international ACM SIGIR conference
70. Bennani-Smires K, Musat C, Hossmann A, et al. Simple unsu- on research and development in information retrieval, SIGIR ‘09,
pervised keyphrase extraction using sentence embeddings. In: pp. 756–757. ACM, New York, NY, USA. 2009. https://​doi.​org/​
Proceedings of the 22nd Conference on Computational Natural 10.​1145/​15719​41.​15721​13. http://​doi.​acm.​org/​10.​1145/​15719​41.​
Language Learning. 2018. p. 221–229. 15721​13.
71. Sun Y, Qiu H, Zheng Y, Wang Z, Zhang C. SIFRank: a new base- 89. Zhang K, Xu H, Tang J, Li J. Keyword extraction using Sup-
line for unsupervised keyphrase extraction based on pre-trained port Vector Machine. In: international conference on web-age
language model. IEEE Access. 2020;8:10896–906. information management. 2006. pp. 85–96. Springer.
72. Cohen JD. Highlights: Language-and domain-independent 90. Yih WT, Goodman J, Carvalho VR. Finding advertising key-
automatic indexing terms for abstracting. J Am Soc Inf Sci. words on web pages. In: Proceedings of the 15th international
1995;46(3):162–74. conference on World Wide Web, WWW ‘06, pp. 213–222. ACM,
73. Nguyen TD, Luong MT. WINGNUS: keyphrase extraction uti- New York, NY, USA. 2006. https://​doi.​org/​10.​1145/​11357​77.​
lizing document logical structure. In: Proceedings of the 5th 11358​13. http://​doi.​acm.​org/​10.​1145/​11357​77.​11358​13.
international workshop on semantic evaluation. Association for 91. Sarkar K, Nasipuri M, Ghose S. A new approach to key-
Computational Linguistics. 2010. pp. 166–169. phrase extraction using neural networks.  2010.  CoRR
74. Ong TH, Chen H. Updateable pat-tree approach to Chinese key- abs/1004.3274. http://​arxiv.​org/​abs/​1004.​3274
phrase extraction using mutual information: A linguistic founda- 92. De Marneffe MC, MacCartney B, Manning CD, et al. Generat-
tion for knowledge management. 1999. ing typed dependency parses from phrase structure parses. In:
75. Ramos J, et al. Using tf-idf to determine word relevance in docu- Lrec; 2006. vol. 6. pp. 449–454.
ment queries. In: Proceedings of the first instructional confer- 93. Toutanova K, Klein D, Manning CD, Singer Y. Feature-rich
ence on machine learning. Piscataway, NJ. 2003. vol. 242, pp. part-of-speech tagging with a cyclic dependency network. In:
133–142. Proceedings of the 2003 conference of the North American
76. Barzilay R, Elhadad M. Using lexical chains for text summa- chapter of the association for computational linguistics on
rization. Advances in automatic text summarization pp. 1999; human language technology. Association for computational
111–121. Linguistics. 2003. volume 1, pp. 173–180. 
77. Krapivin M, Autayeu A, Marchese M, Blanzieri E, Segata N. 94. Sotoca JM, Pla F. Supervised feature selection by clustering
Keyphrases extraction from scientific documents: improving using conditional mutual information-based distances. Pattern
machine learning approaches with natural language processing. Recogn. 2010;43(6):2068–81.
In: International Conference on Asian Digital Libraries. 2010. 95. Ding Z, Zhang Q, Huang X. Keyphrase extraction from online
pp. 102–111. news using binary integer programming. In: Proceedings of 5th
78. Krapivin M, Marchese M, Yadrantsau A, Liang Y. Unsupervised International Joint Conference on Natural Language Process-
key-phrases extraction from scientific papers using domain and ing. 2011; pp. 165–173.
linguistic knowledge. In: 2008 Third International Conference 96. University of Waikato NZ. Datasets of automatic keyphrase
on Digital Information Management. 2008. pp. 105–112. IEEE. extraction. https://​github.​com/​LIAAD/​Keywo​rdExt​racto​rData​
79. Le TTN, Le Nguyen M, Shimazu A. Unsupervised keyphrase sets#​theses. 2019.
extraction: Introducing new kinds of words to keyphrases. In: 97. Medelyan O, Witten IH, Milne D. Topic indexing with Wikipe-
Australasian Joint Conference on Artificial Intelligence. 2016. dia. In Proceedings of the AAAI WikiAI workshop. 2008, July
pp. 665–671. Springer. (Vol. 1, pp. 19–24).
80. Salton G, Singhal A, Mitra M, Buckley C. Automatic text 98. Krapivin M, Autaeu A, Marchese M. Large dataset for key-
structuring and summarization. Inf Process Manage. 1997; phrases extraction, University of Trento.  Tech Report #
33(2):193–207. DISI-09–055. 2009.
81. Lopez P, Romary L. HUMB: automatic key term extraction 99. Chen W, Chan HP, Li P, Bing L, King I. An integrated approach
from scientific articles in GROBID. In: Proceedings of the 5th for keyphrase generation via exploring the power of retrieval and
international workshop on semantic evaluation. Association for extraction. In: NAACL-HLT (1). 2019.
Computational Linguistics. 2010. pp. 248–251.

13
874 Cognitive Computation (2022) 14:852–874

100. Kim SN, Medelyan O, Kan MY, Baldwin T. Automatic key- 106. Kim SN, Baldwin T, Kan MY. Evaluating n-gram based evaluation
phrase extraction from scientific articles. Lang Resour Eval. metrics for automatic keyphrase extraction. In: Proceedings of the
2013;47(3):723–42. 23rd international conference on computational linguistics. Associa-
101. Boudin F. PKE: an open-source python-based keyphrase extrac- tion for Computational Linguistics. 2010. pp. 572–580.
tion toolkit. In: Proc COLING; 2016. p. 69–73. 107. Kim SN, Kan MY. Re-examining automatic keyphrase extraction
102. Jones KS. A statistical interpretation of term specificity and its approaches in scientific articles. In: Proceedings of the workshop
application in retrieval. J Document. 1972;28(1):11–21. on multiword expressions: identification, interpretation, disam-
103. Zesch T, Gurevych I. Approximate matching for evaluating key- biguation and applications. Association for Computational Lin-
phrase extraction. In: Proceedings of the international conference guistics. 2009. pp. 9–16.
ranLP. 2009. pp. 484–489. 108. Pianta E, Tonelli S. Kx: A flexible system for keyphrase extrac-
104. Porter MF. An algorithm for suffix stripping.  Program  tion. In: Proceedings of the 5th international workshop on
1980;14(3), 130–137. semantic evaluation. Association for Computational Linguis-
105. Pal T, Banka H, Mitra P. Das B. Linguistic knowledge based tics. 2010. pp. 170–173.
supervised key-phrase extraction. In: Proceedings of national
conference on future trends in information & communication Publisher's Note Springer Nature remains neutral with regard to
technology & applications, Bhubaneswar. India. 2011.  jurisdictional claims in published maps and institutional affiliations.

13

You might also like