You are on page 1of 3

Weekly Status Report of iRisk Cyber Literature Project

David An
Current Week: Apr. 4

I Activities Accomplished for Current Week

• Developed methods to group and identify keywords in the dataset

– We used 3 different approaches (2 libraries, 1 developed) in order to


extract keywords from the text. We used YAKE (Yet Another Keyword
Extractor), KeyBERT (Transformer-based keyword extraction), as well
as MSS (Max Sum Similarity) and MMR (Max Marginal Relevance) in
order to find keywords.
– YAKE! is a lightweight keyword extraction system. One of the biggest
benefits to YAKE! is that it does not need to be trained on a particular
set of documents and instead relies on statistical features extracted
from individual documents.
– YAKE! works by first pre-processing the text. Then it generates dif-
ferent candidates based on specified hyperparameters. With these can-
didates, it scores them by five different traits: Casing, Positional, Fre-
quency, Relatedness, and Difference. Finally, we score the keywords
and post process them by removing duplicates and returning the list.
– Next, we used KeyBERT which is a transformer based keyword extrac-
tion system. According to the creator, KeyBERT finds the keywords
by utilizing BERT.
– ”First, document embeddings are extracted with BERT to get a document-
level representation. Then, word embeddings are extracted for N-gram
words/phrases. Finally, we use cosine similarity to find the word-
s/phrases that are the most similar to the document.”

• Developed external methods that don’t rely on libraries.

– For this, the sample used was a abstract of the text that was about
cybersecurity risks and defenses in commercial business practices.
– Then we developed two different approaches as well while exploring
different methods that don’t rely on libraries.
– The first method is using cosine similarity with document and can-
didate word embeddings and just sorting by top similarities. On a

1
sample, we get the keywords: [’security businesses demonstrates’, ’cy-
ber security functions’, ’cyber security practitioners’, ’security events
businesses’,’cyber security businesses’]
– The next method allows us to employ diversification to generate better
samples. We then develop a method known as Maximum Sum Simi-
larity. Essentially, it returns the n amount of keywords that have the
greatest cosine similarity apart from the other words. Essentially, we
are choosing the n terms that are the most different from the candidate
keywords. Using the same summary, we get [’businesses researchers’,
’businesses confronted evolving’, ’data gathered indepth’, ’evolving per-
vasive threats’, ’cyber security western’].
– Finally, we used maximal marginal relevance. This iterative first chooses
one candiadate keyword with the greatest similarity. Then, we itera-
tively choose keywords based on which is most similar to what was
already chosen as well as the text. In addition to that, we can choose
how diverse we want words to be with a ”diversification factor” The
formula to calculate the metric is shown below:

M M R = argmax[λ ∗ sim(D, S) − (1 − λ)argmax(Sim(D1 , D2 ))]

In this case, we get that with a diversity (λ) of 0, we get the key-
words [’cyber security businesses’, ’security events businesses’, ’cyber
security practitioners’, ’cyber security functions’, ’security businesses
demonstrates’]. These keywords seem to be relevant to the article itself.
With a diversity of 0.2, we get [’cyber security businesses’, ’businesses
confronted evolving’, ’security events businesses’, ’offer businesses re-
searchers’, ’leaders work political’]. As we go up with diversity, we find
to see that MMR returns increasingly diverse phrases.

• In addition to finding and developing different methods for keyword ex-


traction, for YAKE and KeyBERT, we ran those algorithms on our whole
dataset and generated keywords as well as frequencies for each word.

• Read Cybersecurity Literature Review for potential information on topics.

• Continued working on the poster presentation and poster content.

II Activities Planned for Next Week

• Now with keywords, should we run other algorithms to generate more


keywords or leave it at the package with YAKE and KeyBERT.

2
• Since we have various keywords, can we add something similar to a ”pool-
ing layer” to get one overarching keyword for each article to tag. Or do
we want to keep the different variations.

• Begin modeling and grouping the different keywords since we have a fre-
quency list. What if we employed a bag of words model to group or just
use LDA analysis.

• Finally, continue to figure out how to get keywords when all of the text
is in one contingent string. When we used YAKE, we got that with
all of the summaries combined, the extracted keywords were: [(’cyber
security’, 9.52906140291883e-06), (’data security’, 1.0825401029788618e-
05), (’data’, 1.9307831122194226e-05), (’security’, 2.364159500948794e-
05), (’machine learning’, 2.433732175248527e-05)] with the numbers being
the indicator of how similar it is. The lower the better. However, BERT
seems to timeout at 19 hours. Is there a possible way to circumvent it.
One hypothesized method is running it on every document entry and do
some form of a pooling method on the final words.

• Finalize poster presentation and get a draft ready.

III Other Topics

• Any tips on making a good poster presentation and how one can make it
stand out.

You might also like