Professional Documents
Culture Documents
David An
Current Week: Apr. 4
– For this, the sample used was a abstract of the text that was about
cybersecurity risks and defenses in commercial business practices.
– Then we developed two different approaches as well while exploring
different methods that don’t rely on libraries.
– The first method is using cosine similarity with document and can-
didate word embeddings and just sorting by top similarities. On a
1
sample, we get the keywords: [’security businesses demonstrates’, ’cy-
ber security functions’, ’cyber security practitioners’, ’security events
businesses’,’cyber security businesses’]
– The next method allows us to employ diversification to generate better
samples. We then develop a method known as Maximum Sum Simi-
larity. Essentially, it returns the n amount of keywords that have the
greatest cosine similarity apart from the other words. Essentially, we
are choosing the n terms that are the most different from the candidate
keywords. Using the same summary, we get [’businesses researchers’,
’businesses confronted evolving’, ’data gathered indepth’, ’evolving per-
vasive threats’, ’cyber security western’].
– Finally, we used maximal marginal relevance. This iterative first chooses
one candiadate keyword with the greatest similarity. Then, we itera-
tively choose keywords based on which is most similar to what was
already chosen as well as the text. In addition to that, we can choose
how diverse we want words to be with a ”diversification factor” The
formula to calculate the metric is shown below:
In this case, we get that with a diversity (λ) of 0, we get the key-
words [’cyber security businesses’, ’security events businesses’, ’cyber
security practitioners’, ’cyber security functions’, ’security businesses
demonstrates’]. These keywords seem to be relevant to the article itself.
With a diversity of 0.2, we get [’cyber security businesses’, ’businesses
confronted evolving’, ’security events businesses’, ’offer businesses re-
searchers’, ’leaders work political’]. As we go up with diversity, we find
to see that MMR returns increasingly diverse phrases.
2
• Since we have various keywords, can we add something similar to a ”pool-
ing layer” to get one overarching keyword for each article to tag. Or do
we want to keep the different variations.
• Begin modeling and grouping the different keywords since we have a fre-
quency list. What if we employed a bag of words model to group or just
use LDA analysis.
• Finally, continue to figure out how to get keywords when all of the text
is in one contingent string. When we used YAKE, we got that with
all of the summaries combined, the extracted keywords were: [(’cyber
security’, 9.52906140291883e-06), (’data security’, 1.0825401029788618e-
05), (’data’, 1.9307831122194226e-05), (’security’, 2.364159500948794e-
05), (’machine learning’, 2.433732175248527e-05)] with the numbers being
the indicator of how similar it is. The lower the better. However, BERT
seems to timeout at 19 hours. Is there a possible way to circumvent it.
One hypothesized method is running it on every document entry and do
some form of a pooling method on the final words.
• Any tips on making a good poster presentation and how one can make it
stand out.