You are on page 1of 7

Covid-19-Literature-Clustering

NAME: SAMRINA SIDDIQUI


ROLL NO: DS-104
MS- DATA SCIENCE (WEEKEND)
WHAT IS COVID-19 LITERATURE CLUSTERING?
• There are over 57,000 research papers that could help fight covid-19, a text
mining tool is helping scientists navigate the data.

• Scientists are essentially dealing with a deluge of information. Analyzing this


much data is a daunting task drawing any actionable insights would be a
monumental task for any human.

• Maksim built this tool "Covid-19-Literature-Clustering" that clusters articles


based on topic and using the technique called dimensionality reduction,
which plots similarity of the publications. The data clusters collect and
organize information on topics such as the science of airborne viruses and
the utilization of tools like facemasks. This literature clustering is helping
ensure that scientists aren't simply redoing previous research but are instead
building upon the previous work of others.
WHAT IS TEXT MINING?
• Text mining is the process of analyzing a large data set to uncover themes.
According to Techopedia, "This requires sophisticated analytical tools that
process text in order to glean specific keywords or key data points." Other
names for text mining are text analytics or text data mining.

• Text mining is used by data scientists and others to filter through large groups
of text that would take far too long for humans to organize. The process is
similar to data mining but focuses solely on text material.

• Text analytics utilizes artificial intelligence and natural language processing to


discover commonalities between a number of sources. It runs algorithms to
categorize data and uses analytical models to identify patterns.
COVID-19 LITERATURE CLUSTERING PLOT
WHAT IS t-SNE?
• t-Distributed Stochastic Neighbor Embedding (t-SNE) is an
unsupervised, non-linear technique primarily used for data
exploration and visualizing high-dimensional data. In simpler
terms, t-SNE gives you a feel or intuition of how the data is
arranged in a high-dimensional space.
• t-SNE differs from PCA by preserving only small pairwise
distances or local similarities whereas PCA is concerned with
preserving large pairwise distances to maximize variance
EXAMPLES
• The plot can be used to quickly find many publications on a similar topic. For
example, find information on masks and their effectiveness. To do this we can
either try to find a cluster with a keyword such as "mask" or we can search
for the term directly and try to identify which cluster it most closely relates to.

• If you search for "mask", a dark green cluster (#12) is well represented. A
quick scan of the titles indicates that these publications mainly address
different masks and their uses for healthcare workers.

• Searching for a single key term may cause you to inadvertenly filter out
highly similar papers that use different phrasing. After an initial examination,
clear the search term and explore the whole cluster by adjusting the slider. In
this case, we would set the slider to 12 and look for more interesting papers
in the entire cluster.
CONCLUSION
• This project attempted to cluster published literature on COVID-19 and
reduce the dimensionality of the dataset for visualization purposes. This has
allowed for an interactive scatter plot of papers related to COVID-19, in which
material of similar theme is grouped together.

• Grouping the literature in this way allows for professionals to quickly find
material related to a central topic. Instead of having to manually search for
related work, every publication is connected to a larger topic cluster.

• The clustering of the data was done through k-means on a pre-processed,


vectorized version of the literature’s body text. As k-means simply split the
data into clusters, topic modeling through LDA was performed to identify
keywords. This gave the topics that were prevalent in each of the clusters.
Both the clusters and keywords are found through unsupervised learning
models and can be useful in revealing patterns that humans may not have
even thought about.

You might also like