You are on page 1of 8

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 06, 2020

QUANTIFYING AND FIGURING OUT POTENTIAL TRENDING


CLUSTERS/TOPICS IN A GIVEN CORPUS
Gowriprasad Kuruba Bedala1, Dr. Balasaraswathi2
1
Department of Electronics and communication Engineering, Saveetha School of Engineering, Chennai,
2
Assistant professor, Department of Electronics and communication Engineering, Saveetha School of Engineering,
SIMATS
bgowriprasad@gmail.com , 2balameau2005@gmail.com

Abstract
In the present world, information is power. The current race is not in acquiring a piece of information, but how quickly it
is acquired. The age of waiting for the daily copy of the newspaper in the morning was substituted by Media Houses with
the onset of Television. This was again superseded by the onset of the Internet by introducing live updates through
smartphones. Twitter is currently the primary source of information, thanks to a vast user base who posts tweets related
to almost everything. Ideally, if we could keep track of all the tweets, a news event could be potentially captured even
before they hit the media houses.

The primary objective of the project was to collect tweets and tweet parameters over a live stream using twitter developer
access. Then clustering data using an appropriate conversion of tweet data into vectors. Then to understand the
correlation between trending topics and their sentiment if it exist and if they have a considerable impact on the analysis
and clustering technique. The project succeeded on multiple fronts, including quantifying a choice of sentiment analyzer
for sentiment detection, highlighting the importance of text pre-processing, and figuring out potential trending
clusters/topics in a given corpus.

Keywords :- Information, Internet, Twitter, Tweets, Clustering, Vectors, Sentiment, Sentiment analyzer, Trending
clusters

Introduction
Twitter has gained a steady rise in popularity as a micro-blogging service ever since its inception in 2006 [16]. This
massive user base and the sheer volume of data posted every second opened several avenues for data scientists. Any
significant event in the world is associated with several tweets addressing various aspects of the same event. The series of
tweets gives a multi-perspective view on a single matter, which enables a data scientist to accurately draw all required
details regarding the event, such as points of highlight, location, the topic of significant interest in the event, and
likewise. Twitter has a proprietary algorithm that gives its users the top subjects or topics being discussed in a specific
location by the majority of people referred to as "trending topics." The algorithm takes into account the volume of tweets
that address a particular subject (characterized by hashtags), the area of interest, and other factors that are not disclosed
by twitter. The trending topics, depending on his/her interests, are tailored for each user based on these variables. In a
way, this restricts the whole purpose of trending topics to people like low-end media house or a company which keeps
changing its field of interest with time, thereby making the algorithm futile.

Literature Survey
Twitter has been used as a source of primary information for specific events such as detecting and tracking natural
disasters and response, highlights of sporting events, evaluating product likeability in the market, and so on. Much study
has also addressed various research advancements of NLP (Natural Language Processing) and specifically Sentiment
Analysis or Opinion Mining. The initial assumption was that research works that depend on sentiment analyzers are
inherently biased a lot due to the choice of algorithm. Some of the most relevant research titles in using tweet data to
detect/track events were in the sphere of disaster management, followed by sporting events.

Sl. No Title Journal/Conference Information Included


1. Sakaki, Takeshi, Okazaki, Makoto Proceedings of the 19th Use of tweets as Social
and Matsuo Yutaka; Earthquake International Conference on Sensors, Information
Shakes Twitter Users: Real-Time World Wide Web, WWW '10. Diffusion characteristics
Event Detection by Social pp.851-860. 2010 and event detection
Sensors.[2] using twitter

1990
JOURNAL OF CRITICAL REVIEWS
ISSN- 2394-5125 VOL 7, ISSUE 06, 2020

2. Yu, Yang and Wang, Xiao.; World Computers in Human Behavior, Analysis of emotions in
Cup 2014 in the Twitter World: A Vol. 48, pp. 392–400, 2015 tweets
big data analysis of sentiments in
U.S. sports fans' tweets.[3]

3. Luca Maria Aiello, Georgios IEEE Transactions On Basic flow of tweet


Petkos, Carlos Martin, David Multimedia, Vol. 15, No. 6, pp. analysis and techniques
Corney, Symeon Papadopoulos, 1268-1282 October 2013 for Topic detection
Ryan Skraba, Ayse Göker, Ioannis
Kompatsiaris, Senior Member,
IEEE, and Alejandro Jaimes;
Sensing Trending Topics in Twitter
[4]
4. Strehl Er and Ghosh, Joydeep & Workshop on Artificial Study of different
Mooney, Raymond; Impact of Intelligence for Web Search measures of similarity
Similarity Measures on Web-page (AAAI 2000), 2001 for clustering.
Clustering [5]
5. Murtagh, Fionn and Downs, Geoff SIAM J. Scientific Detailed insight into
& Contreras, Pedro; Hierarchical Computing.Vol. 30, pp. 707- Hierarchical Clustering
Clustering of Massive, High 730, 2008 6
Dimensional Data Sets by
Exploiting Ultrametric Embedding
[6]
6. Feldman, R.; Techniques and Communications of the ACM, Problems specific to the
Applications For Sentiment Vol. 56, pp. 82-89, April 2013 field of sentiment
Analysis [7] analysis
7. Hutto, C.J. and Gilbert, Eric.; Proceedings of the 8th Efficiency of VADER is
VADER: A Parsimonious Rule- International Conference on analyzing sentiment of
based Model for Sentiment Weblogs and Social Media, socialmedia
Analysis of Social Media Text. [8] ICWSM, 2015.
8. Pak, Alexander and Paroubek, Proceedings of LREC. Vol. 10, Custom Naıve Bayes
Patrick; Twitter as a Corpus for 2010 classifier on tweet
Sentiment Analysis and Opinion dataset.
Mining.[9]
9. Gonçalves Polyanna, Araújo EPJ Data Science,Vol. 5, 2016 Detailed benchmark
Matheus, Ribeiro Filipe, between Sentiment140,
Benevenuto Fabrício, Gonçalves SentiStrength and
Marcos.; A Benchmark Comparison VADER
of State-of-the-Practice Sentiment
Analysis Methods.[10]
10. Zhou, Xujuan and Tao, Xiaohui Proceedings of the 2013 IEEE Tweet Sentiment
Yong, Jianming Yang, Zhenyu.; 17th International Conference Analysis Model (TSAM)
Sentiment analysis on tweets for on Computer Supported
social events.[11] Cooperative Work in Design,
(CSCWD), pp.557-562, 2013
11. Thelwall, Mike, Buckley, Kevan, Journal of the American Details of SentiStrength
Paltoglou, Georgios, Cai, Di Society for Information Science classifier
Kappas, Arvid; Sentiment Strength and Technology. Vol.61, pp.
Detection in Short Informal 2544-2558, 2010
Text.[12]
12. Pedregosa, Fabian Varoquaux, Journal of Machine Learning Details of ScikitLearn
Gael Gramfort, Alexandre Michel, Research, Vol. 12, pp. 2825–
Vincent Thirion, Bertrand Grisel, 2830, 2012
Olivier Blondel, Mathieu
1991
JOURNAL OF CRITICAL REVIEWS
ISSN- 2394-5125 VOL 7, ISSUE 06, 2020

Prettenhofer, Peter Weiss, Ron


Dubourg, Vincent Vanderplas, Jake
Passos, Alexandre Cournapeau,
David Brucher, Matthieu Perrot,
Matthieu Duchesnay, Edouard and
Louppe, Gilles; Scikit-learn:
Machine Learning in Python.[13]
13. Go, Alec & Bhayani, Richa & Processing, Vol. 150, 2009 Building a classifier for
Huang, Lei; Twitter sentiment tweets
classification using distant
supervision. [14]

EXPERIMENTAL CORPORA
Twitter Developer Access and TwitterAPI

The primary importance was given to collecting a dataset to start the analysis. For this purpose, the developer access
offered by Twitter was used. For academic/research use, Twitter offers Standard API access, which is restricted in certain
aspects such as the rate and volume at which data can be transferred using the API, retrieving Tweet insights, and
advanced filtering capabilities. However, since our project focuses on simple tweet characteristics, the Standard API was
sufficient.

Tweepy

Python3 was used throughout the project due to the extensive support in libraries available. Tweepy is one such library
that is a wrapper for the TwitterAPI. It has several classes and functions which enable ease of access. The necessary
credentials that need to be passed on include the API Access and the secret keys which can be generated from the
developer portal on Twitter. Once the credentials are verified, the API is ready to be used. Every tweet collected is in the
form of a Status object which holds the complete details regarding the tweet and the user who posted the tweet, such as
the timestamp, Tweet ID, User ID, User screen name, the date when the account was created and other such details.

We compared and evaluated four widely used tools Vader, SentiStrength, Textblob and Sentiment 140. The tools were
not in consensus.

COVID-19 Dataset – Support

The function of the supporting dataset was to help in validating the potential trends at the end of clustering and sentiment
analysis. For this purpose, a parallel collection code was run while collecting the primary dataset focusing on collecting
all tweets posted on verified twitter handles of English media sources. The list was based on the most popular media
sources, which has a considerable user following. A total of 99 accounts in the table below, were shortlisted for this
purpose. The collection code was tweaked to track all 99 accounts instead of any specific keywords.

1992
JOURNAL OF CRITICAL REVIEWS
ISSN- 2394-5125 VOL 7, ISSUE 06, 2020

TF-IDF

TF-IDF or Term Frequency and Inverse Document Frequency refer to a statistical measure used to identify/rank the
relevance of a term in a document/collection. Different variations of the measure exist, which are used in applications
like search engines to rank the results. Both the terms are multiplied to obtain the TF-IDF value. An example of a TF-
IDF matrix generated for a set of 4 texts.

tf-idf(t, d) = tf(t, d) * idf(t)

Term Frequency

The Term Frequency (TF) measures the frequency of a particular term in a document. There are multiple ways to obtain
the term frequency. The most common technique is to assign the term frequency directly as the count of occurrences in
the document.

Clustering

Clustering refers to the technique of grouping similar objects in a given set. It was an essential segment of the project
since the resulting clusters would be potential trending topics. There are numerous clustering algorithms available which
work on different principles. Since there is no limit on the number of trends in a given cluster, the project required an
algorithm that does not require any specific number of clusters to be pre-defined. The most viable option was
Hierarchical Clustering.

For clustering, the first task was to group tweets according to the time they were posted. A majority of the
implementation was done on Google Colab as it offers better performance compared to the local instance. It offered a
maximum of 25 GB of RAM on the GPU instance.

COVID-19 dataset was used for the clustering. The algorithm was tested and fine-tuned on a smaller scale by using the
Diet dataset, which consisted of 29702 tweets first. The first task in the intended flow, as shown in Figure 6.1, was to
group the tweets according to a specific time window and then pre-process the tweet text for each group. The resulting
pre-processed version of the tweets was fed into a vectorizer. The vectorizer used for the purpose was the Tfidfvectorizer
from the scikit-learn library. It converts each tweet into an n-dimensional TF-IDF vector. For the project, all 2 and 3
Grams were considered to form the features. In order to further improve the vectors, two extra parameters: minimum and
maximum document frequency, were also included. The minimum document frequency ensured that a term (or N-Gram
in this case) was only included in the final list of features if it is recurring in at least k documents. The maximum
document frequency, on the other hand, ensured that only terms relevant to the topic of the tweet were retained. This was
achieved by eliminating all terms that occur in more than a specific number of tweets in the collection.

The TF-IDF vector obtained for each tweet was then passed into the cosine similarity function, resulting in a n x n matrix
(where n is the number of tweets in the collection), which is, in turn, used to obtain the pairwise-similarity and distance
between tweets. The finalized time window was 10 minutes. This is because the Colab instance has a limit on the
computational resource it offers. Increasing the window size increases the number of tweets, which exponentially
increases the number of N-Grams and, thereby, the dimension of the vector. The pairwise Cosine distance matrix
(computed from the similarity matrix) was then used as a metric for hierarchical agglomerative clustering. All three
linkage criteria: Single, average, and complete, were used to generate the linkage matrices. The linkage matrix was then
used to compute the height of the slice or threshold that would produce the optimum number of clusters. The final results
consisting of tweet ID and its corresponding cluster label was saved in pickle format.

Effect of Text Pre-Processing

In order to understand the effect of pre-processing on sentiment analysis and vectorization for clustering, the
experiments were repeated by replacing raw tweet text with the pre-processed equivalents. Important metrics such as
time and n-gram counts were recorded in each case.

Training Data

In order to train the classifier, the Postgame 2012 dataset was used. The original dataset consisted of 290879 tweet IDs
in total collected during NFL regular-season games in 2012. However, since only the tweet IDs were made public by the
authors, we had to query back and get the tweet texts. We could not download the text for all the tweet IDs as some of

1993
JOURNAL OF CRITICAL REVIEWS
ISSN- 2394-5125 VOL 7, ISSUE 06, 2020

the tweets were already removed, or the accounts were no longer public. The remaining dataset consisted of 100996
tweets in total. Labelling the dataset was the next task. This was accomplished by using sentiment tools that had more
than 50% agreement. The polarity of the tweets according to the tools was individually determined, and a majority
system was used to assign the final polarity. In case the tools were in complete disagreement, the tweet was discarded
from the training set. This left 86278 labelled tweets

Topic Detection

The next task was to assign suitable topics for each cluster. For this purpose, the TF-IDF metric was used again. Once
the tweets in the clusters were finalized, the tweet texts in each cluster were pre-processed; only this time, the stemming
stage was avoided. This was because topics are ideally human-readable terms. Stemming the words would result in a
hazy topic assigned to the clusters. The pre-processed tweets are then sent into the tfidfvectorizer, where the
dictionary/list of features is populated with the n-grams of all the tweets. The topic for the cluster is assigned based on
the n-grams with the highest TF-IDF value in the list of features. Top 10 terms are considered for each cluster.

Manual Validation

To validate the potential trending clusters generated, a manual validation technique was employed. For this purpose, the
news tweets (Support dataset) were grouped in the same time intervals as the primary dataset. Text pre-processing
without the final stemming was applied over news tweets in each cluster. The idea was to compare the topics of each
cluster in the news tweets. If a match was found, it was confirmed to be a trending topic.

Net Sentiment Rate

In order to correlate sentiment with trending topics, it was necessary to compute the sentiment of the cluster as a whole.
Net Sentiment Rate or NSR was computed for all the clusters across time slots using appropriate sentiment tool. The
choice of sentiment tool was based on the previous results obtained from the NFL Dataset. However, to further support
the result, four different tools were used to compute the NSR value.

Results
For clustering, the COVID dataset was used. Hierarchical Agglomerative clustering was applied with all three variants of
linkage – Single, Complete, and Average. The dataset was first grouped into tweets within every 10-minute intervals.
The first consisted of tweets between 14:24 - 14:34; the second group consisted of tweets 14:34 – 14:44, and so on. A
total of 28 groups were formed. The entire set was manually checked for consistency within clusters. The elementary
check of a clustering algorithm is to ensure all re-tweets fall under the same cluster. That was easily achieved. Multiple
groups of re-tweets were found in the dataset. The clusters were also assigned topics by selecting the top ten highest-
ranking N-Grams using TF-IDF as a metric. The clusters are referred to as potential trending topics, with the subject
being the top ngrams.

Trending cluster plot – Complete linkage criteria


1994
JOURNAL OF CRITICAL REVIEWS
ISSN- 2394-5125 VOL 7, ISSUE 06, 2020

Trending clusters scatterplot - SentiStrength

Trending clusters scatterplot – Sentiment140

Trending clusters scatterplot – TextBlob

Trending clusters scatterplot – VADER

1995
JOURNAL OF CRITICAL REVIEWS
ISSN- 2394-5125 VOL 7, ISSUE 06, 2020

The results from the NSR plots were contrary to the initial assumption that trending topics are predominantly topics that
has an extreme polarity. In this particular case, the COVID situation being a global pandemic was trending even though
the NSR values were very close to neutral. The pattern was extracted based on the plots in which the trending topics were
all tightly packed towards the neutral section of NSR.

Even though VADER and SentiStrength had the highest agreement in the COVID dataset (60.58%), to classify a cluster
as trending or otherwise, TextBlob and SentiStrength were observed to be the most accurate. This is due to the clear
trending boundary that was achieved using these algorithms. The exact polarities of the trending tweets are not in
agreement, which may result in challenges for application centered around sentiment analysis. A simple linear classifier
could be used to accurately predict all trending clusters using specific linkage criteria and sentiment analysis. This,
however, proves that sentiment can be used as a feature to classify trending clusters given a list of possible trends.
Sentiment140 was comparatively the worst choice as it classified a majority of the clusters as neutral due to its training
bias.

Conclusion
However, the conclusion that VADER, SentiStrength, and TextBlob are the most effective choices for sentiment analysis
was corrected further along the course of the project due to the drastic change in agreement observed in the COVID
dataset. The tools had an agreement of 51.8% in the NFL dataset. When the COVID dataset was considered, the
agreement figure was reduced to 33.88%. This suggested that the choice of sentiment tool was heavily dependent on the
dataset or domain of the application. The effect of text preprocessing on estimating the polarity of the tweet and
clustering was quantified.

The choice of clustering algorithm was justified and executed successfully, resulting in a set of clusters with appropriate
topics. However, a clear correlation between the confirmed trending topics and their sentiment was not obtained. In the
case of the COVID dataset, the NSR plots suggested that clusters that were predominantly neutral were the most
trending. This was in direct contradiction to the initial assumption that extreme polarities are required to generate a
trending topic. However, this can be attributed to the dataset (COVID), being predominantly neutral, and also the choice
of the tool: VADER and Sentiment140 classified all trending clusters quite close to the neutral section, whereas TextBlob
classified all trending clusters as slightly positive and SentiStrength classified them as slightly negative. It was also noted
that the NSR plots of the various clusters generated could be used to train a simple linear classifier to predict trending
topics from a list of clusters accurately as there exists a clear differentiator between both the classifications. A more
detailed study into the subject comprising of datasets from multiple domains would be required to arrive at a conclusive
result to the relationship between sentiment and emergence of trending topics.

References
[1]Marco A. Palomina, Aditya Padmanabhan Varma, Gowriprasad Kuruba Bedala, and Aidan Connelly(In Press),
"Investigating the Lack of Consensus among Sentiment Analysis Tools" Human Language Technology. Challenges for
Computer Science and Linguistics. Springer International Publishing. (2020)

[2]Sakaki, Takeshi, Okazaki Makoto and Matsuo Yutaka. "Earthquake Shakes Twitter Users: Real-Time Event Detection
by Social Sensors". Proceedings of the 19th International Conference on World Wide Web, WWW '10, 2010,pp. 851-
860.

[3]Yu Yang and Wang Xiao. "World Cup 2014 in the Twitter World: A big data analysis of sentiments in U.S. sports
fans' tweets". Computers in Human Behavior, Vol. 48, 2015, pp. 392–400

[4]Aiello Luca, Petkos Georgios, Martín Dancausa, Carlos Corney, David Papadopoulos, Symeon Skraba, Ryan Goker,
Ayse Kompatsiaris, Ioannis Jaimes, and Alejandro. "Sensing Trending Topics in Twitter". IEEE Transactions On
Multimedia, Vol. 15, No. 6, 2013, pp. 1268-1282

[5]Strehl Er, Ghosh Joydeep, and Mooney Raymond. "Impact of Similarity Measures on Web-page Clustering".
Workshop on Artificial Intelligence for Web Search AAAI 2000, (2001)

[6]Murtagh Fionn, Downs Geoff, and Contreras Pedro. "Hierarchical Clustering of Massive, High Dimensional Data Sets
by Exploiting Ultrametric Embedding". SIAM J. Scientific Computing. Vol. 30, pp. 707-730, (2008).

[7]Feldman, Ronen. "Techniques and Applications for Sentiment Analysis". Communications of the ACM, Vol. 56, pp.
82-89, (2013)
1996
JOURNAL OF CRITICAL REVIEWS
ISSN- 2394-5125 VOL 7, ISSUE 06, 2020

[8]Hutto, C.J. & Gilbert, Eric. "VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media
Text". Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM, 2014

[9]Pak, Alexander & Paroubek, Patrick. "Twitter as a Corpus for Sentiment Analysis and Opinion Mining". Proceedings
of LREC. Vol. 10, (2010)

[10]Ribeiro, F.N., Araújo, M., Gonçalves, P. "SentiBench - a benchmark comparison of stateof-the-practice sentiment
analysis methods.",EPJ Data Science, Vol. 5, (2016)

[11]X. Zhou, X. Tao, J. Yong and Z. Yang, "Sentiment analysis on tweets for social events," Proceedings of the 2013
IEEE 17th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Whistler, BC,
2013, pp. 557-562,

[12]Thelwall, Mike & Buckley, Kevan & Paltoglou, Georgios, Cai, Di Kappas, Arvid. "Sentiment Strength Detection in
Short Informal Text". Journal of Machine Learning Research, Vol. 12, 2012, pp. 2825–2830

[13]Pedregosa Fabian, Varoquaux Gael, Gramfort Alexandre, Michel Vincent, Thirion Bertrand, Grisel Olivier, Blondel
Mathieu, Prettenhofer Peter Weiss, Ron Dubourg, Vincent Vanderplas, Jake Passos, Alexandre Cournapeau, David
Brucher, Matthieu Perrot, Matthieu Duchesnay, Edouard Louppe, Gilles. “Scikit-learn: Machine Learning in
Python.”,Journal of Machine Learning Research, Vol. 12, 2012, pp. 2825–2830

[14]Go Alec, Bhayani, Richa, and Huang Lei."Twitter sentiment classification using distant supervision.",Processing,
Vol. 150, 2009 [15]Shiladitya Sinha, Chris Dyer, Kevin Gimpel, and Noah A. Smith. "Predicting the NFL using Twitter,"
ECML/PKDD 2013 Workshop on Machine Learning and Data Mining for Sports Analytics, 2013.

[15] Lingeshwari, S., 2014. Provisioning of efficient authentication technique for implementing in large scale networks
(PEAT). International Journal of MC Square Scientific Research, 6(1), pp.34-42.

1997

You might also like