Professional Documents
Culture Documents
Abstract—Analyzing public information from social (media follow-up, academic publication analysis, genetic
networking sites could produce exciting results and insights on research). Researchers utilized various data mining techniques
the public opinion of almost any product, service, or behaviour. during SNA.
One of the most effective and accurate public sentiment
indicators is through social networks data mining, as many Sentiment analysis detects the text's contextual polarity
users tend to express their opinions online. The internet's and determines whether a particular text is positive, negative,
advanced technology has managed to increase activity in or neutral. It is also called opinion mining, as it tries to
blogging, tagging, posting, and online social networking. As a understand individuals' attitudes [1]. Two main techniques for
result, people are starting to grow interested in mining these vast sentiment analysis are machine learning-based techniques and
data resources to analyze opinions. Sentiment analysis is one of lexicon-based techniques.
the computational techniques of opinion, sentiments, and the
variety of texts subjectivity. In this paper, the methodology of The sentiment analysis using the machine learning
determining these public opinions are discussed. The approach requires training and testing set for classification. An
development of a program for sentiment analysis is done to automatic classifier uses these training sets to learn the
create a platform for social network analysis. This paper also features and distinguishing characteristics of a document.
discusses the sentiment analysis design, gathering data, training Then the test set is used to validate the model by seeing how
the data, and visualizing the data using the Python library. well the classifier performs. Some of the machine learning
Finally, a platform is designed in order for other users to search techniques such as Naive Bayes (NB), Maximum Entropy
the sentiment results of particular topics of interest. A total of (ME), and Support Vector Machines (SVM) have been used
3000 Reddit data and 3000 Twitter data has been gathered, for sentiment analysis [2]. For the lexicon-based approach,
cleaned, analyzed, and visualized in this research. The analysis classification is done by comparing a given text's features with
has produced an excellent percentage result of 83% and 77% a sentiment lexicon. The sentiment lexicon contains lists of
for Twitter and Reddit data, respectively. Moreover, the GUI words and expressions used to express people's subjective
platform has been built using the Tkinter library. feelings and opinions. Some of the lexicon methods are the
baseline approach, stemming, and Part of Speech (PoS)
Keywords—Social network analysis, sentiment analysis,
tagging [3]. As for deep learning, it uses multiple layers of
Twitter, Reddit, Python
nonlinear processing units for feature extraction and
I. INTRODUCTION transformation. The lower layers learn simple features, while
higher layers learn more complex features derived from
Over the past decade, social networking sites have grown lower-layer features.
and evolved to become a powerful platform for
communicating with people, acquiring and spreading II. LITERATURE REVIEW
information in different areas such as business, politics,
entertainment, the latest trend in food, fashion, and education. Social networks existed from interactions between
Social network sites are believed to be very popular for individuals, groups, organizations, and other related systems
reasons such as the opportunity to receive, create, and share that are represented as nodes. These nodes are related to each
their opinions, feelings, interest, pictures, and videos in public other in certain types of interdependencies, such as similar
instantly. The overwhelming growth of social network usage personal values, visions, ideas, and multiple other aspects of
has produced an enormous accumulation of data provided in human relationships [4]. It is a medium or system that allows
these sites in many data formats such as textual data, videos, people to interact with each other, and these interactions are
and pictures. These data are divided into two categories, which represented as edges or ties. When connections between the
are structured data, such as relationships between people, and nodes and edges are identified, a social network structure is
also unstructured data, such as textual content in the social created. Most individuals use the social network to
networks. communicate with people who are already in their extended
social network instead of searching to meet new individuals
Social network analysis (SNA) is generally defined as [5].
mapping and measuring the relationships and flows between
people, groups, organizations, computers, or other The growth in users and user-generated content in
information processing entities. SNA is frequently used in websites, social networks platforms and any online platforms
examining individual and social group structures and results in abundant information available on the internet.
behaviours (breaking down into components, clustering, Therefore, social network data is being measured and
determining the relations) and analysis of large data sets analyzed in different areas to obtain information on current
Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.
The 8th International Conference on Cyber and IT Service Management (CITSM 2020)
On Virtual, October 23-24, 2020
issues, trends, future trends, and other information. This TABLE I. COMPARISON OF SOCIAL MEDIA FEATURES
measurement of social network data and structure is called Facebook Twitter Instagram Reddit
SNA, which is the research or study of social structure. It Graph API Graph API Reddit API
Tools Tweepy
studies the relationship among people (nodes), their Explorer Explorer Wrapper
interactions (edges), and the use of graph analytic techniques Latest
V4.0 V3.8.0 v4.0 V6.4.1
to explore characteristics of any social network [6]. Version
Public
API REST API REST API REST API
Social network analyst approaches networks in two ways. REST API
The first approach is called egocentric networks. The Language Java Python Java Python
egocentric network shows the number of relationships an actor Rate limits
A business are on a A business
has with other individual actors in any particular environment. account per-user a account It is
It also considers what kind of ties they have with each other only. It is 15-minute only. It is limited to
and what kind of information they give and receives from each Restriction limited to basis. It is limited to 3600 API
other in their network. This approach is used when the 3000 API limited to 200 API calls per
population is huge, or the population's boundaries are hard to calls per 720 API calls per hour.
hour. calls per hour.
define. Data obtained from this network can be used to guide hour.
new clients or make changes to information services to adapt Create, Get a Retrieve
to clients' behaviours. update and user's all posts
Retrieve
delete tweets, and
Another approach is known as the whole network, which photos
objects per followers, comments
describes how the members of an environment maintain their Capabilities with a
HTTP followed and votes
given
ties with all other members in the same environment. This request on people,
hashtag.
on any
approach requires all the members in a particular environment specific and Reddit or
nodes. hashtags. subreddit.
to respond to all members of the environment. Therefore, there
are limited numbers of actors that one can insert in order for it
to be a reasonable study. Nevertheless, groups of individuals B. Related Works
that engage in similar activities can be identified [7].
Sentiment analysis is a vast area of study which requires
A. Application Programming Interface (API) Comparison research to be done on related works. There are various
With technology development, much information can be approaches to sentiment analysis. In [8], sentiment analysis
obtained by just typing it into the search engines. Some of the was performed on movie reviews. They referred to their
social networking sites that we use daily provide us with much previous work of applied Support Vector Machine method and
information that can be used for social network analysis. Data improved the accuracy by focusing on the text categorization
mining should be applied to gather data from the social techniques on the subjective portion of a document. They
network sites intended for research use. The developers of proposed a graph cut-based subjectivity detector which
some of these social media such as Facebook, Twitter, and produces extraction of the original reviews. The corpus-based
Reddit, provide users with an Application Programming approach was investigated and validated in [9] by combining
Interface (API) that permits researchers to access some domain-specific word embeddings and a propagation
information from the website. framework that could induce accurate domain-specific
sentiment lexicons. They recreated known sentiment lexicons
API is a communication medium between a client and a in different fields, which are standard English, Twitter, and
server. It helps developers extract data from one location onto finance. The author set up baselines for the comparison with
another by providing a function that copies these files. The all the different approaches.
usage of API varies depending on the type of programming
language applied. The services and instructions on API use are Other than that, the study by [10] made an empirical
usually described in the API documentation of the respective comparison between SVM and Artificial Neural Network
social network. The data needs to be cleaned first by removing (ANN) for document-level sentiment classification. The
any words that do not add any value to the analytics part, such experiment is conducted on four datasets. The author
as emoticons and special characters. successfully demonstrated that ANN produced competitive
results for SVMs in most cases. In [11], the authors created an
The social media APIs discussed in this section are the attention-based LSTM network for cross-language sentiment
Facebook API, Twitter API, Instagram API, and Reddit API, classification at the document level. The model consists of bi-
due to its popularity. The latest Facebook API version has split LSTMs for bilingual representation, and each LSTM is
into several functions with Facebook Graph API as the structured hierarchically with four layers. In this setting, it
primary API and others such as Facebook Marketing API as helps refine the sentiment classification performance by
an extension of the Facebook Graph API. All the API that has constructively adapting the sentiment information from
been investigated uses RESTful protocol, where it only English which is a rich language resource compared to
supports limited time connection and limited API calls per Chinese which is a poor language resource.
day. The data responses retrieved are in JavaScript Object
Notation (JSON) format. The summary of all the differences
between the APIs is described in Table 1. More details could
be referred to the respective documentation, such as Facebook
Graph API – Documentation, Twitter Docs, and Reddit API
Documentation.
Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.
The 8th International Conference on Cyber and IT Service Management (CITSM 2020)
On Virtual, October 23-24, 2020
From Table II, it can be concluded that the technique that The cleaning process includes making text all lower case,
produces the highest accuracy is the corpus-based remove punctuation, remove numerical values, and tokenize
methodology followed by the Artificial Neural Network, text. The raw data is cleaned to ensure that the data produced
which is 93.3% and 90.3%, respectively. However, these makes sense before going to the tokenization process. Finally,
methods rely too much on the emotional dictionary and are the data can be classified and visualized.
less stable dealing with noisy terms despite having a fast
learning curve. These noisy terms might cause the accuracy to A. Software Design
be lower than expected. In this research, Twitter and Reddit are used as social
networking sites to be analyzed for its rich data and ease of
Attention-based LSTM requires many parameters and will access. Many people post simple posts on these platforms. The
take a longer time to analyze data. Since a limited amount of number of people who use these platforms per month is
time is given to analyze data, and this method requires much approximately 330 million people, which further proves that
time, this methodology is not used despite its accuracy of these platforms have rich information from its vast number of
84%. Moreover, the data analyzed might not have widely users.
separated occurrences, which leads to not using this
methodology. Both PRAW and Tweepy are documentation of Python
libraries to access the API for Reddit and Twitter,
Other than that, the Support Vector Machine produces a respectively. Both libraries are not of high complexity to
good accuracy of 86%, supporting feature learning and obtain data from the APIs. Both of the Python documentation
parameter optimization. However, it requires large datasets, is open-source, and the software is redistributable. Therefore,
and also it is time-consuming. Therefore, the Naïve Bayes both of these Python libraries are selected as they have
methodology will be used as it also proves to produce a good excellent documentation prepared by the developer of each
accuracy of 83.06% and produces results fast with high website.
precision and recall.
The text classifiers used as machine learning is a
III. DESIGN AND IMPLEMENTATION supervised machine learning paradigm. The classifier needs to
Fig. 1 shows the general flowchart of social network be trained on some labelled training data before applying it to
analysis. After data gathering from Reddit or Twitter using the actual classification task. The Naïve Bayes is a statistical
specific keywords, we need to do data pre-processing. classifier where it can be adapted for sentiment classification
problems. The model works with bag-of-words (BOW)
feature extraction, which ignores the position of words. There
are various training sets available such as Movie Reviews,
Twitter, and others. The Twitter dataset can be used for both
Twitter and Reddit.
Natural Language Processing (NLP) is a computer science
and linguistics field that deals with the interaction between
computers and human languages. This approach uses the
publicly available library of SentiWordNet efficiently. This
library provides a sentiment polarity values for every term
occurring in one document. In this library, each term is
associated with three numerical scores of objective, positive
and negative terms. These three scores are computed by
Fig. 1. General Flowchart of Social Network Analysis combining results produced by classifiers. WordNet has a
Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.
The 8th International Conference on Cyber and IT Service Management (CITSM 2020)
On Virtual, October 23-24, 2020
large lexical database in English and is also publicly available data from Twitter, an app needs to be created on the Twitter
for download without any payments. Some of the major tasks developer website to obtain keys and token management from
that help extract sentiment from using NLP are extracting part the website. Then, the program coding for the data extraction
of the sentence that reflects the sentiment and understanding is done by the user using the Twitter APIs. Result generated is
the sentence structure. NLP also has different tools that help converted into .csv files.
with processing the textual data.
The process will be using the Natural Language ToolKit
B. Data Gathering (NLTK), a library mainly focused on NLP and Deep Learning.
Preparing data or collecting raw data is the initial step to NLTK is believed to be an excellent tool for working with
any data mining work. The process is very flexible, and it computational linguistics using Python. The library has a huge
depends on the specific subject that the user is interested in number of corpora sources and also some text processing
research. Firstly, raw data is gathered on both platforms by libraries for classification. Tokenisation, stemming, and some
scraping data using a specified keyword. In this paper, we other features that will help deal with cleaning the data.
used coronavirus or COVID-19 as the keyword. This keyword C. Data Cleaning
is chosen because it was the most popular topic that is
searched at the time of this research. Fig. 2 and 3 showed the Data cleaning usually involves removing all special
trending topics and communities on Twitter and Reddit, characters, null values, or any other words which do not add
respectively. any values to the analytics results. It also deals with
duplicating data and other outliers. There are several standard
techniques for cleaning data (pre-processing techniques) for
texts. The cleaning process includes making text all lower
case, remove punctuation, remove numerical values, remove
symbols, tokenize text, and others. The raw data is cleaned to
ensure that the data produced makes sense before going to the
tokenization process, as shown in Fig. 4.
Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.
The 8th International Conference on Cyber and IT Service Management (CITSM 2020)
On Virtual, October 23-24, 2020
Algorithm for the algorithm to do the polarity check. This sentiment subjectivity. The subjectivity of the text classifier is
polarity check will calculate the polarity and sentiment, then under the range of 0.0 and 1.0. The 0.0 classifies that the text
display the results of whether it is positive or negative. is objective, and 1.0 present the text to be very subjective.
Thus, it can be further calculated and visualized in a pie chart,
as shown in Table IV.
Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.
The 8th International Conference on Cyber and IT Service Management (CITSM 2020)
On Virtual, October 23-24, 2020
From the pie chart, we can conclude that both platforms Next, the user can insert the keyword and number of data
have shown negative sentiment towards the keyword topic they want to analyze. The output result will then be
that we have chosen. There is an abundance of the reason that represented as visualization pie charts and word cloud
could cause these negative sentiments. We can predict by according to the keyword that has been entered.
looking at the most frequent words used, which can be done
using the word cloud function. An example of the word cloud V. CONCLUSIONS AND FUTURE WORKS
can be seen in Fig 7. In this paper, the PRAW library and GetOldTweets3
library have been used to gather and clean the data. A
sentiment analysis platform was developed using Tkinter.
This platform is supported by the Naïve Bayes algorithm,
which has produced a good accuracy of 83% and 77% for
Twitter and Reddit, respectively. The classification report also
defines that all the test produces an accuracy of more than
70%, which indicates an acceptable accuracy level. This
accuracy level was able to be achieved from carefully
experimenting on the process of sentiment analysis starting
from data gathering up until data classification. Future works
include obtained other social media data, different
Fig. 7. Sample word cloud of Reddit sentiment analysis visualization, and different sentiment analyses.
Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.