You are on page 1of 6

Social Network Analysis using Python Data Mining

1st Teddy Surya Gunawan 2nd Nur Aleah Jehan Abdullah


Electrical and Computer Engineering Department Electrical and Computer Engineering Department
International Islamic University Malaysia International Islamic University Malaysia
Kuala Lumpur, Malaysia Kuala Lumpur, Malaysia
tsgunawan@iium.edu.my aleahjehan1996@gmail.com

3rd Mira Kartiwi 4th Eko Ihsanto


Information Systems Department Electrical Engineering Department
International Islamic University Malaysia Universtias Mercu Buana
Kuala Lumpur, Malaysia Jakarta, Indonesia
mira@iium.edu.my eko.ihsanto@mercubuana.ac.id

Abstract—Analyzing public information from social (media follow-up, academic publication analysis, genetic
networking sites could produce exciting results and insights on research). Researchers utilized various data mining techniques
the public opinion of almost any product, service, or behaviour. during SNA.
One of the most effective and accurate public sentiment
indicators is through social networks data mining, as many Sentiment analysis detects the text's contextual polarity
users tend to express their opinions online. The internet's and determines whether a particular text is positive, negative,
advanced technology has managed to increase activity in or neutral. It is also called opinion mining, as it tries to
blogging, tagging, posting, and online social networking. As a understand individuals' attitudes [1]. Two main techniques for
result, people are starting to grow interested in mining these vast sentiment analysis are machine learning-based techniques and
data resources to analyze opinions. Sentiment analysis is one of lexicon-based techniques.
the computational techniques of opinion, sentiments, and the
variety of texts subjectivity. In this paper, the methodology of The sentiment analysis using the machine learning
determining these public opinions are discussed. The approach requires training and testing set for classification. An
development of a program for sentiment analysis is done to automatic classifier uses these training sets to learn the
create a platform for social network analysis. This paper also features and distinguishing characteristics of a document.
discusses the sentiment analysis design, gathering data, training Then the test set is used to validate the model by seeing how
the data, and visualizing the data using the Python library. well the classifier performs. Some of the machine learning
Finally, a platform is designed in order for other users to search techniques such as Naive Bayes (NB), Maximum Entropy
the sentiment results of particular topics of interest. A total of (ME), and Support Vector Machines (SVM) have been used
3000 Reddit data and 3000 Twitter data has been gathered, for sentiment analysis [2]. For the lexicon-based approach,
cleaned, analyzed, and visualized in this research. The analysis classification is done by comparing a given text's features with
has produced an excellent percentage result of 83% and 77% a sentiment lexicon. The sentiment lexicon contains lists of
for Twitter and Reddit data, respectively. Moreover, the GUI words and expressions used to express people's subjective
platform has been built using the Tkinter library. feelings and opinions. Some of the lexicon methods are the
baseline approach, stemming, and Part of Speech (PoS)
Keywords—Social network analysis, sentiment analysis,
tagging [3]. As for deep learning, it uses multiple layers of
Twitter, Reddit, Python
nonlinear processing units for feature extraction and
I. INTRODUCTION transformation. The lower layers learn simple features, while
higher layers learn more complex features derived from
Over the past decade, social networking sites have grown lower-layer features.
and evolved to become a powerful platform for
communicating with people, acquiring and spreading II. LITERATURE REVIEW
information in different areas such as business, politics,
entertainment, the latest trend in food, fashion, and education. Social networks existed from interactions between
Social network sites are believed to be very popular for individuals, groups, organizations, and other related systems
reasons such as the opportunity to receive, create, and share that are represented as nodes. These nodes are related to each
their opinions, feelings, interest, pictures, and videos in public other in certain types of interdependencies, such as similar
instantly. The overwhelming growth of social network usage personal values, visions, ideas, and multiple other aspects of
has produced an enormous accumulation of data provided in human relationships [4]. It is a medium or system that allows
these sites in many data formats such as textual data, videos, people to interact with each other, and these interactions are
and pictures. These data are divided into two categories, which represented as edges or ties. When connections between the
are structured data, such as relationships between people, and nodes and edges are identified, a social network structure is
also unstructured data, such as textual content in the social created. Most individuals use the social network to
networks. communicate with people who are already in their extended
social network instead of searching to meet new individuals
Social network analysis (SNA) is generally defined as [5].
mapping and measuring the relationships and flows between
people, groups, organizations, computers, or other The growth in users and user-generated content in
information processing entities. SNA is frequently used in websites, social networks platforms and any online platforms
examining individual and social group structures and results in abundant information available on the internet.
behaviours (breaking down into components, clustering, Therefore, social network data is being measured and
determining the relations) and analysis of large data sets analyzed in different areas to obtain information on current

Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.
The 8th International Conference on Cyber and IT Service Management (CITSM 2020)
On Virtual, October 23-24, 2020

issues, trends, future trends, and other information. This TABLE I. COMPARISON OF SOCIAL MEDIA FEATURES
measurement of social network data and structure is called Facebook Twitter Instagram Reddit
SNA, which is the research or study of social structure. It Graph API Graph API Reddit API
Tools Tweepy
studies the relationship among people (nodes), their Explorer Explorer Wrapper
interactions (edges), and the use of graph analytic techniques Latest
V4.0 V3.8.0 v4.0 V6.4.1
to explore characteristics of any social network [6]. Version
Public
API REST API REST API REST API
Social network analyst approaches networks in two ways. REST API
The first approach is called egocentric networks. The Language Java Python Java Python
egocentric network shows the number of relationships an actor Rate limits
A business are on a A business
has with other individual actors in any particular environment. account per-user a account It is
It also considers what kind of ties they have with each other only. It is 15-minute only. It is limited to
and what kind of information they give and receives from each Restriction limited to basis. It is limited to 3600 API
other in their network. This approach is used when the 3000 API limited to 200 API calls per
population is huge, or the population's boundaries are hard to calls per 720 API calls per hour.
hour. calls per hour.
define. Data obtained from this network can be used to guide hour.
new clients or make changes to information services to adapt Create, Get a Retrieve
to clients' behaviours. update and user's all posts
Retrieve
delete tweets, and
Another approach is known as the whole network, which photos
objects per followers, comments
describes how the members of an environment maintain their Capabilities with a
HTTP followed and votes
given
ties with all other members in the same environment. This request on people,
hashtag.
on any
approach requires all the members in a particular environment specific and Reddit or
nodes. hashtags. subreddit.
to respond to all members of the environment. Therefore, there
are limited numbers of actors that one can insert in order for it
to be a reasonable study. Nevertheless, groups of individuals B. Related Works
that engage in similar activities can be identified [7].
Sentiment analysis is a vast area of study which requires
A. Application Programming Interface (API) Comparison research to be done on related works. There are various
With technology development, much information can be approaches to sentiment analysis. In [8], sentiment analysis
obtained by just typing it into the search engines. Some of the was performed on movie reviews. They referred to their
social networking sites that we use daily provide us with much previous work of applied Support Vector Machine method and
information that can be used for social network analysis. Data improved the accuracy by focusing on the text categorization
mining should be applied to gather data from the social techniques on the subjective portion of a document. They
network sites intended for research use. The developers of proposed a graph cut-based subjectivity detector which
some of these social media such as Facebook, Twitter, and produces extraction of the original reviews. The corpus-based
Reddit, provide users with an Application Programming approach was investigated and validated in [9] by combining
Interface (API) that permits researchers to access some domain-specific word embeddings and a propagation
information from the website. framework that could induce accurate domain-specific
sentiment lexicons. They recreated known sentiment lexicons
API is a communication medium between a client and a in different fields, which are standard English, Twitter, and
server. It helps developers extract data from one location onto finance. The author set up baselines for the comparison with
another by providing a function that copies these files. The all the different approaches.
usage of API varies depending on the type of programming
language applied. The services and instructions on API use are Other than that, the study by [10] made an empirical
usually described in the API documentation of the respective comparison between SVM and Artificial Neural Network
social network. The data needs to be cleaned first by removing (ANN) for document-level sentiment classification. The
any words that do not add any value to the analytics part, such experiment is conducted on four datasets. The author
as emoticons and special characters. successfully demonstrated that ANN produced competitive
results for SVMs in most cases. In [11], the authors created an
The social media APIs discussed in this section are the attention-based LSTM network for cross-language sentiment
Facebook API, Twitter API, Instagram API, and Reddit API, classification at the document level. The model consists of bi-
due to its popularity. The latest Facebook API version has split LSTMs for bilingual representation, and each LSTM is
into several functions with Facebook Graph API as the structured hierarchically with four layers. In this setting, it
primary API and others such as Facebook Marketing API as helps refine the sentiment classification performance by
an extension of the Facebook Graph API. All the API that has constructively adapting the sentiment information from
been investigated uses RESTful protocol, where it only English which is a rich language resource compared to
supports limited time connection and limited API calls per Chinese which is a poor language resource.
day. The data responses retrieved are in JavaScript Object
Notation (JSON) format. The summary of all the differences
between the APIs is described in Table 1. More details could
be referred to the respective documentation, such as Facebook
Graph API – Documentation, Twitter Docs, and Reddit API
Documentation.

Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.
The 8th International Conference on Cyber and IT Service Management (CITSM 2020)
On Virtual, October 23-24, 2020

TABLE II. SUMMARY OF RELATED WORKS


Paper Title Methodology Accuracy Advantages Disadvantages
A sentimental education: Sentiment
Support feature learning and
analysis using subjectivity Support Vector Large data requirement and
[8] 86.4% parameter optimization for best
summarization based on minimum Machine works on a single domain
results
cuts
It can help find domain-specific
Inducing domain-specific sentiment Excessively rely on emotional
[9] Corpus-based 93.3% opinion words and their
lexicons from unlabeled corpora dictionary
orientations.
Document-level sentiment
Artificial Faster classification process at Less stable for dealing with
[10] classification: An empirical 90.3%
Neural Network the running time noisy terms
comparison between SVM and ANN
It can quickly learn to
distinguish between two or Have more parameters which
Attention-based LSTM network for Attention-based
[11] 84.0% more widely separated make datasets take longer to
cross-lingual sentiment classification LSTM
occurrences of a particular train
element.

From Table II, it can be concluded that the technique that The cleaning process includes making text all lower case,
produces the highest accuracy is the corpus-based remove punctuation, remove numerical values, and tokenize
methodology followed by the Artificial Neural Network, text. The raw data is cleaned to ensure that the data produced
which is 93.3% and 90.3%, respectively. However, these makes sense before going to the tokenization process. Finally,
methods rely too much on the emotional dictionary and are the data can be classified and visualized.
less stable dealing with noisy terms despite having a fast
learning curve. These noisy terms might cause the accuracy to A. Software Design
be lower than expected. In this research, Twitter and Reddit are used as social
networking sites to be analyzed for its rich data and ease of
Attention-based LSTM requires many parameters and will access. Many people post simple posts on these platforms. The
take a longer time to analyze data. Since a limited amount of number of people who use these platforms per month is
time is given to analyze data, and this method requires much approximately 330 million people, which further proves that
time, this methodology is not used despite its accuracy of these platforms have rich information from its vast number of
84%. Moreover, the data analyzed might not have widely users.
separated occurrences, which leads to not using this
methodology. Both PRAW and Tweepy are documentation of Python
libraries to access the API for Reddit and Twitter,
Other than that, the Support Vector Machine produces a respectively. Both libraries are not of high complexity to
good accuracy of 86%, supporting feature learning and obtain data from the APIs. Both of the Python documentation
parameter optimization. However, it requires large datasets, is open-source, and the software is redistributable. Therefore,
and also it is time-consuming. Therefore, the Naïve Bayes both of these Python libraries are selected as they have
methodology will be used as it also proves to produce a good excellent documentation prepared by the developer of each
accuracy of 83.06% and produces results fast with high website.
precision and recall.
The text classifiers used as machine learning is a
III. DESIGN AND IMPLEMENTATION supervised machine learning paradigm. The classifier needs to
Fig. 1 shows the general flowchart of social network be trained on some labelled training data before applying it to
analysis. After data gathering from Reddit or Twitter using the actual classification task. The Naïve Bayes is a statistical
specific keywords, we need to do data pre-processing. classifier where it can be adapted for sentiment classification
problems. The model works with bag-of-words (BOW)
feature extraction, which ignores the position of words. There
are various training sets available such as Movie Reviews,
Twitter, and others. The Twitter dataset can be used for both
Twitter and Reddit.
Natural Language Processing (NLP) is a computer science
and linguistics field that deals with the interaction between
computers and human languages. This approach uses the
publicly available library of SentiWordNet efficiently. This
library provides a sentiment polarity values for every term
occurring in one document. In this library, each term is
associated with three numerical scores of objective, positive
and negative terms. These three scores are computed by
Fig. 1. General Flowchart of Social Network Analysis combining results produced by classifiers. WordNet has a

Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.
The 8th International Conference on Cyber and IT Service Management (CITSM 2020)
On Virtual, October 23-24, 2020

large lexical database in English and is also publicly available data from Twitter, an app needs to be created on the Twitter
for download without any payments. Some of the major tasks developer website to obtain keys and token management from
that help extract sentiment from using NLP are extracting part the website. Then, the program coding for the data extraction
of the sentence that reflects the sentiment and understanding is done by the user using the Twitter APIs. Result generated is
the sentence structure. NLP also has different tools that help converted into .csv files.
with processing the textual data.
The process will be using the Natural Language ToolKit
B. Data Gathering (NLTK), a library mainly focused on NLP and Deep Learning.
Preparing data or collecting raw data is the initial step to NLTK is believed to be an excellent tool for working with
any data mining work. The process is very flexible, and it computational linguistics using Python. The library has a huge
depends on the specific subject that the user is interested in number of corpora sources and also some text processing
research. Firstly, raw data is gathered on both platforms by libraries for classification. Tokenisation, stemming, and some
scraping data using a specified keyword. In this paper, we other features that will help deal with cleaning the data.
used coronavirus or COVID-19 as the keyword. This keyword C. Data Cleaning
is chosen because it was the most popular topic that is
searched at the time of this research. Fig. 2 and 3 showed the Data cleaning usually involves removing all special
trending topics and communities on Twitter and Reddit, characters, null values, or any other words which do not add
respectively. any values to the analytics results. It also deals with
duplicating data and other outliers. There are several standard
techniques for cleaning data (pre-processing techniques) for
texts. The cleaning process includes making text all lower
case, remove punctuation, remove numerical values, remove
symbols, tokenize text, and others. The raw data is cleaned to
ensure that the data produced makes sense before going to the
tokenization process, as shown in Fig. 4.

Fig. 2. Top 10 topics on Twitter

Fig. 4. Data cleaning process

D. Data Classification and Visualization


Fig. 3. Top 10 Growing Communities on Reddit After extracting and cleaning the data, the data is then
filtered using the Python machine learning tool. The machine
In this research, the process uses the web scraping learning tools used to develop this platform are textBlob and
function. So to scrape data from a social media platform, API Scikit-learn tool (sklearn tool). Using these tools will group
usage is the preferable way to collect the data in a structured the data text extracted into three sentiment polarity, which is
form from the website itself. Collecting social media data is positive, negative, and neutral. The score for the polarity floats
plausible as the website provides the API for scraping data out within the range of -1.0 and 1.0, where anything greater than
of the desired website. Firstly, the user needs to have a Reddit 0 indicates a positive text, and anything below 0 indicates a
account and create an application on the Reddit developer negative text. Then, the sentiment result obtained from
page. A client ID and Client Secret are then needed to access machine learning is visualized. The sentiment results are
Reddit's API as a script application. The user also needs a user visualized using the Matplotlib tool, a Python 2D plotting
agent who is a unique identifier that helps Reddit determine library, and numerical mathematical extension NumPy. It
the network request source. provides a platform for embedding plots into applications and
helps visualize results easier to understand.
For Twitter, data is extracted from tweets posted on
Twitter by using the Python scraping tool library, which is The flowchart for data classification is shown in Fig. 5,
called GetOldTweets3, which was made and publicized on where data is extracted from the social media platform that we
https://github.com/Mottl/GetOldTweets3. This library makes want to use, and then we store the data in .csv files. The data
use of Tweepy, which is one of the streaming APIs developed is then processed and compared with the sentiment database.
to be used for a user to stream Twitter APIs. And to extract We run in through these two datasets through the Naïve Bayes

Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.
The 8th International Conference on Cyber and IT Service Management (CITSM 2020)
On Virtual, October 23-24, 2020

Algorithm for the algorithm to do the polarity check. This sentiment subjectivity. The subjectivity of the text classifier is
polarity check will calculate the polarity and sentiment, then under the range of 0.0 and 1.0. The 0.0 classifies that the text
display the results of whether it is positive or negative. is objective, and 1.0 present the text to be very subjective.
Thus, it can be further calculated and visualized in a pie chart,
as shown in Table IV.

TABLE IV. SAMPLES OF TEXT AND ITS SENTIMENT ANALYSIS


Sample Text Polarity
Sentiment
Well done this is the result of the efforts and the (polarity=0.3,
commitment of @WHO and its partners. subjectivity=0.27)
Positive
Sentiment
Thanks to the Chinese and WHO @DrTedros for
(polarity=0.0,
the novel #coronavirus (2019-nCoV). Keep up the
subjectivity=0.5)
good work.
Negative
Everybody I talk to about this in my area is in the
Sentiment
Fig. 5. Data classification process "whatever" camp, which infuriates me. Yeah,
(polarity=-0.322,
whatever if you get it, but what about when you
subjectivity=0.6)
visit your parents/grandparents/coworkers, then they
IV. RESULTS AND DISCUSSION get infected, and it spreads further?
Negative
Coronavirus medic claims China is lying as
A. Datasets and Preprocessing "100,000 people are infected" --Maybe these
Sentiment
(polarity=0.0425,
The text taken from Reddit and Twitter is sourced from the morons should stop eating strange
subjectivity=0.32)
Reddit community page and Tweets from users. The text animals...importing them into places they are not
Positive
extracted are texts that are only in the English language as normally found!!
Well, put! This is what I've been trying to say to
texts that have been stripped off any kind of symbols, icons, people and am labelled either a "just a flu bro" type
and everything that does not help with the analysis process. A or a doomer. The reality is in between! The impact
total of 3000 Reddit data and 3000 Twitter data are collected. Sentiment
on the medical system will determine the outcome
(polarity=0.0,
These data are kept in a CSV file with the purpose that if there for so many patients. With functioning medical
subjectivity=0.0)
are any problems with data analysis, the process does not need support, this is not good, but manageable.
Negative
Overwhelm the system, and it becomes a major
to be repeated from the start. The original text and the cleaned
problem (1918). Trying to mitigate the effects while
text are kept together to see the difference between the texts. not causing a mass panic is pragmatic.
Some example of the original data and the cleaned data is
shown as in Table III. From Table IV, we can see that the data is analyzed and
classified based on the polarity and subjectivity. A polarity is
TABLE III. SAMPLE OF ORIGINAL AND PRE-PROCESSED TEXT
a floating-point number in the range of -1 to 1, where nearing
Original Text Pre-processed Text 1 defines a positive statement, and nearing -1 is considered a
People are going to get it - but if you can ['people', 'going', 'get', negative statement. Next, the subjectivity defines whether the
avoid it being all at once, it helps 'avoid', 'helps', statement is a fact or an opinion of the person who gives the
tremendously. THIS IS AWESOME to 'tremendously', 'awesome',
help explain to children and teens, 'help', 'explain', 'children',
statement. Similar to polarity, data is defined in floating-point
especially. Thank you!! 'teens', 'especially', 'thank'] number but ranges from 0 to 1. If the statement is factual, the
['people', 'understand', range will be nearing to 0 while it is nearing to 1, it defines
People just don't understand. Seriously,
I'm not even joking. Some people think
'seriously', 'even', 'joking', that the data is mostly a public opinion.
'people', 'think', 'overblown',
it's overblown and should be dying down C. Data Visualization
'dying', 'people', 'dont',
now. People just don't want to obey and
'want', 'obey', 'stay', 'homes',
stay in their homes. No sense of control. The graph can be analyzed in a better manner with a pie
'sense', 'control', 'selfish',
Very selfish and dangerous.
'dangerous'] chart as it is a simple graph consisting of two segments which
['unfortunately', 'lot', are positive and negative. Therefore, people can easily decide
Unfortunately, we have a lot of ignorant
'ignorant', 'selfish', 'people', the percentage of data involved in the analysis. This will
and selfish people on this planet. So, it's
probably going to be around for a while.
'planet', 'probably', 'going', shorten the time for data analysis and future works that need
'around', 'awhile'] to be done. The data visualization is as shown in Fig. 6.
As shown in Table III, the pre-processed data differs from
the original data as it runs through the data cleaning process.
The pre-processed data shows that all the words have been
lowercased. For example, 'AWESOME' is changed to
'awesome' and words such as 'it,' 'is,' 'this' has been removed
as it is considered as a stop word. All the words are also
tokenized as compared to the original text, which was in a full
sentence.
B. Datasets and Sentiment Polarity (a) Reddit Sentiment Analysis (b) Twitter Sentiment Analysis
The extraction process of tweet and Reddit comment is Fig. 6. Samples of sentiment analysis in the pie chart
done by using the respective API. Besides, sentiment analysis
displays each text's sentiment polarity and presents each data's

Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.
The 8th International Conference on Cyber and IT Service Management (CITSM 2020)
On Virtual, October 23-24, 2020

From the pie chart, we can conclude that both platforms Next, the user can insert the keyword and number of data
have shown negative sentiment towards the keyword topic they want to analyze. The output result will then be
that we have chosen. There is an abundance of the reason that represented as visualization pie charts and word cloud
could cause these negative sentiments. We can predict by according to the keyword that has been entered.
looking at the most frequent words used, which can be done
using the word cloud function. An example of the word cloud V. CONCLUSIONS AND FUTURE WORKS
can be seen in Fig 7. In this paper, the PRAW library and GetOldTweets3
library have been used to gather and clean the data. A
sentiment analysis platform was developed using Tkinter.
This platform is supported by the Naïve Bayes algorithm,
which has produced a good accuracy of 83% and 77% for
Twitter and Reddit, respectively. The classification report also
defines that all the test produces an accuracy of more than
70%, which indicates an acceptable accuracy level. This
accuracy level was able to be achieved from carefully
experimenting on the process of sentiment analysis starting
from data gathering up until data classification. Future works
include obtained other social media data, different
Fig. 7. Sample word cloud of Reddit sentiment analysis visualization, and different sentiment analyses.

D. Data Accuracy Measurement ACKNOWLEDGEMENT


The datasets are labelled as positive and negative. A few The researchers in this study would like to acknowledge
sentiment accuracy testing is done using the Scikit learn the Malaysian Ministry of Higher Education (MOHE) for this
library, the sklearn.metrics.accuracy_score, to ensure the research's financial funding through the Fundamental
model's performance. It is an inbuilt score() function where Research Grant Scheme (FRGS) FRGS17-012-0578.
we can input our data and output the appropriate metric. Upon
REFERENCES
checking, the model produces an accuracy score of 83.6% for
Twitter and 77% for Reddit. The performance evaluation is [1] M. Devika, C. Sunitha, and A. Ganesh, "Sentiment analysis: a
comparative study on different approaches," Procedia Computer
described in Table V. Science, vol. 87, pp. 44-49, 2016.
[2] S. Vohra and J. Teraiya, "A comparative study of sentiment analysis
TABLE V. PERFORMANCE EVALUATION
techniques," Journal Jikrce, vol. 2, no. 2, pp. 313-317, 2013.
Precision Recall F1-score support [3] C. Bhadane, H. Dalal, and H. Doshi, "Sentiment analysis: Measuring
Positive 0.79 0.75 0.77 1389 opinions," Procedia Computer Science, vol. 45, no. 0, pp. 808-814,
negative 0.73 0.77 0.75 1204 2015.
[4] O. Serrat, "Social network analysis," in Knowledge solutions: Springer,
2017, pp. 39-43.
The report produces the main classification metrics on a [5] D. M. Boyd and N. B. Ellison, "Social network sites: Definition,
per-class basis, which is positive and negative. The precision history, and scholarship," Journal of computer-mediated
indicates that the positive predictions' accuracy is 79%, while communication, vol. 13, no. 1, pp. 210-230, 2007.
the negative predictions are 73% correct. Next, the recall [6] J. F. Sánchez-Rada and C. A. Iglesias, "Social context in sentiment
indicates that it finds more of the negative class at 77% over analysis: Formal definition, an overview of current trends and
framework for comparison," Information Fusion, vol. 52, pp. 344-356,
the whole element. The F1-score indicates the harmonic mean 2019.
between precision and recall. Lastly, the support shows that
[7] C. Haythornthwaite, "Social network analysis: An approach and
there are 1389 and 1204 occurrences of the given class in our technique for the study of information exchange," Library &
dataset, a balanced dataset. information science research, vol. 18, no. 4, pp. 323-342, 1996.
[8] B. Pang and L. Lee, "A sentimental education: Sentiment analysis
E. GUI Application using subjectivity summarization based on minimum cuts," arXiv
Fig. 8 showed the GUI application. The user first needs to preprint cs/0409058, 2004.
choose a platform that they want to analyze. [9] W. L. Hamilton, K. Clark, J. Leskovec, and D. Jurafsky, "Inducing
domain-specific sentiment lexicons from unlabeled corpora," in
Proceedings of the Conference on Empirical Methods in Natural
Language Processing. Conference on Empirical Methods in Natural
Language Processing, 2016, vol. 2016: NIH Public Access, p. 595.
[10] R. Moraes, J. F. Valiati, and W. P. G. Neto, "Document-level sentiment
classification: An empirical comparison between SVM and ANN,"
Expert Systems with Applications, vol. 40, no. 2, pp. 621-633, 2013.
[11] X. Zhou, X. Wan, and J. Xiao, "Attention-based LSTM network for
cross-lingual sentiment classification," in Proceedings of the 2016
conference on empirical methods in natural language processing,
2016, pp. 247-256.

Fig. 8. GUI for Sentiment Analysis

Authorized licensed use limited to: UNIVERSITAS GADJAH MADA. Downloaded on May 09,2021 at 10:42:21 UTC from IEEE Xplore. Restrictions apply.

You might also like