You are on page 1of 6

2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT)

April 09-11. 2021, ISTTS Surabaya, Indonesia

Sentiment Analysis on Covid19 Vaccines in


Indonesia: From The Perspective of Sinovac and
Pfizer
2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT) | 978-1-6654-0514-0/20/$31.00 ©2021 IEEE | DOI: 10.1109/EIConCIT50028.2021.9431852

1st D eden A de N u rd en i 2 nd In d ra B udi 3rd A ris B u d i Santoso


Master o f Information Technology Master o f Information Technology Master o f Information Technology
University o f Indonesia University o f Indonesia University o f Indonesia
Jakarta, In d o n esia Jakarta, Indonesia Jakarta, Indonesia
deden.ade@ ui.ac.id indra.budi@ cs.ui.ac.id aris.budi@ ui.ac.id

Abstract—The Covid-19 pandemic that hit the world, network applications like Twitter become prominent not only
including in Indonesia, had a significant impact. Casualties, for advertisement but also for sharing of ideas and individual
economic downturn, extreme poverty, and major changes in opinion-making [5].
education are still happening today. The presence of the Covid-
19 vaccine is new hope for mankind to end this pandemic Sentiment Analysis can be done to the citizens' perception
situation. The emergence of two types of vaccines in Indonesia, of social media by analyzing how people express their
Sinovac, and Pfizer, lead to different Indonesian society opinions on various social media topics. Sentiment analysis is
reactions. This study aims to do a sentiment analysis of the two used because when making decisions, it is common to look for
types of vaccines on the Twitter platform. Data from October public view. Therefore, this study attempts to do sentiment
until November 2020 has been crawled and processed to see the analysis in Indonesian society regarding these two types of
citizen opinion. The dataset was split into two types: Sinovac and vaccines regarding positive, negative, or neutral sentiment.
Pfizer dataset. Both datasets were labeled manually into three The results contribute to any stakeholder, such as vaccine
classes: positive, negative, and neutral. The results show that suppliers, hospitals, and decision-makers related to this issue.
77% of Tweets indicate the positive segments, while 19%
represent negative, and 4% seem to be neutral for Sinovac. The following segment will discuss several related works
From the standpoint of Pfizer, the results were 81%, 17%, and in segment two, while segment three explains the research
3% for positive, negative, and neutral, respectively. In terms of methodology. Next, segment four will describe the results and
model performance evaluation, with 10-fold cross-validation, discussion and then the last segment will discuss the
the highest average accuracy in the Sinovac dataset is Support conclusion and recommendation.
Vector Machine with 85% accuracy. Furthermore, the Support
Vector Machine classifier has a superior accuracy value of 78% II. L i t er a t u r e St u d y
in the Pfizer dataset compared to other classifiers.
A. Sentiment Analysis
Keywords—Covid-19, vaccines, social media, Twitter, text Social networking is described on the Web 2.0 platform as
mining, sentiment analysis, machine learning, classification a community of internet-based apps, according to Dawot and
Ibrahim [6], which enables users to connect and collaborate
I. In t r o d u c t io n
and exchange ideas, knowledge, views, experiences,
Novel Corona Virus or commonly called COVID-19 viewpoints, knowledge, and global relationships. Twitter is
viruses have spread all over the world, including Indonesia. one of the websites that is typically used to compose people's
This virus transmits rapidly and causes death to persons. The thoughts to address social media. It can be utilized as a data
total number of cases in Indonesia, noted by WHO, is 803,340 bank to gain sentiment to obtain information and attain the
confirmed cases until 10 January 2021 [1]. The pandemic actual people's emotion [7].
situation affects all sectors, especially the economy sector,
Sentiment analysis is a sub-part of Natural Language
because many regions apply lockdown. The lockdown has an
Processing (NLP) that focuses on determining the text's
impact on all human activity.
feelings. Sentiment analysis is known as opinion mining,
Many researchers have developed vaccines to overcome which is a process of understanding, extracting, and
this situation. The two most famous vaccines in Indonesia are processing textual data automatically to obtain sentiment
Sinovac and Pfizer. Sinovac is produced in China, and citizens information in a sentence [8]. In texts, sentences, and
still doubted the testing. However, Sinovac has been ordered messages, the fundamental concept of sentiment analysis is to
by the Indonesian government and came to Indonesia on 12 detect text polarity. The polarity of emotion is split into three:
December 2020 [2]. Another type of vaccine, Pfizer, is positive, negative, and neutral. According to Lunando and
produced in Germany and seems to be more trustable. Purwarianti in [9], sentiment analysis could be constructed
with the following steps:
Many Indonesian people convey their opinions about
vaccines through social media. With the increase in internet 1) Pre-processing: The goal of the pre-processing is to
users in Indonesia in 2019-2020 amounting to 73.7% of minimize the vocabulary of words contained in the text
Indonesia's population, 51.5% is used for activities on social message. People prefer to use informal words in Indonesian
media [3]. Social media is used because it provides social media than formal ones [9]. Below is the pre-
compatibility, interactivity, and cost-effectiveness [4]. Social
processing task that is used in this research.

978-1-6654-0514-0/21/$31.00 ©2021 IEEE 122

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on May 18,2021 at 23:52:39 UTC from IEEE Xplore. Restrictions apply.
a) Case Folding: Case folding is changing all F1 Score = 2pr / (p + r) (4)
characters available in the dataset to lowercase [10]. The
B. Related Study
purpose of case folding is to make the same letterform.
A study related to sentiment analysis is typical to mining
b) Document Filtering: This process aims to remove
opinion in social media [18]. Several previous studies of
words and punctuation marks that are deemed not to affect
sentiment analysis in Indonesia, especially the Indonesian
the sentence if removed [11]. The examples in the tweet are people’s sentiment regarding the pandemic situation that
URL, mention, and hashtag (#). affected life in Indonesia, were reviewed. Such as to know
c) Tokenization: Tokenization is the process of Indonesian opinion regarding the pandemic situation at the
dividing text originating from a sentence or paragraph into beginning of the time the pandemic occurred [14], mining the
certain parts [12]. public opinion about the lockdown policy by the Indonesian
d) Stopword Removal: In general, conjunctions do not government [19], analysing the sentiment regarding
contain any useful and potential information but are inserted utilization of Online Transportation [20] and mining the
to obtain the grammar. Prepositions such as on, in, to, that, perception of passengers of the commuter line during the
for instance, and question terms such as who, when, when, pandemic in Indonesia [21].
how, and so forth [13]. Those kinds of mentioned words A method to find out the public opinion regarding the
called stopword can be removed for further process. pandemic situation in Indonesia is the researcher's primary
e) Normalization: If a word does not fit the proper objective in [14]. This research scrapped tweet data during
spelling, this procedure is conducted to correct the spelling. March 2020 using Corona and Covid keywords, which
An example of this method is the translation of the word produce 31,003 tweets. Using the Support Vector Machine,
"bgs" into the word "bagus" in the Indonesian language. Random Forest, and Naive Bayes algorithms, predictive
models are derived. The Random Forest-based model
f Stemming: Stemming, which means eliminating the
provides the maximum accuracy of 89%, followed by 87% for
affixes to a word, is the stage of discovering the root word. Support Vector Machine and 68% for Naive Bayes.
Stemming algorithms translate a word into a traditional
morphological representation (known as a stem) [14]. Another researcher in [19] analyses Indonesian public
opinion regarding lockdown policy as the pandemic's effect.
2) Feature Extraction: In this research, we use Part of
This research reveals this lockdown topic is an interesting
Speech (POS) Tagging to classify the type of word in tweet
issue as 79,502 active users form 133,209 networks from
datasets. POS Tagging in the Indonesian language defines the April 2020 to May 2020. In addition, the results show there
word class of every word in a sentence into five main parts of are 14.8% positive sentiment, 17.5% negative, and 67.67%
speech: adjective (JJ), noun (NN), verb (VB), adverb (ADV), non-categorized words. As the criticism for this research,
and function words [15]. there is no explanation of how the researcher classifies the
3) Classification: According to [16], sentiment tweets to the mentioned sentiment.
classification is usually divided into two classes, namely Two twitter data with different times before the pandemic
Positive and Negative. However, in this study, we add one and within the pandemic period is analysed by [20] to get
class, namely Neutral, to get the Twitter sentiment. Indonesian online transportation users' sentiment before and
Classification of opinion is essentially a matter of text within the pandemic. The opinion is categorized as positive,
classification. This research uses many algorithms for negative, and normal using the Support Vector Machine
machine learning, such as Naive Bayes (NB), Support Vector algorithm. The results show that the highest accuracy occurred
Machine (SVM), and Random Forest (RF). These algorithms in the normal period rather than in the pandemic period.
have been selected because multiple text classification tasks Research [21] mining public transport users' perception,
have demonstrated high accuracy using these algorithms especially in Commuter Line transport during the pandemic
[17]. era. Naive Bayes and Decision Tree algorithm used for
4) Model Evaluation: Accuracy, precision, recall, and F1 sentiment classification 340 cleaned tweets positively,
Score are usually used to evaluate the performance of negatively, and neutral. With an accuracy of 73.59%, Naive
classification models [10]. To get those evaluation score Bayes outperformed the Decision Tree, and positive sentiment
values, we need to build a Confusion Matrix to obtain True is superior to the other two sentiments. The comparison
between 4 previous studies and author research is shown in
Positive (TP), False Positive (FP), True Negative (TN), and
Table I.
False Negative (FN). TP indicates the model correctly
predicts positive cases as positive, FP means incorrectly III. M e t h o d o l o g y
predict negative cases as positive, TN shows the model This study uses Python as the programming language and
accurately predicts negative cases as negative, and FN several libraries for text mining that will be explained in
explains the model missed classify the positive cases as advance. Mainly, there were six steps in this study: 1) Data
negative. Evaluation measures can be calculated by equations preparation includes data crawling and labelling, 2) Pre-
(1), (2), (3), and (4). processing, 3) Feature selection, 4) Modelling, 5) Label
prediction, 6) Results, including analysing and reporting can
Accuracy (a) = (TP + TN) / (TP + TN + FP + FN) (1) be seen in Fig. 1.

Precision (p) = TP / (TP + FP) (2)

Recall (r) = TP / (TP + FN) (3)

123

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on May 18,2021 at 23:52:39 UTC from IEEE Xplore. Restrictions apply.
TABLE I. C o m pa r is o n Be t w een t he Pr e v io u s St u d y B. Pre-Processing
Study Topic Techniques Text pre-processing includes five steps: 1) case folding, 2)
opinion mining on Support Vector Machine, document filtering, 3) tokenizing, 4) stop word removal, 5)
[14] the pandemic Random Forest, and Naive Bayes normalization, and 6) stemming. NLTK and Sastrawi were
situation algorithms used in this study for pre-processing.
Opinion on lockdown
[19] program during the Not explained • Case folding means removing the case sensitive by
pandemic lowering or uppercase the text. In this study, the
Sentiment on online lowercase text was applied.
transportation before Support Vector Machine
[20]
and within the algorithm • Document filtering removes special characters,
pandemic mention, link, hashtag, URL, non-ASCII characters,
Perception on
Commuter Line Naive Bayes and Decision Tree
number, punctuation, whitespace, and single character.
[21]
transport at the algorithms
• Tokenizing means chopping the text into words. The
pandemic time
token list is then used for further processing.
Using Part of Speech (PoS)
tagger Indonesian language from • Stop word removal means that all words that seem to
Author’s Sentiment on
NLP-ID library and Naive Bayes, be irrelevant or potentially irrelevant will be removed.
Study Covid19 vaccines
Support Vector Machine and Three methods were applied for this study. First, the
Random Forest algorithms. manual added stop words, as can be seen in Table III.
Second, using TALA stop words [8], and the last was
using NLTK Indonesian Corpus.
• Normalization in this research means to standardize
the words based on the Indonesian Dictionary or KBBI
(Kamus Besar Bahasa Indonesia). KBBI is an official
dictionary for the Indonesian language which contains
all standard Indonesian words along with their
meaning. Researchers build a normalization dictionary
consist of several incomplete words and their
normalization form. Table IV shows the example of
data within the dictionary.
• Stemming means converting words into standard
words. In this phase, the researcher used the Sastrawi
library [22] to form inputs into the Indonesian
language base words.
Table V shows the sentence changes that occur in a tweet
Fig. 1. Research Methodology with positive sentiment after the preprocessing process was
carried out.
A. Data Preparation
Data retrieved for this study is data sampling. The query TABLE III. M a n u a l St o p w o r ds

for crawling data in Twitter was “vaksin (covid or corona)” Stopwords


using Bahasa Indonesia. For this study, only three variables "yg", "dg", "rt", "dgn", "ny", "d", 'klo', 'kalo', 'amp', 'biar', 'bikin', 'bilang',
were retrieved: “created add” means the date of posting the 'gak', 'ga', 'krn', 'nya', 'nih', 'sih', 'si', 'tau', 'tdk', 'tuh', 'utk', 'ya', 'jd', 'jgn',
tweets, “screen names” means the username that was posted 'sdh', 'aja', 'n', 't', 'nyg', 'hehe', 'pen', 'u', 'nan', 'loh', 'rt', '&amp', 'yah'
on Twitter, and “text” means the text of Tweet. Search-tweet-
api and Tweepy are libraries that are used in Python for TABLE IV. No r m a l iz a t io n D ic t io n a r y Ex a m pl e
crawling data from November until December 2020. Original Normalized based on KBBI
hsl hasil
After retrieving the raw data, which is 76,708 tweets, the moga semoga
duplicates and retweets were removed, then it was filtered by narsum narasumber
two terms, which are @vaksin sinovac and @vaksin pfizer.
There were 3,242 tweets used in this study, with 1,760 and TABLE V. Da t a Pr e -p r o c e s s in g
1,482 tweets for Sinovac and Pfizer, respectively. Since this
Process Data
research aims to gain information on the positive, negative, or
D r Ines Atmosukarto mengingatkan jik a sampai saat
neutral sentiment for every vaccine, manual labelling is done ini Sinovac blm membuka laporan hasil sementara
for some of the tweets from every vaccine by the researcher to Original (interim result) dari vaksin yg diproduksinya. Berbeda
build a classification model. Table II describes the distribution Tweet dgn Pfizer & Moderna yg sdh melakukannya.Knp
and information of the tweets. kita tdk gunakan vaksin yg sdh punya interim
result? https://1.co/CVRwE21OKn
TABLE II. T w e e t s D is t r ib u t io n dr ines atmosukarto mengingatkan jik a sampai saat ini
sinovac blm membuka laporan hasil sementara
Vaccine Type N um ber of All Tweets Labeled Tweets (interim result) dari vaksin y g diproduksinya. berbeda
Case Folding
Sinovac 1,760 661 dgn pfizer & moderna y g sdh melakukannya.knp
Pfizer 1,482 463 kita tdk gunakan vaksin yg sdh punya interim
Total 3,242 1,124 result?https ://t. co/cvrw e21okn

124

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on May 18,2021 at 23:52:39 UTC from IEEE Xplore. Restrictions apply.
Process Data IV. Re s u l t a n d Di s c u s s i o n
dr ines atmosukarto mengingatkanjika sampai saat ini
sinovac blm membuka laporan hasil sementara interim This section shows the accuracy results for both sentiment
Document
Filtering
result dari vaksin y g diproduksinya berbeda dgn pfizer analysis using three different algorithms on two different
amp moderna y g sdh melakukannyaknp kita tdk datasets. Table VII shows the accuracy, precision, recall, and
gunakan vaksin yg sdh punya interim result F1 score measurement from the NB classifier in Sinovac and
'dr', 'ines', 'atmosukarto', 'mengingatkan', 'jika',
'sampai', 'saat', 'ini', 'sinovac', 'blm', 'membuka',
Pfizer dataset. Table VIII and Table IX reveal the score model
'laporan', 'hasil', 'sementara', 'interim', 'result', 'dari', evaluation from SVM and RF classifier, respectively. The
Tokenizing 'vaksin', 'yg', 'diproduksinya', 'berbeda', 'dgn', 'pfizer', testing was conducted with defined tweets in two datasets,
'amp', 'moderna', 'yg', 'sdh', 'melakukannyaknp', 'kita', Sinovac and Pfizer with various k-fold values (5, 10, 15, and
'tdk', 'gunakan', 'vaksin', 'yg', 'sdh', 'punya', 'interim', 20) as can be seen in the above tables. In all dataset, the values
'result'
in the k-fold relatively affect the accuracy value of each
'dr', 'ines', 'atmosukarto', 'sinovac', 'blm', 'membuka',
Stopword 'laporan', 'hasil', 'interim', 'result', 'vaksin', classifier, the higher k values make the accuracy higher too. It
Removal 'diproduksinya', 'berbeda', 'pfizer', 'moderna', is because the small k allows the modelling process to be
'melakukannyaknp', 'vaksin', 'interim', 'result' influenced by data errors [24].
'dr', 'ines', 'atmosukarto', 'sinovac', 'belum',
'membuka', 'laporan', 'hasil', 'interim', 'result', 'vaksin', In the Sinovac dataset, the best k value of k=20 was chosen
Normalization
'diproduksinya', 'berbeda', 'pfizer', 'moderna', for the NB classifier since gives the stable and best accuracy
'melakukannyaknp', 'vaksin', 'interim', 'result' of 0.76399 or 0.76 (76%) with 0.76 (76%) precision, 0.76
'dr', 'ines', 'atmosukarto', 'sinovac', 'belum', 'buka', (76%) recall, and 0.70 (70%) F1 score. Sv M classifier gives
'lapor', 'hasil', 'interim', 'result', 'vaksin', 'produksi',
Stemming
'beda', 'pfizer', 'moderna', 'melakukannyaknp', 'vaksin',
the best performance evaluation at k=10 because it has the
'interim', 'result' highest accuracy. Whereas for the RF model, the best k value
is k=15 because it produces 0.81 (81%) accuracy, 0.80 (80%)
TABLE VI. Ty pe of Wo r ds for Fe a t ur e Se t s precision, 0.81 (81%) recall, and 0.78 (78%) f 1 score which
is on average higher than the other k values.
Type of W ord W ords
Noun (NN) ‘vaksin’, ‘sinovac’, ‘dosis’, ‘masyarakat’
TABLE VII. Sc o r e Ev a l u a t io n N a iv e Ba y e s
Verb (VB) ‘uji ’, ‘lapor ’ ‘sedia ’ ‘bekerjasama ’
Adjective (JJ) ‘optimis ’ ‘khawatir ’ ‘rawan ’, ‘ elas ’ ‘efektif NB - Sinovac NB - Pfizer
Adverb (ADV) ‘depan’, ‘sangat’, ‘hanya’, ‘boleh’
k- A cc P r e c is i R eca F1 A cc P r e c isi R eca F1
F ol u on 11 Scor u on 11 Scor
C. Feature Selection
d rac e rac e
After getting the results from the pre-processing stage, the y y
data was processed to select the feature sets. PoS (Part of 5 0.7 0.77 0.76 0.69 0.6 0.69 0.68 0.58
6 8
Speech) tagger Indonesian language from library NLP-ID [23]
10 0.7 0.76 0.76 0.70 0.6 0.67 0.69 0.60
was applied to differentiate the type of words amongst noun, 6 9
verb, adjective, correlation conjunction, and other types of 15 0.7 0.74 0.76 0.70 0.6 0.63 0.69 0.59
words. For this study, we use four types of words for the 6 9
feature sets: adjective, adverb, noun, and verb. The example 20 0.7 0.76 0.76 0.70 0.6 0.61 0.69 0.59
of each type of word can be seen in Table VI. 6 9

Examples of feature words are shown in Table VI. TABLE VIII. Sc o r e Ev a l u a t io n Su ppo r t Ve c t or M a c h in e
‘vaksin’, ‘sinovac’, ‘dosis’ and ‘masyarakat’are the example
SVM - Sinovac SVM - Pfizer
of noun, ‘u ji’, ‘lapor’, ‘sedia’ and ‘bekerjasama’ are the
k- A cc P r e c isi R eca F1 A cc P r e c isi R eca F1
example of verb, ‘optimis’, ‘khawatir’, ‘rowan’, ‘j e la s ’ and
F ol u on 11 Scor u on 11 Scor
‘efektif are an example of adjective. Those words have d rac e rac e
calculated the distribution of appearance in the documents. y y
Afterward, the 8000 terms of the largest number of 5 0.8 0.83 0.84 0.82 0.7 0.73 0.74 0.72
frequencies were chosen as feature words. 4 4
10 0.8 0.82 0.85 0.83 0.7 0.76 0.76 0.75
D. Modeling 5 6
15 0.8 0.82 0.84 0.82 0.7 0.77 0.78 0.76
For building the classifier model, K-Fold Cross Validation 4 8
is used in the dataset for model evaluation with k times. The 20 0.8 0.82 0.84 0.82 0.7 0.76 0.77 0.75
three models (Naive Bayes, Support Vector Machine, and 4 7
Random Forest) were then applied to the training dataset and
then implemented in the testing dataset to calculate the TABLE IX. Sc o r e Ev a l u a t io n Ra n d o m Fo r est
accuracy. The accuracy from all models will be shown in the RF - Sinovac R F - Pfizer
next section.
k- A cc P r e c isi R eca F1 A cc P r e c isi R eca F1
E. Label Prediction F ol u on 11 Scor u on 11 Scor
d rac e rac e
Only 661 and 463 tweet data from Sinovac and Pfizer y y
datasets were manually labelled. Next, the model that was 5 0.8 0.80 0.80 0.77 0.7 0.75 0.73 0.68
selected from the previous stage was implemented to predict 0 3
10 0.8 0.79 0.81 0.78 0.7 0.75 0.73 0.68
the label. The prediction will be shown in the next section.
1 3
15 0.8 0.80 0.81 0.78 0.7 0.75 0.74 0.69
1 4
20 0.8 0.79 0.81 0.78 0.7 0.77 0.75 0.70
1 5

125

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on May 18,2021 at 23:52:39 UTC from IEEE Xplore. Restrictions apply.
The decision of the optimal k value in the Pfizer dataset is Given the number of tweets classified (Table II), the
as follows. NB model provides a maximum accuracy value of Sinovac dataset has more numbers than Pfizer, namely 1760
0.69 (69%) and the highest F1 score of 0.60 (60%) at k=10. At for Sinovac and 1482 for Pfizer, but Pfizer has a higher
k=15 in the SVM model, all evaluation scores show that they percentage of positive sentiment at 81% than Sinovac 77%.
are in the top position with 0.78 (78%) accuracy, 0.77 (77%) On the other hand, the negative sentiment on the Sinovac is
precision, 0.78 (78%) recall, and 0.76 (76%) F1 score. For the slightly greater, which is 19% compared to Pfizer’s at 17%.
RF classifier, the best score is at k=20 because all the
evaluation values are higher than the others. V. Co n c l u s io n An d Re c o m m e n d a t io n s

The difference in the accuracy of the three algorithms in Based on the result, it can be concluded both of vaccine
each dataset is not significant. As known in the tables above has a positive perception which is 77% and 81% for Sinovac
and based on the best accuracy selection as described above, and Pfizer respectively. Pfizer seems to be a more positive
the average accuracy value in the Sinovac dataset is (0.76 + opinion from Indonesian people than Sinovac. It might
0.85 + 0.81) / 3 = 0.81 (81%), while the Pfizer dataset is (0.69 because Pfizer has launched the highest accuracy of the
+ 0.78 + 0.75) / 3 = 0.74 (74%). This is consistent with vaccine. Based on the experiment, we got the highest accuracy
previous research which states that the three algorithms are in the Sinovac dataset by labelling manually by 0.85 (85%)
indeed suitable for text classification [18]. from the Support Vector Machine model and 0.78 (78%) in
the Pfizer dataset with the Support Vector Machine model as
As explained in the previous section, the classifier model well. The accuracy might be improved by inviting an expert
was implemented to all datasets to label each tweet. In the in this field to label the data manually. Also, the performance
Sinovac dataset, the SVM classifier has the highest accuracy of accuracy can be increased by evaluating the pre-processing
value when compared to the other two classifiers, namely 0.85 techniques such as revise the stop word, create a more
(85%). SVM also contributes the best precision, recall, and F1 complete normalization dictionary. Another method to
score values, 0.82 (82%), 0.85 (85%), 0.83 (83%) increased performance can be achieved by implementing
respectively. Meanwhile, the NB classifier has the lowest oversampling or under sampling technique such as SMOTE to
performance with an accuracy of 76% at the Sinovac dataset. handle imbalanced datasets. Additionally, the use and tuning
Apparently, in the Pfizer dataset, SVM is superior with 0.78 of parameters in each classification model can be done to
(78%) accuracy, 0.78 (78%) recall and 0.76 (76%) F1 score obtain an increase in classifier performance.
compared to other models. Therefore, we use the SVM model
to give the label prediction to the rest of the unlabelled data in Re f e r en c es

the Sinovac and Pfizer datasets.


[1] WHO, “Indonesia: WHO Coronavirus Desease (COVID-19)
Fig. 2 displays the sentiment positive, negative, and Dashboard,” 2021. .
neutral distribution by percentage in the Sinovac and Pfizer
[2] Kompas.Com, “Recognizing Sinovac Vaccine That Has Arrived in
datasets. In general, positive sentiment dominates the spread Indonesia,” 2020. .
of perceptions about vaccines. There is only about a quarter of [3] A. W. Irawan, A. Yusufianto, D. Agustina, and R. Dean, “Internet
the data that shows negative sentiment, and not more than 5% Survey Report,” 2020.
who are not in favour of positive or negative (neutral). The [4] S. Ainin, F. Parveen, S. Moghavvemi, N. I. Jaafar, and N. L. M. Shuib,
SVM model with the most optimal k-fold value of k=10 on “Factors influencing the use o f social media by SMEs and its
Sinovac and k=15 Pfizer dataset was able to classify 77% and performance outcomes,” Ind. Manag. Data Syst., vol. 115, no. 3, pp.
570-588, Apr. 2015, doi: 10.1108/IMDS-07-2014-0205.
81% as positive sentiments, 19% and 17% as negative
sentiments, and 4% and 5% of neutral sentiments on Sinovac [5] K. Garcia and L. Berton, “Topic detection and sentiment analysis in
Twitter content related to COVID-19 from Brazil and the USA,” Appl.
and Pfizer dataset. Soft Comput., vol. 101, p. 107057, Mar. 2021, doi:
10.1016/j.asoc.2020.107057.
[6] N. I. Md Dawot and R. Ibrahim, “A review o f features and functional
building blocks of social media,” in 2014 8th Malaysian Software
Engineering Conference, MySEC 2014 , Dec. 2014, pp. 177-182, doi:
10.1109/MySec.2014.6986010.
[7] M. Vadivukarassi, N. Puviarasan, and P. Aruna, “Sentimental Analysis
of Tweets Using Naive Bayes Algorithm,” World Appl. Sci. J . , vol. 35,
no. 1, pp. 54-59, 2017, doi: 10.5829/idosi.wasj.2017.54.59.
[8] Z. F. Tala, “The impact of stemming on information retrieval in Bahasa
Indonesia,” Proc. CLIN, the Netherlands, 2003 . 2003.
[9] E. Lunando and A. Purwarianti, “Indonesian social media sentiment
analysis with sarcasm detection,” in 2013 International Conference on
Advanced Computer Science and Information Systems, ICACSIS 2013 ,
2013, pp. 195-198, doi: 10.1109/ICACSIS.2013.6761575.
[10] H. J. Kaur and R. Kumar, “Sentiment analysis from social media in
crisis situations,” in International Conference on Computing,
Communication and Automation, ICCCA 2015 , Jul. 2015, pp. 251-
256, doi: 10.1109/CCAA.2015.7148383.
[11] P. Gamallo, P. Gamallo, and M. Garcia, “Citius: A NaiveBayes
Strategy for Sentiment Analysis on English Tweets,” Proc. 8TH Int.
Work. Semant. Eval. (SEMEVAL 2014 , pp. 171--175.
[12] H. Schütze, C. D. Manning, and P. Raghavan, Introduction to
information retrieval, vol. 39. Cambridge University Press Cambridge,
2008.
Fig. 2. Sentiment Percentage

126

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on May 18,2021 at 23:52:39 UTC from IEEE Xplore. Restrictions apply.
[13] M. Kanakaraj and R. M. R. Guddeti, “NLP based sentiment analysis on [19] C. Suratnoaji, Nurhadi, and I. D. Arianto, “Public opinion on lockdown
Twitter data using ensemble classifiers,” Aug. 2015, doi: (PSBB) policy in overcoming covid-19 pandemic in indonesia:
10.1109/ICSCN.2015.7219856. Analysis based on big data twitter,” Asian J. Public Opin. Res. , vol. 8,
[14] F. Binsar and T. Mauritsius, “Mining of Social Media on Covid-19 Big no. 3, pp. 393-406, 2020, doi: 10.15206/ajpor.2020.8.3.393.
Data Infodemic in Indonesia,” J. Comput. Sci. , vol. 16, no. 11, pp. [20] J. H. Jaman, R. Abdulrohman, A. Suharso, N. Sulistiowati, and I. P.
1598-1609, 2020, doi: 10.3844/JCSSP.2020.1598.1609. Dewi, “Sentiment analysis on utilizing online transportation of
[15] D. Munandar, E. Suryawati, D. Riswantini, A. F. Abka, R. Wijayanti, indonesian customers using tweets in the normal era and the pandemic
and A. Arisal, “POS-tagging for non-english tweets: An automatic covid-19 era with support vector machine,” Adv. Sci. Technol. Eng.
approach: (Study in Bahasa Indonesia),” in Proceedings - 2017 1st Syst. , vol. 5, no. 5, pp. 389-394, 2020, doi: 10.25046/AJ050549.
International Conference on Informatics and Computational Sciences, [21] I. C. Sari and Y. Ruldeviyani, “Sentiment Analysis o f the Covid-19
ICICoS 2017 , Oct. 2018, vol. 2018-Janua, pp. 219-224, doi: Virus Infection in Indonesian Public Transportation on Twitter Data:
10.1109/ICICOS.2017.8276365. A Case Study o f Commuter Line Passengers,” in 2020 International
[16] B. Liu, “Sentiment analysis and opinion mining,” Synth. Lect. Hum. Workshop on Big Data and Information Security, IWBIS 2020 , 2020,
Lang. Technol., vol. 5, no. 1, pp. 1-184, May 2012, doi: pp. 23-28, doi: 10.1109/IWBIS50925.2020.9255531.
10.2200/S00416ED1V01Y201204HLT016. [22] Pypi.Org, “Sastrawi PyPI,” 2021. .
[17] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment [23] Pypi.Org, “NLP-ID,” 2021. .
Classification using Machine Learning Techniques,” May 2002. [24] S. K. Lidya, O. S. Sitompul, and S. Efendi, “Sentiment Analysis on
[18] D. Ramachandran and R. Parvathi, “Analysis o f Twitter Specific Indonesian Text Using Support Vector Machine (SVM) and K-Nearest
Preprocessing Technique for Tweets,” in Procedia Computer Science , Neighbor (K-NN),” Seminar Nasional Teknologi Informasi dan
Jan. 2019, vol. 165, pp. 245-251, doi: 10.1016/j.procs.2020.01.083. Komunikasi. 2015.

127

Authorized licensed use limited to: UNIVERSIDAD POLITECNICA SALESIANA. Downloaded on May 18,2021 at 23:52:39 UTC from IEEE Xplore. Restrictions apply.

You might also like