Tri Okta Priasni - 92219042 - Proposal

MASTER THESIS PROPOSAL
Tri Okta Priasni (92219042)

Digital Tourism
Analysis of Opinion Mining Using Support Vector Machine, Naive B
ayes, and Decision Tree Algorithms (Case Study: Ancol’s Tourist Att
raction)
Abstract: Sentiment Analysis or also known as Opinion Mining is one of popular stu
dy regarding Natural Language Processing. This study classifies sentiments into two c
lasses, which are positive and negative. Opinion mining is widely used to improve bus
iness marketing, political issues, and feedback reviewing. In this research, opinion mi
ning is used to analyze, review, and classify the customer feedback on Ancol’s tourist
attraction. However, the variety of algorithms give different result for final value of se
ntiment. Thus, this research compares three different algorithms which are Support V
ector Machine, Naive Bayes, and Decision Tree to see which one provides the highest
value of sentiment. Several phases that included in Sentiment Analysis are, 1) Collecti
ng Data; 2) Pre-processing; 3) Training; 4) Testing and Classification; 5) Accuracy M
easurement; 6) Conclusion and Review. The dataset are collected from Twitter using
Twitter Crawling library on Rapidminer in total of 1670 opinions regarding Ancol’s T
ourist Attraction. Pre-processing phase includes the tokenization, stop word removal,
and stemming process. Training and testing phases classify the opinions into positive
and negative sentiments. The accuracy measurement phase aims to compare the three
algorithms using confusion matrix (precision, recall, and f-measure) calculation that p
rovides the final value of those three classification algorithms.
Keywords: Decision Tree, Naive Bayes, Opinion Mining, Sentiment Analysis, Suppor
t Vector Machne
1. Introduction
1.1. Background
The growth of digital era, which can increasingly be felt, has had a major influence on
changing patterns of daily activities [1]. The ease of accessing all information via the i
nternet with everyday gadgets such as smartphones and computers allows people to ge
t information faster. Social media trends also develop along with digital growth. Inter
net users can discuss online and share their posts via social media. According to [2], s
ocial media users will tend to discuss topics that are trending among social media user
s. This shows that the power of social media can have a considerable influence on the
creation of trends in society.
Based on citation of the kominfo.go.id page [3], Indonesia has a total of 63 millio
n people who are registered as internet users. 19.5 million of them are Twitter users a
nd Indonesia placed fifth in the world in term of Twitter most users. Twitter itself is o
ne of social network that currently most being used by Indonesians. Twitter users can
express themselves through 140 characters (tweets). Twitter was found in March 2006
by Jack Dorsey, which was then launched in July of the same year. The concept of Tw
itter was initiated as a forum to accommodate internet users as micro blogs, where use
rs can promote their businesses, hold discussions related to hot topics that are being di
scussed, build brand reputation, and promote other people's networks [4]. The implem
entation of this concept relies on a system of friends (followers) who can then promot
e each other from one friend to another.
The transition of activities to the digital era has led to a lifestyle that depends on t
he internet, in this case social media. One of the real example is marketing activities t
hat are starting to take advantage of the uses of social media. Tourism is one of the in
dustries that has gained many benefits from social media marketing. The tourism busi
ness has started to use social media as a marketing strategy to advertise their products.
According to [5], digital marketing plays a critical role in the success of tourism busin
esses. By reaching out and engaging to newer audiences everyday, it can transform th
e way businesses reach out to users. Apart from using as a marketing media, social me
dia such as Twitter can be used to monitor feedback from visitors who have visited th
e tourism place [6].
Twitter stores a large amount of data regarding the delivery of its users' opinions r
egarding an object. Text mining is a method that can be used to process unstructured d
ata such as text uploaded on Twitter [7]. Text related to user opinions will be collecte
d and then grouped into negative reviews and positive reviews, this method is known
as Sentiment Analysis or also known as Opinion Mining. Opinion mining has the mai
n objective to detect differences in sentiment towards a text, which is used as computa
tional learning related to feelings and subjectivity in a text [8]. This opinion mining m
ethod can make it easier for the tourism industry to classify the opinions of visitors to
their tourist attractions, one of which is to overcome the problems of negative classific
ations that arise from user opinions.
Previous researches related to Opinion Mining in the tourism industry has been do
ne quite a lot. Based on opinion study related to resorts [9], it states that the research c
arried out makes a methodological contribution by demonstrating an understanding of
public opinion using the opinion mining method. Research related to Sentiment Analy
sis on the tourism industry in Yogyakarta using Lexicon [6] states that from several pl
aces taken as samples resulted a positive final sentiment assessment, where the majori
ty of the public are satisfied with their experiences in tourist attractions in Yogyakarta.
Other Opinion Mining researches which use Support Vector Machine [10] and Convo
lutional Neural Network [11] also produce high positive values above 80%.
Each algorithms of Opinion Mining has not only its own respective advantages as
plus point, but also disadvantages as the negative point. The classification of reviews i
s obtained after going through the identification and separation stages into positive an
d negative opinions by considering the weight of each word in the review. Thus, each
algorithms of opinion mining will produce a different accuracy score. The higher the l
evel of accuracy score, the more effective the method is to be used in grouping opinio
ns.
Based on the variety of accuracy of each opinion mining methods, this paper will
focus on the comparison of three opinion mining methods using the object of Twitter
users’ posts related to tourism place in PT. Pembangunan Jaya Ancol, Tbk. Three opi
nion mining algorithms that will be compared for accuracy values are Support Vector
Machine (SVM), Naive Bayes (NB), and Decision Tree (DT).
1.2. Problem Statement
The weight of final classification from collected review will affect the accuracy of use
d method. The variety of methods from Opinion Mining raises a question, "Which alg
orithm produces the best accuracy value?". Implementing three popular methods of op
inion mining, a new question arises, "Which algorithm is the best among the Support
Vector Machine, Naive Bayes, and Decision Tree methods in conducting sentiment an
alysis on customer satisfaction with the object of online review study Ancol’s Tourist
Attractions? ".
1.3. Scope of Research
This research has several limitations, they are:
1. Customer review obtained from Twitter related to Ancol’s tourist attractions
2. Data amounted to 2000 Indonesian tweets collected
3. The classification algorithms are Support Vector Machine, Naive Bayes, and Dec
ision Tree
4. The review data will be processed using rapidminer application
5. Using Twitter API library to filter data that collected randomly from Twitter with
specific conditions applied
6. The accuracy score will be measured using the Confusion Matrix (Precision, Rec
all, and F-Measure)
1.4. Research Objectives
The purpose of this study was to obtain an analytical evaluation from a comparison of
three algorithms using Precision, Recall, and F-measure accuracy calculations. The fi
nal result will conclude which algorithm is better for sentiment analysis.
2. Literature Review
Sentiment Analysis is one of Natural Language Processing method which is now wide
ly used to cover areas such as customer products and awareness services, election cam
paigning, health care monitoring, and social event planning [13]. Sentiment Analysis i
s also known as Opinion Mining, where this method can be defined as the art of gathe
ring and studying a person's opinion on an exact object, usually outlined in social med
ia such as Twitter, Facebook, and other online blogs [14, 15]. This method has a main
focus that centers on classifying opinions into negative and positive opinions, where t
he final result of this opinion classification is used as a decision-making tool. Data co
nsisting opinions will be collected based on certain topics (such as customer feedbac
k) which will then be processed through several phases until the final result will deter
mine whether the analyzed opinions express approval or disagreement with the topic d
iscussed. Therefore, this method is quite popular to be used as a method of decision m
aking.
Sentiment Analysis is well known as two-class classification, which expressed op

inions as “positive” and “negative”. However several studies added “neutral” as third
class label [16]. Analysis is divided into categories, such as document level, sentence l
evel, word/term level, and aspect level [17]. The case study conducted Ancol’s Touris
t attractions is one example of how Sentiment Analysis is used to determine customer
satisfaction with Ancol’s Tourist attractions. Collected data from one of the social me
dia Twitter will be analyzed to find out how many visitors are satisfied and dissatisfie
d. Once the opinions classified into the two categories, this information is used for dec
ision making by the authorities.
This study method is even more interesting to explore considering the large numb
er of requests related to things such as decision making, political issues, news analysis,
marketing analysis, etc. Somehow, there are several obstacles related to the applicatio
n of this Sentiment Analysis method. Expression of opinions in writing style can caus
e ambiguity. For example, the word "unpredictable" can have two meanings at once, p
ositive and negative. If this word is followed by a sentence like, "Wow, what an unpre
dictable event of story!" This opinion can be classified into a positive opinion. Howev
er, if the word unpredictable is followed by a sentence like, "She is such an unpredicta
ble one, I can not differ when she is joking or not." can be categorized as a negative o
pinion [18]. Therefore, it requires proper accuracy of the context in which the word is
used.
In order to get the proper accuracy [19], Sentiment Analysis must identify the righ
t expression of an opinion. Next, Sentiment Analysis must construct the opinion lexic
on (such as good, excellent, great, well done, etc). The determination of a polarity opi
nion depends a lot on the importance of the aspects connected to the main domain. Fo
r example, the use of the conjunction "but". Like the sentence, "This cake is delicious
but the color is bugging me." Opinion model like this can be classified as positive and
negative opinions. When analyzing a sentence like this, the context of its main domai
n must be looked at, in the case of the sentence above the main domain is cake. Mostl
y people judge the cake’s quality by its taste, the appearance will be set aside for a mo
ment. Therefore, the final classification of that sentence should be positive.
2.1. Text Pre-processing
Pre-procesing is he first step of data collection is to decide the most suitable data sour
ce [20]. Data source include raw files such as web, documents, personal texts, survey
responses, etc. Pre-processing also can be considered as the initial stage of text minin
g to form a structured data and ready for processing [21]. In text pre-processing, word
s are gathered as input units of classification models. This phase include some stages
which are tokenizing, stopword removal, and stemming.
2.2.1. Tokenizing
Tokenizing is one of text pre-processing stage that break a textual content into tokens
such as words, terms, symbols, or other meaningful elements [22]. This stage occurs a
t the word level, when a tokenizer will execute some of heuristics like: 1) Punctuation
and whitespace are not included in; 2) All contiguous strings of alphabetic characters
(like a, I, etc) and numbers are part of one token; 3) Tokens are seperated by whitespa
ce characters (such as space or line break or any punctuation characters).
2.2.2. Stop word Removal
Stop words are common words commonly used in natural language, where the meanin
g of the word has insignificant semantic value to the context of a sentence [23]. To si
mplify the text pre-processing stage, those words must be removed. Stop words can b
e grouped into list of words [24] which contain: 1) Determiners (such as a, an, anothe
r); 2) Coordinating conjuction (such as for, an, nor, but, etc); 3) Prepositions (sucs as i
n, under, towards, before. Those cases are meant for English. Stop words differ accord
ing to what language will be used. Sudah ada library berisi kumpulan stopwords untuk
bahasa indonesia. There is already such a library containing a collection of stop words
for Indonesian. One of the example is the library proposed in 2003 [25] by Tala which
are now widely used as research material for text mining using Indonesian
2.2.3. Stemming
Stemming is used to reduce the variant word forms to their base forms, including deri
vation variants in the context of the application [26]. The algorithm will remove affixe
s and suffixes and the process will be different in every language depends on the form
ation of words in it [27].
Based on [28], the algorithms of stemming are:
 Lovins Stemmer: this algorithm use a table of 294 endings, 29 conditions, and 35
transformation rules. This algorithm removed a maximum of one suffix in a word,
 Snowball: this algorithm consists only five steps which every steps apply rules an
d conditions. Once the rule is accepted according to condition, the suffixes will b
e removed,
 Iterated Lovins Stemmer: this algorithm is an iterated version of Lovins Stemmer
algorithm,
 Null Stemmer: this algorithm is also an iterated version of Lovins Stemmer algori
thm which stems the word until it no further changes.
Meanwhile Indonesian also has a library stemmer [29] called Sastrawi that includ
e source codes of the library and published openly in github.
2.2. TF-IDF Term Weighting
Term Frequency or known as Term Frequency-Inverse Document Frequency (TF -ID

F) is a numerical statistic that shows the relevance keywords to some specific docume
nts which provides those keywords, and those specific documents can be identified or
categorized [30]. TF-IDF is used to extract useful features and naturally works on dist
inct terms. However, a single term can be ambiguous when a separate term used for in
dexing. It could carry numerous connotations and the distinct term can be too broad [3
1]. According to research [32], the improvement of TF-IDF algorithm improve the ca
pability of classifying. The improved algorithm considers how to improve the classific
ation accuracy by ignoring the calculation efficiency in the classification process.
Term Frequency means the occurrence of specific words in documents, and word
s with high TF value considered as important in documents. Document Frequency im
plies how many times a specific word appears in the collection of documents, hence w
ords with high DF value not considered as important but IDF (an inverse of DF) then i
s used as a measurement for importance of words in all documents. The more high ID
F values, the more rare for words in all documents, hence it increase the importance o
f those words [33].
2.3. Support Vector Machine
Support Vector Machine (SVM) is a classifier defined by a separating hyperplane, wh

ich labelled training data as output categories [34]. In this method, the steps are stated
as [35]:
1. A classifier for a binary classification will be symbolized as y (labels) and x (feat
ures) to denote the class labels and parameters w (normal to the line) and b (bias)
as stated in formula 1.
f ( x )=wT x +b
(1)
2. Then SVM will be represented by a separated hyperplane f(x) that geometrically
bisects the data space into two diverse regions thus resulting in classification of th
e input data space into two categories as explained in Fig 2.
3. The function f(x) denotes the hyperplane in classification of data set, then the two
regions created by the hyperplane correspond to the two categories of data under t
wo class labels
4. Let the class labels that needs to be assigned to the data vectors to implement sup
ervised classification be denoted by yi, which is +1 for one category of data
vectors and -1 for the other category of data vectors as stated in formula 2.
w T xi + b≥+1
w T xi + b≤−1
(2)
2.4. Naïve Bayes Classifier
Naïve Bayes is a method that predict the probability for a given words to belong to a p
articular class, well-known for its easiness in both during training and classifying step
s [36]. This classifier [37] choses the classification most similar to V nbwith the given a
ttribute α 1 , α 2 , α 3 ,… . , α n. That stated in formula 3 and 4 below:
V nb=argmaz v j ∈v P ( v j ) ⨅ P ( ai|v j ¿
(3)
With the estimate P ( ai| v j ¿ using m-estimate:
nc +mp
P ( ai| v j ¿=
n+ m
(4)
Explanation:
n : the number of training examples for which v = v j
nc : the number of examples for which v = v j and a = a i
p : a priori estimate for P ( ai| v j ¿
m : an equivalent sample size
2.5. Decision Tree
Decision Tree (DT) method is made by mapping a set of data and applying a divide-a
nd-conquer approach to the problem, also composed of nodes represented by circles a
nd branches which are represented by segments connecting the nodes [38]. DT is a cla
ssifier which each internal node of the tree represents a condition on a feature of mode
l, each branch is the output of the previous condition, and each leaf reflects the predict
ed class from the algorithm [39]. Based on [40], Random Forest is one of Decision Tr
ee’s algorithm that fits many classification trees to a data set and then combines the pr
ediction from all the correlated trees, which each tree depends on the value of a separa
tely sampled random vector. The vector used for training each tree is obtained using a
random selection of the instances, and to determine the class all tress should indicate a
n output and the most voted is selected as the final result.
2.6. Measurement Parameters
Confusion matrix record is needed in order to keep in track the progression of classifi
cation phase. The confusion matrices are used as key object of decision criteria for cla
ssifying. By taking the weighted average and dropping the values from the initial epoc
h (k = 0), it aims to mitigate the random effect of the initialization of neural network
[41]. In this study, precision, recall, and f-measure will be used for evaluation, that co
nsidered suitable for classification problem [42], as explained in table 2.1.
Table 2.1 Confusion Matrix

Nilai Aktual
TP FN
Nilai Prediksi
FP TN
Source: Menarianti, 2015
Further explanation:
 True Positive (TP); positive tuple classified as positive class
 False Positive (FP); positive tuple classified negative class
 True Negative (TN); negative tuple classified as negative class
 False Negative (FN); negative tuple classified as positive class

Tabel diatas akan digunakan untuk menghitung akurasi, Precision, Recall, dan F-Mea
sure. Tingkat akurasi akan dihitung menggunakan formula (5).
TP+TN
Accuracy =
TP+TN + FP+ FN (5)
Precision is the number of correctly classified positive samples divided by the total po
sitive sample as explained in formula (6).
TP
p=
TP+FP (6)
Recall is used for calculating the number of sample positive divided by the total positi
ve sample in testing set, as explained in formula (7).
TP
r=
TP+ FN (7)
And the last is f-measure, which is the mean of precision and recall, and can be calcul
ated by using formula (8).
p∗r
Fm=2∗
p+r (8)
2.7. Related Research

The writing of this research refers to previous research related to similar topics, as des
cribed in table 2.2.
No Name Title Method Dataset Result
1. Kahlil, P and Twitter Sentime Sentiment Twitter Average scor
YunYing, Z nt Analysis: Cap Lexicon e is positive a
[9] turing Sentiment nd the reliabil
from Integrated ity test got sc
Resort Tweets ore of α=0,68
for weekly av
erage
No Name Title Method Dataset Result

2. Hermanto, D. Twitter Social Naive Baye Twitter The positive
T, et al [6] Media Analysis s Classifier Vmap value i
in Tourist Destin s more signifi
ations Using Alg cant than the
orithms Naive B negative one
ayes Classifier
3. Kuhamanee, Sentiment Analy Decision Tr Twitter SVM and AN
T, et al [10] sis of Foreign T ee, Support N provide the
ourists to Bangk Vector Mac highest accur
ok Using Data hine, Artifi acy and mostl
Mining through cial Neural y consist of p
Online Social N Network ositive sentim
etwork ents
4. Martin, C. A., Using Deep Lea Convolutio Tripadvis LSTM outper
et al [11] rning to Predict nal Neural or Revie formed CNN
Sentiments: Cas Networks, w
e Study in Touri Long Shor
sm t-term Me
mory Netw
orks
Research by [9] examined the usefulness of sentiment analysis on hospitality operator

s. Using the sentiment lexicon algorithm on sentiment data retrieved from twitter, it re
sults a reasonably effective sentiment scores when it compared to Tripadvisor's rankin
g. Taking a wider range of considerations, the sentiment gathered from Twitter cannot
represent the entire target market. However, Twitter can be used as a complementary t
ool to traditional market surveys conducted by hospitality operators. Researches [6, 1
0, 11] also analyzed sentiments from twitter and they analyzed the tourist objects. Tw
o of them [6, 10] used Support Vector Machine, Naive Bayes, and Decision Tree as cl
assifier. In [6], Naive Bayes showed positive Vmap value meanshilw in [10], they co
mpared more than one classifier algorithms and Support Vector Machine still outperfo
rmed the others.
3. Research Methodology
3.1. Research Method
The research methodology used in this study consist of eight stages: 1) Problem Identi
fication; 2) Case Study; 3) Data Collecting; 4) Data Processing; 5) Sentiment Training;
6) Testing and Classification; 7) Accuracy Measurement; 8) Conclusion and Review.
Further explanation of the stages of the research methodology stated in Figure 3.1.
Figure 3.1 Research Methodology
The following is a further explanation of the stages of the research methodology:
1. Data collection related to sentiment in the tourism sector is carried out through th
e crawling process on social media twitter. The data will consist of 1670 opinions
regarding Ancol’s tourist attraction. Tools that will be used to collect and process
data is RapidMiner Studio. Data will be collected using the library called “twitter
crawling” in the form of tweets and html format. Then, this html format will be c
onverted into CSV format so it can be read by database system.
2. The data processing stage starts from document filtering and tokenization with th
e aim of filtering the amount of sentiment data. Then removing stop words will b
e done, where this stage reduce the meaningless sentences in the tweets. The last i
s the stemming process, where at this stage the word elimination is carried out ba
sed on the data dictionary used.
3. The sentiment training stage is the processing process of data that has been group
ed into data training and data testing. Utilizing the text mining classification algor
ithm available in Rapid Miner Studio, data that are included in the data training g
roup will be classified into two categories of sentiment, which are positive or neg
ative. The used classification algorithms in this study are Support Vector Machin
e, Naive Bayes, and Decision Tree. The selection of these three algorithms is base
d on good accuracy results [43, 44]. This stage results the data training model, wh
ere this file has gone through a classification process using the three related algori
thms. Then, the output of this stage is tested through the data testing stage.
4. The testing and classification stage uses the apply model method, where the data
set are classified based on the formed model pattern. This stage results the classifi
cation of the testing data into positive and negative sentiments.
5. The accuracy measurement stage aims to validate and evaluate the performance o
f three classification algorithms. Using the Confusion Matrix method, the results
of the model performance are measured using Precision, Recall, and F-Measure c
alculations. The final result of this stage is the percentage of each calculations of t
he three classification algorithms.
6. The last stage is making conclusions and writing suggestions. Conclusions are dra
wn based on the analysis of the stages that have been carried out, with the domina
nt reference to the result percentages from the accuracy measurement stages. Wri
ting suggestions refers to learn points and future works for research with similar t
opics.
3.2. Research Schedule

The estimated time for this project is described in table 3.1.
Table 3.1 Time Table
Oct November December
No Activity
3 4 1 2 3 4 1 2 3 4
1 Data Collecting
2 Data Processing
3 Sentiment Trainin
g
4 Testing and Class
ification
5 Accuracy Measur
ement
6 Conclusion and R
eview
Bibiliography
[1] Puschmann, C. and Bastos, M. (2015). How digital are the digital humanities? An
analysis of two scholarly blogging platforms. PLoS ONE, 10 (2), pp.1–15.
[Online]. Available at: doi:10.1371/journal.pone.0115035.
[2] Adelani, D. I. et al. (2020). Estimating community feedback effect on topic
choice in social media with predictive modeling. EPJ Data Science, 9 (1), pp.1–
23. [Online]. Available at: doi:10.1140/epjds/s13688-020-00243-w.
[3] Kominfo (2020). Pengguna Internet di Indonesia 63 Juta Orang. [Online]. Avail
able at: https://kominfo.go.id/index.php/content/detail/3415/kominfo+%3A+peng
guna+internet+di+indonesia+63+juta+orang/0/berita_satker [Accessed on Octob
er 9, 2020]
[4] Zukhrufillah, I. (2018). Gejala Media Sosial Twitter Sebagai Media Sosial
Alternatif. Al-I’lam: Jurnal Komunikasi dan Penyiaran Islam, 1 (2), p.102.
[Online]. Available at: doi:10.31764/jail.v1i2.235.
[5] Kaur, G. (2017). The Importance of Digital Marketing in the Tourism Industry.
International Journal of Research-Granthaalayah, 5 (6), p.72. [Online].
Available at: https://doi.org/10.5281/zenodo.815854.
[6] Hermanto, D. T. et al. (2018). Twitter Social Media Sentiment Analysis in
Tourist Destinations Using Algorithms Naive Bayes Classifier. Journal of
Physics: Conference Series, 1140 (1), p.1. [Online]. Available at:
doi:10.1088/1742-6596/1140/1/012037.
[7] Oh, T.-J. and -, A. (2017). New and Fast Emerging Advance Structure of Text
Mining from Unstructured Data. Bonfring International Journal of Industrial
Engineering and Management Science, 7 (2), pp.13–16. [Online]. Available at:
doi:10.9756/bijiems.8325.
[8] Giachanou, A. and Crestani, F. (2016). Like it or not: A survey of Twitter
sentiment analysis methods. ACM Computing Surveys, 49 (2). [Online].
Available at: doi:10.1145/2938640.
[9] Philander, K. and Zhong, Y. Y. (2016). Twitter sentiment analysis: Capturing
sentiment from integrated resort tweets. International Journal of Hospitality
Management, 55 (May), pp.16–24. [Online]. Available at:
doi:10.1016/j.ijhm.2016.02.001.
[10] Kuhamanee, T. et al. (2017) Sentiment analysis of foreign tourists to Bangkok
using data mining through online social network. IEEE 15th International
Conference on Industrial Informatics (INDIN), Emden, 2017, pp. 1068-1073. doi:
10.1109/INDIN.2017.8104921
[11] Martín, C. A. et al. (2018). Using deep learning to predict sentiments: Case study
in tourism. Complexity, 2018. [Online]. Available at: doi:10.1155/2018/7408431.
[12] Khan, M. T. et al. (2016). Sentiment analysis and the complex natural language.
Complex Adaptive Systems Modeling, 4 (1), pp.1–19. [Online]. Available at:
doi:10.1186/s40294-016-0016-9.
[13] Singh, J., Singh, G. and Singh, R. (2017). Optimization of sentiment analysis
using machine learning classifiers. Human-centric Computing and Information
Sciences, 7 (1). [Online]. Available at: doi:10.1186/s13673-017-0116-3.
[14] Fang, X. and Zhan, J. (2015). Sentiment analysis using product review data.
Journal of Big Data, 2 (1), pp.1–14. [Online]. Available at: doi:10.1186/s40537-
015-0015-2.
[15] Yi, S. and Liu, X. (2020). Machine learning based customer sentiment analysis
for recommending shoppers, shops based on customers’ review. Complex &
Intelligent Systems, 6 (3), pp.621–634. [Online]. Available at:
doi:10.1007/s40747-020-00155-2.
[16] Parlar, T., Özel, S. A. and Song, F. (2018). QER: a new feature selection method
for sentiment analysis. Human-centric Computing and Information Sciences, 8
(1), pp.1–19. [Online]. Available at: doi:10.1186/s13673-018-0135-8.
[17] Hussein, D. M. E. D. M. (2018). A survey on sentiment analysis challenges.
Journal of King Saud University - Engineering Sciences, 30 (4), pp.330–338.
[Online]. Available at: doi:10.1016/j.jksues.2016.04.002.
[18] Ghosh, M. and Sanyal, G. (2018). An ensemble approach to stabilize the features
for multi-domain sentiment analysis using supervised machine learning. Journal
of Big Data, 5 (1). [Online]. Available at: doi:10.1186/s40537-018-0152-5.
[19] Agarwal, B. et al. (2015). Sentiment analysis using common-sense and context
information. Computational Intelligence and Neuroscience, 2015, pp.1–10.
[Online]. Available at: doi:10.1155/2015/715730.
[20] Kobayashi, V. B. et al. (2018). Text Mining in Organizational Research. SAGE J
ournals, 21 (3), pp.733-765. [Online]. Available at:
doi:10.1177/1094428117722619.
[21] Haryanto, D. J., Muflikhah, L. and Fauzi, M. A. (2018). Analisis Sentimen
Review Barang Berbahasa Indonesia Dengan Metode Support Vector Machine
Dan Query Expansion. Jurnal Pengembangan Teknologi Informasi dan Ilmu
Komputer (J-PTIIK) Universitas Brawijaya, 2 (9), pp.2909–2916.
[22] S, V. and R, J. (2016). Text Mining: open Source Tokenization Tools – An
Analysis. Advanced Computational Intelligence: An International Journal
(ACII), 3 (1), pp.37–47. [Online]. Available at: doi:10.5121/acii.2016.3104.
[23] K., J. and R., J. (2016). Stop-Word Removal Algorithm and its Implementation
for Sanskrit Language. International Journal of Computer Applications, 150 (2),
pp.15–17. [Online]. Available at: doi:10.5120/ijca2016911462.
[24] Kaur, J. and Kaur Buttar, P. (2018). A Systematic Review on Stopword Removal
Algorithms. International Journal on Future Revolution in Computer Science &
Communication Engineering, (April), pp.207–210. [Online]. Available at:
http://www.ijfrcsce.org.
[25] Tala, F. Z. (2003). A Study of Stemming Effects on Information Retrieval in
Bahasa Indonesia. M.Sc. Thesis. Master of Logic Project. Institute for Logic,
Language and Computation. Universiteit van Amsterdam, The Netherlands.
[26] Singh, J. and Gupta, V. (2016). Text Stemming: Approaches, Applications, and
Challanges. ACM Computing Surveys, 49 (3), pp.1–46. [Online]. Available at:
doi:10.1145/2975608.
[27] Hidayatullah, A. F., Ratnasari, C. I. and Wisnugroho, S. (2016). Analysis of
Stemming Influence on Indonesian Tweet Classification. Telkomnika
(Telecommunication Computing Electronics and Control), 14 (2), pp.665–673.
[Online]. Available at: doi:10.12928/telkomnika.v14i2.3113.
[28] Bounabi, M., Moutaouakil, K. El and Satori, K. (2017). A comparison of text
classification methods method of weighted terms selected by different stemming
techniques. ACM International Conference Proceeding Series, Part F1294, pp.1–
9. [Online]. Available at: doi:10.1145/3090354.3090398.
[29] Github (2016). Sastrawi. [Online]. Available at: https://github.com/sastrawi/sastr
awi/blob/master/tests/SastrawiFunctionalTest/Stemmer/StemmerTest.php#L49
[Accesed on October 11, 2020]
[30] Qaiser, S. and Ali, R. (2018). Text Mining: Use of TF-IDF to Examine the
Relevance of Words to Documents. International Journal of Computer
Applications, 181 (1), pp.25–29. [Online]. Available at:
doi:10.5120/ijca2018917395.
[31] Dalaorao, G. A. and Journal, I. (2020). Applying Modified TF-IDF with
Collocation in Classifying Disaster-Related Tweets. International Journal of
Advanced Trends in Computer Science and Engineering, 9 (1), pp.28–33.
[32] Fan, H. and Qin, Y. (2018). Research on Text Classification Based on Improved
TF-IDF Algorithm. International Conderence on Network, Communication
Computer Engineering (NCCE 2018), 147, pp.501–506. [Online]. Available at:
doi:10.2991/ncce-18.2018.79.
[33] Kim, S. W. and Gil, J. M. (2019). Research paper classification systems based on
TF-IDF and LDA schemes. Human-centric Computing and Information Sciences,
9 (1). [Online]. Available at: doi:10.1186/s13673-019-0192-7.
[34] Fatima, S. (2017). Text Document categorization using support vector machine.
International Research Journal of Engineering and Technology (IRJET), 4 (2),
pp.141–147. [Online]. Available at: https://irjet.net/archives/V4/i2/IRJET-
V4I227.pdf.
[35] Al Amrani, Y., Lazaar, M. and El Kadirp, K. E. (2018). Random forest and
support vector machine based hybrid approach to sentiment analysis. Procedia
Computer Science, 127, pp.511–520. [Online]. Available at:
doi:10.1016/j.procs.2018.01.150.
[36] Vadivukarassi, M., Puviarasan, N. and Aruna, P. (2017). Sentimental Analysis of
Tweets Using Naive Bayes Algorithm. World Applied Sciences Journal, 35 (1),
pp.54–59. [Online]. Available at: doi:10.5829/idosi.wasj.2017.54.59.
[37] Rasjid, Z. E. and Setiawan, R. (2017). Performance Comparison and
Optimization of Text Document Classification using k-NN and Naïve Bayes
Classification Techniques. Procedia Computer Science, 116, pp.107–112.
[Online]. Available at: doi:10.1016/j.procs.2017.10.017.
[38] Garrido-Cantos, R. et al. (2013). Low-complexity transcoding algorithm from
H.264/AVC to SVC using data mining. Eurasip Journal on Advances in Signal
Processing, 2013 (1), pp.1–24. [Online]. Available at: doi:10.1186/1687-6180-
2013-82.
[39] Spanos, G., Angelis, L. and Toloudis, D. (2017). Assessment of vulnerability
severity using text mining. ACM International Conference Proceeding Series,
Part F1325. [Online]. Available at: doi:10.1145/3139367.3139390.
[40] Asha Kiranmai, S. and Jaya Laxmi, A. (2018). Data mining for classification of
power quality problems using WEKA and the effect of attributes on classification
accuracy. Protection and Control of Modern Power Systems, 3 (1). [Online].
Available at: doi:10.1186/s41601-018-0103-3.
[41] Ahrens, L., Ahrens, J. and Schotten, H. D. (2019). A machine-learning phase
classification scheme for anomaly detection in signals with periodic
characteristics. Eurasip Journal on Advances in Signal Processing, 2019 (1).
[Online]. Available at: doi:10.1186/s13634-019-0619-3.
[42] Menarianti, I. (2015). Klasifikasi data mining dalam menentukan pemberian
kredit bagi nasabah koperasi. Jurnal Ilmiah Teknosains, 1 (1), pp.1–10. [Online].
Available at: http://e-jurnal.upgrismg.ac.id/index.php/JITEK/article/view/836.
[43] Prananda, A. R. and Thalib, I. (2020). Sentiment Analysis for Customer Review:
Case Study of GO-JEK Expansion. Journal of Information Systems Engineering
and Business Intelligence, 6 (1), p.1. [Online]. Available at:
doi:10.20473/jisebi.6.1.1-8.
[44] Spanos, G., Angelis, L. and Toloudis, D. (2017). Assessment of vulnerability
severity using text mining. ACM International Conference Proceeding Series,
Part F1325. [Online]. Available at: doi:10.1145/3139367.3139390

Tri Okta Priasni - 92219042 - Proposal

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tri Okta Priasni - 92219042 - Proposal

Uploaded by

Copyright:

Available Formats

MASTER THESIS PROPOSAL

Tri Okta Priasni (92219042)

Sentiment Analysis is well known as two-class classification, which expressed op

2.1. Text Pre-processing

2.2.2. Stop word Removal

Term Frequency or known as Term Frequency-Inverse Document Frequency (TF -ID

2.3. Support Vector Machine

Support Vector Machine (SVM) is a classifier defined by a separating hyperplane, wh

2.4. Naïve Bayes Classifier

2.6. Measurement Parameters

Table 2.1 Confusion Matrix

 True Positive (TP); positive tuple classified as positive class

 False Positive (FP); positive tuple classified negative class

 True Negative (TN); negative tuple classified as negative class

 False Negative (FN); negative tuple classified as positive class

2.7. Related Research

No Name Title Method Dataset Result

Research by [9] examined the usefulness of sentiment analysis on hospitality operator

The following is a further explanation of the stages of the research methodology:

3.2. Research Schedule

You might also like