Lexicon-Based Features on Naive Bayes
Modification for Classification of Chinese Film
1st S Sunarti 2nd Irawan Dwi Wahyono 3rd Hari Putranto
Department of German Department of Engineering Department of Engineering
Universitas Negeri Malang Universitas Negeri Malang Universitas Negeri Malang
Malang, Indonesia Malang, Indonesia Malang, Indonesia
[Link]@[Link] [Link]@[Link] [Link]@[Link]
4th Djoko Saryono 5th Herri Akhmad Bukhori 6th Tiksno Widyatmoko
Department of Indonesian Department of German Department of German
Universitas Negeri Malang Universitas Negeri Malang Universitas Negeri Malang
Malang, Indonesia Malang, Indonesia Malang, Indonesia
[Link]@[Link] [Link]@[Link] [Link]@[Link]
Abstract—There are many Chinese movies on the internet the average precision is below 90%, and the average f-
for learning Chinese, one of which is on YouTube. This measure is below 90% [8-9].
educational film provides negative and positive comments. To
get a good movie to learn Chinese, we need to classify positive However, there are many classifications of analysis
and negative comments ratings for Chinese learning that comments in text mining. The text-mining algorithm can
teachers can use in this video. In addition, a review of comments improve classification accuracy because comments usually
is an evolution of Chinese film ratings. The evaluation of use the emotion of the reviewer, so sometimes the comment is
comments included includes storytelling, content, model, visual negative or positive [9-13]. All comments will be classified
effects, and more. The review has criticisms and comments that using text mining to spare useless words. Another function of
include feelings about the movie on Chinese language learning. text mining is to make clear comments dan can use to
Commentator helps movie students compare a movie's mood classification at weighting so it can calculate using another
with positive or negative emotion groups. This research uses the algorithm to decide whether it is good or not, negative or
naive Bayes taxonomy with the Lexicon Based function in positive. Clearly, Text mining is needed to make clear words
sentiment analysis of comments. The classification process before classification using naïve Bayes to improve accuracy
considers the appearance of words of emotional content in the and precision. Recall so the teacher will know which video
score and the possible score values for positive or negative can use in online learning.
emotional classes. Based on test results, feature selection
accuracy, precision, and recall in the form of stop word This study uses the Naïve Bayes method and lexicon-
exclusion receive scores of 0.91, 0.87, and 0.98, respectively. based features for commentary analysis of Chinese learning
videos on YouTube channels. The comment analysis process
Keywords—Lexicon Feature; Naive Bayes; Chinese goes through several stages, starting from text preprocessing,
Language calculating term weighting values, and Naïve Bayes
classification with comment weighting based on the
I. INTRODUCTION Indonesian dictionary. This research hopes that several
Now all learning uses a video, including online or offline positive and negative comments are classified to determine the
learning. Teachers use online learning such as YouTube to quality of a good learning video.
upload their videos[1-2]. Usually, on YouTube channels for
videos, there are many comments from the viewer. There are II. METHOD
2 kinds of comments from the viewer: positive comments and
negative comments. The comment is important for the teacher A. Text Mining
or lecturer to classify whether the video is good; no exaptation Text mining is a procedure to find patterns or information
is the video of Chinese language learning. Video of Chinese that is not initially visible in certain documents or text sources
language learning on YouTube channels is useful or useless which will later be used as useful information for certain
for learning[3-5]. Clearly, the teacher or lecturer needs to purposes [13 ]. The use of text mining can solve problems
classify the video based on the viewer's comment. such as sentiment analysis, document grouping, document
Previous research regarding comments using the Naive classification, information extraction, information retrieval,
Bayes method of opinion data on Twitter on several online and web mining.
marketplace sites in Indonesia. The tests found that the
average accuracy was not so significant in producing accuracy B. Sentiment Analysis
and precision [5-6]. Another research is to implement the Sentiment analysis is a field of science that analyzes
Naive Bayes algorithm on film data on social media. From the opinions, sentiments, evaluations, judgments, attitudes, and
test results, the accuracy value is still below 93%. In another emotions towards an entity such as products, services,
study, sentiment analysis was performed by combining a rule- organizations, individuals, problems, events, and topics (Liu,
based classifier with ensemble features and machine learning 2012). Sentiment analysis focuses on opinions expressing
methods. From the tests conducted, lexicon-based features can positive or negative sentiment.
play an important role in sentiment analysis. Subsequent
research uses the naive Bayes classifier method and a lexicon- C. Text Preprocessing
based approach for sentiment analysis of hotel reviews [6-8].
Text preprocessing is the initial stage in the text mining
The average accuracy is below 80% from the tests carried out,
process to get data that is ready to be processed [1]. The order
in text preprocessing is: stemming, tokenizing, and filtering.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
The text preprocessing stage produces a collection of words where ( | ) is the posterior or probability of class against the
used as an index [12-14]. word , ( | ) is the likelihood or probability of the word
against class, ( ) is the prior or probability of occurrence of
1) Stemming
class, and ( ) is evidence or the probability of occurrence of
Stemming is changing the form of a word into a basic the word.
word. Each affixed word (prefix and suffix) will be removed
and form a basic word to be more optimal for the text mining In Equation (2), there is a calculation of the probability,
process. In this study, the Python Literary Stemmer library prior, and proof values. The calculation of the probability
was used. value using the multinomial model. This model considers the
number of occurrences of each word in the document. It can
2) Tokenizing be seen in Equation (3) [15-16]
Tokenizing is the process of breaking up words in a
( , )
document. In this process, whitespace characters are removed ( | )= (3)
( )
because they do not affect text preprocessing
where ( , ) is the number of occurrences of the word
3) Filtering
in c class and ( ) is the number of occurrences of all
Filtering is the stage of filtering words from the tokenizing
words in c class
process. This study uses the stoplist method. A stoplist is a
collection of unimportant words that the bag-of-words The problem that is often found in the calculation of
approach can eliminate. The stopword list used is the Tala multinomial models is that a word that never appears will
stopword list with the addition of the YouTube emoticon result in a zero calculation, called the zero-frequency problem
keyword [15-16]. The way to solve this problem is to do a Laplace
alignment. Laplace smoothing solves this problem by adding
D. Term Weighting a value of 1 to the word or numerator so that it is considered
Term weighting is converting index or preprocessing to occur once and adding a unique word to the denominator.
features in word data into numeric data by assigning a value Laplace smoothing calculation can be seen in Equation (4).
or weight to each word. The results of the term weighting ( , )
( , )= (4)
process can be used for the classification process. The method ( ) | |
used is the raw term frequency weighting. Raw-term where | | is the number of unique words. To calculate the prior
frequency weighting is the weight of a word in a document shown in Equation (5) [15-16].
based on the number of occurrences. it can be seen in Equation
(1) [14] = (5)
, = , (1) where is the number of documents in the training data class
where , is the weight of the occurrence of the word in the c, and N is the number of documents in the training data. To
document and , is the occurrence of the word t in the d calculate the value of evidence can be seen in Equation (6)
document. [15-16].
| |
( )= (6)
E. Classification | |
Classification is entering data whose class is unknown where | | is the number of words that appears and | | is the
based on data whose class has been determined previously. total number of words that appear in the entire document.
The classification process is divided into two stages: learning
and classification. In the learning step, a classification model G. Lexicon Based Features
is made using an algorithm that studies data whose class has Lexicon-based features are features or words that have
been determined previously or training data. Because the class been weighted based on a dictionary or Lexicon. Weighting is
is known beforehand, this stage is called supervised learning. done for each word containing a positive or negative
Furthermore, the learning step model is used in the sentiment. The purpose of using lexicon-based features is to
classification step to predict data whose class is unknown [15- determine the sentiment orientation.
16].
The weighting value is the sentiment value calculated by
F. Naïve Bayes Classifier calculating the sum of the Positif or Negatif values divided by
A naive Bayes classifier is a statistical-based classification the total positive and Negative values for the weighting of
method based on the Bayes theorem to classify data into each word. it can be seen in Equation (7) and Equation (8) [15-
predetermined classes [15-16]. Called 'nave' because the value 16].
of an attribute has no effect or does not depend on the value of ( ) = ∑ ( ) (7)
another attribute is called conditional independence. The
Naïve Bayes classifier shows high accuracy and speed when ( ) = ∑ ( ) (8)
implemented to large data compared to the performance of
The calculation of the total value of the positive and negative
decision trees and some neural network classification
of a word can be seen in Equation (9)
algorithms. Equation (2) shows the general nave Bayes
classifier calculation [15-16] = + (9)
( | )∗ ( )
( | )= (2) The calculation of the sentiment value of a word in the positive
( )
and negative classes can be seen in Equation (10) and
Equation (11) [15-16].
( )
= (10) = (16)
( )
( ) 2) Evaluation of Relevance
= (11)
( ) Relevance evaluation is carried out to assess whether the
The sentiment value is a real number with an interval of 0 documents used as datasets are relevant by calculating the
to 1. If the sentiment value is close to 1, the word has a more kappa measure value. Kappa measure is used to measure
positive sentiment. Meanwhile, if it is close to 0, the word has agreement between raters [19]. Calculation of kappa measure
a more negative sentiment. The sentiment value will be can be seen in Equation (17)
integrated into the posterior calculation for each positive and ( ) ( )
negative class. The calculation is shown in Equation (12) [15- = (17)
( )
16].
where P( ) is the proportion of the number of equal raters, and
( ( | ) )∗ ( ) P( ) is the proportion of the number of equal raters due to
( | )= (12)
( ) chance.
H. Evaluation Kappa is worth 1 if all raters always agree, value 0 if raters
only agree with the score given based on probability, and
The evaluation aims to assess whether the results of the negative if the rater is worse than random relevance. A kappa
trial system created to follow the results of the sentiment value of more than 0.8 is a good agreement, a kappa value
analysis system with the actual results. The evaluation method between 0.67 and 0.8 is a moderate agreement, and a kappa
used is the evaluation of unrated retrieval using the confusion value below 0.67 is a tentative agreement.
matrix method and the evaluation of relevance using the kappa
measure.
I. Design and Implementation
1) Confusion Matrix The process that occurs in the system is divided into two
The confusion matrix is a table used to analyze how well phases, namely: the training phase and the testing phase.
the accuracy of a classification method is in predicting the Figure 1 shows the stages that occur in the training phase
class of data [17]. In this study, the data are classified into
positive and negative or called binary classification. Table 1
shows the confusion matrix table.
TABLE I. CONFUSION MATRIX
Class Classification as Classification as
Positive Class Negative Class
Positive True Positive False-negative
Negative False Positive True Negative
Based on Table 1, it can be calculated with accuracy,
precision, recall, and f-measure [18].
Accuracy is the percentage of test data whose class is
correctly classified by the system according to its original
class. Calculation of accuracy can be seen in Equation (13)
[17]
=
(13)
Precision is the compatibility value between the data class
generated by the system and the actual class. Calculation of
precision can be seen in Equation (14) [18].
Fig. 1. Step of the training phase
= (14)
Based on Figure 1, the process in the training phase begins
The recall is the value of the accuracy of the amount of with input in the form of training data. Furthermore, the
data that the system has successfully generated based on the training data is carried out by a text preprocessing process,
actual class. Recall calculation can be seen in Equation (15) wherein stemming, tokenizing, and filtering are carried out.
[18] Furthermore, the results of text processing are carried out by
word weighting or term weighting using the raw-term
= (15) frequency weighting method. Words that have been given
weight are then used in the training phase using the Naïve
F-measure is a measure of the reciprocal relationship Bayes method, namely in calculating the prior, likelihood, and
between precision and recall. F-measure is also an alternative evidence values. This value will be used as input in the testing
by combining precision and recall into one evaluation phase.
measure. The f-measure calculation can be seen in Equation
(16) [18].
The stages that occur in the testing stage can be seen in Figure III. ANALYSIS AND DISCUSS
2. The evaluation was carried out in two scenarios. The first
scenario was classification using the Naïve Bayes method
with lexicon-based feature weighting. The second scenario
was classification using the Naive Bayes method without
lexicon-based feature weighting. This test uses 150 data
divided into 130 training data consisting of 65 positive classes
and 65 negative classes and 20 test data consisting of 10
positive and 10 negative classes.
The results of testing the Weighted Lexicon-based Features
are shown in table 2.
TABLE II. Result Testing of Weight of Weighted Lexicon
based Features
Item value
Accuracy 0,91
Precision 0,87
Recall 0,98
F-Measure 0,921
The resulting testing without Weighted Lexicon based
Features is shown in table 3
TABLE III. RESULT TESTING WITHOUT WEIGHTED LEXICON BASED
FEATURES
Item value
Fig. 2. Testing Phase Accuracy 0,87
Precision 0,8
Based on Figure 2, the process in the testing phase begins Recall 0,87
with the input of test data, the Indonesia dictionary, and the F-Measure 0,82
prior, likelihood, and evidence values. The initial process
carried out is text preprocessing, wherein the process of Based on Table 2 and Table 3, Testing the Naïve Bayes
stemming, tokenizing, and filtering is carried out. Next, the method and lexicon-based features for film opinion sentiment
Lexicon based features are weighted based on the words analysis resulted in an accuracy value of 0,91, precision of
available in the Indonesia dictionary by calculating the 0,87, recall of 0,98, and size of 0,921. A word has a positive
sentiment value. Furthermore, the Naïve Bayes classification sentiment value that is much greater than the negative
process is carried out by calculating the posterior value as the sentiment value in a document whose class is negative, and
output of the testing phase. The highest posterior value will vice versa. The difference between the value of positive
determine the class of the test data. sentiment and the value of negative sentiment that is very
The application was built based on diagrams figures 1 and large can result in misclassification. While using the use of the
figure 2. The application uses a web-based application, as Naïve Bayes method without lexicon-based features for
shown in figure 3. There are 2 buttons and 2 channel options. sentiment analysis, film opinion yields an accuracy value of
One button is for loading database language, and another is for 0,87, precision 0,8, recall is 0,87, and f-measure is 0,82 is a
choosing a channel on the YouTube channel. Options button better result than using lexicon-based features. So the Naïve
for choosing an algorithm such as naïve Bayes or not and Bayes method with lexicon-based features produces better
using Lexicon or not to improve accuracy. accuracy, precision, recall, and f-measure values than the
Nave Bayes method without lexicon-based feature weighting.
IV. CONCLUSION
This study uses the Naïve Bayes method and lexicon-
based features for commentary analysis of Chinese learning
videos on the YouTube channel. The comment analysis
process goes through several stages, starting from text
preprocessing, calculating term weighting values, and Naïve
Fig. 3. Application on web-based
Bayes classification with comment weighting based on the
Indonesian dictionary. The test results show that the accuracy
of classifying positive comments and negative comments is
better using the Naive Bayes method with Lexicon based on
features than not using Naive Bayes.
REFERENCES
algorithms. In 2016 International Seminar on Application for
[1] Alkaff, M., Baskara, A. R., & Wicaksono, Y. H. (2020, November). Technology of Information and Communication (ISemantic) (pp. 157-
Sentiment Analysis of Indonesian Movie Trailer on YouTube Using 162). IEEE.
Delta TF-IDF and SVM. In 2020 Fifth International Conference on [11] Gusić, J., & Šimić, D. Using Text Mining to Extract Information from
Informatics and Computing (ICIC) (pp. 1-5). IEEE. Students' Lab Assignments. In 2021 44th International Convention on
[2] Gu, J. (2020, July). Evaluation Model of Classroom Teaching Quality Information, Communication and Electronic Technology
of Chinese as a Foreign Language Based on Deep Learning. In 2020 (MIPRO) (pp. 591-594). IEEE.
International Conference on Virtual Reality and Intelligent Systems [12] Siddiqui, T., Amer, A. Y. A., & Khan, N. A. (2019, November).
(ICVRIS) (pp. 903-906). IEEE. Criminal activity detection in social network by text mining:
[3] Dai, J., & Qin, T. (2021, June). Blended Teaching with Mobile comprehensive analysis. In 2019 4th International Conference on
Applications in Smart Classroom: A New Method for Chinese Information Systems and Computer Networks (ISCON) (pp. 224-229).
Language Teaching for International Medical Students in the Post IEEE.
Epidemic Era. In 2021 2nd International Conference on Artificial [13] Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M.,
Intelligence and Education (ICAIE) (pp. 720-723). IEEE. & Gao, J. (2021). Deep learning--based text classification: a
[4] Lau, K. L. (2021, August). Using E-learning Activities to Support comprehensive review. ACM Computing Surveys (CSUR), 54(3), 1-40
Classical Chinese Learning in the Out-of-class Context. In 2021 [14] Kovtun, D. (2021, September). Assessment of Congruence of
International Symposium on Educational Technology (ISET) (pp. 65- Unstructured Data Using Text Mining Technology. In 2021 IEEE 23rd
68). IEEE. Conference on Business Informatics (CBI) (Vol. 2, pp. 163-166). IEEE.
[5] Wang, Y., Wang, Y., Chen, W., & Lin, Y. (2020, October). Natural [15] Damaratih, D. A. (2021, October). Sentiment Analysis Of Online
Language Classification Algorithm of Comments Based on Bayesian Lecture Opinions On Twitter Social Media Using Naive Bayes
Chasing-Clustering Model. In 2020 19th International Symposium on Classifier. In 2021 International Conference on Computer Science,
Distributed Computing and Applications for Business Engineering and Information Technology, and Electrical Engineering (ICOMITEE) (pp.
Science (DCABES) (pp. 5-11). IEEE. 24-28). IEEE.
[6] Setiowati, Y., & Setyorini, F. (2018, November). Service Extraction [16] Surya, P. P., & Subbulakshmi, B. (2019, March). Sentimental analysis
and Sentiment Analysis to Indicate Hotel Service Quality in using Naive Bayes classifier. In 2019 International Conference on
Yogyakarta based on User Opinion. In 2018 International Seminar on Vision Towards Emerging Trends in Communication and Networking
Research of Information Technology and Intelligent Systems (ViTECoN) (pp. 1-5). IEEE.
(ISRITI) (pp. 427-432). IEEE. [17] Zomahoun, D. E. (2019, November). A Semantic Collaborative
[7] Yang, F., Zhu, J., Wang, X., Wu, X., Tang, Y., & Luo, L. (2018, May). Clustering Approach Based on Confusion Matrix. In 2019 15th
A Multi-model Fusion Framework based on Deep Learning for International Conference on Signal-Image Technology & Internet-
Sentiment Classification. In 2018 IEEE 22nd International Conference Based Systems (SITIS) (pp. 688-692). IEEE
on Computer Supported Cooperative Work in Design ((CSCWD)) (pp. [18] . Chaudhry, S., & Dhawan, S. (2019, February). A Novel Clustering
433-437). IEEE. Based Mechanism For Community Detection Using Artificial
[8] Ekawijana, A., & Heryono, H. (2016, April). Composite Naive Bayes Intelligence. In 2019 International Conference on Machine Learning,
Clasification and semantic method to enhance sentiment accuracy Big Data, Cloud and Parallel Computing (COMITCon) (pp. 574-579).
score. In 2016 4th International Conference on Cyber and IT Service IEEE.
Management (pp. 1-4). IEEE. [19] Hasnain, M., Pasha, M. F., Ghani, I., Imran, M., Alzahrani, M. Y., &
[9] Rohman, M. A., Khairani, D., Hulliyah, K., Riswandi, P., & Lakoni, I. Budiarto, R. (2020). Evaluating trust prediction and confusion matrix
(2021, September). Systematic Literature Review on Methods used in measures for web services ranking. IEEE Access, 8, 90847-90861.
Classification and Fake News Detection in Indonesian. In 2021 9th
International Conference on Cyber and IT Service Management
(CITSM) (pp. 1-4). IEEE.
[10] Turdjai, A. A., & Mutijarsa, K. (2016, August). Simulation of .
marketplace customer satisfaction analysis based on machine learning