You are on page 1of 4

,QWHUQDWLRQDO6HPLQDURQ$SSOLFDWLRQIRU7HFKQRORJ\RI,QIRUPDWLRQDQG&RPPXQLFDWLRQ L6HPDQWLF

Classification of Twitter Contents using Chi-Square and


K-Nearest Neighbour Algorithm

Yunita Dwi Setiyaningrum Annisa Fitrianingtiyas Herdajanti


Magister Teknik Informatika Magister Teknik Informatika
Universitas Dian Nuswantoro Universitas Dian Nuswantoro
Semarang, Indonesia Semarang, Indonesia
yunitadwisetiyaningrum@gmail.com annisafh02@gmail.com

Catur Supriyanto Muljono


Magister Teknik Informatika Magister Teknik Informatika
Universitas Dian Nuswantoro Universitas Dian Nuswantoro
Semarang, Indonesia Semarang, Indonesia
catur.supriyanto@dsn.dinus.ac.id muljono@dsn.dinus.ac.id

Abstract— A survey conducted by Hootsuite in 2019, basis of information technology development, so that the
that there are 3,484 billion active social media users. The growth of information technology can also encourage
percentage of active social media users has increased by economic growth and improve the competitiveness of a
15% from 2018. One way or action is to anticipate nation. One way or action is to anticipate negative things
negative things that quickly circulate among the public that circulate quickly in the community such as fake news
such as fake news (hoax), reduced ethics and courtesy or (hoaxes), reduced ethics and courtesy caused by the
referred to as negative content. From various types of growth of this information technology.
social media such as; Facebook, Twitter, Instagram, and From various types of social media such as;
others, one of which Twitter has a percentage of 52% is a Facebook, Twitter, Instagram and others. Twitter gets a
social media network service. In this study we used a percentage of 52%[1], is a social media network service
dataset obtained from UCI in 2014 entitled "Twitter that serves users by reading text-based messages up to 280
dataset for Arabic Sentiment Analysis"[7]. Category of characters.
twitter content using the Chi-Square algorithm and KNN This research aims to classify twitter content
as the method. The chi-square method is an algorithm for categories using the Chi-Square algorithm and KNN as
selecting features, and for the classification process the the method. The chi-square method is an algorithm for
author uses the KNN algorithm. This study concluded selecting features. And for the classification process the
that, the KNN algorithm with feature selection (chi- author uses the KNN algorithm.
square) obtained the results of accuracy seen from the
value of the nearest neighbor distance K = 3 obtained the
same results with an accuracy value of 65.00%, which II. RELATED RESEARCH
states that both methods with feature selection and without According to research from Forams and Ibha, conduct
selection the feature in the dataset turns out that the a review on the algorithm feature selection and feature
method with feature selection has the advantage of being extraction using chi-square metodde and obtained results
more efficient in the calculation process. that KNN text classification is the process of defining
categories that have been established based on the
Keywords— Information Retrival, Feature Selection, document. Selection of feature extraction as a
Chi-Square, Clasisification, K-Nearest Neighbour preprocessing stage for text classification. [2]
In addition, research conducted by Bayu Yudha and
I. INTRODUCTION Riyanarto by classifying the text of posts in social media to
According to the trend and social media data predict personality based on a text written by users of
survey conducted by Hootsuite in 2019, 42% of the Twitter. And of the three classification algorithms used for
world's population are social media users with the conclusion that the Naïve Bayes is superior compared
smartphones, and a percentage of 3,484 billion social with KNN and SVM. [3]
media users is still active. The percentage of active social In research by S. Brindha, and the research team helped
media users has increased by 15% from 2018. [1] With to compare text classification algorithms using Naïve
this, the development of social knowledge is needed as a Bayes, Decission Tree, KNN and SVM. From these

978-1-7281-3832-9/19/$31.00 ©2019 IEEE

78
,QWHUQDWLRQDO6HPLQDURQ$SSOLFDWLRQIRU7HFKQRORJ\RI,QIRUPDWLRQDQG&RPPXQLFDWLRQ L6HPDQWLF

results, it was concluded that the percentage of KNN is the formula to calculate the distance by using
occupy the middle value is not too high, and the largest Euclidean distance:
percentage derived from algorithms Decission Tree. [4]
And research conducted by Bo Tang, and his team
conducted the study with an automatic text classification (4)
using specific features and conventional features. Merging
these two features has the advantage that a lot of selection
criteria that the features can be easily implemented. [5] 3. The distance sort, then determine the value of the
nearest neighbor distance is based on the
minimum distance to the value K.
4. Determine the category of the value of the nearest
III. STUDY OF LITERATURE neighbor distance.
5. Using a simple majority category of the nearest
A. Chi-Square neighbor as the predicted value of the new data.
Chi-Square is one of the feature selection algorithm.
This algorithm is performed to reduce the range of
features that are not relevant in the classification process.
This feature selection using the theory of statistics to test IV. RESEARCH METHOD
the independence of a word to its category. In this
A. Dataset
algorithm, there are two events are: the emergence of the
features and the appearance of the category, and each In this study we used a dataset obtained from UCI (UC
value yag said sorted from highest value[6]. Below is the Irvine Machine Learning Repository) in 2014 with the
formula of chi-square: title "Twitter Dataset for Arabic Sentiment Analysis"[7].
From a study entitled "Arabic Sentiment Analysis:
(1) Lexicon-based and Corpus-based" the dataset is divided
into two categories: 1,000 positive content and 1,000
negative content [8]. In this dataset we use 500 for
training data and 500 for testing data.
The following is one part of a dataset of two categories
(positive and negative) :
(2)
TABLE I. PART OF DOCUMENT

Dokumen Kelas
ΕϮѧѧϤϟ΍ ϪѧѧϤΣέ αΎѧѧϨϟ΍ ϊѧѧϣ Positif
Information :
N : The number of documents ϩΪѧѧѧѧΣϭ ϪѧѧѧѧϴϠϋ ϞѧѧѧѧϛϮΗϭ ͿΎѧѧѧѧΑ Ϧѧѧѧѧυ΃ ϦѧѧѧѧδΣ΍ Negatif
A : The probability of the word in class Z
B : The probability of a word other than the class Z After we initiate data, preprocessing is done including:
C : The probability of a word that does not exist in the stop word removal, tokenize. Then term weighting is done
class Z (TF-IDF).
D : The probability of a word other than the class Z and
no class Z
TABLE II. PRE-PROCESSING DOCUMENT
After getting the value of feature selection, and D1 Class
then do the sorting on the results of chi-square ϞѧѧѧѧѧϛϮΗϭ ͿΎѧѧѧѧѧΑ Ϧѧѧѧѧѧυ΃ ϦѧѧѧѧѧδΣ΍ Ρϭ ϪѧѧѧѧѧѧϴϠϋϩΩ Negatif
calculations above. Here is the formula of Bayes
theorem as follows: From the table above then pre-processing so that it
becomes as follows :
(3)
TABLE III. FRECUENCY RESULT OF DOKUMEN-1

After sorting out the results of the chi-square calculation, Term Tf


the selection feature is taken 50% of the largest ϦѧѧѧδΣ΍ 1
sequential data and then represented as the type of
instructions needed for the classification process. Ϧυ΃ 1
B. K-NN (K-Nearest Neighbour) ͿΎѧѧѧѧѧѧѧѧѧѧΑ 1
This algorithm is used in many applications in ϮѧѧѧѧΗϭ 1
the field of data mining, image processing, etc. Here is
the stage of the algorithm KNN : Ϟѧѧϛ 1
1. Determine the parameters of the total number of
ϪѧѧѧѧѧѧϴϠϋ 1
nearest neighbors (K).
2. Calculating the distance between the data testing ϩΪΣϭ 1
and all data contained in the training data. Below

79
,QWHUQDWLRQDO6HPLQDURQ$SSOLFDWLRQIRU7HFKQRORJ\RI,QIRUPDWLRQDQG&RPPXQLFDWLRQ L6HPDQWLF

TABLE IV. DOCUMENTS OF IDF RESULT TABLE V. EXAMPLE OF PRE-PROCESSING DOKUMEN

Term Tf IDF D1 Kelas


ϪѧѧϤΣέ αΎѧѧϨϟ΍ ϊѧѧϣ ΕϮѧѧϤϟ΍ Positif
ϦѧѧѧδΣ΍ 1 log(n/Tf)

Ϧυ΃ 1 log(n/Tf) From the table above then pre-processing so that it


becomes as follows:
ϞѧѧѧѧѧѧѧϟΎΑϩ 1 log(n/Tf)
TABLE VI. THE RESULT OF PRE-PROCESSING DOKUMEN-1
ϮѧѧѧѧΗϭ 1 log(n/Tf)
log(n/Tf) Term Frekuensi
Ϟѧѧϛ 1
log(n/Tf) ΕϮѧѧѧϤϟ΍ 1
ϪѧѧѧѧѧѧϴϠϋ 1
log(n/Tf) ϊϣ 1
ϩΪΣϭ 1
αΎѧѧѧѧѧϨϟ΍ 1

B. Methodology ϪϤΣέ 1
The determination object of this study, to classify the
positive and negative types of Twitter content that will be After pre-processing as shown above, TF results
explained with the flowchart image below : will be obtained for both categories of 1000 documents.
Then after getting the TF result, the next step is to
calculate the selection feature by using the -chi-square
algorithm that calculates using the formulas of equations
(1) and (2) above.

C. Evaluation
This stage of evaluation or testing is done to determine
the performance of the Naïve Bayes and K-NN
algorithms. Evaluation results with the results of the
accuracy of each method (with selection features and
without selection features). To get the results of
evaluation on K-NN, it is done with three parameters,
including; recall, precision and accuracy. the following is
a formula for precision, recall, and accuracy:

(5)

(6)

(7)

V. RESULT

A. Fitur Selection with Chi-square Algorithm


From the process of calculating image 3 above then
Fig. 1. Framework Research Methods the results are obtained as soon as the features below:

In Figure 1, a brief explanation of the process of this TABLE VII. THE RESULT OF FEATURE SELECTION
research, which consists of: pre-processing, feature
selection process, and classification process. K Recall Presicion Accuration
From the dataset that has been pre-processed, then 1 63,65% 62,15% 63,65%
tokenize from 1000 documents. The dataset is divided into 3 64,40% 63,80% 64,40%
two categories, including; positive and negative. From the 65,00% 64,25% 65,00%
5
document, pre-processing (stop words removal,
stemming) will be done first. Below is one example of the 7 63,65% 62,15% 63,65%
pre-processing process: 9 63,60% 62,40% 63,60%

80
,QWHUQDWLRQDO6HPLQDURQ$SSOLFDWLRQIRU7HFKQRORJ\RI,QIRUPDWLRQDQG&RPPXQLFDWLRQ L6HPDQWLF

Then the results of the above sequence are taken based VI. CONCLUSION
on the needs of the method with the selection feature or From Figure 2 above it can be concluded that the KNN
not. If using the author selection feature only takes 50% of algorithm with feature selection (chi-square) gets the
the sequence data from the largest, whereas if it does not results of accuracy seen from the value of the nearest
use the selection feature it means that all the data that it neighbor distance K = 3 obtaining the greatest results with
has is used in the classification process. an accuracy value of 65.00%. without feature selection in
the dataset it turns out that the method with feature
B. Classification With K-NN Algorithm
selection has the advantage of being more efficient in the
From these datasets are divided into two categories: calculation process.
positive and negative each with 1000 content. In this
dataset we use 1000 for training data and 500 for testing REFERENCE
data. The process with the K-NN algorithm is different
between the calculation of accuracy from the distance of [1] Andi Dwi Riyanto, “Hootsuite (We are Social):
the nearest neighbor. Indonesian Digital Report 2019.” [Online].
• With the nearest neighbor distance K = 1 Available: https://andi.link/hootsuite-we-are-
• With the nearest neighbor distance K = 3 social-indonesian-digital-report-2019/. [Accessed:
• With the nearest neighbor distance K = 5 26-May-2019].
• With the nearest neighbor distance K = 7
• With the nearest neighbor distance K = 9 [2] F. P. Shah and V. Patel, “A review on feature
Here is the result of the accuracy of the algorithm K-NN selection and feature extraction for text
with the nearest neighbor distance (K = 1,3,5, 7, 9) classification,” Proc. 2016 IEEE Int. Conf. Wirel.
Commun. Signal Process. Networking, WiSPNET
2016, pp. 2264–2268, 2016.
TABLE VIII. RESULT OF K-NN ACCURATION

Distance Original Used [3] B. Y. Pratama and R. Sarno, “Personality


K-NN Without Feature Feature classification based on Twitter text using Naive
Selection (%) Selection Bayes, KNN and SVM,” Proc. 2015 Int. Conf.
(%) Data Softw. Eng. ICODSE 2015, pp. 170–174,
1 60,80% 63,65% 2016.
3 63,50% 64,40%
5 63,15% 65,00% [4] A. Rawat and A. Choubey, “A Survey on
7 62,90% 63,65% Classification Techniques in Internet
9 61,35% 63,60% Environment,” Int. Conf. Adv. Comput. Commun.
Syst., vol. 2, no. 3, pp. 436–443, 2016.

From Table VII it can be conluded that Table VIII and [5] B. Tang, H. He, P. M. Baggenstoss, and S. Kay,
Figure 2, explains the distance of the nearest neighbor (K) “A Bayesian Classification Approach Using
= 3 produces the highest that is equal to 63.15% and Class-Specific Features for Text Categorization,”
65.00%. IEEE Trans. Knowl. Data Eng., vol. 28, no. 6, pp.
1602–1606, 2016.

[6] A. Wibowo Haryanto, E. Kholid Mawardi, and


Muljono, “Influence of Word Normalization and
Chi-Squared Feature Selection on Support Vector
Machine (SVM) Text Classification,” Proc. -
2018 Int. Semin. Appl. Technol. Inf. Commun.
Creat. Technol. Hum. Life, iSemantic 2018, pp.
229–233, 2018.

[7] N. A. Abdulla, “Twitter Data set for Arabic


Sentiment Analysis Data Set,” Machine Learning
Repository. [Online]. Available:
https://archive.ics.uci.edu/ml/datasets/Twitter+Dat
a+set+for+Arabic+Sentiment+Analysis.
[Accessed: 26-May-2019].

[8] N. Al-Twairesh, H. Al-Khalifa, A. Al-Salman,


and Y. Al-Ohali, “AraSenTi-Tweet: A Corpus for
Arabic Sentiment Analysis of Saudi Tweets,”
Fig. 2. The percentage of KNN Calculation Procedia Comput. Sci., vol. 117, pp. 63–72, 2017.

81

You might also like