Professional Documents
Culture Documents
Abstract— Nowadays, Twitter is one of the most popular techniques. Section II describes system model approach to
social media today. However, Twitter has several problems that detect Indonesian spammer on Twitter. Section IV provides the
have negative impacts to the users, one of which is spam. We experimental results, followed by the conclusion in section V.
introduce a different approach compared to previous research
are the scope of Indonesian-language Twitter, crawling II. RELATED WORKS
automatically for user and tweets data, as well as the addition of
Benevenuto et al. [1], Features used to classify spammers
new features. We use two features dimension, i.e., user-based and
also have two dimensions, i.e., user-based features and tweet-
tweet-based. In this paper, we detect Indonesian spammers on
Twitter using four classification algorithms, namely Naïve Bayes based features. User-based features are features that can be
(NB), Support Vector Machine (SVM), Logistic Regression seen directly at the Twitter user's profile. The user-based
(Logit), and J48. The results are confirmed for having better features include the number of followers, the number of
accuracy that of the existing. The highest accuracy of 93,67% is followings, and the number of tweets. Tweet-based features is
achieved using Logistic Regression (Logit). features that is numerical value and can be seen from tweet
issued by the user. Tweet-based features includes the number
Keywords—spammer detection, Indonesian, twitter, spam of URLs, the number of mentions, the number of hashtags, the
number of spam words. In this research has not seen how to
I. INTRODUCTION handle spam for languages other than English. Value algorithm
Twitter is one of the social media sites based on is Support Vector Machine (SVM) with accuracy value 87,6%.
microblogging (small blogs), where users send and read tweets Stringing et al. [6] proposed an approach based on user-
(posts in Twitter) with a maximum of 140 characters [1], [2], based and tweet-based features, such as the number of
[3]. Twitter was first launched on July 13, 2006 [4]. Existing followers (the number of users who accepted the request), the
mechanisms on twitter app to tweet can spread and create number followings (the number of friend requests sent), ratio
opportunities influx of spam [1], [2]. As opposed to other of number of followings to number of followers, ratio of
social networks, a user can post tweets on another user’s page number of tweets containing URLs to number of tweets every
without permission, since a user does not have to be one of user sent, number of tweets every user sent and the possibility
another user's followers to post tweets on another user's profile. of whether an account likely used a list of names to pick its
Accounts that do spam are called spammers. Because of this friends or not. Random Forest classifier used in this research is
mechanism, spammers can send many malicious messages with an accuracy of 90.93%.
without permission [5]. Spammers make tweets with popular
topics or hashtags discussed on Twitter by listing links that is McCord and Chuah [2] present two methods based on user-
not related to the tweet theme, other types of spam are tweets based and tweet-based features to facilitate spam detection.
that contain the same contents and contain mentions in large Features they used in the proposed method are the distribution
quantities to other users randomly [1]. The kind of malicious of tweets over a 24-hour period, the number of URLs, the
that can spammers do is sending multiple tweets spam that can number of mentions, the number of retweets, the number of
cause many problems, for example in business, spam affects hashtags. The clasifications algorithm used in this research are
negatively affects business in three major ways, i.e., spam Random Forest, Decision Tree, Naive Bayes, and k-NN, with
contributes to a loss of productivity and profit, spam poses the highest accuracy of 95.7%. In Mateen et al. [7], authors
legal risks, and spam contains various malware threats1. presented another technique for spam detection in twitter.
Because spam is very disturbing, we need a system for There are 3 methods used for spam detection in twitter, i.e.,
detecting spammers. user-based, tweet-based, and graph-based. Features include the
number of followers, the number of following, age of account,
This paper presents a methodology for detecting Indonesian FF Ratio, Reputation, the total number of tweets, hashtag ratio,
spammer, crawling automatically user and tweet data, and mentions ratio, URLs ratio, tweet frequency, the number of
finding the best new features to identify twitter spammers in spam words, comparison of out-degree to in-degree, and
Indonesia. The rest of the paper is organized as follows. betweeness centrality. URLs Ratio defined the ratio between
Section II discusses issues related to spammer detection duplicate URLs to number of distinct URLs in tweets and sum
of Tweets. We propose a different formula with this Ratio
1
https://www.topsec.com/it-security-news-and-info/what-is-the-real-impact-of-spam-on- URLs. The clasifications algorithm used in this research are
business
260
E. Classification Algorithm PHP programming language. Tweets data saved in tweet
As illustrated in Fig. 1, the algorithm is used to create corpus. Next, the reference data, both users data and tweets
spammer detection model during training phase. The spammer data are combined and stored into the database using MySQL.
detection model is then used to classify the spammer of new B. Comparison of Accuracy with Previous Research
tweets, using the same algorithms as used to create the
classification model. Our results are compared to those from two previous
studies, i.e., research conducted by Benevenuto in 2010 [1] and
IV. EXPERIMENTS DAN RESULTS Mateen in 2017 [7]. Comparison of features are shown in Table
I.
There are 3 objectives of experiment, that is, first to know
comparison with previous research, second to know influence TABLE II. FEATURES CHECKLIST
of addition of new feature proposed, and third to know
influence of two dimensions of feature used. Our experiments Benevenuto Mateen The
No Feature
(2010) (2017) Proposed
used a comparison of training data versus testing data, with a
1 #Followers √ √ √
percentage of 80:20. Each cell in Tables II, III, IV and V 2 #Following √ √ √
describes the accuracy averaged from five samples taken 3 Ratio_FF √ √ √
randomly from the database. 4 Reputation √ √
5 #Tweet √ √ √
A. Data Set and Labelling 6 #Like* √
The crawling twitter data is automatically done with the 7 #URL √ √ √
8 URL_Ratio* √
application that we developed in previous research [17]. 9 #Mention √ √ √
Crawling results are twitter data with thirteen features already 10 Mention_Ratio √ √
mentioned above, which are six user-based features and seven 11 #Hashtag √ √ √
tweet-based features. 12 Hashtag_Ratiog √ √
13 #Spam_words_Int √ √
We also take a new approach related to spam detection, are 14 #Spam_words_Indo* √
collected two corpus related words or phrases spam, that is 200 15 Total_spam* √
* new features
words or phrases spam in english and 100 Indonesian words or
phrases. Two corpuses were developed based on Web As shown in Table III, the proposed technique has
Marketing Today, Spam Assassin, Andrea O'Neill, successfully better accuracy than the previous research, in all
MailChimp.2,3 of the classifications used. The highest accuracy of 93.67% is
achieved by using Logistic Regression classifier.
The references data used on the system are taken from 1 to
6 June 2016. These data are 300 users with user-based features TABLE III. COMPARISON TO THE PREVIOUS RESEARCH
(followers, following, FF_ratio, reputation, tweet, and like) and
labeled the spammer class manually. Details of users data from Accuracy (%)
Indonesia are consist of 150 users with spammers label class Classifier Benevenuto Mateen The
(2010) (2017) Proposed
and 150 users with non-spammers label class.
NB 64.33 75.00 84.33
SVM 90.33 90.33 91.00
TABLE I. SAMPLES OF INDONESIAN SPAM-WORDS Logit 93.33 93.33 93.67
No Indonesian Spam-words J48 90.67 90.33 91.67
1 kredit dp C. Influence of Adding New Features and Influential Features
2 paket kredit
3 cicilan ringan 1) Influence of Adding New Features
4 dp ringan
5 cash/kredit In Table IV, the addition of new proposed features
6 dana tunai increases the accuracy in almost four classifiers used except in
7 proses cepat one cell of SVM classifier, there is decrease accuracy that is
8 dana cepat
equal to 0.36%. The highest improvement of accuracy of
9 pinjaman uang
10 pinjaman dana 16.06% was achieved by using Naive Bayes Classifier.
11 pinjaman
12 gadai TABLE IV. INFLUENCE OF ADDING NEW FEATURES
261
accuracy decreases from the original accuracy, while the accuracy of +25,23 % was achieved on Naive Bayes Classifier.
percentage of positive influence (+) means that the reverse is The compositions of two or more best new proposed features
the accuracy of the increase compared to the original accuracy. are baseline + {spam_words_indo, total_spam} and baseline +
The baseline describes the results without new additional {#like, spam_words_indo, total_spam}.
features.
The influence of adding new features to the four classifiers
Of the four new proposed features, the third and fourth we use, Logistic Regression and J48 Classifiers have improved
feature, Spam_words_Indo and Total_spam give more performance over the baseline. While the use of Naive Bayes
influence than other features, the highest improvement of Classifier and SVM Classifier, provide mixed results.
TABLE V. INFLUENTIAL NEW FEATURES
Accuracy (%)
NB SVM Logit J48
Baseline 72.67 91.33 93.33 90.00
Baseline + 1 67.33 (-7.34) 91.33 (+0.00) 93.33 (+0.00) 90.67 (+0.74)
Baseline + 2 75.00 (+3.21) 90.67 (-0.73) 93.33 (+0.00) 90.33 (+0.37)
Baseline + 3 91.00 (+25.23) 92.00 (+0.73) 93.67 (+0.36) 91.33 (+1.48)
Baseline + 4 89.67 (+23.39) 92.00 (+0.73) 93.67 (+0.36) 90.67 (+0.74)
Baseline + 1,2 68.00 (-6.42) 91.00 (-0.36) 93.33 (+0.00) 90.33 (+0.37)
Baseline + 1,3 65.67 (-9.63) 92.00 (+0.73) 93.67 (+0.36) 91.67 (+1.85)
Baseline + 1,4 64.67 (-11.01) 92.00 (+0.73) 93.67 (+0.36) 91.00 (+1.11)
Baseline + 2,3 85.33 (+17.43) 91.67 (+0.36) 93.67 (+0.36) 91.33 (+1.48)
Baseline + 2,4 90.67 (+24.77) 91.33 (+0.00) 93.67 (+0.36) 91.00 (+1.11)
Baseline + 3,4 82.67 (+13.76) 92.00 (+0.73) 93.67 (+0.36) 91.67 (+1.85)
Baseline + 1,2,3 68.33 (-5.96) 91.00 (-0.36) 93.67 (+0.36) 91.67 (+1.85)
Baseline + 1,2,4 67.33 (-7.34) 91.33 (+0.00) 93.67 (+0.36) 91.00 (+1.11)
Baseline + 2,3,4 79.67 (+9.63) 91.33 (+0.00) 93.67 (+0.36) 91.67 (+1.85)
Baseline + 1,3,4 82.33 (+13.30) 92.00 (+0.73) 93.67 (+0.36) 91.67 (+1.85)
Baseline + 1,2,3,4 84.33 (+16.06) 91.00 (-0.36) 93.67 (+0.36) 91.67 (+1.85)
Information: 1 = #Likes, 2 = URLs_ratio, 3 = #Spam_words_Indo, 4 = Total_spam
262
Web Intell. Intell. Agent Technol. (WI-IAT), 2010 IEEE/WIC/ACM Int. Comput., vol. 10, pp. 183–195, 2013.
Conf., vol. 1, 2010. [12] V. Vishwarupe, M. Bedekar, M. Pande, and A. Hiwale, “Intelligent
[5] C. Meda, F. Bisio, P. Gastaldo, and R. Z. Diten, “Machine Learning Twitter Spam Detection : A Hybrid Approach,” Smart Trends Syst. Secur.
Techniques Applied To Twitter Spammers Detection,” Recent Adv. Sustain., pp. 189–197, 2017.
Electr. Electron. Eng., pp. 177–182, 2014. [13] C. Grier, K. Thomas, V. Paxson, and M. Zhang, “@ spam : The
[6] G. Stringhini, C. Kruegel, and G. Vigna, “Detecting Spammers on Social Underground on 140 Characters or Less Categories and Subject
Networks,” ACSAC ’10 Dec. 6-10, 2010, Austin, Texas USA, 2010. Descriptors,” Proc. 17th ACM Conf. Comput. Commun. Secur., pp. 27–
[7] M. Mateen, M. Aleem, M. A. Iqbal, and M. A. Islam, “A Hybrid 37, 2010.
Approach for Spam Detection for Twitter,” 14th Int. Bhurban Conf. Appl. [14] J. Martinez-romo and L. Araujo, “Expert Systems with Applications
Sci. Technol., pp. 466–471, 2017. Detecting malicious tweets in trending topics using a statistical analysis
[8] E. B. Setiawan, D. H. Widyantoro, and K. Surendro, “Feature Expansion of language,” Expert Syst. Appl., vol. 40, no. 8, pp. 2992–3000, 2013.
using Word Embedding for Tweet Topic Classification,” Telecommun. [15] A. H. Wang, “Don’t follow me: Spam detection in Twitter,” Proc. Int.
Syst. Serv. Appl. (TSSA), 2016 10th Int. Conf. Denpasar, Indones., 2016. Conf. Secur. Cryptogr., vol. 2010, pp. 1–10, 2010.
[9] K. Lee, J. Caverlee, and S. Webb, “Uncovering social spammers: social [16] B. Wang, A. Zubiaga, M. Liakata, and R. Procter, “Making the Most of
honeypots+ machine learning,” SIGIR’10, July 19–23, 2010, Geneva, Tweet-Inherent Features for Social Spam Detection on Twitter,” Proc.
Switzerland. Copyr., no. i, pp. 435–442, 2010. 5th Work. Mak. Sense Microposts, Florence, Italy, pp. 10–16, 2015.
[10] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond Blacklists : [17] J. Eka Sembodo, E. Budi Setiawan, and Z. Abdurahman Baizal, “Data
Learning to Detect Malicious Web Sites from Suspicious URLs,” 2009. Crawling Otomatis pada Twitter,” Indosc 2016, no. August, pp. 11–16,
[11] S. Lee and J. Kim, “Warning bird: A near real-time detection system for 2016.
suspicious URLs in twitter stream,” IEEE Trans. Dependable Secur.
263