You are on page 1of 5

2018 6th International Conference on Information and Communication Technology (ICoICT)

Detecting Indonesian Spammer on Twitter


Erwin B. Setiawan, Dwi H. Widyantoro, Kridanto Surendro
School of Electrical Engineering and Informatics, InstitutTeknologi Bandung
Jl Ganesha No 10, Bandung, Indonesia
Email: erwinbudisetiawan@telkomuniversity.ac.id, dwi@stei.itb.ac.id, endro@stei.itb.ac.id

Abstract— Nowadays, Twitter is one of the most popular techniques. Section II describes system model approach to
social media today. However, Twitter has several problems that detect Indonesian spammer on Twitter. Section IV provides the
have negative impacts to the users, one of which is spam. We experimental results, followed by the conclusion in section V.
introduce a different approach compared to previous research
are the scope of Indonesian-language Twitter, crawling II. RELATED WORKS
automatically for user and tweets data, as well as the addition of
Benevenuto et al. [1], Features used to classify spammers
new features. We use two features dimension, i.e., user-based and
also have two dimensions, i.e., user-based features and tweet-
tweet-based. In this paper, we detect Indonesian spammers on
Twitter using four classification algorithms, namely Naïve Bayes based features. User-based features are features that can be
(NB), Support Vector Machine (SVM), Logistic Regression seen directly at the Twitter user's profile. The user-based
(Logit), and J48. The results are confirmed for having better features include the number of followers, the number of
accuracy that of the existing. The highest accuracy of 93,67% is followings, and the number of tweets. Tweet-based features is
achieved using Logistic Regression (Logit). features that is numerical value and can be seen from tweet
issued by the user. Tweet-based features includes the number
Keywords—spammer detection, Indonesian, twitter, spam of URLs, the number of mentions, the number of hashtags, the
number of spam words. In this research has not seen how to
I. INTRODUCTION handle spam for languages other than English. Value algorithm
Twitter is one of the social media sites based on is Support Vector Machine (SVM) with accuracy value 87,6%.
microblogging (small blogs), where users send and read tweets Stringing et al. [6] proposed an approach based on user-
(posts in Twitter) with a maximum of 140 characters [1], [2], based and tweet-based features, such as the number of
[3]. Twitter was first launched on July 13, 2006 [4]. Existing followers (the number of users who accepted the request), the
mechanisms on twitter app to tweet can spread and create number followings (the number of friend requests sent), ratio
opportunities influx of spam [1], [2]. As opposed to other of number of followings to number of followers, ratio of
social networks, a user can post tweets on another user’s page number of tweets containing URLs to number of tweets every
without permission, since a user does not have to be one of user sent, number of tweets every user sent and the possibility
another user's followers to post tweets on another user's profile. of whether an account likely used a list of names to pick its
Accounts that do spam are called spammers. Because of this friends or not. Random Forest classifier used in this research is
mechanism, spammers can send many malicious messages with an accuracy of 90.93%.
without permission [5]. Spammers make tweets with popular
topics or hashtags discussed on Twitter by listing links that is McCord and Chuah [2] present two methods based on user-
not related to the tweet theme, other types of spam are tweets based and tweet-based features to facilitate spam detection.
that contain the same contents and contain mentions in large Features they used in the proposed method are the distribution
quantities to other users randomly [1]. The kind of malicious of tweets over a 24-hour period, the number of URLs, the
that can spammers do is sending multiple tweets spam that can number of mentions, the number of retweets, the number of
cause many problems, for example in business, spam affects hashtags. The clasifications algorithm used in this research are
negatively affects business in three major ways, i.e., spam Random Forest, Decision Tree, Naive Bayes, and k-NN, with
contributes to a loss of productivity and profit, spam poses the highest accuracy of 95.7%. In Mateen et al. [7], authors
legal risks, and spam contains various malware threats1. presented another technique for spam detection in twitter.
Because spam is very disturbing, we need a system for There are 3 methods used for spam detection in twitter, i.e.,
detecting spammers. user-based, tweet-based, and graph-based. Features include the
number of followers, the number of following, age of account,
This paper presents a methodology for detecting Indonesian FF Ratio, Reputation, the total number of tweets, hashtag ratio,
spammer, crawling automatically user and tweet data, and mentions ratio, URLs ratio, tweet frequency, the number of
finding the best new features to identify twitter spammers in spam words, comparison of out-degree to in-degree, and
Indonesia. The rest of the paper is organized as follows. betweeness centrality. URLs Ratio defined the ratio between
Section II discusses issues related to spammer detection duplicate URLs to number of distinct URLs in tweets and sum
of Tweets. We propose a different formula with this Ratio
1
https://www.topsec.com/it-security-news-and-info/what-is-the-real-impact-of-spam-on- URLs. The clasifications algorithm used in this research are
business

ISBN: 978-1-5386-4571-0 (c) 2018 IEEE 259


decorate, Naive Bayes, and J48, with improvement in precision removal, and stemming. We do pre-processing twitter data
up to 97.6 %. automatically with the application that we developed in
previous research [8]. At this stage, data discretization is also
In this paper, we introduce a different approach compared done because the distance of each data is too far, especially
to previous research are scope twitter's account of the numerical data.
Indonesian-language, crawling automatically for users data and
tweets data, and addition of 4 new features, this is, D. Feature Extraction
spam_words_Indo, total_spam, #like, and URL_rasio. To the Features are attributes used as a reference to determine
best of our knowledge, spammer detection has not been classification class. Based on research [1], we developed a
explored for Bahasa Indonesia. spammer detection method based on user-based and tweet-
III. SPAMMER DETECTION MODEL AND THE PROPOSAL based features.
TECHNIQUE The features used in this paper, we summarize the most
Spammer Detection Model is shown in Fig. 1. Dataset is popular features in previous work. We collect features from 13
divided into two, i.e., training data and testing data. Each data different papers, some of which have been discussed, i.e., [1],
is done by pre-processing and feature extraction. Furthermore, [2], [5–15]. On feature extraction, we use the features that
the feature extraction results for the training data as input to the appear in 3 or more of these papers.
modeling process spammer detection, the results of this User features used in this research are five features as:
modeling is used to predict the testing data. Last is expected to
get Spammer Detection class with excellent accuracy. 1. #Followers, the number of people/accounts following user.
2. #Followings, the number of users followed by an account
3. FF Ratio, the ratio between number of followings to
number of followers of any user account [7].
Mathematically it can be represented as
= (# )/(# ).
4. Reputation, the ratio between number of followers to sum
of following and followers [7]. Mathematically it can be
represented as
#
= .
# +#
5. #Tweet, the number of tweets sent by an user.
Tweet feature used in this research are having six features
as:
1. #URL, the number of URLs in the number of recent tweets
Fig. 1. Spammer detection model taken for each user.
2. #Mention, the number of mentions in the number of recent
Details of each step in Fig. 1 are described in the following
tweets taken for each user.
sections.
3. #Hashtag, the number of hashtags in the number of recent
A. Crawling Data tweets taken for each user.
Crawling data aims to get data to be used as reference data 4. #Spam_words_Int, the number of international spam-words
by system. Data reference used in the form of user data and in the number of recent tweets taken for each user.
tweet data. User data obtained from the information displayed 5. Hashtags_Ratio, the ratio of the number of hashtags
on user profile. Tweet data obtained from the tweet and its (#hashtags) to the number of recent tweets taken for each
attributes taken from each user. Twitter data is user data and user.
tweets data that have been combined and have done class 6. Mentions_Ratio, the ratio of the number of mentions
labelling manually for each user. (#Mentions) to the number of recent tweets taken for each
user.
B. Training Data dan Testing Data
We propose four new features, where a feature for user-based
Twitter data is separated into training data and testing data. features and three features for tweet-based features, such as:
We have to conduct an analysis of training data to obtain the 1. #Like, the number of tweets liked by user.
model, whereas testing data used to determine the level of 2. URLs_Ratio, the ratio of the number of URLs (#URLs) to
accuracy of the model that has been generated. the number of recent tweets taken for each user.
C. Pre-Processing Data 3. Spam_words_Indo, the number of Indonesia spam-words in
the number of recent tweets taken for each user.
The stage of pre-processing of data intended to obtain 4. Total_spam, the total number of spams_word_int and
reference data that ready to be processed on stage of spammer spam_words_indo.
detection modelling. Each of the twitter data used is pre-
processing such as case folding, tokenization, stopwords

260
E. Classification Algorithm PHP programming language. Tweets data saved in tweet
As illustrated in Fig. 1, the algorithm is used to create corpus. Next, the reference data, both users data and tweets
spammer detection model during training phase. The spammer data are combined and stored into the database using MySQL.
detection model is then used to classify the spammer of new B. Comparison of Accuracy with Previous Research
tweets, using the same algorithms as used to create the
classification model. Our results are compared to those from two previous
studies, i.e., research conducted by Benevenuto in 2010 [1] and
IV. EXPERIMENTS DAN RESULTS Mateen in 2017 [7]. Comparison of features are shown in Table
I.
There are 3 objectives of experiment, that is, first to know
comparison with previous research, second to know influence TABLE II. FEATURES CHECKLIST
of addition of new feature proposed, and third to know
influence of two dimensions of feature used. Our experiments Benevenuto Mateen The
No Feature
(2010) (2017) Proposed
used a comparison of training data versus testing data, with a
1 #Followers √ √ √
percentage of 80:20. Each cell in Tables II, III, IV and V 2 #Following √ √ √
describes the accuracy averaged from five samples taken 3 Ratio_FF √ √ √
randomly from the database. 4 Reputation √ √
5 #Tweet √ √ √
A. Data Set and Labelling 6 #Like* √
The crawling twitter data is automatically done with the 7 #URL √ √ √
8 URL_Ratio* √
application that we developed in previous research [17]. 9 #Mention √ √ √
Crawling results are twitter data with thirteen features already 10 Mention_Ratio √ √
mentioned above, which are six user-based features and seven 11 #Hashtag √ √ √
tweet-based features. 12 Hashtag_Ratiog √ √
13 #Spam_words_Int √ √
We also take a new approach related to spam detection, are 14 #Spam_words_Indo* √
collected two corpus related words or phrases spam, that is 200 15 Total_spam* √
* new features
words or phrases spam in english and 100 Indonesian words or
phrases. Two corpuses were developed based on Web As shown in Table III, the proposed technique has
Marketing Today, Spam Assassin, Andrea O'Neill, successfully better accuracy than the previous research, in all
MailChimp.2,3 of the classifications used. The highest accuracy of 93.67% is
achieved by using Logistic Regression classifier.
The references data used on the system are taken from 1 to
6 June 2016. These data are 300 users with user-based features TABLE III. COMPARISON TO THE PREVIOUS RESEARCH
(followers, following, FF_ratio, reputation, tweet, and like) and
labeled the spammer class manually. Details of users data from Accuracy (%)
Indonesia are consist of 150 users with spammers label class Classifier Benevenuto Mateen The
(2010) (2017) Proposed
and 150 users with non-spammers label class.
NB 64.33 75.00 84.33
SVM 90.33 90.33 91.00
TABLE I. SAMPLES OF INDONESIAN SPAM-WORDS Logit 93.33 93.33 93.67
No Indonesian Spam-words J48 90.67 90.33 91.67
1 kredit dp C. Influence of Adding New Features and Influential Features
2 paket kredit
3 cicilan ringan 1) Influence of Adding New Features
4 dp ringan
5 cash/kredit In Table IV, the addition of new proposed features
6 dana tunai increases the accuracy in almost four classifiers used except in
7 proses cepat one cell of SVM classifier, there is decrease accuracy that is
8 dana cepat
equal to 0.36%. The highest improvement of accuracy of
9 pinjaman uang
10 pinjaman dana 16.06% was achieved by using Naive Bayes Classifier.
11 pinjaman
12 gadai TABLE IV. INFLUENCE OF ADDING NEW FEATURES

Tweets data are Indonesian tweets with tweet-based Accuracy (%)


NB SVM Logit J48
features (URL, mention, hashtag, international spam words and
Without Adding New Features 72.67 91.33 93.33 90.00
Indonesian spam words), which are 30,961 tweets. Table I Adding New Features 84.33 91.00 93.67 91.67
provides 12 samples of Indonesia spam-words. Tweets data are % Improvement 16.06% -0.36% 0.36% 1.85%
taken from tweets created by each user who are in 300 users
data using the Twitter API that has been implemented in the 2) Influential Features
2
https://emailmarketing.comm100.com/email-marketing-ebook/spam-words.aspx
In Table V, it is explained about the influential features in
3
http://www.mequoda.com/articles/audience-development/subject-line-spam-trigger-
spammers detection on twitter. The percentage of negative
words/ influence (-) means that with the addition of the feature, the

261
accuracy decreases from the original accuracy, while the accuracy of +25,23 % was achieved on Naive Bayes Classifier.
percentage of positive influence (+) means that the reverse is The compositions of two or more best new proposed features
the accuracy of the increase compared to the original accuracy. are baseline + {spam_words_indo, total_spam} and baseline +
The baseline describes the results without new additional {#like, spam_words_indo, total_spam}.
features.
The influence of adding new features to the four classifiers
Of the four new proposed features, the third and fourth we use, Logistic Regression and J48 Classifiers have improved
feature, Spam_words_Indo and Total_spam give more performance over the baseline. While the use of Naive Bayes
influence than other features, the highest improvement of Classifier and SVM Classifier, provide mixed results.
TABLE V. INFLUENTIAL NEW FEATURES
Accuracy (%)
NB SVM Logit J48
Baseline 72.67 91.33 93.33 90.00
Baseline + 1 67.33 (-7.34) 91.33 (+0.00) 93.33 (+0.00) 90.67 (+0.74)
Baseline + 2 75.00 (+3.21) 90.67 (-0.73) 93.33 (+0.00) 90.33 (+0.37)
Baseline + 3 91.00 (+25.23) 92.00 (+0.73) 93.67 (+0.36) 91.33 (+1.48)
Baseline + 4 89.67 (+23.39) 92.00 (+0.73) 93.67 (+0.36) 90.67 (+0.74)
Baseline + 1,2 68.00 (-6.42) 91.00 (-0.36) 93.33 (+0.00) 90.33 (+0.37)
Baseline + 1,3 65.67 (-9.63) 92.00 (+0.73) 93.67 (+0.36) 91.67 (+1.85)
Baseline + 1,4 64.67 (-11.01) 92.00 (+0.73) 93.67 (+0.36) 91.00 (+1.11)
Baseline + 2,3 85.33 (+17.43) 91.67 (+0.36) 93.67 (+0.36) 91.33 (+1.48)
Baseline + 2,4 90.67 (+24.77) 91.33 (+0.00) 93.67 (+0.36) 91.00 (+1.11)
Baseline + 3,4 82.67 (+13.76) 92.00 (+0.73) 93.67 (+0.36) 91.67 (+1.85)
Baseline + 1,2,3 68.33 (-5.96) 91.00 (-0.36) 93.67 (+0.36) 91.67 (+1.85)
Baseline + 1,2,4 67.33 (-7.34) 91.33 (+0.00) 93.67 (+0.36) 91.00 (+1.11)
Baseline + 2,3,4 79.67 (+9.63) 91.33 (+0.00) 93.67 (+0.36) 91.67 (+1.85)
Baseline + 1,3,4 82.33 (+13.30) 92.00 (+0.73) 93.67 (+0.36) 91.67 (+1.85)
Baseline + 1,2,3,4 84.33 (+16.06) 91.00 (-0.36) 93.67 (+0.36) 91.67 (+1.85)
Information: 1 = #Likes, 2 = URLs_ratio, 3 = #Spam_words_Indo, 4 = Total_spam

accuracy in almost four classifiers except in one cell SVM


D. Feature Dimensions classifier with decrease accuracy equals to 0.36%. The highest
In subsequent scenario testing, was conducted an improvement on accuracy of 16.06% is achieved using Naive
experiment to see two dimensions of features, i.e., user-based Bayes Classifier.
features dimension and tweet-based features dimension, which
provide the best accuracy of spammer detection. In Table VI, it Spam_word_Indo (Indonesian spam words and phrases)
shows the comparison of the accuracy of user-based features and Total_spam give more influence rather than other features,
dimension and tweet-based features dimension in four different where the highest improvement of accuracy of +25,23% was
classifiers. For all conditions, the tweet-based features achieved by using Naive Bayes Classifier. The compositions of
dimension better accuracy than the user-based features two or more best new feature are baseline +
dimension. With the highest accuracy value is achieved on {spam_words_indo, total_spam} and baseline + {#like,
tweet-based features dimension in the amount of 93.00% by spam_words_indo, total_spam}. For all conditions, tweet-
using Logistic Regression (Logit) Classifier. based features dimension better accuracy than user-based
features dimension.
TABLE VI. FEATURE DIMENSIONS For future research it needs to be done for large data sizes
Accuracy (%) and the addition of corpus to word or phrase spam, especially
Dimension NB SVM Logit J48 those in Indonesian language.
User-Based 63.33 75.33 87.67 87.00
Tweet-Based 78.67 91.33 93.00 91.33 ACKNOWLEDGMENT
The authors would like to thank to BPPDN RISTEKDIKTI
From the results of this scenario can be concluded that the for the support to this research.
spammer detection could be seen from dimension of tweet-
based features. If we want to detect Indonesian spammer on REFERENCES
Twitter, we are advised to see first dimension of tweet-based [1] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida, “Detecting
features rather than dimension of user-based features. spammers on twitter,” Collab. Electron. Messag. anti-abuse spam Conf.,
vol. 6, p. 12, 2010.
V. CONCLUSION AND FUTURE WORK [2] M. McCord and M. Chuah, “Spam detection on twitter using traditional
We have proposed an approach to detect Indonesian classifiers,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif.
Intell. Lect. Notes Bioinformatics), vol. 6906 LNCS, pp. 175–186, 2011.
spammers on Twitter having four new features. The proposed
[3] E. B. Setiawan, D. H. Widyantoro, and K. Surendro, “Feature Expansion
technique has successfully increased the accuracy of the using Word Embedding for Tweet Topic Classification,” Telecommun.
previous studies in all classifications used. The highest Syst. Serv. Appl., pp. 1–5, 2016.
accuracy of 93.67% is achieved by using Logistic Regression [4] S. Asur and B. a. Huberman, “Predicting the Future with Social Media,”
classifier. The addition of new proposed features increases the

262
Web Intell. Intell. Agent Technol. (WI-IAT), 2010 IEEE/WIC/ACM Int. Comput., vol. 10, pp. 183–195, 2013.
Conf., vol. 1, 2010. [12] V. Vishwarupe, M. Bedekar, M. Pande, and A. Hiwale, “Intelligent
[5] C. Meda, F. Bisio, P. Gastaldo, and R. Z. Diten, “Machine Learning Twitter Spam Detection : A Hybrid Approach,” Smart Trends Syst. Secur.
Techniques Applied To Twitter Spammers Detection,” Recent Adv. Sustain., pp. 189–197, 2017.
Electr. Electron. Eng., pp. 177–182, 2014. [13] C. Grier, K. Thomas, V. Paxson, and M. Zhang, “@ spam : The
[6] G. Stringhini, C. Kruegel, and G. Vigna, “Detecting Spammers on Social Underground on 140 Characters or Less Categories and Subject
Networks,” ACSAC ’10 Dec. 6-10, 2010, Austin, Texas USA, 2010. Descriptors,” Proc. 17th ACM Conf. Comput. Commun. Secur., pp. 27–
[7] M. Mateen, M. Aleem, M. A. Iqbal, and M. A. Islam, “A Hybrid 37, 2010.
Approach for Spam Detection for Twitter,” 14th Int. Bhurban Conf. Appl. [14] J. Martinez-romo and L. Araujo, “Expert Systems with Applications
Sci. Technol., pp. 466–471, 2017. Detecting malicious tweets in trending topics using a statistical analysis
[8] E. B. Setiawan, D. H. Widyantoro, and K. Surendro, “Feature Expansion of language,” Expert Syst. Appl., vol. 40, no. 8, pp. 2992–3000, 2013.
using Word Embedding for Tweet Topic Classification,” Telecommun. [15] A. H. Wang, “Don’t follow me: Spam detection in Twitter,” Proc. Int.
Syst. Serv. Appl. (TSSA), 2016 10th Int. Conf. Denpasar, Indones., 2016. Conf. Secur. Cryptogr., vol. 2010, pp. 1–10, 2010.
[9] K. Lee, J. Caverlee, and S. Webb, “Uncovering social spammers: social [16] B. Wang, A. Zubiaga, M. Liakata, and R. Procter, “Making the Most of
honeypots+ machine learning,” SIGIR’10, July 19–23, 2010, Geneva, Tweet-Inherent Features for Social Spam Detection on Twitter,” Proc.
Switzerland. Copyr., no. i, pp. 435–442, 2010. 5th Work. Mak. Sense Microposts, Florence, Italy, pp. 10–16, 2015.
[10] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond Blacklists : [17] J. Eka Sembodo, E. Budi Setiawan, and Z. Abdurahman Baizal, “Data
Learning to Detect Malicious Web Sites from Suspicious URLs,” 2009. Crawling Otomatis pada Twitter,” Indosc 2016, no. August, pp. 11–16,
[11] S. Lee and J. Kim, “Warning bird: A near real-time detection system for 2016.
suspicious URLs in twitter stream,” IEEE Trans. Dependable Secur.

263

You might also like