You are on page 1of 11

2018 International Conference on Information, Electronic and Communication Engineering (IECE 2018)

ISBN: 978-1-60595-585-8

SMS Phishing Detection Using Oversampling and Feature


Optimization Method
Tong WU1, Kang-feng ZHENG1, Chun-hua WU1,* and Xiu-juan WANG2
1
School of Cyberspace Security, Beijing University of Posts and Telecommunications,
Beijing, China
2
School of Computer, Beijing University of Technology, Beijing, China
*Corresponding author

Keywords: Phishing detection, Mobile security, Feature optimization, Text classification.

Abstract. In this paper, a new SMS phishing detection method using oversampling and feature
optimization technology is proposed to improve SMS phishing detection accuracy. Three types
features are presented including token features, topic features and Linguistic Inquiry and Word
Count (LIWC) features. One of the existing oversampling methods called Adaptive Synthetic
Sampling Approach is applied in this paper since it has good performance. Then, Binary Particle
Swarm Optimization(BPSO) algorithm is used to analyze the three types features and select the
optimal combination of all the features. Finally, the detection results are achieved by Random
Forest classification algorithm. Experimental results show oversampling method and feature
optimization method improve the accuracy of SMS phishing detection. The best accuracy of the
proposed method is 99.01% with an average of 86.6 features. The results demonstrate that the
proposed method has a promising performance for SMS phishing detection.

Introduction
Phishing is a criminal mechanism employing both social engineering and technical subterfuge to
steal consumers’ personal identity data and financial account credentials. According to the latest
reports of Anti-Phishing Working Group (APWG), the total number of phish detected in 1Q 2018
was 263,538 which was up 46 percent from the 180,577 observed in 4Q 2017[1]. In the last few
years, phishing attack grows rapidly and has become one of the most serious threats to global
Internet security [2]. Phishing detection with high validity and robustness has already been an
important task to protect Internet security.
The development of mobile network leads to an increasing trend of launching new phishing
attacks through emerging technologies such as mobile and social media [3,4,5]. The prevalent use
of social media provides fertile ground for phishing attacks due to increasing sharing of personal
information but little awareness and action of protecting the information [6]. Phishing attacks
increasingly focus on mobile users because of its large numbers and high probability of success [7].
The large numbers of mobile users are not matched with high levels of security awareness, and it is
a matter of time before online threats such as phishing become a reality on mobile devices [8].
Many different approaches for fighting SMS phishing have been proposed, such as blacklist
approach, content-based approach, etc. Blacklist approach sets a blacklist on mobile phone based on
user experience which lacks inflexible to handle new phishing. Content-based approach is a
promising approach for its ability of discerning phishing messages automatically [9]. The success of
machine learning techniques in text categorization provides feasibility for content-based phishing
detection methods. SMS phishing detection is a typical classification problem [10] in which the
goal is to assign a test data one of the predefined classes (phishing, legitimate). Machine learning
algorithms make use of the features extracted from phishing and legitimate messages to find
patterns among them [11,12]. The detection process becomes reliable because the decisions are
made based on rules discovered from historical data by learning algorithm [13].
From the state-of-art of existing studies, most researchers focus on improving SMS phishing
detection accuracy using various traditional machine learning algorithms. Few of them consider the
235
characteristics of SMS phishing detection. For instance, most SMS messages in real environment
are normal text without harm information and few phishing messages exist in our daily life. The
phenomenon leads to imbalanced data problem, which means that there are lots of legitimate
messages but few phishing messages to be learned using machine learning methods. On the other
hand, SMS phishing messages are short text and common text modeling methods in Natural
Language Processing (NLP) cannot interpret short messages primely. It is limited to improve SMS
phishing detection accuracy without dealing with the existing problems.
In this paper, a new SMS phishing detection system is presented considering an approach for
imbalanced data problem, a new feature framework and a feature optimization method. The
approach for imbalanced data problem is aimed at handling with the huge quantity difference
between normal SMS messages and phishing messages in real environment. Adaptive Synthetic
Sampling Approach (ADASYN) is used in this paper for its good performance. New features
framework is proposed through combining optimal features in previous researches including token
features, topic features and LIWC features. And then, Binary Particle Swarm Optimization (BPSO)
algorithm is applied for feature optimization combining with Random Forest (RF) classification
algorithm. The SMS phishing detection system proposed in this paper completes SMS phishing
detection on the basis of previous studies. Finally, the system improves the SMS phishing detection
accuracy considering the characteristics of SMS phishing detection.

Related Work
There are two types of researches for SMS phishing detection in recent years, one is study of
detection algorithm and the other is development of detection system. The study of detection
algorithm is aimed at improving detection accuracy through various statistical learning methods and
machine learning algorithms. The detection system is designed to apply in mobile phone so as to
defense SMS phishing. Two types of features are generally used in previous researches including
non-content features and content features. Non-content features refer to elements such as message
size, time stamp and so on [14]. Content feature is a research hotspot for its promising accuracy
such as special characters, function words and etc.
Most of extant researches focus on the study of detection algorithm. Several studies
[15,16,17,18,19] compare various well-known algorithms such as Naive Bayes (NB), Decision Tree,
Support Vector Machines (SVM), Random Forest (RF), etc. According to their experimental results,
SVM and RF show the best performance in SMS phishing detection. In addition, some other
researches work on developing SMS phishing detection applications. For example, Reference [20],
[21] and [22] prove their proposed solution using Android platform.
The majority of existing studies apply content features in their researches, for instance, character
and word features [23,24], Natural Language Processing(NLP) features [25,26,27], etc. Hidalgo et
al. [15] use four types of word-based attributes including sequences of alpha-numeric characters,
lowercased words, character bi-grams and tri-grams and word bi-grams in their study. Karami and
Zhou [17] propose three types features including SMS Specific Features, Linguistic Inquiry and
Word Count features and LDA topic features. Non-content features are also widely used in SMS
phishing detection. For example, Jae Woong Joo et al.[28] distinguishes normal text message and
phishing message through monitoring log, time stamp, URL and etc. Xu et al.[14] propose three
types non-content features including static features, temporal features and network features using
statistical method.

Methodology
Considering SMS phishing messages are mainly short text and have a relatively low number
compared with legitimate messages, new features for short text and oversampling method for
imbalanced data are applied to SMS phishing detection. In this section, a new framework is
presented as Figure 1 for SMS phishing detection combining of feature extraction, oversampling,
feature optimization and classification.
236
Random Forest
Training Dataset Feature Extraction ADASYN BPSO
Model

Token Features Phishing


SMS DB
Topic Features

Testing Dataset LIWC Features Testing Random Forest


legitimate
Features Classfication

Figure 1. Framework of SMS phishing detection.


Framework for SMS phishing detection. The framework for SMS phishing detection is
composed of 7 steps: Data input, Data preprocessing, Feature extraction, Oversampling, Feature
optimization, Classification and Result output. Data preprocessing mainly focuses on the division of
training and test dataset according to different splits. Three types features called token features,
topic features and LIWC features are extracted in feature extraction step which will be introduced in
detail in the following sections. One of the existing oversampling methods for dealing with the
imbalanced samples problem called Adaptive Synthetic Sampling Approach(ADASYN) [29,30] is
adopted to balance phishing samples and legitimate samples. Oversampling is very important for
SMS phishing detection due to the imbalanced SMS data in real world. Binary Particle Swarm
Optimization(BPSO) algorithm is applied to feature optimization step in order to extract optimal
features and reduce feature dimension. Finally, Random Forest algorithm is used to classify whether
it is phishing message or legitimate message.
Dataset. A publicly available dataset collected by Almeida et al.[16] is used in this paper. The
dataset is published in UCI dataset as a commonly used standard test datasets for machine learning
study. The dataset is composed of 5,574 messages which are non-encoded in English. Specially, it
contains 747 phishing messages and 4,827 legitimate messages. This dataset is widely used in SMS
phishing detection.
SMS phishing detection features. Three types features are presented in this paper including
token features, topic features and LIWC features. Token features are collected from two parts, one
is message structure and the other is function word. Focusing on the characteristics of short text, the
biterm topic model(BTM) [31] for short texts is used in this paper. Linguistic Inquiry and Word
Count(LIWC) features are derived from literature [17] for its good performance.
(1)Token features. Token features are proposed consisting of structure features and function word
features through summarizing related literatures. Structure features are characteristics of characters,
words and sentences in SMS messages from the point of view of natural language processing [32].
In this paper, structure features are comprised of the number of characters [22], the number of up
characters [22], the number of ‘!’, and the number of words. Function word features are selected
through presenting the top 50 one-gram, two-gram and three-gram word lists of all the phishing
messages in the datasets. Some chosen words and phrases such as ‘to’, ‘a’, ‘you have’, ’this is’ and
so on have no significance for SMS phishing detection. For this reason, the noun, verb, adjective
words and phrases containing the words are selected as the initial elements of function words. On
the other side, URL features are also selected such as ‘www’, ‘ur’, ‘uk’ because these are important
features for phishing detection according to previous researches [16,22]. Table 1 shows the initial
elements of the one-gram, two-gram and three-gram. Finally, function words list is defined through
choosing words that appear most frequently in the initial elements lists. There are 32 token features
in total including 4 structure features and 28 function word features in this paper shown as Table 2.

237
Table 1. Initial elements of function words.

Types Words
Call, free, now, stop, txt, text, ur, reply, mobile, www, claim, uk, only, cash, please,
One-gram
urgent
call now, call from, please call, your mobile, have won, to claim, co uk, won a, account
Two-gram statement, call identifier, code expires, identifier code, or cash, private your, statement
for, your account, a prize, win a
account statement for, call identifier code, have won a, identifier code expires, private
Three-gram your account, statement for shows, you have won, your account statement, thanks for
your, attempt to contact, are awarded with
Table 2. Token features list.
Features Type Features
number of character, number of up character, number of !, number of
Structure features
word
Call, account, won, free, now, prize, www, reply, please, urgent, cash, txt,
Function words features claim, ur, mobile, stop, text, uk, statement, identifier, expires, code,
private, thanks, award, contact, only, win

(2)Topic features. A topic model is a statistical model used to find the latent semantic structure in
a series of documents which is widely used in natural language processing. Conventional topic
models (e.g. LDA and PLSA) are used in Email phishing detection most frequently [3]. As a kind of
short text, SMS messages differ from Emails and the other types of long texts. Directly applying
those topic models on SMS phishing detection does not work well on account of the severe data
sparsity in short documents [31]. Biterm topic model(BTM) models the word co-occurrence
patterns and uses the aggregated patterns in the whole corpus for learning topics to solve the
problem of sparse word co-occurrence patterns at document-level [31]. The topic probability
distribution of each message obtained by BTM is used as topic features in this paper. Ten-fold cross
validation method is adopted to determine the most appropriate number of topic and finally, 50
topics is selected through experiments using Random Forest algorithm.
(3)LIWC features. LIWC features are acquired through Linguistic Inquiry and Word Count
(LIWC) tool which is a transparent text analysis program that counts words in psychologically
meaningful categories [33]. LIWC can analyze text content quantitatively and calculate the
percentage of different words categories in the text such as causal words, emotional words,
cognitive words and the other psychological parts. LIWC is widely used in the field of natural
language processing and psychology. Literature [17] firstly uses LIWC features in SMS phishing
detection and achieves encouraging performance. Ninety-three LIWC features are used in this
paper.
Oversampling method for imbalanced data. Mobile security reports in China indicates there
are nearly 159.53 billion SMS messages in the first quarter of 2017, among which 0.18 billion
messages are phishing messages. There are very few phishing samples compared to normal text
messages. The huge discrepancy in the data suggests that SMS phishing detection is an imbalanced
sample problem. Imbalanced samples can lead to a series of problems. The information contained in
minority class is very limited and it is difficult to excavate rules internally. Many classification
algorithms use the divide and conquer method, and few rules of minority class result in low
classification accuracy. On the other side, improper induction bias systems tend to classify samples
as the majority class when uncertainty exists. Therefore, SMS phishing detection rate will be
relatively low using traditional method without dealing with the problem of imbalanced samples.
There are many methods for handling class imbalance, such as weighted loss function method,
under sampling method, oversampling method, etc. [34]. Weighted loss function method is to set a
weight for the loss function so that the loss of discriminant errors for minority class is greater than
that of the majority class. Under sampling method is to improve the classification performance of
the minority class by reducing the majority class samples. The disadvantage of under sampling
method is that some important information of the majority class is lost. Oversampling method is to

238
eliminate or reduce the imbalance of data by adding some samples of the minority class.
Oversampling method leads to overfitting sometimes. Synthetic minority oversampling technique
(SMOTE) is the most common used oversampling method for handling class imbalance. This paper
uses Adaptive Synthetic Sampling Approach(ADASYN) which is an extension of SMOTE for its
good performance.
ADASYN algorithm is proposed to overcome the limitation of SMOTE algorithm which is
SMOTE increases the occurrence of overlapping between classes because it generates the same
number of synthetic data samples for each original minority example without considering neighbor
examples [35]. The key idea of the ADASYN algorithm is to use a density distribution as a criterion
to automatically decide the number of synthetic samples that need to be generated for each minority
example by adaptively changing the weights of different minority examples to compensate for the
skewed distributions [29,30].

Figure 2. Samples distribution before and after using oversampling method.


Figure 2 shows the samples distribution before and after using ADASYN oversampling method
through random sampling analysis. Before using ADASYN method, a big difference exists between
the numbers of two classes samples. ADASYN method generates several samples belonging to the
minority class so as to balance the number of two classes. The synthetic phishing messages are in
accordance with the distribution of existing phishing samples. The two scatter grams present
oversampling method solves the problem of imbalanced data. More minority samples improve the
availability of training model for minority class which contribute to enhance the accuracy for
phishing detection.
Feature optimization method based on BPSO algorithm. The particle swarm optimization
algorithm (PSO), originally is proposed by Kennedy and Eberhart in 1995 [36], which has been
used increasingly as an important algorithm among these population based search algorithms on
solving problems in many fields, especially engineering and computer science. PSO is a
bio-inspired algorithm enlightened by crowd behaviors, such as the birds in a flock and the fishes in
a school cooperate while searching for food. The development of PSO is the mechanism by which
each particle adjust their position by interacting information from particle’s past best position (pbest)
and the global best position in all populations or neighbors (gbest). At each iteration, the pbest and
gbest are updated if a better or more dominating solution is found. This process continues iteratively,
until some specific termination criteria are met and a final candidate solution is obtained. However,
initially PSO is designed for continuous optimization problems, the discrete optimization problem is
challenge in field of engineering and scientific applications.
The binary version of PSO (BPSO) is proposed by Kennedy and Eberhart in 1997 [37] for
discrete optimization problems. Suppose that the search space is S = {0,1} for D dimension, and the
objective function f is to be maximized, i.e., max f(x), then the i-th particle of the swarm can be
described as vector𝑥𝑖 = (𝑥𝑖1 , 𝑥𝑖2 … 𝑥𝑖𝐷 ), 𝑥𝑖𝑑 ∈ {0,1},d = 1,2,. . . ,D. The velocity of particle can be
described as vector𝑉𝑖 = (𝑉𝑖1 , 𝑉𝑖2 … 𝑉𝑖𝐷 ), 𝑉𝑖𝑑 ∈ {−𝑉𝑚𝑎𝑥 , 𝑉𝑚𝑎𝑥 }, d = 1,2,. . . ,D and 𝑉𝑚𝑎𝑥 is the
maximum velocity. The particle best position of the i-th is denoted as 𝑃𝑏𝑒𝑠𝑡𝑖 = (𝑝𝑖1 , 𝑝𝑖2 … 𝑝𝑖𝐷 ),

239
𝑝𝑖𝑑 ∈ {0,1}, d = 1,2,. . . ,D. the global best position define as 𝑔𝑏𝑒𝑠𝑡 . The position and velocity of
particle i are updated using the following equations:

vijk 1  wvijk  c1r1 ( pbestijk  xijk )  c2 r2 ( gbestijk  xijk ) (1)

1 if U (0,1)  sigm(vij )
xijk +1   (2)
0 otherwise
where j = 1,2. . .D; i = 1,2. . .N, and N is the size of the swarm; c1 and c2 are constants, called
cognitive and social scaling parameters respectively; r1 and r2 are random numbers, uniformly
distributed in [0, 1]. U(0,1) is a symbol for uniformly distributed random number between 0 and 1.
sigm(vij) is a sigmoid limiting transformation. The traditional transfer function in BPSO is presented
as equation (3) and (4).

1
T (vij )   vij (3)
1 e

0 if rand  T (vij )

xij   (4)
1 if rand  T (vij )

Experiments and Results Analysis


In this section, several experiments are developed and experimental results are listed. Three types
feature in this paper are analyzed through BPSO feature optimization method. In addition, the SMS
phishing detection results are compared with previous researches.
Experimental results. Three experiments are designed to demonstrate the effectiveness and
feasibility of the method proposed in this paper. Firstly, random forest algorithm is used to classify
phishing and legitimate messages for its good performance [37]. Then, ADASYN data
preprocessing method for imbalanced data is adopted to SMS phishing detection method. Finally,
BPSO algorithm is taken for feature optimization in order to increase the rapidity and validity of the
proposed system. The dataset introduced previously is used in the experiments. Seven
measurements including accuracy (ACC), True Positives Rate(TPR), False Positives Rate(FPR),
precision, recall, F1-measure and Matthews Correlation Coefficient (MCC) are listed for evaluating
the performance of SMS phishing detection. In order to compare detection results with previous
studies, different splits between training and testing datasets are applied in this paper such as 3:7,
5:1, 4:1, and 3:1 as well as 10-fold cross validation. The experimental parameters are assigned as
follows: the trees number of random forest algorithm is t=50; the initial population number of
BPSO algorithm is s=20; the maximum iteration number is I=100. The termination condition of the
BPSO algorithm is that the evolution generation is equal to the maximum iteration number. All the
results are the average of 10 times repeated experiments. All experiments in this paper are
conducted in Matlab 2015b.
Table 3 illustrates that ADASYN method and BPSO method improve the SMS phishing detection
accuracy in different degrees regardless of different splits between training and testing datasets. For
the first part results obtained by RF algorithm, the best accuracy 98.47% is achieved with the splits
of 4:1. For the second part with ADASYN method, the best accuracy 98.73% is achieved with the
splits of 5:1. For the third part with BPSO method, the best accuracy 99.01% is also achieved with
the splits of 5:1. In general, the method proposed in this paper increases phishing detection accuracy
compared with previous research [16] as the baseline which is 97.64% with the splits of 3:7.
Specifically, the proposed method significantly improves TPR and Recall values which indicates the
number of well-classified messages detected by the proposed method increase by a large margin.
The F1-Measure becomes closer to 1 which is a unified value of Precision and Recall. MCC is

240
generally regarded as a balanced measure of the quality of binary classifications and +1 represents a
perfect prediction. The method in this paper improves MCC value which makes the detection
system close to a perfect prediction system.

Table 3. Experimental results.

Splits Method ACC(%) TPR FPR Precision Recall F1-Measure MCC


RF 98.38173 0.94172 0.05395 0.98917 0.94172 0.96360 0.93167
10-fold
cross ADASYN-RF 98.57920 0.95698 0.04353 0.98154 0.95698 0.96860 0.93742
validation ADASYN-PSO-RF 98.74418 0.96413 0.03553 0.98194 0.96413 0.97263 0.94708
RF 97.82676 0.92312 0.08013 0.98243 0.92312 0.94993 0.90257
3:7 ADASYN-RF 97.99077 0.93769 0.05322 0.97615 0.93769 0.95568 0.91779
ADASYN-PSO-RF 98.14967 0.94575 0.05473 0.97323 0.94575 0.95883 0.91802
RF 98.22785 0.93519 0.05221 0.99024 0.93519 0.96046 0.92951
5:1 ADASYN-RF 98.73418 0.95953 0.03980 0.98579 0.95953 0.97204 0.94558
ADASYN-PSO-RF 99.00844 0.97073 0.03296 0.98543 0.97073 0.97785 0.95362
RF 98.47397 0.94409 0.05188 0.99065 0.94409 0.96574 0.93539
4:1 ADASYN-RF 98.58169 0.95792 0.04884 0.98013 0.95792 0.96851 0.93699
ADASYN-PSO-RF 98.67145 0.96298 0.04110 0.97961 0.96298 0.97087 0.94277
RF 98.37877 0.94191 0.05603 0.98850 0.94191 0.96356 0.93047
3:1 ADASYN-RF 98.50789 0.95247 0.04703 0.98214 0.95247 0.96659 0.93275
ADASYN-PSO-RF 98.63702 0.96260 0.04243 0.97832 0.96260 0.97009 0.94080

In order to show the effect of the proposed method on phishing messages detection, Table 4
presents the classfication results of phishing messages. According to the listed data, the proposed
method improves TPR and Recall values and balances the detection of minority and majority class.
The method solves the problem of imbalanced data and enhances the performance of SMS phishing
detection. At the same time, the changes of FPR and Precision illustrate the increase of the number
of misclassified legitimate messages. However, F1-Measure and MCC rise closer to 1 which shows
the effectiveness of the proposed method.
Table 4. Classification results of phishing messages
Splits Method TPR FPR Precision Recall F1-Measure MCC
10-fold RF 0.88407 0.00075 0.99464 0.88407 0.93574 0.92885
cross ADASYN-RF 0.91793 0.00356 0.97557 0.91793 0.94546 0.93814
validation ADASYN-PSO-RF 0.93137 0.00381 0.97414 0.93137 0.95197 0.94526
RF 0.84742 0.00148 0.98886 0.84742 0.91258 0.90383
3:7 ADASYN-RF 0.87954 0.00456 0.96789 0.87954 0.92127 0.91137
ADASYN-PSO-RF 0.89713 0.00545 0.96238 0.89713 0.92847 0.91870
RF 0.87087 0.00049 0.99630 0.87087 0.92914 0.92193
5:1 ADASYN-RF 0.92126 0.00244 0.98339 0.92126 0.95110 0.94464
ADASYN-PSO-RF 0.94488 0.00292 0.98053 0.94488 0.96226 0.95685
RF 0.88859 0.00041 0.99701 0.88859 0.93960 0.93298
4:1 ADASYN-RF 0.91812 0.00373 0.97457 0.91812 0.94535 0.93786
ADASYN-PSO-RF 0.92886 0.00435 0.97111 0.92886 0.94916 0.94207
RF 0.88449 0.00083 0.99396 0.88449 0.93599 0.92884
3:1 ADASYN-RF 0.90909 0.00315 0.97821 0.90909 0.94225 0.93461
ADASYN-PSO-RF 0.92834 0.00464 0.96895 0.92834 0.94794 0.94057

In order to indicate different performance of three types of features, the results for different types
are shown in Table 5 using ADASYN-PSO-RF method with 10-fold cross validation. The three

241
types of features present promising performance for SMS phishing detection. LIWC features
display the best accuracy of 98.13% following by the accuracy of Topic features. However, the best
Recall and F1 value are obtained by Topic features. This indicates that the hidden topic probability
distribution of messages has a good effect on SMS phishing detection.

Table 5. Experimental results for single type features.

Features ACC(%) TPR FPR Precision Recall F1-Measure MCC


Token features 97.36289 0.92700 0.06477 0.95805 0.92700 0.94148 0.88840
Topic features 98.09843 0.95562 0.04707 0.96065 0.95562 0.95791 0.91319
LIWC features 98.13427 0.94430 0.05772 0.97224 0.94430 0.95734 0.91385

Feature optimization results analysis. BPSO feature optimization algorithm is used to reduce
feature dimension and improve computational efficiency in this paper. Table 6 shows the feature
optimization results with 10-fold cross validation method. The original features number is 175 and
the average selected features number is 86.6 which is nearly half of the original number. The final
SMS phishing detection accuracy has some improvement with half of the original features
according to Table 3. For the three types features, the average selected features number is 13.4, 25.9
and 47.3. Thinking about the performance of different types features, each type of features is chosen
in a different proportion. The highest chosen proportion 51.8% is presented by topic features and
the second is LIWC features. This is consistent with the classification results in Table 5. Topic
features and LIWC features have better performance, therefore, more associated features are chosen
by features optimization algorithm. In comparison, Token features get relatively low accuracy so
that a small number of features are selected. In general, BPSO feature optimization algorithm
selects the optimal combination of features and show promising performance for SMS phishing
detection.
Table 6. Features optimization results.
Original Feature Average Selected
Type Proportion
Number Number
All features 175 86.6 49.49%
Token features 32 13.4 41.88%
Topic features 50 25.9 51.80%
LIWC features 93 47.3 50.86%

Method comparsion. In order to illustrate the effectiveness of the proposed method, three
related studies with the same dataset are listed in Table 7 for comparison. Almeida et al.[16] firstly
propose the dataset and they finally achieve an accuracy of 97.64% with 30% samples for training
and 70% for testing through comparing different classification methods. With the same number of
training and testing samples, the proposed method in this paper achieves an accuracy of 98.15%
which is higher than 97.64%. Karami and Zhou [17] obtain an accuracy of 99.21% with 480
features. This paper focuses on new token features and new topic features for short text in
conjunction with oversampling method and feature optimization method. Finally, the proposed
method achieves 99.01% accuracy only using an average of 86.6 features. And the new proposed
token features and new topic features have better performance of 97.36% and 98.10% than the same
type features in [17] which are 94.69% and 97.99%. Enaitz et al. [38] combine personality
recognition and sentiment analysis techniques to analyze Short Message Services (SMS) texts and
the best result reaches to a 99.01% which is as same as the best result in this paper. By comparison,
the method proposed in this paper is effective and well performing.

242
Table 7. Results of three previous study.
Reference Results Details
[16] 97.64% With 30% samples for training
[17] 99.21% Using 480 features
[38] 99.01% Combine personality recognition and sentiment analysis

Conclusion
In this paper, a SMS phishing detection framework using oversampling and feature optimization
method is presented through handling with imbalanced data problem and analyzing different types
features. Three types of features are proposed including 32 token features, 50 topic features and 93
LIWC features. An oversampling method called Adaptive Synthetic Sampling (ADASYN) is
applied to solve the problem of imbalanced data existing in SMS phishing detection. Then, this
paper employs Binary Particle Swarm Optimization(BPSO) algorithm for reducing feature
dimension and analyzing the optimal combination of features. Finally, three experiments are
developed using Random Forest classification algorithm to evaluate the performance of the
proposed method. Experimental results show that ADASYN and BPSO methods improve SMS
phishing detection accuracy with three types of features. The highest accuracy is 99.01% with
one-sixth samples as testing data and five-sixth samples as training data. For single type features,
LIWC features and topic features express better performance up to 98.13%. Through BPSO feature
optimization method, nearly half of original features are selected with better accuracy. LIWC
features and topic features have big chosen proportion compared with token features. In general, the
method proposed in this paper has a promising performance for SMS phishing detection.

Acknowledgement
This work is supported by the National Science and Technology Major Project under Grant
no.2017YFB0802800 and the National Natural Science Foundation of China under Grant
no.61602052.

References
[1] Phishing Activity Trends Report 1st Quarter 2018[R]. 2018. Available from:
http://docs.apwg.org/reports/apwg_trends_report_q1_2018.pdf
[2] Gupta B B, Tewari A, Jain A K, et al. Fighting against phishing attacks: state of the art and
future challenges[J]. Neural Computing and Applications, 2017, 28(12): 3629-3654.
[3] Aleroud A, Zhou L. Phishing environments, techniques, and countermeasures: A survey[J].
Computers & Security, 2017, 68: 160-196.
[4] Egele M, Stringhini G, Kruegel C, et al. Compa: Detecting compromised accounts on social
networks[C]//NDSS. 2013.
[5] Marforio C, Masti R J, Soriente C, et al. Personalized Security Indicators to Detect Application
Phishing Attacks in Mobile Platforms[J]. Computer Science, 2015, 06(3):206-212.
[6] Borsack R, Lifson M. Wire B. The Truth About Social Media Identity Theft: Perception Versus
Reality[J]. Business Wire, 2010, 21.
[7] Lemos R. Phishing attacks increasingly focus on social networks, studies show; 2014. Available
from:http://www.eweek.com/security/phishing-attacks-increasingly-focus-on-social-networks-studi
es-show.html.
[8] Kessem L. Rogue mobile apps, phishing, malware and fraud[J]. Speaking of Security, 2012.

243
[9] Islam M R, Abawajy J, Warren M. Multi-tier phishing email classification with an impact of
classifier rescheduling[C]//Pervasive Systems, Algorithms, and Networks (ISPAN), 2009 10th
International Symposium on. IEEE, 2009: 789-793.
[10] Abdelhamid N, Ayesh A, Thabtah F. Associative classification mining for website phishing
classification[C]//Proceedings on the International Conference on Artificial Intelligence (ICAI). The
Steering Committee of The World Congress in Computer Science, Computer Engineering and
Applied Computing (WorldComp), 2013: 1.
[11] Costa G, Ortale R, Ritacco E. X-class: Associative classification of xml documents by
structure[J]. ACM Transactions on Information Systems (TOIS), 2013, 31(1): 3.
[12] Thabtah F, Cowling P, Peng Y. MCAR: multi-class classification based on association
rule[C]//Computer Systems and Applications, 2005. The 3rd ACS/IEEE International Conference
on. IEEE, 2005: 33.
[13] Abdelhamid N, Ayesh A, Thabtah F. Phishing detection based Associative Classification data
mining[J]. Expert Systems with Applications, 2014, 41(13):5948-5959.
[14] Xu Q, Xiang E W, Yang Q, et al. Sms spam detection using noncontent features[J]. IEEE
Intelligent Systems, 2012, 27(6): 44-51.
[15] Gómez Hidalgo J M, Bringas G C, Sánz E P, et al. Content based SMS spam
filtering[C]//Proceedings of the 2006 ACM symposium on Document engineering. ACM, 2006:
107-114.
[16] Almeida T A, Yamakami A. Contributions to the study of SMS spam filtering: new collection
and results[C]// Proceedings of the 11th ACM symposium on Document engineering. ACM,
2011:259-262.
[17] Karami A, Zhou L. Exploiting latent content based features for the detection of static sms
spams[J]. Proceedings of the Association for Information Science and Technology, 2014, 51(1): 1-4.
[18] Mathew K, Issac B. Intelligent spam classification for mobile text message[C]//Computer
Science and Network Technology (ICCSNT), 2011 International Conference on. IEEE, 2011, 1:
101-105.
[19] Zainal K, Sulaiman N F, Jali M Z. An analysis of various algorithms for text spam
classification and clustering using Rapid Miner and Weka[J]. International Journal of Computer
Science and Information Security, 2015, 13(3): 66.
[20] Balubaid, M. A., Manzoor, U., Zafar, B., Qureshi, A., Ghani, N. Ontology Based SMS
Controller for Smart Phones. International Journal of Advanced Computer Science and Applications,
2015, 6(1): 133–139.
[21] Sethi, G., Bhootna, V. SMS Spam Filtering Application Using Android. International Journal of
Computer Science and Information Technologies (IJCSIT), 2014, 5(3): 4624–4626.
[22] Uysal A K, Gunal S, Ergin S, et al. The Impact of Feature Extraction and Selection on SMS
Spam Filtering[J]. Elektronika Ir Elektrotechnika, 2013, 19(5):67-72.
[23] Uysal A K, Gunal S, Ergin S, et al. A novel framework for SMS spam filtering[C]//
International Symposium on Innovations in Intelligent Systems and Applications. IEEE, 2012:1-4.
[24] Almeida T A, Hidalgo J M G, Silva T P. Towards SMS Spam Filtering: Results under a New
Dataset[J]. International Journal of Information Security Science, 2013, 2(1):1-18.

244
[25] Yeboah-Boateng E O, Amanor P M. Phishing, S MiShing & Vishing: an assessment of threats
against mobile devices[J]. Journal of Emerging Trends in Computing and Information Sciences,
2014, 5(4): 297-307.
[26] Warade S J, Tijare P A, Sawalkar S N. An approach for SMS spam detection[J]. Int. J. Res.
Advent Technol, 2014, 2(12): 8-11.
[27] Junaid M B, Farooq M. Using evolutionary learning classifiers to do MobileSpam (SMS)
filtering[C]//Proceedings of the 13th annual conference on Genetic and evolutionary computation.
ACM, 2011: 1795-1802.
[28] Joo J W, Moon S Y, Singh S, et al. S-Detector: an enhanced security model for detecting
Smishing attack for mobile computing[J]. Telecommunications Systems, 2017, 66(1):1-10.
[29] He H, Bai Y, Garcia E A, et al. ADASYN: Adaptive synthetic sampling approach for
imbalanced learning[C]//Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on
Computational Intelligence). IEEE International Joint Conference on. IEEE, 2008: 1322-1328.
[30] He H, Garcia E A. Learning from imbalanced data[J]. IEEE Transactions on knowledge and
data engineering, 2009, 21(9): 1263-1284.
[31] Yan X, Guo J, Lan Y, et al. A biterm topic model for short texts[C]//Proceedings of the 22nd
international conference on World Wide Web. ACM, 2013: 1445-1456.
[32] Cheng N, Chandramouli R, Subbalakshmi K P. Author gender identification from text[J].
Digital Investigation, 2011, 8(1): 78-88.
[33] Tausczik Y R, Pennebaker J W. The Psychological Meaning of Words: LIWC and
Computerized Text Analysis Methods[J]. Journal of Language & Social Psychology, 2010,
29(1):24-54.
[34] More A. Survey of resampling techniques for improving classification performance in
unbalanced datasets[J]. 2016.
[35] Wang B X, Japkowicz N. Imbalanced data set learning with synthetic samples[C]//Proc. IRIS
Machine Learning Workshop. 2004, 19.
[36] Eberhart R, Kennedy J. A new optimizer using particle swarm theory[C]//Micro Machine and
Human Science, 1995. MHS '95., Proceedings of the Sixth International Symposium on. IEEE,
1995: 39-43.
[37] Kennedy J, Eberhart R C. A discrete binary version of the particle swarm
algorithm[C]//Systems, Man, and Cybernetics, 1997. Computational Cybernetics and Simulation.,
1997 IEEE International Conference on. IEEE, 1997, 5: 4104-4108.
[38] Enaitz Ezpeleta, Iñaki Garitano, Urko Zurutuza, et al. Short Messages Spam Filtering
Combining Personality Recognition and Sentiment Analysis[J]. International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, 2017, 25(Suppl. 2):175-189.

245

You might also like