You are on page 1of 5

A method to Measure the Efficiency of

Phishing Emails Detection Features


Melad Mohamed Al-Daeef Nurlida Basir and Madihah Mohd Saudi
Faculty of Science and Technology Faculty of Science and Technology
Universiti Sains Islam Malaysia (USIM) Universiti Sains Islam Malaysia (USIM)
Nilai, Malaysia Nilai, Malaysia
Email: meladmohalda@gmail.com Email: {nurlida, madihah}@usim.edu.my

Abstract— Phishing is a threat in which users are sent fake This work presents a method to choose the most efficient
emails that urge them to click a link (URL) which takes to a feature in detecting phishing emails. The importance of the
phisher's website. At that site, users' accounts information could
be lost. Many technical and non-technical solutions have been selected feature is determined by calculating its Effectiveness
proposed to fight phishing attacks. To stop such attacks, it is Metric (EM) value based on three criteria which derived based
important to select the correct feature(s) to detect phishing on, and related to three general aspects of email. These three
emails. Thus, the current work presents a new method to aspects of email are, email's sender, email's content, and
selecting more efficient feature in detecting phishing emails. Best email's receiver.
features can be extracted from email's body (content) part.
Keywords and URLs are known features that can be extracted The rest of this paper is organized as follows. In section II,
from email's body part. These two features are very relevant to we provide a background on feature selection propositions.
the three general aspects of email, these aspects are, email's
sender, email's content, and email's receiver. In this work, three Section III describes feature selection process used in this
effectiveness criteria were derived based on these aspects of work. This is followed by presenting the process of calculating
email. Such criteria were used to evaluate the efficiency of features’ EM values (Section IV). In section V, we discuss the
Keywords and URLs features in detecting phishing emails by most important results obtained in the current work. Finally,
measuring their Effectiveness Metric (EM) values. The the conclusion and suggestions for future work are presented
experimental results obtained from analyzing more than 8000
ham (legitimate) and phishing emails from two different datasets in section VI.
show that, relying upon the URLs feature in detecting phishing
emails will predominantly give more precise results than relying II. BACKGROUND ON FEATURE SELECTION
upon the Keywords feature in a such task. PROPOSITIONS
Keywords— phishing, Keywords feature, URLs feature, ham Email generally consists of two parts, header and body.
emails, phishing emails, effectiveness metric. Email's header is a set of structured fields such as, from, to,
subject, and routing information. Email's body is the actual
I. INTRODUCTION content of the email which is the foremost part users are
Phishing is an attack that makes Internet users reveal their concerning about and dealing with. The main features to
personal information to un-authorised party. Most phishing detect phishing emails can be extracted from these two parts
attacks start when users receive fake emails asking them to of the email. Such features are identified in [3] and presented
click a URL (link) to update their accounts' information. Once in Table I. A detailed description of these features can be
clicked, this URL will deliver the user to a fake website where found in many studies such as [2],[4].
he/she will most probably lose control over its account
information. According to Anti-Phishing Working Group TABLE I. Main categories of phishing emails detection
features
report, the number of URLs which were used to host phishing
attacks has increased from 164,023 in the first quarter of 2012 Email's Part Feature / Set of Features
to 175,229 in the second quarter of the same year [1].
subject-based features
To detect phishing emails, it is important to choose the features extracted from
sender-based features
email's header
right detection feature(s). Among the available various anti- behaviour -based features
phishing solutions, there is a considerable number of features
URL-based features
which have been suggested to best classify ham (legitimate)
features extracted from keyword-based features
and phishing emails. However, in many cases, these features
email's body (content) form-based features
are inappropriately chosen. This is because they are selected
based on the author's intuition about their effectiveness in script-based features
email classification process [2].

978-1-4799-4441-5/14/$31.00 ©2014 IEEE


Table II shows the 9 features that appeared in the top 10 shown in Table II. The Keywords feature was used to count
ranks amongst 40 features across the three ham, spam, and occurrences of the selected 18 keywords in the two types of
phishing datasets used in [2]. Table II shows also the email's analyzed emails, whereas the URLs feature was used to count
part which each of these 9 features is belong to. 8 out of these the presence and absence occurrences of fake URLs'
9 features are belong to the email's body part. This reflects the indications in these emails.
significance of the body-based features in distinguishing
A. Feature's Effectiveness Criteria
between ham and phishing emails. Another attention-grabbing
point in Table II is that, 4 out of 8 features that belong to By considering email's sender, email's content, and email's
email's body part are URL-based features, this highlights the receiver aspects, we have derived three effectiveness criteria
importance of URL-based feature. which used in calculating the EM values of the Keywords and
URLs features and hence to compare their efficiency in
TABLE II. Nine features that appeared in the top ten ranks of detecting phishing emails. Each of these three criteria has
information gain in [2] given a 1⁄3 of the effectiveness weight (effectiveness/3).
Email's Part From Each Feature Was Table III shows these effectiveness criteria and to which
The Feature
Extracted aspect of the email each criterion is relate to.
body_richness
TABLE III. Features' effectiveness criteria
body_noCharacters
body_noWords Effective-
Criterion Email's
The Criterion ness
body_html although they are belonging to different No. Aspect
Weight
categories, these features are extracted
URL_noLinks from email's body (content) part. the phisher must employ
1 evaluated feature to perform sender 1/3
URL_maxNoPeriods its attack.
URL_noExtLinks evaluated feature can be relied
URL_noDomains 2 upon to make correct decision content 1/3
about email(s) in question.
subj_richness extracted from email's subject part.
evaluated feature must has
3 receiver 1/3
Based on analyzing common properties of phishing emails, strong relevance with the user.
researchers in [5] have chosen 18 keywords as features for
email classification. These keywords were associated with Criterion No. 2 in Table III means that no False Positive
emails that made a sense of threat. These keywords have been (FP) or False Negative (FN) decision could be made based on
repeatedly used in the literature, 14 of them were used for feature's evaluation result. There are four types of decisions
example in [6]. These 18 keywords were used in this study as which are defined as follows:
well, they are listed in Table IV. • True Positive (TP): The email is correctly classified as
ham.
III. FEATURE SELECTION PROCESS • True Negative (TN): The email is correctly classified as
phishing.
This section describes the process of calculating the EM
• False Positive (FP): The email is incorrectly classified as
values of the Keywords and URLs features. EM values of
ham.
these two features were calculated in order to compare their
• False Negative (FN): The email is incorrectly classified as
efficiency in detecting phishing emails. Since the email's body
phishing.
is the foremost part that users are concerning about and paying
attention to, the features extracted from this part of the email B. Analyzed Datasets
are assumed to have higher importance in detecting phishing One publicly available dataset from the phishing corpus of
attempts than the features extracted from email's header part, 4450 phishing emails [8], and one publicly available dataset
and many of cues that influence user's decision about email(s) from TREC corpus of about 25000 ham, and 52000 spam
in question can be found in the email's body part [7]. The emails [9] were analyzed to measure the efficiency of the
Body_no_FunctionWords feature (used in [2], and which is Keywords and URLs features in detecting phishing emails,
called the Keywords feature in this study) is a content-based and then compare their efficiency against each other. In this
feature which has not listed in Table II above. However, this experiment, all the 4450 emails in phishing corpus, and the
feature has shown its importance in the experiment conducted same number of ham emails in TREC corpus were analyzed.
in [2], it was ranked as the 1st, 16th, and 13th best amongst 40 Since in this experiment we were concerned with the
features in three combinations of the three analyzed datasets in frequency appearance of each feature in the analyzed emails
that experiment. from the two selected datasets, thus, we have randomly chosen
In this work, we have focused on the Keywords and the 4450 of ham emails to match the number of emails in phishing
URLs features which are extracted from the email's body part dataset. Spam emails are out of the scope of this experiment.
because these two features have a considerable importance as
C. Results of Datasets Analysis phishing attacks, the emails in the two types of datasets were
This section describes the process of analyzing the emails analyzed to count the occurrences of fake URLs' indications in
in the two selected datasets. This process was conducted the content of these emails. As a pre-analyzing step, we have
through two phases. In the first phase, the emails in the two viewed three techniques which used in producing fake URLs.
datasets were analyzed for the occurrences of selected It is considered as a fake URL indication if the URL was
keywords in the emails' content, whereas in the second phase, produced by using any of these techniques. These three and
these emails were analyzed for the occurrences of fake URLs other making fake URLs techniques were investigated in
indications in their content. In this experiment, EditPad Pro 7 many studies such as [3],[5],[11],[12],[13].
trial version [10] was used in analyzing more than 8000 ham • Hiding the actual (going to) URL from the user is one of
and phishing emails. EditPad Pro is a powerful text editor with making fake URLs techniques. This trick can be performed
a powerful Regular Expression (regex) engine which mainly by using the Window.Status property. The following code
used in pattern matching with strings. is a phisher's eBay email example which using this
1) Results of analyzing emails for keywords occurrences technique:
Before emails were analyzed for selected keywords "<A onmouseover="window.status=-
occurrences, all URLs and emails' addresses were removed 'https://www.eBay.com/cgibin/webscr?
from all analyzed emails to avoid any situation of counting cmd=_loginrun';returntrue"onmouseout=
any of selected keywords which could be found in these URLs “window.status='https://www.eBay.com/
and/or emails' addresses. Table IV shows the appearance cgibin/webscr?cmd=_loginrun'href=“htt
frequency of selected keywords in each set of analyzed ham p://chaseupgrade.com/news/webscr.dll”
and phishing emails. >https://www.eBay.com/cgibin/webscr?-
cmd=_login-run</A>.
TABLE IV. Appearance frequency of selected keywords in the When the user hovers the mouse over this link, the status
analyzed ham and phishing emails bar will show the following address:
https://www.eBay.com/cgibin/webscr?c
Email Type Ham Phishing
Keyword
md=_login-run,
however, once clicked, this link will take the user to the
account 569 19044
following site;
access 682 4540 http://chaseupgrade.com/news/webscr.dll.
bank 780 5834
• If the URL contains an IP address instead of containing
credit 863 2063
website's name, this is an indication of a suspicious URL.
click 616 3451
For example, if posted URL was as,
identify 306 65 http://212.33.67.194/.citibank/accoun
inconvenience 20 1081 texpirycheck.net,
information 1621 6514 which seems to take the user to the website of Citibank,
limited 441 1820 whereas in fact, it takes the user to another fraudulent
login 103 1762 website.
minute 192 392 • Using Hexadecimal character codes is another technique
password 162 1603 for making fake URLs or links. In the following link,
recently 214 1040 http://www.fastvisa.com%00@:%32%34%2e
risk 727 78 %37%36%2e%38%39%2e%36%34:38,
social 81 122
only the www.fastvisa.com is displayed in the address bar,
whereas the browser window displays the site at
security 459 3949
"@:%32%34%2e%37%36…” which is the fraudulent
service 2436 3689
website’s IP address hidden in a hexadecimal character
suspend 44 1407 code.
Total 10316 58454
Experimental results in Table V revealed that there is a
2) Results of analyzing emails for fake URLs indications significant difference in the appearance frequency of fake
occurrences URLs' indications amongst the two types of analyzed emails.
Moreover, the total number of fake URLs' indications is
Fake URLs are phishers' instrument to perform phishing almost equal to the number of emails in the phishing dataset.
attacks. Most phishing scenarios start by sending fake URLs This gives evidence that phishers cannot avoid employing fake
to the potential victims (users) to deliver them to the intended URLs to perform phishing attacks. On the other hand, the
forged website. Due to the importance of URLs in performing results show that fake URLs' indications almost have no
existence in the analyzed ham emails. When the 24 links the effectiveness weight against criterion No.3. Table VI
which contain IP addresses in the analyzed ham emails shows that Keywords feature has met only 2 out of 3
checked, they were not found to indicate fake URLs since effectiveness criteria which listed in the Table III.
most of them did not contain any other domain names,
By using formula (1) in calculating the EM value of
whereas others were found to be linked to image files for
Keywords feature as following:
example. Examples of URLs found in the analyzed ham
EM Keywords 1⁄3 1 1⁄3 0 1⁄3 1
emails which contain IP addresses are:
1⁄3 0 1⁄3
http://64.214.225.20/aopa/rdr.asp?id=7
2⁄ 3
D11018113343241F,
we found that Keywords feature has gained only 2⁄3 of the
http://208.185.40.7/charts/images/news
effectiveness weight against the effectiveness criteria.
AVWIX.jpg.
B. Calculating the EM Value of URLs Feature
TABLE V. Appearance frequencies of fake URLs' indications in
analyzed emails Since fake URLs' indications have a prominent presence in
the analyzed phishing emails compared to the minor presence
Appearance In Ham
Frequency
In Phishing of these indications in the analyzed ham emails as shown in
Emails Emails
Fake URLs' Indication Table V, thus, this feature has gained a 1⁄3 of the
hiding the actual link from the user 0 353 effectiveness weight against criterion No.1 (listed in Table
the link contains IP address 24 3054 III). This prominent presence of fake URLs' indications in the
using hexadecimal character codes 0 958
analyzed phishing emails make this feature reliably enough to
produce a correct decision about the email(s) in question,
Total 24 4365
therefore, URLs feature has gained another 1⁄3 of the
effectiveness weight against criterion No.2 as well. Due to "I
IV. CALCULATING FEATURES' EFFECTIVENESS
want it now" users' attitude [15],[16], most users find it as an
METRIC VALUES easy and convenient way to click a URL in the email's
The experimental results in Tables IV and V were used to message instead of writing the correct address of the intended
calculate the EM values of the Keywords and the URLs website in the address bar, due to this attitude, this feature has
features against the effectiveness criteria which listed in Table a considerable relevance to the user, therefore, URLs feature
III. Formula (1) was used to calculate the EM values of the has also gained another 1⁄3 of the effectiveness weight against
Keywords and URLs features. criterion No.3. Table VI shows that the URLs feature has met
all of the three criteria which listed in Table III.
EM f ∑ Wi Ci 1
By using formula (1) in calculating the EM value of URLs
feature as following:
Where EM(f) is the effectiveness metric value the feature f has
EM URLs 1⁄3 1 1⁄3 1 1⁄3 1
gained, Wi is the weight of the criterion i, and Ci is 1 if the
1⁄3 1⁄3 1⁄3
criterion i has met by the feature, or 0 otherwise. This formula
3⁄ 3
was used in [14] to calculate the total number of excluded
1
security requirements that put the system at risk of possible
we found that URLs feature has gained 3⁄3 (i.e. 1) of the
attacks. Before we calculate the EM value of each feature, we
effectiveness weight against the effectiveness criteria.
have examined each feature against each criterion which listed
in Table III to see which criterion has met by the feature and TABLE VI. Features' EM values
which one has not, Table VI shows that.
Feature Criterion No. Effectiveness
A. Calculating the EM Value of Keywords Feature Name 1 2 3 Weight
Keywords feature has met criterion No.1based on the fact Keywords 2⁄3
of that the phisher must employ this feature to insist the URLs 3⁄3
targeted victims to visit the intended fake website, thus, this
feature has gained a 1⁄3 of the effectiveness weight against V. EXPERIMENT RESULT ANALYSIS
criterion No.1. Due to the higher occurrences of selected This section presents a brief analysis of the results in
keywords in both types of analyzed emails as in Table IV, it is sections III and IV. The appearance frequency of the selected
therefore unreasonable to rely upon Keywords feature as a keywords in Table IV shows that each of these keywords has a
base to distinguish between the ham and phishing emails, thus, considerable frequency in both the analyzed ham and phishing
Keywords feature has gained 0⁄3 of the effectiveness weight emails although most of them occur less in the ham emails
against criterion No.2. As a matter of fact, "users must read than in the phishing ones. Based on that, relying upon the
email's content", this fact strongly binds Keywords feature Keyword feature to distinguish between the phishing and ham
with the users, therefore, this feature has gained another 1⁄3 of emails will produce a high rate of FP and FN results.
Regarding the fake URLs' indications, Table V shows that [4] Khonji, M., Iraqi, Y. & Jones, A. 2012. Enhancing Phishing E-Mail
Classifiers: A Lexical URL Analysis Approach. International Journal
these indications have a higher frequency of occurrences in for Information Security Research (IJISR), 2.
the phishing emails than in the ham ones. This highlights the [5] Chandrasekaran, M., Narayanan, K. & Upadhyaya, S. Phishing email
role that URLs feature can play in distinguishing between the detection based on structural properties. In: NYS Cyber Security
ham and phishing emails. This also points out the importance Conference, 2006. 1-7.
of the URLs feature for the phisher to perform its attack(s). [6] Sanchez, F. and Z. Duan. A sender-centric approach to detecting
phishing emails. in ASE/IEEE international conference on cyber
Section IV shows the results of calculating the EM values security, Washington DC, USA. 2012.
of the Keywords and URLs features. The results show that [7] Kumaraguru, P., et al. School of phish: a real-world evaluation of anti-
phishing training. in Proceedings of the 5th Symposium on Usable
URLs feature has a higher EM value compared to the Privacy and Security. 2009: ACM.
Keyword feature, 3⁄3 (i.e. 1) and 2⁄3 respectively, Table VI [8] PhishingCorpus.http://monkey.org/~jose/wiki/doku.php. accessed 24
shows that as well. This finding gives evidence that relying Apr 2013.
upon the URLs feature in detecting phishing emails will give [9] Cormack, G. V. & Lynam, T. R. TREC 2005 Spam Track Overview.
In: TREC, 2005.
much accurate results than relying upon the Keywords feature
[10] EditPad Pro 7 trial version is available at http://www.editpadpro.com/
in a such task.
[11] Salem, O., A. Hossain, and M. Kamala. Awareness Program and AI
based Tool to Reduce Risk of Phishing Attacks. in Computer and
VI. CONCLUSION AND FUTURE WORK Information Technology (CIT), 2010 IEEE 10th International
Conference on. 2010: IEEE.
In this paper, a new method to calculate the Effectiveness
[12] Fette, I., Sadeh, N. & Tomasic, A. Learning to detect phishing emails.
Metric(EM) values of email classification features was In: Proceedings of the 16th international conference on World Wide
proposed. The EM values of the Keywords and URLs features Web, 2007. ACM, 649-656.
were calculated after examining these features against three [13] Suriya, R., K. Saravanan, and A. Thangavelu. An integrated approach
criterion which derived based on the three general aspects of to detect phishing mail attacks: a case study. in Proceedings of the 2nd
international conference on Security of information and networks.
email. These email aspects are, email's sender, email's content, 2009: ACM.
and email's receiver (Section III). Our experimental results [14] Abdulrazeg, A.A., N.M. Norwawi, and N. Basir. Security metrics to
show that the URLs feature has a higher EM value compared improve misuse case model. in Cyber Security, Cyber Warfare and
to that scored by the Keywords feature as they gained 3⁄3 and Digital Forensic (CyberSec), 2012 International Conference on. 2012:
IEEE.
2⁄3 EM values respectively. The URLs feature has gained
[15] Ferguson, A. J. 2005. Fostering e-mail security awareness: The West
3⁄3 of the effectiveness weight since it has met all the three Point carronade. EDUCASE Quarterly, 28, 54-57.
effectiveness criteria which listed in Table III, whereas the [16] Kumaraguru, P., et al., Teaching Johnny not to fall for phish. ACM
Keywords feature has gained only 2⁄3 of the scale since it has Transactions on Internet Technology (TOIT), 2010. 10(2): p. 7.
met only two of these effectiveness criteria. Based on these
results, it can be concluded that the URLs feature is more
reliable in detecting phishing emails than the Keywords
feature if applied in a proper phishing detection technique.
The method proposed in this study can open a way for
further research in the field of evaluating the efficiency of
phishing email detection features. Although this study was
limited to evaluate only the Keywords and the URLs features,
the herein proposed method however can be applied in future
experiments to evaluate other types and categories of phishing
email detection features. Another limitation of this work is
that the results were based on evaluation of only two features.
However, a future experiment can be carried out to evaluate
more than two features, either individually, or in groups based
on their categories.

REFERENCES
[1] http://docs.apwg.org/reports/apwg_trends_report_q2_2012.pdf
[2] Toolan, F. & Carthy, J. Feature selection for spam and phishing
detection. In: eCrime Researchers Summit (eCrime), 2010, 2010.
IEEE, 1-12.
[3] Hamid, I. R. A. & Abawajy, J. 2011. Hybrid feature selection for
phishing email detection. Algorithms and architectures for parallel
processing. Springer.

You might also like