Professional Documents
Culture Documents
Bagui, S., Nandi, D., Bagui, S., & White, R. J. (2021). Machine learning and deep learning for phishing email
classification using one-hot encoding. Journal of Computer Science, 17, 610–623.
https://doi.org/10.3844/jcssp.2021.610.623
Document Version: Published (Version of record)
Repository homepage:
https://uwf-flvc-researchmanagement.esploro.exlibrisgroup.com/esploro/?institution=01FALSC_UWF
CC BY V4.0
© 2021 Sikha Bagui, Debarghya Nandi, Subhash Bagui and Robert Jamie White. This open access article is
distributed under a Creative Commons Attribution (CC-BY) 4.0 license.
Downloaded On 2024/03/24 12:43:35 -0500
Journal of Computer Science
© 2021 Sikha Bagui, Debarghya Nandi, Subhash Bagui and Robert Jamie White. This open access article is distributed
under a Creative Commons Attribution (CC-BY) 4.0 license.
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
in nature. Static features are not robust enough to handle model of phishing email detection using a concept of
new and emerging phishing patterns. Fraudsters are not phishing term weighting, which evaluates the weights
static in their activities and are constantly changing the of phishing emails in each email. Text stemming and
mode of operation to stay undetected. This motivates WordNet ontology was also used. They achieve an
researchers into seeking effective techniques that can accuracy of 99.1% using Random Forest and 98.4%
handle both known and emerging fraudulent patterns, using J48. Rastenis et al. (2021) presents a solution
hence the focus on machine learning algorithms based on the email message body using text
(Akinyelu and Adewumi, 2014). classification, classifying emails into spam and
The novelty of this study is in applying deep semantic phishing emails. Fang et al. (2019) presented a
analysis with Deep Learning (DL) and Machine Learning recurrent CNN model with multilevel vectors and
(ML) to capture inherent characteristics of email text to proposed a new phishing email detection model,
classify emails as phishing or non-phishing. Until very Themis. Themis was used to model emails as email
recently, little work has been devoted to semantic analysis header, email body, character level and word level
in phishing detection (Zhang et al., 2017), let alone simultaneously. They reached a very high classification
phishing email detection. Representation of text is a accuracy of 99.84%. Verma et al. (2020) looked at the
significant task in Natural Language Processing (NLP) classification of phishing emails using NLP.
and in recent years DL has been widely used in various The rest of this section focuses on works related to
NLP tasks like topic classification, sentiment analysis and other techniques used in detecting phishing emails. These
language translation (Zhang et al., 2017; LeCun et al., works can be divided into four broad categories: (i) Works
2015). Word representations in NLP can be categorized that used black-list phishing detection methods; (ii) works
into four classes: One-hot representations, distributional that used heuristic phishing detection methods; (iii) works
representations, clustering based word representations and that used visual phishing detection methods; and (iv)
distributed representations (Zhang et al., 2017; Lai et al., works that used machine learning approaches.
2016). This study focuses on one-hot representations.
Hence in this study, semantic analysis using one-hot Works that used Blacklist-based Phishing Detection
encoding was used to preprocess the data, before using Methods
DL and ML techniques to classify emails. A comparison
of the parameters and hyperparameters was performed for The blacklist approach depends on building and
DL, Convolutional Neural Networks (CNN) and Long maintaining a list of phishing websites. Though blacklist-
Short Term Memory (LSTM). The results of various ML based methods are easy to implement, they have the
algorithms, Naïve Bayes, SVM, Decision Tree and DL obvious deficiency of not being able to detect zero-hour
algorithms, CNN and LSTM, were presented in terms of phishing attacks, that is, attacks that have never happened
accuracy and computation time. A comparison was also before (Zhang et al., 2017).
performed using CNN with Word Embedding,
demonstrating the usefulness of semantic analysis using
Works that used Heuristic Phishing Detection
Word Embedding versus one-hot encoding. Methods
The rest of this study is organized as follows. Section two Heuristic phishing detection identifies unknown web
presents the related works. Section 3 presents an overview of pages using features extracted from phishing attacks, but
the methodology used. Sections four and five present the DL
heuristic phishing rules are difficult to update because the
and ML algorithms respectively and sections six and seven
present the results and conclusions respectively. rules are derived from statistical features or manual
summaries (Zhang et al., 2017).
Yu et al. (2009) and Zhang et al. (2007) developed
Related Works
heuristic-based phishing detection systems achieving
Since phishing emails are becoming a significant relatively low False Positive (FP) and False Negative
threat, there are several recent works on phishing email (FN) rates. Prakash et al. (2010) used a combination of
detection. Yang et al. (2019) looked at phishing email blacklist and heuristic methods to achieve low levels of
detection based on 18 hybrid features including email-
FP and FN rates.
header structure, email-URL information and email-script
function. SVM was used for classification and a Works that used Visual Similarity Methods
classification accuracy of 95% was reached. Rawal et al.
(2017) compared various classifiers, SVM, Random Cao et al. (2008) and Bergholz et al. (2010) made use
Forest, Logistic Regression, Naïve Bayes and Voted of features involving processing of images (for example,
Perceptron, to classify emails as phishing or ham and a company logo, copyright and other visual elements) to
received a maximum accuracy of 99.87%. Yasin and determine visual similarity, but this leads to higher run
Abuhasan (2016) proposed an intelligent classification time and space needs.
611
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
Works that used Machine Learning Based Phishing Determining if one-hot encoding is a useful pre-
Detection Methods processing measure for phishing email classification
Determining if Word Embedding is useful in
The key point of ML based phishing detection phishing email classification (though this was only
methods is to capture inherent patterns of phishing sites, done with CNN)
hence the use of features. Basnet et al. (2008) used
Using one-hot encoding, determining if ML or DL
structural features in phishing emails, like the HTML
models are better in phishing-email classification
format, IP-based address, age of domain name, number of
Determining which parameters and hyperparameters
domains, number of sub-domains, presence of Java Script,
would give better results in DL
presence of form tags, number of links, URL based image
source, matching domains and keywords, to classify Determining which ML or DL models perform the best
emails using ML methods like SVM, neural networks, in terms of computation time using one hot-encoding
Self Organizing Maps (SOMs), K-means and ROC
curves. Using k-means clustering, they achieved an Methodology
accuracy of around 90%.
Chandrasekaran et al. (2006) extracted a total of 25 Data
structural features for the classification of emails using For this study, an email dataset, collected from App
SVM. They achieved 95% accuracy. Sarju and Thomas River, a company headquartered in Pensacola, Florida, USA,
(2014) also used structural properties to detect spam
that offers secure cloud-based cybersecurity solutions, was
emails, using Naïve Bayes, Adaboost and Random Forest
used. This dataset contained 18,366 labelled emails, of which
for classification and comparison. Using a majority voting
3,416 were phishing emails and 14,950 were regular emails.
algorithm (using J48, SVM and IB1), they achieved a high
accuracy of 99.8%. Bergholz et al. (2008) determined a The emails were collected across US industries - insurance,
feature set and used Random Forest and SVM to measure law, medical, hotel, school, banking and real estate
accuracy of detection. Ma et al. (2009b) extracted 43 marketing. Each of these emails contained a subject as well
features and used six different classifiers and found as body text and the number of users that this email was
Random Forest outperforming the rest. Akinyelu and sent to. This study focuses on analyzing the text content
Adewumi (2014) also used a group of 15 features to of the emails and classifying them as to whether they
classify emails using Random Forest. Fette et al. (2007) were phishing emails or not. 70% of the data was used
made use of ML techniques to receive a very low False for training and the rest for testing.
Positive and False Negative rate.
Data Preprocessing: One-hot Encoding
Blum et al. (2010; Ma et al., 2009a; 2009b; Feroz and
Mengel, 2014; 2015) proposed URL-based phishing ML algorithms cannot operate directly on textual data.
detection methods, claiming accuracy of over 90%, but Data has to be numeric. Hence, in this study, for data
such measures turn out to be unstable since URLs can be preprocessing, the email text was encoded as one-hot
manipulated easily. Xiang et al. (2011) proposed an vectors. One-hot encoding uses a sparse vector in which
improved URL-based method, but this method was overly one element is set to 1 and all other elements are set to 0
dependent on third-party services. and is a commonly used technique to represent strings that
Zhang et al. (2017) applied semantic analysis in have a finite set of values. Using one-hot encoding, high
phishing detection. This study focused on phishing cardinality will lead to high dimensional feature vectors.
websites rather than phishing emails and extracted various But since one-hot encoding is simple, it is a widely-used
semantic features through word2vec to better describe the
encoding method (Davis, 2010; Cerda et al., 2018). One-
features of phishing sites. The authors achieved very low
hot encoding is effective for sentences or tweets that do
error rates and though semantic features analyzed with
not contain many repeated elements and is usually applied
ML algorithms do a good job of analyzing phishing
websites, little attention had been paid to semantic to models that have good smoothing properties. One-hot
analysis before this study. encoding is commonly used in neural networks, whose
Eckhardt and Bagui (2021) used deep neural networks, activation functions require input to be in the discrete
Long Short Term Memory (LSTM) and Convolutional range of [0,1] or [-1,1] (O’REILLY, 2021).
Neural Networks (CNN) for phishing email classification. A one-hot vector is a 1 N matrix (vector) consisting
Though this study performed comparably to our method, of 0s in all cells of the vector with the exception of a single
this study did not take into account the processing time. 1 in a cell used to uniquely identify a word. One hot
Though many of the above works have achieved high encoding allows for more expressive representation of
accuracy, none of them have analyzed the run time categorical data. A sample word range of [good, good,
component, which we also do in this study. So, the main bad] would be represented in 3 such encodings [0, 0, 1],
contributions of this study are: as shown in Table 1.
612
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
613
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
614
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
615
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
CNN with Word Embedding Architecture of our CNN with Word Embedding
Model
CNN was also applied with Word Embedding. Word
Embedding is a technique where individual words are The CNN architecture with Word Embedding used in
represented as sparse vectors based on the context this study is represented in Fig. 4. This consists of an input
window and then mapped to a low dimensional vector layer, an embedding layer with hidden nodes and finally
space. Due to its inability to represent idiomatic phrases, a dense layer with a single hidden node. A sigmoid
vector representations of individual words are used to activation function is used in the final layer.
make the model more expressive. Continuous Bag of
Words (CBOW) is a popular form of Word Embedding. Machine Learning Models
Using Word Embedding, words with similar meaning will The machine learning algorithms, Naïve Bayes, SVM
have similar representation. To understand the semantic and Decision Tree, that have been previously used in text
representation of words in the email text, that is, the classification, were used for comparison with the DL
representation of the context of the word, CBOW, was methods. These algorithms were implemented using
used. A context may be a single word or a group of words (https://www.cs.waikato.ac.nz/ml/weka/).
based on the context window. Each word, represented in
the form of one-hot encoding, can have multiple data Naive Bayes
points as context, as shown in Table 2. In Table 2, The Naïve Bayes classifier assumes class
“Sample” in the text “This is a sample CBOW example” independence. That is, an attribute value on a given
would have two contexts with a 1-word context window. class is independent of the values of the other attributes
Another example demonstrating semantic representation (Han et al., 2011). In a sense, this simplifies the
would be, in the sentence “Taj Mahal in India is one of the computations involved and being relatively robust,
seven wonders of the world,” “Taj Mahal” has a high easy to implement, fast and accurate, naïve Bayes
probability of having a semantic similarity with ‘India’ rather classifiers have been used in many different fields
including text classification (Zhang and Gao, 2011;
than any other country, because they usually appear in the Gupta and Lehal, 2009) and sentiment analysis
same context. This allows it to distinguish words with (Vadivukarassi et al., 2017). Vadivukarassi et al.
different meanings, for example ‘crane’ (a bird; an (2017) used an auxiliary feature to reclassify the text
instrument used to lift heavy objects; to crane your neck etc.) space for classification.
616
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
n max 0,1 yi xi b ,
r r r 2
convolution layers. Accuracy gradually improved with
i 1 the increase in the number of feature maps.
Many works have used SVM for text categorization For CNN with Word Embedding
(Joachims, 1998; Shin and Paek, 2018; Pilászy, 2005;
Fatima and Srinivasu, 2017; Tong and Koller, 2001). SVM The number of embedding nodes played a primary role
was used in this project because of the following SVM in this accuracy.
features (Joachims, 1998), which are applicable to our work: All DL models were compiled using the Adam
Optimizer and a loss function of Binary Cross Entropy.
(i) SVM can handle high dimensional input space Though Relu has been the most effective activation
(ii) SVM does not need to remove irrelevant features function because it does not suffer from the diminishing
through feature selection - it can use all the features, gradient problem, sigmoid was used for the end-Dense
since in this case most of the features will contain layer. It performed well for the binary data.
some information
(iii) SVM works well with sparse vectors Long Short Term Memory Results
(iv) Text categorization is linearly separable To determine the performance of the LSTM model, the
number of hidden nodes were varied and accuracy was
Decision Tree
determined with and without dropouts. Dropout refers to
Decision Tree is a popular ML tool that uses a tree-like removing a random selection of units (hidden as well as
structure to represent all the chance events and their visible) in a network layer for a single gradient step. The
possible outcomes in different conditions. The advantages more units dropped out, the stronger the regularization.
of decision tree are fast speed, relatively high accuracy Dropouts prevent overfitting. In the simplest case, each
and easily understood classification model. But, decision
unit is retained with a fixed probability p independent of
trees have a problem with high dimensional data, leading
to low algorithm efficiency. Decision tree calculations are other units, where p can be chosen using a validation set
based on Information Gain and the attribute with the or can simply be set at 0.5, which seems to be close to
highest information gain is at the root of the tree, the optimal for a wide range of networks and tasks. For input
attribute with the next highest information gain is at the units, however, the optimal probability of retention is usually
next level of the tree and so forth. The leaves of the tree closer to 1 rather than to 0.5 (Srivastava et al., 2014). The
are the final decisions of the classifier (Han et al., 2011). LSTM experimental results are presented in Table 3.
617
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
The highest accuracy was achieved without dropout at accuracy, both without and with dropouts, at 94.98% and
95.7% and with dropout at 95.91%, at 128 hidden nodes. 95.97% respectively, was at feature map size of 64 with
The lowest computational time, however, was at 16 17,729 parameters. Using CNN with one-hot encoding,
hidden nodes, at 193.7 ms. LSTM generally performed the accuracy did not generally improve with dropouts.
better with dropouts. Convolutional Neural Networks with Word
Convolutional Neural Network Results Embedding
Proper estimation of hyperparameters is the key to Table 9 presents the accuracy of varying the
optimizing a CNN model. Selection of hyperparameters embedding nodes, with and without dropouts when CNN
like context window, filter size, number of feature maps and was performed with Word Embedding.
pooling strategies are dependent on the nature of the dataset. The lowest computation time of 45.7 ms was achieved
Tables 4-7 present experimentation results with various with 4 Embedding Nodes and 721 parameters. The highest
context window sizes, filter sizes, embedding window sizes accuracy without dropout was at 96.1% with 32 embedding
and pooling sizes, using the baseline model. nodes and 5,761 parameters and at 96.34% with 64 nodes
and 11,521 parameters. For CNN with Word Embedding, the
Baseline Model used accuracy generally improved slightly with dropouts.
One-hot-size = 100 Comparison of the Machine Learners as well as
Filter Size = 5
Deep Learners
Pool Size = 2
Activation Function = Relu Figure 5 graphically presents a comparison of the
Feature Maps = 32 accuracy of the various models used for classification,
Naïve Bayes, SVM, Decision Tree, LSTM, CNN and
Base Model Structure used CNN with Word Embedding. One-hot encoding was
Embedding Layer -- Conv1D -- MaxPool1D -- used with the Naïve Bayes, SVM, Decision Tree,
Conv1D -- MaxPool1D -- Flatten -- Dense (1) LSTM and CNN models. CNN with one-hot encoding
A stride of one was used. Stride is how much the filter achieved the highest classification accuracy of all the
is shifted at each step. If the stride is 1, the filters overlap. models using one-hot encoding, with an accuracy of
A larger stride leads to fewer applications of the file and 95.97%. However, CNN with Word Embedding
a smaller input size. From Tables 4-7, the following can performed better than CNN with one-hot encoding,
be observed: with the highest overall accuracy of 96.34%.
Figure 6 presents the computation time taken for the
The filter size of 7 produces the best accuracy various models. One-hot encoding was used to run Naïve
Increasing the size of context window improves Bayes, SVM, Decision Tree, LSTM and CNN. Naïve
accuracy, but increasing it beyond a certain limit Bayes had the lowest computation time at 17.7 ms and
tends to increase computation time LSTM had the highest computation time at 696.1 ms and,
Increasing the pool size up to a size of 4 increases the in this case CNN with Word Embedding did not perform
accuracy, after which the accuracy drops. particularly better.
Hence, for the final runs, varying the feature map sizes, a Table 4: Accuracy by context window
filter size of 7, context window of 100, embedding window Context window Accuracy (%)
of 80 and pooling size of 4 was used. The computation time 50 95.739
and accuracy (with and without dropouts) was determined 70 96.27
for various feature map sizes, as presented in Table 8. 90 96.43
The lowest computation time of 46.9 ms was with 110 96.048
feature map size of 4 with 149 parameters. The highest 150 96.44
618
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
Table 8: Computation time and accuracy by feature map size with and without dropouts
Feature maps Parameters Computation_time Accuracy (w/o dropout) Accuracy (with dropout)
4 149 46.9 85.25 81.6
8 425 52.1 88.65 86.74
16 1361 52.8 92.02 91.27
32 4769 59.4 92.53 93.718
64 17729 75.4 94.98 95.97
Table 9: Computation time and accuracy by embedding nodes with and without dropouts
Embedding nodes Parameters Computation_time Accuracy (w/o dropout) Accuracy (with dropout)
4 721 45.7 94.75 95.41
8 1441 49.8 95.5 95.68
16 2881 53.5 95.99 96.048
32 5761 73.7 96.1 96.23
64 11521 102.9 96.01 96.34
75.72
80
60
40
20
0
Naïve bayes SVM Decision tree LSTM CNN Work embedding
619
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
700 696.1
600
500
400
300
200
102.9
100 73.7 75.4
17.7 18.2
0
Naïve bayes SVM Decision tree LSTM CNN Work embedding
620
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, Davis, M. J. (2010). Contrast coding in multiple
E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief regression analysis: Strengths, weaknesses and utility
survey of text mining: Classification, clustering and of popular coding structures. Journal of Data Science,
extraction techniques. arXiv preprint 8(1), 61-73.
arXiv:1707.02919. https://arxiv.org/abs/1707.02919 https://citeseerx.ist.psu.edu/viewdoc/download?doi=
APWG. (2021). Phishing activity trends reports 10.1.1.464.4179&rep=rep1&type=pdf
https://apwg.org/ Eckhardt, R. & Bagui, S. (2021). Convolutional Neural
Basnet, R., Mukkamala, S., & Sung, A. H. (2008). Networks and Long Short Term Memory for Phishing
Detection of phishing attacks: A machine learning Email Classification. International Journal of Computer
approach. In Soft computing applications in industry Science and Information Security, 19(5), 27-35.
(pp. 373-383). Springer, Berlin, Heidelberg. https://doi.org/10.5281/zenodo.4898110
https://doi.org/10.1007/978-3-540-77465-5_19 Fang, Y., Zhang, C., Huang, C., Liu, L., & Yang, Y.
Behdad, M., Barone, L., Bennamoun, M., & French, T. (2019). Phishing email detection using improved
(2012). Nature-inspired techniques in the context RCNN model with multilevel v
of fraud detection. IEEE Transactions on Systems, https://doi.org/10.1109/ACCESS.2019.2913705
Man and Cybernetics, Part C (Applications and Fatima, S., & Srinivasu, B. (2017). Text Document
Reviews), 42(6), 1273-1290. categorization using support vector machine.
Bergholz, A., Chang, J. H., Paass, G., Reichartz, F., & International Research Journal of Engineering and
Strobel, S. (2008, August). Improved Phishing Technology (IRJET), 4(2), 141-147.
Detection using Model-Based Features. In CEAS. Feroz, M. N., & Mengel, S. (2014, October).
Bergholz, A., De Beer, J., Glahn, S., Moens, M. F., Paaß, Examination of data, rule generation and detection
G., & Strobel, S. (2010). New filtering approaches for of phishing URLs using online logistic regression.
phishing email. Journal of computer security, 18(1), In 2014 IEEE International Conference on Big
7-35. https://doi.org/10.3233/JCS-2010-0371 Data (Big Data) (pp. 241-250). IEEE.
Blum, A., Wardman, B., Solorio, T., & Warner, G. (2010, https://doi.org/10.1109/BigData.2014.7004239
October). Lexical feature based phishing URL detection
Feroz, M. N., & Mengel, S. (2015, June). Phishing URL
using online learning. In Proceedings of the 3rd ACM
detection using URL ranking. In 2015 ieee international
Workshop on Artificial Intelligence and Security (pp.
54-60). https://doi.org/10.1145/1866423.1866434 congress on big data (pp. 635-638). IEEE.
Cao, Y., Han, W., & Le, Y. (2008, October). Anti- https://doi.org/10.1109/BigDataCongress.2015.97
phishing based on automated individual white-list. In Fette, I., Sadeh, N., & Tomasic, A. (2007, May). Learning
Proceedings of the 4th ACM workshop on Digital to detect phishing emails. In Proceedings of the 16th
identity management (pp. 51-60). international conference on World Wide Web (pp.
https://doi.org/10.1145/1456424.1456434 649-656). https://doi.org/10.1145/1242572.1242660
Cerda, P., Varoquaux, G., & Kégl, B. (2018). Similarity Gao, J., Pantel, P., Gamon, M., He, X., & Deng, L. (2019).
encoding for learning with dirty categorical variables. Modeling interestingness with deep neural networks.
Machine Learning, 107(8), 1477-1494. https://doi.org/10.3115/v1/D14-1002
https://doi.org/10.1007/s10994-018-5724-2 Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B.
Chandrasekaran, M., Narayanan, K., & Upadhyaya, S. R., & Schmid Huber, J. (2017). LSTM: A Search
(2006, June). Phishing email detection based on Space Oddyssey. IEEE Transactions on Neural
structural properties. In NYS cyber security Networks and Learning Systems, 28, pp. 222-2232.
conference (Vol. 3). https://doi.org/10.1109/TNNLS.2016.2582924
https://www.albany.edu/wwwres/conf/iasymposium/ Gupta, V., & Lehal, G. S. (2009). A survey of text mining
proceedings/2006/chandrasekaran.pdf techniques and applications. Journal of emerging
Choong, A. C. H., & Lee, N. K. (2017, November). technologies in web intelligence, 1(1), 60-76.
Evaluation of convolutionary neural networks https://doi.org/10.4304/jetwi.1.1.60-76
modeling of DNA sequences using ordinal versus Han, J., Pei, J., & Kamber, M. (2011). Data mining:
one-hot encoding method. In 2017 International
concepts and techniques. Elsevier. ISBN-10:
Conference on Computer and Drone Applications
(IConDA) (pp. 60-65). IEEE. 0123814804.
https://doi.org/10.1109/ICONDA.2017.8270400 He, L., Lee, K., Lewis, M., & Zettlemoyer, L. (2017,
Danny, P. (2020). What is Phishing? Everything you need July). Deep semantic role labeling: What works and
to know to protect yourself from scam emails and what’s next. In Proceedings of the 55th Annual
more. https://www.zdnet.com/article/what-is- Meeting of the Association for Computational
phishing-how-to-protect-yourself-from-scam- Linguistics (Volume 1: Long Papers) (pp. 473-483).
emails-and-more https://doi.org/10.18653/v1/P17-1044
621
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
Joachims, T. (1998, April). Text categorization with Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J.,
support vector machines: Learning with many ... & Ward, R. (2016). Deep sentence embedding
relevant features. In European conference on using long short-term memory networks: Analysis
machine learning (pp. 137-142). Springer, Berlin, and application to information retrieval. IEEE/ACM
Heidelberg. https://doi.org/10.1007/BFb0026683 Transactions on Audio, Speech and Language
Johnson, R., & Zhang, T. (2014). Effective use of word Processing, 24(4), 694-707.
order for text categorization with convolutional https://doi.org/10.1109/TASLP.2016.2520371
neural networks. arXiv preprint arXiv:1412.1058. Pilászy, I. (2005, November). Text categorization and
https://doi.org/10.3115/v1/N15-1011 support vector machines. In Proceedings of the 6th
Johnson, R., & Zhang, T. (2015). Semi-supervised international symposium of Hungarian researchers on
convolutional neural networks for text computational intelligence.
categorization via region embedding. Advances in http://citeseerx.ist.psu.edu/viewdoc/download?doi=1
neural information processing systems, 28, 919. 0.1.1.106.8812&rep=rep1&type=pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4 Prakash, P., Kumar, M., Kompella, R. R., & Gupta, M.
831869/ (2010, March). Phishnet: predictive blacklisting to
Kim, Y. (2014). Convolutional Neural Networks for detect phishing attacks. In 2010 Proceedings IEEE
Sentence Classification. Proceedings of the 2014 INFOCOM (pp. 1-5). IEEE.
Conference on Empirical Methods in Natural https://doi.org/10.1109/INFCOM.2010.5462216
Language Process (EMNLP 2014), pp. 1746-1751. Rastenis, J., Ramanauskaitė, S., Suzdalev, I., Tunaitytė,
https://doi.org/10.3115/v1/D14-1181 K., Janulevičius, J., & Čenys, A. (2021). Multi-
Lai, S., Liu, K., He, S., & Zhao, J. (2016). How to Language Spam/Phishing Classification by Email
generate a good word embedding. IEEE Intelligent Body Text: Toward Automated Security Incident
Systems, 31(6), 5-14. Investigation. Electronics, 10(6), 668.
https://doi.org/10.1109/MIS.2016.45 https://doi.org/10.3390/electronics10060668
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Rawal, S., Rawal, B., Shaheen, A., & Malik, S. (2017).
learning. nature, 521(7553), 436-444. Phishing detection in e-mails using machine learning.
https://doi.org/10.1038/nature14539 International Journal of Applied Information Systems,
Linzen, T., Dupoux, E., & Goldberg, Y. (2016). Assessing 12(7), 12-24. https://doi.org/10.5120/ijais2017451713
the ability of LSTMs to learn syntax-sensitive Sak, H., Senior, A. W., & Beaufays, F. (2014). Long
dependencies. Transactions of the Association for short-term memory recurrent neural network
Computational Linguistics, 4, 521-535. architectures for large scale acoustic modeling.
https://doi.org/10.1162/tacl_a_00115 https://storage.googleapis.com/pub-tools-public-
Ma, J., Saul, L. K., Savage, S., & Voelker, G. M. (2009a, publication-data/pdf/43905.pdf
June). Beyond blacklists: learning to detect malicious Sarju, S., & Thomas, R. (2014). Spam Email Detection
web sites from suspicious URLs. In Proceedings of the using Structural Features. International Journal of
15th ACM SIGKDD international conference on Computer Applications, 89(3).
Knowledge discovery and data mining (pp. 1245-1254). https://doi.org/10.5120/15485-4265
https://dl.acm.org/doi/abs/10.1145/1557019.1557153 Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014,
Ma, L., Ofoghi, B., Watters, P., & Brown, S. (2009b, July). November). A latent semantic model with
Detecting phishing emails using hybrid features. In convolutional-pooling structure for information
2009 Symposia and Workshops on Ubiquitous, retrieval. In Proceedings of the 23rd ACM
Autonomic and Trusted Computing (pp. 493-497). international conference on conference on
IEEE. https://doi.org/10.1109/UIC-ATC.2009.103 information and knowledge management (pp. 101-
O’REILLY. (2021). Applied Text Analysis with Python. 110). https://doi.org/10.1145/2661829.2661935
Chapter 4. Text Vectorization and Transformation Shin, H., & Paek, J. (2018). Automatic task classification
Pipelines. One-Hot Encoding. via support vector machine and crowdsourcing.
https://www.oreilly.com/library/view/applied-text- Mobile Information Systems, 2018.
https://doi.org/10.1155/2018/6920679
analysis/9781491963036/ch04.html#atap_ch04_one
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &
_hot_encoding Salakhutdinov, R. (2014). Dropout: a simple way to
Otte, S., Liwicki, M., & Krechel, D. (2014, July). prevent neural networks from overfitting. The journal of
Investigating long short-term memory networks for machine learning research, 15(1), 1929-1958.
various pattern recognition problems. In International https://www.jmlr.org/papers/volume15/srivastava14
Workshop on Machine Learning and Data Mining in a/srivastava14a.pdf?utm_campaign=buffer&utm_co
Pattern Recognition (pp. 484-497). Springer, Cham. ntent=buffer79b43&utm_medium=social&utm_sour
https://doi.org/10.1007/978-3-319-08979-9_37 ce=twitter.com
622
Sikha Bagui et al. / Journal of Computer Science 2021, 17 (7): 610.623
DOI: 10.3844/jcssp.2021.610.623
Srivastava, P. (2017). Essentials of Deep Learning: Yasin, A., & Abuhasan, A. (2016). An intelligent
Introduction to Long Short Term Memory. classification model for phishing email detection.
https://www.analyticsvidhya.com/blog/2017/12/fund arXiv preprint arXiv:1608.02196.
amentals-of-deep-learning-introduction-to-lstm/ https://doi.org/10.5121/ijnsa.2016.8405
Thakur, H., & Kaur, S. (2016). A Survey Paper On
Yu, W. D., Nargundkar, S., & Tiruthani, N. (2009, July).
Phishing Detection. International Journal of
Phishcatch-a phishing detection tool. In Proceedings
Advanced Research in Computer Science, 7(4).
http://www.ijarcs.info/index.php/Ijarcs/article/view/ of the 2009 33rd Annual IEEE International
2706 Computer Software and Applications Conference-
Tong, S., & Koller, D. (2001). Support vector machine active Volume 02 (pp. 451-456).
learning with applications to text classification. Journal https://doi.org/10.1109/COMPSAC.2009.175
of machine learning research, 2(Nov), 45-66. Zhang, W., & Gao, F. (2011). An improvement to naive
https://www.jmlr.org/papers/volume2/tong01a/tong bayes for text classification. Procedia Engineering,
01a.pdf 15, 2160-2164.
Vadivukarassi, M., Puviarasan, N., & Aruna, P. (2017). https://doi.org/10.1016/j.proeng.2011.08.404
Sentimental analysis of tweets using Naive Bayes Zhang, X., Zeng, Y., Jin, X. B., Yan, Z. W., & Geng, G.
algorithm. World Applied Sciences Journal, 35(1), G. (2017, December). Boosting the phishing
54-59.
detection performance by semantic analysis. In 2017
https://www.academia.edu/download/55083588/7.pdf
Verma, P., Goyal, A., & Gigras, Y. (2020). Email ieee international conference on big data (big data)
phishing: Text classification using natural (pp. 1063-1070). IEEE.
language processing. Computer Science and https://doi.org/10.1109/BigData.2017.8258030
Information Technologies, 1(1), 1-12. Zhang, Y., & Wallace, B. (2015). A sensitivity analysis of
https://doi.org/10.11591/csit.v1i1.p1-12 (and practitioners' guide to) convolutional neural
Xiang, G., Hong, J., Rose, C. P., & Cranor, L. (2011). networks for sentence classification. arXiv preprint
Cantina+ a feature-rich machine learning framework for arXiv:1510.03820. https://arxiv.org/abs/1510.03820
detecting phishing web sites. ACM Transactions on Zhang, Y., Hong, J. I., & Cranor, L. F. (2007, May).
Information and System Security (TISSEC), 14(2), Cantina: A content-based approach to detecting
1-28. https://doi.org/10.1145/2019599.2019606 phishing web sites. In Proceedings of the 16th
Yang, Z., Qiao, C., Kan, W., & Qiu, J. (2019, April).
international conference on World Wide Web (pp.
Phishing Email Detection Based on Hybrid Features.
639-648). https://doi.org/10.1145/1242572.1242659
In IOP Conference Series: Earth and Environmental
Science (Vol. 252, No. 4, p. 042051). IOP Publishing.
https://doi.org/10.1088/1755-1315/252/4/042051
623