You are on page 1of 24

A NEW ENSEMBLE

MODEL FOR PHISHING


DETECTION BASED ON
HYBRID CUMULATIVE
FEATURE SELECTION
Project analysis slide Md.
2 Sirajum Munir Prince
sirajummunirprince@gmail.com

Asib Hasan
asibhasan.cse@gmail.com

Faisal Muhammad Shah


faisal505@yahoo.com

Ahsanullah University of Science and Technology


Dhaka, Bangladesh

A New Ensemble Model for Phishing Detection Based on


4/3/2021 2
Cumulative Feature Selection
ABSTRACT
In this research a majority vote based ensemble type phishing detection system is proposed using
five machine learning classifiers with a hybrid feature selection method as pre-processing using five
search techniques with the following procedures :

1. Feature selection by wrapper approach using three search techniques:


 Principal Component Analysis (PCA)
 Pearson correlation coefficient (PCC)
 Chi Square
 Gain Ratio
 Information Gain
2. Building up majority vote based ensemble type classification models, using five base classifiers:
 Naive Bayes (NB)
 Support Vector Machine (SVM)
 Decision Tree (C4.5,Jrip, PART)
 K-Nearest Neighbor (k-NN)
 Random Forrest (RF)
3. Classification of test instances using majority vote with it’s base classifiers.

4/3/2021 A New Ensemble Model for Phishing Detection Based on Cumulative Feature Selection 3
OBJECTIVE

 Evolving knowledge on machine learning approaches applied in phishing detection systems.

 Making a hybrid type of feature selection methodology.

 Making an ensemble type of classification framework.

4/3/2021 A New Ensemble Model for Phishing Detection Based on Cumulative Feature Selection 4
WORK FLOW DIAGRAM

Dataset Feature Selection

Output Classification

4/3/2021 A New Ensemble Model for Phishing Detection Based on Cumulative Feature Selection 5
METHODOLOGY

4/3/2021 A New Ensemble Model for Phishing Detection Based on Cumulative Feature Selection 6
A New Ensemble Model for Phishing Detection Based on
4/3/2021 Cumulative Feature Selection 7
PHASE - 01
Dataset

Filter Model 1 Filter Model 2 Filter Model 3 ͙͙ ͘ ͘ ͘ Filter Model N

FS1,1 FS2,1 FS3,1 FSn,1

FS1,2 FS2,2 FS3,2 FSn,2


͙͙ ͘ ͘ ͘

͙͙ ͘ ͘ ͘ ͙͙ ͘ ͘ ͘ ͙͙ ͘ ͘ ͘ ͙͙ ͘ ͘ ͘

FS1,n FS2,n FS3,n FSn,n

Phase - 02

4/3/2021 A New Ensemble Model for Phishing Detection Based on Cumulative Feature Selection 8
PHASE - 02
Phase - 01

Classifier 1 Classifier 2 Classifier 3 ͙͙ ͘ ͘ ͘ Classifier N

Majority Voting
Majority Voting
on Reduced
on Full Feature
Feature Set
Set Classifier
Classifier

Result

4/3/2021 A New Ensemble Model for Phishing Detection Based on Cumulative Feature Selection 9
MAJORITY VOTING

 Simplest and most instinctive ensemble compound techniques.

 Ensemble method could not make a stable prediction.

 A series of votes.

 To achieve the final result.

A New Ensemble Model for Phishing Detection Based on


4/3/2021 10
Cumulative Feature Selection
DATASET

A New Ensemble Model for Phishing Detection Based on


4/3/2021 11
Cumulative Feature Selection
Phishing Dataset for Machine Learning
Number of Instances: 10000
Number of Attributes: 48
Data Set Characteristics: Multivariate
Attribute Characteristics: Integer
Associated Tasks: Classification
Missing attribute: None
Phishing webpage sources: PhishTank, OpenPhish
Legitimate webpage sources: Alexa, Common Crawl

A New Ensemble Model for Phishing Detection Based on


4/3/2021 12
Cumulative Feature Selection
SIMULATION RESULT ANALYSIS

A New Ensemble Model for Phishing Detection Based on


4/3/2021 13
Cumulative Feature Selection
Assessment of Top Feature Subsets

Chi Square Information Gain

A New Ensemble Model for Phishing Detection Based on


4/3/2021 14
Cumulative Feature Selection
Assessment of Top Feature Subsets

Gain Ratio PCA

A New Ensemble Model for Phishing Detection Based on


4/3/2021 15
Cumulative Feature Selection
Assessment of Top Feature Subsets

PCC Random Forest

A New Ensemble Model for Phishing Detection Based on


4/3/2021 16
Cumulative Feature Selection
ROC Curve

 Receiver Operating Characteristic

 False Positive Rate to the x-axis, True Positive


Rate to the y-axis

 Closer to 1 tends to better

ROC Curve
A New Ensemble Model for Phishing Detection Based on
4/3/2021 17
Cumulative Feature Selection
Performance Comparison Between Top-n Feature Subset and Full Feature Subset
Model Name Number of Features Accuracy(%)
Random Forest 32 98.36
Random Forest 48 98.27
Support Vector Machine 44 94.01
Support Vector Machine 48 93.97
Naïve Bayes 41 85.78
Naïve Bayes 48 85.26
C4.5 42 97.53
C4.5 48 91.11
JRip 48 97.35
PART 41 97.59
PART 48 97.48
KNN 31 96.42
KNN 35 95.23
KNN 48 95.26
PDCFS 48 98.24
4/3/2021 A New Ensemble Model for Phishing Detection Based on 18
Cumulative Feature Selection
Average Runtime for Classification Per Sample

Model Name Runtime(sec)


Chi-Square 2.24
Gain Ratio 1.85
Information Gain 2.29
Pearson Correlation Coefficient 2.32
Principal Components Analysis 2.5
PDCFS 20.21

A New Ensemble Model for Phishing Detection Based on


4/3/2021 19
Cumulative Feature Selection
Performance Benchmarking

Model Name Number of Features Accuracy(%)


HEFS 48 96.17
HEFS 10 94.60
FACA 30 92.40
Random Forrest 32 98.36
PDCFS 48 98.24

A New Ensemble Model for Phishing Detection Based on


4/3/2021 20
Cumulative Feature Selection
CONCLUSION

 Obtain best accuracy using least feature number

 Reduce the run-time

 Work on real-time data

A New Ensemble Model for Phishing Detection Based on


4/3/2021 21
Cumulative Feature Selection
KEY REFERENCES

1. APWG. Phishing activity trends reports. Accessed on: September 8, 2020. [Online]. Available:
https://apwg.org
2. W. Hadi, F. Aburub, and S. Alhawari, “A new fast associative classification algorithm for detecting phishing
websites,” Applied Soft Computing, vol. 48, pp. 729 – 734, 2016. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1568494616303970
3. H. Y. C. I. Benesty J., Chen J., “Pearson correlation coefficient. in: Noise reduction in speech processing,”
Springer Topics in Signal Processing, vol. 2, 2009.
4. K. L. Chiew, C. L. Tan, K. Wong, K. S. Yong, and W. K. Tiong, “A new hybrid ensemble feature selection
framework for machine learning-based phishing detection system,” Information Sciences, vol. 484, pp. 153
– 166, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0020025519300763.
5. S. A. Manaf, N. Mustapha, N. Sulaiman, N. A. Husin, M. N. S. Zainudin, and H. Z. M. Shafri, “Majority
voting of ensemble classifiers to improve shoreline extraction of medium resolution satellite images,” 2017.
6. M. X. Rodriguez-Alvarez and V. Inacio, “Rocnreg: An r package for receiver operating characteristic curve
inference with and without covariate information,” 2020.
7. A. J. O. Kelly H. Zou and L. Mauri, “Receiver-operating characteristic analysis for evaluating diagnostic
tests and predictive models,” vol. 115, no. 5, p. 654–657, 2007.

A New Ensemble Model for Phishing Detection Based on


4/3/2021 22
Cumulative Feature Selection
Thank You

You might also like