Professional Documents
Culture Documents
Architecture Document: International School Capstone Project Proposal CMU-CS450 Capstone Project I
Architecture Document: International School Capstone Project Proposal CMU-CS450 Capstone Project I
Architecture Document
Version: 1.1
Date:
Approved by
REVISION HISTORY
2. Project Report:
Project Overview
Project Name: Apply Machine Learning In Detecting Phishing Websites
Team Member:
Member Role
Khai , Tran Dinh Team Leader
Duc, Nguyen Tan Team Member
Thuy, Pham Thanh Team Member
II. DESCRIBE THE SYSTEM:
This figure presents the architecture of the proposed approach, which consists of
the feature extraction phase, machine learning classifier construction, and training, and
phishing prediction phase. The feature extraction phase takes mixed data as input from
where features are extracted from the URL, Web document properties, and web
behaviour attributes.
In the Feature extraction, we use some features to detect phishing websites like
Address Bar based Features, Abnormal Based Features, HTML and JavaScript-based
Features, Domain-based Features. The Random Forest classifier phase used the
extracted features to train the ML algorithms. In the end, the phishing prediction phase
is used to determine the accuracy of the proposed model.
We have been able to update this feature rule by using a method based on
frequency and thus improving upon its accuracy.
1.3. Using URL Shortening Services “TinyURL”
URL shortening is a method on the “World Wide Web” in which a URL may
be made considerably smaller in length and still lead to the required webpage.
This is accomplished by means of an “HTTP Redirect” on a domain name that
is short, which links to the webpage that has a long URL. For example, the URL
“http://portal.hud.ac.uk/” can be shortened to “bit.ly/19DXSk4”.
TinyURL → Phishing
Rule: IF{
Otherwise → Legitimate
1.8. HTTPS (Hyper Text Transfer Protocol with Secure Sockets Layer)
The existence of HTTPS is very important in giving the impression of
website legitimacy, but this is clearly not enough. The authors in (Mohammad,
Thabtah and McCluskey 2012) (Mohammad, Thabtah and McCluskey 2013)
suggest checking the certificate assigned with HTTPS including the extent of the
trust certificate issuer, and the certificate age. Certificate Authorities that are
consistently listed among the top trustworthy names include: “GeoTrust,
GoDaddy, Network Solutions, Thawte, Comodo, Doster and VeriSign”.
Furthermore, by testing out our datasets, we find that the minimum age of a
reputable certificate is two years.
Rule:IF
Use https and Issuer Is Trusted and Age of Certificate ≥ 1 Years → Legitimate
{ Using https and Issuer Is Not Trusted → Suspicious
Otherwise → Phishing
1.9. Domain Registration Length
Based on the fact that a phishing website lives for a short period of time, we
believe that trustworthy domains are regularly paid for several years in advance.
In our dataset, we find that the longest fraudulent domains have been used for
one year only.
Domains Expires on ≤ 1 years → Phishing
Rule: IF{
Otherwise → Legitimate
1.10. Favicon
A favicon is a graphic image (icon) associated with a specific webpage.
Many existing user agents such as graphical browsers and newsreaders show
favicon as a visual reminder of the website identity in the address bar. If the
favicon is loaded from a domain other than that shown in the address bar, then
the webpage is likely to be considered a Phishing attempt.
Favicon Loaded From External Domain → Phishing
Rule: IF{
Otherwise → Legitimate
1.12. The Existence of “HTTPS” Token in the Domain Part of the URL
The phishers may add the “HTTPS” token to the domain part of a URL in
order to trick users.
For example, http://https-www-paypal-it-webapps-mpp-home.soft-hair.com/.
Using HTTP Token in Domain Part of The URL → Phishing
Rule: IF{
Otherwise → Legitimate
In our system, I use the machine learning algorithm random forest application
to release the website that must be fake or not rely on the random algorithm from
which we mount our device. Here I go to the internet of the system to catch the
packet and see if it has to be fake or if it is spoofing our system, I will report to
the firewall and immediately block the sending of the packet, and if It's safe to be
saved in our data and I have been updated on regular basic.
VII. REFERENCES:
[1] DTU Journal of Science & Technology 03(40) (2020) 16-25 - Automating the
update of packet filtering rules for open source firewalls:
https://duytan.edu.vn/upload/file/3.-BAI-3-(Nguyen-Kim-Tuan)-79.pdf
[2] Randomforest: https://prezi.com/905bwnaa7dva/how-random-forest-
algorithm-
works/?utm_campaign=share&utm_medium=copy&fbclid=IwAR2oGFibYSOtx
mJ1OdUGzMJdrWS3aEeOIyWq6PCI8p4hoTCIL2jPp40w-JM
[3] A COMPLETE GUIDE TO THE RANDOM FOREST ALGORITHM:
https://builtin.com/data-science/random-forest-algorithm
[4] HOW THE RANDOM FOREST ALGORITHM WORKS IN MACHINE
LEARNING:
https://dataaspirant.com/random-forest-algorithm-machine-learing/
[5] List of system quality attributes :
https://en.wikipedia.org/wiki/List_of_system_quality_attributes
[6] Software Architecture Quality Attributes:
https://dzone.com/articles/software-architecture-quality-attributes