Architecture Document: International School Capstone Project Proposal CMU-CS450 Capstone Project I

International School
Capstone Project Proposal

CMU-CS450 Capstone Project I
Architecture Document
Version: 1.1
Date:
APPLY MACHINE LEARNING IN DETECTING PHISHING WEBSITES

Submitted by
Khai, Tran Dinh

Duc, Nguyen Tan
Thuy, Pham Thanh
Approved by
MSc. Tuan, Nguyen Kim

PROJECT INFORMATION
Project Acronym AMLIDPW

Project Title Apply Machine Learning In Detecting Phishing Websites
Start Date 12/08/2020 End Date 05/12/2020
Lead Institution International School, Duytan University
Project Mentor MSc. Nguyen Kim Tuan
Scrum Master / Khai, Tran Dinh
Project Leader & Email: khaitran9499@gmail.com
contact details
Tel: 0707375015
Team members Name Email Tel
1 Duc, Nguyen Tan nguyentanduc259@gmail.com 0935684175
2 Thuy, Pham Thanh btpham99@gmail.com 0905562806
REVISION HISTORY
Version Date Comments Author Approval

1.0 16/08/2020 Initial Release C1NE.03
1.1 22/08/2020 Update References C1NE.03
1.2 15/9/2020 Update References C1NE.03
1.3 20/10/2020 Update C1NE.03
Table of Contents
I. INTRODUCTION:...................................................................................................... 5
1. Project Overview: ....................................................................................................... 5
2. Project Report: ............................................................................................................ 5
II. DESCRIBE THE SYSTEM: ..................................................................................... 6
III. FEATURE EXTRACTION DESCRIPTION: ......................................................... 6
1. Address Bar based Features: ...................................................................................... 6
1.1. Using the IP Address: .............................................................................................. 6
1.2. Long URL to Hide the Suspicious Part ................................................................... 7
1.3. Using URL Shortening Services “TinyURL” ......................................................... 7
1.4. URL’s having “@” Symbol ..................................................................................... 7
1.5. Redirecting using “//” .............................................................................................. 7
1.6. Adding Prefix or Suffix Separated by (-) to the Domain ........................................ 8
1.7. Sub Domain and Multi Sub Domains ...................................................................... 8
1.8. HTTPS (Hyper Text Transfer Protocol with Secure Sockets Layer) ...................... 8
1.9. Domain Registration Length ................................................................................... 9
1.10. Favicon .................................................................................................................. 9
1.11. Using Non-Standard Port : .................................................................................... 9
2. Abnormal Based Features: ....................................................................................... 10
2.1. Request URL: ........................................................................................................ 10
2.2. URL of Anchor: ..................................................................................................... 10
2.3. Links in <Meta>, <Script> and <Link> tags:........................................................ 10
2.4. Server Form Handler (SFH): ................................................................................. 11
2.5. Submitting Information to Email........................................................................... 11
2.6. Abnormal URL ...................................................................................................... 11
3. HTML and JavaScript based Features:..................................................................... 11
3.1. Website Forwarding: ............................................................................................. 11
3.3. Disabling Right Click: ........................................................................................... 12
3.4. Using Pop-up Window .......................................................................................... 12
3.5. IFrame Redirection ................................................................................................ 12
4. Domain based Features:............................................................................................ 12
4.3. Website Traffic: ..................................................................................................... 13
4.4. PageRank: .............................................................................................................. 13
4.5. Google Index: ........................................................................................................ 13
4.7. Statistical-Reports Based Feature: ......................................................................... 14
IV. DATASET DESCRIPTION:.................................................................................. 14
V. RANDOM FOREST ALGORITHM....................................................................... 15
VI. QUALITY ATTRIBUTES ..................................................................................... 18
1. Availability: .............................................................................................................. 18
2. Usability : ................................................................................................................. 18
3. Performance : ............................................................................................................ 18
4. Confidentiality: ......................................................................................................... 18
5.Overall Architecture: ................................................................................................. 18
VII. REFERENCES: .................................................................................................... 20
I. INTRODUCTION:
1. Project Overview:
There are a number of users who purchase products online and make payment
through various websites. Many users unwittingly click phishing domains every day
and every hour. The attackers are targeting both the users and the companies. They
are seeking the credentials of consumers in places where consumers may least expect
it. Malicious URL, a.k.a. a malicious website, is a common and serious threat to
cyber security. The project aims to provide a comprehensive survey and a structural
understanding of Malicious URL Detection techniques using machine learning. We
present the formal formulation of Malicious URL Detection as a machine learning
task and categorize and review the contributions of literature studies that addresses
different dimensions of this problem (feature representation, algorithm design, etc.).
Detecting tool phishing is a web-based criminal act. Phishing sites lure sensitive
information from naive online users by camouflaging themselves as trustworthy
entities. Phishing is considered an annoying threat in the field of electronic
commerce. Due to the short lifespan of phishing webpages and the rapid
advancement of phishing techniques, maintaining blacklists, white-lists or
employing solely heuristics-based approaches are not particularly effective. The
impact of phishing can be largely mitigated by adopting a suitable combination of all
these techniques. In this study, the characteristics of legitimate and phishing
webpages were investigated in depth, and based on this analysis, we proposed
heuristics to extract 15 features from such webpages. These heuristic results were fed
as an input to a trained machine learning algorithm to detect phishing sites. Before
applying heuristics to the web pages, we used two preliminary screening modules in
this system.
2. Project Report:
Project Overview
 Project Name: Apply Machine Learning In Detecting Phishing Websites
 Team Member:
Member Role
Khai , Tran Dinh Team Leader
Duc, Nguyen Tan Team Member
Thuy, Pham Thanh Team Member
II. DESCRIBE THE SYSTEM:
Figure 1:Detect Phishing System
This figure presents the architecture of the proposed approach, which consists of
the feature extraction phase, machine learning classifier construction, and training, and
phishing prediction phase. The feature extraction phase takes mixed data as input from
where features are extracted from the URL, Web document properties, and web
behaviour attributes.
In the Feature extraction, we use some features to detect phishing websites like
Address Bar based Features, Abnormal Based Features, HTML and JavaScript-based
Features, Domain-based Features. The Random Forest classifier phase used the
extracted features to train the ML algorithms. In the end, the phishing prediction phase
is used to determine the accuracy of the proposed model.
III. FEATURE EXTRACTION DESCRIPTION:

Webpages possess a wide variety of properties. These properties can be used to
distinguish a legitimate website from the phishing one. To list out the properties of the
website, the rule of the Extraction Framework is used. This algorithm takes a URL as
input and lists out 30 distinct features of a webpage that is used to determine its
authenticity. These results are listed out and then fed to the Classifier Model for further
processing. The 30 distinct features are listed below:
1. Address Bar based Features:

1.1. Using the IP Address:
If an IP address is used as an alternative of the domain name in the URL,
such as “http://125.98.3.123/fake.html”, users can be sure that someone is trying
to steal their personal information. Sometimes, the IP address is even
transformed into hexadecimal code as shown in the following link
“http://0x58.0xCC.0xCA.0x62/2/paypal.ca/index.html”.
If The Domain Part has an IP Address → Phishing

Rule: IF{
Otherwise → Legitimate
1.2. Long URL to Hide the Suspicious Part

Phishers can use long URL to hide the doubtful part in the address bar. For
example:
http://federmacedoadv.com.br/3f/aze/ab51e2e319e51502f416dbe46b773a5e/?c
md=_home&dispatch=11004d58f5b74f8dc1e7c2e8dd4105e811004d58f5
b74f8dc1e7c2e8dd4105e8@phishing.website.html
To ensure accuracy of our study, we calculated the length of URLs in the dataset
and produced an average URL length. The results showed that if the length of
the URL is greater than or equal 54 characters then the URL classified as
phishing. By reviewing our dataset we were able to find 1220 URLs lengths
equals to 54 or more which constitute 48.8% of the total dataset size.
URL length < 54 → feature = Legitimate

Rule: IF{ else if URL length ≥ 54 and ≤ 75 → feature = Suspicious
otherwise → feature = Phishing
We have been able to update this feature rule by using a method based on
frequency and thus improving upon its accuracy.
1.3. Using URL Shortening Services “TinyURL”
URL shortening is a method on the “World Wide Web” in which a URL may
be made considerably smaller in length and still lead to the required webpage.
This is accomplished by means of an “HTTP Redirect” on a domain name that
is short, which links to the webpage that has a long URL. For example, the URL
“http://portal.hud.ac.uk/” can be shortened to “bit.ly/19DXSk4”.
TinyURL → Phishing
Rule: IF{
1.4. URL’s having “@” Symbol

Using “@” symbol in the URL leads the browser to ignore everything
preceding the “@” symbol and the real address often follows the “@” symbol.
Url Having @ Symbol → Phishing
Rule: IF {
1.5. Redirecting using “//”
The existence of “//” within the URL path means that the user will be
redirected to another website. An example of such URL’s is:
“http://www.legitimate.com//http://www.phishing.com”. We examin the
location where the “//” appears. We find that if the URL starts with “HTTP”, that
means the “//” should appear in the sixth position. However, if the URL employs
“HTTPS” then the “//” should appear in seventh position.
Rule:
ThePosition of the Last Occurrence of "//" in the URL > 7 → Phishing
IF {
1.6. Adding Prefix or Suffix Separated by (-) to the Domain

The dash symbol is rarely used in legitimate URLs. Phishers tend to add
prefixes or suffixes separated by (-) to the domain name so that users feel that
they are dealing with a legitimate webpage. For example http://www.Confirme-
paypal.com/.
Domain Name Part Includes (−) Symbol → Phishing
Rule: IF {
1.7. Sub Domain and Multi Sub Domains

Let us assume we have the following link: http://www.hud.ac.uk/students/.
A domain name might include the country-code top-level domains (ccTLD),
which in our example is “uk”. The “ac” part is shorthand for “academic”, the
combined “ac.uk” is called a second-level domain (SLD) and “hud” is the actual
name of the domain. To produce a rule for extracting this feature, we firstly have
to omit the (www.) from the URL which is in fact a sub domain in itself. Then,
we have to remove the (ccTLD) if it exists. Finally, we count the remaining dots.
If the number of dots is greater than one, then the URL is classified as
“Suspicious” since it has one sub domain. However, if the dots are greater than
two, it is classified as “Phishing” since it will have multiple sub domains.
Otherwise, if the URL has no sub domains, we will assign “Legitimate” to the
feature.
Dots In Domain Part = 1 → Legitimate
Rule: IF {Dots In Domain Part = 2 → Suspicious
Otherwise → Phishing
1.8. HTTPS (Hyper Text Transfer Protocol with Secure Sockets Layer)
The existence of HTTPS is very important in giving the impression of
website legitimacy, but this is clearly not enough. The authors in (Mohammad,
Thabtah and McCluskey 2012) (Mohammad, Thabtah and McCluskey 2013)
suggest checking the certificate assigned with HTTPS including the extent of the
trust certificate issuer, and the certificate age. Certificate Authorities that are
consistently listed among the top trustworthy names include: “GeoTrust,
GoDaddy, Network Solutions, Thawte, Comodo, Doster and VeriSign”.
Furthermore, by testing out our datasets, we find that the minimum age of a
reputable certificate is two years.
Rule:IF
Use https and Issuer Is Trusted and Age of Certificate ≥ 1 Years → Legitimate
{ Using https and Issuer Is Not Trusted → Suspicious
1.9. Domain Registration Length
Based on the fact that a phishing website lives for a short period of time, we
believe that trustworthy domains are regularly paid for several years in advance.
In our dataset, we find that the longest fraudulent domains have been used for
one year only.
Domains Expires on ≤ 1 years → Phishing
Rule: IF{
1.10. Favicon
A favicon is a graphic image (icon) associated with a specific webpage.
Many existing user agents such as graphical browsers and newsreaders show
favicon as a visual reminder of the website identity in the address bar. If the
favicon is loaded from a domain other than that shown in the address bar, then
the webpage is likely to be considered a Phishing attempt.
Favicon Loaded From External Domain → Phishing
Rule: IF{
1.11. Using Non-Standard Port :

This feature is useful in validating if a particular service (e.g. HTTP) is up
or down on a specific server. In the aim of controlling intrusions, it is much better
to merely open ports that you need. Several firewalls, Proxy and Network
Address Translation (NAT) servers will, by default, block all or most of the ports
and only open the ones selected. If all ports are open, phishers can run almost
any service they want and as a result, user information is threatened. The most
important ports and their preferred status are shown in Table 1.
Port # is of the Preffered Status → Phishing
Rule: IF{
Table 1 Common ports to be checked
Meaning Preferred
PORT Service
Status
21 FTP Transfer files from one host to another Close
22 SSH Secure File Transfer Protocol Close
provide a bidirectional interactive text-oriented Close
23 Telnet
communication
80 HTTP Hyper test transfer protocol Open
443 HTTPS Hypertext transfer protocol secured Open
445 SMB Providing shared access to files, printers, serial ports Close
Store and retrieve data as requested by other software Close
1433 MSSQL
applications
1521 ORACLE Access oracle database from web. Close
3306 MySQL Access MySQL database from web. Close
Remote allow remote access and remote collaboration Close
3389
Desktop
1.12. The Existence of “HTTPS” Token in the Domain Part of the URL
The phishers may add the “HTTPS” token to the domain part of a URL in
order to trick users.
For example, http://https-www-paypal-it-webapps-mpp-home.soft-hair.com/.
Using HTTP Token in Domain Part of The URL → Phishing
Rule: IF{
2. Abnormal Based Features:

2.1. Request URL:
Request URL examines whether the external objects contained within a
webpage such as images, videos and sounds are loaded from another domain. In
legitimate webpages, the webpage address and most of objects embedded within
the webpage are sharing the same domain.
% of Request URL < 22% → Legitimate
Rule: IF {%of Request URL ≥ 22% and 61% → Suspicious
Otherwise → feature = Phishing
2.2. URL of Anchor:

An anchor is an element defined by the <a> tag. This feature is treated
exactly as “Request URL”. However, for this feature we examine:
1. If the <a> tags and the website have different domain names. This is
similar to request URL feature.
2. If the anchor does not link to any webpage, e.g.:
A. <a href=“#”>
B. <a href=“#content”>
C. <a href=“#skip”>
D. <a href=“JavaScript ::void(0)”>
% of URL Of Anchor < 31% → Legitimate
Rule: IF{% of URL Of Anchor ≥ 31% And ≤ 67% → Suspicious
2.3. Links in <Meta>, <Script> and <Link> tags:

Given that our investigation covers all angles likely to be used in the
webpage source code, we find that it is common for legitimate websites to use
<Meta> tags to offer metadata about the HTML document; <Script> tags to
create a client side script; and <Link> tags to retrieve other web resources. It is
expected that these tags are linked to the same domain of the webpage.
Rule:IF
% of Links in " < Meta > ", " < Script > " and " < Link>" < 17% → Legitimate
{% of Links in < Meta > ", " < Script > " and " < Link>" ≥ 17% And ≤ 81% → Suspicious
2.4. Server Form Handler (SFH):

SFHs that contain an empty string or “about:blank” are considered doubtful
because an action should be taken upon the submitted information. In addition,
if the domain name in SFHs is different from the domain name of the webpage,
this reveals that the webpage is suspicious because the submitted information is
rarely handled by external domains.
SFH is "about: blank" Or Is Empty → Phishing
Rule: IF{ SFH Refers To A Different Domain → Suspicious
2.5. Submitting Information to Email

Web form allows a user to submit his personal information that is directed
to a server for processing. A phisher might redirect the user’s information to his
personal email. To that end, a server-side script language might be used such as
“mail()” function in PHP. One more client-side function that might be used for
this purpose is the “mailto:” function.
Rule:
IF
Using "mail()" or "mailto:" Function to Submit User Information → Phishing
{
2.6. Abnormal URL

This feature can be extracted from WHOIS database. For a legitimate
website, identity is typically part of its URL.
The Host Name Is Not Included In URL → Phishing
Rule: IF {
3. HTML and JavaScript based Features:
3.1. Website Forwarding:
The fine line that distinguishes phishing websites from legitimate ones is
how many times a website has been redirected. In our dataset, we find that
legitimate websites have been redirected one time max. On the other hand,
phishing websites containing this feature have been redirected at least 4 times.
ofRedirect Page ≤ 1 → Legitimate
Rule: IF {of Redirect Page ≥ 2 And < 4 → Suspicious
3.2. Status Bar Customization:
Phishers may use JavaScript to show a fake URL in the status bar to users.
To extract this feature, we must dig-out the webpage source code, particularly
the “onMouseOver” event, and check if it makes any changes on the status bar.
onMouseOver Changes Status Bar → Phishing
Rule: IF{
It Does′t Change Status Bar → Legitimate
3.3. Disabling Right Click:

Phishers use JavaScript to disable the right-click function, so that users
cannot view and save the webpage source code. This feature is treated exactly as
“Using onMouseOver to hide the Link”. Nonetheless, for this feature, we will
search for event “event.button==2” in the webpage source code and check if the
right click is disabled.
Right Click Disabled → Phishing
Rule: IF{
3.4. Using Pop-up Window

It is unusual to find a legitimate website asking users to submit their personal
information through a pop-up window. On the other hand, this feature has been
used in some legitimate websites and its main goal is to warn users about
fraudulent activities or broadcast a welcome announcement, though no personal
information was asked to be filled in through these pop-up windows.
Popoup Window Contains Text Fields → Phishing
Rule: IF {
3.5. IFrame Redirection

IFrame is an HTML tag used to display an additional webpage into one that
is currently shown. Phishers can make use of the “iframe” tag and make it
invisible i.e. without frame borders. In this regard, phishers make use of the
“frameBorder” attribute which causes the browser to render a visual delineation.
Using iframe → Phishing
Rule: IF {
4. Domain based Features:

4.1. Age of Domain
This feature can be extracted from WHOIS database (Whois 2005). Most
phishing websites live for a short period of time. By reviewing our dataset, we
find that the minimum age of the legitimate domain is 6 months.
Age Of Domain ≥ 6 months → Legitimate
Rule: IF {
4.2. DNS Record
For phishing websites, either the claimed identity is not recognized by the
WHOIS database (Whois 2005) or no records founded for the hostname (Pan
and Ding 2006). If the DNS record is empty or not found then the website is
classified as “Phishing”, otherwise it is classified as “Legitimate”.
no DNS Record For The Domain → Phishing
Rule: IF{
4.3. Website Traffic:

This feature measures the popularity of the website by determining the
number of visitors and the number of pages they visit. However, since phishing
websites live for a short period of time, they may not be recognized by the Alexa
database (Alexa the Web Information Company., 1996). By reviewing our
dataset, we find that in worst scenarios, legitimate websites ranked among the
top 100,000. Furthermore, if the domain has no traffic or is not recognized by
the Alexa database, it is classified as “Phishing”. Otherwise, it is classified as
“Suspicious”.
Website Rank < 100,000 → Legitimate
Rule: IF{ Website Rank > 100,000 → Suspicious
Otherwise → Phish
4.4. PageRank:
PageRank is a value ranging from “0” to “1”. PageRank aims to measure
how important a webpage is on the Internet. The greater the PageRank value the
more important the webpage. In our datasets, we find that about 95% of phishing
webpages have no PageRank. Moreover, we find that the remaining 5% of
phishing webpages may reach a PageRank value up to “0.2”.
PageRank < 0.2 → Phishing
Rule: IF{
4.5. Google Index:

This feature examines whether a website is in Google’s index or not. When
a site is indexed by Google, it is displayed on search results (Webmaster
resources, 2014). Usually, phishing webpages are merely accessible for a short
period and as a result, many phishing webpages may not be found on the Google
index.
Webpage Indexed by Google → Legitimate
Rule: IF{
4.6. Number of Links Pointing to Page
The number of links pointing to the webpage indicates its legitimacy level, even
if some links are of the same domain (Dean, 2014). In our datasets and due to its
short life span, we find that 98% of phishing dataset items have no links pointing
to them. On the other hand, legitimate websites have at least 2 external links
pointing to them.
Of Link Pointing to The Webpage = 0 → Phishing
Rule: IF{Of Link Pointing to The Webpage > 0 and ≤ 2 → Suspicious
4.7. Statistical-Reports Based Feature:

Several parties such as PhishTank (PhishTank Stats, 2010-2012), and
StopBadware (StopBadware, 2010-2012) formulate numerous statistical reports
on phishing websites at every given period of time; some are monthly and others
are quarterly. In our research, we used 2 forms of the top ten statistics from
PhishTank: “Top 10 Domains” and “Top 10 IPs” according to statistical-reports
published in the last three years, starting in January2010 to November 2012.
Whereas for “StopBadware”, we used “Top 50” IP addresses.
Rule:
Host Belongs to Top Phishing IPs or Top Phishing Domains → Phishing
IF{
IV. DATASET DESCRIPTION:

The collection consists of 11,055 records comprising of both Legitimate and
Illegitimate websites. The exact count of both the category present in the dataset is
shown in Figure2. Every tuple in the dataset possesses 30 different characteristics that
a website will have. These characteristics will be considered as the independent
variables for training the model. Based on these features one dependent variable or
target function is defined, which defines the authenticity of the website. The dataset is
limited to 11,055 tuples so as to reduce the impact of overfitting on the performance
of the model.
This data is in CSV Format. The relation name is phishing and has 30 attributes.
There are 11055 instances in the data set. These collected features hold the categorical
values, Legitimate, Suspicious, and Phishy, these values have been replaced with
numerical values 1,0 and -1 respectively.
Figure 1:Overview of Dataset.
V. RANDOM FOREST ALGORITHM

Random forest is an example of Ensemble learning. Random forest is a collection
for n number of decision trees, where every decision tree produces different outputs for
the same input. Here the majority of the outputs from n decision trees are considered as
the output of the model. This model is trained on a dataset consisting of 11,055 tuples.
To create a more effective model, we have used GridSearchCV to find the optimal
parameters for the model. For the afore mentioned dataset, the best parameters produced
by the function is listed in Table 2. The model is trained using K fold validation
technique. Several experiments were performed on different number of K folds ranging
from 2 to 12 on the dataset. The dataset set split into 5 K-folds produced the optimal
result.
S.No Attributes Values
1. Bootstrap True
2. ccp_alpha 0.0
3. class_weight None
4. criterion Gini
5. max_depth None
6. max_features Log2
7. max_leaf_nodes None
8. max_samples None
9. min_impurity_decrease 0.0
10. min_impurity_split None
11. min_samples_leaf 1
12. min_samples_split 2
13. min_weight_fraction_leaf 0.0
14. n_estimators 100
15. n_jobs -1
16. oob_score False
17. random_state None
18. verbose 0
19. warm_start False
Table2: A list of optimal values for the classifier.
Based on the diverse dataset which is built and trained using the Random Forest
Classification model, the trained model exhibits priority over the 30 features which
highly contributes towards the identification of the website’s category i.e. Legitimate
or Phishing. This relative importance of these features as produced by the trained
model is depicted in Figure 3 more precisely.
Figure 3:Model-based feature importance of 30 features.
Upon training the Random Forest Classification Model with the dataset along with
the classifier parametric feature values mentioned in Table 1, it has produced more
effective outcome. The Performance of the model is analyzed through the confusion
matrix produced over the test and predicted values. The more detailed information on
the confusion matrix is represented in the Figure 4 below for better and easy analysis.
Figure 4:Confusion Matrix Produced by the Trained Model.
VI. QUALITY ATTRIBUTES

1. Availability:
The system operates 24/7 continuously and stably. There is no time off.
2. Usability :
The system has an easy-to-use intuitive interface.
3. Performance :
The system will notify the firewall if the website is fake or if the URL is
secure, the firewall will be passed.
The maximum delay to answer questions is less than 5 seconds.
4. Confidentiality:
System hardware is protected inside the core of the international training
office..For users only need to use our software and we can manage the sites that
absorb it and learn from our algorithm.
5.Overall Architecture:
The architecture of the Metadirectory that has been finalized is shown in
Figure 5. The relationship of each component and how the components are
connected are important design criteria. Therefore, key design points can be shown
most efficiently through component and connector view.
Figure 5: Network Diagram.
In our system, I use the machine learning algorithm random forest application
to release the website that must be fake or not rely on the random algorithm from
which we mount our device. Here I go to the internet of the system to catch the
packet and see if it has to be fake or if it is spoofing our system, I will report to
the firewall and immediately block the sending of the packet, and if It's safe to be
saved in our data and I have been updated on regular basic.
VII. REFERENCES:
[1] DTU Journal of Science & Technology 03(40) (2020) 16-25 - Automating the
update of packet filtering rules for open source firewalls:
https://duytan.edu.vn/upload/file/3.-BAI-3-(Nguyen-Kim-Tuan)-79.pdf
[2] Randomforest: https://prezi.com/905bwnaa7dva/how-random-forest-
algorithm-
works/?utm_campaign=share&utm_medium=copy&fbclid=IwAR2oGFibYSOtx
mJ1OdUGzMJdrWS3aEeOIyWq6PCI8p4hoTCIL2jPp40w-JM
[3] A COMPLETE GUIDE TO THE RANDOM FOREST ALGORITHM:
https://builtin.com/data-science/random-forest-algorithm
[4] HOW THE RANDOM FOREST ALGORITHM WORKS IN MACHINE
LEARNING:
https://dataaspirant.com/random-forest-algorithm-machine-learing/
[5] List of system quality attributes :
https://en.wikipedia.org/wiki/List_of_system_quality_attributes
[6] Software Architecture Quality Attributes:
https://dzone.com/articles/software-architecture-quality-attributes

Architecture Document: International School Capstone Project Proposal CMU-CS450 Capstone Project I

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Architecture Document: International School Capstone Project Proposal CMU-CS450 Capstone Project I

Uploaded by

Copyright:

Available Formats

International School

Capstone Project Proposal

APPLY MACHINE LEARNING IN DETECTING PHISHING WEBSITES

Khai, Tran Dinh

MSc. Tuan, Nguyen Kim

Project Acronym AMLIDPW

Version Date Comments Author Approval

Figure 1:Detect Phishing System

III. FEATURE EXTRACTION DESCRIPTION:

1. Address Bar based Features:

If The Domain Part has an IP Address → Phishing

1.2. Long URL to Hide the Suspicious Part

URL length < 54 → feature = Legitimate

1.4. URL’s having “@” Symbol

1.6. Adding Prefix or Suffix Separated by (-) to the Domain

1.7. Sub Domain and Multi Sub Domains

1.11. Using Non-Standard Port :

2. Abnormal Based Features:

2.2. URL of Anchor:

2.3. Links in <Meta>, <Script> and <Link> tags:

2.4. Server Form Handler (SFH):

2.5. Submitting Information to Email

2.6. Abnormal URL

3.3. Disabling Right Click:

3.4. Using Pop-up Window

3.5. IFrame Redirection

4. Domain based Features:

4.3. Website Traffic:

4.5. Google Index:

4.7. Statistical-Reports Based Feature:

IV. DATASET DESCRIPTION:

V. RANDOM FOREST ALGORITHM

VI. QUALITY ATTRIBUTES

Figure 5: Network Diagram.

You might also like