Professional Documents
Culture Documents
Using
Machine Learning
INDEX
CONTENTS PAGE NO
1. INTRODUCTION
1.1 Overview
1.2 Purpose
2. LITERATURE SURVEY
2.1 Existing System
2.2 Proposed System
3. THERORITICAL ANALYSIS
3.1 Block Diagram
3.2 Software / Hardware
4. EXPERIMENTAL INVESTIGATIONS
5. FLOW CHART
6. RESULT
8. APPLICATIONS
9. CONCLUSION
11. BIBLIOGRAPHY
12. APPENDIX
INTRODUCTION
1.1 Overview :
The attacker fools the user with social engineering techniques such
as SMS, voice, email, website and malware.
2. LITERATURE SURVEY
The work consists of host based, page based and lexical feature
extraction of collected URLs and analysis. The first step is the
collection of phishing and benign URLs. The host based, popularity
based and lexical based feature extractions are applied to form a
database of feature values. The database is knowledge mined using
different machine learning
Windows 10
Anaconda
Spyder
Jupyter
Python IDLE
Visual Studio Code
4. EXPERIMENTAL INVESTIGATIONS
When you tag a face in a Facebook photo, it is AI that is
running behind the scenes and identifying faces in a picture.
Face tagging is now omnipresent in several applications that
display pictures with human faces. Why just human faces?
There are several applications that detect objects such as cats,
dogs, bottles, cars, etc. We have autonomous cars running on
our roads that detect objects in real time to steer the car. When
you travel, you use Google Directions to learn the real-time
traffic situations and follow the best path suggested by Google at
that point of time. This is yet another implementation of object
detection technique in real time. Let us consider the example of
Google Translate application that we typically use while visiting
foreign countries.
You can imagine the complexity involved in developing this kind
of application considering that there are multiple paths to your
destination and the application has to judge the traffic situation
in every possible path to give you a travel time estimate for each
such path. Besides, consider the fact that Google Directions
covers the entire globe. Undoubtedly, lots of AI and Machine
Learning techniques are in-use under the hoods of such
applications.
Classification
Clustering
Probability Theories
Decision Trees
Very soon, the data that is available these days has become so
humongous that the conventional techniques developed so far
failed to analyze the big data and provide us the predictions.
The machine now learns on its own using the high computing
power and huge memory resources that are available today.
Machine Learning :
Supervised Learning :
For example, in a set of 100 students say, you may like to group
them into three groups based on their heights- short, medium
and long.
Now, when a new student comes in, you will put him in an
appropriate group by measuring his height.
When the machine learns how the groups are formed, it will be
able to classify any unknown new student correctly.
Once again, you would use the test data to verify that the
machine has learned your technique of classification before
putting the developed model in production.
Unsupervised Learning :
Reinforcement Learning :
Decision Tree:
Random Forest :
Naive Bayes :
How you can learn a naive Bayes model from training data.
How to best prepare your data for the naive Bayes algorithm.
Where to go for more information on naive Bayes.
Representation Used:
Logistic Regression :
Support vectors are data points that are closer to the hyper
plane and influence the position and orientation of the hyper
plane. Using these support vectors, we maximize the margin of
the classifier. Deleting the support vectors will change the
position of the hyper plane. These are the points that help us
build our SVM.
The SVM model can be built by using numpy library and also
can be built using Scikit learn library and just call the related
functions to implement the SVM model.
K-Means :
Basic Algorithms
ADABOOST:
weight(xi) = 1/n
Outliers: Outliers will force the ensemble down the rabbit hole
of working hard to correct for cases that are unrealistic. These
could be removed from the training dataset.
Features Description :
Index It is just like a serial number
http://www.hud.ac.uk/students/.
A domain name might include the country-code top-level
domains (ccTLD), which in our example is “uk”. The “ac” part is
shorthand for “academic”, the combined “ac.uk” is called a
second-level domain (SLD) and “hud” is the actual name of the
domain. To produce a rule for extracting this feature, we firstly
have to omit the (www.) from the URL which is in fact a sub
domain in itself. Then, we have to remove the (ccTLD)
When we have raw data for phishing and legitimate sites, the
next step should be processing these data and extract
meaningful information from it to detect fraudulent domains.
The dataset to be used for machine learning must actually
consistent these features. So, we must process the raw data
which is collected from Alexa, Phishtank or other data resources,
and create a new dataset to train our system with machine
learning algorithms. The feature values should be selected
according to our needs and purposes and should be calculated
for every one of them.
5. RESULT
Avg RUC
Name of Train Accuracy Precision AUC Recall
the and
Model Test set
Score Score Score Score
7.1 ADVANTAGES:
With the help of this system user can also purchase products
online without any hesitation.
7.2 DISADVANTAGES:
Did not use content from body of the email. Susceptible to short
lived phish domains. Users do not pay attention to warnings.
Not all email client are browser based.
8. APPLICATIONS
9. CONCLUSION
The main objective of this study is to help the users to
differentiate between the phishing and legitimate URLs by
inspecting the URLs based on particular unique characteristics.
This research demonstrates the capability to recognize fake web
pages based on their URLs. In order to protect the victim from
phishing attacks, educational awareness programs must be
conducted.
The most important way to protect the user from phishing attack
is the education awareness. Internet users must be aware of all
security tips which are given by experts.
There are many features that can be improved in the work, for
various other issues. The heuristics can be further developed to
detect phishing attacks in the presence of embedded objects like
flash. Identity extraction is an important operation and it was
improved with the Optical Character Recognition (OCR) system to
extract the text and images.
11. BIBILOGRAPHY
https://www.ijert.org/detection-of-url-based-phishing-attacks-
using-machine-learning
https://www.ijeat.org/wp-
content/uploads/papers/v8i2s/B11031282S18.pdf
https://towardsdatascience.com/phishing-domain-detection-
with-ml-5be9c99293e5
https://www.hindawi.com/journals/scn/2017/5421046/
http://phishtank.com
https://blogger.com
https://www.alexa.com/
https://en.wikipedia.org/wiki/Phishing
Almomani, Ammar, et al. "Evolving fuzzy neural network for
phishing emails detection." Journal of Computer Science 8.7
(2012): 1099.
Sananse, Bhagyashree E., and Tanuja K. Sarode. "Phishing URL
Detection: A Machine Learning and Web Signals, Controls and
Computation (EPSCICON), 2012 International Conference on.
IEEE, 2012
Jain, A. K., & Gupta, B. B. (2016, March). Comparative analysis
of features based machine learning approaches for phishing
detection. In 2016 3rd International Conference on Computing
for Sustainable Global Development (INDIA Com) (pp. 2125-
2130). IEEE.
APPENDIX
Logistic Regression
K-nearest Classifier
Decision Tree
Random Forest
SVM
Ada boosting
Naïve Bayes