You are on page 1of 9

URL REPUTATION

Data set Name : Dataset


Goal : The goal is to examine the URL by detecting Maliciousness of its
Data set Source : Christian Urcuqui, https://www.kaggle.com/xwolf12/malicious-and-benign-
websites
Data Type : Multivariate
Task : Classification
Area : Computer Science
Format type : Matrix
Missing Values : Yes
Instance : 1778
Features : 21
 URL: it is the anonimous identification of the URL analyzed in the study

 URL_LENGTH: it is the number of characters in the URL

 NUMBER SPECIAL CHARACTERS: it is number of special characters identified in


the URL, such as, “/”, “%”, “#”, “&”, “. “, “=”

 CHARSET: it is a categorical value and its meaning is the character encoding


standard (also called character set).

 SERVER: it is a categorical value and its meaning is the operative system of the
server got from the packet response.

 CONTENT_LENGTH: it represents the content size of the HTTP header.

 WHOIS_COUNTRY: it is a categorical variable, its values are the countries we got


from the server response (specifically, our script used the API of Whois).
 WHOIS_STATEPRO: it is a categorical variable, its values are the states we got from
the server response (specifically, our script used the API of Whois).
 WHOIS_REGDATE: Whois provides the server registration date, so, this variable has
date values with format DD/MM/YYY HH:MM
 WHOISUPDATEDDATE: Through the Whois we got the last update date from the
server analyzed
 TCPCONVERSATIONEXCHANGE: This variable is the number of TCP packets
exchanged between the server and our honeypot client
 DISTREMOTETCP_PORT: it is the number of the ports detected and different to
TCP
 REMOTE_IPS: this variable has the total number of IPs connected to the honeypot
 APP_BYTES: this is the number of bytes transfered
 SOURCEAPPPACKETS: packets sent from the honeypot to the server
 REMOTEAPPPACKETS: packets received from the server
 APP_PACKETS: this is the total number of IP packets generated during the
communication between the honeypot and the server
 DNSQUERYTIMES: this is the number of DNS packets generated during the
communication between the honeypot and the server
 TYPE: this is a categorical variable, its values represent the type of web page
analyzed, specifically, 1 is for malicious websites and 0 is for benign websites

IMPLEMENTATION OF LOGISTIC REGRESSION

CLASSIFICATION REPORT AND CONFUSION MATRIX


IMPLEMENTATION OF RANDOM FOREST CLASSIFIER
FEATURES IMPORTANCE

Random forest algorithm randomly selects observations and features to build several decision
trees and then the averages the result based on the measurement of the relative importance of
each feature on the prediction.
CLASSIFICATION REPORT AND CONFUSION MATRIX
IMPLEMENTING SUPPORT VECTOR MACHINE(SVM)
We use support Vector machine (SVM), a supervised machine model that is used to classify
the maliciousness of the data to classify the correlation of each variables since we have
multivariate.
The training data is refined and used in terms of naïve Bayes.
We classify the given information with possible output, considering a hyperplane.
SVM solves problems easily with introducing additional features in the data points in a plot.
In the SVM classifier, it is easy to have a linear hyper-plane between these two classes. But,
another burning question which arises is, should we need to add this feature manually to have
a hyper-plane. No, the SVM  algorithm has a technique called the kernel trick. The SVM
kernel is a function that takes low dimensional input space and transforms it to a higher
dimensional space i.e. it converts not separable problem to separable problem. It is mostly
useful in non-linear separation problem. Simply put, it does some extremely complex data
transformations, then finds out the process to separate the data based on the labels or outputs
you’ve defined.
SVM is implemented with imported libraries from
the linear SVM kernel if you have a large number of features (>1000) because it is more
likely that the data is linearly separable in high dimensional space. Also, you can use RBF but
do not forget to cross-validate for its parameters to avoid over-fitting.
Gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Higher the value of gamma, will
try to exact fit the as per training data set i.e. generalization error and cause over-fitting
problem.

Svm = svm.SVC(kernel='rbf', random_state=0, gamma=3, C=1.0)


It works really well with a clear margin of separation and is effective in high dimensional
spaces. It is effective in cases where the number of dimensions is greater than the number of
samples. It uses a subset of training points in the decision function (called support vectors),
so it is also memory efficient.

We extract the required features and split it into training and testing data. 93.5% of the data is
used for training and the rest 7.7% is used for testing, building our SVM model using the
numpy library.
α(0.0001) is the learning rate and the regularization parameter λ is set to 1/epochs. Therefore,
the regularizing value reduces the number of epochs increases.
We now clip the weights as the test data contains only 10 data points. We extract the features
from the test data and predict the values. We obtain the predictions and compare it with the
actual values and print the accuracy of our model.
THE HEATMAP OF SVM
CLASSIFICATION REPORT AND CONFUSION MATRIX OF SVM

CONCLUSION
Studied URL Reputation by determining its maliciousness. The experimental show that, the
Random Forest performed better than other classification techniques used. RFC works best for
unbalanced data because it does not consist of skewness (E.g., Mean). Unequal instances for
different classes. The overfitting problem will never come when we use the random forest
algorithm in any classification problem. Logistic Regression can be used, if speed is the
criteria, whereas in the case of categorical data RF will perform better. However, Better use of
misclassified data to update the weights and classify them again to predict based on of more
data, because more data leads to better prediction. Finally, models appear to perform similarly
across the datasets with performance more influenced by choice of dataset rather than model
selection.

You might also like