Professional Documents
Culture Documents
URLREPUTATIONPROJECT
URLREPUTATIONPROJECT
SERVER: it is a categorical value and its meaning is the operative system of the
server got from the packet response.
Random forest algorithm randomly selects observations and features to build several decision
trees and then the averages the result based on the measurement of the relative importance of
each feature on the prediction.
CLASSIFICATION REPORT AND CONFUSION MATRIX
IMPLEMENTING SUPPORT VECTOR MACHINE(SVM)
We use support Vector machine (SVM), a supervised machine model that is used to classify
the maliciousness of the data to classify the correlation of each variables since we have
multivariate.
The training data is refined and used in terms of naïve Bayes.
We classify the given information with possible output, considering a hyperplane.
SVM solves problems easily with introducing additional features in the data points in a plot.
In the SVM classifier, it is easy to have a linear hyper-plane between these two classes. But,
another burning question which arises is, should we need to add this feature manually to have
a hyper-plane. No, the SVM algorithm has a technique called the kernel trick. The SVM
kernel is a function that takes low dimensional input space and transforms it to a higher
dimensional space i.e. it converts not separable problem to separable problem. It is mostly
useful in non-linear separation problem. Simply put, it does some extremely complex data
transformations, then finds out the process to separate the data based on the labels or outputs
you’ve defined.
SVM is implemented with imported libraries from
the linear SVM kernel if you have a large number of features (>1000) because it is more
likely that the data is linearly separable in high dimensional space. Also, you can use RBF but
do not forget to cross-validate for its parameters to avoid over-fitting.
Gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Higher the value of gamma, will
try to exact fit the as per training data set i.e. generalization error and cause over-fitting
problem.
We extract the required features and split it into training and testing data. 93.5% of the data is
used for training and the rest 7.7% is used for testing, building our SVM model using the
numpy library.
α(0.0001) is the learning rate and the regularization parameter λ is set to 1/epochs. Therefore,
the regularizing value reduces the number of epochs increases.
We now clip the weights as the test data contains only 10 data points. We extract the features
from the test data and predict the values. We obtain the predictions and compare it with the
actual values and print the accuracy of our model.
THE HEATMAP OF SVM
CLASSIFICATION REPORT AND CONFUSION MATRIX OF SVM
CONCLUSION
Studied URL Reputation by determining its maliciousness. The experimental show that, the
Random Forest performed better than other classification techniques used. RFC works best for
unbalanced data because it does not consist of skewness (E.g., Mean). Unequal instances for
different classes. The overfitting problem will never come when we use the random forest
algorithm in any classification problem. Logistic Regression can be used, if speed is the
criteria, whereas in the case of categorical data RF will perform better. However, Better use of
misclassified data to update the weights and classify them again to predict based on of more
data, because more data leads to better prediction. Finally, models appear to perform similarly
across the datasets with performance more influenced by choice of dataset rather than model
selection.