0% found this document useful (0 votes)
23 views1 page

Predicting Forest Fire Damage with SVM

Uploaded by

Harsh Bangia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views1 page

Predicting Forest Fire Damage with SVM

Uploaded by

Harsh Bangia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Build a Machine Learning Model that can predict the burned area of forest fires using meteorological and

other data.

Support Vector Machines (SVM).


Support Vectors

Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a
data set.

hyperplane

Imagine hyperplane as a line that linearly separates and classifies a set of data.

The away from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the
correct side of it.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will
be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance between the data points.

So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.

in brief we can define hyperplane as - There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data
points. This best boundary is known as the hyperplane of SVM.

How to find the right hyperplane To find right hyperplane it is important to set margin appropriatly

The distance between the hyperplane and the nearest data point from either set is known as the margin
The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.

SVM

Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression problems. it is mostly used in classification problems
An SVM model is basically a representation of different classes in a hyperplane in multidimensional space
It can solve linear and non-linear problems and work well for many practical problems
The idea of SVM is simple: The algorithm creates a line or a hyperplane which separates the data into classes
The hyperplane will be generated in an iterative manner by SVM so that the error can be minimized
The goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH)

Two types of SVM

Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is
used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.

For documentation on SVM Click here

Applications of SVM

SVMs are helpful in text and hypertext categorization, as their application can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings. Some methods for shallow
semantic parsing are based on support vector machines.
Classification of images can also be performed using SVMs. Experimental results show that SVMs achieve significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of
relevance feedback. This is also true for image segmentation systems,
Classification of satellite data like SAR data using supervised SVM.
Hand-written characters can be recognized using SVM.
The SVM algorithm has been widely applied in the biological and other sciences. They have been used to classify proteins with up to 90% of the compounds classified correctly. Permutation tests based on SVM weights
have been suggested as a mechanism for interpretation of SVM models. Support-vector machine weights have also been used to interpret SVM models in the past.Posthoc interpretation of support-vector machine models
in order to identify features used by the model to make predictions is a relatively new area of research with special significance in the biological sciences

In [4]: #importing the libraries


# importing numpy library for numarical operations
import numpy as np
# importing pandas library for data manupulation operations
import pandas as pd
# importing pyplot on matplotlib for visualisation purpose
import matplotlib.pyplot as plt
%matplotlib inline
#importing seaborn for advanced visualization
import seaborn as sns

In [5]: #importing the data set


forest=pd.read_csv("forestfires.csv")

- Forest Fire Damage Prediction¶

data we have is about forest fires, let's try to predict the damage caused by a given fire
We will use classification model to make our prediction
To download the dataset Click here

Attribute information:

month - month of the year: "jan" to "dec"

day - day of the week: "mon" to "sun"


FFMC - FFMC index from the FWI system: 18.7 to 96.20
DMC - DMC index from the FWI system: 1.1 to 291.3
DC - DC index from the FWI system: 7.9 to 860.6
ISI - ISI index from the FWI system: 0.0 to 56.10
temp - temperature in Celsius degrees: 2.2 to 33.30
RH - relative humidity in %: 15.0 to 100
wind - wind speed in km/h: 0.40 to 9.40
rain - outside rain in mm/m2 : 0.0 to 6.4
area - the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).
let's go

In [6]: #looking into the dataset


forest

Out[6]: month day FFMC DMC DC ISI temp RH wind rain ... monthfeb monthjan monthjul monthjun monthmar monthmay monthnov monthoct monthsep size_category

0 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 ... 0 0 0 0 1 0 0 0 0 small

1 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 ... 0 0 0 0 0 0 0 1 0 small

2 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 ... 0 0 0 0 0 0 0 1 0 small

3 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 ... 0 0 0 0 1 0 0 0 0 small

4 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 ... 0 0 0 0 1 0 0 0 0 small

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

512 aug sun 81.6 56.7 665.6 1.9 27.8 32 2.7 0.0 ... 0 0 0 0 0 0 0 0 0 large

513 aug sun 81.6 56.7 665.6 1.9 21.9 71 5.8 0.0 ... 0 0 0 0 0 0 0 0 0 large

514 aug sun 81.6 56.7 665.6 1.9 21.2 70 6.7 0.0 ... 0 0 0 0 0 0 0 0 0 large

515 aug sat 94.4 146.0 614.7 11.3 25.6 42 4.0 0.0 ... 0 0 0 0 0 0 0 0 0 small

516 nov tue 79.5 3.0 106.7 1.1 11.8 31 4.5 0.0 ... 0 0 0 0 0 0 1 0 0 small

517 rows × 31 columns

In [7]: # to know the pair wise correlation


forest.corr()

Out[7]: FFMC DMC DC ISI temp RH wind rain area dayfri ... monthdec monthfeb monthjan monthjul monthjun monthmar monthmay monthnov monthoct mon

FFMC 1.000000 0.382619 0.330512 0.531805 0.431532 -0.300995 -0.028485 0.056702 0.040122 0.019306 ... -0.137044 -0.281535 -0.454771 0.031833 -0.040634 -0.074327 -0.037230 -0.088964 -0.005998 0.0

DMC 0.382619 1.000000 0.682192 0.305128 0.469594 0.073795 -0.105342 0.074790 0.072994 -0.012010 ... -0.176301 -0.317899 -0.105647 -0.001946 -0.050403 -0.407404 -0.081980 -0.074218 -0.187632 0.1

DC 0.330512 0.682192 1.000000 0.229154 0.496208 -0.039192 -0.203466 0.035861 0.049383 -0.004220 ... -0.105642 -0.399277 -0.115064 -0.100887 -0.186183 -0.650427 -0.114209 -0.078380 0.093279 0.5

ISI 0.531805 0.305128 0.229154 1.000000 0.394287 -0.132517 0.106826 0.067668 0.008258 0.046695 ... -0.162322 -0.249777 -0.103588 0.020982 0.111516 -0.143520 -0.060493 -0.076559 -0.071154 -0.0

temp 0.431532 0.469594 0.496208 0.394287 1.000000 -0.527390 -0.227116 0.069491 0.097844 -0.071949 ... -0.329648 -0.320015 -0.146520 0.142588 0.051015 -0.341797 -0.045540 -0.053798 -0.053513 0.0

RH -0.300995 0.073795 -0.039192 -0.132517 -0.527390 1.000000 0.069410 0.099751 -0.075519 0.064506 ... -0.047714 0.140430 0.170923 0.013185 0.009382 -0.089836 0.086822 -0.035885 -0.072334 -0.0

wind -0.028485 -0.105342 -0.203466 0.106826 -0.227116 0.069410 1.000000 0.061119 0.012317 0.118090 ... 0.269702 -0.029431 -0.070245 -0.040645 0.012124 0.181433 0.015054 0.011864 -0.053850 -0.1

rain 0.056702 0.074790 0.035861 0.067668 0.069491 0.099751 0.061119 1.000000 -0.007366 -0.004261 ... -0.009752 -0.014698 -0.004566 -0.013390 -0.013510 -0.020744 -0.004566 -0.003225 -0.012665 -0.0

area 0.040122 0.072994 0.049383 0.008258 0.097844 -0.075519 0.012317 -0.007366 1.000000 -0.052911 ... 0.001010 -0.020732 -0.012589 0.006149 -0.020314 -0.045596 0.006264 -0.008893 -0.016878 0.0

dayfri 0.019306 -0.012010 -0.004220 0.046695 -0.071949 0.064506 0.118090 -0.004261 -0.052911 1.000000 ... -0.019140 0.046323 -0.027643 -0.048969 0.006000 0.036205 0.056423 -0.019527 -0.045585 0.1

daymon -0.059396 -0.107921 -0.052993 -0.158601 -0.136529 0.009376 -0.063881 -0.029945 -0.021206 -0.181293 ... 0.114519 0.003933 -0.025470 -0.013300 0.017553 0.077125 -0.025470 -0.017992 0.060975 0.0

daysat -0.019637 -0.003653 -0.035189 -0.038585 0.034899 -0.023869 -0.063799 -0.032271 0.087868 -0.195372 ... -0.058625 0.020406 0.057019 0.060945 -0.022408 0.021024 0.057019 -0.019390 0.017584 -0.0

daysun -0.089517 0.025355 -0.001431 -0.003243 0.014403 0.136220 0.027981 -0.017872 -0.020463 -0.210462 ... -0.024966 0.008416 0.050887 -0.018241 0.024540 -0.047726 -0.029568 -0.020887 0.007252 -0.0

daythu 0.071730 0.087672 0.051859 -0.022406 0.051432 -0.123061 -0.062553 -0.026798 0.020121 -0.162237 ... -0.002838 -0.042278 -0.022793 -0.019300 -0.000195 -0.026885 -0.022793 -0.016101 -0.063223 0.0

daytue 0.011225 0.000016 0.028368 0.068610 0.035630 -0.014211 0.053396 0.139311 -0.001333 -0.166728 ... -0.005125 -0.014491 -0.023424 0.049688 -0.069308 -0.032351 -0.023424 0.117121 0.005008 -0.0

daywed 0.093908 0.017939 0.024803 0.125415 0.090580 -0.087508 -0.019965 -0.020744 -0.011452 -0.151487 ... 0.002899 -0.035713 -0.021282 -0.008985 0.043422 -0.033917 -0.021282 -0.015034 0.016325 -0.0

monthapr -0.117199 -0.197543 -0.268211 -0.106478 -0.157051 0.021235 0.048266 -0.009752 -0.008280 -0.019140 ... -0.017717 -0.026701 -0.008295 -0.034190 -0.024543 -0.045456 -0.008295 -0.005860 -0.023008 -0.0

monthaug 0.228103 0.497928 0.279361 0.334639 0.351404 0.054761 0.028577 0.093101 -0.004187 -0.100837 ... -0.098941 -0.149116 -0.046323 -0.190937 -0.137065 -0.253859 -0.046323 -0.032724 -0.128493 -0.5

monthdec -0.137044 -0.176301 -0.105642 -0.162322 -0.329648 -0.047714 0.269702 -0.009752 0.001010 -0.019140 ... 1.000000 -0.026701 -0.008295 -0.034190 -0.024543 -0.045456 -0.008295 -0.005860 -0.023008 -0.0

monthfeb -0.281535 -0.317899 -0.399277 -0.249777 -0.320015 0.140430 -0.029431 -0.014698 -0.020732 0.046323 ... -0.026701 1.000000 -0.012501 -0.051528 -0.036989 -0.068508 -0.012501 -0.008831 -0.034676 -0.1

monthjan -0.454771 -0.105647 -0.115064 -0.103588 -0.146520 0.170923 -0.070245 -0.004566 -0.012589 -0.027643 ... -0.008295 -0.012501 1.000000 -0.016007 -0.011491 -0.021282 -0.003883 -0.002743 -0.010772 -0.0

monthjul 0.031833 -0.001946 -0.100887 0.020982 0.142588 0.013185 -0.040645 -0.013390 0.006149 -0.048969 ... -0.034190 -0.051528 -0.016007 1.000000 -0.047363 -0.087722 -0.016007 -0.011308 -0.044402 -0.1

monthjun -0.040634 -0.050403 -0.186183 0.111516 0.051015 0.009382 0.012124 -0.013510 -0.020314 0.006000 ... -0.024543 -0.036989 -0.011491 -0.047363 1.000000 -0.062972 -0.011491 -0.008117 -0.031874 -0.1

monthmar -0.074327 -0.407404 -0.650427 -0.143520 -0.341797 -0.089836 0.181433 -0.020744 -0.045596 0.036205 ... -0.045456 -0.068508 -0.021282 -0.087722 -0.062972 1.000000 -0.021282 -0.015034 -0.059034 -0.2

monthmay -0.037230 -0.081980 -0.114209 -0.060493 -0.045540 0.086822 0.015054 -0.004566 0.006264 0.056423 ... -0.008295 -0.012501 -0.003883 -0.016007 -0.011491 -0.021282 1.000000 -0.002743 -0.010772 -0.0

monthnov -0.088964 -0.074218 -0.078380 -0.076559 -0.053798 -0.035885 0.011864 -0.003225 -0.008893 -0.019527 ... -0.005860 -0.008831 -0.002743 -0.011308 -0.008117 -0.015034 -0.002743 1.000000 -0.007610 -0.0

monthoct -0.005998 -0.187632 0.093279 -0.071154 -0.053513 -0.072334 -0.053850 -0.012665 -0.016878 -0.045585 ... -0.023008 -0.034676 -0.010772 -0.044402 -0.031874 -0.059034 -0.010772 -0.007610 1.000000 -0.1

monthsep 0.076609 0.110907 0.531857 -0.068877 0.088006 -0.062596 -0.181476 -0.051733 0.056573 0.107671 ... -0.093982 -0.141642 -0.044001 -0.181367 -0.130195 -0.241135 -0.044001 -0.031083 -0.122053 1.0

28 rows × 28 columns

In [8]: #returns the number of missing values in the data set


forest.isnull().sum()

month 0
Out[8]:
day 0
FFMC 0
DMC 0
DC 0
ISI 0
temp 0
RH 0
wind 0
rain 0
area 0
dayfri 0
daymon 0
daysat 0
daysun 0
daythu 0
daytue 0
daywed 0
monthapr 0
monthaug 0
monthdec 0
monthfeb 0
monthjan 0
monthjul 0
monthjun 0
monthmar 0
monthmay 0
monthnov 0
monthoct 0
monthsep 0
size_category 0
dtype: int64

from the above obtained details we can clearly mention that there is are no null values

In [9]: # to find duplicates


forest[forest.duplicated()].shape

(8, 31)
Out[9]:

In [38]: # removing all the duplicates


forest.drop_duplicates()

Out[38]: month day FFMC DMC DC ISI temp RH wind rain ... monthfeb monthjan monthjul monthjun monthmar monthmay monthnov monthoct monthsep size_category

0 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 ... 0 0 0 0 1 0 0 0 0 small

1 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 ... 0 0 0 0 0 0 0 1 0 small

2 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 ... 0 0 0 0 0 0 0 1 0 small

3 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 ... 0 0 0 0 1 0 0 0 0 small

4 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 ... 0 0 0 0 1 0 0 0 0 small

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

512 aug sun 81.6 56.7 665.6 1.9 27.8 32 2.7 0.0 ... 0 0 0 0 0 0 0 0 0 large

513 aug sun 81.6 56.7 665.6 1.9 21.9 71 5.8 0.0 ... 0 0 0 0 0 0 0 0 0 large

514 aug sun 81.6 56.7 665.6 1.9 21.2 70 6.7 0.0 ... 0 0 0 0 0 0 0 0 0 large

515 aug sat 94.4 146.0 614.7 11.3 25.6 42 4.0 0.0 ... 0 0 0 0 0 0 0 0 0 small

516 nov tue 79.5 3.0 106.7 1.1 11.8 31 4.5 0.0 ... 0 0 0 0 0 0 1 0 0 small

509 rows × 31 columns

from above data we can clearly see that code is trying to drop duplicates

In [11]: forest.dropna()

Out[11]: month day FFMC DMC DC ISI temp RH wind rain ... monthfeb monthjan monthjul monthjun monthmar monthmay monthnov monthoct monthsep size_category

0 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 ... 0 0 0 0 1 0 0 0 0 small

1 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 ... 0 0 0 0 0 0 0 1 0 small

2 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 ... 0 0 0 0 0 0 0 1 0 small

3 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 ... 0 0 0 0 1 0 0 0 0 small

4 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 ... 0 0 0 0 1 0 0 0 0 small

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

512 aug sun 81.6 56.7 665.6 1.9 27.8 32 2.7 0.0 ... 0 0 0 0 0 0 0 0 0 large

513 aug sun 81.6 56.7 665.6 1.9 21.9 71 5.8 0.0 ... 0 0 0 0 0 0 0 0 0 large

514 aug sun 81.6 56.7 665.6 1.9 21.2 70 6.7 0.0 ... 0 0 0 0 0 0 0 0 0 large

515 aug sat 94.4 146.0 614.7 11.3 25.6 42 4.0 0.0 ... 0 0 0 0 0 0 0 0 0 small

516 nov tue 79.5 3.0 106.7 1.1 11.8 31 4.5 0.0 ... 0 0 0 0 0 0 1 0 0 small

517 rows × 31 columns

In [12]: #creating the new dataframe 'fr'


fr=pd.DataFrame(forest)

we can observe that we have successfully created new dataframe for the data we had in forest dataframe
creating new dataframe helps us to hold the data in two data frames which helps us in model building and other steps

In [13]: #passing index values for our dataframe


FR=fr.iloc[:,6:11]

In [14]: #exploring the new dataframe


FR.head()

Out[14]: temp RH wind rain area

0 8.2 51 6.7 0.0 0.0

1 18.0 33 0.9 0.0 0.0

2 14.6 33 1.3 0.0 0.0

3 8.3 97 4.0 0.2 0.0

4 11.4 99 1.8 0.0 0.0

head is used to see the first 5 rows of the dataset

In [15]: FR.tail()

Out[15]: temp RH wind rain area

512 27.8 32 2.7 0.0 6.44

513 21.9 71 5.8 0.0 54.29

514 21.2 70 6.7 0.0 11.16

515 25.6 42 4.0 0.0 0.00

516 11.8 31 4.5 0.0 0.00

tail is used to see the last 5 rows of the dataset

In [16]: # giving indexews and collecting the data


row_indexes=FR[FR['area']>=5].index

FR.loc[row_indexes,'Area']="large"

row_indexes=FR[FR['area']<5].index

FR.loc[row_indexes,'Area']="small"

FR.head()

Out[16]: temp RH wind rain area Area

0 8.2 51 6.7 0.0 0.0 small

1 18.0 33 0.9 0.0 0.0 small

2 14.6 33 1.3 0.0 0.0 small

3 8.3 97 4.0 0.2 0.0 small

4 11.4 99 1.8 0.0 0.0 small

In [17]: #dropping the rows or columns by specifying


FR=FR.drop('area', axis=1)

In [18]: FR.head()

Out[18]: temp RH wind rain Area

0 8.2 51 6.7 0.0 small

1 18.0 33 0.9 0.0 small

2 14.6 33 1.3 0.0 small

3 8.3 97 4.0 0.2 small

4 11.4 99 1.8 0.0 small

we can clearly notice the changes implementing on the dataset as we are procceeding further

In [19]: FR.corr()

Out[19]: temp RH wind rain

temp 1.000000 -0.527390 -0.227116 0.069491

RH -0.527390 1.000000 0.069410 0.099751

wind -0.227116 0.069410 1.000000 0.061119

rain 0.069491 0.099751 0.061119 1.000000

In [20]: # generate a scatter plot matrix with the "mean" columns


sns.pairplot(FR)

<seaborn.axisgrid.PairGrid at 0x151154a0b20>
Out[20]:

from the above plot we can clearly notice the parameters affecting as heat is increasing
as temperature is increasing we can notice that the wind is also decreasing
rain is also drastically reduced with respective to the temperature

In [21]: #using heat map to find correlation


sns.heatmap(FR.corr(), annot=True)

<AxesSubplot:>
Out[21]:

from above heatmap it is clearly observable that the diagonal elements have no impacts on themselves
as tempearture is increasing the the rain hours is also decreasing
various insights can be found from the heatmap with observance

In [22]: #to find univariate distrubution


#dist plots are always plotted ontop of matplot and seaborn
sns.distplot(FR['temp'])

C:\Users\shyam\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt y
our code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='temp', ylabel='Density'>
Out[22]:

it is clearly observable that increase in the temperature is also leading to the increase in the density but at some point the density is reducing with temperature to

In [23]: #importing sklearn


from sklearn.model_selection import train_test_split
import sklearn.svm as svm #importing support vector machines from sklearn

Test and Train data

Typically, when you separate a data set into a training set and testing set, most of the data is used for training, and a smaller portion of the data is used for testing. ##### Test data :
Training data is necessary to teach an ML algorithm ##### Train data :
Test data helps you to validate the progress of the algorithm's training and adjust or optimize it for improved results

In [24]: X= FR.iloc[:,0:4]
y= FR.iloc[:,4]
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3)
#splitting the data into test and train

here the test size 0.3 represents that the percentage is 33% of data is for testing

In [25]: #looking into the train data


X_train

Out[25]: temp RH wind rain

275 5.1 61 4.9 0.0

322 16.8 28 4.0 0.0

390 7.5 71 6.3 0.0

234 17.7 25 3.1 0.0

282 4.2 51 4.0 0.0

... ... ... ... ...

375 15.4 57 4.9 0.0

423 22.3 48 4.0 0.0

33 17.7 39 3.6 0.0

339 20.4 41 1.8 0.0

304 11.3 94 4.9 0.0

361 rows × 4 columns

Training data is usefull to train the model


Here we are giving data from our dataset to train the model for our outcome

In [26]: #looking into test data


X_test

Out[26]: temp RH wind rain

469 13.7 33 9.4 0.0

108 20.3 45 3.1 0.0

185 17.6 46 3.1 0.0

298 19.6 43 4.9 0.0

397 24.3 33 3.6 0.0

... ... ... ... ...

122 22.5 42 5.4 0.0

111 18.8 18 4.5 0.0

126 9.0 49 2.2 0.0

100 19.8 39 5.4 0.0

71 17.7 37 3.6 0.0

156 rows × 4 columns

In [27]: y_train

275 large
Out[27]:
322 small
390 large
234 large
282 small
...
375 large
423 small
33 small
339 small
304 small
Name: Area, Length: 361, dtype: object

In [28]: y_test

469 large
Out[28]:
108 small
185 large
298 small
397 small
...
122 small
111 small
126 small
100 small
71 small
Name: Area, Length: 156, dtype: object

In [29]: # importing support vector classifier (SVC) from sklearn for nect operations
# SVC always works best for low dimensional data
from sklearn.svm import SVC

Kernal .
- Kernalized algorithms

Kernal help to determine the shape of the hyperplane(Hyperplanes are decision boundaries that help classify the data points) and decision boundary. We can set the value of the kernel parameter in the SVM code
Kernalized algorithms are used in classification problem
SVM ( Support Vector Machines) and PCA ( Principle Ccomposite Analysis support kernal operations)

In the SVM classifier, it is easy to have a linear hyper-plane between these two classes. But, another burning question which arises is, should we need to add this feature manually to have a hyper-plane. No, the SVM
algorithm has a technique called the kernel trick. The SVM kernel is a function that takes low dimensional input space and transforms it to a higher dimensional space i.e. it converts not separable problem to separable
problem. It is mostly useful in non-linear separation problem. Simply put, it does some extremely complex data transformations, then finds out the process to separate the data based on the labels or outputs you’ve defined.

Support vector machines are a tool which best serves the purpose of separating two classes. They are a kernel-based algorithm.

A kernel refers to a function that transforms the input data into a high dimensional space where the question or problem can be solved.
A kernel function can be either linear or non-linear. Kernel methods are a type of class of algorithms for pattern analysis.
The primary function of the kernel is to get data as input and transform them into the required forms of output.
In statistics, “kernel” is the mapping function that calculates and represents values of a 2-dimensional data in a 3-dimensional space format.
A support vector machine uses a kernel trick which transforms the data to a higher dimension and then it tries to find an optimal hyperplane between the outputs possible.
Kernel’s method of analysis of data in support vector machine algorithms using a linear classifier to solve non-linear problems is known as ‘kernel trick’.
Kernels are used in statistics and math, but it is most widely and also most commonly used in support vector machines.

In [39]: #- Kernel Function is a method used to take data as input and transform into the required form of processing data
# Kernel = rbf
model_rbf = SVC(kernel = "rbf")
model_rbf.fit(X_train,y_train)
pred_test_rbf = model_rbf.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,pred_test_rbf ))

np.mean(pred_test_rbf==y_test)

[[ 0 40]
[ 0 116]]
0.7435897435897436
Out[39]:

Kernal is the function which is used to perform certain mathematical operation in support vector machine, it can manupulate the according to it

Radial Basis Function kernel, or RBF kernel is the default kernel used within the sklearn's SVM classification algorithm

Classification report

A classification report is a performance evaluation metric in machine learning #### Metrics Definition

Precision - Precision is defined as the ratio of true positives to the sum of true and false positives.

Recall - Recall is defined as the ratio of true positives to the sum of true positives and false negatives.
F1 Score - The F1 is the weighted harmonic mean of precision and recall. The closer the value of the F1 score is to 1.0, the better the expected performance of the model is.
Support - Support is the number of actual occurrences of the class in the dataset. It doesn’t vary between models, it just diagnoses the performance evaluation process.

In [31]: print(classification_report(y_test,pred_test_rbf))

precision recall f1-score support

large 0.00 0.00 0.00 40


small 0.74 1.00 0.85 116

accuracy 0.74 156


macro avg 0.37 0.50 0.43 156
weighted avg 0.55 0.74 0.63 156

C:\Users\shyam\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels wit
h no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\shyam\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels wit
h no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\shyam\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels wit
h no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))

Linear kernal

Linear Kernel is used when the data is Linearly separable, that is, it can be separated using a single Line. It is one of the most common kernels to be used. It is mostly used when there are a Large number of Features in a
particular Data Set. One of the examples where there are a lot of features, is Text Classification, as each alphabet is a new feature. So we mostly use Linear Kernel in Text Classification

In [32]: # Kernel = linear


# Gamma function appears in linear kernal
#As gamma increases our model tends to overfit
model_rbf = SVC(kernel = "linear")
model_rbf.fit(X_train,y_train)
pred_test_linear = model_rbf.predict(X_test)
print(confusion_matrix(y_test,pred_test_linear ))

np.mean(pred_test_linear==y_test)

[[ 0 40]
[ 0 116]]
0.7435897435897436
Out[32]:

In [33]: print(classification_report(y_test,pred_test_linear))

precision recall f1-score support

large 0.00 0.00 0.00 40


small 0.74 1.00 0.85 116

accuracy 0.74 156


macro avg 0.37 0.50 0.43 156
weighted avg 0.55 0.74 0.63 156

C:\Users\shyam\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels wit
h no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\shyam\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels wit
h no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\shyam\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels wit
h no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))

Sigmoid kernal

this function is equivalent to a two-layer, perceptron model of neural network, which is used as activation function for artificial neurons

Sigmoid Kernel: K(X,Y)=tanh(γ⋅XTY+r) which is similar to the sigmoid function in logistic regression

In [40]: # Kernel = sigmoid


model_sigmoid = SVC(kernel = "sigmoid")# applying the polynomial kernal
model_sigmoid.fit(X_train,y_train)
pred_test_sigmoid = model_sigmoid.predict(X_test)#giving data for kernal
print(confusion_matrix(y_test,pred_test_sigmoid ))#applying confusion matrix

np.mean(pred_test_sigmoid==y_test)

[[ 8 32]
[37 79]]
0.5576923076923077
Out[40]:

The equation is

In [35]: print(classification_report(y_test,pred_test_sigmoid))

precision recall f1-score support

large 0.18 0.20 0.19 40


small 0.71 0.68 0.70 116

accuracy 0.56 156


macro avg 0.44 0.44 0.44 156
weighted avg 0.57 0.56 0.57 156

Polynomial kernal

Polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these.
The feature space of a polynomial kernel is equivalent to that of polynomial regression, but without the combinatorial blowup in the number of parameters to be learned.
When the input features are binary-valued (booleans), then the features correspond to logical conjunctions of input features.

In [41]: model_poly = SVC(kernel = "poly",C=2)# applying the polynomial kernal


model_poly.fit(X_train,y_train)#giving data for kernal
pred_test_poly = model_poly.predict(X_test)
print(confusion_matrix(y_test,pred_test_poly ))#applying confusion matrix

np.mean(pred_test_poly==y_test)

[[ 0 40]
[ 0 116]]
0.7435897435897436
Out[41]:

Formulae for Polynomial Kernal

In [37]: print(classification_report(y_test,pred_test_poly))

precision recall f1-score support

large 0.00 0.00 0.00 40


small 0.74 1.00 0.85 116

accuracy 0.74 156


macro avg 0.37 0.50 0.43 156
weighted avg 0.55 0.74 0.63 156

C:\Users\shyam\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels wit
h no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\shyam\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels wit
h no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\shyam\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1248: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels wit
h no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))

Pros and Cons of S V M


Pros:

1. It is good in higher dimension


2. it performs really good when features are more than training data
3. It is the best algorithm when classes are not overlapping or seperable
4. The hyperplane is affected by only the support vectors and the impact from outliers is less.
5. SVM is suited for extreme case binary classification.

cons:

1. For larger dataset, it requires a large amount of time to process


2. it doesn't perform that good when classes are overlapped
3. Selecting, appropriately hyperparameters of the SVM that will allow for sufficient normalized performance
4. Selecting the appropriate kernel function can be tricky sometimes.

Thank you

You might also like