You are on page 1of 10

PART A

(PART A: TO BE REFFERED BY STUDENTS)

Experiment No. 3
A.1 Aim:
To implement Support Vector Machine.

A.2 Prerequisite:
Python Basic Concepts

A.3 Outcome:
Students will be able To implement Support Vector Machine.

A.4 Theory:

Machine Learning, being a subset of Artificial Intelligence (AI), has been playing a
dominant role in our daily lives. Data science engineers and developers working in
various domains are widely using machine learning algorithms to make their tasks
simpler and life easier.

The objective of the support vector machine algorithm is to find a hyperplane in an N-


dimensional space(N — the number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyperplanes that could
be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum
distance between data points of both classes. Maximizing the margin distance provides
some reinforcement so that future data points can be classified with more confidence.
Hyperplanes are decision boundaries that help classify the data points. Data points falling
on either side of the hyperplane can be attributed to different classes. Also, the dimension
of the hyperplane depends upon the number of features. If the number of input features is
2, then the hyperplane is just a line. If the number of input features is 3, then the
hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the
number of features exceeds 3.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear
SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM
classifier.

SVM algorithm is implemented with kernel that transforms an input data space into the
required form. SVM uses a technique called the kernel trick in which kernel takes a low
dimensional input space and transforms it into a higher dimensional space. In simple
words, kernel converts non-separable problems into separable problems by adding more
dimensions to it. It makes SVM more powerful, flexible and accurate. The following are
some of the types of kernels used by SVM.
Linear Kernel
It can be used as a dot product between any two observations. The formula of linear
kernel is as below −
K(x,xi)=sum(x∗xi)K(x,xi)=sum(x∗xi)

From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is
the sum of the multiplication of each pair of input values.
Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or nonlinear input
space. Following is the formula for polynomial kernel −
k(X,Xi)=1+sum(X∗Xi)^dk(X,Xi)=1+sum(X∗Xi)^d
Here d is the degree of polynomial, which we need to specify manually in the learning
algorithm.

Pros and Cons of SVM Classifiers

Pros of SVM classifiers


SVM classifiers offers great accuracy and work well with high dimensional space. SVM
classifiers basically use a subset of training points hence in result uses very less memory.
Cons of SVM classifiers
They have high training time hence in practice not suitable for large datasets. Another
disadvantage is that SVM classifiers do not work well with overlapping classes.

PART B
(PART B : TO BE COMPLETED BY STUDENTS)

(Students must submit the soft copy as per following segments within two hours of the practical. The
soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge faculties at
the end of the practical in case the there is no Black board access available)

Roll. No. B24 Name:Sakshi Bhaskar Tupsundar


Class: BE-Comps Batch:B2
Date of Experiment:10-10-2023 Date of Submission:12-10-2023
Grade:
B.1 Software Code written by student:

import numpy as np
from google.colab import drive
import csv
import pandas as pd
import seaborn as sns

df = pd.read_csv('/content/survey lung cancer.csv')


df.shape
df.isnull().sum()
df.head()

from sklearn import preprocessing

# label_encoder object knows


# how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'.


df['GENDER']= label_encoder.fit_transform(df['GENDER'])
df['GENDER'].unique()
df['LUNG_CANCER']= label_encoder.fit_transform(df['LUNG_CANCER'])
df['LUNG_CANCER'].unique()

df.head()

import matplotlib.pyplot as plt


plt.figure(figsize=(14, 8))
plt.suptitle("Lung Disease Prediction")
ax = plt.gca()
df.boxplot()

#removing outliers
import pandas as pd

columns_to_check = ['LUNG_CANCER']

# Step 1: Calculate the first quartile (Q1), third quartile (Q3),


# and IQR for each column
Q1 = data[columns_to_check].quantile(0.25)
Q3 = data[columns_to_check].quantile(0.75)
IQR = Q3 - Q1
# Step 2: Define the outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Step 3: Identify outliers for each column


outliers = {}
for column_name in columns_to_check:
outliers[column_name] = data[(data[column_name] <
lower_bound[column_name]) |
(data[column_name] >
upper_bound[column_name])]

# For example, if you choose to remove the outliers:


data_cleaned = data.copy()
for column_name in columns_to_check:
data_cleaned = data_cleaned[
(data_cleaned[column_name] >= lower_bound[column_name]) &
(data_cleaned[column_name] <= upper_bound[column_name])]

Applying SVM model before outlier removal

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X = data.drop('LUNG_CANCER', axis=1)
y = data[‘LUNG_CANCER']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

from sklearn.preprocessing import StandardScaler


st_x= StandardScaler()
X_train= st_x.fit_transform(X_train)
X_test= st_x.transform(X_test)
log_reg = LogisticRegression(max_iter=1000)

log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


print("Accuracy:", accuracy)

Applying SVM model after outlier removal

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Split the data into features and labels

X = data.drop('LUNG_CANCER', axis=1) # Adjust as needed


y = data['LUNG_CANCER']

# Initialize an empty list to store selected features


selected_features = []
best_accuracy = 0.0

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

while len(selected_features) < X.shape[1]: # Repeat until all features


are selected
# Find the feature that improves the model the most
best_feature = None
best_feature_accuracy = 0.0

for feature in X.columns:


if feature not in selected_features:
# Create a new feature set by adding the current feature
current_features = selected_features + [feature]

# Train an SVM classifier on the current feature set


svm = SVC()
svm.fit(X_train[current_features], y_train)

# Make predictions on the test set


y_pred = svm.predict(X_test[current_features])

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Check if this feature improves accuracy


if accuracy > best_feature_accuracy:
best_feature_accuracy = accuracy
best_feature = feature

# Add the best feature to the selected features list


selected_features.append(best_feature)
best_accuracy = best_feature_accuracy

#Print the selected feature and its accuracy


print(f"Selected Feature: {best_feature}, Accuracy:
{best_accuracy:.4f}")

print("Forward selection complete.")


print("Selected Features:", selected_features)

Applying SVM model sfter feature selection Process


#from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Create an SVM model with the 'rbf' kernel


clf = svm.SVC(kernel='rbf')

# Fit the SVM model to the training data


clf.fit(X_train_after, y_train_after)

# Make predictions on the test data


y_pred = clf.predict(X_test_after)

# Calculate accuracy on the test set


accuracy = accuracy_score(y_test_after, y_pred)
print("Testing Accuracy:", accuracy)

# Perform cross-validation and print the cross-validation scores


cv_scores = cross_val_score(clf, X_train_after, y_train_after, cv=5) #
You can change the number of folds (cv) as needed
print("Cross-Validation Scores:", cv_scores)
print("Mean CV Score:", cv_scores.mean())
print(classification_report(y_test_after, y_pred))
hyperparameter tuning for SVM

import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import randint

# Assuming the columns are named 'feature1', 'feature2', and 'target'


X = data[['AGE']]
y = data['AGE']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Define the model


rf_model = RandomForestClassifier()

# Define the hyperparameter distributions to sample from


param_dist = {
'n_estimators': randint(50, 200),
'max_depth': [None] + list(randint(1, 20, 10).rvs(10)),
'min_samples_split': randint(2, 11),
'min_samples_leaf': randint(1, 5)
}

# Handle None for max_depth separately


param_dist['max_depth'].append(None)

# Perform randomized search with cross-validation


random_search = RandomizedSearchCV(estimator=rf_model,
param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy',
random_state=42)
random_search.fit(X_train, y_train)

# Get the best hyperparameters


best_params = random_search.best_params_
print("Best Hyperparameters:", best_params)
# Evaluate the model on the test set using the best hyperparameters
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on Test Set:", accuracy)

B.2 Input and Output:

SVM Model Scores


Training Accuracy score 0.89475
Testing Accuracy score 0.86
ROC_AUC score 0.951477
CV score 0.756

SVM Model(Feature Slection) Scores


Training Accuracy score 0.84963
Testing Accuracy score 0.9575
ROC_AUC score 0.9153
CV score 0.9425

Hyper Parameter tuning for Scores


SVM Model
Accuracy score 0.91935483870
ROC_AUC score 0.55
CV score 0.95967741
B.3 Observations and learning:
 The SVM classifier with an RBF kernel demonstrated strong predictive capabilities, achieving a
high accuracy rate and effectively classifying data points into their respective classes.
 Support Vector Machines are powerful classifiers that can be applied to a wide range of
classification problems.
 Evaluating the performance of an SVM model through metrics like accuracy, precision, recall,
and the confusion matrix helps in understanding its strengths and weaknesses.
 SVMs with RBF kernels are suitable for complex datasets with non-linear relationships, but
hyperparameter tuning and feature selection are crucial for optimizing their performance.

B.4 Conclusion:
In this experiment, we successfully implemented a Support Vector Machine (SVM) classifier with an
RBF kernel on a given dataset.

B.5 Question of Curiosity


Q1. What is a support vector machine (SVM)?

Ans: A support vector machine (SVM) is a type of supervised learning algorithm used in
machine learning to solve classification and regression tasks; SVMs are particularly good at
solving binary classification problems, which require classifying the elements of a data set into
two groups.

The aim of a support vector machine algorithm is to find the best possible line, or decision
boundary, that separates the data points of different data classes. This boundary is called a
hyperplane when working in high-dimensional feature spaces. The idea is to maximize the
margin, which is the distance between the hyperplane and the closest data points of each
category, thus making it easy to distinguish data classes.

You might also like