You are on page 1of 6

CHAPTER 4

DESIGN AND DEVELOPMENT

4.1 Overview

In this project we used fallowing Machine Learning Algorithms.


 KNN
 SVM
 Random Forest
 Adaboost
 Xgboost

We implemented above listed algorithm with help of Sciklit-learn. Sciklit-learn is a


free software machine learning library for the Python programming language. Sciklit-learn is
designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Lets see the module wise development

Module 1: Importing appropriate library and Dataset


We should include appropriate library in listed bellow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report,
confusion_matrix,accuracy_score
from sklearn.metrics import roc_auc_score,roc_curve
%matplotlib inline
import os
rom sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
import xgboost as xgb
Importing dataset
For this project we have used “The Pima Indians Diabetes Data Set”
which was taken from the UCI Machine Learning Repository.
The dataset has 9 attributes and 768 instances. Attributes are exacting, all
patients now are females at least 21 years old of Pima Indian heritage. If the 2
hour post load Plasma glucose was as a minimum 200 mg/dl.

Module 2: Pre-processing( feature selection) and split train and test dataset
Pre-processing refers to the transformations applied to our data before
feeding it to the algorithm.
Data Preprocessing is a technique that is used to convert the raw data into
a clean data set. In other words, whenever the data is gathered from different
sources it is collected in raw format which is not feasible for the analysis.

Need of Data Preprocessing


For achieving better results from the applied model in Machine Learning
projects the format of the data has to be in a proper manner. Some specified
Machine Learning model needs information in a specified format, for example,
Random Forest algorithm does not support null values, therefore to execute
random forest algorithm null values have to be managed from the original raw
data set.
Another aspect is that data set should be formatted in such a way that
more than one Machine Learning and Deep Learning algorithms are executed in
one data set, and best out of them is chosen.
Splitting dataset
1. Training dataset -One data set that contains the value of both
components above are used to determine a suitable class based on
predictor.
2. Testing dataset -Containing new data which will be classified by the
model that has been used to take the input.

Module 3: Implementing KNN, SVM, Xgboost, Adaboost and Random


forest using Scikit learn library.

Module 4: Prediction
Prediction
“Prediction” refers to the output of an algorithm after it has
been trained on a historical dataset and applied to new data when forecasting the
likelihood of a particular outcome, such as whether data has disease or not.

Module 5: Evaluation measure

Evaluation Measures for classification techniques.

TP (True Positive): The no. of people who actually suffer from ‘diabetes’ among
those who were diagnosed ‘diabetic’.

TN (True Negative): States the number of people who are ‘healthy’ among those who
were diagnosed ‘diabetic’.

FP (False Positive): Depicts the number of persons who are unhealthy that is,
‘diabetic’ but was diagnosed as ‘healthy’.

FN (False Negative): The number of people found to be ‘healthy’ among those who
were diagnosed as ‘diabetic’.
The performance of classification can be measure in the fallowing criteria.
Sensitivity must have high percentage.
Specificity must have low percentage.
Accuracy must have high percentage.

Module 6: Comparing all the algorithm.


After getting evaluation measure, we could have each and every algorithm’s
different measures like Accuracy, sensitivity and specificity . Based on
evaluation measure we could compare each and every algorithm in graph(Line
chart) and finally come up with best classifer.

Module 7.
After finding the best algorithm, we will give the input to prediction model and
would get corresponding output.

The fallowing Detail Architecture of work flow diagram can be described entire
project process.
Architecture diagram Upload Dataset

Data Pre-Processing

Train Dataset

Adaboost SVM Xgboost

Random Forest KNN

Test Dataset

Adaboost SVM Xgboost

Random Forest KNN

Compare Individual value of Accuracy, Sensitivity, and


Specificity for each algorithms
Final Prediction

Conclusion

You might also like