You are on page 1of 16

CHAPTER -1

ABOUT THE COMPANY

Pragyan AI is one of the pioneers in Data Science AI Domain and provides internships,
training, and FDP programs across India. Company is selected as one of Top 100 Deep Tech
startups and incubated several places.

Companies not just in the training space, they also do technology consulting, collaborative
deep research in AI and developing AI products. They are selected in elevate 100 from
Karnataka Govt and EIR grant from central govt. They have trained over 1000+ students and
conducted several AI hackathons, presented tutorials at several AI Summits.

Pragyan Al is transforming Tier2/3/4 College into the Best College for DS & Al Program in
its locality, by training students over 3 years in data science & Al in online mode, and We
promise students 100% jobs at 2X salary. Students pay nominal initially and 80% as ISA.

Pragyan Al also helps colleges in getting admission and grooming faculties of college and a
lot more. Pragyan Al support help companies, get 100% good hire - through the try and hire
approach.
CHAPTER – 2

ABSTRACT
Diabetes is a chronic disease that affects millions worldwide, leading to serious health
complications if not managed properly. Early detection and classification of diabetes are
crucial for effective treatment and management. Machine learning (ML) algorithms have
shown promise in accurately classifying diabetic disease based on patient data.

This paper presents a comprehensive review of ML techniques used for diabetic disease
classification. It discusses the challenges in accurately diagnosing diabetes and the
importance of early detection. Various ML algorithms, including support vector machines,
decision trees, random forests, and neural networks, are explored for their effectiveness in
classifying diabetic disease.

The paper also discusses the importance of feature selection and data preprocessing in
improving the performance of ML models. Furthermore, it highlights the role of big data and
data mining techniques in identifying patterns and trends in diabetic disease data.
Classification of diabetes is typically based on criteria established by organizations such as
the American Diabetes Association (ADA) or the World Health Organization (WHO). These
criteria take into account factors such as fasting blood glucose levels, oral glucose tolerance
test results, and HbA1c levels.

Machine learning (ML) algorithms can be used to classify diabetes based on these criteria and
other relevant data, such as demographic information, medical history, and lifestyle factors.
ML models can help healthcare professionals make more accurate and personalized treatment
decisions for individuals with diabetes.
Chapter -3 Introduction to Machine Learning using
Diabetic Classification
Machine learning (ML) is a branch of artificial intelligence that focuses on the development
of algorithms and models that can learn from and make predictions or decisions based on
data. One important application of ML is in healthcare, where it can be used to improve
diagnostics and treatment decisions. In this report, we will explore the basics of ML using
diabetic classification as an example.
Diabetic classification involves categorizing individuals into different groups based on their
diabetic status. This can include distinguishing between type 1 and type 2 diabetes,
identifying individuals at risk of developing diabetes, and predicting the progression of the
disease. ML can be used to build models that can automatically classify individuals based on
various features such as age, gender, body mass index (BMI), and blood glucose levels.

The first step in building an ML model for diabetic classification is to collect and preprocess
the data. This involves gathering data from various sources such as electronic health records,
medical tests, and patient surveys. The data may need to be cleaned to remove errors and
inconsistencies and transformed into a format suitable for ML algorithms.

Feature selection is an important step in ML model building, where the most relevant features
are selected to train the model. In the case of diabetic classification, features such as age,
BMI, and blood glucose levels may be important predictors of diabetic status. Feature
selection helps to improve the performance of the ML model and reduce overfitting.

Once the data is prepared and features are selected, ML models can be built using algorithms
such as decision trees, support vector machines, or neural networks. These models are trained
on a subset of the data and evaluated using another subset to assess their performance.
Evaluation metrics such as accuracy, precision, recall, and F1-score are used to measure the
performance of the model.
CHAPTER -4 Introduction to Diabetes Prediction Using
Machine Learning
Diabetes is a chronic disease that affects millions of people worldwide. Early detection and
management of diabetes are crucial to prevent serious complications. Machine learning (ML)
techniques can be utilized to predict the likelihood of an individual developing diabetes based
on their characteristics and risk factors. This report provides an overview of using ML for
diabetes prediction.

**1. Importance of Diabetes Prediction**

Early detection of diabetes can lead to timely interventions, such as lifestyle changes or
medication, which can help prevent or delay the onset of the disease. ML models can analyze
various factors, such as age, gender, family history, body mass index (BMI), and blood
glucose levels, to predict the risk of diabetes in individuals.

**2. Data Collection and Preprocessing**

To build an ML model for diabetes prediction, relevant data must be collected and
preprocessed. This data may include demographic information, medical history, lifestyle
factors, and biomarkers related to diabetes. Data preprocessing involves cleaning the data,
handling missing values, and transforming it into a format suitable for ML algorithms.

**3. Feature Selection**

Feature selection is an essential step in ML model building, where the most relevant features
are selected to train the model. Features that have a significant impact on diabetes prediction,
such as BMI, age, and blood pressure, are selected to improve the model's performance and
interpretability.

**4. Model Building and Evaluation**


ML models, such as logistic regression, decision trees, random forests, or support vector
machines, can be used to predict the likelihood of diabetes in individuals. These models are
trained on a subset of the data and evaluated using metrics such as accuracy, precision, recall,
and F1-score to assess their performance.
CHAPTER 5
Methodology Methodology of Diabetes Prediction
Using Machine Learning

Predicting diabetes using machine learning involves several key steps, from data collection
and preprocessing to model building and evaluation. Here's a detailed methodology:

1. Data Collection:

- Gather relevant data sources such as electronic health records, medical history, and
demographic information.

- Include features such as age, gender, BMI, blood pressure, cholesterol levels, family
history of diabetes, and lifestyle factors.

2. Data Preprocessing:

- Clean the data to remove errors, outliers, and missing values.

- Normalize or standardize numerical features to ensure they are on a similar scale.

- Encode categorical variables using techniques like one-hot encoding.

3. Feature Selection:

- Use statistical tests (e.g., chi-square test) or feature importance techniques (e.g., random
forest feature importance) to select the most relevant features.

- Remove redundant or irrelevant features that may negatively impact model performance.

4. Data Splitting:

- Split the dataset into training, validation, and test sets to evaluate the model's
performance.

- Typically, use 70-80% of the data for training, 10-15% for validation, and the
remaining 10-15% for testing.

5. Model Selection and Training:


- Choose appropriate machine learning algorithms for classification, such as logistic
regression, decision trees, random forests, or support vector machines.

- Train multiple models using the training dataset and tune hyperparameters using the
validation dataset to improve performance.

6. Model Evaluation:

- Evaluate the trained models using the test dataset to assess their performance.

- Use metrics such as accuracy, precision, recall, F1-score, and ROC-AUC to evaluate the
models' performance.

7. Model Interpretation:

- Interpret the model's predictions to understand which features are most important in
predicting diabetes.

- Use visualization techniques like feature importance plots or SHAP (SHapley Additive
exPlanations) values to interpret the model.

8. Deployment and Monitoring:

- Deploy the trained model into a production environment for real-time predictions.

- Monitor the model's performance over time and retrain it periodically with new data to
maintain its accuracy.

9. Ethical Considerations:
- Ensure that the model is fair and unbiased by addressing issues such as algorithmic bias
and fairness in predictions.
Chapter 6

Implementation
1. Data analysis: Here one will get to know about how the data analysis part is done in a

data science life cycle.

2. Exploratory data analysis: EDA is one of the most important steps in the data science

project life cycle and here one will need to know that how to make inferences from

the visualizations and data analysis

3. Model building: Here we will be using 4 ML models and then we will choose the best

performing model.

4. Saving model: Saving the best model using pickle to make the prediction from real

data.

Importing Libraries
import numpy as np import
pandas as pd import matplotlib.pyplot
as plt import seaborn as sns

sns.set()

from mlxtend.plotting import plot_decision_regions import


missingno as msno
from pandas.plotting import scatter_matrix from
sklearn.preprocessing import StandardScaler from
sklearn.model_selection import train_test_split from
sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix from sklearn


import metrics
from sklearn.metrics import classification_report import
warnings
warnings.filterwarnings('ignore')
%matplotlib inline
Here we will be reading the dataset which is in the CSV format
diabetes_df = pd.read_csv('diabetes.csv')

diabetes_df.head()

Output:

Exploratory Data Analysis (EDA)


Information about the dataset

To know more about the dataset with transpose – here T is for the transpose
diabetes_df.describe().T
Data Visualization

Plotting the data distribution plots before removing null values


p = diabetes_df.hist(figsize = (20,20)) output

Inference: So here we have seen the distribution of each features whether it is dependent data or

independent data and one thing which could always strike that why do we need to see the

distribution of data

Plotting the distributions after removing the NAN values.

p = diabetes_df_copy.hist(fig size = (20,20))


Inference: Here we are again using the hist plot to see the distribution of the dataset but this time we

are using this visualization to see the changes that we can see after those null values are removed from
the dataset and we can clearly see the difference for example – In age column after removal of the null

values, we can see that there is a spike at the range of 50 to 100 which is quite logical as well.

Plotting Null Count Analysis Plot

p = msno.bar(diabetes_df) Output:

Inference: Now in the above graph also we can clearly see that there are no null values in the

dataset.

Inference: Here from the above visualization it is clearly visible that our dataset is completely
imbalanced in fact the number of patients who are diabetic is half of the patients who are
nondiabetic.

Output:
Inference: That’s how Distplot can be helpful where one will able to see the distribution of

the data as well as with the help of boxplot one can see the outliers in that column and

other information too which can be derived by the box and whiskers plot.

Output:

Scaling the Data


Before scaling down the data let’s have a look into it diabetes_df_copy.head()

Output:

After Standard scaling sc_X

= StandardScaler()
X = pd.DataFrame(sc_X.fit_transform(diabetes_df_copy.drop(["Outcome"],axis = 1),),
columns=['Pregnancies',
'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'])
X.head() Output:

That’s how our dataset will be looking like when it is scaled down or we can see every value now is on

the same scale which will help our ML model to give a better result.

Output:

0 1
1 0
2 1
3 0
4 1
763 0
764 1
765 0
Name: Outcome, Length: 768, dtype: int64

Model Building
Splitting the dataset

X = diabetes_df.drop('Outcome', axis=1) y
= diabetes_df['Outcome']

Now we will split the data into training and testing data using the train_test_split function

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=7)


Random Forest
Building the model using RandomForest

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=200) rfc.fit(X_train,


y_train) rfc_train
= rfc.predict(X_train) from
sklearn import metrics

print("Accuracy_Score =", format(metrics.accuracy_score(y_train, rfc_train)))

Output: Accuracy = 1.0

Getting the accuracy score for Random Forest

from sklearn import metrics

predictions = rfc.predict(X_test)
print("Accuracy_Score =", format(metrics.accuracy_score(y_test, predictions)))

Output:

Decision Tree

Building the model using DecisionTree


from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier() dtree.fit(X_train,


y_train)

Getting the accuracy score for Decision Tree


from sklearn import metrics

predictions = dtree.predict(X_test) print("Accuracy Score =",


format(metrics.accuracy_score(y_test,predictions))) Output:

Accuracy Score = 0.7322834645669292


Getting the accuracy score for the XgBoost classifier from sklearn
import metrics

xgb_pred = xgb_model.predict(X_test) print("Accuracy Score =",


format(metrics.accuracy_score(y_test, xgb_pred)))
Output:

Accuracy Score = 0.7401574803149606


Classification report and confusion matrix of the XgBoost classifier from sklearn.metrics
import classification_report, confusion_matrix

print(confusion_matrix(y_test, xgb_pred)) print(classification_report(y_test,xgb_pred)) Output:

Support Vector Machine (SVM)


Building the model using Support Vector Machine (SVM)

from sklearn.svm import SVC

svc_model = SVC() svc_model.fit(X_train,


y_train)

Prediction from support vector machine model on the testing data svc_pred

= svc_model.predict(X_test)

Classification report and confusion matrix of the SVM classifier


from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(y_test, svc_pred)) print(classification_report(y_test,svc_pred))

Output:

Feature Importance
import pickle

# Firstly we will be using the dump() function to save the model using pickle saved_model =
pickle.dumps(rfc)

# Then we will be loading that saved model rfc_from_pickle


= pickle.loads(saved_model)

Conclusion
After using all these patient records, we are able to build a machine learning model (random

forest – best one) to accurately predict whether or not the patients in the dataset have diabetes
or not along with that we were able to draw some insights from the data via data analysis and

visualization.

You might also like