You are on page 1of 10

GlucoSight

Is it possible to predict diabetes to start treatment


on time?

Team 4
Sri Emkay Iris Wang Kevin Wang Natalia Shipkova Gino Arevalo
Problem. Impact. Business Model
Problem Business Model / Goal
Adults aged 18 years or older had Predicting whether a person has diabetes based on certain health
diabetes but were undiagnosed characteristics. It is crucial for patients, healthcare providers, and
8.5 M ~23% of adults were undiagnosed insurance companies to prevent and manage disease effectively.
(nih)

GlucoSight
US adults – over a third – have • General public • Further
prediabetes, and more than 8 in 10 takes quick • Instant investigation
96 M of them don't know they have it
questionnaire processing of
pre-test
and on-time
treatment
(cdc)
Healthcare
Users
Two-sided
Providers
Prediabetes and type platform
2 diabetes are preventable. About 9 in
90% 10 cases in the U.S. can be avoided by We are building a platform that enables us to tap into an abundant
making lifestyle changes supply of prevention and treatment services and reduce the marginal
cost of servicing users once the prediction model is trained
Data & Model introduction

Source of data Model introduction

• Data: Kaggle Diabetes Health Indicators Dataset Decision Tree


• CSV of the dataset available on Kaggle for the • Popular ML algorithm used for classification and regression tasks
year 2015 was used: • It recursively splits data into subsets based on feature values and
makes predictions in leaf nodes based on majority class or average
target value.
70,692 Survey responses • Handles non-linear relationships, works with categorical and
numerical data
50- 50 split Respondents (no/yes) diabetes
XGBoost
21 Feature variables • It is an optimized implementation of the gradient boosting algorithm,
which is an ensemble learning technique that combines multiple
0: No diabetes weak models (usually decision trees) to create a stronger model.
Target variable
1: diabetes • High predictive accuracy and computational efficiency - one of the
most popular ML algorithms for structured data. It is also good to
use both classification and regression problems
The Decision Tree Model
Dataset Highest info gain Confusion Matrix
We use all variables from Kaggle The top 3 feature vectors: • 73.82% accuracy
• Split the data with 80/20. • High blood pressure • ROC AUC at 0.8123
• <10% data has heart disease, • BMI We might consider trading higher true
stroke, and heavy alcohol • high cholesterol positive with higher false positive, since it
consumption. can be risky if the patient is unaware of
Female with better general health is
• >70% data that do exercise, eat less likely to get diabetes. diabetes condition.
fruit and veggies.
Feature Selection for XGBoost
Features we will remove
• Fruits, AnyHealthcare, NoDocbccost
and sex are least correlated with
Diabetes.
• At the Chi Squared Test the scores
from these values are very low.

Features used for model


• HighBP, HighChol, BMI, smoker,
stroke, HeartDiseaseorAttack,
PhysActivity, Veggies, MentHlth ,
HvyAlcoholconsump, GenHlth ,
PhysHlth, Age, Education, Income and
DiffWalk have a significant correlation
with Diabetes.
Chi
Squared
Test Scores
The XGBoost has 88% accuracy, where MentHlth, GenHlth, and Education have the
highest reduction in the loss function as feature vectors
Dataset: Confusion Matrix:
70/30 Split, Healthier group 88% Accuracy, ROC AUC = 0.92
We use feature selection based on By analyzing the confusion matrix, we have
correlation analysis to run XGBoost a decent percentage of 88% accuracy.
only on the important variables from While we have ROC AUC at 0.92, we might
Kaggle dataset and split the data with consider trading higher true positive with
70/30. The dataset is a relative higher false positive, since it can be risky if
healthier group, where <10% data has the patient is unaware of diabetes
heart disease, stroke, and heavy condition.
alcohol consumption.
Model - Insights

High blood pressure, BMI, and Regular exercise and a healthy False positives are more
cholesterol are the most diet are key preventative acceptable than false negatives
significant predictors of diabetes measures against diabetes
Not detecting diabetes can be severe. It

Focus on educating users about the Emphasize the importance of regular exercise would be more acceptable to have false
positives (predicting diabetes when the user
importance of acknowledging and managing and a healthy diet as key drivers in preventing
does not have it) than false negatives (failing
these factors to reduce their risk of developing diabetes. We can partner with firms that offer to predict diabetes when the user has it).
diabetes. This can be achieved by partnering personalized nutrition plans, healthy foods, or Sensitivity over specificity: to minimize false

with firms that offer prevention and treatment exercise programs, as well as provide users negatives and ensure that users receive
timely and appropriate care.
services for these conditions with educational content on the benefits of a

healthy lifestyle
How the model can be improved

Collect more data


• We can explore other years of CDC data, data from other public health agencies, or related datasets from academic
resources, then we can preprocess data and EDA analysis, to provide a more comprehensive and diverse dataset,
which may lead to improved prediction accuracy

Feature selection and feature engineering


• Now, we try to calculate the correlation between x and the target variable, then drop some low-correlation features
• We can use Lasso regression, feature importance, etc. to choose features that have more prediction power, so it can
improve the model performance

Fine-tuning the models of both Decision Tree and XGBoost


• Here, for both the decision tree and XGBoost model, we use the default parameters, if we can tune the
hyperparameters of models can significantly impact the accuracy of the model. We can experiment with different
values of hyperparameters using techniques like grid search, randomized search, or Bayesian optimization to find the
optimal combination of hyperparameters that improves the model's accuracy.
DEMO
CORPORATE PRESENTATION

THANK
YOU

You might also like