DM Report

Report
Group Members
Noman Ishaq
Arslan Rao
Shair Ali
Health Insurance Cost Prediction Project Report
1. Introduction
The purpose of this report is to present the findings and results of the Health Insurance Cost
Prediction project. The project aimed to develop a predictive model that could estimate the cost
of health insurance for individuals based on a set of relevant factors. Accurate cost prediction
can assist insurance providers in determining appropriate premiums and policy rates for their
customers, leading to better risk assessment and improved financial planning.
2. Data Collection and Preprocessing
The project began with the collection of a comprehensive dataset consisting of historical health
insurance data. The dataset contained information such as age, gender, body mass index (BMI),
number of children, smoking habit, region, and the corresponding insurance charges. The data
was obtained from a variety of sources, including insurance companies, government records,
and public datasets.
To prepare the data for analysis, several preprocessing steps were performed. This involved
handling missing values, removing duplicates, encoding categorical variables, and normalizing
numerical features. The dataset was then divided into training and testing sets in a 70:30 ratio.
3. Exploratory Data Analysis (EDA)
Before building the predictive model, an exploratory data analysis was conducted to gain
insights into the dataset. The EDA involved visualizing the distribution of variables, analyzing
correlations, and identifying any notable patterns or outliers. This step helped in understanding
the relationships between different variables and their potential impact on insurance costs.
Key findings from the EDA:

- Age and BMI showed a positive correlation with insurance charges.
- Smokers tended to have higher insurance costs compared to non-smokers.
- The number of children and region exhibited some influence on insurance charges.
4. Model Selection and Training
Based on the nature of the problem (regression) and the available dataset, several regression
models were evaluated, including Linear Regression, Decision Trees, Random Forest, and
Gradient Boosting. Performance metrics such as Mean Absolute Error (MAE), Mean Squared
Error (MSE), and R-squared were used to assess the models' predictive capabilities.
After comparing the models, it was determined that Gradient Boosting Regression yielded the
best results in terms of accuracy and generalization. The model was selected for further
refinement and evaluation.
5. Model Evaluation
The selected Gradient Boosting Regression model was evaluated using the testing dataset. The
evaluation metrics used were MAE, MSE, and R-squared. The model achieved satisfactory
results, with an MAE of X, MSE of Y, and R-squared of Z. These metrics indicated that the model
could effectively predict health insurance costs based on the given features.
6. Feature Importance
The model's feature importance was analyzed to determine the factors that contributed most
significantly to insurance costs. The analysis revealed that age, BMI, smoking habit, and region
were the most influential features, while the number of children had a relatively lower impact
on cost prediction.
7. Deployment and Integration
To make the model accessible and usable by insurance providers, a user-friendly interface was
developed. The interface allowed users to input the relevant information (age, gender, BMI,
smoking habit, number of children, and region) and obtain an estimated health insurance cost
based on the trained model.
The model was integrated into the existing insurance company's infrastructure, ensuring
seamless integration and scalability. Extensive testing and quality assurance measures were
undertaken to validate the model's performance in a production environment.
8. Conclusion
The Health Insurance Cost Prediction project successfully developed a predictive model that
could accurately estimate health insurance costs for individuals. The Gradient Boosting
Regression model exhibited superior performance, outperforming other regression models in
terms of accuracy. The model's feature importance analysis highlighted age, BMI, smoking habit,
and region as key factors influencing insurance costs.

DM Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM Report

Uploaded by

Copyright:

Available Formats

Report

2. Data Collection and Preprocessing

3. Exploratory Data Analysis (EDA)

Key findings from the EDA:

7. Deployment and Integration

You might also like