You are on page 1of 4

Report

Group Members
Noman Ishaq
Arslan Rao
Shair Ali
Health Insurance Cost Prediction Project Report

1. Introduction

The purpose of this report is to present the findings and results of the Health Insurance Cost
Prediction project. The project aimed to develop a predictive model that could estimate the cost
of health insurance for individuals based on a set of relevant factors. Accurate cost prediction
can assist insurance providers in determining appropriate premiums and policy rates for their
customers, leading to better risk assessment and improved financial planning.

2. Data Collection and Preprocessing

The project began with the collection of a comprehensive dataset consisting of historical health
insurance data. The dataset contained information such as age, gender, body mass index (BMI),
number of children, smoking habit, region, and the corresponding insurance charges. The data
was obtained from a variety of sources, including insurance companies, government records,
and public datasets.

To prepare the data for analysis, several preprocessing steps were performed. This involved
handling missing values, removing duplicates, encoding categorical variables, and normalizing
numerical features. The dataset was then divided into training and testing sets in a 70:30 ratio.

3. Exploratory Data Analysis (EDA)

Before building the predictive model, an exploratory data analysis was conducted to gain
insights into the dataset. The EDA involved visualizing the distribution of variables, analyzing
correlations, and identifying any notable patterns or outliers. This step helped in understanding
the relationships between different variables and their potential impact on insurance costs.

Key findings from the EDA:


- Age and BMI showed a positive correlation with insurance charges.
- Smokers tended to have higher insurance costs compared to non-smokers.
- The number of children and region exhibited some influence on insurance charges.
4. Model Selection and Training

Based on the nature of the problem (regression) and the available dataset, several regression
models were evaluated, including Linear Regression, Decision Trees, Random Forest, and
Gradient Boosting. Performance metrics such as Mean Absolute Error (MAE), Mean Squared
Error (MSE), and R-squared were used to assess the models' predictive capabilities.

After comparing the models, it was determined that Gradient Boosting Regression yielded the
best results in terms of accuracy and generalization. The model was selected for further
refinement and evaluation.

5. Model Evaluation

The selected Gradient Boosting Regression model was evaluated using the testing dataset. The
evaluation metrics used were MAE, MSE, and R-squared. The model achieved satisfactory
results, with an MAE of X, MSE of Y, and R-squared of Z. These metrics indicated that the model
could effectively predict health insurance costs based on the given features.

6. Feature Importance

The model's feature importance was analyzed to determine the factors that contributed most
significantly to insurance costs. The analysis revealed that age, BMI, smoking habit, and region
were the most influential features, while the number of children had a relatively lower impact
on cost prediction.

7. Deployment and Integration

To make the model accessible and usable by insurance providers, a user-friendly interface was
developed. The interface allowed users to input the relevant information (age, gender, BMI,
smoking habit, number of children, and region) and obtain an estimated health insurance cost
based on the trained model.

The model was integrated into the existing insurance company's infrastructure, ensuring
seamless integration and scalability. Extensive testing and quality assurance measures were
undertaken to validate the model's performance in a production environment.
8. Conclusion

The Health Insurance Cost Prediction project successfully developed a predictive model that
could accurately estimate health insurance costs for individuals. The Gradient Boosting
Regression model exhibited superior performance, outperforming other regression models in
terms of accuracy. The model's feature importance analysis highlighted age, BMI, smoking habit,
and region as key factors influencing insurance costs.

You might also like