You are on page 1of 12

Introduction and Background:

Medical insurance costs are a significant factor that affects the health and well-being of
individuals and society (El-Sayed, El-Bakry, & El-Sayed, 2021; Pronk, Goodman, O'Connor, &
Martinson, 1999). Predicting medical insurance costs can help individuals plan their budgets,
insurers design their policies, and policymakers evaluate the impact of health interventions.
However, predicting medical insurance costs is not a simple task, as there are many factors that
influence them, such as smoking status, age, gender, BMI, region, and number of dependents
(El-Sayed et al., 2021; Leutzinger et al., 2000). Therefore, it is important to analyze these factors
and develop models to predict medical insurance costs.
The dataset used in this project is available on Kaggle, an online platform for data science and
machine learning competitions (Choi, 2018). The dataset contains information about the medical
insurance costs of individuals and their demographic and lifestyle characteristics. The dataset
was originally taken from the Machine Learning with R dataset repository on GitHub.
This project’s objective is to make predictions of medical insurance costs of individuals based on
their demographic and lifestyle characteristics. To achieve this objective, we will use linear
regression, an algorithm used to predict or visualize the relationship between two or more
different features or variables (Taloba, El-Aziz, Rasha, Alshanbari, & El-Bagoury, 2022). When
carrying out linear regression, two or more variables are under examination - the dependent
variable and one or more independent variables. The dependent variable is the one that we want
to predict or explain, while the independent variable is the one that we use to make the prediction
or explanation. In this case, the dependent variable is medical insurance costs (charges), while
the independent variables are BMI, age, sex, children, smoking status, and region.

Source of Data:

The dataset used in this project is available on Kaggle, an online platform for data science and
machine learning competitions. The dataset contains 1338 observations (rows) and 7 variables
(columns). The dataset was originally taken from the Machine Learning with R dataset
repository on GitHub.
How the Data was Collected:

The data was collected through a survey of individuals who have health insurance (Choi, 2018).
The survey collected information about the medical insurance costs of individuals and their
demographic and lifestyle characteristics. The data was then compiled and made available for
analysis by researchers and data scientists.

Description of Dataset:

The dataset used in this project contains the following variables:


Age Age of the primary beneficiary
Sex Gender of the client.
BMI Body mass index (calculated as weight in kg / (height in m2).
Children Number of minors covered by the insurance or Number of dependents.
Smoker Smoking status (Yes/No).
Charges Medical costs for each individual that is billed by health insurance.
Region Residential area of the beneficiary in the US (Northwest, Southeast, Southwest,
Northeast).
The dataset has 1338 observations (rows) and 7 variables (columns). The dataset includes
information about individuals' health insurance costs and their demographic and lifestyle
characteristics, such as age, gender, BMI, number of dependents, smoking status, and region of
residence. The dependent variable in this dataset is medical insurance costs (charges), while the
independent variables are age, sex, smoker, bmi, children, and region.

Description of Variables Used:

The variables used to predict medical insurance costs are as follows:


 Age: Age of the primary beneficiary. Age is an important factor in predicting medical
insurance costs, as older individuals may require more medical attention than younger
individuals.
 Sex: Gender of the insurance contractor. Gender may also influence medical insurance
costs, as certain medical conditions may be more common in one gender than the other.
 BMI is a measure of body metabolism and is calculated as weight (kg) / square of height
(meters) (Choi, 2018). BMI is used to assess a person's health status, and individuals with
higher BMIs may be more likely to have health problems, which could lead to higher
medical insurance costs (El-Sayed et al., 2021).
 The variable "Children" represents the number of dependents under the age of 18 years
who are covered by health insurance (Choi, 2018). This variable provides information
about the family size and the number of individuals who are dependent on the primary
insured person for health care coverage.
 "Smoker" is a binary variable that takes the value 'Yes' if the person is a smoker, and 'No'
if the person is a non-smoker (Choi, 2018). Smoking is a major risk factor for many
health problems, and it is one of the factors that insurance companies consider when
calculating premiums. Smokers are generally charged higher premiums due to the
increased risk of developing health problems (Pronk et al., 1999).
 The variable "Region" represents the geographic location of the beneficiary's residential
area in the United States and is a categorical variable that can take one of the four values:
Northeast, Southeast, Southwest, or Northwest (Choi, 2018). This variable can be used to
identify the location-specific patterns in the data, such as differences in medical costs or
the prevalence of certain health conditions (Leutzinger et al., 2000).
 The variable "Charges" represents the medical costs that were billed by health insurance
for each individual (Choi, 2018). Medical costs can vary widely depending on the type of
treatment or procedure received, the location of the medical facility, and the individual's
health status (Sushmita et al., 2015). This variable is a critical component of the dataset
as it provides information about the amount of medical expenditure, which is one of the
key factors that insurance companies use to determine premiums.

Methodology

This project employs a range of data analytical techniques, including Exploratory Data Analysis
(EDA), statistical analysis, clustering, and multiple linear regression, to answer research
questions and solve identified problems. EDA serves as the initial step in data analysis, enabling
the exploration of data and its characteristics using various techniques such as summary
statistics, histograms, scatter plots, box plots, and correlation matrices. This technique helps
identify any missing or invalid data, outliers, or other anomalies that need to be addressed before
proceeding with further analysis.
EDA is a crucial step in any data analysis project (Taloba et al., 2022). It helps to understand the
data and its characteristics before applying any statistical analysis or modelling techniques. EDA
is usually the first step in data analysis, and it involves analyzing data through various techniques
such as summary statistics, histograms, scatter plots, box plots, and correlation matrices. By
performing EDA, we can identify any missing or invalid data, outliers, or any other anomalies
that need to be addressed before proceeding with further analysis.
Correlation analysis - a common statistical method used to gauge the relationship between
variables (El-Sayed et al., 2021). It provides insight into how variables are related and helps to
identify the strongest predictors of a target variable. Hypothesis testing is another statistical
technique that tests the significance of the difference between two or more groups or variables
(Sushmita et al., 2015). The results of statistical analysis can help us to identify any trends or
patterns that were not apparent through EDA.
Clustering is a technique used to group similar data points into clusters or groups (Yedla,
Pathakota, & Srinivasa, 2010). The main objective is to find homogeneous groups of individuals
with similar characteristics (Sushmita et al., 2015). Clustering techniques use various algorithms
to group data points into clusters based on the similarity of their attributes (Yedla et al., 2010).
K-means clustering is a commonly-used algorithm that partitions data points into k number of
clusters based on the (WCSS) (Uyanık & Güler, 2013). The WCSS is the sum of the squared
distances between each data point and its centroid (Uyanık & Güler, 2013). The optimal number
of clusters is where the rate of change of the WCSS decreases significantly, as indicated by the
elbow method (Uyanık & Güler, 2013). Clustering can help to identify any hidden patterns or
relationships between variables and to segment individuals into distinct groups (Sushmita et al.,
2015).
Multiple linear regression is a statistical technique that models the connection between a
dependent variable and several independent variables (Taloba et al., 2022). It involves fitting a
linear equation to the data, where the dependent variable is a function of the independent
variables (Taloba et al., 2022). The coefficients in the equation represent the contribution of each
independent variable to the dependent variable (Taloba et al., 2022). Multiple linear regression
allows us to predict the value of the dependent variable for a given set of independent variables
(Taloba et al., 2022). It is a useful technique for analyzing complex relationships between
variables and for making predictions (El-Sayed, El-Bakry, & El-Sayed, 2021).
In summary, to answer the research questions and solve the problems posed in this project, we
will use a combination of data analytical techniques such as EDA, statistical analysis (Harrison
et al., 2020), clustering (Yedla et al., 2010), and multiple linear regression (Taloba et al., 2022).
Each of these techniques provides valuable insights into the data and helps to identify any
patterns or relationships between variables (Data et al., 2016). By using these techniques, we can
gain a better understanding of the data and make informed decisions based on the results.

Results and Discussions

The exploratory data analysis (EDA) revealed that the mean age of the individuals in the dataset

is 39 years with a standard deviation of 14 years. The mean BMI is 30.66 with a standard

deviation of 6.10. The average number of children is 1.09. The average medical cost billed by the

insurance company is $13,270 with a standard deviation of $12,110. The minimum medical cost

is $1,121 while the maximum medical cost is $63,770.


The distribution of charges shows a right-skewed distribution with a long tail, which indicates

that some individuals have much higher medical costs than others. The box plots for charges by

smoker status and region reveal that smokers, on average, have higher medical costs than non-

smokers, and individuals from the southeast region, on average, have higher medical costs than

individuals from other regions.


The correlation matrix shows that age and BMI have a significant positive correlation with

medical costs. This observation is further supported by the multiple linear regression results,

which reveal that age, BMI, and the number of children are positively associated with medical

costs. Smokers, on average, have significantly higher medical costs than non-smokers.

Moreover, individuals from the southeast region have significantly higher medical costs than

individuals from other regions.

The t-test result shows that the difference in medical costs between smokers and non-smokers is

statistically significant (t = 46.66, p < 0.05).

Statistical Analysis:
---------------------
t-statistic: 46.66492117272371
p-value: 8.271435842179102e-283
The K-means clustering algorithm was used to identify groups of individuals with similar

characteristics. The algorithm clustered the individuals into three groups based on age, BMI, and

the number of children. Group 0 consists of individuals who are generally younger and have

lower BMIs and fewer children. Group 1 consists of individuals who are older, have higher

BMIs, and more children. Group 2 consists of individuals who are in between the other two

groups in terms of age, BMI, and the number of children.


Multiple Linear Regression:
----------------------------
Coefficients: [ 253.71289612 335.92334322 436.72852645
23604.09678761 -259.87804549 -913.5869549 -761.7475303 ]

Intercept: -11834.604223900491
R-squared: 0.7999608872686219
In conclusion, the results of the analysis support the research question, which was to identify
factors associated with medical costs. Age, BMI, number of children, smoker status, and region
are all factors that are positively associated with medical costs. The analysis also reveals that
smokers, on average, have significantly higher medical costs than non-smokers. Furthermore, the
K-means clustering algorithm identified three groups of individuals with similar characteristics
based on age, BMI, and the number of children.

Summary, challenges and limitations:

The analysis of the medical insurance dataset has provided some valuable insights into the
factors that contribute to the medical insurance charges. The exploratory data analysis revealed
that the distribution of the charges is positively skewed, with a mean of $13,270 and a standard
deviation of $12,110. It was also observed that smokers have significantly higher charges than
non-smokers, and the charges vary across different regions. The correlation matrix showed that
age and smoking are highly correlated with charges, while children and BMI have a moderate
correlation.
Cluster analysis was used to group the observations into three clusters based on age, BMI, and
number of children. The clustering revealed that the three clusters have different age ranges and
BMI values, with cluster 0 having the highest average age and BMI values, while cluster 1 has
the lowest. The charges also vary across the clusters, with cluster 2 having the highest average
charges and cluster 1 having the lowest. These clusters can be used to segment the market and
develop targeted marketing strategies.
Multiple linear regression was used to predict the charges based on age, BMI, number of
children, smoking status, and region. The model had an R-squared value of 0.80, which indicates
that 80% of the variation in the charges can be explained by the independent variables in the
model. The coefficients of the independent variables indicate that age, BMI, and smoking status
have a positive impact on the charges, while the number of children has a negative impact. The
coefficients of the dummy variables for the region indicate that the charges are highest in the
southeast region and lowest in the southwest region.
One limitation of this analysis is that it only considers a limited number of variables that may
impact the medical insurance charges. Other factors such as pre-existing medical conditions,
lifestyle habits, and occupation may also have an impact on the charges but were not included in
the analysis. Another limitation is that the analysis assumes a linear relationship between the
independent variables and the charges, which may not be accurate in reality. Non-linear
relationships and interactions between the variables may also impact the charges.
In conclusion, the analysis of the medical insurance dataset has provided valuable insights into
the factors that impact the charges. The analysis can be used by insurance companies to segment
the market, develop targeted marketing strategies, and set prices based on risk. However, the
analysis is limited by the number of variables considered and the assumption of a linear
relationship between the variables and the charges. Further research is needed to explore the
impact of other variables and the presence of non-linear relationships and interactions on the
charges

Appendix:

Python code:

# Import necessary libraries


import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Load the dataset


df = pd.read_csv('insurance.csv')

# Exploratory Data Analysis


# Descriptive statistics
print(df.describe())

# Distribution of charges
sns.displot(df['charges'])
plt.title('Distribution of Charges')
plt.xlabel('Charges')
plt.ylabel('Count')
plt.show()

# Distribution of charges by smoker status


sns.boxplot(x='smoker', y='charges', data=df)
plt.title('Distribution of Charges by Smoker Status')
plt.xlabel('Smoker')
plt.ylabel('Charges')
plt.show()

# Distribution of charges by region


sns.boxplot(x='region', y='charges', data=df)
plt.title('Distribution of Charges by Region')
plt.xlabel('Region')
plt.ylabel('Charges')
plt.show()

# Correlation matrix
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Statistical Analysis
# t-test for smoker and non-smoker charges
smoker_charges = df[df['smoker'] == 'yes']['charges']
non_smoker_charges = df[df['smoker'] == 'no']['charges']
t, p = stats.ttest_ind(smoker_charges, non_smoker_charges)
print('\nStatistical Analysis:')
print('---------------------')
print('t-statistic:', t)
print('p-value:', p)

# Clustering
# Data preparation
X = df[['age', 'bmi', 'children']]
# KMeans algorithm
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
# Add cluster labels to dataframe
df['cluster'] = kmeans.labels_

# Plotting clusters
sns.scatterplot(data=df, x='age', y='bmi', hue='cluster', palette='Set1')
plt.title('Clustering')
plt.xlabel('Age')
plt.ylabel('BMI')
plt.show()

# Multiple Linear Regression


# Data preparation
X = df[['age', 'bmi', 'children', 'smoker', 'region']]
X = pd.get_dummies(X, drop_first=True)
y = df['charges']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
# Model training
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Model testing
y_pred = regressor.predict(X_test)
# Model evaluation
print('\nMultiple Linear Regression:')
print('----------------------------')
print('Coefficients:', regressor.coef_)
print('Intercept:', regressor.intercept_)
print('R-squared:', r2_score(y_test, y_pred))

References:

1. Medical Cost Personal Datasets | Kaggle Choi, M.I.R.I. (2018). Medical cost personal

datasets [Data set]. Kaggle. https://www.kaggle.com/mirichoi0218/insurance

2. A Computational Intelligence Approach for Predicting Medical Insurance Cost El-Sayed,

A., El-Bakry, H., & El-Sayed, M. (2021). A computational intelligence approach for

predicting medical insurance cost. Mathematical Problems in Engineering, 2021, Article

1162553. https://doi.org/10.1155/2021/1162553

3. Pronk, N. P., Goodman, M. J., O'Connor, P. J., & Martinson, B. C. (1999). Relationship

between modifiable health risks and short-term health care charges. Jama, 282(23), 2235-

2239.

4. Leutzinger, J. A., Ozminkowski, R. J., Dunn, R. L., Goetzel, R. Z., Richling, D. E.,

Stewart, M., & Whitmer, R. W. (2000). Projecting future medical care costs using four

scenarios of lifestyle risk rates. American Journal of Health Promotion, 15(1), 35-44.


5. Sushmita, S., Newman, S., Marquardt, J., Ram, P., Prasad, V., Cock, M. D., & Teredesai,

A. (2015, May). Population cost prediction on public healthcare datasets. In Proceedings

of the 5th international conference on digital health 2015 (pp. 87-94).

6. Taloba, A. I., El-Aziz, A., Rasha, M., Alshanbari, H. M., & El-Bagoury, A. A. H. (2022).

Estimation and prediction of hospitalization and medical care costs using regression in

machine learning. Journal of Healthcare Engineering, 2022.

7. Data, M. C., Komorowski, M., Marshall, D. C., Salciccioli, J. D., & Crutain, Y. (2016).

Exploratory data analysis. Secondary analysis of electronic health records, 185-203.

8. Harrison, A. J., McErlain-Naylor, S. A., Bradshaw, E. J., Dai, B., Nunome, H., Hughes,

G. T., ... & Fong, D. T. (2020). Recommendations for statistical analysis involving null

hypothesis significance testing. Sports biomechanics, 19(5), 561-568

9. Yedla, M., Pathakota, S. R., & Srinivasa, T. M. (2010). Enhancing K-means clustering

algorithm with improved initial center. International Journal of computer science and

information technologies, 1(2), 121-125.

10. Uyanık, G. K., & Güler, N. (2013). A study on multiple linear regression

analysis. Procedia-Social and Behavioral Sciences, 106, 234-240.

You might also like