Professional Documents
Culture Documents
Medical insurance costs are a significant factor that affects the health and well-being of
individuals and society (El-Sayed, El-Bakry, & El-Sayed, 2021; Pronk, Goodman, O'Connor, &
Martinson, 1999). Predicting medical insurance costs can help individuals plan their budgets,
insurers design their policies, and policymakers evaluate the impact of health interventions.
However, predicting medical insurance costs is not a simple task, as there are many factors that
influence them, such as smoking status, age, gender, BMI, region, and number of dependents
(El-Sayed et al., 2021; Leutzinger et al., 2000). Therefore, it is important to analyze these factors
and develop models to predict medical insurance costs.
The dataset used in this project is available on Kaggle, an online platform for data science and
machine learning competitions (Choi, 2018). The dataset contains information about the medical
insurance costs of individuals and their demographic and lifestyle characteristics. The dataset
was originally taken from the Machine Learning with R dataset repository on GitHub.
This project’s objective is to make predictions of medical insurance costs of individuals based on
their demographic and lifestyle characteristics. To achieve this objective, we will use linear
regression, an algorithm used to predict or visualize the relationship between two or more
different features or variables (Taloba, El-Aziz, Rasha, Alshanbari, & El-Bagoury, 2022). When
carrying out linear regression, two or more variables are under examination - the dependent
variable and one or more independent variables. The dependent variable is the one that we want
to predict or explain, while the independent variable is the one that we use to make the prediction
or explanation. In this case, the dependent variable is medical insurance costs (charges), while
the independent variables are BMI, age, sex, children, smoking status, and region.
Source of Data:
The dataset used in this project is available on Kaggle, an online platform for data science and
machine learning competitions. The dataset contains 1338 observations (rows) and 7 variables
(columns). The dataset was originally taken from the Machine Learning with R dataset
repository on GitHub.
How the Data was Collected:
The data was collected through a survey of individuals who have health insurance (Choi, 2018).
The survey collected information about the medical insurance costs of individuals and their
demographic and lifestyle characteristics. The data was then compiled and made available for
analysis by researchers and data scientists.
Description of Dataset:
Methodology
This project employs a range of data analytical techniques, including Exploratory Data Analysis
(EDA), statistical analysis, clustering, and multiple linear regression, to answer research
questions and solve identified problems. EDA serves as the initial step in data analysis, enabling
the exploration of data and its characteristics using various techniques such as summary
statistics, histograms, scatter plots, box plots, and correlation matrices. This technique helps
identify any missing or invalid data, outliers, or other anomalies that need to be addressed before
proceeding with further analysis.
EDA is a crucial step in any data analysis project (Taloba et al., 2022). It helps to understand the
data and its characteristics before applying any statistical analysis or modelling techniques. EDA
is usually the first step in data analysis, and it involves analyzing data through various techniques
such as summary statistics, histograms, scatter plots, box plots, and correlation matrices. By
performing EDA, we can identify any missing or invalid data, outliers, or any other anomalies
that need to be addressed before proceeding with further analysis.
Correlation analysis - a common statistical method used to gauge the relationship between
variables (El-Sayed et al., 2021). It provides insight into how variables are related and helps to
identify the strongest predictors of a target variable. Hypothesis testing is another statistical
technique that tests the significance of the difference between two or more groups or variables
(Sushmita et al., 2015). The results of statistical analysis can help us to identify any trends or
patterns that were not apparent through EDA.
Clustering is a technique used to group similar data points into clusters or groups (Yedla,
Pathakota, & Srinivasa, 2010). The main objective is to find homogeneous groups of individuals
with similar characteristics (Sushmita et al., 2015). Clustering techniques use various algorithms
to group data points into clusters based on the similarity of their attributes (Yedla et al., 2010).
K-means clustering is a commonly-used algorithm that partitions data points into k number of
clusters based on the (WCSS) (Uyanık & Güler, 2013). The WCSS is the sum of the squared
distances between each data point and its centroid (Uyanık & Güler, 2013). The optimal number
of clusters is where the rate of change of the WCSS decreases significantly, as indicated by the
elbow method (Uyanık & Güler, 2013). Clustering can help to identify any hidden patterns or
relationships between variables and to segment individuals into distinct groups (Sushmita et al.,
2015).
Multiple linear regression is a statistical technique that models the connection between a
dependent variable and several independent variables (Taloba et al., 2022). It involves fitting a
linear equation to the data, where the dependent variable is a function of the independent
variables (Taloba et al., 2022). The coefficients in the equation represent the contribution of each
independent variable to the dependent variable (Taloba et al., 2022). Multiple linear regression
allows us to predict the value of the dependent variable for a given set of independent variables
(Taloba et al., 2022). It is a useful technique for analyzing complex relationships between
variables and for making predictions (El-Sayed, El-Bakry, & El-Sayed, 2021).
In summary, to answer the research questions and solve the problems posed in this project, we
will use a combination of data analytical techniques such as EDA, statistical analysis (Harrison
et al., 2020), clustering (Yedla et al., 2010), and multiple linear regression (Taloba et al., 2022).
Each of these techniques provides valuable insights into the data and helps to identify any
patterns or relationships between variables (Data et al., 2016). By using these techniques, we can
gain a better understanding of the data and make informed decisions based on the results.
The exploratory data analysis (EDA) revealed that the mean age of the individuals in the dataset
is 39 years with a standard deviation of 14 years. The mean BMI is 30.66 with a standard
deviation of 6.10. The average number of children is 1.09. The average medical cost billed by the
insurance company is $13,270 with a standard deviation of $12,110. The minimum medical cost
that some individuals have much higher medical costs than others. The box plots for charges by
smoker status and region reveal that smokers, on average, have higher medical costs than non-
smokers, and individuals from the southeast region, on average, have higher medical costs than
medical costs. This observation is further supported by the multiple linear regression results,
which reveal that age, BMI, and the number of children are positively associated with medical
costs. Smokers, on average, have significantly higher medical costs than non-smokers.
Moreover, individuals from the southeast region have significantly higher medical costs than
The t-test result shows that the difference in medical costs between smokers and non-smokers is
Statistical Analysis:
---------------------
t-statistic: 46.66492117272371
p-value: 8.271435842179102e-283
The K-means clustering algorithm was used to identify groups of individuals with similar
characteristics. The algorithm clustered the individuals into three groups based on age, BMI, and
the number of children. Group 0 consists of individuals who are generally younger and have
lower BMIs and fewer children. Group 1 consists of individuals who are older, have higher
BMIs, and more children. Group 2 consists of individuals who are in between the other two
Intercept: -11834.604223900491
R-squared: 0.7999608872686219
In conclusion, the results of the analysis support the research question, which was to identify
factors associated with medical costs. Age, BMI, number of children, smoker status, and region
are all factors that are positively associated with medical costs. The analysis also reveals that
smokers, on average, have significantly higher medical costs than non-smokers. Furthermore, the
K-means clustering algorithm identified three groups of individuals with similar characteristics
based on age, BMI, and the number of children.
The analysis of the medical insurance dataset has provided some valuable insights into the
factors that contribute to the medical insurance charges. The exploratory data analysis revealed
that the distribution of the charges is positively skewed, with a mean of $13,270 and a standard
deviation of $12,110. It was also observed that smokers have significantly higher charges than
non-smokers, and the charges vary across different regions. The correlation matrix showed that
age and smoking are highly correlated with charges, while children and BMI have a moderate
correlation.
Cluster analysis was used to group the observations into three clusters based on age, BMI, and
number of children. The clustering revealed that the three clusters have different age ranges and
BMI values, with cluster 0 having the highest average age and BMI values, while cluster 1 has
the lowest. The charges also vary across the clusters, with cluster 2 having the highest average
charges and cluster 1 having the lowest. These clusters can be used to segment the market and
develop targeted marketing strategies.
Multiple linear regression was used to predict the charges based on age, BMI, number of
children, smoking status, and region. The model had an R-squared value of 0.80, which indicates
that 80% of the variation in the charges can be explained by the independent variables in the
model. The coefficients of the independent variables indicate that age, BMI, and smoking status
have a positive impact on the charges, while the number of children has a negative impact. The
coefficients of the dummy variables for the region indicate that the charges are highest in the
southeast region and lowest in the southwest region.
One limitation of this analysis is that it only considers a limited number of variables that may
impact the medical insurance charges. Other factors such as pre-existing medical conditions,
lifestyle habits, and occupation may also have an impact on the charges but were not included in
the analysis. Another limitation is that the analysis assumes a linear relationship between the
independent variables and the charges, which may not be accurate in reality. Non-linear
relationships and interactions between the variables may also impact the charges.
In conclusion, the analysis of the medical insurance dataset has provided valuable insights into
the factors that impact the charges. The analysis can be used by insurance companies to segment
the market, develop targeted marketing strategies, and set prices based on risk. However, the
analysis is limited by the number of variables considered and the assumption of a linear
relationship between the variables and the charges. Further research is needed to explore the
impact of other variables and the presence of non-linear relationships and interactions on the
charges
Appendix:
Python code:
# Distribution of charges
sns.displot(df['charges'])
plt.title('Distribution of Charges')
plt.xlabel('Charges')
plt.ylabel('Count')
plt.show()
# Correlation matrix
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
# Statistical Analysis
# t-test for smoker and non-smoker charges
smoker_charges = df[df['smoker'] == 'yes']['charges']
non_smoker_charges = df[df['smoker'] == 'no']['charges']
t, p = stats.ttest_ind(smoker_charges, non_smoker_charges)
print('\nStatistical Analysis:')
print('---------------------')
print('t-statistic:', t)
print('p-value:', p)
# Clustering
# Data preparation
X = df[['age', 'bmi', 'children']]
# KMeans algorithm
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
# Add cluster labels to dataframe
df['cluster'] = kmeans.labels_
# Plotting clusters
sns.scatterplot(data=df, x='age', y='bmi', hue='cluster', palette='Set1')
plt.title('Clustering')
plt.xlabel('Age')
plt.ylabel('BMI')
plt.show()
References:
1. Medical Cost Personal Datasets | Kaggle Choi, M.I.R.I. (2018). Medical cost personal
A., El-Bakry, H., & El-Sayed, M. (2021). A computational intelligence approach for
1162553. https://doi.org/10.1155/2021/1162553
3. Pronk, N. P., Goodman, M. J., O'Connor, P. J., & Martinson, B. C. (1999). Relationship
between modifiable health risks and short-term health care charges. Jama, 282(23), 2235-
2239.
4. Leutzinger, J. A., Ozminkowski, R. J., Dunn, R. L., Goetzel, R. Z., Richling, D. E.,
Stewart, M., & Whitmer, R. W. (2000). Projecting future medical care costs using four
6. Taloba, A. I., El-Aziz, A., Rasha, M., Alshanbari, H. M., & El-Bagoury, A. A. H. (2022).
Estimation and prediction of hospitalization and medical care costs using regression in
7. Data, M. C., Komorowski, M., Marshall, D. C., Salciccioli, J. D., & Crutain, Y. (2016).
8. Harrison, A. J., McErlain-Naylor, S. A., Bradshaw, E. J., Dai, B., Nunome, H., Hughes,
G. T., ... & Fong, D. T. (2020). Recommendations for statistical analysis involving null
9. Yedla, M., Pathakota, S. R., & Srinivasa, T. M. (2010). Enhancing K-means clustering
10. Uyanık, G. K., & Güler, N. (2013). A study on multiple linear regression