You are on page 1of 1

Understanding the Data

Objective:
The variable ‘charges’ is the one we have to predict using the following predictors: age, sex, BMI, children,
smoker and region. The variable age and BMI are continuous variables; the variables sex, smoker and region
are categorical variables.

Information of the data:


As we can see, we are working with dataset with only 1338 observations and 7 variables. What we’d be most
interested here, is with the variable charges that is what we would try to predict.

The data set revolves around “Medical cost for Insurance” and through which we are required to predict the
cost of premium/charges.

To understand some background it can be deduced that, to make their profit, insurance companies should
collect higher premium than the amount paid to the insured person. Due to this, insurance companies invest a
lot of time, effort, and money in creating models that accurately predicts health care costs. In this kernel, we
will try to build the most accurate model as possible but at the same time we would keep everything simple.
In regression analysis a predictive model to the data will be used which can be further used to predict an
outcome variable from one or more independent predictor variables. With simple regression an outcome
variable from a single predictor variable is predicted while with multiple regression an outcome variable
from multiple predictor variables is used.
This predictive model uses a straight line to summarize the data and the method of least squares is used to get
the linear line that gives the description (best fit) of the data.
To make their own profits, the insurance company (insurer) must collect more premiums than the amount
paid to the insured person.
For this, the insurance company invests a lot of time and money in creating a model that accurately predicts
health care costs.
Here we explored a data set dedicated to the cost of treatment of different patients. The cost of treatment
depends on many factors: diagnosis, type of clinic, city of residence, age and so on. We have no data on the
diagnosis of patients. But we have other information that can help us to make a conclusion about the health
of patients and practice regression analysis.

Source of data:
Our major source of data is ‘Kaggle’; it is a subsidiary of Google LLC, an online community of data
scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and
build models in a web-based data-science environment, work with other data scientists and machine learning
engineers, and enter competitions to solve data science challenges.

You might also like