You are on page 1of 26

Q2

Before we begin analyzing the data and creating our models, we will import these necessary
libraries. Tidyverse is a library bundle; we will use ggplot and dplyr functions from these
libraries often. Hmisc is for mathmetical library for our model design. WVPlots are for more
advanced graphs. E1071 will provide our statistic and probabilistic algorithms, while caret is
for classification and regression training.

We will use the function str() and summary() to view and understand the data. The data
includes 1338 observations and 7 variables.
We will begin our clean-up of the data by checking for missing values using the is.na()
function and colSums() function, we can list out clearly the empty values per column. We
observe that there are 5 missing values in column ‘bmi’ and 9 in ‘children’. Using the
complete.cases() function, we can see more precisely where the missing values are in a neat
format.

In regards to the missing values in column ‘bmi’, instead of removing them, we will aim for a
clean dataset by imputing the missing values with the averages. The first part of the code,
where we create list_na and average_missing, is to create the average and put it in a variable.
In the second part, we will use this average numbers to replace the BMI values. However, we
will only manipulate the values of ‘bmi’ as the values in ‘children’ are in a strict form of 0, 1,
2, 3.

For the missing values in ‘children’, we will straightaway remove them using the na.omit()
function.

The result show 0 across the board, meaning that there are no more empty values in the
dataset.

Missing values are considered attribute noise. (Sáez., 2021)

Using sum() and duplicated(), we can discover any duplicate values. We can observe that line
196 and 582 are duplicates. Using an inverse of the duplicated() function, we will effectively
clean the dataset. The number 0 displays that there are no longer any duplicated rows.

Missing values are considered class noise. (Sáez., 2021)


Lastly, we will be safe and search for any inconsistencies for column, ‘sex’, ‘smoker’ and
‘region’. The results come up clean and every value within their respective columns are
consistent.

Our first part of pre-processing has been completed. We will now move on to resolving the
outliers.

Inconsistencies are considered attribute noise. (Sáez., 2021)

Q1a

We will start by checking for outliners on the ‘premium’ attribute. The outlier count is 139
and is considered small. Using the filter function, we will cluster the data so we can easily
display the outliers. By effectively removing it, it has reduced the size of the dataset from
1328 to 1189. The 7 variables remain untouched as expected.

We can observe the change from the removal of the points that are apart from the cluster that
are indicated by the red circle that was added.

These types of outliers are considered as “point outliers” as they lay far from the rest of the
distribution – the cluster. (Santoyo, 2017)
Before removing outliers (premium)

After removing outliers (premium)


We will repeat the process with the ‘bmi’ column. Again, there are 13 point outliers which
we will remove, reducing the size of the dataset to 1176.

Before removing outliers (bmi)


Before removing outliers (bmi)

We will then check for outliers for ‘age’ and see that there is nothing to remove, indicated by
the lack of points.
We will then check for outliers for ‘children’ and see that there is nothing to remove,
indicated by the lack of points.

The size of the dataset has been reduced from 1328 to 1176, which will provide higher
classification accuracy and a slightly lower building time. (Sáez., 2021)

Adding on, the dataset has still retained its core information when comparing the summary at
the start with the cleaned summary, indicating that the pre-processing data did not corrupt the
dataset.

Q1b
Next, we will perform a Exploratory Data Analysis by exploring the data distribution to
determine the relationship between attributes and hence identify dependent and independent
attributes.

Histogram for age


Bar chart for sex, separated by gender
Histogram chart for BMI
Bar chart for children.
Bar chart for smoker
Bar chart for region
Histogram for premium

From an observation of the graphs, we can see that the ‘sex’ and ‘region’ attributes have
close to equal distribution. Furthermore, we can conclude that the number of people that are
smokers are significantly lower than those who refrain from smoking. Additionally, we
observe that the majority of people have charges that are less than 20000.

Our model's dependent variable is ‘premium’, this measures the medical costs an individual
has to pay for their insurance plans per year. It is also our target variable as that is what we
are measuring and predicting.
Correlations between each attributes in the matrix are not considered to be strong. However,
there are some visible relationships. Such as those of ‘age’ and ‘bmi’. Although they have a
weak positive correlation, it is still related; as a person grows older, their body mass will
hence increase. Another relationship is that between ‘age’ and ‘premium, ‘bmi’ and
‘premium and ‘children’ and ‘premium’. The association here is that as age, body mass and
children number increase, the expected charge for premium insurance rises.

However, amongst all the numerical attributes, ‘age’ has the highest correlation with
‘premium’

To tease our clearer relationships, especially those with our target and dependent variable, we
will use boxplots to check ‘premium’ with the other categorical attributes’.
Boxplot for smoker against premium
Boxplot for sex against premium
Boxplot for region against premium

Observing the intersection between the boxplot of premium attributes for each categorical
attribute, we are able to conclude that the ‘smoker’ has a higher correlation to that of ‘sex’
and ‘region’
Additionally, using the Chi-Square Test of Independence, we see that the p-values for both
attributes are higher than 0.05, hence by accepting null hypothesis, we conclude that the
attributes smoker, region, and sex are independent.

In conclusion, the independent variables are


: age, sex, bmi, children, smoker and region.

While the dependent variable is


: premium.

For numerical variables, ‘age’ influences the ‘premium’ attribute the most. While in second
place, we have the categorical variable ‘smoker’.
Q3

We will now prepare the dataset for modelling by splitting the data into training and testing
dataset. We took 20 percent of the data for testing and the other 80 percent for our training
model. After splitting, we created a formula. We used it to take all the variables for
comparison with the label variable. The predictive algorithm used in this model is Logistic
Regression, utilized with the glm() function. Logistic Regression is a classification tool for
continuous variables. (Brownlee, 2019)

The formula will take into account all the variables as significant.

The summary() output will help us evaluate the model performance. Firstly, by observing the
residuals, we can see the statistics for the errors in our predictions. A maximum error of
24029.6 means that our model under-predicted expenses by nearly $25000 for at least a
single observation. However, the majority of predictions were between $1961.5 over the true
value and $568 under the true value, which taking into account the volatile nature of the
medical field, is an appropriate result. Secondly, residual deviance, that is residual standard
error is incredibly low which means there are very few differences between actual and
predicted values for this model. (DataSorcerer, 2018)

Lastly, in application, we will apply the model to the new client, Smith. Smith’s health care
premium charge is at $16759.65. A display of results indicate that the model is functioning
accordingly with no errors present.
Q4

We will now design a Linear Regression model, which is a regression algorithm, for this
section. (Brownlee, 2019) Linear Regression is chosen because we have determined the
relationship between variables and now we want to observe how strong the relationship is
between “premium” and “age”. Additionally, we want to observe the value of the dependent
variable when it is at a certain value of the independent variable. It is good for prediction as
in our case, we want to predict the “premium” value based on the values of other variables.
(Lund & Lund, 2018)

Performance indicators such as the Residual standard error shows that there is a lesser
difference between actual and predicted values. The adjusted R-squared at 57.5% mean that
the model explains nearly 57.5% of the variation in the dependent variable. A low R-squared
value is concerning but it is still quite good.
In applying our model to the new client, Smith, we observe that there are no errors and that
Smith’s premium charges are at $16759.65.
Q5

To evaluate the performance for both models, we will first save the R-squared, even though
we might not necessarily use it for the Logistic Regression. But the second step is to predict
the data on the test set. And then to calculate the residuals and lastly to calculate the Root
Mean Squared Error.

We will perform the same operations for the Linear Regression model.

Then, we will proceed to calculate the RMSE for both models. The RMSE is the standard
deviation of residuals, that is prediction errors. The performance for both models seems
equal. The value 3706.93 is a lower value than most which indicate that both models are good
for analysis.

Lastly, we will use a graph that plots the cumulative gain curve of a sort-order. That is, by
targeting a percentage of the total number of cases, it is able to display the percentage of the
overall number of cases in a given category “gained”, that is the accuracy of prediction.
(IBM, 2016)

We can also observe that the errors in the model are closer to zero hence indicating that the
model predicts well. Just like an ROC curve, being above the intersection means more
accurate values predicted.
Gain Curve Plot to indicate accuracy

You might also like