You are on page 1of 35

APPLICATION OF MIXTURE DISTRIBUTION IN

AUTOMOBILE INSURANCE

PROJECT for STATISTICAL COMPUTING – I


SEMESTER – I

MASTER OF SCIENCE
IN
APPLIED STATISTICS & ANALYTICS

BY

1. NEHA DOSHI (A013)


2. RUCHIRA SAKALLE (A031)
3. PRANAV BAKSHI (A004)
4. JYOTIKA VASAMSETTY (A040)
5. AAMINA SHEIKH (A033)

UNDER THE SUPERVISION OF

Prof. PRASHANT DHAMALE

SUNANDAN DIVATIA SCHOOL OF SCIENCE


SVKM’S Narsee Monjee Institute of Management Studies
(Deemed-To-Be University)
V. L. Mehta Road, Vile- Parle (West), Mumbai– 400056.

October 2019
Abstract

In this paper, we model the insurance claims of an automobile insurance company. We use
mixture models since the claims come from heterogeneous sources. Mixture distributions are
useful way to show how variables can be differently distributed. K components of the mixture are
assumed to be from the same parametric family. Therefore, K-means clustering is used for
initialization. Later, we make use of expectation-maximization algorithm to estimate the
parameters which are the mean, standard deviation and probability values. Then using the
estimated probability, we fit our model and obtain the mixing proportions which is nothing but the
measure of accuracy of the fitted model. We use Baye’s Information Criterion to validate if our
assumption regarding the number of clusters is true or not. We compare the results obtained from
EM algorithm and K-means clustering. Based on the results we can comment of the performance
of the two.
Introduction

Automobile insurance is a policy that allows the vehicle owners to diminish the costs
associated with an auto accident. At the time of an accident, instead of paying from their pocket,
the policy holders pay annual premiums to the insurance company; the company then pays all or
most of the costs associated with the accident or other damage caused to the vehicle.

Auto insurance premiums vary depending on age, gender, years of driving experience, accident
and moving violation history, and other factors. A poor driving record or the desire for complete
coverage will lead to higher premiums.

In exchange for paying a premium, the insurance company agrees to pay your losses as
outlined in your policy.

Coverage includes:

i. Property – damage to or theft of your car

ii. Liability – legal responsibility to others for bodily injury or property damage

iii. Medical – costs of treating injuries, rehabilitation and sometimes lost wages and funeral
expenses.

What is an insurance claim?

Insurance claims protect people from financial ruin after an accident or disaster. At its most
simple, the definition of an insurance claim is "a formal request for money after a major loss."

Types of Auto Accident Claims:

 Loss of value to vehicle


 Personal injury
 Injury by uninsured or underinsured defendant
 Coverage for rental car during repair period
 Reimbursement for property repairs
In this paper, we model insurance claims of an Automobile Insurance company and
estimate the parameters using expectation-maximization algorithm on the data set.

The data set of the insurance company seems to follow a mixture distribution.

Objectives

1. To implement the Expectation-Maximization (EM) Algorithm to fit a finite mixture


distribution to Injury claims of automobile insurance.

2. To understand the technique of EM algorithm and make a comparitive study between EM


algorithm and clustering

3. To apply the results obtained on other features and be able to share some business insights
on the same
Methodology

Following are the concepts and techniques applied in our project:

Mixture Distribution:

A Mixture distribution is the probability distribution of a random variable that is derived


from a collection of other random variables first, a random variable is selected by chance from the
collection according to given probabilities of selection, and then the value of the selected random
variable is realized.

In this project, we model the data in terms of a mixture of several components, where
each component has a simple parametric form (such as a Gaussian). In other words, we assume
each data point belongs to one of the components, and we try to infer
the distribution for each component separately.

In order to represent this mathematically, we formulate the model in terms of latent


variables, usually denoted z. These are variables which are never observed, and where we don't
know the correct values in advance. They are roughly analogous to hidden units, in that the learning
algorithm needs to figure out what they should represent, without a human specifying it by hand.
Variables which are always observed, or even sometimes observed, are referred to as observables.

In general, a mixture model assumes the data are generated by the following process:

First we sample z, and then we sample the observables x from a distribution which depends on z,
i.e.

𝒑(𝒛; 𝒙) = 𝒑(𝒛) ∗ 𝒑(𝒙 | 𝒛)

In mixture models, p(z) is always a multinomial distribution. 𝑝(𝑥 | 𝑧) can take a variety
of parametric forms, but we'll assume it's a Gaussian distribution. We refer to such a model as a
mixture of Gaussians. In general,

𝒛 ~ 𝑴𝒖𝒍𝒕𝒊𝒏𝒐𝒎𝒊𝒂𝒍(𝝅)

𝒙| 𝒛 = 𝒌 ~𝑮𝒂𝒖𝒔𝒔𝒊𝒂𝒏 (𝝁𝒌, 𝝈𝒌)


Here, 𝜋 is a vector of probabilities (i.e. nonnegative values which sum to 1) known as the mixing
proportions.

Fitting a (Gaussian Mixture Model) GMM using Expectation Maximization:

EM-Algorithm:

The Expectation-Maximization (EM) algorithm is a way to find maximum-likelihood


estimates for model parameters when your data is incomplete, has missing data points, or has
unobserved (hidden) latent variables. It is an iterative way to approximate the maximum likelihood
function. While maximum likelihood estimation can find the “best fit” model for a set of data, it
doesn’t work particularly well for incomplete data sets. The more complex EM algorithm can find
model parameters even if you have missing data. It works by choosing random values for the
missing data points, and using those guesses to estimate a second set of data. The new values are
used to create a better guess for the first set, and the process continues until the algorithm
converges on a fixed point.

Now, during the analysis we observed that our data related to injury claims follows
multiple normal distributions, so we try to fit a mixture distribution to the data.

Since, we are trying to fit more than one distribution; say we are fitting three distributions;
therefore, we'll have to estimate three parameters. We'll estimate these parameters using EM
Algorithm.

The EM algorithm consists of 3 major steps:

 Initialization
 Expectation (E-step)
 Maximization (M-step)

We initialize the values of the parameters based on our prior knowledge or we can use other
techniques. In our project we use K-means clustering for initialization.

Initialization: K-Means Clustering:

This algorithm categorizes the items into k groups of similarity. To calculate that similarity, we
will use the Euclidean distance as measurement.
The algorithm works as follows:

1. First we initialize k points, called means, randomly.


2. We categorize each item to its closest mean and we update the mean’s coordinates, which
are the averages of the items categorized in that mean so far.
3. We repeat the process for a given number of iterations and at the end, we have our clusters.
The “points” mentioned above are called means, because they hold the mean values of the items
categorized in it. To initialize these means, we have a lot of options. An intuitive method is to
initialize the means at random items in the data set. Another method is to initialize the means at
random values between the boundaries of the data set (if for a feature x the items have values in
[0, 3], we will initialize the means with values for x at [0, 3])

Based on our assumption, we get three clusters with their respective parameter values
calculated as:

µr : Mean of rth cluster

σr : Standard deviation of rth cluster

πr : Probability of an observation falling into rth cluster

where r = 1,2,3

Expectation (E-step):

Now that we have the initial parameters of our GMM, we now have to determine what is
the probability that the data point (xi) belongs to component r? This is considered the
expectation step (E-step) of MLE where we are calculating the “expected values” of the posterior
probabilities for each data point.

Mathematically this can be determined as follows:

The complete data log likelihood can be expressed as:


𝟑

𝒍(𝒙, 𝝁𝒋; 𝝁, 𝝈, 𝝅) = ∑ 𝑰(𝒖𝒋 = 𝒓)[𝒍𝒐𝒈𝝅𝒓 + 𝒍𝒐𝒈𝒇𝒓(𝒙; 𝝁, 𝝈, 𝝅)]


𝒓=𝟏

Where I(uj = r) is an indicator of to which subpopulation each xj belongs.


In this step, the conditional probability to which subpopulation each xj belongs must be

𝝅 𝒇 (𝒙; 𝝁, 𝝈, 𝝅)
𝐏𝐫(𝒖𝒋 = 𝒓 | 𝒙; 𝝁, 𝝈, 𝝅) = 𝟑 𝒓 𝒓
∑𝒓=𝟏 𝝅𝒓𝒇𝒓(𝒙; 𝝁, 𝝈, 𝝅)
computed first,

Then the expected log likelihood based on a random sample can be expressed as:
𝒏 𝟑

𝑸 = ∑ ∑ 𝐏𝐫(𝒖𝒋 = 𝒓 | 𝒙; 𝝁, 𝝈, 𝝅) [𝒍𝒐𝒈𝝅𝒓 + 𝒍𝒐𝒈𝒇𝒓(𝒙; 𝝁, 𝝈, 𝝅)]


𝒋=𝟏 𝒓=𝟏

Maximization: Re-estimate the Component Parameters (M-step)


Now that we have posterior probabilities, we can re-estimate our component parameters.
In M step, the maximizing values for µr, σr and πr are calculated:
∑𝒏𝒋=𝟏 𝐏𝐫(𝒖𝒋 = 𝒓 | 𝒙; 𝝁, 𝝈, 𝝅)
𝝅′ 𝒓 =
𝒏

∑𝒏𝒋=𝟏 𝐱𝐣 𝐏𝐫(𝒖𝒋 = 𝒓 | 𝒙; 𝝁, 𝝈, 𝝅)

𝝁𝒓=
∑𝒏𝒋=𝟏 𝐏𝐫(𝒖𝒋 = 𝒓 | 𝒙; 𝝁, 𝝈, 𝝅)

∑𝒏𝒋=𝟏 (𝒙𝒋 − 𝝁′ 𝒓)𝟐 𝐏𝐫(𝒖𝒋 = 𝒓 | 𝒙; 𝝁, 𝝈, 𝝅)



𝝈𝒓 =√
∑𝒏𝒋=𝟏 𝐏𝐫(𝒖𝒋 = 𝒓 | 𝒙; 𝝁, 𝝈, 𝝅)

Summarizing the steps, we can state that:

After initialization, we follow the E-step, the expectation step where we calculate the
expected likelihood, (Q function) and then we estimate the parameters and maximize the Q
function in the M-step.

We iterate between E-step and M-step until the likelihood converges or Q[k] - Q[k-1] is
greater than 0 for some k. Once the likelihood converges, we get optimal values of the parameters
with respect to the groups created along with their mixing proportions.

Classification of original variables into the new groups created:


After getting the mixing proportions and the estimates of parameters, we get the expected log
likelihood values in each group for every values of the original variable. Hence, if we are creating
3 groups, we have 3 probabilities for each value of the original variable(x). Out of these, the group
corresponding to the maximum probability is the group to which x belongs.
In this way, for each xi we have different probabilities and the group having the maximum
probability is the group where that particular observation is allocated.

Comparison of groups created by EM algorithm with respect to those created by clustering:


We were interested in knowing that out of the groups created using EM or the one created using
k-means cluster, which one is the better fit of the model. We are testing which model signifies the
groups more evidently. This result can be obtained by comparing between groups sum of squares
created of both the models. The model having higher between sum of squares values is a better
model that the other. The k-means() function in R gives us the direct result for the k-means model.
For calculating the between sum of squares of groups created using EM, an ANOVA model can
be fit by treating groups as treatments and the original variables as the observations. Here, the
between sum of squares can be obtained from the summary of the ANOVA model.

Mclust() function in R – an alternative approach for fitting mixture distribution:


We validated our manually created model and it’s assumptions using Mclust() function in R in
mclust package. The Mclust implements a model-based clustering based on parameterized finite
Gaussian mixture models. Here, models are estimated by EM algorithm initialized by hierarchical
model-based agglomerative clustering. The optimal model is then selected according to BIC.
Description of Data and variables

The dataset for our project is an Automobile Insurance data from All State Company.
This data has 1000 data points and 20 variables, the description of which is provided below:
Numeric Variables Categorical Variables
months_as_customer policy_state
age policy_csl
policy_number policy_deductable
policy_annual_premium insured_sex
number_of_vehicles_involved insured_education_level
total_claim_amount insured_occupation
injury_claim insured_hobbies
property_claim insured_relationship
vehicle_claim incident_type
incident_severity
auto_make

Exploratory data analysis:


Mentioned below are some of the characteristics and descriptive statistics of few variables of the
data.
Summary statistics of all the numerical variables:
Min Q1 Median Mean Q3 Max
months_as_customer 0 115.8 199.5 204.0 276.2 479.0
Age 19.0 32.0 38.0 38.95 44.0 64.0
Policy_number 100804 335980 533135 546239 759100 999435
Policy_deductable 500 500 1000 1136 2000 2000
Policy_annual_premium 433.3 1089.6 1257.2 1256.4 1415.7 2047.6
Number_of_vehicles 1.0 1.0 1.0 1.839 3.0 4.0
Total_claim_Amount 100 41813 58055 52762 70593 114920
Injury_claim 0 4295 6775 7433 11305 21450
Property_claim 0 4445 6750 7400 10885 23670
Vehicle_Claim 70 30293 42100 37929 50823 79560
Visualizations of some of the categorical variables:

State wise distribution of the policyholders

Occupationwise distribution of the gender

Distribution of Incident type among the policyholders


Analysis:
Considering fitting mixture distribution to the variable “Injury Claim”, we first look at the density
plot of this variable.

Density of Injury claims


8e-05
Density

4e-05
0e+00

0 5000 10000 15000 20000 25000

N = 1000 Bandwidth = 1103

After looking at the density plot, it can be seen that the density of Injury claims do not have a
bell shaped curve. Hence, there is a higher chance that the observations do not follow
normal distribution.
Normality checks:
We have used 2 methods for testing normality of the claims:
1. Using Shapiro-Wilk’s W test (Under H0, the data follows normal)
2. Using QQ – Plots
From Shapiro test, we got the p-value as 4.275e-14 which is less than 0.05.
Hence, we reject our H0 at 5% level of significance and claim that the Injury claims do not
follow normal distribution.
Output from the QQ-Plot,
15000 20000
10000
x

5000
0

0 200 400 600

seq(1:750)

As the plot does not conform into a straight line, we can say that the Injury Claims does not
follow normal distribution.
EM Algorithm:
It is reasonable to say that this mixture distribution is composed of 3 Gaussian distributions. We
use K-means clustering to get the initial values of µr, σr and πr using the kmeans() function in R.
Usage of K-means to get values of mean, variance and probabilities
Cluster 1 Cluster 2 Cluster 3
Mean 6646.21 1016.97 13619.6
Standard
1503.44 1008.75 2260.4
Deviation
Probability 0.45067 0.23733 0.312

Now, the density of the mixture distribution can be expressed as:


3

𝑓(𝑥; 𝜇, 𝜎, 𝜋) = ∑ 𝜋𝑟 𝑓𝑟(𝑥; 𝜇, 𝜎, 𝜋)
𝑟=1
Algorithm:

 First, we initialise the Q function, with the previously computed mean, standard deviation and
probability values of the three clusters.
 Then we run a loop, say while loop until the value of Q[k]-Q[k-1] is not equal to 0 for some k;
where the initial value of k is 2. Inside this loop we perform the E-step and M-step.

while (Q[k] - Q[k-1] > 0)


{
E-step
Compute the log likelihood of all the three clusters.
M-step
Using the log likelihood values we estimate the mean, standard deviation and probability
values.

k=k+1

After computation of EM algorithm, we get the following output:


Y1 Y2 Y3
Mean 636.952 6006.111 12644.76
Standard
398.198 1442.068 2908.82
Deviation
Probability 0.20225 0.411278 0.386475

Hence, using EM algorithm in R we could get the above estimates for µ, σ and π
respectively.
Visualizations of the fitted values:
Consider the following plot of the original Injury claims:

After fitting the model, we get 3 probability values for each value of x which is our original
observation. Hence, we fit the 3 estimated components in the original variable using the following:
x1 <- seq(from=range(x)[1], to=range(x)[2], length.out=750)
y1 <- p1 * dnorm(x1, mean=mu1, sd=sigma1)
y2 <- p2 * dnorm(x1, mean=mu2, sd=sigma2)
y3 <- p3 * dnorm(x1, mean=mu3, sd=sigma3)

We now show these values on the original density plot to clearly classify the groups created.
Plot of original claims with fitted values

kernal
fitted1
fitted2
fitted3

0.00010
Density

0.00000

0 5000 10000 15000 20000

Here, the kernal represents the density plot of the original data and the 3 curves depict the 3
gaussian distributions fitted.

Density of the mixture distribution:


We used the mixnorm() function in R from Rbest package to get a plot of the density curve of the
mixture distribution with it’s classification plots. This can be represented below:

Here, it can be seen that the mixing proportions are:


20.2% in group 1
41.1% in group 2
38.6% in group 3
Allocation of the original variables to respective groups:
After the classification using EM algorithm, we allocate the original variables into the respective
groups with the following steps:
 We created a data frame y which has variable sort_x and the 3 probability values for each
sort_xi in terms of y1,y2, y3.
 This column “sort_x” is the injury claims arranged in decreasing order of magnitude.
 Next, we created a loop which calculated the maximum probability value for each sort_xi
in stored it in New_X
 These values of New_X were written in the form of the groups they come from, this was
stored in New_R
The final output of the cluster made using EM algorithm (New_R) and cluster created using K-
Means are shown below:

Here, the New_R resembles the clusters created using EM algorithm and the Cluster column
represents the clusters created using k-means.
Characteristics of the new created clusters:
Given below is the summary of the new created clusters:
Summary Statistics
Clusters
Min Q1 Median Mean Q3 Max
Cluster 1 6000 7270 10540 10712 13520 21450
Cluster 2 0 0 360 286 480 570
Cluster 3 580 985 4185 3394 5255 5980

Boxplot of the 3 clusters with respect to the injury claims


15000 20000
10000
5000
0

1 2 3

Comparison between groups created and existing cluster groups:


For comparison between the groups created using EM algorithm and k-means clustering, we shall
compute between groups sum of squares for both the models and thereby conclude that the
model with the higher between sum of squares is better.
Given below is the summary of the anova model:
summary(model_e)
Df Sum Sq Mean Sq F value Pr(>F)
y$New_R 2 16490633480 6.003e+09 677.6 <2e-16 ***
Residuals 747 6.617e+09 8.859e+06
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here, the P-value is less than 0.05, which means that the groups created using EM algorithm
are significant. Also, the between sum of squares is 16490633480.
Sum of squares for k-means:
model_c$betweenss
[1] 12005592000

From the above outputs, it can be stated that the between groups sum of squares of the EM model
is greater than that using k-means approach.
Hence, EM algorithm proves to be a better approach for grouping variables than k-means
clustering.

Alternative approach of fitting mixture distribution:


We validated our manually created model and it’s assumptions using Mclust() function in R in
mclust package. The Mclust implements a model-based clustering based on parameterized finite
Gaussian mixture models. Here, models are estimated by EM algorithm initialized by hierarchical
model-based agglomerative clustering. The optimal model is then selected according to BIC.
We stored our model using mclust in “gmm”, which gave us the following summary and outputs:
-14500
-14700
BIC

-14900

E V

1 2 3 4 5 6 7 8 9

Number of components

The above plot is the plot of the BIC values generated. Here, the optimum number of components
to be considered depends on the V line, as the line chart of V decreases raapidly after the 3rd
component, the optimum number of groups to be created is 3. Hence, our assumption of fitting
3 gaussian distributions to the original data was satisfied.
Below is the plot of the classified groups using the Mclust() function:
3
Classification

2
1

0 5000 10000 15000 20000

x
Application of the groups created on other variables using ggplots
Following plots gives us some insights which could be derived from our model:

Age wise classification of the policy holders on the basis of premium, with the groups as the
clusters:
2000
policy_annual_premium

1500

y$New_R
1
2
3
1000

500

20 30 40 50 60
age

Interpretation:
The groups created are independent of the age of the policyholder and the annual premium.
Classification of the groups with respect to Injury claim and Property claim:

20000

15000
property_claim

y$New_R
1
2
10000
3

5000

0 5000 10000 15000 20000


injury_claim

Interpretation:
Here, we can conclude that the injury claim and property claim are positively correlated. Also, the
classification of groups for property claim is more or less the same as That of the injury claim.
Classification of the groups for total claim amount and injury claim with respect to the
annual premium:

20000 20000

15000 15000
property_claim

property_claim

y$New_R y$New_R
1 1
2 2
10000 10000
3 3

5000 5000

0 0

0 5000 10000 15000 20000 40000 60000 80000


injury_claim total_claim_amount
Interpretation:
Here, it can be seen the mixing proportion of the groups is approximately same for the total claim
amount but is differing for injury claim, keeping the annual premium constant.
Plots of the other categories with respect to the individual groups:
For group 2:

6000
Injury Claim

4000
New_Data[New_y2, ]$insured_sex Plot of Gender
FEMALE
MALE

2000

400 800 1200 1600


Annual Premium

6000

New_Data[New_y2, ]$incident_type
Injury Claim

4000 Multi-vehicle Collision


Parked Car
Single Vehicle Collision Plot of Incident
Vehicle Theft
type
2000

400 800 1200 1600


Annual Premium
6000
New_Data[New_y2, ]$injury_claim

New_Data[New_y2, ]$policy_state
4000
IL
IN
OH
Plot of policy
state
2000

400 800 1200 1600


New_Data[New_y2, ]$policy_annual_premium

Interpretation:

From the 3 plots, it can be clearly seen that the group 2 is independent of Gender or Policy state.
However, the plot of the incident type reveals that the vehicle theft and parked car have a different
claim amount regardless of the premium as compared to the remaining 2 incident types.

For group 1:

20000
Injury Claim

15000 New_Data[New_y1, ]$insured_sex Plot of Gender


FEMALE
MALE

10000

800 1200 1600 2000


Annual Premium
20000
Injury Claim

15000 New_Data[New_y1, ]$incident_type


Multi-vehicle Collision
Single Vehicle Collision
Plot of Incident
type
10000

800 1200 1600 2000


Annual Premium

20000

New_Data[New_y1, ]$policy_state
Injury Claim

15000
IL
IN Plot of policy
OH
state
10000

800 1200 1600 2000


Annual Premium

Interpretation:

From the 3 plots, it can be clearly seen that the group 1 is independent of Gender or Policy state.
However, the plot of the incident type reveals that only single vehicle collision and multi vehicle
collision exist for the policyholders in group 1.This means that there is a significant difference in
the incident type within the groups.
For group 3:

15000

Plot of Gender
Injury Claim

10000
Training_Data[New_y3, ]$insured_sex
FEMALE
MALE

5000

800 1200 1600


Annual Premium

15000 Plot of Incident


type
Training_Data[New_y3, ]$incident_type
Injury Claim

10000 Multi-vehicle Collision


Parked Car
Single Vehicle Collision
Vehicle Theft
5000

800 1200 1600


Annual Premium
15000

Training_Data[New_y3, ]$policy_state
Injury Claim

10000
IL
IN Plot of policy
OH state
5000

800 1200 1600


Annual Premium

Interpretation:

Similar to group 2, it can be clearly seen that the group 3 is independent of Gender or Policy state.
However, the plot of the incident type reveals that the vehicle theft and parked car have a different
claim amount regardless of the premium as compared to the remaining 2 incident types.
Conclusion
From the corresponding analysis and outputs of this research, the following conclusions can be
made:

The injury claims do not follow normal distribution. A finite mixture distribution of 3 gaussian
components can be fitted to injury claims which divide the data into 3 homogeneous groups or
clusters.

The groups classify the original injury claims precisely. This is because the clusters created using
EM algorithm proved to be significant.

The mixing proportions created using these groups comprise of the following:

20.2% in group 1
41.1% in group 2
38.6% in group 3
These mixing proportions have been constant using mclust() function, mixnorm() function and
also using the manually created clusters in the while loop due to which our results obtained could
be validated.

The comparison between EM algorithm and clustering depicted the fact that the results gauged
using EM algorithm are better than that of cluster. We concluded this using the between sum of
squares which was higher in EM model than in clustering.

The ggplots created gave us many insights with regards to which variables are significant or
independent and also the bifurcation of various categorical variables according to the groups made
and vice versa. These visualizations can be further used to understand the hidden patterns in the
data with regards to other features in the data.

Limitations and Future scope


In this research, we just considered injury claim for fitting purpose whereas similar work can be
implemented on other claims like property claim and vehicle claim.

We assumed to fit a gaussian mixture model on this data but there are other types of mixture
distributions existing.

The formula constructed for EM model didn’t incorporate differentiation step for maximization
step. Also, in the Expectation step, we applied conditional probabilities to estimate the pi values
(probability values) which is different approach than the one usually used in EM algorithm.
References
https://www.datanovia.com/en/blog/ggplot-legend-title-position-and-labels

https://stats.stackexchange.com/questions/372477/comparing-k-means-and-expectation-maximization-
on-the-dataset-generated-when-d

https://www.sheffield.ac.uk/polopoly_fs/1.579191!/file/stcp-karadimitriou-normalR.pdf

https://www.researchgate.net/post/What_is_the_best_criteria_for_GMM_model_selection

https://towardsdatascience.com/mixture-modelling-from-scratch-in-r-5ab7bfc83eef
Appendix
Presented below are all the codes used for our project:

Auto<-read.csv("Data.csv", header = TRUE, sep = ",")


#View(Auto)
plot(density(Auto$injury_claim), main = "Density of Injury claims", col = "blue", lwd = 4)

x <- Auto$injury_claim
View(Auto)

#Dividing the data into train and test


set.seed(123)
train_sample<- sample(1:nrow(Auto), 0.75*nrow(Auto), replace = FALSE )
Training_Data<- Auto[train_sample,]
Testing_Data<- Auto[-train_sample,]

x <- Training_Data$injury_claim
plot(density(x))

#Usage of K-means to get values of mean, variance and probabilities


mem <- kmeans(x,3)$cluster
mu1 <- mean(x[mem==1])
mu1
mu2 <- mean(x[mem==2])
mu2
mu3 <- mean(x[mem==3])
mu3

sigma1 <- sd(x[mem==1])


sigma2 <- sd(x[mem==2])
sigma3 <- sd(x[mem==3])

pi1 <- sum(mem==1)/length(mem)


pi1
pi2 <- sum(mem==2)/length(mem)
pi2
pi3 <- sum(mem==3)/length(mem)
pi3
#modified sum only considers finite values
sum.finite <- function(x) {
sum(x[is.finite(x)])
}

Q <- 0

# Starting value of expected value of the log likelihood


Q[2] <- sum.finite(log(pi1)+log(dnorm(x, mu1, sigma1))) + sum.finite(log(pi2)+log(dnorm(x,
mu2, sigma2))) + sum.finite(log(pi3)+log(dnorm(x, mu3, sigma3)))
k <- 2

while (abs(Q[k]-Q[k-1])>=1e-6) {
# E step
comp1 <- pi1 * dnorm(x, mu1, sigma1)
comp2 <- pi2 * dnorm(x, mu2, sigma2)
comp3 <- pi3 * dnorm(x, mu3, sigma3)
comp.sum <- comp1 + comp2 + comp3

p1 <- comp1/comp.sum
p2 <- comp2/comp.sum
p3 <- comp3/comp.sum

# M step

pi1 <- sum.finite(p1) / length(x)


pi2 <- sum.finite(p2) / length(x)
pi3 <- sum.finite(p3) / length(x)

mu1 <- sum.finite(p1 * x) / sum.finite(p1)


mu2 <- sum.finite(p2 * x) / sum.finite(p2)
mu3 <- sum.finite(p3 * x) / sum.finite(p3)

sigma1 <- sqrt(sum.finite(p1 * (x-mu1)^2) / sum.finite(p1))


sigma2 <- sqrt(sum.finite(p2 * (x-mu2)^2) / sum.finite(p2))
sigma3 <- sqrt(sum.finite(p3 * (x-mu3)^2) / sum.finite(p3))

p1 <- pi1
p2 <- pi2
p3 <- pi3
k <- k + 1

Q[k] <- sum(log(comp.sum))


}

#install.packages("mixtools")
## We get the estimates of the parameters using the inbulit function normalmixEM()
library(mixtools)
gm<-normalmixEM(x,k=3)
gm$mu
gm$sigma
gm$lambda # posterior probabilities

#Visualization
hist(x, prob=T, breaks=32, xlim=c(range(x)[1], range(x)[2]), main='')
lines(density(x), col="black", lwd=2)

x1 <- seq(from=range(x)[1], to=range(x)[2], length.out=750)


y1 <- p1 * dnorm(x1, mean=mu1, sd=sigma1)
y2 <- p2 * dnorm(x1, mean=mu2, sd=sigma2)
y3 <- p3 * dnorm(x1, mean=mu3, sd=sigma3)

y <- data.frame(x,y1, y2, y3)


#View(y)

lines(x1, y1, col="red", lwd=2)


lines(x1, y2, col="blue", lwd=2)
lines(x1, y3, col="purple", lwd=2)

legend('topright', col=c("black", 'red',"blue","purple"), lwd=2, legend=c("kernal",


"fitted1","fitted2","fitted3"))

##Alternative Approach of depicting the density of the mixture distributions


#install.packages("RBesT")
library(RBesT)
nm <- mixnorm(y1=c( 0.2022465,636.9515 , 398.1974), y2=c(0.4112852,6006.1392 ,
1442.0882), y3 = c(0.3864683,12644.8631,2908.7576), sigma = 5)
print(nm)
summary(nm)
plot(nm)

## Allocation of the original variables to respective groups


sort_x <- sort(x , decreasing = FALSE)
y <- data.frame(sort_x,y1, y2, y3)

for( j in 1:NROW(y))
{
y$New_x[j] <- max(y1[j], y2[j], y3[j])
y$New_x[j]

for( k in 1:NROW(y))
{
if(y$New_x[k] == y1[k])
{
y$New_R[k]<- "1"
}else
{
if(y$New_x[k] == y2[k])
{
y$New_R[k]<- "2"
}else
{
y$New_R[k]<- "3"
}
}

y$Cluster <- mem


View(y)

##Characteristics of the new created clusters

summary(New_Data[New_y1,]$injury_claim)
summary(New_Data[New_y2,]$injury_claim)
summary(New_Data[New_y3,]$injury_claim)

boxplot(New_Data$injury_claim ~ New_Data$Created_Cluster, col = c("red", "blue","green"),


main = "Boxplot of the 3 clusters with respect to the injury claims")
#Comparison between groups created and existing cluster groups

model_c <- kmeans(x, 3)


model_c$betweenss

model_e <- aov(y$sort_x ~ y$New_R, data = y)


summary(model_e)
model_e

##Assumption testing:

#install.packages("mclust")
library(mclust)

gmm <- Mclust(x)


summary(gmm)

gmm$classification
gmm$BIC

plot(gmm)

##Application of the groups created on other variables using ggplots:

View(Training_Data)

New_Data <- Training_Data[order(Training_Data$injury_claim),]


New_Data$Created_Cluster <- y$New_R
View(New_Data)

New_y1 <- which(y$New_R == 1)


New_y2 <- which(y$New_R == 2)
New_y3 <- which(y$New_R == 3)

c <- ggplot(New_Data[New_y2,], aes(New_Data[New_y2,]$policy_annual_premium,


New_Data[New_y2,]$injury_claim), xlab = "policy annual premium")
c + xlab("Annual Premium") + ylab("Injury Claim")
c + geom_point(aes(colour = New_Data[New_y2,]$insured_sex)) + xlab("Annual Premium") +
ylab("Injury Claim")
c + geom_point(aes(colour = New_Data[New_y2,]$insured_education_level)) + xlab("Annual
Premium") + ylab("Injury Claim")
c + geom_point(aes(colour = New_Data[New_y2,]$incident_type)) + xlab("Annual Premium") +
ylab("Injury Claim")
c + geom_point(aes(colour = New_Data[New_y2,]$policy_state))
d <- ggplot(New_Data[New_y1,], aes(New_Data[New_y1,]$policy_annual_premium,
New_Data[New_y1,]$injury_claim))
d + geom_point(aes(colour = New_Data[New_y1,]$insured_sex)) + xlab("Annual Premium") +
ylab("Injury Claim")
d + geom_point(aes(colour = New_Data[New_y1,]$insured_education_level)) + xlab("Annual
Premium") + ylab("Injury Claim")
d + geom_point(aes(colour = New_Data[New_y1,]$incident_type)) + xlab("Annual Premium")
+ ylab("Injury Claim")
d + geom_point(aes(colour = New_Data[New_y1,]$policy_state)) + xlab("Annual Premium") +
ylab("Injury Claim")

e <- ggplot(Training_Data[New_y3,], aes(Training_Data[New_y3,]$policy_annual_premium,


Training_Data[New_y3,]$injury_claim))
e + geom_point(aes(colour = Training_Data[New_y3,]$insured_sex)) + xlab("Annual
Premium") + ylab("Injury Claim")
e + geom_point(aes(colour = Training_Data[New_y3,]$insured_education_level)) +
xlab("Annual Premium") + ylab("Injury Claim")
e + geom_point(aes(colour = Training_Data[New_y3,]$incident_type)) + xlab("Annual
Premium") + ylab("Injury Claim")
e + geom_point(aes(colour = Training_Data[New_y3,]$policy_state)) + xlab("Annual
Premium") + ylab("Injury Claim")

##With respect to the categories in the data

p <- ggplot(New_Data, aes(age, policy_annual_premium))


p + geom_point(aes(colour = y$New_R), size = 2)

p <- ggplot(New_Data, aes(injury_claim, property_claim))


p + geom_point(aes(colour = y$New_R), size = 2)

## To be included
e <- ggplot(New_Data, aes(x = injury_claim, property_claim))
e + geom_boxplot(aes(fill = y$New_R))

c <- ggplot(New_Data, aes(total_claim_amount,property_claim))


c + geom_boxplot(aes(fill = y$New_R))

gender <- ggplot(New_Data, aes(Training_Data$insured_sex, injury_claim))


gender + geom_boxplot(aes(fill = y$New_R))

You might also like