Professional Documents
Culture Documents
Application of Mixture Distributions in Automobile Insurance
Application of Mixture Distributions in Automobile Insurance
AUTOMOBILE INSURANCE
MASTER OF SCIENCE
IN
APPLIED STATISTICS & ANALYTICS
BY
October 2019
Abstract
In this paper, we model the insurance claims of an automobile insurance company. We use
mixture models since the claims come from heterogeneous sources. Mixture distributions are
useful way to show how variables can be differently distributed. K components of the mixture are
assumed to be from the same parametric family. Therefore, K-means clustering is used for
initialization. Later, we make use of expectation-maximization algorithm to estimate the
parameters which are the mean, standard deviation and probability values. Then using the
estimated probability, we fit our model and obtain the mixing proportions which is nothing but the
measure of accuracy of the fitted model. We use Baye’s Information Criterion to validate if our
assumption regarding the number of clusters is true or not. We compare the results obtained from
EM algorithm and K-means clustering. Based on the results we can comment of the performance
of the two.
Introduction
Automobile insurance is a policy that allows the vehicle owners to diminish the costs
associated with an auto accident. At the time of an accident, instead of paying from their pocket,
the policy holders pay annual premiums to the insurance company; the company then pays all or
most of the costs associated with the accident or other damage caused to the vehicle.
Auto insurance premiums vary depending on age, gender, years of driving experience, accident
and moving violation history, and other factors. A poor driving record or the desire for complete
coverage will lead to higher premiums.
In exchange for paying a premium, the insurance company agrees to pay your losses as
outlined in your policy.
Coverage includes:
ii. Liability – legal responsibility to others for bodily injury or property damage
iii. Medical – costs of treating injuries, rehabilitation and sometimes lost wages and funeral
expenses.
Insurance claims protect people from financial ruin after an accident or disaster. At its most
simple, the definition of an insurance claim is "a formal request for money after a major loss."
The data set of the insurance company seems to follow a mixture distribution.
Objectives
3. To apply the results obtained on other features and be able to share some business insights
on the same
Methodology
Mixture Distribution:
In this project, we model the data in terms of a mixture of several components, where
each component has a simple parametric form (such as a Gaussian). In other words, we assume
each data point belongs to one of the components, and we try to infer
the distribution for each component separately.
In general, a mixture model assumes the data are generated by the following process:
First we sample z, and then we sample the observables x from a distribution which depends on z,
i.e.
In mixture models, p(z) is always a multinomial distribution. 𝑝(𝑥 | 𝑧) can take a variety
of parametric forms, but we'll assume it's a Gaussian distribution. We refer to such a model as a
mixture of Gaussians. In general,
𝒛 ~ 𝑴𝒖𝒍𝒕𝒊𝒏𝒐𝒎𝒊𝒂𝒍(𝝅)
EM-Algorithm:
Now, during the analysis we observed that our data related to injury claims follows
multiple normal distributions, so we try to fit a mixture distribution to the data.
Since, we are trying to fit more than one distribution; say we are fitting three distributions;
therefore, we'll have to estimate three parameters. We'll estimate these parameters using EM
Algorithm.
Initialization
Expectation (E-step)
Maximization (M-step)
We initialize the values of the parameters based on our prior knowledge or we can use other
techniques. In our project we use K-means clustering for initialization.
This algorithm categorizes the items into k groups of similarity. To calculate that similarity, we
will use the Euclidean distance as measurement.
The algorithm works as follows:
Based on our assumption, we get three clusters with their respective parameter values
calculated as:
where r = 1,2,3
Expectation (E-step):
Now that we have the initial parameters of our GMM, we now have to determine what is
the probability that the data point (xi) belongs to component r? This is considered the
expectation step (E-step) of MLE where we are calculating the “expected values” of the posterior
probabilities for each data point.
𝝅 𝒇 (𝒙; 𝝁, 𝝈, 𝝅)
𝐏𝐫(𝒖𝒋 = 𝒓 | 𝒙; 𝝁, 𝝈, 𝝅) = 𝟑 𝒓 𝒓
∑𝒓=𝟏 𝝅𝒓𝒇𝒓(𝒙; 𝝁, 𝝈, 𝝅)
computed first,
Then the expected log likelihood based on a random sample can be expressed as:
𝒏 𝟑
∑𝒏𝒋=𝟏 𝐱𝐣 𝐏𝐫(𝒖𝒋 = 𝒓 | 𝒙; 𝝁, 𝝈, 𝝅)
′
𝝁𝒓=
∑𝒏𝒋=𝟏 𝐏𝐫(𝒖𝒋 = 𝒓 | 𝒙; 𝝁, 𝝈, 𝝅)
After initialization, we follow the E-step, the expectation step where we calculate the
expected likelihood, (Q function) and then we estimate the parameters and maximize the Q
function in the M-step.
We iterate between E-step and M-step until the likelihood converges or Q[k] - Q[k-1] is
greater than 0 for some k. Once the likelihood converges, we get optimal values of the parameters
with respect to the groups created along with their mixing proportions.
The dataset for our project is an Automobile Insurance data from All State Company.
This data has 1000 data points and 20 variables, the description of which is provided below:
Numeric Variables Categorical Variables
months_as_customer policy_state
age policy_csl
policy_number policy_deductable
policy_annual_premium insured_sex
number_of_vehicles_involved insured_education_level
total_claim_amount insured_occupation
injury_claim insured_hobbies
property_claim insured_relationship
vehicle_claim incident_type
incident_severity
auto_make
4e-05
0e+00
After looking at the density plot, it can be seen that the density of Injury claims do not have a
bell shaped curve. Hence, there is a higher chance that the observations do not follow
normal distribution.
Normality checks:
We have used 2 methods for testing normality of the claims:
1. Using Shapiro-Wilk’s W test (Under H0, the data follows normal)
2. Using QQ – Plots
From Shapiro test, we got the p-value as 4.275e-14 which is less than 0.05.
Hence, we reject our H0 at 5% level of significance and claim that the Injury claims do not
follow normal distribution.
Output from the QQ-Plot,
15000 20000
10000
x
5000
0
seq(1:750)
As the plot does not conform into a straight line, we can say that the Injury Claims does not
follow normal distribution.
EM Algorithm:
It is reasonable to say that this mixture distribution is composed of 3 Gaussian distributions. We
use K-means clustering to get the initial values of µr, σr and πr using the kmeans() function in R.
Usage of K-means to get values of mean, variance and probabilities
Cluster 1 Cluster 2 Cluster 3
Mean 6646.21 1016.97 13619.6
Standard
1503.44 1008.75 2260.4
Deviation
Probability 0.45067 0.23733 0.312
𝑓(𝑥; 𝜇, 𝜎, 𝜋) = ∑ 𝜋𝑟 𝑓𝑟(𝑥; 𝜇, 𝜎, 𝜋)
𝑟=1
Algorithm:
First, we initialise the Q function, with the previously computed mean, standard deviation and
probability values of the three clusters.
Then we run a loop, say while loop until the value of Q[k]-Q[k-1] is not equal to 0 for some k;
where the initial value of k is 2. Inside this loop we perform the E-step and M-step.
k=k+1
Hence, using EM algorithm in R we could get the above estimates for µ, σ and π
respectively.
Visualizations of the fitted values:
Consider the following plot of the original Injury claims:
After fitting the model, we get 3 probability values for each value of x which is our original
observation. Hence, we fit the 3 estimated components in the original variable using the following:
x1 <- seq(from=range(x)[1], to=range(x)[2], length.out=750)
y1 <- p1 * dnorm(x1, mean=mu1, sd=sigma1)
y2 <- p2 * dnorm(x1, mean=mu2, sd=sigma2)
y3 <- p3 * dnorm(x1, mean=mu3, sd=sigma3)
We now show these values on the original density plot to clearly classify the groups created.
Plot of original claims with fitted values
kernal
fitted1
fitted2
fitted3
0.00010
Density
0.00000
Here, the kernal represents the density plot of the original data and the 3 curves depict the 3
gaussian distributions fitted.
Here, the New_R resembles the clusters created using EM algorithm and the Cluster column
represents the clusters created using k-means.
Characteristics of the new created clusters:
Given below is the summary of the new created clusters:
Summary Statistics
Clusters
Min Q1 Median Mean Q3 Max
Cluster 1 6000 7270 10540 10712 13520 21450
Cluster 2 0 0 360 286 480 570
Cluster 3 580 985 4185 3394 5255 5980
1 2 3
Here, the P-value is less than 0.05, which means that the groups created using EM algorithm
are significant. Also, the between sum of squares is 16490633480.
Sum of squares for k-means:
model_c$betweenss
[1] 12005592000
From the above outputs, it can be stated that the between groups sum of squares of the EM model
is greater than that using k-means approach.
Hence, EM algorithm proves to be a better approach for grouping variables than k-means
clustering.
-14900
E V
1 2 3 4 5 6 7 8 9
Number of components
The above plot is the plot of the BIC values generated. Here, the optimum number of components
to be considered depends on the V line, as the line chart of V decreases raapidly after the 3rd
component, the optimum number of groups to be created is 3. Hence, our assumption of fitting
3 gaussian distributions to the original data was satisfied.
Below is the plot of the classified groups using the Mclust() function:
3
Classification
2
1
x
Application of the groups created on other variables using ggplots
Following plots gives us some insights which could be derived from our model:
Age wise classification of the policy holders on the basis of premium, with the groups as the
clusters:
2000
policy_annual_premium
1500
y$New_R
1
2
3
1000
500
20 30 40 50 60
age
Interpretation:
The groups created are independent of the age of the policyholder and the annual premium.
Classification of the groups with respect to Injury claim and Property claim:
20000
15000
property_claim
y$New_R
1
2
10000
3
5000
Interpretation:
Here, we can conclude that the injury claim and property claim are positively correlated. Also, the
classification of groups for property claim is more or less the same as That of the injury claim.
Classification of the groups for total claim amount and injury claim with respect to the
annual premium:
20000 20000
15000 15000
property_claim
property_claim
y$New_R y$New_R
1 1
2 2
10000 10000
3 3
5000 5000
0 0
6000
Injury Claim
4000
New_Data[New_y2, ]$insured_sex Plot of Gender
FEMALE
MALE
2000
6000
New_Data[New_y2, ]$incident_type
Injury Claim
New_Data[New_y2, ]$policy_state
4000
IL
IN
OH
Plot of policy
state
2000
Interpretation:
From the 3 plots, it can be clearly seen that the group 2 is independent of Gender or Policy state.
However, the plot of the incident type reveals that the vehicle theft and parked car have a different
claim amount regardless of the premium as compared to the remaining 2 incident types.
For group 1:
20000
Injury Claim
10000
20000
New_Data[New_y1, ]$policy_state
Injury Claim
15000
IL
IN Plot of policy
OH
state
10000
Interpretation:
From the 3 plots, it can be clearly seen that the group 1 is independent of Gender or Policy state.
However, the plot of the incident type reveals that only single vehicle collision and multi vehicle
collision exist for the policyholders in group 1.This means that there is a significant difference in
the incident type within the groups.
For group 3:
15000
Plot of Gender
Injury Claim
10000
Training_Data[New_y3, ]$insured_sex
FEMALE
MALE
5000
Training_Data[New_y3, ]$policy_state
Injury Claim
10000
IL
IN Plot of policy
OH state
5000
Interpretation:
Similar to group 2, it can be clearly seen that the group 3 is independent of Gender or Policy state.
However, the plot of the incident type reveals that the vehicle theft and parked car have a different
claim amount regardless of the premium as compared to the remaining 2 incident types.
Conclusion
From the corresponding analysis and outputs of this research, the following conclusions can be
made:
The injury claims do not follow normal distribution. A finite mixture distribution of 3 gaussian
components can be fitted to injury claims which divide the data into 3 homogeneous groups or
clusters.
The groups classify the original injury claims precisely. This is because the clusters created using
EM algorithm proved to be significant.
The mixing proportions created using these groups comprise of the following:
20.2% in group 1
41.1% in group 2
38.6% in group 3
These mixing proportions have been constant using mclust() function, mixnorm() function and
also using the manually created clusters in the while loop due to which our results obtained could
be validated.
The comparison between EM algorithm and clustering depicted the fact that the results gauged
using EM algorithm are better than that of cluster. We concluded this using the between sum of
squares which was higher in EM model than in clustering.
The ggplots created gave us many insights with regards to which variables are significant or
independent and also the bifurcation of various categorical variables according to the groups made
and vice versa. These visualizations can be further used to understand the hidden patterns in the
data with regards to other features in the data.
We assumed to fit a gaussian mixture model on this data but there are other types of mixture
distributions existing.
The formula constructed for EM model didn’t incorporate differentiation step for maximization
step. Also, in the Expectation step, we applied conditional probabilities to estimate the pi values
(probability values) which is different approach than the one usually used in EM algorithm.
References
https://www.datanovia.com/en/blog/ggplot-legend-title-position-and-labels
https://stats.stackexchange.com/questions/372477/comparing-k-means-and-expectation-maximization-
on-the-dataset-generated-when-d
https://www.sheffield.ac.uk/polopoly_fs/1.579191!/file/stcp-karadimitriou-normalR.pdf
https://www.researchgate.net/post/What_is_the_best_criteria_for_GMM_model_selection
https://towardsdatascience.com/mixture-modelling-from-scratch-in-r-5ab7bfc83eef
Appendix
Presented below are all the codes used for our project:
x <- Auto$injury_claim
View(Auto)
x <- Training_Data$injury_claim
plot(density(x))
Q <- 0
while (abs(Q[k]-Q[k-1])>=1e-6) {
# E step
comp1 <- pi1 * dnorm(x, mu1, sigma1)
comp2 <- pi2 * dnorm(x, mu2, sigma2)
comp3 <- pi3 * dnorm(x, mu3, sigma3)
comp.sum <- comp1 + comp2 + comp3
p1 <- comp1/comp.sum
p2 <- comp2/comp.sum
p3 <- comp3/comp.sum
# M step
p1 <- pi1
p2 <- pi2
p3 <- pi3
k <- k + 1
#install.packages("mixtools")
## We get the estimates of the parameters using the inbulit function normalmixEM()
library(mixtools)
gm<-normalmixEM(x,k=3)
gm$mu
gm$sigma
gm$lambda # posterior probabilities
#Visualization
hist(x, prob=T, breaks=32, xlim=c(range(x)[1], range(x)[2]), main='')
lines(density(x), col="black", lwd=2)
for( j in 1:NROW(y))
{
y$New_x[j] <- max(y1[j], y2[j], y3[j])
y$New_x[j]
for( k in 1:NROW(y))
{
if(y$New_x[k] == y1[k])
{
y$New_R[k]<- "1"
}else
{
if(y$New_x[k] == y2[k])
{
y$New_R[k]<- "2"
}else
{
y$New_R[k]<- "3"
}
}
summary(New_Data[New_y1,]$injury_claim)
summary(New_Data[New_y2,]$injury_claim)
summary(New_Data[New_y3,]$injury_claim)
##Assumption testing:
#install.packages("mclust")
library(mclust)
gmm$classification
gmm$BIC
plot(gmm)
View(Training_Data)
## To be included
e <- ggplot(New_Data, aes(x = injury_claim, property_claim))
e + geom_boxplot(aes(fill = y$New_R))