CapstoneProjectFinalReport LukeJohnston

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/342715936
The Evolution of the Motor Insurance Industry
Thesis · June 2020
CITATIONS READS
0 1,581
1 author:
Luke Johnston
University of Essex
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Luke Johnston on 06 July 2020.
The user has requested enhancement of the downloaded file.

The Evolution of the Motor Insurance Industry
By
Luke Johnston - 1704156
University of Essex
Department of Mathematical Sciences
MA830-6-AU
Supervisor: Spyridon Vrontos
March 2020
Word Count: 13,860
1
Abstract
This project is set out to discuss how car insurance pricing systems have been changed, adapted
and innovated throughout the last century since motor vehicles were invented and insurance
became mandatory in the 1930’s. Surges in technology in the late 20th century allowed for
information to be shared at a much faster rate. This, together with the use of online data
collection, granted the industry an opportunity to reach much more advanced levels in regard to
pricing systems.
As well as this, I have included a more practical section in which I analyse an insurance claim
dataset; ‘dataCar’ from the R package ‘insuranceData’. This section is focused on the use of R to
model the data so it is easier to interpret. The models create coefficient values of regression for a
number of variables in relation to the number of claims. I will analyse and interpret the results to
form a conclusion as to whether certain regressors show any statistically significant correlations,
indicating individuals with these specific credentials may be more at risk of a claim.
Finally, I will go on to talk about potential future implications of technology on the motor insurance
industry. In the last decade, telematics technology in conjunction with recent breakthroughs with
artificial intelligence (AI) have been at the forefront of a market looking to reach new heights to
create optimal systems for the industry, creating a socio-economic uplift for both insurance firms
and consumers.
I would like to thank my Capstone Project supervisor, Dr Spyridon Vrontos, for his guidance and
support throughout the project.
2
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Factors Affecting Car Insurance Premiums

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Auto Insurance Statistics in the UK

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 History of the Motor Insurance Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Early Motor Insurance

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Stagnant Market

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Innovation and Invention

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Telematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 How does telematics work?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 PAYD to PHYD

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.1 Benefits of Telematics - Consumers

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.2 Disadvantages of Telematics - Consumers

. . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.3 Benefits of Telematics - Insurance Companies

. . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.4 Disadvantages of Telematics - Insurance Companies

. . . . . . . . . . . . . . . . . . . 11
4 Car Insurance Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1 Bonus-Malus System (BMS)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Generalised Linear Models for Modelling Claim Count Data

. . . . . . . . . . . . . . . . . . . 11
4.3 GLM Framework and Application

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.1 Poisson Regression

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Quasi-Poisson Regression

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.3 Negative-Binomial Regression

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.4 Hurdle Regression

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.5 Zero-Inflated Regression

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Comparison
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Other Comparisons

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6 Results and Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 What Does the Future Hold for the Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 Telematics and the Car Insurance Market

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.1 Real-Time Telematics Data

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Autonomous Cars

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Autonomous Vehicles and the Car Insurance Market

. . . . . . . . . . . . . . . . . . . 38
5.2.2 Advantages and Disadvantages of Autonomous Vehicles

. . . . . . . . . . . . . . . . 38
5.3 Potential For Insurers

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3
1. Introduction
The fundamental role of insurance is to act as a risk-transfer mechanism. It allows coverage to an
insured person for financial loss or damage beyond their control in exchange for a premium paid
monthly/annually. [1] An insurance firm will use its own systems of extremely complex algorithms
to calculate premiums from a selection of factors. Every consumer’s policy is taken into account
separately as everyone will have a different risk exposure value associated to them. Exposure is
calculated on the basis of variables such as location. This creates a ‘fair’ system so that each
policyholder pays a unique tariff corresponding to the level of exposure to risk they have. [2] The
insurance company will then pay all or most of the costs affiliated with an accident or other
vehicle damage, dependent on the coverage criteria for that policy.
Until recent years, auto insurance premiums have been determined by classical characteristic
variables like age, gender, years of driving experience, value of vehicle, and many other factors.
Premiums can also be affected by a poor driving record, known as a Bonus-Malus System (BMS),
or by requesting more coverage. [3] However, you can reduce your premiums by agreeing to take
on more risk, which means increasing your deductible. I will discuss this later in the project, along
with other factors used as rating variables.
Most countries have different rules and regulations when it comes to auto insurance, but drivers in
the UK and US must have vehicle insurance cover, it is a compulsory requirement. Driving without
insurance could result in a revoked licence, a fine, or even time in prison.
Factor Description
Age • Data shows that young drivers are more likely to be involved in accidents.
• Insurance costs should noticeably drop when a driver reaches around 21 years
old, as long as they haven’t been involved in an accident.
Driving • The more experience you have, the cheaper your car insurance premium.
Experience • Points on your licence for speeding will result in a higher premium in the next year.
Previous Claim • Insurers use data on previous claims to calculate your premium.
History • Insurers have developed systems which reward/penalise policyholders depending

on the number of claims they make in a year. This is known as a No-Claim
Discount (NCD) or, as stated earlier, a Bonus Malus System (BMS).
Vehicle Driven • The make, model, age, security, value and size of your car all affect the price of
your insurance.
• ie. Sports cars are more likely to be involved in accidents - higher risk.
• Repairing powerful cars is going to be a long and expensive process, adding to

the cost of a premium.
Location • Rural and urban areas will have different premium costs.
• Where the car is parked (road or garage) will also affect the price.
Miles Driven • Some insurance companies will ask for the amount of miles driven in the previous
Annually year as an indicator to gauge the level of risk they may be exposed to.
Marital status • Single people are seen as less stable than their married counterparts. If you get
married, you could see your premiums decrease right away.
Other drivers • More than one driver may be approved to drive the car, and all of them impact the
cost of the policy.
Table 1: Traditional basic determinants of car insurance premium valuation
4
2020 is predicted to see a loss in the auto insurance industry, with lower premiums relative to
higher average claim costs. This is set to be a sharp U-turn from the previous two years which
saw consecutive profitable years, including its most profitable period to date. [4] The average
premium cost has increased over the last few years but at a slower rate, as insurance comparison
websites create an easy way of minimising your insurance costs. Moreover, the average claim
cost has also increased. This is a result of repair costs being generally higher than they used to be
due to an increase in the technology used in areas of the car (ie. electric wing mirrors, heated
seats). As well as repairs costing more, they also take longer. Therefore, the insurance companies
are paying coverage for longer periods of the claimants using a hire car.
Another reason for higher average claim amounts is that the average personal injury claim amount
was at a record high in Q2 of 2016 and Q4 of 2017, with both quarters averaging a personal injury
claim amount of approximately £10,800. The cause of this is the rising number of whiplash claims
reported. [5]
Figure 1: Average Car Insurance Policy Cost by Age Group, 2018
[9] https://www.finder.com/uk/car-insurance-statistics
Figure 2: Average Claim, Average Premium and Claims Frequency - Motor Insurance by Age, 2018
[8] https://www.abi.org.uk/products-and-issues/choosing-the-right-insurance/motor-insurance/age-and-motor-insurance/
5
Another problem for motorists and insurers is the number of uninsured drivers on the road.
However, in 2018 the percentage of uninsured drivers was at a record low since 2012. A total of
79,713 motorists were caught uninsured in 2017, a 33% decrease from 2016. [6] A lower number
of uninsured drivers is beneficial to both insurance companies and consumers. If an uninsured
driver is involved in an accident, they will be unable to compensate victims for any losses
incurred. Therefore, the victim driver will have to claim through their own insurance (providing they
have fully comprehensive cover), also affecting their no-claims discount. An estimated 130 people
are killed by either uninsured/untraced drivers annually in the UK. Government figures also put
yearly economic losses from such incidents at more than £1.8 billion. [7]
In the following chapters, I will explain in further depth how the industry has adapted to an ever-
changing environment through the early 1900’s, two World Wars, the post WWII technology boom
and the fast evolving technology in the 21st Century. The use of advanced technology poses a
vast number of opportunities for the industry to progress and become more efficient for both
consumers and insurers. However, there are a number of hurdles which must be overcome to
reach these heights.
2. History of the Motor Insurance Industry

This section will take a walk through the last century to discuss how car insurance adapted to the
rapidly changing environment through the early and late 1900’s, and the technological age of the
21st century.
2.1 Early Motor Insurance
When motor vehicles were introduced as a new mode of transport in Germany in the late 1800’s,
they were very basic, slow moving and much more temperamental than today’s cars. There was a
greater chance of the car blowing up than there was to cause damage to a third party. Premiums
were extremely high and only the rich could afford coverage. During the experimental phase of
motor insurance in the early 1900’s, most insurance companies struggled and many failed to
provide adequate cover. [10] The underwriters of early insurance policies were part of large
insurance businesses, but lacked sufficient knowledge and data to give very accurate
calculations. Insurers would pool together the driver statistics they had to create a market
minimum premium.
The biggest problem with early motor insurance was that some individuals who were victims of a
road traffic accident (RTA) were unable to get compensation to cover for damages if the other
driver had decided to not take out any motor insurance. The introduction of the Road Traffic Act
1930 meant that it was compulsory for anyone who had a motor vehicle to be insured and this led
to a major leap in the motor insurance industry. Furthermore, this legislation meant that speed
limits were abolished for vehicles carrying less than seven people and it became an offence to
drive dangerously. [11]
The postwar technology boom and globalisation created opportunities for UK insurers, with the
second half of the 20th century seeing a huge transformation period for the UK insurance market.
[12]
2.2 Stagnant Market
The years following this new legislation saw a number of reformed versions of the Road Traffic
Act. The Road Traffic Act 1988 (legislation for EU motor directives) added a number of regulations
with the aim of creating more protection for drivers and to all those at risk of RTAs. It would also
help tackle the problem of uninsured drivers on the roads. The standardised system of motor
insurance cover came in three main forms:
6
Third Party • The mandatory minimum coverage you must have to drive your car on the
roads.
• You are protected from claims against you, but not from damage to your
own vehicle.
• Generally the cheapest.

Third Party Fire and Theft • This is the most common type of policy, which insures your vehicle
against fire and theft, but no other kind of damage.
Fully Comprehensive
• Covers accidents that are your fault, that happen without the involvement
(First Party) of another vehicle, or where the other driver is uninsured.
Table 2: Types of motor insurance coverage available to purchase
The insurance industry was in a slump prior to the invention of the internet in 1983. The software
was slow and temperamental, but was still a platform (as well as the telephone) for drivers to find
insurance companies to cover them. Insurance companies also used brokers to communicate
through these platforms to both private and commercial customers.
2.3 Innovation and Invention
The early 21st century saw a huge upturn in the potential of technology in the industry. In 2002,
insurance aggregator websites were created for motorists to compare insurance valuations
between companies, allowing them to find the best deal to suit them. This meant the industry
could move away from the traditional broker system, by which the insurance broker would assist
with both the premium quote and assisting with the transaction with the consumer. The role of
insurance aggregators is just to compile a list of premium quotes from a selection of insurance
companies. They will not help with transacting the policy. Comparison websites countered the
slow rising average premium price, due to consumers being able to ‘shop around’ for the lowest
price. [13]
2006 was a very significant year for the motor insurance industry. Technological advancements
meant the development of basic telematics (black box) could be implemented. This was the first
step in the transition from a Pay-As-You-Drive (PAYD) system to a Pay-How-You-Drive (PHYD).
The first versions of the telematics device were very basic, mainly tracking the location of the
vehicle and the number of miles travelled annually. Prior to this, customers would give their own
annual mileage reading to the insurer, which creates opportunity for fraud to take place, giving
false mileage readings. Take up was initially very low as people were sceptical that the motivation
behind the technology was to track the whereabouts of vehicles for other reasons. The device and
installation were relatively expensive, in addition to the premium they had to pay.
Telematics is, and has been the biggest step in the motor insurance industry. Judging each driver
specifically by how they drive means policies are more personally tailored. The standard rating
variables, which are still used today, provide a potentially inaccurate valuation. For example,
young drivers are automatically charged a premium much higher than that of their older
counterparts. This seems an unfair system for a young driver who does not drive dangerously or
recklessly. Therefore, telematics has emerged as a price saving technology for young drivers.
3. Telematics
As the motor insurance industry became stagnant, it was clear that the traditional rating variables
listed in Section 1 had to be re-assessed. Using these rating variables causes inaccurate and
inconsistent premium valuations between insurers for the same individual. Insurers also use their
own claim data to adjust premium valuations. Due to this large deviance, it is now standard for
consumers to search around for the best available option.
One option which has only recently become available is telematics technology. Telematics is a
recent innovative idea which has refuelled the motor insurance market, which had previously
reached the mature stage of its life cycle.
7
Telematics is a method of monitoring a vehicle’s usage. Not just when it is being used, but how
the car is being driven. By combining a GPS system with on-board diagnostics, it is possible to
record and map exactly where a car is and the speed it is travelling. By adding a communication
system through a 3G network, data can be transferred quickly between the vehicle and a central
management hub. The internal device installed in the car can communicate score ratings for a
variety of usage-based factors to the insurance company, who will analyse the individual driving
data and tailor an insurance premium depending on the ratings from the previous year/month. [14]
There are four types of telematics device on the market:
Device Advantages Disadvantages
Dongle • Self-installed device purchased by • Not available for use in older vehicles
the insurer and given to the (pre-1996)
policyholder (usually free for the • Short life expectancy (12-18 months)
customer)
leads to frequent replacements -
• Low installation/administrative again costs are incurred for this
costs
• High accuracy data

Black Box • Most accurate and comprehensive • Lowest cost effectiveness ratio
data
• Highest installation/administrative
• Equipped with its own sensors to costs
monitor driving behaviours • Cannot be transferred to a different

vehicle
• Must be professionally installed

Embedded Telematics • Installed during the manufacturing • Not all makes of car offer this type of
Equipment phase of the car so no installation telematics device
costs for insurer

• This method is not widely
• Data collected is highly accurate standardised yet so it may become
and reliable
more confusing for insurance
companies as each manufacturer
may have a slightly differentiated
version
Smartphone • Almost no incremental costs for • Reliability is an issue as well as

insurers or customers
accuracy
• High use of technology connects • Limited availability of data for vehicles

the smartphone to the sensors in with smartphones implemented
the vehicle
• Large storage capacity
• Access to high speed internet

allows for fast data transmission
Table 3: Types of telematics devices currently available to install in vehicles
[15] Evolution of Insurance: A Telematics-Based Personal Auto Insurance Study, by Yuanjing Yao
3.2 PAYD to PHYD
Definition of terms:
PAYD • Pay-As-You-Drive - a device is installed in the vehicle to send mileage data to the insurance
company.
• The premium is calculated using generalised linear regression models.
8
PHYD • Pay-How-You-Drive - a device in the vehicle sends driving style data to the insurance
company.
• The premium evolves with the driver’s risk rating.

Table 4: Definition of terms Pay-As-You-Drive and Pay-How-You-Drive
[16] Telematics System in Usage Based Motor Insurance, by Siniša Husnjaka, Dragan Perakovića, Ivan Forenbachera, Marijan Mumdzievb
Insurers have made it their responsibility to attempt to make roads and drivers safer. Drivers can
be rewarded with lower premiums if they can prove to the insurer they represent a lower risk.
Driving within the specified parameters set by the insurance company will make it very much
worthwhile installing the device in your car. The device can also be used as a learning tool for all
drivers, who can learn from the analysis they receive of their driving and at the same time save
money on their insurance. Customers who volunteer to install the device in their car generally see
a 5-15% discount upon installation, and a subsequent discount in the following term if their
driving scores are within the constraints of their policy. However, there are no guarantees that the
driver will save money as it is fully dependent on how they drive. The variables that the telematics
device records are provided in the following table:
Traditional Telematics
Age Max/Average Speed Travelled
Driving Experience Acceleration
Previous Claim History Braking
Vehicle Driven Handling/Cornering
Location Location (Latitude/Longitude)
Miles Driven Annually Distance Travelled (in Miles)
Marital status Number of Journeys
Other drivers Journey Time/Time of Day
Road Type
G-Force (Impact Detection)

Table 5: Comparison of rating variables between traditional and telematics
[17] https://www.confused.com/car-insurance/black-box/telematics-faqs
• For consumers, installing a telematics device can result in flexible premium prices, lower on
average than a consumer with the same characteristics but without a telematics device. Using
traditional rating variables, the only way to see your premiums reduced are to build up a no-
claims bonus, as you cannot change certain characteristics like gender. Reducing annual
mileage, driving more carefully, and avoiding rush hours are all effective ways of lowering
premiums upon renewal.
• Drivers can learn from the data they receive on their driving. It will increase their safety
awareness and lead to safer driving.
• If drivers are aware of their driving behaviours, it can result in more efficient driving, with
improvements in fuel consumption from more controlled accelerating/speeding. Speeding, hard
breaking and rapid acceleration could increase fuel consumption by as much as 40%. [18]
9
• Insurance claims sometimes use telematics device data to help settle claims or determine fault
after an accident. It will be able to highlight if a driver was speeding at the time of impact. This
can be beneficial if the other driver is disputing the claim. It can also protect against theft using
the GPS tracker.
Financial incentives are the most effective in promoting driving safety. In a survey conducted by
the Insurance Research Council (IRC), over half of the 1,135 participating drivers who had
installed the telematics device admitted that they made changes to their driving behaviours. 36%
of the respondents said they made ‘small’ changes and 18% said they made significant changes
in how they drove. [19]
Figure 3: Results of a survey on wether drivers with telematics change their behaviour
[19] https://www.insurancejournal.com/news/national/2015/11/18/389327.htm
3.2.2 Disadvantages of Telematics - Consumers
• Although first time buyers will nearly always get an up front discount, bad drivers will end up
paying higher premiums upon renewal if they drive outside the specified parameters (bad for
that specific driver, but may be seen as a benefit to other drivers)
• If your annual mileage is high as a result of long commutes to work, the telematics device will
not be worthwhile as premiums will increase if you surpass your annual mileage quota.
• Some telematics policies can have curfew parameters, primarily for younger drivers. This is
helpful for parents of the policyholder if they were the one who purchased the insurance.
• Safety of data is a concern for consumers, as there is fear over the privacy of this tracking data.
This type of data is usually secured by a minimum security software, making it easy to hack. [20]
• If there is more than one driver insured on the car then this can affect the driver scores, as most
policies won’t be able to distinguish between drivers. Consequently, premiums will be higher
when the policy is renewed.
• Drivers may unintentionally adapt their driving behaviour in a way which is more dangerous, but
gives them a better score to avoid premium penalties.
10
3.2.3 Benefits of Telematics - Insurance Companies
• Collection of data is much easier and is all automatically collected in a database for all
policyholders.
• An insurance company that offers policies with telematics devices may attract a wider market of
consumers.
3.2.4 Disadvantages of Telematics - Insurance Companies
• We previously looked at the four different types of telematics device in this section. For insurers,
it is not possible to maximise the efficiency of installation/administrative costs, accuracy of data
and compatibility.
• Whilst dongle, black box and embedded telematics all read data directly from the sensor in the
vehicle, the new smartphone technology captures the data from the Global Navigation Satellite
System. The data “is accurate, but subject to frequent occurrences of undetected outliers as
well as irregularities in the data acquisition rate”. [21]
• The smartphone’s network connection may affect the availability of the data if they are in a very
rural area with little satellite coverage. The smartphone may not be able to send the data to the
insurance company database.
• There is a potential for fraud when using a smartphone telematics system. The driver could
simply leave their smartphone at home if they know they will be driving recklessly.
To conclude, it is clear that there are still many shortfalls when it comes to telematics which is why
many consumers are still reluctant to purchase this type of policy. However, for many insurance
companies this is a prime opportunity to build a client base for the future. As telematics
technology becomes more advanced, insurers will want to implement some form of telematics
onto each policy so that the data collection will be more accurate and therefore valuations can
also be more accurate.
4. Car Insurance Pricing

As we previously discussed, motor insurance is generally split into first and third party coverage.
First party protects against damage to yourself and your property if the accident is your fault.
Third party coverage gives protection in the event the vehicle owner causes harm to another
party, who can then recover the cost from the policyholder. [28] However, it is required by law in
the UK to have third party liability coverage for a car to be allowed on public roads.
4.1 Bonus-Malus System (BMS)
Costing individual policies is a process which has been developed over many years in an attempt
to create an optimal system for pricing insurance premiums. A basic and traditional pricing
method is using a BMS. It combines an individuals characteristics and past accident data to
adjust that persons premiums over time. Policyholders are penalised by a premium increase
(malus) if they are responsible for one or more claim in a single period, and are rewarded with a
discounted premium (bonus) in the next period if they do not make a claim. [30] It is standard for a
BMS to be integrated with risk classification and experience rating, derived as a function of the
following components: claim frequency, claim severity and significant individual characteristics.
This incorporates both the priori and posteriori classification criteria. [31]
A number of finite mixture models are used to model the losses incurred from claims, called
Generalised Linear Models (GLMs).
11
This section looks at analysing the number of claims from a dataset based on one-year vehicle
insurance policies taken out in the years 2004 or 2005. There are 67,856 policies, 4,624 of which
had one or more claims. I will model the data using Generalised Linear Models (GLM) in R. Our
aim is to predict the relation of annual claim frequency on given risk factors.
In this case, we have the following variables:
Variable Description
veh_value Vehicle Value: In $10,000’s
exposure Exposure: How exposed each individual is to risk

(0-1)
clm Occurrence of Claim: (0 = no, 1 = yes)
numclaims Number of Claims
claimcst0 Claim Amount: 0 if no claim
veh_body Vehicle Body: coded as BUS (Bus), CONVT

(Convertible), COUPE (Coupe), HBACK
(Hatchback), HDTOP (Hardtop), MCARA (Modified
Car), MIBUS (Minibus), PANVN (Panel Van), RDSTR
(Roadster), SEDAN (Saloon), STNWG (Station
Wagon), TRUCK (Truck), UTE (Utility)
veh_age Vehicle Age: A factor with levels 1 (youngest), 2, 3,

4
gender Gender: A factor with levels F, M
area Area: A factor with levels A, B, C, D, E, F
agecat Age Category: 1 (youngest), 2, 3, 4, 5, 6

Table 6: Variables used in the dataset ‘dataCar’
Dataset: dataCar
> head(dataCar)
veh_ exposure clm num claim veh_ veh_ gender area agecat
value claims cst0 body age
1 1.06 0.3039014 0 0 0 HBACK 3 F C 2
2 1.03 0.6488706 0 0 0 HBACK 2 F A 4
3 3.26 0.5694730 0 0 0 UTE 2 F E 2
4 4.14 0.3175907 0 0 0 STNWG 2 F D 2
5 0.72 0.6488706 0 0 0 HBACK 4 F C 2
6 2.01 0.8542094 0 0 0 HDTOP 3 M C 4

Table 7: First 6 rows of the 67,856 observations to get a more clear idea of the data I will be analysing.
12
It is beneficial to us to be able to see a basic summary of each variable in the dataset. Some of
the data can also be represented graphically, as shown by the following set of information.
> summary(dataCar)
Vehicle Value in $10,000’s Exposure Claim (0 = No, 1 = Yes)
Min: 0.0 Min: 0.002738 Min: 0.0
1st Quartile: 1.01 1st Quartile: 0.219028 1st Quartile: 0.0

Median: 1.5 Median: 0.446270 Median: 0.0
3rd Quartile: 2.15 3rd Quartile: 0.709103 3rd Quartile: 0.0
Max: 34.56 Max: 0.999316 Max: 1.0
Mean: 1.777 Mean: 0.468651 Mean: 0.06814
Number of claims Claim Amount Vehicle Body
Min: 0.0 Min: 200.0 SEDAN: 22,233

1st Quartile: 0.0 1st Quartile: 353.8 HBACK: 18,915
Median: 0.0 Median: 761.6 STNWG: 16,261
3rd Quartile: 0.0 3rd Quartile: 2091.4 UTE: 4,586
Max: 4.0 Max: 55922.1 TRUCK: 1,750
Mean: 0.07276 Mean: 2014.4 HDTOP: 1,579
OTHER 2,532
Vehicle Age Area Age Category
Min: 1.0 A: 16,312 Min: 1.0

1st Quartile: 2.0 B: 13,341 1st Quartile: 2.0
Median: 3.0 C: 20,540 Median: 3.0
3rd Quartile: 4.0 D: 8,173 3rd Quartile: 5.0
Max: 4.0 E: 5,912 Max: 6.0
Mean: 2.674 F: 3,578 Mean: 3.485
> table(dataCar$annual_claims)
Gender 0 1 2 3 4
F: 38,603
63,232 4,333 271 18 2
M: 29,253
We can see from the output of Claim (0-1) and Number of Claims that the data includes lots of
zeros as the distribution of these variables is very heavily one sided above the mean. This can
also be seen from the annual claims table which shows 63,232 zero claim observations.
13
> hist(dataCar$numclaims, xlab="No. Claims",ylab="Total", col = "grey", main = "Annual Claims”)
> plot(dataCar$claimcst0, xlab="Policy Number", ylab="Claim Amount", main = "Claim Amount")
Here I have used R to graphically depict the number of annual claims and claim amounts. We
can see again the heavily weighted zero counts, as most claim amounts are 0 causing the
distribution of dots seen above.
> boxplot(dataCar$veh_value, main="Box Plot: Vehicle Value", xlab="Vehicle Value”, horizontal =

TRUE)
> boxplot(dataCar$exposure, main="Box Plot: Exposure", xlab="Exposure", horizontal = TRUE)
> boxplot(dataCar$claimcst0, main="Box Plot: Claim Amount", xlab="Claims Amount", horizontal

= TRUE)
> boxplot(dataCar$veh_age, main="Box Plot: Vehicle Age", xlab="Vehicle Age", horizontal =

TRUE)
The distribution of vehicle values from this box plot shows that 75% of people spend
approximately between $10,000-30,000 on their vehicle. The anomalous result of $350,000 may
be the latest Roadster (electric) car. Exposure is distributed relatively evenly with the mean value
close to 0.5.
14
> table(veh_body, numclaims)
> table(agecat, numclaims)
> table(gender, numclaims)
> table(area, numclaims)
> top13_veh <- table(dataCar$veh_body)[order(-table(dataCar$veh_body))][1:13]
> ggplot(data=subset(dataCar,clm==1 & veh_body,

names(top13_veh)),aes(x=veh_body,y=claimcst0,color=veh_body))+geom_boxplot()+theme_bw()
numclaims numclaims
veh_body 0 1 2 3 4 agecat 0 1 2 3 4
BUS 39 8 1 0 0
1 5246 468 27 1 0
CONVT 78 3 0 0 0
2 11943 869 58 5 0
COUPE 712 61 7 0 0
3 14654 1044 63 5 1
HBACK 17651 1202 58 4 0
HDTOP 1449 124 6 0 0 4 15085 1027 73 4 0
MCARA 113 13 1 0 0 5 10122 583 29 1 1
MIBUS 674 41 2 0 0
6 6182 342 21 2 0
PANVN 690 57 4 1 0
RDSTR 25 1 1 0 0 numclaims
area 0 1 2 3 4
SEDAN 20757 1361 108 7 0
STNWG 15088 1105 63 3 2 A 15227 996 82 7 0
TRUCK 1630 112 6 2 0 B 12376 916 43 5 1
UTE 4326 245 14 1 0

C 19128 1332 79 1 0
numclaims D 7677 469 26 1 0

gender 0 1 2 3 4
E 5526 363 20 2 1
F 35955 2477 160 9 2
M 27277 1856 111 9 0 F 3298 257 21 2 0
15
It is now standard industry practice for non-life insurance pricing to use GLMs, ideally for non-
normal data which is typically encountered by insurance analysts. This section will introduce a
different way of viewing GLM models. Instead of viewing them as models for the full-likelihood,
we regard them as regression models for the mean. The estimating functions are derived from a
particular family of distributions for fitting the model. [22]
GLMs broaden the framework of linear regression models from the normal distribution to the
exponential family. This makes it easier to model variables with high frequencies (count data).
When modelling the data, the variable being explained is called the “dependent” variable, while
the variables that are doing the explaining are the “explanatory” variables. We use GLMs as it
allows us to analyse more than one dependent variable in a linear combination. The outcome of
the model helps explain to us the connection between the response and explanatory variables.
GLMs model simple linear regression such that it computes how the mean of a response variable,
Yi, is dependant on a set of explanatory variables, Xi. This gives us:
Yi = β0 + βXi + ϵi and E(Yi ) = β0 + βXi
I have used R to compute the models. The GLM framework main arguments in R are:
glm(formula, data, subset, na.action, weights, offset, family = gaussian, start = NULL, control =
glm.control(…), model = TRUE, y = TRUE, x = FALSE, ...)
When it comes to non-life insurance, it has been demonstrated that using generalised linear
modelling techniques in order to estimate the frequency of claims has a priori Poisson
configuration. The classical Poisson and Negative Binomial (NB) models are described in a GLM
framework, such that they are implemented in R by the glm() function (Chambers and Hastie
1992) in the stats package. The glm.nb() function can be found in the MASS package (Venables
and Ripley 2002). The hurdle and zero-inflated supplements of these models are provided by the
functions hurdle() and zeroinfl() in package pscl (Jackman 2008). [23]
Type Distribution Method Description
GLM Poisson ML Poisson regression: classical GLM,

estimated by maximum likelihood
(ML)
Quasi Quasi-Poisson Regression: same

mean function, estimated by quasi-
ML (QML) or equivalently
generalised estimating equations
(GEE), inference adjustment via
estimated dispersion parameter
Negative Binomial ML NB regression: extended GLM,

estimated by ML including additional
shape parameter
Zero-Augmented Hurdle ML Hurdle Poisson Regression
Zero-Inflated ML Zero-Inflated Poisson Regression
Table 8: Overview of discussed count regression models. All GLMs use the same log-linear mean function but make different
assumptions about the remaining likelihood. The zero-augmented models extend the mean function by modifying (typically,
increasing) the likelihood of zero counts.
[23] Regression Models for Count Data in R - Journal of Statistical Software, July 2008, Volume 27, Issue 8
16
As previously stated, we start by using the simplest distribution for modelling the count data. The
Poisson distribution has a probability mass function of:
Poisson regression is used to model response

(dependent) variables that are count data. It tells us which
explanatory variable has a statistically significant impact
on the response variable. Its best use is for events that are
rare and random, as they follow a more poisson
distribution compared to common events which are
distributed normally and symmetrically. For poisson
regression using GLMs we have the following:
E(Y) = β0 + β1 X1 + β2 + … + βk XK = βXi t
Figure 4: Probability of seeing k events in time t,
given λ events occurring per unit time
[24] https://towardsdatascience.com/an-illustrated-guide-to-the-poisson-regression-model-50cccba15958
In a Poisson Regression Model, the event counts Y, are assumed to be Poisson distributed. This
means the probability of observing Y is a function of the event rate vector λ. When applying this to
motor insurance, the count analysis of frequency of claims is not limited to a set number of
independent trials. Instead, the emphasis is on the risk exposure such that the number of
observations ’n’ increases significantly while n x p (the number of “successes”) remains finite. In
this case, the Poisson distribution represents the correct statistical model to assess the
probability of 0, 1, 2... risk occurrences.
The function of the Poisson Regression model is to fit the observed number of claims Y to a
regression matrix X using a link-function that conveys the rate vector λ as a function of:
1) the regression coefficients β,
2) the regression matrix X.
The following figure shows the structure and components of the Poisson regression model.
Figure 5: Poisson Regression Model Components

[24]
This formula, when fully trained to the dataset, will assist

in predicting the event count Y, corresponding to a row of
input regressors X, that have been observed.
Figure 6: Poisson Regression Formula for Count Data
[24] 17
In R, this can easily be specified in the glm() call just by setting family = poisson. Here I have fit
the basic Poisson regression model to capture the relationship between the number of claims and
all regressors.
> model1 <- glm(numclaims ~ veh_value + exposure + veh_body + veh_age + area + agecat,
family=poisson, data = dataCar)
> summary(model1)
When we compute this model we get a number of variables which are not statistically significant,
as seen below. The coefficient values are used to maximise vector of observed counts. We call
the the Maximum Likelihood Estimation (MLE).
Call:
glm(formula = numclaims ~ veh_value + exposure + veh_body + veh_age + area + agecat, family =
poisson, data = datCar)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9052 -0.4273 -0.3323 -0.2679 4.5394
Coefficients:
Estimate Std. Error Z value Pr( | z | )
(Intercept) -2.275 0.327 -7.393 0.000 ***
veh_value 0.029 0.017 1.556 0.120
exposure 1.802 0.051 35.247 0.000 ***
veh_bodyCONVT -1.753 0.668 -2.604 0.009 **
veh_bodyCOUPE -0.559 0.338 -1.644 0.100
veh_bodyHBACK -0.983 0.319 -3.100 0.002 **
veh_bodyHDTOP -0.850 0.328 -2.590 0.010 **
veh_bodyMCARA -0.399 0.410 -0.977 0.329
veh_bodyMIBUS -1.028 0.350 -2.916 0.004 **
veh_bodyPANVN -0.880 0.339 -2.641 0.008 **
veh_bodyRDSTR -0.622 0.660 -0.925 0.355
veh_bodySEDAN -0.934 0.318 -2.949 0.003 **
veh_bodySTNWG -0.934 0.318 -2.920 0.004 **
veh_bodyTRUCK -0.982 0.328 -3.008 0.003 **
veh_bodyUTE -1.140 0.322 -3.543 0.000 ***
veh_age2 0.070 0.045 1.567 0.117
veh_age3 -0.039 0.048 -0.799 0.424
veh_age4 -0.098 0.057 -1.721 0.085 .
18
genderM -0.024 0.030 -0.796 0.426
areaB 0.054 0.043 1.257 0.209
areaC 0.004 0.039 0.112 0.911
areaD -0.109 0.053 -2.065 0.039 *
areaE -0.027 0.058 -0.463 0.643
areaF 0.070 0.066 1.064 0.287
agecat2 -0.175 0.054 -3.234 0.001 **
agecat3 -0.225 0.053 -4.254 0.000 ***
agecat4 -0.251 0.053 -4.757 0.000 ***
agecat5 -0.469 0.059 -7.936 0.000 ***
agecat6 -0.456 0.068 -6.749 0.000 ***
In the last column, each value is given a significance code (0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1)
which indicates how certain we can be that the coefficient has an impact on the dependent
variable. Any coefficient without one of these symbols must be removed for an accurate model.
Therefore, we must re-run the test but this time excluding the non-significant factors.
> fm_pois <- glm(numclaims ~ exposure + veh_body + agecat, family = poisson, data = dataCar)
> summary(fm_pois)
This simpler model lets us obtain the coefficient estimates along with associated partial Wald
(Chi-Squared) tests. The Wald test gives us the significance rating for each variable, indicating if
that explanatory variable adds anything to the model.
Deviance Residuals:
-0.9102 -0.4279 -0.3329 -0.2688 4.5081
Coefficients:
(Intercept) -2.416 0.320 -7.554 0.000 ***
exposure 1.798 0.051 35.325 0.000 ***
veh_bodyCONVT -1.515 0.658 -2.302 0.021 *
veh_bodyCOUPE -0.543 0.337 -1.613 0.107
veh_bodyHBACK -0.953 0.317 -3.002 0.003 ***
veh_bodyHDTOP -0.825 0.328 -2.518 0.012 *
veh_bodyMCARA -0.366 0.408 -0.897 0.370
veh_bodyMIBUS -1.031 0.350 -2.947 0.003 **
19
veh_bodyPANVN -0.884 0.339 -2.609 0.009 **
veh_bodyRDSTR -0.439 0.658 -0.667 0.505
veh_bodySEDAN -0.906 0.317 -2.855 0.004 **
veh_bodySTNWG -0.873 0.318 -2.750 0.006 **
veh_bodyTRUCK -0.968 0.328 -2.949 0.003 **
veh_bodyUTE -1.122 0.322 -3.483 0.000 ***
agecat2 -0.180 0.054 -3.330 0.001 **
agecat3 -0.235 0.053 -4.444 0.000 ***
agecat4 -0.262 0.053 -4.979 0.000 ***
agecat5 -0.479 0.059 -8.119 0.000 ***
agecat6 -0.480 0.067 -7.124 0.000 ***
It is clear from analysis that over-dispersion is present in this data set. Therefore we re-compute
the Wald tests using sandwich standard errors via:
> coeftest(fm_pois, vcov = sandwich)
z test of coefficients:
(Intercept) -2.416 0.303 -7.977 0.000 ***
exposure 1.798 0.048 37.503 0.000 ***
veh_bodyCONVT -1.515 0.640 -2.366 0.018 ***
veh_bodyCOUPE -0.543 0.323 -1.684 0.092 .
veh_bodyHBACK -0.953 0.301 -3.166 0.002 **
veh_bodyHDTOP -0.825 0.312 -2.648 0.008 **
veh_bodyMCARA -0.366 0.394 -0.930 0.353
veh_bodyMIBUS -1.031 0.335 -3.077 0.002 **
veh_bodyPANVN -0.884 0.325 -2.716 0.007 **
veh_bodyRDSTR -0.439 0.763 -0.576 0.565
veh_bodySEDAN -0.906 0.301 -3.011 0.003 **
veh_bodySTNWG -0.873 0.301 -2.898 0.004 **
veh_bodyTRUCK -0.968 0.313 -3.089 0.002 **
veh_bodyUTE -1.122 0.306 -3.664 0.000 ***
20
agecat2 -0.180 0.055 -3.298 0.001 **
agecat3 -0.235 0.053 -4.400 0.000 ***
agecat4 -0.262 0.053 -4.942 0.000 ***
agecat5 -0.479 0.060 -8.047 0.000 ***
agecat6 -0.480 0.069 -7.006 0.000 ***
The exponentiated values of each coefficient is the multiplicative term we use to calculate the
estimated number of claims when each variable is increased by 1 unit. In the case of categorical
(factor) variables, the exponentiated coefficient is the multiplicative term relative to the base level
for that variable. The exp(Intercept) is the baseline rate, and all other estimates would be relative
to it.
> exp(coef(fm_pois))
(Intercept) exposure veh_bodyCONVT veh_bodyCOUPE veh_bodyHBACK
0.089 6.040 0.220 0.581 0.386
veh_bodyHDTOP veh_bodyMCARA veh_bodyMIBUS veh_bodyPANVN veh_bodyRDSTR
0.438 0.693 0.357 0.413 0.645
veh_bodySEDAN veh_bodySTNWG veh_bodyTRUCK veh_bodyUTE agecat2
0.404 0.418 0.380 0.326 0.835
agecat3 agecat4 agecat5 agecat6
0.791 0.769 0.619 0.619
Interpretation:
The “baseline” average number of claims is 0.089. The other exponentiated coefficients are
interpreted multiplicatively. One unit increase in exposure drastically increases the chance of
making a claim by 6.04 times. However, as exposure is measure between 0-1 a one unit increase
is impossible, but the multiplicative assumption stands. A one unit increase in CONVT
(convertible) decreases the average claim number by 0.220 times, compared to RDSTR (roadster)
which decreases it by 0.645 times. When looking at the age categories, the exponentiated value
decreases as the age category increases. This shows that as age increases, the probability of
making a claim decreases.
4.3.2 Quasi-Poisson
The Quasi-Poisson Regression is a generalised version of the Poisson regression. It is primarily

used when modelling an over-dispersed count variable. [36] The quasipoisson family will account
for the greater variance in the data. From the previous computation, we can see that the
dispersion parameter is estimated to be 1. The Poisson model assumes that the variance is equal
to the mean, an assumption which is not alway true. Therefore, we re-evaluate the data using a
Poisson GLM (use the same mean and variance functions) but we leave the dispersion parameter
unrestricted. The quasi-poisson model is effective as it assumes the variance is a function of the
mean.
21
Thus, we can estimate the value of the diversion parameter from the data, rather than assuming it
is fixed at 1. When we compute this in R, we will get the same coefficient estimates as the
standard Poisson model, but inference is adjusted for over-dispersion. Consequently, both
models (Quasi-Poisson and sandwich-adjusted Poisson coefficients) take on the estimating
function view of the Poisson model and do not correspond directly to models with fully specified
likelihoods.
In R, the Quasi-Poisson model with this estimated dispersion parameter is fitted with the glm()
function, by setting family = quasipoisson.
family=quasipoisson, data=dataCar)
> summary(model2)
However, again we see variables which are not statistically significant. We compute another
simpler model for the quasi-poisson.
> fm_qpois <- glm(numclaims ~ exposure + veh_body + agecat, family = poisson, data = dataCar)
> summary(fm_qpois)
Call:
glm(formula = numclaims ~ exposure + veh_body + agecat, family = quasipoisson, data = datCar)
Deviance Residuals:
-0.9083 -0.4276 -0.3325 -0.2686 4.5284
Coefficients:
Estimate Std. Error t value Pr( | t | )
(Intercept) -2.416 0.322 -7.501 0.000 ***
exposure 1.798 0.051 35.076 0.000 ***
veh_bodyCONVT -1.515 0.663 -2.286 0.022 *
veh_bodyCOUPE -0.543 0.339 -1.601 0.109
veh_bodyHBACK -0.953 0.320 -2.981 0.003 **
veh_bodyHDTOP -0.825 0.330 -2.501 0.012 *
veh_bodyMCARA -0.366 0.411 -0.891 0.373
veh_bodyMIBUS -1.031 0.352 -2.927 0.003 **
veh_bodyPANVN -0.884 0.341 -2.591 0.010 **
veh_bodyRDSTR -0.439 0.663 -0.662 0.508
veh_bodySEDAN -0.906 0.320 -2.835 0.005 **
veh_bodySTNWG -0.873 0.320 -2.730 0.006 **
veh_bodyTRUCK -0.968 0.331 -2.928 0.003 **
veh_bodyUTE -1.122 0.324 -3.459 0.001 **
22
agecat2 -0.180 0.055 -3.307 0.001 **
agecat3 -0.235 0.053 -4.413 0.000 ***
agecat4 -0.262 0.053 -4.944 0.000 ***
agecat5 -0.479 0.059 -8.062 0.000 ***
agecat6 -0.480 0.068 -7.074 0.000 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
AIC: n/a
BIC: n/a
(Dispersion parameter for quasipoisson family taken to

be 1.014223)
Number of Fisher Scoring iterations: 6
Null deviance: 26768 on 67855 degrees of freedom
Residual deviance: 25341 on 67833 degrees of freedom
The point estimates have similar interpretations to the regular poisson regression. However, the
residual deviance’s larger for the quasi-poisson. As well as this, the dispersion parameter this time
is 1.014, indicating the conditional variance is larger than the conditional mean, hence the over
dispersion.
A third way of modelling over-dispersed count data is to assume a Negative Binomial (NB)
distribution. NB regression is another generalised version of the traditional Poisson model, but
again the restrictive assumptions of the variance being equal to the mean made by the Poisson
model are loosened. If the conditional distribution of the outcome variable is over-dispersed, it is
highly likely that the confidence intervals will be much narrower than they were for the normal
Poisson regression model.
The traditional NB regression model is based on a distribution mixture of both Poisson and
Gamma. This formulation is most commonly used as it allows us to model the Poisson’s
heterogeneity using a gamma distribution. Thus, the negative binomial distribution is derived as a
gamma mixture of Poisson random variables.
The NB model can be considered as a generalisation of Poisson regression considering the mean
structure is the same as Poisson regression and it has an extra parameter to model the over-
dispersion. [25]
> model3 <- glm.nb(numclaims ~ veh_value + exposure + veh_body + veh_age + area + agecat,
data=dataCar)
> summary(model3)
The simpler version of this model is:
> fm_nb <- glm.nb(numclaims ~ exposure + veh_body + area + agecat, data=dataCar)
> summary(fm_nb)
Call:
glm(formula = numclaims ~ veh_value + exposure + veh_body + veh_age + area + agecat, data =
dataCar, init.theta = 2.196516743, link = log)
Deviance Residuals:
-0.8659 -0.4234 -0.3301 -0.2670 4.0842
23
Coefficients:
(Intercept) -2.430 0.339 -7.162 0.000 ***
exposure 1.807 0.052 34.803 0.000 ***
veh_bodyCONVT -1.502 0.673 -2.231 0.026 *
veh_bodyCOUPE -0.531 0.356 -1.492 0.136
veh_bodyHBACK -0.941 0.337 -2.792 0.005 **
veh_bodyHDTOP -0.814 0.347 -2.345 0.019 *
veh_bodyMCARA -0.355 0.429 -0.826 0.409
veh_bodyMIBUS -1.022 0.369 -2.772 0.006 **
veh_bodyPANVN -0.874 0.358 -2.439 0.015 *
veh_bodyRDSTR -0.439 0.687 -0.639 0.523
veh_bodySEDAN -0.895 0.337 -2.656 0.008 **
veh_bodySTNWG -0.862 0.337 -2.556 0.011 *
veh_bodyTRUCK -0.959 0.348 -2.758 0.006 **
veh_bodyUTE -1.111 0.341 -3.255 0.001 **
agecat2 -0.184 0.055 -3.312 0.001 **
agecat3 -0.237 0.054 -4.377 0.000 ***
agecat4 -0.265 0.054 -4.904 0.000 ***
agecat5 -0.482 0.060 -7.988 0.000 ***
agecat6 -0.484 0.069 -7.028 0.000 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
AIC: 34772
BIC: 34955
(Dispersion parameter for Negative Binomial(2.1343)

family taken to be 1)
Number of Fisher Scoring iterations: 1
Null deviance: 24702 on 67855 degrees of freedom

Theta: 2.134
Residual deviance: 23318 on 67837 degrees of freedom

Std. Err.: 0.379
2 x log-likelihood: -34732.347
Again we take the exponentiated coefficient of each significant variable and interpret the results.
> exp(coef(fm_nb))
(Intercept) exposure veh_bodyCONVT veh_bodyCOUPE veh_bodyHBACK
0.088 6.094 0.223 0.588 0.390
24
veh_bodyHDTOP veh_bodyMCARA veh_bodyMIBUS veh_bodyPANVN veh_bodyRDSTR
0.443 0.701 0.360 0.417 0.645
veh_bodySEDAN veh_bodySTNWG veh_bodyTRUCK veh_bodyUTE agecat2
0.409 0.422 0.383 0.329 0.832
agecat3 agecat4 agecat5 agecat6
0.789 0.768 0.618 0.616
Interpretation:
Again we have a baseline average number of claims, giving a very similar result to that of the ML
Poisson model, of 0.088. We also interpret the other exponentiated coefficients multiplicatively as
we did previously. The exponentiated values of the NB model are all very similar to the poisson
model. However, most of the vehicle types in the NB give a slightly larger number (ie. exp(CONVT)
for poisson was 0.220, whereas with NB we get 0.223). We see the same pattern repeated for the
age categories as what we saw in the poisson.
NB regression is not always suitable for data with large stack of zeros. It can capture more zeros
than the Poisson model, but potentially still not enough.
The hurdle model is comprised of two parts; the first part of the model is a truncated Poisson
model. Truncated means that we only fit the positive counts. The second part is a binary logit
model. This models whether an event occurs or not (0 or 1). The idea is that positive counts occur
once a threshold is crossed (hurdle is cleared). If the hurdle is not cleared, then we have a count
of 0. [26]
Due to the zero and positive elements being separated, they can also be specified and estimated
in this way. The model is set up so that there are separate predictors for each component of the
model (ie. Predictor for a zero probability and a different predictor for the non-zero counts).
> model4 <- hurdle(numclaims ~ veh_value + veh_body + veh_age + gender + area + agecat,
data=dataCar, dist = "poisson")
> summary(model4)
We compute the simple hurdle model using a two part input:
> fm_hurdle <- hurdle(numclaims ~ exposure + area | exposure + veh_body + agecat,

> summary(fm_hurdle)
Call:
glm(formula = numclaims ~ exposure + area | exposure + veh_body + agecat, data = dataCar, diss
= “poisson”)
Pearson Residuals:
-0.6557 -0.2971 -0.2319 -0.1871 13.4715
25
Count model coefficients (truncated poisson with log link):
(Intercept) -2.828 0.200 -14.124 0.000 ***
exposure 1.583 0.240 6.600 0.000 ***
areaB -0.399 0.166 -2.403 0.016 *
areaC -0.409 0.149 -2.684 0.007 **
areaD -0.415 0.212 -1.956 0.051 .
areaE -0.175 0.215 -0.813 0.416
areaF -0.004 0.221 -0.017 0.986
Zero hurdle model coefficients (binomial with logit link):
(Intercept) -2.282 0.380 -6.004 0.000 ***
exposure 1.859 0.055 33.922 0.000 ***
veh_bodyCONVT -1.603 0.702 -2.283 0.022 *
veh_bodyCOUPE -0.658 0.398 -1.654 0.098 .
veh_bodyHBACK -1.053 0.378 -2.786 0.005 **
veh_bodyHDTOP -0.900 0.388 -2.318 0.020 *
veh_bodyMCARA -0.416 0.474 -0.877 0.0380
veh_bodyMIBUS -1.132 0.409 -2.766 0.006 **
veh_bodyPANVN -1.009 0.400 -2.523 0.012 *
veh_bodyRDSTR -0.883 0.832 -1.062 0.288
veh_bodySEDAN -1.033 0.378 -2.734 0.006 **
veh_bodySTNWG -0.977 0.378 -2.583 0.010 **
veh_bodyTRUCK -1.097 0.389 -2.821 0.005 **
veh_bodyUTE -1.245 0.382 -3.258 0.001 ***
agecat2 -0.214 0.059 -3.628 0.000 ***
agecat3 -0.267 0.057 -4.648 0.000 ***
agecat4 -0.302 0.057 -5.262 0.000 ***
agecat5 -0.517 0.064 -8.107 0.000 ***
agecat6 -0.530 0.073 -7.283 0.000 ***

--- AIC: 34771
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC: n/a
Number of iterations in BFGS optimisation: 14

Theta: count = 1.9022
26
Log-likelihood: -1.736e+04 on 26 Df
> data.frame(exp(coef(fm_hurdle)))
Count_model Zero_hurdle_model
(Intercept) 0.059 0.102
exposure 4.869 6.420
veh_bodyCONVT 0.201
veh_bodyCOUPE 0.518
veh_bodyHBACK 0.349
veh_bodyHDTOP 0.407
veh_bodyMCARA 0.659
veh_bodyMIBUS 0.323
veh_bodyPANVN 0.365
veh_bodyRDSTR 0.413
veh_bodySEDAN 0.356
veh_bodySTNWG 0.377
veh_bodyTRUCK 0.334
veh_bodyUTE 0.288
areaB 0.671
areaC 0.670
areaD 0.660
areaE 0.840
areaF 0.996
agecat2 0.808
agecat3 0.766
agecat4 0.740
agecat5 0.597
agecat6 0.589
Interpretation:
With the model being separated, each component is interpreted in a different way. The count part
of the model is looking at the number of claim counts. Whereas the zero-hurdle part focuses on
the probability of making a claim (0 or 1)
(Count model) The baseline average for the count part of the hurdle model is 0.059. An increase in
exposure will again drastically increase the average number of claims, multiplying by 4.869 times.
For this part of the model we only have the area of living as the other variable. We can see that
area F is a much more at risk area than the others, barely decreasing the average number of
claims after a one unit increase. AreaB however causes a decrease of 0.671 times.
27
(Zero-hurdle model) The baseline probability of making a claim is 0.102. The odds are 6.42 times
higher by a one unit increase in exposure. All vehicle types show a decrease of between 0.2 to
0.65 times.
> fm_hurdle <- hurdle(numclaims ~ exposure + area | exposure + veh_body + agecat, data =
dataCar, dist = "poisson")
> model4 <- hurdle(numclaims ~ veh_value + exposure + veh_body + veh_age + area + agecat,
data = dataCar, dist = "poisson")
> lrtest(fm_hurdle, model4)
Likelihood ratio test
Model 1: numclaims ~ exposure + area | exposure + veh_body + agecat
Model 2: numclaims ~ veh_value + exposure + veh_body + veh_age + area + agecat
#Df LogLik Df Chisq Pr(>Chisq)
1 26 -17360
2 56 -17327 30 64.342 0.000 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I have performed a likelihood ratio test for all regressors, excluding the non-significant variables,
to determine if the newer model produces a better output. This test shows that the likelihood ratio
test is very significant. Therefore we will continue to use the simplified hurdle model to predict
mean counts for the non significant regressors.
4.3.5 Zero-Inflated Regression
The Zero-Inflated Poisson (ZIP) Regression model is another two-part model used for count data
that exhibit over-dispersion and excess zeros. However, this time both parts predict zero counts
so that zero counts can arise from either source - the zero component distribution or the count
distribution. The ZIP model fits the two separate regression models simultaneously. One is a logit
model that models the probability of being eligible for a non-zero count. The other models the
size of that count. [27]
The subpopulation of individuals who are ‘not at risk’ of making a claim is modelled in the zero
part, whereas the subpopulation of policyholders who are ‘at risk’ is modelled in the count part. In
this latter part, it is still possible to observe zero values (due to the usual Poisson distribution).
Thus, the ZIP model has two parts, a poisson count model and the logit model for predicting
excess zeros. [28]
> model5 <- zeroinfl(numclaims ~ veh_value + exposure + veh_body + veh_age + gender + area +
agecat, data=dataCar, dist = "poisson")
> summary(model5)
The simple zero-inflated model with non-significant variables removed:
> fm_zeroinfl(numclaims ~ exposure + agecat | exposure + veh_age, data = dataCar, dist =

“poisson")
Call:
glm(formula = numclaims ~ exposure + agecat | exposure + veh_age, data = dataCar, dist =
“poisson”)
28
Pearson Residuals:
-0.4226 -0.3135 -0.2461 -0.1722 13.5300
Count model coefficients (truncated poisson with log link):
(Intercept) -2.378 0.156 -15.282 0.000 ***
exposure 0.679 0.153 4.437 0.000 ***
agecat2 -0.165 0.055 -3.018 0.003 **
agecat3 -0.217 0.053 -4.096 0.000 ***
agecat4 -0.248 0.053 -4.665 0.000 ***
agecat5 -0.460 0.059 -7.743 0.000 ***
agecat6 -0.458 0.068 -6.780 0.000 ***
Zero-inflation model coefficients (binomial with logit link):

(Intercept) 1.39315 0.18521 7.522 0.000 ***
exposure -4.91725 0.59534 -8.260 0.000 ***
veh_age2 -0.36459 0.15884 -2.295 0.022 *
veh_age3 -0.09823 0.14641 -0.671 0.502
veh_age4 -0.09614 0.16724 -0.575 0.565
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC: 34710
BIC: n/a
Log-likelihood: -1.734e+04 on 12 Df
Number of iterations in BFGS optimisation: 42
Here we can see that the count part of the Zero-Inflated Poisson regression model has only
exposure and all levels of age category as significant variables. However, the zero-inflated part of
the model shows fewer significant variables, with only exposure and level 2 of the vehicle age are
significant.
> data.frame(exp(coef(fm_zeroinfl)))
Count_model Zero_hurdle_model
(Intercept) 0.093 4.028
exposure 1.973 0.007
veh_age2 0.694
veh_age3 0.906
29
veh_age4 0.908
agecat2 0.848
agecat3 0.805
agecat4 0.781
agecat5 0.632
agecat6 0.633
Interpretation:
(Count model) The baseline number of claims is 0.093 among those who have had at least one
claim. A one unit increase in exposure increases the average number of claims by 1.973 times.
The only other significant variables for the count part of this model were the age categories, with
the results showing that as we move through the age levels (young to old) we see that the
decrease in average number of claims becomes larger, with the oldest group showing a 0.633
times decrease.
(Zero-inflation model) The baseline probability of making a claim is 4.028. The odds are decreased
by 0.007 following a one unit increase in exposure. For the first time, the vehicle age has become
a significant factor in relation to the chances of making a claim in the zero-inflated part. Older
vehicles show very little decrease in probability following a one unit increase. Whereas the
younger vehicles tend to decrease the chances by 0.694.
> fm_zinb0 <- zeroinfl(formula = numclaims ~ veh_value + exposure + veh_body + veh_age + area
+ agecat | exposure + veh_age + area + agecat, dist = "poisson", data = dataCar)
> fm_zinb <- zeroinfl(formula = numclaims ~ veh_value + exposure + veh_body + veh_age + area
+ agecat, data = dataCar, dist = "poisson")
> lrtest(fm_zinb0, fm_zinb)
Likelihood ratio test
Model 1: numclaims ~ exposure + agecat | exposure + veh_age
Model 2: numclaims ~ veh_value + exposure + veh_body + veh_age + area + agecat
#Df LogLik Df Chisq Pr(>Chisq)
1 12 -17344
2 56 -17292 44 102.93 0.000 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I have performed another likelihood ratio test between all regressors of the zero-inflated model
and the significant ones. Again, this shows that the likelihood ratio test is very significant.
Therefore we are happy with the simplified zero-inflated model so can remove all the non-
significant variables.
4.4 Comparison
Now that we have fitted these five models, it is worthwhile understanding what they have in
common and any major differences in the outcomes of the coefficients.
30
> fm <- list("ML-Pois" = fm_pois, "Quasi-Pois" = fm_qpois, "NB" = fm_nb)
> sapply(fm, function(x) coef(x)[1:19])
> fm2 <- list("Hurdle" = fm_hurdle)
> sapply(fm2, function(x) coef(x)[1:26])
> fm3 <- list("ZIP" = fm_zeroinfl)
ML-Pois Quasi- NB Hurdle Hurdle ZIP ZIP

Poisson (Count) (Zero) (Count) (Zero)
(Intercept) -2.416 -2.416 0.339 -2.828 -2.282 -2.378 1.393
exposure 1.798 1.798 0.052 1.583 1.859 0.679 -4.917
CONVT -1.515 -1.515 0.673 -1.603
COUPE -0.543 -0.543 0.356 -0.658
HBACK -0.953 -0.953 0.337 -1.053
HDTOP -0.825 -0.825 0.347 -0.900
MCARA -0.366 -0.366 0.429 -0.416
MIBUS -1.031 -1.031 0.369 -1.132
PANVN -0.884 -0.884 0.358 -1.009
RDSTR -0.439 -0.439 0.687 -0.883
SEDAN -0.906 -0.906 0.337 -1.033
STNWG -0.873 -0.873 0.337 -0.977
TRUCK -0.968 -0.968 0.348 -1.097
UTE -1.122 -1.122 0.341 -1.245
veh_age2 -0.365
veh_age3 -0.098
veh_age4 -0.096
areaB -0.399
areaC -0.400
areaD -0.415
areaE -0.175
areaF -0.004
agecat2 -0.180 -0.180 0.055 -0.214 -0.165
agecat3 -0.235 -0.235 0.054 -0.267 -0.217
agecat4 -0.262 -0.262 0.054 -0.302 -0.248
31
agecat5 -0.479 -0.479 0.060 -0.517 -0.460
agecat6 -0.480 -0.480 0.069 -0.530 -0.458
Observing this table makes it much easier to see the similarities and differences between the
models. Both the poisson and quasi-poisson models have nearly identical coefficient values. Each
coefficient value increases very slightly when we move to the NB model. The likelihood based
models tend to not have much variation. However, when we move onto the zero-augmented
models the differences become more apparent. The hurdle model shows the most deviation from
the likelihood models, but this time producing values slightly lower than that of the NB and
poisson models.
> cbind("ML-Pois" = sqrt(diag(vcov(model1))), sapply(fm[-1], function(x) sqrt(diag(vcov(x)))[1:27]))
ML-Pois Quasi- NB Hurdle Hurdle ZIP ZIP

Poisson (Count) (Zero) (Count) (Zero)
(Intercept) 0.320 0.322 -2.430 0.200 0.380 0.156 0.185
exposure 0.051 0.051 1.807 0.240 0.055 0.153 0.595
CONVT 0.658 0.663 -1.502 0.702
COUPE 0.337 0.339 -0.531 0.398
HBACK 0.317 0.320 -0.941 0.378
HDTOP 0.328 0.330 -0.814 0.388
MCARA 0.408 0.411 -0.355 0.474
MIBUS 0.350 0.352 -1.022 0.409
PANVN 0.339 0.341 -0.874 0.400
RDSTR 0.658 0.663 -0.439 0.832
SEDAN 0.317 0.320 -0.895 0.378
STNWG 0.318 0.320 -0.862 0.378
TRUCK 0.328 0.331 -0.959 0.389
UTE 0.322 0.324 -1.111 0.382
veh_age2 0.159
veh_age3 0.146
veh_age4 0.167
areaB 0.166
areaC 0.149
areaD 0.212
areaE 0.215
areaF 0.221
32
agecat2 0.054 0.055 -0.184 0.059 0.055
agecat3 0.053 0.053 -0.237 0.057 0.053
agecat4 0.053 0.053 -0.265 0.057 0.053
agecat5 0.059 0.059 -0.482 0.064 0.059
agecat6 0.067 0.068 -0.484 0.073 0.068
The standard errors give a different comparison. This shows that for the count data, the SE’s are
very similar across all models (apart from NB). This is an indication that the models are a good fit
as they all give values very close to zero.
The difference becomes more apparent when we consider the full likelihood as well as the mean:
> rbind(logLik = sapply(fm, function(x) round(logLik(x), digits = 0)), Df = sapply(fm, function(x)

attr(logLik(x), "df")))
ML-Pois Quasi-Poisson NB Hurdle ZIP
loglik -17338 NA -17366 -17327 -17292
Df 19 19 20 30 44
We can see here that the log-likelihood of the NB model is lower than the other models. The
quasi-Poisson model does not have a likelihood value associated with it. Using a Poisson model
slightly improves the fit from a NB, as does the hurdle. The zero-inflated model has a superior log-
likelihood and tells us this model has the best fit for this data. However, we cannot only use this to
determine the best model.
It is also helpful for us to know the expected number of zeros in each likelihood-based model. We
do this by computing:
> round(c("Obs" = sum(dataCar$numclaims < 1), "ML-Pois" = sum(dpois(0, fitted(fm_pois))), "NB"

= sum(dnbinom(0, mu = fitted(fm_nb), size = fm_nb$theta)), "Hurdle" = sum(predict(fm_hurdle,
type = "prob")[,1]), "ZIP" = sum(predict(fm_zeroinfl, type = “prob”)[,1])))
Obs ML-Pois NB Hurdle ZIP
63232 63147 63238 63232 63188
We use this result to compare each model’s expected number of zeros with the observed number
of zero counts. By construction of the models, the expected number of zero counts in the hurdle
model matches the observed number. We can see that both the ML-Poisson and ZIP models
under-predict the number of zeros, whilst NB is very close estimating the correct number of zeros.
The following table summarises all the significant estimated coefficient regression values, their
standard errors and the remaining comparisons between the models we looked at in 4.4.
33
Type GLM Zero-Augmented
Hurdle ZIP
Distribution Poisson ML Hurdle ZIP
(Zero) (Zero)
Method ML Quasi NB ML ML ML ML
Model fm_pois fm_qpois fm_nb fm_hurdle fm_hurdle fm_zeroinfl fm_zeroinfl
(Intercept) -2.416 -2.416 0.339 -2.828 -2.282 -2.282 1.393

0.320 0.322 -2.430 0.200 0.380 0.156 0.185
exposure 1.798 1.798 0.052 1.583 1.859 1.859 -4.917

0.051 0.051 1.807 0.240 0.055 0.153 0.595
CONVT -1.515 -1.515 0.673 -1.603

0.658 0.663 -1.502 0.702
COUPE -0.543 -0.543 0.356 -0.658

0.337 0.339 -0.531 0.398
HBACK -0.953 -0.953 0.337 -1.053

0.317 0.320 -0.941 0.378
HDTOP -0.825 -0.825 0.347 -0.900

0.328 0.330 -0.814 0.388
MCARA -0.366 -0.366 0.429 -0.416

0.408 0.411 -0.355 0.474
MIBUS -1.031 -1.031 0.369 -1.132

0.350 0.352 -1.022 0.409
PANVN -0.884 -0.884 0.358 -1.009
0.339 0.341 -0.874 0.400
RDSTR -0.439 -0.439 0.687 -0.883
0.658 0.663 -0.439 0.832
SEDAN -0.906 -0.906 0.337 -1.033
0.317 0.320 -0.895 0.378
STNWG -0.873 -0.873 0.337 -0.977
0.318 0.320 -0.862 0.378
TRUCK -0.968 -0.968 0.348 -1.097
0.328 0.331 -0.959 0.389
UTE -1.122 -1.122 0.341 -1.245
0.322 0.324 -1.111 0.382
veh_age2 -0.365
0.159
veh_age3 -0.098
34
0.146
veh_age4 -0.096
0.167
areaB -0.399
0.166
areaC -0.400
0.149
areaD -0.415
0.212
areaE -0.175
0.215
areaF -0.004
0.221
agecat2 -0.180 -0.180 0.055 -0.214
0.054 0.055 -0.184 0.055
agecat3 -0.235 -0.235 0.054 -0.267
0.053 0.053 -0.237 0.053
agecat4 -0.262 -0.262 0.054 -0.302
0.053 0.053 -0.265 0.053
agecat5 -0.479 -0.479 0.060 -0.517
0.059 0.059 -0.482 0.059
agecat6 -0.480 -0.480 0.069 -0.530
0.067 0.068 -0.484 0.068
no. 19 19 20 30 44
parameters
log L -17338 n/a -17366 -17327 -17292
AIC 34813 n/a 34772 34771 34710
BIC 34987 n/a 34955 n/a n/a
Expected 63147 n/a 63238 63232 63188

number of
zero counts
Table 9: Summary of fitted count regression models for dataCar dataset: coefficient estimates from count model, zero-inflation
model (both with standard errors in brackets), number of estimated parameters, maximised log-likelihood, AIC, BIC and expected
number of zeros (sum of fitted densities evaluated at zero). The observed number of zeros is 63,232 in 67,856 observations.
[23] Regression Models for Count Data in R - Journal of Statistical Software, July 2008, Volume 27, Issue 8.
35
4.5 Other Comparisons
Deviance Residuals: how much deviation there is about the mean.

• The poisson model has a deviance residual which is close to zero, which means the model is
not biased towards a certain direction.
• The quasi-poisson model has identical deviance residual values to the poisson-ML model.
• The NB model slightly improves this with a median value slightly closer to zero.
These low deviance residual values indicate that the model is a good fit for the data.
Pearson Residuals: obtained by normalising the residuals by the square root of the estimate.
• The hurdle model has the best deviance residual, with a value of -0.2.
• The zero-inflated model shows slightly more deviance than the hurdle, but not as much as the
count models.
Dispersion Parameter: indicates if it has a wide or narrow distribution. The GLM function can use
a dispersion parameter to model the variability.
• For the poisson model, it is taken that the dispersion parameter is 1.
• This value is slightly adjusted to account for a new quasi-likelihood estimation.
AIC: The Akaike information criterion (AIC) is a measure based on the data that describes the
quality of a model.
• The AIC’s of all models are relatively similar. A low AIC value indicates a low complexity, good fit
model.
• Poisson model shows a value worse than the other models, indicating it is not as good of a fit.
BIC: When we add more parameters, it is possible to increase the likelihood. But this may result in
overfitting. The BIC resolves this problem by introducing a penalty term for the number of
parameters in the model.
• We can only obtain BIC values for the main two likelihood models, and it follows the same
pattern as the AIC values.
BIC is always higher than AIC, however the lower the value of either the better the model. The ZIP
gives the lowest AIC value, and the NB model gives the lowest BIC.
[29]
4.6 Results and Conclusions
The conclusions we can draw from these results are that the zero-augmented models are a better
fit for this dataset, due to the over dispersion of the data. Breaking the model down into different
components allows us to interpret two different results; probability of making a claim (0-1) and
average number of claims (given at least one claim made). If we were to conclude our findings
using just the poisson model, the conclusion and results about ‘claim counts’ would be based not
only on the observations of policyholders who made a claim, but also those who did not.
Consequently, this would lead to inaccurate insurance premium valuations if this were to be used
in practice. Analysing only the hurdle part of the hurdle model gives us a better indication of the
coefficient variables which are statistically significant regressors for claim counts. The AIC values
help to conclude that the poisson model is not a good fit, and the zero-augmented models give
the lowest values.
We can also conclude that gender is not statistically significant for any models in insurance claim
frequency from the data we have analysed. The most significant variables across all models are
the vehicle body type and the age category. We were able to conclude from the results that being
in a higher age bracket reduces the average number of claims. As well as this, we were able to
show that utility vehicles are the most significant vehicle body. COUPE, MCARA and RDSTR were
generally non-significant factors consistently throughout the model testing.
36
5. What Does the Future Hold for the Motor Insurance Industry
There are many factors which need to be taken into account when discussing in which direction
the industry will progress, technology being the obvious and main one. The rate of innovation
driven by technological advances in computing and wireless data sharing is growing
exponentially.
Unfortunately, for the insurance industry, as cars become more technologically advanced the cost
of repair has also increased. This effect on insurers profits and/or rising premiums could be
mitigated by a totally new way of insuring drivers altogether.
Telematics insurance is growing worldwide. In 2013, the Usage-Based Insurance (UBI) market
was less than 1% globally. However, in 2016 UBI had over 14 million subscribers accounting for a
32% market growth. [34] A large part of this growth is driven by younger consumers more likely to
accept and trust that the technology used is reliable and beneficial.
As shown by the graph, policyholders between the ages 17-20 with telematics installed in their
vehicle account for over 53% of total telematics policyholders. Moreover, younger drivers are
more likely to adapt their driving behaviours in exchange for a discounted premium, according to
a Willis Towers Watson Survey in 2015. [37].
Figure 7: Telematics Policyholders by Age
[35] https://www.postonline.co.uk/technology/3867331/telematics-watch-the-state-of-play-for-2019
5.1.1 Real-Time Telematics Data
It seems clear that the future for telematics lies in the use of smartphone technology. The security
features of smartphones already facilitates wireless financial transactions and online banking. It
appears feasible therefore that an advanced UBI insurance model is already technologically
possible and would become more advanced and more accurate as 5G wireless becomes
widespread.
Real time data combined with artificial intelligence (AI) and machine learning (ML) could take UBI
to a whole new level of accuracy and flexibility. A future where for a consumer the cost would fall
to virtually zero when their vehicle is parked in a locked garage in a safe area to a relatively high
premium commensurate with aggressive driving or a vehicle being left in a vulnerable place. A real
time comprehensive UBI model could financially incentivise policy holders to drive more carefully
and drive less frequently. This would reduce emissions and reduce accident casualty rates
benefitting both the insurer and consumer. Other resulting benefits could even include a saving for
the NHS with fewer accidents and the health benefits associated with opting to walk a short
journey rather than driving.
37
5.2 Autonomous Cars
Artificial intelligence (AI) has been in the headlines in recent years as the next big breakthrough in
the industry. Many cars all over the world are now being manufactured with the latest AI
technology, with the machine learning system being implemented at a high rate globally. [32] AI,
combined with the use of radars, cameras, sensors and an input from Google Street View have
helped to develop self-driving (autonomous) vehicles which can navigate through traffic and
handle complex situations.
An autonomous vehicle is one that is able to operate itself and perform necessary functions
without any human intervention, responding to external conditions, much as a human driver
would. [33] This development of vehicles is fairly recent, with only a few car companies developing
and testing autonomous cars such as: Audi, BMW, Ford, Google and Tesla. As each developer
may have slightly different systems, the National Highway Traffic Safety Administration (NHTSA)
lays out six levels of automation:
Figure 8: Levels of automation in autonomous cars
[32] https://searchenterpriseai.techtarget.com/definition/driverless-car
Since 2000, automation features have been added incrementally such as cruise control and
antilock brakes for convenience and to improve safety. Better technology allowed advanced
safety features to be added by about 2010, when features like collision warnings and blind-spot
detection became available in vehicles. It was not until 2016 that cars started moving towards
partial autonomy, including features like lane guidance and self-parking. Things have progressed
very fast since then, moving from level 1 in 2010 to level 5 by 2020. [32]
5.2.1 Autonomous Vehicles and the Car Insurance Market
With the autonomous vehicle market growing year on year, 5G will have a major influence on
making cars smarter and safer. 5G will be an essential feature in progressing the industry, as
eventually a 5G ‘wireless mesh’ will enable the instant transfer of data from a car to a hub and
more importantly from car to car.
However, in the future, governments may decide that the only safe way a fully autonomous private
use vehicle system could work is if all autonomous vehicles were essentially the same. To have
vehicles with very different capabilities sharing the same road space will make an acceptably safe
system much harder to achieve. If all vehicles were fully autonomous and identical it would
38
probably be the end of private vehicle ownership and the whole basis of private car insurance in
its current form. Individuals using fully autonomous vehicles would be no different to current
passengers using public transport, making conventional car insurance obsolete. There would
most likely be a need for all users of autonomous vehicles to have some sort of indemnity policy
to cover the misuse of a vehicle so there may even be a larger more profitable market for insurers
in the future.
5.2.2 Advantages and Disadvantages of Autonomous Vehicles
Advantages Disadvantages
Level 5 autonomous driving system will be just as May be disconcerting for passengers.
safe (if not safer) than a human driver.
Could significantly reduce the number of crashes. Deciding who is liable if a self driving car crashes.
Economic benefits will be huge with lower monetary May become over-reliant on the auto-pilot
losses from accident and injury claims. technology.
Smoother flowing traffic. Cannot yet make rational decisions, such as where
to move if an emergency vehicle needs to pass.
Allows individuals with physical limitations to drive. The technology is not always 100% accurate.
Driver of a fully autonomous car will be able to carry Vehicle may swerve unnecessarily if it detects
out activities or rest during a long commute. objects on the road.
Improved driver safety. Roads may need to be adapted to make the

environment more self-driving friendly.
Improved fuel efficiency. Concerns that the software used to operate

autonomous vehicles can be hacked.
5.3 Potential for Insurers
It is imperative that insurers keep up with the latest technology to continue to be competitive in
the industry. One of the main barriers to overcome for insurers is not necessarily the method of
valuing premiums, but the way in which data is collected. Data collection using AI and ML could
prove to be the technological breakthrough the industry needs to thrive.
A problem for insurers will almost certainly be current data protection laws. Driving behaviour data
will be the property of the insurer and the sharing of that data will be essential to create a
competitive insurance market. There is research into a possible centralised agent to collect all
telematics data (similar to that of credit scores) which would allow insurers to use each individuals
past driving data to create a baseline to start their new policy.
Advanced telematics certainly has the potential to encourage people to drive more safely, or even
drive less overall. Surveys have shown that drivers are more willing to change their driving
behaviour for a financial incentive rather than the threat of a punishment.
Current methods of assessment are too limited in scope and diversity to provide a particularly
accurate measure of risk. A problem that could be solved by implementing advanced telematics
and UBI. An insurance model that reduces accident and injury claims, thus increasing profits and
reducing premiums while reducing the impact on the environment and society in general.
39
6. Conclusion
The motor insurance industry has seen substantial reform in line with a monumental shift in the
use of technology. It is essential to keep up with the volatile environment which can see rapid
changes to processes and technologies used throughout the industry, both in the motor vehicles
themselves and in the insurance market. It is the only way for insurance companies to remain
competitive in an industry which progresses so quickly.
Telematics is a clear next step to move to, whether it be a gradual implementation of new and
improved devices, or an optimal system is created for PHYD customers which can be installed in
every new car that is manufactured. Then hopefully, in the near future, cars can be interconnected
and the devices can link with smartphones over a fast 5G network, providing real-time data to
assess driving behaviour. Moving on from this, autonomous cars will have widespread use in the
distant future, as the production and maintenance costs of these cars reduce.
6.1 Dataset Methodology
Running the initial Poisson model for all variables from the dataset gave coefficient estimate
values higher than anticipated, as well as a number of variables which were not statistically
significant. Therefore, we removed these regressors and ran the model again. The output this time
saw lower coefficient estimates and generally lower standard error values. This trend was
consistent for all the models throughout Section 4. As a result, I have only presented the output
for the final four models having already removed the non-significant variables.
We can see, looking at Table 9 comparing the coefficient estimates of all regressors, that the
Zero-Inflated model has very few similar significant regressors compared to the other models. Due
to this, it is hard to use this model as a reliable way of determining which variables are significant
in claim counts. The AIC and log-likelihood values, although give better results than the other
models, therefore cannot be used. However, the zero count part of the model is comparable to
the Poisson, Quasi-Poisson and NB models. The Zero-Inflated model suggest, more so than the
others, that a decrease in age category will increase the chances of making a claim.
We then look at the other zero-augmented model, the Hurdle model, to see if it is a better fit for
this dataset than the previous GLMs. Firstly, when comparing the coefficient estimates, the count
part of the hurdle model produces values indicating stronger correlations between all significant
variables and the number of claims. Furthermore, an easy comparison is to compare the log-
likelihood and AIC values, both of which indicate a better fit than the other models. An explanation
for this could likely be the fact the the poisson model assumes the mean and variance are equal,
which we can clearly see from the data that this is not the case due to the overdispersion. As well
as this, Table 9 concludes that the expected number of zeros in the Poisson, Quasi-Poisson and
NB models is an underestimate of the true value of observed zeros. The hurdle model matches its
expected number with the observed number, again indicating a better fit.
Although the Hurdle and Zero-Inflated models give us similar results, the Hurdle model is easier to
interpret and gives us a clearer understanding of counts due to the zero counts being modelled
separately from the claim counts. However, the Zero-Inflated mixes the zero counts in both parts
of the model, meaning we cannot generalise the expected number of zero counts to the general
population.
6.2 Main Conclusions From the Data
Building on from the conclusions made in Section 4.6, we were able to remove the variable
vehicle value due to it not being statistically significant. At first, this may come as a surprise, as
you may think more expensive and often faster vehicles correlate with insurance claims. However,
as this models were run only for claim count and not claim severity, this becomes a viable
outcome for this variable. If we were to determine insurance valuations using vehicle value, it
would be highly likely that it would become significant.
40
Another non-significant variable is gender. This conclusion backs up the introduction of the EU
regulation to introduce gender-neutral car insurance premium valuations, which was put in place
in an attempt to reduce the premium gap between men and women. [39] However, as this data
was collected in 2004/05 before the new EU regulations were put in place, it seems that gender
as a determining factor has not been significant even long before 2012. Thus it would seem
logical for many more countries to follow the same procedure. On the contrary to this, there was
actually a spike in the insurance premium gap between men and women following on from the
new regulation which saw the gap increase four-fold from men paying on average £21 more than
women, to £101. [40] An explanation for this contradictory statistic is perhaps insurance
companies adopted new explanatory variables correlated to gender such as job which more
accurately reflected individual risk. This suggests however, that gender may still be significant
which leads us to the limitations of this project.
6.3 Limitations and Remedy
One problem with this dataset which may impact my prior conclusion is that some of variables
may be correlated with each other. For example, gender is likely correlated with both vehicle type
and vehicle value which would result in multicollinearity. Value was removed from the models as it
wasn’t significant. However, including gender and all vehicle types may have resulted in
underestimates for the coefficient values for the vehicle types; therefore, gender may have been
significant resulting in a type II error for our failure to reject the false null hypothesis. Including
gender and dropping vehicle types would offer an interesting comparison which may shed light on
this limitation. The presence of multicollinearity could also potentially explain the high standard
errors in the first poisson model which includes all regressors. Thus, it may be required to test for
multicollinearity on datasets such as this one in the future.
Telematics is still a relatively new implementation which makes telematics data hard to access. In
addition, there are many options for telematics which are still yet to be put on the market.
Consequently, running similar models for telematics data is not possible without access to large
insurance databases, which are often kept confidential due to the car tracking devices and data
protection laws. By teaming up with insurance companies, and using ID numbers and location
codes, instead of policyholder names and locations, telematics data could be analysed in a
similar way. Another way of combating this limitation would be use Telematics Vehicle Driver
Simulation software. This would entail running unlimited possibilities of driving patterns and
behaviours around real-life roads in a simulated environment. The data could be collected from
these simulations to help compare to real life situations as they occur.
6.4 Future Research
Finally to conclude, there are many ways in which the data I have collected could be built upon
with more time and resources. Extending the models to account for claim severity would allow
premium valuation estimates to be calculated from combinations of different variables. However, it
would be essential to test for multicollinearity to ensure the model output is accurate so that
valuations are not under/over-estimated.
Furthermore, as telematics becomes more widely accepted, gathering large databases of

telematics data will help refine specific PHYD valuations for individual policyholders. The more
data that is collected and analysed, the more accurate the valuations will be. The best and
quickest way of doing this is to run simulations for each current and new telematics device which
is introduced on the market.
41
7. References
[1] https://www.iii.org/article/what-auto-insurance
[2] Auto Insurance Premium Calculation Using Generalized Linear Models
Mihaela David - Faculty of Economics and Business Administration,

Alexandru Ioan Cuza - University of Iasi, Iasi 700505, Romania
[3] https://www.investopedia.com/terms/a/auto-insurance.asp
[4] https://www.ft.com/content/786a2204-8de2-11e9-a24d-b42f641eca37
[5] https://www.abi.org.uk/news/news-articles/2018/03/average-motor-insurance-claim-at-a-
record-level-says-the-abi/
[6] https://www.statista.com/statistics/830750/estimated-percentage-of-uninsured-drivers-
united-kingdom/
[7] https://www.itij.com/latest/news/epidemic-uninsured-drivers-uk
[8] https://www.abi.org.uk/products-and-issues/choosing-the-right-insurance/motor-insurance/
age-and-motor-insurance/ -
[9] https://www.finder.com/uk/car-insurance-statistics
[10] https://www.unicominsurance.com/insurance-news/history-of-uk-car-insurance/
[11] https://www.motorcontinental.co.uk/information/evolution-of-car-insurance/
[12] https://www.swissre.com/dam/
jcr:e8613a56-8c89-4500-9b1a-34031b904817/150Y_Markt_Broschuere_UK_EN.pdf
[13] http://www.iaea-online.org/media/content/oral-examinations/07.-motor-insurance-
background.pdf
[14] https://www.techradar.com/uk/news/car-tech/telematics-what-you-need-to-know-1087104
[15] Evolution of Insurance: A Telematics-Based Personal Auto Insurance Study
Yuanjing Yao
[16] Telematics System in Usage Based Motor Insurance
Siniša Husnjaka, Dragan Perakovića, Ivan Forenbachera, Marijan Mumdzievb - University of Zagreb,
Faculty of Transport and Traffic Sciences, Vukelićeva 4, 10000 Zagreb, Republic of Croatia bInsurance
Consulting, Ruthnergasse 12-14/31, 1210 Vienna, Austria.
[17] https://www.confused.com/car-insurance/black-box/telematics-faqs
[18] https://learn.eartheasy.com/guides/fuel-efficient-driving/
[19] https://www.insurancejournal.com/news/national/2015/11/18/389327.htm
[20] https://www.wired.com/2016/03/thousands-trucks-buses-ambulances-may-open-hackers/
[21] Insurance Telematics: Opportunities and Challenges with the Smartphone Solution
Peter Handel, Isaac Skog, Johan Wahlstrom, Farid Bonawiede, Richard Welch, Jens Ohlsson and
Martin Ohlsson
42
[22] Statistical Rethinking: A Bayesian Course with Examples in R and Stan
Richard McElreath, November 9, 2015
[23] Journal of Statistical Software: Regression Models for Count Data in R
July 2008, Volume 27, Issue 8.
[24] https://towardsdatascience.com/an-illustrated-guide-to-the-poisson-regression-
model-50cccba15958
[25] http://www.karlin.mff.cuni.cz/~pesta/NMFM404/NB.html
[26] https://data.library.virginia.edu/getting-started-with-hurdle-models/
[27] https://www.theanalysisfactor.com/zero-inflated-poisson-models-for-count-outcomes/
[28] Actuarial Modelling of Claim Counts: Risk Classification, Credibility and Bonus-Malus
Systems
Michel Denuit - Institut de Statistique, Université Catholique de Louvain, Belgium

Xavier Maréchal Reacfin - Spin-off of the Université Catholique de Louvain, Belgium
Sandra Pitrebois Secura, Belgium
Jean-François Walhin Fortis, Belgium
[29] Generalised Linear Models In Vehicle Insurance
Acta Universitatis Agricultural et Silviculturae Mendelianae Brunensis

Volume 62 41 Number 2, 2014
http://dx.doi.org/10.11118/actaun201462020383
[30] Optimal Bonus-Malus Systems Using Finite Mixture Models
George Tzougas - Department of Statistics, Athens University of Economics and Business

Spyridon Vrontos - Department of Mathematical Sciences, University of Essex
Nicholas Frangos - Department of Statistics, Athens University of Economics and Business
[31] Generalised Bonus-Malus Systems with a Frequency and a Severity Component on an

Individual Basis in Automobile Insurance
Rahim Mahmoudvand and Hossein Hassani
[32] https://searchenterpriseai.techtarget.com/definition/driverless-car
[33] https://www.twi-global.com/technical-knowledge/faqs/what-is-an-autonomous-vehicle
[34] http://www.naic.org/cipr_topics/topic_usage_based_insurance.htm
[35] https://www.postonline.co.uk/technology/3867331/telematics-watch-the-state-of-play-
for-2019
[36] Generalized Linear Models for Insurance Data
Piet de Jong and Gillian Z. Heller
[37] Advanced Analytics and the Future: Insurers Boldly Explore New Frontiers
2017/2018 P&C Insurance Advanced Analytics Survey Report (U.S.)
[38] https://www.machinedesign.com/mechanical-motion-systems/article/21837467/could-5g-
be-the-missing-puzzle-piece-for-selfdriving-cars
[39] https://ec.europa.eu/commission/presscorner/detail/en/IP_12_1430
[40] https://www.theguardian.com/money/blog/2017/jan/14/eu-gender-ruling-car-insurance-
inequality-worse
43
8. Appendix
Table 1: Traditional basic determinants of car insurance premium valuation
Figure 1: Average Car Insurance Policy Cost by Age Group, 2018

Figure 2: Average Claim, Average Premium and Claims Frequency - Motor Insurance by Age, 2018
2.2 Stagnant Market
Table 2: Types of motor insurance coverage available to purchase
Table 3: Types of telematics devices currently available to install in vehicles
3.2 PAYD to PHYD
Table 4: Definition of terms Pay-As-You-Drive and Pay-How-You-Drive

Table 5: Comparison of rating variables between traditional and telematics
Package: insuranceData
Type: Package
Title: A Collection of Insurance Datasets Useful in Risk
Classification in Non-life Insurance.
Version: 1.0
Date: 2014-09-04
Author: Alicja Wolny--Dominiak and Michal Trzesiok
Maintainer: Alicja Wolny--Dominiak <alicja.wolny-
dominiak@ue.katowice.pl>
Description: Insurance datasets, which are often used in
claims severity and claims frequency modelling. It helps
testing new regression models in those problems, such as GLM,
GLMM, HGLM, non-linear mixed models etc. Most of the data
sets are applied in the project "Mixed models in ratemaking"
supported by grant NN 111461540 from Polish National Science
Center.
License: GPL-2
44
Packaged: 2014-09-04 10:38:28 UTC; Woali
Depends: R (>= 2.10)
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2014-09-04 13:46:39
Built: R 3.5.0; ; 2018-04-23 10:09:51 UTC; unix
Details: dataset “Car”

Source: http://www.acst.mq.edu.au/GLMsforInsuranceData
References: De Jong P., Heller G.Z. (2008), Generalized
linear models for insurance data, Cambridge University Press
Table 6: Variables used in the dataset ‘dataCar’
Table 7: First 6 rows of the 67,856 observations to get a more clear idea of the data I will be
analysing.
> head(dataCar)
> summary(dataCar)
> table(dataCar$annual_claims)
> hist(dataCar$numclaims, xlab="No. Claims",ylab="Total", col = "grey", main = "Annual Claims”)
> plot(dataCar$claimcst0, xlab="Policy Number", ylab="Claim Amount", main = "Claim Amount")
> boxplot(dataCar$veh_value, main="Box Plot: Vehicle Value", xlab="Vehicle Value”, horizontal =

TRUE)
> boxplot(dataCar$exposure, main="Box Plot: Exposure", xlab="Exposure", horizontal = TRUE)
> boxplot(dataCar$claimcst0, main="Box Plot: Claim Amount", xlab="Claims Amount", horizontal

= TRUE)
> boxplot(dataCar$veh_age, main="Box Plot: Vehicle Age", xlab="Vehicle Age", horizontal =

TRUE)
> table(veh_body, numclaims)
> table(agecat, numclaims)
> table(gender, numclaims)
> table(area, numclaims)
> top13_veh <- table(dataCar$veh_body)[order(-table(dataCar$veh_body))][1:13]
> ggplot(data=subset(dataCar,clm==1 & veh_body,

names(top13_veh)),aes(x=veh_body,y=claimcst0,color=veh_body))+geom_boxplot()+theme_bw()
Table 8: Overview of discussed count regression models. All GLMs use the same log-linear mean
function but make different assumptions about the remaining likelihood. The zero-augmented
models extend the mean function by modifying (typically, increasing) the likelihood of zero counts.
Figure 4: Probability of seeing k events in time t, given λ events occurring per unit time

Figure 5: Poisson Regression Model Components
Figure 6: Poisson Regression Formula for Count Data
family=poisson, data = dataCar)
> summary(model1)
> fm_pois <- glm(numclaims ~ exposure + veh_body + agecat, family = poisson, data = dataCar)
> summary(fm_pois)
45
> coeftest(fm_pois, vcov = sandwich)
> exp(coef(fm_pois))
4.3.2 Quasi-Poisson
family=quasipoisson, data=dataCar)
> summary(model2)
> fm_qpois <- glm(numclaims ~ exposure + veh_body + agecat, family = poisson, data = dataCar)
> summary(fm_qpois)
> model3 <- glm.nb(numclaims ~ veh_value + exposure + veh_body + veh_age + area + agecat,
data=dataCar)
> summary(model3)
> fm_nb <- glm.nb(numclaims ~ exposure + veh_body + area + agecat, data=dataCar)
> summary(fm_nb)
> exp(coef(fm_nb))
> model4 <- hurdle(numclaims ~ veh_value + veh_body + veh_age + gender + area + agecat,
> summary(model4)
> fm_hurdle <- hurdle(numclaims ~ exposure + area | exposure + veh_body + agecat,

> summary(fm_hurdle)
> data.frame(exp(coef(fm_hurdle)))
> fm_hurdle <- hurdle(numclaims ~ exposure + area | exposure + veh_body + agecat, data =
dataCar, dist = "poisson")
> model4 <- hurdle(numclaims ~ veh_value + exposure + veh_body + veh_age + area + agecat,
data = dataCar, dist = "poisson")
> lrtest(fm_hurdle, model4)
4.4 Comparison
> fm <- list("ML-Pois" = fm_pois, "Quasi-Pois" = fm_qpois, "NB" = fm_nb)
> sapply(fm, function(x) coef(x)[1:19])
> fm2 <- list("Hurdle" = fm_hurdle)
> fm3 <- list("ZIP" = fm_zeroinfl)
> cbind("ML-Pois" = sqrt(diag(vcov(fm_pois))), sapply(fm[-1], function(x) sqrt(diag(vcov(x)))[1:27]))
> rbind(logLik = sapply(fm, function(x) round(logLik(x), digits = 0)), Df = sapply(fm, function(x)

attr(logLik(x), “df")))
> round(c("Obs" = sum(dataCar$numclaims < 1), "ML-Pois" = sum(dpois(0, fitted(fm_pois))), "NB"

= sum(dnbinom(0, mu = fitted(fm_nb), size = fm_nb$theta)), "Hurdle" = sum(predict(fm_hurdle,
type = "prob")[,1]), "ZIP" = sum(predict(fm_zeroinfl, type = “prob”)[,1])))
Table 9: Summary of fitted count regression models for dataCar dataset: coefficient estimates from
count model, zero-inflation model (both with standard errors in brackets), number of estimated
46
parameters, maximised log-likelihood, AIC, BIC and expected number of zeros (sum of fitted
densities evaluated at zero). The observed number of zeros is 63,232 in 67,856 observations.
Figure 7: Telematics Policyholders by Age
5.2 Autonomous Cars
Figure 8: Levels of automation in autonomous cars
47
View publication stats

CapstoneProjectFinalReport LukeJohnston

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CapstoneProjectFinalReport LukeJohnston

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

The Evolution of the Motor Insurance Industry

Thesis · June 2020

The user has requested enhancement of the downloaded file.

Luke Johnston - 1704156

Supervisor: Spyridon Vrontos

Word Count: 13,860

1.1 Factors Aﬀecting Car Insurance Premiums

1.2 Auto Insurance Statistics in the UK

2 History of the Motor Insurance Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Early Motor Insurance

2.2 Stagnant Market

2.3 Innovation and Invention

3.1 How does telematics work?

3.2 PAYD to PHYD

3.2.1 Benefits of Telematics - Consumers

3.2.2 Disadvantages of Telematics - Consumers

3.2.3 Benefits of Telematics - Insurance Companies

3.2.4 Disadvantages of Telematics - Insurance Companies

4 Car Insurance Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Bonus-Malus System (BMS)

4.2 Generalised Linear Models for Modelling Claim Count Data

4.3 GLM Framework and Application

4.3.1 Poisson Regression

4.3.2 Quasi-Poisson Regression

4.3.3 Negative-Binomial Regression

4.3.4 Hurdle Regression

4.3.5 Zero-Inflated Regression

4.5 Other Comparisons

4.6 Results and Conclusions

5 What Does the Future Hold for the Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1 Telematics and the Car Insurance Market

5.1.1 Real-Time Telematics Data

5.2 Autonomous Cars

5.2.1 Autonomous Vehicles and the Car Insurance Market

5.2.2 Advantages and Disadvantages of Autonomous Vehicles

5.3 Potential For Insurers

1.1 Factors Aﬀecting Car Insurance Premiums

History • Insurers have developed systems which reward/penalise policyholders depending

• Repairing powerful cars is going to be a long and expensive process, adding to

Figure 1: Average Car Insurance Policy Cost by Age Group, 2018

2. History of the Motor Insurance Industry

2.1 Early Motor Insurance

2.2 Stagnant Market

• Generally the cheapest.

2.3 Innovation and Invention

There are four types of telematics device on the market:

Device Advantages Disadvantages

• High accuracy data

monitor driving behaviours • Cannot be transferred to a diﬀerent

• Must be professionally installed

costs for insurer

Smartphone • Almost no incremental costs for • Reliability is an issue as well as

• High use of technology connects • Limited availability of data for vehicles

• Large storage capacity

• Access to high speed internet

3.2 PAYD to PHYD

• The premium is calculated using generalised linear regression models.

• The premium evolves with the driver’s risk rating.

Age Max/Average Speed Travelled

Driving Experience Acceleration

Previous Claim History Braking