You are on page 1of 14

SQL Business Report

Great Learning Project- Insurance Risk

Sahaj Manocha
Contents
Solution 1..............................................................................................................................................2
Solution 2..............................................................................................................................................3
Solution 3.1...........................................................................................................................................3
Solution 3.2...........................................................................................................................................3
Solution 4.1...........................................................................................................................................4
Solution 4.2...........................................................................................................................................4
Solution 5-.............................................................................................................................................5
Solution 6-.............................................................................................................................................5
Solution 7.1-..........................................................................................................................................6
Solution 7.1-..........................................................................................................................................6
Solution 8 -............................................................................................................................................7
Solution 9.1 -.........................................................................................................................................8
Solution 9.2...........................................................................................................................................8
Solution 10............................................................................................................................................9
Solution 11............................................................................................................................................9
Solution 12..........................................................................................................................................10
Solution 13.1........................................................................................................................................10
Solution 13.2........................................................................................................................................11
.............................................................................................................................................................11
Solution 14..........................................................................................................................................11
Solution 15..........................................................................................................................................11
Solution 16..........................................................................................................................................12
Solution 17..........................................................................................................................................12
Auto Insurance Claims – Risk
Assessment
The third-party data which was provided has been analysed to assess the underlying factors and
extract the essentials insights which will enable the teams to take better business decisions.
The following are the insights and inferences gathered from the data-

Q1. Write a query to calculate what % of the customers have made a claim in the
current exposure period [i.e., in the given dataset]

Query-
select count(IDpol) as customer_count, 
       sum(case when ClaimNb> 0  then 1 else 0 end) as total_no_of_claims,
       (sum(case when ClaimNb> 0 then 1 else 0 end)/count(IDpol)*100) as per_claim
from auto_insurance_risk;

Solution 1-

In the given dataset, the percentage of claims made by the customers in the current exposure period is
5.0235%

Q2.1. Create a new column as 'claim_flag' in the table 'auto_insurance_risk' as integer


datatype.

Query-
alter table auto_insurance_risk add column claim_flag int;
select * from auto_insurance_risk;

Q2.2. Set the value to 1 when ClaimNb is greater than 0 and set the value to 0 otherwise.

Query-
update auto_insurance_risk 
set claim_flag = case when ClaimNb>0 then 1 else 0 end.
Solution 2.1 and 2.2

Q3.1 What is the average exposure period for those who have claimed?

Query-
select claim_flag ,avg(Exposure) as Avg_exposure_period
from auto_insurance_risk
where claim_flag=1;

Solution 3.1-

The average period of exposure for the policy holders in the dataset is 0.642

Q3.2.  What do you infer from the result?


Query-
select claim_flag ,avg(Exposure) as Avg_exposure_period
from auto_insurance_risk
group by claim_flag;

Solution 3.2-

The inference from the above data which we could be draw is that the average exposure is higher for
the customers who made a claim when compared between the claimed and the non-claimed policy
holders.
Q4.1 If we create an exposure bucket where buckets are like below, what is the % of
total claims by these buckets?
#Buckets are => E1 = 0 to 0.25, E2 = 0.26 to 0.5, E3 = 0.51 to 0.75, E4 > 0.75, You need to consider
ClaimNb field to get the total claim count.

Query-
select 
case 
when Exposure <=0.25 then 'E1'
    when Exposure <= 0.5 then 'E2'
    when Exposure <= 0.75 then 'E3'
    else 'E4'
    End as exposure_bucket, 
sum(ClaimNb) as total_claims, count(IDpol) as total_customer,
(sum(ClaimNb)/ count(IDpol))*100 as percentage_claim
From auto_insurance_risk
group by exposure_bucket
order by percentage_claim;

Solution 4.1-

Q4.2 What do you infer from the summary?

Solution 4.2
From the output of 4.1 it could be inferred that percentage claim increases as there is an increase in
the exposure period. Hence the percentage claim is the highest for the E4 exposure bucket and lowest
for E1 bucket.
Therefore, the pricing of the policy should be such that it considers the exposure factor.

Q5. Which area has the highest number of average claims? Show the data in percentage
w.r.t. the number of policies in corresponding Area.

Query-
select Area, 
avg(ClaimNb) as avg_claim, 
(sum(ClaimNb)/count(IDpol))*100 as Per_of_policies_area
from auto_insurance_risk
group by Area
    order by avg(ClaimNb) ;
Solution 5-
As per the data retrieved area F has the highest number
of average claims that is 6.29%

Q6. If we use these exposure bucket along with Area i.e. group Area and Exposure
Buckets together and look at the claim rate, an interesting pattern could be seen in the
data. What is that?
Query-
select Area,
case 
when Exposure <=0.25 then 'E1'
    when Exposure <= 0.5 then 'E2'
    when Exposure <= 0.75 then 'E3'
    else 'E4'
    End as exposure_bucket, 
    sum(ClaimNb) as total_claims, count(IDpol) as total_customer,
(sum(ClaimNb)/ count(IDpol))*100 as percentage_claim
From auto_insurance_risk
group by Area,
exposure_bucket
order by percentage_claim;

Solution 6-
The interesting pattern which we observe in the retrieved data is that as the exposure for an
area increases so does the percentage of claim.
For instance, for area A, the percentage of claim for E4 bucket is 6.14 while that for E1
bucket is only 2.9 and similar is observed for other areas as well.

Q7.1 If we look at average Vehicle Age for those who claimed vs those who didn't claim,
what do you see in the summary?

Query-
select avg(VehAge), claim_flag 
from auto_insurance_risk
group by claim_flag;

Solution 7.1-

As shown in the above output it could be summarised that the average age of the vehicle for which the
claims were made was 6.5 which is 0.5 lesser than the ones in which no claim was made .

Q7.2 Now if we calculate the average Vehicle Age for those who claimed and group
them by Area, what do you see in the summary? Any pattern you see in the data?

Query-
select area, avg(VehAge), claim_flag
from auto_insurance_risk
    where claim_flag =1
group by Area;

Solution 7.1-

From the above output it could be inferred that for Area A the average age of the vehicle is
the highest amongst all the areas where claims have been made by the policy holders hence
the pricing should cover this factor specially in area A.
The lowest of the average if for area F hence it shows that in area F the claims are made in
the initial years of the vehicle. It also shows that accident rate in A area id much lower than
Area F

Q8. If we calculate the average vehicle age by exposure bucket (as mentioned above), 
we see an interesting trend between those who claimed vs those who didn't. What is
that?

Query-
select 
case 
when Exposure <=0.25 then 'E1'
    when Exposure <= 0.5 then 'E2'
    when Exposure <= 0.75 then 'E3'
    else 'E4'
    End as exposure_bucket,
claim_flag, avg(VehAge) as avg_VehAge
From auto_insurance_risk
group by exposure_bucket, claim_flag;

Solution 8 -

The trend we observe from the data between the policy holders claimed and not claimed is
that the average age of vehicle increases as there is an increase in exposure period.

Hence the average vehicle age for the exposure period greater than .75 is the highest for both
the claimed and not claimed.

We also observe that the difference between the average vehicle age is greater for the E1
between the claimed and non- claimed policy holders. For instance, the average age is 4.9 for
claimed and 6.36 hence it could be concluded that newer vehicles are at a higher risk for the
lower exposure policy holders.
Q9.1 Create a Claim_Ct flag on the ClaimNb field as below, and take average of the
BonusMalus by Claim_Ct.

Query-
case when ClaimNb= 1 then '1 Claim'
    when ClaimNb >1 then 'MT 1 Claims'
    else 'No Claims'
    end as Claim_Ct,
    avg(BonusMalus) 
From auto_insurance_risk
group by Claim_Ct;

Solution 9.1 -

Q9.2 What is the inference from the summary?

Solution 9.2
The inference we can draw from this summary is that the average of bonus malas is highest for MT 1
claim which means that the policy holders that claim more frequently they get the discount in the
premium payment of the policy as we know that MT 1 stands for more than 1 claims in the exposure
period.

Q10. Using the same Claim_Ct logic created above, if we aggregate the Density column
(take average) by Claim_Ct, what inference can we make from the summary data?

Query-
select 
case when ClaimNb= 1 then '1 Claim'
    when ClaimNb >1 then 'MT 1 Claims'
    else 'No Claims'
    end as Claim_Ct, 
    avg(Density) 
From auto_insurance_risk
Group by Claim_Ct;
Solution 10

From the above summary we can infer that the population density is much higher for the areas
where a claim has been made hence indicating that the dense areas are more prone to vehicle
accidents hence higher claims. Within the regions of claim the claim counts are more than 1 and the
population density is even higher.

Q11. Which Vehicle Brand & Vehicle Gas combination have the highest number of
Average Claims (use ClaimNb field for aggregation)?

Query-
Select VehBrand, VehGas, 
avg(ClaimNb) 
from auto_insurance_risk
group by VehBrand, VehGas
order by avg(ClaimNb) desc
limit 1;

Solution 11

It could be concluded that the B12 brand which uses Regular gas has the highest number of average
claims hence the company should charge the premium as per the risk of this vehicle type.

Q12. List the Top 5 Regions & Exposure [use the buckets created above] Combination
from Claim Rate's perspective.

Query-
select 
case 
when Exposure <=0.25 then 'E1'
    when Exposure <= 0.5 then 'E2'
    when Exposure <= 0.75 then 'E3'
    else 'E4'
    End as exposure_bucket,
    region, sum(Claim_flag) , count(Idpol),
    (sum(Claim_flag)/count(IDpol))*100 as Claim_rate
From auto_insurance_risk
Group by exposure_bucket, region
order by Claim_rate desc 
limit 5;

Solution 12

The above table shows the List the Top 5 Regions & Exposure Combination from Claim Rate's
perspective.

Q13.1 Are there any cases of illegal driving i.e., underaged folks driving and committing
accidents?

Query-

select count(*) as illegal_driving


from auto_insurance_risk
where DrivAge<18;

Solution 13.1
The output showed 0 as the result hence there are no cases of illegal driving as per the given area o
study.

Q13.2 Create a bucket on DrivAge and then take average of BonusMalus by this Age Group
Category. What do you infer from the summary?
DrivAge=18 then 1-Beginner, DrivAge<=30 then 2-Junior, DrivAge<=45 then 3-Middle
Age, DrivAge<=60 then 4-Mid-Senior, DrivAge>60 then 5-Senior

Query-

select 
case when DrivAge =18 then '1-Beginner'
when DrivAge <= 30 then '2-Junior'
    when DrivAge <=45 then '3-Middle Age'
    when DrivAge<=60 then '4-Mid-Senior'
    else '5-Senior'
    end  as Drive_Age_bucket,
avg(BonusMalus)
From auto_insurance_risk
group by Drive_Age_bucket;
Solution 13.2

It could be inferred from the summary that average Bonus Malas is the highest in the beginner
category that is when the age of the driver is 18. This means that beginners have a higher risk and
make more claims hence the discount given to them is the lower than the older category of drivers.

The closer it is to 100 it means to penalize the policy holder for making higher claims.

Q14. Mention one major difference between unique constraint and primary key?

Solution 14

The major difference between a unique constraint and primary key is in terms of the Null Values.

When any attribute of the data is declared as a primary key for that table it will not accept Null
Values as it enables the user to identity the unique records in the table but under Unique
constraints an attribute can include Null values as well.

Another difference is that the primary key is only one for every table, but more than 1 unique
constraints can be added to the table.

Q15. If there are 5 records in table A and 10 records in table B and we cross-join these
two tables, how many records will be there in the result set?

Solution 15

The CROSS JOIN is used to create a paired combination of each row of the first table with each row
of the second table. This join type is also identified as cartesian join.

Hence in the above question for every 5 records in Table A there will be a subsequent combination
with each record of table B therefor the cross join will include 50 records (5*10)
Q16. What is the difference between inner join and left outer join?

Solution 16

An Inner Join only shows records if there is a matching record on the other (right) side of the join
that is in the other table. Hence it only shows the intersection of the two tables in simple words.

An Inner Join will never have Null Values as it will show only those records for which the match is
found between the two tables being joined.

A Left Outer Join gives the rows for each record on the left hand side table, even if there are no
matching rows on the other (right) side of the join.

In a Left Outer Join the null values will be displayed for the ones which does not have a match on the
right table.

Q17. Consider a scenario where Table A has 5 records and Table B has 5 records. Now
while inner joining Table A and Table B, there is one duplicate on the joining column in
Table B (i.e., Table A has 5 unique records, but Table B has 4 unique values and one
redundant value). What will be record count of the output?

Solution 17
In the above scenario the count of records in the output will be 5. The output will display the
duplicate record as well and for one record for which there is no data in table B will not be displayed.

We can see the above through the following query and its output-

create table t1 (
id int,
name varchar(20));
insert into t1 values (1, 'sahaj'),
(2, 'yash') ,
(3, 'mahak'),
(4, 'sanjeet'),
(5, 'sangeeta');

create table t3 (
id int,
department varchar(20));

insert into t3 values (1, 'a'),


(2, 'b') ,
(2, 'c'),
(4, 'd'),
(5, 'f');
select t1.id, t1.name from t1
inner join t3 on
t1.id=t3.id;

The output will be as follows-

Inner Join
Output

Hence here we see the duplicate of table B is shown twice and the match for ID 3 is not found in
Table B, therefore not displayed in the output.

Q.18 What is the difference between WHERE clause and HAVING clause?

WHERE CLAUSE HAVING CLAUSE


Used to sort out the records from the table based Used to filter out records from the groups based
on a particular condition. on a specific condition.

Can be used without the ‘GROUP BY’ clause Can’t be used without the ‘GROUP BY’ clause.
It can’t contain the aggregate functions. It can contain the aggregate functions
It is used before the ‘GROUP BY’ clause if It is used after the ‘GROUP BY’ clause.
required.

It can be used with the ‘SELECT’, ‘UPDATE’, It can only be used with the ‘SELECT’
and ‘DELETE’ statements statement.

You might also like