Documentação Modelos Rns Auto BR 2020-08-13 Traduzido

RNS Models for Cars (former Robbery/Theft
Model)
13-Aug-2020
Brazil
Technical Area
Advanced Analytics
Classificação: Uso Interno

RNS Models for Cars (former Robbery/Theft Model), Brazil
13-Aug-2020
Document change control
Concept Description
Name of the initiative RNS Models for Cars (former Robbery/Theft Model)
Country Brazil
Products Automais On-Line (215) and Automais Gold On-Line (204)
Project identifier N/A
Methodology N/A
Product Owner Auto Product Manager
Team Technical: Market Pricing and Business Intelligence (BI) –
Advanced Analytics
GitLab N/A
Documentation (Notebook) N/A
Personal data ☒
Data source TronWeb, Registro Nacional de Sinistro (RNS), Neurotech
Version History
Versio Date Summary of changes Author

n
1.0 02-Aug-2020 First version Frederico Poleto
1.1 13-Aug-2020 Included the correct file in Attachment 5. Frederico Poleto
Authorizations
Name Date
Preparation
Revision
Approval
Distribution
This document is the property of MAPFRE and is exclusively for internal use or any of the MAPFRE Group Entities (complete
list on page www.MAPFRE.com). It may not be reproduced in whole or in part, nor may it be transmitted in any form,
whether electronic, mechanical, by photocopy, recording, reproduction or other means without express authorization for
this purpose.
Technical Area: Market Pricing and BI – Advanced Analytics 2

13-Aug-2020
Contents
:
1 Introduction................................................................................................................................4
2 Current scenario.........................................................................................................................4
3 Objectives...................................................................................................................................4
4 Economic impact........................................................................................................................5
5 Modeling dataset........................................................................................................................6
5.1 Targets/response variables built from MAPFRE policies........................................................6
5.2 Features/explanatory variables built from RNS data..............................................................9
6 Descriptive analyses for model candidate variables.................................................................11
7 Model development.................................................................................................................12
8 Model validation.......................................................................................................................14
9 Model usage.............................................................................................................................15
10 Modeling code and computational aspects..............................................................................16
11 Model limitations and potential enhancements for next versions...........................................16

13-Aug-2020
1 Introduction
Beyond the credit bureau scores, for which MAPFRE Brazil has been using Serasa Experian for several
years, other providers have been offering scores and data attributes to help underwriting and pricing.
Among these alternatives, Neurotech became popular on providing scores for insurance industry in
the last years.
CNseg is a Brazilian association that operates nationwide, congregating as its members the
federations that represent the companies operating in the segments of general insurance, private
pension and life insurance, supplementary health and capitalization securities. CNseg maintains a
national registry of claims known as RNS (Registro Nacional de Sinistros), for which the insurance
companies are required to send all their claims. For a small fee, the companies can receive back a
monthly flat file with most of claims of the industry to help the underwriting assessment:
robbery/theft and indemnified total losses, as well as claims from tax IDs and vehicle IDs that had 3
or more claims. For another fee per consultation, it is possible to check information on-line.
2 Current scenario
MAPFRE tested Neurotech scores in Sep’2017 and started using it for underwriting in 2018. Actuarial
and Market Pricing teams are reviewing the pricing models right now to include these scores on
them as well.
MAPFRE has been using RNS in the underwriting process by creating manual rules to detect fraud
and by reviewing manually new policies that had 3 or more claims in the past in RNS. Actuarial and
Market Pricing teams are evaluating the inclusion in the pricing models two variables built from RNS:
total quantity of indemnified claims in the previous 5 years for the vehicle and for third parties of
policy owners (e.g., property damage).
3 Objectives
When benchmarking with industry, it came up to our knowledge that Neurotech scores could lose
predictive power in presence of RNS data. Therefore, the idea of this project emerged as to develop
internal RNS scores that could help underwriting and pricing processes, perhaps allowing also to save
the money currently paid on Neurotech scores.
As Neurotech scores are considered to have its best association with robbery/theft claims, initially
the current model has been called robbery/theft model. However, the usage of RNS data to generate
scores can be applied for different targets, and not only robbery/theft, and that is exactly what has
been accomplished. Therefore, we decided to change the model and call it RNS models.
With this spirit, we decided to develop RNS scores for 6 targets so far, predicting claims frequencies
for different types of events:

13-Aug-2020
1. Comprehensive coverage (partial loss + total loss + robbery/theft, a.k.a. as “casco” in Brazil)
2. Total loss coverage
3. Partial loss coverage
4. Robbery/theft coverage
5. Property damage liability
6. Bodily insurance liability
The more general one, comprehensive coverage, could be used for underwriting, and the ones for
each coverage type could be used by the Actuarial and Market Pricing teams in the individual pricing
models. In the future, we might create also scores for severity of these types of events.
4 Economic impact
Techinical BI team conducted a back-testing analysis of RNS scores for underwriting on 2017
exposures. A similar analysis was conducted for Neurotech scores in Sep’2017. We show some of the
figures in Table 1 where it is possible to see that RNS Scores can reduce much more the loss ratio
than Neurotech scores with even smaller amount of policy rejections.
Translating the loss ratios into financial figures, consider the highlighted cases in bold, for example:
for new insurances on Gold product, this would mean that rejecting the highest risk 4.87% exposures
using RNS Casco scores would avoid BRL 8.6MM casco claim costs while losing BRL 4.4 MM casco
premium gains, so resulting on a net profit of BRL 4.2MM; as a comparison, rejecting the highest risk
6.18% exposures using Neurotech scores would avoid BRL 5.9MM casco claim costs while losing BRL
5.7 MM casco premium gains, so resulting on a net profit of BRL 0.2 MM.
Table 1: Claims costs, polices premiums, loss ratios, and simulations of reductions of exposure of approximate 6%.
Insuranc Produc Indemnities Premium Loss Ratio % Reduction % Reduction Results

e Type t Casco (BRL) Casco (BRL) Casco Exposure Loss Ratio Score Dataset
1,510,723,26
All On-line 1,013,301,512 1 67.07% 5.77% 6.55% RNS Casco Dec'2017
All Gold 209,627,941 298,151,546 70.31% 5.14% 5.45% RNS Casco Dec'2017
New On-line 411,421,292 529,788,939 77.66% 5.68% 6.23% RNS Casco Dec'2017
New Gold 66,131,706 89,746,822 73.69% 4.87% 6.22% RNS Casco Dec'2017
Neurotec
New Gold 68,089,872 89,806,727 75.82% 6.18% 1.85% h Apr'2017
These preliminary results illustrate the huge benefits of using RNS scores. RNS data is available
monthly for only BRL 2,200/month and they have already been paid. Hence, there is no reason to not
use these scores.
Additional discussions are ongoing with Auto Product Manager, Actuarial and Market Pricing teams
on how and when to start using RNS Scores on underwriting and pricing and whether it is worth
discontinuing Neurotech scores or use both together. Currently Neurotech scores cost about BRL 1.9
MM/year, and the contract is annually renewed in September.

13-Aug-2020
5 Modeling dataset
We split the construction of the modeling dataset in the part corresponding to the targets/response
variables built from MAPFRE policies and to the features/explanatory variables built from RNS data.
5.1 Targets/response variables built from MAPFRE policies

As the RNS models use RNS data to predict basically the same targets being modelled by Actuarial
team, our teams aligned to use the same datasets with the exposures and frequencies of claims that
Actuarial team was using to update the models for On-line and Gold products for the different types
of claims.
So far MAPFRE is using only actuarial and pricing statistical models for casco (comprehensive
coverage) as well as property material damage, but after Convergência Project it will be further
possible break down casco on the different types of events, and hence, Actuarial team is building
models for partial loss, total loss, and robbery/theft.
The SAS codes shared by Actuarial team to extract the datasets for Gold and On-line modeling are
included, respectively, in Attachment 1 and Attachment 2. We unified the variable names after
analyzing the underlying code and concepts, and the resulting merged dataset conforms the base of
the modeling dataset to be linked to RNS set of tables. The data dictionaries for Actuarial datasets
and variables mappings can be found in Attachment 3.
Attachment 1: Main SAS code used by Actuarial Team to build modeling dataset for Gold (as of Apr’2020).
Attachment 2: Main SAS code used by Actuarial Team to build modeling dataset for On-line (as of Jun’2020).
Attachment 3: Data dictionaries for Actuarial datasets and variables mappings.
It is worth clarifying that each policy appears on several rows in the Actuarial dataset according to
the following:
 Endorsements, when there are changes on policy characteristics.

 Modeling period, according to the definitions used by Actuarial team at the time of data
extraction.

13-Aug-2020
 Claims, where each claim is included in one row.

For each row of the policy, the corresponding exposure/duration of the policy is computed.
The Neurotech scores were available only for the last period of the modeling dataset (2019), so we
decided to retain this period for out-of-time (OOT) validation and comparison purposes, while
developing the RNS models with the other periods (2016-2018). The characteristics of the datasets
for both products and for the merged dataset is illustrated in Table 2.
Note that the frequency of bodily damage claims is very small. This is expected to result on a not so
good model for this event. Also, it is worth observing that the modeling periods for Gold and On-line
products have a mismatch of 2 months, but we consider that this will not affect results considerably
because the time difference is small. Likewise, Gold product has one year of modeling period more
than On-line, but given that the exposure on Gold is much smaller than on On-line, this is not
expected to generate severe biases in the model. Also, we would like RNS scores to be predictive for
past claims behavior in the industry for the company/individual whatever car insurance product is
being considered, so we do not want to use the product flag in the RNS model.
For generating RNS scores in production, we will want to use all information available in RNS at that
moment. However, for model development and testing, we would like to replicate the operational
mechanism of the RNS data that would have been available in the past at the moments when the
RNS score would have been generated. Therefore, in order to be conservative, we decided to link the
policy to RNS data only at origination, which means that we are not getting new RNS data when there
are endorsements or claims. Also, according to the Fraud Prevention team, RNS data is monthly
updated until around day 07, so we consider that if the start date of the policy is from day 08
onwards, we would have available all RNS data from the previous month backwards, otherwise, we
would have available all RNS data from the previous to last month backwards. Basically, we
implemented this logic by computing a “RNS date” reference for each tax ID.
Beyond the tax ID (CPF/CNPJ) of the insurance owner, the only other information used to build the
model was:
 Tax ID type, as the quantity of claims and behavior are different between individuals and
companies;
 Zip code, for RNS version 2 scores only.
This information was obtained from the first occurrence of the policy found in the dataset, which
means we are not updating the RNS scores after the tax ID, tax ID type, and zip could have been
updated after the policy started.

13-Aug-2020
Table 2: Descriptive statistics of the Actuarial datasets.
Modeling Period Casco Partial Loss Total Loss Robbery/theft Liability Property Damage Bodily Damage
Produc Modeling Number
t Period of Rows Freq Freq
Start End Exposure Nº Nº Freq. Nº Nº Freq. Exposure Nº Freq. Nº Freq.
. .
Gold 2016 2015-12 2016-12 287,656 135,588 9,716 7.2% 7,552 5.6% 1,415 1.0% 749 0.55% 135,190 6,893 5.1% 59 0.04%
2017 2016-12 2017-12 226,975 105,057 6,979 6.6% 5,524 5.3% 845 0.8% 610 0.58% 104,600 5,029 4.8% 56 0.05%
2018 2017-12 2018-12 227,087 104,709 6,199 5.9% 4,886 4.7% 773 0.7% 540 0.52% 103,750 4,579 4.4% 62 0.06%
2019 2018-12 2019-12 228,228 107,990 6,092 5.6% 4,746 4.4% 860 0.8% 486 0.45% 106,753 4,763 4.5% 77 0.07%
On-line 2017 2017-02 2018-02 2,885,090 1,368,677 90,582 6.6% 70,556 5.2% 11,972 0.9% 8,054 0.59% 1,368,114 57,817 4.2% 380 0.03%
2018 2018-02 2019-02 2,584,173 1,230,026 71,888 5.8% 55,096 4.5% 9,609 0.8% 7,183 0.58% 1,222,216 47,825 3.9% 447 0.04%
2019 2019-02 2020-02 2,272,981 1,075,797 58,419 5.4% 42,448 4.0% 10,061 0.9% 5,910 0.55% 1,062,619 42,066 4.0% 422 0.04%
Total 2016 2015-12 2016-12 287,656 135,588 9,716 7.2% 7,552 5.6% 1,415 1.0% 749 0.55% 135,190 6,893 5.1% 59 0.04%
2017 2016-12 2018-02 3,112,065 1,473,734 97,561 6.6% 76,080 5.2% 12,817 0.9% 8,664 0.59% 1,472,714 62,846 4.3% 436 0.03%
2018 2017-12 2019-02 2,811,260 1,334,735 78,087 5.9% 59,982 4.5% 10,382 0.8% 7,723 0.58% 1,325,966 52,404 4.0% 509 0.04%
2019 2018-12 2020-02 2,501,209 1,183,787 64,511 5.5% 47,194 4.0% 10,921 0.9% 6,396 0.54% 1,169,372 46,829 4.0% 499 0.04%
Total Total 2015-12 2020-02 8,712,190 4,127,844 249,875 6.1% 190,808 4.6% 35,535 0.9% 23,532 0.57% 4,103,242 168,972 4.1% 1,503 0.04%

13-Aug-2020
5.2 Features/explanatory variables built from RNS data

RNS contains 4 tables with different types of claims and structures:
 MII: claims with indemnified status and classified as full indemnity.

 MCPF: claims from tax IDs that had 3 or more claims.
 MCH: claims from vehicle IDs (chassi) that had 3 or more claims.
 MRF: claims classified as nature total robbery or total theft.
Our primary interest is to check whether policy owners had claims in the past, linking by tax ID. In the
future, we might want to extract information getting past claims of the cars via vehicle ID and/or the
plate of the vehicle being insured as well.
All tables contained the tax ID of the insurance owner of the claims. MII contains also up to 3 tax IDs
of involved parties, and these tax IDs were on columns that clearly defined this information. On the
other hand, MCPF and MCH had one column that explicitly defined the tax ID of the insurance owner,
and another generic column for a tax ID which had a qualifier if it was from the insurance owner or
an involved party in the claim. We analyzed when the tax IDs where invalid (i.e., blank or
“00000000000000”) and when the columns of tax ID were equal, then we merged together all claims
of all valid tax IDs of the 4 tables, also adjusting the structure and information where needed; e.g., in
MCPF and MCH, when the tax ID was different than the tax ID of the insurance owner and the type of
insurance was considered casco, we changed it to liability, as it did not make sense.
We noticed that even inside each table some claims had more than one registry with claim number
and other information that could be equal or not, but seemingly related to the same event that could
be just adding/adjusting previous reports of the occurrence. The duplication became even higher
when we merged all tables. Hence, after merging the 4 datasets appropriately, adjusting the different
structures, we deduplicated by the combination of tax ID, vehicle plate, date of occurrence, and type
of claim (casco/third party liability) to avoid overcounting the same events, retaining the claim
information prioritizing via MRF>MII>MCPF>MCH, most recent date of registration in RNS, lower
level of claim status (indemnified>closed>declined>pending), and larger claim number.
We consider the following less-to-more granular subsets for creating features:
1. Overall, i.e., no filter (1 total category)

2. Type of Claim (2 categories)
3. Nature (12 categories)
4. Indemnity Type (11 categories)
5. Claim Status (5 categories)
6. Type of Claim x Nature (14 categories)
7. Type of Claim x Indemnity Type (18 categories)
8. Type of Claim x Claim Status (10 categories)
9. Type of Claim x Nature x Indemnity Type (50 categories)
10. Type of Claim x Indemnity Type x Claim Status (45 categories)

13-Aug-2020
11. Type of Claim x Nature x Claim Status (45 categories)

12. Type of Claim x Nature x Indemnity Type x Claim Status (101 categories)
For each combination of variables, we counted the number of claims found in the merged RNS
dataset, and then the number of categories listed in parenthesis are the ones that had 1,000
frequencies or more. This has been accomplished to avoid adding columns/features in the Actuarial
dataset that already had way too many rows that had low potential of helping the discrimination. We
show this for items 6, 7, and 8, respectively, in Table 3, Table 4, and Table 5, where the grey cells
have more than 1,000 frequencies.
For all 12 items, we counted the number of claims for the tax ID, resulting on 314 features. For items
1 to 5, resulting in 31 features, we also computed the following 4 quantities: number of distinct
plates, number of distinct insurance companies, and minimum and maximum date difference for the
claims encountered for the tax ID in RNS. These date differences were computed as the difference of
“RNS date” for the tax ID and the occurrence date of the claims found for that tax ID, so the
minimum date difference captures the time to occurrence of the most recent claim of the tax ID,
while the maximum date difference, of the farthest claim. As a result, 438 features were created.
The data dictionaries and analyses performed on the table structures as well as the counts per
variables are included in Attachment 4.
Attachment 4: Data dictionaries for RNS datasets, analysis performed, and statistics.
Table 3: Cross table of nature and type of claim.
Type of Claim Vehicle Liability

Nature 531 553 Total
Not Classified 0 0 611,893 611.893
Main Insured without claim (without damage) 1 1,547,963 0 1.547.963
Collision 10 10,388,416 2,644,551 13.032.967
Rollover 12 52,792 116 52.908
Fire 20 48,667 129 48.796
Total Robbery 30 1,544,210 131 1.544.341
Total Theft 31 838,764 21 838.785
Harmful Acts 40 19,832 0 19.832
Flood/Inundation 50 107,748 0 107.748
Robbery/Theft of accessories/equipment 51 7,853 0 7.853
Partial Theft (vehicle part) 52 19,595 0 19.595
Not Classified 999 50,109 6,726 56.835
Total 14.625.949 3,263,567 17,889,516

13-Aug-2020
Table 4: Cross table of indemnity type and type of claim.

Indemnity Type 531 553 Total
Not Classified 0 2.417.392 129.670 2.547.062
Unrecoverable Full Indemnity 1 211.647 71.678 283.325
Recoverable Full Indemnity 2 1.282.547 349.241 1.631.788
Partial Indemnity 3 8.578.133 2.696.084 11.274.217
Full Indemnity for Robbery/Theft not recovered 4 1.614.894 150 1.615.044
Partial Indemnity for Robbery/Theft 5 332.413 1 332.414
Compensation for People 6 3.790 5.163 8.953
Compensation to Others 7 3.739 4.731 8.470
Unrecoverable Full Indemnity for Robbery/Theft Rec. 8 23.030 0 23.030
Full Indemnity Recoverable for Robbery/Theft Rec. 9 69.565 0 69.565
Legacy of the previous version 99 88.799 6.378 95.177
NA 0 471 471
Total 14.625.949 3.263.567 17.889.516
Table 5: Cross table of claim status and type of claim.

Claim Status 531 553 Total
Not Classified 0 0 3 3
Indemnified 1 9.979.332 2.818.810 12.798.142
Closed 2 4.083.630 289.778 4.373.408
Declined 3 28.435 10.037 38.472
Pending 4 503.169 143.177 646.346
Legacy of the previous version 88 31.374 1.762 33.136
Canceled 99 9 0 9
Total 14.625.949 3.263.567 17.889.516
6 Descriptive analyses for model candidate variables

Descriptive statistics for the RNS features are included in Attachment 5.
Attachment 5: Data dictionary and statistics for RNS features.

13-Aug-2020
7 Model development
We fit eXtreme Gradient Boosting tree models with R xgboost package for each of the 6 targets of
interest. The RNS models version 1 used only the RNS data and tax ID type indicator, while version 2
additionally had 109 zip dummies created for the 1 st and 2nd digit of zip.
The importance gains of the models are included in Table 6: beyond the importance gains for tax ID
type and the sum for the zip dummies, we list the RNS features that had importance gain values of
0.03 or larger where we show in descending order by the row average of the feature. The full list of
importance gains for all features is included in Attachment 6. Some remarks follow:
 We notice that casco has very similar results of partial loss, which makes sense as most of
the claims in casco comes from this type of claim. Similarly, results from them seem to be
very similar to the ones of property liability, which is also expected to be related. For all
these 3 models, the number of insurers in which there were previous partial indemnity is
the most important feature, followed by total number of previous claims, and,
subsequently, by number of plates in which there were claims.
 For total loss, beyond the total number of claims, the most important attributes are the
minimum time to collision claim, and minimum time to any claim.
 For robbery/theft, it is intriguing to see that the most important feature is the number of
claims on vehicle with full indemnity for robbery/theft not recovered and indemnified (and,
although much lower important, also among the top 5 most important, are the minimum
and maximum time to full indemnity for robbery/theft not recovered claim, which are also
very related). One potential reason for that is that robbery/theft has a strong relation to
high risk regions, and then people that live on these places are more subject to recurrent
events of this nature. This could be a reason why zip is so much more important for
robbery/theft than the other events and could explain why the importance of this feature
drops substantially when zip is included. However, we see that the feature continues to be
very important after including zip, which could be understood as the zip of
overnight/residence is not enough to explain this risk; other reasons could be the places
where the car is driven during the day, fraud, or usage of public/private parking, for
example. Other important features are the number of partial indemnity claims and the
number of claims.
 For bodily liability, the top 5 most important features in the v1 model are 1) number of
plates with claims on partial indemnity, 2) maximum time to fire claim, 3) maximum time to
collision claim, 4) minimum time to indemnified claim, and 5) number of collision claims.
Out of these features, only the 2) in v1 is retained among the top 5 in the model v2,
becoming 3), the others were 1) number of vehicle claims, 2) minimum time to liability
claim, 4) minimum time to claim, and 5) maximum time to partial indemnity claim. The
large changes from v1 to v2 on the most important features might have been generated
due to the fact that there are not as many frequencies in this type of event such as in the
others.

13-Aug-2020
Attachment 6: Importance gains for all features.

13-Aug-2020
Table 6: Importance gains on XGBoost model for RNS features that had importance gain values of 0.03 or larger, as well as tax ID type and sum of importance gains for zip dummies.
Casco Partial Loss Total Loss Robbery /Theft Property Liability Bodily Liability
Feature Avg. Max.
v1 v2 v1 v2 v1 v2 v1 v2 v1 v2 v1 v2
# insurers with partial indemnity claims 0.483 0.467 0.349 0.365 0.000 0.001 0.001 0.000 0.566 0.547 0.000 0.035 0.253 0.566
# claims 0.242 0.232 0.220 0.205 0.179 0.170 0.049 0.025 0.172 0.169 0.000 0.000 0.151 0.242
# plates with claims on partial indemnity 0.067 0.055 0.151 0.142 0.001 0.002 0.006 0.003 0.042 0.041 0.137 0.033 0.059 0.151
Minimum time to collision claim 0.015 0.015 0.012 0.015 0.209 0.170 0.005 0.004 0.007 0.010 0.015 0.022 0.043 0.209
# claims on vehicle with full indemnity for robbery/theft not
0.000 0.000 0.000 0.000 0.004 0.004 0.257 0.133 0.000 0.000 0.000 0.000 0.036 0.257
recovered and indemnified
Minimum time to claim 0.003 0.003 0.002 0.002 0.115 0.142 0.005 0.008 0.003 0.003 0.029 0.040 0.029 0.142
# partial indemnity claims 0.001 0.001 0.058 0.005 0.027 0.032 0.084 0.041 0.002 0.002 0.002 0.007 0.023 0.084
Min. time to full indemnity for robbery/theft not recovered claim 0.001 0.000 0.003 0.001 0.008 0.007 0.090 0.069 0.001 0.001 0.043 0.001 0.020 0.090
Maximum time to partial indemnity claim 0.007 0.004 0.009 0.002 0.030 0.027 0.018 0.009 0.019 0.018 0.005 0.037 0.013 0.037
Maximum time to collision claim 0.007 0.005 0.006 0.006 0.009 0.007 0.009 0.012 0.003 0.005 0.063 0.028 0.012 0.063
Max. time to full indemnity for robbery/theft not recovered claim 0.001 0.000 0.003 0.001 0.004 0.004 0.032 0.062 0.001 0.001 0.020 0.000 0.012 0.062
Minimum time to indemnified claim 0.010 0.009 0.005 0.006 0.014 0.011 0.005 0.003 0.004 0.004 0.055 0.021 0.012 0.055
Maximum time to fire claim 0.001 0.000 0.001 0.001 0.005 0.004 0.003 0.001 0.001 0.001 0.103 0.045 0.011 0.103
# collision claims 0.003 0.002 0.005 0.004 0.007 0.008 0.018 0.013 0.002 0.002 0.052 0.001 0.010 0.052
# vehicle claims 0.016 0.015 0.016 0.011 0.013 0.011 0.014 0.004 0.008 0.007 0.000 0.054 0.010 0.054
Maximum time to closed claim 0.003 0.003 0.004 0.003 0.013 0.011 0.012 0.008 0.003 0.002 0.041 0.021 0.009 0.041
Maximum time to claim 0.003 0.003 0.002 0.002 0.007 0.006 0.010 0.005 0.003 0.002 0.043 0.026 0.008 0.043
Minimum time to liability claim 0.002 0.001 0.001 0.001 0.012 0.011 0.011 0.007 0.002 0.001 0.024 0.047 0.007 0.047
Minimum time to total robbery claim 0.001 0.001 0.001 0.002 0.006 0.004 0.031 0.016 0.001 0.001 0.002 0.000 0.006 0.031
Maximum time to liability claim 0.002 0.002 0.001 0.001 0.017 0.015 0.009 0.007 0.001 0.001 0.003 0.032 0.005 0.032
Maximum time to rollover claim 0.001 0.000 0.001 0.000 0.003 0.003 0.000 0.000 0.000 0.000 0.031 0.011 0.004 0.031
Minimum time to rollover claim 0.000 0.000 0.000 0.000 0.003 0.002 0.000 0.000 0.000 0.000 0.032 0.016 0.003 0.032
# claims on indemnified 0.001 0.001 0.000 0.000 0.006 0.005 0.002 0.002 0.000 0.000 0.001 0.034 0.002 0.034
# closed claims 0.000 0.000 0.000 0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.012 0.030 0.002 0.030
Minimum time to main insured without claim (without damage) 0.001 0.002 0.001 0.001 0.001 0.001 0.003 0.002 0.001 0.001 0.003 0.031 0.001 0.031
Tax ID type 0.012 0.010 0.009 0.007 0.024 0.023 0.006 0.003 0.016 0.015 0.000 0.000 0.011 0.024
Zip dummies 0.065 0.103 0.079 0.385 0.050 0.099 0.136 0.385

13-Aug-2020
8 Model validation
For discrimination, we in general prefer accuracy ratio (AR/Gini) metric which is a summary of
cumulative accuracy profile (CAP) or cumulative gain chart, and has a one-to-one relationship to the
area under the receiver operating characteristic curve (AR=2*AUROC-1). On claims frequencies
models the lifts are also especially useful since they have similarities to the estimated relativities in
the Actuarial models. In Attachment 7, we show the cumulative gain charts and cumulative lift charts
for each of the events, comparing the RNS scores v1 and v2 specifically for that event vs. Serasa
bureau score, income estimates from Serasa, and 4 versions of Neurotech scores for 2019 modeling
period. In Table 7, we illustrate the Gini and also top decile lift, which indicates the ratio of the claim
frequency in the highest 10% scores vs overall claim frequency, again for 2019 modeling period; in
this table we include the RNS scores for all events. Attachment 8 includes these metrics also for
development (modeling period 2018 and earlier), and for the out-of-time validation modeling (period
2019), separating results for On-line and Gold products.
Attachment 7: Cumulative accuracy profile/gain chart and cumulative lift by decile on the different claim events for 2019
modeling period.
Table 7: Gini and top decile lift measures for scores on the different claim events for 2019 modeling period.
Property
Casco Partial Loss Total Loss Robbery /Th. Bodily Liab.
Liab.
Gini Lift Gini Lift Gini Lift Gini Lift Gini Lift Gini Lift
Serasa 4.2% 1.19 2.6% 1.07 7.6% 1.35 18.1% 1.78 4.4% 1.19 1.1% 0.93
Income/Renda 1.7% 1.17 5.9% 1.14 11.4% 1.51 25.2% 2.23 5.5% 1.10 5.9% 1.56
Neuro RF 2.1 0.5% 1.03 3.3% 1.09 1.8% 1.10 29.5% 2.24 6.8% 1.17 14.5% 1.48
Neuro RF 3.0 0.7% 1.13 4.4% 1.02 9.6% 1.44 47.4% 3.32 4.0% 1.10 12.6% 1.38
Neuro Sinistro 6.6% 1.19 6.7% 1.19 3.0% 1.17 10.9% 1.21 13.5% 1.35 0.8% 0.98
x.cas 46.7% 1.66 47.8% 1.75 43.6% 1.51 36.5% 1.25 48.3% 1.84 46.7% 1.56
x.cas2 26.6% 1.68 28.7% 1.79 20.3% 1.51 9.9% 1.14 27.1% 1.83 28.1% 1.60
x.pparcial 46.2% 1.63 47.9% 1.76 42.4% 1.44 29.4% 1.03 47.9% 1.82 46.5% 1.58
x.pparcial2 23.9% 1.66 27.4% 1.80 14.6% 1.43 -5.3% 0.96 25.4% 1.81 28.7% 1.56
x.ptotal 45.7% 1.61 45.9% 1.61 44.9% 1.61 43.1% 1.63 47.5% 1.73 46.1% 1.41
x.ptotal2 30.3% 1.62 29.4% 1.61 32.6% 1.61 35.4% 1.67 31.4% 1.73 32.0% 1.52
x.roubfur 50.8% 1.35 50.0% 1.18 51.5% 1.40 57.7% 2.54 52.1% 1.37 50.4% 1.02
x.roubfur2 24.9% 1.12 22.3% 0.78 28.8% 1.38 49.2% 3.25 27.7% 1.07 20.8% 0.72
x.rcdm 46.6% 1.58 47.4% 1.66 43.6% 1.42 39.5% 1.21 49.6% 1.91 47.8% 1.46
x.rcdm2 13.2% 1.61 15.2% 1.71 6.5% 1.47 -1.8% 1.13 22.1% 1.92 17.7% 1.52
x.rcdc 62.9% 1.09 63.5% 1.14 60.3% 0.97 57.7% 0.93 65.4% 1.18 62.7% 1.25
x.rcdc2 46.0% 1.24 46.7% 1.34 43.9% 1.18 40.2% 0.60 48.9% 1.52 50.7% 2.30

13-Aug-2020
Attachment 8: Gini and top decile lift measures for scores on the different claim events for development and out-of-time
validation, separating also by On-line and Gold products.
It is worth noting that Serasa and Neurotech 2.1, 3.0 and Sinistro scores have an inverse relationship
to the claim frequencies, i.e., the higher the score, the lower the claim frequencies. On the other
hand, the RNS scores at this point and the Neurotech Severity scores have a direct relationship.
When implementing RNS scores, we will perform a transformation in order to turn them on integers
ranging from 0 to 1,000, with an inverse relationship, since this is the behavior more familiar by the
teams in Brazil that are using Serasa and Neurotech scores.
It is interesting to note that the income has an inverse relationship when it comes to robbery/theft
and total loss. On the other hand, the relationship is direct, though very weak, for partial loss,
property liability, and bodily liability. For casco, given the combined different relationships, it is sort
of U-shaped, as the lifts are higher for lower and higher scores, and the lifts are lower in the middle.
Moving back to our main interest, the RNS scores, when we compare the cumulative gain charts and
the Gini coefficients, we can see that the Gini becomes artificially inflated given that about 67% of
the policies have tax IDs which have never appeared in RNS data and, hence, the generated scores
are tied. Adding zip, of course, decreases the ties, though does not resolve completely. Given this
drawback on Gini, the most appropriate way of comparing the scores is to analyze the cumulative
gain charts and the cumulative lift charts/measures, where we can see that RNS scores appear to be
much superior than Neurotech scores for casco, partial loss, property liability, and bodily liability, the
scores are still slightly superior for total loss (and, in this case, income seems to be even slightly
stronger). On the other hand, Neurotech scores are slightly superior than RNS score for
robberty/theft.
It is worth commenting that Neurotech requests to send the zip code and the vehicle type together
with the tax ID when scores are generated. MAPFRE since beginning opted to send just the zip code.
It is not clear the type of information Neurotech uses on their models. In theory, they should not be
able to have access to RNS data. There were rumors that when they were developing their scores in
the past insurer(s) could have shared information for them to develop custom scores for them, and it
is not clear on which these data could have been used on their off-the-shelf scores.
9 Model usage
RNS scores v2 have a smaller number of ties and better predictive power when evaluated
individually, but this predictive power might not be leveraged so much on Actuarial and Maket
Pricing models since they already account for zip/Brazilian regions. This suggests RNS scores v1
could be more orthogonal to the current internal models than v2, and, therefore, there is a chance

13-Aug-2020
that both would end up generating the same gain on their models, although the predictive behavior
of RNS data could, indeed, vary per region.
Also, if the RNS scores include zip, the only way to implement would be on Atenea, and have an API
to generate the scores at the time of the quote/subscription. On the other hand, RNS v1 allows us a
much simpler implementation: every month after new RNS data is obtained in MAPFRE, we can
generate RNS scores for all tax IDs that have ever appeared in the past in RNS, and IT can load this
simple look-up table in our systems. This is the selected strategy at this point.
It is worth commenting that on initial discussions with Actuarial and Market Pricing teams, they
prefer that the RNS models are not updated monthly/frequently with the machine learning spirit;
instead, they think it is best to synch the update of RNS models when they are updating their
internal models, using again the same datasets for both purposes.
10 Modeling code and computational aspects

All computations were performed on R Studio 1.2.1335 using Microsoft R Open 3.5.3 with Intel MKL
for parallel mathematical computing (using 4 cores with Intel Core i7 8550U CPU @ 1.80GHz and
32Gb RAM). The whole R code used is included in Attachment 9, which also lists all packages and
indicates how much time it takes to run some of the most time-consuming pieces of code.
For example, the join of MAPFRE policies with 8 million of rows to the RNS merged dataset with 18
million of claims, and then aggregating information on the 400+ features took about 40 hours.
Attachment 9: R code used for the model development and testing.
11 Model limitations and potential enhancements for next versions

During the model development, we noticed some potential limitations and research opportunities
to be attempted to enhance the model in the future:
 As we indicated earlier in the document, at this first version of the model, we focused on
linking RNS data via tax ID, as this is expected to be the strongest information. However,
mimicking the process using chassi and/or plate of the vehicle could potentially also help
discrimination, so should be tested.
 The tax ID type was used as companies have way many more claims than individuals. Serasa
and other data providers collect the public information of shareholders/partners of
companies. It would be very interesting to test whether the past claim behaviors in RNS of
these individuals could help better discriminate the companies. Similarly, some bureaus
bring relative associations (e.g., spouse and children), or could relate individuals that lived

13-Aug-2020
in the same address and/or shared the same telephone number. The same type of test
could be performed.
 At this first moment, the RNS models focused on claim frequencies prediction, however,
nothing impedes us to extend and test whether RNS data could help the prediction for the
severities of the events as well.
 As it is probably evident from the size of the sections, we had to spend a lot of time on data
preparation, and given the timeframe we had to finish the project, it was not possible to
explore much the relationships of the features to the targets. We think this should be done
as well as on model selection strategies in the next review of the model, and this may open
other ideas of dealing with the features. We also faced memory limitations on the laptop,
but a potential better strategy would be to not create less granular features, which would
open room for creating the more granular features also for number of distinct plates,
distinct insurers, and minimum/maximum time difference.

Documentação Modelos Rns Auto BR 2020-08-13 Traduzido

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Documentação Modelos Rns Auto BR 2020-08-13 Traduzido

Uploaded by

Copyright:

Available Formats

RNS Models for Cars (former Robbery/Theft

Classificação: Uso Interno

Document change control

Versio Date Summary of changes Author

Technical Area: Market Pricing and BI – Advanced Analytics 2

Classificação: Uso Interno

Technical Area: Market Pricing and BI – Advanced Analytics 3

Classificação: Uso Interno

Technical Area: Market Pricing and BI – Advanced Analytics 4

Classificação: Uso Interno

Insuranc Produc Indemnities Premium Loss Ratio % Reduction % Reduction Results

Technical Area: Market Pricing and BI – Advanced Analytics 5

Classificação: Uso Interno

5.1 Targets/response variables built from MAPFRE policies

Attachment 3: Data dictionaries for Actuarial datasets and variables mappings.

 Endorsements, when there are changes on policy characteristics.

Technical Area: Market Pricing and BI – Advanced Analytics 6

Classificação: Uso Interno

 Claims, where each claim is included in one row.

Technical Area: Market Pricing and BI – Advanced Analytics 7

Classificação: Uso Interno

Table 2: Descriptive statistics of the Actuarial datasets.

Technical Area: Market Pricing and BI – Advanced Analytics 8

Classificação: Uso Interno

5.2 Features/explanatory variables built from RNS data

 MII: claims with indemnified status and classified as full indemnity.

We consider the following less-to-more granular subsets for creating features:

1. Overall, i.e., no filter (1 total category)

Technical Area: Market Pricing and BI – Advanced Analytics 9

Classificação: Uso Interno

11. Type of Claim x Nature x Claim Status (45 categories)

Table 3: Cross table of nature and type of claim.

Type of Claim Vehicle Liability

Technical Area: Market Pricing and BI – Advanced Analytics 10

Classificação: Uso Interno

Table 4: Cross table of indemnity type and type of claim.

Type of Claim Vehicle Liability

Table 5: Cross table of claim status and type of claim.

Type of Claim Vehicle Liability

6 Descriptive analyses for model candidate variables

Attachment 5: Data dictionary and statistics for RNS features.

Technical Area: Market Pricing and BI – Advanced Analytics 11

Classificação: Uso Interno

Technical Area: Market Pricing and BI – Advanced Analytics 12

Classificação: Uso Interno

Attachment 6: Importance gains for all features.

Technical Area: Market Pricing and BI – Advanced Analytics 13

Classificação: Uso Interno

Technical Area: Market Pricing and BI – Advanced Analytics 14

Classificação: Uso Interno

Technical Area: Market Pricing and BI – Advanced Analytics 15

Classificação: Uso Interno

Technical Area: Market Pricing and BI – Advanced Analytics 16

Classificação: Uso Interno

10 Modeling code and computational aspects

Attachment 9: R code used for the model development and testing.

11 Model limitations and potential enhancements for next versions

Technical Area: Market Pricing and BI – Advanced Analytics 17

Classificação: Uso Interno

Technical Area: Market Pricing and BI – Advanced Analytics 18

Classificação: Uso Interno

You might also like