0 Up votes0 Down votes

9 views46 pagestopic of MACR

Mar 23, 2016

© © All Rights Reserved

PPT, PDF, TXT or read online from Scribd

topic of MACR

© All Rights Reserved

9 views

topic of MACR

© All Rights Reserved

- The Handmaid's Tale
- The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
- Hillbilly Elegy: A Memoir of a Family and Culture in Crisis
- American Gods: The Tenth Anniversary Edition: A Novel
- Sapiens: A Brief History of Humankind
- Sapiens: A Brief History of Humankind
- How To Win Friends and Influence People
- The Selection
- The Giver
- Fahrenheit 451: A Novel
- The Elite
- Influence: The Psychology of Persuasion
- Animal Farm and 1984
- The Rosie Project: A Novel
- Elon Musk: Tesla, Spacex, and the Quest for a Fantastic Future
- Three Women
- A People's History of the United States

You are on page 1of 46

Introduction to Bootstrapping

James Guszcza, FCAS, MAAA

CAS Predictive Modeling Seminar

Chicago

September, 2005

statistics all the time.

Loss ratio/claim frequency for a population

Outstanding Losses

Correlation between variables

GLM parameter estimates

indicates.

But how can we measure our confidence in

this indication?

More Concisely

what do you think?

Variability of the point estimate says:

how sure are you?

Traditional approaches

Credibility theory

Use distributional assumptions to construct

confidence intervals

made an ingenious suggestion.

Most (sometimes all) of what we know about

the true probability distribution comes from

the data.

So lets treat the data as a proxy for the true

distribution.

We draw multiple samples from this proxy

of the resulting pseudo-datasets.

Philosophy

way of modeling, assumptions, or analysis,

and can be applied in an automatic way to

any situation, no matter how complicated.

An important theme is the substitution of

raw computing power for theoretical analysis

--Efron and Gong 1983

mining paradigm.

Theoretical Picture

was drawn from the unknown

true distribution

The true

distribution

in the sky

make inferences about the

true parameters ()

sample that might have

been

Sample 1

Y 1, Y Y

1

1

2

Y1

1

k

Sample 2

Y 1, Y Y

2

2

2

Y2

2

k

Sample 3

Y , Y 2 Y

3

1

Y3

Sample N

YN1, YN2 YNk

distribution and the size (k) of our sample

YN

as a proxy for the true

distribution.

The actual

sample

your actual distribution N

times.

Y1, Y2 Yk

interest on each re-sample.

Re-sample 1

Y* 1, Y* 2 Y*

1

Y*1

Re-sample 2

Y* 1, Y* 2 Y*

2

Y*2

Re-sample 3

Y* , Y* Y*

3

1

3

2

3

k

Re-sample N

Y*N1, Y*N2 Y*Nk

Y*3

Y*N

(1-1/500)500 1/e .368

that any one of the original data points wont

appear at all if we sample with replacement 500

times.

any data point is included with Prob .632

true population in the sky.

Each resample simulates the process of taking

a sample from the true distribution.

Graph on left: Y-bar calculated from an number of

samples from the true distribution.

Graph on right: {Y*-bar} calculated in each of 1000 resamples from the empirical distribution.

Analogy: : Y ::

Y : Y*

0.6

0.4

0.02

0.2

0.01

0.0

0.00

phi.ybar

0.03

0.8

0.04

70

80

90

100

ybar

110

120

98.5

99.0

99.5

100.0

y.star.bar

100.5

101.0

Summary

serves as a proxy to the true distribution.

Resampling means (repeatedly) sampling

with replacement.

Resampling the data is analogous to the

process of drawing the data from the true

distribution.

We can resample multiple times

Compute the statistic of interest T on each resample

We get an estimate of the distribution of T.

Motivating Example

where we all know the answer

in advance.

Pull 500 draws from the

n(5000,100) dist.

The sample mean 5000

Is a point estimate of the

true mean .

But how sure are we of this

estimate?

From theory, we know that:

s.d .( X ) / N 100

500

4.47

raw data

statistic

value

#obs

500

4995.79

mean

98.78

sd

2.5%ile

4812.30

97.5%ile

5195.58

Look at summary statistics,

histogram, probability density

estimate, QQ-plot.

looks pretty normal

raw data

statistic

value

#obs

500

4995.79

mean

98.78

sd

2.5%ile

4812.30

97.5%ile

5195.58

4700

0.000

4900

0.002

5100

0.004

n(5000,100) data

4700

4800

4900

5000

5100

5200

5300

-3

-2

-1

Now lets use resampling to estimate the

s.d. of the sample mean (4.47)

once; others wont appear at all.

Resampling

Sample with

replacement 500 data

points from the

original dataset S

Call this S*1

Now do this 999

more times!

S*1, S*2,, S*1000

Compute X-bar on

each of these 1000

samples.

R Code

norm.data <- rnorm(500, mean=5000, sd=100)

boots <- function(data, R){

b.avg <<- c(); b.sd <<- c()

for(b in 1:R) {

ystar <- sample(data,length(data),replace=T)

b.avg <<- c(b.avg,mean(ystar))

b.sd <<- c(b.sd,sd(ystar))}

}

boots(norm.data, 1000)

Results

X-bar ~ n(5000, 4.47)

Bootstrapping estimates this

pretty well!

And we get an estimate of

the whole distribution, not

just a confidence interval.

raw data

statistic

value

#obs

500

4995.79

mean

98.78

sd

2.5%ile

4705.08

97.5%ile

5259.27

4985

4995

5005

X-bar

theory bootstrap

1,000

1,000

5000.00 4995.98

4.47

4.43

4991.23 4987.60

5008.77 5004.82

4985

4990

4995

5000

5005

5010

-3

-2

-1

Interval

Percentile method

Just take the desired percentiles of the

bootstrap histogram.

More reliable in cases of asymmetric bootstrap

histograms.

mean(norm.data) - 2 * sd(b.avg)

[1] 4986.926

mean(norm.data) + 2 * sd(b.avg)

[1] 5004.661

raw data

statistic

value

#obs

500

4995.79

mean

98.78

sd

2.5%ile

4705.08

97.5%ile

5259.27

X-bar

theory bootstrap

1,000

1,000

5000.00 4995.98

4.47

4.43

4991.23 4987.60

5008.77 5004.82

And a Bonus

110

105

100

95

90

deviation of each pseudo-dataset.

This enables us to estimate the correlation between the

mean and s.d.

Normal distribution is not skew mean, s.d. are

uncorrelated.

Our bootstrapping experiment confirms this.

sample.sd

4985

4990

4995

sample.mean

5000

5005

5010

result we know to be true from theory.

Often in the real world we either dont know

the true distributional properties of a

random variable

or are too busy to find out.

This is when bootstrapping really comes in

handy.

Severity Data

2700 size-of-loss data points.

severity distribution

4 e-04

mean & 75th %ile.

Gamma? Lognormal? Dont need to know.

2 e-04

0 e+00

10000

20000

30000

40000

50000

Normal Q-Q Plot

0.000

2800

3000

0.002

3200

0.004

3400

2800

3000

3200

3400

-3

-2

0.000

2800

3000

0.002

3200

3400

-1

2800

2900

3000

3100

3200

3300

3400

-3

-2

-1

statistics even average severity! are approximately normally

distributed.

But this breaks down if our statistics is not a smooth function of

the data

Often in the loss reserving we want to focus our attention way

out in the tail

90th %ile is an example.

Normal Q-Q Plot

0.0000

7000

8000

0.0010

9000

7000

7500

8000

8500

9000

-3

-2

-1

6000

5500

5000

sample average and s.d. on each pseudo-dataset.

This time (as one would expect) the variance is a function

of the mean.

sample.sd

2800

2900

3000

3100

sample.mean

3200

3300

3400

80

60

40

20

Credit on a scale of 1-100

1 is worst; 100 is best

Age, credit are linearly related

See plot

R2.08 .28

Older people tend to have better credit

What is the confidence interval around ?

Plot of Age vs Credit

age

20

40

60

80

100

.28

s.d.() .028

> quantile(boot.avg,probs=c(.025,.975))

2.5%

97.5%

0.2247719 0.3334889

> rho - 2*sd(boot.avg); rho + 2*sd(boot.avg)

0.2250254 0.3354617

Normal Q-Q Plot

0.20

0.25

0.30

10

0.35

15

0.20

0.25

0.30

0.35

-3

-2

-1

1300 zip-code level data points

Variables: population density, median #vehicles/HH

R2.50 ; -.70

Median #Vehicles vs Pop Density

veh

5000

loess line

10000

15000

regression line

density

20000

25000

30000

more skew.

-.70

95% conf interval: (-.75, -.67)

Not symmetric around

Effect becomes more pronounced the higher the

value of .

-0.75

10

-0.70

15

-0.65

20

-0.75

-0.70

-0.65

-3

-2

-1

Total loss ratio of a segment of business is

our favorite point estimate.

Its variability depends on many things:

Size of book

Loss distribution

Accuracy of rating plan

Consistency of underwriting

probability distribution?

Severity dist from previous example

LR = .79

Claim frequency = .08

two point estimates.

We will resample the data 500 times

Compute total LR and freq on each sample

Plot the histogram

LR .79

s.d.(LR) .05

conf interval 0.1

Confidence interval calculations disagree a bit:

> quantile(boot.avg,probs=c(.025,.975))

2.5%

97.5%

0.6974607 0.8829664

> lr - 2*sd(boot.avg); lr + 2*sd(boot.avg)

0.6897653 0.8888983

Normal Q-Q Plot

0.7

0.8

0.9

1.0

bootstrap total LR

0.7

0.8

0.9

1.0

-3

-2

-1

How does this affect the variability of LR?

Again re-sample 500 times

Skewness, variance increase considerably

LR:

.79

.78

s.d.(LR):

.05

.13

Normal Q-Q Plot

0.0

0.6

1.0

0.8 1.0

2.0

1.2

1.4

3.0

bootstrap total LR

0.6

0.8

1.0

1.2

1.4

-3

-2

-1

Distribution of Capped LR

statistics

Remove leverage of a few large data points

Here we cap policy-level losses at $30,000

Closer to frequency

s.d. cut in half! .05 .025

Normal Q-Q Plot

0.55

0.60

10

0.65

15

0.70

0.55

0.60

0.65

0.70

-3

-2

-1

freq .08

s.d.(freq) .017

Confidence interval calculations match very well:

> quantile(boot.avg,probs=c(.025,.975))

2.5%

97.5%

0.07734336 0.08391072

> lr - 2*sd(boot.avg); lr + 2*sd(boot.avg)

0.07719618 0.08388898

Normal Q-Q Plot

0.076

50

0.080

100 150

200

0.084

0.074

0.076

0.078

0.080

0.082

0.084

0.086

-3

-2

-1

sub-segments: {clean drivers, other}

LRtot = .79

LRclean = .58

LLRclean = -27%

LRother = .84

LRRother = +6%

than non-clean drivers

How sure are we of this indication?

Lets use bootstrapping.

500 times.

LRc*, LRo*, (LRc*- LRo*), (LRc* / LRo*)

What is the average difference in loss ratios?

what percent of the time is the difference in

loss ratios greater than x%?

Normal Q-Q Plot

0.4

0.6

0.8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3

-2

0.70

0.80

0.90

1.00

-1

0.70

0.75

0.80

0.85

0.90

0.95

1.00

1.05

-3

-2

-1

Normal Q-Q Plot

0.0

0.5

1.0

0.7

2.0

0.9

3.0

1.1

0.5

0.6

0.7

0.8

0.9

1.0

1.1

-3

-2

1.00

1.05

10

1.10

15

-1

1.00

1.05

1.10

-3

-2

-1

Normal Q-Q Plot

-0.1

0.1

0.3

0.5

LRR_other - LRR_clean

0.0

0.2

0.4

0.6

-3

-2

0.0

1.0

0.5

1.5

1.0

2.0

1.5

2.5

LRR_other / LRR_clean

-1

1.0

1.5

2.0

2.5

-3

-2

-1

community is reserve variability

problem.

outstanding losses.

of this o/s losses.

residuals.

Bootstrapping Reserves

Sample with replacement all

policies in S

Same size as S

sample

estimates

lognormal distribution with parameters

=8; =1.3

Li+j = Li * (link + )

details.

Bootstrapping Reserves

estimate of the distribution of outstanding losses

original dataset S of claims.

Note: this bootstrapping method differs from

other analyses which bootstrap the residuals

of a model.

model is correct.

4 e-04

3 e-04

0 e+00

1 e-04

2 e-04

bootstrapped

distribution

Dotted line:

kernel density

estimate of the

distribution

Pink line:

superimposed

normal

19000

20000

21000

22000

23000

24000

25000

95% confidence

interval

4 e-04

3 e-04

Mean:

$21.751M

Median: $21.746M

:

$0.982M

/ 4.5%

2 e-04

1 e-04

outstanding losses

appears normal.

0 e+00

19000

20000

21000

22000

23000

24000

(19.8M, 23.7M)

roughly agree with $21.75 2

25000

distribution of o/s losses is approximately normal.

Remember this is just simulated data!

Real-life results have been consistent with these results.

19000

0 e+00

21000

2 e-04

23000

25000

4 e-04

19000

20000

21000

22000

23000

24000

25000

-3

-2

-1

References

--Davison and Hinkley

--Efron and Tibshirani

--Efron and Gong

American Statistician 1983

-- Efron and Tibshirani

Statistical Science 1986

Practice

-- Derrig, Ostaszewski, Rempala

PCAS 2000

- A Toaxometric AnalysisUploaded byAnonymous i4A8vFZn8Y
- Propensity Score Matching Stata Program and OutputUploaded byheh92
- 4 EmpiricalUploaded byMohd Faizin
- 8Uploaded byakram1978
- Graphonomic ApproachUploaded byEsther Ruiz
- Accurate Prediction of Vaccine Stability Under Real Storage Conditions and During Temperature Excursions_2018Uploaded byLenin Fernandez Arellano
- 13 BootstrapUploaded byMark Carlos Secada
- extremum_estimators_computationUploaded byVictor Haselmann Arakawa
- Business-Unit-Level Relationship Between Employee Satisfaction, Employee Engagement, And Business Outcomes- A Meta-AnalysisUploaded byChris Jarvis
- BootstrapUploaded byCarlos Trucios Maza
- lecture 9.pdfUploaded byTiago Martins
- geot13p152hUploaded byHipow87
- David A. Freedman - The limits of Econometrics.pdfUploaded byRcarles
- JackknifeUploaded byw108bmg
- 04_Jackknife.pdfUploaded byuniversedrill
- Hsu(2017).pdfUploaded byelyesyoussef
- A Kolmogorov-Smirnov Type Test for Positive Quadrant DependenceUploaded byXiaojun Song
- Reading speedUploaded byAndri Pratama
- 38.IJASRAUG201938Uploaded byTJPRC Publications
- Marc Strickert et al- Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression AnalysisUploaded byGrettsz
- Addressing Moderated Mediation Preacher Rucker Hayes 2007Uploaded byPayal Anand
- Correlation.linearmodel.feb.20Uploaded byJust Mahasiswa
- spss munirohUploaded byAbraham
- BS Question Paper 16.docxUploaded byAishwaryaPawaskar
- CorrelationUploaded bynatashashaikh93
- A Study on Customer Preference towards Heavy Commercial Vehicle-3145.pdfUploaded byAkash Das
- OutputUploaded byJoanna
- Book1Uploaded byBhaskar Kunji
- LSCM_SportsOberMeyerUploaded byBhavin Chauhan
- Stroop y TOL- Arch Clin Nps-2009Uploaded bySara

- DS Linear ProgrammimgUploaded byKartik Sharma
- Project SchedulingUploaded bykjs
- Service Gap ModelUploaded byKuldeep Singh Chundawat
- 15- Statistical Quality Control (1)Uploaded byGanesh Mandpe
- MPOB NotesUploaded byKartik Sharma
- DS- Measure of DespersionUploaded byKartik Sharma
- PORTFOLIO MANAGEMENT AND MUTUAL FUND ANALYSIS For IDBI Bank By Mayur ShuklaUploaded byinternationalbank
- 3 Money MarketUploaded bypulkit_kapur2990
- Project CaptialUploaded byhashxim
- Parallel Economy (Black Money)Uploaded byKartik Sharma
- Postpurchase and Custome Rsatisfaction and LoyaltyUploaded byKartik Sharma
- corporatetaxplanning2003-120515115836-phpapp01Uploaded byPankaj Khindria
- Business ResearchUploaded byKartik Sharma
- About IndiaUploaded byKartik Sharma
- EDUploaded byKartik Sharma
- Company Secretary CLUploaded byKartik Sharma
- Factors Influencing Entrepreneurship (BBA)Uploaded byKartik Sharma
- Commercial LawUploaded byKartik Sharma
- Festivals of IndiaUploaded byKartik Sharma
- 2279Uploaded byKartik Sharma
- Package SwitchingUploaded byKartik Sharma
- GiliUploaded byKartik Sharma

- MIT18 05S14 Prac Exam1aUploaded byPoonam Naidu
- Computation Camp 2017Uploaded byElizabeth
- Monty Hall Problem - WikipediaUploaded byviniciusmoulin
- Random Vector 2Uploaded byNamdev
- Asset PricingUploaded bymorinsola
- 1411232_CE2Uploaded byNuhtim
- Pseudo-likelihood Inference for Gaussian Markov Random FieldsUploaded bySEP-Publisher
- Correlation & Regression-IUploaded bydrnareshchauhan
- 9_16042019Uploaded byمحمد خيري بن اسماءيل
- Tables StatUploaded bySoufi Badr
- s Telp Rdb 1043096Uploaded bysipil123
- rpp bwm22502 sem1_18_19Uploaded bymuhd aziq fikri mohd razi
- Probability Practice QuestionUploaded byNavin Rai
- Random Probability Theory Related to Trading SystemsUploaded byPeter L. Brandt
- Mathematics tutorialUploaded byehsanzkr_499329029
- Common Probability Distribution PDFUploaded byJeanette
- Ito StratonovichUploaded byGoce Angelovski
- DA FinalsUploaded byVenkat Vardhan
- 505 Review Problems for Final SolutionsUploaded byGabe Guerrier
- Monte Carlo Simulation (Random Number Generation)Uploaded byAnonymous FZNn6rB
- 14 HW1 & LN1 Q1-Q13 Ans KeyUploaded bySteven Glansberg
- Question Bank M-4Uploaded byMohammad Safreed
- StatNOTE-1 Central Limit TheoremUploaded byVayne
- Probability and Statistics (Basic)Uploaded bysamranda
- Time Series AnalysisUploaded byAlex Tgen
- efficientgmm.pdfUploaded byfpttmm
- MB_V14_T3_RungieUploaded bySidharth Choubey
- Stochastic Calculus Main Results- JaimungalUploaded byjitenparekh
- Convolution of probability distributions.pdfUploaded byJosé María Medina Villaverde
- Matlab VariablesUploaded byChristiana Balta

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.