Mar 23, 2016

Introduction to Bootstrapping

James Guszcza, FCAS, MAAA

CAS Predictive Modeling Seminar

Chicago

September, 2005

statistics all the time.

Loss ratio/claim frequency for a population

Outstanding Losses

Correlation between variables

GLM parameter estimates

indicates.

But how can we measure our confidence in

this indication?

More Concisely

what do you think?

Variability of the point estimate says:

how sure are you?

Traditional approaches

Credibility theory

Use distributional assumptions to construct

confidence intervals

made an ingenious suggestion.

Most (sometimes all) of what we know about

the true probability distribution comes from

the data.

So lets treat the data as a proxy for the true

distribution.

We draw multiple samples from this proxy

of the resulting pseudo-datasets.

Philosophy

way of modeling, assumptions, or analysis,

and can be applied in an automatic way to

any situation, no matter how complicated.

An important theme is the substitution of

raw computing power for theoretical analysis

--Efron and Gong 1983

mining paradigm.

Theoretical Picture

was drawn from the unknown

true distribution

The true

distribution

in the sky

make inferences about the

true parameters ()

sample that might have

been

Sample 1

Y 1, Y Y

1

1

2

Y1

1

k

Sample 2

Y 1, Y Y

2

2

2

Y2

2

k

Sample 3

Y , Y 2 Y

3

1

Y3

Sample N

YN1, YN2 YNk

distribution and the size (k) of our sample

YN

as a proxy for the true

distribution.

The actual

sample

your actual distribution N

times.

Y1, Y2 Yk

interest on each re-sample.

Re-sample 1

Y* 1, Y* 2 Y*

1

Y*1

Re-sample 2

Y* 1, Y* 2 Y*

2

Y*2

Re-sample 3

Y* , Y* Y*

3

1

3

2

3

k

Re-sample N

Y*N1, Y*N2 Y*Nk

Y*3

Y*N

(1-1/500)500 1/e .368

that any one of the original data points wont

appear at all if we sample with replacement 500

times.

any data point is included with Prob .632

true population in the sky.

Each resample simulates the process of taking

a sample from the true distribution.

Graph on left: Y-bar calculated from an number of

samples from the true distribution.

Graph on right: {Y*-bar} calculated in each of 1000 resamples from the empirical distribution.

Analogy: : Y ::

Y : Y*

0.6

0.4

0.02

0.2

0.01

0.0

0.00

phi.ybar

0.03

0.8

0.04

70

80

90

100

ybar

110

120

98.5

99.0

99.5

100.0

y.star.bar

100.5

101.0

Summary

serves as a proxy to the true distribution.

Resampling means (repeatedly) sampling

with replacement.

Resampling the data is analogous to the

process of drawing the data from the true

distribution.

We can resample multiple times

Compute the statistic of interest T on each resample

We get an estimate of the distribution of T.

Motivating Example

where we all know the answer

in advance.

Pull 500 draws from the

n(5000,100) dist.

The sample mean 5000

Is a point estimate of the

true mean .

But how sure are we of this

estimate?

From theory, we know that:

s.d .( X ) / N 100

500

4.47

raw data

statistic

value

#obs

500

4995.79

mean

98.78

sd

2.5%ile

4812.30

97.5%ile

5195.58

Look at summary statistics,

histogram, probability density

estimate, QQ-plot.

looks pretty normal

raw data

statistic

value

#obs

500

4995.79

mean

98.78

sd

2.5%ile

4812.30

97.5%ile

5195.58

4700

0.000

4900

0.002

5100

0.004

n(5000,100) data

4700

4800

4900

5000

5100

5200

5300

-3

-2

-1

Now lets use resampling to estimate the

s.d. of the sample mean (4.47)

once; others wont appear at all.

Resampling

Sample with

replacement 500 data

points from the

original dataset S

Call this S*1

Now do this 999

more times!

S*1, S*2,, S*1000

Compute X-bar on

each of these 1000

samples.

R Code

norm.data <- rnorm(500, mean=5000, sd=100)

boots <- function(data, R){

b.avg <<- c(); b.sd <<- c()

for(b in 1:R) {

ystar <- sample(data,length(data),replace=T)

b.avg <<- c(b.avg,mean(ystar))

b.sd <<- c(b.sd,sd(ystar))}

}

boots(norm.data, 1000)

Results

X-bar ~ n(5000, 4.47)

Bootstrapping estimates this

pretty well!

And we get an estimate of

the whole distribution, not

just a confidence interval.

raw data

statistic

value

#obs

500

4995.79

mean

98.78

sd

2.5%ile

4705.08

97.5%ile

5259.27

4985

4995

5005

X-bar

theory bootstrap

1,000

1,000

5000.00 4995.98

4.47

4.43

4991.23 4987.60

5008.77 5004.82

4985

4990

4995

5000

5005

5010

-3

-2

-1

Interval

Percentile method

Just take the desired percentiles of the

bootstrap histogram.

More reliable in cases of asymmetric bootstrap

histograms.

mean(norm.data) - 2 * sd(b.avg)

[1] 4986.926

mean(norm.data) + 2 * sd(b.avg)

[1] 5004.661

raw data

statistic

value

#obs

500

4995.79

mean

98.78

sd

2.5%ile

4705.08

97.5%ile

5259.27

X-bar

theory bootstrap

1,000

1,000

5000.00 4995.98

4.47

4.43

4991.23 4987.60

5008.77 5004.82

And a Bonus

110

105

100

95

90

deviation of each pseudo-dataset.

This enables us to estimate the correlation between the

mean and s.d.

Normal distribution is not skew mean, s.d. are

uncorrelated.

Our bootstrapping experiment confirms this.

sample.sd

4985

4990

4995

sample.mean

5000

5005

5010

result we know to be true from theory.

Often in the real world we either dont know

the true distributional properties of a

random variable

or are too busy to find out.

This is when bootstrapping really comes in

handy.

Severity Data

2700 size-of-loss data points.

severity distribution

4 e-04

mean & 75th %ile.

Gamma? Lognormal? Dont need to know.

2 e-04

0 e+00

10000

20000

30000

40000

50000

Normal Q-Q Plot

0.000

2800

3000

0.002

3200

0.004

3400

2800

3000

3200

3400

-3

-2

0.000

2800

3000

0.002

3200

3400

-1

2800

2900

3000

3100

3200

3300

3400

-3

-2

-1

statistics even average severity! are approximately normally

distributed.

But this breaks down if our statistics is not a smooth function of

the data

Often in the loss reserving we want to focus our attention way

out in the tail

90th %ile is an example.

Normal Q-Q Plot

0.0000

7000

8000

0.0010

9000

7000

7500

8000

8500

9000

-3

-2

-1

6000

5500

5000

sample average and s.d. on each pseudo-dataset.

This time (as one would expect) the variance is a function

of the mean.

sample.sd

2800

2900

3000

3100

sample.mean

3200

3300

3400

80

60

40

20

Credit on a scale of 1-100

1 is worst; 100 is best

Age, credit are linearly related

See plot

R2.08 .28

Older people tend to have better credit

What is the confidence interval around ?

Plot of Age vs Credit

age

20

40

60

80

100

.28

s.d.() .028

> quantile(boot.avg,probs=c(.025,.975))

2.5%

97.5%

0.2247719 0.3334889

> rho - 2*sd(boot.avg); rho + 2*sd(boot.avg)

0.2250254 0.3354617

Normal Q-Q Plot

0.20

0.25

0.30

10

0.35

15

0.20

0.25

0.30

0.35

-3

-2

-1

1300 zip-code level data points

Variables: population density, median #vehicles/HH

R2.50 ; -.70

Median #Vehicles vs Pop Density

veh

5000

loess line

10000

15000

regression line

density

20000

25000

30000

more skew.

-.70

95% conf interval: (-.75, -.67)

Not symmetric around

Effect becomes more pronounced the higher the

value of .

-0.75

10

-0.70

15

-0.65

20

-0.75

-0.70

-0.65

-3

-2

-1

Total loss ratio of a segment of business is

our favorite point estimate.

Its variability depends on many things:

Size of book

Loss distribution

Accuracy of rating plan

Consistency of underwriting

probability distribution?

Severity dist from previous example

LR = .79

Claim frequency = .08

two point estimates.

We will resample the data 500 times

Compute total LR and freq on each sample

Plot the histogram

LR .79

s.d.(LR) .05

conf interval 0.1

Confidence interval calculations disagree a bit:

> quantile(boot.avg,probs=c(.025,.975))

2.5%

97.5%

0.6974607 0.8829664

> lr - 2*sd(boot.avg); lr + 2*sd(boot.avg)

0.6897653 0.8888983

Normal Q-Q Plot

0.7

0.8

0.9

1.0

bootstrap total LR

0.7

0.8

0.9

1.0

-3

-2

-1

How does this affect the variability of LR?

Again re-sample 500 times

Skewness, variance increase considerably

LR:

.79

.78

s.d.(LR):

.05

.13

Normal Q-Q Plot

0.0

0.6

1.0

0.8 1.0

2.0

1.2

1.4

3.0

bootstrap total LR

0.6

0.8

1.0

1.2

1.4

-3

-2

-1

Distribution of Capped LR

statistics

Remove leverage of a few large data points

Here we cap policy-level losses at $30,000

Closer to frequency

s.d. cut in half! .05 .025

Normal Q-Q Plot

0.55

0.60

10

0.65

15

0.70

0.55

0.60

0.65

0.70

-3

-2

-1

freq .08

s.d.(freq) .017

Confidence interval calculations match very well:

> quantile(boot.avg,probs=c(.025,.975))

2.5%

97.5%

0.07734336 0.08391072

> lr - 2*sd(boot.avg); lr + 2*sd(boot.avg)

0.07719618 0.08388898

Normal Q-Q Plot

0.076

50

0.080

100 150

200

0.084

0.074

0.076

0.078

0.080

0.082

0.084

0.086

-3

-2

-1

sub-segments: {clean drivers, other}

LRtot = .79

LRclean = .58

LLRclean = -27%

LRother = .84

LRRother = +6%

than non-clean drivers

How sure are we of this indication?

Lets use bootstrapping.

500 times.

LRc*, LRo*, (LRc*- LRo*), (LRc* / LRo*)

What is the average difference in loss ratios?

what percent of the time is the difference in

loss ratios greater than x%?

Normal Q-Q Plot

0.4

0.6

0.8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3

-2

0.70

0.80

0.90

1.00

-1

0.70

0.75

0.80

0.85

0.90

0.95

1.00

1.05

-3

-2

-1

Normal Q-Q Plot

0.0

0.5

1.0

0.7

2.0

0.9

3.0

1.1

0.5

0.6

0.7

0.8

0.9

1.0

1.1

-3

-2

1.00

1.05

10

1.10

15

-1

1.00

1.05

1.10

-3

-2

-1

Normal Q-Q Plot

-0.1

0.1

0.3

0.5

LRR_other - LRR_clean

0.0

0.2

0.4

0.6

-3

-2

0.0

1.0

0.5

1.5

1.0

2.0

1.5

2.5

LRR_other / LRR_clean

-1

1.0

1.5

2.0

2.5

-3

-2

-1

community is reserve variability

problem.

outstanding losses.

of this o/s losses.

residuals.

Bootstrapping Reserves

Sample with replacement all

policies in S

Same size as S

sample

estimates

lognormal distribution with parameters

=8; =1.3

Li+j = Li * (link + )

details.

Bootstrapping Reserves

estimate of the distribution of outstanding losses

original dataset S of claims.

Note: this bootstrapping method differs from

other analyses which bootstrap the residuals

of a model.

model is correct.

4 e-04

3 e-04

0 e+00

1 e-04

2 e-04

bootstrapped

distribution

Dotted line:

kernel density

estimate of the

distribution

Pink line:

superimposed

normal

19000

20000

21000

22000

23000

24000

25000

95% confidence

interval

4 e-04

3 e-04

Mean:

$21.751M

Median: $21.746M

:

$0.982M

/ 4.5%

2 e-04

1 e-04

outstanding losses

appears normal.

0 e+00

19000

20000

21000

22000

23000

24000

(19.8M, 23.7M)

roughly agree with $21.75 2

25000

distribution of o/s losses is approximately normal.

Remember this is just simulated data!

Real-life results have been consistent with these results.

19000

0 e+00

21000

2 e-04

23000

25000

4 e-04

19000

20000

21000

22000

23000

24000

25000

-3

-2

-1

References

--Davison and Hinkley

--Efron and Tibshirani

--Efron and Gong

American Statistician 1983

-- Efron and Tibshirani

Statistical Science 1986

Practice

-- Derrig, Ostaszewski, Rempala

PCAS 2000

