You are on page 1of 46

# Deloitte Consulting, 2005

Introduction to Bootstrapping
James Guszcza, FCAS, MAAA
CAS Predictive Modeling Seminar
Chicago
September, 2005

## Actuaries compute points estimates of

statistics all the time.
Loss ratio/claim frequency for a population
Outstanding Losses
Correlation between variables
GLM parameter estimates

## A point estimate tells us what the data

indicates.
But how can we measure our confidence in
this indication?

More Concisely

## Point estimate says:

what do you think?
Variability of the point estimate says:
how sure are you?
Credibility theory
Use distributional assumptions to construct
confidence intervals

## In the late 70s the statistician Brad Efron

Most (sometimes all) of what we know about
the true probability distribution comes from
the data.
So lets treat the data as a proxy for the true
distribution.
We draw multiple samples from this proxy

## And compute the statistic of interest on each

of the resulting pseudo-datasets.

Philosophy

## [Bootstrapping has] requires very little in the

way of modeling, assumptions, or analysis,
and can be applied in an automatic way to
any situation, no matter how complicated.
An important theme is the substitution of
raw computing power for theoretical analysis
--Efron and Gong 1983

## The Basic Idea

Theoretical Picture

## Any actual sample of data

was drawn from the unknown
true distribution

The true
distribution
in the sky

## We use the actual data to

true parameters ()

## Each green oval is the

sample that might have
been
Sample 1
Y 1, Y Y
1

1
2

Y1

1
k

Sample 2
Y 1, Y Y
2

2
2

Y2

2
k

Sample 3
Y , Y 2 Y
3
1

Y3

Sample N
YN1, YN2 YNk

## The distribution of our estimator (Y) depends on both the true

distribution and the size (k) of our sample

YN

## Treat the actual distribution

as a proxy for the true
distribution.

The actual
sample

times.

Y1, Y2 Yk

## Compute the statistic of

interest on each re-sample.

Re-sample 1
Y* 1, Y* 2 Y*
1

Y*1

Re-sample 2
Y* 1, Y* 2 Y*
2

Y*2

Re-sample 3
Y* , Y* Y*
3
1

3
2

3
k

Re-sample N
Y*N1, Y*N2 Y*Nk

Y*3

Y*N

## In fact, there is a chance of

(1-1/500)500 1/e .368
that any one of the original data points wont
appear at all if we sample with replacement 500
times.
any data point is included with Prob .632

## Intuitively, we treat the original sample as the

true population in the sky.
Each resample simulates the process of taking
a sample from the true distribution.

## Theoretical vs. Empirical

Graph on left: Y-bar calculated from an number of
samples from the true distribution.
Graph on right: {Y*-bar} calculated in each of 1000 resamples from the empirical distribution.
Analogy: : Y ::

Y : Y*

0.6
0.4

0.02

0.2

0.01

0.0

0.00

phi.ybar

0.03

0.8

0.04

70

80

90

100
ybar

110

120

98.5

99.0

99.5

100.0

y.star.bar

100.5

101.0

Summary

## The empirical distribution your data

serves as a proxy to the true distribution.
Resampling means (repeatedly) sampling
with replacement.
Resampling the data is analogous to the
process of drawing the data from the true
distribution.
We can resample multiple times
Compute the statistic of interest T on each resample
We get an estimate of the distribution of T.

## Deloitte Consulting, 2005

Motivating Example

## Lets look at a simple case

where we all know the answer
Pull 500 draws from the
n(5000,100) dist.
The sample mean 5000
Is a point estimate of the
true mean .
But how sure are we of this
estimate?
From theory, we know that:
s.d .( X ) / N 100

500

4.47

raw data
statistic
value
#obs
500
4995.79
mean
98.78
sd
2.5%ile
4812.30
97.5%ile
5195.58

## 500 draws from n(5000,100)

Look at summary statistics,
histogram, probability density
estimate, QQ-plot.
looks pretty normal

raw data
statistic
value
#obs
500
4995.79
mean
98.78
sd
2.5%ile
4812.30
97.5%ile
5195.58

4700

0.000

4900

0.002

5100

0.004

n(5000,100) data

4700

4800

4900

5000

5100

5200

5300

-3

-2

-1

## Sampling With Replacement

Now lets use resampling to estimate the
s.d. of the sample mean (4.47)

## Some of the original data points will appear more than

once; others wont appear at all.

## Deloitte Consulting, 2005

Resampling

Sample with
replacement 500 data
points from the
original dataset S
Call this S*1
Now do this 999
more times!
S*1, S*2,, S*1000
Compute X-bar on
each of these 1000
samples.

## Deloitte Consulting, 2005

R Code
norm.data <- rnorm(500, mean=5000, sd=100)
boots <- function(data, R){
b.avg <<- c(); b.sd <<- c()
for(b in 1:R) {
ystar <- sample(data,length(data),replace=T)
b.avg <<- c(b.avg,mean(ystar))
b.sd <<- c(b.sd,sd(ystar))}
}
boots(norm.data, 1000)

Results

## From theory we know that

X-bar ~ n(5000, 4.47)
Bootstrapping estimates this
pretty well!
And we get an estimate of
the whole distribution, not
just a confidence interval.

raw data
statistic
value
#obs
500
4995.79
mean
98.78
sd
2.5%ile
4705.08
97.5%ile
5259.27

4985

4995

5005

X-bar
theory bootstrap
1,000
1,000
5000.00 4995.98
4.47
4.43
4991.23 4987.60
5008.77 5004.82

4985

4990

4995

5000

5005

5010

-3

-2

-1

Interval

## X-bar 2*(bootstrap dist s.d.)

Percentile method
Just take the desired percentiles of the
bootstrap histogram.
More reliable in cases of asymmetric bootstrap
histograms.

mean(norm.data) - 2 * sd(b.avg)
 4986.926
mean(norm.data) + 2 * sd(b.avg)
 5004.661

raw data
statistic
value
#obs
500
4995.79
mean
98.78
sd
2.5%ile
4705.08
97.5%ile
5259.27

X-bar
theory bootstrap
1,000
1,000
5000.00 4995.98
4.47
4.43
4991.23 4987.60
5008.77 5004.82

And a Bonus

110
105
100

95

90

## Note that we can calculate both the mean and standard

deviation of each pseudo-dataset.
This enables us to estimate the correlation between the
mean and s.d.
Normal distribution is not skew mean, s.d. are
uncorrelated.
Our bootstrapping experiment confirms this.

sample.sd

4985

4990

4995
sample.mean

5000

5005

5010

## Weve seen that bootstrapping replicates a

result we know to be true from theory.
Often in the real world we either dont know
the true distributional properties of a
random variable
or are too busy to find out.
This is when bootstrapping really comes in
handy.

## Deloitte Consulting, 2005

Severity Data
2700 size-of-loss data points.

severity distribution
4 e-04

## Lets estimate the distributions of the sample

mean & 75th %ile.
Gamma? Lognormal? Dont need to know.

2 e-04

0 e+00

10000

20000

30000

40000

50000

Normal Q-Q Plot

0.000

2800

3000

0.002

3200

0.004

3400

2800

3000

3200

3400

-3

-2

0.000

2800

3000

0.002

3200

3400

-1

2800

2900

3000

3100

3200

3300

3400

-3

-2

-1

## So far so good bootstrapping shows that many of our sample

statistics even average severity! are approximately normally
distributed.
But this breaks down if our statistics is not a smooth function of
the data
Often in the loss reserving we want to focus our attention way
out in the tail
90th %ile is an example.
Normal Q-Q Plot

0.0000

7000

8000

0.0010

9000

7000

7500

8000

8500

9000

-3

-2

-1

6000
5500
5000

## As with the normal example, we can calculate both the

sample average and s.d. on each pseudo-dataset.
This time (as one would expect) the variance is a function
of the mean.

sample.sd

2800

2900

3000

3100
sample.mean

3200

3300

3400

## Bootstrapping a Correlation Coefficient #1

80

60

40

20

Credit on a scale of 1-100
1 is worst; 100 is best
Age, credit are linearly related
See plot
R2.08 .28
Older people tend to have better credit
What is the confidence interval around ?
Plot of Age vs Credit

age

20

40

60

80

100

.28
s.d.() .028

## Both confidence interval calculations agree fairly well:

> quantile(boot.avg,probs=c(.025,.975))
2.5%
97.5%
0.2247719 0.3334889
> rho - 2*sd(boot.avg); rho + 2*sd(boot.avg)
0.2250254 0.3354617
Normal Q-Q Plot

0.20

0.25

0.30

10

0.35

15

0.20

0.25

0.30

0.35

-3

-2

-1

## Lets try a different example.

1300 zip-code level data points
Variables: population density, median #vehicles/HH
R2.50 ; -.70
Median #Vehicles vs Pop Density

veh

5000

loess line
10000
15000
regression line
density

20000

25000

30000

## Bootstrapping a Correlation Coefficient #2

more skew.
-.70
95% conf interval: (-.75, -.67)
Not symmetric around
Effect becomes more pronounced the higher the
value of .

-0.75

10

-0.70

15

-0.65

20

-0.75

-0.70

-0.65

-3

-2

-1

## Now for what weve all been waiting for

Total loss ratio of a segment of business is
our favorite point estimate.
Its variability depends on many things:
Size of book
Loss distribution
Accuracy of rating plan
Consistency of underwriting

## How could we hope to write down the true

probability distribution?

## 50,000 insurance policies

Severity dist from previous example
LR = .79
Claim frequency = .08

## Lets build confidence intervals around these

two point estimates.
We will resample the data 500 times
Compute total LR and freq on each sample
Plot the histogram

## A little skew, but somewhat close to normal

LR .79
s.d.(LR) .05
conf interval 0.1
Confidence interval calculations disagree a bit:
> quantile(boot.avg,probs=c(.025,.975))
2.5%
97.5%
0.6974607 0.8829664
> lr - 2*sd(boot.avg); lr + 2*sd(boot.avg)
0.6897653 0.8888983
Normal Q-Q Plot

0.7

0.8

0.9

1.0

bootstrap total LR

0.7

0.8

0.9

1.0

-3

-2

-1

## Lets take a sub-sample of 10,000 policies

How does this affect the variability of LR?
Again re-sample 500 times
Skewness, variance increase considerably
LR:
.79

.78
s.d.(LR):
.05

.13
Normal Q-Q Plot

0.0

0.6

1.0

0.8 1.0

2.0

1.2

1.4

3.0

bootstrap total LR

0.6

0.8

1.0

1.2

1.4

-3

-2

-1

## Deloitte Consulting, 2005

Distribution of Capped LR

## Capped LR is analogous to trimmed mean from robust

statistics
Remove leverage of a few large data points
Here we cap policy-level losses at \$30,000

## Affects 50 out of 2700 claims

Closer to frequency

## distribution less skew close to normal

s.d. cut in half! .05 .025
Normal Q-Q Plot

0.55

0.60

10

0.65

15

0.70

0.55

0.60

0.65

0.70

-3

-2

-1

## Much less variance than LR; very close to normal

freq .08
s.d.(freq) .017
Confidence interval calculations match very well:
> quantile(boot.avg,probs=c(.025,.975))
2.5%
97.5%
0.07734336 0.08391072
> lr - 2*sd(boot.avg); lr + 2*sd(boot.avg)
0.07719618 0.08388898
Normal Q-Q Plot

0.076

50

0.080

100 150

200

0.084

0.074

0.076

0.078

0.080

0.082

0.084

0.086

-3

-2

-1

## Example: Divide our 50,000 policies into two

sub-segments: {clean drivers, other}

LRtot = .79

LRclean = .58

LLRclean = -27%

LRother = .84

LRRother = +6%

## Clean drivers appear to have 30% lower LR

than non-clean drivers
How sure are we of this indication?
Lets use bootstrapping.

500 times.

## At each iteration, calculate

LRc*, LRo*, (LRc*- LRo*), (LRc* / LRo*)

## Analyze the resulting empirical distributions.

What is the average difference in loss ratios?
what percent of the time is the difference in
loss ratios greater than x%?

Normal Q-Q Plot

0.4

0.6

0.8

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3

-2

0.70

0.80

0.90

1.00

-1

0.70

0.75

0.80

0.85

0.90

0.95

1.00

1.05

-3

-2

-1

Normal Q-Q Plot

0.0

0.5

1.0

0.7

2.0

0.9

3.0

1.1

0.5

0.6

0.7

0.8

0.9

1.0

1.1

-3

-2

1.00

1.05

10

1.10

15

-1

1.00

1.05

1.10

-3

-2

-1

## Distribution of LRR Differences

Normal Q-Q Plot

-0.1

0.1

0.3

0.5

LRR_other - LRR_clean

0.0

0.2

0.4

0.6

-3

-2

## Normal Q-Q Plot

0.0

1.0

0.5

1.5

1.0

2.0

1.5

2.5

LRR_other / LRR_clean

-1

1.0

1.5

2.0

2.5

-3

-2

-1

## A major issue in the loss reserving

community is reserve variability

problem.

## Predictive variance of your estimate of

outstanding losses.

## Hard to find an analytic formula for variability

of this o/s losses.

residuals.

## Deloitte Consulting, 2005

Bootstrapping Reserves

## S = database of 5000 claims

Sample with replacement all
policies in S

Same size as S

sample

estimates

## Each of the 5000 claims was drawn from a

lognormal distribution with parameters

=8; =1.3

## Build in loss development patterns.

Li+j = Li * (link + )

details.

## Deloitte Consulting, 2005

Bootstrapping Reserves

## These 500 reserve estimates constitute an

estimate of the distribution of outstanding losses

## Notice that we did this by resampling our

original dataset S of claims.
Note: this bootstrapping method differs from
other analyses which bootstrap the residuals
of a model.

## These methods rely on the assumption that your

model is correct.

4 e-04

3 e-04

0 e+00

1 e-04

2 e-04

bootstrapped
distribution
Dotted line:
kernel density
estimate of the
distribution
Pink line:
superimposed
normal

19000

20000

21000

22000

23000

24000

25000

95% confidence
interval

4 e-04
3 e-04

Mean:
\$21.751M
Median: \$21.746M
:
\$0.982M
/ 4.5%

2 e-04

1 e-04

## The simulated dist of

outstanding losses
appears normal.

0 e+00

19000

20000

21000

22000

23000

24000

(19.8M, 23.7M)

## Note: the 2.5 and 97.5 %iles of the bootstrapping distribution

roughly agree with \$21.75 2

25000

## We can examine a QQ plot to verify that the

distribution of o/s losses is approximately normal.

## However, the tails are somewhat heavier than normal.

Remember this is just simulated data!
Real-life results have been consistent with these results.

19000

0 e+00

21000

2 e-04

23000

25000

4 e-04

19000

20000

21000

22000

23000

24000

25000

-3

-2

-1

References

## Bootstrap Methods and their Applications

--Davison and Hinkley

## An Introduction to the Bootstrap

--Efron and Tibshirani

## A Leisurely Look at the Bootstrap

--Efron and Gong
American Statistician 1983

## Bootstrap Methods for Standard Errors

-- Efron and Tibshirani
Statistical Science 1986

## Applications of Resampling Methods in Actuarial

Practice
-- Derrig, Ostaszewski, Rempala
PCAS 2000