# An introduction to applied geostatistics Part 4 – Model validation; simulation Overheads

D G Rossiter Department of Earth Systems Analysis International Institute for Geo-information Science & Earth Observation (ITC) <http://www.itc.nl/personal/rossiter> July 9, 2005

AN

INTRODUCTION TO APPLIED GEOSTATISTICS

1

Model validation
With any predictive method, we would like to know how good it is. This is model validation. • cf. model calibration, when we are building the model The basic idea is to compare model predictions with reality. Two main methods: 1. Separate validation dataset 2. Cross-validation using calibration dataset

D G R OSSITER

estimate (from the model) in the validation dataset. actual mean of the validation dataset. should be zero (0) 1 n ˆ ME = ∑(yi − yi) n i=1 D G R OSSITER . lower is better: 1 RMSE = (yi − yi)2 ∑ˆ n i=1 n 1/2 • Bias or mean error (ME) of estimated vs.AN INTRODUCTION TO APPLIED GEOSTATISTICS 2 Independent validation Simple measures of validity: • Root mean squared error (RMSE) of the residuals: the actual vs.

AN INTRODUCTION TO APPLIED GEOSTATISTICS 3 Cross-validation If we don’t have an independent data set to evaluate a model. we can use the same sample points that were used to estimate the model to validate that same model. this is not legitimate for non-geostatistical models.b. the effect of the removed point on the model (which was estimated using that point) is minor. This seems a bit dubious. because there is no theory of spatial correlation. D G R OSSITER . but with enough points. N.

For each point (a) Remove the point from the sample set (b) predict at that point using the other points and the modelled variogram 3. model it 2. D G R OSSITER .AN INTRODUCTION TO APPLIED GEOSTATISTICS 4 Cross–validation procedure 1. Compute experimental variogram with all sample points. also by looking at individual predictions of interest. Summarize the deviations of the model from the actual point Then models can be compared by their summary statistics.

computed as for independent validation • Mean Squared Deviation Ratio (MSDR) of residuals with kriging variance: should be 1 1 N {z(xi) − z(xi)}2 ˆ MSDR = ∑ ˆ N i=1 σ 2(xi) D G R OSSITER .AN INTRODUCTION TO APPLIED GEOSTATISTICS 5 Summary statistics for cross–validation • Root Mean Square Error (RMSE): lower is better . computed as for independent validation • Bias or mean error (ME): should be 0.

\$ zscore : num 1.cv(log(cadmium)~1. meuse.attr(*..956 1..460 2..996923e-05 [1] 0...3368 0.0773 . mean(kcv\$residual**2) > mean((kcv\$residual)**2/kcv\$var1.796 0.781 0. ..952 ..768 0.4195 -0.152 1.452 1... \$ y : num 333611 333558 333537 333484 333330 .. MSDR = 1 > mean(kcv\$residual). \$ var1. z="residual".5692 0.129047 D G R OSSITER .0886 . "names")= chr "1" "2" "3" "4" .8644106 [1] 1. \$ fold : int 1 2 3 4 5 6 7 8 9 10 .. model=m2).. > truehist(kcv\$residual) > # some residuals are very large. MSE low.. of 8 variables: \$ x : num 181072 181025 181165 181298 181307 . \$ observed : Named num 2.761 .var : num 0...259 0..9774 0.811 0..AN INTRODUCTION TO APPLIED GEOSTATISTICS 6 Cross-validation in gstat > # leave-one-out cross-validation > kcv<-krige..frame’: 155 obs. ~x+y.030 .4786 -0.649 1. \$ residual : num 0.pred: num 1.. \$ var1.var) [1] 7. fill=F) > # measures of goodness: ME = 0.482 1. show their locations > bubble(kcv.3033 0.872 0..5031 0. kcv ‘data.0953 0.

2 q q q q q q qq q q qq qq qq q q q q qq q q q q q q qq q q qq q q q qq qq q q q q q q 0.4 q q q q q q q qq 332000 0.1 q q qqq q q q q qq qq q q q q qq q q q q q q q q −2.565 2.AN INTRODUCTION TO APPLIED GEOSTATISTICS 7 Residuals from cross-validation and their location residual qq q 0.787 −0.3 q q q 331000 0.6 q q q 0.384 0.057 0.424 y q q q q q q q qq q q q q qq q q q q q q q qq q q q q q q q 0.5 333000 q qq q q q q q q q q q q q q qq qq q q q q q q q q q q 0.0 330000 q q q q q q q −3 −2 −1 0 kcv\$residual 1 2 q 178500 179000 179500 180000 180500 181000 181500 x D G R OSSITER .

this reality is usually a spatial distribution (map). D G R OSSITER . given a model.AN INTRODUCTION TO APPLIED GEOSTATISTICS 8 Spatial simulation Simulation is the process or result of representing what reality might look like. In geostatistics.

. 1997. Geostatistics for natural resources evaluation. Oxford University Press. • Reference for spatial simulation: Goovaerts. P. New York. Applied Geostatistics Series. • Non-spatial example: planning the number and timing of clerks in a new branch bank. D G R OSSITER .AN INTRODUCTION TO APPLIED GEOSTATISTICS 9 What is stochastic simulation? • “Simulation” is a general term for studying a system without physically implementing it. transaction length) is stochastic and represented by probability distributions. customer behaviour (arrival times. • “Stochastic” simulation means that there is a random component to the simulation model: quantiﬁed uncertainty is included so that each simulation is different. Chapter 8.

with one expected value (ﬁrst-order stationarity) with a spatially-correlated error that is the same over the whole area (second-order stationarity). in the simplest case. could have occurred in some “parallel universe”. • In addition. by this theory.AN INTRODUCTION TO APPLIED GEOSTATISTICS 10 Why spatial simulation? • Recall: the theory of regionalized variables assumes that the values we observe come from some random process. this variability is not reﬂected in adjacent prediction points. since they are estimated from almost the same data. * Even if there is a high nugget effect in the variogram. D G R OSSITER . • So we’d like to see “alternative realities”. that is. spatial patterns that. especially in areas with low sampling density. kriging maps are unrealistically smooth.

D G R OSSITER . not just on individual values. 370) Example: ground water travel time depends on sequences of large or small values (“critical paths”).AN INTRODUCTION TO APPLIED GEOSTATISTICS 11 When must simulation be used? Goovaerts: “Smooth interpolated maps should not be used for applications sensitive to the presence of extreme values and their patterns of continuity.” (p.

this is the BLUP and its error for each prediction location separately. at each prediction location we obtain a probability distribution of the prediction. so we can’t simulate the error in a ﬁeld by simulating the error in each point separately. a measure of its uncertainty. spatial uncertainty • Recall: kriging prediction also provides a prediction error.AN INTRODUCTION TO APPLIED GEOSTATISTICS 12 Local uncertainty vs. This is ﬁne for evaluating each prediction individually. • So. it is not valid to evaluate the set of predictions! Errors are by deﬁnition spatially-correlated (as shown by the ﬁtted variogram model). • But. • Spatial uncertainty is a representation of the error over the entire ﬁeld of prediction locations at the same time. D G R OSSITER .

• Procedure: 1. then the uncertainty is represented by a number of simulations.AN INTRODUCTION TO APPLIED GEOSTATISTICS 13 Practical applications of spatial simulation • If the distribution of the target variable(s) over the study area is to be used as input to a model. Summarize the output of the different model runs • The statistics of the output give a direct measure of the uncertainty of the model in the light of the sample and the model of spatial variability. Run the model on each simulation 3. D G R OSSITER . Simulate a “large” number of realizations of the spatial ﬁeld 2.

i. (It’s only one realistion.) This is mainly to visualise a random ﬁeld as modelled by a variogram. the data we have. D G R OSSITER . no more valid than any other. not for prediction.e. we simulate the ﬁeld with no reference to the actual sample.AN INTRODUCTION TO APPLIED GEOSTATISTICS 14 Unconditional simulation In unconditional simulation.

AN INTRODUCTION TO APPLIED GEOSTATISTICS 15 What is preserved in unconditional simulation? 1. D G R OSSITER . Covariance structure Data points are not predicted exactly. Mean over ﬁeld 2.

~ x + y. instead use the dummy=TRUE option. + model = m2. nmax = 20. + main = "five unconditional realisations of a correlated Gaussian field") D G R OSSITER . map. z=c(3:7)). dummy = TRUE) [using unconditional gaussian simulation] > levelplot(z ~ x + y | name. Since there is no data with which to estimate the mean.krige(log(cadmium) ~ 1.AN INTRODUCTION TO APPLIED GEOSTATISTICS 16 Unconditional simulation in gstat The krige function allows a number of simulation nsim to be speciﬁed. For unconditional simulation.lev(x. > x <.grid. aspect = mapasp(x). nsim = 5. it must be speciﬁed as the beta parameter. data = NULL. beta=mean(log(cadmium)).to. specify no data (data=NULL). newdata = meuse.

AN INTRODUCTION TO APPLIED GEOSTATISTICS 17 D G R OSSITER .

D G R OSSITER . e.g. while respecting the sample.AN INTRODUCTION TO APPLIED GEOSTATISTICS 18 Conditional simulation This simulates the ﬁeld. but usually much more spatially-variable (depending on the magnitude of the nugget). hydrology. So the simulated maps look more like the best (kriging) prediction. These are inputs into spatially-explicit models.

AN INTRODUCTION TO APPLIED GEOSTATISTICS 19 What is preserved in conditional simulation? 1. Mean over ﬁeld 2. Covariance structure 3. Observed data (points are predicted exactly) D G R OSSITER .

meuse.to.grid. nsim = 6.AN INTRODUCTION TO APPLIED GEOSTATISTICS 20 Conditional simulation in gstat Here the data must be named. map.lev(sims. z=c(3:8)). if not it is estimated by GLS. > mean(log(cadmium)) [1] 0. The beta parameter may be given (usually as estimated from the data).krige(log(cadmium) ~ 1.5610659 > sims <. + model = m2. so the dummy=TRUE option is not used. aspect = mapasp(sims). ~ x + y. beta=mean(log(cadmium))) [using conditional gaussian simulation] > levelplot(z ~ x + y | name. + main = "six conditional realisations of a correlated Gaussian field") D G R OSSITER . nmax = 64. meuse.

AN INTRODUCTION TO APPLIED GEOSTATISTICS 21 D G R OSSITER .

11) sims <.AN INTRODUCTION TO APPLIED GEOSTATISTICS 22 Indicator simulation (See notes “Indicator Kriging”) Indicator variables can also be simulated.variogram(vi.09. threshold<-4 indicator <. and the indicator=TRUE argument must be included. ~ x + y. This is unlike IK where the result is a probability of a 1 (indicator is true). main = "Six conditional realisations of an indicator variable") D G R OSSITER . indicator=TRUE. The mean is estimated by the proportion of true indicators in the sample. Here the result is a 0/1 variable: indicator false/true. + model = vm1.fit. meuse. 0. map. vgm(0. beta=sum(indicator)/length(indicator)) levelplot(z ~ x + y | name. aspect = mapasp(sims).to. z=c(3:8)).f. 500.(cadmium >= threshold) vi<-variogram(indicator ~1. meuse) mi. "Gau".grid. ~x+y. In gstat. meuse. + nmax=64.krige(indicator ~ 1. + nsim=6.lev(sims.f <. the target value must be an indicator.

AN INTRODUCTION TO APPLIED GEOSTATISTICS 23 D G R OSSITER .