You are on page 1of 225

MODELS OF SOFT DATA IN GEOSTATISTICS AND THEIR APPLICATION IN

ENVIRONMENTAL AND HEALTH MAPPING

Seung-Jae Lee

A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in
the partial fulfillment of the requirements for the degree of Doctor of Philosophy in the
Department of Environmental Sciences and Engineering, School of Public Health.

Chapel Hill
2005

Approved by

__________________________________
Advisor: Marc L. Serre

__________________________________
Reader: George Christakos

__________________________________
Reader: Douglas Crawford-Brown

__________________________________
Reader: Michael Flynn

__________________________________
Reader: Michael Symons

__________________________________
Reader: Karin Yeatts
©
2005
Seung-Jae Lee
ALL RIGHTS RESERVED

ii
ABSTRACT
SEUNG-JAE LEE: Models Of Soft Data In Geostatistics And Their Application In
Environmental And Health Mapping
(Under the direction of Marc L. Serre)

Spatiotemporal Geostatistics provides an efficient mapping estimation method to interpolate

a variable of interest at unsampled spatiotemporal locations based on sparse measured values.

The simple kriging and co-kriging methods of classical Geostatistics have been applied to a

wide variety of environmental mapping problems, though these linear estimation methods

have well known limitations (Gaussian assumptions, restriction to exact measurements, etc.).

More recently the Bayesian Maximum Entropy (BME) method of modern Geostatistics has

provided a rigorous mathematical framework that overcomes these limitations, and in

particular provides an efficient framework to assimilate data with uncertainty expressed in

terms of soft data. The rigorous assimilation of soft data is especially attractive because it

allows the integration of data from multiple sources in terms of their uncertainty. However

while integrating data from multiple sources is becoming an important research topic, the

development of models for soft data is still an emerging field in environmental and health

applications. This dissertation is part of this emerging field. Its goal is to advance the

development of models for soft data describing the uncertainty associated with existing

environmental and health processes, to integrate these soft data in a BME mapping analysis,

and test the resulting increase in mapping accuracy in real case studies. In this dissertation

three types of data uncertainty are especially emphasized, i.e. uncertainty from measurement

iii
errors, uncertainty from stochastic empirical laws between primary and secondary variables,

and uncertainty arising from the data observation scale. Each model of soft data is validated

using synthetic simulations as well as real case studies that include the analysis of the

uncertainty associated with arsenic measurement errors, the arsenic-pH empirical law, and

the observation scale of childhood asthma prevalence data. Validation analyses show that for

each of these case studies, the model developed for the soft data leads to a substantial gain in

mapping accuracy over methods not accounting for data uncertainty. Consequently the

models of soft data developed can be applied in a variety of real exposure and health

mapping situations to provide highly informative maps that will be useful to environmental

and public health scientists.

iv
ACKNOWLEDGMENTS

I would like to express my sincere thanks to my advisor, Dr. Marc L. Serre who patiently

guided me throughout the entire period of my Ph.D. work. He convincingly introduced me a

challenging research area and consistently encouraged me to try my best while he has been

willing to provide high-quality advices.

I am also everlastingly grateful to my Ph. D. committee, Dr. Marc L. Serre, Dr. George

Christakos, Dr. Douglas Crawford-Brown, Dr. Michael Flynn, Dr. Michael Symons, and Dr.

Karin Yeatts. During their service on my Ph.D. committee they spent their time without

reluctance and generously guided me with their valuable expertise.

I would like to acknowledge U.S. Geological Survey, Dr. Karin Yeatts, and Dr. Stephen

Peters for providing their datasets. To these technical benefactors, I remain sincerely grateful.

I am also indebted to the financial support by the Rotary foundation during my first academic

year at the University of North Carolina at Chapel Hill (UNC-CH). In addition I was

fortunate to be appointed as a graduate research assistant for all my academic years at the

Department of Environmental Sciences and Engineering, UNC-CH, for which I am thankful.

v
Lastly I would like to express my deepest thanks to my wife, Miyoung Shim who has been

available all the time during my studies and earnestly offered the solid motivation needed to

accomplish my Ph.D. degree. I would also like to heartily thank my grandmother Bun-Nam

Kim, parents Hyung-Jik Lee and Eun-Hee Park, and sisters Ji-Young, Ji-Yoon, and Sun-

Young for their endless support and attention whenever needed.

vi
TABLE OF CONTENTS
Page

LIST OF TABLES ……………………………………………………………………...xiii

LIST OF FIGURES ……………………………………………………………………..xiv

Chapter

I. Introduction ………………………………………………………............................... 1

II. A measurement error model for mapping groundwater arsenic: Case


study using three datasets in New England ………………………………………….. 8

2.1. Background …………………………………………………………………….. 8

2.2. The uncertainty associated with arsenic data …………………………………. 11

2.2.1. Sources of measurement errors ……………………….......................... 11

2.2.2. Arsenic measurement techniques and their associated


analytical errors ……………………………………………………….. 12

2.3. Theory ………………………………………………………………………… 17

2.3.1. The knowledge bases characterizing a contaminant


spatial random field ……………………………………………………17

2.3.2. Proposed model for arsenic measurement error ……………………….19

2.3.3. Modeling the covariance function ……………………………………. 23

2.3.4. The BME method for spatial estimation ……………………………… 26

2.3.5. Step by step summary of the approach ……………………………….. 28

2.3.6. Cross validation procedure ………………………………………….... 29

vii
2.4. Application of the model ……………………………………………………... 31

2.4.1. The arsenic datasets …………………………………………………... 31

2.4.2. Mean trend ……………………………………………………………. 35

2.4.3. Covariance analysis and verification of the measurement


error parameters ………………………………………………………. 36

2.4.4. The BME mapping results …………………………………………..... 40

2.4.5. Cross validation results ……………………………………………….. 42

2.5. Conclusions …………………………………………………………………… 46

III. BME mapping using empirical laws with secondary spatial data: A
farewell to co-kriging? ……………………………………………………………… 49

3.1. Background …………………………………………………………………… 49

3.2. Method description ………………………………………………………….... 51

3.2.1. Spatial Random Field (SRF) representation and physical


knowledge bases ……………………………………………………… 51

3.2.2. Empirical law and cross-correlation of related spatial fields ……......... 54

3.2.3. Deriving the conditional PDF fS(χ|ψ) that describes the


empirical law ………………………………………………………….. 57

3.2.3.1. Non parametric approach ………………………………….... 57

3.2.3.2. Parametric approach …………………………………………58

3.2.3.2.1. Parametric polynomial of order 1 ………………. 58

3.2.3.2.2. Parametric polynomial of order 2 ………………. 60

3.2.4. BME processing of hard and soft data ………………………………... 62

3.2.5. Generating related synthetic fields with stochastic empirical


relationships …………………………………………………………... 63

3.2.6. Step by step description of the simple kriging, co-kriging,


and BME approaches …………………………………………………. 66

viii
3.2.7. Cross validation procedure …………………………………………... 70

3.3. Results ………………………………………………………………………… 71

3.3.1. Synthetic case study …………………………………………………... 71

3.3.1.1. Realization of related spatial fields …………………………. 72

3.3.1.2. Covariance and cross-covariance between fields ……………74

3.3.1.3. Conditional PDF fS(χ|ψ) describing the empirical


relationship ………………………………………………….. 75

3.3.1.4. Assessment of mapping accuracy …………………………... 77

3.3.1.5. Cross validation results as a function of the


curvature of the empirical law ……………………………… 79

3.3.1.6. Cross validation results as a function of the


correlation between logAs and pH ………………………….. 82

3.3.2. Application to the real case study: Mapping arsenic in


New England using soil pH …………………………………………... 84

3.3.2.1. New England datasets for arsenic and pH …………………. 84

3.3.2.2. logAs-pH empirical law …………………………………….. 85

3.3.2.3. Mean trend and spatial variability of groundwater


arsenic in New England ……………………………………...86

3.3.2.4. BME estimation of groundwater arsenic across


New England ………………………………………………...88

3.3.2.5. Non-attainment areas ……………………………………….. 90

3.3.2.6. Cross validation results between simple kriging,


co-kriging and BME ………………………………………... 91

3.4. Conclusions …………………………………………………………………… 93

IV. A geostatistical mapping framework integrating data obtained at


different temporal or spatial observation scale ……………………………………... 98

ix
4.1. Background …………………………………………………………………… 98

4.2. Space/time observation scale: A general conceptual framework …………… 100

4.2.1. A review of BME mapping method …………………………………. 100

4.2.2. Conceptual framework for the uncertainty associated with


the observation scale ………………………………………………… 103

4.3. Temporal observation scale: Mathematical formulation and


synthetic case study …………………………………………………………..105

4.3.1. Mathematical formulation …………………………………………… 105

4.3.1.1. Non-stationary temporal random field …………………….. 105

4.3.1.2. Stationary temporal random field …………………………. 107

4.3.2. Synthetic case study …………………………………………………. 110

4.3.2.1. Synthetic verification of the uncertainty model


for temporal observation scale .............................................. 110

4.3.2.2. Quantifying the improvement in mapping


accuracy resulting from the integration of
temporal observation scale uncertainty ……………………. 113

4.4. Spatial observation scale: Mathematical formulation and synthetic


Case study …………………………………………………………………… 120

4.4.1. Mathematical formulation …………………………………………… 120

4.4.1.1. Non-homogeneous spatial random field …………………... 120

4.4.1.2. Homogeneous spatial random field ………………………...121

4.4.2. Synthetic case study …………………………………………………. 124

4.4.2.1. Synthetic verification of the uncertainty model for


spatial observation scale …………………………………... 124

4.4.2.2. Quantifying the improvement in mapping accuracy


resulting from the integration of spatial observation
scale uncertainty ……………………………………………127

4.5. Mapping the childhood asthma prevalence across North Carolina

x
using data collected at different spatial observation scales …………………. 134

4.5.1. Introduction ………………………………………………………….. 134

4.5.2. Theory ……………………………………………………………….. 137

4.5.2.1. A review of the BME method for the mapping


analysis of the childhood asthma prevalence ……………… 137

4.5.2.2. Conceptual framework for the uncertainty associated


with the observation scale of the childhood asthma
prevalence …………………………………………………. 140

4.5.2.3. Quantifying the improvement in mapping accuracy


of childhood asthma prevalence resulting from the
integration of spatial observation scale uncertainty ……….. 143

4.5.3. Data ………………………………………………………………….. 144

4.5.3.1. The North Carolina School Asthma Survey database ……...145

4.5.3.2. The county-level database of Medicaid-enrolled


children suffering from asthma ……………………………. 146

4.5.4. Results ……………………………………………………………….. 149

4.5.4.1. Trends and variability in the spatial distribution of


local scale asthma prevalence among children ……………. 149

4.5.4.2. Maps of the childhood asthma prevalence obtained


using data collected at different observation scales ……….. 153

4.5.4.3. Cross-validation results ……………………………………. 158

4.5.4.4. Validation results ………………………………………….. 160

4.5.5. Conclusions ………………………………………………………….. 161

V. Concluding remarks ……………………………………………………………….. 165

Appendix A: Derivation of empirical relationships and their associated


uncertainty ………………………………………………………………………… 169

A.1. A quick overview of the multivariate linear regression model ……………… 169

xi
A.2. Parametric polynomial of order 1 …………………………………………… 171

A.3. Parametric polynomial of order 2 …………………………………………… 174

Appendix B: A simulator to generate realizations of two spatial random


fields (logAs and pH) related in terms of a quadratic empirical law ……………… 177

Appendix C: Derivation of σY2(t’,t) accounting for different observation


time scales………………………………………………………………………….. 182

C.1. Non-stationary temporal random field case …………………………………. 182

C.2. Stationary covariance ………………………………………………………... 185

C.3. Stationary exponential covariance case ……………………………………... 187

Appendix D: Derivation of σY2(s’,s) accounting for different observation


scales in two-dimensional (2-D) space ……………………………………………. 191

D.1. Non-homogeneous 2-D spatial random field case …………………………... 191

D.2. Homogeneous 2-D SRF ……………………………………………………... 192

D.3. Application of homogeneous exponential covariance model ……………….. 197

Appendix E: Some notes regarding the first and second arsenic datasets …………….. 200

E.1. The first arsenic dataset ……………………………………………………... 200

E.2. The second arsenic dataset …………………………………………………... 200

References ……………………………………………………………………………... 202

xii
LIST OF TABLES
Page

Table 2.1: The number of above and below detects, the mean value and
detection limit, and σo and k (Eq. 2.6) for each dataset ………………… 32

Table 2.2: Comparison of the values of σlogε2 estimated using (a) the
covariance analysis and (b) the measurement error model ……………... 38

Table 2.3: Specifications of each of the four methods compared in the


cross validation analysis ………………………………………………... 43

Table 2.4: Change in MSE from classical methods (i.e. methods 1 and 3)
to the proposed methods (i.e. methods 2 and 4). A negative
change means reduction in MSE, indicating an improvement
in mapping accuracy ……………………………………………………. 45

Table 3.1: Cross validation results for case 1 ……………………………………… 80

Table 4.1: Description of three estimation methods compared in the


validation procedure ……………………………………………………117

Table 4.2: MSEave calculated by averaging the validation results


obtained over 20 realizations ………………………………………….. 118

Table 4.3: MSEave calculated by averaging the validation results


obtained over 20 realizations …………………………………………...131

Table 4.4: Cross-validation results showing the cross-validation MSE


for methods 1, 2 and 3, and the change in cross-validation
MSE between method 1 and method 3, as well as between
method 2 and method 3 ………………………………………………... 160

Table 4.5: Validation results obtained when selecting a random validation


set consisting of 30% of the NCSAS data. The table shows the
validation MSE obtained for methods 1, 2 and 3, and the
change in validation MSE between method 1 and method 3, as
well as between method 2 and method 3 ……………………………… 161

xiii
LIST OF FIGURES
Page

Figure 2.1: Plot of (a) σZ and (b) σε as a function of Zm for σo =1µg/L


and k=3/10 ……………………………………………………………….20

Figure 2.2: The plain line depicts the expected value E[Z] as a function
of Zm for σo=1µg/L and k=3/10. The detection limit DL=3σo
=3µg/L is shown with the vertical dashed line. The soft
PDFs describing Z when Zm=4µg/L, 6µg/L and 8µg/L are
shown in dotted lines …………………………….................................... 22

Figure 2.3: Measured arsenic concentrations above detection limit


shown with marker size proportional to observed values
for (a) dataset 1, (b) dataset 2, (c) dataset 3. The locations
of all measurements below and above detection limit are
shown in (d) …………………………………………………………….. 35

Figure 2.4: Distribution of the mean trend of total arsenic concentration


mY(s) across New England groundwater ………………………………... 36

Figure 2.5: Covariance model obtained using (a) all the data Xdata
corresponding to the three combined datasets, and (b) only
above detects for the three combined datasets ………………………….. 37

Figure 2.6: Plot of the σlogε2 values predicted by the measurement error
model versus the σlogε2 values obtained from the covariance
analysis ………………..............................................................................39

Figure 2.7: Map of the BME median estimate of total arsenic in the
groundwater of New England …………………………………………... 41

Figure 2.8: Map of the variance of the BME posterior PDF for X(s)
normalized by the variance σX2. This map provides an
assessment of the mapping uncertainty associated with
Figure 2.7 ……………………………………………………………….. 41

Figure 3.1: The circles represent a subset of the data published by


Sanchez et al. (2003) showing the solubility and release of
log-As as a function of pH for a given soil sample
contaminated with arsenic in a pesticide manufacture site ……………... 55

Figure 3.2: Realization of (a) logAs(s) and (b) pH(s) obtained with our
simulator using a1=1.7, a2=0.5 and σA2= 0.32. Asterisks in (a)
and triangles in (b) are the randomly selected points used as

xiv
data in the cross-validation procedure. The scatter plot of
all collocated simulated logAs-pH values are shown in (c),
where the plain line is the theoretical E[logAs|pH] obtained
from Eq. (3.31) ………..............................................................................73

Figure 3.3: Covariance and cross variance for the logAs(s) and pH(s)
synthetic fields shown in Figure 3.2. Experimental
covariance values are shown with dots, while the
corresponding covariance models are shown with plain line …………... 75

Figure 3.4: The dots in (a), (b) and (c) are identical. They show the
collocated measurements for the realization of logAs(s) and
pH(s) shown in Figure 3.2(c). The dashed lines show µ1(ψ)
= E[logAs|pH] obtained using (a) non-parametric prediction,
(b) parametric prediction with polynomial of order 1, and (c)
parametric prediction with polynomial of order 2. The
corresponding µ2(ψ) are shown in (d) with different line
types. The soft data obtained from µ1(ψ) and µ2(ψ) are
shown in thick lines in (a), (b) and (c) ………………………………….. 77

Figure 3.5: The simulated field of logAs(s) shown in map (a) is an


identical reproduction of Figure 3.2(a) that is interpreted as
the truth. The stars are the locations of the logAs hard data
used by estimation method 1 (simple kriging) to produce map
(b). Using this logAs hard data as well as secondary pH data
shown in Figure 3.2(b), we obtain map (c) with method 2
(co-kriging), and map (d) with method 3 (BME) ………………………. 78

Figure 3.6: (a) Curves representing the empirical law E[logAs|pH]


between collocated logAs and pH for the realizations of Table
3.1 (i.e. obtained with a2 varying from 0 to 0.6 by increment
of 0.1). (b) Curves showing the improvement in MSE
reduction i∆ as a function of a2, when the BME soft data is
generated using the non parametric (plain line), the
polynomial of order 1 (dotted line), and the polynomial of
order 2 (dashed line) approaches ……………………………………….. 81

Figure 3.7: Realizations of related logAs(s) and pH(s) fields were obtained
using our simulator with σA2 varying from 0.08 to 0.35. The
linear empirical law E[logAs|pH] for each of these realizations
is shown in (a). The corresponding improvement in MSE
reduction i∆ is shown in (b) as a function of σA2 ………………………... 83

Figure 3.8: (a) Map of the location of the groundwater arsenic samples
from wells with measurements above detection limit. The
circles have a size proportional to the arsenic level recorded.

xv
(b) Map of the location of soil pH-measurements shown with
color indicating the recorded value according to the color
scale ……………………………………………………………………...85

Figure 3.9: Scatter plot of 139 collocated logAs and pH measurements in


New England. The dot-dashed line shows µ1(ψ)=E[logAs|pH]
obtained using second order polynomial regression. The dotted
line shows a curve of similar shape obtained by Sanchez et al.
(2003). The soft PDFs shown with plain line are the BME soft
data generated using µ1(ψ) (and µ2(ψ) not shown here) ………………... 86

Figure 3.10: (a) Mean trend of groundwater log-arsenic in New England,


and (b) covariance function of its residual ……………………………… 88

Figure 3.11: (a) Map of the BME estimate of groundwater arsenic (µg/L)
across New England, and (b) map of the length of the 68%
BME confidence interval (µg/L) expressing the associated
mapping uncertainty ……………………………………………………..89

Figure 3.12: BME map of the probability that the groundwater arsenic
concentration across New England is in non-attainment of
the drinking water standard of 10 µg/L for arsenic ……………………...91

Figure 3.13: Maps of the concentration of arsenic in the ground-water of


New-England obtained using (a) method 1 (simple kriging),
(b) method 2 (co-kriging), and (c) method 3 (our proposed
BME method). …………………………………………………………... 93

Figure 4.1: Plot of σY/σX as a function of T/at for different values of


(t-t’)/T. Markers indicate synthetic estimate obtained from
multiple random realizations (Eq. 4.21), while lines shows
the value predicted from theory (Eq. 4.19) ……………………………. 112

Figure 4.2: Plot showing one of the generated realizations of the TRF
X(t). The simulated values χtrue are shown with a dotted line,
the χhard data are represented by circles, and the ζhard data
are represented by crosses. Four observation time scales of
the ζhard data are shown with horizontal bars, and the
corresponding conditional PDF are shown with bell shape
curves ………………………………………………………………….. 116

Figure 4.3: Plots showing the simulated truth χtrue with a dotted line,
the χhard data with circles, and the ζhard data with crosses.
Additionally lines are showing the estimated profiles
obtained using (a) method 1, (b) method 2, and (c) method 3

xvi
(BME) …………………………………………………………………. 119

Figure 4.4: Plot of σY/σX as a function of R/ar for different values of


|s- s’|/R. Markers indicate synthetic estimate obtained from
multiple random realizations (Eq. 4.34), while lines shows
the value predicted from theory (Eq. 4.32) ……………………………. 126

Figure 4.5: Contoured map showing one of the generated realizations of


the SRF X(s), along with the location of the χhard data points
(stars), and the ζhard data points (triangles). The circular
averaging domain for three of the ζhard data points are shown
with a radius equal to their spatial observation scales ………………… 130

Figure 4.6: Maps of the simulated truth (a), compared to maps obtained
with (b) method 1 using χhard as hard data, (b) method 2
using both χhard and ζhard as hard data, and (c) method 3
corresponding to our proposed BME method accounting for
the effect of observation scale ………………………………………….133

Figure 4.7: Map showing (a) the data on asthma symptoms prevalence
among high school children (age 13-14) reported in the
NCSAS database for most of NC schools, and (b) the county
level asthma prevalence data extracted from the database of
Medicaid-enrolled children age 0-14 years who suffered from
asthma. The prevalence is expressed as a fraction (i.e. average
childhood asthma cases per 1 child) according to the color bar
next to each map ………………………………………………………. 149

Figure 4.8: (a) Map of the local scale mean trend mX(s) of childhood asthma
prevalence (fraction of prevalent asthma cases), and (b) plot of
the covariance of the mean trend-removed local scale childhood
asthma prevalence SRF X’(s) ………………………………………….. 151

Figure 4.9: Maps of the BME mean estimate of children asthmatic symptom
prevalence (average number of case per 1 child) observed at the
school spatial scale across North Carolina. These maps were
obtained using (a) method 1, (b) method 2, and (c) method 3 ………… 155

Figure 4.10: Maps of the BME posterior variance ([average asthma counts
per 1 child]2) obtained with (a) method 1 and (b) method 3,
which provides an assessment of the uncertainty associated
with the BME mean estimate maps shown in Figure 4.9 (a)
and (c), respectively …………………………………………………… 157

xvii
I. Introduction

The spatiotemporal geostatistical framework provides an essential tool to interpolate

monitored data of a variable of interest and obtain an estimate at unsampled space/time

points where there is no direct measurements or any workable physical model. This tool

provides a cost effective method to investigate the distribution of variables of interest across

space and time in our environment. Important applications of this tool include exposure

mapping of environmental contaminants (e.g. groundwater contamination and atmospheric

air pollutants), and the spatiotemporal estimation of a variety of health outcomes (such as

asthma symptoms prevalence, etc.). The geostatistical framework provides a stochastic

approach to spatiotemporal modeling that has been widely used to effectively address the

inherent randomness of natural processes and their associated high variability across space

and time.

There have been considerable attempts in classical Geostatistics to address these issues in

terms of kriging estimators (Olea, 1999; Journel and Huijbregts, 1978; Armstrong, 1998).

However the linear kriging estimators of classical Geostatistics were primarily developed to

account for exact measurements, and they have considerable well documented limitations (i.e.

limited to linear estimation, Gaussian restrictions, etc.) (Goovaerts, 1997; Christakos, 2000).

As a result of these limitations, the linear kriging methods lack the theoretical underpinnings

and practical flexibility needed to incorporate information about the errors and uncertainty

associated with the monitored data. On the other hand, the Bayesian Maximum Entropy
(BME) method of modern Geostatistics developed in the last decade provides a powerful

mathematical framework for the processing of a wide variety of knowledge bases that are

beyond the scope of classical kriging methods (Christakos, 1990, 1992, 2000b; Christakos et

al., 2002). In particular BME rigorously processes exact measurements (hard data) as well as

data with associated error (soft data), leading to estimates that are more accurate than that of

the linear kriging methods lacking the ability to rigorously process soft data, as demonstrated

in several case studies (Christakos et al., 2000a; 2001; Serre et al., 1999a, 2002, 2003; Choi

et al., 2003).

As a result of these studies, we found that the importance of accounting for the wide

variety of environmental and health soft data available has been increasingly recognized. The

wide variety of soft data available arises from the increasing number of data sources that may

not have been available in the past (new analytical measurement techniques, increased access

to relevant secondary data, remote sensing and satellite technologies, measurement

performed at different spatial and temporal scales, etc.). The uncertainty from each of these

data sources needs to be assessed and properly modeled by means of a soft probability

density function (PDF) that can be processed by the BME method. However the investigation

of data uncertainty and the development of a framework to obtain the relevant soft PDF

characterizing existing environmental and health data is still an emerging field. This

dissertation is part of this emerging field. Its goal is to advance the development of models

that rigorously account for the uncertainty associated with existing environmental and health

data, and to test these models in real case studies. Hence this dissertation is dedicated to

models of soft data in Geostatistics, and their application in environmental and health

mapping.

2
Let’s consider different sources of uncertainty associated with environmental and health

data. A primary source of uncertainty for environmental data is measurement error. Usually

environmental monitoring data are available from datasets (e.g. a USGS dataset of

groundwater arsenic concentrations collected in New Hampshire prior to 1999) that have a

homogeneous measurement error. The measurement error associated with a particular dataset

is then the aggregate of errors coming from the analytical method used across the dataset, the

sampling procedure followed (i.e. collecting, saving, transporting samples), and the creation

of the database and retrieval of information from that database. A second source of

uncertainty comes from the emergence of secondary variables used to map a primary variable

for which data is sparse. For example the data for groundwater arsenic concentration

collected at wells may often be sparse over a region of interest, while measurements of soil

pH in the same region may be more abundant. Another example of emerging secondary data

is remote sensing observations obtained from an aircraft or a satellite. In all these cases, the

secondary variable is linked to the primary variable by means of a stochastic empirical law,

which can be used to generate soft data for the primary variable on the basis of the

measurements available for the secondary variable. A third important source of uncertainty

associated with environmental and health data is the temporal or spatial scale at which the

measurement is made. For example asthma prevalence may be measured at a specific school,

or it may be measured over a much wider area such a county. Similarly the concentration of

particulate matter (an air pollutant that may be a contributing cause to asthma in children)

may be collected as an hourly average, or as a daily average. In all these cases, the mixing of

data obtained at different spatial or temporal scales is a source of uncertainty for the

space/time estimation of the variable at some scale of interest. The three sources of

3
uncertainty described here (i.e. measurement error, secondary variable, and observation scale)

illustrate the fact that not accounting properly for the uncertainty associated with the data

might lead to inaccurate geostatistical estimates. This motivates the need to develop for each

source of uncertainty a framework that generates the relevant soft data, which once

rigorously processed with the BME method, will lead to increased mapping accuracy of the

geostatistical estimate.

This dissertation is organized in the introduction (Chapter 1), followed by three main

chapters of this dissertation (Chapters 2, 3 and 4), and concluding with the conclusion

chapter (Chapter 5). Each of chapter 2, 3 and 4 treats a different type of data uncertainty, and

leads to an independent real case study, as described next.

In chapter 2 we consider the measurement error associated with three groundwater

arsenic datasets collected in New England. Each dataset is characterized by its own

analytical and sampling error; therefore the varying levels of measurement error between

datasets should be investigated. The goal of this chapter is to develop a measurement error

model to incorporate the varying uncertainty between datasets, and generate the relevant soft

data for the BME mapping method. This soft data will improve the mapping accuracy of

groundwater arsenic, and will facilitate the incorporation of new datasets as they become

available. This chapter is organized as follow. First, an appropriate measurement error model

for arsenic data is developed. The model can characterize varying measurement error

variance according to the measurement error parameters assumed. The assumed model is

then verified by comparing measurement error variance from the model with that from the

covariance analysis using the experimental data. Once an appropriate measurement error

variance is obtained, the uncertainty from the measurement error is rigorously processed in

4
the BME method in terms of probabilistic soft data. Finally this approach is validated using

groundwater arsenic data in New England. Results from the cross-validation analysis

indicates that the proposed framework for measurement error leads to a substantial increase

of mapping accuracy compared to that obtained when the measurement error is ignored.

Chapter 3 deals with the uncertainty associated with secondary variables when the

relationship between primary and secondary variables may be modeled using stochastic

empirical laws. This chapter illustrates the framework developed using groundwater arsenic

as the primary variable and soil pH as the secondary variable. The traditional approach to

integrate secondary data when mapping a primary variable is co-kriging, which uses the

cross-correlation between the primary and secondary variables. The approach we propose

instead is to use the stochastic empirical law between collocated groundwater arsenic and soil

pH to generate soft data of the primary variable. This is done in terms of the conditional PDF

of groundwater arsenic given a collocated measured value for soil pH. We present three

straightforward approaches to derive this conditional PDF, which include a non-parametric

approach, and a parametric approach with polynomials of order 1 and 2. The conditional PDF

is then used to generate soft data of groundwater arsenic for each soil pH measurements.

These soft data are rigorously processed by the BME method, resulting in arsenic exposure

maps, together with maps of the associated mapping estimation error. The mapping accuracy

of the BME method accounting for non-linear empirical laws is investigated using synthetic

case studies where a variety of empirical laws between groundwater arsenic and soil pH are

explored. In all cases the BME method results in an outstanding improvement in mapping

accuracy over the co-kriging method of classical Geostatistics. As a result, this chapter

suggests a shift of the multivariate mapping paradigm from co-kriging to the BME method

5
when dealing with secondary variables related to the primary variable through a variety of

empirical laws. We finally apply the framework developed to a real case study integrating

soil pH data to improve the mapping accuracy of groundwater arsenic in the New England

area.

Lastly, in chapter 4 we consider uncertainty arising from the mixing of environmental or

health data measured at different spatial or temporal scales. The importance of the scale

effect must be recognized since a variable displays different physical properties depending on

the spatial or temporal scale at which it is observed. In this chapter we mathematically

derive the conditional PDF of a variable at the local scale given an observation of that

variable at a larger scale. Using this framework, it is possible to generate soft data for the

local scale on the basis of data observed at different temporal or spatial scales. This allows

the efficient integration of data observed at a variety of temporal or spatial scales, and

increases the mapping accuracy of the map obtained for the scale of interest. Mathematical

formulations are derived in the one-dimensional temporal case, and in the two dimensional

spatial cases. In each case (temporal and spatial), we validate the framework by comparing

the observation scale uncertainty predicted theoretically from the mathematical formulation,

with that inferred from multiple random realizations of a synthetic case study. Additionally

we use the synthetic case studies to quantify the gain in mapping accuracy achieved when the

BME mapping method rigorously accounts for observation scale uncertainty, compared to

classical approaches not accounting for the observation scale effect. Finally we apply the

developed framework to a real case study involving the estimation of asthma prevalence in

North Carolina. We find that in all cases the developed framework adequately describes the

uncertainty associated with the observation scale, which leads to realistic soft PDF for the

6
observation scale uncertainty that are rigorously assimilated by the BME method, and results

in a substantial improvement in mapping accuracy over classical mapping methods that

ignore the scale effect.

In conclusion, this dissertation emphasizes the development of a “soft” geostatistical

framework to account for a variety of sources of uncertainty in environmental and health data.

This framework will lead to the incorporation of environmental and health data from multiple

sources, which will improve the mapping accuracy of exposure mapping of environmental

toxics, and the space/time assessment of human health outcomes.

7
II. A measurement error model for mapping groundwater arsenic: Case
study using three datasets in New England

2.1. Background

Arsenic in the groundwater has become over the past decade a major public health concern

because of its high toxicity and the fact that it may be naturally found at high levels in the

subsurface. Naturally occurring arsenic appears in igneous and sedimentary rocks, in soils

usually originating from sedimentary rocks, and even in the air due to volcanic explosions

and forest/grass fires (Bhattacharya et al., 2004; EPA, 2000; Hinkle et al., 1999). High levels

of naturally occurring arsenic in the groundwater are detected in certain geologic formations

such as volcanic deposit weathering, sulfide mineral deposits in bedrock aquifer, and iron

oxide rich sedimentary deposits (Welch et al., 2000; EPA, 2000). In addition, anthropogenic

contamination of the groundwater due to human activities is categorized as another source of

arsenic. These activities include the use of wood preservatives and agricultural products (i.e.

pesticides, herbicides, insecticides, and defoliants etc.), and industrial activities (i.e. batteries,

fossil fuel burning, paper production, glass and cement manufacturing etc.) (Welch et al.,

2000; EPA, 2000; Hinkle et al., 1999).

The human health effects associated with the ingestion of arsenic in the drinking water

include both cancer and non-cancer adverse health effects (NRC, 2001; Abernathy et al.,

1999). Chronic exposure to inorganic arsenic levels was shown to cause several cancers

(NRC 1999; Karagas et al., 1998) including that of the skin (Bates et al., 1995), lung (Bates
et al., 1995; Hopenhayn-Rich et al., 1998), kidney (Hopenhayn-Rich et al., 1998), and

bladder (Karagas et al., 2004), as well as a variety of non-carcinogenic illnesses such as

cardiovascular disease, diabetes (Karagas et al., 1998), changes in the color of the skin, and

hyperkeratosis (NRC, 1999). The US standard of 50µg/L for arsenic in the drinking water

that had been used since 1975 was revised by the U.S. Environmental Protection Agency

(U.S. EPA) in 2001, resulting in a stricter new standard of 10µg/L in order to address the

increasing threat of groundwater arsenic to the human population.

The spatial distribution of naturally occurring arsenic in New England groundwater has

become a significant public health issue because of the relatively high arsenic concentrations

observed at private and public wells. Naturally occurring arsenic arising from mineral

deposits in the aquifer is the main source of arsenic across the New England groundwater.

For example in the state of New Hampshire, part of the New England region analyzed in our

work, anthropogenic sources of arsenic are negligible, whereas the weathering of bedrock

materials is a continuing source of groundwater arsenic (Peter et al., 1999). Drinking water

from privately used bedrock wells are not publicly regulated and has often contained arsenic

concentrations at levels of public human health concern (e.g. level in excess of the 10µg/L

standard). Elevated bladder cancer mortality have been observed in northern New England

including Maine, New Hampshire, and Vermont, and on-going work is investigating whether

exposure to high levels of arsenic in private wells is the probable source of these high bladder

cancer rates (Colt et al., 2002). As a result, it is important to map the levels of arsenic in the

groundwater of New England in order to assess human exposure to arsenic in the drinking

water.

9
Mapping of arsenic in the groundwater of New England involves using arsenic

monitoring data collected, analyzed, and stored by different agencies or organizations.

However, because of the different sampling procedure and analytical methods used to obtain

these arsenic monitoring data, the measurement uncertainty might vary widely between

available arsenic datasets. The goal of this work is to develop and implement an analysis

framework that integrates information about measurement errors associated with the arsenic

sampling data, which can then be used to construct accurate maps describing the distribution

of naturally occurring arsenic in the groundwater of New England, and the associated

mapping uncertainty.

The available arsenic datasets considered in this work come from three different sources,

so that each dataset is characterized by its own measurement analysis method, sampling

method, and detection limit. We present an overview of the different kinds of errors

contributing to the data uncertainty of arsenic concentration, and we propose a framework to

model this data uncertainty. This framework consists in a model for the measurement error

that is used to assess the data uncertainty, and provides a way to validate that assessment of

data uncertainty at the covariance analysis stage.

The analysis of data uncertainty leads to the generation of a covariance model and soft

data for arsenic, which provides information that is efficiently processed by the Bayesian

Maximum Entropy (BME) method of modern Geostatistics and its numerical implementation,

BMElib (Christakos, 1990, 2000b; Serre et al., 1998; Serre and Christakos, 1999a; Christakos

et al., 2002). The BME method is able to rigorously integrate the soft data characterizing the

measurement errors, and results in an increase of mapping accuracy over a classical approach

lacking the ability to account for data uncertainty. While the simple kriging approach of

10
classical Geostatistics cannot rigorously assimilate the uncertain information available due to

its limitations (i.e. linear estimator and Gaussian assumption), the BME method is able to

account for the combined effect of the high natural variability of geology and the varying

levels of measurement errors between datasets. Our work shows that the implementation of

the proposed approach results in a Mean Square Error (MSE) reduction in the real case as

well as synthetic case studies when compared to the direct approach not accounting for data

uncertainty.

2.2. The uncertainty associated with arsenic data

2.2.1. Sources of measurement errors

The measurements of environmental contaminants are usually associated with a variety of

errors that include mistakes, systematic errors, and accidental errors. While errors caused by

mistakes and systematic errors can be reduced by the training of personal and the calibration

of instruments, accidental errors still remain in measured values. Therefore, when dealing

with a specific dataset one must be aware that the measurement uncertainty consists in the

aggregation of all types of errors, including some of which that cannot be completely

eliminated. Hence, in order to assess the measurement uncertainty associated with the

arsenic datasets we have access to, it is necessary to investigate the sources of errors involved

during the whole process leading to the creation of the dataset. These sources of errors

include errors in collecting, saving and transporting samples, analytical measurement errors

involved in the technique used to measure arsenic, and errors associated with the creation of

11
the database and retrieval of information from that database. Even though it is hard to

completely assess all the kinds of errors associated with our datasets, it might be plausible to

broadly assess uncertainty arising from the measurement techniques (i.e. analytical error) and

sampling procedures (i.e. sampling error) based on information about the datasets. We now

survey the different arsenic measurement techniques and their analytical errors.

2.2.2. Arsenic measurement techniques and their associated analytical errors

High arsenic concentration found in the groundwater may come from anthropogenic sources

or from naturally occurring material. Historically the main anthropogenic sources have

included agricultural products (e.g. arsenical pesticides), wood preservatives and industrial

waste. However previous studies seem to indicate that anthropogenic sources are not the

main contributor to the arsenic found in New England groundwater, and instead point to the

possibility of natural bedrock being the main arsenic source for this region of the US (EPA

report from USEPA region 1 office, 1981; Peters et al., 1999).

Arsenic can be found in many different forms in the groundwater depending on physico-

chemical conditions of the environment (electro-negativity, pH, etc.), and the processes

involved (oxidation-reduction, biological and bacterial processes, etc.). Arsenate (HAsO42-)

with a valence state As(V) is an anion prevalent in aerobic surface waters, while arsenite

(H3AsO3 or H2AsO3-) is a reduced form with valence III that is one of the primary species

found in the groundwater, and is considerably more mobile and toxic than arsenate (Schnoor,

1996). Additionally in New England, arsenic in its geological occurrence may also be found

as arsenopyrite (FeAsS), orpiment (As2S3) and realgar (AsS) (Peters et al., 1999).

Methylation by bacteria may also produce organic arsenicals such as methylarsenic acid,

12
dimethylarsenic acid and trimethylarsenic acid (Braman and Foreback, 1973; Schnoor, 1996),

however these have not been reported widely for New England groundwater.

The Safe Drinking Water Act requires EPA to revise the existing 50 µg/L Maximum

Contamination Level (MCL) for arsenic in drinking water. On January 22, 2001 EPA

adopted a new standard and public water systems must comply with the new 10 µg/L

standard beginning January 23, 2006. Because toxicity varies with the species of arsenic

present in the water, the standard and methods measuring arsenic should be able to

differentiate between arsenic species. However because the EPA standard regulates total

arsenic present in water, most methods available measure only total arsenic, and we will

therefore restrict our attention to these methods.

The analytical techniques used to measure total arsenic concentration have considerably

improved over the years. Earlier techniques such as the “Guzeit” method developed over

100 years ago are colorimetric methods that do not require sophisticated equipment and can

be implemented in the field; however they are not precise and have high detection limits. As

modern techniques have developed, from flame atomic absorption (FAA), to graphite furnace

atomic absorption (GFAA), to inductively coupled plasma-atomic emission spectrometry

(ICP-AES) and inductively coupled plasma-mass spectrometry (ICP-MS), the detection limit

for arsenic has continuously decreased from over 50 µg/L in the past century to about 1 µg/L

in the past decade, while new methods combining Hydride Generation and ICP-MS are now

providing detection limits of 0.01 µg/L and below.

The precision of an analytical technique may be defined as the ratio of the standard

deviation of error measurement σΖ over the arsenic concentration Z of the sample, e.g. a

precision of 10% would mean that σΖ is equal to 0.1 Z. In general the precision increases as

13
the arsenic decreases, and the detection limit is defined at the smallest arsenic level that can

be determined with acceptable precision. Hence each analytical technique may be

characterized by its detection limit, its precision for a concentration close to the detection

limit, and its precision for a concentration several times higher than the detection limit.

The colorimetric methods take advantage of the formation of volatile arsine (AsH3) gas to

separate the arsenic from other possible interferences in the sample matrix (Melamed, 2004).

This process is called hydride generation (HG). The arsine gas is then brought in contact

with a color-reacting reagent, and the operator reads the arsenic concentration by comparing

the color obtained with a color scale. Colorimetric methods include variants of the Gutzeit

method used extensively for the Taiwan studies in the 1960s (see review of these studies in

Guo et al., 1994), as well as more modern field test variants developed recently in response

to the Bangladesh studies (Kinniburgh and Kosmus, 2002). Because of the importance of the

Taiwan data collected in the 1960s, Greschonig and Irgolic (1997) re-investigated a method

they name the “mercuric-bromide-stain” method, which they believe was similar to methods

used in the 1960s. This method generates arsine gas by reduction using zinc and

hydrochloric acid, and then uses a solution of mercuric-bromide that reacts with the arsine

gas and turns yellow to brown with increasing arsenic concentration. Greschonig and Irgolic

(1997) determined that the mercuric-bromide-stain method had a detection limit exceeding

50 µg/L, with a precision as high as 64% near the detection limit, and about 21% from

arsenic concentrations of 200 µg/L. Due to these poor performances, several improvements

to the colorimetric methods were achieved during the Bangladesh crisis in the 1990s

(replacement of zinc metal with sodium borohydride, etc.), leading to the development of

colorimetric field kit technology with a much enhanced detection limit and precision. For

14
instance Kinniburgh and Kosmus (2002) reports that their PeCo75 method (a hand held

“Arsenator”) has a detection limit of 4.2 µg/L (3 times s0=1.4 µg/L) and a precision of 14%

at concentrations much greater than the detection limit (k=0.14).

While colorimetric methods are useful for field test applications, fixed analytical

techniques using atomic spectrometry provide better precision and accuracy. The basic

procedure for analytical atomic spectroscopy generally consists in the formation of arsenic

atoms from the sample matrix, followed by excitation of these arsenic atoms using some

energy source, and finally photon emissions from the exited atoms which are quantified to

yield the concentration of arsenic. One of these techniques, which is widely used due to its

relative affordability and good precision, is graphite furnace atomic absorption (GFAA). The

atomization is obtained by introducing the arsenic sample into a graphite tube at high

temperature (Beaty et al., 1993). The resulting cloud of arsenic atoms absorbs light at

wavelengths corresponding to the specific excitation energy states of the arsenic atom. The

quantity of interest is then the absorbance of light at these wavelengths, which provides a

quantitative measure of arsenic concentration in the sample analyzed. The expected

analytical error for GFAA measurements ranges from 3-5% for concentrations greater than

10 times detection limit to 20-40% near the detection limit (Keller et al., 1996).

Another category of techniques in atomic spectrometry are those using an inductively

coupled argon plasma at high temperature as the atomization and excitation source. These

techniques include both inductively coupled plasma-atomic emission spectrometry (ICP-AES)

and inductively coupled plasma-mass spectrometry (ICP-MS). In ICP-AES, the plasma is

used to produce thermally excited arsenic atoms that emit light at characteristic wavelengths.

The emitted light is diffracted by wavelengths and amplified to yield an intensity

15
measurement that can be converted to a quantitative estimate of arsenic concentration by

comparison with calibration standards. In ICP-MS, the inductively coupled argon plasma is

again used as the excitation source, however there is enough energy in the plasma to also

remove an electron from the arsenic atoms and create positively charged arsenic ions

(Thomas, 2003). These ions are transported to the mass spectrometer where they are

separated from other elements according to their mass to charge ratios, and analyzed at the

high sensitivity afforded by mass spectrometry. However in the case of arsenic, any chloride

present in the sample will form ArCl+ in the argon plasma, which will then interfere in the

mass spectrometry analysis with arsenic. Indeed ArCl+ has the same mass as 75
As+ (atomic

mass of 75) so that they may be counted with arsenic and cause arsenic readings to be bias

high. As a result ICP-MS without any pre-processing of the sample is limited to a detection

limit of about 1 to 5 µg/L and precision similar to that of ICP-AES. The typical analytical

error associated with the ICP-MS technique is distributed between ±4-6% at concentrations

greater than 10 times the detection limit and ±20-50% at concentrations near the detection

limit (Keller et al., 1996). The United States Geological Survey (USGS) central laboratory

also indicates that the measurement error associated with ICP-MS is about 15%, while a

study by Manninen (undated) suggests that ICP-MS leads to 20% measurement uncertainty

in water.

Coupling online hydride generation with inductively coupled plasma-mass spectrometry

(HG-ICPMS) eliminates the interference between ArCl+ and 75As+, which allows operation

of the mass spectrometer at low mass resolution (M/∆M=300), thus maximizing signal

intensities (Klaue and Blum, 1999; Peters et al., 1999). As a result while the detection limit

for ICPMS routinely exceeds 1 µg/L, that of online HG-ICPMS is as low as 0.01 µg/L

16
(Klaue and Blum, 1999; Peters et al., 1999). Similarly the HG-ICPMS method results in a

2000-fold increase in sensitivity (Klaue and Blum, 1999).

By way of summary, analytical techniques for total arsenic have improved over time.

Using information about the analytical technique used for available arsenic monitoring data,

it is plausible to infer some range for the detection limit and the analytical error of the data.

Combining the analytical error with sampling error will provide an assessment of the

uncertainty associated with the data.

2.3. Theory

2.3.1 The knowledge bases characterizing a contaminant spatial random field

The distribution across space of a contaminant is modeled in terms of the spatial random

field (SRF) X(s), where s is the spatial coordinate. The SRF models the distribution of the

contaminant across space in terms of a collection of plausible field realizations χ(s). The

uncertainty characterizing the SRF at points s and s’ is expressed in terms of the probability

density function (PDF) f(χ,χ’; s,s’) characterizing the different plausible realizations χ and

χ’at these points, i..e.

f (χ, χ’, s, s’) dχ dχ’ = Prob[χ<X(s)<χ+dχ and χ’<X(s’)<χ’+dχ’], (2.1)

where Prob[.] is the probability operator. The mean trend mx(s)=E[X(s)], where E[.] is the

expectation operator fully defined in terms of the PDF of X(s), characterizes systematic

17
trends in the distribution of the contaminant across space. The covariance cx(s,s’)=E[(X(s)-

mx(s))(X(s’)-mx(s’))] describes spatial correlation and contaminant dependencies between

pairs of points. The mean trend and covariance function provide the foundation of the

general knowledge base available for the contaminant of interest.

While the general knowledge base describes the general characteristics of the

contaminant field X(s), we also usually have measurements value at specific site locations.

When a sampled value χhard is an exact measurement of the contaminant process X(shard) at

point shard, we model that value as a hard datum, i.e.

Prob[ X(shard) =χhard]=1. (2.2)

However measurements are seldom exact and they therefore often need to be treated as soft

information. For example in the case of a measurement below detection limit at point ssoft, all

that is known is that the contaminant level is below the detection limit DL, i.e.

Prob[0< X(ssoft) < DL]=1. (2.3)

As can be seen from Eq. (2.3), this soft datum is of the interval type. More generally soft

data can be expressed in terms of a soft PDF fS describing the uncertainty associated with the

measurement, i.e.

u
Prob[X(ssoft) <u]= ∫ −∞ dχ soft f S (χ soft ) . (2.4)

18
For example in the case of normally distributed measurement errors, the soft PDF fS is

Gaussian.

The hard data (Eq. 2.2) and soft data (Eqs. 2.3 and 2.4) provide a site-specific knowledge

base, which, together with the general knowledge base available for the arsenic field, provide

the type of information that is efficiently processed with the BME mapping method.

However before proceeding the BME method, we need to introduce a model for the

measurement error of arsenic data.

2.3.2 Proposed model for arsenic measurement error

Let Z(s) be a SRF representing the distribution of groundwater arsenic across space. At a

given sampling point s we denote the measurement value of arsenic as Zm. The Z(s) and Zm

are related through a measurement error relationship. In the case of arsenic, an appropriate

model for the measurement error is provided by the following relationship

Z =ε Zm. (2.5)

As can be seen from Eq. (2.5), ε is a multiplicative error term. This multiplicative error is an

unknown random quantity. As a result, for any given measurement, the arsenic concentration

Z is a random variable that is function of the measured value Zm and the random

multiplicative error term ε.

The work of Kinniburgh and Kosmus (2002) shows that an appropriate model for the

standard deviation σZ of the random arsenic concentration Z given a measured value Zm is

given by the following relationship

19
σZ | Zm = σo + k Zm. (2.6)

Eq. (2.6) expresses that the measurement error standard deviation σZ increases linearly with

the measurement value Zm, with an intercept value of σo for Zm =0, as illustrated in Figure

2.1(a). Since the arsenic concentration can only take positive values, it is consistent to

assume that for a given Zm , the random variable Z is log-normally distributed with mean Zm

and variance σZ2=(σo+ k Zm)2, which is mathematically denoted as follow

Ζ | Zm ~ logN (Zm , (σo+ k Zm) 2 ) (2.7)

(a) (b)

Figure 2.1: Plot of (a) σZ and (b) σε as a function of Zm for σo =1µg/L and k=3/10.

From Eq. (2.5) we have ε = Z / Zm so that since Z is log normally distributed for a given Zm,

then ε is also log normally distributed given Zm, with expected value E[ε|Zm]=E[Z|Zm]/Zm=1

and variance σε2=(σo/Zm + k)2 given Zm, which is mathematically denoted as follow

20
ε | Zm ~ logN (1 , (σo/Zm + k)2 ). (2.8)

The multiplicative error has an expected value of one, which means that on the average that

multiplicative error is unbiased. In Figure 2.1(b) we show an illustrative plot of the error

variance σε as a function of Zm. As can be seen on that plot, the multiplicative error is

approximately equal to k for large measurement values Zm; however this variance increases

rapidly for small measurement values Zm. This behavior appropriately captures the fact

mentioned earlier that arsenic analytical measurement techniques (for e.g. ICP-MS) have a

small relative error for large arsenic concentration, but this relative error increases with

decreasing concentration. This means that there is a detection limit DL below which the

measurement error is too large to be acceptable. A typical threshold for the detection limit is

3 times σo, i.e.

DL = 3 σo. (2.9)

The detection limit is shown with a vertical dashed line in Figure 2.1. As can be seen from

Figure 2.1(b), this detection limit provides an adequate cutoff to differentiate measurement

above detection limit with a σε approximately equal to k, and the below detects with a much

larger σε..

This proposed model for the measurement error of arsenic (Eqs. 2.5-2.9) provides the

framework necessary to generate the soft data needed for a BME mapping analysis. In order

21
to use this framework, one has to obtain arsenic measurement data and assess for each datum

Zm the parameters σo and k characterizing its measurement error. Then, if the value is below

the detection limit, we use a soft data of interval type (Eq. 2.3). On the other hand, if the

measured value is above the detection limit, we can construct a soft PDF using the log

normal distribution of Eq. (2.7). For illustration purposes we show in Figure 2.2 the soft

PDFs obtained for Zm =4µg/L, 6µg/L and 8µg/L with parameters σo =1µg/L and k=3/10.

Figure: 2.2: The plain line depicts the expected value E[Z] as a function of Zm for σo=1µg/L
and k=3/10. The detection limit DL=3σo =3µg/L is shown with the vertical dashed line. The
soft PDFs describing Z when Zm=4µg/L, 6µg/L and 8µg/L are shown in dotted lines.

As described earlier in details, the measurement error is the combination of several kinds

of errors, including errors arising from the analytical measurement technique and the

sampling procedure used, as well as errors associated with data entry and retrieval. In some

cases the data are available in a set of different databases that are each fairly homogeneous in

terms of the analytical technique, sampling procedure, and data management used. From the

information available for the database, it may often be possible to derive the detection limit

22
DL for the dataset, as well as a typical variance σZO value corresponding to a measured value

ZmO several times the detection limit (e.g. for ZmO=20µg/L we may have σZO=7µg/L). Then

from Eq. (2.9) and (2.6) we obtain the parameter values σo = DL/3 and k= (σZO - σo)/ ZmO.

2.3.3 Modeling the covariance function

Let Y(s) be the log-transformed arsenic field, Y(s)=log Z(s). This field is modeled as the sum

of a mean trend mY(s) obtained from general information about the log arsenic field, and a

residual field X(s), as follow

Y(s) = mY(s) + X(s) (2.10)

The deterministic function mY(s) is selected such that the residual field X(s) is homogenous

over space, so that it’s covariance is only a function of the spatial distance r=||s-s’|| between

points s and s’, i.e. cX(s,s’)=E[(X(s)-mX(s))( X(s’)-mX(s’))]= cX(r=||s-s’||).

By taking the log-transform of Eq. (2.5) at location s and rearranging we have logZm(s)=

logZ(s) - logε(s), which after substituting for X(s) leads to

Xm(s) = X(s) – log ε(s), (2.11)

where Xm(s)= logZm(s) - mY(s) is the SRF representing the distribution across space of

measured log-transformed mean trend removed arsenic concentrations. Eq. (2.11) provides

important insights as it shows that the measured Xm(s) field results from the linear

combination of the SRFs X(s) and logε(s). As a result Xm(s) is also a SRF, and we expect

23
that its spatial variability is the aggregate of the spatial variability of the SRFs X(s) and

logε(s).

It is appropriate to assume that the multiplicative measurement error is homogenous and

not auto correlated over space, so that it a has pure nugget covariance function, i.e.

clogε(r)= σlogε2 δ(r), where δ( r) is the Dirac delta function. Assuming that the SRFs are

independent, we obtain the following equation for the covariance cXm(r) of the SRF Xm(s)

cXm(r) = cX(r)+ σlogε2 δ(r) (2.12)

Furthermore, by calculating Eq. (2.12) for r =0 at we obtain

σXm2 = σX2+ σlogε2 (2.13)

Eq. (2.13) is simply the mathematical expression of the fact that the variability observed

in Xm(s) is the sum of the variability in X(s) and logε(s). Furthermore Eq. (2.12) indicates

that the covariance cXm(r) obtained in practice using Xm values will have a nugget component

equal to σlogε2 that is due to measurement error, and a component cX(r) that is the covariance

associated with the true arsenic field. The cX(r) does not itself have any nugget effect

because the true arsenic concentration in a groundwater is believed to be a continuous

process at very short scale due to the diffusivity of arsenic in the aqueous phase.

24
As a result, when modeling the experimental cXm(r) obtained from arsenic measurements,

one simply measures its nugget component and use that value as an assessment of σlogε2

characterizing the measurement error, while the remaining component free of nugget effect

provides the assessment of the covariance cX(r) charactering the arsenic field.

This has important implications in the context of arsenic mapping when one uses the

measurement model proposed in Eqs. (2.5)-(2.9). In that case, using Eq. (2.8) and from the

property of log normal distribution, we obtain that for a given Zm

σ logε2 = log(1 + (σo/Zm + k)2) (2.14)

Hence for a given dataset of arsenic measurements, we can calculate the average value of

log(1 + (σo/Zm + k)2) across the dataset, and obtain a second assessment of σlogε2

characterizing the measurement error for that dataset.

An interesting implication in practice is that we can test the measurement error

2
parameters σo and k for a given dataset by comparing the σ logε obtained from the

measurement error model (i.e. Eq. 2.14) with the value obtained from the covariance analysis

(i.e. the nugget component of cXm(r) as expressed in Eq. 2.12). This is especially useful when

dealing with datasets that have different measurement errors. In that case each dataset can be

2
analyzed separately to verify that the σ logε value estimated with the measurement error

25
model matches that obtained from covariance analysis so as to validate its measurement

parameters σo and k.

Once the measurement parameters σo and k were validated for each dataset, then all the

datasets available may be combined into a single master dataset, which is used to derive the

covariance model cX(r) that is part of the general knowledge base processed using the BME

method.

2.3.4 The BME method for spatial estimation

The spatial BME mapping approach provides a powerful conceptual framework to rigorously

process the general knowledge bases consisting of the mean trend and covariance function of

the SRF X(s), and the site specific knowledge base comprising the hard and soft data. The

BME conceptual framework (Christakos 1990; 2000b; Serre and Christakos, 1999a;

Christakos et al., 2002) distinguishes between three main stages of knowledge processing

that lead to the calculation of a posterior PDF providing a full stochastic assessment of the

contaminant level at any estimation point of interest. These three main stages of the BME

framework are as follows

(i) At the structural stage, BME generates the prior PDF fG providing an initial

probability distribution across space and time based on the general knowledge base (mean

trend and covariance of X(s)).

(ii) At the specificatory stage, the site-specific knowledge available is organized into hard

and soft data and expressed in terms of suitable operators.

26
(iii) At the integration stage, the initial solution fG of stage (i) is enriched by assimilating

the site-specific knowledge of stage (ii). This final solution provides the posterior PDF fK (χk,

sk) for the contaminant level at each estimation point sk of interest.

In this work we use the BMElib numerical implementation of the BME method. While a

detailed treatment of the BMElib numerical implementation is available elsewhere (Serre,

1999b; Serre and Christakos, 1999a; Christakos 2000b; Christakos et al. 2002), we

summarize here the main numerical steps of the analysis. Since the general knowledge base

considered at the structural stage of the analysis consists only in the mean trend and

covariance (statistical moments up to order 2 only), then the prior PDF fG obtained at the

structural stage is multivariate Gaussian, i.e.

fG (χmap, smap) = φ (χmap ; mmap, cmap ) (2.15)

where χmap is the a vector of values taken by the SRF of interest at the mapping points, smap

are the spatial coordinates of these mapping points, mmap is the vector of mean trend values

provided by the mean trend model at the mapping points, cmap is a matrix of covariance

provided by the covariance model for all pairs of mapping points, and φ (.) is the multivariate

Gaussian PDF (see Serre, 1999b, for the detailed mathematical equations).

The mapping points smap include both the data points sdata and the estimation point sk, i.e.

smap=(sdata, sk). At the specificatory stage the data points are organized into hard and soft data

points, i.e. sdata=(shard, ssoft), and the corresponding site specific knowledge is defined using

Eq. 2.2 for the hard data χhard at shard, and Eqs. 2.3 and 2.4 and for the soft data fS(χsoft) at ssoft,

so that we have smap=(shard, ssoft, sk) and χmap=(χhard, χsoft, χk). Then at the integration stage

27
BMElib calculates the posterior pdf at any estimation point sk using the following Bayesian

conditionalization rule

fK (χk, sk) = A −1 ∫ d χ soft f S ( χ soft ) f G ( χ map ) (2.16)

where A is a normalization parameter.

The posterior PDF provides a complete stochastic characterization of X(s), from which

we obtain any estimate of interest (e.g., the posterior PDF mode, BMEmode, which provides

the most likely value at the estimation point; or posterior PDF mean, BMEmean which

minimizes the mean square estimation error, or the posterior PDF median, BMEmedian), as

well as an assessment of the uncertainty associated with that estimate (e.g., the variance of

the BME posterior PDF, or the BME confidence interval as defined in Serre and Christakos,

1999a). By obtaining the BME posterior PDF at the nodes of an estimation grid covering the

mapping region of interest, we are able to construct a map representing the distribution of

contaminant at unmonitored points across space. This map integrates soft data points with

varying level of measurement uncertainty, as is the case for arsenic monitoring data with

varying level of measurement errors.

2.3.5 Step by step summary of the approach

By way of summary the steps of the analysis are as follow

1) Obtain different datasets of measurements of the arsenic field Z(s) and determine for

each dataset the σo and k values (Eq. 2.6) characterizing its measurement uncertainty.

2) Log-transform to obtain the data for the field Y(s)=log(Z(s)), and model its mean

trend mY(s).

28
3) Model the covariance of the residual field X(s)=Y(s)-mY(s) for each dataset separately,

and compare the σlogε2 obtained from the covariance analysis (i.e. the nugget

component of cXm(r) as expressed in Eq. 2.12) with the σlogε2 corresponding to the

σo and k measurement error model (Eq. 2.14). In case of disagreement revise the

values σo and k and go back to step 1, otherwise accept these values and obtain the

covariance model cX(r) of the combined datasets.

4) Construct the soft data for the log-transformed mean trend removed residual field X(s).

The measurements below detection limit (Eq. 2.3) are used to generate interval soft

data for X(s). The measurements above detection limit are treated as probabilistic

data with Gaussian PDF. The variance of the Gaussian soft PDF is σlogε2 calculated

from Eq. (2.14). The mean of the Gaussian soft PDF is log(Zm)-mY-σlogε2/2, where Zm

and mY are the measured total arsenic concentration and the log-mean trend at the data

point, respectively.

5) Process the covariance cX (r) and soft data for X(s) to calculate the BME posterior

PDF fK (χk, sk) (Eq. 2.16) for X(s) at the nodes sk of an estimation grid covering the

mapping area of interest. Obtain from the BME posterior PDF the median estimate of

X(s), XBMEmedian, and back-transform it to estimate the median estimate of Z(s),

ZBMEmedian=exp(XBMEmedian+mY), where mY is the mean trend value at the estimation

point.

2.3.6 Cross validation procedure

29
In order to assess the improvement provided by the proposed framework, we compare the

performance of the BME method accounting for data uncertainty, with that of various

alternate approaches not accounting for data uncertainty. We therefore need a procedure to

calculate the performance of an estimation method. In general the performance is calculated

as the mean square error (MSE) between a set of n predicted value Xi* at points pi, and the set

of true values Xi at these points, as follow

MSE =
1 n
(
∑ X i* − X i
n i =1
)
2
. (2.17)

The smaller the MSE, the more accurate is a method. In the case of a cross validation, we

remove one measured value Xm,i at a time, re-estimate it using neighboring non-collocated

well data to obtain Xi*, and start the process over again for each of n points. This procedure

leads to the generation of n predicted Xi* and measured Xm,i values, from which the MSE

performance can be calculated. However an obvious flaw with this usual approach for our

problem is that the measured value Xm,i is not equal to the true arsenic value Xi, and therefore

should not be used as reference for comparison.

We address this issue using two alternative cases. In the so called real case, we take

advantage of the fact that some of the datasets we work with have a measurement error that is

smaller than the remaining data included in the analysis. Therefore we select only these

datasets for the basis of the cross validation analysis and calculation of the MSE. This

provides an approximate measure of performance that will tend to the null hypothesis

(because there is still some random error in the dataset used for validation), however it is as

representative as possible of the real world (hence it’s name of the “real” case).

30
The other case involves simulating a field of values for the Xi and Xm,i that reproduces the

statistical properties of the arsenic field and it’s measured values. Then the same procedure

as described above is conducted to obtain the Xi*, but when it comes time to calculate the

MSE, we are in a position to use the simulated truth Xi instead of the Xm,i. This result in a

better assessment of the real performance improvement of the proposed approach and allows

for some correction away from the null hypothesis, however it corresponds to a simulated

world (hence it’s name of the “simulated” case).

2.4. Application of the model

2.4.1 The arsenic datasets

In order to illustrate the measurement error framework presented in this work, we purposely

choose three datasets of groundwater total arsenic measurements collected in New England

such that each dataset has a distinct measurement error from the other two datasets. Our goal

is to show that the measurement error for each dataset can be characterized separately from

the others, and rigorously integrated in the BME mapping analysis. Each dataset includes

only the most recent analysis of total arsenic available for each well. Total arsenic in New

England groundwater is believed to be primarily from natural sources, and therefore does not

change drastically over time. Hence each total arsenic analysis provides a value of total

arsenic today with a data uncertainty that increases for older measurements, as arsenic levels

may have changed slightly over time. In addition the data uncertainty includes analytical

error and sampling error depending on the analytical method and sampling procedure used to

31
measure total arsenic in each dataset. We describe here briefly each of the three dataset, and

we characterize for each dataset the corresponding measurement error in terms of σo and k

defined in Eq. (2.6). The number of data above and below detection limit is listed in Table

2.1 for each of the dataset, as well as the mean value of the dataset, the analytical detection

limit, and σo and k.

Table 2.1: The number of above and below detects, the mean value and detection limit, and
σo and k (Eq. 2.6) for each dataset.
Number Number
of data of data Detection
Mean
above below Limit σo (µg/L) k
(µg/L)
detection detection (µg/L)
limit limit
Dataset 1 219 389 12.3 1 0.333 0.233
Dataset 2 155 623 15.2 3 1.000 0.300
Dataset 3 121 144 72.1 5 1.667 0.616

The first dataset had the smallest measurement relative error, with σo=0.333 µg/L and

k=0.233. The detection limit is DL=3σo=1 µg/L, and the precision σZ/Zm for a sample with a

typical measured concentration of Zm=20 µg/L is calculated using Eq. (2.6) as follow

σZ/Zm=σo/Zm+k=0.333/20+0.233= 25%. As explained earlier, the precision encapsulates all

sources of errors (analytical, sampling and database management errors) contributing to the

data uncertainty of that dataset. This dataset consist of 219 measurements above detection

limit and 389 measurements below detection limit. These measurements were collected

throughout New England, as shown in Figure 2.3(a). The dataset was retrieved from the

USGS National Water Information System (NWIS) in 2001, and is a subset of 20,043 arsenic

samples collected from portable water over the entire U.S from 1973 to 2001 (Focazio et al.,

32
2000; USGS 2001). The arsenic analyses were performed using USGS approved analytical

methods including ICP-MS. The detection limit of 1 µg/L was reported for samples

measured below detection limit, and good USGS sampling procedure and database

management practice were followed uniformly for this dataset, which contributed for the

lower data uncertainty of this dataset compared to the other two datasets (see Appendix E).

The second dataset had a slightly higher measurement relative error, with σo=1 µg/L and

k=0.300, corresponding to a detection limit of DL=3 µg/L, and a precision σZ/Zm = 35% for a

typical measured concentration of Zm=20 µg/L. This dataset was obtained from USGS

Water-Resources Investigations Report 99-4162 published in 1999 (Ayotte et al., 1999). The

USGS compiled this dataset by collecting the existing arsenic data from states in New

England (i.e. Maine, New Hampshire, Massachusetts, and Rhode Island, see Figure 2.3b) that

were using laboratory-analysis methods and sample collection procedures in accordance with

Federal standards. The detection limits reported for measurements below detection limit in

this data base ranged from to 1 µg/L to 5 µg/L, so that 3 µg/L was used as the representative

detection limit across that dataset for measurements not reporting the detection limit. This

leads to the choice of σo=1 µg/L for this dataset which is 3 times larger than the σo used in

dataset 1. Though federally approved, the analytical measurement techniques used in this

dataset may have varied from state to state, and furthermore the dataset was published in

1999, two years earlier than dataset 1. This leads to the selection of k=0.300, which is

slightly larger than for dataset 1 (see Appendix E).

The third dataset is probably the most interesting, as it combines data collected by

homeowners in New Hampshire (90% of the data in dataset 3), and data collected at certain

wells by the New Hampshire district office of the USGS (10% of the data in dataset 3). The

33
location of these data points are shown in Figure 2.3(c). This is a typical situation where a

dataset provides very valuable information, but with a high associated data uncertainty. As

shown in Table 2.1, the measurement error parameters for dataset 3 were σo=1.666 µg/L and

k=0.616, which is much higher than for datasets 1 and 2. This is due to the fact that the

homeowners sampled their own well at the tap following instructions sent by mail, resulting

in a higher sampling error than for the other two datasets collected by trained technicians.

Furthermore the analytical method used to analyze the water samples mailed to the New

Hampshire Department of Environmental Services (NHDES) laboratory in Concord was

furnace atomic absorption spectrometry (GFAA) with a detection limit of 5 µg/L. Reports of

elevated arsenic concentrations were investigated by the New Hampshire district office of the

USGS and analyzed using ICP-OES, resulting in the remaining 10% of the data in dataset 3.

Overall the data uncertainty of dataset 3 is characterized by the high values for σo and k, so

that it can be rigorously integrated with the other two datasets in the BME analysis.

(a) (b)

34
(c) (d)

Figure 2.3: Measured arsenic concentrations above detection limit shown with marker size
proportional to observed values for (a) dataset 1, (b) dataset 2, (c) dataset 3. The locations of
all measurements below and above detection limit are shown in (d).

2.4.2 Mean trend

The arsenic data Zdata from the three datasets combined were obtained by using the measured

values above detection limit, and half the detection limit for values recorded as below detect.

The log-transform data was given by Ydata = log(Zdata). We then obtained the mean trend

function mY(s) defined in Eq. (2.10) by smoothing the log-arsenic data Ydata using a Gaussian

kernel smoothing function of BMElib (Christakos et al., 2002). The mean trend model mY(s)

that we obtained is shown in Figure 2.4. This mean trend model represents the systematic

trend in the spatial distribution of arsenic across New England.

35
Figure 2.4: Distribution of the mean trend of total arsenic concentration mY(s) across New
England groundwater.

2.4.3 Covariance analysis and verification of the measurement error parameters

The key step of the work presented here is the covariance analysis allowing verification of

the measurement error parameters σo and k for each dataset. Using the log-transformed data

Ydata and the mean trend model mY(s) described in the preceding section, we obtain the data

for the residual field Xdata=Ydata- mYdata. This measured data is used to calculate the covariance

cXm(r) using all the data available or any subset of it. For illustration purpose we show in

Figure 2.5(a) the covariance obtained using all the data Xdata available for the three combined

datasets and in Figure 2.5(b) the covariance obtained using the Xdata corresponding only to

the measurements above detection limit for the three combined datasets. We then report in

Table 2.2 the nugget component σlogε2 obtained from this covariance analysis (i.e. σlogε2

=0.389 is the nugget component of the covariance of Figure 2.5(a) obtained for the combined

datasets, and σlogε2 =0.221 is obtained from Figure 2.5(b) using only the above detects of the

36
combined datasets). We also report in Table 2.2 the σlogε2 predicted with the measurement

error model by averaging Eq. (2.14) over the data used in the analysis. As can be seen in

Table 2.2, the measurement error model predicts correctly that the covariance nugget of

Figure 2.5(b) should be smaller than that of Figure 2.5(a) because the exclusion of the below

detects removed data with high measurement errors, as explained by Eq. (2.14).

(a) (b)

Figure 2.5: Covariance model obtained using (a) all the data Xdata corresponding to the three
combined datasets, and (b) only above detects for the three combined datasets.

37
Table 2.2: Comparison of the values of σlogε2 estimated using (a) the covariance analysis and
(b) the measurement error model.
Estimated σ logε2 using :
Dataset used (a) Covariance (b) Measurement
analysis error model
Datasets combined 0.389 0.443
Datasets combined without below detects 0.221 0.240
Dataset 1 0.478 0.443
Dataset 2 0.344 0.349
Dataset 3 0.730 0.722

As described above, we then proceed by analyzing each dataset separately in order to

validate its measurement error parameters σo and k. The results obtained are reported in

Table 2.2, and as can be seen from that table there is an excellent fit between the σlogε2

obtained from the covariance analysis using each dataset separately, and the σlogε2 predicted

by the measurement error model for each of these datasets. This excellent fit is further

illustrated in Figure 2.6 plotting the Table 2.2 results, i.e. showing the plot of the σlogε2

values predicted by the measurement error model versus the σlogε2 values obtained from the

covariance analysis. The corresponding regression statistics is R2=0.972. This excellent fit is

the first of its kind for the Geostatistical analysis of total arsenic, and has important

implications for the assessment of arsenic in the groundwater of New England and the United

States. It demonstrates that using the measurement error model proposed in this work, the

uncertainty in total arsenic monitoring data can be rigorously assessed and validated at the

covariance analysis stage, thereby providing the foundation for an accurate estimation of the

38
distribution of groundwater arsenic across space. The remaining of this work builds on this

foundation to map arsenic across New England using our three datasets with very different

levels of measurement errors.

Figure 2.6: Plot of the σlogε2 values predicted by the measurement error model versus the
σlogε2 values obtained from the covariance analysis.

Now that we have validated that the nugget component of the covariance cXm(r) is due to

the variance σ logε2 associated with measurement error, we can remove this component from

cXm(r) using Eq. (2.12). We obtain the experimental covariance cX (r) = cXm(r)-σlogε2δ(r) for

the SRF X(s) associated with the true total arsenic concentration in the groundwater of New

England. We fit to this experimental covariance the following covariance model

cX(r) = c1 exp( -3r / ar1 ) + c2 exp( -3r / ar2 ), (2.18)

39
where c1=0.7 σX2, ar1=7 km, c2=0.3 σX2, ar1=40 km and σX2 is obtained from Eq. (2.13). We

show this covariance model using a plain line in Figure 2.5(a) and 2.5(b), and as can be seen

in these figures, our model fits the experimental covariance estimates very well. This model

indicates that about 70% of the spatial variability of total arsenic in New England

groundwater has a short spatial range of about 7 km, while the remaining 30% of variability

has a longer range of about 40 km. These findings are in agreement with findings of Serre et

al. (2003) for arsenic in Bangladesh groundwater, where the range was found to vary

between 2 to 57 km.

2.4.4 The BME mapping results

The general knowledge base considered is the covariance model of Eq. (2.18), while the site-

specific knowledge base consists of the soft data obtained using the measurement error

model according to the methodology presented in the theory section of this paper. As

explained above, the soft data rigorously represents the measurements below and above

detection limit by accounting for the detection limit and precision of each of the dataset.

Using the BME method we process this knowledge base and calculate the BME posterior

PDF at the nodes of an estimation grid covering New England. From the BME posterior

PDF we obtain the BME median estimate ZBMEmedian of total arsenic, which we map in Figure

2.7. An assessment of the mapping uncertainty associated with the BME estimate of Figure

2.7 is provided by the variance of the BME posterior PDF for X(s) normalized by the

variance σX2, which we show on the map of Figure 2.8.

40
Figure 2.7: Map of the BME median estimate of total arsenic in the groundwater of New
England.

Figure 2.8: Map of the variance of the BME posterior PDF for X(s) normalized by the
variance σX2. This map provides an assessment of the mapping uncertainty associated with
Figure 2.7.

41
The maps of Figures 2.7 and 2.8 provide very valuable information for public health

officials dealing with the problem of groundwater arsenic in New England. Figure 2.8 shows

that the mapping uncertainty increases as we move away from the locations where samples

were collected. This map is useful to allocate monitoring resources in areas of high mapping

uncertainty. Furthermore we note that the mapping uncertainty does not drop to zero at the

sampling locations. This illustrates the fact that the maps presented here not only account for

the high natural spatial variability of arsenic geology, but also for the measurement errors

associated with the arsenic samples. Indeed the BME method takes in account the

uncertainty associated with the soft data by rigorously processing the measurement error

model presented in this work. Hence the BME estimate of arsenic presented in Figure 2.7 is

the best map of groundwater arsenic produced to date on the basis of the datasets used in this

work. This map provides the tools for public health officials to identify areas where the

arsenic concentration in the groundwater may exceed the new 10 µg/L standard beginning

January 23, 2006, which will help determine where additional treatment will be warranted to

remove arsenic from drinking water. Furthermore the work presented here provides an ideal

framework to add new monitoring data with presumably lower detection limit and better

precision as the analytical measurement techniques for arsenic and its speciation keep

improving in the future.

2.4.5 Cross validation results

As described in the theory section, in the “real” case we perform a cross validation using a

selected dataset with low measurement error as the reference dataset. This cross validation

allows us to compare the estimation method presented in this work against other estimation

42
methods by comparing their MSE (Eq. 2.17). The four methods that we will compare are

summarized in Table 2.3. Method 1 uses “hardened” data for the measurements above and

below detection limit, i.e. it treats the measured values above detection limit and the mid

point of the interval below detection limit as if they were exact measurements of arsenic.

Hence method 1 represents a classical approach not accounting for data uncertainty. On the

other hand method 2 uses the approach presented in this work, i.e. it uses the measurement

error model to rigorously account for the data uncertainty in the measurements above

detection limit, and it uses an interval soft data for the measurements below detection limit.

Note that the reference dataset is always treated as hard, since that is the dataset used to

calculate the MSE. Hence the reduction of MSE between method 1 and 2 will provide a

measure of the improvement in estimation accuracy attributed to rigorously accounting for

data uncertainty in the above and below detects. Additionally methods 3 and 4 are similar to

methods 1 and 2, respectively, except that measured values below detection limit are ignored.

This will allow us to assess the effect of the measurement error model alone.

Table 2.3: Specifications of each of the four methods compared in the cross validation
analysis.
Data > Detection Limit Data < Detection Limit

Method 1 Measured value as hard (upper bound + lower bound)/2 as hard

probabilistic soft data using measurement


[lower bound, upper bound]
Method 2 error model, except reference dataset
as interval soft data
treated as hard data

Method 3 Measured value as hard ignored

probabilistic soft data using measurement


Method 4 error model, except reference dataset ignored
treated as hard data

43
The dataset selected to be used as reference for the cross validation method where chosen

to be either dataset 1 or dataset 2. We first selected dataset 1 because it has smaller σo and k

(see Table 2.1). We then selected dataset 2 as the reference because it has the smallest

average σ logε2 (see Table 2.2). Hence each of these dataset has relatively small measurement

error and can be used to represent the unknown true arsenic concentration. However dataset

3 has consistently higher measurement errors, so it cannot be selected as the reference dataset.

Using a selected dataset as reference, we calculated the cross validation MSE (Eq. 2.17)

for each of the methods described in Table 2.3, and we then calculated the percent change in

MSE from Method 1 to Method 2 as 100%*(MSE2 – MSE1)/MSE1, and from Method 3 to

Method 4 as 100%*(MSE4 – MSE3)/MSE3. The results obtained are shown in Table 2.4. As

can be seen from this table, there is a consistent decrease in MSE from method 1 to method 2

using either dataset 1 or 2 as the reference (validation) dataset. This indicates that the

proposed approach presented in this work leads to a consistent improvement in mapping

accuracy over a method not accounting for data uncertainty. The reduction in MSE is due to

the fact that our approach rigorously accounts for the data uncertainty associated both with

the above detects as well as the below detects. In order to consider only the effect of above

detects (i.e. ignoring the effect of below detects), we turn to methods 3 and 4. We see that

there is still a consistent decrease in MSE, indicating that rigorously accounting for the data

uncertainty of measurements above detection limit using the measurement error model

presented in this work leads to a consistent improvement of mapping accuracy in the real

case.

44
Table 2.4: Change in MSE from classical methods (i.e. methods 1 and 3) to the proposed
methods (i.e. methods 2 and 4). A negative change means reduction in MSE, indicating an
improvement in mapping accuracy.
Change in MSE from Change in MSE from
Method 1 to Method 2 Method 3 to Method 4
“Real” case using
dataset 1 as validation -8.21% -4.39%
dataset
“Real” case using
dataset 2 as validation -14.65% -10.65%
dataset
“Simulated” case using
dataset 1 as validation -67.4% -38.8%
dataset

While the cross-validation in the “real” case uses dataset 1 or 2 as the reference, these

datasets do not represent the true arsenic concentration. The true arsenic concentration is

actually unknown, and using “real” datasets introduces a random error in the cross validation

procedure. As discussed earlier, this leads to a tendency toward the null hypothesis, i.e. it

dampens the ability to measure the change in MSE between methods. We believe that this

means that the reduction in MSE reported for the “real” cases in Table 2.6 are lower bounds

of the true reduction in MSE, and we address this issue using the “simulated” cross validation

described in the theory section. The simulated dataset reproduces the statistical properties of

the log-arsenic data from the three combined datasets. For example, the variance (i.e. 0.7071

[log-µg/L]2) of the simulated log-arsenic data matches well with the variance (i.e. 0.7875

[log-µg/L]2) of the true log-arsenic. As previously represented in Figure 2.5(a) the true

variance is calculated after removing the nugget effect from the experimental variance using

the real data. As can be seen in table 2.4, the cross validation result obtained confirms our

belief, as it shows that when we use a simulated truth as the basis for cross-validation, the

45
reduction in MSE from method 1 to method 2 is 67.4%, and the reduction from method 3 to

method 4 is 38.8%.

The decrease in MSE reported in table 2.4 demonstrates that accounting for the data

uncertainty of both above and below detects leads to a very significant improvement in

mapping accuracy for arsenic in New England. A substantial part of the improvement in

mapping accuracy reported in our results comes from the mathematically rigorous analysis of

measurement errors using the measurement error model presented in this work.

2.5. Conclusions

Due to the increased recognition of the public health concern associated with arsenic and the

impeding change in the federal standard limiting arsenic concentration in the drinking water

at 10 µg/L, it has become important to accurately map arsenic concentration across our

ground waters. A survey of analytical techniques and sampling procedures used to measure

arsenic shows that the detection limit and precision have improved drastically over time, and

will continue to do so in the near future. As a result we are faced with a situation where

historical groundwater arsenic datasets may provide very valuable information but have very

different levels of measurement errors. Furthermore new datasets may be collected in the

future with a much better precision than is now routinely achieved. The question raised is

then: how can we effectively and rigorously process datasets with varying levels of

uncertainty so as to accurately map arsenic across space?

We present in this work a model for the measurement error of total arsenic. We define

the measurement error as including all random and systematic errors, including analytical

46
errors, sampling errors, and data management errors. Our measurement error model is an

extension of the model by Kinniburgh and Kosmus (2002) that (a) provides a way to validate

the measurement error parameters at the covariance analysis stage, and (b) generates the

probabilistic distribution of errors in form of a so-called soft PDF, which can then be

rigorously analyzed using the BME method of modern Geostatistics.

We applied the proposed framework using three historical datasets of total arsenic

measurements from samples collected in New England. Two of the datasets were obtained

from the USGS and have a relatively low measurement error. The third dataset included

samples collected and mailed by homeowners, leading to higher sampling error. We were

able to assess the measurement error parameters for our model for each dataset based on

information available about the analytical techniques and sampling procedures used. We

then validated the resulting measurement error model at the covariance analysis stage, and

this study is the first of its kind demonstrating that this is feasible in the geostatistical

mapping analysis of groundwater arsenic. Hence the measurement error model provides the

foundation to generate soft data rigorously accounting for the data uncertainty associated

with each dataset. Using the soft data generated we obtained a map showing the distribution

of arsenic across the groundwater of New England. Finally our cross validation analysis

demonstrated that the rigorous processing of measurement errors using the approach

presented in this paper leads to a substantial improvement of mapping accuracy over methods

not accounting for the difference in measurement error between the datasets available.

This work has important implications for public health officials needing to identify areas

where the arsenic concentration in the groundwater may exceed the new 10 µg/L standard for

drinking water beginning January 23, 2006. Using the approach presented here they will be

47
able to assess the measurement error of historical datasets and rigorously process them to

accurately map the distribution of arsenic in their ground water. Furthermore the work

presented here provides an ideal framework to add new monitoring data with presumably

lower detection limit and better precision as the analytical measurement techniques for

arsenic and its speciation keep improving in the future. While the maps presented here

provide the best assessment to date of total arsenic in New England ground waters on the

basis of the three historical datasets obtained for this work, future work will look into

improving further this map by incorporating new data for New England as these become

available.

48
III. BME mapping using empirical laws with secondary spatial data: A
farewell to co-kriging?

3.1. Background

Environmental mapping studies are concerned with the spatial estimation of an

environmental contaminant at some unsampled locations. By mapping the estimated values

obtained at the nodes of a regular grid, we obtain a realistic representation of the spatial

distribution of the environmental contaminant of interest, which we refer to as the primary

variable. When dealing with error-free measurements (hard data) of the primary variable

available at some set of sampling locations, a traditional approach is to use the simple kriging

(SK) method of classical Geostatistics (Journel and Huijbregts, 1978; Olea 1999; Armstrong

1998; Isaaks and Srivastava, 1989). Assuming without loss of generality that the mean trend

of the spatial field representing the primary variable is known (or can be estimated using a

parameterized spatial regression model or some spatial smoothing operator), SK is simply the

best linear unbiased estimator (BLUE) using the hard data available for the primary variable,

i.e. it is the linear combination of the error-free measurements of the primary variable that is

unbiased and minimizes the estimation error variance at the estimation point (Stein, 1999;

Christakos, 2000b).

In many environmental applications, secondary spatial fields related to the primary field

provide additional information that is useful to map the primary variable. For example, the

soil pH is a secondary spatial field providing useful information to map the concentration of
groundwater arsenic over space. The traditional extension of SK to account for secondary

spatial field data has been the simple co-kriging method (Goovaerts, 1997; Wackernagel,

1995). Simple co-kriging is a BLUE approach that extends SK by integrating secondary hard

data using the statistical cross correlation between the primary and secondary variables.

However, the stochastic empirical law describing the relationship between the primary

and secondary variables is often complex and information rich. This stochastic empirical law

may be denoted as the conditional Probability Density Function (PDF) fS(χ|ψ) of the primary

variable x given an error free measured value ψ for the collocated secondary variable y. This

conditional PDF has multiple statistical moments, each of which may vary non-linearly with

the measured secondary variable. Hence, a stochastic empirical law may in general be

described as a vector of non-linear relationships. While co-kriging accounts for the cross

correlation coefficient summarizing the relationship between the primary and secondary

variables, it does not have any formal mechanism to process the multiple nonlinear aspects of

a realistic stochastic empirical law. As a result, unless a remarkable cross correlation is

detected, co-kriging does not guarantee a substantial increase in mapping accuracy, as has

been found in previous works, such as that of Welhan and Merrick (2003) investigating the

estimation of groundwater arsenic using specific conductance as the secondary variable.

In this work, we investigate the mapping accuracy of a Bayesian Maximum Entropy

(BME) approach that formally accounts for the stochastic empirical law between the primary

and secondary variables. We describe some straightforward non-parametric and parametric

approaches to model the multiple non-linear aspects the stochastic empirical law. Then, in

order to compare the kriging, co-kriging, and BME methods, we develop a method to

simulate a useful class of spatially related synthetic random fields. Using these related

50
synthetic fields, we demonstrate that because the BME approach formally accounts for the

empirical law between the primary and secondary variables, it leads to a substantial

improvement in mapping accuracy over the co-kriging method which only accounts for the

cross-correlation between primary and secondary variables. Finally we demonstrate the

applicability of our BME approach in the mapping estimation of arsenic in New England

using soil pH as the secondary spatial field.

The remainder of this paper is organized as follow. In the methods section we present the

framework for spatial random fields, the non-parametric and parametric approaches used to

model the stochastic empirical law, the BME estimation method, and finally a procedure to

generate the spatially related synthetic fields used for cross validation purposes. Then in the

results section we present first the synthetic case study, followed by the real-world

application to mapping groundwater arsenic in New England using soil pH data as the

secondary variable.

3.2. Method description

3.2.1. Spatial Random Field (SRF) representation and physical knowledge bases

We denote by X(s) the SRF (Christakos, 1992) representing the spatial distribution of the

primary variable X at the spatial location s, where s=[s1, s2] for a two-dimensional space. The

set of mapping points of interest is denoted as smap={si}, where i=1, 2,…, n. The vector of

random variables xmap=[X(s1), X(s2),…, X(sn)] represents the SRF at the mapping points smap.

A possible realization of xmap is denoted as the vector of realized values χmap = [χ1,…, χn].

51
The randomness associated with xmap is then represented by the set {χmap} of all possible

realizations. Randomness is fully characterized by the multivariate PDF f(χmap) describing

the probability associated with any given realization χmap, i.e.

f (χmap) dχ = Prob[χ1 < x1 <χ1+dχ1,…, χn < xn < χn+dχn] (3.1)

where Prob[.] is the probability operator. An important operator on xmap is the stochastic

expectation operator E[.] of some known function g(xmap) of xmap, which is defined as the

expected value of g(xmap) obtained as follow

E[g(xmap)] = ∫dχmap g(χmap) f(χmap). (3.2)

The physical knowledge base K describing the contaminant SRF consists in the union of

general knowledge G characterizing the spatial trend and variability of the environmental

processes at play, and site specific knowledge S including the monitoring data available for

the specific site at hand. The spatial trend of the primary environmental variable is modeled

by the mean trend function mX(s) of the SRF X(s) defined as

mX(s) = E[X(s)]. (3.3)

This mean trend function characterizes the systematic trends and spatial structures of the

primary variable. The spatial variability is characterized by the covariance function cX(s,s’)

the SRF X(s) between point s and s’ defined as

52
cX(s,s’) = E[ (X(s)-mX(s)) (X(s’)-mX(s’)) ] (3.4)

The covariance function quantifies the amount of co-variability for the primary variable

taken at a pair of points s and s’, which provides a measure of the spatial dependencies and

autocorrelations in the field representing the primary variable. While mX(s) and cX(s,s’)

constitute the general knowledge G, the site-specific knowledge S consists in the actual data

available at a set of specific data points sdata={si}where i=1, 2,…, m. This data often includes

hard data χhard regarded as exact measurements of the primary variable at the points

shard={si}where i=1, 2,…, mh, i.e.

Prob[ X(shard) =χhard] = 1. (3.5)

In many environmental application we also consider a set of points ssoft={si}, i= mh+1, …,

mh+ms=m, where some so-called soft data is available, but has quantifiable associated

uncertainty. A soft datum may be of the interval type (Christakos et al., 2001; Christakos

and Serre, 2000a). For example when measurements of the primary variable at points ssoft are

below detection limit, we have

Prob[0< X(ssoft) < Detection Limit] = 1 (3.6)

53
More generally soft data is of the probabilistic type (Christakos et al., 2001; Christakos and

Serre, 2000a; Serre et al., 2005) when the uncertainty in the soft data can be quantified by

means of a soft PDF fS such that

u
Prob[X(ssoft) <u] = ∫ −∞ dχ soft f S (χ soft ) . (3.7)

We describe next the field representing the secondary variable, and in the following section

we present some straight forward approaches to derive soft data (Eq. 3.7) for the primary

variable on the basis of exact measurement of the secondary variable.

3.2.2. Empirical law and cross-correlation of related spatial fields

In a wide range of environmental mapping applications there exist a SRF Y(s) for a

secondary variable Y that is related to the primary variable X through some empirical law.

As an example, the consideration of an empirical law describing the association between

groundwater arsenic (As) concentration and the soil pH is motivated by the work of Sanchez

at el., 2003. Their study considered a soil contaminated with As and analyzed the As

solubility as a function of pH levels (a representative subset of their points is shown with

circles in Figure 3.1). A curve fitted to the experimental data (shown with a plain line in

Figure 3.1) indicates a clear non-linear relationship between log-As and pH due to the

dependency of arsenic solubility with pH. This evidence supports the existence of empirical

laws describing the relationship between log-As and pH, and has been confirmed by studies

at different geological sites. Peters et al. (1999) observed that As-levels in New England

groundwater are affected by pH-levels since the As-concentration varies with anion exchange

54
and co-precipitation with iron and manganese oxyhyroxides. Similarly the study of Arsenic

in eastern New England by Ayotte et al. (2003) suggests that the high levels of As occur

where elevated pH-values exist due to the geological properties of the bedrock aquifer (i.e.

presence of calcite, ion exchange etc.).

Figure 3.1: The circles represent a subset of the data published by Sanchez et al. (2003)
showing the solubility and release of log-As as a function of pH for a given soil sample
contaminated with arsenic in a pesticide manufacture site.

The stochastic empirical law between the collocated random variables x=X(s) and y=Y(s)

provide one way to model the spatial relationship between the SRF’s X(s) and Y(s). This

empirical law is expressed by the conditional soft PDF fS(χ|ψ) of the primary variable x given

an error free measured value ψ for the collocated secondary variable y. The conditional PDF

provides a complete stochastic description of the relationship by means of its various

statistical moments, each of which may vary non-linearly with the measured value for y. In

practice, it is convenient to model the conditional PDF using an adequate statistical

distribution φ of x given a set of coefficients µ=[µ1, µ2, ..., µm], each of which is a function of

the measured secondary variable ψ, i.e.

55
fS(χ |ψ) = φ(χ ; µ(ψ)). (3.8)

A common example for φ is the Gaussian PDF with only two parameters µ = (µ1, µ2) where,

µ1(ψ) = E[x|ψ] (3.9)

is the expected value of x given an error free measured value ψ for the collocated y, and

µ2(ψ) = Ε[(x-µ1(y))2|ψ]. (3.10)

is the variance of x given ψ. Hence the spatial relationship between X(s) and Y(s) may be

modeled through the empirical law fS(χ |ψ)= φ(χ ; µ(ψ)), which consists in obtaining the

vectorial non-linear relationship µ(ψ) = (µ1(ψ), µ2(ψ)).

The cross-covariance cXY(s,s’) also quantifies the connection between two related spatial

fields. It is an extension of the covariance function (Eq. 3.4) defined as

cXY(s,s’) = E[ (X(s)-mX(s)) (Y(s’)-mY(s’)) ]. (3.11)

The cross-covariance function measures spatial dependencies and correlations between the

two spatial fields, and from it we obtain the dimensionless correlation coefficient ρXY at some

locations s as

56
ρXY = cXY(s,s) / σX(s)σY(s), (3.12)

where σX(s) is the standard deviation of the primary variable at s, and likewise σY(s) is the

standard deviation for the secondary variable.

However, cXY(s,s’) and ρXY only provide a global statistical description of the relationship

between the two spatial fields that fails to account for any non-linearity, whereas the

empirical law offers a complete description of the non-linear aspect of the relationship

between the two fields in terms of the vectorial function µ(ψ). In the following section we

present three straightforward approaches to model µ(ψ) from collocated measurements, first

using a non-parametric approach, and then using a parametric approach with polynomials of

order 1 and 2.

3.2.3. Deriving the conditional PDF fS(χ|ψ) that describes the empirical law

We denote in this section by χ=[χ1, χ 2,…, χ N]T and ψ=[ψ1, ψ 2,…, ψ N]T the column vectors

of exact measurements of the primary and secondary variables X and Y, respectively, at

locations {s1, s2,…, sn} where collocated measurements of X and Y are available. Note that in

general N<n, since the N points with collocated (X,Y) measurements is a subset of the n

mapping points. We also denote by χ and ψ the arithmetic average of the elements in the

χ and ψ vectors, respectively.

3.2.3.1. Non parametric approach

In many cases, the empirical relationship between the primary variable x=X(s) and collocated

secondary variable y=Y(s) does not have a known functional form. A non-parametric

57
approach is then useful to model µ(ψ)= (µ1(ψ), µ2(ψ)), where ψ is an exact measured value

for y; µ1(ψ) = E[x|ψ] (Eq. 3.9) is the conditional expected value of x given y=ψ; and µ2(ψ)

= Ε[(x-µ1(ψ))2|ψ] (Eq. 3.10) is the conditional variance of x given y=ψ . To achieve this

objective within the non-parametric approach we first partition the collocated observations

χ and ψ into a set of disjoint classes χ(k) and ψ(k), k=1,…,K, subject to ψk < ψ(k) < ψk+1, i.e.

ψ(k) is the subset of ψ that belongs to the interval [ψk, ψk+1]. Then each class has under

ergodic assumption its own expected value µ1 of x given that ψk < y < ψk+1:

µ1( ψ ( k ) ) ≅ E[x| ψk < y < ψk+1] ≅ χ ( k ) (3.13)

where ψ ( k ) is approximately equal to the midpoints between ψk and ψk+1, and χ ( k ) is the

arithmetic average of the corresponding vector χ(k). Similarly we obtain µ2 for each class as

µ2( ψ ( k ) ) ≅ Ε[(x-µ1( ψ ( k ) ))2 | ψk < y < ψk+1] ≅ ( χ (k) − µ1 (ψ (k) ) )2 . (3.14)

Finally the set of values { ψ ( k ) , µ1( ψ ( k ) ), µ2( ψ ( k ) ) }, k=1,…,K, provide a discretized form of

the µ(ψ) relationships.

3.2.3.2. Parametric approach

3.2.3.2.1. Parametric polynomial of order 1

58
In some cases the empirical law between x and y is known to be linear. In this case a

parametric approach used to obtain µ(ψ) consist in using the following polynomial model of

order 1

xi = β0 + β1yi +εi 1≤ i ≤ N , (3.15)

where β0 and β1 are regression coefficients, εi is an unobservable random error, and xi=X(si)

and yi=Y(si) are random variables for X and Y, respectively, at collocated measurement point

si. This equation can also be given in matrix/vector notation as

x = Dβ + ε (3.16)

1 y1   ε1 
 x1    ε 
 .   1 y 2   0
β
where x =  ..  , D =  . .  , β =   , and ε =  ..  . D is known as the design matrix.
2

. .  β1   .
x   . .  ε 
 N 1 y 
 N   N

Using standard regression theory, the expected value µ1(ψ) of x given ψ is simply given

by the estimator βˆ0 + βˆ1ψ , where β̂0 and β̂1 are ordinary least square estimates of β0 and β1,

respectively, and the variance µ2(ψ) of x given ψ is the square of the prediction standard

error, PSE(ψ), so that µ1(ψ)= βˆ0 + βˆ1ψ and µ2(ψ)= PSE(ψ)2. The estimate β̂ =[ β̂0 β̂1 ]T for β

is given by the following equation (see Appendix A)

β̂ =(∆T∆)-1(∆Tχ), (3.17)

59
where ∆ is obtained by substituting each random variable yi in the design matrix D with its
)
observed value ψi. Expanding Eq. (3.17) we get β 0 = χ − βˆ1ψ

) N N
and β1 = ∑ (ψ i − ψ )(χ i − χ ) ∑ (ψ − ψ ) . Furthermore (see Appendix A for details) the
2
i
i =1 i =1

prediction standard error PSE(ψ) is estimated using the following equation.

0.5
1 N 
PSE (ψ ) = σ̂ X  + (ψ − ψ ) ∑ (ψ − ψ ) + 1
2 2
(3.18)
 N j=1 

2
where σ̂ X is calculated using the following unbiased variance estimator

N 2
1
∑(χ − χˆ ) .
2
σˆ X = i (3.19)
N −2 i =1

3.2.3.2.2. Parametric polynomial of order 2

In many instances the empirical law between x and y may be found to follow a quadratic

curve. In these cases we can easily extend the parametric approach presented above to

consider a polynomial model of order 2, i.e.

xi = β0 + β1yi + β2yi2 + εi (3.20)

60
where β2 is an additional coefficients characterizing the curvature of the empirical law. Eq.

(3.20) can be recast into Eq. (3.16), x = Dβ + ε , by defining a new design matrix D with an

additional column and a new vector β as follow

1 y1 y12 
   β0 
1 y 2 y 2 2   
D= . . .  , β =  β1  . (3.21)
 .. .. . 
. β 
 2  2
1 y N y N 

The estimator for µ(ψ) is then given by µ1(ψ)= βˆ0 + βˆ1ψ + βˆ 2ψ 2 and µ2(ψ)= PSE(ψ)2. The

estimator β̂ =[ β̂ 0 β̂1 βˆ 2 ]T for β is given by Eq. 3.17, i.e. β̂ =(∆T∆)-1(∆Tχ), but with the

difference that ∆ is obtained by substituting each random variable yi in the new design matrix

D (see Eq. 3.21) with its observed value ψi. In other words ∆ now has one additional column

with elements ψi2. Finally PSE(ψ) is obtained by the equation

PSE (ψ ) = σ̂ X δ T (∆ T ∆ ) δ + 1 ,
−1
(3.22)

where δ=[1 ψ ψ2]T.

By way of summary, in this section we reviewed some non-parametric and parametric

approaches to estimate the relationships µ(ψ)=(µ1(ψ), µ2(ψ)) characterizing the empirical law

between collocated x and y. The estimation of µ(ψ) was obtained on the basis of data at N

points were (X,Y) measurements were collocated. However the µ(ψ) relationships are valid

for the larger set of n mapping points, which also include ms points where only Y

61
measurements were collected. At each of these points, we construct a soft datum for the

primary variable X on the basis of the measured value ψ for the secondary variable using the

soft PDF fS(χ|ψ) = φ(χ ; µ(ψ)). Therefore the soft data for X consist in the soft PDF fs(χsoft)

characterizing the primary variable X at the ms soft data points where only the secondary

variable Y was measured. At the remaining mh data points, exact measurements of the

primary variable X are available, which constitute the hard data χhard. Hence the site-specific

knowledge base consists in the hard data χhard at mh points where at least the primary variable

was measured, and the soft PDF fs(χsoft) at the additional ms points where only the secondary

variable was measured. In the following section we review how the BME method processes

these hard and soft data.

3.2.4. BME processing of hard and soft data

The powerful BME method has 3 main stages of knowledge processing, which are the (i)

structural, (ii) specificatory, (iii) integration stages.

At the structural stage of the BME analysis, a prior PDF fG(χmap) characterizing the SRF

X(s) at the mapping points smap is constructed by maximizing expected information based on

the general knowledge base G available. When G only includes knowledge of the vector of

expected values mmap at the mapping points, and the matrix cmap of covariance between any

pairs of mapping points, then the prior PDF fG (χmap , smap) is given by

fG (χmap,) = φ (χmap ; mmap, cmap ), (3.23)

62
where φ (χmap; mmap, cmap) is the multivariate Gaussian PDF with mean mmap and covariance

matrix cmap.

At the specificatory stage of the analysis, the site-specific knowledge is organized in hard

and soft data. The n mapping points include the estimation point and the data points, i.e.

smap=(sk , sdata). The data points consist in mh points shard where hard data χhard about the

primary variable X is directly collected, and ms points ssoft where only the secondary variable

Y is measured, so that smap=(sk , shard, ssoft) and χmap=(χk , χhard, χsoft). The soft PDF fS(χsoft) is

given by the product of the conditional PDFs fS(χ|ψ) = φ(χ ; µ(ψ)) of x given each of the ms

measured ψ values.

At the integration stage, a Bayesian conditionalization rule (Christakos 1990, 2000b;

Serre and Christakos, 1999) is used to update the prior PDF given the site-specific

knowledge available and yields the BME posterior PDF

fK (χk) = A-1 ∫ dχsoft fs(χsoft) fG(χk, χhard, χsoft) (3.24)

where A is a normalization coefficient. This posterior PDF provides a complete stochastic

description of the primary environmental variable of interest at any estimation point.

Specifically, the BME posterior PDF provides the flexibility to choose any estimator desired

(i.e. BME mode, BME mean, and BME estimate at various percentiles), as well as an

assessment of the estimation error by means of the BME posterior variance or the BME

confidence set (Serre and Christakos, 1999).

3.2.5. Generating related synthetic fields with stochastic empirical relationships

63
We aim to generate realizations for the groundwater log-arsenic SRF logAs(s) and soil pH

SRF pH(s) with prescribed statistical properties reproducing those found in the field, and

with a quadratic empirical relationship E[logAs|pH] at collocated point s similar to those

documented in previous studies (e.g. Fig. 3.1).

Let’s consider three independent, homogeneous, normally distributed SRFs A(s), B(s),

and C(s). Realizations of such fields can easily be generated using geostatistical simulation

techniques (Christakos, 1992; Christakos et al,. 2002) such that the realization of A(s), B(s),

and C(s) have user-defined means µA, µB, and µC, and variances σA2, σB2, and σC2, and with a

covariance range similar to that of soil pH and log-arsenic found in the field. We then

construct the fields for logAs(s) and pH(s) using the following equations

pH(s) = A(s) + B(s) (3.25)

logAs(s) = a1A(s) + a2A(s)2 + C(s), (3.26)

where a1 and a2, together with µA, µB, µC, σA2, σB2, and σC2, are the parameters of our

algorithm to generate logAs(s) and pH(s). Let’s now describe how to choose these parameters

in order to obtain realizations of logAs(s) and pH(s) with known statistical properties and a

quadratic empirical relationship E[logAs|pH] at collocated point s.

The means µlogAs and µpH, and variances σlogAs2 and σpH2 of the logAs(s) and pH(s) SRFs

are inputs to our algorithm (they are known for a specific geologic mapping situation, or can

be estimated from some monitoring dataset). Using these values, we calculate the parameters

64
µB, σB2, µC and σC2 with the following equations obtained from Eq. (3.25) and (3.26) (see

Appendix B for more details)

µB = µA - µpH (3.27)

σB2 = σpH2 - σA2 (3.28)

µC = µlogAs -a1 µA - a 2 µA2 - a2 σA2 (3.29)

σC2 = σ logAs2 - a12 σA2 - 2 a22 σA4 - 4 a22 µA2 σA2 - 4 a1 a2 µA σA2. (3.30)

We now have only four parameters remaining, i.e. a1, a2, µA, and σA2, which need to be

set according to the quadratic relationship desired for the empirical law E[logAs|pH].

Substituting A(s)=pH(s)-B(s) into Eq. (3.26), taking the expected value of logAs given pH at

collocated point s, and using properties of the normal distribution, we obtain after some

manipulations (see Appendix B for more details)

E[logAs|pH] = b0 + b1(pH-µpH) + b2(pH-µpH)2 (3.31)

where b0=µlogAs–a2σA4 /σpH2, b1= a1 σA2/σpH2+2a2 µA σA2 /σpH2, and b2 = a2σA4 /σpH4. As can be

seen from Eq. (3.31), the empirical law is of quadratic form, which fulfills the objective we

had set for our simulation algorithm defined by Eqs. (3.25) and (3.26). We note that the

parameter µA does not have any effect on the empirical law (Eq. 3.31), so without loss of

65
generality we can use µA=0. The parameters a1 and a2 are coefficients that primarily define

the shape of the empirical law E[logAs|pH]. In addition the parameter σA2 not only defines

the shape of the empirical law, but also the amount of co-variability between logAs and pH.

In other words, an increase in σA2 leads to a larger cross-correlation between collocated logAs

and pH, and consequently a smaller variance of logAs given pH, σ[logAs|pH]2.

Hence, by way of summary, we find that the simulator defined by Eqs. (3.25-3.30)

generate a useful class of spatially related synthetic random fields. This simulator allows to

generate realizations of two SRFs logAs(s) and pH(s) with prescribed statistical properties

reproducing that observed in the field, and with a quadratic empirical relationship

E[logAs|pH] at collocated point s defined by the parameters a1, a2 and σA2. Increasing a2 for

a selected value of a1 and σA2 will allow exploring empirical laws with increasing curvature,

while increasing σA2 will allow exploring fields with increasing cross-correlations between

logAs(s) and pH(s). These synthetic fields can then be used in a cross validation analysis to

compare the mapping accuracy associated with the kriging, co-kriging, and the proposed

BME approach. In the next section we provide a step-by-step description of the kriging, co-

kriging and BME approaches, and in the following section we present the cross validation

procedure.

3.2.6. Step by step description of the simple kriging, co-kriging, and BME approaches

The three estimation methods considered here are the simple kriging method labeled as

method 1, the co-kriging method labeled as method 2, and the BME method labeled as

method 3. Each method uses a subset of the synthetic datasets as measured data. We define

logAs as a primary variable X, and pH as a secondary variable Y.

66
We first specify the general knowledge base available by modeling the statistical

moments up to second order (mean and covariance) of the SRFs X(s) and Y(s). We obtain

models for mean trend (i.e. mX(s) using Eq. (3.3), and similarly mY(s) for Y) and covariance

models (i.e. cX(s,s’) using Eq. (3.4) and similarly cY(s,s’) for Y) for each variable. We

additionally obtain the model for the cross-covariance cXY(s,s’) between X and Y (i.e. Eq.

3.11).

The site specific knowledge base consists in the data for X and Y. We denote as χhard the

column vector of exact measurements of X at mXh points sh(X)={ s1(X), s2(X),…, sm ( X ) } where
Xh

at least X was measured. We define as ψhard the column vector of exact measurements of Y at

mYh points sh(Y)={ s1(Y), s2(Y),…, sm (Y ) } where only Y was measured. Each method then
Yh

selects a subset of knowledge bases available to proceed with the estimation step.

In simple kriging (e.g. method 1), the estimator χˆ k (1) of X at estimation points sk(X) is a

linear combination of only χhard given by

(X)
+ λ(1) (χhard – mhard(X)),
(1) T
χˆ k = mk (3.32)

where λ(1) is a column vector of simple kriging weights, mk(X) = mX(sk(X)) is the mean trend of

X at the estimation point sk(X), and mhard(X) = mX(sh(X)) is a column vector of mean trend values

for X at its hard data points sh(X). The vector of simple kriging weights is given by (Olea,

1999)

λ(1) = ck,Xh cXh,Xh-1,


T
(3.33)

67
where ck,Xh = cX(sk(X), sh(X)) is a row vector of covariance for X between the estimation point

sk(X) and hard data points sh(X), and cXh,Xh = cX(sh(X), sh(X)) is a mXh by mXh matrix of covariances

for X between the hard data points sh(X).

The traditional extension of simple kriging to account for secondary spatial field data is

co-kriging (e.g. method 2). The co-kriging estimator, χˆ k (2) is also a linear combination of

data including χhard and ψhard, i.e.

(2) (X) T
χˆ k = mk + λ(2) (Zdata – mdata), (3.34)

χ  m ( X ) 
where Z data =  hard  , mdata =  hard  , and mhard(Y) = mY(sh(Y)) is a column vector of mean
ψ hard 
(Y )
m hard 

trend values for Y at its hard data points sh(Y). The vector of co-kriging weights given by

λ(2) = ck,Zdc-1Zd,Zd,
T
(3.35)

where ck,Zd = [cX(sk(X), sh(X)) cXY(sk(X), sh(Y))] is a row vector of covariance/cross-covariance

between the estimation point and data points, cXY(sk(X), sh(Y)) a row vector of cross covariance

between the estimation point sk(X) and Y hard data points sh(Y), and

c Xh,Xh c Xh ,Yh 
c Zd ,Zd =  , (3.36)
 cYh , Xh cYh,Yh 

68
where cXh,Yh = cXY(sh(X), sh(Y)) is a mXh by mYh matrix of cross covariance between hard data

points sh(X) and sh(Y), and cYh,Yh = cY(sh(Y), sh(Y)) is a mYh by mYh matrix of covariance for Y

between its hard data points sh(Y).

Finally the BME mapping method (i.e. method 3) incorporates a set of probabilistic soft

data χsoft to account for the stochastic empirical relationship. As explained earlier, to generate

χsoft we start by modeling the stochastic empirical law using data at collocated measurement

points, and we then obtain the soft data at every location of sh(Y) where only ψhard is available.

Each soft datum is expressed by the conditional PDF fS(χsoft|ψhard) = φ(χsoft ; µ(ψhard)) (e.g.

Eqs. 3.8-3.10) of x given each of the measured ψ values. At the structural stage, BME

processes the mean vector mmap and the covariance matrix cmap to construct the multivariate

prior PDF (i.e. Eq. 3.23). The mean vector mmap represents the trend of the primary variable

at mapping points, and it is expressed as follow

mk ( X ) 
 (X ) 
m map = m hard  , (3.37)
msoft (Y ) 

where msoft(X) = mX(sh(Y)) is a set of mean values of X at soft data locations sh(Y). The

covariance matrix cmap at the mapping points is expanded as,

c Xh,Xh c Xh , Xs c Xh,k 
 
cmap = c Xs , Xh c Xs,Xs c Xs,k  , (3.38)
c c k,Xs ck,k 
 k,Xh

69
where ck,k = cX(sk(X), sk(X)) is a scalar representing the variance for X at the estimation points

sk(X), cXs,Xh = cX(sh(Y), sh(X)) is a mYh by mXh matrix of covariance for X between its soft data

points sh(Y) and hard data points sh(X), and ck,Xs = cX(sk(X), sh(Y)) is a row vector of covariance

for X between the estimation point sk(X) and soft data points sh(Y). Finally at the integration

stage BME updates the prior PDF (Eq. 3.23) by using a Bayesian conditionalization on χhard

and χsoft in order to obtain the BME posterior PDF fK (χk) at the estimation point sk(X) (Eq.

3.24).

3.2.7. Cross validation procedure

The result from the cross validation procedure offers a useful criterion to compare the

mapping accuracy between estimation methods. The synthetic random fields generated by

the simulator defined in Eqs. (3.25-3.30) provides mXh realizations Xi for the primary variable

at points sh(X), and mYh realizations Yi for the secondary variable at points sh(Y). These

simulated values are interpreted as the truth. The cross validation procedure removes one

true value Xi at a time, and re-estimates it using only data in its neighborhood to obtain the

cross-validate estimate Xi* . The mean square error (MSE) for the cross-validation estimates

is then defined as

MSE = (
1 m Xh *
∑ Xi − Xi
mXh i =1
)
2
(3.39)

1, 2
The percent MSE change rMSE between method 1 and method 2 is given by

70
MSE 2 − MSE 1
1, 2
rMSE = × 100 . (3.40)
MSE 1

1, 3
Similarly rMSE is given using the corresponding equation for method 1 and method 3. Since

both method 2 and 3 use data from the primary and secondary variables, we expect them to

provide more accurate cross-validation estimates than method 1, which only uses data from

1, 2 1, 3
the primary variable. Therefore we expect that both rMSE and rMSE will be negative values,

signifying a reduction in MSE.

To compare the efficiency between method 2 and 3 in using the secondary data, we

define the improvement in MSE reduction, i ∆ as

1,3 1,2
r MSE − r MSE
i∆ = 1,2
× 100 . (3.41)
r MSE

i∆ measures the percent improvement in the reduction of MSE afforded by BME versus co-

kriging. A value of i∆ =10% would mean that the reduction in MSE from kriging to BME is

10% greater than the reduction in MSE from kriging to co-kriging, or in other words that

BME is 10% more efficient than co-kriging at integrating the secondary data.

3.3. Results

3.3.1. Synthetic case study

71
As discussed in the section 3.2.5 our simulator will generate realizations of the fields logAs(s)

and pH(s) such that their statistical properties correspond to that found at a site of interest (i.e.

New England), and such that collocated logAs and pH values are related by a quadratic

empirical law controlled by the parameters a1, a2 and σA2. The quadratic shape of the

synthetic empirical law we generate reproduces the logAs-pH empirical law documented in

previous studies (Sanchez et al., 2003; Ayotte et al., 2003). In this synthetic case study we

investigate the percent improvement i ∆ in the reduction of MSE afforded by BME versus

co-kriging for two scenarios labeled as case 1 and case 2. In case 1 we explore quadratic

empirical laws with increasing curvature by increasing a2 from 0 to 0.6 for fixed a1 and σA2

set to a1 = 1.7 and σA2 = 0.32. In case 2 we explore quadratic empirical laws with increasing

co-variability between logAs and pH by increasing σA2 from 0.08 to 0.35 for fixed a1 and a2

set to a1 = 0.7 and a2 = 0. In the following sections we first present in details the results

obtained for a single realization of case 1 obtained for a2=0.5, and we then provide the results

for case 1 describing the effect of the curvature (i.e. non-linearity) of the empirical law,

followed by the results for case 2 describing the effect of the co-variability between logAs

and pH (i.e. correlation of the empirical law).

3.3.1.1. Realization of related spatial fields

Using our simulator with a1=1.7, σA2= 0.32, and a2=0.5, we successfully generate the

realization for the primary spatial random field logAs(s) shown in Figure 3.2(a) and the

realization for the related secondary variable pH(s) shown in Figure 3.2(b). These simulated

fields are usually interpreted as the truth, from which the measured data are randomly

72
selected. Each asterisk in Figure 3.2(a) indicates the selected measured data for logAs, and

similarly each triangle in Figure 3.2(b) denotes the pH measured data.

The scatter plot of all collocated simulated values for logAs and pH are shown in Figure

3.2(c). This scatter plot shows that our simulator is able to generate a stochastic empirical

law with a realistic non-linear shape in good agreement with that found in previous studies

(e.g. Sanchez et al., 2003). Furthermore the theoretical formulae obtained in Eq. (3.31) for

the empirical law (shown as a plain line labeled as the “true E[logAs|pH]” in Figure 3.2c) is

in perfect agreement with the simulated collocated data.

(a) (b)

(c)

Figure 3.2: Realization of (a) logAs(s) and (b) pH(s) obtained with our simulator using
a1=1.7, a2=0.5 and σA2= 0.32. Asterisks in (a) and triangles in (b) are the randomly selected
points used as data in the cross-validation procedure. The scatter plot of all collocated
simulated logAs-pH values are shown in (c), where the plain line is the theoretical
E[logAs|pH] obtained from Eq. (3.31).

73
Now that we obtained a realization of the logAs and pH fields with realistic properties,

we proceed with its analysis, which consists in the analysis of its covariance/cross-covariance,

followed by the analysis of the empirical relationship, and ends with the results of the cross-

validation procedure.

3.3.1.2. Covariance and cross-covariance between fields

We estimate the experimental values of the covariance for logAs(s) and pH(s) using Eq. (3.4),

and we then fit a covariance model to these experimental covariance values. Similarly we

estimate and model the cross-covariance between logAs(s) and pH(s) using Eq. (3.11). For

illustration purposes, we show in Figure 3.3 the covariance and cross covariance values

obtained for the realization of logAs(s) and pH(s) shown in Figure 3.2. The experimental

covariance values are shown with dots, while the covariance and cross covariance models are

shown with a plain line. These models are given by

c(r) = c0 exp (-3r/ar), (3.42)

where c(r) is an exponential function of the spatial lag r between a pair of points, c0 is the

covariance sill, and ar is the covariance range. Each covariance model has the same range

(e.g. 7km) but different sill values (i.e. ar=1.0441 for the covariance of logAs, ar=0.5514 for

cross-covariance between logAs and pH, and ar=0.3801 for the covariance of pH). These

covariance models represent a spatial autocorrelation of logAs(s) and pH(s) that is

comparable to that found in the real case study for New England presented later.

74
Figure 3.3: Covariance and cross variance for the logAs(s) and pH(s) synthetic fields shown
in Figure 3.2. Experimental covariance values are shown with dots, while the corresponding
covariance models are shown with plain line.

3.3.1.3. Conditional PDF fS(χ|ψ) describing the empirical relationship

The stochastic empirical law relating logAs and pH can be expressed by the conditional PDF

fS(χ|ψ) of the primary variable logAs given an error free measured value ψ for the collocated

secondary variable pH. This conditional PDF is modeled in terms of a Gaussian PDF with

mean µ1(ψ) and variance µ2(ψ), so that the vectorial relationship µ(ψ)=[µ1(ψ) µ2(ψ)]

summarizes the non-linear aspects of the stochastic empirical law. The three straightforward

approaches described earlier in the methods section to obtain µ(ψ) are (1) the non-parametric

prediction, (2) the parametric prediction with polynomial of order 1, and (3) the parametric

prediction with polynomial of order 2.

For illustration purposes, we show in Figure 3.4 the vectorial µ(ψ) relationship and

corresponding conditional PDFs fS(χ|ψ) obtained for the realization of logAs(s) and pH(s)

shown in Figure 3.2. The first order moment µ1(ψ)= E[logAs|pH] estimated using the non-

parametric prediction, the parametric prediction with polynomial of order 1, and the

75
parametric prediction with polynomial of order 2 are shown with a dashed line in Figures

3.4(a), 3.4(b) and 3.4(c), respectively. The second order moment µ2(ψ) is shown in Figure

3.4(d) for all three approaches. Conditional PDFs fS(χ|ψ) corresponding to the various µ(ψ)

obtained are shown with thick lines in Figures 3.4(a), 3.4(b) and 3.4(c).

We find that the non-parametric approach and the parametric approach with polynomial

of order 2 are very successful in producing conditional PDFs fS(χ|ψ) that capture well the

stochastic empirical relationship between logAs and pH. The parametric approach with

polynomial of order 1 is not as successful because of the non-linearity of the empirical law,

however this approach would work well when the empirical law is known to be linear.

Hence Figure 3.4 shows that the three straightforward approaches presented in the methods

section to obtain the conditional PDFs fS(χ|ψ) from collocated measurements are easy to

implement in practice, and the best approach will depend on the data available, and on the

type of the empirical law under consideration.

(a) (b)

76
(c) (d)

Figure 3.4: The dots in (a), (b) and (c) are identical. They show the collocated measurements
for the realization of logAs(s) and pH(s) shown in Figure 3.2(c). The dashed lines show
µ1(ψ)= E[logAs|pH] obtained using (a) non-parametric prediction, (b) parametric prediction
with polynomial of order 1, and (c) parametric prediction with polynomial of order 2. The
corresponding µ2(ψ) are shown in (d) with different line types. The soft data obtained from
µ1(ψ) and µ2(ψ) are shown in thick lines in (a), (b) and (c).

3.3.1.4. Assessment of mapping accuracy

Mapping accuracy is first assessed visually by comparing in Figure 3.5 the simulated field of

logAs(s) representing the truth, with the estimated maps obtained using methods 1 to 3. To

facilitate the visual comparison, Figure 3.5(a) is an identical reproduction of the simulated

field logAs(s) shown of Figure 3.2(a). The stars denote the location of the hard data for logAs.

Using this hard data with method 1 (simple kriging) we obtain the estimated map shown in

Figure 3.5(b). As can be seen from that map, method 1 does not provide a good estimate of

the truth because the hard data available for logAs is sparse. For example this estimated map

completely fails to predict the presence of a highly contaminated area in the lower left corner

of the map because of the lack of hard data for logAs in that area.

Methods 2 and 3 on the other hand use hard data for the secondary variable pH (see

Figure 3.2b) in addition to the hard data for arsenic. The map obtained with method 2 (co-

77
kriging) is shown in Figure 3.5(c). We see a small improvement over method 1, but

important contaminated areas such as that in the lower left corner are still completely missing.

On the other hand the map obtained with method 3 (BME) is a drastic improvement over

method 1. For instance method 3 predicts accurately the presence of the highly contaminated

area in the lower left. Because they used additional information coming from the secondary

variable, both method 2 and 3 were expected to be more accurate that method 1, as is indeed

the case. However what is outstanding is the drastic superiority of the BME method to

process the secondary data over co-kriging.

(a) (b)

(c) (d)

Figure 3.5: The simulated field of logAs(s) shown in map (a) is an identical reproduction of
Figure 3.2(a) that is interpreted as the truth. The stars are the locations of the logAs hard data
used by estimation method 1 (simple kriging) to produce map (b). Using this logAs hard data

78
as well as secondary pH data shown in Figure 3.2(b), we obtain map (c) with method 2 (co-
kriging), and map (d) with method 3 (BME).

We now turn to cross validation in order to quantitatively assess the superiority of BME

over co-kriging in processing the secondary data. As explained earlier in the methods section,

the MSE (Eq. 3.39) of cross validation estimates provides a measure of estimation error. We

find that MSE1 (the MSE for method 1) is equal to 1.09, while MSE2=1.01 for method 2, and

MSE3=0.49 for method 3. It is worthwhile noting that even though methods 2 and 3 use the

1, 3
same pH data, the percent MSE reduction from method 1 to 3, rMSE =-55.1% (Eq. 3.40) is a

1, 2
drastic improvement over the MSE reduction from method 1 to 2, rMSE =-6.9%. In fact the

improvement in MSE reduction (Eq. 3.41) is i∆=703.5%, which is outstanding. In other

words, BME is 703.5% more efficient than co-kriging at integrating the secondary data.

This result demonstrates that BME is substantially more accurate than co-kriging for a

realization of logAs(s) and pH(s) (Figure 3.2) obtained with a1=1.7, σA2= 0.32, and a2=0.5. In

the following two sections, we explore whether this result holds when we change the

curvature of the empirical law, and when we change the correlation between primary and

secondary variables.

3.3.1.5. Cross validation results as a function of the curvature of the empirical law

Table 3.1 summarizes the cross validation results obtained in case 1, where we consider

realizations of logAs(s) and pH(s) generated by our simulator with a1=1.7 and σA2= 0.32, and

with a2 varying from 0 to 0.6 by increment of 0.1. The curve representing the empirical law

between collocated logAs and pH is shown in Figure 3.6(a) for each of these realizations. As

79
can be seen from that figure, the empirical law is linear (i.e. zero curvature) for a2=0, and the

curvature of the empirical law increases monotonically with a2, reaching maximum curvature

for a2=0.6.

Table 3.1: Cross validation results for case 1.


MSE
method 3 MSE reduction from MSE reduction from Improvement in
MSE MSE
with non method 1 to method 2 method 1 to method 3 MSE reduction
method 1 method 2
parametric (%) (%) (%)
a2 regression
1,3 1,2
1,2 MSE 2 − MSE1 1,3 MSE3 − MSE1 rMSE − rMSE
MSE1 MSE2 MSE3 rMSE = × 100 rMSE = × 100 i∆ =
1,2
× 100
MSE1 MSE1 rMSE

0 1.03 0.96 0.48 -7.4 -53.9 623.3

0.1 1.04 0.97 0.48 -7.3 -54.2 640.3

0.2 1.05 0.98 0.48 -7.2 -54.5 657.4

0.3 1.06 0.99 0.48 -7.1 -54.8 674.3

0.4 1.08 1.00 0.48 -7.0 -55.0 690.1

0.5 1.09 1.01 0.49 -6.9 -55.1 703.5

0.6 1.10 1.02 0.49 -6.8 -55.0 711.5

The results shown in Table 3.1 include the mean square errors MSE1, MSE2 and MSE3

1, 2 1, 3
for methods 1, 2 and 3, respectively, the percent MSE change rMSE and rMSE between

methods 1 and 2, and methods 1 and 3, respectively, and the improvement in MSE reduction

i∆ from co-kriging to BME (where the BME soft data are obtained using the non-parametric

approach). We note that the realization discussed in details in the preceding sections is listed

in Table 3.1 on the line corresponding to a2=0.5 with an improvement in MSE reduction

i∆=703.5%. We see clearly from Table 3.1 and Figure 3.6(b) that i∆ increases as the

curvature of the empirical law increases. This makes physically sense, since the BME

approach fully accounts for the non-linear aspects of the empirical law, whereas co-kriging

80
only accounts for the cross-correlation between logAs and pH. However it is very interesting

to note that even for linear empirical laws (i.e. a2=0), BME is still 623.3% more efficient than

co-kriging at integrating the secondary data. These results show that the BME approach

presented in this work outperforms drastically co-kriging whatever the curvature of the

empirical law is.

Furthermore we show in Figure 3.6(b) the improvement of MSE reduction i∆ obtained

when the soft BME data is generated using the non parametric, the polynomial of order 1,

and the polynomial of order 2 approaches. These curves confirm the physically significant

fact that if one knows a priori that the empirical law is quadratic, then using the second order

polynomial approach will give best results, however when that is not the case, then the non-

parametric approach works well when there is sufficient collocated data, while the first order

polynomial approach works well when numerical cost is an issue.

(a) (b)

Figure 3.6: (a) Curves representing the empirical law E[logAs|pH] between collocated logAs
and pH for the realizations of Table 3.1 (i.e. obtained with a2 varying from 0 to 0.6 by
increment of 0.1). (b) Curves showing the improvement in MSE reduction i∆ as a function of
a2, when the BME soft data is generated using the non parametric (plain line), the polynomial
of order 1 (dotted line), and the polynomial of order 2 (dashed line) approaches.

81
Previous arsenic studies have shown that co-kriging is especially disappointing when the

correlation between the primary and secondary variable is weak (e.g. Welhan and Merrick,

2003). Therefore we investigate next the cross validation results as a function of the

correlation between primary and secondary variables.

3.3.1.6. Cross validation results as a function of the correlation between logAs and pH

We now focus on case 2 of the synthetic case study, which explores how the cross validation

results change as a function of the correlation between collocated logAs and pH

measurements. Realizations of the logAs and pH fields are generated using a1 = 0.7 and a2 =

0 (i.e. linear empirical laws), and with σA2 varying from 0.08 to 0.35. The curve representing

the empirical law between collocated logAs and pH is shown in Figure 3.7(a) for each of

these realizations. As σA2 increases, the co-variability between logAs and pH increases,

leading to larger correlation between logAs and pH, and to linear empirical laws with steeper

slopes, as can be seen in Figure 3.7(a).

For each of the realization depicted in Figure 3.7(a) we obtain cross validation estimates

using methods 1 to 3, and we show in Figure 3.7(b) the resulting improvement of MSE

reduction i∆ as a function of σA2, which is a measure of the correlation between logAs and pH.

As can be seen from that figure, BME is significantly more accurate than co-kriging

whatever is the correlation between logAs and pH (i.e. whatever is σA2). In this case we do

not see a difference whether the BME soft data is generated using the non-parametric, the

first order polynomial, or the second order polynomial approaches because the empirical law

is linear. What is extremely interesting to note is that while BME is at least 600% more

efficient than co-kriging at integrating secondary data when the correlation between logAs

82
and pH is strong (i.e. for large σA2), the out performance of BME over co-kriging is even

more drastic when the correlation between logAs and pH is weak, reaching as much as

2000% in the improvement of MSE reduction i∆ . This indicates that BME may provide a

good alternative to co-kriging when mapping arsenic when the correlation between the

primary and secondary variable is weak.

(a) (b)

Figure 3.7: Realizations of related logAs(s) and pH(s) fields were obtained using our
simulator with σA2 varying from 0.08 to 0.35. The linear empirical law E[logAs|pH] for each
of these realizations is shown in (a). The corresponding improvement in MSE reduction i∆ is
shown in (b) as a function of σA2.

By way of summary, this synthetic case study demonstrates that when mapping arsenic,

the BME approach presented in this work is drastically more efficient at incorporating

secondary data than co-kriging. BME is substantially more accurate than co-kriging when

the empirical law is linear and there is a strong correlation between arsenic and the secondary

variable. Furthermore the improvement in mapping accuracy is even more drastic when

considering non-linear empirical laws, or secondary variable that are weakly correlated with

arsenic. It is therefore valuable to apply this proposed method in a real case study. In the

83
next section, we provide a comprehensive real case study considering the mapping of

groundwater arsenic in the New England region using soil pH as the secondary variable.

3.3.2. Application to the real case study: Mapping arsenic in New England using soil pH

3.3.2.1. New England datasets for arsenic and pH

Measurements of groundwater arsenic concentrations sampled at wells located in New

England were obtained from datasets provided by the U.S. Geological Survey (USGS) and

the New Hampshire Department of Environmental Services (NHDES). These samples

resulted in 495 measurements above detection limits treated as hard data, and 1156

measurements below detection limit treated as interval soft data ranging between 0 and the

detection limit. The locations of the arsenic hard data (i.e. above detect measurements) are

shown in Figure 3.8(a) with circles having a size proportional to the recorded value.

The data for the secondary variable consist in exact measurements (hard data) of soil pH

obtained from a dataset provided by the USGS. The locations of the 915 soil pH samples

available were collected in the states of New Hampshire (NH), Maine (ME), and Connecticut

(CT), as shown by the circles of Figure 3.8(b). The color of the circles corresponds to the

soil pH value recorded, according to the color scale shown next to the map.

84
(a) (b)

Figure 3.8: (a) Map of the location of the groundwater arsenic samples from wells with
measurements above detection limit. The circles have a size proportional to the arsenic level
recorded. (b) Map of the location of soil pH-measurements shown with color indicating the
recorded value according to the color scale.

3.3.2.2. logAs-pH empirical law

The non-linear stochastic empirical law relating the primary and secondary variables is

modeled by processing the 139 collocated measurements of logAs and pH shown on the

scatter plot of Figure 3.9. Using the second order polynomial approach described in the

methods section, we obtain the µ1(ψ)=E[logAs|pH] function shown with a dot-dashed line in

Figure 3.9. The equation for E[logAs|pH] is given by

E[logAs|pH] = 4.6538– 0.8355pH + 0.0695pH2. (3.43)

The curve representing this equation has a shape that is consistent with the logAs-pH curve

obtained by Sanchez et al. (2003), shown with a dotted line in Figure 3.9. We also obtain

µ2(ψ) (not shown here), which together with µ1(ψ) provides the vectorial function

µ(ψ)=[µ1(ψ),µ2(ψ)] describing the non-linear aspects of the empirical law. From µ(ψ) we

85
generate the BME soft data consisting in the conditional PDF fS(χ|ψ) for logAs given a

measured value ψ of soil pH. Examples of these soft data are shown with a plain line in

Figure 3.9.

Figure 3.9: Scatter plot of 139 collocated logAs and pH measurements in New England. The
dot-dashed line shows µ1(ψ)=E[logAs|pH] obtained using second order polynomial
regression. The dotted line shows a curve of similar shape obtained by Sanchez et al. (2003).
The soft PDFs shown with plain line are the BME soft data generated using µ1(ψ) (and µ2(ψ)
not shown here).

3.3.2.3. Mean trend and spatial variability of groundwater arsenic in New England

We obtain a model for the mean trend (Eq. 3.3) of groundwater log-arsenic using a moving

window average of the arsenic data. This mean trend, shown in Figure 3.10(a), characterizes

the systematic trends and spatial structures of the logAs(s) SRF. By removing this mean-

trend from the log-arsenic data, we obtain a residual field that is homogenous (i.e. with a

constant mean over space and a covariance that is only a function of the spatial lag between

pairs of points).

86
Using Eq. (3.4), we obtain experimental values of the covariance of the residual logAs(s)

field, and we then fit a covariance model to these experimental covariance values. The

experimental values of the covariance for the residual logAs(s) field and the corresponding

covariance model are shown in Figure 3.10(b). The equation of the covariance model is

given by

 − 3r   − 3r 
c logAs (r ) = c01 exp  + c02 exp 
 a r 1   a r 2  (3.44)

where c01= 0.57× σlogAs2, c02=0.43× σlogAs2, σlogAs2= 1.623 (log-µg/L)2, ar1 = 3.0 km, and ar2=

79.5 km. This covariance model characterizing the spatial autocorrelation of groundwater

arsenic is in good agreement with what has been reported in previous studies. For example,

the covariance range for the spatial distribution of groundwater arsenic in Bangladesh was

reported to vary from 2 to 57 km by Serre et al. (2003), and from 9.2 to 24.1 km by Yu et al.

(2003).

87
(a) (b)

Figure 3.10: (a) Mean trend of groundwater log-arsenic in New England, and (b) covariance
function of its residual.

3.3.2.4. BME estimation of groundwater arsenic across New England

The arsenic mean trend and covariance models, together with the hard and soft interval data

obtained from direct arsenic measurements, and the soft data obtained from pH

measurements using the conditional PDF fS(logAs|pH) (i.e. Figure 3.9), constitute an

informative knowledge base for groundwater arsenic in New England. Given this knowledge

base, the BME method (Eqs. 3.23-3.24) provides the most accurate estimator of groundwater

arsenic across the New England region, as well as a comprehensive assessment of the

associated mapping uncertainty.

The map of the BME estimate of groundwater arsenic obtained in this case study is

shown in Figure 3.11(a). This map is useful to identify areas where levels of groundwater

arsenic may be high. For example areas with groundwater arsenic in excess of 20 µg/L are

found in southern New Hampshire. Previous studies point to natural bedrock as being the

88
main source groundwater arsenic in this area (EPA report from USEPA region 1 office, 1981;

Peters et al., 1999). This is an area where our dataset had the denser spatial coverage for

both arsenic and soil pH (see Figure 3.8). Other parts of our study area had sparse

monitoring arsenic and pH, leading to mapping uncertainty associated with the BME

estimates. The mapping uncertainty is quantified using the BME 68% confidence interval

(CI) (Serre and Christakos, 1999). The BME 68% CI is the smallest interval of arsenic

concentration that has a 68% chance of containing the true arsenic concentration. We show a

map of the length of the BME 68% CI in Figure 3.10(b). As can be seen from that map,

areas with denser monitoring data such as southern New Hampshire have a better mapping

accuracy (i.e. smaller length of the BME 68% CI) than areas with sparse monitoring data.

(a) (b)

Figure 3.11: (a) Map of the BME estimate of groundwater arsenic (µg/L) across New
England, and (b) map of the length of the 68% BME confidence interval (µg/L) expressing
the associated mapping uncertainty.

89
3.3.2.5. Non-attainment areas

State regulators and the drinking water industry are concerned with assessing where the

groundwater may have an arsenic concentration in excess of the 10 µg/L federal standard for

drinking water. Using the BME method, we are able to accurately assess the probability of

non-attainment of the standard at a given spatial location s given the arsenic and pH data

available in the neighborhood of s. This probability of non-attainment is given by

Prob[Non-Attainment] = Prob[Arsenic>10µg/L], (3.45)


where Prob[Arsenic>10µg/L]= ∫ dχ k fK(χk), and fK(χk) is the BME posterior PDF for
log(10 µg/L )

groundwater log-arsenic obtained at s. We can then categorize areas according to their

probability of non-attainment of the standard, as follow: Areas will be Highly Likely in Non-

Attainment for Prob[Non-Attainment]>0.9, Likely in Non-Attainment for 0.5<Prob[Non-

Attainment]<0.9, Near Non-Attainment for 0.1<Prob[Non-Attainment]<0.5, and Highly

Likely in Attainment for Prob[Non-Attainment]<0.1.

Using this probabilistic criterion of non-attainment we obtain the map shown in Figure

3.12. This map is very useful as it provides the most accurate delineation of non-attainment

areas given the arsenic and pH data available, and it uses shades of grey to assess the

probability of non-attainment of the standard. The categories of non-attainment are, from to

the darkest shade of grey to the lightest shade of grey: Highly Likely in Non-Attainment;

Likely in Non-Attainment; Near Non-Attainment; and Highly Likely in Attainment.

90
Figure 3.12: BME map of the probability that the groundwater arsenic concentration across
New England is in non-attainment of the drinking water standard of 10 µg/L for arsenic.

3.3.2.6. Cross validation results between simple kriging, co-kriging and BME

We explore cross validation errors for method 1 (i.e. simple kriging), method 2 (i.e. co-

kriging), and method 3 (i.e. BME) using the real data available. Cross validation errors for

each method are obtained for all the logAs hard data points, and the corresponding cross

validation MSE error we obtain are MSE1 = 6.11, MSE2 = 6.57, and MSE3 = 2.30 for the

simple kriging, co-kriging, and proposed BME methods, respectively. Hence, quite

surprisingly, we find in this real-case study that even though co-kriging processes the

additional information provided by the rich dataset on the secondary pH variable, its mapping

accuracy is worse than that of simple kriging, which ignores entirely the pH data. This

illustrates the fact that the co-kriging method performs poorly in the absence of a strong

cross-correlation between arsenic and the secondary variable, as reported in Welhan and

91
Merrick’s (2003) study of groundwater arsenic using conductance as the secondary variable.

On the other hand BME outperforms drastically both simple kriging and co-kriging. Indeed

1, 3
the MSE change between simple kriging and BME is rMSE = -62.3%, while the percent MSE

2,3
change between co-kriging and BME is rMSE = -65.0%, which represent a dramatic gain in

mapping accuracy. This result further supports that by explicitly modeling and processing

the empirical law between arsenic and its secondary variable, our proposed BME approach is

much more efficient than the classical co-kriging method of multivariate Geostatistics at

integrating the secondary data.

Finally we illustrate the impact of our work in the assessment of groundwater arsenic

across New England by showing in Figure 3.13 the maps obtained using simple kriging, co-

kriging, and our proposed BME method. We can see that the simple kriging map (Figure

3.13a) is similar to the co-kriging map (Figure 3.13b). In other words, co-kriging fails to

incorporate the secondary pH data in a way that would update the arsenic map obtained

without the pH data. On the other hand the BME map is not only 65% more accurate than

the co-kriging map, it also results in a meaningful updating the arsenic map. In fact one can

see that the BME map results in an increase in estimated level of groundwater arsenic over a

substantial area of New England. This results in a substantial increase in the territory

assessed as being in Near Non Attainment of the 10µg/L drinking water standard, which has

important health risk, water treatment, and water resources management implications.

92
(a) (b)

(c)

Figure 3.13: Maps of the concentration of arsenic in the ground-water of New-England


obtained using (a) method 1 (simple kriging), (b) method 2 (co-kriging), and (c) method 3
(our proposed BME method).

3.4. Conclusions

The multivariate co-kriging method of classical Geostatistics has been a traditional approach

to improve the mapping accuracy of a primary variable of interest by integrating data about a

related secondary variable. However co-kriging only accounts for the cross correlation

coefficient summarizing the relationship between the primary and secondary variables. On

93
the other hand the BME approach developed in this work rigorously processes the multiple

nonlinear aspects of a realistic stochastic empirical law that fully describes the relationship

between primary and secondary variable. Insight to validate the proposed BME method was

gained by means of a synthetic case study involving simulated maps of groundwater arsenic

and soil pH successfully generated by a simulator developed for this work. This simulator

allowed generating realizations of two SRFs logAs(s) and pH(s) with prescribed statistical

properties reproducing that observed in New England, and with a wide variety of empirical

laws reproducing those reported in previous studies. The synthetic case study was consistent

in demonstrating that when mapping arsenic, the proposed BME approach is drastically more

efficient at incorporating the secondary pH data than co-kriging. Once validated, the BME

method was applied to a real case study considering the mapping analysis of groundwater

arsenic in New England using soil pH as the secondary variable.

Our proposed approach is very effective at assimilating a stochastic empirical law by

generating appropriate probabilistic soft data for the primary variable on the basis of the

secondary data available. This procedure was implemented by modeling the conditional PDF

of logAs given a collocated measure values ψ for soil pH. This conditional PDF was set to a

known statistical distribution (e.g. Gaussian) parameterized on ψ, and we presented three

straightforward approaches to obtain the vectorial parameter function µ(ψ) on the basis of the

collocated logAs and pH measurements available. We were thus able to generate logAs soft

data given any pH measurements, and these soft data were rigorously processed by the BME

method together with error free measurements of logAs to finally produce arsenic exposure

maps as well as maps of the associated estimation error.

94
Several conclusions can be drawn from the synthetic and real case studies of groundwater

arsenic and soil pH in New England, as follow:

• The simulator developed in this work was successful at generating realizations of a

primary and secondary SRFs that have prescribed statistical moments up to order two,

and with collocated values following a quadratic empirical law with curvature and co-

variability controlled by the parameters of the simulator. Using this simulator we

obtained realizations of groundwater arsenic and soil pH with mean and covariance

reproducing that found in New England, while having empirical laws with varying

quadrature and cross correlation between collocated measurements for the primary and

secondary variables. This simulator is general and will be useful to investigate any

environmental contaminant and its associated secondary data.

• The synthetic case study confirmed that the implementation of the three straightforward

approaches described in this paper to obtain the conditional PDF was easy to use and

therefore provided successful ways to model the stochastic empirical law. The non-

parametric approach is the most general and it is useful when no prior information about

the empirical law is available, however it is the most demanding in terms of the amount

of collocated data necessary. When a relatively small number of collocated

measurements are available, then the parametric approach offers a useful alternative. In

that case if the empirical law is known to be linear, then the parametric approach using a

polynomial of first order can be used, otherwise the second order polynomial approach

offers a good tradeoff for quadratic empirical laws.

• The synthetic case study clearly demonstrates that the BME approach presented in this

work is drastically more efficient at incorporating secondary data than co-kriging. The

95
improvement in MSE reduction when mapping groundwater arsenic in New England

using soil pH secondary data indicates that BME is consistently at least 600% more

efficient than co-kriging at incorporating the secondary data. Furthermore the

improvement of BME over co-kriging is more drastic when the empirical law is non

linear, or when the cross-correlation between primary and secondary variable is weak.

These results indicate that the proposed BME method should provide a useful alternative

to co-kriging in a wide variety of environmental mapping problems where co-kriging is

not efficient at integrating secondary data.

• The real case study presented provides the most accurate exposure map obtained to date

for groundwater arsenic in New England on the basis of the arsenic and soil pH data

available to the authors. Using this groundwater arsenic exposure map, we produce a

probabilistic map of non-attainment of the 10µg/L drinking water standard for arsenic

that is of key importance for state regulators, public health scientists, and the drinking

water industry. Future work will expand the current case study by incorporating new

groundwater arsenic data that are currently being collected in New England.

The numerical work and complexity of co-kriging and the proposed BME method are

similar. Co-kriging requires an extra step to model the cross-covariance between primary and

secondary variables. The computational cost and complexity of this step are saved in the

proposed BME method, and replaced with modeling the empirical law, which is shown in

this work to be relatively straightforward. However while both methods are easy to

implement, this work demonstrates that because the proposed BME approach formally

accounts for the empirical law between the primary and secondary variables, it leads to a

substantial improvement in mapping accuracy over the co-kriging method which only

96
accounts for the cross-correlation between primary and secondary variables. As a result, this

work suggests a shift of the multivariate mapping paradigm from co-kriging to the proposed

BME method when dealing with secondary variables related to the primary variable through

a variety of empirical laws.

97
IV. A geostatistical mapping framework integrating data obtained at
different temporal or spatial observation scale

4.1. Background

In many environmental and health mapping applications, the traditional Geostatistics

approaches have played a significant role to estimate a variable of interest at unsampled

locations (Warner et al., 2003; Lai, 2004; Krivoruchko and Gotway, 2004). Measured values

are usually sparsely located over space and time due to the difficulty and cost of obtaining

data. In some cases, the data for the same variable of interest may have been collected at

different temporal or spatial observation scales. For example the U.S. Environmental

Protection Agency (U.S. EPA) collects monitoring data for the criteria air pollutants both at

the hourly and daily observation scales. In this case, mixing hourly and daily data may

alleviate the problem of the sparsity of the data available; however this essentially disregards

the scale effect of estimation results. Another example using health outcome data is asthma

prevalence among children, which is sometimes measured at specific schools, as well as

being routinely reported at much larger observation scales such as that of counties. In this

example as well we see that the scale effect must be recognized since a variable displays

different physical properties depending on the spatial or temporal scale at which it is

observed.

The importance of accounting for the scale effect was already investigated in previous

works, such as that of Choi et al. (2003) where they demonstrated the usefulness of the

multiscale approach through a downscaling procedure. In this chapter we mathematically


derive the conditional PDF of a variable at the local scale given an observation of that

variable at a larger scale. Once this framework is developed, it is possible to generate soft

data for the local scale on the basis of data observed at different scales. This approach allows

to efficiently mix data observed at a variety of scales, and increases the mapping accuracy of

the map obtained for the scale of interest. Our developed framework is formulated in the

one-dimensional temporal case, corresponding for example to the mixing of PM data

observed at different temporal scales (e.g. the mixing of hourly and daily PM readings). We

then extend the formulation of the framework to the two dimensional spatial case, and we

apply that formulation to a real case study. The real case study considered is the mapping

analysis of local scale asthma symptoms prevalence among children in North Carolina using

data obtained at the school spatial scale, and data obtained at the county spatial scale.

In the following sections, we first lay out the conceptual framework to model the

uncertainty associated with the observation scale, and we obtain mathematical formulations

for one-dimensional (temporal) and two-dimensional (spatial) observation scale uncertainty.

In each case (temporal and spatial), we validate the framework by comparing the observation

scale uncertainty predicted theoretically from the mathematical formulation, with that

inferred from multiple random realizations of a synthetic case study. Additionally we use the

synthetic case studies to quantify the gain in mapping accuracy achieved when the BME

mapping method rigorously accounts for observation scale uncertainty, compared to classical

approaches not accounting for the observation scale effect. Finally we apply the developed

framework to a real case study involving the estimation of asthma prevalence in North

Carolina. We find that in all cases the developed framework adequately describes the

uncertainty associated with the observation scale, which leads to realistic soft PDF for the

99
observation scale uncertainty that are rigorously assimilated by the BME method, and results

in a substantial improvement in mapping accuracy over classical mapping methods that

ignore the scale effect.

4.2. Space/time observation scale: A general conceptual framework

4.2.1. A review of BME mapping method

We define X(p) as a space/time random field (S/TRF) (Christakos, 1992) representing an

environmental or health variable X of interest at space/time location p=[s, t], where s=[s1,…,

sd] is the spatial location in a d-dimensional spatial domain, and t is time. When restricting

our attention to a set of n mapping points pmap=[p1, p2,…, pn], the S/TRF reduces to a vector

of random variables xmap=[X(p1), X(p2),…, X(pn)]. The randomness of the S/TRF at the

mapping points pmap is defined by the set of possible realizations χmap =[χ1, χ2 , …, χn] of the

random vector xmap. The probability of a given realization χmap is calculated from the

multivariate probability density function (PDF) fX(.) of the S/TRF X(p) as follow

Prob[χ1 < x1 <χ1+dχ1,…, χn < xn < χn+dχn] = fX(χmap) dχ (4.1)

where Prob[.] is a probability operator. Hence the multivariate PDF fX(.) provides a complete

stochastic description of the SRF X(s) at the mapping points pmap.

At the structural stage of BME analysis we use a maximum entropy information

processing rule (Christakos 2000) to obtain the multivariate PDF of X(s) on the basis of its

mean trend

100
mX(p) = E[X(p)], (4.2)

and covariance function

cX(p, p’) = E[ (X(p)-mX(p)) (X(p’)-mX(p’)) ], (4.3)

where E[.] is a stochastic expectation operator. Eqs. (4.2) and (4.3) constitute a general

knowledge base G from which the structural PDF obtained by maximizing entropy is

(Christakos, 2000)

fG (χmap,) = φ (χmap ; mmap, cmap ), (4.4)

where φ (.) is the multivariate Gaussian PDF with mean vector mmap and covariance matrix

cmap calculated from Eqs. (4.2) and (4.3), respectively. The subscript G in Eq. (4.4)

emphasizes that the structural PDF fG was obtained on the basis of the general knowledge G

only. This structural PDF will serve as the prior PDF for the Bayesian updating performed at

the integration stage of the BME analysis.

At the specificatory stage of the BME analysis we assess and statistically describe the

data available at specific spatial locations. Hard data corresponds to exact measured values

χhard obtained at points phard defined such that

Prob[ X(phard) =χhard] = 1. (4.5)

101
On the other hand, the soft data at points psoft correspond to measurements with an associated

uncertainty that can be characterized statistically by the so-called soft PDF fS(χsoft) defined as

(Christakos et al., 2001; Christakos and Serre, 2000a; Serre et al., 2005)

u
Prob[X(psoft) <u] = ∫ −∞ dχ soft f S (χ soft ) . (4.6)

At the integration stage of the BME analysis, a Bayesian conditionalization information

processing rule is applied to update the prior PDF with the site-specific knowledge base S,

which yields the posterior PDF fK(χk) describing xk=Xk(pk) at any estimation point sk

(Christakos, 2000)

fK (χk) = A-1 ∫ dχsoft fs(χsoft) fG(χk, χhard, χsoft), (4.7)

where A is a normalization coefficient. The posterior PDF provides a full stochastic

assessment of xk, from which we can obtain an appropriate estimated value (such as the

expected value of the posterior PDF), as well as an assessment of the associated estimation

uncertainty (such as the variance of the posterior PDF).

In the following sections we describe a framework to rigorously account for the data

uncertainty from the different observation scales in time or two-dimensional space. Thanks to

this developed framework we can model this type of data uncertainty in terms of probabilistic

soft data which can be systematically processed in the BME mapping method.

102
4.2.2. Conceptual framework for the uncertainty associated with the observation scale

Let X(p) be a space/time random field (S/TRF) representing an environmental or health

variable of interest. In general we say that X(p) represents the variable of interest at the local

scale in order to differentiate it from its observed value averaged over some space/time

domain V(p). The average of the S/TRF X(p) over the space/time domain V(p) is defined as

the S/TRF Z(p) given by the equation

Z(p) =∫V(u) duX(u) / ||V (p)|| (4.8)

Example 4.1: X(p) is the instantaneous particulate (PM) concentration at p=(s,t), while Z(p)

is its daily average. Then V(p) is the time interval of duration T=24hours centered at p=(s,t),

t +T / 2
i.e. V(p)={s , u} such that u ∈ [t-T /2 , t+T /2], and Z(p)= 1 ∫ duX (s, u ) .
T t −T / 2

Example 4.2: X(p) is the risk (i.e. probability) that a child at p=(s,t) has experienced asthma

symptoms in its lifetime, while Z(p) is the asthma symptoms prevalence observed among the

children of a specific county. Then V(p) is the surface area of the county centered at p=(s,t),

i.e. V(p)={u , t} such that u =[u1, u2]∈ As, where As is the geographical extend of the county

centered at s, and Z(p) =∫u ∈ As duX(u,t) / || As||.

In order to analyze the relationship between the local scale S/TRF X(p) and the V–scale

S/TRF Z(p), we define the random field Y(p’,p) as

Y(p’,p) = X(p’)-Z(p). (4.9)

103
Eq. (4.9) can also be written as X(p’)=Z(p)+Y(p’,p), indicating that when assessing X(p’),

Y(p’,p) acts as an additive error term to the value Z(p) observed at scale V. It follows that the

conditional PDF of X(p’) given an observed value ζ for Z(p) is

fS(χs| ζ) = fY (χs-ζ ), (4.10)

where fY is the PDF for Y(p’,p).

Let’s now consider the class of S/TRFs X(p) that are normally distributed. Then due to

the properties of the multivariate Gaussian distribution, Z(p) is normally distributed (since

according to Eq. (4.8) it can be written as an infinite sum of normally distributed variables),

and consequently Y(p’,p) is also normally distributed (since according to Eq. (4.9) it can be

written as the sum of two normally distributed variables). It follows that under the

assumption that X(p) is normally distributed, then the PDF for Y(p’,p) is given by

fY(ψ)=φ(ψ;mY,σY2), where φ(.) is the Gaussian distribution completely defined by it’s mean

mY= E[Y(p’,p)] and variance σY2. Inserting fY(ψ)=φ(ψ;E[Y(p’,p)],σY2) in Eq. (4.10), we obtain

after a change of variable

fS(χs| ζ) = φ (χs ; E[Y(p’,p)]+ ζ , σY2). (4.11)

Eq. (4.11) provides a probabilistic soft datum for the local scale X at point p’ given a V-

scale observed value at point p. This soft datum is rigorously processed by the BME method,

allowing to accurately account for observations of X(p) at any space/time scale V. The

104
problem then becomes that of obtaining E[Y(p’,p)] and σY2 for different space/time scales V

of interest. In the following sections, we first consider the one-dimensional temporal case

where X is only a function of time, i.e. X(t) is a temporal random field, and Z(t) is the average

of X(t) over a time period T (e.g. the hourly or daily average). We then extend the work to

the two-dimensional spatial case where X is only a function of space, i.e. X(s) is a spatial

random field, and Z(s) is the average of X(s) over a spatial domain (e.g. Z(s) is the average of

X(s) over a county).

The linear kriging method of classical Geostatistics simply combines observed values of

X(p’) and Z(p) to estimate X(p’) at unsampled locations without special differentiation of the

scale effects. By contrast our proposed BME mapping method uses Eq. (4.11) to generate

soft data for X(p’) from the observations obtained at various space/time scales. We

investigate mapping accuracy between the BME and classical methods throughout a wide

variety of case studies.

4.3. Temporal observation scale: Mathematical formulation and synthetic


case study

4.3.1 Mathematical formulation

4.3.1.1. Non-stationary temporal random field

As the most general case of a temporal random field (TRF), we consider the non-stationary

TRF X(t) with mean mX(t)=E[X(t)] at time t, and with covariance cX(t,t’)= E[(X(t)-

mX(t))(X(t’)- mX(t’))] between time t and t’. In general, a non-stationary TRF does not have a

105
constant mean over time, and its covariance cannot be expressed solely as a function of the

temporal lag τ=|t-t’|.

In the case of TRFs, the averaging domain V becomes the time interval V(t) = [t-T/2 ,

t+T/2] of duration T centered at time t, and the V-scale observation of X is

t +T / 2
Z(t)= 1 ∫ duX (u ) . Then Eq. (4.9) is written as
T t −T / 2

Y(t’,t) = X(t’)- Z(t) (4.12)

where t indicates the mid-point of the time interval V(t), and t’ denotes any possible time

within V(t).

We then derive the expected value of Y(t’,t) to be (see Appendix C for details)

t +T / 2
E[Y(t’,t)] = mX(t’) - 1 ∫ du m X (u ) , (4.13)
T t −T / 2

and its variance (see Appendix C for details)

T
t+
2
1
σY2(t’,t) = σX2 + {mX(t’)}2 -2
T ∫ du {c
T
X (t' , u ) + m X (t' )m X (u )}
t−
2

T T
t+ t+
2 2
1 t +T / 2
+
T2 ∫ du ∫ du' {c
T T
X (u, u' ) + m X (u )m X (u' )}-{mX(t’) - 1 ∫
T t −T / 2
du m X (u ) }2. (4.14)
t− t-
2 2

106
Eqs. (4.13) and (4.14) have been obtained without making assumptions about the

stationarity of X(t), and they therefore apply to a wide variety of non-stationary TRFs. When

the averaging time scale T is small relative to the fluctuations of the mean trend, we can

further simplify Eqs. (4.13) and (4.14) by linearizing mX(t), i.e. we use the approximation

mX(t)=m0+m1t for t ∈ T. In that case the expected value of Y(t’,t) reduces to (see Appendix C

for details)

E[Y(t’,t)] = m1(t’-t), (4.15)

and its variance is given by (see Appendix C for details)

T T T
t+ t+ t+
2 2 2
1 1
σY2(t’,t) = σX2 -2
T ∫ du c
T
X (t' , u ) +
T2 ∫ T
du ∫ du' c X (u, u' ) .
T
(4.16)
t− t− t−
2 2 2

Eq. (4.14) (or 4.16 for linearized mean trend) is substantial for environmental and health

Geostatistics. Indeed σY2(t’,t) quantifies the data uncertainty as a function of the scale at

which the variable of interest is observed. This equation can numerically be calculated for

any non-stationary TRF whatever its mean trend or covariance function may be.

We turn now to the case of stationary TRFs in order to further simplify this equation, and

gain more physical intuition about it.

4.3.1.2. Stationary temporal random field

107
Stationary TRF have a constant mean trend, i.e. mX(t)=m0, and a covariance between time t

and t’ that can be expressed in terms of the temporal lag τ=|t-t’|, i.e. cX(t,t’)= cX(τ=|t-t’|).

In order to provide more general results, we first consider the case of a linearized non-

stationary mean trend with stationary covariance cX(τ). By substituting cX(t,t’) with cX(|t-t’|)

in Eq. (4.16), we obtain (see Appendix C for more details),

 t' − t T
 t+
T
t+
T
 2  1 2 2
du ∫ du' c X ( u − u' ) .
1
σY (|t’-t|) = σX -2  ∫ du c X (t' −u − t ) + ∫ du c X (−t' +u + t )  + 2 ∫
2 2
T T t' − t  T T T
 −2  t−
2
t−
2

(4.17)

We now consider the case of stationary TRFs with constant mean trend, which are

obtained by setting m1 to zero. We note that because m1 does not appear in Eq. (4.17), then

this equation remains unchanged for stationary TRFs. This is an important finding, stating

that Eq. (4.17) is valid for any TRF with stationary covariance, as long as its mean trend can

be linearized in the time interval T.

While Eq. (4.17) can be numerically calculated for any stationary covariance model, we

will now consider the particular case where the covariance function can be expressed as the

n  − 3 t − t' 
sum of n exponential covariance models, i.e. c X (t , t' ) = c X (τ = t − t' ) = ∑  σ Xi exp ,
2

ati 
i =1  

where σXi2 and ati are the variance and temporal range, respectively, of the i-th covariance

model. Note that while nested covariance models represent a large class of useful S/TRFs,

other covariance models can just as easily be examined. In this case σY2(|t’-t|) is given by

(see Appendix C for more details),

108
n n
ati σ Xi 
2
 − 3(t' +T / 2 − t )   − 3(− t' +t + T / 2 ) 
σY (| t' −t |) = ∑ σ Xi − 2∑ 2 − exp  − exp 
2 2

i =1 i =1 3T   ati   ati 

ati σ Xi   − 3T 
n 2
2 2
+ ∑ 3T 2 
 2T −
3
a ti +
3
ati exp
a
 (4.18)
i =1  ti 

This equation is useful as it provides an algebraic equation for σY2 that is very efficient to

calculate numerically. In the case of a single structure (i.e. n=1) we write for simplicity

− 3 t − t'
purposes the covariance model as c X (τ = t − t' ) = σ X exp (i.e. we let σX2 and at be
2

at

the variance and temporal range of the TRF X(t), respectively). In this case, Eq. (4.18)

further reduces to

2
σY (| t' −t |) 2 1   (t − t' ) 3  T   (t − t' ) 3 T 
2
= 1 −  2 − exp3 −   − exp− 3 − 
σX 3 T at   at 2  at   at 2 at 

1 1  2 1 2 1  T 
+ 2 − T + T exp − 3  . (4.19)
3 T at  3 at 3 at  at 

Eq. (4.19) is conceptually very meaningful for the physical understanding of the connection

between uncertainty and observation scale. We see that the equation is expressed in terms of

three non dimensional groupings which are σY2(|t’-t|) / σ X , (t-t’)/at and T/at. As illustrated
2

later in the case studies, for a given (t-t’)/at, we find that σY2(|t’-t|)/ σ X increases from zero
2

to one as T/at increases from zero to infinity. In other words, when the observation scale T is

109
very small relative to the covariance range at, then the observed value Z(t) at scale T is very

informative for the assessment of X(t’) (i.e. the corresponding soft data has a small variance

σY2(|t’-t|) ). As T increases, the T–scale observed value Z(t) becomes less and less

informative for X(t’) (i.e. σY2(|t’-t|) increases), until a point where Z(t) becomes irrelevant for

the estimation of X(t’) (i.e. σY2(|t’-t|) reaches the variance σX2of the TRF X(t)).

By way of example, this result simply expresses the fact that, when estimating

instantaneous PM concentration, then a 1-hour average PM concentration is more

informative than, say, a weekly average of PM concentration. Furthermore, Eq. (4.19)

allows to integrate both 1-hour and weekly average measurements by assigning different

variance σY2 to each of these measurements according to their observation scale.

Usually when generating soft data we will use t’=t. The equation for the soft data

variance is then simply obtained by setting (t-t’)/at =0 in Eq. (4.19), which leads to

2 1   3 T  1 1  2 1 2 1  T 
2
σY
2
= 1 −  2 − 2 exp−   +  2− + exp − 3  . (4.20)
σX 3 T at   2  at  3 at  3 at 3 at
T T T
 at 

4.3.2 Synthetic case study

4.3.2.1. Synthetic verification of the uncertainty model for temporal observation scale

We verify the conceptual framework presented by comparing the observation scale

uncertainty predicted theoretically in Eq. (4.19), with that inferred from multiple random

realizations of a STR X(t) with exponential covariance function c X (τ ) = σ X exp − 3τ


2
. The
at

110
t +T / 2
procedure consists in using M random realizations of X(t’) and Z(t)= 1 ∫ duX (u ) to infer
T t −T / 2

a statistical estimate of σY2(|t’-t|), and comparing that synthetic estimate with the value

predicted theoretically by Eq. (4.19). Using classical simulation methods of Geostatistics

(Christakos, 1992), we obtain M realizations χ(k) =[χ1(k), χ2(k),…, χn(k)], k=1,…M, of the

random vector x=[X(t1), X(t2),…, X(tn)] representing the SRF X(t) discretized over a fine

temporal discretization grid t=[t1, t2,…, tn]. We choose an observation scale T of interest, a

time ti ∈ t such that t1+T/2≤ti≤ tn-T/2, and we obtain the M realizations Ζi(k), k=1,…M, of

ti +T / 2
Z(ti)= 1 ∫ duX (u ) by numerically integrating each realization χ(k) over a time period T
T t i −T / 2

centered at ti. We then choose a time tj ∈ t such that |ti-tj|<T/2 and we obtain the M

realizations χj(k), k=1,…M, of X(tj) by selecting the proper element in each χ(k) realization.

This procedure results in the generation of M random realizations {Ζi(k), χj(k)} , k=1,…M, for

the random values {Z(ti), X(tj)}. From these realizations, we can finally easily infer the

expected value and variance of Y(tj,ti)=X(tj)-Z(ti). A statistical estimator for the expected

M
1
value E[Y(tj,ti)] of Y(tj,ti) is Eˆ [Y (t j − ti )] =
M
∑ (χ
k =1
j
(k ) (k )
− Ζi ) while a statistical estimator for

its variance σY2(|tj-ti|) is

M
1
σˆ Y 2 (| t j - t i |) = ∑ (χ − Eˆ [Y (t j , t i )]) 2 .
(k ) (k )
j − Ζi (4.21)
M k =1

Using this synthetic simulation approach, we obtain the synthetic estimate σˆY /σX2 as
2

function of T/at for various choices of (t-t’)/at , and compare this synthetic estimate with the

111
theoretical σY2/σX2 value obtained from Eq. (4.19). This procedure provides a way to verify

whether our conceptual framework leads through Eq. (4.19) to a correct assessment of the

uncertainty (i.e. the variance σY2) associated with the observation scale T.

Figure 4.1 shows the plot of σY2/σX2 as a function of T/at for selected values of (t-t’)/T.

The synthetic estimates σˆ Y /σX2 obtained from multiple random realizations (Eq. 4.21) are
2

shown with markers, while the corresponding σY2/σX2 value predicted from theory (Eq. 4.19)

is shown with lines. The good agreement between theory and synthetic estimates provides

support that our conceptual framework adequately models the uncertainty associated with

temporal observation scale.

Figure 4.1: Plot of σY/σX as a function of T/at for different values of (t-t’)/T. Markers
indicate synthetic estimate obtained from multiple random realizations (Eq. 4.21), while lines
shows the value predicted from theory (Eq. 4.19).

Figure 4.1 additionally provides some useful insights about the effect of temporal

observation scale. As defined earlier, σX2 is the variance of the TRF X(t), at is the temporal

range of its exponential covariance function, and σY2 is the variance of the conditional PDF

112
for the random variable X(t’) given a measured value for Z(t) obtained at time t with an

observation scale T. Hence a ratio σY2/σX2 less than one means that the Z(t) measured value

is informative for the random variable X(t’). Figure 4.1 indicates that the Z(t) measured

value is informative only when the observation scale T is small relative to the temporal

covariance range at. Hence this plot can be useful in determining a cut-off for the observation

scale. For example according to this plot, data measured at an observation scale T greater

than 3 times the temporal covariance range at have little information, and could be

disregarded. Furthermore Figure 4.1 shows that a measured value Z(t) is most informative

for X(t’) when t’=t. Physically this will mean that when constructing the soft datum for X at

some t’ given a measured value for Z at time t, a judicious choice is to select t’=t, i.e. to

construct the X soft datum at the mid-point of the interval T over which Z is observed.

4.3.2.2. Quantifying the improvement in mapping accuracy resulting from the integration of
temporal observation scale uncertainty

A validation procedure using synthetic random fields provides an excellent tool to quantify

the gain in mapping accuracy that our proposed approach provides over an approach not

accounting for the observation scale effect. As described above, synthetic random fields are

easily generated using classical simulation methods of Geostatistics (Christakos, 1992), such

that each realization of a TRF X(t) has a prescribed mean trend and covariance function

corresponding to an environmental or health variable of interest. Typically a realization of

the TRF X(t) consists of values χtrue =[χ1… χn] of X(tj) for a dense grid of times tj = j δt ,

j=1,…, n, with a small time interval δt. Then we obtain the values ζtrue =[ζ1… ζn] for Z(tj)=

t +T / 2

T ∫t j −T j / 2
1 j j
duX (u ) , j=1,…, n, by numerically integrating the simulated χtrue values over a

113
different observation scale Tj at each time tj, j=1,…, n. Hence χtrue represents the (synthetic)

truth for the field of interest observed at the local scale, while ζtrue represents the truth

observed at different observation scales Tj, j=1,…, n. We then randomly divide the truth χtrue

into a validation set χval and a data set χhard, so that χtrue=χval U χhard. Similarly we randomly

select a data set ζhard out of ζtrue.

The validation procedure consists in using only the data χhard and ζhard to obtain estimates

χval* of the local scale TRF X(t) for the validation times at which the truth χval is known. The

estimation errors are then obtained as the difference εval*=χval-χval* between true and

estimated values. Finally we obtain the mean square error (MSE) by averaging the squared

estimation errors. When interested in two different estimation methods (labeled as method 1

and 2), we obtain the MSE for each method (i.e. MSE1 and MSE2), and we quantify the

1, 2
change in mapping accuracy by calculating the percent MSE change rMSE between method 1

1, 2
and method 2 as rMSE =(MSE2-MSE1)/MSE1 x100.

The estimation methods that we compare are the BME approach accounting for the

observation scale of the data, as presented in this work, and the simple kriging method of

classical Geostatistics. Our proposed BME approach generates a conditional PDF fS(χs| ζ)

(Eqs. 4.11, 4.15 and 4.20) for each observed value of the vector ζhard and its corresponding

observation scale. The collection of the conditional PDFs constitutes the soft data χsoft, for

which BME rigorously processes together with the hard data χhard directly observed at the

local scale. By contrast, the simple kriging estimates are obtained using either χhard only, or

both χhard and ζhard, as hard data, i.e. without accounting for the observation scales

uncertainty of the ζhard data. The percent MSE change rMSE


SK , BME
then quantifies the percent

114
SK , BME
change from the SK method MSE to the BME method MSE. A negative rMSE means that

BME reduces the MSE (i.e. that BME is more accurate than SK), and the magnitude of a

SK , BME
negative rMSE quantifies the gain in mapping accuracy of BME over SK.

Using the Geostatistical simulation method based on a Cholesky decomposition of the

covariance matrix (Christakos et al., 2002), we generate 20 realizations of the TRF X(t) with

the following prescribed covariance function

 − 3τ   − 3τ 
c X ( τ =| t' −t |) = c01 exp  + c 02 exp 
 a t1   at 2  , (4.22)

where c01= 0.7 × σX2, c02=0.3 × σX2, σX2= 4, at1 = 50, and at2= 250. Each realization consists of

the vector χtrue =[χ1… χn] simulating the value of the TRF X(t) at times tj ∈ t=[0, 1, …, 500].

We select from this time grid 8 time coordinates where the χhard data is sampled at the local

scale, and 37 time coordinates where the ζhard data is measured at varying observation time

scales Tj, j=1,…,39. Finally, on the basis of the ζhard data and the associated observation time

scales, we construct the conditional PDFs fS(χs| ζ) that constitute the soft data χsoft for our

proposed BME estimation approach.

For illustration purposes, one of the generated realization χtrue is shown with a dotted line

in Figure 4.2, along with the χhard data represented by circles, and the ζhard data represented

by crosses. As explained above, the ζhard data is obtained by numerical integration of χtrue

over each of the observation time scales Tj, j=1,…,39. We show four of these observation

time scales using horizontal bars in Figure 4.2, and we show the corresponding conditional

PDF fS(χs| ζ) with a bell shape curve. As can be seen from the figure, for the small

115
observation time scales we have a conditional PDF with high information content (i.e the bell

shape curve is peaked), while for the large observation time scale, we have a (almost) non

informative conditional PDF (i.e. the bell shape curve is flat). This provides an illustration of

the scale effect captured by our conceptual framework (e.g. Eq. 4.20).

Figure 4.2: Plot showing one of the generated realizations of the TRF X(t). The simulated
values χtrue are shown with a dotted line, the χhard data are represented by circles, and the
ζhard data are represented by crosses. Four observation time scales of the ζhard data are shown
with horizontal bars, and the corresponding conditional PDF are shown with bell shape
curves.

The validation procedure described in section 4.3.2.1 allows us to compare three

estimation methods, which are summarized in Table 4.1. Method 1 and 2 represent two

attempts to process the data available using the traditional simple kriging method of classical

Geostatistics, which ignores the effect of observation scale. Method 1 only processes χhard,

i.e. it entirely ignores the ζhard data. Method 2 treats both χhard and ζhard as hard data, i.e.

ignores the uncertainty arising from the observation scale of the ζhard data. On the other hand

116
method 3 corresponds to our proposed approach, which processes the χhard and χsoft data,

thereby rigorously accounting for the uncertainty associated with the various time scales at

which ζhard is observed.

Table 4.1: Description of three estimation methods compared in the validation procedure.
Local scale X data Large scale Z data
Method 1 χhard ignored
Method 2 χhard ζhard
Method 3 χhard χsoft

Using the validation procedure, we obtain an MSEave which is a validation MSE averaged

over 20 realizations for each of the estimation method considered. As shown in Table 4.2 we

can see from these results that method 2 (simple kriging II) has an MSEave that is only

slightly smaller than that of method 1 (simple kriging I). This means that even though simple

kriging II did process the additional information provided by the data observed at various

time scales, the gain in mapping accuracy was modest because the scale effect was ignored.

On the other hand we see that the MSEave of method 3 (BME) is substantially smaller than

that of either method 1 or 2. In fact BME results in a 50.2% MSEave reduction when

compared to method 1, or a 46.7% MSEave reduction when compared to method 2. These

results demonstrate that our proposed approach provides a sound conceptual framework to

model the effect of observation scale, and may in some cases result in a drastic gain of

mapping accuracy over estimation methods that ignore the scale effect.

117
Table 4.2: MSEave calculated by averaging the validation results obtained over 20 realizations.
Method 1 Method 2 Method 3
(simple kriging I) (simple kriging II) (BME)
MSEave 2.1474 2.0048 1.0689
1, 3
rMSE -50.2%
2 ,3
rMSE -46.7%

Further insights are gained by visually inspecting the validation estimates obtained for the

realization of Figure 4.2. Figure 4.3 shows the simulated truth χtrue with a dotted line, and the

χhard and ζhard data with markers. Additionally the estimation profile obtained with method 1,

2 and 3 are shown with a line in Figure 4.3(a), 4.3(b), and 4.3(c), respectively. As can be

seen from the figure, the estimated profile for method 1 goes through the χhard data, but the

mapping accuracy is poor because the ζhard data is entirely ignored. On the other extreme the

estimated profile for method 2 goes through both χhard and ζhard data. While this results in a

modest gain in mapping accuracy, the estimated profile suffers from the fact that the

observation scale of the ζhard data was ignored. Finally, the estimated profile for method 3

(BME) goes through the χhard data, and consider each of the ζhard datum depending on its

observation scale, such that observations at shorter time scales are given more weight than

observations at larger time scales. As can be seen from the figure, this results in an estimated

profile that provides a much more accurate representation of the truth.

118
(a)

(b)

(c)

Figure 4.3: Plots showing the simulated truth χtrue with a dotted line, the χhard data with
circles, and the ζhard data with crosses. Additionally lines are showing the estimated profiles
obtained using (a) method 1, (b) method 2, and (c) method 3 (BME).

119
4.4. Spatial observation scale: Mathematical formulation and synthetic
case study

4.4.1 Mathematical formulation

4.4.1.1 Non-homogeneous spatial random field

We extend in this section the one dimensional temporal framework to consider two-

dimensional spatial random fields (SRF). Similarly to the one-dimensional case, our aim is to

derive a mathematical formulation for σY2 in the case of SRFs. The local scale SRF X(s)

represents the spatial distribution of the variable X of interest at the spatial location s, where

s=[s1,s2] represents a geographical location. Then Z(s) is defined as an average of X(s) over

the 2-dimensional space domain As (i.e. area), i.e.

Z(s) =∫u ∈ As duX(u) / || As||, (4.23)

where As is a geographical area with centroid s. Following the previous development we

define a new spatial random field Y(s’,s) as,

Y(s’,s) = X(s’)- Z(s). (4.24)

In the most general case, non-homogeneous SRFs are characterized by a spatially varying

mean trend functions mX(s)=E[X(s)], and a covariance function cX(s, s’) that cannot be

expressed solely as a function of the spatial lag, |s-s’|. In this case we can mathematically

120
derive starting from Eqs. (4.23) and (4.24) the expected value of Y(s’,s) (see details in

Appendix D), i.e.

E[Y(s’,s)] = mX(s’) – || As||-1∫u ∈ As du mX(u), (4.25)

and its variance (see details in Appendix D), i.e.

σY2(s’, s) = E[X2(s’)] – 2 E[X(s’)Z(s)] + E[Z2(s)] – { mX(s’) – || As||-1∫u∈ As du mX(u)}2,

(4.26)

where

E[X2(s’)] = σX2(s’) + {mX(s’)}2,

E[X(s’)Z(s)] = || As||-1∫u ∈ As du {cX(s’,u) + mX(s’) mX(u)},

E[Z2(s)] = || As||-2∫u ∈ As du ∫u’ ∈ As du’{cX(u,u’) + mX(u) mX(u’)}.

4.4.1.2 Homogeneous spatial random field

Let us consider some special cases where Eq. (4.26) may be simplified. First if we assume

that E[X(s’)] = 0 (e.g., for a mean trend removed SRF) then we have E[Y(s’,s)]=0, and the

first term in the right hand side (RHS) of Eq. (4.26) (e.g. E[X2(s’)]) is equal to σX2.

Furthermore, the second RHS term in Eq. (4.26) reduces to

E[X(s’)Z(s)] = || As||-1∫u ∈ As du cX(s’,u). (4.27)

121
Assuming a homogeneous spatial covariance, i.e. cX(s’, s) = cX(|s’-s|), we further expand Eq.

(4.27) (see Appendix D for more details) as

E[X(s’)Z(s)] = || As||-1∫r ∈ A(0) dr cX(|r-(s’- s)|), (4.28)

where A(0) is the 2-D spatial averaging domain centered at the origin (i.e. with a centroid

located at 0). This equation can numerically be integrated for any shape of the averaging

domain A(0). However a reasonable approximation of the averaging domain A(0) is a circle

of same area as As, i.e. with a radius R such that πR2=|| As||-1. Assuming that A(0) is a circle

of radius R, we get (see more details in Appendix D),

R 2π
E[X(s’)Z(s)] = (πR2)-1 ∫ dr ∫ dθ r cX( (s1 − s1 '+ rcosθ )2 + (s2 − s2 '+ rsinθ )2 ), (4.29)
0 0

where s=[s1 s2] and s’=[s1’ s2’].

Using a similar development for the third term RHS term of Eq. (4.26) we obtain (see

details in Appendix D)

(r )
R R 2π
2 -2
∫ dr ∫ dr ' ∫ dα 2π r r ' c X + r '2 −2rr ' cosα .
2 2
E[Z (s)] = (πR ) (4.30)
0 0 0

122
Eqs. (4.26), (4.29) and (4.30) provide formulae for σY2(s’, s) that is valid for any

homogeneous covariance model. Let’s now assume that the covariance model is the

superposition of n exponential functions, so that the covariance model can be expressed as

n
c X ( s − s' ) = ∑ σ Xi exp(-3|s- s’|/ari), where σXi2 and ari are the variance and spatial range of
2

i =1

each exponential covariance function, respectively. In this case we have

n 
R 2π
 
 σ Xi 2 exp  − 3d1 (r , s, s ' , θ )  
-1
σY2(|s’-s|) = σX2 -2(πR2) ∫0 ∫0
d r dθ r ∑  ari 
i =1   

n 
 
 σ Xi 2 exp  − 3d 2 (r , r ' , α )   .
R R 2π
-2
+ (πR2) ∫0 ∫0 ∫0
dr dr ' dα 2π r r ' ∑  ari  (4.31)
i =1   

where d1(r, s, s’, θ )= (s1 − s '+ r cos θ )2 + (s2 − s2 '+ r sin θ )2 and
1

d2(r, r’,α)= r 2 + r '2 −2rr ' cosα .

Eq. (4.31) is valid for the superposition of any number of exponential models. In the case

of a single exponential covariance model, i.e. n=1, the covariance function is written as

cX(|s-s’|)=σX2 exp(-3|s-s’|/ar), and Eq. (4.31) reduces to

2 -1
R 2π
 − 3d1 (r , s, s ' , θ ) 
σY2(|s’-s|) σX2 ∫0 ∫0
2
= -2(πR ) d r dθ r σ X exp  
 ar 

2 -2
R R 2π
 − 3d 2 (r , r ' , α ) 
∫ dr ∫ dr ' ∫ dα 2π r r ' σ X exp 
2
+ (πR ) , (4.32)
0 0 0  ar 

123
where σX2 is the variance of the SRF X(s), and ar is its spatial covariance range. Usually we

seek an X soft datum at the centroid of the Z hard data (i.e. s’ = s, so that the X soft datum is

located at the center of the circular averaging area As). In this case Eq. (4.32) is further

reduced by setting s=s’, i.e.

R
σY2= σX2 -4R-2 ∫ dr r σ X exp(−3r / ar )
2

-2
R R 2π
 − 3d (r , r ' , α ) 
+ (πR2) ∫0 ∫0 ∫0 dα 2π r r ' σ X exp 2 ar
2
d r d r ' . (4.33)

As can be seen from this equation, the variance σY2 describing the uncertainty associated with

the observation scale of 2-D circular averaging domain is a function of the variance and

spatial range of the SRF X(s), as well the radius R of the averaging spatial domain

characterizing the observation scale.

4.4.2 Synthetic case study

4.4.2.1 Synthetic verification of the uncertainty model for spatial observation scale

By extending the procedure of section 4.3.2.1 to the spatial case, we generate multiple

random realizations of Y(s’, s) =X(s’)-Z(s), from which we obtain synthetic estimates of

σY2(|s’-s|) that can be used to verify the value predicted theoretically from Eq. (4.32). The

procedure consists in generating realizations of the SRF X(s) on a fine spatial grid, choosing

a radius R of interest for the observation scale (i.e. the radius of the circular averaging spatial

124
domain As), and obtaining the realizations of the SRF Z(s) = ∫ u ∈ As duX(u)/|| As|| by

numerical integration of each of the X(s) realizations. We then choose from the spatial grid

two spatial locations sj and si separated by a distance |si- sj| <R/2 of interest, and we select for

each realization k=1,…M the realized value χj(k) for X(si), and the realized value Ζi(k) for Z(si).

This procedure results in the generation of M random realizations {Ζi(k), χj(k)}, k=1,…M,

from which we obtain the M random realized values ψji(k)=χj(k)−Ζi(k), k=1,…M, for the

random variable Y(sj,si)= X(sj)− Z(si). The synthetic estimate of σY2(|sj-si|) is finally obtained

by the estimator

M
1
σˆ Y 2 (| s j - si |) = ∑ (ψ − Eˆ [Y ( s j , s i )]) 2 ,
(k )
j (4.34)
M k =1

1
where Eˆ [Y ( s j , si )] = ∑
M
. Using this procedure, we obtain synthetic estimates σˆ Y
(k ) 2
k =1
ψj
M

for various values of R and |sj-si|, which we may compare with the theoretical value σY2

predicted by Eq. (4.32). Agreement between the synthetic estimate and theoretical value

provides verification that the conceptual framework proposed provides an adequate model

for the uncertainty associated with spatial observation scale.

Figure 4.4 shows the plot of σY2/σX2 as a function of R/ar for selected values of |s-s’|/R.

The synthetic estimates σˆ Y /σX2 obtained from multiple random realizations (Eq. 4.34) are
2

shown with markers, while the corresponding σY2/σX2 value predicted from theory (Eq. 4.32)

is shown with lines. Similarly to the one-dimensional temporal case, a judicious choice to

construct the X soft datum at some location s’ given a measured values for Z at s is to select

125
s’=s. As shown Figure 4.4 there is a good agreement between theory and synthetic estimates

when |s-s’|/R=0, which provides support that our conceptual framework adequately models

the uncertainty associated with spatial observation scale. When |s-s’|/R>0 (i.e. |s-s’|=0.4R)

the theoretical values are slightly overestimated relative to the synthetic estimates. This may

be due to the numerical work associated with the calculation of the mathematical formulation

of σY2/σX2, which is computationally more complex for |s-s’|/R>0 (Eq. 4.32) than for |s-

s’|/R=0 (Eq. 4.33).

Figure 4.4: Plot of σY/σX as a function of R/ar for different values of |s- s’|/R. Markers
indicate synthetic estimate obtained from multiple random realizations (Eq. 4.34), while lines
shows the value predicted from theory (Eq. 4.32).

As previously noted, the relationship between σY2/σX2 and R/ar shown in Figure 4.4

provides useful insights about the effect of the spatial scale at which observations are made.

As clearly indicated by Figure 4.4, when the spatial observation scale R is very small relative

to the covariance range ar of the local scale SRF X (i.e. when R is smaller than about 0.2 ar)

then an observed at that spatial scale at point s (i.e. a measured value for Z(s)=∫u ∈ As duX(u)

126
/ || As||) is highly informative for assessing the process at the local scale at point s’ (i.e. for

assessing X(s’)), provided that s’ is close to s ( i.e. provided that |s- s’|<0.4 R or much less).

4.4.2.2 Quantifying the improvement in mapping accuracy resulting from the integration of
spatial observation scale uncertainty

Validation procedures provide the tools needed to quantify the gain in mapping accuracy that

our proposed approach provides over an approach not accounting for the effect of spatial

observation scale. One validation procedure consists in using a synthetic SRF, while another

consists in using data from a real case study.

In the synthetic validation procedure, we use classical Geostatistical simulation

techniques to generate a realization χtrue =[χ1… χn] of the SRF X(s) observed at the nodes si,

i=1,…, n, of a fine resolution spatial grid. Then, for each node si, we numerically integrate

χtrue over a circular spatial observation domain Asi of radius Ri to obtain the realized value ζi

for the random variable Z(si) =∫u ∈ Asi duX(u) / || Asi||. This results in the generation of the

realization ζtrue =[ζ1… ζn] of the SRF Z(s) observed at observation scales Ri, i=1,…, n.

Hence χtrue represents the (synthetic) truth for the field of interest observed at the local scale,

while ζtrue represents the truth observed at a variety of observation scales. We then randomly

divide the truth χtrue into a validation set χval and a data set χhard, so that χtrue=χval U χhard.

Similarly we randomly select a data set ζhard out of ζtrue. The advantage of the synthetic

validation procedure is that we can select a large n so as to have high statistical power, and

that we can choose arbitrarily any observation scale Ri of interest.

On the other hand, in the real case study, χtrue and ζtrue are obtained from available data

measured at the local scale, and at some observation scale R, respectively. However the

127
validation procedure from real-case study data suffers from many limitations, including the

fact that n is limited of the number of data available (which may limit statistical power), the

unavoidable measurement errors that introduce an uncontrollable noise between the data

available and the actual truth, and the lack of mechanism to select different observation

scales other than that for which data is available. Nonetheless notwithstanding these

limitations, we randomly select χhard and χval from χtrue subject to χtrue=χval U χhard, and we

obtain ζhard by usually selecting all of the data ζtrue.

The validation procedure consists in using only the data χhard and ζhard to obtain estimates

χval* of the local scale SRF X(s) at the validation point locations where the truth χval is known.

The validation estimation errors are then simply obtained as the difference εval*=χval-χval*

between true and estimated values, and their mean square error (MSE) provides a measure of

the estimation error of the estimation method used to obtain χval*.

In this study we compare the mapping accuracy of 3 different mapping methods. Method

1 consists in the simple kriging (SK) method of classical Geostatistics using only χhard as

hard data. Method 2 also consists in the SK method, but using both χhard and ζhard as hard

data (i.e. ignoring the observation scales uncertainty of the ζhard data). Finally method 3

consists in the BME method proposed in this work, which uses χhard as hard data, and uses

ζhard and the corresponding observation scale R to generate some soft data χsoft in terms of the

conditional PDF fS(χsoft | ζhard, R) (Eqs. 4.11 and 4.33). As a result our Method 3 fully

accounts for the observation scale effect, which is compared to the two extreme classical

approaches not accounting for observation scale: Method 1 which ignores ζhard entirely, and

method 2 which treats it as if it was hard data (i.e. as if the observation scale was not

introducing any uncertainty).

128
It should be noted that the so called cross-validation procedure is a slight modification of

the validation procedure that is widely used in practice, so we will also use this procedure to

compare method 1, 2 and 3 in the real case study. In the cross validation procedure, the ζhard

data remains unchanged, while the validation data χxval corresponds to whole dataset χtrue

available, i.e. χxval =χtrue. Then cross validation estimates χxval* are obtained by excluding in

turn each validation point, and re-estimating it from the surrounding data. The cross-

validation MSE are finally obtained on the basis the cross-validation errors εxval*=χval-χval*.

Hence the cross-validation procedure provides an additional metric to compare methods 1, 2

and 3.

Using the Geostatistical simulation method based on a Cholesky decomposition of the

covariance matrix (Christakos et al., 2002), we generate 20 realizations of the SRF X(s) with

the following prescribed covariance function

 − 3l 
c X (l = | s' − s |) = σ X exp 
2

a
 r , (4.35)

where variance of X(s) σX2= 0.006, and ar = 10. Each realization consists of the vector χtrue

=[χ1… χn] simulating the value of the SRF X(s) at the nodes of a dense spatial grid. We

select from this simulated truth a subset of data χhard representing local scale measurements

of X(s). We additionally obtain from χtrue a set ζhard of observations at varying spatial scales

Rj, j=1,…,n . Each ζhard datum is obtained by numerically integrating the truth χtrue over a

circular averaging domain of radius equal to its spatial observation scale Rj. Finally, on the

basis of the ζhard data and the associated observation spatial scales, we construct the

129
conditional PDFs fS(χs| ζ) that constitute the soft data χsoft for our proposed BME estimation

approach.

For illustration purposes, one of the generated realization χtrue is shown in the contoured

map of Figure 4.5, along with the χhard data points represented by stars, and the ζhard data

point represented by triangles. For three of the ζhard data points, we show the corresponding

circular averaging domain with a radius equal to their spatial observation scales. As

illustrated in Figure 4.5, the observation spatial scale is not constant across the ζhard data,

which corresponds to a realistic situation where data might be obtained at varying

observation scales (e.g. for data collected at different administrative aggregation levels, such

as zip code, counties, etc.).

Figure 4.5: Contoured map showing one of the generated realizations of the SRF X(s), along
with the location of the χhard data points (stars), and the ζhard data points (triangles). The
circular averaging domain for three of the ζhard data points are shown with a radius equal to
their spatial observation scales.

130
Using the validation procedure, we obtain an MSEave which is the validation MSE

averaged over 20 realizations for the three estimation methods considered, i.e. method 1

(simple kriging I), method 2 (simple II), and method 3 (BME). As shown Table 4.3 we can

see from these results that method 2 (simple kriging II) has a MSEave that is smaller than that

of method 1 (simple kriging I). This is explained by the fact that simple kriging II processes

the additional information provided by the data observed at various spatial scales, resulting in

a gain in mapping accuracy. However simple kriging II does not account for the effect of

observation scale. On the other hand we see that the MSEave of our proposed BME method

(method 3), which rigorously accounts for the scale effect, is substantially smaller than that

of either method 1 or 2. In fact BME results in a 41.6% MSEave reduction when compared to

method 1, or a 30.2% MSEave reduction when compared to method 2. These results

demonstrate that our proposed approach leads on the average to a substantial gain of mapping

accuracy over estimation methods that ignore the scale effect.

Table 4.3: MSEave calculated by averaging the validation results obtained over 20 realizations.
Method 1 Method 2 Method 3
(simple kriging I) (simple kriging II) BME
MSEave 0.001255 0.001046 0.000730
1, 3
rMSE -41.8%
2 ,3
rMSE -30.2%

The validation results presented so far were obtained for the mapping situation depicted

in Fig. 4.5 where the ζhard data points correspond to 8% of the χtrue grid points, which

corresponds to a realistic mapping situation. In order to obtain a visual comparison between

the estimation methods, we now consider a mapping situation where the ζhard data points

131
correspond to 45% of the χtrue grid points. The simulated truth is shown in Figure 4.6(a),

while the estimates obtained with method 1, method 2, and method 3 are shown in Figure

4.6(b), 4.6(c) and 4.6(d), respectively. As can be seen from these maps, method 1 captures

the dominant features of spatial distribution of X(s) thanks to the information provided by the

χhard data, however the mapping accuracy is poor because the ζhard data is entirely ignored.

The estimation method 2 provides another extreme by processing both χhard and ζhard as hard

data, thereby ignoring the uncertainty associated with the observation scale of the ζhard data.

This results in a map (Figure 4.6c) with a lot more fine resolution details, but of poor

mapping accuracy, as is apparent by comparing this map with the simulated truth. Finally,

the map of our proposed approach (method 3) shown in Figure 4.6(d) provides a much more

accurate representation of the truth, as can be seen by comparing it with the simulated truth.

In fact for this realization of the simulated truth, our proposed BME method results in a

77.6% MSE reduction when compared to method 1, or a 70.4% MSE reduction when

compared to method 2. This demonstrates that while our proposed method results on average

in a substantial gain of mapping accuracy over classical approaches, the gain in mapping

accuracy can be drastic for some specific mapping situations.

132
(a) (b)

(c) (d)

Figure 4.6: Maps of the simulated truth (a), compared to maps obtained with (b) method 1
using χhard as hard data, (b) method 2 using both χhard and ζhard as hard data, and (c) method 3
corresponding to our proposed BME method accounting for the effect of observation scale.

Next, we consider the spatial estimation of asthma symptom prevalence among the children

of North Carolina. This case study involves the development and implementation of the

mathematical framework outlined above, and its application to a real case study in North

Carolina. The data used in this real case study are the combination of two datasets, each

collected at a different spatial scale.

133
4.5. Mapping the childhood asthma prevalence across North Carolina
using data collected at different spatial observation scales

4.5.1. Introduction

Asthma is an inflammatory disease characterized by symptoms that include wheezing,

coughing, breathlessness, and chest tightness (Clark et al., 1999; Lane and Edwards, 2003). It

is known as the most common chronic childhood disease (Zmirou et al., 2004;; Lewis et al.,

2005; Freeman et al., 2003; Gergen et al., 1988). Approximately 12.7% of all children (Lane

and Edwards, 2003), and about 10 million children of age under 16 (Clark et al., 1999) in

United States are suffering from current asthma symptoms. The estimated cost of treating

asthma in children younger than 18 years of age is $3.2 million per year (Weiss et al., 2000).

Some risk factors responsible for exacerbating asthma symptoms in children includes

tobacco smoke, dust mite and cockroach allergens, pet dander, and household molds (Sturm

et al., 2004).

The association between air pollution exposure (i.e. PM, O3, SO2, and NO2 etc.) and

asthma prevalence has been extensively investigated (EPA Criteria pollutant document;

Clark et al., 1999; Lewis et al., 2005). While air pollutants have clearly been associated with

exacerbations of asthma (including increased symptoms, Emergency Room (ER) visits,

hospitalizations, and medication use), the association of air pollutants and increased asthma

incidence is less clear (Clark et al., 1999). However, a recent study showed an association

between asthma incidence and children exercising in high ozone areas (McConnell et al.,

2002). Furthermore, a study by Zmirou et al., 2000 investigating the association between

traffic related air pollutants and incidence of children asthmatic symptoms suggests that air

134
pollutants might be a potential contributor to increasing asthma prevalence in children.

Africans and Hispanic-Americans have a higher susceptibility to develop asthma than other

populations (Freeman et al., 2003). Individuals who experience regularly asthma symptoms

(Clark et al., 1999) and with smoking behavior (Sturm et al., 2004) are also regarded as a

susceptible population group for asthma adverse health effects.

In their work, White et al. (1994) suspect that the increase of asthma symptoms is

attributable to air pollution and performing a reasonable analysis of their association is still

an emerging field. This naturally leads to the need to map the distribution of asthma

prevalence across space. Indeed highly informative asthma maps provide invaluable spatial

information that allows epidemiologists to better understand risk factors that may cause

asthma, such as air pollutants, and help identify susceptible subpopulations, such as

individuals with particular pre-existing health conditions and/or with specific smoking

behavior and socioeconomic characteristics, etc. Additionally better asthma maps are helpful

for public health intervention by not only identifying areas of high prevalence where to target

health treatment facilities for susceptible populations, but also in identifying areas where to

focus efforts on abating suspected causal agents that can be controlled.

Geostatistics provide epidemiologists an essential spatial estimation tool that accounts for

the inherent high spatial variability of asthma prevalence and the map it produces provides a

graphical representation of reality that is extremely useful for health research. However, few

studies on mapping asthma have been found, and existing works are mainly limited to an

exploratory visualization of existing asthma prevalence data obtained at a single observation

scale (Hernandez et al., 2000; Oyana and Lwebuga-Mukasa, 2004).

135
There are a variety of data sources providing asthma prevalence data that can be used in a

mapping analysis. The asthma data can be collected in a number of ways, including random

telephone surveys, questionnaire-based surveys, hospital discharge records, Medicaid claims,

etc. However what is notable is the spatial aggregation scale, or observation scale, at which

the data is reported, which may vary considerably from one data source to another.

One important reason for the difference in observation scale between data sources is that

some data sources may have confidentiality requirements that only allow them to release data

aggregated over large spatial scale (e.g. county level) in order to protect the privacy of the

individuals who provided their health information. For example the childhood asthma

Medicaid claim data analyzed by Buescher et al. (1999) is aggregated at the county level,

which is a large spatial observation scale providing a strong protection of individual privacy

and preventing deductive disclosure. Medicaid claims provide a cost effective source of

information. Claims data are cost effective because they are derived from a health system

that is already in place. However, it is not clear how good Medicaid claims data is in the

estimation of asthma prevalence at a fine spatial scale. Another source of information is the

asthma data obtained from a one time school asthma surveillance project (the North Carolina

School Asthma Survey, or NCSAS), which had high quality asthma prevalence data on a fine

spatial resolution. The NCSAS database provides good quality asthma prevalence estimates

for the majority of middle schools in North Carolina, which corresponds to an observation

scale that is much smaller than that of the Medicaid data reported at the county level. As a

result, our goal is to perform an accurate mapping analysis of asthma symptom prevalence

that rigorously accounts for the high natural variability of asthma prevalence across space,

while also efficiently integrating data collected at different observation scales. Integrating

136
large observation scales data to obtain good estimate of asthma prevalence at a fine spatial

resolution would lead to some substantial cost savings in North Carolina because it will

enable state health departments to efficiently use data from existing systems such as

Medicaid, which would reduce the need to conduct additional costly surveillance of asthma.

Our aim at in this work is to develop a conceptual mapping framework that integrates

asthma data obtained at different spatial observation scales, and to apply this framework to

improve the accuracy of maps of the childhood asthma prevalence. The framework we

develop is a novel application of the Bayesian Maximum Entropy (BME) theory of modern

Geostatistics, where we formally account for the uncertainty associated with the various

spatial observation scales corresponding to the prevalence data available. Insight is gained by

comparing the map we produce with classical maps obtained by using only data at one

observation scale, or by disregarding the scale effect. We find that by formally accounting

for the observation scale of asthma prevalence data, the map we obtain is substantially more

accurate than classical maps, leading to a more realistic representation of the spatial

distribution of the asthma prevalence among children across North Carolina, which will be

useful for epidemiologists and public health officials to plan targeted intervention efforts.

4.5.2. Theory

4.5.2.1. A review of the BME method for the mapping analysis of the childhood asthma
prevalence

In this work, the variable we are dealing with is the prevalence of asthma among children.

What we usually measure is the prevalence of the cardinal symptom of asthma (wheezing)

among children; however we will assume that the asthma symptom selected provides an

adequate observable outcome to measure the prevalence of asthma among children, and we

137
will refer to it as the childhood asthma prevalence. This prevalence is distributed across a

two-dimensional spatial domain, and it is defined as the count of children found to have the

asthma symptom of interest divided by the number of children surveyed over some spatial

region As (i.e. area), where the subscript s=[s1,s2] is the spatial location of the centroid for As.

The spatial region As over which the prevalence is observed has a spatial scale R

corresponding to the radius of a circle of same surface area as As, i.e. R =(As/π)0.5 is the

spatial observation scale of the prevalence.

We define X(s) as a spatial random field (SRF) (Christakos, 1992) representing the

childhood asthma prevalence at the local scale, i.e. observed at an infinitely small spatial

scale. When restricting our attention to a set of n mapping spatial points smap=[s1, s2,…, sn],

the SRF reduces to a vector of random variables xmap=[X(s1), X(s2),…, X(sn)]. The SRF

describes the uncertainty and variability of the spatial distribution of the local scale

prevalence by means of an ensemble of realizations χmap =[χ1, χ2 , …, χn] of the random

vector xmap. The probability of a given realization χmap is calculated from the multivariate

probability density function (PDF) fX(.) of the SRF X(s) as follow

Prob[χ1 < x1 <χ1+dχ1,…, χn < xn < χn+dχn] = fX(χmap) dχ (4.36)

where Prob[.] is a probability operator. Hence the multivariate PDF fX(.) provides a complete

stochastic description of the SRF X(s) at the mapping points pmap.

At the structural stage of BME analysis we use a maximum entropy information

processing rule (Christakos 2000) to obtain the multivariate PDF of X(s) on the basis of its

mean trend characterizing systematic trends in X(s)

138
mX(s) = E[X(s)], (4.37)

and covariance function characterizing spatial correlation between any pairs of points in X(s)

cX(s, s’) = E[ (X(s)-mX(s)) (X(s’)-mX(s’)) ], (4.38)

where E[.] is a stochastic expectation operator. Eqs. (4.37) and (4.38) constitute a general

knowledge base G from which the structural PDF obtained by maximizing entropy is

(Christakos, 2000)

fG (χmap,) = φ (χmap ; mmap, cmap ), (4.39)

where φ (.) is the multivariate Gaussian PDF with mean vector mmap and covariance matrix

cmap calculated at the mapping points from Eqs. (4.37) and (4.38), respectively. This

structural PDF will serve as the prior PDF for the Bayesian updating performed at the

integration stage of the BME analysis.

At the specificatory stage of the BME analysis we assess and statistically describe the

data available for the childhood asthma prevalence. Hard data corresponds to exact measured

prevalence values χhard obtained at spatial points shard defined such that

Prob[ X(shard) =χhard] = 1. (4.40)

139
On the other hand, the soft data at spatial points ssoft correspond to observed value with an

associated uncertainty that can be characterized statistically by the so-called soft PDF fS(χsoft)

defined as (Christakos et al., 2001; Christakos and Serre, 2000a; Serre et al., 2005)

u
Prob[X(ssoft) <u] = ∫ −∞ dχ soft f S (χ soft ) . (4.41)

At the integration stage of the BME analysis, a Bayesian conditionalization information

processing rule is applied to update the prior PDF with the site-specific knowledge base S,

which yields the posterior PDF fK(χk) describing the childhood asthma prevalence xk=Xk(pk)

at any estimation point sk (Christakos, 2000)

fK (χk) = A-1 ∫ dχsoft fs(χsoft) fG(χk, χhard, χsoft), (4.42)

where A is a normalization coefficient. The posterior PDF provides a full stochastic

assessment of xk, from which we can obtain an appropriate estimated prevalence (such as the

expected value of the posterior PDF), as well as an assessment of the associated prevalence

uncertainty (such as the variance of the posterior PDF).

4.5.2.2. Conceptual framework for the uncertainty associated with the observation scale of
the childhood asthma prevalence

We define the observed value of X(s) over the observation region As as the SRF Z(s) given by

the following equation

140
Z(s) =∫u ∈ As duX(u) / || As||. (4.43)

In other words Z(s) is corresponding to an observation of X(s) at a spatial scale R=(As/π)0.5. In

order to analyze the relationship between the local scale SRF X(s) and the As–scale SRF Z(s),

we define the random field Y(s’,s) as

Y(s) = X(s)-Z(s). (4.44)

Eq. (4.44) can also be written as X(s)=Z(s)+Y(s), indicating that when assessing X(s), Y(s)

acts as an additive error term to the value Z(s) observed at scale As. It follows that the

conditional PDF of X(s) given an observed value ζ for Z(s) is

fS(χs| ζ) = fY (χs-ζ ), (4.45)

where fY is the PDF for Y(s). Assuming that the local scale prevalence SRF X(s) can

reasonably be assumed to be normally distributed, we obtain that Z(s) and Y(s) are also

normally distributed (Lee, 2005, pp 104). Then the PDF for Y(s) is given by

fY(ψ)=φ(ψ;mY,σY2), where φ(.) is the Gaussian distribution completely defined by its mean

mY= E[Y(s)] and variance σY2. Inserting fY(ψ)=φ(ψ;E[Y(s)],σY2) in Eq. (4.45), we obtain after

a change of variable

fS(χs| ζ) = φ (χs ; E[Y(s)]+ ζ , σY2). (4.46)

141
Eq. (4.46) provides a probabilistic soft datum for the local scale X at point s given an As-scale

observed value for the prevalence at point s. This soft datum for the local scale prevalence is

rigorously processed by the BME method for the mapping analysis of local scale prevalence,

which constitute our proposed approach to integrate prevalence data at any spatial

observation scale in the mapping estimation of local scale prevalence. The problem then

becomes that of obtaining E[Y(s)] and σY2 for different spatial observation scales As of

interest.

We consider the class of homogeneous SRFs X(s) with a zero mean trend and a

covariance model corresponding to the superposition of n exponential functions. This class

of SRFs provides without loss of generality a good representation of the spatial distribution

of the local scale childhood asthma prevalence. We mathematically derive the expected

value of Y(s) as E[Y(s)]=0 and its variance (Lee, 2005, pp 120-124) as

R n
σY2 = σX2 -4R-2 ∫ dr r ∑ σ Xi exp(−3r/ari )
2

0 i =1

2 2
R R 2π
 2
n
 − 3d 2 (r , r ' , α )  
∫0 ∫0 ∫0 ∑
-
+ (πR ) dr dr ' dα 2π r r '  σ exp    , (4.47)
 Xi
a
i =1   ri 

where d2(r, r’,α)= r 2 + r '2 −2rr ' cosα , σXi2 and ari as the variance and spatial range,

respectively, of each exponential covariance function, σX2 is the variance of the SRF X(s),

and R is the observation spatial scale obtained as R=(As/π)0.5. As can be seen from this

equation, the variance σY2 describing the uncertainty associated with the observation scale of

142
2-D circular averaging domain is a function of the variance and spatial ranges of the SRF

X(s), as well the radius R of the averaging spatial domain characterizing the observation scale.

The linear kriging method of classical Geostatistics simply combines observed values of

X(s) and Z(s) to estimate X(s) at unsampled locations without any consideration of the scale

effects. By contrast our proposed BME mapping method uses Eq. (4.47) to generate soft data

for X(s) from observations obtained at various spatial scales.

4.5.2.3. Quantifying the improvement in the mapping accuracy of the childhood asthma
prevalence resulting from the integration of spatial observation scale uncertainty

Validation procedures provide the tools needed to quantify the gain in mapping accuracy that

our proposed approach provides over an approach not accounting for the effect of spatial

observation scale when mapping the childhood asthma prevalence. Let χtrue and ζtrue denote

the available data measured at the local scale, and at some observation scale R, respectively.

We randomly select χhard and χval from χtrue subject to χtrue=χval U χhard, and we obtain ζhard by

usually selecting all of the data ζtrue. The validation procedure consists in using only the data

χhard and ζhard to obtain estimates χval* of the local scale SRF X(s) at the validation point

locations where the truth χval is known. The validation estimation errors are then simply

obtained as the difference εval*=χval-χval* between true and estimated values, and their mean

square error (MSE) provides a measure of the estimation error of the estimation method used

to obtain χval*.

In this study we compare the mapping accuracy of 3 different mapping methods. Method

1 consists in the simple kriging (SK) method of classical Geostatistics using only χhard as

hard data. Method 2 also consists in the SK method, but using both χhard and ζhard as hard

143
data (i.e. ignoring the observation scales uncertainty of the ζhard data). Finally method 3

consist in the BME method proposed in this work, which uses χhard as hard data, and uses

ζhard and the corresponding observation scale R to generate some soft data χsoft in terms of the

conditional PDF fS(χsoft | ζhard, R) (Eqs. 4.46 and 4.47). As a result our Method 3 fully

accounts for the observation scale effect, which is compared to the two extreme classical

approaches not accounting for observation scale: Method 1 which ignores ζhard entirely, and

method 2 which treats it as if it was hard data (i.e. as if the observation scale was not

introducing any uncertainty).

It should be noted that the so called cross-validation procedure is a slight modification of

the validation procedure that is widely used in practice, so we will also use this procedure to

compare method 1, 2 and 3. In the cross validation procedure, the ζhard data remains

unchanged, while the validation data χxval corresponds to whole dataset χtrue available, i.e.

χxval =χtrue. Then cross validation estimates χxval* are obtained by excluding in turn each

validation point, and re-estimating it from the surrounding data. The cross-validation MSE

are finally obtained on the basis the cross-validation errors εxval*=χval-χval*. Hence the cross-

validation procedure provides an additional metric to compare methods 1, 2 and 3.

4.5.3. Data

We have obtained two datasets with data on the childhood asthma prevalence across North

Carolina. The first dataset was based on a middle school based survey using questionnaires,

while the second dataset used Medicaid claim data. Another source of asthma data is

available from the North Carolina Behavioral Risk Factor Surveillance System (BRFSS)

144
collected using random telephone survey. However this state-level asthma dataset includes

no stratification by age of children, and was therefore not used in this work.

4.5.3.1. The North Carolina School Asthma Survey database

The first dataset consists in children asthma health outcomes collected as a part of the North

Carolina School Asthma Survey (NCSAS) (Yeatts et al., 2004; Sturm et al., 2004). The

NCSAS is a collaborative program between the North Carolina Department of Health and

Human Services, the North Carolina Department of Public Instruction, and the Department of

Epidemiology in the University of North Carolina at Chapel Hill. This survey collected

information on the breathing status of students enrolled in public 7th and 8th grades (i.e. age

of 13-14) in the 1999-2000 academic school year. 565 public middle schools (for a total of

192,248 enrolled students) were asked to participate in the survey, leading to the

participation of 499 schools in the survey. We obtained data from approximately 128,556

students (i.e. 66.9% of the student population) in 493 schools (i.e. 87.3% of the school

population regarding the prevalence of asthma symptoms among the children of North

Carolina.

The NCSAS questionnaire included internationally standardized and validated questions

from the International Survey of Asthma and Allergies in Childhood (ISAAC) consisting of

written and video types of questions. While the NCSAS provides several relevant asthma

variables for each student, the variable we used, named “current wheezing symptom”, which

characterizes the occurrence of asthma, was recorded as a value of 1 for children who said

“yes” to any one of four video questions describing 1) wheezing during the day, 2) wheezing

induced by exercise, 3) wheezing at night, or 4) a severe wheezing attack. Using this variable,

145
we calculated for each of 493 schools the asthma prevalence among children by dividing the

number of children who answered yes by the total number of students surveyed in that school.

For illustration purposes, we show in Figure 4.7(a) a graduated color plot of the childhood

asthma prevalence data obtained from this dataset.

Because of the almost-exhaustive nature and the good data quality of the NCSAS dataset,

the data it provides on the prevalence of asthma symptoms among children enrolled in public

7th and 8th grades in North Carolina can reasonably be considered exact measurements of

the childhood asthma prevalence. Furthermore, the observation scale for this prevalence

data corresponds to that of middle schools, which have a very small geographical extend

relative to that of, for example, a county. Indeed half of the average distance between

schools in North Carolina and their closest neighbor is approximately 3 kilometers(km), so

that for the average of schools the maximum distance that children travel to go to school is

on the order of 3 km. Since the children population is generally clustered around schools, the

median travel distance to school must be much less than its maximum of 3 km, in the order

of a fraction of the kilometer scale. If we add the fact that children do spend a portion of

their day on the premises of the school itself, we can safely conclude the NCSAS data

obtained at the school observation scale can reasonably be conceptualized as providing exact

measurements of the childhood asthma prevalence observed at the local scale, i.e. this dataset

provides hard data for the SRF X(s).

4.5.3.2. The county-level database of Medicaid-enrolled children suffering from asthma

Buescher et al., 1999 published a document including data on Medicaid claims due to asthma

in North Carolina during the state fiscal year 1997-1998. The number of childhood asthma

146
cases in each county was recorded by counting the Medicaid-enrolled children of age 0 to 14

who suffered from asthma. According to the study report, the Medicaid-enrolled children

suffering from asthma were identified on the basis of paid Medicaid claims with a diagnosis

of asthma as well as with prescription drug used for treating asthma. They then obtained the

fraction of Medicaid-enrolled children suffering from asthma for each of the 100 counties in

North Carolina by dividing the number of Medicaid-enrolled children with asthma claims by

the total number of Medicaid-enrolled children claims in each county. The location we assign

for each of these fractions is the centroid of the county for which the fraction is calculated,

and we show visually these data in Figure 4.7(b) using a graduated color plot.

The average land area for counties in North Carolina is 1363.9 km2, which correspond to

a radius of about 20.8 km if assume that counties can be approximated with circles of same

surface areas. This spatial scale of about 20.8 km is substantially larger than that of the

NCSAS data collected at the school level, which as discussed above is believed to be on the

order of a fraction of the kilometer scale. This statement is also strengthened by the fact that

most of children live close to their school, with few children living far from their school,

whereas Medicaid-enrolled children can be assumed to have a much more uniform spatial

distribution across the whole county. Therefore we define the fraction of Medicaid-enrolled

children with asthma in a particular county as a measurement of the SRF Z(s) observed at the

county spatial scale. In other words we conceptualize the Medicaid data shown in Figure

4.7(b) as being observations of the local scale childhood asthma prevalence (the NCSAS data

shown in Figure 4.7a) averaged at the county spatial scale. Indeed, as can be seen from

Figure 4.7, the Medicaid data are smoother than the NCSAS data, which is consistent with

our hypothesis that one corresponds to the aggregation of the other at a larger spatial scale.

147
However a limitation of the Medicaid dataset for the inference of the childhood asthma

prevalence is that the Medicaid-enrolled children population is only a subgroup of the total

children population, and biases may exist at the local scale. Furthermore the Medicaid data

was obtained in 1997-1998 while the NCSAS was obtained in 1999-2000. Nevertheless we

hypothesize that the local-scale deviations in asthma prevalence between the Medicaid and

NCSAS datasets average out at the county spatial scale. As will be shown in our cross

validation results, when accounting for the scale effect then the Medicaid data does improve

the estimation of the asthma prevalence reported in the NCSAS dataset, which confirms our

hypothesis that the Medicaid data provides an adequate measurement of NCSAS asthma

prevalence aggregated at the county scale.

As a result our aim is now to estimate the spatial distribution of the (local scale)

childhood asthma prevalence X(s) using the NCSAS dataset providing exact measurements

of X(s) at the location of 493 schools in North Carolina, and the Medicaid dataset providing

(almost) exact measurements of the county-scale Z(s) at the centroid of 100 counties across

North Carolina.

148
(a)

(b)

Figure 4.7: Map showing (a) the data on asthma symptoms prevalence among high school
children (age 13-14) reported in the NCSAS database for most of NC schools, and (b) the
county level asthma prevalence data extracted from the database of Medicaid-enrolled
children age 0-14 years who suffered from asthma. The prevalence is expressed as a fraction
(i.e. average childhood asthma cases per 1 child) according to the color bar next to each map.

4.5.4. Results

4.5.4.1 Trends and variability in the spatial distribution of local scale asthma prevalence
among children

149
The SRF X(s) represents the distribution across space of the prevalence of asthma among

children observed at the local scale. Its mean trend function mX(s) (Eq. 4.37) provides a

model for the systematic trends and consistent spatial structures of the childhood asthma

prevalence across space, while its covariance function cX(s,s’) (Eq. 4.38) describes the

inherent spatial variability of the childhood asthma prevalence.

As discussed in the data section, it is reasonable to use each NCSAS datum as an exact

measurement of the local scale childhood asthma prevalence for the spatial location of each

school in North Carolina (Figure 4.7a). Hence we obtain the local scale mean trend function

mX(s) using a moving window average of the NCSAS data with an exponentially decaying

exponential filter. This leads to the mean trend function shown in Figure 4.8(a). As can be

seen from this figure, the mean trend of asthma prevalence among children in North Carolina

has a slightly higher prevalence along the eastern coast of North Carolina, and it decreases

almost linearly from East to West. This mean trend function can be linearized within each

county, and as a result it is valid at the county observation scale as well. In other words, the

trend shown in Figure 4.8(a) is the mean trend of the local scale asthma prevalence field, as

well as the asthma prevalence field observed at the county spatial scale, i.e. mZ(s)=mX(s). A

useful implication is that the framework presented in the theory section to integrate data

obtained at different spatial observation scales is valid not only for the X(s) and Z(s) SRFs,

but also for the mean trend removed residual fields X’(s)=X(s)-mX(s) and Z’(s)=Z(s)-mZ(s)

(since mZ(s)=mX(s)). We will therefore apply our framework for the integration of data

observed at different spatial scales to the residual fields X’(s) and Z’(s).

150
(a)

(b)

Figure 4.8: (a) Map of the local scale mean trend mX(s) of the childhood asthma prevalence
(fraction of prevalent asthma cases), and (b) plot of the covariance of the mean trend-
removed local scale childhood asthma prevalence SRF X’(s).

Experimental values for the covariance of the residual field X’(s) where estimated from

residual prevalence data obtained by subtracting the mean trend mX(s) (Figure 4.8a) from the

NCSAS prevalence data (Figure 4.7a). We then fit to these experimental covariance values

the following covariance model

151
 − 3r   − 3r 
c X (r =| s' − s |) = c01 exp  + c02 exp 
 a r1   ar 2  , (4.48)

where c01= 0.9 × σX2, c02=0.1 × σX2, σX2= 0.0055 (average number of asthma cases per 1

child)2, ar1 = 89.6 km, and ar2= 448 km. As can be seen from Figure 4.8(b), there is a good fit

between the covariance model of Eq. (4.49) and the experimental covariance values obtained

from the residual data observed at the local scale. The covariance model indicates that about

90 percent of the variability of the local scale childhood asthma prevalence has a spatial

range (e.g. spatial clustering) of 89.6 km, while the remaining 10 percent of variability as a

much larger spatial range (clustering) of 448 km. This interesting finding indicates that the

prevalence of asthma among children observed at a small scale (i.e. at the spatial scale

corresponding to the children population serviced by a high school) has a spatial distribution

that is not random, instead it is spatially organized in the nesting of spatial structures

(clustering) of two sizes, one of about 89.6 km in size explaining 90 percent of the overall

asthma prevalence variability, and the other of about 448 km in size explaining 10 percent of

the variability. The explanation for this spatial organization of local scale asthma prevalence

may be manifold, and provides the basis for hypothesis generation that may be tested in

future works. The first possible explanation of spatial clustering of the childhood asthma

prevalence may be that it is a result of the observation scale at which the prevalence is

observed. However the NCSAS asthma prevalence data is observed at the spatial scale of the

children served by a single school, and conceivably a majority of the children served by one

school live in a radius that is much smaller than 89.6 km, so that this rather small observation

scale alone cannot explain the larger spatial scales of spatial clustering identified in the

152
covariance analysis. An additional explanation that then naturally arises is that the

prevalence of asthma among children is influenced by underlying factors that are themselves

organized in space. One such factor may be the characteristics of the children population (i.e.

ethnic make-up, socio-economic status, dietary habits, proportion of children with higher

asthmatic susceptibility, etc.) that may themselves have a spatial structure corresponding to

the 89.6 km spatial scale. Another factor may be the exposure to environmental pollutants

suspected to cause asthma, such as airborne particulate matters, ozone and lead, which may

have spatial ranges in excess of 448 km (e.g. Christakos and Serre, 2000a).

The mean trend function and covariance model provide the general knowledge base

processed at the prior stage of the BME analysis. Next we present the asthma prevalence

maps obtained at the posterior stage of the BME analysis by integrating asthma prevalence

data obtained at different observation scales.

4.5.4.2 Maps of the childhood asthma prevalence obtained using data collected at different
observation scales

We obtain maps describing the spatial distribution of the childhood asthma prevalence across

North Carolina using three estimation methods. Each estimation method uses the same

general knowledge base consisting in the mean trend function and covariance model

presented above. This general knowledge base is processed at the structural stage of the

BME analysis and leads to a prior PDF characterizing the general characteristics (systematic

trends, spatial variability) of the spatial distribution of the childhood asthma prevalence

observed at the local scale (i.e. at the spatial scale of high schools). Then at the integration

stage of the BME analysis, each method uses a Bayesian conditionalization knowledge

processing rule to update the prior PDF by considering a different site specific knowledge

153
base, leading to different maps of the estimated childhood asthma prevalence across North

Carolina.

The first estimation method (method 1) considers the NCSAS data as hard (exact)

measurements of the childhood asthma prevalence observed at the local scale. This

estimation method ignores entirely the Medicaid childhood asthma prevalence data collected

at the county observation scale. Using this restricted site specific knowledge base, we update

the prior PDF at each node of a regular estimation grid covering the state of North Carolina.

We thereby obtain a BME posterior PDF at each of these estimation points, from which we

select the expected value as the so-called BME mean estimate, and the variance as an

assessment of the associated mapping uncertainty. The map of the BME mean estimate for

method 1 is shown in Figure 4.9(a), and the map of the associated uncertainty is shown in

Figure 4.10(a). As can be seen from these figures, the map obtained interpolates the NCSAS

data over all non-surveyed areas of North Carolina, with a mapping uncertainty that is zero at

the spatial location of each of the NCSAS high schools, and increases away from these

surveyed locations. We note that because the site specific knowledge base is restricted to

only include hard data, the BME estimate of method 1 reduces to the simple kriging

estimator of classical Geostastistics. Hence method 1 corresponds to the simple kriging

method accounting only for data obtained at the local scale, and we can compare this baseline

method against other methods that attempt to integrate the additional information provided

by the Medicaid childhood asthma prevalence data available at the county observation scale.

154
(a)

(b)

(c)

Figure 4.9: Maps of the BME mean estimate of children asthmatic symptom prevalence
(average number of case per 1 child) observed at the school spatial scale across North
Carolina. These maps were obtained using (a) method 1, (b) method 2, and (c) method 3.

155
In the second estimation method (method 2), we consider both the NCSAS and Medicaid

data as if they were exact measurements (hard data) of the childhood asthma prevalence

observed at the local scale. In other words this estimation method corresponds to using the

simple kriging estimator on the combined NCSAS and Medicaid data without recognizing

that these data were obtained at different observation scales. By ignoring the scale effect for

the Medicaid data, method 2 underestimate the uncertainty associated with the large

observation scale of that dataset. The map of BME mean estimate obtained from method 2 is

shown in Figure 4.9(b). As can be seen from this figure, the map integrates more details in

the spatial distribution of the childhood asthma prevalence because the combined dataset is

larger, leading to a spatial estimate that is quite different than that obtained with method 1.

The substantial difference between the maps of method 1 and method 2 is the main point we

are making here. Whether the map of method 2 is any more accurate than that obtained with

method 1 is an issue that we will address later in the cross-validation section. Suffice to say

that method 2 wrongly assumes that the scale effect of the Medicaid data can be ignored,

leading to the erroneous belief that the uncertainty associated with the map of method 2 is

zero at the centroid of each county where each Medicaid data points are reported. As a result,

method 2 is unable to provide a correct assessment of the uncertainty associated with its

spatial estimate shown in Figure 4.9(b).

156
(a)

(b)

Figure 4.10: Maps of the BME posterior variance ([average asthma counts per 1 child]2)
obtained with (a) method 1 and (b) method 3, which provides an assessment of the
uncertainty associated with the BME mean estimate maps shown in Figure 4.9 (a) and (c),
respectively.

On the other hand method 3 corresponds to our proposed approach which accounts for

the scale effect by formally processing the uncertainty associated with the observation scale

of the Medicaid data. As explained in the theory section, we have developed for this method

a mathematical formulation for the error variance (Eq. 4.47) resulting from the spatial scale

at which the Medicaid data is observed. Using our proposed framework, the NCSAS data is

157
processed as hard data, while the Medicaid data is used to generate soft data with an

uncertainty calculated as a function of the corresponding observation scale. The map of the

BME mean estimate for method 3 is shown in Figure 4.9(c), and the map of the associated

uncertainty is shown in Figure 4.10(b). As can be seen from these figures, method 3

integrates both datasets, extracting all the information provided by the NCSAS data obtained

at the local scale, and using the Medicaid data as an approximate guess of the local scale

childhood asthma prevalence away from the NCSAS data points. The resulting map has

more spatial details than the map of method 1, yet it is smoother than the map of method 2.

The map of the associated mapping uncertainty shows that the uncertainty is zero at the

NCSAS high school location, that it is small but non zero at the centroid of counties for

which the Medicaid data is available, and that it increases away from these points. Both

these features result in a more realistic representation of the local scale childhood asthma

prevalence than that obtained from either method 1 or 2.

The results presented so far illustrate that by formally accounting for the scale effect of

the childhood asthma prevalence data, our proposed framework (method 3) generates a map

describing the spatial distribution of the childhood asthma prevalence that is substantially

different and more realistic than maps obtained using methods not accounting for the scale

effect. We now investigate whether this more realistic map is also substantially more

accurate than the maps of methods 1 or 2.

4.5.4.3 Cross-validation results

We use a cross validation procedure to compare the accuracy of the maps obtained using

estimation methods 1, 2 and 3 in terms of their cross validation mean square error (MSE).

158
Each datum of the NCSAS dataset representing an exact measurement of the childhood

asthma prevalence observed at the spatial scale of high schools is removed from the data, and

re-estimated on the basis of the remaining NCSAS and Medicaid data. The cross validation

error is then simply obtained by subtracting from each cross-validation estimate the exact

measurement that was set aside. Using this procedure we obtain cross-validation errors for

each estimation method, from which the cross-validation MSE is calculated. The results of

this cross validation procedure are shown in Table 4.4. As can be seen from this table,

somewhat surprisingly, method 2 does not provide any improvement of mapping accuracy

over method 1. In fact the MSE for method 2 is slightly higher than that of method 1. This

result provides a striking illustration of what may happen when one attempts to mix-in data

obtained at different observation scales without consideration of the scale effect, as is the

case for the naïve approach used in method 2. Indeed, even though method 2 seems to

provide more spatial details about the distribution of the asthma prevalence among children

across North Carolina, these details are actually erroneous because they do not account for

the uncertainty associated with the large observation scale of the Medicaid data. On the other

hand our proposed BME approach (method 3) has a MSE that is substantially smaller than

that of either method 1 or method 2. The sound conceptual framework we have developed in

this work to integrate data obtained at different observation scale leads to a 10.2% decrease

in cross-validation MSE relative to method 1, and an 11.6% decrease relative to method 2.

This demonstrates that our proposed approach leads to a map of the childhood asthma

prevalence across North Carolina that is more realistic and more accurate than those obtained

by methods that do not account for the scale effect.

159
Table 4.4: Cross-validation results showing the cross-validation MSE for methods 1, 2 and 3,
and the change in cross-validation MSE between method 1 and method 3, as well as between
method 2 and method 3.
Method 1 Method 2 Method 3
(simple kriging I) (simple kriging II) (BME)
MSE 0.040638 0.041293 0.036490
1, 3
rMSE -10.206%
2 ,3
rMSE -11.630%

The cross validation procedure compares the accuracy of the estimation methods when

one data point is removed at a time. This comparison quantifies the gain in accuracy for the

current mapping situation, i.e. we can say that the childhood asthma prevalence map

produced in this work (the method 3 map of Figure 4.9c) is at least 10% more accurate than

maps that may have been produced to date using the traditional approach of method 1 or

method 2. Another comparison that is often used in practice to compare estimation methods

is a validation procedure, which compares the mapping accuracy under other mapping

situations by removing several data points at once. We present next the validation results for

a selected mapping situation of interest.

4.5.4.4 Validation results

The validation procedure that we implement consists in removing 30% of the NCSAS data at

once, and re-estimating the childhood asthma prevalence for these points using the remaining

NCSAS data as well as the Medicaid data. We then subtract from these validation estimates

the exact measured values that were set aside, thereby obtaining validation errors from which

we obtain the validation MSE. The validation MSE obtained using this procedure for

estimation methods 1, 2 and 3 are shown in Table 4.5. As we can seen from this table, when

160
removing 30% of the NCSAS data, method 2 is slightly more accurate than method 1, and,

more importantly, our proposed BME approach (method 3) is at least 20% more accurate

than either method 1 or method 2. This means that our proposed method provides a powerful

conceptual framework to integrate data obtained at different observation scale for a wide

range of mapping situations.

Table 4.5: Validation results obtained when selecting a random validation set consisting of
30% of the NCSAS data. The table shows the validation MSE obtained for methods 1, 2 and
3, and the change in validation MSE between method 1 and method 3, as well as between
method 2 and method 3.
Method 1 Method 2 Method 3
(simple kriging I) (simple kriging II) (BME)
MSE 0.0098939 0.0096997 0.0076670
1, 3
rMSE -22.508%
2 ,3
rMSE -20.957%

4.5.5. Conclusions

Asthma is an adverse health condition of emerging concern for children. Maps showing the

spatial distribution of the asthma prevalence among children are vital to better understand

what may cause the disease and to improve its public health response in order to protect the

health of children. However mapping the childhood asthma prevalence is complicated by the

fact that data is often available at a variety of spatial scales. This is particularly the case

because several data sources have confidentiality requirements that only allow release of

information aggregated over spatial scales that are sufficiently large to ensure the privacy of

the individuals who provided their health information.

161
We develop in this work a rigorous mathematical framework to map the spatial

distribution of the childhood asthma prevalence by integrating data collected at different

spatial observation scales, and we apply this framework to a real case study in North Carolina

using two datasets obtained at two substantially different observation scales. We constructed

our first dataset of the childhood asthma prevalence using the North Carolina School Asthma

Survey data that was collected as part of a previous study of one of the co-authors (Yeatts et

al., 2004; Sturm et al., 2004). By aggregating the NCSAS data at the high school spatial

scale using good quality information on the prevalence of asthma symptoms among 7-8th

grades, we obtained a dataset that can essentially be treated as exact measurements of the

childhood asthma prevalence observed at the local scale for each of 493 high-schools which

participated in the NCSAS study. While this first dataset provides a rich set of point

measurements, it is inherently providing a sparse spatial coverage of North Carolina. Hence

we also included in the mapping analysis a second dataset consisting of the childhood asthma

prevalence calculated on the basis of Medicaid-claims aggregated at the county spatial scale

(Buescher et al., 1999). While this dataset presents some limitations due to biases connected

with the Medicaid-enrolled children population, we hypothesized that local errors in the

Medicaid data may average out at the county spatial scale, so that this dataset provides useful

information as long as the scale effect is adequately accounted for.

The conceptual framework we develop in this work provides a rigorous mathematical

formulation for the uncertainty associated with the spatial scale at which asthma prevalence

data are observed. Using this framework, the NCSAS data is processed as hard data, while

the Medicaid children data is used to generate soft data with an uncertainty corresponding to

the county spatial scale at which this data is reported. These combined hard and soft data are

162
then rigorously processed using the Bayesian Maximum Entropy method of modern

Geostatistics, leading to an accurate estimation of the spatial distribution of the childhood

asthma prevalence across North Carolina.

We find that the map we obtain is substantially more realistic and accurate than the

classical map obtained by ignoring entirely the county level data, or the classical map

obtained by integrating the county level data without consideration of its observation scale.

Results from our cross-validation analysis indicates that the childhood asthma prevalence

map we generate for North Carolina has a mapping error variance that is a substantial 10%

smaller than that of the classical maps obtained when ignoring the scale effect. Furthermore

a validation analysis indicates that under other mapping situations the drop in mapping

estimation error can be in excess of 20% over the classical approaches not accounting for the

scale effect. This means that our proposed method provides a powerful conceptual

framework to integrate data obtained at different observation scales for a wide range of

asthma mapping situations.

This work provides a methodological advance that will lead to an improved assessment

of the spatial distribution of the asthma prevalence among children nationwide, and by

applying this new method we obtain the most accurate map created to date for the spatial

distribution of the childhood asthma prevalence across North Carolina. These contributions

will be very useful to improve our understanding of possible associations between asthma

and causal risk factors such as air pollutants, and will be critical to improve asthma public

health intervention for children nationwide. Furthermore by demonstrating how existing

sources of asthma data such as Medicaid claims can be used to obtain good estimates of the

childhood asthma prevalence at a fine spatial resolution, this work will reduce the need of

163
costly programs dedicated to asthma surveillance, so that state health departments’ limited

resources can be more efficiently used for public health interventions and reduction of

childhood asthma morbidity.

164
V. CONCLUDING REMARKS

The linear kriging methods of classical Geostatistics (i.e. simple kriging, co-kriging, etc.)

have gained considerable popularity in environmental mapping applications to estimate an

environmental contaminant variable of interest at unsampled locations. However, these

estimation methods have considerable well documented limitations (i.e. linear estimation,

Gaussian assumptions, exact measurements, etc.), and as a result they lack the theoretical

underpinnings and practical flexibility needed to incorporate the wide variety of knowledge

bases available in modern environmental and health mapping applications, which include

information about the uncertainty associated with the data available.

On the other hand the powerful BME mapping method of modern spatiotemporal

Geostatistics is a non-linear estimation method that overcomes the limitations of the classical

Geostatistics by comprehensively assimilating a wide variety of physical knowledge bases,

including data uncertainty. The data uncertainty prevalent in environmental and health

applications has been recognized as critical information that needs to be formally modeled in

order to increase the accuracy of estimated maps. In this work we investigate three important

types of uncertainty for environmental and health processes, and we develop the framework

to account for these types of uncertainty in terms of relevant soft PDF. The models of soft

data we generate are then used in real world case studies, resulting in three environmental

and health mapping applications. In each mapping application, the data uncertainty is
successfully identified and expressed in terms of the proper soft data model, and rigorously

processed using the powerful BME mapping method.

In the first mapping situation considered the data uncertainty originates from varying

levels of measurement errors in the analysis of groundwater arsenic contamination in New

England. We develop a measurement error model specifically for arsenic analyses, and we

successfully validate this measurement error model by comparing the uncertainty it predicts

with that obtained from a covariance analysis. The measurement error model then allows us

to obtain probabilistic soft data describing adequately the uncertainty associated with the

measurement error of three arsenic datasets. As a result, we are able to apply the BME

estimation method to account for the varying levels of measurement error between these

three datasets, and obtain accurate maps of the spatial distribution of arsenic in the ground

waters of New England. A synthetic case study as well as the real case study show that the

proposed BME approach results in a substantial improvement of mapping accuracy over

classical Geostatistical methods that do not properly account for measurement error.

Furthermore the work presented in this first mapping application will provide an ideal

framework to add new monitoring data with presumably lower detection limit and better

precision as the analytical measurement techniques for arsenic and its speciation keep

improving in the future.

The source of data uncertainty we consider in the second mapping application comes

from the emergence of secondary variables used to map a primary variable for which data is

sparse. Starting with a synthetic case study, we generate realizations of two related SRFs

reproducing the statistical properties of New England groundwater arsenic, and soil pH,

respectively, using a new simulator developed as part of this work. We then implement some

166
straightforward regression approaches to model the empirical law between groundwater

arsenic and soil pH, from which we obtain a model for the conditional PDF of the

groundwater arsenic primary variable given a collocated measurement of the soil pH

secondary variable. This conditional PDF is efficiently processed in term of soft data by the

BME estimation method, resulting in realistic maps of groundwater arsenic that rigorously

incorporate the information provided by the soil pH secondary variable. This work

demonstrates that because the proposed BME approach formally accounts for the empirical

law between the primary and secondary variables, it leads to a drastic improvement in

mapping accuracy over the co-kriging method which only accounts for the cross-correlation

between primary and secondary variables. As a result, this work suggests a shift of the

multivariate mapping paradigm from co-kriging to the proposed BME method when dealing

with secondary variables related to the primary variable through a variety of empirical laws.

In the third mapping application we develop a rigorous mathematical framework to map

the spatial distribution of childhood asthma prevalence by integrating data collected at

different spatial observation scales, and we apply this framework to a real case study in North

Carolina using two datasets obtained at two substantially different observation scales. The

mathematical framework we develop consists in deriving the conditional PDF of a variable at

the local scale given an observation of that variable at a larger scale. Once this framework is

developed, it is possible to generate soft data for the local scale variable on the basis of data

observed at different temporal or spatial scales. This approach allows to efficiently mix data

observed at a variety of scales, and increases the mapping accuracy of the map obtained for

the scale of interest. Our developed framework is formulated in the one-dimensional

temporal case, and then extended to the two dimensional spatial case, before being applied to

167
the North Carolina childhood asthma prevalence real case study. We find that the map that

we obtain is substantially more realistic and accurate than maps obtained without

consideration of observation scale. Results from our cross-validation analysis indicates that

the childhood asthma prevalence map we generate for North Carolina has a mapping error

variance that is a substantial 10% smaller than that of classical maps obtained when ignoring

the scale effect. Furthermore a validation analysis indicates that under other mapping

situations the drop in mapping estimation error can be in excess of 20% over the classical

approaches not accounting for the scale effect. This means that our proposed method

provides a powerful conceptual framework to integrate data obtained at different observation

scales for a wide range of asthma mapping situations.

In this dissertation models for soft Geostatistical data have been developed to account for

three important types of data uncertainty that are relevant to environmental and health

spatiotemporal processes. The subsequent integration of these soft data models using the

rigorous mathematical estimation framework provided by the BME mapping method leads to

substantial improvements in mapping accuracy over classical methods that do not properly

account for data uncertainty. Thus these models of soft data can be applied in a variety of real

exposure and health mapping situations to provide highly informative maps that will be

useful for environmental scientists, epidemiologists, public health officials, and state

regulators.

168
Appendix A: Derivation of empirical relationship and their associated
uncertainty

A.1. A quick overview of the multivariate linear regression model

Let’s consider the multivariate linear regression model expressed as

p
xi = ∑ yij β j + εi , 1≤ i ≤ N (A.1)
j=1

where xi are the response variables, yij are explanatory variables, βj are regression parameters,

εi are unobservable random errors, N is the number of observations, and p is the number of

regression parameters.

This model usually includes major assumptions (i.e. normality, homoscedasticity, and

mutual independence between response variables) leading to the normal distribution of the

p
estimator for the expected value (i.e. ∑ yij β j ) and variance (i.e. σ X ). The regression
2

j =1

coefficients are estimated by setting up an objective function equal to the mean prediction

square error

2

n p

MPSE = ∑  xi − ∑ yij β j  (A.2)
i =1  j =1 

169
to be minimized with respect to the β j . We obtain the estimators βˆk for βk , k=1,…,p, by

∂MPSE
setting =0, k=1,…,p, which leads to the following normal equations
∂β̂ k

n  p ) 
∑ y  x
ik  i − ∑ yij β j  = 0 (A.3)
i =1  j =1 

where k=1, 2, 3, …, p.

Eq. (A.3) may be written in matrix form as

DTx= DTD β̂ (A.4)

where D is a (n× p) design matrix with elements yij, x is a (n× 1) vector with elements xi, and

β̂ is a (p × 1) vector with elements β̂ j .

In case of the existence of (DTD)-1, the regression coefficients are given by

β̂ =(∆T∆)-1(∆Tχ) (A.5)

and the covariance matrix for β̂ is estimated as

cov( β̂ )=σX2(∆T∆)-1 (A.6)

170
where ∆ is obtained by substituting each random variable yij in the design matrix D with its

observed value ψij, and χ is a vector of observed values for x.

Once β̂ has been calculated, then the response variables (i.e. xi) are evaluated using

p
x̂i = ∑ ψ ij βˆ j , (A.7)
j =1

2
where i=1,…,N. In addition, the unbiased common estimate for σ X is calculated by

calculating the average vertical distance between the fitted and the observed values, i.e.

2
N 2 N  p
ˆ 
∑ (χ i − χˆ ) ∑  i ∑ψ ij β j 

i =1 
χ −
2
σ̂ X = i =1
=
j =1  . (A.8)
N−p N−p

A.2. Parametric polynomial of order 1

A univariate parametric polynomial of order 1 corresponds to a linear regression model (Eq.

A.1) with N=1, i.e. it corresponds to

xi = β0 + β1yi + εi. (A.9)

Expanding the normal equations (Eq. A.5) in the case of N=1, we obtain

171
−1
 n
 n 
N ∑ ψi  ∑ χ i 
 βˆ 0   i =1
  i =1 
     
(
βˆ = ∆ T ∆ ) (−1
)
∆T χ =   =     (A.10)
ˆ     
 β1   n n  n 
∑ψ i ∑ ψ i2  ∑ψ i χ i 
 i =1 i =1   i =1 

In other words, the regression coefficients are obtained as follows

)
β 0 = χ − βˆ1ψ (A.11)

) N N
β1 = ∑ (ψ i − ψ )(χ i − χ ) ∑ (ψ −ψ)
2
i (A.12)
i =1 i =1

where the bar denotes the arithmetic average operator.

Expanding Eq. (A.6), we obtain after mathematical manipulations the following

2 2
equations for the variance σˆ β0 of β̂ 0 and the variance σˆ β1 of β̂1

 
 
2 1 χ2
σˆ β0 = σ X  + N 
2
(A.13)
N
 ∑ (ψ j −ψ )2 
 j =1 

2
2 σX
σˆ β1 = N
. (A.14)
∑ (ψ −ψ ) 2
i
i =1

172
Then, the variance for the fitted values is

σ̂ χi = σˆ β0 + σˆ β1 ψi2 + 2 ψi c βˆ
2 2
ˆ (A.15)
0 , β1

where c βˆ ˆ indicates the covariance between β̂ 0 and β̂1 that can be expanded as
0 , β1

c βˆ ˆ = c βˆ , χ − βˆ ψ = c βˆ , χ - ψ c βˆ , βˆ . (A.16)
0 , β1 1 1 1 1 1

Since c βˆ , χ = 0, Eq. (A.16) finally reduces to


1

−ψ σ X
2

c βˆ ˆ = N
(A.17)
0 , β1

∑ (ψ −ψ ) 2
i
i =1

2
By substituting Eq. (A.17) into Eq. (A.15), we obtain σˆ χ i , i.e.

 
 (ψ − ψ ) 2 
2 1
= σX  + N i
2
σˆ χi . (A.18)
N 2 
 ∑ (ψ j − ψ ) 
 j=1 

The standard error (SE) for χ̂ i is obtained by first obtaining the estimate of σX2 using Eq.

(A.8), and then taking square root, i.e.,

173
0.5
 
1 2 
(ψ − ψ )  .
SE for χˆ i = σ̂ X  + N i (A.19)
N
 ∑ (ψ j − ψ )2 
 j=1 

However, in the case of predicting the result of a single experiment, it is more appropriate to

use the prediction standard error (PSE) rather than the SE for χ̂ i . In this case an additional

term (i.e. σ̂ X ) is included to account for the randomness associated with a single experiment,

so that the PSE for χ̂ i is given by

0.5
 
1 (ψ − ψ ) + 1 .
2
PSE for χˆ i = σ̂ X  + N i (A.20)
N
 ∑ (ψ j − ψ )2 
 j =1 

Finally, the conditional PDF fS(χi|ψi) characterizing the empirical law for xi given a measured

value ψi for yi is normally distributed with mean χ̂ i = βˆ0 + βˆ1ψ i and a variance equal to the

PSE for χ̂ i , i.e.

( )
fS(χi|ψi) = N βˆ0 + βˆ1ψi , PSE for χˆ i . (A.21)

A.3. Parametric polynomial of order 2

The case of parametric polynomial regression with order 2 corresponds to

174
xi = β0 + β1yi + β2yi2 + εi (A.22)

xˆ i = βˆ 0 + βˆ1 y i + βˆ 2 yi .
2
(A.23)

In this case the normal equations (Eq. A.5) for the regression parameters can be expanded as

−1
 N N
 N 
N ∑ψ i ∑ ψi 
2
∑ χ i 
 βˆ 0   i =1 i =1
  i =1 
  N 3  N 
( ) ( )
N N
∆ χ =  βˆ1  = ∑ψ i ∑ψ ∑ ∑ψ i χ i  .
−1
βˆ = ∆ T ∆ ψi 
T 2
i (A.24)
 ˆ   i =1 i =1 i =1   i =1 
 β 2   N 2 N N
 N 2 
∑ψ i ∑ψ ∑ ψ i4  ∑ψ i χ i 
3
i
 i =1 i =1 i =1   i =1 

The PSE for χ̂ i is then of the following form

T
PSE for χˆ i = σ̂ X δi ∆ T ∆ ( )
−1
δi + 1 , (A.25)

1 
 
where δi =  ψi  and σ̂ X is obtained from Eq. (A.8).
 2 
 ψi 

Finally, the conditional PDF fS(χi|ψi) describing the empirical law is given by the

following normal distribution

175
( )
χˆ i = N βˆ0 + βˆ1ψ i + βˆ2ψ i , PSE for χˆ i .
2
(A.26)

176
Appendix B: A simulator to generate realizations of two spatial random
fields (logAs and pH) related in terms of a quadratic empirical law

We aim to generate realizations for the groundwater log-arsenic SRF logAs(s) and soil pH

SRF pH(s) with prescribed statistical properties reproducing those found in the field, and

with a quadratic empirical relationship E[logAs|pH] at collocated point s similar to those

documented in previous studies (e.g. Fig. 3.1).

Let’s consider three independent, homogeneous, normally distributed SRFs A(s), B(s),

and C(s).

A(s) ~ N(µA, σA2) (B.1)

B(s) ~ N(µB, σB2) (B.2)

C(s) ~ N(µC, σC2). (B.3)

Realizations of such fields can easily be generated using geostatistical simulation techniques

(Christakos, 1992; Christakos et al,. 2002) such that the realization of A(s), B(s), and C(s)

have their user-defined means µA, µB, and µC, and variances σA2, σB2, and σC2, and with a

covariance range similar to that of soil pH and log-arsenic found in the field.

We then construct the fields for logAs(s) and pH(s) using the following equations

pH(s) = A(s) + B(s) (B.4)

177
logAs(s) = a1A(s) + a2A(s)2 + C(s), (B.5)

where a1 and a2, together with µA, µB, µC, σA2, σB2, and σC2, are the parameters of our

algorithm to generate logAs(s) and pH(s). Let’s now describe how to choose these parameters

in order to obtain realizations of logAs(s) and pH(s) with known statistical properties and a

quadratic empirical relationship E[logAs|pH] at collocated point s.

By substituting Eq. (B.4) into Eq. (B.5) we obtain the following relationship between the

two collocated random variables logAs and pH

logAs = a1 pH - a1B + a2 pH2 - 2 a2 pH B + a2B2 + C. (B.6)

The statistical moments of pH and logAs are obtained from Eq. (B.4) and (B.5) as

µpH = µA + µB (B.7)

µlogAs = a1µA + a2σA2+ a2 {µA}2 + µC (B.8)

σpH2 = σA2+ σB2 (B.9)

σ logAs2 = a12σA2+ 2a22{σA2}2 + 4a22{µA}2 σ A + 4a1a2µAσA2+ σC2


2
(B.10)

where Eq. (B.10) is obtained by using the following two properties of the Gaussian variable

A expressing the covariance cA,A2 between A and A2, and the variance σ A2 of A2
2

178
cA,A2 = 2µA σA2 (B.11)

σ A2 = 2{σA2}2 + 4{µA}2σA2.
2
(B.12)

Taking the expected value of logAs (Eq. B.6) for given a pH value, we have

E[logAs|pH] = a1 pH – a1 E[B|pH] + a2 pH2 – 2 a2 pH E[B|pH] + a2 E[B2|pH] + E[C].

(B.13)

We see from Eq. (B.4) that since A and B are normally distributed, then pH is also normally

distributed. Multiplying Eq. (B.4) by B and taking the expected value we obtain after some

2
mathematical manipulations that cpH,B = σ B . Assuming that pH and B have a joint

distribution that is approximately multi Gaussian, we have

E[B|pH] = µB + cB,pH c-1pH,pH (pH-µpH) = µB + σB2/σpH2 (pH-µpH) (B.14)

σ B| pH = σ B - cB,pH c-1pH,pH cpH,B = σB2–{σB2}2/σpH2


2 2
(B.15)

Then the expected value of B2 given pH, E[B2|pH]= σ B| pH +{E[B|pH]}2, is easily obtained
2

using Eq. (B.14) and (B.15), leading to the following expression

179
E[B2|pH] = σB2– {σB2}2/σpH2+ {µB }2 + 2µB σB2/σpH2 (pH-µpH) + {σB2}2/{σpH2}2(pH-µpH)2.

(B.16)

Substituting Eq. (B.14) and (B.16) into (B.13) gives the equation for E[logAs|pH], i.e.

E[logAs|pH] = a1(pH-µpH) + a1µpH – a1 E[B|pH] + a2 (pH-µpH )2 – a2µpH 2 + 2 a2 µpH (pH –

µpH) + 2a2µpH2 - 2a2 E[B|pH](pH-µpH) - 2a2 E[B|pH]µpH + a2 E[B2|pH] + E[C]

= a1(pH-µpH) + a1µpH – a1(µpH – µA) – a1(σpH2 – σA2)/σpH2(pH-µpH ) + a2(pH-µpH )2 - a2µpH2

+ 2a2µpH(pH-µpH) + 2a2µpH2 - 2a2(µpH - µA) (pH-µpH) – 2a2(σpH2 – σA2)/σpH2(pH-µpH )2 - 2

a2(µpH - µA) µpH – 2a2(σpH2 – σA2)/σpH2µpH (pH-µpH ) + a2(σpH2 – σA2) – a2(σpH2 – σA2)2/σpH2 +

a2(µpH - µA)2 + 2a2(µpH - µA)(σpH2 – σA2)(pH-µpH )/ σpH2 + a2(σpH2 – σA2)2(pH-µpH )2/σpH4 +

µC

= a1µA - a2µpH 2 + 2a2 µA µpH + a2(σpH2 – σA2) - a2σpH2 + 2a2σA2 - a2σA4/σpH2 + a2µpH 2 -

2a2 µA µpH + a2{µA }2 + { a1 - a1 + a1σA2/σpH2 + 2a2µpH - 2a2µpH + 2a2 µA - 2a2µPh +

2a2σA2/σpH2µpH + 2a2(µpH - µpHσA2/σpH2 - µA + µA σA2/σpH2)}( pH-µpH) + (a2 - 2a2 +

2a2σA2/σpH2 + a2 - 2a2σA2/σpH2 + a2σA4/σpH4)( pH-µpH)2 + µC

= a1 µA + a2(σA2 - σA4/σpH2 + {µA }2) + (a1σA2/σpH2 + 2a2 µA σA2/σpH2)( pH-µpH) +

a2σA4/σpH4( pH-µpH)2 + µC. (B.17)

We further simplify Eq. (B.17) leading to the following equation for E[logAs|pH], i.e.

180
E[logAs|pH] = µlogAs – a2{σA2}2/σpH2+ (a1σA2/σpH2+2a2µA σA2/σpH2)(pH-µpH) +

a2{σA2}2/{σpH2}2(pH-µpH)2. (B.18)

181
Appendix C: Derivation of σY2(t’,t) accounting for different observation
time scales

C.1. Non-stationary temporal random field case

Let X(t) be a non-stationary temporal random field (TRF), so that its mean trend mX(t)=E[X(t)]

is not a constant, and its covariance model cannot generally be expressed solely as a function

of temporal lag, τ=|t-t’|, i.e.

c X (t , t' ) ≠ c X (τ = t − t' ), and m X (t ) ≠ m0 . (C.1)

Z(t) is defined as the average of X(t) over the time duration T centered at time t

T
t+
2
1
Z (t ) =
T ∫ du X (u ) .
T
(C.2)
t−
2

Taking the expected value of Eq. (C.2) gives

T T
t+ t+
2 2
1 1
E [Z(t)] =
T ∫ du E[X(u)] = T ∫ du
T T
mX(u). (C.3)
t− t−
2 2

We define a new temporal random field Y(t’,t) as

Y(t’,t) = X(t’) - Z(t) (C.4)

182
where t indicates the mid-points of the time domain T(t)=[t-T/2, t+T/2], and t’ denotes any

possible time within T(t). Then we derive its expected value

T
t+
2
1
E[Y(t’,t)] = E[X(t’) – Z(t)] = mX(t’) –
T T
∫ du mX(u), (C.5)
t−
2

and variance

σY2(t’,t) = E[Y2(t’,t)] – {E[Y(t’,t)]}2 = E[X2(t’)] – 2 E[X(t’)Z(t)] + E[Z2(t)] – {mX(t’) –

t +T / 2

T ∫t −T / 2
1 du m X (u ) }2, (C.6)

where

E[X2(t’)] = σX2(t’) + {E [X(t’)]}2 = σX2(t’) + {mX(t’)}2, (C.7)

 t+
T
2
t+
T
2

1 
∫T du ∫Tdu' X (u ) X (u' )
2
E [Z (t)] = E [Z(t)Z(t)] = E  2
T
 t−
2
t−
2


T T T T
t+ t+ t+ t+
2 2 2 2
1 1
=
T2 ∫ du ∫ du' E [ X (u ) X (u' )] = T ∫ du ∫ du' {c
T T
2
T T
X (u, u' ) + m X (u )m X (u' )} , (C.8)
t− t- t− t-
2 2 2 2

183
and,

 t+
T
2
 t+
T

 1  1 2
T ∫T
E[X(t’)Z(t)] = E  X (t' ) ∫ du X (u ) = du E [ X (t' ) X (u )]
T T
 t−
2
 t−
2

T
t+
2
1
=
T ∫ du {c
T
X (t' , u ) + m X (t' )m X (u )} . (C.9)
t−
2

Assuming a linearized mean trend mX(t)=m0+m1t, Eq. (C.3) reduces to

T T
t+ t+
2 2
1 1
E [Z(t)] =
T ∫ du
T
mX(t)=
T ∫ du
T
(m0+m1u) = m0+m1t , (C.10)
t− t−
2 2

so that the expected value of Y(t’,t) can be expressed as

E[Y(t’,t)] = E[X(t’)] – E[Z(t)] = m1(t’-t). (C.11)

Similarly for linearized mean trend the variance of Y(t’,t) reduces to

σY2(t’,t) = E[X2(t’)] – 2 E[X(t’)Z(t)] + E[Z2(t)] – [m1(t’-t)]2, (C.12)

where

184
E[X2(t’)] = σX2(t’) + { m0+m1t’}2, (C.13)

T T
t+ t+
2 2
1
E [Z2(t)] =
T2 ∫ du ∫ du' {c
T T
X (u, u' ) + (m0 + m1u )(m0 + m1u' )}
t− t-
2 2

T T
t+ t+
2 2
1
∫ du ∫ du' c X (u, u' ) + (m1t + m0 ) , and
2
= (C.14)
T2 T T
t− t−
2 2

T
t+
2
1
E[X(t’)Z(t)] = =
T ∫ du {c
T
X (t' , u ) + (m0 + m1t' )(m0 + m1u)}
t−
2

T
t+
2
1
∫ du c
2 2
= X (t' , u ) + (m0 + m0 m1t + m0 m1t' + m1 tt' ) . (C.15)
T T
t−
2

Once we substitute Eqs (C.13), (C.14) and (C.15) in Eq. (C.12), we obtain variance of Y(t’,t),

i.e.

T T T
t+ t+ t+
2 2 2
1 1
σY2(t’,t) = σX2 -2
T ∫ du c
T
X (t' , u ) +
T2 T
∫ du ∫ du' c X (u, u' ) .
T
(C.16)
t− t− t−
2 2 2

C.2. Stationary covariance

185
We now consider the case where the TRF X(t) has a stationary covariance and a non-

stationary linearized mean trend, so that

c X (t,t' ) = c X (τ = t − t' ) and mX(t)=m0+m1t, (C.17)

As previously derived, in this case E[Y(t’,t)] = m1(t’-t), σY2(t’,t) = E[X2(t’)] – E[2X(t’)Z(t)] +

E[Z2(t)] – [m1(t’-t)]2, and E[X2(t’)] = σX2 + { m0+m1t’}2, However, due to the stationary

covariance assumption we can reduce further the expressions for E [Z2(t)] and E[X(t’)Z(t)] in

σY2(t’,t). First E [Z2(t)] reduces to

T T
t+ t+
2 2
1
E [Z2(t)] =
T2 ∫ du ∫ du' {c
T T
X (u, u' ) + (m0 + m1u )(m0 + m1u' )}
t− t-
2 2

T T
t+ t+
2 2
1
∫ du ∫ du' c X ( u − u' )) + (m1t + m0 ) .
2
= (C.18)
T2 T T
t− t−
2 2

Similarly E[X(t’)Z(t)] reduces to

T
t+
2
1
E[X(t’)Z(t)] =
T ∫ du {c
T
X (t' , u ) + E[ X (t' )] E[ X (u )]}=
t−
2

T
t+
2
1
T ∫ du {c ( t' −u ) + (m
T
X 0 + m1t' )(m0 + m1u )}
t−
2

186
Defining the change of variable w = u – t, we further obtain

T
2
1
E[X(t’)Z(t)] =
T ∫ dw {c ( t' − w − t ) + (m
T
X 0 + m1t' )(m0 + m1w + m1t )}

2

Or, by reverting to w = u, the equation is simply


T
2
1
E[X(t’)Z(t)] =
T ∫ du {c ( t' −u − t ) + (m
T
X 0 + m1t' )(m0 + m1u + m1t )}

2

 t' − t T

1  2 
=  ∫ du c X (t' −u − t ) + ∫ du c X (−t' +u + t )  + (m1t + m0 )(m1t' + m0 ) . (C.19)
T T t' − t 
 −2 

C.3. Stationary exponential covariance case

Let’s now assume that the stationary covariance model is the superposition of n exponential

functions, so that the covariance and mean trend of the TRF X(t) are

n  − 3 t − t' 
c X (t , t' ) = c X (τ = t − t' ) = ∑  σ Xi exp  and mX(t)=m0+m1t,
2
 (C.20)
i =1  ati 

where ati and σXi2 are temporal range and variance in each exponential function respectively.

In this case the expressions for E[Z2(t)] can be expanded as follows. Defining the change of

variables w’ = u’ – t and w = u – t for Eq. (C.18), we get

187
T T
2 2
1
E [Z2(t)] = ∫ dw ∫ dw' c ( w − w' ) + (m t + m )
2
X 1 0
T2 T T
− -
2 2

Reverting back to u’ = w’ and u = w we have

T T
2 2
1
E [Z2(t)] = ∫ du ∫ du' c ( u − u' )) + (m t + m )
2
X 1 0
T2 T T
− −
2 2

Applying stationary covariance model which is the superposition of n exponential functions

we obtain

T T

1 2 2 n   − 3( u − u' ) 
E [Z2(t)] = ∫T ∫T ∑  σ Xi 2 exp    + (m1t + m0 )
2
du du'
T2  a
− -
i =1   ti 
2 2

T
u T

1 2
 n
 − 3(u − u' )  2 n
 2 − 3(u' −u ) 
∫T du  ∫T du' ∑  σ Xi exp  + ∫ du' ∑  σ Xi exp  + (m1t + m0 ) 2
2
= 2
T i =1  ati  u i =1  ati 

2
 − 2 

ati σ Xi   − 3T 
n 2
2 2
=∑ 2  2T − a ti + ati exp  + (m1t + m0 )2 . (C.21)
i =1 3T  3 3  ati 

Similarly

188
T
2
1
E[X(t’)Z(t)] =
T ∫ du {c ( t' −u − t ) + (m
T
X 0 + m1t' )(m0 + m1u + m1t )}

2

T
 n  2  t' −u − t  
  + (m0 + m1t' )(m0 + m1u + m1t )
2
1
=
T ∫T ∑
du  σ Xi exp − 3
 
 ati 
 
−  i =1 
2

n
ati σ Xi 
2
 − 3(t' +T / 2 − t )   − 3(− t' +t + T / 2) 
=∑ 2 − exp  − exp  + (m1t + m0 )(m1t' + m0 ) .
i =1 3T   ati   ati 

(C.22)

Using Eqs. (C.21) and (C.22) we can write the variance of Y(t’,t) as

n n
ati σ Xi 
2
 − 3(t' +T / 2 − t )   − 3(− t' +t + T / 2 ) 
σY (| t' −t |) = ∑ σ Xi − 2∑ 2 − exp  − exp 
2 2

i =1 i =1 3T   ati   ati 

ati σ Xi   − 3T 
n 2
2 2
+ ∑ 3T 2 
 2T −
3
a ti +
3
ati exp
a
 (C.23)
i =1  ti 

Eq. (C.23) is valid for the superposition of n exponential models, which can be simplified

into Eq. (C.24) for example when dealing with one exponential covariance function

at σ X 
2
 − 3(t' +T / 2 − t )   − 3(− t' +t + T / 2) 
2 − exp  − exp 
2 2
σY (| t' −t |) = σ X −2
3T   at   at 

at σ X   − 3T 
2
2 2
+ 2 
2T − a t + at exp  , (C.24)
3T  3 3  at 

189
where σX2 is the variance of the TRF X(t), and at is its temporal covariance range. Eq. (C.24)

can be expressed in terms of the non-dimensional groupings of variables σY2(|t’-t|) / σ X , (t-


2

t’)/at , and T/at, i.e.

2
σY (| t' −t |) 2 1   (t − t' ) 3  T   (t − t' ) 3 T 
2
= 1 −  2 − exp3 −   − exp− 3 − 
σX 3 T at   at 2  at   at 2 at 

1 1  2 1 2 1  T 
+ 2 − T + T exp − 3  . (C.25)
3 T at  3 at 3 at  at 

Usually when generating soft data we will use t’=t. The equation for the soft data variance is

then simply obtained by setting (t-t’)/at =0 in Eq. (C.25), which leads to

2 1   3 T   1 1  2 1 2 1  
2
σY T
2
= 1 −  2 − 2 exp−   + 2 − T + T exp − 3  . (C.26)
σX 3 T at   2  at  3 at
T
 3 at 3 at  at 

190
Appendix D: Derivation of σY2(s’,s) accounting for different observation
scales in two-dimensional (2-D) space

D.1. Non-homogeneous 2-D spatial random field case

We now extend the framework in Appendix C by considering 2-D spatial random field (SRF).

In the most general case, non-homogeneous SRFs are characterized by a spatially varying

mean trend functions mX(s)=E[X(s)], and a covariance function cX(s, s’) that cannot be

expressed solely as a function of the spatial lag, |s-s’|, i.e.

c X ( s, s' ) ≠ c X ( s − s' ), and m X ( s ) ≠ m0 . (D.1)

Z(s) is defined as the average of X(s) over the surface area As of a 2-D spatial domain

centered at s.

Z(s) =∫u ∈ As duX(u) / || As||. (D.2)

For example, the 2-D spatial domain As may correspond to the geographical extend of the

county that has its centroid located at s. We then define the SRF Y(s’,s) = X(s’) - Z(s) and

derive its expected value as

E[Y(s’,s)] = E[X(s’) – Z(s)] = mX(s’) – || As||-1∫u ∈ As du mX(u), (D.3)

and variance as

191
σY2(s’,s) = E[Y2(s’,s)] – {E[Y(s’,s)]}2 = E[X2(s’)] – 2 E[X(s’)Z(s)] + E[Z2(s)] – { mX(s’) – ||

As||-1∫u ∈ As du mX(u)}2, (D.4)

where

E[X2(s’)] = σX2(s’) + {E [X(s’)]}2 = σX2(s’) + {mX(s’)}2, (D.5)

E[X(s’)Z(s)] = || As||-1E[X(s’)∫u ∈ As duX(u)] = || As||-1∫u ∈ As du E[X(s’)X(u)] = || As||-1∫u ∈ As

du {cX(s’,u) + mX(s’) mX(u)} (D.6)

E[Z2(s)] = E [Z(s)Z(s)] = || As||-2∫u ∈ As du ∫u’ ∈ As du’ E[X(u)X(u’)] = || As||-2∫u ∈ As du ∫

u’ ∈ As du’{cX(u,u’) + mX(u) mX(u’)}. (D.7)

Substituting Eqs. (D.5), (D.6), and (D.7) into (D.4) yields the following mathematical

formulae for the variance accounting for the uncertainty associated with the 2D observation

scale

σY2(s’,s) = σX2(s’) + {mX(s’)}2 – 2|| As||-1∫u ∈ As du {cX(s’,u) + mX(s’) mX(u)} + || As||-2∫u ∈ As

du ∫u’ ∈ As du’{cX(u,u’) + mX(u) mX(u’)} – { mX(s’) – || As||-1∫u ∈ As du mX(u)}2. (D.8)

D.2. Homogeneous 2-D SRF

Let us now consider the special case of homogeneous SRF with a zero mean trend, i.e.

192
c X ( s , s' ) = c X ( s − s' ) and mX(s)=0. (D.9)

Due to the fact that mX(s) is now equal to 0, it follows that Eq. (D.3) reduces to 0. Therefore

Eq. (D.4) simplifies to

σY2(|s’-s|) = E[X2(s’)] – 2 E[X(s’)Z(s)] + E[Z2(s)], (D.10)

where

E[X2(s’)] = σX2(s’), (D.11)

E[X(s’)Z(s)] = || As||-1∫u ∈ As du cX(s’,u), (D.12)

E[Z2(s)] = || As||-2∫u ∈ As du ∫u’ ∈ As du’cX(u,u’). (D.13)

Using the property cX(s,s’) = cX(|s-s’|) of the homogeneous covariance models, we further

expand Eq. (D.12) as

E[X(s’)Z(s)] = || As||-1∫u ∈ As du cX(|u- s’|) (D.14)

We change the integration variable u with a new integration variable r defined as r=u-s (see

Fig. D.1). The integration domain for r corresponds to

193
u∈ As ⇔ r+s ∈ As ⇔ r∈ A(s-s) ⇔ r∈ A(0)

where A(0) is the 2-D spatial averaging domain centered at the origin (i.e. with a centroid

located at 0). Performing the change of variable in (D.14) results in

E[X(s’)Z(s)] = || As||-1∫r ∈ A(0) dr cX(|r-(s’- s)|) (D.15)

This equation can numerically be integrated for any shape of the averaging domain A(0).

However a reasonable approximation of the averaging domain A(0) is a circle of same area as

As, i.e. with a radius R such that πR2=|| As||-1. In this case it is better to change the Cartesian

integration variable r=[r1,r2] with the polar coordinate system (r,θ) defined as (see Fig. D.1)

r1=rcos(θ) and r2=rsin(θ). Performing this change of coordinate system for a circular spatial

domain A(0) of radius R leads to

R 2π R 2π

∫ dr ∫ dθ ∫ dr ∫ dθ
2 -1 2 -1
E[X(s’)Z(s)] = (πR ) r cX(|r-(s’- s)|) = (πR ) r cX(|l|), (D.17)
0 0 0 0

where |l| = (s1 − s1 '+ rcosθ )2 + (s2 − s2 '+ rsinθ )2

194
u2 u
s2’ l
s’ r
θ
R
s2 s

s1’ s1 u1

Figure D.1 : A 2-D spatial circle domain to solve for E[X(s’)Z(s)].

We now consider the third term in the right hand side of Eq. (D.10). Under the homogeneous

assumption Eq. (D.13) reduces to

E[Z2(s)] = || As||-2∫u∈ As du ∫u’ ∈ As du’ cX(|u-u’|). (D.18)

Similarly to the derivation of E[X(s’)Z(s)], using a polar integration coordinate system for a

circular average domain As of radius R we get

R 2π R 2π
2 -2
∫ dr ∫ dθ ∫ dr' ∫ dθ '
2
E[Z (s)] = (πR ) r r’ cX(|r-r’|) (D.19)
0 0 0 0

195
where |r-r’| = (rcosθ - r ' cosθ ')2 + (rsinθ - r ' sinθ ')2 = r 2 + r '2 −2rr ' cos(θ '−θ ) .

Defining the change of variables r = r, r’ = r’, θ =θ and α = θ’-θ, we further obtain

(r )
R 2π R 2π
-2
E[Z2(s)] = (πR2) ∫ dr ∫ dθ ∫ dr ' ∫ dα r r ' c X + r ' 2 −2rr ' cosα
2

0 0 0 0

(r )
R R 2π
2 -2
∫ dr ∫ dr ' ∫ dα 2π r r ' c X + r '2 −2rr ' cosα .
2
= (πR ) (D.20)
0 0 0

u2 u
|u-u’|
u2’ u’
θ’ r
r’ θ
R
s=s’

u1’ u1

Figure D.2 : A 2-D spatial circle domain to solve for E[Z2(s)].

Consequently, substituting Eqs (D.11), (D.17), and (D.20) into (D.10) leads to

196
R 2π
2 -1 
σY2(|s’-s|) = σX2(s’) -2(πR ) ∫ dr ∫ dθ r c (s1 − s '+ r cos θ )2 + (s2 − s2 '+ r sin θ )2 

1
0 0
X
 

(r )
R R 2π
-2
+ (πR2) ∫ dr ∫ dr ' ∫ dα 2π r r ' c X + r '2 −2rr ' cosα .
2
(D.21)
0 0 0

D.3. Application of homogeneous exponential covariance model

Let’s now assume that the homogeneous covariance model is the superposition of n

exponential functions, so that the covariance model can be expressed as,

n  − 3 s − s' 
c X ( s − s' ) = ∑  σ Xi exp ,
2
 (D.22)
i =1  ari 

where σXi2 and ari are the variance and spatial range of each exponential covariance function,

respectively. Using the superposition of n exponential models leads to the following equation

for σY2(|s’-s|)

n 
R 2π
 
 σ Xi 2 exp  − 3d1 (r , s, s ' , θ )  
-1
σY2(|s’-s|) = σX2 -2(πR2) ∫0 ∫0
d r dθ r ∑  ari 
i =1   

n 
 
 σ Xi 2 exp  − 3d 2 (r , r ' , α )   .
R R 2π
-2
+ (πR2) ∫0 ∫0 ∫0
dr dr ' dα 2π r r ' ∑  ari  (D.23)
i =1   

197
where d1(r, s, s’,θ )= (s1 − s '+ r cos θ )2 + (s2 − s2 '+ r sin θ )2 and

1
d2(r, r’,α)= r 2 + r '2 −2rr ' cosα .

This equation is valid for the superposition of any number of exponential models. In the case

of a single exponential covariance model, i.e. n=1, the covariance function is written as

cX(|s-s’|)=σX2 exp(-3|s-s’|/ar), and Eq. (D.23) reduces to

-1
R 2π
 − 3d (r , s, s ' , θ ) 
σY2(|s’-s|) = σX2 -2(πR2) ∫0 ∫0 dθ r σ X exp 1 ar
2
d r 

-2
R R 2π
 − 3d (r , r ' , α ) 
+ (πR2) ∫0 ∫0 ∫0 dα 2π r r ' σ X exp 2 ar
2
d r d r ' , (D.24)

where σX2 is the variance of the SRF X(s), and ar is its spatial covariance range. Usually we

seek an X soft datum at the centroid of the Z hard data (i.e. s = s’, so that the X soft datum is

located at the center of the circular averaging area As). In this case Eq. (D.24) is further

reduced by setting s=s’, i.e.

R
σY2= σX2 -4R-2 ∫ dr r σ X exp(−3r / ar )
2

2 -2
R R 2π
 − 3d 2 (r , r ' , α ) 
∫ dr ∫ dr ' ∫ dα 2π r r ' σ X exp 
2
+ (πR ) . (D.25)
0 0 0  ar 

198
As can be seen from this equation, the variance σY2 describing the uncertainty associated with

the observation scale of 2-D circular averaging domain is a function of the variance and

spatial range of the SRF X(s), as well as the radius R of the averaging spatial domain

characterizing the observation scale.

199
APPENDIX E: Some notes regarding the first and second arsenic datasets

E.1. The first arsenic dataset

• Retrieved from United States Geological Survey (USGS) National Water Information
System (NWIS) in 2001.

• Updated from USGS Water-Resources Investigations Report 99-4279 (Focazio et al.,


2000).

• Dataset name: arsenic_nov2001.txt which is publicly available from the USGS


website.

• Time duration: 1973-2001.

• A subset of 20,043 arsenic measurements only covering New England.

• Consistent and accurate arsenic sampling procedures (i.e. collecting, saving, and
transporting samples) maintained by USGS.

• USGS developed the field protocols in the early 1990.

• The previous sampling methods were tested by the USGS Office of Water Quality.

• A representative analytical method used is Inductively Coupled Plasma Mass


Spectrometry (ICP-MS), which was one of the latest available methods at the time
that the dataset was generated.

E.2. The second arsenic dataset

• Product of Water-Resources Investigation Report 99-4162 (Ayotte et al., 1999) by the


USGS National Water-Quality Assessment (NAWQA).

200
• Dataset generated for the purpose of monitoring compliance with the Federal Safe
Drinking Water Act.

• Constructed by assembling arsenic measurements from the states of Maine (ME),


New Hampshire (NH), Massachusetts (MA), and Rhode Island (RI) of New England.

• Each state includes different types of detection limit (i.e. 1 µg/L for ME, and 5 µg/L
for NH, MA, and RI), so a conservative detection limit of 5 µg/L is used for the
whole dataset.

• Some data above detect limit was lost due to the increased reporting level over the
entire New England.

• No information available concerning analytical techniques used based on the report


by Ayotte et al. (1999).

• Each state maintains its own safe drinking-water program in a good agreement with
Federal standards.

• Sampling procedure and analytical methods were set at the State level.

201
References

Abernathy, C. O., Y.-P. Liu, D. Longfellow, H. V. Aposhian, B. Beck, B. Fowler, R. Goyer,


R. Menzer, T. Rossman, C. Thompson, and M. Waalkes, 1999. Arsenic: Health Effects,
Mechanisms of Actions, and Research Issues, Environmental Health Perspective, Vol.
107, No. 7, pp. 593-597.

Armstrong M. (1998) Basic Linear Geostatistics, Springer, Berlin, 153 p.

Ayotte, J.D., M.G. Nielsen, G.R. Robinson, Jr., and R.B. Moore, 1999, Relation of Arsenic,
Iron, and Manganese in Ground Water to Aquifer Type, Bedrock Lithogeochemistry, and
Land Use in the New England Coastal Basins, Water-Resources Investigations Report
99-4162.

Bates, M. N., A. H. Smith, and K. P. Cantor, 1995. Case-Control Study of Bladder Cancer
and Arsenic in Drinking Water, American Journal of Epidemiology, Vol. 141, No. 6, pp.
523-529.

Beaty, Richard D. and Jack D. Kerber, 1993. Concepts, Instrumentation and Techniques in
Atomic Absorption Spectrophotometry, Second edition, The Perkon-Elmer Corporation,
Norwalk, CT.

Bhattacharya, P., A. H. Welch, K. M. Ahmed, G. Jacks, and R. Naidu, 2004. Applied in


Groundwater of Sedimentary Aquifers, Applied Geochemistry, 19, pp. 163-167.

Braman, R.S., and Foreback, C.C., 1973. Methylated Forms of Arsenic in the Environment,
Science, Vol. 182, pp. 1247-1249.

Buescher, P., and K. Jones-Vessey, 1999. Childhood Asthma in North Carolina, A Special
Report Series by the State Center for Health Statistics, No. 113.

Choi, K.-M., M. L. Serre, and G. Christakos, 2003. Efficient Mapping of California Mortality
Fields at Different Spatial Scales, Journal of Exposure Analysis and Environmental
Epidemology, 13, pp. 120-133.

202
Christakos, G., 1990. A Bayesian/Maximum-Entropy View to the Spatial Estimation
Problem". Mathematical Geology, vol. 22, No. 7, pp. 763-777.

Christakos, G., 1992. Random Field Models in Earth Sciences, Dover Publications, INC.,
Mineola, NY, 474 p.

Christakos, G., and M. L. Serre, 2000a. BME Analysis of Spatiotemporal Particulate Matter
Distribution in North Carolina, Atmospheric Environment, 34, pp. 3393-3406.

Christakos, G., 2000b. Modern Spatiotemporal Geostatistics, Oxford University Press, 288 p.

Christakos, G., M. L. Serre, and J. L. Kovitz , 2001. BME Representation of Particulate


Matter Distributions in the State of California on the Basis of Uncertain Measurements, J.
of Geological Research, Vol. 106, No. D9, pp. 9717-9731.

Christakos G., P. Bogaert and M. L. Serre, 2002. Advanced functions of temporal GIS,
Springer-Verlag, New York, N.Y., 264 p.

Clark, N. M., R. W. Brown, E. Parker, T. G. Robins, D. G. Remick Jr, M. A. Philber, G. J.


Keeler, and B. A. Israel, 1999. Childhood Asthma, Environmental Health Perspective,
Vol. 107, S3, pp 421-429.

Colt, J.S., D. Baris, S.F. Clark, J.D. Ayotte, M. Ward, J.R. Nuckols, K.P. Cantor, D.T.
Silverman, and M. Karagas, 2002. Sampling Private Wells at Past Home to Estimate
Arsenic Exposure: A Methodologic Study in New England, Journal of Exposure Analysis
and Environmental Epidemiology, 12, pp. 329-334.

Environmental Protection Agency (U.S. EPA) report, 1981. Investigation of Arsenic Sources
in Groundwater, Environmental Protection Agency; U.S. GOP: Washington, DC.

Environmental Protection Agency (U.S. EPA), 2000. Arsenic Occurrence in Public Drinking
Water Supplies, EPA-815-R-00-023, December.

Focazio, M. J., A. H. Welch, S. A. Watkins, D. R. Helsel, and M. A. Horn, 2000. A


Retrospective Analysis on the Occurrence of Arsenic in Ground-Water Resources of the

203
United States and Limitations in Drinking-Water-Supply Characterizations, Water-
Resources Investigations Report 99-4279, United Geological Survey.

Freeman, N. C.G., D. Schneider, and P. Mcgarvey, 2003. Household Exposure Factors,


Asthma, and School Absenteeism in a Predominantly Hispanic Community, Journal of
Exposure Analysis and Environmental Epidemiology, Vol. 13, pp 169-176.

Geological Survey (USGS), 2001. Available from


http://water.usgs.gov/nawqa/trace/data/arsenic_nov2001.txt.

Gergen, P. J., D. I. Mullally, and R. Evans III, 1988. National Survey of Prevalence of
Asthma Among Children in the United States, 1976 to 1980, Pediatrics, Vol. 81, No. 1,
pp 1-7.

Goovaerts, P., 1997. Geostatistics for Natural Resources Evaluation, Oxford University Press,
New York, 483 p.

Greschonig, H. and K.J. Irgolic, 1997. The Mercuric-Bromide-Stain and the Natelson
Method for the Determination of Arsenic: Implications for Assessment from Exposure to
Arsenic in Taiwan. pp. 17-31 in "Arsenic: Exposure and Health Effects." Edited by C.O
Abernathy, R.L Calderon and W.R Chappel, Chapman & Hall, London.

Guo, H.R., C.J. Chen and H.L. Greene, 1994. Arsenic in Drinking Water and Cancers: a
Brief Descriptive Review of Taiwan Studies, in Arsenic Exposure and Health (eds W.R.
Chappell, C.O. Abernathy, and C.R. Cothern), Sciences and Technology Letters,
Northwood, pp. 129-138.

Hernandez, A., J. Von Behren, R. Kreutzer, and B. McLaughlin, 2000. California County
Asthma Hospitalization Chart Book, California Department of Health Services,
Environmental Health Investigations Branch.

Hinkle, S. R., and D. J. Polette, 1999. Arsenic in Ground Water of the Willamette Basin,
Water-Resources Investigation Report 98-4205, Unites States Geological Survey.

Hopenhayn-Rich, C., M. L. Biggs, and A. H. Smith, 1998. Lung and Kidney Cancer
Mortality Associated with Arsenic in Drinking Water in Cordoba, Argentina,
International Journal of Epidemiology, Vol. 27, pp. 561-569.

204
Isaaks, E. H. and R.M. Srivastave, 1989. Applied geostatistics, Oxford Press, New York, 561
p.

Journel, A., and C. J. Huijbregts, 1978. Mining Geostatistics, Academic Press, London, U.K.,
600 p.

Karagas, M. R., T.D. Tosteson, J. Blum, J. Steven Morris, J.A. Baron, and B. Klaue, 1998.
Design of an Epidemiologic Study of Drinking Water Arsenic Exposure and Skin and
Bladder Cancer Risk in a U.S. Population, Environmental health perspective, 106, pp.
1047-1050.

Karagas, M. R., T. D. Tosteson, J. S. Morris, E. Demidenko, L. A. Mott, J. Heaney, and A.


Schned, 2004. Incidence of Transitional Cell Carcinoma of the Bladder and Arsenic
Exposure in New England, Cancer Causes and Control, Vol. 15, pp. 465-472.

Keller J.M., J. M. Giaquinto and A. M. Meeks, 1996. Characterization of the MVST


Waste Tanks Located at ORNL, ORNL/TM-13357, Chemical and Analytical
Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN.

Kinniburgh, D.G., and W. Kosmus, 2002. Arsenic Contamination in Groundwater: Some


Analytical Considerations, Talanta, 58, pp. 165-180.

Klaue, B and J.D. Blum, 1999. Trace Analyses of Arsenic in Drinking Water by Inductively
Coupled Plasma Mass Spectrometry: High Resolution Versus Hydride Generation,
Analytical chemistry, vol. 71, No. 7, pp. 1408-1414.

Krivoruchko, K., and C.A. Gotway, 2004. Creating Exposure Maps Using Kriging, Public
Health GIS News and Information, Vol. 56, pp 11-16.

Lai, D., 2004. Geostatistical Analysis of Chinese Cancer Mortality: Variogram, Kriging and
Beyond, Journal of Data Science, Vol. 2, No. 2, pp 177-193.

Lane, W. G., M. C. Edwards, 2003. Asthma in Maryland 2003, Maryland Asthma Control
Program, Family Health Administration, 410-767-6713.

205
Lewis, T. C., T. G. Robins, J. T. Dvonch, G. J. Keeler, F. Y. Yip, G. B. Mentz, X. Lin, E. A.
Parker, B. A. Israel, L. Gonzalez, and Y. Hill, 2005. Air Pollution-Associated Changes in
Lung Function among Asthmatic Children in Detroit, Environmental Health Perspectives,
Vol. 113, No. 8, pp 1068-1075.

Manninen, P., Presentation of the Implementation of Use of Reference Materials in an


Application-Arsenic by ICP-MS, Consulting Engineers Paavo Ristola Ltd.
(www.vtt.fi/pro/eurachsf/manninen.pdf).

McConnell, R., K. Berhane, F. Gilliland, S. J. London, T. Islam, W. J. Gauderman, E. Avol,


H. G. Margolis, and J. M. Peters, 2002. Asthma in Exercising Children Exposed to Ozone:
A Cohort Study, The Lancet, Vol. 359, pp. 386-391.

Melamed, D., 2004. Monitoring Arsenic in the Environment: A Review of Science and
Technologies for Field Measurements and Sensors, EPA 542/R-04/002, U.S. EPA,
Washington, DC.

National Research Council (NRC), 1999. Arsenic in Drinking Water, National Academy
Press, Washington, DC.

National Research Council (NRC), 2001. Arsenic in Drinking water, National Academy
Press, Washington, DC.

Olea, R, 1999. Geostatistics for Engineer and Earth Scientists, Kluwer Academic Publisher,
Boston, 303 p.

Oyana, T. J., J. S. Lwebuga-Mukasa, 2004. Spatial Relationships Among Asthma Prevalence,


Health Care Utilization, and Pollution Sources in Neighborhoods of Buffalo, New York,
Journal of Environmental Health, Vol. 66, No. 8, pp. 25-37.

Peters, S.C., J.D. Blum, B. Klaue, and M.R. Karagas, 1999. Arsenic Occurrence in New
Hampshire Drinking Water, Environmental science and technology, Vol. 33, No.9, pp.
1328-1333.

Sanchez, F., A.C. Garrabrants, C. Vandecasteele, P. Moszkowicz, and D.S. Kosson, 2003.
Environmental Assessment of Waste Matrices Contaminated with Arsenic, Journal of
Hazardous Materials, B96, pp 229-257.

206
Schnoor, J. L., 1996. Environmental Modeling: Fate and Transport of Pollutants in Water,
Air, and Soil, John Wiley & Sons, INC.

Serre, M. L., P. Bogaert and G. Christakos, 1998. Latest Computational Results in


Spatiotemporal Prediction Using the Bayesian Maximum Entropy Method, in A.
Buccianti, G. Nardi and R. Potenza, editors, Proceedings of IAMG '99 - Fifth Annual
Conference of the International Association for Mathematical Geology, 1, 117-122, De
Frede Editore, Napoli.

Serre, M. L., and G. Christakos, 1999a. Modern Geostatistics: Computational BME in the
Light of Uncertain Physical Knowledge--The Equus Beds Study, Stochastic
Environmental Research and Risk Assessment, Vol. 13, No. 1, pp 1-26.

Serre, M. L., 1999b. Environmental Spatiotemporal Mapping and Groundwater Flow


Modeling using the BME and ST methods, Ph.D. Dissertation, Depart. of Environmental
Sciences & Engineering, University of North Carolina at Chapel Hill, NC, USA, 236 p.

Serre, M.L., A. Kolovos, G. Christakos, and K. Modis, 2003. An Application of the


Holistochastic Human Exposure Methodology to Naturally Occurring Arsenic in
Bangladesh Drinking Water, Risk Analysis, Vol. 23, No. 3, pp. 515-528.

Stein, M.L., 1999. Interpolation of Spatial Data: Some Theory for Kriging, Springer-Verlag,
New York, 264 p.

Sturm, J. J, K Yeatts, and D Loomis, 2004. Effects of Tobacco Smoke Exposure on Asthma
Prevalence and Medical Care Use in North Carolina Middle School Children, American
Journal of Public Health, Vol. 94, No.2, pp 308-313.

Thomas, Robert , 2003. Practical Guide to ICP-MS, Marcel Dekker, 336 p.

Wackernagel, H., 1995. Multivariate Geostatistics: An Introduction with Applications,


Springer-Verlag, Berlin, 256 p.

207
Warner, K.L., Angel Martin Jr., and Terri L. Arnold, 2003. Arsenic in Illinois Ground Water-
Community and Private Supplies, United States Geological Survey Water-Resources
Investigation Report 03-4103. http://il.water.usgs.gov/pubs/wrir03_4103.pdf.

Weiss, K.B., S. D. Sullivan, C. S. Lytle, 2000. Trends in the Cost for Asthma in the United
States, 1985-1999, Journal of Allergy and Clinical Immunology, Vol. 106, pp. 493-499.

Welch A. H., D. B. Westjohn, D. R. Helsel, and R. B., 2000. Wanty, Arsenic in Ground
Water of the United States: Occurrence and Geochemistry, Ground Water, Vol. 38, No. 4,
pp. 589-604.

Welhan, J., and M. Merrick, 2003. Statewide Network Data Analysis and Kriging Project-
Final Report, Idaho Geological Survey.
http://www.idwr.state.id.us/hydrologic/info/statewide/IGS_Kriging_Project-
Final_Report.pdf.

Yeatts, K.B., M.L. Serre, S.-J. Lee, 2004. Spatial Distribution of Wheezing Prevalence and
Air Pollution across North Carolina, Sixteenth Conference of the International Society for
Environmental Epidemiology, New York City, NY, USA, August 1-4.

Yu, Winston H., C. M. Harvey, and C. F. Harvey, 2003. Arsenic in Groundwater in


Bangladesh: A Geostatistical and Epidemiological Framework for Evaluating Health
Effects and Potential Remdies, Water Resources Research, Vol. 39, No. 6, 1146,
doi:10.1029/2002WR001327.

Zmirou, D., S. Gauvin, I. Pin, I. Momas, F. Sahraoui, J Just, Y Le Moullec, F. Bremont, S.


Cassadou, P. Reungoat, M. Albertini, N. Lauvergne, M. Chiron, A. Labbe, Vesta
investigators, 2004. Traffic Related Air Polution and Incidence of Childhood Athma :
Results of the Vesta Case-Control Study, Journal of Epidemiol Community Health, 58,
pp 19-23.

208

You might also like