FinalDefense Vfinal

MODELS OF SOFT DATA IN GEOSTATISTICS AND THEIR APPLICATION IN
ENVIRONMENTAL AND HEALTH MAPPING
Seung-Jae Lee
A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in
the partial fulfillment of the requirements for the degree of Doctor of Philosophy in the
Department of Environmental Sciences and Engineering, School of Public Health.
Chapel Hill
2005
Approved by
__________________________________
Advisor: Marc L. Serre
__________________________________
Reader: George Christakos
__________________________________
Reader: Douglas Crawford-Brown
__________________________________
Reader: Michael Flynn
__________________________________
Reader: Michael Symons
__________________________________
Reader: Karin Yeatts
©
2005
Seung-Jae Lee
ALL RIGHTS RESERVED
ii
ABSTRACT
SEUNG-JAE LEE: Models Of Soft Data In Geostatistics And Their Application In
Environmental And Health Mapping
(Under the direction of Marc L. Serre)
Spatiotemporal Geostatistics provides an efficient mapping estimation method to interpolate
a variable of interest at unsampled spatiotemporal locations based on sparse measured values.
The simple kriging and co-kriging methods of classical Geostatistics have been applied to a
wide variety of environmental mapping problems, though these linear estimation methods
have well known limitations (Gaussian assumptions, restriction to exact measurements, etc.).
More recently the Bayesian Maximum Entropy (BME) method of modern Geostatistics has
provided a rigorous mathematical framework that overcomes these limitations, and in
particular provides an efficient framework to assimilate data with uncertainty expressed in
terms of soft data. The rigorous assimilation of soft data is especially attractive because it
allows the integration of data from multiple sources in terms of their uncertainty. However
while integrating data from multiple sources is becoming an important research topic, the
development of models for soft data is still an emerging field in environmental and health
applications. This dissertation is part of this emerging field. Its goal is to advance the
development of models for soft data describing the uncertainty associated with existing
environmental and health processes, to integrate these soft data in a BME mapping analysis,
and test the resulting increase in mapping accuracy in real case studies. In this dissertation
three types of data uncertainty are especially emphasized, i.e. uncertainty from measurement
iii
errors, uncertainty from stochastic empirical laws between primary and secondary variables,
and uncertainty arising from the data observation scale. Each model of soft data is validated
using synthetic simulations as well as real case studies that include the analysis of the
uncertainty associated with arsenic measurement errors, the arsenic-pH empirical law, and
the observation scale of childhood asthma prevalence data. Validation analyses show that for
each of these case studies, the model developed for the soft data leads to a substantial gain in
mapping accuracy over methods not accounting for data uncertainty. Consequently the
models of soft data developed can be applied in a variety of real exposure and health
mapping situations to provide highly informative maps that will be useful to environmental
and public health scientists.
iv
ACKNOWLEDGMENTS
I would like to express my sincere thanks to my advisor, Dr. Marc L. Serre who patiently
guided me throughout the entire period of my Ph.D. work. He convincingly introduced me a
challenging research area and consistently encouraged me to try my best while he has been
willing to provide high-quality advices.
I am also everlastingly grateful to my Ph. D. committee, Dr. Marc L. Serre, Dr. George
Christakos, Dr. Douglas Crawford-Brown, Dr. Michael Flynn, Dr. Michael Symons, and Dr.
Karin Yeatts. During their service on my Ph.D. committee they spent their time without
reluctance and generously guided me with their valuable expertise.
I would like to acknowledge U.S. Geological Survey, Dr. Karin Yeatts, and Dr. Stephen
Peters for providing their datasets. To these technical benefactors, I remain sincerely grateful.
I am also indebted to the financial support by the Rotary foundation during my first academic
year at the University of North Carolina at Chapel Hill (UNC-CH). In addition I was
fortunate to be appointed as a graduate research assistant for all my academic years at the
Department of Environmental Sciences and Engineering, UNC-CH, for which I am thankful.
v
Lastly I would like to express my deepest thanks to my wife, Miyoung Shim who has been
available all the time during my studies and earnestly offered the solid motivation needed to
accomplish my Ph.D. degree. I would also like to heartily thank my grandmother Bun-Nam
Kim, parents Hyung-Jik Lee and Eun-Hee Park, and sisters Ji-Young, Ji-Yoon, and Sun-
Young for their endless support and attention whenever needed.
vi
TABLE OF CONTENTS
Page
LIST OF TABLES ……………………………………………………………………...xiii
LIST OF FIGURES ……………………………………………………………………..xiv
Chapter
I. Introduction ………………………………………………………............................... 1
II. A measurement error model for mapping groundwater arsenic: Case

study using three datasets in New England ………………………………………….. 8
2.1. Background …………………………………………………………………….. 8
2.2. The uncertainty associated with arsenic data …………………………………. 11
2.2.1. Sources of measurement errors ……………………….......................... 11
2.2.2. Arsenic measurement techniques and their associated

analytical errors ……………………………………………………….. 12
2.3. Theory ………………………………………………………………………… 17
2.3.1. The knowledge bases characterizing a contaminant

spatial random field ……………………………………………………17
2.3.2. Proposed model for arsenic measurement error ……………………….19
2.3.3. Modeling the covariance function ……………………………………. 23
2.3.4. The BME method for spatial estimation ……………………………… 26
2.3.5. Step by step summary of the approach ……………………………….. 28
2.3.6. Cross validation procedure ………………………………………….... 29
vii
2.4. Application of the model ……………………………………………………... 31
2.4.1. The arsenic datasets …………………………………………………... 31
2.4.2. Mean trend ……………………………………………………………. 35
2.4.3. Covariance analysis and verification of the measurement

error parameters ………………………………………………………. 36
2.4.4. The BME mapping results …………………………………………..... 40
2.4.5. Cross validation results ……………………………………………….. 42
2.5. Conclusions …………………………………………………………………… 46
III. BME mapping using empirical laws with secondary spatial data: A
farewell to co-kriging? ……………………………………………………………… 49
3.1. Background …………………………………………………………………… 49
3.2. Method description ………………………………………………………….... 51
3.2.1. Spatial Random Field (SRF) representation and physical

knowledge bases ……………………………………………………… 51
3.2.2. Empirical law and cross-correlation of related spatial fields ……......... 54
3.2.3. Deriving the conditional PDF fS(χ|ψ) that describes the

empirical law ………………………………………………………….. 57
3.2.3.1. Non parametric approach ………………………………….... 57
3.2.3.2. Parametric approach …………………………………………58
3.2.3.2.1. Parametric polynomial of order 1 ………………. 58
3.2.3.2.2. Parametric polynomial of order 2 ………………. 60
3.2.4. BME processing of hard and soft data ………………………………... 62
3.2.5. Generating related synthetic fields with stochastic empirical

relationships …………………………………………………………... 63
3.2.6. Step by step description of the simple kriging, co-kriging,

and BME approaches …………………………………………………. 66
viii
3.2.7. Cross validation procedure …………………………………………... 70
3.3. Results ………………………………………………………………………… 71
3.3.1. Synthetic case study …………………………………………………... 71
3.3.1.1. Realization of related spatial fields …………………………. 72
3.3.1.2. Covariance and cross-covariance between fields ……………74
3.3.1.3. Conditional PDF fS(χ|ψ) describing the empirical

relationship ………………………………………………….. 75
3.3.1.4. Assessment of mapping accuracy …………………………... 77
3.3.1.5. Cross validation results as a function of the

curvature of the empirical law ……………………………… 79
3.3.1.6. Cross validation results as a function of the

correlation between logAs and pH ………………………….. 82
3.3.2. Application to the real case study: Mapping arsenic in

New England using soil pH …………………………………………... 84
3.3.2.1. New England datasets for arsenic and pH …………………. 84
3.3.2.2. logAs-pH empirical law …………………………………….. 85
3.3.2.3. Mean trend and spatial variability of groundwater

arsenic in New England ……………………………………...86
3.3.2.4. BME estimation of groundwater arsenic across

New England ………………………………………………...88
3.3.2.5. Non-attainment areas ……………………………………….. 90
3.3.2.6. Cross validation results between simple kriging,

co-kriging and BME ………………………………………... 91
3.4. Conclusions …………………………………………………………………… 93
IV. A geostatistical mapping framework integrating data obtained at

different temporal or spatial observation scale ……………………………………... 98
ix
4.1. Background …………………………………………………………………… 98
4.2. Space/time observation scale: A general conceptual framework …………… 100
4.2.1. A review of BME mapping method …………………………………. 100
4.2.2. Conceptual framework for the uncertainty associated with

the observation scale ………………………………………………… 103
4.3. Temporal observation scale: Mathematical formulation and

synthetic case study …………………………………………………………..105
4.3.1. Mathematical formulation …………………………………………… 105
4.3.1.1. Non-stationary temporal random field …………………….. 105
4.3.1.2. Stationary temporal random field …………………………. 107
4.3.2. Synthetic case study …………………………………………………. 110
4.3.2.1. Synthetic verification of the uncertainty model

for temporal observation scale .............................................. 110
4.3.2.2. Quantifying the improvement in mapping

accuracy resulting from the integration of
temporal observation scale uncertainty ……………………. 113
4.4. Spatial observation scale: Mathematical formulation and synthetic

Case study …………………………………………………………………… 120
4.4.1. Mathematical formulation …………………………………………… 120
4.4.1.1. Non-homogeneous spatial random field …………………... 120
4.4.1.2. Homogeneous spatial random field ………………………...121
4.4.2. Synthetic case study …………………………………………………. 124
4.4.2.1. Synthetic verification of the uncertainty model for

spatial observation scale …………………………………... 124
4.4.2.2. Quantifying the improvement in mapping accuracy

resulting from the integration of spatial observation
scale uncertainty ……………………………………………127
4.5. Mapping the childhood asthma prevalence across North Carolina
x
using data collected at different spatial observation scales …………………. 134
4.5.1. Introduction ………………………………………………………….. 134
4.5.2. Theory ……………………………………………………………….. 137
4.5.2.1. A review of the BME method for the mapping

analysis of the childhood asthma prevalence ……………… 137
4.5.2.2. Conceptual framework for the uncertainty associated

with the observation scale of the childhood asthma
prevalence …………………………………………………. 140
4.5.2.3. Quantifying the improvement in mapping accuracy

of childhood asthma prevalence resulting from the
integration of spatial observation scale uncertainty ……….. 143
4.5.3. Data ………………………………………………………………….. 144
4.5.3.1. The North Carolina School Asthma Survey database ……...145
4.5.3.2. The county-level database of Medicaid-enrolled

children suffering from asthma ……………………………. 146
4.5.4. Results ……………………………………………………………….. 149
4.5.4.1. Trends and variability in the spatial distribution of

local scale asthma prevalence among children ……………. 149
4.5.4.2. Maps of the childhood asthma prevalence obtained

using data collected at different observation scales ……….. 153
4.5.4.3. Cross-validation results ……………………………………. 158
4.5.4.4. Validation results ………………………………………….. 160
4.5.5. Conclusions ………………………………………………………….. 161
V. Concluding remarks ……………………………………………………………….. 165
Appendix A: Derivation of empirical relationships and their associated

uncertainty ………………………………………………………………………… 169
A.1. A quick overview of the multivariate linear regression model ……………… 169
xi
A.2. Parametric polynomial of order 1 …………………………………………… 171
A.3. Parametric polynomial of order 2 …………………………………………… 174
Appendix B: A simulator to generate realizations of two spatial random

fields (logAs and pH) related in terms of a quadratic empirical law ……………… 177
Appendix C: Derivation of σY2(t’,t) accounting for different observation

time scales………………………………………………………………………….. 182
C.1. Non-stationary temporal random field case …………………………………. 182
C.2. Stationary covariance ………………………………………………………... 185
C.3. Stationary exponential covariance case ……………………………………... 187
Appendix D: Derivation of σY2(s’,s) accounting for different observation

scales in two-dimensional (2-D) space ……………………………………………. 191
D.1. Non-homogeneous 2-D spatial random field case …………………………... 191
D.2. Homogeneous 2-D SRF ……………………………………………………... 192
D.3. Application of homogeneous exponential covariance model ……………….. 197
Appendix E: Some notes regarding the first and second arsenic datasets …………….. 200
E.1. The first arsenic dataset ……………………………………………………... 200
E.2. The second arsenic dataset …………………………………………………... 200
References ……………………………………………………………………………... 202
xii
LIST OF TABLES
Page
Table 2.1: The number of above and below detects, the mean value and
detection limit, and σo and k (Eq. 2.6) for each dataset ………………… 32
Table 2.2: Comparison of the values of σlogε2 estimated using (a) the
covariance analysis and (b) the measurement error model ……………... 38
Table 2.3: Specifications of each of the four methods compared in the

cross validation analysis ………………………………………………... 43
Table 2.4: Change in MSE from classical methods (i.e. methods 1 and 3)
to the proposed methods (i.e. methods 2 and 4). A negative
change means reduction in MSE, indicating an improvement
in mapping accuracy ……………………………………………………. 45
Table 3.1: Cross validation results for case 1 ……………………………………… 80
Table 4.1: Description of three estimation methods compared in the

validation procedure ……………………………………………………117
Table 4.2: MSEave calculated by averaging the validation results

obtained over 20 realizations ………………………………………….. 118
Table 4.3: MSEave calculated by averaging the validation results

obtained over 20 realizations …………………………………………...131
Table 4.4: Cross-validation results showing the cross-validation MSE

for methods 1, 2 and 3, and the change in cross-validation
MSE between method 1 and method 3, as well as between
method 2 and method 3 ………………………………………………... 160
Table 4.5: Validation results obtained when selecting a random validation

set consisting of 30% of the NCSAS data. The table shows the
validation MSE obtained for methods 1, 2 and 3, and the
change in validation MSE between method 1 and method 3, as
well as between method 2 and method 3 ……………………………… 161
xiii
LIST OF FIGURES
Page
Figure 2.1: Plot of (a) σZ and (b) σε as a function of Zm for σo =1µg/L

and k=3/10 ……………………………………………………………….20
Figure 2.2: The plain line depicts the expected value E[Z] as a function
of Zm for σo=1µg/L and k=3/10. The detection limit DL=3σo
=3µg/L is shown with the vertical dashed line. The soft
PDFs describing Z when Zm=4µg/L, 6µg/L and 8µg/L are
shown in dotted lines …………………………….................................... 22
Figure 2.3: Measured arsenic concentrations above detection limit

shown with marker size proportional to observed values
for (a) dataset 1, (b) dataset 2, (c) dataset 3. The locations
of all measurements below and above detection limit are
shown in (d) …………………………………………………………….. 35
Figure 2.4: Distribution of the mean trend of total arsenic concentration

mY(s) across New England groundwater ………………………………... 36
Figure 2.5: Covariance model obtained using (a) all the data Xdata
corresponding to the three combined datasets, and (b) only
above detects for the three combined datasets ………………………….. 37
Figure 2.6: Plot of the σlogε2 values predicted by the measurement error
model versus the σlogε2 values obtained from the covariance
analysis ………………..............................................................................39
Figure 2.7: Map of the BME median estimate of total arsenic in the
groundwater of New England …………………………………………... 41
Figure 2.8: Map of the variance of the BME posterior PDF for X(s)
normalized by the variance σX2. This map provides an
assessment of the mapping uncertainty associated with
Figure 2.7 ……………………………………………………………….. 41
Figure 3.1: The circles represent a subset of the data published by

Sanchez et al. (2003) showing the solubility and release of
log-As as a function of pH for a given soil sample
contaminated with arsenic in a pesticide manufacture site ……………... 55
Figure 3.2: Realization of (a) logAs(s) and (b) pH(s) obtained with our
simulator using a1=1.7, a2=0.5 and σA2= 0.32. Asterisks in (a)
and triangles in (b) are the randomly selected points used as
xiv
data in the cross-validation procedure. The scatter plot of
all collocated simulated logAs-pH values are shown in (c),
where the plain line is the theoretical E[logAs|pH] obtained
from Eq. (3.31) ………..............................................................................73
Figure 3.3: Covariance and cross variance for the logAs(s) and pH(s)
synthetic fields shown in Figure 3.2. Experimental
covariance values are shown with dots, while the
corresponding covariance models are shown with plain line …………... 75
Figure 3.4: The dots in (a), (b) and (c) are identical. They show the
collocated measurements for the realization of logAs(s) and
pH(s) shown in Figure 3.2(c). The dashed lines show µ1(ψ)
= E[logAs|pH] obtained using (a) non-parametric prediction,
(b) parametric prediction with polynomial of order 1, and (c)
parametric prediction with polynomial of order 2. The
corresponding µ2(ψ) are shown in (d) with different line
types. The soft data obtained from µ1(ψ) and µ2(ψ) are
shown in thick lines in (a), (b) and (c) ………………………………….. 77
Figure 3.5: The simulated field of logAs(s) shown in map (a) is an

identical reproduction of Figure 3.2(a) that is interpreted as
the truth. The stars are the locations of the logAs hard data
used by estimation method 1 (simple kriging) to produce map
(b). Using this logAs hard data as well as secondary pH data
shown in Figure 3.2(b), we obtain map (c) with method 2
(co-kriging), and map (d) with method 3 (BME) ………………………. 78
Figure 3.6: (a) Curves representing the empirical law E[logAs|pH]

between collocated logAs and pH for the realizations of Table
3.1 (i.e. obtained with a2 varying from 0 to 0.6 by increment
of 0.1). (b) Curves showing the improvement in MSE
reduction i∆ as a function of a2, when the BME soft data is
generated using the non parametric (plain line), the
polynomial of order 1 (dotted line), and the polynomial of
order 2 (dashed line) approaches ……………………………………….. 81
Figure 3.7: Realizations of related logAs(s) and pH(s) fields were obtained
using our simulator with σA2 varying from 0.08 to 0.35. The
linear empirical law E[logAs|pH] for each of these realizations
is shown in (a). The corresponding improvement in MSE
reduction i∆ is shown in (b) as a function of σA2 ………………………... 83
Figure 3.8: (a) Map of the location of the groundwater arsenic samples
from wells with measurements above detection limit. The
circles have a size proportional to the arsenic level recorded.
xv
(b) Map of the location of soil pH-measurements shown with
color indicating the recorded value according to the color
scale ……………………………………………………………………...85
Figure 3.9: Scatter plot of 139 collocated logAs and pH measurements in

New England. The dot-dashed line shows µ1(ψ)=E[logAs|pH]
obtained using second order polynomial regression. The dotted
line shows a curve of similar shape obtained by Sanchez et al.
(2003). The soft PDFs shown with plain line are the BME soft
data generated using µ1(ψ) (and µ2(ψ) not shown here) ………………... 86
Figure 3.10: (a) Mean trend of groundwater log-arsenic in New England,

and (b) covariance function of its residual ……………………………… 88
Figure 3.11: (a) Map of the BME estimate of groundwater arsenic (µg/L)
across New England, and (b) map of the length of the 68%
BME confidence interval (µg/L) expressing the associated
mapping uncertainty ……………………………………………………..89
Figure 3.12: BME map of the probability that the groundwater arsenic
concentration across New England is in non-attainment of
the drinking water standard of 10 µg/L for arsenic ……………………...91
Figure 3.13: Maps of the concentration of arsenic in the ground-water of

New-England obtained using (a) method 1 (simple kriging),
(b) method 2 (co-kriging), and (c) method 3 (our proposed
BME method). …………………………………………………………... 93
Figure 4.1: Plot of σY/σX as a function of T/at for different values of

(t-t’)/T. Markers indicate synthetic estimate obtained from
multiple random realizations (Eq. 4.21), while lines shows
the value predicted from theory (Eq. 4.19) ……………………………. 112
Figure 4.2: Plot showing one of the generated realizations of the TRF
X(t). The simulated values χtrue are shown with a dotted line,
the χhard data are represented by circles, and the ζhard data
are represented by crosses. Four observation time scales of
the ζhard data are shown with horizontal bars, and the
corresponding conditional PDF are shown with bell shape
curves ………………………………………………………………….. 116
Figure 4.3: Plots showing the simulated truth χtrue with a dotted line,
the χhard data with circles, and the ζhard data with crosses.
Additionally lines are showing the estimated profiles
obtained using (a) method 1, (b) method 2, and (c) method 3
xvi
(BME) …………………………………………………………………. 119
Figure 4.4: Plot of σY/σX as a function of R/ar for different values of

|s- s’|/R. Markers indicate synthetic estimate obtained from
multiple random realizations (Eq. 4.34), while lines shows
the value predicted from theory (Eq. 4.32) ……………………………. 126
Figure 4.5: Contoured map showing one of the generated realizations of

the SRF X(s), along with the location of the χhard data points
(stars), and the ζhard data points (triangles). The circular
averaging domain for three of the ζhard data points are shown
with a radius equal to their spatial observation scales ………………… 130
Figure 4.6: Maps of the simulated truth (a), compared to maps obtained
with (b) method 1 using χhard as hard data, (b) method 2
using both χhard and ζhard as hard data, and (c) method 3
corresponding to our proposed BME method accounting for
the effect of observation scale ………………………………………….133
Figure 4.7: Map showing (a) the data on asthma symptoms prevalence
among high school children (age 13-14) reported in the
NCSAS database for most of NC schools, and (b) the county
level asthma prevalence data extracted from the database of
Medicaid-enrolled children age 0-14 years who suffered from
asthma. The prevalence is expressed as a fraction (i.e. average
childhood asthma cases per 1 child) according to the color bar
next to each map ………………………………………………………. 149
Figure 4.8: (a) Map of the local scale mean trend mX(s) of childhood asthma
prevalence (fraction of prevalent asthma cases), and (b) plot of
the covariance of the mean trend-removed local scale childhood
asthma prevalence SRF X’(s) ………………………………………….. 151
Figure 4.9: Maps of the BME mean estimate of children asthmatic symptom
prevalence (average number of case per 1 child) observed at the
school spatial scale across North Carolina. These maps were
obtained using (a) method 1, (b) method 2, and (c) method 3 ………… 155
Figure 4.10: Maps of the BME posterior variance ([average asthma counts
per 1 child]2) obtained with (a) method 1 and (b) method 3,
which provides an assessment of the uncertainty associated
with the BME mean estimate maps shown in Figure 4.9 (a)
and (c), respectively …………………………………………………… 157
xvii
I. Introduction
The spatiotemporal geostatistical framework provides an essential tool to interpolate
monitored data of a variable of interest and obtain an estimate at unsampled space/time
points where there is no direct measurements or any workable physical model. This tool
provides a cost effective method to investigate the distribution of variables of interest across
space and time in our environment. Important applications of this tool include exposure
mapping of environmental contaminants (e.g. groundwater contamination and atmospheric
air pollutants), and the spatiotemporal estimation of a variety of health outcomes (such as
asthma symptoms prevalence, etc.). The geostatistical framework provides a stochastic
approach to spatiotemporal modeling that has been widely used to effectively address the
inherent randomness of natural processes and their associated high variability across space
and time.
There have been considerable attempts in classical Geostatistics to address these issues in
terms of kriging estimators (Olea, 1999; Journel and Huijbregts, 1978; Armstrong, 1998).
However the linear kriging estimators of classical Geostatistics were primarily developed to
account for exact measurements, and they have considerable well documented limitations (i.e.
limited to linear estimation, Gaussian restrictions, etc.) (Goovaerts, 1997; Christakos, 2000).
As a result of these limitations, the linear kriging methods lack the theoretical underpinnings
and practical flexibility needed to incorporate information about the errors and uncertainty
associated with the monitored data. On the other hand, the Bayesian Maximum Entropy
(BME) method of modern Geostatistics developed in the last decade provides a powerful
mathematical framework for the processing of a wide variety of knowledge bases that are
beyond the scope of classical kriging methods (Christakos, 1990, 1992, 2000b; Christakos et
al., 2002). In particular BME rigorously processes exact measurements (hard data) as well as
data with associated error (soft data), leading to estimates that are more accurate than that of
the linear kriging methods lacking the ability to rigorously process soft data, as demonstrated
in several case studies (Christakos et al., 2000a; 2001; Serre et al., 1999a, 2002, 2003; Choi
et al., 2003).
As a result of these studies, we found that the importance of accounting for the wide
variety of environmental and health soft data available has been increasingly recognized. The
wide variety of soft data available arises from the increasing number of data sources that may
not have been available in the past (new analytical measurement techniques, increased access
to relevant secondary data, remote sensing and satellite technologies, measurement
performed at different spatial and temporal scales, etc.). The uncertainty from each of these
data sources needs to be assessed and properly modeled by means of a soft probability
density function (PDF) that can be processed by the BME method. However the investigation
of data uncertainty and the development of a framework to obtain the relevant soft PDF
characterizing existing environmental and health data is still an emerging field. This
dissertation is part of this emerging field. Its goal is to advance the development of models
that rigorously account for the uncertainty associated with existing environmental and health
data, and to test these models in real case studies. Hence this dissertation is dedicated to
models of soft data in Geostatistics, and their application in environmental and health
mapping.
2
Let’s consider different sources of uncertainty associated with environmental and health
data. A primary source of uncertainty for environmental data is measurement error. Usually
environmental monitoring data are available from datasets (e.g. a USGS dataset of
groundwater arsenic concentrations collected in New Hampshire prior to 1999) that have a
homogeneous measurement error. The measurement error associated with a particular dataset
is then the aggregate of errors coming from the analytical method used across the dataset, the
sampling procedure followed (i.e. collecting, saving, transporting samples), and the creation
of the database and retrieval of information from that database. A second source of
uncertainty comes from the emergence of secondary variables used to map a primary variable
for which data is sparse. For example the data for groundwater arsenic concentration
collected at wells may often be sparse over a region of interest, while measurements of soil
pH in the same region may be more abundant. Another example of emerging secondary data
is remote sensing observations obtained from an aircraft or a satellite. In all these cases, the
secondary variable is linked to the primary variable by means of a stochastic empirical law,
which can be used to generate soft data for the primary variable on the basis of the
measurements available for the secondary variable. A third important source of uncertainty
associated with environmental and health data is the temporal or spatial scale at which the
measurement is made. For example asthma prevalence may be measured at a specific school,
or it may be measured over a much wider area such a county. Similarly the concentration of
particulate matter (an air pollutant that may be a contributing cause to asthma in children)
may be collected as an hourly average, or as a daily average. In all these cases, the mixing of
data obtained at different spatial or temporal scales is a source of uncertainty for the
space/time estimation of the variable at some scale of interest. The three sources of
3
uncertainty described here (i.e. measurement error, secondary variable, and observation scale)
illustrate the fact that not accounting properly for the uncertainty associated with the data
might lead to inaccurate geostatistical estimates. This motivates the need to develop for each
source of uncertainty a framework that generates the relevant soft data, which once
rigorously processed with the BME method, will lead to increased mapping accuracy of the
geostatistical estimate.
This dissertation is organized in the introduction (Chapter 1), followed by three main
chapters of this dissertation (Chapters 2, 3 and 4), and concluding with the conclusion
chapter (Chapter 5). Each of chapter 2, 3 and 4 treats a different type of data uncertainty, and
leads to an independent real case study, as described next.
In chapter 2 we consider the measurement error associated with three groundwater
arsenic datasets collected in New England. Each dataset is characterized by its own
analytical and sampling error; therefore the varying levels of measurement error between
datasets should be investigated. The goal of this chapter is to develop a measurement error
model to incorporate the varying uncertainty between datasets, and generate the relevant soft
data for the BME mapping method. This soft data will improve the mapping accuracy of
groundwater arsenic, and will facilitate the incorporation of new datasets as they become
available. This chapter is organized as follow. First, an appropriate measurement error model
for arsenic data is developed. The model can characterize varying measurement error
variance according to the measurement error parameters assumed. The assumed model is
then verified by comparing measurement error variance from the model with that from the
covariance analysis using the experimental data. Once an appropriate measurement error
variance is obtained, the uncertainty from the measurement error is rigorously processed in
4
the BME method in terms of probabilistic soft data. Finally this approach is validated using
groundwater arsenic data in New England. Results from the cross-validation analysis
indicates that the proposed framework for measurement error leads to a substantial increase
of mapping accuracy compared to that obtained when the measurement error is ignored.
Chapter 3 deals with the uncertainty associated with secondary variables when the
relationship between primary and secondary variables may be modeled using stochastic
empirical laws. This chapter illustrates the framework developed using groundwater arsenic
as the primary variable and soil pH as the secondary variable. The traditional approach to
integrate secondary data when mapping a primary variable is co-kriging, which uses the
cross-correlation between the primary and secondary variables. The approach we propose
instead is to use the stochastic empirical law between collocated groundwater arsenic and soil
pH to generate soft data of the primary variable. This is done in terms of the conditional PDF
of groundwater arsenic given a collocated measured value for soil pH. We present three
straightforward approaches to derive this conditional PDF, which include a non-parametric
approach, and a parametric approach with polynomials of order 1 and 2. The conditional PDF
is then used to generate soft data of groundwater arsenic for each soil pH measurements.
These soft data are rigorously processed by the BME method, resulting in arsenic exposure
maps, together with maps of the associated mapping estimation error. The mapping accuracy
of the BME method accounting for non-linear empirical laws is investigated using synthetic
case studies where a variety of empirical laws between groundwater arsenic and soil pH are
explored. In all cases the BME method results in an outstanding improvement in mapping
accuracy over the co-kriging method of classical Geostatistics. As a result, this chapter
suggests a shift of the multivariate mapping paradigm from co-kriging to the BME method
5
when dealing with secondary variables related to the primary variable through a variety of
empirical laws. We finally apply the framework developed to a real case study integrating
soil pH data to improve the mapping accuracy of groundwater arsenic in the New England
area.
Lastly, in chapter 4 we consider uncertainty arising from the mixing of environmental or
health data measured at different spatial or temporal scales. The importance of the scale
effect must be recognized since a variable displays different physical properties depending on
the spatial or temporal scale at which it is observed. In this chapter we mathematically
derive the conditional PDF of a variable at the local scale given an observation of that
variable at a larger scale. Using this framework, it is possible to generate soft data for the
local scale on the basis of data observed at different temporal or spatial scales. This allows
the efficient integration of data observed at a variety of temporal or spatial scales, and
increases the mapping accuracy of the map obtained for the scale of interest. Mathematical
formulations are derived in the one-dimensional temporal case, and in the two dimensional
spatial cases. In each case (temporal and spatial), we validate the framework by comparing
the observation scale uncertainty predicted theoretically from the mathematical formulation,
with that inferred from multiple random realizations of a synthetic case study. Additionally
we use the synthetic case studies to quantify the gain in mapping accuracy achieved when the
BME mapping method rigorously accounts for observation scale uncertainty, compared to
classical approaches not accounting for the observation scale effect. Finally we apply the
developed framework to a real case study involving the estimation of asthma prevalence in
North Carolina. We find that in all cases the developed framework adequately describes the
uncertainty associated with the observation scale, which leads to realistic soft PDF for the
6
observation scale uncertainty that are rigorously assimilated by the BME method, and results
in a substantial improvement in mapping accuracy over classical mapping methods that
ignore the scale effect.
In conclusion, this dissertation emphasizes the development of a “soft” geostatistical
framework to account for a variety of sources of uncertainty in environmental and health data.
This framework will lead to the incorporation of environmental and health data from multiple
sources, which will improve the mapping accuracy of exposure mapping of environmental
toxics, and the space/time assessment of human health outcomes.
7
II. A measurement error model for mapping groundwater arsenic: Case
study using three datasets in New England
2.1. Background
Arsenic in the groundwater has become over the past decade a major public health concern
because of its high toxicity and the fact that it may be naturally found at high levels in the
subsurface. Naturally occurring arsenic appears in igneous and sedimentary rocks, in soils
usually originating from sedimentary rocks, and even in the air due to volcanic explosions
and forest/grass fires (Bhattacharya et al., 2004; EPA, 2000; Hinkle et al., 1999). High levels
of naturally occurring arsenic in the groundwater are detected in certain geologic formations
such as volcanic deposit weathering, sulfide mineral deposits in bedrock aquifer, and iron
oxide rich sedimentary deposits (Welch et al., 2000; EPA, 2000). In addition, anthropogenic
contamination of the groundwater due to human activities is categorized as another source of
arsenic. These activities include the use of wood preservatives and agricultural products (i.e.
pesticides, herbicides, insecticides, and defoliants etc.), and industrial activities (i.e. batteries,
fossil fuel burning, paper production, glass and cement manufacturing etc.) (Welch et al.,
2000; EPA, 2000; Hinkle et al., 1999).
The human health effects associated with the ingestion of arsenic in the drinking water
include both cancer and non-cancer adverse health effects (NRC, 2001; Abernathy et al.,
1999). Chronic exposure to inorganic arsenic levels was shown to cause several cancers
(NRC 1999; Karagas et al., 1998) including that of the skin (Bates et al., 1995), lung (Bates
et al., 1995; Hopenhayn-Rich et al., 1998), kidney (Hopenhayn-Rich et al., 1998), and
bladder (Karagas et al., 2004), as well as a variety of non-carcinogenic illnesses such as
cardiovascular disease, diabetes (Karagas et al., 1998), changes in the color of the skin, and
hyperkeratosis (NRC, 1999). The US standard of 50µg/L for arsenic in the drinking water
that had been used since 1975 was revised by the U.S. Environmental Protection Agency
(U.S. EPA) in 2001, resulting in a stricter new standard of 10µg/L in order to address the
increasing threat of groundwater arsenic to the human population.
The spatial distribution of naturally occurring arsenic in New England groundwater has
become a significant public health issue because of the relatively high arsenic concentrations
observed at private and public wells. Naturally occurring arsenic arising from mineral
deposits in the aquifer is the main source of arsenic across the New England groundwater.
For example in the state of New Hampshire, part of the New England region analyzed in our
work, anthropogenic sources of arsenic are negligible, whereas the weathering of bedrock
materials is a continuing source of groundwater arsenic (Peter et al., 1999). Drinking water
from privately used bedrock wells are not publicly regulated and has often contained arsenic
concentrations at levels of public human health concern (e.g. level in excess of the 10µg/L
standard). Elevated bladder cancer mortality have been observed in northern New England
including Maine, New Hampshire, and Vermont, and on-going work is investigating whether
exposure to high levels of arsenic in private wells is the probable source of these high bladder
cancer rates (Colt et al., 2002). As a result, it is important to map the levels of arsenic in the
groundwater of New England in order to assess human exposure to arsenic in the drinking
water.
9
Mapping of arsenic in the groundwater of New England involves using arsenic
monitoring data collected, analyzed, and stored by different agencies or organizations.
However, because of the different sampling procedure and analytical methods used to obtain
these arsenic monitoring data, the measurement uncertainty might vary widely between
available arsenic datasets. The goal of this work is to develop and implement an analysis
framework that integrates information about measurement errors associated with the arsenic
sampling data, which can then be used to construct accurate maps describing the distribution
of naturally occurring arsenic in the groundwater of New England, and the associated
mapping uncertainty.
The available arsenic datasets considered in this work come from three different sources,
so that each dataset is characterized by its own measurement analysis method, sampling
method, and detection limit. We present an overview of the different kinds of errors
contributing to the data uncertainty of arsenic concentration, and we propose a framework to
model this data uncertainty. This framework consists in a model for the measurement error
that is used to assess the data uncertainty, and provides a way to validate that assessment of
data uncertainty at the covariance analysis stage.
The analysis of data uncertainty leads to the generation of a covariance model and soft
data for arsenic, which provides information that is efficiently processed by the Bayesian
Maximum Entropy (BME) method of modern Geostatistics and its numerical implementation,
BMElib (Christakos, 1990, 2000b; Serre et al., 1998; Serre and Christakos, 1999a; Christakos
et al., 2002). The BME method is able to rigorously integrate the soft data characterizing the
measurement errors, and results in an increase of mapping accuracy over a classical approach
lacking the ability to account for data uncertainty. While the simple kriging approach of
10
classical Geostatistics cannot rigorously assimilate the uncertain information available due to
its limitations (i.e. linear estimator and Gaussian assumption), the BME method is able to
account for the combined effect of the high natural variability of geology and the varying
levels of measurement errors between datasets. Our work shows that the implementation of
the proposed approach results in a Mean Square Error (MSE) reduction in the real case as
well as synthetic case studies when compared to the direct approach not accounting for data
uncertainty.
2.2. The uncertainty associated with arsenic data
2.2.1. Sources of measurement errors
The measurements of environmental contaminants are usually associated with a variety of
errors that include mistakes, systematic errors, and accidental errors. While errors caused by
mistakes and systematic errors can be reduced by the training of personal and the calibration
of instruments, accidental errors still remain in measured values. Therefore, when dealing
with a specific dataset one must be aware that the measurement uncertainty consists in the
aggregation of all types of errors, including some of which that cannot be completely
eliminated. Hence, in order to assess the measurement uncertainty associated with the
arsenic datasets we have access to, it is necessary to investigate the sources of errors involved
during the whole process leading to the creation of the dataset. These sources of errors
include errors in collecting, saving and transporting samples, analytical measurement errors
involved in the technique used to measure arsenic, and errors associated with the creation of
11
the database and retrieval of information from that database. Even though it is hard to
completely assess all the kinds of errors associated with our datasets, it might be plausible to
broadly assess uncertainty arising from the measurement techniques (i.e. analytical error) and
sampling procedures (i.e. sampling error) based on information about the datasets. We now
survey the different arsenic measurement techniques and their analytical errors.
2.2.2. Arsenic measurement techniques and their associated analytical errors
High arsenic concentration found in the groundwater may come from anthropogenic sources
or from naturally occurring material. Historically the main anthropogenic sources have
included agricultural products (e.g. arsenical pesticides), wood preservatives and industrial
waste. However previous studies seem to indicate that anthropogenic sources are not the
main contributor to the arsenic found in New England groundwater, and instead point to the
possibility of natural bedrock being the main arsenic source for this region of the US (EPA
report from USEPA region 1 office, 1981; Peters et al., 1999).
Arsenic can be found in many different forms in the groundwater depending on physico-
chemical conditions of the environment (electro-negativity, pH, etc.), and the processes
involved (oxidation-reduction, biological and bacterial processes, etc.). Arsenate (HAsO42-)
with a valence state As(V) is an anion prevalent in aerobic surface waters, while arsenite
(H3AsO3 or H2AsO3-) is a reduced form with valence III that is one of the primary species
found in the groundwater, and is considerably more mobile and toxic than arsenate (Schnoor,
1996). Additionally in New England, arsenic in its geological occurrence may also be found
as arsenopyrite (FeAsS), orpiment (As2S3) and realgar (AsS) (Peters et al., 1999).
Methylation by bacteria may also produce organic arsenicals such as methylarsenic acid,
12
dimethylarsenic acid and trimethylarsenic acid (Braman and Foreback, 1973; Schnoor, 1996),
however these have not been reported widely for New England groundwater.
The Safe Drinking Water Act requires EPA to revise the existing 50 µg/L Maximum
Contamination Level (MCL) for arsenic in drinking water. On January 22, 2001 EPA
adopted a new standard and public water systems must comply with the new 10 µg/L
standard beginning January 23, 2006. Because toxicity varies with the species of arsenic
present in the water, the standard and methods measuring arsenic should be able to
differentiate between arsenic species. However because the EPA standard regulates total
arsenic present in water, most methods available measure only total arsenic, and we will
therefore restrict our attention to these methods.
The analytical techniques used to measure total arsenic concentration have considerably
improved over the years. Earlier techniques such as the “Guzeit” method developed over
100 years ago are colorimetric methods that do not require sophisticated equipment and can
be implemented in the field; however they are not precise and have high detection limits. As
modern techniques have developed, from flame atomic absorption (FAA), to graphite furnace
atomic absorption (GFAA), to inductively coupled plasma-atomic emission spectrometry
(ICP-AES) and inductively coupled plasma-mass spectrometry (ICP-MS), the detection limit
for arsenic has continuously decreased from over 50 µg/L in the past century to about 1 µg/L
in the past decade, while new methods combining Hydride Generation and ICP-MS are now
providing detection limits of 0.01 µg/L and below.
The precision of an analytical technique may be defined as the ratio of the standard
deviation of error measurement σΖ over the arsenic concentration Z of the sample, e.g. a
precision of 10% would mean that σΖ is equal to 0.1 Z. In general the precision increases as
13
the arsenic decreases, and the detection limit is defined at the smallest arsenic level that can
be determined with acceptable precision. Hence each analytical technique may be
characterized by its detection limit, its precision for a concentration close to the detection
limit, and its precision for a concentration several times higher than the detection limit.
The colorimetric methods take advantage of the formation of volatile arsine (AsH3) gas to
separate the arsenic from other possible interferences in the sample matrix (Melamed, 2004).
This process is called hydride generation (HG). The arsine gas is then brought in contact
with a color-reacting reagent, and the operator reads the arsenic concentration by comparing
the color obtained with a color scale. Colorimetric methods include variants of the Gutzeit
method used extensively for the Taiwan studies in the 1960s (see review of these studies in
Guo et al., 1994), as well as more modern field test variants developed recently in response
to the Bangladesh studies (Kinniburgh and Kosmus, 2002). Because of the importance of the
Taiwan data collected in the 1960s, Greschonig and Irgolic (1997) re-investigated a method
they name the “mercuric-bromide-stain” method, which they believe was similar to methods
used in the 1960s. This method generates arsine gas by reduction using zinc and
hydrochloric acid, and then uses a solution of mercuric-bromide that reacts with the arsine
gas and turns yellow to brown with increasing arsenic concentration. Greschonig and Irgolic
(1997) determined that the mercuric-bromide-stain method had a detection limit exceeding
50 µg/L, with a precision as high as 64% near the detection limit, and about 21% from
arsenic concentrations of 200 µg/L. Due to these poor performances, several improvements
to the colorimetric methods were achieved during the Bangladesh crisis in the 1990s
(replacement of zinc metal with sodium borohydride, etc.), leading to the development of
colorimetric field kit technology with a much enhanced detection limit and precision. For
14
instance Kinniburgh and Kosmus (2002) reports that their PeCo75 method (a hand held
“Arsenator”) has a detection limit of 4.2 µg/L (3 times s0=1.4 µg/L) and a precision of 14%
at concentrations much greater than the detection limit (k=0.14).
While colorimetric methods are useful for field test applications, fixed analytical
techniques using atomic spectrometry provide better precision and accuracy. The basic
procedure for analytical atomic spectroscopy generally consists in the formation of arsenic
atoms from the sample matrix, followed by excitation of these arsenic atoms using some
energy source, and finally photon emissions from the exited atoms which are quantified to
yield the concentration of arsenic. One of these techniques, which is widely used due to its
relative affordability and good precision, is graphite furnace atomic absorption (GFAA). The
atomization is obtained by introducing the arsenic sample into a graphite tube at high
temperature (Beaty et al., 1993). The resulting cloud of arsenic atoms absorbs light at
wavelengths corresponding to the specific excitation energy states of the arsenic atom. The
quantity of interest is then the absorbance of light at these wavelengths, which provides a
quantitative measure of arsenic concentration in the sample analyzed. The expected
analytical error for GFAA measurements ranges from 3-5% for concentrations greater than
10 times detection limit to 20-40% near the detection limit (Keller et al., 1996).
Another category of techniques in atomic spectrometry are those using an inductively
coupled argon plasma at high temperature as the atomization and excitation source. These
techniques include both inductively coupled plasma-atomic emission spectrometry (ICP-AES)
and inductively coupled plasma-mass spectrometry (ICP-MS). In ICP-AES, the plasma is
used to produce thermally excited arsenic atoms that emit light at characteristic wavelengths.
The emitted light is diffracted by wavelengths and amplified to yield an intensity
15
measurement that can be converted to a quantitative estimate of arsenic concentration by
comparison with calibration standards. In ICP-MS, the inductively coupled argon plasma is
again used as the excitation source, however there is enough energy in the plasma to also
remove an electron from the arsenic atoms and create positively charged arsenic ions
(Thomas, 2003). These ions are transported to the mass spectrometer where they are
separated from other elements according to their mass to charge ratios, and analyzed at the
high sensitivity afforded by mass spectrometry. However in the case of arsenic, any chloride
present in the sample will form ArCl+ in the argon plasma, which will then interfere in the
mass spectrometry analysis with arsenic. Indeed ArCl+ has the same mass as 75
As+ (atomic
mass of 75) so that they may be counted with arsenic and cause arsenic readings to be bias
high. As a result ICP-MS without any pre-processing of the sample is limited to a detection
limit of about 1 to 5 µg/L and precision similar to that of ICP-AES. The typical analytical
error associated with the ICP-MS technique is distributed between ±4-6% at concentrations
greater than 10 times the detection limit and ±20-50% at concentrations near the detection
limit (Keller et al., 1996). The United States Geological Survey (USGS) central laboratory
also indicates that the measurement error associated with ICP-MS is about 15%, while a
study by Manninen (undated) suggests that ICP-MS leads to 20% measurement uncertainty
in water.
Coupling online hydride generation with inductively coupled plasma-mass spectrometry
(HG-ICPMS) eliminates the interference between ArCl+ and 75As+, which allows operation
of the mass spectrometer at low mass resolution (M/∆M=300), thus maximizing signal
intensities (Klaue and Blum, 1999; Peters et al., 1999). As a result while the detection limit
for ICPMS routinely exceeds 1 µg/L, that of online HG-ICPMS is as low as 0.01 µg/L
16
(Klaue and Blum, 1999; Peters et al., 1999). Similarly the HG-ICPMS method results in a
2000-fold increase in sensitivity (Klaue and Blum, 1999).
By way of summary, analytical techniques for total arsenic have improved over time.
Using information about the analytical technique used for available arsenic monitoring data,
it is plausible to infer some range for the detection limit and the analytical error of the data.
Combining the analytical error with sampling error will provide an assessment of the
uncertainty associated with the data.
2.3. Theory
2.3.1 The knowledge bases characterizing a contaminant spatial random field
The distribution across space of a contaminant is modeled in terms of the spatial random
field (SRF) X(s), where s is the spatial coordinate. The SRF models the distribution of the
contaminant across space in terms of a collection of plausible field realizations χ(s). The
uncertainty characterizing the SRF at points s and s’ is expressed in terms of the probability
density function (PDF) f(χ,χ’; s,s’) characterizing the different plausible realizations χ and
χ’at these points, i..e.
f (χ, χ’, s, s’) dχ dχ’ = Prob[χ<X(s)<χ+dχ and χ’<X(s’)<χ’+dχ’], (2.1)
where Prob[.] is the probability operator. The mean trend mx(s)=E[X(s)], where E[.] is the
expectation operator fully defined in terms of the PDF of X(s), characterizes systematic
17
trends in the distribution of the contaminant across space. The covariance cx(s,s’)=E[(X(s)-
mx(s))(X(s’)-mx(s’))] describes spatial correlation and contaminant dependencies between
pairs of points. The mean trend and covariance function provide the foundation of the
general knowledge base available for the contaminant of interest.
While the general knowledge base describes the general characteristics of the
contaminant field X(s), we also usually have measurements value at specific site locations.
When a sampled value χhard is an exact measurement of the contaminant process X(shard) at
point shard, we model that value as a hard datum, i.e.
Prob[ X(shard) =χhard]=1. (2.2)
However measurements are seldom exact and they therefore often need to be treated as soft
information. For example in the case of a measurement below detection limit at point ssoft, all
that is known is that the contaminant level is below the detection limit DL, i.e.
Prob[0< X(ssoft) < DL]=1. (2.3)
As can be seen from Eq. (2.3), this soft datum is of the interval type. More generally soft
data can be expressed in terms of a soft PDF fS describing the uncertainty associated with the
measurement, i.e.
u
Prob[X(ssoft) <u]= ∫ −∞ dχ soft f S (χ soft ) . (2.4)
18
For example in the case of normally distributed measurement errors, the soft PDF fS is
Gaussian.
The hard data (Eq. 2.2) and soft data (Eqs. 2.3 and 2.4) provide a site-specific knowledge
base, which, together with the general knowledge base available for the arsenic field, provide
the type of information that is efficiently processed with the BME mapping method.
However before proceeding the BME method, we need to introduce a model for the
measurement error of arsenic data.
2.3.2 Proposed model for arsenic measurement error
Let Z(s) be a SRF representing the distribution of groundwater arsenic across space. At a
given sampling point s we denote the measurement value of arsenic as Zm. The Z(s) and Zm
are related through a measurement error relationship. In the case of arsenic, an appropriate
model for the measurement error is provided by the following relationship
Z =ε Zm. (2.5)
As can be seen from Eq. (2.5), ε is a multiplicative error term. This multiplicative error is an
unknown random quantity. As a result, for any given measurement, the arsenic concentration
Z is a random variable that is function of the measured value Zm and the random
multiplicative error term ε.
The work of Kinniburgh and Kosmus (2002) shows that an appropriate model for the
standard deviation σZ of the random arsenic concentration Z given a measured value Zm is
given by the following relationship
19
σZ | Zm = σo + k Zm. (2.6)
Eq. (2.6) expresses that the measurement error standard deviation σZ increases linearly with
the measurement value Zm, with an intercept value of σo for Zm =0, as illustrated in Figure
2.1(a). Since the arsenic concentration can only take positive values, it is consistent to
assume that for a given Zm , the random variable Z is log-normally distributed with mean Zm
and variance σZ2=(σo+ k Zm)2, which is mathematically denoted as follow
Ζ | Zm ~ logN (Zm , (σo+ k Zm) 2 ) (2.7)
(a) (b)
Figure 2.1: Plot of (a) σZ and (b) σε as a function of Zm for σo =1µg/L and k=3/10.
From Eq. (2.5) we have ε = Z / Zm so that since Z is log normally distributed for a given Zm,
then ε is also log normally distributed given Zm, with expected value E[ε|Zm]=E[Z|Zm]/Zm=1
and variance σε2=(σo/Zm + k)2 given Zm, which is mathematically denoted as follow
20
ε | Zm ~ logN (1 , (σo/Zm + k)2 ). (2.8)
The multiplicative error has an expected value of one, which means that on the average that
multiplicative error is unbiased. In Figure 2.1(b) we show an illustrative plot of the error
variance σε as a function of Zm. As can be seen on that plot, the multiplicative error is
approximately equal to k for large measurement values Zm; however this variance increases
rapidly for small measurement values Zm. This behavior appropriately captures the fact
mentioned earlier that arsenic analytical measurement techniques (for e.g. ICP-MS) have a
small relative error for large arsenic concentration, but this relative error increases with
decreasing concentration. This means that there is a detection limit DL below which the
measurement error is too large to be acceptable. A typical threshold for the detection limit is
3 times σo, i.e.
DL = 3 σo. (2.9)
The detection limit is shown with a vertical dashed line in Figure 2.1. As can be seen from
Figure 2.1(b), this detection limit provides an adequate cutoff to differentiate measurement
above detection limit with a σε approximately equal to k, and the below detects with a much
larger σε..
This proposed model for the measurement error of arsenic (Eqs. 2.5-2.9) provides the
framework necessary to generate the soft data needed for a BME mapping analysis. In order
21
to use this framework, one has to obtain arsenic measurement data and assess for each datum
Zm the parameters σo and k characterizing its measurement error. Then, if the value is below
the detection limit, we use a soft data of interval type (Eq. 2.3). On the other hand, if the
measured value is above the detection limit, we can construct a soft PDF using the log
normal distribution of Eq. (2.7). For illustration purposes we show in Figure 2.2 the soft
PDFs obtained for Zm =4µg/L, 6µg/L and 8µg/L with parameters σo =1µg/L and k=3/10.
Figure: 2.2: The plain line depicts the expected value E[Z] as a function of Zm for σo=1µg/L
and k=3/10. The detection limit DL=3σo =3µg/L is shown with the vertical dashed line. The
soft PDFs describing Z when Zm=4µg/L, 6µg/L and 8µg/L are shown in dotted lines.
As described earlier in details, the measurement error is the combination of several kinds
of errors, including errors arising from the analytical measurement technique and the
sampling procedure used, as well as errors associated with data entry and retrieval. In some
cases the data are available in a set of different databases that are each fairly homogeneous in
terms of the analytical technique, sampling procedure, and data management used. From the
information available for the database, it may often be possible to derive the detection limit
22
DL for the dataset, as well as a typical variance σZO value corresponding to a measured value
ZmO several times the detection limit (e.g. for ZmO=20µg/L we may have σZO=7µg/L). Then
from Eq. (2.9) and (2.6) we obtain the parameter values σo = DL/3 and k= (σZO - σo)/ ZmO.
2.3.3 Modeling the covariance function
Let Y(s) be the log-transformed arsenic field, Y(s)=log Z(s). This field is modeled as the sum
of a mean trend mY(s) obtained from general information about the log arsenic field, and a
residual field X(s), as follow
Y(s) = mY(s) + X(s) (2.10)
The deterministic function mY(s) is selected such that the residual field X(s) is homogenous
over space, so that it’s covariance is only a function of the spatial distance r=||s-s’|| between
points s and s’, i.e. cX(s,s’)=E[(X(s)-mX(s))( X(s’)-mX(s’))]= cX(r=||s-s’||).
By taking the log-transform of Eq. (2.5) at location s and rearranging we have logZm(s)=
logZ(s) - logε(s), which after substituting for X(s) leads to
Xm(s) = X(s) – log ε(s), (2.11)
where Xm(s)= logZm(s) - mY(s) is the SRF representing the distribution across space of
measured log-transformed mean trend removed arsenic concentrations. Eq. (2.11) provides
important insights as it shows that the measured Xm(s) field results from the linear
combination of the SRFs X(s) and logε(s). As a result Xm(s) is also a SRF, and we expect
23
that its spatial variability is the aggregate of the spatial variability of the SRFs X(s) and
logε(s).
It is appropriate to assume that the multiplicative measurement error is homogenous and
not auto correlated over space, so that it a has pure nugget covariance function, i.e.
clogε(r)= σlogε2 δ(r), where δ( r) is the Dirac delta function. Assuming that the SRFs are
independent, we obtain the following equation for the covariance cXm(r) of the SRF Xm(s)
cXm(r) = cX(r)+ σlogε2 δ(r) (2.12)
Furthermore, by calculating Eq. (2.12) for r =0 at we obtain
σXm2 = σX2+ σlogε2 (2.13)
Eq. (2.13) is simply the mathematical expression of the fact that the variability observed
in Xm(s) is the sum of the variability in X(s) and logε(s). Furthermore Eq. (2.12) indicates
that the covariance cXm(r) obtained in practice using Xm values will have a nugget component
equal to σlogε2 that is due to measurement error, and a component cX(r) that is the covariance
associated with the true arsenic field. The cX(r) does not itself have any nugget effect
because the true arsenic concentration in a groundwater is believed to be a continuous
process at very short scale due to the diffusivity of arsenic in the aqueous phase.
24
As a result, when modeling the experimental cXm(r) obtained from arsenic measurements,
one simply measures its nugget component and use that value as an assessment of σlogε2
characterizing the measurement error, while the remaining component free of nugget effect
provides the assessment of the covariance cX(r) charactering the arsenic field.
This has important implications in the context of arsenic mapping when one uses the
measurement model proposed in Eqs. (2.5)-(2.9). In that case, using Eq. (2.8) and from the
property of log normal distribution, we obtain that for a given Zm
σ logε2 = log(1 + (σo/Zm + k)2) (2.14)
Hence for a given dataset of arsenic measurements, we can calculate the average value of
log(1 + (σo/Zm + k)2) across the dataset, and obtain a second assessment of σlogε2
characterizing the measurement error for that dataset.
An interesting implication in practice is that we can test the measurement error
2
parameters σo and k for a given dataset by comparing the σ logε obtained from the
measurement error model (i.e. Eq. 2.14) with the value obtained from the covariance analysis
(i.e. the nugget component of cXm(r) as expressed in Eq. 2.12). This is especially useful when
dealing with datasets that have different measurement errors. In that case each dataset can be
2
analyzed separately to verify that the σ logε value estimated with the measurement error
25
model matches that obtained from covariance analysis so as to validate its measurement
parameters σo and k.
Once the measurement parameters σo and k were validated for each dataset, then all the
datasets available may be combined into a single master dataset, which is used to derive the
covariance model cX(r) that is part of the general knowledge base processed using the BME
method.
2.3.4 The BME method for spatial estimation
The spatial BME mapping approach provides a powerful conceptual framework to rigorously
process the general knowledge bases consisting of the mean trend and covariance function of
the SRF X(s), and the site specific knowledge base comprising the hard and soft data. The
BME conceptual framework (Christakos 1990; 2000b; Serre and Christakos, 1999a;
Christakos et al., 2002) distinguishes between three main stages of knowledge processing
that lead to the calculation of a posterior PDF providing a full stochastic assessment of the
contaminant level at any estimation point of interest. These three main stages of the BME
framework are as follows
(i) At the structural stage, BME generates the prior PDF fG providing an initial
probability distribution across space and time based on the general knowledge base (mean
trend and covariance of X(s)).
(ii) At the specificatory stage, the site-specific knowledge available is organized into hard
and soft data and expressed in terms of suitable operators.
26
(iii) At the integration stage, the initial solution fG of stage (i) is enriched by assimilating
the site-specific knowledge of stage (ii). This final solution provides the posterior PDF fK (χk,
sk) for the contaminant level at each estimation point sk of interest.
In this work we use the BMElib numerical implementation of the BME method. While a
detailed treatment of the BMElib numerical implementation is available elsewhere (Serre,
1999b; Serre and Christakos, 1999a; Christakos 2000b; Christakos et al. 2002), we
summarize here the main numerical steps of the analysis. Since the general knowledge base
considered at the structural stage of the analysis consists only in the mean trend and
covariance (statistical moments up to order 2 only), then the prior PDF fG obtained at the
structural stage is multivariate Gaussian, i.e.
fG (χmap, smap) = φ (χmap ; mmap, cmap ) (2.15)
where χmap is the a vector of values taken by the SRF of interest at the mapping points, smap
are the spatial coordinates of these mapping points, mmap is the vector of mean trend values
provided by the mean trend model at the mapping points, cmap is a matrix of covariance
provided by the covariance model for all pairs of mapping points, and φ (.) is the multivariate
Gaussian PDF (see Serre, 1999b, for the detailed mathematical equations).
The mapping points smap include both the data points sdata and the estimation point sk, i.e.
smap=(sdata, sk). At the specificatory stage the data points are organized into hard and soft data
points, i.e. sdata=(shard, ssoft), and the corresponding site specific knowledge is defined using
Eq. 2.2 for the hard data χhard at shard, and Eqs. 2.3 and 2.4 and for the soft data fS(χsoft) at ssoft,
so that we have smap=(shard, ssoft, sk) and χmap=(χhard, χsoft, χk). Then at the integration stage
27
BMElib calculates the posterior pdf at any estimation point sk using the following Bayesian
conditionalization rule
fK (χk, sk) = A −1 ∫ d χ soft f S ( χ soft ) f G ( χ map ) (2.16)
where A is a normalization parameter.
The posterior PDF provides a complete stochastic characterization of X(s), from which
we obtain any estimate of interest (e.g., the posterior PDF mode, BMEmode, which provides
the most likely value at the estimation point; or posterior PDF mean, BMEmean which
minimizes the mean square estimation error, or the posterior PDF median, BMEmedian), as
well as an assessment of the uncertainty associated with that estimate (e.g., the variance of
the BME posterior PDF, or the BME confidence interval as defined in Serre and Christakos,
1999a). By obtaining the BME posterior PDF at the nodes of an estimation grid covering the
mapping region of interest, we are able to construct a map representing the distribution of
contaminant at unmonitored points across space. This map integrates soft data points with
varying level of measurement uncertainty, as is the case for arsenic monitoring data with
varying level of measurement errors.
2.3.5 Step by step summary of the approach
By way of summary the steps of the analysis are as follow
1) Obtain different datasets of measurements of the arsenic field Z(s) and determine for
each dataset the σo and k values (Eq. 2.6) characterizing its measurement uncertainty.
2) Log-transform to obtain the data for the field Y(s)=log(Z(s)), and model its mean
trend mY(s).
28
3) Model the covariance of the residual field X(s)=Y(s)-mY(s) for each dataset separately,
and compare the σlogε2 obtained from the covariance analysis (i.e. the nugget
component of cXm(r) as expressed in Eq. 2.12) with the σlogε2 corresponding to the
σo and k measurement error model (Eq. 2.14). In case of disagreement revise the
values σo and k and go back to step 1, otherwise accept these values and obtain the
covariance model cX(r) of the combined datasets.
4) Construct the soft data for the log-transformed mean trend removed residual field X(s).
The measurements below detection limit (Eq. 2.3) are used to generate interval soft
data for X(s). The measurements above detection limit are treated as probabilistic
data with Gaussian PDF. The variance of the Gaussian soft PDF is σlogε2 calculated
from Eq. (2.14). The mean of the Gaussian soft PDF is log(Zm)-mY-σlogε2/2, where Zm
and mY are the measured total arsenic concentration and the log-mean trend at the data
point, respectively.
5) Process the covariance cX (r) and soft data for X(s) to calculate the BME posterior
PDF fK (χk, sk) (Eq. 2.16) for X(s) at the nodes sk of an estimation grid covering the
mapping area of interest. Obtain from the BME posterior PDF the median estimate of
X(s), XBMEmedian, and back-transform it to estimate the median estimate of Z(s),
ZBMEmedian=exp(XBMEmedian+mY), where mY is the mean trend value at the estimation
point.
2.3.6 Cross validation procedure
29
In order to assess the improvement provided by the proposed framework, we compare the
performance of the BME method accounting for data uncertainty, with that of various
alternate approaches not accounting for data uncertainty. We therefore need a procedure to
calculate the performance of an estimation method. In general the performance is calculated
as the mean square error (MSE) between a set of n predicted value Xi* at points pi, and the set
of true values Xi at these points, as follow
MSE =
1 n
(
∑ X i* − X i
n i =1
)
2
. (2.17)
The smaller the MSE, the more accurate is a method. In the case of a cross validation, we
remove one measured value Xm,i at a time, re-estimate it using neighboring non-collocated
well data to obtain Xi*, and start the process over again for each of n points. This procedure
leads to the generation of n predicted Xi* and measured Xm,i values, from which the MSE
performance can be calculated. However an obvious flaw with this usual approach for our
problem is that the measured value Xm,i is not equal to the true arsenic value Xi, and therefore
should not be used as reference for comparison.
We address this issue using two alternative cases. In the so called real case, we take
advantage of the fact that some of the datasets we work with have a measurement error that is
smaller than the remaining data included in the analysis. Therefore we select only these
datasets for the basis of the cross validation analysis and calculation of the MSE. This
provides an approximate measure of performance that will tend to the null hypothesis
(because there is still some random error in the dataset used for validation), however it is as
representative as possible of the real world (hence it’s name of the “real” case).
30
The other case involves simulating a field of values for the Xi and Xm,i that reproduces the
statistical properties of the arsenic field and it’s measured values. Then the same procedure
as described above is conducted to obtain the Xi*, but when it comes time to calculate the
MSE, we are in a position to use the simulated truth Xi instead of the Xm,i. This result in a
better assessment of the real performance improvement of the proposed approach and allows
for some correction away from the null hypothesis, however it corresponds to a simulated
world (hence it’s name of the “simulated” case).
2.4. Application of the model
2.4.1 The arsenic datasets
In order to illustrate the measurement error framework presented in this work, we purposely
choose three datasets of groundwater total arsenic measurements collected in New England
such that each dataset has a distinct measurement error from the other two datasets. Our goal
is to show that the measurement error for each dataset can be characterized separately from
the others, and rigorously integrated in the BME mapping analysis. Each dataset includes
only the most recent analysis of total arsenic available for each well. Total arsenic in New
England groundwater is believed to be primarily from natural sources, and therefore does not
change drastically over time. Hence each total arsenic analysis provides a value of total
arsenic today with a data uncertainty that increases for older measurements, as arsenic levels
may have changed slightly over time. In addition the data uncertainty includes analytical
error and sampling error depending on the analytical method and sampling procedure used to
31
measure total arsenic in each dataset. We describe here briefly each of the three dataset, and
we characterize for each dataset the corresponding measurement error in terms of σo and k
defined in Eq. (2.6). The number of data above and below detection limit is listed in Table
2.1 for each of the dataset, as well as the mean value of the dataset, the analytical detection
limit, and σo and k.
Table 2.1: The number of above and below detects, the mean value and detection limit, and
σo and k (Eq. 2.6) for each dataset.
Number Number
of data of data Detection
Mean
above below Limit σo (µg/L) k
(µg/L)
detection detection (µg/L)
limit limit
Dataset 1 219 389 12.3 1 0.333 0.233
Dataset 2 155 623 15.2 3 1.000 0.300
Dataset 3 121 144 72.1 5 1.667 0.616
The first dataset had the smallest measurement relative error, with σo=0.333 µg/L and
k=0.233. The detection limit is DL=3σo=1 µg/L, and the precision σZ/Zm for a sample with a
typical measured concentration of Zm=20 µg/L is calculated using Eq. (2.6) as follow
σZ/Zm=σo/Zm+k=0.333/20+0.233= 25%. As explained earlier, the precision encapsulates all
sources of errors (analytical, sampling and database management errors) contributing to the
data uncertainty of that dataset. This dataset consist of 219 measurements above detection
limit and 389 measurements below detection limit. These measurements were collected
throughout New England, as shown in Figure 2.3(a). The dataset was retrieved from the
USGS National Water Information System (NWIS) in 2001, and is a subset of 20,043 arsenic
samples collected from portable water over the entire U.S from 1973 to 2001 (Focazio et al.,
32
2000; USGS 2001). The arsenic analyses were performed using USGS approved analytical
methods including ICP-MS. The detection limit of 1 µg/L was reported for samples
measured below detection limit, and good USGS sampling procedure and database
management practice were followed uniformly for this dataset, which contributed for the
lower data uncertainty of this dataset compared to the other two datasets (see Appendix E).
The second dataset had a slightly higher measurement relative error, with σo=1 µg/L and
k=0.300, corresponding to a detection limit of DL=3 µg/L, and a precision σZ/Zm = 35% for a
typical measured concentration of Zm=20 µg/L. This dataset was obtained from USGS
Water-Resources Investigations Report 99-4162 published in 1999 (Ayotte et al., 1999). The
USGS compiled this dataset by collecting the existing arsenic data from states in New
England (i.e. Maine, New Hampshire, Massachusetts, and Rhode Island, see Figure 2.3b) that
were using laboratory-analysis methods and sample collection procedures in accordance with
Federal standards. The detection limits reported for measurements below detection limit in
this data base ranged from to 1 µg/L to 5 µg/L, so that 3 µg/L was used as the representative
detection limit across that dataset for measurements not reporting the detection limit. This
leads to the choice of σo=1 µg/L for this dataset which is 3 times larger than the σo used in
dataset 1. Though federally approved, the analytical measurement techniques used in this
dataset may have varied from state to state, and furthermore the dataset was published in
1999, two years earlier than dataset 1. This leads to the selection of k=0.300, which is
slightly larger than for dataset 1 (see Appendix E).
The third dataset is probably the most interesting, as it combines data collected by
homeowners in New Hampshire (90% of the data in dataset 3), and data collected at certain
wells by the New Hampshire district office of the USGS (10% of the data in dataset 3). The
33
location of these data points are shown in Figure 2.3(c). This is a typical situation where a
dataset provides very valuable information, but with a high associated data uncertainty. As
shown in Table 2.1, the measurement error parameters for dataset 3 were σo=1.666 µg/L and
k=0.616, which is much higher than for datasets 1 and 2. This is due to the fact that the
homeowners sampled their own well at the tap following instructions sent by mail, resulting
in a higher sampling error than for the other two datasets collected by trained technicians.
Furthermore the analytical method used to analyze the water samples mailed to the New
Hampshire Department of Environmental Services (NHDES) laboratory in Concord was
furnace atomic absorption spectrometry (GFAA) with a detection limit of 5 µg/L. Reports of
elevated arsenic concentrations were investigated by the New Hampshire district office of the
USGS and analyzed using ICP-OES, resulting in the remaining 10% of the data in dataset 3.
Overall the data uncertainty of dataset 3 is characterized by the high values for σo and k, so
that it can be rigorously integrated with the other two datasets in the BME analysis.
(a) (b)
34
(c) (d)
Figure 2.3: Measured arsenic concentrations above detection limit shown with marker size
proportional to observed values for (a) dataset 1, (b) dataset 2, (c) dataset 3. The locations of
all measurements below and above detection limit are shown in (d).
2.4.2 Mean trend
The arsenic data Zdata from the three datasets combined were obtained by using the measured
values above detection limit, and half the detection limit for values recorded as below detect.
The log-transform data was given by Ydata = log(Zdata). We then obtained the mean trend
function mY(s) defined in Eq. (2.10) by smoothing the log-arsenic data Ydata using a Gaussian
kernel smoothing function of BMElib (Christakos et al., 2002). The mean trend model mY(s)
that we obtained is shown in Figure 2.4. This mean trend model represents the systematic
trend in the spatial distribution of arsenic across New England.
35
Figure 2.4: Distribution of the mean trend of total arsenic concentration mY(s) across New
England groundwater.
2.4.3 Covariance analysis and verification of the measurement error parameters
The key step of the work presented here is the covariance analysis allowing verification of
the measurement error parameters σo and k for each dataset. Using the log-transformed data
Ydata and the mean trend model mY(s) described in the preceding section, we obtain the data
for the residual field Xdata=Ydata- mYdata. This measured data is used to calculate the covariance
cXm(r) using all the data available or any subset of it. For illustration purpose we show in
Figure 2.5(a) the covariance obtained using all the data Xdata available for the three combined
datasets and in Figure 2.5(b) the covariance obtained using the Xdata corresponding only to
the measurements above detection limit for the three combined datasets. We then report in
Table 2.2 the nugget component σlogε2 obtained from this covariance analysis (i.e. σlogε2
=0.389 is the nugget component of the covariance of Figure 2.5(a) obtained for the combined
datasets, and σlogε2 =0.221 is obtained from Figure 2.5(b) using only the above detects of the
36
combined datasets). We also report in Table 2.2 the σlogε2 predicted with the measurement
error model by averaging Eq. (2.14) over the data used in the analysis. As can be seen in
Table 2.2, the measurement error model predicts correctly that the covariance nugget of
Figure 2.5(b) should be smaller than that of Figure 2.5(a) because the exclusion of the below
detects removed data with high measurement errors, as explained by Eq. (2.14).
(a) (b)
Figure 2.5: Covariance model obtained using (a) all the data Xdata corresponding to the three
combined datasets, and (b) only above detects for the three combined datasets.
37
Table 2.2: Comparison of the values of σlogε2 estimated using (a) the covariance analysis and
(b) the measurement error model.
Estimated σ logε2 using :
Dataset used (a) Covariance (b) Measurement
analysis error model
Datasets combined 0.389 0.443
Datasets combined without below detects 0.221 0.240
Dataset 1 0.478 0.443
Dataset 2 0.344 0.349
Dataset 3 0.730 0.722
As described above, we then proceed by analyzing each dataset separately in order to
validate its measurement error parameters σo and k. The results obtained are reported in
Table 2.2, and as can be seen from that table there is an excellent fit between the σlogε2
obtained from the covariance analysis using each dataset separately, and the σlogε2 predicted
by the measurement error model for each of these datasets. This excellent fit is further
illustrated in Figure 2.6 plotting the Table 2.2 results, i.e. showing the plot of the σlogε2
values predicted by the measurement error model versus the σlogε2 values obtained from the
covariance analysis. The corresponding regression statistics is R2=0.972. This excellent fit is
the first of its kind for the Geostatistical analysis of total arsenic, and has important
implications for the assessment of arsenic in the groundwater of New England and the United
States. It demonstrates that using the measurement error model proposed in this work, the
uncertainty in total arsenic monitoring data can be rigorously assessed and validated at the
covariance analysis stage, thereby providing the foundation for an accurate estimation of the
38
distribution of groundwater arsenic across space. The remaining of this work builds on this
foundation to map arsenic across New England using our three datasets with very different
levels of measurement errors.
Figure 2.6: Plot of the σlogε2 values predicted by the measurement error model versus the
σlogε2 values obtained from the covariance analysis.
Now that we have validated that the nugget component of the covariance cXm(r) is due to
the variance σ logε2 associated with measurement error, we can remove this component from
cXm(r) using Eq. (2.12). We obtain the experimental covariance cX (r) = cXm(r)-σlogε2δ(r) for
the SRF X(s) associated with the true total arsenic concentration in the groundwater of New
England. We fit to this experimental covariance the following covariance model
cX(r) = c1 exp( -3r / ar1 ) + c2 exp( -3r / ar2 ), (2.18)
39
where c1=0.7 σX2, ar1=7 km, c2=0.3 σX2, ar1=40 km and σX2 is obtained from Eq. (2.13). We
show this covariance model using a plain line in Figure 2.5(a) and 2.5(b), and as can be seen
in these figures, our model fits the experimental covariance estimates very well. This model
indicates that about 70% of the spatial variability of total arsenic in New England
groundwater has a short spatial range of about 7 km, while the remaining 30% of variability
has a longer range of about 40 km. These findings are in agreement with findings of Serre et
al. (2003) for arsenic in Bangladesh groundwater, where the range was found to vary
between 2 to 57 km.
2.4.4 The BME mapping results
The general knowledge base considered is the covariance model of Eq. (2.18), while the site-
specific knowledge base consists of the soft data obtained using the measurement error
model according to the methodology presented in the theory section of this paper. As
explained above, the soft data rigorously represents the measurements below and above
detection limit by accounting for the detection limit and precision of each of the dataset.
Using the BME method we process this knowledge base and calculate the BME posterior
PDF at the nodes of an estimation grid covering New England. From the BME posterior
PDF we obtain the BME median estimate ZBMEmedian of total arsenic, which we map in Figure
2.7. An assessment of the mapping uncertainty associated with the BME estimate of Figure
2.7 is provided by the variance of the BME posterior PDF for X(s) normalized by the
variance σX2, which we show on the map of Figure 2.8.
40
Figure 2.7: Map of the BME median estimate of total arsenic in the groundwater of New
England.
Figure 2.8: Map of the variance of the BME posterior PDF for X(s) normalized by the
variance σX2. This map provides an assessment of the mapping uncertainty associated with
Figure 2.7.
41
The maps of Figures 2.7 and 2.8 provide very valuable information for public health
officials dealing with the problem of groundwater arsenic in New England. Figure 2.8 shows
that the mapping uncertainty increases as we move away from the locations where samples
were collected. This map is useful to allocate monitoring resources in areas of high mapping
uncertainty. Furthermore we note that the mapping uncertainty does not drop to zero at the
sampling locations. This illustrates the fact that the maps presented here not only account for
the high natural spatial variability of arsenic geology, but also for the measurement errors
associated with the arsenic samples. Indeed the BME method takes in account the
uncertainty associated with the soft data by rigorously processing the measurement error
model presented in this work. Hence the BME estimate of arsenic presented in Figure 2.7 is
the best map of groundwater arsenic produced to date on the basis of the datasets used in this
work. This map provides the tools for public health officials to identify areas where the
arsenic concentration in the groundwater may exceed the new 10 µg/L standard beginning
January 23, 2006, which will help determine where additional treatment will be warranted to
remove arsenic from drinking water. Furthermore the work presented here provides an ideal
framework to add new monitoring data with presumably lower detection limit and better
precision as the analytical measurement techniques for arsenic and its speciation keep
improving in the future.
2.4.5 Cross validation results
As described in the theory section, in the “real” case we perform a cross validation using a
selected dataset with low measurement error as the reference dataset. This cross validation
allows us to compare the estimation method presented in this work against other estimation
42
methods by comparing their MSE (Eq. 2.17). The four methods that we will compare are
summarized in Table 2.3. Method 1 uses “hardened” data for the measurements above and
below detection limit, i.e. it treats the measured values above detection limit and the mid
point of the interval below detection limit as if they were exact measurements of arsenic.
Hence method 1 represents a classical approach not accounting for data uncertainty. On the
other hand method 2 uses the approach presented in this work, i.e. it uses the measurement
error model to rigorously account for the data uncertainty in the measurements above
detection limit, and it uses an interval soft data for the measurements below detection limit.
Note that the reference dataset is always treated as hard, since that is the dataset used to
calculate the MSE. Hence the reduction of MSE between method 1 and 2 will provide a
measure of the improvement in estimation accuracy attributed to rigorously accounting for
data uncertainty in the above and below detects. Additionally methods 3 and 4 are similar to
methods 1 and 2, respectively, except that measured values below detection limit are ignored.
This will allow us to assess the effect of the measurement error model alone.
Table 2.3: Specifications of each of the four methods compared in the cross validation
analysis.
Data > Detection Limit Data < Detection Limit
Method 1 Measured value as hard (upper bound + lower bound)/2 as hard
probabilistic soft data using measurement

[lower bound, upper bound]
Method 2 error model, except reference dataset
as interval soft data
treated as hard data
Method 3 Measured value as hard ignored
probabilistic soft data using measurement

Method 4 error model, except reference dataset ignored
treated as hard data
43
The dataset selected to be used as reference for the cross validation method where chosen
to be either dataset 1 or dataset 2. We first selected dataset 1 because it has smaller σo and k
(see Table 2.1). We then selected dataset 2 as the reference because it has the smallest
average σ logε2 (see Table 2.2). Hence each of these dataset has relatively small measurement
error and can be used to represent the unknown true arsenic concentration. However dataset
3 has consistently higher measurement errors, so it cannot be selected as the reference dataset.
Using a selected dataset as reference, we calculated the cross validation MSE (Eq. 2.17)
for each of the methods described in Table 2.3, and we then calculated the percent change in
MSE from Method 1 to Method 2 as 100%*(MSE2 – MSE1)/MSE1, and from Method 3 to
Method 4 as 100%*(MSE4 – MSE3)/MSE3. The results obtained are shown in Table 2.4. As
can be seen from this table, there is a consistent decrease in MSE from method 1 to method 2
using either dataset 1 or 2 as the reference (validation) dataset. This indicates that the
proposed approach presented in this work leads to a consistent improvement in mapping
accuracy over a method not accounting for data uncertainty. The reduction in MSE is due to
the fact that our approach rigorously accounts for the data uncertainty associated both with
the above detects as well as the below detects. In order to consider only the effect of above
detects (i.e. ignoring the effect of below detects), we turn to methods 3 and 4. We see that
there is still a consistent decrease in MSE, indicating that rigorously accounting for the data
uncertainty of measurements above detection limit using the measurement error model
presented in this work leads to a consistent improvement of mapping accuracy in the real
case.
44
Table 2.4: Change in MSE from classical methods (i.e. methods 1 and 3) to the proposed
methods (i.e. methods 2 and 4). A negative change means reduction in MSE, indicating an
improvement in mapping accuracy.
Change in MSE from Change in MSE from
Method 1 to Method 2 Method 3 to Method 4
“Real” case using
dataset 1 as validation -8.21% -4.39%
dataset
“Real” case using
dataset
“Simulated” case using
dataset
While the cross-validation in the “real” case uses dataset 1 or 2 as the reference, these
datasets do not represent the true arsenic concentration. The true arsenic concentration is
actually unknown, and using “real” datasets introduces a random error in the cross validation
procedure. As discussed earlier, this leads to a tendency toward the null hypothesis, i.e. it
dampens the ability to measure the change in MSE between methods. We believe that this
means that the reduction in MSE reported for the “real” cases in Table 2.6 are lower bounds
of the true reduction in MSE, and we address this issue using the “simulated” cross validation
described in the theory section. The simulated dataset reproduces the statistical properties of
the log-arsenic data from the three combined datasets. For example, the variance (i.e. 0.7071
[log-µg/L]2) of the simulated log-arsenic data matches well with the variance (i.e. 0.7875
[log-µg/L]2) of the true log-arsenic. As previously represented in Figure 2.5(a) the true
variance is calculated after removing the nugget effect from the experimental variance using
the real data. As can be seen in table 2.4, the cross validation result obtained confirms our
belief, as it shows that when we use a simulated truth as the basis for cross-validation, the
45
reduction in MSE from method 1 to method 2 is 67.4%, and the reduction from method 3 to
method 4 is 38.8%.
The decrease in MSE reported in table 2.4 demonstrates that accounting for the data
uncertainty of both above and below detects leads to a very significant improvement in
mapping accuracy for arsenic in New England. A substantial part of the improvement in
mapping accuracy reported in our results comes from the mathematically rigorous analysis of
measurement errors using the measurement error model presented in this work.
2.5. Conclusions
Due to the increased recognition of the public health concern associated with arsenic and the
impeding change in the federal standard limiting arsenic concentration in the drinking water
at 10 µg/L, it has become important to accurately map arsenic concentration across our
ground waters. A survey of analytical techniques and sampling procedures used to measure
arsenic shows that the detection limit and precision have improved drastically over time, and
will continue to do so in the near future. As a result we are faced with a situation where
historical groundwater arsenic datasets may provide very valuable information but have very
different levels of measurement errors. Furthermore new datasets may be collected in the
future with a much better precision than is now routinely achieved. The question raised is
then: how can we effectively and rigorously process datasets with varying levels of
uncertainty so as to accurately map arsenic across space?
We present in this work a model for the measurement error of total arsenic. We define
the measurement error as including all random and systematic errors, including analytical
46
errors, sampling errors, and data management errors. Our measurement error model is an
extension of the model by Kinniburgh and Kosmus (2002) that (a) provides a way to validate
the measurement error parameters at the covariance analysis stage, and (b) generates the
probabilistic distribution of errors in form of a so-called soft PDF, which can then be
rigorously analyzed using the BME method of modern Geostatistics.
We applied the proposed framework using three historical datasets of total arsenic
measurements from samples collected in New England. Two of the datasets were obtained
from the USGS and have a relatively low measurement error. The third dataset included
samples collected and mailed by homeowners, leading to higher sampling error. We were
able to assess the measurement error parameters for our model for each dataset based on
information available about the analytical techniques and sampling procedures used. We
then validated the resulting measurement error model at the covariance analysis stage, and
this study is the first of its kind demonstrating that this is feasible in the geostatistical
mapping analysis of groundwater arsenic. Hence the measurement error model provides the
foundation to generate soft data rigorously accounting for the data uncertainty associated
with each dataset. Using the soft data generated we obtained a map showing the distribution
of arsenic across the groundwater of New England. Finally our cross validation analysis
demonstrated that the rigorous processing of measurement errors using the approach
presented in this paper leads to a substantial improvement of mapping accuracy over methods
not accounting for the difference in measurement error between the datasets available.
This work has important implications for public health officials needing to identify areas
where the arsenic concentration in the groundwater may exceed the new 10 µg/L standard for
drinking water beginning January 23, 2006. Using the approach presented here they will be
47
able to assess the measurement error of historical datasets and rigorously process them to
accurately map the distribution of arsenic in their ground water. Furthermore the work
presented here provides an ideal framework to add new monitoring data with presumably
lower detection limit and better precision as the analytical measurement techniques for
arsenic and its speciation keep improving in the future. While the maps presented here
provide the best assessment to date of total arsenic in New England ground waters on the
basis of the three historical datasets obtained for this work, future work will look into
improving further this map by incorporating new data for New England as these become
available.
48
III. BME mapping using empirical laws with secondary spatial data: A
farewell to co-kriging?
3.1. Background
Environmental mapping studies are concerned with the spatial estimation of an
environmental contaminant at some unsampled locations. By mapping the estimated values
obtained at the nodes of a regular grid, we obtain a realistic representation of the spatial
distribution of the environmental contaminant of interest, which we refer to as the primary
variable. When dealing with error-free measurements (hard data) of the primary variable
available at some set of sampling locations, a traditional approach is to use the simple kriging
(SK) method of classical Geostatistics (Journel and Huijbregts, 1978; Olea 1999; Armstrong
1998; Isaaks and Srivastava, 1989). Assuming without loss of generality that the mean trend
of the spatial field representing the primary variable is known (or can be estimated using a
parameterized spatial regression model or some spatial smoothing operator), SK is simply the
best linear unbiased estimator (BLUE) using the hard data available for the primary variable,
i.e. it is the linear combination of the error-free measurements of the primary variable that is
unbiased and minimizes the estimation error variance at the estimation point (Stein, 1999;
Christakos, 2000b).
In many environmental applications, secondary spatial fields related to the primary field
provide additional information that is useful to map the primary variable. For example, the
soil pH is a secondary spatial field providing useful information to map the concentration of
groundwater arsenic over space. The traditional extension of SK to account for secondary
spatial field data has been the simple co-kriging method (Goovaerts, 1997; Wackernagel,
1995). Simple co-kriging is a BLUE approach that extends SK by integrating secondary hard
data using the statistical cross correlation between the primary and secondary variables.
However, the stochastic empirical law describing the relationship between the primary
and secondary variables is often complex and information rich. This stochastic empirical law
may be denoted as the conditional Probability Density Function (PDF) fS(χ|ψ) of the primary
variable x given an error free measured value ψ for the collocated secondary variable y. This
conditional PDF has multiple statistical moments, each of which may vary non-linearly with
the measured secondary variable. Hence, a stochastic empirical law may in general be
described as a vector of non-linear relationships. While co-kriging accounts for the cross
correlation coefficient summarizing the relationship between the primary and secondary
variables, it does not have any formal mechanism to process the multiple nonlinear aspects of
a realistic stochastic empirical law. As a result, unless a remarkable cross correlation is
detected, co-kriging does not guarantee a substantial increase in mapping accuracy, as has
been found in previous works, such as that of Welhan and Merrick (2003) investigating the
estimation of groundwater arsenic using specific conductance as the secondary variable.
In this work, we investigate the mapping accuracy of a Bayesian Maximum Entropy
(BME) approach that formally accounts for the stochastic empirical law between the primary
and secondary variables. We describe some straightforward non-parametric and parametric
approaches to model the multiple non-linear aspects the stochastic empirical law. Then, in
order to compare the kriging, co-kriging, and BME methods, we develop a method to
simulate a useful class of spatially related synthetic random fields. Using these related
50
synthetic fields, we demonstrate that because the BME approach formally accounts for the
empirical law between the primary and secondary variables, it leads to a substantial
improvement in mapping accuracy over the co-kriging method which only accounts for the
cross-correlation between primary and secondary variables. Finally we demonstrate the
applicability of our BME approach in the mapping estimation of arsenic in New England
using soil pH as the secondary spatial field.
The remainder of this paper is organized as follow. In the methods section we present the
framework for spatial random fields, the non-parametric and parametric approaches used to
model the stochastic empirical law, the BME estimation method, and finally a procedure to
generate the spatially related synthetic fields used for cross validation purposes. Then in the
results section we present first the synthetic case study, followed by the real-world
application to mapping groundwater arsenic in New England using soil pH data as the
secondary variable.
3.2. Method description
3.2.1. Spatial Random Field (SRF) representation and physical knowledge bases
We denote by X(s) the SRF (Christakos, 1992) representing the spatial distribution of the
primary variable X at the spatial location s, where s=[s1, s2] for a two-dimensional space. The
set of mapping points of interest is denoted as smap={si}, where i=1, 2,…, n. The vector of
random variables xmap=[X(s1), X(s2),…, X(sn)] represents the SRF at the mapping points smap.
A possible realization of xmap is denoted as the vector of realized values χmap = [χ1,…, χn].
51
The randomness associated with xmap is then represented by the set {χmap} of all possible
realizations. Randomness is fully characterized by the multivariate PDF f(χmap) describing
the probability associated with any given realization χmap, i.e.
f (χmap) dχ = Prob[χ1 < x1 <χ1+dχ1,…, χn < xn < χn+dχn] (3.1)
where Prob[.] is the probability operator. An important operator on xmap is the stochastic
expectation operator E[.] of some known function g(xmap) of xmap, which is defined as the
expected value of g(xmap) obtained as follow
E[g(xmap)] = ∫dχmap g(χmap) f(χmap). (3.2)
The physical knowledge base K describing the contaminant SRF consists in the union of
general knowledge G characterizing the spatial trend and variability of the environmental
processes at play, and site specific knowledge S including the monitoring data available for
the specific site at hand. The spatial trend of the primary environmental variable is modeled
by the mean trend function mX(s) of the SRF X(s) defined as
mX(s) = E[X(s)]. (3.3)
This mean trend function characterizes the systematic trends and spatial structures of the
primary variable. The spatial variability is characterized by the covariance function cX(s,s’)
the SRF X(s) between point s and s’ defined as
52
cX(s,s’) = E[ (X(s)-mX(s)) (X(s’)-mX(s’)) ] (3.4)
The covariance function quantifies the amount of co-variability for the primary variable
taken at a pair of points s and s’, which provides a measure of the spatial dependencies and
autocorrelations in the field representing the primary variable. While mX(s) and cX(s,s’)
constitute the general knowledge G, the site-specific knowledge S consists in the actual data
available at a set of specific data points sdata={si}where i=1, 2,…, m. This data often includes
hard data χhard regarded as exact measurements of the primary variable at the points
shard={si}where i=1, 2,…, mh, i.e.
Prob[ X(shard) =χhard] = 1. (3.5)
In many environmental application we also consider a set of points ssoft={si}, i= mh+1, …,
mh+ms=m, where some so-called soft data is available, but has quantifiable associated
uncertainty. A soft datum may be of the interval type (Christakos et al., 2001; Christakos
and Serre, 2000a). For example when measurements of the primary variable at points ssoft are
below detection limit, we have
Prob[0< X(ssoft) < Detection Limit] = 1 (3.6)
53
More generally soft data is of the probabilistic type (Christakos et al., 2001; Christakos and
Serre, 2000a; Serre et al., 2005) when the uncertainty in the soft data can be quantified by
means of a soft PDF fS such that
u
Prob[X(ssoft) <u] = ∫ −∞ dχ soft f S (χ soft ) . (3.7)
We describe next the field representing the secondary variable, and in the following section
we present some straight forward approaches to derive soft data (Eq. 3.7) for the primary
variable on the basis of exact measurement of the secondary variable.
3.2.2. Empirical law and cross-correlation of related spatial fields
In a wide range of environmental mapping applications there exist a SRF Y(s) for a
secondary variable Y that is related to the primary variable X through some empirical law.
As an example, the consideration of an empirical law describing the association between
groundwater arsenic (As) concentration and the soil pH is motivated by the work of Sanchez
at el., 2003. Their study considered a soil contaminated with As and analyzed the As
solubility as a function of pH levels (a representative subset of their points is shown with
circles in Figure 3.1). A curve fitted to the experimental data (shown with a plain line in
Figure 3.1) indicates a clear non-linear relationship between log-As and pH due to the
dependency of arsenic solubility with pH. This evidence supports the existence of empirical
laws describing the relationship between log-As and pH, and has been confirmed by studies
at different geological sites. Peters et al. (1999) observed that As-levels in New England
groundwater are affected by pH-levels since the As-concentration varies with anion exchange
54
and co-precipitation with iron and manganese oxyhyroxides. Similarly the study of Arsenic
in eastern New England by Ayotte et al. (2003) suggests that the high levels of As occur
where elevated pH-values exist due to the geological properties of the bedrock aquifer (i.e.
presence of calcite, ion exchange etc.).
Figure 3.1: The circles represent a subset of the data published by Sanchez et al. (2003)
showing the solubility and release of log-As as a function of pH for a given soil sample
contaminated with arsenic in a pesticide manufacture site.
The stochastic empirical law between the collocated random variables x=X(s) and y=Y(s)
provide one way to model the spatial relationship between the SRF’s X(s) and Y(s). This
empirical law is expressed by the conditional soft PDF fS(χ|ψ) of the primary variable x given
an error free measured value ψ for the collocated secondary variable y. The conditional PDF
provides a complete stochastic description of the relationship by means of its various
statistical moments, each of which may vary non-linearly with the measured value for y. In
practice, it is convenient to model the conditional PDF using an adequate statistical
distribution φ of x given a set of coefficients µ=[µ1, µ2, ..., µm], each of which is a function of
the measured secondary variable ψ, i.e.
55
fS(χ |ψ) = φ(χ ; µ(ψ)). (3.8)
A common example for φ is the Gaussian PDF with only two parameters µ = (µ1, µ2) where,
µ1(ψ) = E[x|ψ] (3.9)
is the expected value of x given an error free measured value ψ for the collocated y, and
µ2(ψ) = Ε[(x-µ1(y))2|ψ]. (3.10)
is the variance of x given ψ. Hence the spatial relationship between X(s) and Y(s) may be
modeled through the empirical law fS(χ |ψ)= φ(χ ; µ(ψ)), which consists in obtaining the
vectorial non-linear relationship µ(ψ) = (µ1(ψ), µ2(ψ)).
The cross-covariance cXY(s,s’) also quantifies the connection between two related spatial
fields. It is an extension of the covariance function (Eq. 3.4) defined as
cXY(s,s’) = E[ (X(s)-mX(s)) (Y(s’)-mY(s’)) ]. (3.11)
The cross-covariance function measures spatial dependencies and correlations between the
two spatial fields, and from it we obtain the dimensionless correlation coefficient ρXY at some
locations s as
56
ρXY = cXY(s,s) / σX(s)σY(s), (3.12)
where σX(s) is the standard deviation of the primary variable at s, and likewise σY(s) is the
standard deviation for the secondary variable.
However, cXY(s,s’) and ρXY only provide a global statistical description of the relationship
between the two spatial fields that fails to account for any non-linearity, whereas the
empirical law offers a complete description of the non-linear aspect of the relationship
between the two fields in terms of the vectorial function µ(ψ). In the following section we
present three straightforward approaches to model µ(ψ) from collocated measurements, first
using a non-parametric approach, and then using a parametric approach with polynomials of
order 1 and 2.
3.2.3. Deriving the conditional PDF fS(χ|ψ) that describes the empirical law
We denote in this section by χ=[χ1, χ 2,…, χ N]T and ψ=[ψ1, ψ 2,…, ψ N]T the column vectors
of exact measurements of the primary and secondary variables X and Y, respectively, at
locations {s1, s2,…, sn} where collocated measurements of X and Y are available. Note that in
general N<n, since the N points with collocated (X,Y) measurements is a subset of the n
mapping points. We also denote by χ and ψ the arithmetic average of the elements in the
χ and ψ vectors, respectively.
3.2.3.1. Non parametric approach
In many cases, the empirical relationship between the primary variable x=X(s) and collocated
secondary variable y=Y(s) does not have a known functional form. A non-parametric
57
approach is then useful to model µ(ψ)= (µ1(ψ), µ2(ψ)), where ψ is an exact measured value
for y; µ1(ψ) = E[x|ψ] (Eq. 3.9) is the conditional expected value of x given y=ψ; and µ2(ψ)
= Ε[(x-µ1(ψ))2|ψ] (Eq. 3.10) is the conditional variance of x given y=ψ . To achieve this
objective within the non-parametric approach we first partition the collocated observations
χ and ψ into a set of disjoint classes χ(k) and ψ(k), k=1,…,K, subject to ψk < ψ(k) < ψk+1, i.e.
ψ(k) is the subset of ψ that belongs to the interval [ψk, ψk+1]. Then each class has under
ergodic assumption its own expected value µ1 of x given that ψk < y < ψk+1:
µ1( ψ ( k ) ) ≅ E[x| ψk < y < ψk+1] ≅ χ ( k ) (3.13)
where ψ ( k ) is approximately equal to the midpoints between ψk and ψk+1, and χ ( k ) is the
arithmetic average of the corresponding vector χ(k). Similarly we obtain µ2 for each class as
µ2( ψ ( k ) ) ≅ Ε[(x-µ1( ψ ( k ) ))2 | ψk < y < ψk+1] ≅ ( χ (k) − µ1 (ψ (k) ) )2 . (3.14)
Finally the set of values { ψ ( k ) , µ1( ψ ( k ) ), µ2( ψ ( k ) ) }, k=1,…,K, provide a discretized form of
the µ(ψ) relationships.
3.2.3.2. Parametric approach
3.2.3.2.1. Parametric polynomial of order 1
58
In some cases the empirical law between x and y is known to be linear. In this case a
parametric approach used to obtain µ(ψ) consist in using the following polynomial model of
order 1
xi = β0 + β1yi +εi 1≤ i ≤ N , (3.15)
where β0 and β1 are regression coefficients, εi is an unobservable random error, and xi=X(si)
and yi=Y(si) are random variables for X and Y, respectively, at collocated measurement point
si. This equation can also be given in matrix/vector notation as
x = Dβ + ε (3.16)
1 y1   ε1 
 x1    ε 
 .   1 y 2   0
β
where x =  ..  , D =  . .  , β =   , and ε =  ..  . D is known as the design matrix.
2
. .  β1   .
x   . .  ε 
 N 1 y 
 N   N
Using standard regression theory, the expected value µ1(ψ) of x given ψ is simply given
by the estimator βˆ0 + βˆ1ψ , where β̂0 and β̂1 are ordinary least square estimates of β0 and β1,
respectively, and the variance µ2(ψ) of x given ψ is the square of the prediction standard
error, PSE(ψ), so that µ1(ψ)= βˆ0 + βˆ1ψ and µ2(ψ)= PSE(ψ)2. The estimate β̂ =[ β̂0 β̂1 ]T for β
is given by the following equation (see Appendix A)
β̂ =(∆T∆)-1(∆Tχ), (3.17)
59
where ∆ is obtained by substituting each random variable yi in the design matrix D with its
)
observed value ψi. Expanding Eq. (3.17) we get β 0 = χ − βˆ1ψ
) N N
and β1 = ∑ (ψ i − ψ )(χ i − χ ) ∑ (ψ − ψ ) . Furthermore (see Appendix A for details) the
2
i
i =1 i =1
prediction standard error PSE(ψ) is estimated using the following equation.
0.5
1 N 
PSE (ψ ) = σ̂ X  + (ψ − ψ ) ∑ (ψ − ψ ) + 1
2 2
(3.18)
 N j=1 
2
where σ̂ X is calculated using the following unbiased variance estimator
N 2
1
∑(χ − χˆ ) .
2
σˆ X = i (3.19)
N −2 i =1
3.2.3.2.2. Parametric polynomial of order 2
In many instances the empirical law between x and y may be found to follow a quadratic
curve. In these cases we can easily extend the parametric approach presented above to
consider a polynomial model of order 2, i.e.
xi = β0 + β1yi + β2yi2 + εi (3.20)
60
where β2 is an additional coefficients characterizing the curvature of the empirical law. Eq.
(3.20) can be recast into Eq. (3.16), x = Dβ + ε , by defining a new design matrix D with an
additional column and a new vector β as follow
1 y1 y12 
   β0 
1 y 2 y 2 2   
D= . . .  , β =  β1  . (3.21)
 .. .. . 
. β 
 2  2
1 y N y N 
The estimator for µ(ψ) is then given by µ1(ψ)= βˆ0 + βˆ1ψ + βˆ 2ψ 2 and µ2(ψ)= PSE(ψ)2. The
estimator β̂ =[ β̂ 0 β̂1 βˆ 2 ]T for β is given by Eq. 3.17, i.e. β̂ =(∆T∆)-1(∆Tχ), but with the
difference that ∆ is obtained by substituting each random variable yi in the new design matrix
D (see Eq. 3.21) with its observed value ψi. In other words ∆ now has one additional column
with elements ψi2. Finally PSE(ψ) is obtained by the equation
PSE (ψ ) = σ̂ X δ T (∆ T ∆ ) δ + 1 ,
−1
(3.22)
where δ=[1 ψ ψ2]T.
By way of summary, in this section we reviewed some non-parametric and parametric
approaches to estimate the relationships µ(ψ)=(µ1(ψ), µ2(ψ)) characterizing the empirical law
between collocated x and y. The estimation of µ(ψ) was obtained on the basis of data at N
points were (X,Y) measurements were collocated. However the µ(ψ) relationships are valid
for the larger set of n mapping points, which also include ms points where only Y
61
measurements were collected. At each of these points, we construct a soft datum for the
primary variable X on the basis of the measured value ψ for the secondary variable using the
soft PDF fS(χ|ψ) = φ(χ ; µ(ψ)). Therefore the soft data for X consist in the soft PDF fs(χsoft)
characterizing the primary variable X at the ms soft data points where only the secondary
variable Y was measured. At the remaining mh data points, exact measurements of the
primary variable X are available, which constitute the hard data χhard. Hence the site-specific
knowledge base consists in the hard data χhard at mh points where at least the primary variable
was measured, and the soft PDF fs(χsoft) at the additional ms points where only the secondary
variable was measured. In the following section we review how the BME method processes
these hard and soft data.
3.2.4. BME processing of hard and soft data
The powerful BME method has 3 main stages of knowledge processing, which are the (i)
structural, (ii) specificatory, (iii) integration stages.
At the structural stage of the BME analysis, a prior PDF fG(χmap) characterizing the SRF
X(s) at the mapping points smap is constructed by maximizing expected information based on
the general knowledge base G available. When G only includes knowledge of the vector of
expected values mmap at the mapping points, and the matrix cmap of covariance between any
pairs of mapping points, then the prior PDF fG (χmap , smap) is given by
fG (χmap,) = φ (χmap ; mmap, cmap ), (3.23)
62
where φ (χmap; mmap, cmap) is the multivariate Gaussian PDF with mean mmap and covariance
matrix cmap.
At the specificatory stage of the analysis, the site-specific knowledge is organized in hard
and soft data. The n mapping points include the estimation point and the data points, i.e.
smap=(sk , sdata). The data points consist in mh points shard where hard data χhard about the
primary variable X is directly collected, and ms points ssoft where only the secondary variable
Y is measured, so that smap=(sk , shard, ssoft) and χmap=(χk , χhard, χsoft). The soft PDF fS(χsoft) is
given by the product of the conditional PDFs fS(χ|ψ) = φ(χ ; µ(ψ)) of x given each of the ms
measured ψ values.
At the integration stage, a Bayesian conditionalization rule (Christakos 1990, 2000b;
Serre and Christakos, 1999) is used to update the prior PDF given the site-specific
knowledge available and yields the BME posterior PDF
fK (χk) = A-1 ∫ dχsoft fs(χsoft) fG(χk, χhard, χsoft) (3.24)
where A is a normalization coefficient. This posterior PDF provides a complete stochastic
description of the primary environmental variable of interest at any estimation point.
Specifically, the BME posterior PDF provides the flexibility to choose any estimator desired
(i.e. BME mode, BME mean, and BME estimate at various percentiles), as well as an
assessment of the estimation error by means of the BME posterior variance or the BME
confidence set (Serre and Christakos, 1999).
3.2.5. Generating related synthetic fields with stochastic empirical relationships
63
We aim to generate realizations for the groundwater log-arsenic SRF logAs(s) and soil pH
SRF pH(s) with prescribed statistical properties reproducing those found in the field, and
with a quadratic empirical relationship E[logAs|pH] at collocated point s similar to those
documented in previous studies (e.g. Fig. 3.1).
Let’s consider three independent, homogeneous, normally distributed SRFs A(s), B(s),
and C(s). Realizations of such fields can easily be generated using geostatistical simulation
techniques (Christakos, 1992; Christakos et al,. 2002) such that the realization of A(s), B(s),
and C(s) have user-defined means µA, µB, and µC, and variances σA2, σB2, and σC2, and with a
covariance range similar to that of soil pH and log-arsenic found in the field. We then
construct the fields for logAs(s) and pH(s) using the following equations
pH(s) = A(s) + B(s) (3.25)
logAs(s) = a1A(s) + a2A(s)2 + C(s), (3.26)
where a1 and a2, together with µA, µB, µC, σA2, σB2, and σC2, are the parameters of our
algorithm to generate logAs(s) and pH(s). Let’s now describe how to choose these parameters
in order to obtain realizations of logAs(s) and pH(s) with known statistical properties and a
quadratic empirical relationship E[logAs|pH] at collocated point s.
The means µlogAs and µpH, and variances σlogAs2 and σpH2 of the logAs(s) and pH(s) SRFs
are inputs to our algorithm (they are known for a specific geologic mapping situation, or can
be estimated from some monitoring dataset). Using these values, we calculate the parameters
64
µB, σB2, µC and σC2 with the following equations obtained from Eq. (3.25) and (3.26) (see
Appendix B for more details)
µB = µA - µpH (3.27)
σB2 = σpH2 - σA2 (3.28)
µC = µlogAs -a1 µA - a 2 µA2 - a2 σA2 (3.29)
σC2 = σ logAs2 - a12 σA2 - 2 a22 σA4 - 4 a22 µA2 σA2 - 4 a1 a2 µA σA2. (3.30)
We now have only four parameters remaining, i.e. a1, a2, µA, and σA2, which need to be
set according to the quadratic relationship desired for the empirical law E[logAs|pH].
Substituting A(s)=pH(s)-B(s) into Eq. (3.26), taking the expected value of logAs given pH at
collocated point s, and using properties of the normal distribution, we obtain after some
manipulations (see Appendix B for more details)
E[logAs|pH] = b0 + b1(pH-µpH) + b2(pH-µpH)2 (3.31)
where b0=µlogAs–a2σA4 /σpH2, b1= a1 σA2/σpH2+2a2 µA σA2 /σpH2, and b2 = a2σA4 /σpH4. As can be
seen from Eq. (3.31), the empirical law is of quadratic form, which fulfills the objective we
had set for our simulation algorithm defined by Eqs. (3.25) and (3.26). We note that the
parameter µA does not have any effect on the empirical law (Eq. 3.31), so without loss of
65
generality we can use µA=0. The parameters a1 and a2 are coefficients that primarily define
the shape of the empirical law E[logAs|pH]. In addition the parameter σA2 not only defines
the shape of the empirical law, but also the amount of co-variability between logAs and pH.
In other words, an increase in σA2 leads to a larger cross-correlation between collocated logAs
and pH, and consequently a smaller variance of logAs given pH, σ[logAs|pH]2.
Hence, by way of summary, we find that the simulator defined by Eqs. (3.25-3.30)
generate a useful class of spatially related synthetic random fields. This simulator allows to
generate realizations of two SRFs logAs(s) and pH(s) with prescribed statistical properties
reproducing that observed in the field, and with a quadratic empirical relationship
E[logAs|pH] at collocated point s defined by the parameters a1, a2 and σA2. Increasing a2 for
a selected value of a1 and σA2 will allow exploring empirical laws with increasing curvature,
while increasing σA2 will allow exploring fields with increasing cross-correlations between
logAs(s) and pH(s). These synthetic fields can then be used in a cross validation analysis to
compare the mapping accuracy associated with the kriging, co-kriging, and the proposed
BME approach. In the next section we provide a step-by-step description of the kriging, co-
kriging and BME approaches, and in the following section we present the cross validation
procedure.
3.2.6. Step by step description of the simple kriging, co-kriging, and BME approaches
The three estimation methods considered here are the simple kriging method labeled as
method 1, the co-kriging method labeled as method 2, and the BME method labeled as
method 3. Each method uses a subset of the synthetic datasets as measured data. We define
logAs as a primary variable X, and pH as a secondary variable Y.
66
We first specify the general knowledge base available by modeling the statistical
moments up to second order (mean and covariance) of the SRFs X(s) and Y(s). We obtain
models for mean trend (i.e. mX(s) using Eq. (3.3), and similarly mY(s) for Y) and covariance
models (i.e. cX(s,s’) using Eq. (3.4) and similarly cY(s,s’) for Y) for each variable. We
additionally obtain the model for the cross-covariance cXY(s,s’) between X and Y (i.e. Eq.
3.11).
The site specific knowledge base consists in the data for X and Y. We denote as χhard the
column vector of exact measurements of X at mXh points sh(X)={ s1(X), s2(X),…, sm ( X ) } where
Xh
at least X was measured. We define as ψhard the column vector of exact measurements of Y at
mYh points sh(Y)={ s1(Y), s2(Y),…, sm (Y ) } where only Y was measured. Each method then
Yh
selects a subset of knowledge bases available to proceed with the estimation step.
In simple kriging (e.g. method 1), the estimator χˆ k (1) of X at estimation points sk(X) is a
linear combination of only χhard given by
(X)
+ λ(1) (χhard – mhard(X)),
(1) T
χˆ k = mk (3.32)
where λ(1) is a column vector of simple kriging weights, mk(X) = mX(sk(X)) is the mean trend of
X at the estimation point sk(X), and mhard(X) = mX(sh(X)) is a column vector of mean trend values
for X at its hard data points sh(X). The vector of simple kriging weights is given by (Olea,
1999)
λ(1) = ck,Xh cXh,Xh-1,

T
(3.33)
67
where ck,Xh = cX(sk(X), sh(X)) is a row vector of covariance for X between the estimation point
sk(X) and hard data points sh(X), and cXh,Xh = cX(sh(X), sh(X)) is a mXh by mXh matrix of covariances
for X between the hard data points sh(X).
The traditional extension of simple kriging to account for secondary spatial field data is
co-kriging (e.g. method 2). The co-kriging estimator, χˆ k (2) is also a linear combination of
data including χhard and ψhard, i.e.
(2) (X) T
χˆ k = mk + λ(2) (Zdata – mdata), (3.34)
χ  m ( X ) 
where Z data =  hard  , mdata =  hard  , and mhard(Y) = mY(sh(Y)) is a column vector of mean
ψ hard 
(Y )
m hard 
trend values for Y at its hard data points sh(Y). The vector of co-kriging weights given by
λ(2) = ck,Zdc-1Zd,Zd,
T
(3.35)
where ck,Zd = [cX(sk(X), sh(X)) cXY(sk(X), sh(Y))] is a row vector of covariance/cross-covariance
between the estimation point and data points, cXY(sk(X), sh(Y)) a row vector of cross covariance
between the estimation point sk(X) and Y hard data points sh(Y), and
c Xh,Xh c Xh ,Yh 
c Zd ,Zd =  , (3.36)
 cYh , Xh cYh,Yh 
68
where cXh,Yh = cXY(sh(X), sh(Y)) is a mXh by mYh matrix of cross covariance between hard data
points sh(X) and sh(Y), and cYh,Yh = cY(sh(Y), sh(Y)) is a mYh by mYh matrix of covariance for Y
between its hard data points sh(Y).
Finally the BME mapping method (i.e. method 3) incorporates a set of probabilistic soft
data χsoft to account for the stochastic empirical relationship. As explained earlier, to generate
χsoft we start by modeling the stochastic empirical law using data at collocated measurement
points, and we then obtain the soft data at every location of sh(Y) where only ψhard is available.
Each soft datum is expressed by the conditional PDF fS(χsoft|ψhard) = φ(χsoft ; µ(ψhard)) (e.g.
Eqs. 3.8-3.10) of x given each of the measured ψ values. At the structural stage, BME
processes the mean vector mmap and the covariance matrix cmap to construct the multivariate
prior PDF (i.e. Eq. 3.23). The mean vector mmap represents the trend of the primary variable
at mapping points, and it is expressed as follow
mk ( X ) 
 (X ) 
m map = m hard  , (3.37)
msoft (Y ) 
where msoft(X) = mX(sh(Y)) is a set of mean values of X at soft data locations sh(Y). The
covariance matrix cmap at the mapping points is expanded as,
c Xh,Xh c Xh , Xs c Xh,k 
 
cmap = c Xs , Xh c Xs,Xs c Xs,k  , (3.38)
c c k,Xs ck,k 
 k,Xh
69
where ck,k = cX(sk(X), sk(X)) is a scalar representing the variance for X at the estimation points
sk(X), cXs,Xh = cX(sh(Y), sh(X)) is a mYh by mXh matrix of covariance for X between its soft data
points sh(Y) and hard data points sh(X), and ck,Xs = cX(sk(X), sh(Y)) is a row vector of covariance
for X between the estimation point sk(X) and soft data points sh(Y). Finally at the integration
stage BME updates the prior PDF (Eq. 3.23) by using a Bayesian conditionalization on χhard
and χsoft in order to obtain the BME posterior PDF fK (χk) at the estimation point sk(X) (Eq.
3.24).
3.2.7. Cross validation procedure
The result from the cross validation procedure offers a useful criterion to compare the
mapping accuracy between estimation methods. The synthetic random fields generated by
the simulator defined in Eqs. (3.25-3.30) provides mXh realizations Xi for the primary variable
at points sh(X), and mYh realizations Yi for the secondary variable at points sh(Y). These
simulated values are interpreted as the truth. The cross validation procedure removes one
true value Xi at a time, and re-estimates it using only data in its neighborhood to obtain the
cross-validate estimate Xi* . The mean square error (MSE) for the cross-validation estimates
is then defined as
MSE = (
1 m Xh *
∑ Xi − Xi
mXh i =1
)
2
(3.39)
1, 2
The percent MSE change rMSE between method 1 and method 2 is given by
70
MSE 2 − MSE 1
1, 2
rMSE = × 100 . (3.40)
MSE 1
1, 3
Similarly rMSE is given using the corresponding equation for method 1 and method 3. Since
both method 2 and 3 use data from the primary and secondary variables, we expect them to
provide more accurate cross-validation estimates than method 1, which only uses data from
1, 2 1, 3
the primary variable. Therefore we expect that both rMSE and rMSE will be negative values,
signifying a reduction in MSE.
To compare the efficiency between method 2 and 3 in using the secondary data, we
define the improvement in MSE reduction, i ∆ as
1,3 1,2
r MSE − r MSE
i∆ = 1,2
× 100 . (3.41)
r MSE
i∆ measures the percent improvement in the reduction of MSE afforded by BME versus co-
kriging. A value of i∆ =10% would mean that the reduction in MSE from kriging to BME is
10% greater than the reduction in MSE from kriging to co-kriging, or in other words that
BME is 10% more efficient than co-kriging at integrating the secondary data.
3.3. Results
3.3.1. Synthetic case study
71
As discussed in the section 3.2.5 our simulator will generate realizations of the fields logAs(s)
and pH(s) such that their statistical properties correspond to that found at a site of interest (i.e.
New England), and such that collocated logAs and pH values are related by a quadratic
empirical law controlled by the parameters a1, a2 and σA2. The quadratic shape of the
synthetic empirical law we generate reproduces the logAs-pH empirical law documented in
previous studies (Sanchez et al., 2003; Ayotte et al., 2003). In this synthetic case study we
investigate the percent improvement i ∆ in the reduction of MSE afforded by BME versus
co-kriging for two scenarios labeled as case 1 and case 2. In case 1 we explore quadratic
empirical laws with increasing curvature by increasing a2 from 0 to 0.6 for fixed a1 and σA2
set to a1 = 1.7 and σA2 = 0.32. In case 2 we explore quadratic empirical laws with increasing
co-variability between logAs and pH by increasing σA2 from 0.08 to 0.35 for fixed a1 and a2
set to a1 = 0.7 and a2 = 0. In the following sections we first present in details the results
obtained for a single realization of case 1 obtained for a2=0.5, and we then provide the results
for case 1 describing the effect of the curvature (i.e. non-linearity) of the empirical law,
followed by the results for case 2 describing the effect of the co-variability between logAs
and pH (i.e. correlation of the empirical law).
3.3.1.1. Realization of related spatial fields
Using our simulator with a1=1.7, σA2= 0.32, and a2=0.5, we successfully generate the
realization for the primary spatial random field logAs(s) shown in Figure 3.2(a) and the
realization for the related secondary variable pH(s) shown in Figure 3.2(b). These simulated
fields are usually interpreted as the truth, from which the measured data are randomly
72
selected. Each asterisk in Figure 3.2(a) indicates the selected measured data for logAs, and
similarly each triangle in Figure 3.2(b) denotes the pH measured data.
The scatter plot of all collocated simulated values for logAs and pH are shown in Figure
3.2(c). This scatter plot shows that our simulator is able to generate a stochastic empirical
law with a realistic non-linear shape in good agreement with that found in previous studies
(e.g. Sanchez et al., 2003). Furthermore the theoretical formulae obtained in Eq. (3.31) for
the empirical law (shown as a plain line labeled as the “true E[logAs|pH]” in Figure 3.2c) is
in perfect agreement with the simulated collocated data.
(a) (b)
(c)
Figure 3.2: Realization of (a) logAs(s) and (b) pH(s) obtained with our simulator using
a1=1.7, a2=0.5 and σA2= 0.32. Asterisks in (a) and triangles in (b) are the randomly selected
points used as data in the cross-validation procedure. The scatter plot of all collocated
simulated logAs-pH values are shown in (c), where the plain line is the theoretical
E[logAs|pH] obtained from Eq. (3.31).
73
Now that we obtained a realization of the logAs and pH fields with realistic properties,
we proceed with its analysis, which consists in the analysis of its covariance/cross-covariance,
followed by the analysis of the empirical relationship, and ends with the results of the cross-
validation procedure.
3.3.1.2. Covariance and cross-covariance between fields
We estimate the experimental values of the covariance for logAs(s) and pH(s) using Eq. (3.4),
and we then fit a covariance model to these experimental covariance values. Similarly we
estimate and model the cross-covariance between logAs(s) and pH(s) using Eq. (3.11). For
illustration purposes, we show in Figure 3.3 the covariance and cross covariance values
obtained for the realization of logAs(s) and pH(s) shown in Figure 3.2. The experimental
covariance values are shown with dots, while the covariance and cross covariance models are
shown with a plain line. These models are given by
c(r) = c0 exp (-3r/ar), (3.42)
where c(r) is an exponential function of the spatial lag r between a pair of points, c0 is the
covariance sill, and ar is the covariance range. Each covariance model has the same range
(e.g. 7km) but different sill values (i.e. ar=1.0441 for the covariance of logAs, ar=0.5514 for
cross-covariance between logAs and pH, and ar=0.3801 for the covariance of pH). These
covariance models represent a spatial autocorrelation of logAs(s) and pH(s) that is
comparable to that found in the real case study for New England presented later.
74
Figure 3.3: Covariance and cross variance for the logAs(s) and pH(s) synthetic fields shown
in Figure 3.2. Experimental covariance values are shown with dots, while the corresponding
covariance models are shown with plain line.
3.3.1.3. Conditional PDF fS(χ|ψ) describing the empirical relationship
The stochastic empirical law relating logAs and pH can be expressed by the conditional PDF
fS(χ|ψ) of the primary variable logAs given an error free measured value ψ for the collocated
secondary variable pH. This conditional PDF is modeled in terms of a Gaussian PDF with
mean µ1(ψ) and variance µ2(ψ), so that the vectorial relationship µ(ψ)=[µ1(ψ) µ2(ψ)]
summarizes the non-linear aspects of the stochastic empirical law. The three straightforward
approaches described earlier in the methods section to obtain µ(ψ) are (1) the non-parametric
prediction, (2) the parametric prediction with polynomial of order 1, and (3) the parametric
prediction with polynomial of order 2.
For illustration purposes, we show in Figure 3.4 the vectorial µ(ψ) relationship and
corresponding conditional PDFs fS(χ|ψ) obtained for the realization of logAs(s) and pH(s)
shown in Figure 3.2. The first order moment µ1(ψ)= E[logAs|pH] estimated using the non-
parametric prediction, the parametric prediction with polynomial of order 1, and the
75
parametric prediction with polynomial of order 2 are shown with a dashed line in Figures
3.4(a), 3.4(b) and 3.4(c), respectively. The second order moment µ2(ψ) is shown in Figure
3.4(d) for all three approaches. Conditional PDFs fS(χ|ψ) corresponding to the various µ(ψ)
obtained are shown with thick lines in Figures 3.4(a), 3.4(b) and 3.4(c).
We find that the non-parametric approach and the parametric approach with polynomial
of order 2 are very successful in producing conditional PDFs fS(χ|ψ) that capture well the
stochastic empirical relationship between logAs and pH. The parametric approach with
polynomial of order 1 is not as successful because of the non-linearity of the empirical law,
however this approach would work well when the empirical law is known to be linear.
Hence Figure 3.4 shows that the three straightforward approaches presented in the methods
section to obtain the conditional PDFs fS(χ|ψ) from collocated measurements are easy to
implement in practice, and the best approach will depend on the data available, and on the
type of the empirical law under consideration.
(a) (b)
76
(c) (d)
Figure 3.4: The dots in (a), (b) and (c) are identical. They show the collocated measurements
for the realization of logAs(s) and pH(s) shown in Figure 3.2(c). The dashed lines show
µ1(ψ)= E[logAs|pH] obtained using (a) non-parametric prediction, (b) parametric prediction
with polynomial of order 1, and (c) parametric prediction with polynomial of order 2. The
corresponding µ2(ψ) are shown in (d) with different line types. The soft data obtained from
µ1(ψ) and µ2(ψ) are shown in thick lines in (a), (b) and (c).
3.3.1.4. Assessment of mapping accuracy
Mapping accuracy is first assessed visually by comparing in Figure 3.5 the simulated field of
logAs(s) representing the truth, with the estimated maps obtained using methods 1 to 3. To
facilitate the visual comparison, Figure 3.5(a) is an identical reproduction of the simulated
field logAs(s) shown of Figure 3.2(a). The stars denote the location of the hard data for logAs.
Using this hard data with method 1 (simple kriging) we obtain the estimated map shown in
Figure 3.5(b). As can be seen from that map, method 1 does not provide a good estimate of
the truth because the hard data available for logAs is sparse. For example this estimated map
completely fails to predict the presence of a highly contaminated area in the lower left corner
of the map because of the lack of hard data for logAs in that area.
Methods 2 and 3 on the other hand use hard data for the secondary variable pH (see
Figure 3.2b) in addition to the hard data for arsenic. The map obtained with method 2 (co-
77
kriging) is shown in Figure 3.5(c). We see a small improvement over method 1, but
important contaminated areas such as that in the lower left corner are still completely missing.
On the other hand the map obtained with method 3 (BME) is a drastic improvement over
method 1. For instance method 3 predicts accurately the presence of the highly contaminated
area in the lower left. Because they used additional information coming from the secondary
variable, both method 2 and 3 were expected to be more accurate that method 1, as is indeed
the case. However what is outstanding is the drastic superiority of the BME method to
process the secondary data over co-kriging.
(a) (b)
(c) (d)
Figure 3.5: The simulated field of logAs(s) shown in map (a) is an identical reproduction of
Figure 3.2(a) that is interpreted as the truth. The stars are the locations of the logAs hard data
used by estimation method 1 (simple kriging) to produce map (b). Using this logAs hard data
78
as well as secondary pH data shown in Figure 3.2(b), we obtain map (c) with method 2 (co-
kriging), and map (d) with method 3 (BME).
We now turn to cross validation in order to quantitatively assess the superiority of BME
over co-kriging in processing the secondary data. As explained earlier in the methods section,
the MSE (Eq. 3.39) of cross validation estimates provides a measure of estimation error. We
find that MSE1 (the MSE for method 1) is equal to 1.09, while MSE2=1.01 for method 2, and
MSE3=0.49 for method 3. It is worthwhile noting that even though methods 2 and 3 use the
1, 3
same pH data, the percent MSE reduction from method 1 to 3, rMSE =-55.1% (Eq. 3.40) is a
1, 2
drastic improvement over the MSE reduction from method 1 to 2, rMSE =-6.9%. In fact the
improvement in MSE reduction (Eq. 3.41) is i∆=703.5%, which is outstanding. In other
words, BME is 703.5% more efficient than co-kriging at integrating the secondary data.
This result demonstrates that BME is substantially more accurate than co-kriging for a
realization of logAs(s) and pH(s) (Figure 3.2) obtained with a1=1.7, σA2= 0.32, and a2=0.5. In
the following two sections, we explore whether this result holds when we change the
curvature of the empirical law, and when we change the correlation between primary and
secondary variables.
3.3.1.5. Cross validation results as a function of the curvature of the empirical law
Table 3.1 summarizes the cross validation results obtained in case 1, where we consider
realizations of logAs(s) and pH(s) generated by our simulator with a1=1.7 and σA2= 0.32, and
with a2 varying from 0 to 0.6 by increment of 0.1. The curve representing the empirical law
between collocated logAs and pH is shown in Figure 3.6(a) for each of these realizations. As
79
can be seen from that figure, the empirical law is linear (i.e. zero curvature) for a2=0, and the
curvature of the empirical law increases monotonically with a2, reaching maximum curvature
for a2=0.6.
Table 3.1: Cross validation results for case 1.

MSE
method 3 MSE reduction from MSE reduction from Improvement in
MSE MSE
with non method 1 to method 2 method 1 to method 3 MSE reduction
method 1 method 2
parametric (%) (%) (%)
a2 regression
1,3 1,2
1,2 MSE 2 − MSE1 1,3 MSE3 − MSE1 rMSE − rMSE
MSE1 MSE2 MSE3 rMSE = × 100 rMSE = × 100 i∆ =
1,2
× 100
MSE1 MSE1 rMSE
0 1.03 0.96 0.48 -7.4 -53.9 623.3
0.1 1.04 0.97 0.48 -7.3 -54.2 640.3
0.2 1.05 0.98 0.48 -7.2 -54.5 657.4
0.3 1.06 0.99 0.48 -7.1 -54.8 674.3
0.4 1.08 1.00 0.48 -7.0 -55.0 690.1
0.5 1.09 1.01 0.49 -6.9 -55.1 703.5
0.6 1.10 1.02 0.49 -6.8 -55.0 711.5
The results shown in Table 3.1 include the mean square errors MSE1, MSE2 and MSE3
1, 2 1, 3
for methods 1, 2 and 3, respectively, the percent MSE change rMSE and rMSE between
methods 1 and 2, and methods 1 and 3, respectively, and the improvement in MSE reduction
i∆ from co-kriging to BME (where the BME soft data are obtained using the non-parametric
approach). We note that the realization discussed in details in the preceding sections is listed
in Table 3.1 on the line corresponding to a2=0.5 with an improvement in MSE reduction
i∆=703.5%. We see clearly from Table 3.1 and Figure 3.6(b) that i∆ increases as the
curvature of the empirical law increases. This makes physically sense, since the BME
approach fully accounts for the non-linear aspects of the empirical law, whereas co-kriging
80
only accounts for the cross-correlation between logAs and pH. However it is very interesting
to note that even for linear empirical laws (i.e. a2=0), BME is still 623.3% more efficient than
co-kriging at integrating the secondary data. These results show that the BME approach
presented in this work outperforms drastically co-kriging whatever the curvature of the
empirical law is.
Furthermore we show in Figure 3.6(b) the improvement of MSE reduction i∆ obtained
when the soft BME data is generated using the non parametric, the polynomial of order 1,
and the polynomial of order 2 approaches. These curves confirm the physically significant
fact that if one knows a priori that the empirical law is quadratic, then using the second order
polynomial approach will give best results, however when that is not the case, then the non-
parametric approach works well when there is sufficient collocated data, while the first order
polynomial approach works well when numerical cost is an issue.
(a) (b)
Figure 3.6: (a) Curves representing the empirical law E[logAs|pH] between collocated logAs
and pH for the realizations of Table 3.1 (i.e. obtained with a2 varying from 0 to 0.6 by
increment of 0.1). (b) Curves showing the improvement in MSE reduction i∆ as a function of
a2, when the BME soft data is generated using the non parametric (plain line), the polynomial
of order 1 (dotted line), and the polynomial of order 2 (dashed line) approaches.
81
Previous arsenic studies have shown that co-kriging is especially disappointing when the
correlation between the primary and secondary variable is weak (e.g. Welhan and Merrick,
2003). Therefore we investigate next the cross validation results as a function of the
correlation between primary and secondary variables.
3.3.1.6. Cross validation results as a function of the correlation between logAs and pH
We now focus on case 2 of the synthetic case study, which explores how the cross validation
results change as a function of the correlation between collocated logAs and pH
measurements. Realizations of the logAs and pH fields are generated using a1 = 0.7 and a2 =
0 (i.e. linear empirical laws), and with σA2 varying from 0.08 to 0.35. The curve representing
the empirical law between collocated logAs and pH is shown in Figure 3.7(a) for each of
these realizations. As σA2 increases, the co-variability between logAs and pH increases,
leading to larger correlation between logAs and pH, and to linear empirical laws with steeper
slopes, as can be seen in Figure 3.7(a).
For each of the realization depicted in Figure 3.7(a) we obtain cross validation estimates
using methods 1 to 3, and we show in Figure 3.7(b) the resulting improvement of MSE
reduction i∆ as a function of σA2, which is a measure of the correlation between logAs and pH.
As can be seen from that figure, BME is significantly more accurate than co-kriging
whatever is the correlation between logAs and pH (i.e. whatever is σA2). In this case we do
not see a difference whether the BME soft data is generated using the non-parametric, the
first order polynomial, or the second order polynomial approaches because the empirical law
is linear. What is extremely interesting to note is that while BME is at least 600% more
efficient than co-kriging at integrating secondary data when the correlation between logAs
82
and pH is strong (i.e. for large σA2), the out performance of BME over co-kriging is even
more drastic when the correlation between logAs and pH is weak, reaching as much as
2000% in the improvement of MSE reduction i∆ . This indicates that BME may provide a
good alternative to co-kriging when mapping arsenic when the correlation between the
primary and secondary variable is weak.
(a) (b)
Figure 3.7: Realizations of related logAs(s) and pH(s) fields were obtained using our
simulator with σA2 varying from 0.08 to 0.35. The linear empirical law E[logAs|pH] for each
of these realizations is shown in (a). The corresponding improvement in MSE reduction i∆ is
shown in (b) as a function of σA2.
By way of summary, this synthetic case study demonstrates that when mapping arsenic,
the BME approach presented in this work is drastically more efficient at incorporating
secondary data than co-kriging. BME is substantially more accurate than co-kriging when
the empirical law is linear and there is a strong correlation between arsenic and the secondary
variable. Furthermore the improvement in mapping accuracy is even more drastic when
considering non-linear empirical laws, or secondary variable that are weakly correlated with
arsenic. It is therefore valuable to apply this proposed method in a real case study. In the
83
next section, we provide a comprehensive real case study considering the mapping of
groundwater arsenic in the New England region using soil pH as the secondary variable.
3.3.2. Application to the real case study: Mapping arsenic in New England using soil pH
3.3.2.1. New England datasets for arsenic and pH
Measurements of groundwater arsenic concentrations sampled at wells located in New
England were obtained from datasets provided by the U.S. Geological Survey (USGS) and
the New Hampshire Department of Environmental Services (NHDES). These samples
resulted in 495 measurements above detection limits treated as hard data, and 1156
measurements below detection limit treated as interval soft data ranging between 0 and the
detection limit. The locations of the arsenic hard data (i.e. above detect measurements) are
shown in Figure 3.8(a) with circles having a size proportional to the recorded value.
The data for the secondary variable consist in exact measurements (hard data) of soil pH
obtained from a dataset provided by the USGS. The locations of the 915 soil pH samples
available were collected in the states of New Hampshire (NH), Maine (ME), and Connecticut
(CT), as shown by the circles of Figure 3.8(b). The color of the circles corresponds to the
soil pH value recorded, according to the color scale shown next to the map.
84
(a) (b)
Figure 3.8: (a) Map of the location of the groundwater arsenic samples from wells with
measurements above detection limit. The circles have a size proportional to the arsenic level
recorded. (b) Map of the location of soil pH-measurements shown with color indicating the
recorded value according to the color scale.
3.3.2.2. logAs-pH empirical law
The non-linear stochastic empirical law relating the primary and secondary variables is
modeled by processing the 139 collocated measurements of logAs and pH shown on the
scatter plot of Figure 3.9. Using the second order polynomial approach described in the
methods section, we obtain the µ1(ψ)=E[logAs|pH] function shown with a dot-dashed line in
Figure 3.9. The equation for E[logAs|pH] is given by
E[logAs|pH] = 4.6538– 0.8355pH + 0.0695pH2. (3.43)
The curve representing this equation has a shape that is consistent with the logAs-pH curve
obtained by Sanchez et al. (2003), shown with a dotted line in Figure 3.9. We also obtain
µ2(ψ) (not shown here), which together with µ1(ψ) provides the vectorial function
µ(ψ)=[µ1(ψ),µ2(ψ)] describing the non-linear aspects of the empirical law. From µ(ψ) we
85
generate the BME soft data consisting in the conditional PDF fS(χ|ψ) for logAs given a
measured value ψ of soil pH. Examples of these soft data are shown with a plain line in
Figure 3.9.
Figure 3.9: Scatter plot of 139 collocated logAs and pH measurements in New England. The
dot-dashed line shows µ1(ψ)=E[logAs|pH] obtained using second order polynomial
regression. The dotted line shows a curve of similar shape obtained by Sanchez et al. (2003).
The soft PDFs shown with plain line are the BME soft data generated using µ1(ψ) (and µ2(ψ)
not shown here).
3.3.2.3. Mean trend and spatial variability of groundwater arsenic in New England
We obtain a model for the mean trend (Eq. 3.3) of groundwater log-arsenic using a moving
window average of the arsenic data. This mean trend, shown in Figure 3.10(a), characterizes
the systematic trends and spatial structures of the logAs(s) SRF. By removing this mean-
trend from the log-arsenic data, we obtain a residual field that is homogenous (i.e. with a
constant mean over space and a covariance that is only a function of the spatial lag between
pairs of points).
86
Using Eq. (3.4), we obtain experimental values of the covariance of the residual logAs(s)
field, and we then fit a covariance model to these experimental covariance values. The
experimental values of the covariance for the residual logAs(s) field and the corresponding
covariance model are shown in Figure 3.10(b). The equation of the covariance model is
given by
 − 3r   − 3r 
c logAs (r ) = c01 exp  + c02 exp 
 a r 1   a r 2  (3.44)
where c01= 0.57× σlogAs2, c02=0.43× σlogAs2, σlogAs2= 1.623 (log-µg/L)2, ar1 = 3.0 km, and ar2=
79.5 km. This covariance model characterizing the spatial autocorrelation of groundwater
arsenic is in good agreement with what has been reported in previous studies. For example,
the covariance range for the spatial distribution of groundwater arsenic in Bangladesh was
reported to vary from 2 to 57 km by Serre et al. (2003), and from 9.2 to 24.1 km by Yu et al.
(2003).
87
(a) (b)
Figure 3.10: (a) Mean trend of groundwater log-arsenic in New England, and (b) covariance
function of its residual.
3.3.2.4. BME estimation of groundwater arsenic across New England
The arsenic mean trend and covariance models, together with the hard and soft interval data
obtained from direct arsenic measurements, and the soft data obtained from pH
measurements using the conditional PDF fS(logAs|pH) (i.e. Figure 3.9), constitute an
informative knowledge base for groundwater arsenic in New England. Given this knowledge
base, the BME method (Eqs. 3.23-3.24) provides the most accurate estimator of groundwater
arsenic across the New England region, as well as a comprehensive assessment of the
associated mapping uncertainty.
The map of the BME estimate of groundwater arsenic obtained in this case study is
shown in Figure 3.11(a). This map is useful to identify areas where levels of groundwater
arsenic may be high. For example areas with groundwater arsenic in excess of 20 µg/L are
found in southern New Hampshire. Previous studies point to natural bedrock as being the
88
main source groundwater arsenic in this area (EPA report from USEPA region 1 office, 1981;
Peters et al., 1999). This is an area where our dataset had the denser spatial coverage for
both arsenic and soil pH (see Figure 3.8). Other parts of our study area had sparse
monitoring arsenic and pH, leading to mapping uncertainty associated with the BME
estimates. The mapping uncertainty is quantified using the BME 68% confidence interval
(CI) (Serre and Christakos, 1999). The BME 68% CI is the smallest interval of arsenic
concentration that has a 68% chance of containing the true arsenic concentration. We show a
map of the length of the BME 68% CI in Figure 3.10(b). As can be seen from that map,
areas with denser monitoring data such as southern New Hampshire have a better mapping
accuracy (i.e. smaller length of the BME 68% CI) than areas with sparse monitoring data.
(a) (b)
Figure 3.11: (a) Map of the BME estimate of groundwater arsenic (µg/L) across New
England, and (b) map of the length of the 68% BME confidence interval (µg/L) expressing
the associated mapping uncertainty.
89
3.3.2.5. Non-attainment areas
State regulators and the drinking water industry are concerned with assessing where the
groundwater may have an arsenic concentration in excess of the 10 µg/L federal standard for
drinking water. Using the BME method, we are able to accurately assess the probability of
non-attainment of the standard at a given spatial location s given the arsenic and pH data
available in the neighborhood of s. This probability of non-attainment is given by
Prob[Non-Attainment] = Prob[Arsenic>10µg/L], (3.45)
∞
where Prob[Arsenic>10µg/L]= ∫ dχ k fK(χk), and fK(χk) is the BME posterior PDF for
log(10 µg/L )
groundwater log-arsenic obtained at s. We can then categorize areas according to their
probability of non-attainment of the standard, as follow: Areas will be Highly Likely in Non-
Attainment for Prob[Non-Attainment]>0.9, Likely in Non-Attainment for 0.5<Prob[Non-
Attainment]<0.9, Near Non-Attainment for 0.1<Prob[Non-Attainment]<0.5, and Highly
Likely in Attainment for Prob[Non-Attainment]<0.1.
Using this probabilistic criterion of non-attainment we obtain the map shown in Figure
3.12. This map is very useful as it provides the most accurate delineation of non-attainment
areas given the arsenic and pH data available, and it uses shades of grey to assess the
probability of non-attainment of the standard. The categories of non-attainment are, from to
the darkest shade of grey to the lightest shade of grey: Highly Likely in Non-Attainment;
Likely in Non-Attainment; Near Non-Attainment; and Highly Likely in Attainment.
90
Figure 3.12: BME map of the probability that the groundwater arsenic concentration across
New England is in non-attainment of the drinking water standard of 10 µg/L for arsenic.
3.3.2.6. Cross validation results between simple kriging, co-kriging and BME
We explore cross validation errors for method 1 (i.e. simple kriging), method 2 (i.e. co-
kriging), and method 3 (i.e. BME) using the real data available. Cross validation errors for
each method are obtained for all the logAs hard data points, and the corresponding cross
validation MSE error we obtain are MSE1 = 6.11, MSE2 = 6.57, and MSE3 = 2.30 for the
simple kriging, co-kriging, and proposed BME methods, respectively. Hence, quite
surprisingly, we find in this real-case study that even though co-kriging processes the
additional information provided by the rich dataset on the secondary pH variable, its mapping
accuracy is worse than that of simple kriging, which ignores entirely the pH data. This
illustrates the fact that the co-kriging method performs poorly in the absence of a strong
cross-correlation between arsenic and the secondary variable, as reported in Welhan and
91
Merrick’s (2003) study of groundwater arsenic using conductance as the secondary variable.
On the other hand BME outperforms drastically both simple kriging and co-kriging. Indeed
1, 3
the MSE change between simple kriging and BME is rMSE = -62.3%, while the percent MSE
2,3
change between co-kriging and BME is rMSE = -65.0%, which represent a dramatic gain in
mapping accuracy. This result further supports that by explicitly modeling and processing
the empirical law between arsenic and its secondary variable, our proposed BME approach is
much more efficient than the classical co-kriging method of multivariate Geostatistics at
integrating the secondary data.
Finally we illustrate the impact of our work in the assessment of groundwater arsenic
across New England by showing in Figure 3.13 the maps obtained using simple kriging, co-
kriging, and our proposed BME method. We can see that the simple kriging map (Figure
3.13a) is similar to the co-kriging map (Figure 3.13b). In other words, co-kriging fails to
incorporate the secondary pH data in a way that would update the arsenic map obtained
without the pH data. On the other hand the BME map is not only 65% more accurate than
the co-kriging map, it also results in a meaningful updating the arsenic map. In fact one can
see that the BME map results in an increase in estimated level of groundwater arsenic over a
substantial area of New England. This results in a substantial increase in the territory
assessed as being in Near Non Attainment of the 10µg/L drinking water standard, which has
important health risk, water treatment, and water resources management implications.
92
(a) (b)
(c)
Figure 3.13: Maps of the concentration of arsenic in the ground-water of New-England

obtained using (a) method 1 (simple kriging), (b) method 2 (co-kriging), and (c) method 3
(our proposed BME method).
3.4. Conclusions
The multivariate co-kriging method of classical Geostatistics has been a traditional approach
to improve the mapping accuracy of a primary variable of interest by integrating data about a
related secondary variable. However co-kriging only accounts for the cross correlation
coefficient summarizing the relationship between the primary and secondary variables. On
93
the other hand the BME approach developed in this work rigorously processes the multiple
nonlinear aspects of a realistic stochastic empirical law that fully describes the relationship
between primary and secondary variable. Insight to validate the proposed BME method was
gained by means of a synthetic case study involving simulated maps of groundwater arsenic
and soil pH successfully generated by a simulator developed for this work. This simulator
allowed generating realizations of two SRFs logAs(s) and pH(s) with prescribed statistical
properties reproducing that observed in New England, and with a wide variety of empirical
laws reproducing those reported in previous studies. The synthetic case study was consistent
in demonstrating that when mapping arsenic, the proposed BME approach is drastically more
efficient at incorporating the secondary pH data than co-kriging. Once validated, the BME
method was applied to a real case study considering the mapping analysis of groundwater
arsenic in New England using soil pH as the secondary variable.
Our proposed approach is very effective at assimilating a stochastic empirical law by
generating appropriate probabilistic soft data for the primary variable on the basis of the
secondary data available. This procedure was implemented by modeling the conditional PDF
of logAs given a collocated measure values ψ for soil pH. This conditional PDF was set to a
known statistical distribution (e.g. Gaussian) parameterized on ψ, and we presented three
straightforward approaches to obtain the vectorial parameter function µ(ψ) on the basis of the
collocated logAs and pH measurements available. We were thus able to generate logAs soft
data given any pH measurements, and these soft data were rigorously processed by the BME
method together with error free measurements of logAs to finally produce arsenic exposure
maps as well as maps of the associated estimation error.
94
Several conclusions can be drawn from the synthetic and real case studies of groundwater
arsenic and soil pH in New England, as follow:
• The simulator developed in this work was successful at generating realizations of a
primary and secondary SRFs that have prescribed statistical moments up to order two,
and with collocated values following a quadratic empirical law with curvature and co-
variability controlled by the parameters of the simulator. Using this simulator we
obtained realizations of groundwater arsenic and soil pH with mean and covariance
reproducing that found in New England, while having empirical laws with varying
quadrature and cross correlation between collocated measurements for the primary and
secondary variables. This simulator is general and will be useful to investigate any
environmental contaminant and its associated secondary data.
• The synthetic case study confirmed that the implementation of the three straightforward
approaches described in this paper to obtain the conditional PDF was easy to use and
therefore provided successful ways to model the stochastic empirical law. The non-
parametric approach is the most general and it is useful when no prior information about
the empirical law is available, however it is the most demanding in terms of the amount
of collocated data necessary. When a relatively small number of collocated
measurements are available, then the parametric approach offers a useful alternative. In
that case if the empirical law is known to be linear, then the parametric approach using a
polynomial of first order can be used, otherwise the second order polynomial approach
offers a good tradeoff for quadratic empirical laws.
• The synthetic case study clearly demonstrates that the BME approach presented in this
work is drastically more efficient at incorporating secondary data than co-kriging. The
95
improvement in MSE reduction when mapping groundwater arsenic in New England
using soil pH secondary data indicates that BME is consistently at least 600% more
efficient than co-kriging at incorporating the secondary data. Furthermore the
improvement of BME over co-kriging is more drastic when the empirical law is non
linear, or when the cross-correlation between primary and secondary variable is weak.
These results indicate that the proposed BME method should provide a useful alternative
to co-kriging in a wide variety of environmental mapping problems where co-kriging is
not efficient at integrating secondary data.
• The real case study presented provides the most accurate exposure map obtained to date
for groundwater arsenic in New England on the basis of the arsenic and soil pH data
available to the authors. Using this groundwater arsenic exposure map, we produce a
probabilistic map of non-attainment of the 10µg/L drinking water standard for arsenic
that is of key importance for state regulators, public health scientists, and the drinking
water industry. Future work will expand the current case study by incorporating new
groundwater arsenic data that are currently being collected in New England.
The numerical work and complexity of co-kriging and the proposed BME method are
similar. Co-kriging requires an extra step to model the cross-covariance between primary and
secondary variables. The computational cost and complexity of this step are saved in the
proposed BME method, and replaced with modeling the empirical law, which is shown in
this work to be relatively straightforward. However while both methods are easy to
implement, this work demonstrates that because the proposed BME approach formally
accounts for the empirical law between the primary and secondary variables, it leads to a
substantial improvement in mapping accuracy over the co-kriging method which only
96
accounts for the cross-correlation between primary and secondary variables. As a result, this
work suggests a shift of the multivariate mapping paradigm from co-kriging to the proposed
BME method when dealing with secondary variables related to the primary variable through
a variety of empirical laws.
97
IV. A geostatistical mapping framework integrating data obtained at
different temporal or spatial observation scale
4.1. Background
In many environmental and health mapping applications, the traditional Geostatistics
approaches have played a significant role to estimate a variable of interest at unsampled
locations (Warner et al., 2003; Lai, 2004; Krivoruchko and Gotway, 2004). Measured values
are usually sparsely located over space and time due to the difficulty and cost of obtaining
data. In some cases, the data for the same variable of interest may have been collected at
different temporal or spatial observation scales. For example the U.S. Environmental
Protection Agency (U.S. EPA) collects monitoring data for the criteria air pollutants both at
the hourly and daily observation scales. In this case, mixing hourly and daily data may
alleviate the problem of the sparsity of the data available; however this essentially disregards
the scale effect of estimation results. Another example using health outcome data is asthma
prevalence among children, which is sometimes measured at specific schools, as well as
being routinely reported at much larger observation scales such as that of counties. In this
example as well we see that the scale effect must be recognized since a variable displays
different physical properties depending on the spatial or temporal scale at which it is
observed.
The importance of accounting for the scale effect was already investigated in previous
works, such as that of Choi et al. (2003) where they demonstrated the usefulness of the
multiscale approach through a downscaling procedure. In this chapter we mathematically

derive the conditional PDF of a variable at the local scale given an observation of that
variable at a larger scale. Once this framework is developed, it is possible to generate soft
data for the local scale on the basis of data observed at different scales. This approach allows
to efficiently mix data observed at a variety of scales, and increases the mapping accuracy of
the map obtained for the scale of interest. Our developed framework is formulated in the
one-dimensional temporal case, corresponding for example to the mixing of PM data
observed at different temporal scales (e.g. the mixing of hourly and daily PM readings). We
then extend the formulation of the framework to the two dimensional spatial case, and we
apply that formulation to a real case study. The real case study considered is the mapping
analysis of local scale asthma symptoms prevalence among children in North Carolina using
data obtained at the school spatial scale, and data obtained at the county spatial scale.
In the following sections, we first lay out the conceptual framework to model the
uncertainty associated with the observation scale, and we obtain mathematical formulations
for one-dimensional (temporal) and two-dimensional (spatial) observation scale uncertainty.
In each case (temporal and spatial), we validate the framework by comparing the observation
scale uncertainty predicted theoretically from the mathematical formulation, with that
inferred from multiple random realizations of a synthetic case study. Additionally we use the
synthetic case studies to quantify the gain in mapping accuracy achieved when the BME
mapping method rigorously accounts for observation scale uncertainty, compared to classical
approaches not accounting for the observation scale effect. Finally we apply the developed
framework to a real case study involving the estimation of asthma prevalence in North
Carolina. We find that in all cases the developed framework adequately describes the
uncertainty associated with the observation scale, which leads to realistic soft PDF for the
99
observation scale uncertainty that are rigorously assimilated by the BME method, and results
in a substantial improvement in mapping accuracy over classical mapping methods that
ignore the scale effect.
4.2. Space/time observation scale: A general conceptual framework
4.2.1. A review of BME mapping method
We define X(p) as a space/time random field (S/TRF) (Christakos, 1992) representing an
environmental or health variable X of interest at space/time location p=[s, t], where s=[s1,…,
sd] is the spatial location in a d-dimensional spatial domain, and t is time. When restricting
our attention to a set of n mapping points pmap=[p1, p2,…, pn], the S/TRF reduces to a vector
of random variables xmap=[X(p1), X(p2),…, X(pn)]. The randomness of the S/TRF at the
mapping points pmap is defined by the set of possible realizations χmap =[χ1, χ2 , …, χn] of the
random vector xmap. The probability of a given realization χmap is calculated from the
multivariate probability density function (PDF) fX(.) of the S/TRF X(p) as follow
Prob[χ1 < x1 <χ1+dχ1,…, χn < xn < χn+dχn] = fX(χmap) dχ (4.1)
where Prob[.] is a probability operator. Hence the multivariate PDF fX(.) provides a complete
stochastic description of the SRF X(s) at the mapping points pmap.
At the structural stage of BME analysis we use a maximum entropy information
processing rule (Christakos 2000) to obtain the multivariate PDF of X(s) on the basis of its
mean trend
100
mX(p) = E[X(p)], (4.2)
and covariance function
cX(p, p’) = E[ (X(p)-mX(p)) (X(p’)-mX(p’)) ], (4.3)
where E[.] is a stochastic expectation operator. Eqs. (4.2) and (4.3) constitute a general
knowledge base G from which the structural PDF obtained by maximizing entropy is
(Christakos, 2000)
where φ (.) is the multivariate Gaussian PDF with mean vector mmap and covariance matrix
cmap calculated from Eqs. (4.2) and (4.3), respectively. The subscript G in Eq. (4.4)
emphasizes that the structural PDF fG was obtained on the basis of the general knowledge G
only. This structural PDF will serve as the prior PDF for the Bayesian updating performed at
the integration stage of the BME analysis.
At the specificatory stage of the BME analysis we assess and statistically describe the
data available at specific spatial locations. Hard data corresponds to exact measured values
χhard obtained at points phard defined such that
Prob[ X(phard) =χhard] = 1. (4.5)
101
On the other hand, the soft data at points psoft correspond to measurements with an associated
uncertainty that can be characterized statistically by the so-called soft PDF fS(χsoft) defined as
(Christakos et al., 2001; Christakos and Serre, 2000a; Serre et al., 2005)
u
Prob[X(psoft) <u] = ∫ −∞ dχ soft f S (χ soft ) . (4.6)
At the integration stage of the BME analysis, a Bayesian conditionalization information
processing rule is applied to update the prior PDF with the site-specific knowledge base S,
which yields the posterior PDF fK(χk) describing xk=Xk(pk) at any estimation point sk
(Christakos, 2000)
fK (χk) = A-1 ∫ dχsoft fs(χsoft) fG(χk, χhard, χsoft), (4.7)
where A is a normalization coefficient. The posterior PDF provides a full stochastic
assessment of xk, from which we can obtain an appropriate estimated value (such as the
expected value of the posterior PDF), as well as an assessment of the associated estimation
uncertainty (such as the variance of the posterior PDF).
In the following sections we describe a framework to rigorously account for the data
uncertainty from the different observation scales in time or two-dimensional space. Thanks to
this developed framework we can model this type of data uncertainty in terms of probabilistic
soft data which can be systematically processed in the BME mapping method.
102
4.2.2. Conceptual framework for the uncertainty associated with the observation scale
Let X(p) be a space/time random field (S/TRF) representing an environmental or health
variable of interest. In general we say that X(p) represents the variable of interest at the local
scale in order to differentiate it from its observed value averaged over some space/time
domain V(p). The average of the S/TRF X(p) over the space/time domain V(p) is defined as
the S/TRF Z(p) given by the equation
Z(p) =∫V(u) duX(u) / ||V (p)|| (4.8)
Example 4.1: X(p) is the instantaneous particulate (PM) concentration at p=(s,t), while Z(p)
is its daily average. Then V(p) is the time interval of duration T=24hours centered at p=(s,t),
t +T / 2
i.e. V(p)={s , u} such that u ∈ [t-T /2 , t+T /2], and Z(p)= 1 ∫ duX (s, u ) .
T t −T / 2
Example 4.2: X(p) is the risk (i.e. probability) that a child at p=(s,t) has experienced asthma
symptoms in its lifetime, while Z(p) is the asthma symptoms prevalence observed among the
children of a specific county. Then V(p) is the surface area of the county centered at p=(s,t),
i.e. V(p)={u , t} such that u =[u1, u2]∈ As, where As is the geographical extend of the county
centered at s, and Z(p) =∫u ∈ As duX(u,t) / || As||.
In order to analyze the relationship between the local scale S/TRF X(p) and the V–scale
S/TRF Z(p), we define the random field Y(p’,p) as
Y(p’,p) = X(p’)-Z(p). (4.9)
103
Eq. (4.9) can also be written as X(p’)=Z(p)+Y(p’,p), indicating that when assessing X(p’),
Y(p’,p) acts as an additive error term to the value Z(p) observed at scale V. It follows that the
conditional PDF of X(p’) given an observed value ζ for Z(p) is
fS(χs| ζ) = fY (χs-ζ ), (4.10)
where fY is the PDF for Y(p’,p).
Let’s now consider the class of S/TRFs X(p) that are normally distributed. Then due to
the properties of the multivariate Gaussian distribution, Z(p) is normally distributed (since
according to Eq. (4.8) it can be written as an infinite sum of normally distributed variables),
and consequently Y(p’,p) is also normally distributed (since according to Eq. (4.9) it can be
written as the sum of two normally distributed variables). It follows that under the
assumption that X(p) is normally distributed, then the PDF for Y(p’,p) is given by
fY(ψ)=φ(ψ;mY,σY2), where φ(.) is the Gaussian distribution completely defined by it’s mean
mY= E[Y(p’,p)] and variance σY2. Inserting fY(ψ)=φ(ψ;E[Y(p’,p)],σY2) in Eq. (4.10), we obtain
after a change of variable
fS(χs| ζ) = φ (χs ; E[Y(p’,p)]+ ζ , σY2). (4.11)
Eq. (4.11) provides a probabilistic soft datum for the local scale X at point p’ given a V-
scale observed value at point p. This soft datum is rigorously processed by the BME method,
allowing to accurately account for observations of X(p) at any space/time scale V. The
104
problem then becomes that of obtaining E[Y(p’,p)] and σY2 for different space/time scales V
of interest. In the following sections, we first consider the one-dimensional temporal case
where X is only a function of time, i.e. X(t) is a temporal random field, and Z(t) is the average
of X(t) over a time period T (e.g. the hourly or daily average). We then extend the work to
the two-dimensional spatial case where X is only a function of space, i.e. X(s) is a spatial
random field, and Z(s) is the average of X(s) over a spatial domain (e.g. Z(s) is the average of
X(s) over a county).
The linear kriging method of classical Geostatistics simply combines observed values of
X(p’) and Z(p) to estimate X(p’) at unsampled locations without special differentiation of the
scale effects. By contrast our proposed BME mapping method uses Eq. (4.11) to generate
soft data for X(p’) from the observations obtained at various space/time scales. We
investigate mapping accuracy between the BME and classical methods throughout a wide
variety of case studies.
4.3. Temporal observation scale: Mathematical formulation and synthetic

case study
4.3.1 Mathematical formulation
4.3.1.1. Non-stationary temporal random field
As the most general case of a temporal random field (TRF), we consider the non-stationary
TRF X(t) with mean mX(t)=E[X(t)] at time t, and with covariance cX(t,t’)= E[(X(t)-
mX(t))(X(t’)- mX(t’))] between time t and t’. In general, a non-stationary TRF does not have a
105
constant mean over time, and its covariance cannot be expressed solely as a function of the
temporal lag τ=|t-t’|.
In the case of TRFs, the averaging domain V becomes the time interval V(t) = [t-T/2 ,
t+T/2] of duration T centered at time t, and the V-scale observation of X is
t +T / 2
Z(t)= 1 ∫ duX (u ) . Then Eq. (4.9) is written as
T t −T / 2
Y(t’,t) = X(t’)- Z(t) (4.12)
where t indicates the mid-point of the time interval V(t), and t’ denotes any possible time
within V(t).
We then derive the expected value of Y(t’,t) to be (see Appendix C for details)
t +T / 2
E[Y(t’,t)] = mX(t’) - 1 ∫ du m X (u ) , (4.13)
T t −T / 2
and its variance (see Appendix C for details)
T
t+
2
1
σY2(t’,t) = σX2 + {mX(t’)}2 -2
T ∫ du {c
T
X (t' , u ) + m X (t' )m X (u )}
t−
2
T T
t+ t+
2 2
1 t +T / 2
+
T2 ∫ du ∫ du' {c
T T
X (u, u' ) + m X (u )m X (u' )}-{mX(t’) - 1 ∫
T t −T / 2
du m X (u ) }2. (4.14)
t− t-
2 2
106
Eqs. (4.13) and (4.14) have been obtained without making assumptions about the
stationarity of X(t), and they therefore apply to a wide variety of non-stationary TRFs. When
the averaging time scale T is small relative to the fluctuations of the mean trend, we can
further simplify Eqs. (4.13) and (4.14) by linearizing mX(t), i.e. we use the approximation
mX(t)=m0+m1t for t ∈ T. In that case the expected value of Y(t’,t) reduces to (see Appendix C
for details)
E[Y(t’,t)] = m1(t’-t), (4.15)
and its variance is given by (see Appendix C for details)
T T T
t+ t+ t+
2 2 2
1 1
σY2(t’,t) = σX2 -2
T ∫ du c
T
X (t' , u ) +
T2 ∫ T
du ∫ du' c X (u, u' ) .
T
(4.16)
t− t− t−
2 2 2
Eq. (4.14) (or 4.16 for linearized mean trend) is substantial for environmental and health
Geostatistics. Indeed σY2(t’,t) quantifies the data uncertainty as a function of the scale at
which the variable of interest is observed. This equation can numerically be calculated for
any non-stationary TRF whatever its mean trend or covariance function may be.
We turn now to the case of stationary TRFs in order to further simplify this equation, and
gain more physical intuition about it.
4.3.1.2. Stationary temporal random field
107
Stationary TRF have a constant mean trend, i.e. mX(t)=m0, and a covariance between time t
and t’ that can be expressed in terms of the temporal lag τ=|t-t’|, i.e. cX(t,t’)= cX(τ=|t-t’|).
In order to provide more general results, we first consider the case of a linearized non-
stationary mean trend with stationary covariance cX(τ). By substituting cX(t,t’) with cX(|t-t’|)
in Eq. (4.16), we obtain (see Appendix C for more details),
 t' − t T
 t+
T
t+
T
 2  1 2 2
du ∫ du' c X ( u − u' ) .
1
σY (|t’-t|) = σX -2  ∫ du c X (t' −u − t ) + ∫ du c X (−t' +u + t )  + 2 ∫
2 2
T T t' − t  T T T
 −2  t−
2
t−
2
(4.17)
We now consider the case of stationary TRFs with constant mean trend, which are
obtained by setting m1 to zero. We note that because m1 does not appear in Eq. (4.17), then
this equation remains unchanged for stationary TRFs. This is an important finding, stating
that Eq. (4.17) is valid for any TRF with stationary covariance, as long as its mean trend can
be linearized in the time interval T.
While Eq. (4.17) can be numerically calculated for any stationary covariance model, we
will now consider the particular case where the covariance function can be expressed as the
n  − 3 t − t' 
sum of n exponential covariance models, i.e. c X (t , t' ) = c X (τ = t − t' ) = ∑  σ Xi exp ,
2
ati 
i =1  
where σXi2 and ati are the variance and temporal range, respectively, of the i-th covariance
model. Note that while nested covariance models represent a large class of useful S/TRFs,
other covariance models can just as easily be examined. In this case σY2(|t’-t|) is given by
(see Appendix C for more details),
108
n n
ati σ Xi 
2
 − 3(t' +T / 2 − t )   − 3(− t' +t + T / 2 ) 
σY (| t' −t |) = ∑ σ Xi − 2∑ 2 − exp  − exp 
2 2
i =1 i =1 3T   ati   ati 
ati σ Xi   − 3T 
n 2
2 2
+ ∑ 3T 2 
 2T −
3
a ti +
3
ati exp
a
 (4.18)
i =1  ti 
This equation is useful as it provides an algebraic equation for σY2 that is very efficient to
calculate numerically. In the case of a single structure (i.e. n=1) we write for simplicity
− 3 t − t'
purposes the covariance model as c X (τ = t − t' ) = σ X exp (i.e. we let σX2 and at be
2
at
the variance and temporal range of the TRF X(t), respectively). In this case, Eq. (4.18)
further reduces to
2
σY (| t' −t |) 2 1   (t − t' ) 3  T   (t − t' ) 3 T 
2
= 1 −  2 − exp3 −   − exp− 3 − 
σX 3 T at   at 2  at   at 2 at 
1 1  2 1 2 1  T 
+ 2 − T + T exp − 3  . (4.19)
3 T at  3 at 3 at  at 
Eq. (4.19) is conceptually very meaningful for the physical understanding of the connection
between uncertainty and observation scale. We see that the equation is expressed in terms of
three non dimensional groupings which are σY2(|t’-t|) / σ X , (t-t’)/at and T/at. As illustrated
2
later in the case studies, for a given (t-t’)/at, we find that σY2(|t’-t|)/ σ X increases from zero
2
to one as T/at increases from zero to infinity. In other words, when the observation scale T is
109
very small relative to the covariance range at, then the observed value Z(t) at scale T is very
informative for the assessment of X(t’) (i.e. the corresponding soft data has a small variance
σY2(|t’-t|) ). As T increases, the T–scale observed value Z(t) becomes less and less
informative for X(t’) (i.e. σY2(|t’-t|) increases), until a point where Z(t) becomes irrelevant for
the estimation of X(t’) (i.e. σY2(|t’-t|) reaches the variance σX2of the TRF X(t)).
By way of example, this result simply expresses the fact that, when estimating
instantaneous PM concentration, then a 1-hour average PM concentration is more
informative than, say, a weekly average of PM concentration. Furthermore, Eq. (4.19)
allows to integrate both 1-hour and weekly average measurements by assigning different
variance σY2 to each of these measurements according to their observation scale.
Usually when generating soft data we will use t’=t. The equation for the soft data
variance is then simply obtained by setting (t-t’)/at =0 in Eq. (4.19), which leads to
2 1   3 T  1 1  2 1 2 1  T 
2
σY
2
= 1 −  2 − 2 exp−   +  2− + exp − 3  . (4.20)
σX 3 T at   2  at  3 at  3 at 3 at
T T T
 at 
4.3.2 Synthetic case study
4.3.2.1. Synthetic verification of the uncertainty model for temporal observation scale
We verify the conceptual framework presented by comparing the observation scale
uncertainty predicted theoretically in Eq. (4.19), with that inferred from multiple random
realizations of a STR X(t) with exponential covariance function c X (τ ) = σ X exp − 3τ

2
. The
at
110
t +T / 2
procedure consists in using M random realizations of X(t’) and Z(t)= 1 ∫ duX (u ) to infer
T t −T / 2
a statistical estimate of σY2(|t’-t|), and comparing that synthetic estimate with the value
predicted theoretically by Eq. (4.19). Using classical simulation methods of Geostatistics
(Christakos, 1992), we obtain M realizations χ(k) =[χ1(k), χ2(k),…, χn(k)], k=1,…M, of the
random vector x=[X(t1), X(t2),…, X(tn)] representing the SRF X(t) discretized over a fine
temporal discretization grid t=[t1, t2,…, tn]. We choose an observation scale T of interest, a
time ti ∈ t such that t1+T/2≤ti≤ tn-T/2, and we obtain the M realizations Ζi(k), k=1,…M, of
ti +T / 2
Z(ti)= 1 ∫ duX (u ) by numerically integrating each realization χ(k) over a time period T
T t i −T / 2
centered at ti. We then choose a time tj ∈ t such that |ti-tj|<T/2 and we obtain the M
realizations χj(k), k=1,…M, of X(tj) by selecting the proper element in each χ(k) realization.
This procedure results in the generation of M random realizations {Ζi(k), χj(k)} , k=1,…M, for
the random values {Z(ti), X(tj)}. From these realizations, we can finally easily infer the
expected value and variance of Y(tj,ti)=X(tj)-Z(ti). A statistical estimator for the expected
M
1
value E[Y(tj,ti)] of Y(tj,ti) is Eˆ [Y (t j − ti )] =
M
∑ (χ
k =1
j
(k ) (k )
− Ζi ) while a statistical estimator for
its variance σY2(|tj-ti|) is
M
1
σˆ Y 2 (| t j - t i |) = ∑ (χ − Eˆ [Y (t j , t i )]) 2 .
(k ) (k )
j − Ζi (4.21)
M k =1
Using this synthetic simulation approach, we obtain the synthetic estimate σˆY /σX2 as
2
function of T/at for various choices of (t-t’)/at , and compare this synthetic estimate with the
111
theoretical σY2/σX2 value obtained from Eq. (4.19). This procedure provides a way to verify
whether our conceptual framework leads through Eq. (4.19) to a correct assessment of the
uncertainty (i.e. the variance σY2) associated with the observation scale T.
Figure 4.1 shows the plot of σY2/σX2 as a function of T/at for selected values of (t-t’)/T.
The synthetic estimates σˆ Y /σX2 obtained from multiple random realizations (Eq. 4.21) are
2
shown with markers, while the corresponding σY2/σX2 value predicted from theory (Eq. 4.19)
is shown with lines. The good agreement between theory and synthetic estimates provides
support that our conceptual framework adequately models the uncertainty associated with
temporal observation scale.
Figure 4.1: Plot of σY/σX as a function of T/at for different values of (t-t’)/T. Markers
indicate synthetic estimate obtained from multiple random realizations (Eq. 4.21), while lines
shows the value predicted from theory (Eq. 4.19).
Figure 4.1 additionally provides some useful insights about the effect of temporal
observation scale. As defined earlier, σX2 is the variance of the TRF X(t), at is the temporal
range of its exponential covariance function, and σY2 is the variance of the conditional PDF
112
for the random variable X(t’) given a measured value for Z(t) obtained at time t with an
observation scale T. Hence a ratio σY2/σX2 less than one means that the Z(t) measured value
is informative for the random variable X(t’). Figure 4.1 indicates that the Z(t) measured
value is informative only when the observation scale T is small relative to the temporal
covariance range at. Hence this plot can be useful in determining a cut-off for the observation
scale. For example according to this plot, data measured at an observation scale T greater
than 3 times the temporal covariance range at have little information, and could be
disregarded. Furthermore Figure 4.1 shows that a measured value Z(t) is most informative
for X(t’) when t’=t. Physically this will mean that when constructing the soft datum for X at
some t’ given a measured value for Z at time t, a judicious choice is to select t’=t, i.e. to
construct the X soft datum at the mid-point of the interval T over which Z is observed.
4.3.2.2. Quantifying the improvement in mapping accuracy resulting from the integration of
temporal observation scale uncertainty
A validation procedure using synthetic random fields provides an excellent tool to quantify
the gain in mapping accuracy that our proposed approach provides over an approach not
accounting for the observation scale effect. As described above, synthetic random fields are
easily generated using classical simulation methods of Geostatistics (Christakos, 1992), such
that each realization of a TRF X(t) has a prescribed mean trend and covariance function
corresponding to an environmental or health variable of interest. Typically a realization of
the TRF X(t) consists of values χtrue =[χ1… χn] of X(tj) for a dense grid of times tj = j δt ,
j=1,…, n, with a small time interval δt. Then we obtain the values ζtrue =[ζ1… ζn] for Z(tj)=
t +T / 2
T ∫t j −T j / 2
1 j j
duX (u ) , j=1,…, n, by numerically integrating the simulated χtrue values over a
113
different observation scale Tj at each time tj, j=1,…, n. Hence χtrue represents the (synthetic)
truth for the field of interest observed at the local scale, while ζtrue represents the truth
observed at different observation scales Tj, j=1,…, n. We then randomly divide the truth χtrue
into a validation set χval and a data set χhard, so that χtrue=χval U χhard. Similarly we randomly
select a data set ζhard out of ζtrue.
The validation procedure consists in using only the data χhard and ζhard to obtain estimates
χval* of the local scale TRF X(t) for the validation times at which the truth χval is known. The
estimation errors are then obtained as the difference εval*=χval-χval* between true and
estimated values. Finally we obtain the mean square error (MSE) by averaging the squared
estimation errors. When interested in two different estimation methods (labeled as method 1
and 2), we obtain the MSE for each method (i.e. MSE1 and MSE2), and we quantify the
1, 2
change in mapping accuracy by calculating the percent MSE change rMSE between method 1
1, 2
and method 2 as rMSE =(MSE2-MSE1)/MSE1 x100.
The estimation methods that we compare are the BME approach accounting for the
observation scale of the data, as presented in this work, and the simple kriging method of
classical Geostatistics. Our proposed BME approach generates a conditional PDF fS(χs| ζ)
(Eqs. 4.11, 4.15 and 4.20) for each observed value of the vector ζhard and its corresponding
observation scale. The collection of the conditional PDFs constitutes the soft data χsoft, for
which BME rigorously processes together with the hard data χhard directly observed at the
local scale. By contrast, the simple kriging estimates are obtained using either χhard only, or
both χhard and ζhard, as hard data, i.e. without accounting for the observation scales
uncertainty of the ζhard data. The percent MSE change rMSE

SK , BME
then quantifies the percent
114
SK , BME
change from the SK method MSE to the BME method MSE. A negative rMSE means that
BME reduces the MSE (i.e. that BME is more accurate than SK), and the magnitude of a
SK , BME
negative rMSE quantifies the gain in mapping accuracy of BME over SK.
Using the Geostatistical simulation method based on a Cholesky decomposition of the
covariance matrix (Christakos et al., 2002), we generate 20 realizations of the TRF X(t) with
the following prescribed covariance function
 − 3τ   − 3τ 
c X ( τ =| t' −t |) = c01 exp  + c 02 exp 
 a t1   at 2  , (4.22)
where c01= 0.7 × σX2, c02=0.3 × σX2, σX2= 4, at1 = 50, and at2= 250. Each realization consists of
the vector χtrue =[χ1… χn] simulating the value of the TRF X(t) at times tj ∈ t=[0, 1, …, 500].
We select from this time grid 8 time coordinates where the χhard data is sampled at the local
scale, and 37 time coordinates where the ζhard data is measured at varying observation time
scales Tj, j=1,…,39. Finally, on the basis of the ζhard data and the associated observation time
scales, we construct the conditional PDFs fS(χs| ζ) that constitute the soft data χsoft for our
proposed BME estimation approach.
For illustration purposes, one of the generated realization χtrue is shown with a dotted line
in Figure 4.2, along with the χhard data represented by circles, and the ζhard data represented
by crosses. As explained above, the ζhard data is obtained by numerical integration of χtrue
over each of the observation time scales Tj, j=1,…,39. We show four of these observation
time scales using horizontal bars in Figure 4.2, and we show the corresponding conditional
PDF fS(χs| ζ) with a bell shape curve. As can be seen from the figure, for the small
115
observation time scales we have a conditional PDF with high information content (i.e the bell
shape curve is peaked), while for the large observation time scale, we have a (almost) non
informative conditional PDF (i.e. the bell shape curve is flat). This provides an illustration of
the scale effect captured by our conceptual framework (e.g. Eq. 4.20).
Figure 4.2: Plot showing one of the generated realizations of the TRF X(t). The simulated
values χtrue are shown with a dotted line, the χhard data are represented by circles, and the
ζhard data are represented by crosses. Four observation time scales of the ζhard data are shown
with horizontal bars, and the corresponding conditional PDF are shown with bell shape
curves.
The validation procedure described in section 4.3.2.1 allows us to compare three
estimation methods, which are summarized in Table 4.1. Method 1 and 2 represent two
attempts to process the data available using the traditional simple kriging method of classical
Geostatistics, which ignores the effect of observation scale. Method 1 only processes χhard,
i.e. it entirely ignores the ζhard data. Method 2 treats both χhard and ζhard as hard data, i.e.
ignores the uncertainty arising from the observation scale of the ζhard data. On the other hand
116
method 3 corresponds to our proposed approach, which processes the χhard and χsoft data,
thereby rigorously accounting for the uncertainty associated with the various time scales at
which ζhard is observed.
Table 4.1: Description of three estimation methods compared in the validation procedure.
Local scale X data Large scale Z data
Method 1 χhard ignored
Method 2 χhard ζhard
Method 3 χhard χsoft
Using the validation procedure, we obtain an MSEave which is a validation MSE averaged
over 20 realizations for each of the estimation method considered. As shown in Table 4.2 we
can see from these results that method 2 (simple kriging II) has an MSEave that is only
slightly smaller than that of method 1 (simple kriging I). This means that even though simple
kriging II did process the additional information provided by the data observed at various
time scales, the gain in mapping accuracy was modest because the scale effect was ignored.
On the other hand we see that the MSEave of method 3 (BME) is substantially smaller than
that of either method 1 or 2. In fact BME results in a 50.2% MSEave reduction when
compared to method 1, or a 46.7% MSEave reduction when compared to method 2. These
results demonstrate that our proposed approach provides a sound conceptual framework to
model the effect of observation scale, and may in some cases result in a drastic gain of
mapping accuracy over estimation methods that ignore the scale effect.
117
Table 4.2: MSEave calculated by averaging the validation results obtained over 20 realizations.
Method 1 Method 2 Method 3
(simple kriging I) (simple kriging II) (BME)
MSEave 2.1474 2.0048 1.0689
1, 3
rMSE -50.2%
2 ,3
rMSE -46.7%
Further insights are gained by visually inspecting the validation estimates obtained for the
realization of Figure 4.2. Figure 4.3 shows the simulated truth χtrue with a dotted line, and the
χhard and ζhard data with markers. Additionally the estimation profile obtained with method 1,
2 and 3 are shown with a line in Figure 4.3(a), 4.3(b), and 4.3(c), respectively. As can be
seen from the figure, the estimated profile for method 1 goes through the χhard data, but the
mapping accuracy is poor because the ζhard data is entirely ignored. On the other extreme the
estimated profile for method 2 goes through both χhard and ζhard data. While this results in a
modest gain in mapping accuracy, the estimated profile suffers from the fact that the
observation scale of the ζhard data was ignored. Finally, the estimated profile for method 3
(BME) goes through the χhard data, and consider each of the ζhard datum depending on its
observation scale, such that observations at shorter time scales are given more weight than
observations at larger time scales. As can be seen from the figure, this results in an estimated
profile that provides a much more accurate representation of the truth.
118
(a)
(b)
(c)
Figure 4.3: Plots showing the simulated truth χtrue with a dotted line, the χhard data with
circles, and the ζhard data with crosses. Additionally lines are showing the estimated profiles
obtained using (a) method 1, (b) method 2, and (c) method 3 (BME).
119
4.4. Spatial observation scale: Mathematical formulation and synthetic
case study
4.4.1 Mathematical formulation
4.4.1.1 Non-homogeneous spatial random field
We extend in this section the one dimensional temporal framework to consider two-
dimensional spatial random fields (SRF). Similarly to the one-dimensional case, our aim is to
derive a mathematical formulation for σY2 in the case of SRFs. The local scale SRF X(s)
represents the spatial distribution of the variable X of interest at the spatial location s, where
s=[s1,s2] represents a geographical location. Then Z(s) is defined as an average of X(s) over
the 2-dimensional space domain As (i.e. area), i.e.
Z(s) =∫u ∈ As duX(u) / || As||, (4.23)
where As is a geographical area with centroid s. Following the previous development we
define a new spatial random field Y(s’,s) as,
Y(s’,s) = X(s’)- Z(s). (4.24)
In the most general case, non-homogeneous SRFs are characterized by a spatially varying
mean trend functions mX(s)=E[X(s)], and a covariance function cX(s, s’) that cannot be
expressed solely as a function of the spatial lag, |s-s’|. In this case we can mathematically
120
derive starting from Eqs. (4.23) and (4.24) the expected value of Y(s’,s) (see details in
Appendix D), i.e.
E[Y(s’,s)] = mX(s’) – || As||-1∫u ∈ As du mX(u), (4.25)
and its variance (see details in Appendix D), i.e.
σY2(s’, s) = E[X2(s’)] – 2 E[X(s’)Z(s)] + E[Z2(s)] – { mX(s’) – || As||-1∫u∈ As du mX(u)}2,
(4.26)
where
E[X2(s’)] = σX2(s’) + {mX(s’)}2,
E[X(s’)Z(s)] = || As||-1∫u ∈ As du {cX(s’,u) + mX(s’) mX(u)},
E[Z2(s)] = || As||-2∫u ∈ As du ∫u’ ∈ As du’{cX(u,u’) + mX(u) mX(u’)}.
4.4.1.2 Homogeneous spatial random field
Let us consider some special cases where Eq. (4.26) may be simplified. First if we assume
that E[X(s’)] = 0 (e.g., for a mean trend removed SRF) then we have E[Y(s’,s)]=0, and the
first term in the right hand side (RHS) of Eq. (4.26) (e.g. E[X2(s’)]) is equal to σX2.
Furthermore, the second RHS term in Eq. (4.26) reduces to
E[X(s’)Z(s)] = || As||-1∫u ∈ As du cX(s’,u). (4.27)
121
Assuming a homogeneous spatial covariance, i.e. cX(s’, s) = cX(|s’-s|), we further expand Eq.
(4.27) (see Appendix D for more details) as
E[X(s’)Z(s)] = || As||-1∫r ∈ A(0) dr cX(|r-(s’- s)|), (4.28)
where A(0) is the 2-D spatial averaging domain centered at the origin (i.e. with a centroid
located at 0). This equation can numerically be integrated for any shape of the averaging
domain A(0). However a reasonable approximation of the averaging domain A(0) is a circle
of same area as As, i.e. with a radius R such that πR2=|| As||-1. Assuming that A(0) is a circle
of radius R, we get (see more details in Appendix D),
R 2π
E[X(s’)Z(s)] = (πR2)-1 ∫ dr ∫ dθ r cX( (s1 − s1 '+ rcosθ )2 + (s2 − s2 '+ rsinθ )2 ), (4.29)
0 0
where s=[s1 s2] and s’=[s1’ s2’].
Using a similar development for the third term RHS term of Eq. (4.26) we obtain (see
details in Appendix D)
(r )
R R 2π
2 -2
∫ dr ∫ dr ' ∫ dα 2π r r ' c X + r '2 −2rr ' cosα .
2 2
E[Z (s)] = (πR ) (4.30)
0 0 0
122
Eqs. (4.26), (4.29) and (4.30) provide formulae for σY2(s’, s) that is valid for any
homogeneous covariance model. Let’s now assume that the covariance model is the
superposition of n exponential functions, so that the covariance model can be expressed as
n
c X ( s − s' ) = ∑ σ Xi exp(-3|s- s’|/ari), where σXi2 and ari are the variance and spatial range of
2
i =1
each exponential covariance function, respectively. In this case we have
n 
R 2π
 
 σ Xi 2 exp  − 3d1 (r , s, s ' , θ )  
-1
σY2(|s’-s|) = σX2 -2(πR2) ∫0 ∫0
d r dθ r ∑  ari 
i =1   
n 
 
 σ Xi 2 exp  − 3d 2 (r , r ' , α )   .
R R 2π
-2
+ (πR2) ∫0 ∫0 ∫0
dr dr ' dα 2π r r ' ∑  ari  (4.31)
i =1   
where d1(r, s, s’, θ )= (s1 − s '+ r cos θ )2 + (s2 − s2 '+ r sin θ )2 and
1
d2(r, r’,α)= r 2 + r '2 −2rr ' cosα .
Eq. (4.31) is valid for the superposition of any number of exponential models. In the case
of a single exponential covariance model, i.e. n=1, the covariance function is written as
cX(|s-s’|)=σX2 exp(-3|s-s’|/ar), and Eq. (4.31) reduces to
2 -1
R 2π
 − 3d1 (r , s, s ' , θ ) 
σY2(|s’-s|) σX2 ∫0 ∫0
2
= -2(πR ) d r dθ r σ X exp  
 ar 
2 -2
R R 2π
 − 3d 2 (r , r ' , α ) 
∫ dr ∫ dr ' ∫ dα 2π r r ' σ X exp 
2
+ (πR ) , (4.32)
0 0 0  ar 
123
where σX2 is the variance of the SRF X(s), and ar is its spatial covariance range. Usually we
seek an X soft datum at the centroid of the Z hard data (i.e. s’ = s, so that the X soft datum is
located at the center of the circular averaging area As). In this case Eq. (4.32) is further
reduced by setting s=s’, i.e.
R
σY2= σX2 -4R-2 ∫ dr r σ X exp(−3r / ar )
2
-2
R R 2π
 − 3d (r , r ' , α ) 
+ (πR2) ∫0 ∫0 ∫0 dα 2π r r ' σ X exp 2 ar
2
d r d r ' . (4.33)

As can be seen from this equation, the variance σY2 describing the uncertainty associated with
the observation scale of 2-D circular averaging domain is a function of the variance and
spatial range of the SRF X(s), as well the radius R of the averaging spatial domain
characterizing the observation scale.
4.4.2 Synthetic case study
4.4.2.1 Synthetic verification of the uncertainty model for spatial observation scale
By extending the procedure of section 4.3.2.1 to the spatial case, we generate multiple
random realizations of Y(s’, s) =X(s’)-Z(s), from which we obtain synthetic estimates of
σY2(|s’-s|) that can be used to verify the value predicted theoretically from Eq. (4.32). The
procedure consists in generating realizations of the SRF X(s) on a fine spatial grid, choosing
a radius R of interest for the observation scale (i.e. the radius of the circular averaging spatial
124
domain As), and obtaining the realizations of the SRF Z(s) = ∫ u ∈ As duX(u)/|| As|| by
numerical integration of each of the X(s) realizations. We then choose from the spatial grid
two spatial locations sj and si separated by a distance |si- sj| <R/2 of interest, and we select for
each realization k=1,…M the realized value χj(k) for X(si), and the realized value Ζi(k) for Z(si).
This procedure results in the generation of M random realizations {Ζi(k), χj(k)}, k=1,…M,
from which we obtain the M random realized values ψji(k)=χj(k)−Ζi(k), k=1,…M, for the
random variable Y(sj,si)= X(sj)− Z(si). The synthetic estimate of σY2(|sj-si|) is finally obtained
by the estimator
M
1
σˆ Y 2 (| s j - si |) = ∑ (ψ − Eˆ [Y ( s j , s i )]) 2 ,
(k )
j (4.34)
M k =1
1
where Eˆ [Y ( s j , si )] = ∑
M
. Using this procedure, we obtain synthetic estimates σˆ Y
(k ) 2
k =1
ψj
M
for various values of R and |sj-si|, which we may compare with the theoretical value σY2
predicted by Eq. (4.32). Agreement between the synthetic estimate and theoretical value
provides verification that the conceptual framework proposed provides an adequate model
for the uncertainty associated with spatial observation scale.
Figure 4.4 shows the plot of σY2/σX2 as a function of R/ar for selected values of |s-s’|/R.
The synthetic estimates σˆ Y /σX2 obtained from multiple random realizations (Eq. 4.34) are
2
shown with markers, while the corresponding σY2/σX2 value predicted from theory (Eq. 4.32)
is shown with lines. Similarly to the one-dimensional temporal case, a judicious choice to
construct the X soft datum at some location s’ given a measured values for Z at s is to select
125
s’=s. As shown Figure 4.4 there is a good agreement between theory and synthetic estimates
when |s-s’|/R=0, which provides support that our conceptual framework adequately models
the uncertainty associated with spatial observation scale. When |s-s’|/R>0 (i.e. |s-s’|=0.4R)
the theoretical values are slightly overestimated relative to the synthetic estimates. This may
be due to the numerical work associated with the calculation of the mathematical formulation
of σY2/σX2, which is computationally more complex for |s-s’|/R>0 (Eq. 4.32) than for |s-
s’|/R=0 (Eq. 4.33).
Figure 4.4: Plot of σY/σX as a function of R/ar for different values of |s- s’|/R. Markers
indicate synthetic estimate obtained from multiple random realizations (Eq. 4.34), while lines
shows the value predicted from theory (Eq. 4.32).
As previously noted, the relationship between σY2/σX2 and R/ar shown in Figure 4.4
provides useful insights about the effect of the spatial scale at which observations are made.
As clearly indicated by Figure 4.4, when the spatial observation scale R is very small relative
to the covariance range ar of the local scale SRF X (i.e. when R is smaller than about 0.2 ar)
then an observed at that spatial scale at point s (i.e. a measured value for Z(s)=∫u ∈ As duX(u)
126
/ || As||) is highly informative for assessing the process at the local scale at point s’ (i.e. for
assessing X(s’)), provided that s’ is close to s ( i.e. provided that |s- s’|<0.4 R or much less).
4.4.2.2 Quantifying the improvement in mapping accuracy resulting from the integration of
spatial observation scale uncertainty
Validation procedures provide the tools needed to quantify the gain in mapping accuracy that
our proposed approach provides over an approach not accounting for the effect of spatial
observation scale. One validation procedure consists in using a synthetic SRF, while another
consists in using data from a real case study.
In the synthetic validation procedure, we use classical Geostatistical simulation
techniques to generate a realization χtrue =[χ1… χn] of the SRF X(s) observed at the nodes si,
i=1,…, n, of a fine resolution spatial grid. Then, for each node si, we numerically integrate
χtrue over a circular spatial observation domain Asi of radius Ri to obtain the realized value ζi
for the random variable Z(si) =∫u ∈ Asi duX(u) / || Asi||. This results in the generation of the
realization ζtrue =[ζ1… ζn] of the SRF Z(s) observed at observation scales Ri, i=1,…, n.
Hence χtrue represents the (synthetic) truth for the field of interest observed at the local scale,
while ζtrue represents the truth observed at a variety of observation scales. We then randomly
divide the truth χtrue into a validation set χval and a data set χhard, so that χtrue=χval U χhard.
Similarly we randomly select a data set ζhard out of ζtrue. The advantage of the synthetic
validation procedure is that we can select a large n so as to have high statistical power, and
that we can choose arbitrarily any observation scale Ri of interest.
On the other hand, in the real case study, χtrue and ζtrue are obtained from available data
measured at the local scale, and at some observation scale R, respectively. However the
127
validation procedure from real-case study data suffers from many limitations, including the
fact that n is limited of the number of data available (which may limit statistical power), the
unavoidable measurement errors that introduce an uncontrollable noise between the data
available and the actual truth, and the lack of mechanism to select different observation
scales other than that for which data is available. Nonetheless notwithstanding these
limitations, we randomly select χhard and χval from χtrue subject to χtrue=χval U χhard, and we
obtain ζhard by usually selecting all of the data ζtrue.
The validation procedure consists in using only the data χhard and ζhard to obtain estimates
χval* of the local scale SRF X(s) at the validation point locations where the truth χval is known.
The validation estimation errors are then simply obtained as the difference εval*=χval-χval*
between true and estimated values, and their mean square error (MSE) provides a measure of
the estimation error of the estimation method used to obtain χval*.
In this study we compare the mapping accuracy of 3 different mapping methods. Method
1 consists in the simple kriging (SK) method of classical Geostatistics using only χhard as
hard data. Method 2 also consists in the SK method, but using both χhard and ζhard as hard
data (i.e. ignoring the observation scales uncertainty of the ζhard data). Finally method 3
consists in the BME method proposed in this work, which uses χhard as hard data, and uses
ζhard and the corresponding observation scale R to generate some soft data χsoft in terms of the
conditional PDF fS(χsoft | ζhard, R) (Eqs. 4.11 and 4.33). As a result our Method 3 fully
accounts for the observation scale effect, which is compared to the two extreme classical
approaches not accounting for observation scale: Method 1 which ignores ζhard entirely, and
method 2 which treats it as if it was hard data (i.e. as if the observation scale was not
introducing any uncertainty).
128
It should be noted that the so called cross-validation procedure is a slight modification of
the validation procedure that is widely used in practice, so we will also use this procedure to
compare method 1, 2 and 3 in the real case study. In the cross validation procedure, the ζhard
data remains unchanged, while the validation data χxval corresponds to whole dataset χtrue
available, i.e. χxval =χtrue. Then cross validation estimates χxval* are obtained by excluding in
turn each validation point, and re-estimating it from the surrounding data. The cross-
validation MSE are finally obtained on the basis the cross-validation errors εxval*=χval-χval*.
Hence the cross-validation procedure provides an additional metric to compare methods 1, 2
and 3.
Using the Geostatistical simulation method based on a Cholesky decomposition of the
covariance matrix (Christakos et al., 2002), we generate 20 realizations of the SRF X(s) with
the following prescribed covariance function
 − 3l 
c X (l = | s' − s |) = σ X exp 
2
a
 r , (4.35)
where variance of X(s) σX2= 0.006, and ar = 10. Each realization consists of the vector χtrue
=[χ1… χn] simulating the value of the SRF X(s) at the nodes of a dense spatial grid. We
select from this simulated truth a subset of data χhard representing local scale measurements
of X(s). We additionally obtain from χtrue a set ζhard of observations at varying spatial scales
Rj, j=1,…,n . Each ζhard datum is obtained by numerically integrating the truth χtrue over a
circular averaging domain of radius equal to its spatial observation scale Rj. Finally, on the
basis of the ζhard data and the associated observation spatial scales, we construct the
129
conditional PDFs fS(χs| ζ) that constitute the soft data χsoft for our proposed BME estimation
approach.
For illustration purposes, one of the generated realization χtrue is shown in the contoured
map of Figure 4.5, along with the χhard data points represented by stars, and the ζhard data
point represented by triangles. For three of the ζhard data points, we show the corresponding
circular averaging domain with a radius equal to their spatial observation scales. As
illustrated in Figure 4.5, the observation spatial scale is not constant across the ζhard data,
which corresponds to a realistic situation where data might be obtained at varying
observation scales (e.g. for data collected at different administrative aggregation levels, such
as zip code, counties, etc.).
Figure 4.5: Contoured map showing one of the generated realizations of the SRF X(s), along
with the location of the χhard data points (stars), and the ζhard data points (triangles). The
circular averaging domain for three of the ζhard data points are shown with a radius equal to
their spatial observation scales.
130
Using the validation procedure, we obtain an MSEave which is the validation MSE
averaged over 20 realizations for the three estimation methods considered, i.e. method 1
(simple kriging I), method 2 (simple II), and method 3 (BME). As shown Table 4.3 we can
see from these results that method 2 (simple kriging II) has a MSEave that is smaller than that
of method 1 (simple kriging I). This is explained by the fact that simple kriging II processes
the additional information provided by the data observed at various spatial scales, resulting in
a gain in mapping accuracy. However simple kriging II does not account for the effect of
observation scale. On the other hand we see that the MSEave of our proposed BME method
(method 3), which rigorously accounts for the scale effect, is substantially smaller than that
of either method 1 or 2. In fact BME results in a 41.6% MSEave reduction when compared to
method 1, or a 30.2% MSEave reduction when compared to method 2. These results
demonstrate that our proposed approach leads on the average to a substantial gain of mapping
accuracy over estimation methods that ignore the scale effect.
Table 4.3: MSEave calculated by averaging the validation results obtained over 20 realizations.
(simple kriging I) (simple kriging II) BME
MSEave 0.001255 0.001046 0.000730
1, 3
rMSE -41.8%
2 ,3
rMSE -30.2%
The validation results presented so far were obtained for the mapping situation depicted
in Fig. 4.5 where the ζhard data points correspond to 8% of the χtrue grid points, which
corresponds to a realistic mapping situation. In order to obtain a visual comparison between
the estimation methods, we now consider a mapping situation where the ζhard data points
131
correspond to 45% of the χtrue grid points. The simulated truth is shown in Figure 4.6(a),
while the estimates obtained with method 1, method 2, and method 3 are shown in Figure
4.6(b), 4.6(c) and 4.6(d), respectively. As can be seen from these maps, method 1 captures
the dominant features of spatial distribution of X(s) thanks to the information provided by the
χhard data, however the mapping accuracy is poor because the ζhard data is entirely ignored.
The estimation method 2 provides another extreme by processing both χhard and ζhard as hard
data, thereby ignoring the uncertainty associated with the observation scale of the ζhard data.
This results in a map (Figure 4.6c) with a lot more fine resolution details, but of poor
mapping accuracy, as is apparent by comparing this map with the simulated truth. Finally,
the map of our proposed approach (method 3) shown in Figure 4.6(d) provides a much more
accurate representation of the truth, as can be seen by comparing it with the simulated truth.
In fact for this realization of the simulated truth, our proposed BME method results in a
77.6% MSE reduction when compared to method 1, or a 70.4% MSE reduction when
compared to method 2. This demonstrates that while our proposed method results on average
in a substantial gain of mapping accuracy over classical approaches, the gain in mapping
accuracy can be drastic for some specific mapping situations.
132
(a) (b)
(c) (d)
Figure 4.6: Maps of the simulated truth (a), compared to maps obtained with (b) method 1
using χhard as hard data, (b) method 2 using both χhard and ζhard as hard data, and (c) method 3
corresponding to our proposed BME method accounting for the effect of observation scale.
Next, we consider the spatial estimation of asthma symptom prevalence among the children
of North Carolina. This case study involves the development and implementation of the
mathematical framework outlined above, and its application to a real case study in North
Carolina. The data used in this real case study are the combination of two datasets, each
collected at a different spatial scale.
133
4.5. Mapping the childhood asthma prevalence across North Carolina
using data collected at different spatial observation scales
4.5.1. Introduction
Asthma is an inflammatory disease characterized by symptoms that include wheezing,
coughing, breathlessness, and chest tightness (Clark et al., 1999; Lane and Edwards, 2003). It
is known as the most common chronic childhood disease (Zmirou et al., 2004;; Lewis et al.,
2005; Freeman et al., 2003; Gergen et al., 1988). Approximately 12.7% of all children (Lane
and Edwards, 2003), and about 10 million children of age under 16 (Clark et al., 1999) in
United States are suffering from current asthma symptoms. The estimated cost of treating
asthma in children younger than 18 years of age is $3.2 million per year (Weiss et al., 2000).
Some risk factors responsible for exacerbating asthma symptoms in children includes
tobacco smoke, dust mite and cockroach allergens, pet dander, and household molds (Sturm
et al., 2004).
The association between air pollution exposure (i.e. PM, O3, SO2, and NO2 etc.) and
asthma prevalence has been extensively investigated (EPA Criteria pollutant document;
Clark et al., 1999; Lewis et al., 2005). While air pollutants have clearly been associated with
exacerbations of asthma (including increased symptoms, Emergency Room (ER) visits,
hospitalizations, and medication use), the association of air pollutants and increased asthma
incidence is less clear (Clark et al., 1999). However, a recent study showed an association
between asthma incidence and children exercising in high ozone areas (McConnell et al.,
2002). Furthermore, a study by Zmirou et al., 2000 investigating the association between
traffic related air pollutants and incidence of children asthmatic symptoms suggests that air
134
pollutants might be a potential contributor to increasing asthma prevalence in children.
Africans and Hispanic-Americans have a higher susceptibility to develop asthma than other
populations (Freeman et al., 2003). Individuals who experience regularly asthma symptoms
(Clark et al., 1999) and with smoking behavior (Sturm et al., 2004) are also regarded as a
susceptible population group for asthma adverse health effects.
In their work, White et al. (1994) suspect that the increase of asthma symptoms is
attributable to air pollution and performing a reasonable analysis of their association is still
an emerging field. This naturally leads to the need to map the distribution of asthma
prevalence across space. Indeed highly informative asthma maps provide invaluable spatial
information that allows epidemiologists to better understand risk factors that may cause
asthma, such as air pollutants, and help identify susceptible subpopulations, such as
individuals with particular pre-existing health conditions and/or with specific smoking
behavior and socioeconomic characteristics, etc. Additionally better asthma maps are helpful
for public health intervention by not only identifying areas of high prevalence where to target
health treatment facilities for susceptible populations, but also in identifying areas where to
focus efforts on abating suspected causal agents that can be controlled.
Geostatistics provide epidemiologists an essential spatial estimation tool that accounts for
the inherent high spatial variability of asthma prevalence and the map it produces provides a
graphical representation of reality that is extremely useful for health research. However, few
studies on mapping asthma have been found, and existing works are mainly limited to an
exploratory visualization of existing asthma prevalence data obtained at a single observation
scale (Hernandez et al., 2000; Oyana and Lwebuga-Mukasa, 2004).
135
There are a variety of data sources providing asthma prevalence data that can be used in a
mapping analysis. The asthma data can be collected in a number of ways, including random
telephone surveys, questionnaire-based surveys, hospital discharge records, Medicaid claims,
etc. However what is notable is the spatial aggregation scale, or observation scale, at which
the data is reported, which may vary considerably from one data source to another.
One important reason for the difference in observation scale between data sources is that
some data sources may have confidentiality requirements that only allow them to release data
aggregated over large spatial scale (e.g. county level) in order to protect the privacy of the
individuals who provided their health information. For example the childhood asthma
Medicaid claim data analyzed by Buescher et al. (1999) is aggregated at the county level,
which is a large spatial observation scale providing a strong protection of individual privacy
and preventing deductive disclosure. Medicaid claims provide a cost effective source of
information. Claims data are cost effective because they are derived from a health system
that is already in place. However, it is not clear how good Medicaid claims data is in the
estimation of asthma prevalence at a fine spatial scale. Another source of information is the
asthma data obtained from a one time school asthma surveillance project (the North Carolina
School Asthma Survey, or NCSAS), which had high quality asthma prevalence data on a fine
spatial resolution. The NCSAS database provides good quality asthma prevalence estimates
for the majority of middle schools in North Carolina, which corresponds to an observation
scale that is much smaller than that of the Medicaid data reported at the county level. As a
result, our goal is to perform an accurate mapping analysis of asthma symptom prevalence
that rigorously accounts for the high natural variability of asthma prevalence across space,
while also efficiently integrating data collected at different observation scales. Integrating
136
large observation scales data to obtain good estimate of asthma prevalence at a fine spatial
resolution would lead to some substantial cost savings in North Carolina because it will
enable state health departments to efficiently use data from existing systems such as
Medicaid, which would reduce the need to conduct additional costly surveillance of asthma.
Our aim at in this work is to develop a conceptual mapping framework that integrates
asthma data obtained at different spatial observation scales, and to apply this framework to
improve the accuracy of maps of the childhood asthma prevalence. The framework we
develop is a novel application of the Bayesian Maximum Entropy (BME) theory of modern
Geostatistics, where we formally account for the uncertainty associated with the various
spatial observation scales corresponding to the prevalence data available. Insight is gained by
comparing the map we produce with classical maps obtained by using only data at one
observation scale, or by disregarding the scale effect. We find that by formally accounting
for the observation scale of asthma prevalence data, the map we obtain is substantially more
accurate than classical maps, leading to a more realistic representation of the spatial
distribution of the asthma prevalence among children across North Carolina, which will be
useful for epidemiologists and public health officials to plan targeted intervention efforts.
4.5.2. Theory
4.5.2.1. A review of the BME method for the mapping analysis of the childhood asthma
prevalence
In this work, the variable we are dealing with is the prevalence of asthma among children.
What we usually measure is the prevalence of the cardinal symptom of asthma (wheezing)
among children; however we will assume that the asthma symptom selected provides an
adequate observable outcome to measure the prevalence of asthma among children, and we
137
will refer to it as the childhood asthma prevalence. This prevalence is distributed across a
two-dimensional spatial domain, and it is defined as the count of children found to have the
asthma symptom of interest divided by the number of children surveyed over some spatial
region As (i.e. area), where the subscript s=[s1,s2] is the spatial location of the centroid for As.
The spatial region As over which the prevalence is observed has a spatial scale R
corresponding to the radius of a circle of same surface area as As, i.e. R =(As/π)0.5 is the
spatial observation scale of the prevalence.
We define X(s) as a spatial random field (SRF) (Christakos, 1992) representing the
childhood asthma prevalence at the local scale, i.e. observed at an infinitely small spatial
scale. When restricting our attention to a set of n mapping spatial points smap=[s1, s2,…, sn],
the SRF reduces to a vector of random variables xmap=[X(s1), X(s2),…, X(sn)]. The SRF
describes the uncertainty and variability of the spatial distribution of the local scale
prevalence by means of an ensemble of realizations χmap =[χ1, χ2 , …, χn] of the random
vector xmap. The probability of a given realization χmap is calculated from the multivariate
probability density function (PDF) fX(.) of the SRF X(s) as follow
Prob[χ1 < x1 <χ1+dχ1,…, χn < xn < χn+dχn] = fX(χmap) dχ (4.36)
where Prob[.] is a probability operator. Hence the multivariate PDF fX(.) provides a complete
stochastic description of the SRF X(s) at the mapping points pmap.
At the structural stage of BME analysis we use a maximum entropy information
processing rule (Christakos 2000) to obtain the multivariate PDF of X(s) on the basis of its
mean trend characterizing systematic trends in X(s)
138
mX(s) = E[X(s)], (4.37)
and covariance function characterizing spatial correlation between any pairs of points in X(s)
cX(s, s’) = E[ (X(s)-mX(s)) (X(s’)-mX(s’)) ], (4.38)
where E[.] is a stochastic expectation operator. Eqs. (4.37) and (4.38) constitute a general
knowledge base G from which the structural PDF obtained by maximizing entropy is
(Christakos, 2000)
where φ (.) is the multivariate Gaussian PDF with mean vector mmap and covariance matrix
cmap calculated at the mapping points from Eqs. (4.37) and (4.38), respectively. This
structural PDF will serve as the prior PDF for the Bayesian updating performed at the
integration stage of the BME analysis.
At the specificatory stage of the BME analysis we assess and statistically describe the
data available for the childhood asthma prevalence. Hard data corresponds to exact measured
prevalence values χhard obtained at spatial points shard defined such that
Prob[ X(shard) =χhard] = 1. (4.40)
139
On the other hand, the soft data at spatial points ssoft correspond to observed value with an
associated uncertainty that can be characterized statistically by the so-called soft PDF fS(χsoft)
defined as (Christakos et al., 2001; Christakos and Serre, 2000a; Serre et al., 2005)
u
Prob[X(ssoft) <u] = ∫ −∞ dχ soft f S (χ soft ) . (4.41)
At the integration stage of the BME analysis, a Bayesian conditionalization information
processing rule is applied to update the prior PDF with the site-specific knowledge base S,
which yields the posterior PDF fK(χk) describing the childhood asthma prevalence xk=Xk(pk)
at any estimation point sk (Christakos, 2000)
fK (χk) = A-1 ∫ dχsoft fs(χsoft) fG(χk, χhard, χsoft), (4.42)
where A is a normalization coefficient. The posterior PDF provides a full stochastic
assessment of xk, from which we can obtain an appropriate estimated prevalence (such as the
expected value of the posterior PDF), as well as an assessment of the associated prevalence
uncertainty (such as the variance of the posterior PDF).
4.5.2.2. Conceptual framework for the uncertainty associated with the observation scale of
the childhood asthma prevalence
We define the observed value of X(s) over the observation region As as the SRF Z(s) given by
the following equation
140
Z(s) =∫u ∈ As duX(u) / || As||. (4.43)
In other words Z(s) is corresponding to an observation of X(s) at a spatial scale R=(As/π)0.5. In
order to analyze the relationship between the local scale SRF X(s) and the As–scale SRF Z(s),
we define the random field Y(s’,s) as
Y(s) = X(s)-Z(s). (4.44)
Eq. (4.44) can also be written as X(s)=Z(s)+Y(s), indicating that when assessing X(s), Y(s)
acts as an additive error term to the value Z(s) observed at scale As. It follows that the
conditional PDF of X(s) given an observed value ζ for Z(s) is
fS(χs| ζ) = fY (χs-ζ ), (4.45)
where fY is the PDF for Y(s). Assuming that the local scale prevalence SRF X(s) can
reasonably be assumed to be normally distributed, we obtain that Z(s) and Y(s) are also
normally distributed (Lee, 2005, pp 104). Then the PDF for Y(s) is given by
fY(ψ)=φ(ψ;mY,σY2), where φ(.) is the Gaussian distribution completely defined by its mean
mY= E[Y(s)] and variance σY2. Inserting fY(ψ)=φ(ψ;E[Y(s)],σY2) in Eq. (4.45), we obtain after
a change of variable
fS(χs| ζ) = φ (χs ; E[Y(s)]+ ζ , σY2). (4.46)
141
Eq. (4.46) provides a probabilistic soft datum for the local scale X at point s given an As-scale
observed value for the prevalence at point s. This soft datum for the local scale prevalence is
rigorously processed by the BME method for the mapping analysis of local scale prevalence,
which constitute our proposed approach to integrate prevalence data at any spatial
observation scale in the mapping estimation of local scale prevalence. The problem then
becomes that of obtaining E[Y(s)] and σY2 for different spatial observation scales As of
interest.
We consider the class of homogeneous SRFs X(s) with a zero mean trend and a
covariance model corresponding to the superposition of n exponential functions. This class
of SRFs provides without loss of generality a good representation of the spatial distribution
of the local scale childhood asthma prevalence. We mathematically derive the expected
value of Y(s) as E[Y(s)]=0 and its variance (Lee, 2005, pp 120-124) as
R n
σY2 = σX2 -4R-2 ∫ dr r ∑ σ Xi exp(−3r/ari )
2
0 i =1
2 2
R R 2π
 2
n
 − 3d 2 (r , r ' , α )  
∫0 ∫0 ∫0 ∑
-
+ (πR ) dr dr ' dα 2π r r '  σ exp    , (4.47)
 Xi
a
i =1   ri 
where d2(r, r’,α)= r 2 + r '2 −2rr ' cosα , σXi2 and ari as the variance and spatial range,
respectively, of each exponential covariance function, σX2 is the variance of the SRF X(s),
and R is the observation spatial scale obtained as R=(As/π)0.5. As can be seen from this
equation, the variance σY2 describing the uncertainty associated with the observation scale of
142
2-D circular averaging domain is a function of the variance and spatial ranges of the SRF
X(s), as well the radius R of the averaging spatial domain characterizing the observation scale.
The linear kriging method of classical Geostatistics simply combines observed values of
X(s) and Z(s) to estimate X(s) at unsampled locations without any consideration of the scale
effects. By contrast our proposed BME mapping method uses Eq. (4.47) to generate soft data
for X(s) from observations obtained at various spatial scales.
4.5.2.3. Quantifying the improvement in the mapping accuracy of the childhood asthma
prevalence resulting from the integration of spatial observation scale uncertainty
Validation procedures provide the tools needed to quantify the gain in mapping accuracy that
our proposed approach provides over an approach not accounting for the effect of spatial
observation scale when mapping the childhood asthma prevalence. Let χtrue and ζtrue denote
the available data measured at the local scale, and at some observation scale R, respectively.
We randomly select χhard and χval from χtrue subject to χtrue=χval U χhard, and we obtain ζhard by
usually selecting all of the data ζtrue. The validation procedure consists in using only the data
χhard and ζhard to obtain estimates χval* of the local scale SRF X(s) at the validation point
locations where the truth χval is known. The validation estimation errors are then simply
obtained as the difference εval*=χval-χval* between true and estimated values, and their mean
square error (MSE) provides a measure of the estimation error of the estimation method used
to obtain χval*.
In this study we compare the mapping accuracy of 3 different mapping methods. Method
1 consists in the simple kriging (SK) method of classical Geostatistics using only χhard as
hard data. Method 2 also consists in the SK method, but using both χhard and ζhard as hard
143
data (i.e. ignoring the observation scales uncertainty of the ζhard data). Finally method 3
consist in the BME method proposed in this work, which uses χhard as hard data, and uses
ζhard and the corresponding observation scale R to generate some soft data χsoft in terms of the
conditional PDF fS(χsoft | ζhard, R) (Eqs. 4.46 and 4.47). As a result our Method 3 fully
accounts for the observation scale effect, which is compared to the two extreme classical
approaches not accounting for observation scale: Method 1 which ignores ζhard entirely, and
method 2 which treats it as if it was hard data (i.e. as if the observation scale was not
introducing any uncertainty).
It should be noted that the so called cross-validation procedure is a slight modification of
the validation procedure that is widely used in practice, so we will also use this procedure to
compare method 1, 2 and 3. In the cross validation procedure, the ζhard data remains
unchanged, while the validation data χxval corresponds to whole dataset χtrue available, i.e.
χxval =χtrue. Then cross validation estimates χxval* are obtained by excluding in turn each
validation point, and re-estimating it from the surrounding data. The cross-validation MSE
are finally obtained on the basis the cross-validation errors εxval*=χval-χval*. Hence the cross-
validation procedure provides an additional metric to compare methods 1, 2 and 3.
4.5.3. Data
We have obtained two datasets with data on the childhood asthma prevalence across North
Carolina. The first dataset was based on a middle school based survey using questionnaires,
while the second dataset used Medicaid claim data. Another source of asthma data is
available from the North Carolina Behavioral Risk Factor Surveillance System (BRFSS)
144
collected using random telephone survey. However this state-level asthma dataset includes
no stratification by age of children, and was therefore not used in this work.
4.5.3.1. The North Carolina School Asthma Survey database
The first dataset consists in children asthma health outcomes collected as a part of the North
Carolina School Asthma Survey (NCSAS) (Yeatts et al., 2004; Sturm et al., 2004). The
NCSAS is a collaborative program between the North Carolina Department of Health and
Human Services, the North Carolina Department of Public Instruction, and the Department of
Epidemiology in the University of North Carolina at Chapel Hill. This survey collected
information on the breathing status of students enrolled in public 7th and 8th grades (i.e. age
of 13-14) in the 1999-2000 academic school year. 565 public middle schools (for a total of
192,248 enrolled students) were asked to participate in the survey, leading to the
participation of 499 schools in the survey. We obtained data from approximately 128,556
students (i.e. 66.9% of the student population) in 493 schools (i.e. 87.3% of the school
population regarding the prevalence of asthma symptoms among the children of North
Carolina.
The NCSAS questionnaire included internationally standardized and validated questions
from the International Survey of Asthma and Allergies in Childhood (ISAAC) consisting of
written and video types of questions. While the NCSAS provides several relevant asthma
variables for each student, the variable we used, named “current wheezing symptom”, which
characterizes the occurrence of asthma, was recorded as a value of 1 for children who said
“yes” to any one of four video questions describing 1) wheezing during the day, 2) wheezing
induced by exercise, 3) wheezing at night, or 4) a severe wheezing attack. Using this variable,
145
we calculated for each of 493 schools the asthma prevalence among children by dividing the
number of children who answered yes by the total number of students surveyed in that school.
For illustration purposes, we show in Figure 4.7(a) a graduated color plot of the childhood
asthma prevalence data obtained from this dataset.
Because of the almost-exhaustive nature and the good data quality of the NCSAS dataset,
the data it provides on the prevalence of asthma symptoms among children enrolled in public
7th and 8th grades in North Carolina can reasonably be considered exact measurements of
the childhood asthma prevalence. Furthermore, the observation scale for this prevalence
data corresponds to that of middle schools, which have a very small geographical extend
relative to that of, for example, a county. Indeed half of the average distance between
schools in North Carolina and their closest neighbor is approximately 3 kilometers(km), so
that for the average of schools the maximum distance that children travel to go to school is
on the order of 3 km. Since the children population is generally clustered around schools, the
median travel distance to school must be much less than its maximum of 3 km, in the order
of a fraction of the kilometer scale. If we add the fact that children do spend a portion of
their day on the premises of the school itself, we can safely conclude the NCSAS data
obtained at the school observation scale can reasonably be conceptualized as providing exact
measurements of the childhood asthma prevalence observed at the local scale, i.e. this dataset
provides hard data for the SRF X(s).
4.5.3.2. The county-level database of Medicaid-enrolled children suffering from asthma
Buescher et al., 1999 published a document including data on Medicaid claims due to asthma
in North Carolina during the state fiscal year 1997-1998. The number of childhood asthma
146
cases in each county was recorded by counting the Medicaid-enrolled children of age 0 to 14
who suffered from asthma. According to the study report, the Medicaid-enrolled children
suffering from asthma were identified on the basis of paid Medicaid claims with a diagnosis
of asthma as well as with prescription drug used for treating asthma. They then obtained the
fraction of Medicaid-enrolled children suffering from asthma for each of the 100 counties in
North Carolina by dividing the number of Medicaid-enrolled children with asthma claims by
the total number of Medicaid-enrolled children claims in each county. The location we assign
for each of these fractions is the centroid of the county for which the fraction is calculated,
and we show visually these data in Figure 4.7(b) using a graduated color plot.
The average land area for counties in North Carolina is 1363.9 km2, which correspond to
a radius of about 20.8 km if assume that counties can be approximated with circles of same
surface areas. This spatial scale of about 20.8 km is substantially larger than that of the
NCSAS data collected at the school level, which as discussed above is believed to be on the
order of a fraction of the kilometer scale. This statement is also strengthened by the fact that
most of children live close to their school, with few children living far from their school,
whereas Medicaid-enrolled children can be assumed to have a much more uniform spatial
distribution across the whole county. Therefore we define the fraction of Medicaid-enrolled
children with asthma in a particular county as a measurement of the SRF Z(s) observed at the
county spatial scale. In other words we conceptualize the Medicaid data shown in Figure
4.7(b) as being observations of the local scale childhood asthma prevalence (the NCSAS data
shown in Figure 4.7a) averaged at the county spatial scale. Indeed, as can be seen from
Figure 4.7, the Medicaid data are smoother than the NCSAS data, which is consistent with
our hypothesis that one corresponds to the aggregation of the other at a larger spatial scale.
147
However a limitation of the Medicaid dataset for the inference of the childhood asthma
prevalence is that the Medicaid-enrolled children population is only a subgroup of the total
children population, and biases may exist at the local scale. Furthermore the Medicaid data
was obtained in 1997-1998 while the NCSAS was obtained in 1999-2000. Nevertheless we
hypothesize that the local-scale deviations in asthma prevalence between the Medicaid and
NCSAS datasets average out at the county spatial scale. As will be shown in our cross
validation results, when accounting for the scale effect then the Medicaid data does improve
the estimation of the asthma prevalence reported in the NCSAS dataset, which confirms our
hypothesis that the Medicaid data provides an adequate measurement of NCSAS asthma
prevalence aggregated at the county scale.
As a result our aim is now to estimate the spatial distribution of the (local scale)
childhood asthma prevalence X(s) using the NCSAS dataset providing exact measurements
of X(s) at the location of 493 schools in North Carolina, and the Medicaid dataset providing
(almost) exact measurements of the county-scale Z(s) at the centroid of 100 counties across
North Carolina.
148
(a)
(b)
Figure 4.7: Map showing (a) the data on asthma symptoms prevalence among high school
children (age 13-14) reported in the NCSAS database for most of NC schools, and (b) the
county level asthma prevalence data extracted from the database of Medicaid-enrolled
children age 0-14 years who suffered from asthma. The prevalence is expressed as a fraction
(i.e. average childhood asthma cases per 1 child) according to the color bar next to each map.
4.5.4. Results
4.5.4.1 Trends and variability in the spatial distribution of local scale asthma prevalence
among children
149
The SRF X(s) represents the distribution across space of the prevalence of asthma among
children observed at the local scale. Its mean trend function mX(s) (Eq. 4.37) provides a
model for the systematic trends and consistent spatial structures of the childhood asthma
prevalence across space, while its covariance function cX(s,s’) (Eq. 4.38) describes the
inherent spatial variability of the childhood asthma prevalence.
As discussed in the data section, it is reasonable to use each NCSAS datum as an exact
measurement of the local scale childhood asthma prevalence for the spatial location of each
school in North Carolina (Figure 4.7a). Hence we obtain the local scale mean trend function
mX(s) using a moving window average of the NCSAS data with an exponentially decaying
exponential filter. This leads to the mean trend function shown in Figure 4.8(a). As can be
seen from this figure, the mean trend of asthma prevalence among children in North Carolina
has a slightly higher prevalence along the eastern coast of North Carolina, and it decreases
almost linearly from East to West. This mean trend function can be linearized within each
county, and as a result it is valid at the county observation scale as well. In other words, the
trend shown in Figure 4.8(a) is the mean trend of the local scale asthma prevalence field, as
well as the asthma prevalence field observed at the county spatial scale, i.e. mZ(s)=mX(s). A
useful implication is that the framework presented in the theory section to integrate data
obtained at different spatial observation scales is valid not only for the X(s) and Z(s) SRFs,
but also for the mean trend removed residual fields X’(s)=X(s)-mX(s) and Z’(s)=Z(s)-mZ(s)
(since mZ(s)=mX(s)). We will therefore apply our framework for the integration of data
observed at different spatial scales to the residual fields X’(s) and Z’(s).
150
(a)
(b)
Figure 4.8: (a) Map of the local scale mean trend mX(s) of the childhood asthma prevalence
(fraction of prevalent asthma cases), and (b) plot of the covariance of the mean trend-
removed local scale childhood asthma prevalence SRF X’(s).
Experimental values for the covariance of the residual field X’(s) where estimated from
residual prevalence data obtained by subtracting the mean trend mX(s) (Figure 4.8a) from the
NCSAS prevalence data (Figure 4.7a). We then fit to these experimental covariance values
the following covariance model
151
 − 3r   − 3r 
c X (r =| s' − s |) = c01 exp  + c02 exp 
 a r1   ar 2  , (4.48)
where c01= 0.9 × σX2, c02=0.1 × σX2, σX2= 0.0055 (average number of asthma cases per 1
child)2, ar1 = 89.6 km, and ar2= 448 km. As can be seen from Figure 4.8(b), there is a good fit
between the covariance model of Eq. (4.49) and the experimental covariance values obtained
from the residual data observed at the local scale. The covariance model indicates that about
90 percent of the variability of the local scale childhood asthma prevalence has a spatial
range (e.g. spatial clustering) of 89.6 km, while the remaining 10 percent of variability as a
much larger spatial range (clustering) of 448 km. This interesting finding indicates that the
prevalence of asthma among children observed at a small scale (i.e. at the spatial scale
corresponding to the children population serviced by a high school) has a spatial distribution
that is not random, instead it is spatially organized in the nesting of spatial structures
(clustering) of two sizes, one of about 89.6 km in size explaining 90 percent of the overall
asthma prevalence variability, and the other of about 448 km in size explaining 10 percent of
the variability. The explanation for this spatial organization of local scale asthma prevalence
may be manifold, and provides the basis for hypothesis generation that may be tested in
future works. The first possible explanation of spatial clustering of the childhood asthma
prevalence may be that it is a result of the observation scale at which the prevalence is
observed. However the NCSAS asthma prevalence data is observed at the spatial scale of the
children served by a single school, and conceivably a majority of the children served by one
school live in a radius that is much smaller than 89.6 km, so that this rather small observation
scale alone cannot explain the larger spatial scales of spatial clustering identified in the
152
covariance analysis. An additional explanation that then naturally arises is that the
prevalence of asthma among children is influenced by underlying factors that are themselves
organized in space. One such factor may be the characteristics of the children population (i.e.
ethnic make-up, socio-economic status, dietary habits, proportion of children with higher
asthmatic susceptibility, etc.) that may themselves have a spatial structure corresponding to
the 89.6 km spatial scale. Another factor may be the exposure to environmental pollutants
suspected to cause asthma, such as airborne particulate matters, ozone and lead, which may
have spatial ranges in excess of 448 km (e.g. Christakos and Serre, 2000a).
The mean trend function and covariance model provide the general knowledge base
processed at the prior stage of the BME analysis. Next we present the asthma prevalence
maps obtained at the posterior stage of the BME analysis by integrating asthma prevalence
data obtained at different observation scales.
4.5.4.2 Maps of the childhood asthma prevalence obtained using data collected at different
observation scales
We obtain maps describing the spatial distribution of the childhood asthma prevalence across
North Carolina using three estimation methods. Each estimation method uses the same
general knowledge base consisting in the mean trend function and covariance model
presented above. This general knowledge base is processed at the structural stage of the
BME analysis and leads to a prior PDF characterizing the general characteristics (systematic
trends, spatial variability) of the spatial distribution of the childhood asthma prevalence
observed at the local scale (i.e. at the spatial scale of high schools). Then at the integration
stage of the BME analysis, each method uses a Bayesian conditionalization knowledge
processing rule to update the prior PDF by considering a different site specific knowledge
153
base, leading to different maps of the estimated childhood asthma prevalence across North
Carolina.
The first estimation method (method 1) considers the NCSAS data as hard (exact)
measurements of the childhood asthma prevalence observed at the local scale. This
estimation method ignores entirely the Medicaid childhood asthma prevalence data collected
at the county observation scale. Using this restricted site specific knowledge base, we update
the prior PDF at each node of a regular estimation grid covering the state of North Carolina.
We thereby obtain a BME posterior PDF at each of these estimation points, from which we
select the expected value as the so-called BME mean estimate, and the variance as an
assessment of the associated mapping uncertainty. The map of the BME mean estimate for
method 1 is shown in Figure 4.9(a), and the map of the associated uncertainty is shown in
Figure 4.10(a). As can be seen from these figures, the map obtained interpolates the NCSAS
data over all non-surveyed areas of North Carolina, with a mapping uncertainty that is zero at
the spatial location of each of the NCSAS high schools, and increases away from these
surveyed locations. We note that because the site specific knowledge base is restricted to
only include hard data, the BME estimate of method 1 reduces to the simple kriging
estimator of classical Geostastistics. Hence method 1 corresponds to the simple kriging
method accounting only for data obtained at the local scale, and we can compare this baseline
method against other methods that attempt to integrate the additional information provided
by the Medicaid childhood asthma prevalence data available at the county observation scale.
154
(a)
(b)
(c)
Figure 4.9: Maps of the BME mean estimate of children asthmatic symptom prevalence
(average number of case per 1 child) observed at the school spatial scale across North
Carolina. These maps were obtained using (a) method 1, (b) method 2, and (c) method 3.
155
In the second estimation method (method 2), we consider both the NCSAS and Medicaid
data as if they were exact measurements (hard data) of the childhood asthma prevalence
observed at the local scale. In other words this estimation method corresponds to using the
simple kriging estimator on the combined NCSAS and Medicaid data without recognizing
that these data were obtained at different observation scales. By ignoring the scale effect for
the Medicaid data, method 2 underestimate the uncertainty associated with the large
observation scale of that dataset. The map of BME mean estimate obtained from method 2 is
shown in Figure 4.9(b). As can be seen from this figure, the map integrates more details in
the spatial distribution of the childhood asthma prevalence because the combined dataset is
larger, leading to a spatial estimate that is quite different than that obtained with method 1.
The substantial difference between the maps of method 1 and method 2 is the main point we
are making here. Whether the map of method 2 is any more accurate than that obtained with
method 1 is an issue that we will address later in the cross-validation section. Suffice to say
that method 2 wrongly assumes that the scale effect of the Medicaid data can be ignored,
leading to the erroneous belief that the uncertainty associated with the map of method 2 is
zero at the centroid of each county where each Medicaid data points are reported. As a result,
method 2 is unable to provide a correct assessment of the uncertainty associated with its
spatial estimate shown in Figure 4.9(b).
156
(a)
(b)
Figure 4.10: Maps of the BME posterior variance ([average asthma counts per 1 child]2)
obtained with (a) method 1 and (b) method 3, which provides an assessment of the
uncertainty associated with the BME mean estimate maps shown in Figure 4.9 (a) and (c),
respectively.
On the other hand method 3 corresponds to our proposed approach which accounts for
the scale effect by formally processing the uncertainty associated with the observation scale
of the Medicaid data. As explained in the theory section, we have developed for this method
a mathematical formulation for the error variance (Eq. 4.47) resulting from the spatial scale
at which the Medicaid data is observed. Using our proposed framework, the NCSAS data is
157
processed as hard data, while the Medicaid data is used to generate soft data with an
uncertainty calculated as a function of the corresponding observation scale. The map of the
BME mean estimate for method 3 is shown in Figure 4.9(c), and the map of the associated
uncertainty is shown in Figure 4.10(b). As can be seen from these figures, method 3
integrates both datasets, extracting all the information provided by the NCSAS data obtained
at the local scale, and using the Medicaid data as an approximate guess of the local scale
childhood asthma prevalence away from the NCSAS data points. The resulting map has
more spatial details than the map of method 1, yet it is smoother than the map of method 2.
The map of the associated mapping uncertainty shows that the uncertainty is zero at the
NCSAS high school location, that it is small but non zero at the centroid of counties for
which the Medicaid data is available, and that it increases away from these points. Both
these features result in a more realistic representation of the local scale childhood asthma
prevalence than that obtained from either method 1 or 2.
The results presented so far illustrate that by formally accounting for the scale effect of
the childhood asthma prevalence data, our proposed framework (method 3) generates a map
describing the spatial distribution of the childhood asthma prevalence that is substantially
different and more realistic than maps obtained using methods not accounting for the scale
effect. We now investigate whether this more realistic map is also substantially more
accurate than the maps of methods 1 or 2.
4.5.4.3 Cross-validation results
We use a cross validation procedure to compare the accuracy of the maps obtained using
estimation methods 1, 2 and 3 in terms of their cross validation mean square error (MSE).
158
Each datum of the NCSAS dataset representing an exact measurement of the childhood
asthma prevalence observed at the spatial scale of high schools is removed from the data, and
re-estimated on the basis of the remaining NCSAS and Medicaid data. The cross validation
error is then simply obtained by subtracting from each cross-validation estimate the exact
measurement that was set aside. Using this procedure we obtain cross-validation errors for
each estimation method, from which the cross-validation MSE is calculated. The results of
this cross validation procedure are shown in Table 4.4. As can be seen from this table,
somewhat surprisingly, method 2 does not provide any improvement of mapping accuracy
over method 1. In fact the MSE for method 2 is slightly higher than that of method 1. This
result provides a striking illustration of what may happen when one attempts to mix-in data
obtained at different observation scales without consideration of the scale effect, as is the
case for the naïve approach used in method 2. Indeed, even though method 2 seems to
provide more spatial details about the distribution of the asthma prevalence among children
across North Carolina, these details are actually erroneous because they do not account for
the uncertainty associated with the large observation scale of the Medicaid data. On the other
hand our proposed BME approach (method 3) has a MSE that is substantially smaller than
that of either method 1 or method 2. The sound conceptual framework we have developed in
this work to integrate data obtained at different observation scale leads to a 10.2% decrease
in cross-validation MSE relative to method 1, and an 11.6% decrease relative to method 2.
This demonstrates that our proposed approach leads to a map of the childhood asthma
prevalence across North Carolina that is more realistic and more accurate than those obtained
by methods that do not account for the scale effect.
159
Table 4.4: Cross-validation results showing the cross-validation MSE for methods 1, 2 and 3,
and the change in cross-validation MSE between method 1 and method 3, as well as between
method 2 and method 3.
MSE 0.040638 0.041293 0.036490
1, 3
rMSE -10.206%
2 ,3
rMSE -11.630%
The cross validation procedure compares the accuracy of the estimation methods when
one data point is removed at a time. This comparison quantifies the gain in accuracy for the
current mapping situation, i.e. we can say that the childhood asthma prevalence map
produced in this work (the method 3 map of Figure 4.9c) is at least 10% more accurate than
maps that may have been produced to date using the traditional approach of method 1 or
method 2. Another comparison that is often used in practice to compare estimation methods
is a validation procedure, which compares the mapping accuracy under other mapping
situations by removing several data points at once. We present next the validation results for
a selected mapping situation of interest.
4.5.4.4 Validation results
The validation procedure that we implement consists in removing 30% of the NCSAS data at
once, and re-estimating the childhood asthma prevalence for these points using the remaining
NCSAS data as well as the Medicaid data. We then subtract from these validation estimates
the exact measured values that were set aside, thereby obtaining validation errors from which
we obtain the validation MSE. The validation MSE obtained using this procedure for
estimation methods 1, 2 and 3 are shown in Table 4.5. As we can seen from this table, when
160
removing 30% of the NCSAS data, method 2 is slightly more accurate than method 1, and,
more importantly, our proposed BME approach (method 3) is at least 20% more accurate
than either method 1 or method 2. This means that our proposed method provides a powerful
conceptual framework to integrate data obtained at different observation scale for a wide
range of mapping situations.
Table 4.5: Validation results obtained when selecting a random validation set consisting of
30% of the NCSAS data. The table shows the validation MSE obtained for methods 1, 2 and
3, and the change in validation MSE between method 1 and method 3, as well as between
method 2 and method 3.
MSE 0.0098939 0.0096997 0.0076670
1, 3
rMSE -22.508%
2 ,3
rMSE -20.957%
4.5.5. Conclusions
Asthma is an adverse health condition of emerging concern for children. Maps showing the
spatial distribution of the asthma prevalence among children are vital to better understand
what may cause the disease and to improve its public health response in order to protect the
health of children. However mapping the childhood asthma prevalence is complicated by the
fact that data is often available at a variety of spatial scales. This is particularly the case
because several data sources have confidentiality requirements that only allow release of
information aggregated over spatial scales that are sufficiently large to ensure the privacy of
the individuals who provided their health information.
161
We develop in this work a rigorous mathematical framework to map the spatial
distribution of the childhood asthma prevalence by integrating data collected at different
spatial observation scales, and we apply this framework to a real case study in North Carolina
using two datasets obtained at two substantially different observation scales. We constructed
our first dataset of the childhood asthma prevalence using the North Carolina School Asthma
Survey data that was collected as part of a previous study of one of the co-authors (Yeatts et
al., 2004; Sturm et al., 2004). By aggregating the NCSAS data at the high school spatial
scale using good quality information on the prevalence of asthma symptoms among 7-8th
grades, we obtained a dataset that can essentially be treated as exact measurements of the
childhood asthma prevalence observed at the local scale for each of 493 high-schools which
participated in the NCSAS study. While this first dataset provides a rich set of point
measurements, it is inherently providing a sparse spatial coverage of North Carolina. Hence
we also included in the mapping analysis a second dataset consisting of the childhood asthma
prevalence calculated on the basis of Medicaid-claims aggregated at the county spatial scale
(Buescher et al., 1999). While this dataset presents some limitations due to biases connected
with the Medicaid-enrolled children population, we hypothesized that local errors in the
Medicaid data may average out at the county spatial scale, so that this dataset provides useful
information as long as the scale effect is adequately accounted for.
The conceptual framework we develop in this work provides a rigorous mathematical
formulation for the uncertainty associated with the spatial scale at which asthma prevalence
data are observed. Using this framework, the NCSAS data is processed as hard data, while
the Medicaid children data is used to generate soft data with an uncertainty corresponding to
the county spatial scale at which this data is reported. These combined hard and soft data are
162
then rigorously processed using the Bayesian Maximum Entropy method of modern
Geostatistics, leading to an accurate estimation of the spatial distribution of the childhood
asthma prevalence across North Carolina.
We find that the map we obtain is substantially more realistic and accurate than the
classical map obtained by ignoring entirely the county level data, or the classical map
obtained by integrating the county level data without consideration of its observation scale.
Results from our cross-validation analysis indicates that the childhood asthma prevalence
map we generate for North Carolina has a mapping error variance that is a substantial 10%
smaller than that of the classical maps obtained when ignoring the scale effect. Furthermore
a validation analysis indicates that under other mapping situations the drop in mapping
estimation error can be in excess of 20% over the classical approaches not accounting for the
scale effect. This means that our proposed method provides a powerful conceptual
framework to integrate data obtained at different observation scales for a wide range of
asthma mapping situations.
This work provides a methodological advance that will lead to an improved assessment
of the spatial distribution of the asthma prevalence among children nationwide, and by
applying this new method we obtain the most accurate map created to date for the spatial
distribution of the childhood asthma prevalence across North Carolina. These contributions
will be very useful to improve our understanding of possible associations between asthma
and causal risk factors such as air pollutants, and will be critical to improve asthma public
health intervention for children nationwide. Furthermore by demonstrating how existing
sources of asthma data such as Medicaid claims can be used to obtain good estimates of the
childhood asthma prevalence at a fine spatial resolution, this work will reduce the need of
163
costly programs dedicated to asthma surveillance, so that state health departments’ limited
resources can be more efficiently used for public health interventions and reduction of
childhood asthma morbidity.
164
V. CONCLUDING REMARKS
The linear kriging methods of classical Geostatistics (i.e. simple kriging, co-kriging, etc.)
have gained considerable popularity in environmental mapping applications to estimate an
environmental contaminant variable of interest at unsampled locations. However, these
estimation methods have considerable well documented limitations (i.e. linear estimation,
Gaussian assumptions, exact measurements, etc.), and as a result they lack the theoretical
underpinnings and practical flexibility needed to incorporate the wide variety of knowledge
bases available in modern environmental and health mapping applications, which include
information about the uncertainty associated with the data available.
On the other hand the powerful BME mapping method of modern spatiotemporal
Geostatistics is a non-linear estimation method that overcomes the limitations of the classical
Geostatistics by comprehensively assimilating a wide variety of physical knowledge bases,
including data uncertainty. The data uncertainty prevalent in environmental and health
applications has been recognized as critical information that needs to be formally modeled in
order to increase the accuracy of estimated maps. In this work we investigate three important
types of uncertainty for environmental and health processes, and we develop the framework
to account for these types of uncertainty in terms of relevant soft PDF. The models of soft
data we generate are then used in real world case studies, resulting in three environmental
and health mapping applications. In each mapping application, the data uncertainty is
successfully identified and expressed in terms of the proper soft data model, and rigorously
processed using the powerful BME mapping method.
In the first mapping situation considered the data uncertainty originates from varying
levels of measurement errors in the analysis of groundwater arsenic contamination in New
England. We develop a measurement error model specifically for arsenic analyses, and we
successfully validate this measurement error model by comparing the uncertainty it predicts
with that obtained from a covariance analysis. The measurement error model then allows us
to obtain probabilistic soft data describing adequately the uncertainty associated with the
measurement error of three arsenic datasets. As a result, we are able to apply the BME
estimation method to account for the varying levels of measurement error between these
three datasets, and obtain accurate maps of the spatial distribution of arsenic in the ground
waters of New England. A synthetic case study as well as the real case study show that the
proposed BME approach results in a substantial improvement of mapping accuracy over
classical Geostatistical methods that do not properly account for measurement error.
Furthermore the work presented in this first mapping application will provide an ideal
framework to add new monitoring data with presumably lower detection limit and better
precision as the analytical measurement techniques for arsenic and its speciation keep
improving in the future.
The source of data uncertainty we consider in the second mapping application comes
from the emergence of secondary variables used to map a primary variable for which data is
sparse. Starting with a synthetic case study, we generate realizations of two related SRFs
reproducing the statistical properties of New England groundwater arsenic, and soil pH,
respectively, using a new simulator developed as part of this work. We then implement some
166
straightforward regression approaches to model the empirical law between groundwater
arsenic and soil pH, from which we obtain a model for the conditional PDF of the
groundwater arsenic primary variable given a collocated measurement of the soil pH
secondary variable. This conditional PDF is efficiently processed in term of soft data by the
BME estimation method, resulting in realistic maps of groundwater arsenic that rigorously
incorporate the information provided by the soil pH secondary variable. This work
demonstrates that because the proposed BME approach formally accounts for the empirical
law between the primary and secondary variables, it leads to a drastic improvement in
mapping accuracy over the co-kriging method which only accounts for the cross-correlation
between primary and secondary variables. As a result, this work suggests a shift of the
multivariate mapping paradigm from co-kriging to the proposed BME method when dealing
with secondary variables related to the primary variable through a variety of empirical laws.
In the third mapping application we develop a rigorous mathematical framework to map
the spatial distribution of childhood asthma prevalence by integrating data collected at
different spatial observation scales, and we apply this framework to a real case study in North
Carolina using two datasets obtained at two substantially different observation scales. The
mathematical framework we develop consists in deriving the conditional PDF of a variable at
the local scale given an observation of that variable at a larger scale. Once this framework is
developed, it is possible to generate soft data for the local scale variable on the basis of data
observed at different temporal or spatial scales. This approach allows to efficiently mix data
observed at a variety of scales, and increases the mapping accuracy of the map obtained for
the scale of interest. Our developed framework is formulated in the one-dimensional
temporal case, and then extended to the two dimensional spatial case, before being applied to
167
the North Carolina childhood asthma prevalence real case study. We find that the map that
we obtain is substantially more realistic and accurate than maps obtained without
consideration of observation scale. Results from our cross-validation analysis indicates that
the childhood asthma prevalence map we generate for North Carolina has a mapping error
variance that is a substantial 10% smaller than that of classical maps obtained when ignoring
the scale effect. Furthermore a validation analysis indicates that under other mapping
situations the drop in mapping estimation error can be in excess of 20% over the classical
approaches not accounting for the scale effect. This means that our proposed method
provides a powerful conceptual framework to integrate data obtained at different observation
scales for a wide range of asthma mapping situations.
In this dissertation models for soft Geostatistical data have been developed to account for
three important types of data uncertainty that are relevant to environmental and health
spatiotemporal processes. The subsequent integration of these soft data models using the
rigorous mathematical estimation framework provided by the BME mapping method leads to
substantial improvements in mapping accuracy over classical methods that do not properly
account for data uncertainty. Thus these models of soft data can be applied in a variety of real
exposure and health mapping situations to provide highly informative maps that will be
useful for environmental scientists, epidemiologists, public health officials, and state
regulators.
168
Appendix A: Derivation of empirical relationship and their associated
uncertainty
A.1. A quick overview of the multivariate linear regression model
Let’s consider the multivariate linear regression model expressed as
p
xi = ∑ yij β j + εi , 1≤ i ≤ N (A.1)
j=1
where xi are the response variables, yij are explanatory variables, βj are regression parameters,
εi are unobservable random errors, N is the number of observations, and p is the number of
regression parameters.
This model usually includes major assumptions (i.e. normality, homoscedasticity, and
mutual independence between response variables) leading to the normal distribution of the
p
estimator for the expected value (i.e. ∑ yij β j ) and variance (i.e. σ X ). The regression
2
j =1
coefficients are estimated by setting up an objective function equal to the mean prediction
square error
2

n p

MPSE = ∑  xi − ∑ yij β j  (A.2)
i =1  j =1 
169
to be minimized with respect to the β j . We obtain the estimators βˆk for βk , k=1,…,p, by
∂MPSE
setting =0, k=1,…,p, which leads to the following normal equations
∂β̂ k
n  p ) 
∑ y  x
ik  i − ∑ yij β j  = 0 (A.3)
i =1  j =1 
where k=1, 2, 3, …, p.
Eq. (A.3) may be written in matrix form as
DTx= DTD β̂ (A.4)
where D is a (n× p) design matrix with elements yij, x is a (n× 1) vector with elements xi, and
β̂ is a (p × 1) vector with elements β̂ j .
In case of the existence of (DTD)-1, the regression coefficients are given by
β̂ =(∆T∆)-1(∆Tχ) (A.5)
and the covariance matrix for β̂ is estimated as
cov( β̂ )=σX2(∆T∆)-1 (A.6)
170
where ∆ is obtained by substituting each random variable yij in the design matrix D with its
observed value ψij, and χ is a vector of observed values for x.
Once β̂ has been calculated, then the response variables (i.e. xi) are evaluated using
p
x̂i = ∑ ψ ij βˆ j , (A.7)
j =1
2
where i=1,…,N. In addition, the unbiased common estimate for σ X is calculated by
calculating the average vertical distance between the fitted and the observed values, i.e.
2
N 2 N  p
ˆ 
∑ (χ i − χˆ ) ∑  i ∑ψ ij β j 

i =1 
χ −
2
σ̂ X = i =1
=
j =1  . (A.8)
N−p N−p
A.2. Parametric polynomial of order 1
A univariate parametric polynomial of order 1 corresponds to a linear regression model (Eq.
A.1) with N=1, i.e. it corresponds to
xi = β0 + β1yi + εi. (A.9)
Expanding the normal equations (Eq. A.5) in the case of N=1, we obtain
171
−1
 n
 n 
N ∑ ψi  ∑ χ i 
 βˆ 0   i =1
  i =1 
     
(
βˆ = ∆ T ∆ ) (−1
)
∆T χ =   =     (A.10)
ˆ     
 β1   n n  n 
∑ψ i ∑ ψ i2  ∑ψ i χ i 
 i =1 i =1   i =1 
In other words, the regression coefficients are obtained as follows
)
β 0 = χ − βˆ1ψ (A.11)
) N N
β1 = ∑ (ψ i − ψ )(χ i − χ ) ∑ (ψ −ψ)
2
i (A.12)
i =1 i =1
where the bar denotes the arithmetic average operator.
Expanding Eq. (A.6), we obtain after mathematical manipulations the following
2 2
equations for the variance σˆ β0 of β̂ 0 and the variance σˆ β1 of β̂1
 
 
2 1 χ2
σˆ β0 = σ X  + N 
2
(A.13)
N
 ∑ (ψ j −ψ )2 
 j =1 
2
2 σX
σˆ β1 = N
. (A.14)
∑ (ψ −ψ ) 2
i
i =1
172
Then, the variance for the fitted values is
σ̂ χi = σˆ β0 + σˆ β1 ψi2 + 2 ψi c βˆ
2 2
ˆ (A.15)
0 , β1
where c βˆ ˆ indicates the covariance between β̂ 0 and β̂1 that can be expanded as
0 , β1
c βˆ ˆ = c βˆ , χ − βˆ ψ = c βˆ , χ - ψ c βˆ , βˆ . (A.16)
0 , β1 1 1 1 1 1
Since c βˆ , χ = 0, Eq. (A.16) finally reduces to

1
−ψ σ X
2
c βˆ ˆ = N
(A.17)
0 , β1
∑ (ψ −ψ ) 2
i
i =1
2
By substituting Eq. (A.17) into Eq. (A.15), we obtain σˆ χ i , i.e.
 
 (ψ − ψ ) 2 
2 1
= σX  + N i
2
σˆ χi . (A.18)
N 2 
 ∑ (ψ j − ψ ) 
 j=1 
The standard error (SE) for χ̂ i is obtained by first obtaining the estimate of σX2 using Eq.
(A.8), and then taking square root, i.e.,
173
0.5
 
1 2 
(ψ − ψ )  .
SE for χˆ i = σ̂ X  + N i (A.19)
N
 ∑ (ψ j − ψ )2 
 j=1 
However, in the case of predicting the result of a single experiment, it is more appropriate to
use the prediction standard error (PSE) rather than the SE for χ̂ i . In this case an additional
term (i.e. σ̂ X ) is included to account for the randomness associated with a single experiment,
so that the PSE for χ̂ i is given by
0.5
 
1 (ψ − ψ ) + 1 .
2
PSE for χˆ i = σ̂ X  + N i (A.20)
N
 ∑ (ψ j − ψ )2 
 j =1 
Finally, the conditional PDF fS(χi|ψi) characterizing the empirical law for xi given a measured
value ψi for yi is normally distributed with mean χ̂ i = βˆ0 + βˆ1ψ i and a variance equal to the
PSE for χ̂ i , i.e.
( )
fS(χi|ψi) = N βˆ0 + βˆ1ψi , PSE for χˆ i . (A.21)
A.3. Parametric polynomial of order 2
The case of parametric polynomial regression with order 2 corresponds to
174
xi = β0 + β1yi + β2yi2 + εi (A.22)
xˆ i = βˆ 0 + βˆ1 y i + βˆ 2 yi .
2
(A.23)
In this case the normal equations (Eq. A.5) for the regression parameters can be expanded as
−1
 N N
 N 
N ∑ψ i ∑ ψi 
2
∑ χ i 
 βˆ 0   i =1 i =1
  i =1 
  N 3  N 
( ) ( )
N N
∆ χ =  βˆ1  = ∑ψ i ∑ψ ∑ ∑ψ i χ i  .
−1
βˆ = ∆ T ∆ ψi 
T 2
i (A.24)
 ˆ   i =1 i =1 i =1   i =1 
 β 2   N 2 N N
 N 2 
∑ψ i ∑ψ ∑ ψ i4  ∑ψ i χ i 
3
i
 i =1 i =1 i =1   i =1 
The PSE for χ̂ i is then of the following form
T
PSE for χˆ i = σ̂ X δi ∆ T ∆ ( )
−1
δi + 1 , (A.25)
1 
 
where δi =  ψi  and σ̂ X is obtained from Eq. (A.8).
 2 
 ψi 
Finally, the conditional PDF fS(χi|ψi) describing the empirical law is given by the
following normal distribution
175
( )
χˆ i = N βˆ0 + βˆ1ψ i + βˆ2ψ i , PSE for χˆ i .
2
(A.26)
176
Appendix B: A simulator to generate realizations of two spatial random
fields (logAs and pH) related in terms of a quadratic empirical law
We aim to generate realizations for the groundwater log-arsenic SRF logAs(s) and soil pH
SRF pH(s) with prescribed statistical properties reproducing those found in the field, and
with a quadratic empirical relationship E[logAs|pH] at collocated point s similar to those
documented in previous studies (e.g. Fig. 3.1).
Let’s consider three independent, homogeneous, normally distributed SRFs A(s), B(s),
and C(s).
A(s) ~ N(µA, σA2) (B.1)
B(s) ~ N(µB, σB2) (B.2)
C(s) ~ N(µC, σC2). (B.3)
Realizations of such fields can easily be generated using geostatistical simulation techniques
(Christakos, 1992; Christakos et al,. 2002) such that the realization of A(s), B(s), and C(s)
have their user-defined means µA, µB, and µC, and variances σA2, σB2, and σC2, and with a
covariance range similar to that of soil pH and log-arsenic found in the field.
We then construct the fields for logAs(s) and pH(s) using the following equations
pH(s) = A(s) + B(s) (B.4)
177
logAs(s) = a1A(s) + a2A(s)2 + C(s), (B.5)
where a1 and a2, together with µA, µB, µC, σA2, σB2, and σC2, are the parameters of our
algorithm to generate logAs(s) and pH(s). Let’s now describe how to choose these parameters
in order to obtain realizations of logAs(s) and pH(s) with known statistical properties and a
quadratic empirical relationship E[logAs|pH] at collocated point s.
By substituting Eq. (B.4) into Eq. (B.5) we obtain the following relationship between the
two collocated random variables logAs and pH
logAs = a1 pH - a1B + a2 pH2 - 2 a2 pH B + a2B2 + C. (B.6)
The statistical moments of pH and logAs are obtained from Eq. (B.4) and (B.5) as
µpH = µA + µB (B.7)
µlogAs = a1µA + a2σA2+ a2 {µA}2 + µC (B.8)
σpH2 = σA2+ σB2 (B.9)
σ logAs2 = a12σA2+ 2a22{σA2}2 + 4a22{µA}2 σ A + 4a1a2µAσA2+ σC2

2
(B.10)
where Eq. (B.10) is obtained by using the following two properties of the Gaussian variable
A expressing the covariance cA,A2 between A and A2, and the variance σ A2 of A2
2
178
cA,A2 = 2µA σA2 (B.11)
σ A2 = 2{σA2}2 + 4{µA}2σA2.
2
(B.12)
Taking the expected value of logAs (Eq. B.6) for given a pH value, we have
E[logAs|pH] = a1 pH – a1 E[B|pH] + a2 pH2 – 2 a2 pH E[B|pH] + a2 E[B2|pH] + E[C].
(B.13)
We see from Eq. (B.4) that since A and B are normally distributed, then pH is also normally
distributed. Multiplying Eq. (B.4) by B and taking the expected value we obtain after some
2
mathematical manipulations that cpH,B = σ B . Assuming that pH and B have a joint
distribution that is approximately multi Gaussian, we have
E[B|pH] = µB + cB,pH c-1pH,pH (pH-µpH) = µB + σB2/σpH2 (pH-µpH) (B.14)
σ B| pH = σ B - cB,pH c-1pH,pH cpH,B = σB2–{σB2}2/σpH2

2 2
(B.15)
Then the expected value of B2 given pH, E[B2|pH]= σ B| pH +{E[B|pH]}2, is easily obtained
2
using Eq. (B.14) and (B.15), leading to the following expression
179
E[B2|pH] = σB2– {σB2}2/σpH2+ {µB }2 + 2µB σB2/σpH2 (pH-µpH) + {σB2}2/{σpH2}2(pH-µpH)2.
(B.16)
Substituting Eq. (B.14) and (B.16) into (B.13) gives the equation for E[logAs|pH], i.e.
E[logAs|pH] = a1(pH-µpH) + a1µpH – a1 E[B|pH] + a2 (pH-µpH )2 – a2µpH 2 + 2 a2 µpH (pH –
µpH) + 2a2µpH2 - 2a2 E[B|pH](pH-µpH) - 2a2 E[B|pH]µpH + a2 E[B2|pH] + E[C]
= a1(pH-µpH) + a1µpH – a1(µpH – µA) – a1(σpH2 – σA2)/σpH2(pH-µpH ) + a2(pH-µpH )2 - a2µpH2
+ 2a2µpH(pH-µpH) + 2a2µpH2 - 2a2(µpH - µA) (pH-µpH) – 2a2(σpH2 – σA2)/σpH2(pH-µpH )2 - 2
a2(µpH - µA) µpH – 2a2(σpH2 – σA2)/σpH2µpH (pH-µpH ) + a2(σpH2 – σA2) – a2(σpH2 – σA2)2/σpH2 +
a2(µpH - µA)2 + 2a2(µpH - µA)(σpH2 – σA2)(pH-µpH )/ σpH2 + a2(σpH2 – σA2)2(pH-µpH )2/σpH4 +
µC
= a1µA - a2µpH 2 + 2a2 µA µpH + a2(σpH2 – σA2) - a2σpH2 + 2a2σA2 - a2σA4/σpH2 + a2µpH 2 -
2a2 µA µpH + a2{µA }2 + { a1 - a1 + a1σA2/σpH2 + 2a2µpH - 2a2µpH + 2a2 µA - 2a2µPh +
2a2σA2/σpH2µpH + 2a2(µpH - µpHσA2/σpH2 - µA + µA σA2/σpH2)}( pH-µpH) + (a2 - 2a2 +
2a2σA2/σpH2 + a2 - 2a2σA2/σpH2 + a2σA4/σpH4)( pH-µpH)2 + µC
= a1 µA + a2(σA2 - σA4/σpH2 + {µA }2) + (a1σA2/σpH2 + 2a2 µA σA2/σpH2)( pH-µpH) +
a2σA4/σpH4( pH-µpH)2 + µC. (B.17)
We further simplify Eq. (B.17) leading to the following equation for E[logAs|pH], i.e.
180
E[logAs|pH] = µlogAs – a2{σA2}2/σpH2+ (a1σA2/σpH2+2a2µA σA2/σpH2)(pH-µpH) +
a2{σA2}2/{σpH2}2(pH-µpH)2. (B.18)
181
Appendix C: Derivation of σY2(t’,t) accounting for different observation
time scales
C.1. Non-stationary temporal random field case
Let X(t) be a non-stationary temporal random field (TRF), so that its mean trend mX(t)=E[X(t)]
is not a constant, and its covariance model cannot generally be expressed solely as a function
of temporal lag, τ=|t-t’|, i.e.
c X (t , t' ) ≠ c X (τ = t − t' ), and m X (t ) ≠ m0 . (C.1)
Z(t) is defined as the average of X(t) over the time duration T centered at time t
T
t+
2
1
Z (t ) =
T ∫ du X (u ) .
T
(C.2)
t−
2
Taking the expected value of Eq. (C.2) gives
T T
t+ t+
2 2
1 1
E [Z(t)] =
T ∫ du E[X(u)] = T ∫ du
T T
mX(u). (C.3)
t− t−
2 2
We define a new temporal random field Y(t’,t) as
Y(t’,t) = X(t’) - Z(t) (C.4)
182
where t indicates the mid-points of the time domain T(t)=[t-T/2, t+T/2], and t’ denotes any
possible time within T(t). Then we derive its expected value
T
t+
2
1
E[Y(t’,t)] = E[X(t’) – Z(t)] = mX(t’) –
T T
∫ du mX(u), (C.5)
t−
2
and variance
σY2(t’,t) = E[Y2(t’,t)] – {E[Y(t’,t)]}2 = E[X2(t’)] – 2 E[X(t’)Z(t)] + E[Z2(t)] – {mX(t’) –
t +T / 2
T ∫t −T / 2
1 du m X (u ) }2, (C.6)
where
E[X2(t’)] = σX2(t’) + {E [X(t’)]}2 = σX2(t’) + {mX(t’)}2, (C.7)
 t+
T
2
t+
T
2

1 
∫T du ∫Tdu' X (u ) X (u' )
2
E [Z (t)] = E [Z(t)Z(t)] = E  2
T
 t−
2
t−
2

T T T T
t+ t+ t+ t+
2 2 2 2
1 1
=
T2 ∫ du ∫ du' E [ X (u ) X (u' )] = T ∫ du ∫ du' {c
T T
2
T T
X (u, u' ) + m X (u )m X (u' )} , (C.8)
t− t- t− t-
2 2 2 2
183
and,
 t+
T
2
 t+
T
 1  1 2
T ∫T
E[X(t’)Z(t)] = E  X (t' ) ∫ du X (u ) = du E [ X (t' ) X (u )]
T T
 t−
2
 t−
2
T
t+
2
1
=
T ∫ du {c
T
X (t' , u ) + m X (t' )m X (u )} . (C.9)
t−
2
Assuming a linearized mean trend mX(t)=m0+m1t, Eq. (C.3) reduces to
T T
t+ t+
2 2
1 1
E [Z(t)] =
T ∫ du
T
mX(t)=
T ∫ du
T
(m0+m1u) = m0+m1t , (C.10)
t− t−
2 2
so that the expected value of Y(t’,t) can be expressed as
E[Y(t’,t)] = E[X(t’)] – E[Z(t)] = m1(t’-t). (C.11)
Similarly for linearized mean trend the variance of Y(t’,t) reduces to
σY2(t’,t) = E[X2(t’)] – 2 E[X(t’)Z(t)] + E[Z2(t)] – [m1(t’-t)]2, (C.12)
where
184
E[X2(t’)] = σX2(t’) + { m0+m1t’}2, (C.13)
T T
t+ t+
2 2
1
E [Z2(t)] =
T2 ∫ du ∫ du' {c
T T
X (u, u' ) + (m0 + m1u )(m0 + m1u' )}
t− t-
2 2
T T
t+ t+
2 2
1
∫ du ∫ du' c X (u, u' ) + (m1t + m0 ) , and
2
= (C.14)
T2 T T
t− t−
2 2
T
t+
2
1
E[X(t’)Z(t)] = =
T ∫ du {c
T
X (t' , u ) + (m0 + m1t' )(m0 + m1u)}
t−
2
T
t+
2
1
∫ du c
2 2
= X (t' , u ) + (m0 + m0 m1t + m0 m1t' + m1 tt' ) . (C.15)
T T
t−
2
Once we substitute Eqs (C.13), (C.14) and (C.15) in Eq. (C.12), we obtain variance of Y(t’,t),
i.e.
T T T
t+ t+ t+
2 2 2
1 1
σY2(t’,t) = σX2 -2
T ∫ du c
T
X (t' , u ) +
T2 T
∫ du ∫ du' c X (u, u' ) .
T
(C.16)
t− t− t−
2 2 2
C.2. Stationary covariance
185
We now consider the case where the TRF X(t) has a stationary covariance and a non-
stationary linearized mean trend, so that
c X (t,t' ) = c X (τ = t − t' ) and mX(t)=m0+m1t, (C.17)
As previously derived, in this case E[Y(t’,t)] = m1(t’-t), σY2(t’,t) = E[X2(t’)] – E[2X(t’)Z(t)] +
E[Z2(t)] – [m1(t’-t)]2, and E[X2(t’)] = σX2 + { m0+m1t’}2, However, due to the stationary
covariance assumption we can reduce further the expressions for E [Z2(t)] and E[X(t’)Z(t)] in
σY2(t’,t). First E [Z2(t)] reduces to
T T
t+ t+
2 2
1
E [Z2(t)] =
T2 ∫ du ∫ du' {c
T T
X (u, u' ) + (m0 + m1u )(m0 + m1u' )}
t− t-
2 2
T T
t+ t+
2 2
1
∫ du ∫ du' c X ( u − u' )) + (m1t + m0 ) .
2
= (C.18)
T2 T T
t− t−
2 2
Similarly E[X(t’)Z(t)] reduces to
T
t+
2
1
E[X(t’)Z(t)] =
T ∫ du {c
T
X (t' , u ) + E[ X (t' )] E[ X (u )]}=
t−
2
T
t+
2
1
T ∫ du {c ( t' −u ) + (m
T
X 0 + m1t' )(m0 + m1u )}
t−
2
186
Defining the change of variable w = u – t, we further obtain
T
2
1
E[X(t’)Z(t)] =
T ∫ dw {c ( t' − w − t ) + (m
T
X 0 + m1t' )(m0 + m1w + m1t )}
−
2
Or, by reverting to w = u, the equation is simply

T
2
1
E[X(t’)Z(t)] =
T ∫ du {c ( t' −u − t ) + (m
T
X 0 + m1t' )(m0 + m1u + m1t )}
−
2
 t' − t T

1  2 
=  ∫ du c X (t' −u − t ) + ∫ du c X (−t' +u + t )  + (m1t + m0 )(m1t' + m0 ) . (C.19)
T T t' − t 
 −2 
C.3. Stationary exponential covariance case
Let’s now assume that the stationary covariance model is the superposition of n exponential
functions, so that the covariance and mean trend of the TRF X(t) are
n  − 3 t − t' 
c X (t , t' ) = c X (τ = t − t' ) = ∑  σ Xi exp  and mX(t)=m0+m1t,
2
 (C.20)
i =1  ati 
where ati and σXi2 are temporal range and variance in each exponential function respectively.
In this case the expressions for E[Z2(t)] can be expanded as follows. Defining the change of
variables w’ = u’ – t and w = u – t for Eq. (C.18), we get
187
T T
2 2
1
E [Z2(t)] = ∫ dw ∫ dw' c ( w − w' ) + (m t + m )
2
X 1 0
T2 T T
− -
2 2
Reverting back to u’ = w’ and u = w we have
T T
2 2
1
E [Z2(t)] = ∫ du ∫ du' c ( u − u' )) + (m t + m )
2
X 1 0
T2 T T
− −
2 2
Applying stationary covariance model which is the superposition of n exponential functions
we obtain
T T
1 2 2 n   − 3( u − u' ) 
E [Z2(t)] = ∫T ∫T ∑  σ Xi 2 exp    + (m1t + m0 )
2
du du'
T2  a
− -
i =1   ti 
2 2
T
u T

1 2
 n
 − 3(u − u' )  2 n
 2 − 3(u' −u ) 
∫T du  ∫T du' ∑  σ Xi exp  + ∫ du' ∑  σ Xi exp  + (m1t + m0 ) 2
2
= 2
T i =1  ati  u i =1  ati 
−
2
 − 2 
ati σ Xi   − 3T 
n 2
2 2
=∑ 2  2T − a ti + ati exp  + (m1t + m0 )2 . (C.21)
i =1 3T  3 3  ati 
Similarly
188
T
2
1
E[X(t’)Z(t)] =
T ∫ du {c ( t' −u − t ) + (m
T
X 0 + m1t' )(m0 + m1u + m1t )}
−
2
T
 n  2  t' −u − t  
  + (m0 + m1t' )(m0 + m1u + m1t )
2
1
=
T ∫T ∑
du  σ Xi exp − 3
 
 ati 
 
−  i =1 
2
n
ati σ Xi 
2
 − 3(t' +T / 2 − t )   − 3(− t' +t + T / 2) 
=∑ 2 − exp  − exp  + (m1t + m0 )(m1t' + m0 ) .
i =1 3T   ati   ati 
(C.22)
Using Eqs. (C.21) and (C.22) we can write the variance of Y(t’,t) as
n n
ati σ Xi 
2
 − 3(t' +T / 2 − t )   − 3(− t' +t + T / 2 ) 
σY (| t' −t |) = ∑ σ Xi − 2∑ 2 − exp  − exp 
2 2
i =1 i =1 3T   ati   ati 
ati σ Xi   − 3T 
n 2
2 2
+ ∑ 3T 2 
 2T −
3
a ti +
3
ati exp
a
 (C.23)
i =1  ti 
Eq. (C.23) is valid for the superposition of n exponential models, which can be simplified
into Eq. (C.24) for example when dealing with one exponential covariance function
at σ X 
2
 − 3(t' +T / 2 − t )   − 3(− t' +t + T / 2) 
2 − exp  − exp 
2 2
σY (| t' −t |) = σ X −2
3T   at   at 
at σ X   − 3T 
2
2 2
+ 2 
2T − a t + at exp  , (C.24)
3T  3 3  at 
189
where σX2 is the variance of the TRF X(t), and at is its temporal covariance range. Eq. (C.24)
can be expressed in terms of the non-dimensional groupings of variables σY2(|t’-t|) / σ X , (t-

2
t’)/at , and T/at, i.e.
2
σY (| t' −t |) 2 1   (t − t' ) 3  T   (t − t' ) 3 T 
2
= 1 −  2 − exp3 −   − exp− 3 − 
σX 3 T at   at 2  at   at 2 at 
1 1  2 1 2 1  T 
+ 2 − T + T exp − 3  . (C.25)
3 T at  3 at 3 at  at 
Usually when generating soft data we will use t’=t. The equation for the soft data variance is
then simply obtained by setting (t-t’)/at =0 in Eq. (C.25), which leads to
2 1   3 T   1 1  2 1 2 1  
2
σY T
2
= 1 −  2 − 2 exp−   + 2 − T + T exp − 3  . (C.26)
σX 3 T at   2  at  3 at
T
 3 at 3 at  at 
190
Appendix D: Derivation of σY2(s’,s) accounting for different observation
scales in two-dimensional (2-D) space
D.1. Non-homogeneous 2-D spatial random field case
We now extend the framework in Appendix C by considering 2-D spatial random field (SRF).
In the most general case, non-homogeneous SRFs are characterized by a spatially varying
mean trend functions mX(s)=E[X(s)], and a covariance function cX(s, s’) that cannot be
expressed solely as a function of the spatial lag, |s-s’|, i.e.
c X ( s, s' ) ≠ c X ( s − s' ), and m X ( s ) ≠ m0 . (D.1)
Z(s) is defined as the average of X(s) over the surface area As of a 2-D spatial domain
centered at s.
Z(s) =∫u ∈ As duX(u) / || As||. (D.2)
For example, the 2-D spatial domain As may correspond to the geographical extend of the
county that has its centroid located at s. We then define the SRF Y(s’,s) = X(s’) - Z(s) and
derive its expected value as
E[Y(s’,s)] = E[X(s’) – Z(s)] = mX(s’) – || As||-1∫u ∈ As du mX(u), (D.3)
and variance as
191
σY2(s’,s) = E[Y2(s’,s)] – {E[Y(s’,s)]}2 = E[X2(s’)] – 2 E[X(s’)Z(s)] + E[Z2(s)] – { mX(s’) – ||
As||-1∫u ∈ As du mX(u)}2, (D.4)
where
E[X2(s’)] = σX2(s’) + {E [X(s’)]}2 = σX2(s’) + {mX(s’)}2, (D.5)
E[X(s’)Z(s)] = || As||-1E[X(s’)∫u ∈ As duX(u)] = || As||-1∫u ∈ As du E[X(s’)X(u)] = || As||-1∫u ∈ As
du {cX(s’,u) + mX(s’) mX(u)} (D.6)
E[Z2(s)] = E [Z(s)Z(s)] = || As||-2∫u ∈ As du ∫u’ ∈ As du’ E[X(u)X(u’)] = || As||-2∫u ∈ As du ∫
u’ ∈ As du’{cX(u,u’) + mX(u) mX(u’)}. (D.7)
Substituting Eqs. (D.5), (D.6), and (D.7) into (D.4) yields the following mathematical
formulae for the variance accounting for the uncertainty associated with the 2D observation
scale
σY2(s’,s) = σX2(s’) + {mX(s’)}2 – 2|| As||-1∫u ∈ As du {cX(s’,u) + mX(s’) mX(u)} + || As||-2∫u ∈ As
du ∫u’ ∈ As du’{cX(u,u’) + mX(u) mX(u’)} – { mX(s’) – || As||-1∫u ∈ As du mX(u)}2. (D.8)
D.2. Homogeneous 2-D SRF
Let us now consider the special case of homogeneous SRF with a zero mean trend, i.e.
192
c X ( s , s' ) = c X ( s − s' ) and mX(s)=0. (D.9)
Due to the fact that mX(s) is now equal to 0, it follows that Eq. (D.3) reduces to 0. Therefore
Eq. (D.4) simplifies to
σY2(|s’-s|) = E[X2(s’)] – 2 E[X(s’)Z(s)] + E[Z2(s)], (D.10)
where
E[X2(s’)] = σX2(s’), (D.11)
E[X(s’)Z(s)] = || As||-1∫u ∈ As du cX(s’,u), (D.12)
E[Z2(s)] = || As||-2∫u ∈ As du ∫u’ ∈ As du’cX(u,u’). (D.13)
Using the property cX(s,s’) = cX(|s-s’|) of the homogeneous covariance models, we further
expand Eq. (D.12) as
E[X(s’)Z(s)] = || As||-1∫u ∈ As du cX(|u- s’|) (D.14)
We change the integration variable u with a new integration variable r defined as r=u-s (see
Fig. D.1). The integration domain for r corresponds to
193
u∈ As ⇔ r+s ∈ As ⇔ r∈ A(s-s) ⇔ r∈ A(0)
where A(0) is the 2-D spatial averaging domain centered at the origin (i.e. with a centroid
located at 0). Performing the change of variable in (D.14) results in
E[X(s’)Z(s)] = || As||-1∫r ∈ A(0) dr cX(|r-(s’- s)|) (D.15)
This equation can numerically be integrated for any shape of the averaging domain A(0).
However a reasonable approximation of the averaging domain A(0) is a circle of same area as
As, i.e. with a radius R such that πR2=|| As||-1. In this case it is better to change the Cartesian
integration variable r=[r1,r2] with the polar coordinate system (r,θ) defined as (see Fig. D.1)
r1=rcos(θ) and r2=rsin(θ). Performing this change of coordinate system for a circular spatial
domain A(0) of radius R leads to
R 2π R 2π
∫ dr ∫ dθ ∫ dr ∫ dθ
2 -1 2 -1
E[X(s’)Z(s)] = (πR ) r cX(|r-(s’- s)|) = (πR ) r cX(|l|), (D.17)
0 0 0 0
where |l| = (s1 − s1 '+ rcosθ )2 + (s2 − s2 '+ rsinθ )2
194
u2 u
s2’ l
s’ r
θ
R
s2 s
s1’ s1 u1
Figure D.1 : A 2-D spatial circle domain to solve for E[X(s’)Z(s)].
We now consider the third term in the right hand side of Eq. (D.10). Under the homogeneous
assumption Eq. (D.13) reduces to
E[Z2(s)] = || As||-2∫u∈ As du ∫u’ ∈ As du’ cX(|u-u’|). (D.18)
Similarly to the derivation of E[X(s’)Z(s)], using a polar integration coordinate system for a
circular average domain As of radius R we get
R 2π R 2π
2 -2
∫ dr ∫ dθ ∫ dr' ∫ dθ '
2
E[Z (s)] = (πR ) r r’ cX(|r-r’|) (D.19)
0 0 0 0
195
where |r-r’| = (rcosθ - r ' cosθ ')2 + (rsinθ - r ' sinθ ')2 = r 2 + r '2 −2rr ' cos(θ '−θ ) .
Defining the change of variables r = r, r’ = r’, θ =θ and α = θ’-θ, we further obtain
(r )
R 2π R 2π
-2
E[Z2(s)] = (πR2) ∫ dr ∫ dθ ∫ dr ' ∫ dα r r ' c X + r ' 2 −2rr ' cosα
2
0 0 0 0
(r )
R R 2π
2 -2
∫ dr ∫ dr ' ∫ dα 2π r r ' c X + r '2 −2rr ' cosα .
2
= (πR ) (D.20)
0 0 0
u2 u
|u-u’|
u2’ u’
θ’ r
r’ θ
R
s=s’
u1’ u1
Figure D.2 : A 2-D spatial circle domain to solve for E[Z2(s)].
Consequently, substituting Eqs (D.11), (D.17), and (D.20) into (D.10) leads to
196
R 2π
2 -1 
σY2(|s’-s|) = σX2(s’) -2(πR ) ∫ dr ∫ dθ r c (s1 − s '+ r cos θ )2 + (s2 − s2 '+ r sin θ )2 
1
0 0
X
 
(r )
R R 2π
-2
+ (πR2) ∫ dr ∫ dr ' ∫ dα 2π r r ' c X + r '2 −2rr ' cosα .
2
(D.21)
0 0 0
D.3. Application of homogeneous exponential covariance model
Let’s now assume that the homogeneous covariance model is the superposition of n
exponential functions, so that the covariance model can be expressed as,
n  − 3 s − s' 
c X ( s − s' ) = ∑  σ Xi exp ,
2
 (D.22)
i =1  ari 
where σXi2 and ari are the variance and spatial range of each exponential covariance function,
respectively. Using the superposition of n exponential models leads to the following equation
for σY2(|s’-s|)
n 
R 2π
 
 σ Xi 2 exp  − 3d1 (r , s, s ' , θ )  
-1
σY2(|s’-s|) = σX2 -2(πR2) ∫0 ∫0
d r dθ r ∑  ari 
i =1   
n 
 
 σ Xi 2 exp  − 3d 2 (r , r ' , α )   .
R R 2π
-2
+ (πR2) ∫0 ∫0 ∫0
dr dr ' dα 2π r r ' ∑  ari  (D.23)
i =1   
197
where d1(r, s, s’,θ )= (s1 − s '+ r cos θ )2 + (s2 − s2 '+ r sin θ )2 and
1
d2(r, r’,α)= r 2 + r '2 −2rr ' cosα .
This equation is valid for the superposition of any number of exponential models. In the case
of a single exponential covariance model, i.e. n=1, the covariance function is written as
cX(|s-s’|)=σX2 exp(-3|s-s’|/ar), and Eq. (D.23) reduces to
-1
R 2π
 − 3d (r , s, s ' , θ ) 
σY2(|s’-s|) = σX2 -2(πR2) ∫0 ∫0 dθ r σ X exp 1 ar
2
d r 

-2
R R 2π
 − 3d (r , r ' , α ) 
+ (πR2) ∫0 ∫0 ∫0 dα 2π r r ' σ X exp 2 ar
2
d r d r ' , (D.24)

where σX2 is the variance of the SRF X(s), and ar is its spatial covariance range. Usually we
seek an X soft datum at the centroid of the Z hard data (i.e. s = s’, so that the X soft datum is
located at the center of the circular averaging area As). In this case Eq. (D.24) is further
reduced by setting s=s’, i.e.
R
σY2= σX2 -4R-2 ∫ dr r σ X exp(−3r / ar )
2
2 -2
R R 2π
 − 3d 2 (r , r ' , α ) 
∫ dr ∫ dr ' ∫ dα 2π r r ' σ X exp 
2
+ (πR ) . (D.25)
0 0 0  ar 
198
As can be seen from this equation, the variance σY2 describing the uncertainty associated with
the observation scale of 2-D circular averaging domain is a function of the variance and
spatial range of the SRF X(s), as well as the radius R of the averaging spatial domain
characterizing the observation scale.
199
APPENDIX E: Some notes regarding the first and second arsenic datasets
E.1. The first arsenic dataset
• Retrieved from United States Geological Survey (USGS) National Water Information
System (NWIS) in 2001.
• Updated from USGS Water-Resources Investigations Report 99-4279 (Focazio et al.,

2000).
• Dataset name: arsenic_nov2001.txt which is publicly available from the USGS

website.
• Time duration: 1973-2001.
• A subset of 20,043 arsenic measurements only covering New England.
• Consistent and accurate arsenic sampling procedures (i.e. collecting, saving, and
transporting samples) maintained by USGS.
• USGS developed the field protocols in the early 1990.
• The previous sampling methods were tested by the USGS Office of Water Quality.
• A representative analytical method used is Inductively Coupled Plasma Mass

Spectrometry (ICP-MS), which was one of the latest available methods at the time
that the dataset was generated.
E.2. The second arsenic dataset
• Product of Water-Resources Investigation Report 99-4162 (Ayotte et al., 1999) by the

USGS National Water-Quality Assessment (NAWQA).
200
• Dataset generated for the purpose of monitoring compliance with the Federal Safe
Drinking Water Act.
• Constructed by assembling arsenic measurements from the states of Maine (ME),

New Hampshire (NH), Massachusetts (MA), and Rhode Island (RI) of New England.
• Each state includes different types of detection limit (i.e. 1 µg/L for ME, and 5 µg/L
for NH, MA, and RI), so a conservative detection limit of 5 µg/L is used for the
whole dataset.
• Some data above detect limit was lost due to the increased reporting level over the
entire New England.
• No information available concerning analytical techniques used based on the report

by Ayotte et al. (1999).
• Each state maintains its own safe drinking-water program in a good agreement with
Federal standards.
• Sampling procedure and analytical methods were set at the State level.
201
References
Abernathy, C. O., Y.-P. Liu, D. Longfellow, H. V. Aposhian, B. Beck, B. Fowler, R. Goyer,

R. Menzer, T. Rossman, C. Thompson, and M. Waalkes, 1999. Arsenic: Health Effects,
Mechanisms of Actions, and Research Issues, Environmental Health Perspective, Vol.
107, No. 7, pp. 593-597.
Armstrong M. (1998) Basic Linear Geostatistics, Springer, Berlin, 153 p.
Ayotte, J.D., M.G. Nielsen, G.R. Robinson, Jr., and R.B. Moore, 1999, Relation of Arsenic,
Iron, and Manganese in Ground Water to Aquifer Type, Bedrock Lithogeochemistry, and
Land Use in the New England Coastal Basins, Water-Resources Investigations Report
99-4162.
Bates, M. N., A. H. Smith, and K. P. Cantor, 1995. Case-Control Study of Bladder Cancer
and Arsenic in Drinking Water, American Journal of Epidemiology, Vol. 141, No. 6, pp.
523-529.
Beaty, Richard D. and Jack D. Kerber, 1993. Concepts, Instrumentation and Techniques in
Atomic Absorption Spectrophotometry, Second edition, The Perkon-Elmer Corporation,
Norwalk, CT.
Bhattacharya, P., A. H. Welch, K. M. Ahmed, G. Jacks, and R. Naidu, 2004. Applied in

Groundwater of Sedimentary Aquifers, Applied Geochemistry, 19, pp. 163-167.
Braman, R.S., and Foreback, C.C., 1973. Methylated Forms of Arsenic in the Environment,
Science, Vol. 182, pp. 1247-1249.
Buescher, P., and K. Jones-Vessey, 1999. Childhood Asthma in North Carolina, A Special
Report Series by the State Center for Health Statistics, No. 113.
Choi, K.-M., M. L. Serre, and G. Christakos, 2003. Efficient Mapping of California Mortality
Fields at Different Spatial Scales, Journal of Exposure Analysis and Environmental
Epidemology, 13, pp. 120-133.
202
Christakos, G., 1990. A Bayesian/Maximum-Entropy View to the Spatial Estimation
Problem". Mathematical Geology, vol. 22, No. 7, pp. 763-777.
Christakos, G., 1992. Random Field Models in Earth Sciences, Dover Publications, INC.,
Mineola, NY, 474 p.
Christakos, G., and M. L. Serre, 2000a. BME Analysis of Spatiotemporal Particulate Matter
Distribution in North Carolina, Atmospheric Environment, 34, pp. 3393-3406.
Christakos, G., 2000b. Modern Spatiotemporal Geostatistics, Oxford University Press, 288 p.
Christakos, G., M. L. Serre, and J. L. Kovitz , 2001. BME Representation of Particulate

Matter Distributions in the State of California on the Basis of Uncertain Measurements, J.
of Geological Research, Vol. 106, No. D9, pp. 9717-9731.
Christakos G., P. Bogaert and M. L. Serre, 2002. Advanced functions of temporal GIS,
Springer-Verlag, New York, N.Y., 264 p.
Clark, N. M., R. W. Brown, E. Parker, T. G. Robins, D. G. Remick Jr, M. A. Philber, G. J.

Keeler, and B. A. Israel, 1999. Childhood Asthma, Environmental Health Perspective,
Vol. 107, S3, pp 421-429.
Colt, J.S., D. Baris, S.F. Clark, J.D. Ayotte, M. Ward, J.R. Nuckols, K.P. Cantor, D.T.
Silverman, and M. Karagas, 2002. Sampling Private Wells at Past Home to Estimate
Arsenic Exposure: A Methodologic Study in New England, Journal of Exposure Analysis
and Environmental Epidemiology, 12, pp. 329-334.
Environmental Protection Agency (U.S. EPA) report, 1981. Investigation of Arsenic Sources
in Groundwater, Environmental Protection Agency; U.S. GOP: Washington, DC.
Environmental Protection Agency (U.S. EPA), 2000. Arsenic Occurrence in Public Drinking
Water Supplies, EPA-815-R-00-023, December.
Focazio, M. J., A. H. Welch, S. A. Watkins, D. R. Helsel, and M. A. Horn, 2000. A

Retrospective Analysis on the Occurrence of Arsenic in Ground-Water Resources of the
203
United States and Limitations in Drinking-Water-Supply Characterizations, Water-
Resources Investigations Report 99-4279, United Geological Survey.
Freeman, N. C.G., D. Schneider, and P. Mcgarvey, 2003. Household Exposure Factors,

Asthma, and School Absenteeism in a Predominantly Hispanic Community, Journal of
Exposure Analysis and Environmental Epidemiology, Vol. 13, pp 169-176.
Geological Survey (USGS), 2001. Available from

http://water.usgs.gov/nawqa/trace/data/arsenic_nov2001.txt.
Gergen, P. J., D. I. Mullally, and R. Evans III, 1988. National Survey of Prevalence of
Asthma Among Children in the United States, 1976 to 1980, Pediatrics, Vol. 81, No. 1,
pp 1-7.
Goovaerts, P., 1997. Geostatistics for Natural Resources Evaluation, Oxford University Press,
New York, 483 p.
Greschonig, H. and K.J. Irgolic, 1997. The Mercuric-Bromide-Stain and the Natelson
Method for the Determination of Arsenic: Implications for Assessment from Exposure to
Arsenic in Taiwan. pp. 17-31 in "Arsenic: Exposure and Health Effects." Edited by C.O
Abernathy, R.L Calderon and W.R Chappel, Chapman & Hall, London.
Guo, H.R., C.J. Chen and H.L. Greene, 1994. Arsenic in Drinking Water and Cancers: a
Brief Descriptive Review of Taiwan Studies, in Arsenic Exposure and Health (eds W.R.
Chappell, C.O. Abernathy, and C.R. Cothern), Sciences and Technology Letters,
Northwood, pp. 129-138.
Hernandez, A., J. Von Behren, R. Kreutzer, and B. McLaughlin, 2000. California County
Asthma Hospitalization Chart Book, California Department of Health Services,
Environmental Health Investigations Branch.
Hinkle, S. R., and D. J. Polette, 1999. Arsenic in Ground Water of the Willamette Basin,
Water-Resources Investigation Report 98-4205, Unites States Geological Survey.
Hopenhayn-Rich, C., M. L. Biggs, and A. H. Smith, 1998. Lung and Kidney Cancer
Mortality Associated with Arsenic in Drinking Water in Cordoba, Argentina,
International Journal of Epidemiology, Vol. 27, pp. 561-569.
204
Isaaks, E. H. and R.M. Srivastave, 1989. Applied geostatistics, Oxford Press, New York, 561
p.
Journel, A., and C. J. Huijbregts, 1978. Mining Geostatistics, Academic Press, London, U.K.,
600 p.
Karagas, M. R., T.D. Tosteson, J. Blum, J. Steven Morris, J.A. Baron, and B. Klaue, 1998.
Design of an Epidemiologic Study of Drinking Water Arsenic Exposure and Skin and
Bladder Cancer Risk in a U.S. Population, Environmental health perspective, 106, pp.
1047-1050.
Karagas, M. R., T. D. Tosteson, J. S. Morris, E. Demidenko, L. A. Mott, J. Heaney, and A.

Schned, 2004. Incidence of Transitional Cell Carcinoma of the Bladder and Arsenic
Exposure in New England, Cancer Causes and Control, Vol. 15, pp. 465-472.
Keller J.M., J. M. Giaquinto and A. M. Meeks, 1996. Characterization of the MVST

Waste Tanks Located at ORNL, ORNL/TM-13357, Chemical and Analytical
Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN.
Kinniburgh, D.G., and W. Kosmus, 2002. Arsenic Contamination in Groundwater: Some

Analytical Considerations, Talanta, 58, pp. 165-180.
Klaue, B and J.D. Blum, 1999. Trace Analyses of Arsenic in Drinking Water by Inductively
Coupled Plasma Mass Spectrometry: High Resolution Versus Hydride Generation,
Analytical chemistry, vol. 71, No. 7, pp. 1408-1414.
Krivoruchko, K., and C.A. Gotway, 2004. Creating Exposure Maps Using Kriging, Public
Health GIS News and Information, Vol. 56, pp 11-16.
Lai, D., 2004. Geostatistical Analysis of Chinese Cancer Mortality: Variogram, Kriging and
Beyond, Journal of Data Science, Vol. 2, No. 2, pp 177-193.
Lane, W. G., M. C. Edwards, 2003. Asthma in Maryland 2003, Maryland Asthma Control
Program, Family Health Administration, 410-767-6713.
205
Lewis, T. C., T. G. Robins, J. T. Dvonch, G. J. Keeler, F. Y. Yip, G. B. Mentz, X. Lin, E. A.
Parker, B. A. Israel, L. Gonzalez, and Y. Hill, 2005. Air Pollution-Associated Changes in
Lung Function among Asthmatic Children in Detroit, Environmental Health Perspectives,
Vol. 113, No. 8, pp 1068-1075.
Manninen, P., Presentation of the Implementation of Use of Reference Materials in an

Application-Arsenic by ICP-MS, Consulting Engineers Paavo Ristola Ltd.
(www.vtt.fi/pro/eurachsf/manninen.pdf).
McConnell, R., K. Berhane, F. Gilliland, S. J. London, T. Islam, W. J. Gauderman, E. Avol,

H. G. Margolis, and J. M. Peters, 2002. Asthma in Exercising Children Exposed to Ozone:
A Cohort Study, The Lancet, Vol. 359, pp. 386-391.
Melamed, D., 2004. Monitoring Arsenic in the Environment: A Review of Science and
Technologies for Field Measurements and Sensors, EPA 542/R-04/002, U.S. EPA,
Washington, DC.
National Research Council (NRC), 1999. Arsenic in Drinking Water, National Academy
Press, Washington, DC.
National Research Council (NRC), 2001. Arsenic in Drinking water, National Academy
Press, Washington, DC.
Olea, R, 1999. Geostatistics for Engineer and Earth Scientists, Kluwer Academic Publisher,
Boston, 303 p.
Oyana, T. J., J. S. Lwebuga-Mukasa, 2004. Spatial Relationships Among Asthma Prevalence,

Health Care Utilization, and Pollution Sources in Neighborhoods of Buffalo, New York,
Journal of Environmental Health, Vol. 66, No. 8, pp. 25-37.
Peters, S.C., J.D. Blum, B. Klaue, and M.R. Karagas, 1999. Arsenic Occurrence in New
Hampshire Drinking Water, Environmental science and technology, Vol. 33, No.9, pp.
1328-1333.
Sanchez, F., A.C. Garrabrants, C. Vandecasteele, P. Moszkowicz, and D.S. Kosson, 2003.
Environmental Assessment of Waste Matrices Contaminated with Arsenic, Journal of
Hazardous Materials, B96, pp 229-257.
206
Schnoor, J. L., 1996. Environmental Modeling: Fate and Transport of Pollutants in Water,
Air, and Soil, John Wiley & Sons, INC.
Serre, M. L., P. Bogaert and G. Christakos, 1998. Latest Computational Results in

Spatiotemporal Prediction Using the Bayesian Maximum Entropy Method, in A.
Buccianti, G. Nardi and R. Potenza, editors, Proceedings of IAMG '99 - Fifth Annual
Conference of the International Association for Mathematical Geology, 1, 117-122, De
Frede Editore, Napoli.
Serre, M. L., and G. Christakos, 1999a. Modern Geostatistics: Computational BME in the
Light of Uncertain Physical Knowledge--The Equus Beds Study, Stochastic
Environmental Research and Risk Assessment, Vol. 13, No. 1, pp 1-26.
Serre, M. L., 1999b. Environmental Spatiotemporal Mapping and Groundwater Flow

Modeling using the BME and ST methods, Ph.D. Dissertation, Depart. of Environmental
Sciences & Engineering, University of North Carolina at Chapel Hill, NC, USA, 236 p.
Serre, M.L., A. Kolovos, G. Christakos, and K. Modis, 2003. An Application of the

Holistochastic Human Exposure Methodology to Naturally Occurring Arsenic in
Bangladesh Drinking Water, Risk Analysis, Vol. 23, No. 3, pp. 515-528.
Stein, M.L., 1999. Interpolation of Spatial Data: Some Theory for Kriging, Springer-Verlag,
New York, 264 p.
Sturm, J. J, K Yeatts, and D Loomis, 2004. Effects of Tobacco Smoke Exposure on Asthma
Prevalence and Medical Care Use in North Carolina Middle School Children, American
Journal of Public Health, Vol. 94, No.2, pp 308-313.
Thomas, Robert , 2003. Practical Guide to ICP-MS, Marcel Dekker, 336 p.
Wackernagel, H., 1995. Multivariate Geostatistics: An Introduction with Applications,

Springer-Verlag, Berlin, 256 p.
207
Warner, K.L., Angel Martin Jr., and Terri L. Arnold, 2003. Arsenic in Illinois Ground Water-
Community and Private Supplies, United States Geological Survey Water-Resources
Investigation Report 03-4103. http://il.water.usgs.gov/pubs/wrir03_4103.pdf.
Weiss, K.B., S. D. Sullivan, C. S. Lytle, 2000. Trends in the Cost for Asthma in the United
States, 1985-1999, Journal of Allergy and Clinical Immunology, Vol. 106, pp. 493-499.
Welch A. H., D. B. Westjohn, D. R. Helsel, and R. B., 2000. Wanty, Arsenic in Ground
Water of the United States: Occurrence and Geochemistry, Ground Water, Vol. 38, No. 4,
pp. 589-604.
Welhan, J., and M. Merrick, 2003. Statewide Network Data Analysis and Kriging Project-
Final Report, Idaho Geological Survey.
http://www.idwr.state.id.us/hydrologic/info/statewide/IGS_Kriging_Project-
Final_Report.pdf.
Yeatts, K.B., M.L. Serre, S.-J. Lee, 2004. Spatial Distribution of Wheezing Prevalence and
Air Pollution across North Carolina, Sixteenth Conference of the International Society for
Environmental Epidemiology, New York City, NY, USA, August 1-4.
Yu, Winston H., C. M. Harvey, and C. F. Harvey, 2003. Arsenic in Groundwater in

Bangladesh: A Geostatistical and Epidemiological Framework for Evaluating Health
Effects and Potential Remdies, Water Resources Research, Vol. 39, No. 6, 1146,
doi:10.1029/2002WR001327.
Zmirou, D., S. Gauvin, I. Pin, I. Momas, F. Sahraoui, J Just, Y Le Moullec, F. Bremont, S.

Cassadou, P. Reungoat, M. Albertini, N. Lauvergne, M. Chiron, A. Labbe, Vesta
investigators, 2004. Traffic Related Air Polution and Incidence of Childhood Athma :
Results of the Vesta Case-Control Study, Journal of Epidemiol Community Health, 58,
pp 19-23.
208

FinalDefense Vfinal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FinalDefense Vfinal

Uploaded by

Copyright:

Available Formats

MODELS OF SOFT DATA IN GEOSTATISTICS AND THEIR APPLICATION IN

ENVIRONMENTAL AND HEALTH MAPPING

Spatiotemporal Geostatistics provides an efficient mapping estimation method to interpolate

a variable of interest at unsampled spatiotemporal locations based on sparse measured values.

provided a rigorous mathematical framework that overcomes these limitations, and in

particular provides an efficient framework to assimilate data with uncertainty expressed in

and public health scientists.

guided me throughout the entire period of my Ph.D. work. He convincingly introduced me a

willing to provide high-quality advices.

reluctance and generously guided me with their valuable expertise.

Department of Environmental Sciences and Engineering, UNC-CH, for which I am thankful.

Young for their endless support and attention whenever needed.

LIST OF TABLES ……………………………………………………………………...xiii

LIST OF FIGURES ……………………………………………………………………..xiv

II. A measurement error model for mapping groundwater arsenic: Case

2.1. Background …………………………………………………………………….. 8

2.2. The uncertainty associated with arsenic data …………………………………. 11

2.2.1. Sources of measurement errors ……………………….......................... 11

2.2.2. Arsenic measurement techniques and their associated

2.3. Theory ………………………………………………………………………… 17

2.3.1. The knowledge bases characterizing a contaminant

2.3.2. Proposed model for arsenic measurement error ……………………….19

2.3.3. Modeling the covariance function ……………………………………. 23

2.3.4. The BME method for spatial estimation ……………………………… 26

2.3.5. Step by step summary of the approach ……………………………….. 28

2.3.6. Cross validation procedure ………………………………………….... 29

2.4.1. The arsenic datasets …………………………………………………... 31

2.4.2. Mean trend ……………………………………………………………. 35

2.4.3. Covariance analysis and verification of the measurement

2.4.4. The BME mapping results …………………………………………..... 40

2.4.5. Cross validation results ……………………………………………….. 42

2.5. Conclusions …………………………………………………………………… 46

3.1. Background …………………………………………………………………… 49

3.2. Method description ………………………………………………………….... 51

3.2.1. Spatial Random Field (SRF) representation and physical

3.2.2. Empirical law and cross-correlation of related spatial fields ……......... 54

3.2.3. Deriving the conditional PDF fS(χ|ψ) that describes the

3.2.3.1. Non parametric approach ………………………………….... 57

3.2.3.2. Parametric approach …………………………………………58

3.2.3.2.1. Parametric polynomial of order 1 ………………. 58

3.2.3.2.2. Parametric polynomial of order 2 ………………. 60

3.2.4. BME processing of hard and soft data ………………………………... 62

3.2.5. Generating related synthetic fields with stochastic empirical

3.2.6. Step by step description of the simple kriging, co-kriging,

3.3. Results ………………………………………………………………………… 71

3.3.1. Synthetic case study …………………………………………………... 71

3.3.1.1. Realization of related spatial fields …………………………. 72

3.3.1.2. Covariance and cross-covariance between fields ……………74

3.3.1.3. Conditional PDF fS(χ|ψ) describing the empirical

3.3.1.4. Assessment of mapping accuracy …………………………... 77

3.3.1.5. Cross validation results as a function of the

3.3.1.6. Cross validation results as a function of the

3.3.2. Application to the real case study: Mapping arsenic in

3.3.2.1. New England datasets for arsenic and pH …………………. 84

3.3.2.2. logAs-pH empirical law …………………………………….. 85

3.3.2.3. Mean trend and spatial variability of groundwater

3.3.2.4. BME estimation of groundwater arsenic across

3.3.2.5. Non-attainment areas ……………………………………….. 90

3.3.2.6. Cross validation results between simple kriging,

3.4. Conclusions …………………………………………………………………… 93

IV. A geostatistical mapping framework integrating data obtained at

4.2. Space/time observation scale: A general conceptual framework …………… 100

4.2.1. A review of BME mapping method …………………………………. 100