You are on page 1of 21

Risk Analysis DOI: 10.1111/j.1539-6924.2012.01790.

A Bayesian Method to Mine Spatial Data Sets to Evaluate


the Vulnerability of Human Beings to Catastrophic Risk

Lianfa Li,1,2,∗ Jinfeng Wang,1 Hareton Leung,2 and Sisi Zhao1

Vulnerability of human beings exposed to a catastrophic disaster is affected by multiple fac-


tors that include hazard intensity, environment, and individual characteristics. The traditional
approach to vulnerability assessment, based on the aggregate-area method and unsupervised
learning, cannot incorporate spatial information; thus, vulnerability can be only roughly as-
sessed. In this article, we propose Bayesian network (BN) and spatial analysis techniques to
mine spatial data sets to evaluate the vulnerability of human beings. In our approach, spatial
analysis is leveraged to preprocess the data; for example, kernel density analysis (KDA) and
accumulative road cost surface modeling (ARCSM) are employed to quantify the influence
of geofeatures on vulnerability and relate such influence to spatial distance. The knowledge-
and data-based BN provides a consistent platform to integrate a variety of factors, including
those extracted by KDA and ARCSM to model vulnerability uncertainty. We also consider
the model’s uncertainty and use the Bayesian model average and Occam’s Window to av-
erage the multiple models obtained by our approach to robust prediction of the risk and
vulnerability. We compare our approach with other probabilistic models in the case study of
seismic risk and conclude that our approach is a good means to mining spatial data sets for
evaluating vulnerability.

KEY WORDS: Bayesian network; data mining; spatial analysis; vulnerability

1. INTRODUCTION that resulted in casualties of about 300,000 people


(more than 90,000 were dead or missing) and prop-
Vulnerability of human beings to catastrophic
erty loss of U.S. $20 billion.(1,2) Another catastrophic
risk denotes the degree of an individual subject to
event was the tsunami occurring on December 26,
the damage arising from a catastrophic disaster. A
2004, beneath the Indian Ocean west of Sumatra,
great degree of vulnerability will result in consid-
Indonesia, with the loss of more than 300,000 lives
erable damage to the individuals exposed to catas-
and the displacement of over 1,000,000 people.(3)
trophic hazards. In China, the last record event was
These two catastrophic events, which caused a huge
the Wenchuan earthquake of 8 Mw on May 12, 2008,
loss of lives and properties, resulted from the emer-
gency nature of such catastrophic hazards; vulnera-
bility from unpredictable natural disasters is the pri-
1 LREIS, Institute of Geographical Sciences and Natural Re- mary reason for such huge casualties and loss. It is
sources Research, Chinese Academy of Sciences, Beijing, China. undeniable that it is difficult to monitor and pre-
2 Department of Computing, The Hong Kong Polytechnic Univer-

sity, Hong Kong.


dict natural hazards precisely because of our lim-
∗ Address correspondence to Lianfa Li, LREIS, Institute of Ge- ited knowledge of their mechanisms and underlying
ographical Sciences and Natural Resources Research, Chinese causes. However, given historical patterns of natu-
Academy of Sciences, Beijing, China; lspatial@gmail.com. ral hazards, we can obtain useful information about

1 0272-4332/12/0100-0001$22.00/1 
C 2012 Society for Risk Analysis
2 Li et al.

vulnerability to catastrophic disasters using adaptive attributes.(13,14) Techniques of spatial analy-


evaluation methods and increasing spatiotemporal sis such as kernel density analysis (KDA)
data sets. But in the case of the two disasters de- and accumulative road cost surface modeling
tailed earlier, vulnerability was so poorly evaluated (ARCSM) can quantify such spatial patterns
that people had no time to prepare for the sud- that are used with other quantitative factors in
den occurrence of these catastrophic disasters, which vulnerability assessment. This extends the set
in turn caused huge casualties. If we had a precise of predictive factors that can improve vulnera-
evaluation of vulnerability to catastrophic risk, even bility prediction.
given the occurrence of such disaster, more loss could
In this article, we propose a spatial data mining
have been avoided and more lives could have been
approach to modeling the vulnerability of human be-
saved.
ings exposed to catastrophic hazards. Data mining
Conversely, vulnerability is affected by multiple
is the process of extracting patterns from data. It is
uncertain factors that include hazard intensity, envi-
an important tool for transforming the data into use-
ronment, and individual characteristics;(4) thus, eval-
ful information. In terms of vulnerability assessment,
uation of vulnerability presents ongoing challenges.
spatial data mining processes geospatial data to ob-
Many existing approaches to vulnerability assess-
tain the prediction models (structure and parame-
ment are based on the aggregate-area method and
ters) that relate predictive factors to risk and vul-
unsupervised learning, that is, observations with as-
nerability. Technically, we use a Bayesian network
sumed latent variables,(5−9) making these approaches
(BN) to make the uncertainty analysis and compu-
unable to fully use spatial information. Vulnerability,
tation of vulnerability. Our approach is based on
therefore, is evaluated only roughly, for the following
a grid data set supported by a geographical infor-
reasons:
mation system (GIS). Using the GIS-supported grid
data set, spatial analysis such as KDA and ARCSM
• First, in the aggregate-area-based method,
can be applied to derive new predictive factors from
related factors are aggregated into census-
the related geofeatures, obtain a table of all the cells
designated statistical areas. The size of the ag-
in the grid data set (with each as a sample for un-
gregated area has a significant influence on the
certainty inference), amass knowledge from the BN
result of statistical analysis (different conclu-
to predict new cases, and, as well, visualize the re-
sions might be obtained where there are ag-
sults in GIS. In this article, we make the following
gregated areas of different sizes).(10) Conse-
contributions:
quently, approaches based on the aggregate
area cannot satisfy the requirement of specific • Use spatial analysis techniques such as KDA
location of vulnerability, such as rescue of hu- and ARCSM to derive new predictive fac-
man beings, accurate assessment, and planning tors from related geofeatures that are used
for the disaster. with other factors to predict vulnerability.
• Second, some existing approaches to vulnera- The aggregate-area method does not con-
bility assessment, such as fuzzy comprehensive sider the differences among individuals. Our
evaluation,(11,12) are based primarily on do- method can preserve the original number of
main knowledge and anecdotal evidence gath- observations without the need to either ag-
ered by local experts in which the empirical gregate or average them. Thus, it is less bi-
data used in the model are difficult to validate. ased than the aggregate-area-based methods.
The result obtained thereby indicates simply Given derivation of additional predictive fac-
where the vulnerability is high or low at the tors, the set of predictors can be strength-
census-based statistical area but does not show ened and the model’s performance can be
the vulnerability of the landscape at a fine improved.
resolution. • Propose a data mining approach to vulnera-
• In these approaches, such techniques of spatial bility assessment. A BN is the major frame-
analysis have seldom been used, and this has work for integrating spatiotemporal data and
made it difficult to incorporate spatial infor- domain knowledge from different sources. We
mation. Spatial analysis can detect patterns of generalize the generic framework of the BN
geofeatures or events occurring in geographic and spatial modeling techniques for vulner-
space according to their spatial or nonspatial ability assessment. Our approach may target
Mining Spatial Data Sets for Evaluating Vulnerability 3

a more specific location of vulnerability since or system that mitigate against the damage of
it is based on a fine scale. Also, our ap- a natural disaster. For example, a house with
proach is able to distinguish among the differ- a lightweight steel structure may better with-
ent environmental and individual characteris- stand the destructive power of an earthquake
tics that contribute to the vulnerability of the than one constructed of wood or brick; people
exposed objects. In this term, our approach living closer to a shelter may reduce their vul-
should perform better than many existing nerability to damage from a disaster; educa-
approaches. tion about the causes and nature of tsunamis
• Consider the uncertainties of predictive mod- can increase citizens’ vigilance against this
els in vulnerability assessment. We employ type of disaster.
heuristic-learning algorithms to study local
topologies based on the BN framework and In our previous article’s(15) table 1, we reported
then use a Bayesian model average (BMA) to on exposure-related and vulnerability-related factors
get a robust prediction. We use several typical and the empirical modeling methods for five natural
predictive models to assess vulnerability and disasters: earthquake, flood, typhoon, mudslide, and
examine their applicability. avalanche.
The following outlines the structure of this
article. Section 2 briefly describes the Bayesian 2.2. BN of Vulnerability Assessment
framework for vulnerability assessment. Section 3
presents our modeling methods for vulnerability as- A BN is a kind of directed acyclic graph with con-
sessment. Section 4 gives an introduction to our study ditional probabilistic dependence:
case and then presents the results. Section 5 discusses
the implications of our approach. Section 6 presents BS = G(V, E), (1)
our conclusions.
where BS is network structure; V is the set of random
variables (r.v.); E ∈ V × V is the set of directed edges
2. BAYESIAN FRAMEWORK FOR that indicate the probabilistically conditional depen-
VULNERABILITY ASSESSMENT dency relationships between r.v. nodes and satisfy the
This section mainly describes the factors in- Markov property;(16) and
volved (Section 2.1) and our Bayesian framework for
vulnerability assessment (Section 2.2). BP = {γu : u × πu → [0 . . . 1]|u ∈ V} (2)

is a set of assessment functions, where the state space,


2.1. Factors Involved u , is the finite set of values of u; πu is the set of par-
ent nodes of u; if X is a set of variables, X is the
There are two sets of factors involved in vulner-
Cartesian product of the state spaces of all the vari-
ability analysis:
ables in X; and γu uniquely defines the joint prob-
1. Factors related to exposure are directly re- ability distribution P(u|πu ) of u conditional on its
lated to occurrence of natural disasters. For parent set, πu .
instance, movements of the earth’s crusts BN is based on the Bayesian inference princi-
may cause earthquakes; extremely intense ple of the a posteriori probability (that is, belief )
wind may cause typhoons; heavy rainfalls may of a hypothesis from the evidence. For vulnera-
cause floods or/and landslides. bility assessment, evidence derives from exposure-
2. Factors related to vulnerability refer to envi- or vulnerability-related factors, and the hypothesis
ronmental and system-resistance factors. En- refers to the state of risk, expressed as damage
vironmental factors characterize the environ- states from low to high levels. Let t be such a
ment that breeds the disasters. They are able hypothesis variable of damage risk and its state
to mitigate or amplify the destructive power space r would have, say, seven states, r = {none,
of a hazard. For instance, a good water-soil slight, light, moderate, heavy, major, destroyed}. In
conservation capability can mitigate the de- a specified BN, if some factors are known, a pos-
structive effect of mudslide. System-resistance teriori probability or belief of the target variable
factors are the characteristics of an individual t being in a certain state can be estimated by
4 Li et al.

MP: modeling
R: release scenario
parameters of exposure

E: exposure intensity and derived hazards


Exposure

System resistence
HC: human characteristics such
as sex, income, age, residential ARCS: accumulated road
environment and health state etc. cost surface

Vulnerability
HD: human damage

V: vulnerability

States of the target variable, HD: none, slight, light, moderate, major, destroyed

set of random nodes utility node link of a single variable link of (possible) multiple variables

Fig. 1. Generic Bayesian modeling framework for vulnerability assessment.

calculating the marginal probability: various states. In this framework, we use capitalized
 italicized letters to represent a univariate and capital-
Belief (x, t) = p(u1 , u2 , . . . , r, . . . , un ), (3) ized roman-type letters to represent the set of multi-
ui ∈V,ui =r
variates. So in this framework, R, ARCS, and HD de-
where x is an entity, namely, a cell of the grid surface note a univariate but MP, E, and HC denote a set of
that represents the persons exposed to the disaster; t multiple variables. Also in this figure, the double line
the value of a certain state of the damage; and t∈r , with an arrow represents the link from multiple ran-

p(u1 , . . . , un ) = ui ∈V p(ui |πui ) is the joint probabil- dom nodes to the arrowed node or the set of nodes,
ity over V. whereas the single line represents the link from a sin-
To construct a BN, we suggest the following four gle random node to the arrowed node or the set of
steps: (1) identification of the factors to quantify the nodes.
problem framework; (2) establishment of the inter- We construct the framework according to do-
dependencies between the r.v. nodes; (3) assignment main knowledge,(5,9,11,17,18) that is, an extension of
of the states or quantities to nodes; and (4) assess- our previous framework in risk assessment for build-
ment of the conditional probabilities. ings.(15) In this framework, it is assumed that release
Fig. 1 describes the BN framework for risk scenarios (R) and related parameters (MP) of ex-
assessment of human beings. In this framework, posure have deterministic influences on the expo-
R denotes release scenario, MP the set of exposure- sure intensity or derived hazards (E); human dam-
related factors, E the exposure intensity and its age (HD) is assumed to be determined by exposure
derived hazards (random nodes), HC human char- (E), accumulated road cost surface (ARCS), and hu-
acteristics related to the vulnerability of human be- man characteristics (HC). For the set of multivari-
ings (random nodes), ARCS the accumulative road ates (MP, E, HC), there may be interdependencies
cost surface (a random node), HD the casualty dam- between these variables in addition to their depen-
age caused to human beings (a random node), and dencies described in Fig. 1. Similarly, we can con-
V the vulnerability index (a utility node) that will be struct their interdependencies according to domain
defined in Equation (13) of Section 3.6. HD is the knowledge or learning from the training data. Once
target variable; we aim to estimate its probabilities in an initial BN framework is constructed, we can
Mining Spatial Data Sets for Evaluating Vulnerability 5

Fig. 2. Data mining procedure for vulnerability assessment.

refine the interdependencies between local r.v.s in lect a variety of data from different sources and to
MP, E, and HC and estimate the conditional assess- integrate it within a data mining system using geospa-
ment parameters. tial techniques such as rasterization, resampling, re-
projection, KDA, and ARCSM. Rasterization, re-
3. MINING TECHNIQUES FOR sampling, and reprojection are traditional techniques
VULNERABILITY ASSESSMENT and relatively mature, but KDA and ARCSM have
only recently been used in modeling as new quantifi-
3.1. Data Set Format and Mining Procedure
cation techniques of spatial analysis. Thus, our de-
Fig. 2 shows the procedure for vulnerability as- scription of methodology is focused on KDA and
sessment. Our techniques are based on the grid data ARCSM techniques of spatial analysis and Bayesian
set. To obtain the grid data set from multiple hetero- modeling.
geneous sources, we apply preprocessing steps, e.g., In the spatial data mining environment, we use
converting the vector data and resampling the grid not only GIS to manage and visualize spatial data
into the target grid at the standardized resolution and but also employ spatial analysis techniques that may
projection. We perform these steps in a GIS environ- be loaded on GIS, such as KDA and ARCSM, to
ment (Fig. 2a), such as ARCGIS or SuperMap. The extract predictive factors from relevant geofeatures.
following sections describe major techniques of the Although KDA and ARCSM may be loaded on GIS,
procedure. they are distinct from GIS basic functionality. As the
In our approach, spatial analysis and BNs are first law of geography states: “Everything is related to
used to make the vulnerability assessment. In the grid everything else, but near things are more related than
data set (Fig. 2b), each cell corresponds to an inde- distant things.”(19) Natural disasters, as geospatial
pendent sample for training or a new unit that has events, occur in geographical areas, and they should
predictive factors and the target variable. We can ex- also have such spatial characteristics; thus, Tobler’s
tract the table of multiple attributes (Fig. 2c) from first law of geography is applicable for them. We
the data set. The grid-based format enables us to col- use KDA and ARCSM in vulnerability assessment
6 Li et al.

that considers influence of points such as pollution from the geofeature. The z’s density estimate using
sources or polylines, such as an earthquake’s active KD function is:
faults on the surroundings; the influence gradually
1
n
decreases with the increase of the distance from the
Density(z) = Zi · Kλ (z, Zi ), (4)
sources. KDA models such influences and derives in- n
i=1
fluence factors from the geofeatures.(20,21) ARCSM
calculates the accumulative cost for each cell to its
where n is number of sample units, z is any unit in the
neighbor shelters and the cost, determined by rele-
geographical area, Zi is the value of the sample unit,
vant geographical factors, has an important influence
and Kλ (z, Zi ) is the kernel density (KD) function. To
on the vulnerability of the individuals exposed to the
get the KD function, we can use the normal function
hazards.(22−24)
to simulate it.(15)
The bandwidth or search radius, λ, is affected
3.2. Spatial Analysis Techniques by empirical knowledge and the goal. A large λ is
more generalized over the entire study area whereas
Spatial analysis is used to deal with spatial or
a small λ means more localization over the area.
nonspatial attributes of geographic features to iden-
Since our goal is to reflect the influence of relevant
tify generic spatial patterns. These techniques in-
factors on damage caused to individuals exposed to
clude but are not limited to local/global spatial
a disaster, in practice, λ can be set according to the
autocorrelation, G-statistics, KDA, ARCSM, and
biggest influence range of the relevant factors.
the like. In vulnerability analysis, spatial analysis
As a smoothing technique, KDA can derive the
can be used to quantify the discrete or qualita-
continuous surface data from geofeatures. This facil-
tive factors for use in combination with other quan-
itates integration of quantitative factors in combina-
titative factors. For example, spatial autocorrela-
tion with the qualitative factors deriving from some
tion is employed to detect similarity or dissimilar-
geofeatures, such as points (e.g., pollution sources)
ity of the damage or loss from the disaster, and
and polylines (e.g., rivers or faults). Another advan-
G-statistics is used to detect hot spots of high risk. In
tage of KDA is that it avoids the drawback of that
our approach, KDA and ARCSM are especially ap-
“aggregate” approach within which the estimated av-
plicable for processing those qualitative spatial fea-
erage exposure in a particular region may serve as a
tures such as rivers or faults and quantifying them to
reasonable surrogate for the actual exposure of indi-
be used in combination with other analytic factors.
viduals. Individual exposure levels cannot accurately
Sections 3.2.1 and 3.2.2 mainly describe these two
be inferred from aggregated data; KDA helps to pre-
techniques. Spatial autocorrelation and G-statistics
serve the original number and intensity of observa-
may be useful, but the factors derived from them
tions without the need either to aggregate or average
cannot be used directly in vulnerability assessment.
them.(15,26)
Thus, our methodology primarily uses KDA and
Conversely, although in general KDA is appli-
ARCSM as quantification techniques.
cable for smoothing geofeatures, given certain geo-
graphical features such as gorges and chasms (which
3.2.1. Kernel Density Analysis might result in noncontiguous geofeatures), it is pru-
dent for us to use a continuity-based method, such
As a technique of spatial analysis, KDA
as KDA. This problem is common, especially in the
transforms a sample of observations recorded as
case of floods. Thus, if there are considerable unique
geographically referenced points or polylines into a
features such as gorges, we need to identify them on
continuous surface, quantifying the intensity of indi-
the grid and assign the corresponding cells values dif-
vidual observations over space.(10,25) Points or poly-
ferent from other continuous surfaces.
lines closer to the center of the target entity are
weighted more heavily than those away from the en-
tity, which embodies Tobler’s first law of geography.
3.2.2. Accumulative Road Cost Surface Modeling
The kernel weights vary within its “sphere of influ-
ence” according to their distance from the central ARCSM is used to estimate the cost, i.e., the
point or polyline as the intensity is estimated: the sur- “difficulty” to reach a shelter from a living place
face value is highest at the location of the central when a catastrophic event happens. The lower
target geofeature and diminishes over the distance the cost, the more likely individuals vulnerable to
Mining Spatial Data Sets for Evaluating Vulnerability 7

damage can escape and avoid it, thus reducing their variables for modeling and inference. We obtain the
vulnerability. ARCSM consists of two models: discrete intervals according to domain knowledge or
use a discretization algorithm to discretize continu-
• Road cost surface model. This model evalu-
ous factors such as elevation, slope, and KD of faults
ates the road cost of each cell of the grid based
or rivers. Discretization is significant since it enables
on given factors:
us to use quantitative factors along with qualitative
or discrete factors to strengthen the model’s pre-
c(x) = e(x) · we + d(x) · wd + a(x) · wa dictability of vulnerability in BN.
If the domain knowledge for discrete intervals

k
+ s(x) · ws + oi (x) · woi , is unclear, the discretization algorithm can be used
(5) to make an automatic division. The algorithm is de-
i=1
signed according to the “recursion” idea by Fulton
where x denotes the cell, e(·) denotes the ob- et al. (1995)(27) and the minimal description length
stacle factor directly related to the exposure, (MDL) stopping criteria in Fayyad and Irani’s algo-
d(·) the road surface density (e.g., the road rithm.(28) This discretization method was reported in
area/the region area), a(·) road accessibility our 2010 article;(15) readers can refer to the article for
determined by distance to the roads, s(·) the technical details.
slope, and oi other factors related to the road
cost. we , wd , wa , ws , and woi , respectively, de-
note the weights of relevant factors. If a factor 3.4. Probabilistic Mining Models
presents a more difficult obstacle for traffic, it
is weighted more heavily. This section briefly introduces the probabilistic
• Accumulated cost surface model. Given the models compared in this article.
cost surface from Equation (5) and the op-
tional serviceable shelters, the accumulated
cost surface model estimates the least accumu- 3.4.1. Bayesian Network
lated cost with which the neighborhood shel- BN involves different search algorithms for con-
ters of each cell could be reached: structing the network topology. Table I lists major
methods for construction, inference, and prediction

NS(x) of BN.
Ac (x) = min c( j), (6)
j=x

where x denotes the cell, Ac (·) the accumu- 3.4.1.1. Learning the network topology. Learn-
lated cost, NS(x) the neighborhood shelters ing the BN topology requires establishing a score to
around x, and c(j) the road cost of the cell, j. measure the network’s quality. There are three kinds
We can calculate Ac (·) by optimal operational of score measures that bear a close resemblance: the
algorithms. Bayesian approach, the information criterion, and
the minimum description length. In this study, we
Accumulated road cost surface modeling is espe- used the Bayesian approach, which uses the a pos-
cially useful for evaluating the damage that may re- teriori probability of the learned structure given the
sult to human lives: less accumulated road cost means training instances as a quality measure. The Bayesian
that vulnerable persons can more easily reach shel- approach can achieve a good result as it is unaffected
ters and thus avoid potential harm. The indicator by the specific structure, unlike other measures.(29)
extracted using ARCSM can also be used with the Once a Bayesian quality measure is selected, we
factors extracted by KDA and other quantitative fac- apply an algorithm to search the space of the net-
tors for assessing the vulnerability of human beings. work structures to find the network topology with
a high-quality score, Q(BS , D). We can apply dif-
ferent heuristic or general-purpose search strategies,
3.3. Discretization
as listed in Table I. The heuristic algorithms include
Discretization is employed to transfer the con- K2, hill climbing (HC), and TAN (tree augmented
tinuous variables to discrete variables to be used in naı̈ve (Bayes)); the general-purpose algorithms in-
BN since BN contains only qualitative or discrete clude Tabu, simulated annealing (SA), and genetic
8 Li et al.

Table I. Methods for Construction, Inference and Prediction of BN

Steps Type Methods

Structure Domain-knowledge-based Construct BN according to domain


or empirical knowledge
Dependency-analysis-based Conditional independence (CI)(16)
Search scoring based(29) Quality measure Bayesian approach, information
criterion approach, and
minimum description length
approach
Learning methods Heuristic search strategies: K2, hill
climbing (HC), and TAN, etc.
General-purpose search
strategies: Tabu, simulated
annealing (SA), and genetic
algorithm (GA), etc.
Parameter learning Domain-knowledge-based Reports, statistics, and
experienced models
Distribution-based Dirichlet-based parameter
estimator
With missing data(16) Expectation maximization, Gibbs
sampling
Inference Exact inference Joint probability, naı̈ve Bayesian,
graph reduction, and
polytree(54)
Approximate inference Forward simulation, random
simulation(54)
Prediction Three types of reasoning Causal, diagnostic, and
intercausal(16)

algorithm (GA).(29) Six search strategies used in our Markov blanket of the classifier node while the
study are briefly described below: optimal network is acquired.
• SA(29) randomly generates a candidate net-
work BS close to the current network BS . It
• K2(30) adds arcs with a fixed topological order- accepts the network if it is better than the cur-
ing of variables. In our implementation of K2, rent one. Otherwise, it accepts the candidate
the ordering in the data set is initially set as with the probability:
a naı̈ve Bayes (NB) network in which the tar-
get class variable (loss or damage risk) is made 
eti ·(Q(BS ,D)−Q(BS ,D)) , (7)
the first in the ordering(29) since we know little
about the relationship between local variables where ti is the temperature at iteration i. The
of the training data set. temperature starts at t0 and is slowly decreased
• HC(31) adds and deletes arcs with no fixed or- with each iteration.
dering of variables. This procedure is iterated • GA.(34) With D as the set of BN structures for
until the highest value of the local Bayes score a fixed domain with n variables, and the alpha-
is obtained. bet S being {0,1}, a BN structure can be repre-
• In TAN Bayes,(32) the tree is formed by calcu- sented by an n x n connectivity matrix C, where
lating the maximum-weight spanning tree us- its elements, cij
ing Chow and Liu’s algorithm.(33)
• Tabu(29) search is an optimal algorithm of 
1, if i is a parent j,
HC. This search algorithm applies a Markov ci j =
0, otherwise.
blanket correction to the network structure af-
ter a network structure is learned. This ensures In GA, we represent an individual
that all nodes in the network are part of the of the population by character string:
Mining Spatial Data Sets for Evaluating Vulnerability 9

{c11 c21 . . . cn1 c12 c22 . . . cn2 . . . c1n c2n . . . cnn } (also the target/dependent variable (Y = risk level):
known as chromosomes). GA will search the
structure space to find the individual with p(Y) 
P(Y|X1 , . . . , Xn ) = P(Xi |Y). (9)
the best “genetic material” by cross-over and P(X) i
mutation operators.
NB assumes that within each class, the numeric
predictors are normally distributed. One can repre-
3.4.1.2. Learning the parameters. Once the op- sent such a distribution in terms of its mean (μ)
timal structure has been found, the parameters of and standard deviation (σ ) and thus can estimate
the BN (conditional probability table (CPT) for each the probability of an observed value from such
node, BP ) can be determined according to domain estimates.
knowledge or can be learned from the database of
instances. We used the simple Bayesian estimator,
which assumes that the conditional probability of 3.4.2.3. Normalized Gaussian radial basis
each r.v. node corresponding to its parent instan- function (RBF) network. Normalized Gaussian RBF
tiation conforms to the Dirichlet distribution with network uses the k-means clustering algorithm to
local parameter independence: D(α1 , . . . , αi , . . . , ατ ) provide the basis functions and learns either an
with αi being the hyperparameter for state i. This LR (discrete class problems) or linear regression
Bayesian estimator directly produces estimates of (numeric class problems) on top of that. Symmetric
conditional probability from the data set, and it can multivariate Gaussians are fit to the data from each
be used directly in the data-driven learning. The cluster. If the class is nominal, it uses the given
Bayesian estimator rather than BMA is used in the number of clusters per class. It standardizes all
analysis because BMA is not yet part of the standard numeric attributes to zero mean and unit variance.
data analysis tool kit as its implementation presents
several difficulties.(35)
For the estimation method, the assumption of 3.4.2.4. Multiple perception (MPer). MPer is a
local parameter independence may not be realistic. computational model that tries to simulate the struc-
This can result in slow and biased parameter learn- ture and/or functional aspects of biological neural
ing. We use the classification tree to improve the networks. It consists of an interconnected group of
analysis while avoiding this assumption.(16) artificial neurons and processes information using
a connectionist approach to computation. In most
cases, an artificial neural network (ANN) is an adap-
3.4.2. Other Probabilistic Models tive system that changes its structure based on ex-
ternal or internal information that flows through the
3.4.2.1. Logistic regression (LR). LR is a tech-
network during the learning phase. In more practical
nique of probability estimation based on maximum
terms, neural networks are nonlinear statistical data-
likelihood estimation. In this model, let Y be the risk
modeling tools. They can be used to model complex
level as the target/dependent variable (e.g., Y = 1 in-
relationships between inputs and outputs or to find
dicating “high risk” and Y = 0 “low risk”), and Xi
patterns in data.
(i = 1, 2, . . ., n) be the predictive factors. LR assumes
that Y follows a Bernoulli distribution, and the link
function relating Xi and Y is the logit or log-odds: 3.5. Evaluation and Modeling Uncertainty
1 3.5.1. Evaluation Measures
p(Y = 1|X, β) = μ = , (8)
1 + e−Xβ
We use four scalar measures, i.e., pd, balance,
where X = (1, X 1 , X 2 , . . . , X n ), β = (β 0 , β 1 , β 2 , . . . , precision, and ROC area.
β n )T and β can be estimated by the maximum like-
lihood estimator that would make the observed data • Pd refers to the detection probability of high
most likely.(36) risk: it measures the proportion of correctly
predicted positive instances among the actually
positive ones. If a method achieves a higher pd,
3.4.2.2. Naı̈ve Bayes. NB assumes that the pre- it can detect more positive instances (more cell
dictive factors are conditionally independent given units of high risk detected).
10 Li et al.

• Balance between pd and pf (pf refers to the thus improving the computation efficiency. Occam’s
probability of false alarms; a good method Window(35) has two principles: (1) if a model re-
should have a low pf ): ceives much less support (e.g., the ratio of 20:1) than
 another model with maximum posterior probability,
balance = 1 − ( pf ∗ pf + (1 − pd)∗ (1 − pd))/2. then it should be dropped; (2) complex models that
(10) receive less support than their simpler counterparts
• Precision refers to the proportion of true pos- should be dropped.
itives among the instances predicted as posi- We use six search algorithms, shown in Table I
tive, but it cannot measure how the method (bold typeface), to get the local structures of qual-
detects the proportion of correctly predicted itative factors and use BMA and Occam’s Window
positive instances among the actually positive to average the qualified models, thus decreasing the
ones. Good precision does not always mean models’ bias and improving their robustness.
a good pd. A method with high precision but
a low pd is less useful since it cannot detect 3.6. Vulnerability Assessment
more significant positive instances (less units of
“high loss” risk detected). Vulnerability assessment estimates the suscep-
• ROC area is the area between the horizontal tibility of individuals exposed to natural hazards(9)
axis and the receiver operating characteristic and corresponds it with the potential degree of dam-
(ROC) curve; it gives a comprehensive scalar age to individuals exposed to natural hazards. The
value representing the model’s expected per- higher the vulnerability, the more damage the indi-
formance. The ROC area is between 0.5 and 1, vidual may experience.
where a value close to 0.5 is less precise, while There is no precise definition for vulnerability.
a value close to 1.0 is more precise. In this study, we regard vulnerability as the com-
prehensive damage index. In the BN framework
(Fig. 1), the target variable is the damage state of hu-
3.5.2. Uncertainty of Models man beings that are affected by multiple factors such
To mitigate the sampling bias and model uncer- as exposure intensity, physical or ecological environ-
tainties (also avoiding the overfitting problem), we ment, accumulative road cost, and individual char-
use BMA and Occam’s Window(35,37,38,39) to produce acteristics. Using BN, we can integrate a variety of
a robust prediction of the seismic risk. exposure- and vulnerability-related factors from dif-
Assume r to be the target variable of risk, D to ferent sources to estimate the probabilities that the
be the training data set, and Mi to be the ith model human being is in a certain damage state. Each state
of BN. We can then obtain the averaged value of the (none, slight, light, moderate, major or destroyed) of
probability of the target variable being in a certain the target variable, HD, has a corresponding dam-
state using BMA: age factor range, a central damage factor, and cost
coefficients (Table II). Damage factor is the fraction

K
or percentage of the damage caused to an individual
pr (r |D) = p(r |Mk, D) p(Mk|D), (11)
exposed to a natural disaster. The cost coefficients
k=1
are used for calculating the vulnerability index.
where K is the number of models selected and We assume that the cost coefficient is nonnegative,
p(D|Mk) p(Mk)
p(Mk|D) =  (12)
K Table II. Different States of the Target Variable, Damage
k=1 p(D|Ml ) p(Ml )
Factors, and Cost Coefficients
is the weight of Bayes factor that is ratios of marginal
likelihoods or of posterior odds to prior odds for dif- Damage Factor Central Damage
Damage State Range (%) Factor (%) Cost Efficient
ferent models. We use the BN’s inference algorithms
(Table I to obtain p(D|Mk) and assume that the prior None 0 0 0.1
probability of each model ( p(Mk)) is the same. Slight 0–1 0.5 0.2
While BMA can average the predictions of the Light 1–10 5 0.3
Moderate 10–30 20 0.5
learning algorithms (models) (Table I), we can also
Major 60–100 80 0.75
use Occam’s Window to select the qualified mod- Destroyed 100 90 >0.9
els and remove those poorly qualified models,
Mining Spatial Data Sets for Evaluating Vulnerability 11

ranging from 0 to 1; a condition with a larger dam- 4. THE STUDY CASE OF SEISMIC RISK
age factor has a larger cost coefficient. Table II gives
The 2008 Wenchuan earthquake is selected as
a general definition of the cost coefficient; users can
our study case. In Section 4.1, we introduce the study
adjust it according to the study goal and empirical
region and goal. Section 4.2 describes the simula-
knowledge.
tion of the peak ground acceleration (PGA) distribu-
Integrating the multiplication of the probability
tion under different scenarios. We then use different
of the vulnerable individual being in a certain dam-
probability models to compute the risk probability
age state, and the corresponding cost coefficients, we
and compare them in Section 4.3. Section 4.4 presents
can obtain the vulnerability index of the individual as
the vulnerability index produced with our robust ap-
a utility node of BN (the node of diamond in Fig. 1):
proach. In Section 4.5, we perform uncertainty and
 t=Max
sensitivity analysis.
Vul(x) = Cost (t)Belief (x, t)dt
t=Min

t=destr oyed
≈ Cost (t)Belief (x, t), (13) 4.1. Study Region and Goal
t=none
The study region of interest (ROI; Fig. 3a)
where x denotes a unit to be estimated, e.g., a cell is a rectangular region, located at Du Jiangyan,
in a grid data that represents the exposed persons, t Sichuan province of China, between North Latitude
the damage state of the target variable, HD (t∈r = 30◦ 57 57.318 and 31◦ 1 12.768 and between East
{none, slight, light, moderate, major, destroyed}), Cost Longitude 103◦ 35 19.657 and 103◦ 41 7.6 . The study
the cost coefficients, and Belie f (x, l) the likelihood region is close to the catastrophic disaster of the May
or posterior probability of x being in a damage state. 12, 2008, Wenchuan earthquake.

Fig. 3. (a) The study region of interest (ROI) and (b) the ROI’s background of seismicity.

Fig. 4. Steps of probabilistic seismic hazard analysis (PSHA).


12 Li et al.

The 2008 Wenchuan earthquake, also known as Geomatics Center of China, using the resampling
the Great Sichuan Earthquake, was a deadly earth- technique. The data are acquired as follows:
quake that measured at 8.0 Ms and 7.9 Mw , occurring
at 14:28:01.42 CST (02:28:01.42 EDT) on May 12, • Factors related to exposure include release
2008 in Sichuan province of China. The earthquake scenario (rs ), magnitude (m), distance (d),
killed at least 68,000 people. This earthquake landslide risk (lsr ), liquefaction risk (lfr ), and
resulted from the interactive movements of the ground motion risk (pga ). We obtained rs ,
Indian and Eurasian plates in opposite directions. m, d, and pga by exposure modeling accord-
The seismicity of central and eastern Asia is the re- ing to the catalog of historical earthquakes
sult of the northward movement of the Indian plate and seismicity around this region (Fig. 3b).
at a rate of 5 cm/year and its collision with Eurasia, Also, lsr was quantified using the appropriate
resulting in the uplift of the Himalaya and Tibetan method(44,45) and relevant environmental fac-
plateaus, and associated earthquake activity. The tors that included slope, soil type, PGA, and
earthquake occurred along the Longmenshan fault, KDs of rivers and active faults. Due to the lack
a thrust structure along the border of the Indo- of specific spatial data for this region, lfr and
Australian and Eurasian plates. Seismic activities its relevant factors were not included in the
were concentrated on its mid-fracture (known as the data set.
Yingxiu-Beichuan fracture). The rupture lasted close • Factors related to system resistance include en-
to 120 seconds, with the majority of energy released vironmental variables and human character-
in the first 80 seconds. Starting from Wenchuan, the istics.(42) Environmental variables include soil
rupture propagated at an average speed of 3.1 km/sec type (st ), proximity to faults (kdf ), proximity
49◦ towards the northeast, rupturing a total area of to rivers (kdr ), accumulative road cost surface
about 300 km. Maximum displacement amounted (cost ), and slope (sl ). We used the KD func-
to 9 m. tion described in Section 3.2.1 to quantify kdf
Fig. 3(b) represents the seismicity context of the and kdr and made a suitable classification of
2008 Wenchuan earthquake as described in the ear- them (Table III). Human presence (hpr ) is the
lier paragraph, where the ROI is close to the active primary characteristic included for human be-
faults that have a history of seismic activities. The ings. Due to lack of relevant data, we didn’t in-
specific goal of our study case is to use the historic clude other variables of human characteristics,
seismic catalog to simulate under the seismicity con- such as average age and disaster knowledge.
text (Fig. 3b) the probabilistic seismic hazard risk, But if such factors are available, they should be
i.e., PGA (the traditional peak ground acceleration) employed.
values in ROI at two levels of exceedance probabil- • The data set is based on the grid format. Each
ity, then use the two release scenarios to conduct vul- cell of the grid has two sides of 5 m × 5 m, giv-
nerability analysis of human beings in the ROI. In ing an area of 25 m2 . This affords our data set a
this study case, besides the simulation of release sce- fine resolution for simulating the practical situ-
nario, we use critical covariates such as distances to ation. The values of each cell’s relevant factors
rivers and the active faults, accumulative road cost are assigned this way: Spatial data of vectors
surface, and so on, to estimate the potential vulner- such as faults and rivers are converted into the
ability of human beings exposed to such scenarios. grid surface using the KD method, and all the
The assessment output of vulnerability is informa- grids are resampled to the target grid of 5 m ×
tive for the relevant agencies or insurance companies 5 m. Such a high resolution means more spe-
to make suitable plans and preparedness measures cific and precise location of vulnerability fac-
against possible damage from future earthquakes. tors that are beneficial in the decision-making
process for early warning, mitigation, and pre-
vention of disasters.
4.2. Data Set
The data set is based on the grid format. Each
4.3. Hazard Analysis
cell of the grid corresponds to a certain number of
human beings and their vulnerability. The factors in- Probabilistic seismic hazard analysis
volved were initially selected according to domain (PSHA)(42,46,47) was used in conducting the haz-
knowledge(40−43) and obtained from the National ard analysis. The goal of PSHA is to quantify the
Mining Spatial Data Sets for Evaluating Vulnerability 13

Table III. Classification of Variables and Their Descriptions in BN of Vulnerability of Human Beings to Seismic Risk

Number of States or Intervals Source of Prior


Variable States (unit) Probability Distribution

Release scenario (rs ) 2 10% in 50 years; 10% in Set according to the PGA
100 years exceedance probability
in T years
Magnitude (m) 6 0–5.0–5.5–6.5–7.0–7.5 –∞ The annual probabilities
calculated using the
Gutenberg-Richter
magnitude recurrence
relationship(55)
Distance (d) 5 0–10–20–40–80–∞ (km) Distance is that from the
seismic sources to the
site of interest
Ground motions (PGA) 11 0–30–50–150–250–350 Deterministic relations,
risk (pga ) –450–550–650–750–850 –∞ calculated as a function
(gal) of magnitude, distance,
soil type by PSHA(42)
Soil type (st ) 5 Unknown; hard rock; soft Amplification factor for
rock; medium soil; soft soil PGA and landslide risk:
1.0 (unknown, medium
soil), 0.55 (hard rock),
0.7 (soft rock), 1.3 (soft
rock)(40,43)
Close to faults? (kdf ) 6 0–1–2–3–4–5– Quantified using kernel
density function (Section
3.2); assume that closer
to active faults, more risk
of damages(20)
Close to rivers? (kdr ) 11 0–100–200–300–400– Quantified using kernel
500–600–700–800–900 density function (Section
–1000– 3.2.1); assume that closer
to rivers, more risk of
damages(20)
Slope (sl ) 9 0–5–10–15–20–25–30 Slopes are assumed to
–35–40–90 cause mudslides that
cause damages to
buildings
Landslide risk (lsr ) 3 Safe or slightly risk; moderate Five factors, i.e., rivers,
risk; highly risk faults, soil, slope, and
PGA are responsible for
landslides; modeled
using the fuzzy
method(44)
Liquefaction risk (lfr ) 2 Ground amplification; Modeled using the method
liquefaction of earthquake
engineering(40,43)
Human presence (hpr ) 2 Yes/ no Assume that the prior
probability p = the
population within the
cell/the area of each cell

probability of exceeding various ground-motion is used to define lateral forces and shear stresses
levels at a site, e.g., a cell in a grid data set, given in the equivalent-static-force procedures of some
all possible earthquakes. In our analysis, we used building codes and in liquefaction analyses. It is a
PGA to simulate the ground motion in PSHA. PGA good indicator of ground motions.
14 Li et al.

Fig. 5. The surfaces of 10% chance of exceedance of PGA (a) within 50 years and 10% chance of exceedance PGA (b) within 100 years.

Our 2010 article(15) described the techniques of 4.4. Bayesian Modeling


PSHA (Fig. 4) in hazard analysis of earthquakes in This section mainly describes the specification of
detail. Readers can refer to this article and related BN that includes construction of the BN topology
documents for technical details. and extraction of the CPT parameters.
In the study case, we obtained the PGA maps We constructed the BN according to domain
of two levels of exceedance probability, i.e., a 10% knowledge of earthquake engineering(40,43,48,49,50)
chance of PGA exceedance within 50 years (Fig. 5a) and the generic framework of Fig. 1. Fig. 6 presents
and a 10% chance of PGA exceedance within the initial network topology.
100 years (Fig. 5b). These translate, effectively, to re- As shown in Fig. 6, exposure-related factors in-
currence probabilities of 475 and 950 years. clude rs , m, d, pga , lfr , and lsr . Among these factors, m
Mining Spatial Data Sets for Evaluating Vulnerability 15

Release scenario (rs)

Soil profile Magnitude (m) Distance (d ) Close to faults? (k df)

Liquid limit Ground motions


Soil type (st) Close to rivers? (k dr)
(PGA) risk ( pga)
Clay content
Landslide Slope (sl)
risk (lsr)
Liquefaction Liquefaction
susceptibility risk (lfr )
Human Accumulative road
damage (hd) cost surface (cost )
Human presence (hpr)

Vulnerability

target variable indicators relevant to system resistance indicators related to exposure Not modeled in the study case

Fig. 6. Bayesian network topology of seismic vulnerability.

and d are modeling parameters of exposure, pga is the Fig. 7(a) presents the surface of KD values of the
intensity indicator, and pga, lfr , and lsr are three risk roads in ROI, and Fig. 7(b) presents the accumu-
factors responsible for vulnerability of human beings, lative road cost surface, with selected hospitals and
hd . Thus, we have three causal links from three risk public parks as the shelter services (the “plus” in
factors, pga, lfr , and lsr to the target variable, hd . Due Fig. 7).
to lack of some soil data such as soil profile and clay
content, modeling of liquefaction risk was discarded.
4.5. Evaluation of Models and Vulnerability
But these factors should be considered if such data
are available. Therefore, the soil variables are given Using the models presented in Section 3.4 to pre-
in Fig. 6 but are shown in dotted lines to indicate that dict damage probabilities, we compared the models
they were not used in our case. Thus, in the test, by (Tables IV and V) with the practical situation in the
modeling pga and lsr , we simulated the different vul- simulation of the return period of 950 years and used
nerabilities of ROI under two scenarios of different Equation (13) to compute the vulnerability index for
PGA. each cell of the grid for the two simulations in ROI
In Fig. 6, the factors relevant to system resistance (Fig. 8).
include environmental variables and characteristics In total, we used 11 algorithms to predict the
of human beings. Environmental variables include ROI’s damage under two scenarios and compared
st , kdf , kdr , sl , and cost . We used KD function (1) to them with the practical situation under a scenario
quantify kdf and kdr and made a suitable classifica- similar to the actual 2008 situation. Among the
tion of them (Table III). Characteristics of human 11 algorithms there were six BNs, respectively, re-
beings include human presence (hpr ), which was ap- sponding to the six search algorithms described in
proximated with the map of human density. Section 3.4.1—K2, HC, TAN, Tabu, SA, and GA.
The target variable of the Bayesian model is We also used BMA that integrated the predictive
damage to human beings (hd ), with a typology of six values from the six BN algorithms and other four
states, as described in Table III. Its corresponding algorithms, including LR, NB, RBF network, and
damage factors are established according to earth- MPer. (The four methods have been described in
quake engineering experience.(43,51) Section 3.4.2.) The predictors used in these mod-
Table III gives a brief description of the variables els included the exposed-related factors (i.e., rs , m,
(factors) involved in our BN (Fig. 6) and the sources d, pga , and lsr ) and system-resistance factors (i.e., st ,
of the distributions of their prior probabilities. kdf , kdr , sl , cost , and hpr ). Table IV shows the specific
16 Li et al.

0 .0025 .005 .01


km

(a)

±
E
E E

E
E

E
E
E

0 .0025 .005 .01


km

(b)

Fig. 7. The graduated categories of the kernel density values of the rivers within ROI (a) and the graduated categories of accumulative
road cost surface corresponding to shelter services within ROI (the plus corresponding to the shelter service)(b).
Mining Spatial Data Sets for Evaluating Vulnerability 17

Table IV. Comparison of the Models for Prediction of the 950 Years with the Practical Damage Survey as the Training Data
in Pd and Balance
Model Pd Balance
No. of
State 1 2 3 4 5 6 1 2 3 4 5 6

BN (BMA) 0.88 0.79 0.84 0.91 0.82 0.91 0.88 0.85 0.88 0.93 0.85 0.87
BN (K2) 0.95 0.69 0.75 0.84 0.73 0.79 0.96 0.77 0.82 0.87 0.79 0.84
BN (HC) 0.82 0.68 0.74 0.83 0.74 0.80 0.87 0.77 0.81 0.87 0.81 0.85
BN (TAN) 0.52 0.87 0.89 0.89 0.83 0.83 0.68 0.90 0.92 0.92 0.86 0.87
BN (Tabu) 0.78 0.67 0.76 0.83 0.73 0.81 0.84 0.77 0.83 0.87 0.80 0.85
BN (SA) 0.21 0.84 0.82 0.91 0.90 0.84 0.45 0.88 0.87 0.93 0.91 0.88
BN(GA) 0.82 0.68 0.73 0.82 0.73 0.80 0.87 0.77 0.80 0.87 0.80 0.85
LR 0.69 0.51 0.30 0.69 0.79 0.77 0.78 0.65 0.50 0.78 0.79 0.82
NB 0.83 0.63 0.65 0.80 0.44 0.68 0.87 0.73 0.74 0.83 0.60 0.75
RBF 0.73 0.67 0.74 0.79 0.68 0.74 0.81 0.77 0.82 0.84 0.75 0.78
MPer 0.13 0.67 0.70 0.82 0.82 0.90 0.38 0.76 0.84 0.88 0.88 0.89

Note: In the No. of State row, 1: destroyed; 2: major; 3: moderate; 4: light; 5: slight; 6: none.

Table V. Comparison of the Models for Prediction of the 950 Years with the Practical Damage Survey as the Training Data in Precision
and ROC Area
Model Precision ROC Area
No. of
State 1 2 3 4 5 6 1 2 3 4 5 6

BN (BMA) 0.62 0.75 0.78 0.76 0.88 0.90 1.00 0.98 0.98 0.98 0.96 0.94
BN (K2) 0.31 0.66 0.63 0.61 0.81 0.85 0.99 0.96 0.97 0.97 0.91 0.91
BN (HC) 0.26 0.66 0.63 0.60 0.83 0.86 0.99 0.96 0.97 0.97 0.91 0.90
BN (TAN) 0.52 0.69 0.77 0.74 0.87 0.90 0.98 0.98 0.99 0.98 0.96 0.94
BN (Tabu) 0.33 0.65 0.66 0.60 0.81 0.85 0.99 0.96 0.96 0.97 0.91 0.91
BN (SA) 0.62 0.79 0.82 0.81 0.86 0.91 0.99 0.98 0.98 0.98 0.96 0.95
BN (GA) 0.27 0.66 0.63 0.60 0.82 0.85 0.99 0.96 0.97 0.97 0.91 0.91
LR 0.47 0.70 0.57 0.57 0.69 0.81 0.99 0.92 0.91 0.93 0.90 0.91
NB 0.24 0.65 0.27 0.39 0.78 0.74 0.98 0.93 0.91 0.39 0.77 0.74
RBF 0.54 0.70 0.68 0.61 0.72 0.75 0.99 0.95 0.94 0.92 0.90 0.90
MPer 0.37 0.71 0.74 0.83 0.86 0.90 0.95 0.85 0.93 0.91 0.95 0.90

Note: In the No. of State row, 1: destroyed; 2: major; 3: moderate; 4: light; 5: slight; 6: none.

classification of these factors acquired by the dis- In the simulation of the return period of
cretization algorithm in Section 3.3 and relevant 950 years, we compared the result with the ROI’s
methods used to determine their prior probability damage situation estimated by aerial photos and
distributions. Our exploratory data analysis showed practical surveys(1) of the Wenchuan earthquake
that the linear correlations between these factors of May 12, 2008. It is found that our simulation
ranged from 0.0032 to 0.379. The slight linear cor- had a total prediction accuracy of 0.846, with the
relations show that the multiple-collinearity prob- Kappa statistic being 0.735—acceptable for practi-
lem in the prediction models could be decreased cal monitoring and forecast. Table IV lists pd and
or avoided, even using all the factors as predictors. balance for each damage levels; Table V presents
About the conditional probabilities or relevant pa- precision and ROC areas for each damage level.
rameters of each predictor, given their prior prob- As shown in Tables IV and V, the BMA pre-
abilities known according to Table IV, they could diction of BN achieved a nice pd with good bal-
be learned from the training data sets by the mod- ance and a moderately good precision and ROC
els according to the principle of maximum likelihood area for each damage level, unlike other proba-
and minimum error. We developed these algorithms bilistic models that achieved only a “good” perfor-
based on WEKA(52) and MatLab.(53) mance for partial damage levels. Although the other
18 Li et al.

0 .00375 .0075
km
(a) Prediction of the vulnerability for the scenario of the recurrence period of 475 years

0 .00375 .0075
km
(b) Prediction of the vulnerability for the scenario of the recurrence period of 950 years

Fig. 8. Vulnerability index estimated by the Bayesian method: (a) 10% chance of exceedance of PGA probability within 50 years; (b) 10%
within 100 years.
Mining Spatial Data Sets for Evaluating Vulnerability 19

Table VI. Shannon Mutual Information with Vulnerability sl , kdr , and st . As the epistemic uncertainty sources
Index, Vul, as the Target Variable, and Other Predictive Factors of the four factors (kdf , pga , hpr , and cost ), this result
Influencing Vul
suggests that decisionmakers should prioritize local
Shannon Mutual data-collection efforts on these factors rather than on
Predictive Factors Information other factors listed in the BN.
Close to faults (kdf ) 0.189
PGA risk (pga ) 0.135 5. DISCUSSION
Density of people (hpr ) 0.131
Accumulative road cost surface (cost ) 0.115 Probabilistic data mining models are a good
Slope (sl ) 0.102 means for estimating vulnerability since their prob-
Close to river? (kdr ) 0.080 ability outputs can be multiplied with the cost coeffi-
Soil type (st ) 0.021
cients of the damage states (Table III) to obtain a vul-
nerability index. The performance of these models
in predicting the probabilities of the damage states
models can have a slightly better value in some
is significant for estimation of the vulnerability in-
performance measures for certain levels of damage
dex. In our study case, we compared the probabilis-
(e.g., BN(SA)’s pd for level 4 in Table V), they per-
tic models, such as BNs, NB, LR, RBF network, and
form no better than our BMA prediction for other
neural network. As shown in Tables V and VI, these
levels. In total, the BMA prediction has an accept-
models were effective for prediction of certain lev-
able range for pd (0.79–0.91), balance (0.85–0.88),
els of damage but performed poorly for other lev-
precision (0.62–0.90), and ROC area (0.94–1.0) for ei-
els of damage. But by using the BMA to average the
ther level of damage. Thus, our BMA prediction has
predictions of BN, the uncertainty of the models was
achieved a stable performance compared with other
decreased, and the prediction performed stably and
single probabilistic models.
moderately well. This validation illustrates that the
Due to BMA’s robust prediction, we used it to
averaged prediction of BN is an acceptable method
do the vulnerability assessment. Fig. 8(a) presents the
for modeling and estimating the vulnerability of hu-
vulnerability index of human beings in ROI under
man beings to catastrophic risk.
the scenario of a 10% chance of exceedance prob-
As a means to probability inference, BN offers
ability of PGA within 50 years; Fig. 8(b) presents
several specific advantages over other probabilistic
the vulnerability index under the scenario of a 10%
models. BN supports a good platform for integrat-
chance of exceedance probability of PGA within
ing information sources from multidisciplinary spe-
100 years. The former corresponds to a recurrence
cialist fields and using them to make uncertainty
period of 475 years and the latter to a recurrence
inference. We propose a generic Bayesian frame-
period of 950 years, similar to the Wenchuan earth-
work for modeling catastrophic risk (Fig. 1). As il-
quake on May 20, 2008. Comparing the vulnerability
lustrated in our study of seismic risk, BN integrates
indices of the 475- and 950-year recurrence periods,
the output from the probability seismic hazard model
we found that the vulnerability for the 950-year pe-
designed by seismic experts, assessment of the en-
riod is considerably larger than that for the 475-year
vironment, and system resistance by architects and
period, reflecting its higher vulnerability. The spatial
engineers within a consistent modeling system. Also,
distribution of vulnerability for the 950-year period
BN can update the predictive values of risk given par-
is slightly different from that of the 475-year period
tial evidence, even with missing data. This offers de-
(see Fig. 5a compared with Fig. 5b). This can be ex-
cisionmakers good knowledge of the practical situa-
plained by the difference in PGA: the 950-year PGA
tion or some potential scenarios in the future before
is stronger than that for 475 years.
taking action or making the emergency plans.
Another advantage of our method is to use spa-
4.6. Sensitivity Analysis
tial analysis and GIS to assess, visualize, and locate
Table VI presents the results of the sensitivity the vulnerability. We used spatial analysis techniques
analysis: Shannon mutual information for eight vari- of KDA and ARCSM to quantify the influence of
ables with vulnerability, with Vul as the target vari- geofeatures such as faults, rivers, and roads. It is
able. From this table, we can see that kdf and pga have assumed that the closer the individuals exposed are
the largest values, indicating the greatest influence on to the active faults or rivers, the higher is the vulner-
the vulnerability index. Following these are hpr , cost , ability; the closer the individuals exposed are to the
20 Li et al.

rescue roads or/and shelter services, the lower is the beings’ behavior produce different vulnerability find-
vulnerability (because they have more opportunity to ings (Fig. 8). The estimated vulnerability is more ob-
escape, thus decreasing vulnerability). This assump- jective and spatially located.
tion is generally acceptable. With spatial-analysis Our Bayesian mining approach is generic
techniques such as KDA and ARCSM, potential in- (Section 3), and it can be applied to other types of
fluences from many risk-related, critical geofeatures catastrophic risk, such as flood, typhoon, mudslides,
or indicators can be quantified and used in model- and avalanches, among other catastrophes, by incor-
ing in combination with other quantitative variables. porating relevant domain knowledge and research
Further, use of GIS facilitates location of geograph- findings for specification of BN, as illustrated in our
ical variations of the vulnerability and their uncer- study case.
tainties. Use of both spatial analysis and GIS in our
grid-based approach has significant implication: the
ACKNOWLEDGMENTS
important risk-related factors are quantified for risk
prediction, and the risk-prone sites at a fine scale This research is partially supported by grants
can be spatially located. This enables decisionmak- 41171344/D010703 from the Natural Science Foun-
ers to plan more precisely for shelter services be- dation of China, grant 2011AA120305–1 from the Hi-
fore the disaster more effectively allocate resources tech Research and Development Program of China’s
during the disaster, and make more informative risk Ministry of Science and Technology (863), and grant
maps after the disaster for future precaution and 2012CB955503 (Research of Identification of Suscep-
preparedness. tible Population and Risk Regionalization for Cli-
Further, our method can combine domain mate Changes and Health) from the National Ba-
knowledge and learning from spatial data to im- sic Research Program of China’s Ministry of Science
prove the performance of risk and vulnerability and Technology (973). We also thank the reviewers
assessment. Given enough representative training for their constructive suggestions and the editors for
samples, we can understand the interdependency re- their careful check and revisions, which further im-
lationships between variables in MP, E, and HC of proved this article.
the framework and use domain knowledge to adap-
tively refine the relationships revealed. REFERENCES
1. Civil and Structural Groups of Tsinghua University
6. CONCLUSION XaJiUaBJU. Analysis on seismic damage of buildings in
the Wenchuan earthquake. Journal of Building Structures (in
Given the uncertain nature of natural disaster Chinese), 2008; 29(4):1–9.
events, we develop a generic Bayesian data mining 2. Paterson E, Re D, Wang Z. The 2008 Wenchuan Earthquake:
Risk Management Lessons and Implications. Beijing: Risk
approach to vulnerability assessment of human be- Management Solutions, Inc., 2008.
ings to catastrophic risk. This method is applicable 3. Asian Development Bank. An Initial Assessment of the Im-
for most natural disasters. In our method, BN pro- pact of the Earthquake and Tsunami of December 26, 2004 on
South and Southeast Asia. Philippines: Metro Manila, 2005.
vides an integrative platform combining a variety of 4. Tamerius DJ, Wise KE, Uejio KC, McCoy LM, Comrie CA.
information sources from different specialist fields Climate and human health: Synthesizing environmental com-
and facilitates communications between different do- plexity and uncertainty. Stochastic Environmental Research
and Risk Assessment, 2007; 21(5):601–613.
main specialists—for example, experts in natural dis- 5. Alexander D. Natural Disasters. New York: Chapman &
asters, architects, engineers, and economists. BN also Hall, 1993.
serves a means of uncertainty analysis that imple- 6. Arnold M, Chen RS, Deichmann U et al. eds. Natural Disas-
ters Hotspots: Case Studies. Disaster Risk Management Se-
ments propagations of uncertainties from different ries, Washington, DC: International Bank for Reconstruction
sources. Therefore, it is suitable for vulnerability as- and Development/World Bank, 2006.
sessment of natural disasters influenced by complex 7. Jiang H, Eastman JR. Application of fuzzy measures in multi-
criteria evaluation in GIS. International Journal of Geograph-
multiple factors. ical Information Science, 2000; 14(2):173–184.
Use of spatial analysis and GIS makes possi- 8. Shi P. Theory on disaster science and disaster dynamics. Nat-
ble derivation of key risk-related quantitative predic- ural Disasters (in Chinese), 2002; 11(3):1–9.
9. William JP, Arthur AA. Natural Hazard Risk Assessment
tive factors such as influences from active faults and and Public Policy. New York: Springer-Verlag, 1982.
rivers, and visualizes uncertainties in a spatial man- 10. Kloog I, Haim A, Portnov AB. Using kernel density func-
ner. Our study case of seismic risk illustrates that tions as an urban analysis tool: Investigating the associa-
tion between nightlight exposure and the incidence of breast
different ground motion, environmental conditions cancer in Haifa, Israel. Computers, Environment and Urban
(e.g., slope and soil), and characteristics of human Systems, 2009; 33:55–63.
Mining Spatial Data Sets for Evaluating Vulnerability 21

11. Huang C. Risk Analysis of Natural Disasters. Beijing: Beijing 34. Larranaga P, Poza M, Yurramendi Y, Murga HR, Kuijpers
Normal University Press, 2001. C. Structure learning of Bayesian networks by genetic algo-
12. Li L, Wang J, Wang C. Typhoon insurance pricing with spatial rithms: A performance analysis of control parameters. IEEE
decision support tools. International Journal of Geographical Transactions on Pattern Analysis and Machine Intelligence,
Information Science, 2005; 19(3):363–384. 1996; 18(9):912–926.
13. Anselin L. Spatial data analysis with GIS: An introduction to 35. Hoeting AJ, Madigan D, Raftery EA, Volinsk TC. Bayesian
application in the social sciences technical report 92–10. Sys- model averaging: A tutorial. Statistical Science, 1999; 14(4):
tems Research, 1992; 321(8):1605–1609. 382–417.
14. Goodchild M, Haining R, Wise S. Integrating GIS and spatial 36. Hosmer D, Lemeshow S. Applied Logistic Regression, 2nd
data analysis: Problems and possibilities. International Jour- ed. John Wiley and Sons, 2000.
nal of Geographical Information Systems, 1992; (6):407–423. 37. Cox AL. Risk Analysis: Foundations, Models and Methods.
15. Li L, Wang J, Leung H. Using spatial analysis and Bayesian Norwell MA: Springer, 2001.
network to model the vulnerability and make insurance pric- 38. Morales-Casique E, Neuman PS, Vesselinov VV. Maximum
ing of catastrophic risk. International Journal of Geographical likelihood Bayesian averaging of airflow models in unsat-
Information Science, 2010; 24(12):1759–1784. urated fractured tuff using Occam and variance windows.
16. Korb KB, Nicholson AE. Bayesian Artificial Intelligence. Stochastic Environmental Research and Risk Assessment,
Boca Raton, FL: Chapman & Hall/CRC, 2004. 2010; 24(6):843–880.
17. Amendola A, Ermoliev Y, Gitis VE, Koff G, Linnerooth- 39. Neuman PS. Maximum likelihood Bayesian averaging of
Bayger J. A systems approach to modeling catastrophic risk uncertain model predictions. Stochastic Environmental Re-
and insurability. Natural Hazards, 2000; 21:381–393. search and Risk Assessment, 2003; 17(5):291–305.
18. Straub D. Natural hazards risk assessment using Bayesian 40. Bard P. Local effects of strong ground motion: Basic physical
networks. Pp. 2590–2516 in the 9th International Conference phenomena and estimation methods for microzoning studies.
on Structural Safety and Reliability (ICOSSAR 05). Series Laboratoire Central de Ponts-et-Chausees and Observatoire
Natural Hazards Risk Assessment Using Bayesian Networks. de Grenoble.
Rome, Italy, 2005. 41. Bayraktarli YY, Yazgan U, Dazio A, Faber HM. Capabilities
19. Tobler WR. Cellular Geography, Philosophy in Geography. of the Bayesian probabilistic networks approach for earth-
Dordrecht: Reidel, 1979. quake risk management. Pp. 1–in First European Conference
20. Hastie T, Tibshirani R, Friedman J. The Elements of Statis- on Earthquake Engineering and Seismology Series. Geneva,
tical Learning: Data Mining, Inference and Prediction. New Switzerland, 2006.
York: Springer-Verlag, 2001. 42. Cornell CA. Engineering seismic risk analysis. Bulletin of the
21. Miller HJ, Han J. Geographic Data Mining and Knowledge Seismological Society of America, 1968; (58):1583–1606.
Discovery. London and New York: Taylor & Francis, 2001. 43. Day RW. Geotechnical Earthquake Engineering Handbook.
22. ESRI. Arcgis Spatial Analyst: Advanced GIS Spatial Analy- New York: McGraw-Hill, 2001.
sis Using Raster and Vector Data. New York: Environmental 44. Chen X, Qi W, Ye H. Fuzzy comprehensive study on seismic
System Research Institute, 2001. landslide hazard based on GIS. Acta Scientiarum Naturalium
23. Torun A, Duzgun S. Using spatial data mining techniques to Universitatis Pekinensis, 2008; 44(3):434–438.
reveal vulnerability of people and places due to oil transporta- 45. Jakob M. Morphometric and geotechnical controls of de-
tion and accidents: A case study of Istanbul Strait. Pp. 43– bris flow frequency and magnitude in southwestern British
48 in International Achieves of Photogrammetry, Remote Columbia, Ph.D. thesis, University of British Columbia,
Sensing, and Spatial Information Sciences (ISPRS), Techni- Vancouver, 1996.
cal Commission II Symposium Series. Vienna, 2006. 46. Algermissen ST, Perkins DM. A probabilistic estimate of
24. Varnakovida P, Messina PJ. Hospital Site Selection Analysis. maximum acceleration in rock in the contiguous United
Michigan: Michigan State University, 2006. States, USGS, 1976.
25. Silverman BW. Density Estimation for Statistics and Data 47. McGuire RK. Frisk—A computer program for seismic risk
Analysis. New York: Chapman and Hall, 1986. analysis. Department of Interior, Geological Survey, 1978.
26. McCoy J, Johnston K. Using Arcgis Spatial Analyst. 48. Bayraktarli YY, Ulfkjaer J, Yazgan U, Faber HM. On the ap-
Redlands: ESRI, 2001. plication of Bayesian probabilistic networks for earthquake
27. Fulton T, Kasif S, Salzberg S. Efficient algorithms for find- risk management. In the 9th International Conference on
ing multi-way splits for decision trees. Pp. 244–251 in Proc. Structural Safety and Reliability (ICOSSAR 05). Series
Twelfth International Conference on Machine Learning. San on the Application of Bayesian Probabilistic Networks for
Francisco, CA: Kaufmann, 1995. Earthquake Risk Management, Rome, Italy, 2005.
28. Fayyad U, Irani K. Multiple-interval discretization of 49. Guo Z, Chen X. Strategies Against Earthquakes for Cities (in
continuous-valued attributes for classification learning. chinese). Beijing: Earthquake Press, 1992.
Pp. 1022–1027 in Thirteenth International Joint Conference 50. Kramer SL. Geotechnical Earthquake Engineering. New
on Artificial Intelligence. San Mateo, CA: Kaufmann, 1993. Jersey: Prentice Hall, 1996.
29. Bouckaert RR. Bayesian belief network: From construction 51. Yi Z. Forecast Methods of Seismic Disaster and Loss. Beijing:
to inference, dissertation, Universiteit Utrecht, 1995. Geology Publisher, 1995.
30. Cooper FG, Herskovits E. A Bayesian method for the induc- 52. Witten IH, Frank E. Data Mining: Practical Machine Learn-
tion of probabilistic networks from data. Machine Learning, ing Tools and Techniques, 2nd ed. San Francisco: Morgan
1992; 9:309–347. Kaufmann, 2005.
31. Buntine W. A guide to the literature on learning probabilistic 53. MathWorks MUMM. Matlab 2007 User Manual. Math-
networks from data. IEEE Transactions on Knowledge and Works, 2007.
Data Engineering, 1996; 8:196–210. 54. Pearl J. Probabilistic Reasoning in Intelligent Systems:
32. Friedman N, GeiGer D. Bayesian network classifier. Machine Networks of Plausible Inference. San Francisco: Morgan
Learning, 1997; 29:131–163. Kaufmann, 1988.
33. Chow CK, Liu CN. Approximating discrete probability distri- 55. Gutenberg B, Richter C-F. Frequency of earthquake in cali-
butions with dependence trees. IEEE Transactions on Infor- fornia. Bulletin of the Seismological Society of America, 1944;
mation Theory, 1968; 14(3):462–467. (34):185–188.

You might also like