Professional Documents
Culture Documents
Journal of Hydrology
journal homepage: www.elsevier.com/locate/jhydrol
Classifiers for the detection of flood-prone areas using remote sensed elevation data
Massimiliano Degiorgis, Giorgio Gnecco, Silvia Gorni, Giorgio Roth ⇑, Marcello Sanguineti,
Angela Celeste Taramasso
University of Genoa, Genoa, Italy
a r t i c l e i n f o s u m m a r y
Article history: A technique is presented for the identification of the areas subject to flooding hazard. Starting from
Received 31 May 2012 remote sensed elevation data and existing flood hazard maps – usually available for limited areas –
Received in revised form 9 August 2012 the relationships between selected quantitative morphologic features and the flooding hazard are first
Accepted 3 September 2012
identified and then used to extend the hazard information to the entire catchment. This is performed
Available online 11 September 2012
This manuscript was handled by Corrado
through techniques of pattern classification, such as linear classifiers based on quantitative morphologic
Corradini, Editor-in-Chief, with the features, and support vector machines with linear and Gaussian kernels. The experiment starts by dis-
assistance of Magdeline Laba, Associate criminating between flood-prone areas and marginal hazard areas. Multiclass classifiers are subsequently
Editor used to graduate the hazard. Their designs amount to solving suitable optimization problems. Several
performance measures are considered in comparing the different classifiers, such as the area under the
Keywords: receiver operating characteristics curve, and the sum of the false positive and false negative rates.
Flood hazard The procedure has been validated for the Tanaro basin, a tributary to the major Italian river, the Po.
Flood risk management Results show a high reliability: the classifier properly identifies 93% of flood-prone areas, and only 14%
Receiver operating characteristics of the areas subject to a marginal hazard are improperly assigned. An increase of this latter value up
Linear classifiers and support-vector to 19% is detected when the same structure is applied for hazard graduation. Results derived from the
machines
application to different catchments seem to qualitatively indicate the ability of the classifier to perform
Parameter optimization
well also outside the calibration region.
Shuttle radar topography mission
Pattern classification techniques should be considered when the identification of flood-prone areas and
hazard grading is required for large regions (e.g., for civil protection or insurance purposes) or when a
first identification is needed (e.g., to address further detailed flood-mapping activities).
Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction be included in the analysis (de Moel et al., 2009; Ghizzoni et al.,
2010, 2012), the still limited availability of flood-prone and haz-
Flooding is one of the most significant natural risks. Its impact ard-grading maps require an effort to be directed toward the com-
concerns almost all the components of global communities, inde- pletion of this knowledge.
pendently from their geographic location and their social and eco- Flood hazard maps constitute the result of a modeling chain
nomic structures. The mapping of the hazard component of the that usually starts from the collection of historical information
flooding risk is frequently identified as the basic element on which about past flood events, which allows for the first recognition of
risk mitigation strategies should be developed. Consequently, potentially hazardous reaches and river sections. A hydrologic
many countries regulate hazard and risk mapping by law. In analysis is then performed to define flow peak discharges and re-
1973 the Congress of the United States of America, through the lated hydrographs for assigned return periods. Those are the input
Flood Disaster Protection Act (Pub. L. No. 93-234; 87 Stat. 975), rec- for hydraulic flow propagation models, which allow for the
ognized the relevance of the flooding hazard and called for the description of water levels along the reaches under examination.
identification of floodplain areas and related hazard areas. More re- While hydrologic analyses usually involve the availability of rele-
cently, in 2007, the European Commission, through the Flood vant discharge time series, the hydraulic analysis requires the
Directive (2007/60/EC), requires European Union member states knowledge of both river channel geometry and pertinent charac-
to produce flood hazard and flood risk maps. While adverse effects teristics, such as the surface roughness and boundary conditions.
on asset values, people and the environment should in perspective At this stage, critical sections for the given return periods can be
identified. They constitute the starting point of the modeling of
⇑ Corresponding author. Address: University of Genoa, Via Montallegro 1, Genoa the inundation process over the floodplains. A variety of different
16145, Italy. Tel.: +39 0103532486; fax: +39 0103532546. models can be assumed to represent the involved physical pro-
E-mail address: giorgio.roth@unige.it (G. Roth). cesses, and the choice of the right model depends upon both the
0022-1694/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.jhydrol.2012.09.006
M. Degiorgis et al. / Journal of Hydrology 470–471 (2012) 302–315 303
characteristics of the physical system under analysis and the accu- features, and of Support Vector Machines (SVMs) with linear and
racy of the expected results (Guzzetti et al., 2005; Horritt and Gaussian kernels (Franc and Hlávač, 2004; Vapnik, 1998). The
Bates, 2002; Hunter et al., 2007). experiments have been made first by discriminating between
The above outlined procedure is well established and able to flood-prone areas and marginal hazard areas, then multiclass clas-
accurately recognize flood-prone areas down to the scale of the sifiers have been used for hazard graduation. For the purpose of the
single building. On the other hand, it is expensive and time con- present paper, a marginal hazard level is introduced: it distin-
suming; moreover, it requires information not readily available guishes areas that are subject to the flood hazard with a return
for all areas. For all these reasons, even in developed countries, time greater than that used to identify flood-prone areas.
the complete mapping of flood-prone areas is far from being This paper is organized as follows. Section 2 presents the Tanaro
achieved. case-study area. Section 3 introduces the elevation dataset and the
The development and processing of Digital Elevation Models methods used to prune the drainage network, i.e. the source of the
(DEMs) is a subject of increasing interest for a number of environ- flood hazard. The marginal hazard concept is presented in Section 4
mental disciplines. Consequently, the availability of new technolo- together with a simple procedure able to identify areas that are
gies to measure surface elevation (e.g., GPS, SAR, SAR subject to this hazard level. Section 5 describes the selected mor-
interferometry, radar and laser altimetry) has made more attrac- phological features, i.e. those for which the relation with the flood-
tive the application of DEM-based models. Moreover, in recent ing hazard is to be evaluated. Performance measures are
years the DEM-based automatic characterization of hydrological intensively used to identify the best classifier and to validate the
and morphological features (e.g. drainage area, stream channels, procedure. They are introduced in Section 6. Linear classifiers
valley bottoms, and floodplain identification) has become a prac- and SVM with linear and Gaussian kernels are presented in Sec-
tice for hydrologists and geomorphologists, substituting time-con- tions 7 and 8, respectively. A procedure for hazard graduation in
suming manual procedures (Bates et al., 2003; Dodov and recognized flood-prone areas is delineated in Section 9. Finally,
Foufoula-Georgiou, 2005; Gallant and Dowling, 2003; Giannoni Section 10 presents the application and validation of the procedure
et al., 2005, 2008; Manfreda et al., 2011; Nardi et al., 2006, 2008; with reference to the Tanaro case study as well as the qualitative
Noman et al., 2001). Among global elevation sources, in the follow- validation performed within portions of the Tevere, the Dora Baltea
ing reference is made to HydroSHEDS (Hydrological data and maps and the Quirra catchments (Italy). Section 11 is a brief conclusive
based on SHuttle Elevation Derivatives at multiple Scales; see Leh- section. To make the paper self contained, an appendix on basic re-
ner et al. (2008) for technical information), based on the NASA sults from Statistical Learning Theory is included.
Shuttle Radar Topography Mission (SRTM).
In the present contribution, a possible approach is delineated
for flood-prone areas identification and hazard grading from Basin 2. The Tanaro basin case study
Authorities and remote sensed elevation datasets. For the purpose,
a number of DEM-derived quantitative morphologic features were Tanaro is a 276 km-long river in North-western Italy (Fig. 1). It
selected: local slope, contributing area, site elevation and distance rises in the Ligurian Alps, close to the border with France and is the
from the potential source of flooding, and surface concavity. On the most significant right-side tributary to the Po River in terms of
other hand, flood hazard maps provided by Basin Authorities, and length, drainage area (partly Alpine and partly Apennine) and dis-
available for limited portions of the basin area, complete the data- charge. At its junction with the Po River, the Tanaro drains about
set for the calibration of relationships between the selected mor- 8000 km2, 500 km2 of which are in mountainous terrain. Major
phologic features and the flooding hazard. Once these tributaries are the Stura di Demonte (contributing area
relationships are identified and calibrated, the extension of the 1430 km2), Alto Tanaro (1840 km2), Belbo (480 km2), Bormida
hazard information to the entire catchment can be performed. This (1640 km2) and Orba (840 km2). Morphological variability allows
is achieved through techniques of pattern classification such as the for the identification of three main zones with characteristic
use of linear classifiers based on one or two of the morphologic behaviors. The mountain part, with an average 6% slope, very steep
Fig. 1. Tanaro basin: location map (a) and hillshaded representation of the SRTM dataset (b).
304 M. Degiorgis et al. / Journal of Hydrology 470–471 (2012) 302–315
catchments and deep river beds; the mild part, with an average 1% drainage basin surface is derived from the NASA SRTM mission
slope, shallower river beds and mildly steep catchments; finally, with a 3 arc-seconds resolution. For the study area, this corre-
the alluvial part, characterized by very low slope values. Unique sponds to a DEM grid size of about 85 85 m. While this resolu-
among the Po right-side tributaries, the river has an Alpine origin. tion is far from allowing for the recognition of flood control
Nevertheless, the Ligurian Alps have a not high enough altitude, structures – such as levees, dikes and weirs – it is accurate enough
and are located too close to the sea, to allow for the formation of to describe the local terrain morphology (a discussion on the influ-
snow fields or glaciers large enough to provide a steady source of ence of DEM source and grid size on the delineation of flood-prone
water during the dry summer season. Furthermore, the Alpine areas is provided in Manfreda et al. (2011)).
zone forms only a part of the basin drained by the Tanaro. The dis- Fig. 2 depicts the Po Basin Authority results relevant to the pres-
charge is therefore subject to a great deal of variation, and the sea- ent study (www.adbpo.it). Panel (a) shows the portion of the drain-
sonal regime of the river is more typical of an Apennine torrent, age network for which hydrologic and hydraulic studies were
with maximum discharges in spring and autumn and a very small performed to define flood-prone areas and hazard graduation. Pa-
flow rate in summer. The river is highly prone to flooding. During nel (b) shows the areas that are recognized as subject to possible
the last two centuries, the Tanaro basin was affected by floods on floods produced by the reaches that constitute the network in Pa-
136 occasions, the most devastating being that of November nel (a). Panel (c) shows, within flood-prone areas, the hazard grad-
1994, when the whole of the river valley was affected by severe uation. The flooding hazard is graduated in three classes: high,
flooding (Marchi et al., 1996; Luino, 2002). Due to these character- medium, and low.
istics, the Tanaro was selected as case study for the design of risk From Fig. 2, one could realize that the work needed to complete
scenarios for the flooding hazard (Ghizzoni et al., 2010). the knowledge of the flooding hazard for the Tanaro basin is far
Fig. 1 incorporates a hillshaded representation of the Tanaro ba- from being finished. When analyzing the hazard for an element lo-
sin elevation data. The DEM used in this study to describe the cated outside recognized flood-prone areas, i.e., within the gray
areas in Fig. 2c, this incomplete knowledge presents severe draw-
backs. In fact, since many non-studied tributaries are present, each
location in the catchment seems to be potentially subject to the
flooding hazard.
Fig. 3. Reference drainage network of the Tanaro basin. Streams studied by the Po Fig. 4. Marginal hazard areas of the Tanaro basin (green) identified according to the
Basin Authority are depicted in dark-blue. (For interpretation of the references to Po Basin Authority hazard studies. Marginal hazard areas are recognized as those
color in this figure legend, the reader is referred to the web version of this article.) cells (i) directly drained by the network of Fig. 2a, (ii) not prone to floods according
to Fig. 2c, and (iii) not flowing through the streams depicted in light-blue in Fig. 3.
(For interpretation of the references to color in this figure legend, the reader is
referred to the web version of this article.)
Different methods allow for the identification of the drainage
network from a DEM (see, e.g., Band, 1986; Giannoni et al., 2005;
O’Callaghan and Mark, 1984; Rodriguez-Iturbe and Rinaldo, The calibration of classifiers able to discriminate the hazard le-
1997; Roth et al., 1996; Tarboton et al., 1991). In the context of vel needs a training set that includes elements that belong to all
the present work, the drainage network is pruned through the pro- the classes: high, medium, low, and marginal. For this purpose,
cedure proposed by Giannoni et al. (2005), which takes into ac- marginal hazard areas associated to the network of Fig. 2 should
count contributing area, A, and local slope, S, in the form ASk. On be identified.
the basis of this area-slope criterion, a channel is expected to start Marginal hazard areas are here recognized as the ensemble of
from locations where the quantity ASk exceeds a threshold value. the DEM cells that are: (i) directly drained by the studied network
Once a channel is generated, its path to the outlet is identified by of Fig. 2a; (ii) not recognized as prone to floods according to
following the maximum slope direction. Fig. 2b; and (iii) not flowing through the streams depicted in
This procedure produces a non-uniform drainage density, an light-blue in Fig. 3. Note that the last condition holds only for those
important attribute when accurate recognition of the extension locations not subject to be flooded from a non-studied stream.
of the drainage system is essential. In this framework, the k expo- In Fig. 4 the results of this identification procedure are pre-
nent of the threshold expression is substantially responsible for sented, it now includes four hazard classes (high, medium, low,
drainage density redistribution within the basin. In fact, this and marginal) although unclassified areas are still present.
parameter, by assigning different importance to the slope, aug-
ments the influence of high slope values in steep mountain zones, 5. Flood-related basic morphologic features
and does the opposite in flat areas. The k = 1.7 value is assumed in
the present work (for a complete discussion on this topic see Gian- Binary classifiers need a dataset of features on which calibration
noni et al., 2005). The threshold above which a drainage is pro- and predictions are to be performed. Basin surface topography and
duced is here fixed to achieve, for the Tanaro catchment, the morphology are here represented by a DEM: obvious ingredient of
average drainage density value obtained by the Po Basin Authority, the dataset is therefore the location of the cell under exam (lati-
that is ASk P 5104 m2 and Dd = 0.74 m1. The resulting drainage tude and longitude). The hazard class (high, medium, low, or mar-
network is depicted in Fig. 3, from which one could both remark ginal) will be provided for calibration purposes where available.
the significant increase in the extension of the network, and guess Other features should be related to the physical process under
(e.g., from DEM shadows) the presence of small tributaries still not investigation, and available for the entire area under study, in this
considered as potential hazard sources. Obviously, landscape dis- case the Tanaro catchment. In this work, simple morphologic fea-
section by surface transport processes starts well below the dimen- tures are taken into account, leaving their matching and weighting
sion here determined by the threshold value assumption (see, e.g., to the classifier structure. The selected features, specified for each
Montgomery and Dietrich, 1988). Nevertheless, the small size of DEM cell, are: distance from the nearest stream, D, elevation to the
un-dissected catchments, as well as their location in hilly and nearest stream, H, surface curvature, DH (defined as the Laplacian
mountain areas, allows assuming that all main hazard sources of the elevation), contributing area, A, and local slope, S, estimated
are taken into account, at least for the aims of the present work. as the maximum slope among the eight possible flow directions
Moreover, a possible downsizing to the scales of surface dissection that connect the cell under exam to the adjacent cells. Contributing
and channel generation processes will result in an all-pervading area and local slope are the main ingredients of the Topographic
hazard source. Index (TI) first introduced by Kirkby (1975) and recently modified
to detect flood-prone areas from DEMs by Manfreda et al. (2011).
4. Marginal hazard identification In the present work, these two features will be considered sepa-
rately, together, and mixed with all other features.
For the purpose of the present paper, a marginal hazard level is Two features are related to the cell location with respect to the
assigned to the areas that are subject to the flood hazard with a re- nearest stream. The distance from the nearest stream is the length
turn time greater than that used to identify flood-prone areas. The of the path, identified by following the maximum slope direction,
hazard in such areas is less than low, and tends to zero. As a con- which connects the cell under exam to the nearest element of
sequence, the flooding hazard for a given location can be set to the reference drainage network (Fig. 3). The elevation to the near-
marginal if (i) all the potential flood risk sources have been taken est stream is the difference between the elevation of the cell under
into account and (ii) for each single risk source the location is out- exam and the elevation of the final point of the above-identified
side its flood-prone area. path. As suggested by intuition, large distance and elevation values
306 M. Degiorgis et al. / Journal of Hydrology 470–471 (2012) 302–315
correspond to a lower flood hazard. Finally, the surface curvature is rtp = (1 rfn) and rtn = (1 rfp). The ROC curve is defined as the
related to the ability to discriminate convex divides from concave set of pairs (rfp, rtp) obtained by varying the threshold of a binary
valley bottoms, more prone to be flooded. classifier. The Area Under the ROC Curve (also known as AUC) is
a common measure of performance, used to compare different
kinds of binary classifiers. This measure does not require one to as-
6. Dataset and performance measures
sign different weights to the average misclassification errors on the
two classes, or to specify their a priori probabilities. In general, the
As a result from the above, the available dataset is composed of
area under the ROC curve is between 0 and 1. The larger the AUC,
187,306 labeled data points pertaining to both the Tanaro basin
the better the classifier. AUC values greater than 0.5 correspond to
flood-prone areas recognized by the Po Basin Authority, and to
classifiers performing better than chance. Indeed, a completely
conterminous areas subject to a marginal flood hazard. Each data
random classifier (one that is no better at recognizing true posi-
point contains latitude and longitude plus the following five fea-
tives than flipping a coin) has an area under the ROC curve of
tures: D, H, DH, A, S. Within flood-prone areas, the dataset provides
0.5. Instead, a classifier with 0 false positives and 0 false negatives
also the hazard level: high, medium or low. For sake of simplicity,
has AUC = 1. In the following, rfp, rtp, and AUC are used as perfor-
we first ignore the hazard level, and divide data points into two
mance measures of the binary classifiers.
classes: class 0 for marginal hazard data points, and class 1 for data
points with high, medium or low hazard level. The data points
belonging to class 0 are 131,785 (about 70% of the size of the data- 7. Linear binary classifiers
set); the ones from class 1 are 55,521 (about 30% of the size of the
dataset). A pictorial representation of linear binary classifiers is provided
Since the dataset was necessarily sampled from a portion of the in Fig. 5a. They are introduced in this section, while SVM with lin-
geographical region under investigation, we did not use latitude ear and Gaussian kernels (Fig. 5b) are presented in Section 8. In the
and longitude to train the classifiers. While these two features first case, classifiers use only one (Section 7.1) or two (Section 7.2)
may be useful to classify data points that are in the neighborhood of the selected features presented in Section 5 whereas, in the sec-
of some training sample, they may be misleading for the classifica- ond case, all five features are jointly used. To compare different
tion of data points coming from sub-regions where no training classifiers, we use performance measures introduced in Section 6.
sample is available. The respective ROC curves are first obtained by varying the classi-
To investigate the dependence of simulation results on the size fication threshold. Then, corresponding AUCs are evaluated, and
of the dataset, some simulations were performed on both the the binary classifier with the largest AUC value is finally selected.
whole dataset and suitably chosen subsets. Indeed, the first option
may be time-consuming for large datasets. To reduce simulation 7.1. Classifiers based on a single feature and ROC curves
time, only a few training samples may be sufficient to evaluate
empirical estimates of the same metrics computed using the whole We first considered linear binary classifiers based on a single
dataset. feature, chosen among the following: D, H, DH, A, and S. Data were
In some applications, because of the different importance as- then normalized in such a way that each normalized feature – ob-
signed to the occurrence of each of the two events, achieving a tained after translation and scaling of the original feature – lies be-
small average misclassification error on one of the two classes tween 1 and 1. Consequently, also the threshold in the classifiers
may be more significant than obtaining a small average misclassi- was normalized.
fication error on the other class. During the training phase, assign- Single feature classifiers have the advantage of being particu-
ing two different weights to the average misclassification errors on larly simple and quick to be trained, since only the threshold has
the two classes may satisfy this need. However, sometimes the ex- to be set. Fig. 6 shows the ROC curves associated with the five lin-
act values of such weights are difficult to establish. A second way ear binary classifiers obtained by separately thresholding each of
to satisfy this need is through the so-called Receiver Operating the five features, and varying the threshold. Table 1 shows the cor-
Characteristics (ROC) curve (see, e.g., Fawcett, 2006), which is de- responding AUCs. For the classifiers based on H, D, and S, each data
fined in terms of false positive and true positive rates. We recall point was assigned to the class 0 if the feature was above the
that, for any given binary classifier, the false positive rate, rfp, is threshold, and to the class 1 if it was under the threshold. Instead,
the probability that a sample coming from class 0 is erroneously for the classifiers based on the features DH and A, each data point
classified as a sample coming from class 1, i.e. a marginal hazard was assigned to the class 0 if the corresponding feature was under
area is classified as flood prone. Similarly, the false negative rate, the threshold, and to the class 1 if it was above the threshold. The
rfn, is the probability that a sample coming from class 1 is errone- reason for such different choices is that these two different rules
ously classified as a sample coming from class 0. The true positive allow one to obtain ROC curves whose AUCs are greater than 0.5
rate, rtp, and the true negative rate, rtn, are simply obtained as for all the five classifiers.
Fig. 5. Pictorial representation of linear classifiers (a) and SVM classifiers (b). For graphical reasons, the case of two features x1 and x2 is considered. Note that in (a) a single-
feature linear classifier corresponds to a horizontal or vertical separating line. In (b), again for graphical reasons, the case H ¼ R2 is illustrated.
M. Degiorgis et al. / Journal of Hydrology 470–471 (2012) 302–315 307
Fig. 6. Receiver Operating Characteristics (ROC) curve for the five selected features.
ROC curves are obtained by applying a threshold to one of the five features in the
dataset, and by varying the threshold.
Table 2
Empirical false positive rate, r fp , and true positive rate, r tp , and the sum r fp þ ð1 r tp Þ for the (approximately) optimal two features linear binary classifiers. The samples were
obtained by sampling i.i.d. from the given data distribution (for n = 5000 and n = 20,000).
Pairs of features r fp n ¼ 5000 r tp n ¼ 5000 r fp þ ð1 r tp Þn ¼ 5000 rtp n ¼ 20; 000 r tp n ¼ 20; 000 r fp þ ð1 r tpÞn ¼ 20; 000
Table 3 Table 4
False positive rate, r fp , true positive rate, r tp , the sum r fp þ ð1 r tp Þ, and the h and t AUCs for the (approximately) optimal two features linear binary classifier for the
parameters for the (approximately) optimal two features linear binary classifiers. choices x1 = H and x2 = D, and for the linear binary classifiers obtained by thresholding
one of the same two features.
Pairs of features r fp r tp r fp þ ð1 r tp Þ h t
H and D H D
H, DH 0.1636 0.9212 0.2424 183° 0.794
AUC 0.9575 0.9399 0.8628
H, S 0.1629 0.9340 0.2289 171° 0.824
H, A 0.1732 0.9289 0.2443 182° 0.846
H, D 0.1369 0.9283 0.2086 186° 0.838
DH, S 0.5466 0.9200 0.6266 248° 0.586 whole dataset, respectively. Moreover, Tables 2 and 3 show that
DH, A 0.7795 0.9528 0.8267 16° 0.182 even with 5000 i.i.d. samples one is able to obtain the optimal
DH, D 0.2270 0.7789 0.4481 275° 0.954 selection of x1 and x2 (i.e., x1 = H and x2 = D).
S, A 0.5920 0.9444 0.6477 183° 0.870 Let Uða1 x1 þ a2 x2 a3 Þ be an (approximately) optimal linear
S, D 0.1938 0.8521 0.3417 219° 0.874
binary classifier and consider the choices x1 = H and x2 = D. The
A, D 0.2403 0.7899 0.4504 284° 0.790
classifier associates the pattern (x1, x2) to the class 0 if
a1 x1 þ a2 x2 6 a3 , to the class 1 otherwise. Then, the ROC curve
and 20,000 times, respectively (in both cases, the same samples associated with such a classifier is obtained by varying only its
were used to evaluate rfp and rtp, for each of the 10 kinds of classi- threshold a3 (while fixing a1 ¼ a1 and a2 ¼ a2 ), and plotting the
fiers). Instead, Table 3 shows the results obtained by computing resulting pairs (rfp, rtp). Hopefully, the resulting AUC may be greater
the true rfp and rtp (i.e., the ones obtained using the whole dataset), than 0.9399. This is indeed the case, since in the simulation the ob-
and also the corresponding approximately optimal parameters h tained AUC was 0.9575. Fig. 8 compares the ROC curve associated
and t. with such a binary classifier and the ones associated with the linear
The results of Tables 2 and 3 are quite similar, in accordance to binary classifiers obtained by thresholding one of the two features.
Appendix A about Statistical Learning Theory. Note, however, that Then, Table 4 shows the corresponding AUCs.
the samples used in Table 2 are less than 2.7% and 10.7% of the
8. Binary SVMs with linear and Gaussian kernel and ROC curves
Table 5
Weight vector, w, bias, b, and AUC for the binary SVM classifier with linear kernel obtained after training. The ROC curve associated with the binary SVM classifier was obtained
by varying the bias while fixing the weight vector. Two different AUCs are shown, associated to ROC curves whose rates rfp and rtp are evaluated on the validation set and on the
whole dataset.
wH wDH wS wA wD b AUC validation set AUC dataset
Table 6
Number of support vectors and AUC for the binary SVM classifier with Gaussian
kernel obtained after training. The ROC curve associated with the binary SVM
classifier was obtained by varying the bias while fixing the weight vector. Two
different kinds of AUCs are shown, associated to ROC curves whose rates rfp and rtp are
evaluated on the validation set and on the whole dataset.
Table 7
Binary classification problems with classes [(marginal) vs. (high + medium + low)], [(low + marginal) vs. (high + medium)], and [(medium + low + marginal) vs. (high)],
respectively: false positive rate, rfp , true positive rate, r tp , the sum r fp þ ð1 r tp ), the h and t parameters, and the AUC for the (approximately) optimal linear binary classifiers for
the choices x1 = H and x2 = D.
(Marginal) vs. (high + medium + low) 0.1369 0.9283 0.2086 186° 0.838 0.9575
(Low + marginal) vs. (high + medium) 0.1590 0.9270 0.2319 190° 0.850 0.9446
(Medium + low + marginal) vs. (high) 0.1967 0.9254 0.2713 190° 0.862 0.9250
the ones from subclass 1, and a new class 10 (high + medium) made fined a new class 000 (medium + low + marginal) made up of all the
up of all the data points from subclasses 2 and 3. Similarly, we de- data points from class 0 and the ones from subclasses 1 and 2, and
a new class 100 (high) made up of all the data points from subclass 3.
In this way, it was possible to apply the same binary classification
techniques used in the previous simulations to first separate clas-
ses 00 and 10 , then classes 000 and 100 .
In particular, we considered only linear binary classifiers based
on the two normalized features H and D, since this was the best
architecture (in terms of both the performance and the simulation
time) found in the previous simulations to separate classes 0 and 1.
The obtained results are shown in Table 7, whereas Fig. 10 shows
the obtained ROC curves.
Fig. 10. Hazard graduation. ROC curve for the binary [(low + marginal) vs.
(high + medium)] (a), and the [(medium + low + marginal) vs. (high)] (b) hazard
classifiers. The two-feature classifier of Fig. 8 is also reported for comparison
purposes. All classifiers are based on relative elevation and distance from the
nearest stream. AUC values are also reported.
Table 8
Performances of the three multiclass classifiers in terms of their identification of the
hazard level. The results are in%. Best performances are highlighted in bold.
Table 9
Performances of the three multiclass classifiers in terms in of the extensions of the four kinds of sub-regions identified by the classifiers (high, medium, low, and marginal hazard).
The results are also shown in% with respect to the total extension of the region investigated (around the river Tanaro). The best performances of each row (with respect to the data
available from the Po Basin Authority, B.A.) are highlighted in bold.
Hazard B.A. (river Multiclass Multiclass classifier 1 vs. Multiclass Multiclass classifier 2 vs. Multiclass Multiclass classifier 3 vs.
level Tanaro) classifier 1 B.A. (%) classifier 2 B.A. (%) classifier 3 B.A. (%)
Marginal 70 63 10.7 66 6.3 66 6.3
Low 9 5 40.3 7 20.5 7 20.5
Medium 11 5 54.7 7 37.1 10 5.6
High 10 27 +169.3 20 +101.8 17 +68.2
M. Degiorgis et al. / Journal of Hydrology 470–471 (2012) 302–315 311
The final multiclass classifier merges the outputs of the three (3) It is considered at medium hazard level if it is associated to
‘‘inner’’ binary classifiers associated with the three pairs of classes the classes 1, 10 and 000 , respectively.
in the following way: (4) It is considered at high hazard level if it is associated to the
classes 1, 10 and 100 , respectively.
(1) A data point is considered at marginal hazard level if it is
associated by the three binary classifiers to the classes 0, 00 Note that such a multiclass classifier is in general nonlinear, de-
and 000 , respectively. spite the fact that the inner binary classifiers are linear.
(2) It is considered at low hazard level if it is associated to the At this point, we observe that the three inner binary classifiers
classes 1, 00 and 000 , respectively. on which the multiclass classifier is based are obtained by giving
equal weights to the error rates associated with the classes 0 and
1, 00 and 10 , and 000 and 100 , respectively. Indeed, each of them is
obtained by minimizing rfp + (1 rtp). A different criterion con-
sists in using different weights to the error rates associated with
the two classes of each of the three inner binary classifiers. For
instance, giving more weight to one of the two classes may pro-
vide better performance of the binary classifier with respect to
that class, without a significant decrease in performance with re-
spect to the other class. Then, we defined three multiclass classi-
fiers (each of which is based on three binary classifiers with
classes 0 and 1, 00 and 10 , and 000 and 100 , respectively) in the fol-
lowing way:
(1) For Multiclass Classifier 1, all the three binary classifiers are
Fig. 12. Tanaro basin. Composite of Po Basin Authority predictions [Fig. 2b] and obtained by minimizing rfp + (1 rtp), as described above.
hazard graduation identified according to the two-features classifiers based on
(2) For Multiclass Classifier 2, the binary classifier with classes 0
relative elevation and distance from the nearest stream [Fig. 11b]. When available,
Basin Authority predictions are preferred due to their higher reliability. Red, yellow, and 1 is obtained by minimizing 1.5rfp + (1 rtp), the one
blue and green respectively indicate high, medium, low and marginal hazard areas. with classes 00 and 10 by minimizing 2rfp + (1 rtp), and the
(For interpretation of the references to color in this figure legend, the reader is one with classes 000 and 100 by minimizing 2.5rfp + (1 rtp).
referred to the web version of this article.)
Fig. 13. Flood hazard predictions for a portion of the Tevere (Central Italy) basin: Basin Authority data (a), marginal hazard areas (b), two-features classifiers based on relative
elevation and distance from the nearest stream (c), and composite picture (d). Red, yellow and blue respectively indicate high, medium, low hazard areas. Dark and light green
indicate marginal hazard areas as identified from Basin Authority data or by the binary classifier, respectively. (For interpretation of the references to color in this figure
legend, the reader is referred to the web version of this article.)
312 M. Degiorgis et al. / Journal of Hydrology 470–471 (2012) 302–315
(3) For Multiclass Classifier 3, the binary classifier with classes 0 One could notice that the overall extension and location of
and 1 is obtained by minimizing 1.5rfp + (1 rtp), the one flood-prone areas (Fig. 2b vs. Fig. 11a) is well replicated by the clas-
with classes 00 and 10 by minimizing 2rfp + (1 rtp), and the sifier. As a result of the adopted calibration procedure, an overesti-
one with classes 000 and 100 by minimizing 3rfp + (1 rtp). mation error is envisaged. This is confirmed by Fig. 11a which
shows a small overestimation of flood-prone areas, confirmed by
One can observe that the different coefficients assigned to the the rfp ¼ 0:1369 value which characterizes this classifier (see
error rates allow one to take into account the different numbers Table 3).
of training samples for the two classes of each of the inner binary Fig. 11b shows classifiers results for hazard graduation. When
classifiers, and to give more weight to the ‘‘more important’’ clas- compared with Fig. 2c, hazard overestimation is detected. Again,
ses (e.g., one may want to give more weight to the error made overestimation is definitely to be preferred over underestimation,
when associating a marginal hazard level to an area with high, and may be partially anticipated by the selected calibration proce-
medium or low hazard level, with respect to the weight assigned dure. Values of r fp increase up to r fp ¼ 0:1967. This latter value is
to the error made in associating a high, medium, or low hazard le- associated to the identification of high hazard areas. This may be
vel to an area with marginal hazard level. Finally, by varying the partially due to the DEM grid size, which does not allow for the rec-
same coefficients, one can control the extensions of the sub-re- ognition of flood defense structures such as levees, dams and weirs.
gions classified as at a high, medium, low, or marginal hazard level. In fact, overestimation is easily detected in flat valley areas, were
Tables 8 and 9 show the results obtained for the three multi- levees are more effective, and tend to vanish moving upstream to-
class classifiers described above. As shown by the tables, the best ward hilly and mountainous areas.
overall performances were obtained for Multiclass Classifier 3. Classifiers should be applied to predict flood-prone areas and
hazard graduation in non-studied areas. Fig. 12 shows, for the en-
10. Application and results tire Tanaro basin, a composite of the Po Basin Authority predictions
and the results of the two-feature linear classifier based on relative
The application of the above described procedure to the streams elevation and distance from the nearest stream. Basin Authority
studied by the Po Basin Authority for the Tanaro catchment is pre- predictions, resulting from intensive field surveys and from valu-
sented in Fig. 11 with reference to the two-feature linear classifier able hydrologic and hydraulic studies, are characterized by a high-
based on relative elevation and distance from the nearest stream. er reliability. Consequently, in Fig. 12 they are overlapped to
Consequently, results are limited to the streams of Fig. 2a for which classifier results. Since the classifier now identify flood-prone areas
the flood prone status and the hazard graduation are already avail- for the entire catchment, also areas subject to a marginal hazard
able, and depicted in Fig. 2b and c. could be depicted.
Fig. 14. Flood hazard predictions for a portion of the Dora Baltea (Valle d’Aosta Region, Northwestern Italy) basin: Basin Authority data (a), marginal hazard areas (b), two-
features classifiers based on relative elevation and distance from the nearest stream (c), and composite picture (d). Red, yellow and blue respectively indicate high, medium,
low hazard areas. Dark and light green indicate marginal hazard areas as identified from Basin Authority data or by the binary classifier, respectively. (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this article.)
M. Degiorgis et al. / Journal of Hydrology 470–471 (2012) 302–315 313
The linear binary classifier calibrated on the Tanaro basin data 11. Conclusions
provided by the Po Basin Authority and complemented by mar-
ginal hazard identification procedure described in Section 4, can A simple linear binary classifier, based on two features related
be applied also outside the Tanaro catchment. While quantitative to the location of the site under exam with respect to the nearest
measures of classifiers performances are meaningless outside the hazard source, allows distinguishing flood-prone areas. The two
calibration catchment, qualitative comparisons will shed some best-performing features, selected among the five available, seem
light on the possible extensive classifiers application on wide areas, to confirm the intuition that an increasing distance from the risk
at least to obtain a preliminary description of the flooding hazard. sources corresponds to a lower hazard. The first feature is the
The two-feature linear classifier based on relative elevation and length of the path that hydrologically connects the location under
distance from the nearest stream, calibrated with the Tanaro data exam to the nearest element of the drainage network. The second
set, has been therefore applied to three different cases: the Tevere feature is the difference in elevation between the cell under exam
basin in Central Italy, the Dora Baltea basin in Northwestern Italy, and the final point of the same path. The identification is per-
and the Quirra basin in the Sardinia island, Italy. Results are pre- formed with a high reliability: for the Tanaro case study, 93% of
sented in Figs. 13–15, respectively. flood-prone areas are properly recognized by the classifier, and
The three cases were selected for their specific characteristics. only 14% of the areas subject to a marginal hazard are improperly
The Tevere River represents the case of a well-studied catchment: assigned.
flood-prone areas and hazard graduation is available for almost the The same structure can be applied for hazard graduation. While
entire mainstream and for a number of tributaries. Dora Baltea Riv- a negligible reduction in the performances of the resulting multi-
er is studied intermittently, along the mainstream only. The Quirra class classifier is observed in terms of its ability to correctly recog-
River is studied for its final reach only. nize high hazard areas, an increase of false positive up to 19% is
Results depicted in Figs. 13–15 show that the classifier is able to detected. This is partially originated from the selected optimization
provide a good description of flood-prone areas and hazard gradu- procedure, whose main goal is the correct identification of flood-
ation. In all cases, simulation results mimic well the Basin Author- prone areas, and partially due to the DEM resolution. This, in fact,
ities predictions, and composite pictures do not show abrupt seems high enough to describe the local terrain morphology, but
changes at the interface between the two data sources. As com- far from allowing the recognition of flood control structures.
mented with reference to the Tanaro basin results, a small overes- Results derived from the application to different catchments
timation is produced with respect to the prediction of flood-prone seem to qualitatively indicate the ability of the classifier to perform
areas extension and for high hazard areas identification. well also outside the calibration region.
Fig. 15. Flood hazard predictions for a portion of the Quirra (Sardinia, Italy) basin: Basin Authority data (a), marginal hazard areas (b), two-features classifiers based on
relative elevation and distance from the nearest stream (c), and composite picture (d). Red, yellow and blue respectively indicate high, medium, low hazard areas. Dark and
light green indicate marginal hazard areas as identified from Basin Authority data or by the binary classifier, respectively. (For interpretation of the references to color in this
figure legend, the reader is referred to the web version of this article.)
314 M. Degiorgis et al. / Journal of Hydrology 470–471 (2012) 302–315
Pattern classification techniques should be taken into account of a binary classifier), a fixed nonlinear map u: RR ! E, where E
when the completeness in the identification of flood-prone areas is an (usually infinite-dimensional) Euclidean space, a fixed regu-
and hazard grading is required for large regions or when a first larization parameter C > 0, and a fixed parameter p e {1, 2}, a binary
identification is desired. This may be the case (i) for specific appli- SVM classifier with Lp-soft margin is obtained by solving the fol-
cations, such as those related to the insurance market; (ii) for spe- lowing optimization problem (see, e.g., Franc and Hlávač, 2004):
cific areas, for which the available information is limited; or (iii) !
whenever a cost to benefit ratio should address further detailed 1 Xn
Hunter, N.M., Bates, P.D., Horritt, M.S., Wilson, M.D., 2007. Simple spatially- Noman, N.S., Nelson, E.J., Zundel, A.K., 2001. A review of automated flood plain
distributed models for predicting flood inundation: a review. Geomorphology delineation from digital terrain models. ASCE J. Water Resour. Plann. Manage.
90, 208–225. 127, 394–402.
Kirkby, M.J., 1975. Hydrograph modelling strategies. Progress Phys. Hum. Geogr., O’Callaghan, J.F., Mark, D.M., 1984. The extraction of drainage networks from digital
69–90. elevation data. Comput. Vision Graphics Image Proc. 28, 323–344.
Lehner, B., Verdin, K., Jarvis, A., 2008. New global hydrography derived from Platt, C.J., 1998. Sequential minimal optimization: A fast algorithm for training
spaceborne elevation data. Eos. Trans. AGU 89, 93–94. support vector machines. doi: ftp://ftp.research.microsoft.com/pub/tr/tr-98-
Luino, F., 2002. Flooding vulnerability of a town in the Tanaro basin: the case of 14.pdf.
Ceva (Piedmont–Northwest Italy). In: Thorndycraft, V.R., Benito, G., Barriendos, Rakotomamonjy, A., 2004. Optimizing area under ROC curve with SVMs. ECAI,
M., Llasat, M.C., (Eds.), Proc. PHEFRA Workshop ‘‘Paleofloods, Historical Data & Valencia, Spain, 71–80.
Climatic Variability: Application in Flood Risk Assessment’’. Barcelona, Spain, Rodriguez-Iturbe, I., Rinaldo, A., 1997. Fractal River Basins: Chance and Self-
pp. 321–326. organization. Cambridge University Press.
Manfreda, S., Di Leo, M., Sole, A., 2011. Detection of flood-prone areas using digital Roth, G., La Barbera, P., Greco, M., 1996. On the description of the basin effective
elevation models. J. Hydrol. Eng. 16, 781–790. drainage structure. J. Hydrol. 187, 119–135.
Marchi, E., Roth, G., Siccardi, F., 1996. The Po: centuries of river training. Phys. Chem. Santini, M., Grimaldi, S., Nardi, F., Petroselli, A., Rulli, M.C., 2009. Pre-processing
Earth 20, 475–478. algorithms and landslide modeling on remotely sensed DEMs. Geomorphology
Montgomery, D.R., Dietrich, W.E., 1988. Where do channels begin? Nature 336, 113, 110–125.
232–234. Tarboton, D.G., Bras, R.L., Rodriguez-Iturbe, I., 1991. On the extraction of channel
Nardi, F., Grimaldi, S., Santini, M., Petroselli, A., Ubertini, L., 2008. Hydrogeomorphic networks from digital elevation data. Hydrol. Process 5, 81–100.
properties of simulated drainage patterns using digital elevation models: the Vapnik, V.N., 1998. Statistical Learning Theory. Wiley, New York.
flat area issue. Hydrol. Sci. J. 53, 1176–1193.
Nardi, F., Vivoni, E.R., Grimaldi, S., 2006. Investigating a floodplain scaling relation
using a hydrogeomorphic delineation method. Water Resour. 42, W09409.
http://dx.doi.org/10.1029/2005WR004155.