You are on page 1of 23

Coastal Engineering 51 (2004) 917 – 939

www.elsevier.com/locate/coastaleng

Evaluating the performance of morphological models


J. Sutherland*, A.H. Peet, R.L. Soulsby
HR Wallingford Ltd., Howbery Park, Wallingford, Oxfordshire, OX10 8BA, United Kingdom

Available online 26 August 2004

Abstract

Evaluating the performance of numerical models of coastal morphology against observations is an essential part of
establishing their credibility. In the past, this has usually been done by comparing predicted with observed behavior, and the
modeler making a subjective judgement of the goodness of fit. By contrast, meteorological models have been tested for many
years against observations using objective, scientifically rigorous methods. Drawing on the analogy between distributions of
atmospheric pressure (and similar scalars) and coastal bathymetry, the meteorological statistical methods are here adapted to
coastal applications. A set of criteria is proposed that identify the desirable attributes of statistical measures of the performance
of coastal morphological models. Various statistical parameters used in the meteorological field are described, dealing both with
cases in which the model outputs comprise a small number of categories (e.g., advance/equilibrium/retreat of shoreline), and
those in which they are continuous (e.g., bathymetry). Examples of the application of these methods are given, in terms of both
hypothetical illustrations, and real field data and model predictions. Following meteorological practice, it is shown that
measuring the skill of a model (i.e., its performance relative to a simple baseline predictor) is a more critical test than measuring
its absolute accuracy. The attributes of the different performance measures are compared with the proposed desirable criteria,
and those that match them best are selected. These are the LEPSOB test for categorical data, and the Brier Skill Score (BSS) for
continuous data. Routine use of these measures of performance would aid inter-comparability of models, and be a step toward
strengthening user confidence in the predictions of coastal numerical models.
D 2004 Elsevier B.V. All rights reserved.

Keywords: Model validation; Morphological model; Skill; Accuracy; Bias; Correlation; Brier skill score; Linear error in probability space

1. Introduction state predictors (Dean, 1977) to detailed numerical


models incorporating a large number of physical
Over the past 20 years, many models of coastal processes (e.g., Zheng and Dean, 1997 or Nicholson
morphology have been developed. These cover a wide et al., 1997). In the past, the predictions from each
range of time-scales and processes, from equilibrium model have usually been qualitatively compared to
data sets and analytical solutions of idealized cases.
* Corresponding author. Tel.: +44 1491 822311; fax:+44 1491
This comparison is typically subjective, and of the
832233. form breasonably good agreement was foundQ. Critical
E-mail address: jas@hrwallingford.co.uk (J. Sutherland). comments about model performance seem to be
0378-3839/$ - see front matter D 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.coastaleng.2004.07.015
918 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939

missing from the literature, despite the variety of there seems no particular reason for doing so, it will
available models. During the early stages of model not be considered further. Bias is sometimes referred
development, this form of assessment is arguably to as reliability. In this definition a breliableQ model
sufficient, as it can be obvious at a glance where does not consistently over-predict or under-predict,
further model improvement is required when a simple but is not necessarily accurate.
comparison of model prediction with data is made. Accuracy is a measure of the average size of the
However, a model that is to be used as an engineering difference between a set of predictions and the
tool requires a more objective assessment. corresponding observations (the average error), so
Coastal zone managers and engineers requiring even when there is no bias in a model it may still have
answers to problems are not concerned with the a low accuracy. Accuracy can be represented in a
details of the internal processes in the specific model dimensional or a non-dimensional (relative accuracy)
they use. They are concerned with the fitness for manner.
purpose of the model and perhaps, most importantly, Skill is a non-dimensional measure of the
its credibility. Therefore, it is necessary to establish accuracy of a prediction relative to the accuracy of
completely objective standard algorithms of model a baseline prediction, which could, for example, be
performance, which can be used to determine the an initial condition, an average value, a random
bias, the accuracy and the skill of a model (defined choice (from the possible range of outcomes) or a
below). simple empirical predictor (such as a Dean profile).
In this paper, the problem of evaluating morpho- Skill can be based on any measure of accuracy. The
logical models is addressed by drawing on established difference between accuracy and skill can be
methods used in meteorology, where predictions of illustrated by the following hypothetical example.
pressure and rainfall are routinely evaluated against A model of bar movement predicted a small offshore
data. This is possible, as the spatial distribution of movement, but the measurements showed a small
coastal bathymetry is directly analogous to the onshore movement. The difference between the
distribution of atmospheric pressure (or rainfall) in predicted and observed bar crest distance from the
meteorology. First, we propose criteria for assessing shore was still small, so the accuracy of the bar crest
the suitability of model performance statistics. Next, a position was still relatively high, even though the
number of methods for evaluating different kinds of model moved the bar in the wrong direction.
meteorological models (Heidke, 1926; Willmott et al., However, introducing a baseline prediction (that the
1985; Murphy and Epstein, 1989; Gandin and bar does not move) showed that the prediction was
Murphy, 1992; Wilks, 1995; Livezey et al., 1996; less accurate than the baseline prediction, so the
Potts et al., 1996) are described in the context of model had a negative level of skill.
coastal morphological modeling. These methods are All skill scores possess an origin, scale and
then applied to both hypothetical and real cases of orientation. The origin is the result obtained if the
coastal morphological modeling. The degree to which model predicts the baseline prediction. The scale is
the statistics match up to the criteria is assessed and the the average score from perfect modeling and a
best measures, according to these criteria, are chosen. positive orientation implies that larger scores are more
skillful. These should be chosen to have sensible
1.1. Bias, accuracy and skill values, such as the origin=0, the scale=1 and a positive
orientation.
The performance of any model can be assessed by
calculating its bias, accuracy and skill. 1.2. Desirable properties of a model performance
Bias is a measure of the difference in the central statistic
tendencies of the predictions and the observations.
The most common measure of the central tendency is This paper considers a number of model perform-
the mean, although there are circumstances where ance statistics that can be applied to morphodynamic
using the median is appropriate. The mode or modal modeling. The statistics are compared and particular
group (for binned data) could also be used, but as ones are recommended for future use. The authors
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 919

consider that the following properties are desirable in comparisons are made with values and not categories.
a model performance statistic: The predictions from a probabilistic model can be
expressed in categorical terms by sorting the output
(1) The statistic should be relatively easy to under- into bins. This allows the skill of probabilistic models
stand. Therefore, it should have a simple and to be compared to the results from simple categorical
clear conceptual basis. predictors. Moreover, by allowing the number of bins
(2) The statistic matches up with expert opinion. It to increase until each bin is the same size as the
should generally return a good score when resolution of the measurements, a categorical model
experts consider that the modeling has been a can be applied to a probabilistic case.
success. In an ordered variable, it is possible to determine
(3) The statistic should appear bhonestQ, in the sense not only whether the prediction is right or wrong, but
that it is not easy to alter the results by altering also how badly wrong an incorrect prediction is. Skill
the model set-up slightly. For example, the Brier assessments of ordered predictions have to take the
Skill Score (BSS) (Section 3.5) is not altered by size of the error into account. For example, proba-
the addition of an additional inactive area to the bilistic predictions or measurements of bed elevation,
domain of a morphodynamic model, whereas the water level or current speed are ordered as the values
correlation coefficient is. This makes the BSS lie within a continuous range and it is easy to calculate
more bhonestQ. the difference between observation and prediction.
(4) The statistic should be transferable from one Categories can also be ordered and, in this case, a skill
model to another and from one data set to assessment that takes into account category error
another. Therefore, there is a preference for non- should be performed. Category error is a measure of
dimensional quantities such as skill scores or how close an incorrect prediction was to the correct
relative errors rather than dimensional measures prediction. An example of ordered categories is the
of accuracy that may be expected to have lower change in shoreline position, which can be split into
values in, for example, a flume test than a field three categories: retreat, equilibrium and advance. The
experiment. order retreat, equilibrium, advance is correct as there
(5) The statistic should take a range of values that is a logical progression from the first category to the
makes it intuitively easy to assess the quality of last. With unordered categories it makes no difference
the modeling from the value returned. which incorrect category is predicted; they are all
(6) The statistic should measure skill rather than equally wrong.
accuracy. This follows partly from the afore-
mentioned properties, all of which can be met by
skill scores, and partly because skill scores can 2. Categorical assessment
be made equitable (defined in Section 2.4).
2.1. Categorical assessment ignoring category error
1.3. Types of prediction
The results from comparing categorical predictions
Predictions can be either categorical or probabil- and observations can always be represented by one of
istic, and either ordered or unordered. A categorical four possible outcomes, which can be recorded in a
prediction is used when a model predicts one of a
finite number of possible outcomes or categories. For Table 1
example, shoreline position changes could be divided Possible outcomes from categorical prediction, shown in a
contingency table
into three categories: retreat, equilibrium and advance.
A probabilistic prediction is used when a model Prediction
predicts a time series of a discrete variable (e.g., Predicted Not predicted Total
shoreline position) or the values of a continuous Observation Occurred H M H+M
variable (e.g., points along a beach profile). The Did not occur F C F+C
number of possible outcomes is now infinite and so Total H+F M+C J
920 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939

22 contingency table, as shown in Table 1. The The advantages of using the HSS, rather than PC,
variables in the table are: are illustrated in the following example. Let a
particular state occur for 125 out of 2500 observations
H Hits—the number of times the event was predicted and (5%). Let 100 of these events be predicted and 25 not
observed predicted, with there being 50 false alarms. The 22
M Misses—the number of times the event was observed but
contingency table is shown in Table 2. The resulting
not predicted
F False alarms—the number of times the event was predicted statistics are: PC=0.97, FAR=0.33, Probability of
but did not occur Detection=0.80, Threat Score=0.57 and HSS=0.71.
C Correct rejections—number of times the event was not The HSS is as low as 0.71 because 90% of the events
predicted and did not happen. would be correctly predicted by chance. If, instead,
the events were never predicted (H=F=0, M=125,
A number of statistics can be calculated to assess C=2375), then PC=0.95 (a reduction of only 0.02 on
the performance of the modeling (Wilks, 1995). Here the previous case) whereas HSS=0.0 as the predictive
the bias, Bias=(H+F)/(H+M), is the number of times model has no skill.
the event was predicted compared to the number of These statistics are most obviously applied to two-
times the event was observed. The accuracy can be category systems but can also be applied to systems
expressed by a number of measures. The Percentage with more than two categories. However, in this case,
Correct, PC=(H+C)/J, is the ratio of correct predic- all incorrect categories are treated as equal, which is
tions as a fraction of the total number of forecasts, good enough for unordered variables, but takes no
J=H+M+F+C. The False Alarm Rate, FAR=F/ account of category error (the magnitude of the error)
(H+F), represents the number of false predictions for ordered categories, when it matters how badly
divided by the number of forecasts of the event, while wrong the prediction was. Examples of two-state
the Probability of Detection, POD=H/(H+M), is the categories include whether a channel is deep enough
number of correct predictions of the event divided by to take a particular draught of ship or not, and beach
the number of times the event was observed. The state indicators such as whether a beach is sufficiently
Threat Score, TS=H/(H+F+M), is the number of wide for recreation or not. These statistics could also
successful predictions of the event divided by the be used to assess flooding or overtopping warning
number of predictions minus the number of correct systems. However, in many cases there will be more
rejections. Note, however, that both Percentage than two ordered categories and a skill score that takes
Correct and Probability of Detection are referred to into account category error should be used instead.
as the Hit Rate, so the latter term is ambiguous and
should be avoided. 2.2. Ordered morphological categories
The skill of a categorical prediction can be
determined (Heidke, 1926) using the Heidke Skill A skill assessment that takes into account category
Score (HSS) given by: error (how close an incorrect prediction was to the
correct prediction) should be used with ordered
PC  G categories. In the earlier example of shoreline change,
HSS ¼ ð1Þ
1G the correct order must be: retreatYequilibriumY
where PC=Percentage Correct (expresses as a frac- advance (or advanceYequilibriumYretreat), and not
tion) and G is the fraction of predictions of the correct retreat YadvanceYequilibrium (or any of the other
categories (H and C) that would be expected from a three permutations). This condition of ordinal catego-
random choice. G is the sum of the fraction of events
that occurred and were predicted by chance, given by
(H+F)(H+M)/J 2 added to the fraction of events that Table 2
Example of categorical prediction
did not occur and were predicted by chance not to
occur, given by (M+C)( F+C)/J 2. The HSS is the Predicted Not predicted
percentage correct, corrected for the number expected Observed 100 25
to be correct by chance. Not observed 50 2325
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 921

ries is met by the Wright and Short (1983) beach Table 4


classification, which is: Inequitable skill scores for three categories with equal probabilities
Observation Weighted
R E A average
Reflective Y Terraced Y Transverse Bar
Y Rhythmic Bar Y Straight Bar Y Dissipative: Prediction R 1.0 0.5 0.0 0.50
E 0.5 1.0 0.5 0.67
A 0.0 0.5 1.0 0.50
This is ordinal and corresponds to a change in
Weighted average of all skill score 0.56
energy from low incident wave energy (Reflective)
to high energy (Dissipative). In addition to the beach
state, many processes occurring on the beach may p(A)=1/3. Now consider an intuitive skill scoring
also be categorized. Detailed morphological models, system for all of the nine possible outcomes.
although making quantifiable predictions of param-
eters such as shoreline movement, can also be ! If the correct category is predicted, then assign a
assessed using the categorical technique. This facil- skill of S=1
itates the comparison of detailed process models with ! If the correct category is missed by a single
simple models, such as predictor parameters. Some category, then assign S=0.5
examples of parameters and processes that can be ! If the correct category is missed by two categories,
categorized, enabling such comparisons, are given in then assign a skill of S=0
Table 3.
The skill score matrix is shown in Table 4.
2.3. Skill score matrices for ordered categories The weighted average of all the skill scores, 0.56,
is the score representative of a random choice. The
Consider the example of shoreline position change average is weighted by the probability of occurrence
with the number of categories, N=3, defined as retreat of each category. Here the categories have equal
(R), equilibrium (E), and advance (A). A skill score weighting so the weighted average is the same as the
can be assigned to each of the N 2=9 possible arithmetical average. This can be considered a
combinations of observation and prediction and these baseline score against which to compare all other
can be displayed in a N by N (33) skill score matrix. skill scores. If a model were always to predict
The skill scores from each event are then read from shoreline retreat then its average score would be 0.5,
the skill score matrix and averaged to give the overall worse than a random guess. If a model was always
skill score. able to predict the middle category, near equilibrium,
As an example, assume that each category has an then its average score would be 0.67, better than a
equal probability, p(x)=1/N, so that p(R)=p(E)= random guess. This kind of scoring method is
inequitable because it will favor the middle category
and may encourage modelers to steer model develop-
Table 3 ment and hence predictions. Any scoring algorithm
Possible categories for morphological parameters processes for use that does not give the same score for random choices
with skill scores
and constant choices of any category is called
Parameter Categories inequitable. The alternative is to use an equitable
1 2 3 skill score.
Shoreline position Advance No change Retreat
Nearshore erosion Erode No change Deposit 2.4. Principles of an equitable skill score
Longshore transport Left None Right
Berm height Higher No change Lower
Bar crest depth Deeper No change Shallower
The principles of an equitable skill score are set in
Bar shape trough/crest Flatter No change Accentuated Gandin and Murphy (1992) as:
Bar migration Onshore None Offshore
Left None Right (1) The scores assigned to correct predictions in-
Longshore length scale Increase No change Decrease crease as the probability of occurrence decreases
922 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939

(2) The scoring matrix for N+1 categories becomes Table 5


the matrix for N categories as the probability of Equitable linear error in probability space skill scores for three
categories with equal probabilities
one of the categories becomes zero
Observation Weighted
(3) Scores assigned to incorrect forecasts are less
R E A average of rows
than the scores assigned to correct forecasts
(4) Scores assigned to large errors are less than the Prediction R 0.89 0.11 0.78 0
scores assigned to small errors E 0.11 0.22 0.11 0
A 0.78 0.11 0.89 0
(5) Constant forecasts of any event or random Weighted average of all skill scores 0
choices should be accorded the same level of
skill.
score. The weighted average of the skill scores for
The equitable nature is achieved because the each row and each column is zero, as is the weighted
extreme categories are awarded a higher level of average of all the skill scores. The weighting is by
skill for a correct prediction than the common probability of occurrence. In this example, the scores
categories. This means that a model is rewarded for correct predictions are positive, while the others
for correctly predicting a non-standard event, even are negative.
when the categories have unequal probabilities. The The baseline prediction for the LEPS skill score
baseline prediction, against which the accuracy of is either a random choice from the available
the model is assessed, is a random selection from the categories or a constant choice of a single category.
available categories, weighted by the probability of Both options give an average skill score of 0, which
occurrence. is therefore the origin. Modeling that produces a
LEPS score above zero shows a greater level of skill
2.5. Linear error in probability space than the baseline prediction. One weakness of the
LEPS score is that the scale (the average score from
One method of determining equitable skill scores perfect modeling) varies with the number of catego-
for categorical assessment is to use the Linear Error in ries. The scale for N categories is 11/N. Hence, the
Probability Space (LEPS) method, described by Potts scale for 2 categories is 1/2, but for 3 categories it is
et al. (1996). The categories are placed in order (e.g., 2/3. It is possible to use a LEPS score normalized by
retreat, equilibrium, advance) and cumulative proba- the scale to make the normalized scale one. If this is
bility distributions of the forecast, P f, and observa- done, it must be clearly stated. The orientation of
tions, P o, are calculated for all categories, which cover LEPS is positive—the higher the score the better the
the whole range of probability space (i.e., P f =0 to 1, modeling.
and P o=0 to 1). The LEPS score at each value of P f
and P o is defined as: 2.5.1. LEPS score applied to a bar migration model
  In most cases, the categories will not have equal
LS ¼ 3 1  jPf  Po j þ Pf2  Pf þ Po2  Po  1 probabilities of occurrence and the skill score matrices
ð2Þ will not be symmetrical about the leading axis. Plant
and Holman (2001) used a simple model of sandbar
The skill score for a particular categorical migration to predict the position of a sandbar at
outcome is found by integrating Eq. (2) over the Katwijk (NL). Sutherland and Soulsby (2003) took
ranges of P f and P o within that category. An example their data and converted the data into three categories
of a LEPS skill matrix for three categories (R, E and of bar movement: retreat, equilibrium and advance.
A, for Retreat, Equilibrium and Advance of shoreline) The categories had different probabilities of occur-
with equal probabilities is shown in Table 5. The rence so the observed cumulative probability distri-
LEPS score for prediction=R and observation=E is bution was different to the observed. The LEPS skill
obtained by numerically integrating Eq. (2), from matrix for this case is shown in Table 6. The
P f =0 to P f =1/3, and from P o=1/3 to P o=2/3. The probability of occurrence of each category is also
average of this integration, SS=0.11, is the skill shown.
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 923

Table 6
Linear error in probability space skill scores for categories of bar movement
Observed Weighted
Retreat Equilibrium Advance average of rows
p(R)=0.46 p(E)=0.21 p(A)=0.33
Predicted Retreat p(R)=0.46 0.58 0.21 0.68 0
Equilibrium p(E)=0.33 0.34 0.28 0.29 0
Advance p(A)=0.21 0.75 0.01 1.03 0
Weighted average of all skill scores 0

The weighted average values of the rows and 2.5.2. LEPSOB


columns are still zero, however, when each skill In certain circumstances, there may be more
score is weighted by the probability of occurrence. observations than predictions. If it is felt that there
For example, the weighted average of the first are an insufficient number of predictions to form a
row is (0.580.46)+(0.210.21)+(0.680.33)= reasonable cumulative probability density, the LEPS
0.00. Similarly the weighted average of the first skill score matrix can be calculated using the observed
column is (0.58 0.46)+(0.340.33)+(0.75 cumulative probability distribution for both the
0.21)=0.00. The probabilities of occurrence show observations and predictions. The average LEPS skill
that the model predicted fewer instances of score is then calculated as usual, by averaging the skill
onshore movement (advance) than were observed. scores from each pair of observations and predictions.
However, it did correctly predict the probability of In this case, the resulting skill score should be referred
offshore movement (retreat). A higher skill score to as LEPSOB, to denote the reliance on observations
is awarded for the correct prediction of a only. The LEPSOB skill score matrix for the bar
relatively rare event (advance) than is awarded migration case used above is shown in Table 7. The
for the correct prediction of the most common matrix is symmetrical about its leading axis. The
event (retreat). average LEPSOB skill score for the bar migration
A skill score from Table 6 was assigned to each of model was again 0.19.
the examples of bar movement from Plant and
Holman’s model, and the average LEPS skill score
was found to be 0.19. This is a positive skill score, 3. Probabilistic assessment
showing that the model has skill compared to the
baseline prediction, in predicting the category of bar Many models predict the values of parameters and
movement. Recall that the scale (the average result variables, not just their category. It is therefore
from perfect modeling) is 2/3, which can be checked desirable to include as much detail about the
by calculating the weighted average of the leading observations and model predictions as possible in
diagonal of Table 6. the skill assessment. The same probabilistic assess-

Table 7
Observational linear error in probability space (LEPSOB) skill scores for categories of bar movement
Observed Weighted
Retreat Equilibrium Advance average of rows
p(R)=0.46 p(E)=0.21 p(A)=0.33
Predicted Retreat 0.59 0.21 0.68 0
Equilibrium 0.21 0.34 0.07 0
Advance 0.68 0.07 0.89 0
Weighted average of all skill scores 0
924 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939

ment techniques can be used both for time series of The choice between MAE and RMSE depends
discrete variables, such as bar height and shoreline largely on the importance of outliers (points with
position, and for continuous variables, such as cross- particularly large differences between measured and
shore bathymetry or an area map of bathymetry. predicted values) in the data. The presence of a few
outliers will have a greater influence on RMSE than
3.1. Bias on MAE as RMSE squares the differences. The
RMSE is, therefore, greater than or equal to the
The basic equations for bias in the mean, Biasa, MAE so RMSE is a more conservative error predictor.
and in the median, Biasm are: The difference between RMSE and MAE depends on
the shape of the distributions of Y and X and is
1 XJ   reduced as Y and X tend toward symmetrical
Biasa ¼ yj  xj ¼ hY i  hX i ð3Þ distributions. The circumstances of the modeling
J j¼1
exercise will determine the importance that any
outliers should be given and will influence the choice
Biasm ¼ YM  XM ð4Þ
between MAE and RMSE.
The use of the modulus makes MAE nonanalytic
where Y is a set of J forecasts or predictions, y 1, y 2,. . ., and thus more difficult to manipulate mathematically
y J , and X is a set of J observations, x 1, x 2,. . ., x j with than MSE. However, the MAE is equally applicable to
the nth element of each set occurring at the same point vectors and scalars, and so is particularly useful for
in space and time. Moreover Y M and X M are the validating and evaluating hydrodynamic modeling
median values in Y and X and angular brackets denote (Sutherland et al., 2004). Moreover, the phrases
the mean. Bias reveals the tendency toward under- or bmean absolute errorQ and broot-mean-square-errorQ
over-prediction, with a positive bias indicating that the contain within them the assumption that all errors are
model consistently overpredicts the observations. in the model prediction and the measurement is error-
free. Calculations of best-fit lines also tend to assume
3.2. Accuracy that the data is error-free. This will never be the case,
in practice, and model evaluation should include a full
The most common measures of accuracy are the discussion of the measurements and their inherent
Mean Absolute Error, MAE, and the Mean Square errors. Some consideration of measurement errors is
Error, MSE, defined as: given in Section 7.
1 XJ
MAEðY ; X Þ ¼ jyj  xj j ¼ hjY  X ji ð5Þ 3.3. Correlation
J j¼1
A widely used method of determining whether two
1 XJ  2 series, X and Y, are related is the linear correlation
MSEðY ; X Þ ¼ yi  xj ¼ h ð Y  X Þ 2 i ð6Þ coefficient, r XY given by:
J j¼1

The above expression assumes that there is no need hðY  hY iÞð X  hX iÞi sX Y
to use a weighted average. A general form of the rX Y ¼ ¼ ð7Þ
rX rY rX rY
expression for average error was given by Willmott et
al. (1985), which includes weighting and the possible with r X the measured, r Y the predicted standard
use of other powers than one and two to define a mean deviation and s XY the covariance between the obser-
error statistic. There is no advantage to using powers vations and predictions. The value r XY=1 denotes
other than one or two; therefore, only MAE, MSE and complete positive correlation while r XY=0 occurs
the root-mean-square error, RMSE, will be considered when X and Y are uncorrelated. The coefficient r XY
further. RMSE is often preferred to MSE as it has the is a measure of how strong a correlation is, but not
same units as the measured variable. If necessary, Eqs. how significant the correlation is because the distri-
(5) and (6) can be extended to include weighting. butions of X and Y are not taken into account.
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 925

In evaluating the performance of a morphodynamic MAE are suitable measures of accuracy for calculat-
model, it is often useful to plot out the measured and ing skill.
modeled anomalies. The anomaly is the difference The numerator in Eq. (11) is the improvement in
between the final measurements (or predictions) and accuracy of the model compared to the baseline
the baseline prediction of the final bathymetry. Let B prediction. This can be an intuitively simple model,
be a set of N baseline predictions, b 1, b 2,. . ., b N , with such as an initial condition or an average value, or it
the nth baseline prediction occurring at the same place can be a simple empirical predictor (such as a Dean
and time as the nth value of the predictions and profile) for example. The denominator is the total
observations, Y and X. The modeled and measured possible improvement in accuracy given by the
anomalies are given by: difference between the accuracy of the baseline
prediction and the result from perfect modeling
Y V¼ Y  B ð8Þ (Murphy and Epstein, 1989). Commonly the accuracy
of perfect modeling, represented by MAE(W,X),
X V¼ X  B ð9Þ MSE(W,X) or RMSE(W, X), is taken to be zero. This
simplifies the expression for skill.
In many cases the baseline prediction for the final
bathymetry used in morphological modeling is the 3.5. Skill scores for probabilistic models
initial bathymetry (i.e., it is assumed that the bed does
not alter). In this case, the measured and modeled If the error associated with a perfect prediction,
anomalies are simply the measured and modeled A(W,X) in Eq. (11), is taken to be zero, then three skill
changes in bed level during the time being modeled. scores can be derived for probabilistic models by
The anomalies commonly have a much smaller using mean square error (MSE), root mean square
vertical scale than the measured and modeled bed, error (RMSE), and mean absolute error (MAE) as
so plotting the anomalies on a larger axis is a very measures of accuracy. These are the BSS:
useful way of assessing where the modeling was good
and where it was bad. It is also possible to calculate an
MSEðY ; X Þ hðY  X Þ2 i
anomaly correlation coefficient, r XVYV, given by: BSS ¼ 1  ¼1 ð12Þ
MSEð B; X Þ hð B  X Þ2 i
hY VX Vi
rX VY V ¼ ð10Þ
rY VrX V the root-mean-square skill score:
This correlation coefficient has a value of 1 for a
RMSðY ; X Þ hðY  X Þ2 i1=2
perfect prediction, which is reduced due to error in RMSSS ¼ 1  ¼1
predicting the mean bed level and the position of
RMSð B; X Þ hð B  X Þ2 i1=2
erosion and deposition. No reduction in skill is made ð13Þ
for amplitude error.
and the mean absolute skill score:
3.4. Skill MAEðY ; X Þ hjY  X ji
MASS ¼ 1  ¼1 ð14Þ
MAEð B; X Þ hjB  X ji
Skill scores can be based on Mean Absolute Error,
Mean Square Error or Root Mean Square Error. The The BSS is commonly used in meteorology and
general form of skill score, SS, can be given by: has already been applied to the modeling of coastal
Að B; X Þ  AðY ; X Þ morphodynamics by Brady and Sutherland (2001),
SS ¼ ð11Þ Sutherland et al. (2001), Van Rijn et al. (2002), Van
Að B; X Þ  AðW ; X Þ
Rijn et al. (2003) and Sutherland and Soulsby (2003)
where A(B, X)=the accuracy of the baseline predic- all coming from the COAST3D project (Soulsby,
tion, A(Y, X) is the accuracy of the predictions and 2001). Gallagher et al. (1998) used the RMSSS in a
A(W, X)=accuracy of a perfect set of predictions, with local sand transport model driven by measured
W a set of J perfect predictions. MSE, RMSE and flows.
926 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939

Perfect agreement gives a skill score of 1 with the terms set out below:
whereas modeling the baseline condition gives a
score of 0. If the model prediction is further away  2
rY V
from the final measured condition than the baseline a ¼ rY2 VX V b ¼ rY VX V 
rX V
prediction, the skill score is negative. Note that  2  2
these skill scores are unbounded at the lower limit. hY Vi  hX Vi hX Vi
c¼ e¼ ð16Þ
Therefore, they can be extremely sensitive to small rX V rX V
changes when the denominator is low, in common
with other non-dimensional skill scores derived Here,
from the ratio of two numbers. Therefore, large
negative values can be obtained even from models a is a measure of phase error—when the sand is moved to the
which predict a small change (of the correct order wrong position. Perfect modeling of the phase gives a=1.
of magnitude) when the measured change is very b is a measure of amplitude error—when the wrong volume of
small. In these circumstances, different models of sand is moved. Perfect modeling of phase and amplitude
gives b=0.
the same experiment can still be compared (as the
c is a measure of the map mean error—when the predicted
same small denominator will be used) to get a average bed level is different from the measured. Perfect
ranking of relative merit. Note that when the modeling of the mean gives c=0.
denominator reduces to a similar size as the error e is a normalization term, which is only affected by measured
in the measurements, then the skill score becomes changes from the baseline prediction.
effectively meaningless.

3.6. Brier skill score Comments on the decomposition can be found in


Murphy and Epstein (1989) and Livezey et al. (1996).
A particularly useful skill score in coastal Note that a is the square of the anomaly correlation
engineering is the BSS (Eq. (12)), which includes coefficient, and so must be between 0 and 1; terms b
contributions due to errors in predicting amplitude, and c are positive or zero while e is positive. e is the
phase and mean. This means that, for example, it is only term that does not depend on the forecast, being
impossible to tell from the BSS alone whether a score the mean observed anomaly divided by the standard
less than 1 is due to a model moving a bar crest too deviation in the anomaly. c equals zero for uncondi-
little or too far in the cross-shore direction. Similarly, tionally unbiased predicted anomalies whereas b
predicting too much or too little erosion may cause a equals zero for conditionally unbiased predicted
score less than 1. anomalies. If both unconditional and conditional bias
It can be useful to know which of the types of can be removed and e is small, then BSSca so that a
errors contributes the most to lowering the skill becomes a measure of potential skill. Alternatively, if
score. This can be determined by visually compar- the mean measured anomaly is small relative to its
ing the profiles, or by using the decomposition of standard deviation (ec0) and the unconditional bias
the BSS proposed by Murphy and Epstein (1989). is small (cc0) and the standard deviations in
This shows that the BSS can be decomposed in predicted and observed anomalies are approximately
terms of the anomalies in the prediction, YV=YB, equal, then the BSS reduces to
and the measurements, XV=XVB. The standard
deviations in the predicted and observed anomalies BSS ¼ rX2 VY V  ðrX VY V  1Þ2 ¼ 2rX VY V  1 ð17Þ
are given by r YVand r XV, respectively. The Murphy–
Epstein decomposition shows that the BSS can be Under these circumstances, the BSS is positive
written as: for anomaly correlation coefficients greater than
0.5. Murphy and Epstein note that in atmospheric
forecasting it is traditional to consider an anomaly
abcþe correlation coefficient greater than 0.6 as represent-
BSS ¼ ð15Þ
1þe ing a useful forecast, so that (under the previous set
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 927

of assumptions) a BSS greater than 0.2 represents a alize the accuracy by dividing by the mean absolute
useful forecast. However, transferring this rule-of- value, or root mean square value of the measured
thumb for atmospheric forecasting to morphody- variable. Similar measures of relative accuracy
namic modeling may be inappropriate and a could be derived for bathymetry by dividing the
suitable level of BSS to represent a useful forecast accuracy by the mean or standard deviation of the
in coastal morphodynamic modeling will emerge measured bed levels. However, neither the mean nor
from experience. the standard deviation of the bed level has any great
meaning. The mean bed elevation, in particular, is
3.7. Indices of agreement an unsatisfactory statistic to use as its value depends
on the datum used, and a relative accuracy that
Willmott (1981) introduced an index of agree- varies as the datum changes from chart datum to
ment, d 2, intended to act as a measure of the extent mean sea level cannot be used for intercomparisons
to which the predictions are error-free. This measure between cases.
of relative accuracy was non-dimensionalized to
ensure that it can be easily interpreted and used to
compare between models. It is bounded by 1 and 0 4. Comparison between probabilistic model
with a value of 1 implying perfect agreement performance statistics
between predictions and observations. A value of
zero implies complete disagreement. The concept The difference between the model performance
was extended in Willmott et al. (1985) to include an statistics is demonstrated below using a hypothetical
index, d 1, that used mean absolute errors as well as data set. In this case, a Dean (1977) equilibrium cross-
the original d 2 using mean square errors. They are shore profile was chosen as the initial profile. This
based on the assumption that the measurements are took the form z=Dx 2/3 with z the vertical distance of
error-free (so the mean of the measurements is the bed below the onshore level (here set to 1 m),
error-free). The predicted and measured deviations D=0.1 (corresponding to a median grain diameter of
from the weighted mean of the observations are about 0.2 m), and x is the cross-shore distance from
YhXi and XhXi, and the sum of these terms the start of the Dean profile. A final dmeasuredT bed
gives the maximum potential error. The potential profile, X, was generated by adding an anomaly to the
error and its square form the denominators of d 1 initial profile. The anomaly, XVwas given by the sum
and d 2. No assumptions about the distributions of of two sine waves:
observed or predicted variables are made. The
numerator is the mean absolute difference or X V ¼  0:20sinðkxÞ þ 0:05sinð2kxÞ ð20Þ
mean-square error. The indices are:
with k=p/24, representing a wavelength of 48 m. XV
hjY  X ji was given by Eq. (20) for 0VxV48 and XV=0
d1 ¼ 1  ð18Þ elsewhere. A number of predicted final bathymetries
hjY  hX ij þ jX  hX iji
were devised:

hðY  X Þ2 i
d2 ¼ 1  ð19Þ Scenario 1 perfect modeling: Y=B+XV
hðjY  hX ij þ jX  hX ijÞ2 i Scenario 2 initial profile plus half the measured anomaly:
Y=B+0.5XV
Alternative versions of the same concept are Scenario 3 initial profile plus one and a half times the measured
possible. One possibility is to subtract the median, anomaly: Y=B+1.5XV
Scenario 4 initial profile plus anomaly shifted 3.5 m offshore:
rather than the mean, of the measured values in the Y=B+XV(x3.5)
denominator as h|XX M|i is smaller than h|XhXi|i. Scenario 5 initial profile plus constant vertical offset of 0.055
Indeed, it is possible to have alternative measures of m: Y=B+XV+0.055
all the measures of relative accuracy. In hydro- Scenario 6 initial profile plus half the measured anomaly, but
dynamic modeling, it is common to non-dimension- with a profile extended out to 70 m.
928 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939

The initial and final bathymetries and the bed profile. The mean absolute error between the
anomalies of scenarios 1 to 5 are shown in Fig. initial and final measured profiles, MAE(B,X), was
1. The measures of accuracy, relative accuracy, skill 0.111 m (out to 55 m) while the root mean square
and correlation defined previously were calculated difference was 0.136 m. Table 8 shows that perfect
for each of the scenarios and the resulting values modeling (Scenario 1) gives bias and error of zero
are shown in Table 8. The baseline prediction of the and skill and indices of agreement of 1, as
final bathymetry for the skill scores was the initial expected.

Fig. 1. Cross-shore profiles and predicted anomalies for idealized Scenarios 1 to 5.


J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 929

Table 8
Accuracy, skill correlation from idealized cross-shore profiles modeling scenarios
Score Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 6
Bias 0.0000 0.0000 0.0000 0.0000 0.0550 0.0000
MAE(Y, X) 0.0000 0.0550 0.0550 0.0550 0.0550 0.0440
RMSE(Y, X) 0.0000 0.0680 0.0680 0.0660 0.0550 0.0600
MSE(Y, X) 0.0000 0.0046 0.0046 0.0044 0.0031 0.0036
r XY 1.0000 0.9800 0.9800 0.9800 1.0000 0.9900
r XVY V 1.0000 1.0000 1.0000 0.8800 1.0000 1.0000
BSS 1.0000 0.7500 0.7500 0.7600 0.8300 0.7500
RMSSS 1.0000 0.5000 0.5000 0.5100 0.5900 0.5000
MASS 1.0000 0.5000 0.5000 0.5000 0.5000 0.5000
a 1.0000 1.0000 1.0000 0.7700 1.0000 1.0000
b 0.0000 0.2500 0.2500 0.0100 0.0000 0.2500
c 0.0000 0.0000 0.0000 0.0000 0.1700 0.0000
d1 1.0000 0.8960 0.8910 0.8930 0.8900 0.9390
d2 1.0000 0.9910 0.9900 0.9910 0.9940 0.9950

The mean absolute error in the modeling, Absolute Skill Score was the same as before. The
MAE( Y,X), equals 0.055 m for both Scenarios 2 and correlation coefficients were both 1, as they take no
3. The skill scores and correlation coefficients were account of errors in predicting the mean value. The
the same for both scenarios as well, showing that the Murphy–Epstein decomposition of the BSS shows
statistics cannot differentiate between the two cases. that the only reduction in skill from a perfect score of
The Murphy–Epstein decomposition of the BSS one was due to the error in the mean.
shows that the phase term a was 1, as the predicted The indices of agreement gave similar values for
anomaly was in phase with the measured. The mean cases 2 to 5. Scenario 6 used the same data as
and normalization terms c and e are both zero while Scenario 2, but with the profile extended out to 70 m
the reduction in skill from a perfect score of 1 was due using Dean’s profile. In this case, the measures of
to amplitude errors (as term b was non-zero in both accuracy (mean absolute error, the root mean square
cases). The correlation coefficients had values of 0.98, error and mean square error) all reduced, but the skill
reduced from one by the increased standard deviation scores did not. The skill scores are thus a more
of the predicted profile. bhonestQ’ measure, as the modeling was not worsened
In Scenario 4 the anomaly was shifted 3.5~m by the addition of a morphologically inactive area.
offshore, so that the mean absolute error, MAE( Y,X),
was approximately the same as for Scenarios 2 and 3. 4.1. General results of the model performance
The skill scores are essentially the same as before statistics
(small changes of F0.01 were due to the finite
discretization of the profiles in both horizontal and The MAE, MSE and RMSE all give different
vertical directions). The correlation coefficient was measures of the accuracy of the predictions. The use
still 0.98, but the anomaly correlation coefficient was of the modulus makes the MAE nonanalytic and thus
significantly reduced to a value of 0.88. The decom- more difficult to manipulate mathematically than
position of the BSS shows that this phase error was using a RMSE (or MSE). However, the MAE is not
the main cause of the loss of skill in Scenario 4. as heavily influenced by outliers as RMSE (Hedges,
In Scenario 5, the constant vertical offset was 2001) and is always lower than, or equal to, the
chosen to give the same mean absolute error as in RMSE. The average error (and hence accuracy) is
Scenarios 2 to 4. However, in this scenario the root probably best represented by MAE as it does not
mean square error was lower than before (0.055 rather depend on the distribution of the errors. The MAE,
than 0.068) so the BSS and the RMS Skill Score were MSE and RMSE are all reduced by the addition of a
higher than for Scenarios 2 to 4, but the Mean morphologically inactive area to the model domain
930 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939

(compare results from Scenario 6 and Scenario 2). values in the idealized scenarios to be close to one,
Hence, the accuracy produced by a model can often be making it difficult to distinguish between scenarios.
improved by extending the model domain offshore The skill scores are sensitive to mean, phase and
into deeper water where there is less sediment amplitude errors. They are unaltered by the inclusion
transport. This is not a good way to judge the of an additional area of zero activity. They clearly
performance of a model. match the criteria for desirable model performance
The skill scores, however, give the same score statistics set out in Section 1.2. Their strength lies
when the morphologically inactive offshore area is partially in the inclusion of the baseline prediction
introduced. The variation between the skill scores (usually the initial condition) within the method.
reflects the variations in the MAE, MSE and RMSE. Alternative baseline predictions include an average
The RMS Skill Score gives the same values as the profile (if enough data was available) and a Dean
Mean Absolute Skill Score, except for Scenario 5 equilibrium profile. The baseline prediction should
(vertical offset) as the distribution of errors was always be clearly stated when using a skill score.
different in this case. The BSS gave the highest Thus, an additional benefit of using the BSS for
values as it is based on mean square errors. It will tend morphological modeling is the Murphy–Epstein
to give higher values for positive skill scores, but decomposition, which is a useful tool in determining
lower values for negative skill scores than the Mean the source of a low skill score.
Absolute Skill Score. The ability to decompose the
BSS into phase, amplitude and bias terms is definitely
an advantage over the other skill scores (which have 5. Application of BSS to a coastal profile model
no equivalent to the Murphy–Epstein decomposition).
This allows the relative importance of phase, ampli- The BSS has been applied to assess the skill of the
tude and bias to be determined and may indicate the coastal profile model COSMOS at Egmond-aan-Zee
areas where a model most needs to be improved. (NL) on the North Sea coast (Brady and Sutherland,
The correlation coefficient, r XY, does not take into 2001; Sutherland and Soulsby, 2003). The modeling
account errors in the mean value. Nor does it take into was carried out as part of the COAST3D project
account the initial condition or any baseline predic- (Soulsby, 2001; Van Rijn et al., 2002; Van Rijn et al.,
tion. Moreover, if Y=aX+b then r XY=1 for any values 2003).
of a and b. Therefore, quite dissimilar sets of Egmond-aan-Zee is on a long, almost straight
observations and predictions may be perfectly corre- stretch of the Dutch coast. However, on direct
lated. Moreover, if the changes in bathymetry are inspection it is seen that there is a natural three-
small compared with the range of depths in the initial dimensionality to the bathymetry produced by rip
bathymetry, then the correlation coefficient will be channels. These intersect a bar system that has an
near to 1 for any plausible model—as the hypothetical offshore bar, an inshore bar and an occasional swash
scenarios showed. A large error may be masked by zone bar as well. The period modeled during the
high standard deviations or a low error may seem high COAST3D main experiment contained almost-con-
if the standard deviations are low. Therefore, correla- tinuous major storms with offshore significant wave
tion coefficients are a poor measure of the perform- height, H s, of up to 5 m. The storms caused
ance of any model, not just of morphodynamic substantial changes in morphology. Regular bathy-
models. metric surveys were performed in water depths up to
Willmott’s indices of agreement are probably about 6 m.
unsuited to testing morphodynamic modeling as they The bathymetric changes along a cross-shore
are based on variations from the measured mean, profile were modeled using the coastal profile model
which is an arbitrary value that depends on the datum COSMOS (Nairn and Southgate, 1993; Southgate and
and has no significance for a beach. It also means that Nairn, 1993). COSMOS includes wave transformation
the denominator on the right hand side is often large by refraction, shoaling, Doppler shifting, bottom
(as measured and predicted levels both tend to be on friction and wave breaking. It also includes wave
the same side of the measured mean). This caused the setup, cross-shore and longshore sediment transport
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 931

using the energetics approach and sea bed level swash zone (from 4870 to 5000 m). The baseline
changes due to cross-shore transport. COSMOS was prediction of the final profile was the initial profile.
driven using wave height, period and directions The calculation of the BSS for the outer bar only
measured by a directional wave buoy approximately extends out to 4488 m due to severe wave conditions
5 km offshore, with water levels taken from a fixed limiting the extent of the bathymetric survey.
pole on the outer bar. COSMOS assumes that the The model predicted the cross-shore evolution of
beach and the hydrodynamics are both uniform in the the outer bar with a skill score of 0.64. The inner bar
longshore direction. The measured profiles from was predicted less well, but still had a positive skill
seven parallel cross-shore transects were averaged at score of 0.20. The skill score for the modeling of the
the start and end of each modeled period to provide a swash zone was negative. The progression of skill
more longshore-uniform profile. A comparison scores can be explained as follows. It is easier to
between the results from five different profile models model the wave heights over the outer bar than the
is provided by Van Rijn et al. (2003). inner bar, as the waves at the outer bar are closer to the
A 7-day period, from 24 October to 31 October, boundary conditions. It is therefore reasonable to
1998, at the height of the storm was modeled (Brady expect that the changes in the cross-shore profile will
and Sutherland, 2001). Fig. 2 shows observed be modeled more accurately over the outer bar than
profiles at the start (24 October) and end (31 the inner. The model does not include swash-zone
October) of the storm period, as well as the predicted processes in its formulation, so its inability to model
profile at 31 October. The initial profile used in the the swash zone is to be expected.
modeling was the profile measured on 24 October. Moreover, the measured mean squared change in
All distances along the cross-shore profiles were bed levels was very low (0.005 m2) in the swash zone.
measured from the position of the wave buoy. The This value was similar to the modeled mean squared
average of the hourly offshore measurements of root- change (0.008 m2) and to the estimated mean squared
mean-square wave height taken during these 7 days measurement error. Therefore, the model was correct
was 2.2 m, while the average measured spectral peak in modeling little change, but incorrect in predicting
period was 8.5 s. the details of the change. The swash zone provides an
Fig. 2 also includes the BSS for each of three example of a case where the BSS is very sensitive to
sections of the profile: the outer bar (from 4488 to small changes in measurements and predictions as the
4660 m), the inner bar (from 4660 to 4870 m), and the denominator is so small.

Fig. 2. Brier Skill Scores for outer, inner and swash zone bars for COSMOS modeling at Egmond. B=initial profile, X=measured final profile,
and Y=predicted final profile.
932 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939

6. Application of BSS to a coastal area model COAST3D main experiment provided the initial
and final bathymetries.
The BSS was applied to morphodynamic mod- Teignmouth is located on the south coast of
eling at the mouth of the estuary at Teignmouth England, on the western side of Lyme Bay. The site
situated in the southwestern United Kingdom comprises a tidal inlet adjacent to a beach. It was
(Sutherland et al., 2001; Sutherland et al., 2004; chosen as the UK COAST3D site because the
Sutherland and Soulsby, 2003). The modeling and bathymetry was irregular and three-dimensional and
the measurements used to assess the modeling were the site was known to be morphologically active
part of the COAST3D project (Soulsby, 2001; (Whitehouse and Sutherland, 2001). The river Teign
Whitehouse and Sutherland, 2001). A 14-day empties into the sea through the 6-km-long shallow
period that included relatively stormy weather was estuary. The estuary channel is sinuous in the lower
chosen for comparing predicted and observed kilometer as it passes to the north of the Salty (a large
morphological change. Two surveys from the inter-tidal area). A rocky headland (the Ness) is to the

Fig. 3. Bathymetry of Teignmouth at start of main experiment.


J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 933

south of the estuary mouth, while a sand and shingle details of the site can be found in Whitehouse and
spit (Denn Point) is located on the north side. The Sutherland (2001).
beach to the north is about 2-km long, is groyned and The Teignmouth site was simulated using the
backed by a seawall. The exchange of water through coastal area modeling system PISCES (Chesher et
the narrow estuary mouth causes high currents that al., 1993). The PISCES modules used in this study
maintain a deep scoured channel. A bathymetric chart were the finite element flow module, TELEMAC-2D
of the area used for measurements in the COAST3D (Hervouet et al., 1994), the finite difference wave
experiments is shown in Fig. 3. This also includes module, FDWAVE (Southgate and Goldberg, 1989),
some instrument locations for the main experiment. and the sand transport model, SANDFLOW. A local
The mean tidal range is 1.7 m on neaps, and 4.2 m model was nested directly within a regional model.
on springs. The tide causes a large part of the estuary The regional model covered the whole of the English
to dry and flood each tidal cycle. Outside the estuary, Channel and was driven by boundary conditions from
there are a number of morphologically active sand- the Tidal Flow Forum (Werner and Lynch, 1987). The
banks shown in Fig. 3: Pole Sand, Spratt Sand and local model included the whole estuary and extended
East Pole Sand (marked by a box). The beach and approximately 7 km north and 5 km south of the
nearshore bank system comprises mixed sediments estuary mouth and between 3.5 and 4.5 km offshore.
and the sediment offshore is fine sand. The beach and The local model was run for the period between
sandbank systems at Teignmouth have been studied 07:00 on 8 November 1999 and 07:00 on 22
since the mid-nineteenth century when Spratt (1856) November 1999. Wave conditions were taken from
documented cyclic movements of the bars. Further measurements made at a buoy about 1.5 km

Fig. 4. Changes in bathymetry of East Pole Sand from measurements and three model runs. Reproduced from Sutherland and Soulsby (2003) by
permission of World Scientific Publishing.
934 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939

offshore (item 7 in Fig. 3). A uniform grain size ing through the model run or by improving the wave–
was used throughout the model area. The sediment current interaction to reproduce the hydrodynamics
transport rate in the deep channel at the estuary around East Pole Sand during the storm.
mouth was set to zero to prevent unrealistically high
sediment transport rates in this region, as the bed
was known not to be of sand. The default bed 7. Effect of measurement errors on BSS
roughness of 0.10 m was also applied throughout,
and the bed was only updated at the end of the Eq. (11) for skill shows that the denominator
model run. Further details of the modeling can be depends on the accuracy of a perfect prediction as
found in Sutherland et al. (2001, 2004) and well as the accuracy of the baseline prediction.
Sutherland and Soulsby (2003). Generally, the error of a perfect prediction is taken
Here we concentrate on the bathymetric modeling to be zero, giving Eqs. (12)–(14). However, this
of East Pole Sand, between 294 200 and 294 600 m ignores the presence of measurement errors. All
East and between 72 200 and 72 500 m North (marked bathymetric surveys contain some errors in the
by a box in Fig. 3). Sand sizes of 0.25, 0.35 and 0.50 measurements. Moreover, the results from different
mm were modeled in turn to test the model’s surveys will generally have to be interpolated onto the
sensitivity to grain size. The measured and modeled same grid for comparisons to take place, and the
changes in bathymetry are shown in Fig. 4 (from interpolation procedure will introduce further inaccur-
Sutherland and Soulsby, 2003). PISCES correctly acies into the calculations of bias, accuracy and skill.
predicted that there was erosion on the top and east of To minimize the latter, the survey and data reduction
East Pole Sand and that there was some deposition on processes used in the initial and final surveys should
the western side. The Brier Skill Scores were be compatible, as accurate as possible and include an
calculated and are shown in Table 9. The Brier Skill estimate of their errors.
Scores are all positive and increase with grain size. Two methods have been proposed to account for
One might expect that increasing the grain size measurement errors. Although it is difficult to estimate
further might lead to improvements in the model skill. measurement error, they will occur. Consequently,
However, Table 9 also includes the results of a even in an inactive area, where no bed level changes
Murphy–Epstein (1989) decomposition of the BSS take place in reality, there will be changes between the
(Eqs. (15) and (16)). This shows that the phase, mean first and second surveys (B and X) at a point.
error and normalization terms (a, c and E, respec- The first adjusted BSS is a refinement of that
tively) remain essentially constant for all three grain proposed in Sutherland et al. (2001). It comes from
sizes, but that the amplitude term, b, reduces from using mean square errors in all parts of Eq. (11). We
0.20 to 0.01 as the grain size increases from 0.25 to assume that the initial measured bathymetry, B, is
0.50 mm. The results show that increasing the grain made up of the actual bathymetry B A plus measure-
size in the model reduces the transport rate, but does ment error, d B , so B=B A+d B . Similarly, let X=X A+d X
not alter the distribution of the sediment. Therefore, and the output from perfect modeling, W=B A+
increasing the grain size further in the model will not d B +dW, with dW the modeled change in bed elevation
produce significantly better results as the amplitude is due to perfect modeling. We assume that for perfect
already close to zero. Further improvements in model modeling dW=X AB A so that W=X A+d B. It follows
performance can thus only come from improving the that MSE(W,X)=hd X2i+hd B2i. We assume that surveys
modeling approach, possibly by including bed updat- B and X are made the same way, so that
hd X2i=hd B2i=hd 2i. Substituting into Eq. (11) gives:
Table 9
Decomposition of Brier Skill Score from coastal area modeling MSEð B; X Þ  MSEðY ; X Þ
d 50 [mm] BSS Term a Term b Term c Term e
BSSP ¼ ð21Þ
MSEð B; X Þ  2hd2 i
0.25 0.15 0.38 0.20 0.04 0.01
0.35 0.29 0.38 0.07 0.03 0.01 The assumptions above imply that MSE(B, X)=
0.50 0.34 0.38 0.01 0.03 0.01 h(B AX A)2i+2hd 2iz2hd 2i so the denominator of Eq.
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 935

(21) will be positive (or zero). Eq. (21), therefore, and predicted surface elevations on exactly the same
gives a higher magnitude of skill score than the grid. This was necessary, as the BSS requires all
standard form of the BSS (Eq. (12)). This means that three surface elevations to be at the same point.
positive skill scores will increase, but negative ones Using 0.026 m as the RMSE in Eqs. (21) and (22)
will decrease. gives the values for the adjusted Brier Skill Scores,
Van Rijn et al. (2003) used a different approach. BSSP and BSSvR, shown at the top of Table 10 for
Their alternative method of calculating the BSS, East Pole Sand alongside the normal BSS and the root
adjusting it for the measurement error, is given by: mean square difference between measurement and
prediction, RMSE(X,Y). Taking the error into account
hðjY  X j  dÞ2 i improves the BSSP by about 0.01 each. The Van Rijn
BSSvR ¼ 1  ð22Þ
hð B  X Þ2 i method, BSSvR, gives a larger improvement, doubling
the BSS. When a RMSE of 0.1 m is used, the figures
with |YX|d set to zero if |YX|bd. In other words, in the lower half of Table 10 are returned. Here the
the minimum difference between the computed value root mean square measurement error is a much larger
and the measured parameter plus or minus measure- fraction of the root mean square modeling error
ment error is considered. When the computed value is (difference between final measurements and predic-
within the error band of the measurements, the error tions) and the increases in the skill scores are therefore
(numerator on the right hand side of Eq. 22) is set to much larger. The BSS adjusted for the error in perfect
zero. In this method, areas where the difference modeling, BSSP, gives increases in score of between
between modeled and measured final bathymetry is 0.16 and 0.40. The Van Rijn method gives increases in
within the measurement error are treated as having no score ranging from 0.37 to 0.44.
error. The measurement error always acts to reduce The Van Rijn et al. (2003) method (Eq. (22))
the numerator in Eq. (22), so maximizing the BSS generally gives the larger improvement in skill score,
value calculated. for the same error, as any prediction lying within the
Consider the measurements and modeling made at error limit (set by the root mean square measurement
Teignmouth, described in the previous section. The error) is allocated an error of zero. Eq. (21) takes into
majority of the observed bed levels were made from account errors in perfect modeling and is compatible
the survey vessel using DGPS and echo sounder. The with the definition of skill (Eq. (11)), and so is
estimated error (for depths less than 10 m) was 0.1 m considered the best method of including measurement
for wave heights less than 0.2 m, the sea-state for the error in the BSS.
surveys (Whitehouse and Sutherland, 2001; Van Rijn
et al., 2003). An alternative estimate of the noise in
the measurements was made by calculating the root- 8. Qualification of skill scores
mean-square difference between initial and final bed
elevation measurements in a region that showed no Skill scores provide an objective method of
noticeable changes between initial and final surveys. assessing the performance of morphological models.
The chosen region was between 294 700 and 295 000 Any BSS higher than zero means that the modeling
m East and between 72 400 and 72 800 m North
(Fig. 3). This gave a root-mean-square difference of
Table 10
0.037 m. It was assumed that no sand transport took
Brier skill score adjusted Brier skill scores to account for measure-
place in this region, so the difference was due to ment error
independent measurement errors in both pffiffiffi surveys. It d 50 [mm] d rms [m] BSS BSSP BSSvR RMSE(X,Y) [m]
follows that hd2 i0:5 ¼ drms ¼ 0:037= 2 ¼ 0:026 m.
0.25 0.026 0.15 0.15 0.30 0.178
The root mean square difference is about a quarter of 0.35 0.026 0.29 0.30 0.42 0.163
the quoted estimate of error of 0.1 m in the data 0.50 0.026 0.34 0.35 0.47 0.157
report (Whitehouse and Sutherland, 2001). The 0.25 0.100 0.15 0.31 0.59 0.178
situation is also complicated by the use of a surface 0.35 0.100 0.29 0.61 0.68 0.163
interpolation package to produce sets of measured 0.50 0.100 0.34 0.74 0.71 0.157
936 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939

Table 11 accuracy and skill. Bias is a measure of the


Proposed classification table for Brier skill score difference in the central tendency (often the mean)
BSSvR BSS BSSP and shows if the model underestimates or over-
Excellent 1.0–0.8 1.0–0.5 1.0–0.8 estimates. Accuracy is a measure of the average
Good 0.8–0.6 0.5–0.2 0.8–0.3 difference between predictions and measurements,
Reasonable/fair 0.6–0.3 0.2–0.1 0.3–0.15
while skill is a non-dimensional measure of the
Poor 0.3–0.0 0.1–0.0 0.15–0.0
Bad b0.0 b0.0 b0.0 accuracy of a prediction relative to the accuracy of
a baseline prediction. The baseline prediction (of the
final measured bathymetry) must be simple and the
is more accurate than the baseline prediction. initial bathymetry is a logical choice (i.e., if the
However, as skill scores are used more and more, model cannot perform better than doing no model-
it will become apparent what value of skill score is ing, it has no skill). Alternative baselines include an
needed before a model result can be considered average profile or an equilibrium profile (e.g., Dean,
dgoodT. As a step toward this, Van Rijn et al. (2003) 1977).
proposed a classification table for their modification In order to help choose an appropriate set of model
of BSS, BSSvR. This classification is shown in the performance statistics, the authors propose that the
first two columns of Table 11. In addition, the following criteria are desirable. The statistics should
examples provided in the lower half of Table 10 be easy to understand, agree with expert opinion,
(which assumed a mean square measurement error of bhonestQ, transferable between data sets (so non-
0.01 m2) were used to map points on the Van Rijn et dimensional is better than dimensional), take intuitive
al. (2003) classification onto equivalent points for values and show skill rather than accuracy.
the unmodified BSS, BSS and the modified version, The Linear Error in Probability Space (Potts et al.,
BSSP, as shown in Table 11. The rest of the 1996) is well suited to ordered categories, according
classification was completed by judgement, and by to the above criteria. When only two categories are
calculating test cases based on Table 10. This possible, the simpler Heidke Skill Score (Heidke,
process of categorization is similar to meteorologists 1926) can be used instead.
describing as dusefulT a BSS greater than 0.2 A probabilistic assessment has to be made when a
(Section 3.6). continuous variable, such as bed elevation, is meas-
Table 11 illustrates the differences that might be ured. The use of correlation coefficients to determine
expected between the different versions of the BSS. the quality of modeling can be misleading. Correla-
The mapping between versions depends on the tion coefficients represent the size, but not signifi-
relative sizes of the modeled changes, the measured cance, of the correlation between a set of
changes and the measurement error. As such, the measurements and a set of predictions. They do not
mappings between different versions of the skill represent bias, accuracy or skill, and have several
score will alter with circumstances. Moreover, the weaknesses (as set out in Section 4.1). The most
ranges of BSS that form each category can be appropriate measure of bias is the difference in the
revised (ideally by an expert group) as more means. Accuracy can be determined using three
experience is gained with skill scores in realistic measures of average difference: the Mean Absolute
situations. Table 11 should, therefore, be regarded as Error, the Root Mean Square Error and the Mean
provisional. Square Error. The average difference (and hence
accuracy) is probably best represented by Mean
Absolute Error, as it does not depend on the
9. Conclusions distribution of the errors.
The measures of accuracy are all (apparently)
This paper has demonstrated how a number of improved by the addition of a morphologically
statistical tests, commonly used in meteorology, can inactive area to the model domain. Hence, the accuracy
be applied to determine the quality of a morphody- produced by a model can often be improved by
namic model. Quality can be determined from bias, extending the model domain offshore into deeper
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 937

water where there is less sediment transport. There- dY modeled change in bed elevation
fore, measures of accuracy have to be judged in the E equilibrium (category of shoreline
context of the area modeled and are not necessarily movement)
easy to compare between applications. F false alarm (i.e., event predicted that did not
The skill scores, however, still give the same occur)
score when the morphologically inactive offshore FAR False Alarm Rate
area is introduced and are therefore more bhonestQ G fraction of correct predictions expected from
than measures of accuracy. The variation between a random choice
the skill scores reflects the variations in the mean H Hit (i.e., event occurred as predicted)
absolute error, mean square error and root mean HSS Heidke Skill Score
square error. The BSS will tend to give higher values J total number of predictions or observations
for positive skill scores, but lower values for k wave number
negative skill scores than the Mean Absolute Skill LEPS Linear Error in Probability Space
Score. This could be considered an advantage of the LS LEPS score
BSS over the Mean Absolute Skill Score, even M Miss (i.e., event occurred but not predicted)
though the latter uses the Mean Absolute Error, MAE mean absolute error
which is probably the best representation of the MASS mean absolute skill score
accuracy. The ability to decompose the BSS into MSE mean square error
phase, amplitude and bias terms is definitely an N number of categories
advantage over the other skill scores. This allows the P cumulative probability
relative importance of phase, amplitude and bias to PC percentage correct
be determined and may indicate the areas where a POD probability of detection
model most needs to be improved. Measurement R retreat (category of shoreline movement)
error can be taken into account by considering it to RMSE root mean square error
be the mean square difference between observations RMSSS root mean square skill score
and predictions in perfect modeling. The BSS, r XY correlation coefficient between sets X and Y
therefore, is considered the best measure of perform- SS skill score
ance for morphological models, based on the criteria s XY covariance between sets X and Y
set out above. TS Threat Score
In addition, when evaluating model performance, W set of predictions from perfect modeling
it is good practice to answer the following ques- X set of measurements of final bathymetry
tions. Was the model tuned against the data set XM median value of set of measurements
used for validation? Was the evaluation performed XV measured anomaly (XB)
as a blind test (i.e., did the modeler know the x horizontal distance
correct answer in advance)? What was the baseline Y set of predictions of final bathymetry
prediction used in the calculation of any skill YM median value of set of predictions
scores? YV predicted anomaly ( YB)
z vertical elevation
Lists of Symbols a phase error in Murphy–Epstein
A advance (category of shoreline movement) decomposition
a constant b amplitude error in Murphy–Epstein
B set of baseline predictions of final decomposition
bathymetry c mean error in Murphy–Epstein decomposition
b constant d measurement error
BSS Brier Skill Score e normalization term in Murphy–Epstein
C correct rejection (i.e., event prediction not to decomposition
occur and did not occur) r standard deviation
D constant in Dean equilibrium beach profile hXi average set of X
938 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939

Acknowledgements Model. Bulletin of the American Meteorological Society 77 (3),


507 – 517.
Murphy, A.H., Epstein, E.S., 1989. Skill scores and correlation
This work was undertaken as part of the coefficients in model verification. Monthly Weather Review
COAST3D project funded partly by the European 117, 572 – 581.
Commission’s research programme MAST under Nairn, R.L., Southgate, H.N., 1993. Deterministic profile modelling
Contract Number MAS3-CT97-0086 and partly of nearshore processes. Part 2. Sediment transport and beach
through national funding from the UK Ministry of profile development. Coastal Engineering 19, 57 – 96.
Nicholson, J., Broker, I., Roelvink, J.A., Price, D., Tanguy, J.M.,
Agriculture, Fisheries and Food under research project Moreno, L., 1997. Intercomparison of coastal area morphody-
FD0803 and the UK Environment Agency’s Research namic models. Coastal Engineering 31 (1–4), 97 – 123.
and Development project W5B017. The work on skill Plant, N.G., Holman, R.A., 2001. Nearshore morphology character-
scores was partly funded through a NICOP grant isation based on a predictive model for sandbar migration.
bIntermediate Scale Coastal Behaviour: Measurement, Proceedings, Coastal Dynamics ’01. ASCE Press, Reston, VA,
pp. 530 – 537.
Modelling and PredictionQ from the US Office of Potts, J.M., Folland, C.K., Jolliffe, I.T., Sexton, D., 1996. Revised
Naval Research, award N00014-97-1-0792. The bLEPSQ scores for assessing climate model simulations and
authors are indebted to Dr. Michael Davey of the long-range forecasts. Journal of Climate 9, 34 – 53.
Hadley Centre at the UK Meteorological Office for Soulsby, R.L., 2001. Sediment transport and morphodynamics on
giving an invited presentation at a NICOP workshop complex coastlines—the COAST3D project. Proceedings
Coastal Dynamics ’01. ASCE Press, Reston, VA, pp. 92 – 101.
in 1997 of the methods used for quantifying pre- Southgate, H.N., Goldberg, D.G., 1989. An efficient computational
diction skill in the meteorological world, which model for wave refraction and diffraction using finite differ-
formed the basis of the approach developed for the ences. HR Report, 213.
present paper. Southgate, H.N., Nairn, R.L., 1993. Deterministic profile modelling
of nearshore processes: Part 1. Waves and currents. Coastal
Engineering 19, 27 – 56.
Spratt, T., 1856. An Investigation of the Movements of Teignmouth
References Bar. John Weale, London.
Sutherland, J., Soulsby, R.L., 2003. Use of model perform-
Brady, A.J., Sutherland, J., 2001. COSMOS modelling of ance statistics in modelling coastal morphodynamics. Pro-
COAST3D Egmond main experiment. HR Wallingford Report ceedings of the International Conference on Coastal Sediments
TR 115. 2003. CD-ROM Published by World Scientific Publishing. and
Chesher, T.J., Wallace, H.M., Meadowcroft, I.C., Southgate, East Meets West Productions, Corpus Christi, TX, USA, ISBN:
H.N., 1993. PISCES, A Morphodynamic Coastal Area 981-238-422-7.
Model. First Annual Report, Report SR 337, April 1993. Sutherland, J., Hall, L.J. Chesher, T.J., 2001. Evaluation of the
HR Wallingford. coastal area model PISCES at Teignmouth (UK). HR Wall-
Dean, R.G., 1977. Equilibrium beach profiles: U.S. Atlantic and ingford Report TR125.
Gulf coasts. Ocean Engineering Report No. 12, Department of Sutherland, J., Walstra, D.J.R., Chesher, T.J., Van Rijn, L.C.,
Civil Engineering. University of Delaware, Newark, DE. Southgate, H.N., 2004. Evaluation of coastal area model-
Gallagher, E.L., Elgar, S., Guza, R.T., 1998. Observations of sand ling systems at an estuary mouth. Coastal Engineering 51,
bar evolution on a natural beach. Journal of Geophysical 119–142.
Research 103 (C2), 3203 – 3215. Van Rijn, L.C., Ruessink, B.G., Mulder, J.P.M. (Eds.), 2002.
Gandin, L.S., Murphy, A.H., 1992. Equitable skill scores for COAST3D—EGMOND, the Behaviour of Straight Sandy Coast
categorical forecasts. Monthly Weather Review 120, 361 – 370. on the Time Scale of Storms and Seasons. Aqua Publications,
Hedges, T.S., 2001. Regression and dimensional analysis in coastal Amsterdam.
engineering: some pitfalls. Proceedings of the Institution of Civil Van Rijn, L.C., Walstra, D.J.R., Grasmeijer, B., Sutherland, J., Pan,
Engineers. Water and Maritime Engineering 148 (4), 219 – 225. S., Sierra, J.P., 2003. The predictability of cross-shore bed
Heidke, P., 1926. Berechnung des Erfogles und der Gqte der evolution of sandy beaches at the time scale of storms and
Windst7rkervorhersagen im Sturmwarnungsdienst. Geografiska seasons using process-based profile models. Coastal Engineer-
Annaler 8, 301 – 349 (In German). ing 47, 295 – 327.
Hervouet, J.-M., Hubert, J.-L., Jamin, J.-M., Lepeintre, F., Peltier, E., Werner, F.E., Lynch, D.R., 1987. Field verification of wave equation
1994. The computation of free surface flows with TELEMAC: an tidal dynamics in the English Channel and Southern North Sea.
example of evolution towards hydroinformatics. Journal of Advances in Water Resources 10 (3), 115 – 130.
Hydraulic Research 32, 45 – 64. (Extra issue). Whitehouse, R.J.S., Sutherland, J., 2001. COAST3D data report—
Livezey, R.E., Masutani, M., Ming, J., 1996. SST-forced seasonal 3D experiment. Teignmouth, UK. HR Wallingford Report, TR
simulation and prediction skill for versions of the NCEP/MRF 119.
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 939

Wilks, D.S., 1995. Statistical Methods in the Atmospherical Wright, L.D., Short, A.D., 1983. Morphodynamics of beaches and
Sciences. Academic Press, San Diego. 467 pp. surf zones in Australia. In: Komar, P.D. (Ed.), Handbook of
Willmott, C.J., 1981. On the validation of models. Physical Coastal Processes and Erosion. CRC Press, Boca Raton FL,
Geography 2 (2), 184 – 194. USA, pp. 35 – 64.
Willmott, C.J., Ackleson, S.G., Davis, R.E., Feddema, J.J., Klink, Zheng, J., Dean, R.G., 1997. Numerical models and intercompar-
J.J., Legates, D.R., Rowe, C.M., 1985. Statistics for the isons of beach profile evolution. Coastal Engineering 30 (3–4),
evaluation and comparison of models. Journal of Geophysical 169 – 201.
Research 90 (C5), 8995 – 9005.

You might also like