You are on page 1of 19

JUNE 2023 HU ET AL.

1367

Deep Learning Forecast Uncertainty for Precipitation over


the Western United States

WEIMING HU ,a MOHAMMADVAGHEF GHAZVINIAN,a WILLIAM E. CHAPMAN,b AGNIV SENGUPTA,a


FRED MARTIN RALPH,a AND LUCA DELLE MONACHEa
a
Center for Western Weather and Water Extremes, Scripps Institution of Oceanography, University of California, San Diego,
San Diego, California
b
National Center for Atmospheric Research, Boulder, Colorado

(Manuscript received 7 October 2022, in final form 23 February 2023, accepted 23 February 2023)

ABSTRACT: Reliably quantifying uncertainty in precipitation forecasts remains a critical challenge. This work examines
the application of a deep learning (DL) architecture, Unet, for postprocessing deterministic numerical weather predictions
of precipitation to improve their skills and for deriving forecast uncertainty. Daily accumulated 0–4-day precipitation fore-
casts are generated from a 34-yr reforecast based on the West Weather Research and Forecasting (West-WRF) mesoscale
model, developed by the Center for Western Weather and Water Extremes. The Unet learns the distributional parameters
associated with a censored, shifted gamma distribution. In addition, the DL framework is tested against state-of-the-art
benchmark methods, including an analog ensemble, nonhomogeneous regression, and mixed-type meta-Gaussian distribu-
tion. These methods are evaluated over four years of data and the western United States. The Unet outperforms the
benchmark methods at all lead times as measured by continuous ranked probability and Brier skill scores. The Unet also
produces a reliable estimation of forecast uncertainty, as measured by binned spread–skill relationship diagrams. Addition-
ally, the Unet has the best performance for extreme events (i.e., the 95th and 99th percentiles of the distribution) and for
these cases, its performance improves as more training data are available.

SIGNIFICANCE STATEMENT: Accurate precipitation forecasts are critical for social and economic sectors. They
also play an important role in our daily activity planning. The objective of this research is to investigate how to use a
deep learning architecture to postprocess high-resolution (4 km) precipitation forecasts and generate accurate and
reliable forecasts with quantified uncertainty. The proposed approach performs well with extreme cases and its per-
formance improves as more data are available in training.

KEYWORDS: Atmosphere; Forecast verification/skill; Probabilistic Quantitative Precipitation Forecasting (PQPF);


Short-range prediction; Postprocessing; Machine learning

1. Introduction Furthermore, California experiences a uniquely high variabil-


ity in precipitation (Dettinger et al. 2011) governed by the
Accurate precipitation forecasts are crucial for social and
presence or absence of a relatively small number of large
economic sectors. Quantitative precipitation forecasts (QPFs)
storms, typically landfalling Atmospheric rivers (ARs) in the
of daily accumulated rainfall plays an important role not only
winter months (Dettinger and Cayan 2014; Oakley et al. 2018;
in water supply, flood risk, and drought mitigation, but also in
Corringham et al. 2019). Reliable, accurate, and timely pre-
guiding agricultural management and the operations of hydro-
dictions of precipitation at weather time scales have the po-
electric power plants (Theis et al. 2005). For example, accu-
tential to inform operational decisions related to reservoir
rate and spatially detailed QPFs are of particular interest and
management and flood emergency response, e.g., through the
importance in California (Dettinger et al. 2011; Corringham
Forecast Informed Reservoir Operations initiative that incor-
et al. 2019). Statewide water management suffers from a dis-
porates forecast information into water management deci-
tinct spatial mismatch that 75% of its rain and snow is re-
sions (Jasperse et al. 2020).
ceived from the watersheds north of Sacramento, California,
The current state-of-the-art method for forecasting precipita-
yet 80% of the demand comes from south of Sacramento.
tion events is through numerical weather prediction (NWP)
which is based on state-of-the-art dynamical weather models.
However, the accuracy of NWP can still be limited by errors in
Denotes content that is immediately available upon publica-
initial conditions, numerical approximations, incomplete under-
tion as open access.
standing of underlying physical processes, and the chaotic nature
of the atmosphere (Lorenz 1963; Gleick 2008). These errors
Supplemental information related to this paper is available at from NWP will propagate to subsequent hydrological model
the Journals Online website: https://doi.org/10.1175/MWR-D-22- simulations and affect the quality and uncertainty of the end
0268.s1.
products. In this context, machine learning (ML) methods can
be designed to correct a portion of the errors that contaminate
Corresponding author: Weiming Hu, weiminghu@ucsd.edu dynamical model estimates in a postprocessing framework.

DOI: 10.1175/MWR-D-22-0268.1
Ó 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright
Policy (www.ametsoc.org/PUBSReuseLicenses).
Unauthenticated | Downloaded 03/07/24 06:20 PM UTC
1368 MONTHLY WEATHER REVIEW VOLUME 151

ML methods include traditional statistical methods, neural- along the western coast and the spatial resolution was limited to
based algorithms, and deep learning (DL) approaches 0.58.
(Goodfellow et al. 2016). Statistical postprocessing methods Inspired by the rich literature (Ronneberger et al. 2015;
can be categorized into parametric and nonparametric meth- Chapman et al. 2019; Ghazvinian et al. 2021; Chapman et al.
ods depending on whether a distribution is considered a pri- 2022; Ghazvinian et al. 2022; Han et al. 2022; Li et al. 2022;
ori or not. More recently, Vannitsem et al. (2021) refer to Badrinathat et al. 2022), we propose to use Unet, a type of
them as “distribution-based assumptions” and “distribution- denoising autoencoder, to generate high-resolution, accurate,
free assumptions” approaches, respectively. Nonparametric and reliable probabilistic quantitative precipitation forecast
methods include weather analogs (Hamill and Whitaker 2006; (PQPF) in this work. Both input and output of Unet are high-
Alessandrini et al. 2015; Hamill et al. 2015; Delle Monache et al. resolution maps that cover the entire study domain which avoids
2013; Scheuerer et al. 2020) and quantile regression (Bremnes training and running separate models at each grid point. We aim
2004). These methods do not assume a prior distribution of the to address three important questions in this work:
variable of interest, rather the model structure is determined
from data. In contrast, parametric methods typically assume a 1) How can we adopt an Unet architecture for generating
prior distribution of the predictand and try to estimate the associ- high-resolution precipitation forecasts?
ated distributional characterizing variables. Examples include en- 2) How do we train such a network to characterize the fore-
semble model output statistics (Wilks 2009; Scheuerer and cast uncertainty?
Hamill 2015), mixed-type meta-Gaussian distribution (Herr and 3) What are the strengths and limitations of using a large model
Krzysztofowicz 2005; Wu et al. 2011), and Bayesian-based meth- compared to other state-of-the-art benchmark methods?
ods (Raftery et al. 2005; Wang et al. 2009). The rest of the paper is organized as follows: section 2 de-
More recently, neuron-based algorithms are becoming in- scribes observations and forecasts used in this study. Section 3
creasingly popular for postprocessing NWP. Rasp and Lerch introduces the design of Unet and other benchmark postprocess-
(2018) demonstrated how neural networks (NNs) could be used ing methods. Specifically, section 3b(1) proposes the solution for
for probabilistic postprocessing ensemble forecasts within the the first question on Unet architecture and section 3b(2) ad-
distributional regression framework. For 2-m temperature fore- dresses the second question on model training. Section 4 exhibits
casts, an NN was trained to learn the distributional parameters
results and demonstrates evidence of the strengths and limita-
of a Gaussian predictive distribution. Taillardat et al. (2019)
tions of Unet to address the third question. Finally, section 5
combined a random forest technique with a parametric distribu-
provides a summary of the work and additional discussion.
tion to calibrate rainfall ensemble forecasts and concluded that
the hybrid approach produced the most skill improvements for
forecasting heavy rainfall events. Ghazvinian et al. (2021, 2022)
2. Data
proposed a hybrid NN–nonhomogeneous regression-based a. High-resolution spatial climate data
scheme that uses an NN to learn the distributional parameters
of a censored, shifted gamma distribution (CSGD). This ap- Precipitation ground truth is collected from the Parameter
proach provided a unified way to postprocess precipitation fore- Elevation Regression on Independent Slopes Model (PRISM)
casts at multiple lead times and seasons. The advantage of an (Daly et al. 2002; Strachan and Daly 2017). PRISM provides a
NN is its ability to reconstruct highly nonlinear functions and to daily gridded precipitation dataset over the continental
explore a vast amount of data. Therefore, NN is suitable as a United States with a 4-km spatial resolution. It leverages mul-
postprocessing method to cope with the high nonlinearity and tiple data sources including surface precipitation gauge net-
dimensionality in a weather system. works as well as radar observations. Its ingestion of a digital
One disadvantage of an NN, however, is its limited ability to elevation model (DEM) allows PRISM to accurately account
extract spatial features due to their fully connected layers. DL for complex climate regimes associated with orography, rain
with convolutional neural network (CNN), hence, has been shadows, temperature inversions, slope aspect, coastal prox-
found promising for abstracting spatial information from high- imity, and other factors. PRISM has been widely used in vari-
resolution forecasts. It has been successfully applied to the ous precipitation studies, e.g., Lewis et al. (2017), Ishida et al.
statistical downscaling of temperature and precipitation over (2015), Buban et al. (2020). It is suitable for this study as it
complex terrains (Pan et al. 2019; Sha et al. 2020a,b). Past work provides a high-resolution gridded precipitation product with
has shown its capability of discerning fine-grained spatial de- a continuous time series. Precipitation records from 1986 to
tails. For precipitation forecasting, Li et al. (2022) proposed using 2019 have been analyzed in this study.
CNN to learn the distributional parameters of CSGD but the in-
b. West-WRF reforecast
put to the CNN are predictors of a square patch of 7 3 7 grids
centered at the grid to be predicted, rather than the entire do- Reforecast products are forecasts from the same modeling
main. Chapman et al. (2019) first introduced denoising autoen- system spanning multiple decades. One of its benefits is that
coders for postprocessing deterministic forecasts of integrated they enable researchers to evaluate the predictability of his-
vapor transport (IVT) and integrated water vapor (IWV), and torical high-impact precipitation events given a current
the model takes predictors for the entire domain. This approach model. Reforecast products can reveal important information
was later improved for probabilistic forecasts (Chapman et al. about the model performance over a wide range of atmo-
2022) but the study domain was confined within a narrow band spheric and hydrological conditions, and they are often used

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


JUNE 2023 HU ET AL. 1369

model domains and the study domain in the supplemental


material, Fig. 1). Forecasts are initialized at 0000 UTC for 120 h
(5 days) into the future. However, the precipitation accumula-
tion period of a “PRISM day” is 1200 to 1200 UTC. As a result,
there are only forecasts for four overlapped lead days.
Figure 1 shows the study domain between [32.478, 41.498N]
and [124.448, 116.218W]. This is the region of overlap between
West-WRF and PRISM. Forecasts from the 3-km grid have
been collected, and the predictors are shown in Table 1 as
they have been previously found to be highly effective at cap-
turing the state of the atmosphere for atmospheric rivers and
precipitation events (Chapman et al. 2019, 2022). West-WRF
is regridded to the PRISM grid using the nearest neighbor,
prior to model training and data analysis.

3. Methods
a. Benchmark methods
1) CENSORED, SHIFTED GAMMA DISTRIBUTION
The CSGD heteroscedastic (nonhomogeneous) regression
model was first introduced by Scheuerer and Hamill (2015)
and is perhaps one of the most promising of the models that
fall under the broad category of ensemble model output sta-
tistics (EMOS) (Gneiting et al. 2005). CSGD represents the
FIG. 1. Study domain and terrain elevation (color shading). right skewed and mixed-type nature of precipitation distribu-
tion, by using a shifted gamma distribution, left censored
to improve current predictions and build postprocessing at zero (Scheuerer and Hamill 2015; Baran and Nemoda
methods. 2016). The model explicitly addresses the heteroscedasticity
The West Weather Research and Forecasting (West-WRF) in forecasts by prescribing nonlinear nonhomogeneous re-
Model reforecast (Martin et al. 2018) is run for a total of gression equations and links spatially smoothed ensemble
34 water years from 1986 to 2019. There are two integration statistics to the parameters of predictive CSGD. In this work,
domains, 9, and 3 km. The 3-km domain is one-way nested we implement the simplest equations that use the forecast
into the 9-km domain. A cumulus scheme is used in the 9-km mean only:
domain but not in the 3-km domain. The radiation scheme is
m 5 mcl /a1 log(1 1 {[exp(a1 ) 2 1](a2 1 a3 f /fcl )}),
RRTMG as described in Iacono et al. (2008). Other physics
 (1)
schemes are described in Martin et al. (2018). An adaptive s 5 a4 scl m/mcl ,
time step is used for all domains starting at 5dx and ranging
between 1dx and 8dx, targeting a domain-wide vertical where f and fcl denote the raw ensemble mean forecasts
Courant–Friedrichs–Lewy criterion of 1.28. and their climatological mean in training data, respectively.
In this study, all years refer to the water year starting from The training dataset is defined as all the data available exclud-
the December of the specified year and ending on the last day ing the test years. The shift parameter of predictive distribu-
of March of the next year. For example, the water year 2019 re- tion is fixed and estimated from the climatology CSGD.
fers to the period from 1 December 2019 to 31 March 2020. The Climatological CSGD refers to the CSGD model fitted using
3-km domain is used in this work (See boundaries of the full climatology data to be used as the reference forecast, following

TABLE 1. Abbreviations and descriptions of weather variables used as predictors.

Predictor Long name Vertical placement


P Pressure Surface
T Temperature 2 m above ground
Rh Relative humidity 2 m above ground
u/y U/V component of wind 10 m above ground, 500 mb
Z Geopotential height 500 mb
Precip Daily accumulated precipitation Surface
IVT Integrated vapor transport Vertically integrated
IWV Integrated water vapor Vertically integrated
IVT U/V Integrated vapor transport U/V component Vertically integrated

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


1370 MONTHLY WEATHER REVIEW VOLUME 151

Scheuerer and Hamill (2015), Ghazvinian et al. (2020, 2021). Wu et al. (2011), Ghazvinian et al. (2020), Ghazvinian (2021)
In this case, both the regression coefficients above and clima- for the MMGD parameter estimation and derivation details.
tological CSGD are identified locally by minimizing the aver- We estimate the MMGD parameters locally using the training
age value of continuous ranked probability score (CRPS) over sample pooled across all months (excluding the test period)
training data (Scheuerer and Hamill 2015; Baran and Nemoda and for each grid point separately. We model the marginal
2016) for CSGD. We used forecast–observation pairs of all distributions using a gamma distribution as it provides the
months over training years for fitting predictive distribution. best fit to our data.
Climatological CSGD is computed for each month using the A number of past studies have conducted performance evalu-
data of that month and two surrounding months from the his- ations or comparisons for the MMGD with other schemes and
torical observations. for different regions within the continental United States. Ex-
The CSGD and its variants have been shown to be able to amples include Wu et al. (2011), Brown et al. (2014), Zhang
generate calibrated and highly skillful PQPFs in a variety of hy- et al. (2017), Wu et al. (2018), Kim et al. (2018), Ghazvinian
drometeorological conditions and spatiotemporal scales, e.g., et al. (2019, 2020, 2021). The results collectively indicate that
Ishida et al. (2015), Baran and Nemoda (2016), Zhang et al. while MMGD is a highly parsimonious mechanism and can pre-
(2017), Bellier et al. (2017), Scheuerer et al. (2017), Baran and serve the skill in the raw forecast, it has a notable limitation
Lerch (2018), Hamill and Scheuerer (2018), Taillardat et al. which is the tendency of the model to under-forecast heavy-to-
(2019), Scheuerer et al. (2020), Ghazvinian et al. (2020), Lei extreme precipitation amounts as a result of conditional biases
et al. (2022). A notable limitation of EMOS schemes CSGD in- and lack of capability to adequately capture forecast heterosce-
cluded is that their performance can depend on prescribed in- dasticity as a result of NQT transformation among others. The
flexible predictor–predictand relationships as pointed out by under-forecast problem is very serious as the performance of
Rasp and Lerch (2018), Ghazvinian et al. (2021). Recent studies ensemble streamflow forecasts from the U.S. NWS HEFS oper-
(Ghazvinian et al. 2020, 2021; Valdez et al. 2022) showed that ations depend critically on the PQPF performance from the
the predictive CSGD (Scheuerer and Hamill 2015) tends to un- MMGD. In this study, separate MMGD regression models are
derestimate the probability of precipitation (PoP) due to its reli- trained for each pixel and each lead time.
ance on climatological shift parameters. This bias was found to
be more evident in shorter lead times and to some extent 3) ANALOG ENSEMBLE
directly affected the CSGD’s overall predictive performance. In Analog ensemble (AnEn) (Delle Monache et al. 2013; Hu
this study, separate CSGD regression models are trained for et al. 2020) is a technique to generate forecast ensembles
each pixel and each lead time. from deterministic predictions. It is different from the previ-
ous parametric methods where a prescribed distribution for
2) MIXED-TYPE META-GAUSSIAN DISTRIBUTION
forecasting precipitation is first assumed and then the algo-
As an additional benchmark, we consider the mixed-type rithm fits the distributional parameters to the data. AnEn,
meta-Gaussian distribution (MMGD). The MMGD (Herr however, relies on prediction similarity and historical obser-
and Krzysztofowicz 2005; Wu et al. 2011) is a widely known vations to generate forecast ensembles.
Bayesian modeling-based statistical technique to generate a AnEn assumes that, given a static NWP model, similar pre-
postprocessed ensemble of QPFs from a deterministic fore- dictions are associated with similar error patterns, and by
cast. It was implemented by the U.S. National Weather Ser- sampling from similar historical predictions, the error can be
vice (NWS) and is currently an integral part of the NWS corrected using the associated historical observations. For ex-
Hydrologic Ensemble Forecast Service (HEFS) (Demargne ample, to generate AnEn for precipitation with a 24-h lead
et al. 2014; Brown et al. 2014). The model uses a mixed two- time at a single grid point, the following distance metric is cal-
part process to compute the conditional cumulative distribution culated between the target prediction and all historical predic-
function (CDF) of observations given a real-time forecast. De- tions with the same lead time and at the same location:
noted by X and Y random variables of the QPF and ground 

N t̃
truth, and x, y, their realized values, respectively, the predictive vi
‖Ft , At ‖ 5 ∑ ∑ (Fi,t1j 2 Ai,t 1j )2 , (3)
CDF is estimated separately for rainy and dry forecasted cases: i51 si j52t̃

FY|X (y|x, x 5 0) 5 a 1 (1 2 a)GY (y),


(2) where ‖Ft , At ‖ represents the distance between the multivari-
FY|X (y|x, x . 0) 5 c(x) 1 [1 2 c(x)]DY|X ( y|x) ? ate deterministic target forecast Ft at time t, and an analog
forecast At , at a historical time t . On the right is a Euclidean
Given a forecasted value, the CDFs above are composed of distance. N is the number of weather variables; vi is the pre-
mass probabilities of observed precipitation being equal to defined weight associated with the variable i; and si is the
zero [a and c(x)] and continuous conditional CDFs [GY(y) standard deviation, as a normalizing factor, for the variable i.
and DY|X( y|x)]. The term Fi,t1j is the target forecast at time t 1 j for the vari-
MMGD uses the meta-Gaussian distribution (Kelly and able i and similarly, Ai,t 1j is the historical analog forecast at
Krzysztofowicz 1997) to estimate conditional distribution time t 1 j for the variable i. Finally, t̃ is the half size of the
DY|X(y|x), which relies on parametric normal quantile trans- temporal window. Comparison within a temporal window is
formation (NQT) of positive observation–forecast pairs. See necessary to find more realistic weather analogs. In this study,

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


JUNE 2023 HU ET AL. 1371

FIG. 2. Unet architecture for probabilistic precipitation forecasts.

t̃ 5 1 indicates that three forecast lead times (previous, cur- Meanwhile, more features are generated by the convolution
rent, next) are used to calculate the similarity metric. that extracts high-level spatial information. The output of the
Once the similarity metric is calculated for all t , the historical encoder branch is usually referred to as the bottleneck. It is
forecasts with the lowest distances are selected as analog fore- then followed by a reconstruction stage by feeding the bottle-
casts and their associated observations comprise AnEn mem- neck features into a decoder branch. The decoder branch is
bers. This process is repeated independently at all grid points composed of repeated deconvolution (or transpose convolu-
and for all lead times. The generated ensemble can then be used tion) and skip connections (Drozdzal et al. 2016). The decoder
to build empirical distribution and be converted to probabilistic structure is indispensable for image-to-image problems be-
forecasts to be consistent with previous parameter methods. cause it expands the spatial domain from the lower-resolution
The most important parameters of AnEn are predictor representation and reconstructs the desired output dimension.
weights and the number of analog members. Both parameters As a result, the output of the decoder has the same spatial di-
are optimized using the training dataset (data excluding the mensions as the input. Each of the grid points has a distinct set
testing period) and a constrained extensive grid search. For of three variables that determine the CSGD at the location.
all experiments, 15 members are generated by AnEn. Skip connection, as shown in gray arrows in Fig. 2, is an in-
dispensable component of the Unet architecture which im-
b. Proposed deep learning for forecast uncertainty proves model training stability and helps preserve fine
1) U-NET ARCHITECTURE features in high spatial resolution input (Mao et al. 2016).
Deep networks suffer from gradient explosion and vanishing
The proposed DL postprocessing model is built after the issues (Mao et al. 2016; Tong et al. 2017). During weight opti-
Unet (Ronneberger et al. 2015; Long et al. 2015) architecture. mization and backward propagation, according to the chain
The architecture features a U-shape diagram with an encoder rule, error gradients are multiplied as they pass along the net-
(left), a bottleneck (bottom), and a decoder (right), as shown work. However, in the long chain of multiplication (following
in Fig. 2. Rectangles represent multidimensional tensors and the U-shape path), error gradients can be numerically unsta-
arrows represent various operations. Spatial dimensions are ble. Skip connections, however, provide an additional path
denoted within square brackets and the number of features in for error terms to pass through the network, by concatenating
each tensor is indicated atop rectangles. features from the encoder stage. As a result, model training
To begin with, forecasts with 13 variables (listed in Table 1) becomes more stable. (Table A1 in appendix A shows a de-
are first standardized and then input into the encoder branch tailed list of model parameters and the configuration of each
to be spatially compressed through repeated convolutions. layer in the proposed Unet model.)

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


1372 MONTHLY WEATHER REVIEW VOLUME 151

2) LOSS FUNCTION AND MODEL TRAINING simultaneously associated with a CSGD. Four water years are
selected as tests: 1997 being one of the strongest El Niño
As shown in Fig. 2, Unet produces distributional parame-
events during the past 60 years of records; 2011 being identi-
ters at each grid point. To optimize the learned distribution,
fied a La Niña year; and two El Niño–Southern Oscillation
we use CRPS (Hersbach 2000) as the loss function during
(ENSO) neutral years, namely, 2016 and 2013, being wet and
model training. CRPS is a probabilistic scoring metric that in-
dry, respectively. El Niño forcing and background internal
tegrates the squared difference between CDF of forecasts and
variability has been shown to influence precipitation predict-
observations. Minimizing CRPS encourages a sharper and
ability (Chapman et al. 2021). Therefore, these four years are
bias-corrected forecast distribution. Empirically, CRPS can
selected to test the robustness of the proposed algorithm.
be calculated by aggregating Brier scores with all possible
Models are trained for each water year independently with
thresholds. For deterministic predictions, CRPS collapses to
the previous year as validation and all other years as training.
mean absolute error (MAE).
Hyperparameter tuning is carried out using the training and
However, numerical approximation of CRPS can be com-
validation data, excluding the four years used in testing. Sto-
putationally expensive. A closed form of CRPS for a paired chastic gradient descent with a minibatch (Li et al. 2014) of
CSGD predictive distribution and verifying observation has size 8 and a momentum of 0.9 is used during weight optimiza-
been studied by Scheuerer and Hamill (2015) and Ghazvinian tion. Cyclical learning rate (Smith 2017) is used with a maxi-
et al. (2021). Similarly, we calculate the following: mum of 1022, a minimum of 1025, a step size of 5, and a
CRPS(Fk u d , yi )  (yi 2 d i )[2Fk ,u (yi 2 d i ) 21] shrink factor of 2. The step size indicates the number of train-
i i i i i
  ing iterations it takes to go from the minimum to the maxi-
uk 1 1 mum learning rate. After each cyclic walk (maximum to
2 i i B , ki 1 [1 2 F2k ,u (22d i )]
p 2 2 i i minimum and then back to maximum), learning rates are di-
1 ui ki [1 1 2Fk ,u (2d )Fk 11,u (2d i ) 2 Fk ,u (2d i )2 vided by the shrinking factor. The maximum training iteration
i i i i i i
is set to 200 but early stopping is engaged if no improvement
2 2Fk 11,u (yi 2 d i )] 1 d Fk ,u (2d i )2 , (4)
i i i i has been observed in the validation loss for 20 consecutive
iterations.
where (ki, ui, d i) are the three parameters of the ith predictive To determine the best scaling factor a, a sequence of values
CSGD, namely, the shape, scale, and shift; F is the CDF for from 0 to 1 with an increment of 0.025 have been tested. Al-
CSGD; yi is the verifying observation; and B is the beta func- though there are no theoretical upper and lower bounds for
tion. With this analytical form, the calculation of CRPS can a, negative and larger than one values lead to unrealistic un-
be more computationally efficient. Furthermore, shape k and certainty estimation. Therefore, the parameter search is con-
scale u can be related to the mean m and deviation s by k 5 fined between 0 and 1. An a value of 0.35 is chosen for
m2/s2 and u 5 s2/m. training the final Unet because it achieved the best CRPS and
To avoid numerical instability in the computation of CRPS spread–skill correlation (not shown) on the validation data.
during training, we propose the following regularization on
the forecasted standard deviation (si):
4. Results
L (mi , si , d i , yi )  CRPS(mi , si , d i , yi ) In this section, we present precipitation forecast compari-
(5)
 CRPS(mi , si + a 3 mi , d i , yi ), sons and their verification over the 4-yr test period. Unet fore-
casts are compared to the baseline Weather Research and
where mi, si, and d i are learned parameters from Unet and Forecasting (WRF) model forecasts and three different post-
a is a scaling factor. To be consistent with the notations in processing schemes, AnEn, MMGD, and CSGD. Since WRF

Eq. (4), mi 5 kiui and si 5 ki ui . The standard deviation used is a deterministic NWP model and all other methods generate

to calculate CRPS (s i ) depends on the two model outputs m probabilistic forecasts, the distribution mean is used to com-
and si; a determines the portion of forecast uncertainty re- pare with WRF forecasts when a deterministic form is de-
lated to the intensity of the precipitation event. As a result, a sired. Additional descriptions of the verification metrics used
certain amount of uncertainty is always present when generat- are provided in appendix B.
ing forecasts. If the lower bound for uncertainty is not present,
a. Deterministic predictions and forecast uncertainty
we found that the training becomes numerically unstable when
calculating CRPS with small values of standard deviation. The ability of a model to accurately predict the accumu-
Another possible solution is to engage early stopping, but lated precipitation throughout the rainy season is critical for
in practice, it resulted in too few training epochs (fewer than 5 water management. Figure 3 shows the mean areal precipita-
in our case). The Unet failed to converge and was underper- tion over the Yuba–Feather River watershed that stems from
forming. We also found that (not shown) the Unet ensemble the west slope of the Sierra Nevada and sits to the northeast
tends to be underdispersive if no constraint is imposed on of Sacramento. This watershed includes several hydropower
forecast uncertainty. projects and accurate precipitation forecasts are critical to its
In this study, one Unet is trained for the entire domain and operation. The time series shows precipitation aggregated
separate Unet models are trained for each lead day. Each over days in the rainy season starting in December 2016 which
Unet learns (outputs) the three distributional parameters marked a particularly wet year.

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


JUNE 2023 HU ET AL. 1373

CSGD, and AnEn all showed underprediction early on dur-


ing the water year around 13 December 2016, with MMGD
consistently producing the most underprediction among the
others. Unet showed a mixed performance throughout the
water year, underpredicting the rain event during 5 January
2017 and 13 January 2017 but overpredicting the rain event
during 1 February 2017 and 10 February 2017. However,
overall, Unet closely follows PRISM throughout the year
and its prediction for the year-round total precipitation is
the most accurate compared to other baseline forecasts.
During the 2016/17 rainy season, the largest amount of
daily accumulated rain was received on 9 January 2017 and
Fig. 4 compares forecasts and observations for this event.
PRISM is shown in Fig. 4a and WRF is shown in Fig. 4b. The
FIG. 3. Mean areal precipitation over the Yuba–Feather River distributional means from MMGD, CSGD, AnEn, and Unet
watershed with the topographic map at the lower right. The time are shown in Figs. 4c–f, with their standard deviations, s, plot-
series shows precipitation aggregated over days in the water year ted below. Forecasts for the first lead day are shown.
starting in December 2016. For Unet, AnEn, CSGD, and MMGD, On 9 January 2017, a large amount of precipitation was re-
the distributional mean is used to calculate the time series. ceived over the Sierra Nevada with the northern rain region
extending further to northwestern Nevada. An average pre-
As shown in Fig. 3, WRF remained close to PRISM until cipitation of 90 mm was received for the day along the coast
18 January 2017 when WRF started to overpredict. There to the north and over the Russian River watershed. On the
were several major rain events on 9 January 2017, 19 January other hand, WRF overpredicted the precipitation over the
2017, 27 February 2017, and 21 February 2017, and WRF Sierra Nevada and at Sacramento, California. WRF produced
showed significant overprediction during the latter three events, an overprediction to the west of the Mojave Desert and it
enlarging the difference to PRISM. On the other hand, MMGD, failed to predict the light rain in southern California.

FIG. 4. Forecasts and observations of the precipitation event on 9 Jan 2017: (a) the observations from PRISM;
(b) the deterministic WRF reforecast; (c)–(f) the distributional mean from MMGD, CSGD, AnEn, and Unet, respec-
tively; and (g)–(j) the standard deviation s. For better visualization, the precipitation difference is shown in supple-
mental Fig. 2.

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


1374 MONTHLY WEATHER REVIEW VOLUME 151

FIG. 5. (a) RMSE skill score and (b) correlation as a function of forecast lead times. Skill scores are calculated using
the climatological CSGD as the reference forecast. Vertical dashes indicate a 95% confidence interval from bootstrap-
ping. Note the ranges of the vertical axes vary across panels.

The four postprocessing methods showed mixed results. To provide a systematic evaluation, we calculate two deter-
MMGD underpredicted both the intensity and the span of ministic metrics}RMSE and Pearson correlation}from a
precipitation, for example, in Central Valley and the Sierra 4-yr testing period and across the entire study domain (48 108
Nevada. CSGD and AnEn both exhibited a better perfor- grid points), shown in Fig. 5. RMSE skill scores are calculated
mance at estimating the rain intensity over the Sierra Nevada. using climatological CSGD as the reference forecast.
However, these forecasts still suffered from an underpredic- RMSE assigns disproportionate weights for samples with
tion of the Russian River watershed in northern California. different errors, and it penalizes larger errors. In Fig. 5a, all
Finally, Unet best resembles the ground truth, PRISM, for ex- postprocessing methods show improvements over WRF but
ample, in the Sierra Nevada and the Russian River watershed. their performance is close to each other. However, Unet re-
Additionally, Unet corrected the overprediction from WRF mains at the top of the diagram, having the best predictive
to the west of the Mojave Desert and improved its predictions still. In terms of correlation (Fig. 5b), results are similar to
for light rain regions in southern California. On average, the RMSE that Unet consistently outperforms all benchmarks
root-mean-square errors (RMSEs) for this day are 20.01 mm across all forecast lead times, having the highest correlation.
(WRF), 21.44 mm (MMGD), 16.43 mm (CSGD), 16.16 mm Although Unet and CSGD have the same prescribed distribu-
(AnEn), and 12.55 mm (Unet). tion, Unet has the additional DL architecture that can encode
The forecast uncertainty, quantified using the distributional spatial features and improve its prediction accuracy.
standard deviation, is plotted below precipitation maps. In
b. CRPS
general, forecast uncertainty is correlated with precipitation
intensity. A low uncertainty produces a sharp forecast, but CRPS is used to evaluate the quality of probabilistic fore-
the uncertainty estimation should also be reflective of the casts. Since West-WRF is a deterministic system, MAE is
expected predictive skill of the forecast. Although Unet calculated as opposed to CRPS. The CRPS skill score is calcu-
has higher uncertainty than other methods over the Sierra lated using climatological CSGD as the reference forecast.
Nevada (Fig. 4j), it provides a timely warning that the coming In Fig. 6, all postprocessing methods are shown to yield sig-
event is hard to predict, and this information matches up with nificant improvement over WRF. CSGD and MMGD have
the rarity and intensity of the rain event. On the other hand, similar performance, but AnEn and Unet are shown to out-
Unet produced a correct forecast with lower uncertainty at perform both. AnEn and Unet are the two data-driven meth-
the southern Coastal Range and to the west of the Mojave ods. Results suggest they can better exploit the huge dataset
Desert while all other methods failed. This difference suggests size than CSGD and MMGD. Unet is the best-performing
that the additional spatial information in Unet helps to cor- method, having the highest CRPS skill scores. This is proba-
rect for small-scale precipitation mismatch. bly because AnEn can be limited by its similarity metric when
It is worthwhile to point out the visually smoothed precipita- locating weather analogs. It relies on a set of weather analogs
tion field generated by Unet, as a comparison to other bench- to derive the forecast distribution. Unet, on the other hand,
mark methods. The main reason is that only one Unet is trained directly learns the distributional parameters, without the need
across the domain and the convolution-based architecture tends to form an ensemble.
to generate smoother output. But MMGD, CSGD, and AnEn Another way to evaluate the performance is by visualizing
are all applied grid-by-grid, and therefore, their results seem to the spatial variation of CRPS. Figure 7 shows geographic
better discern local variability. However, having local variability maps of CRPS skill scores for WRF and the four postprocess-
does not necessarily indicate predictive skills. It is a balance be- ing methods in the first five panels. Green indicates the evalu-
tween resolving fine-level features and improving forecast skills. ated method outperforms and red indicates climatological

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


JUNE 2023 HU ET AL. 1375

domain, having higher CRPS skill scores compared to the


previous methods. These results suggest that AnEn and
Unet predictions are skillful at locations with disparate
climatology.
The last panel (Fig. 7f) further compares AnEn and Unet
by calculating the CRPS skill score of Unet against AnEn.
Blue indicates Unet predictions have better skills and red in-
dicates AnEn predictions have better skills. Results show that
Unet has either similar or better skills than AnEn for most
parts of the domain. Unet outperforms AnEn in Northern
California and the Sierra Nevada where most of the rain is
typically received but variation exists in southern Central Val-
ley and Death Valley. These areas are among the driest places
in the western United States and AnEn is shown to have bet-
ter skill. One difference between Unet and AnEn is that
AnEn searches for weather analogs independently at each
grid point while only one Unet is trained to predict the entire
FIG. 6. CRPS skill score calculated using climatological CSGD as domain. Aside from the computational advantage of Unet, it
the reference forecast. Vertical dashes indicate a 95% confidence is also highly effective at extracting spatial information and
interval from bootstrapping. learning a skillful relationship between forecasts and observa-
tions within a spatial domain.
CSGD outperforms. As a benchmark, WRF outperforms the
c. Spread–skill relationship, reliability, and resolution
climatological CSGD for most regions in the study domain ex-
cept for distributed patches to the north of California and Having high accuracy is critical but only partial to building
around the Northern Basin and Range in Nevada. MMGD a robust probabilistic postprocessing workflow. The overall
and CSGD predictions are shown to be most skillful around quality of probabilistic forecasts also depends on ensemble
the Sierra Nevada and along coastal regions. These two consistency and forecast reliability.
methods, however, are found to yield less improvement to Figure 8 evaluates and compares the binned spread–skill
the northeast of the domain. In terms of AnEn and Unet, correlation (Van den Dool 1989; Wang and Bishop 2003;
both methods are shown to perform well across the Leutbecher and Palmer 2008) for various postprocessing

FIG. 7. CRPS skill score maps averaged from all forecast lead times for (a) WRF, (b) MMGD, (c) CSGD, (d) AnEn, and (e) Unet. The
skill scores are calculated using climatological CSGD as the reference forecast; (f) the CRPS skill score of Unet against AnEn.

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


1376 MONTHLY WEATHER REVIEW VOLUME 151

FIG. 8. Binned spread–skill correlation of (a) CSGD, (b) MMGD, (c) AnEn, and (d) Unet. Spread is estimated using standard deviation
and skill is estimated using RMSE. The diagonal line shows the perfect correlation. Vertical dashes show the 95% confidence interval.

methods, aggregated across lead times and the study domain. is applicable for evaluating heteroscedastic predictor–
Forecast spread is estimated using standard deviations and predictand relationships with time- or ensemble-varying
standard error is calculated as RMSE of the distributional ensemble spread. The method provides a way to show
mean. The diagonal lines on each panel depict the perfect that errors are larger in relationship to larger spread/
correlation between the spread and the standard error. A uncertainty. However, the method is not a diagnostic for
high correlation indicates that the forecast spread is consis- the source of spread, but only as a metric to show that
tent with the expected skill, and it can be a reliable first-order spread is a reliable indicator of forecast uncertainty (Hopson
estimate of the flow-dependent error. The spread–skill metric 2014).

FIG. 9. (top) Brier score and its decomposition into (middle) reliability and (bottom) resolution for three thresholds: (left) 1 mm and
(center) 95% and (right) 99% of the location-specific climatological distribution. Percentile values are calculated for each grid point using
the associated climatology. Vertical dashes indicate a 95% confidence interval. Note the ranges of the vertical axes vary across panels.

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


JUNE 2023 HU ET AL. 1377

FIG. 10. Brier skill scores of Unet averaged from all forecast lead times with three thresholds: (top) 1 mm, (middle) 95%, and (bottom)
99% of the location-specific climatological distribution, against (a),(d),(g) MMGD, (b),(e),(h) CSGD, (c),(f),(i) AnEn, and (j)–(l) generated
with PRISM. A map of PoP using 1 mm as the threshold is shown in (j), whereas (k) and (l) show the precipitation map for the 95th and
99th percentiles, respectively.

To begin with, all four methods show a high level of consis- The Brier score is an error metric measuring the average gaps be-
tency between forecast skill and spread. This suggests that tween forecasted probabilities and actual outcomes. A lower
forecast uncertainty can be effectively estimated with statisti- Brier score is better. The Brier score can be further decomposed
cal methods. MMGD, overall, has the best spread–skill corre- into reliability, resolution, and uncertainty (Murphy 1973). The
lation for forecasts with smaller spread (,10 mm), most likely uncertainty measures the inherent variability in the outcomes of
due to having two separate distributions for dry and wet con- the event, and it is not conditioned on forecasts. Resolution quan-
ditions, respectively. However, it becomes over-dispersive tifies the ability of the method to discriminate event probabilities
with forecasts with a large spread (.10 mm). CSGD, AnEn, that are different from climatology. Therefore, a high-resolution
and Unet are all shown to have under-dispersive forecast en- score is preferred. The reliability score is used to measure the cal-
sembles overall. The difference lies in forecasts with a large ibration mismatch between forecasted probabilities and observed
spread (.10 mm). AnEn and Unet have a high spread–skill frequencies and therefore, a lower reliability score is better.
correlation, mostly likely due to their ability to incorporate When calculating Brier scores for a continuous variable, a prede-
more predictor variables than CSGD and MMGD. fined threshold is needed to binarize ground truths and to convert
Unet was previously found to produce under-dispersive en- the CDF to probability. Therefore, three thresholds are used,
sembles when applied to IVT forecasts [Fig. 5 in Chapman 1 mm for evaluating how well different methods predict PoP, and
et al. (2022)] and also in this work when not using the regular- two quantiles (95% and 99%) for evaluating skills of extreme
ization term on forecast uncertainty (not shown). When add- events. In terms of the Brier score (Figs. 9a–c), Unet is shown to
ing the regularization as in Eq. (5), the final uncertainty outperform all other methods with all three thresholds. This re-
estimation depends on both output parameters from Unet, s, sult suggests that Unet generates the most skillful predictions for
and m. This improvement in ensemble consistency suggests a wide range of rain events.
that the regularization of the learned uncertainty is necessary In terms of reliability (Figs. 9d–f; lower is better), MMGD
and effective for training a reliable model. has the best calibration for forecasting PoP (Fig. 9d). This is
Figure 9 shows the Brier score (first row) and its decompo- because MMGD separates dry and wet conditions and esti-
sition into reliability (second row) and resolution (third row). mates parameters for each condition. However, it becomes

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


1378 MONTHLY WEATHER REVIEW VOLUME 151

less reliable for more extreme events (Figs. 9e,f). Both CSGD
and Unet assume a censored, shifted gamma distribution and
attempt to model no-rain events by applying shifts to distribu-
tions. The distributional assumption helps to predict extreme
events, as shown in Figs. 9e and 9f. On the other hand, AnEn
and Unet are comparable in forecasting extreme events but
Unet slightly outperforms AnEn during light rain events. It is
worthwhile to mention the oscillating behavior of the Unet as
a function of lead time in Figs. 9e and 9f. Although other
methods show similar trends, the Unet is the most exagger-
ated. This might be related to the observational dataset used
in this study but we are currently unsure of the exact cause.
In terms of resolution (Figs. 9g–i; higher is better), Unet
again outperforms other methods with all three thresholds
meaning that Unet typically has a better predictive skill over
events that are different from climatology. The improvement FIG. 11. CRPS for Unet and AnEn, and MAE for WRF as a
in resolution outweighs the underperformance in reliability. function of the number of years in the training data. Unet and
As a result, the overall Brier score is improved (lower value) AnEn are tested for 2016, with training from previous years. Verifi-
cation results are aggregated using values for all lead times and
by Unet. The outperformance of Unet suggests that CRPS,
locations.
being a proper scoring rule, encourages optimization for both
accuracy and reliability. With a large parameterization (over
1 million model parameters), Unet has the flexibility and ca- the sensitivity of model performance (predictive skill) on the
pability to distinguish a wide array of precipitation events and length of training data and to answer the question, how much
produce accurate and reliable probabilistic forecasts. data are needed to train a skillful model.
Figure 10 visualizes the spatial variation of Brier skill scores Figure 11 shows CRPS of AnEn and Unet as functions of
of Unet for the same three thresholds, 1 mm and 95% and the number of years in the training data. We only compare these
99% against MMGD, CSGD, and AnEn. Panels in the last two methods because they are among the best-performing meth-
column are generated using PRISM. Figure 10j shows the ods in this work. The selected water year for testing is 2016 be-
map of PoP using 1 mm as the threshold. Figures 10k and 10l cause it is one of the wettest years throughout the dataset. It is
show the precipitation maps for the 95th and 99th percentiles. also the most recent year in the test data so the historical record
In terms of PoP (Figs. 10a–c), Unet outperforms the bench- is the longest. Training years are included retrospectively, for ex-
mark methods across the domain which suggests that Unet ample, two training years means the two years prior to 2016 are
predictions have better skill at detecting rain events. The used during training (searching in the case of AnEn). MAE is
most outperformance is achieved when compared to CSGD shown for WRF. Both methods show improvement over WRF
(Fig. 10b), which suggests that the Unet architecture is an ef- with only two years of training data. However, with the limited
fective component for ingesting more predictor variables and set of training data, Unet is likely to overfit, and hence, it only
detecting rain-related spatial patterns with convolutions. produces a slight improvement, while AnEn is a much more pre-
In terms of extreme events (Figs. 10d–i), Unet largely ferred method. This finding is consistent with prior findings
shows outperformance to the benchmark when using a 95th (Delle Monache et al. 2013; Eckel and Delle Monache 2016; Hu
percentile threshold. Unet predictions have a better skill in and Cervone 2019) where using a few years of search can al-
the Sierra Nevada and the Central Basin and Range in South- ready yield satisfactory results.
ern Nevada. However, the performance of Unet tends to vary Since Unet possesses a large number of model parameters,
with regions when using a 99th percentile threshold. MMGD, it underperforms AnEn when only a few years of training
CSGD, and AnEn are found to perform better in the Mojave data are used. But the performance of Unet improves as the
Desert and the Nevada Basin and Range while Unet performs number of years increases. After 12 years, the performance of
better in the rest of the domain. Since only one model is AnEn converges and reaches a plateau. This suggests that
trained with Unet while all other benchmark methods are ap- AnEn can no longer benefit from having an even larger his-
plied grid by grid, this could be a potential limit of Unet when torical repository. This could be traced to the limitation of the
applied to a large spatial domain and when evaluated with a similarity metric (Hu et al. 2023) that, even though more simi-
high threshold. However, on the other hand, Unet typically lar forecasts can be found, they might not be better weather
outperforms other methods in the high-impact regions where analogs that benefit the final prediction accuracy. In contrast,
precipitation is copious, and its higher biases are found in rel- Unet maintains the momentum at increasing its predictive
atively drier regions. skills and CRPS keeps decreasing when more training data
are used. This demonstrates the capability of Unet to intake a
d. Length of training
large amount of data and still be able to identify patterns that
Unet is a data-intensive method due to its large parameteri- lead to more accurate forecasts. Experiments stopped at
zation of the model. This section aims to present results on 30 years when the maximum number of years is achieved.

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


JUNE 2023 HU ET AL. 1379

FIG. 12. Brier scores for AnEn and Unet as a function of the number of years in the training data. The thresholds
are (a) 95% and (b) 99%. Unet and AnEn are tested for 2016, with training from previous years. Verification results
are aggregated using values for all lead times and locations.

Similarly, Fig. 12 shows the Brier scores of Unet and AnEn calculates gradients to measure the relationship between
when an increasing number of training years are used. The changes to an input feature and changes in the model output.
two thresholds are consistent with the previous analysis, 95%, It is a nonintrusive method, meaning that it can be directly ap-
and 99%. In this comparison, AnEn starts as the prevailing plied to any trained DL models without modifying the
method again when two years of training data are used. Its architecture.
performance also stagnates after reaching 12 training years. Figure 13 shows the forecast sensitivity of m and s with re-
Unet, on the other hand, overtakes AnEn at an earlier time spect to small changes in predictors. The feature sensitivity is
point when four training years are used, as opposed to eight calculated using IG (Mudrakarta et al. 2018; Adebayo et al.
years in the previous evaluation. This suggests that Unet is 2020) and then normalized to ensure a sum of one for all fea-
optimized for predicting high-impact precipitation events and tures. It is worthwhile to point out that feature sensitivity is
with proper training, it is capable of capturing patterns that not a direct measure of feature importance, but it is a good in-
are more pertinent to high-intensity precipitation, even when dicator of how each input feature changes the prediction out-
a limited training dataset is present. Finally, considering the put. For both precipitation intensity and forecast uncertainty,
large model, Unet can benefit from having more training data precipitation appears as the most sensitive input feature as ex-
and it is preferable to observe that its performance keeps im- pected due to its high autocorrelation with the predictand.
proving when AnEn has reached its potential. IVT and IWV are identified to be the next two most sensitive
features after precipitation. This relationship is consistent
e. Model sensitivity to predictors
with our knowledge that they are among the most important
Unet is a nonlinear model with weights learned in a data-driven variables to explain variations in precipitation. For example,
fashion. Although it is difficult to pinpoint which mechanism has ARs are long and narrow corridors of enhanced IWV and
been identified and learned to help postprocess precipitation fore- IVT, primarily driven by a precold-frontal low-level jet stream
casts, we are still able to visualize the relative sensitivity of model of an extratropical cyclone (American Meteorological Society
forecasts with respect to small changes in model input, via inte- 2022). On the West Coast, over the study domain, ARs ac-
grated gradient (IG). count for 30%–50% of the annual precipitation (Oakley et al.
The IG (Sundararajan et al. 2017; Mudrakarta et al. 2018; 2018). Unet successfully identified this nonlinear relation be-
Sayres et al. 2019; McCloskey et al. 2019; Sundararajan et al. tween the IWV/IVT and precipitation as a result of the data-
2019) is a gradient-based attribution method that quantifies driven model training.
the contribution of each input predictor by calculating the Comparing Figs. 13a and 13b, the sensitivity of forecast un-
product of gradients and input values, similar to a linear sys- certainty is more likely to be related to multiple features, as
tem, where the contribution of each predictor is determined opposed to the predominant effect from precipitation when
as the multiplication of the predictor value and its coefficient. predicting rain intensity. This suggests, when estimating fore-
Given a model input and a reference input, IG integrates the cast uncertainty, Unet tends to identify patterns by examining
gradient and quantifies how a change in the input, relative to the interaction between multiple features.
the reference, would affect the model output. A reference in- The benefit of having multiple features can be further ob-
put can be a Gaussian blurred version of the model input served in the supplemental material, Fig. 3. It is worthwhile
(background signal) which provides little to no detailed infor- to note that CSGD and MMGD use precipitation as the
mation for Unet to make a useful prediction. IG then only predictor, but data-driven methods like AnEn and

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


1380 MONTHLY WEATHER REVIEW VOLUME 151

FIG. 13. Forecast sensitivity of (a) m and (b) s with respect to changes in model predictors. Feature sensitivity is esti-
mated using an integrated gradient and then normalized to ensure a sum of one for all features.

Unet are more flexible at ingesting multiple predictors. Hav- observational errors are inevitable, and the verifying observa-
ing multiple predictors improves the performance of both tions are merely samples from the “true” forecast probability dis-
AnEn and Unet, although the existence of a deep architec- tribution. In this study, we used an additional regularization term
ture seems to play a bigger role suggested by the smaller dif- to constrain the optimization of standard deviation. This regulari-
ference in performance (shown from the solid red and dashed zation has been found effective for avoiding under-dispersive dis-
red lines). tributions and improving model stability.
In this work, we also tested the performance sensitivity to
5. Summary and conclusions data volume sizes. Results show that traditional statistical post-
processing methods usually perform better with limited data.
In this work, Unet has been applied to postprocessing However, as more data become available (roughly more than
NWP forecasts and generating high-resolution 0–4-day proba- eight years of training data in our case), ML starts to outperform
bilistic forecasts for precipitation. The Unet learns the distri- traditional methods. This result demonstrates the ability of ML
butional parameters associated with a censored, shifted to learn the highly nonlinear relationship between forecasts and
gamma distribution. The objective evaluation shows that the observations when presented with enough training data.
Unet outperforms the benchmark methods at all lead times, The proposed framework in this work can provide robust
specifically having the best performance for extreme events, and reliable input for downstream applications such as hydro-
i.e., the 95th and 99th percentiles of the distribution. logical forecasting and water resource management. How-
Compared to traditional parametric and nonparametric ever, one limitation is that spatially and temporally consistent
postprocessing methods, Unet benefits from its DL architec- members are usually needed for hydrological simulations, but
ture that can easily incorporate more predictor variables and Unet only generates the distributional parameters. Future re-
extracts spatial information. Since these complex spatial pat- search could focus on producing spatially and temporally con-
terns are stored as bottleneck features (Fig. 2), Unet benefits sistent ensemble members from prescribed distributions. This
from having a large parameterization so that more patterns work also builds on top of a deterministic forecast model,
can be encoded. These patterns are not location-dependent, West-WRF. Future work can investigate how Unet can be ap-
meaning that patterns can be detected across different parts plied to the calibration of ensemble model output. Alterna-
of the domain. In terms of forecast errors and spread, all four tively, future research could study how various encoder–
methods show high spread–skill correlations which indicate decoder architectures, i.e., the generative adversarial network
that the spread generated from postprocessed probabilistic (Goodfellow Ian et al. 2014), could be used to preserve de-
forecasts can be good first-order estimates of the predictive tailed spatial information and generate realistic precipitation
skill, although AnEn and Unet show better correlations for patterns. On a different note, this project primarily focuses on
forecasts with large spread (.10 mm). precipitation forecasting, but since the Unet is capable of
Learning distribution and forecast uncertainty from deter- learning spatial dependencies, it would be interesting to test
ministic models and observations is a challenging task because its multivariate performance. For example, future research
we only have one realization of the model and one reality that could apply copula-based technique like ensemble copula
we can observe. One difficulty when training the Unet using coupling (Schefzik et al. 2013; Schefzik 2017) to CSGD and
CRPS is that network parameters are not trained with physi- compare with the Unet.
cal constraints. As a result, model training can lead to unrea-
sonable estimates and cause numerical instability in the Acknowledgments. This work is supported by the California
computation of CRPS. Ideally, it is useful to have a “true” Department of Water Resources Ph3 AR research program
forecast probability distribution (Anderson 1996) obtained (Award 4600014294) and the Forecast Informed Reservoir Op-
with perfect initialization and a perfect model. In reality, erations Award (USACE W912HZ1920023).

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


JUNE 2023 HU ET AL. 1381

TABLE A1. Unet parameter details and configurations. Padding and cropping layer parameters include [[top_pad, bottom_pad],
[left_pad, right_pad]]. Convolutional layer parameters include [height, width, features], stride, padding. MaxPooling layer parameters
include size, stride, padding. The X represents the sample size. Output shapes include [samples, height, width, features].

Layer Parameters Norm Activation Output shape


Input } } } [X, 228, 211, 13]
Pad_0 [[0, 0], [0, 1]] } } [X, 228, 212, 13]
Downsample_0_Conv_0 [3, 3, 32], 1, 1 BatchNorm LeakyReLU [X, 228, 212, 32]
Downsample_0_Conv_1 [3, 3, 32], 1, 1 BatchNorm LeakyReLU [X, 228, 212, 32]
MaxPooling_0 2, 2, 1 } } [X, 114, 106, 32]
Downsample_1_Conv_0 [3, 3, 64], 1, 1 BatchNorm LeakyReLU [X, 114, 106, 64]
Downsample_1_Conv_1 [3, 3, 64], 1, 1 BatchNorm LeakyReLU [X, 114, 106, 64]
MaxPooling_1 2, 2, 1 } } [X, 57, 53, 64]
Pad_1 [[0, 1], [0, 1]] } } [X, 58, 54, 64]
Downsample_2_Conv_0 [3, 3, 128], 1, 1 BatchNorm LeakyReLU [X, 58, 54, 128]
Downsample_2_Conv_1 [3, 3, 128], 1, 1 BatchNorm LeakyReLU [X, 58, 54, 128]
MaxPooling_2 2, 2, 1 } } [X, 29, 27, 128]
Bottleneck_Conv_0 [3, 3, 256], 1, 1 BatchNorm LeakyReLU [X, 29, 27, 256]
Bottleneck_Conv_1 [3, 3, 256], 1, 1 BatchNorm LeakyReLU [X, 29, 27, 256]
Bottleneck_Conv_2 [3, 3, 256], 1, 1 BatchNorm LeakyReLU [X, 29, 27, 256]
Bottleneck_Conv_3 [3, 3, 256], 1, 1 BatchNorm LeakyReLU [X, 29, 27, 256]
DeConv_0 [2, 2, 128], 2, 0 } Linear [X, 58, 54, 128]
Concatenate_0 DeConv_0, Downsample_2_Conv_1 } } [X, 58, 54, 256]
Upsample_2_Conv_0 [3, 3, 128], 1, 1 BatchNorm LeakyReLU [X, 58, 54, 128]
Upsample_2_Conv_1 [3, 3, 128], 1, 1 BatchNorm LeakyReLU [X, 58, 54, 128]
Crop_0 [[0, 1], [0, 1]] } } [X, 57, 53, 128]
DeConv_1 [2, 2, 64], 2, 0 } Linear [X, 114, 106, 64]
Concatenate_1 DeConv_1, Downsample_1_Conv_1 } } [X, 114, 106, 128]
Upsample_1_Conv_0 [3, 3, 64], 1, 1 BatchNorm LeakyReLU [X, 114, 106, 64]
Upsample_1_Conv_1 [3, 3, 64], 1, 1 BatchNorm LeakyReLU [X, 114, 106, 64]
DeConv_2 [2, 2, 32], 2, 0 } Linear [X, 228, 212, 32]
Concatenate_2 DeConv_2, Downsample_0_Conv_1 } } [X, 228, 212, 64]
Upsample_0_Conv_0 [3, 3, 32], 1, 1 BatchNorm LeakyReLU [X, 228, 212, 32]
Upsample_0_Conv_1 [3, 3, 32], 1, 1 BatchNorm LeakyReLU [X, 228, 212, 32]
Crop_1 [[0, 0], [0, 1]] } } [X, 228, 211, 32]
Upsample_0_Conv_2 [3, 3, 3], 1, 1 } Linear [X, 228, 211, 3]

Data availability statement. The West-WRF data are avail- convolutions, and convolutions can be represented by matrix
able from the forecast research landing page of Center for multiplication. Convolutions compute the forward pass using
Western Weather and Water Extremes at https://cw3e.ucsd. the kernels. Instead, deconvolutions transpose the matrix and
edu/west-wrf/. The PRISM data are available from the Ore- carry out backward computation of the kernels. Details of de-
gon State University at https://www.prism.oregonstate.edu/. convolutions can be found in section 4.1 and 4.2 of Dumoulin
and Visin (2016).
APPENDIX A
APPENDIX B
Unet Model Parameters and Configurations
Probabilistic Metrics
Table A1 lists the model parameters and configuration for
each layer in Unet. The layers are connected in sequence a. CRPS
from the top of the table to the bottom. For example, the in-
The continuous ranked probability score (CRPS) (Matheson
put layer is connected to Pad_0. Then the output of Pad_0 is
and Winkler 1976) calculates the area between the forecasted
fed into Downscample_0_Conv_0, and so forth.
CDF and the indicator function defined by the ground truth.
Skip connections are represented by the layers, Concatenate_0,
Since it is an error metric, a lower CRPS indicates a higher
Concatenate_1, and Concatenate_2. These layers aggregate the
accuracy. It is a proper scoring rule (Matheson and Winkler
output of previous layers without running additional convolu-
1976; Wilks 2019). Specifically, CRPS is computed as
tional operations. They carry out concatenation of the features.
‘
As a result, skip connections are not trainable and they do not in-
CRPS(F, o)  [F(x) 2 I{x $ o}]2 dx,
troduce additional parameters into the model. 2‘
The layers, DeConv_0, DeConv_1, and DeConv_2, are de-
convolutional layers, also referred to as transposed convolutions where F(x) denotes the forecasted CDF, I denotes the indi-
or fractionally strided convolutions. In its essence, kernels define cator function, and o is the ground truth. Here, I 5 0 for

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


1382 MONTHLY WEATHER REVIEW VOLUME 151

x , o and I 5 1 for x $ o. For deterministic forecasts, the the ensemble mean. However, even for a perfect ensemble
forecasted CDF can also be represented as an indicator system, the correlation can be rather low since the spread is an
function, and therefore, CRPS collapses to MAE. When no estimate of the expected value of the error. Therefore, forecast–
analytical calculation of CRPS is available, it can be esti- observation pairs are usually first binned, before calculating their
mated empirically using numerical integration. correlation.
We chose CRPS as a proper score because probabilistic The steps to calculate and visualize binned spread–skill
verification of the forecasted distribution is of interest in correlation are as follows:
this work. To ensure a fair comparison of different types of
1) Calculate the variance of the ensemble.
forecasts, we computed the integral of Brier scores for all
2) Calculate the squared error of the ensemble mean.
models in verification. For CSGD, MMGD, and Unet, the
3) Create bins with intervals of equal size, or intervals with
Brier score is computed from CDF. For AnEn (ensemble
the same number of samples, based on ensemble variance.
forecast), since the forecast is not associated with a pre-
4) Calculate the average variance in each bin and take the
scribed distribution, the CRPS is computed via the empiri-
square root for the standard deviation.
cal distribution.
5) Calculate RMSE in each bin. For ensemble forecasts, the
b. Brier score mean squared error needs to be multiplied by n/(n 1 1),
with n being the number of ensemble members, before
The Brier score (Brier 1950) is also a proper scoring rule
taking the root.
that measures the accuracy of probabilistic forecasts. It is
6) Plot RMSE as a function of the standard deviation.
an error metric, meaning that the lower the value, the
higher the accuracy. Unlike CRPS, the Brier score requires A well-calibrated ensemble system should closely follow the
a predefined threshold when applied to continuous varia- one-on-one diagonal line. Lines laying to the upper left of the ref-
bles. Observations are first binarized and forecasted proba- erential line indicate overconfidence, and lines laying to the lower
bility needs to be calculated from forecasted distributions. right of the referential line indicate underconfidence.
Then, the Brier score is calculated as
N REFERENCES
BS 
1
N
∑ [f 2 I ] ,
i1
i i
2
American Meteorological Society, 2022: Atmospheric river. Glossary
of Meteorology. https://glossary.ametsoc.org/wiki/Atmospheric_
where f and I are the forecasted and observed probability river.
and N is the number of samples. Adebayo, J., J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and
The Brier score can be decomposed into three compo- B. Kim, 2020: Sanity checks for saliency maps. arXiv,
nents (Murphy 1973), uncertainty, reliability, and resolu- 1810.03292v3, https://doi.org/10.48550/arXiv.1810.03292.
Alessandrini, S., L. Delle Monache, S. Sperati, and J. N. Nissen,
tion, specifically,
2015: A novel application of an analog ensemble for short-
K K term wind power forecasting. Renewable Energy, 76, 768–781,
1 1
BS  ∑ n [f 2 o k ]2 2 ∑ nk [o k 2 o]2 + o(1 2 o), https://doi.org/10.1016/j.renene.2014.11.061
N k1 k k N k1 Anderson, J. L., 1996: A method for producing and evaluating proba-
bilistic forecasts from ensemble model integrations. J. Climate, 9,
where K is the number of forecast probability categories (bins), 1518–1530, https://doi.org/10.1175/1520-0442(1996)009,1518:
nk is the number of forecasts within each category, o k is the AMFPAE.2.0.CO;2.
observed frequency given forecasts fk, and o 5 ∑i51 oi /N.
N Badrinathat, A., L. Delle Monache, N. Hayatbini, W. Chapman,
F. Cannon, and M. Ralph, 2022: Improving precipitation fore-
The first term quantifies reliability, and it shows how
casts with convolutional neural networks. Wea. Forecasting,
close the forecasted probability is to the observed fre- 38, 291–306, https://doi.org/10.1175/WAF-D-22-0002.1.
quency. And therefore, the lower the value, the better the Baran, S., and D. Nemoda, 2016: Censored and shifted gamma
reliability. The second term quantifies resolution. It shows distribution based EMOS model for probabilistic quantitative
how much the conditional probability given the different precipitation forecasting. Environmetrics, 27, 280–292, https://
forecasts is different from climatology. In other words, the doi.org/10.1002/env.2391.
resolution is the predictive skill of a forecast system to pre- }}, and S. Lerch, 2018: Combining predictive distributions for
dict events that are different from climatology. Therefore, the statistical post-processing of ensemble forecasts. Int. J.
the higher the value, the better the resolution. The third Forecasting, 34, 477–496, https://doi.org/10.1016/j.ijforecast.
term quantifies the inherent uncertainty in the outcomes of 2018.01.005.
Bellier, J., G. Bontron, and I. Zin, 2017: Using meteorological
the event.
analogues for reordering postprocessed precipitation ensem-
c. Binned spread–skill correlation bles in hydrological forecasting. Water Resour. Res., 53,
10 085–10 107, https://doi.org/10.1002/2017WR021245.
Binned spread–skill correlation (Murphy 1988) measures the Bremnes, J. B., 2004: Probabilistic forecasts of precipitation in
consistency between the spread of an ensemble system and its terms of quantiles using NWP model output. Mon. Wea. Rev.,
predictive skill. The forecast spread is usually calculated as the 132, 338–347, https://doi.org/10.1175/1520-0493(2004)132,0338:
standard deviation and the skill is usually calculated as RMSE of PFOPIT.2.0.CO;2.

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


JUNE 2023 HU ET AL. 1383

Brier, G. W., 1950: Verification of forecasts expressed in terms of Engineering, The University of Texas at Arlington, 194 pp.,
probability. Mon. Wea. Rev., 78, 1–3, https://doi.org/10.1175/ https://rc.library.uta.edu/uta-ir/handle/10106/30923.
1520-0493(1950)078,0001:VOFEIT.2.0.CO;2. }}, D.-J. Seo, and Y. Zhang, 2019: Improving medium-range
Brown, J. D., L. Wu, M. He, S. Regonda, H. Lee, and D.-J. Seo, probabilistic quantitative precipitation forecast for heavy-to-
2014: Verification of temperature, precipitation, and stream- extreme events through the conditional bias-penalized regres-
flow forecasts from the NOAA/NWS Hydrologic Ensemble sion. 2019 Fall Meeting, San Francisco, CA, Amer. Geophys.
Forecast Service (HEFS): 1. Experimental design and forcing Union, Abstract H33P-2245, https://agu.confex.com/agu/fm19/
verification. J. Hydrol., 519, 2869–2889, https://doi.org/10. meetingapp.cgi/Paper/517742.
1016/j.jhydrol.2014.05.028. }}, Y. Zhang, and D.-J. Seo, 2020: A nonhomogeneous
Buban, M. S., T. R. Lee, and C. B. Baker, 2020: A comparison of regression-based statistical postprocessing scheme for generat-
the U.S. climate reference network precipitation data to the ing probabilistic quantitative precipitation forecast. J. Hydro-
Parameter-Elevation Regressions on Independent Slopes meteor., 21, 2275–2291, https://doi.org/10.1175/JHM-D-20-0019.1.
Model (PRISM). J. Hydrometeor., 21, 2391–2400, https://doi. }}, }}, }}, M. He, and N. Fernando, 2021: A novel hybrid
org/10.1175/JHM-D-19-0232.1. artificial neural network}Parametric scheme for postprocess-
Chapman, W. E., A. C. Subramanian, L. Delle Monache, S. P. ing medium-range precipitation forecasts. Adv. Water Re-
Xie, and F. M. Ralph, 2019: Improving atmospheric river sour., 151, 103907, https://doi.org/10.1016/j.advwatres.2021.
forecasts with machine learning. Geophys. Res. Lett., 46, 103907.
10 627–10 635, https://doi.org/10.1029/2019GL083662. }}, }}, T. M. Hamill, D.-J. Seo, and N. Fernando, 2022:
}}, }}, S.-P. Xie, M. D. Sierks, F. M. Ralph, and Y. Kamae, Improving probabilistic quantitative precipitation forecasts
2021: Monthly modulations of ENSO teleconnections: Impli- using short training data through artificial neural networks.
cations for potential predictability in North America. J. Cli- J. Hydrometeor., 23, 1365–1382, https://doi.org/10.1175/JHM-
mate, 34, 5899–5921, https://doi.org/10.1175/JCLI-D-20-0391.1. D-22-0021.1.
}}, L. D. Monache, S. Alessandrini, A. C. Subramanian, F. M. Gleick, J., 2008: Chaos: Making a New Science. Penguin, 384 pp.
Ralph, S.-P. Xie, S. Lerch, and N. Hayatbini, 2022: Probabil- Gneiting, T., A. E. Raftery, A. H. Westveld, and T. Goldman, 2005:
istic predictions from deterministic atmospheric river fore- Calibrated probabilistic forecasting using ensemble model out-
casts with deep learning. Mon. Wea. Rev., 150, 215–234,
put statistics and minimum CRPS estimation. Mon. Wea. Rev.,
https://doi.org/10.1175/MWR-D-21-0106.1.
133, 1098–1118, https://doi.org/10.1175/MWR2904.1.
Corringham, T. W., F. M. Ralph, A. Gershunov, D. R. Cayan,
Goodfellow, I., Y. Bengio, and A. Courville, 2016: Deep Learning.
and C. A. Talbot, 2019: Atmospheric rivers drive flood dam-
MIT Press, 800 pp.
ages in the western United States. Sci. Adv., 5, eaax4631,
}}, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
https://doi.org/10.1126/sciadv.aax4631.
S. Ozair, A. Courville, and Y. Bengio, 2014: Generative ad-
Daly, C., W. P. Gibson, G. H. Taylor, G. L. Johnson, and P. Pasteris,
versarial nets. Proc. 27th Int. Conf. on Neural Information
2002: A knowledge-based approach to the statistical map-
Processing Systems, Z. Ghahramani et al., Eds., MIT Press,
ping of climate. Climate Res., 22, 99–113, https://doi.org/10.
Vol. 2, 2672–2680.
3354/cr022099.
Hamill, T. M., and J. S. Whitaker, 2006: Probabilistic quantitative
Delle Monache, L., F. A. Eckel, D. L. Rife, B. Nagarajan, and
precipitation forecasts based on reforecast analogs: Theory
K. Searight, 2013: Probabilistic weather prediction with an
and application. Mon. Wea. Rev., 134, 3209–3229, https://doi.
analog ensemble. Mon. Wea. Rev., 141, 3498–3516, https://doi.
org/10.1175/MWR-D-12-00281.1. org/10.1175/MWR3237.1.
Demargne, J., and Coauthors, 2014: The science of NOAA’s opera- }}, and M. Scheuerer, 2018: Probabilistic precipitation forecast
tional hydrologic ensemble forecast service. Bull. Amer. Meteor. postprocessing using quantile mapping and rank-weighted
Soc., 95, 79–98, https://doi.org/10.1175/BAMS-D-12-00081.1. best-member dressing. Mon. Wea. Rev., 146, 4079–4098,
Dettinger, M. D., and D. R. Cayan, 2014: Drought and the California https://doi.org/10.1175/MWR-D-18-0147.1.
Delta}A matter of extremes. San Francisco Estuary Watershed }}, }}, and G. T. Bates, 2015: Analog probabilistic precipitation
Sci., 12, 7, https://doi.org/10.15447/sfews.2014v12iss2art4. forecasts using GEFS reforecasts and climatology-calibrated
}}, F. M. Ralph, T. Das, P. J. Neiman, and D. R. Cayan, 2011: precipitation analyses. Mon. Wea. Rev., 143, 3300–3309, https://
Atmospheric rivers, floods and the water resources of doi.org/10.1175/MWR-D-15-0004.1.
California. Water, 3, 445–478, https://doi.org/10.3390/w3020445. Han, L., H. Liang, H. Chen, W. Zhang, and Y. Ge, 2022: Convec-
Drozdzal, M., E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, tive precipitation nowcasting using U-net model. IEEE
2016: The importance of skip connections in biomedical image Trans. Geosci. Remote Sens., 60, 1–8, https://doi.org/10.1109/
segmentation. Deep Learning and Data Labeling for Medical TGRS.2021.3100847.
Applications, G. Carneiro et al., Eds., Lecture Notes in Com- Herr, H. D., and R. Krzysztofowicz, 2005: Generic probability dis-
puter Science, Springer International Publishing, 179–187, tribution of rainfall in space: The bivariate model. J. Hydrol.,
https://doi.org/10.1007/978-3-319-46976-8_19. 306, 234–263, https://doi.org/10.1016/j.jhydrol.2004.09.011.
Dumoulin, V., and F. Visin, 2016: A guide to convolution arith- Hersbach, H., 2000: Decomposition of the continuous ranked proba-
metic for deep learning. arXiv, 1603.07285v2, https://doi.org/ bility score for ensemble prediction systems. Wea. Forecasting,
10.48550/arXiv.1603.07285. 15, 559–570, https://doi.org/10.1175/1520-0434(2000)015,0559:
Eckel, F. A., and L. Delle Monache, 2016: A hybrid NWP–analog DOTCRP.2.0.CO;2.
ensemble. Mon. Wea. Rev., 144, 897–911, https://doi.org/10. Hopson, T., 2014: Assessing the ensemble spread–error relation-
1175/MWR-D-15-0096.1. ship. Mon. Wea. Rev., 142, 1125–1142, https://doi.org/10.1175/
Ghazvinian, M., 2021: Improving probabilistic quantitative preci- MWR-D-12-00111.1.
pitation forecasting using machine learning and statistical Hu, W., and G. Cervone, 2019: Dynamically optimized un-
postprocessing methods. Ph.D. thesis, Dept. of Civil structured grid (DOUG) for analog ensemble of numerical

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


1384 MONTHLY WEATHER REVIEW VOLUME 151

weather predictions using evolutionary algorithms. Comput. Mao, X., C. Shen, and Y.-B. Yang, 2016: Image restoration using
Geosci., 133, 104299, https://doi.org/10.1016/j.cageo.2019.07.003. very deep convolutional encoder-decoder networks with sym-
}}, D. Del Vento, and S. Su, Eds., 2020: Proceedings of the metric skip connections. Advances in Neural Information Proc-
2020 improving scientific software conference. UCAR/NCAR essing Systems 29 (NIPS 2016), D. Lee et al., Eds., Vol. 29,
Tech. Rep., NCAR/TN-5641PROC, https://doi.org/10.5065/ Curran Associates, Inc., https://proceedings.neurips.cc/paper/
P2JJ-9878. 2016/hash/0ed9422357395a0d4879191c66f4faa2-Abstract.html.
}}, G. Cervone, G. Young, and L. Delle Monache, 2023: Ma- Martin, A., F. M. Ralph, R. Demirdjian, L. DeHaan, R. Weihs,
chine learning weather analogs for near-surface variables. J. Helly, D. Reynolds, and S. Iacobellis, 2018: Evaluation of
Bound.-Layer Meteor., 186, 711–735, https://doi.org/10.1007/ atmospheric river predictions by the WRF Model using air-
s10546-022-00779-6. craft and regional Mesonet observations of orographic pre-
Iacono, M. J., J. S. Delamere, E. J. Mlawer, M. W. Shephard, cipitation and its forcing. J. Hydrometeor., 19, 1097–1113,
S. A. Clough, and W. D. Collins, 2008: Radiative forcing by https://doi.org/10.1175/JHM-D-17-0098.1.
long-lived greenhouse gases: Calculations with the AER radi- Matheson, J. E., and R. L. Winkler, 1976: Scoring rules for contin-
ative transfer models. J. Geophys. Res., 113, D13103, https:// uous probability distributions. Manage. Sci., 22, 1087–1096,
doi.org/10.1029/2008JD009944. https://doi.org/10.1287/mnsc.22.10.1087.
Ishida, K., M. L. Kavvas, and S. Jang, 2015: Comparison of perfor- McCloskey, K., A. Taly, F. Monti, M. P. Brenner, and L. J. Colwell,
mance on watershed-scale precipitation between WRF and 2019: Using attribution to decode binding mechanism in neural
MM5. World Environmental and Water Resources Congress network models for chemistry. Proc. Natl. Acad. Sci. USA, 116,
2015, Austin, TX, American Society of Civil Engineers, 989– 11 624–11 629, https://doi.org/10.1073/pnas.1820657116.
993, https://doi.org/10.1061/9780784479162.095. Mudrakarta, P. K., A. Taly, M. Sundararajan, and K. Dhamdhere,
Jasperse, J., and Coauthors, 2020: Lake Mendocino forecast in- 2018: Did the model understand the question? Proc. 56th An-
formed reservoir operations final viability assessment. Scripps nual Meeting of the Association for Computational Linguistics
Institution of Oceanography, 142 pp., https://escholarship.org/
(Vol. 1: Long Papers), Melbourne, Australia, Association for
uc/item/3b63q04n.
Computational Linguistics, 1896–1906, https://aclanthology.
Kelly, K. S., and R. Krzysztofowicz, 1997: A bivariate meta-
org/P18-1176.
Gaussian density for use in hydrology. Stochastic Hydrol. Hy-
Murphy, A. H., 1973: A new vector partition of the probability
draul., 11, 17–31, https://doi.org/10.1007/BF02428423.
score. J. Appl. Meteor., 12, 595–600, https://doi.org/10.1175/
Kim, S., and Coauthors, 2018: Assessing the skill of medium-range
1520-0450(1973)012,0595:ANVPOT.2.0.CO;2.
ensemble precipitation and streamflow forecasts from the Hy-
Murphy, J. M., 1988: The impact of ensemble forecasts on predict-
drologic Ensemble Forecast Service (HEFS) for the upper
ability. Quart. J. Roy. Meteor. Soc., 114, 463–493, https://doi.
trinity river basin in North Texas. J. Hydrometeor., 19, 1467–
org/10.1002/qj.49711448010.
1483, https://doi.org/10.1175/JHM-D-18-0027.1.
Oakley, N. S., F. Cannon, E. Boldt, J. Dumas, and F. M. Ralph, 2018:
Lei, H., H. Zhao, and T. Ao, 2022: A two-step merging strategy
Origins and variability of extreme precipitation in the Santa
for incorporating multi-source precipitation products and
Ynez River Basin of Southern California. J. Hydrol. Reg. Stud.,
gauge observations using machine learning classification and
19, 164–176, https://doi.org/10.1016/j.ejrh.2018.09.001.
regression over China. Hydrol. Earth Syst. Sci., 26, 2969–
Pan, B., K. Hsu, A. AghaKouchak, and S. Sorooshian, 2019: Im-
2995, https://doi.org/10.5194/hess-26-2969-2022.
Leutbecher, M., and T. N. Palmer, 2008: Ensemble forecasting. proving precipitation estimation using convolutional neural
J. Comput. Phys., 227, 3515–3539, https://doi.org/10.1016/j.jcp. network. Water Resour. Res., 55, 2301–2321, https://doi.org/
2007.02.014. 10.1029/2018WR024090.
Lewis, W. R., W. J. Steenburgh, T. I. Alcott, and J. J. Rutz, 2017: Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski,
GEFS precipitation forecasts and the implications of statisti- 2005: Using Bayesian model averaging to calibrate forecast
cal downscaling over the western United States. Wea. Fore- ensembles. Mon. Wea. Rev., 133, 1155–1174, https://doi.org/
casting, 32, 1007–1028, https://doi.org/10.1175/WAF-D-16- 10.1175/MWR2906.1.
0179.1. Rasp, S., and S. Lerch, 2018: Neural networks for postprocessing
Li, M., T. Zhang, Y. Chen, and A. J. Smola, 2014: Efficient mini- ensemble weather forecasts. Mon. Wea. Rev., 146, 3885–3900,
batch training for stochastic optimization. Proc. 20th ACM https://doi.org/10.1175/MWR-D-18-0187.1.
SIGKDD Int. Conf. on Knowledge Discovery and Data Min- Ronneberger, O., P. Fischer, and T. Brox, 2015: U-Net: Convolutional
ing, New York, NY, Association for Computing Machinery, networks for biomedical image segmentation. Medical Image
661–670, https://doi.org/10.1145/2623330.2623612. Computing and Computer-Assisted Intervention}MICCAI
Li, W., B. Pan, J. Xia, and Q. Duan, 2022: Convolutional neural 2015, N. Navab et al., Eds., Lecture Notes in Computer Science,
network-based statistical post-processing of ensemble precipi- Vol. 9351, Springer International Publishing, 234–241, https://doi.
tation forecasts. J. Hydrol., 605, 127301, https://doi.org/10. org/10.1007/978-3-319-24574-4_28.
1016/j.jhydrol.2021.127301. Sayres, R., and Coauthors, 2019: Using a deep learning algorithm
Long, J., E. Shelhamer, and T. Darrell, 2015: Fully convolutional and integrated gradients explanation to assist grading for dia-
networks for semantic segmentation. Proc. IEEE Conf. on betic retinopathy. Ophthalmology, 126, 552–564, https://doi.
Computer Vision and Pattern Recognition (CVPR), Seattle, org/10.1016/j.ophtha.2018.11.016.
WA, CVF, 3431–3440, https://openaccess.thecvf.com/content_ Schefzik, R., 2017: Ensemble calibration with preserved correla-
cvpr_2015/html/Long_Fully_Convolutional_Networks_2015_ tions: Unifying and comparing ensemble copula coupling and
CVPR_paper.html. member-by-member postprocessing. Quart. J. Roy. Meteor.
Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci., Soc., 143, 999–1008, https://doi.org/10.1002/qj.2984.
20, 130–141, https://doi.org/10.1175/1520-0469(1963)020,0130: }}, T. L. Thorarinsdottir, and T. Gneiting, 2013: Uncertainty
DNF.2.0.CO;2. quantification in complex simulation models using ensemble

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC


JUNE 2023 HU ET AL. 1385

copula coupling. Stat. Sci., 28, 616–640, https://doi.org/10.1214/ approach. Meteor. Appl., 12, 257–268, https://doi.org/10.1017/
13-STS443. S1350482705001763.
Scheuerer, M., and T. M. Hamill, 2015: Statistical postprocessing Tong, T., G. Li, X. Liu, and Q. Gao, 2017: Image super-resolution
of ensemble precipitation forecasts by fitting censored, shifted using dense skip connections. Proc. IEEE Conf. on Computer
gamma distributions. Mon. Wea. Rev., 143, 4578–4596, https:// Vision (ICCV), Venice, Italy, CVF, 4799–4807, https://
doi.org/10.1175/MWR-D-15-0061.1. openaccess.thecvf.com/content_iccv_2017/html/Tong_Image_
}}, }}, B. Whitin, M. He, and A. Henkel, 2017: A method Super-Resolution_Using_ICCV_2017_paper.html.
for preferential selection of dates in the Schaake shuffle ap- Valdez, E. S., F. Anctil, and M.-H. Ramos, 2022: Choosing
proach to constructing spatiotemporal forecast fields of tem- between post-processing precipitation forecasts or chaining
perature and precipitation. Water Resour. Res., 53, 3029– several uncertainty quantification tools in hydrological fore-
3046, https://doi.org/10.1002/2016WR020133. casting systems. Hydrol. Earth Syst. Sci., 26, 197–220, https://
}}, M. B. Switanek, R. P. Worsnop, and T. M. Hamill, 2020: doi.org/10.5194/hess-26-197-2022.
Using artificial neural networks for generating probabilistic Van den Dool, H., 1989: A new look at weather forecasting through
subseasonal precipitation forecasts over California. Mon. analogues. Mon. Wea. Rev., 117, 2230–2247, https://doi.org/10.
Wea. Rev., 148, 3489–3506, https://doi.org/10.1175/MWR-D- 1175/1520-0493(1989)117,2230:ANLAWF.2.0.CO;2.
20-0096.1. Vannitsem, S., and Coauthors, 2021: Statistical postprocessing for
Sha, Y., D. J. G. Ii, G. West, and R. Stull, 2020a: Deep-learning- weather forecasts: Review, challenges, and avenues in a Big
based gridded downscaling of surface meteorological varia- Data world. Bull. Amer. Meteor. Soc., 102, E681–E699,
bles in complex terrain. Part I: Daily maximum and minimum
https://doi.org/10.1175/BAMS-D-19-0308.1.
2-m temperature. J. Appl. Meteor. Climatol., 59, 2057–2073,
Wang, Q. J., D. E. Robertson, and F. H. S. Chiew, 2009: A Bayesian
https://doi.org/10.1175/JAMC-D-20-0057.1.
joint probability modeling approach for seasonal forecasting of
}}, }}, }}, and }}, 2020b: Deep-learning-based gridded
streamflows at multiple sites. Water Resour. Res., 45, W05407,
downscaling of surface meteorological variables in complex
https://doi.org/10.1029/2008WR007355.
terrain. Part II: Daily precipitation. J. Appl. Meteor. Clima-
Wang, X., and C. H. Bishop, 2003: A comparison of breeding and
tol., 59, 2075–2092, https://doi.org/10.1175/JAMC-D-20-0058.1.
ensemble transform Kalman filter ensemble forecast schemes.
Smith, L. N., 2017: Cyclical learning rates for training neural net-
J. Atmos. Sci., 60, 1140–1158, https://doi.org/10.1175/1520-
works. 2017 IEEE Winter Conf. on Applications of Computer
0469(2003)060,1140:ACOBAE.2.0.CO;2.
Vision (WACV), Santa Rosa, CA, Institute of Electrical En-
Wilks, D. S., 2009: Extending logistic regression to provide full-
gineers, 464–472, https://doi.org/10.1109/WACV.2017.58.
Strachan, S., and C. Daly, 2017: Testing the daily PRISM air tem- probability-distribution MOS forecasts. Meteor. Appl., 16,
perature model on semiarid mountain slopes. J. Geophys. Res. 361–368, https://doi.org/10.1002/met.134.
Atmos., 122, 5697–5715, https://doi.org/10.1002/2016JD025920. }}, 2019: Statistical forecasting. Statistical Methods in the Atmo-
Sundararajan, M., A. Taly, and Q. Yan, 2017: Axiomatic attribu- spheric Sciences, D. S. Wilks, Ed., Elsevier, 298–299.
tion for deep networks. Proc. 34th Int. Conf. on Machine Wu, L., D.-J. Seo, J. Demargne, J. D. Brown, S. Cong, and J.
Learning, Vol. 70, Sydney, Australia, PMLR, 3319–3328, Schaake, 2011: Generation of ensemble precipitation forecast
https://proceedings.mlr.press/v70/sundararajan17a.html. from single-valued quantitative precipitation forecast for hydro-
}}, S. Xu, A. Taly, R. Sayres, and A. Najmi, 2019: Exploring logic ensemble prediction. J. Hydrol., 399, 281–298, https://doi.
principled visualizations for deep network attributions. Joint org/10.1016/j.jhydrol.2011.01.013
Proc. ACM IUI 2019 Workshops, Vol. 4, Los Angeles, CA, }}, Y. Zhang, T. Adams, H. Lee, Y. Liu, and J. Schaake, 2018:
ACM IUI, 11 pp., https://ceur-ws.org/Vol-2327/IUI19WS- Comparative evaluation of three Schaake shuffle schemes in
ExSS2019-16.pdf. postprocessing GEFS precipitation ensemble forecasts. J. Hy-
Taillardat, M., A.-L. Fougères, P. Naveau, and O. Mestre, 2019: drometeor., 19, 575–598, https://doi.org/10.1175/JHM-D-17-
Forest-based and semiparametric methods for the postpro- 0054.1.
cessing of rainfall ensemble forecasting. Wea. Forecasting, 34, Zhang, Y., L. Wu, M. Scheuerer, J. Schaake, and C. Kongoli,
617–634, https://doi.org/10.1175/WAF-D-18-0149.1. 2017: Comparison of probabilistic quantitative precipitation
Theis, S. E., A. Hense, and U. Damrath, 2005: Probabilistic pre- forecasts from two postprocessing mechanisms. J. Hydrome-
cipitation forecasts from a deterministic model: A pragmatic teor., 18, 2873–2891, https://doi.org/10.1175/JHM-D-16-0293.1.

Unauthenticated | Downloaded 03/07/24 06:20 PM UTC

You might also like