Professional Documents
Culture Documents
Verification Against In-Situ Observations For Data-Driven Weather Prediction
Verification Against In-Situ Observations For Data-Driven Weather Prediction
Abstract
Data-driven weather prediction models (DDWPs) have made rapid strides in re-
cent years, demonstrating an ability to approximate Numerical Weather Prediction
(NWP) models to a high degree of accuracy. The fast, accurate, and low-cost
forecasts promised by DDWPs make their use in operational forecasting an attrac-
tive proposition, however, there remains work to be done in rigorously evaluating
DDWPs in a true operational setting. Typically trained and evaluated using ERA5
reanalysis data, DDWPs have been tested only in a simulation, which cannot repre-
sent the real world with complete accuracy even if it is of a very high quality. The
safe use of DDWPs in operational forecasting requires more thorough "real-world"
verification, as well as a careful examination of how DDWPs are currently trained
and evaluated. It is worth asking, for instance, how well do the reanalysis datasets,
used for training, simulate the real world? With an eye towards climate justice
and the uneven availability of weather data: is the simulation equally good for all
regions of the world, and would DDWPs exacerbate biases present in the training
data? Does a good performance in simulation correspond to good performance in
operational settings? In addition to approximating the physics of NWP models,
how can ML be uniquely deployed to provide more accurate weather forecasts?
As a first step towards answering such questions, we present a robust dataset of
in-situ observations derived from the NOAA Meteorological Assimilation Data
Ingest System (MADIS) program to serve as a benchmark to validate DDWPs in
an operational setting. By providing a large corpus of quality-controlled, in-situ
observations, this dataset provides a meaningful real-world task that all NWPs
and DDWPs can be tested against. We hope that this data can be used not only to
rigorously and fairly compare operational weather models but also to spur future
research in new directions.
1 Introduction
Data-driven weather prediction (DDWP) models have made rapid strides in recent years, steadily
improving forecast skill over longer time horizons at a fraction of the cost of Numerical Weather
Prediction (NWP) models. Given the demonstrated improvements in forecast skill, low inference
cost, and the ability to target specific variables of interest, using DDWPs in operational weather
forecasting is an attractive proposition. Weather forecasts play a foundational role in day-to-day
operations of key sectors like agriculture, energy, transportation, and in reducing loss of life due to
extreme weather, thus allowing DDWPs to have a tangible social benefit. This potential impact also
makes it imperative that ML practitioners should take a holistic view of DDWPs in the operational
forecasting setting, testing DDWPs under the same constraints faced by NWPs that currently provide
the bulwark of these operational forecasts. Similar examination of ML models is common in the
2
Prior work has examined ERA5 data against observations for specific variables and regions, e.g., Jiao
et al. (2021) evaluate ERA5 for precipitation in China, or to assess its suitability for specific tasks,
e.g., choosing suitable location for wind farms (Olauson (2018). To our knowledge, this is the first
publicly available comparison of ERA5 against in-situ observations at such scale, as well as the first
publicly available dataset for verifying NWPs and DDWPs against observations. The dataset is made
available at https://huggingface.co/datasets/excarta/madis2020.
3
https://madis.ncep.noaa.gov/index.shtml
4
Not all sources ingested by MADIS are free for public use, we choose Mesonet and METAR sources as they
are publicly usable, and have large data volumes.
5
https://madis.ncep.noaa.gov/madis_sfc_qc_notes.shtml
3
For temperature, dewpoint, and wind speed, we include only observations that have QC flags S or
V, i.e., passing at least level 2 checks. Some observations pass level 2 checks, but are in regions
too data-sparse to support level 3 checks, and may still introduce noisy observations. To remove
such noisy observations, we ignore temperature and dewpoint observations that were more than 20C
away from the ERA5 prediction. Precipitation observations are more sparse, and we additionally
include observations with QC flag C. We partition the observations into different geographical regions,
extending from the list of regions used in ECMWF scorecards6 , graphically shown in figure 1:
• global: Longitude: -180° to 180°, Latitude: -90° to 90°
• namer (North America): Longitude: -120° to -75°, Latitude: 25° to 60°
• europe (Europe): Longitude: -12.5° to 42.5°, Latitude: 35° to 75°
• wasia (West Asia): Longitude: 42.5° to 102.5°, Latitude: 0° to 60°
• easia (East Asia): Longitude: 102.5° to 150°, Latitude: 25° to 60°
• seasia (South-East Asia): Longitude 95° to 130°, Latitude: -12.5° to 25°
• ausnz (Aus. & New Zealand): Longitude: 110° to 180°, Latitude: -48° to -12.5°
• samer (South America): Longitude: -80° to -31°, Latitude: -54° to 10°
• nafr (Northern Africa): Longitude: -15° to 34°, Latitude: 4° to 36°
• safr (Southern Africa): Longitude: -15° to 50°, Latitude: -36° to 4°
The above regions are not mutually exclusive, and a small number of observations in the "global"
region do not fall under any of the defined regions. Figure 2 shows the distribution of observations
globally and for each region. All regions have at least 1,000,000 observations for temperature, wind
speeds, and dewpoint, but precipitation data is available only for North America. Unsurprisingly,
these data are not uniformly distributed across the world, with North America and Europe having the
greatest density of observations, and the global South having a much lower density of observation
data.
Using criteria7 set by the National Weather Service in the US (NWS), we can select a subset of
observations for temperature and wind speeds that represent extreme weather:
• Temperature: Exceeding 35° Celsius. While heat advisories are issued for a combination
of temperature and dewpoint, a temperature of 35C nearly guarantees a heat advisory, and
hence is a useful threshold for extreme heat events
• Wind speed: 13.85 m/s (or 31mph), the threshold for wind advisories not associated with a
specific thunderstorm.
Figure 3 shows the observation counts for wind and temperature events that meet the above criteria.
As expected, extreme events are rarer, though a sufficiently large number of observations is still
6
https://sites.ecmwf.int/ifs/scorecards/scorecards-48r1ENS.html
7
https://www.weather.gov/ctp/wwaCriteria
4
Figure 2: Total number of observations for different variables in 2020, in log (base 10) scale.
Figure 3: Total number of observations for extreme weather conditions in 2020, in log (base 10)
scale.
available for extreme temperatures in all regions, and for extreme wind events in North America and
Europe.
5
with the same analysis used to initialize IFS, and while such data is available from the TIGGE archive,
very long download times made this impracticable. To approximate the lower quality of an operational
start state vs. a reanalysis start state, we construct start states using the ensemble mean data from
ERA5 instead of the HRES data from ERA5. The ensemble mean data in ERA5 is produced at a
lower resolution (63km) than the HRES data (31km)8 , thus making it a coarser reanalysis compared
to HRES. In our experiments, we saw a noticeable drop in the quality of predictions made using
ensemble mean start states compared to HRES start states, mimicking the effect of having lower
quality start states in operational settings.
For all sources, we generate 48-hour forecasts at 0.25° resolution starting at 00UTC for each day in
2020, at an hourly frequency for ERA5 and FCN, and 6-hourly frequency for IFS. Since real-world
observations are for specific points, we use bilinear interpolation to extract point forecasts from
the gridded output using the functionality provided by the Xarray software package by Hoyer and
Hamman (2017). The forecast depth for ERA5 is limited to 24 hours as it contains reanalysis data, and
a longer depth would essentially repeat the first 24 hours. The model outputs for each variable are then
compared against both MADIS2020 and the "Extreme weather conditions" subset of MADIS2020,
using RMSE as the error metric.
Figure 4 shows how different model sources compare against all observations from MADIS2020,
and the nearly identical performance for all models for most variables and regions indicates that
the improvements seen in verification-against-analysis may not proportionately hold when verifying
against observations. For all sources, we can also note both daily and regional variations. The
regional variations are pronounced for all variables, while the daily variation is more pronounced for
temperature and dewpoint. The FCN model’s performance largely matches that of the ERA5 data,
indicating that it has indeed learned a good approximation of the underlying NWP, and that at least
for a 48-hour forecast its real-world performance is close to that of ERA5. Regional variation in
ERA5 is expected, per the ERA5 documentation9 the uncertainty in ERA5 reanalysis is expected
to be higher the farther back in time we go, and in data-sparse regions like the global South. The
reanalysis is also not equally accurate for all times of day, due to the fact that ERA5 reanalysis uses
fixed observation windows of 09UTC-2100UTC and 2100UTC-0900UTC (+1 day) for assimilation.
The variations in ERA5 quality across regions should be kept in mind when developing DDWPs
to be conscious of biases that may be reinforced or exacerbated. For instance, heatwaves induced
by climate change are expected to disproportionately affect the global South, which are also the
regions (West Asia, East Asia, South America, Southern Africa) where ERA5 has the highest errors
for 2-meter temperature and dewpoint. Lam et al. (2022) reported the performance of GraphCast
separately for each region when verifying against reanalysis, the same procedure should be followed
when verifying against observations.
Forecasting precipitation accurately is of great value, and evaluating precipitation forecasts accurately
is difficult given the sparsity of its occurrence. Precipitation is a long-tailed event, as it does not rain
in most of the world, most of the time, with non-trivial amounts of rain being even rarer. While RMSE
and ACC were used as metrics for evaluating the precipitation forecasting models by Pathak et al.
(2022), the comparison in Figure 4 illustrates how RMSE may not be appropriate for precipitation.
The FCN model shows a much lower error than ERA5 and IFS for 6-hourly precipitation, implying
the model forecasts precipitation accurately – even better than the data it is trained on! A naive "no
rain" model that always predicts 0 mm precipitation has even lower error, as we can see in figure 5.
It is more likely that the FCN model underpredicts precipitation and arrives at a misleadingly low
RMSE. For precipitation, metrics like Critical Success Index (CSI) of Continuous Ranked Probability
Score (CRPS) may be more appropriate, as used by Espeholt et al. (2022) and Ravuri et al. (2021).
8
https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#
heading-Spatialgrid
9
https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#
heading-Accuracyanduncertainty
6
Figure 4: RMSE of ERA5, FourCastNet (FCN), and IFS against observations for wind speeds,
temperature, dewpoint, and 6-hourly precipitation accumulation. Each column represents a particular
data source (ERA5, FCN, IFS), and each row represents a variable. The x-axis represents the time of
day in UTC, and the y-axis represents the RMSE in the units appropriate to each variable. Note the
different forecast depths for ERA5 vs. FCN and IFS.
Figure 6 shows how different model sources compare against extreme observations for wind and
temperature, with higher absolute errors for all models, reflecting the extreme values of the variables
being predicted. While daily variation is still present, the local time of day is a stronger influence, as
expected for extreme heat events that peak in the day. ERA5 and FCN generally perform no better
than IFS, confirming the trend seen in "all-weather" verification that DDWPs do not necessarily
perform better than NWPs when verifying against observations, even if they do when verifying
7
Figure 5: RMSE of 6-hourly precipitation for FourCastNet (FCN) and a naive "no rain" model which
always predicts 0mm rainfall.
Figure 6: RMSE of ERA5, FourCastNet (FCN), and IFS against extreme observations for wind
speeds and temperature. Each column represents a particular data source (ERA5, FCN, IFS), and each
row represents a variable. The x-axis represents the time of day in UTC, and the y-axis represents the
RMSE in the units appropriate to each variable. Note the different forecast depths for ERA5 vs. FCN
and IFS.
against analysis. In some cases (e.g., 2-meter temperature for South America), ERA5 itself seems to
have higher error than the IFS predictions.
4 Discussion
DDWPs have made significant improvements in a short time, and have so far been evaluated using
reanalysis data as ground truth. As DDWPs move from research to operation, they must be evaluated
more thoroughly under settings that replicate their operational use. While by no means exhaustive,
the use of in-situ observations from MADIS2020 provides one such way to evaluate DDWPs on a
meaningful task, with the benefit that they can be used to fairly compare all DDWPs and NWPs.
Verifying against observations shows the IFS baseline to be extremely competitive against DDWPs
even thought DDWPs performed better when verifying against reanalysis, further highlighting the
8
need for thorough evaluation. The analysis presented earlier also point to more areas where ML can
be applied to improve operational weather forecasting.
The difference between reanalysis start states, which DDWPs are trained on, and operational start
states has a detrimental impact on forecast quality, especially in the short term. This may be addressed
by fine-tuning DDWPs on operational start states once they are sufficiently trained on ERA5 data.
The uncertainty of ERA5 reanalysis varies by time and location and tends to be higher in the global
South, as mentioned in the ERA5 documentation and evidenced through the comparison of ERA5
against observations. The ensemble "spread" available in ERA5 conveys a specific type of uncertainty
estimate, and may be used to improve the training of DDWPs by adjusting how much to weight data
from different locations and times. The uncertainty is expected to also be higher the farther back in
time we go (due to sparser observations), and accounting for uncertainty may bias DDWPs towards
recent data and help reduce any training-inference skew caused by weather patterns changing due to
climate change.
ML could also be used to contribute to climate equity by tackling regional biases in weather data.
High-resolution models like HRRR10 are typically available only at a regional scale for select regions
of the world. Techniques like transfer learning and super-resolution could be used to improve the
quality of weather forecasts in underserved regions by exploiting the higher quality and volume of
data in other regions. Active learning techniques could also be used to identify which locations and
variables, if monitored, would most improve forecast skill, thus providing a data-driven approach for
optimally collecting weather data.
As weather forecasts are often consumed at specific locations of high interest like airports, inter-
polation from a grid to a specific latitude/longitude is a necessary step. Bilinear interpolation is
a common technique, but ML could be applied to perform data-driven interpolation, taking into
account sub-grid features like topography, terrain, land use, etc., for which in-situ observations are
particularly useful as training and evaluation data. Improved interpolation techniques could once
again exploit observations from data-rich regions for the benefit of data-poor regions. Such techniques
can be compared against techniques like Model Output Statistics (MOS), used to provide site-specific
guidance derived from the gridded GFS forecast issued by NOAA (Glahn and Lowry (1972)). Where
a long history of observations is available, data-driven debiasing can be used to remove systemic
biases present in weather models.
The definition and use of appropriate metrics is necessary to build trust amongst stakeholders and
properly evaluate the quality of DDWPs. While deterministic metrics like RMSE are convenient to
report, they are not appropriate for all variables (e.g. precipitation) and for longer lead times. For
longer lead times, it may be more appropriate to use probabilistic metrics as well as NWP ensembles
as a more relevant baseline.
The recent activity and progress in researching ML-based forecasting leaves little doubt that data-
driven weather prediction models will have a meaningful role to play in operational forecasting.
While much of the recent work has been focused on learning an NWP approximation in an ERA5
simulation, imagining a DDWP in an operational setting raises many more interesting questions (and
avenues for research) around their actual use. Our work should be seen as the start of a conversation
around answering such questions to realize the full potential of ML-based weather forecasting.
References
Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., and Tian, Q. (2022). Pangu-weather: A 3d high-
resolution model for fast and accurate global weather forecast. arXiv preprint arXiv:2211.02556.
Bowler, N., Cullen, M., and Piccolo, C. (2015). Verification against perturbed analyses and observa-
tions. Nonlinear Processes in Geophysics, 22(4):403–411.
Bowler, N. E. (2008). Accounting for the effect of observation errors on verification of mogreps.
Meteorological Applications: A journal of forecasting, practical applications, training techniques
and modelling, 15(1):199–205.
10
High-Resolution Rapid Refresh, a 3km near-term model provided by NOAA that covers the continental US:
https://rapidrefresh.noaa.gov/hrrr/.
9
Espeholt, L., Agrawal, S., Sønderby, C., Kumar, M., Heek, J., Bromberg, C., Gazen, C., Carver, R.,
Andrychowicz, M., Hickey, J., et al. (2022). Deep learning for twelve hour precipitation forecasts.
Nature communications, 13(1):5145.
Glahn, H. R. and Lowry, D. A. (1972). The use of model output statistics (mos) in objective weather
forecasting. Journal of Applied Meteorology and Climatology, 11(8):1203–1211.
Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz-Sabater, J., Nicolas, J.,
Peubey, C., Radu, R., Schepers, D., et al. (2020). The era5 global reanalysis. Quarterly Journal of
the Royal Meteorological Society, 146(730):1999–2049.
Hoyer, S. and Hamman, J. (2017). xarray: Nd labeled arrays and datasets in python. Journal of Open
Research Software, 5(1).
Jiao, D., Xu, N., Yang, F., and Xu, K. (2021). Evaluation of spatial-temporal variation performance
of era5 precipitation data in china. Scientific Reports, 11(1):1–13.
Kadian, A., Truong, J., Gokaslan, A., Clegg, A., Wijmans, E., Lee, S., Savva, M., Chernova, S.,
and Batra, D. (2020). Sim2real predictivity: Does evaluation in simulation predict real-world
performance? IEEE Robotics and Automation Letters, 5(4):6670–6677.
Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Pritzel, A., Ravuri, S.,
Ewalds, T., Alet, F., Eaton-Rosen, Z., Hu, W., Merose, A., Hoyer, S., Holland, G., Stott, J., Vinyals,
O., Mohamed, S., and Battaglia, P. (2022). Graphcast: Learning skillful medium-range global
weather forecasting.
Olauson, J. (2018). Era5: The new champion of wind power modelling? Renewable energy,
126:322–331.
Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mardani, M., Kurth, T.,
Hall, D., Li, Z., Azizzadenesheli, K., Hassanzadeh, P., Kashinath, K., and Anandkumar, A. (2022).
Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural
operators.
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving,
G. (2022). Red teaming language models with language models.
Ravuri, S., Lenc, K., Willson, M., Kangin, D., Lam, R., Mirowski, P., Fitzsimons, M., Athanassiadou,
M., Kashem, S., Madge, S., et al. (2021). Skilful precipitation nowcasting using deep generative
models of radar. Nature, 597(7878):672–677.
Tamkin, A., Brundage, M., Clark, J., and Ganguli, D. (2021). Understanding the capabilities,
limitations, and societal impact of large language models.
10