You are on page 1of 10

Verification against in-situ observations for

Data-Driven Weather Prediction

Vivek Ramavajjala Peetak P. Mitra


Excarta Inc. Excarta Inc.
vivek@excarta.io peetak@excarta.io
arXiv:2305.00048v1 [cs.LG] 28 Apr 2023

Abstract

Data-driven weather prediction models (DDWPs) have made rapid strides in re-
cent years, demonstrating an ability to approximate Numerical Weather Prediction
(NWP) models to a high degree of accuracy. The fast, accurate, and low-cost
forecasts promised by DDWPs make their use in operational forecasting an attrac-
tive proposition, however, there remains work to be done in rigorously evaluating
DDWPs in a true operational setting. Typically trained and evaluated using ERA5
reanalysis data, DDWPs have been tested only in a simulation, which cannot repre-
sent the real world with complete accuracy even if it is of a very high quality. The
safe use of DDWPs in operational forecasting requires more thorough "real-world"
verification, as well as a careful examination of how DDWPs are currently trained
and evaluated. It is worth asking, for instance, how well do the reanalysis datasets,
used for training, simulate the real world? With an eye towards climate justice
and the uneven availability of weather data: is the simulation equally good for all
regions of the world, and would DDWPs exacerbate biases present in the training
data? Does a good performance in simulation correspond to good performance in
operational settings? In addition to approximating the physics of NWP models,
how can ML be uniquely deployed to provide more accurate weather forecasts?
As a first step towards answering such questions, we present a robust dataset of
in-situ observations derived from the NOAA Meteorological Assimilation Data
Ingest System (MADIS) program to serve as a benchmark to validate DDWPs in
an operational setting. By providing a large corpus of quality-controlled, in-situ
observations, this dataset provides a meaningful real-world task that all NWPs
and DDWPs can be tested against. We hope that this data can be used not only to
rigorously and fairly compare operational weather models but also to spur future
research in new directions.

Keywords: Weather forecasting, ECMWF, ERA5, MADIS, verification, evaluation, observations

1 Introduction
Data-driven weather prediction (DDWP) models have made rapid strides in recent years, steadily
improving forecast skill over longer time horizons at a fraction of the cost of Numerical Weather
Prediction (NWP) models. Given the demonstrated improvements in forecast skill, low inference
cost, and the ability to target specific variables of interest, using DDWPs in operational weather
forecasting is an attractive proposition. Weather forecasts play a foundational role in day-to-day
operations of key sectors like agriculture, energy, transportation, and in reducing loss of life due to
extreme weather, thus allowing DDWPs to have a tangible social benefit. This potential impact also
makes it imperative that ML practitioners should take a holistic view of DDWPs in the operational
forecasting setting, testing DDWPs under the same constraints faced by NWPs that currently provide
the bulwark of these operational forecasts. Similar examination of ML models is common in the

Preprint. Under review.


development of large language models (LLMs), where a "red-teaming" approach is used to identify
how a robustly trained LLM may yet produce undesirable outcomes in a real-world scenario (e.g.,
Perez et al. (2022), Tamkin et al. (2021)). For DDWPs, a good starting point is defining evaluation
tasks that more accurately reflect their use in operational forecasts.
The most promising DDWPs developed recently, have been trained using the ERA5 reanalysis dataset
(Hersbach et al. (2020)), typically using data from 1979-2017 for training and development, and
using data from 2018 as a held-out test set. Produced by combining historical observations and a
NWP, the ERA5 data provides "maps without gaps" of essential climate variables (ECVs). The ML
models use a fixed subset of ECVs from ERA5 to represent the atmosphere at time t, and predict
the same subset of ECVs at t+δ, where δ is most commonly 6 hours, but varies from 1 hour to 24
hours. All ECVs are typically normalized to zero mean and unit variance, and a mean-squared error
(MSE) loss is used to train the models. For evaluation, DDWPs are typically initialized with data
from ERA5 at 00UTC on a particular day, and iteratively rolled out to generate a forecast for the
next 14 days. The forecasts are evaluated by comparing the predictions against ERA5 data using two
metrics: the root mean squared error (RMSE) and the anomaly correlation coefficient (ACC), both
computed for the entire latitude-longitude grid, with each grid points nearer to the equator having
higher weights. To establish a NWP baseline, the RMSE and ACC metrics are similarly calculated for
ECMWF’s Integrated Forecasting System (IFS). For details, we refer to FCN (Pathak et al. (2022)),
Pangu-Weather (Bi et al. (2022)), or GraphCast (Lam et al. (2022)), which share essentially the same
procedure for training and evaluating different DDWPs.
Trained thus, DDWPs learn a high quality approximation of the underlying NWP used to construct
the ERA5 dataset. Verifying DDWP forecasts against ERA5 reanalysis is similar to the "verification
against analysis" procedure used to evaluate operational NWPs, where the forecast from an NWP
is compared to a sequence of analyses produced from the same NWP1 . While convenient, since
the sequence of analyses depends on the NWP output from the previous cycle, such evaluations
can underestimate the forecast error especially for short lead times (Bowler et al. (2015), Bowler
(2008)). Besides fundamental limitations of verification against analysis, comparing against ERA5 is
not sufficient to estimate how well the DDWP performs in an operational setting, both objectively
and relative to the IFS baseline. First, the analysis available for initializing operational DDWPs
assimilate observations from +3 hours ahead, and may not be as robust as the ERA5 reanalysis, which
assimilates observations from +9 hours at 00UTC. Second, while verification against analysis is
indeed valuable, operational NWPS are additionally verified against observations to provide more
information2 . Lastly, evaluating IFS on the ERA5 dataset is not necessarily meaningful since it
has not been optimized to predict ERA5 data, whereas DDWPs have. Indeed, simply because a
DDWP outperforms the IFS model in the evaluation setting does not guarantee the same performance
improvement holds in an operational setting where both start states and ground truth are different. In
domains such as robotics or RL-based control, the "sim2real" challenge of transferring and evaluating
a model trained via simulation in a real-world setting is well known (e.g., Kadian et al. (2020)).
Viewing the ERA5 dataset as a high-quality simulation of weather, there is a similar need for a
"sim2real" step that evaluates DDWPs against observations.
Our contribution: We introduce a dataset of in-situ observations derived from the NOAA Meteo-
rological Assimilation Data Ingest System (MADIS) program that can be used to more thoroughly
evaluate the performance of DDWPs in an operational setting. Additionally, since in-situ observations
are not modeled in any way but instead reflect the actual weather on the ground, they can be used
to fairly compare DDWPs and NWPs on a task meaningful to consumers of forecasts. We evaluate
three sources of weather data against real-world observations:
• ERA5: We examine how well ERA5 data corresponds to real-world observations. Since
identifying biases in a dataset are crucial to addressing those biases in ML models trained
on that dataset, we further examine regional variations in ERA5 quality.
• FourCastNet: We train a variant of the FourCastNet model, first developed by Pathak et al.
(2022) to examine how a DDWP performs in an operational setting.
• IFS: We evaluate the IFS model against real-world observations to serve as a baseline for an
operational NWP.
1
For instance, see "grid-to-grid" verification against analysis published by NOAA: https://www.emc.
ncep.noaa.gov/users/verification/global/gfs/ops/grid2grid_all_models/rmse/.
2
For example, ECMWF scorecards report performance against both analysis and observations: https:
//sites.ecmwf.int/ifs/scorecards/scorecards-48r1ENS.html

2
Prior work has examined ERA5 data against observations for specific variables and regions, e.g., Jiao
et al. (2021) evaluate ERA5 for precipitation in China, or to assess its suitability for specific tasks,
e.g., choosing suitable location for wind farms (Olauson (2018). To our knowledge, this is the first
publicly available comparison of ERA5 against in-situ observations at such scale, as well as the first
publicly available dataset for verifying NWPs and DDWPs against observations. The dataset is made
available at https://huggingface.co/datasets/excarta/madis2020.

2 The MADIS dataset

MADIS3 (Meteorological Assimilation Data Ingest System) is a database provided by NOAA


(National Oceanic and Atmospheric Administration) that contains meteorological observations
covering the entire globe. MADIS ingests data from NOAA and non-NOAA sources, including
observation networks ("mesonets") from US federal, state, and transportation agencies, universities,
volunteer networks, and data from private sectors like airlines as well as public-private partnerships
like CWOP (Citizen Weather Observer Program). Whereas reanalysis data is influenced by the
underlying physics of the NWP used to reconstruct the reanlaysis, these observations directly capture
the weather on the ground and can be used to objectively evaluate any data-driven or numerical
weather model, without being biased towards either of the approaches. We also note that both industry
and common user primarily consume weather forecasts for specific locations (vs. large gridded
forecasts), making such "verification against observations" a meaningful test of operational weather
models.
We created a dataset "MADIS2020" consisting of hourly observations from the Mesonet and METAR
networks ingested by MADIS4 for the entirety of 2020. Mesonet observations represent data from
different weather observation networks, while METAR data primarily indicates data collected at
airports. We limit ourselves to a few key climate variables that have an outsized impact on society:
• Temperature at 2m height (t2m)
• Dewpoint at 2m height (d2m)
• Wind speeds at 10m height (wind)
• Precipitation accumulated over 6 hours (tp6h)
Temperature, dewpoint, and wind speeds are instantaneous observations while precipitation is an
accumulation variable. The MADIS database also includes quality control flags that specify the level
to which an observation has been checked5 :
• Level 1: Observation is checked to lie within a feasible range:
Temperature between -60F to 130F
Dewpoint between -90F to 90F
Wind speed must lie between 0kts - 250kts
• Level 2: Internal consistency, temporal consistency, and statistical spatial consistency
are checked. Internal consistency checks enforce meteorological relationships between
different variables at the same station, e.g., dewpoint must not exceed temperature. The
temporal consistency check flag observations that change too rapidly, and the statistical
spatial consistency check flag observations that have failed quality checks 75% of the time
in the last 7 days.
• Level 3: Spatial consistency checks that verify one observation against nearby observations.
Based on the QC levels above, different quality flags are defined and applied for each observation:
• C: Coarse pass, passed level 1
• S: Screened, passed levels 1 and 2
• V: Verified, passed levels 1, 2, and 3
• X: Rejected, failed level 1
• Q: Erroneous, passed level 1 but failed level 2 or 3

3
https://madis.ncep.noaa.gov/index.shtml
4
Not all sources ingested by MADIS are free for public use, we choose Mesonet and METAR sources as they
are publicly usable, and have large data volumes.
5
https://madis.ncep.noaa.gov/madis_sfc_qc_notes.shtml

3
For temperature, dewpoint, and wind speed, we include only observations that have QC flags S or
V, i.e., passing at least level 2 checks. Some observations pass level 2 checks, but are in regions
too data-sparse to support level 3 checks, and may still introduce noisy observations. To remove
such noisy observations, we ignore temperature and dewpoint observations that were more than 20C
away from the ERA5 prediction. Precipitation observations are more sparse, and we additionally
include observations with QC flag C. We partition the observations into different geographical regions,
extending from the list of regions used in ECMWF scorecards6 , graphically shown in figure 1:
• global: Longitude: -180° to 180°, Latitude: -90° to 90°
• namer (North America): Longitude: -120° to -75°, Latitude: 25° to 60°
• europe (Europe): Longitude: -12.5° to 42.5°, Latitude: 35° to 75°
• wasia (West Asia): Longitude: 42.5° to 102.5°, Latitude: 0° to 60°
• easia (East Asia): Longitude: 102.5° to 150°, Latitude: 25° to 60°
• seasia (South-East Asia): Longitude 95° to 130°, Latitude: -12.5° to 25°
• ausnz (Aus. & New Zealand): Longitude: 110° to 180°, Latitude: -48° to -12.5°
• samer (South America): Longitude: -80° to -31°, Latitude: -54° to 10°
• nafr (Northern Africa): Longitude: -15° to 34°, Latitude: 4° to 36°
• safr (Southern Africa): Longitude: -15° to 50°, Latitude: -36° to 4°

Figure 1: Geographical extent of various regions used in evaluations.

The above regions are not mutually exclusive, and a small number of observations in the "global"
region do not fall under any of the defined regions. Figure 2 shows the distribution of observations
globally and for each region. All regions have at least 1,000,000 observations for temperature, wind
speeds, and dewpoint, but precipitation data is available only for North America. Unsurprisingly,
these data are not uniformly distributed across the world, with North America and Europe having the
greatest density of observations, and the global South having a much lower density of observation
data.

2.1 Extreme weather conditions

Using criteria7 set by the National Weather Service in the US (NWS), we can select a subset of
observations for temperature and wind speeds that represent extreme weather:
• Temperature: Exceeding 35° Celsius. While heat advisories are issued for a combination
of temperature and dewpoint, a temperature of 35C nearly guarantees a heat advisory, and
hence is a useful threshold for extreme heat events
• Wind speed: 13.85 m/s (or 31mph), the threshold for wind advisories not associated with a
specific thunderstorm.
Figure 3 shows the observation counts for wind and temperature events that meet the above criteria.
As expected, extreme events are rarer, though a sufficiently large number of observations is still
6
https://sites.ecmwf.int/ifs/scorecards/scorecards-48r1ENS.html
7
https://www.weather.gov/ctp/wwaCriteria

4
Figure 2: Total number of observations for different variables in 2020, in log (base 10) scale.

Figure 3: Total number of observations for extreme weather conditions in 2020, in log (base 10)
scale.

available for extreme temperatures in all regions, and for extreme wind events in North America and
Europe.

3 Verification against observations


We verify the ERA5 data, the IFS forecast, and a variant of the FourCastNet (FCN) model against
observations to compare the quality of the reanalysis data, an operational NWP model, and a DDWP
trained on that reanalysis. In addition to the original FourCastNet model that has a 6-hour timestep,
we train a variant with a 1-hour timestep, interleaving the two models to obtain hourly forecasts at
0.25° resolution (similar to the process followed by Bi et al. (2022)). We additionally incorporate
dewpoint at 2m height as an additional input and output variable, and train a precipitation prediction
model as described in the FourCastNet implementation. To ensure our FCN implementation is of
a high quality, we compute the RMSE and ACC metrics using ERA5 as ground truth and compare
them against the metrics published by Pathak et al. (2022).
Initializing FCN with a start state representative of an operational setting is challenging, since ERA5
contains reanalysis data and not operational analysis data. The ideal test would be to initialize FCN

5
with the same analysis used to initialize IFS, and while such data is available from the TIGGE archive,
very long download times made this impracticable. To approximate the lower quality of an operational
start state vs. a reanalysis start state, we construct start states using the ensemble mean data from
ERA5 instead of the HRES data from ERA5. The ensemble mean data in ERA5 is produced at a
lower resolution (63km) than the HRES data (31km)8 , thus making it a coarser reanalysis compared
to HRES. In our experiments, we saw a noticeable drop in the quality of predictions made using
ensemble mean start states compared to HRES start states, mimicking the effect of having lower
quality start states in operational settings.
For all sources, we generate 48-hour forecasts at 0.25° resolution starting at 00UTC for each day in
2020, at an hourly frequency for ERA5 and FCN, and 6-hourly frequency for IFS. Since real-world
observations are for specific points, we use bilinear interpolation to extract point forecasts from
the gridded output using the functionality provided by the Xarray software package by Hoyer and
Hamman (2017). The forecast depth for ERA5 is limited to 24 hours as it contains reanalysis data, and
a longer depth would essentially repeat the first 24 hours. The model outputs for each variable are then
compared against both MADIS2020 and the "Extreme weather conditions" subset of MADIS2020,
using RMSE as the error metric.

3.1 All-weather verification

Figure 4 shows how different model sources compare against all observations from MADIS2020,
and the nearly identical performance for all models for most variables and regions indicates that
the improvements seen in verification-against-analysis may not proportionately hold when verifying
against observations. For all sources, we can also note both daily and regional variations. The
regional variations are pronounced for all variables, while the daily variation is more pronounced for
temperature and dewpoint. The FCN model’s performance largely matches that of the ERA5 data,
indicating that it has indeed learned a good approximation of the underlying NWP, and that at least
for a 48-hour forecast its real-world performance is close to that of ERA5. Regional variation in
ERA5 is expected, per the ERA5 documentation9 the uncertainty in ERA5 reanalysis is expected
to be higher the farther back in time we go, and in data-sparse regions like the global South. The
reanalysis is also not equally accurate for all times of day, due to the fact that ERA5 reanalysis uses
fixed observation windows of 09UTC-2100UTC and 2100UTC-0900UTC (+1 day) for assimilation.
The variations in ERA5 quality across regions should be kept in mind when developing DDWPs
to be conscious of biases that may be reinforced or exacerbated. For instance, heatwaves induced
by climate change are expected to disproportionately affect the global South, which are also the
regions (West Asia, East Asia, South America, Southern Africa) where ERA5 has the highest errors
for 2-meter temperature and dewpoint. Lam et al. (2022) reported the performance of GraphCast
separately for each region when verifying against reanalysis, the same procedure should be followed
when verifying against observations.

3.1.1 Evaluating precipitation

Forecasting precipitation accurately is of great value, and evaluating precipitation forecasts accurately
is difficult given the sparsity of its occurrence. Precipitation is a long-tailed event, as it does not rain
in most of the world, most of the time, with non-trivial amounts of rain being even rarer. While RMSE
and ACC were used as metrics for evaluating the precipitation forecasting models by Pathak et al.
(2022), the comparison in Figure 4 illustrates how RMSE may not be appropriate for precipitation.
The FCN model shows a much lower error than ERA5 and IFS for 6-hourly precipitation, implying
the model forecasts precipitation accurately – even better than the data it is trained on! A naive "no
rain" model that always predicts 0 mm precipitation has even lower error, as we can see in figure 5.
It is more likely that the FCN model underpredicts precipitation and arrives at a misleadingly low
RMSE. For precipitation, metrics like Critical Success Index (CSI) of Continuous Ranked Probability
Score (CRPS) may be more appropriate, as used by Espeholt et al. (2022) and Ravuri et al. (2021).

8
https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#
heading-Spatialgrid
9
https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#
heading-Accuracyanduncertainty

6
Figure 4: RMSE of ERA5, FourCastNet (FCN), and IFS against observations for wind speeds,
temperature, dewpoint, and 6-hourly precipitation accumulation. Each column represents a particular
data source (ERA5, FCN, IFS), and each row represents a variable. The x-axis represents the time of
day in UTC, and the y-axis represents the RMSE in the units appropriate to each variable. Note the
different forecast depths for ERA5 vs. FCN and IFS.

3.2 Extreme weather verification

Figure 6 shows how different model sources compare against extreme observations for wind and
temperature, with higher absolute errors for all models, reflecting the extreme values of the variables
being predicted. While daily variation is still present, the local time of day is a stronger influence, as
expected for extreme heat events that peak in the day. ERA5 and FCN generally perform no better
than IFS, confirming the trend seen in "all-weather" verification that DDWPs do not necessarily
perform better than NWPs when verifying against observations, even if they do when verifying

7
Figure 5: RMSE of 6-hourly precipitation for FourCastNet (FCN) and a naive "no rain" model which
always predicts 0mm rainfall.

Figure 6: RMSE of ERA5, FourCastNet (FCN), and IFS against extreme observations for wind
speeds and temperature. Each column represents a particular data source (ERA5, FCN, IFS), and each
row represents a variable. The x-axis represents the time of day in UTC, and the y-axis represents the
RMSE in the units appropriate to each variable. Note the different forecast depths for ERA5 vs. FCN
and IFS.

against analysis. In some cases (e.g., 2-meter temperature for South America), ERA5 itself seems to
have higher error than the IFS predictions.

4 Discussion
DDWPs have made significant improvements in a short time, and have so far been evaluated using
reanalysis data as ground truth. As DDWPs move from research to operation, they must be evaluated
more thoroughly under settings that replicate their operational use. While by no means exhaustive,
the use of in-situ observations from MADIS2020 provides one such way to evaluate DDWPs on a
meaningful task, with the benefit that they can be used to fairly compare all DDWPs and NWPs.
Verifying against observations shows the IFS baseline to be extremely competitive against DDWPs
even thought DDWPs performed better when verifying against reanalysis, further highlighting the

8
need for thorough evaluation. The analysis presented earlier also point to more areas where ML can
be applied to improve operational weather forecasting.
The difference between reanalysis start states, which DDWPs are trained on, and operational start
states has a detrimental impact on forecast quality, especially in the short term. This may be addressed
by fine-tuning DDWPs on operational start states once they are sufficiently trained on ERA5 data.
The uncertainty of ERA5 reanalysis varies by time and location and tends to be higher in the global
South, as mentioned in the ERA5 documentation and evidenced through the comparison of ERA5
against observations. The ensemble "spread" available in ERA5 conveys a specific type of uncertainty
estimate, and may be used to improve the training of DDWPs by adjusting how much to weight data
from different locations and times. The uncertainty is expected to also be higher the farther back in
time we go (due to sparser observations), and accounting for uncertainty may bias DDWPs towards
recent data and help reduce any training-inference skew caused by weather patterns changing due to
climate change.
ML could also be used to contribute to climate equity by tackling regional biases in weather data.
High-resolution models like HRRR10 are typically available only at a regional scale for select regions
of the world. Techniques like transfer learning and super-resolution could be used to improve the
quality of weather forecasts in underserved regions by exploiting the higher quality and volume of
data in other regions. Active learning techniques could also be used to identify which locations and
variables, if monitored, would most improve forecast skill, thus providing a data-driven approach for
optimally collecting weather data.
As weather forecasts are often consumed at specific locations of high interest like airports, inter-
polation from a grid to a specific latitude/longitude is a necessary step. Bilinear interpolation is
a common technique, but ML could be applied to perform data-driven interpolation, taking into
account sub-grid features like topography, terrain, land use, etc., for which in-situ observations are
particularly useful as training and evaluation data. Improved interpolation techniques could once
again exploit observations from data-rich regions for the benefit of data-poor regions. Such techniques
can be compared against techniques like Model Output Statistics (MOS), used to provide site-specific
guidance derived from the gridded GFS forecast issued by NOAA (Glahn and Lowry (1972)). Where
a long history of observations is available, data-driven debiasing can be used to remove systemic
biases present in weather models.
The definition and use of appropriate metrics is necessary to build trust amongst stakeholders and
properly evaluate the quality of DDWPs. While deterministic metrics like RMSE are convenient to
report, they are not appropriate for all variables (e.g. precipitation) and for longer lead times. For
longer lead times, it may be more appropriate to use probabilistic metrics as well as NWP ensembles
as a more relevant baseline.
The recent activity and progress in researching ML-based forecasting leaves little doubt that data-
driven weather prediction models will have a meaningful role to play in operational forecasting.
While much of the recent work has been focused on learning an NWP approximation in an ERA5
simulation, imagining a DDWP in an operational setting raises many more interesting questions (and
avenues for research) around their actual use. Our work should be seen as the start of a conversation
around answering such questions to realize the full potential of ML-based weather forecasting.

References
Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., and Tian, Q. (2022). Pangu-weather: A 3d high-
resolution model for fast and accurate global weather forecast. arXiv preprint arXiv:2211.02556.

Bowler, N., Cullen, M., and Piccolo, C. (2015). Verification against perturbed analyses and observa-
tions. Nonlinear Processes in Geophysics, 22(4):403–411.

Bowler, N. E. (2008). Accounting for the effect of observation errors on verification of mogreps.
Meteorological Applications: A journal of forecasting, practical applications, training techniques
and modelling, 15(1):199–205.
10
High-Resolution Rapid Refresh, a 3km near-term model provided by NOAA that covers the continental US:
https://rapidrefresh.noaa.gov/hrrr/.

9
Espeholt, L., Agrawal, S., Sønderby, C., Kumar, M., Heek, J., Bromberg, C., Gazen, C., Carver, R.,
Andrychowicz, M., Hickey, J., et al. (2022). Deep learning for twelve hour precipitation forecasts.
Nature communications, 13(1):5145.
Glahn, H. R. and Lowry, D. A. (1972). The use of model output statistics (mos) in objective weather
forecasting. Journal of Applied Meteorology and Climatology, 11(8):1203–1211.
Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz-Sabater, J., Nicolas, J.,
Peubey, C., Radu, R., Schepers, D., et al. (2020). The era5 global reanalysis. Quarterly Journal of
the Royal Meteorological Society, 146(730):1999–2049.
Hoyer, S. and Hamman, J. (2017). xarray: Nd labeled arrays and datasets in python. Journal of Open
Research Software, 5(1).
Jiao, D., Xu, N., Yang, F., and Xu, K. (2021). Evaluation of spatial-temporal variation performance
of era5 precipitation data in china. Scientific Reports, 11(1):1–13.
Kadian, A., Truong, J., Gokaslan, A., Clegg, A., Wijmans, E., Lee, S., Savva, M., Chernova, S.,
and Batra, D. (2020). Sim2real predictivity: Does evaluation in simulation predict real-world
performance? IEEE Robotics and Automation Letters, 5(4):6670–6677.
Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Pritzel, A., Ravuri, S.,
Ewalds, T., Alet, F., Eaton-Rosen, Z., Hu, W., Merose, A., Hoyer, S., Holland, G., Stott, J., Vinyals,
O., Mohamed, S., and Battaglia, P. (2022). Graphcast: Learning skillful medium-range global
weather forecasting.
Olauson, J. (2018). Era5: The new champion of wind power modelling? Renewable energy,
126:322–331.
Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mardani, M., Kurth, T.,
Hall, D., Li, Z., Azizzadenesheli, K., Hassanzadeh, P., Kashinath, K., and Anandkumar, A. (2022).
Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural
operators.
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving,
G. (2022). Red teaming language models with language models.
Ravuri, S., Lenc, K., Willson, M., Kangin, D., Lam, R., Mirowski, P., Fitzsimons, M., Athanassiadou,
M., Kashem, S., Madge, S., et al. (2021). Skilful precipitation nowcasting using deep generative
models of radar. Nature, 597(7878):672–677.
Tamkin, A., Brundage, M., Clark, J., and Ganguli, D. (2021). Understanding the capabilities,
limitations, and societal impact of large language models.

10

You might also like