Ebook How To Choose The Right Dataset Solargis

How to choose the right dataset
for evaluation of solar projects —

the MASTER approach
Version 2020.11
M A S T E R: H OW TO C H O O S E T H E R I G H T DATA S E T FO R E VA LUAT I O N O F S O L A R P R O J EC T S
Table of Contents
01 03
Why choosing the right solar dataset is important? MASTER: 6 key properties to consider when choosing
Not all solar datasets are equal a reliable solar dataset
What are the most common problems?
Made for Solar
Available
02
Spatial resolution
Current approaches for selecting datasets
Time resolution
Weighted average vs median approach
Extensive validation
Real hourly vs synthetic hourly data
Representative in time
Why is Typical Meteorological Year (TMY) not sufficient?
04
Applying the selection criteria — and a bonus tip!
Comparison of 7 most popular datasets
Who are the winners?
© 2020 Solargis 2
01
Why choosing the right solar dataset is important?
Reliable estimation of long-term energy yield is one the most

important steps in the development of solar energy projects.
To support this, developers need to select a reliable solar radiation
dataset that effectively meets their needs.
However this is not as straightforward as it might seem.

Developers nowadays are faced with a ‘paradox of choice’. Not all
solar datasets are created equal, and with multiple free and paid
options available, the risks of making the wrong choice can be
substantial.
PVSyst, the most popular PV simulation software, offers developers

an almost overwhelming choice — with the option to import solar
data from more than 15 sources:
© 2020 Solargis 3
Some data providers even offer data from multiple model versions 3. Lower productivity: Developers, technical advisors, and EPCs
at the same time (for example v1.1, v1.2, v2, etc.), further adding exchange endless emails and phone calls to agree on the
to the confusion. choice of dataset. The process of downloading data from
multiple websites and comparing them in spreadsheets takes
Such an extensive choice of datasets creates several problems: hours for project engineers, wasting valuable time.
1. Lower profits: A wide choice of datasets creates confusion

among project developers and EPCs. Choosing a less
accurate dataset can result in a suboptimal system design.
A sub-optimal design of a large-scale PV power plant leading
to 1% lower yield would result in lost profitability in the range
of millions of dollars over the asset lifetime.
2. Reduced transparency: Lack of standardization in selection

of solar data also creates room for ‘data shopping’. This
means that developers and EPCs compare multiple data
sources and choose the one that best serves their needs
— irrespective of whether the dataset is reliable or not. The
dataset showing highest solar radiation is often selected to
increase chances of closing a deal or to get more favorable
financing conditions.
© 2020 Solargis 4
02
Current approaches for selecting datasets
There is not yet a standardised approach for identifying the best Whilst, at first glance, these approaches might seem more reliable
solar and meteorological dataset. Two approaches commonly used than relying on a single data source, both have limitations and their
by technical advisors to justify estimates of solar resource are: use is strongly discouraged.
1. The weighted average approach, which relies on taking The weighted mean approach requires generation of synthetic hourly
a weighted mean of monthly averages of solar resource data from monthly averages. Using synthetically generated typical
and air temperature data from multiple sources. Weights year dataset creates multiple limitations for evaluation of a solar
to different datasets are assigned based on parameters project, as summarised in table below. This approach is subjective
such as spatial resolution and temporal coverage. and results in a dataset that can always be disputed by stakeholders
The main argument for use of this method is that the in the project evaluation process. Use of this approach reduces both
weighted mean should give a resource estimate that has transparency and efficiency in the evaluation of solar projects.
lower uncertainty than that of a single dataset.
Real Synthetic
hourly data hourly data
2. The median approach, where long-term monthly and
Minimum uncertainty
yearly values of multiple data sources are compared.
Interannual variability of solar resource and expected generation
If any data source shows inconsistency relative to other Optimizing of PV system design
datasets, it is identified as an outlier and is removed Accurate revenue planning and self-consumption analysis
from further analysis. From the remaining datasets, the Validate the models and adapt them to the local conditions
using high quality ground measurements
dataset with the median GHI value is selected as input Calculate consistent metrics along entire project lifetime
for yield estimation.
© 2020 Solargis 5
The second approach, which relies on selection of the dataset with

the median GHI value after discarding outliers, can only be considered
more effective if the datasets being compared have similar qualities –
i.e. they can be validated, are available as high resolution time series,
and have long temporal coverage, including the recent period.
If the dataset with the median value is available only as a Typical

Meteorological Year (TMY) dataset, this data cannot be validated
against on-site measurements. For this reason alone, use of datasets
available only as typical year dataset is not justified. Moreover, it might
be difficult to select a dataset that is the median value for multiple
parameters influencing the performance of a PV system – including
GHI, DNI, and TEMP.
It is also worth noting that the best models may not be selected
via this median approach, as they are able to capture what other
models cannot. As an example, an analysis of GHI estimates from
multiple models for the Salt Lakes in South Australia show that
Solargis is the outlier. However, Solargis is the only data source that
is able to distinguish between clouds and high reflectivity surfaces.
By employing the median approach, a suboptimal dataset might be
Image credit: J.K. Copper and A.G. Bruce, UNSW (2019)
selected in this case.
© 2020 Solargis 6
03
MASTER: 6 key properties to consider

when choosing a reliable solar dataset
To help developers eliminate poor data choices

from their decision-making process, below we have
identified the key criteria that need to be considered.
Together, they form the ‘MASTER’ approach:
M for Made for solar applications

(solar radiation plus air temperature, wind parameters, etc.)
A for Availability of recent periods
S for Spatial resolution — high
T for Time resolution — subhourly
E for Extensive validation
R for Representative enough — data coverage
© 2020 Solargis 7
M Made for solar

A Available
The dataset must specifically designed for simulation of Availability of data covering the recent period is a feature often
solar power. The older data sets, such as NREL TMY2 or overlooked, but it is vital. If the satellite-derived data used in the
TMY3, or Meteonorm lack internal integrity and geographical development stage is updated on a regular basis, it is possible to
representativeness. NREL TMY2 and TMY3 datasets are created compare on-site measurements to the modeled solar radiation
by selection of 12 representative months from a multi-year time estimates during the operation phase of the project.
series. However, when selecting representative months, equal
weightage is given to solar radiation and meteo parameters (air This feedback is useful for:
temperature and wind), even though solar radiation parameters • Re-evaluation of the long-term yield
have bigger influence on the performance of a solar power plant. • Achieving the correct asset valuation in case of sale
Such approach is optimal for building energy simulations but not or refinancing of the project
for solar energy applications. • Improving accuracy of long-term yield estimates
for future projects to be developed.
© 2020 Solargis 8
S Spatial resolution
High spatial resolution is now a necessity, as projects are Solar asset developers, owners and operators should also
increasingly developed in mountainous terrain or at the pay close attention to the spatial resolution of meteorological
intersection of the land with water (lakes and seas). parameters. Often, meteorological parameters such as air
temperature are derived from numerical weather models that
Some data providers offer access to a low-resolution version have a coarse spatial resolution, such as 25 km × 25 km. This
of their datasets, i.e 10 km × 10 km, and a higher resolution can result in higher uncertainty for yield estimates in regions
version, representing native resolution of satellite imagery. with variable terrain or in proximity to lakes and seas. Datasets
It should be noted that when using the lower resolution dataset, which include post-processed meteo inputs, for example
the uncertainty of estimates is higher than the advertised elevation corrected and de-biased air temperature data, should
uncertainty of the high-resolution dataset. be given preference when selecting a dataset for energy yield
simulations.
© 2020 Solargis 9
T Time resolution
The original time intervals of the best datasets will be sub-

hourly, or at least hourly. Modern satellites offer a standard
time resolution of 10 and 15 minutes. Together with quality and
accuracy, it is important to look at the original time granularity
of the solar and weather data input. There are 2 categories of
datasets used by the industry:
• Artificial hourly data profiles, generated from the

long-term
monthly averages by a synthetic data generator
• Real sub-hourly or hourly time series, as an output
of the solar and meteorological models.
Example of relative differences in the 24x12 PV production profile calculated using

Even when the monthly sums of both real and synthetically- i) original hourly data as input, and ii) synthetically generated hourly data, both
having the same monthly sum.
generated hourly datasets are similar, the hour-by-hour analysis
will always show critical differences. In a synthetically-generated
hourly time series, the typical and extreme values are not fully Hourly data resolution is not sufficient to meet the needs of
captured and systematic deviations are often present. modern energy simulation. To evaluate specific scenarios
This has an adverse effect on selection of an optimum design. of system performance requires 15-minute time resolution.
Ultimately, this negatively impacts revenue planning and self- For specific use, especially for designing solar hybrid systems,
consumption analysis. We discuss this topic in more detail (with storage or diesel genset), even more granular 1-minute
in our blog post here. time series are needed.
© 2020 Solargis 10
E Extensive validation
The most accurate datasets undergo extensive validation across Demonstrating consistent results in validation of both GHI and
all geographical zones and climate regions, demonstrating DNI is critical, as it proves integrity of the solar models used.
low uncertainty. This is the single most important criteria for Only rigorous validation of air temperature qualifies the dataset
selection of the right dataset. (bias of 1-degree Celsius air temperature results in a systematic
error of approximately 0.5% in energy yield).
Many popular data sources only include typical year files with
artificially generated hourly values. These cannot be compared Besides bias and RMSE, consistency (representativeness) of
one-to-one with ground measurements, and hence cannot be modelled and measured values at sub-hourly time resolution,
validated. When choosing between a validated dataset where e.g. Kolmogorov Smirnoff index, must be taken into account.
the margin of error is quantified, and a dataset that cannot
be validated, it’s a clear choice. Applying this criterion alone
narrows down the long list of data sources significantly.
Besides showing the validation statistics for Global Horizontal

Irradiation (GHI), it is also essential to document the validation
of Direct Normal Irradiation (DNI) and relevant meteorological
parameters, particularly air temperature.
© 2020 Solargis 11
R Representative in time
The dataset must cover a long period, ideally 20+ years, and Length of GHI measurements 1 year 10 years 20 years
data must be computed in real time for availability up to the Typical interannual variability (STDEV) 5—7 % 2—4 % 2 %
(Remund and Müller 2010)
present time.
Uncertainty (80 % occurence) 6.4—9 % 2.6—5.1 % 2.6 %
Studies based on long periods of measurements from meteo-

rological stations show that the long-term average can be more Using shorter periods of data not only increases uncertainty
accurately estimated when the dataset represents a 20 to around the long-term revenues of a project, but can also result
25 - year period instead of 10-15 years. in an inaccurate estimate of P90 or other Pxx yield estimate –
crucial for servicing project debt.
For example, Remund and Müller (2010) have shown that annual
variability of GHI decreases from 2–4% to 2% as we move from
a 10 year measurement period to 25 year measurement period.
Kimball, Chaudhary, et al (2018) have also demonstrated that
using fewer years of data i.e. 10–15 years instead of 20+ can
result in underestimating the inter-annual resource variability.
© 2020 Solargis 12
04
Applying the selection criteria — and a bonus tip!
To demonstrate how to apply the above data selection criteria, we have compared 7 datasets used widely by the solar industry in North America:
Solar Solargis Meteonorm NREL Solcast NREL Vaisala

Anywhere TMY2 / TMY3 PSM
Validation, accuracy
Long period
Recent data
Spatial resolution
Temporal originality of hourly dataset
High accuracy validated meteorological data
As shown by the above comparison, the choice of solar datasets But how should solar asset owners choose between these two
can be easily narrowed down to two options — Solargis and options if both data sources for a specific region are validated,
SolarAnywhere. show good accuracy, and meet the relevant criteria?
© 2020 Solargis 13
Bonus tip
Choose the solution that delivers additional commercial insights and improves the productivity
of your team. This may mean a better API or inputs for more accurate simulation of new
technologies — such as bifacial or floating solar PV systems. Factors such as higher spatial
resolution of the data, high integrity of the meteo parameters, global coverage, integration with
a wide range of software tools, and professional support are also increasingly important for solar
developers.
It’s surprising that lower quality datasets developed in the 1990s are still being used by the solar
industry. These practices stifle innovation and competitiveness of the solar industry — ultimately
hindering growth. This eBook shares our best practice method — the MASTER approach — for
selecting the most reliable data for solar energy simulation. By mastering this process, we can
collectively improve profitability, transparency and productivity within the industry.
© 2020 Solargis 14
To learn how Solargis can help your business
improve your solar energy assessment,
contact us to schedule a demo.

Ebook How To Choose The Right Dataset Solargis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ebook How To Choose The Right Dataset Solargis

Uploaded by

Copyright:

Available Formats

How to choose the right dataset

for evaluation of solar projects —

Why choosing the right solar dataset is important?

Reliable estimation of long-term energy yield is one the most

However this is not as straightforward as it might seem.

PVSyst, the most popular PV simulation software, offers developers

1. Lower profits: A wide choice of datasets creates confusion

2. Reduced transparency: Lack of standardization in selection

Current approaches for selecting datasets

The second approach, which relies on selection of the dataset with

If the dataset with the median value is available only as a Typical

MASTER: 6 key properties to consider

To help developers eliminate poor data choices

M for Made for solar applications

M Made for solar

The original time intervals of the best datasets will be sub-

• Artificial hourly data profiles, generated from the

Example of relative differences in the 24x12 PV production profile calculated using

Besides showing the validation statistics for Global Horizontal

Studies based on long periods of measurements from meteo-

Applying the selection criteria — and a bonus tip!

Solar Solargis Meteonorm NREL Solcast NREL Vaisala

Temporal originality of hourly dataset

High accuracy validated meteorological data

You might also like