Statistical Analysis and Forecasting of Solar Energy (Inter-States)

A Report On
Statistical Analysis and Forecasting of Solar Energy

(Inter-States)
Prepared for
MATH F432 : Applied Statistical Methods
Submitted to
Dr. Sumanta Pasari (Dept of Mathematics)
On
April 30th, 2021
By
Group Tukey(11)
_________________________________________________________
Aryan Verma -2019B4A30616P Parikshit Sharma -2019B4TS1267P
Chirag Kakkar -2019B4A40671P Akash Saini -2019B4TS1274P
Shubh Varshney -2019B4A30771P Neel Malani -2019B4A40717P
Kaushik Pattanayak -2019B4A80653P Nakul Kumar Singh -2019B4A30740P
CONTENTS
1. Introduction
2. Mean Absolute Percentage Error (MAPE)
3. Definitions
4. Dataset
5. Correlation Matrices
6. Visual Normality Tests

6.1 Histogram Plot
6.2 Quantile – Quantile plot
7. Statistical Normality Tests
8. Chi – Square test for Best Fit Distribution
9. Stationary Tests
9.1 Augmented Dickey-Fuller (ADF) Test
9.2 Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test
10.Time Series Analysis

10.1 Removing Hourly Variation
10.2 Decomposition
10.3 Auto – Correlation Function
10.4 Partial Auto – Correlation Function
10.5 Predictions
10.5.1 AR model
10.5.2 MA model
10.5.3 ARMA model
10.5.4 ARIMA model
10.5.5 SARIMA model
10.5.6 ARIMA for monthly prediction vs SARIMA
11.Machine Learning: A forecasting tool?
12.Conclusion
1. Introduction
India is one of the countries with an extensive production of energy from renewable sources. As of 27
November 2020, 38% of India’s installed electricity generation capacity is from renewable sources (136
GW out of 373 GW)[1]. By 2030 this capacity is expected to be 60%. However, presently most of India’s
electricity comes from burning coal, petroleum, and biomass. These supplies will be unable to keep pace
with this growing demand. The prices and market volatility would cause severe problems in India unless
it shifts its energy demand to renewable energy.
This study focuses on India’s solar energy capacity. Solar energy presently accounts for only 4% of India’s
total power generation. The National Solar Mission (NSM) of India has set a target for the development
and deployment of 100 GW Solar Power by 2022. India needs to spend around $1.4 trillion over the next
20 years to make its energy supply sustainable, about 70% more than provided in current government
policy planning[2].
Solar energy has many benefits, like it is pollution-free, cheap, requires less maintenance, and India being
a desert country, can use solar energy very effectively. However, it also has some limitations. The
installation of solar energy requires a lot of space and money, and the availability of sunlight is dependent
on climate, season, and weather. This study focuses on such conditions of four states of India, namely
Rajasthan, Madhya Pradesh, Andhra Pradesh, and Tamil Nadu. 2000 to 2014 hourly data of these five solar
parks is obtained. These five states are present in various parts of India, but still, to include the whole
climatic variation is not possible. This report deals with forecasting using time series analysis, so collected
data is a time series of the availability of solar energy. The seasonality component of our time series
represents the seasonal availability of solar energy throughout the year.
The complete code for the analysis is on GitHub, the link for same is in Appendix.
2. Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error (MAPE) is a statistical measure of how accurate a forecast system is.
It measures this accuracy as a percentage. It is one of the most common measures used to forecast error.
As AR, MA, ARMA, ARIMA, and SARIMA models are used for forecasting, so throughout the report, MAPE
is used to measure forecasting accuracy by all these models.
𝑛
1 𝐹𝑡 − 𝑋𝑡
𝑀𝐴𝑃𝐸 = ∑ | |
𝑛 𝑋𝑡
𝑡=1
Where:
𝑛 is the number of observations.
𝐹𝑡 is the forecast at time t.
𝑋𝑡 is the actual observation at time t.
1
3. Definitions [3]
Direct Normal Irradiance (DNI): The amount of sunlight received per unit area by a surface which is
perpendicular to the rays. DNI belongs to those rays that come in a straight line between the direction of
the sun and their current position in the sky.
Diffuse Horizontal Irradiance (DHI): The amount of sunlight received per unit area by a surface that does
not lie on a direct path from the sun. In DHI, the sunlight is scattered by molecules and particles present
in the atmosphere, and it comes equally from all directions.
Global Horizontal Irradiance (GHI): The light received by the surface which is horizontal (parallel) to the
earth’s surface. It is given by:
𝐺𝐻𝐼 = 𝐷𝐻𝐼 + 𝐷𝑁𝐼 × cos(𝜃) , where θ is solar zenith angle (also denoted by z).
From the above relation, GHI considers both DNI and DHI, and is therefore chosen to be the prime
attribute for forecasting. Hence, various models will be used to forecast GHI for the four selected states.
4. Dataset
The given dataset contains hourly data for four solar parks located in Rajasthan, Madhya Pradesh, Andhra
Pradesh, and Tamil Nadu for the years 2000 to 2014. In every entry, it has the following attributes:
1. Timestamp of measurement (min, hour, day, month, year)
2. DHI and Clear sky DHI (w/m2, watt per square meter)
3. DNI and Clear sky DNI (w/m2)
4. GHI and Clear sky GHI (w/m2)
5. Dew Point (℃)
6. Temperature (℃)
7. Pressure (mbar)
8. Relative Humidity (%)
9. Solar Zenith Angle (Degree)
10. Snow Depth (m)
11. Wind Speed (m/s)
2
5. Correlation Matrices
From the understanding and definitions of all variables, the candidate variables for forecasting can
be easily figured out. Here are correlation matrices of all four states using such variables obtained in
Python.
Andhra Pradesh Madhya Pradesh
Rajasthan Tamil Nadu
Observation: As there are many variables in the dataset, it would be complicated and time-consuming
to forecast all these variables. Generally, if not pre-decided, the variable with the highest correlation
or covariance with the remaining variables is chosen. The units of all variables are different
(mentioned in the dataset section), so the correlation matrix is chosen. Otherwise, covariance gives
best result. For simplicity, the correlation matrix of Rajasthan is selected. Also, note that the negative
values of the matrix are also contributing, and more negative means more correlation but in a
negative sense. Hence, all negative values are converted to positive values for adding purpose.
3
DHI DNI GHI Clear Sky Clear Sky Clear Sky Temperature Relative SUM
Rajasthan DHI DNI GHI Humidity
DHI 1 0.098 0.62 0.86 0.15 0.74 0.68 0.14 4.288
DNI 0.098 1 0.68 0.079 0.69 0.43 0.13 0.37 3.477
GHI 0.62 0.68 1 0.71 0.63 0.92 0.63 0.13 5.32
Clear Sky DHI 0.86 0.079 0.71 1 0.098 0.79 0.8 0.065 4.402
Clear Sky DHI 0.15 0.69 0.63 0.098 1 0.65 0.083 0.27 3.571
Clear Sky DHI 0.74 0.43 0.92 0.79 0.65 1 0.66 0.024 5.214
Temperature 0.68 0.13 0.63 0.8 0.083 0.66 1 0.097 4.08
Relative Humidity 0.14 0.37 0.13 0.065 0.27 0.024 0.097 1 2.096
MAX= 5.32
Therefore, GHI is chosen as forecasting variable. Also, from the earlier section on definitions, GHI was
a valid variable. Now, let us find the distribution followed by GHI. The most common and helpful
distribution in statistics is normal distribution, so let us start from it.
6. Visual Normality Tests[4]

6.1. Histogram Plot: Madhya Pradesh
It is a straightforward and common way for a
quick check. For making this histogram, the
data is divided into a pre-specified number of
groups known as bins. The data is sorted into
each bin, and then the number of observations
in each bin is counted. The X-axis represents
bins, and Y-axis represents count in each bin.
Here, the histogram does not show the familiar

bell shape of normal distribution, so it can be
concluded that GHI does not follow normal
distribution.
4
6.2. Quantile-Quantile Plot:
Another popular plot is the quantile-quantile plot, or QQ plot for
short. A QQ plot is a scatterplot created by plotting two sets of
quantiles against one another. If both sets of quantiles come from
the same distribution, the points form a line that is roughly
straight. In Python, it generates its own sample of normal
distribution. The samples are then divided into groups called
quantiles. Each data point of the sample is paired with a similar
member from the normal distribution. The X-axis represents
theoretical quantiles (of normal distribution) and Y-axis
represents given sample quantiles.
As there are a many deviation, especially at the bottom and top

of the plot, so it can be concluded that GHI does not follow normal
distribution.
7. Statistical Normality Tests[5]

Visual tests are quick ways to judge the distribution, but they fail to provide valid proofs and are not
strong tests. Hence, now Kolmogorov-Smirnov test, a statistical normality test is used. The KS test is
used to compare the CDFs of two samples. If both the samples follow same distribution, they will
have overlapping CDF plots. To carry out the test let us consider the following hypothesis,
𝐻0 : GHI follows Normal Distribution.
𝐻𝑎 : GHI does not follow Normal Distribution.
On giving GHI data of all four states in Python, the 𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≈ 0.00 for all states was obtained. It is
less than any selected value of 𝛼. Hence, the null hypothesis is rejected, and it can be concluded that
the GHI does not follow the normal distribution. Moreover, the CDF plots of GHI and the normal data
have a significant gap between them.
8. Chi-Square Test for Best Fit Distribution

To find the actual distribution of GHI, several well-known
distributions were used in Python. The candidate
distributions considered were: Beta, Tringle, Gamma,
Inverse gaussian, Uniform, Exponential, Weibull Max,
Weibull Min, Lognormal, Ch-Squared, etc. From all of
them, Beta distribution was best fit. Here are the best
five distribution plots of Madhya Pradesh. The best fit is
Beta distribution with 𝛼 = 1.685 and 𝛽 = 1.369.
Madhya Pradesh, Best Fit Distribution of GHI
5
9. Stationary Tests[6]
The time series forecasting gives best results when the data is stationary. Stationary data means each
data point is independent of each other. It is necessary to ensure that the overall behavior of the data
remains constant. The data becomes non-stationary when data has seasonality and trend
components. Trend represents a long-term increase or decrease in the data. Seasonality represents
reoccurring pattern at a fixed and known frequency based on a time of forecasting.
When we plot the GHI of all four states, we can see cyclic and seasonality in it, but the trend
component is missing. If a data is cyclic but not seasonal or lack trends, then also it can be stationary.
To be more precise let us carry our two statistical tests.
9.1. Augmented Dickey-Fuller (ADF) Test

The ADF test checks the stationarity of the data based on the unit root, and hence it a unit root test.
Let us take an example to understand unit root. When we plot our data along with its mean line, and
start moving on the series, then obviously we will encounter some deviations from the mean line. If
we are again reaching back to the mean line, then we say that the data does not have any unit root.
However, if we do not reach back to the mean line, then we say that our data has a unit root. In short,
if a series goes through a shoch, and it tends to regain its original path, then it does not have a unit
root. Unit roots are responsible for non-seasonality.
To carry out this test let us consider the hypothesis,
𝐻0 : The series has a unit root.
𝐻𝑎 : The series does not have a unit root.
After obtaining the 𝑝 − 𝑣𝑎𝑙𝑢𝑒 of all four states in python, the null hypothesis can be rejected. Also, by
more than 99% confidence it can be concluded that it is a stationary series.
9.2. Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test

The KPSS test is similar to ADF test. It is a type of unit root test that tests for stationarity of a given
series around a deterministic trend. From the GHI plots, it can be observed that there is no trend
(both seasonal and deterministic). Hence, the series seems stationary for deterministic trends also.
However, to be more precise, let us see the results obtained in Python. For that the hypothesis are,
𝐻0 : The series is stationary.
𝐻𝑎 : The series is not stationary.
As the 𝑝 − 𝑣𝑎𝑙𝑢𝑒 comes out to be 0.1, this time the null hypothesis cannot be rejected even by 95%
confidence. Hence, the series is stationary.
6
10. Time Series Analysis
10.1 Remove Hourly Variation:
By summing up all the values over a day, the hourly time series is converted into daily time series.
This removes the variation of GHI values over the span of a day, which are not relevant to this analysis
and forecasting. It also reduces the size of our time series, making computation relatively easier.
10.2 Decomposition:
Let us choose Madhya Pradesh to begin with the time series analysis. On plotting the daily and
monthly GHI data of MP, the seasonality component can be clearly seen. Hence, Python is used to
show the decomposition of daily GHI data in trend, seasonality, and residual components.
Madhya Pradesh - Daily
Madhya Pradesh – Daily Decomposition
10.3 Autocorrelation Function (ACF):

ACF considers all the past observations irrespective of their effect on the future or present time. It
calculates the correlation between the observation at the current time spot and the observations at
previous time spot. Here is the ACF of Madhya Pradesh.
7
10.4 Partial Autocorrelation Function (PACF):
The PACF considers the partial correlation between the observation at the current time spot and the
observations at previous time spot. It does not take into consideration all the time lags between
current and past time. It considers only the time lags having a direct impact on future time period by
neglecting the insignificant time lags in between the two time slots. Here is the PACF of Madhya
Pradesh.
10.5 Prediction:
• AR Model - In a multiple regression model, we forecast the variable using some linear
combination of predictors. An autoregressive (AR) model is a kind of multiple regression model
where the variable is forecasted using a linear combination of some past values of the variable of
interest itself, i.e.,
𝑌𝑡 = 𝑓(𝑌𝑡 , 𝑌𝑡−1 , 𝑌𝑡−2 … … … , ϵ𝑡 )
where, ϵ𝑡 is the “white noise error term”.

Consider an AR model of order p, where p denotes the number of parameter or number of past
values considered, written as AR(p)[7], we can write:
𝑌𝑡 = β0 + β1 𝑌𝑡−1 + ⋯ … … + β𝑝 𝑌𝑡−𝑝 + ϵ𝑡
The term “autoregression” indicates regression of the variable against itself [7].
8
The PACF plot removes the indirect correlation between the current step and the lag and hence is an
excellent tool to identify the value of p for our AR model. In our PACF plot above, the lags with PACF
above the threshold will be helpful for our AR model, and hence the possible p-value can be 19 for
the first 20 lags.
Applying grid search for the best fit for our AR model, we got the value of p as 14. On running an AR
model with the value of p as 14, the MAPE came out to be 14.06%.
We got the residual ACF plot after running the above model, as:
ACF Plot – AR(14) residual
Generally, for an ACF plot of the residuals all the values should lie inside the red region, which implies
that the correlations are statistically zero, or we can say that a small p-value for a coefficient might
have a lag corresponding to that coefficient to be significant in predicting the current value.
Clearly, we have a few significant points in our plot, which means that the residuals might have an
“white noise errors” and AR model might not give us the best prediction for our time series.
To forecast GHI data on a daily and weekly basis, we use the rolling forecast method. A rolling forecast
is an add/drop process for predicting the future over a set period [9].
We select the value of p as 8 or the AR(8) model since for large values of p (10 to 14), it will be a very
complex task to test it computationally. Daily and weekly rolling forecasts were done, and we got the
MAPE values as 14.96% and 18.74%, respectively. As the forecast span increases the MAPE values
also increases.
Weekly Predictions – AR(8)
9
• MA model - Rather than using a multiple regression model with a linear combination of past values
proposed in AR, the moving average (MA) method uses past forecast random errors in a regression
equation that follows a white noise process [7].
Consider a MA model with a parameter q, written as MA(q), we can write [8]:
𝑌𝑡 = 𝑓(ϵ𝑡 , ϵ𝑡−1 , ϵ𝑡−2 , … … … , ϵ𝑡−𝑞 )
𝑌𝑡 = ϕ0 + ϕ1 ϵ𝑡−1 + ϕ2 ϵ𝑡−2 + ⋯ … … . . +ϕ𝑞 ϵ𝑡−𝑞
Here, each value of 𝑌𝑡 can be considered as the weighted average of past errors, where ϵ𝑖 are the
white noise error terms with zero mean and constant variance.
Like the PACF plot, the ACF plot can be used as a good estimator for calculating the number of lags or
the order q for the MA model.
From the above plot of ACF, the q value is 53, but due to the limitation of computation speed, we take
the q value to be 30. This value is found by grid searching for the best fit of the MA model.
On running an MA model with the value of q as 30, the MAPE came out to be 14.25%.
We got the residual ACF plot after running the above model, as:
ACF Plot – MA(30) Residual
Similar, to the ACF plot, here also we can see that we have a few significant points in our plot (outside
the red region). Hence, it can be expected that MA model might also not fit best for our forecasting.
But, from the MAPE value of MA and AR models, we can clearly see that AR model works better than
MA model in terms of forecasting.
Although q= 30 gives us a good fitting model, but with such a high value of q we can't train a model
for such a big dataset.
Instead, we use q=7 for our daily and weekly forecasting which gives us a decent fit. Rolling forecasts
for MA(7), daily and weekly came out to be 16.008% and 22.26%, respectively.
10
Weekly Predictions – MA(7)
• ARMA model - Auto-Regressive Moving average model is a model that is combined from the AR and
MA models. In this model, the impact of previous lags and the residuals is considered for forecasting
the future values of the time series [8].
The model has two parameters, p and q, for the AR and MA model separately. The ARMA (p,q) can be
written as :
𝑌𝑡 = 𝑓(𝑌𝑡 , 𝑌𝑡−1 , 𝑌𝑡−2 … … … , ϵ𝑡 ) + 𝑔(ϵ𝑡 , ϵ𝑡−1 , ϵ𝑡−2 , … … … , ϵ𝑡−𝑞 )
𝑌𝑡 = β0 + β1 𝑌𝑡−1 + ⋯ … … + β𝑝 𝑌𝑡−𝑝 + ϵ𝑡 + ϕ1 ϵ𝑡−1 + ϕ2 ϵ𝑡−2 + ⋯ … … . . +ϕ𝑞 ϵ𝑡−𝑞
We did a grid search for the parameters p and q due to computational constraints and got the best
values as 14 and 20, respectively. The model turns out to be a decent for the time series, which can
be seen from the ACF residual plot where fewer values are significant than the previous two
methods, and the MAPE for the same came out as 13.93%.
ACF Plot – ARMA(14,20) Residual
Due to computational costs, p and q value for the rolling forecasts were chosen to be 4 and 1
respectively. Rolling forecast was performed and daily and weekly MAPE values were found to be
14.91% and 18.37% respectively.
11
Weekly Predictions – ARMA(4,1)
• ARIMA model - If we combine differencing with autoregression and a moving average model, we
obtain a “non-seasonal” ARIMA model. ARIMA is an acronym for Auto-Regressive Integrated Moving
Average. The ARIMA (p,d,q) model, where p is the order of the auto regressive part, d is the degree
of first differencing involved, and q is the order of the moving average part, can be written as [7]:
′
𝑌𝑡′ = 𝑐 + ϕ1 𝑌𝑡−1 ′
+ ⋯ … . . +ϕ𝑝 𝑌𝑡−𝑝 + θ1 ϵ𝑡−1 + ⋯ … + θ𝑞 ϵ𝑡−𝑞 + ϵ𝑡
where Y′t is the differenced series (it may have been differenced more than once). The “predictors”
on the right-hand side include both lagged values of Yt and lagged errors.
We performed a grid search for parameters with d ≥ 1 which best fit our model. We found that an
order of (12, 2, 10) fit our model best (MAPE 13.76%), and resulted in insignificant correlation in the
residuals.
ACF Plot – ARIMA(12,2,10) Residual
However, this model is too computationally intensive to use on a rolling forecast. Instead,
we used an order of (4,1,5) for our rolling forecast, giving us a daily MAPE of 14.83% and weekly
MAPE of 18.18%, the best of all the forecasting methods employed so far.
12
Weekly predictions – ARIMA (4,1,5)
• SARIMA model - So far in our study, we have only considered the non-seasonal data and saw the non-
seasonal ARIMA model. Although the ARIMA model is also helpful in modeling a wide range of
seasonal data, this model is known as Seasonal ARIMA or SARIMA model. It involves differencing at
the lags of the multiples of the seasonal period [8] . A simple SARIMA model made up of additional
seasonal order in the ARIMA(p, d, q) model, and this can be represented as [7]:
(p, d, q) (P, D, Q)s

+
Non- seasonal part Seasonal part
where, ‘s’ represents the seasonality or the number of observations per year.
Time Complexity with SARIMA - Often we want to look at higher frequency data such as weekly
(s=52) or monthly (s=12) change for the SARIMA model. However, since we needed daily and
weekly forecasts, first we try modeling with the seasonality of 365. This model resulted in a high
amount of time with no results obtained. Secondly, we aggregated the data into weekly and used a
seasonality of 52. Similar results were obtained in the weekly analysis. It was concluded that the high
seasonality of 365 and 52 are extremely computationally expensive and take a lot of time to show the
results.
Hence, finally, the data was aggregated into monthly data, and a seasonality of 12 is used to do the
monthly forecasting of our time series.
Through grid searching for the best fit for our SARIMA, the optimal parameters where found to be
(3, 0, 3) for the non-seasonal part and (2, 0, 1) for the seasonal part with the seasonality of 12. The
MAPE value obtained corresponding to this model came out to be 5.94%.
13
ACF Plot – SARIMA (3,0,3)(2,0,1)12 Residual
From the residual plot, we can find insignificant correlations in the residuals. Hence, it might be the
best method for the monthly forecasting out of all the methods we employed so far.
Monthly predictions of our SARIMA model can be seen in the plot below:
Monthly predictions – SARIMA (3, 0, 3)(2, 0, 1)12
• ARIMA for monthly prediction vs SARIMA
Monthly analysis of ARIMA was also done to compare the results. For monthly rolling forecasts for
ARIMA model the optimal parameters used were (5, 1, 3) and the monthly forecast MAPE value
corresponding to this model is found to be 11.433%.
Rolling forecast ARIMA SARIMA

Model (monthly analysis)
(monthly analysis)
Parameters (5, 1, 3) (3, 0, 3)(2, 0, 1)12
Monthly Forecast MAPE values 11.433% 5.94%
14
From this it can be concluded that, SARIMA model is better than ARIMA model, for monthly forecasts.
However, due to high computational costs SARIMA is not able to perform daily and weekly analysis,
and therefore for that ARIMA is preferred. Monthly predictions plots can also prove to be a useful
tool to get the similar results (see appendix).
11. Machine Learning

11.1 ML in time series forecasting
Time Series Analysis and Forecasting have always been crucial research areas in many domains
because many different types of data are stored as time series. Given the growing availability of data
and computing power in recent years, Machine Learning has become a fundamental part of the new
generation of Time Series Forecasting models, obtaining excellent results.
While in classical models like AR, MA, etc., feature engineering is performed manually and requires
manual optimization of parameters. Machine Learning models require features and dynamics only,
and that too directly from the data. Thanks to this, they speed up data preparation and learn more
complex data patterns in a more complete and faster way.
As different time series problems are studied in many various fields, many new architectures have
been developed in recent years. This has also been simplified by the growing availability of open-
source frameworks, making new custom network components more straightforward and faster.
11.2 Well Known Methods in ML
• Recurrent Neural Networks (RNNs): They are the most classical and used architecture for Time
Series Forecasting problems.
• Long Short-Term Memory (LSTM): They are an evolution of RNNs developed to overcome the
vanishing gradient problem.
• Gated Recurrent Unit (GRU): They are another evolution of RNNs, like LSTM.
• Encoder-Decoder Model: This is a model for RNNs introduced to address the problems where
input sequences differ in length from output sequences.
• Attention Mechanism: This is an evolution of the Encoder-Decoder Model, developed in order to

avoid forgetting of the earlier parts of the sequence.
15
11.3 Complexities under univariate forecasting
This assignment deals with univariate forecasting and studies have suggested that classical time
series models are more accurate for univariate, simple time series forecasting. Also, classical time
series methods are easily expandable, whereas machine learning models are more target specific.
While machine learning may provide benefit in datasets with complex irregular data, missing
observations, heavy noise, complex interrelationship between multiple variates. But in our case, for
simple univariate forecasting we suggest classical methods because of more accuracy and lesser
computation cost.
12. Conclusion
In this study we performed various time series forecasting methods namely AR, MA, ARMA,
ARIMA, and SARIMA on the given data. The given data consist of five solar parks situated in
Rajasthan, Madhya Pradesh, Andhra Pradesh, and Tamil Nadu. We first found the
forecasting variable by definitions of all given variables and by correlation matrices. We
selected GHI for forecasting and checked weather it follows normal distribution or not.
Then we found its distribution in Python. For forecasting as data must be stationary, so we
did stationary tests.
Selecting the model for forecasting- We did data analysis on our data set, and tested
multiple models. According to the MAPE values observed, it was found that ARIMA was the
best fit for our time series forecasting. However, for monthly analysis SARIMA was found to
be a better choice than ARIMA, but due to very high computational costs SARIMA can’t be
used for daily and weekly forecasting, which is our main concern for the analysis.
We also tried machine learning models for time series forecasting. But studies have
suggested that classical time series models are more accurate for univariate, simple time
series forecasting. In contrast, classical time series methods are easily expandable, whereas
machine learning models are more target-specific. Although we can use machine learning
models for time series forecasting but it is not suggested in our case of “simple univariate”
forecasting.
16
Appendix
• The complete code for the analysis is available here.
• FORECASTS: Daily Predictions
Daily Predictions – AR(8)
Daily Predictions – MA(7)
Daily Predictions – ARMA(4,1)
17
Daily Predictions – ARIMA(4,1,5)
• ARIMA monthly forecast vs SARIMA
Monthly Predictions – ARIMA(5, 1, 3)
Monthly Predictions – SARIMA(3,0,3)(2,0,1)12
18
References
[1] Koundal, A. (2020). ETEnergyWorld

https://energy.economictimes.indiatimes.com/news/renewable/indias-renewable-
power-capacity-is-the-fourth-largest-in-the-world-says-pm-modi/79430910
[2] Annual Report, Ministry of New and Renewable Energy, Government of India (2020-21)
https://mnre.gov.in/img/documents/uploads/file_f-1618564141288.pdf
[3] Singh, S., Solar Irradiance Concepts: DNI, DHI, GHI & GTI
https://www.yellowhaze.in/solar-irradiance/
[4] Brownlee, J. (2019). A Gentle Introduction to Normality Tests in Python, Statistics

https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-
python/
[5] Kolmogorov-Smirnov Goodness-of-Fit Test, Engineering Statistics Handbook

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
[6] Perktold, J., Seabold, S. & Taylor, J. (2021). Stationarity and detrending (ADF/KPSS)
https://www.statsmodels.org/stable/examples/notebooks/generated/stationarity_det
rending_adf_kpss.html
[7] Rob J Hyndman & Athanasopoulos, G., Monash University, Australia, Forecasting:
Principles and Practice (2nd ed)
https://otexts.com/fpp2/seasonal-arima.html
[8] Shetty, C. (2020). Time Series Models AR, MA, ARMA, ARIMA
https://towardsdatascience.com/time-series-models-d9266f8ac7b0
[9] CFI Education Inc., Rolling Forecast: A financial model that moves forward one month at
a time, what is rolling forecast?
https://corporatefinanceinstitute.com/resources/knowledge/accounting/rolling-
forecast/
[10] Yiu, T. (2020). Understanding SARIMA (More Time Series Modeling)
https://towardsdatascience.com/understanding-sarima-955fe217bc77
19

Statistical Analysis and Forecasting of Solar Energy (Inter-States)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Analysis and Forecasting of Solar Energy (Inter-States)

Uploaded by

Copyright:

Available Formats

A Report On

Statistical Analysis and Forecasting of Solar Energy

2. Mean Absolute Percentage Error (MAPE)

6. Visual Normality Tests

8. Chi – Square test for Best Fit Distribution

10.Time Series Analysis

2. Mean Absolute Percentage Error (MAPE)

𝑛 is the number of observations.

𝐹𝑡 is the forecast at time t.

𝑋𝑡 is the actual observation at time t.

1. Timestamp of measurement (min, hour, day, month, year)

3. DNI and Clear sky DNI (w/m2)

4. GHI and Clear sky GHI (w/m2)

5. Dew Point (℃)

8. Relative Humidity (%)

9. Solar Zenith Angle (Degree)

10. Snow Depth (m)

11. Wind Speed (m/s)

Rajasthan Tamil Nadu

DNI 0.098 1 0.68 0.079 0.69 0.43 0.13 0.37 3.477

GHI 0.62 0.68 1 0.71 0.63 0.92 0.63 0.13 5.32

Temperature 0.68 0.13 0.63 0.8 0.083 0.66 1 0.097 4.08

6. Visual Normality Tests[4]

Here, the histogram does not show the familiar

As there are a many deviation, especially at the bottom and top

7. Statistical Normality Tests[5]

8. Chi-Square Test for Best Fit Distribution

9.1. Augmented Dickey-Fuller (ADF) Test

9.2. Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test

Madhya Pradesh - Daily

Madhya Pradesh – Daily Decomposition

10.3 Autocorrelation Function (ACF):

𝑌𝑡 = 𝑓(𝑌𝑡 , 𝑌𝑡−1 , 𝑌𝑡−2 … … … , ϵ𝑡 )

where, ϵ𝑡 is the “white noise error term”.

ACF Plot – AR(14) residual

Consider a MA model with a parameter q, written as MA(q), we can write [8]:

𝑌𝑡 = 𝑓(ϵ𝑡 , ϵ𝑡−1 , ϵ𝑡−2 , … … … , ϵ𝑡−𝑞 )

𝑌𝑡 = ϕ0 + ϕ1 ϵ𝑡−1 + ϕ2 ϵ𝑡−2 + ⋯ … … . . +ϕ𝑞 ϵ𝑡−𝑞

ACF Plot – MA(30) Residual

𝑌𝑡 = 𝑓(𝑌𝑡 , 𝑌𝑡−1 , 𝑌𝑡−2 … … … , ϵ𝑡 ) + 𝑔(ϵ𝑡 , ϵ𝑡−1 , ϵ𝑡−2 , … … … , ϵ𝑡−𝑞 )

𝑌𝑡 = β0 + β1 𝑌𝑡−1 + ⋯ … … + β𝑝 𝑌𝑡−𝑝 + ϵ𝑡 + ϕ1 ϵ𝑡−1 + ϕ2 ϵ𝑡−2 + ⋯ … … . . +ϕ𝑞 ϵ𝑡−𝑞

ACF Plot – ARMA(14,20) Residual

ACF Plot – ARIMA(12,2,10) Residual

(p, d, q) (P, D, Q)s

Monthly predictions – SARIMA (3, 0, 3)(2, 0, 1)12

• ARIMA for monthly prediction vs SARIMA

Rolling forecast ARIMA SARIMA

Parameters (5, 1, 3) (3, 0, 3)(2, 0, 1)12

Monthly Forecast MAPE values 11.433% 5.94%

11. Machine Learning

11.2 Well Known Methods in ML

• Attention Mechanism: This is an evolution of the Encoder-Decoder Model, developed in order to

• The complete code for the analysis is available here.

• FORECASTS: Daily Predictions

Daily Predictions – AR(8)

Daily Predictions – MA(7)

Daily Predictions – ARMA(4,1)

• ARIMA monthly forecast vs SARIMA

Monthly Predictions – ARIMA(5, 1, 3)

Monthly Predictions – SARIMA(3,0,3)(2,0,1)12

[1] Koundal, A. (2020). ETEnergyWorld

[4] Brownlee, J. (2019). A Gentle Introduction to Normality Tests in Python, Statistics

[5] Kolmogorov-Smirnov Goodness-of-Fit Test, Engineering Statistics Handbook