Professional Documents
Culture Documents
College of Science
Department of Mathematics
- STA 564 -
By Kenneth Guzman
December 7, 2018
Contents
1 INTRODUCTION 2
7 Conclusion 30
References 31
TIME SERIES ANALYSIS FORECASTING
NOTES
At its core, the influences of air pollution in the atmosphere are strongly managed by
meteorology. However, in the ”univariate” models we will consider it is assumed that the
final concentration of air pollutants in the atmosphere is the final result of all the complex
interactions of meteorology, chemistry, transport, diffusion etc. For this reason, the combined
information of their effect on air pollutant concentration is contained in the corresponding
time series in a stochastic way. Using this approach, calculations are simplified and performed
only using the time series of the pollutant without explicit inclusion of meteorological or other
measurements.
Four professors from Plovdiv University in Bulgaria, produced a research paper on time series
analysis concerning air pollution, the methods used were explicitly stated in their article as:
(i)
Identify correlation type dependencies and grouping of observed air pollutants using the
method of factor analysis to explain mutual effects of pollution.
(ii)
Conduct time series analysis by determining seasonal ARIMA(based on hourly data) relevant
parametric models of pollutants.
(iii)
Analysis and Diagnostics of constructed models.
(iv)
Application of models for short term forecasting.
(v)
Interpretation of the results and definition of the conditions contributing to the exceeding
of national and European, concentration norms for the considered air pollutants.
Their study was carried out using IBM SPSS 19 and EViews 7.[3]
1
TIME SERIES ANALYSIS FORECASTING
1 INTRODUCTION
Even though there are established regulations for monitoring and controlling effects on
air quality in certain territories, air quality may remain unsatisfactory. Lets consider the
particular case where our focus lies within the town of Blagoevgrad, Bulgaria. Blagoevgrad is
a typical representative of a small urban region, with a population of approximately 70,000.
Time Span of Study:1 year period from September 1st, 2011 to August 31st, 2012, based
on hourly measurements, six air pollutants were observed. Factor analysis and Box-Jenkins
methodology were applied to inspect concentrations of the primary air pollutants of interest.
The pollutants were grouped into three factors and the degree of contribution of the factors
to the overall pollution was determined, this contribution was interpreted as the presence
of common sources of pollution. The classical techniques of principal component analysis
(PCA) and factor analysis are important statistical instruments frequently used in the
environmental sciences.
The focus of the study involved the performance of time series analysis and the development
of univariate stochastic seasonal autoregressive integrated moving average (SARIMA) models
with recording on an hourly basis as seasonality. The study incorporates Yeo-Johnson power
transformation for variance stabilizing of the data, and model selection by using Bayesian
Information Criterion. The SARIMA models obtained in the study in Bulgaria demonstrated
good fitting with respect to the observed air pollutants and short term predictions for 72
hours ahead, specifically in the case of ozone and particulate matter PM10. The methods
presented, allowed the building of less complex models that are effective for short-term air
pollution forecasting and useful for advance warning purposes in urban areas.[3]
Continuous and careful monitoring and forecasting of atmospheric air pollutants is important
when evaluating regulatory control measures related to air quality. In Bulgaria, 12 types
of pollutants are systematically monitored by more than 36 automated stations run by the
Executive Environment Agency(EEA), which manages and coordinates activities related to
the control and environmental protection of the country. Atmospheric air quality reports
for the various regions of the country are regularly published, and from this much data is
accumulated. The data accumulation is what allows us to carry out statistical analysis which
leads to the discovery of, general patterns and dependencies for different time periods and
relationships between observed air pollutants. The observed air pollutants related to the
study carried out in Blagoevgrad, Bulgaria are concentrations of particulate matter P M 10,
nitrogen oxide N O, nitrogen dioxide N O2 , nitrogen oxides N Ox , sulfur dioxide SO2 , and
ground level ozone O3 . The data measurements are expressed in units of mass concentration
of pollutants in µg/m3 , only N Ox is in unit ppb(partsperbillion, as it is observing pollution
from all kinds of nitrogen oxides. The data consisted of 8,744 observations (hourly data).
The goal of their study was to demonstrate the capabilities of the mentioned methods, which
can be applied to other recorded sets including for shorter and longer periods of time.
2
TIME SERIES ANALYSIS FORECASTING
(x + 1)λ − 1
g(x; λ) = {1(λ6=0,x≥0)
λ
{1(λ=0,x≥0) log(x + 1)
(1 − x)2−λ − 1
{1(λ6=2,x<0)
λ−2
{1(λ=2,x<0) − log(1 − x)
3
TIME SERIES ANALYSIS FORECASTING
where p is the number of parameters describing the auto-regressive process, d is the number of
nonseasonal differences needed to reach stationarity, and q is the number of lagged forecast
errors in the prediction equation. Similarly, the SARIMA models take the general form
Arima(p,d,q)(P,D,Q)s , where P is the number of seasonal auto-regressive terms, D is the
order of seasonal differencing and Q is the number of seasonal moving average terms. In the
seasonal part of the model, the three parameters P,D,Q operate across multiples of lag s,
where s is the number of time periods until a pattern repeats itself.
Before I proceed forward I would like to point out that while the research paper concerning
Bulgaria highlighted a factor analysis and principal component analysis approach, the correlation
matrix calculated in R concerning the Richmond pollutant data-sets, displayed no signs of
positive or negative correlation between the pollutants, therefore I did not proceed to carry
out any sort of factor analysis or PCA. Also, the 2013 pollutant data-sets were strictly used
to compare our forecast models to the actual data recorded by the EPA in 2013.
4
TIME SERIES ANALYSIS FORECASTING
Directly below is the correlation matrix for all 7 pollutants concerning data over the time
span of the years 2010, 2011, and 2012.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
5
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for PM-2.5 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM-2.5 2013 observations,
λ = 0.05030683.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
PM-2.5.
6
TIME SERIES ANALYSIS FORECASTING
7
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for PM-2.5 to see how accurately auto.arima() predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM-2.5 2013 observations,
λ = 0.05030683.
Finally, once we plot arima model against the 2013 time series plot, I believe the auto.arima
function is somewhat appropriate for predicting the trend of the Pollutant PM-2.5 for the
year 2013.
8
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
Below is the time series plot using only the forecast function in R.
9
TIME SERIES ANALYSIS FORECASTING
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for PM10 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM10 2013 observations,
λ = 0.7845362.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
neither the forecast nor the auto.arima function is appropriate for predicting the values of
2013 for the Pollutant PM10.
10
TIME SERIES ANALYSIS FORECASTING
Below is the time series plot for 2012 after a yeo-johnson transformation.
11
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
12
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for Pb and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original Pb 2013 observations,
λ = −4.99994.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
13
TIME SERIES ANALYSIS FORECASTING
Pb.
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
14
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
15
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for CO and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = −2.432302.
16
TIME SERIES ANALYSIS FORECASTING
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
CO.
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
17
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for CO and see how accurately the arima model predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = −2.432302.
Finally, once we plot the arima model against the 2013 time series plot, I believe the
auto.arima function is most appropriate for predicting the trend of the Pollutant CO for
2013.
18
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
19
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for O3 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original O3 2013 observations,
λ = 4.99994.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the forecast function is most appropriate for predicting the values of 2013 for the Pollutant
O3 .
20
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
21
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
22
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for SO2 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original SO2 2013 observations,
λ = 0.2616144.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the forecast function is most appropriate for predicting the values of 2013 for the Pollutant
23
TIME SERIES ANALYSIS FORECASTING
SO2 .
Below is the time series plot for 2012 after a yeo-johnson transformation.
24
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Below is the time series plot for the 3 years after a yeo-johnson transformation.
25
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for N O2 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original N O2 2013 observations,
λ = 1.003092.
26
TIME SERIES ANALYSIS FORECASTING
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
N O2 .
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
27
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for N O2 and see how accurately the model predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = 1.003092.
Finally, once we plot the arima model against the 2013 time series plot, I believe the
auto.arima function is most appropriate for predicting the trend of the Pollutant N O2 for
2013.
28
TIME SERIES ANALYSIS FORECASTING
• From MASS the function truehist was used to plot the histograms of the pollutant
data before and after the yeojohnson transformation was applied, to visually show the
transformation from non-normal to normal distribution of the data.
• From bestNormalize the function yeojohnson was used to transform the pollutant
data from non-normal to normally distributed, in order to better carry out our statistical
analysis.
• From forecast the functions forecast and auto.arima were used, each playing the most
important role in analyzing prior pollutant observations and forecasting our future
values as accurately as R allows for each pollutant.
The main functions that I will highlight in this sections are the forecast and auto.arima()
functions in R but I will also briefly explain my usage of the ts() and yeojohnson() functions.
It was very important to my study that within the forecast function level=F because while
having confidence intervals in our graphs could be useful, they were not particularly needed
for my study to be carried out, since I was mostly interested in the specific values that the
forecast function gave us in its output. Also, in the forecast package it was vary important
that we only forecast exactly 59 future values, which is simply due to the fact that there
are exactly 59 values in our EPA 2013 data for each pollutant. Now, in the auto.arima()
function, no restrictions needed to be called within the function but it was most important
that we accessed our forecast values by auto.arima()$f and just for reference we are also
able to access our original values that were put into the function by using auto.arima()$x.
One last note, when I was plotting the time series for the 3 year data, you should notice that
within each ts() function the frequency=(58) which I interpret as they were an average of
58 observations per year, and I simply got 58 by dividing the total amount of observations
in our 3 year data by 3, so 174/3 = 58. Within the yeojohnson() function you will notice
29
TIME SERIES ANALYSIS FORECASTING
that standardize=FALSE this is because if it is not declared within the function by default
R will further perform standardization of the values put into the function, I did not find
the further standardization useful in my case when dealing with the Richmond data, mainly
because the yeojohnson transformation was of interest in the Bulgaria study so I wanted to
follow that transformation as it is without further standardization.
7 Conclusion
In the Bulgaria study the researchers main goal was to be able to use the arima models in
order to forecast ahead 72 hours, because they used hourly data. Similarly, I feel it necessary
to highlight the importance the auto.arima() function played in helping forecast the year
2013. While it was not totally helpful with forecasting all pollutants, it was definitely more
helpful than the forecast() function, in identifying the trend or behavior of each pollutant
throughout the year(s). The most important finding I came across was that the 2012 data
alone was certainly not enough it most cases when attempting to forecast a future year, but
the 3 year(2010,2011,2012) data combination allowed both the forecast() and auto.arima()
functions to display their usefulness when forecasting. I certainly enjoyed preparing this
study and learning about time series and hope that I am given the opportunity to further
explore this discipline in the future.
30
TIME SERIES ANALYSIS FORECASTING
References
[1] Kullback, S. (1959), Information Theory and Statistics, John Wiley and Sons.
Republished by Dover Publications in 1968; reprinted in 1978: ISBN 0-8446-5625-9.
[2] Yeo, I. K., and Johnson, R. A. (2000). A new family of power transformations to improve
normality or symmetry. Biometrika.
[4] Alcosser, Howard. ”Diamond Bar High School” Internal Assessment: Mathematical
Exploration. Web. 27 May 2015.
[5] Jolliffe, Ian. (1986). Principal Component Analysis and Factor Analysis. 10.1007/978 −
1 − 4757 − 1904 − 87 . Principal component analysis and Factor Analysis.
31