You are on page 1of 33

Marshall University

College of Science
Department of Mathematics
- STA 564 -

Time Series Analysis and Forecasting


focused on Air Pollution in an Urban
Area

By Kenneth Guzman
December 7, 2018
Contents
1 INTRODUCTION 2

2 Yeo-Johnson transformation, Kolmogorov-Smirnov Test for normality 3

3 Factor Analysis and PCA 3

4 Box Jenkins Methodology 3

5 Personal Study Carried out in R 4

6 R Code Explanation and Software Package Used 29

7 Conclusion 30

References 31
TIME SERIES ANALYSIS FORECASTING

NOTES
At its core, the influences of air pollution in the atmosphere are strongly managed by
meteorology. However, in the ”univariate” models we will consider it is assumed that the
final concentration of air pollutants in the atmosphere is the final result of all the complex
interactions of meteorology, chemistry, transport, diffusion etc. For this reason, the combined
information of their effect on air pollutant concentration is contained in the corresponding
time series in a stochastic way. Using this approach, calculations are simplified and performed
only using the time series of the pollutant without explicit inclusion of meteorological or other
measurements.

Four professors from Plovdiv University in Bulgaria, produced a research paper on time series
analysis concerning air pollution, the methods used were explicitly stated in their article as:
(i)
Identify correlation type dependencies and grouping of observed air pollutants using the
method of factor analysis to explain mutual effects of pollution.
(ii)
Conduct time series analysis by determining seasonal ARIMA(based on hourly data) relevant
parametric models of pollutants.
(iii)
Analysis and Diagnostics of constructed models.
(iv)
Application of models for short term forecasting.
(v)
Interpretation of the results and definition of the conditions contributing to the exceeding
of national and European, concentration norms for the considered air pollutants.

Their study was carried out using IBM SPSS 19 and EViews 7.[3]

1
TIME SERIES ANALYSIS FORECASTING

1 INTRODUCTION
Even though there are established regulations for monitoring and controlling effects on
air quality in certain territories, air quality may remain unsatisfactory. Lets consider the
particular case where our focus lies within the town of Blagoevgrad, Bulgaria. Blagoevgrad is
a typical representative of a small urban region, with a population of approximately 70,000.
Time Span of Study:1 year period from September 1st, 2011 to August 31st, 2012, based
on hourly measurements, six air pollutants were observed. Factor analysis and Box-Jenkins
methodology were applied to inspect concentrations of the primary air pollutants of interest.
The pollutants were grouped into three factors and the degree of contribution of the factors
to the overall pollution was determined, this contribution was interpreted as the presence
of common sources of pollution. The classical techniques of principal component analysis
(PCA) and factor analysis are important statistical instruments frequently used in the
environmental sciences.

The focus of the study involved the performance of time series analysis and the development
of univariate stochastic seasonal autoregressive integrated moving average (SARIMA) models
with recording on an hourly basis as seasonality. The study incorporates Yeo-Johnson power
transformation for variance stabilizing of the data, and model selection by using Bayesian
Information Criterion. The SARIMA models obtained in the study in Bulgaria demonstrated
good fitting with respect to the observed air pollutants and short term predictions for 72
hours ahead, specifically in the case of ozone and particulate matter PM10. The methods
presented, allowed the building of less complex models that are effective for short-term air
pollution forecasting and useful for advance warning purposes in urban areas.[3]

Continuous and careful monitoring and forecasting of atmospheric air pollutants is important
when evaluating regulatory control measures related to air quality. In Bulgaria, 12 types
of pollutants are systematically monitored by more than 36 automated stations run by the
Executive Environment Agency(EEA), which manages and coordinates activities related to
the control and environmental protection of the country. Atmospheric air quality reports
for the various regions of the country are regularly published, and from this much data is
accumulated. The data accumulation is what allows us to carry out statistical analysis which
leads to the discovery of, general patterns and dependencies for different time periods and
relationships between observed air pollutants. The observed air pollutants related to the
study carried out in Blagoevgrad, Bulgaria are concentrations of particulate matter P M 10,
nitrogen oxide N O, nitrogen dioxide N O2 , nitrogen oxides N Ox , sulfur dioxide SO2 , and
ground level ozone O3 . The data measurements are expressed in units of mass concentration
of pollutants in µg/m3 , only N Ox is in unit ppb(partsperbillion, as it is observing pollution
from all kinds of nitrogen oxides. The data consisted of 8,744 observations (hourly data).
The goal of their study was to demonstrate the capabilities of the mentioned methods, which
can be applied to other recorded sets including for shorter and longer periods of time.

2
TIME SERIES ANALYSIS FORECASTING

2 Yeo-Johnson transformation, Kolmogorov-Smirnov


Test for normality
Time series data often requires preparation before using forecasting methods; and for this
reason normal or near to normal distribution of the univariate data is important, because
it reduces issues when we forecast future values. The obtained K-S statistic indicated
non-normality of the data collected in Bulgaria, which led to the transformation of the
data prior to constructing the forecasting models. In that particular case the Yeo-Johnson
transformation was carried out, which lead to the satisfying of the Kolmogorov Smirnov Test
for normality at 0.05 level of significance and may be assumed to be normally distributed.
The Yeo-Johnson transformation finds the optimal value of lambda that minimizes the
KullBack-Leibler1 distance between the normal distribution and the transformed distribution.[1][2]
Properties of Yeo-Johnson transformation below:

(x + 1)λ − 1
g(x; λ) = {1(λ6=0,x≥0)
λ
{1(λ=0,x≥0) log(x + 1)
(1 − x)2−λ − 1
{1(λ6=2,x<0)
λ−2
{1(λ=2,x<0) − log(1 − x)

3 Factor Analysis and PCA


The statistical techniques of factor analysis and principal component analysis, help identify
patterns in the correlation between variables. The patterns identified are used to create
factors, which was the case in Bulgaria and allowed the grouping of correlated pollutants.
The steps followed for the particular case in Bulgaria were: (a) calculation of correlation
matrix (b) testing the adequacy of factor anaylsis (c) factor extraction (d) factor rotation
and (e) score calculation of factor variables. The particular advantages of these methods are
that they reveal strong correlation relationships between observed variables and allow their
grouping into new variables (factors) in order to reduce the dimensions of the complex data
structure. The factors can thereafter be used to build regression or other types of models.[5]

4 Box Jenkins Methodology


Other methods frequently used in times series analysis and forecasting are the auto-regressive
integrated moving average(ARIMA) and seasonal ARIMA (SARIMA)models, also known as
Box-Jenkins stochastic models. Box-Jenkins methodology is widely applied in air quality
research among other disciplines, and is a systematic strategy for identifying, fitting, and
forecasting time series univariate data. ARIMA models generally take the form Arima(p,d,q)
1
In mathematical statistics, the KullbackLeibler divergence (also called relative entropy), is a measure of
how one probability distribution is different from a second, reference probability distribution.

3
TIME SERIES ANALYSIS FORECASTING

where p is the number of parameters describing the auto-regressive process, d is the number of
nonseasonal differences needed to reach stationarity, and q is the number of lagged forecast
errors in the prediction equation. Similarly, the SARIMA models take the general form
Arima(p,d,q)(P,D,Q)s , where P is the number of seasonal auto-regressive terms, D is the
order of seasonal differencing and Q is the number of seasonal moving average terms. In the
seasonal part of the model, the three parameters P,D,Q operate across multiples of lag s,
where s is the number of time periods until a pattern repeats itself.

Main advantages of the Box-Jenkins approach:


(i)
Applicability for modeling and forecasting practically any time series that is stationary or
can be reduced to stationary by a differencing procedure.
(ii)
Ability to extract all the trends and serial correlations in the data with a minimized sequence
of white noise(shock) through inclusion in one general model equation that gets to the basis
of historical data development.
(iii)
The method has been incorporated into many standard software packages which exist within
R, SPSS, etc., which speeds up and assists the modeling process considerably.

5 Personal Study Carried out in R


Using the presented methods, I was able to carry out my own study using the statistical
software R. Using data provided by our own Environmental Protection Agency here in the
United States (https://www.epa.gov/outdoor-air-quality-data), I accessed pollutant concentration
data for the city of Richmond, Virginia, which has a population of approximately 220,000.
Time Span of Observed Data: A total of 4 years of data was accessed, periods from January
2010 to December 2013 based on weekly measurements of the following air pollutants,
concentrations of particulate matter P M 2.5, particulate matter P M 10, lead P b expressed
in units of mass concentration (µg/m3 ), carbon monoxide CO and ground level ozone O3
are in units ppm(partspermillion), sulfur dioxide SO2 and nitrogen dioxide N O2 are in units
ppb(partsperbillion). The goal of my personal research is to apply the time series analysis
and forecasting methods from the research paper produced in Bulgaria, to a local city here
in the US. As was the case in Bulgaria, once these methods are applied to the Richmond
pollutant data I hope to visually show an appropriate forecast for each pollutant for the year
2013.

Before I proceed forward I would like to point out that while the research paper concerning
Bulgaria highlighted a factor analysis and principal component analysis approach, the correlation
matrix calculated in R concerning the Richmond pollutant data-sets, displayed no signs of
positive or negative correlation between the pollutants, therefore I did not proceed to carry
out any sort of factor analysis or PCA. Also, the 2013 pollutant data-sets were strictly used
to compare our forecast models to the actual data recorded by the EPA in 2013.

4
TIME SERIES ANALYSIS FORECASTING

Directly below is the correlation matrix for all 7 pollutants concerning data over the time
span of the years 2010, 2011, and 2012.

Analyzing PM-2.5 using 3 year data


The first pollutant we will analyze is particulate matter P M 2.5
The lambda value used to transform the original PM-2.5 observations, λ = 0.227158.

Directly below is the time series plot for the 3 years after a yeo-johnson transformation.

5
TIME SERIES ANALYSIS FORECASTING

Directly below is the time series plot using only the forecast function in R.

Directly below is the time series plot using auto.arima function in R.

Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for PM-2.5 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM-2.5 2013 observations,
λ = 0.05030683.

Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
PM-2.5.

6
TIME SERIES ANALYSIS FORECASTING

Using only 2012 data to predict 2013 values


The lambda value used to transform the original PM-2.5 observations for the year 2012,
λ = 0.7078218.

7
TIME SERIES ANALYSIS FORECASTING

Directly below is the time series plot for 2012 after a yeo-johnson transformation.

The time series plot using only the forecast function was not yielding an appropriate graph
in R.

Directly below is the time series plot using auto.arima function in R.

Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for PM-2.5 to see how accurately auto.arima() predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM-2.5 2013 observations,
λ = 0.05030683.

Finally, once we plot arima model against the 2013 time series plot, I believe the auto.arima
function is somewhat appropriate for predicting the trend of the Pollutant PM-2.5 for the
year 2013.

8
TIME SERIES ANALYSIS FORECASTING

Analyzing PM10 using 3 year data


The second pollutant we will analyze is particulate matter P M 10
The lambda value used to transform the original PM10 observations, λ = 0.2409915.

Directly below is the time series plot for the 3 years after a yeo-johnson transformation.

Below is the time series plot using only the forecast function in R.

9
TIME SERIES ANALYSIS FORECASTING

Directly below is the time series plot using auto.arima function in R.

Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for PM10 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM10 2013 observations,
λ = 0.7845362.

Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
neither the forecast nor the auto.arima function is appropriate for predicting the values of
2013 for the Pollutant PM10.

10
TIME SERIES ANALYSIS FORECASTING

Using only 2012 data to predict 2013 values


The lambda value used to transform the original PM10 observations for the year 2012,
λ = −0.04297711.

Below is the time series plot for 2012 after a yeo-johnson transformation.

11
TIME SERIES ANALYSIS FORECASTING

The time series plot using only the forecast function was not yielding an appropriate graph
in R.

The time series plot using the auto.arima function was not yielding an appropriate graph in
R.

Analyzing Pb(Lead) using 3 year data


The third pollutant we will analyze is lead P b
The lambda value used to transform the original Pb observations, λ = −4.99994.

Directly below is the time series plot for the 3 years after a yeo-johnson transformation.

12
TIME SERIES ANALYSIS FORECASTING

Directly below is the time series plot using only the forecast function in R.

Directly below is the time series plot using auto.arima function in R.

Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for Pb and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original Pb 2013 observations,
λ = −4.99994.

Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant

13
TIME SERIES ANALYSIS FORECASTING

Pb.

Using only 2012 data to predict 2013 values


The lambda value used to transform the original Pb(Lead) observations for the year 2012,
λ = −4.99994.

Directly below is the time series plot for 2012 after a yeo-johnson transformation.

14
TIME SERIES ANALYSIS FORECASTING

The time series plot using only the forecast function was not yielding an appropriate graph
in R.

The time series plot using the auto.arima function was not yielding an appropriate graph in
R.

Analyzing CO using 3 year data


The fourth pollutant we will analyze is carbon monoxide CO
The lambda value used to transform the original CO observations, λ = −3.577325.

Directly below is the time series plot for the 3 years after a yeo-johnson transformation.

15
TIME SERIES ANALYSIS FORECASTING

Directly below is the time series plot using only the forecast function in R.

Directly below is the time series plot using auto.arima function in R.

Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for CO and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = −2.432302.

16
TIME SERIES ANALYSIS FORECASTING

Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
CO.

Using only 2012 data to predict 2013 values


The lambda value used to transform the original CO observations for the year 2012, λ =
−3.641187.

Directly below is the time series plot for 2012 after a yeo-johnson transformation.

17
TIME SERIES ANALYSIS FORECASTING

The time series plot using only the forecast function was not yielding an appropriate graph
in R.

Directly below is the time series plot using auto.arima function in R.

Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for CO and see how accurately the arima model predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = −2.432302.

Finally, once we plot the arima model against the 2013 time series plot, I believe the
auto.arima function is most appropriate for predicting the trend of the Pollutant CO for
2013.

18
TIME SERIES ANALYSIS FORECASTING

Analyzing O3 using 3 year data


The fifth pollutant we will analyze is ground level ozone O3
The lambda value used to transform the original O3 observations, λ = 3.615548.

Directly below is the time series plot for the 3 years after a yeo-johnson transformation.

19
TIME SERIES ANALYSIS FORECASTING

Directly below is the time series plot using only the forecast function in R.

Directly below is the time series plot using auto.arima function in R.

Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for O3 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original O3 2013 observations,
λ = 4.99994.

Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the forecast function is most appropriate for predicting the values of 2013 for the Pollutant
O3 .

20
TIME SERIES ANALYSIS FORECASTING

Using only 2012 data to predict 2013 values


The lambda value used to transform the original O3 observations for the year 2012, λ =
4.99994.

Directly below is the time series plot for 2012 after a yeo-johnson transformation.

21
TIME SERIES ANALYSIS FORECASTING

The time series plot using only the forecast function was not yielding an appropriate graph
in R.

The time series plot using the auto.arima function was not yielding an appropriate graph in
R.

Analyzing SO2 using 3 year data


The sixth pollutant we will analyze is sulfur dioxide SO2
The lambda value used to transform the original SO2 observations, λ = −0.227093.

Directly below is the time series plot for the 3 years after a yeo-johnson transformation.

22
TIME SERIES ANALYSIS FORECASTING

Directly below is the time series plot using only the forecast function in R.

Directly below is the time series plot using auto.arima function in R.

Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for SO2 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original SO2 2013 observations,
λ = 0.2616144.

Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the forecast function is most appropriate for predicting the values of 2013 for the Pollutant

23
TIME SERIES ANALYSIS FORECASTING

SO2 .

Using only 2012 data to predict 2013 values


The lambda value used to transform the original SO2 observations for the year 2012, λ =
−0.1123281.

Below is the time series plot for 2012 after a yeo-johnson transformation.

24
TIME SERIES ANALYSIS FORECASTING

The time series plot using only the forecast function was not yielding an appropriate graph
in R.

The time series plot using the auto.arima function was not yielding an appropriate graph in
R.

Analyzing N O2 using 3 year data


The seventh and final pollutant we will analyze is nitrogen dioxide N O2
The lambda value used to transform the original N O2 observations, λ = 0.9783584.

Below is the time series plot for the 3 years after a yeo-johnson transformation.

25
TIME SERIES ANALYSIS FORECASTING

Directly below is the time series plot using only the forecast function in R.

Directly below is the time series plot using auto.arima function in R.

Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for N O2 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original N O2 2013 observations,
λ = 1.003092.

26
TIME SERIES ANALYSIS FORECASTING

Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
N O2 .

Using only 2012 data to predict 2013 values


The lambda value used to transform the original N O2 observations for the year 2012,
λ = 1.229131.

Directly below is the time series plot for 2012 after a yeo-johnson transformation.

27
TIME SERIES ANALYSIS FORECASTING

The time series plot using only the forecast function was not yielding an appropriate graph
in R.

Directly below is the time series plot using auto.arima function in R.

Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for N O2 and see how accurately the model predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = 1.003092.

Finally, once we plot the arima model against the 2013 time series plot, I believe the
auto.arima function is most appropriate for predicting the trend of the Pollutant N O2 for
2013.

28
TIME SERIES ANALYSIS FORECASTING

6 R Code Explanation and Software Packages Used


The following packages in the R software were used: MASS , bestNormalize, forecast.

• From MASS the function truehist was used to plot the histograms of the pollutant
data before and after the yeojohnson transformation was applied, to visually show the
transformation from non-normal to normal distribution of the data.

• From bestNormalize the function yeojohnson was used to transform the pollutant
data from non-normal to normally distributed, in order to better carry out our statistical
analysis.

• From forecast the functions forecast and auto.arima were used, each playing the most
important role in analyzing prior pollutant observations and forecasting our future
values as accurately as R allows for each pollutant.

The main functions that I will highlight in this sections are the forecast and auto.arima()
functions in R but I will also briefly explain my usage of the ts() and yeojohnson() functions.
It was very important to my study that within the forecast function level=F because while
having confidence intervals in our graphs could be useful, they were not particularly needed
for my study to be carried out, since I was mostly interested in the specific values that the
forecast function gave us in its output. Also, in the forecast package it was vary important
that we only forecast exactly 59 future values, which is simply due to the fact that there
are exactly 59 values in our EPA 2013 data for each pollutant. Now, in the auto.arima()
function, no restrictions needed to be called within the function but it was most important
that we accessed our forecast values by auto.arima()$f and just for reference we are also
able to access our original values that were put into the function by using auto.arima()$x.
One last note, when I was plotting the time series for the 3 year data, you should notice that
within each ts() function the frequency=(58) which I interpret as they were an average of
58 observations per year, and I simply got 58 by dividing the total amount of observations
in our 3 year data by 3, so 174/3 = 58. Within the yeojohnson() function you will notice

29
TIME SERIES ANALYSIS FORECASTING

that standardize=FALSE this is because if it is not declared within the function by default
R will further perform standardization of the values put into the function, I did not find
the further standardization useful in my case when dealing with the Richmond data, mainly
because the yeojohnson transformation was of interest in the Bulgaria study so I wanted to
follow that transformation as it is without further standardization.

7 Conclusion
In the Bulgaria study the researchers main goal was to be able to use the arima models in
order to forecast ahead 72 hours, because they used hourly data. Similarly, I feel it necessary
to highlight the importance the auto.arima() function played in helping forecast the year
2013. While it was not totally helpful with forecasting all pollutants, it was definitely more
helpful than the forecast() function, in identifying the trend or behavior of each pollutant
throughout the year(s). The most important finding I came across was that the 2012 data
alone was certainly not enough it most cases when attempting to forecast a future year, but
the 3 year(2010,2011,2012) data combination allowed both the forecast() and auto.arima()
functions to display their usefulness when forecasting. I certainly enjoyed preparing this
study and learning about time series and hope that I am given the opportunity to further
explore this discipline in the future.

30
TIME SERIES ANALYSIS FORECASTING

References
[1] Kullback, S. (1959), Information Theory and Statistics, John Wiley and Sons.
Republished by Dover Publications in 1968; reprinted in 1978: ISBN 0-8446-5625-9.

[2] Yeo, I. K., and Johnson, R. A. (2000). A new family of power transformations to improve
normality or symmetry. Biometrika.

[3] Gocheva-Ilieva, Snezhana; Ivanov, A; Voynikova, Desislava; Boyadzhiev, Doychin. (2013).


Time series analysis and forecasting for air pollution in small urban area: An SARIMA
and factor analysis approach. Stochastic Environmental Research and Risk Assessment.
28. 1045-1060. 10.1007/s00477-013-0800-4.

[4] Alcosser, Howard. ”Diamond Bar High School” Internal Assessment: Mathematical
Exploration. Web. 27 May 2015.

[5] Jolliffe, Ian. (1986). Principal Component Analysis and Factor Analysis. 10.1007/978 −
1 − 4757 − 1904 − 87 . Principal component analysis and Factor Analysis.

31

You might also like