You are on page 1of 7

DISCOVERING CLIMATE CHANGE PATTERNS

USING HIERARCHICAL CLUSTERING AND


ASSOCIATION RULES
Mahmoud Sammour 1st Author Zulaiha Othman per 2nd Author
Center for Artificial Intelligence Technology Center for Artificial Intelligence Technology
Faculty of Information Science and Technology Faculty of Information Science and Technology
Universiti Kebangsaan Malaysia Universiti Kebangsaan Malaysia
43600 Bangi, Selangor, Malaysia 43600 Bangi, Selangor, Malaysia

Abstract— Ozone analysis is the process of identifying would lead to inferior indications regarding the ozone. This is
meaningful patterns that would facilitate the prediction of future due to multiple reasons. First, results of clustering is
trends. One of the common techniques that have been used for significantly changing based on the clustering technique and the
ozone analysis is the clustering technique. Clustering is one of the distance measure used. There are several clustering techniques
popular methods which contributes significant knowledge for time such as partitioning (e.g. k-means, k-medoids, etc.) and
series data mining by aggregating similar data in specific groups. hierarchical (e.g. agglomerative and divisive). As well as, there
However, identifying significant patterns regarding the ground- are different distance measures that could be utilized with the
level ozone is quite challenging task especially after applying the clustering technique such as Euclidean, Minkowski and
clustering task. This paper aims to propose a pattern discovery
Dynamic Time Warping (DTW). Such plenty of choices leads
method for ground-level ozone. The proposed method consists of
an Agglomerative Hierarchical Clustering with Dynamic Time
to different results of clustering. On the other hand, evaluating
Warping (DTW) as a distance measure. AHC has been applied on the results of clustering is a challenging task in which different
a Malaysian Ozone change dataset collected from Putrajaya for evaluation methods have been proposed for such purpose. All
year 2006. Consequentially, the patterns have been extracted using these mentioned reasons makes the process of identifying
the Apriori Association Rules (AAR) algorithm. The experiment significant patterns from ozone clustering results a difficult task.
shows that the extracted patterns are related to CO which is an This paper aims to propose the Apriori Association Rules
interesting relation in accordance to the literature. algorithm in order to extract patterns from the clustering results.
Keywords— Hierarchical Clustering, Dynamic Time Warping,
This paper is considered to be an extension of our study
Ground-Level Ozone, Apriori Association Rules (Sammour and Othman, 2016).

I. INTRODUCTION II. RELATED WORK


Scientifically called trioxygen is an inorganic molecule with The ground-level ozone. According to Vingarzan (2004)
the chemical formula O3, It is a pale blue gas with a distinctively who accommodated a review for the trends of ground-level
pungent smell (Control and Prevention, 1997). Reports suggest ozone using a data of the last century in Canada and USA, have
that ground level ozone (O3) can be rather harmful to the human concluded that the ground-level ozone has dramatically
respiratory system. In addition, there are also research reports increased in the last three decades. As a response, the research
that this can also result in several other detrimental diseases such community has attempted to propose statistical models that has
as severe exposure to ozone can negatively affect and upset the ability to predict the increasing of ozone for instance, Ray &
functioning of lungs, and can potentially increase inflammation Ensor (2009) have proposed an agglomerative hierarchical
(Bell et al., 2007). For instance, studies find that mortality rate clustering for identifying the most polluted area in Houston,
in urban areas is related to the effects of ozone, and there is Texas in terms of ground level ozone. In their study, the authors
correlation between them (Bell et al., 2004). Other studies such have declared multiple factors that have a significant impact on
the one by (Wilson et al., 2014) conclude that the effects of the ozone increment such as wind speed, wind direction, and
ozone can be non-linear, and specifically, extreme exposure to solar radiation.
ground level ozone can be dangerous for health. Therefore, in In addition, Adame et al. (2012) have proposed a k-means
view of the dangers the ozone layer entails, it could be critical to clustering approach with Euclidean distance measure in order to
research and determine the factors which cause the ozone layer identifying peaks of ozone rates in an industrial area in Central-
to spread the most. Southern Spain. The authors have successfully identified several
Several studies have been proposed for the task of predicting polluted plots. Another approach was proposed by (Ahmad and
ozone levels (Monteiro et al., 2012; Saithanu and Mekparyup, Aziz, 2013) in which a statistical method of passive sampling
2012; Wang et al., 2013). Such methods have utilized the was being used to investigate the air pollution in Pakistan.
clustering techniques in order to group the spots that have high Furthermore, Monteiro et al. (2012) have proposed a
level of ozone. However, the results of clustering sometimes combination of Statistical means of quantile regression and
agglomerative hierarchical clustering in order to measure the III. MATERIALS AND METHOD
pollution of air in terms of ground-level ozone. The proposed method consists of Agglomerative
Other researchers have attempted to identify characteristics Hierarchical Clustering with Dynamic Time Warping (DTW) as
of ground-level ozone such as Smith et al. (2013) who proposed a distance measure. The reason behind selecting such clustering
a Hybrid Single Particle Lagrangian Integrated Trajectory technique and distance measure lies on their superior
(HySPLIT) Model in order to characterize the ground ozone performance according to the state of the art of ozone clustering.
concentration in the gulf of Texas. In their study they have In addition, the Apriori Association Rules will be applied upon
figured out that the lowest ozone concentrations are associated the clustering results in order to discover knowledge. The
with trajectories that remained over the central Gulf for at least following sub-sections will tackle the proposed method
48 hours. As well as, higher concentrations are associated with components.
trajectories that pass close to the northern and western Gulf A. Agglomerative Hierarchical Clustering
Coast. Wang et al. (2013) have addressed the problem of
detecting ground level ozone from Spatio-temporal aspect. The Let 𝑥𝑖 and 𝑣𝑗 be a P-dimensional vector, then the Euclidean
authors have proposed a nearest neighbor clustering approach in distance can be measured as (Liao, 2005):
order to identify Spatio-temporal patterns of the air pollution. This phase aims to apply hierarchical clustering technique.
Another observational study conducted by Li et al. (2013) In general, hierarchical clustering algorithms work by
which concentrated on the pollution in Tangshan, north China. aggregating the objects into a tree of clusters (Aghagolzadeh et
Such study has mainly relied on statistical analysis. The study al., 2007). Hierarchical clustering can be categorized into two
has implied the dramatic expansion rates of O3 and NOX from types; agglomerative and divisive. Such categorization is
2008 to 2011. The study has referred the reason behind the inspired from the mechanism of grouping the objects whether
increment rates due to the extent of industries that are located in bottom-up or top-down approach. AHC is considered as a
such city. In addition, Hájek & Olej (2012) have accommodate bottom-up hierarchical approach where each object set in a
a comparative study of three regression approaches including separated cluster then AHC will merge such clusters into larger
Neural Network (NN), Support Vector Machine (SVM) and clusters (Rani and Sikka, 2012). Such process is continuing until
Fuzzy Logic (FL) in terms of predicting ground-level ozone. a specific termination has been reached. Complete linkage
Based on the Root Mean Square Error (RMSE), SVM has shown algorithm aims to identify the similarity between two clusters by
superior performance regarding to predict the ozone levels. measure two nearest data points that are located in different
Similarly, Kandil et al. (2014) have examined two NN models clusters. Hence, the merge will be done between the clusters that
including Feed-Forward NN and Back-propagation NN in terms have minimum distance -most similar- between each other. In
of ozone prediction. Basically, multiple features have been this paper, AHC has been applied as a maximum linkage.
encoded and fed into the network including temperature,
B. Dynamic Time Warping (DTW)
humidity, wind speed, incoming solar radiation, sulfur dioxide
and nitrogen dioxide. Feed forward NN has outperformed the DTW has been widely used to compare between discrete
other model. sequences and sequences of continuous values (Liao, 2005). Let
𝑆 = {𝑠1 , 𝑠2 , … , 𝑠𝑖 , … , 𝑠𝑛 } and 𝑇 = {𝑡1 , 𝑡2 , … , 𝑡𝑗 , … , 𝑡𝑛 } be a two
In addition, Sun et al. (2015) have accommodated a time series sequences. DTW will minimize the differences
comparison among two linear regression methods including among these series by representing a matrix of 𝑛 × 𝑚. In such
SVM and multi-layer perceptron NN for identifying ozone matrix, the distance/similarity between 𝑠𝑖 and 𝑡𝑗 will be
levels in Houston–Galveston–Brazoria area, Texas. Results
calculated using Euclidean distance.
shown superior performance for SVM. Tamas et al. (2016) have
used three clustering approaches in order to detect pollution in However, a warping path 𝑃 = {𝑝1 , 𝑝2 , … , 𝑝𝑘 , … , 𝑝𝐾 } where
Air including Artificial Neural Network (ANN), Self-Organized max(𝑚, 𝑛) ≤ 𝐾 ≤ 𝑚 + 𝑛 − 1 will be elements from the matrix
Mapping (SOM) and K-means clustering. Using hourly data, the that meet three constraints including boundary condition,
results shown two main sources of pollutions including ozone continuity and monotonicity. The boundary condition constraint
(O3) and nitrogen dioxide (NO2). requires the warping path to start and finish in diagonally
opposite corner cells of the matrix. That is 𝑝1 = (1,1) and 𝑝𝐾 =
On the other hand, Akimoto et al. (2015) have a long-term
(𝑚, 𝑛). The continuity constraint restricts the allowable steps to
statistical study for ground-level ozone in Japan from 1990 to
2010. The study focused on identifying correlation for the adjacent cells. The monotonicity constraint forces the points in
increment rates of ozone. The authors have identified three main the warping path to be monotonically spaced in time. The
causes stated as; (i) the decrease of NO titration effect, (ii) the warping path that has the minimum distance/similarity between
increase of transboundary transport, and (iii) the decrease of the two series is of interest. Hence, the DTW can be computed
situated photochemical production. Similarly, on an as follows:
observational study of ozone level’s causes by (Christensen et ∑𝐾𝑘=1 𝑝𝐾
al., 2015), the authors have indicated that Asia continent is one 𝑑𝐷𝑇𝑊 = 𝑚𝑖𝑛 (1)
𝐾
of the main sources that affecting the ground-level ozone in
Western United States.
C. Apriori Association Rules (AAR)
Apriori is an algorithm for frequent item set mining and
association rule learning over transactional databases (Borgelt,
2004). It proceeds by identifying the frequent individual items in object in other cluster. Similarly, RMSE-SD can be computed
the database and extending them to larger and larger item sets as for the external clusters as:
long as those item sets appear sufficiently often in the database.
The frequent item sets determined by Apriori can be used to Note that, the smaller value of RMSE-SD between the objects
determine association rules which highlight general trends in the within a single cluster leads to better performance in which the
database. In this manner, applying the Apriori algorithm on our objects are very similar. In contrast, the bigger value of RMSE-
dataset would reveals the interesting patterns that are being SD between the objects within a single cluster leads to lower
occurred. In order to distinguish these interesting patterns or performance in which the homogenous among the objects is
rules, it is necessary to consider the value of confidence which being maximized. Therefore, best results associated with smaller
is being illustrated as follows: value of RMSE-SD among intra-cluster, and with greater value
of RMSE-SD among inter-clusters. Based on the latter
Confidence: The confidence of a rule is defined Conf (X mentioned explanation, the results of applying AHC with DTW
implies Y) = supp(X ∪Y)/supp(X) in which supp(X∪Y) means can be depicted as in Table 1.
"support for occurrences of transactions where X and Y both
appear". Confidence ranges from 0 to 1, where the closeness to TABLE I. RESULTS OF INTRA AND INTER CLUSTER OF AHC
1 indicates an interesting relation. Confidence is an estimate of # Clusters Intra-Cluster Inter-Cluster
Pr(Y | X), the probability of observing Y given X. The support 15 0.0042 0.3869
supp(X) of an itemset X is defined as the proportion of
14 0.0041 0.3825
transactions in the data set which contain the itemset
13 0.0041 0.3814
IV. EXPERIMENTAL RESULTS 12 0.0042 0.3813
11 0.0045 0.3901
First the data has been collected from LESTARI (2016) which
is the Institution for Environment and Development in Malaysia 10 0.0045 0.4031
and the Asia Pacific. Such institution has been established since 9 0.0039 0.4077
1994 with the structure of Universiti Kebangsaan Malaysia 8 0.0039 0.3401
(UKM) in order to deal with environment and development 7 0.0041 0.3252
issues. The data contains ozone levels for one year (i.e. 2006) 6 0.0066 0.3221
particularly for Putrajaya city. The data has been represented 5 0.0068 0.3153
hourly as time intervals, which contained 8544 instances. Hence, 4 0.0073 0.3149
the proposed AHC with DTW has been applied on such dataset. 3 0.0054 0.3608
Two main approaches have been used for validating clustering As shown in Table 1, the best results of intra and inter cluster
process; external and internal validation of clusters (Liu et al., has been achieved at number of cluster 9.
2010). External validation aims to validate the clusters based on
the distribution in which the common information retrieval The US Office of Air and Radiation (AirNow, 2009) have
metrics such as precision, recall and f-measure. However, such discussed the factors that lead to air pollution. In their
mechanism of validation relies on a labeled data. Since, the real- investigation, ozone was one of the main factors that could harm
life data is usually unlabeled thus, applying external validation the human health. For this manner, AirNow (2009) has provided
tend to be insufficient. 5 categories of air pollution which are shown in Table 2.
On other hand, internal validation aims to measure the TABLE II. CATEGORIES OF AIR POLLUTION
correctness among objects within a cluster (i.e. intra-cluster) and
the correctness among objects within multiple clusters (i.e. inter- # index Unhealthy Level
1 Very Unhealthy
cluster). Basically, the main aim of clustering task is to make
sure that the objects within a single cluster are mostly similar, as 2 Unhealthy
well as, the objects within multiple clusters are mostly 3 Unhealthy for Sensitive Groups
dissimilar. Hence, computing the Root Mean Square Error 4 Moderate
Standard Deviation (RMSE-SD) would measure the 5 Good
homogenous of the objects within a single cluster and within In order to provide more critical analysis of the acquired
multiple clusters which can be computed as: clusters, the best number of cluster based on the RMSE-SD
𝑛 which is 9 will be considered. In addition, the AirNow (2009)
1
𝑖𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑅𝑀𝑆𝐸 − 𝑆𝐷 = ∑(𝑥𝑖 − 𝑥𝑖+1 )2 (2) categorization will be considered. Therefore, two number of
𝑛−1
𝑖=1 clusters will be considered in the analysis which are 5 and 9, next
where n is the number of objects inside a cluster and 𝑥𝑖 − 𝑥𝑖+1 sections will tackle this analysis.
is the distance between an two objects inside the same cluster.
Similarly, RMSE-SD can be computed for the external clusters A. AAnalysis When K=5
as: This section aims to provide a critical analysis of clustering
1
𝑛 when k=5 by identifying new patterns. This can be conducted by
𝑒𝑥𝑡𝑒𝑟𝑛𝑎𝑙 𝑅𝑀𝑆𝐸 − 𝑆𝐷 = ∑(𝑥𝑖 − 𝑥̅ )2 (3) detecting anonymous or abnormal trends for the ground-level
𝑛−1
𝑖=1 ozone rates. In this manner, each cluster included within the five
where n is the number of objects of two inter clusters and 𝑥𝑖 − clusters will be discussed separately. The analysis is tackling the
𝑥̅ is the distance between an object in one cluster and the other
days included in this cluster and will be conducted based three
8-hour intervals regarding to Christensen et al. (2015). Figure 1
depicts the results of this experiment. Note that, the values of
ozone have been measured using the Particle per million
recorded from the stations.
For cluster 1, the first 8-hour interval has begun with 0.004
ppb and ended up with 0.005 ppb. Whereas, second interval has
shown a rise of the ozone values reaching to the peak of 0.061 Pattern: Standard pattern
ppb at 2pm and ended with 0.050 ppb at 5pm. In the third
interval, the ozone values have been gradually decreased
reaching 0.008. This pattern is considered to be standard in
accordance to the literature (Toh et al., 2013).
For cluster 2, the first interval has begun with 0.014 ppb and
ended up with 0.005 ppb. Second interval has shown a rise of
ozone values reaching the peak of 0.113 ppb at 2pm and ended
with 0.089 ppb. In the third interval, the values have been Pattern: sharp increase and sharp decrease
decreased to reach 0.014 ppb. A remarkable pattern could be
noticed from this cluster, this pattern represented by the sharp
increase and sharp decrease of the ozone values.
For cluster 3, the first 8-hour interval has begun with 0.017
ppb and ended up with 0.007 ppb. Whereas, second interval has
shown an increase of values reaching the peak of 0.058 ppb at 2
pm, this peak has not changed until 4 pm. In the third interval,
the values have been gradually decreased reaching 0.012 ppb. A Pattern: Starting with high value
pattern can be shown as starting with high value.
For cluster 4, the first 8-hour interval has begun with 0.005
ppb and ended up with 0.006 ppb. Whereas, second interval has
shown an increase of values reaching the peak of 0.037 ppb at 2
pm, this peak has not changed until 3 pm. In the third interval,
the values have been gradually decreased reaching 0.005 ppb. A Pattern: Very low
remarkable pattern could be noticed from this cluster, this
pattern represented by the lowest values of ozone for the whole
day.
For cluster 5, the first 8-hour interval has begun with 0.033
ppb and ended up with 0.016 ppb. Whereas, second interval has
shown an increase of values reaching the peak of 0.059 ppb at 3
pm and ended with 0.051 ppb. In the third interval, the values
have been sharply decreased reaching 0.019 ppb at 8 pm and
ended with 0.017. A remarkable pattern could be noticed from Pattern: Starting with high value and unusual low
this cluster, this pattern represented by a high value of starting
and unusual decline of the ozone values.
B. Analysis When K=9 For cluster 1, the first 8-hour interval has begun with 0.008
ppb and ended up with 0.005 ppb. Whereas, second interval has
This section aims to provide a critical analysis of clustering shown a rise of ozone values reaching the peak of 0.076 ppb at
when k=9 based by identifying new patterns. This can be 2pm, then ended up with 0.059 ppb at 5pm. In the third interval,
conducted by detecting anonymous or abnormal trends for the the values have been gradually decreased reaching 0.007. A
ground-level ozone rates. In this manner, each cluster included remarkable pattern could be noticed from this cluster, this
within the nine clusters will be discussed separately. The pattern represented by the sharp decrease and increase of ozone
analysis is tackling the days included in this cluster and will be values.
conducted based three 8-hour intervals regarding to Christensen Fig. 2. Results of clustering when k = 1-5 Fig.2. Results of clustering
et al. (2015). Figure 2 depicts the results of this experiment.
Note that, the values of ozone have been measured using the
Particle per million recorded from the stations.

Fig. 1. Results of clustering when k = 5


Pattern: sharp decrease and increase of values.

Pattern: standard pattern


Pattern: sharp increase and decrease of values.

Pattern: starting with high value

Pattern: standard pattern

Pattern: starting with high value


Pattern: sharp decrease of the values with multiple peaks

Pattern: starting with high value

Pattern: starting with high value. For cluster 5, the first interval has begun with 0.004 ppb and
ended up with 0.005 ppb. Second interval shown three peaks
For cluster 2, the first interval has begun with 0.004 ppb and stated as; 0.032 ppb at 12 pm, 0.036 at 2 pm and 0.035 ppb at
ended up with 0.005 ppb. Second interval has begun with 0.008 4pm, and ended with 0.029. Whereas, the third interval shown
ppb and then sharply increased into the maximum peak of 0.055 an unstable decreased until reaching 0.005 ppb. This cluster has
ppb at 2pm and then decreased into 0.046 ppb at 5pm. The third a standard pattern.
interval has decreased reaching 0.009 ppb. This cluster has a For cluster 6, the first interval has begun with 0.015 ppb then
standard pattern. shown a stable decrease until 8 am reaching 0.008 ppb. Second
For cluster 3, the first interval has begun with 0.014 ppb and interval has shown two peaks of 0.034 at 12 pm and 0.039 at 3
ended up with 0.005 ppb. Second interval has shown an increase pm. The third interval shown a gradual decrease of the values
of values reaching the peak of 0.113 ppb at 2 pm and ended up reaching 0.005 ppb. The pattern embedded in such cluster as a
with 0.089 ppb. In the third interval, the values have been high value of the starting point.
decreased into 0.014 ppb. A remarkable pattern could be noticed For cluster 7, the first interval has begun with 0.019 ppb and
from this cluster, this pattern represented by the sharp increase ended up with 0.011 ppb. Second interval has shown two peaks
followed by sharp decrease of values. of 0.078 ppb at 4 pm and 0.074 ppb at 2 pm. Third interval has
For cluster 4, the first interval has begun with 0.015 ppb and shown gradual decrease reaching 0.019 ppb. A remarkable
ended up with 0.006 ppb. Second interval shown multiple peak pattern could be noticed from this cluster, this pattern
in which the values at 4 pm have reached 0.053 and 0.049 ppb at represented by the sharp decrease of the values with multiple
5pm. The third interval has shown a decrease of values 007 ppb. peaks.
For cluster 8, the first interval has begun with 0.033 ppb and
ended up with 0.016 ppb. Second interval has shown a maximum
peak of 0.059 ppb at 3 pm and then ended up with 0.051 ppb.
Third interval has shown a sharp decrease of values reaching
0.017 ppb. The pattern embedded in such cluster as a high value
of the starting point.
For cluster 9, the first interval has begun with 0.025 ppb and ozone rates. In addition, the following two rules (i.e. 6 and 7) are
ended up with 0.007 ppb. Second interval has shown a sharp associated with NOx and NO2 where the decreasing of NO2
increase of values reaching a maximum of 0.062 ppb at 2 pm and with an NOx = 0.003 would lead to increment in the ozone rates.
ended with 0.049 ppb. Third interval has shown a gradual The fifth following rules (i.e. 8-12) are associated with NOx and
decrease of the values reaching 0.014 ppb. The pattern CO where the decreasing of CO with an NOx = 0.003 (especially
embedded in such cluster as a high value of the starting point. for rule 10 and 11) would lead to increment in the ozone rates.
C. Comparison between K=5 and K=9 TABLE III. VALUES USING CLUSTER K = 5
This section aims to accommodate a comparison between the Days K=5 Morning Afternoon Evening Standard
two number of cluster 9 and 5 which have been analyzed in the Category
previous sections. The comparison will be based on multiple Class Start end Max Men end
Max
variables including starting values of ozone, maximum peak, 60 4 0.005 0.006 0.09 0.037 0.005 Good
maximum peak of median and ending values. Table 3 shows the 8 5 0.033 0.016 0.093 0.059 0.017 Moderate
values of 5 number of clusters. Unhealthy
104 3 0.017 0.007 0.105 0.058 0.012 for Sensitive
As shown in Table 3, the number of days included in the Groups
‘unhealthy’ category is nearly representing the half of the year 173 1 0.004 0.005 0.115 0.061 0.008 Unhealthy
Very
which seems to be overestimated categorization. This means that 19 2 0.014 0.005 0.148 0.113 0.014
Unhealthy
this category should be divided into more categories. Whereas,
the ‘moderate’ category contains only eight days which also TABLE IV. VALUES USING CLUSTER K = 9
seems to be underestimated categorization. Generally, this
Days K=9 Morning Afternoon Evenin Proposed
category is supposed to contain more days. However, Table 4 g Category
shows the values of 9 number of cluster. Class Start end Max Men end
Max
As shown in Table 4, unlike the standard 5 categorization, the 38 5 0.004 0.005 0.06 0.036 0.005 Very Good
9-categorizaiton has the ability to provide better description of 22 6 0.015 0.008 0.09 0.039 0.005 Good
the year’s days. This can be represented by giving more 63 4 0.015 0.006 0.093 0.053 0.007
High
Moderate
categories. For instance, the ‘unhealthy’ category has been 121 2 0.004 0.005 0.096 0.055 0.009 Moderate
splitted into two categories as ‘unhealthy’ and ‘very unhealthy 8 8 0.033 0.016 0.093 0.059 0.017
low
for sensitive group’. These categories have shown reasonable Moderate
Unhealthy
contained number of days. In addition, the category ‘moderate’ for
20 9 0.025 0.007 0.077 0.062 0.014
has been splitted into three categories as ‘high moderate’, Sensitive
‘moderate’ and ‘low moderate’. Similarly, these categories have Groups
Very
contained reasonable number of days. Finally, the category Unhealthy
‘good’ has been also divided into two categories as ‘very good’ 52 1 0.008 0.005 0.115 0.076 0.007 for
and ‘good’. Sensitive
Groups
D. Extracting Patterns Using AAR 21 7 0.019 0.011 0.105 0.078 0.019 Unhealthy
Very
19 3 0.014 0.005 0.148 0.113 0.014
Basically, determining factors that affect the ozone is a Unhealthy
difficult task. Akimoto et al. (Akimoto, et al., 2015) have
conducted a study to analyze the causes of ground-level ozone On the other hand, the remaining 8 rules (i.e. 13-20) are
in Japan using 20 years data. As a conclusion in their study, they associated with a single factor which is CO. In fact, these rules
have surprisingly found out that even though with the decrease are related to the peak or highest rates of ozone. Although there
of NOx and NMHC (i.e. considered as the main causes of is no a direct relation between the CO values and the ozone rates
increasing the ozone), there is still an ongoing increment of however, as a general view, CO is related to the transportation.
ground-level ozone rates. Based on their judgment, they have This can evidence the finding of Akimoto et al. (2015) study
referred the reason to the transportation. which implies that the transportation is one of the main reasons
behind the growth of ground-level ozone rates.
Hence, it is a challenging task to identify the factors that
would affect the ground-level ozone. However, this study TABLE V. SIGNIFICANT PATTERNS EXTRACTED USING AAR
attempts to present an analysis for specific cases of extreme
# Factor 1 Factor 2 => Ozone Confidence
growth of ozone rates. Therefore, association rules approach 1 NOx=0.003 Temp=33.7 => 0.089 1.00
has been used in order to clarify the factors that would increase 2 NOx=0.003 Temp=32.1 => 0.088 1.00
3 NOx=0.003 Temp=30.4 => 0.070 1.00
rates of ozone in Putrajaya, Malaysia. Table 5 depicts the results 4 NOx=0.003 Temp=29.9 => 0.061 1.00
of applying association rules by showing the most significant 5 NOx=0.003 Temp=29.2 => 0.059 1.00
patterns with highest confidences. 6 NOx=0.003 NO2=0.013 => 0.059 1.00
7 NOx=0.003 NO2=0.002 => 0.070 1.00
8 NOx=0.003 CO=1.7 => 0.070 1.00
As we can see in Table 5, the first 12 rules are associated with 9 NOx=0.003 CO=1.61 => 0.061 1.00
two factors whereas the other rules are associated with a single 10 NOx=0.003 CO=0.31 => 0.088 1.00
factor. In particular, the first fifth rules (i.e. 1-5) are associated 11 NOx=0.003 CO=0.27 => 0.086 1.00
with NOx and the Temperature where the increasing of 12 NOx=0.003 CO=0.16 => 0.078 1.00
13 CO=0.75 - => 0.148 1.00
temperature with an NOx = 0.003 leads to increasing in the
14 CO=0.91 - => 0.147 1.00 [10] C.f.D. Control and Prevention (1997) NIOSH pocket guide to chemical
15 CO=0.7 - => 0.143 1.00 hazards. Access 1997).
16 CO=0.44 - => 0.140 1.00
[11] P. Hájek and V. Olej (2012) 'Ozone prediction on the basis of neural
17 CO=0.63 - => 0.140 1.00
networks, support vector regression and methods with uncertainty',
18 CO=0.60 - => 0.139 1.00
19 CO=0.59 - 0.139 1.00
Ecological Informatics, Vol. 12, pp. 31-42 (Access 2012
=>
20 CO=0.78 - => 0.131 1.00 [12] M. Kandil, et al. (2014) 'Prediction of Maximum Ground Ozone Levels
using Neural Network', International Journal of Computing and Digital
V. CONCLUSION Systems, Vol. 3 No. 2, pp. 133-140 (Access 2014
[13] LESTARI 'The Institute for Environment and Development in Malaysia
This study has proposed a pattern extraction method for the and the Asia Pacific' [online] http://www.ukm.my/lestari.
ground-level ozone using the Apriori Association Rules method. [14] P. Li, et al. (2013) 'Observational studies and a statistical early warning
The data used in the experiment has been collected from of surface ozone pollution in Tangshan, the largest heavy industry city of
LESTARI (2016) which is the Institution for Environment and North China', International journal of environmental research and public
Development in Malaysia and the Asia Pacific. In the beginning, health, Vol. 10 No. 3, pp. 1048-1061 (Access 2013
AHC with DTW have been applied upon the dataset in order to [15] T.W. Liao (2005) 'Clustering of time series data—a survey', Pattern
clustering the days based on the ozone levels. Consequentially, recognition, Vol. 38 No. 11, pp. 1857-1874 (Access 2005
the proposed AAR has been applied to extract significant [16] Y. Liu, et al. (2010) 'Understanding of internal clustering validation
measures', Data Mining (ICDM), 2010 IEEE 10th International
patterns. The experiment shows that the extracted patterns are Conference on, IEEE, pp. 911-916.
related to CO which is an interesting relation in accordance to [17] A. Monteiro, et al. (2012) 'Trends in ozone concentrations in the Iberian
the literature. In fact, this study has utilized a ground-level ozone Peninsula by quantile regression and clustering', Atmospheric
data with a single year. Hence, using a dataset with multiple environment, Vol. 56, pp. 184-193 (Access 2012
years in future researches has the ability to identify frequent [18] S. Rani and G. Sikka (2012) 'Recent techniques of clustering of time series
patterns which may facilitate the determination of important data: A Survey', Int. J. Comput. Appl, Vol. 52 No. 15, pp. 1-9 (Access
factors of ozone. 2012
[19] S.J.T.B.K. Ray and K.B. Ensor (2009) 'A Model-based Approach for
ACKNOWLEDGEMENT Clustering Air Quality Monitoring Networks in Houston, Texas', (Access
2009
This study is supported by the University Kebangsaan [20] K. Saithanu and J. Mekparyup (2012) 'Clustering of Air Quality and
Malaysia (UKM) and funded by research grant Meteorological Variables Associated with High Ground Ozone
(FRGS/1/2012/SG05/UKM/02/6). Concentration in the Industrial Areas, at the East of Thailand',
International Journal of Pure and Applied Mathematics, Vol. 81 No. 3, pp.
REFERENCES 505-515 (Access 2012
[1] J. Adame, et al. (2012) 'Application of cluster analysis to surface ozone, [21] M. Sammour and Z. Othman (2016) 'An Agglomerative Hierarchical
NO 2 and SO 2 daily patterns in an industrial area in Central-Southern Clustering with Various Distance Measurements for Ground Level Ozone
Spain measured with a DOAS system', Science of the Total Environment, Clustering in Putrajaya, Malaysia', International Journal on Advanced
Vol. 429, pp. 281-291 (Access 2012 Science, Engineering and Information Technology, Vol. 6 No. 6, pp.
1127-1133 (Access 2016
[2] M. Aghagolzadeh, et al. (2007) 'A Hierarchical Clustering Based on
Mutual Information Maximization', Image Processing, 2007. ICIP 2007. [22] J. Smith, et al. (2013) 'Characterization of Gulf of Mexico background
IEEE International Conference on, pp. I - 277-I - 280. ozone concentrations', 12th Annual CMAS Conference, Chapel Hill, NC.
[3] S.S. Ahmad and N. Aziz (2013) 'Spatial and temporal analysis of ground [23] W. Sun, et al. (2015) 'Prediction of surface ozone episodes using clusters
level ozone and nitrogen dioxide concentration across the twin cities of based generalized linear mixed effects models in Houston–Galveston–
Pakistan', Environmental monitoring and assessment, Vol. 185 No. 4, pp. Brazoria area, Texas', Atmospheric Pollution Research, Vol. 6 No. 2, pp.
3133-3147 (Access 2013 245-253 (Access 2015
[4] AirNow 'Ozone and Your Health' [online] [24] W. Tamas, et al. (2016) 'Hybridization of Air Quality Forecasting Models
https://cfpub.epa.gov/airnow/index.cfm?action=ozone_health.index#req Using Machine Learning and Clustering: An Original Approach to Detect
uest.PDFPath#ozone-c.pdf (Accessed 2016). Pollutant Peaks', Aerosol and Air Quality Research, Vol. 16 No. 2, pp.
405-416 (Access 2016
[5] H. Akimoto, et al. (2015) 'Analysis of monitoring data of ground-level
ozone in Japan for long-term trend during 1990–2010: causes of temporal [25] Y.Y. Toh, et al. (2013) 'The influence of meteorological factors and
and spatial variation', Atmospheric environment, Vol. 102, pp. 302-310 biomass burning on surface ozone concentrations at Tanah Rata,
(Access 2015 Malaysia', Atmospheric environment, Vol. 70, pp. 435-446 (Access 2013
[6] M.L. Bell, et al. (2007) 'Climate change, ambient ozone, and health in 50 [26] R. Vingarzan (2004) 'A review of surface ozone background levels and
US cities', Climatic Change, Vol. 82 No. 1-2, pp. 61-76 (Access 2007 trends', Atmospheric environment, Vol. 38 No. 21, pp. 3431-3442
(Access 2004
[7] M.L. Bell, et al. (2004) 'Ozone and short-term mortality in 95 US urban
communities, 1987-2000', Jama, Vol. 292 No. 19, pp. 2372-2378 (Access [27] S. Wang, et al. (2013) 'New Spatiotemporal Clustering Algorithms and
2004 their Applications to Ozone Pollution', 2013 IEEE 13th International
Conference on Data Mining Workshops, IEEE, pp. 1061-1068.
[8] C. Borgelt (2004) 'Recursion Pruning for the Apriori Algorithm', FIMI.
[28] A. Wilson, et al. (2014) 'Modeling the effect of temperature on ozone-
[9] J.N. Christensen, et al. (2015) 'Unraveling the sources of ground level
related mortality', The Annals of Applied Statistics, Vol. 8 No. 3, pp.
ozone in the Intermountain Western United States using Pb isotopes',
1728-1749 (Access 2014
Science of the Total Environment, Vol. 530, pp. 519-525 (Access 2015

You might also like