You are on page 1of 16

Foundation University Rawalpindi Campus

Department of Software Engineering

Master’s Thesis Proposal

Automated Context based Positive & Negative Spatio-temporal


Association Rule Mining

Submitted by:

Mohammad Salman Hafeez

Roll no: F’14AMSCS-005

Supervisor:

Dr. Mohammad Shaheen Khan.


Date: January, 27 2016

Academic year 2015/2016

1|Page
Table of Contents
Abstract……………………………………………………………………………..3

Chapter 1……………………………………………………………………………3

1.1. Introduction…………………………………………………………………………3

1.2. Background/Knowledge Gap……………………………………………………….4

1.3. Problem Identification……………………………………………………………....5

1.4. Problem Statement………………………………………………………………….6

1.5. Objective of Study………………………………………………………………….6

1.6. Significance of the Study…………………………………………………………...6

1.7. Limitations of the Study……………………………………………………………6

2. Chapter 2………………………………………………………………………..7

2.1. Literature Review………………………………………………………………….7

3. Chapter 3………………………………………………………………………..9

3.1. Research Methodology and Model………………………………………………...9

3.2. Research Model…………………………………………………………………..10

3.3. Hypothesis Statement ……………………………………………………………10

3.4. Sample……………………………………………………………………………10

3.5. Performance Measures…………………………………………………………...10

3.6. Procedures and Experimental Scenarios…………………………………………...12

4. Timeline and Budget………………………………………………………….14

5. References……………………………………………………………………..15

2|Page
Abstract

A novel approach to mine both positive and negative association rules for spatio temporal
databases with automated extraction of context variable. Currently many researchers are using an
Apriori algorithm on spatial databases but this algorithm does not utilize the strengths of both
positive and negative association rules and also does not provide time series analysis, hence it
fail to spot very interesting and useful associations present in the data. In compact spatial
databases, the numbers of negative association rules are huge as compared to the positive rules
which need management. Using both positive and negative association rule discovery and then
pruning out the uninteresting rules consumes resources without much improvement in the overall
accuracy of the knowledge discovery process. The associations among different objects and
patterns are strongly dependent upon the context, particularly where context is the state of entity,
environment or action. We propose a new approach for spatial association rule mining from
datasets projected at a temporal bar in which the automated contextual situation is considered
while generating positive and negative frequent itemsets. The algorithm for positive and negative
association rule mining is based on Apriori algorithm which is further extended to include
context variable automation and simulate temporal series spatial inputs.

Chapter 1
1.1 Introduction
Data mining, the extraction of hidden predictive information from large databases, is a powerful
new technology with great potential to help companies focus on the most important information
in their data warehouses. Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions and can answer business
questions that traditionally were too time consuming to resolve . They scour databases for hidden
patterns, finding predictive information that experts may miss because it lies outside their
expectations. [2] Although data mining is a relatively new term, the technology is not.
Companies have used powerful computers to sift through volumes of supermarket scanner data
and analyze market research reports for years. However, continuous innovations in computer
processing power, disk storage, and statistical software dramatically increase the accuracy of
analysis while driving down the cost.

Data mining is an emerging domain that relies on the analysis of historical data to ensure that the
amount of available data is directly proportional to the quality of the knowledge derived from
that data. As the amount of data increases, the reliability of patterns extracted from data
increases. The pattern extracted from hundred records of customers may not reflect stable
customer behavior, while a pattern extracted from thousands of records is considered to be a
more reliable gauge of customer behavior.

3|Page
Association rule mining, one of the most important and well researched techniques of data
mining. It aims to extract interesting correlations, frequent patterns, associations or casual
structures among sets of items in the transaction databases or other data repositories. Association
rules are widely used in various areas such as telecommunication networks, market and risk
management, inventory control etc. [3] Association rule mining research typically focuses on
positive association rules (PARs), generated from frequently occurring itemsets. Researchers
now focus on finding alternative patterns such as, unexpected patterns, exceptional patterns, and
strong negative associations.

A strong negative association is referred to as negative relation between two itemsets. This
negative relation implies a negative rule between the two itemsets. Decision making in many
applications such as product placement and investment analysis often involves a number of
factors, some of which play beneficial roles and others play harmful roles. We need to minimize
the harmful impacts as well as maximize possible benefits. Association rules are popular due to
their wider application on different types of data such as numeric, ordinal, spatial and multimedia
data.

1.2 Background Knowledge


A spatial database also known as geo database is a database that is optimized to store and query
data that represents objects defined in a geometric space. Most spatial databases allow
representing simple geometric objects such as points, lines and polygons. A spatial association
rule describes the implication of a feature or a set of features by another set of features in spatial
databases. [5]

A temporal database is a database with built-in support for handling data involving time, being
related to the slowly changing dimension concept by attaching a time period to the data, it is
possible to store different database states. A spatiotemporal database is a database that manages
both space and time information. Real world applications like, location based services,
geographic information systems, etc need to store real world data which shows spatial as well as
temporal characteristics, into database. [6]

Many data objects in real world have attributes related to both space and time, and managing
them using existing RDBMS is complex and in-efficient, as these objects which show spatio-
temporal behavior are multi-dimensional in nature. For example an object changing its geometry
over its course, exhibits both spatial as well as temporal qualities, as it can change its shape at
different points in time, as well as the location of the object. There is a need to store these objects
and view it the same way it was at any particular point in time. [7]

Spatial association rule mining can be defined as the extraction of implicit associations, spatial
relations or other special patterns not explicitly stored in spatial databases. If these associations

4|Page
are extracted by time series analysis of spatial data over a certain period of time, the process will
be called spatio temporal association rule mining. [9]

Spatial association rule mining is more complex; therefore, the methods for spatial association
rule mining are distinctive. The majority of researcher’s uses popular data mining method such
as Apriori algorithm for association rule mining on spatial data, Apriori algorithm has its own
limitations such as repeated scans of the data when finding frequent itemsets. Another prominent
fact is that researchers generally focus on mining positive association rules neglecting that
negative association rule can be valuable commonly in all cases, but predominantly for spatial
data. Multiple scans of a database can cause a large number of iterations particularly when the
dataset is as dense as it is in spatial databases. To eliminate multiple scans of a database, several
techniques are used such as P-Tree [1], T-Tree [1] and FP-Trees [10].

Positive rules represents positive associations among data items while negative associations are
ignored, but negative associations can lead to potentially interesting results with might be helpful
in decision making. The extraction of positive and negative association rules from datasets
produces a very large number of rules. To overcome this, researchers incorporate pruning
strategies in frequent-items algorithms; e.g. Sharma et al. [11] used an ‘interestingness’ measure.

Spatio temporal databases accede to the complexity of spatial data with the additional
complexity of its evaluation over different periods of time. The computation of spatial data over
time is expensive in computation and real time processing [12]. Spatio temporal datasets are
generally evaluated using a combination of Apriori technique/modified Apriori techniques with
underlying complexities of spatial association rule mining.

Context is the variable that corresponds to the state of the entity, environment and action e.g. an
association is observed while studying satellite images and how the color of vegetation pales
whenever there are hydrocarbons in the vicinity of that vegetation. The association can be false if
the pale color of the vegetation is because of abnormal temperatures. In this case, temperature is
the context variable which, if ignored, would present false associations. [13]

1.3 Problem Identification


After studying several research papers from different journals and conferences we have come up
to a conclusion that both positive and negative association rule mining is important for decision
making. Spatio temporal databases inherit the complexity of spatial data with the additional
complexity of its evaluation over different periods of time. Context is the variable that
represents the state of the entity, its environment and action, there exist certain sets of contextual
situations that have an effect on the state of the system from which association rules are mined.
In these sorts of situations, association rule mining does not truthfully represent the associations
among different entities. Thus context must be an important consideration in spatio temporal
datasets.

5|Page
1.4 Problem Statement
Context variable epitomizes the state of the entity, its environment and action. There exist certain
situations that can affect the state of system from which we are mining the association rules,
ultimately results in incorrect associations among different entities. We must consider context in
spatio temporal datasets as they represent true representation of the state of the system.

The context variables are selected manually by user input, and their initial and final values are
also user driven. Thus we need to develop and algorithm which consider context variable
prominently in spatio temporal datasets and automated selection for the context variables and its
initial and final values are also defined automatically.

1.5 Objective of Study


We will develop an algorithm which will automatically identify the context variable in spatio
temporal positive and negative association rule mining. The identification of context variable
will be purely based on the algorithm with no user input. The definition of initial and final value
of the context variable will be purely algorithmic.

1.6 Significance of Study


The most valued gain of association rule mining is it results in lesser but precise rules. The
positive and negative rules are huge in numbers as the data grows, and all the rules are not of our
interest. By Emphasizing context variable we can identify precise and true associations among
different entities which help in better decision making. To minimize the user efforts the
automation of the context variable identification will automatically identify the context variable
and its initial and final value.

Chapter 2
2.1 Literature Review
WU at al [14] firstly describe the importance of negative rule mining why the negative rules are
important, and then addresses two key problems of negative association rule mining. How to
effectively search for interesting itemsets and secondly how to identify effectively negative
association rules of interest. Pruning technique is used for efficient search of interesting itemsets.
A pruning strategy is designed and Interestingness function is devised based on the threshold
value. If interest(X, Y) >= mi, the rule X -> Y is of potential interest, and X U Y is referred to as
a potentially interesting item set. For mining the negative association rules: A heuristic based
approach is used, it does not examine all possible negative itemsets. A & B are frequent itemsets

6|Page
and C is also a frequent item set. [ABC] is of interest even if it is 3 item infrequent item set. And
D is an infrequent item set. They have not considered it either a positive or negative association
of interest by their heuristic. If A=>! B Or! A=>B Or! A=>! B is negative rule of interest. A & B
would be frequent itemsets. The results of the proposed approach for mining both positive and
negative association rules of interest are promising; the positive association rules mined by the
proposed model are identical to that by the support-confidence framework proposed in [Agrawal
et al. 1993b].

Agarwal et al [15] developed an algorithm. The Apriori algorithm is based upon a support and
confidence measure and is considered to be part of a support confidence framework for
association rule mining. In the Apriori algorithm, the item set is scanned in multiple passes and
threshold by support value. The rules have specific support and confidence values and are left
behind at the end of passes.

Tamir and Singer [16] classified the interestingness measure into subjective and objective
interestingness measures. The objectives measures are based upon a statistical method and
include generality, peculiarity, diversity and surprisingness. The subjective measures are based
upon the domain’s expert understandings. These measures include novelty, utility, applicability,
etc. Similar to prune association rules, in order to reduce the number of output rules and improve
their utility, interestingness measures (confidence gain) is proposed. There are other
interestingness measures for Apriori-based rules as summarized by Shaharanee et al. [17] and
Geng and Hamilton [18].

Sharma et al [11] described that researchers mostly focused on positive rule mining for spatial
databases, negative rule mining can play a vital role in decision making, and thus we need to
consider negative rule mining for spatial databases. They have proposed a novel approach of
mining spatial positive and negative association rules. The approach applies multiple level
spatial mining methods to extract interesting patterns in spatial and/or non-spatial predicates.
Data and spatial predicates/ association-ship are organized as set hierarchies to mine them level-
by level as required for multilevel spatial positive and negative association rules. A pruning
strategy is used in our approach to efficiently reduce the search space. Further efficiency is
gained by interestingness measure. Spatial association rules are defined for the identification of
interesting itemsets and an algorithm is introduced for mining spatial association rules in large
spatial database. The input consists of a spatial database, a mining query and a set of thresholds.
A spatial database SDB and set of concept hierarchies, query of a reference class set of task
relevant classes for spatial object and a set of task relevant spatial relations and three thresholds
value; minimum support, minimum confidence, and minimum interestingness. The output of the
algorithm is strong spatial positive and negative association rules for the relevant sets of objects
and relations. The execution time of the algorithm increased with the number of objects in
database but the increase for large number of positive and negative association rules is not
enormous in view of the fact that the number of negative associations is reasonably large. The
algorithm proposed in this paper is efficient for mining multiple level potentially interesting
7|Page
spatial positive and negative association rules in spatial database. It explores techniques at
multiple approximation and abstraction levels. Further, efficiency is gained by interestingness
measure, which allows greatly reduce the number of associations needed for consideration.

A spatio-temporal association rule is one in which there is a spatio-temporal relationship


between antecedent and consequent of the rule. Mennis and Wei Liu [19] applied spatio temporal
association rule mining on urban growth data of Denver, USA. Four spatio temporal variables
are focused on extracted polygons (e.g., change in landcover, change in percent minority, change
in poverty and density of developed land over a certain period of time). The results are then
presented in a multi-level hierarchy.

Shu et al. [20] a unique rule pruning property is devised for high traffic regions. The algorithm
developed on the basis of the above terms is applied to sparse and compact datasets and good
precision is observed in the results. Geospatial and temporal data have also been mined by for
the sake of environmental change monitoring. In order to assess regional vegetation and climate
change, a framework based upon kringling interpolation, wavelet multi resolution analysis, fuzzy
c means clustering and Apriori rules are developed. In this study, weather observation data,
precipitation and air temperature data are collected at multiple time intervals from Chinese sites.
The data is interpolated and clustered in the conceptual development phase. The association rules
are extracted which are then pruned on the basis of user constraints and specifications.

Compeita et al. [21] notes that efforts to develop spatio temporal association rule mining method
are based on conventional knowledge discovery techniques. The concepts of ‘localizer’ and
‘miner’ are proposed wherein localizer deals with spatial and temporal dimensions of data and
miner processes the data on the basis of these spatio temporal relationships. The study also
focused on better visualization of spatio temporal association rules and provided two interactive
interfaces.

Most of the extant studies are related to the application of association rule mining on sparse and
dense spatial data. There has been some research on the extraction of both positive and negative
association rules from spatial data and then to prune these rules at multiple levels. A very
important aspect affecting the results of spatio temporal association rules has not been
considered before. In a study by Tang et al. [22], context is considered for spatio temporal
market basket analysis. A set of contexts is derived from time and space hierarchy and those
association rules are extracted which meet the support and confidence in all contexts. The
decision makers could analyze the market baskets at different hierarchy levels.

Shaheen et al [13] introduced a new approach to mine context based positive and negative spatial
association rules. Many researchers are currently using an Apriori algorithm on spatial databases
but this algorithm does not utilize the strengths of positive and negative association rules and of
time series analysis, hence it misses the discovery of very interesting and useful associations
present in the data. They proposed an approach for spatial association rule mining from datasets

8|Page
projected at a temporal bar in which the contextual situation is considered while generating
positive and negative frequent itemsets. An extended algorithm based on the Apriori approach is
developed and compared with existing spatial association rule algorithms. The algorithm for
positive and negative association rule mining is based on Apriori algorithm which is further
extended to include context variable and simulate temporal series spatial inputs. Context
variables are used as an influencing variable.

Chapter 3
3.1 Research Methodology and Model
Empirical research methodology will be followed in this research. At first experimentations will
be performed. Secondly we will study the results of our algorithm with respect to the number of
rules, average confidence of rules, total time taken to extract association rules (execution time).
The automated extraction of association rules on the basis of context is a novel approach and has
its own impediments. That is why the comparison may not manifest the significance of the
approach. The comparison is in fact made to compare the use of an existing technique on Spatio
temporal data to the proposed technique. These results are compared with the results produced
by Apriori and positive/negative association rule mining algorithm. In accordance with our
research problem we will develop a hypothesis. Which we will accept or reject on the basis of
the results accumulated throughout the experimentation phase of the thesis.

Energy dataset from the U.S energy information administration will be used and we will apply
our proposed algorithm on the energy dataset. The proposed technique is the extended form of
the algorithm proposed by Shaheen et al [13]. The automated extraction of the context variable
will be purely algorithmic.

3.2 Research Model


Groot et al [ ] proposed empirical research and we will use this empirical research model in our
research. Figure 1 shows the empirical research model.

9|Page
Observation: Facts are gathered through
observation.

Induction: Hypothesis is formulated.

Deduction: Abstracting consequences of


hypothesis as testable predictions.

Testing: Hypothesis is tested.

Evaluation: The results are evaluated

Figure 1: Empirical research model [ ]

3.3 Hypothesis Statement


Can we develop an algorithm which will mine both positive and negative association rules from
spatio temporal database with automated context variable extraction?

3.4 Sample Dataset


The proposed algorithm is specifically designed for spatio temporal data for which various
satellite images collected over a certain period of time may better reflect the execution
performance of the algorithm. The dataset is collected from the U.S energy information
administration. These datasets are available at http://www.eia.gov/opendata/excel/. The U.S.
Energy Information Administration provides free and open data by making it available through
an Application Programming Interface (API) and open data tools to better serve our customers.
The data in the API is also available in bulk file, in Excel via the add-in, and via widgets that
embed interactive data visualizations of EIA data on any website. By making EIA data available
in machine-readable formats, the creativity in the private, the non-profit, and the public sectors
can be harnessed to find new ways to innovate and create value-added services powered by
public data.
Currently, EIA's API contains the following main data sets:
 11,989 natural gas series and associated categories.
 115,052 petroleum series and associated categories.
 30,000 State Energy Data System series organized into 600 categories.
 34,790 U.S. crude imports series and associated categories.
 92,836 International energy series (released May 6, 2015) [23]

10 | P a g e
This map was created by the National Renewable Energy Laboratory for the Department of
Energy (April 1, 2011).

U.S. Energy Information Administration / Annual Energy Review 2011 [23].

3.5 Performance Measures


The performance of algorithm is compared with the Apriori algorithm as well as with the
algorithm to mine positive and negative association rules. All these performances are compared
with respect to the following:

1. No. of spatio temporal association rules.

2. Execution time.

3. Confidence of rules.

11 | P a g e
4. Relevance of rules for domain experts.

5. Automated extraction of context variable.

6. Contextual outliers value setting i.e. initial and final value.

7. Comparison of the proposed technique with the algorithm Shaheen et al [13]

We will study the results of our algorithm with respect to the number of rules, average
confidence of rules, total time taken to extract association rules (execution time) and the opinion
of domain experts. The extraction of association rules on the basis automated extraction of
context variable is a novel approach and has its own impediments. That is why the comparison
may not manifest the significance of the approach.

3.6 Procedure / Experimental Scenario.


The algorithm will be implemented on multiple thematic maps of different times which will be
collected from the energy datasets from http://www.eia.gov/ . The thematic maps will be mapped
onto a geographic information system (GIS). Using cartographic mapping, rectification is
performed in ESRI’s ArcGIS 9.2 whereas the spatial database will be managed in Microsoft SQL
Server.

The procedure discovers all frequent and infrequent itemsets from the spatial database on the
basis of a context variable. Automated extraction of Context variable will be change according to
some mechanism which will be identified during the data mining process i.e. on the basis the
most frequent negative/ positive itemsets or specified by the domain experts.

In the first step of the proposed algorithm, Positive frequent itemsets and initial negative frequent
itemsets are initialized.
In 2nd step, the range for context variable automatically set by the algorithm which spans from
the initial value to the final value.
In 3rd step, predicates from a spatial database are extracted and stored in order of time intervals.
In 4th step, candidate sets and subsets are obtained as per the Apriori algorithm.
In 5th step, the function of support value calculation is called to produce frequent itemsets and
In 6th step, infrequent itemsets are produced.

In steps no. 7 and 8, uninteresting itemsets from both the frequent and infrequent itemsets are
eliminated.
The output is produced in the final step (9).

12 | P a g e
Flowchart for mining context based positive and negative spatio temporal association rules. [13]

13 | P a g e
4. Timeline & Budget
Area of research identification
1

2
Literature Review

3
Problem identification

4
Experimentation & results

5
Final Thesis Documentation

September 2015 November 2015 January 2016 Feb2016 March 2016 June2016

1. Identification of research area.


2. Literature Review
3. Problem identification
4. Experimentation and explanation of results.
5. Final Thesis Documentation.

6. References
[1] Data Mining and Analysis: Fundamental Concepts and Algorithms.

[2] Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management.

[3] Abraham, Tamas; Roddick, John F.; "Survey of Spatio-Temporal Databases", GeoInformatica, pp. 61--99, vol. 3,
1, 1999 URL

[4] Efficient Mining of Both Positive and Negative Association Rules XINDONG WU University of Vermont
CHENGQI ZHANG

[5] F. Coenen, P. Leng, S. Ahmed, Data structure for association rule mining, IEEE Transactions on Knowledge and
Data Engineering. 16 (6) (2004) 774–779.

[6] F. Verhein, S. Chawla, Mining spatio-temporal association rules, sources, sinks, stationary regions and
thoroughfares in object mobility databases, LNCS 3882 (2006) 187–201

[7] G. Marakas, Decision Support Systems, second ed., Prentice Hall, NJ, 2003

[8] H. Shu, X. Zhu, S. Dai, Mining association rules in geographical spatio-temporal, data the international archives
of photogrammetry, Remote Sensing and Spatial Information Sciences 37 (B2) (2008) 225–228.

14 | P a g e
[9] J. Chen, F. Huang, R. Wang, Y. Jin, A research about spatial association rule mining based on concept lattice, in:
International Conference on Wireless Communication, Networking and Mobile Computing, China, 2007, pp. 5979–
5982.

[10] Z. He, S. Deng, X. Xu, An FP tree based approach for mining all strongly correlated item pairs, Lecture Notes
in Computer Science (2006) 735–740.

[11] L.K. Sharma, O.P. Vyas, U.S. Tiwary, R. Vyas, A Novel Approach of Multilevel Positive and Negative
Association Rule Mining for Spatial Databases, Springer-Verlag, Berlin, 2005. pp. 620–629.

[12] S.U. Calargan A.Yazici, Fuzzy association rule mining from spatio temporal data, LNCS 5072 (2008) 631–646.

[13] M. Shaheen et al. / Knowledge-Based Systems 37 (2013) 261–273 context based positive and negative
association rule mining.

[14] X. Wu, C. Zhang, S. Zhang, Efficient mining of both positive and negative association rules, ACM
Transactions on Information Systems 22 (3) (2004) 381–405.

[15] R. Agarwal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of the 20th VLDB
Conference, Chile, 1994, pp. 487–499.

[16] R. Tamir, Y. Singer, On a confidence gain measure for association rule discovery and scoring, The VLDB
Journal 15 (1) (2006) 40–52.

[17] I.N.M. Shaharanee, F. Hadzic, T.S. Dillon, Interestingness measures for association rules based on statistical
validity, Knowledge-Based Systems 24 (3) (2010) 386–392.

[18] L. Geng, H.J. Hamilton, Interestingness measures for data mining: a survey, ACM Computing Surveys 38 (3)
(2006) 1–32.

[19] J. Mennis, J. Wei Liu, Mining association rules in spatio-temporal data: an analysis of urban socioeconomic and
land cover change, Transactions in GIS 9 (1) (2005) 5–17.

[20] H. Shu, X. Zhu, S. Dai, Mining association rules in geographical spatio-temporal, data the international
archives of photogrammetry, Remote Sensing and Spatial Information Sciences 37 (B2) (2008) 225–228.

[21] P. Compieta, S.D. Martino, M. Bertolotto, F. Ferrucci, T. Kechadi, Exploratory spatio-temporal data mining
and visualization, Journal of Visual Languages and Computing 18 (2007) 255–279.

[22] K. Tang, Y.L. Chen, H.W. Hu, Context-based market basket analysis in multiple store environment, Science
Direct; Decision Support Systems 45 (2008) 150– 163.

[23] http://www.eia.gov/ energy datasets.

[24] Association Rules Mining: A Recent Overview by Sotiris Kotsiantis GESTS International Transactions on
Computer Science and Engineering, Vol.32 (1), 2006, pp. 71-16

[25] T Ramakrishnudu Mining Positive and Negative Association Rules Using FII-Tree (IJACSA) International
Journal of Advanced Computer Science and Applications, Vol. 4, No. 9, 2013

[26] Spatio-temporal Databases in Urban Transportation Ouri Wolfson and Bo Xu University of Illinois, Chicago,
IL 60607

15 | P a g e
[27] The Knowledge Engineering Review, Vol. 00:0, 1{24.c 2004, Cambridge University Press Data Mining: Past,
Present and Future FRANS COENEN

[28] An Introduction to Spatial Database Systems Ralf Hartmut Güting Praktische Informatik IV, FernUniversität
Hagen D-58084 Hagen, Germany

16 | P a g e

You might also like