You are on page 1of 9

Mining Association Rules Using Genetic Algorithm: The role

of Estimation Parameters

Abstract –Genetic Algorithms (GA) have emerged as practical, robust optimization and search
methods to generate accurate and reliable Association Rules. The main motivation for using GAs
in the discovery of high-level prediction rules is that they perform global search and cope better
with attribute interaction than the greedy rule induction algorithms often used in data mining.
The performance of GA greatly depends on the GA parameters namely population size,
crossover rate, mutation rate, fitness function adopted and selection method. The objective of this
paper is to compare the performance of the Genetic algorithm for association rule mining by
varying these parameters. The algorithm when tested on three datasets indicates that the
efficiency not only depends on the GA parameters but also the size of the dataset and
relationship between its attributes.

Keywords : Association rules, Genetic Algorithm, GA parameters.

1. Introduction

Data mining, also referred as knowledge discovery in database, means a


process of nontrivial extraction of implicit, previously unknown and
potentially useful information (such as knowledge rules, constraints,
regularities) from data in database. Data mining combines theory and technology of
several domains which include artificial intelligence, machine learning, statistics, neural network
and so on [1]. Association rule mining is major area in data mining that discovers the relations
between different attributes by analyzing and disposing data in the database.

Many algorithms for generating association rules were developed over time. Some well
known algorithms are Apriori, Eclat and FP-Growth tree. This paper analyses the mining of
Association Rules by applying Genetic Algorithms. There have been several attempts for mining
association rules using Genetic Algorithm. Robert Cattral et Al. [2] describe the evolution of
hierarchy of rule using genetic algorithm with chromosomes of varying length and macro
mutations. Manish Saggar et Al. [3] proposes an algorithm with binary encoding and the fitness
function was generated based on confusion matrix.

Genetic algorithm based on the concept of strength of implication of rules was presented by
Zhou et Al. [10]. Genxiang et Al. [9] introduced dynamic immune evolution, and biometric
mechanism in Engineering Immune Computing namely immune recognition, immune memory,
and immune regulation to GA for mining association rules.
Gonzales. E et Al. [11] introduced the Genetic Relation Algorithm (GRA) based on evaluating
the distances between rules. The distance is calculated using both matching criteria namely
complete match and partial match. Genetic algorithm easily leads to premature convergence or
takes too much time to converge during evolution process. Hong GUO et Al. [12] propose GA
with adaptive mutation and individual based selection methods to avoid premature convergence.
A brief introduction about Association Rule Mining and GA is given in Section 2, followed
by methodology in section 3, which describes the basic implementation details of Association
Rule Mining with GA. In section 4 the Parameters that decides on efficiency of the algorithm is
presented. Section 5 presents the experimental results followed by conclusion in the last section.

2. Association Rules and Genetic Algorithms

Association Rules
Association rule is a popular and well researched method for discovering interesting
relations between variables in large databases. It studies the frequency of items occurring
together in transactional databases, and based on a threshold called support, identifies the
frequent item sets. Another threshold, confidence, which is the conditional probability that an
item appears in a transaction when another item appears, is used to pinpoint association rules.

The discovered association rules are of the form: P→Q [s,c], where P and Q are conjunctions
of attribute value-pairs, and s (for support) is the probability that P and Q appear together in a
transaction and c (for confidence) is the conditional probability that Q appears in a transaction
when P is present.

Genetic Algorithm
A Genetic Algorithm (GA) is a procedure used to find approximate solutions to search
problems through the application of the principles of evolutionary biology. Genetic algorithms
use biologically inspired techniques such as genetic inheritance, natural selection, mutation, and
sexual reproduction (recombination, or crossover).

Genetic algorithms are typically implemented using computer simulations in which an


optimization problem is specified. For this problem, members of a space of candidate solutions,
called individuals, are represented using abstract representations called chromosomes. The GA
consists of an iterative process that evolves a working set of individuals called a population
toward an objective function, or fitness function. Traditionally, solutions are represented using
fixed length strings, especially binary strings, but alternative encodings have been developed.

3. Methodology

The evolutionary process of GA is a highly simplified and stylized simulation of the


biological version. It starts from a population of individuals randomly generated according to
some probability distribution, usually uniform and updates this population in steps called
generations. In each generation, multiple individuals are randomly selected from the current
population based on application of fitness, crossover, and modified through mutation to form a
new population.

A. [Start] Generate random population of n chromosomes (suitable solutions for the


problem)
B. [Fitness] Evaluate the fitness f(x) of each chromosome x in the population
C. [New population] Create a new population by repeating the following steps until the new
population is complete
i. [Selection] Select two parent chromosomes from a population according to their
fitness (the better fitness, the bigger chance to be selected)
ii. [Crossover] With a crossover probability alter the parents to form a new offspring
(children). If no crossover was performed, offspring is an exact copy of parents.
iii. [Mutation] With a mutation probability mutate new offspring at each locus (position
in chromosome).
iv. [Accepting] Place new offspring in a new population

D. [Replace] Use newly generated population for a further run of the algorithm
E. [Test] If the end condition is satisfied, stop, and return the best solution in current
population
F. [Loop] Go to step B

4. Parameters in Genetic Algorithm

4.1 Selection of Individuals

During each successive generation, a proportion of the existing population is selected to


breed a new generation. Individual solutions are selected through a fitness-based process, where
fitter solutions (as measured by a fitness function) are typically more likely to be selected. The
Tournament selection and roulette Wheel selection are commonly used selection methods
adopted.

4.2 Fitness Function

A fitness function is a particular type of objective function that prescribes the optimality
of a chromosome in a genetic algorithm, so that the particular chromosome may be ranked
against all the other chromosomes [7, 8]. An ideal fitness function correlates closely with the
algorithm's goal, and yet may be computed quickly. Speed of execution is very important, as a
typical genetic algorithm must be iterated many, times in order to produce a usable result for a
non-trivial problem.

This paper adopts minimum support and minimum confidence for filtering rules. Then
correlative degree is confirmed in rules which satisfy minimum support-degree and minimum
confidence-degree. After support-degree and confidence-degree are synthetically taken into
account, fit degree function is defined as follows.

(1)
In formula, Rs + Rc = 1 (Rs ≥ 0 , Rc ≥ 0) and Suppmin, Confmin are respective values of
minimum support and minimum confidence. By all appearances the bigger support-degree and
confidence-degree are, the bigger value of fit degree function is.

4.3 Crossover Operator

Crossover entails choosing two individuals to swap segments of their code, producing
artificial "offspring" that are combinations of their parents. This process is intended to simulate
the analogous process of recombination that occurs to chromosomes during sexual reproduction.
Common forms of crossover include single-point crossover, in which a point of exchange is set
at a random location in the two individuals genomes, where one individual contributes all its
code till the point of crossover, the second individual contributes all its code after the point of
crossover to produce an offspring, and uniform crossover, in which the value at any given
location in the offspring's genome is either the value of one parent's genome at that location or
the value of the other parent's genome at that location, chosen with 50/50 probability[9].

4.4 Mutation Operator

Partial gene values of individuals are adjusted by using mutation operation [6]. This part of
the genetic algorithm, require great care, here there are two probabilities, one usually called as
Pm, this probability will be used to judge whether mutation has to be done or not, when the
candidate fulfills this criterion it will be fed to another probability and that is, locus probability
that is on which point of the candidate the mutation has to be done.

4.5 Population Size

Population size refers to the number of chromosomes taken up for optimization. If there are
too few chromosomes, GA has few possibilities to perform crossover and only a small part of
search space is explored. On the other hand, if there are too many chromosomes, GA slows
down. It is not useful to use very large populations because it does not solve the problem faster
than moderate sized populations.

The population size is calculated by


2k (2)
Where l = number of chromosomes in data and k is the average size of the schema of interest.
If uniform crossover is you can most likely get with populations at least twice as small.

4.6 Number of Generations

The generational process of mining association rules by Genetic algorithm is repeated until a
termination condition has been reached. Common terminating conditions are:
• A solution is found that satisfies minimum criteria
• Fixed number of generations reached
• Allocated budget (computation time/money) reached
• The highest ranking solution's fitness is reaching or has reached a plateau such that
successive iterations no longer produce better results
• Manual inspection
• Combinations of the above

5. Experimental Studies

The objective of this study is to compare the accuracy achieved in datasets by varying the
GA Parameters. The encoding of chromosome is binary encoding with fixed length. As the
crossover is performed on attribute level the mutation rate is set to zero so as to retain the
original attribute values. The selection method used is tournament selection. The fitness function
adopted is as given in equation (1).

Three datasets namely Lenses, Haberman survival and Iris Data Set from UCI Machine
Learning Repository have been taken up for experimentation. Lenses dataset has 4 attributes with
24 instances. Haberman's Survival data Set has 3 attributes and 306 instances and Iris dataset has
5 attributes and 150 instances. The Algorithm is implemented using MATLAB R2008a.

The GA parameters basically set are

Parameter Value
Population Size Instances * 1.5
Crossover Rate 0.5
Mutation Rate 0.0
Selection Method Tournament Selection
Minimum Support 0.2
Minimum Confidence 0.8

The results by controlling the GA parameters are recorded as follows

No. of Instances No. of Instances * 1.25 No. of Instances *1.5


Accuracy No. of Accuracy No. of Accuracy No. of
Generations Generations Generations
Lenses 75 7 82 12 95 17
Haberman 71 114 68 88 64 70
Iris 77 88 87 53 82 45
Table 1: Population Size

It could be seen from above table that when the size of the dataset is small the optimum accuracy
is achieved when size is one and half times of the dataset. As the data size increases the optimum
accuracy is achieved with population size set to initial dataset size

Minimum Support & Minimum Confidence


Sup = .4 & con = .4 Sup = .9 & con = .9 Sup = .9 & con = .2 Sup = .2 & con = .9
Accuracy No. of Accuracy No. of Accuracy No. of Accuracy No. of
Generations Generations Generations Generations
Lenses 22 20 49 11 70 21 95 18
Haberman 45 68 58 83 71 90 62 75
Iris 40 28 59 37 78 48 87 55
Table 2 : Minimum Support and Confidence

The setting up of the values for minimum support and minimum confidence is based on the
support and confidence of relationship between attributes in dataset rather than the size of the
dataset

Cross Over
Pc = .25 Pc = .5 Pc = .75
Accuracy No. of Accuracy No. of Accuracy No. of
Generations Generations Generations
Lenses 95 8 95 16 95 13
Haberman 69 77 71 83 70 80
Iris 84 45 86 51 87 55

Table 3 : Crossover Probability

The accuracy achieved is same whatever the crossover probability adopted. The convergence
depends on the data size and population size set.

The results observed when plotted are shown in figures below


Figure1: Population Size Vs Accuracy

Figure 2: Minimum Support and Confidence Vs Accuracy

The following indication could be observed form the experimental analysis.

• The optimum population size for better accuracy depends upon the number of instances
in dataset. If data size is larger than population size is better the same as number of
instances.
• Setting up of minimum support and confidence depends on the dataset and their
relationship between attributes.
• The fitness function variables too depend on the dataset and its relationship between
attributes.
• The crossover rate only alters the number of generations needed to converge the results
rather than accuracy.
• The accuracy of the algorithm and GA parameters cannot be generalized as the optimum
value of these parameters varies from dataset to dataset.

6. Conclusion
Genetic Algorithms have been used to solve difficult optimization problems in a number of
fields and has proved to produce optimum results in mining Association rules. When Genetic
algorithm is used for mining association rules the GA parameters decides the efficiency of the
system. The optimum value of the GA parameters varies from data to data and the fitness
function plays a major role in optimizing the results. The size of the dataset and relationship
between attributes in data contributes to the setting up of the parameters.

The efficiency of the algorithm could be further explored by altering the fitness function and
adopting different selection methods.

References
[1]. HongiunLu , LingFeng , JiaweiHan . Beyond intertransaction association analysis:
mining multidimensional intertransaction association rules, Association for Computing
Machinery Transaction on Office Information System, Pages 423-454, 2000.

[2]. Cattral, R., Oppacher, F., Deugo, D.,”Rule Acquisition with a Genetic Algorithm”,
Proceedings of the 1999 Congress on Evolutionary Computation,. CEC 99, 1999

[3]. Saggar, M., Agrawal, A.K., Lad, A., “Optimization of Association Rule Mining”,
IEEE International Conference on Systems, Man and Cybernetics, Vol. 4, Page(s): 3725
– 3729, 2004

[4]. R.Agrawal, T. Imielinski, and A.Swami. Mining association rules between sets of items
in large databases.In the Proc. of the ACM SIGMOD Int'l Cod, on Management of Data
(ACM SIGMOD '93), Washington, USA, May 1993.

[5]. M. Klemettinen, H. Mannila, P. Ronkainen, H.Toivonen, and A.I. Verkamo. Finding


interesting rules from large sets of discovered association rules. In Proc. Of the 3" Int'l
Cod. on Information and Knowledge Management, 1994

[6]. Back T. Optimal Mutation Rates in Genetic Search. 5th International Conference on
Genetic Algorithms. (pp.1-9).Saneteo, CA; Morgan Kaufinann.1993.

Hua Tang, Jun Lu, “A Hybrid Algorithm Combined Genetic Algorithm with Information
Entropy for Data Mining”, 2nd IEEE Conference on Industrial Electronics and
Applications, Page(s): 753 – 757, 2007.

[7]. Wenxiang Dou, Jinglu Hu, Hirasawa, K., Gengfeng Wu, “Quick Response Data Mining
Model using Genetic Algorithm”, SICE Annual Conference, Page(s): 1214 – 1219, 2008.
[8]. Genxiang Zhang, Haishan Chen, “Immune Optimization Based Genetic Algorithm for
Incremental Association Rules Mining”, International Conference on Artificial
Intelligence and Computational Intelligence, AICI '09, Volume: 4, Page(s): 341 – 345,
2009.
Zhou Jun, Li Shu-you, Mei Hong-yan, Liu Hai-xia, “A Method for Finding Implicating Rules
Based on the Genetic Algorithm”, Third International Conference on Natural
Computation, Volume: 3, Page(s): 400 – 405, 2007.
[9]. Gonzales, E., Mabu, S., Taboada, K., Shimada, K., Hirasawa, K., “Mining Multi-class
Datasets using Genetic Relation Algorithm for Rule Reduction”, IEEE Congress on
Evolutionary Computation, CEC '09, Page(s): 3249 – 3255, 2009.

[10]. Hong Guo, Ya Zhou, “An Algorithm for Mining Association Rules Based on
Improved Genetic Algorithm and its Application”, 3rd International Conference on
Genetic and Evolutionary Computing, WGEC '09, Page(s): 117 – 120, 2009.

You might also like