You are on page 1of 4

Multi-Objective Metaheuristic Algorithms for Finding Interesting Rules in Large Complex Databases

This research was funded by the EPSRC, grant number GR/T04298/01. Introduction The research was concerned with developing efficient algorithms for finding classification rules from large databases. The data mining task of classification is concerned with finding patterns in classification data, that is, data which has a class label for each instance or record in the database. When the classification task is restricted to a pre-defined class label, the data mining task is known as partial classification or nugget discovery. The patterns sought for this task are class descriptions, often of the general form antecedent class-label = u for some value of u. The antecedent is a predicate used to define subsets of records from the database, and it is often defined by a conjunction of attribute-value tests. The simplest rule format may include only individual equality tests on nominal attributes (hence numeric attributes would require pre-discretisation). More complex rule formats include tests on a set of values (a disjunction of values) for a nominal attribute or a range of values for an ordinal attribute. A more complex format may imply a better class description, but there is a much larger search space of possible rules. Current methods for classification rule induction work by placing constraints on the search space to keep the problem tractable. Alternatively, the search space may be constrained with the use of minimum support and confidence constraints. These approaches may lead to loss of information and weaker class descriptions. There are problems of scalability for large databases. Additionally, the number of rules found is frequently large, and many of them may describe the same population. We used multi-objective evolutionary algorithms to perform a search for non-overlapping strong rules containing numerical or nominal attributes. Experimental results are very promising. We extended the techniques to find non-overlapping strong rules and this presents a unique rule induction mechanism; coupled with an emphasis on scalability and efficiency the project has delivered new state-of-the-art data mining algorithms. Adaptation to cope with missing or uncertain data, and with complex data will provide an invaluable tool for medical data mining, where data often presents those characteristics. Key Advances and Supporting Methodology In this section, the achievements of the project are reviewed according to the original aims and objectives. The project aim was to develop scalable techniques for nugget discovery using Multi-Objective Metaheuristic algorithms (MOMH) in complex data, and to undertake a number of experiments using benchmark databases from the UCI repository to prove the effectiveness of the new techniques. In order to test the extensions of the algorithm for large and complex data, the techniques were to be applied to some real medical case studies. The work would build upon some of the preliminary work already undertaken but would seek to enhance and extend the techniques in a number of ways.

The first objective was to develop efficient algorithms for finding strong classification rules (e.g. those of high confidence and coverage) in large databases by using multi-objective optimisation techniques. A preliminary MOMH nugget discovery algorithm was created and its performance was compared on a number of benchmark databases against another algorithm for nugget 1 discovery, the all-rules (ARA) algorithm. The results [ ] showed the potential of MOMH for nugget discovery as for every benchmark database tested, the set of rules obtained was close to or coincided with the Pareto Optimal set of rules (upper confidence/coverage border). Our algorithm represents one of the first attempts to employ MOMH in data mining and particularly for classification, so the work is very innovative and is already receiving citations by other authors working in the field. The second part of the first objective was to evaluate a number of different approaches to Multiobjective optimisation and their application to this particular data mining task. Our preliminary algorithm used NSGA- II, a well known MOMH algorithm, as the engine for nugget discovery. While the results obtained by NSGA II were encouraging, it was felt that a range of multi-objective meta-heuristics should be applied to this problem, in order to determine which approach would be most effective. As part of the project, we therefore implemented a second multi-objective genetic algorithm, SPEA2, a multi-objective local search, PAES, and two forms of multi-objective genetic local search for the task of nugget discovery. In addition, we also developed our own multi-objective algorithm based on GRASP (Greedy Randomized Adaptive Search Procedure). This was an important development in the context of multi-objective optimization as GRASP has been applied to a number of problems in combinatorial optimization, but it has very seldom been used in a MO setting, and generally only through repeated optimization of single objective problems, using either linear combinations of the objectives or additional constraints. The approach we developed takes advantage of some specific characteristics of the data mining problem being solved, allowing for the very effective construction of a set of solutions that form the starting point for the local search phase of the GRASP. The resulting algorithm is guided solely by the concepts of dominance and Pareto2 optimality. Our algorithm has been accepted for journal publication [ ]. The experimental results for our partial classification GRASP and other MO meta-heuristics have shown that MOMH are generally very well suited to nugget discovery and furthermore, the GRASP algorithm brings additional efficiency to the search for partial classification rules. This makes GRASP suitable for the exploration of larger databases, and this is in fulfillment of another of our objectives: to enhance the scalability of the algorithm produced. We therefore succeeded in producing an efficient and scalable MOMH algorithm for partial classification in fulfilment of our first and sixth objectives. However, the set of rules returned was large, and some rules in this set were similar to one another. Our second objective was to explore concepts related to the interest of the rule, such as novelty, surprise, simplicity, etc, and extend the algorithm to find sets of interesting rules (e.g. simple strong novel rules that are interesting within the set as well as individually). In order to present more interesting rules to the user and to aid the rule evaluation task, the work was developed in two directions. We first explored the concept of clustering similar rules to allow for summarisation of rule sets into clusters with representative rules. As the clustering of rules is a new area of research, we evaluated a number of clustering algorithms on this task including Partition Based Clustering Algorithms such as k-medoids, PAM and CLARANS. Furthermore, with our objective of scalability and efficiency in mind, a modification to CLARANS was made that allows it to perform as well as PAM in long runs and significantly outperform all the partition-based algorithms in short runs. In addition to partition-based algorithms, a variant of a hierarchical clustering algorithm known as AGNES was also explored and resulted in different clustering solutions. All clustering algorithms were applied to the clustering of rules obtained from both the all-rules and MOMH for 3 partial classification. The results were reported in journal publication [ ] and they should have impact in the way in which sets of rules from a number of algorithms including association rule algorithms, are presented to the user. Using our method, clusters of rules can be presented to the

user by, for example, medoid rules for each cluster. If a cluster of rules appears to contain interesting rules, the cluster can then be expanded to allow the user to examine individual rules. This hierarchical presentation of rules could be combined with visualization techniques (e.g. parallel coordinate techniques) in a user interface for easy evaluation of large sets of rules. The clustering of rules extracted with the MOMH partial classification algorithms showed that rules that are close in objective space are also similar in terms of their support sets (they are supported by overlapping sets of records), hence rules within a particular cluster present little diversity in terms of support sets. Next, we attempted to improve the quality of our rule sets by modifying the original algorithm to improve the diversity of rules within the set. To this aim, we studied the novelty of the rules in relation to other rules within the set. We created a modification of NSGA-II by introducing the concept of rule dissimilarity in the crowding measure. This has allowed us to increase the diversity of rules in some areas of the Pareto front in terms of support sets. The modified algorithm and results were presented in the third international conference on 4 Evolutionary Multi-Criterion Optimisation (EMO 2005) [ ]. The results were very well received because the presentation of results by clustering was applicable to many other problems in MultiCriterion Optimisation and represented a very interesting development in this area. The creation of diverse rules was the subject of further research efforts by the use of modified dominance relations in the MOMH algorithm. The modified dominance relations were designed to encouraging diversity and novelty. In previous research, novelty was measured in terms of the rules support sets. In this research we adapted the methods to also consider apparent/syntactic novelty. We modified the dominance relation as follows. When rule q would normally be dominated by rule r, the difference between the two rules is considered. If rule q provides enough novelty with respect to rule r, it is permitted to remain non-dominated. The amount of novelty required depends upon the relative quality of the two rules according to the two prime objectives: coverage and confidence. Experiments showed that the new modified dominance relation encourage the production of rules that were previously undiscovered and provided additional information to the user. The results [5] presented in the 2006 IEEE International Joint Conference on Neural Networks received a price as best paper for the session. The contribution of our modified dominance relation was not only to our algorithm for partial classification but also to the field of Multi-Objective optimisation where diverse solutions are often sought and a similar approach can be taken. As a different direction on our research, we decided to investigate the creation of a predictive classifier. This was because although individual rules produce with our MOMH were simple, easy to understand and provided useful partial descriptions of the class of interest, it was difficult to see how they could be combined to provide a description for the whole database. Additionally, the simple rule representation of partial classification was insufficient to produce individual rules with both high confidence and coverage. Other authors have attempted to produce predictive classifiers from simple association rules using a MO algorithm to optimize rule set complexity and error rate. As this approach has a number of disadvantages we choose instead to produce a more expressive rule representation, specifically by using Expression Trees (ETs). ETs are different to Decision Trees. Expression trees contain Attribute Tests (ATs) as leafs while internal nodes contain a Boolean operator. On the other hand, Decision Trees contain internal nodes that define partitions of the data and leaf nodes that indicate class membership. The combination of ATs in the ET represents a rule in Disjunctive Normal Form which forms a complete description of the class of interest. The ET can be used for binary classification problems as any record described by the ET belongs to the positive class or class of interest, whereas any record not described by the ET belongs to the negative class. ETs were evaluated according to their error rate or misclassification cost and to their complexity, creating therefore a multi-objective problem scenario. A MOMH algorithm, NSGA-II, was implemented to search for the best ET given the optimization objectives. The resulting algorithm was tested on a number of 6 benchmark databases. The results [ ] illustrated that the approach can produce a reasonable classification on two class problems, competitive with other classification algorithms in terms of

misclassification cost. However, the algorithm provides additional benefits. The most obvious is that the user can be provided with a range of models with different trade-offs between rule complexity and misclassification costs. This allows the client to select a set of rules that is accurate enough while also being comprehensible. The overall approach is also flexible. Rules may be presented to the user in a number of ways and the measure of rule complexity can easily be adapted to match the method of rule presentation and the clients concept of rule comprehensibility. Different measures of misclassification cost can easily be used. There is no restriction on the data types of the fields of the dataset and no need to discretise numeric fields as required by other algorithms. This new algorithm was a deliverable additional to the original objectives of the project. It is however a very important and innovative development as it allows us to produce sets of rules using MOMH algorithms. The set of rules produced represent a complete description of the class of interest, and are still relatively simple and understandable. Quite often they represent a more concise rule set that the rule set obtained by a Decision Tree. The preliminary work on ET, although successful, identified a problem with the algorithm as a major loss of population diversity was observed early in the search process. This loss of diversity refers to the range of genetic material in the population that may be usefully combined to create new solutions, instead of diversity on the solutions on the Pareto front. We used of both linear combinations of objectives and modified dominance relations to control population diversity, producing higher quality results in shorter run times 7. This work can have a major impact in the multi-objective meta-heuristic community were loss of diversity is often a problem. Further experiments are underway to submit some of the work for Journal publication. The project has been very successful in launching multi-objective algorithms as a platform for data mining algorithm development. The PI was an invited speaker at the third UK KDD 8 Symposium [ ] to talk about application of Multi-objective algorithms in data mining.

[ ] B. de la Iglesia, G. Richards, M. S. Philpott and V. J Rayward-Smith. The application and effectiveness of a multi-objective metaheuristic algorithm for partial classification. EJORS, 169:3, pp 898-917, 2006. 2 [ ] A. Reynolds and B. de la Iglesia, A Multi-Objective GRASP for Partial Classification, Soft Computing Journal, (to appear, 2008). 3 [ ] A. P. Reynolds, G. Richards, B. de la Iglesia, V. J. Rayward-Smith, Clustering Rules: A Comparison of Partitioning and Hierarchical Clustering Algorithms. Journal of Mathematical Modelling and Algorithms, 5:4, pp 475-504 , 2006. 4 [ ] Beatriz de la Iglesia, Alan Reynolds, Vic J Rayward-Smith, Developments on a Multi-objective Metaheuristic (MOMH) Algorithm for Finding Interesting Sets of Classification Rules, Lecture Notes in Computer Science, Volume 3410, Pages 826 840, 2005. 5 [ ] A. P. Reynolds and B de la Iglesia, Rule Induction Using Multi-Objective Metaheuristics: Encouraging Rule Diversity (Winner of Best Session), IJCNN 2006, pp 6375-6382, 2006. 6 [ ] A.P. Reynolds, and B. de la Iglesia. Rule Induction for Classification Using Multi-Objective Genetic Programming. Proceedings of the 4th International Evolutionary Multi-Criterion Optimization Conference (EMO 2007), LNCS 4403, pp. 516-530, Matsushima, Japan, 2007. 7 [ ] Reynolds, A.P. and de la Iglesia, B., Managing Population Diversity Through the Use of Weighted Objectives, Proceedings of the 2007 IEEE Symposium on Computational, pp. 99106, 2007. 8 [ ] B. de la Iglesia. Application of Multi_objective Metaheuristic Algorithms in Data Mining, Proceedings of the Third UK Knowledge Discovery and Data Mining Symposium (Invited Talk), Expert Update, Expert Update, Autumn 2007, Vol. 9, No. 3, ISSN: 1465-4091,43-48.

You might also like