You are on page 1of 142
SOFIE M Talis Mima liTeLlasy Dolf Zantinge DATA MINING PIETER ADRIAANS DOLF ZANTINGE ssyllogic PEARSON a Education The right of Pieter Adriaans and Dolf Zantinge to be identified as authors of this Work has been: asserted by them in accordance with the Copyright, Designs and Patents Act 1988. The programs in this book have been included for their instructional value. They have been tested with care but are not guaranteed for any particular purpose. The publisher and authors do not offer any warranties or representations nor do they accept any liabilities with respect to the programs, Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Pearson Education has made every attempt to supply trademark information about manufacturers and their products mentioned in this book. A list of the trademark designations and their owners appears below Trademark notice DB/2 is a trademark of International Business Machines Corporation. ORACLE isa trademark of Oracle Corporation UK Limited. SYBASE is a registered trademark of Sybase Incorporated. Copyright © 1996 by Pearson Education Ltd. This edition is published by arrangement with Pearson Education, Ltd. This book is s resold, hired out. or otherwi of binding or cover other than that in which itis including this condition being imposed on the subsequent purchaser and without limiting the rights jer copyright reserved above, no part of this publication may be reproduced. stored in or troduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise), without the prior written permission of both the copyright owner and the above-mentioned publisher of this book. ISBN 978-81-317-0717-3 First Impression, 2007 Sccond Impression, 2008 Third Impression, 2009 Fourth Impression, 2009 This edition is manufactured in India and is authorized for sale only in India, Bangladesh, Bhutan, Pakistan, Nepal, Sri Lanka and the Maldives. Circulation of this edition outside of these territories is UNAUTHORIZED. Published by Dorling Kindersley (India) Pvt. Ltd., licensees of Pearson Education in South Asia. Head Office: 482. F.LE., Patparganj, Delhi 110 092. India. Registered Office: 14 Local Shopping Centre, Panchshcel Park, New Delhi 110 017, India. Printed in India by Chaman Enterprises. CONTENTS Preface 5 Overview of the book 6 Acknowledgements 7 1 Introduction 13 An expanding universe of data 13 Information as a production factor 14 Computer systems that can learn 14 Data mining 16 Data mining versus query tools 18 Data mining in marketing 19 Practical applications of data mining 21 Conclusion 2D 2 Whatis learning? 23 Introduction 2.0 What is learning? 23 Self-learning computer systems 24 Machine learning and the methodology of science 26 Concept learning 29 A kangaroo in mist 32 Conclusion 8S 9 10 Contents 3 Data mining and the data warehouse 37 Introduction 37 What is a data warehouse and why do we need it? 37 Designing decision support systems 40 Integration with data mi 41 Client/server and data warehousing 2 Multi-processing machines 44 Cost justification 45 ‘Conelussion The knowledge discovery process 49 Introduction 49 ‘The knowledge discovery process in detail 51 Data selection St Cleaning Sl Enrichment 54 Coding 34 Data mining 59 Preliminary analysis of the data set using traditional query tools 59 Visualization techniques 64 Likelihood and distances 65 QOLAP tools k-nearest neighbour 68 Decision trees 70 Association rules 15 Neural networks Genetic algorithms 84 Reporting 90 Conclusion 90 Sitting up a KDD environment 91 Introduction m1 Different forms of knowledge a1 Getting started 93 Data selection 95 Cleaning 96 Enrichment 97 Coding 98 Data mining 98 Contents 11 Reporting 101 ‘The KDD environment 2. S—id'O. Ten golden rules 102 Conclusion 105 6 Some real-life applications 107 Introduction 107 Customer profiling 108 Predicting bid behaviour of pilots 114 Discovering foreign key relationships 117 Results 121 Conclusion 121 7. Some formal aspects of learning algorithms 123 int if 123 Leaming of compression of data sets 123 The information content of a message 127 Noise and redundancy 129 The significance of noise 130 Fuzzy databases 130 The traditional theory of the relational database 131 From relations of tables 133 From keys to statistical dependencies 133 Denormalization 135 Data mining primitives 137 Conclusion 138 Summary 139 Glossary 143 Index 153 image not available image not available Learning computer systems 15 their leaves towards the sun; this is an elementary form of adaptation to the environment. At the other end, humans have learned to exploit an extremely complicated and delicate structure — language — as a tool to explore the laws of the Universe. This capacity to learn seems to be an essential characteristic of life itself. The theory of evolution teaches us that the species that will survive are those that have adapted themselves optimally to their environments. Since learning is a form of adaptation, it is a central aspect of the emergence of life on our planet. In this light, there is a deep philosophical motivation to the quest for computer systems that can learn. Few specialists doubt that an intelligent computer program will, at the very least, also be a program with funda- mental learning capacities. There can be no artificial intelligence (AI) without artificial learning (or machine learning, ML, as the subject is somewhat erroneously called by its advocates), therefore computer pro- grams which can learn have been at the focus of AI research from the earliest days of computer technology, back in the 1950s. One thing we have learned so far is that it will be extremely difficult to create a com- puter that will have an intelligence that comes in any way close to that of human beings. Specialists believe that it will take at least another hundred years before we can create a computer program that can chat about the contents of the daily newspaper. This is an area of research that has suf- fered its fair share of set-backs. In their initial enthusiasm, pioneering scientists somewhat overstated their case and, when they inevitably failed to fulfil their extravagant claims, research budgets were cut at the end of the 1960s. This situation was further exacerbated by the publication of some very negative theoretical results on the possibilities of certain techniques. Minsky and Papert, for instance, showed that the so-called “perceptrons’, simple forerunners of the now well-known neural networks, could learn only very simple rules, which would never lead to serious learning systems. Similar things happened to other research programs and, as a result, for a long time machine learning led a hidden life in uni- versities and research centers. The beginning of the 1980s heralded a change. A new generation of researchers began to look with fresh eyes at the problems of machine learning and made new discoveries: simple algorithms to create decision trees for the classification of arbitrary classes of objects; new architec- tures for neural networks that were more powerful than the perceptrons criticized by Minsky and Papert; and other approaches like genetic algo- rithms modeled on the theory of evolution. At the same time they had much more powerful computers at their disposal, which enabled them to test their new algorithms on real problems. They did not make over-zeal- ous claims about intelligent computers but instead turned their attention to simple, practical problems. image not available image not available image not available Data mining in marketing 19 have a large file containing millions of records that describe your cus- tomers’ purchases over the last ten years. There is a wealth of potentially useful knowledge in such a file, most of which can be found by firing normal queries at the database, such as ‘who bought which product on what date?,’ ‘what is the average turnover in a certain sales region in July? and so on. There is, however, knowledge hidden in your database that is much harder to find using SQL. Examples would be the answers to ques- tions such as ‘what is an optimal segmentation of my clients? (that is, ‘how do I find the most important different customer profiles?’), or ‘what are the most important trends in customer behavior”. Of course, these questions could be answered using SQL. You could try to guess for yourself some defining criteria for customer profiles and query the database to see whether they work or not. In a process of trial and error, one could gradu- ally develop intuitions about what the important distinguishing attributes are. Proceeding in such a way, it could take days or months to find an opti- mal segmentation for a large database, while a machine-learning algorithm like a neural network or a genetic algorithm could find the answer auto- matically in a much shorter time, sometimes even in minutes or a couple of hours. Once the data mining tool has found a segmentation, you can use your query environment again to query and analyze the profiles found. One could say that if you know exactly what you are looking for, use SQL; but if you know only vaguely what you are looking for, turn to data mining. Generally there are far more occasions when your initial approach is vague than times when you know precisely what you are looking for, and it is this that has motivated the recent surge of interest in data mining. It is clear that KDD is not an activity that stands on its own: a good foundation in terms of a data warehouse is a necessary condition of its effective implementation. Noisy and incomplete data, and legal and pri- vacy issues, constitute important problems. One must pay attention to the process of data cleaning — remove duplicate records, correct typo- graphical errors in strings, add missing information, and so on. In KDD too, the old ‘garbage in. garbage out’ rule still holds. To implement KDD in an organization is to start a process of permanent refinement and detailing of data. The real aim should be ultimately to create a self- learning organization. Data mining in marketing The standard success stories of KDD come primarily from marketing. Suppose you own a mail-order firm and you have a database in which, for fifieen years, you have kept a file on which clients reacted to what mailings image not available image not available 22. Introduction None of these issues is insoluble in a flexible, healthy organization. Experience tells us that when they are addressed in the proper way, an organization can benefit tremendously from the introduction of KDD. Conclusion In this chapter we have given you a first glimpse of the potential impact of KDD and data mining. In subsequent chapters we will broaden our understanding of these matters. Not only will we discuss the background of learning algorithms, but we will also consider the implications of data warehousing for KDD, present a very detailed case as an illustration, and provide an extensive overview of the points to be kept in mind when set- ting up a KDD environment. These aspects will enable you to coordinate such a process in your own organization. CHAPTER 2 What is learning? Introduction In this chapter we will answer some questions about the methodological aspects of learning: what is implied when a computer system ‘learns’; what learning is in general; and what the relationship is between machine learning and the methodology of science. We also introduce some impor- tant concepts, such as the complexity of the search space, and redundancy and noise. Although a broad understanding of these con- cepts is not necessary to carry out a KDD project, a general appreciation of these ideas might help you to avoid the obvious pitfalls. The main aim of this chapter is to give the reader an understanding of the methodologi- cal issues that are involved in using machine-learning algorithms. For those who would like to gain a deeper understanding of these matters, Chapter 7 considers the formal aspects of learning in more detail. What is learning? It seems that learning is one of the basic necessities of life: living creatures must have the ability to adapt themselves to their environment. There are undeniably many different ways of learning but instead of considering all kinds of different definitions, we will concentrate on an operational defin- 23 24 What is learning? ition of the concept of learning; we will not state what learning is but rather specify how to determine when someone has learned something. In order to define learning operationally, we need two other concepts: a cer- tain ‘task’ to be carried out either well or badly, and a learning ‘subject’ that is to carry out the task. Our simple definition of learning is then: An individual learns how to carry out a certain task by making a tran- sition from a situation in which the task cannot be carried out toa situation in which the same task can be carried out under the same cir- cumstances. It is not always easy to establish whether a change of conduct has really taken place. However, this is a general problem in the world of science, and is not specific to learning theory. In real life it is impossible to make pure observations that without question ‘prove’ a theory, which eliminate the possibility of other, unknown influences; in most cases the systems we study are very complex and we need to have many observations before we can make a statement that 1s statistically relevant. For the time being it is sufficient to remember that we can define learning operationally but that verification of a real change of conduct is not always easy. Self-learning computer systems Turning now to the computer: is it also capable of learning? According to our definition it is. If a computer, when instructed, cannot at first carry out a particular task and later, under the same circumstances, it can, we can say that it has learned something. Yet this conclusion is rather point- less because, after all, we can program the computer. If we want to make a computer solve differential equations, we simply write a program that enables it to do so. Thus, the computer has learned something, yet we cannot truthfully speak of a self-learning computer. It appears that some- thing is still missing. The same consideration applies to playing the piano. Would we say that someone who can play only Fiir Elise can play the piano? In one sense we would, but we do have the feeling that mastering just one piece of music is insufficient to make such a claim. In order to say that someone can play the piang, it is necessary for that person to have mastered a variety of pieces, and, even more importantly, to be able to study new ones. This means that the individual must not only master a specific task under specific circumstances but must also have a general learning capacity so that he or she can perform a whole range of new tasks. For that purpose, a training method needs to be developed that enables new unknown tasks to be performed. There are various schools of thought that place great importance on providing a definition of learning that emphasizes the fundamental differ- Self-learning computer systems 25 ence in general learning capacities between human and animal (and sometimes also between human and machine). According to these schools of though, a human being can normally undertake new tasks, which shows an infinite capacity for learning new things. This ‘openness’ could also define the difference in creativity between human and machine (or animal). An animal or a machine would always be confined to ‘pre- programmed’ tasks. In our view, it is too early to express an opinion about this. For the moment, it is certain that with regard to learning capacity, the computer is overshadowed by humans (and by the higher animal species — just try to teach a computer how to swing through trees). Things are bound to remain like this for the next few centuries. All kinds of tasks can easily be learnea by humans, such as walking, speaking, catching balls, and playing games, although all of these still constitute insurmountable problems for the computer. ‘Yet we mustn't throw the baby out with the bath water. There is still a wide range of things that a computer can learn. In a way, it is easier to study the learning abilities of computers than those of human beings, because we know exactly what is happening in the former case. With computers, we could say that they can learn how to solve differential equations if they can distil a method for the solution of these types of equations from a number of examples of correctly and incorrectly solved equations. In prac- tice, to enable it to carry out the task correctly, a computer must be able to write a program based on examples. This leads to a new definition: A self-learning computer can generate programs itself, enabling it to carry out new tasks. The methods the computer uses for this are of no importance now but we will return to the subject later. If we draw a comparison between the learning capacity of the computer and that of humans, we find an unex- pected discrepancy. The computer can do both more and less than a human; to the computer, comparing millions of pieces of data in a couple of minutes is simple. No human can do that. On the other hand it finds things that humans consider easy, such as recognizing a face or baking a cake, extremely difficult. If we are going to use the learning capacity of computers for the solution of problems, we will have to restrict the problems to those areas in which computers specialize. At the moment, the computer’s greatest power lies in its speed and accuracy; its greatest limitation is lack of creativity. A computer can solve simple puz- zles, but no real problems. Finding patterns in a marketing database with millions of records is relatively easy, but carrying out a task that really requires creativity and general knowledge of the world, such as solving a murder or drawing up a marketing plan, still presents the computer with insurmountable problems. If we want to feed a certain problem to the computer, we first have to dissect it into manageable segments ourselves, 26 What is learning? and only learning problems that can be expressed in a limited number of strings of figures and symbols are suitable for this purpose. Yet, with the growing availability of large databases and the increasing need to inter- pret the information contained in these databases automatically, even the limited learning capabilities of current computers can prove of great value to an organization. Machine learning and the methodology of science The task of the modern scientist is to explain and to predict. Ideally the process of scientific research takes the form of a so-called empirical cycle (see Figure 2.1): Observation: we start with a number of observations. e@ = Analysis: we try to find patterns in these observations. e Theory: if we have found some regularities, we formulate a theory (hypothesis) explaining the data. e — Prediction: our theory will predict new phenomena that can be veri- fied by new observations. Analysis Observation Prediction Figure 2.1. Empirical cycle of scientific research. Machine learning and the methodology of science 27. In the last stage of the cycle there are two possibilities. Either our predic- tions are correct, in which case our theory is corroborated, or our predictions are wrong. In this latter case we have to analyze the new observations and try to come up with a new theory. So the whole process starts again, which is why we speak of an empirical cycle. The process goes on and on for ever, and we can refine our theories indefinitely. The same holds, apart from changes of detail, for a manager who tries to ana- lyze a market to develop new products or to optimize production. In any learning situation there is always some sort of empirical cycle. In this century, the philosophy of science tends to defend the view that we can formulate hypotheses to explain empirical observations but that we can never prove that they are true. Everything that science discovers has only temporary value. Consider the following example. Suppose that we want to formulate a hypothesis concerning the color of swans. We observe a number of swans that are all white, so our hypothesis is ‘all swans are white’ (see Figure 2.2). Now one could ask how many observations we need in order to validate this theory. Clearly this number is infinite, since we are speaking of all swans. No matter how many swans we have seen, we can only be sure of the definitive truth of our hypothesis if we have seen them all. On the other hand, we need only one observa- tion in order to falsify the theory. Once we find a black swan, the ‘statement ‘all swans are white’ will be untrue. This leads to the following: Reality Infinite numberof swans = a) a Analysis Limited Theory ‘number of “All swans: cbeevations are white! LLCOL LEO LLLLt [Sed aS wad Figure 2.2 Theory formation. image not available image not available image not available Concept learning 31 icance of the results of our learning programs. Statistics are very impor- tant in machine learning and we will come back to this subject later. Information content Information content is closely related to statistical significance and trans- parency. In the previous example concerning statistics, we presented a theory that always classifies an animal as a non-kangaroo. Obviously a program that ‘learns’ this theory is not very useful. The theory does not contain any information, and we say that the information content is low. We prefer machine-learning algorithms that learn hypotheses that are as rich as possible. Complexity of the search space ‘We have described concept learning as the process in which one is given positive or negative examples of a concept, and a learning algorithm has to develop a hypothesis, described in a certain language, that explains these examples. When looked at in such a way, machine learning can be perceived as a search problem: we have to find the correct hypothesis. An important element in search problems is the establishment of the com- plexity of the search space, that is, how many hypotheses there are and how they are related. In fact, a machine-learning algorithm can be described as a search algorithm for a hypothesis space. When we want to judge the performance of a learning algorithm a priori, it is important to realize what the complexity of this search space is. When there are only ten different hypotheses possible, it is easy to find the correct hypothesis simply by enumeration; we only have to do ten tests in order to find the answer. In the case of several millions or even an infinite number of hypotheses, enumeration would not be such a good strategy. In most cases, unfortunately, we are confronted with search problems in which the number of potential hypotheses is infinite. The only way to cope with the search problem in such a case is to develop some kind of refinement theory for hypotheses; in other words, we have to develop a measure for the quality of the hypotheses. We can use that measure to select poten- tially good hypotheses, and based on this selection, try to improve the theories. This process is called the “hill-climbing search’ because it resem- bles climbing a hill; the higher you climb, the better the theories explaining the data become. This aspect of the complexity of the search space determines the differ- ence in learning capabilities between human beings and machines. Human beings are good at solving problems in very complex, badly struc- tured search spaces. If we take solving a crime as an example, there is an almost infinite number of clues that could lead to the identification of a murderer, but a computer would not know even where to begin looking. image not available image not available image not available Conclusion 35 Conclusion This is all the technical background on learning and pattern recognition that you need to read the rest of the book. We have given you some global information on issues like theory formation, statistical relevance, and the complexity of search spaces. This will help you to understand the poten- tial as well as the limits of KDD and data mining applications in your own organization. At the end of the book, Chapter 7 is more formal, and gives a somewhat more extensive treatment of this area. image not available image not available image not available 40 Data mining and the data warehouse © What data type or format it is in © How this data is related to other data in other databases © Where the data is from and to whom the data belongs For these reasons, another database containing the so-called meta-data is needed, which describes the structure of the contents of a database. In a complex database environment, adequate meta-data is indispensable, since it determines the structure of both the operational data and the data warehouse. Meta-data is used by end-users for querying purposes, as well as by the data manager for structuring the management of a database site. Designing decision support systems The design of a decision support system differs considerably from that of an online transaction processing system. The main difference is that deci- sion support systems are used only for queries, so their structure should be optimized for this use. Some of the design aspects of a data warehouse are discussed in Chapter 7 on formal issues of learning algorithms. When designing a decision support system, particular importance should be placed on the requirements of the end-user and the hardware and soft- ware products that will be required. The requirements of the end-user During discussions with your end-users, you will discover that there are many people who need to use decision support and that between them they will produce a huge variety of queries. Some end-users need specific query tools so that they can build their queries themselves, others are interested only in a particular part of the information. You can build a specific type of application around this lattcr type of end-user, in which case you can optimize the application completely in order to speed up the query process. In addition to these two decision support systems, you will also find trend analysis tools and tools that can show the outcome of queries in a graphi- cal way — these are often used for statistical analysis. Because you have historical data in your decision support systems, you can carry out some trend analyses using statistical techniques. All these types of application will support management in its decision support function. The hardware and software products of a decision support system Within a data warehouse you need to satisfy specific hardware and soft- ware requirements in order to enable decision support to be successfully image not available image not available image not available ‘44 Data mining and the data warehouse compare alll the information, and you will often need consultants with database and hardware expertise at this stage. Multi-processing machines A data mining environment has specific hardware requirements. Certain machine-learning tasks will involve the comparison of millions of records, thus placing a tremendous.burden on the system. If an end-user wants to compare large numbers of records within a very short period of time, the computer may need all its internal memory and all its process- ing power for this one task alone. When working with genetic algorithms in particular, it is important to understand the demands that are made on the computer: it has to take each record, compare it with all the other millions of records within a database, and on finding a certain pattern within the database, recalculate this pattern while constantly comparing all the records. In certain cases, the end-user needs an answer in a very short space of time. In order to get a satisfactory answer at least two options are available: define the question by focusing on a limited number of records and attributes within a database, or move towards a multi-processing computer system. In very large database sites, multi- processing machines are needed for data mining projects: the end-user defines the records and attributes to be worked on and an extract from the original database is copied to a multi-processing machine. All the records are stored on its hard disk so that this machine can be used only for data mining. There are several types of multi-processing machines and we will describe the two most important ones: © symmetric multi-processing massively parallel With the symmetric multi-processing machine all the processors work on one computer, all are equal and they communicate via shared storage. Symmetric multi-processing machines share the same hard disk and the internal memory. Although processors share their internal coordination, this type of multi-processing is limited to a certain number of processors because the synchronization of the processors places a huge burden on the computer system; at the present time, approximately twelve processors is the maximum. Because everything is shared, this type of symmetric multi- processing can be used for data mining and, in many cases, this type of machine is sufficient for an organization’s data mining activity. The massively parallel machine is a computer where each processor has its own operating system, its own memory, and its own hard disk. Although each processor is independent. communication between the systems is possible. In this type of environment one can work with thou- image not available image not available image not available 48 Data mining and the data warehouse Mailing targeted using data mining --- = Untargeted mailing Number of responses 0 200 400 600 800 1,000 Number of mailshots dispatched Figure 3.3. Targeting a mailing using data mining, Using such techniques, together with the benefits of data warehouse and business processing re-engineering, it should be possible to justify the cost of most forms of data mining. Conclusion In this chapter we have presented some issues that greatly contribute to the success of a KDD environment: data warehousing, datamarts, and multi-processing machines. Although it is not necessary to develop a data warehouse before one starts a data mining project, since it is possible to do pilots on ad hoc data sets that are created on the basis of operational data, data mining in most cases repays the cost of the original investment when it is performed on a regular basis. A data warehouse facilitates such a process. The cost justification of data mining is closely related to the benefits of data warchousing and decision support systems. It is clear however that, when a decision support task is defined as the recognition of a pattern in a large database, data mining starts to be cost efficient. In the next chapter we will describe this process in greater detail. image not available image not available image not available 52 The knowledge discovery process Client | Name Address Date Magazine number purchase made purchased 23003 | Johnson 1 Downing Street 04-15-94 car 23003 | Johnson 1 Downing Street 06-21-93 music 23003 Johnson 1 Downing Street 05-30-92 comic 23009 | Ciinton 2 Boulevard 01-01-01 comic 23013 | King 3 High Road 02-30-95, sports 23019 | Jonson 1 Downing Street 01-01-01 house Figure 4.2 Original data. A very important element in a cleaning operation is the de-duplication of records (see Figure 4.3). In a normal client database some clients will be represented by several records, although in many cases this will be the result of negligence, such as people making typing errors, or of clients moving from one place to another without notifying change of address. There are also cases in which people deliberately spell their names incor- rectly or give incorrect information about themselves, especially in situations where individuals have been refused some type of insurance. By slightly mis-spelling their name or by giving a false address, they try to avoid a negative decision. Of course it is important for any company to be aware of such abnormalities in the database. Although data mining and data cleaning are two different disciplines, they have a lot in common and pattern recognition algorithms can be applied in cleaning data. Client | Name Address Date Magazine number purchase made purchased 23003 Johnson 1 Downing Street 04-15-94 car 23003 | Johnson 1 Downing Street 06-21-93 music 23003 | Johnson 1 Downing Street 05-30-92 comic 23009 | Clinton 2 Boulevard 01-01-01 comic 23013 | King 3 High Road 02-30-95 sports 23003 Johnson 1 Downing Street 01-01-01 house Figure 4.3 De-duplication image not available image not available image not available 56 The knowledge discovery process nection between the lack of information and certain purchasing behavior by Mr King. For the moment we will suppose that we can omit this data without consequences for our final results. Next we carry out a projection of the records. In this example we are not interested in the clients’ name, since we just want to identify certain types of client, so their names are removed from the sample database. Up to this point, the coding phase has consisted of nothing more than simple SQL operations but now we are entering the stage where we will be able to perform more creative transformations on the data. By this time, the information in our database is much too detailed to be used as input for pattern recognition algorithms. Take for example the notion of a date of birth: an algorithm that puts people with the same date of birth into a certain customer class is obviously much too detailed for our purposes, whereas a similar algorithm that operates on age classes with an interval of, for instance, 10 years would be very applicable. The same holds true for addresses. Address information is much too detailed for pattern recogni- tion and, in this case, we need to recode addresses into regional codes.-The way in which we code the information will, to a great extent, determine the type of patterns we find. Coding, therefore, is a creative activity that has to be performed repeatedly in order to get the best results. Take, for example, the subscription date: again, this is much too detailed to be of any value as such, but there are various ways to recode such dates in a way that yields valuable patterns. One solution might be to transform purchase dates into month numbers, starting from 1990. In this way, we might be able to find patterns in time series of our customers’ transactions. We could find dependencies similar to the following rule: A customer with credit >13,000 and aged between 22 and 31 who has subscribed to a comics at time T will very likely subscribe to a car magazine five years later Or we might identify trends such as: The number of house magazines sold to customers with credit between 12,000 and 31.000 living in region 4 is increasing We may also identify migration of client types, such as: A customer with credit benween 5,000 and 10,000 who reads a comics magazine will very likely become a customer with credit between 12,000 and 31,000 who reads a sports anda house magazine after 12 years Sometimes, however, we are not interested in time series but in informa- tion such as seasonal influence on customer behavior. In such cases we can recode the subscription dates to seasonal codes and try to find pat- terns in this data. Coding is a creative process — there can be an infinite number of different codes that are related to any number of different potential patterns we would like to find. image not available image not available image not available GO The knowledge discovery process ‘Average Age 46.9 Income 20.8 Credit 34.9 Car owner 05g House owner 0.59 car magazine 0.329 house magazine 0.702 sports magazine 0.447 music magazine 0.146 comic magazine 0.081 Figure 4.10 Averages. data set. With SQL we can uncover only shallow data, which is informa- tion that is easily accessible from the data set: yet although we cannot find hidden data, for the most part 80% of the interesting information can be abstracted from a database using SQL. The remaining 20% of hidden information requires more advanced techniques, and for large marketing-driven organizations, this 20% can prove of vital importance. A good way to start is to extract some simple, statistical information from the data set, and averages are an important example in this respect. In our data set (see Figure 4.10) we see that the average age is 46 years old, the average income 20, the average credit 34, and so on. It is inter- esting to look at the averages of the output fields: we see that 329 clients out of every 1000 subscribe to a car magazine, whereas only 81 out of 1000 subscribe to a comic. These numbers are very important, because they give us a norm by which to judge the performance of pat- tern recognition and learning algorithms. Suppose that you want to predict how many clients will buy a car magazine. Now an algorithm that always predicts ‘no car magazine’ would be correct in 671 out of 1000 cases, which is about 70%, Any learning algorithm that claims to give some insight into the data set and do some real predicting has to improve on this. A trivial result that is obtained by an extremely simple method is called a naive prediction, and an algorithm that claims to learn anything must always do better than the naive predic- tion (Figure 4.11). Here we can see also that it is more difficult to make predictions for the small group in our sample set. Since only 81 out of 1000 clients subscribe to a comics, a learning algorithm that claims to predict which clients will subscribe to comics has to give a image not available image not available image not available 64 The knowledge discovery process 100 90 80 70 60 50 40 30 20 10 0 Number of purchasers in age bracket 10 20 30 40 50 60 70 60 90 Age of sports magazine purchasers Figure 4.16 Age distribution of readers of the sports magazine. Visualization techniques Visualization techniques are a very useful method of discovering patterns in data sets, and may be used at the beginning of a data mining process to get a rough feeling of the quality of the data set and where patterns are to be found. Interesting possibilities are offered by object-oriented three- dimensional tool kits, such as Inventor, which enable the user to explore three-dimensional structures interactively. In the section below on deci- sion trees we will give an example of the use of these tools for exploration of tee structures. Such techniques are developing rapidly: advanced graphical techniques in virtual reality enable people to wander through artificial data spaces, while historic development of data sets can be dis- played as a kind of animated movie. For most users, however, these advanced features are not accessible, and they have to rely on simple. graphical display techniques that are contained in the query tool or data mining tools they are using. These simple methods can provide us with a wealth of information. An elementary technique that can be of great value is the so-called scatter diagram; in this technique, information on two attributes is displayed in a Cartesian space. Scatter diagrams can be used to identify interesting sub-sets of the data sets so that we can focus on the rest of the data mining process. There is a whole field of research dedicated to the search for interesting projections of data sets ~ this is called projection pursuit. In our example (Figure 4.17) we have made-a projection along two dimensions: income and age. We see that on average young people with a low income tend to read the music magazine. image not available image not available image not available 6B The knowledge discovery process OLAP tools This idea of dimensionality can be expanded: a table with n independent attributes can be seen as an n-dimensional space. Managers generally ask questions that pre-suppose a multi-dimensional analysis — they don't want to know how much is sold (a zero-dimensional question) but they do want to know what type of magazines are sold in a designated area per month and to what age group (a four-dimensional question: product, area, pur- chase date, and age). Information of this nature is called multi- dimensional and such relationships cannot easily be analyzed when the table has the standard two-dimensional representation. We need to explore the relationship between several dimensions and standard relational data- bases are not very good at this. They identify records using keys and multi-dimensional relationships pre-suppose multiple keys, but there is a limit to the number of keys we can define effectively for a given table. There is, however, almost no end to the type of questions managers can formulate: one minute a manager might want sales data ordered by area, age, and income; the next minute, the same data ordered by credit and age — and this preferably online using large data sets OLAP tools were devel- oped to solve these problems. These tools store their data in a special multi-dimensional format, often in memory, and a manager can ask any question at all, although the data cannot be updated. OLAP can be an important stage in a data mining process. There is however an important difference between data mining and OLAP tools: OLAP tools do not learn, they create no new knowledge, and they cannot search for new solu- tions. There is thus a fundamental difference between multi-dimensional knowledge and the type of knowledge one can extract from a database via data mining. Data mining is more powerful than OLAP. Another advan- tage is that data mining algorithms do not need a special form of storage, since they can work directly on data stored in a relational database. k-nearest neighbor When we interpret records as points in a data space, we can define the concept of neighborhood: Records that are close to each other live in each other's neighborhood Suppose we want to predict the behavior of a set of customers and we have a database with records describing these customers. The basic hypothesis required in order to make such a prediction will be that cus- tomers of the same type will show the same behavior. In terms of the metaphor of our multi-dimensional data space, a type is nothing more than a region in this data space. In other words, records of the same type will be close to each other in the data space; they will be living in each other's neighborhood. Based on this insight, we can develop a very simple but powerful learning algorithm — the k-nearest neighbor. The basic phi- image not available image not available image not available 72 The knowledge discovery process Figure 4.23 A four-level decision tree for the car magazine. there is still an 8'% readership of the car magazine, while above that age there is no interest in this magazine, so this branch is not explored further by the algorithm. For the group under 44.5 years, income seems to be the next important attribute, People with a somewhat higher income (above 34.5) do not read the magazine, but below this income limit, age suddenly becomes decisive again: all the people in this group under the age of 31.5 subscribe to the magazine. One of the conclusions we can draw from this tree is that people with an income under 34.5 and an age under 31.5 are very likely to be interested in the car magazine, while those with an income above 34.5 and an age under 44.5 will probably not be. It turns out that a tree of depth four is optimal, since further expansion of the tree does not yield much more information. Figure 4.24 gives an overview of a maximally expanded tree as an interac- tive three-dimensional picture. The columns at the nodes of the tree represent the number of records that will be in the sub-nodes. We can see that all but one of the branches stop after three levels. When we zoom in on this branch (see Figure 4.25, but remember it is an interactive environ- ment), we see that after one node (age < 31.5) there is very little development. The columns on all the consecutive nodes have about equal image not available image not available image not available 76 The knowledge discovery process Figure 4.27 Binary associations between the magazines. rules are defined on binary attributes. Now which associations are inter- esting? In the first place we look for associations that have a lot of examples in the database, and term this process the support of an associa- tion rule. In our case the support of the rule is: the percentage of records for which MUSIC_LMAG, HOUSE_MAG and CAR_MAG all hold, that is, all the people that read all three magazines. Support in itself is not enough, however. It may be the case that we have a considerable group of people who read all three magazines but there is a much larger group that reads both MUSIC_MAG and HOUSE_MAG, although not CAR_MAG. In this case the association is very weak, although the support might be relatively high. We thus need an additional measure — confidence — and in the present case such confidence is the per- centage of records for which CAR_MAG holds, within the group of records for which MUSIC_MAG and HOUSE_MAG hold. At present, association rules are only useful in data mining if we already have a rough idea of what it is we are looking for. This illustrates the fact that there is no algorithm that will automatically give us everything that is of interest in the database. An algorithm that finds a lot of rules will probably also find a lot of useless rules, while an algorithm that finds only a limited number of associations, without fine tuning, will probably also miss a lot of interesting information. In the example above, we have illus- trated association rules using multiple attributes. In our marketing example concerning magazines, we will first investigate single attribute or unary association rules. Figure 4.27 illustrates the associations between the different groups of magazine readers. image not available image not available image not available 80 The knowledge discovery process Neural networks It is interesting to see that many machine-learning techniques are derived from paradigms related to totally different areas of research. Genetic algorithms derive their inspiration from biology while neural networks are modeled on the human brain. Learning is such an impor- tant aspect of nature that it inevitably crops up in various domains of the study of living beings and provides us with suitable models for the study of learning behavior. In Freud’s theory of psychodynamics, the human brain was described as a neural network, and recent investiga- tions have corroborated this view. The human brain consists of a very large number of neurons, about 10!!, connected to each other via a huge number of so-called synapses. A single neuron is connected to other neurons by a couple of thousand of these synapses, Although neurons could be described as the simple building blocks of the brain, the human brain can handle very complex tasks despite this relative sim- plicity. This analogy therefore offers an interesting model for the creation of more complex learning machines, and has led to the creation of so-called artificial neural networks. Such networks can be built using special hardware but most are just software programs that can operate on normal computers. Typically, a neural network consists of-a set of nodes: input nodes receive the input signals, output nodes give the output signals, and a potentially unlimited number of intermediate layers contain the intermediate nodes. There are various different archi- tectures for neural networks, and they each utilize different wiring and learning strategies to perform tasks. When using neural networks we have to distinguish between two stages — the encoding stage in which the neural network is trained to perform a certain task, and the decoding stage in which the network is used to classify examples, make predic- tions, or execute whatever learning task is involved. There are several different forms of neural network but we shall discuss only three of them here: e — Perceptrons e Back propagation networks e Kohonen self-organizing map In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called perceptron, one of the first implementations of what would later be known as a neural network. A perceptron consists of a simple three-layered network with input units called photo-receptors, intermedi- ate units called associators, and output units called responders. The perceptron could learn simple categories and thus could be used to per- form simple classification tasks. image not available image not available image not available image not available image not available image not available image not available image not available image not available image not available image not available image not available image not available aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. aa You have either reached a page that is unavailable for viewing or reached your viewing limit for this book. image not available image not available image not available Copyrighted material image not available image not available image not available 110 Some real-life applications 4 | High : , a ° Variation of client from 1 “2 3 SERPZPESSEER PP SF2E 2 SU PppcgeSeeye lege ats 5° i 3! i? a ot 8 8 3 12 ee €é¢aseu" 2 Figure 6.1 One of the clent profiles found by the clustering algorithm, The upper part of the figure describes the profile of a client, the lower part the potential behavior of this client. On the vertical axis of the figure we have indicated for each attribute in which interval a certain client must fall in order to belong to the cluster. Horizontally, 21 attributes are repre- sented — those attributes form the profile. In order to be able to compare image not available image not available image not available 114 Some reablife applications Naturally these findings reflect only the current situation in the database. It is not certain that the profiles we find, when projected onto the single buyers’ segment of the database, will create genuine marketing opportuni- tics. It might be the case that people who are single buyers in the database buy most of the other available services from competitors. Nonetheless, the information we extract from the database can be extremely valuable when setting up marketing campaigns. This indicates how a data mining project operates in practice — it can sometimes be complicated, but in the end may be extremely valuable. Predicting bid behavior of pilots In the next example we consider an embedded form of data mining. In the planning department of KLM data mining has been used to reduce costs substantially, and is a good example of the benefits produced by the element of repetition. The application can learn rules that predict the behavior of pilots on the basis of historic information in an Oracle data- base, although this behavior is subject to alteration as a result of changes in circumstances. At any point in time the planner can decide to learn new rules, on the basis of the current content of the database. This appli- cation is also a good example of a situation where data mining can provide a solution where other techniques fail. CAPTAINS is a complex application that enables a planner to maintain strategic, tactical, and operational models of pilot populations. A major problem in building the short-term planning algorithms for CAPTAINS was prediction of pilot bid behavior (or career intentions). Twice per year pilots can express their preference for new seats (functions) and KLM is obliged to give a new seat to the most senior officer who has applied. (A ‘seat’ is a certain function for a certain aircraft type, such as second offi- cer on a Boeing 747-400 - this corresponds to the physical seat in the cockpit.) However, in some cases, when planning future seasons, the bids are not yet known. Furthermore, a change in a pilot’s bid can influence the planning substantially, so that the correct prediction of pilot bid behavior has vital importance for good planning. Using genetic algo- rithms, we were able to produce rules that predict pilot bid behavior within an acceptable level of accuracy. To appreciate better the need for machine learning in the domain of career planning, we must describe the planning problem in some detail. Certain elements of the description are particular to the KLM situation, but the example is applicable in the main to any airline. Similar problems also exist for railway companies and indeed for any organization that aims to carry out some form of career planning. image not available image not available image not available CHAPTER 7 Some formal aspects of learning algorithms Introduction This chapter is intended for the reader with more than a casual interest in the background of data mining and machine learning, and we discuss some of the more formal aspects. We begin with an introduction to some of the mathematical aspects of learning algorithms, since it can be very useful to have a general understanding of these ideas when one is involved in a KDD project. A second issue addressed in the chapter is the relationship between data mining and the theory of the relational data- base. The formal model of the relational database is not completely suited to the development of data warehouses and KDD environments, and there are important new developments in this area. We conclude with some remarks on data warehouses and data mining primitives. Learning as compression of data sets In most cases, learning can be described from a mathematical point of view as the compression of data sets. If we have an algorithm that creates a description of a data set that is effectively shorter than the original data set, then we can say that we have learned something. There is a relation- ship between the complexity of data sets and learnability. In general, 123

You might also like