This page intentionally left blank

.

newagepublishers. without the written permission of the publisher.Copyright © 2007 New Age International (P) Ltd.110002 Visit us at www. All inquiries should be emailed to rights@newagepublishers.. Publishers Published by New Age International (P) Ltd. electronic or mechanical. microfilm.com . by photostat.com ISBN : 978-81-224-2432-4 PUBLISHING FOR ONE WORLD NEW AGE INTERNATIONAL (P) LIMITED. New Delhi . PUBLISHERS 4835/24. xerography. Ansari Road.. or incorporated into any information retrieval system. Publishers All rights reserved. Daryaganj. or any other means. No part of this ebook may be reproduced in any form.

.Dedicated to our beloved Grandparents...

This page intentionally left blank .

Chapter 7 covers how to design the data warehouse using fact tables. etc. We have attempted to cover the major topics in data mining and warehousing in depth. A brief synopsis of the chapters and comments on their appropriateness in a basic course is therefore appropriate. Chapter 9 covers the data mart and meta data. Chapter 4 deals with various types of data mining techniques and algorithms available. This book also contains sufficient material to make up an advanced course on data mining and warehousing. Chapter 8 deals with partitioning strategy in software and hardware. Chapter 1 introduces the basic concepts to data mining and warehousing and its essentials. tuning of data warehouse. Chapter 11 deals with performance. dimension tables and schemas. . Model question paper is also given. We conclude that. commercial organizations. has been put into separate chapters so that courses in a variety of levels can be taught from this book. Chapter 2 covers the types of knowledge and learning concepts. Chapter 5 covers various applications related to data mining techniques and real time usage of these techniques in various fields like business. which are essential to data mining techniques. Each and every chapter has exercise. this book is the only book. benefits and new architecture of the data warehouse. which deals both data mining techniques and data warehousing concepts suited for the modern time. however. Chapter 3 covers how to discover the knowledge hidden in the databases using various techniques. Chapter 10 covers backup and recovery process. Advanced material. challenges. Chapter 6 deals with the evaluation of data warehouse and need of the new design process for databases.PREFACE This book is intended as a text in data mining and warehousing for engineering and post-graduate level students.

This page intentionally left blank .

Many experts made their contributions to the quality of writing by providing oversight and critiques in their special areas of interest. they are sincerely offered and well deserved. New Age International (P) Ltd.A. whose efforts maintained the progress of our work and ultimately put us in the right path. And finally. and for the faith in our joint effort. Such excellence as the book may have is owed in part to them. Bangalore. this book would indeed be poorer. for sparing his precious time in editing this book. for their tolerance of our many requests and especially for their commitment to this book and its authors. Once again we acknowledge our family members and wish them to know that our thanks are not merely “Proforma”. Scientist. Without such help. We particularly thank my graduate students for contributing a great deal to some of the original wording and editing. N..D. S.ACKNOWLEDGEMENT Thanks and gratitude is owed to many kind individuals for their help in various ways in the completion of this book. Ph. Karaikal. We are also indebted to our publishers Messrs. S.R. We first of all record our gratitude to the authorities of Bharathiyar College of Engineering and Technology. for the willingness to accept the advice of the other. We owe thanks to our colleagues and friends who have added to this book by providing us with their valuable suggestions.L. any errors are ours. Rajagopalan. each of us wishes to “doff his hat” to the other for the unstinting efforts to make this book the best possible. Retd. We began writing this book an year ago as good friends and we completed it as even better friends. Venkatesan . Prabhu N. We particularly thank Dr.

This page intentionally left blank .

1 Introduction 4.2 Classification vii ix 1-7 1 2 3 3 4 6 7 8-13 8 8 9 12 13 14-22 14 15 15 20 20 22 23-53 23 23 Chapter 2 Chapter 3 Chapter 4 .6 Mining and Warehousing Concepts Introduction Data Mining Definitions Data Mining Tools Applications of Data Mining Data Warehousing and Characteristics Data Warehouse Architecture Exercise Learning and Types of Knowledge 2.2 1.2 What is Learning? 2.CONTENTS Preface Acknowledgement Chapter 1 Data 1.4 Different Types of Knowledge Exercise Knowledge Discovery Process 3.5 1.3 1.2 Evaluation of Data Mining 3.4 Data Mining Operations 3.3 Stages of the Data Mining Process 3.3 Anatomy of Data Mining 2.1 Introduction 3.1 1.5 Architecture of Data Mining Exercise Data Mining Techniques 4.1 Introduction 2.4 1.

Information and Knowledge 6.7 Data Warehousing Schemas Exercise Partitioning in Data Warehouse 8.4 Goals of Data Warehouse Architecture 7. OLTP.2 Future Scope 5. OLAP and Data Mining Exercise Data Warehouse Design 7.5 4.3 RAID Levels 8.9 4.4 OLAP and OLAP Server 6.2 Hardware Partitioning 8.3 Fundamental of Database 6.4 4.3.2 The Central Data Warehouse 7.5 Data Warehouse Users 7.2 Data.3 Data Mining Products Exercise Data Warehouse Evaluation 6.4 Software Partitioning Methods Exercise 23 24 32 34 45 46 51 52 53 54-62 54 59 61 62 63-71 63 65 65 68 69 71 72-85 72 72 74 77 78 79 81 85 86-94 86 86 88 92 94 .1 Introduction 8. Data Warehousing Objects 7.5 Data Warehouses.1 Applications of Data Mining 5.3 4.(xii) 4.1 Introduction 7.6 4.7 4.10 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Neural Networks Decision Trees Genetic Algorithm Clustering Online Analytic Processing (OLAP) Association Rules Emerging Trends in Data Mining Data Mining Research Projects Exercise Real Time Applications and Future Scope 5.1 The Calculations for Memory Capacity 6.8 4.6 Design the Relational Database and OLAP Cubes 7.

1 9.2 Prioritized Tuning Steps 11.5 Future of the Data Warehouse 11.4 Chapter 10 Chapter 11 Appendix Appendix Appendix Appendix A B C D Mart and Meta Data Introduction Data Mart Meta Data Legacy Systems Exercise Backup and Recovery of the Data Warehouse 10.4 Benefits of Data Warehousing 11.6 New Architecture of Data Warehouse Exercise Glossary Multiple-Choice Questions Frequently Asked Questions & Answers Model Question Papers Bibliography Index 95-100 95 95 96 100 100 101-105 101 101 102 104 105 106-109 106 106 107 108 108 109 109 110-114 115-116 117-119 120-124 125-126 127-129 .1 Introduction 11.2 Types of Backup 10.3 Challenges of the Data Warehouse 11.3 Backup the Data Warehouse 10.(xiii) Chapter 9 Data 9.4 Data Warehouse Recovery Models Exercise Performance Tuning and Future of Data Warehouse 11.3 9.1 Introduction 10.2 9.

This page intentionally left blank .

medical history data in health care.CHAPTER 1 DATA MINING AND WAREHOUSING CONCEPTS 1. The data storage bits/bytes are calculated as follows: ² 1 byte = 8 bits ² 1 kilobyte (K/KB) = 2 ^ 10 bytes = 1. made data cheap. policy and claim data in insurance. There are many examples that can be cited. are some instances of the types of data that is being collected. Point of sale data in retail.1 The growing base of data Data storage became easier as the availability of large amounts of computing power at low cost i. Volume of data 1970 1980 1990 2000 Year Fig.024 bytes . The new methods tend to be computationally intensive hence a demand for more processing power. financial data in banking and securities.e. in addition to traditional statistical analysis of data. 1..1 INTRODUCTION The past couple of decades have seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate. It has been estimated that the amount of information in the world doubles every 20 months and the sizes as well as number of databases are increasing even faster. the cost of processing power and storage is falling. There was also the introduction of new machine learning methods for knowledge representation based on logic programming etc.

Some of the numerous definitions of Data Mining.2 | Data Mining and Warehousing ² 1 megabyte (M/MB) = 2 ^ 20 bytes = 1. if the database is a faithful mirror.627. and detecting anomalies’. such as clustering. Traditional on-line transaction processing systems. 1. It is the computer which is responsible for finding .842.899.073. such as a relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database and. Database Management Systems gave access to the data stored but this was only a small part of what could be gained from the data.824 bytes ² 1 terabyte (T/TB) = 2 ^ 40 bytes = 1. analyzing changes. This is where Data Mining or Knowledge Discovery in Databases (KDD) has obvious benefits for any enterprise.906. but as it stands of low value as no direct use can be made of it. Data mining refers to “using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data. previously unknown. Frawley. is the process of extracting hidden knowledge from large volumes of raw data and using it to make crucial business decisions. safely and efficiently but are not good at delivering meaningful analysis in return. OLTPs. data summarization.921. The data is often voluminous. Analyzing data can provide further knowledge about a business by going beyond the data explicitly stored to derive knowledge about the business. Matheus ‘Data Mining.846.624 bytes ² 1 exabyte (E/EB) = 2 ^ 60 bytes = 1.2 DATA MINING DEFINITIONS The term data mining has been stretched beyond its limits to apply to any form of data analysis.576 bytes ² 1 gigabyte (G/GB) = 2 ^ 30 bytes = 1.504.099. also called as data archeology. prediction. or Knowledge Discovery in Databases (KDD) as it is also known. According to William J. or Knowledge Discovery in Databases are: Extraction of interesting information or patterns from data in large databases is known as data mining. Data Mining.511. forecasting and estimation. of the real world registered by the database”. and potentially useful information from data.776 bytes ² 1 petabyte (P/PB) = 2 ^ 50 bytes = 1. and extracting these in such a way that they can be put to use in the areas such as decision support.125. finding dependency networks.606. data harvesting. According to Marcel Holshemier and Arno Siebes “Data mining is the search for relationships and global patterns that exist in large databases but are ‘hidden’ among the vast amount of data.976 bytes ² 1 zettabyte (Z/ZB) =1 000 000 000 000 000 000 000 bytes ² 1 yottabyte (Y/YB) =1 000 000 000 000 000 000 000 000 bytes It was recognized that information is at the heart of business operations and that decisionmakers could make use of the data stored to gain valuable insight into the business. is the nontrivial extraction of implicit. data dredging. Gregory Piatetsky-Shapiro and Christopher J. This encompasses a number of different technical approaches.741. learning classification rules.048.152. it is the hidden information in the data that is useful” Data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. are good at putting data into databases quickly.

Once knowledge has been acquired this can be extended to larger sets of data working on the assumption that the larger data set has a structure similar to the sample data.4 APPLICATIONS OF DATA MINING Data mining has many and varied fields of application some of which are listed below.1 Oracle Data Mining (ODM) Oracle enables data mining inside the database for performance and scalability. 1. ORACLE9i. This enables e-businesses to incorporate predictions and classifications in all processes and decision points throughout the business cycle. ODM is designed to meet the challenges of vast amounts of data. The analysis process starts with a set of data. Data mining analysis tends to work from the data up and the best techniques are those developed with an orientation towards large volumes of data. By delivering complete programmatic control of the database in data mining. Again this is analogous to a mining operation where large amounts of low-grade materials are sifted through in order to find something of value. Oracle Data Mining (ODM) delivers powerful. making use of as much of the collected data as possible to arrive at reliable conclusions and decisions.Data Mining and Warehousing Concepts | 3 the patterns by identifying the underlying rules and features in the data.3 DATA MINING TOOLS The best of the best commercial database packages are now available for data mining and warehousing including IBM DB2. scalable modeling and real-time scoring. Intelligent Miner. INFORMIX-On Line XPS. Some of the capabilities are: ² An API that provides programmatic control and application integration ² Analytical capabilities with OLAP and statistical functions in the database ² Multiple Algorithms: Naïve Bayes and Association Rules ² Real-time and Batch Scoring modes ² Multiple Prediction types ² Association insights ORACLE9i Data Mining provides a Java API to exploit the data mining functionality that is embedded within the ORACLE9i database. This integrated intelligence enables the automation and decision speed that e-businesses require in order to compete today. . delivering accurate insights completely integrated into e-business applications.3. 1. 4 Thought and SYBASE System 10. The idea is that it is possible to strike gold in unexpected places as the data mining software extracts patterns not previously discernable or so obvious that no-one has noticed them before. 1. Clementine. uses a methodology to develop an optimal representation of the structure of the data during which time knowledge is acquired.

As well as integrating data throughout an enterprise.4. aimed at enabling the knowledge worker (executive.1 Sales/Marketing ² ² ² ² Identify buying patterns from customers Find associations among customer demographic characteristics Predict response to mailing campaigns Market basket analysis 1. It can be loosely defined as any centralized data repository which can be queried for business benefit but this will be more clearly defined later. .5 DATA WAREHOUSING AND CHARACTERISTICS Data warehousing is a collection of decision support technologies.4 | Data Mining and Warehousing 1.e. or communication requirements it is possible to incorporate additional or expert information.4.5 Medicine ² Characterize patient behaviour to predict office visits ² Identify successful medical therapies for different illnesses 1..4.4. which medical procedures are claimed together Predict which customers will buy new policies Identify behaviour patterns of risky customers Identify fraudulent behaviour 1. Data warehousing is a new powerful technique making it possible to extract archived operational data and overcome inconsistencies between different legacy data formats. Data mining potential can be enhanced if the appropriate data has been collected and stored in a data warehouse. manager. regardless of location.2 Banking ² ² ² ² ² ² Credit card fraudulent detection Identify ‘loyal’ customers Predict customers likely to change their credit card affiliation Determine credit card spending by customer groups Find hidden correlation’s between different financial indicators Identify stock trading rules from historical market data 1. analyst) to make better and faster decisions.4 Transportation ² Determine the distribution schedules among outlets ² Analyze loading patterns 1. format.4. A data warehouse is a relational database management system (RDBMS) designed specifically to meet the needs of transaction processing systems.3 Insurance and Health Care ² ² ² ² Claims analysis i.

When they achieve this. These difficulties are eliminated by ETL Tools since they are very powerful and they offer many advantages in all stages of ETL process starting from extraction. Data warehouses must put data from disparate sources into a consistent format. the above mentioned ETL process was done manually by using SQL code created by programmers. they are said to be integrated. debugging and loading into data warehouse when compared to the old method. you can answer questions like “Who was our best customer for this item last year?” This ability to define a data warehouse by subject matter. Before the evolution of ETL Tools. transformation. to learn more about your company’s sales data. Using this warehouse. Integration is closely related to subject orientation. an online analytical processing (OLAP) engine. transformation. On top of it. Following are some of those: Tool name Informatica DT/Studio Data Stage Ab Initio Data Junction Oracle Warehouse Builder Microsoft SQL Server Integration Transform On Demand Solonde Transformation Manager Company name Informatica Corporation Embarcadero Technologies IBM Ab Initio Software Corporation Pervasive Software Oracle Corporation Microsoft ETL Solutions A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon. author of Building the Data Warehouse and the guru who is widely considered to be the originator of the data warehousing concept. This task was tedious in many cases since it involved many resources. There are a number of ETL tools available in the market to do ETL process the data according to business/technical requirements. transportation. For example. you can build a warehouse that concentrates on sales. is as follows: ² Subject Oriented ² Integrated ² Nonvolatile ² Time Variant Data warehouses are designed to help you analyze data. ETL Tools are meant to extract. They must resolve such problems as naming conflicts and inconsistencies among units of measure. client analysis tools. data profiling. complex coding and more work hours. a data warehouse environment includes an extraction. transform and load the data into Data Warehouse for decision making. and other applications that manage the process of gathering data and delivering it to business users. makes the data warehouse subject oriented. maintaining the code placed a great challenge among the programmers. sales in this case. data cleansing. .Data Mining and Warehousing Concepts | 5 In addition to a relational database. and loading (ETL) solution.

When data are moved from the operational environment into the data warehouse. such as sales. A data warehouse that is designed for a particular line of business. or older. there may be several departmental data marts. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. 1. gender might be coded as “m” and “f” in another by 0 and 1. where performance requirements demand that historical data be moved to an archive.6 | Data Mining and Warehousing For instance. In an independent data mart. These data are not updated. and forecasting. perhaps onto slower archival storage. data can be collected directly from sources. In order to discover trends in business. Nonvolatile means that. In addition to the main warehouse. once entered into the warehouse. they assume a consistent coding convention e. for cleaning.6 DATA WAREHOUSE ARCHITECTURE Data warehouse architecture includes tools for extracting data from multiple operational databases and external sources. This is very much in contrast to online transaction processing (OLTP) systems. gender data is transformed to “m” and “f”. to be used for comparisons. In a dependent data mart. A data warehouse’s focus on change over time is what is meant by the term time variant. or finance.g.2 Data warehousing architecture Tools . transforming and integrating this data. analysts need large amounts of data. data should not change. and for periodically refreshing the warehouse to reflect updates at the sources and to purge data from the warehouse. 1. trends. Data in the warehouse and data marts is stored and managed by Monitoring & Administration Metadata Repository OLAP Servers Data Warehouse External Sources Extract Transform Load Refresh Query/Reporting Serve Data Mining Analysis Operational dbs Data Sources Data Marts Fig. the data can be derived from an enterprise-wide data warehouse. The data warehouse contains a place for storing data that are 10 to 20 years old. marketing. in one application. for loading data into the data warehouse.

Data Mining and Warehousing Concepts | 7 one or more warehouse servers. and tools for monitoring and administering the warehousing system. Define data mining. What is data warehouse? List out the characteristics of data warehouse. and data mining tools. report writers. . which present multidimensional views of data to a variety of front end tools: query tools. 3. 4. Explain the uses of data mining and describe any three data mining softwares. Explain the data warehouse architecture. 2. analysis tools. Finally. there is a repository for storing and managing meta data. EXERCISE 1. 5.

1 INTRODUCTION It seems that learning is one of the basic necessities of life.. database is analyzed with a view to finding patterns. In order to define learning operationally.2 WHAT IS LEARNING? An individual learns how to carry out a certain task by making a transition from a situation in which the task cannot be carried out to a situation in which the same task can be carried out under the same circumstances. This process of classification identifies classes such that each class has a unique pattern of values which forms the class description. we need two other concepts a certain ‘task’ to be carried out either well or badly and a learning ‘subject’ that is to carry out the task. Inductive learning where the system infers knowledge itself from observing its environment has two main strategies: ² Supervised learning—This is learning from examples where a teacher helps the system construct a model by defining classes and supplying examples of each class. should be able to learn. 2.2.e. living creatures must have the ability to adapt themselves to their environment. Once . we will not state what learning is but rather specify how to determine when someone has learned something. the common properties in the examples. 2..CHAPTER 2 LEARNING AND TYPES OF KNOWLEDGE 2.e. Generally it is only possible to use a small number of properties to characterize objects so we make abstractions in that objects which satisfy the same subset of properties are mapped to the same internal representation. Similar objects are grouped in classes and rules formulated whereby it is possible to predict the class of unseen objects.1 Inductive learning Induction is the inference of information from data and inductive learning is the model building process where the environment i. The system has to find a description of each class i. we will concentrate on an operational definition of the concept of learning. There are undeniably many different ways of learning but instead of considering all kinds of different definitions..e. The nature of the environment is dynamic hence the model must be adaptive i.

The data are typically a collection of records. class description) by itself. disease). are some instances of the types of data that is being collected. There are many examples that can be cited. Very often. i. policy and claim data in insurance.e.Learning and Types of Knowledge | 9 the description has been formulated the description and the class form a classification rule which can be used to predict the class of previously unseen objects. financial data in banking and securities..e. age) and some symbolic (discrete valued. 2. and it is not always possible to verify a model by checking it for all possible situations. Database Systems Statistics Machine Learning Data Mining Visualization Algorithm Other Disciplines Fig. medical history data in health care. ² Unsupervised learning—This is learning from observation and discovery. This is similar to discriminate analysis as in statistics. where each individual record may correspond to a transaction or a customer.1 Multi-disciplinary approach in data mining . Again this similar to cluster analysis as in statistics.g. This system results in a set of class descriptions.3 ANATOMY OF DATA MINING The use of computer technology in decision support is now widespread and pervasive across a wide range of business and industry.. Point of sale data in retail. The problem is that most environments have different states. 2. The data mine system is supplied with objects but no classes are defined so it has to observe the examples and recognize patterns (i. e. changes within.. these fields are of mixed type. The quality of the model produced by inductive learning methods is such that the model could be used to predict the outcome of future situations in other words not only for states encountered but rather for unseen states that could occur. e. This has resulted in the capture and availability of data in immense volume and proportion.. with some being numerical (continuous valued. one for each class discovered in the environment. The simpler models are more likely to be correct if we adhere to Ockham’s razor. and the fields in the record correspond to attributes. which states that if there are multiple explanations for a particular phenomenon it makes sense to choose the simplest because it is more likely to capture the nature of the phenomenon. Given a set of examples the system can construct multiple models some of which will be simpler than others. Induction is therefore the extraction of patterns.g.

while ML typically (but not always) looks at smaller data sets. Data mining however allows the expert’s knowledge of the data and the advanced analysis techniques of the computer to work together. learning with teacher. ² ML is a broader field which includes not only learning from examples. Statistics have a role to play and data mining will not replace such analyses but rather they can act upon more directed analyses based on the results of data mining. When integrating machine-learning techniques into database systems to implement KDD some of the databases require: . For example statistical induction is something like the average rate of failure of machines. Analysts to detect unusual patterns and explain patterns using statistical models such as linear models have used statistical analysis systems such as SAS and SPSS. Machine learning examines previous examples and their outcomes and learns how to reproduce these and make generalizations about new cases. but also reinforcement learning.10 | Data Mining and Warehousing 2. The main differences are: ² KDD is concerned with finding understandable knowledge. Generally a machine learning system does not use single observations of its environment but an entire finite set called the training set at once. there are efforts to extract knowledge from neural networks which are very relevant for KDD.2. This set contains examples i. ² KDD is concerned with very large. learning with teacher. observations coded in some machine readable form.. 2. A learning algorithm takes the data set and its accompanying information as input and returns a statement e.1 Differences between Data Mining and Machine Learning Knowledge Discovery in Databases (KDD) or Data Mining.3.3.3. etc. while ML is concerned with improving performance of an agent. The training set is finite hence not all concepts can be learned exactly.1 Statistics Statistics has a solid theoretical foundation but the results from statistics can be overwhelming and difficult to interpret as they require user guidance as to where and how to analyze the data. real-world databases. So training a neural network to balance a pole is part of ML.e. 2.2 Machine Learning Machine learning is the automation of a learning process and learning is tantamount to the construction of rules based on observations of environmental states and transitions. but also reinforcement learning. etc. KDD is that part of ML which is concerned with finding understandable knowledge in large sets of real-world examples. This is a broad field which includes not only learning from examples. a concept representing the results of learning as output. So efficiency questions are much more important for KDD.. However. and the part of Machine Learning (ML) dealing with learning from examples overlap in the algorithms used and the problems addressed.g. but not of KDD.

2. and ² Interpreting the knowledge produced to solve users’ problems and/or reduce data spaces. as near perfect as possible. which represent instances of a problem domain.3. Databases are usually contaminated by errors so the data mining algorithm has to cope with noise whereas ML has laboratory type examples i.Learning and Types of Knowledge | 11 ² More efficient learning algorithms because realistic databases are normally very large and noisy.5 Visualization Data visualization makes it possible for the analyst to gain a deeper.4 Algorithms 2. Visualization indicates the wide range of tool for presenting the reports to the end user . more intuitive understanding of the data and as such can work well along side data mining. 2..2 Statistical Algorithms Statistics is the science of colleting.. and knowledge.3 Database Systems A database is a collection of information related to a particular subject or purpose.. It is usual that the database is often designed for purposes different from data mining and so properties or attributes that would simplify the learning task are not present nor can they be requested from the real world. Statistical analysis systems such as SAS and SPSS have been used by analysts to detect unusual patterns and explain patterns using statistical models such as linear models. Data mining allows the analyst to focus on certain patterns and trends and explore in-depth using visualization.1 Genetic Algorithms Optimization techniques that use processes such as genetic combination. .4. e. and the semantic information contained in the relational schemata. and natural selection in a design based on the concepts of natural evolution.3.4. //Data spaces being the number of examples.3.The presentation ranges from simple table to complex graph. rules in a rule-based system. mutation. ² More expressive representations for both data. ² Using machine learning techniques to produce knowledge bases from databases.g. tuples in relational databases. 2. On its own data visualization can be overwhelmed by the volume of data in a database but in conjunction with data mining can help with exploration. 2. and applying numerical facts. e.3. using various 2D and 3D rendering techniques to distinguish information presented. which can be used to solve users’ problems in the domain. organizing.e.3. such as tracking customer orders or maintaining a music collection.g. Practical KDD systems are expected to include three interconnected phases ² Translation of standard database information into a form suitable for use by learning facilities.

for the present at any rate. Here information that can be obtained through data mining techniques. The advantage of OLAP tools is that they are optimized for the kind of search and analysis operation. It is almost impossible to decipher a message that is encrypted if you do not have key. In data mining we distinguish four different types of knowledge. A pattern recognition algorithm could find regularities in a database in minutes or at most a couple of hours.12 | Data Mining and Warehousing 2. An example of this is encrypted information stored in a database. which indicates that. With and different orderings of the data but it is important to realize that most of the things you can do with an OLAP tool can also be done using SQL. with no indication of any elevations in the neighborhood. 2.1 Different Types of Knowledge and Techniques Type of knowledge Shallow knowledge Multi-dimensional knowledge Hidden knowledge Deep knowledge Technique SQL OLAP Data mining Clues . OLAP is not as powerful as data mining. Hidden knowledge is the result of a search space over a gentle. TABLE 2. Again. without achieving any significant result.1 Shallow Knowledge This is information that can be easily retrieved from database using a query tool such as Structured Query Language (SQL).4 Deep Knowledge This is information that is stored in the database but can only be located if we have a clue that tells us where to look. 2. a search algorithm can easily find a reasonably optimal solution.4.4 DIFFERENT TYPES OF KNOWLEDGE Knowledge is a collection of interesting and useful pattern in a database.4. However. The key issue in Knowledge Discovery in Database is to realize that there is more information hidden in your data than you are table to distinguish at first sight. one could use SQL to find these patterns but this would probably prove extremely time-consuming. there is a limit to what one can learn.2 Multi-Dimensional Knowledge OLAP tools you have the ability to rapidly explore all sorts of clusterings this is information that can be analyzed using online analytical processing tools. A search algorithm could roam around this landscape for ever. 2. whereas you would have to spend months using SQL to achieve the same result.4.3 Hidden Knowledge This is data that can be found relative easily by using pattern recognition or machine learning algorithms.4. 2. Deep knowledge is typically the result of a search space over only a tiny local optimum. it cannot search for optimal solutions. hilly landscape.

2. What are the differences between data mining and machine learning? . 4. What are the different kinds of knowledge? Compare data mining with query tools. 6. 5. What is learning? Explain different types of learning with example. 7.Learning and Types of Knowledge | 13 EXERCISE 1. With a neat sketch explain the architecture of data mining. 3. Write in detail about data mining and query tools. Explain in detail about self-learning computer systems and concept learning.

and potentially useful information from data. Data mining. . cleaning. used to provide insight into the trends inherent in historical information. finding dependency networks. Quite a bit of work. or Knowledge Discovery in Databases (KDD) as it is also known. previously unknown. and transformation of data. improves the data cleaning and transformation steps that are so crucial to good data warehousing practices. Data mining does not guarantee the behavior of future data through the analysis of historical data. Similarly. including research. misconceptions about the role that data mining plays in decision support solutions can lead to confusion about and misuse of these tools and techniques. selection. Although Analysis Services removes much of the mystery and complexity of the data mining process by providing data mining tools for creating and examining data mining models. such as clustering. data mining is a guidance tool. Well-designed data warehouses have handled the data selection. analyzing changes. garbage out) law applies more to data mining than to any other area in Analysis Services. in turn. Data mining and data warehouses complement each other. the process of data warehousing improves as. enrichment. Instead. This encompasses a number of different technical approaches. through data mining. learning classification rules. enrichment. must be performed first if data mining is to truly supply meaningful information. Data mining is not a “black box” process in which the data miner simply builds a data mining model and watches as meaningful information appears. and transformation steps that are also typically associated with data mining. cleaning. However. it becomes apparent which data elements are considered more meaningful than others in terms of decision support and. these tools work best on well-prepared data to answer well-researched business scenarios—the GIGO (garbage in. and detecting anomalies. data summarization.1 INTRODUCTION Data mining is an effective set of analysis tools and techniques used in the decision support process.CHAPTER 3 KNOWLEDGE DISCOVERY PROCESS 3. is the nontrivial extraction of implicit.

Knowledge Discovery Process | 15 3. proactive information delivery. Prospective. massive databases Retrospective. Data Prep 1.2 EVALUATION OF DATA MINING TABLE 3. Data Selection & Cleaning Fig. dynamic data delivery at multiple levels. Interpretation/ Knowledge 4. Relational databases Retrospective. (RDBMS). tapes.” “What’s likely to happen to Chennai unit sales next month? Why?” Enabling technologies Computers. Data Warehousing & Decision Support (1990s) Data Mining (Emerging Today) 3. Data Mining/ Pattern Discovery 3.3 STAGES OF THE DATA MINING PROCESS 5. ODBC On-line analytic processing (OLAP). 3. data warehouses Advanced algorithms. Structured dynamic data delivery Query Language (SQL). disks Characteristics Retrospective. multiprocessor computers.1 An overview of the steps that compose the KDD process . static data delivery.1 Data Mining Evaluation Evolutionary step Data Collection (1960s) Data Access (1980s) Business question “What was my total revenue in the last four years?” “What were unit sales in New Delhi last May?” “What were unit sales in New Delhi last March? Drill down to Chennai. Transformation 2. multidimensional databases. at record level.

or of clients moving from one place to another without notifying change of address. A domain expert is someone who is intimately familiar with the business purposes and aspects. Data enrichment.3. identifying data.16 | Data Mining and Warehousing 3. some of which can be executed in advance while others are invoked only after pollution is detected at the coding or the discovery stage. and logical inconsistencies.2 Original Data P. The first part.5 ——— ——— 10. period of diseases. although in many cases this will be the result of negligence. while data transformation changes the form or structure of attributes in existing data to meet specific data mining requirements. for data mining purposes. In the original table we have X and X1. It is a selection operational data from the Public Health Center patients of a small village contains information about the name.2 Data Cleaning Data cleaning is the process of ensuring that. TABLE 3.3. In a normal database some clients will be represented by several records. There are several types of cleaning process. such as orphan records or required fields set to null. or domain. age. which is a strong indication that they have the same person but that one of the spelling is incorrect. In order to facilitate the KDD process. the data is uniform in terms of key and attribute usage.5 3. which requires significant input by a domain expert for the data. by contrast. of the data to be examined. . locating data. In our example we start with a database containing records of patient for hypertension diseases with duration. tends to be more mechanical in nature than the second part. Although data mining and data cleaning are two different disciplines. address. A important element in a cleaning operation is the de-duplication of records. 101 102 103 104 105 104 Name U V W X Y XI Address UU1 VV1 WW1 XX1 YY1 XXI Sex Male Male Male Female Male Female Age 49 45 61 49 43 49 Hype_dur ——— 10.5 –10. they have a lot in common and pattern recognition algorithm can be applied in cleaning data. adds new attributes to existing data. such as people making typing errors. They have same Number and Address but different name. such as accounts with closing dates earlier than starting dates. disease particulars. a copy of this operational data is drawn and stored in a separate database. The process of inspecting data for physical inconsistencies.No. Data cleaning is separate from data enrichment and data transformation because data cleaning attempts to correct misused or incorrect attributes in existing data.1 Data Selection There are two parts to selecting data for data mining. designation.

5 10. but data enrichment involves adding information to existing data. in particular the combination of external data sources with data to be mined. kidney. This can include combining internal data with external data.No. obtained from either different departments or companies or vendors that sell standardized industry-relevant data. This is more realistic than it may initially . Data transformation involves the manipulation of data. In our example.4 Domain Consistency P.5 NULL 3.5 -10. data warehouses frequently provide pre aggregation across business lines that share common attributes for cross-selling analysis purposes.5 NULL 10. to existing data.No.Knowledge Discovery Process | 17 TABLE 3. such as calculated fields or data from external sources. The duration value is indicated in negative value for Patient number 103. 101 102 103 104 105 Name U V W X Y Address UU1 VV1 WW1 XX1 YY1 Sex Male Male Male Female Male Age 49 45 61 49 43 Hype_dur — 10. or provide additional derived attributes for a better understanding of indirect relationships. The value must be positive so the incorrect value to be replaced with NULL. For example. This type of pollution is particularly damaging. TABLE 3. In our example we replace the unknown data to NULL. 101 102 103 104 105 Name U V W X Y Address UU1 VV1 WW1 XX1 YY1 Sex Male Male Male Female Male Age 49 45 61 49 43 Hype_dur NULL 10. we will suppose that we have purchased extra information about our patients consisting of age. this step is best handled in a temporary storage area. heart and stroke related disease. Most references on data mining tend to combine this step with data transformation.3 Data Enrichment Data enrichment is the process of adding new attributes. You can add information to such data from standardized external industry sources to make the data mining process more successful and reliable. but in greatly influence to type of patterns when we apply data mining to this table.5 — Second type of pollution that frequently occurs in lack of domain consistency and disambiguation. Data enrichment. As with data cleaning and data transformation. can require a number of updates to both data and meta data. Data enrichment is an important step if you are attempting to mine marginally acceptable data. because it is hard to trace.3. and such updates are generally not acceptable in an established data warehouse.3 De-duplication P.

Data transformation is separate from data cleansing and data enrichment for data mining purposes because it does not correct existing attribute data or add new attributes.5 Coding That data in our example can undergo a number of transformations.3.5 Enrichment Name U V W X Y Address UU1 VV1 WW1 XX1 YY1 Sex Male Male Male Female Male Age 49 45 61 49 43 Kidney No No No No No Heart Yes Yes Yes Yes No Stroke No No No No No 3. it is not particularly important how the information was gathered and be easily be joined to the existing records. In our example we convert disease yes-no into 1 and 0. In the next stage. 3. but instead grooms existing attributes for data mining purposes. especially fraud detection.5 NULL NULL 13 14 15 NULL 0 0 0 0 0 . this is a situation that occurs frequently in practice. however.4 Data Transformation Data transformation. TABLE 3. a lot of desirable data is missing and most is impossible to retrieve. TABLE 3.5 NULL 10. First the extra information that was purchased to enrich the database is added to the records describing the individuals. we select only those records that have enough information to be of value. in terms of data mining. Although it is difficult to give detailed rules for this kind of operation. after a thorough analysis of the possible consequences.18 | Data Mining and Warehousing seen. since it is quite possible to buy demographic data on average age for a certain neighborhood and hypertension and diabetes patients can be traced fairly easily. is the process of changing the form or structure of existing attributes. For this.6 Coded Database HyperTension 1 0 0 1 1 Occupation HypeDur Diabetes 0 1 1 0 0 Address Kidney Diadur Stroke 0 0 0 0 0 Name Heart 1 1 1 1 0 Age 49 45 61 49 43 Sex Male Male Male Female Male U V W X Y Labor Farmer Farmer Farmer Farmer UU1 VV1 WW1 XX1 YY1 NULL 10. lack of information can be a valuable indication of interesting patterns. A general rule states that any detection of data must be a conscious decision.3. In some cases. In most tables that are collected from operational data.

Interesting possibilities are offered by object oriented three dimensional tool kits. and may be used at the beginning of a data mining process to get a rough feeling of the quality of the data set and where patterns are to be found.Knowledge Discovery Process | 19 TABLE 3. We shall see that some learning algorithms do well on one part of the set where others fail. From this point of view.7 Final Table after Removing Rows HyperTension 0 1 Occupation HypeDur (Years) Diabetes 1 0 Address Kidney Diadur (Years) V X Farmer Farmer VV1 XX1 Male Female 45 49 10. Here we shall discuss some of the most important machine-learning and pattern recognition algorithms. which enable to user to explore three dimensional structures interactively.3. data mining is really an ‘anything goes’ affair. Any technique that helps extract more out of our data in useful. and this clearly indicates the need for hybrid learning.6 Data Mining The discovery stage of the KDD process is fascinating.5 10.3. We shall also show that there is a relationship is detected during the data mining stage. those that are of interest in present context are: ² Query tools ² Statistical techniques ² Visualization ² Online analytical processing ² Case based learning (k-nearest neighbor) ² Decision trees ² Association rules ² Neural networks ² Genetic algorithms 3. Advanced graphical techniques in virtual reality enable people to wander through artificial data spaces. and in this way get an idea of the opportunities that are available as well as some of the problems that occur during the discovery stage. such as Inventor. An elementary technique that can be of great value is the so called scatter diagram. These simple methods can provide us with a wealth of information. white historic development of data sets can be displayed as a kind of animated movie. Data mining is not so much a single technique as the idea that there is more knowledge hidden in the data than shows itself on the surface. so data mining techniques from quite a heterogeneous group.7 Visualization/Interpretation/Evaluation Visualization techniques are a very useful method of discovering patterns in datasets. Scatter diagrams can be used to identify Stroke 0 0 Name Heart Age Sex . Although various different techniques are used for different purposes.5 13 15 0 0 1 1 3.

4 DATA MINING OPERATIONS Four operations are associated with discovery-driven data mining. data warehouses. the goal of Association is to establish relations between the records in a data base. Database. 3.1 Creation of Prediction and Classification Models This is the most commonly used operation primarily because of the proliferation of automatic model development techniques. 3. The goal of this operation is to use the contents of the database.20 | Data Mining and Warehousing interesting subsets of the data sets so that we can focus on the rest of the data mining process. rules. It is usually the source of true discovery since outliers express deviation from some previously known expectation and norm.4. 3. . spreadsheets or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.5 ARCHITECTURE OF DATA MINING Based on databases. i.2 Association Whereas the goal of the modeling operation is to create a generalized description that characterizes the contents of a database... 3. or before performing a data mining operation such as model creation.. The value added by data mining techniques in this operation is in their ability to generate models that are comprehensible.e. 3. since many data mining modeling techniques express models as sets of if. to automatically generate a model that can predict a future behavior.4. then. 3. and explain whether they are due to noise or other impurities being present in the data. its goal is to identify outlying points in a particular data set.4. which reflect historical data. data warehouse or other information repository This is one or a set of databases. data about the past. It is usually applied in conjunction with database segmentation. or due to causal reasons. Model creation has been traditionally pursued using statistical techniques.. data warehouse of a typical data mining system may have the following components.4. There is a whole field of research dedicated to the search for interesting projections for data setsthat is called projection pursuit.. In particular.3 Database Segmentation As databases grow and are populated with diverse types of data it is often necessary to partition them into collections of related records either as a means of obtaining a summary of each database. and explainable.4 Deviation Detection This operation is the exact opposite of database segmentation.

. Graphical Interface Pattern Evaluation Data Mining Engine Knowledge Base Database or Data Warehouse Server Data Cleaning Data Integration Database Filtering Data Warehouse Fig. cluster analysis and evolution and deviation analysis. Such knowledge can include concept hierarchies. based on the user’s data mining request. may also be included. Data mining engine This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization.2 Architecture of data mining Knowledge base This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. used to organize attributes or attribute values into different levels of abstraction. 3. which can be used to assess a pattern’s interestingness based on its unexpectedness. classification. association. Knowledge such as user beliefs.Knowledge Discovery Process | 21 Database or data warehouse server The database or data warehouse server is responsible for fetching the relevant data.

this component allows the user to browse database and data warehouse schemes or data structures. based on the intermediate data mining results. . (ii) Data cleaning. providing information to help focus the search and performing exploratory data mining. Explain four operations that are associated with discovery-driven data mining. Write a short note on evaluation of data mining. 3. 5. For efficient data mining. it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns. With a neat sketch. allowing the user interact with the system by specifying a data mining query or task. (iii) Data enrichment. 2. explain the architecture of a data mining system. Explain the various steps involved in Knowledge Discovery Process.22 | Data Mining and Warehousing Pattern evaluation module This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search towards interesting patterns. Write short notes on (i) Data selection. Graphical user interface This module communicates between user and the data mining system. evaluate mined patterns and visualize the patterns in different forms. 4. EXERCISE 1. In addition.

CHAPTER 4 DATA MINING TECHNIQUES 4. 4.1 INTRODUCTION There are several different methods used to perform data mining tasks. In this chapter. 4. we briefly examined some of the data mining techniques. Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest. but also imply certain types of algorithmic approaches. Examples of classification methods used as part of knowledge discovery applications include classifying trends in financial markets and automated identification of objects of interest in large image databases.3 NEURAL NETWORKS Neural Networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new NEURONS INPUT LAYER 1 LAYER 2 OUTPUT Fig.1 Neural networks . These techniques not only require specific types of data structures. Description focuses on finding human interpretable patterns describing the data. 4.2 CLASSIFICATION Classification is learning a function that maps a data item into one of several predefined classes.

4. 3. find a “fit” to) the sample data on which the “training” is performed.1 Constructing Decision Trees Decision tree programs construct a decision tree T from a set of training cases. Neural Networks techniques can also be used as a component of analyses designed to build explanatory models because Neural Networks can help explore data sets in search for relevant variables or groups of variables. neurons apply an iterative process to the number of inputs (variables) to adjust the weights of the network in order to optimally predict (in traditional terms one could say. Disadvantages The final solution depends on the initial conditions of the network. which variables matter. or even to some extent. theoretically. indicating a class of instances. 4. If all examples in T are positive. analytic terms. such as those used to build theories that explain phenomena. 2. they are capable of approximating any continuous function.24 | Data Mining and Warehousing observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data. The algorithm consists of five steps. It is virtually impossible to “interpret” the solution in traditional. the neural network is ready and it can then be used to generate predictions. with one branch and sub-tree for each possible outcome of the test. Decision trees represent rules. Decision tree is a classifier in the form of a tree structure where each node is either: ² a leaf node. the results of such explorations can then facilitate the process of model building. 4. The original idea of construction of decision trees goes back to the work of Hoveland and Hunt on Concept Learning Systems (CLS) in the late 1950s. The neural network is then subjected to the process of “training. create a ‘P’ node with T as its parent and stop. and thus the researcher does not need to have any hypotheses about the underlying model. which provides the classification of the instance. 1.” In that phase. . create a ‘N’ node with T as its parent and stop. After the phase of learning from an existing data set. Create a T node. Advantages Neural Networks is that.4 DECISION TREES Decision trees are powerful and popular tools for classification and prediction. A decision tree can be used to classify an instance by starting at the root of the tree and moving through it until a leaf node. If all examples in T are negative. T ← the whole training set. or ² a decision node that specifies some test to be carried out on a single attribute value.

2 ID3 Decision Tree Algorithm J.. vN and partition T into subsets T1. “Logical” classification model: Classifier that can be only expressed as decision trees or set of production rules.5 B > 4. Example for decision tree . v2.Data Mining Techniques | 25 4. 1.1 . Select an attribute TN according their and X = vi as the 5. no. vol. For each Ti do: T X with values v1. 4..4. Machine Learning. Ross Quinlan originally developed ID3 at the University of Sydney. Sufficient data: Usually hundreds or even thousands of training cases..2 An example of a simple decision tree Decision tree induction is a typical inductive approach to learn knowledge on classification.. …. values on X. N) with T as their parent label of the branch from T to Ti. T2. ← Ti and go to step 2. …. Discrete classes: A case does or does not belong to a particular class. He first presented ID3 in 1975 in a book.decision node A A = red B B > 8.1 B < 4. The key requirements to do mining with decision trees are: Predefined classes: The categories to which cases are to be assigned must have been established beforehand (supervised data). 1.leaf node Fig. and there must be for more cases than classes.5 K=x K=y K=x C = true K=x C = false K=y C A = blue B B < 8. . 4. ID3 is based on the Concept Learning System (CLS) algorithm. Create N nodes Ti (i = 1.

For example. For example.970 Notice that the entropy is 0 if all members of S belong to the same class. and Entropy (S) = –1′ log2(1) – 0′ log20 = –1′ 0 – 0′ log20 = 0. 4.5.1 Basic Ideas Behind ID3 In the decision tree each node corresponds to a non-goal attribute and each arc to a possible value of that attribute. called entropy. To illustrate. A good quantitative measure of the worth of an attribute is a statistical property called information gain that measures how well a given attribute separates the training examples according to their target classification.2. the receiver knows the drawn example will be positive. as p+ varies between 0 and 1. binary classification is defined as: Entropy (S) = – pp log2 pp – pn log2 pn where pp is the proportion of positive examples in S and pn is the proportion of negative examples in S. that characterizes the (im)purity of an arbitrary collection of examples. and the entropy is 0. Entropy is used to measure how informative is a node. containing only positive and negative examples of some target concept (a 2 class problem). suppose S is a collection of 25 examples. Then the entropy of S relative to this classification is Entropy (S) = – (15/25) log2 (15/25) – (10/25) log2 (10/25) = 0. Given a set S.26 | Data Mining and Warehousing 4. The goal is to select the attribute that is most useful for classifying examples. . if pp is 1. one bit is required to indicate whether the drawn example is positive or negative. If the collection contains unequal numbers of positive and negative examples.3 shows the form of the entropy function relative to a binary classification. if all members are positive (pp= 1). Entropy—a measure of homogeneity of the set of examples In order to define information gain precisely. if pp is 0.] In the decision tree at each node should be associated the non-goal attribute which is most informative among the attributes not yet considered in the path from the root. One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S (i. A leaf of the tree specifies the expected value of the goal attribute for the records described by the path from the root to that leaf. a member of S drawn at random with uniform probability). Note the entropy is 1 (at its maximum!) when the collection contains an equal number of positive and negative examples. we need to define a measure commonly used in information theory. Which attribute is the best classifier? The estimation criterion in the decision tree algorithm is the selection of an attribute to test at each decision node in the tree. 10–].e. [This defines what a decision tree is..4. In all calculations involving entropy we define 0log0 to be 0. the entropy is between 0 and 1. This measure is used to select among the candidate attributes at each step while growing the tree. then pn is 0. so no message need be sent. Fig. including 15 positive and 10 negative examples [15+. On the other hand. the entropy of set S relative to this simple.

8.Data Mining Techniques | 27 1 0. More precisely.8 Entropy 0. relative to a collection of examples S. weighted by the fraction of examples |Sv|/|S| .4 0. as the proportion of positive examples pp varies between 0 and 1 0. Note the first term in the equation for Gain is just the entropy of the original collection S and the second term is the expected value of the entropy after S is partitioned using attribute A. Thus far we have discussed entropy in the special case where the target classification is binary.4 Pp Fig. then a collection of messages can be encoded using on average less than 1 bit per message by assigning shorter codes to collections of positive examples and longer codes to less likely negative examples. Note also that if the target attribute can take on c possible values. is defined as Gain ( S. If the target attribute takes on c different values. A) of an attribute A.6 0.8 1 If pp is 0. Note the logarithm is still base 2 because entropy is a measure of the expected encoding length measured in bits. called information gain. is simply the expected reduction in entropy caused by partitioning the examples according to this attribute..2 0 0 0. The measure we will use. A ) = Entropy ( S ) − v ∈ Values ( A) ∑ Sv S Entropy ( Sv ) where Values(A) is the set of all possible values for attribute A. Information gain measures the expected reduction in entropy Given entropy as a measure of the impurity in a collection of training examples. the maximum possible entropy is log2c. Sv = {s ∈ S | A(s) = v}). The expected entropy described by this second term is simply the sum of the entropies of each subset Sv.2 0. 4.e. then the entropy of S relative to this cwise classification is defined as Entropy ( S ) = ∑ − pi log 2 pi i =1 c where pi is the proportion of S belonging to class i. Gain (S. the information gain.6 0.3 The entropy function relative to a binary classification. and Sv is the subset of S for which attribute A has value v (i. we can now define a measure of the effectiveness of an attribute in classifying the training data.

ID3(R-{A}.A) is the number of bits saved when encoding the target value of an arbitrary member of S. C. S1). examples that will be improperly classified]. am going respectively to the trees (ID3(R-{A}... Attributes that have been incorporated higher in the tree are excluded. ... . m} until they are empty end. given the value of some other attribute A. The process of selecting a new attribute and partitioning the training examples is now repeated for each non-terminal descendant node.. S: a training set) returns a decision tree. C.A) is the information provided about the target attribute value. or 2..2.. Every attribute has already been included along this path through the tree. by knowing the value of attribute A. If S consists of records all with the same value for the target attribute... this time using only the training examples associated with that node.2. a2. return a single leaf node with that value. . This process continues for each new leaf node until either of two conditions is met: 1. then return a single node with the value of the most frequent of the values of the target attribute that are found in records of S. C: the target attribute. Sm). . Let A be the attribute with largest Gain(A. begin If S is empty. the training examples associated with this leaf node all have the same target attribute value (i.A) is therefore the expected reduction in entropy caused by knowing the value of attribute A. [in that case there may be errors. Algorithm function ID3 Input: (R: a set of non-target attributes. C. Let { A j | j=1. The value of Gain(S.e.... Recursively apply ID3 to subsets { S j | j=1. Return a tree with root labeled A and arcs labeled a1.ID3(R-{A}. If R is empty. m} be the subsets of S consisting respectively of records with value aj for A. Let { S j | j=1. so that any given attribute can appear at most once along any path through the tree. their entropy is zero).. . Gain (S.S) among attributes in R. return a single node with value failure. S2). .2. Gain (S.28 | Data Mining and Warehousing that belong to Sv.. m} be the values of attribute A. Put another way.

temperature. overcast. cool} humidity = {high. rain} temperature = {hot. Entropy(S) = –(9/14) Log2 (9/14) – (5/14) Log2 (5/14) = 0. and wind speed. The weather attributes are outlook. A) = Entropy(S) – Σ ((|Sv|/|S|) * Entropy(Sv)) Where: Σ = value v of all possible values of attribute A S v = subset of S for which attribute A has value v |S v| = number of elements in Sv |S| = number of elements in S. Gain(S. The range of entropy is 0 (“perfectly classified”) to 1 (“totally random”). mild. normal} wind = {weak. They can have the following values: outlook = {sunny. strong} Examples of set S are: TABLE 4.2 Weather Report Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Outlook Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Temperature Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal High Wind Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong Play ball No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No If S is a collection of 14 examples with 9 YES and 5 NO examples then.940 Notice entropy is 0 if all members of S belong to the same class (the data is perfectly classified). data is collected to help ID3 build a decision tree (see Table 4. A) is information gain of example set S on attribute A is defined as Gain(S.2). The target classification is “should we play cricket” which can be yes or no.Data Mining Techniques | 29 Example of ID3 Suppose we want ID3 to decide whether the weather is amenable to playing cricket. Over the course of 2 weeks. humidity. .

therefore it is used as the decision attribute in the root node. Since Outlook has three possible values. the gain is calculated and the highest gain is used in the decision node.30 | Data Mining and Warehousing Suppose S is a set of 14 examples in which one of the attributes is wind speed.048 Outlook attribute has the highest gain. overcast. 3 are YES and 3 are NO. Wind) = 0.4 A decision tree for the concept Playball . D8. 6 of the examples are YES and 2 are NO. D11} = 5 examples from Table 1 with outlook = sunny Gain(Ssunny. Gain(S. For Wind = Weak.811 Entropy(Sstrong) = – (3/6)*log2(3/6) – (3/6)*log2(3/6) = 1. suppose there are 8 occurrences of Wind = Weak and 6 occurrences of Wind = Strong. we only decide on the remaining three attributes: Humidity. Therefore. The values of Wind can be Weak or Strong. Temperature) = 0. The next question is “what attribute should be tested at the Sunny branch node?” Since we have used Outlook at the root. For Wind = Strong. The gain is calculated for all four attributes: Gain(S. The classification of these 14 examples are 9 YES and 5 NO. We need to find which attribute will be the root node in our decision tree.151 Gain(S. Temperature.246 Gain(S. 4. D9. Humidity) = 0.048 Entropy(Sweak) = – (6/8)*log2(6/8) – (2/8)*log2(2/8) = 0. or Wind.00 For each attribute.Wind) = Entropy(S)–(8/14)*Entropy(Sweak)–(6/14)*Entropy(Sstrong) = 0.811 – (6/14)*1. D2.029 Gain(S. For attribute Wind. Outlook) = 0. Humidity) = 0.970 Outlook Sunny Humidity Overcast Yes Rain Wind High Normal Strong Weak No Yes No Yes Fig. rain).940 – (8/14)*0.00 = 0. the root node has three branches (sunny. Ssunny = {D1.

Some specific applications include medical diagnosis. classification of soybean diseases. hot). et al. which test the value of a mathematical or logical expression of the attributes. hot.g. and web search classification.g. The easiest situation for decision tree learning occurs when each attribute takes on a small number of disjoint possible values (e. 4.g. Agrawal.4 A Detailed Survey on Decision Tree Technique A decision tree contains zero or more internal nodes and one or more leaf nodes..Data Mining Techniques | 31 Gain(Ssunny. credit risk assessment of loan applications. temperature) and their values (e. et al. ID3 has been incorporated in a number of commercial rule-induction packages. Instances are described by a fixed set of attributes (e..570 Gain(Ssunny. 4.4. All internal nodes have two or more child nodes.6 Decision Tree Algorithms Name of the Algorithm Developer Year CHAID CART ID3 C4. The task of constructing a tree from the training set is called tree induction.019 Humidity has the highest gain. equipment malfunctions by their cause. 1980 1984 1986 1993 1996 1996 . Quinlan Quinlan Agrawal. Each leaf node has a class label associated with it. Wind) = 0. cold).4. 4. The final decision = tree The decision tree can also be expressed in rule format: IF outlook = sunny AND humidity = high THEN cricket = no IF outlook = rain AND humidity = high THEN cricket = no IF outlook = rain AND wind = strong THEN cricket = yes IF outlook = overcast THEN cricket = yes IF outlook = rain AND wind = weak THEN cricket = yes. All non-terminal nodes contain splits.. it is used as the decision node.5 Appropriate Problems for Decision Tree Learning Instances are represented by attribute-value pairs.4. This process goes on until all data is classified perfectly or we run out of attributes.5 SLIQ SPRINT Kass Breimarn. therefore. Temperature) = 0. mild. et al.

or interest rate. . 2. Trouble with Non-Rectangular Regions. 4. The weaknesses of decision tree methods Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous variable such as income.1 Crossover Using simple single point crossover reproduction mechanism. 1 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 Fig. 1. 4.7 Strengths and Weaknesses of Decision Tree Methods strengths of decision tree methods Decision trees are able to generate understandable rules. Computationally Expensive to Train. Two programs called chromosomes combine to produce a third program called child. 4.4. blood pressure. (a) Single Point Crossover: In this type. Decision trees are also problematic for time-series data unless a lot of effort is put into presenting the data in such a way that trends and sequential patterns are made visible.5 GENETIC ALGORITHM Genetic Algorithms/Evolutionary algorithms are based on Darwin’s theory of survival of the fittest. one crossover point is selected and the string from the beginning of chromosome to the crossover point is copied from one parent and the rest is copied from the second parent resulting in a child. Decision trees provide a clear indication of which fields are most important for prediction or classification.5. The 1.5 Single point crossover Here we observe that the regions after the crossover point are interchanged in the children. 3. Decision trees perform classification without requiring much computation. a point along the chromosome length is randomly chosen at which the crossover is done as illustrated. 4. For instance. 2. Decision trees are able to handle both continuous and categorical variables. Here the best program or logic survives from a pool of solutions. 3.32 | Data Mining and Warehousing 4. Error-Prone with Too Many Classes. consider the following chromosomes and crossover point at position 4. the reproduction process goes through Crossover and Mutation operations.

5. 4. Mutation produces incremental random changes in the offspring generated through crossover.Data Mining Techniques | 33 (b) Two Point Crossover: In this type. This type of crossover is mainly employed in permutation encoding and value encoding where a single point crossover would result in inconsistencies in the child chromosomes. mutation is equivalent to a random search consisting of incremental random modification of the existing solution and acceptance if there is improvement. One crossover point is chosen at random and parents are divided in that point and parts below crossover point are exchanged to produce new offspring. When used by itself without any crossover. when used in the GA.7 Tree crossover 4. its behavior changes radically. 4. (c) Tree Crossover: The tree crossover method is most suitable when tree encoding is employed. consider the following chromosomes and crossover points at positions 2 and 5. For instance. In the GA. mutation may be done by flipping a bit. mutation serves the crucial role for replacing the gene values lost from the population during the selection process so that they can . In a binary-coded Genetic Algorithm. two crossover points are selected and the string from beginning of one chromosome to the first crossover point is copied from one parent.2 Mutation As new individuals are generated. mutation involves randomly generating a new character in a specified position. However. Parent A Parent B Offspring x + + x = x y 3 y 2 y 2 Fig.6 Two point crossover Here we observe that the crossover regions between the crossover points are interchanged in the children. 3 2 1 5 6 7 4 2 7 6 5 1 4 3 3 2 6 5 1 7 4 2 7 1 5 6 4 3 Fig. the part from the first to the second crossover point is copied from the second parent and the rest is copied from the first parent. while in a non-binary-coded GA. each character is mutated with a given probability.

consider mutation at allele 4.11 5.1 What is Clustering? Clustering can be considered the most important unsupervised learning problem. Hence. The various types are as follows: (a) Bit Inversion: This mutation type is employed for a binary encoded problem.68 2. Here. or of providing the gene values that were not present in the initial population.. The type of mutation used as in crossover is dependent on the type of encoding employed.6.6 CLUSTERING 4.73 4. so.22 5. For instance.29 5. * x + x + y 3 y 3 Fig. For instance.55) => (1. 4. .8 Bit inversion (b) Order Changing: This type of mutation is specifically used in permutation-encoded problems.55) (d) Operator Manipulation: This method involves changing the operators randomly in an operator tree and hence is used with tree-encoded problems.68 2. Here. a bit is randomly selected and inverted i.29 5. two random points in the chromosome are chosen and interchanged. as every other problem of this kind.10 Operator manipulation Here we observe that the divide operator in the parent is randomly changed to the multiplication operator. 4. deals with finding a structure in a collection of unlabeled data. For instance.86 4.34 | Data Mining and Warehousing be tried in a new context. 1 0 0 0 0 1 1 0 0 1 0 1 Fig. (1.e. 4. a bit is changed from 0 to 1 and vice-versa. 1 2 3 4 5 6 1 6 3 4 5 2 Fig. 4. this method is specifically useful for real value encoded problems.9 Order changing (c) Value Manipulation: Value manipulation refers to selecting random point or points in the chromosome and adding or subtracting a small number from it.

6. 4.. 4. the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). + xip − x jp . not according to simple similarity measures. we easily identify the 4 clusters into which the data can be divided.11 Clustering process-example In the example. the following common distance functions can be defined: Euclidean Distance Function d ( i..2 Distance Functions Given two p-dimensional data objects i = (xi1.. j ) = xi 1 − x j 1 + xi 2 − x j 2 + . Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. objects are grouped according to their fit to descriptive concepts.... + xip − x jp 2 2 2 Manhattan Distance Function d ( i.. xj2. This is called distance-based clustering. xip ) and j = (xj1. . Graphical Example Fig. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. xi2.. j ) = xi 1 − x j 1 + xi 2 − x j 2 + .Data Mining Techniques | 35 Definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. xjp ). In other words.. ..

In nonhierarchical clustering. For instance. If some of an object’s attributes are measured along different scales. the relationship between clusters is undetermined. . ² Insurance: identifying groups of motor insurance policy holders with a high average claim cost.6. so that clusters can be formed from objects with a high similarity to each other. we could be interested in finding representatives for homogeneous groups (data reduction).3 Goals of Clustering The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. so when using the Euclidean distance function.5 Types of Clustering Algorithms There are two types of clustering algorithms: 1. the attribute values are often normalized to lie between 0 and 1.4 Applications Clustering algorithms can be applied in many fields. 4. it is not necessary to calculate the square root because distances are always positive numbers and as such. ² City-planning: identifying groups of houses according to their house type. ² Biology: classification of plants and animals given their features. ² WWW: document classification.6. value and geographical location. such as the k-means algorithm. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. in such a way that the result of the clustering will suit their needs. To prevent this problem. Nonhierarchical and 2. 4.6. Hierarchical clustering repeatedly links pairs of clusters until every data object is included in the hierarchy.36 | Data Mining and Warehousing When using the Euclidean distance function to compare distances. in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection). identifying frauds. attributes with larger scales of measurement may overwhelm attributes measured on a smaller scale. in finding “natural clusters” and describe their unknown properties (“natural” data types). 4. Hierarchical. ² Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones. d1 and d2. for instance: ² Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records. With both of these approaches. √d1 > √d2 ⇔ d1 > d2. an important issue is how to determine the similarity between two objects. clustering weblog data to discover groups of similar access patterns. for two distances. ² Libraries: book ordering. Consequently. it is the user which must supply this criterion.

or clusters. k-means Algorithm: 1.1 Nonhierarchical Algorithms K-means Algorithm The k-means algorithm is one of a group of algorithms called partitioning methods. typically the squared-error criterion.5 Pi =  5 5  6  6 5. 2.13 Sample data—k-means clustering process . The algorithm is shown in Fig.5  0  1   1   0  0. The value of k may or may not be specified and a clustering criterion. 4. 4. Do loop (a) Partition by assigning or reassigning all data objects to their closest cluster center. must be adopted. Recall that a partition divides a set into disjoint parts that together include all members of the set. These means are used as the new cluster points and each object is reassigned to the cluster that it is most similar to.5  5  6   6   5  5. This continues until there is no longer a change when the clusters are recalculated. Select a clustering criterion. such that the objects in a cluster are more similar to each other than to objects in different clusters. The k-means algorithm initializes k clusters by arbitrarily selecting one object to represent each cluster. The solution to this problem is straightforward. Initialize cluster centers with those k clusters.6. 3. The problem of partitioned clustering can be formally stated as follows: Given n objects in a d-dimensional metric space. Each of the remaining objects is assigned to a cluster and the clustering criterion is used to calculate the cluster mean. then for each data object select the cluster that optimizes the criterion. 4.Data Mining Techniques | 37 4. Select k clusters arbitrarily.12.12 k-means algorithm Example 0 0  1  1 0.5   * * * * * * * * * * Fig.5. (b) Compute new cluster centers as mean value of the objects in each cluster until no change in cluster center calculation. determine a partition of the objects into k groups. Fig.

(5.6).5. Step 3b″: Compute new cluster centers.(1.(0.0).(1.5)/7 = 4.(6.5) New center for C2 = (5.4.(5.1).5.5)}. assigning the object to the closest cluster. Step 2: Since there is only one point in each cluster. so the loop is repeated. the clusters contain the following objects: C1 = {(0.(0. Step 3b: Compute new cluster center for each cluster. Result is same as in 3b´.0).5.5.(5. so algorithm is finished.2) (0+1+5+5+6+6+5.0).5)/3 = 0.1)}.(6. so this object is assigned to C2.5.(6.16) (0+1+0.6).0.1 (1+1+5+5+6+6+5. New center for C1 = (0.6). so we arbitrarily assign it to C1.4.(1.(5.5).2).5). resulting in: C1 = {(0.3) = |1–0| + |1–0| = 2 and dis(2.5 (0+0+0.0.0.1.5. that point is the cluster center.16 New center for C2 = (4.5) differ from old centers C1 = (0.5)/3 = 0.5)}. for the third object: dis(1.0.13 as input and we choose k = 2 and the Manhattan distance function dis = |x2–x1|+|y2–y1|. An initial partition can be formed by first specifying a set of k seed points. .5. Step 3a″: New centers C1 = (0.5)/7 = 4.0.0)} and C2 = {(0.5)} C2 = {(5.16) and C2 = (4.38 | Data Mining and Warehousing Suppose we are given the data in Fig. Reassign the ten objects to the closest cluster center.(5.16) and C2 = (4.0)} and C2={(0.3) = |1–0| + |1–1| = 1.0.5.2.1.4.1)}.(0. Step 3a′: New centers C1 = (0.5.1.5.5. After calculating the distance for all points.(6.5).0).(1.5. The fifth object is equidistant from both clusters. Algorithm is done: Centers are the same as in Step 3b´.1).0. 4. so the loop is repeated. The detailed computation is as follows: Step 1: Initialize k partitions.1). We choose the first two objects as seed points and initialize the clusters as C1 = {(0.5.5) and C2 = (5.5).2) differ from old centers C1={(0.5)} and C2 = {(0.5.1).5).5. The seed points could be the first k objects or k objects chosen randomly from the input objects.6). Step 3a: Calculate the distance between each object and each cluster center. Step 3b′: Compute new cluster center for each cluster New center for C1 = (0. For example.

. For n objects. Hierarchical algorithms are rigid in that once a merge has been done. Each object in X is initially used to create a cluster containing a single object. f l c e h a k b i d m g j Fig.14 Sample dendogram without clustering process In the context of hierarchical clustering. The distance function in this algorithm can determine similarity of clusters through many methods. the hierarchy graph is called a dendogram. the original clusters are removed from C. including single link and group-average. which are added to the set of clusters. C. Unlike with the k-means algorithm.5. As such. After the hierarchy is built. Fig. These clusters are successively merged into new clusters. the user can specify the number of clusters required. it cannot be undone.15 shows a simple hierarchical algorithm. 4.14 shows a sample dendogram that could be produced from a hierarchical clustering algorithm. 4. 4. from 1 to n.6. When a pair of clusters is merged. Hierarchical algorithms can be either agglomerative or divisive. . Group-average first finds the average values for all objects in the group (i. that is top-down or bottomup. Here we describe a simple agglomerative clustering algorithm. Single link calculates the distance between two clusters as the shortest distance between any two objects contained in those clusters. containing all the objects from X. The hierarchy of clusters is implicitly represented in the nested sets of C.e. cluster) and calculates the distance between clusters as the distance between the average values. the number of clusters (k) is not specified in hierarchical clustering. merge points need to be chosen carefully. Although there are smaller computational costs with this. Thus. we simply need to traverse down the hierarchy. To examine more clusters. These groups are successively combined based on similarity until there is only one group remaining or a specified termination condition is satisfied. This hierarchical tree shows levels of clustering with each level having a larger number of smaller clusters. All agglomerative hierarchical clustering algorithms begin with each object as a separate group. or k = 1. The top level of the hierarchy represents one cluster. it can also cause problems if an erroneous merge is done. n-1 mergings are done.Data Mining Techniques | 39 4. the number of clusters in C decreases until there is only one cluster remaining. Fig.2 Hierarchical Clustering Hierarchical clustering creates hierarchy of clusters on the data set.

The set X contains n = 10 elements..cb } 3. each element xi of X is placed in a cluster ci ..0).size > 1 do (a) (cmin1. while C.{x5}.{x10 }} Step 2: Set l = 11. 4.{x6}.16 represented in matrix and graph form. C = {c1. C = {{x1}.{x9}. where x1 = (0. where ci is a member of the set of clusters C.40 | Data Mining and Warehousing Given: A set X of objects {x1..xn} A distance function dis(c1. {x 8}. l = n+1 4. shown in Fig. for i = 1 to n ci = {xi } end for 2. x1 to x10 . We will use the Manhattan distance function and the single link method for calculating distance between clusters.{x7}.. 0 1  3  2 6 X= 6 5  3  0 2  0 1  1  4 3  6 2  5  2 1  * * * * * * * * * * Fig.cmin2} to C (d) l = l + 1 end while Fig..cj in C (b) remove cmin1 and cmin2 from C (c) add {cmin1.{x3 }. .15 Agglomerative hierarchical algorithm Example: Suppose the input to the simple agglomerative algorithm described above is the set X.cmin2) = minimum dis(ci . 4. 4..{x2 }.c2) 1..16 Sample data Step 1: Initially. {x4}..cj) for all ci .

{x9}. {x 3}}. {x4}. This occurs between.c16) . {{x4}. {x3}}} ² Set l = 13.{x9}.Data Mining Techniques | 41 Step 3: (First iteration of while loop) C.{x9}.{x9}.{x 3}.{{{{x2}. ² Add c11 to C. {x10}}.{x10}} ² Remove c2 and c10 from C.{x1}}. Step 3: (Second iteration) C.{x7}}}} Step 3: (Eighth iteration) C.{x6}.{x1}}. Arbitrarily we choose the first.{x7}.size = 9 ² The minimum single link distance between two clusters is 1. {x 10}}.{x7}. (cmin1.{x 1}}.{x6}.cmin2) = (c1.size = 8 ² (cmin1. {x 8}. {x3}}. between c3 and c11 because the distance between x3 and x10 is 1.{{{{x2}. between c2 and c10 and between c3 and c10.{x9}}} Step 3: (Seventh iteration) C.{{x6}. c11 = c2 U c10 = {{x2}. {{x 5}.{x 8}}} Step 3: (Fifth iteration) C. (cmin1.{{{{x2}. {x10}}.{{{x2}.{{{{{x2}.size = 6 ² (cmin1.c15) ² C = {{{x4}. {x9}.{{x4}.size = 5 ² (cmin1.c13) ² C = {{x6}.size = 7 ² (cmin1. Step 3: (Third iteration) C.{x 8}}. {x 3}}.cmin2) = (c 2.c12) ² C = {{x4}. {x10}}. {{{{{x 2}.{x 5}.c11) ² c12 = c3 U c11 = {{{x2}.{x8}}.{x7}.cmin2) = (c9. {x 10}}} ² Set l = l + 1 = 12.{x6}. {x8}. {{x5}.size = 10 ² The minimum single link distance between two clusters is 1.{x7}. {x3}}. C = {{x1}.c min2) = (c14. {x10}}.cmin2) = (c6. This occurs in two places.{x1}}.{x 7}}. {x10}}.{x5}.c min2) = (c4. {{x5}. {x4}. ² Add c12 to C.{{x4}.size = 3 ² (cmin1.{x8}}.c10) ² Since l = 10.c min2) = (c5.size = 4 ² (cmin1.c7) ² C = {{x6}.{x1}}} Step 3: (Fourth iteration) C. {x8}. C = {{x1}.{x10}}.{x9}}.cmin2) = (c3.{x7}}} Step 3: (Sixth iteration) C.{x6}. Depending on how our minimum function works we can choose either pair of clusters.{x3}} ² Remove c3 and c11 from C.c8) ² C = {{x 5}. {x3}}.{x5}.{{x2}. where x10 is in c11.

{{{{{x2}.18 Sample dendogram after clustering . {{{{{x2}. Algorithm done. 4. * * * * * * * * * * Fig.c min2) = (c17. {{x5}.c18) ² C = {{{{x4}.42 | Data Mining and Warehousing ² C = { {{x6}. 4.{x9}}}.size = 1.{x7}}}. {{{x4}.{x7}}}} Step 3: (Tenth iteration) C. The corresponding dendogram formed from the hierarchy in C is shown in Fig.17 Graph with clusters X6 X5 X7 X1 X2 X10 X3 X9 X4 X8 Fig. The cluster created from this algorithm can be seen in Fig. {{x6}. 4.{x8}}.size = 2 ² (cmin1.18. {{x5}. 4. {x3}}. {x3}}.{x8}}.{x 9}}}} Step 3: (Ninth iteration) C. {x 10}}.{x1}}. The points which appeared most closely together on the graph of input data in Fig. {x10}}.16 are grouped together more closely in the hierarchy. 4.{x1}}.17.

² Time: Usually O(N**2) though can range from O(N log N) to O(N**5).5. ² If a document remains that is not in the MST fragment.6. ellipsoidal. Single Link Method ² Similarity: Join the most similar pair of objects that are not yet in the same cluster.Data Mining Techniques | 43 4. so no need arises to recalculate the similarity matrix. chains. Complete Link Method ² Similarity: Join least similar pair between each of two clusters ² Type of clusters: All entries in a cluster are linked to one another within some minimum similarity. No cluster centroid or representative required. tightly bound clusters. ² Advantages: Theoretical properties. and add it to the current MST fragment. SLINK Algorithms ² ² ² ² ² Array record cluster pointer and distance information Process input documents one by one. efficient implementations. Distance between 2 clusters is the distance between the closest pair of points. each of which is in one of the two clusters. For each: Compute and store a row of the distance matrix Find nearest other point. ² Time: Voorhees alg. as needed MST Algorithms ² MST has all info needed to generate single link hierarchy in O(N**2) operations. ² Type of clusters: Long straggly clusters. . ² Space: O(N). worst case is O(N**3) but sparse matrices require much less ² Space: Voorhees alg. ² Prim-Dijkstra algorithm operates as follows: ² Place an arbitrary document in the MST and connect its NN (Nearest Neighbor) to it. to create an initial MST fragment. repeat the previous step.3 Other Hierarchical Clustering Methods 1. so have small. ² Disadvantages: Difficult to apply to large data sets since most efficient algorithm is general Hierarchical agglomerative clustering method (HACM) using stored data or stored matrix approach. ² Find the document not in the MST that is closest to any point in the MST. ² Disadvantages: Unsuitable for isolating spherical or poorly separated clusters. or the single link hierarchy can be built at the same time as MST. worst case is O(N**2) but sparse matrices require much less ² Advantages: Good results from (Voorhees) comparative studies. using the matrix Relabel clusters. widely used. 2.

based upon all objects in the cluster. since for any point or cluster there exists a chain of nearest neighbors (NNs) that ends with a pair that are each others’ NN. Group Average Link Method ² Similarity: Use the average value of the pairwise links within a cluster.Single Cluster: 1. can use centroids only to compute cluster-cluster sim. Save space and time if use inner product of 2 (appropriately weighted) vectors — see Voorhees alg.44 | Data Mining and Warehousing Voorhees Algorithm ² Variation on the sorted matrix HACM approach ² Based on fact: if sims between all pairs of docs are processed in descending order. ² Requires sorted list of doc-doc similarities ² Requires way to count no. Follow the NN chain from it till an RNN pair is found. yields unique and exact hierarchy ² Disdvantages: Sensitive to outliers. two clusters of size mi and mj can be merged as soon as the mi × mj th similarity of documents in those clusters is reached. ² Hence. based on Euclidean distance between centroids ² Type of clusters: Homogeneous clusters in a symmetric hierarchy. ² Type of clusters: Intermediate in tightness between single link and complete link ² Time: O(N**2) ² Space: O(N) ² Advantages: Ranked well in evaluation studies ² Disadvantages: Expensive for large collections Algorithm ² Sim between cluster centroid and any doc = mean sim between the doc and all docs in the cluster. Algorithm . Select an arbitrary doc 2. . in O(N) space Ward’s Method (minimum variance method) ² Similarity: Join the cluster pair whose merger minimizes the increase in the total withingroup error sum of squares. cluster centroid nicely characterized by its center of gravity ² Time: O(N**2) ² Space: O(N) — carries out agglomerations in restricted spatial regions ² Advantages: Good at recovering cluster structure. of sims seen between any 2 active clusters 3. below. poor at recovering elongated clusters RNN: We can apply a reciprocal nearest neighbor (RNN) algorithm.

4. Implementations of the general algorithm ² Stored matrix approach: Use matrix. the algorithm above can be applied.7 ONLINE ANALYTIC PROCESSING (OLAP) The term On-Line Analytic Processing . 4. Otherwise. If there is only one point. ² Type of clusters: Cluster is represented by the coordinates of the group centroid or median. but will be O(N**3) if matrix is scanned linearly.FASMI) refers to technology that allows users of multidimensional databases to . return to step 2 else to step 1. Merge the 2 points in that pair. Centroid and Median methods ² Similarity: At each stage join the pair of clusters with the most similar centroids. General algorithm for HACM All hierarchical agglomerative clustering methods (HACMs) can be described by the following general algorithm. How to apply general algorithm? Lance and Williams proposed the Lance-Williams dissimilarity update formula. and the matrix can be processed linearly. to calculate dissimilarities between a new cluster and existing points. which reduces disk accesses. Storage is therefore O(N**2) and time is at least O(N**2). replacing them with a single new point. O(N**2 log N**2) to sort it. Algorithm 1. O(N**2) to construct hierarchy. if there is a point in the NN chain preceding the merged points. This formula has 3 parameters. and then apply Lance-Williams to recalculate dissimilarities between cluster centers. and each HACM can be characterized by its own set of Lance-Williams parameters. Identify and combine the next 2 closest points (treating existing clusters as points too) 3. but one need not store the data set. using the appropriate Lance-Williams dissimilarities. return to step 1. ² Stored data approach: O(N) space for data but recompute pairwise dissimilarities so need O(N**3) time ² Sorted matrix approach: O(N**2) to calculate dissimilarity matrix.Data Mining Techniques | 45 3. stop. Then. Identify the 2 closest points and combine them into a cluster 2.OLAP (or Fast Analysis of Shared Multidimensional Information . based on the dissimilarities prior to forming the new cluster. 4. meaning that updates may cause large changes throughout the cluster hierarchy. ² Disadvantages: Newly formed clusters may differ overly much from constituent points. If more than one cluster remains.

Association rules show attributes value conditions that occur frequently together in a given dataset. contain dynamically updated information) through efficient “multidimensional” queries that reference various types of data. Typically the relationship will be in the form of a rule: IF {bread} THEN {butter}. This above condition extracts the hidden information i. they may involve seasonal adjustments. he will also buy butter as side dish. There are two types of Association rule levels.. 4. and market basket analysis seeks to find relationships between purchases. More information about OLAP is discussed in Chapter 6.1 Rules for Support Level The minimum percentage of instances in the database that contain all items listed in a given association rule.g.g.. The final result of OLAP techniques can be very simple (e. The set of items a customer buys is referred to as an item set.46 | Data Mining and Warehousing generate on-line descriptive or comparative summaries (“views”) of data and other analytic queries. if a customer used to buy bread. obviously.e. OLAP facilities can be integrated into corporate (enterprise-wide) database systems and they allow analysts and managers to monitor the performance of the business (e. removal of outliers.g.8 ASSOCIATION RULES Association rule mining finds interesting associations and/or correlation relationships among large set of data items. Note that despite its name. descriptive statistics. the term applies to analyses of multidimensional databases (that may. Market Basket Analysis is a modeling technique based on the theory that if you buy a certain group of items then you are more (or less) likely to buy another group of items.. let T be the set of all “baskets” or “carts” of products bought by the customers from a supermarket – say on a given day.8. Support of an item set Let T be the set of all transactions under consideration. Support Level Confidence Level 4.. A typical and widely-used example of association rule mining is Market Basket Analysis. analyses referred to as OLAP do not need to be performed truly “online” (or in real-time). simple cross-tabulations) or more complex (e. Discovery of association rules are showing attribute-value conditions that occur frequently together in a given set of data. frequency tables. such as various aspects of the manufacturing process or numbers and types of completed transactions at different locations) or the market. In the . e. The minimum percentage of instances in the database that contain all items listed in a given association rule. The support of an item set S is the percentage of those transactions in T which contain S.. and other forms of cleaning the data).g.

Confidence of an association rule To evaluate association rules.2 Rules for Confidence Level “If A then B”. let R = “butter and bread –> milk”. The relative support is the absolute support divided by the number of transactions (i.8. the algorithm was improved and renamed Apriori by Agrawal et al. 4.e. market-basket analysis). butter. and hence S is in U.. A transaction is a set of items (t. If there are 318 customers and 242 of them buy such a set U or a similar one that contains S. if a customer buys the set X = {milk. then support (S) = (242/318) = 76. milk}. then the rule is applicable and it says that he/she can be expected to buy milk. the confidence of a rule is the number of cases in which the rule is correct relative to the number of cases in which it is applicable. I). referring to the primary application domain (i.…. B})) *100% More intuitively. A set of items is often called the item set by the data mining community. onions.e.tn). If U is the set of all transactions that contain all items in S. Confidence (R) = (support ({A. Transactions are often called baskets. potatoes} then S is obviously a subset of X. apples. The (absolute) support or the occurrence of X (denoted by Supp(X) is the number of transactions that are supersets of X (i. An association rule is like an implication: X –> Y means that if item set X occurs in a transaction.. cheese. Shortly after that.. For example. for example S = {bread. than the rule is not applicable and thus (obviously) does not say anything about this customer. banana..e. rule confidence is the conditional probability that B is true when A is known to be true.1%. respectively. Association rule mining problem This program is also capable of mining association rules.Data Mining Techniques | 47 supermarket example this is the number of “baskets” that contain a given set S of products. n). If a customer buys butter and bread. i. then Support(S) = (|U|/|T|) *100% where |U| and |T| are the number of elements in U and T. ∈..3 APRIORI Algorithm The first algorithm to generate all frequent sets and confident association rules was the AIS algorithm by Agrawal et al. sausages. C})/support ({A.e. If he/she does not buy sugar or does not buy bread or buys neither.. bread. 4. then item set Y also occurs . For example. that contain X). B. An item set is frequent if its support is greater or equal to a threshold value. the confidence of a rule R = “A and B –> C” is the support of the set of all items that appear in the rule divided by the support of the antecedent of the rule. which was given together with the introduction of this mining problem.8. by exploiting the monotonic property of the frequency of item sets and the confidence of association rules Frequent item set mining problem A transactional database consists of sequence of transaction: T = (t1.

Algorithm Input I D S Output L // Large Itemsets Apriori Algorithm k = 0 // k is used as the scan number L=0 Ci = I // Initial candidates are set to be the items Ck Repeat k = k + 1 Lk = 0 For each Ii ∈ Ck do ci = 0 // initial counts for each itset are 0 for each tj ∈ D do for each Ii ∈ Ck do if Ii ∈ tj then ci = ci +1 for each Ii ∈ Ck do if ci >= (s x |D|) do Lk = Lk U Ii L = L U Lk // Itemsets // Database of transactions // Support . This probability is given by the confidence of the rule. An association rule is valid if its confidence and support are greater than or equal to corresponding threshold values. Apriori steps are as follows: Counts item occurrences to determine the frequent item sets. Candidates are generated. Use the frequent item sets to generate the desired rules. thus conf(X–>Y) = Supp(XUY)/Supp(X).48 | Data Mining and Warehousing with high probability. It is like an approximation of p(Y|X). it is the number of transactions that contain both X and Y divided by the number of transaction that contain X. In the frequent itemset mining problem a transaction database and a relative support threshold (traditionally denoted by min_supp) is given and we have to find all frequent itemsets. Count the support of item sets pruning process ensures candidate sizes are already known to be frequent item sets.

19 Apriori example . 2 2 3 2 itemset {1} {2} {3} {4} {5} sup. 4.Data Mining Techniques | 49 Ck+1 = Apriori_Gen(Lk) Until Ck+1 =0 Input Li – 1 Output Ci // Candidate of size i Apriori-gen algorithm Ci = 0 For each I ∈ Li-1 do For each J ≠ I ∈ Li-1 do If i-2 of the elements in I and J are equal then Ck = Ck U (I U J) // Large Itemsets of size i–1 AR Gen algorithm Input D // Database of transactions I // Items L // Large Itemsets S // Support œ // Confidence Output R // Association rules satisfying s and œ AR Gen Algorithm R=0 For each I ∈ L do For each x C I such that x ≠0 do If support(I)/support(x) >= œ then R=RU(x = > (1 – X)) TID 100 200 300 400 Items 13 4 235 1235 25 itemset {1 3} {2 3} {2 5} {3 5} sup. 2 3 3 3 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} itemset {2 3 5} itemset {2 3 5} Fig. 2 itemset {1} {2} {3} {5} sup. 2 3 3 1 3 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} sup. 1 2 1 2 3 2 sup.

Advantages (1) Facilitates efficient counting of item sets in large databases. In an item set is large in a sample. Then any one algorithm is used to find the large item sets for the sample.smalls) C = PL U BD~(PL) L=0 For each Ii ∈ C do ci = 0 // Initial counts for each itemset are 0 For each tj ∈ D do // First scan count For each Ii ∈ C do If Ii ∈ tj then ci = ci + 1 For each Ii ∈ C do If ci >= (sx|D|) do L = L U Ii ML = (x| x BD~(PL) x L).4 Sampling Algorithm Here a sample is drawn from the database such that it is memory resident. . Confidence 100%) 4. it is viewed to be potentially large in the entire database. (2) Original sampling algorithm reduces the number of database scans to one or two. Input: I // Item set D // Database of transactions S // Support Output: L // Large item set Sampling algorithm: Ds = Sample drawn from Di PL = Apriori(I.8.50 | Data Mining and Warehousing Rules from example Support(2 3) = 2/4*100 = 50 Support(2 3 5) = 2/4*100 = 50 Confidence(2 3→5) = Support(2 3 5)/support(2 3) = 50/50*100=100% 2 3 → 5 (Support 50%.Ds. // Missing Large itemsets If ML ≠ 0 then C = Li // set candidates to be the large itemsets Repeat C = C U BD~© // Expand candidate sets using negative border Until no new itemsets are added to C.

(3) It is easier to perform incremental generation of association rules by heating the current style of database as one partition and treating the new entries as a second partitions. (2) These algorithms adapt better to binted main memory.….Dp) // Database of transactions divided into partition S // Support Output L // Large Itemset Partitioning Algorithm C=0 For I = 1 to p do // Find large itemsets in each partition i Li = Apriori(I.5 Partitioning Algorithm Here the database D is divided into ‘P’ partitions D1.Data Mining Techniques | 51 For each Ii ∈ C do ci = 0 // Initial counts for each itemset are 0 For each tj ∈ D do // Second scan count For each Ii ∈ C do If Ii ∈ tj then ci = ci + 1 If ci … (sx|D| then L = L U Ii 4.s) C = CULi L=0 For each Ii ∈ C do ci =0 // Initial counts for each item set are 0. Dn. Input I // Item set 1 2 D = (D . D . Advantages (1) Algorithm is more efficient as it has the advantages of large itemset property wherein a large itemsets must be large in at least one of the partitions. For each tj ∈ D do // count candidates during second scan For each Ii ∈ C do Ii ∈ tj then ci = ci +1 For each Ii ∈ C do If ci … (sx|D|) then L = L U Ii .…. D2.D .8.

Content of actual web pages Intra page structure includes the HTML or XML code for the page Inter page structure is the actual linkage structure between web Usage data that describe how Web pages are accessed by visitors User profiles include demographic and registration information obtained about users. This may be the data actually present in Web pages or data related to Web activity. 4. Spatial Mining or Spatial data mining or knowledge discovery in spatial databases.1 Web Mining Web mining is mining of data related to the World Wide Web. resource management. Web mining taxonomy The following are the classification of web mining: Web content mining Web page content mining Search result mining Web structure mining Web usage mining General access pattern tracking Customized usage tracking. This includes applications to whether.3 Temporal Mining The data that are stored reflect data at a single point in time. disaster management.9. Each tuple contains the information that is current from the date stored with that tuple to the date stored with the next tuple in temporal order. not just one time point is called temporal database. is data mining as applied to spatial databases or spatial data. called a snapshot database.52 | Data Mining and Warehousing 4. medicine and robotics. Geographic information systems are used to store information related to Geographic locations on the surface of the Earth. .9. Biomedical application including medical imaging and illness diagnosis also require spatial system. Data mining activities include prediction of environmental catastrophes. agriculture.2 Spatial Mining Spatial data are data that a spatial or location component. Spatial data can be viewed as data about objects that themselves are located in physical space. and hazardous waste. Web data can be classified into the following classes. Spatial data are required for many current information technology systems.9 EMERGING TRENDS IN DATA MINING 4.9. community infrastructure needs. Data are maintained for multiple time points. environment science. 4. geology. Applications for spatial data mining are GIS system.

15. What is market-basket analysis? Explain with an example. 8. 6. 5. What is OLAP? Explain the various tools available for OLAP. 16. 3. 6. Write an essay on emerging trends in data mining. Write short notes on (i) Apriori algorithm. 18. 4. 20. 7. 17. . 2. 5. 19. 2. What is genetic algorithm? Explain. 11.Data Mining Techniques | 53 4. 12. 9. 3. 21. Write in detail about the use of artificial neural network and genetic algorithm for data mining. 8. Data Mining in the Insurance Industry Increasing Wagering Turnover Through Data Mining The Transition from Data Warehousing to Data Mining Rule Extraction from Trained Neural Networks Neural Networks for Market Basket Analysis Data Mining for Customer Relationship Management Web Usage Mining using Neural Networks Data Visualization Techniques and their Role in Data Mining Evolutionary Rule Generation for Credit Scoring Improved Decision Trees for Data Mining Data Mining in the Pharmaceutical Industry Data Mining the Malaria Genome Neural Network Modeling of Water-Related Gastroenteritis Outbreaks Data Mining with Self Generating Neural Networks Development of Intelligent OLAP Tools Mining Stock Market Data Data Mining by Structure Adapting Neural Networks Rule Discovery with Neural Networks in Different Concept Levels Cash Flow Forecasting using Neural Networks Neural Networks for Market Response Modeling Advertising Effectiveness Modeling Temporal Effects in Direct Marketing Response EXERCISE 1.10 DATA MINING RESEARCH PROJECTS 1. 22. 4. 7. 14. 13. What is clustering? Explain. (ii) ID3 algorithm. Describe association rule mining and decision trees. 10.

They have become the “Hobson’s choice” in a number of industrial applications including. .CHAPTER 5 REAL TIME APPLICATIONS AND FUTURE SCOPE 5. Starting from marketing down to the medical field. data mining tools and techniques have found a niche. there is not a single field where data mining is not applied.1 APPLICATIONS OF DATA MINING As on this day. ² Ad revenue forecasting ² Churn (turnover) management ² Claims processing ² Credit risk analysis ² Cross-marketing ² Customer profiling ² Customer retention ² Electronic commerce ² Exception reports ² Food-service menu analysis ² Fraud detection ² Government policy setting ² Hiring profiles ² Market basket analysis ² Medical management ² Member enrollment ² New product development ² Pharmaceutical research ² Process control ² Quality control ² Shelf management/store management ² Student recruitment and retention ² Targeted marketing ² Warranty analysis and many more.

Although banks have employed statistical analysis tools with some success for several years.Real Time Applications and Future Scope | 55 The 2003-IBM grading of industries which use data mining are as follows in Table 5. previously unseen patterns of customer behavior are now coming into clear focus with the aid of new data mining tools.1. TABLE 5. but more advanced data mining tools provide insight to the answer. These statistical tools and even the OLAP find out the answers. Some of the applications of data mining in this industry are. banking sector is ahead of many other industries in using mining techniques for their vast customer database. (i) Predict customer reaction to the change of interest rates (ii) Identify customers who will be most receptive to new product offers (iii) Identify “loyal” customers (iv) Pin point which clients are at the highest risk for defaulting on a loan (v) Find out persons or groups who will opt for each type of loan in the following year (vi) Detect fraudulent activities in credit card transactions .1.1 Industrial Applications of Data Mining Banking Bioinformatics/Biotechnology Direct Marketing/Fundraising eCommerce/Web Entertainment/News Fraud Detection Insurance Investment/Stocks Manufacturing Medical/Pharmacy Retail Scientific Data Security Supply Chain Analysis Telecommunications Travel Others None 1% 2% 4% 2% 1% 8% 13% 10% 10% 5% 1% 9% 8% 3% 2% 6% 6% 9% Industries/fields where data mining is currently applied are as follows: 5.1 Data Mining in the Banking Sector Worldwide.

) by. Communication tools such as the Mass E-Mail Manager make it easy to stay in touch with the constituents. criminologists and many others. and access them from one centralized database. voluntary organizations. (iv) Providing a flexible communication and campaign management structure from a single appeal through multiple campaigns.1. (i) Differential market basket analysis helps to decide the location and promotion of goods inside a store and to compare results between different stores.2 Data Mining in Bio-Informatics and Biotechnology Bio-informatics is a rapidly developing research area that has its roots both in biology and information technology.1. these data are managed and used to build and improve long-term relationships with them.1. different seasons of the year. (ii) Setting up multiple funds and default designations based on various criteria such as campaign. etc. volunteers. etc. assigning requirements and allowing the system to refer the data to automatically generate awards as a recognition of exceptional donor efforts. 5.56 | Data Mining and Warehousing (vii) Predict clients who are likely to change their credit card affiliation in the next quarter (viii) Determine customer preference of the different modes of transaction namely through teller or through credit cards. between different days of the week. election fund raising. between customers in different demographic groups.. The many ways in which mining techniques help in direct marketing are.4 Data Mining in Fraud Detection Data dredging has found wide and useful application in various fraud detection processes like (1) Credit card fraud detection using a combined parallel approach . 5.g. Using powerful search options. (iii) Defining award classes.3 Data Mining in Direct Marketing and Fund Raising Marketing is the first field to use data mining tools for its profit. agencies and members of different charitable trusts. prospects. (i) Capturing unlimited data on various donors. Techniques followed in data harvesting also help in fund raising (e. (ii) A predictive market basket analysis can be used to identify sets of item purchases (or events) that generally occur in sequence — something of interest to direct marketers. Various applications of data mining techniques in this field are: (i) prediction of structures of various proteins (ii) determining the intricate structures of several drugs (iii) mapping of DNA structure with base to base accuracy 5. etc. region or membership type to honor the donors’ wishes. relationship management and data mining tools. to increase company sales.

5 Data Mining in Managing Scientific Data Data Mining is a concept that is taking off in the commercial sector as a means of finding useful information out of gigabytes of scientific data.1.1. (1) Analysis of telecom service purchases (2) Prediction of telephone calling patterns . at the same time. the moisture content and the type of the soil.1. it identified the factors affecting the crop yield. (2) Studies on geophysical databases at the University of Helsinki: It has published a scientific analysis of data on farming and the environment. 5. As a result it has optimized crop yields. Few examples of data mining in the scientific environment are: (1) Studies on global climatic changes: This is a hot research area and is primarily a verification driven mining exercise. To minimize the resources. increase profits.6 Data Mining in the Insurance Sector Insurance companies can benefit from modern data mining methodologies. Gone are those days when scientists have to search through reams of printouts and rooms full of tapes to find the gems that make up scientific discovery. and develop new products. (2) Formulating Statistical Modeling of Insurance Risks (3) Using the Joint Poisson/Log-Normal Model of mining to optimize insurance policies (4) And finally finding the actuarial Credibility of the risk groups among insurers 5. (4) Rule and analog based detection of false medical claims and so on. retain current customers. every activity in telecommunication has used data mining technique. a number of different predictive models have been proposed for future climatic conditions. which help companies to reduce costs. This is an example of what known as the spatial mining. while minimizing the resources supplied.Real Time Applications and Future Scope | 57 (2) Fraud detection in the voters list using neural networks in combination with symbolic and analog data mining.7 Data Mining in Telecommunication As on this date. through such activities as analysis of ice core samples from the Antarctic and. Climate data has been collected for many centuries and is being extended into the more distant past. like chemical fertilizers and additives (phosphate). This can be done through: (1) evaluating the risk of the assets being insured taking into account the characteristics of the asset as well as the owner of the asset. This is an example of Temporal mining as it uses a temporal database maintained for multiple time points. acquire new customers. 5. (3) Fraud detection in passport applications by designing a specific online learning diagnostic system.

1. various forms of data-mining development are being undertaken by companies looking to make sense of the raw data that has been mounting relentlessly in the recent years.9 Data Mining in the Retail Industries Data mining techniques have gone glove in hand in CRM (Customer Relationship Marketing) by developing models for ² predicting the propensity of customers to buy ² assessing risk in each transaction ² knowing the geographical location and distribution of customers ² analyzing customer loyalty in credit based transactions ² Assessing the competitive threat in a locality. . (3) By developing common market place services using a shared external ICT infrastructure to manage suppliers and logistics and implement electronic match making mechanisms. ² Prediction of occurrence of disease and/or complications ² Therapy of choice for individual cancers ² Choice of antibiotics for individual infections ² Choice of a particular technique (of anastamosis. Few of the ways in which data mining tools find its use in e-commerce are: (1) By formulating market tactics in business operations.10 Data Mining in E-Commerce and the World Wide Web Sophisticated or not. 5. the article says. This was possible. (2) By automating business interactions with customers. with the use of data mining techniques. E-commerce is the newest and hottest to use data mining techniques. Here we present some of the medical and pharmaceutical uses of Data Mining techniques for the analysis of medical databases.1. 5. This is widely used in today’s web world. A recent article in the Engineering News-Record noted that e-commerce has empowered companies to collect vast amounts of data on customers—everything from the number of Web surfers in a home to the value of the cars in their garage.58 | Data Mining and Warehousing (3) Management of resources and network traffic (4) Automation of network management and maintenance using artificial intelligence to diagnose and repair network transmission problems. so that customers can transact with all the players in supply chain.) in a given surgical procedures ² Stocking the most frequently prescribed drugs And many more… 5. suture materials etc.8 Data Mining in Medicine and Pharmacy Widespread use of medical information systems and explosive growth of medical databases require traditional manual data analysis to be coupled with methods for efficient computer-assisted analysis. sutures. etc.1.

2 FUTURE SCOPE The era of DBMS began using relational data model and SQL. not only will there be more efficient algorithms with better interface techniques. As the data mining applications are of diverse types. (i) Helping the stock market researchers to predict future stock price movement. web mining can be done in above ways. Coming to the purchaser side in supply chain analysis. (ii) Analyzing fulfillment data to manage volume purchase contracts.1. the hawala scandal. Manual definition of request and result interpretation used on the current data mining tools may decrease with increase in automation. A major development will be in creation of a sophisticated “query language” that includes everything form SQL functions to the data mining applications. While these are many tools to and this present. Over the next few years. (ii) It mines payment data to advantageously update pricing policies (iii) Demand analysis and forecasting helps the supplier to determine the optimum levels of stocks and spare parts. it probably will include similar item: algorithms.Real Time Applications and Future Scope | 59 Determining the size of the world wide web is extremely difficult. Considering the world wide web to be the largest database collection. data mining techniques help them by: (i) Knowing vendor rating to choose the beneficial supplier.1. It is of use to the supplier in the following ways: (i) It analyses the process data to manage buyer rating. In 1999 it was estimated to contain are 350 billion pages with a growth rate of 1 million page/day. 5. 5. 5. but also steps will be taken. Data archeology churns the ocean of stock market to obtain the cream of information in ways like.11 Data Mining in Stock Market and Investment The rapid evolution of computer technology in the last few decades has provided investment professionals (and amateurs) with the capability to access and analyze tremendous amounts of financial data. While it may not look like the relational model. development of a “complete” data mining model is desirable. . Data mining techniques have found wide application in supply chain analysis. data model and metrics for goodness. At present data mining is little there than a set of tools to uncover hidden information from a database. (ii) Data mining of the past prices and related variables help to discover stock market anomalies like.12 Data Mining in Supply Chain Analysis Supply chain analysis is nothing but the analysis of the various data regarding the transactions between the supplier and the purchaser and the use of this analysis to the advantage of either of them or even both of them. To develop an all-encompassing model for data mining. there is no all encompassing model or approach.

² A generalized relation: obtained by generating data from input data ² A characteristic rule: condition that is satisfied by almost all records in a target class. This is where the true data mining request is made. (2) CRISP-DM (Cross Industry Standard Process for Data Mining): This is a new KDD process model applicable to many different applications. ensure its consistency and provide concurrency and recovery features. Difference between these two are: TABLE 5. Two latest models that gain access via ad hoc data mining queries are: (1) KDDMS (Knowledge and Data Discover Management System): It is expected to be the future generation of data mining systems that include not only data mining tools but also techniques to manage the underlying data. The data mining request can be one of the following.60 | Data Mining and Warehousing A data mining query language (DMQL) based on SQL has been proposed. ² A discriminate rule: condition that is satisfied by a target class but is different from conditions satisfied in other classes ² Classification rule: set of rules used to classify data. The model addresses all steps in .2 Difference between SQL and DMQL SQL Allows access only to relational database Here the retrieved data is necessarily an aggregate of data from relations There is no such threshold level DMQL Allows access to background information such as concept hierarchies It indicates the type of knowledge to be mined and the retrieved data need not be a subset of data form relations It fixes a threshold that any mined information should obey A BNF statement of DMQL can be given as modified from zaigg <DMQL>: USE DATABASE (database name) {USE HIERARCHY <hierarchy name> FOR <attribute>} <rule_spec> RELATED TO <attr_or_agg_list> FROM <relation(s)> [WHERE <condition>] [ORDER BY <order_list>] {with [<kinds_of>]THRESHOLD = <threshold_value>[FOR <attribute(s)>]} The heart of a DMQL statement is the rule specification portion.

rule induction (C5. BIRCH. clustering. Decision trees (C5. Windows MARS Salford systems Unix (Linux.Real Time Applications and Future Scope | 61 the KDD process. Life cycle of CRISP-DM model can be described as follows: Assess (business understanding) | Access (data understanding) | Analyze (data preparation) | Act (modeling) | Automate (evaluation deployment) 5. k-means. Windows HP/UX. classification. AIX. decision trees. classification. neural networks. clustering. classification. OS/390. Salford Systems Functions Prediction Classification Techniques Neural Network (MLP) Dicision tree (CART) Platforms Windows CMS . C&RT a variation of CART). RBF) Regression Windows. OS/400 JDA Intellect IDA Software Group Solaris. neural networks (MLP. k-means clustering.) . neural networks (back-probagation . MYS. including the maintenance of the results of the data mining step. clustering Association rules. sequence discovery Association rules. Sun Solaris. classification.. k-means . prediction Forecasting Apriori.0. clustering factor analysis. Unix (linux). forecasting. AIX. k-nearest neighbour. back propagation) regression (Linear) Naïve bayes. prediction. IBM Corporation Windows Decision trees.. Solaris. Windows NT Clementine SPSS Inc.0. time series Association rules. CARMA. k-means DB Miner Analytical System Intelligent Miner DB Miner Technologies Inc. (Contd. logarithm) Decision trees. GRI) regression (linear. IBM.3 Products of Data Mining Product 4 Thought CART Vendor Congnos Inc. Association rules. prediction sequential patterns.3 DATA MINING PRODUCTS TABLE 5.

2. SOM). 7.62 | Data Mining and Warehousing Product MARS Vendor Salford Systems Functions Forecasting Techniques Regression Platforms Unix(Linux. prediction. prediction Classification. classification. regression Decision trees. RBF. 4. neural networks( back-propagation. hypothesis testing. time series analysis Naïve bayes ARIMA. decision trees. clustering ODBC. 3. rules Data Miner Statsoft Inc. MLP. hierarchical clustering(agglomerative. exponential smoothing. logistic. prediction. regression (linear. Windows All platforms on which oracle9i runs Unix. 6. nonlinear. clustering. What are the uses of data mining tools in e-commerce? What is the real life application of data mining in the insurance industry? Analyse the future scopes of data harvesting in the marketing industry. Monte karlo) ARIMA. 5. Windows Oracle 9i database s-plus Oracle Corporation Insightful Corporation Association rules. Windows Xpert Rule Miner Attar Software Ltd Association rules. divisive). decision trees (CART. polynomial) statistical techniques (jackknife. CHAID). . clustering. k-means. How does data mining help the banking industry? List some of the applications of data dredging in managing scientific data. Classification. classification. List a few industries which apply data mining for the management of its vast database. Windows EXERCISE 1. Solaris). correlation(Pearson). List some of the products of data mining.

Point of sale data in retail. Megabyte (1 000 000 bytes) ² 1 Megabyte: A small novel OR A 3. we are listed some interesting examples of data storage. Kilobyte (1000 bytes) ² 1 Kilobyte: A very short story ² 2 Kilobytes: A Typewritten page ² 10 Kilobytes: An encyclopaedic page OR A deck of punched cards ² 50 Kilobytes: A compressed document image page ² 100 Kilobytes: A low-resolution photograph ² 200 Kilobytes: A box of punched cards ² 500 Kilobytes: A very heavy box of punched cards 3. Bytes (8 bits) ² 0. 1.1 THE CALCULATIONS FOR MEMORY CAPACITY The past couple of decades have seen a dramatic increase in the amount of information or data being stored in electronic format.1 bytes: A binary decision ² 1 byte: A single character ² 10 bytes: A single word ² 100 bytes: A telegram OR A punched card 2. are some instances of the types of data that is being collected.CHAPTER 6 DATA WAREHOUSE EVALUATION 6. medical history data in health care.5 inch floppy disk ² 2 Megabytes: A high resolution photograph . Here. There are many examples that can be cited. policy and claim data in insurance. This accumulation of data has taken place at an explosive rate. It has been estimated that the amount of information in the world doubles every 20 months and the sizes as well as number of databases are increasing even faster. financial data in banking and securities.

Petabyte (1 000 000 000 000 000 bytes) ² ² ² ² 1 Petabyte: 3 years of EOS data (2001) 2 Petabytes: All US academic research libraries 20 Petabytes: Production of hard-disk drives in 1995 200 Petabytes: All printed material OR Production of digital magnetic tape in 1995 7. Terabyte (1 000 000 000 000 bytes) ² 1 Terabyte: An automated tape robot OR All the X-ray films in a large technological hospital OR 50000 trees made into paper and printed OR Daily rate of EOS data (1998) ² 2 Terabytes: An academic research library OR A cabinet full of Exabyte tapes ² 10 Terabytes: The printed collection of the US Library of Congress ² 50 Terabytes: The contents of a large Mass Storage System 6. Exabyte (1 000 000 000 000 000 000 bytes) ² 5 Exabytes: All words ever spoken by human beings. Yottabyte (1 000 000 000 000 000 000 000 000 bytes) . 8. Zettabyte (1 000 000 000 000 000 000 000 bytes) 9.64 | Data Mining and Warehousing ² 5 Megabytes: The complete works of Shakespeare OR 30 seconds of TV-quality video ² 10 Megabytes: A minute of high-fidelity sound OR A digital chest X-ray ² 20 Megabytes: A box of floppy disks ² 50 Megabytes: A digital mammogram ² 100 Megabytes: 1 meter of shelved books OR A two-volume encyclopaedic book ² 200 Megabytes: A reel of 9-track tape OR An IBM 3480 cartridge tape ² 500 Megabytes: A CD-ROM 4. Gigabyte (1 000 000 000 bytes) ² 1 Gigabyte: A pickup truck filled with paper OR A symphony in high-fidelity sound OR A movie at TV quality ² 2 Gigabytes: 20 meters of shelved books OR A stack of 9-track tapes ² 5 Gigabytes: An 8mm Exabyte tape ² 10 Gigabytes: ² 20 Gigabytes: A good collection of the works of Beethoven ² 50 Gigabytes: A floor of books OR Hundreds of 9-track tapes ² 100 Gigabytes: A floor of academic journals OR A large ID-1 digital tape ² 200 Gigabytes: 50 Exabyte tapes 5.

or signs knowledge. with related data. 6. only knowledge enables actions and decisions. One column (data element) contains data of one and the same kind. and pragmatics. “There is. A row (= tuple. we call the patterns. INFORMATION AND KNOWLEDGE In this section we will present three terms widely—but often unreflectingly—used in several ITrelated (and other) communities. who pursue goals which require performing actions. ‘data’. (3) The third dimension is added by relating signs and their ‘users’.. defines an important characteristic of knowledge. i. for example the column Emp_Number. that data becomes information.3. i. . Data means any symbolic reperesentation of facts or ideas from which information can potentially be extracted.e. Table 6.Data Warehouse Evaluation | 65 6. A table in a database looks like a simple spreadsheet. semantics.e.e. It is only through this relation between sign and meaning.1 Example for Database Emp. and ‘knowledge’.e. the syntax of ‘sentences’ does not relate signs to anything in the real world and thus. which is called pragmatics. signs only viewed under this dimension equal data. If this third dimension including users. no known way to distinguish knowledge from information or data on a purely representational basis.1 Definitions A database is a collection of tables. is introduced. In our eyes. their semantics adds a second dimension to signs. ‘information’. can be called one-dimensional.” (1) The relation among signs... i..3 FUNDAMENTAL OF DATABASE 6. number 101 102 103 Name John Jill Jack A table is a matrix with data. i. Information is a meaningful data and Knowledge is a interesting and useful patterns in a data base. for example the data of one employee detail. (2) The relation between signs and their meaning. The example above has 3 rows and 2 columns in employee table.1 Provides Example of Employee table with 2 columns and 3 rows.. i. The definitions will be oriented according to the three dimensions of semiotics (the theory of signs). TABLE 6. syntax. in general. Normalization in a relational database is the process of removing redundancy in data by separating the data into multiple tables.e. performed by actors. This relationship.2 DATA. entry) is a group of related data.

constraints. each column must be assigned a data type.66 | Data Mining and Warehousing The process of removing redundancy in data by separating the data into multiple tables. But no commercial DBMS product supports relational domains. For example a character data type could be specified as VARCHAR(10). The first step in transforming a logical data model into a physical model is to perform a simple translation from logical terms to physical objects. and other features that augment the functionality of database objects. To successfully create a physical database design you will need to have a good working knowledge of the features of the DBMS including: ² In-depth knowledge of the database objects supported by the DBMS and the physical structures and files required to support those objects. this simple transformation will not result in a complete and correct physical database design – it is simply the first step. and decimal (which require a length and scale) types. you can create an effective and efficient database from a logical data model. Armed with the correct information. indicating that up to 10 characters can be stored for the column. The transformation consists of the following: ² Transforming entities into tables ² Transforming attributes into columns ² Transforming domains into data types and constraints To support the mapping of attributes to table columns you will need to map each logical domain of the attribute to a physical data type and perhaps additional constraints. what data type and length will be used for monetary values if no built-in currency data type exists? Many of the major DBMS . Of course. ² Data definition language (DDL) skills to translate the physical design into actual database objects. ² Details regarding the manner in which the DBMS supports indexing. referential integrity. 6. data types. Certain data types require a maximum length to be specified.2 Data Base Design Database design is the process of transforming a logical data model into an actual physical database. ² Knowledge of the DBMS configuration parameters that are in place. You may need to adjust the data type based on the DBMS you use. For example. though. what must be done to implement a physical database? The first step is to create an initial physical data model by transforming the logical data model into a physical implementation based on an understanding of the DBMS to be used for deployment. Therefore the domain assigned in the logical data model must be mapped to a data type supported by the DBMS. ² Detailed knowledge of new and obsolete features for particular versions or releases of the DBMS to be used. floating point.3. You may need to apply a length to other data types as well. such as graphic. A logical data model is required before you can even begin to design a physical database. In a physical database. Assuming that the logical data model is complete.

² Building indexes on columns to improve query performance. For example. multiple candidate keys often are uncovered during the data modeling process. ATM cards) very frequently and sometimes data warehouse may also use relational databases. ² Choosing the type of index to create: b-tree. buffer pool specification. ² Other physical aspects such as column ordering. Failure to do so will make processing the data in that table more difficult. negative numbers. You may decide to choose a primary key other than the one selected during logical design – either one of the candidate keys or another surrogate key for physical implementation. and so on. you also may need to apply a constraint to the column.g. de normalization. Simply assigning the physical column to an integer data type is insufficient to match the domain. Without a constraint. ² Deciding on the clustering sequence for the data. each of the following must be addressed: ² The nullability (NULL) of each column in each table. partitioning. reverse key. But even if the DBMS does not mandate a primary key for each table it is a good practice to identify a primary key for each physical table you create. Using check constraints you can place limits on the data values that can be stored in a column or set of columns. A primary key should be assigned for every entity in the logical data model. However. In addition to a data type and length. 1 through 10.Data Warehouse Evaluation | 67 products support user-defined data types. ² For character columns. 6. A constraint must be added to restrict the values that can be stored for the column to the specified range.. Specification of a primary key is an integral part of the physical design of entities and attributes. there are many other decisions that must be made during the transition from logical to physical. and values greater than ten could be stored. Of course.3. should fixed length or variable length be used? ² Should the DBMS be used to assign values to sequences or identity columns? ² Implementing logical relationships by assigning referential constraints. Consider a domain of integers between 1 and 10 inclusive. etc. zero. if no built-in data type is acceptable. data files. As a first course of action you should try to use the primary key as selected in the logical data model. Popular RDBMS Databases are listed below. hash. bit map. TABLE 6.3 Selection for the Database/Hardware Platform RDBMS are used in OLTP applications (e. so you might want to consider creating a data type to support the logical domain.2 Packages—Example RDBMS name Oracle IBM DB2 UDB IBM Informix Microsoft SQL Server Sybase Terradata Company name Oracle Corporation IBM Corporation IBM Corporation Microsoft Sybase Corporation NCR .

there are going to be certain parts of the code that is hardware platform-dependent. OLAP helps the user synthesize enterprise information through comparative. Massively parallel computers in 1993. managers and executives to gain insight into data through fast. bugs and bug fixes are often hardware dependent. all at the same time. As a result. This is achieved through use of an OLAP server.4. as well as through analysis of historical and projected data in various “what-if” data model scenarios. regardless of database size and complexity.2 OLAP Server An OLAP server is a high-capacity. personalized viewing. interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. 6.4 OLAP AND OLAP SERVER 6.1 OLAP Definition On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts. A multi-dimensional structure is arranged so that every data item is located and accessed based on the intersection of the dimension members . and nowadays the most powerful computers all use multiple CPUs. multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures. OLAP functionality is characterized by dynamic multi-dimensional analysis of consolidated enterprise data supporting end user analytical and navigational activities including: ² calculations and modeling applied across dimensions. RDBMS/Hardware Combination: Because the RDBMS physically sits on the hardware platform.so that it would be the best way for any large computations to be done within 5 years. consistent. where each processor can perform a part of the task. one needs to determine the approximate amount of data that is to be kept in the data warehouse system once it’s mature. Parallel Processing Support: The days of multi-million dollar supercomputers with one single CPU are gone.68 | Data Mining and Warehousing In making selection for the database/hardware platform. and base any testing numbers from there. through hierarchies and/or across members ² trend analysis over sequential time periods ² slicing subsets for on-screen viewing ² drill-down to deeper levels of consolidation ² reach-through to underlying detail data ² rotation to new dimensional comparisons in the viewing area OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries. 6.4. there are several items that need to be carefully considered: Scalability: How can the system grow as your data storage needs grow? Which RDBMS and hardware platform can handle large sets of data most efficiently? To get an idea of this.

valid data. as well as for fast. If data is simply archived for preservation. Without a data warehouse to hold historical information.5 DATA WAREHOUSES.1 A Data Warehouse Supports OLTP A data warehouse supports an OLTP system by providing a place for the OLTP database to offload data as it accumulates.5. uses validation data tables Supports thousands of concurrent users 6. the OLTP database continues to grow in size and requires more indexes to service analytical and report queries. usually adding or retrieving a single row at a time per table Optimized for validation of incoming data during transactions.3 Comparison of Data Warehouse Database and OLTP Data warehouse database Designed for analysis of business measures by categories and attributes Optimized for bulk loads and large. The design of the server and the structure of the data are optimized for rapid ad-hoc information retrieval in any orientation. These queries access and process large portions of the continually growing historical data and add a substantial load to the database. The large indexes needed to support these queries . requires no real time validation Supports few concurrent users relative to OLTP OLTP database Designed for real-time business operations Optimized for a common set of transactions. complex. TABLE 6. staging the multi-dimensional data in the OLAP server is often the preferred method. unpredictable queries that access many rows per table Loaded with consistent. or offer a choice of both. 6. or it may populate its data structures in real-time from relational or other databases. OLTP.Data Warehouse Evaluation | 69 which define that item. Given the current state of technology and the end user requirement for consistent and rapid response times. flexible calculation and transformation of raw data based on formulaic relationships. OLAP AND DATA MINING A relational database is designed for a specific purpose. The OLAP server may either physically stage the processed multi-dimensional information to deliver consistent and rapid response times to end users. An OLTP system is designed to handle a large volume of small-predefined transactions. data is archived to static media such as magnetic tape. the design characteristics of a relational database that supports a data warehouse differ from the design characteristics of an OLTP database. If data is allowed to accumulate in the OLTP so it can be used for analysis. Because the purpose of a data warehouse differs from that of an On Line Transaction Processing (OLTP). or allowed to accumulate in the OLTP database. and by providing services that would complicate and degrade OLTP operations if they were performed in the OLTP database. it is not available or organized for use by analysts and decision makers.

which does not need additional indexes for their support. OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses. a query that requests the total sales income and quantity sold for a range of products in a specific geographical region for a specific time period can typically be answered in a few seconds or less regardless of how many hundreds of millions of rows of data are stored in the data warehouse database. 6.3 Data Mining is a Data Warehouse Tool Data mining is a technology that applies sophisticated and complex algorithms to analyze data and expose interesting information for analysis by decision makers.70 | Data Mining and Warehousing also tax the OLTP transactions with additional index maintenance. data mining can be used to analyze sales data against customer attributes and create a new cube dimension to assist the analyst in the discovery of the information embedded in the cube data.5. High volume analytical and reporting queries are handled by the data warehouse and do not load the OLTP. Whereas OLAP organizes data in a model suited for exploration by analysts. OLAP is not designed to store large volumes of text or binary data. it is also reorganized and consolidated so that analytical queries are simpler and more efficient. OLAP supports model-driven analysis and data mining supports data-driven analysis. For example.5. OLAP organizes data warehouse data into multidimensional cubes based on this dimensional model. Data mining has traditionally operated only on raw data in the data warehouse database or. A data warehouse provides a multidimensional view of data in an intuitive model designed to match the types of queries posed by analysts and decision makers. more commonly. . allowing the OLTP to operate at peak transaction efficiency. analysis services provides data mining technology that can analyze data in OLAP cubes. as well as data in the relational data warehouse database. These queries can also be complicated to develop due to the typically complex OLTP database schema. A data warehouse offloads the historical data from the OLTP. In addition. and then preprocesses these cubes to provide maximum performance for queries that summarize data in various ways. The inherent stability and consistency of historical data in a data warehouse enables OLAP to provide its remarkable performance in rapidly summarizing information for analytical queries. Analysis Services provides tools for developing OLAP applications and a server specifically designed to service OLAP queries. business objects and data stage. text files of data extracted from the data warehouse database. As data is moved to the data warehouse. Other tolls are informatica.2 OLAP is a Data Warehouse Tool Online analytical processing (OLAP) is a technology designed to provide superior performance for ad hoc business intelligence queries. data mining performs analysis on data and provides the results to decision makers. data mining results can be incorporated into OLAP cubes to further enhance model-driven analysis by providing an additional dimensional viewpoint into the OLAP model. nor is it designed to support high volume update transactions. In SQL server 2000. In SQL server 2000. For example. Thus. 6.

6. IT professional day to day operations application-oriented current. 4. Distinguish between data base warehouse and OLTP data base. What is the difference between data and information? What are OLAP and OLAP Server? What is OLTP? What are the items to be considered to create a data warehouse? Compare OLAP and OLTP. key short.4 Difference between OLTP and OLAP TABLE 6. multidimensional integrated consolidated ad-hoc lots of scans complex query millions hundreds 100GB–TB query throughout. 5. . flat relational isolated repetitive read/write index/hash on prim. 2. summarized. up-to-date detailed. response users function DB design data clerk. 3.4 Difference between OLTP and OLAP OLTP OLAP knowledge worker decision support subject-oriented historical. simple transaction tens thousands 100MB–GB transaction throughout usage access unit of work # records accessed # users DB size metric EXERCISE 1.5.Data Warehouse Evaluation | 71 6.

complete and integrated information of the enterprise that can satisfy decision-making requests. it is the result of transformations. it is supposed to support complex queries (summarization. in a DW environment end users make queries directly against the DW through user-friendly query tools. A paradox occurs: data exists but information cannot be obtained. while its maintenance does not suppose transactional load. ² The “legacy systems” where an enterprise’s data are currently kept. in order to minimize redundancy and so that each may be found easily. it includes indicators that are derived from operational data and give it additional value. and the CDs are in a third.1 INTRODUCTION Data Warehouse (DW) is a database that stores information oriented to satisfy decision-making requests. the T-shirt is in another.2 THE CENTRAL DATA WAREHOUSE The Central Data Warehouse is just that—a warehouse. A data warehouse has three main components: ² A “Central Data Warehouse” or “Operational Data Store(ODS)”. This is accomplished by organizing it according to the enterprise’s corporate data model. We found in the literature. “normalized”. Think of it as a giant grocery store warehouse where the chocolates are kept in one section. A DW is a database with particular features. All the enterprise’s data are stored in there. which is a data base organized according to the corporate data model. ² One or more “data marts”—extracts from the central data warehouse that are organized according to the particular retrieval requirements of individual users.CHAPTER 7 DATA WAREHOUSE DESIGN 7. globally two different approaches for Relational DW design: one that applies dimensional modeling techniques. In addition. quality improvement and integration of data that comes from operational bases. A very frequent problem in enterprises is the impossibility for accessing to corporate. In general. 7. and another that bases mainly in the concept of materialized views. a DW is constructed with the goal of storing and providing all the relevant information that is generated along the different databases of an enterprise. instead of accessing information through reports generated by specialists. Besides. Building and maintaining a DW need to solve problems of many different aspects. Concerning the data it contains. crossing of data). aggregates. Concerning its utilization. In this chapter we concentrate in DW design. .

consisting of measures and context data. Normalized databases have some characteristics that are appropriate for OLTP systems. It typically represents business items or business transactions. OLAP cube design requirements will be a natural outcome of the dimensional model if the data warehouse is designed to support the way users wants to query data. 0-D (apex) cuboid Product Date try t. country Fig. Data redundancy is not a problem in DWs because data is not updated on-line. This maximizes efficiency of updates. A measure is a numeric attribute of a fact. country 2-D cuboids Product. they are the parameters over which we want to perform OLAP. making more compatible logical data representation with OLAP data management. (ii) To maximize the efficiency of queries. 7. representing the performance or behavior of the business relative to the dimensions. It achieves these objectives by minimizing the number of tables and relationships between them. Dimensions determine the contextual background for the facts. dimensions and measures.Data Warehouse Design | 73 Dimensional models represent data with a “cube” structure. coun Produc Country 1-D cuboids Date. The basic concepts of dimensional modeling are: facts. but not for DWs: (i) Its structure is not easy for end-users to understand and use. but tends to penalize retrievals. In OLTP systems this is not a problem because. usually end-users interact with the database through a layer of software. A fact is a collection of related data items. (ii) Data redundancy is minimized. According to the objectives of dimensional modeling are: (i) To produce database structures that are easy for end-users to understand and write queries against. date.1 Cubes . date 3-D (base) cuboid Product. A dimension is a collection of data that describe one business dimension.

also known as lookup or reference tables. Dimension tables are usually textual and descriptive and you can use them as the row headers of the result set. contain the relatively static data in the warehouse. From a modeling standpoint. they can also be semi-additive or non-additive. cost. Fact tables that contain aggregated facts are often called SUMMARY TABLES.3. A fact table contains either detail-level facts or facts that have been aggregated. the primary key of the fact table is usually a composite key that is made up of all of its foreign keys. Fact tables typically contain facts and foreign keys to the dimension tables. where you cannot tell what a level means simply by looking at it. the fact table can be reduced to columns for dimension foreign keys and numeric fact values.1 Fact Tables A fact table typically has two types of columns: those that contain numeric facts (often called measurements). Examples include sales.74 | Data Mining and Warehousing 7.3 DATA WAREHOUSING OBJECTS The following types of objects are commonly used in dimensional data warehouse schemas: Fact tables are the large tables in your warehouse schema that store business measurements. and denormalized data are typically not stored in the fact table. An example of this is averages. that can be analyzed and examined. containing hundreds of millions of rows and consuming hundreds of gigabytes or multiple terabytes of storage. An example of this is inventory levels.2 A sales analysis star schema Dimension tables. The definitions of this ‘sales’ fact table follow: . Dimension tables store the information you normally use to contain queries. Suppliers or products. and profit. A common example of this is sales. Additive facts can be aggregated by simple arithmetical addition. Examples are customers. 7. BLOBs. usually numeric and additive. Fact tables contain business event details for summarization. Text. Fact tables represent data. and those that are foreign keys to dimension tables. Time Customer Sales Facts Seller Product Manufacturing Location Fig. Because dimension tables contain records that describe facts. Though most facts are additive. Fact tables are often very large. Semi-additive facts can be aggregated along some of the dimensions and not along others. Creating a new fact table: You must define a fact table for each star schema. Time. Location. 7. Non-additive facts cannot be added at all. A fact table usually contains facts with the same level of aggregation.

If fact tables are partitioned.2 Dimension Tables A dimension is a structure. They are normally descriptive. cust_id NUMBER CONSTRAINT sales_customer_nn NOT NULL. that categorizes data. inventory. enable you to answer business questions. The partition divisions are almost always along a single dimension.2) CONSTRAINT sales_amount_nn NOT NULL. as discussed earlier in “Dimension Tables. For example. Partitioned fact tables can be viewed as one table with an SQL UNION query as long as the number of tables involved does not exceed the limit for a single query. a product dimension may be used with a sales fact table and an inventory fact table in the data warehouse. Such business-specific schemas may be part of the central data warehouse or implemented as data marts. such as sales. Any dimensions that are common across the business functions must represent the dimension information in the same way. and finance.3. combined with facts.quantity_sold NUMBER(4) CONSTRAINT sales_quantity_nn NOT NULL. and also in one or more departmental data marts. 7. and the time dimension is the most common one to use because of the historical nature of most data warehouse data. These natural rollups or aggregations within a dimension table are called hierarchies. Summarization data and reports will not correspond if different schemas use different versions of a dimension table. amount NUMBER(10. Very large fact tables may be physically partitioned for implementation and maintenance design considerations. Dimensional attributes help to describe the dimensional value. ad_id NUMBER(7). Using conforming dimensions is critical to successful data warehouse design.” Each business function will typically have its own schema that contains a fact table. Several distinct dimensions. products. and time. or product that is used in multiple schemas is called a conforming dimension if all copies of the dimension are the same. often composed of one or more hierarchies.2) CONSTRAINT sales_cost_nn NOT NULL ) Multiple Fact Tables: Multiple fact tables are used in data warehouses that address multiple business functions. Dimension data is typically collected at the lowest level of detail and then aggregated into higher-level totals that are more useful for analysis. The definitions of this ‘customer’ fact table follow: . textual values. Each business function should have its own fact table and will probably have some unique dimension tables. OLAP cubes are usually partitioned to match the partitioned fact table segments for ease of maintenance. several conforming dimension tables. time. A dimension table may be used in multiple places if the data warehouse contains multiple fact tables or contributes data to data marts. and some dimension tables unique to the specific business function.Data Warehouse Design | 75 CREATE TABLE sales ( prod_id NUMBER(7) CONSTRAINT sales_product_nn NOT NULL. time_id DATE CONSTRAINT sales_time_nn NOT NULL. cost NUMBER(10. A dimension such as customer. Commonly used dimensions are customers.

there may be a number of sales to a single customer.prod_subcat_desc ATTRIBUTE category DETERMINES products. these attributes are rich and user-oriented textual details. cust_last_name VARCHAR2(40) CONSTRAINT customer_lname_nn NOT NULL. cust_marital_status VARCHAR2(20). Geography is seldom a dimension of its own. The code may also be carried in the dimension table if needed for maintenance.prod_name ATTRIBUTE product DETERMINES products. For example.prod_id) LEVEL subcategory IS (products. it is usually a hierarchy that . Attributes serve as report labels and query constraints. product category may exist as a simple integer in the OLTP database. This denormalization simplifies and improves the efficiency of queries and simplifies user query tools. cust_year_of_birth NUMBER(4). A dimension may contain multiple hierarchies – a time dimension often contains both calendar and fiscal year hierarchies. but the dimension table should contain the actual text for the category. Attributes that are coded in an OLTP database should be decoded into descriptions. if a dimension attribute changes frequently. Year. The dimension table contains attributes associated with the dimension entry. cust_state_district VARCHAR2(40). For example. Month. cust_postal_code VARCHAR2(10) CONSTRAINT customer_pcode_nn NOT NULL. cust_credit_limit NUMBER. Day or Week. cust_email VARCHAR2(30) ) CREATE DIMENSION products_dim LEVEL product IS (products. Quarter. Hierarchies: The data in a dimension is usually hierarchical in nature.prod_cat_desc.country_id CHAR(2) CONSTRAINT customer_country_id_nn NOT NULL. For example.cust_city VARCHAR2(30) CONSTRAINT customer_city_nn NOT NULL. maintenance may be easier if the attribute is assigned to its own table to create a snowflake dimension. cust_income_level VARCHAR2(30). or a number of sales of a single product.76 | Data Mining and Warehousing CREATE TABLE customers ( cust_id NUMBER. cust_first_name VARCHAR2(20) CONSTRAINT customer_fname_nn NOT NULL.prod_category) HIERARCHY prod_rollup ( product CHILD OF subcategory CHILD OF category ) ATTRIBUTE product DETERMINES products. The records in a dimension table establish one-to-many relationships with the fact table.cust_sex CHAR(1).prod_desc ATTRIBUTE subcategory DETERMINES products. a time dimension often contains the hierarchy elements: (all time). Hierarchies are determined by the business need to group and summarize data into usable information. such as product name or customer name and address.cust_street_address VARCHAR2(40) CONSTRAINT customer_st_addr_nn NOT NULL. However. cust_phone_number VARCHAR2(25).prod_subcategory) LEVEL category IS (products.

If the number of discrete values creates a very large table of all possible value combinations. This can greatly reduce the size of the fact table by reducing the number of foreign keys in fact table records.Data Warehouse Design | 77 imposes a structure on sales points. the database can aggregate existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known. region. A data warehouse must be designed to satisfy the following requirements: ² Deliver a great user experience—user acceptance is the measure of success ² Function without interfering with OLTP systems ² Provide a central repository of consistent data ² Answer complex queries quickly ² Provide a variety of powerful analytical tools such as OLAP and data mining Most successful data warehouses that meet these requirements have these common characteristics: ² Based on a dimensional model ² Contain historical data ² Include both detailed and summarized data ² Consolidate disparate data from multiple sources while retaining consistency ² Focus on a single subject such as sales. store. They define the parent-child relationship between the levels in a hierarchy. or finance Data warehouses are often quite large. Multi-use dimensions: Sometimes data warehouse design can be simplified by combining a number of small. size is not an architectural goal—it is a characteristic driven by the amount of data needed to serve the users. unrelated dimensions into a single physical dimension. An example geography hierarchy for sales points is: (all). For example. collecting these comments in a single dimension removes a sparse text field from the fact table and replaces it with a compact foreign key. Hierarchies are also essential components in enabling more complex rewrites. often called a junk dimension. 7. customers.4 GOALS OF DATA WAREHOUSE ARCHITECTURE A data warehouse exists to serve its users—analysts and decision makers. country. Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. A common example of a multi-use dimension is a dimension that contains customer demographics selected for reporting standardization. the table can be populated with value combinations as they are encountered during the load or update process. or other geographically distributed dimensions. However. Often the combined dimension will be prepopulated with the cartesian product of all dimension values. . city. Another multiuse dimension might contain useful textual comments that occur infrequently in the source data records. inventory. state or district.

Each type makes up a portion of the user population as illustrated in this diagram. Knowledge Workers are often deeply engaged with the data warehouse design and place the greatest demands on the ongoing data warehouse operations team for training and support. Their work can contribute to closed loop systems that deeply influence the operations and profitability of the company. These are the users who get the Designer or Analyst versions of user access tools. They will figure out how to quantify a subject area. historical data might as well be archived to magnetic tape and stored in the basement.5 DATA WAREHOUSE USERS The success of a data warehouse is measured solely by its acceptance by users.3 Data warehouse users Statisticians: There are typically only a handful of statisticians and operations research types in any organization. Knowledge Workers: A relatively small number of analysts perform the bulk of new queries and analyses against the data warehouse. They use static or simple interactive reports . Data warehouse users can be divided into four categories: Statisticians. Successful data warehouse design starts with understanding the users and their needs. and executives.78 | Data Mining and Warehousing 7. After a few iterations. information consumers. Statisticians (2%) Knowledge Workers (15%) Executives Information Consumers (82%) Fig. Information Consumers: Most users of the data warehouse are Information Consumers. knowledge workers. 7. they will probably never compose a true ad hoc query. their queries and reports typically get published for the benefit of the Information Consumers. Without users.

which are often more efficient than direct queries and impose less load on the relational database. analysts will write MDX queries against OLAP cubes and use data mining. statisticians will often perform data mining.6.1 How Users Query the Data Warehouse? Information for users can be extracted from the data warehouse relational database or from the output of analytical services such as OLAP or data mining. and published reports are highly visible.. The index on the primary key is often created as a clustered index. Views. 7. In many scenarios a clustered primary key provides excellent . They usually interact with the data warehouse only through the work product of others. Statisticians frequently extract data for use by special analytical tools. The most common types of constraints include: ² UNIQUE constraints—To ensure that a given column is unique ² NOT NULL constraints—To ensure that no null values are allowed ² FOREIGN KEY constraints—To ensure that two keys share a primary key to foreign key relationship Constraints can be used for these purposes in a data warehouse is Data cleanliness and Query optimization. indexes. the star or snowflake schema is created in the relational database. and fact table partitions are also defined. Executives use standard reports or ask others to create specialized reports for them.Data Warehouse Design | 79 that others have developed. The index alone is almost as large as the fact table. Information consumers do not interact directly with the relational database but may receive e-mail reports or access web pages that expose data from the relational database. 7. Primary/ foreign key relationships should be established in the database schema.6 DESIGN THE RELATIONAL DATABASE AND OLAP CUBES In this phase. and gather feedback from these users to improve the information sites over time.1 Keys and Relationships Tables are implemented in the relational database after surrogate keys for dimension tables have been defined and primary and foreign keys and their relationships have been identified. surrogate keys are defined and primary and foreign key relationships are established. The composite primary key in the fact table is an expensive key to maintain. and Information Consumers will use interactive reports designed by others. Direct queries to the data warehouse relational database should be limited to those that cannot be accomplished through existing tools. Set up a great communication infrastructure for distributing information widely. Reporting tools and custom applications often access the database directly. Executives: Executives are a special case of the Information Consumers group. OLAP cubes are designed that support the needs of the users. This group includes a large number of people. Analysts may write complex queries to extract and compile specific information not readily accessible through existing tools.5. 7. Integrity constraints provide a mechanism for ensuring that data conforms to guidelines specified by the database administrator. When using the analysis services tools in SQL server 2000.

and then employed periodically during ongoing operation to maintain indexes in tune with the user query workload. A recommendation from the wizard consists of SQL statements that can be executed to create new. should the database administrator determine this structure would provide better performance. which has sophisticated index techniques and an easy to use Index Tuning Wizard tool to tune indexes to the query workload. There are scenarios where the primary key index should be clustered and other scenarios where it should not. and recommends an index configuration to improve the performance of the database. or the internals of SQL Server. the workload. 7. the system will require significant additional storage space. it can then be implemented immediately. ² It allows you to customize its recommendations by specifying advanced options.2 Indexes Dimension tables must be indexed on their primary keys. Elaborate initial design and development of index plans for end-user queries is not necessary with SQL server 2000. and query performance may degrade.6. The empirical tuning approach provided by the index tuning wizard can be used frequently when the data warehouse is first implemented to develop the initial index set. However. The fact table must have a unique index on the primary key. ² It can recommend ways to tune the database for a small set of problem queries. many star schemas are defined with an integer surrogate primary key.80 | Data Mining and Warehousing query performance. the less beneficial it is to cluster the primary key index. and performance of queries in the workload. The SQL Server 2000 Index Tuning Wizard allows you to select and create an optimal set of indexes and statistics for a database without requiring an expert understanding of the structure of the database. Also create an IDENTITY column in the fact table that could be used as a unique clustered index. The wizard analyzes a query workload captured in a SQL Profiler trace or provided by an SQL script. All indexes on the fact table will be large. such as disk space constraints. scheduled as a SQL Server job. if wanted. ² It analyzes the effects of the proposed changes. The larger the number of dimensions in the schema. or executed manually at a later time. . With a large number of dimensions. including index usage. As a result. all other indexes on the fact table use the large clustered index key. more effective indexes and. which are the surrogate keys created for the data warehouse tables. drop existing indexes that are ineffective. After the Index Tuning Wizard has suggested a recommendation. We recommend that the fact table be defined using the composite primary key. distribution of queries among tables. it is usually more effective to create a unique clustered index on a meaningless IDENTITY column. or no primary key at all. The index tuning wizard provides the following features and functionality: ² It can use the query optimizer to analyze the queries in the provided workload and recommend the best combination of indexes to support the query mix in the workload. Indexed views are recommended on platforms that support their use.

Users can be granted access to views without having access to the underlying data. a complex star can have more than one fact table. type of network. The following diagram shows a classic star schema. number of users.7.1 Star Schemas A schema is called a star schema if all dimension tables can be joined directly to the fact table. Indexed views can be used to improve performance of user queries that access data through views. ² Identify measures or facts (sales dollar). Steps in Designing Star Schema ² Identify a business process for analysis (like sales). the schema for a dimensional model contains a central fact table and multiple dimension tables.Data Warehouse Design | 81 7. branch name. Most data warehouses use a dimensional model. subregion name).1.3 Views Views should be created for users who need direct access to data in the data warehouse relational database. ² Determine the lowest level of summary in a fact table (sales dollar). and synonyms. You can sometimes get the source model from your company’s enterprise data model and reverse-engineer the logical data model for the data warehouse from this. views. a single object (the fact table) sits in the middle and is radically connected to other surrounding objects (dimension lookup tables) like a star.1 Dimensional Model Schemas The principal characteristic of a dimensional model is a set of detailed business facts surrounded by multiple dimensions that describe those facts. including tables. and software. . When realized in a database. ² Identify dimensions for facts (product dimension. 7.7. location dimension. View definitions should create column and table names that will make sense to business users. indexes. A star schema can be simple or complex. A dimensional model may produce a star schema or a snowflake schema. In the star schema design. organization dimension).6.7 DATA WAREHOUSING SCHEMAS A schema is a collection of database objects. A simple star consists of one fact table. The model of your source data and the requirements of your users help you design the data warehouse schema. You can arrange schema objects in the schema models designed for data warehousing in a variety of ways. 7. time dimension. ² List the columns that describe each dimension (region name. 7. The physical implementation of the logical data warehouse model may require some changes to adapt it to your system parameters—size of machine. storage capacity.

A hierarchy can also be used to define a navigational drill path. a hierarchy might be used to aggregate data from the month level to the quarter level. in a time dimension.5 Example of star schema Hierarchy A logical structure that uses ordered levels as a means of organizing data.82 | Data Mining and Warehousing Products Dimension Table Sales (amount sold. regardless of whether the levels in the hierarchy represent aggregated totals or not. . 7.4 Star schema Times Dimension Table Channels Dimension Table Example – Star Schema with Time Dimension PRODUCT DIMENSION Product Dimension Identifier (PK) Product Category Name Product Sub-Category Name Product Name Product Feature Description Date Time Stamp ORGANISATION DIMENSION Organisation Dimension Identifier (PK) Corporate Office Name Region Name Branch Name Employees Name Date Time Stamp SALES FACT Time Dimension Identifier (FK) Location Dimension Identifier (FK) Product Dimension Identifier (FK) Organisation Dimension Identifier (FK) LOCATION DIMENSION Location Dimension Identifier (PK) Country Name State Name City Name Date Time Stamp Sales Dollar Date Time Stamp TIME DIMENSION Time Dimension Identifier (PK) Year Number Day of Year Quarter Number Month Number Month Name Month Day Number Week Number Day of Week Calender Date Date Time Stamp Fig. A hierarchy can be used to define data aggregation. for example. from the quarter level to the year level. 7. quantity sold) Fact Table Customers Dimension Table Fig.

A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. 7. time and organization. Suppliers Times Sales (amount sold. quantity sold) Customers Channels Products Countries Fig. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). a time dimension might have a hierarchy that represents data at the month. “Sales Dollar” in sales fact table can be calculated across all dimensions independently or in a combined manner which is explained below. It shows that data can be sliced across all dimensions and again it is possible for the data to be aggregated across multiple dimensions. Fact Table A table in a star schema that contains facts and connected to dimensions.6. 7. In the example Fig. a dimension that describes products may be separated into three tables (snowflaked). For example.7. ² Sales dollar value for a particular product ² Sales dollar value for a product in a location ² Sales dollar value for a product in a year within a location ² Sales dollar value for a product in a year within a location sold or serviced by an employee 7. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.2 Snowflake Schemas A schema is called a snowflake schema if one or more dimension tables do not join directly to the fact table but must join through other dimension tables. sales fact table is connected to dimensions location. For example. quarter. A fact table usually contains facts with the same level of aggregation.6 A snowflake schema .1. and year levels. product.Data Warehouse Design | 83 Level A position in a hierarchy.

The reason is that hierarchies(category. and month) are being broken out of the dimension tables (PRODUCT. Example—Snowflake Schema In Snowflake schema. state. 4 lookup tables and 1 fact table. branch.84 | Data Mining and Warehousing The snowflake schema is an extension of the star schema where each point of the star explodes into more points. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables. ORGANIZATION. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. 7. the example diagram shown below has 4 dimension tables.7 Example of snowflake schema . and TIME) PRODUCT CATEGORY LOOKUP Product Category Code (PK) Product Category Name Date Time Stamp BRANCH LOOKUP Branch Code (PK) Branch Name Date Time Stamp PRODUCT DIMENSION Product Dimension Identifier (PK) Product Category Code (FK) Product Sub-Category Name Product Name Product Feature Description Date Time Stamp ORGANISATION DIMENSION Organisation Dimension Identifier (PK) Corporate Office Name Region Name Branch Name Employees Name Date Time Stamp SALES FACT Time Dimension Identifier (FK) Location Dimension Identifier (FK) Product Dimension Identifier (FK) Organisation Dimension Identifier (FK) LOCATION DIMENSION Location Dimension Identifier (PK) Country Name State Code (FK) City Name Date Time Stamp Sales Dollar Date Time Stamp TIME DIMENSION Time Dimension Identifier (PK) Year Number Day of Year Quarter Number Month Number Month Name Month Day Number Week Number Day of Week Calender Date Date Time Stamp STATE LOOKUP State Code (PK) State Name Date Time Stamp MONTH LOOKUP Month Number (PK) Month Name Date Time Stamp Fig. LOCATION.

3 Important Aspects of Star Schema & Snowflake Schema ² In a star schema every dimension will have a primary key. 2. In OLAP. Since dimension tables hold less space. EXERCISE 1. a dimension table will have one or more parent tables. a dimension table will not have any parent table. Write an essay on schemas.1. How will you design OLAP cubes? . 3. 6. they try to normalize the dimension tables to save space. 4. Discuss about use of data warehouse. ² Whereas in a snowflake schema. Define fact and dimension tables.Data Warehouse Design | 85 respectively and shown separately. Explain data warehouse architecture. Snowflake schema approach may be avoided. Who are data warehouse users? Explain. this Snowflake schema approach increases the number of joins and poor performance in retrieval of data. ² Whereas hierarchies are broken into separate tables in snowflake schema. 7. 5. In few organizations. ² Hierarchies for the dimensions are stored in the dimensional table itself in star schema. ² In a star schema. These hierarchies help to drill down the data from topmost hierarchies to the lowermost hierarchies.7.

The objective of partitioning is manageability and performance related requirements.2 HARDWARE PARTITIONING When a system is constrained by I/O capabilities.1 INTRODUCTION Data warehouses often contain large tables and require techniques both for managing these large tables and for providing good query performance across these large tables. RAID can be implemented in several levels. This increases disk I/O performance because the hard disks participating in the RAID array.0 and Windows 2000 and applications (like SQL Server) into slices (usually 16–128 KB) that are then spread across all disks participating in the RAID array. instead of some disks becoming a bottleneck due to uneven distribution of the I/O requests. RAID can provide both performance and fault tolerance benefits. it is I/O bottleneck. it is CPU bottleneck. Performance Hardware RAID controllers divide read/writes of all data from Windows NT 4. A variety of RAID controllers and disk configurations offer tradeoffs among cost. Splitting data across physical drives like this has the effect of distributing the read/write I/O workload evenly across all physical hard drives participating in the RAID array.CHAPTER 8 PARTITIONING IN DATA WAREHOUSE 8. . When a system is constrained by having limited CPU resources. ranging from 0 to 7. and fault tolerance. as a whole are kept equally busy. These topics are discussed: ² Hardware partitioning ² Software partitioning 8. Database architects frequently use RAID (Redundant Arrays of Inexpensive Disks) systems to overcome I/O bottlenecks and to provide higher availability. performance. Fault Tolerance RAID also provides protection from hard disk failure and accompanying data loss by using two methods: mirroring and parity. RAID is a storage technology often used for databases larger than a few gigabytes.

Disadvantages The minus of parity are performance and fault tolerance. the data for the lost drive can be rebuilt by replacing the failed drive and rebuilding the mirror set. Both RAID 1 and its hybrid. Advantage The main plus of parity is cost. Parity It is implemented by calculating recovery information about data written to disk and writing this parity information on the other drives that form the RAID array. Another advantage is that mirroring provides more fault tolerance than parity RAID implementations. for a data warehouse. Advantage It offers the best performance among RAID options if fault tolerance is required. Disadvantage The disk cost of mirroring is one extra drive for each drive worth of data.0 and Windows 2000 and RDBMS are online. To protect any number of drives with RAID 5. RAID 5 and its hybrids are implemented through parity.Partitioning in Data Warehouse | 87 Mirroring It is implemented by writing information onto a second (mirrored) set of drives. This essentially doubles your storage cost. once to each side of the mirrorset. is often one of the most expensive components needed. only one additional drive is required. a new drive is inserted into the RAID array and the data on that failed drive is recovered by taking the recovery information (parity) written on the other drives and using this information to regenerate the data from the failed drive. Be ready to add disks and redistribute data across RAID arrays and/or small computer system interface (SCSI) channels as necessary to balance disk I/O and maximize performance. Such RAID systems are commonly referred to as “Hot Plug” capable drives. Mirroring can enable the system to survive at least one failed drive and may be able to support the system through failure of up to half of the drives in the mirrorset without forcing the system administrator to shut down the server and recover from the file backup. Read I/O operation costs are the same for mirroring and parity. compared to two disk I/O operations for mirroring. . RAID 0+1 (sometimes referred to as RAID 10 or 0/1) are implemented through mirroring. Read operations. If a drive should fail. System monitor will indicate if there is a disk I/O bottleneck on a particular RAID array. Parity information is evenly distributed among all drives participating in the RAID 5 array. Due to the additional costs associated with calculating and writing parity. General Rule of Thumb: Be sure to stripe across as many disks as necessary to achieve solid disk I/O performance. are usually one failed drive before the array must be taken offline and recovery from backup media must be performed to restore data. If there is a drive loss with mirroring in place. however. which. Bear in mind that each RDBMS write to the mirrorset results in two disk I/O operations. Most RAID controllers provide the ability to do this failed drive replacement and re-mirroring while Windows NT 4. RAID 5 requires four disk I/O operations for each write.

This backup power can help preserve the data written in cache for some period of time (perhaps days) in case of a power outage.1 RAID level 0 . The principle of these controller-based caching mechanisms is to gather smaller and potentially nonsequential I/O requests coming in from the host server and try to batch them together with other I/O requests for a few milliseconds so that the batched I/Os can form larger (32–128 KB) and maybe sequential I/O requests to send to the hard drives. If the database server is also supported by an uninterruptible power supply (UPS).3 RAID LEVELS Level 0 This level is also known as disk striping because of its use of a disk file system called a stripe set. In keeping with the principle that sequential and larger I/O is good for performance. The following illustration shows RAID 0. This available caching with SQL Server can significantly enhance the effective I/O handling capacity of the disk subsystem. It is not that the RAID controller caching magically allows the hard disks to process more I/Os per second. Disk 1 Disk 2 Disk 3 Disk 4 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 Fig.88 | Data Mining and Warehousing Effect of On-Board Cache of Hardware RAID Controllers Many hardware RAID controllers have some form of read and/or write caching. RAID 0 is similar to RAID 5. so that operations can be performed independently and simultaneously. the RAID controller has more time and opportunity to flush data to disk in the event of power disruption. Although a UPS for the server does not directly affect performance. this helps produce more disk I/O throughput given the fixed number of I/Os that hard disks are able to provide to the RAID controller. RAID 0 improves read/write performance by spreading operations across multiple disks. the RAID controller cache is using some organization to arrange incoming I/O requests to make best possible use of the underlying hard disks’ fixed amount of I/O processing ability. it does provide protection for the performance improvement supplied by RAID controller caching. 8. except RAID 5 also provides fault tolerance. Rather. These RAID controllers usually protect their caching mechanism with some form of backup power. 8. Data is divided into blocks and spread in a fixed order among all disks in an array.

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 A0 A1 A2 A3 4 parity B0 B1 B2 3 parity B4 C0 C1 2 parity C3 C4 Fig. The following illustration shows RAID 1. Disk mirroring provides a redundant. It also employs a disk-striping strategy that breaks a file into bytes and spreads it across multiple disks.Partitioning in Data Warehouse | 89 Level 1 This level is also known as disk mirroring because it uses a disk file system called a mirror set.3 RAID level 2 D0 1 party D2 D3 D4 0 party E1 E2 E3 E4 .2 RAID level 1 Level 2 This level adds redundancy by using an error correction method that spreads parity across all disks. identical copy of a selected disk. RAID 1 provides fault tolerance and generally improves read performance (but may degrade write performance). RAID 2 is not as efficient as other RAID levels and is not generally used. This strategy offers only a marginal improvement in disk utilization and read/write performance over mirroring (RAID 1). All data written to the primary disk is written to the mirror disk. Disk 1 Disk 2 A B C D A B C D Mirror Fig. 8. 8.

This level uses a striped array of disks. read performance degrades (for example. It keeps user data separate from error-correction data. RAID 3 also is rarely used. 8. . Disk 1 Disk 2 Disk 3 Disk 4 Disk 1 Disk 2 Disk 3 Disk 4 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 Fig. which are then mirrored to another identical set of striped disks. For example.4 RAID level 5 Level 10 (1 + 0) This level is also known as mirroring with striping. It differs in how it writes the parity across all the disks. The data and parity information are arranged on the disk array so the two are always on different disks. The following illustration shows RAID 10. Striping with parity offers better performance than disk mirroring (RAID 1). RAID 3 provides some read/write performance improvement. Like RAID 3. RAID 4 is not as efficient as other RAID levels and is not generally used. RAID 10 provides the performance benefits of disk striping with the disk redundancy of mirroring. when a disk fails). the error correction method requires only one disk for parity data. RAID 10 provides the highest read/write performance of any of the RAID levels at the expense of using twice as many disks. The striped array of disks is then mirrored using another set of four striped disks. The following illustration shows RAID 5. It is similar to RAID 4 because it stripes the data in large blocks across the disks in an array. a striped array can be created using four disks. but the error correction method requires only one disk for parity data. RAID 5 is one of the most commonly used RAID configurations.90 | Data Mining and Warehousing Level 3 This level uses the same striping method as RAID 2. this level is the most popular strategy for new designs. However. when a stripe member is missing. Level 4 This level employs striped data in much larger blocks or segments than RAID 2 or RAID 3. Use of disk space varies with the number of data disks. Data redundancy is provided by the parity information. Level 5 Also known as striping with parity.

Partitioning in Data Warehouse | 91 As mentioned above. but cost more in terms of disks required. RAID 5 costs less than RAID 1 or RAID 0+1 but provides less fault tolerance and less write performance. The write performance of RAID 5 is only about half that of RAID 1 or RAID 0+1 because of the additional I/O needed to read and write parity information. RAID 0 RAID 1 RAID 5 RAID 0 + 1 Data Fault tolerance information Fig. RAID 1 or RAID 0+1 are the best choices in terms of both performance and fault tolerance.5 Comparison of different RAID levels . RAID 0 is typically used only for benchmarking or testing. and it is not recommended for development environments. Because RAID 0 provides no fault tolerance protection. 8. The best disk I/O performance is achieved with RAID 0 (disk striping with no fault tolerance protection). it should never be used in a production environment. When cost of hard disks is not a limiting factor. RAID 1 and RAID 0+1 offer the best data protection and best performance among RAID levels.

92 | Data Mining and Warehousing

Many RAID array controllers provide the option of RAID 0+1 (also referred to as RAID 1/ 0 and RAID 10) over physical hard drives. RAID 0+1 is a hybrid RAID solution. On the lower level, it mirrors all data just like normal RAID 1. On the upper level, the controller stripes data across all of the drives (like RAID 0). Thus, RAID 0+1 provides maximum protection (mirroring) with high performance (striping). These striping and mirroring operations are transparent to Windows and RDBMS because they are managed by the RAID controller. The difference between RAID 1 and RAID 0+1 is on the hardware controller level. RAID 1 and RAID 0+1 require the same number of drives for a given amount of storage. For more information on RAID 0+1 implementation of specific RAID controllers, contact the hardware vendor that produced the controller. The Fig. 8.5 shows differences between RAID 0, RAID 1, RAID 5, and RAID 0+1. Note: In the illustration above, in order to hold four disks worth of data, RAID 1 (and RAID 0+1) need eight disks, whereas Raid 5 only requires five disks. Be sure to involve your storage vendor to learn more about their specific RAID implementation.

8.4 SOFTWARE PARTITIONING METHODS
² ² ² ² ² Range Partitioning Hash Partitioning Index Partitioning Composite Partitioning List Partitioning

Range Partitioning
Range partitioning maps data to partitions based on ranges of partition key values that you establish for each partition. It is the most common type of partitioning and is often used with dates. For example, you might want to partition sales data into weekly, monthly or yearly partitions. Range partitioning maps rows to partitions based on ranges of column values. Range partitioning is defined by the partitioning specification for a table or index. PARTITION BY RANGE (column_list) and by the partitioning specifications for each individual partition: VALUES LESS THAN (value_list) where: column_list is an ordered list of columns that determines the partition to which a row or an index entry belongs. These columns are called the partitioning columns. The values in the partitioning columns of a particular row constitute that row’s partitioning key. value_ list is an ordered list of values for the columns in the column list. Only the VALUES LESS THAN clause is allowed. This clause specifies a non-inclusive upper bound for the partitions. All partitions, except the first, have an implicit low value specified by the VALUES LESS THAN literal on the previous partition. Any binary values of the partition key equal to or higher than this literal are added to the next higher partition. Highest partition being where MAXVALUE literal is defined. Keyword, MAXVALUE, represents a virtual infinite value that sorts higher than any other value for the data type, including the null value. The statement below creates a table sales_range_part that is range partitioned on the sales_date field using Oracle.

Partitioning in Data Warehouse | 93

CREATE TABLE sales_range_part (sales_man_id NUMBER(6), emp_name VARCHAR2(25), sales_amount NUMBER(10), sales_date DATE) PARTITION BY RANGE(sales_date) ( PARTITION sales_jan2006 VALUES LESS THAN(TO_DATE(’02/01/2006',’MM/DD/YYYY’)), PARTITION sales_feb2006 VALUES LESS THAN(TO_DATE(’03/01/2006',’MM/DD/YYYY’)), PARTITION sales_mar2006 VALUES LESS THAN(TO_DATE(’04/01/2006',’MM/DD/YYYY’)), PARTITION sales_apr2006 VALUES LESS THAN(TO_DATE(’05/01/2006',’MM/DD/YYYY’)),);

Hash Partitioning
Hash partitioning maps data to partitions based on a hashing algorithm that Oracle applies to a partitioning key that you identify. The hashing algorithm evenly distributes rows among partitions, giving partitions approximately the same size. Hash partitioning is the ideal method for distributing data evenly across devices. Hash partitioning is a good and easy-to-use alternative to range partitioning when data is not historical and there is no obvious column or column list where logical range partition pruning can be advantageous. Oracle uses a linear hashing algorithm and to prevent data from clustering within specific partitions, you should define the number of partitions by a power of two (for example, 2, 4, 8). The statement below creates a table sales_hash, which is hash partitioned on the salesman_id field. data1, data2, data3, and data4 are tablespace names. CREATE TABLE sales_hash (salesman_id NUMBER(5), emp_name VARCHAR2(25), sales_amount NUMBER(10), week_no NUMBER(2)) PARTITION BY HASH(salesman_id) PARTITIONS 4 STORE IN (data1, data2, data3, data4);

List Partitioning
List partitioning enables you to explicitly control how rows map to partitions. You do this by specifying a list of discrete values for the partitioning column in the description for each partition. This is different from range partitioning, where a range of values is associated with a partition and with hash partitioning, where you have no control of the row-to-partition mapping. The advantage of list partitioning is that you can group and organize unordered and unrelated sets of data in a natural way. CREATE TABLE sales_list (salesman_id NUMBER(5), emp_name VARCHAR2(25), sales_area VARCHAR2(20), sales_amount NUMBER(10), sales_date DATE) PARTITION BY LIST(sales_area) ( PARTITION sales_south VALUES IN(‘Delhi’, ‘Mumbai’,’Kolkatta’), PARTITION sales_north VALUES IN (‘Bangalore’, ‘Chennai’, ‘Hyderabad’));

94 | Data Mining and Warehousing

Composite Partitioning
Composite partitioning combines range and hash partitioning. Oracle first distributes data into partitions according to boundaries established by the partition ranges. Then Oracle uses a hashing algorithm to further divide the data into subpartitions within each range partition. CREATE TABLE sales( s_productid NUMBER, s_saledate DATE, s_custid NUMBER, s_totalprice NUMBER) PARTITION BY RANGE (s_saledate) SUBPARTITION BY HASH (s_productid) SUBPARTITIONS 16 (PARTITION sal98q1 VALUES LESS THAN (TO_DATE(’01-APR-1998', ‘DD-MON-YYYY’)), PARTITION sal98q2 VALUES LESS THAN (TO_DATE(’01-JUL-1998', ‘DD-MON-YYYY’)), PARTITION sal98q3 VALUES LESS THAN (TO_DATE(’01-OCT-1998', ‘DD-MON-YYYY’)), PARTITION sal98q4 VALUES LESS THAN (TO_DATE(’01-JAN-1999', ‘DD-MON-YYYY’)))

Index Partitioning
You can choose whether or not to inherit the partitioning strategy of the underlying tables. You can create both local and global indexes on a table partitioned by range, hash, or composite methods. Local indexes inherit the partitioning attributes of their related tables. For example, if you create a local index on a composite table, Oracle automatically partitions the local index using the composite method.

EXERCISE
1. Why we need partition? Explain. 2. Explain different types of partition with suitable examples. 3. Write a short note on RAID technology.

It contains summarized data that the user group can easily understand.2 DATA MART A data warehouse is a cohesive data model that defines the central data repository for an organization. data is refreshed in near real time and used for routine business activity. your company’s data warehousing environment may differ slightly from what we are about to introduce. business unit or business function. Data Mart .An enterprise data warehouse provides a central database for decision support throughout the enterprise.Datamart is a subset of data warehouse and it supports a particular region. This is most useful for users to access data since a database can be visualized as a cube of several dimensions. client analysis tools. it requires a data warehouse. and other applications that manage the process of gathering data and delivering it to business users. Retrieving a collection of different kinds of data from a “normalized” . and apply. A data mart cannot stand alone.This has a broad enterprise wide scope. Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables. Each data mart is a collection of tables organized according to the particular requirements of a user or group of users.1 INTRODUCTION A data warehouse is a relational/multidimensional database that is designed for query and analysis rather than transaction processing. 3. process. but unlike the real entertprise data warehouse. a data warehouse environment often consists of an ETL solution. Enterprise Data Warehouse . A data warehouse provides an opportunity for slicing and dicing that cube along each of its dimensions. Because each data warehousing effort is unique. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources. an OLAP engine. There are three types of data warehouses: 1. In addition to a relational/multidimensional database. 9. A data warehouse usually contains historical data that is derived from transaction data. A data mart is a data repository for a specific user group. 2. ODS (Operational Data Store) .CHAPTER 9 DATA MART AND META DATA 9.

and loans. and video tapes all next to each other. has millions of customers and the lines of business of the enterprise are savings. Hence the need to rearrange the data so they can be retrieved more easily. the requirements are almost certain to change once the user has seen the implications of the request. encoding structures. over the years. And however the marts are organized initially.96 | Data Mining and Warehousing warehouse can be complex and time-consuming. The following example explains how the data is integrated from source systems to target systems. variable measurements. So source systems will be different in naming conventions. it may not even make sense. ² Tools to convert models into data marts quickly. Consider a bank that has got several branches in several countries. 9. Data Warehouse Data Mart1 Data Mart2 Data Mart N Fig. This organization does not have to follow any particular inherent rules or structures. ² Sufficient facility with database modeling and design to produce new tables quickly. This means that the creation of data marts requires: ² Understanding of the business involved. . and physical attributes of data. The notion of a “mart” suggests that it is organized for the ultimate consumers — with the potato chips. ² Responsiveness to the user’s stated objectives.3 META DATA For Example: In order to store data. many application designers in each branch have made their individual decisions as to how an application and database should be built. Indeed.1 Data warehouse and data marts 9.

and datatypes are consistent throughout the target system. column names. where it stored. this process will be used initially to help define the data warehouse requirements and it will then be used iteratively during the life of the data warehouse to update and integrate new dimensions of the warehouse.e. attribute name. Example of Target Data (Data Warehouse) Target system Record #1 Record #2 Record #3 Attribute name Customer Application Date Customer Application Date Customer Application Date Column name CUSTOMER_ APPLICATION_DATE CUSTOMER_ APPLICATION_DATE CUSTOMER_ APPLICATION_DATE Datatype DATE DATE DATE Values 01112005 01112005 01112005 In the above example of target data. We intend to create a design process for creating a data warehouse through the development of a metadata system. the job of the analyst is much more difficult and the work required for analysis is compounded significantly. attribute names. This inconsistency in data can be avoided by integrating the data into a data warehouse with good standards. and how it is related to the business. 0) 11012005 DATE DATE 11012005 01NOV2005 In the aforementioned example. where it comes from. column name. A data warehouse stores current and historical data from disparate operational systems (i. how it relates to other information.Data Mart and Meta Data | 97 Example of Source Data System name Source System 1 Source System 2 Source System 3 Attribute name Customer Application Date Customer Application Date Application Date Column name CUSTOMER_ APPLICATION_DATE CUST_APPLICATION_ DATE APPLICATION_DATE Datatype Values NUMERIC (8. transactional databases) into one consolidated system where data is cleansed and restructured to support data analysis.. The topic of standardizing meta data across various products and applying a systems engineering approach to this process in order to facilitate data warehouse design is what this project intend to address. . datatype and values are entirely different from one source system to another. It describes the kind of information in the warehouse. Using a systems engineering approach. the user of the data warehouse must have access to meta data that is accurate and up to date. The meta data system we intend to create will provide a framework to organize and design the data warehouse. This is how data from various source systems is integrated and accurately stored into the data warehouse. Meta data is literally “data about data”. Without a good source of meta data to operate from. In order to be effective.

98 | Data Mining and Warehousing Understanding the role of the enterprise data model relative to the data warehouse is critical to a successful design and implementation of a data warehouse and its meta data management. The destination field needs to be described in a similar way to the source. Name. Destination . Destination Unique identifier. Type is the storage type of data. For the most part. Some specific challenges which involve the physical design tradeoffs of a data warehouse include: ² Granularity of data . A unique identifier is needed to help avoid any confusion between commonly used field names.refers to the breakup of data into separate physical units that can be handled independently. when moving it from source to data warehouse. Base data is used to distinguish between the original data load destination and any copies.refers to the level of detail held in the unit of data ² Partitioning of data .3. object). name. It will also depend on the number of source systems and the type of each system. type.1 Data Transformation and Load Meta data may be used during transformation and load to describe the source data and any changes that need to be made. It is designed to support analytical processing for the entire business organization. table name. Meta data itself is arguably the most critical element in effective data management. location (system. location describe where the data comes from. type. The data warehouse presents an alternative in data processing that represents the integrated information requirements of the enterprise. These trends and patterns are absolutely vital to the vision of management in directing the organization. the data that has accumulated has served only the immediate set of requirements for which is was initially intended. type. Location is both the system comes from and the object contains it. Effective tools must be used to properly store and use the meta data generated by the various systems. Name is its local field name. name. data has aggregated in the corporate world as a by-product of transaction processing and other uses of the computer. 9. Analytical processing looks across either various pieces of or the entire enterprise and identifies trends and patterns otherwise not apparent. ² Performance issues ² Data structures inside the warehouse ² Migration For years. Name is the designation that the field will have in the base data within the data warehouse. A unique identifier is required to avoid any confusion occurring between two fields of the same name from different sources. For each source data field the following information is required: Source field Unique identifier. Whether or not you need to store any meta data for this process will depend on the complexity of the transformations that are to be applied to the data.

This is needed by the warehouse manager to allow it to track and control all data movements. max. type) Constraints Name. aggregation). reference identifier) Aggregations are a special case of tables. description. type. syntax). Table name is the name of the table field will be part of. table (columns) To this end we will need to store meta data like the following for each field. language (module name. Field Unique identifier. average. type) Views Columns (name. Name is a unique identifier that differentiates this from any other similar transformations. type) Indexes Columns (name. The other attributes are module name and syntax. number. Aggregation Aggregation name. The functions listed below are examples of aggregation functions: Min. Meta data is needed for all of the following: Table Columns (name. field name. columns (column_name.3. Language attribute contains the name of the language that the transformation written in. Transformation(s) Name. Every object in the database needs to be described. 9. columns (column_name. sum . For each table the following information needs to be stored. such as clear.Data Mart and Meta Data | 99 type is the database data type. reference identifier. Transformation needs to be applied to turn the source data into the destination data. varchar.2 Data Management Meta data is required to describe the data as it resides in the data warehouse. Table Table name. and they require the following meta data to be stored.

the particular characteristics of the project vary. 3. 4. and sites of those systems. Inc. What is legacy system? Explain data management in meta data. EXERCISE 1. depending on the nature. Combining this with the mapping of each column in a data mart to an attribute of the data model provides direct mapping of original data to the view of them seen by users. Define meta data. has extended Oracle’s Designer/2000 product to allow for reverse engineering of legacy systems—specifically providing the facility for mapping each column of a legacy system to an attribute of the corporate data model. Moreover. Essential Strategies.100 | Data Mining and Warehousing Partitions are subsets of a table. partition key(reference identifier)) 9. . What is data mart? Give example. 2. table name (range allowed. technology.4 LEGACY SYSTEMS The task of documenting legacy systems and mapping their columns to those of the central data warehouse is the worst part of any data warehouse project. range contained. Partition Partition name. but they also need the information on the partition key and the data range that resides in the partition.

Hot Backup—Any backup that is not cold is considered to be hot. then treat it accordingly. Consider how these important issues affect backup plan: ² Realize that the restore is almost more important than the backup ² Create a hot backup ² Move your backup offline ² Perfect the connection between your production and backup servers ² Consider a warm backup ² Always reassess the situation . 10. These will vary from RDBMS to RDBMS. There are special requirements that need to be observed if the hot backup is to be valid. Process. system failure and the other unfrozen disasters can make your work vanish into digital nothingness in no time. In a multiinstance environment. Complete Backup—The entire database is backed up at the same time. Software. There are utilities that can backup entire partitions along with the operating system and everything else. If your data is precious to you. Partial Backup—Any backup that is not complete. all instances of that database must be shut down.CHAPTER 10 BACKUP AND RECOVERY OF THE DATA WAREHOUSE 10. It make sense backup your data on a regular basis . Network. Cold Backup—Backup that is taken while the database is completely shut down. The terminology comes from the fact that the database engine is up and running hence hot. These definitions apply to the back up of the database. Human errors.it is required to recover the data base and return to normal operation as quick as possible.1 INTRODUCTION One of the most important part is backup and recovery in a data warehouse. Hardware. This includes all database data files and the journal files.If such failure affects the operation of a database system .2 TYPES OF BACKUP The definitions below should help to clarify the issue.

and provides tight interfaces to industry-leading media management subsystems. and cancel-based recovery. full database recovery. There are also supplemental backup methods. you have not made a successful backup of the database. and recovery operations are maintained by the server in a recovery catalog and automatically used as part of these operations. This reduces administrative burden and minimizes the possibility of human errors. Backup and recovery problems are most likely to occur at this level. All computers will share the common pool of storage space. Furthermore. For other backup products designed for local backup. ² The backup and recovery technology is highly scalable. We allow you to back up any number of computers to your account. The major types of instance recovery are cold restore.3 BACKUP THE DATA WAREHOUSE Oracle RDBMS is basically a collection of physical database files. If you can’t afford this higher cost. when placed in their own tablespaces. Each type of backup has its advantages and disadvantages. time-based recovery. If you omit any of these files. . 10. This provides for efficient operations that can scale up to handle very large volumes of data. Even though most of the data comes from operational systems originally.102 | Data Mining and Warehousing 10. data warehouses often contain external data that.1 Create a Hot Backup The best solution for critical systems is always to restore the backup once it has been made onto another machine. it is removed from the operational databases. may have to be purchased. You only pay a small one time fee for each additional computer. restore. Individual partitions. and online redo log files. Hot backups take backups while the database is functioning. Keeping a local backup allows you to save data way in excess of the storage that you purchase for online backup. safer operations at terabyte scale. you should at least do periodic test restores. Open platforms for more hardware options & enterprise-level platforms. but it may still exist in the data warehouse. enabling operations to be completed efficiently within times proportional to the amount of changes. rather than the overall size of the database.3. ² Oracle includes support for incremental backup and recovery. can be backed up and restored independently of the other partitions of a table. Cold backups shut down the database. you cannot always rebuild data warehouses in the event of a media failure (or a disaster). and recovery tasks that enables simpler. such as exports. Some of the highlights are: ² Details related to backup. ² Backup and recovery operations are fully integrated with partitioning. Oracle provides a server-managed infrastructure for backup. This will also provide you with a “hot” backup that is ready to be used. if lost. Putting in place a backup and recovery plan for data warehouses is imperative. control files. As operational data ages. restore. should your production database ever crash. Three types of files must be backed up: database files.

3 Tape Backup Storage ² ² ² ² ² Tape backup of customers’ data on SAN Tape backup of Veritas disk storage Tape backup of customers’ data direct from network Tape archiving Off-site tape storage 10. If you are working on a spreadsheet over a period of several days. fairly recent but not as recent as your hot backup. Consider what happens if someone accidentally deletes the wrong data. it’s best to keep a snapshot of every day’s work. To make this task easier. another hard drive.3. you should consider creating a copy of the database that is kept “warm”—i. can you relax? Sort of. or another computer’s hard drive on the network. this actually happened at a client of mine. Most users . you might set the hot backup within a minute of the primary server. (I’m not making this up.3. For example. The minute data is changed on the primary server. Simple data backup can be done by saving your data to multiple locations. you propagate that change to your backup server. Hot backups are great because they are extremely current.3. If you use your computer for business or personal use. If your house burns down and your backups are sitting on the shelf next to your computer. they too will be lost. If every second of downtime is expensive. to enable users to make backup copies of their data.2 Consider a Warm (Not Hot) Backup After you install hot backups and build a solid network. But that very same close binding also comes with its own dangers. Ideally you should keep a copy of this data at some location other than your home or place of business. This is easily done with an online backup service. hardware or human error. you only have to restore the last several transaction logs rather than start from scratch.) The inadvertent deletion will immediately get propagated to the backup.e. but deliberately keep your warm backup 30 minutes behind your primary. You need to make backups of your important data to prevent a total loss in the event of any kind of system failure or human error. copying the data from its original location to removable media. 10. a leader in the backup business. That way.Backup and Recovery of the Data Warehouse | 103 10. too. You also need to have multiple versions of your information backed up.. you need to always have a backup of your data files in the event of software. say an administrator accidentally drops the accounts table in the database.4 What is SureWest Broadband’s Online Backup Service? It is Internet-based services that allows computer users to routinely backup and recover their data using a secure connection and safely store it in the SureWest Broadband Data Storage Warehouse until it is needed. SureWest Broadband has licensed Backup Software from NovaStor. For example. except there’s one thing you still need to worry about.

it is truncated and reused. and simple. ² Simple recovery provides the highest performance and lowest log space consumption. data is recoverable only to the last (most recent) full database or differential backup. in this model. The patching process involves the comparison of two different versions of the same file and extracting the differences between the files. you are deciding among the following business requirements: ² Performance of large-scale operations (for example. It is imperative that the risks be thoroughly understood before choosing a recovery model.5 What is a FastBIT Backup? The FastBIT patching process is the core technology behind NovaStor’s speedy backup program. so a new type of backup system was needed. It’s known as online backup. There are three recovery models: full. Online backup software is designed to routinely copy your important files to a private. When using the simple recovery model. and protection against data loss. ² Bulk-logged recovery provides higher performance and lower log space consumption for certain large-scale operations (for example.4 DATA WAREHOUSE RECOVERY MODELS The model chosen can have a dramatic impact on performance. the loss of committed transactions) . The recovery model of a new database is inherited from the model database when the new database is created. After the log space is no longer needed for recovery from server failure (active transactions). the transactions are truncated from the log upon checkpoint. If you have a working Internet connection on your computer. create index or bulk copy). bulk-logged. you can use SureWest Broadband’s Online Backup Service to keep your important files safe from disaster on a daily basis. but it does so with significant exposure to data loss in the event of a system failure. 10. Trade-offs are made depending on the model you chose. the amount of exposure to data loss varies with the model chosen. Each recovery model addresses a different need. they are saved into a new file and compressed into what is known as a patch. space utilization (disk or tape). The model for a database can be changed after the database has been created. Transaction log backups are not usable for recovering transactions because. secure location on the Internet by transmitting your data over your existing Internet connection.104 | Data Mining and Warehousing do not want to be bothered with managing the tasks necessary for maintaining their own backups. especially during data loads. It does this at the expense of some flexibility of point-in-time recovery. All of your data is encrypted before it is sent to our storage network to protect your privacy. The patch file is often 85% to 1010.10% smaller than the file which the patch was extracted from originally. When you choose a recovery model. This creates the potential for data loss. The trade-offs that occur pertain to performance. 10. Knowledgeable administrators can use this recovery model feature to significantly speed up data loads and bulk operations. When the differences are extracted from the two files. index creation or bulk loads) ² Data loss exposure (for example.3. However. ² Full recovery provides the most flexibility for recovering databases to an earlier point in time.

Then changes must be redone. Work loss exposure Changes since the most recent database or differential backup must be redone Normally none. Can recover to an arbitrary point in time (for example. 2. Reclaims log space to keep space requirements small. Can recover to any point in time. or bulk operations occurred since the most recent log backup. Recover to point in time? Can recover to the end of any backup. Recovery model Simple Benefits Permits high-performance bulk copy operations. Then changes must be redone. Permits high-performance bulk copy operations. What are the types of backup? 3. If the log is damaged. Minimal log space is used by bulk operations. changes since the most recent log backup must be redone. prior to application or user error). EXERCISE 1. no work is lost. No work is lost due to a lost or damaged data file. How will you backup the data warehouse? Explain.Backup and Recovery of the Data Warehouse | 105 ² Transaction log space consumption ² Simplicity of backup and recovery procedures Depending on what operations you are performing. Full Bulk-Logged If the log is damaged. . Otherwise. one model may be more appropriate than another. Can recover to the end of any backup. consider the impact it will have. Explain data warehouse recovery model. The following table provides helpful information. Before choosing a recovery model. changes since that last backup must be redone.

Fig. These steps are prioritized in order of diminishing returns: steps with the greatest effect on performance appear first. Tuning is an iterative process. Different tuning strategies vary in their effectiveness. Furthermore. so additional passes through the tuning process may be useful. Performance gains made in later steps may pave the way for further improvements in earlier steps. For optimal results. systems with different purposes. resolve tuning issues in the order listed: from the design and development phases through instance tuning. reassess your database performance and decide whether further tuning is necessary. 11.1 illustrates the tuning method: . likely require different tuning methods. Step Step Step Step Step Step Step Step Step Step 1: Tune the Business Rules 2: Tune the Data Design 3: Tune the Application Design 4: Tune the Logical Structure of the Database 5: Tune Database Operations 6: Tune the Access Paths 7: Tune Memory Allocation 8: Tune I/O and Physical Structure 9: Tune Resource Contention 10: Tune the Underlying Platform(s) After completing these steps. such as online transaction processing systems and decision support systems.1 INTRODUCTION A well planned methodology is the key to success in performance tuning. 11.CHAPTER 11 PERFORMANCE TUNING AND FUTURE OF DATA WAREHOUSE 11. therefore.2 PRIORITIZED TUNING STEPS The following steps provide a recommended method for tuning an Oracle database.

The tuning method . 11.1.Performance Tuning and Future of Data Warehouse | 107 1 Tune the business rules 2 Tune the data design 3 Tune the application design 4 Tune the logical structure of the database 5 Tune database operations 6 Tune the access paths 7 Tune memory allocation 8 Tune the I/O and physical structure 9 Tune the resource contention 10 Tune the underlying platform(s) Fig.

in step 5 you may rewrite some of your SQL statements. 7. Technical challenges 42% Data management 40% Hardware. which is tuned in step 7.108 | Data Mining and Warehousing Decisions you make in one step may influence subsequent steps. Although the figure shows a loop back to step 1. For example. 2. 4.3 CHALLENGES OF THE DATA WAREHOUSE 1. 5. 3. 6. depends on the size of the buffer cache. 6. Also.5 FUTURE OF THE DATA WAREHOUSE ² Peta byte system (1 PB = 1024 TB) ² Size of the database grows to a Very Large Database (VLDB) to ELDB (Extremely Very Large Data Base) ² Integration and manipulation of non-textual (multimedia) and textual data ² Building and running ever-larger data warehouse system ² Web enabled application grows ² Handle vast quantities of multiformat data ² Distributed databases will be used ² Cross database integrity ² Use of middleware and multiple tiers .4 BENEFITS OF DATA WAREHOUSING 1. 4. 2. which is tuned in step 8. 5. 11. 3. These SQL statements may have significant bearing on parsing and caching issues addressed in step 7. 7. Better data quality 63% Better competitive data 61% Direct end-user access 32% Cost reduction/higher productivity 32% More timely data access 24% Better availability of systems 16% Supports organization change 8% 11. disk I/O. you may need to return from any step to any previous step. software staffing 32% Selling to management 26% Training users 16% Managing expectations 11% Managing change 8% 11.

Explain tuning the data warehouse.I) l Reports [Organization . What is the future scope of data warehouse? . Business and Internal] Fig. 11. 3.2 Data warehouse architecture diagram EXERCISE 1.Performance Tuning and Future of Data Warehouse | 109 11.6 NEW ARCHITECTURE OF DATA WAREHOUSE Business Modeling l Business Process l Data Models Data Modeling Source data Model l Data Warehouse Models l Source Design * RDBMS * Source Systems * Files * ERP [SAP] ETL Procedure Designing Extraction. 2. Transformation and Load Process Data Warehouse Data Mart1. Data Mart 2… Data Mart n Business Intelligence (B. Draw the architecture diagram of data warehouse.

Cleansing: The process of resolving inconsistencies and fixing the anomalies in source data. X2. For example. Classification: A subdivision of a set of examples into a number of classes.Xn =>y. This means that the attributes X1. an attribute is a column in a dimension that characterizes elements of a single level. Aggregation: The process of consolidating data values into a single value. Cold Backup: A database backup taken while the database is shut down. Clickstream Data: Data regarding web browsing Cluster: A tightly coupled group of SMP machines. Basically. For example. The data can then be referred to as aggregate data. and produces some value or set of values as output. 0 or 1. X2. quarter and yearly sales.…. Byte: A standard unit of storage. Business Intelligence Tools: Software that enables business users to see and use large amounts of complex data. A byte consists of 8 bits. See also ETL. with the database is shut down. a byte is a sufficient to store a single character. Association Rules: rules that state a statistical correlation between the occurrence of certain attributes in a database table. The general form of an association rule is X1. A unit of computer storage that can store 2 values.Xn predict Y. and aggregate data is synonymous with summary data. Bit(s): acronym for binary digit. . Coding: Operations on a database to transform or simplify data in order to prepare it for a machine-learning algorithm. unit sales of a particular product could be aggregated by day. In RDBMS. Attribute: A descriptive characteristic of one or more levels. and so on. Attributes represent logical groupings that enable end users to select data based on like characteristics.…. typically as part of the ETL process. sales data could be collected on a daily basis and then be aggregated to the week level. Algorithm: Any well-defined computational procedure that takes some value or set of values as input. the week data could be aggregated to the month level.Appendix A GLOSSARY Aggregate: Summarized data. month. Aggregation is synonymous with summarization.

Glossary | 111

Common Warehouse Meta Data (CWM): A repository standard used by Oracle data warehousing, decision support, and OLAP tools including Oracle Warehouse Builder. The CWM repository schema is a standalone product that other products can share each product owns only the objects within the CWM repository that it creates. Confidence: Given the association rule X1, X2…Xn=>Y, the confidence is the percentage of records for which Y holds, within the group of records for which X1, X2,…Xn holds. Data: Any symbolic representation of facts or ideas from which information can potentially be extracted. Data Selection: The stage of selecting the right data for a KDD process. Data Source: A database, application, repository, or file that contributes data to a warehouse. Data Mart: A data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, the data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources. See also data warehouse. Data Mining: The actual discovery phase of a knowledge discovery process. Data Warehouse: A relational database that is designed for query and analysis rather than transaction processing. A data warehouse usually contains historical data that is derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources. In addition to a relational database, a data warehouse environment often consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. See also ETL, OLAP. Denormalize: The process of allowing redundancy in a table so that it can remain flat. Contrast with normalize. Derived Fact (or Measure): A fact (or measure) that is generated from existing data using a mathematical operation or a data transformation. Examples include averages, totals, percentages, and differences. Decision Trees: A decision tree consists of nodes and branches, starting from a single root node. Each node represents a test or decision. Depending on the outcome of the decision, one chooses a certain branch and when a terminal node(or leaf) is reached a decision on a class assignment is made. Deep Knowledge: Knowledge that is hidden within a database and can only be recovered if one is given certain clues. Dimension: A structure, often composed of one or more hierarchies, that categorizes data. Several distinct dimensions, combined with measures, enable end users to answer business questions. Commonly used dimensions are customer, product, and time. Dimension Data: Data stored externally to the fact data, which contains expanded information on specific dimensions of the fact data. Dimension Table: Part of star schema. A table which holds dimension data.

112 | Data Mining and Warehousing

Drill: To navigate from one item to a set of related items. Drilling typically involves navigating up and down through the levels in a hierarchy. When selecting data, you can expand or collapse a hierarchy by drilling down or up in it, respectively. See also drill down, drill up. Drill Down: To expand the view to include child values that are associated with parent values in the hierarchy. (See also drill, drill up) Drill Up: To collapse the list of descendant values that are associated with a parent value in the hierarchy. Enrichment: A stage of the KDD process in which new data is added to the existing selection. Element: An object or process. For example, a dimension is an object, a mapping is a process, and both are elements. ETL: Extraction, transformation, and loading. ETL refers to the methods involved in accessing and manipulating source data and loading it into a data warehouse. The order in which these processes are performed varies. Note that ETT (extraction, transformation, transportation) and E T M (extraction, transformation, move) are sometimes used instead of ETL. (See also data warehouse, extraction, transportation) Extraction: The process of taking data out of a source as part of an initial phase of ETL Fact Data: The fact data is the basic core data that will be stored in the data warehouse. It is called fact data because it will correspond to business-related facts such as call record data for a telco, or account transaction data for a bank. Fact Table: Part of a star schema. The central table that contains the fact data. File System: An area of space set aside to contain files. The file system will contain a root directory in which other directories or files can be created. File-to-table Mapping: Maps data from flat files to tables in the warehouse. Genetic Algorithm: A class of machine-learning algorithm that is based on the theory of evaluation. Granularity: The level of detail of the facts stored in a data warehouse. Hidden Knowledge: Knowledge that is hidden in a database and that cannot be recovered by a simple SQL query. More advanced machine-learning techniques involving many different SQL queries are necessary to find hidden knowledge. Hierarchy: A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation; for example, in a Time dimension, a hierarchy might be used to aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path, regardless of whether the levels in the hierarchy represent aggregated totals. Inductive Learning: Learning by generalizing form examples. Information: Meaningful data. KDD: Knowledge Discovery in Databases: The non-trivial extraction of implicit, previously unknown and potentially useful information from data. Knowledge: A collection of interesting and useful patterns in a database.

Glossary | 113

Learning: An individual learns how to carry out a certain task by making a transition from a situation in which the task cannot be carried out to a situation in which the same task can be carried out under the same circumstances. Machine Learning: A sub-discipline of computer science that deals with the design and implementation of learning algorithms. Materialized View: A pre-computed table comprising aggregated and/or joined data from fact and possibly dimension tables. Also known as a summary or aggregate table. Meta Data: Data about data. Meta data is data that describes the source, location and meaning of another piece of data. For example, the schema design of a data warehouse is typically stored in a repository as metadata, which is used to generate scripts used to build and populate the data warehouse. A repository contains meta data. Examples include: for data, the definition of a source to target transformation that is used to generate and populate the data warehouse; for information, definitions of tables, columns and associations that are stored inside a relational modeling tool; for business rules, discount by 10 percent after selling 1,000 items. Multi-dimensional Knowledge: A table with n independent attributes can be seen as an ndimensional space. Some regularities represented in such a table cannot easily be detected when the table has a standard two-dimensional representation. WE have to explore the relationship Neural Networks: A class of learning algorithm consisting of multiple nodes that communicate through their connecting synapses. Neural networks initiate the structure of biological nervous systems. Normalize: In a relational database, the process of removing redundancy in data by separating the data into multiple tables. Contrast with denormalize. Operational Data Store (ODS): The cleaned, transformed data from a particular source database. OLAP: Online analytical Processing OLTP: Online transaction processing. An OLTP system is designed to handle a large volume of small predefined transactions. Online Backup: A database backup taken while the database is open and potentially in use. The RDBMS will need to have special facilities to ensure that data in the backup is consistent. Overnight Processing: Any processing that needs to be done to accomplish the daily loading, cleaning, aggregation, maintenance and backup of data, in order to make that new data available to the users for the following day. Patterns: Structures in a database that are statistically relevant. Prediction: The result of the application of a theory or a rule in a specific case. Query Tools: Tools designed to query a database. Shallow Knowledge: The information stored in a database that can be retrieved with a single query. Snowflake Schema: A variant of the star schema where each dimension can have its own dimensions.

Visualization Techniques: A class of graphic techniques used to visualize the contents of a database. Supervised Algorithms: Algorithms that need the control of a human operator during their execution. Versioning :The ability to create new versions of a data warehouse project for new requirements and changes. Some dimensions may be represented in both forms to cater different query requirements. There will be a foreign key in the fact table for each dimension table.114 | Data Mining and Warehousing Star Schema: A logical structure that has a fact table in the centre with dimension table radiating off of this central table. Transportation: The process of moving copied or transformed data from a source to a data warehouse. Unsupervised Algorithms: The opposite of supervised algorithms. Star Flake Schema: This is a hybrid structure that contains a mix of star and snowflake schemas. X2. Validation: The process of verifying metadata definitions and configuration parameters. Statistical Techniques: Techniques used in statistics.…Xn => Y. .Xn and Y both hold. the support is the percentage of records for which X1…. Support: Given the association rule X1. The fact table will generally be extremely large in comparison to its dimension tables. Verification: The validation of a theory on the basis of a finite number of examples.

(a) Integrated and subject oriented (b) Time variant (c) Non volatile (d) All the above 2.Appendix B MULTIPLE-CHOICE QUESTIONS 1. (b) Removes duplication (c) Reconciles differences between various styles of data collection (d) All the above 5. Data cleansing (a) The process of resolving inconsistencies and fixing the anomalies in source data. Data which is not updated or changed in any way once they enter the data warehouse are called: (a) Non-volatile (b) Integrated (c) Cluster (d) Coding 4. A data warehouse is a……….collection of data in support of management’s decisionmaking process. SMP means (a) Systematic multi-processing (b) Stage multi-processing (c) Symmetric multi-processing (d) None of the above . Who is the originator of the data warehousing concept? (a) Johan McIntyre (b) Bill Inmon (c) Bill Gates (d) Chris Erickson 3.

.... 8 (b). 4 (a)...... . 10 (a).. which are small in size. (a) Range & hash (b) Hash & index (c) Index and range (d) None of the above Answer 1 (d).. 6 (d)... OLTP means (a) Off line transaction processing (b) Off line transaction product (c) On line transaction product (d) On line transaction processing 7. and ... 9 (d). (a) Business houses (b) Workgroups or departmental warehouses (c) Data structures (d) Company warehouses 9. Composite partitioning combines .116 | Data Mining and Warehousing 6. 5 (c)....... partitioning. 3 (a). 2 (b).. OLAP means (a) Off line adaptive processing (b) Off line adaptive product (c) On line adaptive product (d) On line analytical processing 10.. 7 (b). Data marts are .. Data we use to run our business is (a) Historical data (b) Operational data (c) Statistical data (d) Transactional data 8..

Insert and update information in one step). consistency and integrity (the “ACID” tests). It is a subject oriented. External Tables allows users to run SELECT statements on external data files (with pipelining support). inquiry only collection of operational data. Many new features in the Oracle9i database will also make ETL processing easier. What is the difference between a data warehouse and a data mart? There are inherent similarities between the basic constructs used to design a data warehouse and a data mart. subject-oriented. but . What is the difference between a W/H and an OLTP application? Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). ROLAP stands for Relational OLAP. What is a data warehouse? A data warehouse is the “corporate memory”. while Data Marts is used on a business division/department level. What is ETL/How does Oracle support the ETL process? ETL is the Data Warehouse acquisition processes of extracting. As a result. 3. ROLAP. In general a Data Warehouse is used on an enterprise level. Users see their data organized in cubes with dimensions. data warehouses are designed differently than traditional relational databases.Appendix C FREQUENTLY ASKED QUESTIONS AND ANSWERS 1. As a result. Warehouses are time referenced. For example: New MERGE command (also called UPSERT. MOLAP and HOLAP are specialized OLAP (Online Analytical Processing) applications. point-in-time. Oracle supports the ETL process with their “Oracle Warehouse Builder” product. 2. transforming (or transporting) and loading (ETL) data from source systems into the data warehouse. What is the difference between OLAP. Since a data warehouse is not updated. MOLAP and HOLAP? ROLAP. A data mart only contains the required subject specific data for local analysis. data warehouses are designed differently than traditional relational databases. OLTP databases are designed to maintain atomicity. 4. these constraints are relaxed. non-volatile (read only) and integrated. Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP).

9. allowing them to slice and dice the data to answer business questions. Snow flake schema is similar to the star schema. Normal relational databases store data in two-dimensional tables and analytical queries against them are normally very slow. What is a star schema? Why does one design this way? Why? A single “fact table” containing a compound primary key. 7. An ODS may contain 30 to 90 days of information. It can be used to represent hierarchies of information. 6. When designed correctly. 8. It allows for the highest level of flexibility of metadata Low maintenance as the data warehouse matures Best possible performance. How can Oracle materialized views be used to speed up data warehouse queries? With “Query rewrite” (QUERY_REWRITE_ENABLED=TRUE in INIT. Oracle designer.” and additional columns of additive. 5. Seagate Software’s Holos is an example HOLAP environment. 10. HOLAP stands for Hybrid OLAP. Express objects. Its sources include legacy systems and it contains current or near term data.118 | Data Mining and Warehousing the data is really stored in a Relational Database (RDBMS) like Oracle. an OLAP database will provide must faster response times for analytical queries. numeric facts. When should one use an MD-database (multi-dimensional database) and not a relational one? Data in a multi-dimensional database is stored as business people views it. Users see their data organized in cubes with dimensions. What is the difference between an ODS and a W/H? An ODS (operational data store) is an integrated database of operational data. In a MOLAP system lot of queries have a finite answer and performance is usually critical and fast. Data warehouses group data by subject ather than by activity (subject-oriented). It normalizes dimension table to save data storage space. The RDBMS will store data at a fine grain level. Other properties are: Non-volatile (read only) and Integrated. Oracle express. What Oracle tools can be used to design and build a W/H? Data Warehouse Builder (or Oracle data mart builder). Discoverer is an ad-hoc end-user query tool. response times are usually slow. MOLAP stands for Multidimensional OLAP. When should you use a STAR and when a SNOW-FLAKE schema? The star schema is the simplest data warehouse schema.ORA) Oracle can direct . it is a combination of both worlds. 11. What is the difference between Oracle express and Oracle discoverer? Express is an MD database and development environment. but the data is store in a Multi-dimensional database (MDBMS) like Oracle Express Server. with one segment for each “dimension. In a HOLAP system one will find queries on aggregated data as well as on detailed data. A warehouse typically contains years of data (time referenced).

Frequently Asked Questions & Answers | 119 queries to use pre-aggregated tables instead of scanning large tables to answer complex queries. Using this feature one can easily “detach” a tablespace for archiving purposes. ETL processing using Data stage 7. OLAP processing using cognos. because they store summarized data. ETL processing using Oracle DataPump. What tools can be used to design and build a W/H? ETL processing using informatica . . we mean the lowest level of information that will be stored in the fact table.5. What is drill down and drill up? Drill down—To expand the view to include child values that are associated with parent values in the hierarchy drill up—To collapse the list of descendant values that are associated with a parent value in the hierarchy. The query optimizer can direct queries to summary/ roll-up tables instead of the detail data tables (query rewrite). designing data W/H using ERWIN. Data is automatically directed to the correct partition based on data ranges or hash values. By granularity. One can also use this feature to quickly move data from an OLTP database to a warehouse database. Materialized views in a W/H environments is typically referred to as summaries. 15. business objects. 13. This will dramatically speed-up warehouse queries and saves valuable machine resources. Oracle materialized views can be used to preaggregate data. 12. What is granularity? The first step in designing a fact table is to determine the granularity of the fact table. 14. This constitutes two steps: Determine which dimensions will be included and determine where along the hierarchy of each dimension the information will be kept. What Oracle features can be used to optimize the warehouse system? From Oracle8i One can transport tablespaces between Oracle databases. The determining factors usually go back to the requirements. exp/imp and SQL*LOADER. Oracle parallel query can be used to speed up data retrieval by using multiple processes (and CPUs) to process a single task. Data partitioning allows one to split big tables into smaller more manageable sub-tables (partitions).

Appendix D

MODEL QUESTION PAPERS
MCA DEGREE EXAMINATIONS
Data mining Time: 3 hrs 5 × 15 = 75

1. a. What do you mean by expanding universe of data? b. Explain the term “data mining” (or) 2. a. Explain, what are the practical applications of data mining. b. Distinguish between data mining and query tools. 3. With a suitable diagram explain the architecture of data warehouse. (or) 4. What are the challenges in creating and maintaining a data warehouse? Explain in detail with suitable example. 5. What is OLAP? Explain the various tools available for OLAP. (or) 6. Write short notes on (i) K–nearest neighbour, (ii) Decision trees. 7. Explain the various steps involved in knowledge discovery process. (or) 8. Write short notes on (i) Data selection, (ii) Data cleaning, (iii) Data enrichment. 9. Explain about data compression. (or) 10. Explain the term denormalization.

Model Question Papers | 121

MCA DEGREE EXAMINATIONS
Data mining Time: 3 hrs 5 × 15 = 75

1. a. What is data mining and data warehousing? b. How computer system that can learn? c. Discuss data mining in marketing. (or) 2. a. Explain the practical applications of data mining. b. Write an account on machine learning and methodology of science. c. How query language suits to data mining operation? 3. a. Write in detail about manufacturing machine. b. Write short notes on cost justification. (or) 4. a. Give an account of the need for decision support-system for business and scientific applications. 5. Write in detail about the use of artificial neural network and genetic algorithm for data mining. (or) 6. What are OLAP tools? Explain various tools available for OLAP. 7. What are the different kinds of knowledge? How to develop a knowledge discovery database project? (or) 8. Explain the data selection, cleaning, enrichment, coding and analysis of knowledge discovery process. 9. What are the primitives of data mining? Explain the data mining primitives in SQL. (or) 10. Explain the following terminologies: Learning as compression of data set, Information content of a message, Noise and redundancy, Fuzzy database, Relational databases.

122 | Data Mining and Warehousing

MCA DEGREE EXAMINATIONS
Data mining Time: 3 hrs 5 × 15 = 75

1. a. What is meant by data mining? b. Compare data mining with query tools. (or) 2. a. Explain the concept of learning. b. Explain the complexity of search space with an example of a kangaroo in mist. 3. What is data warehouse? Why do we need it? Explain briefly. (or) 4. Explain the difficulties of a client/server and data warehousing. 5. Explain the knowledge discovery process in detail. (or) 6. Write a note on neural networks. 7. a. Explain the goal of KDD. b. What are the ten golden rules for reliable data mining? (or) 8. Write short notes on (i) Data selection, (ii) Data cleaning, (iii) Data enrichment. 9. Explain briefly the reconstruction foreign key relationship with suitable example. (or) 10. Explain the following: (i) Denormalization, (ii) Data mining primitives.

B. Tech DEGREE EXAMINATIONS
Data mining and warehousing Time: 3 hrs 5 × 15 = 75

1. What is meant by data mining? Compare data mining and query tools. Explain in detail about data mining in marketing. (or) 2. Explain in detail the knowledge discovery process. 3. Explain in detail about self learning computer systems and concept learning. (or)

12. Discuss in detail data mining and data warehouse. How do you test the data warehouse? Explain. Write in detail about knowledge discover in database environment. 14. B. Discuss in detail visualization techniques. Discuss about system and data warehouse process managers. (or) 10. cleaning. (or) 6. Explain how you can true and test the data warehouse. Write in detail about data mining and query tools. 7. . Explain in detail self learning computer systems. enrichment and coding. 11. 9. 15. 9. Explain about service level agreement and operating the data warehouse. Define data mining? Write in detail about information and production factor. 5. 6. Explain in detail hardware architecture. Discuss in detail about database design schema and partitioning strategy. Discuss about system process architecture design and database schema. Write in detail about meta data and system and data warehouse process managers. Explain in detail data warehouse features. Tech DEGREE EXAMINATIONS Data mining and warehousing Time: 3 hrs 5 × 15 = 75 1. 10. 7. Explain about visualization techniques and OLAP tools. Write in detail about data selection. Explain about capacity planning and tuning the data warehouse. 5. 4.Model Question Papers | 123 4. 3. backup and recovery. 13. Explain in detail hardware physical layout security. (or) 8. 8. Explain backup and recovery. Discuss how one can operate the data warehouse. 2. Explain in detail the physical layout security.

with related case studies. List the steps involved in operating the data warehouse. Explain. 8. 7. With relevant examples discuss the need for backup and recovery with respect to a data warehouse. List the basic hardware architecture for a data warehouse. Data mining plays a role in self learning computer systems. What is meta data? Discuss with an example. Discuss with simple case studies the role of neural networks in data mining. data mining as step in the process of knowledge discovery. Justify the above statement. Give the necessary description also. the procedure and the need for testing the data warehouse. With suitable examples and relevant description justify the above statement. 2. Explain with relevant diagrams the steps involved in designing a database schema. What is data warehouse tuning? Discuss the need for data warehouse tuning. b. 4. Differentiate between data mining and data warehousing. a. Explain with neat sketch. With a neat sketch explain the architecture of a data warehouse “A data warehouse contains subject oriented data”. What is market-basket analysis? Explain with an example. b. 5. With relevant examples explain how association rules can be mined from large data sets. explain the architecture of a data mining system. a. in detail. 10. 3. What are the different security issues in data warehousing? 9. Tech DEGREE EXAMINATIONS Data mining and warehousing Time: 3 hrs 1. 5 × 15 = 75 . With a neat sketch.124 | Data Mining and Warehousing B. 6. What are decision trees? Discuss their applications.

1994. MIT Press. Data Warehousing in the Real World. Han J. Pieter Adriaans Dolf Zantinge. Hamilton J.. (Eds. NY: Oxford University Press. Neural Networks for Pattern Recognition. Smyth P. Piatetsky-shapiro G. Barquin & Herbert A. Churchill R. Ramon C. NJ: Prentice Hall PTR. Addison Wesley. Fundamental of Database Systems. 1995. Data Mining. Pearson Education. Brin S. 1971. Introduction to Time Series and Forecasting. Statistical Analysis of Time Series. Elmasri R. Tsur S.B. Doebbeling N. 1994 .1996.. Morgan Kaufman. and Uthurusamy K. Anderson. Release 1 (9. NY: Springer– Verlag .W. Principles and Practice of Public Health Surveillance. Dunhan.. Navathe S. Md: Williams and Wilkins.. 1999.C.. Ullman J. NY: John Wiley & Sons. Addison Wesley Pub. The Data Warehouse Toolkit..E. Using and Managing the Data Warehouse. Sam Anahory Dennis Murray. Kamber M. Data Warehouse Design. Data Mining: Concepts and Techniques. Bishop. Paul Lane. Pearson Education. (Eds. 1994.. Margaret H. MIT Press. Building.) (1996) Advance in Knowledge Discovery and Data Mining. 2000. 1993. 1996.0. In: Wenzel RP (Ed)... S. Prevention and Control of Nosocomial Infections. (1988) Improving the Convergence of Back Propagation Learning with Second Order Methods. M. Englewood Cliffs. Data Mining for Association Rules and Sequential Patterns. Mozer. 1988. MIT Press. John Wiley & Sons. Davis. In: ACM Press.D. NY: Springer-Verlag. Cristopher M. Data Warehousing Guide. et al. 2000.. Motwani R. 2004. 1996. Baltimore. Jean-mare Adamo. 2000. T. Part I and Part II.M.1) Oracle Corporation. Princeton University Press. Teutsch S. A. Brockwell and Richard A. 2000. Eds. Inman.. et al.M. Weigend and N. Kimball R. . Gershenfeld). Epidemics: Identification and Management. Becker. (1993) Neural Net Architecture for Temporal Sequence Processing.. Data Mining Introductory and Advanced Topics. Dynamic Itemset Counting and Implications Rules for Market Basket Data. (1994) Time Series Analysis. 1993. Predicting Future and Understanding the Past (Eds. 1999. Oracle9i. 1997.).D. NY: Oxford University Press. Edelstein.BIBLIOGRAPHY BOOKS Fayyad U. Peter J.

http://www. 1994. Aug. (1991).htm http://www. WEBSITES http://cs.htm.arborsoft.ic. 11. Artificial Neural Networks Soulie.. IBM System Journal. R. Noboru et al. Datamation.gov. and Simoudis. 39. Bustamente.felk. Vol 41(5) March 15.nih.. E... 1996. Murata. & Sorenson. G. Francois (1991).cz/~xobitko/ga/ http://www.com/larryg/articles. 1995. G.uk/complications/ESRD http://www. 605–615. Bill. No.ac. Decision Support at Lands’ End – An Evolution. Artificial Neural Networks.html http://www-east. Brachman.com t t t .126 | Data Mining and Warehousing JOURNALS AND PUBLICATIONS Radding. T.J. Communications of the ACM. Support Decision Makers with a Data Warehouse. The Right Server for Your Data Warehouse. Commandeering Mainframe Database for Data Warehouse Use.shtm http://www.com/ida/browse/96-1/ida96-1. Vol. 1994.doc. Vol. Elaine.ncbi. Phyllis and Inmon.lancet2000.cvut.nlm.generation5. November. K.. Datamation Vol 41(5) March 15. W. Vol 33(2).elsevier. Alan.com/OLAP. Koslow. http://pwp. Vol 1(8).org http://www. G. Kloesgen. Piatetsky-Shapiro.html. Application Development Trends. 1995.uk/~nd/surprise_96/journal/vol4/tcw2/report. Appleton.starnetinc.org/aisolutions/ga_tsp. (1996) Mining Business Databases. 1. Khabaza.olapcouncil.

Index A Aggregation–99 Algorithms–11 Applications–2. 10. 3. 20. 74 Foreign key–79 Frequent itemsets–47 D Data–66 Data cleaning–16 Data mining–2. 78 Association rules–46 Apriori algorithm–47 Data mart–95 Database–11. 54 Apriori–48 Architecture–6. 79 Destination–99 Dimension table–75. 32 Gain–27 Giga byte–64 . 61 Cluster–37 Clustering–34 Coding–18 Cold backup–101 Constraints–79. 36. 15. 102 C CART–62 Confidence level–47 Classification–23 Cleaning–16 Clementine–3. 19 Data selection–16 Data warehouse–5. 76 Dimensions–74 Distance–35 Domain consistency–17 DMQC–60 B Backup–101. 65 DB miner–61 Decision tree–24 De-duplication–17 Deep knowledge–12 Dendogram–39 Design–72. 99 Crossover–32 E ELDB–108 Encoding–27 Enrichment–17 Entropy–26 Exabyte–64 F Fact table–83. 6 G Genetic algorithm–11.

102 O P Parity–87 Partitioning–86. 68 OLAP server–68 OLAP tools–5 Oracle–62.128 | Data Mining and Warehousing H Hash partitioning–94 Hardware–87 Hierarchical clustering–39 Hierarchies–77 Hot backup–102. 82 SQC–88 SQL server 2000–89 Statistical analysis–11 Statistical techniques–11 Storage–68 N Neural networks–23 Neurons–24 Nonvolatile–6 Normalization–65 . 88 Recovery–104 Relational databases–4 Report–109 Row–65 L Learning–8 Legacy system–10 List partitioning–93 S M Machine learning–10 Marketing–4. 36 Medicine–4 Meta data–96 Mutation–33 Mirroring–87 Multi-dimensional knowledge–12 Sampling algorithm–49 Scalability–69 Scatter diagrams–19 Schema–81 Sesi–87 Shallow knowledge–12 Snowflake schema–83 Source–97. 45. 92 Partitioning algorithm–51 Performance of–87 Petabyte–64 Prediction–23 Primary key–67 I Indexes–99 Inductive learning–8 Information–65 Interpretation–19 ID3 algorithm–28 Q K KDD–14 Keys–79 K-means algorithm–37 Knowledge–12. 65 Query–79 R Range partitioning–92 Raid–87. 98 Spatial mining–52 Star schema–81. 103 Hidden knowledge–12 OLAP–12.

Index | 129 Sybase–3 System–3 Support level–46 Supervized learning–8 W Web mining–51. 19 Visualization techniques–19 View–99. 81 . 52 World wide web (www)–58 T Table–101 Tape–104 Temporal mining–52 Tuning–106 Tera byte–64 X XML–52 Xpert rule miner–62 Y Yotta byte–64 U Unsupervised learning–9 Unique–79 Z Zetta byte–64 V Visualization–11.

Sign up to vote on this title
UsefulNot useful