You are on page 1of 171

DCS 008 Data Mining and Data Warehousing Unit I

Structure of the Unit 1.1 Introduction 1.2 Learning Objectives 1.3 Data mining – concepts 1.3.1 An overview 1.3.2 Data mining Tasks 1.3.3 Data mining Process 1.4 Information and production factor 1.5 Data mining vs Query tools 1.6 Data Mining in Marketing 1.7 Self learning Computer System 1.8 Concept Learning 1.9 Data Learning 1.10 1.11 1.12 Data mining and Data Ware housing Summary Exercises




As a student who knows the basics of the computers and data, you would have known that the modern world is surrounded by various types of data (numbers, image, video, sound). Simply to say that the whole world is a data driven one. As years pass by the size of these data has grown very big . The volume of the old and past data has become enormously big and considered to be a waste by most of the owners. This has occurred in all the areas like Super market transaction data, Credit card processing details , Telephone calls dialed/received details, Ration card details, Election / voters details etc., By the statement “Waste to Wealth”, these data can be used to get vital informations, answer the important decision making questions, to instruct the beneficial ways by analyzing and arranging. In order to extract the information / answers / ways from the data available in a large size, there are statistical and others concepts are being used . One of the major discipline which has been used for this in these days is known as “DATA MINING”. Like mining the land for the treasure you have to mine the large data to find the precious information which lies with in the data (like the relationships / Patterns)

1.2  

Learning Objectives Understanding the necessity of analyzing and processing of complex, large, information-rich data sets To make the students know the initial concepts related to data mining

1.3 Data mining – concepts 1.3.1 An overview Data is growing at a phenomenal rate. Users expect more sophisticated information How to get that? You have to uncover the hidden information in the large data .To do that Data mining is used. You may be familiar with common queries to explore the information from a data base, But, how for the queries in data mining different from this? See the following examples and you will understand the difference. Examples for a data base query –Find all credit applicants with last name of Smith. –Identify customers who have purchased more than $10,000 in the last month. –Find all customers who have purchased milk


Examples for a data mining query – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) –Find all items which are frequently purchased with milk. (association rules) So, in Short the definition for DATA MINING can be given as “Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to summarize the data in a novel ways which is understandable and useful (the hidden information ) and validate the findings by applying the detected patterns to new subsets of data.” The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business Data Mining (e.g., Classification Trees).

Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.

1.3.2 Data mining Tasks The Basic Data Mining Tasks Can be defined as follows


e.Classification maps data into predefined groups or classes –Supervised learning –Pattern recognition –Prediction Regression is used to map a data item to a real valued prediction variable. (2) model building or pattern identification with validation/verification. depending on the nature of 4 . selecting subsets of records and .performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered).. The process of data mining consists of three stages: (1) the initial exploration. –Characterization –Generalization Link Analysis uncovers relationships among data.and predictive data mining is the most common type of data mining and one that has the most direct business applications. data transformations. Stage 1: Exploration.3 Data mining Process The ultimate goal of data mining is prediction .3. This stage usually starts with data preparation which may involve cleaning case of data sets with large numbers of variables ("fields") . –Unsupervised learning –Segmentation –Partitioning Summarization maps data into subsets with associated simple descriptions. Clustering groups similar data together into clusters. the application of the model to new data in order to generate predictions). and (3) deployment (i. Ex: Time Series Analysis Example: Stock Market Predict future values Determine similar patterns over time Classify behavior 1. –Affinity Analysis –Association Rules –Sequential Analysis determines sequential patterns. Then.

your labor in planting. working more hours.which are often considered the core of predictive data mining . for example by implementing crop rotation.many of which are based on so-called "competitive evaluation of models. you can also increase output through know-how.4 Information and production factor Information / Knowledge can behave as a factor of production .e. That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome. You can increase output by increasing any of these factors: cultivating more land. it sometimes involves a very elaborate process. Stacking (Stacked Generalizations). this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model. Still. Some economists mention entrepreneurship as a fourth factor – but none talk about knowledge. since knowledge has unusual properties: there is no metric for it. or borrowing money to buy better tractor or better seed. You could make a more substantial improvement in output if you changed your practices. 1. For example. These techniques . Stage 2: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance (i. to elaborate exploratory analyses using a wide variety of graphical and statistical methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage. applying different models to the same data set and then comparing their performance to choose the best. There are a variety of techniques developed to achieve that goal . However.include: Bagging (Voting. This may sound like a simple operation. it’s not that strange.4. labor. Stage 3: Deployment. $/acre for land). Averaging). and Meta-Learning." that is. Your inputs are land and other raw materials like fertilizer and seed.. the raw material for any productive activity can be put in one of three categories: land (raw materials. and money you’ve borrowed from the bank to pay for your tractor. Farmers in Europe had practiced a three-year 5 . and capital.the analytic problem. Boosting. you might discover that your land is better suited to one kind of corn rather than another. This is strange since know-how is the key determinant for the most important kind part of output: increased production. cultivating and harvesting the crop.1 An example from agriculture Imagine that you are a crop farmer. According to elementary economics texts. 1. in general). and one can’t calculate a monetary rate for it (cf. but in fact. explaining the variability in question and producing stable results across samples).

(I suspect that if a four-crop rotation had been invented now.4. 1. and algorithmic complexity (eg Kolmogorov complexity). then. However. barley. it would be eligible for a business process patent.1 Clementine 6 . but I have so far found no simple metric. Land in different locations.g. and loans of different risks will earn different payment rates.5 Data mining vs Query tools There are various tools available for data mining commercially . There are many different perspectives. One cannot calculate a $/something rate for knowledge in the way one can for the other three. How is it. including all the other factors of production. the application of knowledge. and money in dollars.2 Measuring Knowledge A key difficulty is that knowledge is easy to describe but very hard to measure. This system removed the need for a fallow period and allowed livestock to be bred year-round. no fallow) was a key development in the British Agricultural Revolution in the 18th Century. One can always. and so many cents of interest per dollar I loan you. All give different results. Some if them have been given below for your reference. labor of different kinds. no “$/something” for the knowledge purchased. then letting the soil rest (fallow) during the third stage. The users can use that and do data mining to get required results and models. and clover. that is. when a patent is licensed or when a list of customer names is valued on a balance sheet. 1.rotation since the Middle Ages: rye or winter wheat.5. information theory (measuring data channel capacity). One can talk about uses of knowledge. turnips. It is perhaps so indefinite that we are fooling ourselves by even imagining that it exists. e. such as: library science (eg a user-centered measure of information). labor and capital all have an underlying “objective” measure. this is true for anything.) Most of the increases in our material well-being have come about through innovation. of course. Knowledge does have some value when it’s sold. It’s even hard to measure information content. You’ll pay me so much per acre of land. That suggests that the underlying concept is indefinite. However. Let’s say land is measured in acres and labor in hours. so much per hour of labor. Four-field rotation (wheat. that knowledge as a factor of production gets such a cursory treatment in traditional economics? 1. followed by spring oats or barley. there’s no rate. The difference is that land. argue that money is the ultimate metric: the knowledge value of something is what someone will pay for it.

With market optimization. This is a technique that is intricately connected to data mining. just browse to the web page you are interesting and click what you want to define the extraction task. reform into local file or save to database.SPSS’ Clementine. the premier data mining workbench. there is even more you can do to tip the odds in your favor. and telemarketing.3 Web Information Extractor Web Information Extractor is a powerful tool for web data mining.1 Marketting Optimization If you are the owner of a business. repeatedly. and modeling to collaborate in exploring data and building models. and after reviewing the limits of the campaign. which enables predictive insights to be developed consistently. quickly and simply build powerful queries. picture and other file) from web page. direct mail. allows experts in business processes. post to web server. summarise any two columns against an aggregate function (MIN.5. AVG etc. While using these techniques can help your business succeed. It also supports the proven. Query Editor. It allows you to perform data analysis on any SQL database Developed predominately for the non technical user.6 Data Mining in Marketting 1. 1.) of any numerical column. 1.2 CART CART is a robust data mining tool that automatically searches for important patterns and relationships in large data sets and quickly uncovers hidden structures even highly complex data sets. and run it as you want.5. NEW features: Query Builder.6. industry-standard CRISP-DM methodology.5. No need to define complex template rules. data. you will use data mining to decide which 7 .4 The Query Tool The Query Tool is a powerful data mining application. No knowledge of SQL is required. Mac and Unix platforms 1. Summary. There is the internet. You will want to become familiar with a technique that is called marketing optimization. 1. It can extract structure or unstructured data (including text. you will take a group of offers and customers. now you can create your own scripts. content extraction and content update monitor. It works on the Windows. you should already be aware of the fact that there a multiple techniques you can use to market to your customers. No wonder that organizations from FORTUNE 500 companies to government agencies and academic institutions point to Clementine as a critical factor in their success.

and these are CDs. let me use an example. The second strategy is to market to students who are already attending college. Your company uses a data mining tool that will predict the chances of people signing up for your products. you are targeting young parents who may be looking to save money for their children. it is your job to market checking accounts and savings accounts. After you have set up your offers. Your goal is to analyze each offer you're making and optimize it in a way that will allow you to bring in the largest profits. and a savings offers should be made to specific customers. Now that you have two offers you're interested in marketing. 1. you will next want to study the data you have obtained. credit cards. The first step in marketing optimization is to create a group of marketing offers. you will next want to look at the purchasing habits of the customers you already have. An example of this would be the cost required to run each campaign. gold credit cards.2 Illustration with an example To illustrate market optimization with data mining. Each offer will have a model connected to it that will make a prediction based on the customer information that is presented to it. The prediction could come in the form of a score. Each offer will be created separately from the others. and each one of them will have their own financial attributes. You look at the customer data over the last few years to make a marketing decision. and you are targeting young people that are already in college. These models can be added to your marketing strategy.6. You will want to create certain mathematical models that will allow you to predict the possible responses. Market optimization is a powerful tool that will take your marketing to the next level. You have a number of products which you offer to your customers. you work for a large company that has a data warehouse. The models will be created by data mining tools. you can have come up with two possible strategies that you will present to your manager. It is your goal to figure out which customers will be interested in savings accounts compared to checking accounts. Despite the fact that your company offers these four products. 8 . The score could be defined by the probability of a customer purchasing a product. you can take a group of marketing strategies and market them to different people based on patterns and relationships. In this example. Instead of mass marketing a product to a broad group of people that may not respond to it. In this example. The first possible strategy is to market to customers who would like to save money for their children so they can attend college when they turn 18 years old. Suppose you were the marketing director for a financial institution such as a bank. After thinking about how you can successfully market your products to your customers.

with a set of rules. The most common form of Self learning Computer System is a computer program. you will want to use complex data mining strategies. A related term is wizard. How ever by this time you would have realized how for the data mining concepts are Used in marketing and optimizing the same. If the historical response rate is only 10%. also known as a knowledge based system or an Expert system . and to be more precise. However. 9 . is a computer program that contains the knowledge and analytical skills of one or more human experts. A wizard is an interactive computer program that helps a user solve a problem. related to a specific subject. In this example. This class of program was first developed by researchers in artificial intelligence during the 1960s and 1970s and applied commercially throughout the 1980s. 1. Other "Wizards" are a sequence of online forms that guide users through a series of choices. and recommends one or more courses of user action.Computer algorithms will be able to look at the history of customer transactions to determine the chances of success for your marketing campaign. The expert system utilizes what appears to be reasoning capabilities to reach conclusions. some rule-based expert systems are also called wizards. historical response rates are simply. it is likely that it will remain the same for your new marketing strategy. and these are not expert systems. An Self learning Computer System is a software system that incorporates concepts derived from experts in a field and uses their knowledge to provide problem analysis to users of the software. However. that analyzes information (usually supplied by the user of the system) about a specific class of problems. A Self learning Computer System or an expert system is a computer program that simulates the judgement and behavior of a human or an organization that has expert knowledge and experience in a particular field. such as the ones which manage the installation of new software on computers. The expert system may also provide mathematical analysis of the problem(s). such a system contains a knowledge base containing accumulated experience and a set of rules for applying the knowledge base to each particular situation that is described to the program. In other words. Originally the term wizard was used for programs that construct a database search query based on criteria supplied by the user. Typically.7 Self learning Computer System An Self learning Computer System. the best way to find out if young parents and college students will be interested in your offer is by looking at the historical response rate.

Sophisticated expert systems can be enhanced with additions to the knowledge base or to the set of rules.8. The fact that they are different colors and sizes and have different orientations is irrelevant. Among the best-known expert systems have been those that play chess and that assist in medical diagnosis. A rule: This a statement that specifies which attributes must be present or absent for a stimulus to qualify as a positive instance of the concept. (4) lines form 4 right angles. Color. size.1 Analyzing Concepts Concepts are categories of stimuli that have certain features in common. and orientation are not defining features of the concept If a stimulus is a member of a specified conceptual category. Their common features are (1) 4 lines.8 Concept Learning 1. a stimulus is a negative instance if it lacks any one of the specified features. 10 . The shapes on the above are all members of a conceptual category: rectangle. (3) lines connected at ends. Every concept has two components: Attributes: These are features of a stimulus that one must look for to decide if that stimulus is a positive instance of the concept. 1. it is referred to as “negative instance”. (2) opposite lines parallel. These are all negative instances of the rectangle concept: As rectangles are defined. If it is not a member. it is referred to as a “positive instance”.

For example. This was the rule used earlier to define the concept of a rectangle. The opposite or “complement” of affirmation is is negation. and the rule would be that all the attributes must be present.2 Behavioral Processes 11 . More complex conceptual rules involve two or more specified attributes. _ _ _ + An invertebrate animal is one that lacks a backbone.For rectangles.8. a stimulus must lack a single specified attribute. Which of these stimuli are positive instances? + + _ + This rule is called affirmation. 1. The simplest rules refer to the presence or absence of a single attribute. These are the positive and negative instances when the negation rule is applied. a “vertebrate” animal is defined as an animal with a backbone. It says that a stimulus must possess a single specified attribute to qualify as a positive instance of a concept. the conjunction rule states that a stimulus must possess two or more specified attributes to qualify as a positive instance of the concept . the attributes would be the four features discussed earlier. To qualify as a positive instance. For example.

Discrimination: We discriminate between stimuli which belong to the conceptual class and those that don’t because they lack one or more of the defining attributes. For example. we generalize the word “rectangle” to those stimuli that possess the defining attributes. in which case we respond with a different word: ? 1. The data can be arranged in a particular format to learn from them. two processes control how we respond to a stimulus: Generalization: We generalize a certain response (like the name of an object) to all members of the conceptual class based on their common attributes. also called a database management system (DBMS). The software programs involve mechanisms for the definition of database structures. consists of a collection of interrelated data. and a set of software programs to manage and access the data. and for ensuring the consistency and security of the information stored. when a concept is learned. The following are some of the examples : (i) A database system.. shared.In behavioral terms.9 Data learning The learning from the data given can be done in many ways. or distributed data access. 12 . for concurrent. known as a database.and discriminate between these stimuli and others that are outside the conceptual class... Rectangle Rectangle Rectangle .. for data storage. despite system crashes or attempts at unauthorized access.

10 Data mining and Data Ware housing A data warehouse is an integrated and consolidated collection of data. 3.1 Functional requirements of a Data Warehouse A data ware house provides the needed support for all the informational applications of a company. An ER data model represents the database as a set of entities and their relationships. It focuses on selected subjects. each of which is assigned a unique name. Consolidation of the data / information can be done through various tools in a data warehouse. analytical.. Each table consists of a set of attributes ( columns or fields) and usually stores a large set of tuples (records or rows). Certain information present in the data warehouse is derived for the necessity.10. etc. It can be defined as a repository of purposely selected and adopted operational data which can successfully answer any ad hoc. 13 . Time dependent data will be present in a Data warehouse. 1. subset of a data warehouse. (iv) A data warehouse which is a repository of information collected from multiple sources. is often constructed for relational databases. such as an entity-relationship (ER) data model. 4. Decision support processing Informational Application Model building Consolidation The data in the warehouse is being processed and gives out the decision to be taken in the crucial times of the business. It must support various types of applications. A data warehouse must support 1. Also modeling of the data can be done by exploring the data in the data warehouse.(ii) A relational database is a collection of tables. Data in a data warehouse must there fore be organized such that it can be analyzed or explored along with different contextual dimensions. all of which have their own requirements in terms of data and the way data are modeled and used. multivariate data sets in a way that facilitates the retrieval of information for analytic purposes. 1. stored under a unified schema (v) A data mart. 2. Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. A D ata warehousing can be defined as a process of organizing the storage of large. statistical queries. complex . (iii) A semantic data model.

1. the characteristics of the 'monthly sales number' measurement can be a Location (Where). Executives etc. and what you do with it. The applications like Decision support processing. context of the measurements are represented in dimension tables. operational end users. Extended data warehouse applications can be done on the data in a warehouse. pictures. audio.1 show the many sources and different types of users There can be many sources for a data warehouse to get data (Corporate. In your business process Sales. Fig 1. It is the process of creating. ).external.In a warehouse the data can be structured and unstructured ( like large text objects.. Administrative officials. video etc.. where. and then querying a data warehouse and can involve a number of discrete technologies such as: In a Dimensional Model.Data sources. users.offline etc. The people who use the data warehouse data can be Executives.2 Data warehousing Data warehousing is essentially what you need to do in order to create a data warehouse.. populating. You can also think of the context of a measurement as the characteristics such as who. what. Product Sold (What). Time (When).). External users and data & business analysts etc. and informational applications for a data ware house Corporate data Data warehouse Environment Offline data Data Warehouse External Users External data Structured and unstructured data CEOs. 14 .10. when. how of a measurement (subject ).

Dimensional model is the underlying data model used by many of the commercial OLAP products available today in the market. Your store location data may be spanned across multiple tables in your OLTP system (unlike OLAP). What is data mining? In your answer. and 15 . State. Lastly the necessity of Data warehouse and its usage in various aspects has been explained. Zip code. Also the learning concepts and details about the self learning or Expert systems have been explained. In the above example you get all your store location information and put that into one single table called Location. Country.. the extraction of hidden predictive information from large databases. 1. Dimensional modeling is the design concept used by many data warehouse designers to build their data warehouse. The role of information on production has been also explained to you. The dimension attributes also contain one or more hierarchical relationships. The learning from the data has been explained to you in a brief manner. The marketing people are much exited to use these facilities and you would have understood this by the example given above in the unit. but you need to de-normalize all that data into one single table. you need to decide what this data warehouse contains. In this model. Before designing your data warehouse. and query constraints such as where Country='USA'. 1. Clementine etc. all data is contained in two types of tables called Fact Table and Dimension Table. the attributes can be Location Code. Now You would have got the idea that Data mining. is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Generally the Dimension Attributes are used in report labels. Also you had an overview of the various tools used in the mining like CART.12 Exercises 1. address the following: (a) Is it another type? (b) Is it a simple transformation of technology developed fromdatabases. Say if you want to build a data warehouse containing monthly sales numbers across multiple store locations. statistics.11 Summary In this Unit you have learnt about the basic concepts involved in Data Mining. across time and across products then your dimensions are: Location Time Product Each dimension table contains data for one dimension.The Dimension Attributes are the various columns in a dimension table. The marketing can be done in a powerful way by using the Data mining results. In the Location dimension.

Give brief notes on various mining tools which are known to you.machine learning? 2. What do you mean by data mining in marketing ? explain with suitable example 5. 6. Data ware house – explain the concepts 16 . 4.What is concept? How one can learn a concept? Explain with examples the factors of concept.what are all the ways the one can learn from data ? 7. How information behaves as production factor explain .Illustrate with an example of your own (not given in the book) 3.

1 Introduction 2.10 Genetics Algorithms 2.3 Knowledge discovery process 2.4 Preliminary Analysis of Data using traditional query tools 2.3.1 Data Selection 2.6 OLAP Tools 2.3.2 Learning Objectives 2.Unit II Structure of the Unit 2.3.8 Association Rules 2.7 Decision trees 2.3 Data Enrichment 2.11 KDD in Data bases 2.5 Visualization techniques 2.13 Exercises 17 .2 Data Cleaning 2.9 Neural Networks 2.12 Summary 2.

Also there are processes like Data cleaning. to prepare the data for mining and get the results out of it. 2. which can be used in various businesses and fields.1 Introduction There are various processes and tools involved in Data mining. The data in a large data base can be analyzed through various traditional queries to get the suitable information and Knowledge.2 Learning Objectives  To Know the concepts in Knowledge Discovery process in mining large data bases .1 An Overview Why Do We Need KDD? The traditional method of turning data into knowledge relies on manual analysis and interpretation. Various methods involved in mining process like Decision trees. Data Selection.2. Through which one can view various effects on a situation and can understand easily the results. Association Rules etc. can give useful and suitable solutions to various problems. in the health-care industry. They can be Decision trees..3. The specialists then provide a report detailing the analysis to the sponsoring health- 18 . on a quarterly basis. For example. To get the knowledge from large data bases one of the process used is KDD (Knowledge Discovery Process).. Also to understand the process of data cleaning. Association Rules. Neural Networks.  2.3 Knowledge discovery process 2. and Data enrichment under KDD. To visualize the results and data there are techniques called Visualization Techniques. it is common for specialists to periodically analyze current trends and changes in health-care data. There are methods in data mining. say. Data selection. Data enrichment etc.. Students to know about the Visualization techniques used in data mining . Genetic Algorithms etc.

a system used by astronomers to perform image analysis. manufacturing. at least partially. Haussler. In its first application. in medical diagnostic applications. this type of manual data analysis is becoming completely impractical in many domains. and highly subjective. finance (especially investment).care organization. main KDD application areas includes marketing. KDD is an attempt to address a problem that the digital information era made a fact of life for all of us: data overload. Hence. and Internet agents. as data volumes grow dramatically. In fact. The need to scale up human analysis capabilities to handling the large number of bytes that we can collect is both economic and scientific. planetary geologists sift through remotely sensed images of planets and asteroids. increase efficiency. telecommunications. Be it science. a notable success was achieved by SKICAT. Businesses use data to gain competitive advantage. In science. Who could be expected to digest millions of records. one of the primary application areas is astronomy. and Stolorz (1996) for a survey of scientific applications. or any other field. analysis work needs to be automated. marketing. the number of fields d can easily be on the order of 102 or even 103. where it is estimated that on the order of 109 sky objects are detectable. for example. Similarly. for example. it is only natural to turn to computational techniques to help us unearth meaningful patterns and structures from the massive volumes of data. hence. SKICAT can outperform humans and traditional computational techniques in classifying faint sky objects. Djorgovski. this report becomes the basis for future decision making and planning for health-care management. and provide more valuable services to customers. fraud detection. classification. for example. health care. the system was used to process the 3 terabytes (1012 bytes) of image data resulting from the Second Palomar Observatory Sky Survey. Data Mining and Knowledge Discovery in the Real World A large degree of the current interest in KDD is the result of the media interest surrounding successful KDD applications. retail. In a totally different type of application. Data we capture about our environment are the basic evidence we use to build theories and models of the universe we live in. and cataloging of sky objects from sky-survey images (Fayyad. Databases are increasing in size in two ways: (1) the number N of records or objects in the database and (2) the number d of fields or attributes to an object. Because computers have enabled humans to gather more data than we can digest. Databases containing on the order of N = 109 objects are becoming increasingly common. carefully locating and cataloging such geologic objects of interest as impact craters. For these (and many other) applications. 19 . Here. In business. See Fayyad. finance. each having tens or hundreds of fields? We believe that this job is certainly not one for humans. in the astronomical sciences. the classical approach to data analysis relies fundamentally on one or more analysts becoming innovaintimately familiar with the data and serving as an interface between the data and the users and products. expensive. this form of manual probing of a data set is slow. and Weir 1996).

Knowledge discovery is the non-trivial extraction of implicit. and continues to evolve. Typical tasks for knowledge discovery are the identification of classes (clustering).The Interdisciplinary Nature of KDD KDD has evolved. pattern recognition. knowledge acquisition for expert systems. the discovery of associations or deviations in spatial databases. databases. knowledge discovery algorithms should be incremental. the number and the size of databases are rapidly growing because of the large amount of data obtained from satellite images. statistics. and high-performance computing. data visualization.       Data selection Data cleaning/cleansing Data Enrichment Data mining Pattern evaluation Knowledge presentation 20 . and potentially useful information from databases. Therefore. rules or clusters hidden in the data. The term 'visual Data Mining' refers to the emphasis of integrating the user in the knowledge discovery process. Basic steps in the knowledge discovery process are. AI. The unifying goal is extracting high-level knowledge from low-level data in the context of large data sets. i. knowledge discovery becomes more and more important in databases. when updating the database the algorithm does not have to be applied to the whole database. KDD (Knowledge Discovery in Databases or Knowledge Discovery and Data Mining) is a recent term related to data mining and involves sorting of huge quantity of data to pick out useful and relevant information. This growth by far exceeds human capacities to analyze the databases in order to find implicit regularities. X-ray crystllography or other scientific equipment. the prediction of new. unknown objects (classification). from the intersection of research fields such as machine learning. Since these are challenging tasks. Both. previously unknown.e.

Representation of data After choosing the relevant data the data has to be represented in a suitable structure. text etc. T a s k -r e le v a n t D a ta D a ta W a r eh o u se S e le c t io n D a t a C le a n in g D a ta I n t e g r a t io n D a ta ba ses H a n : In tr o d u c tio n to K D D 11 2. That structure formats like data base.Data Selection : The selection of data for a KDD process has to be done as a first step. 21 . can also be decided and data can be represented in that format. This selection of data is the selection of relevant data for the field of approach to arrive at a meaningful knowledge.1. For example in a super market if one want to get the knowledge of sales of milk products then the transaction data relevant to sales of milk products has to be gathered and processed and here the other sales details are not necessary. But if the shop keeper wants to know the overall performance then every transaction becomes necessary for the process. Identification of relevant data In a large and vast data bank one has to select the relevant and necessary data / information that is found to be important for the project / process that has to be done to get the targeted knowledge.3.Knowledge discovery Process – An overview D a t a M in in g : A K D D P r o c e s s P a tt e r n E v a lu a tio n  D a t a m in in g : t h e c o r e o f k n o w le d g e d is c o v e r y D a t a M in in g p ro c e s s .

We can list some of the Data cleaning tasks as Data acquisition and metadata Fill in missing values Unifieddate format Convertingnominalto numeric Identify outliers and smooth out noisy data Correct inconsistent data How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.2. no quality mining results Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Before proceed to the further steps in Knowledge discovery Process. the Data cleaning Has to be don that involves Fill in missing values. identify or remove outliers. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value : e..2 Data cleaning: Data Cleaning is the act of detecting and correcting (or removing) corrupt or inaccurate attributes or records The first step in Knowledge discovery Process is the Data Cleaning and that is necessary because Data in the real world is dirty means .in order to get the successful results. lacking certain attributes of interest. Incomplete: lacking attribute values. and resolve inconsistencies etc. or use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree 22 . smooth noisy data. So the organizations are forced to think about a unified logical view of the wide variety of data and databases they possess. or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data. they have to address the issues of mapping data to a single naming convention.3. and handling noise and errors when possible. “unknown”. uniformly representing and handling missing data. a new class?! Imputation: Use the attribute mean to fill in the missing value.g..

smooth by bin boundaries.3. children. age. gender. “credit worthy”. “risk taker”. “cultured“. “hi-tech adverse”. The requirements for this enrichment can be •Behavioral opurchase from related businesses (Air Miles) oEg.etc. Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data How to handle Noisy data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means. income level • Psychographic oEg. number of vehicles. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions 2. travel frequency • Demographic oEg. “conservative”.3 Data Enrichment : The represented data has to be enriched with various additional details apart from the base details that has been gathered.Noisy Data : There can be random error or variance in a measured variable. “trustworthy” 23 . smooth by bin median. marital status.

e. the more "vague" the prediction (i. The calculation of confidence intervals is based on the assumption that the variable is normally distributed in the population. and the lower and upper limits of the p=. then you can conclude that there is a 95% probability that the population mean is greater than 19 and lower than 27. 24 . the less reliable the mean. The estimate may not be valid if this assumption is not met. the so-called Kolmogorov-Smirnov test. Probably the most often used descriptive statistic is the mean. unless the sample size is large. say n=100 or more. wider the confidence interval). Normality. Typically. The confidence intervals for the mean give us a range of values around the mean where we expect the "true" (population) mean is located . the more reliable its mean. Shape of the Distribution.4 Preliminary Analysis of the Data set The gathered data set can be analysed for various purposes before proceeds to the KDD process. a researcher is interested in how well the distribution can be approximated by the normal distribution Simple descriptive statistics can provide some information relevant to this issue. One of the most needed can be Statistical Analysis. Note that the width of the confidence interval depends on the sample size and on the variation of data values. while normal distributions are perfectly symmetrical.. the kurtosis of the normal distribution is 0. which tells you the frequency of values from different ranges of the variable. or the Shapiro-Wilks' W test. If you set the p-level to a smaller value. then the distribution is either flatter or more peaked than normal. Usually we are interested in statistics (such as the mean) from our sample data set only to the extent to which they can infer information about the population. then the interval would become wider thereby increasing the "certainty" of the estimate. An important aspect of the "description" of a variable is the shape of its distribution. For example. Statistical Analysis Mean and Confidence Interval. The larger the variation.. The larger the sample size. then that distribution is asymmetrical.g. if the mean in your sample is 23. if the skewness (which measures the deviation of the distribution from symmetry) is clearly different from 0. If the kurtosis (which measures "peakedness" of the distribution) is clearly different from 0. and vice versa.05 confidence interval are 19 and 27 respectively.2. The mean is a particularly informative measure of the "central tendency" of the variable if it is reported along with its confidence intervals. as we all know from the weather forecast. the more likely it will materialize. For example. More precise information can be obtained by performing one of the tests of normality to determine the probability that the sample came from a normally distributed population of observations (e.

00. The graph allows you to evaluate the normality of the empirical distribution because it also shows the normal curve superimposed over the histogram. The value of -1. but other correlation coefficients are available to handle other types of data. each more or less normally distributed.e. In such cases. Correlation coefficients can range from -1. A value of 0. the distribution could be bimodal (have 2 peaks). This might suggest that the sample is not homogeneous but possibly its elements came from two different populations. It also allows you to examine various aspects of the distribution qualitatively. 25 .00 represents a perfect negative correlation while a value of +1. Correlations Purpose (What is Correlation?) Correlation is a measure of the relation between two or more variables.. in order to understand the nature of the variable in question.00 represents a lack of correlation. The measurement scales used should be at least interval scales.However. you should look for a way to quantitatively identify the two sub-samples.00 to +1.00 represents a perfect positive correlation. a graph that shows the frequency distribution of a variable). For example. none of these tests can entirely substitute for a visual examination of the data using a histogram (i.

The most widely-used type of correlation coefficient is Pearson r, also called linear or product- moment correlation. Simple Linear Correlation (Pearson r). Pearson correlation (hereafter called correlation), assumes that the two variables are measured on at least interval scales (see Elementary Concepts), and it determines the extent to which values of the two variables are "proportional" to each other. The value of correlation (i.e., correlation coefficient) does not depend on the specific measurement units used; for example, the correlation between height and weight will be identical regardless of whether inches and pounds, or centimeters and kilograms are used as measurement units. Proportional means linearly related; that is, the correlation is high if it can be "summarized" by a straight line (sloped upwards or downwards).


This line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. Note that the concept of squared distances will have important functional consequences on how the value of the correlation coefficient reacts to various specific arrangements of data (as we will later see). How to Interpret the Values of Correlations. As mentioned before, the correlation coefficient (r) represents the linear relationship between two variables. If the correlation coefficient is squared, then the resulting value (r2, the coefficient of determination) will represent the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of the relationship). In order to evaluate the correlation between variables, it is important to know this "magnitude" or "strength" as well as the significance of the correlation. Significance of Correlations. The significance level calculated for each correlation is a primary source of information about the reliability of the correlation. As explained before (see Elementary Concepts), the significance of a correlation coefficient of a particular magnitude will change depending on the size of the sample from which it was computed. The test of significance is based on the assumption that the distribution of the residual values (i.e., the deviations from the regression line) for the dependent variable y follows the normal distribution, and that the variability of the residual values is the same for all values of the independent variable x. However, Monte Carlo studies suggest that meeting those assumptions closely is not absolutely crucial if your sample size is not very small and when the departure from normality is not very large. It is impossible to formulate precise recommendations based on those Monte- Carlo results, but many researchers follow a rule of thumb that if your sample size is 50 or more then serious biases are unlikely, and if your sample size is over 100 then you should not be concerned at all with the normality assumptions. There are, however, much more common and serious threats to the validity of information that a correlation coefficient can provide; they are briefly discussed in the following paragraphs. Outliers. Outliers are atypical (by definition), infrequent observations. Because of the way in which the regression line is determined (especially the fact that it is based on minimizing not the sum of simple distances but the sum of squares of distances of data points from the line), outliers have a profound influence on the slope of the regression line and consequently on the value of the correlation coefficient. A single outlier is capable of considerably changing the slope of the regression line and, consequently, the value of the correlation, as demonstrated in the following example. Note, that as shown on that illustration, just one outlier can be entirely responsible for a high value of the correlation that otherwise (without the outlier) would be close to zero. Needless to say, one should never base important conclusions on the value of the correlation coefficient alone (i.e., examining the respective scatterplot is always recommended).


Note that if the sample size is relatively small, then including or excluding specific data points that are not as clearly "outliers" as the one shown in the previous example may have a profound influence on the regression line (and the correlation coefficient). This is illustrated in the following example where we call the points being excluded "outliers;" one may argue, however, that they are not outliers but rather extreme values.

Typically, we believe that outliers represent a random error that we would like to be able to control. Unfortunately, there is no widely accepted method to remove outliers automatically (however, see the next paragraph), thus what we are left with is to identify any outliers by examining a scatter plot of each important correlation. Needless to say, outliers may not only artificially increase the value of a correlation coefficient, but they can also decrease the value of a "legitimate" correlation. t-test for independent samples Purpose, Assumptions. The t-test is the most commonly used method to evaluate the differences in means between two groups. For example, the t-test can be used to test for a


the hypothesis is true. The means of the dependent variable will be compared between selected groups based on the specified values (e. Some researchers suggest that if the difference is in the predicted direction. The following data set can be analyzed with a t-test comparing the average WCC score in males and females. Theoretically. one independent (grouping) variable (e.. see the graph below).g. male and female) of the independent variable. 29 . this is the probability of error associated with rejecting the hypothesis of no difference between the two categories of observations (corresponding to the groups) in the population when. comparisons of means and measures of variation in the two groups can be visualized in box and whisker plots (for an example.difference in test scores between a group of patients who were given a drug and a control group who received a placebo. in fact.test (see Nonparametric and Distribution Fitting).. you can consider only one half (one "tail") of the probability distribution and thus divide the standard p-level reported with a t-test (a "two-tailed" probability) by two. Gender: male/female) and at least one dependent variable (e. some researchers claim that even smaller n's are possible). a test score) are required. as long as the variables are normally distributed within each group and the variation of scores in the two groups is not reliably different (see also Elementary Concepts). the normality assumption can be evaluated by looking at the distribution of the data (via histograms) or by performing a normality test.g. Others.g. In the t-test analysis. the t-test can be used even if the sample sizes are very small (e. however. then you can evaluate the differences in means between two groups using one of the nonparametric alternatives to the t. two-tailed t-test probability. As mentioned before. The equality of variances assumption can be verified with the F test.. If these conditions are not met. In order to perform the t-test for independent samples. as small as 10. or you can use the more robust Levene's test. GENDER male male male female female WCC 111 110 109 102 104 case 1 case 2 case 3 case 4 case 5 mean WCC in males = 110 mean WCC in females = 103 t-test graphs. The p-level reported with a t-test represents the probability of error involved in accepting our research hypothesis about the existence of a difference. Arrangement of Data.g.. suggest that you should always report the standard. Technically speaking.

WCC 101 110 92 112 95 . Breakdown: Descriptive Statistics by Groups Purpose.. the dependent variable WCC (White Cell Count) can be broken down by 2 independent variables: Gender (values: males and females).These graphs help you to quickly evaluate and "intuitively visualize" the strength of the relation between the grouping and the dependent variable. The resulting breakdowns might look as follows (we are assuming that Gender was specified as the first independent variable... and Height (values: tall and short). The breakdowns analysis calculates descriptive statistics and correlations for dependent variables in each of a number of groups defined by one or more grouping (independent) variables.. GENDER male male male female female . HEIGHT short tall tall tall short .. In the following example data set (spreadsheet). Arrangement of Data.. and Height as the second).. case 1 case 2 case 3 case 4 case 5 . Entire Mean=100 SD=13 N=120 Males Mean=99 SD=13 Females Mean=101 SD=13 sample 30 ..

" using different orders of independent variables. in the above example. For example. you see the means for "all males" and "all females" but you do not see the means for "all tall subjects" and "all short subjects" which would have been produced had you specified independent variable Height as the first grouping variable rather than the second. Statistical Tests in Breakdowns. They are often used as one of the exploratory procedures to review how different categories of values are distributed in the sample.. the breakdown results from a combination of a number of grouping variables). If you are interested in variation differences. the statistical procedures in breakdowns assume the existence of a single grouping factor (even if. but only if they are males. in a survey of spectator interest in different sports. For example. see the "tree" data above).N=60 Tall/males Mean=98 SD=13 N=30 N=60 Short/males Tall/females Mean=100 Mean=101 SD=13 SD=13 N=30 N=30 Short/females Mean=101 SD=13 N=30 The composition of the "intermediate" level cells of the "breakdown tree" depends on the order in which independent variables are arranged. there could be differences between the influence of one independent variable on the dependent variable at different levels of another independent variable (e. Thus. Frequency tables Purpose. Frequency or one-way tables represent the simplest method for analyzing categorical (nominal) data (refer to Elementary Concepts). then the appropriate test is the breakdowns one-way ANOVA (F test). but the magnitude or significance of such effects cannot be estimated by the breakdown statistics. we could summarize the respondents' interest in watching football in a frequency table as follows: STATISTICA BASIC STATS Category FOOTBALL: "Watching football" Count Cumulatv Percent Cumulatv 31 . then you should test for homogeneity of variances. the typical question that this technique can help answer is very simple: Are the groups created by the independent variables different regarding the dependent variable? If you are interested in differences concerning the means. tall people could have lower WCC than short ones. For example. those statistics do not reveal or even take into account any possible interactions between grouping variables in the design. Breakdowns are typically used as an exploratory data analysis technique. Other Related Data Analysis Techniques. Although for exploratory data analysis. You can explore such effects by examining breakdowns "visually.g. breakdowns can use more than one independent variable. in fact.

Some tools : Microsoft Excel: The Analysis ToolPak is a tool in Microsoft Excel to perform basic statistical procedures. One can use these tools to have an preliminary analysis of the selected data for KDD. It is also publicly available . a first "look" at the data usually includes frequency tables.0000 100. regression. frequency tables can show the number of males and females who participated in the survey. In addition to the basic spreadsheet functions.g. (3) Sometimes interested. in industrial research one may tabulate the frequency of different causes leading to catastrophic failure of products during stress tests (e. In practically every research project.0000 The table above shows the number. percentiles.00000 Percent 39. (2) Usually interested. a t-test. In medical research. For example.g. interest in watching football) can also be nicely summarized via the frequency table.00000 16. Microsoft Excel is spreadsheet software that is used to store information in columns and rows. This document describes how to get basic descriptive statistics. correlations. and so on. histograms. Customarily. and a linear regression. descriptive statistics... Responses on some labeled attitude measurement scales (e. which parts are actually responsible for the complete malfunction of television sets under extreme temperatures?). the number of respondents from particular ethnic and racial backgrounds.0000 55. and cumulative proportion of respondents who characterized their interest in watching football as either (1) Always interested. perform an ANOVA.00000 26. or (4) Never interested Applications. if a data set includes any categorical data.00000 0. SPSS 32 . The primary reason to use Excel for statistical data analysis is because it is so widely available.00000 19. and t-tests.ALWAYS : Always interested USUALLY : Usually interested SOMETIMS: Sometimes interested NEVER : Never interested Missing 39 16 26 19 0 Count 39 55 81 100 100 39. then one of the first steps in the data analysis is to compute a frequency table for those categorical variables. one may tabulate the number of patients displaying specific symptoms. in survey research. which can then be organized and/or processed. Tools for this analysis : To do this statistical analysis there are various tools liks SPSS. the Analysis ToolPak in Excel contains procedures such as ANOVA. MicroSoft Excel etc..0000 100. The Analysis Toolpak is an add-on that can be installed for free if you have the installation disk for Microsoft Office.0000 81. proportion.

Descriptives. Nonparametric tests Prediction for numerical outcomes: Linear regression Prediction for identifying groups: Factor analysis. Frequencies. is the process of identifying new patterns and insights in data. Kmeans. creating derived data) and data documentation (a metadata dictionary is stored with the data) are features of the base software. survey companies. for discovering new patterns in recent Census data to warn about hidden trends. With the maturity of databases and constant improvements in computational speed. for example. marketing organizations and others. and humans have been exploring many ways to use the mind for thousands of years. including Machine Learning Statistics Pattern Recognition]. Discriminant 2. location. whether it is for understanding the Human Genome to develop new drugs. Statistics included in the base software:     Descriptive statistics: Cross tabulation. Data mining. t-test. health researchers. Online retailing in the Internet age. distances). sometimes referred to as knowledge discovery is at the intersection of multiple researchareas. It is used by market researchers. Explore. data mining algorithms that were too expensive to execute are now within reach. data management (case selection. Correlation (bivariate. Data mining serves two goals: 33 . One of the greatest challenges we face today is making sense of all this data. education researchers. or for understanding your customers better at an electronic webstore in order to provide a personalized one-to-one experience. or knowledge discovery. Data mining. government. Descriptive Ratio Statistics Bivariate statistics: Means. cluster analysis (two-step.SPSS is among the most widely used programs for statistical analysis in social science. file reshaping. Making sense of such data is becoming harder and more challenging.Databases and Visualization Good marketing and business-oriented data mining books are also available. The technique of visualization can help you acquire new knowledge and skills more quickly than with conventional techniques The amount of data stored on electronic media is growing exponentially fast. hierarchical). ANOVA. and location) are irrelevant for online stores. In addition to statistical analysis.5 Visualization techniques The human mind has boundless potential. is very different than retailing a decade ago because the three most important factors of the past (location. partial.

it is split into a training set and a test-set. is that a prediction task is well defined and can be objectively measured on an independent test-set. if the prediction is for a continuous variable (e. For example. For each attribute. For example. Given a dataset that is labeled with the correct predictions. If the prediction is for a discrete variable with a few values (e. Models built can be viewed and interacted with. Customers with high scores can be used in a direct marketing campaign. Similarly. which in this case was who earns over $50. are also relatively easy to understand. such as brick-andmortar stores’placement of products. unless the number of rules is too large. is harder to evaluate. which yields human insight. Descriptive data mining. 3 One way to aid users in understanding the models is to visualize them. a bar chart shows how much "evidence" each value (or range of values) of that attribute provides for the target label. A learning algorithm is given the training set and produces a model that can map new unseen data into the prediction. The choice of a predictive model can have a profound influence on the resulting accuracy and on the ability of humans to gain insight from it. By understanding the underlying patterns. customer spending in the next year). characterize the heavy spenders on a web site.000 in the US working population.. or people that buy product X. even if a Perceptron algorithm [20] outperforms a loan officer in predicting who will default on a loan. such as Neural Networks are the most opaque. Prediction: a model is built that predicts (or scores) based on input data. For example. buy product X or not).g. is a data mining tool that integrates data mining and visualization very tightly. Insight: identify patterns and trends that are comprehensible.1. MineSet. for example. a model consisting of if-then rules is easy to understand. the task is called classification. Figure 1 shows a visualization of the Naïve-Bayes classifier. the visualization shows a small set of "important" attributes (measured using mutual information or cross-entropy). a model can be built to predict the propensity of customers to buy product X based on their demographic data and browsing patterns on a web site. so that action can be taken based on the insight. yet necessary in many domains because the users may not trust predictions coming out of a black box or because legally one must explain the predictions. no doubt. the loan officer must explain the reason for the rejection. For example. The insight may also lead to decisions that affect other channels. the task is called regression. salary 34 . Decision trees. Some models are naturally easier to understand than others. Given a target value. marketing efforts. higher education levels (right bars in the education row) imply higher salaries because the bars are higher. Linear models get a little harder. the person requesting a loan cannot be rejected simply because he is on the wrong side of a 37-dimensional hyperplane. the web site can be personalized and improved.g. Nearest-neighbor algorithms in high dimensions are almost impossible for users to understand. 2. and cross-sells. The majority of research in data mining has concentrated on building the best models for prediction. especially if discrete inputs are used.. For example. legally. Part of the reason. The model can then be evaluated for its accuracy in making predictions on the unseen test-set. and non-linear models in high dimensions.

patterns or outliers A. computes the importance of hundreds of attributes. and then a visualization that shows the important attributes visually. clusters. Users can interact with the model by clicking on attribute values and seeing the predictions that the model makes. makes this a very useful tool that helps identify patterns. 35 .Unsupervised Visual Data Clustering Kohonen's Self-Organizing Maps Miner3D now includes a visual implementation of Self Organizing Maps. Figure 1: A visualization of the Naive-Bayes classifier Examples of Visualization tools Miner3D Create engaging data visualizations and live data-driven graphics! Miner3D delivers new data insights by allowing you to visually spot trends. and salary increases with the number of hour worked per week. Users looking for unattended data clustering tool will find this modul surprisingly powerful.increases with age up to a point and then decreases. The combination of a back-end algorithm that bins the data.

Miner3D provides the popular K-means method of clustering. K-Means Clustering and K-Means Data Reduction give you more power and more options to process large data sets. or clusters. 36 .Users looking for unattended and unsupervised data clustering tool. Also known as self-organizing maps (SOM). The new enhancement of yet powerful set of Miner3D data analysis tools further broadens its application portfolio. Miner3D’s implementation of K-Means uses a high-performance proprietary scheme based on filtering algorithms and multidimensional binary search trees. K-means can be used either for clustering data sets visually in 3D or for row reduction and compression of large data sets. so that the data objects within one cluster are more similar to each other than to those in other clusters. In this plot. the user. Kohonen maps are a tool for arranging the data points into a manageable 2D or 3D space in a way that preserves closeness. while dissimilar rows will be separated by a greater distance in the plot space. but Miner3D can also support 3D Kohonen maps. to tease out salient data patterns. The result of applying a Kohonen map to a data set is a 2D plot. The SOM computational mechanism reflects how many scientists think the human brain organizes many-faceted concepts into its 3D structure. This allows you. B. The assignment is made in such a way that neighboring units recognize similar data. capable of generating convincible results. data points (rows) that are similar in the chosen set of attributes will be grouped close together.K-Means clustering A powerful K-Means clustering method can be used to visually cluster data sets and for data set reduction Cluster analysis is a set of mathematical techniques for partitioning a series of data objects into a smaller amount of groups. Kohonen maps are inspired biologically. Self-Organizing Maps has been the data clustering method sought by many people from different areas of business and science. will recognize strong data analysis potential of Kohonen's Self-Organizing Maps (SOMs). The SOM algorithm lays a 2D grid of "neuronal units" and assigns each data point to the unit that will "recognize" it.

Microsoft SQL Server 2000 with Analysis Services. and IBM with DB2. An example of incompatible data: Customer ages can be stored as birth date for purchases made over the web and stored as age categories (i. Online Transaction Process. performing OLAP analysis was an extremely costly process mainly restricted to larger organizations.e.6 OLAP (or Online Analytical Processing) OLAP (or Online Analytical Processing) has been growing in popularity due to the increase in data volumes and the recognition of the business value of analytics. It would a time consuming process for an executive to obtain OLAP reports such as . Until the mid-nineties.What are the most popular products purchased by customers between the ages 15 to 30? Part of the OLAP implementation process involves extracting data from the various data repositories and making them compatible. are in types of databases called OLTPs. Normally data in an organization is distributed in multiple data sources and are incompatible with each other. Business Objects.K-means clustering is only available in Miner3D Enterprise and Miner3D Developer packages 2. such as point-of-sales. MicroStrategy. A retail example: Point-of-sales data and sales made via call-center or the Web are stored in different location and formats. Making data compatible involves ensuring that the meaning of the data in one repository matches all other repositories. This has changed as the major database vendor have started to incorporate OLAP modules within their database offering . between 15 and 30) for in store sales. OLTP. The major OLAP vendor are Hyperion. It is not always necessary to create a data warehouse for OLAP analysis. What is OLAP? OLAP allows business users to slice and dice data at will. Cognos. The cost per seat were in the range of $1500 to $5000 per annum. Data stored by operational systems. Oracle with Express and Darwin. The setting up of the environment to perform OLAP analysis would also require substantial investments in time and monetary resources. databases do not have any difference from a 37 .

Examples of OLTPs can include ERP. With a database design. Payment Method' is created quickly on the database and the results can be recalled by managers equally quickly if needed. and only. Data Model for OLTP Data are not typically stored for an extended period on OLTPs for storage cost and transaction speed reasons. When a consumer makes a purchase online. Telephone.structural perspective from any other databases. CRM. call data modeling. they expect the transactions to occur instantaneously. Order Name. Order Number. difference is the way in which data is stored. OLTPs are designed for optimal transaction speed. 38 . Address. Call Center. The main difference. Point-of-Sale applications. optimized for transactions the record 'Consumer name. SCM. . Price.

OLAP cubes are not strictly cuboids .e. In which zip code did product A sell the is the name given to the process of linking data from the different dimensions. 39 . Or a giant cube can be formed with all the dimensions. Using the above data model. from a data model OLAP cubes are created. month. The most common method is called the star design. year or quarter. Star Data Model for OLAP The central table in an OLAP start data model is called the fact table. The surrounding tables are called the dimensions. The quantity shipped on a particular date. data modeling) has to be set up differently.OLAPs have a different mandate from OLTPs. OLAPs are designed to give an overview analysis of what happened. The cubes can be developed along business units such as sales or marketing. Hence the data storage (i. it is possible to build reports that answer questions such as:    The supervisor that gave the most discounts. such as the ones above. To obtain answers.

reporting tools. OLAP analysis can aid an organization evaluate balanced scorecard targets. changing the relationships to get more detailed insight into corporate information.OLAP Cube with Time. WebFOCUS WebFOCUS OLAP combines all the functionality of query tools. OLAP tools structure data hierarchically – the way managers think of their enterprises. Examples for OLAP tools 1. Aside from producing reports. Customer and Product Dimensions OLAP can be a valuable and rewarding business tool. Designed for managers looking to make sense of their information. and OLAP into a single powerful solution with one common interface so business analysts can slice and dice the data and see business processes in a new way. WebFOCUS makes data part of an organization's natural culture by giving developers the premier design environments for automated ad hoc and parameter-driven reporting and giving everyone 40 . Steps in the OLAP Creation Process OLAP – Online Analytical Processing – Tools OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view. but also allows business analysts to rotate that data.

else the ability to receive and retrieve data in any format. WebFOCUS also supports the real-time creation of Excel spreadsheets and Excel PivotTables with full styling. performing analysis using whatever device or application is part of the daily working life. business intelligence application developers can easily enhance reports with extensive data-analysis functionality so that end users can dynamically interact with the information. 41 . WebFOCUS ad hoc reporting and OLAP features allow users to slice and dice data in an almost unlimited number of ways. and formula capabilities so that Excel power users can analyze their corporate data in a tool with which they are already familiar. Satisfying the broadest range of analytical needs. drill-downs.

It provides highly dynamic interface for interactive data analysis 3. Oracle. OlapCube will let you create local cubes (files with . Oracle Express). yet powerful tool to analyze data.cub extension) from data stored in any relational database (including MySQL. 2.2. SQL Server. OlapCube OlapCube is a simple. You can explore the resulting cube with our OlapCube Reader. Decision Trees What is a Decision Tree? 42 . look for information or details and create summaries and reports that help the end user in making accurate decisions. PostgreSQL. Or you can use Microsoft Excel to create rich and customized reports. PivotCubeX PivotCubeX is a visual ActiveX control for OLAP analysis and reporting. Microsoft Access. You can use it to load data from huge relational databases. SQL Server Express.7.

A decision tree is a predictive model that. For instance if we were going to classify customers who churn (don’t renew their phone contracts) in the Cellular Telephone Industry a decision tree might look something like that found in Figure 2. The number of churners and non-churners is conserved as you move up or down the tree It is pretty easy to understand how the model is being built (in contrast to the models from neural networks or from standard statistics). as its name implies. can be viewed as a tree. You may notice some interesting things about the tree:     It divides up the data on each branch point without losing any of the data (the number of total records in a given parent node is equal to the sum of the records contained in its two children). Figure 2. Segmentation 43 . Specifically each branch of the tree is a classification question and the leaves of the tree are partitions of the dataset with their classification. You may also build some intuitions about your customer base. “customers who have been with you for a couple of years and have up to date cellular phones are pretty loyal”. It would also be pretty easy to use this model if you actually had to target those customers that are likely to churn with a targeted marketing offer. E.1. Viewing decision trees as segmentation with a purpose From a business perspective decision trees can be viewed as creating a segmentation of the original dataset (each segment would be one of the leaves of the tree).1 A decision tree is a predictive model that makes a prediction on the basis of a series of decision much like the game of 20 questions.g.

products.with no particular reason for creating the segmentation except that the records within each segmentation were somewhat similar to each other.”).not just that they are similar .without similarity being well defined. In this case the segmentation is done for a particular reason . the results can be presented in an easy to understand way that can be quite useful to the business user. These predictive segments that are derived from the decision tree also come with a description of the characteristics that define the predictive segment.namely for the prediction of some important piece of information. The records that fall within each segment fall there because they have similarity with respect to the information being predicted . In the past this segmentation has been performed in order to get a high level view of a large amount of data . Because decision trees score so highly on so many of the critical features of data mining they can be used in a wide variety of business problems for both exploration and for prediction. and sales regions is something that marketing managers have been doing for many years. Partially because of this history. For instance once a customer population is found with high predicted likelihood to attrite a variety of cost models can be used to see if an expensive marketing intervention should be used because the customers are highly valuable or a less expensive intervention should be used because the revenue from this sub-population of customers is marginal. They are also particularly adept at handling raw data with little or no pre-processing. Because of their high level of automation and the ease of translating decision tree models into SQL for deployment in relational databases the technology has also proven to be easy to integrate with existing IT processes.of customers. requiring little preprocessing and cleansing of the data. Applying decision trees to Business Because of their tree structure and ability to easily generate rules decision trees are the favored technique for building understandable models. They have been used for problems ranging from credit card attrition prediction to time series prediction of the exchange rate of different international 44 . Because of this clarity they also allow for more complex profit and ROI models to be added easily in on top of the predictive model. Perhaps also because they were originally developed to mimic the way an analyst interactively performs data mining they provide a simple to understand predictive model based on rules (such as “90% of the time credit card customers of less than 3 months who max out their credit limit are going to default on their credit card loan. decision tree algorithms tend to automate the entire process of hypothesis generation and then validation much more completely and in a much more integrated way than any other data mining techniques. or extraction of a special purpose file specifically for data mining. Where can decision trees be used? Decision trees are data mining technology that has been around in a form very similar to the technology of today for almost twenty years now and early versions of the algorithms date back in the 1960s. Often times these techniques were originally developed for statisticians to automate the process of determining which fields in their database were actually useful or correlated with the particular problem that they were trying to understand. Thus the decision trees and the algorithms that create them may be complex.

The first step is Growing the Tree The first step in the process is that of growing the tree. There is always noise in the database to some degree (there are variables that are not being collected that have an impact on the target you are trying to predict). number. Most of the time it is not possible to have the algorithm work perfectly. Usually the models to be built and the interactions to be detected are much more complex in real world problems and this is where decision trees excel. There are also some problems where decision trees will not do as well. They have also been used and more increasingly often being used for prediction. nearest neighbor and normal statistical routines .1 years AND sales channel = telesales THEN chance of churn is 65%. Because the algorithm is fairly robust with respect to a variety of predictor types (e. Decision tress for Prediction Although some forms of decision trees were initially developed as exploratory tools to refine and preprocess data for more standard statistical techniques like logistic regression. With a host of new products and skilled users now appearing this tendency to use decision trees only for exploration now seems to be changing. Using decision trees for Data Preprocessing Another way that the decision tree technology has been used is for preprocessing data for other prediction algorithms. The name of the game in growing the tree is in finding the best possible question to ask at each branch point of the tree. Often times these predictors provide usable insights or propose questions that need to be answered.currencies. This is often done by looking at the predictors and values that are chosen for each split of the tree. categorical etc. This is interesting because many statisticians will still use decision trees for exploratory analysis effectively building a predictive model as a by product but then ignore the predictive model in favor of techniques that they are most comfortable with.which can take a considerable amount of time to run if there are large numbers of possible predictors to be used in the model. Sometimes veteran analysts will do this even excluding the predictive model when it is superior to that produced by other techniques. For instance if you ran across the following in your database for cellular phone churn you might seriously wonder about the way your telesales operators were making their calls and maybe change the way that they are compensated: “IF customer lifetime < 1. Some very simple problems where the prediction is just a simple multiple of the predictor can be solved much more quickly and easily by linear regression.g. Thus the question: “Are you over 40?” 45 . Using decision trees for Exploration The decision tree technology can be used for exploration of the dataset and business problem.) and because it can be run relatively quickly decision trees can be used on the first pass of a data mining run to create a subset of possibly useful predictors that can then be fed into neural networks. Specifically the algorithm seeks to create a tree that works as perfectly as possible on all the data that is available. At the bottom of the tree you will come up with nodes that you would like to be all of one type or the other.

say that split 100 customers into one segment of 50 churners and another segment of 50 nonchurners then this would be considered to be a good question. (There is no further question that you could ask which could further refine a segment of just one. The process in decision tree algorithms is very similar when they build trees. do you have a telephone that is more than two years old and were you originally landed as a customer via telesales rather than direct sales?” This series of questions defines a segment of the customer population in which 90% churn. If we started off with our population being half churners and half non-churners then we would expect that a question that didn’t organize the data to some degree into one segment that was more likely to churn than the other then it wouldn’t be a very useful question to ask. Maybe the series of questions would be something like: “Have you been a customer for less than a year. On the other hand there may be a series of questions that do quite a nice job in distinguishing those cellular phone customers who will churn and those who won’t. These algorithms look at all possible distinguishing questions that could possibly break up the original training dataset into segments that are nearly homogeneous with respect to the different classes being predicted.let’s say it is 40%/60%. 46 . On the other hand if we asked a question that was very good at distinguishing between churners and non-churners .) All the records in the segment have identical characteristics. (There is no reason to continue asking further questions segmentation since all the remaining records are the same. CART picks the questions in a very unsophisticated way: It tries them all. To let the tree grow to this size is both computationally expensive but also unnecessary.probably does not sufficiently distinguish between those who are churners and those who are not . When does the tree stop growing? If the decision tree algorithm just continued growing the tree like this it could conceivably create more and more questions and branches in the tree so that eventually there was only one record in the segment. Some decision tree algorithms may use heuristics in order to pick the questions or even pick them at random. Most decision tree algorithms stop growing the tree when one of three criteria are met:    The segment contains only one record. change the likelihood of a churner appearing in the customer segment. These are then relevant questions to be asking in relation to predicting churn. After it has tried them all CART picks the best one uses it to split the data into two more organized segments and then again asks all possible questions on each of those new segments individually.or in this case.) The improvement is not substantial enough to warrant making the split. In fact it had decreased the “disorder” of the original segment as much as was possible. The difference between a good question and a bad question The difference between a good question and a bad question has to do with how much the question can organize the data .

1 Decision tree algorithm segment. This particular example has to do with overfitting the model . It has been created out of a much larger customer database by selecting only those customers aged 27 with blue eyes and salaries between $ this case fitting the model too closely to the idiosyncrasies of the training data.1 of a segment that we might want to split further which has just two examples. It does this in several ways using a cross validation approach or a test set validation approach. This segment cannot be split further except by using the predictor "name". After the tree has been grown to a certain size (depending on the particular stopping criteria used in the algorithm) the CART algorithm has still more work to do. It was one of the first decision tree algorithms yet at the same time built solidly on work that had been done on inference systems and concept learning systems from that decade as well as the preceding decade. Initially ID3 was used for tasks such as learning good game playing strategies for chess end games. It might work well for this particular 2 record segment but it is unlikely that it will work for other customer databases or even the same customer database at a different time. The algorithm then checks to see if the model has been overfit to the data. It would then be possible to ask a question like: “Is the customer’s name Steve?” and create the segments which would be very good at breaking apart those who churned from those who did not: The problem is that we all have an intuition that the name of the customer is not going to be a very good indicator of whether that customer churns or not. Basically using the same mind numbingly simple approach it used to find the best questions in the first place . Since then ID3 has been applied to a wide 47 . The tree that does the best on the held aside data is selected by the algorithm as the best model.namely trying many different simpler versions of the tree on a held aside test set.000 Yes Alex 27 Blue $80. The nice thing about CART is that this testing and selection is all an integral part of the algorithm as opposed to the after the fact approach that other techniques use.000 and $81.5 In the late 1970s J. ID3 and an enhancement .Why would a decision tree algorithm stop growing the tree if there wasn’t enough data? Consider the following example shown in Table 2. salary) except for name.000 No Table 2. Ross Quinlan introduced a decision tree algorithm named ID3. In this case all of the possible questions that could be asked about the two customers turn out to have the same value (age.C4. Name Age Eyes Salary Churned? Steve 27 Blue $80. eyes. This can be fixed later on but clearly stopping the building of the tree short of either one record segments or very small segments in general is a good idea. Decision trees aren’t necessarily finished after the tree is grown.000.

In building the CART tree each predictor is picked based on how well it teases apart the records with different predictions. If the amount of information required is much lower after the split is made then that split has decreased the disorder of the original single segment. CART accomplishes this by building a very complex tree and then pruning it back to the optimally general tree based on the results of cross validation or test set validation. Gain is defined as the difference between the entropy of the original segment and the accumulated entropies of the resulting split segments. These researchers from Stanford University and the University of California at Berkeley showed how this new algorithm could be used on a variety of different problems from to the detection of Chlorine from the data contained in a mass spectrum. ID3 picks predictors and their splitting values based on the gain in information that the split or splits provide. They were concerned with how information could be efficiently communicated over telephone lines.5 improves on ID3 in several important areas:     predictors with missing values can still be used predictors with continuous values can be used pruning is introduced rule derivation Many of these techniques appear in the CART algorithm plus some others so we will go through this introduction in the CART algorithm. The measure originated from the work done by Claude Shannon and Warren Weaver on information theory in 1949. The tree is pruned back based on the performance of the various pruned version of the tree on the test set data.Growing a forest and picking the best tree CART stands for Classification and Regression Trees and is a data exploration and prediction algorithm developed by Leo Breiman. C4.variety of problems in both academia and industry and has been modified. Friedman. CART Automatically Validates the Tree One of the great advantages of CART is that the algorithm has the validation of the model and the discovery of the optimally general model built deeply into the algorithm. Interestingly. improved and borrowed from many times over. unseen data can be chosen.5. ID3 was later enhanced in the version called C4. For instance one measure that is used to determine whether a given split point for a give predictor is better than another is the entropy metric. their results also prove useful in creating decision trees. 48 . CART . By using cross validation the tree that is most likely to do well on new. The most complex tree rarely fares the best on the held aside data as it has been overfitted to the training data. Richard Olshen and Charles Stone and is nicely detailed in their 1984 book “Classification and Regression Trees” ([Breiman. Predictors are picked as they decrease the disorder of the data. Gain represents the difference between the amount of information that is needed to correctly make a prediction before a split is made and after the split has been made. Olshen and Stone 19 84)]. Jerome Friedman.

CART Surrogates handle missing data The CART algorithm is relatively robust with respect to missing data. 20-29 etc. Rule induction on a data base can be a massive undertaking where all possible patterns are systematically pulled out of the data and then an accuracy and significance are added to them that tell the user how strong the pattern is and how likely it is to occur again. Surrogates are split values and predictors that mimic the actual split in the tree and can be used when the data for the preferred predictor is missing. When CART is being used to predict on new data. If the value is missing for a particular predictor in a particular record that record will not be used in making the determination of the optimal split when the tree is being built. namely “mining” for gold through a vast database. Because CHAID relies on the contingency tables to form its test of significance for each predictor all predictors must either be categorical or be coerced into a categorical form via binning (e. The gold in this case would be a rule that is interesting . In general these rules are relatively simple such as for a market basket database of items scanned in a consumer market basket you might find interesting correlations in your database such as: 49 . For instance though shoe size is not a perfect predictor of height it could be used as a surrogate to try to mimic a split based on height when that information was missing from the particular record being predicted with the CART model. Instead of the entropy or Gini metrics for choosing optimal splits the technique relies on the chi square test used in contingency tables to determine which categorical predictor is furthest from independence with the prediction values.8. CHAID Another equally popular decision tree technology to CART is CHAID or Chi-Square Automatic Interaction Detector. 2. 10-19. It is also perhaps the form of data mining that most closely resembles the process that most people think about when they think about data mining. Association rules (Rule Induction) Association rules or Rule induction is one of the major forms of data mining and is perhaps the most common form of knowledge discovery in unsupervised learning systems.).g. Though this binning can have deleterious consequences the actual accuracy performances of CART and CHAID have been shown to be comparable in real world direct marketing response models. In effect CART will utilizes as much information as it has on hand in order to make the decision for picking the best possible split. CHAID is similar to CART in that it builds a decision tree but it differs in the way that it chooses its splits. break up possible people ages into 10 bins from 0-9. missing values can be handled via surrogates.that tells you something about your database that you didn’t already know and probably weren’t able to explicitly articulate (aside from saying “show me things that are interesting”).

who do I give credit to / who do I deny credit to) with little explanation. Automating the process of culling the most interesting rules and of combing the recommendations of a variety of rules are well handled by many of the commercially available rule induction systems on the market today and is also an area of active research. In comparing data mining techniques along an axis of explanation neural networks would be at one extreme of the data mining algorithms and rule induction systems at the other end. They can be modified to for use in prediction problems but the algorithms for combining evidence from a variety of rules comes more from rules of thumbs and practical experience.that it retrieves all possible interesting patterns in the database. Applying Rule induction to Business Rule induction systems are highly automated and are probably the best of data mining techniques for exposing all possible predictive patterns in a database.  If bagels are purchased then cream cheese is purchased 90% of the time and this pattern occurs in 3% of all shopping baskets. You almost need a second pass of data mining to go through the list of interesting rules that have been generated by the rule induction system in the first place in order to find the most valuable gold nugget amongst them all.g. This overabundance of patterns can also be problematic for the simple task of prediction because all possible patterns are culled from the database there may be conflicting predictions made by equally interesting rules. The bane of rule induction systems is also its strength . The business value of rule induction techniques reflects the highly automated way in which the rules are created which makes it easy to use the system but also that this approach can suffer from an overabundance of interesting patterns which can make it complicated in order to make a prediction that is directly tied to return on investment (ROI). Or 50 . Rule induction systems when used for prediction on the other hand are like having a committee of trusted advisors each with a slightly different opinion as to what to do but relatively well grounded reasoning and a good explanation for why it should be done. Neural networks are extremely proficient and saying exactly what must be done in a prediction task (e. If live plants are purchased from a hardware store then plant fertilizer is purchased 60% of the time and these two items are bought together in 6% of the shopping baskets. What is a rule? In rule induction systems the rule itself is of a simple form of “if this and this and this then this”. The rules that are pulled from the database are extracted and ordered to be presented to the user based on the percentage of times that they are correct and how often they apply. For example a rule that a supermarket might find in their data collected from scanners would be: “if pickles are purchased then ketchup is purchased’. This is a strength in the sense that it leaves no stone unturned but it can also be viewed as a weaknes because the user can easily become overwhelmed with such a large number of rules that it is difficult to look through all of them.

Rule If breakfast cereal purchased then milk purchased.   If paper plates then plastic forks If dip then potato chips If salsa then tortilla chips In order for the rules to be useful there are two pieces of information that must be supplied as well as the actual rule:   Accuracy . Examples of these two measure for a variety of rules is shown in Table 2.How often is the rule correct? Coverage . What to do with a rule When the rules are mined out of the database the rules can be used either for understanding better the business problems that the data reflects or for performing actual predictions against some predefined prediction target. If 42 years old and purchased pretzels and purchased dry roasted peanuts then beer will be purchased.01% Table 2. Generally the consequent is just a single condition (prediction of purchasing just one grocery store item) rather than multiple conditions. Thus just like in other data mining algorithms it is important to recognize and make explicit the uncertainty in the rule. In some cases accuracy is called the confidence of the rule and coverage is called the support. Accuracy 85% 15% 95% Coverage 20% 6% 0. In this case all rules that have a certain value for the antecedent are gathered and displayed to the user. The coverage of the rule has to do with how much of the database the rule “covers” or applies to.2. For instance a grocery store may request all rules that have nails. The left hand side is called the antecedent and the right hand side is called the consequent. Since there is both a left side and a right side to a rule (antecedent and consequent) they can be used in several ways for your business. The antecedent can consist of just one condition or multiple conditions which must all be true in order for the consequent to be true at the given accuracy.How often does the rule apply? Just because the pattern in the data base is expressed as rule does not mean that it is true all the time. If bread purchased then swiss cheese purchased. bolts or screws as the antecedent in order to try to understand whether discontinuing the sale of these low margin items will have any effect on other higher 51 . Thus rules such as: “if x and y then a and b and c”. This is what the accuracy of the rule means. Accuracy and coverage appear to be the preferred ways of naming these two measurements. Target the antecedent.2 Examples of Rule Accuracy and Coverage The rules themselves consist of two halves.

margin. For instance maybe people who buy nails also buy expensive hammers but wouldn’t do so at the store if the nails were not available. Target the consequent. In this case all rules that have a certain value for the consequent can be used to understand what is associated with the consequent and perhaps what affects the consequent. For instance it might be useful to know all of the interesting rules that have “coffee” in their consequent. These may well be the rules that affect the purchases of coffee and that a store owner may want to put close to the coffee in order to increase the sale of both items. Or it might be the rule that the coffee manufacturer uses to determine in which magazine to place their next coupons. Target based on accuracy. Some times the most important thing for a user is the accuracy of the rules that are being generated. Highly accurate rules of 80% or 90% imply strong relationships that can be exploited even if they have low coverage of the database and only occur a limited number of times. For instance a rule that only has 0.1% coverage but 95% can only be applied one time out of one thousand but it will very likely be correct. If this one time is highly profitable that it can be worthwhile. This, for instance, is how some of the most successful data mining applications work in the financial markets - looking for that limited amount of time where a very confident prediction can be made. Target based on coverage. Some times user want to know what the most ubiquitous rules are or those rules that are most readily applicable. By looking at rules ranked by coverage they can quickly get a high level view of what is happening within their database most of the time. Target based on “interestingness”. Rules are interesting when they have high coverage and high accuracy and deviate from the norm. There have been many ways that rules have been ranked by some measure of interestingness so that the trade off between coverage and accuracy can be made. Since rule induction systems are so often used for pattern discovery and unsupervised learning it is less easy to compare them. For example it is very easy for just about any rule induction system to generate all possible rules, it is, however, much more difficult to devise a way to present those rules (which could easily be in the hundreds of thousands) in a way that is most useful to the end user. When interesting rules are found they usually have been created to find relationships between many different predictor values in the database not just one well defined target of the prediction. For this reason it is often much more difficult to assign a measure of value to the rule aside from its interestingness. For instance it would be difficult to determine the monetary value of knowing that if people buy breakfast sausage they also buy eggs 60% of the time. For data mining systems that are more focused on prediction for things like customer attrition, targeted marketing response or risk it is much easier to measure the value of the system and compare it to other systems and other methods for solving the problem.

Caveat: Rules do not imply causality
It is important to recognize that even though the patterns produced from rule induction systems are delivered as if then rules they do not necessarily mean that the left hand side of the rule (the “if” part) causes the right hand side of the rule (the “then” part) to happen. Purchasing cheese does not cause the purchase of wine even though the rule if cheese then wine may be very strong.


This is particularly important to remember for rule induction systems because the results are presented as if this then that as many causal relationships are presented.

Types of databases used for rule induction
Typically rule induction is used on databases with either fields of high cardinality (many different values) or many columns of binary fields. The classical case of this is the super market basket data from store scanners that contains individual product names and quantities and may contain tens of thousands of different items with different packaging that create hundreds of thousands of SKU identifiers (Stock Keeping Units). Sometimes in these databases the concept of a record is not easily defined within the database - consider the typical Star Schema for many data warehouses that store the supermarket transactions as separate entries in the fact table. Where the columns in the fact table are some unique identifier of the shopping basket (so all items can be noted as being in the same shopping basket), the quantity, the time of purchase, whether the item was purchased with a special promotion (sale or coupon). Thus each item in the shopping basket has a different row in the fact table. This layout of the data is not typically the best for most data mining algorithms which would prefer to have the data structured as one row per shopping basket and each column to represent the presence or absence of a given item. This can be an expensive way to store the data, however, since the typical grocery store contains 60,000 SKUs or different items that could come across the checkout counter. This structure of the records can also create a very high dimensional space (60,000 binary dimensions) which would be unwieldy for many classical data mining algorithms like neural networks and decision trees. As we’ll see several tricks are played to make this computationally feasible for the data mining algorithm while not requiring a massive reorganization of the database.

The claim to fame of these ruled induction systems is much more so for knowledge discovers in unsupervised learning systems than it is for prediction. These systems provide both a very detailed view of the data where significant patterns that only occur a small portion of the time and only can be found when looking at the detail data as well as a broad overview of the data where some systems seek to deliver to the user an overall view of the patterns contained n the database. These systems thus display a nice combination of both micro and macro views:

Macro Level - Patterns that cover many situations are provided to the user that can be used very often and with great confidence and can also be used to summarize the database. Micro Level - Strong rules that cover only a very few situations can still be retrieved by the system and proposed to the end user. These may be valuable if the situations that are covered are highly valuable (maybe they only apply to the most profitable customers) or represent a small but growing subpopulation which may indicate a market shift or the emergence of a new competitor (e.g. customers are only being lost in one particular area of the country where a new competitor is emerging).


After the rules are created and their interestingness is measured there is also a call for performing prediction with the rules. Each rule by itself can perform prediction - the consequent is the target and the accuracy of the rule is the accuracy of the prediction. But because rule induction systems produce many rules for a given antecedent or consequent there can be conflicting predictions with different accuracies. This is an opportunity for improving the overall performance of the systems by combining the rules. This can be done in a variety of ways by summing the accuracies as if they were weights or just by taking the prediction of the rule with the maximum accuracy. Table 2.3 shows how a given consequent or antecedent can be part of many rules with different accuracies and coverages. From this example consider the prediction problem of trying to predict whether milk was purchased based solely on the other items that were in the shopping basket. If the shopping basket contained only bread then from the table we would guess that there was a 35% chance that milk was also purchased. If, however, bread and butter and eggs and cheese were purchased what would be the prediction for milk then? 65% chance of milk because the relationship between butter and milk is the greatest at 65%? Or would all of the other items in the basket increase even further the chance of milk being purchased to well beyond 65%? Determining how to combine evidence from multiple rules is a key part of the algorithms for using rules for prediction. Antecedent Consequent Accuracy Coverage bagels cream cheese 80% 5% bagels orange juice 40% 3% bagels coffee 40% 2% bagels eggs 25% 2% bread milk 35% 30% butter milk 65% 20% eggs milk 35% 15% cheese milk 40% 8% Table 2.3 Accuracy and Coverage in Rule Antecedents and Consequents

The General Idea
The general idea of a rule classification system is that rules are created that show the relationship between events captured in your database. These rules can be simple with just one element in the antecedent or they might be more complicated with many column value pairs in the antecedent all joined together by a conjunction (item1 and item2 and item3 … must all occur for the antecedent to be true). The rules are used to find interesting patterns in the database but they are also used at times for prediction. There are two main things that are important to understanding a rule: Accuracy - Accuracy refers to the probability that if the antecedent is true that the precedent will be true. High accuracy means that this is a rule that is highly dependable. Coverage - Coverage refers to the number of records in the database that the rule applies to. High coverage means that the rule can be used very often and also that it is less likely to be a spurious artifact of the sampling technique or idiosyncrasies of the database.


T = 100 = Total number of shopping baskets in the database. be only rarely used. Having a high accuracy rule with low coverage would be like owning a race horse that always won when he raced but could only race once a year. This would be 40/100 = 40%. Trading off accuracy and coverage is like betting at the track An analogy between coverage and accuracy and making money is the following from betting on horses. 55 . This can be seen graphically in Figure 2. In rule induction for retail stores it is unlikely that finding that one rule between mayonnaise.5. B = 20 = Number of baskets with both eggs and milk in them. In betting. If the accuracy is significantly below that of what would be expected from random guessing then the negation of the antecedent may well in fact be useful (for instance people who buy denture adhesive are much less likely to buy fresh corn on the cob than normal). For instance you may have a rule that is 100% accurate but is only applicable in 1 out of every 100. ice cream and sardines that seems to always be true will have much of an impact on your bottom line. Table 2.4. Accuracy Low Accuracy High Coverage High Rule is rarely correct but Rule is often correct and can be used often. Accuracy is then just the number of baskets with eggs and milk in them divided by the number of baskets with milk in them. You can rearrange your shelf space to take advantage of this fact but it will not make you much money since the event is not very likely to happen. So. How to evaluate the rule One way to look at accuracy and coverage is to see how they relate so some simple statistics and how they can be represented graphically. can be used often. From statistics coverage is simply the a priori probability of the antecedent and the consequent occurring at the same time. M = 40 = Number of baskets with milk in them. The coverage would be the number of baskets with milk in them divided by the total number of baskets. From a business perspective coverage implies how often you can use a useful rule.4 Rule coverage versus accuracy.namely that there is something far from independent between the antecedent and the consequent.The business importance of accuracy and coverage From a business perspective accurate rules are important because they imply that there is useful predictive information in the database that can be exploited . you could probably still make a lot of money on such a horse. Displays the trade off between coverage and accuracy. The lower the accuracy the closer the rule comes to just random guessing. The accuracy is just the probability of the consequent conditional on the precedent.000 shopping baskets. E = 30 = Number of baskets with eggs in them. Coverage Low Rule is rarely correct and Rule is often correct but can can be only rarely used. In this case that would be 20/40 = 50%. Table 2. for instance the if we were looking at the following database of super market basket scanner data we would need the following information in order to calculate the accuracy and coverage for a simple rule (let’s say milk purchase implies eggs purchased).

12 or 12% of the time we would see shopping baskets with both eggs and milk in them. The coverage of the rule “If Milk then Eggs” is just the relative size of the circle corresponding to milk. Notice that we haven’t used E the number of baskets with eggs in these calculations. If the purchase of eggs and milk were independent of each other one would expect that 0. some have so little coverage that though they are interesting they have little applicability. Defining “interestingness” One of the biggest problems with rule induction systems is the sometimes overwhelming number of rules that are produced. Some of the rules are so inaccurate that they cannot be used. Most of which have no practical value or interest. This would give us some sense of how unlikely and how special the event is that 20% of the baskets have both eggs and milk in them. and finally many of the rules 56 .3 x 0. That is to say there is a good chance that the purchase of one effects the other and the degree to which this is the case could be calculated through statistical tests and hypothesis testing.5 Graphically the total number of shopping baskets can be represented in a space and the number of baskets containing eggs or milk can be represented by the area of a circle. One way that eggs could be used would be to calculate the expected number of baskets with eggs and milk in them based on the independence of the events. The accuracy is the relative size of the overlap between the two to the circle representing milk purchased.Figure 2. The fact that this combination of products occurs 20% of the time is out of the ordinary if these events were independent. Remember from the statistics section that if two events are independent (have no effect on one another) that the product of their individual probabilities of occurrence should equal the probability of the occurrence of them both together.4 = 0.

Interestingness increases as accuracy increases (or decreases with decreasing accuracy) if the coverage is fixed. mint jelly) and they may have low accuracy but since there are so few possible rules even though they are not interesting they will be “novel” and should be retained and presented to the user for that reason alone.000 If customer eyes = blue If customer social security number = 144 30 8217 Table 2.5 shows an example of this. The example in Table 2.capture patterns and information that the user is already familiar with.000001% Antecedent <no constraints> If customer balance > $3. This is an important solely for the end user. They are used for pruning back the total possible number of rules that might be generated and then presented to the user. Where a rule for attrition is no better than just guessing the overall rate of attrition. We might also expect it to have at least the following four basic behaviors:     Interestingness = 0 if the accuracy of the rule is equal to the background accuracy (a priori probability of the consequent).g. Consequent then customer will attrite then customer will attrite then customer will attrite then customer will attrite Accuracy 10% 10% 10% 100% Coverage 100% 60% 30% 0. Certainly any measure of interestingness would have something to do with accuracy and that rules that are redundant but strong are less favored to be searched than rules that may not be as strong but cover important examples that are not covered by other strong rules. Interestingness increases or decreases with coverage if accuracy stays fixed Interestingness decreases with coverage for a fixed number of correct responses (remember accuracy equals the number of correct responses divided by the coverage). 57 . Thus the user has a desire to see simpler rules and consequently this desire can be manifest directly in the rules that are chosen and supplied automatically to the user. As complex rules. To combat this problem researchers have sought to measure the usefulness or interestingness of rules. may be difficult to understand or to confirm via intuition. For instance there may be few historical records to provide rules on a little sold grocery item (e. Finally a measure of novelty is also required both during the creation of the rules .5 Uninteresting rules There are a variety of measures of interestingness that are used that have these general characteristics. as powerful and as interesting as they might be. Other measures of usefulness Another important measure is that of simplicity of the rule.

Foremost among these advantages is their highly accurate predictive models that can be applied across a large number of different types of problems. This view is encouraged by the way the historical training data is often supplied to the network . In many respects the greatest breakthroughs in neural networks in recent years have been in their application to more mundane real world problems like customer response prediction or fraud detection rather than the loftier goals that were originally set out for the techniques such as overall human learning and computer speech and image understanding. Since that time much progress has been made in finding ways to apply artificial neural networks to real world prediction problems and in improving the performance of the algorithm in general. It 58 . To understand how neural networks can detect patterns in a database an analogy is often made that they “learn” to detect these patterns and make better predictions in a similar way to the way that human beings record (example) at a time. neural networks that run on computers can do some of the things that people can do. make predictions and learn. To be more precise with the term “neural network” one might better speak of an “artificial neural network”. but they do also have some significant advantages. Neural Networks What is a Neural Network? When data mining algorithms are talked about these days most of the time people are talking about either decision trees or neural networks. Despite the fact that scientists are still far from understanding the human brain let alone mimicking it. Of the two neural networks have probably been of greater interest through the formative stages of data mining technology. As we will see neural networks do have disadvantages that can be limiting in their ease of use and ease of deployment. It is difficult to say exactly when the first “neural network” on a computer was built.9. Thus historically neural networks grew out of the community of Artificial Intelligence rather than from the discipline of statistics. Neural networks do “learn” in a very real sense but under the hood the algorithms and techniques that are being deployed are not truly different from the techniques found in statistics or other data mining algorithms. During World War II a seminal paper was published by McCulloch and Pitts which first outlined the idea that simple processing units (like the individual neurons in the human brain) could be connected together in large networks to create a system that could solve difficult problems and display behavior that was much more complex than the simple pieces that made it up.2. The artificial ones are computer programs implementing sophisticated pattern detection and machine learning algorithms on a computer to build predictive models from large historical databases. Don’t Neural Networks Learn to make better predictions? Because of the origins of the techniques and because of some of their early successes the techniques have enjoyed a great deal of interest. True neural networks are biological systems (a k a brains) that detect patterns. Artificial neural networks derive their name from their historical development which started off with the premise that machines could be made to “think” if scientists found ways to mimic the structure and functioning of the human brain on the computer.

create very complex models that are almost always impossible to fully understand even by experts. understanding what the data in your database means and a clear definition of the business problem to be solved are essential to ensuring eventual success. neural networks. And.most often there is a requirement to normalize numeric data between 0. The model itself is represented by numeric values in a complex calculation that requires all of the predictor values to be in the form of a number. The bottom line is that neural networks provide no short cuts. The implicit claim is also that most neural networks can be unleashed on your data straight out of the box without having to rearrange or modify the data very much to begin with. 59 . white or black jeans for a clothing manufacturer requires that the predictor values blue. or predictive modeling or even the database in order to use them. black and white for the predictor color to be converted to numbers). There are many important design decisions that need to be made in order to effectively use a neural network such as:    How should the nodes in the network be connected? How many neuron like processing units should be used? When should “training” be stopped in order to avoid overfitting? There are also many important steps required for preprocessing the data that goes into a neural network .0 and 1. The output of the neural network is also numeric and needs to be translated if the actual prediction value is categorical (e. As we will see in this section. These efforts are still in there infancy but are of tremendous importance since most data mining techniques including neural networks are being deployed against real business problems where significant investments are made based on the predictions from the models ( for instance. predicting the demand for blue. unfair to assume that neural networks could outperform other techniques because they “learn” and improve over time while the other techniques are static. Are Neural Networks easy to use? A common claim for neural networks is that they are automated to a degree where the user does not need to know that much about how they work. Because of the complexity of these techniques much effort has been expended in trying to increase the clarity with which the model can be understood by the end user. consider trusting the predictive model from a neural network that dictates which one million customers will receive a $1 mailing).0 and categorical predictors may need to be broken up into virtual predictors that are 0 or 1 for each value of the original categorical predictor.g. Applying Neural Networks to Business Neural networks are very powerful predictive modeling techniques but some of the power comes at the expense of ease of use and ease of deployment. The other techniques if fact “learn” from historical examples in exactly the same way but often times the examples (historical records) to learn from a processed all at once in a more efficient manner than neural networks which often modify their model one record at a time.g. Just the opposite is often true. as always.

Typically the networks are used in a unsupervised learning mode to create the clusters. The Kohonen network described in this section is probably the most common network used for clustering and segmentation of the database. Since there is not a great difference in the overall predictive accuracy of neural networks over standard statistical techniques the main difference becomes the replacement of the statistical expert with the neural network expert. The first tactic has seemed to work quite well because when the technique is used for a well defined problem many of the difficulties in preprocessing the data can be automated (because the data structures have been seen before) and interpretation of the model is less of an issue since entire industries begin to use the technology successfully and a level of trust is created. but it can be quite expensive because it is human intensive.g. This allows the neural network to be carefully crafted for one particular application and once it has been proven successful it can be used over and over again without requiring a deep understanding of how it works. They have been used in all facets of business from detecting the fraudulent use of credit cards and credit risk prediction to increasing the hit rate of targeted mailings. thus ensuring that the clusters overlap as little as possible. One of the great promises of data mining is. The neural network is package up with expert consulting services. Either the experts are able to explain the models or they are trusted that the models do work. The clusters are created by forcing the system to compress the data by creating prototypes or by algorithms that steer the system toward creating clusters that compete against each other for the records that they contain. Either with statistics or neural network experts the value of putting easy to use tools into the hands of the business end user is still not achieved. Neural Networks for clustering Neural networks of various kinds can be used for clustering and prototype creation. after all. Here the neural network is deployed by trusted experts who have a track record of success.There are two ways that these shortcomings in understanding the meaning of the neural network model have been successfully addressed:   The neural network is package up into a complete solution such as fraud prediction. These neural network consulting teams are little different from the analytical departments many companies already have in house. HNC’s Falcon system for credit card fraud prediction and Advanced Software Applications ModelMAX package for direct marketing). Packaging up neural networks with expert consultants is also a viable strategy that avoids many of the pitfalls of using neural networks. There are several vendors who have deployed this strategy (e. 60 . They also have a long history of application in other areas such as the military for the automated driving of an unmanned vehicle at 30 miles per hour on paved roads to biological simulations such as learning the correct pronunciation of English words from written text. Where to Use Neural Networks Neural networks are used in a wide variety of applications. the automation of the predictive modeling process.

But as there are fewer and fewer hidden nodes. or raw input data are just the colored pixels that make up the picture. but certainly describing it in terms of high level features requires much less communication of information than the “paint by numbers” approach of describing the color on each square millimeter of the image. mountains etc. unlike the others. All stores with these characteristics have seen at least a 100% jump in revenue since the start of the sale except one. One store stands out. In either case your friend eventually gets all the information that they need in order to know what the picture looks like. A sale on men’s suits is being held in all branches of a department store for southern California . The predictors. Neural Networks for feature extraction One of the important problems in all of data mining is that of determining which predictors are the most relevant and the most important in building models that are most accurate at prediction. as producing significantly lower profit. There is a cluster of stores that can be formed with these characteristics. in other problem domains it is more difficult to recognize the features. These predictors may be used by themselves or they may be used in conjunction with other predictors to form “features”. circles . One novel way that neural networks have been used to detect features is the idea that features are sort of a compression of the training database. If we think of features in this way. however. and then using the line as the input predictor can prove to dramatically improve the accuracy of the model and decrease the time to create it. Consider that if you were allowed 100 hidden nodes. On closer examination it turns out that the distributor was delivering product to but not collecting payment from one of their customers. that information has to be passed through the hidden layer in a more and more efficient manner since there are less hidden nodes to help pass along the information. For instance you could describe an image to a friend by rattling off the color and intensity of each pixel on every point in the picture or you could describe it at a higher level in terms of lines. It turns out that this store had.2 is used to extract features by requiring the network to learn to recreate the input data at the output nodes by using just 5 hidden nodes. that recreating the data for the network would be rather trivial . Some features like lines in computer images are things that humans are already pretty good at detecting. advertised via radio rather than television. A simple example of a feature in problems that neural networks are working on is the feature of a vertical line in a computer image. then neural networks can be used to automatically extract them. For instance: Most wine distributors selling inexpensive wine in Missouri and that ship a certain volume of product produce a certain level of profit.or maybe even at a higher level of features such as trees. The neural network shown in Figure 2. as an efficient way to communicate our data.Neural Networks for Outlier Analysis Sometimes clustering is performed not so much to keep records together as to make it easier to see when one record sticks out from the rest.simply pass the input node value directly through the corresponding hidden node and on to the output node. Recognizing that the predictors (pixels) can be organized in such a way as to create lines. 61 .

The link . This forced “squeezing” of the data through the narrow hidden layer forces the neural network to extract only those predictors and combinations of predictors that are best at recreating the input record.2 Neural networks can be used for data compression and feature extraction.3 there is a drawing of a simple neural network. The round circles represent the nodes and the connecting lines represent the links.Figure 2. Given that there are two main structures of consequence in the neural network: The node . The value at this node represents the prediction from the neural network model. In this case the network takes in values for predictors for age and income and predicts whether the person will default on a bank loan. In Figure 2. The link weights used to create the inputs to the hidden nodes are effectively creating features that are combinations of the input nodes values. What does a neural net look like? A neural network is loosely based on how some people believe that the human brain is organized and how it learns. The neural network functions by accepting predictor values at the left and performing calculations on those values to produce new values in the node at the far right.which loosely corresponds to the neuron in the human brain. dendrites and synapses) in the human brain. 62 .which loosely corresponds to the connections between neurons (axons. In order to accomplish this the neural network tries to have the hidden nodes extract features from the input nodes that efficiently describe the record represented at the input layer.

47 and the income is normalized to the value 0. A simplified version of the calculations made in Figure 2.7 and 0. These values are then added together at the node at the far right (the output node) a special thresholding function is applied and the resulting number is the prediction.Figure 2. In this case if the resulting number is 0 the record is considered to be a good credit risk (no default) if the number is 1 the record is considered to be a bad credit risk (likely default). These become the values for those nodes those values are then multiplied by values that are stored in the links (sometimes called links and in some ways similar to the weights that were applied to predictors in the nearest neighbor method).0 indicates non-default. Here the value age of 47 is normalized to fall between 0.0 so the record is assigned a non-default prediction.4.3 A simplified view of a neural network for prediction of loan default.000.3 might look like what is shown in Figure 2.39. How does a neural net make a prediction? In order to make a prediction the neural network accepts the values for the predictors on what are called the input nodes. The network has been trained to learn that an output value of 1.39) is closer to 0.0 than to 1. The links are weighted at 0.0 and 1.65. 63 .1 and the resulting value after multiplying the node values by the link weights is 0. The output value calculated here (0.0 and has the value 0.0 indicates default and that 0. This simplified neural network makes the prediction of no default for a 47 year old making $65.

For this reason it is the link weights that are modified each time an error is made. For the actual neural network it is the weights of the links that actually control the prediction value for a given record. The greater the error the harsher the verbal correction. the neural network) and if the answer is wrong to verbally correct the student. This was the case in the architecture of a neural network system called NETtalk that learned how to pronounce written English words. Each node in this network was connected to every node in the level above it and below it resulting in 18. How complex can the neural network model become? The models shown in the figures above have been designed to be as simple as possible in order to make them understandable.Figure 2. How is the neural net model created? The neural network model is created by presenting it with many examples of the predictor values from records in the training set (in this example age and income are used) and the prediction value from those same records.629 link weights that needed to be learned in the network.4 The normalized input values are multiplied by the link weights and added together at the output. So that large errors are given greater attention at correction than are small errors. 64 . In practice no networks are as simple as these. Networks with many more links and many more nodes are possible.k. By comparing the correct answer obtained from the training record and the predicted answer from the neural network it is possible to slowly change the behavior of the neural network by changing the values of the link weights. Thus the particular model that is being found by the neural network is in fact fully determined by the weights and the architectural structure of the network.a. In some ways this is like having a grade school teacher ask questions of her student (a.

There are even more complex neural network architectures that have more than one hidden layer. ince the prediction is made at the output layer and the difference between the prediction and the actual value is calculated there. how is this error correction fed back through the hidden layers to modify the link weights that connect them? The meaning of these hidden nodes is not necessarily well understood but sometimes after the fact they can be looked at to see when they are active and when they are not and derive some meaning from them. The hidden nodes. The learning that goes on in the hidden nodes.In this network there was a row of nodes in between the input nodes and the output nodes. You can think of the inputs coming from the hidden nodes as advice. A good metaphor for how this works is to think of a military operation in some war where there are many layers of command with a general ultimately responsible for making the decisions on where to advance and where to retreat. Hidden nodes are like trusted advisors to the output nodes The meaning of the input nodes and the output nodes are usually pretty well understood and are usually defined by the end user based on the particular problem to be solved and the nature and structure of the database. do not have a predefined meaning and are determined by the neural network as it trains. This hierarchy continuing downward through colonels and privates at the bottom of the hierarchy. To make a decision the general considers how trustworthy and valuable the advice is and how 65 . These are called hidden nodes or the hidden layer because the values of these nodes are not visible to the end user the way that the output nodes are (that contain the prediction) and the input nodes (which just contain the predictor values). The general probably has several lieutenant generals advising him and each lieutenant general probably has several major generals advising him. however. In this analogy the link weight of a neural network to an output unit is like the trust or confidence that a commander has in his advisors and the actual node value represents how strong an opinion this particular advisor has about this particular situation. The link weight corresponds to the trust that the general has in his advisors. The other part of the advice from the advisors has to do with how competent the particular advisor is for a given situation. Which poses two problems:   It is difficult to trust the prediction of the neural network if the meaning of these nodes is not well understood. The general may have a trusted advisor but if that advisor has no expertise in aerial invasion and the question at hand has to do with a situation involving the air force this advisor may be very well trusted but the advisor himself may not have any strong opinion one way or another. The learning procedure for the neural network has been defined to work for the weights in the links connecting the hidden layer. This is not too far from the structure of a neural network with several hidden layers and one output node. In practice one hidden layer seems to suffice however. Some trusted advisors have very high weights and some advisors may no be trusted and in fact have negative weights.

On the other hand any advisors who were making the correct recommendation but whose input was not taken as seriously would be taken more seriously the next time. Recurrent networks have also been used for decreasing the amount of time that it takes to train the neural network. In the case of the neural network this decision is reach by multiplying the link weight by the output value of the node and summing these values across all nodes. In the same way the output node will make a decision (a prediction) by taking into account all of the input from its advisors (the nodes connected to it). The confidence that the general has in all of those advisors that gave the wrong recommendation is decreased . however. At each level the link weights between the layers are updated so as to decrease the chance of making the same mistake again. This learning in the neural network is very similar to what happens when the wrong decision is made by the general.and all the more so for those advisors who were very confident and vocal in their recommendation. Different types of neural networks There are literally hundreds of variations on the back propagation feedforward neural networks that have been briefly described here. There are. A very similar method of training takes place in the neural network. Another twist on the neural net theme is to change the way that the network learns. If the prediction is incorrect the nodes that had the most influence on making the decision have their weights modified so that the wrong prediction is less likely to be made the next time.knowledgeable and confident each advisor is in making their suggestion and then taking all of this into account the general makes the decision to advance or retreat. Most having to do with changing the architecture of the neural network to include recurrent connections where the output from the output layer is connected back as input into the hidden layer. many other ways of doing search in a high dimensional space including Newton’s methods and conjugate gradient as well as simulating the physics of cooling metals in a process called simulated annealing or in simulating the search process that 66 . Sharing the blame and the glory throughout the organization This feedback can continue in this way down throughout the organization . It is called “back propagation” and refers to the propagation of the error backwards from the output nodes (where the error is easy to determine the difference between the actual prediction value from the training database and the prediction from the neural network ) through the hidden layers and to the input layers. Likewise any advisor that was reprimanded for giving the wrong advice to the general would then go back to his advisors and determine which of them he had trusted more than he should have in making his recommendation and who he should have listened more closely to. These recurrent nets are some times used for sequence prediction where the previous outputs from the network need to be stored someplace and then fed back into the network to provide context for the current prediction. In this way the entire organization becomes better and better and supporting the general in making the correct decision more of the time. Back propagation is effectively utilizing a search technique called gradient descent to search for the best possible improvement in the link weights to reduce the each level giving increased emphasis to those advisors who had advised correctly and decreased emphasis to those who had advised incorrectly.

The networks generally contain only an input layer and an output layer but the nodes in the output layer compete amongst themselves to display the strongest activation to a given record. the back propagation learning procedure is the most commonly used. Kohonen feature maps are often used for unsupervised learning and clustering and Radial Basis Function networks are used for supervised learning and in some ways represent a hybrid between nearest neighbor and neural network classification. two other neural network architectures that are used relatively often. It is well understand. What is sometimes called “winner take all”. however. When these networks were run. In the early days of neural networks the predictive accuracy that was often 67 . Combatting overfitting . This has once again proved to be a difficult task . Each output node represented a cluster and nearby clusters were nearby in the two dimensional output layer. relatively simple.despite the power of these new techniques and the similarities of their architecture to that of the human brain.goes on in biological evolution and using genetic algorithms to optimize the weights of the neural networks. and seems to work in a large number of problem domains. Kohonen Feature Maps Kohonen feature maps were developed in the 1970’s and as such were created to simulate certain brain function. There are. It has even been suggested that creating a large number of neural networks with randomly weighted links and picking the one with the lowest error rate would be the best learning procedure. There have also been some exciting successes. Kohonen networks are feedforward neural networks generally with no hidden layer. This is particularly problematic for neural networks because it is difficult to understand how the model is working. Each record in the database would fall into one and only one cluster (the most active output node) but the other clusters in which it might also fit would be shown and likely to be next to the best matching cluster. Neural networks can be quite good at overfitting training data with a predictive model that does not work well on new data. in order to simulate the real world visual system it became that the organization that was automatically being constructed on the data was also very useful for segmenting and clustering the training database. Many of the things that people take for granted are difficult for neural networks .getting a model you can use somewhere else As with all predictive modeling techniques some care must be taken to avoid overfitting with a neural network. Namely that physical locality of the neurons seems to play an important role in the behavior and learning of avoiding overfitting and working with real world data without a lot of preprocessing required. Despite all of these choices. How much like a human brain is the neural network? Since the inception of the idea of neural networks the ultimate goal for these techniques has been to have them recreate human thought and learning. The networks originally came about when some of the puzzling yet simple behaviors of the real neurons were taken into effect. Today they are used mostly to perform unsupervised learning and clustering.

In the example given at the beginning of this section the hidden nodes of the neural network seemed to have extracted important distinguishing features in predicting the relationship between people by extracting information like country of origin. With nearest neighbor techniques prototypical records are provided to “explain” why the prediction is made. provide an automated function that saves out the network when it is best performing on the test set and even continues to search after the minimum is reached. This is in part due to the fact that unlike decision trees or nearest neighbor techniques. For neural networks generalization of the predictive model is accomplished via rules of thumb and sometimes in a more methodically way by using cross validation as is done with decision trees. Unfortunately there is no god theoretical grounds for picking a certain number of links. One way to control overfitting in neural networks is to limit the number of links. Since the number of links represents the complexity of the model that can be produced. Some times this can be done quite successfully. The NeuralWare product. The complex models of the neural network are captured solely by the link weights in the network which represent a very complex mathematical equation. 68 . Explaining the network One of the indictments against neural networks is that it is difficult to understand the model that they have built and also how the raw data effects the output predictive answer. Perhaps because overfitting was more obvious for decision trees and nearest neighbor approaches more effort was placed earlier on to add pruning and editing to these techniques. which can quickly achieve 100% predictive accuracy on the training database. Features that it would seem that a person would also extract and use for the prediction. This accuracy will peak at some point in the training and then as training proceeds it will decrease while the accuracy on the training database will continue to increase. Test set validation can be used to avoid overfitting by building the neural network on one portion of the training database and using the other portion of the training database to detect what the predictive accuracy is on vaulted data. and others. overfitting can be controlled by simply limiting the number of links in the neural network. But there were also many other hidden nodes. neural networks can be trained forever and still not be 100% accurate on the training set. While this is an interesting fact it is not terribly relevant since the accuracy on the training set is of little interest and can have little bearing on the validation database accuracy. even in this particular example that were hard to explain and didn’t seem to have any particular purpose. and since more complex models have the ability to overfit while less complex ones cannot. The simplest approach is to actually look at the neural network and try to create plausible explanations for the meanings of the hidden nodes. and decision trees provide rules that can be translated in to English to explain why a particular prediction was made for a particular record. The link weights for the network can be saved when the accuracy on the held aside data peaks. Except that they aided the neural network in making the correct prediction. There have been several attempts to alleviate these basic problems of the neural network.mentioned first was the accuracy on the training set and the vaulted or validation set database was reported as a footnote.

First. Genetic algorithms are a particular class of evolutionary algorithms (also known as evolutionary computation) that use techniques inspired by evolutionary biology such as inheritance. For example. and has several different settings. a blueprint so to speak. Each gene represents a specific trait of the organism. The Genetic Algorithm . black or auburn. Genetic algorithms are categorized as global search heuristics. But how does recombination fit into the scheme of things? Genetic Algorithms are a way of solving problems by mimicking the same processes mother nature uses. Normally this mutated gene will not affect the development of the phenotype but very occasionally it will be expressed in the organism as a completely new trait.a brief overview 69 .10 Genetic Algorithm A genetic algorithm (GA) is a search technique used in computing to find exact or approximate solutions to optimization and search problems. which in turn are connected together into long strings called chromosomes.and gene mutation have very powerful roles to play in the evolution of an organism. Life on earth has evolved to be as it is through the processes of natural selection. recombination and mutation to evolve a solution to a problem.2. This process is called recombination. The resultant offspring may end up having half the genes from one parent and half from the other. a Biology Lesson Every organism has a set of called the phenotype. They use the same combination of selection. the settings for a hair colour gene may be blonde. The physical expression of the genotype . recombination and mutation As you can see the processes of natural selection . When two organisms mate they share their genes. Very occasionally a gene may be mutated.the organism itself . selection. These genes and their settings are usually referred to as an organism's genotype.survival of the fittest . and crossover (also called recombination). These rules are encoded in the genes of an organism. describing how that organism is built up from the tiny building blocks of life. like eye colour or hair colour. mutation.

3. i. This could be as a string of real numbers or. just relax and go with the flow. when decoded will represent a different solution to the problem at hand. the following steps are repeated until a solution is found      Test each chromosome to see how good it is at solving the problem at hand and assign a fitness score accordingly. 70 .Before you can use a genetic algorithm to solve a problem. For now. It works like this: Imagine that the population’s total fitness score is represented by a pie chart. Tell me about Roulette Wheel selection This is a way of choosing members from the population of chromosomes in a way that is proportional to their fitness. Let's say there are N chromosomes in the initial population. to choose a chromosome all you have to do is spin the ball and grab the chromosome at the point it stops. Then. it will all start to become clear shortly. Roulette wheel selection is a commonly used method. Now.e. The chance of being selected is proportional to the chromosomes fitness. A good value for this is around 0.7. The fitness score is a measure of how good that chromosome is at solving the problem to hand. or roulette wheel. merely that it has a very good chance of doing so. A typical chromosome may look like this: 10010101110101001010011101101110111111101 (Don't worry if non of this is making sense to you at the moment. Select two members from the current population. Crossover is performed by selecting a random gene along the length of the chromosomes and swapping all the genes after that point. The size of the slice is proportional to that chromosomes fitness score. It does not guarantee that the fittest member goes through to the next generation. a way must be found of encoding any potential solution to the problem. Repeat step 2. as is more typically the case. a binary bit string. What's the Crossover Rate? This is simply the chance that two chromosomes will swap their bits. Each one. Step through the chosen chromosomes bits and flip dependent on the mutation rate. 4 until a new population of N members has been created. Dependent on the crossover rate crossover the bits from each chosen chromosome at a randomly chosen point.) At the beginning of a run of a genetic algorithm a large population of random chromosomes is created. I will refer to this bit string from now on as the chromosome. Now you assign a slice of the wheel to each member of the population. the fitter a member is the bigger the slice of pie it gets.

From Theory to Practice To hammer home the theory you've just learnt let's look at a simple problem: Given the digits 0 through 9 and the operators +. I know it's a little contrived but I've used it because it's very simple.5 is the chosen number then 5/2+9*7-5 would be a possible solution. that is 0 through 9 and +. Please make sure you understand the problem before moving on. 1 becomes 0).. Four bits are required to represent the range of characters used: 71 . This is usually a very low value for binary encoded genes. -. * and /. This will represent a gene. Given two chromosomes 10001001110010010 01010001001000011 Choose a random bit along the length. and swap all the bits after that point so the above become: 10001001101000011 01010001010010010 What's the Mutation Rate? This is the chance that a bit within a chromosome will be flipped (0 becomes 1. The operators will be applied sequentially from left to right as you read. If 75. -.e. Each chromosome will be made up of several genes.g. say 0. the sequence 6+5*4/2+1 would be one possible solution. * and /. say at position 9. given the target number 23. So. first we need to represent all the different characters available to the solution. So how do we do this? Well.. Stage 1: Encoding First we need to encode a possible solution as a string of bits… a chromosome.001 So whenever chromosomes are chosen from the population the algorithm first checks to see if crossover should be applied and then the algorithm iterates down the length of each chromosome mutating the bits if applicable. find a sequence that will represent a given target number.

So now you can see that the solution mentioned above for 23. the algorithm will just ignore any genes which don’t conform to the expected pattern of: 72 .0: 1: 2: 3: 4: 5: 6: 7: 8: 9: +: -: *: /: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 The above show all the different genes required to encode the problem as described. The possible genes 1110 & 1111 will remain unused and will be ignored by the algorithm if encountered. when decoding. these bits represent: 0010 0010 1010 1110 1011 0111 0010 2 2 + n/a 7 2 Which is meaningless in the context of this problem! Therefore. ' 6+5*4/2+1' would be represented by nine genes like so: 0110 1010 0101 1100 0100 1101 0010 1010 0001 6 + 5 * 4 / 2 + 1 These genes are all strung together to form the chromosome: 011010100101110001001101001010100001 A Quick Word about Decoding Because the algorithm deals with random arrangements of bits it is often going to come across a string of bits like this: 0010001010101110101101110010 Decoded.

however.number -> operator -> number -> operator …and so on.. Therefore a test can be made for this occurrence and the algorithm halted accordingly. the chromosome mentioned above 011010100101110001001101001010100001 has a fitness score of 1/(42-23) or 1/19. you are still confused.. This is not a problem however as we have found what we were looking for. If you now feel you understand enough to solve this problem I would recommend trying to code the genetic algorithm yourself. a solution. There is no better way of learning. Note: The code given will parse a chromosome bit string into the values we have discussed and it will attempt to find a solution which uses all the valid symbols it has found. If we assume the target number for the remainder of the tutorial is 42. Please tinker around with the mutation rate. a fitness score can be assigned that's inversely proportional to the difference between the solution and the value a decoded chromosome represents. size of chromosome etc to get a feel for how each parameter effects the algorithm. Stuff to Try 73 . With this in mind the above ‘nonsense’ chromosome is read (and tested) as: 2 + 7 Stage 2: Deciding on a Fitness Function This can be the most difficult part of the algorithm to figure out. if a solution is found. With regards to the simple project I'm describing here. crossover rate. + 6 * 7 / 2 would not give a positive result even though the first four symbols("+ 6 * 7") do give a valid solution. Stage 3: Getting down to business First. Therefore if the target is 42. please read this tutorial again. It really depends on what problem you are trying to solve but the general idea is to give a higher fitness score the closer a chromosome comes to solving the problem. Hopefully the code should be documented well enough for you to follow what is going on! If not please email me and I’ll try to improve the commenting. I have already prepared some simple code which you can find here. If. a divide by zero error would occur as the fitness would be 1/(42-42). As it stands.

Screenshot 1 Use a genetic algorithm to find the disk of largest radius which may be placed amongst these disks without overlapping any of them.If you have succeeded in coding a genetic algorithm to solve the problem given in the tutorial. 74 . See Screenshot 2. try having a go at the following more difficult problem: Given an area that has a number of non overlapping disks scattered about its surface as shown in Screenshot 1.

In business endeavours.11 KDD (Knowledge Discover in Data Bases): Looming atop a wide variety of human activities are the menacing profiles of evergrowing mountains of data. and the race is on for who can explain the observations best. a medical imaging device. then one’s competition may put them to good use. Unfortunately. we have not witnessed corresponding advances in computational techniques to help us analyze the accumulated data. In 75 . or a supermarket’s checkout system. collect. perhaps to one’s detriment. and customers. we risk missing most of what the data have to offer. competitors. With major advances in database technology came the creation of huge efficient data stores. a credit-card transaction verification system. data captures information about the markets. In scientific endeavours. Should one choose to ignore valuable information buried within the data. Without such developments. data represents observations carefully collected about some phenomena under study. the human at the other end of the data gathering and storage machinery is faced with the same problem: What to do with all this data? Ignoring whatever we cannot analyze would be wasteful and unwise. Be it a satellite orbiting our planet. Advances in computer networking have enabled the data glut to reach anyone who cares to tap in. These mountains grew as a result of great engineering successes that enabled us to build devices to generate.Screenshot 2 2. and store digital data.

artificial intelligence. and solutions have to be developed to enable analysis of large databases. data captures performance and optimization opportunities. Why Data Mining and Knowledge Discovery? Knowledge Discovery in Databases (KDD) is concerned with extracting useful information from databases. One or more analysts get intimately familiar with the data and with the help of statistical techniques provide summaries and generate reports. For example. Traditionally. K-means clustering in your favorite Fortran library) assumes data can be "loaded" into memory and then manipulated. and keys to improving processes and troubleshooting problems. is very broad. pattern recognition. Faced with massive data sets. Hence data mining is but a step in this iterative and interactive process. The value of raw data is typically predicated on the ability to extract higher level information: information useful for decision support. and can describe a multitude of fields of study. evaluation and interpretation. cleaning. KDD’s goal. traditional approaches in statistics and pattern recognition collapse. What happens when the data set will not fit in main memory? What happens if the database is on a remote server and will never permit a naïve scan of the data? How do I sample effectively if I am not permitted to query for a stratified sample because the relevant fields are not indexed? What if the data set is in a multitude of tables (relations) and can only be accessed via some hierarchically structured 76 . The various steps of the process which include data warehousing. humans have done the task of analysis. data is a data mining algorithm. a statistical analysis package (e. Statistics has been preoccupied with this goal for over a century. The term data mining has historically been used in the database community and in statistics (often in the latter with negative connotations to indicate improper data analysis). We chose to include it in the name of the journal because it represents a majority of the published research work. So have many other fields including database systems. Such an approach rapidly breaks down as the volume and dimensionality of the data increase.manufacturing. or fits models to. and because we wanted to build bridges between the various communities that work on topics related to data mining. and for better understanding of the phenomena generating the data. We take the view that any algorithm that enumerates patterns from. data mining. Hence tools to aid in at least the partial automation of analysis tasks are becoming a necessity. techniques. and finally consolidation and use of the extracted "knowledge". as stated above. In effect. the data grow and change at rates that would quickly overwhelm manual analysis (even if it were possible). model selection (or combination). analysts determine the right queries to ask and sometimes even act as sophisticated query processors. target data selection. Who could be expected to "understand" millions of cases each having hundreds of fields? To further complicate the situation. So why has a separate community emerged under the name "KDD"? The answer: new approaches. and a host of activities related to data analysis.g. for exploration. We further view data mining to be a single step in a larger process that we call the KDD process. data visualization. transformation and reduction. preprocessing.

as well as issues of flexible querying and query optimization.75. then one can construct a training sample for a data mining algorithm. While the emphasis in OLAP is still primarily on data visualization and query-driven exploration. Assuming that certain cases in the database can be identified as "fraudulent" and others as "known to be legitimate".set of fields? What if the relations are sparse (not all fields are defined or even applicable to any fixed subset of the data)? How do I fit a statistical model with a large number of variables? The open problems are not restricted to scalability issues of storage. and data visualization. While these issues are studied in many related fields. knowledge modeling. approaches to solving them in the context of large databases are unique to KDD. Issues of inference under uncertainty. a problem that is not addressed by the database field is one I like to call the "query formulation problem": what to do if one does not know how to specify the desired query to begin with? For example. The efficient and reliable storage and retrieval of the data. contributions from the database research literature in the area of data mining are beginning to appear. by our definition. all work in classification and clustering in statistics. Related Fields Many research communities are strongly related to KDD. For example. Most interesting queries that arise with end-users of the data are of this class. transformation. and KDD. and so on are also fundamental to KDD. and I am sure future pages of this journal will unveil many problems we have not thought of yet. and databases would fit under the data mining step. search for patterns and parameters in large spaces. For example. data warehousing. In addition to exploratory data analysis (EDA). are important enabling techniques. KDD provides an alternative solution to this problem. Other related fields include optimization (in search). It is not clear that one can write a SQL query (or even a program) to retrieve the target. pattern recognition. neural networks. In addition. and then retrieve records that the model triggers on. access. let the algorithm build a predictive model. ". machine learning. I outline several other issues and challenges for KDD later in this editorial. For example. and scale. The Database field is of fundamental importance to KDD. the management of uncertainty. a clustering method 77 . preprocessing. Data visualization can contribute to effective EDA and visualization of extracted knowledge. On-line Analytical Processing (OLAP) is an evolving field with very strong ties to databases. automated techniques for data mining can play a major role in making OLAP more useful and easier to apply. and evaluation of extracted knowledge. This is an example of a much needed and much more natural interface between humans and databases. high-performance and parallel computing. statistics overlaps with KDD in many other steps including data selection and sampling. Data mining can enable the visualization of patterns hidden deep within the data and embedded in much higher dimensional spaces. it would be desirable for a bank to issue a query at a high level: "give me all transactions whose likelihood of being fraudulent exceeds 0.

KDD should have evolved as a proper subset of statistics. However. In addition. Future Prospects and Challenges Successful KDD applications continue to appear. Some of these challenges include: 1. the traditional approaches in statistics perform little search over models and parameters (again with notable recent exceptions). The challenges ahead of us are formidable. The fundamental problems are still as difficult as they always were.can segment the data into homogeneous subsets that are easier to describe and visualize. and neural networks. and change and deviation detection that scale to large databases. However. There is a tradeoff between performance and accuracy as one surrenders to the fact that data resides primarily on disk or on a server and cannot fit in main memory. These in turn can be displayed to the user instead of attempting to display the entire data (or a global random sample of it) which usually results in missing the embedded patterns. We do not dismiss the dangers of blind mining and that it can easily deteriorate to data dredging. 2. driven mainly by a glut in databases that have clearly grown to surpass raw human processing abilities. 78 . A marketing person interested in segmenting a database may not have the necessary advanced degree in statistics to understand and use the literature or the library of available routines. clustering. Driving the healthy growth of this field are strong forces (both economic and social) that are a product of the data overload phenomenon. Furthermore. insuring that any theory or model that emerges will find its immediate real-world test environment. Develop mining algorithms for classification. Not only will it ensure our healthy growth as a new engineering discipline. compared with techniques that data mining draws on from pattern recognition. dependency analysis. the majority of the work has been primarily focused on hypothesisverification as the primary mode of data analysis (which is certainly no longer true now). KDD is concerned with formalizing and encoding aspects of the "art" of statistical analysis and making analysis methods easier to use by those who own the data. Develop schemes for encoding "metadata" (information about the content and meaning of data) over data tables so that mining algorithms can operate meaningfully on a database and so that the KDD system can effectively ask for more information from the user. statisticians have not focused on considering issues related to large databases. and we need to guard against building unrealistic expectations in the public’s mind. In an ideal world. historically. the strong need for analysis aids in the dataoverloaded society need to be addressed. The de-coupling of database issues (storage and retrieval) from analysis issues is also a culprit. regardless of whether they have the pre-requisite knowledge of the techniques being used. machine learning. but it will provide our efforts with a healthy dose of reality checks. I view the need to deliver workable solutions to pressing problems as a very healthy pressure on the KDD field.

and 79 . This problem becomes significant as a program explores a huge search space over many models for a given data set. because they grow over a long time. sparse relations). Bayesian methods and decision analysis provide the basic foundational framework. A science of how to exploit massive data sets. Huber. Issues of query optimization in these settings are fundamental. our understanding of high dimensional spaces and estimation within them is still fairly primitive. allow proper tradeoffs between complexity and understandability of models for purposes of visualization and reporting.f. Perhaps the most exciting aspect of the launch of this new journal is the possibility of the birth of a new research area properly mixing statistics. and text modalities) and deal with sparse relations that are only defined over parts of the data. automated data analysis and reduction. While KDD will draw on the substantial body of knowledge built up in its constituent fields. data reduction. 4. 10. Scale methods to parallel databases with hundreds of tables. Develop theory and techniques to model growth and change in data. Develop data mining methods that account for prior knowledge of data and exploit such knowledge in reducing search. by Fayyad & Smyth. and other related areas. hierarchies. 7. The question of how does the data grow? needs to be better understood (see articles by P. the paper by Gray et al in this issue).3. thousands of fields. 9. While large sample sizes allow us to handle higher dimensions. databases. and that are robust against uncertainty and missing data problems. This includes providing SQL support for new primitives that may be needed (c. 6. go beyond the flat file or single table assumption. Develop effective means for data sampling. 11. Enhance database management systems to support new primitives for the efficient extraction of necessary sufficient statistics as well as more efficient sampling schemes. it is my hope that a new science will inevitably emerge. enable interactive exploration where the analyst can easily provide hints to help the mining algorithm with its search. data mining systems need to guard against fitting models to data by chance. that can account for costs and benefits. thereby improving humanity’s collective intellect: a sort of amplifier of basic human analysis capabilities. and dimensionality reduction that operate on a mixture of categorical and numeric data fields. how to store and access them for analysis purposes. 5. and terabytes of data.e. 8. video. The curse of dimensionality is still with us. KDD holds the promise of an enabling technology that could unlock the knowledge lying dormant in huge databases. Large databases. Develop schemes capable of mining over nonhomogenous data sets (including mixtures of multimedia. Account for and model comprehensibility of extracted models.g. and by others in [6]) and tools for coping with it need to be developed. Develop new mining and search algorithms capable of extracting more complex relationships between fields and able to account for structure over the fields (e. do not typically grow as if sampled from a static joint probability density. i. While operating in a very large sample size environment is a blessing against overfitting problems.

Association Rules – write about this with examples 8.12 Summary In this unit you have learnt about the Knowledge Discovery Process from the data or information. Data enrichment are some of the procedures and concepts that have been introduced in this chapter. Genetics Algorithm – explain 10. Association Rules and etc. Explain about Decision trees? 7.13 Exercises : 1.Explain the Concept with examples 9. What are Neural Networks?. 2. I sincerely hope that future issues of this journal will address some of the challenges and chronicle the development of theory and applications of the new science for supporting analysis and decision making with massive data sets. How KDD can be used in Data bases? 80 . Visualization Techniques – explain with examples to cope with growth and change in data. Write about Data Selection. Decision Trees. Also the Visualization techniques . the various Data mining methods like Classification. 4. Data selection. What are all the preliminary analysis that can be done on a data set? Give some of the examples for the tools. data Cleaning and Data enrichment with examples 3. Explain KDD Process 2. have been learnt by you. 2. How are OLAP tools used ? 6. Data cleaning.

7 Data Base Schema 3.9 Aggregations 3.13 Data Marting Meta Data System and Warehouse process Managers Summary Exercises 81 .11 3.5 Process Architecture 3.Unit III Structure of the Unit 3.1 Introduction 3.13 3.10 3.12 3.2 Learning Objectives 3.3 Data Warehouse Architecture 3.4 System Process 3.6 Design 3.8 Partitioning Strategy 3.

This chapter presents an overview of data warehouse and OLAP technology. In this chapter. multivariate data sets in a way that facilitates the retrieval of information for analytic purposes. slicing. and clustering. The construction of data warehouses involves data cleaning. An overview of data warehouse implementation examines general strategies for efficient data cube computation. classification. and dicing. Hence. Finally. we study the data cube. we study a well-accepted definition of the data warehouse and see why more and more organizations are building data warehouses for the analysis of their data. amultidimensional data model for data warehouses and OLAP. Many other data mining functions. a powerful paradigm that integrates data warehouse and OLAP technology with that of data mining. In particular. and data transformation and can be viewed as an important preprocessing step for data mining. including steps on data warehouse design and construction. data warehouses provide on-line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities. Therefore. The most efficient data warehousing architecture will be capable of incorporating or at least referencing all data available in the relevant enterprise-wide information management 82 . as well as OLAP operations such as roll-up.1 Introduction Data warehouses generalize and consolidate data in multidimensional space. data integration. 3.3 Data ware house Architecture: What is a Data Warehouse (DW) ? An organized system of enterprise data derived from multiple data sources designed primarily for decision making in the organization can called as Data Warehouse. such as association. prediction. can be integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of abstraction. drill-down. StatSoft defines data warehousing as a process of organizing the storage of large. Such an overview is essential for understanding the overall data mining and knowledge discovery process. data warehousing and OLAP form an essential step in the knowledge discovery process. OLAP data indexing. which facilitates effective data generalization and data mining. and OLAP query processing. we look at on-line-analytical mining.3.Moreover.2 Learning Objectives  To learn about the need and architecture of Data warehouse  To know about the processes and operations done on a Ware house 3. the data warehouse has become an increasingly important platformfor data analysis and on-line analytical processing and will provide an effective platformfor data mining. We also look at data warehouse architecture.

and other applications that manage the process of gathering data and delivering it to business users. 83 .Oracle.g. transportation. They must resolve such problems as naming conflicts and inconsistencies among units of measure. This classic definition of the data warehouse focuses on data storage. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. In addition to a relational database. you can build a warehouse that concentrates on sales. to learn more about your company's sales data. When they achieve this. a data warehouse environment includes an extraction. but it can include data from other sources.MSSQLServer. Data warehouses must put data from disparate sources into a consistent format. an online analytical processing (OLAP) engine. and to manage the dictionary data are also considered essential components of a data warehousing system.) Another definition says that a data warehouse is a repository of an organization's electronically stored data. A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. transformation. transform and load data. A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon:     Subject Oriented Integrated Nonvolatile Time Variant Subject Oriented Data warehouses are designed to help you analyze data. Data warehouses are designed to facilitate reporting and analysis. they are said to be client analysis tools.. sales in this case. you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter. the means to retrieve and analyze data. to extract. It usually contains historical data derived from transaction data. and loading (ETL) solution. Using this warehouse. makes the data warehouse subject oriented. Integrated Integration is closely related to subject orientation. using designated technology suitable for corporate data base management ((e.Sybase. However. For example.

data should not change. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables represent data.Nonvolatile Nonvolatile means that. and those that are foreign keys to dimension tables. An example of this is averages. Though most facts are additive. Fact tables are the large tables in your warehouse schema that store business measurements. they can also be semi-additive or non-additive. Dimension tables. Fact tables that contain aggregated facts are often called summary tables. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. This is very much in contrast to online transaction processing (OLTP) systems. that can be analyzed and examined. and profit. Examples include sales. contain the relatively static data in the warehouse. Fact tables typically contain facts and foreign keys to the dimension tables. usually numeric and additive. From a modeling standpoint. A common example of this is sales. A fact table usually contains facts with the same level of aggregation. Non-additive facts cannot be added at all. cost. where you cannot tell what a level means simply by looking at it. Fact Tables A fact table typically has two types of columns: those that contain numeric facts (often called measurements). Creating a New Fact Table You must define a fact table for each star schema. Data Warehousing Objects Fact tables and dimension tables are the two types of objects commonly used in dimensional data warehouse schemas. the primary key of the fact table is usually a composite key that is made up of all of its foreign keys. Dimension tables are usually textual and descriptive and you can use them as the row headers of the result set. once entered into the warehouse. Examples are customers or products. Dimension tables store the information you normally use to contain queries. Time Variant In order to discover trends in business. An example of this is inventory levels. A data warehouse's focus on change over time is what is meant by the term time variant. analysts need large amounts of data. Additive facts can be aggregated by simple arithmetical addition. also known as lookup or reference tables. Semi-additive facts can be aggregated along some of the dimensions and not along others. where performance requirements demand that historical data be moved to an archive. 84 .

with the root level as the highest or most general level. Hierarchies Hierarchies are logical structures that use ordered levels as a means of organizing data. Within a hierarchy. combined with facts. and values at the next lower level are its children. A hierarchy can be used to define data aggregation. there might be two hierarchies--one for product categories and one for product suppliers. often composed of one or more hierarchies. This is one of the key benefits of a data warehouse. products. For example. Dimensional attributes help to describe the dimensional value. A hierarchy can also be used to define a navigational drill path and to establish a family structure. a hierarchy might aggregate data from the month level to the quarter level to the year level. enable you to answer business questions.Dimension Tables A dimension is a structure. For example. quarter. in the product dimension. Commonly used dimensions are customers. Levels range from general to specific. A dimension can be composed of more than one hierarchy. Data values at lower levels aggregate into the data values at higher levels. a divisional multilevel sales organization. a time dimension might have a hierarchy that represents data at the month. that categorizes data. Levels A level represents a position in a hierarchy. and time. Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. They are normally descriptive. The levels in a dimension are organized into one or more hierarchies. For example. Dimension hierarchies also group levels from general to granular. When designing hierarchies. a value at the next higher level is its parent. For a particular level value. you must consider the relationships in business structures. These familial relationships enable analysts to access data quickly. Level Relationships 85 . and year levels. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. Several distinct dimensions. in a time dimension. textual values. each level is logically connected to the levels above and below it. For example. Hierarchies impose a family structure on dimension values. These natural rollups or aggregations within a dimension table are called hierarchies.

In it:     region: is at the top of the dimension hierarchy subregion: is below region country_name: is below subregion customer: is at the bottom of the dimension hierarchy Unique Identifiers Unique identifiers are specified for one distinct record in a dimension table. Unique identifiers are represented with the # character. They define the parent-child relationship between the levels in a hierarchy. the database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known. Artificial unique identifiers are often used to avoid the potential problem of unique identifiers changing. there is obviously a customer and a product. #customer_id.Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. For example. Hierarchies are also essential components in enabling more complex rewrites. Figure 2-2 Typical Levels in a Dimension Hierarchy This illustrates a typical dimension hierarchy. An example is that if a business sells something. Relationships Relationships guarantee business integrity. Typical Dimension Hierarchy Figure 2-2 illustrates a dimension hierarchy based on customers. Designing a relationship between 86 . For example.

cust_last_name. promotions. the dimension tables are:     times channels products. Conceptualization of a data warehouse architecture consists of the following interconnected layers: Operational database layer The source data for the data warehouse 87 . Example of Data Warehousing Objects and Their Relationships Figure 2-3 illustrates a common example of a sales fact table and dimension tables customers.the sales information in the fact table and the dimension tables products and customers enforces the business rules in databases. which contains prod_id customers. Figure 2-3 Typical Data Warehousing Objects This illustrates a typical star schema with some columns and relationships detailed. and cust_state_province The fact table is sales. and channels. processing and presentation that exists for end user computing within the enterprise. which contains cust_id. In it. which contains cust_id and prod_id. times. Data Warehouse Architecture (DWA) is a way of representing the overall structure of data. products. cust_city. communication.

Three common architectures are:    Data Warehouse Architecture (Basic) Data Warehouse Architecture (with a Staging Area) Data Warehouse Architecture (with a Staging Area and Data Marts) Data Warehouse Architecture (Basic) Figure 1-2 shows a simple architecture for a data warehouse. Figure 1-2 Architecture of a Data Warehouse This illustrates three things: 88 . End users directly access data derived from several source systems through the data warehouse.Informational access layer The data accessed for reporting and analyzing and the tools for reporting and analyzing data Data access layer The interface between the operational and informational access layer Metadata layer The data directory (which is often much more detailed than an operational system data directory Data Warehouse Architectures Data warehouses and their architectures vary depending upon the specifics of an organization's situation.

Summaries are very valuable in data warehouses because they pre-compute long operations in advance.   Data Sources (operational systems and flat files) Warehouse (metadata. reporting. Data Warehouse Architecture (with a Staging Area) In Figure 1-2. Figure 1-3 illustrates this typical architecture. summary data. reporting. For example. as is an additional type of data. You can do this programmatically. a typical data warehouse query is to retrieve something like August sales. and mining) In Figure 1-2. and mining) 89 . and raw data) Users (analysis. the metadata and raw data of a traditional OLTP system is present. A staging area simplifies building summaries and general warehouse management. and raw data) Users (analysis. Figure 1-3 Architecture of a Data Warehouse with a Staging Area This illustrates four things:     Data Sources (operational systems and flat files) Staging Area (where data sources go before the warehouse) Warehouse (metadata. summary data. although most data warehouses use a staging area instead. you need to clean and process your operational data before putting it into the warehouse. A summary in Oracle is called a materialized view. summary data.

In this example. sales. you may want to customize your warehouse's architecture for different groups within your organization. You can do this by adding data marts. and mining) Architecture Review and Design 90 . summary data. sales. and inventories are separated. reporting. Figure 1-4 illustrates an example where purchasing. a financial analyst might want to analyze historical data for purchases and sales. Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts This illustrates five things:      Data Sources (operational systems and flat files) Staging Area (where data sources go before the warehouse) Warehouse (metadata. which are systems designed for a particular line of business. and inventory) Users (analysis. and raw data) Data Marts (purchasing.Data Warehouse Architecture (with a Staging Area and Data Marts) Although the architecture in Figure 1-3 is quite common.

performance. as the name implies. disaster recovery. The Technical Architecture provides the underlying computing infrastructure that enables the data and application architectures. Application. The logical architecture is a configuration map of the necessary data stores that make up the Warehouse. DBMS. During the Architecture Review and Design stage. and end-user workstation hardware/software. an optional Operational Data Store. Architecture Review and Design applies to the long-term strategy for development and refinement of the overall Data Warehouse. capacity and volume handling (including sizing and partitioning of tables). It includes platform/server. This stage develops the blueprint of an 91 . communications and connectivity hardware/software/middleware. availability. and is not conducted merely for a single iteration. and one or more Metadata stores. one or more (optional) individual business area Data Marts. Gap analysis is conducted to determine which components of each architecture already exist in the organization and can be reused. the Data. it controls the movement of data from source to user. and security. querying). Once the logical configuration is defined. In the metadata store(s) are two different kinds of metadata that catalog reference information about the primary data. it includes a central Enterprise Data Store. the logical Data Warehouse architecture is developed. and version control/configuration management) and organizational functions necessary to effectively manage the technology investment. and which components must be developed (or purchased) and configured for the Data Warehouse. The Application Architecture is the software framework that guides the overall implementation of business functionality within the Warehouse environment. The Architecture Review and Design stage. reliability/stability compliance reporting. tools and structures for backup/recovery. client/server 2-tier vs. The Support Architecture includes the software components (e. It is important to assess what pieces of the architecture already exist in the organization (and in what form) and to assess what pieces are missing which are needed to build the complete Data Warehouse architecture.g. The Data Architecture organizes the sources and stores of business information and defines the quality and management standards for data and metadata.The Architecture is the logical and physical foundation on which the Data Warehouse will be built. network. data archiving. data cleansing. data transformation. Technical architecture design must address the requirements of scalability. stability. chargeback. performance monitoring. data refresh. including the functions of data extraction.3-tier approach. Technical and Support Architectures are designed to physically implement it. data loading. Requirements of these four architectures are carefully analyzed so that the Data Warehouse can be optimized to serve the users.. and data access (reporting. is both a requirements analysis and a gap analysis activity.

and targeted market research. and business analysts in making complex business decisions. Data Warehousing requires both business and technical expertise and involves the following activities: . The primary objective of Data Warehousing is to bring together information from disparate sources and put the information into a format that is conducive to making business decisions. It forms a foundation that drives the iterative Detail Design activities. Data Warehouses and Data Warehouse applications are designed primarily to support executives. so the BQA stage must conclude before the Architecture stage can conclude. The Architecture will be developed based on the organization's long-term Data Warehouse strategy. Rather. forecasting.encompassing data and technical structure. This objective necessitates a set of activities that are far more complex than just collecting data and reporting against it.Managing the scope of each subject area which will be implemented into the Warehouse On an iterative basis 92 . consolidated information from various internal and external sources. application and support infrastructure that enables and supports the storage and access of information is generally independent from the business requirements of which data is needed to drive the Warehouse. so that future iterations of the Warehouse will have been provided for and will fit within the overall architecture. strategic analysis. for building decision support systems and a knowledge-based applications architecture and environment that supports both everyday tactical decision making and long-term business strategizing.Identifying and prioritizing subject areas to be included in the Data Warehouse . software application configuration. data. The Architecture Review and Design stage can be conducted as a separate project that runs mostly in parallel with the Business Question Assessment stage.Accurately identifying the business information that must be contained in the Warehouse . The Data Warehouse environment positions a business to utilize an enterprise-wide data store to link information from diverse sources and make the information accessible for a variety of user purposes. For the technical. competitive analysis. most notably. or process. it is an overall strategy. and organizational support structure for the Warehouse. the data architecture is dependent on receiving input from certain BQA activities (data source system identification and data modeling). Where Design tells you what to do. Data Warehouse applications provide the business community with access to accurate. senior managers.4 System Process A Data Warehouse is not an individual repository product. Business analysts must be able to use the Warehouse for such strategic purposes as trend identification. However. 3. Architecture Review and Design tells you what pieces you need in order to do it.

more and more business users demanded that their needs for information be addressed.Educating the business community about the realm of possibilities that are available to them through Data Warehousing .Providing user-friendly. relational databases. enhancing. There are sound reasons for separating operational and informational databases. and identifying and selecting the hardware/software/middleware components to implement it . users of operational data tend to be clerical.Establishing processes for maintaining. . and analytical processing. as described below. timing and cycles . which is the information repository and point of access for information processing. .The processing characteristics for the operational environment and the informational environment are fundamentally different. powerful tools at the desktop to access the data in the Warehouse . Users of informational data are generally managers and analysts. transforming and validating the data to ensure accuracy and consistency . they act as the source of data for the Data Warehouse.. reporting. Informational data contains an historical perspective that is not generally used by operational systems. operational databases are not accessed directly to perform information processing. As the use of PCs.Defining the correct level of summarization to support business decision making . Data Warehousing has evolved to meet those needs without disrupting operational processing.Establishing a refresh program that is consistent with business needs. 4GL technology and end-user computing grew and changed the complexion of information processing.The users of informational and operational data are different.Establishing a Data Warehouse Help Desk and training users to effectively utilize the desktop tools . . 93 . operational and administrative staff. In the Data Warehouse model. . In most cases. the primary focus of computing resources was on satisfying operational needs and requirements. aggregating. Information reporting and analysis needs were secondary considerations.Operational data differs from informational data in context and currency. Rather. including online transaction processing.Extracting.The technology used for operational processing frequently differs from the technology required to support informational needs. cleansing. batch processing. and ensuring the ongoing success and applicability of the Warehouse Until the advent of Data Warehouses.Developing a scaleable architecture to serve as the Warehouse’s technical and application foundation. enterprise databases were expected to serve multiple purposes.

the operational environment is designed around applications and functions. Time Variance All data in Data Warehouse is accurate as of some moment in time. process) is evident in the content of the database. An example of this integration is the treatment of codes such as gender codes.The Data Warehouse functions as a Decision Support System (DSS) and an Executive Information System (EIS). integrated. and other salient data characteristics. even though the operational systems feed the Warehouse with source data. various applications may represent gender codes in different ways: male vs. gender is always represented in a consistent way. nonvolatile collections of data used to support analytical decision making. Data Warehouses do not contain information that will not be used for informational or analytical processing. Within a single corporation. in effect. it is transformed into a consistent representation as required. operational databases contain detailed data that is needed to satisfy processing requirements but which has no relevance to management or analysis. Integration and Transformation The data within the Data Warehouse is integrated. measurements of variables. physical attributes. time-variant. etc. female. Subject Orientation Data Warehouses are designed around the major subject areas of the enterprise. The data in the Warehouse comes from the operational environment and external sources. Data Warehouses can be defined as subject-oriented. This difference in orientation (data vs. In the Data Warehouse. As the data is moved to the Warehouse. Once the data is loaded into the enterprise data store and data 94 . regardless of the many ways by which it may be encoded and stored in the source data. This means that there is consistency among naming conventions. This differs from the operational environment in which data is intended to be accurate as of the moment of access. and 1 vs. 0. a series of snapshots. A variety of sophisticated tools are readily available in the marketplace to provide user-friendly access to the information stored in the Data Warehouse. f. encoding structures. meaning that it supports informational and analytical needs by providing integrated and transformed enterprise-wide historical data from which to do management analysis. The data in the Data Warehouse is. Data Warehouses are physically separated from operational systems. m vs. providing an historical perspective.

or time period .(optional) one or more individual Data Mart(s) . It is refreshed on a periodic or more Metadata Store(s) or Repository(ies) . The EDS is the cornerstone of the Data Warehouse. it translates a cryptic name code that represents a data element into a meaningful description of the 95 . which is a good reason for using Data Marts to filter. It is the key to providing users and developers with a road map to the information in the Warehouse. users can access the EDS directly. The operational data store. Metadata is divided into two categories: information for technical use. For these reasons. It is fed by the existing subject area operational systems and may also contain data from external sources. not dynamic. . It can be accessed for both immediate informational needs and for analytical processing in support of strategic decision making. This creates an optimum environment for strategic analysis. it cannot be updated.a "snapshot" of a moment in time's enterprisewide data . due to the volume of data it contains. The EDS in turn feeds individual Data Marts that are accessed by end-user query tools at the user's desktop. condense and summarize information for specific business areas." a catalog of information about the primary data that defines access to the Warehouse.catalog(s) of reference information about the primary data. Metadata comes in two different forms: end-user and transformational.a central repository which supplies atomic (detail level) integrated information to the whole organization. access to the EDS can be slow. However. and can be used for drill-down support for the Data Marts which contain only summarized data.marts. may be updated. if included in the Warehouse architecture.(optional) one Operational Data Store . In the absence of the Data Mart layer. the physical design of a Data Warehouse optimizes the access of data. rather than focusing on the requirements of data update and delete processing. includes the following components: . It is used to consolidate related data from multiple sources into a single source. End-user metadata serves a business purpose. Non-Volatility Data in the Warehouse is static. Data Warehouse Configurations A Data Warehouse Enterprise Data Store (EDS) . such as business functional departments or geographical regions. while the Data Marts are used to physically distribute the consolidated data into logical categories of data. Metadata is "data about data.summarized subset of the enterprise's data specific to a functional area or department. and information for business end-users. as determined by the business need. The EDS is a collection of daily "snapshots" of enterprise-wide data taken over an extended time period. and thus retains and makes available for tracking purposes the history of changes to a given data element over time. geographical region. The only operations that occur in Data Warehouse applications are the initial loading of data. access of data. and refresh of data. also known as the logical architecture.

and where the business problems that must be solved are clear and well understood. and implementing the configuration. developing the business plan and Warehouse solution to business requirements. not described here. and wrap-up activities which are detailed in the Plan. This is useful in the early stage of business modeling and technology development. Each type of metadata is kept in one or more repositories that service the Enterprise Data Store. The Process is conducted in an iterative fashion after the initial business requirements and architectural foundations have been developed with the emphasis on populating the Data Warehouse with "chunks" of functional subject-area information each iteration. an organization can exploit the planned and strategic nature of the top-down approach while retaining the rapid 96 . It allows an organization to move forward at considerably less expense and to evaluate the benefits of the technology before making significant commitments. The bottom-up approach starts with experiments and prototypes. The top-down approach starts with the overall design and planning. startup. strategic initiatives. and related business goals. identifying it by source field name. The Process guides the development team through identifying the business requirements. business rules for usage and derivation. (Note: The Data Warehouse Process also includes conventional project management. and deployment of each population project. or a combination of both. It is useful in cases where the technology is mature and well known. The following is a description of each stage in the Data Warehouse Process. Activate.) 3. Control and End stages. Potential Data Warehouse configurations should be evaluated and a logical architecture determined according to business requirements. construction. index and other relevant transformational and structural information. metadata would clarify that the data element "ACCT_CD" represents "Account Code for Small Business. For example. key. a bottom-up element so that end-users can recognize and use the data. and application architecture for the overall Data Warehouse. size. format. design. transformation routine. but it is a prescription for achieving such goals through a specific architecture.5 Process Architecture The Process of DataWarehouse A data warehouse can be built using a top-down approach." Transformational metadata serves a technical purpose for development and maintenance of the Warehouse. The Data Warehouse Process The james martin + co Data Warehouse Process does not encompass the analysis and identification of organizational value streams. It maps the data element from its source system to the Data Warehouse. the specific number of Data Marts (if any) and the need for an Operational Data Store are judgment calls. In the combined approach. destination field code. It then specifies the iterative activities for the cyclical planning. technical. While an Enterprise Data Store and Metadata Store(s) are always included in a sound Data Warehouse design.

limiting the size of the data warehouse. transaction type. especially for data marts. or resources. However. Choose the dimensions that will apply to each fact table record. managing database performance. with short intervals between successive releases. In general. warehouse. managing access control and security. the subset of the organization that is to be modeled. scripts. if the process is departmental and focuses on the analysis of one kind of business process.implementation and opportunistic application of the bottom-up approach. account administration. Data warehouse development tools provide functions to define and edit metadata repository contents (such as schemas. planning for disaster recovery. data integration and testing. Data warehouse administration includes data refreshment. and reports. The goals of an initial data warehouse implementation should be specific. Typical measures are numeric additive quantities like dollars sold and units sold. Large software systems can be developed using two methodologies: the waterfall method or the spiral method. 2. 4. The grain is the fundamental. item. and ship metadata to and from relational database system catalogues. a data mart model should be chosen. falling from one step to the next. supplier. answer queries. or limiting the schedule. sales. Scope management includes controlling the number and range of queries. The waterfall method performs a structured and systematic analysis at each step before proceeding to the next. warehouse design. for example. and so on. and status. and orientation. problemanalysis. the warehouse design process consists of the following steps: 1. requirements study. shipments. and data warehouse enhancement and extension. for example. a data warehouse model should be followed. and the number and types of departments to be served. the design and construction of a data warehouse may consist of the following steps: planning. individual transactions. and measurable. This is considered a good choice for data warehouse development. Because data warehouse construction is a difficult and long-term task. Typical dimensions are time. This involves determining the time and budget allocations. individual daily snapshots. data source synchronization. the number of data sources selected. Choose the measures that will populate each fact table record. 3. or rules). the initial deployment of the warehouse includes initial installation. orders. Choose a business process to model. The spiral method involves the rapid generation of increasingly functional systems. Choose the grain of the business process. Planning and analysis tools study the impact of schema changes and of refresh performance when changing refresh rates or time windows. and new designs and technologies can be adapted in a timely manner. because the turnaround time is short. Once a data warehouse is designed and constructed. its implementation scope should be clearly defined. If the business process is organizational and involves multiple complex object collections. and finally deployment of the datawarehouse. From the software engineering point of view. output reports. Platform upgrades and maintenance must also be considered. budget. which is like a waterfall. invoices. dimensions. 97 . modifications can be done quickly. inventory. Various kinds of data warehouse design tools are available. atomic level of data to be represented in the fact table for this process. customer. training. managing data growth. or the general ledger. achievable. roll-out planning.

you look at the most effective way of storing and retrieving the objects as well as handling them from a transportation and backup/recovery perspective. you look at the logical relationships among the objects. Now you need to translate your requirements into a system deliverable. rather than at individual transactions. To do so.3. You have defined the business requirements and agreed upon the scope of your application. 98 . In the logical design. you focus on the information requirements and save the implementation details for later. End users typically want to perform analysis and look at aggregated data. However. Orient your design toward the needs of the end users. end users might not know what they need until they see it. a well-planned design allows for growth and changes as the needs of users change and evolve. In addition. By beginning with the logical design. In the physical design.6 Design Logical Design in Data Warehouses This chapter tells you how to design a data warehousing environment and includes the following topics:     Logical Versus Physical Design in Data Warehouses Creating a Logical Design Data Warehousing Schemas Data Warehousing Objects Logical Versus Physical Design in Data Warehouses Your organization has decided to build a data warehouse. You then define:      The specific data content Relationships within and between groups of data The system environment supporting your data warehouse The data transformations required The frequency with which data is refreshed The logical design is more conceptual and abstract than the physical design. you create the logical and physical design for the data warehouse. and created a conceptual design.

In a physical design. an entity often maps to a table. In relational databases. and name the attributes for each subject. You identify business subjects or fields of data. an attribute maps to a column. and includes the following topics: 99 . instead of seeking to discover atomic units of information (such as entities and attributes) and all of the relationships between them. define relationships between business subjects. You can create the logical design using a pen and paper. Physical Design in Data Warehouses This chapter describes the physical design of a data warehousing environment. you identify which information belongs to a central fact table and which information belongs to its associated dimension tables.Creating a Logical Design A logical design is conceptual and abstract. the technique is still useful for data warehouse design in the form of dimensional modeling. Entity-relationship modeling involves identifying the things of importance (entities). and how they are related to one another (relationships). To be sure that your data is consistent. A unique identifier is something you add to tables so that you can differentiate between the same item when it appears in different places. An entity represents a chunk of information. One technique you can use to model your organization's logical information requirements is entity-relationship modeling. While entity-relationship diagramming has traditionally been associated with highly normalized models such as OLTP applications. The process of logical design involves arranging data into a series of logical relationships called entities and attributes. In dimensional modeling. this is usually a primary key. Your logical design should result in (1) a set of entities and attributes corresponding to fact tables and dimension tables and (2) a model of operational data from your source into subject-oriented information in your target data warehouse schema. the properties of these things (attributes). You do not deal with the physical implementation details yet. or you can use a design tool such as Oracle Warehouse Builder (specifically designed to support modeling the ETL process) or Oracle Designer (a general purpose modeling tool). In relational databases. you need to use unique identifiers. You deal only with defining the types of information that you need. An attribute is a component of an entity that helps define the uniqueness of the entity.

100 . Physical Design During the logical design phase. a way of narrowing a search before performing it. Physical design decisions are mainly driven by query performance and database maintenance aspects. Attributes are used to describe the entities. The unique identifier (UID) distinguishes between one instance of an entity and another. For example. attributes. Figure 3-1 offers you a graphical way of looking at the different ways of thinking about logical and physical designs. choosing a partitioning strategy that meets common query requirements enables Oracle to take advantage of partition pruning. Physical design is the creation of the database with SQL statements. you defined a model for your data warehouse consisting of entities.  Moving from Logical to Physical Design Physical Design Moving from Logical to Physical Design Logical design is what you draw with a pen and paper or design with Oracle Warehouse Builder or Designer before building your warehouse. During the physical design process. and relationships. you convert the data gathered during the logical design phase into a description of the physical database structure. The entities are linked together using relationships.

Figure 3-1 Logical Design Compared with Physical Design The logical entities are:     entities relationships attributes unique identifiers The logical model is mapped to the following database structures:       tables indexes columns dimensions materialized views integrity constraints During the physical design process. you translate the expected schemas into actual database structures. At this time. you have to map:     Entities to tables Relationships to foreign key constraints Attributes to columns Primary unique identifiers to primary key constraints 101 .

Tablespaces need to be separated by differences. For example. quick DDL statement and load new data while only affecting 1/48th of the complete 102 . Tables and Partitioned Tables Tables are the basic unit of data storage. though you will also see performance benefits in most cases because of partition pruning or intelligent parallel processing. If you have four years' worth of data. Others exist only in the data dictionary. the logical business design affects availability and maintenance operations. Tablespaces should also represent logical business units if possible. Using partitioned tables instead of nonpartitioned ones addresses the key problem of supporting very large data volumes by allowing you to decompose them into smaller and more manageable pieces. The main design criterion for partitioning is manageability. They are the container for the expected amount of raw data in your data warehouse. you can delete a month's data as it becomes older than four years with a single. Unique identifiers to unique key constraints Physical Design Structures Once you have converted your logical design to a physical one. the following structures may be created for performance improvement:   Indexes and Partitioned Indexes Materialized Views Tablespaces A tablespace consists of one or more datafiles. Additionally. you will need to create some or all of the following structures:      Tablespaces Tables and Partitioned Tables Views Integrity Constraints Dimensions Some of these structures require disk space. you might choose a partitioning strategy based on a sales transaction date and a monthly granularity. For example. A datafile is associated with only one tablespace. which are physical structures within the operating system you are using. Because a tablespace is the coarsest granularity for backup and recovery or the transportable tablespaces mechanism. From a design perspective. tablespaces are containers for physical design structures. tables should be separated from their indexes and small tables should be separated from large tables.

In data warehousing environments. the buffer cache). one month's worth of data can be assigned its own partition. bitmap indexes are very common in data warehousing environments. a cost in CPU overhead. such as tables with many foreign keys. To reduce disk use and memory use (specifically. A view takes the output of a query and treats it as a table. Typically. NOT NULL constraints are particularly common in data warehouses. Although compressed tables or partitions are updatable. Data Segment Compression You can save disk space by compressing heap-organized tables. or 3/48ths of the total volume. Bitmap indexes are optimized index structures for set-oriented operations. There is. Data segment compression should be used with highly redundant data. 103 . however. Under some specific circumstances. In addition to the classical B-tree indexes. Integrity Constraints Integrity constraints are used to enforce business rules associated with your database and to prevent having invalid information in the tables. you can store tables and partitioned tables in a compressed format inside the database. which is not a big problem in data warehousing environments because accuracy has already been guaranteed. constraints need space in the database. Business questions regarding the last quarter will only affect three months. A typical type of heaporganized table you should consider for data segment compression is partitioned tables. each month. Views A view is a tailored presentation of the data contained in one or more tables or other views. Integrity constraints in data warehousing differ from constraints in OLTP environments. Views do not require any space in the database. which is equivalent to three partitions. For example. Indexes and Partitioned Indexes Indexes are optional structures associated with tables or clusters. and high update activity may work against compression by causing some space to be wasted. In OLTP environments. You should avoid compressing tables with much update or other DML activity. there is some overhead in updating these tables. These constraints are in the form of the underlying unique index. you partition based on transaction dates in a data warehouse. Data segment compression can also speed up query execution. they primarily prevent the insertion of invalid data into a record. Partitioning large tables improves performance because each partitioned piece is more manageable.table. constraints are only used for query rewrite. This often leads to a better scaleup for read-only operations.

Indexes are just like tables in that you can partition them. they are necessary for some optimized data access methods such as star transformations. You can arrange schema objects in the schema models designed for data warehousing in a variety of ways. A typical dimension is city. The model of your source data and the requirements of your users help you design the data warehouse schema. Dimensions A dimension is a schema object that defines hierarchical relationships between columns or column sets. From a physical design point of view. number of users. although the partitioning strategy is not dependent upon the table structure. A dimension is a container of logical relationships and does not require any space in the database. indexes. views. and country.7 Data base schema Data Warehousing Schemas A schema is a collection of database objects. state (or province). It is called a star schema because the diagram resembles a star. and synonyms. 3. Star Schemas The star schema is the simplest data warehouse schema. including tables. Most data warehouses use a dimensional model. materialized views resemble tables or partitioned tables and behave like indexes. A hierarchical relationship is a functional dependency from one level of a hierarchy to the next one. The center of the star consists of one or more fact tables and the points of the star are the dimension tables. region. The physical implementation of the logical data warehouse model may require some changes to adapt it to your system parameters--size of machine. 104 . You can sometimes get the source model from your company's enterprise data model and reverse-engineer the logical data model for the data warehouse from this. type of network. with points radiating from a center. Partitioning indexes makes it easier to manage the warehouse during refresh and improves query performance.Additionally. storage capacity. and software. as shown in Figure 2-1. Materialized Views Materialized views are query results that have been stored in advance so long-running calculations are not necessary when you actually execute your SQL statements.

In it. The most natural way to model a data warehouse is as a star schema. Other Schemas Some schemas in data warehousing environments use third normal form rather than star schemas. 3.8 Partitioning Strategy HEAPS/CLUSTERED/NONCLUSTERED 105 . A star schema optimizes performance by keeping queries simple and providing fast response time. which is a star schema with normalized dimensions in a tree structure. sales shows columns amount_sold and quantity_sold. All the information about each level is stored in one row.Figure 2-1 Star Schema This illustrates a typical star schema. Another schema that is sometimes useful is the snowflake schema. only one join establishes the relationship between the fact table and any one of the dimension tables. the dimension tables are:     times channels products customers The fact table is sales.

106 . Data Partitioning in Data warehouses Data warehouses often contain very large tables and require techniques both for managing these large tables and for providing good query performance across them.Data Partitioning Data Partitioning is the formal process of determining which data subjects. if a table's data is skewed to fill some partitions more than others. Data Partitioning is also the process of logically and/or physically partitioning data into segments that are more easily maintained or accessed. Some examples are current day's transactions or online archives. Partitioned tables and indexes facilitate administrative operations by enabling these operations to work on subsets of data. or can be used as a way to stage data between different phases of use. several million rows) are joined together by using partition-wise joins. An important tool for achieving this. part of the Tuning Pack. Partitioning also enables you to swap partitions with a table. Finally. or swap a large amount of data quickly. data occurrence groups. Thus. Granularity in a partitioning scheme can be easily changed by splitting or merging partitions. you can add a new partition. Partitioning can help you tune SQL statements to avoid unnecessary index and table scans (using partition pruning). swapping can be used to keep a large amount of data that is being loaded inaccessible until loading is completed. Current RDBMS systems provide this kind of distribution functionality. Partitioning offers support for very large tables and indexes by letting you decompose them into smaller and more manageable pieces called partitions. organize an existing partition. It is an orderly process for allocating data to data sites that is done within the same common data architecture. as well as enhancing data access and improving overall application performance is partitioning. the ones that contain more data can be split to achieve a more even distribution. and data characteristics are needed at each data site. The SQL Access Advisor offers both graphical and command-line interfaces. remove. Partitioning of data helps in performance and utility processing. partitioning data greatly improves manageability of very large databases and dramatically reduces the time required for administrative tasks such as backup and restore. A good starting point for considering partitioning strategies is to use the partitioning advice within the SQL Access Advisor. It also enables you to improve the performance of massive join operations when large amounts of data (for example. By being able to easily add. For example. This support is especially important for applications that access tables and indexes with millions of rows and many gigabytes of data. or drop a partition with minimal to zero interruption to a read-only application.

and time. The single biggest benefit to a data partitioning approach is easy yet efficient maintenance. Some examples of a dimension include item. All fact table need to be periodically updated using data which are the most recently collected from the various data sources. 107 . an implementation of a relational data warehouse can involve creation and management of dimension tables and fact tables. store. Some of the factors to be considered for long term planning of a data warehouse include data volume. and management considerations in a data warehousing environment. so will the data in the database. The monolithic approach may contain huge fact tables which can be difficult to manage. big issues pertaining to supporting large tables can be answered by having the database decompose large chunks of data into smaller partitions thereby resulting in better management.-- Data Partitioning can be of great help in facilitating the efficient and effective management of highly available relational data warehouse. careful long term planning is beneficial. Index maintenance window. As an organization grows. On the other hand. data loading window. But data partitioning could be a complex process which has several factors that can affect partitioning strategies and design. implementation. Data partitioning can answer the need to small database maintenance window in a very large business organization. Typically. There are many benefits to implementing a relational data warehouse using the data partitioning approach. A data warehouse which is powered by a relational database management system can provide for a comprehensive source of data and an infrastructure for building Business Intelligence (BI) solutions. archive and backup strategy and hardware characteristics There are two approaches to implementing a relational data warehouse: monolithic approach and partitioned approach. a fact table represents a business recording like item sales information for all the stores. data aging strategy. The need for high availability of critical data while accommodating the need for a small database maintenance window becomes indispensable. workload characteristics. easy monitoring of aging data and efficient data retrieval system. With data partitioning. Since data warehouses need to manage and handle high volumes of data updated regularly. Data partitioning also results in faster data loading. A dimension table is usually smaller in size compared to a fact table but they both provide details about the attributes used to describe or explain business facts.

The goal is to help answer the questions "How do I choose which aggregates to create. Approaches to Aggregation 108 .Data partitioning in relational data warehouse can implemented by objects partitioning of base tables." "How do I create and store aggregates. real time disclosure so that the company can meet compliance regulations and accurate sales and marketing data so the company can grow a larger customer base and thus increase profitability. Here are some practical guidance on how to implement a sensible aggregation strategy for a data warehouse. the areas that are given more focus to gain competitive edge over other companies include the need for timely financial reporting. and index views. Management of these partitioned data can vary as well. But the important thing to note is that regardless of the software application implementing data partitioning. the benefits of separating data into partitions will continue to bring benefits to data warehouses. Implementation methods vary depending on the database software application vendor or developer. Range partitions refer to table partitions which are defined by a customizable range of data.9 Aggregations Introduction In a competitive business environment. partition scheme having file group mappings and table which are mapped to the partition scheme. which now have become standard requirements for large companies in order to operate efficiently. Data aggregation helps company data warehouses try to piece together different kinds of data within the data warehouse so that they can have meaning that will be useful as statistical basis for company reporting and analysis. The end user or database administrator can define the partition function with boundary values. 3. There are so many ways wherein data partitioning can be implemented." and "How do I monitor and maintain aggregates in a database?" The information in this article has been gathered from several years of consulting in the relational decision support market. clustered and non-clustered indexes. This article assumes some familiarity with dimensional or "star" schema design. as this forms the base from which data is aggregated.

This approach will produce optimal query results because a query can read the minimum number of rows required to return an answer. there are some basic tradeoffs to keep in mind. the cost of processing to create the aggregates. Figure 1 depicts sample hierarchies in each dimension and the total number of aggregates possible. this approach is not normally practical due to the processing required to produce all possible aggregates and the storage required to store them. In some cases. selective aggregation. Each dimension has several levels of summarization possible. sales geography. The difficult question becomes "Which aggregates should I create?" 109 . even on a large system. and the cost of monitoring aggregate usage. We are trading these costs against the need for query performance. In a simple sales example where the dimensions are product.Before trying to answer the questions mentioned above. or exhaustive aggregation. Given the above constraints and the huge number of rows to store for every possible aggregate. In a typical database the data volumes will be large enough that this will not be the case. There are direct costs associated with this approach: the cost of storage on the system. The opposite extreme is exhaustive aggregation. This leaves selective aggregation as the middle ground. However. Aggregates are created after new fact data has been verified and loaded. customer. Figure 1: Number of possible aggregates. simply multiple the number of levels in each of the dimension hierarchies. it is apparent that an exhaustive approach is not generally feasible. To determine the number of aggregates. This time window is a restriction to how many aggregates may be created. the number of possible aggregates is the number of levels in each hierarchy of each dimension multiplied together. This will show the total number of aggregates possible. Given the loading time and the time to perform database backups. Creating an aggregate is really summarizing and storing data which is available in the fact table in order to improve the performance of end-user queries. the volume of data in the fact table will be small enough that performance is acceptable without aggregates. and time. Creating a large number of aggregates will take a lot of processing time. There are three approaches to aggregation: no aggregation. there is a small window left in the batch cycle to create aggregates.

Since each store will move 15.500. but in a given day a typical store will sell only ten percent of those products. it is a good idea to run some queries to get an idea of the number of rows at various levels in the dimension hierarchies. but the combination of all the dimensions will have a row at a higher level. One of the areas which this requirements analysis normally focuses on is the decision making processes of individual users. If this pattern of analysis is common. or looking at the policies by policy or coverage types.000. then aggregates by region and policy type will be most useful.500. we will have 255.000. If we create an aggregate of product sales by store by week we would intuitively expect that the number of rows in the aggregate table would be reduced by seven since we have summarized seven daily records into a single weekly record for a given product. the number of rows will be 78. the reduction in the number of rows may not be as much as was expected. When you combine the fact rows to create an aggregate at a higher level. This is due to the sparsity of the fact data: as you look at the data values for a given dimension you will notice that certain values do not exist at the detail level.000 unique products. 365 days a year. Based on this information it is possible to determine that they often look for anomalies in their data by focusing at a certain level.000. A simple example of this is in high volume retail sales. and then looking for information at lower or higher levels based on what they find. As an example. Probably the most important item is the expected usage patterns of the data.000 products in a week. The most frequently examined levels will be good candidates for aggregation. From there they may note that a certain region has a higher profitability and start looking for the contributing factors by drilling down to a district level. This information is often not available until the initial loading of data into the database is complete and it will likely change over time. This information is usually known after the requirements phase of a decision support project. In a single week the store may sell 15. someone looking at the profitability of car insurance policies may start by examining the profitability of all policies broken out by geographic regions. After loading the data. or double what we were expecting! 110 . Some of the best candidates for aggregation will be those where the row counts decrease the most from one level in a hierarchy to the next. This will not be the case due to the sparsity of data. If we calculate the number of rows in the fact table for a chain with 100 stores where every store sells 7000 products a day.000 unique products. the number of rows in the aggregate will not be 36. This will tell you where there are significant decreases in the volume of data along a given hierarchy. Base Table Row Reduction The second piece of information to consider is the data volumes and distributions in the fact table. The decrease of rows in a dimension hierarchy is not a hard rule due to the distribution of data along multiple dimensions.000 rows. A single store may carry 70.Choosing Aggregates to Create Usage and Analysis Patterns There are two basic pieces of information which are required to select the appropriate aggregates.

The brand by district aggregate provides a very significant drop over the detail data. it is apparent that creating aggregates at some of the highest levels will provide minimal performance improvement.Since we are trying to reduce the number of rows a query must process. Looking at this chart. The first is how to store the aggregated data. Figure 2 shows the row counts for all possible aggregates of product by store by day using one year of data for a 200 store retail grocer. and will probably be small enough that all higher level product and geography queries may be satisfied by this aggregate. There are two parts to this question. The highest level summary is total sales corporate-wide in the lower right. Aggregate Storage Once you have made an initial decision about which aggregates to create you have to answer the next question: how to create and store those aggregates. there are several likely candidates. containing a single row for each day. The base level of detail is product by store by day shown in the upper left corner of the chart. Depending on the frequency of usage. but there are still tens of millions of rows in some of the lower level aggregates. Storing aggregates can be complicated by the columns available in the base fact table. One thing to keep in mind is that it is appealing to decide based on what you can see in the chart. one of the key procedures is finding aggregates where the intersection of dimensions has a significant decrease in the number of rows. Any of the subcategory level aggregates provide a significant reduction in volume and would be good starting points for exploration. Figure : Row counts for possible aggregates. Some data 111 . Knowing how fast your database and hardware combination can move data for a query is still important and will help you determine where it is practical to stop aggregating. The second is how to create and update the aggregated data.

Another item which appears regularly in aggregate table design is the required precision of columns storing counts or monetary values. For data at a low level of detail. they may be stored in a separate aggregate-only table. You can create a combined fact and aggregate table which will contain both the base level fact rows and the aggregate rows. In the example above. and health care. I normally recommend creating a separate table for each aggregate. but it usually results in a very large and unmanageable table. Since semi-additive and non-additive data is only valid at the detail level. This is a common issue for businesses such as insurance. Storing Aggregate Rows There are three basic options for storing the aggregated data which are diagrammed in figure 3. issues with data storage for columns which are not valid at 112 . or they may be stored in individual aggregate tables. This must be taken into account when creating the physical table to store an aggregate. subscription services. The single aggregate table is almost as unmanageable. Figure 3: Three possibilities for storing aggregates. catalog sales. or it may not be possible to summarize the data in a column.may be invalid at a higher aggregate level. Both approaches suffer from contention problems during query and update. You can create a single aggregate table which holds all aggregate data for a single fact table. the values stored in a column may never exceed five digits. we will very likely have fewer columns in an aggregate table than we have in a fact table. For example. two "count" columns could be added with the number of male and female claimants stored in each. you can create a separate table for each aggregate created. Lastly. The combined fact and aggregate table approach is appealing. There are some ways to preserve a portion of the information. Aggregates may be stored in the same table with the base level fact rows. If the data is summarized for an entire week the column may require seven or more digits. This will be true of most semi-additive fact data and all non-additive fact data. it is not possible to aggregate automobile insurance claims information by vehicle type and preserve information from the claim such as the gender of the policyholder.

For example. simplified keying of the aggregates. a product dimension has a single row for each product manufactured by the company. Information Advantage. For packaged query tools the issue is somewhat more complex. They operate by examining the query and re-writing it so that it uses the appropriate aggregate table rather than the base fact table. as will indexing in such a way that aggregate rows will be efficiently retrieved. Examples of companies providing this type of software are MicroStrategies. The most difficult issue to resolve is the complication of end-user query access. The same drawbacks apply to a single separate table for all aggregate rows. the aggregate rows will require 113 . contention during the batch cycle may be considerable. but no commercial RDBMS vendor has provided extensions to their products to handle this issue. Normally the dimensions contain one row for each discrete item. Given the large batch nature of aggregate creation and update. If an aggregate is added or removed. the program must be manually changed to query from the appropriate table. The contention problem with a single table for detail and aggregates is straightforward: the same table is read from and written to in order to create or update the aggregate rows. or allowing multiple aggregates to be updated concurrently). This provides a single logical view of the schema and hides the aggregates from the user or developer. The question arises. and easier management of performance issues (for example. Query contention due to all end-user queries hitting the same table will be an issue. Storing Aggregate Dimension Rows A big issue encountered when storing aggregates is how the dimensions will be managed. In spite of this limitation. With programmatic interfaces. Using a separate table for each aggregate avoids these problems and has the advantages of allowing independent creation and removal of aggregates. This is not practical with more than a few aggregates. There are products available which will act as intelligent middleware. and Stanford Technologies (now owned by Informix). This design approach introduces a new factor into the selection of end user query tools. particularly ad-hoc query tools: they should be "aggregate aware". The complication results from the introduction of a number of possible tables from which the data may be queried.higher levels of aggregation. and the possibility of incorrectly summarizing data in a query. The logical place for this type of query optimization is in the database itself. For the remainder of this article I will assume that each aggregate is stored in an individual table. spreading I/O load by rearranging tables on disks. Products which are not "aggregate aware" will present users with all the fact and aggregate tables and it will be up to the user to select the appropriate table for their query. "how do you store information about hierarchies so the fact and aggregate tables are appropriately keyed and queried?" No matter how the dimensions and aggregates are handled. these issues can be managed by designing the applications to dynamically generate queries against the appropriate table. I prefer this design for aggregate storage due to the advantages over the other methods. The problem is worse for the custom application designer because queries are embedded in the program.

generated keys. This is because the levels in a dimension hierarchy are not actually elements of the dimension. They are constructs above the detail level within the dimension. This is easily seen if we look at the company geography dimension described in the example for figure 2. The granularity of the fact table is product by store by day. This means the base level in the geography dimension is the store level. All fact rows will have as part of their key the store key from a row in this dimension. The hierarchy in the dimension is store ® district ® region ® all stores. There is no row available in the dimension table describing a district or region. We must create these rows and derive keys for them. The keys can't duplicate any of the base level keys already present in the dimension table. This can be done in several ways. The preferred method is to store all of the aggregate dimension records together in a single table. This makes it simple to view dimension records when looking for particular information prior to querying from the fact or aggregate tables. There is one issue with the column values if all the rows are stored in a single table like this. When adding an aggregate dimension record there will be some columns for which no values apply. For example, a district level row in the geography dimension will not have a store number. This is shown in figure 4.

Figure 4: Storing aggregate dimension rows. Each level above the base store level has keys in a distinct range to avoid conflicts, and all column values are empty for those columns which do not apply at the given level. When you wish to create a pick list of values for a level in the hierarchy you can issue a SELECT DISTINCT on the column for that level. An alternative to this method is to include a level column which contains a single value for each level in the hierarchy. Then queries for a set of values for a particular level need only select where the level column is the level required. Other methods for storing the aggregate dimension rows include using a separate table for each level in the dimension, normalizing the dimension, or using one table for the base dimension rows and a separate table for the hierarchy information. The disadvantage of all of these methods is that the dimension is stored in multiple tables, which further complicates the query process. The first method is conceptually clean because each fact table has a set of dimension tables which is associated only with that table, so all data is available at the same grain. The problem comes when the user is viewing dimension data at one level, and then wants to drill up or down along a hierarchy. Browsing through values in the dimension is extremely complicated. In addition, there are now many more tables and table relationships to maintain. This runs counter to the goal of the


dimensional model, which is to simplify access to the data. Normalizing the dimension is another way to store the hierarchy information. Rather than store values in dimension columns for the different levels of a hierarchy and issue a SELECT DISTINCT on the appropriate column, a key to another table is stored. This table contains just the values for that column. This is not much different from storing the values in the dimension table, and it complicates queries by adding more tables. Again, this runs counter to the goal of simplifying access for both the user and the query optimizer in the database. Using a single table for the base level dimension rows and a separate table for all aggregate dimension rows has the disadvantage of adding another table. It has an advantage which may make this approach better than using a single dimension table for all rows. If creating non-duplicate key values for the base level dimension rows and the aggregate rows is difficult, storing the aggregate rows in a separate table will make this problem simpler to resolve. The aggregate dimension rows can use a simpler key structure since they are no longer under the column constraints imposed by the base level dimension. Another topic worth mentioning in the storing of aggregate dimensions is multiple hierarchies in a single dimension. This shows up frequently when initially designing the dimensions, and it has an impact on the aggregates. When a dimension has multiple hierarchies, this implies that the number of possible aggregates will be multiplied by the number of levels in the extra hierarchies. When looking at the number of aggregates you must remember to take into account each hierarchy which exists in a dimension. Multiple hierarchies may create further problems at higher summary levels because values at a low level may be double counted at a higher level. Places where you will frequently find multiple hierarchies are in customer dimensions, product dimensions, and the time dimension. Products may have several hierarchies depending on whether you are viewing them from a manufacturing, warehousing, or sales perspective. Customer dimensions will sometimes have hierarchies for physical geography, demographic geography, and organizational geography. You might see two hierarchies in the time dimension: one for the calendar year and one for the fiscal year. Once the method for storing aggregates and their dimension values is chosen, the next step is to create the aggregates. The optimal approach depends on the volume of data, number of aggregates, and parallel capabilities of your database and hardware. Since there is no single best method, I will offer some guidelines on approaches. Aggregate Creation There are a number of factors which will help to define the approach. The first is the size of the fact table. It is not uncommon for a fact table to contain hundreds of millions to more than a billion rows of detail data, and to exceed 75 gigabytes in size. This volume of data will limit approaches which require frequent recalculation of the aggregates, or which require multiple scans through the fact table. The number of aggregates which must be created is a constraint. A typical fact table may have more than fifty aggregate tables in a production system. The number will depend on the fact table size, number of dimensions, and query performance. As the number of aggregates grows, the processing


window required for the batch update cycle will increase, possibly spilling over into the online usage period. Another constraint is the parallel capabilities of the database and computing platform. With the typical volumes of data it is unlikely that a simplistic approach using a single threaded program and no database parallelism will complete in a reasonable amount of time. For very large databases and high numbers of concurrent users, high end symmetric multiprocessing (SMP) or massively parallel (MPP) platforms will be required. Parallel database performance improvements are impressive, but the usefulness of the technology may be limited. Depending on the database, only certain SQL operations may be parallelized. Most commercial databases have limitations of this type. This can be a serious issue when building an aggregate table. If you are creating a very large table and the database can't parallelize the INSERT statement then you might be faced with a bottleneck that prevents you from using a simple SQL statement to create the table. In addition, there may be constraints on the query portion of the statement such that only certain types of queries will execute in parallel. This can effective cripple the statement by turning it into a single-threaded access to millions of rows of data. If the critical path of nightly batch processing can't fully utilize the hardware, parallelism may help alleviate single-stream bottlenecks by allowing certain processes to use more resources and complete sooner. If the aggregate processing has already been parallelized by partitioning the work into multiple application processes then there may be less benefit. Due the brute-force nature of many parallel implementations, databases have the ability to use all available resources on a server for a single SQL statement. This resource utilization often constrains use of parallel operations to a limited scope. If you try to run more than a handful of operations without constraining them in some way, they will introduce serious contention issues. Recreating Versus Updating Aggregates One of the major design choices for the aggregation programs is whether to drop and recreate the aggregate tables during the batch cycle, or update the tables in place. The time to completely regenerate an entire aggregate table is a prime consideration. Some aggregates may be too large or require summarizing of too much data for the regeneration approach to work effectively. Alternatively, regeneration may be more appropriate if there is a lot of program logic to determine what data must be updated in the aggregate table. Time period-to-date aggregates create their own special set of problems when making the recreate versus update decision. When updating the aggregate, new data which is not yet present in the aggregate table will likely require insertion. This implies a two pass approach in the aggregate program design, where the first pass scans aggregate data to see what should be inserted and what is already present. The second pass updates the existing data, but not the newly inserted data from the first pass. Updates to the rows can cause database update and query performance issues if the tables are not tuned properly. Updating column values may create problems with internal space allocation if


Creating an aggregate table may be as simple a single SELECT statement. Running too many concurrent programs could create a bottleneck in CPU.numeric values are stored with a variable length encoding scheme. Using a "Cascading" Model to Create Aggregates A final note on aggregate creation is for aggregates which are one or more levels removed from the base level data. the design may require changes. This is due to the addition of dependencies and constraints on processes and the added monitoring and scheduling required. Another limitation is the inflexibility of a single batch stream. if the queries to create the aggregate can take advantage of database parallelism then all resources on the server may be dedicated to a single process. or the processing window is large then generating aggregates in single stream fashion may be practical. If the number of aggregates to generate is limited. the database must reallocate for those rows (Oracle refers to this as row-chaining). One limitation with this approach is the queries which do not take advantage of parallelism. This approach requires less development effort because there is no coordination among processes and there will be few dependencies due to the serial nature of processing. This will require more detailed knowledge of the data to determine when to schedule the programs. If a dollar value increases from one to eight digits during several updates to rows in a month-to-date sales table. Single Threaded Versus Multi-threaded Creation The program to create aggregates can be written to build or update the tables in single threaded fashion (one at a time). Also. data volumes are not very large. or large volumes of data then a design which allows multiple processes to execute simultaneously will probably be required. This will allow the operation to complete in a fraction of the time. as shown in Figure 5. Updating the same table may require several passes through the new data and the existing aggregate table. rather than dropping and recreating all aggregates. memory. If there are many aggregates. There will also be fewer opportunities to use the brute-force approach that database parallelism allows with single large operations. Given the data volumes in a typical decision support database. This approach may be taken with either 117 . these processes will become bottlenecks due to the serial nature of the design. and will require more development effort. A multi-threaded design must take into account the impact of simultaneous access to data stored in the same table. If dependencies are created or new long-running processes are added. and the impact of writing aggregated data into the aggregate tables. the program will perform less work to generate successively higher level aggregates. multiple period-to-date aggregates. there is little concurrent activity on the system. By using the lower level tables. This approach requires more effort to design. The tradeoff with this approach is the programming complexity. This will lead to slower update and query performance over time. It may be possible to use aggregates stored at lower levels to generate the higher level aggregates. If the processing window is not sufficiently large. eventually requiring a table reorganization.. it will probably be most efficient for aggregation programs to update the aggregate tables with the newly loaded data. or I/O resources on the platform. or cause contention issues in the logging mechanism of the database.

118 .the single-stream or multi-threaded designs.

in this case summarizing the data to the level of product by salesperson by day. When designing a cascading model like this there are several issues which should be taken into consideration.Figure 5: Tables in a cascading design. the cascading approach should not be taken because it will be more complex. If an error is encountered during aggregation at a lower level. the error will be propagated throughout all higher level aggregates. If there is sufficient time available to process aggregates from the base data. and is built from the previous aggregate table. After the initial 119 . a consideration should be what the system management and software maintenance impact will be versus the available processing window. The addition of dependencies on intermediate processing may result in missing highlevel aggregates due to a failure during creation of a lower level table. Error propagation can be an issue with cascading creation. Strong change management practices and coordinated development are required to avoid spending excessive time solving operational problems when something changes or goes wrong. The aggregation program creates the first aggregate. The base table is at the grain of product by salesperson by customer by day. When choosing whether to use a cascading approach. If the problem can't be fixed before the users log on to the system. Correction of the error will require the recalculation of all data at all levels above the level where the error first occurred. Aggregate Maintenance Once the system is in production a new set of problems will introduce themselves. they will suffer from seriously degraded performance for high-level queries. If there are numerous highlevel aggregates then a lower level failure will result in many missing aggregates. The next aggregate is product by district by month.

this is useful in determining the selectivity of various possible indexes and will influence your indexing strategy. Beyond this basic table level information. this is useful only if a fixed query front-end application is used to access the data. If an aggregate is no longer useful and is removed from the system. It will also be the case that certain aggregates are rarely used by any query. Some users will notify the DBA when there is a performance issue with a query. and new aggregates will be required to meet the current performance expectations. The total count gives an idea of how much the system is being used. Aggregate maintenance is a mostly manual process and requires monitoring the usage of aggregates by the users. query resource utilization. In many cases. Without statistics or their feedback there is no way to know if the system is performing adequately. this is a simple measure to determine whether there are any excessively long queries which should be examined in further detail to see if they are the result of excessive I/O or poorly written SQL. histograms for the combination of values at each level in the fact table. and therefore slow.rollout of the system some users will experience very long running queries or reports. this is an indication that another aggregate may be in order. this must be updated. This type of data collection is very useful in determining when to add or remove aggregates. This implies continued maintenance of the aggregation programs. If you have specific metadata about aggregates. There is 120 . the addition of an aggregate may provide a more efficient base for existing higher level aggregates and suggest changes in how they are generated. If you are using a cascading model for creating aggregates. the following items are very useful if they can be captured:      column value histograms on the constraint columns in the fact and aggregate tables. queries may be highly complex. These two data points will indicate which tables are or are not being used. Previously useful aggregates will be used less. Users will become more educated about the available information and the questions they ask will evolve. or that this is normal. The most basic information which should be collected in order to monitor aggregate usage is the number of queries against the fact table and the number of queries against each of the aggregate tables. If there are frequent queries against low level aggregates or the fact table. you must build a monitoring component as part of your decision support system. This will tell exactly how many times a given set of information was queried. query parse counts. this information will help estimate the row counts in any aggregate you might wish to create. query duration. This may result in the need for some new aggregates to resolve the performance issue. Many will assume there is a problem with their PC. the database. An interesting pattern that I have observed with decision support databases is that over a period of time the usage of the data will change. These are good candidates for removal since they don't do anything other than take up space in the database. all associated programs must be updated. There are some end user access products on the market which include a query monitor component which will collect statistics on the usage of aggregates and on the queries which are executed.

an aggregate is not going to reduce the number of rows returned. This article only touches the surface of the issues around performance in a large decision support database. Business Objects. knowing the levels of data requested along the dimension hierarchies will tell if queries are accessing the correct aggregate level. It may be that they are querying a lower level aggregate due to the absence of an aggregate at the level required. level of data requested. it will be up to the implementor to determine the optimal approach for their set of constraints. For the most part. number of rows retrieved.Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. this is very important because some queries simply return lots of data. and others. when deciding how to create and maintain aggregates in the database. It is always good to plan the business architecture so that data will be in sync between real activities and the data model simulating the real scenario.many require that if you use their aggregate middleware. and this is the reason for the slow response. If a query returns 10. 3. and might benefit from an aggregate. which includes vendors like MicroStrategy.10 Data Marting A data mart is a subset of an organizational data store. I advise people starting complex projects like these to seek professional consulting help from companies with experience in the end-to-end implementation of similar projects.000 rows. and with a proven record of successful references. have been adopted by database and query tool vendors. usually oriented to a specific purpose or major data subject. that may be distributed to support business needs. Conclusion Some of the techniques mentioned above mainly in the monitoring and analysis space. IT decision makers need to make careful choice in software applications as there are hundreds of choices that can be bought from software vendors and developers around the world. Aggregate aware tools have the ability to process queries issued against a base-level dimensional schema and select the appropriate aggregate to satisfy the query. One drawback to many of these products is that they are not "open" . Some of the products include a component which can monitor queries and indicate potential candidates for aggregation. If a query has a very large amount of logical I/O but few rows returned then it is reading a large amount of data. Data aggregation can really grow to be a complex process through time. you use their aggregation tool. Oracle has also added aggregate awareness to its database engine. One rapidly growing product area is the "aggregate aware" query tool arena.  nothing to do with this type of query except try to rewrite it. Data marts are often derived from subsets of 121 .

Marc Demerest. a data mart tends to be tactical and aimed at meeting an immediate need. a data mart is a data repository that may or may not derive from a data warehouse and that emphasizes ease of access and usability for a particular designed purpose. then they will be related. many products and companies offering data warehouse services also tend to offer data mart capabilities or services.[2] This enables each department to use. manipulate and develop their data any way they see fit. In general. product. If the data marts are designed using conformed facts and dimensions. Design schemas   Star schema or dimensional model is a fairly popular design choice. each department or business unit is considered the owner of its data mart including all the hardware. etc. though in the bottom-up data warehouse design methodology the data warehouse is created from the union of organizational data in a data warehouse. each one relevant to one or more business units for which it was designed. Snowflake schema 122 . In some deployments. There can be multiple data marts inside a single corporation. the terms data mart and data warehouse each tend to imply the presence of the other in some form. a data warehouse tends to be a strategic but somewhat unfinished concept. However. A data warehouse is a central aggregation of data (which can be distributed physically). most writers using the term seem to agree that the design of a data mart tends to start from an analysis of user needs and that a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used. as it enables a relational database to emulate the analytical functionality of a multidimensional database. DMs may or may not be dependent or related to other data marts in a single corporation. In practice. In other deployments where conformed dimensions are used. suggests combining the ideas into a Universal Data Architecture (UDA). Terminology In practice. One writer. without altering information inside other data marts or the data warehouse. software and data. this business unit ownership will not hold true for shared dimensions like customer.

The following query extracts how many TV sets have been sold. a specialized multidimensional DBMS is likely to be both expensive and inconvenient. Dimension tables have a simple primary key.). corresponding to a three-column primary key (date_id. d_store. SELECT (and other store address components). for each brand and Example Consider a database of sales. d_product. justifying the name) referencing any number of "dimension tables". d_date. not on combinations of a few dimensions.brand (and product name etc. Another reason for using a star schema is its simplicity from the users' point of view: queries are never complex because the only joins and conditions involve a fact table and a single level of dimension tables. consisting of a few "fact tables" (possibly only one. d_store and d_product. while the usually smaller dimension tables describe each value of a dimension and can be joined to fact tables as needed. discounts etc.The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema is a way to implement multi-dimensional database (MDDB) functionality using a mainstream relational database: given the typical commitment to relational databases of most organizations. without the indirect dependencies to other tables that are possible in a better normalized snowflake schema. product_id) in f_sales. classified by date. store and product. It is common for dimension tables to consolidate redundant data and be in second normal form.brand.category and d_product.units_sold (and sale price.). The "facts" that the data warehouse helps analyze are classified along different "dimensions": the fact tables hold the main data. store_id.year (and other date components).id = FS. while fact tables have a compound primary key consisting of the aggregate of relevant dimension keys. S. Each dimension table has a primary key called id.units_sold) FROM f_sales FS INNER JOIN d_date D ON D. f_sales is the fact table and there are three dimension tables d_date. perhaps from a store chain.date_id 123 . while fact tables are usually in third normal form because all data depend on either one dimension or all of them. in 1997. sum (FS. Data columns include f_sales.

a complex snowflake starts to take shape. S.year = 1997 AND P. When the dimensions consist of only single tables. When the dimensions are more = A snowflake schema is a way of arranging tables in a relational database such that the entity relationship diagram resembles a snowflake in shape. Snowflake schema are often better with more sophisticated query tools that isolate users from the raw table structures and for environments having numerous queries with complex = FS. The fact table is unchanged. The decision on whether to employ a star schema or a snowflake schema should consider the relative strengths of the database platform in question and the query tool to be employed. At the center of the schema are fact tables which are connected to multiple dimensions. you have the simpler star schema. and in environments where most queries are simpler in nature. The star and snowflake schema are most commonly found in data warehouses where speed of data retrieval is more important than speed of insertion. Star schema should be favored with query tools that largely expose users to the underlying table structures.product_id WHERE D. these schema are not normalized much. Reasons for creating a data mart       Easy access to frequently needed data Creates collective view by a group of users Improves end-user response time Ease of creation Lower cost than implementing a full Data warehouse Potential users are more clearly defined than in a full Data warehouse Dependent data mart According to the Inmon school of data warehousing. Generally. and are frequently left at third normal form or second normal form. a dependent data mart is a logical subset (view) or a physical subset (extract) of a larger data warehouse. As such. whether a snowflake or a star schema is used only affects the dimensional tables. and where child tables have multiple parent tables ("forks in the road").category = 'tv' GROUP BY P.store_id INNER JOIN d_product P ON P. isolated for one of the following reasons: 124 .INNER JOIN d_store S ON S. having multiple levels of tables.

125 . Scorecarding and reporting. Security: to separate an authorized data subset selectively Expediency: to bypass the data governance and authorizations required to incorporate a new application on the Enterprise Data Warehouse Proving Ground: to demonstrate the viability and ROI (return on investment) potential of an application prior to migrating it to the Enterprise Data Warehouse Politics: a coping strategy for IT (Information Technology) in situations where a user group has more influence than funding or is not a good citizen on the centralized data warehouse. duplication of data.g. tradeoffs inherent with data marts include limited scalability.       A need for a special data model or schema: e. data inconsistency with other silos of information. The cost of obtaining front-end analytics are lowered if there is consistent data quality all along the pipeline from data source to analytical reporting. and inability to leverage enterprise sources of data.. OLAP.11 Meta Data The primary rational for data warehousing is to provide businesses with analytics results from data mining. 3. According to the Inmon school of data warehousing. Politics: a coping strategy for consumers of data in situations where a data warehouse team is unable to create a usable data warehouse. to restructure for OLAP Performance: to offload the data mart to a separate computer for greater efficiency or to obviate the need to manage that workload on the centralized data warehouse.

One of the projects we recently worked on was with a major insurance company in North America.Figure 1. 126 . Batch processes can be run to address data degradation or changes to data policy. Overview of Data Warehousing Infrastructure Metadata is about controlling the quality of data entering the data stream. The company had amalgamated over the years with acquisitions and also had developed external back-end data integrations to banks and reinsurance partners. Metadata policies are enhance by using metadata repositories.

the rewards would be just. Company-wide Metadata Policy 127 . Prediction analysis. Big bang approaches rarely work . This also created a bottleneck as data was not always replicated between the repositories.and the consequences are extremely high for competitive industries such as insurance. we analyzed the potential quid-pro-quos of different design changes. Departments had created their own data marts for generating quick access to reports as they had felt the central data warehouse was not responsive to their needs. The metaphor we used for the project was the quote from Julius Cesar by Shakespeare given at the start of the article. Disparate Data Definition Policies in an Insurance Company The client approached DWreview as they felt that they were not obtaining sufficient return-on-investments on their data warehouse. Figure 3. With the IT manager's approval and buy-in of departmental managers. The publicly listed insurance company was also in the process of implementing a financial Scorecarding application to monitor compliance with the Sarbanes-Oxley act.Figure 2. In consultation with the company's IT managers. a gradual phase in of a company-wide metadata initiative was introduced. The first step in the process of realignment the data warehousing policies was the examination of the metadata policies and deriving a unified view that can work for all stakeholders. We felt that this was a potentially disruptive move but if the challenges were met positively. As the company was embarking on a new Scorecarding initiative it became feasible to bring the departments together and propose a new enterprise-wide metadata policy. profit-loss ratio and OLAP reports were labor and time intensive to produce.

The implementation of the Sarbanes-Oxley Scorecarding initiative was on time and relatively painless. Without metadata policies in place it would be next to impossible to perform coherent text mining. Figure 4. manufacturing. OMG’s Common Warehouse Metadata Initiative (CWMI) is a vendor back proposal to enable easy interchange of metadata between data warehousing tools and metadata repositories in distributed heterogeneous environments. Text mining is being used to evaluate claims examiners comments regarding insurance claims made by customers. Many of the challenges that would have been faced without a metadata policy were avoided.Industry metadata standards exists in industry verticals such as insurance. OLAP reporting is moving across stream with greater access to all employees. Data mining models are now more accurate as the model sets can be scored and trained on larger data sets. The metadata terminologies used in claims 128 . Partial Schematic Overview of Data Flow after Company-wide Metadata Implementation In the months since the implementation. There were training seminars given to keep staff abreast on the development and the responses were overwhelmingly positive. banks. With a unified data source and definition. the project has been moving along smoothly. the company is embarking further on the analysis journey. The text mining tool was custom developed by DWreview for the client's unique requirements.

it is used to analyze summarized and detailed data. basic statistical analysis. data analysis tools. charts. Finally. As metadata is abstract in concept a visceral approach can be helpful. database reporting tools. Later. Typically. In many firms.generating project support. Initially. performing multidimensional analysis and sophisticated slice-and-dice operations. identify trends for potential fraud analysis and provide feedback for insurance policy development. where the results are presented in the form of reports and charts. tables. Data warehouses are used extensively in banking and financial services. the tools for data warehousing can be categorized into access and retrieval tools. Using the text mining application. Business executives use the data in data warehouses and data marts to perform data analysis and make strategic decisions. There are three kinds of data warehouse applications: information processing. It will also help in gaining trust from departments that may be reluctant to hand over metadata policies. In this context. the data warehouse may be employed for knowledge discovery and strategic decision making using data mining tools. the data warehouse is used for strategic purposes. the longer a data warehouse has been in use. Progressively. A tested method for gathering executive sponsorship is first setting departmental metadata standards and evaluating the difference in efficiency. the data warehouse is mainly used for generating reports and answering predefined queries. and data mining tools. 3. the client can now monitor consistency in claims examination. the more it will have evolved. how to access the contents of the data warehouse. consumer goods and retail distribution sectors. such as demandbased production. This evolution takes place throughout a number of phases. or graphs.12 System and data warehouse process managers DataWarehouse Usage Data warehouses and data marts are used in a wide range of applications. Business users need to have the means to know what exists in the data warehouse (through metadata). and reporting using crosstabs. 129 . and data mining: Information processing supports querying. analytical processing.examination were developed in conjunction with insurance partners and brokers. For a successful metadata implementation strong executive backing and support must be obtained. and how to present the results of such analysis. developing suitable guidelines and setting technical goals. data warehouses are used as an integral part of a plan-execute-assess “closed-loop” feedback system for enterprise management. Developing metadata policies for organizations falls into three project management spheres . how to examine the contents using analysis tools. A current trend in data warehouse information processing is to construct low-cost Web-based accessing tools that are then integrated withWeb browsers. and controlled manufacturing.

data mining covers a broader spectrum than OLAP with respect to data mining functionality and the complexity of the data handled. Yet according to this view. time-series analysis. Such descriptions are equivalent to the class/concept descriptions discussed in Chapter 1. Data mining can help business managers find and reach more suitable customers. and multimedia data that are difficult to model with current multidimensional database technology. “How does data mining relate to information processing and on-line analytical processing?” Information processing. It may analyze data existing at more detailed granularities than the summarized data provided in a data warehouse. though limited. In this context.Analytical processing supports basic OLAP operations. this raises some interesting questions: “Do OLAP systems perform data mining? Are OLAP systems actually data mining systems?” The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is a data summarization/aggregation tool that helps simplify data analysis. These are. Therefore. and other data analysis tasks. performing classification and prediction. and presenting the mining results using visualization tools. OLAP tools are targeted toward simplifying and supporting interactive data analysis. prediction. dicing. and pivoting. Data mining supports knowledge discovery by finding hidden patterns and associations.However. data mining goes one step beyond traditional on-line analytical processing. OLAP functions are essentially for user-directed data summary and comparison (by drilling. can find useful information. The major strength of on-line analytical processing over information processing is themultidimensional data analysis of data warehouse data. roll-up. textual. whereas the goal of data mining tools is to automate as much of the process as possible. On-line analytical processing comes a step closer to data mining because it can derive information summarized at multiple granularities from user-specified subsets of a data warehouse. while still allowing users to guide the process. including slice-and-dice. They do not reflect sophisticated patterns or regularities buried in the database. In this sense. constructing analytical models. clustering. and other operations). drill-down. based on queries. Because data mining systems can also mine generalized class/concept descriptions. while data mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data. slicing. Because OLAP systems can present general descriptions of data from data warehouses. as well as gain critical business insights that may help drive market share and raise profits. In addition. An alternative and broader view of data mining may be adopted in which data mining covers both data description and data modeling. Because data mining involves more automated and deeper analysis than OLAP. data mining functionalities. data mining covers a much broader spectrum than simple OLAP operations because it performs not only data summary and comparison but also association. correct item bundling based 130 . data mining can help managers understand customer group characteristics and develop optimal pricing strategies accordingly. It generally operates on historical data in both summarized and detailed forms. information processing is not data mining. It may also analyze transactional. spatial. Data mining is not confined to the analysis of data stored in data warehouses. pivoting. data mining is expected to have broader applications. classification. answers to such queries reflect the information directly stored in databases or computable by aggregate functions.

or hybrid OLAP (HOLAP). as well as statistical operations such as ranking and computing moving averages and growth rates.not on intuition but on actual item groups derived from customer purchase patterns. data history. A MOLAP server maps multidimensional data views directly to array structures. The core of the multidimensional model is the data cube. A ROLAP server uses an extended relational DBMS that maps OLAP operations on multidimensional data to standard relational operations. A multidimensional data model is typically used for the design of corporate data warehouses and departmental data marts. containing query and reporting tools. OLAP servers may use relational OLAP (ROLAP). data cleaning. or multidimensional OLAP (MOLAP). and business terms and issues. through). drill-(down. which consists of a large set of facts (or measures) and a number of dimensions. Full materialization refers to the computation of all of the cuboids in the lattice defining a data cube. Several factors distinguish data warehouses from operational databases. On-line analytical processing (OLAP) can be performed in data warehouses/marts using the multidimensional data model. the algorithms used for summarization. loading. A HOLAP server combines ROLAP and MOLAP. A metadata repository provides details regarding the warehouse structure. it may use ROLAP for historical data while maintaining frequently accessed data in a separate MOLAP store. an iceberg cube is a data cube that stores only those cube cells whose 131 . or fact constellation schema. time-variant. They are useful in mining at multiple levels of abstraction. It typically requires an excessive amount of storage space. Dimensions are the entities or perspectives with respect to which an organization wants to keep records and are hierarchical in nature. refreshing. reduce promotional spending. and at the same time increase the overall net effectiveness of promotions. Because the two systems provide quite different functionalities and require different kinds of data. For example. each corresponding to a different degree of summarization of the given multidimensional data. A data cube consists of a lattice of cuboids. and nonvolatile collection of data organized in support of management decision making. it is necessary to maintain data warehouses separately from operational databases. partial materialization is the selective computation of a subset of the cuboids or subcubes in the lattice. The bottomtier is a warehouse database server. 3. These cover data extraction. pivot (rotate). Data warehouse metadata are data defining the warehouse objects. A data warehouse contains back-end tools and utilities for populating and refreshing the warehouse. Such a model can adopt a star schema. The middle tier is an OLAP server. This problem is known as the curse of dimensionality. particularly as the number of dimensions and size of associated concept hierarchies grow. Typical OLAP operations include rollup. system performance. slice-and-dice. across. snowflake schema. and warehouse management. For example. Data warehouses often adopt a three-tier architecture.13 Summary A data warehouse is a subject-oriented. OLAP operations can be implemented efficiently using the data cube structure. and the top tier is a client. mappings from the source data to warehouse form. Alternatively. data transformation. integrated. Concept hierarchies organize the values of attributes or dimensions into gradual levels of abstraction. which is typically a relational database system.

the avg grade measure stores the actual course grade of the student. Briefly compare the following concepts. count.What is meta data ? explain 5.Aggregations . When at the lowest conceptual level (e. (a) Enumerate three classes of schemas that are popularly used for modeling data warehouses. Join indexing registers the joinable rows of two or more relations from a relational database..g. what speci¯c OLAP operations should be per-formed in order to list the total fee collected by each doctor in 2004? (d) To obtain the same list. 132 .Describe the data base schemas being used in data ware house 4.Suppose that a data warehouse for Big University consists of the following four dimensions: student. (b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a). You may use an example to explain your point(s). or on-line analytical mining (OLAM). semester. roll-up from semester to year) should one perform in order to list the average grade of CS courses for each Big University student. instructor]. where charge is the fee that a doctor charges a patient for a visit. 2. which emphasizes the interactive and exploratory nature of OLAP mining. refresh (c) Enterprise warehouse.aggregate value (e. analytical processing and data mining (which supports knowledge discovery). fact constellation. and patient. (c) Starting with the base cuboid [day. which combines the bitmap and join index methods. for a given student. year. OLAP-based data mining is referred to as OLAP mining. data transformation. data mart. and instructor combination). hospital.Explain the architecture of data warehouse 3. course. OLAP query processing can be made more efficient with the use of indexing techniques.g. doctor.. 3. patient. At higher conceptual levels. Bitmap indexing reduces join. doctor. what specific OLAP operations (e. each attribute has its own bitmap index table. (b) Starting with the base cuboid [student. aggregation.How it is being done on a data ware house? 6. write an SQL query assuming the data is stored in a relational database with the schema fee (day. and the two measures count and charge. Bitmapped join indexing. semester. and instructor. avg grade stores the average grade for the given combination. and two measures count and avg grade. course.g. (a) Snowflake schema. semester. doctor. reducing the overall cost of OLAP join operations. can be used to further speed up OLAP query processing. star net query model (b) Data cleaning. count) is above some minimum support threshold. month. charge). virtual warehouse 9. course.14 EXCERCISES 1.How is a data warehouse different from a database? How are they similar? 8.Why Partitioning strategies are needed in data ware housing maintenance? 7.. and comparison operations to bit arithmetic. In bitmap indexing. patient]. Data warehouses are used for information processing (querying and reporting).Suppose that a data warehouse consists of the three dimensions time. (a) Draw a snowflake schema diagram for the data warehouse.

Spectators may be students. spectator. (a) Draw a star schema diagram for the data warehouse. brie°y discuss advantages and problems of using a bitmap index structure.10. spectator. (b) Starting with the base cuboid [date. Taking this cube as an example. adults. location. and game. where charge is the fare that a spectator pays when watching a game on a given date. 133 . game]. count and charge. what speci¯c OLAP operations should one perform in order to list the total charge paid by student spectators at GM Place in 2004? (c) Bitmap indexing is useful in data warehousing. location.Suppose that a data warehouse consists of the four dimensions. with each category having its own charge rate. and the two measures. date. or seniors.

6 Back up and recovery 4.3 Data Warehouse hardware Architecture 4.2 Learning Objectives 4.7 Service level agreement 4.10 Exercises 134 .Unit IV Structure of the Unit 4.4 Physical Layout 4.9 Summary 4.5 Security 4.1 Introduction 4.8 Operating the data Ware house 4.

it will become complex once you go beyond a certain size. system functionality and user acceptance testing is conducted for the complete integrated Data Warehouse system. User access to the data in the Warehouse is established. Metrics are captured for the load process. and the validity of the output is measured. The problem with the DW (which is not in OLTP) is that the kind of load and queries are not certain.4. sometimes the allocation of processes across the processors.3 Data Warehouse hardware Architecture Needed hardware Multiprocessors with in the same machine sharing same disk and memory: This is good approach for small to small-medium size data warehouse. system backup and recovery. 135 . the development and test environment is established. the project to implement the current Data Warehouse iteration can proceed quickly. The final step is to conduct the Production Readiness Review prior to transitioning the Data Warehouse system into production. and data archiving are implemented and tested as the system is prepared for deployment. software and middleware components are purchased and installed. transform and load the source data and to periodically refresh the existing data in the Warehouse.  To know about the needed security for the information stored and accessed  To have the awareness of backup and recovery of the data and the operating the data ware house. Once the programs have been developed and unit tested and the components are in place. and the configuration management processes are implemented. system disaster recovery.2 Learning Objectives  Make the student to have the knowledge of the hardware requirement for a data ware house . itself runs out of breath. (with load-balancing and automatic failover). Canned production reports are developed and sample ad-hoc queries are run against the test database. System support processes of database security. Therefore. and the programs are individually unit tested against a test database with sample source data. Programs are developed to extract. The metadata repository is loaded with transformational and business user metadata. 4. Even if you are having a kind of cluster. During this review.1 Introduction Once the Planning and Design stages are complete. the system is evaluated for acceptance by the customer organization 4. cleanse. Necessary hardware.

TIP- Ask your vendor. IBM DB2.. For example if you want to join two star-schemas (refer multi-cube in OLAP server). 4. This way they get their own playing field. you have to ensure that relevant data for the two cubes is in same or few servers.Parallel Processing Servers Here the processing is done across multiple servers with each having its own memory and disk space. Combining the above two Depending upon the kind of business you want to run through your data warehouse. This way one can add hundreds of servers to share the load through messaging or other mode of EAI (enterprise application integration). one needs to ensure that there is not too many cross connections.. Also ask the following questions:     How many multiprocessor machines it can support? Can it support cluster architecture? Does it have load-balancing and fail-over capability? What has been the field experience of the number of parallel processing servers that this platform has achieved? For the top quadrant like Oracle. if the data warehouse can support all the above three processing architecture styles. As you design this processing architecture. one needs to ride over them unless there are strong reasons for you to go for some other database for your data warehouse. SQL server (2008 is preferable). Teradata. instead of fighting for common resources (as in the multiprocessor architecture). TIP- If you have enterprise strength database from Oracle. the answer for all of this is going to be positive. the best is to get the combination of the above two..4 Hardware architecture of a data ware house – Physical Layout 136 .

across to special Intranet sites or out to Internet or partner sites. A worst-case scenario. or within. drawings. instantaneous and unfiltered data alongside more structured data. thereby enabling the construction of new and different representations. anipulating and changing the data structures into a format that lends itself to analytical processing. LOAD MANAGEMENT relates to the collection of information from disparate internal or external sources. In most cases the loading process includes summarizing. word processing documents. In each case the goal is better information made available quickly to the decision makers to enable them to get to market faster. images. Further. The 137 . The latter is often summarized and coherent information as it might relate to a quarterly period or a snapshot of the business as of close of day. WAREHOUSE MANAGEMENT relates to the day-to-day management of the data warehouse. if the raw data is not stored. other files or data sources may be accessed by links back to the original source. and so on. a warehouse.Data Warehouse Components In most cases the data warehouse will have been created by merging related data from many different sources into a single database – a copy-managed data warehouse as in above figure More sophisticated systems also copy related files that may be better kept outside the database for such things as graphs. drive revenue and service levels. sound. the data warehouse itself. There is often a mixture of current. would be to reassemble the data from the various disparate sources around the organization simply to facilitate a different analysis. A data warehouse typically has three parts – a load management. and manage business change. and a query management component. Actual raw data should be kept alongside.

the effective backup of its contents. is doing little or nothing to protect its strategic information assets! Your data warehouse administrators could not pinpoint the causes of recent system problems and security breaches until you showed them the shocking results of what you and your friend had done. The tools enabled you to issue complex queries which accessed numerous data. Your findings led you to the classic answer: Your organization. when you want it. Driven by the needs to complete the data warehouse project on time and within budget. classify. So. you easily managed to access some powerful user tools that were presumably restricted to unlimited access users. The new environment empowers you to have the information processing world by the tail. 138 . As a general user. and where you want it to solve dynamic organizational problems. and protect its valuable information assets? You posed this question to the data warehouse architects and administrators. tasks associated with the warehouse include ensuring its availability. You no longer feel frustrated with the inability of the Information Systems (IS) function to respond quickly to your diverse needs for information. Somewhere along the lines. a paranoid thought creeped into your head. you sensed that they were neither objective and convincing. It was then that they admitted that security was not a priority during the development of data warehouse. they did not give security requirements any thought. you put on your hacking hat and went about the process of finding the answer to your question. consumed enormous resources. They told you that there was nothing to worry about because the in-built security measures of your data warehouse environment could put the DoD systems to shame. was also able to access sensitive corporate data through the Internet without much ado. Access may be provided through custom-built applications. and you are exceedingly thrilled by it all! Suddenly. and the date of your last performance evaluation among other things. 4. He was able to disclose your exact salary. and its security.5 Security Imagine your organization has just built its data warehouse. Your trusted friend. social security number. and get impatient users off their backs. or make important decisions. The new data warehouse environment enables you to access corporate data in the form you want. and you asked the classic question: What is your organization doing to identify. QUERY MANAGEMENT relates to the provision of access to the contents of the warehouse and may include the partitioning of information into different areas with different privileges to different users. or ad hoc query tools. birth date. like most. a reformed hacker. and slowed system response time considerably.

Thus. having an internal control mechanism to assure the confidentiality. Implicit in the DW design is the concept of progress through sharing. minimizing operating costs and maximizing revenue. 4) identifying data security vulnerabilities. Marrying DW architecture to artificial intelligence or neural applications also facilitates highly unstructured decision-making by the auditors. tables. Phase One .an important component of the DW -. Achieving proactive security requirements of DW is a seven-phase process: 1) identifying data. rows of data. This is an often ignored. 6) selecting cost-effective security measures. These phases are part of an enterprise-wide vulnerability assessment and management program. improved quality of audit services. and. and 7) evaluating the effectiveness of security measures. but critical phase of meeting the security requirements of the DW environment since it forms the foundation for subsequent phases. enabling employees to effectively and efficiently solve dynamic organizational problems. subjects. 3) quantifying the value of data. and expert insights on a variety of control topics. lower operating costs. and does not ordinarily involve data updating. such as: fostering a culture of information sharing. It empowers end-users to perform data access and analysis. attracting and maintaining market shares. most data warehouses are built with little or no consideration given to security during the development phase. This results in timely completion of audit projects. minimizing the impact of employee turnovers. The security requirements of the DW environment are not unlike those of other distributed computing systems.can provide an accurate information about all databases.Your euphoric excitement about the new data warehouse vanished into the thick air of security concerns over your valuable corporate data. columns. It also gives an organization certain competitive advantages.Identifying the Data The first security task is to identify all digitally stored corporate data placed in the DW. It contains both highly detailed and summarized historical data relating to various categories. Auditors can access and analyze the DW data to efficiently make well reasoned decisions (e. best audit practices. 2) classifying data. 5) identifying data protection measures and their costs. and profiles of 139 . It entails taking a complete inventory of all the data that is available to the DW end-users. and minimal impact from staff turnover. or areas. the internal audit functions of a multi-campus institution like the University of California builds a DW to facilitate the sharing of strategic data. Unfortunately. As a diligent corporate steward. you realized that it is high time for your organization to take a reality check! As you know that a Data warehouse (DW) is a collection of integrated databases designed to support managerial decision-making and problem-solving functions. This eliminates the need for the IS function to perform informational processing from the legacy systems for the end-users. The installed data monitoring software -. For instance. integrity and availability of data in a distributed environment is of paramount importance.. All units of data are relevant to appropriate time horizons. To have the security. recommend cost-effective solutions to various internal control problems).g. DW is an integral part of enterprise-wide decision support system.

Only highlevel DW users (e. admission information. Classifying data into different categories is not as easy as it seems. the universal goal of data classification is to rank data categories by increasing degrees of sensitivity so that different protective measures can be used for different categories.). custodians. CONFIDENTIAL (Moderately Sensitive Data): For data that is more sensitive than public data. The sensitivity of corporate data can be classified as:  PUBLIC (Least Sensitive Data): For data that is less sensitive than confidential corporate data. documented and retained for the next phase.). investments.. integrity and availability in a prudent manner. new product lines.. etc.g. The principle of least privilege applies to this data classification category. unlimited access) with proper security clearance can access this data (e. recruitment strategy. etc. TOP SECRET (Most Sensitive Data): For data that is more sensitive than confidential data. Whether the required information is gathered through an automated or a manual method. Users can only access this data if it is needed to perform their work successfully (e. and the end-users. etc. A manual procedure would require preparing a checklist of the same information described above. In some cases. the collected information needs to be organized.g. Data is generally classified on the basis of criticality or sensitivity to disclosure. and destruction. common business practices. Data in this category is usually unclassified and subject to public disclosure by laws. or company policies.. personnel/payroll information. R&D. data classification is a legally mandated requirement.g.g. The principle of least privilege also applies to this category -. 140 .. time. location.Classifying the Data for Security Classifying all the data in the DW environment is needed to satisfy security requirements for data confidentiality. modification. Data in this category is not subject to public disclosure. medical residing in the DW environment as well as who is using the data and how often they use the data. Phase Two . Users can access only the data needed to accomplish their critical job duties.. All levels of the DW end-users can access this data (e.with access requirements much more stringent than those of the confidential data. Determining how to classify this kind of data is both challenging and interesting.).   Regardless of which categories are used to classify data on the basis of sensitivity. and access to the data is limited to a need-to-know basis. Data in this category is highly sensitive and mission-critical. and laws in effect).g. phone directories. Certain data represents a mixture of two or more categories depending on the context used (e. audited financial statements. but less sensitive than top secret data. Performing this task requires the involvement of the data owners. trade secrets.

Some common vulnerabilities of DW include the following: 141 . senior management demands to see the smoking gun (e.g.. and advance use of secret financial data by rogue employees in the stock market prior to public release. Measuring the value of sensitive data is often a Herculean task.000 annually (based on labor hours) to reconstruct data classified as top secret with assigned risk factor of 4. The quantification process is primarily concerned about assigning "street value" to data grouped under different sensitivity categories.Identifying Data Vulnerabilities This phase requires the identification and documentation of vulnerabilities associated with the DW environment. The higher the likelihood of attacking a particular unit of data. or lose (annual dollar loss) if it does not act to protect the valuable assets.. or intercepted data. Quantifying the value of sensitive data warranting protective measures is as close to the smoking gun as one can get to trigger senior management's support and commitment to security initiatives in the DW environment.000 in punitive damages for public disclosure of privacyprotected personal information.000 a year if this top secret data is not adequately protected. probability of occurrence) can be determined arbitrarily or quantitatively. Some organizations use simple procedures for measuring the value of data. Measuring the value of strategic information assets based on accepted classification categories can be used to show what an organization can save (e. By itself.Quantifying the Value of Data In most organizations. Phase Four .g. the greater the risk factor assigned to that data set. The data value may also include lost revenue from leakage of trade secrets to competitors. then the company should expect to lose at least $40.not soft variables concocted hypothetically.g. data has no intrinsic value. cost-vsbenefit figures. However. the definite value of data is often measurable by the cost to (a) reconstruct lost data.. Return on Investment) if the assets are properly protected. then the liability cost plus legal fees paid to the lawyers can be used to calculate the value of the data. The risk factor (e. They build a spreadsheet application utilizing both qualitative and quantitative factors to reliably estimate the annualized loss expectancy (ALE) of data at risk. or (d) pay financial liability for public disclosure of confidential data. For instance. (a) restore the integrity of corrupted.Phase Three . Cynic managers will be quick to point out that they deal with hard reality -. Similarly. or hard evidence of committed frauds) before committing corporate funds to support security initiatives. if an employee is expected to successfully sue the company and recover $250. if it costs $10. (c) not make timely decisions due to denial of service. fabricated.

Most organizations. sabotage. Insider Threats: The DW users (employees) represent the greatest threat to valuable data. Using dual security engines tends to present opportunity for security lapses and exacerbate the complexity of security administration in the DW environment. The VIEW-based security is inadequate for the DW because it can be easily bypassed by a direct dump of data. frauds. and limited DW users authorized to access only the confidential data may not be prevented from accessing the top secret data.       142 . for instance. the programs handling high top security data may not prevent leaking the data to the programs handling the confidential data. The security feature is equally ineffective for the DW environment where the activities of the end-users are largely unpredictable. modifications. Unfortunately. omissions. Disgruntled employees with legitimate access could leak secret data to competitors and publicly disclose certain confidential human resources data. Carrying out direct and indirect inference attacks is a common vulnerability in the DW environment. All users can access public data. disclosure. In-built DBMS Security: Most data warehouses rely heavily on in-built security that is primarily VIEW-based. However. It also does not protect data during the transmission from servers to clients -. and negligence account for most of the costly losses incurred by organizations. Availability Factor: Availability is a critical requirement upon which the shared access philosophy of the DW architecture is built. DBMS Limitations: Not all database systems housing the DW data have the capability to concurrently handle data of different sensitivity levels. confidentiality. availability requirement can conflict with or compromise the confidentiality and integrity of the DW data if not carefully considered. and availability of the DW data. use one DW server to process top secret and confidential data at the same time.exposing the data to unauthorized access. and (d) loss of competitive edge. misuse. (b) loss of money from financial liabilities. Dual Security Engines: Some data warehouses combine the in-built DBMS security features with the operating system access control package to satisfy their security requirements. These activities cause (a) strained relationships with business partners or government entities. Human Factors: Accidental and intentional acts such as errors. These acts adversely affect the integrity. However. destruction. but only a select few would presumably access confidential or top secret data. general users can access protected data by inference without having a direct access to the protected data. (c) loss of public confidence in the organization. Inference Attacks: Different access privileges are granted to different DW users. Rogue employees can also profit from using strategic corporate data in the stock market before such information is released to the public. Sensitive data is typically inferred from a seemingly non-sensitive data.

These factors have a lower probability of occurrence. Some protective measures for the DW data include:  The Human Wall: Employees represent the front-line of defense against security vulnerabilities in any decentralized computing environment. Also. These outsiders engage in electronic espionage and other hacking techniques to steal. access to the sensitive data should rely on more than one authentication mechanism. and (b) loss of continuity of DW resources which negates user productivity. but tend to result in excessive losses. This approach effectively treats the root causes. Access Users Classification: Classify data warehouse users as 1) General Access Users. periodic background checks. Outsider Threats: Competitors and other outside parties pose similar threat to the DW environment as unethical insiders. training (security awareness). water. Utility Factors: Interruption of electricity and communications service causes costly disruption to the DW environment. The resultant losses tend to be higher than those of insider threats. as major or minor) for the next phase. of security problems. 2) Limited Access Users. Phase Five .g. These access controls minimize damage from accidental and malicious attacks. Natural Factors: Fire. Human resources management costs are easily measurable. Addressing employee hiring. Users need to obtain a granulated security clearance before they are granted access to sensitive data. and 3) Unlimited Access Users for access control decisions. Risks from these activities include (a) negative publicity which decimates the ability of a company to attract and retain customers or market shares. Corporate data must be protected to the degree consistent with its value.Identifying Protective Measures and Their Costs Vulnerabilities identified in the previous phase should be considered in order to determine cost-effective protection for the DW data at different sensitivity levels.   143 . transfers. or gather strategic corporate data in the DW environment. and air damages can render both the DW servers and clients unusable. Access Controls: Use access controls policy based on principles of least privilege and adequate data protection.. rather than the symptoms. buy. Risks and losses vary from organization to organization -depending mostly on location and contingency factors. and termination as part of the security requirements is helpful in creating security-conscious DW environment. Enforce effective and efficient access control restrictions so that the end-users can access only the data or programs for which they have legitimate privileges. including DW.   A comprehensive inventory of vulnerabilities inherent in the DW environment need to be documented and organized (e.

deem it imprudent to commit $500. and availability of data in the DW environment. and select cost-effective security measures to safeguard the data against known vulnerabilities. b) restrict data merge access to authorized activities only. 144 . and maintenance costs of each security measure. Partitioning: Use a mechanism to partition sensitive data into separate tables so that only authorized users can access these tables based on legitimate needs. and quantifying their associated costs or fiscal impact.Selecting Cost-Effective Security Measures All security measures involve expenses.000 annually in safeguarding the data with annualized loss expectancy of only $250. This phase relies on the results of previous phases to assess the fiscal impact of corporate data at risk. reliable and timely data to the users. d) enable rapid recovery of data and operations in the event of disasters. use of this method presents some data redundancy problems. Development Controls: Use quality control standards to guide the development. It also ensures that the system is highly elastic (e. Data Encryption: Encrypting sensitive data in the DW ensures that the data is accessed on an authorized basis only. for instance. However. fabrication and modification. BUDDY SYSTEM. Integrity Controls: Use a control mechanism to a) prevent all users from updating and deleting historical data in the DW.000. and security expenses require justification.. encryption ensures the confidentiality. Partitioning scheme relies on a simple in-built DBMS security feature to prevent unauthorized access to sensitive data in the DW environment. Phase Six . Commercial packages (e.g. c) immunize the DW data from power failures. These are achieved through the OS integrity controls and well tested disaster recovery procedures. CORA.g. BDSS. In short. implementation. This approach ensures that security requirements are sufficiently addressed during and after the development phase.    The estimated costs of each security measure should be determined and documented for the next phase.. integrity. adaptable or responsive to changing security needs). system crashes and corruption. and enables secure authentication of users. and e) ensure the availability of consistent. Measuring the costs usually involves determining the development. Senior management would. RANK-IT.) and in-house developed applications can help in identifying appropriate protective measures for known vulnerabilities. This nullifies the potential value of data interception. testing and maintenance of the DW architecture. It also inhibits unauthorized dumping and interpretation of data. Selecting cost-effective security measures is congruent with a prudent business practice which ensures that the costs of protecting the data at risk does not exceed the maximum dollar loss of the data. etc. BIA Professional.

there are two important factors. 145 . Locating individual records in a table through a standard search command will be exceedingly difficult if any of the encrypted columns serve as keys to the table. and credit card numbers. even if it is made by the greatest player on the court. randomly generated numbers) as keys before encrypting the SSN column. Every time we identify and select cost-effective security measures to secure our strategic information assets against certain attacks. performance evaluation ratings. unlike hardware and software. a winning security strategy is to assume that all security measures are breakable. and usercentric activities so that they do not adversely affect the protected computing resources. the cost factor should not be the only criterion for selecting appropriate security measures in the DW environment. and 5) reasonably efficient in terms of time. birth dates.g. Additionally. the attackers tend to double their efforts in identifying methods to defeat our implemented security measures. tested and verified. or be prepared to rebound quickly if our assets are attacked. 3) used properly and selectively so that they do not exclude legitimate accesses.Evaluating the Effectiveness of Security Measures A winning basketball formula from the John Wooden school of thought teaches that a good team should be prepared to rebound every shot that goes up. Similarly.However. First. 2) carefully analyzed. make the attacks difficult to carry out. Second. salaries. The best we can do is to prevent this from happening. Compatibility. It is equally important to ensure that the DW end-users understand and embrace the propriety of security measures through an effective security awareness program. simple and straightforward. Encrypting columns of a table containing sensitive data is the most common and straightforward approach used. the principle of adequate data protection dictates that the DW data can be protected with security measures that are effective and efficient enough for the short life span of the data. or not permanently effective. or row level. adaptability and potential impact on the DW performance should also be taken into consideration. Phase Seven . The data warehouse administrator (DWA) with the delegated authority from senior management is responsible for ensuring the effectiveness of security measures. We will not be well positioned to do any of these if we do not evaluate the effectiveness of security measures on an ongoing basis. Organizations that use social security numbers as key to database tables should seriously consider using alternative pseudonym codes (e. 4) elastic so that they can respond effectively to changing security requirements. data. is an element in the IS security arena that has the shortest life span. Encryption Requirements Encrypting sensitive data in the DW environment can be done at the table. memory space. Evaluating the effectiveness of security measures should be conducted continuously to determine whether the measures are: 1) small. the economy of mechanism principle dictates that a simple. Few examples of columns that are usually encrypted include social security numbers. well tested protective measure can be relied upon to control multiple vulnerabilities in the DW environment. Thus. column.. confidential bank information.

Multiple encryption algorithms can also be used to encrypt rows of data reflecting sensitive transactions for different campuses so that geographically distributed users of the same DW can only view/search transactions (rows) related to their respective campuses. but can be useful in some unique cases. This increases the time to process a query which can irritate the end-users and force them to be belligerent toward encryption mechanism. The encryption algorithm selected for the DW environment should be able to preserve field type and field length characteristics. performing decryption on the DW server before transmitting the decrypted data to the client (end-user's workstation) exposes the data to unauthorized access during the transmission. Data Warehouse Administration The size of historical data in the DW environment grows significantly every year. It necessitates the periodic phasing out of least used or unused data -. In addition. borders. the data decryption sequence must be executed before it reaches the software package handling the standard query. It should also work cooperatively with the access and analysis software package in the DW environment. a single encryption algorithm can be used to encrypt the ages of some employees who insist on non-disclosure of their ages for privacy reasons. mixing separate rows of encrypted and unencrypted data and managing multiple encryption algorithms in the same DW environment can introduce chaos. A prudent decision has to be made as to how long historical data should be kept in the DW environment before they are phased out en mass. The DWA 146 . Encrypting a table (all columns/rows) is very rarely used because it essentially renders the data useless in the DW environment. Specifically. while the use of the data tends to decrease dramatically. This increases storage. including flawed data search results. weak encryption algorithm) can give users a false sense of security. the package could prevent decryption of the encrypted data -.g. The procedures required to decrypt the encrypted keys before accessing the records in a useful format are very cumbersome and cost-prohibitive. If not carefully planned.Encrypting only selected rows of data is not commonly used. These problems can be minimized if the encryption and decryption functions are effectively deployed to the workstation level with greater CPU cycles available for processing. Finally. Also. improperly used encryption (e.S. processing and operating costs of the DW annually. it is still illegal to use certain encryption algorithms outside the U.. Encryption Constraints Performing data encryption and decryption on the DW server consumes significant CPU processing cycles. Encrypted data in the DW must be decrypted before the standard query operations can be performed.usually after a detailed analysis of the least and most accessed data over a long time horizon. For instance. This results in excessive overhead costs and degraded system performance.rendering the data useless. Otherwise.

The program also shifts the management focus from taking corrective security actions in a crisis mode to prevention of security crises in the DW environment. Effective collaboration with the internal customers (the DWA and en-users) and use of automated control tools are essential for conducting these control reviews competently. the DWA should be a good strategist. 4. planning. For these reasons. The need for accurate information in the most efficient and effective manner is congruent with the security requirements for data integrity and availability. an astute politician. Control Reviews The internal control review approach of the DW environment should be primarily forward-looking (emphasizing up-front prevention) as opposed to backward-looking (emphasizing after-the-fact verification). and timely data for analytical. reliable. an effective communicator. This approach calls for the use of pre-control and concurrent control assessment techniques to look at such issues as (a) data quality control. (d) accomplishment of operational goals or quality standards.may not meet effectively these challenges without the necessary tools (activity and data monitors). resources (funds and staffing support) and management philosophy (strategic planning and management). (c) economy and efficiency of DW operations. The myth that security defeats the goal of DW. and (e) overall DW administration. and a competent technician. (b) effectiveness of security management. exploration and analysis. Anything less would be imprudent. and assessment purposes in a format that allows for easy retrieval. Conclusions The seven phases of systematic vulnerability assessment and management program described in this article are helpful in averting underprotection and overprotection (two undesirable security extremes) of the DW data.6 Backup and Recovery 147 . It is generally recognized that the goal of DW is to provide decision-makers access to consistent. it is a winning corporate strategy to ensure a happy marriage between the idealism of DW based on empowered informational processing. Thus. This is achieved through the eventual selection of cost-effective security measures which ensure that different categories of corporate data are protected to the degree necessary. or cannot coexist in the DW environment should be debunked. and the pragmatism of a proactive security philosophy based on prudent security practices in the empowered computing environment.

Note that a ‘hot file system’ or checkpointing facility is also used to assure the conventional files backed up correspond to the database. but also any other files or links that are a key part of its operation.Planning is essential. A well planned operation has fewer ‘accidents’. SQL server. Guess what? They do not even practise the recreation process. The third mechanism is to exploit the RDBMS special hot backup mechanisms provided by Oracle and others. Backup and Restore The fundamental level of safety that must be put in place is a backup system that automates the process and guarantees that the database can be restored with full data integrity in a timely manner. Amazingly many companies do not attempt this. The first mechanism is simply to take cold backups of the whole environment. back up the database and the related files while they are being updated. and even if they can be stopped there may not be a large enough window to do the backup. Veritas supports a range of alternative ways of backing up and recovering a data warehouse – for this paper we will consider this to be a very large Oracle 7 or 8 database with a huge number of related files. This requires a high-end backup product that is synchronized with the database system’s own recovery system and has a ‘hot-file’ backup capability to be able to back up the conventional file system. Oracle. or recreating the warehouse from scratch. The first step is to ensure that all of the data sources from which the data warehouse is created are themselves backed up. to be able to back it up as well. The preferred solution is to do a ‘hot’ database backup – that is. Even a small file that is used to help integrate larger data sources may play a critical part. The second method is to use the standard interfaces provided by Oracle (Sybase.It can take six months or more to create a data warehouse. This is often not an option as they may need to be operational on a nonstop basis. concurrently with any related files. so when (not if) the system breaks the business impact will be enormous. Where a data source is external it may be expedient to ‘cache’ the data to disk. They rely on a mirrored system not failing. Informix. SQL BackTrack. Backing up the data warehouse itself is fundamental.) to synchronize a backup of the database with the RDBMS recovery mechanism to provide a simple level of ‘hot’ backup of the database. Then there is the requirement to produce say a weekly backup of the entire warehouse itself which can be restored as a coherent whole with full data integrity. etc. How do we back it up? The simplest answer is to quiesce the entire data warehouse and do a ‘cold’ backup of the database and related files. who provide a set of data streams to the backup system – and later 148 . and when they occur recovery is far more controlled and timely. exploiting multiplexing and other techniques to minimize the backup window (or restore time) by exploiting to the full the speed and capacity of the many types and instances of tape and robotics devices that may need to be configured. The responsibility for the database part of the data warehouse is taken by. say. but only a few minutes to lose it! Accidents happen. What must we back up? First the database itself.

A well balanced system can help control the growth and avoid ‘disk full’ 149 . the Veritas software can be used to exploit network-attached intelligent disk and tape arrays to take backups of the data warehouse directly from disk to tape – without going through the server technology. Also optimizations can be done – for example. can also be used to keep an up-to-date copy of the data warehouse on a local or remote site – another form of instantaneous. Veritas now uniquely also supports block-level incremental backup of any database or file system without requiring pre-scanning. The facility will also be available with the notion of ‘synthetic full’ backups where the last full backup can be merged with a set of incremental backups (or the last cumulative incremental backup) to create a new full backup off line. Oracle 8 can also be used with the Veritas NetBackup product to take incremental backups of the database. Replication technology. offline devices and network configurations that may be available. With Oracle 7 and 8 this can be used for very fast full backups. In many corporations a hybrid or combination may be employed to balance cost and services levels. This exploits file-system-level storage checkpoints. for example. full backup of the data warehouse. Each backup. always available. requires Oracle to take a scan of the entire data warehouse to determine which blocks have changed. prior to providing the data stream to the backup process. however. and the Veritas NetBackup facility can again ensure that other nondatabase files are backed up to cover the whole data warehouse. to back up readonly partitions once only (or occasionally) and to optionally not back up indexes that can easily be recreated. And finally. Online Versus Offline Storage With the growth in data usage – often more than doubling each year – it is important to assure that the data warehouse can utilize the correct balance of offline as well as online storage. network-attached device.request parts back for restore purposes. intelligent. These mechanisms can be fine tuned by partition etc. The technology can be integrated with the Veritas Volume Management technology to automate the taking of full backups of a data warehouse by means of third-mirror breakoff and backup of that mirror. on each working day. These two facilities reduce by several orders of magnitude the time and resources taken to back up and restore a large data Richard Barker 6 warehouse. Alternatives here are disk to tape on a single. or disk to tape across a fiber channel from one network-attached device to another. 7.m. Veritas also supports the notion of storage checkpoint recovery by which means the data warehouse can be instantly reset to a date/time when a checkpoint was taken. Each of these backup and restore methods addresses different service-level needs for the data warehouse and also the particular type and number of computers. This is particularly relevant for smaller data warehouses where having a second and third copy of it can be exploited for resilience and backup.00 a. at either the volume or full-system level. This could be a severe overhead on a 5 terabyte data warehouse.

volume replication can be used to retain a secondary remote site identical to the primary data warehouse site. When accessed. Perhaps the largest benefit is to migrate off line the truly enormous amounts of old raw data sources. After choosing reliable hardware and software the most obvious thing is to use redundant disk technology. The Veritas file system has been given a special mechanism to run the database at exactly the same speed as the raw partitions. of any particular type. The Veritas HSM system provides an effective way to manage these disk-space problems by migrating files. and rarely used multi media and documents. disk to optical. The user sees the file and can access it. yet leaving them accessible to the user. Candidates for offline storage include old raw data. to an off-site vault. the file is returned to the online storage and manipulated by the user with only a small delay. Policies can be set so files or backup files can be migrated automatically from media type to media type. for example. to tape. The significance of this to a data warehousing environment in the first instance relates to user activities around the data warehouse. to secondary or tertiary storage. Richard Barker 7 Where companies can afford the redundant hardware and very-high-bandwidth (fiber channel) communications widearea network. dramatically improving both reliability and performance. ranging from manual copies to full automation. Hierarchical Storage Management (HSM) is the ability to ‘off line’ files automatically to secondary storage. particularly where there is a hybrid of database management systems and conventional files. For more complex environments. Most data warehouses have kept the database on raw partitions rather than on top of a file system – purely to gain performance. The output from these will either be viewed dynamically on screen or held on a file server. This then enables the database 150 . old reports. Generally speaking. and from site to site. which are often key in measuring end-user availability. but are infrequently accessed and can consume huge quantities of disk space. which provides another immediate benefit – reduced backup times – since the backup is now simply of ‘stubs’ instead of complete files. should there be a crash. users will access the data warehouse and run reports of varying sophistication. Veritas HSM and NetBackup are tightly integrated. Several techniques can be used. Old reports are useful for comparative purposes. leaving them ‘apparently’ on line in case they are needed again for some new critical analysis. but is actually looking at a small ‘stub’ since the bulk of the file has been moved elsewhere.problems. The simplest mechanism is to use a backup product that can automatically produce copies for a remote site. The Veritas file system is a journaling file system and recovers in seconds. which cause more than 20% of stoppages on big complex systems. an HSM system can be used to copy data to a disaster recovery site automatically. Reliability and High Availability A reliable data warehouse needs to depend upon restore and recovery a lot less. Disaster Recovery From a business perspective the next most important thing may well be to have a disaster recovery site set up to which copies of all the key systems are sent regularly.

such as FirstWatch. along with event management and high availability software. The SLA has three attributes: STRUCTURE. This agreement establishes expectations and impacts the design of the components of the data warehouse solution. The event management software should be used to monitor all aspects of the data warehouse – operating system. the next most sensible way of increasing reliability is to use a redundant computer. Performance Characteristics of a Data Warehouse Environment The art of performance tuning has always been about matching workloads for execution with resources for that execution. Data warehouse projects are popular within the business world today.administrator to get all the usability benefits of a file system with no loss of performance. the High Availability software should fail the system over to the secondary comp uter within a few seconds. the beginning of a performance tuning strategy for a data warehouse must include the characterization of the data warehouse workloads. to provide redundant disk-based reliability and performance. When used with the Veritas Volume Manager we also get the benefit of ‘software RAID’. In the event of a major problem that cannot be fixed. utilizing this otherwise expensive asset. This document examines the performance characteristics of a data warehouse and looks at how expectations for these projects can be set and managed. AND FEASIBILITY. Advanced HA solutions enable the secondary machine to be used for other purposes in the meantime. there must be some common metrics to differentiate one workload from another. Once the workloads have been characterized. It is a collection of service level requirements that have been negotiated and mutually agreed upon by the information providers and the information consumers. Should a failure occur the software should deal with the event automatically or escalate instantly to an administrator. After disks. database. 4.7 Service level agreement Service Level Agreement (SLA) A binding contract which formally specifies enduser expectation about the solution and tolerances. PRECISION. some analysis should be performed to determine the impact of executing multiple workloads at the same time. To perform that characterization. This paper focuses on the performance aspects of a warehouse running on an IBM® mainframe and using UDB for OS/390 as the database. Competitive advantages are maintained or gained by the strategic use of business information that has been analyzed to produce ways to attract new customers and sell more products to existing customers. Therefore. and applications. It is possible that 151 . The benefits of this analysis have caused business executives to push for data warehouse technology – and expectations for these projects are high. files.

Many have argued the value of a warehouse to project future buying patterns and needs to ensure that the business remains competitive in a changing marketplace. but unplanned outages at the middle or end of a long running query may be unacceptable for users. For example. Sometimes availability is overlooked in the evaluation of performance. need 24x7 availability. Others may find the dynamic nature of the environment very difficult to characterize. this control will work best when it is integrated into the warehouse service-level agreements. If a system is unavailable. The performance analyst can get help with pattern matching from end users while trying to determine the workload characteristics of the company's warehouse. One issue to address in a warehouse environment is whether there will be uniform workloads or whether all work will be unique. Workload arrival rates and how those arrival rates can be influenced must be combined with the workload characterization.some workloads will not work well together and thus. it is best to keep these workloads from running at the same time. then it is not performing. 152 . which may take hours or perhaps days. This will make control a part of the agreement between IS and the users of the system. This will be addressed in more detail later. Therefore. Longer outages might be tolerated by an operational system if they are planned around user queries. One factor that should be evaluated in deciding to establish marts is the ability to segregate workloads across multiple data marts and thus mitigate the instances of competing workloads that degrade performance. in fact. data marts have their place in a warehouse strategy. These decisions affect whether there will be future orders to record in an operational system. Some might question the need for high availability of a warehouse compared to the availability requirements of an operational system. This would not guarantee that the work would arrive at certain times. this control might be structured by setting different response time goals for different workloads or groups of users based on day of the week or hour of the day. will degrade each other's performance. when executed at the same time. the more a company uses a warehouse to make strategic business decisions. Typically. In addition. but it would encourage submission of workloads at different times. Queries against a warehouse are not driven by customers transacting business with the company. the timing of these queries can be under some control. the more the warehouse becomes just as critical as the operational systems that process the current orders. The workload that consists of the maintenance programs for the warehouse should be tracked closely because of its impact on availability. but rather by users who are searching for information to make the business run smoother and be more responsive to its customers. Some companies may find queries that are executed on a regular basis and thus can be characterized as a workload. Therefore. the ability of the platform to deliver the required availability is critical. A warehouse may. In such cases. Consider that queries against a warehouse will have to process large volumes of data. While not specifically related to the issue.

in a data warehouse system. This reduces the impact of missing the SLA for these workloads. this allows the system's administrator to have more control over the delivery of service and the resources needed to deliver on service-level agreements. time-sensitive information may occasionally be required for business opportunities that exist for only a short time. When this is successful. Metrics that can be obtained without traces are important because of the overhead of DB2® traces.Realistic service-level agreements Realistic service-level agreements can be achieved within the management of a data warehouse. report and understand. reporting of attainment should have some level of prioritization contained within the reporting structure. An operational system may have a requirement that 90% of all transactions complete in less than one second. then service-level agreements for the warehouse will have to be adjusted based on the ability of the platform to deliver service.000 pages processed by the query. There are many choices of platforms on which to run the data warehouse. This will generate an increase in warehouse activities that will strain the initial resources. within the SLA reporting structure. This allows for the recognition that some workloads for brief periods were considered less important. This may take some time and effort to work with the vendor of the tool and to work with some level of SQL EXPLAIN information to achieve the best SQL for your system. If performance of the warehouse is not a major consideration. the company executives will want more. but not always. the emphasis should always 153 . Service-level agreements for a data warehouse may need to be more fluid than those established for an operational system. This will have to be coordinated with the users to ensure that the tests represent realistic queries. If all of the benefits gained from that design in the SQL queries are not used. The goal of the data warehouse is to deliver value to the business. Service-level agreements need to be adjustable after implementation rather than rigid. However. This is easy to track. Because of these adjustments and the fluid nature of the SLA system. growth in new business areas or reduced costs for the business. This will probably occur near the end of the testing phase and should include tests against the same data that will actually exist in the production warehouse at its inception. The ability to adjust resources to meet specific workloads can have an impact on the attainment of other service-level agreements. Service-level agreements along with capacity planning information can pave the way for warehouse growth as it proves its value to the company and expands its mission. If a tool generates the queries. Furthermore. Testing the warehouse before production implementation is the basis of realistic service-level agreements. Value is delivered in terms of more business. there may be a requirement to produce an answer within one second for every 10. this will be an opportunity to determine how much control you have within the tool to generate efficient SQL. While the user may require some education on the metric. Performance is sometimes a consideration when making this choice. In addition. then the long hours spent designing the schema of the warehouse to meet the users' requirements will have been wasted.

and over buying of hardware and software.. The loading is based on one of two common procedures : Bulk down load ( which refreshes the entire data base periodically) . files.  Managing the back end components. Being reactive. In fact many organizations have had to deploy many more administrators than they had originally planned just to keep the fires down. disk and tape arrays. of resource utilization. The tools must encompass all of the storage being managed – which means the database. and so forth – because the actual resources and response times could and probably will change over time. slow reaction to them. 154 . Also support and enhancement requests will roll in. which in turn affect its performance.  Updating the data to reflect the organizational changes. These people cost a lot of money. – if it has not been properly sized.Change based replication (Which copies the data residing in different servers) Once the Data ware house is up and running it will continue to require attention in different ways. volumes. Some other common issues that need to be dealt with operating the data ware house are  Loading the new data on a regular basis. Staging and Load scheduling are automated where ever it is possible. rather than proactive. Management Tools Managing such potential complexity and making decisions about which options to use can become a nightmare. excessive numbers of problems.etc. A successful data ware house see its users increase considerably. mergers and acquisitions. means that the resources supporting the data warehouse are not properly deployed – typically resulting in poor performance.  Ensuring up time and realiability  Managing the front end tools. 4. file systems. Management tools must therefore address enabling administrators to switch from reactive to proactive management by automating normal and expected exception conditions against on percentages – of attainment.8 Operating the Data Ware house After the data ware house becomes operational The data management process including Extracton. intelligent controllers and embedded or other tools that each can manage part of the scene. which can range from real-time to weekly. The ‘when in doubt throw more kit at the problem’ syndrome. Transformation.

network-attached intelligent devices. STORAGE ANALYST: collects and aggregates further data. The tools should. Veritas is developing a set of management tools that address these issues. online/offline mix. etc. enables online performance monitoring and lets you see the health of the data warehouse at a glance. STORAGE OPTIMIZER: recommends sensible actions to remove hot spots and otherwise improve the performance or reliability of the online (and later offline) storage based on historical usage patterns. [Note: versions of Storage Manager and Optimizer are available now. growth. such as number of users. and enables analysis of the data over time. Such data could be high level. tape and disk arrays. the tools must collect data about the data warehouse and how it is (and will be) used. raw or aggregate data over time that could be used to help optimize the existing data warehouse configuration. to cover the full spectrum of management required – snap-in tools. Once again it is worth noting that the management tools may need to exploit lower-level tools with which it is loosely or tightly integrated.To assist the proactive management. and so on – in other words. access trends from different parts of the world. access patterns on each of several thousand disks. then suggest or recommend better ways of doing things. focusing on very large global databases and data warehouses. Finally. Storage Manager enables other Veritas and third-party products to be exploited in context. It can automatically manage and advise on the essential growth of the data warehouse. The product also automates many administrative processes. A simple example would be to analyze disk utilization automatically and recommend moving the backup job to an hour later (or use a different technique) and to stripe the data on a particular partition to improve the performance of the system (without adversely impacting other things – a nicety often forgotten). collects data about data. This style of tool is invaluable when change is required. STORAGE PLANNER: will enable capacity planning of online/offline storage.] 155 . with the others being phased for later in 1998 and 1999. file systems. The tools can then be used to execute the change envisaged and any subsequent fine tuning that may be required. time to retrieve a multi-media file from offline storage. pre-emptively advise on Richard Barker 10 problems that will otherwise soon occur using threshold management and trend analysis. size. They are: STORAGE MANAGER: managing all storage objects such as the database. the data about the data warehouse could be used to produce more accurate capacity models for proactive planning. Or it could be very detailed – such as space capacity on a specific disk or array of disks. peak utilization of a server in a cluster. manages exceptions. ideally.

High-level storage management tools can provide a simple view of these sophisticated options. Veritas is the Richard Barker 11 storage company to provide end-to-end storage management. Data warehouses. through to predictive capacity planning of the data warehouse future needs.The operational management of a data warehouse should ideally focus on these success factors. The Optimizer product can identify hot spots and eliminate them on the fly. Management of them starts with a good life-cycle process that concentrates on the operational aspects of the system. datamarts and other large database systems are now critical to most global organizations. high performance data warehouses. performance and availability for the most ambitious data warehouses of today. Running the database on top of the Veritas File System. optimization of existing configurations. provides the assurance of the RDBMS vendor and the vendor’s own management tools. where all changes to the software. such as Oracle. Their success is dependent on the availability. It is. In summary. remote backup and multiple remote sites by which the decision support needs of the data warehouse can be ‘localized’ to the user community – thereby adding further resilience and providing optimal ‘read’ access. the key to the operational success of a global data warehouse is ‘online everything’. however. an unfortunate truth that in most cases they will have to be used retrospectively to manage situations that have become difficult to control – to regain the initiative with these key corporate assets. growing. High availability and advanced clustering facilities complete the picture of constantly available. They can also add value by analysis of trends. It also enables the most efficient incremental backup method available when used with the Veritas NetBackup facilities. Replication services at the volume or file-system level can be used to provide enhanced disaster recovery. online and offline storage and the hardware can be done on line on a 24x365 basis. provides maximum ease of management with optimal performance and availability. Putting the structured part of the database on a leading database. accessibility and performance of the system. 4. along with other data warehouse files. and enable management by policy and exception. By exploiting the Veritas Volume Manager the disk arrays can be laid out for the best balance of performance and data resilience.The use of these tools and tools from other vendors should ideally be started during the ‘Design and Predict’ phase of development of a data warehouse.9 Summary 156 .

The first mechanism is simply to take cold backups of the whole environment. Even a small file that is used to help integrate larger data sources may play a critical part. These phases are part of an enterprise-wide vulnerability assessment and management program. It is a collection of service level requirements that have been negotiated and mutually agreed upon by the information providers and the information consumers. instead of fighting for common resources (as in the multiprocessor architecture).The third mechanism is to exploit the RDBMS special hot backup mechanisms provided by Oracle and others. Therefore. 3) quantifying the value of data. (with load-balancing and automatic fail-over). Where a data source is external it may be expedient to ‘cache’ the data to disk. and 7) evaluating the effectiveness of security measures. having an internal control mechanism to assure the confidentiality. 4) identifying data security vulnerabilities.The first step is to ensure that all of the data sources from which the data warehouse is created are themselves backed up. exploiting multiplexing and other techniques to minimize the backup window (or restore time) by exploiting to the full the speed and capacity of the many types and instances of tape and robotics devices that may need to be configured. Informix. it will become complex once you go beyond a certain size. 2) classifying data. to be able to back it up as well. The responsibility for the database part of the data warehouse is taken by. Achieving proactive security requirements of DW is a seven-phase process: 1) identifying data. The second method is to use the standard interfaces provided by Oracle (Sybase. The security requirements of the DW environment are not unlike those of other distributed computing systems. (iii) Regarding the backup and recovery . Here the processing is done across multiple servers with each having its own memory and disk space. 6) selecting cost-effective security measures. sometimes the allocation of processes across the processors. who provide a set of data streams to the backup system – and later request parts back for restore purposes. (iv) Service Level Agreement (SLA) A binding contract which formally specifies end-user expectation about the solution and tolerances. concurrently with any related files. Unfortunately. SQL BackTrack. Even if you are having a kind of cluster. (ii)The various aspects of security for a data ware house has been discussed in detail. The SLA has three attributes: 157 . integrity and availability of data in a distributed environment is of paramount importance.) to synchronize a backup of the database with the RDBMS recovery mechanism to provide a simple level of ‘hot’ backup of the database. This way they get their own playing field. say. Then there is the requirement to produce say a weekly backup of the entire warehouse itself which can be restored as a coherent whole with full data integrity. itself runs out of breath. SQL server. The problem with the DW (which is not in OLTP) is that the kind of load and queries are not certain. 5) identifying data protection measures and their costs. Thus. etc. Oracle. most data warehouses are built with little or no consideration given to security during the development phase.(i)The hardware needed for a Ware house had been explained with the architecture. Note that a ‘hot file system’ or checkpointing facility is also used to assure the conventional files backed up correspond to the database.

10 1. Give short notes on Operating the Data ware house 158 . The loading is based on one of two common procedures : Bulk down load ( which refreshes the entire data base periodically) . What type of Backup and Recovery for a Data ware house are needed? 6. 4. Transformation. 2. 7. (v) After the data ware house becomes operational The data management process including Extracton. What is SLA? Explain in detail. 4.Change based replication (Which copies the data residing in different servers) . Excercises Explain about hardware requirement for a Data ware house. Staging and Load scheduling are automated where ever it is possible. 3. Discuss the Hardware architecture for a DW Why security is needed in a Data ware house environment? Describe in detail the various security measures that can be given for a Data Ware house? 5. AND FEASIBILITY.STRUCTURE.Also the management tools are used for the operations on a data ware house. PRECISION. This agreement establishes expectations and impacts the design of the components of the data warehouse solution.

Unit V Structure of the Unit 5.3 Tuning the data ware house 5.6 Summary 5.5 Data ware house features 5.4 Testing the data ware house 5.1 Introduction 5.7 Exercises 159 .2 Learning Objectives 5.

The function of the data warehouse is to consolidate and reconcile information from across disparate business units and IT systems to provide a subject-orientated. you should confirm that the data at the lowest levels of aggregation are 160 .3 Tuning the data ware house Overview The speed and efficiency with which a RDBMS can respond to a query strongly affects the response time experienced by end users. time-variant. The wrong type of indexes or materialized views generated by the wrong type of SQL commands may degrade performance rather than improve it. Before taking this step. If these have been done in a correct way then the Data Ware house will have the needed features. All data warehouses can benefit from the creation of the best possible indexes and materialized views. Extremely large data warehouses should be striped and partitioned. however. integrated store for reporting on and analysing data.2 Learning Objectives  To Know about the Testing done in a Ware house  To have the Knowledge about the fine tuning done in a Data Ware house  Finally to have the knowledge of the features that a Data Ware house has to posses 5. Since the nature of the Data ware house is dynamic the fine tuning and the testing are recurrent processes in a Data Warehouse Environment.5. non-volatile.1 Introduction Data warehouses are often at the heart of the strategic reporting systems used to help manage and control the business. Hence after the development of the Data Ware houses the testing and the fine tuning have to be done in order to get the correct and accurate result. 5.

truly needed for the types of analysis being performed. Eliminating unnecessarily lowlevel data from your data warehouse is much easier than striping. If your data warehouse is configured in a snowflake schema, you should look at the frequency that queries must perform joins on dimension tables. Denormalizing can improve performance significantly.
Data types of key columns

In a dimension table, the key column should have a NUMBER data type for the best performance. A primary index is always created on the key column to ensure that each row has a unique value. A NUMBER data type reduces the amount of disk space needed to store the index values for the key, since the index values are also stored as numbers instead of text strings. The smaller the index, the faster the database can search it. The larger the number of values in the dimension, the greater the improvement in performance of NUMBER keys over CHAR (or other text) keys. Since time dimensions are typically rather small, a NUMBER key will improve performance only slightly. Thus, time dimensions can have either NUMBER or CHAR keys with little difference in performance between the them. However, dimensions for products and geographical areas often have thousands of members, and the performance benefits of NUMBER keys can be significant

Indexing Indexing is a vital component of any data warehouse. It allows Oracle to select rows quickly that satisfy the conditions of a query, without having to scan every row in the table. B-tree indexes are the most common; however, bitmap indexes are often the most effective in a data warehouse. A column identifying gender will have in each cell one of two possible values to indicate male or female. Because the number of distinct values is small, the column has low cardinality. In dimension tables, the parent level columns also have low cardinality because the parent dimension values are repeated for each of their children. A column containing actual sales figures might have unique values in most cells. Because the number of distinct values is large, columns of this type have high cardinality. Most of the columns in fact tables have high cardinality. Dimension key columns have extremely high cardinality because each value must be unique. In fine-tuning your data warehouse, you may discover factors other than cardinality that influence your choice of an indexing method. With that caveat understood, here are the basic guidelines:


 

Create bitmap indexes for columns with low to high cardinality. Create B-tree indexes for columns with very high cardinality (that is, all or nearly all unique values).

Striping and partitioning Striping and partitioning are techniques used on large databases that contain millions of rows and multiple gigabytes of data. Striping is a method of distributing the data over your computer resources (such as multiple processors or computers) to avoid congestion when fetching large amounts of data. Partitioning is a method of dividing a large database into manageable subsets. Using partitions, you can reduce administration downtime, eliminate unnecessary scans of tables and indexes, and optimize joins. If other methods of optimizing your database have not been successful in bringing performance up to acceptable standards, then you should investigate these techniques.

Materialized views Oracle will rewrite queries written against tables and views to use materialized views whenever possible. For the optimizer to rewrite a query, it must pass several tests to verify that it is a suitable candidate. If the query fails any of the tests, then the query is not rewritten, and the materialized views are not used. And when the aggregate data must be recalculated at runtime, performance degrades. All materialized views for use by the OLAP API must be created from within the OLAP management tool of OEM. Materialized views created elsewhere in Oracle Enterprise Manager or directly in SQL are unlikely to match the SQL statements generated by the OLAP API, and thus will not be used by the optimizer. Application tuning in Data ware house
Application tuning helps companies save money through query optimization.

CIOs and data warehouse directors are under pressure and overwhelmed with user requests for more resources, applications and power—often without an accompanying increase in budget. Fortunately, it's possible to relieve some of the pressure and "find money" by effectively tuning applications on the data warehouse.
Sometimes queries that perform unnecessary full-table scans or other operations that consume too many system resources are submitted to the data warehouse. Application tuning is a process to identify and tune target applications for performance improvements and proactively prevent application performance problems.


Application tuning focuses on returning capacity to a system by concentrating on query optimization. Through application tuning, database administrators (DBAs) look for queries wreaking havoc on the system and then target and optimize those queries to improve system performance and prevent application performance problems. The results can be dramatic, often providing a gain of several nodes' worth of processing power. Savings in the works

A holistic view of Teradata performance—gained through the timely collection of data—is a good precursor to application tuning. Many customers have engaged Teradata Professional Services to install the performance data collection and reporting (PDCR) database. This historical performance database and report toolkit provides diagnostic reports and graphs to help tune applications, monitor performance, manage capacity and operate the Teradata system at peak efficiency. If the PDCR database is not installed for performance tuning, it is imperative to enable Database Query Log (DBQL) detail, SQL and objects data logging for a timeframe that best represents the system workload to identify the optimum queries for tuning. To optimize performance and extract more value from your Teradata system, follow these application tuning steps: STEP 1: Identify performance-tuning opportunities The DBQL logs historical data about queries including query duration, CPU consumption and other performance metrics. It also offers information to calculate suspect query indicators such as large-table scans, skewing (when the Teradata system is not using all the AMPs in parallel) and large-table-to-large-table product joins (a highly consumptive join). STEP 2: Find and record "like queries" with similar problems While the DBQL is used to find specific incidents of problem queries, it can also be used to examine the frequency of a problem query. In this scenario, a DBA might notice that a marketing manager runs a problem query every Monday morning, and the same problem query is run several times a day by various users. Identifying and documenting the frequency of problem queries offers a more comprehensive view of the queries affecting data warehouse performance and helps prioritize tuning efforts. STEP 3: Determine a tuning solution To improve query performance—particularly queries with large-scan indicators—additional indexes or index changes should be considered. Teradata's various indexing options enable efficient resource use, saving I/O and CPU time and thereby making more resources available for other work. Options such as partitioned primary index (PPI), secondary indexes and join indexes can help reduce resource consumption and make queries more efficient. STEP 4: Determine the best solution

To determine the best tuning options, it is important to baseline existing performance conditions (using DBQL data), pilot potential solutions through experimentation and analyze the results. If


(4 X 86. The presentation should be tailored to a specific audience and should capture the value of application tuning. where 86. On a four-CPU node.400) . and the effects of the change are measured and documented. the new environment is re-created on the same production system. Check the impact of making a tuning change: Monthly CPU saved = Total old CPU for a month X the average improvement percent.400) . or any IT improvement.20%. although a spreadsheet can be used for backup material or a deeper dive into performance data and options.400 CPU seconds. is an important step to showcasing the value of the data warehouse.400 equals the number of seconds in a day. To answer the question "How many CPU seconds equals a node?" use the following calculations: Determine per node CPU seconds in a day (number of CPUs per node X 86. documenting and analyzing the results DBAs must run tests on the same production system and take the following steps to determine the solution with the best cost/benefit and viability of the final performance fixes: Test the system. First. a regression test suite is created to gauge the effectiveness of the solution before production. Quantifying the business value of query optimization.multiple optimization strategies are found.400)/5)) X 30 = 8. Compare the new DBQL measurements with the original baseline. From there. STEP 5: Regression testing Regression testing is an important quality control process to ensure any optimization changes or fixes will not adversely affect data warehouse performance. The presentation might include: Query optimization process Options found and tested 164 . Use each of the optimization strategies and gather the new DBQL data. using a user ID with a low workload priority. running the queries. In regression testing. and measuring. the DBA must determine a representative list of queries that apply to the selected performance fix. The goal is to ensure queries that are not part of the tuning process are not unduly affected by the optimization changes. DBAs should test one strategy at a time by temporarily creating the new scenario. changing the queries to use the new objects. STEP 6: Quantify and translate performance gains into business value CIOs are routinely pressed to show how their IT dollars affect operations and enable cost reduction and business growth. Multiply per node CPU seconds in a day by 30 to get CPU seconds per node per month. Determining business value can be broken into calculations and sub-calculations. the equation would look something like this: ((4 X 86.294. STEP 7: Document and implement Presenting application tuning recommendations to IT management and business users typically requires more than a spreadsheet of data. and 15% to 20% is subtracted from the equation to account for system-level work not recorded in DBQL.

known as source-to-target mappings. what should be done to the data. but often lack the details to support test development and execution. and where it should get loaded. These documents can be useful for basic test strategy development. Following the application tuning methodology will help you optimize performance 5. additional system-design documentation can also serve to guide the test strategy. BI views and so forth. there is typically a requirements document of some sort. I've found that from a lifecycle and quality perspective it's often best to seek an incremental testing approach when testing a data warehouse. improve application performance or quantify the need for hardware expansion can benefit from application tuning. then base historical tables. then incremental tables. it's really not that different than any other testing project. These source-to-target documents specify where the data is coming from. If you have it available. Many times there are other documents. In addition. you'll want to start to develop your test strategy. and why Lists of what still needs testing Observations and recommendations Anticipated savings Customers looking to add new applications.Best option Options discarded. Let's review a few of these steps and how they fit within a data warehouse context: Analyze source documentation As with many other projects. this approach serves to set up the detailed processes involved in development and testing cycles.4 Testing the Data warehouse Testing a data warehouse is a wondrous and mysterious process. The primary benefit of this approach is that it avoids an overwhelming "big bang" type of delivery and enables early defect detection and simplified debugging. This essentially means that the development teams will deliver small pieces of functionality to the test team earlier in the process. Develop strategy and test plans As you analyze the various pieces of source documentation. 165 . when testing a data warehouse implementation. Specific to data warehouse testing this means testing of acquisition staging tables. The basic system analysis and testing process still applies. which provide much of the detailed technical specifications.

if the organization will have an ongoing need for regression testing.Another key data warehouse test strategy decision is the analysis-based test approach versus the query-based test approach. a data warehouse testing process can be set up to guide the project team to a successful release.  166 . An often forgotten aspect of execution is an accurate status reporting process. Ensures that the ETL application correctly rejects. the test categories and testing progress will ensure that the rest of the team is clear on the testing status. defect taxonomies like Larry Greenfield's may also be helpful. Data transformation. Making sure the rest of the team understands your approach. Goals for a successful data warehouse testing. corrects or ignores and reports invalid data. Data quality. In any case. In this situation. For example. then a query-based approach may be appropriate. follow-through and communication. it is helpful to frame the test development and execution process with guiding test categories. a few data warehouse test categories might be:        record counts (expected vs. This offers the benefit of setting up a future regression process with minimal effort. namely. actual) duplicate checks reference data validity referential integrity error and exception logic incremental and historical process control column values and default values In addition to these categories. If the testing effort is a one time effort. then any early tests developed may largely become obsolete. The pure analysis-based approach would put test analysts in the position of mentally calculating the expected result by analyzing the target data and related specifications. Conversely. then it may be sufficient to take the analysis-based path since that is typically faster.   Data completeness. The query-based approach involves the same basic analysis but goes further to codify the expected result in the form of a SQL query. With some careful planning. Test development and execution Depending on the stability of the upstream requirements and analysis process it may or may not make sense to do test development in advance of the test execution process. an integrated test development and test execution process that occurs in real time can usually yield better results. If the situation is highly dynamic. substitutes default values. Ensures that all data is transformed correctly according to business rules and/or design specifications. Ensures that all expected data is loaded.

Integration testing. and what in practice (meaning in real-world. the right operational procedures. Ensures existing functionality remains intact each time a new release of code is completed. referenceable commercial applications) are the boundaries of the technology's capacity? loading and indexing performance: how fast does the DBMS load and index the raw data sets from the production systems or from reengineering tools? operational integrity. all depend on the elusive combination of the right design.    5. Performance and scalability.5 Data Warehouse features Data warehouse technologies need to have the following features  capacity: what are the outer bounds of the DBMS technology's physical storage capability. Ensures that data loads and queries perform within expected time frames and that the technical architecture is scalable. Ensures the solution meets users' current expectations and anticipates their future expectations. the right DBMS technology. middleware and SQL dialects? query processing performance: how well does the DBMS' query planner handle ad hoc SQL queries? How well does the DBMS perform on table scans? on more or less constrained queries? on complex multi-way join operations?     None of these areas is exclusively the province of the DBMS technology. Regression testing. the right network architecture and implementation and the hundred other variables that make up a complex client/server environment. Nevertheless. User-acceptance testing. in whole or in part. the right hardware platform. and what kinds of support. while the data warehouse is online? what kinds of tools and procedures exist for examining warehouse use and tuning the DBMS and its platform accordingly? client/server connectivity: how much support. does the open market provide for the DBMS vendor's proprietary interfaces. Ensures that the ETL process functions well with other upstream and downstream processes. it should be possible to get qualitative if not quantitative information from a prospective data warehouse DBMS vendor in each of these areas. 167 . reliability and manageability: how well does the technology perform from an operational perspective? how stable is it in near-24x7 environments? how difficult is it to archive the data warehouse data set.

since we can't tell quite what people want. very large database (VLDB) boundaries hovered around the 10 gigabyte (GB) line.the extraction. the gating factor is never. a black art. and the number of years of history kept online. 168 . a theoretical limit of a multiple terabytes means little in the face of an installed base with no sites larger than 50 GB. and there are as many data engineering strategies as there are data warehouses. Until just a few years ago.Capacity Capacity is a funny kind of issue in large-scale decision support environments. loading and indexing of DSS data -. customers using sophisticated homegrown message-based near-real-time replication mechanisms. and customers who cut 3480 tapes using mainframe report writers. The primary determinants of size are granularity or detail.A firm has customers using state-of-the-art reengineering and warehouse management tools like those from Prism and ETI. prior to the first DSS project. The bottom line is that the specifics of a DBMS technology's load and indexing performance is conditioned by the data engineering procedures in use. is that:  well-designed warehouses are typically greater than 250 GB for a mid-sized to large firm. yet data warehouses are often spoken of in terms of multiple terabytes (TB).     Loading And Indexing Speed Data engineering -. any DBMS technology's capacity is only as good as its leading-edge customers say (and demonstrate) that it is. as old guard technologists seem to think. and it's therefore necessary to have a clear idea of likely data engineering scenarios before it is possible to evaluate fully a DBMS' suitability for data warehousing applications. the initial sizing estimates of the warehouse are always grossly inaccurate. data machismo reigns: the firm with the biggest warehouse wins." The reality. the difference between the raw data set size (the total amount of data extracted from production sources for loading into the warehouse) and the loaded data set size varies widely. but instead the ability of the DBMS technology to manage the loaded set comfortably.5 to 3 times the space of raw data sets. the available physical storage media. and sometimes the design principle at work is "let's put everything we have into the warehouse. DSS (Decision Support System) is generally an area where. with loaded sets typically taking 2.

but which has such small market share that it is ignored by the independent software vendor 169 . or do not support incremental updates at all. traditional operational concerns about overall system reliability. it is definitely significant numbers of first-time DSS projects fail not because the DBMS is incapable of processing queries in a timely fashion. data source in the enterprise. the more demand placed on the warehouse and its marts. Second of all. organizational processes are built around the DSS infrastructure. Operational Integrity. the better the enterprise DSS design. Loads requiring days are not unheard of when this area of evaluation is neglected. Effective DSS environments quickly create high levels of dependency. Bottom line: all the operational evaluation criteria we would apply without thinking to an online transaction processing (OLTP) system apply equally to the data warehouse. as a rule. There are exceptions to this rule: some kinds of user communities. data marts and not end-user communities. the data warehouse is a unique. A DBMS engine that processes queries well. and quite possibly the most clean and complete. since the data warehouse is a copy of operational data. Reliability and Manageability A naive view of DSS would suggest that. First of all. other applications depend on the warehouse or one of its marts for source data. Nothing could be farther from the truth. making full drop-and-reload operations a necessity.This area of evaluation is made more complex by the fact that some proprietary MDDBMS environments lack the ACID characteristics required to recover from a failure during loading. and. The loss of warehouse or mart service can quite literally bring parts of the firm to a grinding. and because the open middleware marketplace is now producing open data movement technology that promises to link heterogeneous DBMSs with high-speed data transfer facilities. will source their analytic data directly from the warehouse. and scrub and rationalize data elements found elsewhere in the enterprise's data stores. availability and maintainability do not apply to data warehousing environments. when the warehouse is refreshed daily. it is important to understand what kind and what quantity of support exist in the open marketplace for the DBMS vendors proprietary client/server interfaces and SQL dialects. The consolidation and integration that occur during the data engineering process create unique data elements. For that reason. Nevertheless. Client/Server Connectivity The warehouse serves. this kind of impedance mismatch spells death for the DSS project. within end-user communities. on the warehouse and its marts. but because the database cannot be loaded in the allotted time window. angry halt. particularly those who bathe regularly in seas of quantitative data.

on the other than.6 Summary Tuning in Ware house has to be done to improve the performance. or if the warehouse is the target for (typically batch-oriented) operational reporting processes. This can be done through deciding the proper data types for the important coloumns. Through this Partitioning one can achieve easy and consistent query performance. or supported only through an open specification like Microsoft's Open Database Connectivity (ODBC) specification. significant post-processing. fed by a batch extract from the warehouse. If the warehouse serves intensive analytic applications like statistical analysis tools or neural network-based analytic engines. Partitioning is a method of dividing a large database into manageable subsets. indexing . is a dangerous architectural choice. for example. the warehouse will have to contend with a wide range of unpredictable constrained and unconstrained queries. Query Processing Performance Query processing performance. the warehouse is primarily concerned with populating marts. is an area of the DSS marketplace in which marketing claims abound and little in the way of common models or metrics are to be found. Partioning and materialized views. Thus -. would be used. All of these usage models suggest different performance requirements. 5. Data compression can also be made as a part of tuning of the Ware house 170 . the warehouse is likely to have to contend with significant volumes of inbound queries imposing table scanning disciplines on the database: few joins. the warehouse is connected directly to significant numbers of intelligent desktops equipped with ad hoc query tools. and far more likely that some kind of bulk data transfer mechanism. complete maintenance of data in a simpler way and scalability of the data ware house can be is the case with load and indexing performance -.its is critical to have a clear idea of the warehouse usage model before structuring performance requirements in this area. its query performance is a secondary issue. like capacity. and very large result sets. since it is unlikely that a large mart would request its load set using dynamic SQL. many of which are likely to impose a multi-way join discipline on the DBMS. and different (and perhaps mutually exclusive) database indexing strategies. If. Part of the practical difficulty in establishing conventions in this area has to do with the usage model for the warehouse.

Data Ware House features – Write in detail 171 . data transformation. does the open market provide for the DBMS vendor's proprietary interfaces. scalability. First the test data is being prepared.     capacity: what are the outer bounds of the DBMS technology's physical storage capability loading and indexing performance: how fast does the DBMS load and index the raw data sets from the production systems or from reengineering tools? operational integrity. performance. in whole or in part. What do you mean by Data Ware house Tunig ? Discuss in detail with examples. middleware and SQL dialects? query processing performance: how well does the DBMS' query planner handle ad hoc SQL queries? How well does the DBMS perform on table scans? on more or less constrained queries? 5. 2.Testing in a Data ware house is done to test the data completeness.7 Exercise 1. How does tuning helps to improve the performance in a Ware house 3. and what kinds of support. user acceptance etc. reliability and manageability: how well does the technology perform from an operational perspective? how stable is it in near-24x7 environments? how difficult is it to archive the data warehouse data set. How testing can be done on a Ware house ? list out the steps and explain 4.. while the data warehouse is online? client/server connectivity: how much support. General features of a Data Ware house can be listed as follows. Then test plans and strategy are prepared and test on a ware house are quality.