You are on page 1of 9

DATA MINING AND DATA WAREHOUSING

A PAPER PRESENTATION AT
TECHNO CARNIVAL- 2006

FROM

SUBMITTED BY:
PRADEEP BHANAWAT T.E. (C.S.E.)
deep1_wit@yahoo.co.in

HARSHIL GANDHI T.E. (C.S.E.)


harshil12@gmail.com Mb.no.-9890469828

GUIDED BY: Prof.Mrs.R.K.Dixit


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
WALCHAND INSTITUTE OF TECHNOLOGY SOLAPUR (MAHARASHTRA).

DATA MINING AND DATA WAREHOUSING ABSTRACT:


Fast, accurate and scalable data analysis techniques are needed to extract useful information from huge pile of data. Data warehouse is a single, integrated source of decision support information formed by collecting data from multiple sources, internal to the organization as well as external, and transforming and summarizing this information to enable improved decision making. Data warehouse is designed for easy access by users to large amounts of information, and data access is typically supported by specialized analytical tools and applications. Typical applications include decision support systems and execution information system. Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. It is An information extraction activity whose goal is to discover hidden facts contained in databases. The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions. Data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. A data mining model is a description of a specific aspect of a dataset. It produces output values for an assigned set of input values. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.

DATA MINING AND DATA WAREHOUSING Introduction:


Everyday increasingly, organizations are analyzing current and historical data to identify useful patterns and support business strategies. A large amount of the right information is the key to survival in todays competitive environment. And this kind of information can be made available only if theres totally integrated enterprise data

DATA MINING What is data mining?


Data Mining refers to the process of analyzing the data from different perspectives and summarizing it into useful information. Data mining software is one of the numbers of tools used for analyzing data. It allows users to analyze from many different dimensions or angles, categorize it, and summarize the relationship identified. Data Mining is about techniques for finding and describing Structural Patterns in data.

Definition:
Data mining is the process of finding correlation or patterns among fields in large relational databases. The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. (Simoudis, 1996)

Different Types of Data Mining:


Business Data Mining Scientific Data Mining Internet Data Mining

Five major elements of Data Mining:


1. 2. 3. 4. 5. Extract, transform, and load transaction data on to the data warehouse system. Store and manage data in multidimensional database system. Provide access to business analysts and information technology Professionals. Analyze the data by application software. Present the data in useful format such as graph or table.

Requirements of Data Mining:

Handling of different type of data Efficiency and scalability of algorithm,Usefulness, certainty and expressiveness of result Expression of various kinds of mining results Interactive mining knowledge at multiple levels Mining information from different sources of data

Protection of privacy and data security

DATA MINING: A KDD Process

Steps of KDD Process:


Learning the application domain relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing Data reduction and transformation Find useful features, dimensionality/variable reduction, and invariant representation. Choosing functions of data mining Summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation Visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

Various kinds of data on which Data Mining is applied :


Relational database Data warehouse Transactional database Multimedia database Spatial and temporal data Object-relational database

Problems in Data Mining:


Data Mining systems rely on databases to supply raw data for input. Limited information Noise and missing values Uncertainty

Data Mining applications:

Retail/Marketing:
Performing basket analysisWhich items customers tend to purchase together. This knowledge can improve stocking, store layout strategies, and promotions. Sales forecasting Examining time-based patterns helps retailers make stocking decisions. If a customer purchases an item today, when are they likely to purchase a complementary item?

Telecommunication:
Call detail record analysis Telecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions. Customer loyalty Some customers repeatedly switch providers, or churn, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit.

What is data warehousing?


A data warehouse is a subject-oriented, integrated, non-volatile & time-variant collection of data in support of managements decisions

NEED FOR A DATA WAREHOUSE :


IT or business staff spending a lot of time developing special reports for decision-makers. Lots of PC-based or small server systems obtaining extracts of data incapable of presenting a holistic view of the entire gamut of information. Same data present on different systems, in different department and users may be unaware of this fact. Difficulty in getting meaningful information in a timely manner. Multiple systems giving different answer to the business questions. Less analysis by decision makers and policy planners due to non-availability of sophisticated tools and easily decipherable, timely and comprehensive information

PURPOSE OF A DATA WAREHOUSE :


Better business intelligence for end users. Reduction in time to locate, access and analyze information. Consolidation of disparate information sources. Replacement of older, less-responsive decision support systems Faster time to market for products and services Strategic advantage over competitors

Data Warehouse Characteristics: 1. Subject-orientedWH is organized around the major subjects of the enterprise rather 2. 3.
than the major application areas. This is reflected in the need to store decision-support data rather than application-oriented data. Integratedbecause the source data come together from different enterprise-wide applications systems. The source data is often inconsistent using..The integrated data source must be made consistent to present a unified view of the data to the users. Time-variantthe source data in the WH is only accurate and valid at some point in time or over some time interval. The time-variance of the data warehouse is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots. Non-volatiledata is not update in real time but is refresh from OS on a regular basis. New data is always added as a supplement to DB, rather than replacement. the DB continually absorbs this new data, incrementally integrating it with previous data

4.

DATA WAREHOUSE LIFE CYCLE:


1. Data warehousing is a concept. It is not a product that can be purchased off the shelf. It is a set of hardware and software components integrated together which can be used to analyze the massive amount of data stored in an efficient manner. It is a process through which one can build a successful data warehouse. Following are the five steps towards building a successful data warehouse. JUSTIFICATION REQUIREMENT ANALYSIS DESIGN DEVELOPMENT AND IMPLEMENTATION DEPLOYMENT

Data WarehouseArchitecture :
Operational data source1

Meta-data Operational data source 2 Load Manager Operational data source n Lightly summarized data Query Manage

High summarized data

Reporting, query, application development, and EIS(executive information system) tools

Detailed data

DBMS

OLAP(online analytical processing) tools

Operational data store (ods)

Warehouse Manager

Operational data store (ODS) Data mining Archive/backup data End-user access tools

Typical architecture of a data warehouse

Main Components: Operational data sourcesfor the DW is supplied from mainframe operational data held
in first generation hierarchical and network databases, departmental data held in proprietary file systems, private data held on workstaions and private serves and external systems such as the Internet, commercially available DB, or DB assoicated with and organizations suppliers or customers

Operational datastore(ODS)(is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse

load manager(also called the frontend component, it performance all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse warehouse manager(performs all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data query manager(also called backend component, it performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries detailed, lightly and lightly summarized data,archive/backup data meta-data end-user access tools(can be categorized into five main groups: data reporting and query tools, application development tools, executive information system (EIS) tools, online analytical processing (OLAP) tools, and data mining tools.

Data Warehouse Components:


Detailed Data Summary Data

Ranges from detailed to summarized data Contains metadata Many views of the data Subject-Oriented Time-variant Metadata

Data Flows
Inflow- The processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse. upflow- The process associated with adding value to the data in the warehouse through summarizing, packaging , packaging, and distribution of the data downflow- The processes associated with archiving and backing-up of data in the warehouse outflow- The process associated with making the data availabe to the end-users Meta-flow- The processes associated with the management of the meta-data

Tools and Technologies:


The critical steps in the construction of a data warehouse: a. Extraction b. Cleansing c. Transformation after the critical steps, loading the results into target system can be carried out either by separate products, or by a single, categories: code generators

database data replication tools dynamic transformation engines

Data Warehousing Issues


Semantic Integration: When getting data from multiple sources, must eliminate mismatches, e.g., different currencies, DB schemas. Heterogeneous Sources: Must access data from a variety of source formats and repositories.Replication capabilities can be exploited here. Load, Refresh, Purge: Must load data,periodically refresh it, and purge too-old data. Metadata Management: Must keep track of source, loading time, and other information or all data in the warehouse.
Book Dimension Book key Author Title Description Discount

Date Dimension Date key Transaction data Day of month Month of year Year

Bookstore Dimension Bookstore key City Region Inventory

BookSales fact table Foreign Keys: Date key Bookstore key Book key Clerk code key Summary Data: Units Sales Discounts

Clerk Dimension Clerk code key Name Manager Store

Fact Table Dimension Tables Dimension Tables

The benefits of data warehousing:

The potential benefits of data warehousing are high returns on investment. substantial competitive advantage. Increased productivity of corporate decision-makers.. More cost effective decision making Better enterprise intelligence Enhanced customer service Better asset/liability management Business process reengineering

Applications: OnLine Transaction Processing:


OLTP systems are the major kinds of enterprise applications: Examples: Order entry systems, Inventory control systems, Reservation systems, Point-of-sale systems, Tracking systems, etc.

Executive information system (EIS) :


Present information at the highest level of summarization using corporate business measures. They are designed for extreme ease-of-use and, in many cases, only a mouse is required. Graphics are usually generously incorporated to provide at-a-glance indications of performance

Decision Support Systems (DSS) : They ideally present information in graphical and tabular form, providing the user with the ability to drill down on selected information. Note the increased detail and data manipulation options presented.

Conclusion:
Data Warehousing provides the means to change the raw data into information for making effective business decisions-the emphasis on information, not data.The Data warehouse is the hub for decision support data. Data mining is a useful tool with multiple algorithms that can be tuned for specific tasks. It can benefit business, medicine, and science. It needs more efficient algorithms to speed up data mining process.