Professional Documents
Culture Documents
Aristotle Onassis
Automatic discovery of patterns Prediction of likely outcomes Creation of actionable information Focus on large data sets and databases
Data mining can discover patterns and relationships in the data in order to help make better business decisions. Data mining can help spot sales trends, develop smarter marketing campaigns, and accurately predict customer loyalty.
Market segmentation - Identify the common characteristics of customers who buy the same products from your company. Customer churn - Predict which customers are likely to leave your company and go to a competitor. Fraud detection - Identify which transactions are most likely to be fraudulent. Direct marketing - Identify which prospects should be included in a mailing list to obtain the highest response rate. Interactive marketing - Predict what each individual accessing a Web site is most likely interested in seeing. Market basket analysis - Understand what products or services are commonly purchased together; e.g., beer and diapers. Trend analysis - Reveal the difference between a typical customer this month and last.
Data gathering Data cleaning Feature extraction Pattern extraction and discovery Visualization of data Evaluation of result
Predictive
Descriptive
Classification
Regression
Clustering
Association
Sequential analysis
Data Mining
Task-relevant Data
Data Selection Data Preprocessing
Data Warehouse
Data Cleaning Data Integration
Databases
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Making Decisions
Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting
End User
Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
DBA
Artificial neural networks Genetic algorithms Decision trees Nearest neighbor method Rule induction Data visualization
Inmonss definition
A data warehouse is -subject-oriented, -integrated, -time-variant, -nonvolatile collection of data in support of managements decision making process.
Subject-oriented
Data warehouse is organized around subjects such as sales,product,customer. It focuses on modeling and analysis of data for decision makers. Excludes data not useful in decision support process.
Integration
Data Warehouse is constructed by integrating multiple heterogeneous sources. Data Preprocessing are applied to ensure consistency.
RDBMS
Legacy System
Data Warehouse
Flat File
Integration
In terms of data.
encoding structures.
Measurement of attributes. physical attribute. of data
remarks
naming conventions.
Data type format
Time-variant
Provides information from historical perspective e.g. past 5-10 years Every key structure contains either implicitly or explicitly an element of time
Nonvolatile
Data once recorded cannot be updated. Data warehouse requires two operations in data accessing Initial loading of data Access of data
load
acces s
Star Schema
A single,large and central fact table and one table for each dimension. Every fact points to one tuple in each of the dimensions and has additional attributes. Does not capture hierarchies directly.
Store Key
Product Key Period Key Units
Period Key
Year Quarter Month
Price
Benefits: Easy to understand, easy to define hierarchies, reduces no. of physical joins.
SnowFlake Schema
Variant of star schema model. A single,large and central fact table and one or more tables for each dimension. Dimension tables are normalized i.e. split dimension table data into additional tables
Store Key
Product Key Period Key Units
Period Key
Year Quarter Month
Price
Region
Fact Constellation
Multiple fact tables share dimension tables. This schema is viewed as collection of stars hence called galaxy schema or fact constellation. Sophisticated application requires such schema.
Flat File
Client
Units 3 4
Pune
Pune Mumbai
3
4 16
7.95
7.32 42.40
Product_Category_ID 1 1 2
Time
OLAP Cube
City All Mumbai Mumbai Mumbai Mumbai Mumbai Product All All White Bread Time All All All Units 113 64 38 13 3 3 Dollars 251.26 146.07 98.49 32.24 7.44 7.44
OLAP Operations
Drill Down Product Category e.g Electrical Appliance
Time
OLAP Operations
Drill Up Product Category e.g Electrical Appliance
Time
OLAP Operations
Slice and Dice Product Product=Toaster
Time
Time
OLAP Operations
Pivot Product Product
Time
Region
OLAP Server
An OLAP Server is a high capacity,multi user data manipulation engine specifically designed to support and operate on multi-dimensional data structure. OLAP server available are
MOLAP server ROLAP server HOLAP server
Presentation
Product
Reporting Tool
Report Time
References
Building Data Warehouse by Inmon Data Mining:Concepts and Techniques by Han,Kamber. www.dwinfocenter.org www.datawarehousingonline.com www.billinmon.com
Thank You