You are on page 1of 19

Data warehousing and mining

Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay

Introduction
Organizations getting larger and amassing ever increasing amounts of data Historic data encodes useful information about working of an organization. However, data scattered across multiple sources, in multiple formats. Data warehousing: process of consolidating data in a centralized location Data mining: process of analyzing data to find useful patterns and relationships
Dr. Sunita Sarawagi Data Warehousing & Mining 2

Typical data analysis tasks


Report the per-capita deposits broken down by region and profession. Are deposits from rural coastal areas increasing over last five years? What percent of small business loans were cleared? Why is it less than last years? How did similar businesses that did not take loans perform? What should be the new rules for loan eligibility?
Dr. Sunita Sarawagi Data Warehousing & Mining 3

Decision support tools


Direct Query Reporting tools
Crystal reports

OLAP
Essbase

Mining tools
Intelligent Miner

Merge Clean Summarize Detailed transactional data

Data warehouse

Relational DBMS+ e.g. Redbrick GIS data

Operational data
Data Warehousing & Mining

Bombay branch Delhi branch Oracle


Dr. Sunita Sarawagi

Calcutta branch IMS

Census data SAS


4

Data warehouse construction


Heterogeneous data integration
merge from various sources, fuzzy matches remove inconsistencies

Data cleaning:
missing data, outliers, clean fields e.g. names/addresses Data mining techniques

Data loading: summarize, create indices Products: Prism warehouse manager, Platinum info
refiner, info pump, QDB, Vality
Dr. Sunita Sarawagi Data Warehousing & Mining 5

Warehouse maintenance
Data refresh
when to refresh, what form to send updates?

Materialized view maintenance with batch updates. Query evaluation using materialized views Monitoring and reporting tools
HP intelligent warehouse advisor
Dr. Sunita Sarawagi Data Warehousing & Mining 6

Decision support tools


Direct Query Reporting tools
Crystal reports

OLAP
Essbase

Mining tools
Intelligent Miner

Merge Clean Summarize Detailed transactional data

Data warehouse

Relational DBMS+ e.g. Redbrick GIS data

Operational data
Data Warehousing & Mining

Bombay branch Delhi branch Oracle


Dr. Sunita Sarawagi

Calcutta branch IMS

Census data SAS


7

OLAP
Fast, interactive answers to large aggregate queries. Multidimensional model: dimensions with hierarchies Dim 1: Bank location:
branch-->city-->state

Dim 2: Customer:
sub profession --> profession

Dim 3: Time:
month --> quarter --> year

Measures: loan amount, #transactions, balance


Dr. Sunita Sarawagi Data Warehousing & Mining 8

OLAP
Navigational operators: Pivot, drill-down, roll-up, select. Hypothesis driven search: E.g. factors affecting defaulters
view defaulting rate on age aggregated over other dimensions for particular age segment detail along profession

Need interactive response to aggregate queries..


Dr. Sunita Sarawagi Data Warehousing & Mining 9

OLAP products
About 30 OLAP vendors Dominant ones:
Oracle Express: largest market share: 20% Arbor Essbase: technology leader Microsoft Plato: introduced late last year, rapidly taking over...

Dr. Sunita Sarawagi

Data Warehousing & Mining

10

Microsoft OLAP strategy


Plato: OLAP server: powerful, integrating various operational sources OLE-DB for OLAP: emerging industry standard based on MDX --> extension of SQL for OLAP Pivot-table services: integrate with Office 2000
Every desktop will have OLAP capability.

Client side caching and calculations Partitioned and virtual cube Hybrid relational and multidimensional storage
Dr. Sunita Sarawagi Data Warehousing & Mining 11

Data mining
Process of semi-automatically analyzing large databases to find interesting and useful patterns Overlaps with machine learning, statistics, artificial intelligence and databases but
more scalable in number of features and instances more automated to handle heterogeneous data
Dr. Sunita Sarawagi Data Warehousing & Mining 12

Some basic operations


Predictive:
Regression Classification

Descriptive:
Clustering / similarity matching Association rules and variants Deviation detection
Dr. Sunita Sarawagi Data Warehousing & Mining 13

Classification
Given old data about customers and payments, predict new applicants loan eligibility.
Previous customers Age Salary Profession Location Customer type
Dr. Sunita Sarawagi

Classifier

Decision rules
Salary > 5 L
Prof. = Exec

Good/ bad

New applicants data


Data Warehousing & Mining 14

Classification methods
Nearest neighbor Regression: (linear or any polynomial)
a*salary + b*age + c = eligibility score.

Decision tree classifier Probabilistic/generative models Neural networks


Dr. Sunita Sarawagi Data Warehousing & Mining 15

Clustering
Unsupervised learning when old data with class labels not available e.g. when introducing a new product. Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. Key requirement: Need a good measure of similarity between instances. Identify micro-markets and develop policies for each
Dr. Sunita Sarawagi Data Warehousing & Mining 16

Association rules
Given set T of groups of items Example: set of item sets purchased Goal: find all rules on itemsets of the form a-->b such that
support of a and b > user threshold s conditional probability (confidence) of b given a > user threshold c

T Milk, cereal Tea, milk Tea, rice, bread

Example: Milk --> bread Purchase of product A --> service B


Dr. Sunita Sarawagi Data Warehousing & Mining

cereal

17

Mining market
Around 20 to 30 mining tool vendors Major players:
Clementine, IBMs Intelligent Miner, SGIs MineSet, SASs Enterprise Miner.

All pretty much the same set of tools Many embedded products: fraud detection, electronic
commerce applications
Dr. Sunita Sarawagi Data Warehousing & Mining 18

Conclusions
The value of warehousing and mining in effective decision making based on concrete evidence from old data Challenges of heterogeneity and scale in warehouse construction and maintenance Grades of data analysis tools: straight querying, reporting tools, multidimensional analysis and mining.
Dr. Sunita Sarawagi Data Warehousing & Mining 19

You might also like