Professional Documents
Culture Documents
Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay
Introduction
Organizations getting larger and amassing ever increasing amounts of data Historic data encodes useful information about working of an organization. However, data scattered across multiple sources, in multiple formats. Data warehousing: process of consolidating data in a centralized location Data mining: process of analyzing data to find useful patterns and relationships
Dr. Sunita Sarawagi Data Warehousing & Mining 2
OLAP
Essbase
Mining tools
Intelligent Miner
Data warehouse
Operational data
Data Warehousing & Mining
Data cleaning:
missing data, outliers, clean fields e.g. names/addresses Data mining techniques
Data loading: summarize, create indices Products: Prism warehouse manager, Platinum info
refiner, info pump, QDB, Vality
Dr. Sunita Sarawagi Data Warehousing & Mining 5
Warehouse maintenance
Data refresh
when to refresh, what form to send updates?
Materialized view maintenance with batch updates. Query evaluation using materialized views Monitoring and reporting tools
HP intelligent warehouse advisor
Dr. Sunita Sarawagi Data Warehousing & Mining 6
OLAP
Essbase
Mining tools
Intelligent Miner
Data warehouse
Operational data
Data Warehousing & Mining
OLAP
Fast, interactive answers to large aggregate queries. Multidimensional model: dimensions with hierarchies Dim 1: Bank location:
branch-->city-->state
Dim 2: Customer:
sub profession --> profession
Dim 3: Time:
month --> quarter --> year
OLAP
Navigational operators: Pivot, drill-down, roll-up, select. Hypothesis driven search: E.g. factors affecting defaulters
view defaulting rate on age aggregated over other dimensions for particular age segment detail along profession
OLAP products
About 30 OLAP vendors Dominant ones:
Oracle Express: largest market share: 20% Arbor Essbase: technology leader Microsoft Plato: introduced late last year, rapidly taking over...
10
Client side caching and calculations Partitioned and virtual cube Hybrid relational and multidimensional storage
Dr. Sunita Sarawagi Data Warehousing & Mining 11
Data mining
Process of semi-automatically analyzing large databases to find interesting and useful patterns Overlaps with machine learning, statistics, artificial intelligence and databases but
more scalable in number of features and instances more automated to handle heterogeneous data
Dr. Sunita Sarawagi Data Warehousing & Mining 12
Descriptive:
Clustering / similarity matching Association rules and variants Deviation detection
Dr. Sunita Sarawagi Data Warehousing & Mining 13
Classification
Given old data about customers and payments, predict new applicants loan eligibility.
Previous customers Age Salary Profession Location Customer type
Dr. Sunita Sarawagi
Classifier
Decision rules
Salary > 5 L
Prof. = Exec
Good/ bad
Classification methods
Nearest neighbor Regression: (linear or any polynomial)
a*salary + b*age + c = eligibility score.
Clustering
Unsupervised learning when old data with class labels not available e.g. when introducing a new product. Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. Key requirement: Need a good measure of similarity between instances. Identify micro-markets and develop policies for each
Dr. Sunita Sarawagi Data Warehousing & Mining 16
Association rules
Given set T of groups of items Example: set of item sets purchased Goal: find all rules on itemsets of the form a-->b such that
support of a and b > user threshold s conditional probability (confidence) of b given a > user threshold c
cereal
17
Mining market
Around 20 to 30 mining tool vendors Major players:
Clementine, IBMs Intelligent Miner, SGIs MineSet, SASs Enterprise Miner.
All pretty much the same set of tools Many embedded products: fraud detection, electronic
commerce applications
Dr. Sunita Sarawagi Data Warehousing & Mining 18
Conclusions
The value of warehousing and mining in effective decision making based on concrete evidence from old data Challenges of heterogeneity and scale in warehouse construction and maintenance Grades of data analysis tools: straight querying, reporting tools, multidimensional analysis and mining.
Dr. Sunita Sarawagi Data Warehousing & Mining 19