Professional Documents
Culture Documents
Part 1 Data Warehousing and Data Mining
Part 1 Data Warehousing and Data Mining
MINING
10BI23
Instructor: Dr S.Natarajan
Professor and Key Resource Person
Department of Information Science and Engineering
PES Institute of technology
Bangalore
DATA WAREHOUSING AND DATA MINING
10BI23
Text Books:
1. Fundamentals of Data Warehouses – M. Jarke, M. Lenzerirni,
Y.Vassiliou, P. Vassiliadis – Springer Verlag – 2003
2. The Data Warehouse Toolkit – Ralph Kimball Wiley 2002
3. Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations by I. Witten and E Frank, Morgan
Kaufmann, 1999
4. Data Mining: Concepts and Techniques- J.Han and M.Kamber,
Morgan Kaufmann , 2000
DATA WAREHOUSING AND DATA MINING
10BI23
M Jarke
Ralph Kimball
Bill Inmon
6
What Happened in the Cloud?
• Stage 1: Laziness
– Operators grew tired of hanging tapes
• In response to requests for historical financial data
– They stored data on-line, in “unauthorized” mainframe databases
7
The Real Truth
• Data warehousing is a symptom of a problem
– Technological inability to deploy single-platform
information systems that:
• Capture data once and reuse it throughout an
enterprise
• Support high-transaction rates (single record CREATE,
SELECT, UPDATE, DELETE) and analytic queries on the same
computing platform, with the same data, at the same
time
8
The “Ideal Library” Analogy
• Stores all of the books and other reference material you need to conduct your
research
– The Enterprise data warehouse
• A single place to visit
– One database environment
• Contents are kept current and refreshed
– Timely, well-choreographed data loads
• Staffed with friendly, knowledgeable people that can help you find your way
around
– Your Data Warehouse team
• Organized for easy navigation and use
– Metadata
– Data models
– “User friendly” naming conventions
• Solid architectural infrastructure
– Hardware, software, standards, metrics
9
Example of DW in Healthcare
World
Scientific Databases
Wide
Web
Digital Libraries
l Different interfaces
l Different data representations
l Duplicate and inconsistent information
16
Problem: Data Management in Large
Enterprises
Vertical fragmentation of informational systems
(vertical stove pipes)
Result of application (user)-driven development of
operational systems
Sales Planning Suppliers Num. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...
Integration System
World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases
Source Source
21
Data warehouses
Interfaces
provide intuitive access to
the data
possibly change data format
to meet user expectations
Data
Warehouse Warehouse
stores a consistent view of
data in a local repository
Mediator
Mediator
transform data from source
Wrapper Wrapper
format to warehouse format
Swiss
Wrappers
PDB Prot SCoP dbEST read data from source into
internal representation
Metadata
Metadata is defined as data providing information about one or more aspects
of the data, such as:
Means of creation of the data
Purpose of the data
Time and date of creation
Creator or author of data
Placement on a computer network where the data was created
Standards used
Examples: Libraries
Metadata has been used in various forms as a means of cataloguing archived
information
The Dewey Decimal System employed by libraries for the classification of
library materials is an early example of metadata usage. Library catalogues
used 3x5 inch cards to display a book's title, author, subject matter, and a
brief plot synopsis along with an abbreviated alpha-numeric identification
system
23
Metadata (continued)
Photographs
Metadata may be written into a digital photo file that will identify who owns
it, copyright & contact information, what camera created the file, along with
exposure information and descriptive information such as keywords about the
photo, making the file searchable on the computer and/or the Internet
Video
Metadata is particularly useful in video, where information about its contents
(such as transcripts of conversations and text descriptions of its scenes) are not
directly understandable by a computer, but where efficient search is desirable.
Web pages
Web pages often include metadata in the form of meta tags
Description and keywords meta tags are commonly used to describe the Web
page's content
Most search engines use this data when adding pages to their search index.
CS 336 24
Metadata (continued)
• Data Warehouse Metadata
– All of the information in the data warehouse environment that is not the
actual data itself
• Operational Database Metadata Elements
– Tables (names, descriptions, definitions)
– Attributes (names, descriptions, definitions)
– Relationships
– Formulae
• Data Warehouse Metadata Elements
– – Transformations
Tables (names, descriptions, definitions)
– – Synonyms
Attributes (names, descriptions, definitions)
– – Alias
Relationships
– – Source/target info
Formulae
– Versions
– etc.
25
The Traditional Research Approach
Query-driven (lazy, on-demand)
Clients
...
Wrapper Wrapper Wrapper
...
Source Source Source
26
Disadvantages of Query-Driven
Approach
Delay in query processing
Slow or unavailable information sources
Complex filtering and integration
Inefficient and potentially expensive for
frequent queries
Competes with local processing at sources
Hasn’t caught on in industry
27
Advantages of Warehousing Approach
High query performance
But not necessarily most current information
Doesn’t interfere with local processing at sources
Complex queries at warehouse
OLTP at information sources
Information copied at warehouse
Can modify, annotate, summarize, restructure, etc.
Can store historical information
Security, no auditing
Has caught on in industry
28
DATA WAREHOUSING AND DATA MINING
10BI23
Clients
Central Data
Warehouse
Source Source
......
Central Architecture
Central DW
Local Finan-
Data marts
Market Distri-
cial ting bution
Logical
Data
Warehouse
Source Source
Federated architecture
Federated DW
• Data stored in separate data marts, aimed at special
departments
• The DW is only logical, i.e., ”virtual”
• The data marts contain detail data
• Like Kimball’s DW Bus concept
Pros
Performance due to distribution
Cons
More complex
Warehouse Architecture
Central
Data
Warehouse
Tiered Architecture
Tiered Architecture
• Central DW is materialized
• Data is distributed to data marts in one or more tiers
• Only aggregated data in cube tiers
• Data is aggregated/reduced as it moves through tiers
• Pros
• Best performance due to redundancy + distribution
• Cons
Most complex
Hard to manage
Traditional data warehouse architecture
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
4)
3)
1)
2)
L
One,
company-
wide
T warehouse
T
E
T
E Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
Data warehouse
49 Hend MADHOUR
EPFL 2005
ODS and data warehouse
are one and the same
Logical data mart and @ctive data warehouse
T
E
Near real-time ETL for Data marts are NOT separate databases,
@active Data Warehouse but logical views of the data warehouse
Easier to create new data marts
Data warehouse
50 Hend MADHOUR
EPFL 2005
CUSTOMIZATIO
Data
Marts
History
Data Source
data
Data Integration
N
Data Cleaning
logs, deltas
Data Archival
and histories
Data Extraction
Data sources
Data warehouse
54 Hend MADHOUR
EPFL 2005
The ETL Process
• Capture
• Scrub or data cleansing
• Transform
• Load and Index
Data warehouse
55 Hend MADHOUR
EPFL 2005
Steps in data reconciliation (src: J.Hoffer, M. Prescott, F. McFadden)
Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
Aggregation – data summarization one field to many
Data warehouse
58 Hend MADHOUR
EPFL 2005
Steps in data reconciliation (continued)
Data warehouse
59 Hend MADHOUR
EPFL 2005
Architecture is the proper arrangement of the components.
Source Data
External
Information Delivery
Management & Control
Production
Metadata
Data Mining
Archived Internal
Data
Sources
Yearly refresh
Quarterly refresh
Monthly refresh
Daily refresh
DATA
WAREHOUSE
Base data load
Data Warehouse: A Multi-Tiered Architecture
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
65
Characteristics of a Data Warehouse
At the moment the data are extracted for the data warehouse, they are
accurate, but moments later they will no longer reflect the true state of
the business.
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Why OLAP & OLTP don’t mix (1)
Different performance requirements
• Transaction processing (OLTP):
– Fast response time important (< 1 second)
– Data must be up-to-date, consistent at all times
• Data analysis (OLAP):
– Queries can consume lots of resources
– Can saturate CPUs and disk bandwidth
– Operating on static “snapshot” of data usually OK
• OLAP can “crowd out” OLTP transactions
– Transactions are slow → unhappy users
• Example:
– Analysis query asks for sum of all sales
– Acquires lock on sales table for consistency
– New sales transaction is blocked
Why OLAP & OLTP don’t mix (2)
Different data modeling requirements
76
Source: adapted from Strange (1997).
77