Professional Documents
Culture Documents
Data Warehousing & Data Mining: Code: ITM 402 Dr. Dinesh Gomber
Data Warehousing & Data Mining: Code: ITM 402 Dr. Dinesh Gomber
Warehousing
&
Data Mining
Code: ITM 402
Dr. Dinesh
Gomber
What is Data
Warehouse?
Inmonss definition
A data warehouse is
-subject-oriented(high level),
-integrated,
-time-variant,
-nonvolatile
collection of data in support of managements
decision making process.
Subject-oriented
Data warehouse is organized around
subjects such as
sales,product,customer.
It focuses on modeling and analysis of
data for decision makers.
Excludes(delete) data that is notuseful
in decision support process.
Integration
Data Warehouse is constructed by
integrating multiple heterogeneous
sources.
Data Preprocessing are applied to
RDBMS
ensure consistency.
Data
Legacy Warehouse
System
Measurement of
attributes.
physical attribute.
of data remarks
naming conventions.
load
acce
ss
How it is differ from
Database
OLTP vs. OLAP
OLTP: On Line Transaction Processing
Describes processing at operational sites.
CSE601 10
Warehouse is a Specialized
Standard DB (OLTP)DB Warehouse (OLAP)
Mostly updates Mostly reads
Many small Queries are long and
transactions complex
Mb - Gb of data Gb - Tb of data
Current snapshot History
Index/hash on p.k. Lots of scans
Raw data Summarized, reconciled data
Thousands of users Hundreds of users (e.g.,
(e.g., clerical users) decision-makers, analysts)
CSE601 11
Operational v/s Information
System
Features Operational Information
Characteristics Operational processing Informational processing
Orientation Transaction Analysis
User Clerk,DBA,database Knowledge workers
professional
Function Day to day operation Decision support
Data Current Historical
View Detailed,flat relational Summarized,
multidimensional
DB design Application oriented Subject oriented
Unit of work Short ,simple transaction Complex query
Access Read/write Mostly read
Operational v/s Information
System
Features Operational Information
Focus Data in Information out
Number of records tens millions
accessed
Schema: plan
Describes structure of database
Names and sizes of fields
Identifies primary keys
Data dictionary: repository of
information about data
The Schema and Metadata
(continued)
Metadata: data about data
Source of data
Tables related to data
Field information
Usage of data
Population rules
The Schema and Metadata
(continued)
Data Warehouse
Fundamentals
Extraction, transformation, and loading
(ETL) a process that extracts information
from internal and external databases,
transforms the information using a common
set of enterprise definitions, and loads the
information into a data warehouse
Delhi
Sales per item type per branch Sales
for first quarter. Manager
Chennai
Banglore
ETL(Extract, Transform,
Load)
Improve the quality of data before
loading it into the warehouse.
Perform data cleaning and
transformation before loading the
data.
And Then Load into Data Warehouse.
Use query analysis tools to support
adhoc queries.
Solution 1:ABC Pvt Ltd.
Report
Delhi
Query & Sales
Data Analysis tools Manager
Warehouse
Chennai
Banglore
Data Warehousing
Architecture Monitoring &
Administratio OLAP Servers
n
Metadata
Repository
Data Mining
DATA MARTS
Data Warehouse Architecture
Data Warehouse server
almost always a relational
DBMS,rarely flat files
OLAP servers
to support and operate on multi-
dimensional data structures
Clients
Query and reporting tools
Analysis tools
Data mining tools
OLTP
OLTP (On-line Transaction Processing)
Operational data
To control and run fundamental business
tasks
Transactions:INSERT, UPDATE, DELETE.
Detailed and current data,
Relatively standardized and simple queries
Highly normalized(3NF) with many tables
OLAP
OLAP (On-line Analytical Processing)
Low volume of transactions.
To help with planning, problem solving,
and decision support
Relatively standardized and simple
queries
Typically de-normalized with fewer tables;
use of star and/or snowflake schemas.
Historical data, stored in multi-
dimensional schemas (usually star
schema).
OLTP (On-line Transaction
Processing) vs. OLAP (On-
line Analytical
Processing)
We can divide IT systems into
transactional (OLTP) and analytical
(OLAP). In general we can assume
that OLTP systems provide source
data to data warehouses, whereas
OLAP systems help to analyze it.
Data Mart
Introduction
48
OLTP vs. OLAP
We can divide IT systems into transactional
(OLTP) and analytical (OLAP). In general we
can assume that OLTP systems provide
source data to data warehouses, whereas
OLAP systems help to analyze it.
OLTP IS Highly normalized with many
tables(RDBMS)
OLAP Typically de-normalized with fewer
tables use of( star and/or snowflake
schemas)
Difference between OLTP AND
OLAP
OLTP (On-line Transaction Processing) is characterized by
a large number of short on-line transactions (INSERT, UPDATE,
DELETE). The main emphasis for OLTP systems is put on very
fast query processing, maintaining data integrity in multi-
access environments and an effectiveness measured by
number of transactions per second. In OLTP database there is
detailed and current data, and schema used to store
transactional databases is the entity model (usually
3NF).
55
Multi-Dimensional OLAP
Servers
Roll UP - aggregation of data such as simple
roll-ups or complex expressions involving inter-
related data, for example Monthly data to
quarterly data.
56
Slicing
Multi-Dimensional OLAP
servers
Can store data in a compressed form by
dynamically selecting physical storage
organizations and compression techniques
that maximize space utilization.
59
ON-LINE ANALYTICAL
PROCESSING
Demand for OLAP
To develop DM, three approaches
In all approaches, Data Marts
rest on Dimensional Model
Data Marts are sufficient for
basic data analysis
Users need to go beyond such
basic analysis
61
Demand for OLAP
62
Demand for OLAP
Traditional tools of report writers,
query products, spreadsheets, &
language interfaces do not match
the user expectations as far as
performing multidimensional
analysis with complex calculations
is concerned.
Tools used with OLTP and basic DW
environments do not match up to
the task
63
OLAP is the Answer!
OLAP is a category of software technology
that enables analysts, managers, and
executives to gain insight into the data
through fast, consistent, interactive, access in
a wide variety of possible views of
information that has been transformed from
raw data to reflect the real dimensionality of
the enterprise as understood by the user.
64
Why is OLAP useful?
Facilitates multidimensional data
analysis by pre-computing aggregates
across many sets of dimensions
Provides for:
Greaterspeed and responsiveness
Improved user interactivity
65
Data Warehouses
A data warehouse is based on a
multidimensional data model which
views data in the form of a data cube
A data cube allows data to be modeled
and viewed in multiple dimensions
66
CUBE
Multi-dimensional cube:
Fact table view:
sale prodId storeId date amt
p1 c1 1 12 c1 c2 c3
day 2
p2 c1 1 11 p1 44 4
p1 c3 1 50 p2 c1 c2 c3
p2 c2 1 8
day 1
p1 12 50
p1 c1 2 44 p2 11 8
p1 c2 2 4
dimensions = 3
67
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
68
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
69
Cube Aggregation
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
70
Aggregation Using
Hierarchies
c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8
country
region A region B
p1 56 54
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)
71
Pivoting
Fact table view: Multi-dimensional cube:
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8
c1 c2 c3
p1 56 4 50
p2 11 8
72
OLAP Operations
Roll-Up
Drill-Down
Slice & Dice
Pivot
73
OLAP Operations
74
Slicing
75
Dicing (Sub-cube)
76
Roll-Up
77
Drill-Down
78
Other OLAP
Operations
o Moving Averages
o Growth Rates
o Depreciation
o Currency Conversion
o Statistical Functions
o Top N or Bottom N queries
79
Conceptual vs. Actual
The cube is a logical way of
visualizing the data in an OLAP setting
Not how the data is actually
represented on disk
Two ways of storing data:
ROLAP: Relational OLAP
MOLAP: Multidimensional OLAP
80
OLAP & CUBE
Construction of the data cube
is key to the operation of OLAP
The computation process
creates a set of aggregates on
the various dimensions of the
data
The CUBE operator
81
Approaches to OLAP
Servers
82
Approaches to OLAP
Servers
Three possibilities for OLAP servers
(1) Relational OLAP (ROLAP)
Relational and specialized relational DBMS to
store and manage warehouse data
OLAP middleware to support missing pieces
(2) Multidimensional OLAP (MOLAP)
Array-based storage structures
Direct access to array data structures
(3) Hybrid OLAP (HOLAP)
Storing detailed data in RDBMS
Storing aggregated data in MDBMS
User access via MOLAP tools
83
ROLAP
Special schema design: star, snowflake
Products
IBM DB2, Oracle, Sybase IQ,
RedBrick, Informix
84
ROLAP
Defines complex, multi-dimensional data with
simple model
Reduces the number of joins a query has to
process
Allows the data warehouse to evolve with
relatively low maintenance
Can contain both detailed and summarized data.
ROLAP is based on familiar, proven, and already
selected technologies.
BUT!!!
SQL for multi-dimensional manipulation of
calculations.
85
MOLAP
86
OLAP Needs
User Needs
Multidimensional view
Excellent Performance
Analytical Flexibility
Real-Time Data Access
High Data Capacity
87
OLAP Needs: User Needs
Excellent Performance
For example, suppose you have a Sales indicator
with six dimensionsRepresentatives, Products,
Customers, Regions, Months, and Years.
MOLAP tools will store a given aggregate, such as
the November 1997 government sales of product
A504 by representative 1040 in New York, in 1
cell of the MDDB.
In contrast, ROLAP tools consume 600% more
space, because they require a record of seven
valuessix foreign keys and the actual
aggregatein a relational summary table.
88
OLAP Needs: User Needs
Excellent Performance
RDBMSs must use several summary tables to store the aggregates
that a MOLAP could store in just one cube. For example, consider a Sales
indicator with three dimensions: Months, Regions, and Products. The
indicator cube will contain seven sets of aggregates:
Sales by month
Sales by product
Sales by region
Sales by month and product
Sales by month and region
Sales by product and region
Sales by product, month, and region
To store these aggregates in an RDBMS, youd have to create seven
summary tables, one for each aggregate set.
HOW MANY SUMMARY TABLES FOR 6 DIMENSIONS?
(Separate fact table and shrunken dimension table approach for storing
aggregates)
89
OLAP Needs: User Needs
Analytical Flexibility
90
OLAP Needs: User Needs
Real-Time Data Access
MOLAP tools load data into the multidimensional cubes.
Consequently, the data being accessed is only as recent
as the last load.
Some applications require real-time data access
Process of continually refreshing the data attaches higher
costs to operating a MOLAP system
Some MOLAP tools offer reach-through functionality to
access volatile data stored outside the MDDB
Unfortunately, users must be aware of the underlying
database structure
Relational data access is too complex for the typical user
91
OLAP Needs: User Needs
Real-Time Data Access
ROLAP tools maintain a constant link to the
operational RDBMS, which provides users with
up-to-the-minute, accurate data
(Real-Time Data Warehousing)
Industries & organizations with highly volatile
data particularly benefit from this access to
live, operational data.
92
OLAP Needs: User Needs
High Capacity Data
MOLAP products are limited by the size of the
cube defined by the multidimensional view.
When dimension elements are predefined, the
scope of available data is limited at the onset.
ROLAP tools circumvent this barrier. Dynamic
dimensions are not stored in the predefined
multidimensional model, but fetched at run
time from the RDBMS.
93
OLAP Needs: Needs
Easy Development
MOLAP development is straightforward, it requires no
fine tuning and creates its own aggregates.
ROLAP tools, on the other hand, require a specific
schema for the relational database.
Skilled DBAs must provide the appropriate schema
(star or snowflake schema), tune the database, and
create the appropriate summary tables.
However, many ROLAP tools are metadata-driven,
which means the multidimensional view is generated
and maintained more easily.
94
Hybrid OLAP - HOLAP
o Best of both worlds
95
HOLAP
RDBMS Server MDBMS Server Client
Multi-
dimensional
SQL-Read access
Multidimensional
User
data Meta data
Multi- Viewer
dimensional
Derived data
data
SQL-Reach
Through
Relational
Viewer
SQL-Read
96
ROLAP, MOLAP, or HOLAP
IF
A. You require write access
B. Your data is under 50 GB
C. Your timetable to implement is 60-90 days
D. Lowest level already aggregated
E. Data access on aggregated level
F. Youre developing a general-purpose application for inventory movement or assets management
THEN
Consider an MDD /MOLAP solution for your data mart
IF
A. Your data is over 100 GB
B. You have a "read-only" requirement
C. Historical data at the lowest level of granularity
D. Detailed access, long-running queries
E. Data assigned to lowest level elements
THEN
Consider an RDBMS/ROLAP solution for your data mart.
IF
A. OLAP on aggregated and detailed data
B. Different user groups
C. Ease of use and detailed data
THEN
Consider an HOLAP for your data mart
97
Conclusions
ROLAP: RDBMS -> star/snowflake schema
MOLAP: MDDB -> Cube structures
ROLAP or MOLAP: Data models used play major role in
performance differences
MOLAP: for summarized and relatively lesser volumes
of data (100GB)
ROLAP: for detailed and larger volumes of data
Both storage methods have strengths and weaknesses
The choice is requirement specific, though currently
data warehouses are predominantly built using
RDBMSs/ROLAP.
HOLAP is emerging as the OLPA server of choice
98
Different forms of
OLAP
Three ways of storing data:
99
Relational Database Model
Time Time
SALES FINANCE
Product GL_Line
user
Analysis using preaggregated
summaries and precalculated Warehouse
measures
ROLAP Server
interface Query
Performance
Openness
Modeling
Warehouses differ from operational
structures:
Analytical requirements
Subject orientation
Data must map to subject oriented
information:
Identify business subjects
Define relationships between subjects
Name the attributes of each subject
Modeling is iterative
Modeling tools are available
Creating the Dimensional
Identify fact Model
tables
Translate business measures into
fact tables
Analyze source system information
for additional measures
Identify base and derived measures
Document additivity of measures
Identify dimension tables
Link fact tables to the dimension
tables
Create views for users
Dimension Tables
Dimension tables have the following
characteristics:
Contain textual information that
represents the attributes of the
business
Contain relatively static data
Are joined to a fact table through a
Product Channel
Customer Time
Fact Tables
Fact tables have the following
characteristics:
Contain numeric measures (metrics) of
the business
May contain summarized (aggregated)
data
May contain date-stamped data
Have key value that is typically a
concatenated key composed of the
primary keys of the dimensions
Joined to dimension tables through
foreign keys that reference primary keys
in the dimension tables
Dimensional Model (Star
Schema)
Fact table
Product Channel
Facts
(units,
price)
Customer Time
Dimension tables
Star Schema Model
Product Table Store Table
Product_id Store_id
Product_desc District_id
...
Publisher:
Morgan Kaufmann Publishers