Data Mining and Warehousing: Kapil Sharma

1
DATA MINING AND WAREHOUSING
Kapil Sharma
Asst. Prof. at ATC Indore

V. Prof. at Symbiosis University of Applied Science(SUAS)
Sr. Research Associate- Indian Institute of Management (IIM) Indore
Working on World Banks Project-BRTS and Knowledge management
in State Bank of India.
V. Prof.-International Institute of Professional Studies, Davv Indore
V. Prof.- School of Future Studies and Planning , Davv Indore
17 articles published+ one patent (under process)
2
Data Warehousing
A data warehouse is constructed by integrating data
from multiple heterogeneous sources that support
analytical reporting, structured and/or ad hoc queries,
and decision making.
Data warehousing is the process of constructing and

using a data warehouse. Data warehousing mainly
involves data cleaning, data integration, and data
consolidations.
3
What is Data Warehouse?

• Defined in many different ways, but not rigorously.
• A decision support database that is maintained
separately from the organization’s operational database
• Support information processing by providing a solid
platform of consolidated, historical data for analysis.
• “A data warehouse is a subject-oriented, integrated, time-

variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
• Data warehousing:
• The process of constructing and using data warehouses
4
Functions of Data Warehouse Tools

and Utilities
• The following are the functions of data warehouse tools and
utilities:
• Data Extraction - Involves gathering data from multiple
heterogeneous sources.
• Data Cleaning - Involves finding and correcting the errors in
data.
• Data Transformation - Involves converting the data from
legacy format to warehouse format.
• Data Loading - Involves sorting, summarizing, consolidating,
checking integrity, and building indices and partitions.
• Refreshing - Involves updating from data sources to
warehouse.
• Note: Data cleaning and data transformation are important
steps in improving the quality of data and data mining results.
5
Using Data Warehouse

Information
• There are decision support technologies that help utilize the
data available in a data warehouse. These technologies help
executives to use the warehouse quickly and effectively. They
can gather data, analyze it, and take decisions based on the
information present in the warehouse. The information
gathered in a warehouse can be used in any of the following
domains:
• Tuning Production Strategies - The product strategies can be
well tuned by repositioning the products and managing the
product portfolios by comparing the sales quarterly or yearly.
• Customer Analysis - Customer analysis is done by analyzing
the customer's buying preferences, buying time, budget cycles,
etc.
• Operations Analysis - Data warehousing also helps in
customer relationship management, and making environmental
corrections. The information also allows us to analyze business
operations.
6
• Query-Driven Approach
• This is the traditional approach to integrate
heterogeneous databases. This approach was used to
build wrappers and integrators on top of multiple
heterogeneous databases. These integrators are also
known as mediators.
• Process of Query-Driven Approach
• When a query is issued to a client side, a metadata
dictionary translates the query into an appropriate form for
individual heterogeneous sites involved.
• Now these queries are mapped and sent to the local
query processor.
• The results from heterogeneous sites are integrated into a
global answer set.
7
Disadvantages
• Query-driven approach needs complex integration and

filtering processes.
• This approach is very inefficient.
• It is very expensive for frequent queries.
• This approach is also very expensive for queries that
require aggregations.
8
Update-Driven Approach
• This is an alternative to the traditional approach. Today's

data warehouse systems follow update-driven approach
rather than the traditional approach discussed earlier.
• In update-driven approach, the information from multiple
heterogeneous sources are integrated in advance and are
stored in a warehouse. This information is available for
direct querying and analysis.
9
Advantages
• This approach has the following advantages:

• This approach provide high performance.
• The data is copied, processed, integrated, annotated,
summarized and restructured in semantic data store in
advance.
• Query processing does not require an interface to process
data at local sources.
10
Understanding a Data
Warehouse
• A data warehouse is a database, which is kept separate
from the organization's operational database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the
organization to analyze its business.
• A data warehouse helps executives to organize,
understand, and use their data to take strategic decisions.
• Data warehouse systems help in the integration of diversity
of application systems.
• A data warehouse system helps in consolidated historical
data analysis.
11
Why a Data Warehouse is Separated from

Operational Databases
A data warehouses is kept separate from operational databases due
to the following reasons:
•An operational database is constructed for well-known tasks and
workloads such as searching particular records, indexing, etc. In
contract, data warehouse queries are often complex and they present
a general form of data.
•Operational databases support concurrent processing of multiple
transactions. Concurrency control and recovery mechanisms are
required for operational databases to ensure robustness and
consistency of the database.
•An operational database query allows to read and modify operations,
while an OLAP query needs only read only access of stored data.
•An operational database maintains current data. On the other hand,
a data warehouse maintains historical data.
12
Data Warehouse Features

• The key features of a data warehouse are discussed below:
• Subject Oriented - A data warehouse is subject oriented because it provides
information around a subject rather than the organization's ongoing
operations. These subjects can be product, customers, suppliers, sales,
revenue, etc. A data warehouse does not focus on the ongoing operations,
rather it focuses on modelling and analysis of data for decision making.
• Integrated - A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This
integration enhances the effective analysis of data.
• Time Variant - The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information from
the historical point of view.
• Non-volatile - Non-volatile means the previous data is not erased when new
data is added to it. A data warehouse is kept separate from the operational
database and therefore frequent changes in operational database is not
reflected in the data warehouse.
• Note: A data warehouse does not require transaction processing, recovery,
and concurrency controls, because it is physically stored and separate from
the operational database.
13
OLAP
• OLAP is an acronym for Online Analytical Processing.
OLAP performs multidimensional analysis of business
data and provides the capability for complex calculations,
trend analysis, and sophisticated data modeling
• OLAP (Online Analytical Processing) is the technology
behind many Business Intelligence (BI) applications.
OLAP is a powerful technology for data discovery,
including capabilities for limitless report viewing, complex
analytical calculations, and predictive “what if” scenario
(budget, forecast) planning.
14
• It is the foundation for many kinds of business applications

• for Business Performance Management,
• Planning,
• Budgeting,
• Forecasting,
• Financial Reporting,
• Analysis,
• Simulation Models,
• Knowledge Discovery, and
• Data Warehouse Reporting.
• OLAP enables end-users to perform ad hoc analysis of data in
multiple dimensions, thereby providing the insight and
understanding they need for better decision making.
Han: Dataware Houses and OLAP 15
OLTP (online transaction processing)
• OLTP (online transaction processing) is a class of

software programs capable of supporting transaction-
oriented applications on the Internet.
• Typically, OLTP systems are used for order entry,
financial transactions, customer relationship management
(CRM) and retail sales. Such systems have a large
number of users who conduct short transactions.
Database queries are usually simple, require sub-second
response times and return relatively few records.
• An important attribute of an OLTP system is its ability to
maintain concurrency. To avoid single points of failure,
OLTP systems are often decentralized.
Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)
1 It involves historical processing of information. It involves day-to-day processing.
OLAP systems are used by knowledge workers such as OLTP systems are used by clerks, DBAs, or
2
executives, managers, and analysts. database professionals.
3 It is used to analyze the business. It is used to run the business.
4 It focuses on Information out. It focuses on Data in.
It is based on Star Schema, Snowflake Schema, and Fact

5 It is based on Entity Relationship Model.
Constellation Schema.
6 It focuses on Information out. It is application oriented.

7 It contains historical data. It contains current data.
8 It provides summarized and consolidated data. It provides primitive and highly detailed data.
It provides detailed and flat relational view of
9 It provides summarized and multidimensional view of data.
data.
10 The number of users is in hundreds. The number of users is in thousands.
11 The number of records accessed is in millions. The number of records accessed is in tens.
12 The database size is from 100GB to 100 TB. The database size is from 100 MB to 100 GB.
13 These are highly flexible. It provides high performance.
16
Data Warehouse Applications

• As discussed before, a data warehouse helps business
executives to organize, analyze, and use their data for
decision making. A data warehouse serves as a sole part
of a plan-execute-assess "closed-loop" feedback system
for the enterprise management. Data warehouses are
widely used in the following fields:
• Financial services
• Banking services
• Consumer goods
• Retail sectors
• Controlled manufacturing
18
Types of Data Warehouse

Applications
• Information processing, analytical processing, and data mining are
the three types of data warehouse applications that are discussed
below:
• Information Processing - A data warehouse allows to process the
data stored in it. The data can be processed by means of querying,
basic statistical analysis, reporting using crosstabs, tables, charts,
or graphs.
• Analytical Processing - A data warehouse supports analytical
processing of the information stored in it. The data can be analyzed
by means of basic OLAP operations, including slice-and-dice, drill
down, drill up, and pivoting.
• Data Mining - Data mining supports knowledge discovery by finding
hidden patterns and associations, constructing analytical models,
performing classification and prediction. These mining results can
be presented using the visualization tools.
Data Warehousing and OLAP/Multi-dimensional

Data Model
1. What is a data warehouse?
2. Specific Software for Data Warehousing: OLAP (Online
Analytical Processing) / The Multi-Dimensional Data
Model / Data Cubes
Data Warehouse—Subject-
Oriented
• Organized around major subjects, such as customer,
product, sales.
• Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.
• Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process.
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data
sources
• relational databases, flat files, on-line transaction
records
• Data cleaning and data integration techniques are
applied.
• Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
• When data is moved to the warehouse, it is converted.
Data Warehouse—Time Variant

• The time horizon for the data warehouse is significantly
longer than that of operational systems.
• Operational database: current value data.
• Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
• Contains an element of time, explicitly or implicitly
• But the key of operational data may or may not contain
“time element”.
Data Warehouse—Non-Volatile
• A physically separate store of data transformed from the

operational environment.
• Operational update of data does not occur in the data
warehouse environment.
• Does not require transaction processing, recovery, and
concurrency control mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data.
Data Warehouse vs. Heterogeneous DBMS
• Traditional heterogeneous DB integration:

• Build wrappers/mediators on top of heterogeneous databases
• Query driven approach
• When a query is posed to a client site, a meta-dictionary is used to
translate the query into queries appropriate for individual
heterogeneous sites involved, and the results are integrated into a
global answer set
• Complex information filtering, compete for resources
• Data warehouse: update-driven, high performance

• Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis
OLTP vs. OLAP (Online Analytical Processing)
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

Why Separate Data Warehouse?

• High performance for both systems
• DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
• Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
• Different functions and different data:
• missing data: Decision support requires historical data
which operational DBs do not typically maintain
• data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
• data quality: different sources typically use inconsistent
data representations, codes and formats which have to
be reconciled
From Tables and

Spreadsheets to Data Cubes
• A data warehouse is based on a multidimensional data model which

views data in the form of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
• Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
• Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
• In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids forms
a data cube.
OLAP Terminology
• A data cube supports viewing/modelling of a variable (a
set of variables) of interest. Measures are used to report
the values of the particular variable with respect to a given
set of dimensions.
• A fact table stores measures as well as keys
representing relationships to various dimensions.
• Dimensions are perspectives with respect to which an
organization wants to keep record.
• A star schema defines a fact table and its associated
dimensions.
Data Warehouse Schema Architecture

• Data Warehouse environment usually transforms the relational data
model into some special architectures. There are many schema
models designed for data warehousing but the most commonly used
are:
- Star schema
- Snowflake schema
- Fact constellation schema
The determination of which schema model should be used for a data

warehouse should be based upon the analysis of project
requirements, accessible tools and project team preferences.
Conceptual Modeling of
Data Warehouses
• Modeling data warehouses: dimensions & measures
• Star schema: A fact table in the middle connected to a set
of dimension tables
• Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
• Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
31
What is star schema?

• The star schema architecture is the simplest data
warehouse schema. It is called a star schema because
the diagram resembles a star, with points radiating from a
center. The center of the star consists of fact table and
the points of the star are the dimension tables. Usually the
fact tables in a star schema are in third normal form(3NF)
whereas dimensional tables are de-normalized. Despite
the fact that the star schema is the simplest architecture,
it is most commonly used nowadays and is recommended
by Oracle.
Fact Tables
• A fact table typically has two types of columns: foreign
keys to dimension tables and measures those that contain
numeric facts. A fact table can contain fact's data on detail
or aggregated level.
Dimension Tables
• A dimension is a structure usually composed of one or

more hierarchies that categorizes data. If a dimension
hasn't got a hierarchies and levels it is called flat
dimension or list. The primary keys of each of the
dimension tables are part of the composite primary key of
the fact table. Dimensional attributes help to describe the
dimensional value. They are normally descriptive, textual
values. Dimension tables are generally small in size then
fact table.
• Typical fact tables store data about sales while dimension

tables data about geographic region(markets, cities) ,
clients, products, times, channels.
The main characteristics of star schema:

• Simple structure -> easy to understand schema Great
query effectives -> small number of tables to join
Relatively long time of loading data into dimension tables
-> de-normalization, redundancy data caused that size of
the table could be large. The most commonly used in the
data warehouse implementations -> widely supported by a
large number of business intelligence tools
Example of Star Schema

time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
A Concept Hierarchy: Dimension (location)
all all
region Europe ... North_America
country Germany ... Spain Canada ... Mexico
city Frankfurt ... Vancouver ... Toronto
office L. Chan ... M. Wind

Snowflake schema 38
The snowflake schema architecture is a more complex variation of the star schema
used in a data warehouse, because the tables which describe the dimensions are
normalized.
What is fact constellation schema?

For each star schema it is possible to construct fact constellation
schema(for example by splitting the original star schema into more star
schemes each of them describes facts on another level of dimension
hierarchies). The fact constellation architecture contains multiple fact
tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more

complicated design because many variants for particular kinds of
aggregation must be considered and selected. Moreover, dimension
tables are still large.
40
View of Warehouses and Hierarchies
Specification of hierarchies
• Schema hierarchy
day < {month < quarter;
week} < year
• Set_grouping hierarchy
{1..10} < inexpensive
Multidimensional Data
• Sales volume as a function of product, month, and
region
Dimensions: Product, Location, Time
Hierarchical summarization paths
on
gi
Industry Region Year

Re
Category Country Quarter

Product
Product City Month Week
Office Day
Month
A Sample Data Cube

Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
Browsing a Data Cube
• Visualization
• OLAP capabilities
• Interactive manipulation
Typical OLAP Operations
• Roll up (drill-up): summarize data
• by climbing up hierarchy or by dimension reduction

• Drill down (roll down): reverse of roll-up
• from higher level summary to lower level summary or detailed data,

or introducing new dimensions
• Slice and dice:
• project and select

• Pivot (rotate):
• reorient the cube, visualization, 3D to series of 2D planes.

• Other operations
• drill across: involving (across) more than one fact table

• drill through: through the bottom level of the cube to its back-end
relational tables (using SQL)
A Star-Net Query Model

Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
Views and Decision Support
• OLAP queries are typically aggregate queries.
• Precomputation is essential for interactive response
times.
• The CUBE is in fact a collection of aggregate queries,
and precomputation is especially important: lots of
work on what is best to precompute given a limited
amount of space to store precomputed results.
• Warehouses can be thought of as a collection of
asynchronously replicated tables and periodically
maintained views.
• Has renewed interest in view maintenance!
Issues in View Materialization
• What views should we materialize, and what indexes
should we build on the precomputed results?
• Given a query and a set of materialized views, can we
use the materialized views to answer the query?
• How frequently should we refresh materialized views
to make them consistent with the underlying tables?
(And how can we do this incrementally?)
Discovery-Driven Exploration of Data

Cubes
• Hypothesis-driven: exploration by user, huge search space
• Discovery-driven (Sarawagi et al.’98)
• pre-compute measures indicating exceptions, guide user in the

data analysis, at all levels of aggregation
• Exception: significantly different from the value anticipated, based
on a statistical model
• Visual cues such as background color are used to reflect the
degree of exception of each cell
• Computation of exception indicator (modeling fitting and computing
SelfExp, InExp, and PathExp values) can be overlapped with cube
construction
Examples: Discovery-Driven Data Cubes

Data Warehouse Usage

• Three kinds of data warehouse applications
• Information processing
• supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
• Analytical processing and Interactive Analysis
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
• Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools.
• Differences among the three tasks
Software to Work with Data Cubes

• http://www.olapreport.com/
• http://www.olapreport.com/Market.htm
Summary
• Data warehouse
• A subject-oriented, integrated, time-variant, and nonvolatile collection
of data in support of management’s decision-making process
• A multi-dimensional model of a data warehouse
• Star schema, snowflake schema, fact constellations
• A data cube consists of dimensions & measures
• OLAP operations: drilling, rolling, slicing, dicing and pivoting

• Efficient computation of data cubes
• Partial vs. full vs. no materialization
• Special index structures (not discussed)
• Further development of data cube technology
• Discovery-drive and multi-feature cubes
• From OLAP to OLAM (on-line analytical mining)
References (I)
• S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. In Proc. 1996 Int. Conf. Very Large
Data Bases, 506-521, Bombay, India, Sept. 1996.
• D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. In
Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, 417-427, Tucson, Arizona, May 1997.
• R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of
Data, 94-105, Seattle, Washington, June 1998.
• R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In Proc. 1997 Int. Conf.
Data Engineering, 232-243, Birmingham, England, April 1997.
• K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs. In Proc. 1999
ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99), 359-370, Philadelphia, PA, June 1999.
• S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD
Record, 26:65-74, 1997.
• OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm, 1998.
• J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh.
Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data
Mining and Knowledge Discovery, 1:29-54, 1997.
References (II)
• V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In Proc. 1996
ACM-SIGMOD Int. Conf. Management of Data, pages 205-216, Montreal, Canada, June 1996.
• Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998.
• K. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 1997 Int. Conf. Very
Large Data Bases, 116-125, Athens, Greece, Aug. 1997.
• K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. In
Proc. Int. Conf. of Extending Database Technology (EDBT'98), 263-277, Valencia, Spain, March
1998.
• S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. In Proc.
Int. Conf. of Extending Database Technology (EDBT'98), pages 168-182, Valencia, Spain, March
1998.
• E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley & Sons,
1997.
• Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous
multidimensional aggregates. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, 159-170,
Tucson, Arizona, May 1997.

Data Mining and Warehousing: Kapil Sharma

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining and Warehousing: Kapil Sharma

Uploaded by

Copyright:

Available Formats

1

DATA MINING AND WAREHOUSING

Asst. Prof. at ATC Indore

Data warehousing is the process of constructing and

What is Data Warehouse?

• “A data warehouse is a subject-oriented, integrated, time-

Functions of Data Warehouse Tools

Using Data Warehouse

• Query-driven approach needs complex integration and

• This is an alternative to the traditional approach. Today's

• This approach has the following advantages:

Why a Data Warehouse is Separated from

Data Warehouse Features

• It is the foundation for many kinds of business applications

OLTP (online transaction processing)

• OLTP (online transaction processing) is a class of

1 It involves historical processing of information. It involves day-to-day processing.

3 It is used to analyze the business. It is used to run the business.

4 It focuses on Information out. It focuses on Data in.

It is based on Star Schema, Snowflake Schema, and Fact

6 It focuses on Information out. It is application oriented.

13 These are highly flexible. It provides high performance.

Data Warehouse Applications

Types of Data Warehouse

Data Warehousing and OLAP/Multi-dimensional

Data Warehouse—Time Variant

• A physically separate store of data transformed from the

Data Warehouse vs. Heterogeneous DBMS

• Traditional heterogeneous DB integration:

• Data warehouse: update-driven, high performance

Han: Dataware Houses and OLAP 25

Why Separate Data Warehouse?

From Tables and

• A data warehouse is based on a multidimensional data model which

Data Warehouse Schema Architecture

- Fact constellation schema

The determination of which schema model should be used for a data

What is star schema?

• A dimension is a structure usually composed of one or

• Typical fact tables store data about sales while dimension

The main characteristics of star schema:

Example of Star Schema

A Concept Hierarchy: Dimension (location)

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

What is fact constellation schema?

The main shortcoming of the fact constellation schema is a more

View of Warehouses and Hierarchies

Industry Region Year

Category Country Quarter

Product City Month Week

A Sample Data Cube

Browsing a Data Cube

Typical OLAP Operations

• Roll up (drill-up): summarize data

• by climbing up hierarchy or by dimension reduction

• from higher level summary to lower level summary or detailed data,

• project and select

• reorient the cube, visualization, 3D to series of 2D planes.

• drill across: involving (across) more than one fact table

A Star-Net Query Model