You are on page 1of 22

Data warehouse structures

Rajeev Tiwari Lecture 3

References

[1] Data Mining Concepts and Techniques Jiawei Han and Micheline Kamber [2] http://www.daneil-lemire.com [3] http://www.kalmstrom.nu

What is Data Warehouse?


o Defined in many different ways.
A decision support database that is maintained separately from the

organizations operational database.


Support information processing by providing a solid platform of

consolidated, historical data for analysis.


o A data warehouse is a subject-oriented, integrated, time-variant,

and nonvolatile collection of data in support of managements decision-making process.W. H. Inmon


o Data warehousing:
The process of constructing and using data warehouses

Data Warehouse Subject Oriented


o Organized around major subjects, such as customer, product,

sales.
o Focused on the modeling and analysis of data for decision makers,

not on daily operations


o Provide a simple and concise view around particular subject issues

by excluding data that are not useful in the decision support process.

Data Warehouse - Integrated


o Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records o Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures,

attribute measures, etc. among different data sources When data is moved to the warehouse, it is converted.
o Eg: Sales data may be on RDB, customer information in flat files.

Data Warehouse - Time Variant


o The time horizon for the data warehouse is significantly longer than

that of operational database systems


Operational database: current value Data warehouse data: provide information from a historical

perspective (e.g., past 5-10 years)


o Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time

element

Data Warehouse - Nonvolatile


o A physically separate store of data, transformed from the operational

environment
o Operational update of data does not occur in the data warehouse

environment
Does not require transaction processing, recovery, and

concurrency control mechanisms


Requires only two operations in data accessing: initial loading of data and access of data

Heterogeneous Databases
o Consists of a set of interconnected, autonomous databases. o Objects in one database may differ from objects in other

databases. o Information exchange across such databases is difficult.

Data Warehouse vs. Heterogeneous DBMS


o

Heterogeneous DBMS: A query driven approach


Build wrappers/mediators on top of heterogeneous databases A meta-dictionary is used to translate the query into queries appropriate for

individual heterogeneous sites.


The results are integrated into a global answer set. This approach involves complex information filtering. Inefficient and potentially expensive. o

Data warehouse: update-driven, high performance


Information from heterogeneous sources is integrated in advance and stored

in warehouses for direct query and analysis

Operational DBMS
o They consist of tables with a set of attributes and stores a o o o o

large set of tuples. They use the Entity-Relationship (ER) data model. They are used to store transactional data. They contain the most current information. Thus known as Online Transaction Processing (OLTP) systems.

10

Data Warehouse vs. Operational DBMS

o User and system orientation


customer vs. market

o Data contents
current, detailed vs. historical, consolidated

o Database design
ER + application vs. star + subject

o View
current, local vs. evolutionary, integrated

o Access patterns
update vs. read-only but complex queries

11

OLTP vs. OLAP


users function DB design data OLTP( online transaction OLAP(online analytical processing) processing) clerk, IT professional knowledge worker day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction tens thousands 100MB-GB transaction throughput decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans complex query millions hundreds 100GB-TB query throughput, response

usage access unit of work # records accessed #users DB size metric

Why Separate Data Warehouse?


o High performance for both systems
DBMS - Tuned for Online Transaction Processing Systems Warehouse - Tuned for Online Analytical Processing systems

involving complex OLAP queries


Processing OLAP queries would degrade DBMS performance of

operational tasks.
o Decision support requires historical data which operational

Databases do not typically maintain.


o Decision Support requires consolidation of data from

heterogeneous sources.
o Solution
13

To maintain separate database systems which support special

primitives and structures suitable to store, access and process

Multidimensional Data Model


o A Data warehouse is based on multidimensional data model,

which views data in the form of a data cube.


o

Data cube models n-D data, defined by dimensions and facts. Dimensions: They are entities with respect to which an organization wants to keep records such as items (item_name). Facts: It is a subject of decision oriented analysis such as dollars_sold or units_sold. Facts are numerical measures. Quantities by which we want to analyze relationship between dimensions. Contains key to each of the related dimension tables.

o A multidimensional data model is typically organized around a

central theme, like sales, and is represented by a fact table.

Sales volume as a function of product, Date, Country


Dimensions: Product, Location, Time Hierarchical summarization paths
Total annual sales of TV in U.S.A.

Industry Region

Year

Canada Mexico sum

Category Country Quarter

Product

City
Office

Month
Week Day

Country

TV PC VCR sum

1Qtr 2Qtr 3Qtr 4Qtr sum U.S.A

Date

Cube: A Lattice of Cuboids


se

all time item location supplier

0-D(apex) cuboid

1-D cuboids

time,location

item,location

location,supplier

time,supplier

2-D cuboids
item,supplier

time,location,supplier

3-D cuboids
time,item,supplier

item,location,supplier

4-D(base) cuboid

Schemas for Multidimensional Databases


Multidimensional model exists in form of 1. Star Schema: A fact table in the middle connected to a set of dimension tables.
time
time_key day day_of_the_week month quarter year

time_key item_key branch_key location_key units_sold dollars_sold avg_sales Sales Fact Table

item
item_key item_name brand type supplier_type

branch
branch_key branch_name branch_type

location
location_key street city state_or_province country

Employee_Dim The Star Schema


EmployeeKey EmployeeID ...

Time_Dim
TimeKey TheDate ...

Sales_Fact
TimeKey EmployeeKey ProductKey CustomerKey ShipperKey Sales Amount Unit Sales ...

Product_Dim
ProductKey ProductID ...

Shipper_Dim
ShipperKey ShipperID ...

Customer_Dim
CustomerKey CustomerID ...

2. Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake. item time
time_key day day_of_the_week month quarter year

time_key item_key branch_key location_key

item_key item_name brand type supplier_key

location
location_key street city_key

branch
branch_key branch_name branch_type

units_sold
dollars_sold avg_sales

city Sales Fact Table


city_key city state_or_province country

Snowflakes

are conglomerations of frozen ice crystals which fall through the Earth's atmosphere. They begin as two snow crystals which develop when microscopic supercooled cloud droplets freeze.

3. Fact Constellation: Multiple facts tables share dimension tables, viewed as collection of stars, therefore called galaxy schema or fact constellation.
qq

time
time_key day day_of_the_week month quarter year

item
time_key
item_key branch_key
item_key item_name brand type supplier_type

Shipping Fact Table

time_key
item_key shipper_key

location
location_key street city province_or_state country

branch
branch_key branch_name branch_type

location_key units_sold dollars_sold avg_sales Sales Fact Table

from_location
to_location dollars_cost units_shipped shipper
shipper_key shipper_name location_key shipper_type

THANKS

You might also like