You are on page 1of 3

Data Warehousing and OLAP Technology for Data Mining

Data Warehousing and OLAP Technology for Data Mining

What is a data warehouse? A multi-dimensional data model From data warehousing to data mining

Based on the slides developed by J Han and M Kamber

What is Data Warehouse?


Data WarehouseSubject-Oriented

Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organizations operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process.W. H. Inmon Data warehousing: The process of constructing and using data warehouses

Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.

Data WarehouseIntegrated

Data WarehouseTime Variant


Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

The time horizon for the data warehouse is significantly longer than that of operational systems.

Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted.

Every important element in the data warehouse contains time, explicitly or implicitly

Data WarehouseNon-Volatile

From Tables and Spreadsheets to Data Cubes


A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment.

A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

Does not require transaction processing, recovery, and concurrency control mechanisms

Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

Requires only two operations in data accessing:


initial loading of data and access of data.

In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.

Cube: A Lattice of Cuboids


all time item location supplier

Conceptual Modeling of Data Warehouses


Modeling data warehouses: dimensions & measures

0-D(apex) cuboid

Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

1-D cuboids

time,item

time,location

item,location

location,supplier

time,supplier time,item,location

2-D cuboids
item,supplier

time,location,supplier

3-D cuboids

time,item,supplier

item,location,supplier

4-D(base) cuboid
time, item, location, supplier

Example of Star Schema


time
time_key day day_of_the_week month quarter year

Example of Snowflake Schema


time item
item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year

item Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales
item_key item_name brand type supplier_key

supplier
supplier_key supplier_type

Sales Fact Table time_key item_key branch_key

branch
branch_key branch_name branch_type

location location_key units_sold dollars_sold avg_sales Measures


location_key street city province_or_street country

branch
branch_key branch_name branch_type

location
location_key street city_key

city
city_key city province_or_street country

Measures

Example of Fact Constellation


time
time_key day day_of_the_week month quarter year

Example of Star Schema


time
time_key day day_of_the_week month quarter year

item Sales Fact Table


time_key item_key branch_key
item_key item_name brand type supplier_type

Shipping Fact Table


time_key item_key shipper_key from_location

item Sales Fact Table time_key item_key branch_key


item_key item_name brand type supplier_type

branch
branch_key branch_name branch_type

location_key units_sold dollars_sold avg_sales Measures

location
location_key street city province_or_street country

to_location dollars_cost units_shipped shipper


shipper_key shipper_name location_key shipper_type

branch
branch_key branch_name branch_type

location location_key units_sold dollars_sold avg_sales Measures


location_key street city province_or_street country

Multidimensional Data

A Sample Data Cube


du ct
TV PC VCR sum 1Qtr 2Qtr

Sales volume as a function of product, month, and region


gi on
Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year

Date
3Qtr 4Qtr sum

Total annual sales of TV in U.S.A. U.S.A Canada Mexico sum

Pr o

Category Country Quarter

Product

Product

City Office

Month Day

Week

Month

Cuboids Corresponding to the Cube


all
product

Browsing a Data Cube

0-D(apex) cuboid country


date, country

date
product,country

1-D cuboids

product,date

2-D cuboids 3-D(base) cuboid


product, date, country

Visualization OLAP capabilities Interactive manipulation

Country

Re

You might also like