You are on page 1of 40

SCCS 453

Data Warehousing and Data Mining


Lecture 3
Data Warehouse Modeling and OLAP Operations

Songsri Tangsripairoj, Ph.D.


ccsts@mahidol.ac.th

Department of Computer Science


Faculty of Science, Mahidol University

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 1


Semester 2, Year 2006
Topics
p From Tables and Spreadsheets to Data Cubes
p Multidimensional Data Model
p Conceptual Modeling of Data Warehouses
p Measures of Data Cube
p Typical OLAP Operations
p OLAP Server Architectures

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 2


Semester 2, Year 2006
From Tables and Spreadsheets to
Data Cubes
p A data warehouse is based on a
multidimensional data model which views data in
the form of a data cube

p A data cube, such as sales, allows data to be


modeled and viewed in multiple dimensions
n Dimension tables, such as item (item_name, brand,
type), or time (day, week, month, quarter, year)
n Fact table contains measures (such as dollars_sold)
and keys to each of the related dimension tables
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 3
Semester 2, Year 2006
Fragments of relations from a
relational database for AllElectronics

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 4


Semester 2, Year 2006
Database Queries for AllElectronics
p Show me a list of all items that were sold in the
last quarter
p Show me the total sales of the last month,
grouped by branch
p How many sales transactions occurred in the
month of December?
p Which sales person had the highest amount of
sales?

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 5


Semester 2, Year 2006
Typical framework of a data
warehouse for AllElectronics
Provide an analysis of the company’s sales
per item type per branch for the third quarter.

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 6


Semester 2, Year 2006
A data cube for AllElectronics

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 7


Semester 2, Year 2006
A 2-D view of sales data for AllElectronics according to the
dimensions time and item, where the sales are from Vancouver.
The measure displayed is dollars_sold (in thousands).

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 8


Semester 2, Year 2006
A 3-D view of sales data for AllElectronics, according to the
dimensions time, item, and location. The measure displayed is
dollars_sold (in thousands).

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 9


Semester 2, Year 2006
A 3-D data cube representation of the data, according to the
dimensions time, item, and location. The measure displayed is
dollars_sold (in thousands).

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 10


Semester 2, Year 2006
A 4-D data cube representation of the sales data, according to the
dimensions time, item, location, and supplier. The measure
displayed is dollars_sold (in thousands).

Any n-D data may be displayed as a series of (n-1)-D cubes.


SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 11
Semester 2, Year 2006
Multidimensional Data Model
p A set of numeric measures (e.g., sales, budget,
revenue, inventory, etc.) are the objects of analysis
p Each measure depends on a set of dimensions, which
provide the context for the measure e.g., dimensions of
sales are product, location, and time
p Each dimension is described by a set of attributes e.g.,
product dimension consists of category and brand
p Attributes may related via a hierarchy of relationships
e.g., Day ⊂ Week ⊂ Year

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 12


Semester 2, Year 2006
Multidimensional Data Model
p Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

SCCS 453 DW and DM


Month Songsri Tangsripairoj, Ph.D. 13
Semester 2, Year 2006
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 14


Semester 2, Year 2006
Concept Hierarchies
p A concept hierarchy defines a sequence of mappings
from a set of low-level concepts to higher-level, more
general concepts.
Year
The location dimension
“street < city < province_or_state < country”
Quarter
Country
Month Week
Province or State
Day
City
The time dimension
“day < {month < quarter; week} < year”
SCCS 453 DW and DM Street Songsri Tangsripairoj, Ph.D. 15
Semester 2, Year 2006
A Concept Hierarchy for
the dimension location

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 16


Semester 2, Year 2006
A Concept Hierarchy (set-grouping
hierarchy) for the attribute price

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 17


Semester 2, Year 2006
Dimensional Modeling Vocabulary
p Fact tables
p Dimension tables

p Fact represents business measure


p Dimension describes scope of fact

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 18


Semester 2, Year 2006
Fact Tables
p Each row in fact table consists of
n A list of dimension keys
n A set of measures

Sales – Fact Table

Dimension keys Date Key {FK}


Product Key {FK}
Store Key {FK}
Quantity Sold
measures
Dollar Sales Amount

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 19


Semester 2, Year 2006
Fact Tables
p The most useful measures are numeric and
additive such as dollar sales amount
p Theoretically, measures can be textual
p Practically, textual measure is rare in DW due to
unpredictable content and mostly non-unique
value i.e. a free text comment
n To avoid textual measure, put it into dimension
attributes

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 20


Semester 2, Year 2006
Fact Tables
p Combination of dimensions defines the grain of
the fact table
n All measurements in a fact table must be at the same
grain

p Fact tables trend to be


n deep in terms of number of rows
n narrow in terms of number of columns

p Fact tables express M:N relationship between


dimensions
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 21
Semester 2, Year 2006
Dimension Tables
p Dimension tables contain textual descriptors of business
p Dimension tables are entry points into the fact table
p Each row in dimension table consists of
n A single primary key
Product – Dimension Table
n A set of attributes
Product Key {PK}
Product Description
SKU Number
Brand Description
Category Description
Department Description
Package Type Description
Package Size
Etc.
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 22
Semester 2, Year 2006
Dimension Tables
p Dimension tables trend to be
n shallow in terms of number of rows
n wide in terms of number of columns (attributes)

p The power of the data warehouse is directly proportional


to the quality and depth of dimension attributes
p The best attributes are textual and discrete e.g.,
n ProductDescription should be real words rather than cryptic
abbreviations
n ProductSize (which is numeric) is discrete and constant
descriptor of a specific product

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 23


Semester 2, Year 2006
Conceptual Modeling of DW
p In traditional database design, we use relational
schema to represent a conceptual model (e.g.,
ER-model)
p In data warehousing design, we use star
schema to represent the multidimensional data
model
p Alternative schemas: snowflake schema, fact
constellations schema

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 24


Semester 2, Year 2006
Conceptual Modeling of DW
p Star schema: A fact table in the middle
connected to a set of dimension tables
p Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a
shape similar to snowflake
p Fact constellation schema: Multiple fact tables
share dimension tables, viewed as a collection of
stars, therefore called galaxy schema
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 25
Semester 2, Year 2006
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 26
Semester 2, Year 2006
Star Schema
p Consist of a single fact table and a single table,
called dimension table, for each dimension
p Each dimension table has a non-composite
primary key and other attributes
p In the fact table, the PK is made up of two or
more FKs that point to the PKs of dimension
tables
p May introduce some redundancy
n The location dimension table e.g. (.., Vancouver, British
Columbia, Canada) and (.., Victoria, British Columbia, Canada)

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 27


Semester 2, Year 2006
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 28
Semester 2, Year 2006
Snowflake Schema
p The dimension tables of the snowflake model
may be kept in normalized form to reduce
redundancies.
p The snowflake structure can reduce the
effectiveness of browsing, since more joins will
be needed to execute a query.
p Although the snowflake schema reduces
redundancy, it is not as popular as the star
schema in the data warehouse design.
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 29
Semester 2, Year 2006
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 30
Semester 2, Year 2006 location_key
shipper_type
Measures of Data Cube
p A multidimensional point in the data cube space
can be defined by a set of dimension-value pairs.
n <time=“Q1”, location=“Vancouver”, item=“computer”>

p A data cube measure is a numerical function that


can be evaluated at each point in the data cube
space
p A measure value is computed for a given point
by aggregating the data corresponding to the
respective dimension-value pairs defining the
given point
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 31
Semester 2, Year 2006
Measures of Data Cube:
Three Categories
p Distributive: if the result derived by applying the function to
n aggregate values is the same as that derived by applying
the function on all the data without partitioning
p E.g., count(), sum(), min(), max()
p Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function
p E.g., avg(), min_N(), standard_deviation()
p Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
p E.g., median(), mode(), rank()
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 32
Semester 2, Year 2006
OLAP Queries
p Find the total sales
p Find total sales by month
p Find total sales by quarter
p Find total sales by month for each city
p Find the top five products ranked by total sales
p Find the percentage change in the total monthly sales
for each product
p Etc.

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 33


Semester 2, Year 2006
Typical OLAP Operations
p Roll up (drill-up): summarize data
n by climbing up hierarchy or by dimension reduction

p Drill down (roll down): reverse of roll-up


n from higher level summary to lower level summary or
detailed data, or introducing new dimensions

p Slice and Dice: project and select


p Pivot (rotate):
n reorient the cube, visualization, 3D to series of 2D
planes
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 34
Semester 2, Year 2006
Roll_up & Drill_down
Year
Total sales by year

Drill_down

Roll_up

Total sales by quarter


Quarter

Total sales by month …


month

Total sales by date


Date
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 35
Semester 2, Year 2006
Pivoting
p Rotate the data axes in view in order to
provide an alternative presentation of the data.
Location
Time New
York Vancouver
Q1 Q2 Q3 Q4 Chicago Toronto
Location

Chicago 63 38 75 58 Q1 63 81 144 144

Time
New York 81 107 35 67 Q2 38 107 145 145

Toronto 144 145 110 120 Q3 75 35 110 110

Vancouver 144 145 110 120 Q4 58 67 120 120

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 36


Semester 2, Year 2006
Slice and Dice
p Take a projection of the data on a subset of
dimensions for selected values of the other
dimension
p Slice: equality selection on one or more
dimensions, possibly also with some
dimensions projected out
p Dice: range selection

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D.


Location 37
Semester 2, Year 2006 Time
it r
(c nt
ou
on Toronto 395 (c
ti

(quarters)
USA
on Canada 2000

time (quarters)
Vancouver
ca i
lo at
oc

time
Q1 605 Q1 1000
l
Q2 Q2
computer
home
entertainment
Typical OLAP Operations Q3

Q4
item (types)
computer security
home phone
dice for entertainment
(location = “Toronto” or “Vancouver”) item (types)
and (time = “Q1” or “Q2”) and
(item = “home entertainment” or “computer”)
roll-up
on location
(from cities
s) to countries)
ie
it
(c
on New Chicago 440
a ti York 1560
o c Toronto 395

time (quarters)
l Vancouver
Q1 605 825 14 400

Q2

Q3

Q4
slice
computer security
for time = “Q1”
home phone
entertainment
location (cities)

drill-down
item (types) on time
(from quarters
Chicago to months)
New York

Toronto
s)
t ie
Vancouver 605 825 14 400
( ciChicago
n New York
computer security
t io
Toronto
home phone ca
entertainment lo Vancouver
item (types) January 150
February 100
March 150

time (months)
pivot April
May
June
July
item (types)

home
entertainment 605 August
September
computer 825
October
phone 14 November
December
400
SCCS DM453 DW and
security Songsri Tangsripairoj, Ph.D. 38
computer security
Semester 2, Year 2006 New York Vancouver home phone
Chicago Toronto entertainment
location (cities) item (types)
Other OLAP Operations
p Drill across: involving (across) more than one fact
table
p Drill through: through the bottom level of the cube
to its back-end relational tables (using SQL)

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 39


Semester 2, Year 2006
OLAP Server Architectures
p Relational OLAP (ROLAP)
n Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
n Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
n Greater scalability
p Multidimensional OLAP (MOLAP)
n Sparse array-based multidimensional storage engine
n Fast indexing to pre-computed summarized data
p Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
n Flexibility, e.g., low level: relational, high-level: array
p Specialized SQL servers (e.g., Redbricks)
n Specialized support for SQL queries over star/snowflake schemas
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 40
Semester 2, Year 2006

You might also like