You are on page 1of 36

Lecture 05

Data Warehouse Architecture (cont’d.)


Logical Model
Summary – last week
• Last week:
– Distributed DWH
– Data Modeling – Conceptual Modeling
• Multidimensional Entity Relationship (ME/R) Model

• This week:
– Logical Model

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 2
Logical Model
• Goa l of the L og ica l M odel
– Confirm the subject areas e.g. sales or inventory
– Create ‘real’ facts and dim ensions from the subjects
that we have identified. Facts examples are how much I
sold? How much profit I got? While dimensions examples
are when I did sell this item? or where I did sell this?
– Establish the needed g ranularity for our dimensions e.g.
time dimension can go for yearsà monthsà weeks à
days etc.

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 3
Logical Model (cont’d.)
• L og ica l structure of the
multidimensional model
– Cubes:Sales,Purchase,Price,Inventory
– Dimensions:Product,Time,Geography,Client
Cube

Dimensions
Sales Fact
Classification levels
w hom
To Turnover
Client e
her
w
Country Region District City Store Purchase
Amount

at
wh
Prod. Categ Prod. Family Prod. Group Article
Price
Unit Price

en
Week

wh
Year Quarter Month Day
Inventory
Stock

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 4
Dimensions
• Dim ensions a re…
analysis purpose chosen entities,
within the data model
– One dimension can be used to define
m ore than one cube
– They are hierarchically organized Sales
Turnover

Purchase
Prod. Categ Prod. Family Prod. Group Article
Amount

Price
Unit Price

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 5
Dimensions (cont’d.)
• Dimension hierarchies are organized in
cla ssifica tion levels (e.g.,Day,Month,…)
– The dependencies between the classification levels are
described by the classification schem a through
functional dependencies
• An attribute B is functionally dependent on an attribute A,
denoted A ⟶ B, if for all a Î dom(A) there exists exactly
one b Î dom(B) corresponding to it

Week

Year Quarter Month Day

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 6
Dimensions (cont’d.)
• Cla ssifica tion schem a s
– The classification schema of a dimension D is a semi-
ordered set of classification levels
({D.K0,…,D.Kk}, ⟶ )
– With a sm allest elem ent D.K0,
i.e.there is no classification level
with smaller granularity
Semi-order

Week

Year Quarter Month Day

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 7
Dimensions (cont’d.)
• A fully-ordered set of classification levels is
called a Pa th
– If we consider the classification schem a of the
time dimension,then we have the following paths
• T.Day T.Week
• T.Day T.Month T.Quarter T.Year
path

Year Quarter Month Day

Week

– HereT.Day is the sm allest elem ent


Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 8
Dimensions (cont’d.)
• Cla ssifica tion hiera rchies
– Let D.K0 ⟶ …⟶ D.Kk be a path in the classification
schema of dimension D
– A classification hierarchy concerning these path is
a balanced tree which
• Has as nodes dom(D.K0) … dom(D.Kk) {ALL}
• And its edges respect the functional dependencies

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 9
Dimensions (cont’d.)
• E xa m ple: classification hierarchy from the
path product dimension
Prod. Categ Prod. Family Prod. Group Article

ALL

Category Electronics Clothes



Video Audio TV Prod. Family
… …
Video
Camcorder Prod. Group
recorder

TR-34 TS-56 Article

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 10
Cubes
• Cube represent the basic unit of the
multidimensional paradigm
• They store one or more measures (e.g. the
turnover for sale) in raw or pre-aggregated form
• More formally a cube C is a set of cube cells
C Í dom(G) x dom(M), where
G= (D1.K1, …, Dn.Kn) is a set of granularities,
M= (M1, …, Mm) the set of m ea sures
• E.g. Sales((Store, Article, Quarter), (Turnover))

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 11
Cubes (cont’d.)
• The coordina tes of a cell are the classification
nodes from dom(G) corresponding to the cell
– Sales ((Store, Article, Quarter),(Turnover))
–…

Classification levels Dimensions

Country Region District City Store whe


re

what Sales
Prod. Categ Prod. Family Prod. Group Article
Turnover
Week when

Year Quarter Month Day

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 12
Cubes (cont’d.)

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 13
Cubes (cont’d.)
• 4 dimensions (supplier,city,quarter,product)

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 14
Cubes (cont’d.)
• We can now imagine n-dim ensiona l cubes
– n-D cube is called a base cuboid
– The top most cuboid, the 0-D, which holds the highest
level of summarization is called apex cuboid
– The full data cube is
formed by the
lattice of cuboids
city

items

city

year
items

city
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 15
Cubes (cont’d.)
• But things can get complicated pretty fast
all
0-D(apex) cuboid

time item location supplier


1-D cuboids

time,item time,location item,location location,supplier


2-D cuboids
time,supplier item,supplier

time,location,supplier
time,item,location
3-D cuboids
time,item,supplier item,location,supplier

time, item, location, supplier 4-D(base) cuboid

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 16
Basic Operations
• B a sic opera tions of the multidimensional model
on the logical level
– Selection
– Projection
– Cube join
– Sum
– Aggregation

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 17
Basic Operations (cont’d.)
• Multidimensional Selection
– The selection on a cube C((D1.K1,…, Dg.Kg),
(M1, …, Mm)) through a predicate P,is defined as
σP(C) = {z Є C:P(z)}, if all variables in P are either:
• Classification levels K ,which functionally depend on a
classification level in the granularity of K,i.e.Di.Ki ⟶ K
• Measures from (M1, …, Mm)
–E.g.σP.Prod_group= “Video”(Sales)

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 18
Basic Operations (cont’d.)
• Multidimensional projection
– The projection of a function of a measure
F(M) of cube C is defined as
• rrF(M)(C) = { (g,F(m)) Î dom(G) x dom(F(M)): (g,m)
Î C}
– E.g. projection rrturnover, sold_items(Sales)

Sales
Turnover
Sold_items

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 19
Basic Operations (cont’d.)
• Join opera tions between cubes is usual
– E.g. if turnover would not be provided, it could be
calculated with the help of the unit price from the
price cube
• 2 cubes C1(G1, M1) and C2(G2, M2) can only be
joined,if they have the same granularity
(G1= G2 = G)
– C1⋈C2= C(G, M1u M2)
Sales Price

Units_Sold Unit Price

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 20
Basic Operations (cont’d.)
• When the granularities are different,but we still
need to join the cubes,ag g reg a tion has to
be performed
– E.g.,Sales ⋈ Inventory: aggregate Sales((Day,Article,
Store,Client)) to Sales((Month,Article,Store,Client))

Client
Sales
Turnover
Country Region District City Store

Prod. Categ Prod. Family Prod. Group Article


Inventory
Stock
Week

Year Quarter Month Day

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 21
Basic Operations (cont’d.)
• A g g reg a tion: most important operation
for OLAP operations
• Aggregation functions
– Build a single values from set of value,e.g.in SQL:
SUM,AVG,Count,Min,Max
–Example:SUM(P.Product_group, G.City, T.Month)(Sales)

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 22
Change Support
• Classification schema,cube schema,classification
hierarchy are all designed in the building phase
and considered a s fix
– Practice has proven otherwise
DW grow old,too
– Changes are strongly connected to the time factor
– This lead to the time validity of these concepts
• Reasons for schem a m odifica tion
– New requirements
– Modification of the data source

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 23
Classification Hierarchy
• E.g.S a turn sells a lot of electronics
– Lets consider mobile phones
• They built their DW on 01.03.2003
• A classification hierarchy of their
data until 01.07.2008 could look
like this:
Mobile Phone

GSM 3G

BlackBerry
Nokia 3600 O2 XDA
Bold

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 24
Classification Hierarchy (cont’d.)
• After 01.07.2008 3G becomes hip and affordable
and many phone makers start migrating towards
3G capa ble phones
– Lets say O2 makes its XDA 3G capable

Mobile Phone

GSM 3G

BlackBerry
Nokia 3600 O2 XDA
Bold

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 25
Classification Hierarchy (cont’d.)
• After 01.04.2010 phone makers already develop
4G capa ble phones

Mobile Phone

GSM 3G 4G

BlackBerry Best phone


Nokia 3600 O2 XDA
Bold ever

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 26
Classification Hierarchy (cont’d.)
• It is important to tra ce the evolution of
the data
– It can explain which data was available at which
moment in time
– Such a versioning system of the
classification hierarchy can be performed by
constructing a validity m atrix
• When is something,valid?
• Use timestamps to mark it!

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 27
Classification Hierarchy (cont’d.)
• A nnota ted Cha ng e da ta

Mobile Phone
[01.03.2003, ∞) [01.04.2010, ∞)
[01.04.2005, ∞)

GSM 3G 4G

[01.04.2005, ∞) [01.07.2008, ∞)
[01.03.2006, ∞) [01.04.2010, ∞)
[01.03.2003, 01.07.2008)

BlackBerry Best phone


Nokia 3600 O2 XDA
Bold ever

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 28
Classification Hierarchy (cont’d.)
• The tree can be stored as dimension metadata
– The storage form is a validity m atrix
• Rows are parent nodes
• Columns are child nodes
GSM 3G 4G Nokia 3600 O2 XDA Berry Bold Best phone

Mobile [01.03.2003, ∞) [01.04.2005, ∞) [01.04.2010, ∞)


phone
GSM [01.04.2005, ∞) [01.03.2003,
01.07.2008)
3G [01.07.2008, ∞) [01.03.2006, ∞)
4G [01.04.2010, ∞)

Nokia 3600

O2 XDA
Berry Bold
Best phone

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 29
Schema Modification
• Improper modification of a schema (deleting a
dimension) can lead to
– Data loss
– Inconsistencies
• Data is incorrectly aggregated or adapted
• Proper schema modification is complex but
– It brings flexibility for the end user
• The possibility to ask“As Is vs. AsWas” queries and so on
• Alternatives
– Schema evolution
– Schema versioning

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 30
Schema Modification (cont’d.)
• Schema evolution
– Modifications can be performed without data loss
– It involves schema modification and data adaptation
to the new schema
– This data adaptation process is called Instance
adaptation

Sales
Prod. Categ Prod. Family Prod. Group Article
Turnover

Purchase
Amount

Price
Unit Price

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 31
Schema Modification (cont’d.)
• Schema evolution
– Advantage
• Faster to execute queries in DW with many schema
modifications
– Disadvantages
• It limits the end user flexibility to query based on the past
schemas
• Only actual schema based queries are supported

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 32
Schema Modification (cont’d.)
• S chem a versioning
– Also no data loss
– All the data corresponding to all the schemas are
always available
– After a schema modification the data is held in their
belonging schema
• Old data - old schema
Sales
Prod. Categ Prod. Family Prod. Group Article
Turnover

• New data - new schema


Sales Purchase
Prod. Categ Prod. Group Article Amount
Turnover

Purchase Price
Amount Unit Price

….
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 33
Schema Modification (cont’d.)
• Schema versioning
– Advantages
• Allows higher flexibility,e.g.,“As Is vs.AsWas”,etc.queries
– Disadvantages
• Adaptation of the data to the queried schema is done on
the spot
• This results in longer
query run time

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 34
Summary

• Logical Model
– Cubes,Dimensions,Hierarchies,
Classification Levels

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 35
Next Lecture
• Physical Model
– Relational Implementation through:
• Star schema:improves query performance for often-
used data
• Snowflake schema:reduce the size of the dimension
tables

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 36

You might also like