You are on page 1of 45

Topics to cover

Understanding key concepts in dimensional


modeling
Importance of dimensional modeling
Dimensional modeling vs ER modeling
Types of dimensional models
A DWH as a Multidimensional Model

Chapters 10 and 11

02/17/24 2
02/17/24 3
Data Warehouse Design
 Designing the data warehouse is a key issue in the
DWH process.

 Although a DWH can be designed by entity


relationship modeling , many DWH experts including
Kimball et al… prefer dimensional modeling.

02/17/24 4
Dimensional Model vs ER model
ER models are not appropriate for Data Warehouses.
ER modeling does not really model a business; rather,
it models the micro relationships among data
elements.

ER models are wildly variable in structure. As such, it


is extremely difficult to optimize query performance.

02/17/24 5
What is a Dimensional Model?
 A dimensional model is a star schema that
contains two types of tables, fact tables and
dimension tables.
1. Fact table (quantitative) – a fact table is the
primary table in a dimensional model where the
numerical performance measurement of the business
are stored. i.e. attributes are numeric and additive.
Example: quantity sold, dollar sales amount.

2. Dimension table ( descriptive) – tables that


contain the textual descriptors of the business.
Example: product and brand descriptions.
02/17/24 6
•A fact table is a table that contains the measures of interest. For example,
sales amount would be such a measure.
•This measure is stored in the fact table with the appropriate granularity.
For example, it can be sales amount by store by day. In this case, the fact table
would contain three columns: A date column, a store column, and a sales
amount column.
SO WHAT IS A FACT?
•Numeric
• Additive – across dimensions. Hundreds and thousands of records are
fetched from the database, the useful thing to do with so many records is to
add them up.
• Primary keys of dimensions (surrogate keys) become foreign keys in the
fact table.
•Consider
02/17/24
rolling summaries 7
Rolling summary in a time dimension

Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7

Week 1 Week 2 Week 3 Week 4

Month 1 Month 2 Month 3 ………. Month 12

Qrt 1 Qrt 2 Qrt 3 Qrt 4

02/17/24 8
•A dimensional table provides the detailed information about
the attributes. For example, the dimensional table for the
Quarter attribute would include a list of all of the quarters
available in the data warehouse.

•Each row (each quarter) may have several fields, one for the
unique ID that identifies the quarter, and one or more
additional fields that specifies how that particular quarter is
represented on a report (for example, first quarter of 2001 may
be represented as "Q1 2001" or "2001 Q1").

02/17/24 9
Dimensional modeling is the process and outcome of designing
logical database schemas created to support OLAP and Data
Warehousing solutions.

It is especially useful for summarizing and rearranging the data


and presenting views of the data to support data analysis.

Dimension: A category of information, for example, the time dimension.

Attribute: A unique level within a dimension, for example, Month is an


attribute in the Time Dimension.

Hierarchy: The specification of levels that represents relationship


between different attributes within a dimension. For example, one
possible hierarchy in the Time dimension is Year → Quarter → Month →
Day.

02/17/24 10
Issues to note:
1.Dimensions and hierarchies are represented by
dimensional tables.
2.Attributes are the non-key columns in the
dimensional tables.
3.Fact tables connect to one or more dimensional
tables, but fact tables do not have direct
relationships to one another.

02/17/24 11
Dimensional Modeling
Dimensions
Time Locations
Year Country
Attributes in their hierarchy

Quarter District

Month Village

Measured Facts: annual sales amounts per village

Representation of an information Package


Some examples:
– The analysis of a product sales to a customer during
last six month has three dimensions –: customer, product
and time

– The analysis of a product sales to a customer in a


district during the last six months has four dimensions –:
customer, product, region, and time.

- The measured facts are the amounts of sales

• By the rule, the time is always one of the


dimensions
02/17/24 13
Dimensional Modeling
Data Granularity

The grain defines the level of detail of a single


record in the fact table.

The more detail there is in the fact table, the


higher its granularity and vice versa.

02/17/24 14
Dimensional Modeling
Data Granularity example
 Suppose a fact table contains three metrics (Unit Price, Units Sold
and Total Sale Amount).
 The Time dimension consists of four hierarchical elements (Year,
Quarter, Month and Day).
 The Organization dimension consists of three hierarchical
elements (Region, District and Store).
 The Product dimension consists of two hierarchical elements
(Product Family and SKU(Stock Keeping Unit)).

 The highest granularity that we can store Sales metrics is by


Day/Store/SKU (i.e., the lowest level in each dimensional hierarchy).
 Conversely, the lowest granularity that we can aggregate Sales
metrics to in this data mart is by Year/Region/Product Family (i.e.,
the highest level in each dimensional hierarchy).
02/17/24 15
Dimensional Model vs ER model
The key to understanding the relationship between
DM and ER is that a single ER diagram breaks down
into multiple DM diagrams, or ‘stars’.

Think of a large ER diagram as representing every


possible business process within an application. The
ER diagram may have Sales Calls, Order Entries,
Shipment Invoices, Customer Payments, and Product
Returns, all on the same diagram.

02/17/24 16
Dimensional Model vs ER model
To create the individual ‘stars’ that exist within an
application:
Look for many-to-many relationships in the ER model
containing numeric and additive facts and designate
them as fact tables.
Alternatively, look for ‘events’ or ‘transactions’ – these
may also be facts
De-normalize all of the remaining tables into flat
tables with single-part keys that connect directly to the
fact tables. These tables become the dimension tables.

02/17/24 17
Shipments

Returns

Orders

Sales Contact

Payments

02/17/24 18
ERD versus DM

Product- Product- Product_ Customer_


Region dimension
group type dimension

Product Customer
Order_
fact

Order

Time_
Order- dimension
line
02/17/24 19
1. Logical model is easy to understand
• Provides a predictable and standard framework for end user apps. Report
writers, query tools, and user interfaces can all make strong assumptions about the
dimensional model to make the user interfaces more understandable.
• Model can be done (mostly) independent of expected queries since it
withstands unexpected changes in user behavior
• Handle changes easily – such as adding new dimensional attributes since
there is no need to reload data and no need to reprogram query tools
2. Optimized for performance
• High performance “browsing” across the attributes
• Strategy to handling aggregates .i.e. Summary records that are logically
redundant with base data already in the data warehouse, but they are used to
enhance query performance.
• OLAP engines can make processing more efficient
3. Historical tracking of information
– Strategies for handling changing dimensions
– Fact design allows high volume snapshots and transaction Tracking

02/17/24 20
ER Modeling vs Dimensional modeling
Dimensional DM
Relational DM
1. Data is stored in RDBMS 1. Data is stored in RDBMS or
Multidimensional databases
2. Tables are units of storage 2. Cubes are units of storage
3. Data is normalized and used for 3. Data is denormalized and used in
OLTP. Optimized for OLTP datawarehouse and data mart.
processing Optimized for OLAP
4. Several tables and chains of 4. Few tables and fact tables are
relationships among them connected to dimensional tables
5. Non volatile
5. Volatile(several updates)
6. The simpler data design makes it
6. User is usually constrained by an
easier for users to analyze data in any
application that understands the
way they choose. Users are typically
data design. Users are typically
analysts, company strategists, or even
operations staff. executives
02/17/24 21
ER Modeling vs Dimensional modeling
Dimensional DM
Relational DM
7. SQL is used to manipulate data 7. MDX is used to manipulate data
8. Detailed level of transactional data 8. Summary of bulky transactional
data(Aggregates and Measures) used
in business decisions
9. Normal Reports 9. User friendly, interactive, drag and
drop multidimensional OLAP Reports
10. Typical data design used for business 10. Data design used for analysis systems
transaction systems
11. Goal – reduce every piece of 11. Goal – break up information into
information to it’s simplest form – ‘Facts’ – things a company measures
a debit transaction, a customer record, and ‘Dimensions’ - how we measure
an address. them: by time, region, or customer
12. Suited for concurrent handling of 12. Suited for reading or analyzing large
many small transactions by many amounts of data by a modest numbers
users. Only a limited amount of data of users. Many years of data history
history is normally kept.
02/17/24 may be kept. 22
Three basic types of dimensional models, and they are:

1. Star model
2. Snowflake model
3. Fact constellation model

02/17/24 23
• a single object (the fact table) sits in the middle and is
radically connected to other surrounding objects
(dimension tables) like a star.

• Each dimension is represented as a single table. The primary


key in each dimension table is related to a foreign key in the
fact table.
• A simple star consists of one fact table; a complex star can
have more than one fact table.
• Note that different dimensions are not related to one
another.

02/17/24 24
Example of Star Schema

02/17/24 25
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
02/17/24 26
Star Schema with Sample Data

001
002
001
003
002

001
002
003

02/17/24 27
Relationship of a Star Schema model
to a Report
Question answered what, when, by whom, and to
whom.
Results got by combining (joining one or more
dimension tables with the fact table)
Example
The Marketing Dept wants to know the quantity of and
order amount of PCs sold, relating to customers who are
married obtained by sales persons in the Makerere region
in the month of March.

02/17/24 28
Relationship of a Star Schema model
to a Report
Product Dimension Table
Customer Dimension Table
PK Product Key
PK Customer Key
Product Name
Product Code Customer Name
Product Line Customer Code
Brand Marital Status
Address
Town

Product name = Marital Status =


Order Facts Table
PCs Product Key (FK) Married
Time Key (FK)
Customer Key (FK)
Sales Person Key (FK)

Order Shillings
Cost Shillins

Month = Margin Shillings


Quanty Region Name =
March Makerere
Time Dimension Table Sales Person Dimension Table
PK Time Key PK Sales Person Key
Date Sales Person NAme
Month Territory Name
Quarter Region Name
Year

02/17/24 29
1. Easy to understand
2. Easy to define hierarchies
3. Reduces number of physical joins
4. Low maintenance
5. Very simple metadata

02/17/24 30
A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to
snowflake

02/17/24 31
Example of Snowflake Schema
time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_stree
Measures country

02/17/24 32
Example of a snowflake schema- a student
attendance DWH

02/17/24 33
02/17/24 34
Benefits of Snowflaking
 Itis possible to save on storage space
 Normalized structures are easier to update
and maintain than un-normalized one
It is appropriate for use where a dimension
table occupies a significant proportion of
the database as a result of dimensions with
many attributes

02/17/24 35
Disadvantages of Snowflaking
 Schema less intuitive and end-users are put
off by the complexity
 Ability to browse through the contents is
difficult
 Degraded query performance because of
additional joins

02/17/24 36
A fact constellation model is a dimensional model that
consists of multiple fact tables, joined together through
dimensions.

Multiple fact tables share dimension tables, viewed as a


collection of stars, therefore called galaxy schema or fact
constellation

In this case a dimension table connects to more than one fact


table, we therefore refer to the dimension tables as
"conformed" between the two dimensional models.

02/17/24 37
02/17/24 38
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter item_key
time_key type
year supplier_type shipper_key
item_key
branch_key fromlocation

branch location_key location location-key


branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
02/17/24 39
shipper_type
Common uses
These schema are most commonly found in dimensional
DWHs and data marts where speed of data retrieval is more
important than the efficiency of data manipulations.

The decision whether to employ a star schema or a snowflake


schema or a fact constellation schema should consider the
relative strengths of the database platform in question and the
query tool to be employed.
 Star schema should be favored with query tools that largely
expose users to the underlying table structures, and in
environments where most queries are simpler in nature.
 Snowflake schema are often better with more sophisticated query
tools that isolate users from the raw table structures and for
environments having numerous queries with complex criteria.

02/17/24 40
Schema Keys
Dimension Business Key
 Column or columns that identify a unique instance of the
business record (not necessarily a primary key in the dimension
table)
Dimension Record Surrogate Keys
 Defines the dimension’s primary key
 Relates to the fact table foreign key field
 Numeric data type, typically integer
Foreign Keys
 Each Dimensional Table has a one-to-many relationship with
the central fact table
 The PK of each Dimension Table must be a Foreign Key in the
Fact Table

02/17/24 41
Why use surrogate Keys
Data tables in various source systems may use different
presentations of keys for the same entity. Legacy systems
that provide historical data might have used a different
numbering system than a current online transaction processing
system. A surrogate key uniquely identifies each entity in the
dimension table regardless of its source key. A separate field
can be used to contain the key used in the source system.

Systems developed independently in company divisions


may not use the same keys, or they may use keys that
conflict with data in the systems of other divisions. This
situation may not cause problems when each division
independently reports summary data, but it cannot be
permitted in the data warehouse where data is consolidated.

02/17/24 42
Why use surrogate keys
Keys may change or be reused in the source
data systems. This situation is usually less likely
than others, but some systems have been known
to reuse keys belonging to obsolete data.
However, the key may still be in use in historical
data in the data warehouse, and the same key
cannot be used to identify different entities.

02/17/24 43
Why use surrogate Keys
Changes in organizational structures may move keys in
the hierarchy. This can be a common situation.
For example, if a salesperson is transferred from one region
to another, the company may prefer to track two things:
sales data for the salesperson with the person's original
region or data prior to the transfer date, and sales data for
the salesperson in the person's new region after the
transfer date. To represent this organization of data, the
salesperson's record must exist in two places in the sales
force dimension table, which is not possible if the
salesperson's company employee identification number is
used as the primary key for the dimension table. A
surrogate key allows the same salesperson to participate in
different locations in the dimension hierarchy.
02/17/24 44
Why use surrogate keys
In this case, the salesperson will be represented twice in
the dimension table with two different surrogate keys.
These surrogate keys are used to join the salesperson's
records to the sets of facts appropriate to the various
locations in the hierarchy occupied by the salesperson.
The employee's identification number should be carried
in a separate column in the table so information about
the employee can be reviewed or summarized
regardless of the number of times the employee's
record appears in the dimension table.
Dimensions that exhibit this type of change are
called slowly changing dimensions.

02/17/24 45

You might also like