You are on page 1of 65

Unit 2

NOTE : THIS PRESENTATION SHOULD BE CONSIDERED AS SUPPORTING


MATERIAL ONLY. FOR DETAILED STUDY STUDENTS MUST REFER THE TEXT
BOOKS AND REFRENCE BOOKS MENTIONED IN SYLLABUS.

1 PRASHASTI KANIKAR 8/6/2020


Dimensional Analysis

2 PRASHASTI KANIKAR 8/6/2020


DIMENSIONAL ANALYSIS

3 PRASHASTI KANIKAR 8/6/2020


4 PRASHASTI KANIKAR 8/6/2020
5 PRASHASTI KANIKAR 8/6/2020
INFORMATION PACKAGES
 Define the common subject areas

 Design key business metrics

 Decide how data must be presented

 Determine how users will aggregate or roll up

 Decide the data quantity for user analysis or query

 Decide how data will be accessed

6 PRASHASTI KANIKAR 8/6/2020


7 PRASHASTI KANIKAR 8/6/2020
Product dimensions
 Product: Model name, model year, package styling, product line, product category, exterior
color, interior color, first model year

 Dealer: Dealer name, city, state, single brand flag, date first operation

 Customer demographics: Age, gender, income range, marital status, household size,
vehicles owned, home value, own or rent

 Payment method: Finance type, term in months, interest rate, agent

 Time: Date, month, quarter, year, day of week, day of month, season, holiday flag

8 PRASHASTI KANIKAR 8/6/2020


9 PRASHASTI KANIKAR 8/6/2020
10 PRASHASTI KANIKAR 8/6/2020
Dimensional Data Modeling

11 PRASHASTI KANIKAR 8/6/2020


Entity Relationship Modeling: Review
 Entity Relationship modeling is a technique used to ‘abstract’ user’s data requirements
into a model that can be analyzed and ultimately implemented.

 The focus of ER modeling:


 achieve processing and data storage efficiency by reducing data redundancy (storing data
elements once)
 provide flexibility and ease of maintenance
 protect the integrity of data by storing it once

 ER modeling and normalization is great for transaction processing as it makes


transactions as simple as possible (as data stored only in one place)

12 PRASHASTI KANIKAR 8/6/2020


ER Model Example
complex databases .

a ‘spiderweb of joins’ is
required for many queries.

typically unusable for non-


technical users who wish to
perform queries

13 PRASHASTI KANIKAR 8/6/2020


ER model issues
 End users cannot understand or remember an ER model.

 End users cannot navigate an ER model.

 There is no graphical user interface (GUI) that takes a general ER model and makes it
usable by end users.

 Use of the ER modeling technique defeats the basic goal of data warehousing, namely
intuitive and high-performance retrieval of data.

 solution: the Dimensional Data Model

14 PRASHASTI KANIKAR 8/6/2020


What is Dimensional Modeling (DM)?

 DM is a logical design technique that seeks to present the data in a standard,


intuitive framework that allows for high-performance access.

 Can be implemented using a relational or a multidimensional DBMS

 Every dimensional model is composed of one table with a multipart key, called the
fact table, and a set of smaller tables called dimension tables.

 Each dimension table has a single-part primary key that corresponds exactly to one
of the components of the multipart key in the fact table.

 This characteristic "star-like" structure is often called a star join/star schema .

15 PRASHASTI KANIKAR 8/6/2020


Dimensional Model Example

16 PRASHASTI KANIKAR 8/6/2020


Dimensional Model: Fact Tables
 A fact table contains information about things that an organization wants
to measure.
 A fact table’s key is made up from the keys of two or more dimension
tables.
 The most useful fact tables also contain one or more numerical
measures, or facts, that occur for the combination of keys that define
each record.
 Example: the facts are Dollars Sold, Units Sold, and Dollars Cost.
 The most useful facts in a fact table are numeric and additive.
 Additivity is crucial because data warehouse applications almost never
retrieve a single fact table record; rather, they fetch back hundreds,
thousands, or even millions of these records at a time, and often the
most useful thing to do with so many records is to add them up.

17 PRASHASTI KANIKAR 8/6/2020


Dimensional Model: Dimension Tables
 Dimension tables contain information about how an organization
wants to analyze facts:
 “Show me sales revenue (fact) for last week (time) for blue cups (product) in
the western region (geography)

 Dimension tables most often contain descriptive textual


information . Ex-‘Blue cups’, ‘Western Region’

18 PRASHASTI KANIKAR 8/6/2020


Dimensional Model vs ER model
 The key to understanding the relationship between DM
and ER is that a single ER diagram breaks down into
multiple DM diagrams, or ‘stars’.
 Think of a large ER diagram as representing every
possible business process within an application. The ER
diagram may have Sales Calls, Order Entries, Shipment
Invoices, Customer Payments, and Product Returns, all
on the same diagram.

19 PRASHASTI KANIKAR 8/6/2020


Dimensional Model vs ER model
Shipments

Returns

Orders

Sales Contact

Payments

20 PRASHASTI KANIKAR 8/6/2020


ER vs DM – Final Points
 ER models are not appropriate for Data Warehouses. ER
modeling does not really model a business; rather, it
models the micro relationships among data elements.
 ER models are wildly variable in structure. As such, it is
extremely difficult to optimize query performance.

21 PRASHASTI KANIKAR 8/6/2020


Why ER is not suitable for Data Warehouses ?

 End user cannot understand or remember an ER Model.


End User cannot navigate an ER Model. There is no
graphical user interface or GUI that takes a general ER
diagram and makes it usable by end users.
 ER modeling is not optimized for complex, ad-hoc queries.
They are optimized for repetitive narrow queries.
 Use of ER modeling technique defeats this basic feature of
data warehousing, namely intuitive and high performance
retrieval of data because it leads to highly normalized
relational tables.

22 PRASHASTI KANIKAR 8/6/2020


Advanced Concepts

23 PRASHASTI KANIKAR 8/6/2020


Snowflake Schema
 Snowflaking is removing low cardinality textual attributes from dimension
tables and placing them in secondary dimension tables.

 Snowflaking a dimension means normalizing it and making it more manageable


by reducing its size.

 But this may have an adverse effect on performance, as joins need to be


performed.

24 PRASHASTI KANIKAR 8/6/2020


star flake schema
 star flake schema is a hybrid structure that
contains a mixture of star(de normalised) and
snowflake(normalised) schemas.

 Allows dimensions to be present in both forms to


cater for different query requirements

25 PRASHASTI KANIKAR 8/6/2020


Fact Constellation

 A fact constellation is a set of fact tables that share some


dimension tables.

26 PRASHASTI KANIKAR 8/6/2020


Differentiate between Star Schema and Snowflake Schema

Star Schema Snowflake schema


Star schema contains the dimension tables A Snowflake schema contains in-depth joins
mapped around one or more fact tables. because the tables are splitted in to many
pieces.
It is a denormalized model It is the normalised form of Star schema
No need to use complicated joins. We have to use complicated joins, since we
have more tables.
Queries results fastly There will be some delay in processing the
Query.
All the primary keys of the dimension tables here we get more dimension tables which are
are in the fact table linked by primary – foreign key relation.

27 PRASHASTI KANIKAR 8/6/2020


How does a query execute for star
schema
 A star query is a join between a fact table and a number of dimension tables. Each dimension
table is joined to the fact table using a primary key to foreign key join, but the dimension tables
are not joined to each other. The optimizer recognizes star queries and generates efficient
execution plans for them.
 A typical fact table contains keys and measures. For example,dimension tables
are customers, times, products, channels, and promotions. The products dimension table, for
example, contains information about each product number that appears in the fact table.

28 PRASHASTI KANIKAR 8/6/2020


Examples on Star Schema and Snowflake Schema
 All electronics company have sales department. Sales
consider four dimensions namely time, item, branch and
location. The schema contain a central fact tables sales
with two measures dollars_sold and unit_sold

 Design star schema and snowflake schema for same.

29 PRASHASTI KANIKAR 8/6/2020


Star Schema

30 PRASHASTI KANIKAR 8/6/2020


Example 2
 The Mumbai university wants you to help design a star schema to
record grades for course completed by students. There are four
dimensional tables namely course section, professor, student,
period with attributes as follows :

 Course_section Attributes: Course_Id, Section_number, Course_name, Units,


Room_id, Roomcapacity. During a given semester the college offers an average of 500
course sections
 Professor Attributes: Prof_id, Prof_Name, Title, Department_id, department_name
 Student Attributes: Student_id, Student_name, Major. Each Course section has an
average of 60 students
 Period Attributes: Semester_id,Year. The database will contain Data for 30 months
periods. The only fact that is to be recorded in the fact table is course Grade

31 PRASHASTI KANIKAR 8/6/2020


Answer the following Questions

 (a) Design the star schema for this problem


 (b) Estimate the number of rows in the fact table, using the
assumptions stated above and also estimate the total size of the
fact table (in bytes) assuming that each field has an
average of 5 bytes
 (c) Can you convert this star schema to a snowflake schema?
Justify your answer and design a snowflake schema if it is
possible

32 PRASHASTI KANIKAR 8/6/2020


Star Schema

33 PRASHASTI KANIKAR 8/6/2020


 Total Courses Conducted by university =500
 Each Course has average students= 60
 University stores data for 30 months
 Total Student in University for all courses in 30 months =500*60 =
30000
 Time Dimension = 30 months = 5 Semesters (Assume 1 semester= 6
months)
 Now, Number of rows of fact table= 30000*5= 150000 (one student
has 5 grades for 5 semesters)

34 PRASHASTI KANIKAR 8/6/2020


Snowflake Schema
 Assumptions:

 Course dimension can be further normalized to rooms


dimension.

 Professor dimension can be further normalized to


department dimension.

 Student dimension can be further normalized to major


dimension.

35 PRASHASTI KANIKAR 8/6/2020


Snowflake Schema

36 PRASHASTI KANIKAR 8/6/2020


Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
37 PRASHASTI KANIKAR 8/6/2020
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
38 PRASHASTI KANIKAR 8/6/2020
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
39 PRASHASTI KANIKAR 8/6/2020
location_key
shipper_type
Factless fact table
 Let us say we are building a fact table to track the attendance of students.
 For analyzing student attendance, the possible dimensions are student, course, date,
room, and professor.The attendance may be affected by any of these dimensions.
 When you want to mark the attendance relating to a particular course, date, room,
and professor, what is the measurement you come up for recording the event?
 In the fact table row, the attendance will be indicated with the number one.
 Every fact table row will contain the number one as attendance.
 If so, why bother to record the number one in every fact table row? There is no need to
do this.
 The very presence of a corresponding fact table row could indicate the attendance.
 This type of situation arises when the fact table represents events. Such fact tables
really do not need to contain facts.They are “factless” fact tables.

40 PRASHASTI KANIKAR 8/6/2020


Factless fact table example

41 PRASHASTI KANIKAR 8/6/2020


UPDATES TO THE DIMENSION TABLES
 Over time, what happens to the fact table?
 Every day as more and more sales take place, more and more rows get added to the
fact table.The fact table continues to grow in the number of rows over time.
 Very rarely are the rows in a fact table updated with changes. Even when there are
adjustments to the prior numbers, these are also processed as additional adjustment
rows and added to the fact table.
 Now consider the dimension tables. Compared to the fact table, the dimension
tables are more stable and less volatile. A dimension table does not change just
through the increase in the number of rows, but also through changes to the
attributes themselves.
 Consider the product dimension table. Every year, rows are added as new models
become available. But what about the attributes within the product dimension table?
If a particular product is moved to a different product category, then the
corresponding values must be changed in the product dimension table.
 Let us examine the types of changes that affect dimension tables and discuss the
ways for dealing with these types.
42 PRASHASTI KANIKAR 8/6/2020
Slowly Changing Dimensions
 Consider the customer demographics dimension table. What happens when a
customer’s status changes from rental home to own home? The corresponding
row in that dimension table must be changed.
 From the consideration of the changes to the dimension tables, we can derive
the following principles:
 Most dimensions are generally constant over time
 Many dimensions, though not constant over time, change slowly
 The product key of the source record does not change
 The description and other attributes change slowly over time
 In the source OLTP systems, the new values overwrite the old ones
 Overwriting of dimension table attributes is not always the appropriate
option in a data warehouse
 The ways changes are made to the dimension tables depend on the types of
changes and what information must be preserved in the data warehouse

43 PRASHASTI KANIKAR 8/6/2020


The usual changes to dimension tables
Type 1 Changes: Correction of Errors
 For example, suppose a spelling error in the customer name is corrected
to read as Michael Romano from the erroneous entry of Michel
Romano.
 There is no need to preserve the old values. In the case of Michael
Romano, the old name is erroneous and needs to be discarded. When
the users need to find all the orders from Michael Romano, the users
will use the correct name.
 Here are the general principles for Type 1 changes:
 Usually, the changes relate to correction of errors in source systems
 Sometimes the change in the source system has no significance
 The old value in the source system needs to be discarded
 The change in the source system need not be preserved in the data
warehouse

44 PRASHASTI KANIKAR 8/6/2020


Applying Type 1 Changes to the Data Warehouse.
 The method for applying Type 1 changes is:
 Overwrite the attribute value in the dimension table
row with the new value
 The old value of the attribute is not preserved
 No other changes are made in the dimension table
row
 The key of this dimension table or any other key
values are not affected
 This type is easiest to implement

45 PRASHASTI KANIKAR 8/6/2020


46 PRASHASTI KANIKAR 8/6/2020
Type 2 Changes: Preservation of History
 Assume that in your data warehouse one of the essential
requirements is to track orders by marital status in
addition to tracking by other attributes.
 If the change to marital status happened on October 1,
2000, all orders from Kristin Samuelson before that date
must be included under marital status: single, and all
orders on or after October 1,2000 should be included
under marital status: married.

47 PRASHASTI KANIKAR 8/6/2020


 Here are the general principles for this type of change:
 They usually relate to true changes in source systems
 There is a need to preserve history in the data
warehouse
 This type of change partitions the history in the data
warehouse
 Every change for the same attribute must be
preserved

48 PRASHASTI KANIKAR 8/6/2020


 The method for applying Type 2 changes is:
 Add a new dimension table row with the new value of
the changed attribute
 An effective date field may be included in the
dimension table
 There are no changes to the original row in the
dimension table
 The key of the original row is not affected
 The new row is inserted with a new surrogate key

49 PRASHASTI KANIKAR 8/6/2020


50 PRASHASTI KANIKAR 8/6/2020
 Type 3 Changes: Tentative Soft Revisions
 Assume your marketing department is contemplating a
realignment of the territorial assignments for
salespersons.
 Before making a permanent realignment, they want to
count the orders in two ways: according to the current
territorial alignment and also according to the proposed
realignment.
 This type of provisional or tentative change is a Type 3
change.

51 PRASHASTI KANIKAR 8/6/2020


 Here are the general principles for Type 3 changes:
 They usually relate to “soft” or tentative changes in the
source systems
 There is a need to keep track of history with old and
new values of the changed attribute
 They are used to compare performances across the
transition
 They provide the ability to track forward and
backward

52 PRASHASTI KANIKAR 8/6/2020


 Applying Type 3 Changes to the Data Warehouse.
 The methods for applying Type 3 changes are:
 Add an “old” field in the dimension table for the affected
attribute
 Push down the existing value of the attribute from the
“current” field to the “old”field
 Keep the new value of the attribute in the “current” field
 Also, you may add a “current” effective date field for the
attribute
 The key of the row is not affected

53 PRASHASTI KANIKAR 8/6/2020


54 PRASHASTI KANIKAR 8/6/2020
Keys in Data warehouse schema
Primary Keys
 Each row in a dimension table is identified by a unique
value of an attribute designated as the primary key of the
dimension.
 Examples-
 In a product dimension table, the primary key
identifies each product uniquely.
 In the customer dimension table, the customer
number identifies each customer uniquely.

55 PRASHASTI KANIKAR 8/6/2020


Primary keys

56 PRASHASTI KANIKAR 8/6/2020


Surrogate keys
 The surrogate keys are simply system-generated sequence
numbers.
 They do not have any built-in meanings.
 Of course, the surrogate keys will be mapped to the
production system keys.
 The general practice is to keep the operational system keys as
additional attributes in the dimension tables.
 Please refer to previous figure. The STORE KEY is the
surrogate primary key for the store dimension table. The
operational system primary key for the store reference table
may be kept as just another nonkey attribute in the store
dimension table.

57 PRASHASTI KANIKAR 8/6/2020


Foreign Keys

 Each dimension table is in a one-to-many relationship


with the central fact table.
 So the primary key of each dimension table must be a
foreign key in the fact table.
 If there are four dimension tables of product, date,
customer, and sales representative, then the primary key
of each of these four tables must be present in the orders
fact table as foreign keys.

58 PRASHASTI KANIKAR 8/6/2020


Aggregate Tables
 Few queries require selections and summations from the fact table
rows.
 For these types of summations, you need detailed data based on
one or more dimensions, but only summary totals based on the
other dimensions.
 For example, you may need detailed daily data based on the time
dimension, but summary totals by product categories. If you had
summary totals or pre-calculated aggregates readily available, the
queries would run faster.
 Aggregates have fewer rows than the base tables. Therefore, when
most of the queries are run against the aggregate fact tables instead
of the base fact table, you notice a tremendous boost to
performance in the data warehouse.
 Formation of aggregate fact tables is certainly a very effective
method to improve query performance.

59 PRASHASTI KANIKAR 8/6/2020


Aggregating Fact Tables

 In the base fact table, the rows reflect the numbers at the lowest levels of the dimension
hierarchies.
 For example, each row in the base fact table shows the sales units and sales dollars relating
to one date, one store, and one product.
 By moving up one or more levels along the hierarchy in each dimension, you can create a
60
variety of aggregate
PRASHASTI fact tables.
KANIKAR 8/6/2020
Multi-Way Aggregate Fact Tables
One-Way Aggregates:
 When you rise to higher levels in the hierarchy of one
dimension and keep the level at the lowest in the other
dimensions, you create one-way aggregate tables.
 examples:
 Product category by store by date
 Product department by store by date
 All products by store by date

61 PRASHASTI KANIKAR 8/6/2020


Two-Way Aggregates:
 When you rise to higher levels in the hierarchies of two
dimensions and keep the level at the lowest in the other
dimension, you create two-way aggregate tables.
 examples:
 Product category by territory by date
 Product category by region by date
 Product category by all stores by date

62 PRASHASTI KANIKAR 8/6/2020


Three-Way Aggregates:
 When you rise to higher levels in the hierarchies of all
the three dimensions, you create three-way aggregate
tables.
 examples:
 Product category by territory by month
 Product department by territory by month
 All products by territory by month

63 PRASHASTI KANIKAR 8/6/2020


Forming aggregate fact tables

64 PRASHASTI KANIKAR 8/6/2020


Goals for Aggregation
 Do not create too many aggregates.
 Try to cater to a wide range of user groups.
 Go for aggregates that do not increase the overall usage
of storage.
 Keep the aggregates hidden from the end-users.
 The query tool must be the one to be aware of the
aggregates to direct the queries for proper access.
 Attempt to keep the impact on the data staging process
as less intensive as possible.

65 PRASHASTI KANIKAR 8/6/2020

You might also like