Professional Documents
Culture Documents
Most relational models have a similar looking design. There are hundreds of tables
connected by an even larger number of join paths. The result is overwhelming, and,
from a business user perspective, unusable. No human being or computer software
can analyze and run queries on it in a meaningful amount of time.
3
The workshops provide another opportunity to flesh out the requirements with the
business.
Dimensional models should not be designed in isolation by folks who don’t fully
understand the business and their needs
4
The answers to these questions are determined by considering the needs of the
business along with the realities of the underlying source data during the
collaborative modeling sessions.
Following the business process, grain, dimension, and fact declarations, the
design team determines the table and column names, sample domain values,
and business rules. Business data governance representatives must participate
in this detailed design activity to ensure business buy-in.
5
6
Grain is level of detail
7
Facts are very specific, well-defined numeric attributes
8
9
10
11
12
13
14
Example of a dimensional model. The measurements include numeric attributes in a
sales transaction, while the dimensions include the product in the transaction, time
of the transaction, location of the store and customer who made the transaction
Model measurements as fact tables with multiple foreign keys referring to the
contextual entities.
15
16
Additivity is crucial because data warehouse applications almost never retrieve a
single fact table record; rather, they fetch back hundreds, thousands, or even
millions of these records at a time, and almost the only useful thing to do with so
many fact records is to add them up.
17
Because most fact tables are huge, with millions or even billions of rows, you almost
never fetch a single record into your queried answer set.
Rather, you fetch a very large number of records, which you compress into digestible
form by adding, counting, averaging, or taking the min or max. But for practical
purposes, the most common choice, by far, is adding.
So it is imperative to store facts which are additive
18
Example of tuition payment fact. Stores the paid tuition fess by students along with
the the student and date dimensions
19
20
Makes no sense to add the GPA of some individual students.
However, one can say that the average student GPA is a useful value and we can find
that instead. However, please note that additivity relates solely to the ability to add
21
22
Store underlying components in fact tables e.g., for GPA, store credit hours and grade
points instead of storing the ratio/average itself. GPA can then be calculated at report
time
23
Balance Fact: Fact table storing the account balance of customers at times.
Explain the table
Customer 618 had balance 1500. He withdrew some amount and the balance
dropped to 1400. Similarly for other customers …
24
Can add balance along the customer dimension
25
We cannot say that customer 618’s account balance is 1500+ 1400. This makes no
sense
26
Recommended approach is to always remember to divide by the number of time
slots for such semi-additive facts
27
https://www.sisense.com/blog/when-and-how-to-use-surrogate-keys/
https://www.kimballgroup.com/2009/05/the-10-essential-rules-of-dimensional-
modeling/
28
Surrogate keys used because Natural keys do not stand the test of time. Symbols
which might have been business meaning could become meaningless, or bear a
different meaning in the future. Also, surrogate keys are smaller in size and allow
faster indexing
In the future, when merging data from other systems, it might be possible that there
is a possibility of conflicts in natural key
29
Consider a dimension table
Note: Good practice to write _DIM in table name to indicate that this is a dimension
table
30
Faculty_ID is an identification number associated with each faculty member. This
attribute is a part of the business process.
31
However, according to the guidelines, we must use a surrogate key as the primary
key. Therefore, a new attribute (typically integer) will be generated during the data
modeling process and used as the primary key.
32
The surrogate key is used as the primary key
33
34
Consider the above schema representing the tuition fess of a student.
The tuition and activities bill amounts are the facts and the student id and date are
dimensions
35
36
DATA SCIENCE
LAB
Dimensional Modeling
CS 537- Big Data Analytics
2
Dimension
Product
3
Dimension
/l\
Product
We will use this to explain the difference between star and snowflake schemas
4
Implementing Different Schemas
• Star Schema
• Snowflake Schema
These schemas differ on the basis of how they represent the different hierarchies
in dimensional tables
LD
Star Schema
The star schema consists of one fact table referencing any number
of dimension tables.
6
Why "star" schema?
• Gets its name from the physical model resembling a star shape
7
Star Schema
1 dimension table for all
Product Category these set of objects
ZK
Product Family
8
Star Schema
Faculty_DIM
Facutty_Key
FacultyJD
Faculty_Last_Name
Faculty_First_Name
Year_Joined
Faculty_Rank
DeptJD
Dept_Name
Dept_Year_Founded
CollegeJD
College_Name
College_Year_Founded
Dean
9
Star Schema
Faculty_DIM
Faculty Key
QacultyJP Z^
Faculty_Last_Name
Faculty_Flrst_Name
Year.Joined
Faculty_Rank
DeptName
Dept_Year_Founded
^College,ID Z^
Cohege^ame
College_Year_Founded
Dean
Highlighted natural keys. Not recommended to choose them as the primary key
10
Star Schema
As previously discussed, create a surrogate key and use that as the primary key
11
DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY
12
Benefits Drawbacks
• Simplifies queries
• Data Integrity
• Fast aggregations
• Decrease query flexibility
13
Snowflake Schema
14
Why "snowflake" schema?
"A complex snowflake shape emerges when the dimensions of a
snowflake schema are elaborated, having multiple levels of
relationships, child tables having multiple parents."
15
Snowflake schema
k
16
Separate table for each
Snowflake schema dimension in the hierarchy
17
DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY
18
DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY
19
DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY
Foreign keys link the tables together. Here, department key is used as foreign key
in the faculty_dim and college_key in department_dim.
No foreign keys in terminal tables (College_DIM in this example)
20
Snowflake Schema PK-FK Rules
21
Snowflake Schema PK-FK Rules
22
College.
Dim
Academic_Yr.
Dim
23
Snowflake hierarchy with branching
Academic
Year
(Some
Fact
Table)
Fiscal Year
Here, we have a fact table which also has some form of a time dimension.
Months can fall in an academic calendar and also in a fiscal calendar. So we have
split here
CD
24
Snowflake hierarchy with branching
Semester DIM
ziSemester_Key
<<^SemesterKey
D^pFTO----------
" SemesterName
SemesterNum
Day.DIM
^AcademicYearKey
^cademicYearKey
IsHoliday
<Wonth_Key (FK)
QTscal_YearKey
Represent this in data with two foreign keys in the month dimension
25
Snowflake vs Star
Star Schema Snowflake Schema
Only level away from fact table One or more levels away from fact
along each hierarchy table along each hierarchy
With one fact table usually With one fact table usually
resembles a star resembles a snowflake
26
Snowflake vs Star
Star Schema Snowflake Schema
Overall fewer database joins for Overall more database joins for
drilling up/down drilling up/down
28
Types of Fact Tables
transaction fact table is one in which we will store facts or measurements, that
>
periodic snapshot fact table is one in which we will track specific things, which
>
—b
we want to measure at regular intervals. Kind of like periodic readings
O
thing
The factless fact table (oddly named) is used to just record the occurrence of a
thing which has no facts/measurements
29
Transaction Fact Table
Heart of dimensional models: Most of the dimensional models are of this type
30
Transaction Fact Table
Example
"Students are required to pay tuition fees for the university"
We want to track:
• Which students paid the fees
• The amount paid by each student
• Date of payment
31
Transaction Fact Table
Tuition_Payment_Fact
Tuition Payment Student Key Date Key
$7,000.00 732017235 88085255
$6,500.00 481011832 88085255
$7,000.00 881838281 82324174
$7,000.00 298191999 13216661
32
----------------------------------------------------------------------------------------------- 1
Student_DIM Date.DIM
Student Key Student ID Student LName Student FName Date Key r— —i
732017235 SJACK32 Jackson Sally 82324174
481011832 RTHOM29 Thompson Richard 13216661
881838281 GWILL03 Williams Greta Time/Date
65656565
298191999 MBROD21 Brody Michele
56565656 Columns
... *** 88085255
— —
TuitibrsJ>ayment Fact /
33
Periodic Snapshot Fact Table
34
Periodic Snapshot Fact Table
Example
"Each student is assigned a meal card by their universities, which can
be used to purchase meals from the university cafeteria."
35
This dimensional model stores the meal transactions made by students with their
university assigned meal cards
Above is a transaction grained fact table representing this. Each time a student
makes a payment with his meal card, it will record who the student was, what
date the transaction occurred and at what campus did the transaction occur.
E.g. if a student spent 500 Rs., then a new row will be created in the fact table
36
Periodic Snapshot Fact Table
Objective
Track and analyze end-of-week meal account balances throughout the
semester
This can be done with the regular transaction grained fact table, but
the analytic queries will be complex and compute intensive
37
Recording Student End of Week Meal Card Payments
This fact table will store the end of week meal balance of by a given student at
the end of a given week.
The campus_food dimension is not needed because for this analysis, we only
need to analyze the balance for students, regardless of campus location
38
Periodic Snapshot Fact Table
39
Accumulating Snapshot Fact Table
QJ
single dimension table
Include both the completed and in-progress phases: So that we can track at a
particular row that at which processes have completed
40
Accumulating Snapshot Fact Table
Example
"Student financial aid application process"
41
Example: Student Financial Aid Application
Process
42
Example: Student Financial Aid Application
Process
Dimensions for this process
• Student
• Time (Specific day a process begins)
43
Example: Fact and dimension tables
FinAid_Accumul_Snapshot
Student DIM
Student_Key
StudentJD
Student_Lname
Student_Fname
Date DIM
Date_Key
Dimensions for the student who applied for the aid, date
Accumulating snapshot fact table recording the progress of each phase
44
Days spent Days Days Days from
in screening spent in spent in approval
Student DIM
Student_Key
Student_ID
Student_Lname
Student_Fname
\
Date.DIM
Date_Key
Facts table will contain the actual measurements, which are the days spent at
each phase
45
Multiple links to data dimension
FinAid_Accumul_Snapshot
Student DIM
Student_Key
Student_Key <Oay_Submitted Key^>
StudentJD <gay_ScreehingLXsy^>
Student_Lname
Student_Fname <c~0ay_Vayment_Key~-~2>
Days_Screening
Days_Prelim_Decision
Date DIM Days_Final_Decision
Days_Decision_to_Pmt
<CZDate_Key
Multiple links to the date dimension for each phase. Notice that the date foreign
keys in the fact table are named differently
46
Accumulating Snapshot Fact Table
FinAid_Accumul_Snapshot
Student DIM
Student_Key (PK, FK)
Student_Key (PK) Day_Submitted_Key (PK, FK)
StudentJD Day_Screenlng_Key (PK, FK)
Day_Prelim_Key (PK, FK)
Student_Lname Day_Final_Key (PK, FK)
Student_Fname Day_Payment_Key (PK, FK)
Days_Screening
Days_Prelim_Decision
Date DIM Days_Final_Decision
Days_Dedsion_to_Pmt
Date_Key (PK)
As usual, the primary key of the fact table will be a composite primary key of all
the dimension foreign keys
47
Factless Fact Table
48
Factless Fact Table
Example
"Students can register for online webinar throughout the semester"
We want to track:
• Which students register
• Which webinar they register for
• Date of registration
• Scheduled date of webinar
49
Factless Fact Table
No facts in the fact table. With this dimensional model, we can answer questions
like
how many students registered for webinars in semester 2?
50
Essential Rules of Dimensional Modeling
(Recommended by Kimball)
https://www.kimballgroup.eom/2009/05/the-10-essential-rules-of-dimensional-
modeling/
51
Essential Rules of Dimensional Modeling
(Recommended by Kimball)
https://www.kimballgroup.eom/2009/05/the-10-essential-rules-of-dimensional-
modeling/
52
OLAP Cubes
53
OLAP Cubes
--- --- -
APR NY Paris SF
• An OLAP cube is an aggregation of
X)■
a fact metric on a number of MAR NY Paris SF
Branch
54
OLAP Cubes Operations: Roll-up & Drill
Down
Roll-up: Sum up the sales of each city by APR US FR
55
OLAP Cubes Operations: Slice
56
OLAP Cubes Operations: Dice
; Smaller sub-cube |
in ['NY', 'Paris'] ..
Branch
57
OLAP Cubes query optimization
• Business users will typically want to slice, dice, rollup and drill-down
all the time
• Each such combination will potentially go through all the facts table
(suboptimal)
58
OLAP Cubes query optimization
• The "GROUP by CUBE (movie, branch, month)" will make one pass through
the facts table and will aggregate all possible combinations of groupings, of
lengths 0, 1, 2 and 3 e.g:
59
The Last Mile: Delivering the analytics to users
Data is available...
• In an understandable & performant dimensional model
• With Conformed Dimensions or separate Data Marts
• For users to report and visualize
• By interacting directly with the model
• Or in most cases, through a Bl application
60
The Last Mile: Delivering the analytics to users
61
OLAP Cubes technology
62
References & Further Reading
63