Professional Documents
Culture Documents
Chapter2 Isdp II
Chapter2 Isdp II
Engineering
• The field of Machine Learning (ML) has been consistently evolving since
Data Science started gaining traction in 2012. However, I believe 2018
was a critical inflection point in the ML industry. After
Dimensional Modelling helping Insight Fellows build dozens of ML products to get roles on
applied ML teams, and reading through both corporate and academic
published research and , I’ve seen more need for engineering skills than
ever before.
• A common warning shared with aspiring Data Scientists is that 90% of
the work is about gathering and cleaning data, or validating,
deploying, and monitoring models. If that is the case, why are 90% of
the frameworks and Github repositories (see this list for
example)focused on model building?
• A part of the job that demands so much of a practitioner’s time should
have proper tooling support.
https://www.kdnuggets.com/2019/03/most-impactful-ai-trends-2018-rise-ml-engineering.html?fbclid=IwAR3Ocrq9m2ci4ofDzrXzp4xaJ0GKQJt_iXLNZY-
zd7wrS2uUF4P5avCQNCE#.XHrPCIlrgYQ.facebook
• Contains the same information as the normalized model • A denormalized relational model
• Has far fewer tables • Made up of tables with attributes
• Grouped in coherent business categories • Relationships defined by keys and foreign keys
• Pre-joins hierarchies and lookup tables resulting in • Organized for understandability and ease of reporting
fewer join paths and fewer intermediate tables rather than update
• Normalized fact table with denormalized dimension • Queried and maintained by SQL or special purpose
tables. management tools.
Entity-Relationship vs. Dimensional Models Fact Tables
•One table per entity •One fact table for data • Contains two or more foreign keys
•Minimize data redundancy organization • Tend to have huge numbers of records
•Optimize update •Maximize • Useful facts tend to be numeric and additive
understandability
•The Transaction Processing
Model •Optimized for retrieval
•The data warehousing
model
Measurements associated with a specific business • Contain text and descriptive information
process
• 1 in a 1-M relationship
• Grain: level of detail of the table
• Generally the source of interesting constraints
• Process events produce fact records
• Typically contain the attributes for the SQL answer set.
• Facts (attributes) are usually
• Numeric
• Additive
• Derived facts included
• Foreign (surrogate) keys refer to dimension tables
(entities)
• Classification values help define subsets
Strengths of the Dimensional Model
Dimension Tables
(according to Kimball)
Entities describing the objects of the process • Predictable, standard framework
• Conformed dimensions cross processes • Respond well to changes in user reporting needs
• Attributes are descriptive • Relatively easy to add data without reloading tables
• Text
• Numeric • Standard design approaches have been developed
• Surrogate keys • There exist a number of products supporting the
dimensional model
• Less volatile than facts (1:m with the fact table)
• Null entries
• Date dimensions
• Produce “by” questions
What is a Model ?
•Definitions of ‘Model’ abound
• “the act of representing something (usually on a smaller
scale)”
•Properties
• They aren’t real
• Their function is to aid communication, between users,
Models and Model Types technologists, machines
Logical
Model
Conceptual Physical The Basics
Model Model
Business
Users Technicians
structures.
Total 10,010 6,020 1,310 7,330 2,680
Data mart SQL tends to consist of complex queries affecting a •The data set may be said to be arranged to have two
large number of tables and columns and returning large result dimensions: a row-arranged month dimension and a
sets. A simple structure can improve query performance. column-arranged measures.
Pivot Table - Three Dimensions
Design Issues
•Now, let’s add a THIRD Contextual Dimension to the same
spreadsheet – Products. The spreadsheet now highlights that
the Date and Measures data presented relates specifically to the Relational and Multidimensional Models
Product Category – Shoes. • Denormalized and indexed relational models more
context
Product: shoes
columns
Measures: all
flexible
Month Sales Direct Costs Indirect Costs Total Costs Margin • Multidimensional models simpler to use and more
January
February
750
700
420
500
100
110
520
610
230
90 efficient
March 810 530 90 620 190
April 820 450 130 580 240
May 900 410 80 490 410
June 930 630 130 760 170
July 890 540 100 640 250
August 740 550 110 660 80
September 840 470 120 590 250
October 900 520 150 670 230
November 830 430 100 530 300
December 900 570 90 660 240
Total 10,010 6,020 1,310 7,330 2,680
rows
Time: Months
Identify the data structure, attributes and constraints for As always in life, there are some disadvantages to 3NF:
the client’s data warehousing environment. • Performance can be truly awful. Most of the work that is
• Stable performed on denormalizing a data model is an attempt
• Optimized for update to reach performance objectives.
• Flexible • The structure can be overwhelmingly complex. We may
wind up creating many small relations which the user
might think of as a single relation or group of data.
Building a Data Warehouse from a
The 4 Step Design Process
Normalized Database
• Choose the Data Mart The steps
• Declare the Grain • Develop a normalized entity-relationship business
• Choose the Dimensions model of the data warehouse.
• Choose the Facts • Translate this into a dimensional model. This step
reflects the information and analytical characteristics of
the data warehouse.
• Translate this into the physical model. This reflects the
changes necessary to reach the stated performance
objectives.
• The first step is the development of the structural • Select an associative entity for a fact table
dimensions. This step corresponds very closely to what • Determine granularity
we normally do in a relational database.
• Replace operational keys with surrogate keys
• The star architecture that we will develop here depends
• Promote the keys from all hierarchies to the fact table
upon taking the central intersection entities as the fact
tables and building the foreign key => primary key • Add date dimension
relations as dimensions. • Split all compound attributes
• Add necessary categorical dimensions
• Fact (varies with time) / Attribute (constant)
Converting an E-R Diagram Choosing the Mart
• Determine the purpose of the mart • A set of related fact and dimension tables
• Identify an association table as the central fact table • Single source or multiple source
• Determine facts to be included • Conformed dimensions
• Replace all keys with surrogate keys • Typically have a fact table for each process
• Promote foreign keys in related tables to the fact table
• Add time dimension
• Refine the dimension tables
Represent a process or reporting environment that is of The grain determines what each fact record represents:
value to the organization the level of detail.
• It is important to determine the identity of the fact table • For example
and specify exactly what it represents. • Individual transactions
• Typically correspond to an associative entity in the E-R • Snapshots (points in time)
model • Line items on a document
• Generally better to focus on the smallest grain
Facts Dimensions
Measurements associated with fact table records at fact A table (or hierarchy of tables) connected with the fact
table granularity table with keys and foreign keys
• Normally numeric and additive • Preferably single valued for each fact record (1:m)
• Non-key attributes in the fact table • Connected with surrogate (generated) keys, not
• Attributes in dimension tables are constants. Facts operational keys
vary with the granularity of the fact table • Dimension tables contain text or numeric attributes
maintenance efforts because of the more lookup tables. High level of Data redundancy Very low-level data redundancy
Single Dimension table contains aggregated data. Data Split into different Dimension Tables.
Cube processing is faster. Cube processing might be slow because of the complex
join.
Offers higher performing queries using Star Join Query The Snow Flake Schema is represented by centralized fact
Optimization. Tables may be connected with multiple table which unlikely connected with multiple dimensions.
dimensions.
As you can see in above figure, there are two facts table
1.Revenue
2.Product.
In Galaxy schema shares dimensions are called Conformed Dimensions.
Characteristics of Galaxy Schema: What is Star Cluster Schema?
• Snowflake schema contains fully expanded hierarchies. • Multidimensional schema is especially designed to model
However, this can add complexity to the Schema and requires data warehouse systems
extra joins. On the other hand, star schema contains fully • The star schema is the simplest type of Data Warehouse
collapsed hierarchies, which may lead to redundancy. So, the schema. It is known as star schema as its structure resembles
best solution may be a balance between these two schemas
which is star cluster schema design. a star.
• A Snowflake Schema is an extension of a Star Schema, and it
adds additional dimensions. It is called snowflake because its
• Overlapping dimensions can be found as forks in hierarchies. A diagram resembles a Snowflake.
fork happens when an entity acts as a parent in two different
• In a star schema, only single join creates the relationship
dimensional hierarchies. Fork entities then identified as
classification with one-to-many relationships. between the fact table and any dimension tables.
Summary: Dimensional Modelling
• Star schema contains a fact table surrounded by •Based around ‘Measures’ (Fact Tables) that are constrained by
dimension tables. ‘Dimensions’ (Dimension Tables).
• Very common in Data Warehouse applications.
• Snow fake schema is surrounded by dimension table • Can directly feed other tools such as MOLAP databases..
which are in turn surrounded by dimension table • Tend to have a very specific focus that is easy for users to
• A snowflake schema requires many joins to fetch the understand
data. • Users will get confused with more than about six
• A Galaxy Schema contains two fact table that shares dimensions.
dimension tables. It is also called Fact Constellation • The model is easy - the ‘Fact’ is unique by ALL of the
‘Dimensions’.
Schema. • Is concerned primarily with retrieval needs.
• Star cluster schema contains attributes of Start • Is almost always summarised, eg using SUM or MAX and
schema and Slow fake schema. so on.
Star Schema
Always start with
Star Schemas in a RDBMS • Basic form includes a central this simple form
table with a number of
descriptive tables joined
Fact Table
directly
In most companies doing ROLAP, the DBAs have created • Central table known as
Customer
Customer ID
Claim
Claim ID
Dimension tables
indexes and summary tables proliferate in order to
Coverage Customer ID (FK)
Coverage ID Product ID (FK) Claim Transaction Type
Coverage ID (FK) Claim Trans Type ID
• A simpler design that
Policy Number
Claim ID (FK)
aggregations that the users perform, the build times and for data retrieval
Claim Count Amount
Catastrophe
Catastrophe ID
disk space needed to create them has grown • All dimension tables completely Product Catastrophe Desc
Catastrophe Start Date
denormalised
Product ID Calendar
Catastrophe End Date
Coverage
Claim Transaction
Customer ID (FK)
Entered Date
• Row population is sparse
• A row exists only where there are non-zero measures
Coverage ID Product ID (FK) Claim Transaction Type
Coverage ID (FK) Claim Trans Type ID
Policy Number
Dimension Coverage Effective Date
Claim ID (FK)
Claim Trans Type ID (FK) Claim Trans Type Code
Coverage Expiration Date
Tables Catastrophe ID (FK) Claim Trans Type Desc
• Examples:
Catastrophe ID
Product Catastrophe Desc
Product ID Calendar Catastrophe Start Date
Catastrophe End Date
Product Type Desc Calendar ID
Catastrophe Location
Product
Product
Category Code
Category Desc
Cal Date
Cal Year
• Sales
Product Desc Cal Month
Cal YearMonth
Fin Year
Fin Month
Fin YearMonth
• Counts
Relationships
• Percentage
surrogate is an artificial, numeric, key generated • Hierarchies are quite disorganised in the vast
from a pool of numbers inside the warehouse. majority of organisations.
• Multiple independent hierarchies often needed
• Use these as Primary Keys for Dimensions.
Product
category
• Hierarchies within the dimensions are very
• Will facilitate efficient Fact to Dimension joins important Product
(Bolt)
• Support Slowly changing dimension (next slide) • Within the proper tool they enable “drill up/
• If using surrogates then bring in the source system drill down”
natural key along with another field to say which • e.g. day, week, month, quarter, year
source system this value came from – i.e. put the • e.g. Product, Product Category, Total Products
context back. • Details usually need to be explicitly stored.
• E.g. Decode all codes.
Hierarchies – Complex And/Or Dynamic
Hierarchies – Simple, Static Hierarchies
Hierarchies
•Simple, static hierarchies are best designed directly
Total
into the dimensions. Products •If hierarchies are complex, if there are multiple
• This is what was meant when we said we ‘de- hierarchies on a dimension or if the hierarchy changes
normalised’ for a Star Schema often, it could be messy to design the hierarchy into
• Easiest to use Product
the dimension.
category
• Most efficient to query
Product • Snowflake the dimension, creating one or more
• e.g. Product, Product Category, Total Products (Bolt)
outboard hierarchy tables.
• Example: Geography (Store, Branch, Region, [State,
Country]) • Changes to hierarchies do not affect the base
dimension.
• Multiple hierarchies can be represented with
multiple tables or with a hierarchy ID column which
must be filtered on in any query.
• An architecture that permits aggregating data across A surrogate key is a unique identifier for data warehouse
multiple marts records that replaces source primary keys
• Conformed dimensions and attributes (business/natural keys)
• Drill Down vs. Drill Across • Protect against changes in source systems
• Bus matrix • Allow integration from multiple sources
• Enable rows that do not exist in source data
• Track changes over time (e.g. new customer instances
when addresses change)
• Replace text keys with integers for efficiency
Slowly Changing Dimensions Slowly Changing Dimensions
• A SCD is a dimension that stores and manages both current Attributes in a dimension that change more slowly than the
and historical data over time in a data warehouse. fact granularity
• It is considered and implemented as one of the most • Type 1: Current only
critical ETL tasks in tracking the history of dimension • Type 2: All history
records. • Type 3: Most recent few (rare)
(Addresses, Managers, etc.) Note: rapidly changing dimensions usually indicate the
• Type 1: Store only the current value presence of a business process that should be tracked as a
• Type 2: Create a dimension record for each value (with or separate dimension or as a fact table
without date stamps)
• Type 3: Create an attribute in the dimension record for
previous value
• Dimensions without attributes. (Such as a transaction • Connects entities to dimension tables rather than the
number or order number.) fact table
• Put the attribute value into the fact table even though it • Complicates coding and requires additional processing
is not an additive fact. for retrievals
• Makes type 2 slowly changing dimensions harder to
maintain
• Useful for seldom used lookups
• Fact to Dimension
SALESREP
• Dimension to Dimension SalesRepKey
Name
ORDERS (FACT)
SalesRepKey
ProductKey
Address
SalesRepGrpKey
CustomerKey
• Try to avoid these. Solutions can be very misleading. OrderQty
SALESREP-ORDER-BRIDGE
SalesRepKey
SalesrepGroupKey
Weight= (1/NumReps)
Hierarchies Heterogeneous Products
Group data within dimensions: SalesRep • Several different kinds of entry with different attributes
• Region for each
• State • (The sub-class problem)
• County
• Neighborhood
Problem structures
• Variable depth
• Frequently changing
• Dimensions that represent data at different levels of • Miscellaneous attributes that don’t belong to another
granularity entity, usually representing processing levels
• Remove a dimension • Flags
• Roll up the hierarchy (provide a new shrunken • Categories
dimension with new surr-key that represents rolled • Types
up data)
Fact Tables Aggregates