You are on page 1of 27

Most impactful AI trends of 2018: The rise of ML

Engineering

• The field of Machine Learning (ML) has been consistently evolving since
Data Science started gaining traction in 2012. However, I believe 2018
was a critical inflection point in the ML industry. After

Dimensional Modelling helping Insight Fellows build dozens of ML products to get roles on
applied ML teams, and reading through both corporate and academic
published research and , I’ve seen more need for engineering skills than
ever before.
• A common warning shared with aspiring Data Scientists is that 90% of
the work is about gathering and cleaning data, or validating,
deploying, and monitoring models. If that is the case, why are 90% of
the frameworks and Github repositories (see this list for
example)focused on model building?
• A part of the job that demands so much of a practitioner’s time should
have proper tooling support.
https://www.kdnuggets.com/2019/03/most-impactful-ai-trends-2018-rise-ml-engineering.html?fbclid=IwAR3Ocrq9m2ci4ofDzrXzp4xaJ0GKQJt_iXLNZY-
zd7wrS2uUF4P5avCQNCE#.XHrPCIlrgYQ.facebook

Most impactful AI trends of 2018: The rise of ML


Looking forward: 2019 resolutions
Engineering
• More and more people realize that nobody needs yet another library
or tutorial to build a 3-layer neural network on MNIST. Consequently, • The march towards ML Engineering has already sped up in 2018, and at
many startups have entered the space of data and model infrastructure, Insight, we’ve been hard at work to help more people transition to the
management and deployment, and educational resources have started field successfully (learn more here!). In addition, throughout this year, I
to focus on these aspects more. This is why, I fundamentally believe will focus more of my writing on the 90% of ML work that have been
that 2019 will be the year of ML Engineering. ignored so far, but that is so crucial to success in industry.
• When it comes to recruiting, Hiring Managers of teams all over the
valley most often complain that while there is no shortage of people
able to train models on a dataset, they need engineers that can build
data driven products.
• At the same time, most aspiring Data Scientists and ML Engineers are
most excited about training models on provided datasets. This
excitement is usually inspired by blogs and courses that have focused on
that part of the work, instead of data gathering/labeling/cleaning and
model deployment.
https://www.kdnuggets.com/2019/03/most-impactful-ai-trends-2018-rise-ml-engineering.html?fbclid=IwAR3Ocrq9m2ci4ofDzrXzp4xaJ0GKQJt_iXLNZY-
zd7wrS2uUF4P5avCQNCE#.XHrPCIlrgYQ.facebook Bio: Emmanuel Ameisen (@EmmanuelAmeisen) is Head of AI at Insight Data Science.
The Dimensional Data Model

An alternative to the normalized data model


• Present information as simply as possible (easier to
understand)
• Return queries as quickly as possible (efficient for
queries)
• Track the underlying business processes (process
focused)

The Dimensional Data Model Dimensional Models

• Contains the same information as the normalized model • A denormalized relational model
• Has far fewer tables • Made up of tables with attributes
• Grouped in coherent business categories • Relationships defined by keys and foreign keys
• Pre-joins hierarchies and lookup tables resulting in • Organized for understandability and ease of reporting
fewer join paths and fewer intermediate tables rather than update
• Normalized fact table with denormalized dimension • Queried and maintained by SQL or special purpose
tables. management tools.
Entity-Relationship vs. Dimensional Models Fact Tables

•One table per entity •One fact table for data • Contains two or more foreign keys
•Minimize data redundancy organization • Tend to have huge numbers of records
•Optimize update •Maximize • Useful facts tend to be numeric and additive
understandability
•The Transaction Processing
Model •Optimized for retrieval
•The data warehousing
model

Fact Tables Dimension Tables

Measurements associated with a specific business • Contain text and descriptive information
process
• 1 in a 1-M relationship
• Grain: level of detail of the table
• Generally the source of interesting constraints
• Process events produce fact records
• Typically contain the attributes for the SQL answer set.
• Facts (attributes) are usually
• Numeric
• Additive
• Derived facts included
• Foreign (surrogate) keys refer to dimension tables
(entities)
• Classification values help define subsets
Strengths of the Dimensional Model
Dimension Tables
(according to Kimball)
Entities describing the objects of the process • Predictable, standard framework
• Conformed dimensions cross processes • Respond well to changes in user reporting needs
• Attributes are descriptive • Relatively easy to add data without reloading tables
• Text
• Numeric • Standard design approaches have been developed
• Surrogate keys • There exist a number of products supporting the
dimensional model
• Less volatile than facts (1:m with the fact table)
• Null entries
• Date dimensions
• Produce “by” questions

What is a Model ?
•Definitions of ‘Model’ abound
• “the act of representing something (usually on a smaller
scale)”
•Properties
• They aren’t real
• Their function is to aid communication, between users,
Models and Model Types technologists, machines

“All Models are wrong, some models are useful.”


Are there different types of ‘Data Models’ ?
•Several levels of ‘Data Models’ are usually used and each has, as a
focus, a different audience
•These were covered in the Data Modelling course. All three types are
applicable Dimensional modelling also, however, Logical and Physical
are more prominent.

Logical
Model
Conceptual Physical The Basics
Model Model

Business
Users Technicians

Dimensional Modelling Vs Normalisation Spreadsheet – Two Dimensions


•Lets us start with a typical example of two-dimensional data.
 Normalisation is good for the middle layer of a 3-tier DW design Anything that you track, whether it is hours per employee,
 Minimal redundancy improves maintainability – data is costs per department, balance per customer, or complaints per
updated in one place. store, can be arrange in a two-dimensional format.
 Normalised form can unify a diversity of enterprise data Month Sales Direct Costs Indirect Costs Total Costs Margin

sources in a flexible manner.


January 750 420 100 520 230
February 700 500 110 610 90
March 810 530 90 620 190
 Denormalisation is good for Business Intelligence April 820 450 130 580 240
May 900 410 80 490 410
 Minimal redundancy is not necessary because data is derived June
July
930
890
630
540
130
100
760
640
170
250
from other sources, not directly maintained in dimensional August 740 550 110 660 80
September 840 470 120 590 250
form. October 900 520 150 670 230
November 830 430 100 530 300
 Redundancy improves comprehension and usability of data December 900 570 90 660 240

structures.
Total 10,010 6,020 1,310 7,330 2,680

 Data mart SQL tends to consist of complex queries affecting a •The data set may be said to be arranged to have two
large number of tables and columns and returning large result dimensions: a row-arranged month dimension and a
sets. A simple structure can improve query performance. column-arranged measures.
Pivot Table - Three Dimensions
Design Issues
•Now, let’s add a THIRD Contextual Dimension to the same
spreadsheet – Products. The spreadsheet now highlights that
the Date and Measures data presented relates specifically to the Relational and Multidimensional Models
Product Category – Shoes. • Denormalized and indexed relational models more
context
Product: shoes
columns
Measures: all
flexible
Month Sales Direct Costs Indirect Costs Total Costs Margin • Multidimensional models simpler to use and more
January
February
750
700
420
500
100
110
520
610
230
90 efficient
March 810 530 90 620 190
April 820 450 130 580 240
May 900 410 80 490 410
June 930 630 130 760 170
July 890 540 100 640 250
August 740 550 110 660 80
September 840 470 120 590 250
October 900 520 150 670 230
November 830 430 100 530 300
December 900 570 90 660 240
Total 10,010 6,020 1,310 7,330 2,680

rows
Time: Months

The Business Model Business Model

Identify the data structure, attributes and constraints for As always in life, there are some disadvantages to 3NF:
the client’s data warehousing environment. • Performance can be truly awful. Most of the work that is
• Stable performed on denormalizing a data model is an attempt
• Optimized for update to reach performance objectives.
• Flexible • The structure can be overwhelmingly complex. We may
wind up creating many small relations which the user
might think of as a single relation or group of data.
Building a Data Warehouse from a
The 4 Step Design Process
Normalized Database
• Choose the Data Mart The steps
• Declare the Grain • Develop a normalized entity-relationship business
• Choose the Dimensions model of the data warehouse.
• Choose the Facts • Translate this into a dimensional model. This step
reflects the information and analytical characteristics of
the data warehouse.
• Translate this into the physical model. This reflects the
changes necessary to reach the stated performance
objectives.

Structural Dimensions Steps in dimensional modeling

• The first step is the development of the structural • Select an associative entity for a fact table
dimensions. This step corresponds very closely to what • Determine granularity
we normally do in a relational database.
• Replace operational keys with surrogate keys
• The star architecture that we will develop here depends
• Promote the keys from all hierarchies to the fact table
upon taking the central intersection entities as the fact
tables and building the foreign key => primary key • Add date dimension
relations as dimensions. • Split all compound attributes
• Add necessary categorical dimensions
• Fact (varies with time) / Attribute (constant)
Converting an E-R Diagram Choosing the Mart

• Determine the purpose of the mart • A set of related fact and dimension tables
• Identify an association table as the central fact table • Single source or multiple source
• Determine facts to be included • Conformed dimensions
• Replace all keys with surrogate keys • Typically have a fact table for each process
• Promote foreign keys in related tables to the fact table
• Add time dimension
• Refine the dimension tables

Fact Tables Grain (unit of analysis)

Represent a process or reporting environment that is of The grain determines what each fact record represents:
value to the organization the level of detail.
• It is important to determine the identity of the fact table • For example
and specify exactly what it represents. • Individual transactions
• Typically correspond to an associative entity in the E-R • Snapshots (points in time)
model • Line items on a document
• Generally better to focus on the smallest grain
Facts Dimensions

Measurements associated with fact table records at fact A table (or hierarchy of tables) connected with the fact
table granularity table with keys and foreign keys
• Normally numeric and additive • Preferably single valued for each fact record (1:m)
• Non-key attributes in the fact table • Connected with surrogate (generated) keys, not
• Attributes in dimension tables are constants. Facts operational keys
vary with the granularity of the fact table • Dimension tables contain text or numeric attributes

What is a Star Schema?

• The star schema is the simplest type of Data Warehouse


schema. It is known as star schema as its structure
resembles a star. In the Star schema, the center of the
star can have one fact tables and numbers of associated
dimension tables. It is also known as Star Join Schema
and is optimized for querying large data sets.
Schema Types
Characteristics of Star Schema:
• Every dimension in a star schema is represented with the only
one-dimension table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign
key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal
disk usage.
For example, as you can see in the above-given image that fact table is at
• The dimension tables are not normalized. For instance, in the
the center which contains keys to every dimension table like Deal_ID, above figure, Country_ID does not have Country lookup table
Model ID, Date_ID, Product_ID, Branch_ID & other attributes like Units as an OLTP design would have.
sold and revenue. • The schema is widely supported by BI Tools

What is a Snowflake Schema?

• A Snowflake Schema is an extension of a Star


Schema, and it adds additional dimensions. It is
called snowflake because its diagram resembles a
Snowflake.
• The dimension tables are normalized which splits
data into additional tables. In the following
example, Country is further normalized into an
individual table.
Star Vs Snowflake Schema: Key Differences
Characteristics of Snowflake Schema:
Star Schema Snow Flake Schema
Hierarchies for the dimensions are stored in the Hierarchies are divided into separate tables.
• The main benefit of the snowflake schema it uses smaller dimensional table.
disk space. It contains a fact table surrounded by dimension One fact table surrounded by dimension table which are
tables. in turn surrounded by dimension table
• Easier to implement a dimension is added to the Schema In a star schema, only single join creates the A snowflake schema requires many joins to fetch the
• Due to multiple tables query performance is reduced relationship between the fact table and any dimension data.
tables.
• The primary challenge that you will face while using the Simple DB Design. Very Complex DB Design.
snowflake Schema is that you need to perform more Denormalized Data structure and query also run faster. Normalized Data Structure.

maintenance efforts because of the more lookup tables. High level of Data redundancy Very low-level data redundancy
Single Dimension table contains aggregated data. Data Split into different Dimension Tables.
Cube processing is faster. Cube processing might be slow because of the complex
join.
Offers higher performing queries using Star Join Query The Snow Flake Schema is represented by centralized fact
Optimization. Tables may be connected with multiple table which unlikely connected with multiple dimensions.
dimensions.

What is a Galaxy schema?

• A Galaxy Schema contains two fact table that


shares dimension tables. It is also called Fact
Constellation Schema. The schema is viewed as a
collection of stars hence the name Galaxy Schema.

As you can see in above figure, there are two facts table
1.Revenue
2.Product.
In Galaxy schema shares dimensions are called Conformed Dimensions.
Characteristics of Galaxy Schema: What is Star Cluster Schema?

• The dimensions in this schema are separated into separate


dimensions based on the various levels of hierarchy.
• For example, if geography has four levels of hierarchy like
region, country, state, and city then Galaxy schema should
have four dimensions.
• Moreover, it is possible to build this type of schema by
splitting the one-star schema into more Star schemes.
• The dimensions are large in this schema which is needed
to build based on the levels of hierarchy.
• This schema is helpful for aggregating fact tables for better
understanding.

What is Star Cluster Schema? Summary:

• Snowflake schema contains fully expanded hierarchies. • Multidimensional schema is especially designed to model
However, this can add complexity to the Schema and requires data warehouse systems
extra joins. On the other hand, star schema contains fully • The star schema is the simplest type of Data Warehouse
collapsed hierarchies, which may lead to redundancy. So, the schema. It is known as star schema as its structure resembles
best solution may be a balance between these two schemas
which is star cluster schema design. a star.
• A Snowflake Schema is an extension of a Star Schema, and it
adds additional dimensions. It is called snowflake because its
• Overlapping dimensions can be found as forks in hierarchies. A diagram resembles a Snowflake.
fork happens when an entity acts as a parent in two different
• In a star schema, only single join creates the relationship
dimensional hierarchies. Fork entities then identified as
classification with one-to-many relationships. between the fact table and any dimension tables.
Summary: Dimensional Modelling

• Star schema contains a fact table surrounded by •Based around ‘Measures’ (Fact Tables) that are constrained by
dimension tables. ‘Dimensions’ (Dimension Tables).
• Very common in Data Warehouse applications.
• Snow fake schema is surrounded by dimension table • Can directly feed other tools such as MOLAP databases..
which are in turn surrounded by dimension table • Tend to have a very specific focus that is easy for users to
• A snowflake schema requires many joins to fetch the understand
data. • Users will get confused with more than about six
• A Galaxy Schema contains two fact table that shares dimensions.
dimension tables. It is also called Fact Constellation • The model is easy - the ‘Fact’ is unique by ALL of the
‘Dimensions’.
Schema. • Is concerned primarily with retrieval needs.
• Star cluster schema contains attributes of Start • Is almost always summarised, eg using SUM or MAX and
schema and Slow fake schema. so on.

Star Schema
Always start with
Star Schemas in a RDBMS • Basic form includes a central this simple form
table with a number of
descriptive tables joined
Fact Table
directly
In most companies doing ROLAP, the DBAs have created • Central table known as
Customer
Customer ID
Claim
Claim ID

countless indexes and summary tables in order to avoid


Customer Number Claim Number
Effective Date Claim Status Desc

the Fact table Policy Years


Policy Count
Cause Description
Cause Code

I/O-intensive table scans against large fact tables. As the


Income Claim Status Code

• Satellite tables known as Claim Transaction


Reported Date
Entered Date

Dimension tables
indexes and summary tables proliferate in order to
Coverage Customer ID (FK)
Coverage ID Product ID (FK) Claim Transaction Type
Coverage ID (FK) Claim Trans Type ID
• A simpler design that
Policy Number
Claim ID (FK)

optimize performance for the known queries and


Coverage Effective Date Claim Trans Type Code
Claim Trans Type ID (FK)
Coverage Expiration Date Claim Trans Type Desc

can be easily optimized


Catastrophe ID (FK)
Coverage Status Reversal Indicator
Claim Trans Date ID (FK)
Sum Insured Reversal Indicator Desc

aggregations that the users perform, the build times and for data retrieval
Claim Count Amount
Catastrophe
Catastrophe ID

disk space needed to create them has grown • All dimension tables completely Product Catastrophe Desc
Catastrophe Start Date

denormalised
Product ID Calendar
Catastrophe End Date

enormously, often requiring more time than is allotted


Product Type Desc Calendar ID
Catastrophe Location
Product Category Code Cal Date

• All dimension tables relate Product


Product
Category Desc
Desc
Cal Year

and more space than the original data!


Cal Month

directly to the fact table


Cal YearMonth
Fin Year
Fin Month
Fin YearMonth
• The Grain is set by the
dimensions
• Best for straightforward Dimension
modelling requirements. Tables Relationships
Snow Flake Schema When to Snowflake
Customer Claim
• Described as a variant option Customer ID Claim ID

• One or more dimension


Customer Number Claim Number
Effective Date Claim Status Desc
Policy Years Cause Description
tables are not completely Policy Count
Income
Cause Code
Claim Status Code
denormalised Reported Date
Entered Date
Claim Transaction
• Some data represented in a Coverage
Coverage ID
Customer ID (FK)
Product ID (FK) Claim Transaction Type
snowflake or outrigger table Policy Number
Coverage ID (FK)
Claim ID (FK)
Claim Trans Type ID
Coverage Effective Date Claim Trans Type Code
• Benefits Coverage Expiration Date
Coverage Status
Claim Trans Type ID (FK)
Catastrophe ID (FK) Claim Trans Type Desc
Reversal Indicator
Claim Trans Date ID (FK)
• Useful for complex Sum Insured
Claim Count Amount
Reversal Indicator Desc

modelling situations, Catastrophe


Catastrophe ID
such as dynamic Product
Catastrophe Desc

hierarchies, or shared Product ID


Calendar
Calendar ID
Catastrophe Start Date
Catastrophe End Date

dimensions Product Type Desc Catastrophe Location


Product Category Code (FK) Cal Date
Product Desc Cal Year

• Drawbacks Cal Month


Cal YearMonth
Fin Year
• More navigation needed Fin Month
Fin YearMonth
and it can get Product Category
Product Category Code
complicated quickly Product Category Desc

• Avoid unless requirements


demand it

When to Snowflake When to Snowflake


Snowflake Schema Snowflaking & Hierarchies

• Advantages: • Efficiency vs Space


• Small saving in storage space • Understandability
• Normalized structures are easier to update and maintain
• M:N relationships
• Disadvantages:
• Schema less intuitive and end-users are put off by the
complexity
• Ability to browse through the contents difficult
• Degrade query performance because of additional joins

What is the Best Design?

• Performance benchmarking can be used to determine


what is the best design.
• Snowflake schema: easier to maintain dimension tables
when dimension tables are very large (reduce overall
space). It is not generally recommended in a data Components of a Dimensional
warehouse environment. Model
• Star schema: more effective for data cube browsing (less
joins): can affect performance.
Components of a Dimensional Model Components of a Dimensional Model - FACT Tables
• Contain measures
• Fact Tables • Usually numeric, measures quantify the business
• Dimensional Tables
Customer Claim
Fact Table • Most useful measures are additive
• Relationships Customer ID
Customer Number
Claim ID
Claim Number
• Grain Effective Date
Policy Years
Claim Status Desc
Cause Description • Additive measures can be meaningfully added across rows
Policy Count Cause Code
Income Claim Status Code
Reported Date

Coverage
Claim Transaction
Customer ID (FK)
Entered Date
• Row population is sparse
• A row exists only where there are non-zero measures
Coverage ID Product ID (FK) Claim Transaction Type
Coverage ID (FK) Claim Trans Type ID
Policy Number
Dimension Coverage Effective Date
Claim ID (FK)
Claim Trans Type ID (FK) Claim Trans Type Code
Coverage Expiration Date
Tables Catastrophe ID (FK) Claim Trans Type Desc

• Fact tables are not denormalised


Coverage Status Reversal Indicator
Claim Trans Date ID (FK)
Sum Insured Reversal Indicator Desc
Claim Count Amount
Catastrophe

• Examples:
Catastrophe ID
Product Catastrophe Desc
Product ID Calendar Catastrophe Start Date
Catastrophe End Date
Product Type Desc Calendar ID
Catastrophe Location
Product
Product
Category Code
Category Desc
Cal Date
Cal Year
• Sales
Product Desc Cal Month
Cal YearMonth
Fin Year
Fin Month
Fin YearMonth
• Counts
Relationships
• Percentage

Components of a Dimensional Model - DIMENSION Tables Components of a Dimensional Model - RELATIONSHIPS


• Purpose of a Dimension – To Add Context to the Fact
• Relationships are one-to-many
• Contain attributes • Dimensions are parents (one)
• Usually textual, attributes describe the business • Facts are children (many)
• Any logical many-to-many relationships must be decomposed
• Attributes are used for filtering and grouping
• Fact tables contain foreign keys
• Dimension tables are typically denormalised
• Point to primary keys in dimension tables
• Increases comprehension
• Facilitates browsing • Referential Integrity is critical
• Have unique primary keys to identify every row • Every fact must have a parent row in each dimension table
• Violations lead to incorrect and inconsistent query results
• Example Dimensions • If you have to put a ‘Not Applicable’ value in the dimension so
• Date that the Fact has something to link to.
• Product
Components of a Dimensional Model - GRAIN

• ‘Grain’ is the fundamental atomic level of data to be represented


in the fact table.

• Business analysis discovers the level at which the data needs to


be represented

• Fact grain is determined


• Transaction is the finest grain
• Data is aggregated if transaction grain is not needed
Process
• Dimension grain is matched to fact grain

• Fine grains have performance implications – hardware must be


adequate to handle the load

Inputs to Dimensional Modelling Outputs of Dimensional Modelling


• Requirements documents
• Logical data model
• Source data models / database schemas
• Physical data model
• Metadata / data dictionary
• Source-target mapping
• Existing reports / analysis models
• Validated business rules / transformations
• Consultation with business users
• Consultation with technical users
Modelling Steps Example
Dimensional Modelling Steps •We are going to explore the fictional ‘ACME Bolt company’ and it’s Key
performance Indicator: “Total Bolts Sold per Customer (TBSC)” This
measure has been handed down from ACME’s foreign parent and
1.Understanding the Business Problem everybody’s bonus is related to proving this value has risen over the latest
2.Choose the Dimensions periods.
3.Choose the Grain of the Fact Table •1. Understanding the Business Problem
4.Choose the Measured Facts •What is our problem ?
5.Choose the Dimension Attributes •Getting our bonus of course but for this exercise lets just satisfy the
6.Deriving the Physical Model from Logical request.
•We will at least need to discover what is meant by ‘Customer’ and ‘Bolts
sold’ (does this mean ordered? Invoiced? Delivered?)
•Interviews with the clients reveals that its not across the board, individual
regions and branches will be judged also. We have to prove a general trend
across 12 months.
•The user would also like to know something of the Customer’s category also
so as to be able to manage trends during the year. Similarly the groupings of
bolts.

Modelling Steps Example (Cont) Modelling Steps Example (Cont)


•3. Choose the Grain of the Fact Table
•2. Choose the Dimensions •4. Choose the Measured Facts
• These are readily apparent if using the Thomsen Diagram. • These steps go hand in hand. In our example just one fact has been
requested – Number (count) of bolts sold. In the real world this may also
• If not then these will need to be deduced from the Problem include ‘profit’ but anything included has to be constrained by the exact
definition in step one. same dimensions.
• Draw a small data model with the measure / fact in the • The Grain may be determined by the request in this case total bolts per day
centre and the other concepts around this – including an per customer per product per store
entity for each aggregation level. • Sometimes extra detail is included eg choose ‘day’ even if only month has
been asked for.
• A Star schema would collapse (denormalise) all these levels.
• Extra detail is more flexible for the future but costs more today to
load and summarise
• Sometimes the detail requested cannot be stored – whilst we aim for
Transaction level some clients have hundreds of millions of low level
transactions which it is just not economical to replicate and manage.
• Losing detail sacrifices flexibility but can reduce cost.
Modelling Steps Example (Cont)

•5. Choose the Dimension Attributes


• Dimension attributes describe the business. They are used
to filter and group in reports and queries.
• Choose to decode any codes eg don’t just take a ‘region
code’ – decode it as well and take ‘Region Name’ – these
will become user selectable items.
• Character fields are almost always attributes.
• Err on the side of including too many attributes from the
Slightly Advanced
source data – the performance penalty is negligible and it is
simple to hide any attributes which are later found to be
useless.

Date and Time


Slightly Advanced Topics
• Don’t confuse the two!
• Date and Time • You will meet many situations where it is stated that a dimension is
‘time’. Almost universally this is actually date.
• Time Variant • Date and Time are ‘static’ reference dimensions and should be
• Surrogate keys populated in advance – usually as part of initial build.
• Hierarchies • The DAK Data standards document has a sample schema for Calendar
• Aggregate fact tables which includes extra data columns for ‘is last day of month’ etc. These
can make later queries much easier. Sample spreadsheets to load also
exist.
• Do not be temped to combine into one dimension
• At the grain of Date there would be 3650 rows to represent 10 years.
• At the grain of Minute there are 1440 minutes in a day – so 1440 rows
needed
• Combined this would need 5,256,000 rows to represent all the
minutes for 10 years.
Time Variant
Natural and Surrogate Keys
• Time Variance: “A characteristic of a data warehouse that defines the
moment in time that the data or variant of the data is valid. If Order No. • Natural keys are the values usually referred to by people
123 has a value of $1,500.00 on Dec 1 and $1,700 on Dec 10, Dec 1 and Dec as the identifiers of entities (customer number, claim
10 shows us the time variance of Order No. 123.”
number, etc.).
• Many operational source systems only record one item of information and
if that changes the new value is simply replaced. E.g. if you move house • They are often the primary keys in source systems.
your doctor or movie rental company really doesn’t care where you used to • As a general rule they should not be used in a
live. warehouse - but they might be in a dimensional
• For big organisations this is sometimes overcome by using a separate data model directly off a single source system.
warehouse where each change is noted by boundary dates. I.e. a start and
end date.
• The problem is what happens when a second data
• Adding these is not simple as it can subtly changes the overall granularity –
source populates the dimension? Now the jumble of
if dates (not times) are used then only one fact value is possible per day. numbers means nothing and in fact the same
You now have to be extremely clear about what value is to be used. The last identifier could be used in different systems to refer
of the day? The maximum of the day? to different things.

Natural and Surrogate Keys Hierarchies


• Hierarchies are pervasive in the vast majority of
Total
• A surrogate is something used ‘instead of’. A organisations. Products

surrogate is an artificial, numeric, key generated • Hierarchies are quite disorganised in the vast
from a pool of numbers inside the warehouse. majority of organisations.
• Multiple independent hierarchies often needed
• Use these as Primary Keys for Dimensions.
Product
category
• Hierarchies within the dimensions are very
• Will facilitate efficient Fact to Dimension joins important Product
(Bolt)
• Support Slowly changing dimension (next slide) • Within the proper tool they enable “drill up/
• If using surrogates then bring in the source system drill down”
natural key along with another field to say which • e.g. day, week, month, quarter, year
source system this value came from – i.e. put the • e.g. Product, Product Category, Total Products
context back. • Details usually need to be explicitly stored.
• E.g. Decode all codes.
Hierarchies – Complex And/Or Dynamic
Hierarchies – Simple, Static Hierarchies
Hierarchies
•Simple, static hierarchies are best designed directly
Total
into the dimensions. Products •If hierarchies are complex, if there are multiple
• This is what was meant when we said we ‘de- hierarchies on a dimension or if the hierarchy changes
normalised’ for a Star Schema often, it could be messy to design the hierarchy into
• Easiest to use Product
the dimension.
category
• Most efficient to query
Product • Snowflake the dimension, creating one or more
• e.g. Product, Product Category, Total Products (Bolt)
outboard hierarchy tables.
• Example: Geography (Store, Branch, Region, [State,
Country]) • Changes to hierarchies do not affect the base
dimension.
• Multiple hierarchies can be represented with
multiple tables or with a hierarchy ID column which
must be filtered on in any query.

Aggregate Fact Tables


Hierarchies – Example
• Fact tables are very large
•Below is a common situation, caused where different
‘departments’ view the summarised data differently. Total All
Warehouses
• Aggregates (pre-stored summaries) are the most effective way
• In this case it is Critical that it is the SAME fact with of improving data warehouse performance
Total All
the same granularity – in this case Store. Region
Areas • An aggregate is a fact table records representing a
• It is just the summaries beyond ‘Store’ that differ. summarisation of base level fact table records.
Measures
• Watch for the the same item being used in Warehouse Region (Facts)
• Can be Explicitly Designed and managed or many DBMS now
different contexts e.g. Region in this example is
NOT the same thing. Distribution Branch
* Bolts sold have inbuilt Aggregations available.
Distribution Hierarchy
Node
• Aggregate awareness – DBMS implicit and OLAP tool explicit
Distribution Identif ier
Distribution Node Code
Distribution Node Name
Store • Each grain of aggregate should occupy its own fact table, and be
Warehouse Code
Warehouse Name
Distrubution Region Code
summarises
Geography
Bolts Sold Fact
Date (FK)
supported by appropriate category dimension tables
• What will that do to the number of tables? Can be an
Geography Identifier Product Identifier (FK)
Distrubution Region Name
Store Code Geography Identifier (FK)
groups
Customer Identifier (FK)

exponential blow out.


Store Name
Distribution Identif ier (FK) Bolts Sold Quantity
Geography Mgt Identifier (FK)

Geography Managment Hierarchy


summarises
• Complexity from end-users point of view? They can be
forced to remember what summaries exist and what they
Geography Mgt Identifier
Branch Code
Branch Name
Management Region Code
Management Region Name are called.
Many to many Recursive

• Use a Bridge Table • Use a Bridge Table


• Add a weighting factor to correct fact addition • Add a level count and bottom flag
Fact (Employee)
employee-key (FK)

Fact (Acct Bal)

Bridge Navigation (Supervise)


Dimension (Customer) acct-key (PK) Dimension (Employee) employee-key (PK)
customer-key (PK) supervises-key
weighting-factor number-levels-down
bottom-most-flag

Bus Architecture Keys and Surrogate Keys

• An architecture that permits aggregating data across A surrogate key is a unique identifier for data warehouse
multiple marts records that replaces source primary keys
• Conformed dimensions and attributes (business/natural keys)
• Drill Down vs. Drill Across • Protect against changes in source systems
• Bus matrix • Allow integration from multiple sources
• Enable rows that do not exist in source data
• Track changes over time (e.g. new customer instances
when addresses change)
• Replace text keys with integers for efficiency
Slowly Changing Dimensions Slowly Changing Dimensions

• A SCD is a dimension that stores and manages both current Attributes in a dimension that change more slowly than the
and historical data over time in a data warehouse. fact granularity
• It is considered and implemented as one of the most • Type 1: Current only
critical ETL tasks in tracking the history of dimension • Type 2: All history
records. • Type 3: Most recent few (rare)
(Addresses, Managers, etc.) Note: rapidly changing dimensions usually indicate the
• Type 1: Store only the current value presence of a business process that should be tracked as a
• Type 2: Create a dimension record for each value (with or separate dimension or as a fact table
without date stamps)
• Type 3: Create an attribute in the dimension record for
previous value

CustKey BKCustID CustName CommDist Gender HomOwn?


1552 31421 Jane Rider 3 F N Date Dimensions
Fact Table
• One row for every day for which you expect to have data
Date CustKey ProdKey Item Count Amount
for the fact table (perhaps generated in a spreadsheet and
1/7/2004 1552 95 1 1,798.00 imported)
3/2/2004 1552 37 1 27.95
• Usually use a meaningful integer surrogate key (such as
5/7/2005 1552 87 2 320.26
yyyymmdd 20060926 for Sep. 26, 2006). Note: this order
2/21/2006 1552 2387 42 1 19.95
sorts correctly.
Dimension with a slowly changing attribute • Include rows for missing or future dates to be added later.
Cust BKCust Cust Comm Gender Hom Eff End
Key ID Name Dist Own?
1552 31421 Jane Rider 3 F N 1/7/2004 1/1/2006
2387 31421 Jane Rider 31 F N 1/2/2006 12/31/9999
Degenerate Dimensions Snowflaking
(Outrigger Dimensions or Reference Dimensions)

• Dimensions without attributes. (Such as a transaction • Connects entities to dimension tables rather than the
number or order number.) fact table
• Put the attribute value into the fact table even though it • Complicates coding and requires additional processing
is not an additive fact. for retrievals
• Makes type 2 slowly changing dimensions harder to
maintain
• Useful for seldom used lookups

M:N Multivalued Dimensions Multivalued Dimensions

• Fact to Dimension
SALESREP
• Dimension to Dimension SalesRepKey
Name
ORDERS (FACT)
SalesRepKey
ProductKey
Address
SalesRepGrpKey
CustomerKey
• Try to avoid these. Solutions can be very misleading. OrderQty

SALESREP-ORDER-BRIDGE
SalesRepKey
SalesrepGroupKey
Weight= (1/NumReps)
Hierarchies Heterogeneous Products

Group data within dimensions: SalesRep • Several different kinds of entry with different attributes
• Region for each
• State • (The sub-class problem)
• County
• Neighborhood
Problem structures
• Variable depth
• Frequently changing

Aggregate Dimensions Junk Dimensions

• Dimensions that represent data at different levels of • Miscellaneous attributes that don’t belong to another
granularity entity, usually representing processing levels
• Remove a dimension • Flags
• Roll up the hierarchy (provide a new shrunken • Categories
dimension with new surr-key that represents rolled • Types
up data)
Fact Tables Aggregates

• Transaction • Precalculated summary tables


• Track processes at discrete points in time when they • Improve performance
occur • Record data an coarser granularity
• Periodic snapshot
• Cumulative performance over specific time intervals
• Accumulating snapshot
• Constantly updated over time. May include multiple
dates representing stages.

Slowly Changing Dimension (SCD) Slowly Changing Dimension (SCD)

• Type 0: Retain Original • Type 2: Add New Row


• Type 1: Overwrite • the primary workhorse technique for accurately tracking slowly
• easy to implement, but it does not maintain any history of changing dimension attributes.
prior attribute values.
Rapidly Changing Dimension
Slowly Changing Dimension (SCD)

• Type 3: Add New Attribute A dimension is a fast changing or rapidly changing


• The type 3 slowly changing dimension technique enables you dimension if one or more of its attributes in the table
to see new and historical fact data by either the new or prior changes very fast and in many rows. Handling rapidly
attribute values, sometimes called alternate realities. changing dimension in data warehouse is very difficult
because of many performance implications.

Rapidly Changing Dimension

You might also like