USING THE RIGHT DATA MODEL IN A DATA MART

D AV I D M WA L K E R D ATA M A N A G E M E N T & WA R E H O U S I N G

INTRODUCTION
•  The concept of a Data Mart as the data access interface layer for Business Intelligence has been around for over 25 years •  Kimball style Dimensional Modelling and Star Schemas have become the de facto data modelling technique for data marts •  These have been and continue to be hugely successful with relational databases and reporting tools – but are they the right tool for todays technologies ?

March 2012

© 2012 Data Management & Warehousing

2

WHY IS A STAR SCHEMA SO SUCCESSFUL?
•  There are three main reasons for creating a star schema and their wide acceptance as a technique
•  Simpler for users to understand •  Highly performant user queries •  Optimal disk storage usage

March 2012

© 2012 Data Management & Warehousing

Slide 3

WHAT IS A STAR SCHEMA?
•  A star schema consists of two parts
DATE DIMENSION
•  •  •  •  •  •  Date Surrogate Key Date Day Month Year Public Holiday Flag •  •  •  •  •  • 

STORE DIMENSION
Store Surrogate Key Store Name Store Number Store Postcode Store Town Store Region

•  Facts are stored at a uniform level of detail known as the grain of the data •  A star schema consists of a fact table and a number of associated dimension tables

•  Facts: Measurable numeric and/or time data about an event •  Dimensions: Descriptive attributes about the event that give the facts a context

SALES FACTS
•  •  •  •  •  •  •  Date Surrogate Key Store Surrogate Key Customer Surrogate Key Product Surrogate Key Sale Time Sale Quantity Sale Unit Price

CUSTOMER DIMENSION
•  •  •  •  •  •  Customer Surrogate Key Customer Loyalty Number Customer Gender Customer Postcode Customer Town Customer Region •  •  •  •  •  • 

PRODUCT DIMENSION
Product Surrogate Key Product SKU Product Name Product Category Product Group Temperature Group

March 2012

© 2012 Data Management & Warehousing

Slide 4

STAR SCHEMAS: SIMPLER FOR USERS TO UNDERSTAND
•  Intuitive grouping of information
select from

•  e.g. All customer data in one dimension, all store data in another, etc.

•  Much easier queries than on a full relational schemas
•  Consequently harder to get the wrong answer because of the wrong join

•  All data is at the same level of granularity

where and and and and and and and

P.PRODUCT_CATEGORY, sum(SALES_QUANTITY) SALES_FACTS F, DATE_DIMENSION D, STORE_DIMENSION S, CUSTOMER_DIMENSION C, PRODUCT_DIMENSION P MONTH = ‘March’ YEAR = ‘2012’ CUSTOMER_GENDER = ‘Female’ STORE_LOCATION = ‘South West’ F.DATE_SKEY = D.DATE_SKEY F.STORE_SKEY = S.STORE_SKEY F.CUSTOMER_SKEY = C.CUSTOMER_SKEY F.PRODUCT_SKEY = P.PRODUCT_SKEY

•  Consequently harder to get the wrong answer because of mismatched levels of data

Example query to get the number of sales in each product category for March 2012 by female customers in stores in the South West region

March 2012

© 2012 Data Management & Warehousing

Slide 5

STAR SCHEMAS: HIGHLY PERFORMANT USER QUERIES
•  Dimensional data has an enforced one-tomany relationship with the fact table •  Filtering occurs on the (smaller) dimensions
•  e.g. where YEAR = ‘2012’
DATE DIMENSION
•  •  •  •  •  •  Date Surrogate Key Date Day Month Year Public Holiday Flag •  •  •  •  •  • 

STORE DIMENSION
Store Surrogate Key Store Name Store Number Store Postcode Store Town Store Region

SALES FACTS
•  •  •  •  •  •  •  Date Surrogate Key Store Surrogate Key Customer Surrogate Key Product Surrogate Key Sale Time Sale Quantity Sale Unit Price

•  Aggregation takes place only on the relevant subset of the facts
•  e.g. sum (SALES_QUANTITY)

CUSTOMER DIMENSION
•  •  •  •  •  •  Customer Surrogate Key Customer Loyalty Number Customer Gender Customer Postcode Customer Town Customer Region •  •  •  •  •  • 

PRODUCT DIMENSION
Product Surrogate Key Product SKU Product Name Product Category Product Group Temperature Group

March 2012

© 2012 Data Management & Warehousing

Slide 6

STAR SCHEMAS: OPTIMAL DISK STORAGE USAGE
•  If STORE_REGION had:
•  10 discreet values •  was stored in the example SALES_FACT table •  was on average 10 bytes long •  This one field alone would require an additional 1Tb of storage •  Not storing it in the fact also improves query performance by reducing disk I/O required to retrieve the information
March 2012

DATE DIMENSION
•  •  •  •  •  •  Date Surrogate Key Date Day Month Year Public Holiday Flag •  •  •  •  •  • 

STORE DIMENSION
Store Surrogate Key Store Name Store Number Store Postcode Store Town Store Region

SALES FACTS
•  •  •  •  •  •  •  Date Surrogate Key Store Surrogate Key Customer Surrogate Key Product Surrogate Key Sale Time Sale Quantity Sale Unit Price

CUSTOMER DIMENSION
•  •  •  •  •  •  Customer Surrogate Key Customer Loyalty Number Customer Gender Customer Postcode Customer Town Customer Region •  •  •  •  •  • 

PRODUCT DIMENSION
Product Surrogate Key Product SKU Product Name Product Category Product Group Temperature Group

© 2012 Data Management & Warehousing

7

SCHEMAS: THE ALTERNATIVES
RELATIONAL SNOWFLAKE STAR RESULT SET

Complexity Speed Space Usually used for data warehouses rather than data marts. Favoured solution on MPP technologies due to their power

Complexity Speed Space Favours saving some space in exchange for added user query complexity – usually a techie compromise

Complexity Speed Space De facto standard for data mart design based on traditional technologies. Also used as source for OLAP cubes

Complexity Speed Space Large single table with the entire result set – optimal in some circumstances

© 2012 Data Management & Warehousing

8

STAR SCHEMAS: TECHNOLOGY ASSUMPTIONS
•  There are two major and often unspoken assumptions about the technologies used to build this sort of environment: •  Firstly: The database used is a row store database and not a column store database •  Secondly: That users will be running reporting tools and OLAP cubes to access the data •  Neither of these assumptions is necessarily true – the last 10 years have seen massive innovation in Business Intelligence technologies that will have an impact on the chosen architectural solution – using alternate technologies means that you should challenge existing designs and embrace appropriate new designs in order to exploit the technology
March 2012 © 2012 Data Management & Warehousing 9

UNDERSTAND THE DESIGN IMPACT OF ALTERNATE TECHNOLOGIES
•  Column Store Databases:
•  What is a column store database? •  Why are column store databases efficient? •  How does this affect data mart design?

•  The use of alternate reporting mechanisms:
•  The user requirement gap •  How users have filled the gap

March 2012

© 2012 Data Management & Warehousing

Slide 10

WHAT IS A COLUMN STORE DATABASE?
•  Traditionally databases are ‘row-based’ i.e. each field of data in a record is stored next to each other:
Forename David Helen Sheila Surname Walker Walker Jones Gender Male Female Female

•  Column store databases store the values in columns and then hold a mapping to form the record •  This is transparent to the user, who queries a table with SQL in exactly the same way as they would a row-based database
Jan 2012 © 2012 Data Management & Warehousing 11

COLUMN STORAGE EXAMPLE
First Name Value David Helen Sheila Jones Walker Gender Value Female Male
Jan 2012

F Token PPP QQQ RRR XXX YYY G Token AAA BBB

Note: To the user this appears as a conventional row-based table that can be queried by standard SQL, it is only the underlying storage that is different

F Token S Token PPP QQQ RRR YYY YYY XXX

G Token BBB AAA AAA

Surname Value S Token

© 2012 Data Management & Warehousing

12

EFFICIENCIES OF COLUMN STORE DATABASES
•  Column store databases offer significant storage optimisation opportunities because long strings are not repeatedly stored •  In addition it is possible to compress the data column stores very efficiently •  It is possible, in some column store implementations, that the column storage holds additional metadata that can be used to speed up specific queries (e.g. the number of records associated with each value in a column) •  Reduced the data volume stored means reduced I/O when querying the database, this therefore also gives query performance improvements
Jan 2012 © 2012 Data Management & Warehousing 13

COLUMN STORE DATABASES AND DATA MART SCHEMAS
•  A column store database effectively internally creates a star schema of every field in a result set table. •  This minimises the storage and maximises the query speed in this type of database •  Creating a star schema at the table level effectively duplicates (in a less efficient manner) the underlying structure that is automatically created by the database engine •  Consequently a single table result set is more efficient in a column store database than a star schema
March 2012 © 2012 Data Management & Warehousing Slide 14

SCHEMAS: THE ALTERNATIVES
ROW DB COLUMN DB ROW DB COLUMN DB

Complexity Speed Space

Complexity Speed Space Column Store Database improve space usage and increase speed compared to Row Based Databases

Complexity Speed Space

Complexity Speed Space Column Store Databases will significantly improve space usage and s p e e d w h e n compared to Row Based Databases

STAR SCHEMA

RESULT SET SCHEMA
15

© 2012 Data Management & Warehousing

WHO ARE THE COLUMN STORE VENDORS
•  Many of the major database vendors have bought into this concept, mostly by acquisition
Vendor Actian EMC HP InfoBright ParAccel SAP SAP Teradata Database Vectorwise Greenplum Vertica InfoBright ParAccel HANA (In Memory) Sybase IQ AsterData Sybase/TSQL Postgres SQL Dialect Ingres Postgres Postgres MySQL Postgres

•  There are multiple other players
•  For more information: Wikipedia & DBMS2
March 2012 © 2012 Data Management & Warehousing 16

REPORTING TECHNOLOGIES
•  Historically:
•  Reporting tools were initially designed to provide a ‘simplified’ user interface for reporting against relational schemas rather than writing SQL •  Schemas were simplified into star schemas and specialist tools evolved to query both star schemas and OLAP cubes built on top of the star schemas •  The focus of the tools was on the ability to report what had happened from the data

March 2012

© 2012 Data Management & Warehousing

17

THE USER REQUIREMENT GAP
What users had: Historical Reporting Insight into what has happened
March 2012

What users want: Predictive Analytics Understanding what is likely to happen

© 2012 Data Management & Warehousing

Slide 18

HOW USERS HAVE FILLED THE GAP
•  Spreadsheets
•  Users love them even if IT hate the associated data integrity issues •  Users have adopted the idea of manipulating a worksheet of data equivalent to a result set table. •  Spreadsheets can connect to database sources to get data often using a ‘join all’ view over a star schema to access data •  Desktop based spreadsheets now support large data sets (e.g. Excel supports 1M rows, 16K columns) •  Emergence or equivalent web based technologies (e.g. Google Docs) •  Emergence of low cost, open source equivalents •  In-built graphing and charting capabilities
March 2012 © 2012 Data Management & Warehousing 19

HOW USERS HAVE FILLED THE GAP
•  Statistical Analysis Tools
Statistical analysis of data to identify future trends Extracting large result sets to the tools for analysis Connecting to result sets in the database for direct access Emergence of low cost, open source equivalents (R) Emergence or equivalent web based technologies (e.g. Google Prediction, R Studio) •  Predictive Model Standards (PMML) •  In-built graphing and charting capabilities •  •  •  •  • 

March 2012

© 2012 Data Management & Warehousing

Slide 20

HOW USERS HAVE FILLED THE GAP
•  Data Visualisation/Dashboarding Tools
•  Multiple maps, charts, graphs, gauges, sparklines, heat maps and traffic lights displaying process critical information •  Often sourced from a result set table which is being drip fed the latest data by being automatically generated by devices (machine generated data) •  Emergence of agile/rapid development style tools •  Tools depend on it being easy to load/update the data to give near realtime information

March 2012

© 2012 Data Management & Warehousing

Slide 21

SCHEMA TYPE SELECTION BASED ON IMPLEMENTATION TECHNOLOGY
SPREADSHEETS STATISTICAL TOOLS DASHBOARDS

Physical Star Schema with Single Table View

Physical Single Table

TRADITIONAL REPORTING AND CUBING TOOLS

Physical Star Schema

Physical Single Table with Star Schema Views

ROW STORE DATABASE
March 2012

COLUMN STORE DATABASE
Slide 22

© 2012 Data Management & Warehousing

IN CONCLUSION …
•  When designing your solution architecture it is important that you choose The Equivalent Alternate Design best suited to the technology you are deploying •  Star Schemas are still the best design pattern to use when you are using row based databases •  Result Set Single Tables are more efficient when using column store databases •  Consider the users and the tools that they will use when choosing the schema design type
March 2012 © 2012 Data Management & Warehousing 23

CONTACT US
•  Data Management & Warehousing
•  Website: http://www.datamgmt.com •  Telephone: +44 (0) 118 321 5930

•  David Walker
•  •  •  •  E-Mail: davidw@datamgmt.com Telephone: +44 (0) 7990 594 372 Skype: datamgmt White Papers: http://scribd.com/davidmwalker

March 2012

© 2012 Data Management & Warehousing

24

ABOUT US
Data Management & Warehousing is a UK based consultancy that has been delivering successful business intelligence and data warehousing solutions since 1995. Our consultants have worked with major corporations around the world including the US, Europe, Africa and the Middle East. We have worked in many industry sectors such as telcos, manufacturing, retail, financial and transport. We provide governance and project management as well as expertise in the leading technologies.

March 2012

© 2012 Data Management & Warehousing

25

THANK YOU
© 2 0 1 2 - D ATA M A N A G E M E N T & WA R E H O U S I N G H T T P : / / W W W. D ATA M G M T. C O M