You are on page 1of 29

Data Vault POC Report

Data Strategy

Jens Schultheiss July 2023


DRAFT VERSION
HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT. August 7, 2023
ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Agenda

• Classic Data Warehouse vs Modern Data Analytics Platform


• Data Vault 2.0 Reference Architecture
• Data Vault 2.0 Solution Architecture
• Data Warehouse vs Lakehouse
• Data Vault PoC:
• Data Models and DV Implementation (Model Driven vs Meta Data Driven)
• Data Vault Methodology (Framework of Pattern-Language , Templates, Roles,
SDLC, principles, best practices)
• Conclusion and suggestions (in terms of MDAP requirements, costing, skills,
effort)
• Next steps

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Classic EDW Reference Architecture

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Today’s EDW - Pain Points

• Time intensive and very costly development lifecycle


• Object dependencies when loading the EDW (low parallelism)
• Scalability (SSIS is based on DCOM, a component technology from the early 90s,
on prem SQL Server is not really a distributed computing system, big data)
• Maintainability (Deployment) – doesn’t support agile CI/CD approaches incl. TDD
• No Automation (No/Low Code) – fully customized
• Platform dependent

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Modern Data Analytics Platform - Requirements

• Support for different types of data


• New use cases: streaming, real-time, data science, operational reporting, self-service
• Agility, CI / CD, Sprint oriented Dev, Incremental DWH Development, Support for TDD
• Resilient to changing source data
• Scalability (Data Volume, concurrency, distributed)
• Support for Automation (Prerequ.: Methodology is Pattern Based, 100% Automation )
• Fully auditable
• Security and privacy requirements
• Different business perspectives
• Cloud and platform agnostic

=> Data Vault supports MDAP. Its not just a modelling approach but rather a methodology
HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.
ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Data Vault 2.0 Reference Architecture

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Data Vault 2.0 Solution Architectures

Datawarehouse
• Pure Relational – Synapse SQL Dedicated
• Hybrid – Landing/Staging Data Lake , EDW relational

Lakehouse
• Landing Staging – Data Lake, EDW Apache Spark
• Databricks

Hybrid/Integrated
Azure Fabric

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Lakehouse

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Lakehouse

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Lakehouse – Apache Spark

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Lakehouse – Databricks

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Lakehouse – Databricks

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Hybrid/Integrated – Azure Fabric

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Proof of Concept
Preparation
• Pick simple use case based on non-complex data domain (HR)
• Pick platform and setup
• Data Modeling
Desired Outcome
• Platform Comparison (i.e. Costing aspect)
• Design Patterns, Pattern Language
• Validate MDAP requirements
• Design a Framework and Methodology
• Tools (Automation)
• Cost, Effort, Skill Set, Roles and Responsibilities
• Road Map
HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.
ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Use Case (HR Data Domain)

Data in scope
1) Master data entity – Employee (Source 1 = PeopleSoft)
2) Reference Data entities – Company (Company will have many sources for complete set
of attributes)
3) Transactional Data Entities – Sales Orders (Source 2 AX) & Employee Hours (Source 3
– Timecard)
4) Employee role-playing – Sales Order to Employee (Sales Person) link {there will be
many other roles for employee – Sales Order Credit Approver, creator, modifier etc.}

Consumption Basis
1) Employee List by Company
2) Total Hours by Employee and by Company
3) Total Sales Orders (counts/amounts) by Sales-Person

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Object Model – Sales Order

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Data Vault Model – Sales Order (Conceptual)

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Data Vault Model – Detailed (Version 1)
EmployeeData EmployeeCom pany

T
Employee TEmployeeHours Company CustomerData

Customer
SalesOrderSalesPer T
TTimeCardData
son

T T TSalesOrder LinesD
SalesOrder TSalesOrder Lines
ata

Product
SalesOrderData

FSalesOrder Data Business Vault

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Data Vault Model – Detailed (Version 2)
EmployeeData EmployeeCom pany

T
Employee TEmployeeHours Company

T
TTimeCardData CustomerData

Customer

T
TSalesOrder Li nes

Product Sales Order Bridge

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Data Vault Model – Physical Implementation

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
DV Benefits

• Support for Agile working, CI/CD, Incremental Product Development, TDD

• Parallel Loads (No object dependencies => Scalability)

• Pattern Language

• Automation and Code generation (support for different platforms)


- Model Driven
- Meta-Data Driven

• Decouples Implementation from Development Team

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
DV Challenges

• 5-10x amount of tables compared to classic EDW approach (e.g. Documentation)

• Hardware demanding (compute & storage) => Cloud only

• Requires rigid development discipline

• Steep learning curve (esp. Loading Info Mart is very critical)

• Impact on SDLC

• Skill sets

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
DV Roles / Teams / Skill Sets
Scrum Team Roles
• Business Analyst
o Needs to know the source data, business requirements

• DV Modeler / Architect
o Creates break down into DV artefacts, e.g., when to apply SAT split up, Soft-Rules, PITs, Bridges
o Needs to know the data

• DV Engineer (DV Producer)


o Creates DV-Loader for Stage / Raw / Business Vault

• Information Mart Engineer (DV Consumer)


o Creates Information Marts and Loader, consumes DV

• Tester QA
o Creates Unit Tests (DevOps TDD Integration, Build, Deployment Pipelines)

Other Roles: Product Owner, Scrum Master, Release/QA Manager


HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.
ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Best Practices
• Stick to design principles (use of standards, data loads, refactoring, managing data at
scale, automation)
• Use Views to create versioned data interface contracts to decouple vault layers
• SAT schema is a CREATE only schema - ALTER is not required
• DV is insert only - UPDATE/DELETE is never used
• Don’t use left outer join but inner joins only (Zero Key & Ghost record pattern)
• Treat everything as Type 2 dimension (Satellites and Info Mart)
• Use virtual LoadEnd-Date (lead function)
• Use temporal surrogate sequence keys in PITs (Sats) and Bridges (Links) rather than
Hash & LoadDate to load dimensions and facts
• The raw vault and business vault entities are not, or at least should not, be directly
accessed by data consumers (Access through Info-Marts for business users, Source-
Marts/flat wide tables for Data Scientists).
• Landing: Keep history for couple of days for redo scenarios (like in ADX)
HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.
ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Best Practices - SDLC & Sprints

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Best Practices - SDLC & Sprints (2)

1. Developers work in a development environment isolated from production.


2. When they complete code and local tests, they check code into Git.
3. A continuous integration service detects a change in code and launches a regression test
– running a full set of tests against the new version of the code base. Failures are
identified quickly for a fix from the developer.
4. At the end of each sprint a deployment engineer takes the code base from Git into a
staging environment. The engineer conducts final tests of the deployment scripts and,
assuming all is well, launches production deployment.
5. The production environment downloads deployment scripts and executes the update to
bring new code and data structures online. Smoke tests are executed to confirm the
upgrade.

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Best Practices - Automation
• Automation (same principles apply to Hubs, Links & Satellites)

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Next Steps - Tool comparison / selection
• BimlFlex
• Supported Target Platforms: Snowflake, SQL, Synapse (via SSIS ad ADF)
• 800$ per month per license
• WhereScape
• various products on Datawarehouse automation
• 1600$ per month per license
• VaultSpeed
• Model Driven
• Integrates with Databricks, Synapse, Snowflake, ADF, dbt, Apache Spark, …
• 1600$ per month per license
• AutomateDV/dbt
• Meta Data Driven, can be complemented by HOM Landing FW
• Supported Target Platforms: Databricks, SQL, Snowflake
• 100$ per month per license
HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.
ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.
Next Steps
• Trainings
o Scalefree Data Vault Boot Camp
▪ CDVP2 Certification (Data Vault Alliance)
▪ https://www.scalefree.com/boot-camp-class/
▪ 3000$ per ticket

HUSKY CONTROLLED. HUSKY INFORMATION IS ONLY INTENDED FOR THE RECIPIENT.


ANY DISSEMINATION TO ANYONE OTHER THAN THE RECIPIENT IS UNAUTHORIZED AND STRICTLY PROHIBITED.

You might also like