Professional Documents
Culture Documents
A DW is designed to be user-friendly,
utilizing business terminology.
Scope: Scope:
One database system Integrate data from multiple systems
Master
Data
Modernizing an Existing DW Loan Repayment
Alerts
Scenario:
Streaming Data
Near-Real-Time Monitoring
Predictive Analytics:
Devices & Advanced • Model to predict
Sensors Data Lake Analytics repayment ability
Social Media
Self-Service Phone Records:
Operational Reports & • Sentiment analysis
Data Store Hadoop Machine Models
Demographics
Learning
Data E-mail Records:
Batch ETL • Text analytics
Federated Queries
Data OLAP
Social Media:
Organizational Operational
Data Warehouse Semantic Reporting
Layer • Personal
Data Historical comments
Third Party Master Marts In-Memory Analytics
Data Data Model
What Makes a Data Warehouse “Modern”
Variety of data
Coexists with Coexists with Larger data Multi-platform
sources;
Data lake Hadoop volumes; MPP architecture
multistructured
Analytics
Automation & Data catalog; Scalable Bimodal
sandbox w/
APIs search ability architecture environment
promotability
Ways to Approach Data Warehousing
SQL-on-Hadoop
Batch Processing &
The DW *is* Hadoop: Interactive Queries
✓ Many open source projects Hive
✓ Can utilize distributions
(Hortonworks, Cloudera, MapR) Impala
✓ Challenging
Spark
implementation
Distributed Presto Analytics &
Storage etc… Reporting Tools
OR
Data
Warehouse
Growing a DW:
✓ Data modeling strategies
✓ Partitioning
✓ Clustered columnstore index
✓ In-memory structures
✓ MPP (massively parallel processing)
Larger Scale Data Warehouse: MPP
Massively Parallel Processing (MPP) operates on
high volumes of data across distributed nodes Control Node
Data
Warehouse Data Lake Hadoop In-Memory NoSQL
Model
3 Federated query
Devices,
Sensors access (data
NoSQL
Spatial, virtualization)
GPS
1
Lambda Architecture
Speed Layer:
Low latency data
Batch Layer:
Streaming
Speed Dashboard
Data Data Layer Data processing to
Devices & Ingestion Processing support complex
Sensors
analysis
Data Lake
Batch Advanced Serving Layer:
Responds to queries
Analytics
Raw Data
Layer Curated Data
Machine
Learning Analytics &
reporting
Objectives:
Master
Data tools • Support large
Data
Warehouse Reports,
volume of high-
Organizational
Batch ETL
Dashboards velocity data
Data Serving Layer • Near real-time
analysis +
persisted history
Hybrid Architecture
Objectives:
1 Scale up MPP
3 Machine compute nodes
Social Media
Data
Hadoop Learning during:
o Peak ETL data
MPP Data
loads, or
Warehouse o High query
CRM 1
volumes
Cloud Analytics &
reporting 2 Utilize existing
On-Premises tools on-premises data
2 structures
2 Data Mart
Sales System 3 Take advantage of
cloud services for
advanced
Inventory analytics
System
Data Lake
Objectives, Challenges &
Implementation Options
Data Lake
Data Lake A repository for analyzing large
Repository quantities of disparate sources of data in
its native format
Web
One architectural platform to house all
Sensor
types of data:
Log o Machine-generated data (ex: IoT, logs)
o Human-generated data (ex: tweets, e-mail)
Social
o Traditional operational data (ex: sales, inventory)
Images
Objectives of a Data Lake
✓ Reduce up-front effort by ingesting Data Lake
data in any format without requiring a Repository
schema initially Web
Social
✓ Store large volume of multi-structured
data in its native format Images
Objectives of a Data Lake
✓ Defer work to ‘schematize’ after value & Data Lake
requirements are known Repository
Web
✓ Achieve agility faster than a traditional
data warehouse can Sensor
Log
✓ Speed up decision-making ability
Social
✓ Storage for additional types of data Images
which were historically difficult to obtain
Data Lake as a Staging Area for DW Strategy:
• Reduce storage
needs in data
warehouse
• Practical use for
Data Lake Store data stored in the
CRM
data lake
Raw Data: Data
Corporate Staging Area Warehouse Analytics & 1 Utilize the data
reporting
Data 1 tools lake as a landing
area for DW
Social Media staging area,
Data instead of the
relational
database
Devices &
Sensors
Data Lake for Active Archiving Strategy:
Data archival, with
query ability
available when
needed
CRM Data Lake Store
1 Archival process
Raw Data: based on data
Corporate Staging Area Data Analytics & retention policy
Warehouse reporting
Data
tools
2 Federated query
Social Media to access current
Data 1 & historical data
Active Archive
2
Devices &
Sensors
Iterative Data Lake Pattern
1 2 3
For data of value:
Ingest and Analyze in place to
integrate with the
store data determine value
data warehouse
indefinitely in of the data
(“schema on write”),
is native (“schema on read”)
or use data
format
virtualization
Acquire data Analyze data Deliver data
with cost- on a case-by-case basis once requirements
effective storage with scalable parallel are fully known
processing ability
Data Lake Implementation Data Lake
Transient/ Analytics
Temp Zone Sandbox
Corporate
2 Execution of R
Data scripts from local
1 workstation for
Social Media Data Lake
exploratory data
Data Raw Curated science &
Data Data advanced analytics
scenarios
Devices & 1 Sandbox 2
Sensors
Data Scientist / Analyst
1
Flat Files
Objective:
Sandbox Solutions: Operationalize 1 Trained model is
promoted to run in
production server
CRM Data environment
Warehouse
1 2 Sandbox use is
Corporate
R Server 3 discontinued
Data
once solution is
promoted
Data Lake Data Scientist / Analyst
Social Media
Data Raw
Data
Curated
Data
3 Execution of R
scripts from server
Devices & Sandbox 2 for operationalized
Sensors data science &
2
advanced analytics
scenarios
Flat Files
Organizing the Data Lake
Plan the structure based on optimal data retrieval.
The organization pattern should be self-documenting.
Confidential Classification
Business Impact / Criticality Public information
High (HBI) Internal use only
Medium (MBI) Supplier/partner confidential
Low (LBI) Personally identifiable information (PII)
etc… Sensitive – financial
Sensitive – intellectual property
etc…
Owner / Steward / SME
Challenges of a Data Lake
Technology Process People
✓ Complex, ✓ Right balance of ✓ Expectations
multi-layered deferred work vs. ✓ Data stewardship
architecture up-front work ✓ Redundant effort
✓ Unknown storage ✓ Ignoring established ✓ Skills required to
& scalability best practices for make analytical
✓ Data retrieval data management use of the data
✓ Working with ✓ Data quality
un-curated data ✓ Governance
✓ Performance ✓ Security
✓ Change
management
Ways to Get Started with a Data Lake
1. Data lake as staging area for DW
2. Offload archived data from DW back to data lake
3. Ingest a new type of data to allow time for longer-term planning
Getting Real Value From the Data Lake
1. Selective integration with the data warehouse – physical or virtual
2. Data science experimentation with APIs
3. Analytical toolsets on top of the data lake
o Hive o Impala o Solr
o Pig o Presto o Kafka
o Spark o Drill o etc…
4. Query interfaces on top of the data lake using familiar technologies
o SQL-on-Hadoop o Metadata
o OLAP-on-Hadoop o Data cataloging
o Data virtualization
The Logical
Data Warehouse &
Data Virtualization
Conceptual Data Warehouse Layers
User Access
Analytics & reporting tools
Metadata
Views into the data storage
Storage
Data persistence
Software
Database Management System (DBMS)
Foundational Systems
Servers, network Source: Philip Russom, TDWI
Logical Data Warehouse
Data User Access
virtualization Analytics & reporting tools
abstraction layer Metadata
Views into the data storage
Storage Distributed
Data persistence
storage &
Software processing from
Database Management System (DBMS) databases,
Foundational Systems data lake,
Servers, network Hadoop,
NoSQL, etc.
Logical Data Warehouse
An LDW is a data warehouse which uses “repositories, virtualization,
and distributed processes in combination.”
7 major components of an LDW:
Data Repository Auditing &
Virtualization Management Performance
Services
Distributed Metadata
Processing Management
Service Level
Taxonomy/Ontology Agreement
We will focus on Resolution Management
these two aspects Source: Gartner
Objective:
Data Virtualization Ability to access
various data
platforms without
doing full data
integration
1
Data Data
Warehouse Lake 2 Data returned
from this
federated query
across > 1 data
source
Data Virtualization
User
Queries:
Data
Warehouse Data Lake Hadoop In-Memory NoSQL
Model
Recommended Whitepaper:
How to Build An Enterprise Data Lake:
Important Considerations Before Jumping
In by Mark Madsen, Third Nature Inc.
Thank You!
To download a copy of this presentation:
SQLChick.com “Presentations & Downloads” page
Evolving &
Reducing time to
Minimizing chaos maturing
value
technology
Agility
Challenges with Modern Data Warehousing
Complexity
Challenges with Modern Data Warehousing
Self-service solutions Managing ‘production’
which challenge delivery from IT and
centralized DW user-created solutions
Handling ownership
changes (promotion) of
valuable solutions
Security Governance
People Process
Technology
Challenges of a Data Lake: Technology
✓ Polyglot persistence strategy
Complex, multi-
layered architecture ✓ Architectures & toolsets are emerging and maturing