You are on page 1of 53

Data Warehouse Data Warehouse

Architecture Best Practices Architecture Best Practices


Presenter: Bhimappa Desai Presenter: Bhimappa Desai
Agenda
Introductions
Business Intelligence Background
Architecture Best Practices
Questions & Answers
Business Intelligence Background
Data Warehouse Data Warehouse
Architecture Best Practices Architecture Best Practices
BI- Nutshell
What is Business Intelligence?
A Data Warehouse is usually one component of an overall
business intelligence solution
IT people may be tempted to think in terms of products and
technologies
BUT...
BI Goal
The overarching goal of business
intelligence is to provide the information
necessary to MANAGE a business
This means providing information in support
of management decision making, which is
why BI is also called Decision Support
BI is about Data Abstraction
wisdom
knowledge
information
data
audience for a data warehouse typically considers higher
slices of data abstraction pyramid
lowest level of pyramid is too detailed & unwieldy
Stages Stages
(4) (4)
Its Not Technology
Business Intelligence is about delivering business value
provide tangible benefit by answering important
questions that can help the business to achieve its
strategic focus
Improving profitability
Who are our five most profitable clients?
What are our least profitable products?
Reducing cost
Who are our lowest cost suppliers?
Which materials incur highest spoilage costs?
Improving customer satisfaction
What factors may lead to lost customers?
Business of BI
Many leading companies use BI to achieve
competitive advantage
E.g. Walmart, Dell, Amazon.com, Kraft,
American Express, etc
Data Warehouse Architecture
Architecture is about delivering an
elegant solution that meets the
solution requirements
This means really understanding
the problem
DW architecture is part art, part
science
Good Architecture
Its not easy to describe a good design, but we will
know it when we see it
BI Architecture Requirements
Must recognize change as a constant
Take incremental development approach
Existing applications must continue to
work
Need to allow more data and new types of
data to be added
End User Acceptance
Understandability
understandability is in the eyes of the
beholder
want to hide the complexity
try to make it:
intuitive, obvious
visible, memorable
End User Acceptance
Performance
dont want to interrupt the
thinking process
provide one click, instantaneous
access
warehouse must be available,
production system
Architecture Best Practices Architecture Best Practices
High Level Architecture
remember the different worlds
on-line transaction processing (OLTP)
business intelligence systems (BIS)
users are different
data content is different
data structures are different
architecture & methodology must be different
Two Different Worlds
On-Line Transaction Processing
Entity Relational Data Model
created in 1960s to address
performance issues with relational database
implementations
normalized to most efficiently get data in
divides the data into many discrete entities
many relationships between these entities
this approach was documented by C.J. Date in
An Introduction to Database Systems
Two Different Worlds
Business Intelligence Systems
Dimensional Data Model
also called star schema
designed to easily get information out
fewer relationships than ERD, the only table
with multiple joins connecting to other
tables is the central table
developed in 1960s by data service
providers, formalized by Ralph Kimball in
The Data Warehouse Toolkit
Entity Relation Disadvantages
all tables look the same
people cant visualize/remember diagrams
software cant navigate as schema becomes
too complex
business processes mixed together
many artificial keys created
Dimensional Model Advantages
Simplicity
humans can navigate and remember
software can navigate deterministically
business process explicitly separated (Data Mart)
not so many keys (keys = # of attendant tables)
Best Practice #1
Use a data model that is optimized for information retrieval
Dimensional model
Denormalized
Hybrid approach
Data Acquisition Processes
Extract Transform Load (ETL)
the process of unloading or copying data from the source
systems, transforming it into the format and data model
required in the BI environment, and loading it to the DW
also, a software development tool for building ETL processes
(an ETL tool)
many production DWs use COBOL or other general-purpose
programming languages to implement ETL
Data Quality Assurance
data cleansing
the process of validating and enriching the data as it is
published to the DW
also, a software development tool for building data
cleansing processes (a data cleansing tool)
many production DWs have only very rudimentary data
quality assurance processes
Data Acquisition & Cleansing
getting data loaded efficiently and correctly is critical
to the success of DW
implementation of data acquisition & cleansing processes
represents from 50 to 80% of effort on typical DW projects
inaccurate data content can create the problem for user
acceptance
Best Practice #2
Carefully design the data acquisition and cleansing
processes for your DW
Ensure the data is processed efficiently and
accurately
Consider acquiring ETL and Data Cleansing tools
Use them well!
Data Model
Already discussed the benefits of a dimensional
model
No matter whether dimensional modeling or any
other design approach is used, the data model
must be documented
Documenting the Data Model
The best practice is to use some kind of data modeling tool
CA ERwin
Sybase PowerDesigner
Oracle Designer
IBM Rational Rose
Etc.
Different tools support different modeling notations, but
they are more or less equivalent anyway
Most tools allow sharing of their metadata with an ETL
tool
Data Model Standards
data model standards appropriate for the
environment and tools chosen in your data
warehouse should be adopted
considerations should be given to data access tool(s)
and integration with overall enterprise standards
standards must be documented and enforced within
the DW team
someone must own the data model
to ensure a quality data model, all changes should be
reviewed through some formal process
Data Model Metadata
Business definitions should be recorded for every field
(unless they are technical fields only)
Domain of data should be recorded
Sample values should be included
As more metadata is populated into the modeling tool it
becomes increasingly important to be able to share this data
across ETL and Data Access tools
Metadata Architecture
The strategy for sharing data model and other metadata
should be formalized and documented
Metadata management tools should be considered & the
overall metadata architecture should be carefully planned
Best Practice #3
Design a metadata architecture that allows sharing of
metadata between components of your DW
consider metadata standards such as OMGs Common
Warehouse Metamodel (CWM)
Source : www.service-architecture.com/web-
services/articles/common_warehouse_meta-
model_cwm.html
Alternative Architecture Approaches
Bill Inmon: Corporate Information Factory
Hub and Spoke philosophy
JBOC just a bunch of cubes
Let it consists of naturally
What We Want (Architectural Principal)
In most cases, business and IT agree that the data warehouse
should provide a single version of the truth
Any approach that can result in disparate data marts or cubes
is undesireable
This is known as data silos or
Enterprise DW Architecture
how to design an enterprise data warehouse
and ensure a single version of the truth?
according to Kimball:
start with an overall data architecture phase
use Data Warehouse Bus design to integrate
multiple data marts
use incremental approach by building one data mart
at a time
Data Warehouse Bus Architecture
named for the bus in a computer
standard interface that allows you to plug in cdrom,
disk drive, etc.
these peripherals work together smoothly
provides framework for data marts to fit together
allows separate data marts to be implemented by
different groups, even at different times
Data Mart Definition
data mart is a complete subset of the overall data
warehouse
a single business process OR
a group of related business processes
think of a data mart as a collection of related fact tables
sharing conformed dimensions, aka a fact constellation
Designing The DW Bus
determine which dimensions will be shared across multiple
data marts
conform the shared dimensions
produce a master suite of shared dimensions
determine which facts will be shared across data marts
conform the facts
standardize the definitions of facts
Dimension Granularity
conformed dimensions will usually be granular
makes it easy to integrate with various base level fact
tables
easy to extend fact table by adding new facts
no need to drop or reload fact tables, and no keys have to
be changed
Conforming Dimensions
by adhering to standards, the separate data marts can be
plugged together
e.g. customer, product, time
they can even share data usefully, for example in a drill
across report
ensures reports or queries from different data marts share
the same context
Conforming Dimensions (contd)
accomplish this by adding any dimension attribute(s) needed
in any data mart(s) to the standard dimension definition
attributes not needed everywhere can always be ignored
typically harder to determine how to load conformed
dimensions than to design them initially
need a single integrated ETL process
what is the Source Data System for Record (SOR) for each
attribute?
how do we deal with attributes for which there is more
than one possible SOR?
Data Consolidation
a current trend in BI/DW is data consolidation
from a software vendor perspective, it is tempting to simplify
this:
we can keep all the tables for all your disparate
applications in one physical database
Data Integration
To truly achieve a single version of the truth, must do
more than simply consolidating application databases
Must integrate data models and establish common terms of
reference
Best Practice #4
Take an approach that consolidates data into a single version
of the truth
Data Warehouse Bus
conformed dimensions
OR?
Operational Data Store (ODS)
a single point of integration for disparate operational
systems
contains integrated data at the most detailed level
(transactional)
may be loaded in near real time or periodically
can be used for centralized operational reporting
Role of an ODS in DW Architecture
In the case where an ODS is a necessary component of the
overall DW, it should be carefully integrated into the overall
architecture
Can also be used for:
Staging area
Master/reference data management
Etc
ODS Data Model
Not clear if any design approach for an ODS data model has
emerged as a best practice
normalized
dimensional
denormalized/hybrid
any suggestions?
Best Practice #5
Consider implementing an ODS only when information
retrieval requirements are near the bottom of the data
abstraction pyramid and/or when there are multiple
operational sources that need to be accessed
Must ensure that the data model is integrated, not just
consolidated
May consider 3NF data model
Capacity Planning
DW workloads are typically very demanding, especially for
I/O capacity
Successful implementations tend to grow very quickly, both
in number of users and data volume
Rules of thumb do exist for sizing the hardware platform to
provide adequate initial performance
typically based on estimated raw data size of proposed
database e.g. 100-150 Gb per modern CPU
Best Practice #6
Create a capacity plan for your BI application & monitor it
carefully
Consider future additional performance demands
Establish standard performance benchmark queries and
regularly run them
Implement capacity monitoring tools
Build scalability into your architecture
Open Source Affordability
Another emerging trend in IT generally is to utilize Open
Source software running on commodity hardware
this is expected to offer lower total cost of ownership
certainly, GNU/Linux and other Open Source initiatives
do provide very good functionality and quality for
minimal cost
This trend also applies to BI & DW:
most traditional RDBMSs are now supported on Linux
however, open source RDBMSs lag behind on providing
good performance for DW queries
DW Appliances
DW appliances, consisting of packaged solutions providing
all required software and hardware, are beginning to offer
very promising price/performance
production experience is limited so far, so this is not yet a
best practice
Q & A
Thank You
For suggestion and queries email to
bhimappa_Desai@satyam.com

You might also like