You are on page 1of 54

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus
Dr. Yashvardhan Sharma
CSIS Dept., BITS-Pilani
SS ZG515 - Data
Warehousing
BITS Pilani, Pilani Campus
Data Warehousing
Architecture
Monitoring & Administration
Metadata
Repository
Extract
Transform
Load
Refresh
Data Marts
External
Sources
Operational
dbs
Serve
OLAP servers
Analysis
Query/
Reporting
Data Mining
2
BITS Pilani, Pilani Campus
Data Warehousing Architecture
3
BITS Pilani, Pilani Campus
Data Warehouse Architecture
BITS Pilani, Pilani Campus
Data Warehouse COMPONENTS
5
BITS Pilani, Pilani Campus
Data Warehouse COMPONENTS
6
Source Data Component
Production Data.
Internal Data.
Archived Data.
External Data.
Data Staging Component
Data Extraction
Data Transformation.
Data Loading.
BITS Pilani, Pilani Campus
Data Loading
7
BITS Pilani, Pilani Campus
Data Storage Component
8
Many of the data warehouses also employ
multidimensional database management
systems. Data extracted from the data
warehouse storage is aggregated in many
ways and the summary data is kept in the
multidimensional databases (MDDBs). Such
multidimensional database systems are
usually proprietary products.
BITS Pilani, Pilani Campus
Information Delivery Component
9
BITS Pilani, Pilani Campus
Metadata Component
10
Metadata in a data warehouse is similar to a
data dictionary, but much more than a data
dictionary.
Types of Metadata
Operational Metadata
Extraction and Transformation Metadata
End-User Metadata
More Details in Chapter 9.
BITS Pilani, Pilani Campus
Why Meta Data: Special Significance
11
First, it acts as the glue that connects all parts
of the data warehouse.
Next, it provides information about the
contents and structures to the developers.
Finally, it opens the door to the end-users and
makes the contents recognizable in their own
terms.
BITS Pilani, Pilani Campus
Operational
data source1
The architecture


Query Manage












Warehouse Manager




DBMS

Operational
data source 2
Meta-data
High
summarized data
Detailed data
Lightly
summarized
data
Operational
data store (ods)
Operational
data source n
Archive/backup
data
Load Manager

Data mining
OLAP(online analytical
processing) tools
Reporting, query,
application development,
and EIS(executive information
system) tools
End-user
access tools
Typical architecture of a data warehouse
Operational data store (ODS)
BITS Pilani, Pilani Campus
The main components
Operational data sourcesfor the DW is supplied from
mainframe operational data held in first generation hierarchical and
network databases, departmental data held in proprietary file systems,
private data held on workstaions and private serves and external systems
such as the Internet, commercially available DB, or DB assoicated with and
organizations suppliers or customers
Operational datastore(ODS)is a repository of current and
integrated operational data used for analysis. It is often structured and
supplied with data in the same way as the data warehouse, but may in fact
simply act as a staging area for data to be moved into the warehouse
BITS Pilani, Pilani Campus
The main components
load manageralso called the frontend component, it performance all
the operations associated with the extraction and loading of data into the
warehouse. These operations include simple transformations of the data
to prepare the data for entry into the warehouse
warehouse managerperforms all the operations associated with the
management of the data in the warehouse. The operations performed by
this component include analysis of data to ensure consistency,
transformation and merging of source data, creation of indexes and views,
generation of denormalizations and aggregations, and archiving and
backing-up data

BITS Pilani, Pilani Campus
The main components
query manageralso called backend component, it performs all the
operations associated with the management of user queries. The
operations performed by this component include directing queries to the
appropriate tables and scheduling the execution of queries
detailed, lightly and lightly summarized data,archive/backup
data
meta-data
end-user access toolscan be categorized into five main groups: data
reporting and query tools, application development tools, executive
information system (EIS) tools, online analytical processing (OLAP) tools,
and data mining tools


BITS Pilani, Pilani Campus
Data flows
Inflow- The processes associated with the extraction, cleansing,
and loading of the data from the source systems into the data
warehouse.
upflow- The process associated with adding value to the data in
the warehouse through summarizing, packaging , packaging, and
distribution of the data
downflow- The processes associated with archiving and
backing-up of data in the warehouse
outflow- The process associated with making the data availabe
to the end-users
Meta-flow- The processes associated with the management of
the meta-data

BITS Pilani, Pilani Campus
Operational
data source1










Warehouse Manager



DBMS

Meta-data High
summarized data
Detailed data
Lightly
summarized
data
Operational
data store (ods)
Operational
data source n
Archive/backup
data
Load
Manager

Data mining tools
OLAP (online analytical
processing) tools
End-user
access tools
Information flows of a data warehouse
Reporting, query,application
development, and EIS (executive
information system) tools
Downflow
Inflow
Meta-flow
Upflow
Query Manage
Outflow
Warehouse Manager
BITS Pilani, Pilani Campus
Tools and Technologies
The critical steps in the construction of a data
warehouse:
a. Extraction
b. Cleansing
c. Transformation
after the critical steps, loading the results into target
system can be carried out either by separate
products, or by a single, categories:
code generators
database data replication tools
dynamic transformation engines




BITS Pilani, Pilani Campus



Data Cleaning
Why?
Data warehouse contains data that is analyzed for business
decisions
More data and multiple sources could mean more errors in the data
and harder to trace such errors
Results in incorrect analysis
Detecting data anomalies and rectifying them
early has huge payoffs
Long Term Solution
Change business practices and data entry tools
Repository for meta-data
BITS Pilani, Pilani Campus



Soundex Algorithms
Misspelled terms
For example NAMES
Phonetic algorithms can find similar
sounding names
Based on the six phonetic classifications
of human speech sounds
BITS Pilani, Pilani Campus
Prof. Navneet Goyal, BITS, Pilani
Data Warehouse Design
OLTP Systems are Data Capture Systems
DATA IN systems
DW are DATA OUT systems

OLTP DW
BITS Pilani, Pilani Campus
Analyzing the DATA
Active Analysis User Queries
User-guided data analysis
Show me how X varies with Y
OLAP
Automated Analysis Data Mining
Whats in there?
Set the computer FREE on your data
Supervised Learning (classification)
Unsupervised Learning (clustering)

BITS Pilani, Pilani Campus
OLAP Queries
How much of product P1 was sold in 2009
state wise?
Top 5 selling products in 2010
Total Sales in Q1 of FY 2008-09?
Color wise sales figure of cars from 2008 to
2010
Model wise sales of cars for the month of
Jan from 2006 to 2010


BITS Pilani, Pilani Campus
Data Mining Investigations
Which type of customers are more likely to spend most
with us in the coming year?
What additional products are most likely to be sold to
customers who buy sportswear?
In which area should we open a new store in the next
year?
What are the characteristics of customers most likely to
default on their loans before the year is out?


BITS Pilani, Pilani Campus
Prof. Navneet Goyal, BITS, Pilani
Continuum of Analysis

OLTP



OLAP


Data Mining

Primitive &
Canned
Analysis
Complex
Ad-hoc
Analysis
Automated
Analysis
SQL
Specialized
Algorithms
BITS Pilani, Pilani Campus
Net Resources
26
Online Resources
The Data Warehousing Institute
www.tdwi.org
Data Warehousing on www
www.datawarehousing.org
www.datawarehousing.com

Online Magazines & Periodicals
www.intelligententerprise.com
www.dmreview.com
www.cio.com
www.daniel-lemire.com/OLAP/index.html


BITS Pilani, Pilani Campus
Data Marts
What is a data mart?
Advantages and disadvantages of data
marts
Issues with the development and
management of data marts
13-Jun-14
27
BITS Pilani, Pilani Campus
Data Marts
A subset of a data warehouse that supports the requirements
of a particular department or business process
Data Mart is a subset of corporate-wide data that is of value
to a specific groups of users. Its scope is confined to specific,
selected groups, such as marketing data mart.
Characteristics include:
Does not always contain detailed data unlike data
warehouses
More easily understood and navigated
Can be dependent or independent
13-Jun-14
28
BITS Pilani, Pilani Campus
Data Marts
29
Data Mart: A scaled-down version of the data
warehouse
A data mart is a small warehouse designed for
the department level.
It is often a way to gain entry and provide an
opportunity to learn
Major problem: if they differ from department
to department, they can be difficult to
integrate enterprise-wide
BITS Pilani, Pilani Campus
Reasons for Creating Data Marts
Proof of Concept for the DW
Can be developed quickly and less resource
intensive than DW
To give users access to data they need to
analyze most often
To improve query response time due to
reduction in the volume of data to be
accessed
13-Jun-14
30
BITS Pilani, Pilani Campus
Kimball vs Inmon
Bill Inmon's paradigm: Data warehouse is one part
of the overall business intelligence system. An
enterprise has one data warehouse, and data marts
source their information from the data warehouse. In
the data warehouse, information is stored in 3rd
normal form.

Ralph Kimball's paradigm: Data warehouse is the
conglomerate of all data marts within the enterprise.
Information is always stored in the dimensional
model.
13-Jun-14
31
BITS Pilani, Pilani Campus
Kimball vs Inmon
Bill Inmon: Endorses a Top-Down design
Independent data marts cannot comprise an effective EDW.
Organizations must focus on building EDW
Ralph Kimball: Endorses a Bottom-Up design
EDW effectively grows up around many of the several
independent data marts such as for sales, inventory, or
marketing
13-Jun-14
32
BITS Pilani, Pilani Campus
Kimball vs Inmon: War of Words
"...The data warehouse is nothing more than the
union of all the data marts...,"
Ralph Kimball, December 29, 1997.

"You can catch all the minnows in the ocean and
stack them together and they still do not make
a whale,"
Bill Inmon, January 8, 1998.
13-Jun-14
33
BITS Pilani, Pilani Campus
Kimball vs. Inmon
There is no right or wrong between these two ideas,
as they represent different data warehousing
philosophies. In reality, the data warehouse in most
enterprises are closer to Ralph Kimball's idea. This is
because most data warehouses started out as a
departmental effort, and hence they originated as a
data mart. Only when more data marts are built later
do they evolve into a data warehouse.
13-Jun-14
34
BITS Pilani, Pilani Campus
Data Warehousing Process
35
Enterprise-wide warehouse, top down, the
Inmon methodology
Data mart, bottom up, the Kimball
methodology
When properly executed, both result in an
enterprise-wide data warehouse
BITS Pilani, Pilani Campus
Data warehouse versus data mart.
36
BITS Pilani, Pilani Campus
Building a Data Warehouse
37
Questions to be asked:
Top-down or bottom-up approach?
Enterprise-wide or departmental?
Which firstdata warehouse or data mart?
Build pilot or go with a full-fledged
implementation?
Dependent or independent data marts?
BITS Pilani, Pilani Campus
Top-Down Versus Bottom-Up Approach
38
BITS Pilani, Pilani Campus
Data Warehouse or Data Mart
First?
Top-Down vs. Bottom-Up Approach
Advantages of Top-Down
A truly corporate effort, an enterprise view of data
Inherently architected-not a union of disparate DMs
Single, central storage of data about the content
Central rules and control
May be developed fast using iterative approach
13-Jun-14
39
BITS Pilani, Pilani Campus
Data Warehouse or Data Mart
First?
Disadvantages of Top-Down
Takes longer to build even with iterative method
High exposure/risk to failure
Needs high level of cross functional skills
High outlay without proof of concept
Difficult to sell this approach to senior management
and sponsors

13-Jun-14
40
BITS Pilani, Pilani Campus
Data Warehouse or Data Mart
First?
Advantages of Bottom-Up Approach
Faster and easier implementation of manageable
pieces
Favorable ROI and proof of concept
Less risk of failure
Inherently incremental; can schedule important DMs
first
Allows project team to learn and grow

13-Jun-14
41
BITS Pilani, Pilani Campus
Data Warehouse or Data Mart
First?
Disadvantages of Bottom-Up Approach
Each DM has its own narrow view of data
Permeates redundant data in every DM
Difficult to integrate if the overall requirements are not
considered in the beginning
Kimballs approach is considered as a Bottom-Up
approach, but he disagrees

13-Jun-14
42
BITS Pilani, Pilani Campus
13-Jun-14 43
Dependent Data Marts
BITS Pilani, Pilani Campus
13-Jun-14 44
Independent Data Marts
BITS Pilani, Pilani Campus
The Bottom-Up Misnomer
Kimball encourages you to broaden your
perspective both vertically and horizontally
while gathering business requirements while
developing data marts

13-Jun-14
45
BITS Pilani, Pilani Campus
The Bottom-Up Misnomer
Vertical
Dont just rely on the business data analyst to
determine requirements
Inputs from senior managers about their vision,
objectives, and challenges are critical
Ignoring this vertical span might cause failure in
understanding the organizations direction and likely
future trends
13-Jun-14
46
BITS Pilani, Pilani Campus
The Bottom-Up Misnomer
Horizontal
Look horizontally across the departments before designing the
DW
Critical in establishing the enterprise view
Challenging to do if one particular department if funding the
project
Ignoring horizontal span will create isolated, department-centric
databases that are inconsistent and cant be integrated
Complete coverage in a large organization is difficult
One rep. from each dept. interacting with the core development
team can be of immense help
13-Jun-14
47
BITS Pilani, Pilani Campus
Data Warehouse or Data Mart
First?
New Practical approach by Kimball
1. Plan and define requirements at the overall corporate
level
2. Create a surrounding architecture for a complete
warehouse
3. Conform and standardize the data content
4. Implement the Data Warehouse as a series of
Supermarts, one at a time
13-Jun-14
48
BITS Pilani, Pilani Campus
A Word about SUPERMARTS
Totally monolithic approach vs. totally stovepipe approach
A step-by-step approach for building an EDW from granular data
A Supermart s a data mart that has been carefully built with a
disciplined architectural framework
A Supermart is naturally a complete subset of the DW
A Supermart is based on the most granular data that can possible
be collected and stored
Conformed dimensions and standardized fact definitions
13-Jun-14
49
BITS Pilani, Pilani Campus
A Word about SUPERMARTS
13-Jun-14
50
BITS Pilani, Pilani Campus
Pilot Projects: Risk vs. Reward
Start with a pilot implementation as the first
rollout for DW
Pilot projects have advantage of being small
and manageable
Provide organization with a proof of concept
13-Jun-14
51
BITS Pilani, Pilani Campus
Pilot Projects: Risk vs. Reward
Functional scope of a pilot project should be
determined based on:
1. The Degree of risk enterprise is willing to
take
2. The potential for leveraging the pilot project
Avoid constructing a throwaway prototype
Pilot warehouse must have actual value to the
enterprise
13-Jun-14
52
BITS Pilani, Pilani Campus
Pilot Projects: Risk vs. Reward

High Risk
Low Reward


High Risk
High reward

Low Risk
Low Reward


Low Risk
High Reward
13-Jun-14
53
RISK
REWARD
BITS Pilani, Pilani Campus
A Practical Approach
54
Most people employ a Hybrid approach with elements of Top-Down and Bottom-
Up
Again, practitioners dont always concentrate on these issues and use this
terminology, and just focus on best-practice
That would include;
Build incrementally according to a business function
Employ an enterprise perspective
Dimensionally model data
Utilise conformed dimensional models
Employ a Staging Area or Data Warehouse
Store atomic data