You are on page 1of 64

Data Warehousing: Introduction

 Problems with Data Warehousing


 Underestimation of resources for data loading
 Hidden problems with source systems
 Required data not captured
 Increased end-user demands
 Data homogenization
 High demand for resources
 Data ownership
 High maintenance
 Long-duration projects
 Complexity of integration
Data Warehousing
Architecture
Monitoring & Administration
OLAP servers

Metadata
Repository Analysis

Extract
Query/
External
Sources
Transform Reporting
Load Serve
Operational
Refresh Data
dbs Mining

Data Marts
Disadvantages of On-Demand
Approach
 Poor response time due to delay in query processing
 Slow or unavailable data sources
 Time consuming and complex filtering and integration
 Inefficient and potentially expensive for frequent queries
 Wrappers compete on resources with local applications at data
sources
 There are only few notable systems based on this approach
e.g.
 TAMBIS: Transparent Access to Multiple Bio-informatics
Information Systems
 SRS: Sequence Retrieval System
 OPM (Object Protocol Model) based multi-database tools and query
language (OPM-QL)
Two Approaches:
(2) Data Warehousing
• In advance/Eager
data integration
• Integrated data
is persistently stored
in a database – data
warehouse for direct
querying and analysis
Advantages of Data Warehousing
Approach
 High performance query processing
 Though the information returned may not be most up-to-date
 Does not interfere with local data processing at sources
 Analytical Querying/Statistical Analysis or On-Line Analytical
Processing (OLAP) at warehouse
 On-Line Transaction Processing (OLTP) at data sources
 Data Persistently Stored at Warehouse
 Data at the warehouse can be further re-structured, aggregated,
summarized and modified if necessary.
 A DW may store historical/archive data.
 Data warehousing approach has been widely used e.g.
 The Maryland ADMS Project
 Supporting Data Integration and Warehousing Using H2O
 The Stanford Data Warehousing Project
 GIMS: Genome Information Management System
 Marks & Spencer Data Warehouse
Trade-off between Query-Driven and
Data Warehousing Approaches
 Query-driven approach is still better for:
 Rapidly changing information/data sources;
 Accessing very large amounts of data from many sources;
 Clients with unpredictable and dynamic requirements
 Data Warehousing is more suitable when:
 Data sources on which a data warehouse is based are not
frequently changing;
 Data up-to-dateness is not crucially important;
 Querying and Analysis is complex;
 Data needs to be highly summarized and aggregated;
 Fast access to integrated and derived data is vital; and
 Keeping data warehouse consistent with the underlying data
sources is efficient and does not compromise on expected
performance.
Data Warehouse Architectures
An overview of the Oracle data warehousing
implementation.
Data warehouses and their architectures vary
depending upon the specifics of an organisation's
situation.
Three common architectures are:
Data Warehouse Architecture - Basic
Data Warehouse Architecture - with a Staging Area
Data Warehouse Architecture - with a Staging Area and
Data Marts
Data Warehouse COMPONENTS
Data Warehouse COMPONENTS
Source Data Component
 Production Data.
 Internal Data.
 Archived Data.
 External Data.
 Data Staging Component
 Data Extraction
 Data Transformation.
 Data Loading.
Data Loading
Data Storage Component
Many of the data warehouses also employ
multidimensional database management systems. Data
extracted from the data
warehouse storage is aggregated in many ways and the
summary data is kept in the multidimensional
databases (MDDBs). Such multidimensional database
systems are usually proprietary products.
Information Delivery Component
Metadata Component
Metadata in a data warehouse is similar to a data
dictionary, but much more than a data dictionary.
Types of Metadata
Operational Metadata
 Extraction and Transformation Metadata
 End-User Metadata
 More Details in Chapter 9.
Why Meta Data: Special Significance
 First, it acts as the glue that connects all parts of the
data warehouse.
Next, it provides information about the contents and
structures to the developers.
 Finally, it opens the door to the end-users and makes
the contents recognizable in their own terms.
Reporting,
Warehouse Manager query,application
Operational
data source1 development, and EIS
Meta-flow (executive information
system) tools
Meta-data High
summarized data
Inflow Outflow
Lightly
Load summarized
Manager data OLAP (online
Upflow Query analytical
processing) tools
Operational Manage
data source n Detailed data DBMS

Operational Warehouse Manager


data store (ods)
Data mining
tools

Downflow
End-user
Archive/backup access tools
data

Information flows of a data warehouse


 Data flows
Inflow- The processes associated with the extraction,
cleansing, and loading of the data from the source systems
into the data warehouse.
upflow- The process associated with adding value to the data
in the warehouse through summarizing, packaging ,
packaging, and distribution of the data
downflow- The processes associated with archiving and
backing-up of data in the warehouse
outflow- The process associated with making the data
availabe to the end-users
Meta-flow- The processes associated with the management
of the meta-data
 Tools and Technologies
The critical steps in the construction of a data
warehouse:
a. Extraction
b. Cleansing
c. Transformation
after the critical steps, loading the results into
target system can be carried out either by separate
products, or by a single, categories:
code generators
database data replication tools
dynamic transformation engines
Problems and Issues
 Warehouse Maintenance
 Data sources (DSs) on which a DW is based may change over time.
 Changes at DSs may require changes at a DW.
 How often to propagate changes to a DW?
 At night, weekly/fortnightly/monthly, immediately, etc.
 How to propagate changes to a DW?
 Completely re-build all affected tables at the DW (easy but inefficient)
 Apply changes to affected tables incrementally (efficient but difficult)
 Performance
 How to assess if a DW is performing well?
 How to improve performance?
 Miscellaneous Issues
 Data Quality Assurance (How good is data in a DW?)
 How to cope with data warehouse evolution?
Populating & Refreshing the Warehouse
 Data Extraction
 Data Cleaning
 Data Transformation
 Convert from legacy/host format to warehouse
format
 Load
 Sort, summarize, consolidate, compute views,
check integrity, build indexes, partition
 Refresh
 Bring new data from source systems
ETL Process : Issues & Challenges
 Consumes 70-80% of project time
 Heterogeneous Source Systems
 Little or no control over source systems
 Source systems scattered
 Source systems operating in different time zones
 Different currencies
 Different measurement units
 Data not captured by OLTP systems
 Ensuring data quality
Data Staging Area
 A storage area where extracted data is
 Cleaned
 Transformed
 Deduplicated
 Initial storage for data
 Need not be based on Relational model
 Spread over a number of machines
 Mainly sorting and Sequential processing
 COBOL or C code running against flat files
 Does not provide data access to users
 Analogy – kitchen of a restaurant
Presentation Servers
 A target physical machine on which DW data is
organized for
 Direct querying by end users using OLAP
 Report writers
 Data Visualization tools
 Data mining tools
 Data stored in Dimensional framework
 Analogy – Sitting area of a restaurant
Data Cleaning
 Why?
 Data warehouse contains data that is analyzed for
business decisions
 More data and multiple sources could mean more
errors in the data and harder to trace such errors
 Results in incorrect analysis
 Detecting data anomalies and rectifying them
early has huge payoffs
 Long Term Solution
 Change business practices and data entry tools
 Repository for meta-data
Soundex Algorithms

 Misspelled terms
 For example NAMES
 Phonetic algorithms – can find similar
sounding names
 Based on the six phonetic classifications
of human speech sounds
Data Warehouse Design

OLTP Systems are Data Capture Systems


“DATA IN” systems
DW are “DATA OUT” systems

OLTP DW
Analyzing the DATA
Active Analysis – User Queries
User-guided data analysis
Show me how X varies with Y
OLAP
Automated Analysis – Data Mining
What’s in there?
Set the computer FREE on your data
Supervised Learning (classification)
Unsupervised Learning (clustering)
OLAP Queries
How much of product P1 was sold in 2009 state wise?
Top 5 selling products in 2010
Total Sales in Q1 of FY 2008-09?
Color wise sales figure of cars from 2008 to 2010
Model wise sales of cars for the month of Jan from
2006 to 2010
Data Mining Investigations
Which type of customers are more likely to spend
most with us in the coming year?
What additional products are most likely to be sold
to customers who buy sportswear?
In which area should we open a new store in the
next year?
What are the characteristics of customers most
likely to default on their loans before the year is
out?
Continuum of Analysis

Specialized
Algorithms
SQL

OLTP OLAP Data Mining


Primitive & Complex Automated
Canned Ad-hoc Analysis
Analysis Analysis
Data Warehouse Architectures: Basic
Data Warehouse Architectures:
with a Staging Area
Data Warehouse Architectures:
with a Staging Area and Data Marts
A General Architecture for Data
Warehousing
A General Architecture for
Data Warehousing
 The major components of data warehouse architecture are:

 Source systems are where the data comes from.


 Extraction, transformation, and load (ETL) move data between
different data stores.
 The central repository is the main store for the data warehouse.
 The metadata repository describes what is available and where.
 Data marts provide fast, specialised access for end users and
applications.
 Operational feedback integrates decision support back into the
operational systems.
 End-users are the reason for developing the warehouse in the first
place
Data Marts
What is a data mart?
Advantages and disadvantages of data marts
Issues with the development and management of data
marts
Data Marts
A subset of a data warehouse that supports the
requirements of a particular department or business
process
Data Mart is a subset of corporate-wide data that is of
value to a specific groups of users. Its scope is
confined to specific, selected groups, such as
marketing data mart.
Characteristics include:
Does not always contain detailed data unlike data
warehouses
More easily understood and navigated
Can be dependent or independent
Data Marts
Data Mart: A scaled-down version of the data
warehouse
 A data mart is a small warehouse designed for the
department level.
 It is often a way to gain entry and provide an
opportunity to learn
Major problem: if they differ from department to
department, they can be difficult to integrate
enterprise-wide
Reasons for Creating Data Marts

Proof of Concept for the DW


Can be developed quickly and less resource intensive than
DW
To give users access to data they need to analyze most
often
To improve query response time due to reduction in the
volume of data to be accessed
Kimball vs Inmon
Bill Inmon's paradigm: Data warehouse is one part
of the overall business intelligence system. An
enterprise has one data warehouse, and data marts
source their information from the data warehouse.
In the data warehouse, information is stored in 3rd
normal form.

Ralph Kimball's paradigm: Data warehouse is the


conglomerate of all data marts within the
enterprise. Information is always stored in the
dimensional model.
Kimball vs Inmon

Bill Inmon: Endorses a Top-Down design


Independent data marts cannot comprise an effective EDW.
Organizations must focus on building EDW
Ralph Kimball: Endorses a Bottom-Up design
EDW effectively grows up around many of the several
independent data marts – such as for sales, inventory, or
marketing
Kimball vs Inmon: War of Words
"...The data warehouse is nothing more than the union of
all the data marts...,"
Ralph Kimball, December 29, 1997.

"You can catch all the minnows in the ocean and stack
them together and they still do not make a whale,"
Bill Inmon, January 8, 1998.
Kimball vs. Inmon
There is no right or wrong between these two
ideas, as they represent different data warehousing
philosophies. In reality, the data warehouse in most
enterprises are closer to Ralph Kimball's idea. This
is because most data warehouses started out as a
departmental effort, and hence they originated as a
data mart. Only when more data marts are built
later do they evolve into a data warehouse.
Data Warehousing Process
 Enterprise-wide warehouse, top down, the Inmon
methodology
Data mart, bottom up, the Kimball methodology
 When properly executed, both result in an enterprise-
wide data warehouse
Data warehouse versus data mart.
Building a Data Mart
Questions to be asked:
Top-down or bottom-up approach?
Enterprise-wide or departmental?
 Which first—data warehouse or data mart?
 Build pilot or go with a full-fledged implementation?
 Dependent or independent data marts?
Top-Down Versus Bottom-Up Approach
Data Warehouse or Data Mart First?
Top-Down vs. Bottom-Up Approach
Advantages of Top-Down
A truly corporate effort, an enterprise view of data
Inherently architected-not a union of disparate DMs
Single, central storage of data about the content
Central rules and control
May be developed fast using iterative approach
Data Warehouse or Data Mart First?
Disadvantages of Top-Down
Takes longer to build even with iterative method
High exposure/risk to failure
Needs high level of cross functional skills
High outlay without proof of concept
Difficult to sell this approach to senior management and sponsors
Data Warehouse or Data Mart First?
Advantages of Bottom-Up Approach
Faster and easier implementation of manageable pieces
Favorable ROI and proof of concept
Less risk of failure
Inherently incremental; can schedule important DMs first
Allows project team to learn and grow
Data Warehouse or Data Mart First?
Disadvantages of Bottom-Up Approach
Each DM has its own narrow view of data
Permeates redundant data in every DM
Difficult to integrate if the overall requirements are not considered
in the beginning
Kimball’s approach is considered as a Bottom-Up approach,
but he disagrees
Dependent Data Marts
Independent Data Marts
The Bottom-Up Misnomer

Kimball encourages you to broaden your perspective both


“vertically” and “horizontally” while gathering business
requirements while developing data marts
The Bottom-Up Misnomer
Vertical
Don’t just rely on the business data analyst to determine
requirements
Inputs from senior managers about their vision, objectives, and
challenges are critical
Ignoring this vertical span might cause failure in understanding
the organization’s direction and likely future trends
The Bottom-Up Misnomer
Horizontal
Look horizontally across the departments before designing the
DW
Critical in establishing the enterprise view
Challenging to do if one particular department if funding the
project
Ignoring horizontal span will create isolated, department-centric
databases that are inconsistent and can’t be integrated
Complete coverage in a large organization is difficult
One rep. from each dept. interacting with the core development
team can be of immense help
Data Warehouse or Data Mart First?
New Practical approach by Kimball
1. Plan and define requirements at the overall corporate level
2. Create a surrounding architecture for a complete warehouse
3. Conform and standardize the data content
4. Implement the Data Warehouse as a series of Supermarts, one
at a time
A Word about SUPERMARTS
 Totally monolithic approach vs. totally stovepipe approach
 A step-by-step approach for building an EDW from granular data
 A Supermart s a data mart that has been carefully built with a
disciplined architectural framework
 A Supermart is naturally a complete subset of the DW
 A Supermart is based on the most granular data that can possible
be collected and stored
 Conformed dimensions and standardized fact definitions
A Word about SUPERMARTS
Pilot Projects: Risk vs. Reward

Start with a pilot implementation as the first rollout for


DW
Pilot projects have advantage of being small and
manageable
Provide organization with a “proof of concept”
Pilot Projects: Risk vs. Reward

Functional scope of a pilot project should be


determined based on:
1. The Degree of risk enterprise is willing to take
2. The potential for leveraging the pilot project
 Avoid constructing a throwaway prototype
 Pilot warehouse must have actual value to the enterprise
Pilot Projects: Risk vs. Reward

High Risk High Risk


Low Reward High reward
RISK

Low Risk Low Risk


Low Reward High Reward

REWARD
A Practical Approach
Most people employ a Hybrid approach with
elements of Top-Down and Bottom-Up
 Again, practitioners don’t always concentrate on
these issues and use this terminology, and just
focus on best-practice
That would include;
Build incrementally according to a business
function
 Employ an enterprise perspective
 Dimensionally model data
 Utilise conformed dimensional models
 Employ a Staging Area or Data Warehouse
 Store atomic data

You might also like