Professional Documents
Culture Documents
Lec-3 30-08-2011
Lec-3 30-08-2011
Metadata
Repository Analysis
Extract
Query/
External
Sources
Transform Reporting
Load Serve
Operational
Refresh Data
dbs Mining
Data Marts
Disadvantages of On-Demand
Approach
Poor response time due to delay in query processing
Slow or unavailable data sources
Time consuming and complex filtering and integration
Inefficient and potentially expensive for frequent queries
Wrappers compete on resources with local applications at data
sources
There are only few notable systems based on this approach
e.g.
TAMBIS: Transparent Access to Multiple Bio-informatics
Information Systems
SRS: Sequence Retrieval System
OPM (Object Protocol Model) based multi-database tools and query
language (OPM-QL)
Two Approaches:
(2) Data Warehousing
• In advance/Eager
data integration
• Integrated data
is persistently stored
in a database – data
warehouse for direct
querying and analysis
Advantages of Data Warehousing
Approach
High performance query processing
Though the information returned may not be most up-to-date
Does not interfere with local data processing at sources
Analytical Querying/Statistical Analysis or On-Line Analytical
Processing (OLAP) at warehouse
On-Line Transaction Processing (OLTP) at data sources
Data Persistently Stored at Warehouse
Data at the warehouse can be further re-structured, aggregated,
summarized and modified if necessary.
A DW may store historical/archive data.
Data warehousing approach has been widely used e.g.
The Maryland ADMS Project
Supporting Data Integration and Warehousing Using H2O
The Stanford Data Warehousing Project
GIMS: Genome Information Management System
Marks & Spencer Data Warehouse
Trade-off between Query-Driven and
Data Warehousing Approaches
Query-driven approach is still better for:
Rapidly changing information/data sources;
Accessing very large amounts of data from many sources;
Clients with unpredictable and dynamic requirements
Data Warehousing is more suitable when:
Data sources on which a data warehouse is based are not
frequently changing;
Data up-to-dateness is not crucially important;
Querying and Analysis is complex;
Data needs to be highly summarized and aggregated;
Fast access to integrated and derived data is vital; and
Keeping data warehouse consistent with the underlying data
sources is efficient and does not compromise on expected
performance.
Data Warehouse Architectures
An overview of the Oracle data warehousing
implementation.
Data warehouses and their architectures vary
depending upon the specifics of an organisation's
situation.
Three common architectures are:
Data Warehouse Architecture - Basic
Data Warehouse Architecture - with a Staging Area
Data Warehouse Architecture - with a Staging Area and
Data Marts
Data Warehouse COMPONENTS
Data Warehouse COMPONENTS
Source Data Component
Production Data.
Internal Data.
Archived Data.
External Data.
Data Staging Component
Data Extraction
Data Transformation.
Data Loading.
Data Loading
Data Storage Component
Many of the data warehouses also employ
multidimensional database management systems. Data
extracted from the data
warehouse storage is aggregated in many ways and the
summary data is kept in the multidimensional
databases (MDDBs). Such multidimensional database
systems are usually proprietary products.
Information Delivery Component
Metadata Component
Metadata in a data warehouse is similar to a data
dictionary, but much more than a data dictionary.
Types of Metadata
Operational Metadata
Extraction and Transformation Metadata
End-User Metadata
More Details in Chapter 9.
Why Meta Data: Special Significance
First, it acts as the glue that connects all parts of the
data warehouse.
Next, it provides information about the contents and
structures to the developers.
Finally, it opens the door to the end-users and makes
the contents recognizable in their own terms.
Reporting,
Warehouse Manager query,application
Operational
data source1 development, and EIS
Meta-flow (executive information
system) tools
Meta-data High
summarized data
Inflow Outflow
Lightly
Load summarized
Manager data OLAP (online
Upflow Query analytical
processing) tools
Operational Manage
data source n Detailed data DBMS
Downflow
End-user
Archive/backup access tools
data
Misspelled terms
For example NAMES
Phonetic algorithms – can find similar
sounding names
Based on the six phonetic classifications
of human speech sounds
Data Warehouse Design
OLTP DW
Analyzing the DATA
Active Analysis – User Queries
User-guided data analysis
Show me how X varies with Y
OLAP
Automated Analysis – Data Mining
What’s in there?
Set the computer FREE on your data
Supervised Learning (classification)
Unsupervised Learning (clustering)
OLAP Queries
How much of product P1 was sold in 2009 state wise?
Top 5 selling products in 2010
Total Sales in Q1 of FY 2008-09?
Color wise sales figure of cars from 2008 to 2010
Model wise sales of cars for the month of Jan from
2006 to 2010
Data Mining Investigations
Which type of customers are more likely to spend
most with us in the coming year?
What additional products are most likely to be sold
to customers who buy sportswear?
In which area should we open a new store in the
next year?
What are the characteristics of customers most
likely to default on their loans before the year is
out?
Continuum of Analysis
Specialized
Algorithms
SQL
"You can catch all the minnows in the ocean and stack
them together and they still do not make a whale,"
Bill Inmon, January 8, 1998.
Kimball vs. Inmon
There is no right or wrong between these two
ideas, as they represent different data warehousing
philosophies. In reality, the data warehouse in most
enterprises are closer to Ralph Kimball's idea. This
is because most data warehouses started out as a
departmental effort, and hence they originated as a
data mart. Only when more data marts are built
later do they evolve into a data warehouse.
Data Warehousing Process
Enterprise-wide warehouse, top down, the Inmon
methodology
Data mart, bottom up, the Kimball methodology
When properly executed, both result in an enterprise-
wide data warehouse
Data warehouse versus data mart.
Building a Data Mart
Questions to be asked:
Top-down or bottom-up approach?
Enterprise-wide or departmental?
Which first—data warehouse or data mart?
Build pilot or go with a full-fledged implementation?
Dependent or independent data marts?
Top-Down Versus Bottom-Up Approach
Data Warehouse or Data Mart First?
Top-Down vs. Bottom-Up Approach
Advantages of Top-Down
A truly corporate effort, an enterprise view of data
Inherently architected-not a union of disparate DMs
Single, central storage of data about the content
Central rules and control
May be developed fast using iterative approach
Data Warehouse or Data Mart First?
Disadvantages of Top-Down
Takes longer to build even with iterative method
High exposure/risk to failure
Needs high level of cross functional skills
High outlay without proof of concept
Difficult to sell this approach to senior management and sponsors
Data Warehouse or Data Mart First?
Advantages of Bottom-Up Approach
Faster and easier implementation of manageable pieces
Favorable ROI and proof of concept
Less risk of failure
Inherently incremental; can schedule important DMs first
Allows project team to learn and grow
Data Warehouse or Data Mart First?
Disadvantages of Bottom-Up Approach
Each DM has its own narrow view of data
Permeates redundant data in every DM
Difficult to integrate if the overall requirements are not considered
in the beginning
Kimball’s approach is considered as a Bottom-Up approach,
but he disagrees
Dependent Data Marts
Independent Data Marts
The Bottom-Up Misnomer
REWARD
A Practical Approach
Most people employ a Hybrid approach with
elements of Top-Down and Bottom-Up
Again, practitioners don’t always concentrate on
these issues and use this terminology, and just
focus on best-practice
That would include;
Build incrementally according to a business
function
Employ an enterprise perspective
Dimensionally model data
Utilise conformed dimensional models
Employ a Staging Area or Data Warehouse
Store atomic data