You are on page 1of 77

Overview of Data Warehousing

Concepts

Chapter 1

1
Lecture Learning objectives
• To understand the defini/on and applica/ons
of a data warehouse
- Introduction to data warehousing concepts
- Opera/onal and informa/onal systems
- OLAP & OLTP Systems
- Applica/ons of data warehouse
- Data warehouse architecture

2
Net Resources

• Online resources
- The data warehousing institute
- www.tdwi.org
- Data warehousing on www
• www.datawarehousing.org
• www.datawarehousing.com
- Online Magazines and Periodicals
• www.intelligententerprise.com
• www.dmgreview.com
• www.cio.com
• h=p://www.daniel-lemire.com/OLAP/index.html

3
Main Topics
• Evolution of Data Processing
• Motivation for data warehousing
• Architecture
• Data modeling
• Dimension modeling
• Query performance enhancing techniques
• DW Project Management
• Case Studies
• Research Issues

4
Background

• 1980’s to early 1990’s


- Focus on comperizing business processes
- To gain compe==ve advantage
• By early 1990’s
- All companies had operational systems.
- It no longer offered any advantage?
• Competitive advantage?
- How do we sustain the compe==ve edge

5
What is Data Warehousing?

•A data warehouse is constructed by integrating data


from multiple heterogeneous sources that support
analytical reporting, structured and/or ad hoc
queries, and decision making.

•Data warehousing involves data cleaning, data


integration, and data consolidations.

6
Business Intelligence
The goal of decision-support systems is twofold:

transformation of data to information

derivation of knowledge from information

7
Query-Driven Approach
•This is the traditional approach to integrate heterogeneous
databases. This approach was used to build wrappers and
integrators on top of multiple heterogeneous databases. These
integrators are also known as mediators.

•When a query is issued to a client side, a metadata dictionary


translates the query into an appropriate form for individual
heterogeneous sites involved.

•Now these queries are mapped and sent to the local query
processor.

•The results from heterogeneous sites are integrated into a global


answer set.

8
Query-Driven Approach
(Disadvantages)
•Query-driven approach needs complex integration and filtering processes.

•This approach is very inefficient.

•It is very expensive for frequent queries.

•This approach is also very expensive for queries that require aggregations.

9
Update-Driven Approach
•This is an alternative to the traditional approach. Today's data warehouse
systems follow update-driven approach rather than the traditional
approach discussed earlier.

•In update-driven approach, the information from multiple heterogeneous


sources are integrated in advance and are stored in a warehouse. This
information is available for direct querying and analysis.

Advantages

•This approach has the following advantages:

• This approach provide high performance.


• The data is copied, processed, integrated, annotated, summarized
and restructured in semantic data store in advance.
• Query processing does not require an interface to process data at
local sources.
10
OLTP Vs OLAP

OLTP: OnLine Transaction Processing OLAP: On Line Analytical Processing


Systems
- Describes processing at opera/onal - Describes processing at warehouse
sites

11
OLTP Vs OLAP
Standard DB (OLTP) Warehouse (OLAP)
• Mostly updates • Mostly reads
• Many small transactions • Queries are long and complex
• Mb - Gb of data • Gb - Tb of data
• Index/hash on p.k. • History
• Raw data • Lots of scans
• Thousands of users • Summarized, reconciled data
• Hundreds of users

12
OLTP vs OLAP
• OLTP: Online-transac3on • OLAP: On-line Analytical
processing Processing
- Many short transactions (queries - Long transac3ons, complex
+ updates) queries
- Examples: - Queries touch large amounts of
• Update account balance data
• Enroll in course
• Add a book to shopping cart - Updates are infrequent
- Queries touch small amounts of
data (one record or a few - Individual queries can require
records) lots of resources.
- Updates are frequent

13
Why OLAP & OLTP don’t mix

• Different performance requirements


• Transaction processing (OLTP)
- Fast response time important (<1 second)
- Data must be up-to-date,consistent at all times.
• Data Analysis (OLAP)
- Queries can consume lots of resources
- Can saturate CPU’s and disk bandwidth
- Operating on static “snapshot” of data usually OK

14
Why olap & oltp don’t mix
• Different data modeling requirements
- Transac;on processing (OLTP):
• Normalized schema for consistency
• Complex data models, many tables
• Limited number of standardised queries and updates
- Data Analysis (OLAP)
• Simplicity of data model is important
• De-normalized schemas are common
• Fewer joins- improved query performance
• Fewer tables- schema is easier to understand

15
Why olap & oltp don’t mix
• Analysis requires data from many sources
- An OLTP system targets one specific process
• For ex: ordering from an online store
- OLAP integrates data from different processes
• Combine sales, inventory and purchasing data
• Analyze experiments conducted by different labs
- OLAP oFen makes use of historical data
• IdenIfy long-term paKerns
• Notice changes in behaviour over time.
- Terminology, schemas vary across data sources
• Integrating data from disparate sources is a major challenge

16
Data Warehouse

• Doing OLTP and OLAP in the same database system is


o8en imprac;cal
- Different performance requirements
- Different data modeling requirements
- Analysis queries require data from many sources

• Solution : Build a “data warehouse”


• Copy data from various OLTP systems
• Op;mize data organiza;on, system tuning for OLAP
• Transactions aren’t slowed by big analysis queries
• Periodically refresh the data in the warehouse.

17
The Warehousing Approach
• Information Clients
integrated in
advance Data
Warehouse
• Stored in WH
for direct
Integration System Metadata
querying and
analysis ...

Extractor/ Extractor/ Extractor/


Monitor Monitor Monitor

...
Source Source Source
18
Advantages of Warehousing Approach
• High query performance
- But not necessarily most current information
• Doesn’t interfere with local processing at sources
- Complex queries at warehouse
- OLTP at information sources
• Information copied at warehouse
- Can modify, annotate, summarize, restructure, etc.
- Can store historical information
- Security, no auditing

19
Need for data warehousing

• Companies, over the years, gathered huge


volumes of data
• Can this data be used in any way?
• Can we analyze this data to get any
compe<<ve advantage?

20
Benefits of data warehousing

• low “efficient” analysis of data ͘


• Competitive advantage
• Analysis aids strategic decision making
• Increased productivity of decision makers

21
Decision support systems, DW & OLAP

• Information technology to help the knowledge worker


(executive, manager, analyst) make faster and better
decisions.
• Data Warehouse is a Decision Support System (DSS)
• A decision support system is an architectural construct of an
information system that provides users with current and
historical decision support information
• On-Line Analytical Processing (OLAP) is an element of DSS

22
Data warehouse: characteris/cs

• Analysis driven
• Ad-hoc queries
• Complex queries
• Used by top managers
• Based on dimensional modeling
• Denormalized structures

23
Data Warehousing applica2ons

• Retail • Manufacturing
- Customer loyalty - Cost reduction
- Market planning - Logis2cs management

• Financial • Utilities
- Risk management - Asset management
- Fraud detection - Resource management

• Airlines • Government
- Route profitability - Manpower planning
– Yield management - Cost control

24
Data Warehousing components

Chapter 2

1
Learning Objectives
• To understand the architecture and processes of a data
warehouse
- Data Warehouse Definitions
- Operational vs. Informational Systems
- Desired features of DW
- Characteristics of Data Warehouse
- Data Warehouse vs. Data Marts
- Top Down Approach vs. Bottom Up Approach

2
What is a data warehouse?
• A single, complete and consistent store of data obtained from a
variety of different sources made available to end users in a way
they can understand and use in a business context.
{ Barry Devlin}

• R.Kimball’s definition of a DW
- A data warehouse is a copy of transactional data specifically
structured for querying and analysis
- According to this definition:
• The form of the stored data (RDBMS,flat file) has nothing
to do with whether something is a data warehouse.

3
Data warehouse
• A decision support database that is maintained separately
from the organization’s operational databases.
• Another Definition by W.H. Inmon: A data warehouse is a
- Subject-oriented,
- Integrated,
- Time-varying
- Non-volatile
collection of data that is used primarily in organizational
decision making

W.H.Inmon

4
Data Warehouse Components

Subject Integrated
Oriented

Data
Warehouse

Non Volatile Time Variant

5
Subject-Oriented

Data is categorized and stored by business subject


rather than by application
OLTP Applications Data Warehouse Subject

Equity
Plans
Shares
Customer
financial
Insurance information

Savings
Loans

6
Integrated

• Heterogeneous source systems


• Need to integrate source data
• For example: product codes could be different
in different systems
• Arrive at common code in DW

7
Time variant

• Most business
analysis has a time component

• Trend analysis
(historical data is required)

8
Nonvolatile

Typically data in the data warehouse is not updated or deleted.

Operational Warehouse

Load

Insert Read Read


Update
Delete

9
Nonvolatile

10
Operational Vs. Informational system

OPERATIONAL INFORMATIONAL

Data content Current values Archived, derived, summarized

Data structure Optimized for transactions Optimized for complex queries

Access Frequency High Medium to low

Access Type Read, update, delete, insert Read

Usage Predictable, repetitive random, heuristic

Response time Sub-seconds Several seconds to minutes

Users Large number Relatively small number


11
Characteristics of Strategic Information

INTEGRATED Must have a single, enterprise-wide view

Information must be accurate and must conform to


DATA INTEGRITY
business rules

Easily accessible with intuitive access paths, and


ACCESSIBLE
responsive for analysis

Every business factor must have one and only one


CREDIBLE
value

Information must be available within the stipulated


TIMELY
time frame

12
What are informational systems ?
Provides an integrated and total view of the enterprise

Makes the enterprise’s current and historical information easily available for
strategic decision making

Makes decision-support transactions possible without hindering operational


systems

Renders the organization’s information consistent

Presents a flexible and interactive source of strategic information.

13
Data Warehouse: Major Players

• SAS Institute
• IBM (cognos)
• Oracle (Hyperion)
• Sybase
• Microsoft
• HP
• …

14
Desired features of DW
Data Warehouse designed for analytical tasks
Easy to use and conducive to long interactive sessions by
users
Read-intensive data usage
Direct interaction with the system by the users without IT
assistance
Content updated periodically and stable
Content to include current and historical data
Ability for users to run queries and get results online.
Ability for users to initiate reports.

15
General Overview of a DW

16
DW Milestones
• 1983- Teradata – DBMS for decision support systems
• 1988- Barry Devlin and Paul Murphy – IBM Systems Journal- An
architecture for a business and information systems.
• 1990- Red Brick Systems – data warehousing system.
• 1991- Bill Inmon – Building the data warehouse – Father of Data
warehousing
• 1991-Prism Solutions – Prism warehouse software for developing data
warehouse
• 1995- The Data Warehousing Institute – DW & BI
• 1996- Ralph Kimball – The data warehousing toolkit
• 1997- Oracle 8, with support for STAR schema queries is released.

17
Build a data warehouse
Approaches
- Top-down or bottom-up approach
- Enterprise-wide or departmental?
- Which first- data warehouse or data mart?
- Dependent or independent data mart.

18
19
Top-Down Approach (Bill Inmon)

• DW is a centralized repository of the enterprise.


- Data is stored at the lowest level of granularity based on
a normalized data model.
- Need to normalize existing tables
- The centralized data warehouse would feed the
dependent data marts to be designed based on
dimensional data model

20
Top-down Approach
Advantages Disadvantages
• A truly corporate effort, an • Takes longer to build even with an
enterprise view of data iterative method
• Inherently architected, not a union • High exposure to risk of failure
of disparate data marts • Difficult to sell to the stake
holders/sponsors/higher management
• Single, central storage of data
as it requires experienced professionals
about the content
• Centralized rules and control
• May see quick results if
implemented with iterations

21
Bottom-Up Approach ( Ralph Kimball)

• Kimball (1996) envisioned the corporate data warehouse as


a collection of conformed data marts.
- Conforming of the dimensions among the separate data marts.
- Data marts are created first to provide analytical and reporting
capabilities for specific business subjects based on the dimensional
model.
- Data marts contain data at the lowest level of granularity and also as
summaries depending on the needs for analysis
- No need to normalize every table
- These data marts are joined together

22
Bottom-up Approach
Advantages Disadvantages
• Faster and easier implementation • Each data mart has its own narrow
of manageable pieces. view of data
• Favorable return on investment and • Permeates redundant data in every
proof of concept. data mart
• Less risk of failure
• Allows project team to learn and
grow.

23
Practical Approach

• Some foreseen Questionnaire


- Is your organization looking for long-term
results or fast data marts for only a few subjects
for now?
- Do you want to look into some other practical
approach?
DO NOT LOOSE SIGHT OF THE OVERALL
BIG PICTURE

24
Kimball’s vs. Inmon’s

25
Data Warehouse: architectural
components

Chapter 3

1
Course Learning Objectives

• To understand the architectural components and processes of a


data warehouse
- Architectural Components
- List and functions for each architectural components
- Architectural types and their differences

2
Architectural components in the three major areas

• Data Acquisition
• Data Storage
• Information Delivery

3
Management and Control Component
• This component has two major functions:
- first to constantly monitor all the ongoing operations,
- and next to step in and recover from problems when things go wrong.
• The management architectural component manages and
controls data acquisition functions, ensuring that extracts and
transformations are carried out correctly and in a timely
fashion.
• The management component manages backing up significant
parts of the data warehouse and recovering from failures.
Management services include monitoring the growth and
periodically archiving data from the data warehouse.
• This management component governs data security and
provides authorized access to the data warehouse.

4
Data Acquisition
• Data acquisition involves in the entire process of extracting
data from the data sources, moving all the extracted data to the
staging area, and preparing the data for loading into the data
warehouse repository.
• The two major architectural components identified for data
acquisition are:
- source data and
- data staging.

5
Data Acquisition
• Source Data
- The internal and external data sources form the source data architectural
component.
- Source data governs the extraction of data for preparation and storage in
the data warehouse.
• Data Staging
- The data staging architectural component governs the transformation,
cleaning, and integration of data.
- An intermediate storage area used for data pre-processing.

6
Data Acquisition: technical architecture

7
Data Acquisition: Functions and Services
Data Extraction
1. Select data sources and determine the types of filters to be applied
to individual sources.
2. Generate automatic extract files from operational systems using
replication and other techniques.
3. Create intermediary files to store selected data to be merged later.
4. Transport extracted files from multiple platforms.
5. Provide automated job control services for creating extract files.
6. Reformat input from outside sources.
7. Reformat input from departmental data files, databases, and
spreadsheets.
8. Generate common application codes for data extraction.
9. Resolve inconsistencies for common data elements from multiple
sources.

8
Data Acquisition: Functions and Services

Data Transformation
1. Map input data to data for data warehouse repository.
2. Clean data, de-duplicate, and merge/purge.
3. De-normalize extracted data structures as required by the
dimensional model of the data warehouse.
4. Convert data types.
5. Calculate and derive attribute values.
6. Check for referential integrity.
7. Aggregate data as needed.
8. Resolve missing values.
9. Consolidate and integrate data.

9
Data Acquisition: Functions and Services

Data Staging
1. Provide backup and recovery for staging area repositories.
2. Sort and merge files.
3. Create files as input to make changes to dimension tables.
4. If data staging storage is a relational database, create and populate
database.
5. Resolve and create primary and foreign keys for load tables.
6. Consolidate datasets and create flat files for loading through
DBMS utilities.

10
Data Storage
• This is the process of loading the data from the staging area
into the data warehouse repository.

• All functions for transforming and integrating the data are


completed in the data staging area.

• The prepared data in the data warehouse is like the finished


product that is ready to be stacked in an industrial warehouse.

11
Data Storage: Functions and Services
1. Load data for full refreshes of data warehouse tables.
2. Perform incremental loads at regular prescribed intervals.
3. Support loading into multiple tables at the detailed and
summarized levels.
4. Provide automated job control services for loading the data
warehouse.
5. Provide backup and recovery for the data warehouse
database.
6. Periodically archive data from the database according to
preset conditions.

12
Data Storage Architecture

13
Information Delivery
• Spans a broad spectrum of methods for making information
available to users.
• Primary data warehouse feeds data to proprietary
multidimensional databases (MDDBs) where summarized
data is kept as multidimensional cubes of information.
• The users perform complex multidimensional analysis
using the information cubes in the MDDBs.

14
Information Delivery

15
Information Delivery: Functions and Services

• Allow users to browse data warehouse content.


• Simplify access by hiding internal complexities of data storage
from users.
• Govern queries and control runaway queries.
• Provide self-service report generation for users, consisting of a
variety of flexible options to create, schedule, and run reports.
• Store result sets of queries and reports for future use.
• Make provision for the users to perform complex analysis
through online analytical processing (OLAP).

16
Architectural Types

• For each of these architectural types, we shall learn


how data is stored in the data warehouse and the
relationships between the data warehouse and the
data marts.
- Centralized data warehouse
- Independent data marts
- Federated
- HUB and SPOKE
- Data mart Bus

17
Architectural Types

18
Centralized data warehouse

• It takes into account the enterprise-level


information requirements.
• Queries and applications access the normalized
data in the central data warehouse.
• No data marts, whether dependent or independent.
Therefore all information delivery is from the
centralized data warehouse.

19
Centralized data warehouse
• The flow of data from source systems to staging area, then
to the normalized central data warehouse, and thereafter to
end-users as business intelligence.

20
Independent data marts
• In this architecture type, the data warehouse is really a
collection of unconnected, disparate data marts, each serving
a specific department or purpose. These data marts in such
organizations usually evolve over time without any overall
planning.
• Each data mart delivers information to its own group of users
- these separate data marts do not provide “ single version of the
truth”. Conflicting table scheme is likely.
- Data mart is independent of one another – no normalization takes
place
• More likely to have inconsistent data definitions,
redundant data, redundant processing and standards.
• Not scalable for integration

21
Independent data marts
• The flow of data from source systems to staging area, then to the
various independent data marts, and thereafter to individual groups of
end-users as business intelligence. In many cases, data staging functions
and movement to each data mart may be carried out separately.

22
Federated
• This architecture type appears to be similar to the type with
independent data marts with the exception that common data
elements in the various data marts and even data warehouses
that compose the federation are integrated physically or
logically.
- Link together existing Data marts
- Logical/physical integration of common data elements
- Linkage information is stored and held by the meta-data
• All information delivery is from the data storage
• By using meta-data, there is no need to re-architect existing
data structures (such as marts, warehouses, or transactional
systems.

23
Federated
• Logical or physical integration of common data elements
takes place between multiple data marts.

24
Hub and Spoke
• In this architecture type, a centralized enterprise data
warehouse is present but there are also data marts that
depend on the enterprise data warehouse for data feed.
• Information delivery can, therefore, be both from the
centralized data warehouse and the dependent data marts.
• Relational Model is carried out at the centralized data
warehouse whereas the Dimensional modeling is carried
out by the dependent data marts.
• Centralized data warehouse is based on OLAP, used for
query and report
• Dependent Data Marts is used by Data Mining Applications
(more on this later).

25
Hub and Spoke
• Centralized DW is based on Relational Model whereas
Dependent Data Marts are based on Dimensional Model

26
Data-Mart Bus
• In this architecture type, no distinct, single data warehouse exists.
• Build the first data mart(super mart) using business dimensions
and metrics.
• These business dimensions will be shared in the future data marts.
New data marts can use existing table schema.
• Use for specific business processes but the use of conformed
dimensions and facts enables the incremental integration of
additional data marts to form an organization wide view of the
organization

27
Data-Mart Bus
• Additional marts are developed using the conformed
dimensions of the first mart. Staging is needed to ensure
that source data is converted into conformed data marts
structure.

28

You might also like