Professional Documents
Culture Documents
Concepts
Chapter 1
1
Lecture Learning objectives
• To understand the defini/on and applica/ons
of a data warehouse
- Introduction to data warehousing concepts
- Opera/onal and informa/onal systems
- OLAP & OLTP Systems
- Applica/ons of data warehouse
- Data warehouse architecture
2
Net Resources
• Online resources
- The data warehousing institute
- www.tdwi.org
- Data warehousing on www
• www.datawarehousing.org
• www.datawarehousing.com
- Online Magazines and Periodicals
• www.intelligententerprise.com
• www.dmgreview.com
• www.cio.com
• h=p://www.daniel-lemire.com/OLAP/index.html
3
Main Topics
• Evolution of Data Processing
• Motivation for data warehousing
• Architecture
• Data modeling
• Dimension modeling
• Query performance enhancing techniques
• DW Project Management
• Case Studies
• Research Issues
4
Background
5
What is Data Warehousing?
6
Business Intelligence
The goal of decision-support systems is twofold:
7
Query-Driven Approach
•This is the traditional approach to integrate heterogeneous
databases. This approach was used to build wrappers and
integrators on top of multiple heterogeneous databases. These
integrators are also known as mediators.
•Now these queries are mapped and sent to the local query
processor.
8
Query-Driven Approach
(Disadvantages)
•Query-driven approach needs complex integration and filtering processes.
•This approach is also very expensive for queries that require aggregations.
9
Update-Driven Approach
•This is an alternative to the traditional approach. Today's data warehouse
systems follow update-driven approach rather than the traditional
approach discussed earlier.
Advantages
11
OLTP Vs OLAP
Standard DB (OLTP) Warehouse (OLAP)
• Mostly updates • Mostly reads
• Many small transactions • Queries are long and complex
• Mb - Gb of data • Gb - Tb of data
• Index/hash on p.k. • History
• Raw data • Lots of scans
• Thousands of users • Summarized, reconciled data
• Hundreds of users
12
OLTP vs OLAP
• OLTP: Online-transac3on • OLAP: On-line Analytical
processing Processing
- Many short transactions (queries - Long transac3ons, complex
+ updates) queries
- Examples: - Queries touch large amounts of
• Update account balance data
• Enroll in course
• Add a book to shopping cart - Updates are infrequent
- Queries touch small amounts of
data (one record or a few - Individual queries can require
records) lots of resources.
- Updates are frequent
13
Why OLAP & OLTP don’t mix
14
Why olap & oltp don’t mix
• Different data modeling requirements
- Transac;on processing (OLTP):
• Normalized schema for consistency
• Complex data models, many tables
• Limited number of standardised queries and updates
- Data Analysis (OLAP)
• Simplicity of data model is important
• De-normalized schemas are common
• Fewer joins- improved query performance
• Fewer tables- schema is easier to understand
15
Why olap & oltp don’t mix
• Analysis requires data from many sources
- An OLTP system targets one specific process
• For ex: ordering from an online store
- OLAP integrates data from different processes
• Combine sales, inventory and purchasing data
• Analyze experiments conducted by different labs
- OLAP oFen makes use of historical data
• IdenIfy long-term paKerns
• Notice changes in behaviour over time.
- Terminology, schemas vary across data sources
• Integrating data from disparate sources is a major challenge
16
Data Warehouse
17
The Warehousing Approach
• Information Clients
integrated in
advance Data
Warehouse
• Stored in WH
for direct
Integration System Metadata
querying and
analysis ...
...
Source Source Source
18
Advantages of Warehousing Approach
• High query performance
- But not necessarily most current information
• Doesn’t interfere with local processing at sources
- Complex queries at warehouse
- OLTP at information sources
• Information copied at warehouse
- Can modify, annotate, summarize, restructure, etc.
- Can store historical information
- Security, no auditing
19
Need for data warehousing
20
Benefits of data warehousing
21
Decision support systems, DW & OLAP
22
Data warehouse: characteris/cs
• Analysis driven
• Ad-hoc queries
• Complex queries
• Used by top managers
• Based on dimensional modeling
• Denormalized structures
23
Data Warehousing applica2ons
• Retail • Manufacturing
- Customer loyalty - Cost reduction
- Market planning - Logis2cs management
• Financial • Utilities
- Risk management - Asset management
- Fraud detection - Resource management
• Airlines • Government
- Route profitability - Manpower planning
– Yield management - Cost control
24
Data Warehousing components
Chapter 2
1
Learning Objectives
• To understand the architecture and processes of a data
warehouse
- Data Warehouse Definitions
- Operational vs. Informational Systems
- Desired features of DW
- Characteristics of Data Warehouse
- Data Warehouse vs. Data Marts
- Top Down Approach vs. Bottom Up Approach
2
What is a data warehouse?
• A single, complete and consistent store of data obtained from a
variety of different sources made available to end users in a way
they can understand and use in a business context.
{ Barry Devlin}
• R.Kimball’s definition of a DW
- A data warehouse is a copy of transactional data specifically
structured for querying and analysis
- According to this definition:
• The form of the stored data (RDBMS,flat file) has nothing
to do with whether something is a data warehouse.
3
Data warehouse
• A decision support database that is maintained separately
from the organization’s operational databases.
• Another Definition by W.H. Inmon: A data warehouse is a
- Subject-oriented,
- Integrated,
- Time-varying
- Non-volatile
collection of data that is used primarily in organizational
decision making
W.H.Inmon
4
Data Warehouse Components
Subject Integrated
Oriented
Data
Warehouse
5
Subject-Oriented
Equity
Plans
Shares
Customer
financial
Insurance information
Savings
Loans
6
Integrated
7
Time variant
• Most business
analysis has a time component
• Trend analysis
(historical data is required)
8
Nonvolatile
Operational Warehouse
Load
9
Nonvolatile
10
Operational Vs. Informational system
OPERATIONAL INFORMATIONAL
12
What are informational systems ?
Provides an integrated and total view of the enterprise
Makes the enterprise’s current and historical information easily available for
strategic decision making
13
Data Warehouse: Major Players
• SAS Institute
• IBM (cognos)
• Oracle (Hyperion)
• Sybase
• Microsoft
• HP
• …
14
Desired features of DW
Data Warehouse designed for analytical tasks
Easy to use and conducive to long interactive sessions by
users
Read-intensive data usage
Direct interaction with the system by the users without IT
assistance
Content updated periodically and stable
Content to include current and historical data
Ability for users to run queries and get results online.
Ability for users to initiate reports.
15
General Overview of a DW
16
DW Milestones
• 1983- Teradata – DBMS for decision support systems
• 1988- Barry Devlin and Paul Murphy – IBM Systems Journal- An
architecture for a business and information systems.
• 1990- Red Brick Systems – data warehousing system.
• 1991- Bill Inmon – Building the data warehouse – Father of Data
warehousing
• 1991-Prism Solutions – Prism warehouse software for developing data
warehouse
• 1995- The Data Warehousing Institute – DW & BI
• 1996- Ralph Kimball – The data warehousing toolkit
• 1997- Oracle 8, with support for STAR schema queries is released.
17
Build a data warehouse
Approaches
- Top-down or bottom-up approach
- Enterprise-wide or departmental?
- Which first- data warehouse or data mart?
- Dependent or independent data mart.
18
19
Top-Down Approach (Bill Inmon)
20
Top-down Approach
Advantages Disadvantages
• A truly corporate effort, an • Takes longer to build even with an
enterprise view of data iterative method
• Inherently architected, not a union • High exposure to risk of failure
of disparate data marts • Difficult to sell to the stake
holders/sponsors/higher management
• Single, central storage of data
as it requires experienced professionals
about the content
• Centralized rules and control
• May see quick results if
implemented with iterations
21
Bottom-Up Approach ( Ralph Kimball)
22
Bottom-up Approach
Advantages Disadvantages
• Faster and easier implementation • Each data mart has its own narrow
of manageable pieces. view of data
• Favorable return on investment and • Permeates redundant data in every
proof of concept. data mart
• Less risk of failure
• Allows project team to learn and
grow.
23
Practical Approach
24
Kimball’s vs. Inmon’s
25
Data Warehouse: architectural
components
Chapter 3
1
Course Learning Objectives
2
Architectural components in the three major areas
• Data Acquisition
• Data Storage
• Information Delivery
3
Management and Control Component
• This component has two major functions:
- first to constantly monitor all the ongoing operations,
- and next to step in and recover from problems when things go wrong.
• The management architectural component manages and
controls data acquisition functions, ensuring that extracts and
transformations are carried out correctly and in a timely
fashion.
• The management component manages backing up significant
parts of the data warehouse and recovering from failures.
Management services include monitoring the growth and
periodically archiving data from the data warehouse.
• This management component governs data security and
provides authorized access to the data warehouse.
4
Data Acquisition
• Data acquisition involves in the entire process of extracting
data from the data sources, moving all the extracted data to the
staging area, and preparing the data for loading into the data
warehouse repository.
• The two major architectural components identified for data
acquisition are:
- source data and
- data staging.
5
Data Acquisition
• Source Data
- The internal and external data sources form the source data architectural
component.
- Source data governs the extraction of data for preparation and storage in
the data warehouse.
• Data Staging
- The data staging architectural component governs the transformation,
cleaning, and integration of data.
- An intermediate storage area used for data pre-processing.
6
Data Acquisition: technical architecture
7
Data Acquisition: Functions and Services
Data Extraction
1. Select data sources and determine the types of filters to be applied
to individual sources.
2. Generate automatic extract files from operational systems using
replication and other techniques.
3. Create intermediary files to store selected data to be merged later.
4. Transport extracted files from multiple platforms.
5. Provide automated job control services for creating extract files.
6. Reformat input from outside sources.
7. Reformat input from departmental data files, databases, and
spreadsheets.
8. Generate common application codes for data extraction.
9. Resolve inconsistencies for common data elements from multiple
sources.
8
Data Acquisition: Functions and Services
Data Transformation
1. Map input data to data for data warehouse repository.
2. Clean data, de-duplicate, and merge/purge.
3. De-normalize extracted data structures as required by the
dimensional model of the data warehouse.
4. Convert data types.
5. Calculate and derive attribute values.
6. Check for referential integrity.
7. Aggregate data as needed.
8. Resolve missing values.
9. Consolidate and integrate data.
9
Data Acquisition: Functions and Services
Data Staging
1. Provide backup and recovery for staging area repositories.
2. Sort and merge files.
3. Create files as input to make changes to dimension tables.
4. If data staging storage is a relational database, create and populate
database.
5. Resolve and create primary and foreign keys for load tables.
6. Consolidate datasets and create flat files for loading through
DBMS utilities.
10
Data Storage
• This is the process of loading the data from the staging area
into the data warehouse repository.
11
Data Storage: Functions and Services
1. Load data for full refreshes of data warehouse tables.
2. Perform incremental loads at regular prescribed intervals.
3. Support loading into multiple tables at the detailed and
summarized levels.
4. Provide automated job control services for loading the data
warehouse.
5. Provide backup and recovery for the data warehouse
database.
6. Periodically archive data from the database according to
preset conditions.
12
Data Storage Architecture
13
Information Delivery
• Spans a broad spectrum of methods for making information
available to users.
• Primary data warehouse feeds data to proprietary
multidimensional databases (MDDBs) where summarized
data is kept as multidimensional cubes of information.
• The users perform complex multidimensional analysis
using the information cubes in the MDDBs.
14
Information Delivery
15
Information Delivery: Functions and Services
16
Architectural Types
17
Architectural Types
18
Centralized data warehouse
19
Centralized data warehouse
• The flow of data from source systems to staging area, then
to the normalized central data warehouse, and thereafter to
end-users as business intelligence.
20
Independent data marts
• In this architecture type, the data warehouse is really a
collection of unconnected, disparate data marts, each serving
a specific department or purpose. These data marts in such
organizations usually evolve over time without any overall
planning.
• Each data mart delivers information to its own group of users
- these separate data marts do not provide “ single version of the
truth”. Conflicting table scheme is likely.
- Data mart is independent of one another – no normalization takes
place
• More likely to have inconsistent data definitions,
redundant data, redundant processing and standards.
• Not scalable for integration
21
Independent data marts
• The flow of data from source systems to staging area, then to the
various independent data marts, and thereafter to individual groups of
end-users as business intelligence. In many cases, data staging functions
and movement to each data mart may be carried out separately.
22
Federated
• This architecture type appears to be similar to the type with
independent data marts with the exception that common data
elements in the various data marts and even data warehouses
that compose the federation are integrated physically or
logically.
- Link together existing Data marts
- Logical/physical integration of common data elements
- Linkage information is stored and held by the meta-data
• All information delivery is from the data storage
• By using meta-data, there is no need to re-architect existing
data structures (such as marts, warehouses, or transactional
systems.
23
Federated
• Logical or physical integration of common data elements
takes place between multiple data marts.
24
Hub and Spoke
• In this architecture type, a centralized enterprise data
warehouse is present but there are also data marts that
depend on the enterprise data warehouse for data feed.
• Information delivery can, therefore, be both from the
centralized data warehouse and the dependent data marts.
• Relational Model is carried out at the centralized data
warehouse whereas the Dimensional modeling is carried
out by the dependent data marts.
• Centralized data warehouse is based on OLAP, used for
query and report
• Dependent Data Marts is used by Data Mining Applications
(more on this later).
25
Hub and Spoke
• Centralized DW is based on Relational Model whereas
Dependent Data Marts are based on Dimensional Model
26
Data-Mart Bus
• In this architecture type, no distinct, single data warehouse exists.
• Build the first data mart(super mart) using business dimensions
and metrics.
• These business dimensions will be shared in the future data marts.
New data marts can use existing table schema.
• Use for specific business processes but the use of conformed
dimensions and facts enables the incremental integration of
additional data marts to form an organization wide view of the
organization
27
Data-Mart Bus
• Additional marts are developed using the conformed
dimensions of the first mart. Staging is needed to ensure
that source data is converted into conformed data marts
structure.
28