You are on page 1of 51

DATA WAREHOUSING

CIS 410
Definition
• Data Warehouse:
– A subject-oriented, integrated, time-variant, non-updatable
collection of data used in support of management decision-
making processes
– Subject-oriented: e.g. customers, patients, students, products
– Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
– Time-variant: Can study trends and changes
– Nonupdatable: Read-only, periodically refreshed
• Data Mart:
– A data warehouse that is limited in scope
Transaction Oriented (OLTP)
Database Characteristics
• Narrow focus (subject area DBs)
• Current or recent Data only
• Fully detailed data
• Dirty data
• Volatile data
– continuous updates, inserts, deletes
Desired Properties for Data
Warehouse Databases
• Broad scope of data
• Long time horizon
• Summary data needed
• Data must be clean
• Data are only read (queried)

Star model – Snowflake and Constellation Variants


Need for Data Warehousing
• Integrated, company-wide view of high-quality
information (from disparate databases)
• Separation of operational and informational systems and
data (for improved performance)
Source: adapted from Strange (1997).
Data Warehouse Architectures
• Generic Two-Level Architecture
• Independent Data Mart
• Dependent Data Mart and Operational Data
Store
• Logical Data Mart and @ctive Warehouse
• Three-Layer architecture
All involve some form of extraction, transformation and loading (ETL)
ETL
Figure 11-2: Generic two-level architecture

L
One,
company-
wide
T warehouse

Periodic extraction  data is not completely current in warehouse


Figure 11-3: Independent data mart
Data marts:
Mini-warehouses, limited in scope

T
E

Separate ETL for each Data access complexity


independent data mart due to multiple data marts
Figure 11-4: ODS provides option for
Dependent data mart with operational data store obtaining current data

T
E Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
ODS and data warehouse
are one and the same

T
E
Near real-time ETL for Data marts are NOT separate databases,
@active Data Warehouse but logical views of the data warehouse
 Easier to create new data marts
Figure 11-6: Three-layer data architecture
Data Characteristics
Status vs. Event Data

Status

Event = a database action


(create/update/delete) that
results from a transaction

Status
Data Characteristics
Transient vs. Periodic Data
Changes to existing
records are written
over previous
records, thus
destroying the
previous data content

Data are never


physically altered or
deleted once they
have been added to
the store
Data Reconciliation
• Typical operational data is:
– Transient – not historical
– Not normalized (perhaps due to denormalization for performance)
– Restricted in scope – not comprehensive
– Sometimes poor quality – inconsistencies and errors
• After ETL, data should be:
– Detailed – not summarized yet
– Historical – periodic
– Normalized – 3rd normal form or higher
– Comprehensive – enterprise-wide perspective
– Timely – data should be current enough to assist decision-making
– Quality controlled – accurate with full integrity
The ETL Process
• Capture/Extract
• Scrub or data cleansing
• Transform
• Load and Index
ETL = Extract, transform, and load
Capture/Extract…obtaining a
snapshot of a chosen subset of
the source data for loading
into the data warehouse

Static extract = capturing a Incremental extract =


snapshot of the source data at capturing changes that have
a point in time occurred since the last static
extract
Scrub/Cleanse…uses pattern
recognition and AI techniques to
upgrade data quality

Fixing errors: misspellings, Also: decoding, reformatting, time


erroneous dates, incorrect field usage, stamping, conversion, key
mismatched addresses, missing data, generation, merging, error
duplicate data, inconsistencies detection/logging, locating missing
data
Transform = convert data from format
of operational system to format of data
warehouse

Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
Aggregation – data summarization one field to many
Load/Index= place transformed data
into the warehouse and create indexes

Refresh mode: bulk rewriting of Update mode: only changes in


target data at periodic intervals source data are written to data
warehouse
Figure 11-11: Single-field transformation

In general – some transformation function


translates data from old form to new form

Algorithmic transformation uses a


formula or logical expression

Table lookup – another


approach, uses a separate table
keyed by source record code
Figure 11-12: Multifield transformation

M:1 – from many source


fields to one target field

1:M – from one


source field to
many target fields
Derived Data
• Objectives
– Ease of use for decision support applications
– Fast response to predefined user queries
– Customized data for particular target audiences
– Ad-hoc query support
– Data mining capabilities
• Characteristics
– Detailed (mostly periodic) data
– Aggregate (for summary)
– Distributed (to departmental servers)

Most common data model = star schema


(also called “dimensional model”)
DATA WAREHOUSE STRUCTURE

• Data warehouses typically have a ‘star-shaped’ structure


with a central ‘Fact’ table and a number of ‘dimension’ or
category tables.
• The fact table contains one observation for each target
‘fact’ variable for each distinct combination of categories
from the dimension tables.
• Dimension tables contain a dimension code which is the
key that is repeated in the fact table and additional
descriptive characteristics of the dimension. (For a zip
code the dimension table might include the name of the
city and perhaps the population of the zip code area.
A Data Warehouse Structure
Dimension Dimension
table A table F

Dimension
Dimension Fact table table E
table B

Dimension Dimension
table C table D
Data Warehouse Development
Process
Deployment Planning
1. Planning
2. Requirements
Construction
Gathering
Requirements
3. Analysis and Design
4. Construction
Analysis
5. Deployment & Design

• Iterate through these


Cycles as Needed
Gather User Requirements

• Identify main user groups and users


• Gather Requirements from each main user
– Facts to be analyzed
– Dimensions used for evaluation
– Use query footprint modeling
• Consolidate individual requirements to
determine system requirements
Query Footprint Tool
TIME
• Subject Area modeled as the
Quarter
central circle
• Lines (arcs) from the circle LOCATION
Month
State
representing dimensions for
analysis City

• Points along the dimension


SALES Product
arcs represent differing levels Product-Line

of aggregation PRODUCT
– Levels closest to the circle are
most detailed and outer levels
become progressively more
aggregated
•Consolidated Query footprint includes every dimension identified by any key
user and every level of aggregation identified by any key user
•Differences in terms used to described dimensions also resolved here
Airwest Sales Related Tables
RESERVATION
FREQUENT_FLYER
AIRPORT Ff_no Confirm_no
Air_code L_name Res_name
Air_location F_name Res_phone
Elevation
Str_address ...
City
Air_phone State
Zip_code
Total_miles
PASSENGER
Cur_year_miles
Itinerary_no
FLIGHT First_pur_date Pass_name
Flight_no Confirm_no
Orig F_no
Dest
Orig_time
Dest_time TICKET
Meal
Flight_no
Fare Flight_Date
Ff_miles
Itinerary_no
Seat
Fare_billed*
Query Footprints for AirWest
1. The manager of the Frequent-Flyer program wants to be able to look
at trends in tickets sold and fares charged  for each route (orig and
dest combination) associated with the Zip_code, City, and State of
residence of the Frequent Flyer she want this data on a Quarterly and
annual basis.
2. The VP of marketing wants to be able to track tickets sold by the
area_code and State of the person making the reservations She needs
to see this data grouped by the flight number, route, and origin of the
flight on which the ticket is sold.  She wants to be able to see trade-
offs between this flight and customer location data on a monthly and
quarterly basis.  Note: The state of Residence component will require
us to supplement the data of the transactional database.
3. The Chef wants to be able to track tickets sold by Flight_number and
Origin of the flight on a monthly and annual basis.
Analysis and Design
• Create Star or Snowflake Schema model to meet
requirements
Identify internal and external Sources of needed data
• Analyze data transformation requirements / design
mechanisms for data transformation
– Simple transformation, cleansing and scrubbing,
integration, Aggregation and summarization
• Identify key performance issues and design
structures to maximize performance in key areas
Figure 11-13: Components of a star schema
Fact tables contain
factual or quantitative
data

Dimension tables are


1:N relationship
denormalized to
between dimension
maximize
tables and fact tables
performance

Dimension tables contain


descriptions about the
subjects of the business

Excellent for ad-hoc queries,


but bad for online transaction processing
Figure 11-14: Star schema example

Fact table provides statistics for sales


broken down by product, period and store
dimensions
Issues Regarding Star Schema
• Dimension table keys must be surrogate (non-intelligent and
non-business related), because:
– Keys may change over time
– Length/format consistency
• Granularity of Fact Table – what level of detail do you want?
– Transactional grain – finest level
– Aggregated grain – more summarized
– Finer grains  better market basket analysis capability
– Finer grain  more dimension tables, more rows in fact table
• Duration of the database – how much history should be kept?
– Natural duration – 13 months or 5 quarters
– Financial institutions may need longer duration
– Older data is more difficult to source and cleanse
Figure 11-16: Modeling dates

Fact tables contain time-period data


 Date dimensions are important
The User Interface
Metadata (data catalog)
• Identify subjects of the data mart
• Identify dimensions and facts
• Indicate how data is derived from enterprise data
warehouses, including derivation rules
• Indicate how data is derived from operational data store,
including derivation rules
• Identify available reports and predefined queries
• Identify data analysis techniques (e.g. drill-down)
• Identify responsible people
On-Line Analytical Processing (OLAP)
• The use of a set of graphical tools that provides users with
multidimensional views of their data and allows them to
analyze the data using simple windowing techniques
• Relational OLAP (ROLAP)
– Traditional relational representation
• Multidimensional OLAP (MOLAP)
– Cube structure
• OLAP Operations
– Cube slicing – come up with 2-D view of data
– Drill-down – going from summary to more detailed views
Figure 11-22: Slicing a data cube
Summary report
Figure 11-23:
Example of drill-down

Starting with summary


data, users can obtain Drill-down with
details for particular color added
cells
Alternative Data Warehouse Schema
Sturctures
• Star Schema
– Single fact table
– Each dimension has only 1 logical hierarchy for
aggregation
• Snowflake Schema
– Single fact table
– 1 or more dimensions has multiple aggregation hierarchies
• Constellation Schema
– Multiple fact tables with overlapping but not identical
dimensions and levels of detail
Our Airwest DW has a Star Schema
TIME_PER_DIM
CUST_RES_DIM
Month Area_Code
Quarter State
Year

SALES_FACT
Flight_No
Month
Area_code
Zip_code
Tickets_sold
Fares_collected

FLIGHTS_DIM
Flight_no
Route FF_RES_DIM
Orig Zip_Code
City
State
What causes a Snowflake Design?
• Different groups aggregate a dimension in different
ways.
– For example:
• Marketing aggregates flights in a hierarchy of Flight_no, Route
• Food services aggregates flights in a hierarchy of Flight_no,
Meal by Orig, Orig
– Or
• One hierarchy of time goes from day to week to month to
quarter to year
• Another goes from month to high versus low season
ROUTE_DIM
This creates a Snowflake
MEAL_DIM SEASON_DIM TIME_DIM*
Route Meal Month Month
Orig Season Quarter
Year

FLIGHTS_DIM TIME_DIM
Flight_no Month

SALES_FACT
Flight_No
Day
Area_code
CUST_RES Tickets_sold
Area_Code Fares_collected
State
What Causes a Constellation
Design
• Important facts relating to the subject area are
– Not all available at the same level of detail
– Not all available across every dimension
– For example
• Air west might be able to get estimates of total
ticket_sales (for all airlines) for each route each month.
This fact could not be broken down across the customer
location dimensions, or compared directly to individual
Airwest Flights
This Leads to a Constellation
Schema
FLIGHTS_DIM TIME_PER_DIM
CUST_RES_DIM FF_RES_DIM
Flight_no Month
Area_Code Zip_Code
Route Quarter
State City
Orig Year
State

IND_SALES_FACT SALES_FACT
Route Flight_No
Month Month
Ind_tickets_sold Area_code
Zip_code
Tickets_sold
Fares_collected
ROLAP Versus MOLAP Storage
Structures
ROLAP MOLAP
• 1 Row in dimension • 1 Row in dimension tables
for each distinct value at
tables for each value each level of the
of the most detailed dimension
level of the dimension • All possible levels of
• Higher level aggregation are pre-
aggregations calculated in the fact table
developed by standard • Pointers rather than joins
SQL joins and group are used to link fact and
functions dimension tables
Flights Dimension for ROLAP

FLIGHT_NO ROUTE ORIG


101 FLG to PHX FLG
602 PHX to SFO PHX
15 PHX to LAX PHX
17 PHX to LAX PHX
33 PHX to LAX PHX
40 PHX to LAX PHX
600 PHX to SFO PHX
204 LAX to SFO LAX
207 LAX to SFO LAX
209 LAX to SFO LAX
Flights Dimension for MOLAP
Instead of 10 distinct values for the
Flight_nos which can be aggregated Flt_dim_no Dim_Label Level
To higher levels. The MOLAP A01 101 3
Structure assigns a unique dimension A02 FLG to PHX 2
Record to each value at each level of A03 FLG 1
The hierarchy (10 flights + 4 routes A04 204 3
+ 3 Origins + 1 overall total). A05 207 3
A06 209 3
Now there are 18 rows in this
A07 LAX to SFO 2
Dimension table. Each row is
Identified by an artificial variable, A08 LAX 1
And the level attribute can be used A09 15 3
To pick all rows at a particular level A10 17 3
In the hierarchy – e.g. all level 2s to A11 33 3
Retrieve Route related data. A12 40 3
A13 PHX to LAX 2
The fact table contains 1 row for A14 600 3
Each combination of values across A15 602 3
All dimension after they have been A16 PHX to SFO 2
Expanded in this way. Thus all A17 PHX 1
Possible summarizations are
A18 TOTAL 0
Precalculated.
Explosion of Summary Values Airwest
Example
DIMENSION # of Distinct Values
Lowest Level All Levels
Flight 10 18
Cust_res 12 15
FF_Res 9 16
Time 12 21
12960 90720

The pre-calculation of all possible summarizations can explode the storage requirements of the
MOLAP based data warehouse. The table above shows this explosion effect for our Airwest
Example. In real-life with several more dimensions and levels, and many more distinct values
Of dimensions the storage required could explode to tera-bytes. For this reasons many OLAP
Related software products combine ROLAP and MOLAP characteristics. They pre-calculate
the highest (most frequently used) levels in each dimension and then use ROLAP methods to
calculate lower level summarize as needed.
Data Mining and Visualization
• Knowledge discovery using a blend of statistical, AI, and computer graphics
techniques
• Goals:
– Explain observed events or conditions
– Confirm hypotheses
– Explore data for new or unexpected relationships
• Techniques
– Statistical regression
– Decision tree induction
– Clustering and signal processing
– Affinity
– Sequence association
– Case-based reasoning
– Rule discovery
– Neural nets
– Fractals
• Data visualization – representing data in graphical/multimedia formats for analysis

You might also like