Professional Documents
Culture Documents
CIS 410
Definition
• Data Warehouse:
– A subject-oriented, integrated, time-variant, non-updatable
collection of data used in support of management decision-
making processes
– Subject-oriented: e.g. customers, patients, students, products
– Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
– Time-variant: Can study trends and changes
– Nonupdatable: Read-only, periodically refreshed
• Data Mart:
– A data warehouse that is limited in scope
Transaction Oriented (OLTP)
Database Characteristics
• Narrow focus (subject area DBs)
• Current or recent Data only
• Fully detailed data
• Dirty data
• Volatile data
– continuous updates, inserts, deletes
Desired Properties for Data
Warehouse Databases
• Broad scope of data
• Long time horizon
• Summary data needed
• Data must be clean
• Data are only read (queried)
L
One,
company-
wide
T warehouse
T
E
T
E Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
ODS and data warehouse
are one and the same
T
E
Near real-time ETL for Data marts are NOT separate databases,
@active Data Warehouse but logical views of the data warehouse
Easier to create new data marts
Figure 11-6: Three-layer data architecture
Data Characteristics
Status vs. Event Data
Status
Status
Data Characteristics
Transient vs. Periodic Data
Changes to existing
records are written
over previous
records, thus
destroying the
previous data content
Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
Aggregation – data summarization one field to many
Load/Index= place transformed data
into the warehouse and create indexes
Dimension
Dimension Fact table table E
table B
Dimension Dimension
table C table D
Data Warehouse Development
Process
Deployment Planning
1. Planning
2. Requirements
Construction
Gathering
Requirements
3. Analysis and Design
4. Construction
Analysis
5. Deployment & Design
of aggregation PRODUCT
– Levels closest to the circle are
most detailed and outer levels
become progressively more
aggregated
•Consolidated Query footprint includes every dimension identified by any key
user and every level of aggregation identified by any key user
•Differences in terms used to described dimensions also resolved here
Airwest Sales Related Tables
RESERVATION
FREQUENT_FLYER
AIRPORT Ff_no Confirm_no
Air_code L_name Res_name
Air_location F_name Res_phone
Elevation
Str_address ...
City
Air_phone State
Zip_code
Total_miles
PASSENGER
Cur_year_miles
Itinerary_no
FLIGHT First_pur_date Pass_name
Flight_no Confirm_no
Orig F_no
Dest
Orig_time
Dest_time TICKET
Meal
Flight_no
Fare Flight_Date
Ff_miles
Itinerary_no
Seat
Fare_billed*
Query Footprints for AirWest
1. The manager of the Frequent-Flyer program wants to be able to look
at trends in tickets sold and fares charged for each route (orig and
dest combination) associated with the Zip_code, City, and State of
residence of the Frequent Flyer she want this data on a Quarterly and
annual basis.
2. The VP of marketing wants to be able to track tickets sold by the
area_code and State of the person making the reservations She needs
to see this data grouped by the flight number, route, and origin of the
flight on which the ticket is sold. She wants to be able to see trade-
offs between this flight and customer location data on a monthly and
quarterly basis. Note: The state of Residence component will require
us to supplement the data of the transactional database.
3. The Chef wants to be able to track tickets sold by Flight_number and
Origin of the flight on a monthly and annual basis.
Analysis and Design
• Create Star or Snowflake Schema model to meet
requirements
Identify internal and external Sources of needed data
• Analyze data transformation requirements / design
mechanisms for data transformation
– Simple transformation, cleansing and scrubbing,
integration, Aggregation and summarization
• Identify key performance issues and design
structures to maximize performance in key areas
Figure 11-13: Components of a star schema
Fact tables contain
factual or quantitative
data
SALES_FACT
Flight_No
Month
Area_code
Zip_code
Tickets_sold
Fares_collected
FLIGHTS_DIM
Flight_no
Route FF_RES_DIM
Orig Zip_Code
City
State
What causes a Snowflake Design?
• Different groups aggregate a dimension in different
ways.
– For example:
• Marketing aggregates flights in a hierarchy of Flight_no, Route
• Food services aggregates flights in a hierarchy of Flight_no,
Meal by Orig, Orig
– Or
• One hierarchy of time goes from day to week to month to
quarter to year
• Another goes from month to high versus low season
ROUTE_DIM
This creates a Snowflake
MEAL_DIM SEASON_DIM TIME_DIM*
Route Meal Month Month
Orig Season Quarter
Year
FLIGHTS_DIM TIME_DIM
Flight_no Month
SALES_FACT
Flight_No
Day
Area_code
CUST_RES Tickets_sold
Area_Code Fares_collected
State
What Causes a Constellation
Design
• Important facts relating to the subject area are
– Not all available at the same level of detail
– Not all available across every dimension
– For example
• Air west might be able to get estimates of total
ticket_sales (for all airlines) for each route each month.
This fact could not be broken down across the customer
location dimensions, or compared directly to individual
Airwest Flights
This Leads to a Constellation
Schema
FLIGHTS_DIM TIME_PER_DIM
CUST_RES_DIM FF_RES_DIM
Flight_no Month
Area_Code Zip_Code
Route Quarter
State City
Orig Year
State
IND_SALES_FACT SALES_FACT
Route Flight_No
Month Month
Ind_tickets_sold Area_code
Zip_code
Tickets_sold
Fares_collected
ROLAP Versus MOLAP Storage
Structures
ROLAP MOLAP
• 1 Row in dimension • 1 Row in dimension tables
for each distinct value at
tables for each value each level of the
of the most detailed dimension
level of the dimension • All possible levels of
• Higher level aggregation are pre-
aggregations calculated in the fact table
developed by standard • Pointers rather than joins
SQL joins and group are used to link fact and
functions dimension tables
Flights Dimension for ROLAP
The pre-calculation of all possible summarizations can explode the storage requirements of the
MOLAP based data warehouse. The table above shows this explosion effect for our Airwest
Example. In real-life with several more dimensions and levels, and many more distinct values
Of dimensions the storage required could explode to tera-bytes. For this reasons many OLAP
Related software products combine ROLAP and MOLAP characteristics. They pre-calculate
the highest (most frequently used) levels in each dimension and then use ROLAP methods to
calculate lower level summarize as needed.
Data Mining and Visualization
• Knowledge discovery using a blend of statistical, AI, and computer graphics
techniques
• Goals:
– Explain observed events or conditions
– Confirm hypotheses
– Explore data for new or unexpected relationships
• Techniques
– Statistical regression
– Decision tree induction
– Clustering and signal processing
– Affinity
– Sequence association
– Case-based reasoning
– Rule discovery
– Neural nets
– Fractals
• Data visualization – representing data in graphical/multimedia formats for analysis