Data Warehouse

-Yoga Kathirvelu
OBJECTIVES
 What is a Data Warehouse?
 Data Warehouse Architecture
 Data Warehouse Design Considerations
 Data Warehouse Terminologies
 Extraction – Transformation – Loading
 Mining, Business Intelligence & Reporting


Data-Data Everywhere yet…..
 What is Data?
 I can’t find the data that I need
 Data scattered all across the network
 Data stored in disparate formats
 I can’t understand the data that I see
 How to interpret
 Need someone to translate
 I can’t use the data that I get
 Different rules implemented across
 Missing or inconsistent data
 I don’t get the data when it matters
 Data comes in very late
 Data collection is very time consuming
What the users want
 Data should be integrated across the enterprise
 Data reporting should be uniform irrespective of
how it is stored
 Data should be available when we want it
 Summary data had a real value to the organization
 Historical data holds the key to understanding
data over time
 Can we clean, merge and enrich the data???




Enter Data Warehouse…..
Data Warehouse
A single, complete and consistent store of data obtained
from a variety of different sources made available to end
users in a format that they can understand and use in a
business context.
Data Warehousing as a process
 A technique for assembling and managing data
from various sources for the purpose of
answering business questions, thus making
decisions that were previously not possible
 Creating a decision support database
maintained separately from the organization’s
operational database

Goals of a Data Warehouse
 It must make an organization’s information
more accessible
 It must make the organization’s information
consistent
 It must be adaptive and resilient to change
 It must be defender of the organization’s data
 It must serve as a foundation for improved
decision making



OLTP Systems vs Data Warehouse
OLTP Systems Data Warehouse
Application Oriented Subject Oriented
Used to run business Used to analyze business
Detailed data Summarized & refined data
Current up to date data Snapshot data
Repetitive access Ad-hoc access
Clerical User Knowledge User (Manager)
Few Records accessed at a time
(tens)
Large volumes accessed at a time
(millions)
OLTP Systems vs Data Warehouse
OLTP Systems Data Warehouse
No data redundancy Redundancy present
Database Size (100MB -100 GB) Database Size (100GB –few
terabytes)
Transaction throughput is the
performance metric
Query throughput is the
performance metric
Thousands of users Hundreds of users
Read/Update Access Mostly Read (Updation through
batch loads)
Performance Sensitive Performance relaxed
Data Warehousing Architecture
Source Systems
 OLTP Systems
 Range from Flat files to RDBMS
 Maintain little or no history
 Data Pull or Data Push
 Data Extraction Window
Extraction – Transformation – Loading
 Extraction
 Capture of data from Source Systems
 Important to decide the frequency of Extraction
 Merging
 Bringing data together from different operational
sources.
 Choosing information from each functional system to
populate the single occurrence of the data item in the
warehouse


Extraction – Transformation – Loading
 Conditioning
 The conversion of data types from the source to the
target data store (warehouse) -- always a relational
database
 Eg. OLTP Date stored as text (DDMMYY); DW
format is Oracle Date type.
 Scrubbing
 Ensuring all data meets the input validation rules
which should have been in place when the data was
captured by the operational system.
 Eg. Country of the Customer should have been
entered in the Country field but entered in 1 of the
address field.

Extraction – Transformation – Loading
 Enrichment
 Bring data from external sources to
augment/enrich operational data.
 Eg. Currency conversion rates being brought in
from external sources.
 Validating
 Process of ensuring that the data captured is
accurate and transformation process is correct
 Eg. Date of Birth of a Customer should not be
more than today’s date

Extraction – Transformation – Loading
 Loading
 Loading the Extracted and Transformed data into
the Staging Area or the Data Warehouse
 First time bulk load to get the historical data into the
Data Warehouse
 Periodic Incremental loads to bring in modified data
 The Loading window should be as small as possible
 Should be clubbed with strong Error Management
process to capture the failures or rejections in the
Loading process
ETL Process – Issues & Challenges
 Consumes 70-80% of project time
 Heterogeneous source systems
 Little or no control over source systems
 Scattered source systems working is different time
zones having different currencies
 Different measurement unit
 Data not captured by OLTP systems
 Data Quality

Incremental Load vs Complete Refresh
 Complete refresh is required when the data is being
loaded into the DW for the first time
 Subsequent to that, DW should be refreshed with
incremental loads
 Complete refresh or Full Load is too disruptive and not
required if updates since last load can be identified
easily
 Some master data might require only a 1 time load into
the DW

When to Refresh?
 Periodically (e.g., every night, every week) or after
significant events
 On every update; not warranted unless DW users
require current data (up to the minute stock quotes)
 Refresh policy set by administrator based on user
needs and traffic
 Different strategies might be required for different
sources
Staging Area
 An intermediate area between the Operational Source
Systems and the data presentation area
 Analogous to the kitchen of a restaurant
 Accessible only to the skilled personnel; no user access
 The structure is closer to the Operational Systems
rather than the DW
 Data arriving at different point of time is merged and
then loaded into the DW
 Usually does not maintain history; only a temporary
area
Data Warehouse Design
 Design of the DW must directly reflect the way
the managers look at the business
 Should capture the important measurements
along with the parameters by which these
measurements are viewed
 It must facilitate data analysis
 The methodology on which the DW is
designed is called as Dimensional Modeling
(different from ER Modeling)
ER Modeling
 ER Model views the
components as Entities &
Relationships
 Entities: principal data
object about which
information is collected
 Relationship: association
between two or more entities
 Attributes: smaller pieces of
information within an entity

Dimensional Modeling
 Represents data in a standard framework
 Framework is easily understandable by the end-users
 Contains same information as the ER Model
 Facilitates data retrieval and analysis
 Entities are called Facts and Dimensions
 A generic representation of a dimension model in
which a fact table is joined to a number of dimensions
is called a Star Schema
Star Schema
Fact Table
 The Primary table in a dimensional model where the
numeric performance measurements of the business
are stored
 The most useful facts are numeric and additive
 Each measurement is taken at the intersection of all
the dimensions
 Tend to be deep in term of number of rows but narrow
in terms of number of columns
 They have Composite Primary Keys which consists of
all Foreign Keys of referred Dimensions

Dimension Table
 Contain textual descriptors of the business
 Lesser no. of rows but more no. of columns
 Linked to the Fact using a Foreign Key called Surrogate Key
 Dimension attributes serve as the primary source of query
constraints, groupings and report labels
 Minimize the use of Codes by replacing them with verbose
text
 Concatenated piece of text serving as a code should be
broken into constituent piece of information
 Contain hierarchical information
 Data stored in a denormalized form
Dimension Table
 Client Dimension




CLIENT
KEY
CLIENT ID CLIENT NAME CLIENT GROUP
CODE
CLIENT GROUP
NAME
CLIENT AREA
1 100 ABC LTD. 1234 XYZ LTD. A1
2 200 DEF LTD. 6789 RST LTD. A1
3 300 GHI LTD. 1234 XYZ LTD. A2
CLIENT
KEY
DEBTOR
KEY
TIME KEY CURRENCY
KEY
AMOUNT INVESTED AMOUNT EARNED
1 5 1 100 10,000 3,000
2 6 1 100 20,000 7,000
3 5 1 100 15,000 6,000
 Client Fact




 Potential Queries
 Total Amount Earned by Client Group XYZ Ltd.
 Total Amount Earned by Clients in Area A1
 Total Amount Invested by Client ABC Ltd.




SURROGATE
KEY
NATURAL
KEY
Surrogate Key
 Integers that are assigned sequentially as needed to
populate a dimension
 Serve to join the Dimension to the Fact table
 Better to use Surrogate Key instead of Natural Key
 They buffer the DW environment from operational
changes
 Operational Codes or Natural Keys might get
reassigned in the Operational Systems
Surrogate Key
 Granularity of the dimension might be different from
the Natural Key
 Natural Keys might not be unique across business
 Better for performance; Natural Keys might be bulky
alphanumeric character string
 There might not be a Natural Key available in the
source system
Data Marts
 A Data Mart is a collection of subject areas organized for
decision support based on the needs for a given
department
 Finance will have their own Data Mart, Marketing their
own etc.
 Each set of Users have their own interpretation of what
their Data Mart should look like
 The Database design of a Data Mart is built around a start-
join structure that is optimal for the specific set of users
 A Data Mart generally contains aggregated or summarized
data whereas DW would contain more granular data
Types of Data Marts
 Dependent Data Mart
 A Data Mart whose source is the Data Warehouse
 All dependent Data Marts are loaded from the same
source – the Data Warehouse
 Independent Data Mart
 A Data Mart whose source is the legacy application
environment
 Each independent Data Mart is fed uniquely and
separately by the individual source systems
Dimensions revisited
 Till now we have assumed Dimensions to be independent
of time
 Dimension attributes are relatively static, they are not fixed
forever
 Business Users might want to track the impact of each and
every attribute change
 We can preserve the independent dimensional structure
with only relatively minor adjustments
 These nearly constant dimensions are called Slowly
Changing Dimensions (SCD’s)
 3 Basic techniques for maintaining SCD’s
SCD – Type 1
 The new information simply overwrites the original
information
 No history is maintained
Client Master Key Client Name Client Country
1000 Nunn Mozhi India
Before Change:
Client Master Key Client Name Client Country
1000 Nunn Mozhi US
After Change:
SCD – Type 1
 Advantages
 Easiest technique in terms of implementation
 Disadvantages
 All history will be lost
 Usage
 About 50% of the time
 When to use
 When it is not necessary for the DW to maintain history
SCD – Type 2
 A new record is added to the dimension to
represent the new information
 The new record gets its own Primary Key
Client Master Key Client Name Client Country Latest Record
1000 Nunn Mozhi India Y
Before Change:
Client Master Key Client Name Client Country Latest Record
1000 Nunn Mozhi India N
1001 Nunn Mozhi US Y
After Change:
SCD – Type 2
 Advantages
 Allows us to accurately store history
 Disadvantages
 This will cause the table size to grow fast
 Storage and Performance might become a concern
 Usage
 About 50% of the time
 When to use
 When it is necessary for the DW to maintain history
SCD – Type 3
 There will be 2 columns to indicate the particular
attribute of interest; 1 indicating the original value
and one indicating the current value
Client Master
Key
Client
Name
Original Client
Country
Current Client
Country
Effective
Date
1000 Nunn Mozhi India 12-Jan-2004
Before Change:
Client Master
Key
Client
Name
Original Client
Country
Current Client
Country
Effective
Date
1000 Nunn Mozhi India US 13-Apr-2004
After Change:
SCD – Type 3
 Advantages
 Does not increase the table size drastically
 Allows us to keep some part of history
 Disadvantages
 Will not be able to keep all history when the value of the
attribute changes more than once
 Usage
 Very rarely use
 When to use
 When the no. of attribute changes are finite
Type of Dimensions
 Conformed Dimension
 A single Dimension referring to more than one Fact
 Exact copy of the same Dimension used in more
than one Data Mart
 When one Dimension is created as a subset of
another existing Dimension
CLIENT DIMENSION
TRANSACTION FACT DAILY SUMMARY
FACT
Type of Dimensions
 Junk Dimension
 Is a convenient grouping of typically low cardinality
flags and indicators
 Can be used to handle infrequently populated, open
ended comments field sometimes attached to a Fact
row

OUTCOME
DIMENSION
MARKETING
FACT
VALUES:
N/A
ACCEPTED
DECLINED
Type of Dimensions
 Degenerate Dimension
 A Dimension Key, such as Transaction Number,
that has no attributes and hence does not join to
an actual dimension table
TRANSACTION FACT
CLIENT MASTER KEY
TIME KEY
CURRENCY KEY
TRANSACTION ID
AMOUNT
LAST EXTRACTION DATE
DEGENERATE
DIMENSION
Type of Facts
Type of Facts
 Factless Fact
 A Fact table that has no facts but captures certain
many-to-many relationship between the dimension
keys
CLIENT KEY
DEBTOR KEY
COUNT
CLIENT DIMENSION DEBTOR
DIMENSION
FACT TABLE
ALWAYS = 1
Dimension Normalization – Snow flaking
 Removing the redundant information from the
Dimension and placing them in a separate Dimension
 These two Dimensions are joined by a key called
Snowflake Key
 Aim is to reduce the total amount of storage needed
for a dimension
 When to Snowflake
 Very large dimensions
 Some attributes not common to all the records
Dimension Normalization – Snow flaking
 Advantages
 Reduces disk space usage
 Easy to maintain
 Disadvantages
 Presentation layer becomes complicated
 Data retrieval time increases
 Might not save too much of disk space considering that
Dimensions take less space and Facts take more of space
Data Mining
 A relatively new data analysis technique
 It is very different from query and reporting
 You do not ask a particular question of the data but
use specific algorithms that analyze data and report
what they discover
 Not done by normal end users; done by specialists
 Used for:
 Statistical Analysis
 Knowledge Discovery
Business Intelligence
 BI is the leveraging of the Data Warehouse to help
make business decisions and recommendations
 Information and data rules engines are leveraged to
help make these decisions along with statistical
analysis tools and data mining tools
 Expensive and a very specialized set of activity
 Not performed by the end users; done by specialists
Tools and Technology
o ETL Tools
o Informatica
o DataStage
o Talend
o Pentaho Kettle
o Reporting Tools
o Business Objects
o Cognos
o Micro strategy

• Modeling Tools
• Erwin
• DB Designer

Queries ?............