Data Warehouse

-Yoga Kathirvelu
 What is a Data Warehouse?
 Data Warehouse Architecture
 Data Warehouse Design Considerations
 Data Warehouse Terminologies
 Extraction – Transformation – Loading
 Mining, Business Intelligence & Reporting

Data-Data Everywhere yet…..
 What is Data?
 I can’t find the data that I need
 Data scattered all across the network
 Data stored in disparate formats
 I can’t understand the data that I see
 How to interpret
 Need someone to translate
 I can’t use the data that I get
 Different rules implemented across
 Missing or inconsistent data
 I don’t get the data when it matters
 Data comes in very late
 Data collection is very time consuming
What the users want
 Data should be integrated across the enterprise
 Data reporting should be uniform irrespective of
how it is stored
 Data should be available when we want it
 Summary data had a real value to the organization
 Historical data holds the key to understanding
data over time
 Can we clean, merge and enrich the data???

Enter Data Warehouse…..
Data Warehouse
A single, complete and consistent store of data obtained
from a variety of different sources made available to end
users in a format that they can understand and use in a
business context.
Data Warehousing as a process
 A technique for assembling and managing data
from various sources for the purpose of
answering business questions, thus making
decisions that were previously not possible
 Creating a decision support database
maintained separately from the organization’s
operational database

Goals of a Data Warehouse
 It must make an organization’s information
more accessible
 It must make the organization’s information
 It must be adaptive and resilient to change
 It must be defender of the organization’s data
 It must serve as a foundation for improved
decision making

OLTP Systems vs Data Warehouse
OLTP Systems Data Warehouse
Application Oriented Subject Oriented
Used to run business Used to analyze business
Detailed data Summarized & refined data
Current up to date data Snapshot data
Repetitive access Ad-hoc access
Clerical User Knowledge User (Manager)
Few Records accessed at a time
Large volumes accessed at a time
OLTP Systems vs Data Warehouse
OLTP Systems Data Warehouse
No data redundancy Redundancy present
Database Size (100MB -100 GB) Database Size (100GB –few
Transaction throughput is the
performance metric
Query throughput is the
performance metric
Thousands of users Hundreds of users
Read/Update Access Mostly Read (Updation through
batch loads)
Performance Sensitive Performance relaxed
Data Warehousing Architecture
Source Systems
 OLTP Systems
 Range from Flat files to RDBMS
 Maintain little or no history
 Data Pull or Data Push
 Data Extraction Window
Extraction – Transformation – Loading
 Extraction
 Capture of data from Source Systems
 Important to decide the frequency of Extraction
 Merging
 Bringing data together from different operational
 Choosing information from each functional system to
populate the single occurrence of the data item in the

Extraction – Transformation – Loading
 Conditioning
 The conversion of data types from the source to the
target data store (warehouse) -- always a relational
 Eg. OLTP Date stored as text (DDMMYY); DW
format is Oracle Date type.
 Scrubbing
 Ensuring all data meets the input validation rules
which should have been in place when the data was
captured by the operational system.
 Eg. Country of the Customer should have been
entered in the Country field but entered in 1 of the
address field.

Extraction – Transformation – Loading
 Enrichment
 Bring data from external sources to
augment/enrich operational data.
 Eg. Currency conversion rates being brought in
from external sources.
 Validating
 Process of ensuring that the data captured is
accurate and transformation process is correct
 Eg. Date of Birth of a Customer should not be
more than today’s date

Extraction – Transformation – Loading
 Loading
 Loading the Extracted and Transformed data into
the Staging Area or the Data Warehouse
 First time bulk load to get the historical data into the
Data Warehouse
 Periodic Incremental loads to bring in modified data
 The Loading window should be as small as possible
 Should be clubbed with strong Error Management
process to capture the failures or rejections in the
Loading process
ETL Process – Issues & Challenges
 Consumes 70-80% of project time
 Heterogeneous source systems
 Little or no control over source systems
 Scattered source systems working is different time
zones having different currencies
 Different measurement unit
 Data not captured by OLTP systems
 Data Quality

Incremental Load vs Complete Refresh
 Complete refresh is required when the data is being
loaded into the DW for the first time
 Subsequent to that, DW should be refreshed with
incremental loads
 Complete refresh or Full Load is too disruptive and not
required if updates since last load can be identified
 Some master data might require only a 1 time load into
the DW

When to Refresh?
 Periodically (e.g., every night, every week) or after
significant events
 On every update; not warranted unless DW users
require current data (up to the minute stock quotes)
 Refresh policy set by administrator based on user
needs and traffic
 Different strategies might be required for different
Staging Area
 An intermediate area between the Operational Source
Systems and the data presentation area
 Analogous to the kitchen of a restaurant
 Accessible only to the skilled personnel; no user access
 The structure is closer to the Operational Systems
rather than the DW
 Data arriving at different point of time is merged and
then loaded into the DW
 Usually does not maintain history; only a temporary
Data Warehouse Design
 Design of the DW must directly reflect the way
the managers look at the business
 Should capture the important measurements
along with the parameters by which these
measurements are viewed
 It must facilitate data analysis
 The methodology on which the DW is
designed is called as Dimensional Modeling
(different from ER Modeling)
ER Modeling
 ER Model views the
components as Entities &
 Entities: principal data
object about which
information is collected
 Relationship: association
between two or more entities
 Attributes: smaller pieces of
information within an entity

Dimensional Modeling
 Represents data in a standard framework
 Framework is easily understandable by the end-users
 Contains same information as the ER Model
 Facilitates data retrieval and analysis
 Entities are called Facts and Dimensions
 A generic representation of a dimension model in
which a fact table is joined to a number of dimensions
is called a Star Schema
Star Schema
Fact Table
 The Primary table in a dimensional model where the
numeric performance measurements of the business
are stored
 The most useful facts are numeric and additive
 Each measurement is taken at the intersection of all
the dimensions
 Tend to be deep in term of number of rows but narrow
in terms of number of columns
 They have Composite Primary Keys which consists of
all Foreign Keys of referred Dimensions

Dimension Table
 Contain textual descriptors of the business
 Lesser no. of rows but more no. of columns
 Linked to the Fact using a Foreign Key called Surrogate Key
 Dimension attributes serve as the primary source of query
constraints, groupings and report labels
 Minimize the use of Codes by replacing them with verbose
 Concatenated piece of text serving as a code should be
broken into constituent piece of information
 Contain hierarchical information
 Data stored in a denormalized form
Dimension Table
 Client Dimension

1 100 ABC LTD. 1234 XYZ LTD. A1
2 200 DEF LTD. 6789 RST LTD. A1
3 300 GHI LTD. 1234 XYZ LTD. A2
1 5 1 100 10,000 3,000
2 6 1 100 20,000 7,000
3 5 1 100 15,000 6,000
 Client Fact

 Potential Queries
 Total Amount Earned by Client Group XYZ Ltd.
 Total Amount Earned by Clients in Area A1
 Total Amount Invested by Client ABC Ltd.

Surrogate Key
 Integers that are assigned sequentially as needed to
populate a dimension
 Serve to join the Dimension to the Fact table
 Better to use Surrogate Key instead of Natural Key
 They buffer the DW environment from operational
 Operational Codes or Natural Keys might get
reassigned in the Operational Systems
Surrogate Key
 Granularity of the dimension might be different from
the Natural Key
 Natural Keys might not be unique across business
 Better for performance; Natural Keys might be bulky
alphanumeric character string
 There might not be a Natural Key available in the
source system
Data Marts
 A Data Mart is a collection of subject areas organized for
decision support based on the needs for a given
 Finance will have their own Data Mart, Marketing their
own etc.
 Each set of Users have their own interpretation of what
their Data Mart should look like
 The Database design of a Data Mart is built around a start-
join structure that is optimal for the specific set of users
 A Data Mart generally contains aggregated or summarized
data whereas DW would contain more granular data
Types of Data Marts
 Dependent Data Mart
 A Data Mart whose source is the Data Warehouse
 All dependent Data Marts are loaded from the same
source – the Data Warehouse
 Independent Data Mart
 A Data Mart whose source is the legacy application
 Each independent Data Mart is fed uniquely and
separately by the individual source systems
Dimensions revisited
 Till now we have assumed Dimensions to be independent
of time
 Dimension attributes are relatively static, they are not fixed
 Business Users might want to track the impact of each and
every attribute change
 We can preserve the independent dimensional structure
with only relatively minor adjustments
 These nearly constant dimensions are called Slowly
Changing Dimensions (SCD’s)
 3 Basic techniques for maintaining SCD’s
SCD – Type 1
 The new information simply overwrites the original
 No history is maintained
Client Master Key Client Name Client Country
1000 Nunn Mozhi India
Before Change:
Client Master Key Client Name Client Country
1000 Nunn Mozhi US
After Change:
SCD – Type 1
 Advantages
 Easiest technique in terms of implementation
 Disadvantages
 All history will be lost
 Usage
 About 50% of the time
 When to use
 When it is not necessary for the DW to maintain history
SCD – Type 2
 A new record is added to the dimension to
represent the new information
 The new record gets its own Primary Key
Client Master Key Client Name Client Country Latest Record
1000 Nunn Mozhi India Y
Before Change:
Client Master Key Client Name Client Country Latest Record
1000 Nunn Mozhi India N
1001 Nunn Mozhi US Y
After Change:
SCD – Type 2
 Advantages
 Allows us to accurately store history
 Disadvantages
 This will cause the table size to grow fast
 Storage and Performance might become a concern
 Usage
 About 50% of the time
 When to use
 When it is necessary for the DW to maintain history
SCD – Type 3
 There will be 2 columns to indicate the particular
attribute of interest; 1 indicating the original value
and one indicating the current value
Client Master
Original Client
Current Client
1000 Nunn Mozhi India 12-Jan-2004
Before Change:
Client Master
Original Client
Current Client
1000 Nunn Mozhi India US 13-Apr-2004
After Change:
SCD – Type 3
 Advantages
 Does not increase the table size drastically
 Allows us to keep some part of history
 Disadvantages
 Will not be able to keep all history when the value of the
attribute changes more than once
 Usage
 Very rarely use
 When to use
 When the no. of attribute changes are finite
Type of Dimensions
 Conformed Dimension
 A single Dimension referring to more than one Fact
 Exact copy of the same Dimension used in more
than one Data Mart
 When one Dimension is created as a subset of
another existing Dimension
Type of Dimensions
 Junk Dimension
 Is a convenient grouping of typically low cardinality
flags and indicators
 Can be used to handle infrequently populated, open
ended comments field sometimes attached to a Fact

Type of Dimensions
 Degenerate Dimension
 A Dimension Key, such as Transaction Number,
that has no attributes and hence does not join to
an actual dimension table
Type of Facts
Type of Facts
 Factless Fact
 A Fact table that has no facts but captures certain
many-to-many relationship between the dimension
Dimension Normalization – Snow flaking
 Removing the redundant information from the
Dimension and placing them in a separate Dimension
 These two Dimensions are joined by a key called
Snowflake Key
 Aim is to reduce the total amount of storage needed
for a dimension
 When to Snowflake
 Very large dimensions
 Some attributes not common to all the records
Dimension Normalization – Snow flaking
 Advantages
 Reduces disk space usage
 Easy to maintain
 Disadvantages
 Presentation layer becomes complicated
 Data retrieval time increases
 Might not save too much of disk space considering that
Dimensions take less space and Facts take more of space
Data Mining
 A relatively new data analysis technique
 It is very different from query and reporting
 You do not ask a particular question of the data but
use specific algorithms that analyze data and report
what they discover
 Not done by normal end users; done by specialists
 Used for:
 Statistical Analysis
 Knowledge Discovery
Business Intelligence
 BI is the leveraging of the Data Warehouse to help
make business decisions and recommendations
 Information and data rules engines are leveraged to
help make these decisions along with statistical
analysis tools and data mining tools
 Expensive and a very specialized set of activity
 Not performed by the end users; done by specialists
Tools and Technology
o ETL Tools
o Informatica
o DataStage
o Talend
o Pentaho Kettle
o Reporting Tools
o Business Objects
o Cognos
o Micro strategy

• Modeling Tools
• Erwin
• DB Designer

Queries ?............