You are on page 1of 26

What is a Data Warehouse

A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

Subject Oriented
Relevant data about a subject is gathered and stored as a single set in a useful format. Thus a data warehouse is organized around major subjects eg (Customers, Products, and Sales) excludes data that is not useful in decision support process

Integrated
Data that is gathered into the data warehouse from a variety of sources (relational DB, flat files, legacy systems) The data warehouse provides mechanism to store this data in a globally accepted fashion with consistent naming conventions, measurements, encoding structures, and physical attributes, even when the underlying operational systems store the data differently.

Time-variant
All data in the data warehouse is identified with a particular time period. Data is stored to provide information from a historical perspective.

Non-volatile
This means data warehouse is read-only. Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.

Common words in data warehouse


Dimensions - The dimensions or the dimension tables are the tables in the data warehouse which act as the source for the fact tables. Dimensions get source from multiple sources and we are required to generate surrogate keys accordingly (these are independent to any changes made in the business logics). Now, these loaded dimensions act as the source for the fact tables.

Surrogate Keys
A surrogate key is an artificial or synthetic key that is used as a substitute for a natural key. Surrogate keys are keys that are maintained within the data warehouse instead of the natural keys taken from source data systems. The surrogate keys basically serve to join the dimension tables to the fact table. Surrogate keys serve as an important means of identifying each instance or entity inside a dimension table.

Slowly Changing Dimensions


Dimensions that change very less over time are called Slowly Changing Dimensions. For example change in the name of the client. Slowly Changing Dimensions are often categorized into three types namely Type1, Type2 and Type3.

Type1 SCD Dimension


Type 1: Overwriting the old values. In this Type 1, it Overwrite the old values.

Type2 SCD Dimension


Type 2: Creating another additional record. In this Type 2, the old values will not be replaced but a new row containing the new values will be added to the table.

Type3 SCD Dimension


Type 3: Creating new fields. In this Type 3, the latest update to the changed values can be seen.

Facts
The facts or the fact tables refer to the dimensions using the primary key-foreign key relationships. A fact refers to multiple dimensions. The fact tables contain only the foreign keys that refer to the dimension tables, some amount fields and also aggregated fields.

Fact Table

Factless Fact - This is a table, which contains just the foreign keys from the dimensions and does not contain any amount/ aggregated fields. Dirty Dimensions: The dimension tables that have the values that change very frequently, by which the dimension table need to get updated frequently. Junk Dimensions: The dimensions that are already loaded, but that are not in use by any of the fact tables are called the junk dimensions.

Types of Facts
Additive Facts Semi additive Facts Non additive Facts

Additive Fact
Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table. Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns: Sales Amount can be summed up along all the three dimensions. So this is an Additive Fact.

Semi-Additive
The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information).

Non Additive Facts


Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

Types of Fact Tables


Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table. Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

unk Dimension

Junk Dimension

Confirmed Dimensions
The dimensions that are used by more than 1 fact table as its source are called confirmed dimensions.

Confirmed Dimension

De-generated Dimensions
This is nothing but dimension data stored within fact tables. This is done to track back the data from the source system. For example Source_customer_number
Datawarehouse
DWH Source_customer_number Customer_number Date Sales Order Amount Promise Date Zip Code Customer Customer_Number Customer Name Country State City Zip Code

Source
Source Customer_Number Customer_name Country State City Zip Code

Dimensional Modeling
Star Schema: The modeling where the dimensions are not subdivided is called a star schema.

Snowflake Schema
The modeling where the dimensions are subdivided as a hierarchy is called a snowflake schema.

Four Step Process to Design data warehouse


Identify Business Process Decide Granularity Indentify Dimensions Identify Facts