You are on page 1of 6

Example

Count attribute
1 Order ID
2 Order Date
3 Order Priority
4 Order Quantity
5 Order Discount
6 Shipping Mode
7 Unit Price
8 Unit Cost
9 Shipping Cost
10 Customer Name
11 Province
12 Customer Type
13 Product Category
14 Product Name
15 Product container
16 Shipping date

This data needs to be loaded in a data warehouse with the following schema:
Data staging.
This process would save the input files on the local storage
Data extraction.
At this stage we need to remove data that is not necessary for storing in our data warehouse. We need
to remove all those attributes that are not necessary such as Order id, priority, Shipping mode,
customer name, product category, or container

Count attribute
1 Order ID
2 Order Date
3 Order Priority
4 Order Quantity
5 Order Discount
6 Shipping Mode
7 Unit Price
8 Unit Cost
9 Shipping Cost
10 Customer Name
11 Province
12 Customer Type
13 Product Category
14 Product Name
15 Product container
16 Shipping date

Count attribute
1 Order Date
2 Order Quantity
3 Order Discount
4 Unit Price
5 Unit Cost
6 Shipping Cost
7 Province
8 Customer Type
9 Shipping date

Why did we remove the product name? It seems highly necessary for the analysis?
I agree that product name is necessary for the analysis. However, last lesson we have decided on not
using it for the purpose of this exercise due to economy of space.
Question:
Why did we not remove the shipping date? I could not find a shipping date field in the data warehouse
schema?
Shipping date will be used in data transformation stage to calculate the shipping delay.

Data Transformation
At this stage we need transform the data so that it has the same format as the schema of the data
warehouse.
Tasks should include:
 Handling missing data. Records can be removed or replaced. This depends on the specifics of
the business knowledge
 Handling all the dimensional information that exist in the input data but does not exist in the
data warehouse. For example, maybe the input data includes a customer type with the value
“international buyer”. This data does not exist in our data warehouse. The question is should
we add it or should ignore it?
 The answer is difficult because it depends on the business requirements. If the dimension
tables are shared by multiple schemas then there is a very tight control on the values of these
dimensions. For example, a customer type value might already exist but with a slight difference
in name such as“International customer”. In this case the two values need to be merged. Or
maybe, our analysis is not interested in international customers as they are very rare. In this
case the data should be ignored.
 Generating fields that need to be calculated. For example, the field unit profit needs to be
calculated from existing fields.
 In this example, for simplicity, we will ignore all the data that has new, unknown
dimensions except for dates and we will remove all data with missing values. The resulting
data should be like this with the new fields bolded and removed fields strikethrough:

Coun attribute Note


t
1 Order Date This field is replaced by an id
1 Order Date ID
2 Order Quantity
3 Order Discount
4 Unit Price
5 Unit Cost
6 Shipping Cost
7 Province This field is replaced by an id
7 Province id
8 Customer Type This field is replaced by an id
8 Customer Type
ID
9 Ship Date This field is replaced by shipping delay
9 Shipping delay Shipping date - Order Date
10 Order total (Order Quantity
* Unit Price + Shipping Cost) * (1- Order
Discount)
11 Unit profit Unit Price * (1- Order Discount)
- Unit Cost

Data Loading
This data can be loaded now in the facts table of the data warehouse.

You might also like