Professional Documents
Culture Documents
• Choosing the process: Selecting the subjects from the information packages for the first set of logical
structures to be designed.
• Choosing the grain. Determining the level of detail for the data in the data structures. Identifying and
conforming the dimensions.
• Choosing the business dimensions (such as product, market, time, etc.) to be included in the first set of
structures and making sure that each particular data element in every business dimension is conformed to one
another.
• Choosing the facts. Selecting the metrics or units of measurements (such as product sale units, dollar sales,
dollar revenue, etc.) to be included in the first set of structures.
• Choosing the duration of the database. Determining how far back in time you should go for historical data
Dimensional Modeling Basics
• Dimensional modeling gets its name from the business dimensions we
need to incorporate into the logical data model.
• It is a logical design technique to structure the business dimensions and
the metrics that are analyzed along these dimensions.
• This modeling technique is intuitive for that purpose. The model has also
proved to provide high performance for queries and analys
Here are some of the criteria for combining the tables into a dimensional model .
• Dimensions: are the logically related attributes for modelling the data.
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures 82
• Each dimension in a star schema is represented with only one-dimension
table.
• This dimension table contains the set of attributes.
• The following diagram shows the sales data of a company with respect to
the four dimensions, namely time, item, branch, and location.
• There is a fact table at the centre. It contains the keys to each of four dimensions.
• The fact table also contains the attributes, namely dollars sold and units sold.
• Each dimension has only one dimension table and each table holds a set of
attributes. For example, the location dimension table contains the attribute set
{location_key, street, city, province_or_state,country}. This constraint may cause
data redundancy. For example, "Vancouver" and "Victoria" both the cities are in
the Canadian province of British Columbia. The entries for such cities may cause
data redundancy along the attributes province_or_state and country.
ADVANTAGES
SNOWFLAKE SCHEMA
• Some dimension tables in the Snowflake schema are normalized.
• The normalization splits up the data into additional tables.
• Unlike Star schema, the dimensions table in a snowflake schema are
normalized. For example, the item dimension table in star schema is
normalized and split into two dimension tables, namely item and supplier
table.
• Now the item dimension table contains the attributes item_key,
item_name, type, brand, and supplier-key.
• The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
• Note − Due to normalization in the Snowflake schema, the redundancy is
reduced and therefore, it becomes easy to maintain and the save storage
space.
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
88
ADVANTAGES/DISADVANTAGES
FACT CONSTELLATION SCHEMA
• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.
• he sales fact table is same as that in the star schema.
• The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
• The shipping fact table also contains two measures, namely dollars sold and units sold.
• It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact table.
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
• Add an “old” field in the dimension table for the affected attribute
• Push down the existing value of the attribute from the “current” field to
the “old” field
• Keep the new value of the attribute in the “current” field
• Also, you may add a “current” effective date field for the attribute
• The key of the row is not affected
• No new dimension row is needed
• The existing queries will seamlessly switch to the “current” value
• Any queries that need to use the “old” value must be revised accordingly
• The technique works best for one “soft” change at a time
• If there is a succession of changes, more sophisticated techniques must
be devised
Types of facts(Measures/metric)
• Additive/Fully Additive
• Semi additive
• Non additive
Aggregate Fact Table
• Aggregates are precalculated summaries derived from the most granular fact table.
• These summaries form a set of separate aggregate fact tables.
• You may create each aggregate fact table as a specific summarization across any
number of dimensions.
• Let us begin by examining a sample STAR schema. Assume there are four
dimension tables surrounding this most granular fact table.
• When you run a query in an operational system, it produces a result set about a
single customer, a single order, a single invoice, a single product, and so on.
• But, as you know, the queries in a data warehouse environment produce
large result sets.
• These queries retrieve hundreds and thousands of table rows, manipulate
the metrics in the fact tables, and then produce the result sets.
• The manipulation of the fact table metrics may be a simple addition, an
addition with some adjustments, a calculation of averages, or even an
application of complex arithmetic algorithms
• Let us review a few typical queries against the sample STAR schema shown
in Figure 11-11.
• Query 1: Total sales for customer number 12345678 during the first week of
December 2000 for product Widget-1.
• Query 2: Total sales for customer number 12345678 during the first three
months of 2000 for product Widget-1.
• Query 3: Total sales for all customers in the South-Central territory for the
first two quarters of 2000 for product category Bigtools.
Scrutinize these queries and determine how the totals will be
calculated in each case:
• Query 1: All fact table rows where the customer key relates to customer
number 12345678, the product key relates to product Widget-1, and the
time key relates to the seven days in the first week of December 2000.
Assuming that a customer may make at most one purchase of a single
product in a single day, only a maximum of 7 fact table rows participate in
the summation.
• Query 2: All fact table rows where the customer key relates to customer
number 12345678, the product key relates to product Widget-1, and the
time key relates to about 90 days of the first quarter of 2000. Assuming
that a customer may make at most one purchase of a single product in a
single day, only about 90 fact table rows or less participate in the
summation.
• Query 3: All fact table rows where the customer key relates to all
customers in the South-Central territory, the product key relates to all
products in the product category Bigtools, and the time key relates to
about 180 days in the first two quarters of 2000. In this case, clearly a
large number of fact table rows participate in the summation.
Fact Table Sizes
• Time dimension: 5 years × 365 days = 1825
• Store dimension: 300 stores reporting daily sales
• Product dimension: 40,000 products in each store (about 4000 sell in each store
daily)
• Promotion dimension: a sold item may be in only one promotion in a store on a
given day
• Maximum number of base fact table records: 1825 × 300 × 4000 × 1 = 2 billion
Need for Aggregates
• STAR schema for a grocery chain. In those 300 stores, assume there are 500 products
per brand. Of the 40,000 products, assume that there is at least one sale per product
per store per week. Let us estimate the number of fact table rows to be retrieved and
summarized for the following types of queries:
• Query involves 1 product, 1 store, 1 week—retrieve/summarize only 1 fact table row
• Query involves 1 product, all stores, 1 week—retrieve/summarize 300 fact table rows
• Query involves 1 product, 1 store, 1 week—retrieve/summarize only 1 fact table row
• Query involves 1 product, all stores, 1 week—retrieve/summarize 300 fact table rows
Aggregate types
• One-Way Aggregates: When you rise to higher levels in the hierarchy of
one dimension and keep the level at the lowest in the other dimensions,
you create one-way aggregates
• Product category by store by date
• Product department by store by date
• All products by store by date
• Territory by product by date
• Region by product by date
• All stores by product by date
• Month by store by product
• Quarter by store by product
• Year by store by product
Two-Way Aggregates. When you rise to higher levels in the hierarchies of
two dimensions and keep the level at the lowest in the other dimension, you
create two-way aggregate tables. Please review the following examples:
Three-Way Aggregates. When you rise to higher levels in the hierarchies of
all the three dimensions, you create three-way aggregate tables. Please
review the following examples:
• Effect of Sparsity on Aggregation. Consider the case of the grocery chain with
300 stores, 40,000 products in each store, but only 4000 selling in each store in a
day. As discussed earlier, assuming that you keep records for 5 years or 1825
days, the maximum number of base fact table rows is calculated as follows:
• Product = 40,000(4k in one store each day)
• Store = 300
• Time = 1825
• Maximum number of base fact table rows = 22 billion
Now let us see what happens when you form aggregates. Scrutinize a one-
way aggregate: brand totals by store by day. Calculate the maximum number
of rows in this one-way aggregate.
ETL
• Each of the ETL functions fulfills a significant purpose.
• When you want to convert data from the source systems into information stored in the data warehouse, each of
these functions is essential.
• For changing data into information you first need to capture the data. After you capture the data, you cannot
simply dump that data into the data warehouse and call it strategic information.
• You have to subject the extracted data to all manner of transformations so that the data will be fit to be
converted into information.
• Once you have transformed the data, it is still not useful to the end-users until it is moved to the data
warehouse repository.
• Data loading is an essential function. You must perform all three functions of ETL for successfully
transforming data into information.
Take as an example an analysis your user wants to perform. The user wants
to compare and analyze sales by store, by product, and by month.
• The sale figures are available in the several sales applications in your company
• Also, you have a product master file. Further, each sales transaction refers to a
specific store.
• For doing the analysis, you have to provide information about the sales in the
data warehouse database.
• You have to provide the sales units and dollars in a fact table, the products in a
product dimension table, the stores in a store dimension table, and months in a
time dimension table.
How do you do this?
• Extract the data from each of the operational systems, reconcile the variations
in data representations among the source systems, and transform all the sales
of all the products. Then load the sales into the fact and dimension tables.
• Now, after completion of these three functions, the extracted data is sitting in
the data warehouse, transformed into information, ready for analysis.
• ETL functions are challenging primarily because of the nature of the source
systems. Most of the challenges in ETL arise from the disparities among the
source operational systems.
Please review the following list of reasons for the types of
difficulties in ETL functions :
• Most source systems do not represent data in types or formats that are
meaningful to the users. Many representations are cryptic and ambiguous.
• When the project team designs the ETL functions, tests the various
processes, and deploys them, you will find that these consume a very high
percentage of the total project effort.
• It is not uncommon for a project team to spend as much as 50–70% of
the project effort on ETL functions. You have already noted several
factors that add to the complexity of the ETL functions
• You have to reformat internal data structures, re-sequence data, apply various
forms of conversion techniques, supply default values wherever values are
missing, and you must design the whole set of aggregates that are needed for
performance improvement.
• In many cases, you need to convert from EBCDIC to ASCII formats.
• Now turn your attention to the data loading function.
• Creating and managing load images for such large numbers are not easy
tasks.
• Even more difficult is the task of testing and applying the load images to actually
populate the physical files in the data warehouse. Sometimes, it may take two or more
weeks to complete the initial physical loading.
• With regard to extracting and applying the ongoing incremental changes, there are
several difficulties.
• Finding the proper extraction method for individual source datasets can be difficult.
• Once you settle on the extraction method, finding a time window to apply the
changes to the data warehouse can be tricky if your data warehouse cannot suffer long
downtimes.
What are the major steps in the ETL process?
Data Extraction keys:
• Source Identification—>identify source applications and source structures.
• Method of extraction—>For each data source, define whether the extraction process is
manual or tool-based.
• Extraction frequency—>For each data source, establish how frequently the data extraction
must by done—daily, weekly, quarterly, and so on.
• Time window—>For each data source, denote the time window for the extraction process.
• Job sequencing—>Determine whether the beginning of one job in an extraction job stream
has to wait until the previous job has finished successfully.
• Exception handling—>Determine how to handle input records that cannot be extracted.
Evaluation of the Techniques To summarize, the
following options are available for data extraction
• Capture of static data
• Capture through transaction logs
• Capture through database triggers
• Capture in source applications
• Capture based on date and time stamp
• Capture by comparing files
Data Transformation
• One major effort within data transformation is the improvement of data
quality.
• In a simple sense, this includes filling in the missing values for attributes
in the extracted data.
• Data quality is of paramount importance in the data warehouse because
the effect of strategic decisions based on incorrect information can be
devastating/
Data Transformation: Basic Tasks
• Selection:The task of selection usually forms part of the extraction function
itself.
• Splitting/joining: This task includes the types of data manipulation, you will be
splitting the selected parts even further during data transformation. Joining of
parts selected from many source systems is more widespread in the data
warehouse environment.
• Conversion: To standardize among the data extractions from disparate source
systems, and the other to make the fields usable and understandable to the users.
• Summarization: Not storing data at granular level, Storing sales by
product by store by day in the data warehouse may be quite adequate. So,
in this case, the data transformation function includes summarization of
daily sales by product and by store.
• Enrichment. This task is the rearrangement and simplification of
individual fields to make them more useful for the data warehouse
environment.
Major Transformation Types
• Format Revisions. (changes to the data types and lengths of individual fields.)
• Decoding of Fields (coding for gender, with one source system using 1 and 2 for male and female
and another system using M and F)
• Calculated and Derived Values.
• Splitting of Single Fields
• Merging of Information
• Character Set Conversion
• Conversion of Units of Measurements, Date/Time Conversion,
• Key Restructuring , Deduplication
How to Implement Transformation
How to Implement Transformation
• Using Transformation Tools
• Using Manual Techniques
DATA LOADING
• The whole process of moving data into the data warehouse repository is
referred to in several ways:
Initial Load—populating all the data warehouse tables for the very first time.
Incremental Load—applying ongoing changes as necessary in a periodic
manner.
Full Refresh—completely erasing the contents of one or more tables and
reloading with fresh data (initial load is a refresh of all the tables)
TYPES OF LOAD TECHNIQUES
Load.
Append.
Destructive Merge.
Constructive Merge.
Assignment
• Difference between data quality & Accuracy.
• Characteristics & Issues of Data Quality