Data Warehousing Basics

Targeted at: Entry Level Trainees

Session 06-08: Data Modeling Techniques

© 2007, Cognizant Technology Solutions. All Rights Reserved. The information contained herein is subject to change without notice.

C3: Protected

About the Author
Created By: Credential Information: Version and Date: Dhana Lakshmi Thirunavukkarasu (154180) 5+ years of experience in DW DW/PPT/0509/1.0

2 2

Icons Used
Hands on Exercise

Questions

Tools

Coding Standards

Test Your Understanding

Reference

Try it Out

A Welcome Break

Contacts

3

4 .Data Warehousing Basics Session 06-08: Overview  Introduction: This chapter explains Data Modeling Techniques in Data Warehouse.

Data Warehousing Basics Session 06-08: Objective  Objective: After completing this session. you will be able to: » Explain the concept of ER Modeling and Dimensional Modeling 5 .

Data Modeling for DWH  Data Modeling describes how to structure the data in your data warehouse.  Data Modeling is a process that produces abstract data models for one or more database components of the data warehouse. 6 .

Data Modeling Techniques Entity-Relationship Modeling (ER Model):    Traditional modeling technique Technique of choice for OLTP Suited for corporate data warehouse Dimensional Modeling:    Analyzing business measures in the specific business context Helps visualize very abstract business questions End users can easily understand and navigate the data structure 7 .

8 .  Created databases that cannot be queried.  The highest art form of ER modeling is to remove all redundancy in the data.ER Model  The ER modeling technique is a discipline used to illuminate the microscopic relationships among data elements.

ER Modeling-Logical Design Entity:  Object that can be observed and classified by its properties and characteristics  Business definition with a clear boundary  Characterized by a noun Example:    Product Employee Sales Sales Organization Sales Org ID Distribution Channel Entity 9 .

Dependent/Attributive Entity (Weak):  It relies on another entity for identification.ER Modeling-Logical Design (Contd. or Product.) Entity Types: Independent/Fundamental Entity (Strong):  It does not rely on another entity for identification. for Example: Employee Hobby depends on Employee. for Example: Assignment of Employee to Project. Customer. 10 . Associative Entity:  It is applied to associate two or more entities in order to reconcile many‐to‐many relationship. for Example: Employee.

for Examples: Residential Building. 11 . but provides more specific details about its own characteristics. for example: Building  Sub type Entity: Subtype Entity inherits all attributes and relationships of its super type. Commercial Building.ER Modeling-Logical Design (Contd. that are not properties of the Super type.)  Super type Entity: A general entity type that can be specialized into more specialized ones.

Distribution Channel are attributes of entity “Sales Organization”   Attribute name should be unique and self-explanatory Primary Key.ER Modeling-Logical Design (Contd. Constraints are defined on Attributes Sales Organization Attributes Sales Org ID Distribution Channel 12 .) Attributes:   Characteristics and properties of entities Example: Sales org ID . Foreign Key.

ER Modeling-Logical Design (Contd.) Identifier:  One or more attribute uniquely identifies an instance of an entity  Example: Sales Org ID Sales Organization Sales Org ID Distribution Channel Identifier 13 .

) Relationship:  Relationship between entities .structural interaction and association  Cardinality: 1-1 1-M Sales Detail Sales Rep Sales Record ID Sales Rep ID M-M Relationship  Example: Sales Detail and Sales Rep 14 .ER Modeling-Logical Design (Contd.

Logical Data Model 15 .

Map the following:  Entities to Tables  Relationships to Foreign Keys  Attributes to Columns  Primary Unique Identifiers to the Primary Key  Unique Identifiers to Unique Keys 16 .Moving from Logical to Physical Design Expected schemas is translated into actual database structures.

 Database performance. database properties for the physical implementation of databases. relationships.Physical Design  Physical data model includes all required tables. implementation of referential integrity. super types and sub types and so on. columns. indexing strategy.ER Model .  Logical data model is approved by functional team and there-after development of physical data model work gets started. physical storage and de normalization are important parameters of a physical model. 17 .  The transformations from logical model to physical model include imposing database rules.

Physical Design .Example 18 .

Physical Logical Represents business information and defines business rules Entity Attribute Relationship Primary Key Rule Physical Represents the physical implementation of the model in a database Table Column Foreign Key Primary Key Constraint Check Constraint .Default Value 19 .Logical Vs.

Why Not ER Model ?  End users cannot understand or remember an ER model.  Use of the ER modeling technique defeats the basic allure of data warehousing. namely intuitive and high-performance retrieval of data.  Soft wares cannot usefully query a general ER model.  No graphical user interface (GUI) that takes a general ER model and makes it usable by end users. 20 .

intuitive framework that allows for high-performance access. called the fact table. adhoc and data intensive queries.  No concern for concurrency. 21 .Dimensional Model  Represents the data in a standard. complex. locking and insert/update/delete performance.  Every dimensional model is composed of one table with a multipart key. and a set of smaller tables called dimension tables.  Schema designed to process large.

therefore called galaxy schema or fact constellation. 22 .Types of Schema  Star schema: A fact table in the middle connected to a set of dimension tables. viewed as a collection of stars.  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables. forming a shape similar to snowflake.  Fact constellations: Multiple fact tables share dimension tables.

Star Schema  Consists of a group of tables that describe the dimensions of the business. Easy to understand. large and central fact table and one table for each dimension. easy to define hierarchies.   Does not capture hierarchies directly. reduces number of physical joins.  Arranged logically around a huge central table that contains all the accumulated facts and figures of the business  The smaller. The larger table the center from which the points radiate. outer tables are points of the star. 23 .  A single.

Star Schema (Contd. Fact Table Key 1 Dimension table 2 Key 2 Attribute ………… ………… Attribute Key 2 Key 3 Key 4 Data Column …………… Data Column Key 3 Attribute …………… …………….) Dimension table 1 Key 1 Attribute ………… ………… Attribute Key 4 Attribute …………… ……………. Attribute Dimension table 3 24 Attribute Dimension table 4 .

Region_ID Region Desc. Level Fact Table STORE KEY PRODUCT KEY PERIOD KEY Dollars Units Price Time Dimension PERIOD KEY Period Desc Year Quarter Month Day Current Flag Resolution Sequence Product Dimension PRODUCT KEY Product Desc.Example of Star Schema Store Dimension STORE KEY Store Description City State District ID District Desc. Regional Mgr. Brand Color Size Manufacturer Level 25 .

Level Fact Table STORE KEY PRODUCT KEY PERIOD KEY Dollars Units Price Time Dimension PERIOD KEY Period Desc Year Quarter Month Day Current Flag Resolution Sequence Product Dimension PRODUCT KEY Product Desc. Brand Color Size Manufacturer Level Example: Select A.STORE_KEY in (select STORE_KEY from Store_Dimension B where region = “North” and Level = 2) Level is needed whenever aggregates are stored with detail facts.STORE_KEY. Regional Mgr. A. Region_ID Region Desc.Example of Star Schema Store Dimension STORE KEY Store Description City State District ID District Desc.dollars from Fact_Table A where A.PERIOD_KEY. A. 26 .

Star Schema: Dimension Table PK Geography_dim Empl_Code 2341 3424 1232 3554 3963 2924 2673 3253 234 2342 empl_name Mike King Jim McCann Kitty Stokes Clem Akins Duncan Moore Dawn McGuire Joe Becker Geoff Bergren Garth Boyd Lin Cepele city_code 101 106 104 102 101 103 105 107 106 104 city Atlantic city Chicago Austin Medford Atlantic city Englewood Alverton Springfield Chicago Austin state_code NJ IL PA NJ NJ NJ PA IL IL PA state region_code region New Jersey 1 New Jersey Illinois 2 Illinois Pennsylvania 1 New Jersey New Jersey 1 New Jersey New Jersey 1 New Jersey New Jersey 1 New Jersey Pennsylvania 1 New Jersey Illinois 2 Illinois Illinois 2 Illinois Pennsylvania 1 New Jersey Attributes Elements Region State City Employee 27  De-normalized structure  Easy navigation within the dimension Region State City Employee .

Star Schema: Fact Table Sales_fact day_code prod_code cust_code empl_code units sold revenue 1211 345 1231123 1232 23 7935 1211 22 1245223 3554 12 264 1211 112 1522342 3963 6 672 1212 233 1524665 2924 34 7922 1212 112 1366454 2673 76 8512 1212 22 1403453 3554 22 484 Dimension Keys Measures Contains columns for measures and dimensions 28 .

and Time 29 .What is Dimension?  A Dimension is a structure that categorizes data in order to enable end users to answer business questions. Product. Commonly used dimensions are Customer.

customers. when and where qualifiers to the measures Dimensions could be products. geography. time. and so on.What is Dimension? What was sold? Whom was it sold to? When was it sold? Where was it sold?    Dimensions put measures in perspective What. 30 .

Dimension Elements Geography Time      Components of a dimension Product Represents the natural elements in the business dimension Directly related to the dimension Facilitates analysis from different perspectives of a dimension Often referred to as levels of a dimension 31 .

drill down directions Each element represents different levels of aggregation End users may need custom hierarchies     Drill Up 32 .Dimension Hierarchy Time Dimension Year 1999 April May Drill Down Month Date 9/4/99 28/4/99 5/5/99 17/5/99 Represents the natural business hierarchy within dimension elements Clarifies the drill up.

Examples of Dimensions  The following dimensions are common in all Data Warehouses in various forms: » Product Dimension » Service Dimension » Geographic Dimension » Time Dimension 33 .

Surrogate Keys All tables (facts and dimensions) should not use production keys but Data Warehouse generated surrogate keys:   Productions keys get reused sometimes In case of mergers/acquisitions. protects you from different key formats  Production systems may change their systems to generalize key definitions    Using surrogate key will be faster Can handle Slowly Changing dimensions well These keys should be simple integers 34 .

 Key formats may be generalized to handle some new situation. 35 .  A mistake could be made and a key could be reused.  A product description or a customer description could be changed without changing the key.Why Existing Keys Should not be Used?  Keys may be reused after they have been purged even thought they are used in the warehouse.

Types of Dimensions  Conformed Dimension  Degenerate Dimension  Demographic Dimension  Junk Dimension  Casual Dimension  Slowly Changing Dimension 36 .

Conformed Dimensions
 Conformed dimensions are those which are consistent across data marts.  Essential for integrating the data marts into an Enterprise

Data Warehouse (EDW).
PRODUCT CUSTOMER

SALES INVENTORY DATE

37

Causal Dimensions
 Causal dimensions can be used for explaining why a record exists in a fact table.  Causal dimensions should not change the grain of the fact

table.

38

Causal Dimension: Example
Example:

 Why did a customer buy a particular product?
 Why did a customer use a particular ATM machine?

39

40 .What is a Slowly Changing Dimension?  Although dimension tables are typically static lists. these dimensions are known as slowly growing or slowly changing dimensions.  Since these changes are smaller in magnitude compared to changes in fact tables. most dimension tables do change over time.

Slowly Changing Dimension: Classification Slowly Changing Dimensions (SCD) are classified into three different types:    TYPE I TYPE II TYPE III 41 .

co.com 1001 Shane Shane@ xyz.in Shane@xy z.Slowly Changing Dimensions Type I Example 1: Source Emp id Name Email Emp id Target Name Email 1001 Shane Shane@ xyz.in 1001 Shane Shane@ abc.co.com 42 .com Source Emp id Name Email Target Emp id Name Email 1001 Shane Shane@ abc.

com 0 43 .com 1000 10 Shane Shane @xyz.Slowly Changing Dimensions Type I (Contd.) Example 2 : Target Source Emp id Name Email PM_P RIMA RYKE Y Emp id Name Email PM_V ERSIO N_NU MBER 10 Shane Shane@x yz.

Types of SCD Type 2  Versioning  Flag  Date 44 .

com 0 1001 10 Shane Shane@ abc.in Source PM_PRIM ARYKEY Emp id Name Email PM_VERSION_NUM BER 1000 10 Shane Shane@ xyz.in 1 Target 45 .co.Slowly Changing Dimensions II: Versioning Example: Emp id Name Email 10 Shane Shane@ abc.co.

) Example: Source Emp id Name Email 10 Shane Shane@ abc.com 46 .com Shane@ abc.com PM_PR IMARY KEY 1000 1001 1003 Emp id 10 10 10 Name Email PM_VERSION_ NUMBER 0 1 2 Shane Shane Shane Shane@ xyz.co.Slowly Changing Dimensions II: Versioning (Contd.in Shane@ abc.

Slowly Changing Dimensions Type II: Flag Example: Emp id Name Email PM_ PRIM ARY KEY Emp id Name Email PM_CU RRENT _FLAG 10 Shane Shane@x yz. com 1 Source Target 47 .com 1000 10 Shane Shan e@xy z.

com Shane@ abc.Slowly Changing Dimensions Type II: Flag Example: Source Emp id Name Email 10 Shane Shane@ abc.in PM_PRIM ARYKEY Emp id Name Email PM_CURRENT_FLAG 1000 10 Shane Shane@ xyz.in N 1001 10 Shane Y Target 48 .co.co.

com 01/01/ 00 Source Target 49 .com 1000 10 Shane Shane @xyz.Slowly Changing Dimensions Type II: Date Example: PM_P RIMA RYKE Y Emp id Name Email PM_B EGIN_ DATE PM_E ND_D ATE Emp id 10 Name Email Shane Shane@x yz.

in 1001 10 Shane 03/01/00 Target 50 .com Shane@ abc.Slowly Changing Dimensions II: Effective Date Example: Source Emp id Name Email Shane@ abc.co.in Emp id Name Email PM_BEGIN_D ATE 01/01/00 PM_END_D ATE 03/01/00 10 Shane PM_PRIMARY KEY 1000 10 Shane Shane@ xyz.co.

com Target 51 .Slowly Changing Dimensions II: Effective Date (Contd.com Shane@ abc.co.) Example: Source Emp id Name Shane Email Shane@ abc.in Shane@ abc.com 10 PM_PRI MARYK EY 1000 1001 1003 Emp id 10 10 10 Name Email PM_BEGIN _DATE 01/01/00 03/01/00 05/02/00 PM_END_D ATE 03/01/00 05/02/00 Shane Shane Shane Shane@ xyz.

com 1 10 Shane Shane@ xyz.com 01/01/00 Source Target 52 .Slowly Changing Dimensions Type III Example: PM_PR IMARY KEY Emp id Name Email Emp id Name Email PM_Pr ev_Col umn Name PM_EFF ECT_DA TE 10 Shane Shane@xy z.

) Example: Source Emp id Name Email Shane@ abc.co.in 10 Shane PM_PRIMA RYKEY 1 Emp id Name Email PM_Prev_ ColumnNa me Shane@xy z.Slowly Changing Dimensions Type III (Contd.co.com PM_EFFE CT_DATE 01/02/00 10 Shane Shane @ abc. in Target 53 .

in PM_EFFEC T_DATE 01/03/00 Shane Shane@ abc.) Example: Source Emp id Name Shane Email Shane@ abc.Slowly Changing Dimensions Type III (Contd.com Target 54 .co.com 10 PM_PRI MARYK EY 1 Emp id 10 Name Email PM_Prev_C olumnNam e Shane@ abc.

summarized 55 .Facts and Measures  Facts or measures are the key performance indicators of an enterprise  Factual data about the subject area  Numeric.

Types of Facts  Additive: Measures that can be added across all dimensions. Example: Current Balance. Temperature  Semi Additive: Measures that can be added across few dimensions and not with others. Example: Profit Margin. Example: Sales Amount  Non Additive: Measures that cannot be added across all dimensions. Inventory 56 .

and usually includes more semi-additive and non-additive facts: » For example Current_Balance is a semi-additive fact. For example. this fact table may describe the total sales by product by store by day.Types of Fact Tables Based on the facts classifications. it does not make sense to add them up for the account level or the day level. as it makes sense to add them up for all accounts but it does not make sense to add them up through time. there are two types of fact tables:  Cumulative: This type of fact table describes what has happened over a period of time. » Profit_Margin is a non-additive fact. 57 . the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week  Snapshot: This type of fact table describes the state of things in a particular instance of time. The facts for this type of fact tables are mostly additive facts: » For example.

 The two types of factless fact tables are: » Coverage tables » Event tracking tables 58 .Fact less Fact Tables  Fact Tables that contains no facts or measures are called as Factless Fact.

Factless Fact Tables: Coverage Tables  Coverage tables are required when a primary fact table is sparse  Example: Tracking products in a store that did not sell 59 .

 Example: Tracking student attendance 60 .Factless Fact Tables: Event Tracking  These tables are used for tracking a event.

Complicated maintenance and metadata. explosion in number of table.  Makes user representation more complex and intricate.   Good performance when queries involve aggregation. 61 .  Each dimension has one key for each level of the dimension’s hierarchy.Snowflake Schema  Dimension tables are normalized by decomposing at the attribute level.

Snowflake Schema Dim Table Dim Table Fact Table Dim Table Dim Table 62 .

Snowflake Schema cust_code cust_name age_code age sex_code sex city_code city emp_code emp_name emp_code city_code cityname city_code state_code statename state_code region_code regionname region_code country_code countryname day_code day_name week_code week_code week_name month_code emp_code cust_code prod_code day_code units revenue prod_code brand_code prod_name brand_code brand_name color_code month_code month_name quarter_code year color_code color_name 63 .

Avoid Snowflakes Avoid natural desire to normalize model:  Complicates end-user query construction  Adds additional level of “JOIN” complexity  Database optimizers do not handle very well  Saves some space at the cost of longer queries 64 .

Star vs Snow Flake Schema Star Schema Denormalised No complex joins High Performance Occupies more space Snow flake Schema Normalized Uses Complex joins Low Performance Occupies less space 65 .

66 .Fact Constellation  Multiple fact tables share dimension tables.  Sophisticated application requires such schema.  This schema is viewed as collection of stars hence called galaxy schema or fact constellation.

Region_ID Region Desc. Brand Color Size Manufacturer District Fact Table District_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price Region Fact Table Region_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price 67 . Fact Table STORE KEY PRODUCT KEY PERIOD KEY Dollars Units Price Time Dimension PERIOD KEY Period Desc Year Quarter Month Day Current Flag Sequence Product Dimension PRODUCT KEY Product Desc.Example of Fact Constellation Store Dimension STORE KEY Store Description City State District ID District Desc. Regional Mgr.

which requires more extensive metadata. which can slow performance.Fact Constellation  Advantage: No need for the “Level” indicator in the dimension tables.  Disadvantage: Dimension tables are still very large in some cases. 68 . front-end must be able to detect existence of aggregate facts. since no aggregated data is stored with lower-level detail.

Default Value 69 .ER vs Dimensional Entity Relationship Data remains normalized User access more complex Useful in Enterprise-wide DW implementations Timestamp usually in key structure Relationship Rule Dimensional Uses denormalized dimension data Simplified data model for user access Most often used in Data Marts Can be integrated through dimension sharing Foreign Key Check Constraint.

 Helper table can be placed between two dimensions tables or between a dimension table and a fact table. That is when there is a many to many relationship between a fact table and a dimension table.Helper Tables  Helper tables are used when there are multi valued dimensions. 70 .

Helper Tables: Example Example: A customer having more than one bank account 71 .

 Allow time for questions from participants 72 .

Why the surrogate key is needed? 73 . Name the various types of Schema. 2. List the types of modeling techniques and explain each one of them. 3. List the types of Dimensions. Which schema is the best in performance? 4. What is Fact Constellation? 5. 7. What are the types of facts and fact tables? 6.Test Your Understanding 1.

Data Warehousing Basics Session 06-08: Summary  The ER modeling technique is a discipline used to illuminate the microscopic relationships among data elements. forming a shape similar to snowflake. therefore called galaxy schema or fact constellation.  Star schema: A fact table in the middle connected to a set of dimension tables. viewed as a collection of stars.  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables.  Fact constellations: Multiple fact tables share dimension tables. 74 .

service marks.Data Warehousing Basics Session 0608: Source  Ralph Kimball. All trademarks. and trade names in this course are the marks of the respective owner(s). The materials that can be accessed from linked sites are not maintained by Cognizant Academy and we are not responsible for the contents thereof. 75 . Data Warehousing Disclaimer: Parts of the content of this course is based on the materials available from the Web sites and books listed above.

The information contained herein is subject to change without notice. Cognizant Technology Solutions.You have completed the Session 06-08 of Data Warehousing Basics © 2007. All Rights Reserved. .

Sign up to vote on this title
UsefulNot useful