Lab Topic 2 Designing and Implementing A Data Warehouse

Prepared by Kesavan Krishnan
Lab Topic 2: Designing and

Implementing a Data Warehouse
Tasks:
Designing Dimension Tables, Designing Fact
Tables, Implement a Star Schema, Implement a
Snowflake Schema, and Implement a Time
Dimension
Understanding Normalization
• What Is Normalization?
▪ The process of organizing the tables in a relational
database
▪ Eliminates data redundancy
▪ Lowers record locking
▪ Increases efficiency in concurrency
• Accomplished By Dividing Large Tables Into Smaller Tables
▪ Tables have relationship defined
• Before Normalization…
• Before Normalization…
• After Normalization…
▪ No redundant data
• After Normalization…
▪ No redundant data
Normalized Structure Challenges
• A normalized Database Structure Has A Few Basic
Characteristics
▪ It is designed to store detailed data
▪ It is designed to store the data as efficiently as possible
▪ It is designed to provide data integrity
▪ It results in many tables
▪ It is an excellent solution for adding and managing daily
activities
• A Normalized Database Structure Has A Few Basic Challenges
▪ It is usually very inefficient for data extraction (queries)
▪ Usually requires multiple table joins to reach all the data
▪ It doesn’t store data in the form needed for data analysis
▪ Data is stored in most detailed form
▪ No aggregated data
▪ No discretized data
▪ Data may be stored in multiple, normalized databases
• A Normalized Database Structure Has A Few Basic Challenges
▪ As data storage increases, query efficiency decreases
▪ Needed historical data may not exist in the database
▪ Required reporting aggregations can create serious
inefficiencies in data extraction
▪ Database LOB functions can be impacted
Star Schema Basics
• What Is A Star Schema?
▪ The simplest form of database structure used in a DW
▪ Answers the basic question:
▪ What happened, who did it, when did they do it…etc.
▪ Focuses on one, single business area
• What Advantages Does A Star Schema Offer?
▪ Separates data into two main categories
▪ Facts
▪ Dimensions (Measurements or descriptive information)
Star Schema Basics
• What Is A Fact?
▪ Fact (What happened)
▪ Product sold
▪ Customer who bough
▪ Etc.
▪ Dimension (Attributes that describes what happened)
▪ When the product was sold
▪ Day, date, year, quarter, day of the week, etc.
▪ Where the product was sold
Star Schema Basics
Snowflake Schema Basics
• What Is A Snowflake Schema?
▪ A star schema with a little normalization added in
▪ Dimension tables are normalized somewhat
• What Use A Snowflake Schema?
▪ To satisfy data gathering functionality of more advanced
data warehousing/mining tools
▪ To logically separate large dimension tables
▪ To more naturally separate dimensional data
▪ Known customers vs. anonymous customers
• Two Main Rules Concerning Snowflake Schema
▪ Don’t use it
▪ Unless you want to or need to
▪ Test your design and make your own determination
Understanding Granularity
• What Is Meant By The Term Granularity In A DW?
▪ The level of details available?
• What determines Granularity?
▪ The level of data loaded into the fact table
▪ Per order numbers
▪ Daily numbers
▪ Weekly numbers
▪ The number and detail level of dimensions
▪ Quarter, Year, etc.
Understanding Granularity
• Granularity Should Be Determined During Database Design
▪ Changes can be challenging later on
▪ Changes involve a few steps
▪ Changes structure of fact table
▪ Possible changes in dimension tables
▪ Changes in data loading
▪ Changes in data queries
Auditing and Lineage
• Data Warehouse Do Not Store Data As It Is Created
• Data Warehouses Are Populated From OLTP Data
▪ Based on various conditions
▪ At various times (weekly, monthly, etc.)
▪ From various sources
• Data Can Be Informative Based On Different Aspects
• These Characteristics Usually Change Over Time
• Auditing And Lineage Identify These Aspects
▪ Usually stored in tables
▪ Describe source, duration of load, who performed the load
etc.
• SQL Server Integration Services (SSIS)
▪ Provides SSIS logging
Simple Data Warehouse Example
▪ This Data Is Loaded Into The Fact Table
▪ This Data Is Loaded Into The Sales Dimension Table
▪ This Data Is Loaded Into The Sales Geography Dimension
Understanding Fact Tables
• A Fact Table Is A Collection Of Measurements
▪ Note the word ‘Measurements’
▪ About a specific business process
▪ A single, identifies fact about a specific process
▪ Usually numeric
▪ Sales amount order quantity
▪ Tax amount
▪ Discount amount
Understanding Fact Tables
• Fact Tables May Contain Multiple Measurements
▪ If they are closely related
• A Data Warehouse Will Have Many Fact Tables
▪ Each one stores data (measure) for each specific business
area
Understanding Dimensions
• Dimensions Give Context To Measures
▪ Measures are the ‘facts’ or measureable numbers in the
Fact table
▪ Dimensions give context, or specific meaning, to facts
▪ The term ‘Dimension’ usually refers to a table of related
dimensions
• Example:
▪ A Fact table contains numbers of products sold
▪ A DateDimension table contains the following ‘dimensions’
of dates pertaining to the number of products sold
▪ Date and time (11/17/2011 10:15:32)
▪ Quarter (4)
▪ DayOfYear (321)
▪ WeekDay (Thursday)
▪ Week (44)
• Each Individual Column In A Dimension Table Is An Attribute
▪ Attributes usually compress or expand data detail
▪ Data can be ‘discretized’ into smaller, summarized groups
▪ Day (365 values)
▪ Weeks (52 values)
▪ Months (12 values)
▪ Quarters (4 values)
• Each individual column in a dimension table is an attribute
▪ Data can also be ‘drilled into’ for more detailed
information
▪ Hour of the day
▪ Minutes of the hour
▪ Seconds
▪ Milliseconds
▪ Etc.
Dimension Column Types
• A Dimension Table Usually Stores More Than Attributes
▪ It stores data that is not in the fact table
• A Dimension Table Can Have At Least 5 Column Types
▪ Data can also be ‘drilled into’ for more detailed
information
▪ Name
▪ Key
▪ Member Properties
▪ Lineage
• Attributes Column
▪ Give context to measures
▪ Used by tools to create pivot tables, drill downs, etc.
• Name Column
▪ Used to make the reported data easier to read
▪ Provides human-readable names to entities (Customers,
orders, products, etc.)
• Key Column
▪ Used to uniquely identify entitles and establish
relationships
• Member Property Column
▪ Data included for descriptive use on reports, etc.
▪ Addresses, phone numbers, descriptions, etc.
• Lineage Column
▪ Used to store auditing, source info
Understanding Slowly Changing
Dimensions
• Dimensions Provide Description Or Meaning For Fact Table
Data
• Some Dimension Data May Change Over Time
▪ Customer Last Name
▪ Customer Address
▪ Could affect Region, Country, State, City, Zip, etc.
• What Happens When Dimensions Data Changes?
▪ Historical accuracy is changed
Dimensions
• Example:
▪ OLTP Data
▪ Customer’s address is Atlanta, GA
▪ The customer orders 12,768 products over 12 months
▪ The customer moves to Pittsburg, PA
▪ If Customer Dimension Data Is Changed From Atlanta, GA
to Pittsburg, PA
▪ Historical reports now show those 12,768 products as being
purchased from Pittsburg, PA
▪ Wait a minute…
Dimensions
• Two Main Solutions For SCDs
▪ Type 1 SCD
▪ Type 2 SCD
Dimensions
• Type 1 SCD
▪ OLTP updates are moved into the DW
▪ Any changes overwrite the current DW data
▪ Past actual data history is lost
Dimensions
• Type 2 SCD
▪ Data is not overwritten in the DW
▪ A new row for the customer must be inserted
▪ Usually creates primary key problems
▪ You must now add a Surrogate Key (Data Warehouse Key)
▪ Uniquely identifies every row in the dimension table
▪ You must also add another column or two
▪ To flag the current value
▪ To provide date/time perspective
Dimensions
• Type 1 SCD
Dimensions
• Type 2 SCD
Creating Our Data Warehouse Database
• Right click Database, create a New Database

• Select General, Enter a Database Name, Click Owner.

• Click Browse, Add Administrator user.

• Change from 3 to 100 under the Initial Size (MB).

• Click on Autogrowth / Massize and adjust the In Megabytes to 10.

• Click on Option, change to Simple under the Recovery model.

Identifying Our Dimensions
• Three Dimensions
▪ Customer Dimension
▪ Products Dimension
▪ Date Dimension
• We’ll Load Them Using
SQL Data Tools (SSIS)
▪ Familiarize you with
various aspects of
SSIS
Identifying Our Fact Table
• Our Fact Table Will
Include:
▪ Data loaded directly
from the source
▪ Data calculated
during the data load
Understanding Indexing
• Indexing Affects How Data Is Stored And Managed In SQL
Server
• There Are Four Main Indexing Option In SQL Server
▪ Clustered Index
▪ Non-Clustered Index
▪ Filtered Non-Clustered Index
▪ Columnstore Index
• Clustered Index
▪ Determines the physical storage order of the data
▪ There can be only one clustered index on a table
• Non-Clustered Index
▪ Sorts data in a column or column and stores pointers to
the actual data row
▪ You can have up to 999 non-clustered indexes on a table
▪ Non-clustered indexes slow down data management
• Filtered Non-Clustered Index
▪ Creates a non-clustered index on a subset of value in a
column
• Columnstore Index
▪ A non-clustered index placed on a single column
▪ The column is store and searched separately from the data
row
▪ Adding a columnstore index to a column makes the column
read-only
• SQL Server Stores Data In Tables In Two Forms
▪ Heap
▪ Data is stored in the order in which it is added to the table
▪ New rows are added to the bottom of the data list
▪ Balanced tree (B-tree)
▪ Data is ordered based on the clustered index key
Indexing The Data Warehouse
• Indexing In The Data Warehouse Can Be Tricky
▪ Too few indexes will allow data loads to be quick
▪ But query response times will be slow
▪ Too many indexes and data loads slow down and storage
requirements go up
▪ But query response is good
• General Rule Of Thumb
▪ Dimension tables
▪ Place clustered index on the surrogate key
▪ If the table has a lot of columns, create non-clustered indexes on
the most popular columns
▪ Popular=most often used in queries
• General Rule Of Thumb
▪ Fact tables
▪ Place a non-clustered index on the single-column foreign keys to
the dimension tables
▪ If the primary key is a composite of all the dimension foreign keys,
make it a non-unique clustered index
Understanding Indexed Views
• What Is A View?
▪ A result set of a query that is a virtual table
▪ The virtual table is not stored permanently in the database
▪ The view can be referenced like a table in Transact-SQL
• Indexing A View
▪ You can create a unique clustered index on a view
▪ The view’s result set is now stored in the database, just like
a regular table with a clustered index
Understanding Indexed Views
• Advantages Of Index Views
▪ Improve the performance of joins and aggregations that
process many rows
Understanding Data Compression
• SQL Server 2012 Supports Data Compression
▪ Data compression reduces the size of the database
▪ Packs more data onto few data pages
▪ Fewer data page reads required to satisfy queries
▪ Lower IO means faster response; lower processing load on
server
▪ Minor issue: extra CPU resources are required for data
updates
▪ Not a problem in data warehousing
• SQL Server 2012 Supports Three Compression Types
▪ Page compression
▪ Focuses on duplicate values within the data page
▪ Stores one value; places a pointer at all other locations
▪ Row compression
▪ Removes any unused bytes in a fixed data type
▪ CHAR(25)
▪ Unicode compression
▪ Reduces storage space for unicode data that doesn’t require the
space
• Which Compression Should You Use?
▪ Page compression
▪ It automatically uses row compression when page compression is
used
▪ Fact Tables Usually Benefit The Most From Compression
▪ Note!
▪ Compression is only available in SQL Server Enterprise Edition
▪ See SQL Books Online For Details And Implementation
Using Partitions
• Fact Tables Become Very Large Tables Over Time
• Very Large Database Tables Present Serious Challenges
▪ What if you need to delete a large portion of the data?
▪ TRUNCATE TABLE command performs deletions which minimal
logging…
▪ But it deletes the entire table data
▪ Large Data Inserts Can Become Time Consuming
▪ Index maintenance and storage can become problematic
▪ Table Partitions Deal With All These Issues
Using Partitions
• What Is A Table Partition?
▪ A large table is stored in multiple files
▪ Divided horizontally (rows) based on a condition
▪ Usually date/time
▪ SQL Server 2012 allows up to 15,000 partitions on a single
table
▪ Partitions and data are managed in the background
Using Partitions
• Partitioning Offers Many Advantages
Data Lineage
• What Is Data Lineage?
▪ It depends on who you ask
▪ Best definition…
▪ Data origination and flow details
▪ Where it is from, where it is going, how it is transformed in the
process
▪ Same concept as comments in programming
▪ A note to self
Data Lineage
• Why Do We Need Data Lineage?
▪ To provide meta-data context in the data warehouse
▪ Data can come from many locations at various times
▪ Future business rules may change, affecting some data
▪ Making it invalid
▪ Making it suspect
▪ Making it more important
▪ Data lineage allows us to identify this data
Data Lineage
• Two Main Options For Adding Data Lineage
▪ SSIS system variables
▪ If you are using SSIS
▪ T-SQL system functions
Data Lineage
• Right click AdventureWorksDW2012, Select SQL Query, Run the

following Script as above.
Creating Our Dimensions
• Go to Table from the VTCDW database, Right click Files Tables, Create
New Table by SQL Query.
• Create Fact Table and Dimensions Table under Tables by using SQL scripts
(Create Fact Table.txt; CreateDimensionTables.txt).
• Click on Execute to run the scripts.

Creating Our Fact Table
• The second method is to execute the script. Go to File, Select Open, Click
File and input the database file from your folder. Click Execute once you
input the scripts.
Creating Our Relationships
• Right click Database Diagram, Click New Database System

• Click Yes to create the database diagram.

• Click Add to add all the tables

• Click Close under the Add Table

• Create a relationship by Select and Drag the arrow from dimension table
to fact table as a foreign key (ProductKey to ProductKey).
• Make sure the Primary key value and Foreign key value are correct. Click
OK.
to fact table as a foreign key (DateKey to OrderDate).
OK.
to fact table as a foreign key (CustomerKey to CustomerKey).
OK.
Click Yes to make changes to the selected diagram.

• Enter a name for the diagram. Click OK.

• Click Yes to save all the diagrams.

• Under the Tables; Select dbo.FactInternetSales, Select Keys for all the
foreign keys.
Q&A
Best of luck !!!!

Lab Topic 2 Designing and Implementing A Data Warehouse

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab Topic 2 Designing and Implementing A Data Warehouse

Uploaded by

Copyright:

Available Formats

Prepared by Kesavan Krishnan

Lab Topic 2: Designing and

• Right click Database, create a New Database

• Select General, Enter a Database Name, Click Owner.

• Click Browse, Add Administrator user.

• Change from 3 to 100 under the Initial Size (MB).

• Click on Autogrowth / Massize and adjust the In Megabytes to 10.

• Click on Option, change to Simple under the Recovery model.

• Right click AdventureWorksDW2012, Select SQL Query, Run the

• Click on Execute to run the scripts.

• Right click Database Diagram, Click New Database System

• Click Yes to create the database diagram.

• Click Add to add all the tables

• Click Close under the Add Table

Click Yes to make changes to the selected diagram.

• Enter a name for the diagram. Click OK.

• Click Yes to save all the diagrams.

Best of luck !!!!

You might also like