You are on page 1of 25

Microsoft Official Course

Module 4

Designing an ETL Solution


Module Overview

ETL Overview
Planning Data Extraction
Planning Data Transformation
• Planning Data Loads
Lesson 1: ETL Overview

ETL in a BI Project
Common ETL Data Flow Architectures
Documenting High-Level Data Flows
• Creating Source To Target Mappings
ETL in a BI Project

Business Requirements

Technical
Data
Architecture Reporting and
Warehouse
and Analysis
and ETL
Infrastructure Design
Design
Design

Monitoring and Optimizing

Operations and Maintenance


Common ETL Data Flow Architectures

• Single-stage ETL
• Data is transferred directly from Source DW
source to data warehouse
• Transformations and validations
occur in-flight or on extraction
• Two-stage ETL
Source Staging DW
• Data is staged for a coordinated
load
• Transformations and validations
occur in-flight, or on staged data
Source Landing Zone
• Three-stage ETL
• Data is extracted quickly to a
landing zone, and then staged prior
to loading Staging DW
• Transformations and validation can
occur throughout the data flow
Documenting High-Level Data Flows

ProductDB

Product Subcategory Category

Audit Start
Filter on LastModified
Concatenate Size
Lookup Subcategory Lookup Category Handle NULLs*
(Size + ' ' + MeasureUnit)

Update SCD1 rows ProductName

Update and insert SCD2 rows Category,


(generate surrogate key) Subcategory,
Size, Color
Insert new rows
*NULL Handling Rules (generate surrogate key)
• Change NULL Subcategory and
Audit End
category to "Uncategorized“
• Redirect rows with null ProductName
DimProduct
Creating Source To Target Mappings
Lesson 2: Planning Data Extraction

Profiling Source Systems


Identifying New and Modified Rows
• Planning Extraction Windows
Profiling Source Systems

• What data sources are there, and how will the ETL
solution connect to them?
• What data types and formats are used in each
source system?
• What data integrity and validation issues exist in
the source data?
Identifying New and Modified Rows

• Data modification time fields


• Database modification tracking functionality
• Custom change detection functionality
Planning Extraction Windows

• How frequently is new data generated in the


source systems, and for how long is it retained?
• What latency between changes in source system
and reporting is tolerable?
• How long does data extraction take?
• During what time periods are source systems least
heavily used?
Lesson 3: Planning Data Transformation

Where to Perform Transformations


Transact-SQL vs. Data Flow Transformations
Handling Invalid Rows and Errors
• Logging Audit Information
Where to Perform Transformations

• On extraction
Source
• From source
• From landing zone
• From staging
Landing
• In data flow Zone

• Source to landing zone


• Landing zone to staging
Staging
• Staging to data warehouse

• In-place
• In landing zone
Data
• In staging Warehouse
Transact-SQL vs. Data Flow Transformations

SELECT CAST(c.CustomerID AS nvarchar(5)) AS CustomerAltKey,


CONVERT(nvarchar(50), c.FirstName + ' ' + c.LastName) AS CustomerName,
ISNULL(m.MembershipLevelName, 'Unknown') AS MembershipLevel
FROM src.Customers AS c
LEFT OUTER JOIN src.MembershipLevels AS m
ON c.MembershipLevel = m.MembershipLevelID;
Handling Invalid Rows and Errors
Logging Audit Information

• Use SSIS logging to record package execution


events
• Consider using an audit dimension to track inserts
and updates in data warehouse tables
Lesson 4: Planning Data Loads

Minimizing Logging
Loading Indexed Tables
Loading Partitioned Fact Tables
• Demonstration: Loading a Partitioned Fact Table
Minimizing Logging

• Set the data warehouse recovery mode to simple


or bulk-logged
• Consider enabling trace flag 610
• Use a bulk load operation to insert data:
• An SSIS data flow destination with Fast Load option
• The bulk copy program (BCP)
• The BULK INSERT statement
• The INSERT … SELECT statement
• The SELECT INTO statement
• The MERGE statement
Loading Indexed Tables

• Consider dropping and recreating indexes for


large volumes of new data
• Sort data by the clustering key and specify the
ORDER hint
• Columnstore indexes make the table read-only
Loading Partitioned Fact Tables

• Switch loaded tables into partitions


• Partition-align indexed views
Demonstration: Loading a Partitioned Fact Table

In this demonstration, you will see how to:


• Split a partition
• Create a load table
• Switch a partition
Lab: Designing an ETL Solution

Exercise 1: Preparing for ETL Design


Exercise 2: Creating Source to Target
Documentation
• Exercise 3: Using SSIS To Load a Partitioned Fact
Table
Logon Information
Start 20467B-MIA-DC and 20467B-MIA-SQLBI, and then log onto
20467B-MIA-SQLBI as ADVENTUREWORKS\Student with the password Pa$$w0rd.

Estimated Time:120 Minutes


Lab Scenario
You have designed a data warehouse for Adventure Works
Cycles and must now design the ETL processes that will be
used to load data from source systems into the data
warehouse.
You have decided to focus your design on the Reseller Sales
and Internet Sales dimensional models in the data
warehouse, so you can ignore the financial accounts and
marketing campaigns fact tables and their related
dimension tables.
The source data is in a number of sources, and you must
examine each one to determine the columns and data types
and discover any data validation or quality issues. Then you
must design the ETL data flows for the tables involved in the
Reseller Sales and Internet Sales dimensional models.
Finally, you must design SSIS packages to load data into the
partitioned fact tables.
Lab Review

Compare the source-to-target documentation in


the D:\Labfiles\Lab04\Solution folder with your own
documentation. What significant differences are
there in the suggested solutions compared to your
own, and how would you justify your own
solutions?
• How might your design of the SSIS package that
loads the FactResellerSales table have differed if
the table was partitioned on OrderDateKey
instead of ShipDateKey?
Module Review and Takeaways

• In what scenarios would you consider using


Transact-SQL for transformations, and in what
scenarios are SSIS data flow transformations
appropriate?

You might also like