Professional Documents
Culture Documents
Services
Realized by:
ANJAR Fatima
EL MECHRY Mariem
The objective of this lab is to explore the capabilities of SQL Server Integration
Services (SSIS) for data quality by:
-Profiling data using SSIS to understand and analyze source data effectively.
-Cleansing data using SSIS and Data Quality Services (DQS) to improve data
quality.
SSIS and Data Quality:
SSIS provides tools for performing ETL processes on data. In this lab, we focus on
two key SSIS tasks: Data Profiling and DQS Cleansing. These tasks help understand
and automate data cleansing processes.
After setting up profiles and saving them, we will use the Data Profile Viewer to
analyze profiles.
In the execution of this task, the following steps were performed within SQL Server
Data Tools (SSDT) to profile data in the DQS_STAGING_DATA database:
The DQS_STAGING_DATA database was selected as the target database for the
profiling task.
After selecting the database we returned to the Simple Table Quick Profiling Form
dialog box.
The "CustomersDirty" view was chosen from the "Table Or View" drop-down list.
Different profiles were selected from the "Profile Type" list on the right.
For the "Column Value Distribution Profile Request," the column property was
changed from (*) to "Occupation" to profile this specific column.
The column property for the "Column Pattern Profile Request" was changed from (*)
to "EmailAddress."
This data profiling exercise aims to uncover anomalies, patterns, and potential data
quality issues within the dataset, offering valuable insights for subsequent data
cleansing and improvement efforts.
To do so, we performed the following tasks:
● Open the file in the Data Profile Viewer.
In the following step, we examined the distribution of values within the "Occupation"
column, with a focus on identifying and investigating values that occur with very low
frequency, such as the "Professional" value. The drill-down functionality provided us
with information such as the count of the value and its frequency in percentage that
we can compare with the other values. “Professional” is clearly a typo.
Then we proceeded by examining Column Pattern Profiles for EmailAddress:
the regular expression patterns displayed for this column are indeed the ones
defined previously when preparing knowledge base in earlier labs.
Then we reviewed the other profile results before completing the analyze step
The data profiling exercise provided valuable insights into the distribution, patterns,
and potential anomalies within the dataset. These findings are crucial for
subsequent data cleansing and improvement efforts, ensuring a higher level of data
quality in the Dim Customers data.
Data Cleansing with SSIS:
The objective of the following part is to demonstrate the process of data cleansing
using SSIS DQS Cleansing transformation. This involves preparing data and building
an SSIS package to cleanse the dirty data.
and populated it with every tenth customer from the DimCustomer table.
Updated a specific row in the dirty data table (CustomersDirtyT) to create a scenario
for correction with the DQS Cleansing transformation.
Created two staging tables (CustomersDirtyMatchT and CustomersDirtyNoMatchT) to
store the results of the cleansing process
Added another new table in the dbo schema and named it CustomersDirtyNOMatchT using the
same schema as for the previous table.
Summary
This lab demonstrated the practical use of SSIS for data profiling and cleansing. The
Data Profiling task helped identify potential data quality issues, while the DQS
Cleansing transformation improved the quality of data through corrections. The
exercises provided hands-on experience in configuring SSIS tasks, analyzing profiles,
and executing data cleansing processes.
The SSIS and DQS functionalities showcased in this lab offer valuable tools for
maintaining data quality and integrity in SQL Server environments.