You are on page 1of 23

Introduction to Informatica Data Quality

Deloitte Consulting LLP


Agenda
Day1 : Data Quality , DQ Cycle and Profiling (Basic & Advanced)

Day2 : Rule Builder & Scorecards

Day3 : Data Cleansing & Address Validator

Day4 : Data Governance & Exception Handling

Deloitte Consulting LLP 2013 2


Foundation – Data Quality
Data Quality is the ability of data to satisfy the stated business, system, and technical requirements
of an organization. This is typically measured in terms of completeness, timeliness, accuracy,
consistency, relevance, and integrity.

People Process Technology


 Data Owner: The data owner is the  Data Profiling: Identify data owners and  Data quality initiatives typically use a
one ultimately accountable for the stewards. Define business standards and suite of tools to accomplish the
data. The data owner provides the rules. Perform profiling of data and generate following common activities:
strategic vision and direction. data quality score cards. – Data Profiling
 Data Steward: The data steward is  Data Cleansing: Analyze profiling results – Production of Data Quality
the “day-to-day” manager of the and data standards. Cleanse data from Scorecards
data. The steward works under the production environment – detect and
owner’s direction and handles all on- – Extract, Transform, and Load
correct erroneous data and data anomalies.
going requests and maintenance. (ETL)
Test the data and restore to production.
 Data Quality Architect: The architect – Data Cleansing
 Data Monitoring: Create a baseline for all
leads the design and development of data and store in a data quality repository. – Data Monitoring
all data quality implementation Establish monitoring intervals and evaluate
efforts. condition of data as an ongoing function.

Value Proposition

 Quality of data is considered to be the number one issue facing companies today and is a
prerequisite for all data-driven initiatives (e.g., data warehouse, system conversion).
Data Governance Structure

Data Governance Council


Data Owners

Data Request Review Board


Data Stewards

Data Quality Architects


Data Manager / Coordinator & Programmers

Deloitte Consulting LLP 2012 4


Process – Data Quality
The high level data quality process can be decomposed into several more
detailed steps that are common among data quality initiatives. While there are
many ways to address data quality, the below high-level process highlights some
best practices:

Profile Cleanse Monitor

Data Quality
 Identify data owners and  Analyze profiling results, data  Create a baseline for all data and
stewards. standards, and impact to store in a data quality repository.
 Define standards and business downstream applications .  Establish monitoring intervals
rules.  Extract required data out of and infrastructure.
 Perform profiling of data and production into the cleansing  Monitor the data as an ongoing
generate data quality score environment. function.
cards.  Cleanse the data, test the data,
and restore the data to
production environment.
Platform Overview
Data Quality

Data Profiling Data Quality Stewardship


Cross Table
Column Profiling Labeling Standardization Exception
Profiling
Management
Address
Rule Profiling Scorecarding Parsing Duplicate
Validation
Resolution
Midstream Mapping Fuzzy
Consolidation Reference Data
Profiling Generation Matching
Management

Integrations Security
Informatica MDM User Group
Power Center Authorization
Hub Authorization

User Database
Data Services XMap Authentication
Authentication

SAP / POE Oracle LDAP Integration


IDQ Solution - Process Flow

Deloitte Consulting LLP 2013 7


IDQ Overview
Informatica Data Quality is a data discovery product that you can use to find the content,
quality, and structure of data sources of an application, schema, or enterprise. The data source
content includes value frequencies and datatypes. The data source structure includes keys and
functional dependencies.
With Data quality, you can use the Analyst tool and Developer tool to analyze the
source data and metadata. Analysts and developers can use these tools to collaborate, identify
data quality issues, and analyze data relationships

Deloitte Consulting LLP 2013 8


Introduction to Data Profiling
Definition:
Data profiling is the systematic analysis of data to gather actionable and measurable
information about its quality. Information gathered from data profiling activities is used
to assess the overall health of the data and determine the direction of DQ initiatives.
Profiling is essential to MDM
and involves anomaly View
analysis and assessment. information
Data Quality Findings from this activity is
as a strategic
asset
a prelude to cleansing.
Management is an
Dimensions of DQ provides a
enterprise-wide yardstick for measuring the
Profiling
quality of master data.
program to implement Examples include accuracy,
Cleansing
consistency, completeness,
data quality and to and timeliness.

Dimensions
enable its widespread Data Quality
Management
adoption in business. Driven
by Framework
Linked
to
business Controls strategi
value c
initiativ
es
Enrichment
Data Profiling includes: Service
Levels
▪ Defining Dimensions
▪ Profiling Data Enabled with a
comprehensive
change
management
program
Deloitte Consulting LLP 2012 9
Data Profiling Dimensions
Some of the key data profiling dimensions are as following. Any effective data quality
management program tracks the quality of data against these dimensions.
Attribute Description Example metric

Is the data free from error, with a high Percent of values that are correct when compared
Accuracy assessment corresponding to a small to the actual value. For e.g. M=Male when the
error? subject is Male.

Are values present all in the attributes that Percent of data fields having values entered into
Completeness
require them? them.

Percent of matching values across


Does the data set meet the required set of
Consistency tables/files/records. For e.g. A customer d with
constraints?
status as ‘Active’ and ‘Inactive’ across system(s).

Percent of data available in standard format. For e.g.


Conformity Are data values in standard format?
Customers with Valid addresses.

Is every record identified uniquely? Is the


data free of (unplanned) duplicates?
Uniqueness Percent of records having a unique primary key.
Does each object appear only once in a
particular schema?

Does the data follow an adequate system Percent of data having values that fall within their
Validity of classification? Does the data meet ‘the respective domain of allowable values. For e.g.
rules’? Deloitte Consulting LLP 2012
‘Individual’ or ‘Corporate’ customer types. 10
Benefits of Data Profiling
Data Cleansing Benefits
➢ Provides an initial assessment about the quality of data
➢ Helps discover various data anomalies within data elements, including but not
limited to unclear and multiple definitions of data, completeness of data, and
validity of data
➢ Helps discover level of uniqueness within a data set
➢ Helps discover data types and formats/patterns within data elements
➢ Helps discover relationships among data sets
➢ Data profiling results form a basis for defining the scope and approach of data
cleansing activities
Business Benefits
➢ A comprehensive data profiling ensures successful realization of business goals from
the implementation of MDM (master data management) and Data Quality (DQ)
projects
➢ Helps identify data errors at the source end and determine necessary corrective
action
➢ Reduces cost of managing data by removing inefficiency and redundancy within data
➢ Improves confidence in data among business users and customers
Deloitte Consulting LLP 2012 11
Data Quality Features in Informatica Developer Client
➢ Column Profile: A column profile determines the characteristics of columns in a
data source, such as value frequency, percentages, and patterns.
➢ Column profiling discovers the following facts about data:
▪ The number of unique and null values in each column, expressed as a number
and a percentage.
▪ The patterns of data in each column and the frequencies with which these
values occur.
▪ Statistics about the column values, such as the maximum and minimum
lengths of values and the first and last values in each column.
➢ Column Profile Options: Column profile options can be used to select the
columns on which the profile needs to be run, set data sampling options, and set
drilldown options when a profile is created.
➢ When a profile is created with the Column Profiling option, the profile wizard can
be used to define filter and sampling options. These options determine how the
profile reads rows from the data set.
➢ After completing the steps in the profile wizard, a rule to the profile can be
created.

Deloitte Consulting LLP 2012 12


Data Quality Features in Informatica Developer Client

Deloitte Consulting LLP 2012 13


Data Quality Features in Informatica Developer Client
➢ Creating a single Data Object Profile:
➢ In the Object Explorer view, select the data object you want to profile
➢ Click File > New > Profile to open the profile wizard
➢ Select Profile and click Next.
➢ Enter a name for the profile and verify the project location. If required, browse
to a new location.
➢ Optionally, enter a text description of the profile.
➢ Verify that the name of the data object you selected appears within the Data
Objects section. Click Next
➢ Configure the profile operations that you want to perform. You can configure
the following operations
▪ Column profiling
▪ Primary key discovery
▪ Functional dependency discovery
▪ Data domain discovery
(Contd.)
Deloitte Consulting LLP 2012 14
Data Quality Features in Informatica Developer Client
➢ Review the options for your profile.
➢ The column selection for all profile types
➢ Review the filter and sampling options for column profiles
➢ review the inference options for primary key, functional dependency, and data
domain discovery
➢ also review data domain selection for data domain discovery
➢ Review the drilldown options, and edit them if necessary
➢ Click Finish.

Deloitte Consulting LLP 2012 15


Data Quality Features in Informatica Developer Client
➢ Column Value Properties: Column value properties show the values in the
profiled columns and the frequency with which each value appears in each
column. The frequencies are shown as a number, a percentage, and a bar chart.
➢ To view column value properties, select Values from the Show menu. Double-click
a column value to drill-down to the rows that contain the value.
➢ The following table describes the properties for column values:

Property Description
Values List of all values for the column in the profile.
Frequency Number of times a value appears in a column.
Percent Number of times a value appears in a column,
expressed as a percentage of all values in the
column.
Chart Bar chart for the percentage.

Deloitte Consulting LLP 2012 16


Data Quality Features in Informatica Developer Client
➢ Column Statistics Properties: Column pattern properties show the patterns of
data in the profiled columns and the frequency with which the patterns appear in
each column. The patterns are shown as a number, a percentage, and a bar chart.
➢ To view pattern information, select Patterns from the Show menu. Double-click a
pattern to drill-down to the rows that contain the pattern.
➢ The following table describes the properties for column value patterns:

Property Description
Patterns Pattern for the selected column.
Frequency Number of times a pattern appears in a
column.
Percent Number of times a pattern appears in a
column, expressed as a percentage of all
values in the column.
Chart Bar chart for the percentage.

Deloitte Consulting LLP 2012 17


Data Quality Features in Informatica Developer Client
➢ Exporting Profile Results:
▪ In the Object Explorer view, select and open a profile.
▪ Optionally, run the profile to update the profile results.
▪ Select the Results view.
▪ Select the column that contains the data for export.
▪ Under Details, select Values or select Patterns and click the Export button.
▪ The Export data to a file dialog box opens.
▪ Accept or change the file name.
▪ Select the type of data to export. You can select either Values for the selected
column or Patterns for the selected column
▪ Under Save, choose Save on Client and click Browse to select a location and save
the file locally .
▪ If export field names as the first row is not needed clear the Export field names as
first row check box.
▪ Click Ok.

Deloitte Consulting LLP 2012 18


Advanced Profiling – Profile Model
Profile Model is an object that captures the metadata for an application or schema. It contains data
objects to access data used for profiling and verification and structural objects, such as relationships,
keys, and functional dependencies.
➢ Join Analysis: The degree of overlap between two data columns, displayed as a Venn diagram
and as a percentage value. Use join analysis profiles to identify possible problems with column
join conditions
➢ Foreign Key Analysis: A column is a foreign key if its data values match the primary key column
values in another data object. You can perform foreign key discovery on multiple data objects in
the Developer tool. Create a profile model to select data objects and define the profile. Before
you perform foreign key discovery, you must identify the parent and child data objects in the
profile model. The profile uses one or more keys in the parent object, including its primary key, to
discover foreign keys in the child object
➢ Overlap Discovery :Overlap discovery provides information about overlapping data in pairs of
columns within a data source or multiple data sources. You can find overlapping data from a
profile model. You can validate the profile results and view the results in a Venn diagram.
Overlap discovery identifies overlapping data based on either the default settings or the settings
you specify. You can override the default settings and specify inference options, including the
maximum number of top pairs the overlap discovery returns based on the percentage of overlap.
You can also specify a confidence level that determines the eligibility for overlap discovery.

Deloitte Consulting LLP 2012 19



Join Analysis
➢ Perform a Join Analysis on Two Data Sources
▪ In this task, you perform a join analysis on the two data sources to view the join
▪ conditions.
▪ In the developer view, browse to the data objects in your tutorial project.
▪ Select the two required data sources.
▪ Tip: Hold down the Shift key to select multiple data objects.
▪ Right-click the selected data objects and select Profile. The New Profile wizard opens. Select Profile Model, and click Next.
▪ In the Name field, enter “Join_Demo”.Click Next.
▪ Verify that both tables appear in the Data Object column. Click Finish.
▪ The Join Demo profile model appears in the developer view.
▪ Use your mouse to select both tables in the modeling canvas. Right-click a data object name and select Join Profile. The
New Join Profile wizard opens.
▪ In the Name field, enter Join Analysis. Verify that both tables appear as data objects. Click Next.
▪ Select the join key column in both data sources. Scroll down the wizard pane to view the columns in both data sets. Click
Next. Click Add to add join conditions. The Join Condition window opens.
▪ In the Columns section, click the New button.
▪ Double-click the first row in the left column and select Key1.
▪ Double-click the first row in the right column and select Key2.
▪ Click OK, and click Finish.
▪ The Join Analysis profile opens in the editor and the profile runs.
▪ Note: Do not close the profile. You view the profile results in the next task.

Deloitte Consulting LLP 2012 20


Join Analysis Results
➢ Perform a Join Analysis on Two Data Sources
▪ Click the Join Analysis tab in the modeling canvas.In the Join Profile section, click the first row.
▪ The Details section displays a Venn diagram and a key that details the results of the join analysis.
▪ Verify that the Join Rows column reports zero as the number of rows that contain a join.
▪ This indicates that none of the Key fields are duplicates, suggesting you can successfully merge the two data sources.
▪ To view the overlap key values for any data object/tables, double-click the circle labeled data object in the Venn diagram.
▪ Tip: Double-click the circles in the Venn diagram to view the data rows described by these items. In cases where circles
intersect in the Venn diagram, double-click the intersection to view data values common to both data sets.
▪ The Data Viewer displays the key values contained in the one of the data object.

Deloitte Consulting LLP 2012 21


Foreign Key & Overlap Analysis
Similar to Join analysis, we can perform the Foreign Key Analysis and Overlap Discovery. Below are
the sample results as how the results look :
Overlap Discovery:

Foreign Key Analysis:

Deloitte Consulting LLP 2012 22


Q&A

Deloitte Consulting LLP 2012 23

You might also like