Data Deduplication

Understanding Data De-duplication
Data Quality Automation -

Providing a cornerstone to
your Master Data Management
(MDM) strategy

Transoft White Paper Confidential & proprietary
(i)
Contents
Introduction .................................................................................................................................1
Example of de-duplication ..........................................................................................................2
Successful de-duplication ..........................................................................................................3
Normalization....................................................................................................................................... 3
Grouping.............................................................................................................................................. 4
Matching .............................................................................................................................................. 4
Merging ............................................................................................................................................... 5
How Transoft DBIntegrate can help...................................................................................................... 5
Transoft the systems transformation company .....................................................................6



1
Introduction
Data de-duplication is the process of identifying matching records from a variety of data sets and
then merging these together to leave one best fit record remaining, or a record that takes the
best fit fields providing the golden record. It forms the cornerstone of ensuring data quality,
especially when you are reviewing your MDM requirements.
The Process allows a user to match data together without a unique common identifier, such as
customer ID number, and instead base a match on key information fields such as surnames,
company names or addresses.
The typical reasons duplicated data is created are:
Lack of processes, such as not checking historical or archive records to see if they can be
re-opened
Inconsistent standards for formatting and abbreviations, such as using nicknames or
substituting words for shorthand text e.g. Dr for doctor
Poor data validation, particularly when it comes to addresses
Staff taking shortcuts, where it is quicker to set up new records than find the original
System integration requirements to avoid overriding old data, where whole new records
are created, such as a worker switching from a PAYE to LTD pay scheme
Poor training where a user cant properly search databases.



2
Example of de-duplication
The table below shows example data that requires de-duplication where long and short versions
of a persons name have been used and inconsistencies are apparent within address data.

ID Title Forename Surname Addr1 Addr2 Town Postcode Pay
status
210276 Mrs Gillian Rhodes 10 Rogers Lane Langley Slough SL1 4GH PAYE
102356 Ms Gill Rhodes 10 Rogers Lane Slough SL1 4HH LTD
103556 Miss Gillian
Mary
Rhodes 1 Rogers Lane Slough LTD
103450 Mr Matt Turner 79 Stapleton Rd Reading RG1 MX7 PAYE
204576 Mr Matthew Tuner 79 Stapleton
Road
Reading RG1 PAYE

A common use for data de-duplication is the identification of replicated postal and email
addresses to remove redundancy from mail shots. For example, a marketing team has two lists
of email addresses from two data sources, one of which they have collated and the other which
has been bought from another organization this leads to a very high potential for duplicates. As
company perceptions and customer satisfaction can be damaged by small issues such as a
customer receiving two emails, it is important that the data is correct and valid.
From a physical mail perspective, sending two letters or even two catalogues to the same
address can have significant cost implications.
From a process perspective, a marketing team would ideally be able to import and run a
de-duplication project so they are left with one de-duplicated list of addresses. This would require
having a technical user to setup the project and validate that it is finding matches and not
erroneously pairing data. Once this has been done, the project could be rerun multiple times
using different data sets without any further input required by the marketing team.



3
Successful de-duplication
Some tips to ensure successful de-duplication:
1. Identify business rules surrounding your data, look at how you want your data formatted,
e.g. do you want all customer surnames to be in uppercase?
2. Analyse data prior to performing de-duplication to understand key relationships to be
used to match records, e.g. customer activity tables or bank account details.
3. Understand what data should be ranked when matching a set of records, ask yourself
questions when identifying data importance, does it matter for example if one record
contains a customer middle name whilst another does not?
4. Access reference data such as the Royal Mail PAF database for validating UK addresses.
Other reference examples can be deceased, gone away or do-not-contact lists.
5. Carry out preliminary analysis on a representative data sample to help determine
business rules.
There is no silver bullet for successful de-duplication jobs; it is a case of identifying the situation
and rules relevant to each set of data to help find a new master record. However, using the
correct software tool and process for defining de-duplication projects is the easiest way to break
the process into the following manageable steps:

Normalization
Normalization helps reduce small data entry issues, and puts data into a consistent format. This
greatly enhances the chance of successful and useful de-duplicated data. This stage can be
made up of steps of mapping jobs in order to bring data from a variety of data types together, for
example Microsoft Excel spreadsheets can be brought in to be de-duplicated alongside a SQL
Server RDBMS and a legacy CRM application.
Data should also be formatted so that it is the same data type, for example, converting a string
01/01/2011 so that it is a date timestamp would be beneficial to de-duplication. Other common
normalization practices include:
Setting title case
Removing unessential punctuation
Widening acronyms or abbreviations e.g. replacing st with street or rd with road
If data contains a wide range of issues that need to be assessed and resolved, such as address
validation or removing invalid data, it would be appropriate to clean data prior to performing
de-duplication. Without cleansed data, the efficiency of matches will be reduced and erroneous
records may be paired together in the latter stages of de-duplication. Normalization in general
should only be used to ensure that data within separate records can be matched and not fix large
scale dirty data issues.


4
Grouping
Running de-duplication on large data sets can take considerable lengths of time; therefore
grouping is essential to divide the rows to be matched and merged into manageable data sets. In
general, the smaller the groups are, the more efficient the matching and merging process will be.
Grouping data also helps avoid spurious matches between sets, for example grouping by gender
would avoid matches between male and female records where first names are unisex such as
Alex, Robin and Sam. Grouping is commonly an optional step in de-duplication, users simply
have to bypass the step to avoid grouping by fields. But it is only recommended for small data
sets (<10,000 rows) to avoid slowing down the runtime of projects.

Matching
In the matching stage, rules are defined to build a match score for any pair of records to
determine whether or not they should be considered duplicates of each other. It is common for
fuzzy matching logic to be applied to data to identify duplicate data sets. These functions help
overlook poor data entry, such as spelling mistakes.
Looking at our example of email addresses again, a common mistake is where the email @
symbol has been mistakenly substituted as an apostrophe, tilde or any keys surrounding the @
symbol on a traditional QWERTY keyboard (~#). If this has not been cleaned up in the
normalization phase, then matching functions can overlook these issues.
Some common and easily transferable matching methodologies are detailed below:
Function Description
Ignore Order Matches can be made by ignoring the order of words within a string. E.g. John Smith
would match Smith John
Soundex Matches can be made on how alike records sound to each other. E.g. New York
would match New Yerk
Hamming Identifies the number of steps required to change one string into another, a user can
define the maximum distance apart the strings can be before they fail as a match.
Given a max hamming distance of 3, Matt would match Matthew
Substring Match on elements within a string,e.g. to match on the first few digits of a postcode
such as SL2 3TY could match SL2 4PU if you wanted to target mail shots to a
specific area

Any data record has a wide number of differentials that would need to be accounted for, so a
variety of fields must be included in matching. For example, in the scenario where a Father and
Son are named after each other, age or date of birth data would be required to avoid this pair
matching. Given this, user discretion is required when defining match rules, and periodic reviews
of the data matches are suggested to avoid false negatives from appearing as data matches.



5
Merging
The final stage of data de-duplicating is to merge records together either by choosing the best fit
fields from matched source records or by selecting the best fit record from the matched records.
This allows a golden record to be achieved in the target data source.
A common approach is to specify fields or text as a higher preference to the merge score; this
can be done by looking for better data in matching records, for example:
No null or empty fields
More recently accessed records, looking at timestamps
Fields with more complete information such as a longer address line
Fields with more higher numeric values, for example gift aid donations amount to more in
one record than another
Data with more attachments to other fields in the database, for example a record contains
an external reference number whilst another does not

How Transoft DBIntegrate can help
Transoft has an experienced consultancy team available to carry out professional data cleansing
and de-duplication services. Our team has run hundreds of projects across many different
industries, so we can guide, advise or run your data quality projects depending on your
requirements.
We use Transoft DBIntegrate to make de-duplication a seamless and repeatable project, and
offer a wide range of support to our customers. DBIntegrate has a unique user interface to make
each stage of the de-duplication process as clear as possible using drop-down boxes, icons and
tick boxes so that minimal training is needed.
Transoft DBIntegrate also provides real-time access to multiple data sources, enabling flexible,
repeatable and automated data migration and cleansing. It offers data integration with optimized
real-time read/write access across all data sources, and fast, configurable data warehousing.
For more information please visit: www.transoft.com.



6
Transoft the systems transformation company
Transoft is a leading provider of innovative and pioneering transformation solutions, with
hundreds of thousands of organizations worldwide using our products and services. Our aim is to
enable our customers to increase business value and maintain competitive advantage by
maximizing the potential of existing data and applications.

This provides rapid return on investment, reduced costs, improved productivity and efficiency,
and the ability to manage operational risk. With 25 years experience, and expert staff dedicated
to servicing the needs of organizations with legacy systems, we pride ourselves on a tailored
approach to customer service.

Major organizations such as The Gap, LOreal, Boeing, Christies and Balfour Beatty have
enjoyed the business benefits of a Transoft application transformation strategy. We work with a
large network of VARs, System Integrators, ISVs and technical partners to offer unparalleled
solutions.

www.transoft.com
newsolutions@transoft.com

Phone: +1 (770) 933-1965 (Americas)
Phone: +44 (0) 1753 778000 (Rest of World)

Transoft is a trading name of Transoft Group Limited and Transoft Inc, which are a part of the CSH Group of companies. Transoft is a
trade mark. Transoft Group Limited and Transoft Inc 2012. All rights reserved.

Data Deduplication

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Deduplication

Uploaded by

Copyright:

Available Formats

Understanding Data De-duplication

Data Quality Automation -

You might also like