You are on page 1of 7

Problem Understanding

• Problem Statement: Need an automated system that links two similar


items within same database or in different database. The features of
items could be exactly same or partially same.
• Challenges: No unique ID, spelling variation, different abbreviation of
same word, word order is different etc.
Exactive Summary
• Data deduplication / Similar data is a fundamental activity in the pipeline of data
integration and data cleansing.
• It identifies and removes the disguised duplicates in a dataset.
• The duplicate records can either be “exact duplicates” or more commonly occurring
“near duplicates”.
• The solution to this problem is Record Linkage.
• Record linkage is a process of identifying the records referring the same entity across
two or more datasets.
• Record linkage has a wide range of applications pertaining to business, government
agencies, health sector, digital libraries and so on.
• The absence of a unique identifier, data heterogeneity, data noise and data size makes
record linkage a really challenging task.
Steps of Record linkage (OurApproach)

Data pre- Blocking (use Record pairs Classification


processing to index the comparisons (grouping into
(Character records) (using relevant
normalization, Unsupervised category)
Word Machine
Normalization, learning /
Stemming, Supervised
Abbreviation Machine
handling etc Learngin)
Record Linkage Process
Dataset Pre-Processing

Record Pairs Record Pairs


Reduction using Comparisons using
Blocking Filter Linkage Key

Record Pairs
Classification

Non Matches Possible Matches Matches


Data Preporcssing
• Stopword list (a, on, the)
• Character level normilzation (Naive  Naïve)
• Word level normalization (Colour  Color)
• Lematization / Stemming (Descriptions  Description)
• Synonims handling (Low fat  Reduced fat)
• Abbrivation handling (ltr, lt, l) etc
CRISP DM
• For such projects, we follow standard CRISP-DM
Approach.
• Business understanding is already accomplished.
• We will start with the Data Understanding phase.
• The deployment, objective is to provide a light-weight
model or parameters for the target device.
Data Science Approach
Roll out
Phase-2 (Machine Learning • Model training for all items
• Testing of existing model
Phase-1 (Rule based + based appraoch)
• Deployment
Unsupervised) For Selected • Preparation of data for
Itemsthe data
• Understanding Supervised Machine
• Preparation of Domain Learning
Specific stop-word list. • Feature engineering
• Applying NLP techniqes to • Development of Supervised
standardize the dataset Machine Learning algorithm
• Indexing strategy for similarity matching
• Using Unsupervised • Evaluation and
Machine Learning algorithm improvement of Algorithms
for identification of Similar
records

10 -12 weeks 8-12 weeks 8-12 weeks


Scope of this proposal

You might also like