• Problem Statement: Need an automated system that links two similar
items within same database or in different database. The features of items could be exactly same or partially same. • Challenges: No unique ID, spelling variation, different abbreviation of same word, word order is different etc. Exactive Summary • Data deduplication / Similar data is a fundamental activity in the pipeline of data integration and data cleansing. • It identifies and removes the disguised duplicates in a dataset. • The duplicate records can either be “exact duplicates” or more commonly occurring “near duplicates”. • The solution to this problem is Record Linkage. • Record linkage is a process of identifying the records referring the same entity across two or more datasets. • Record linkage has a wide range of applications pertaining to business, government agencies, health sector, digital libraries and so on. • The absence of a unique identifier, data heterogeneity, data noise and data size makes record linkage a really challenging task. Steps of Record linkage (OurApproach)
Data pre- Blocking (use Record pairs Classification
processing to index the comparisons (grouping into (Character records) (using relevant normalization, Unsupervised category) Word Machine Normalization, learning / Stemming, Supervised Abbreviation Machine handling etc Learngin) Record Linkage Process Dataset Pre-Processing
Record Pairs Record Pairs
Reduction using Comparisons using Blocking Filter Linkage Key
Record Pairs Classification
Non Matches Possible Matches Matches
Data Preporcssing • Stopword list (a, on, the) • Character level normilzation (Naive Naïve) • Word level normalization (Colour Color) • Lematization / Stemming (Descriptions Description) • Synonims handling (Low fat Reduced fat) • Abbrivation handling (ltr, lt, l) etc CRISP DM • For such projects, we follow standard CRISP-DM Approach. • Business understanding is already accomplished. • We will start with the Data Understanding phase. • The deployment, objective is to provide a light-weight model or parameters for the target device. Data Science Approach Roll out Phase-2 (Machine Learning • Model training for all items • Testing of existing model Phase-1 (Rule based + based appraoch) • Deployment Unsupervised) For Selected • Preparation of data for Itemsthe data • Understanding Supervised Machine • Preparation of Domain Learning Specific stop-word list. • Feature engineering • Applying NLP techniqes to • Development of Supervised standardize the dataset Machine Learning algorithm • Indexing strategy for similarity matching • Using Unsupervised • Evaluation and Machine Learning algorithm improvement of Algorithms for identification of Similar records
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB