You are on page 1of 20

A DEEP DIVE INTO PE

DATA
Presented by: Daniya Boges [Data Scientist]
Disconnected Systems of
Heterogeneous Nature
MINERET
SAGE
tech

MICROSYSTEM
S
S
CDK
Global
Autoline Drive DMS
HOW TO REACH OUR DATA
As MANY and As ONE [1]

Multiple Databases (DB) on One Server Instance 1


2 Multiple DB on Multiple Servers from the Same Vendor

Multiple DB on Multiple Servers from Multiple 3


Vendors
Multi DB on Multi Servers from
Multi Vendors (The Three Multies)

Table Linking and Federation [2] Easy, low cost Performance Degradation

Straightforward, easy, very


Custom Code low cost Very Time Consuming

Best performance, Easy to Time consuming, very high


Data Warehousing/ETL [3]
implement cost

Mediation Software Easy, low cost Performance Degradation

[2] L. M. Haas, E. T. Lin and M. A. Roth, "Data integration through database federation," in IBM Systems Journal, vol. 41, no. 4, pp. 578-596, 2002, doi: 10.1147/sj.414.0578.
[3] Biswas, N., Sarkar, A. & Mondal, K.C. Efficient incremental loading in ETL processing for real-time data integration. Innovations Syst Softw Eng 16, 53–61 (2020). https://doi.org/10.1007/s11334-019-00344-
SNIPPETS FROM
WITHIN Demonstrating Redundancy in Customers Data as so:

?
SAMEER(S) FOUND!
FIND THIS CUSTOMER: SAMEER M NAWAR
Provided his phone number is: +966505673025

After 3 levels of data filtering, we found our matching record(s):


MINOR MISTAKES

SYNTAX CusstomerID
INCONSISTENCIES
ADDED SPACES | <space>CustomerID<space> |

SEMANTICS ..
OPERATION
CLEANSING

CUSTOMER
S MINERET tech

VEHICELS S
INVOICES SAGE
MICROSYSTEM
S
Customer | Vehicle | Invoice
Date of export in raw form : 2021-01-14

INVOICES

MINERET (Minerets +
SAGE)
CUSTOME VEHICE
S3 days
2020-07-11 to 2021-01-14
 6 months, RS LS

SAGE
2013-01-22 to 2020-07-11
 7 years, 5 months, 19 days
BIRD’S-EYE VIEW 170 fields

Invoices mac(25 Columns)


Customers mac (42 Columns) Invoices_SAGE (61 Columns)
Vehicles (42 Columns)

SAGE
MICROSYSTEM

MINERET tech
S
CDK
Global
Autoline Drive DMS
BIRD’S-EYE VIEW -40 fields-
(shrunk)
OPERATION
CLEANSING
Cross-platform KEYS:
1 Mobile numbers
2 Vehicle Identification Number
(VIN)

ORPHANED ASSOCIATED (non orphaned)


CUSTOMERS CUSTOMERS
Fail to link with a valid successfully link to a
vehicle vin# and an valid VIN# and/or an
invoice#. invoice#.
OPERATION
CLEANSING

Mobile Numbers VIN

Qualification pre-clean-up: 41.92% 87.95%


After clean-up: 88.43 % 91.96 %
OPERATION
CLEANSING#1
RAW count (B2C)
4,796,227
4,795,428

4,165,299

* Mobile# & VIN# -fix 3,824,844 341,107


* Link with vehicles
• Link with • Link with
invoices in invoices in 276,465 {Associated}
minerets+sage minerets+sage
• De-duplication • De-duplication 1442 {ORPHANE
CUSTOMERS
D}
VEHICLES 2,598,601 {Associated}

INVOICES 137,841 {Associated}

#3,012,907
PS: Results in the end have been stripped from duplicated customers; hence they are Unique.
EVALUATION PER ATTRIBUTE
FIELD
EVALUATION PER ATTRIBUTE
FIELD
WRAPPING UP
IN CONCLUSION

- We have a custom stable work pipeline: Implement on the rest.

- Unique customers count as per this snapshot of the data is 3,012,907.

- Scale up our hardware recourses.

- Recycling / recovery plans to be set.


REFRENCES
- [1] R. Lawrence, “How To Query Multiple Databases”, Director of Distributed Database Laboratory,
University of British Columbia
- [2] L. M. Haas, E. T. Lin and M. A. Roth, "Data integration through database federation," in IBM
Systems Journal, vol. 41, no. 4, pp. 578-596, 2002, doi: 10.1147/sj.414.0578.
- [3] Biswas, N., Sarkar, A. & Mondal, K.C. Efficient incremental loading in ETL processing for real-
time data integration. Innovations Syst Softw Eng 16, 53–61 (2020).
https://doi.org/10.1007/s11334-019-00344-4
- Maintenance and Mediation in Database Federations
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.6769&rep=rep1&type=pdf
- Distributed data integration by object-oriented mediator servers, CONCURRENCY AND
COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2001; 14:1–21
(DOI: 10.1002/cpe.607)
‫‪Thank You‬‬
‫شكرا جزيال‬

You might also like