You are on page 1of 29

E-Migrate: Semi Automated Mapping of Database Structure using String Similarity

and Distance Algorithms for FAITH Colleges Multiple Systems

A Capstone Project by:

JOSEPH PLATA MALALUAN

Master in Information Technology

Submitted to the School of Graduate Studies

First Asia Institute of Technology and Humanities

Tanauan City, Batangas, Philippines

In Partial Fulfillment

of the Requirements for the Degree

Master in Information Technology


SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

TABLES OF CONTENTS

Chapter 1: Introduction

Background of the Study……………………………………………………….1

Objectives of the Study…………………………………………………………2

General Objectives……………………………………………………...2

Specific Objectives……………………………………………………...2

Significance of the Study……………………………………………………….4

Scope and Limitations of the Study…………………………………………...5

Chapter 2: Review of Related Literature

Conceptual Literature…………………………………………………………...7

Related Study and Literature…………………………………………………..8

Existing System…………………………………………………………….......9

Synthesis………………………………………………………………............11

Conceptual Framework……………………………………………………….13

Operational Framework……………………………………………………….14

Definition of Terms…………………………………………………………….17

Chapter 3: Research Methodology

Research Design……………………………………………………………....17

System Evaluation………………..……………………………………………21

Subject of the Study…………………...………………………………………24

System Requirements………………………………………………………...25

1
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

Chapter I

INTRODUCTION

1.1 BACKGROUND OF THE STUDY

Moving data is a critical process in migrating from one system to another. Legacy

systems that use outdated technology need to be migrated to a new system to be

more efficient and maintainable; however changes on structures and schema are

unavoidable in any migration process(1). Semantic problems in every column

should be resolved and apply more suitable column names in order to follow

naming convention standards(2).

The initial challenge that needs to be addressed before deciding to migrate into

another system is how compatible the new system is to the existing data. At

FAITH Colleges, multiple systems are in need of upgrading and improvements.

Campus++ an admission and enrollment system - cannot be easily replaced

without migrating data related to grades and back accounts. While another

system ISIS has different corporate data that needs to be intact from the

beginning.

Manual mapping of each table schema and columns is a potential solution,

however it may take a long time to be completed having more than 50 tables with

some containing at least 20 columns. While, fully automated data mapping is not

an option due to the old and poor design strategy designing the existing

database. Therefore, a combination of manual and automated database mapping

is ideal to be used. Automated- for faster execution and manual adjusting for

accuracy.

2
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

Automated database mapping requires a strategy that can ensure accurate

mapping and relevant scoring. The initial stage of schematic mapping is linguistic

matching for element name comparison(3). In this stage, Smith-Waterman

Algorithm will be used to compare and rank similarity scoring for each candidate

element.

Smith-Waterman Algorithm aligns the comparing table names and column names

into a matrix formulation. This algorithm is a good fit in analyzing the similarity of

the element since it provides a scoring pattern.

1.2 OBJECTIVES OF THE STUDY

1.2.1 General Objective

The researcher aims to develop a system that will help migrate data

from a legacy database structure to a new structure faster and more

accurately.

1.2.2 Specific Objectives

1.2.1.1 Analyze the structure of the existing or legacy systems of FAITH

Colleges.

● To create a system that can suggest database table and column

mapping of old database structure to a new database structure

design in SQL Server or MYSQL server.

3
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

● To allow users to visually compare database similarity of two

different databases designed in the same database management

system.

● To give users options on how to map database structure, with

manually mapped or follow the automated mapping suggestion.

1.2.1.2 Comparing different algorithms that can be used in database

entities mapping of databases design in SQL Server or MYSQL Server.

● Identify the algorithm that is accurate and efficient.

● Visually displays the score generated by the compared algorithms.

1.2.1.3 To utilize Smith-Waterman algorithm as a library tool for entity

mapping.

● Compare the Smith-Waterman algorithm with the Hamming

Distance and Levenshtein Distance algorithm.

● Use the most suitable algorithm for string comparison.

1.2.1.4 Evaluate the result of the study.

● Validate the efficiency of the system in comparison with the manual

process.

● Validate the suitability of Smith-Waterman algorithm in database

entity mapping.

4
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

1.3 SIGNIFICANCE OF THE STUDY

The goal of this project is to provide aid migrating data from a legacy

database to a new database structure. With the help of the proposed system,

accuracy and efficiency of data migration can be achieved. Missing pertinent

data can be avoided and new structures can be utilized in warehousing old data

without interrupting new processes. Efficiency and reliability of migrated data will

benefit the following stakeholders:

1.3.1 To the current system developers. Mapping of database elements

can be a tedious task for developers. The proposed system can help

current system developers to lessen their job on manually analyzing

database schemas.

1.3.2 To the system users. Having an accurate data migration from

legacy system to new system can significantly improve reliability of the

system. System users no longer encounter missing data from the

replacement system. Implementation of new processes can also be

applied on old transactions that will be very important in performing

different tasks of the users.

1.3.3 To the researcher. Data migration research is the best way to come

up with the latest technique to improve knowledge and skills. Ideas and

learning can be acquired in the process of this research.

1.3.4 To the future researchers. Ideas discussed in this research can be

a big help to future researchers. Improved research for this topic can open

up a new discovery for data migration techniques.

5
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

1.4 SCOPE AND LIMITATION OF THE STUDY

This project focuses on developing a system that maps and compares database

schema of a legacy database and a new database schema for migration.

The following features are included in the proposed system:

1. Dynamic database object mapping

The proposed system has the flexibility to be integrated in different

systems being used in any industry. Connection string attributes should be

supplied within the system interface. Table Name, Column name and

Instructor/Student object attribute are captured by the system for proper

mapping.

2. Visual presentation of database object

A web interface is available for viewing and validating of structure

mapping executed by the system. The user can update manually if

mapping adjustment is needed.

3. Linguistic Similarity Scoring

The proposed system displays the similarity score of two elements

being mapped. The score is computed based on Smith-Waterman

Algorithm.

While the system can visualize the actual database structure,

migration of data still needs the user’s intervention. While it is important to

consider database relationships and foreign keys, this research did not

tackle and implement in-depth study of relational databases. Database

6
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

components such as storedprocedure, views and triggers are not included

in this research and implementation.

7
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

CHAPTER II

REVIEW OF RELATED LITERATURE

This chapter presents the project’s features, related reading materials

from other related literature, and a review of related systems used by the

developers to complete the project.

2.1 CONCEPTUAL LITERATURE

This study literature review deals with the Database Management System,

Data Migration, Linguistic Module, and Smith-Waterman Algorithm. The

developers will use these concepts to create and deploy the proposed project

into actual development.

2.1.1 Database Management System. According to Chapman and Hall in

the book “Handbook of Data Structures and Applications” a database

management system is a software application that purposely designs to

store data. It is an interface between user and data that allows interaction

to execute different queries such as add, update, delete and retrieval of

data. In the journal written by Tim Kraska about Learned Databases(2019)

, DBMS is designed to handle different purposes, multiple schemas and

aims to provide efficient access to stored data.

2.1.2 Data Migration. In the article written by Henry Engeldow, Data

migration is always accompanied by challenges and obstacles. Different

structures and field definitions added complexity to the migration process.

8
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

Datasource numerous fields increase the problems that need to be

resolved.

2.1.3 Linguistic Module. A journal written by Vitor Marini Blaselbauer

defines the linguistic module as the measures of similarity between

strings. In his article, JSON elements are being used to compare schemas

to create pairs for comparison. Pairs and nodes are in result after lexical

function under semantic module of comparison.

2.1.4 Smith-Waterman Algorithm. According to Fabio F. de Oliveira in a

journal posted on June 30,2022, SW algorithm initially intended to perform

sequence between nucleotides and proteins under the biology field.

Forward and backstage are performed in this algorithm. While in an article

written by Han Su Yin Nyunt, a parallel implementation was done in

sequence of strings with relevant scoring.

2.2 RELATED STUDIES AND LITERATURE

The Information gathered from the previous research helped a lot in the

project. In addition, the related literature from other research and articles

improved the developers' understanding of the project. Thus it gave the

developers the knowledge to design and create a system that can be used for

data mapping and migration.

The examination of old data is the first stage of a multi-phase process

called data migration, which concludes with the loading and reconciliation of data

into new systems. Organizations are always in need of data migration due to the

9
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

exponential expansion of data. Data migration may be a difficult procedure, and

testing is necessary to guarantee the accuracy of the data. If best practices are

not followed and hidden expenses are not discovered early enough, migration

can be highly expensive ,Sarmah, S. S. (2018).

There are existing tools that can be used but it is hard to find the best one

mostly for the specific needs of the users. According to a study of Elamparithi,

M., & Anuratha, V. (2015) due to the varied structures and many RDB data types,

Relational Database Migration (RDBM) has generally been a complicated, time-

consuming, and expensive procedure. There are several Database Migration

Tools (DBMTs) available in the software industry, but choosing the best one is

still a difficult task. Therefore, some of the DBMTs are examined, and certain

fundamental standards for assessing migration toolkits are given.

Even though NoSQL is a newer database system, there aren't as many

tools and methods for designing schema in it as there are in RDBMS. NoSQL

necessitates a different strategy for creating effective schema, such as deciding

which information belongs in a document database as an embedded document

or a referenced document. For the creation of schemas in NoSQL databases,

there are certain general guidelines. Creating a decent and effective database

schema is a particularly challenging process in reengineering projects, especially

when switching from an old RDBMS to a new NoSQL system. In this research,

we offer a model for a schema design advisor that leverages the SQL query load

10
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

of the current program as an input and an algorithm for schema design

recommendation, Namdeo, B., Suman, U. (2021).

2.3 EXISTING SYSTEMS

2.3.1 Azure Migrate. This migration tool is used under the Azure

environment. It allows the user to assess the actual data environment

using different tools such as Turbonic and Unify Cloud. This migration tool

focuses on migrating the entire database (schema and data) all in one in

an azure environment. It supports the migration of systems running under

microsoft technology.

2.3.2 Fivetran. With the help of the cloud-based ETL data transfer

platform Fivetran, data engineers can focus less on laborious engineering

activities and more on producing new insights. Fivetran streamlines data

migration by automatically handling all data integration duties using

connectors that deploy in a matter of minutes, adapt to source changes

automatically, and require zero maintenance.

2.3.3 Microsoft Data Migration Assistant You can update your database

schema and data from a source SQL Server environment to a target SQL

Server environment using the Microsoft tool known as Data Migration

Assistant, or DMA. You can transfer schema, data, and uncontained

11
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

objects from your source server to your target server using this migration

tool.

The best thing about utilizing this tool is that it can help you find

compatibility problems that might affect how well your new SQL Server

database version functions. DMA maintains a minimum amount of

downtime while automating database migration. Additionally, it offers

assistance in moving databases like Microsoft SQL Server, MySQL,

PostgreSQL, and MongoDB from on-premises and other cloud

environments to Azure.

2.3.4 Integrate.io (Formerly Xplenty). A data migration solution called

Integrate.io offers customers a single interface for managing, converting,

and moving data between several applications. This tool aids businesses

in the integration, processing, and preparation of data for cloud analytics.

The solution has a user-friendly interface, a highly automated workflow,

and is both scalable and economical.

Users may concentrate on their businesses without having to worry about

the difficulties of data conversion thanks to Integrate.io's simplicity of use.

The availability of integrations via REST API or direct FTP uploads

enables even non-technical staff to use this application in a smooth

manner with their technology stack.

12
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

2.3.5 IRI NextForm. A data migration tool developed on Eclipse is called

IRI NextForm. From one system or format to another, this utility can

migrate, reformat, convert, replicate, federate, and report data.

Additionally, it may replace outdated database engines and other

components that might not have native export capabilities as well as

transport data from legacy systems into new contexts.

2.4 SYNTHESIS

Concepts, literature and related systems were reviewed and analyzed in

comparison with E-Migrate. The researcher used those studies to further

understand the value of the migration concept. The present study is improved

with the use of available ideas from previous studies.

13
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

Table 2.1 Summary of Related System Comparisons

Related System Developed

System

Software Azure Fivetr Microsoft Integrat NextForm E-Migrate

Features Migrate an e.io


Assist

Similarity No No No No No Yes

Analysis

Graphical No Yes No No No Yes

Analysis

Table No No No Yes Yes Yes

Mapping

Field No No No Yes Yes Yes

Mapping

Constraint Yes Yes Yes Yes Yes Yes

Check

Foreign Key Yes Yes Yes Yes Yes Yes

Check

14
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

For database concepts and applicable data migration, Chapman and Hall

insights Handbook of Data Structures and Applications (2009) and Henry

Engeldow in Data Migration (2011) identified key features in evaluating the

database structure and migration concepts.

Linguistic analysis from Blaselbauer was used to evaluate different algorithms

such as Smith-Waterman Algorithm, Hamming Distance and Levenshtein

Distance.

While related systems were analyzed and compared with the proposed system.

No features were duplicated or copied from the existing system with the

proposed system. Algorithm and structures are completely different and unique

with the proposed system.

15
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

2.5 CONCEPTUAL FRAMEWORK

Figure 2.1 Conceptual Framework

Figure 2.1 Illustrates the conceptual framework of the e-migrate project. The

image is utilized to visually represent the overall processes involved in the

system. It shows that a db admin is needed to provide connection strings value in

an e-migrate system. The system will process the inputs and provide database

elements to be processed in the server. The application then uses the process

data to gather data for migration and save it to the new database destination.

16
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

2.6 OPERATIONAL FRAMEWORK

2.6.1 Procedure for System Development

Deciding which System development life cycle to be used in developing the

system were analyzed based on different factors. Below are the factors

considered in deciding to Iterative Incremental Development Models Model.

1. Number of human resources that will develop and analyze the system

2. Availability of the system specification upon start of the development

3. Data Gathering for machine learning training

4. Time constraint.

5. Development tools and familiarity.

Figure 2.2 Iterative Incremental Development Models using Agile method

17
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

2.6.1.1 Planning

This is the initial step of the project development to identify the value of

the project and setting up the goals and timeline. A project scope also

developed within this phase.

2.6.1.2 Requirements and Roadmap Creation

Roadmap is very important having the process of breaking down of

deliverables to build the final product is identified. Each individual feature

is built by the whole team.

2.6.1.3 Release Planning

Before the project kicks off, a road map and project plan are presented.

Feature releases are and deliverables are finalized.

2.6.1.4 Sprint Planning

This phase is responsible for identifying tasks to be accomplished by each

member of the team during the specific sprint. It also includes discussions

on how it is possible to achieve specific accomplishments. Assessment of

tasks loads is part of this step.

2.6.1.5 Stand up Meeting

This is a short meeting that is being held to help each member of the team

accomplish their tasks. At some point changes need to be made during

the sprint, discussion of the changes is being done during this meeting.

2.6.1.6 Sprint Review and Retrospective

End of each sprint a two or combined meeting are being held to review the

sprint and to present the finished product. Next is the sprint retrospective

18
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

meeting wherein discussion on what are the difficulties and bottlenecks

during the sprint and how to improve the process on the next.

19
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

2.7 DEFINITION OF TERMS

20
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

CHAPTER III

RESEARCH METHODOLOGY

This study utilizes comparative analysis research and quantitative analysis to

design and implement the system. Research on algorithms to be used is also

discussed in this chapter.

3.1 RESEARCH DESIGN

Choosing which algorithm to use in developing a mapping tool for data migration

could be a tough one without comparing the most suitable algorithm available.

For this study, comparative descriptive design is utilized to make comparisons

between entities while conducting experimental research (Cantrell, 2011). Score

comparisons between the three selected algorithms were done and analyzed by

eight professionals in software engineering and database administrations.

Prototyping Method is utilized for the execution of the experimental research.

The researcher compared three widely used algorithms in technical linguistics.

21
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

Table 3.1 Algorithm and Computation Basis

Algorithms Equation Computation Basis

Hamming Distance String Distance

Levenshtein Distance String Distance

Smith-Waterman Algorithm Similarity

Table 3.1 defined the algorithm being compared in this research. These

algorithms have their own basis in computing differences. These bases are also

defined in table 1.

22
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

Table 3.2 String Differences Scenarios

Scenarios Percent Shares

strings in same length 33.33%

strings in different length 33.33%

identical strings 33.33%

Table 3.2 shows different scenarios that are most likely encountered by

comparing different database entities.

Table 3.3 Category and Percent Share

Categories Percent Shares

Accuracy 70%

Efficiency 30%

Table 3.3 contains the test consideration category in selecting the algorithm.

Accuracy has higher consideration, while speed is less.

23
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

A prototype application containing the implementation of the three algorithms

were created and evaluated using the below matrix.

3.2 SYSTEM EVALUATION

E-Migrate will utilize ISO 25010 Quality Model Characteristics in evaluating the

system. Feedback is captured using Google Form and distributed to technical

experts and different target users.

Table 3.5 System Attribute for Evaluation

Characteristics Sub Characteristics Evaluation Questions

Functionality Functional Is the system properly connected


Specifications and visualize the connection
structure?

Fields and Tables mapping options


is available

Easy to map fields and table


objects

Reliability Fault tolerance Suggested mapped objects are


accurate
Recoverability
Score provided in each mapping
are accurate

Migration options are usable and


produce correct output

No data is loss during the process


and execution

24
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

Usability Learnability Link and Pages are properly


placed where the user can easily
Operability find it.

Error Protection Errors and Warning are properly


shown.

The system can be very useful and


replace current process

Efficiency Time Behavior The system respond time is


acceptable

The system execution time is


acceptable

Resource Utilization Resources consumes are


acceptable

Maintainability Modifiability The system can easily identify


errors
Capacity
Fixes or work around can be easily
apply

The system test cases is easy to


apply

Portability Adaptability Different environment(databases)


can be integrated in the system

Viewing from different browser


(includes mobile)

25
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

Table 3.5 Likert Scale and Score equivalent

Score Rating Interpretation

Equivalent

80-100 4.51-5.00 Highly Acceptable

60-79 3.51-4.50 Acceptable

40-59 2.51-3.50 Moderately Acceptable

20-39 1.51-2.50 Slightly Acceptable

0-19 1.00-1.50 Not Acceptable

Table 3.4 gives the rating equivalent to the score resulting from the prototype

execution. Evaluators may give their own score upon manually evaluated and

compared given strings.

3.3 SUBJECT OF THE STUDY

Researchers conducted surveys and evaluation from experts in programming,

mathematics and database administration. There were twenty-five invited

respondents to evaluate the e-migrate project and twenty three of them

26
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

responded. Among the respondents are five software developers, eight data

engineers and ten database administrators.

Table 3.5 Profile of respondents

Participants Description Frequency Percentage

Software Engineers Locally Employed 5 21.75

Data Engineers Internationally Employed 8 34.78

Database Admin Locally Employed 10 43.47

27
SCHOOL OF GRADUATE STUDIES
#2 Pres. Laurel Highway, Tanauan City, Batangas
778-0656 loc. 352\

SYSTEM REQUIREMENTS

To properly implement the system, minimum requirements must be provided.

Table 3.6 Software Requirements

Category Minimum Requirements Recommended


Requirements

Operating Windows 7 or higher and its Windows 10


System equivalent on other platforms

Browser Firefox, Google Chrome, Safari Firefox,Google


Chrome

Programming .net Framework 6.0 .NET Framework 6.0


Libraries

Hosting IIS 6.0 IIS 6.0

Database MS SQL Server 2018 MS SQL Server 2022

Table 3.6 displays the rundown of development and programing libraries needed

to complete the software development. The system is web based and expected

to be hosted in a Microsoft Server environment. .NET Framework is also a

requirement to handle. Core library requirements.

Category Minimum Requirements Recommended


Requirements

28

You might also like