You are on page 1of 7

Shift Left - How I carried out ETL & data

migration testing activities (10 mins read)


Recently I got an opportunity to work in an ETL & Data Migration project. I would like
to share the way I have carried out data migration testing activities and a checklist to verify data
migration through this article.

First, let’s understand what is data migration and why we need data migration.

What is data migration?


Simply, transferring data from one database to another.

Why?
Data migration requirement can come due to several reasons. Below I have mentioned two of
them.

 Decommissioning legacy applications. (Re-build the application using


a latest architecture)
 Change of the database vendor (e.g. from Oracle to MySQL) etc.
In a microservice architecture environment, there should be a separate micro database for each
microservice. So that changes to one service’s database does not impact to any other services.
In our case, we are moving from monolithic architecture to microservice architecture. Therefore,
we should migrate service related data from a large legacy database to a new micro database.

To explain the topic in a practical way, I am taking an example.

Assume our source database is DB2 and destination database in PostgreSQL.

There is a table in source database (DB2) which contains customer related data which customer
microservice needs to consume in performing dedicated functions.
Table description in source database (DB2)

STEP ONE – We have source database (DB2) table, we should


finalize destination database model
For that team came up with below database table model.

Team decided to concatenate "registered_date" and "registered_time" as "registered_date_time".


Also add an extra column "is_deleted" to provide soft deletion.

Developers are going to create this table in the new micro database by referring to this document.
At this point as testers we must review this document and make sure the proposed
architecture (database model) is correct.

I would highly recommend for testers to involve from the


design phase itself so that you can provide your inputs in
early stage and that will reduce the review time as well.
To review the document with the proposed database model, I have created my review checklist.

Database model document review - Checklist

Number of columns - Verify whether proposed database model includes all the source
table (customer) columns in DB2 which we have planned to migrate to the micro database.

 In above example - We have discussed and agreed to concatenate date


and time columns and agreed to add a new column (is_deleted)
 Result - Pass
Column names - Verify whether column names mentioned in proposed database model
describes the columns in DB2 correctly.

 In above example - All the column names in proposed architecture


describes the columns in DB2
 Result - Pass
Note: It’s not necessary to mention the name of the customer as “customer_name” in customer
table. If you defined column as “name”, it means "customer_name" since the column is anyway
in customer table.

Column data types – If you are NOT applying any transformation logic, data types
should be matched in the source and destination.

 In above example - We have decided to transform the date time format


to epoch timestamp. Other than that, columns data types are matched
with DB2 column’s data types.
 Result - Pass

Column sizes – Column size mentioned in proposed architecture should be greater than or
equal to DB2 column size.

 I have given an input for this point. Since we have defined “name” as
varchar [30], it’s better to increase the size of "email” column from
varchar [30] to varchar [40]. Because there is considerable likelihood
that having an email address with first name and last name. Since email
include a domain name, (e.g. @gmail.com) size of "email" field should
be greater than the "name" field.
 Result – Pass

Primary Key / Foreign Key – Primary key and foreign keys (if available) should
be mentioned in proposed architecture document.

 Result - Pass

Not Null – Not null columns defined in source (DB2) should be mentioned in proposed
architecture document.

 Result - Pass

Database best practices –


Table names and column names should be lower case and should be separated using an
underscore (do not use spaces for table names)

Singular words for the table names (customer instead of customers)

No unnecessary prefixes or suffixes (customer instead of tbl_customer) etc.

 Result – Pass

Consistency - Columns referring to the same data should be identical everywhere in the
database. (e.g. Data type and size of the customer id in customer table should be same as the data
type and size of the customer id (foreign key) in the order table)

Result – N/A

Default value – Discussed and agreed default values should be mentioned in the
proposed database model document.

Result – Pass

Since I have spent some quality time in reviewing the proposed database model document, I have
identified some mistakes earlier which could have been a cause for an issue in future.

Okay, now we have finalized the database model.

STEP TWO – Need to create the finalized table in PostgreSQL


database
Developers are creating these table in the new micro database in accordance with the proposed
database model document. Once they commit the SQL queries, they are adding testers
as reviewers for their pull requests for the remote repository. As testers, we
should be aware on SQL queries and need to review their pull request against the proposed
database model. (A defect prevention activity)

According to above example, SQL query should be,

CREATE TABLE customer (

id VARCHAR(10),

name VARCHAR(30),

credit_limit INT NOT NULL,

registered_date_time BIGINT,

email VARCHAR(40),
is_deleted BOOLEAN DEFAULT FALSE,

PRIMARY KEY (id)

);

The column names, data types, sizes and data validations applied in this query can be verified
with the document and can approve the pull request from the tester’s end.

In this case you can approve the pull request containing the above query since it will create the
exact table which is described in proposed database model.

Okay, step two is also done. Once everything is fine with the DB scripts, these scripts can be
executed in the QA dedicated PostgreSQL DB instance.

STEP THREE – Review ETL transformation logic


Before moving into step 3, let’s explain ETL briefly.

ETL stands for Extract, Transformation and Load.

Extract – Extracting data from source database (in our case, it is DB2)

Transformation – Apply transformation logic to change source data if needed. (in our case, we
have write a logic to concatenate "registered_date" and "registered_time" as
"registered_date_time").

Load – Load the data to the destination database (in our case, it is PostgreSQL)

We can create ETL jobs using some tools (e.g. talend) and can migrate data from source to
destination by running ETL jobs.

As testers, we should review transformation logic in ETL rather than waiting to verify migrated
data. So that we can prevent data mismatches or incorrect data in destination database.

As an example we can go ahead with a discussion with


developers and verify that all the transformation logic
identified in the design phase are implemented in the ETL
Okay, we have reviewed transformation logic as well.

STEP FOUR (final step) – Run the ETL job, migrate data from
source to destination and verify data accuracy to make sure
data migration is successful
Once ETL job is completed, we need to make sure all the data in source database (DB2) are in
destination database (PostgreSQL) as well. We call this as data accuracy testing.
Carried out to make sure data migration is successful.

For that, I have a checklist to verify destination data against source data.

Data accuracy testing - Checklist

 Verify record counts


 Verify record counts for a specific duration (e.g. total customers
registered in 2017)
 Latest date in the table
 Oldest date in the table.
 Sum of an integer column
 Max of an integer column
 Min of an integer column
 Number of customers who are using Gmail (%@gmail.com) etc.
 more...
There are more to discuss. Test the performance of ETL, fault tolerance testing, challenges faced
in data migration etc.. Let's discuss soon.

You might also like