You are on page 1of 9

357 BROUGHT TO YOU IN PARTNERSHIP WITH

NoSQL Migration
CONTENTS

• About SQL and NoSQL

− SQL Incentives (What Were

Essentials
They Thinking?)
−   NoSQL Incentives (How Did We
Get Here?)

• Key Concepts in NoSQL Migration


Moving Out of a Relational Database −   NoSQL Data Modeling
−   Common Migration Paths
• Conclusion

−   Additional Resources
UPDATED BY MATTHEW D. GROVES, PRODUCT MARKETING MANAGER

ORIGINAL BY FRANK EAVES, SENIOR SOFTWARE ENGINEER

The need to transition from an SQL (relational) to a NoSQL (non- NOSQL INCENTIVES (HOW DID WE GET HERE?)
relational) data solution happens usually under a positive business From 1979–2021 (42 years), the number of people using networked
environment. Your customers' requirements are causing you to computers (the Internet) grew exponentially. During this time, the
reevaluate your data solution in order to help them achieve their cost of storage fell to $0.023 per gigabyte. The motivations of 1979's
business objectives. So congratulations, you have come to the right data engineers have changed drastically. Internet hosts went from
place. When you're finished reading this Refcard, you'll have a solid single digits in the late 60s to just over one billion in 2021. The modern
understanding of the foundations of this transition and serious paths to incentive of data engineering is scalability, while cost of storage is now
consider when going about planning and implementing your migration basically an afterthought. The foremost guiding principle within the
from an SQL to a NoSQL database. NoSQL database community is using keys and indexes on independent
data to create efficient data access and as a result, achieve speed and
ABOUT SQL AND NOSQL scalability.
The primary engineering incentive driving the design principles for
SQL is minimizing memory and disk usage, while that of NoSQL is KEY CONCEPTS IN NOSQL MIGRATION
improving scalability. Understanding the reasons behind each of these Why is database migration such a critical topic of conversation? It is
motivations will help set a foundation on which to base a migration essential to first understand how the issue ties directly to the success
from SQL to NoSQL. of today's modern business.

SQL INCENTIVES (WHAT WERE THEY THINKING?)


Relational Software Inc., now Oracle, announced the first
implementation of SQL in 1979. At the time, one megabyte of storage
cost $47,000 (inflation adjusted). Given that storage was so expensive,
engineering design was focused on the efficient use of this scarce
resource. Not repeating data was the main challenge to limiting
the amount of storage used. References to data in other tables and
constraints to assist with data integrity are the guiding principles
within the SQL database community.

This understanding and logic promulgated throughout the engineering


community as well as into academia. Students of computer science are
taught these principles and take that knowledge base with them into
the professional workplace.

REFCARD | NOVEMBER 2022 1


Couchbase Capella: A faster, easier, and more

The ideal database for mobile and edge applications

Learn More
REFCARD | NOSQL MIGR ATION ESSENTIALS

Since the introduction of SQL, the importance of data has grown In his paper, "Further Normalization of the Data Base Relational Model,"
exponentially, and in tandem, so has the need to learn and apply the Edgar Codd defined four forms of data normalization:
concepts of SQL databases. Academia and industry both found new and
Table 1
compelling uses for data. This naturally led colleges to teach relational
principles. As graduates entered the workforce, relational databases FORM PURPOSE

were used in systems development, which resulted in many businesses


1NF (First Normal Remove unnecessary insertions, updates, and
running legacy software systems that are now struggling to scale and Form) deletion dependencies from the data model — all
meet the needs of the modern Internet customer. values must be atomic

2NF Optimize your data model so when new data types


Relying on legacy systems can result in lagging behind the smaller
are required, restructuring the schema is unneeded
start-ups using modern solutions and cloud-based services, including — non-key values must have a clear connection to
NoSQL. These start-ups can scale their offerings far beyond what the the primary key

legacy systems can offer. The loss of market share for these companies 3NF Remove transitive dependencies within the data
has opened a new market for industry solutions that help businesses model
with legacy systems migrate their data infrastructure to NoSQL and
4NF Identify statistics that could change in the future;
educate engineering teams about efficient implementation of it with make changes to the data model relationships so
the hopes of maintaining and regaining lost market share. querying for these statistics is neutral

Making this leap isn’t as hard as it might seem. Next, we'll cover the
basics (and a comparison) of SQL and NoSQL data modeling. Following this process results in a set of tables with well-defined
constraints that meet the objectives of reduced data redundancy and
NOSQL DATA MODELING keeping data integrity intact.
The fundamental principles around NoSQL data modeling are the same
What Is Denormalization? — Only Exceptional Data Need Apply!
as those used for SQL. The difference comes down to denormalization
The objective of denormalization is to design all access patterns in the
vs. normalization of data and the minor shift in the shared approaches
most efficient way possible. Data access patterns are a main focus,
that this causes. Below is a Venn diagram showing a high-level view of
and careful analysis is needed to attain that efficiency. This is done by
data modeling for SQL and NoSQL.
combining or nesting data into one structure, which makes the reads
Figure 1 and writes faster and avoids the overhead of joins.

Some people think that NoSQL is schema-less; however, this isn't true.
All applications have an inherent logical data structure, which still
exists in a NoSQL database — it is just stored implicitly. By utilizing this
capability, joins can be reduced, allowing the application to retrieve all
necessary information using a single lookup.

Before jumping into the example, let's first review the possible
variations of NoSQL data models.

NOSQL MODELS
For SQL data solutions, a table is the core data model. However, there
are multiple models for NoSQL data solutions to choose from. This
The process of creating an entity relationship diagram, understanding
mainly depends on the NoSQL database you are using — the four most
your application’s access patterns, and then optimizing data throughput
popular types:
by using indexes is the same for both SQL and NoSQL. So let’s focus on
understanding the differences between normalization and 1. Key-value pair

denormalization. 2. Column-oriented

3. Graph-based
DENORMALIZING DATA
What Is Normalization? …Just Be Like Everyone Else 4. Document-oriented
The objectives of normalization are to reduce data redundancy and the
Many NoSQL databases support two or more of these models. These
amount of data storage used, as well as to improve data integrity.
are known as "multi-model" databases.

REFCARD | NOVEMBER 2022 3 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | NOSQL MIGR ATION ESSENTIALS

Quick note: In a later section, a simple music site is used to demonstrate Figure 2
denormalization. The data used in the models below — band and song
names — are taken from that example.

Key-Value Pair
For key-value pair NoSQL databases, the model consists of a key and
value:

Table 2

PRIMARY KEY VALUE

Band Name Coldplay

But for more complex data structures, the value can be a JSON object,
which is greater in depth and complexity. Typically, text is used as the
visualization model for JSON structures. Document-Oriented
Document-oriented NoSQL databases have a primary key and a JSON
Primary Key: Band Name
value.
Value:

{ Primary Key: Arctic Monkeys


"Band Name": "Coldplay", Value:
"Year": 2002,
{
"Origin": "Sheffield, England"
"Band Name": "Arctic Monkeys",
}
"Year": 2002,
"Origin": "Sheffield, England",
Column-Oriented "Band Members": [
For column-oriented NoSQL databases, tables (or "column families", "Alex Turner",
since they are fundamentally different from relational tables) can be "Jamie Cook",
used to show the column data structures. Tables can have multiple "Nick O'Malley",

header rows — one for each data structure: "Matt Helders"


],
Table 3 "Genres": [
"Indie rock",
PRIMARY KEY ATTRIBUTES
"garage rock",
BAND NAME YEAR ORIGIN BAND MEMBERS "post-punk revival",
"psychedelic rock",
Arctic 2002 Sheffield, Alex Turner, Jamie Cook,
"alternative rock"
Monkeys England Nick O'Malley, Matt
Helders ]
}
PEAK
SONG NAME YEAR
CHART POS
Because all SQL models use tables, describing the migration process
When the Sun 2018 1 from SQL to NoSQL with tables may be a more applicable depiction
Goes Down that resonates with those coming from an SQL background. Therefore,
in the following sections, a table model will be used.

This model shows two data structures, Band Name and Song Name.
With the base knowledge about NoSQL databases, let's dive into how

Graph-Based denormalization works.

For graph-based NoSQL databases, the model consists of nodes


MOVING DATA INTO A NOSQL DATABASE
and edges. Data structures are put into nodes, and relationships are
To demonstrate the core concept of denormalization, we'll walk
represented by the edges that join one node to another. For example,
through the design of a data solution from the beginning. However,
visualizing a band, the songs that the band performs, and the members
most likely you will have gone through the initial phase(s) in building
of the band might be modeled this way:
your current SQL data solution.
SEE FIGURE 2 ON NEXT PAGE

REFCARD | NOVEMBER 2022 4 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | NOSQL MIGR ATION ESSENTIALS

When creating any data solution, whether SQL or NoSQL, you should The data model for the Home screen:
consider following these steps:
Table 4
1. Understand your application – Use design tools and processes
PRIMARY KEY ATTRIBUTES
that are best suited to illustrate your problem domain and its
design. For this example, I use wire diagrams. SONG BAND PEAK CHART
HOME SCREEN YEAR
NAME NAME POS
2. Build an entity relationship diagram – Capture the data Yes When the Arctic 2018 1
structures and the relationships between them. Sun Goes Monkeys
Down
3. Define your access patterns – The data that should be grouped
together for optimal data delivery based on the application's Yes Yellow Coldplay 2000 2
needs.

4. Design your primary keys and indexes – Analyze the best Note that accessing this data structure will be done using the Home
way to design the primary keys and indexes for optimal data Screen column, and therefore, indexing this column may be necessary
delivery. for improved performance.

For this example, our problem domain is a simple music site where Functionality: Allow users to view song details.
users are presented with a list of songs and associated data, as well as
Figure 4
links to more information about each song and band.

1: Understand Your Application


The functionality this application will provide is allowing users to
choose a song or band they want to read more details about. Below is
the Home screen of our simple music site.

Functionality: Present a list of songs with links to song and band


details.

Figure 3

The model for the Song Details screen:

Table 5

PRIMARY KEY ATTRIBUTES

PEAK
BAND NAME -
ALBUM YEAR CHART LYRICS
SONG NAME
POS

Arctic Whatever 2018 1 So who's that


Monkeys - People Say girl there?
I Am, That's I wonder what
When the Sun
What I'm went wrong
Goes Down Not So that she had
Users can select the song name to navigate to song details or the band
to roam the
name to navigate to band details.
streets

TABLE CONTINUES ON NEXT PAGE

REFCARD | NOVEMBER 2022 5 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | NOSQL MIGR ATION ESSENTIALS

PRIMARY KEY ATTRIBUTES PRIMARY


ATTRIBUTES
KEY
PEAK
BAND NAME -
ALBUM YEAR CHART LYRICS BAND
SONG NAME BAND
POS NAME - YEAR ORIGIN GENRES
MEMBERS
DETAILS
Coldplay - Parachutes 2000 2 Look at the stars
Yellow Look how they Coldplay - 1996 London, Chris Martin, Alternative
shine for you Details England Jonny rock, pop rock,
And everything Buckland, Guy post-Britpop,
you do Berryman, Will pop
Champion, Phil
Harvey
Note that if your primary key must be a number, then hashing the string
value into a number will work as well.
Note that the Band Members and Genres columns are both arrays,
Functionality: Allow users to view band details. representing a one-to-many embedded relationship.

Figure 5 2: Build an Entity Relationship Diagram


The entity relationship diagram looks like this:

Figure 6

There are four entities: Band Name, Song Name, Genre, and Band
The model for the Band Details screen:
Member. And there are three relationships, Perform, BelongsTo, and

Table 6 PartOf — all of which are one-to-many.

PRIMARY 3: Define the Access Patterns


ATTRIBUTES
KEY
Access patterns are the data patterns/groupings needed for the
BAND
BAND application.
NAME - YEAR ORIGIN GENRES
MEMBERS
DETAILS
In our music site, we have Song Name and Band Name, and the access
Arctic 2002 Sheffield, Alex Turner, Indie rock,
patterns for each of the wire diagrams are:
Monkeys - England Jamie Cook, garage rock,
Nick O'Malley, post-punk • Home screen – Displays the list of songs
Details
Matt Helders revival,
psychedelic • Song Details screen – Displays song details
rock,
alternative rock
• Band Details screen – Displays band details

To better understand these access patterns, let’s look at a composite


TABLE CONTINUES IN NEXT COLUMN
data model for all wire diagrams (data sourced from Tables 4-6).

REFCARD | NOVEMBER 2022 6 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | NOSQL MIGR ATION ESSENTIALS

Table 7 While not possible in every NoSQL database, many document databases
have recently introduced JOIN capability. With JOIN functionality, you
PRIMARY
ATTRIBUTES
KEY can design your model to take advantage of whichever approach is best.
HOME SONG BAND PEAK CHART You can denormalize to maximize performance, and you can normalize
YEAR
SCREEN NAME NAME POS to maximize data integrity.
Yes When the Sun Arctic 2018 1
Goes Down Monkeys 4: Design Your Primary Keys and Indexes
In the composite table above, there are separate primary keys for each
Yes Yellow Coldplay 2000 2
data structure used. By querying this single table, you can extract
BAND NAME ALBUM PEAK YEAR LYRICS different schemas or groups of data. The Home screen’s primary key is
- SONG CHART the Home Screen column. When the Home Screen column is queried, it
NAME POS
will look for data that has a Home Screen attribute set to yes. Therefore,
Arctic Whatever 1 2018 So who's that an index should be created on the Home Screen attribute, so that look up
Monkeys - People Say girl there?
will be as fast as possible. The Song Details primary key uses a string
I Am, That's I wonder what
When the
What I'm Not went wrong that consists of the Band Name and Song Name concatenated together.
Sun Goes So that she had The Band Details primary key uses a string consisting of the Band Name
Down to roam the
followed by - Details. These primary keys are designed this way to
streets
give the application the data it needs in the quickest manner possible.
Coldplay - Parachutes 2 2000 Look at the
Yellow stars COMMON MIGRATION PATHS
Look how they In most cases, legacy solutions are unable to meet organizations’
shine for you
scalability objectives. And the engineering teams supporting those
And everything
you do legacy applications may not be aware of NoSQL database principles;
however, they have a tremendous amount of application and domain
BAND NAME BAND ORIGIN YEAR GENRES
- DETAILS MEMBERS knowledge. In short, companies in these situations have two primary

Arctic routes for migrating to a NoSQL database.


Alex Turner, Sheffield, 2002 Indie rock,
Monkeys - Jamie England garage rock,
Cook, Nick post-punk DIRECT PORT
Details
O'Malley, revival, Every business' situation is unique, and the migration approach will
Matt Helders psychedelic
inevitably be unique as well. A direct port will be easiest if your current
rock,
alternative rock data solution has a clear decoupling between the application logic and
data. If the application data layer is designed in a manner that allows
Coldplay - Chris Martin, London, 1996 Alternative
pulling out one database architecture and replacing it with another, then
Details Jonny England rock, pop rock,
Buckland, post-Britpop, it's just a matter of implementing the data layer interface using NoSQL.
Guy pop However, if your engineering team needs to create a clear decoupling of
Berryman,
the application logic and data, that work will need to be completed first.
Will
Champion, Then they can implement the data layer interface using NoSQL.These
Phil Harvey tasks may be done in parallel with one team coming up to speed on NoSQL—
and following the four steps for creating a new data solution discussed
The beneficial part about denormalizing data this way is that when above — while the other team does the architectural work of decoupling the
queried using a primary key, the application retrieves the data needed application logic and data. Then when both teams are ready, the application
to satisfy the functionality without resorting to joins. The downside of can be switched from an SQL database to a NoSQL database.
denormalizing your data is that some data may be duplicated (i.e., Band
Name, Year, Peak Chart Pos). In a fully normalized SQL database, PIECEMEAL
data isn't duplicated, and data integrity is maintained by the database In cases when a legacy application must remain running to keep the

through table constraints. business operational, switching the entire data layer over at once isn't
reasonable. Depending on the size of the data layer and the number of
To meet scalability needs, a NoSQL database has duplicated data and engineering teams involved, migrating piecemeal may be an option.
moves the responsibility of data integrity to the application code. When Oftentimes, the areas where the application struggles under the
data needs to be updated in multiple places, it is the application’s current SQL data layer is known. Focus your engineering efforts there
responsibility to keep all data structures up to date. first, porting the critical code over in a piecemeal fashion.

REFCARD | NOVEMBER 2022 7 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | NOSQL MIGR ATION ESSENTIALS

SQL and NoSQL Working Together The basic principles surrounding NoSQL databases aren’t that different
If your application’s data layer involves using multiple databases, you from SQL. The key difference is in normalizing your data versus
can choose to port over one database to a NoSQL solution while leaving denormalizing your data. By walking through an application design step
the other alone. This approach works well in a cloud-native environment by step, you saw how data structures can be organized and then put into
where multiple microservices are running. Teams should start with the a denormalized structure. Finally, I highly encourage you to visit https://
service that is under the most stress and then move to the next service. couchbase.live/ and interact with the JSON data to truly experience a
When a monolith is involved, a careful process needs to be followed NoSQL database (see Additional Resources). The lessons learned from
where you would wall off the part that has the most pressing issues with the evolution of SQL are being applied to NoSQL in an effort to improve
scaling. After decoupling this part, it starts to look like a microservice, and strengthen its weaker areas, as well as to continue innovating
and you can then focus on the data layer porting process. Once complete, simpler ways for organizations to migrate from SQL to NoSQL.
the next piece of the monolith can be addressed, repeating the process.
ADDITIONAL RESOURCES
Convert to NoSQL While Still Using SQL Queries •   "Number of Internet hosts worldwide: 1969–present" – https://
The market has recognized the need for businesses to migrate to a en.wikipedia.org/wiki/History_of_the_Internet#/media/
NoSQL solution. And those currently using an SQL solution also need File:Internet_Hosts_Count_log.svg
a bridge that enables them to move incrementally into a full NoSQL
•   "Further Normalization of the Data Base Relational Model" –
database. This method offers three things:
https://forum.thethirdmanifesto.com/wp-content/uploads/
1. Businesses can address their scalability objectives.
asgarosforum/987737/00-efc-further-normalization.pdf
2. Engineering teams have time to come up to speed on NoSQL
•   Couchbase Playground – https://couchbase.live/ (start a
principles.
sandbox session and paste this code in the "Query Workbench"
3. Businesses have more options for addressing and evolving their
to create the structures and data)
data solution going forward.
CREATE COLLECTION tutorial._default.bands;
There are solutions available that allow you to convert your SQL schema
CREATE COLLECTION tutorial._default.songs;
over to a NoSQL database while continuing to use SQL queries. There CREATE COLLECTION tutorial._default.band_details;
are also NoSQL databases that are implementing the "SQL++" standard:
adding denormalized capabilities to the familiar query language. This INSERT INTO tutorial._default.bands (KEY, VALUE)
approach may help teams ramp up to NoSQL faster, building on their VALUES

existing experience and code base. (With the advent of SQL++, the term ("Arctic Monkeys", {
"Home Screen": "Yes",
"NoSQL" can be defined as "Not Only SQL": SQL is still an option for
"Song Name": "When the Sun Goes Down",
interacting with data, but it's not the only one).
"Year": 2018,

After, the business can operate as normal and only address the areas where "Peak Chart Pos": 1
}),
scalability has become a critical issue, leaving the rest of their data solution
("Coldplay", {
as is. As the business and engineering teams grow more familiar with
"Home Screen": "Yes",
NoSQL databases and key principles, leaders can make more informed
"Song Name": "Yellow",
decisions about how to design the business' overall data architecture. "Year": 2000,
"Peak Chart Pos": 2
CONCLUSION });
There are areas where a NoSQL database shouldn’t be used, and another
type of database should be considered. For example, an engineering team INSERT INTO tutorial._default.songs (KEY, VALUE)
uses custom queries to discover and explore relationships within their VALUES

data, which provides the metrics needed for business intelligence efforts. ("Arctic Monkeys - When the Sun Goes Down", {
"Album": "Whatever People Say I Am, That's
Put simply, probing such data and performing statistical analyses that
What I'm Not",
do complex database queries will execute poorly in a NoSQL database.
"Peak Chart Pos": 1,
NoSQL has been designed specifically to address database scalability
"Year": 2018,
at the expense of the memory used, the database’s ability to support
"Lyrics": "So who's that girl there? I wonder
custom queries, and data integrity — just as the SQL database was what went wrong So that she had to roam the streets"
designed to minimize memory use through the process of normalization. }),
As SQL database technology evolved, so did its capabilities to handle
CODE CONTINUES ON NEXT PAGE
efficient custom queries and support data integrity.

REFCARD | NOVEMBER 2022 8 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | NOSQL MIGR ATION ESSENTIALS

("Coldplay - Yellow", { And to query, you can try the following:


"Album": "Parachutes",
"Peak Chart Pos": 2,
•   For Home Screen data: SELECT META(b).id AS BandName,

"Year": 2000, b.* FROM tutorial._default.bands b WHERE b.`Home

"Lyrics": "Look at the stars Look how they Screen` = 'Yes';


shine for you And everything you do" •   For Song Details data: SELECT META(b).id AS SongName,
});
b.* FROM tutorial._default.songs b WHERE META(b).id
= 'Coldplay - Yellow'
INSERT INTO tutorial._default.band_details (KEY, VALUE)
VALUES •   For Band Details data: SELECT META(b).id AS BandName,
("Arctic Monkeys", { b.* FROM tutorial._default.band_details b WHERE

"Band Members": [ META(b).id = 'Coldplay';


"Alex Turner",
"Jamie Cook", As an alternative to embedded and/or duplicating data, consider a
"Nick O'Malley", SQL++ query using a join in NoSQL:
"Matt Helders"
•   SELECT META(b).id AS BandName, d.Origin FROM
],
tutorial._default.bands b INNER JOIN tutorial._
"Origin": "Sheffield, England",
default.band_details d ON META(b).id = META(d).id
"Year": 2002,
"Genres": [
"indie rock",
"garage rock",
"post-punk revival", UPDATED BY MATTHEW GROVES,
"psychedelic rock", PRODUCT MARKETING MANAGER
"alternative rock"
Matthew D. Groves is a guy who loves to code.
] It doesn't matter if it's C#, jQuery, or PHP: he'll
}), submit pull requests for anything. He has been
("Coldplay", { coding professionally ever since he wrote a QuickBASIC
point-of-sale app for his parent's pizza shop back in the 90s. He
"Band Members": [ currently works as a Product Marketing Manager for Couchbase. He
"Chris Martin", is the published author of AOP in .NET and Pro Microservices in .NET
"Jonny Buckland", 6, and is also a Microsoft MVP.
"Guy Berryman",
"Will Champion",
"Phil Harvey"
WRITTEN BY FRANK EAVES,
],
SENIOR SOFTWARE ENGINEER
"Origin": "Sheffield, England",
"Year": 2002, Computer programmer with 20+ years' experience
in application development, project and team
"Genres": [
management, and enterprise integration. Currently
"Alternative rock", focusing on full-stack enterprise architecture/applications utilizing
"pop rock", object-oriented programming languages and frameworks (Spring,
"post-Britpop", Angular, Serverless, React, etc.). Additional experience includes
building embedded systems with C and C++, building trading systems
"pop"
with C# (.NET Framework), and machine learning with Python.
]
});

600 Park Offices Drive, Suite 300


Note: This is the same data in the composite table (Table 7) expressed Research Triangle Park, NC 27709
888.678.0399 | 919.678.0300
as a JSON structure. To query this data, you will need minimum indexes:
At DZone, we foster a collaborative environment that empowers developers and
•   CREATE PRIMARY INDEX ON tutorial._default.bands; tech professionals to share knowledge, build skills, and solve problems through
content, code, and community. We thoughtfully — and with intention — challenge
•   CREATE PRIMARY INDEX ON tutorial._default.songs; the status quo and value diverse perspectives so that, as one, we can inspire
positive change through technology.
•   CREATE PRIMARY INDEX ON tutorial._default.band_
details; Copyright © 2022 DZone, Inc. All rights reserved. No part of this publication
may be reproduced, stored in a retrieval system, or transmitted, in any form or
by means of electronic, mechanical, photocopying, or otherwise, without prior
(More detailed indexes will be necessary later to improve performance). written permission of the publisher.

REFCARD | NOVEMBER 2022 9 BROUGHT TO YOU IN PARTNERSHIP WITH

You might also like