You are on page 1of 36

TH E DZO N E GUI D E TO

DATABASE AND
PERSISTENCE
MANAGEMENT
2015 EDITION

BROUGHT TO YOU IN PARTNERSHIP WITH


BROUGHT TO YOU IN PARTNERSHIP WITH

1 dzones 2015 guide to database and persistence management


Dear Reader, For thousands
of years, the
Table of Contents
closest things to databases that humans have had
were libraries. Even in those days, the ancients 3 Summary & Key Takeaways
knew how out of hand things could get if sorting
standards changed or if texts werent returned to 4 Key Research Findings
the proper place. Today, we have databases that can
6 A Review of Persistence Strategies by Andrew C. Oliver
access thousands of records in mere milliseconds.
Data can be infinitely duplicated and preserved, 10 Firebase, Meteor & Cognito: Just Enough Database
unlike the ill-fated Library of Alexandria. And for a Mobile World By ben teese
yet, errors in sorting, translation, and revision
still happen even in our most advanced databases 14 The State of the Storage Engine by baron schwartz

today. Nothing is perfect, but modern standards


16 The Mixology of Databases Infographic
for data persistence, like SQL and relational
schemas, have endured and kept a majority of the 18 3 Reasons Why Its Okay to Stick with SQL
worlds databases well-organized. Other styles of By Lukas Eder

persistence have existed for many years, but only


22 How to Manage Deterministic Data-Driven
in the last few years have so many broken into
Services by Peter Lawrey
mainstream use, causing a rapid, unprecedented
expansion of database types available. 25 Diving Deeper Into Database & Persistence Management

More choices can be a blessing and a curse. Without 26 Data Migration Checklist
the right understanding of new technologies and
27 Solutions Directory
how they relate to your situation, the multitude of
persistence options can be paralyzing. Our second 36 Glossary
guide of 2015 is here to help you navigate this
sea of technology and understand where certain

Credits
options might fit in your organization. Youll also
learn about the trends and opinions surrounding
modern database technology to get an idea of most
companies recipes for success. DZone Research marketing & Corporate
SALES Management
John Esposito
This new guide topic is the result of a split in Editor-in-Chief Alex Crafts Rick Ross
VP of Sales CEO
coverage from last years Big Data guide. While Jayashree Gopal
Director of Research Matt OBrian
DZones 2014 Guide to Big Data contained content Matt Schmidt
Director of Business President & CTO
Mitch Pronschinske
and product information for both databases and Development
Sr. Research Analyst
analytics platforms, this guide focuses solely on Ashley Slate Kellet Atkinson
Benjamin Ball General Manager
databases and persistence utilities. Later this Director of Design
Research Analyst
year, we will cover analytics platforms and data Matt Werner
Chelsea Bosworth
Marketing Associate
processing tools in a separate guide. Market Researcher
Chris Smith
John Walter
Production Advisor
Were proud to present DZones first pure guide Content Curator
Brandon Rosser
to databases. I hope it gives you some new ideas Ryan Spain
Customer Success
Content Curator
and helps you build better software! Advocate
Jillian Poore
Sales Associate
Mitch Pronschinske
Senior Research Analyst Special thanks to our topic experts Ayobami Adewole, Eric Redmond, Oren
research@dzone.com Eini, Jim R. Wilson, Peter Zaitsev, and our trusted DZone Most Valuable
Bloggers for all their help and feedback in making this report a great success.

2 2015 guide to database and persistence management


Summary & scalability (44%) and data migration (36%). This highlights
the importance of choosing the right database and getting

Key Takeaways
familiar with its query language, so that queries can be
optimized and application access can be tuned.

DevOps Practices Struggle to Reach Database


A l m ost a l l n o n -t r ivial s oftware produc ts re q u i r e
Administration Although DZones 2015 Continuous
a database. Performant, scalable persistence is a
Delivery survey showed modest growth (+9%) for
constant concern for developers, architects, operations,
respondents that extended Continuous Delivery practices
and business-level managers. Companies of all sizes
to their database, the overall number (30%) is about
are now facing unprecedented data management and
half the amount of respondents using CD practices in
analysis challenges because of the sheer amount of data
application build management (61%). One key reason why
being generated, and the speed at which we want to
CD isnt more common in database administration might
digest it. Some estimates expect the amount of digital
be due to the lack of version control used in databases. In
data in the world to double every two years [1]. The
this years databases survey, respondents were asked if
predictions for data growth are staggering no matter
they used any form of version control for their database.
where you look. DZones Guide to Database and Persistence
Nearly half, 48%, said no.
Management is a valuable handbook for understanding
and conquering the challenges posed by modern RDBMS and the Big 3 Databases Still Dominate
database usage. The 2014 Big Data survey and this years database survey
both confirm that the clear Big 3 databases in IT are
The resources in this guide include: Oracle, MySQL, and Microsoft SQL Server. All three
resided in the 50%-57% range for organizational use this
Side-by-side feature comparison of the best databases
year, while the other databases were further behind.
and database hosting services.
PostgreSQL is used in 31% of
Comprehensive data sourced from 800+ IT professionals respondents organizations POLYGLOT
on database usage, data storage, and experiences. and MongoDB (the most-used PERSISTENCE
A step-by-step database migration checklist. NoSQL database) is used in 20%
of the organizations. Even though
An introduction to several innovative new data

98%
NoSQL databases were touted as
persistence technologies. the next evolution in persistence
Forecasts and opinions about the future of database by developers who were fed up RDBMS
usage. with certain aspects of RDBMS,
relational data stores are still a
dominant part of the IT industry
Key Takeaways (98% use a relational database),
Database Performance is the Biggest Challenge
for Operations and Engineering Developers and
and the strengths of SQL databases
have been reiterated by many key
32%
operations professionals both agree that achieving and players in the space. Today, experts
N O SQL
maintaining high database performance is their biggest instead suggest a multi-database
challenge related to persistence. 71% of respondents in solution, or polyglot persistence
DZones 2015 Databases survey said performance was one approach. 30% of respondents

30%
of their biggest challenges related to databases, which organizations currently use a
was significantly higher than other challenges such as polyglot persistence approach
where at least one relational BOTH
database and one non-relational
database are used.

71% 44% 36% [1] http://www.emc.com/about/news/press/2014/20140409-01.htm


PERFORMANCE SCALABILITY DATA MIGRATION

3 dzones 2015 guide to database and persistence management


Key
professionals focus on database monitoring (79%),
database administration (74%), and managing
data security and user access (65%). Operation

Research
professionals also write queries frequently (65%),
but they are involved in diagnosing production
database issues (62%), tuning database configuration
(50%), and their other three primary jobs mentioned

Findings
above much more often than developers.
DATABASE
MONITORING
WRITING
QUERIES DATABASE
ADMINISTRATION
OPTIMIZING
QUERIES MANAGING

80%
DATA
SECURITY

79%
TUNING DATA AND USER
ACCESS IN ACCESS
More than 800 IT professionals responded

74%
APPLICATION

62%
CODE
to DZones 2015 Databases Survey. Here

65%
are the demographics for this survey:

49%
Developers (45%) and development team
leads (28%) were the most common roles. DEVELOPERS DATABASE OPERATIONS DATABASE
ACTIVITIES ACTIVITIES
63% of respondents come from large
Object-Relational Mappers are Used in Moderation
organizations (100 or more employees)
For years, developers have constantly argued about
and 37% come from small organizations whether Object-Relational Mapping tools (ORMs)
(under 100 employees). are a useful solution, or a detrimental anti-pattern.

The majority of respondents are According to the survey


results, a majority of DONT USE ORM
headquartered in Europe (40%) or the
developers use
US (32%). 28%
ORMs, but not
100%
Over half of the respondents (69%) all of the time.
have over 10 years of experience as IT 14% OF THE
Slightly more <50% ORM TIME

professionals. than a quarter OF THE


TIME
28% USAGE
of respondents
A large majority of respondents dont use ORMs
organizations use Java (85%). C# is (28%). While there
30% >50%
OF THE TIME
the next highest (40%). are still vocal anti-ORM
developers out there, the
Developers Focus on Queries, Operations focuses silent majority seems to understand the risks,
on Monitoring and Admin There were significant limitations, and benefits of ORMs enough to take
differences between the database-related activities a balanced approach, using them where they are
of developers and operations professionals. appropriate.
Developers spend most of their time writing queries
(80%), optimizing queries (62%), and tuning data Few Database Administration Tasks are Automated

access in application code (49%). Operations A majority of respondents automate database backup
(65%) and database monitoring (56%), but most

4 dzones 2015 guide to database and persistence management


5+ 4+ 63%
15%
19% 1 14% NO

MOST
1
DO YOU ANTICIPATE
NUMBER OF
4 12% DATABASE
36%
INTRODUCING NEW
DATABASE 3 18%
VERSIONS DATABASES IN YOUR
PRODUCTS
MAINTAINED COMPANY IN THE
USED 37%
28%
33% 2 NEXT YEAR?
YES
3 32%

respondents do other database administration Regarding the environments where respondents


tasks manually. About one fourth of respondents run their databases, 59% run them virtualized in a
automated database change management (25%), private datacenter, 53% run them on bare metal in
documentation generation (24%), and installation/ a private datacenter, and 22% use a private cloud.
patching (26%). Tasks such as tuning database Hybrid cloud, PaaS, IaaS, and DBaaS are all used
configuration, optimizing queries, and diagnosing by less than 10%. However, in the next year, fewer
production database issues were mainly manual. respondents expect that their databases will be
virtualized in a private datacenter (58%) or hosted
One Type of Database is Usually Not Enough on bare metal in a private datacenter (46%), and
81% of respondents used two or more database more expect to be hosting them in the private
products in their organization. The largest cloud (33%). There are also 2%-3% increases in
percentage of respondents use the number of respondents who expect to
two products (33%). Another use PaaS, IaaS, and DBaaS. One other
interesting statistic about distributed
interesting statistic is
24% database environments is the usage
the number of different
YES
versions that respondents DO YOU of sharding, which seems to have
organizations maintain PRACTICE stabilized after more experts
for a given database. Most SHARDING began explaining when it is
maintain one (36%), but a ON YOUR advantageous to shard and when
it is harmful.
fair number also maintain DATABASES?
two (32%), revealing that tech
76% Many Organizations are
companies arent afraid to use
NO Considering New Databases
multiple products, but they do
dislike the difficulty of maintaining While the majority of organizations

several versions. (63%) are satisfied with their current


array of databases, a significant number of

Databases Moving Toward Cloud Infrastructure


respondents expect to add a new database product

and Distributed Architecture 56% of respondents to their stacks in the next year (37%). With the

have distributed architectures with many database ever-growing number of new database options

instances. 45% of those respondents make their and persistence strategies, its prudent for

own internal distributed infrastructure while some developers to anticipate some changes in

the other 11% use cloud infrastructure services. technology.

5 dzones 2015 guide to database and persistence management


K e y -Va l u e S t o r e s

A Review of To some degree all NoSQL databases derive from key


value stores. These can be thought of as hash maps.

Persistence
Their capabilities are fairly limited: they just look
up values by their key. Most have alternative index
capabilities that allow you to look up values by other
characteristics of the values, but this is relatively slow
and a good indication that maybe a K-V store is not the

Strategies
best choice. Sharding or distributing the data across
multiple nodes is very simple. The market is crowded
with K-V stores such as Aerospike, Redis, and Riak, all of
which are simple to implement.

C o l u m n - F a m i ly D a t a b a s e s
Many K-V stores also operate as column family databases.
Here the value evolves into a multidimensional array.
by A n d r e w C . O live r These are again most efficient while looking up values.
However, the secondary indexes are often implemented
with pointer tables, meaning a hidden table with that
field as the key points to the key of the original table. This
T h e B i g D ata a n d N o SQL t r e n d s a r e , at b e st,
is essentially O(1)+O(1) for a lookup time, which is not
misnamed. The same technologies used to process large
as good as O(1), but still decent. The two most popular,
amounts of data are sometimes needed for moderate
Cassandra and HBase, are both based on Hadoop but
amounts of data that then must be analyzed in ways
have different write semantics. HBase offers strong write
that, until now, were impossible. Moreover, there are
integrity; Cassandra offers eventual consistency. If you
NewSQL databases, or ways to run SQL against NoSQL
have a fair number of reads and writes, and a dirty read
databases that can be used in place of traditional solutions.
wont end your world, Cassandra will concurrently handle
Many of these types of databases are not all that new, them well. A good example of this is Netflix: if they are
but interest in them wasnt high when they were first updating the catalog and you see a new TV show, but the
invented, or the hardware or economics were wrong. seasons havent been added and you click on the title and
For the last 30 years or so, the tech industry assumed there is a momentary glitch where you only see season
the relational database was the right tool for every one until the next time you clickwould you even notice?
structured persistence job. A better idea is to lump However, this is unlikely to happen because by the time
everything under use the right tool for the right job. you click it was likely added. Obviously this is no way to
do accounting, but it is perfectly fine for a lot of datasets.
Your RDBMS isnt going away. Some tasks are definitely HBase offers more paranoid locking semantics for when
tabular and lend themselves well to relational algebra. you need to be sure the reads are clean. This obviously
However, it is possible that some of the more expensive comes at a cost in terms of concurrency, but is still about
add-ons for your RDBMS will go away. Scaling a as good as row-locking in an RDBMS. Time series data is
traditional RDBMS is difficult at best. Partitioning classically given as an optimal use case for column-family
schemes, multi-master configurations, and redundancy databases, but there are lots of other strong use cases such
systems offered by Oracle, SQL Server, and DB2 are as catalogs, messaging, and fraud detection.
expensive and problematic at best. They often fail to
meet the needs of high-scale applications. That said, try C o l u m n - O r i e nt e d R D B MS
joining two or three small tables and qualifying by one These are the new hot thing with SAPs HANA being
or two fields on Hadoops Hive and you will be rather so well marketed. The only reason you cant have a
underwhelmed with its performance. Try putting truly distributed RDBMS is because they didnt evolve that
tabular data into a MongoDB document and achieving way. Granted, distributed joins can be heavy, but if
transactional consistency and it will frustrate you. optimized correctly they can be done more efficiently
than, say, MapReduce against Hadoops Hive (which is
Organizations need to identify the patterns and
notoriously slow for many types of queries). Column-
characteristics of their systems, data, and applications
oriented relational databases may also perform better
and use the right data technologies for the job. Part of
than traditional RDBMS for many types of queries.
that is understanding the types of data systems available
Conceptually, think of turning a table sideways: make
and their key characteristics.
your rows columns and columns into rows. Many values

6 dzones 2015 guide to database and persistence management


in an RDBMS are repetitive and can be compressed with themselves well to relational algebra. They require lots of
either pointers or compression algorithms that work well different things to happen in all-or-none transactions,
on repetitive data. This also makes lookups or table scans but dont expand to tens of terabytes. This is the relational
much faster, and data partitioning becomes easier. In databases sweet spot, and it is a big one. Even things that
addition to HANA, there is also Splice Machine, which is do not fit nicely but have been made to work after a decade
built on top of HBase and Hadoop. The promises of these of hammering may not be worth the migration cost. This is
databases include high-scale distribution with column- not about paring down your toolkit, it is about adding tools
oriented performance and good compression performance, so you always have the right one for the job.
and they still offer transactional integrity equivalent to a
W h at a b o ut H a d o o p ?
traditional RDBMS.
Hadoop is not a database. HDFS (Hadoop Distributed File
D o c u m e nt D ata bas e s System) distributes blocks of data redundantly across
To some, these are the holy grail of NoSQL databases. nodes, similar to Red Hats Gluster, EMCs IFS, or Ceph.
They map well to modern object-oriented languages and You can think of it as RAID over several servers on the
map even better to modern web applications sending network. You store files on it much as you would any
JSON to the back-end (i.e. Ajax, jQuery, Backbone, etc.). file system; you can even mount it if you like. It uses
Most of these natively speak a JSON dialect. Generally the MapReduce, which is a framework for processing data
transactional semantics are different, but comparable to across a network. This was popularized by Hadoop but
an RDBMS. That is, if you have a person in an RDBMS, is quickly giving way to DAG (distributed acyclic graph)
their data (phone numbers, email addresses, physical frameworks such as Spark.
addresses) will be distributed across multiple tables and
require a transaction for a discrete write. With a document
database, all of their characteristics can be stored in one T ry p u t t i n g t r u ly t a b u l a r d a t a i n t o
document, which also can be done in discrete writes. a M o n g o D B d o c u m e nt a n d ac h i e v i n g
These are not so great if you have people, money, and
other things all being operated on at once, involving transactional c o ns ist e nc y and it
operations that complete all or none because they do not
w i l l f r ust r at e y o u .
offer sufficient transactional integrity for that. However,
document databases scale quite well and are great for
web-based operational systems that operate on a single big
While the original use cases for Hadoop were pure analytics
entity, or systems that dont require transactional integrity
or data warehousing models where data warehousing tools
across entities. There are also those who use document were not sufficient or economically sensible, the ecosystem
databases as data warehouses for complex entities with has since expanded to include a number of distributed
moderate amounts of data where joining lots of tables computing solutions. Storm allows you to stream data,
caused problems. MongoDB and Couchbase are typically Kafka allows you to do event processing/messaging, and
the leaders in this sector. other add-ons like Mahout enable machine learning
algorithms and predictive analytics. The use cases are as
G r a p h D ata bas e s diverse as the ecosystem. Chances are if you need to analyze
Complex relationships may call for databases where a large amount of data, then Hadoop is a fine solution.
the relationships are treated as first order members of
the database. Graph databases offer this feature along P ut t i n g it to g e t h e r
with better transactional integrity than relational When you understand the characteristics of the different
databases for the types of datasets used with Graphs types of storage systems offered, the characteristics of
the relationships can be added and removed in a way your data, and the scenarios that your data can be used for,
similar to adding a row to a table in an RDBMS, and that choosing the right set of technologies is not a daunting
too is given all-or-none semantics. Social networks task. The era of one big hammer (the RDBMS) to suit every
and recommendation systems are classic use cases for use case is over.
graph databases, but you should note that there are a few
different types of graph databases. Some are aimed more
at operational purposes (Neo4j) while others are aimed An drew C. Oliver is the president and founder
of Mammoth Data, a Durham based Big Data/
more at analytics (Apache Giraph).
NoSQL consulting company specializing in Hadoop,
MongoDB and Cassandra. Oliver has over 15 years of
RDBMS executive consulting experience and is the Strategic
Some datasets are very tabular, having use cases that lend Developer blogger for InfoWorld.com.

7 dzones 2015 guide to database and persistence management


8 dzones 2015 guide to database and persistence management
sponsored opinion

Building
There was a time
when big data meant Hadoop addresses data volume. It can store and process
Hadoop. Big Data was a lot of data, later. It scales out to store and process more

a Faster really just offline


analytics. Thats no
longer the case.
data. Hadoop doesnt address data velocity. However, it
meets offline analytical requirements.

Real-time
Couchbase Server addresses data velocity. It is a high-
In todays world, performance NoSQL database that can store a lot of data,
Hadoop can no in real time. It scales out to store a lot of data, faster. It

Big Data longer meet big data


challenges by itself.
meets operational requirements.

Storm addresses information velocity. It scales out

Solution
The explosion of
to process streams of data, faster. It meets real-time
user-generated data
analytical requirements.
is being followed
by an explosion of Integrating Hadoop, Couchbase Server, and Storm creates a
machine-generated faster big data solution. Hadoop and Couchbase Server form
data. Hadoop is great for storing and processing large the inner core. Hadoop enables big data analysis. Couchbase
volumes of data, but its not equipped to ingest it at this Server enables high performance data access. Storm
velocity. This has led to a new generation of systems provides the outer core a stream processing system.
engineered specifically for real-time analytics, including
Storm and Couchbase Server. The best thing about building a real-time big data solution
is designing the architecture. Its like playing with Legos
THERE ARE THREE BIG DATA CHALLENGES: and putting the pieces together that best fit your use case.
1. Data volume: The amount of data being generated. What will your architecture look like?
2. Data velocity: The rate at which data is being generated. W r it ten By

3. Information velocity: The rate at which information Shane Johnson, Senior Product Marketing Manager, Couchbase
must be generated from raw data.

CLASSIFICATION Document and Key-Value Store with Integrated Caching

Couchbase Server by Couchbase

A high-performance distributed database with integrated caching, cross data center replication,
mobile data access, and Hadoop integration certified by Cloudera and Hortonworks.

language drivers
storage model ACID Isolation transactions supported
Document, Key-Value, Java .NET Node.js Go
Locking Compare and Swap (CAS)
Distributed cache PHP C Python Ruby

case study PayPal integrated Couchbase Server, Storm, and Hadoop to build the foundation
indexing capabilities
of its real-time analytics platform. Clickstream and interaction data from all channels is pushed to
Incremental MapReduce
the platform for real-time analysis. The data is pulled in a stream processor, Storm, for filtering,
enrichment, and aggregation. After the data is processed, its written to Couchbase Server where
its accessed by rich visualization tools. With these tools, PayPal can monitor all traffic in real-time.
replication
In addition, PayPal leverages views in Couchbase Server to perform additional aggregation and
Synchronous
filtering. Finally, the data is exported from Couchbase Server to Hadoop for offline analysis. Asynchronous

Customers AOL eBay LivePerson PayPal


built-in text search?
AT&T LinkedIn Neiman Marcus Via ElasticSearch and Solr

BLOG blog.couchbase.com twitter @couchbase WEBSITE couchbase.com

9 dzones 2015 guide to database and persistence management


Firebase, Meteor &
give developers a quick and easy way to
implement the common mobile app use
case of storing stuff for users across

Cognito: Just Enough


devices. All of these frameworks can run
in offline mode and offer mechanisms
for automatic syncing. They all also have
the ability to integrate with OAuth-based

Database for a Mobile World identity services.

Its important to understand that these


platforms are not relational databases
by Ben Teese
and thus do not offer features like JOIN queries or full
ACID transactions. Instead, their goal is to provide just
Ev e rybody i s bu ilding mobile apps th ese enough data storage capabilities for a mobile app to
days, both natively and for the web. Part of the value of satisfy the storage needs of its users. However, they each
almost any non-trivial app relates to its ability to store have their own limitations and trade-offs.
data for a user. This can range from something as obvious
as a to-do list app, whose entire value-proposition is its Firebase
ability to remember things for a user, to less obvious cases The first platform well coverand probably the best
like mobile games, which may want to store game state or knownis Firebase, an API for storing and syncing data.
high-scores for the user. Recently acquired by Google, Firebase provides client-side
libraries for Android, iOS, all major JavaScript browser
Rather than just storing this data on a single device, frameworks, Node.js, and Java. These libraries give clients
users also now increasingly expect apps to make their impressive real-time data-synchronization capabilities.
data quickly and easily accessible across all of their The Firebase platform is essentially a fully-hosted service
devices. Furthermore, when they have multi-device that stores JSON in a tree accessed via RESTful URLs. One
access, they dont want to have to worry about whether thing to note is that when you load a node in the tree, you
internet connectivity is missing or intermittent. The app get all of the data under that node. Consequently, to avoid
should be able to store it locally and then synchronize downloading too much stuff to the client, its important
with the cloud later. to minimize the depth of the tree.

As soon as remote data storage becomes a requirement Firebase allows data to be shared between users. To
for an app, a whole bunch of new back-end infrastructure restrict how data can be accessed by particular users,
needs to be set up by the developer. Even with the rise of it provides a server-side rules language. However, its
the cloud infrastructure, this is still a lot of work. important to understand that Firebase has no application
layer by defaultyou essentially go directly from your
You need to pick app server and data store software, client to the database. Because of this, the rules language
then run a bunch of wiring code to pipe data between also includes capabilities for doing data validation.
the client and database.
Firebase has simple querying capabilities. However, its
If you want offline data access to work, you need to
important to configure indexing rules on query criteria
build local storage into your client and get the local-
to keep performance manageable. Queries that you run
remote syncing logic right.
with Firebase are liveif the data underlying the query
If you want real-time data syncing between clients, changes after the query has been run, then the query
you need to introduce real-time web protocols, or results will automatically update.
simulate them using techniques like long-polling. Meteor
Meteor is a platform for building web and mobile apps.
Finally, to secure your users data so that only they can
Like Firebase, it has very impressive data synchronization
see it, you need to either build your own infrastructure
capabilities, including support for live queries. However,
to securely store and verify user credentials, or
Meteor goes even further by allowing you to write code
integrate with services like Facebook or Twitter to allow
that can be run on both the client and the server-side. This
people to use their existing credentials.
gives you a great deal of flexibility from a security and data
In this article, well look at three platforms that aim to validation perspective. The catch is that all development

10 dzones 2015 guide to database and persistence management


has to be done with Javascript, HTML, and CSS. If you want and remote storage. However, in contrast to Meteor and
to deploy your app to an app store, the best you can do is Firebase, developers have to manually trigger when they
package it all up inside a native wrapper. By default, Meteor want synchronization to happen, and handle any conflicts
uses MongoDB as a data store, which basically means that themselves.
you get Mongos query capabilities. There is also early
support for Redis and plans for other databases in future. Cognitos data storage capabilities are very basic. An
app can have multiple datasets, each of which can be
It is very easy to get up-and-running with Meteor, as it considered roughly equivalent to a table. Keyed records
bundles both an application server container and a local can be inserted into each dataset. However, you can only
Mongo instance out of the box. It also provides command- look up records by their keyany more sophisticated
line tools for automatically packaging your client and querying would require a full scan of the dataset. That
server-side code and deploying it to the cloud with a hosted said, if Cognitos data-storage capabilities are not
app server and datastore. This saves you from having to enough, you can also use it to connect to other AWS
setup a complex build and deployment process yourself. services like S3 or DynamoDB. However, you lose its
The Meteor platform is comprised of a number of libraries offline and synchronization capabilities when you do this.
and tools. While they can be used in isolation, they
work best together. For example, Which One You Might Use
Meteors live-updating view library Which of these frameworks you use
(Blaze) is designed to integrate depends very much on your storage
seamlessly with its remote data All of these platforms needs. If you have really simple
syncing capabilities. requirements and youre looking
herald a shift away from specifically for user authentication as
One downside of this integration is a service, Cognito should be enough,
that, to get up and running quickly, the traditional three- especially if you are already familiar
you have to learn the framework
in its entirety, putting aside any tier, request-response with the Amazon ecosystem. However,
its inability to query or share data
skills you might already have
with Javascript MVC frameworks
approach to storing and between users will be a deal-breaker for
many scenarios.
like AngularJS, Ember.js, or accessing user data.
React. Some work has been done For the majority of use cases, Firebase
integrating Meteor with AngularJS is probably a better fit, as it gives you a
to get the best of both worlds, but far more powerful datastore. The biggest
it is only at an early stage. risk with Firebase is that your security and validation
requirements wont be met by Firebases rules mechanism.
Cognito If thats the case, you could start using Firebase in your
Cognito is a new offering from Amazon Web Services (AWS) own app server, or move onto a full-stack framework like
that provides a quick and easy way to authenticate users Meteor. The natural trade-off of the full-stack approach is
and store data for them. It integrates tightly with AWSs that you get more locked in to the specifics of the platform.
existing identity management infrastructure, but can In the case of Meteor, this means you have to fully embrace
also be used with other OAuth providers. Like Firebase, JavaScript and the Meteor way of doing things.
clients go directly to the database rather than through
an application layer. To make this secure, no data can be Regardless of which you pick, all of these platforms herald
shared between users. Any additional data validation you do a shift away from the traditional three-tier, request-
will have to be duplicated between different client types, as response approach to storing and accessing user data. To
there is no shared server to put it on. one degree or another, they take care of the minutiae of
data storage, ensuring our user data is instantly available
Interestingly, Cognito has gone with a native-first everywhere, all the time.
approach to its client-side libraries. Full support for
both the identity and data synchronization components Ben Teese is a senior consultant at Shine

of Cognito is available for iOS and Android, but the Technologies. He spent 10 years as a back-end
Java developer before moving into front-end
synchronization component for the Javascript libraries
development with Rails and, more recently,
is still only in developer preview mode. These libraries Javascript SPAs. He spends most of his time building
provide simple support for syncing data between local mobile web apps and, occasionally, iOS apps.

11 dzones 2015 guide to database and persistence management


Stop tying your new application to 30+ year old technology.
Its time to try a database as innovative as you are.

NuoDBs scale-out
SQL database offers:
Scale-Out Performance
Continuous Availability
Geo-Distribution

Download the NuoDB Community Edition for free today!


www.nuodb.com/download

2015 NuoDB, Inc., all rights reserved.


12 dzones 2015 guide to database and persistence management
sponsored opinion

When to
Sharding has enabled You still get the view of a single logical database,
companies to scale regardless of how many hosts youre running on,
out a traditional SQL processes you have going, or datacenters youre

shard?
database. But not running in. Since youre still programming to a single
without consequence: logical database, your developers are free to write their
The introduction applications without having to think about operational

Or, why
of sharding has concerns. And you can still maintain ACID compliance.
also delivered
fragility, application Plus, your operations people are free to scale the

shard complexity,
and operational
database however they need to, to take advantage
of their resources. Then, when you need to bring
difficulty. things online on demand, or when you no longer need

at all? additional resources, and you want to shed them, you


can do that without affecting the application.
The ideal scenario
starts with the
The modern datacenter requires an
database on one machine, and scales out as you need
additional capacity, cache, parallelism, or support for elastically scalable DBMS that can quickly
more clients. But a traditional SQL database makes this
impossible. add new machines to a running database.

But what if you revisit the entire architecture of the To view a video explaining how to build a scalable
relational database? By divorcing processing power application without sharding, visit www.nuodb.com/
requirements from data replication, you can provide no-shard.
a system that scales horizontally; that is always
consistent; replicates data correctly and consistently; W r it ten By

and in the face of failure, always fails correctly. Michael Waclawiczek, VP of Marketing and Operations, NuoDB

CLASSIFICATION NewSQL Database

NuoDB by NuoDB
Scale-out SQL database, on-demand scalability, geo-distributed deployment, peer-to-peer
distributed architecture.
language drivers
storage model ACID Isolation transactions supported
Relational, Key-Value, C C++ C# Go Java Serializable/linearizable Arbitrary multi-statement
Graph Node.js transactions on a single node

case study UAE Exchange, headquartered in Abu Dhabi, offers money transfer, foreign currency
exchange, remittance, and related financial services. The company is a major player in these segments with indexing capabilities
over 725 branches in 32 countries serving 7.9 million customers. UAE Exchange handles 6.1% of the worlds total Primary Key
remittances. One of the key elements in NuoDB, which was critical for us, was its ability to scale out/in on demand
based on the application load. The ability to work across multiple datacenters, ease of provisioning interfaces and
security out-of-the-box have all impressed us. When we were evaluating new database technologies, our primary
aim was to ensure a low learning curve for our teams and a faster ramp up on leveraging the selected platform. replication
NuoDB is ACID compliant and supports the ANSI SQL standard as well stored procedures. These factors played a Asynchronous
crucial role in the adoption of NuoDB. Jayakumar V, Senior Manager, IT, UAE Exchange.

Customers Dassault Systemes Kodiak Networks EnterpriseWeb UAE Exchange built-in text search?
Palo Alto Networks Auto Zone iCims Zombie No

BLOG nuodb.com/techblog twitter @nuodb WEBSITE nuodb.com

13 dzones 2015 guide to database and persistence management


State
Readers of this But this textbook architecture has been increasingly
The guide already know questioned, not only by newcomers but by leading
the database world database architects such as Michael Stonebraker.
is undergoing Some new databases depart significantly from the

o f the
rapid change. textbook architecture with concepts such as wide-row
From relational- and columnar storage, no support for concurrency at
only, to NoSQL all, and eventual consistency. Its worth noting that

Sto rage
and Big Data, the although NoSQL databases represent obvious changes
technologies we in the data model and languagehow developers
use for data storage access the databasenot all NoSQL databases innovate
and retrieval today architecturally. Coping with todays data storage

Engine
are much different challenges often requires breaking from tradition
from even five architecturally, especially in the storage engine.
years ago.
L o g -St r uc t u r e d M e r g e T r e e s
Todays datasets One of the more interesting trends in storage engines is
b y B a r o n S c hwa r t z are so large, and the emergence of log-structured merge trees (LSM trees)
the workloads so as a replacement for the venerable B-Tree index. LSM trees
demanding, that are now about two decades old, and LevelDB is perhaps the
one-size-fits-all databases rarely make much sense. most popular implementation. Databases such as Apache
When a small inefficiency is multiplied by HBase, Hyperdex, Apache Cassandra,
a huge dataset, the opportunity to use a RocksDB, WiredTiger, and Riak use
specialized database to save money, improve M o s t c o mpa n i e s various types of LSM trees.
performance, and optimize for developer
c a n a f f o r d o n ly
productivity and happiness can be very LSM trees work by recording data, and
large. And todays solid-state storage is one or two changes to the data, in immutable
vastly different from spinning disks, too. segments or runs. The segments
These factors are forcing fundamental p r o p e r i n - d e pt h are usually organized into levels or
changes for database internals: the generations. There are several strategies,
e va l u a t i o n s f o r a
underlying algorithms, file formats, and data but the first level commonly contains
structures. As a result, modern applications n e w d ata bas e . the most recent and active data, and
are often backed by as many as a dozen lower levels usually have progressively
distinct types of databases (polyglot larger and/or older data, depending on
persistence). These trends signal significant, long-term the leveling strategy. As data is inserted or changed, the
change in how databases are built, chosen, and managed. top level fills up and its data is copied into a segment in
the second level. Background processes merge segments
T e x t b o o k A r c h i t e c t u r e s L o s e R e l e va n c e in each level together, pruning out obsolete data and
Many of todays mature relational databases, such as building lower-level segments in batches. Some LSM tree
MySQL, Oracle, SQL Server, and PostgreSQL, base much implementations add other features such as automatic
of their architecture and design on decades-old research compression, too. There are several benefits to this
into transactional storage and relational models that stem approach as compared to the classic B-Tree approach:
from two classic textbooks in the fieldknown simply
as Gray & Reuters and Weikum & Vossen. This textbook Immutable storage segments are easily cached and
architecture can be described briefly as having: backed up

Row-based storage with fixed schemas Writes can be performed without reading first, greatly
speeding them up
B-Tree primary and secondary indexes
Some difficult problems such as fragmentation are
ACID transaction support
avoided or replaced by simpler problems
Row-based locking
Some workloads can experience fewer random-access
MVCC (multi-version concurrency control) implemented I/O operations, which are slow
by keeping old row versions

14 dzones 2015 guide to database and persistence management


There may be less wear on solid-state storage, which E va l u at i n g D ata bas e s W it h L og -St r uctur e d
cant update data in-place Merge Trees
No matter what underlying storage you use, theres always
It can be possible to eliminate the B-Tree write cliff,
a trade-off. The iron triangle of storage engines is this:
which happens when the working set no longer fits in
memory and writes slow down drastically
You can have sequential reads without amplification,
Although many of the problems with B-Tree indexes sequential writes without amplification, or an
can be avoided, mitigated, or transformed, LSM tree immutable write-once designpick any two.
indexes arent a panacea. There are always trade-offs and
implementation details. The main set of trade-offs for Todays emerging Big Data use cases, in which massive
LSM trees are usually explained in terms of amplification datasets are kept in raw form for a long time instead
along several dimensions. The amplification is the average of being summarized and discarded, represent some
ratio of the databases physical behavior to the logical of the classes of workloads that can potentially be
behavior of the users request, over the long-term. Its addressed well with LSM tree storage (time-series
usually a ratio of bytes to bytes, but can also be expressed data is a good example). However, knowledge of the
in terms of operations, e.g. number of physical I/O specific LSM implementation must be combined with
operations performed per logical user request. a deep understanding of the workload, hardware, and
application.
Write amplification is the multiple of bytes written by
the database to bytes changed by the user. Since some
LSM trees rewrite unchanging data over time, write A l t h o u g h N o SQL d a t a b a s e s
amplification can be high in LSM trees. r e p r e s e n t o bv i o u s c h a n g e s i n t h e
Read amplification is how many bytes the database d ata m o d e l a n d l a n g u a g e , n ot
has to physically read to return values to the user,
compared to the bytes returned. Since LSM trees a l l N o SQL d a t a b a s e s i n n o va t e
may have to look in several places to find data, or to a r c h i t e c t u r a l ly.
determine what the datas most recent value is, read
amplification can be high.

Space amplification is how many bytes of data are Sometimes companies dont find a database thats
stored on disk, relative to how many logical bytes the optimized for their exact use case, so they build their
database contains. Since LSM trees dont update in own, often borrowing concepts from various databases
place, values that are updated often can cause space and newer storage engines to achieve the efficiency
amplification. and performance they need. An alternative is to adapt
an efficient and trusted technology thats almost good
In addition to amplification, LSM trees can have other enough. At VividCortex, we ignore the relational features
performance problems, such as read and write bursts and of MySQL and use it as a thin wrapper around InnoDB to
stalls. Its important to note that amplification and other store our large-scale, high-velocity time-series data.
issues are heavily dependent on workload, configuration
of the engine, and the specific implementation. Unlike Whatever road you take, a good deal of creativity and
B-Tree indexes, which have essentially a single canonical experience is required from architects who are looking to
implementation, LSM trees are a group of related overhaul their applications capabilities. You cant just
algorithms and implementations that vary widely. assume youll plug in a database that will immediately fit
your use case. Youll need to take a much deeper look at
There are other interesting technologies to consider the storage engine and the paradigms it is based on.
besides LSM trees. One is Howard Chus LMDB (Lightning
Memory-Mapped Database), which is a copy-on-write
B-Tree. It is widely used and has inspired clones such as
Baron Schwartz is the founder and CEO of
BoltDB, which is the storage engine behind the up-and- VividCortex, the best SaaS for seeing what your
coming InfluxDB time-series database. Another LSM production database servers are doing. He is the author
alternative is Tokuteks fractal trees, which form the of High Performance MySQL and many open-source tools
for MySQL administration. Hes also an Oracle ACE and
basis of high-performance write and space-optimized
frequent participant in the PostgreSQL community.
alternatives to MySQL and MongoDB.

15 dzones 2015 guide to database and persistence management


THE MIXOLO GY OF

DATABASES
With so many different flavors of databases, its hard to tell what the ingredients are for each one. Here weve mapped
a portion of the database landscape onto a menu of mixology! Take your pick of Relational, NoSQL, NewSQL, or
Data Grids. And remember, you can pick more than one! Thats why they say polyglot persistence is so tasty. Cheers!

R E L AT I O N A L
( A S R E P R E S E N T E D B Y L I Q U O R G L A S SWA R E )

Oracle SAP HANA


PostgreSQL Database
IBM DB2 SQL Server CLOUD
SERVICES
AVAILABLE
KEY VALUE
DIRECT ACCESS
GENERAL RDBMS HADOOP
DOCUMENT
STORE
CLOUD
MySQL GENERAL RDBMS
GENERAL GENERAL RDBMS
SERVICES
AVAILABLE CLOUD
SQLite RDBMS

GRAPH
COLUMN-
ORIENTED
GENERAL SERVICES RDBMS
AVAILABLE DATABASE
RDBMS
GENERAL CLOUD
DOCUMENT SERVICES IN-MEMORY
KEY VALUE DIRECT RDBMS
STORE AVAILABLE
ACCESS APPLIANCE
AVAILABLE
CLOUD SERVICES AVAILABLE CLOUD SERVICES
AVAILABLE

NOSQL
( A S R E P R E S E N T E D B Y C O F F E E A N D E S P R E S S O G L A S SWA R E )

Cassandra DataStax
Couchbase CouchDB
Aerospike HBase
WIDE COLUMN
STORE DOCUMENT
WIDE COLUMN
STORE DynamoDB
STORE
KEY VALUE KEY VALUE
STORE KEY VALUE STORE DOCUMENT WIDE
KEY VALUE STORE STORE KEY VALUE STORE COLUMN
HADOOP STORE
DATA CACHING STORE

IN-MEMORY CLOUD SERVICE

MongoDB Riak
Redis RavenDB
CLOUD
Neo4j Titan
SERVICES DOCUMENT
KEY VALUE AVAILABLE STORE
DOCUMENT
STORE
STORE DOCUMENT
GRAPH GRAPH
DATA STORE KEY VALUE
CACHING
DATABASE DATABASE
STORE

IN-MEMORY

NEW SQL
( AS R E P R E S E NT E D BY B E E R C O NTA I N E R S )

NuoDB
GemFire XD
FoundationDB VoltDB
Dataomic MemSQL
DOCUMENT STORE NEWSQL NEWSQL

GRAPH DATABASE
NEWSQL NEWSQL
DOCUMENT
STORE
NEW SQL KEY VALUE
STORE
HADOOP IN- DOCUMENT
MEMORY IN-MEMORY
NEWSQL STORE

IN-MEMORY

D ATA G R I DS
( A S R E P R E S E N T E D B Y W I N E & C H A M PA G N E G L A S SWA R E )

GridGain Hazelcast GemFire

DATA GRID DATA


GRID
HADOOP DATA GRID

IN-MEMORY IN-MEMORY IN-MEMORY


3
Historical Cost of Computer Memory & Storage [1]

Reasons
1000

100

10

why Its Okay


1

PRICE USD / MB
.1

to Stick with .01

sql
.001

by lukas eder
.0001

1985 1995 2005 2015

SMALL DRIVES DIMMs ICs ON


The past decade has been an extremely BOARDS
FLASH DRIVES SIMMs
exciting one in all matters related to data. We have had:
Several vendors of commercial RDBMS have implemented
An ever increasing amount of data produced by social in-memory computing capabilities and column stores in
media (once called Web 2.0) their standard products: Oracle 12c, SQL Server 2012, SAP
An ever increasing amount of data produced by devices HANA, just to name a few. One of the most interesting
(a.k.a. the Internet of Things) examples of a classic application that runs on a single
centralized RDBMS is the Stack Exchange platform, with
An ever increasing amount of database vendors that around 400 GB of RAM and 343M queries per day [2].
explore new models
This architecture has been the preferred way for any
Marketers, publishers, and evangelists coined the term software system that profited from the free ride offered by
Big Data and NoSQL to wrap up the fuzzy ideas Moores Law in the past: Just throw more hardware at it,
around persistence technology that have emerged in the and itll run again. The nightmare of having to distribute
last decade. But how big is Big Data? Do you do Big such a system is postponed yet again.
Data? Do you need to do Big Data?
In other words: The primary source of contention, the
The truth is you dont. You can keep using the same disk, is no longer the bottleneck of your application, and
architecture and relational databases youve always had, you can scale out on a single machine.
for these three reasons.
2. S Q L i s t h e be st l a n guage f o r querying
1. RAM price s ar e c rum bli n g Winston Churchill may have been known for saying
For a while, economists and researchers have been something like: SQL is the worst form of database querying.
predicting the end of Moores Law. With single-CPU Except for all the other forms.
processing power reaching the limits of whats possible
in non-quantum physics, we can no longer scale up QUEL may have been the better language to start with
processing power cheaply. Multi-core processors have from a technical perspective, but due to a cunning
emerged, and vendors have encouraged sharing data move by Oracle and IBM in the early 80s, SQL and the
across several machines for a long time. ANSI/ISO SQL standard won the race. One of the most
fundamental criticisms of SQL was that it is not really
But do we really need to distribute data onto several a relational language. This was recognized as early as
machines and deal with the struggles introduced by the in 1983 by none other than C.J. Date in his paper A
CAP theorem in distributed environments? The truth is, Critique of the SQL Database Language, but it was too
you may not have to pay this price just yet! late [3]. SQL had already won the race. Why?

While CPUs arent getting faster anymore, RAM continues SQL has been designed to be used primarily by humans
to get cheaper. Today, a database server that can keep your to create ad-hoc queries. It is remarkable that the SQL
whole database in memory constantly is fairly affordable. language is pretty much the only declarative language

18 dzones 2015 guide to database and persistence management


that has survived and maintained a continuous level of tier, preferably written in J2EE and later in Java EE,
popularity. helping large software vendors sell extremely expensive
middleware components as a complement to their
Declarative thinking is hard enough when most people already expensive databases.
prefer to instruct machines in an imperative style. We
can see attempts at creating such languages not only in This may have been reasonable in the past since
databases, but also in general-purpose languages, such databases werent as powerful 20 years ago as they are
as Java. Specifically, Java EE is full of annotations that today. Today, however, commercial SQL optimizers are
are used to tag types and members for an interpreter extremely powerful. All you need to get your SQL up
to pick up those tags and inject behavior. Once you to speed is appropriate indexing. If you decide to use
combine JPA, JAXB, JAX-RS, JAX-WS, EJB, perhaps even a platform that doesnt allow you to use SQL, it will
Spring, and many other tools in a single application, you greatly increase your total cost of ownership for data
will immediately see how hard it is to still understand processing logic. It is very hard to implement optimal
what all those declarative items mean, and how and execution plans manually, in the form of algorithms in
when they interact. Usually, the behavior is not specified your general-purpose imperative language, such as Java.
or ill-specified. There are ways around this, however, with APIs like jOOQ,
which is a product our company works on that seamlessly
SQL is a comparatively simple declarative language, integrates SQL or procedure calls into Java.
with a lot of limitations. This is good, because the lack of
language features allows for implementations to quickly For everything that is not feasible with SQL, you can use
meet the SQL standards requirements. In fact, this your relational databases procedural language, which
generality of the SQL language has even lead to a variety of allows you to implement more complex and more explicit
NoSQL databases adopting SQL or imitating SQL closely: algorithms directly in the database, keeping business
logic very close to the data. Not only can such languages
SQL on Hadoop
access the data very quickly, they can also interact with
Phoenix for HBase SQL for bulk data processing.
Cassandras CQL resembles SQL
JCR-SQL2 for content repositories In other words: Moving some computation-heavy logic
into the database may be your best choice for a variety
Not to mention, of course, the numerous actual SQL of use cases. I predict that more tech firms will start
database implementations out there. building around in-memory databases and this will
inevitably translate into a higher popularity for procedural
While popular ORMs help abstract away the tedious work SQL languages like Transact-SQL (T-SQL), PL/SQL and
of implementing CRUD with SQL, there has been no SQLScript.
replacement thus far for SQL when it comes to querying
and bulk data processingit doesnt matter if your RDBMS h av e wo n i n t h e pa st & wi l l win again
underlying storage is relational, or non-relational.
When all you have is a hammer, every problem will look like
In other words: The non-relational-ness of SQL is one if a nail
the languages primary advantages, allowing technologies
There couldnt be a better metaphor for RDBMS. Theyre
to incorporate other aspects of data processing, such as
hammers. Theyre Swiss-Army-Hammers with millions
XML, OLAP, JSON, column stores, and many more.
of tools, and most developers can go as far as they need
to with just that hammer. So, keep scaling up. With the
3. Proce dural lan guag es ar e i d eal f o r
hammer: SQL.
computing
A third advantage that relational databases have over [1] http://www.jcmit.com/mem2014.htm
non-relational stores is the tight integration of procedural [2] http://stackexchange.com/performance
languages with the SQL language. [3] http://dl.acm.org/citation.cfm?id=984551

If scaling vertically is your strategy, you want to leverage


as many cores as possible from your single database Lukas eder is the founder and CEO of Data Geekery

server, therefore, you want to run the computations as GmbH, located in Zurich, Switzerland. Data Geekery
is a leading innovator in the field of Java/SQL
closely as possible to the data in your servers RAM.
interfacing. Their flagship product jOOQ is offering
For many years, enterprise software architects have frictionless integration of the SQL language into Java.
attempted to move all business logic into a middle

19 dzones 2015 guide to database and persistence management


Every company wants to put
their existing data to work.

Actually doing it is a different story.

Thats where Mammoth


Data comes in. Were Big
Data consultants, change
managers and risk mitigators.
We make data-driven
decision making possible.

The Leader in Big Data Consulting

20 www.mammothdata.com | (919)management
dzones 2015 guide to database and persistence 321-0119
sponsored opinion

A business intelligence project is


not about technology: it is about Integrating knowledge. These interviews
usually begin rather open-ended

Hadoop in
the people using the technology, and end with focused questions to
how they communicate and what understand how each individual
they want to do. works with data, how they make

a Business
decisions, and how that fits into
Fueling an Executive Strategy the overall executive strategy.
Executive strategy combined
with continued involvement
is absolutely indispensable. Intelligence Taking the BS out of BI
Frequently these projects are

Project
First, the organization needs to implemented with BI tools like
figure out exactly what kind of Pentaho or Tableau, though,
intelligence they want to gather. some operational systems do
If operational excellence is the not keep history and often data
primary goal then more time warehouses are too focused to
will be focused on measuring costs and efficiencies, provide for different uses of the data. Sometimes
but if you are trying to seek customer intimacy the the solution is creating a new way to make use of the
project will focus on customer intelligence, social existing systems. Other times turning to Hadoop
networks, and recommendations. or, more particularly, Hive to integrate and provide
a new view of the data into a data lake or similar
Connecting the Dots structure.
Team interviews conducted by an outside party are
essential. This enables your team to explain the ins W r it ten By

and outs of the company without assuming prior Elizabeth Edmiston, VP of Engineering, Mammoth Data

Mammoth Data
Mammoth Data is a data architecture consulting company specializing in new data technologies, distributed
computing frameworks, like Hadoop, and NoSQL databases. By combining NoSQL and Big Data technologies
with a high-level strategy, Mammoth Data is able to craft systems that capture and organize unstructured data
into real business intelligence.

case study Mammoth Data is assisting a major utility company in services


their smart grid analytics project. The project involves integrating data from Creation of an executive data strategy
disparate sources into a Hadoop Hive cluster. This involved getting and
Analysis, design and implementation of
analyzing feeds, designing Pig-based load processes with Oozie, architecting Data-Driven company strategies
a Data Lake and assisting the customer in integrating BI and Analytics tools.
Business Intelligence project analysis,
design and development
technologies verticals partners Initial install of a new technology
Hadoop Biotechnology Cloudera
Cloudbees Architect a data lake or data warehouse
Cassandra Education/Non-profits
Finance Couchbase Implementation of new technology or
Couchbase
Healthcare DataStax migration from a legacy system
MongoDB
Retail Hortonworks
Neo4j MongoDB
Technology
Utilities Neo Technology

email info@mammothdata.com twitter @mammothdataco WEBSITE mammothdata.com

21 dzones 2015 guide to database and persistence management


input and confirm that those inputs produce the
expected messages/output. These tests are useful
but have limited application. If you want to examine

How to Manage
rare bugs or improve the latency distribution
of your state machine, you need a far longer

Deterministic
recordingideally every event for a week.

Although replaying the input data of a system can


reproduce realistic tests, you want to replay the

Data-Driven
output data of a system to reproduce the transient
decisions made in production, like which event was
processed next, or what the actual spacing of those
events was down to the microsecond. Timestamps

Services
can be a guide, but are not reliable. If you want to
upgrade a system in production, replaying from
the output ensures the new system will honor the
decisions already made, even if the new system
would have made different choices.

by Peter L awrey
Recording every event in and
out of a system can help build
data-driven tests, provide a
H
ow does the journaling of all data inputs and
outputs for key systems help make it easier simple recovery mechanism, and
to test, maintain, and improve performance?
What are some of the practical considerations of handle micro-bursts gracefully.
using this approach? One way to model a complex
event-driven system is as a series of functions or
state machines with the data inputs and a data The Penalty of Recording Everything in
output. Recording and replaying data for such Production
state machines can help you test these systems Making long recordings from a production system
for reliable behavior in terms of data output and can raise concerns about the use of disk space and
timings. It can also make restarting and upgrading also slowing production response time. However,
a running system easier. decreasing costs for disk space and the use of
memory-mapped files can make extensive, detailed
recording of every event practical. Output data can
include meta information such as the history of
INPUT 1 an event with timing of key stages in the process.
STATE INPUT
OUTPUT Systems that have introduced this recording
DOWN-
MACHINE JOURNAL technique successfully see lower latencies, even
STREAMS
INPUT 2 when recording every event in production. This
is one of the best ways to ensure you have a
deterministic latency profile for high-frequency
trading systems.
Using Recorded Inputs and Outputs to Perform
Extensive Tests Reproducing and Fixing Obscure Issues with
To create a small test for a state machine you can Confidence
record real examples of events/messages/generic One of the challenges of fixing production bugs is

22 dzones 2015 guide to database and persistence management


reproducing them in development. When you record operations that are serialized, or have a poor degree
all the inputs, you cant just debug and find out of parallelism. To achieve high performance in this
why the millionth event of the day was processed environment, you need a low latency system, ideally
the way it was. You can fix a bug with confidence by low latency with a high degree of consistency.
showing that the root cause has been fixed, instead These systems often need some form of redundancy
of finding a symptom of the bug, fixing that, and in the event of a hardware failure.
hoping it was the root cause.
How do we combine low latency, high availability,
Testing from recorded data allows you to test and persistence? Often you need to write to another
systems in isolation on a test server or development device on the network, because waiting for data to
PC. The test doesnt depend on a database that could write to the disk can be too slow. In this case, you
be either inaccessible (as it is in production) or trade disk latency for network latency. One of the
regularly changing (making exact reproduction very key bottlenecks is the network round-trip time: the
difficult or impossible). Non-reproducible inputs, time it takes to send a message to a second machine
such as user inputs, can be replayed in an automated and get a response back. One solution to this is to
fashion with realistic timings.

Journalling, Low Latency, and Resilience


SERVER A SERVER B
INPUT DATA / INPUT DATA
A key requirement for many systems is the ability REQUESTS IF A FAILS
to restart on a failure. Being able to quickly replay
or reload all the data driving a system is important REPLICATES

to having acceptable restart times. By definition, CONFIRMED


deterministic software should fail the same way PROCESSOR PROCESSOR
IF A FAILS
every time,
so if there is a REPLICATES
Best practice in Java software bug, CONFIRMED

is a 25 microsecond the solution is


to reproduce
OUTPUT DATA /
RESPONSE
OUTPUT DATA
IF A FAILS

latency, 99% of the this in testing,


and release a
time, from an input fix. You have to have the second machine process the message.
have confidence Server A receives data that must be processed and
on the network to that no decision then must send a response. The critical time is from

an output. In C/C++, already made request to response. We want to avoid a loss of


will be changed input or output as much as possible. One solution
latencies are closer to by replaying the is to have Server A send the input to Server B, wait
outputs already
10 microseconds. given by the
old system. To SERVER A SERVER B
test that replaying from the output works, you can INPUT DATA / INPUT DATA
REQUESTS IF A FAILS
chain the system to another instance of itself. State
machine A produces an output that state machine
A2 reads in replay mode. You can determine that REPLICATES

the state machines have the same state at any point


PROCESSOR PROCESSOR
in a unit test. IF B FAILS

Some services can be horizontally distributed REPLICATES

easily. They can exploit natural parallelism in the


OUTPUT DATA / OUTPUT DATA
problem they are trying to solve by using many RESPONSE IF A FAILS
CPUs or machines. Some back office services need

23 dzones 2015 guide to database and persistence management


for the response, process the message, send the environments flow control complicating timing
output to Server B, and wait for confirmation before reproduction significantly.
sending the message on. In this case, you pay the
RTT (Round-Trip Time) twice. A faster solution is to When you have a series of such systems, delays
have Server B do the processing. are exposed for what they are. A producer is very
rarely slowed by a slow consumer, and you see
Server A receives an input event, writes to a clearly when the consumer couldnt keep up. If
memory-mapped file, and sends it to Server B, you use flow control, the producer conspires with
which also writes it to a memory-mapped file. the consumer to not make it look bad by giving it
Server B processes the messages and creates the less work to do. You could record the flow control
output, which is then written to a memory-mapped messages as well, but it is hard to estimate what the
file on B and is replicated back to A. Lastly, Server impact on your latency will be.
A responds to the input event. At this point, there
are two copies of the inputs and outputs on both Monitoring the Behavior of Your System
machines, and you only have one RTT latency. Monitoring the performance of your system in a
The widely accepted best practice in Java is a 25 natural way without impacting the performance can
microsecond latency, 99% of the time, from an input be a challenge. While you can monitor the outputs
on the network to an output. In C/C++, latencies are in real-time by reading themwriting as much data
closer to 10 microseconds. as you might needperformance results need to be
very lightweight to be credible.
Keeping Flow Control as Light as Possible
When you have bursting data activity, you need Writing a detailed latency monitoring system, like
some form of flow control to prevent systems from replay systems for testing, can be an afterthought.
being overloaded or messages from being lost. One If you build the system around consumers
of the benefits of recording every input event is constantly replaying data from producers, you have
that you reduce the need for push back on the data implemented a replay system already. If you include
producer. There is still flow control in the network, a history of system timings in these messages,
and the memory-mapped file places limits on how this history naturally falls out at the end of your
much uncommitted data you can have, but you now final output; it is there along with the message,
have a system that consumes data as fast as it can so you can also determine the significance (or
to an open-ended queue or journal, minimizing the insignificance) of the output event. Building a tool
need to ever push back on the producer. to read this history and present it in real-time can
be surprisingly trivial.

By definition, deterministic Persisting Every Message is Valuable and


Practical for Many Systems
software should fail the Recording every event in and out of a system can

same way every time, so if


help build data-driven tests, provide a simple
recovery mechanism, and handle micro-bursts

there is a software bug, the gracefully. The cost of implementing such a


solution is falling and can lead to significant cost
solution is to reproduce this savings in development time and downtime for your
key back-end systems.
in testing, and release a fix.
P e t e r L aw r e y is the lead developer of
Having a light touch flow control means you can OpenHFTs Chronicle Queue. He has the most
answers on StackOverflow for Java and the JVM
timestamp your data more accurately (you get
and is the founder of the Performance Java
the data ASAP, not when you would like to get
Users Group. His blog Vanilla Java has had
it), and you can replay your data without the real over 3 million views.

24 dzones 2015 guide to database and persistence management


DIVING DEEPER INTO
DATA BA SE & PERSISTENCE M A N AGEM ENT

TOP TEN #DATABASE TWITTER FEEDS


@markuswinand @spyced @markcallaghan @antirez @tmcallaghan

@PeterZaitsev @pbailis @xaprb @lukaseder @DatabaseFriends

DZONE DATA BA SE ZONES

NoSQL Zone dzone.com/mz/nosql SQL Zone dzone.com/mz/sql Big Data Zone dzone.com/mz/big-data
NoSQL (Not Only SQL) is a movement devoted to creating The SQL Zone is DZone's portal for following the news and The Big Data/Analytics Zone is a prime resource and
alternatives to traditional relational databases and excelling trends of the relational database ecosystem, which includes community for Big Data professionals of all types. We're on top
where they are weak. The NoSQL Zone is a central source for solutions such as MySQL, PostgreSQL, SQL Server, NuoDB, of all the best tips and news for Hadoop, R, and data
news and trends of the non-relational database ecosystem, VoltDB, and many other players. In addition to these RDBMSs, visualization technologies. Not only that, but we also give you
which includes solutions such as Neo4j, MongoDB, CouchDB, youll find optimization tricks, SQL syntax tips, and useful tools advice from data science experts on how to understand and
Cassandra, Riak, Redis, RavenDB, HBase, and others. to help you write cleaner and more efficient queries. present that data.

DZONE DATA BA SE R EFC A R DZ


JDBC Best Practices MongoDB: Flexible NoSQL Apache Cassandra Essential Couchbase APIs
bit.ly/1EbzQjm for Humongous Data bit.ly/1yTuwN6 bit.ly/1B0CCBH
JDBC Best Practices has something for every bit.ly/1EbAldh Apache Cassandra is a high performance, Couchbase is an open source NoSQL database
developer. JDBC was created to be the standard extremely scalable, fault tolerant (i.e., no optimized for interactive web and mobile
MongoDB is a document-oriented database
for data access on the Java Platform. This single point of failure), distributed applications. The data model is JSON-based, so it's
that is easy to use from almost any
DZone Refcard starts off with JDBC basics non-relational database solution. Cassandra familiar, efficient, and extremely flexible. This Refcard
language. As of August 2014, MongoDB is
including descriptions for each of the 5 Driver combines all the benefits of Google Bigtable offers a snippet-rich overview of connection, storage,
by far the most popular NoSQL database.
Types. This is followed up with some advanced and Amazon Dynamo to handle the types of retrieval, update, and query operations on
This Refcard is designed to help you get the
JDBC on Codeless Configuration, Single Sign-on database management needs that traditional Couchbase databases from Java, Ruby, and .NET
most out of MongoDB.
with Kerberos, Debugging and Logging. RDBMS vendors cannot support. applications.

TOP 3 DATABASE WEBSITES TOP 4 DATABASE NEWSLETTERS

High Scalability highscalability.com NoSQL Weekly nosqlweekly.com


DB Weekly dbweekly.com
Use the Index, Luke use-the-index-luke.com
Postgres Weekly postgresweekly.com
ODBMS Industry Watch odbms.org/blog MySQL Weekly mysqlnewsletter.com

25 dzones 2015 guide to database and persistence management


Data Migr ation
Data migration is the process of transferring data from one system
to another, which can mean changing the storage type, data formats,

Checklist
database, application or computer system. Most IT shops have to deal with
data migration at some point, and making mistakes in a migration project
can lead to a loss of data. This checklist is designed to help ensure that you
dont miss any important steps in a general data migration project.

01. Assessment 02. Planning M i gr at i o n S o ftwa r e


FF Establish a migration management FF Build a migration plan and timetable
Below are a few existing tools for data
team with key stakeholders with the migration management team,
migration. Youll have to choose based
and include some flexibility
on OS environments, original and target
FF Identify the data to be migrated
FF Acquire or build any tools necessary for hardware platforms, database vendors,
migration and database versions. Key features
FF Analyze the current data environment
in comparison to the new environment FF Establish which permissions you should look for include rollback and
consultants or CRM vendors should migration speed.
FF Find out if any data format conversions have when working with your data (if
will be necessary applicable) Storage Migration Tools

FF Determine how much data cleaning FF Map the old database schema to the Rsync: Open source and provides fast
new database and prepare any data incremental file transfer
and manipulation is required
migration software or custom scripts
IBM Softek Transparent Data
FF Review data governance policies
FF Decide whether data will be cleaned Migration Facility: Migrates data at
before or after migration the volume or block level, supports
FF Have a data backup system in place multiple platforms
FF Notify users of any downtime
FF Project how much storage capacity is SanXfer: Migrates server data
required for the migration and if any FF Determine your organizations metrics volumes and is optimized for cloud
data growth is expected for success in this migration environments

FF Run tests with sample data to rehearse Hitachi Data Systems Universal
FF Determine any other required
the data migration process Replicator: Provides storage array-
tooling, costs, risks, and time frames
based replication
associated with the migration FF Test administrators knowledge on the
new systems features (e.g. reporting,
export) Database Migration Tools

Microsoft SQL Server Migration


Assistant (SSMA): For migrating
03. Assessment 04. Validation MySQL or Oracle database to SQL Server

FF Revise migration plan if necessary FF Test the data to ensure its integrity and MySQL Workbench Wizard: For easy
full importation database migration to MySQL
FF Configure migration environment
FF Verify that the database schema is Oracle SQL Developer: For migrating
correct non-Oracle databases to Oracle
FF Identify any inconsistencies,
missing fields, or fields that need FF Perform a rollback if data corruption EDB Migration Toolkit: For database
consolidation, conversion, or parsing occurs migration to PostgreSQL

FF Perform backup or replication of the FF Prepare a report on the statistics Other database migration tools include
related to the migration Flyway, SwisSQL, ESF Database
data to be migrated
Migration Toolkit, and Ispirers
FF Business owners need to perform SQLways
acceptance testing

FF Check to see if your organizations Application Migration


goals for success were met Software Tools
Application vendors normally provide custom
FF Document the process steps if no prior tools to migrate data from one application to
documentation exists the other.
*Authored with Ayobami Adewole

26 dzones 2015 guide to database and persistence management


CLASSIFICATION In-Memory, KV

Solutions Aerospike Server by Aerospike

Directory
Aerospike was designed to make specific use of SSDs
without a filesystem, and features advanced peer-to-peer
clustering with automated management.

Notice to readers: The data in this section is impartial and not storage model language drivers
influenced by any vendor. Key-Value - SSD or Disk
C C# Go Java
Node.js Ruby
This directory of database management systems
transactions supported
and data grids provides comprehensive, factual Compare-and-set
Non-Sql query languages
data gathered from third-party sources and the tool AQL
creators organizations. Solutions in the directory
ACID Isolation
are selected based on several impartial criteria, Read committed
including solution maturity, technical innovativeness,
relevance, and data availability. The solution summaries replication indexing capabilities
Synchronous Primary and Secondary Keys
underneath the product titles are based on the
Asynchronous
organizations opinion of its most distinguishing
features. Understand that not having certain features open source built-in text search?
Based on Open Source No
is a positive thing in some cases. Fewer unnecessary
features sometimes translates into more flexibility and twitter @AerospikeDB
better performance. website aerospike.com

CLASSIFICATION In-Memory, NewSQL CLASSIFICATION Graph, Document, KV

Altibase HDB by Altibase ArangoDB by ArangoDB

Altibase is a hybrid in-memory database, providing the speed ArangoDB features a microservice framework, optional
of memory plus the storage of disk in one single, unified transactions, and integrates with cloud operating systems like
engine. Mesosphere and Docker Swarm.

storage model language drivers storage model language drivers


Relational Document, Key-Value, Graph
C C++ C# Java PHP C# Go Java Node.js Ruby
Python
transactions supported transactions supported
Arbitrary multi-statement Arbitrary multi-statement
transactions on a single node Non-Sql query languages transactions on a single node Non-Sql query languages
None AQL
ACID Isolation ACID Isolation
Serializable/linearizable Serializable/linearizable

replication indexing capabilities replication indexing capabilities


Synchronous Complex Indexing Asynchronous Primary and Secondary Keys
Asynchronous

open source built-in text search? open source built-in text search?
No No Yes Yes

twitter @altibase twitter @arangodb

website altibase.com website arangodb.com

27 dzones 2015 guide to database and persistence management


CLASSIFICATION Specialist Analytic CLASSIFICATION NewSQL, KV Direct Access

Aster Database by Teradata c-treeACE by FairCom Corporation

Teradata focuses on hadoop capabilities and multistructured c-treeACE features a small footprint, scalability, and data
formats, particularly for the data warehouse market. integrity coupled with portability, flexibility, and service.

storage model language drivers storage model language drivers


Relational Relational, Key-Value, Wide
C# Java C C++ C# Java Python PHP
Column/Big Table

transactions supported transactions supported


Arbitrary multi-statement Arbitrary multi-statement
transactions on a single node Non-Sql query languages transactions spanning arbitrary Non-Sql query languages
None nodes CTDB
JDTB
ACID Isolation ACID Isolation
ISAM
Snapshot Snapshot
Serializable/linearizable
replication indexing capabilities replication indexing capabilities
Synchronous Primary and Secondary Keys Asynchronous Complex Indexing
Asynchronous

open source built-in text search? open source built-in text search?
No Yes No No

twitter @teradata twitter @FairCom_Corp

website teradata.com/Teradata-Aster-Database/ website faircom.com

CLASSIFICATION Object-Oriented CLASSIFICATION KV, Wide Column

Cache by InterSystems Cassandra


Cache is scalable in terms of users and data, both vertically
Apache Cassandra was originally developed at Facebook and
and horizontally, while running more efficiently and using less
is currently used at major tech firms like Adobe and Netflix.
hardware.

storage model language drivers storage model language drivers


Object Model Wide Column/Big Table
C C++ C# Java Node.js C# Go Java Node.js Ruby
Python Python
transactions supported transactions supported
Arbitrary multi-statement None
transactions spanning arbitrary Non-Sql query languages Non-Sql query languages
nodes MDX CQL
ACID Isolation ACID Isolation
Serializable/Linearizable None

replication indexing capabilities replication indexing capabilities


Synchronous Complex Indexing Synchronous Primary and Secondary Keys
Asynchronous Asynchronous

open source built-in text search? open source built-in text search?
No Yes Yes Via Solr and Datastax

twitter @InterSystems twitter @cassandra


website InterSystems.com website cassandra.apache.org

28 dzones 2015 guide to database and persistence management


CLASSIFICATION NewSQL CLASSIFICATION KV, Document, Data Caching

ClustrixDB by Clustrix Couchbase Server by Couchbase

A memory-centric architecture, featuring an integrated cache,


ClustrixDB offers seamless scale-out and built-in high
support for distributed caching, key-value storage, and
availability, perfectly suited for e-commerce applications.
document handling.

storage model language drivers storage model language drivers


Relational Document, Key-Value
C Java Ruby Python PHP C C++ C# Go Java Node.js

transactions supported transactions supported


Arbitrary multi-statement Compare-and-set
transactions spanning arbitrary Non-Sql query languages Non-Sql query languages
nodes None N1QL
ACID Isolation ACID Isolation
Serializable/linearizable MVCC

replication indexing capabilities replication indexing capabilities


Asynchronous Primary and Secondary Keys Synchronous Complex Indexing
Asynchronous

open source built-in text search? open source built-in text search?
No No Based on Open Source Via ElasticSearch and Solr

twitter @clustrix twitter @couchbase

website clustrix.com website couchbase.com

CLASSIFICATION KV, DBaaS CLASSIFICATION KV, Graph, Document, NewSQL

DynamoDB by Amazon FoundationDB by FoundationDB

Amazon DynamoDB is a fully managed NoSQL database FoundationDB is a multi-model database that combines
that supports both document and key-value data models for scalability, fault-tolerance, and high performance with ACID
consistency and low latency. transactions.

storage model language drivers storage model language drivers


Document, Key-Value - Ordered Relational, Document, Ordered
Java Node.js Ruby Python C C++ C# Go Java Node.js
Key-Value, Graph
Perl PHP
transactions supported transactions supported
Arbitrary multi-statement Multi-key ACID transactions
transactions spanning arbitrary Non-Sql query languages Non-Sql query languages
nodes DSL None
ACID Isolation ACID Isolation
Snapshot Serializable/linearizable

replication indexing capabilities replication indexing capabilities


Synchronous Primary and Secondary Keys Synchronous Primary and Secondary Keys
Asynchronous

open source built-in text search? open source built-in text search?
No Yes No Via Lucene

twitter @awscloud twitter @FoundationDB

website aws.amazon.com website foundationdb.com

29 dzones 2015 guide to database and persistence management


CLASSIFICATION In-Memory, Data Grid CLASSIFICATION Wide Column

Hazelcast by Hazelcast HBase


Hazelcast features an in-memory data grid and distributed
HBase excels at random, real-time read/write access to very
execution framework using Java standard APIs, with
large tables of data atop clusters of commodity hardware.
transparent distribution and full JCache support.

storage model language drivers storage model language drivers


Key-Value - RAM Wide Column/Big Table
C++ C# Java REST C C++ C# Java Python PHP

transactions supported transactions supported


Arbitrary multi-statement None
transactions spanning arbitrary Non-Sql query languages Non-Sql query languages
nodes None DSL
ACID Isolation ACID Isolation
Repeatable reads N/A

replication indexing capabilities replication indexing capabilities


Synchronous Complex Indexing Synchronous Primary Key
Asynchronous Asynchronous

open source built-in text search? open source built-in text search?
Yes No Yes No

twitter @hazelcast twitter @Hbase

website hazelcast.com website hbase.apache.org

CLASSIFICATION RDBMS CLASSIFICATION In-Memory, Hadoop, Data Grid

Ingres by Actian In-Memory Data Fabric by GridGain

Ingres contains enhancements to easily internationalize GridGain works on top of existing databases. It supports key-
customer applications with reduced configuration overhead, value stores, SQL access, M/R, streaming/CEP, and Hadoop
simple admin tools, and improved performance. acceleration.

storage model language drivers storage model language drivers


Relational Key-Value - RAM
Java .NET ODBC C++ C# Java Scala HTTP
REST Memcached
transactions supported transactions supported
Compare-and-set, arbitrary Arbitrary multi-statement
multi-statement transactions Non-Sql query languages transactions spanning arbitrary Non-Sql query languages
spanning multiple nodes QUEL nodes Customer-definable
ACID Isolation ACID Isolation
Serializable/linearizable Serializable/linearizable

replication indexing capabilities replication indexing capabilities


Asynchronous Primary and Secondary Keys Synchronous Primary and Secondary Keys
Asynchronous

open source built-in text search? open source built-in text search?
Based on Open Source Yes No Via Lucene

twitter @ActianCorp twitter @gridgain

website actian.com website gridgain.com/products/in-memory-data-fabric/

30 dzones 2015 guide to database and persistence management


CLASSIFICATION RDBMS CLASSIFICATION RDBMS, MySQL Family

InterBase by Embarcadero MariaDB by MariaDB

Multi-device, fast, secure, admin free, small footprint scalable While remaining application compatible with MySQL, MariaDB
embeddable relational database for Windows, Linux, Mac adds many new capabilities to address challenging web and
OSX & iOS, Android. enterprise application use cases.

storage model language drivers storage model language drivers


Relational Relational
C C++ C# Java Delphi C C++ C# Go Java Ruby
Python
transactions supported transactions supported
Arbitrary multi-statement Arbitrary multi-statement
transactions on a single node Non-Sql query languages transactions on a single node Non-Sql query languages
GDML None
ACID Isolation ACID Isolation
Snapshot Snapshot
Serializable/linearizable
replication indexing capabilities replication indexing capabilities
Asynchronous Primary and Secondary Keys Synchronous Primary and Secondary Keys
Asynchronous

open source built-in text search? open source built-in text search?
No No Yes Yes

twitter @interbase twitter @mariadb

website embarcadero.com/products/interbase website mariadb.com

CLASSIFICATION In-Memory, NewSQL CLASSIFICATION Document

MemSQL by MemSQL MongoDB by MongoDB

MemSQL is a real-time databases for transactional analytics.


MongoDB blends linear scalability and schema flexibility with
It provides instant access to real-time and historical data
the rich query and indexing functionality of the RDBMS.
through a familiar SQL interface.

storage model language drivers storage model language drivers


Relational Document
C C++ C# Go Java Node.js C C++ C# Java Node.js
Ruby
transactions supported transactions supported
Arbitrary multi-statement Compare-and-set
transactions spanning arbitrary Non-Sql query languages Non-Sql query languages
nodes None None
ACID Isolation ACID Isolation
Read committed Read uncommitted

replication indexing capabilities replication indexing capabilities


Synchronous Primary and Secondary Keys Synchronous Primary and Secondary Keys
Asynchronous Asynchronous

open source built-in text search? open source built-in text search?
No No Yes Yes

twitter @memsql twitter @mongodb

website memsql.com website mongodb.org

31 dzones 2015 guide to database and persistence management


CLASSIFICATION RDBMS CLASSIFICATION Graph

MySQL Community Edition by Oracle Neo4j by Neo Technology

MySQL Community Edition is the free downloadable version


Neo4j is a purpose-built graph database designed to unlock
of the popular open source database, supported by a large
business value in real time from data and its relationships.
community of developers.

storage model language drivers storage model language drivers


Relational Graph
C C++ C# Java Ruby Go Java Node.js Ruby
Python Python Perl
transactions supported transactions supported
None Arbitrary multi-statement
Non-Sql query languages transactions on a single node Non-Sql query languages
None Cypher
ACID Isolation ACID Isolation
N/A Serializable/linearizable

replication indexing capabilities replication indexing capabilities


Asynchronous Primary and Secondary Keys Asynchronous Primary and Secondary Keys

open source built-in text search? open source built-in text search?
Yes Yes Yes Yes

twitter @mysql twitter @neo4j

website mysql.com website neo4j.com

CLASSIFICATION NewSQL CLASSIFICATION RDBMS, Graph, Document

NuoDB by NuoDB Oracle Database by Oracle

NuoDB is a scale-out SQL database with on-demand Oracle Database features a multitenant architecture,
scalability, geo-distributed deployment and peer-to-peer automatic data optimization, defense-in-depth security, high
distributed architecture. availability, clustering, and failure protection.

storage model language drivers storage model language drivers


Relational, Key-Value, Graph Relational, Document, Graph
C C++ C# Go Java Node.js C C++ C# Go Java Node.js

transactions supported transactions supported


Arbitrary multi-statement Arbitrary multi-statement
transactions on a single node Non-Sql query languages transactions spanning arbitrary Non-Sql query languages
RDF nodes CQL
XQuery
ACID Isolation ACID Isolation
Serializable/linearizable Snapshot
Serializable/linearizable
replication indexing capabilities replication indexing capabilities
Asynchronous Primary Key Synchronous Primary and Secondary Keys
Asynchronous

open source built-in text search? open source built-in text search?
No No No Yes

twitter @nuodb twitter @oracledatabase

website nuodb.com website oracle.com

32 dzones 2015 guide to database and persistence management


CLASSIFICATION RDBMS, Document, Graph CLASSIFICATION RDBMS, MySQL Family

OrientDB by Orient Technologies Percona Server by Percona

OrientDB is a multi-model NoSQL DBMS that combines


Percona server is an open source drop-in replacement for
graphs, documents, and key-value into one scalable, high
MySQL with high performance for demanding workflows.
performance product.

storage model language drivers storage model language drivers


Document, Ordered Key-Value, Relational
PHP Python C# Java Node. C C++ C# Go Java Node.js
Graph
js Ruby
transactions supported transactions supported
Arbitrary multi-statement Single node and master-to-
transactions spanning arbitrary Non-Sql query languages slave Non-Sql query languages
nodes Gremlin None
Sparql
ACID Isolation ACID Isolation
Orient SQL
Serializable/linearizable Snapshot
Serializable/linearizable
replication indexing capabilities replication indexing capabilities
Synchronous Complex Indexing Asynchronous Primary and Secondary Keys
Asynchronous

open source built-in text search? open source built-in text search?
Yes Via Lucene Yes Yes

twitter @OrientDB twitter @percona

website orientechnologies.com website percona.com

CLASSIFICATION RDBMS CLASSIFICATION Document

Postgres Plus Advanced Server by EnterpriseDB RavenDB by Hibernating Rhinos

Postgres Plus Advanced Server integrates enterprise-grade


RavenDB features multi-master replication, dynamic queries,
performance, security and manageability enhancements into
strong support for reporting, and self optimizing.
PostgreSQL.

storage model language drivers storage model language drivers


Relational Document, Key-Value - SSD or
C++ C# Java Node.js Ruby C# Java Node.js Ruby
Disk
Python Python PHP
transactions supported transactions supported
Arbitrary multi-statement Arbitrary multi-statement
transactions on a single node Non-Sql query languages transactions spanning arbitrary Non-Sql query languages
None nodes Lucene
ACID Isolation
Serializable/linearizable ACID Isolation
Read committed Snapshot
Repeatable Read
indexing capabilities replication indexing capabilities
replication
Synchronous Complex Indexing Asynchronous Complex Indexing
Asynchronous

open source built-in text search? open source built-in text search?
Based on Open Source Yes Yes Yes

twitter @enterprisedb twitter @RavenDB

website enterprisedb.com website ravendb.net

33 dzones 2015 guide to database and persistence management


CLASSIFICATION In-Memory, KV, Data Caching CLASSIFICATION Document, KV

Redis Labs Enterprise Cluster by Redis Labs Riak by Basho

Redis Labs Enterprise Cluster allows users to create and


Riak is a highly available, highly scalable, fault-tolerant
manage scalable, highly available Redis databases on-
distributed NoSQL database.
premises.

storage model language drivers storage model language drivers


Key-Value - RAM Key-Value - Eventually
C C++ C# Go Java Node.js C# Erlang Python Ruby
Consistent
Java Node.js
transactions supported transactions supported
Arbitrary multi-statement None
transactions spanning arbitrary Non-Sql query languages Non-Sql query languages
nodes None MapReduce
ACID Isolation ACID Isolation
Serializable/linearizable None, BASE system

replication indexing capabilities replication indexing capabilities


Asynchronous None Synchronous Primary and Secondary Keys
Asynchronous

open source built-in text search? open source built-in text search?
Based on Open Source No Yes Yes

twitter @RedisLabsInc twitter @basho

website redislabs.com website basho.com

CLASSIFICATION In-Memory, Column-Oriented RDBMS CLASSIFICATION In-Memory, Hadoop, Data Grid

SAP HANA Platform by SAP ScaleOut StateServer by ScaleOut Software

SAP HANA is an in-memory, column-oriented ACID database ScaleOuts architecture integrates a low latency in-memory
and application platform for spatial, predictive, text, graph, data grid and a data-parallel compute engine. Ease-of-use is
and streaming analytics. built into all functions.

storage model language drivers storage model language drivers


Relational, Column-oriented, Key-Value - Ordered and RAM
C C++ C# Go Java Node.js C C++ C# Java
Graph

transactions supported transactions supported


Arbitrary multi-statement Compare-and-set
transactions spanning arbitrary Non-Sql query languages Non-Sql query languages
nodes Odata Microsoft LINQ
XMLA
ACID Isolation ACID Isolation
Snapshot Serializable/linearizable
Serializable/linearizable
replication indexing capabilities replication indexing capabilities
Synchronous Primary and Secondary Key, Synchronous Complex Indexing
Asynchronous Complex Indexing Asynchronous

open source built-in text search? open source built-in text search?
No Yes No No

twitter @SAPinMemory twitter @ScaleOut_Inc

website hana.sap.com website scaleoutsoftware.com

34 dzones 2015 guide to database and persistence management


CLASSIFICATION NewSQL, Hadoop CLASSIFICATION RDBMS

Splice Machine by Splice Machine SQL Server by Microsoft

Splice Machine combines the scalability of Hadoop, the ANSI SQL Server is considered the de facto database for .NET
SQL and ACID transactions of an RDBMS, and the distributed development. Its ecosystem is connected to numerous .NET
computing of HBase. technologies.

storage model language drivers storage model language drivers


Relational Relational
Java C# Java Ruby PHP Visual
Basic
transactions supported transactions supported
Arbitrary multi-statement Arbitrary multi-statement
transactions spanning arbitrary Non-Sql query languages transactions on a single node Non-Sql query languages
nodes None Transact-SQL
MDX
ACID Isolation ACID Isolation
Snapshot Snapshot
Serializable/linearizable
replication indexing capabilities replication indexing capabilities
Asynchronous Primary and Secondary Keys Synchronous Primary and Secondary Keys
Asynchronous

open source built-in text search? open source built-in text search?
No Via Solr No Yes

twitter @splicemachine twitter @microsoft

website splicemachine.com website microsoft.com/en-us/server-cloud/products/sql-server/Buy.aspx

CLASSIFICATION RDBMS CLASSIFICATION In-Memory, NewSQL

SQLite VoltDB by VoltDB

SQLite is a software library that implements a self-contained,


VoltDB combines the capabilities of an operational database,
serverless, zero-configuration, transactional SQL database
real-time analytics, and stream processing in one platform.
engine.

storage model language drivers storage model language drivers


Relational Relational
C C++ C# Go Java Ruby C++ C# Go Java Node.js
Python
transactions supported transactions supported
Arbitrary multi-statement Arbitrary multi-statement
transactions spanning arbitrary Non-Sql query languages transactions spanning arbitrary Non-Sql query languages
nodes None nodes None
ACID Isolation ACID Isolation
Serializable/linearizable Serializable/linearizable

replication indexing capabilities replication indexing capabilities


No replication Primary Key Synchronous Complex Indexing
Asynchronous

open source built-in text search? open source built-in text search?
Yes Yes Yes No

twitter N/A twitter @voltdb

website sqlite.org website voltdb.com

35 dzones 2015 guide to database and persistence management


glossary Document Store A type of database that
aggregates data from documents rather
than defined tables and is used to present
document data in a searchable form.
Multi-Version Concurrency Control
(MVCC) A method for handling
situations where machines
simultaneously read and write to a
database.

a Aggregate A cluster of domain objects


that can be treated as a single unit.
e Eventual Consistency The idea that
databases conforming to the BASE
n NewSQL A shorthand descriptor for
relational database systems that provide
An ideal unit for data storage on large model will contain data that becomes
consistent over time. horizontal scalability and performance
distributed systems.
on par with NoSQL systems.

f
ACID (Atomicity, Consistency, Isolation, Fault-Tolerance A systems ability to
NoSQL A class of database systems that
Durability) A term that refers to respond to hardware or software failure
incorporates other means of querying
the model properties of database without disrupting other systems.
outside of traditional SQL and does not
transactions, traditionally used for SQL

g Graph Store A type of database used follow standard relational database rules.
databases.

o
for handling entities that have a large
Object-Relational Mapper (ORM) A tool

b BASE (Basic Availability, Soft State,


Eventual Consistency) A term that
refers to the model properties of
number of relationships, such as social
graphs, tag systems, or any link-rich
domain; it is also often used for routing
that provides a database abstraction
layer to convert data between
incompatible type systems using
database transactions, specifically for and location services.
object-oriented programming languages
NoSQL databases needing to manage
instead of the databases query

h
unstructured data. Hadoop An Apache Software Foundation
language.
framework developed specifically for high-

p
B-Tree A data structure in which all scalability, data-intensive, distributed Persistence Refers to information from
terminal nodes are the same distance computing. It is used primarily for batch a program that outlives the process that
from the base, and all nonterminal processing large datasets very efficiently. created it, meaning it wont be be erased
nodes have between n and 2n subtrees or during a shutdown or clearing of RAM.
pointers. It is optimized for systems that High Availability (HA) Refers to the Databases provide persistence.
read and write large blocks of data. continuous availability of resources in a
computer system even after component Polyglot Persistence Refers to an

c Complex Event Processing An


organizational process for collecting
data from multiple streams for the
failures occur. This can be achieved with
redundant hardware, software solutions,
and other specific strategies.
organizations use of several different
data storage technologies for different
types of data.
purpose of analysis and planning.

i
In-Memory As a generalized industry
r Relational Database A database that

d
Database Clustering Connecting two or term, it describes data management tools structures interrelated datasets in
more servers and instances to a single that load data into memory (RAM or flash tables, records, and columns.
database, often for the advantages of memory) instead of hard disk drives.
Replication A term for the sharing of

j
fault tolerance, load balancing, and Journaling Refers to the simultaneous,
parallel processing. data so as to ensure consistency between
real-time logging of all data updates in
redundant resources.
a database. The resulting log functions
Data Management The complete lifecycle

s
as an audit trail that can be used to
of how an organization handles storing, Schema A term for the unique data
rebuild the database if the original data is
processing, and analyzing datasets. structure of an individual database.
corrupted or deleted.

k
Data Mining The process of discovering Key-Value Store A type of database that Sharding Also known as horizontal
patterns in large sets of data and stores data in simple key-value pairs. partitioning, sharding is where a
transforming that information into an They are used for handling lots of small, database is split into several pieces,
understandable format. continuous, and potentially volatile reads usually to improve the speed and
and writes. reliability of an application.
Database Management System (DBMS)

l
A suite of software and tools that manage Lightning Memory-Mapped Database
Strong Consistency A database concept
data between the end user and the (LMDB) A copy-on-write B-Tree
that refers to the inability to commit
database. database that is fully transactional, ACID
transactions that violate a databases
compliant, small in size, and uses MVCC.
rules for data validity.
Data Warehouse A collection of
accumulated data from multiple streams Log-Structured Merge (LSM) Tree A
Structured Query Language (SQL) A
within a business, aggregated for the data structure that writes and edits data
programming language designed for
purpose of business management. using immutable segments or runs that
managing and manipulating data; used
are usually organized into levels. There
Distributed System A collection of primarily in relational databases.
are several strategies, but the first level

w
individual computers that work commonly contains the most recent and Wide-Column Store Also called
together and appear to function as a active data. BigTable stores because of their
single system. This requires access to

m
relation to Googles early BigTable
a central database, multiple copies of a MapReduce A programming model
database, these databases store data in
database on each computer, or database created by Google for high scalability
records that can hold very large numbers
partitions on each machine. and distribution on multiple clusters for
of dynamic columns. The column names
the purpose of data processing.
and the record keys are not fixed.

36 dzones 2015 guide to database and persistence management


DZONE RESEARCH GUIDES
Better, more concrete, and closer to reality than
reports from Gartner and Forrester...
Robert Zakrzewski, DZone Member

DZONE GUIDE TO DZONE GUIDE TO DZONE GUIDE TO

BIG DATA CONTINUOUS DELIVERY ENTERPRISE


Discover data science Understand the role of INTEGRATION
obstacles and how the inux containerization, automation, testing,
Expand your knowledge on
of data is inuencing new and other best practices that move
integration architecture, REST
tools and technologies. Continuous Delivery beyond simply
APIs and SOA patterns.
an exercise in tooling.

DOWNLO AD D OW N LO A D D OW NLO AD

UPCOMING GUIDES IN 2015:


Cloud Development Mobile Development Internet of Things
Performance & Monitoring Software Quality Team Collaboration

Get them all for free on DZone! dzone.com/research


37 dzones 2015 guide to database and persistence management

You might also like