Professional Documents
Culture Documents
DATABASE AND
PERSISTENCE
MANAGEMENT
2015 EDITION
More choices can be a blessing and a curse. Without 26 Data Migration Checklist
the right understanding of new technologies and
27 Solutions Directory
how they relate to your situation, the multitude of
persistence options can be paralyzing. Our second 36 Glossary
guide of 2015 is here to help you navigate this
sea of technology and understand where certain
Credits
options might fit in your organization. Youll also
learn about the trends and opinions surrounding
modern database technology to get an idea of most
companies recipes for success. DZone Research marketing & Corporate
SALES Management
John Esposito
This new guide topic is the result of a split in Editor-in-Chief Alex Crafts Rick Ross
VP of Sales CEO
coverage from last years Big Data guide. While Jayashree Gopal
Director of Research Matt OBrian
DZones 2014 Guide to Big Data contained content Matt Schmidt
Director of Business President & CTO
Mitch Pronschinske
and product information for both databases and Development
Sr. Research Analyst
analytics platforms, this guide focuses solely on Ashley Slate Kellet Atkinson
Benjamin Ball General Manager
databases and persistence utilities. Later this Director of Design
Research Analyst
year, we will cover analytics platforms and data Matt Werner
Chelsea Bosworth
Marketing Associate
processing tools in a separate guide. Market Researcher
Chris Smith
John Walter
Production Advisor
Were proud to present DZones first pure guide Content Curator
Brandon Rosser
to databases. I hope it gives you some new ideas Ryan Spain
Customer Success
Content Curator
and helps you build better software! Advocate
Jillian Poore
Sales Associate
Mitch Pronschinske
Senior Research Analyst Special thanks to our topic experts Ayobami Adewole, Eric Redmond, Oren
research@dzone.com Eini, Jim R. Wilson, Peter Zaitsev, and our trusted DZone Most Valuable
Bloggers for all their help and feedback in making this report a great success.
Key Takeaways
familiar with its query language, so that queries can be
optimized and application access can be tuned.
98%
NoSQL databases were touted as
persistence technologies. the next evolution in persistence
Forecasts and opinions about the future of database by developers who were fed up RDBMS
usage. with certain aspects of RDBMS,
relational data stores are still a
dominant part of the IT industry
Key Takeaways (98% use a relational database),
Database Performance is the Biggest Challenge
for Operations and Engineering Developers and
and the strengths of SQL databases
have been reiterated by many key
32%
operations professionals both agree that achieving and players in the space. Today, experts
N O SQL
maintaining high database performance is their biggest instead suggest a multi-database
challenge related to persistence. 71% of respondents in solution, or polyglot persistence
DZones 2015 Databases survey said performance was one approach. 30% of respondents
30%
of their biggest challenges related to databases, which organizations currently use a
was significantly higher than other challenges such as polyglot persistence approach
where at least one relational BOTH
database and one non-relational
database are used.
Research
professionals also write queries frequently (65%),
but they are involved in diagnosing production
database issues (62%), tuning database configuration
(50%), and their other three primary jobs mentioned
Findings
above much more often than developers.
DATABASE
MONITORING
WRITING
QUERIES DATABASE
ADMINISTRATION
OPTIMIZING
QUERIES MANAGING
80%
DATA
SECURITY
79%
TUNING DATA AND USER
ACCESS IN ACCESS
More than 800 IT professionals responded
74%
APPLICATION
62%
CODE
to DZones 2015 Databases Survey. Here
65%
are the demographics for this survey:
49%
Developers (45%) and development team
leads (28%) were the most common roles. DEVELOPERS DATABASE OPERATIONS DATABASE
ACTIVITIES ACTIVITIES
63% of respondents come from large
Object-Relational Mappers are Used in Moderation
organizations (100 or more employees)
For years, developers have constantly argued about
and 37% come from small organizations whether Object-Relational Mapping tools (ORMs)
(under 100 employees). are a useful solution, or a detrimental anti-pattern.
access in application code (49%). Operations A majority of respondents automate database backup
(65%) and database monitoring (56%), but most
MOST
1
DO YOU ANTICIPATE
NUMBER OF
4 12% DATABASE
36%
INTRODUCING NEW
DATABASE 3 18%
VERSIONS DATABASES IN YOUR
PRODUCTS
MAINTAINED COMPANY IN THE
USED 37%
28%
33% 2 NEXT YEAR?
YES
3 32%
and Distributed Architecture 56% of respondents to their stacks in the next year (37%). With the
have distributed architectures with many database ever-growing number of new database options
instances. 45% of those respondents make their and persistence strategies, its prudent for
own internal distributed infrastructure while some developers to anticipate some changes in
Persistence
Their capabilities are fairly limited: they just look
up values by their key. Most have alternative index
capabilities that allow you to look up values by other
characteristics of the values, but this is relatively slow
and a good indication that maybe a K-V store is not the
Strategies
best choice. Sharding or distributing the data across
multiple nodes is very simple. The market is crowded
with K-V stores such as Aerospike, Redis, and Riak, all of
which are simple to implement.
C o l u m n - F a m i ly D a t a b a s e s
Many K-V stores also operate as column family databases.
Here the value evolves into a multidimensional array.
by A n d r e w C . O live r These are again most efficient while looking up values.
However, the secondary indexes are often implemented
with pointer tables, meaning a hidden table with that
field as the key points to the key of the original table. This
T h e B i g D ata a n d N o SQL t r e n d s a r e , at b e st,
is essentially O(1)+O(1) for a lookup time, which is not
misnamed. The same technologies used to process large
as good as O(1), but still decent. The two most popular,
amounts of data are sometimes needed for moderate
Cassandra and HBase, are both based on Hadoop but
amounts of data that then must be analyzed in ways
have different write semantics. HBase offers strong write
that, until now, were impossible. Moreover, there are
integrity; Cassandra offers eventual consistency. If you
NewSQL databases, or ways to run SQL against NoSQL
have a fair number of reads and writes, and a dirty read
databases that can be used in place of traditional solutions.
wont end your world, Cassandra will concurrently handle
Many of these types of databases are not all that new, them well. A good example of this is Netflix: if they are
but interest in them wasnt high when they were first updating the catalog and you see a new TV show, but the
invented, or the hardware or economics were wrong. seasons havent been added and you click on the title and
For the last 30 years or so, the tech industry assumed there is a momentary glitch where you only see season
the relational database was the right tool for every one until the next time you clickwould you even notice?
structured persistence job. A better idea is to lump However, this is unlikely to happen because by the time
everything under use the right tool for the right job. you click it was likely added. Obviously this is no way to
do accounting, but it is perfectly fine for a lot of datasets.
Your RDBMS isnt going away. Some tasks are definitely HBase offers more paranoid locking semantics for when
tabular and lend themselves well to relational algebra. you need to be sure the reads are clean. This obviously
However, it is possible that some of the more expensive comes at a cost in terms of concurrency, but is still about
add-ons for your RDBMS will go away. Scaling a as good as row-locking in an RDBMS. Time series data is
traditional RDBMS is difficult at best. Partitioning classically given as an optimal use case for column-family
schemes, multi-master configurations, and redundancy databases, but there are lots of other strong use cases such
systems offered by Oracle, SQL Server, and DB2 are as catalogs, messaging, and fraud detection.
expensive and problematic at best. They often fail to
meet the needs of high-scale applications. That said, try C o l u m n - O r i e nt e d R D B MS
joining two or three small tables and qualifying by one These are the new hot thing with SAPs HANA being
or two fields on Hadoops Hive and you will be rather so well marketed. The only reason you cant have a
underwhelmed with its performance. Try putting truly distributed RDBMS is because they didnt evolve that
tabular data into a MongoDB document and achieving way. Granted, distributed joins can be heavy, but if
transactional consistency and it will frustrate you. optimized correctly they can be done more efficiently
than, say, MapReduce against Hadoops Hive (which is
Organizations need to identify the patterns and
notoriously slow for many types of queries). Column-
characteristics of their systems, data, and applications
oriented relational databases may also perform better
and use the right data technologies for the job. Part of
than traditional RDBMS for many types of queries.
that is understanding the types of data systems available
Conceptually, think of turning a table sideways: make
and their key characteristics.
your rows columns and columns into rows. Many values
Building
There was a time
when big data meant Hadoop addresses data volume. It can store and process
Hadoop. Big Data was a lot of data, later. It scales out to store and process more
Real-time
Couchbase Server addresses data velocity. It is a high-
In todays world, performance NoSQL database that can store a lot of data,
Hadoop can no in real time. It scales out to store a lot of data, faster. It
Solution
The explosion of
to process streams of data, faster. It meets real-time
user-generated data
analytical requirements.
is being followed
by an explosion of Integrating Hadoop, Couchbase Server, and Storm creates a
machine-generated faster big data solution. Hadoop and Couchbase Server form
data. Hadoop is great for storing and processing large the inner core. Hadoop enables big data analysis. Couchbase
volumes of data, but its not equipped to ingest it at this Server enables high performance data access. Storm
velocity. This has led to a new generation of systems provides the outer core a stream processing system.
engineered specifically for real-time analytics, including
Storm and Couchbase Server. The best thing about building a real-time big data solution
is designing the architecture. Its like playing with Legos
THERE ARE THREE BIG DATA CHALLENGES: and putting the pieces together that best fit your use case.
1. Data volume: The amount of data being generated. What will your architecture look like?
2. Data velocity: The rate at which data is being generated. W r it ten By
3. Information velocity: The rate at which information Shane Johnson, Senior Product Marketing Manager, Couchbase
must be generated from raw data.
A high-performance distributed database with integrated caching, cross data center replication,
mobile data access, and Hadoop integration certified by Cloudera and Hortonworks.
language drivers
storage model ACID Isolation transactions supported
Document, Key-Value, Java .NET Node.js Go
Locking Compare and Swap (CAS)
Distributed cache PHP C Python Ruby
case study PayPal integrated Couchbase Server, Storm, and Hadoop to build the foundation
indexing capabilities
of its real-time analytics platform. Clickstream and interaction data from all channels is pushed to
Incremental MapReduce
the platform for real-time analysis. The data is pulled in a stream processor, Storm, for filtering,
enrichment, and aggregation. After the data is processed, its written to Couchbase Server where
its accessed by rich visualization tools. With these tools, PayPal can monitor all traffic in real-time.
replication
In addition, PayPal leverages views in Couchbase Server to perform additional aggregation and
Synchronous
filtering. Finally, the data is exported from Couchbase Server to Hadoop for offline analysis. Asynchronous
As soon as remote data storage becomes a requirement Firebase allows data to be shared between users. To
for an app, a whole bunch of new back-end infrastructure restrict how data can be accessed by particular users,
needs to be set up by the developer. Even with the rise of it provides a server-side rules language. However, its
the cloud infrastructure, this is still a lot of work. important to understand that Firebase has no application
layer by defaultyou essentially go directly from your
You need to pick app server and data store software, client to the database. Because of this, the rules language
then run a bunch of wiring code to pipe data between also includes capabilities for doing data validation.
the client and database.
Firebase has simple querying capabilities. However, its
If you want offline data access to work, you need to
important to configure indexing rules on query criteria
build local storage into your client and get the local-
to keep performance manageable. Queries that you run
remote syncing logic right.
with Firebase are liveif the data underlying the query
If you want real-time data syncing between clients, changes after the query has been run, then the query
you need to introduce real-time web protocols, or results will automatically update.
simulate them using techniques like long-polling. Meteor
Meteor is a platform for building web and mobile apps.
Finally, to secure your users data so that only they can
Like Firebase, it has very impressive data synchronization
see it, you need to either build your own infrastructure
capabilities, including support for live queries. However,
to securely store and verify user credentials, or
Meteor goes even further by allowing you to write code
integrate with services like Facebook or Twitter to allow
that can be run on both the client and the server-side. This
people to use their existing credentials.
gives you a great deal of flexibility from a security and data
In this article, well look at three platforms that aim to validation perspective. The catch is that all development
of Cognito is available for iOS and Android, but the Technologies. He spent 10 years as a back-end
Java developer before moving into front-end
synchronization component for the Javascript libraries
development with Rails and, more recently,
is still only in developer preview mode. These libraries Javascript SPAs. He spends most of his time building
provide simple support for syncing data between local mobile web apps and, occasionally, iOS apps.
NuoDBs scale-out
SQL database offers:
Scale-Out Performance
Continuous Availability
Geo-Distribution
When to
Sharding has enabled You still get the view of a single logical database,
companies to scale regardless of how many hosts youre running on,
out a traditional SQL processes you have going, or datacenters youre
shard?
database. But not running in. Since youre still programming to a single
without consequence: logical database, your developers are free to write their
The introduction applications without having to think about operational
Or, why
of sharding has concerns. And you can still maintain ACID compliance.
also delivered
fragility, application Plus, your operations people are free to scale the
shard complexity,
and operational
database however they need to, to take advantage
of their resources. Then, when you need to bring
difficulty. things online on demand, or when you no longer need
But what if you revisit the entire architecture of the To view a video explaining how to build a scalable
relational database? By divorcing processing power application without sharding, visit www.nuodb.com/
requirements from data replication, you can provide no-shard.
a system that scales horizontally; that is always
consistent; replicates data correctly and consistently; W r it ten By
and in the face of failure, always fails correctly. Michael Waclawiczek, VP of Marketing and Operations, NuoDB
NuoDB by NuoDB
Scale-out SQL database, on-demand scalability, geo-distributed deployment, peer-to-peer
distributed architecture.
language drivers
storage model ACID Isolation transactions supported
Relational, Key-Value, C C++ C# Go Java Serializable/linearizable Arbitrary multi-statement
Graph Node.js transactions on a single node
case study UAE Exchange, headquartered in Abu Dhabi, offers money transfer, foreign currency
exchange, remittance, and related financial services. The company is a major player in these segments with indexing capabilities
over 725 branches in 32 countries serving 7.9 million customers. UAE Exchange handles 6.1% of the worlds total Primary Key
remittances. One of the key elements in NuoDB, which was critical for us, was its ability to scale out/in on demand
based on the application load. The ability to work across multiple datacenters, ease of provisioning interfaces and
security out-of-the-box have all impressed us. When we were evaluating new database technologies, our primary
aim was to ensure a low learning curve for our teams and a faster ramp up on leveraging the selected platform. replication
NuoDB is ACID compliant and supports the ANSI SQL standard as well stored procedures. These factors played a Asynchronous
crucial role in the adoption of NuoDB. Jayakumar V, Senior Manager, IT, UAE Exchange.
Customers Dassault Systemes Kodiak Networks EnterpriseWeb UAE Exchange built-in text search?
Palo Alto Networks Auto Zone iCims Zombie No
o f the
rapid change. textbook architecture with concepts such as wide-row
From relational- and columnar storage, no support for concurrency at
only, to NoSQL all, and eventual consistency. Its worth noting that
Sto rage
and Big Data, the although NoSQL databases represent obvious changes
technologies we in the data model and languagehow developers
use for data storage access the databasenot all NoSQL databases innovate
and retrieval today architecturally. Coping with todays data storage
Engine
are much different challenges often requires breaking from tradition
from even five architecturally, especially in the storage engine.
years ago.
L o g -St r uc t u r e d M e r g e T r e e s
Todays datasets One of the more interesting trends in storage engines is
b y B a r o n S c hwa r t z are so large, and the emergence of log-structured merge trees (LSM trees)
the workloads so as a replacement for the venerable B-Tree index. LSM trees
demanding, that are now about two decades old, and LevelDB is perhaps the
one-size-fits-all databases rarely make much sense. most popular implementation. Databases such as Apache
When a small inefficiency is multiplied by HBase, Hyperdex, Apache Cassandra,
a huge dataset, the opportunity to use a RocksDB, WiredTiger, and Riak use
specialized database to save money, improve M o s t c o mpa n i e s various types of LSM trees.
performance, and optimize for developer
c a n a f f o r d o n ly
productivity and happiness can be very LSM trees work by recording data, and
large. And todays solid-state storage is one or two changes to the data, in immutable
vastly different from spinning disks, too. segments or runs. The segments
These factors are forcing fundamental p r o p e r i n - d e pt h are usually organized into levels or
changes for database internals: the generations. There are several strategies,
e va l u a t i o n s f o r a
underlying algorithms, file formats, and data but the first level commonly contains
structures. As a result, modern applications n e w d ata bas e . the most recent and active data, and
are often backed by as many as a dozen lower levels usually have progressively
distinct types of databases (polyglot larger and/or older data, depending on
persistence). These trends signal significant, long-term the leveling strategy. As data is inserted or changed, the
change in how databases are built, chosen, and managed. top level fills up and its data is copied into a segment in
the second level. Background processes merge segments
T e x t b o o k A r c h i t e c t u r e s L o s e R e l e va n c e in each level together, pruning out obsolete data and
Many of todays mature relational databases, such as building lower-level segments in batches. Some LSM tree
MySQL, Oracle, SQL Server, and PostgreSQL, base much implementations add other features such as automatic
of their architecture and design on decades-old research compression, too. There are several benefits to this
into transactional storage and relational models that stem approach as compared to the classic B-Tree approach:
from two classic textbooks in the fieldknown simply
as Gray & Reuters and Weikum & Vossen. This textbook Immutable storage segments are easily cached and
architecture can be described briefly as having: backed up
Row-based storage with fixed schemas Writes can be performed without reading first, greatly
speeding them up
B-Tree primary and secondary indexes
Some difficult problems such as fragmentation are
ACID transaction support
avoided or replaced by simpler problems
Row-based locking
Some workloads can experience fewer random-access
MVCC (multi-version concurrency control) implemented I/O operations, which are slow
by keeping old row versions
Space amplification is how many bytes of data are Sometimes companies dont find a database thats
stored on disk, relative to how many logical bytes the optimized for their exact use case, so they build their
database contains. Since LSM trees dont update in own, often borrowing concepts from various databases
place, values that are updated often can cause space and newer storage engines to achieve the efficiency
amplification. and performance they need. An alternative is to adapt
an efficient and trusted technology thats almost good
In addition to amplification, LSM trees can have other enough. At VividCortex, we ignore the relational features
performance problems, such as read and write bursts and of MySQL and use it as a thin wrapper around InnoDB to
stalls. Its important to note that amplification and other store our large-scale, high-velocity time-series data.
issues are heavily dependent on workload, configuration
of the engine, and the specific implementation. Unlike Whatever road you take, a good deal of creativity and
B-Tree indexes, which have essentially a single canonical experience is required from architects who are looking to
implementation, LSM trees are a group of related overhaul their applications capabilities. You cant just
algorithms and implementations that vary widely. assume youll plug in a database that will immediately fit
your use case. Youll need to take a much deeper look at
There are other interesting technologies to consider the storage engine and the paradigms it is based on.
besides LSM trees. One is Howard Chus LMDB (Lightning
Memory-Mapped Database), which is a copy-on-write
B-Tree. It is widely used and has inspired clones such as
Baron Schwartz is the founder and CEO of
BoltDB, which is the storage engine behind the up-and- VividCortex, the best SaaS for seeing what your
coming InfluxDB time-series database. Another LSM production database servers are doing. He is the author
alternative is Tokuteks fractal trees, which form the of High Performance MySQL and many open-source tools
for MySQL administration. Hes also an Oracle ACE and
basis of high-performance write and space-optimized
frequent participant in the PostgreSQL community.
alternatives to MySQL and MongoDB.
DATABASES
With so many different flavors of databases, its hard to tell what the ingredients are for each one. Here weve mapped
a portion of the database landscape onto a menu of mixology! Take your pick of Relational, NoSQL, NewSQL, or
Data Grids. And remember, you can pick more than one! Thats why they say polyglot persistence is so tasty. Cheers!
R E L AT I O N A L
( A S R E P R E S E N T E D B Y L I Q U O R G L A S SWA R E )
GRAPH
COLUMN-
ORIENTED
GENERAL SERVICES RDBMS
AVAILABLE DATABASE
RDBMS
GENERAL CLOUD
DOCUMENT SERVICES IN-MEMORY
KEY VALUE DIRECT RDBMS
STORE AVAILABLE
ACCESS APPLIANCE
AVAILABLE
CLOUD SERVICES AVAILABLE CLOUD SERVICES
AVAILABLE
NOSQL
( A S R E P R E S E N T E D B Y C O F F E E A N D E S P R E S S O G L A S SWA R E )
Cassandra DataStax
Couchbase CouchDB
Aerospike HBase
WIDE COLUMN
STORE DOCUMENT
WIDE COLUMN
STORE DynamoDB
STORE
KEY VALUE KEY VALUE
STORE KEY VALUE STORE DOCUMENT WIDE
KEY VALUE STORE STORE KEY VALUE STORE COLUMN
HADOOP STORE
DATA CACHING STORE
MongoDB Riak
Redis RavenDB
CLOUD
Neo4j Titan
SERVICES DOCUMENT
KEY VALUE AVAILABLE STORE
DOCUMENT
STORE
STORE DOCUMENT
GRAPH GRAPH
DATA STORE KEY VALUE
CACHING
DATABASE DATABASE
STORE
IN-MEMORY
NEW SQL
( AS R E P R E S E NT E D BY B E E R C O NTA I N E R S )
NuoDB
GemFire XD
FoundationDB VoltDB
Dataomic MemSQL
DOCUMENT STORE NEWSQL NEWSQL
GRAPH DATABASE
NEWSQL NEWSQL
DOCUMENT
STORE
NEW SQL KEY VALUE
STORE
HADOOP IN- DOCUMENT
MEMORY IN-MEMORY
NEWSQL STORE
IN-MEMORY
D ATA G R I DS
( A S R E P R E S E N T E D B Y W I N E & C H A M PA G N E G L A S SWA R E )
Reasons
1000
100
10
PRICE USD / MB
.1
sql
.001
by lukas eder
.0001
While CPUs arent getting faster anymore, RAM continues SQL has been designed to be used primarily by humans
to get cheaper. Today, a database server that can keep your to create ad-hoc queries. It is remarkable that the SQL
whole database in memory constantly is fairly affordable. language is pretty much the only declarative language
server, therefore, you want to run the computations as GmbH, located in Zurich, Switzerland. Data Geekery
is a leading innovator in the field of Java/SQL
closely as possible to the data in your servers RAM.
interfacing. Their flagship product jOOQ is offering
For many years, enterprise software architects have frictionless integration of the SQL language into Java.
attempted to move all business logic into a middle
20 www.mammothdata.com | (919)management
dzones 2015 guide to database and persistence 321-0119
sponsored opinion
Hadoop in
the people using the technology, and end with focused questions to
how they communicate and what understand how each individual
they want to do. works with data, how they make
a Business
decisions, and how that fits into
Fueling an Executive Strategy the overall executive strategy.
Executive strategy combined
with continued involvement
is absolutely indispensable. Intelligence Taking the BS out of BI
Frequently these projects are
Project
First, the organization needs to implemented with BI tools like
figure out exactly what kind of Pentaho or Tableau, though,
intelligence they want to gather. some operational systems do
If operational excellence is the not keep history and often data
primary goal then more time warehouses are too focused to
will be focused on measuring costs and efficiencies, provide for different uses of the data. Sometimes
but if you are trying to seek customer intimacy the the solution is creating a new way to make use of the
project will focus on customer intelligence, social existing systems. Other times turning to Hadoop
networks, and recommendations. or, more particularly, Hive to integrate and provide
a new view of the data into a data lake or similar
Connecting the Dots structure.
Team interviews conducted by an outside party are
essential. This enables your team to explain the ins W r it ten By
and outs of the company without assuming prior Elizabeth Edmiston, VP of Engineering, Mammoth Data
Mammoth Data
Mammoth Data is a data architecture consulting company specializing in new data technologies, distributed
computing frameworks, like Hadoop, and NoSQL databases. By combining NoSQL and Big Data technologies
with a high-level strategy, Mammoth Data is able to craft systems that capture and organize unstructured data
into real business intelligence.
How to Manage
rare bugs or improve the latency distribution
of your state machine, you need a far longer
Deterministic
recordingideally every event for a week.
Data-Driven
output data of a system to reproduce the transient
decisions made in production, like which event was
processed next, or what the actual spacing of those
events was down to the microsecond. Timestamps
Services
can be a guide, but are not reliable. If you want to
upgrade a system in production, replaying from
the output ensures the new system will honor the
decisions already made, even if the new system
would have made different choices.
by Peter L awrey
Recording every event in and
out of a system can help build
data-driven tests, provide a
H
ow does the journaling of all data inputs and
outputs for key systems help make it easier simple recovery mechanism, and
to test, maintain, and improve performance?
What are some of the practical considerations of handle micro-bursts gracefully.
using this approach? One way to model a complex
event-driven system is as a series of functions or
state machines with the data inputs and a data The Penalty of Recording Everything in
output. Recording and replaying data for such Production
state machines can help you test these systems Making long recordings from a production system
for reliable behavior in terms of data output and can raise concerns about the use of disk space and
timings. It can also make restarting and upgrading also slowing production response time. However,
a running system easier. decreasing costs for disk space and the use of
memory-mapped files can make extensive, detailed
recording of every event practical. Output data can
include meta information such as the history of
INPUT 1 an event with timing of key stages in the process.
STATE INPUT
OUTPUT Systems that have introduced this recording
DOWN-
MACHINE JOURNAL technique successfully see lower latencies, even
STREAMS
INPUT 2 when recording every event in production. This
is one of the best ways to ensure you have a
deterministic latency profile for high-frequency
trading systems.
Using Recorded Inputs and Outputs to Perform
Extensive Tests Reproducing and Fixing Obscure Issues with
To create a small test for a state machine you can Confidence
record real examples of events/messages/generic One of the challenges of fixing production bugs is
NoSQL Zone dzone.com/mz/nosql SQL Zone dzone.com/mz/sql Big Data Zone dzone.com/mz/big-data
NoSQL (Not Only SQL) is a movement devoted to creating The SQL Zone is DZone's portal for following the news and The Big Data/Analytics Zone is a prime resource and
alternatives to traditional relational databases and excelling trends of the relational database ecosystem, which includes community for Big Data professionals of all types. We're on top
where they are weak. The NoSQL Zone is a central source for solutions such as MySQL, PostgreSQL, SQL Server, NuoDB, of all the best tips and news for Hadoop, R, and data
news and trends of the non-relational database ecosystem, VoltDB, and many other players. In addition to these RDBMSs, visualization technologies. Not only that, but we also give you
which includes solutions such as Neo4j, MongoDB, CouchDB, youll find optimization tricks, SQL syntax tips, and useful tools advice from data science experts on how to understand and
Cassandra, Riak, Redis, RavenDB, HBase, and others. to help you write cleaner and more efficient queries. present that data.
Checklist
database, application or computer system. Most IT shops have to deal with
data migration at some point, and making mistakes in a migration project
can lead to a loss of data. This checklist is designed to help ensure that you
dont miss any important steps in a general data migration project.
FF Determine how much data cleaning FF Map the old database schema to the Rsync: Open source and provides fast
new database and prepare any data incremental file transfer
and manipulation is required
migration software or custom scripts
IBM Softek Transparent Data
FF Review data governance policies
FF Decide whether data will be cleaned Migration Facility: Migrates data at
before or after migration the volume or block level, supports
FF Have a data backup system in place multiple platforms
FF Notify users of any downtime
FF Project how much storage capacity is SanXfer: Migrates server data
required for the migration and if any FF Determine your organizations metrics volumes and is optimized for cloud
data growth is expected for success in this migration environments
FF Run tests with sample data to rehearse Hitachi Data Systems Universal
FF Determine any other required
the data migration process Replicator: Provides storage array-
tooling, costs, risks, and time frames
based replication
associated with the migration FF Test administrators knowledge on the
new systems features (e.g. reporting,
export) Database Migration Tools
FF Revise migration plan if necessary FF Test the data to ensure its integrity and MySQL Workbench Wizard: For easy
full importation database migration to MySQL
FF Configure migration environment
FF Verify that the database schema is Oracle SQL Developer: For migrating
correct non-Oracle databases to Oracle
FF Identify any inconsistencies,
missing fields, or fields that need FF Perform a rollback if data corruption EDB Migration Toolkit: For database
consolidation, conversion, or parsing occurs migration to PostgreSQL
FF Perform backup or replication of the FF Prepare a report on the statistics Other database migration tools include
related to the migration Flyway, SwisSQL, ESF Database
data to be migrated
Migration Toolkit, and Ispirers
FF Business owners need to perform SQLways
acceptance testing
Directory
Aerospike was designed to make specific use of SSDs
without a filesystem, and features advanced peer-to-peer
clustering with automated management.
Notice to readers: The data in this section is impartial and not storage model language drivers
influenced by any vendor. Key-Value - SSD or Disk
C C# Go Java
Node.js Ruby
This directory of database management systems
transactions supported
and data grids provides comprehensive, factual Compare-and-set
Non-Sql query languages
data gathered from third-party sources and the tool AQL
creators organizations. Solutions in the directory
ACID Isolation
are selected based on several impartial criteria, Read committed
including solution maturity, technical innovativeness,
relevance, and data availability. The solution summaries replication indexing capabilities
Synchronous Primary and Secondary Keys
underneath the product titles are based on the
Asynchronous
organizations opinion of its most distinguishing
features. Understand that not having certain features open source built-in text search?
Based on Open Source No
is a positive thing in some cases. Fewer unnecessary
features sometimes translates into more flexibility and twitter @AerospikeDB
better performance. website aerospike.com
Altibase is a hybrid in-memory database, providing the speed ArangoDB features a microservice framework, optional
of memory plus the storage of disk in one single, unified transactions, and integrates with cloud operating systems like
engine. Mesosphere and Docker Swarm.
open source built-in text search? open source built-in text search?
No No Yes Yes
Teradata focuses on hadoop capabilities and multistructured c-treeACE features a small footprint, scalability, and data
formats, particularly for the data warehouse market. integrity coupled with portability, flexibility, and service.
open source built-in text search? open source built-in text search?
No Yes No No
open source built-in text search? open source built-in text search?
No Yes Yes Via Solr and Datastax
open source built-in text search? open source built-in text search?
No No Based on Open Source Via ElasticSearch and Solr
Amazon DynamoDB is a fully managed NoSQL database FoundationDB is a multi-model database that combines
that supports both document and key-value data models for scalability, fault-tolerance, and high performance with ACID
consistency and low latency. transactions.
open source built-in text search? open source built-in text search?
No Yes No Via Lucene
open source built-in text search? open source built-in text search?
Yes No Yes No
Ingres contains enhancements to easily internationalize GridGain works on top of existing databases. It supports key-
customer applications with reduced configuration overhead, value stores, SQL access, M/R, streaming/CEP, and Hadoop
simple admin tools, and improved performance. acceleration.
open source built-in text search? open source built-in text search?
Based on Open Source Yes No Via Lucene
Multi-device, fast, secure, admin free, small footprint scalable While remaining application compatible with MySQL, MariaDB
embeddable relational database for Windows, Linux, Mac adds many new capabilities to address challenging web and
OSX & iOS, Android. enterprise application use cases.
open source built-in text search? open source built-in text search?
No No Yes Yes
open source built-in text search? open source built-in text search?
No No Yes Yes
open source built-in text search? open source built-in text search?
Yes Yes Yes Yes
NuoDB is a scale-out SQL database with on-demand Oracle Database features a multitenant architecture,
scalability, geo-distributed deployment and peer-to-peer automatic data optimization, defense-in-depth security, high
distributed architecture. availability, clustering, and failure protection.
open source built-in text search? open source built-in text search?
No No No Yes
open source built-in text search? open source built-in text search?
Yes Via Lucene Yes Yes
open source built-in text search? open source built-in text search?
Based on Open Source Yes Yes Yes
open source built-in text search? open source built-in text search?
Based on Open Source No Yes Yes
SAP HANA is an in-memory, column-oriented ACID database ScaleOuts architecture integrates a low latency in-memory
and application platform for spatial, predictive, text, graph, data grid and a data-parallel compute engine. Ease-of-use is
and streaming analytics. built into all functions.
open source built-in text search? open source built-in text search?
No Yes No No
Splice Machine combines the scalability of Hadoop, the ANSI SQL Server is considered the de facto database for .NET
SQL and ACID transactions of an RDBMS, and the distributed development. Its ecosystem is connected to numerous .NET
computing of HBase. technologies.
open source built-in text search? open source built-in text search?
No Via Solr No Yes
open source built-in text search? open source built-in text search?
Yes Yes Yes No
f
ACID (Atomicity, Consistency, Isolation, Fault-Tolerance A systems ability to
NoSQL A class of database systems that
Durability) A term that refers to respond to hardware or software failure
incorporates other means of querying
the model properties of database without disrupting other systems.
outside of traditional SQL and does not
transactions, traditionally used for SQL
g Graph Store A type of database used follow standard relational database rules.
databases.
o
for handling entities that have a large
Object-Relational Mapper (ORM) A tool
h
unstructured data. Hadoop An Apache Software Foundation
language.
framework developed specifically for high-
p
B-Tree A data structure in which all scalability, data-intensive, distributed Persistence Refers to information from
terminal nodes are the same distance computing. It is used primarily for batch a program that outlives the process that
from the base, and all nonterminal processing large datasets very efficiently. created it, meaning it wont be be erased
nodes have between n and 2n subtrees or during a shutdown or clearing of RAM.
pointers. It is optimized for systems that High Availability (HA) Refers to the Databases provide persistence.
read and write large blocks of data. continuous availability of resources in a
computer system even after component Polyglot Persistence Refers to an
i
In-Memory As a generalized industry
r Relational Database A database that
d
Database Clustering Connecting two or term, it describes data management tools structures interrelated datasets in
more servers and instances to a single that load data into memory (RAM or flash tables, records, and columns.
database, often for the advantages of memory) instead of hard disk drives.
Replication A term for the sharing of
j
fault tolerance, load balancing, and Journaling Refers to the simultaneous,
parallel processing. data so as to ensure consistency between
real-time logging of all data updates in
redundant resources.
a database. The resulting log functions
Data Management The complete lifecycle
s
as an audit trail that can be used to
of how an organization handles storing, Schema A term for the unique data
rebuild the database if the original data is
processing, and analyzing datasets. structure of an individual database.
corrupted or deleted.
k
Data Mining The process of discovering Key-Value Store A type of database that Sharding Also known as horizontal
patterns in large sets of data and stores data in simple key-value pairs. partitioning, sharding is where a
transforming that information into an They are used for handling lots of small, database is split into several pieces,
understandable format. continuous, and potentially volatile reads usually to improve the speed and
and writes. reliability of an application.
Database Management System (DBMS)
l
A suite of software and tools that manage Lightning Memory-Mapped Database
Strong Consistency A database concept
data between the end user and the (LMDB) A copy-on-write B-Tree
that refers to the inability to commit
database. database that is fully transactional, ACID
transactions that violate a databases
compliant, small in size, and uses MVCC.
rules for data validity.
Data Warehouse A collection of
accumulated data from multiple streams Log-Structured Merge (LSM) Tree A
Structured Query Language (SQL) A
within a business, aggregated for the data structure that writes and edits data
programming language designed for
purpose of business management. using immutable segments or runs that
managing and manipulating data; used
are usually organized into levels. There
Distributed System A collection of primarily in relational databases.
are several strategies, but the first level
w
individual computers that work commonly contains the most recent and Wide-Column Store Also called
together and appear to function as a active data. BigTable stores because of their
single system. This requires access to
m
relation to Googles early BigTable
a central database, multiple copies of a MapReduce A programming model
database, these databases store data in
database on each computer, or database created by Google for high scalability
records that can hold very large numbers
partitions on each machine. and distribution on multiple clusters for
of dynamic columns. The column names
the purpose of data processing.
and the record keys are not fixed.
DOWNLO AD D OW N LO A D D OW NLO AD