A Survey of Post-Relational Data Management and NOSQL Movement

A Survey of Post-relational Data Management and NOSQL movement
Aleksandar Milanovi aca.milanovic@gmail.com Miroslav Mijajlovi, mijajlovic.miroslav@gmail.com
Department of Computer Science, Faculty of Mathematics University of Belgrade, Serbia
Abstract
Relational database management systems (RDMBSs) today are the predominant technology for storing structured data in web and business applications. In the past few years, the one size fits all-thinking concerning datastores has been questioned by both, science and web affine companies, which has lead to the emergence of a great variety of alternative databases. Here we examine a number of so-called NoSQL data stores designed to scale simple OLTPstyle application loads over many servers. That idea was originally motivated by Web 2.0 applications and these systems should scale to thousands or millions of users doing updates as well as reads, in contrast of traditional RDBMS. An alternate way of storing data to account for simplicity in data management which provides horizontal scalability is quite popular also within cloud based internet web applications.
1.Introduction
Interactive software (software with which a person iteratively interacts in real time) has changed in fundamental ways over the last 35 years. The online systems of the 1970s have, through a series of intermediate transformations, evolved into todays Web and mobile applications. These systems solve new problems for potentially vastly larger user populations, and they execute atop a computing infrastructure that has changed even more radically over the years. The architecture of these software systems has likewise transformed. A modern Web application can support millions of concurrent users by spreading load across a collection of application servers behind a load balancer. Changes in application behavior can be rolled out incrementally without requiring application downtime by gradually replacing the software on individual servers. Adjustments to application capacity are easily made by changing the number of application servers. But database technology has not kept pace. Relational database technology, invented in the 1970s and still in widespread use today, was optimized for the applications, users and infrastructure of that era. In some regards, it is the last domino to fall in the inevitable march toward a fully-distributed software architecture. While a number of bandaids have extended the useful life of the technology (horizontal and vertical sharding, distributed caching and data denormalization), these tactics nullify key benefits of the relational model while increasing total
1/22
system cost and complexity.
Figure 1 - State of interactive software, Comparison of tendences from 1975 and from today
2.Problem statement
Typically, traditional relational databases (RDBMS) ensure ACID (Atomicity, Consistency, Isolation and Durability) for every transaction thereby providing a consistency over business transactions and these databases do well when run on a single hardware. However, when required to scale horizontally, commonly recommended scalability approaches like 'Sharding' (Partitioning data to run on multiple machines) or 'Optimistic Locking' (allowing multiple transactions can complete without affecting each other) often cause issues within data normalization thereby making RDBMS irrelevant in this deployment. Hence, deployments which doesn't require ACID transactions make a great candidate to use alternate way of storing data - especially more relevant in cloud environment, where scalability requirement is very high.
2/22
3. Existing solutions of the problem and their criticism

In response to the lack of commercially available alternatives, new solutions were brought. These solutions, or databases are now called NoSQL databases. The term NoSQL was rst used in 1998 to name an open-source relational database system that did not have a SQL interface explicitly. It was an experimental database system and have not been widely used. The term have no longer been used until 2009, when it emerged on an event about non-relational approaches databases. NoSQL was then used to describe a group of databases which had a different approach of SQL Databases Systems. NoSQL covers a wide range of technologies, data architectures and priorities; it represents as much a movement or a school of thought as it does any particular technology. Even the name is confusing, for some it means literally any data storage that does not use SQL but thus far the industry seems to have settled on "Not Only SQL". As time goes on it is likely that the scope of the term is going to grow and grow until it becomes meaningless by itself and sub-divisions will be needed to clarify the meaning of the term. The NoSQL systems described here generally do not provide ACID transactional properties: updates are eventually propagated, but there are limited guarantees on the consistency of reads. Some authors suggest a BASE acronym in contrast to the ACID acronym: BASE = Basically Available, Soft state, Eventually consistent ACID = Atomicity, Consistency, Isolation, and Durability The idea is that by giving up ACID constraints, one can achieve much higher performance and scalability. NoSQL is a large and expanding field, for the purposes of this paper the common features of NoSQL data stores are: Easy to use in conventional load-balanced clusters Persistent data (not just caches) Scale to available memory Have no fixed schemas and allow schema migration without downtime Have individual query systems rather than using a standard query language Are ACID within a node of the cluster and eventually consistent across the cluster Not every product in this paper has every one of these properties but the majority of the stores we are going to talk about support most of them.
3/22
3.1 Classification criteria and the classification tree
Figure 2 - Clasification tree, defining different classes of NoSQL storage systems
Technical Names
Wide Column Store
Document Store
Key Value Store
Eventually Consistent Key Value Store 3
Graph Databases
Object Databases
XML Databases
Number of Surveyed Contributi ons
Table 1 - Classification Table, defining procedure of surveying and number of surveyed contributions
First classification parameter tells us if such systems are originated out of Web 2.0 need or not out of Web 2.0.
4/22
Criterion inside Core NoSQL systems groups them according to their data model. Here, we will explain any of these leaves and give the most used products from each class.
3.2. Presentation of existing solutions

This section is divided in several subsections, one per leaf of the above defined classification. For each leaf, several paragraphs are given, one per research effort overviewed. In the text to follow, instead of the term leaf, we use the term DataStore Group.
3.2.1. DataStore Group #1: Wide Column Store

Traditional databases are optimized for 'column retrieval' for a given row. For example, it is very inexpensive to retrieve all the columns within a given 'table' for a given row in a traditional RDBMS database. However, in the same traditional RBDMS architecture, it is very expensive to retrieve all the rows within a given 'table'. This next generation database - also known as 'C-Store' - inverses this dynamic and is designed to optimize retrieval of multiple rows within a given column and also is designed for read efficiency over write efficiency. This database also support "shared-nothing" clustering to allow for scalability across multiple machines and also support standard SQL and support for read consistency and transactions. Based on our survey, we found the following to be widely popular and well adopted by various companies.
3.2.1.1. Googles BigTable

BigTable development began in 2004 and is now used by a number of Google applications, such as MapReduce, which is often used for generating and modifying data stored in BigTable,Google Reader, Google Maps, Google Book Search, "My Search History", Google Earth, Blogger.com, Google Code hosting, Orkut, YouTube, and Gmail. Google's reasons for developing its own database include scalability and better control of performance characteristics. A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. A Bigtable cluster stores a number of tables. Each table consists of a set of tablets, and each tablet contains all data associated with a row range. Initially, each table consists of just one tablet. As a table grows, it is automatically split into multiple tablets, each approximately 100200 MB in size by default.
5/22
Figure 3- Big Table Data Model, demonstration of BigTable Data Model (GFS - Google File System)
Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. Bigtable timestamps are 64-bit integers. They can be assigned by Bigtable, in which case they represent real time in microseconds, or be explicitly assigned by client.
// Open the table Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set("anchor:www.c-span.org", "CNN"); r1.Delete("anchor:www.abc.com"); Operation op; Apply(&op, &r1); Example 1 - Writing to BigTable, example of C++ code for writting into BigTable
Bigtable datasets can be queried from services like AppEngine using a language called GQL ("gee-kwal") which is a based on a subset of SQL. Conspicuously missing from GQL is any sort of JOIN command. Because of the distributed nature of a Bigtable database, performing a join between two tables would be terribly inefficient.
from google.appengine.ext import db class Person(db.Model): name = db.StringProperty() age = db.IntegerProperty() amy = Person(key_name='Amy', age=48) amy.put() Person(key_name='Betty', age=42).put() Person(key_name='Charlie', age=32).put() Person(key_name='David', age=29).put() Person(key_name='Edna', age=20).put() Person(key_name='Fred', age=16, parent=amy).put() Person(key_name='George').put() Example 2- Example of GQL input, simple GQL code that store data into class Person
Users like the performance and high availability provided by the Bigtable implementation, and that they can scale the capacity of their clusters by simply adding more machines to the system as their resource demands change over time. Given the unusual interface to Bigtable, an interesting
6/22
question is how difcult it has been for our users to adapt to using it. The importance of this Googles project is enormous since many other projects are built on the same basis as BigTable and still today powers 100s of Google applications and services. Its notable implementations are HBase and HyperTable.
3.2.1.2. HyperTable
Hypertable is a high performance distributed data storage system designed to support applications requiring maximum performance, scalability, and reliability. Modeled after Google's well known Bigtable project, Hypertable is designed to manage the storage and processing of information on a large cluster of commodity servers, providing resilience to machine and component failures.
Figure 4- An overview of HyperTable deployment, illustration of deployment of HyperTable (informations are stored on a large cluster of servers)
Hypertable seeks to set the open source standard for highly available, petabyte scale, database systems. Hypertable has been developed as an in-house software at Zvents Inc. In January 2009, Baidu, the leading Chinese language search engine, became a project sponsor. Hypertable runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, 7/22
or the Kosmos File System (KFS). It is written almost entirely in C++. Pros Hypertable Thrift interface provides seamless language support for Java, PHP, Ruby, Python, Perl, and more, so you can easily build applications in your favorite language. It supports indexing, caches updates in memory and frequently writes to disk. It is designed to handle high traffic web sites. Cons Disadvantages are same as in all NoSQL systems: no transactions.
3.2.1.3. Hadoop Database (HBase)

HBase is an open--source, distributed, column--oriented store modeled after Google's Bigtable. HBase is an Apache project written in Java. A wide variety of companies and organizations use Hadoop for both research and production. HBase uses the Hadoop distributed file system in place of the Google file system. It puts updates into memory and periodically writes them out to files on the disk. The updates go to the end of a data file, to avoid seeks. The files are periodically compacted. Updates also go to the end of a write ahead log, to perform recovery if a server crashes. Row operations are atomic, with row-level locking and transactions. There is optional support for transactions with wider scope. These use optimistic concurrency control, aborting if there is a conflict with other updates. Partitioning and distribution are transparent; there is no client-side hashing or fixed keyspace as in some NoSQL systems. There is multiple master support, to avoid a single point of failure. MapReduce support allows operations to be distributed efficiently. Pros Designed to efficiently store/manage large quantity of sparse data. Cons Not designed to store large amount of binary data.
The support for transactions is attractive, and unusual for a NoSQL system, which has great responsibility of continous growth of HBase popularity.
3.2.2 Document Store #2: Document Store

Document Oriented approach structures data similar to document, as its name indicates. Usually that approach does not use tables. The attributes of the items are dinamically inserted. Such documents share some common properties, but may have different ones, in other words the items stored on Document Oriented databases may differ. As no table is declared, there
8/22
is no rigid structure to store data on it. The application is responsible to treat these dynamic properties. Here we will expose two widely used systems: MongoDB and CouchDB.
3.2.2.1. MongoDB
MongoDB is emerging as one of the best among the new generation of alternatives to a typical relational database used as the back--end for a web application. It is open--source and is a document database designed to be easy to work with, fast, and is very scalable. It is written in C++ and is a good fit for user profiles, sessions, product information, and for all forms of Web content (blogs, wikis, comments, messages etc). Its name is from word humongous. MongoDB stores BSON, essentially a JSON document in an efficient binary representation(Binary JSON) with more data types. BSON documents readily persist many data structures, including maps, structs, associative arrays, and objects in any dynamic language in the binary format. MongoDB is also schema--free. In some ways MongoDB is closer to MySQL than to other so--called NoSQL databases. It has a query optimizer, ad--hoc queries, and a custom network layer. It also lets you organize document into collections, similar to sql tables, for speed, efficiency, and organization. MySQL term
database table index row column join primary key
Mongo term
database collection index BSON document BSON field embedding and linking _id field
Table 2 - Comparison of Terms, inspecting similarities and/or differencies between MySQL and MongoDB
Pros Easy addition of fields whenever needed, without performing an expensive change on the database makes it ideal for agile environments. Cons MongoDB has no version concurrency control and no transaction management. So if a client reads a document and writes a modied version back to the databases it may happen that another client writes a new version of the same document between the read and write operation of the first client. MongoDB provides a lot of the features of a traditional RDBMS such as secondary indexes, dynamic queries, sorting, rich updates, upserts (update if document exists, insert if it doesn't), 9/22
and easy aggregation. This gives you the breadth of functionality that you are used to from an RDBMS, with the flexibility and scaling capability that the non-relational model allows, which makes it one of the most powerfull and spreading solutions today.
3.2.2.2. Apache CouchDB

Apache CouchDB is a schema-less document oriented database with the primary goal to be highly scalable/fault tolerant database. CouchDB is written in Erlang. Here the documents can be either JSON objects or binary files with versioning support. The primary API to store data is through RESTful JSON API. In this model, functions using JavaScript can be added to every data to allow creating custom views where in documents can be translated into table views so that indexing and querying can be done for Business Intelligence purposes. A CouchDB document is an object that consists of named fields. Field values may be strings, numbers, dates, or even ordered lists and associative maps. An example of a document would be a blog post:
"Subject": "I like Plankton" "Author": "Rusty" "PostedDate": "5/23/2006" "Tags": ["plankton", "baseball", "decisions"] "Body": "I decided today that I don't like baseball. I like plankton."
In the above example document, Subject is a field that contains a single string value I like plankton. Tags is a field containing the list of values plankton, baseball, and decisions. A CouchDB database is a flat collection of these documents. Each document is identified by a unique ID. Pros Massively/Horizontally Scalable Uses green threads Every node is eqivalent to an OS thread Designed to be False tollerant Available under widely accepted Apache 2.0 Open Source License Supports MapReduce[3] system to generate custom views. Cons Users of this database might have to learn a new language known as 'Erlang'. It is not widely addopted. This system is also emerging, and it is founded attractive for research and its variation CouchBase is used by worlds busiest Web applications, from social gaming companies like Zynga to more traditional companies like American Honda.
3.2.3. Document Store #3: Key Value Store

The Key/Value Store is the simplest NoSQL database approach. The data is stored on a hash which has a unique identifier, the Key, and the respective data, the Value. The data is structured
10/22
similar as an dictionary. Insert, delete and update operations are applied on each given key. Comparing to a relational model, the Key/Value approach is similar to a table of two columns. The difference is that the value column may store multivalue items. The examples that will be exposed here are SimpleDB provided by Amazon, Scalaris and Redis. Those examples are, by our opinion, the most representative in this field.
3.2.3.1. Amazon SimpleDB

Amazon SimpleDB is a web services interface to create and store multiple data sets, query the data easily, and return the results faster. It is written in Erlang by Amazon.com and it is a distributed database. The data gets automatically indexed, making it easy to quickly find the information that is needed. There is no need to pre--define a schema or change a schema if new data is added later. SimpleDB lets the client organize the data into domains, which can be compared with tables in relation databases, with the dierence that a domain can contain a dierent set of attributes for each item. All attributes are byte arrays with a maximum size of 1024 bytes. Amazon SimpleDB automatically creates multiple geographically distributed copies of each data item that is stored and this provides high availability, durability and efficient fail--over mechanism. Pros It is fast, flexible, has on--demand scaling, schemaless, simple to use, designed to use with other Amazon Web Services and is rather in--expensive. It supports minimal string based query syntax. However, this query syntax is proprietary. Cons No support for standard SQL. Currently there isnt any Management Console for Simple DB. All tasks, including Creating Domains, entering data, data definition tasks, and all data manipulation tasks are done via a programming interface you create. Authors conclusion is that SimpleDB is very useful solution and that some day this management console should appear.
3.2.3.2. Scalaris
Scalaris is a scalable, transactional, distributed key--value store, designed for Web 2.0 applications. It is implemented in Erlang programming language. Its key feature is its ability to support distributed transactions. Scalaris uses an adapted version of the chord service to expose a distributed hash table to clients. As it stores keys in lexicographic order, range queries on prefixes are possible. In contrast to other key-/value-stores Scalaris has a strict consistency model, provides symmetric replication and allows for complex queries (via programming language libraries). It guarantees ACID properties also for concurrent transactions by implementing an adapted version of the Paxos consensus protocol.
11/22
System comprises three layers, all of them implemented in Erlang:
Figure 5- Layers in Scalaris architecture, view on three layers of Solaris system
Pros
Clients can connect to this system using either JDBC, Erlang or simply HTTP
Cons Does not provide persistency--ability to recover lost data when a system crashes Still Immature
Knowing that this system provides advanced works with transactions, authors opinion is that this system can some day reach better usage and popularity.
3.2.3.3. Redis
The Redis key-value data store started as a one-person project but now has multiple contributors as BSD-licensed open source. It is written in C. A Redis server is accessed by a wire protocol implemented in various client libraries (which must be updated when the protocol changes). The client side does the distributed hashing over servers. The servers store data in RAM, but data can be copied to disk for backup or system shutdown. System shutdown may be needed to add more nodes. Like the other key-value stores, Redis implements insert, delete and lookup operations.
Pros It allows lists and sets to be associated with a key, not just a blob or string. It also includes list and set operations. Cons The amount of main memory limits the amount of data that is possible to store. This cannot be
12/22
expanded by the usage of hard-disks.
Redis is a relatively new datastore, so its developers can enhance its possibilities to justify the advantage which they have with using lists and sets.
3.2.4. DataStore Group #4:Eventually consistent Key Value Store

These systems can also be considered as a Key Value Store systems but they are eventual consistent. The main idea behind eventual consistency is that you sacrifice the ability
for all nodes to see exact same thing at any given time (consistency), but in return you can tolerate network partitions and you have availability. This directly relates to the CAP theorem which states you only get two of: Consistency, Availability, and tolerance to network Partitions. So, we are throwing out C so we can get rid of those nasty distributed locking algorithms, but in return we take on EC. 3.2.4.1. Amazon Dynamo
Amazon Dynamo is distributed key-value storage system that is used internally at Amazon. System is created to work in a network of hardware built nodes, where its presumed that each node has same responsibillities. Also, system assumes that every network node and connection can fail any time. In order to protect from potential disaster that can destroy complete datacenters, every key-value pair is replicated with a geological distribution over several datacenters around the world. Dynamo uses optimistic replication with multiversion concurrency control (MVCC) to achieve a type of eventual consistency. To meet all these requirements, Dynamo utilizes a mix of existing technologies from distributed databases, peer to peer networks and distributed le systems such as consistent hashing for replication and key partitioning. Pros: successful in handling server failures, data center failures and network partitions simple key/value interface efficient in its resource usage Cons: It is still not easy to scale-out databases or use smart partitioning schemes for loader balancing. Dynamo is used only by Amazons internal services. Its operation environment is assumed to be non-hostile and there are no security related requirements such as authentication and authorization.
3.2.4.2. Cassandra
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google's BigTable. Like Dynamo, Cassandra is eventually consistent. Like
13/22
BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems. Cassandra was initialy developed at Facebook to support their Inbox Search feature. Later on it was released as open source project on Google Code, and finally it become an Apache Incubator project. Pros: Supports multiple client API in various languages like Python, Ruby, PHP etc. Highly configurable - Latency vs. Consistency Cons: It lacks a transactional support.
3.2.4.3. Voldemort
Voldemort is a distributed Key/Value storage system written in Java that is used by LinkedIn for high-scallability storage. Key features include: Data is automatically replicated over multiple servers. Data is automatically partitioned so each server contains only a subset of the total data Server failure is handled transparently Pluggable serialization is supported to allow rich keys and values including lists and tuples with named fields, as well as to integrate with common serialization frameworks like Protocol Buffers, Thrift, Avro and Java Serialization Data items are versioned to maximize data integrity in failure scenarios without compromising availability of the system Each node is independent of other nodes with no central point of failure or coordination Good single node performance: you can expect 10-20k operations per second depending on the machines, the network, the disk system, and the data replication factor Support for pluggable data placement strategies to support things like distribution across data centers that are geographically far apart. Pros: Automatic data partitioning and replication across multiple systems thereby delivering horizontal Scalability and failover. Pluggable serialization support allowing any type of value to be stored within a given key. Both Read/Write operation can scale horizontally unlike any Can store structured data as well as BLOB data and text Cons: Does not support SQL or normalize data. Accordingly, applications need to be customized to use this system. Also, lacks data mining capabilities thereby limiting the scope of its deployments only to internet web application
3.2.5. DataStore Group #5:Graph Databases

Graph databases represent a type of datastore that uses graph structure with nodes, edges to
14/22
represent and store data. Compared with relational databases, graph databases are often faster for associative data sets, and map more directly to the structure of object-oriented applications. They can scale more naturally to large data sets as they do not typically require expensive join operations.
3.2.5.1. Neo4J
Neo4J is open source graph database implemented in Java. Neo4J offers a disk based, native storage manager completely optimized for storing graph structures for maximum performance and scalability. Neo4J can handle a large number of graph nodes measured up to several billions, that can be stored on a single machine or shared across multiple machines. Neo4J is really powerful when you need to solve problem that demand repeated probing throughout the network. You can bundle up a query in a traversal object that will scan through multiple connected nodes to find the answer. It will repeatedly ask for one row of a traditional database, then use that information to search for a new row again and again and again. By contrast, a traditional database would require a separate query for each step through the search. Pros:
Documentation is good. Quite good in cases involving deep searching through the networks. A thousand times faster than a relational database. Some nice subprojects, addons, and tools have appeared in fertile open source projects.
Cons: Searching for a particular node with a particular attribute is better handled in some other graph databases. Implementing a project requires some forethought, much like the design work that goes into planning a schema for a relational database.
3.2.6. DataStore Group #6: Object Databases

Object databases are the type of database where are objects used to represent and store the data. Object-oriented Database System term goes back to around 1985, though the first research developments in this area started a bit earlier in mid-1970s. The first commercial object database management systems appeared in early 1990s. They added a concept of
15/22
persistence into the object-oriented languages. The second growth wave was observed in early 2000s, when the Object Databases written completely in an Object-Oriented language appeared on the market.
3.2.6.1. db4O
db4o (database for objects) is an embeddable open source object database for Java and .NET developers. db4o is not built as a server system but as a library. This allows for some important features of db4o, like a very small memory footprint [db4oOSODB] which makes, by the way, db4o very useable on mobile devices. db4o architecture is layered into a server and a client part. Db4o has implemented its own caching algorithm. Cache entries are based on a hash code. The cache itself is organized as an efficient tree structure as well are the indexes, by the way, that are built as fast B-Trees. db4o is only an object database and not an object database management system, it is not standalone which means that everything goes through an application. Therefore the application can fully control authentication and authorization as needed. Db4o supports the ACID model. Pros: Small memory footprint. Fast Inserts, Updates. Queries need indexes or performance degrades very quickly. Cons: Debugging of object databases is harder than with RDBMS because the information is not chopped up. Without proper administration performance can degrade extremely quickly. Logging could be better.
3.2.7. DataStore Group #7: XML Databases

XML Databases are the type of databases that store data in XML format. This data can then be queried, exported and serialized into the desired format. The formal definition from the XML:DB initiative (which appears to be inactive since 2003) states that a native XML database: Defines a (logical) model for an XML document as opposed to the data in that document and stores and retrieves documents according to that model. At a minimum, the model must include elements, attributes, PCDATA, and document order. Examples of such models include the XPath data model, the XML Infoset, and the models implied by the DOM and the events in SAX 1.0. Has an XML document as its fundamental unit of (logical) storage, just as a relational database has a row in a table as its fundamental unit of (logical) storage. Need not have any particular underlying physical storage model. For example, NXDs can use relational, hierarchical, or object-oriented database structures, or use a proprietary storage format (such as indexed, compressed files).
16/22
3.2.7.1. eXist
eXist is an open source database management system that is completely built on XML technology. eXist uses XQuery to manipulate its data. eXist-db provides a powerful environment for the development of web applications based on XQuery and related standards. Entire web applications can be written in XQuery, using XSLT, XHTML, CSS, and Javascript (for AJAX functionality). XQuery server pages can be executed from the filesystem or stored in the database. Queries are accelerated with the help of indexes that eXist invisibly creates for its important internal components. The system builds B+tree indexes for elements and words, and it creates indexes for collections (for mapping collection paths to collection objects) and DOM objects (for rapid location of a documents nodes). eXist provides an array of mechanisms for manipulating the database. The REST (Representational State Transfer)-style interface allows you to access data through simple HTTP requests. You can also perform more elaborate operations with POST requests, which expect an XUpdate document or a query request as an XML fragment in the content of the request. Pros: Cons:
Using eXist based tools you can create entire web applications using just XQuery. Very good documentation. Very active and dedicated community. Easy to install and to run. eXist does not boast the same high performance that other more proven relational databases can. eXist is a work in progress. Some parts of architecture are not complete and not reliable.
3.2.7.2. Mark Logic Server

MarkLogic Server is a "content server": a combination of a native XML database with a fully featured implementation of the W3C XQuery language that allows XML content to be searched, queried, updated, and transformed in many different ways. MarkLogic uses documents, written in XML, as its core data model. It features: Schema-independent loading and querying of XML content, stored in native XML format Full implementation of XQuery. Built-in HTTP and WebDAV servers; Java connector; email processing support. Rich security model. Many options for full-text indexing and searching, including stemmed searching (in all supported languages) and user-created thesaurus
Pros:
Quite scaleable and performant Supports clustering among multiple servers Additional XQuery extensions to support a multitude of additional functionality Proven track record of deployments at customers with large scale XML content
17/22
Cons: Mark Logic is much smaller company comparing with Oracle, so their achievements are not well known.
4. Trends and optimal solutions for future

In figure shown below we can see that need for experts in NoSQL databases is getting bigger and bigger.
Figure 6 - NoSQL job trends, graphic illustration of emerging need for NoSQL experts
NoSQL has challenged RDBMS supremacy, we now have the situation that, also new ideas for RDBMS are emerging,and on the other hand, implementations of new NoSQL systems are coming to the scene, approximatelly, once a week. How we came to this situation? From a business perspective, you are probably going to find some use cases where storing your data in a relational database doesnt make too much sense and youll start looking for ways to get it out of the database. For example, think about storing logs data, or collecting historical data, or page impressions. As a business, a NoSQL database can be a viable solution for scenarios where you discover that your data doesnt really fit the relational model. When you work with something like MongoDB and CouchDB youll get a good idea of what NoSQL is about, as MongoDB is halfway between a relational and NoSQL database while CouchDB is basically totally different thinking all the way. If all youre looking for is scale, have a look at Cassandra. They follow pretty interesting models of scaling up. You need to be aware that youll meet a different data model, which brings great power and flexibility. Youll find that most of the tools in the NoSQL landscape removed any kind of transactional means, for the benefit of simplicity, making it a lot easier to scale up. We might not realize that transactions are not always needed, which is not to say theyre totally unnecessary,
18/22
its merely that oftentimes theyre lack is not really a problem. As for querying, for the most part youre saying good bye to ad-hoc queries. Most NoSQL databases removed means to run any kind of dynamic query on your data, MongoDB being the noteworthy exception here. All of these NoSQL engines are new, and because of that, they are short of features. Theyre not polished, and comfortable to use. On of the things that pushed this technology forward is the fact taht this year (2011), new query language appeared. It is called UnQL (unstructured Query Language). It was developed by the same team that developed CouchDB and SQLite. It's an open query language for JSON, semistructured and document databases. UnQL is basically SQL-like language for NoSQL system. Because of advantages that MongoDB has and, by knowing the fact that is the easiest solution for transition from SQL and relational databases, and yet still so powerful, authors opinion is that maybe it would be worthy to create such Mongo-like system which provides transaction operations like some key value store examples (eg. Scalaris). But, possible scenario is that, because of popularity and stability, SQL will always be better solutions. Therefore, any of these systems looks-like SQL. So, we predict that, as they mature, NoSQL engines will change - into NearlySQL engines.
5. Conclusion
There is a misperception that if someone advocates a non-relational database, they either dont understand SQL optimization, or they are generally a hater. This is not the case. It is reasonable to seek a new tool for a new problem, and database problems have changed with the rise of web-scale distributed systems. This does not mean that SQL as a generalpurpose runtime and reporting tool is going away. However, at web-scale, it is more flexible to separate the concerns. NoSQL is not an alternative substitute for traditional Relational Database Management Systems. Each kind of database suits for different needs. And that is why each solution must be evaluated for each application. Although, we believe NoSQL Databases are very good for applications that deal with large amount of data and which has some main entities that associates to many other secondary entities. NoSQL Databases still have significant technical drawbacks. These include: Transactional support and referential integrity. Applications using cloud databases are largely responsible for maintaining the integrity of transactions and relationships between tables. Complex data accesses. The ORM pattern -- and cloud databases -- excel at single row transactions -- get a row, save a row, etc. However, most non--trivial applications do have to perform joins and other operations. Business Intelligence. Application data has value not only in terms of powering applications, but also as information which drives business intelligence. The dilemma of the pre--relational database -- in which valuable business data was locked inside of
19/22
impenetrable application data stores -- is not something to which business will willingly return.
20/22
Data Integrity -- Though, some systems offer eventual integrity yet these systems does not offer either ACID support or data reliability that is typically associated with a relational data. Cloud databases could displace the relational database for a significant segment of next--generation, cloud--enabled applications. However, business is unlikely to be enthusiastic about an architecture that prevents application data from being leveraged for BI and decision support purposes. An architecture that delivered the scalability and other advantages of cloud databases without sacrificing information management would therefore be very appealing. Final conclusion and advice for going on depends on what you really need to do. Frequently-written, rarely read statistical data (for example, a web hit counter) should use an in-memory key/value store like Redis, or an update-in-place document store like MongoDB. Big Data (like weather stats or business analytics) will work best in a freeform, distributed db system like Hadoop. Binary assets (such as MP3s and PDFs) find a good home in a datastore that can serve directly to the user's browser, like Amazon S3. If you need to be able to replicate your data set to multiple locations (such as syncing a music database between a web app and a mobile device), you'll want the replication features of CouchDB. High availability apps, where minimizing downtime is critical, will find great utility in the automatically clustered, redundant setup of datastores like Cassandra.
6. References
[1: Web Paper] Natarajan, R., A Survey Report on databases designed for Cloud, http:// blogs.oracle.com/natarajan/entry/a_survey_report_on_nosql 24/11/2011, [2: Journal] Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E., Bigtable: A Distributed Storage System for Structured Data, ACM, ACM Transactions on Computer Systems (TOCS), Volume 26 Issue 2, June 2008, [3: M.Sc. Thesis] Orend, K., Analysis and Classication of NoSQL Databases and Evaluation of their ability to Replace an Object-relational Persistence Layer, Fakultt fr Infoormatik, Technische Universitt
Mnchen,
[4: Web Paper] Cattel, R., Scalable SQL and NoSQL Data Stores, http://www.cattell.net/datastores/ 24/ 11/2011, [5: Book] Strauch, C. - NoSQL Databases, Hochschule der Medien, Stuttgart, Germany, 2010., [6: Conference] Franco, M., Nogueira, M. Using NoSQL Database to Persist Complex Data Objects, Instituto de Informatica, Universidade Federal de Goias (UFG), VIII Seminrio de PsGraduao da UFG - Mestrado, 2011., [7: Web Paper] Hypertable, About HyperTable, http://hypertable.org/about.html, 24/11/2011,
21/22
[8: Web Paper] MongoDB Agile and Scalable, http://www.mongodb.org/, 24/11/2011, [9: Web Paper] Apache Introduction, http://couchdb.apache.org/docs/intro.html, 24/11/2011, [10: Web Paper] DBPedias, http://dbpedias.com, 24/11/2011, [11: Web Paper] Amazon, Amazon SimpleDB (beta), http://aws.amazon.com/simpledb/, 23/11/2011, [12: Web Paper] Google, Scalaris, a distributed key-value store, http://code.google.com/p/scalaris/, 23/ 11/2011, [13: Web Paper] Rees, R., NoSQL, no problem: An introduction to NoSQL databases, http:// www.thoughtworks.com/articles/nosql-comparison, 22/11/2011, [14: Web Paper] CouchBase, Inc., NoSQL Database Technology http://www.couchbase.com/, 24/11/ 2011, [15: Web Paper] Watters, A., Cassandra: Predicting the Future of NoSQL, http:// www.readwriteweb.com/cloud/2010/07/cassandra-predicting-the-futur.php, 22/11/2011, [16:Web Paper] Wayner, P., Neo4j review http://review.techworld.com/applications/3213054/neo4jreview/, 18/2/2011 [17:Web Paper] Day, E., Eventually Consistent Relational Database http://oddments.org/?p=176, 5/12/ 2011 [18:Web Paper] Hauser, P., Review of db4o from db4objects http://wiki.hsr.ch/Datenbanken/files/ 25.db4oReview.pdf, 5/12/2011 [19:Web Paper] Grehan, R., XQuery takes center stage in eXist database http://www.infoworld.com/d/ data-management/xquery-takes-center-stage-in-exist-database-166, 5/12/2011 [20:Web Paper] Hunter, J., Inside MarkLogic Server http://www.odbms.org/download/inside-marklogicserver.pdf, 5/12/2011 [21:Web Paper] Kanaracus , C., MarkLogic ties its database to Hadoop for big data support http:// www.infoworld.com/d/business-intelligence/marklogic-ties-its-database-hadoop-big-data-support-177660, 5/12/2011
22/22

A Survey of Post-Relational Data Management and NOSQL Movement

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Survey of Post-Relational Data Management and NOSQL Movement

Uploaded by

Copyright:

Available Formats

A Survey of Post-relational Data Management and NOSQL movement

Aleksandar Milanovi aca.milanovic@gmail.com Miroslav Mijajlovi, mijajlovic.miroslav@gmail.com

Department of Computer Science, Faculty of Mathematics University of Belgrade, Serbia

system cost and complexity.

3. Existing solutions of the problem and their criticism

3.1 Classification criteria and the classification tree

Figure 2 - Clasification tree, defining different classes of NoSQL storage systems

Wide Column Store

Key Value Store

Eventually Consistent Key Value Store 3

Number of Surveyed Contributi ons

3.2. Presentation of existing solutions

3.2.1. DataStore Group #1: Wide Column Store

3.2.1.1. Googles BigTable

3.2.1.3. Hadoop Database (HBase)

3.2.2 Document Store #2: Document Store

3.2.2.2. Apache CouchDB

3.2.3. Document Store #3: Key Value Store

3.2.3.1. Amazon SimpleDB

System comprises three layers, all of them implemented in Erlang:

Figure 5- Layers in Scalaris architecture, view on three layers of Solaris system

expanded by the usage of hard-disks.

3.2.4. DataStore Group #4:Eventually consistent Key Value Store

3.2.5. DataStore Group #5:Graph Databases

3.2.6. DataStore Group #6: Object Databases

3.2.7. DataStore Group #7: XML Databases

3.2.7.2. Mark Logic Server

4. Trends and optimal solutions for future

You might also like