You are on page 1of 14

NoSQL data storage

Fundamentals of big data hardware and software technologies


lvaro Mndez Civieta

Contents
1 Introduction

2 What is a NoSQL database

2.1 Easy scale-out . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2 Lack of structure . . . . . . . . . . . . . . . . . . . . . . . . .

3 Types of NoSQL databases

3.1 Key-Value database . . . . . . . . . . . . . . . . . . . . . . .

3.2 Document database . . . . . . . . . . . . . . . . . . . . . . . .

3.3 Graph database . . . . . . . . . . . . . . . . . . . . . . . . . . 10


3.4 Column-based database . . . . . . . . . . . . . . . . . . . . . 11
4 NoSQL vs SQL

12

5 Conclusion

13

6 Bibliography

14

Introduction

There are a lot of companies and web services that are using databases daily.
The common database used for the last decades is known as Relational data
base.

A relational database is a collection of data organized in a set of relationed tables previously dened. Each table contains one or more categories
organised in columns, and the information of a unique element of the data
organised in each row. To interact, doing queries, with the information contained in those databases, it is used the SQL (Structured Query Language),
based on the relational algebra.

This kind of database appeared some decades ago. If we think of a


company that needed a database at that time, for example a bank, it is easy
to understand that the requirements they would expect from the database
were basically 2:
They needed the data to be consistent, in the sense that all clients

have the same view of the data.

They needed the data to be available, in the sense that all clients can

always read and write .


To ensure the consistency, all the operations that are done on the database
(insertion, modication etc.) must be controlled just by one machine. If we
are doing a few number of operations at a time, this is not a problem, and
for this reason, this kind of database worked well all those years.
Let's go back to the present and think on the quantity of data that is
being generated nowadays, for example:
There are more than 900 million users of facebook.
Each minute, 50 hours of video is uploaded to youtube.
Twitter generates 8 Tb of tweets per day.

With the arrival of the web 2.0 not just the companies can upload data to
the web, but all the people, and the exponential growing of the data starts
to cause problems managing all this information stored in the relational
databases.
As all the operations done on the database are controlled by just one
machine to ensure the consistency, this machine starts to become a bottleneck
when the number of operations at a time grows, slowing the global behaviour
of the database. A rst approach to solve this problem is to scale-in (or
vertically) the machine, buying better components, like a better processor,
but it is expensive and does not solve totally the problem.
Big web companies like Google, Facebook or Amazon realized that for
them, most of the time it was more important the performance and processing velocity rather than consistency, and started to develop new types of
databases to solve the situation. Those data bases are known as NoSQL.

What is a NoSQL database

NoSQL is like an umbrella that includes a lot of dierent databases, as shown


later, but there are a few basics that a database must verify to be considered
as NoSQL:
It doesn't work with SQL language, or not only with it (For this reason

those databases are also known as Not only SQL)


It scales-out (or horizontally) easily.
It lacks of a predened structure.
2.1

Easy scale-out

If we consider the CAP theorem:

A database can only ensure 2 of those characteristics:


Consistency.
Availability.
Partition tolerance.

Relational databases are dened to ensure consistency and availability,


but not partition tolerance. The main problem of a relational database is
that to ensure consistency, one machine controls all the changes done to the
database, and this machine becomes a bottleneck.
The solution to this problem is to build a database able to work in parallel. This way you win velocity and there is no bottleneck. If you need more
processing power, it's just a question of adding more clusters to the system.
The negative side of this solution is that Partition tolerance must be ensured. If we have a look again to the CAP theorem, we can see that this
implies we have to renounce either to availability or to consistency.
Partition tolerance and consistency

In a system partition tolerant and consistent, if something happens, part


of the information will not be accessible, but the rest will, and will be consistent.
To ensure the consistency, the usual solution is doing replicas of the data.
When you add (or modify etc.) information in a server, the same information
is added to other servers of the database, and the new information is not
considered as consistent until it is replicated this way.
If one server fails, all the data can be restored using the replicas. If
several servers fails, we might loose part of the data, if it is replicated in
failed servers.
This kind of databases must do a lot of writtings for each modication
(the proper modication of the data, and the modication of the replicas)
6

so to guarantee the velocity of the system, they can be congured to store


the modications of the data on the in-memory, which is much faster than
the persistent storage system, and after a certain period of time, do a spill
on the persistent storage system.
However, this is not proper consistency, because if an error occurs, we
can loose the modications that are in the in-memory but haven't been spilt
yet. This solution is considered as eventual consistency.
To partially solve this, it is usual that those databases have a writte
hadead loggin, what means that all the modications are written to a log be-

fore they are applied, so an error occurs we can restore the data using the log.
Partition tolerance and availability

In a system that is available and partition tolerant, if one of the nodes


fails the information will be able, but may not be consistent. This happens
when we work with a partitioned system but we don't do replicas
In general, the most famous NoSQL databases are Consistent (or eventually consistent) and partition tolerant.
2.2

Lack of structure

One characteristic of the NoSQL databases is the lack of a predened structure. In the relational databases we had a clear tabular structure that we
had to predene before adding any information to the database, but here we
don't have this problem.
This has advantages and disadvantages:
The advantage is clear. As the system is not responsible of the struc-

ture of the data, adding new information is much faster than in the
relational databases. It is said that NoSQL databases can ingest almost
everything.
7

The disadvantage is that this lack of clear predened structure may

cause a loss of data, in the sense that we are free to use dierent
structures to add the same type of data, but if we try to recover that
data, we can "forget" some of the structures used, and recover just a
part of the information.

Types of NoSQL databases

Depending on the data structure that a database have, we can classify the
NoSQL databases in 4 main groups:
3.1

Key-Value database

In this type of databases each element is identied using a unique "key"


asociated to it. This allows a fast recover of the information when is needed
just by asking for the data asociated to determined key. It is a very ecient
system for readings and writtings.

Apache Cassandra

Apache Cassandra is one of the best examples of this type of database.


Initially developed by Facebook, Cassandra is a NoSQL, opensource, distributed database, based on the key-value structure and written in Java. It
ensures the availability of the data and works with asynchronous no-master
replication based on P2P conguration, ensuring low latencies and good velocity. One of the companies that is using Cassandra nowadays is Twitter.
3.2

Document database

In this type of database, information is stored as a document, usually using


the "JSON" (JavaScript Object Notation) or "BSON" (BinaryScript Object Notation) structure, and a unique key is used for each register. This
structure allows doing key-value queries over hole documents, but also more
complicated queries about the content of the document, for this reason are
the most versatile databases.

MongoDB

MongoDB is one of the best example of a document based NoSQL database,


and one of the most famos NoSQL databases we can nd.
It is a NoSQL open source database that works with their own version
of the JSON les, called BSON le. The main features of MongoDB are:
Indexation: All the documents (or subdocuments) in MongoDB can

be indexed to gain extra velocity doing queries.


Replication: MongoDB ensures consistency of the data by replication.
It can do complex operations like Map-reduce.
3.3

Graph database

In this type of database, the information is represented as the nodes of a


graph, and the relation between elements as the links between nodes. This
way, we can use the mathematical graph theory. Those kind of databases
have the best performance working with highly relationed data.

10

Neo4j

Neo4j is the most famous NoSQL graph database. It is open source, written in Java and counts with a native graph storage and processing system.
It ensures consistency and hight availability of the data.
3.4

Column-based database

This type of database is thought to do queries and addings on huge quantities


of data. They work like the relational databases but adding columns of data
instead of individual data.

Druid

Druid is an open source column oriented, distributed database written


in Java. It is built to ensure the fast ingest of huge amounts of data, that is
immediately available to query, and also to ensure fault tolerancy by using
replications. For this reasons it is sometimes used to do real-time analytics.

11

NoSQL vs SQL

As we have seen along this technical report, the NoSQL databases appeared few years ago to solve a problem: The old databases, called relational databases, weren't able to work with the amount of data that is being
generated nowadays.
A NoSQL database must verify basically 2 conditions: scale-out and be
lack of structure. With this characteristics, those databases solve the problem they were created for, but this doesn't mean that the relational databases
are getting old, or that are no more useful.
Then, we should know when is it good to use a NoSQL database and when is
it better to use a relational database. It is better to use a NoSQL database
if:
The amount of data is huge.
Our data doesn't have a uniform schema.
We expect intensive use of the database.
There are a lot of relations between our data.

In the rest of situations, a relational database is sometimes preferred


over a NoSQL. For example, imagine we need a database for our company.
The data will be accessed only on the Intranet of the company, so we don't
expect a huge amount of data, neither a intensive access to the database. It
is also important to ensure the availability and consistency of the data, so it
is much better to use a relational database rather than a NoSQL.

12

Conclusion

NoSQl databases are a new generation of databases created to solve the problem of handling with huge amounts of data, for this reason are increasingly
used in big data and real-time web applications.
They lack of a predened structure, so there are a lot of NoSQL database
types (key-value, document oriented, graph etc), each of them specially built
for an specic work. This lack of structure also causes that it is hard to move
from a noSQl provider to another.
As any new technology, NoSQL databases are not still as used as they
should be, maybe due to the distrust they still generate, but the benets
of the correct use of this technology are great: velocity working with huge
amounts of data, easily parallel and scale-out structure... For all of this
reasons, it is expectable that the use of noSQL databases will grow in the
future.

13

Bibliography
Material from subject "Back-end for Big Data analysis".
https://es.wikipedia.org/wiki/NoSQL
https://en.wikipedia.org/wiki/NoSQL
http://www.acens.com/wp-content/images/2014/02/bbdd-nosql-wp-

acens.pdf
http://www.genbetadev.com/bases-de-datos/bases-de-datos-nosql-elige-

la-opcion-que-mejor-se-adapte-a-tus-necesidades
http://www.genbetadev.com/bases-de-datos/el-concepto-nosql-o-como-

almacenar-tus-datos-en-una-base-de-datos-no-relacional
http://www.maestrosdelweb.com/nosql-como-el-futuro-de-las-

bases-de-datos/
http://www.campusmvp.es/recursos/post/Fundamentos-de-bases-

de-datos-NoSQL-MongoDB.aspx

14

You might also like