You are on page 1of 84

NoSQL Database Management Systems

Data Systems for Software Engineers


Essam Mansour

Essam Mansour (Concordia) SOEN 363 1 / 84


Outline and Learning Outcomes

- What is NoSQL ?

- The Evolution of NoSQL

- Scalability and CAP theorem

- BASE principle

- Advantages and Disadvantages

- Classification of NoSQL Databases

- Indexing Structure for NoSQL databases

- Cloud Spanner

Essam Mansour (Concordia) SOEN 363 2 / 84


Outline

- What is NoSQL ?

- The Evolution of NoSQL

- Scalability and CAP theorem

- BASE principle

- Advantages and Disadvantages

- Classification of NoSQL Databases

- Indexing Structure for NoSQL databases

- Cloud Spanner

Essam Mansour (Concordia) SOEN 363 3 / 84


Essam Mansour (Concordia) SOEN 363 4 / 84
Why NoSQL?

Essam Mansour (Concordia) SOEN 363 5 / 84


Why NoSQL?

For decades, relational database management systems (MySQL, Post-


greSQL, SQL Server, Oracle) have been considered as the one-size-fits-
all solution for providing data persistence and its retrieval for decades.
The ever increasing need for scalability and new application require-
ments have created new challenges for traditional RDBMS.
At some stage, there has been some dissatisfaction with this one-size-
fits-all approach in deploying the data storage tier for large scale online
web services.

Essam Mansour (Concordia) SOEN 363 6 / 84


What is NoSQL?

An umbrella term for all database management systems that don’t


follow the popular and well established RDBMS principles and
often relate to large data sets accessed and manipulated on a Web
scale1 .
NoSQL = Not ONLY SQL
NoSQL implies that more than one storage mechanism could be used
based on the needs.

1
Tiwari, S. (2011). Professional NoSQL. John Wiley & Sons.
Essam Mansour (Concordia) SOEN 363 7 / 84
What is NoSQL?
NoSQL is not a single product or even a single technology. It represents
a class of products and a collection of diverse, and sometimes related,
concepts about data storage and manipulation.

NoSQL database systems represent a new generation of low-cost, high


performance database software which is increasingly gaining more and
more popularity.

Essam Mansour (Concordia) SOEN 363 8 / 84


How it is different?

Don’t use SQL as a query language.

Essam Mansour (Concordia) SOEN 363 9 / 84


How it is different?

No Fixed Schema

Essam Mansour (Concordia) SOEN 363 10 / 84


How it is different?

No Join Operations
Expensive Operation for combination of records from two or more
tables

Requires fixed Schema and Strong Consistency

Essam Mansour (Concordia) SOEN 363 11 / 84


How it is different?

Allows Distributed, Fault-tolerant Architecture


NoSQL systems allow you to store your database on multiple nodes and
maintain high-speed performance.

Allows Linear Scalability


When you add more nodes, you get a consistent increase in performance.

Essam Mansour (Concordia) SOEN 363 12 / 84


Outline

- What is NoSQL ?

- The Evolution of NoSQL

- Scalability and CAP theorem

- BASE principle

- Advantages and Disadvantages

- Classification of NoSQL Databases

- Indexing Structure for NoSQL databases

- Cloud Spanner

Essam Mansour (Concordia) SOEN 363 13 / 84


Essam Mansour (Concordia) SOEN 363 14 / 84
Brief History

MultiValue Databases at TRW in 1965


DBM - AT&T in 1979
Carlo Strozzi used ”NoSQL” Term to name a lightweight open
source relational database with a different standard SQL inter-
face in 1998
Graph Database - Neo4j in 2000
Distributed Key/Value Store Google BigTable in 2004
Document Database - CouchDB started in 2005
Cloud Key/Value Data Store - Amazon Dynamo
Document Database - MongoDB in 2007
Facebook Cassandra Project in 2008
Project Voldemort by LinkedIn in 2008
Term NoSQL has been defined and reintroduced in 2009

Essam Mansour (Concordia) SOEN 363 15 / 84


NoSQL Evolution
World Wide Webs (Web 2.0) Companies started to face scalability
issues with the huge growing data and infrastructure.
Many companies came up with their own solutions to these problems
with new technologies like
Google: BigTable2
Apache: Cassandra3
Amazon: DynamoDB4

2
https://cloud.google.com/bigtable/
3
http://cassandra.apache.org/
4
https://aws.amazon.com/dynamodb/
Essam Mansour (Concordia) SOEN 363 16 / 84
NoSQL Evolution

The number of NoSQL Database Management Systems increased every


year with main goals on performance, reliability, consistency, and
enhancing search and read performance5 .

The success of the first NoSQL DBMS initiated the development of


more similar open-source and proprietary database systems.

Currently, there is a huge variety of cloud database systems available

Commercial: Amazon Simple DB, Amazon DynamoDB, Microsoft


Azure Table Storage.

Open Source: HBase, Cassandra, Voldemort, CouchDB, MongoDB

5
Sakr, S., Liu, A., Batista, D. M., Alomari, M. (2011). A survey of large scale data
management approaches in cloud environments. IEEE Communications Surveys &
Tutorials.
Essam Mansour (Concordia) SOEN 363 17 / 84
Where is NoSQL now?

BigTable Voldemort Cassandra HBase HBase

Essam Mansour (Concordia) SOEN 363 18 / 84


Outline

- What is NoSQL ?

- The Evolution of NoSQL

- Scalability and CAP theorem

- BASE principle

- Advantages and Disadvantages

- Classification of NoSQL Databases

- Indexing Structure for NoSQL databases

- Cloud Spanner

Essam Mansour (Concordia) SOEN 363 19 / 84


Limitations of the distributed databases are:

Consistency
Every node needs to always see the same data value at any given instance

Essam Mansour (Concordia) SOEN 363 20 / 84


Limitations of the distributed databases are:

Availability
The system should continue to operate even if nodes in a cluster crash
or some of the hardware/software parts are down at any time.

Essam Mansour (Concordia) SOEN 363 21 / 84


Limitations of the distributed databases are:

Partition Tolerance
The systems continues to operate in the presence of network partitions.

Essam Mansour (Concordia) SOEN 363 22 / 84


CAP Theorem:

Any distributed database with shared data, can have at most two of
the three desirable properties underlineConsistency, underlineAvail-
ability, and underlinePartition Tolerance

Essam Mansour (Concordia) SOEN 363 23 / 84


Availability + Partition Tolerance = Not Consistent

Essam Mansour (Concordia) SOEN 363 24 / 84


Consistency + Partition Tolerance = Not Available,
waiting...

Essam Mansour (Concordia) SOEN 363 25 / 84


Availability + Consistency = Not Partitioned

Essam Mansour (Concordia) SOEN 363 26 / 84


Outline

- What is NoSQL ?

- The Evolution of NoSQL

- Scalability and CAP theorem

- BASE principle

- Advantages and Disadvantages

- Classification of NoSQL Databases

- Indexing Structure for NoSQL databases

- Cloud Spanner

Essam Mansour (Concordia) SOEN 363 27 / 84


Reliable Database Transactions Control

Transaction control is important with respect to performance and con-


sistency in distributed computing environments.

Two Transaction Control Models are usually used which are:


ACID: used in RDBMS

BASE: used in many NoSQL Systems

The difference between these models is in the amount of effort required


by application developers and the location (tier) of the transactional
controls.

Essam Mansour (Concordia) SOEN 363 28 / 84


Example
You have two bank accounts ”Checking” and ”Savings”.
You want to transfer 1000$ from ”Savings” to ”Checkings” using the
transfer form on the bank website.

Essam Mansour (Concordia) SOEN 363 29 / 84


RDBMS Control Using ACID

Atomicity: The transaction must happen as an all or nothing.


Consistency: You shouldn’t have a report showing the withdrawal
from Savings account without addition to checkings. (ie. Database
should block all reports during atomic operations)
Isolation: each part of the transaction occurs without knowledge of
other parts.
Durability: Once the transaction is done, it should be permanent. (If
database crashes, all the done transactions must be restored from the
last backup points).

Essam Mansour (Concordia) SOEN 363 30 / 84


RDBMS Control Using ACID
The main focus of ACID systems is the integrity and consistency of
data above any other considerations.
Losing the availability by temporarily blocking mechanisms is the only
way to ensure the reliability and accuracy of the information given by
the system.
ACID system is said to be pessimistic as they must consider all possible
failure nodes in a computing environment.

Essam Mansour (Concordia) SOEN 363 31 / 84


Example

You are buying some items from a website that uses the ”Shopping
Cart” and ”CheckOut” constructs.

The issue here is that if you have used the ACID transaction control
system, you may prevent some customers from taking an order and
block them for a while so you may lose this customer.

Essam Mansour (Concordia) SOEN 363 32 / 84


NoSQL Control Using BASE

Basic Availability: It means that the system can be temporarily in-


consistent so that transactions are manageable and all information and
services are ”Basically Available”.
Soft-State: It means that some inaccuracies are allowed and data
may change during usage to reduce the amount of used resources for
the sake of availability.
Eventual Consistency: It means that at certain point when all the
service logic is executed, the system will turn into a consistent state6 .

6
Vogels, Werner. Eventually consistent. Communications of the ACM (2009)
Essam Mansour (Concordia) SOEN 363 33 / 84
NoSQL Control Using BASE

The main focus of BASE systems is the availability of services above


any other considerations. We should allow new data to be stored all
time even at the risk of being out of sync for a short period of time7 .

BASE systems tend to be simpler and faster than ACID as they don’t
require handling of many events by locking and unlocking resources like
in case of ACID systems

BASE system is said to be optimistic as it is assumed that all systems


will eventually catch up and become consistent.

7
Pritchett, Dan. BASE: An ACID alternative. ACM Queue (2008
Essam Mansour (Concordia) SOEN 363 34 / 84
Reliable Database Transactions Control

Essam Mansour (Concordia) SOEN 363 35 / 84


Outline

- What is NoSQL ?

- The Evolution of NoSQL

- Scalability and CAP theorem

- BASE principle

- Advantages and Disadvantages

- Classification of NoSQL Databases

- Indexing Structure for NoSQL databases

- Cloud Spanner

Essam Mansour (Concordia) SOEN 363 36 / 84


Advantages of NoSQL Databases:
Simple and Flexible Structure = Schema Free
Based on Key-Value Pairs
Some Store types include Column Store, Document Store, Object
Store, XML Store, Graph Store, etc.
It can allow storage of serialized objects into the database.
Open Source NoSQL databases don’t need expensive licensing fees
and can run on inexpensive hardware.
Expansion is always cheaper and easier than relational databases.
It depends on Horizontal scaling by distributing the load over more
nodes. On the contrary, Relational databases scale vertically by replac-
ing main host with a more powerful one.

Essam Mansour (Concordia) SOEN 363 37 / 84


Disadvantages of NoSQL Databases:
They don’t support the reliability features of relational databases
and ACID transaction control systems (Atomicity, Consistency, Isola-
tion, Durability).
They trade consistency for the sake of availability, performance and
scalability.
Each type of NoSQL databases always has a limited number of ap-
plications.
Incompatibility with the SQL queries and the need of a manual/propriet
querying language which adds more complexity.
They don’t support Joins/ Group By/ Order By/ ACID transac-
tions.

Essam Mansour (Concordia) SOEN 363 38 / 84


SQL Vs NoSQL

Essam Mansour (Concordia) SOEN 363 39 / 84


Essam Mansour (Concordia) SOEN 363 40 / 84
Essam Mansour (Concordia) SOEN 363 41 / 84
Outline

- What is NoSQL ?

- The Evolution of NoSQL

- Scalability and CAP theorem

- BASE principle

- Advantages and Disadvantages

- Classification of NoSQL Databases

- Indexing Structure for NoSQL databases

- Cloud Spanner

Essam Mansour (Concordia) SOEN 363 42 / 84


Classification of NoSQL Databases

Essam Mansour (Concordia) SOEN 363 43 / 84


Classification of NoSQL Databases

Essam Mansour (Concordia) SOEN 363 44 / 84


Classification of NoSQL Databases

Essam Mansour (Concordia) SOEN 363 45 / 84


Querying Data in NoSQL

Syntax Varies (No standard interface or language)

No set-based query language terms of constraints.

Procedural Program Languages such as Java, C, etc.

Application specifies retrieval path.

No query Optimizer

Essam Mansour (Concordia) SOEN 363 46 / 84


A directed Graph structure is used to represent the data.
The graph includes a set of nodes, each represents an object. Some
pairs of objects are connected by links (edges) representing relations
between objects.
It is used typically in social networking applications.
It allows developers to focus on relations between objects rather
than the objects themselves.
They are powerful for graph queries like (shortest path, connected
components, etc.)
Example: Neo4J8

8
https://neo4j.com/
Essam Mansour (Concordia) SOEN 363 47 / 84
Example:

Essam Mansour (Concordia) SOEN 363 48 / 84


Neo4j uses Cypher as a declarative graph query language
It allows expressive and efficient querying and updating of a property
graph.
Cypher is a relatively simple but still very powerful language in which
very complicated database queries can easily be expressed through
Cypher.

Essam Mansour (Concordia) SOEN 363 49 / 84


CREATE

The command CREATE will create a new node/edge


CREATE (a:Studentname:’Tarun’) -> (NODE)
CREATE (a:Studentname:’Bob’)-> (NODE)
CREATE (b:Cityname:’Tartu’, located In:’Estonia’)-> (NODE)
CREATE (c:Coursename:’Data Management’)-> (NODE)
CREATE (b)<-[:Live In]-(a)-[:Studysemester:’spring’]->(c) ->
(EDGE)

Essam Mansour (Concordia) SOEN 363 50 / 84


READ

The MATCH command will retrieve all the nodes with specific relation

MATCH (a)-[b]->(x) RETURN a,b,x

Essam Mansour (Concordia) SOEN 363 51 / 84


UPDATE

The MERGE command will merge some nodes together after


matching.
The SET command will update/insert some data to a node/edge
after matching.
MATCH (n:Cityname:’Tartu’)
SET n.population:’100k’
MATCH (n:Studentname:’Tarun’), (m:Studentname:’Bob’)
MERGE (n)-[:Friendship]->(m),

(m)-[:Friendship]->(n)
Essam Mansour (Concordia) SOEN 363 52 / 84
DELETE

The DELETE command will delete an edge/node after making a


Match
MATCH e = (n:Studentname:’Bob’)-[r]-(n:Studentname:’Tarun’)
DELETE e
MATCH (n:Studentname:’Bob’)
DELETE n

Essam Mansour (Concordia) SOEN 363 53 / 84


CREATE

Very similar to SQL commands

The INSERT command will perform an insert operation into a table.

INSERT INTO students (idNo, name, city) VALUES (87787, Tarun,


Tartu);

Essam Mansour (Concordia) SOEN 363 54 / 84


READ

The SELECT command will retrieve data from a given table.


These keywords help in forming the SELECT queries:
WHERE: This keyword will specify the location where data is to be
selected.
FROM: This keyword will specify the table to select from.
Examples:
SELECT * FROM students
SELECT city, name FROM students
SELECT name FROM students WHERE city = ’tartu’

Essam Mansour (Concordia) SOEN 363 55 / 84


UPDATE

The UPDATE command will update values of a certain key in the


table.
These keywords help in forming the update queries:
Where: This keyword will specify the location where data is to be
updated.
Set: This keyword will specify the updated value.
Must: This keyword includes the columns composing the primary key.
Example:
UPDATE students SET city = Tallinn WHERE idNo = 87787

Essam Mansour (Concordia) SOEN 363 56 / 84


DELETE

The DELETE command will delete an entry from a given table.


These keywords help in forming the DELETE queries:
WHERE: This keyword will specify the location where data is to be
deleted.
FROM: This keyword will specify the table to delete from.
Example:
DELETE FROM students WHERE city = ’tartu’

Essam Mansour (Concordia) SOEN 363 57 / 84


A hybrid of RDBMS and Key-Value stores.
Data is stored in columns as opposed to being stored in rows as in
Relational database management systems.
It consists of one or more column families that group certain columns
in the database and each key is used to identify and point to one
family.

Essam Mansour (Concordia) SOEN 363 58 / 84


Each column includes tuples of names and values comma sepa-
rated.

Fast Data Aggregation

Rows that correspond to a single column are stored as a single disk


entry which leads to faster access during read/write operations.

Example: HBase9

9
https://hbase.apache.org/
Essam Mansour (Concordia) SOEN 363 59 / 84
Example:
Country Product Sales
US Alpha 3000
US Beta 1250
ES Alpha 700
UK Alpha 450

Essam Mansour (Concordia) SOEN 363 60 / 84


Essam Mansour (Concordia) SOEN 363 61 / 84
CREATE

”Create” will create new table. Eg: create ’<table name>’,


’<column family name>’, ’<column family name>’, ...

”Put” will insert data into a table.

Example:
create ’student’, ’personal data’, ’courses’

Put ’student’,’87787’, ’personal data:name’,’Tarun’

Put ’student’,’87787’, ’personal data:city’,’Tartu’

Put ’student’,’87787’, ’courses:spring’,’Data Management’

Essam Mansour (Concordia) SOEN 363 62 / 84


READ

The ”get” command will retrieve data from a given table based on
some parameters

The ”scan” command will retrieve all data from a given table

get ’students’, ’87787’

get ’students’, ’87787’,COLUMN=>’personal data:city’

scan ’students’

Essam Mansour (Concordia) SOEN 363 63 / 84


UPDATE

The ”put” command will insert/overwrite/ update entries in a table.

put ’students’, ’87787’, ’personal data:city’, ’tallinn’

Essam Mansour (Concordia) SOEN 363 64 / 84


DELETE

The ”delete” command will delete an entry document from table.

delete ’students’, ’87787’, ’courses:spring’

Essam Mansour (Concordia) SOEN 363 65 / 84


It is similar to key-value stores in that they are schema-less and have
the same advantages/disadvantages.
The value of each key stored in the database is a ”document” with
many possible different encodings like XML, JSON, BSON (Binary
Encoded JSON).
Documents themselves can contain many different key-value pairs, key-
array pairs, or nested documents.
Documents can be indexed which leads to outperformance of
traditional file systems.
Example: MongoDB10

10
https://www.mongodb.com/
Essam Mansour (Concordia) SOEN 363 66 / 84
Example:

Essam Mansour (Concordia) SOEN 363 67 / 84


CREATE

The command ”db.collection.insert()” will perform an insert


operation into a collection of a document.

Essam Mansour (Concordia) SOEN 363 68 / 84


READ

The find() command will retrieve all the documents of the given
collection.

db.collection name.find()

If a document is to be retrieved based on some criteria. Then, we


should pass some parameters in the find() command

db.students.find(”idNo”:”87787”)

Essam Mansour (Concordia) SOEN 363 69 / 84


UPDATE

If a document is to be retrieved based on some criteria. Then, we


should pass some parameters in the update()command, and set the
new attributes using $set

The update() command will retrieve all the documents of the given
collection.

db.collection name.update()

db.student.update( ”idNo”: ”87787” ,$set: ”name”:”Tarun”)

Essam Mansour (Concordia) SOEN 363 70 / 84


DELETE

db.collection name.remove()

The remove() command will delete an entry document from a given


collection.
db.students.remove(”idNo”:”87787”)

Essam Mansour (Concordia) SOEN 363 71 / 84


NoSQL Databases11

11
http://nosql-database.org/
Essam Mansour (Concordia) SOEN 363 72 / 84
Outline

- What is NoSQL ?

- The Evolution of NoSQL

- Scalability and CAP theorem

- BASE principle

- Advantages and Disadvantages

- Classification of NoSQL Databases

- Indexing Structure for NoSQL databases

- Cloud Spanner

Essam Mansour (Concordia) SOEN 363 73 / 84


Indexing Structures for NoSQL Databases

The process of associating a key with the location of a corresponding


data record in a Database management system is called indexing.

The three of the most common methods used in the process of indexing
in NoSQL databases are:

B-Tree Indexing

T-Tree Indexing

O2-Tree Indexing

Essam Mansour (Concordia) SOEN 363 74 / 84


One Size Does Not Fit All
There has been a big debate between the proponents of the NoSQL
and RDBMS camps which is centered around the right choice for im-
plementing online transaction processing systems.
RDBMS proponents think that the NoSQL camp has not spent suffi-
cient time to understand the foundation of the transaction model,
The NoSQL camp argues that the domain specific optimization oppor-
tunities of NoSQL systems gives more flexibility to the developers.
However, they admit that making optimization decision requires a lot
of experience and can be very error-prone and dangerous if not done
by experts.

Essam Mansour (Concordia) SOEN 363 75 / 84


Outline

- What is NoSQL ?

- The Evolution of NoSQL

- Scalability and CAP theorem

- BASE principle

- Advantages and Disadvantages

- Classification of NoSQL Databases

- Indexing Structure for NoSQL databases

- Cloud Spanner

Essam Mansour (Concordia) SOEN 363 76 / 84


Cloud Spanner13

Spanner is Google’s globally distributed NewSQL database12 .

Cloud Spanner is the only enterprise-grade, globally-distributed, and


strongly consistent database service built for the cloud specifically
to combine the benefits of relational database structure with non-
relational horizontal scale.

This combination delivers high-performance transactions and strong


consistency across rows, regions, and continents with an industry-
leading 99.999% availability SLA, no planned downtime, and enterprise-
grade security.

Cloud Spanner revolutionizes database administration and management


and makes application development more efficient.
12
Corbett, James C., et al. Spanner: Google’s globally distributed database. ACM
Transactions on Computer Systems (TOCS), 2013
13
https://cloud.google.com/spanner/
Essam Mansour (Concordia) SOEN 363 77 / 84
Cloud Spanner14

14
https://cloud.google.com/spanner/
Essam Mansour (Concordia) SOEN 363 78 / 84
NewSQL

NewSQL is a combination of SQL and NoSQL15 .

It is a class of modern RDBMS that seeks to provide:


The same scalable performance of NoSQL systems for online transaction
processing (read-write) workloads.

Maintaining the ACID guarantees of a traditional database system.

NewSQL systems are relational databases designed to provide:


ACID compliance, real-time OLTP (Online Transaction Processing).

Conventional SQL-based OLAP (Online Analytic Processing) in Big Data


Environments.

15
Grolinger, Katarina, et al. Data management in cloud environments: NoSQL and
NewSQL data stores. Journal of Cloud Computing: advances, systems and applications,
2013
Essam Mansour (Concordia) SOEN 363 79 / 84
NewSQL

These Systems break through conventional RDBMS performance lim-


its by employing NoSQL style features such as column-oriented data
storage and distributed architectures, or by employing technologies like
in-memory processing, symmetric multiprocessing (SMP), or massively
parallel processing (MPP)

Essam Mansour (Concordia) SOEN 363 80 / 84


NewSQL

Essam Mansour (Concordia) SOEN 363 81 / 84


Three Eras of Databases
Current Market Status

Data
Data Warehouse
Warehouse RDBMS
RDBMS
RDBMS NoSQL

1985-1995
1995-2010 2010-Now

• RDBMS for transactions, Data Warehous


for analytics and NoSQL for …?
Essam Mansour (Concordia) SOEN 363 82 / 84
What’s next?

Essam Mansour (Concordia) SOEN 363 83 / 84


The End

Essam Mansour (Concordia) SOEN 363 84 / 84

You might also like