No SQL

BIG
DATA
ANALYTICS
G. Sudha Sadasivam & R. Thirumahal
© Oxford University Press 2020. All rights reserved

Chapter 4
NoSQL

Agenda
• SQL vs NoSQL
• Limitations and advantages of NoSQL
• Types of NoSQL Stores with example
 KV store
 Column family
 Document
 Graph
• Comparison of NoSQL stores
• Principles of NoSQL models
• CAP
• BASE
• Polyglot persistence in ecommerce application
Introduction
• Coined by Carlo Strozzi in 1998
• Relational systems have
 ACID properties, are transactional and hence performance degradation
 Centralised control
 Rigid schema resulting in lack of flexibility and scalability.
• NoSQL – not only SQL
• Schema less and hence
 Have simple and fast data access
 Can store voluminous data
 Can store unstructured data from multiple sources
• Work with large volumes of distributed data.
• Have high operational speed, great flexibility, horizontal scalability
• BASE properties with eventual consistency
• Possess shared-nothing architecture
• Supports auto-sharding & replication;
• Parallelism & distributed querying
Nosql systems are complementary to SQL systems

Limitations
• Cannot be used for transactional applications that have constraints and consistency
requirements
• Being schemaless necessitates use of constraints by app developer
• Multiple data stores makes interoperability difficult
• Eventual consistency: changes in data will be updated to all copies with a time lag
• Vendor lock-in: each NoSQL data store exists as a silo resulting in high coupling between
data store and the application.
• Lack of expertise in the usage of the NoSQL stores.
• NoSQL databases suffer from security issues based on authentication, authorization and
storage security.
Types of NoSQL Stores
• Key-value (KV) stores
 Associative arrays (dictionary)
 Key-value pairs with unique ordered keys for every value.
 Good performance, so used for session management and caching
 RAM as in Memcached or secondary memory as in MemcacheDB.
• Document stores
 Organise data as a collection of documents with unique keys.
 Information can be retrieved based on the contents of the document.
 Collections are analogous to tables & documents to records in a table.
 Every document can have different fields.
 Suitable to manage content and mobile data.
 MongoDB and Couch DB.
• Column family stores
 data is stored in columns instead of rows.
 columns with different types of data can be aggregated as a column family
for querying.
 HBase and BigTable are column family data stores.
• Graph data stores
 Entities in social networks are connected by relationships represented by
graphs ---- Neo4j

EXAMPLE
RELATIONAL

• KV Store:
 Each record is stored in a row &read using RecordReader in HDFS
 Each attribute is separated by a comma & extracted using a comma separator.
• Column Family Store
 Customer Table has 2 col families – Name & Address along with orders with TS
 Order Table has Price and Item column families
• Document Store
 Two collections namely, Customer and Order.
 Customer has 2 documents (rows) while Order has 3 documents
• Graph Store:
 Entities are CustID with Name, Address, OrderID with Price and Items.
Logical organization in KV
store
Physical organization in KV store

Column Family Store

Document Store
Document
collection 2
Document collection 1 Document 1
Document collection 1
Document 2
Document 2
Document 3

Graph Datastore

NoSQL Stores
• KV stores are simple and powerful but cannot process a range of keys.
• Ordered KV stores can be used, but cannot model values.
• Column families model values as map-of-map-of-maps in terms of column families, aggregated
from columns aggregated from timestamp values.
• Document stores can model values not only as
aggregates but also schema of arbitrary complexity.
They also provide indexing based on field names/keys.
• Graph data stores extend ordered KV systems by linking
various keys as a graph rather than a hierarchical model

Comparison

Comparison

Principles governing NoSQL models
• Denormalisation: simplifies and optimizes query processing by
 grouping all data required for a query processing in one node
 reducing the number of join operations.
 Increases Volume - used in KV, Document DB & column family DB
• Soft schema: aggregates objects with varying attributes
As attributes vary among the products being sold by the e-commerce
site eg book and shoes.. Soft schema  used for aggregation
• Application side joins incur performance penalty. Denormalization
and aggregates avoid query time join operations

• Atomic aggregates: Aggregates in NoSQL systems, model a business entity as one document
that can be updated atomically.
• Enumerable keys: Ordered KV stores facilitate easy search operations with keys.
 Web logs as ordered KV pairs can be searched based on user_id, time
• Dimensionality reduction: GIS Multi-dimensional Quadtree
indexing with dynamic updation can be reduced to lower
dimensional structures like GeoHash.
 Geohashes use a 2D structure (with Z-like scan) that
can be traversed to obtain a list of entries in binary
form
• Index table: to retrieve customers belonging to a city (index) column families are used
• Composite key index: to identify the products purchased by a particular customer (with
CID), based in a city, city: CID as a composite key
• Composite key aggregation: aggregation of product details based on the city and
customer, city: CID as a composite key.

• Inverted index: identifying customers belonging to a city (A) and identifying customers
purchasing a particular product category (B), and then performing an AND operation.

• Types - Tree aggregation, adjacency lists, and nested sets Hierarchical
Modeling
• Tree aggregation
 Denormalization in NoSQL data stores help in the
development of tree aggregates.
 Product example tree aggregate

Adjacency lists
• Graph models inherently represent adjacency lists, where neighbours and
parent/child of a node can be identified based on the relationships. For example,
social networks can be represented using graph models to show the degree of
relatedness between the users.

Nested sets
• These sets store the leaves of the tree in an array. Each non-leaf node is
mapped to a range of leaf nodes with the start and end indices. Effectiveness
in fetching all the leaf nodes of a given node without traversals. It also
occupies less memory. Insertions and updations are costly, as the addition of
one leaf node requires updation of all indices

Flattening Nesting Documents
• The problem with nested documents is its
complexity
in retrieving the information.
• Nested documents can be flattened based
on levels or using proximity distance.
• Ram has multiple skills with expertise at
different levels in each skill

CAP
• Eric brewer proposed the consistency, availability, partition tolerance (CAP) theory in
2000
• Consistency is the ability to obtain same data from multiple replicas. Consistency
compliance ensures that all the cluster nodes should have access to the same data.
• Availability is the ability of a system to continue its operation even when some
hardware/software components fail.
• Partition tolerance is the ability of the system to continue operation a partitioned

network due to network failures. It guarantees independence of various data
partitions.
Replication facilitates the availability of data. Eventual consistency ensures that replicas are
not stale. Partitioning ensures load distribution and scalability.
CAP
• Only 2 can be satisfied at a time
 AP follows BASE properties with eventual consistency. eg. Amazon’s Dynamo DB
without strict consistency
 CP: ACID properties with strict consistency. Pessimistic locking ensures consistency.
eg. MongoDB and MemChache A CA system.
 CA: cannot operate under network partitions
and hence it is neither ACID nor BASE. 2
phase commit protocol is used. For eg
Relational and Big table

BASE
• Web 2.0 applications
• Basically available, soft state and eventually consistent
• Works basically all the time
• Due to eventual consistency, maintains softstate
ACID BASE
Atomicity, Consistency, Isolation, Durability Basically Available, Softstate, eventually
consistent
Strong consistency Weak consistency
Consistency and Isolation first Availability first

Nested Transactions Approximate Answers
Conservative Simple
Schema Schema-less

Consistency
• Strict consistency ensures that all read operations must return data from the latest completed write
operation. For distributed transactions two-phase-commit protocol can be used
• Eventual consistency enables the readers to see writes after a time lag. When updates are in progress,
state is inconsistent. Updates are made in a replica that copies the changes to other replicas
eventually.
• Eventual consistency types:
(a) read your own writes (RYOW) consistency, where a particular client’s updates are visible to him
instantly, whereas the updates made by other clients are not visible to him immediately.
(b) session consistency, where updates can be viewed only if the read requests immediately follow
an update made by a client in the same session.
(c) casual consistency: if a client reads a version x and updates it as version y, then all other clients
reading version x will also see version y.
(d) monotonic read consistency, where the client can view the updated versions only in future
requests.
Partitioning Approaches
• Memory caches are partitioned, transient, in-memory databases, with frequently used
database in main memory. Cached instances (objects) are launched into the processes in
various nodes
• Clustering of database servers, partitions data and maintains data among multiple servers
• separating reads from writes: here separate write servers designated as masters hold the
latest version of data. These server update information to slave/read servers
• Sharding: data is partitioned so that the data requested and updated together resides in
the same node. Data shards are replicated for load balancing and reliability. Shards can be
mapped to storage nodes statically or dynamically. It is very well suited for NoSQL
Simple hashing can be used to map
Keys (o) database objects to servers (n).
Partition = hash(o) mod n
Partitioning Approaches
• Sharding can be done based on attributes like user_id, location, time etc
• Shards are replicated for availability

Case Study
• Polyglot persistence applies multiple data storage technologies to meet the needs of an
application.
• Consider an e-commerce application with shopping cart, inventory, orders, catalogue and
customer details.
 User sessions / activity logs require efficient read/write operations - KV stores
 Point of Sales high ingestion rate with high volume of write operations. KV stores (storage)
; Column family (analytics)
 Shopping cart requires high availability, and aggregates information. Document store.
 Product Catalogue has frequent reads and infrequent writes. They must also support
aggregation. Document stores
 Product recommendations are made based on similar products or users. Graph Store
 Financial data is relational and requires transactional updates – RDBMS

Case Study

Conclusion
• SQL vs NoSQL
• Limitations and advantages of NoSQL
• Types of NoSQL Stores with example
 KV store
 Column family
 Document
 Graph
• Comparison of NoSQL stores
• Principles of NoSQL models
• CAP
• BASE
• Polyglot persistence in ecommerce application
Thanks!

No SQL

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

No SQL

Uploaded by

Copyright:

Available Formats

BIG

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

Nosql systems are complementary to SQL systems

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

Physical organization in KV store

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

• Partition tolerance is the ability of the system to continue operation a partitioned

© Oxford University Press 2020. All rights reserved

Consistency and Isolation first Availability first

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

© Oxford University Press 2020. All rights reserved

You might also like