You are on page 1of 35

BIG

DATA
ANALYTICS
G. Sudha Sadasivam & R. Thirumahal

© Oxford University Press 2020. All rights reserved


Chapter 4
NoSQL

© Oxford University Press 2020. All rights reserved


Agenda
• SQL vs NoSQL
• Limitations and advantages of NoSQL
• Types of NoSQL Stores with example
 KV store
 Column family
 Document
 Graph
• Comparison of NoSQL stores
• Principles of NoSQL models
• CAP
• BASE
• Polyglot persistence in ecommerce application
© Oxford University Press 2020. All rights reserved
Introduction
• Coined by Carlo Strozzi in 1998
• Relational systems have
 ACID properties, are transactional and hence performance degradation
 Centralised control
 Rigid schema resulting in lack of flexibility and scalability.
• NoSQL – not only SQL
• Schema less and hence
 Have simple and fast data access
 Can store voluminous data
 Can store unstructured data from multiple sources
• Work with large volumes of distributed data.
• Have high operational speed, great flexibility, horizontal scalability
• BASE properties with eventual consistency
• Possess shared-nothing architecture
• Supports auto-sharding & replication;
• Parallelism & distributed querying

Nosql systems are complementary to SQL systems


© Oxford University Press 2020. All rights reserved
© Oxford University Press 2020. All rights reserved
Limitations
• Cannot be used for transactional applications that have constraints and consistency
requirements
• Being schemaless necessitates use of constraints by app developer
• Multiple data stores makes interoperability difficult
• Eventual consistency: changes in data will be updated to all copies with a time lag
• Vendor lock-in: each NoSQL data store exists as a silo resulting in high coupling between
data store and the application.
• Lack of expertise in the usage of the NoSQL stores.
• NoSQL databases suffer from security issues based on authentication, authorization and
storage security.
© Oxford University Press 2020. All rights reserved
Types of NoSQL Stores
• Key-value (KV) stores
 Associative arrays (dictionary)
 Key-value pairs with unique ordered keys for every value.
 Good performance, so used for session management and caching
 RAM as in Memcached or secondary memory as in MemcacheDB.
• Document stores
 Organise data as a collection of documents with unique keys.
 Information can be retrieved based on the contents of the document.
 Collections are analogous to tables & documents to records in a table.
 Every document can have different fields.
 Suitable to manage content and mobile data.
 MongoDB and Couch DB.
© Oxford University Press 2020. All rights reserved
Types of NoSQL Stores
• Column family stores
 data is stored in columns instead of rows.
 columns with different types of data can be aggregated as a column family
for querying.
 HBase and BigTable are column family data stores.
• Graph data stores
 Entities in social networks are connected by relationships represented by
graphs ---- Neo4j

© Oxford University Press 2020. All rights reserved


Types of NoSQL Stores

EXAMPLE
RELATIONAL

© Oxford University Press 2020. All rights reserved


Types of NoSQL Stores
• KV Store:
 Each record is stored in a row &read using RecordReader in HDFS
 Each attribute is separated by a comma & extracted using a comma separator.
• Column Family Store
 Customer Table has 2 col families – Name & Address along with orders with TS
 Order Table has Price and Item column families
• Document Store
 Two collections namely, Customer and Order.
 Customer has 2 documents (rows) while Order has 3 documents
• Graph Store:
 Entities are CustID with Name, Address, OrderID with Price and Items.
© Oxford University Press 2020. All rights reserved
Logical organization in KV
store

Physical organization in KV store

© Oxford University Press 2020. All rights reserved


Column Family Store

© Oxford University Press 2020. All rights reserved


Document Store
Document
collection 2
Document collection 1 Document 1
Document collection 1

Document 2

Document 2

Document 3

© Oxford University Press 2020. All rights reserved


Graph Datastore

© Oxford University Press 2020. All rights reserved


NoSQL Stores
• KV stores are simple and powerful but cannot process a range of keys.
• Ordered KV stores can be used, but cannot model values.
• Column families model values as map-of-map-of-maps in terms of column families, aggregated
from columns aggregated from timestamp values.
• Document stores can model values not only as
aggregates but also schema of arbitrary complexity.
They also provide indexing based on field names/keys.
• Graph data stores extend ordered KV systems by linking
various keys as a graph rather than a hierarchical model

© Oxford University Press 2020. All rights reserved


Comparison

© Oxford University Press 2020. All rights reserved


Comparison

© Oxford University Press 2020. All rights reserved


Principles governing NoSQL models
• Denormalisation: simplifies and optimizes query processing by
 grouping all data required for a query processing in one node
 reducing the number of join operations.
 Increases Volume - used in KV, Document DB & column family DB
• Soft schema: aggregates objects with varying attributes
As attributes vary among the products being sold by the e-commerce
site eg book and shoes.. Soft schema  used for aggregation
• Application side joins incur performance penalty. Denormalization
and aggregates avoid query time join operations

© Oxford University Press 2020. All rights reserved


Principles governing NoSQL models
• Atomic aggregates: Aggregates in NoSQL systems, model a business entity as one document
that can be updated atomically.
• Enumerable keys: Ordered KV stores facilitate easy search operations with keys.
 Web logs as ordered KV pairs can be searched based on user_id, time
• Dimensionality reduction: GIS Multi-dimensional Quadtree
indexing with dynamic updation can be reduced to lower
dimensional structures like GeoHash.
 Geohashes use a 2D structure (with Z-like scan) that
can be traversed to obtain a list of entries in binary
form
© Oxford University Press 2020. All rights reserved
Principles governing NoSQL models
• Index table: to retrieve customers belonging to a city (index) column families are used
• Composite key index: to identify the products purchased by a particular customer (with
CID), based in a city, city: CID as a composite key
• Composite key aggregation: aggregation of product details based on the city and
customer, city: CID as a composite key.

© Oxford University Press 2020. All rights reserved


Principles governing NoSQL models
• Inverted index: identifying customers belonging to a city (A) and identifying customers
purchasing a particular product category (B), and then performing an AND operation.

© Oxford University Press 2020. All rights reserved


Principles governing NoSQL models
• Types - Tree aggregation, adjacency lists, and nested sets Hierarchical
Modeling
• Tree aggregation
 Denormalization in NoSQL data stores help in the
development of tree aggregates.
 Product example tree aggregate

© Oxford University Press 2020. All rights reserved


Adjacency lists
• Graph models inherently represent adjacency lists, where neighbours and
parent/child of a node can be identified based on the relationships. For example,
social networks can be represented using graph models to show the degree of
relatedness between the users.

© Oxford University Press 2020. All rights reserved


Nested sets
• These sets store the leaves of the tree in an array. Each non-leaf node is
mapped to a range of leaf nodes with the start and end indices. Effectiveness
in fetching all the leaf nodes of a given node without traversals. It also
occupies less memory. Insertions and updations are costly, as the addition of
one leaf node requires updation of all indices

© Oxford University Press 2020. All rights reserved


Flattening Nesting Documents
• The problem with nested documents is its
complexity
in retrieving the information.
• Nested documents can be flattened based
on levels or using proximity distance.
• Ram has multiple skills with expertise at
different levels in each skill

© Oxford University Press 2020. All rights reserved


CAP
• Eric brewer proposed the consistency, availability, partition tolerance (CAP) theory in
2000
• Consistency is the ability to obtain same data from multiple replicas. Consistency
compliance ensures that all the cluster nodes should have access to the same data.
• Availability is the ability of a system to continue its operation even when some
hardware/software components fail.

• Partition tolerance is the ability of the system to continue operation a partitioned


network due to network failures. It guarantees independence of various data
partitions.

Replication facilitates the availability of data. Eventual consistency ensures that replicas are
not stale. Partitioning ensures load distribution and scalability.
© Oxford University Press 2020. All rights reserved
CAP
• Only 2 can be satisfied at a time
 AP follows BASE properties with eventual consistency. eg. Amazon’s Dynamo DB
without strict consistency
 CP: ACID properties with strict consistency. Pessimistic locking ensures consistency.
eg. MongoDB and MemChache A CA system.
 CA: cannot operate under network partitions
and hence it is neither ACID nor BASE. 2
phase commit protocol is used. For eg
Relational and Big table

© Oxford University Press 2020. All rights reserved


BASE
• Web 2.0 applications
• Basically available, soft state and eventually consistent
• Works basically all the time
• Due to eventual consistency, maintains softstate
ACID BASE
Atomicity, Consistency, Isolation, Durability Basically Available, Softstate, eventually
consistent
Strong consistency Weak consistency

Consistency and Isolation first Availability first


Nested Transactions Approximate Answers
Conservative Simple

Schema Schema-less

© Oxford University Press 2020. All rights reserved


Consistency
• Strict consistency ensures that all read operations must return data from the latest completed write
operation. For distributed transactions two-phase-commit protocol can be used
• Eventual consistency enables the readers to see writes after a time lag. When updates are in progress,
state is inconsistent. Updates are made in a replica that copies the changes to other replicas
eventually.
• Eventual consistency types:
(a) read your own writes (RYOW) consistency, where a particular client’s updates are visible to him
instantly, whereas the updates made by other clients are not visible to him immediately.
(b) session consistency, where updates can be viewed only if the read requests immediately follow
an update made by a client in the same session.
(c) casual consistency: if a client reads a version x and updates it as version y, then all other clients
reading version x will also see version y.
(d) monotonic read consistency, where the client can view the updated versions only in future
requests.
© Oxford University Press 2020. All rights reserved
Partitioning Approaches
• Memory caches are partitioned, transient, in-memory databases, with frequently used
database in main memory. Cached instances (objects) are launched into the processes in
various nodes
• Clustering of database servers, partitions data and maintains data among multiple servers
• separating reads from writes: here separate write servers designated as masters hold the
latest version of data. These server update information to slave/read servers
• Sharding: data is partitioned so that the data requested and updated together resides in
the same node. Data shards are replicated for load balancing and reliability. Shards can be
mapped to storage nodes statically or dynamically. It is very well suited for NoSQL
Simple hashing can be used to map
Keys (o) database objects to servers (n).
Partition = hash(o) mod n
© Oxford University Press 2020. All rights reserved
Partitioning Approaches
• Sharding can be done based on attributes like user_id, location, time etc
• Shards are replicated for availability

© Oxford University Press 2020. All rights reserved


Case Study
• Polyglot persistence applies multiple data storage technologies to meet the needs of an
application.
• Consider an e-commerce application with shopping cart, inventory, orders, catalogue and
customer details.
 User sessions / activity logs require efficient read/write operations - KV stores
 Point of Sales high ingestion rate with high volume of write operations. KV stores (storage)
; Column family (analytics)
 Shopping cart requires high availability, and aggregates information. Document store.
 Product Catalogue has frequent reads and infrequent writes. They must also support
aggregation. Document stores
 Product recommendations are made based on similar products or users. Graph Store
 Financial data is relational and requires transactional updates – RDBMS

© Oxford University Press 2020. All rights reserved


Case Study

© Oxford University Press 2020. All rights reserved


Conclusion
• SQL vs NoSQL
• Limitations and advantages of NoSQL
• Types of NoSQL Stores with example
 KV store
 Column family
 Document
 Graph
• Comparison of NoSQL stores
• Principles of NoSQL models
• CAP
• BASE
• Polyglot persistence in ecommerce application
© Oxford University Press 2020. All rights reserved
Thanks!

© Oxford University Press 2020. All rights reserved

You might also like