You are on page 1of 21

SRI KRISHNA COLLEGE OF TECHNOLOGY

[An Autonomous Institution | Affiliated to Anna University and


Approved by AICTE | Accredited by NAAC with ‘A’ Grade]

KOVAIPUDUR, COIMBATORE – 641 042.

21ITE06/BIG DATA ANALYTICS


III YEAR /CSE/VI SEMESTER
MODULE-1
Module 1- Introduction

Session Topic
1.1 Types of Digital Data-Characteristics of Data – Evolution of Big Data -
Definition of Big Data – Challenges with Big Data.
1.2 3Vs of Big Data – Non-Definitional traits of Big Data – BI vs. Big Data - Data
warehouse and Hadoop environment – Coexistence.
1.3 Big Data Analytics: Classification of analytics – Data Science
1.4 Terminologies in Big Data – CAP Theorem – BASE Concept.
1.5 NoSQL: Types of Databases – Advantages – NewSQL - SQL vs. NOSQL vs
NewSQL.
1.6 Introduction to Hadoop: Features – Advantages - Versions – Overview of
Hadoop Eco systems
1.7 Hadoop distributions – Hadoop vs. SQL – RDBMS vs. Hadoop
1.8 Hadoop Components – Architecture-HDFS
1.9 Map Reduce: Mapper – Reducer - Combiner -Partitioner – Searching – Sorting
– Compression
1.10 Hadoop 2 (YARN): Architecture – Interacting with Hadoop Eco systems.
MODULE 1 Introduction to Big Data
1.5 NoSQL

Types of Databases – Advantages – NewSQL - SQL vs. NOSQL vs NewSQL.

Course Outcome:
Upon completion of the session, students shall have ability to

CO4 Demonstrate NOSQL distributed database storage and processing [AP]


NoSQL

Features of NoSQL
NoSQL 1. NoSQL databases are non-
Not Only SQL. relational
• non-relational 2. Distributed
• open source 3. No Support for ACID properties
• distributed They adherence to CAP theorem.
databases. 4. No fixed table schema

MODULE 1 Introduction to Big Data


NoSQL
Need of NoSQL
1. It has scale out architecture instead of the monolithic
architecture of relational databases.
2. It can house large volumes of structured, semi-structured
and unstructured data.
3. Dynamic Schema: It allows insertion of data without a
predefined schema.
4. Auto Sharding: It automatically spread data across an
arbitrary numer of servers or nodes in a cluster.
5. Replication: It offers good support for replication which in
turn guarantees high availability, fault tolerance and disaster
recovery.

MODULE 1 Introduction to Big Data


NoSQL: Types of Databases
They broadly divided into
• Key-Value or big hash table
• Schemal-less.
1. Key-Value 2. Document
It maintains a big hash table of keys and It maintains data in collections
values. constituted of documents.
Key are unique. Eg. MongoDB, Apace CouchDB,
It is fast, scalable and fault tolerance. Couchbase, MarkLogic etc.
It can’t model more complex data Sample Document in Document DB:
structure such as objects {
Eg. Dynamo, Redis, Riak etc. “Book Name”: ”Big Data and
Sample Key-Value pair database: Analytics”,
------------------------------- “Publisher”: “Wiley India”,
Key Value “Year”: “2015”
Fname Praneeth }
Lname Ch
--------------------------------
MODULE 1 Introduction to Big Data
NoSQL: Types of Databases
3. Column
• Each storage block has data from only one 4. Graph
column. They are also called Network
• It only fetch column families of those database.
columns that are required by a query A graph stores data in nodes.
Data model:
Eg. Cassandra, HBase etc. o (Property Graph) nodes and
Sample column database: edges
UserProfile = {  Nodes may have properties
Cassandra =
 Edges may have labels or roles
{ emailAddress:”casandra@apache.org” ,
age:”20”} o Key-value pairs on both
TerryCho = Eg. Neo4j, HyperGraphDB,
{ emailAddress:”terry.cho@apache.org” , InfiniteGraph etc.
gender:”male”}
Cath = { emailAddress:”cath@apache.org” ,
age:”20”,gender:”female”,address:”Seoul”}
}
MODULE 1 Introduction to Big Data
4. Graph

MODULE 1 Introduction to Big Data


NoSQL: Advantages
 Big Data Capability
 No Single Point of Failure
 Easy Replication
 It provides fast performance and horizontal scalability.
 Can handle structured, semi-structured, and unstructured
data with equal effect
 NoSQL databases don't need a dedicated high-performance
server
 It can serve as the primary data source for online
applications.
 Excels at distributed database and multi-data centre
operations
 Eliminates the need for a specific caching layer to store data
 Offers a flexible schema design which can easily be altered
without downtime or service disruption MODULE 1 Introduction to Big Data
NoSQL: DisAdvantages

Disadvantages:
 Limited query capabilities
 RDBMS databases and tools are comparatively mature
 It does not offer any traditional database capabilities, like
consistency when multiple transactions are performed
simultaneously.
 When the volume of data increases it is difficult to
maintain unique values as keys become difficult
 Doesn't work as well with relational data
 Open source options so not so popular for enterprises.
 No support for join and group-by operations.

MODULE 1 Introduction to Big Data


SQL NoSQL
Non-relational, distributed
Relational database
database
Relational model Model-less approach
Dynamic schema for unstructured
Pre-defined schema
data
Document-based or graph-based
Table based databases or wide column store or
key-value pairs databases
Vertically scalable (by
Horizontally scalable (by creating a
increasing system
cluster commodity machines)
resources)
Uses UnQL (Unstructured Query
Uses SQL
Language)
Not preferred for large
Largely preferred for large datasets
datasets

MODULE 1 Introduction to Big Data


SQL NoSQL
Best fit for hierarchical storage as it
Not a best fit for hierarchical follows the key value pair of storing
data data similar to JSON (Java Script
Object Notation)
Emphasis on ACID properties Follows Brewer’s CAP theorem

Excellent support from vendors Relies heavily on community support


Does not have good support for
Supports complex querying and data
complex
keeping needs
querying
Few support strong consistency (e.g.,
Can be configured for strong MongoDB), few others can be
consistency configured for eventual
consistency (e.g.,Cassandra)
Examples: Oracle, DB2, MongoDB, HBase, Cassandra, Redis,
MySQL, MS SQL,PostgreSQL, etc. Neo4j, CouchDB,Couchbase, Riak, etc

MODULE 1 Introduction to Big Data


Where to use NoSQL
Key-Value
Shopping carts
Web user data analysis
Amazon, Linkedin
Document based
Real-time Analysis
Logging
Document archive management
Column-oriented
Analyze huge web user actions
Sensor feeds
Facebook, Twitter, eBay, Netfix
Graph-based
Network modeling
Recommendation
Walmart-upsell, cross-sell
MODULE 1 Introduction to Big Data
Real time applications of NoSQL in BigData Analytics

 HBase for Hadoop,->NoSQL database is used by Facebook


for its messaging infrastructure.
 HBase is used by Twitter for generating data, storing, logging,
and monitoring data around people search.
 HBase is used by the discovery engine Stumble upon for data
analytics and storage.
 MongoDB is another NoSQL Database used by CERN, a
European Nuclear Research Organization for collecting data
from the huge particle collider “Hadron Collider”.
 LinkedIn, Orbitz, and Concur use the Couchbase NoSQL
Database for various data processing and monitoring tasks.

MODULE 1 Introduction to Big Data


NewSQL

NewSQL supports relational data model and uses SQL as


their primary
interface.

NewSQL Characterisitcs:
 SQL interface for application interaction
 ACID support for transactions
 An architecture that provides higher per node
performance vis-a-vs
traditional RDBMS solution
 Scale out, shared nothing architecture
 Non-locking concurrency control mechanism so that real
time reads
will not conflict with writes. MODULE 1 Introduction to Big Data
SQL vs. NOSQL vs NewSQL.
SQL NoSQL NewSQL

Adherence to ACID
Yes No Yes
properties
OLTP/OLAP Yes No Yes

Yes
Schema rigidity
Adherence to No Maybe
Adherence to data model
relational model

Data Format Flexibility No Yes Maybe

Scale out
Scale up Scale
Scalability Horizontal
Vertical Scaling out
Scaling

Distributed Computing Yes Yes Yes

Slowly
Community Support Huge Growing
growing
MODULE 1 Introduction to Big Data
Test your Knowledge
Test your Knowledge
Test your Knowledge
Test your Knowledge
Next Session…

1.6 Introduction to Hadoop: Features – Advantages


- Versions – Overview of Hadoop Eco systems ..
…………

You might also like