You are on page 1of 96

No SQL?

• NoSQL stands for:


• Non Relational
• No RDBMS
• Not Only SQL
• NoSQL is an umbrella term for all databases and data stores that don’t
follow the RDBMS principles
• A class of products
• A collection of several (related) concepts about data storage and
manipulation
• Often related to large data sets
NoSQL
• NoSQL can be defined as a database which is employed for managing the massive collection of unstructured data and
when your data is not piled up in a tabular format or relations like that of relational databases.
• The term NoSQL came from the word non SQL or nonrelational.
• NoSQL Database is a non-relational Data Management System, that does not require a fixed schema.
• It avoids joins, and is easy to scale.
• The major purpose of using a NoSQL database is for distributed data stores .
• NoSQL is used for Big data and real-time web apps.
• For example, companies like Twitter, Facebook and Google collect terabytes of user data every single day.

• There are a wide variety of existing Relational Databases that have been unsuccessful in solving several complex
modern problems such as:
• A dynamic change in the nature of data - i.e., nowadays data are in structured, semi-structured, nonstructured as well as polymorphic
in type.
• The variety of applications and the type of data feed into them for analysis has now become more diverse and distributed and is
approaching cloud-oriented.
• Also, modern applications and services are serving tens of thousands of users in diverse geo-locations, having diverse time zones. So
data integrity needs to be there at all the time.
Structured v/s. Unstructured data
• Structured data are in a proper format, usually text files or which can be
represented in a tabular format. Also, such types of data can be smoothly
represented in chart-like form, and data mining tools can be used to process them
efficiently.
• Whereas unstructured data are disorganized data formats (such as document files,
image files, video files, icons, etc.) where structured data can be pulled out or mine
from unstructured data, but this process usually takes a lot of time.
• Modern-day data generated from different applications, services, or sources are a
combination of structured and unstructured both.
• So, you will need something to store such data to make your application work
properly. NoSQL based languages and scripts can help in this regard.
RDBMS Characteristics
• Data stored in columns and tables
• Relationships represented by data
• Data Manipulation Language
• Data Definition Language
• Transactions
• Abstraction from physical layer
• Applications specify what, not how
• Physical layer can change without modifying applications
• Create indexes to support queries
• In Memory databases
Transactions – ACID Properties
• Atomic – All of the work in a transaction completes (commit) or none of it completes
• a transaction to transfer funds from one account to another involves making a withdrawal operation from the first
account and a deposit operation on the second. If the deposit operation failed, you don’t want the withdrawal operation
to happen either.
• Consistent – A transaction transforms the database from one consistent state to another consistent state.
Consistency is defined in terms of constraints.
• a database tracking a checking account may only allow unique check numbers to exist for each transaction
• Isolated – The results of any changes made during a transaction are not visible until the transaction has
committed.
• a teller looking up a balance must be isolated from a concurrent transaction involving a withdrawal from the same
account. Only when the withdrawal transaction commits successfully and the teller looks at the balance again will the
new balance be reported.
• Durable – The results of a committed transaction survive failures
• A system crash or any other failure must not be allowed to lose the results of a transaction or the contents of the
database. Durability is often achieved through separate transaction logs that can "re-create" all transactions from some
picked point in time (like a backup).
NoSQL, No ACID
• RDBMSs are based on ACID (Atomicity, Consistency, Isolation, and
Durability) properties
• NoSQL
• Does not give importance to ACID properties
• In some cases completely ignores them
• In distributed parallel systems it is difficult/impossible to ensure ACID
properties
• Long-running transactions don't work because keeping resources blocked for a long time is
not practical
Features of NoSQL
• Non-relational
• NoSQL databases never follow the relational model
• Never provide tables with flat fixed-column records
• Work with self-contained aggregates or BLOBs
• Doesn’t require object-relational mapping and data normalization
• No complex features like query languages, query planners, referential integrity joins, ACID
• Schema-free
• NoSQL databases are either schema-free or have relaxed schemas
• Do not require any sort of definition of the schema of the data
• Offers heterogeneous structures of data in the same domain
• Simple API
• Offers easy to use interfaces for storage and querying data provided
• APIs allow low-level data manipulation & selection methods
• Text-based protocols mostly used with HTTP REST with JSON
• Mostly used no standard based NoSQL query language
• Web-enabled databases running as internet-facing services
• Distributed
• Multiple NoSQL databases can be executed in a distributed fashion
• Offers auto-scaling and fail-over capabilities
• Often ACID concept can be sacrificed for scalability and throughput
• Mostly no synchronous replication between distributed nodes Asynchronous Multi-Master Replication, peer-to-peer, HDFS Replication
• Only providing eventual consistency
• Shared Nothing Architecture. This enables less coordination and higher distribution.
Advantages of NoSQL
• Can be used as Primary or Analytic Data Source
• Big Data Capability
• No Single Point of Failure
• Easy Replication
• It provides fast performance and horizontal scalability.
• Can handle structured, semi-structured, and unstructured data with equal effect
• Object-oriented programming which is easy to use and flexible
• NoSQL databases don’t need a dedicated high-performance server
• Support Key Developer Languages and Platforms
• Simple to implement than using RDBMS
• It can serve as the primary data source for online applications.
• Handles big data which manages data velocity, variety, volume, and complexity
• Excels at distributed database and multi-data center operations
• Offers a flexible schema design which can easily be altered without downtime or service disruption
Disadvantages of NoSQL
1.Narrow focus –
NoSQL databases have very narrow focus as it is mainly designed for storage but it provides very little functionality.
Relational databases are a better choice in the field of Transaction Management than NoSQL.
2.Open-source –
NoSQL is open-source database. There is no reliable standard for NoSQL yet. In other words two database systems
are likely to be unequal.
3.Management challenge –
The purpose of big data tools is to make management of a large amount of data as simple as possible. But it is not so
easy. Data management in NoSQL is much more complex than a relational database. NoSQL, in particular, has a
reputation for being challenging to install and even more hectic to manage on a daily basis.
4.GUI is not available –
GUI mode tools to access the database is not flexibly available in the market.
5.Backup –
Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has no approach for the backup
of data in a consistent manner.
6.Large document size –
Some database systems like MongoDB and CouchDB store data in JSON format. Which means that documents are
quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts, since they
increase the document size.
Why NoSQL?
• Schemaless data representation: Almost all NoSQL implementations offer schemaless data
representation. This means that you don’t have to think too far ahead to define a structure and you
can continue to evolve over time—including adding new fields or even nesting the data, for example,
in case of JSON representation.
• Development time: reduced development time because one doesn’t have to deal with complex SQL
queries. (just like JOIN query to collate the data across multiple tables)
• Speed: Even with the small amount of data that you have, if you can deliver in milliseconds rather
than hundreds of milliseconds—especially over mobile and other intermittently connected devices—
you have much higher probability of winning users over.
• Plan ahead for scalability: your application can be quite elastic—it can handle sudden spikes of load.
SchemaLess data representation
Types of data models in NoSQL Databases
• NoSQL Databases are mainly categorized into four types: Key-value
pair, Column-oriented, Graph-based and Document-oriented.
• Every category has its unique attributes and limitations.
• None of the above-specified database is better to solve all the problems.
• Users should select the database based on their product needs.
• Types of NoSQL Databases:
• Key-value Pair Based
• Column-oriented
• Graphs based
• Document-oriented
Types of NoSQL Databases
Key Value Pair Based
• Data is stored in key/value pairs. It is designed in such a way to handle lots of
data and heavy load.
• Key-value pair storage databases store data as a hash table where each key is
unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
• For example, a key-value pair may contain a key like “sname” associated with
a value like “Priya”.
• This kind of NoSQL database is used as a collection, dictionaries, associative
arrays, etc. Key value stores help the developer to store schema-less data.
They work best for shopping cart contents.
• Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases.
They are all based on Amazon’s Dynamo paper.
Column-based
• Column-oriented databases work on columns and
are based on BigTable paper by Google.
• Every column is treated separately.
• Values of single column databases are stored
contiguously.
• They deliver high performance on aggregation
queries like SUM, COUNT, AVG, MIN etc. as the data is
readily available in a column.
• Column-based NoSQL databases are widely used to
manage data warehouses, business intelligence,
CRM, Library card catalogs,
• HBase, Cassandra, HBase, Hypertable are NoSQL
query examples of column based database.
Document-Oriented:
• Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value part is
stored as a document.
• The document is stored in JSON or XML formats. The value is understood by the DB and can be
queried.
• The document type is mostly used for CMS systems, blogging platforms, real-time analytics & e-
commerce applications.
• It should not use for complex transactions which require multiple operations or queries against
varying aggregate structures.
• Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular Document
originated DBMS systems.
Graph-Based
• A graph type database stores entities as well the relations
amongst those entities.
• The entity is stored as a node with the relationship as edges.
• An edge gives a relationship between nodes.
• Every node and edge has a unique identifier.
• Compared to a relational database where tables are loosely
connected, a Graph database is a multi-relational in nature.
• Traversing relationship is fast as they are already captured into
the DB, and there is no need to calculate them.
• Graph base database mostly used for social networks, logistics,
spatial data.
• Neo4J, Infinite Graph, OrientDB, FlockDB are some popular
graph-based databases.
List of NoSQL Databases
Choosing Right NoSQL Database
• The Data Model
• It involves the type of data that you need to store. NoSQL Databases only differ in the data model that they use. A mismatch
between the NoSQL databases solution data model and target application can make or break the success of the project
which you are building.

• Data-Scaling expectations
• The next question is how large an application is expected to grow and how much data scale support will be needed. Some
NoSQL databases are memory-based and do not scale across multiple machines where others like Cassandra, scale linearly
across many applications.

• Choosing a Data Model


• To choose a data model, it is first required that we check what type of data is required to be stored. Depending on that, we
choose the data model.
• If the data is to be represented in the form of a graph, then the graph database is used.
• If we have to store the data in the form of the key-value pair then we can choose a key-value database or document
database is chosen.
• Different data models are used to solve different problems. For example: if we are solving the graph problem with the
relational database, so it is better to resolve the graph problem with the graph database.
BASE in NOSQL
• BASE , invented by Eric Brewer:
• Basic availability: Each request is guaranteed a response—successful or failed
execution.
• Soft state: The state of the system may change over time, at times without
any input (for eventual consistency).
• Eventual consistency: The database may be momentarily inconsistent but will
be consistent eventually.
• Eric Brewer also noted that it is impossible for a distributed computer
system to provide consistency, availability and partition tolerance
simultaneously. This is more commonly referred to as the CAP
theorem
CAP Theorem
• In a distributed system, managing consistency(C),
availability(A) and partition toleration(P) is
important,
• Eric Brewer put forth the CAP theorem which
states that in any distributed system we can choose
only two of consistency, availability or partition
tolerance.
• The concept of consistency(C), availability(A)
and partition tolerance(P) across distributed
systems gives rise to the need for CAP
theorem.
• But CAP theorem demonstrates that any
distributed system cannot guarantee C, A, and
P simultaneously.
CAP Theorem
• Suppose three properties A
of a distributed system (sharing data) C
• Consistency:
• all copies have same value P
• Availability:
• reads and writes always succeed
• Partition-tolerance:
• system properties (consistency and/or availability) hold even when network failures
prevent some machines from communicating with others

23
CAP Theorem
• Brewer’s CAP Theorem:
• For any system sharing data, it is “impossible” to guarantee simultaneously all
of these three properties
• You can have at most two of these three properties for any shared-data
system
• Very large systems will “partition” at some point:
• That leaves either C or A to choose from (traditional DBMS prefers C over A
and P )
• In almost all cases, you would choose A over C (except in specific applications
such as order processing)

24
CAP Theorem
All client always have the
same view of the data
Availability

Consistency

Partition
tolerance

25
CAP Theorem
• Consistency
• 2 types of consistency:
1. Strong consistency – ACID (Atomicity, Consistency, Isolation, Durability)
2. Weak consistency – BASE (Basically Available Soft-state Eventual
consistency)

26
CAP Theorem
• ACID
• A DBMS is expected to support “ACID transactions,” processes that are:
• Atomicity: either the whole process is done or none is
• Consistency: only valid data are written
• Isolation: one operation at a time
• Durability: once committed, it stays that way

• CAP
• Consistency: all data on cluster has the same copies
• Availability: cluster always accepts reads and writes
• Partition tolerance: guaranteed properties are maintained even when network failures
prevent some machines from communicating with others

27
CAP Theorem
• A consistency model determines rules for visibility and apparent order of
updates
• Example:
• Row X is replicated on nodes M and N
• Client A writes row X to node N
• Some period of time t elapses
• Client B reads row X from node M
• Does client B see the write from client A?
• Consistency is a continuum with tradeoffs
• For NOSQL, the answer would be: “maybe”
• CAP theorem states: “strong consistency can't be achieved at the same time as
availability and partition-tolerance”

28
CAP Theorem
• Eventual consistency
• When no updates occur for a long period of time, eventually all updates will propagate
through the system and all the nodes will be consistent
• Cloud computing
• ACID is hard to achieve, moreover, it is not always required, e.g. for blogs,
status updates, product listings, etc.

29
CAP Theorem
Each client always can
read and write.
Availability

Consistency

Partition
tolerance

30
CAP Theorem
A system can continue to
operate in the presence of
a network partitions
Availability

Consistency

Partition
tolerance

31
Consistency in CAP Theorem
• When data is stored on multiple nodes, all the nodes should see the same
data, meaning, that when the data is updated at one node then the same
update should be made at the other nodes storing the same data also.
• For example, if we perform a read operation, it will return the value of the
most recent write operation causing all nodes to return the same data.
• A system is said to be in a consistent state, if the transaction starts with
the system in a consistent state, and ends with a system in a consistent
state.
• In this model, a system can shift into an inconsistent state during a
transaction but, in this case, the entire transaction gets rolled back if there
is an error at any stage in the process.
Availability in CAP Theorem
• To achieve a higher order of availability, it is required that the
system should remain operational 100% all the time.
• So we can get a response at any time.
• So according to this whenever a user makes a request, a user
should be able to get the response regardless of the state of a
system.
Partition Tolerance in CAP Theorem
• According to this, a system should work despite message loss or partition failure.
• A system that is partition-tolerant can sustain any amount of network failure
• It does not result in a failure of the entire network.
• A storage system that falls under CP (partition tolerance with consistency) are MongoDB, Redis,
AppFabric caching and Memcached DB.
• Databases that come under the partition tolerance are those which store their data on multiple
nodes.
• As in relational data models, it is required that it should follow the ACID (Atomicity, Consistency,
Isolation, Durability) properties. But with NoSQL databases, it is not possible for data storage
structures to follow all the C, A and P.
• Data storage models which come under the NoSQL databases of the following but it is not possible
to follow all –
• CA(Consistency and Availability)
• AP(Availability with partition Tolerance)
• CP(consistency with partition Tolerance)
MongoDB
• MongoDB is a cross-platform, document oriented database that provides, high performance, high availability, and
easy scalability.
• MongoDB works on concept of collection and document.
• Database
• Database is a physical container for collections. Each database gets its own set of files on the file system. A single MongoDB server
typically has multiple databases.
• Collection
• Collection is a group of MongoDB documents.
• It is the equivalent of an RDBMS table.
• A collection exists within a single database. Collections do not enforce a schema.
• Documents within a collection can have different fields.
• Typically, all documents in a collection are of similar or related purpose.
• Document
• A document is a set of key-value pairs. Documents have dynamic schema.
• Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common
fields in a collection's documents may hold different types of data.
Relationship of RDBMS terminology with
MongoDB
RDBMS MongoDB
Database Database
Table Collection
Tuple/Row Document
column Field
Table Join Embedded Documents
Primary Key Primary Key (Default key _id
provided by MongoDB itself)

Database Server and Client


mysqld/Oracle mongod
mysql/sqlplus mongo
Working with mongoDB
• Step 1: create a folder named “data “where mongo db files can be
stored .example: c:\md data
• Step 2:Write the following command on cmd prompt
• Program Files\MongoDB\Server\5.0\bin>mongod.exe --dbpath "c:\
data1“
• Step 3:Start another cmd prompt and write following command
• Program Files\MongoDB\Server\5.0\bin>mongo.exe
Data Modelling
• Data Model Design
• MongoDB provides two types of data
models: —
• Embedded data model and
• Normalized data model.
• Based on the requirement, you can use
either of the models while preparing your
document.
• Embedded Data Model
• In this model, you can have (embed) all the
related data in a single document, it is also
known as de-normalized data model.
• For example, assume we are getting the details
of employees in three different documents
namely, Personal_details, Contact and, Address,
you can embed all the three documents in a
single one as
Data Modelling
• Normalized Data
Model
• In this model, you can
refer the sub
documents in the
original document,
using references.
Database commands
• The use Command
• MongoDB use DATABASE_NAME is used to create database.
• The command will create a new database if it doesn't exist, otherwise it will return the existing database.
• In MongoDB default database is test.
• If you didn't create any database, then collections will be stored in test database.
• Syntax-use DATABASE_NAME
• Example-use studentdatabase

• The dropDatabase() Method


• MongoDB db.dropDatabase() command is used to drop a existing database.
• Syntax-db.dropDatabase()
• This will delete the selected database. If you have not selected any database, then it will delete default 'test' database.

• Show command-

• To display list of all databases.


• Syntax-show dbs

• To display single database


Create Collection Parameter
Name
Type
String
Description
Name of the collection to be
created
Options Document (Optional) Specify options about
• The createCollection() Method memory size and indexing
• MongoDB db.createCollection(name,
options) is used to create collection. Field Type Description

• Syntax-db.createCollection(name,option) (Optional) If true, enables a capped collection. Capped


collection is a fixed size collection that automatically
• Example-use studentdb overwrites its oldest entries when it reaches its maximum
capped Boolean size. If you specify true, you need to specify size parameter
db.createCollection("mycollection") also.

• In the command, name is name of collection


(Optional) If true, automatically create index on _id field.Its
to be created.  autoInde Boolean Default value is false.
xId
• Options is a document and is used to specify
(Optional) Specifies a maximum size in bytes for a capped
configuration of collection. collection. If capped is true, then you need to specify this
size number field also.
• While inserting the document, MongoDB first
checks size field of capped collection, then it
(Optional) Specifies the maximum number of documents
checks max field. max number allowed in the capped collection.
Show collection and Drop collection
• Show collection-
• You can check the created collection by using the command show
collections.
• Syntax-Show collections
• Drop Collection
• MongoDB's db.collection.drop() is used to drop a collection from the
database.
• Syntax-db.collectionname.drop()
• Example-db.mycollection.drop()
MongoDB - Datatypes
• String − This is the most commonly used datatype to store the data. String in MongoDB must be UTF-8 valid.
• Integer − This type is used to store a numerical value. Integer can be 32 bit or 64 bit depending upon your
server.
• Boolean − This type is used to store a boolean (true/ false) value.
• Double − This type is used to store floating point values.
• Min/ Max keys − This type is used to compare a value against the lowest and highest BSON(binary javascript
object notation) elements.
• Arrays − This type is used to store arrays or list or multiple values into one key.
• Timestamp − ctimestamp. This can be handy for recording when a document has been modified or added.
• Object − This datatype is used for embedded documents.
• Null − This type is used to store a Null value.
• Symbol − This datatype is used identically to a string; however, it's generally reserved for languages that use
a specific symbol type.
• Date − This datatype is used to store the current date or time in UNIX time format. You can specify your own
date time by creating object of Date and passing day, month, year into it.
• Object ID − This datatype is used to store the document’s ID.
• Binary data − This datatype is used to store binary data.
• Code − This datatype is used to store JavaScript code into the document.
• Regular expression − This datatype is used to store regular expression.
The insert() Method
• To insert data into MongoDB collection, you need to use MongoDB's insert() or save() method.
• Syntax-db.collectionname.insert(document)
• Example-db.createcollection(“users”)
• db.users.insert
•(
{
rollno : 100,
title: "MongoDB Overview",
description: "MongoDB is no sql database",
likes: 100
}
•)
• Output-WriteResult({ "Inserted" : 1 })
The insertOne() method
• If you need to insert only one document into a collection you can use this method.
• Syntax-db.collectionname.insertOne(document)
• Example-
• db.empDetails.insertOne
•(
{
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26",
e_mail: "radhika_sharma.123@gmail.com",
phone: "9848022338"
}
•)
•{
• "acknowledged" : true,
• "insertedId" : ObjectId("5dd62b4070fb13eec3963bea")
•}
The insertMany() method
• You can insert multiple
documents using the
insertMany() method.
• To this method you need to
pass an array of documents.
MongoDB searching methods
• find() method
• To query data from MongoDB collection, you need to use
MongoDB's find() method.
• find() method will display all the documents in a non-structured way.
• Syntax-db.collectionname.find()
• The findOne() method
• Apart from the find() method, there is findOne() method, that returns only one
document.
• Syntax-db.collectionname.findOne()
• The pretty() Method
• To display the results in a formatted way, you can use pretty() method.
• Syntax-db.collectionname.find().pretty()
RDBMS Where Clause Equivalents in
MongoDB
Operation Syntax Example RDBMS Equivalent
Equality {<key>:{$<value>}} Db.student.find({“rollno":101}).pretty() where by = ‘My
document'
Less Than {<key>:{$lt:<value>}} db.mycol.find({"likes":{$lt:50}}).pretty() where likes < 50
Less Than Equals {<key>:{$lte:<value>}} db.mycol.find({"likes":{$lte:50}}).pretty() where likes <= 50
Greater Than {<key>:{$gt:<value>}} db.mycol.find({"likes":{$gt:50}}).pretty() where likes > 50
Greater Than {<key>:{$gte:<value>}} db.mycol.find({"likes":{$gte:50}}).pretty() where likes >= 50
Equals
Not Equals {<key>:{$ne:<value>}} db.mycol.find({"likes":{$ne:50}}).pretty() where likes != 50
Values in an array {<key>:{$in:[<value1>, db.mycol.find({"name":{$in:["Raj", "Ram", Where name matches
<value2>,…… "Raghu"]}}).pretty() any of the value in :
<valueN>]}} ["Raj", "Ram", "Raghu"]
Values not in an {<key>: db.mycol.find({"name":{$nin:["Ramu", Where name values is
array {$nin:<value>}} "Raghav"]}}).pretty() not in the array :
["Ramu", "Raghav"] or,
doesn’t exist at all
AND in MongoDB

• To query documents based on the AND condition, you need to use


$and keyword.
• Syntax-db.mycol.find({ $and: [ {<key1>:<value1>}, { <key2>:<value2>} ] })

• Example-
db.student.find(
{
$and:
[{"rollno":104},{"sname:"aakash"}]
}
)
OR in MongoDB
• To query documents based on the OR condition, you need to use $or
keyword.
• Syntax-db.mycol.find({ $or: [ {<key1>:<value1>}, { <key2>:<value2>} ] })

• Example-
db.student.find(
{
$or:[
{"rollno":104},{"sname:"aakash"}
]
}
)
NOT in MongoDB
• To query documents based on the NOT condition, you need to
use $not keyword. 
• Example-db.empDetails.find(
•{
• "Age": { $not: { $gt: "25" }}
•}
•)
• Example-db.student.find({“age":{$not:{$gt:23}}})
Update Command
• MongoDB's update() and save() methods are used to update document into a collection.
• The update() method updates the values in the existing document while the save() method
replaces the existing document with the document passed in save() method.
• MongoDB Update() Method
• The update() method updates the values in the existing document.
• Syntax->db.COLLECTION_NAME.update(SELECTION_CRITERIA, UPDATED_DATA)
• >db.student.update({‘sname’:‘priya'},{$set:{‘sname’:‘Anjali'}})
• By default, MongoDB will update only a single document.
• To update multiple documents, you need to set a parameter 'multi' to true.
• >db.student.update({‘sname’:‘priya'},{$set:{‘sname’:‘Anjali’}},
{multi:true})
• db.emp.update({'age':23},{$set:{'cname':"avinash"}})
Update Command
• Save() Method
• The save() method replaces the existing document with the new document passed in
the save() method.
• >db.COLLECTION_NAME.save({_id:ObjectId(),NEW_DATA})
• Example-
• >db.mycol.save( { "_id" : ObjectId("507f191e810c19729de860ea"),
"title":“MONGODB", "by":“MONGODB with NOSQL" } )
• TO check updates with the following commands
• >db.mycol.find()
Update Command
• findOneAndUpdate() method
• The findOneAndUpdate() method updates the values in the existing
document.
• Syntax- db.COLLECTION_NAME.findOneAndUpdate
(SELECTIOIN_CRITERIA, UPDATED_DATA)
• updateOne() method
• This methods updates a single document which matches the given filter.
• Syntax->db.COLLECTION_NAME.updateOne(<filter>, <update>)
• > db.empDetails.updateOne( {First_Name: 'Radhika'},
{ $set: { Age: '30',e_mail:
'radhika_newemail@gmail.com'}} )
• updateMany() method
• The updateMany() method updates all the documents that
matches the given filter.
• Syntax->db.COLLECTION_NAME.update(<filter>,
<update>)

> db.empDetails.updateMany(
Example-

{Age:{ $gt: "25" }}, { $set: {


Age: '00'}} )
Delete Document
• The remove() Method
• MongoDB's remove() method is used to remove a document from the collection.
• remove() method accepts two parameters. One is deletion criteria and second is justOne flag.
• deletion criteria − (Optional) deletion criteria according to documents will be removed.
• justOne − (Optional) if set to true or 1, then remove only one document.
• Syntax->db.COLLECTION_NAME.remove(DELLETION_CRITTERIA)
• >db.student.remove({‘sname’:’ajay’})

• Remove Only One


• If there are multiple records and you want to delete only the first record, then set justOne parameter in remove() method.
• Syntax->db.COLLECTION_NAME.remove(DELETION_CRITERIA,1)

• Remove All Documents


• If you don't specify deletion criteria, then MongoDB will delete whole documents from the collection.
• > db.mycol.remove({})
Limit Records
• Limit() Method
• To limit the records in MongoDB, you need to use limit() method.
• The method accepts one number type argument, which is the number of
documents that you want to be displayed.
• If you don't specify the number argument in limit() method then it will display
all documents from the collection.
• Syntax->db.COLLECTION_NAME.find().limit(NUMBER)
• Skip() Method
• Apart from limit() method, there is one more method skip() which also
accepts number type argument and is used to skip the number of documents.
• >db.COLLECTION_NAME.find().limit(NUMBER).skip(NUMBER)
Sort Records
• The sort() Method
• To sort documents in MongoDB, use sort() method.
• The method accepts a document containing a list of fields along
with their sorting order.
• To specify sorting order 1 and -1 are used. 1 is used for
ascending order while -1 is used for descending order.
• Syntax->db.COLLECTION_NAME.find().sort({KEY:1})
• >db.mycol.find({},
{"title":1,_id:0}).sort({"title":-1})
Indexing
• Indexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document of a
collection to select those documents that match the query statement.
• This scan is highly inefficient and require MongoDB to process a large volume of data.
• Indexes are special data structures, that store a small portion of the data set in an easy-to-traverse form.
• The index stores the value of a specific field or set of fields, ordered by the value of the field as specified in the
index.
• createIndex() Method
• To create an index, use createIndex() method of MongoDB.
• Syntax->db.COLLECTION_NAME.createIndex({KEY:1})
• Here key is the name of the field on which we want to create index and 1 is for ascending order. To create index
in descending order, use -1.
• Example->db.mycol.createIndex({"title":1})
• In createIndex() method you can pass multiple fields, to create index on multiple fields.
• >db.mycol.createIndex({"title":1,"description":-1})
Create Index Method parameters
Parameter Type Description
Builds the index in the background so that building an index does not block other database activities.
background Boolean Specify true to build in the background. The default value is false.

Creates a unique index so that the collection will not accept insertion of documents where the index key
unique Boolean or keys match an existing value in the index. Specify true to create a unique index. The default value
is false.

The name of the index. If unspecified, MongoDB generates an index name by concatenating the names of
name string the indexed fields and the sort order.
If true, the index only references documents with the specified field. These indexes use less space but
sparse Boolean behave differently in some situations (particularly sorts). The default value is false.

Specifies a value, in seconds to control how long MongoDB retains documents in this collection.
expireAfterSeconds integer

The weight is a number ranging from 1 to 99,999 and denotes the significance of the field relative to the
weights document other indexed fields in terms of the score.
For a text index, the language that determines the list of stop words and the rules for the stemmer and
default_language string tokenizer. The default value is English.
For a text index, specify the name of the field in the document that contains, the language to override the
language_override string default language. The default value is language.
Index methods
• dropIndex() method
• You can drop a particular index using the dropIndex() method of MongoDB.
• Syntax->db.COLLECTION_NAME.dropIndex({KEY:1})
• Here key is the name of the file on which you want to create index and 1 is for ascending order and -1 for descending
order.
• Example->db.mycol.dropIndex({"title":1})
• dropIndexes() method
• This method deletes multiple (specified) indexes on a collection.
• Syntax->db.COLLECTION_NAME.dropIndexes()
• First create two indexes on collection
• > db.mycol.createIndex({"title":1,"description":-1})
• Drop the indexes
• >db.mycol.dropIndexes({"title":1,"description":-1})
• The getIndexes() method
• This method returns the description of all the indexes int the collection.
• Syntax->db.COLLECTION_NAME.getIndexes()
• Example->db.mycol.createIndex({"title":1,"description":-1})
• Example->db.mycol.getIndexes()
Aggregation
• Aggregations operations process data records and return computed results.
• Aggregation operations group values from multiple documents together, and can
perform a variety of operations on the grouped data to return a single result.
• In SQL count(*) and with group by is an equivalent of MongoDB aggregation.
• aggregate() Method
• For the aggregation in MongoDB, use aggregate() method.
• Syntax->db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)
• Now, if you want to display a list stating how many tutorials are written
by each user, then use the  aggregate() method as−
• >db.mycol.aggregate([{$group : {_id : "$by_user",
num_tutorial : {$sum : 1}}}])
List of aggregate function
Expression Description Example
Sums up the defined value from all documents in the db.mycol.aggregate([{$group : {_id : "$by_user",
$sum collection. num_tutorial : {$sum : "$likes"}}}])
Calculates the average of all given values from all db.mycol.aggregate([{$group : {_id : "$by_user",
$avg documents in the collection. num_tutorial : {$avg : "$likes"}}}])
Gets the minimum of the corresponding values from all db.mycol.aggregate([{$group : {_id : "$by_user",
$min documents in the collection. num_tutorial : {$min : "$likes"}}}])
Gets the maximum of the corresponding values from all db.mycol.aggregate([{$group : {_id : "$by_user",
$max documents in the collection. num_tutorial : {$max : "$likes"}}}])
Inserts the value to an array in the resulting document. db.mycol.aggregate([{$group : {_id : "$by_user",
$push
url : {$push: "$url"}}}])
Inserts the value to an array in the resulting document but db.mycol.aggregate([{$group : {_id : "$by_user",
$addToSet does not create duplicates. url : {$addToSet : "$url"}}}])
Gets the first document from the source documents db.mycol.aggregate([{$group : {_id : "$by_user",
$first according to the grouping. Typically this makes only sense first_url : {$first : "$url"}}}])
together with some previously applied “$sort”-stage.
Gets the last document from the source documents db.mycol.aggregate([{$group : {_id : "$by_user",
$last according to the grouping. Typically this makes only sense last_url : {$last : "$url"}}}])
together with some previously applied “$sort”-stage.
Replication
• The process of synchronizing data across multiple servers.
• Provides redundancy and increases data availability with multiple copies of data on different
database servers.
• Protects a database from the loss of a single server.
• Also allows to recover from hardware failure and service interruptions.
• With additional copies of the data, you can dedicate one to disaster recovery, reporting, or
backup.
• Why Replication?
• To keep your data safe
• High (24*7) availability of data
• Disaster recovery
• No downtime for maintenance (like backups, index rebuilds, compaction)
• Read scaling (extra copies to read from)
• Replica set is transparent to the application
Working of Replication in MongoDB
• MongoDB achieves replication by the use of replica set.
• A replica set is a group of mongod instances that host the same
data set.
• In a replica, one node is primary node that receives all write
operations.
• All other instances, such as secondaries, apply operations from
the primary so that they have the same data set.
• Replica set can have only one primary node.
• Replica set is a group of two or more nodes (generally minimum 3
nodes are required).
• In a replica set, one node is primary node and remaining nodes
are secondary.
• All data replicates from primary to secondary node.
• At the time of automatic failover or maintenance, election
establishes for primary and a new primary node is elected.
• After the recovery of failed node, it again join the replica set and
works as a secondary node.
Replica Set Features
• A cluster of N nodes
• Any one node can be primary
• All write operations go to primary
• Automatic failover
• Automatic recovery
• Consensus election of primary
Set Up a Replica Set
• To convert to replica set, following are the steps −
• Shutdown already running MongoDB server.
• Start the MongoDB server by specifying -- replSet option. Following is the basic syntax of
--replSet −
• Syntax-mongod --port "PORT" --dbpath "YOUR_DB_DATA_PATH" --replSet
"REPLICA_SET_INSTANCE_NAME"
• Example-mongod --port 27017 --dbpath "D:\set up\mongodb\data" --
replSet rs0
• It will start a mongod instance with the name rs0, on port 27017.
• Now start the command prompt and connect to this mongod instance.
• In Mongo client, issue the command rs.initiate() to initiate a new replica set.
• To check the replica set configuration, issue the command rs.conf().
• To check the status of replica set issue the command rs.status().
Add Members to Replica Set
• To add members to replica set, start mongod instances on multiple
machines.
• Now start a mongo client and issue a command rs.add().
• Syntax->rs0.add(HOST_NAME:PORT)
• Example>rs0.add("mongod1.net:27017")
• You can add mongod instance to replica set only when you are connected
to primary node.
• To check whether you are connected to primary or not, issue the
command db.isMaster() in mongo client.
Sharding
• Sharding is the process of storing data records across multiple machines and it is
MongoDB's approach to meeting the demands of data growth.
• As the size of the data increases, a single machine may not be sufficient to store
the data nor provide an acceptable read and write throughput.
• Sharding solves the problem with horizontal scaling. With sharding, you add more
machines to support data growth and the demands of read and write operations.
• Why Sharding?
• In replication, all writes go to master node
• Latency sensitive queries still go to master
• Single replica set has limitation of 12 nodes
• Memory can't be large enough when active dataset is big
• Local disk is not big enough
• Vertical scaling is too expensive
Sharding in MongoDB
• Shards − Shards are used to store data. They provide high availability
and data consistency. In production environment, each shard is a separate
replica set.
• Config Servers − Config servers store the cluster's metadata.
• This data contains a mapping of the cluster's data set to the shards. The
query router uses this metadata to target operations to specific shards.
• In production environment, sharded clusters have exactly 3 config servers.
• Query Routers − Query routers are basically mongo instances, interface
with client applications and direct operations to the appropriate shard.
• The query router processes and targets the operations to shards and then
returns results to the clients.
• A sharded cluster can contain more than one query router to divide the
client request load.
• A client sends requests to one query router. Generally, a sharded cluster
have many query routers.
Create Backup
• Dump MongoDB Data
• To create backup of database in MongoDB, you should
use mongodump command.
• This command will dump the entire data of server into the dump
directory.
• There are many options available by which we can limit the
amount of data or create backup of remote server.
• Example->mongodump
Options with mongodump
Syntax Description Example
mongodump --host HOST_NAME --port This commmand will backup mongodump --host google.com--port
PORT_NUMBER all databases of specified 27017
mongod instance.

mongodump --dbpath DB_PATH --out This command will backup mongodump --dbpath /data/db/ --
BACKUP_DIRECTORY only specified database at out /data/backup/
specified path.

mongodump --collection COLLECTION -- This command will backup mongodump --collection mycol --db
db DB_NAME only specified collection of test
specified database.
Restore data
• To restore backup data MongoDB's mongorestore command is
used. This command restores all of the data from the backup
directory.
• Syntax->mongorestore
BIG DATA
• Data which are very large in size is called Big Data.
• Normally we work on data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data.
• It is stated that almost 90% of today's data has been generated in the past 3 years.
• Sources of Big Data
• These data come from many sources like
• Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a day to day basis as they have billions of
users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge data which are stored and manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish their plans and for this they store the data
of its million users.
• Share Market: Stock exchange across the world generates huge amount of data through its daily transaction.
• 3V's of Big Data
1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as well as unstructured. Log file, CCTV footage is unstructured
data. Data which can be saved in tables are structured data like the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
• Issues
• Huge amount of unstructured data which needs to be stored, processed and analyzed.
Issues and solutions related to BIG DATA
• Issues
• Huge amount of unstructured data which needs to be stored, processed and
analyzed.
• Solution
• Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed
File System) which uses commodity hardware to form clusters and store data
in a distributed fashion. It works on Write once, read many times principle.
• Processing: Map Reduce paradigm is applied to data distributed over network
to find the required output.
• Analyze: Pig, Hive can be used to analyze the data.
• Cost: Hadoop is open source so the cost is no more an issue.
Hadoop
• Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume.
• Hadoop is written in Java and is not OLAP (online analytical processing).
• It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can
be scaled up just by adding nodes in the cluster.
• Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was
developed. It states that the files will be broken into blocks and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using key
value pair. The Map task takes input data and converts it into a data set which can be computed in Key value
pair. The output of Map task is consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.
Hadoop Architecture
• The Hadoop architecture is a package of
the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System).
• The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.
• A Hadoop cluster consists of a single
master and multiple slave nodes.
• The master node includes Job Tracker,
Task Tracker, NameNode, and DataNode
whereas the slave node includes DataNode
and TaskTracker.
Hadoop Distributed File System layer
• The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop.
• It contains a master/slave architecture.
• This architecture consist of a single NameNode performs the role of master, and multiple DataNodes performs the role of a slave.
• Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to develop HDFS.
So any machine that supports Java language can easily run the NameNode and DataNode software.
• NameNode
• It is a single master server exist in the HDFS cluster.
• As it is a single node, it may become the reason of single point failure.
• It manages the file system namespace by executing an operation like the opening, renaming and closing the files.
• It simplifies the architecture of the system.
• DataNode
• The HDFS cluster contains multiple DataNodes.
• Each DataNode contains multiple data blocks.
• These data blocks are used to store data.
• It is the responsibility of DataNode to read and write requests from the file system's clients.
• It performs block creation, deletion, and replication upon instruction from the NameNode.
• Job Tracker
• The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.
• In response, NameNode provides metadata to Job Tracker.
• Task Tracker
• It works as a slave node for Job Tracker.
• It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a Mapper.
MapReduce Layer
• The MapReduce comes into existence when the client application
submits the MapReduce job to Job Tracker.
• In response, the Job Tracker sends the request to the appropriate Task
Trackers.
• Sometimes, the TaskTracker fails or time out. In such a case, that part
of the job is rescheduled.
Advantages of Hadoop
• Fast: 
• In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval.
• Even the tools to process the data are often on the same servers, thus reducing the processing time.
• It is able to process terabytes of data in minutes and Peta bytes in hours.
• Scalable: 
• Hadoop cluster can be extended by just adding nodes in the cluster.
• Cost Effective: 
• Hadoop is open source and uses commodity hardware to store data so it really cost effective as
compared to traditional relational database management system.
• Resilient to failure: 
• HDFS has the property with which it can replicate data over the network, so if one node is down or
some other network failure happens, then Hadoop takes the other copy of data and use it.
• Normally, data are replicated thrice but the replication factor is configurable.
HDFS
• Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel application.
• It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and node
name.
• Where to use HDFS
• Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
• Streaming Data Access: The time to read whole data set is more important than latency in reading the first. HDFS is
built on write-once and read-many-times pattern.
• Commodity Hardware:It works on low cost hardware.
• Where not to use HDFS
• Low Latency data access: Applications that require very less time to access the first data should not use HDFS as it is
giving importance to whole data rather than time to fetch the first record.
• Lots Of Small Files:The name node contains the metadata of files in memory and if the files are small in size it takes a
lot of memory for name node's memory which is not feasible.
• Multiple Writes:It should not be used when we have to write multiple times.
Basic concepts of HDFS
1.Blocks: 
• A Block is the minimum amount of data that it can read or write.
• HDFS blocks are 128 MB by default and this is configurable.
• Files in HDFS are broken into block-sized chunks,which are stored as independent units.
• Unlike a file system, if the file is in HDFS is smaller than block size, then it does not occupy full blocks size, i.e. 5 MB of file
stored in HDFS of block size 128 MB takes 5MB of space only.
• The HDFS block size is large just to minimize the cost of seek.
2.Name Node: 
• HDFS works in master-worker pattern where the name node acts as master.
• Name Node is controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS; the metadata
information being file permission, names and location of each block.
• The metadata are small, so it is stored in the memory of name node,allowing faster access to data.
• Moreover the HDFS cluster is accessed by multiple clients concurrently, so all this information is handled by a single machine.
• The file system operations like opening, closing, renaming etc. are executed by it.
3.Data Node: 
• They store and retrieve blocks when they are told to; by client or name node.
• They report back to name node periodically, with list of blocks that they are storing.
• The data node being a commodity hardware also does the work of block creation, deletion and replication as stated by the name
node.
Features and goals of HDFS
• Features of HDFS
• Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single cluster.
• Replication - Due to some unfavorable conditions, the node containing the data may be loss. So, to overcome such problems, HDFS
always maintains the copy of data on a different machine.
• Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the event of failure. The HDFS is highly
fault-tolerant that if any machine fails, the other machine containing the copy of that data automatically become active.
• Distributed data storage - This is one of the most important features of HDFS that makes Hadoop very powerful. Here, data is
divided into multiple blocks and stored into nodes.
• Portable - HDFS is designed in such a way that it can easily portable from platform to another.

• Goals of HDFS
• Handling the hardware failure - The HDFS contains multiple server machines. Anyhow, if any machine fails, the HDFS goal is to
recover it quickly.
• Streaming data access - The HDFS applications usually run on the general-purpose file system. This application requires streaming
access to their data sets.
• Coherence Model - The application that runs on HDFS require to follow the write-once-ready-many approach. So, a file once
created need not to be changed. However, it can be appended and truncate.
MAPREDUCE
• A MapReduce is a data processing tool which is used to process the data parallelly in a distributed form.
• It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data Processing on Large Clusters," published by Google.
• The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase.
• In the Mapper, the input is given in the form of a key-value pair.
• The output of the Mapper is fed to the reducer as input.
• The reducer runs only after the Mapper is over. The reducer too takes input in key-value format, and the output of reducer is the final output.
• Steps in Map Reduce
• The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will not be unique in this case.
• Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort and shuffle acts on these list of <key,
value> pairs and sends out unique keys and a list of values associated with this unique key <key, list(values)>.
• An output of sort and shuffle sent to the reducer phase. The reducer performs a defined function on a list of values for unique keys,
and Final output <key, value> will be stored/displayed.
• Sort and Shuffle
• The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper task is complete, the results are sorted
by key, partitioned if there are multiple reducers, and then written to disk. Using the input from each Mapper <k2,v2>, we collect all
the values for each unique key k2. This output from the shuffle phase in the form of <k2, list(v2)> is sent as input to reducer phase.
Terminology
• PayLoad − Applications implement the Map and the Reduce functions, and form the core of the job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
• NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
• DataNode − Node where data is presented in advance before any processing takes place.
• MasterNode − Node where JobTracker runs and which accepts job requests from clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
• Job − A program is an execution of a Mapper and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer on a slice of data.
• Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.
Usage of MapReduce

• It can be used in various application like document clustering,


distributed sorting, and web link-graph reversal.
• It can be used for distributed pattern-based searching.
• We can also use MapReduce in machine learning.
• It was used by Google to regenerate Google's index of the World
Wide Web.
• It can be used in multiple computing environments such as multi-
cluster, multi-core, and mobile environment.
Data Flow In MapReduce
• MapReduce is used to compute the huge amount of data . To handle
the upcoming data in a parallel and distributed form, the data has to
flow from various phases.
Phases of MapReduce data flow
• Input reader
• The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB to 128 MB).
• Each data block is associated with a Map function.
• Once input reads the data, it generates the corresponding key-value pairs.
• The input files reside in HDFS.
• Map function
• The map function process the upcoming key-value pairs and generated the corresponding output key-value pairs.
• The map input and output type may be different from each other.
• Partition function
• The partition function assigns the output of each Map function to the appropriate reducer.
• The available key and value provide this function. It returns the index of reducers.
• Shuffling and Sorting
• The data are shuffled between/within nodes so that it moves out from the map and get ready to process for reduce function.
• Sometimes, the shuffling of data can take much computation time.
• The sorting operation is performed on input data for Reduce function.
• Here, the data is compared using comparison function and arranged in a sorted form.
• Reduce function
• The Reduce function is assigned to each unique key. These keys are already arranged in sorted order.
• The values associated with the keys can iterate the Reduce and generates the corresponding output.
• Output writer
• Once the data flow from all the above phases, Output writer executes.
• The role of Output writer is to write the Reduce output to the stable storage.
Mapreduce Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the data resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
• Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the
Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small
chunks of data.
• Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the
cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the
Hadoop server.
Example Scenario
• Given below is the data regarding the electrical consumption of an organization.
• It contains the monthly electrical consumption and the annual average for various years.
• If the above data is given as input, we have to write applications to process it and produce results such as finding
the year of maximum usage, year of minimum usage, and so on.
• This is a walkover for the programmers with finite number of records. They will simply write the logic to produce
the required output, and pass the data to the application written.
• But, think of the data representing the electrical consumption of all the largescale industries of a particular state,
since its formation.
• When we write applications to process such bulk data,
• They will take a lot of time to execute.
• There will be a heavy network traffic when we move data from source to network server and so on.
• To solve these problems, we have the MapReduce framework.
HIVE
• Hive is a data warehouse system which is used to analyze structured
data. It is built on the top of Hadoop. It was developed by Facebook.
• Hive provides the functionality of reading, writing, and managing large
datasets residing in distributed storage.
• It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
• Using Hive, we can skip the requirement of the traditional approach of
writing complex MapReduce programs.
• Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).
Features and limitations of Hive
• Features of Hive
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or
Spark jobs.
• It is capable of analyzing large datasets stored in HDFS.
• It allows different storage types such as plain text, RCFile, and HBase.
• It uses indexing to accelerate queries.
• It can operate on compressed data stored in the Hadoop ecosystem.
• It supports user-defined functions (UDFs) where user can provide its functionality.
• Limitations of Hive
• Hive is not capable of handling real-time data.
• It is not designed for online transaction processing.
• Hive queries contain high latency.
Hive Architecture
• The flow of submission of query into Hive is
explained by the architecture.
• Hive Client
• Hive allows writing applications in various languages,
including Java, Python, and C++. It supports different
types of clients such as:-
• Thrift Server - It is a cross-language service provider
platform that serves the request from all those
programming languages that supports Thrift.
• JDBC Driver - It is used to establish a connection between
hive and Java applications. The JDBC Driver is present in
the class org.apache.hadoop.hive.jdbc.HiveDriver.
• ODBC Driver - It allows the applications that support the
ODBC protocol to connect to Hive.
Hive Services
• Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and
commands.
• Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based GUI
for executing Hive queries and commands.
• Hive MetaStore - It is a central repository that stores all the structure information of various tables and
partitions in the warehouse. It also includes metadata of column and its type information, the serializers and
deserializers which is used to read and write data and the corresponding HDFS files where the data is stored.
• Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and
provides it to Hive Driver.
• Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It
transfers the queries to the compiler.
• Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the
different query blocks and expressions. It converts HiveQL statements into MapReduce jobs.
• Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce tasks and
HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of their dependencies.
HIVE Data Types
• Hive data types are categorized in numeric types, string types, misc
types, and complex types. 
Type Size Range
• Integer Types TINYINT 1-byte signed integer -128 to 127
SMALLINT 2-byte signed integer 32,768 to 32,767
INT 4-byte signed integer 2,147,483,648 to 2,147,483,647
BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Type Size Range

FLOAT 4-byte Single precision floating


• Decimal Types point number
DOUBLE 8-byte Double precision floating
point number

Type Size Range


Struct It is similar to C struct or an object where fields are accessed using the "dot" struct('James','Roy')

• Complex Types Map


notation.
It contains the key-value tuples where the fields are accessed using array notation. map('first','James','last','Roy'
)
Array It is a collection of similar type of values that indexable using zero-based integers. array('James','Roy')
HIVE data types
• Date/Time Types
• TIMESTAMP
• It supports traditional UNIX timestamp with optional nanosecond precision.
• As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
• As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with decimal precision.
• As string, it follows java.sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal place precision)
• DATES
• The Date value is used to specify a particular year, month and day, in the form YYYY--MM--DD. However, it didn't provide the time of
the day.
• The range of Date type lies between 0000--01--01 to 9999--12--31.

• String Types
• STRING
• The string is a sequence of characters. It values can be enclosed within single quotes (') or double quotes (").
• Varchar
• The varchar is a variable length type whose range lies between 1 and 65535, which specifies that the maximum number of characters allowed in the character
string.
• CHAR
The char is a fixed-length type whose maximum length is fixed at 255.

You might also like