Professional Documents
Culture Documents
• There are a wide variety of existing Relational Databases that have been unsuccessful in solving several complex
modern problems such as:
• A dynamic change in the nature of data - i.e., nowadays data are in structured, semi-structured, nonstructured as well as polymorphic
in type.
• The variety of applications and the type of data feed into them for analysis has now become more diverse and distributed and is
approaching cloud-oriented.
• Also, modern applications and services are serving tens of thousands of users in diverse geo-locations, having diverse time zones. So
data integrity needs to be there at all the time.
Structured v/s. Unstructured data
• Structured data are in a proper format, usually text files or which can be
represented in a tabular format. Also, such types of data can be smoothly
represented in chart-like form, and data mining tools can be used to process them
efficiently.
• Whereas unstructured data are disorganized data formats (such as document files,
image files, video files, icons, etc.) where structured data can be pulled out or mine
from unstructured data, but this process usually takes a lot of time.
• Modern-day data generated from different applications, services, or sources are a
combination of structured and unstructured both.
• So, you will need something to store such data to make your application work
properly. NoSQL based languages and scripts can help in this regard.
RDBMS Characteristics
• Data stored in columns and tables
• Relationships represented by data
• Data Manipulation Language
• Data Definition Language
• Transactions
• Abstraction from physical layer
• Applications specify what, not how
• Physical layer can change without modifying applications
• Create indexes to support queries
• In Memory databases
Transactions – ACID Properties
• Atomic – All of the work in a transaction completes (commit) or none of it completes
• a transaction to transfer funds from one account to another involves making a withdrawal operation from the first
account and a deposit operation on the second. If the deposit operation failed, you don’t want the withdrawal operation
to happen either.
• Consistent – A transaction transforms the database from one consistent state to another consistent state.
Consistency is defined in terms of constraints.
• a database tracking a checking account may only allow unique check numbers to exist for each transaction
• Isolated – The results of any changes made during a transaction are not visible until the transaction has
committed.
• a teller looking up a balance must be isolated from a concurrent transaction involving a withdrawal from the same
account. Only when the withdrawal transaction commits successfully and the teller looks at the balance again will the
new balance be reported.
• Durable – The results of a committed transaction survive failures
• A system crash or any other failure must not be allowed to lose the results of a transaction or the contents of the
database. Durability is often achieved through separate transaction logs that can "re-create" all transactions from some
picked point in time (like a backup).
NoSQL, No ACID
• RDBMSs are based on ACID (Atomicity, Consistency, Isolation, and
Durability) properties
• NoSQL
• Does not give importance to ACID properties
• In some cases completely ignores them
• In distributed parallel systems it is difficult/impossible to ensure ACID
properties
• Long-running transactions don't work because keeping resources blocked for a long time is
not practical
Features of NoSQL
• Non-relational
• NoSQL databases never follow the relational model
• Never provide tables with flat fixed-column records
• Work with self-contained aggregates or BLOBs
• Doesn’t require object-relational mapping and data normalization
• No complex features like query languages, query planners, referential integrity joins, ACID
• Schema-free
• NoSQL databases are either schema-free or have relaxed schemas
• Do not require any sort of definition of the schema of the data
• Offers heterogeneous structures of data in the same domain
• Simple API
• Offers easy to use interfaces for storage and querying data provided
• APIs allow low-level data manipulation & selection methods
• Text-based protocols mostly used with HTTP REST with JSON
• Mostly used no standard based NoSQL query language
• Web-enabled databases running as internet-facing services
• Distributed
• Multiple NoSQL databases can be executed in a distributed fashion
• Offers auto-scaling and fail-over capabilities
• Often ACID concept can be sacrificed for scalability and throughput
• Mostly no synchronous replication between distributed nodes Asynchronous Multi-Master Replication, peer-to-peer, HDFS Replication
• Only providing eventual consistency
• Shared Nothing Architecture. This enables less coordination and higher distribution.
Advantages of NoSQL
• Can be used as Primary or Analytic Data Source
• Big Data Capability
• No Single Point of Failure
• Easy Replication
• It provides fast performance and horizontal scalability.
• Can handle structured, semi-structured, and unstructured data with equal effect
• Object-oriented programming which is easy to use and flexible
• NoSQL databases don’t need a dedicated high-performance server
• Support Key Developer Languages and Platforms
• Simple to implement than using RDBMS
• It can serve as the primary data source for online applications.
• Handles big data which manages data velocity, variety, volume, and complexity
• Excels at distributed database and multi-data center operations
• Offers a flexible schema design which can easily be altered without downtime or service disruption
Disadvantages of NoSQL
1.Narrow focus –
NoSQL databases have very narrow focus as it is mainly designed for storage but it provides very little functionality.
Relational databases are a better choice in the field of Transaction Management than NoSQL.
2.Open-source –
NoSQL is open-source database. There is no reliable standard for NoSQL yet. In other words two database systems
are likely to be unequal.
3.Management challenge –
The purpose of big data tools is to make management of a large amount of data as simple as possible. But it is not so
easy. Data management in NoSQL is much more complex than a relational database. NoSQL, in particular, has a
reputation for being challenging to install and even more hectic to manage on a daily basis.
4.GUI is not available –
GUI mode tools to access the database is not flexibly available in the market.
5.Backup –
Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has no approach for the backup
of data in a consistent manner.
6.Large document size –
Some database systems like MongoDB and CouchDB store data in JSON format. Which means that documents are
quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts, since they
increase the document size.
Why NoSQL?
• Schemaless data representation: Almost all NoSQL implementations offer schemaless data
representation. This means that you don’t have to think too far ahead to define a structure and you
can continue to evolve over time—including adding new fields or even nesting the data, for example,
in case of JSON representation.
• Development time: reduced development time because one doesn’t have to deal with complex SQL
queries. (just like JOIN query to collate the data across multiple tables)
• Speed: Even with the small amount of data that you have, if you can deliver in milliseconds rather
than hundreds of milliseconds—especially over mobile and other intermittently connected devices—
you have much higher probability of winning users over.
• Plan ahead for scalability: your application can be quite elastic—it can handle sudden spikes of load.
SchemaLess data representation
Types of data models in NoSQL Databases
• NoSQL Databases are mainly categorized into four types: Key-value
pair, Column-oriented, Graph-based and Document-oriented.
• Every category has its unique attributes and limitations.
• None of the above-specified database is better to solve all the problems.
• Users should select the database based on their product needs.
• Types of NoSQL Databases:
• Key-value Pair Based
• Column-oriented
• Graphs based
• Document-oriented
Types of NoSQL Databases
Key Value Pair Based
• Data is stored in key/value pairs. It is designed in such a way to handle lots of
data and heavy load.
• Key-value pair storage databases store data as a hash table where each key is
unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
• For example, a key-value pair may contain a key like “sname” associated with
a value like “Priya”.
• This kind of NoSQL database is used as a collection, dictionaries, associative
arrays, etc. Key value stores help the developer to store schema-less data.
They work best for shopping cart contents.
• Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases.
They are all based on Amazon’s Dynamo paper.
Column-based
• Column-oriented databases work on columns and
are based on BigTable paper by Google.
• Every column is treated separately.
• Values of single column databases are stored
contiguously.
• They deliver high performance on aggregation
queries like SUM, COUNT, AVG, MIN etc. as the data is
readily available in a column.
• Column-based NoSQL databases are widely used to
manage data warehouses, business intelligence,
CRM, Library card catalogs,
• HBase, Cassandra, HBase, Hypertable are NoSQL
query examples of column based database.
Document-Oriented:
• Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value part is
stored as a document.
• The document is stored in JSON or XML formats. The value is understood by the DB and can be
queried.
• The document type is mostly used for CMS systems, blogging platforms, real-time analytics & e-
commerce applications.
• It should not use for complex transactions which require multiple operations or queries against
varying aggregate structures.
• Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular Document
originated DBMS systems.
Graph-Based
• A graph type database stores entities as well the relations
amongst those entities.
• The entity is stored as a node with the relationship as edges.
• An edge gives a relationship between nodes.
• Every node and edge has a unique identifier.
• Compared to a relational database where tables are loosely
connected, a Graph database is a multi-relational in nature.
• Traversing relationship is fast as they are already captured into
the DB, and there is no need to calculate them.
• Graph base database mostly used for social networks, logistics,
spatial data.
• Neo4J, Infinite Graph, OrientDB, FlockDB are some popular
graph-based databases.
List of NoSQL Databases
Choosing Right NoSQL Database
• The Data Model
• It involves the type of data that you need to store. NoSQL Databases only differ in the data model that they use. A mismatch
between the NoSQL databases solution data model and target application can make or break the success of the project
which you are building.
• Data-Scaling expectations
• The next question is how large an application is expected to grow and how much data scale support will be needed. Some
NoSQL databases are memory-based and do not scale across multiple machines where others like Cassandra, scale linearly
across many applications.
23
CAP Theorem
• Brewer’s CAP Theorem:
• For any system sharing data, it is “impossible” to guarantee simultaneously all
of these three properties
• You can have at most two of these three properties for any shared-data
system
• Very large systems will “partition” at some point:
• That leaves either C or A to choose from (traditional DBMS prefers C over A
and P )
• In almost all cases, you would choose A over C (except in specific applications
such as order processing)
24
CAP Theorem
All client always have the
same view of the data
Availability
Consistency
Partition
tolerance
25
CAP Theorem
• Consistency
• 2 types of consistency:
1. Strong consistency – ACID (Atomicity, Consistency, Isolation, Durability)
2. Weak consistency – BASE (Basically Available Soft-state Eventual
consistency)
26
CAP Theorem
• ACID
• A DBMS is expected to support “ACID transactions,” processes that are:
• Atomicity: either the whole process is done or none is
• Consistency: only valid data are written
• Isolation: one operation at a time
• Durability: once committed, it stays that way
• CAP
• Consistency: all data on cluster has the same copies
• Availability: cluster always accepts reads and writes
• Partition tolerance: guaranteed properties are maintained even when network failures
prevent some machines from communicating with others
27
CAP Theorem
• A consistency model determines rules for visibility and apparent order of
updates
• Example:
• Row X is replicated on nodes M and N
• Client A writes row X to node N
• Some period of time t elapses
• Client B reads row X from node M
• Does client B see the write from client A?
• Consistency is a continuum with tradeoffs
• For NOSQL, the answer would be: “maybe”
• CAP theorem states: “strong consistency can't be achieved at the same time as
availability and partition-tolerance”
28
CAP Theorem
• Eventual consistency
• When no updates occur for a long period of time, eventually all updates will propagate
through the system and all the nodes will be consistent
• Cloud computing
• ACID is hard to achieve, moreover, it is not always required, e.g. for blogs,
status updates, product listings, etc.
29
CAP Theorem
Each client always can
read and write.
Availability
Consistency
Partition
tolerance
30
CAP Theorem
A system can continue to
operate in the presence of
a network partitions
Availability
Consistency
Partition
tolerance
31
Consistency in CAP Theorem
• When data is stored on multiple nodes, all the nodes should see the same
data, meaning, that when the data is updated at one node then the same
update should be made at the other nodes storing the same data also.
• For example, if we perform a read operation, it will return the value of the
most recent write operation causing all nodes to return the same data.
• A system is said to be in a consistent state, if the transaction starts with
the system in a consistent state, and ends with a system in a consistent
state.
• In this model, a system can shift into an inconsistent state during a
transaction but, in this case, the entire transaction gets rolled back if there
is an error at any stage in the process.
Availability in CAP Theorem
• To achieve a higher order of availability, it is required that the
system should remain operational 100% all the time.
• So we can get a response at any time.
• So according to this whenever a user makes a request, a user
should be able to get the response regardless of the state of a
system.
Partition Tolerance in CAP Theorem
• According to this, a system should work despite message loss or partition failure.
• A system that is partition-tolerant can sustain any amount of network failure
• It does not result in a failure of the entire network.
• A storage system that falls under CP (partition tolerance with consistency) are MongoDB, Redis,
AppFabric caching and Memcached DB.
• Databases that come under the partition tolerance are those which store their data on multiple
nodes.
• As in relational data models, it is required that it should follow the ACID (Atomicity, Consistency,
Isolation, Durability) properties. But with NoSQL databases, it is not possible for data storage
structures to follow all the C, A and P.
• Data storage models which come under the NoSQL databases of the following but it is not possible
to follow all –
• CA(Consistency and Availability)
• AP(Availability with partition Tolerance)
• CP(consistency with partition Tolerance)
MongoDB
• MongoDB is a cross-platform, document oriented database that provides, high performance, high availability, and
easy scalability.
• MongoDB works on concept of collection and document.
• Database
• Database is a physical container for collections. Each database gets its own set of files on the file system. A single MongoDB server
typically has multiple databases.
• Collection
• Collection is a group of MongoDB documents.
• It is the equivalent of an RDBMS table.
• A collection exists within a single database. Collections do not enforce a schema.
• Documents within a collection can have different fields.
• Typically, all documents in a collection are of similar or related purpose.
• Document
• A document is a set of key-value pairs. Documents have dynamic schema.
• Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common
fields in a collection's documents may hold different types of data.
Relationship of RDBMS terminology with
MongoDB
RDBMS MongoDB
Database Database
Table Collection
Tuple/Row Document
column Field
Table Join Embedded Documents
Primary Key Primary Key (Default key _id
provided by MongoDB itself)
• Show command-
• Example-
db.student.find(
{
$and:
[{"rollno":104},{"sname:"aakash"}]
}
)
OR in MongoDB
• To query documents based on the OR condition, you need to use $or
keyword.
• Syntax-db.mycol.find({ $or: [ {<key1>:<value1>}, { <key2>:<value2>} ] })
• Example-
db.student.find(
{
$or:[
{"rollno":104},{"sname:"aakash"}
]
}
)
NOT in MongoDB
• To query documents based on the NOT condition, you need to
use $not keyword.
• Example-db.empDetails.find(
•{
• "Age": { $not: { $gt: "25" }}
•}
•)
• Example-db.student.find({“age":{$not:{$gt:23}}})
Update Command
• MongoDB's update() and save() methods are used to update document into a collection.
• The update() method updates the values in the existing document while the save() method
replaces the existing document with the document passed in save() method.
• MongoDB Update() Method
• The update() method updates the values in the existing document.
• Syntax->db.COLLECTION_NAME.update(SELECTION_CRITERIA, UPDATED_DATA)
• >db.student.update({‘sname’:‘priya'},{$set:{‘sname’:‘Anjali'}})
• By default, MongoDB will update only a single document.
• To update multiple documents, you need to set a parameter 'multi' to true.
• >db.student.update({‘sname’:‘priya'},{$set:{‘sname’:‘Anjali’}},
{multi:true})
• db.emp.update({'age':23},{$set:{'cname':"avinash"}})
Update Command
• Save() Method
• The save() method replaces the existing document with the new document passed in
the save() method.
• >db.COLLECTION_NAME.save({_id:ObjectId(),NEW_DATA})
• Example-
• >db.mycol.save( { "_id" : ObjectId("507f191e810c19729de860ea"),
"title":“MONGODB", "by":“MONGODB with NOSQL" } )
• TO check updates with the following commands
• >db.mycol.find()
Update Command
• findOneAndUpdate() method
• The findOneAndUpdate() method updates the values in the existing
document.
• Syntax- db.COLLECTION_NAME.findOneAndUpdate
(SELECTIOIN_CRITERIA, UPDATED_DATA)
• updateOne() method
• This methods updates a single document which matches the given filter.
• Syntax->db.COLLECTION_NAME.updateOne(<filter>, <update>)
• > db.empDetails.updateOne( {First_Name: 'Radhika'},
{ $set: { Age: '30',e_mail:
'radhika_newemail@gmail.com'}} )
• updateMany() method
• The updateMany() method updates all the documents that
matches the given filter.
• Syntax->db.COLLECTION_NAME.update(<filter>,
<update>)
•
> db.empDetails.updateMany(
Example-
Creates a unique index so that the collection will not accept insertion of documents where the index key
unique Boolean or keys match an existing value in the index. Specify true to create a unique index. The default value
is false.
The name of the index. If unspecified, MongoDB generates an index name by concatenating the names of
name string the indexed fields and the sort order.
If true, the index only references documents with the specified field. These indexes use less space but
sparse Boolean behave differently in some situations (particularly sorts). The default value is false.
Specifies a value, in seconds to control how long MongoDB retains documents in this collection.
expireAfterSeconds integer
The weight is a number ranging from 1 to 99,999 and denotes the significance of the field relative to the
weights document other indexed fields in terms of the score.
For a text index, the language that determines the list of stop words and the rules for the stemmer and
default_language string tokenizer. The default value is English.
For a text index, specify the name of the field in the document that contains, the language to override the
language_override string default language. The default value is language.
Index methods
• dropIndex() method
• You can drop a particular index using the dropIndex() method of MongoDB.
• Syntax->db.COLLECTION_NAME.dropIndex({KEY:1})
• Here key is the name of the file on which you want to create index and 1 is for ascending order and -1 for descending
order.
• Example->db.mycol.dropIndex({"title":1})
• dropIndexes() method
• This method deletes multiple (specified) indexes on a collection.
• Syntax->db.COLLECTION_NAME.dropIndexes()
• First create two indexes on collection
• > db.mycol.createIndex({"title":1,"description":-1})
• Drop the indexes
• >db.mycol.dropIndexes({"title":1,"description":-1})
• The getIndexes() method
• This method returns the description of all the indexes int the collection.
• Syntax->db.COLLECTION_NAME.getIndexes()
• Example->db.mycol.createIndex({"title":1,"description":-1})
• Example->db.mycol.getIndexes()
Aggregation
• Aggregations operations process data records and return computed results.
• Aggregation operations group values from multiple documents together, and can
perform a variety of operations on the grouped data to return a single result.
• In SQL count(*) and with group by is an equivalent of MongoDB aggregation.
• aggregate() Method
• For the aggregation in MongoDB, use aggregate() method.
• Syntax->db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)
• Now, if you want to display a list stating how many tutorials are written
by each user, then use the aggregate() method as−
• >db.mycol.aggregate([{$group : {_id : "$by_user",
num_tutorial : {$sum : 1}}}])
List of aggregate function
Expression Description Example
Sums up the defined value from all documents in the db.mycol.aggregate([{$group : {_id : "$by_user",
$sum collection. num_tutorial : {$sum : "$likes"}}}])
Calculates the average of all given values from all db.mycol.aggregate([{$group : {_id : "$by_user",
$avg documents in the collection. num_tutorial : {$avg : "$likes"}}}])
Gets the minimum of the corresponding values from all db.mycol.aggregate([{$group : {_id : "$by_user",
$min documents in the collection. num_tutorial : {$min : "$likes"}}}])
Gets the maximum of the corresponding values from all db.mycol.aggregate([{$group : {_id : "$by_user",
$max documents in the collection. num_tutorial : {$max : "$likes"}}}])
Inserts the value to an array in the resulting document. db.mycol.aggregate([{$group : {_id : "$by_user",
$push
url : {$push: "$url"}}}])
Inserts the value to an array in the resulting document but db.mycol.aggregate([{$group : {_id : "$by_user",
$addToSet does not create duplicates. url : {$addToSet : "$url"}}}])
Gets the first document from the source documents db.mycol.aggregate([{$group : {_id : "$by_user",
$first according to the grouping. Typically this makes only sense first_url : {$first : "$url"}}}])
together with some previously applied “$sort”-stage.
Gets the last document from the source documents db.mycol.aggregate([{$group : {_id : "$by_user",
$last according to the grouping. Typically this makes only sense last_url : {$last : "$url"}}}])
together with some previously applied “$sort”-stage.
Replication
• The process of synchronizing data across multiple servers.
• Provides redundancy and increases data availability with multiple copies of data on different
database servers.
• Protects a database from the loss of a single server.
• Also allows to recover from hardware failure and service interruptions.
• With additional copies of the data, you can dedicate one to disaster recovery, reporting, or
backup.
• Why Replication?
• To keep your data safe
• High (24*7) availability of data
• Disaster recovery
• No downtime for maintenance (like backups, index rebuilds, compaction)
• Read scaling (extra copies to read from)
• Replica set is transparent to the application
Working of Replication in MongoDB
• MongoDB achieves replication by the use of replica set.
• A replica set is a group of mongod instances that host the same
data set.
• In a replica, one node is primary node that receives all write
operations.
• All other instances, such as secondaries, apply operations from
the primary so that they have the same data set.
• Replica set can have only one primary node.
• Replica set is a group of two or more nodes (generally minimum 3
nodes are required).
• In a replica set, one node is primary node and remaining nodes
are secondary.
• All data replicates from primary to secondary node.
• At the time of automatic failover or maintenance, election
establishes for primary and a new primary node is elected.
• After the recovery of failed node, it again join the replica set and
works as a secondary node.
Replica Set Features
• A cluster of N nodes
• Any one node can be primary
• All write operations go to primary
• Automatic failover
• Automatic recovery
• Consensus election of primary
Set Up a Replica Set
• To convert to replica set, following are the steps −
• Shutdown already running MongoDB server.
• Start the MongoDB server by specifying -- replSet option. Following is the basic syntax of
--replSet −
• Syntax-mongod --port "PORT" --dbpath "YOUR_DB_DATA_PATH" --replSet
"REPLICA_SET_INSTANCE_NAME"
• Example-mongod --port 27017 --dbpath "D:\set up\mongodb\data" --
replSet rs0
• It will start a mongod instance with the name rs0, on port 27017.
• Now start the command prompt and connect to this mongod instance.
• In Mongo client, issue the command rs.initiate() to initiate a new replica set.
• To check the replica set configuration, issue the command rs.conf().
• To check the status of replica set issue the command rs.status().
Add Members to Replica Set
• To add members to replica set, start mongod instances on multiple
machines.
• Now start a mongo client and issue a command rs.add().
• Syntax->rs0.add(HOST_NAME:PORT)
• Example>rs0.add("mongod1.net:27017")
• You can add mongod instance to replica set only when you are connected
to primary node.
• To check whether you are connected to primary or not, issue the
command db.isMaster() in mongo client.
Sharding
• Sharding is the process of storing data records across multiple machines and it is
MongoDB's approach to meeting the demands of data growth.
• As the size of the data increases, a single machine may not be sufficient to store
the data nor provide an acceptable read and write throughput.
• Sharding solves the problem with horizontal scaling. With sharding, you add more
machines to support data growth and the demands of read and write operations.
• Why Sharding?
• In replication, all writes go to master node
• Latency sensitive queries still go to master
• Single replica set has limitation of 12 nodes
• Memory can't be large enough when active dataset is big
• Local disk is not big enough
• Vertical scaling is too expensive
Sharding in MongoDB
• Shards − Shards are used to store data. They provide high availability
and data consistency. In production environment, each shard is a separate
replica set.
• Config Servers − Config servers store the cluster's metadata.
• This data contains a mapping of the cluster's data set to the shards. The
query router uses this metadata to target operations to specific shards.
• In production environment, sharded clusters have exactly 3 config servers.
• Query Routers − Query routers are basically mongo instances, interface
with client applications and direct operations to the appropriate shard.
• The query router processes and targets the operations to shards and then
returns results to the clients.
• A sharded cluster can contain more than one query router to divide the
client request load.
• A client sends requests to one query router. Generally, a sharded cluster
have many query routers.
Create Backup
• Dump MongoDB Data
• To create backup of database in MongoDB, you should
use mongodump command.
• This command will dump the entire data of server into the dump
directory.
• There are many options available by which we can limit the
amount of data or create backup of remote server.
• Example->mongodump
Options with mongodump
Syntax Description Example
mongodump --host HOST_NAME --port This commmand will backup mongodump --host google.com--port
PORT_NUMBER all databases of specified 27017
mongod instance.
mongodump --dbpath DB_PATH --out This command will backup mongodump --dbpath /data/db/ --
BACKUP_DIRECTORY only specified database at out /data/backup/
specified path.
mongodump --collection COLLECTION -- This command will backup mongodump --collection mycol --db
db DB_NAME only specified collection of test
specified database.
Restore data
• To restore backup data MongoDB's mongorestore command is
used. This command restores all of the data from the backup
directory.
• Syntax->mongorestore
BIG DATA
• Data which are very large in size is called Big Data.
• Normally we work on data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data.
• It is stated that almost 90% of today's data has been generated in the past 3 years.
• Sources of Big Data
• These data come from many sources like
• Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a day to day basis as they have billions of
users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge data which are stored and manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish their plans and for this they store the data
of its million users.
• Share Market: Stock exchange across the world generates huge amount of data through its daily transaction.
• 3V's of Big Data
1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as well as unstructured. Log file, CCTV footage is unstructured
data. Data which can be saved in tables are structured data like the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
• Issues
• Huge amount of unstructured data which needs to be stored, processed and analyzed.
Issues and solutions related to BIG DATA
• Issues
• Huge amount of unstructured data which needs to be stored, processed and
analyzed.
• Solution
• Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed
File System) which uses commodity hardware to form clusters and store data
in a distributed fashion. It works on Write once, read many times principle.
• Processing: Map Reduce paradigm is applied to data distributed over network
to find the required output.
• Analyze: Pig, Hive can be used to analyze the data.
• Cost: Hadoop is open source so the cost is no more an issue.
Hadoop
• Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume.
• Hadoop is written in Java and is not OLAP (online analytical processing).
• It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can
be scaled up just by adding nodes in the cluster.
• Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was
developed. It states that the files will be broken into blocks and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using key
value pair. The Map task takes input data and converts it into a data set which can be computed in Key value
pair. The output of Map task is consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.
Hadoop Architecture
• The Hadoop architecture is a package of
the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System).
• The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.
• A Hadoop cluster consists of a single
master and multiple slave nodes.
• The master node includes Job Tracker,
Task Tracker, NameNode, and DataNode
whereas the slave node includes DataNode
and TaskTracker.
Hadoop Distributed File System layer
• The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop.
• It contains a master/slave architecture.
• This architecture consist of a single NameNode performs the role of master, and multiple DataNodes performs the role of a slave.
• Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to develop HDFS.
So any machine that supports Java language can easily run the NameNode and DataNode software.
• NameNode
• It is a single master server exist in the HDFS cluster.
• As it is a single node, it may become the reason of single point failure.
• It manages the file system namespace by executing an operation like the opening, renaming and closing the files.
• It simplifies the architecture of the system.
• DataNode
• The HDFS cluster contains multiple DataNodes.
• Each DataNode contains multiple data blocks.
• These data blocks are used to store data.
• It is the responsibility of DataNode to read and write requests from the file system's clients.
• It performs block creation, deletion, and replication upon instruction from the NameNode.
• Job Tracker
• The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.
• In response, NameNode provides metadata to Job Tracker.
• Task Tracker
• It works as a slave node for Job Tracker.
• It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a Mapper.
MapReduce Layer
• The MapReduce comes into existence when the client application
submits the MapReduce job to Job Tracker.
• In response, the Job Tracker sends the request to the appropriate Task
Trackers.
• Sometimes, the TaskTracker fails or time out. In such a case, that part
of the job is rescheduled.
Advantages of Hadoop
• Fast:
• In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval.
• Even the tools to process the data are often on the same servers, thus reducing the processing time.
• It is able to process terabytes of data in minutes and Peta bytes in hours.
• Scalable:
• Hadoop cluster can be extended by just adding nodes in the cluster.
• Cost Effective:
• Hadoop is open source and uses commodity hardware to store data so it really cost effective as
compared to traditional relational database management system.
• Resilient to failure:
• HDFS has the property with which it can replicate data over the network, so if one node is down or
some other network failure happens, then Hadoop takes the other copy of data and use it.
• Normally, data are replicated thrice but the replication factor is configurable.
HDFS
• Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel application.
• It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and node
name.
• Where to use HDFS
• Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
• Streaming Data Access: The time to read whole data set is more important than latency in reading the first. HDFS is
built on write-once and read-many-times pattern.
• Commodity Hardware:It works on low cost hardware.
• Where not to use HDFS
• Low Latency data access: Applications that require very less time to access the first data should not use HDFS as it is
giving importance to whole data rather than time to fetch the first record.
• Lots Of Small Files:The name node contains the metadata of files in memory and if the files are small in size it takes a
lot of memory for name node's memory which is not feasible.
• Multiple Writes:It should not be used when we have to write multiple times.
Basic concepts of HDFS
1.Blocks:
• A Block is the minimum amount of data that it can read or write.
• HDFS blocks are 128 MB by default and this is configurable.
• Files in HDFS are broken into block-sized chunks,which are stored as independent units.
• Unlike a file system, if the file is in HDFS is smaller than block size, then it does not occupy full blocks size, i.e. 5 MB of file
stored in HDFS of block size 128 MB takes 5MB of space only.
• The HDFS block size is large just to minimize the cost of seek.
2.Name Node:
• HDFS works in master-worker pattern where the name node acts as master.
• Name Node is controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS; the metadata
information being file permission, names and location of each block.
• The metadata are small, so it is stored in the memory of name node,allowing faster access to data.
• Moreover the HDFS cluster is accessed by multiple clients concurrently, so all this information is handled by a single machine.
• The file system operations like opening, closing, renaming etc. are executed by it.
3.Data Node:
• They store and retrieve blocks when they are told to; by client or name node.
• They report back to name node periodically, with list of blocks that they are storing.
• The data node being a commodity hardware also does the work of block creation, deletion and replication as stated by the name
node.
Features and goals of HDFS
• Features of HDFS
• Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single cluster.
• Replication - Due to some unfavorable conditions, the node containing the data may be loss. So, to overcome such problems, HDFS
always maintains the copy of data on a different machine.
• Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the event of failure. The HDFS is highly
fault-tolerant that if any machine fails, the other machine containing the copy of that data automatically become active.
• Distributed data storage - This is one of the most important features of HDFS that makes Hadoop very powerful. Here, data is
divided into multiple blocks and stored into nodes.
• Portable - HDFS is designed in such a way that it can easily portable from platform to another.
• Goals of HDFS
• Handling the hardware failure - The HDFS contains multiple server machines. Anyhow, if any machine fails, the HDFS goal is to
recover it quickly.
• Streaming data access - The HDFS applications usually run on the general-purpose file system. This application requires streaming
access to their data sets.
• Coherence Model - The application that runs on HDFS require to follow the write-once-ready-many approach. So, a file once
created need not to be changed. However, it can be appended and truncate.
MAPREDUCE
• A MapReduce is a data processing tool which is used to process the data parallelly in a distributed form.
• It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data Processing on Large Clusters," published by Google.
• The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase.
• In the Mapper, the input is given in the form of a key-value pair.
• The output of the Mapper is fed to the reducer as input.
• The reducer runs only after the Mapper is over. The reducer too takes input in key-value format, and the output of reducer is the final output.
• Steps in Map Reduce
• The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will not be unique in this case.
• Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort and shuffle acts on these list of <key,
value> pairs and sends out unique keys and a list of values associated with this unique key <key, list(values)>.
• An output of sort and shuffle sent to the reducer phase. The reducer performs a defined function on a list of values for unique keys,
and Final output <key, value> will be stored/displayed.
• Sort and Shuffle
• The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper task is complete, the results are sorted
by key, partitioned if there are multiple reducers, and then written to disk. Using the input from each Mapper <k2,v2>, we collect all
the values for each unique key k2. This output from the shuffle phase in the form of <k2, list(v2)> is sent as input to reducer phase.
Terminology
• PayLoad − Applications implement the Map and the Reduce functions, and form the core of the job.
• Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
• NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
• DataNode − Node where data is presented in advance before any processing takes place.
• MasterNode − Node where JobTracker runs and which accepts job requests from clients.
• SlaveNode − Node where Map and Reduce program runs.
• JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
• Task Tracker − Tracks the task and reports status to JobTracker.
• Job − A program is an execution of a Mapper and Reducer across a dataset.
• Task − An execution of a Mapper or a Reducer on a slice of data.
• Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.
Usage of MapReduce
• String Types
• STRING
• The string is a sequence of characters. It values can be enclosed within single quotes (') or double quotes (").
• Varchar
• The varchar is a variable length type whose range lies between 1 and 65535, which specifies that the maximum number of characters allowed in the character
string.
• CHAR
The char is a fixed-length type whose maximum length is fixed at 255.