Professional Documents
Culture Documents
BY
G PRASADB.TECH, M.TECH & (PhD)
Assistant Professor
&
Asst.Controller of Exams
SNIST – IT Dept
UNIT-1
Introduction to Big Data:
Big Data Analytics,
Characteristics of Big Data
The Four Vs, importance of Big Data, Different Use cases,
Data-Structured, Semi-Structured, Un-Structured
Introduction to Hadoop and its use in solving big data
problems.
Comparison Hadoop with RDBMS
Brief history of Hadoop
Apache Hadoop EcoSystem
Components of Hadoop
The Hadoop Distributed File System (HDFS):Architecture
and design of HDFS in detail
Working with HDFS (Commands)
Data: Data is nothing but facts and statistics stored or free flowing over a
network, generally it's raw and unprocessed.
When data are processed, organized, structured or presented in a given
context so as to make them useful, they are called Information
Big Data:
Data which is beyond to the storage capacity, which is beyond to the
Processing power can be called Big Data. or
“Big Data” is a term for datasets that are so large or complex that
traditional data processing software are inadequate to deal with
them.
• To extract value and hidden knowledge
–It requires new architecture, techniques, algorithms, and
analytics to manage.
Massive Volume
Complex data formats
• Volume
1 Kilobyte (KB) = 1000 bytes
• Velocity
1 Megabyte (MB) = 1,000,000 bytes
• Variety
1 Gigabyte (GB) = 1,000,000,000 bytes
1 Terabyte (TB) = 1,000,000,000,000 bytes
1 Petabyte (PB) = 1,000,000,000,000,000 bytes
1 Exabyte (EB) = 1,000,000,000,000,000,000 bytes
1 Zettabyte (ZB) = 1,000,000,000,000,000,000,000 bytes
1 Yottabyte (YB) = 1,000,000,000,000,000,000,000,000 bytes
velocity
Transfer
Mobile devices
(tracking all objects all the
time)
Scientific instruments
Social media and networks (collecting all sorts of
(all of us are generating data)
data)
Sensor technology and networks
(measuring all kinds of data)
Technology enabled
IT’s collaboration with analytics
business users & data
scientists Time-sensitive decisions
made in near real time by
processing a steady
stream of real-time data
Big data analytics sets the stage for better and faster decision making. It is about
leveraging technology to help with analytics. It spells a tight handshake between the
communities of business users, IT and data scientists. It when properly utilized leads
to gaining a richer, deeper and wider insights into the business, the customers and
the partners.
What Big Data Analytics isn’t
Big data analytics is not here to replace RDBMS or data warehouse. It is much more
than technology. It is about dealing with not just the massive onslaught of huge
volumes of data but also dealing with great variety and velocity of data. It works on the
philosophy of “move code to data”.
Characteristics of Big Data
Databases such as
Ease with Structured Data
Security
Sources of Structured Data Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc
Ease with Structured data Indexing /
Searching
Transaction
OLTP Systems
Processing
Semi-structured Data XML (eXtensible Markup Language)
Chats
Social
Media data
Word
Document
Issues with terminology – Unstructured Structure can be implied despite not being
Data formerly defined.
•In-Memory Analytics
•In-Database Processing
• Massively Parallel Processing
• Parallel System
• Distributed System
•Shared Nothing Architecture
In-Memory Analytics: Here all the relevant data is stored in Random Access
Memory (RAM) or primary storage thus eliminating the need to access the data
from hard disk.
Shared Nothing Architecture: In SNA, neither the memory nor the disk is
shared.
Big Data Use Cases?
• Log analytics
• Fraud detection
• Social media and sentiment analysis
• Risk modeling and management
• Energy sector
Introduction to Hadoop:
Traditional Approach of Storing Data:In this approach, an
enterprise will have a computer to store and process big data. For
storage purpose, the programmers will take the help of their choice
of database vendors such as Oracle, IBM, etc. In this approach, the
user interacts with the application, which in turn handles the part
of data storage and analysis.
Limitation
This approach works fine with those applications that process less
voluminous data that can be accommodated by standard database
servers, or up to the limit of the processor that is processing the
data.
But when it comes to dealing with huge amounts of scalable data, it is a
hectic task to process such data through a single database.
To Overcome the Limitation , HADOOP Comes in as a Solution For
Storing & Processing Large Voluminous Amounts of Data.
INTRODUCTION TO HADOOP:
• Apache Hadoop is an open-source framework that is used for storing and
processing large amounts of data in a distributed computing environment.
• Its framework is based on Java programming
• It was developed by Doug Cutting and his team, administered by the
Apache Software Foundation.
• It is designed to handle big data and is based on the Map Reduce
programming model, which allows for the parallel processing of large
datasets.
• Unlike traditional, structured platforms, Hadoop is able to store any kind
of data in its native format and to perform a wide variety of analyses and
transformations on that data.
• Hadoop solves the problem of Big data by storing the data in distributed
• form in different machines. There are plenty of data but that data have to
be store in a cost effective way and process it efficiently.
• Hadoop stores terabytes, and even petabytes, of data inexpensively.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and
many more.
• It is robust and reliable and handles hardware and system failures
automatically, without losing data or interrupting data analyses .
ADVANTAGES OF HADOOP
Scalability: Hadoop can easily scale to handle large amounts of data by
adding more nodes to the cluster.
Cost-effective: Hadoop is designed to work with commodity hardware,
which makes it a cost-effective option for storing and processing large
amounts of data.
Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-
tolerance, which means that if one node in the cluster goes down(fails), the
data can still be processed by the other nodes.
Flexibility: Hadoop can process structured, semi-structured, and
unstructured data, which makes it a versatile option for a wide range of
big data scenarios.
Disadvantages:
Not very effective for small data.
Limited Support for Structured Data: Hadoop is designed to work with
unstructured and semi-structured data, it is not well-suited for structured
data processing. Data Loss: In the event of a hardware failure, the data
stored in a single node may be lost permanently.
COMPARISON OF HADOOP WITH RDBMS:
RDBMS HADOOP
.
Replication In HDFS : Replication ensures the availability of the
data. Replication is making a copy of something and the number of
times you make a copy of that particular thing can be expressed as
it’s Replication Factor.
Rack Awareness:
In a large Hadoop cluster, there are multiple racks.
Each rack consists of Data Nodes. Communication between the
Data Nodes on the same rack is more efficient as compared to the
communication between DataNodes residing on different racks.
To reduce the network traffic during file read/write Name Node
chooses the closest Data Node for serving the client read/write
request.
NameNode maintains rack ids of each DataNode to achieve this
rack information.
This concept of choosing the closest DataNode based on
the rack information is known as Rack Awareness.
HDFS Read and Write Mechanism
HDFS Read and Write mechanisms are parallel activities. To read or
write a file in HDFS, a client must interact with the name node. The
name node checks the privileges of the client and gives permission
to read or write on the data blocks.
Hadoop MAP REDUCE:
A MapReduce is a data processing tool which is used to process the
data parallelly in a distributed form.
Map Reduce makes the use of two functions i.e. Map() and
Reduce() hose
• task is:
3.Map() performs sorting and filtering of data and
thereby organizing them in the form of group. Map
generates a key- value pair based result which is later on
processed by the Reduce() method.
3.Reduce() as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the output
generated by Map() as input and combines those tuples into smaller
set of tuples.