You are on page 1of 68

Big Data Analytics

BY
G PRASADB.TECH, M.TECH & (PhD)
Assistant Professor
&
Asst.Controller of Exams
SNIST – IT Dept
UNIT-1
Introduction to Big Data:
Big Data Analytics,
Characteristics of Big Data
The Four Vs, importance of Big Data, Different Use cases,
Data-Structured, Semi-Structured, Un-Structured
Introduction to Hadoop and its use in solving big data
problems.
Comparison Hadoop with RDBMS
Brief history of Hadoop
Apache Hadoop EcoSystem
Components of Hadoop
The Hadoop Distributed File System (HDFS):Architecture
and design of HDFS in detail
Working with HDFS (Commands)
Data: Data is nothing but facts and statistics stored or free flowing over a
network, generally it's raw and unprocessed.
When data are processed, organized, structured or presented in a given
context so as to make them useful, they are called Information
Big Data:
Data which is beyond to the storage capacity, which is beyond to the
Processing power can be called Big Data. or
“Big Data” is a term for datasets that are so large or complex that
traditional data processing software are inadequate to deal with
them.
• To extract value and hidden knowledge
–It requires new architecture, techniques, algorithms, and
analytics to manage.
Massive Volume
Complex data formats

Not possible to process using traditional techniques


Examples: Huge amount of data used in
1. Social Networks Face book, Twitter, yahoo and Google
2. National Climate Data Centers Temperature(various places gives
climate details) and Rain
3. Hospitality Data data base for Medicals.
4. Air lines A Flight can generate 10 TB data for every 30 min.
5. Mobiles Fb, Whats App.
Advantages:
Making our cities smarter
• Oslo, Norway
– Reduced street light energy consumption by 62%.
• Memphis Police Department, USA
– Reduced serious crime by 30% since 2006.
• Applications:
• Healthcare: With the help of predictive analytics, medical professionals are
now able to provide personalized healthcare services to individual patients.
• Academia: Big Data is also helping enhance education today. there are
numerous online educational courses to learn from. Academic institutions are
investing in digital courses powered by Big Data technologies to aid the all-
round development of budding learners.
• Banking: The banking sector relies on Big Data for fraud detection. Big Data
tools can efficiently detect fraudulent acts in real-time such as misuse of
credit/debit cards.
• IT : One of the largest users of Big Data, IT companies around the world are
using Big Data to optimize their functioning, enhance employee productivity,
and minimize risks in business operations. By combining Big Data
technologies with ML and AI, the IT sector is continually powering
innovation to find solutions even for the most complex of problems
Facts and Figures:
• Wal-Mart handles 1 million customer transactions/hour. Face
book handles 40 billion photos from its user base! Face book
inserts 500 terabytes of new data every day.
Face book stores , accesses, and analyzes 30+ Peta bytes of
generated data.
A flight generates 240 terabytes of flight data in 6-8 hours of flight.
More than 5billion people are calling, texting, tweeting
and browsing on mobile phones worldwide.
• Decoding the human genome originally took 10 years to process;
now it can be achieved in one week.8
• The largest AT&T database boasts titles including the largest
volume of data in one unique database (312 terabytes) and the
second largest number of rows in a unique database (1.9 trillion),
which comprises AT&T’s extensive calling records.
Where is the problem?
• Traditional RDBMS queries isn't sufficient to get useful
information out of the huge volume of data
• To search it with traditional tools to find out if a particular topic
was trending would take so long that the result would be
meaningless by the time it was computed.
• Big Data come up with a solution to store this data in novel ways
in order to make it more accessible, and also to come up with
methods of performing analysis on it.
Big Data Challenges:
• The major challenges associated with big data are as follows −
• Capturing data
• Storage
• Security
• Analysis
• To fulfill the above challenges, organizations normally take the
help of enterprise servers.
Traits of Big Data
Data Challenges
• Volume : The challenge is how to deal with the size of Big
Data.
• Variety, Combining Multiple Data Sets : The challenge is
how to handle multiplicity of types, sources, and formats.
• Velocity : One of the key challenges is how to react to the
flood of information in the time required by the application.
• Veracity, Data Quality, Data Availability
– How can we cope with uncertainty, imprecision, missing values,
misstatements or untruths?
– How good is the data? How broad is the coverage?
– How fine is the sampling resolution? How timely are the readings?
– How well understood are the sampling biases?
– Is there data available, at all?
• Data Discovery
– This is a huge challenge: how to find high-quality data from
the vast collections of data that are out there on the Web.
• Quality and Relevance
• The challenge is determining the quality of data sets and
relevance to particular issues
• Data Comprehensiveness
– Are there areas without coverage? What are the implications?
• Personally Identifiable Information
– Can we extract enough information to help people without
extracting so much as to compromise their privacy?
• Data Dogmatism
– Domain experts—and common sense—must continue to play
a role.
• Scalability
– techniques like social graph analysis, for instance leveraging the
influencers in a social network to create better user experience
are hard problems to solve at scale
All of these problems combined create a perfect storm of
challenges and opportunities to create faster, cheaper and
better solutions for Big Data analytics than traditional
approaches can solve.”
Process Challenges
• Capturing data
• Aligning data from different sources (e.g., resolving
when two objects are the same)
• Transforming the data into a form suitable for analysis
• Modeling it, whether mathematically, or through some
form of simulation
• Understanding the output, visualizing and sharing the
results, think for a second how to display complex
analytics on a iPhone or a mobile device
Challenges of Conventional Systems:
• Traditional Analytics analyzes on the known data terrain that
too the data that is well understood. It cannot work on
unstructured data efficiently.
• Traditional Analytics is built on top of the relational data
model, relationships between the subjects of interests have
been created inside the system and the analysis is done based
on them. This approach will not adequate for big data
analytics.
• Traditional analytics is batch oriented and we need to wait for
nightly ETL (extract, transform and load) and transformation
jobs to complete before the required insight is obtained.
• Parallelism in a traditional analytics system is achieved
through costly hardware like MPP (Massively Parallel
Processing) systems
• Inadequate support of aggregated summaries of data
• Designed to handle well structured data
• traditional storage vendor solutions are very expensive
• shared block-level storage is too slow
• read data in 8k or 16k block size
• Schema-on-write requires data be validated before it
can be written to disk.
• Software licenses are too expensive
• Get data from disk and load into memory requires
application
Definition of Big Data Volume - A Mountain of Data

• Volume
1 Kilobyte (KB) = 1000 bytes
• Velocity
1 Megabyte (MB) = 1,000,000 bytes
• Variety
1 Gigabyte (GB) = 1,000,000,000 bytes
1 Terabyte (TB) = 1,000,000,000,000 bytes
1 Petabyte (PB) = 1,000,000,000,000,000 bytes
1 Exabyte (EB) = 1,000,000,000,000,000,000 bytes
1 Zettabyte (ZB) = 1,000,000,000,000,000,000,000 bytes
1 Yottabyte (YB) = 1,000,000,000,000,000,000,000,000 bytes

velocity

Batch  Periodic  Near real time  Real-


time processing
Volume
Volume: Enterprises are awash with ever-growing
data of all types, easily amassing terabytes even
Petabytes of information.
Turn 12 terabytes of Tweets created each day into
improved product sentiment analysis
Convert 350 billion annual meter readings to better
predict power consumption

Big Data Computing


Introduction to Big
Vu Data
Example The Earth scope: is the
world's largest science project.
Designed to track North America's
geological evolution, this Observatory
records data over 3.8 million square
miles, amassing 67 terabytes of
data. It Analyzes seismic slips in the
San Andreas fault, sure, but also the
plume of magma underneath
Yellowstone and much, much more.
(http://www.msnbc.msn.com/id/44363
598/ns/technology_and_science-
future_of_technology/#.TmetOdQ--uI)
Variety
Capture
Structured data: example: traditional
transaction processing systems and
RDBMS, etc. Storage

Semi-structured data: example: Hyper


Curation
Text Markup Language (HTML),
eXtensible Markup Language (XML).
Challenges with Big Data
Search
Unstructured data: example:
unstructured text documents, audio,
video, email, photos, PDFs, social Analysis
media, etc.

Transfer

Other Characteristics of Data – Visualization


Which are not Definitional Traits of Big Data
•Veracity and Validity
•Volatility Privacy
•Variability Violations
• Velocity: Sometimes 2 minutes is too late. For time- sensitive
processes such as catching fraud, big data must be used as it
streams into your enterprise in order to maximize its value.
• Scrutinize 5 million trade events created each day to identify
potential fraud
• Analyze 500 million daily call detail records in real- time to
predict customer churn faster
Examples:
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions  missing opportunities
Examples E-Promotions: Based on your current location, your
purchase history, what you like  send promotions right now for
store next to you
• Healthcare monitoring: sensors monitoring your activities and
body any abnormal measurements require immediate reaction
Real-time/Fast Data

Mobile devices
(tracking all objects all the
time)

Scientific instruments
Social media and networks (collecting all sorts of
(all of us are generating data)
data)
Sensor technology and networks
(measuring all kinds of data)

The progress and innovation is no longer hindered by the ability to


collect data But, by the ability to manage, analyze, summarize,
visualize, and discover knowledge from the collected data in a
timely manner and in a scalable fashion
Vu
5-v of big data
BIG DATA ANALYTICS
The Process of Analysis of large volumes of Diverse data
sets using Advanced Analytical Techniques is referred as
Big Data Analytics.

There are mainly 4 types of analytics:


Descriptive Analytics: Descriptive Analytics is focused solely on
historical data. describing without judgment. It tells what happen in
the past and uses some statistics functions and returns answers(“What
happened?”)
Ex: Google Analytics and Netflix
Predictive Analytics : States that some specified event will happen in
future i.e. what might happen in future, to reduce loses for the users.
( “What might happen in the future?”). Predictive Analytics is a
statistical method that utilizes algorithms and machine learning to
identify
trends in data and predict future behaviors.
Prescriptive Analytics: Imposition of a rule/method. What we should
do like what kind of action should be performed. Eg: self driving cars
which analyzes environment and moves on the road.(“What should
we do next?”)
Diagnostic analytics: It finds out the root cause for happened thing in
medical diagnosis.(why Did this happen.)
What is Big Data Analytics?
Move code to data for
greater speed and
Better, faster efficiency
Richer, deeper
decisions in real time insights into
customers, partners
and the business

Working with data sets


whose volume and variety
is beyond the storage & Competitive
processing capability of a
Big Data Analytics
Advantage
typical Database Software

Technology enabled
IT’s collaboration with analytics
business users & data
scientists Time-sensitive decisions
made in near real time by
processing a steady
stream of real-time data

Big data analytics sets the stage for better and faster decision making. It is about
leveraging technology to help with analytics. It spells a tight handshake between the
communities of business users, IT and data scientists. It when properly utilized leads
to gaining a richer, deeper and wider insights into the business, the customers and
the partners.
What Big Data Analytics isn’t

Only about Volume


“One-size fit all” traditional RDBMS
built on shared disk & memory

Big Data Analytics isn’t ... Just about Technology

Only used by huge online


companies like Google or Amazon
Meant to replace RDBMS

Meant to replace Data Warehouse

Big data analytics is not here to replace RDBMS or data warehouse. It is much more
than technology. It is about dealing with not just the massive onslaught of huge
volumes of data but also dealing with great variety and velocity of data. It works on the
philosophy of “move code to data”.
Characteristics of Big Data

3Vs of Big Data


Characteristics of Big Data
Sources of Big Data
Importance of Big data: Big Data importance doesn’t revolve
around the amount of data a company has. Its importance lies in the
fact that how the company utilizes the gathered data.
• Every company uses its collected data in its own way. More
effectively the company uses its data, more rapidly it grows.
• The companies in the present market need to collect it and analyze it
because:
1. Cost Savings
• Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving
benefits to businesses when they have to store large amounts of data.
These tools help organizations in identifying more effective ways of
doing business.
2. Time-Saving
• Real-time in-memory analytics helps companies to collect data from
various sources. Tools like Hadoop help them to analyze data
immediately thus helping in making quick decisions based on the
learnings.
3. Understand the market conditions
• Big Data analysis helps businesses to get a better understanding of
market situations.
• For example, analysis of customer purchasing behavior helps
companies to identify the products sold most and thus produces those
products accordingly. This helps companies to get ahead of their
competitors.
4. Social Media Listening
• Companies can perform sentiment analysis using Big Data tools.
These enable them to get feedback about their company, that is, who
is saying what about the company.
• Companies can use Big data tools to improve their online presence.
5. . Boost Customer Acquisition and Retention
• Customers are a vital asset on which any business depends on. No
single business can achieve its success without building a robust
customer base. But even with a solid customer base, the companies
can’t ignore the competition in the market.
• If we don’t know what our customers want then it will degrade
companies’ success. It will result in the loss of clientele which
creates an adverse effect on business growth.
• Big data analytics helps businesses to identify customer related
trends and patterns. Customer behavior analysis leads to a
profitable business.
6. Solve Advertisers Problem and Offer Marketing Insights
• Big data analytics shapes all business operations. It enables
companies to fulfill customer expectations. Big data analytics helps
in changing the company’s product line. It ensures powerful
marketing campaigns.
7. The driver of Innovations and Product Development
• Big data makes companies capable to innovate and redevelop their
products.
BIG DATA Structured Data

Types of Digital Data • This is the data which is in an


•Structured
organized form (e.g., in rows and
 Sources of structured data columns) and can be easily used
 Ease with structured data by a computer program.
• Relationships exist between
•Semi-Structured
entities of data, such as classes
 Sources of semi-structured data
and their objects.
• Data stored in databases is an
•Unstructured
example of structured data.
 Sources of unstructured data
 Issues with terminology
 Dealing with unstructured data
Input / Update /
Delete

Databases such as
Ease with Structured Data
Security
Sources of Structured Data Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc
Ease with Structured data Indexing /
Searching

Structured data Spreadsheets


Scalability

Transaction
OLTP Systems
Processing
Semi-structured Data XML (eXtensible Markup Language)

• This is the data which does not Sources of


conform to a data model but SSD Other Markup Languages
has some structure. However, it
is not in a form which can be
used easily by a computer
program. JSON (Java Script Object Notation)
Semi-Structured Data
• Example, emails, XML,
markup languages like HTML,
etc. Metadata for this data is Inconsistent Structure
available but is not sufficient.
Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information is
Characteristics of Semi-structured Data blended with data values

Data objects may have different


attributes not known beforehand
Unstructured Data

This is the data which does not conform to a


data model or is not in a form which can be Sources of Unstructured Data Web Pages

used easily by a computer program. Images

About 80–90% data of an organization is in Free-Form


Text
this format.
Audios
Unstructured data
Example: memos, chat rooms, PowerPoint
Videos
presentations, images, videos, letters,
researches, white papers, body of an email, Body of
Email
etc.
Text
Messages

Chats

Social
Media data

Word
Document
Issues with terminology – Unstructured Structure can be implied despite not being
Data formerly defined.

Data with some structure may still be labeled


Issues with terminology
unstructured if the structure doesn’t help with
processing task at hand

Data may have some structure or may even be


highly structured in ways that are unanticipated
or unannounced.

Dealing with Unstructured Data Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics


Terminologies Used in Big data Environments

•In-Memory Analytics
•In-Database Processing
• Massively Parallel Processing
• Parallel System
• Distributed System
•Shared Nothing Architecture

In-Memory Analytics: Here all the relevant data is stored in Random Access
Memory (RAM) or primary storage thus eliminating the need to access the data
from hard disk.

In-Database Processing: With in-database processing, the database program itself


can run the computations eliminating the need for export and thereby saving on
time.

MPP: Massive Parallel Processing (MPP) refers to the coordinated processing of


programs by a number of processors working parallel. The processors, each have
their own operating systems and dedicated memory. They work on different parts
of the same program.
Parallel System: a parallel database system is a tightly coupled system. The
processors co-operate for query processing. The user is unaware of the
parallelism since he/she has no access to a specific processor of the system.

Distributed System: The data is usually distributed across several machines,


thereby necessitating quite a number of machines to be accessed to answer a user
query.

Shared Nothing Architecture: In SNA, neither the memory nor the disk is
shared.
Big Data Use Cases?
• Log analytics
• Fraud detection
• Social media and sentiment analysis
• Risk modeling and management
• Energy sector
Introduction to Hadoop:
Traditional Approach of Storing Data:In this approach, an
enterprise will have a computer to store and process big data. For
storage purpose, the programmers will take the help of their choice
of database vendors such as Oracle, IBM, etc. In this approach, the
user interacts with the application, which in turn handles the part
of data storage and analysis.

Limitation
 This approach works fine with those applications that process less
voluminous data that can be accommodated by standard database
servers, or up to the limit of the processor that is processing the
data.
But when it comes to dealing with huge amounts of scalable data, it is a
hectic task to process such data through a single database.
To Overcome the Limitation , HADOOP Comes in as a Solution For
Storing & Processing Large Voluminous Amounts of Data.
INTRODUCTION TO HADOOP:
• Apache Hadoop is an open-source framework that is used for storing and
processing large amounts of data in a distributed computing environment.
• Its framework is based on Java programming
• It was developed by Doug Cutting and his team, administered by the
Apache Software Foundation.
• It is designed to handle big data and is based on the Map Reduce
programming model, which allows for the parallel processing of large
datasets.
• Unlike traditional, structured platforms, Hadoop is able to store any kind
of data in its native format and to perform a wide variety of analyses and
transformations on that data.
• Hadoop solves the problem of Big data by storing the data in distributed
• form in different machines. There are plenty of data but that data have to
be store in a cost effective way and process it efficiently.
• Hadoop stores terabytes, and even petabytes, of data inexpensively.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and
many more.
• It is robust and reliable and handles hardware and system failures
automatically, without losing data or interrupting data analyses .
ADVANTAGES OF HADOOP
Scalability: Hadoop can easily scale to handle large amounts of data by
adding more nodes to the cluster.
Cost-effective: Hadoop is designed to work with commodity hardware,
which makes it a cost-effective option for storing and processing large
amounts of data.
Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-
tolerance, which means that if one node in the cluster goes down(fails), the
data can still be processed by the other nodes.
Flexibility: Hadoop can process structured, semi-structured, and
unstructured data, which makes it a versatile option for a wide range of
big data scenarios.
Disadvantages:
Not very effective for small data.
Limited Support for Structured Data: Hadoop is designed to work with
unstructured and semi-structured data, it is not well-suited for structured
data processing. Data Loss: In the event of a hardware failure, the data
stored in a single node may be lost permanently.
COMPARISON OF HADOOP WITH RDBMS:
RDBMS HADOOP

1 Traditional row-column based databases, An open-source software used for


basically used for data storage, storing data and running
manipulation and retrieval. applications or processes
concurrently.
2. In this structured data is mostly processed. In this both structured and
unstructured data is processed.
3. It is best suited for OLTP environment. It is best suited for BIG data.
(Online Transaction Processing)
4. It is less scalable than Hadoop. It is highly scalable
5. Data normalization is required in Data normalization is not required in
RDBMS. Hadoop.
6. It stores transformed and aggregated data. It stores huge volume of data.
7. The data schema of RDBMS is static type. The data schema of Hadoop is
dynamic type.

8. High data integrity available. Low data integrity available than


BRIEF HISTORY OF HADOOP:
1.In 2002, Doug Cutting and Mike Cafarella started to work on a
project, Apache Nutch. It is an open source software project.
2.While working on Apache Nutch, they were dealing with big
data. To store that data they have to spend a lot of costs which
becomes the consequence of that
• project. This problem becomes one of the important reason for the
emergence of Hadoop.
3.In 2003, Google introduced a file system known as GFS (Google
file system). It
• is a proprietary distributed file system developed to provide
efficient access to data.
4.In 2004, Google released a white paper on Map Reduce. This
technique simplifies the data processing on large clusters.
4.In 2005, Doug Cutting and Mike Cafarella introduced a new file system
known as
• NDFS (Nutch Distributed File System). This file system also includes
Map reduce.
6.In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the
Nutch project, Dough Cutting introduces a new project Hadoop with a
file system known as HDFS (Hadoop Distributed File System). Hadoop
first version 0.1.0 released in this year.
7.Doug Cutting gave named his project Hadoop after his son's toy elephant.
8.In 2007, Yahoo runs two clusters of 1000 machines.
9. In 2008, Hadoop became top level project at Apache , and the fastest system to
sort 1 terabyte of data on a 900 node cluster within 209 seconds.
10.In 2011, Hadoop 1.0 was released.
11.In 2012, Hadoop 2.0 was released.
12.In 2017, Hadoop 3.0 was released.
13.In 2020, Hadoop 3.1.3 was released.
APACHE HADOOP ECOSYSTEM: Introduction: Hadoop is a
framework that manages big data storage by means of parallel and
distributed processing. Hadoop is comprised of various tools and
frameworks that are dedicated to different sections of data management,
like storing, processing, and analyzing. Hadoop Ecosystem is a platform
or a suite which provides various services to solve the big data problems. It
includes Apache projects and various commercial tools and solutions.
DATA STORAGE:
HDFS: HDFS(Hadoop Distributed File System) is the primary or major
component of Hadoop ecosystem and is responsible for storing large data
sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
 HDFS is a specially designed file system for storing huge datasets in
commodity hardware, storing information in different formats on various
machines.
 HDFS consists of two core components i.e.
1.Name node(Master node)
2.Data Node(Slave node)
 Name Node - Name Node is the master Node.
 There is only one active Name Node.
 It manages the Data Nodes and stores all the metadata.
 These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.
Data Node - Data Node is the slave daemon.
There can be multiple Data Nodes.
It stores the actual data
 HDFS splits the data into multiple blocks, defaulting to a
maximum of 128 MB. The default block size can be changed
depending on the processing speed and the data distribution. Let’s
look at the example below:

 As seen from the above image, we have 300 MB of data. This is


broken down into 128 MB, 128 MB, and 44 MB. The final block
handles the remaining needed storage space, so it doesn’t have to be
sized at 128 MB. This is how data gets stored in a distributed manner
in HDFS.
• HBASE: HBase is a column-oriented non-relational database
management system that runs on top of Hadoop Distributed File
System (HDFS) This means that data is stored in individual
columns, and indexed by a unique row key.
• HBASE provides a fault-tolerant way of storing sparse data
sets, which are common in many big data use cases.
• It is well suited for real-time data processing or random read/write
access to large volumes of data.
• Unlike relational database systems, HBase does not support a
structured query language like SQL; in fact, HBase isn’t a
relational data store at all.
• HBase applications are written in Java much like a typical
MapReduce application.
• HBase does support writing applications in Avro.
• HBase works well with Hive, a query engine for batch processing
of big data to enable fault-tolerant big data applications.
DATA PROCESSING:
MAP REDUCE:
A Map Reduce is a data processing tool which is used to process the
data parallelly in a distributed form.
Map Reduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
1.Map() performs sorting and filtering of data and
thereby organizing them in the form of group. Map
generates a key- value pair based result which is later on
processed by the Reduce() method.
2.Reduce(), as the name suggests does the summarization
by aggregating the mapped data. In simple, Reduce() takes
the output generated by Map() as input and combines those
tuples into smaller set of tuples.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is
the one who helps to manage the resources across the clusters. In
short, it performs scheduling and resource allocation for the Hadoop
System.
Consists of three major components i.e.
1.Resource Manager
2.Nodes Manager
3.Application Manager
Resource manager has the privilege of allocating resources for
the applications in a system.
Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the
resource manager.
Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the
requirement of the two.
DATA ACCESS:
HIVE: Hive is a data warehouse infrastructure tool to process
structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data and makes querying and analyzing easy. It
provides SQL type language for querying called HiveQL or HQL.
HIVE performs reading and writing of large data sets. However, its
query language is called as HQL (Hive Query Language).
 It is familiar, fast, scalable, and extensible.
 Initially Hive was developed by Face book, later the Apache
Software Foundation took it up and developed it further as an open
source under the name Apache Hive.
 Hive is not
A relational database
A language for real-time queries
PIG:
Pig was basically developed by Yahoo which works on a pig Latin
language, which is Query based language similar to SQL.
Apache Pig is a scripting Language.
It is a platform for structuring the data flow, processing and analyzing
huge data sets.
Pig does the work of executing commands and in the background, all the
activities of Map Reduce are taken care of. After the processing. pig stores
the result in HDFS.
Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.
MAHOUT:
Apache Mahout is an open source project that is primarily used for
creating scalable machine learning algorithms.
It implements popular machine learning techniques such as:
Filtering
Classification
Clustering
AVRO:
Avro is an open source project that provides data serialization and
data exchange services for Apache Hadoop.
These services can be used together or independently.
Avro facilitates the exchange of big data between programs written in
any language. With the serialization service, programs can efficiently
serialize data into files or into messages.
SQOOP:
Sqoop is a tool designed to transfer data between Hadoop and
relational database servers.
It is used to import data from relational databases such as MySQL,
Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases.
It is provided by the Apache Software Foundation.
DATA MANAGEMENT: OOZIE:
Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit.
There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs.
Oozie workflow is the jobs that need to be executed in a sequentially ordered
manner .
Oozie Coordinator jobs are those that are triggered when some data or external
stimulus is given to it.
FLUME:
Apache Flume is a tool for data ingestion in HDFS.
It collects, aggregates and transports large amount of streaming data such as log
files, events from various sources like network traffic, social media, email
messages etc. to HDFS.
Flume is a highly reliable & distributed.
The main idea behind the Flume’s design is to capture streaming data from
various web servers to HDFS.
It has simple and flexible based on streaming data flows.
It is fault-tolerant and provides reliability mechanism for Fault tolerance &
failure recovery.
Chukwa: Apache Chukwa is an open source large-scale log
collection system for monitoring large distributed systems. It is one of
the common big data terms related to Hadoop. It is built on the top of
Hadoop Distributed File System (HDFS) and Map/Reduce framework.
It inherits Hadoop's robustness and scalability.
ZOOKEEPER:
Zookeeper is an open source Apache project that provides a
centralized service for providing configuration information, naming,
synchronization and group services over large clusters in distributed
systems.
The goal is to make these systems easier to manage with improved,
more reliable propagation of changes.
There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop
which resulted in inconsistency, often Zookeeper overcame all the
problems by performing synchronization, inter-component based
communication, grouping, and maintenance.
COMPONENTS OF HADOOP:
• Hadoop is a framework that uses distributed storage and parallel processing to
• store and manage big data. It is the software most used by data analysts to handle
big data, and its market size continues to grow. There are three components of
• Hadoop:
1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.
2. Hadoop MapReduce - Hadoop MapReduce is the processing unit.
3. Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource
management unit.
The Hadoop Distributed File System (HDFS):
 Hadoop Distributed File System (HDFS) is the storage unit.
 Data is stored in a distributed manner in HDFS. There are two
components of HDFS - name node and data node.
 While there is only one name node, there can be multiple data
nodes. HDFS is specially designed for storing huge datasets in
commodity hardware.
 Features of HDFS
 Provides distributed storage

 Can be implemented on commodity hardware


 Provides data security
Master and Slave Nodes
 Master and slave nodes form the HDFS cluster. The name node is
called the master, and the data nodes are called the slaves.
The name node is responsible for the workings of the data nodes. It
also stores the metadata.
The data nodes read, write, process, and replicate the data.
They also send signals, known as heartbeats, to the name node.
These heartbeats show the status of the data node.
 Consider that 30TB of data is loaded into the name node. The name node
distributes it across the data nodes, and this data is replicated among the data
notes. You can see in the image above that the blue, grey, and red data are
replicated among the three data nodes.
 Replication of the data is performed three times by default. It is done this way, so
if a commodity machine fails, you can replace it with a new machine that has the
same data.

.
Replication In HDFS : Replication ensures the availability of the
data. Replication is making a copy of something and the number of
times you make a copy of that particular thing can be expressed as
it’s Replication Factor.
Rack Awareness:
 In a large Hadoop cluster, there are multiple racks.
 Each rack consists of Data Nodes. Communication between the
Data Nodes on the same rack is more efficient as compared to the
communication between DataNodes residing on different racks.
 To reduce the network traffic during file read/write Name Node
chooses the closest Data Node for serving the client read/write
request.
 NameNode maintains rack ids of each DataNode to achieve this
rack information.
 This concept of choosing the closest DataNode based on
the rack information is known as Rack Awareness.
HDFS Read and Write Mechanism
HDFS Read and Write mechanisms are parallel activities. To read or
write a file in HDFS, a client must interact with the name node. The
name node checks the privileges of the client and gives permission
to read or write on the data blocks.
Hadoop MAP REDUCE:
 A MapReduce is a data processing tool which is used to process the
data parallelly in a distributed form.
 Map Reduce makes the use of two functions i.e. Map() and
Reduce() hose
• task is:
3.Map() performs sorting and filtering of data and
thereby organizing them in the form of group. Map
generates a key- value pair based result which is later on
processed by the Reduce() method.
3.Reduce() as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the output
generated by Map() as input and combines those tuples into smaller
set of tuples.

You might also like