You are on page 1of 19

1.

Introduction to Big Data and Hadoop: Introduction to Big Data, Big Data characteristics, Types of
Big Data, Traditional vs. Big Data ,Big Data Applications. Hadoop architecture: HDFS,YARN 2,
YARN Daemons. Hadoop Ecosystem. Self-Learning Topics: Yet Another Resource Negotiator
YARN 1.X
2. HDFS and Map Reduce HDFS: HDFS architecture, Features of HDFS,Rack Awareness,HDFS
Federation
Map Reduce:The Map Task, The Reduce Task, Grouping by Key,Partitioner and Combiners, Detail
of Map Reduce Execution. Algorithm Using Map Reduce: Matrix and Vector Multiplication by Map
Reduce Computing Selection and Projection by Map Reduce Computing Grouping and Aggregation
by Map Reduce Self-Learning Topics: Concept of Sorting and Natural Joins
3. NoSQL: Introduction to NoSQL, No SQL Business drivers NoSQL Data architecture patterns: key
value stores, Column family Stores, Graph Stores, Document Stores. NoSQL to manage big data:
Analyzing big data with shared nothing architecture, choosing distribution master slave vs. peer to
peer. HBASE overview,HBASE data model, Read Write architecture. Self-Learning Topics:
Cassandra Case Study
4. Hadoop Ecosystem: HIVE and PIG HIVE: background, architecture, warehouse directory and meta-
store, HIVE query language, loading data into table, HIVE built-in functions, joins in HIVE,
Partitioning. HiveQL: querying data, sorting and aggregation, PIG : background, architecture, PIG
Latin Basics, PIG execution modes, PIG processing – loading and transforming data, PIG built-in
functions, filtering, grouping, sorting data Installation of PIG and PIG Latin commands. Self-
Learning Topics:Cloudera IMPALA
5. Apache Kafka: Kafka Fundamentals, Kafka architecture, Case Study: Streaming real time data (Read
Twitter Feeds and Extract the Hashtags) Apache Spark: Spark Basics, Working with RDDs in Spark,
Spark Framework, aggregating Data with Pair RDDs, Writing and Deploying Spark Applications,
Spark SQL and Data Frames. Self-Learning Topics: KMeans and Page Rank in Apache Spark
6. Data Visualization: Explanation of data visualization, Challenges of big data visualization,
Approaches to big data visualization, D3 and big data, Getting started with D3, Another twist on bar
chart visualizations, Tableau as a Visualization tool, Dashboards for Big Data - Tableau. Self-
Learning Topics: Splunk via web Interface.
1. Introduction to Big Data
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can store it
or process it efficiently. Big data is a data but with huge size.
Big data is a collection of massive and complex data sets and data volume that include the huge
quantities of data, data management capabilities, social media analytics and real-time data.
Big data analytics is the process of examining large amounts of data. There exist large amounts of
heterogeneous digital data.
Big data is about data volume and large data set's measured in terms of terabytes or petabytes.
This phenomenon is called Bigdata. 
Following are some of the Big Data examples-

The New York Stock Exchange is an example of Big Data that generates about one


terabyte of new trade data per day.

Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases of social
media site Facebook, every day. This data is mainly generated in terms of photo and video
uploads, message exchanges, putting comments etc.

https://www.youtube.com/watch?v=bAyrObl7TYE
2. Big Data characteristics
 Big data refer to the massive data sets that are collected from a variety of data sources for
business needs to reveal new insights for optimized decision making.
 "Big data" is a field that treats ways to analyze, systematically extract information from.
 Big data generates value by the storage and processing of digital data that cannot be
analyzed by traditional computing techniques.
 Result of various trends like cloud, increased computing resources, generation by mobile
computing, social networks, sensors, web data etc.
5 V’s of BIG DATA

Volume

The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.

Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button


is recorded, and more than 350 million new posts are uploaded each day. Big data technologies
can handle large amounts of data.

Variety

Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.

The data is categorized as below:

a. Structured data: In Structured schema, along with all the required columns. It is in a tabular
form. Structured Data is stored in the relational database management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,
XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to
work with semi-structured data. It is stored in relations, i.e., tables.

c. Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they
did not know how to derive the value of data since the data is raw.

Veracity

Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.

For example, Facebook posts with hashtags.


Value

Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.

Velocity

Velocity plays an important role compared to others. Velocity creates the speed by which the data
is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.

Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc

3. Types of Big Data


BigData' could be found in three forms:
Structured : Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data.
Ex: Data in the form of tables, Excel etc.
Unstructured: Any data with unknown form or the structure is classified as unstructured data. In addition
to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving
value out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc.
Example: The output returned by 'Google Search’
Semi-structured: Semi-structured data can contain both the forms of data. We can see semi-structured
data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
4. Traditional vs. Big Data
5. 1. Traditional data: Traditional data is the structured data that is being majorly
maintained by all types of businesses starting from very small to big organizations.
In a traditional database system, a centralized database architecture used to store
and maintain the data in a fixed format or fields in a file. For managing and
accessing the data Structured Query Language (SQL) is used.
6. 2. Big data: We can consider big data an upper version of traditional data. Big
data deal with too large or complex data sets which is difficult to manage in
traditional data-processing application software. It deals with large volume of both
structured, semi structured and unstructured data. Volume, Velocity and Variety,
Veracity and Value refer to the 5’V characteristics of big data. Big data not only
refers to large amount of data it refers to extracting meaningful data by analyzing
the huge amount of complex data sets. semi-structured
Traditional Data  Big Data 

Traditional data is generated in enterprise Big data is generated outside the


level. enterprise level.

Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.

Big data system deals with structured,


Traditional database system deals with semi-structured,database, and
structured data. unstructured data.

Traditional data is generated per hour or per But big data is generated more
day or more. frequently mainly per seconds.

Traditional data source is centralized and it is Big data source is distributed and it is
managed in centralized form. managed in distributed form.

Data integration is very easy. Data integration is very difficult.

Normal system configuration is capable to High system configuration is required to


process traditional data. process big data.

The size is more than the traditional data


The size of the data is very small. size.

Special kind of data base tools are


Traditional data base tools are required to required to perform any
perform any data base operation. databaseschema-based operation.
Traditional Data  Big Data 

Special kind of functions can manipulate


Normal functions can manipulate data. data.

Its data model is strict schema based and it is Its data model is a flat schema based
static. and it is dynamic.

Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.

Big data is in huge volume which


Traditional data is in manageable volume. becomes unmanageable.

It is difficult to manage and manipulate


It is easy to manage and manipulate the data. the data.

Its data sources includes ERP transaction Its data sources includes social media,
data, CRM transaction data, financial data, device data, sensor data, video, images,
organizational data, web transaction data etc. audio etc.

7. Big Data Applications.


1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like Amazon, Walmart, Big
Bazar etc.) management team has to keep data of customer’s spending habit (in which product customer
spent, in which brand they wish to spent, how frequently they spent), shopping behavior, customer’s most
liked product (so that they can keep those products in the store). Which product is being searched/sold
most, based on that data, production/collection rate of that product get fixed.
Banking sector uses their customer’s spending behavior-related data so that they can provide the offer to a
particular customer to buy his particular liked product by using bank’s credit or debit card with discount or
cashback. By this way, they can send the right offer to the right person at the right time.
2. Recommendation: By tracking customer spending habit, shopping behavior, Big retails store provide a
recommendation to the customer. E-commerce site like Amazon, Walmart, Flipkart does product
recommendation. They track what product a customer is searching, based on that data they recommend
that type of product to that customer.
YouTube also shows recommend video based on user’s previous liked, watched video type. Based on the
content of a video, the user is watching, relevant advertisement is shown during video running. As an
example suppose someone watching a tutorial video of Big data, then advertisement of some other big
data course will be shown during that video.

3. Smart Traffic System: Data about the condition of the traffic of different road, collected through camera
kept beside the road, at entry and exit point of the city, GPS device placed in the vehicle (Ola, Uber cab,
etc.). All such data are analyzed and jam-free or less jam way, less time taking ways are recommended.
Such a way smart traffic system can be built in the city by Big data analysis. One more profit is fuel
consumption can be reduced.

4. Secure Air Traffic System: At various places of flight (like propeller etc) sensors present. These sensors
capture data like the speed of flight, moisture, temperature, other environmental condition. Based on such
data analysis, an environmental parameter within flight are set up and varied.
By analyzing flight’s machine-generated data, it can be estimated how long the machine can operate
flawlessly when it to be replaced/repaired.
5. Auto Driving Car: Big data analysis helps drive a car without human interpretation. In the various spot of
car camera, a sensor placed, that gather data like the size of the surrounding car, obstacle, distance from
those, etc. These data are being analyzed, then various calculation like how many angles to rotate, what
should be speed, when to stop, etc carried out. These calculations help to take action automatically.
6. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant tool (like Siri in Apple
Device, Cortana in Windows, Google Assistant in Android) to provide the answer of the various question
asked by users. This tool tracks the location of the user, their local time, season, other data related to
question asked, etc. Analyzing all such data, it provides an answer.
As an example, suppose one user asks “Do I need to take Umbrella?”, the tool collects data like location of
the user, season and weather condition at that location, then analyze these data to conclude if there is a
chance of raining, then provide the answer.
7. Education Sector: Online educational course conducting organization utilize big data to search
candidate, interested in that course. If someone searches for YouTube tutorial video on a subject, then
online or offline course provider organization on that subject send ad online to that person about their
course.
8. Energy Sector: Smart electric meter read consumed power every 15 minutes and sends this read data to
the server, where data analyzed and it can be estimated what is the time in a day when the power load is
less throughout the city. By this system manufacturing unit or housekeeper are suggested the time when
they should drive their heavy machine in the night time when power load less to enjoy less electricity bill.
9. Media and Entertainment Sector: Media and entertainment service providing company like Netflix,
Amazon Prime, Spotify do analysis on data collected from their users. Data like what type of video, music
users are watching, listening most, how long users are spending on site, etc are collected and analyzed to
set the next business strategy.
8. Hadoop architecture: HDFS, YARN 2, YARN Daemons.

Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.

Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that
HDFS was developed. It states that the files will be broken into blocks and stored in nodes over the
distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on
data using key value pair. The Map task takes input data and converts it into a data set which can be
computed in Key value pair. The output of Map task is consumed by reduce task and then the out of
reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop
modules.

Hadoop Architecture:

A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job
Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode and
TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a
master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run the
NameNode and DataNode software.

NameNode

o It is a single master server exist in the HDFS cluster.


o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.

DataNode

o The HDFS cluster contains multiple DataNodes.


o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using
NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker

o It works as a slave node for Job Tracker.


o It receives task and code from Job Tracker and applies that code on the file. This process can also be
called as a Mapper.

MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to
Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.

Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval.
Even the tools to process the data are often on the same servers, thus reducing the processing time.
It is able to process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost
effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network, so if
one node is down or some other network failure happens, then Hadoop takes the other copy of
data and use it. Normally, data are replicated thrice but the replication factor is configurable.

Hadoop 2.X Architecture


HDFS

Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel
application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.

Where to use HDFS


o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
o Streaming Data Access: The time to read whole data set is more important than latency in reading
the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.

Architecture

 The Hadoop Distributed File System (HDFS) allows applications to run across multiple servers. HDFS
is highly fault tolerant, runs on low-cost hardware, and provides high-throughput access to data.
 Java-based scalable system that stores data across multiple machines without prior organization.
 Data in a Hadoop cluster is broken into smaller pieces called blocks, and then distributed
throughout the cluster.
 Blocks, and copies of blocks, are stored on other servers in the Hadoop cluster.
 That is, an individual file is stored as smaller blocks that are replicated across multiple servers in the
cluster.
 HDFC cluster has two types of cluster : NameNode(Master) and DataNode(workers)
Data Node
Each HDFS cluster has a number of DataNodes, DataNodes manage the storage that is attached to the
nodes on which they run.
When a file is split into blocks, the blocks are stored in a set of DataNodes that are spread
throughout the cluster.

DataNodes are responsible for serving read and write requests from the clients on the file
system, and also handle block creation, deletion, and replication.
DataNodes store and retrive the blocks when client or NameNode are told , also they report
back to the NameNode periodically with list of blocks they are storing.
NameNode

 An HDFS cluster supports NameNodes, an active NameNode and a standby NameNode, which is a
common setup for high availability.
 The NameNode Manages the file system Namespace, maintain file system tree and metadata for all
file and directories in the tree.
 The NameNode regulates access to files by clients, and tracks all data files in HDFS.
 NameNode knows the DataNode on which all blocks for given file are located.
 The NameNode determines the mapping of blocks to DataNodes, and handles operations such as
opening, closing, and renaming files and directories.
What are differences between NameNode and Standby NameNode?
 NameNode is the master daemon which maintains and manages the DataNodes.
 It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure
that the DataNodes are live.
 NameNode is the one which stores the information of HDFS file system in a file called FSimage.
 In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas, balance
disk usage and manages the communication traffic to the DataNodes.
 It stores the metadata of all the files stored in HDFS, e.g. The location of blocks stored, the size of
the files, permissions, hierarchy, etc.
 It maintains 2 files:
 FsImage: Contains the complete state of the file system namespace since the start of the
NameNode.
 EditLogs: Contains all the recent modifications made to the file system with respect to the most
recent FsImage.
 Any changes that you make in your HDFS are never logged directly into FSimage. instead, they are
logged into a separate temporary file.
 The name node reads the FSimage file and then reads the temporary file and updates the memory.
 This temporary file which stores the intermediate data is called Secondary name node.
 The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits
log size within a limit.
 It is usually run on a different machine than the primary NameNode since its memory requirements
are on the same order as the primary NameNode.
 This secodary name node is used just to speed up the memory accessing process of Name node.
since the process of updating the minute data changes directly to the name node consumes a lot of
time and is not efficient.
 the Secondary NameNode is one which constantly reads all the file systems and metadata from the
RAM of the NameNode and writes it into the hard disk or the file system.
 It is responsible for combining the EditLogs with Fsimage from the NameNode.
9. Hadoop Ecosystem.
Different services deployed by the various enterprise to work with a variety of data.

Each of the Hadoop Ecosystem Components is developed to deliver explicit function.

And each has its own developer community and individual release cycle.
Hadoop Distributed File System (HDFS)

 MapReduce
 Yarn
 Hive
 Pig
 Mahout
 Hbase
 Oozie
 Sqoop
 Flume
 Ambari
 Apache Drill
 Zookeeper

Hive:

 Hive is a data warehouse project built on the top of Apache Hadoop which provides data query and analysis.
 It has got the language of its own call HQL or Hive Query Language.
 HQL automatically translates the queries into the corresponding map-reduce job.
 Main parts of the Hive are –
 MetaStore – it stores metadata
 Driver – Manages the lifecycle of HQL statement
 Query compiler – Compiles HQL into DAG i.e. Directed Acyclic Graph
 Hive server – Provides interface for JDBC/ODBC server.

 Facebook designed Hive for people who are comfortable in SQL.


 It has two basic components – Hive Command Line and JDBC-ODBC.
 Hive Command line is an interface for execution of HQL commands.
 JDBC, ODBC establishes the connection with data storage.
 Hive is highly scalable.
 It can handle both types of workloads i.e. batch processing and interactive processing.
 It supports native data type of SQL.
 Hive provides many pre-defined functions for analysis.
 But you can also define your own custom functions called UDFs or user-defined functions.
Pig

 Pig is a SQL like language used for querying and analyzing data stored in HDFS.
 Yahoo was the original creator of the Pig.
 It uses pig latin language.
 It loads the data, applies a filter to it and dumps the data in the required format.
 Pig also consists of JVM called Pig Runtime

features of Pig are as follows:-

 Extensibility – For carrying out special purpose processing, users can create their own custom function.
 Optimization opportunities – Pig automatically optimizes the query allowing users to focus on semantics
rather than efficiency.
 Handles all kinds of data – Pig analyzes both structured as well as unstructured.

Hbase

HBase is a NoSQL (non SQL" or "non relational )database built on the top of HDFS.

Features of Hbase :

 It is open-source , non-relational, distributed database, versioned , column oriented, multidimensional


storage system designed for high performance and high availability.
 Like RDBMS data in Hbase is organized in table but its support very loose schema and not provide joins,
qurery language or SQL
 It imitates Google’s Bigtable and written in Java.
 It provides real-time read, write ,update and delete operation on large datasets.

Its various components are as follows:

 HBase Master
 Region Server
HBase Master

 HBase performs the following functions:


 Maintain and monitor the Hadoop cluster.
 Performs administration of the database.
 Controls the failover.
 HMaster handles DDL operation(create and delete).

Region Server

 Region server is a process which handles read, writes, update and delete requests from clients.
 It runs on every node in a Hadoop cluster that is HDFS DataNode.
 HBase is a column-oriented database management system.
 It runs on top of HDFS.
 It suits for sparse data sets which are common in Big Data use cases.
 HBase support writing application in Apache Avro, REST and Thrift.
 Apache HBase has low latency storage. (Latency indicates how long it takes for packets to reach their
 destination.)
 Enterprises use this for real-time analysis.
 The design of HBase is such that to contain many tables. Each of these tables must have a primary key.

Memstore : It is HBase implementation of in-memory data cache, helps to increase performance by serving as much
as data as possible directly from memory.

WAL: Write-ahead-Log records all changes to the data .Which is useful in server crashes for recovering everything .If
writing the record to the WAL fails, the whole operation must be considered failure.

HFile: It is specialized HDFS file format for Hbase. The implementation of Hfile in a region server is responsible for
reading and writing Hfiles to and from HDFS.
Zookeeper: Distributed Hbase instance is depends on a running zookeeper cluster.

All participating node and client must able to access the running Zookeeper instances.

By default HBase manages a Zookeeper cluster to start and stop Zookeeper processes as a part of Hbase start and
stop process.

10.Self-Learning Topics: Yet Another Resource Negotiator YARN 1.X


https://www.analyticsvidhya.com/blog/2022/01/yarn-yet-another-resource-negotiator/

You might also like