You are on page 1of 141

BIG DATA ANALYTICS

UNIT-1
What is Big data?

● Wikipedia defines "Big Data" as a collection of data sets so large and complex that it becomes difficult
to process using on-hand database management tools or traditional data processing applications.
● "Big Data" consists of very large volumes of heterogeneous data that is being generated, often, at high
speeds.
What is Big Data?

“Big data is the data characterized by three attributes volume velocity and variety.” -----IBM

“Big data is the data characterized by four attribute volume velocity variety and value.”--------
Oracle

“Big Data is the frontier of a firm’s ability to store, process, and access all the data it needs to
operate effectively, make decisions, reduce risks, and serve customers.” --- Forrester

“Big Data in general is defined as high volume, velocity and variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight and
decision making.” -- Gartner

“Big data is data that exceeds the processing capacity of conventional database systems. The
data is too big, moves too fast, or doesn't fit the strictures of your database architectures. To
gain value from this data, you must choose an alternative way to process it.” -- O’Reilly
Data storage capacity and evolution
Data storage examples
The zettabyte era

● In 2016, Cisco Systems stated that the Zettabyte Era was now
reality when global IP traffic reached an estimated 1.2 zettabytes.
● Cisco also provided future predictions of global IP traffic in their
report The Zettabyte Era: Trends and Analysis.
● This report uses current and past global IP traffic statistics to
forecast future trends.
● The report predicts trends between 2016 and 2021.
The zettabyte era

● Some of the predictions for 2021 found in the report:


● Global IP traffic will triple and is estimated to reach 3.3 ZB on a yearly basis
● In 2016 video traffic (e.g. Netflix and YouTube) accounted for 73% of total
traffic. In 2021 this will increase to 82%
● The number of devices connected to IP networks will be more than three
times the global population
● The amount of time it would take for one person to watch the entirety of
video that will traverse global IP networks in one month is 5 million years
● PC traffic will be exceeded by smartphone traffic. PC traffic will account for
25% of total IP traffic while smartphone traffic will be 33%
● There will be a twofold increase in broadband speeds
Introduction
● Big Data is a field dedicated to the analysis, processing, and storage of large collections
of data that frequently originate from different sources.
● Big Data solutions are typically required when traditional data analysis, processing and
storage technologies and techniques are insufficient.
● Big Data addresses distinct requirements, such as the combining of multiple unrelated
datasets, processing of large amounts of unstructured data and harvesting of hidden
information in a time-sensitive manner.
● The boundaries of what constitutes a Big Data problem are also changing due to the
ever-shifting and advancing landscape of software and hardware technology.
● Thirty years ago, one gigabyte of data could amount to a Big Data problem and require
special purpose computing resources. Now, gigabytes of data are commonplace and can
be easily transmitted, processed and stored on consumer-oriented devices
The sources of Big data
The sources of Big data
1. Social data:
● Likes, Tweets & Retweets, Comments, Video Uploads, and general media that are uploaded and
shared via the world’s favorite social media platforms.
● This kind of data provides invaluable insights into consumer behavior and sentiment and can be
enormously influential in marketing analytics.
2. Machine data:
● Sensors such as medical devices, smart meters, road cameras, satellites, games and the rapidly
growing Internet of Things will deliver high velocity, value, volume and variety of data
3. Transactional data
● Daily transactions that take place both online and offline.
● Invoices, payment orders, storage records, delivery receipts.
Concepts and Terminology

Datasets

● Collections or groups of related data are generally referred to as datasets.


● Some examples of datasets are:

• tweets stored in a flat file

• a collection of image files in a directory

• an extract of rows from a database table stored in a CSV formatted file

• historical weather observations that are stored as XML files


Data Analysis
● Process of examining data to find facts, relationships, patterns, insights
and/or trends.
● Goal of data analysis is to support better decision making.
● Example: Analysis of ice cream sales data in order to determine how the
number of ice cream cones sold is related to the daily temperature.
● The results of such an analysis would support decisions related to how
much ice cream a store should order in relation to weather forecast
information.
Data Analytics

● Data analytics is a broader term that encompasses data analysis.


● Data analytics is a discipline that includes the management of the complete data lifecycle, which
encompasses collecting, cleansing, organizing, storing, analyzing and governing data.
● It includes the development of analysis methods, scientific techniques and automated tools.
● Different kinds of organizations use data analytics tools and techniques in different ways.

• In business-oriented environments, data analytics results can lower operational costs and
facilitate strategic decision-making.

• In the scientific domain, data analytics can help identify the cause of a phenomenon to improve
the accuracy of predictions.

• In service-based environments like public sector organizations, data analytics can help
strengthen the focus on delivering high-quality services by driving down costs.
Data Analytics

There are four general categories of analytics that are distinguished by the results they
produce:

•descriptive analytics

• diagnostic analytics

• predictive analytics

• prescriptive analytics
Descriptive Analytics

Descriptive analytics are carried out to answer questions about events that have already occurred.

Sample questions can include:

• What was the sales volume over the past 12 months?

• What is the number of support calls received as categorized by severity and geographic location?

• What is the monthly commission earned by each sales agent?

It is estimated that 80% of generated analytics results are descriptive in nature.

Valuewise, descriptive analytics provide the least worth and require a relatively basic skillset.
Diagnostic Analytics

● It aims to determine the cause of a phenomenon that occurred in the past using
questions that focus on the reason behind the event.
● The goal of this type of analytics is to determine what information is related to the
phenomenon in order to enable answering questions that seek to determine why
something has occurred.
● Such questions include:
● Why were Q2 sales less than Q1 sales?
● Why have there been more support calls originating from the Eastern region than from
the Western region?
● Why was there an increase in patient re-admission rates over the past three months?
Diagnostic Analytics
● Diagnostic analytics provide more value than descriptive analytics but require
a more advanced skillset.
● Diagnostic analytics usually require collecting data from multiple sources and
storing it in a structure that lends itself to performing drill-down and roll-up
analysis.
● Diagnostic analytics results are viewed via interactive visualization tools that
enable users to identify trends and patterns.
● The executed queries are more complex compared to those of descriptive
analytics and are performed on multidimensional data held in analytic
processing systems.
Drill-down refers to the process of viewing data at a level
of increased detail, while roll-up refers to the process of
viewing data with decreasing detail.
Predictive Analytics
● Predictive analytics are carried out in an attempt to determine the outcome of
an event that might occur in the future.
● Generate future predictions based upon past events.
● Questions are usually formulated using a what-if rationale, such as the
following:
● What are the chances that a customer will default on a loan if they have
missed a monthly payment?
● What will be the patient survival rate if Drug B is administered instead of Drug
A?
● If a customer has purchased Products A and B, what are the chances that
they will also purchase Product C?
Predictive Analytics

● Predictive analytics try to predict the outcomes of events, and


predictions are made based on patterns, trends and exceptions found
in historical and current data.
● This can lead to the identification of both risks and opportunities.
● It provides greater value and requires a more advanced skillset than
both descriptive and diagnostic analytics.
Prescriptive Analytics
● Prescriptive analytics build upon the results of predictive analytics by
prescribing actions that should be taken.
● The focus is not only on which prescribed option is best to follow, but why.
● Sample questions may include:
● Among three drugs, which one provides the best results?
● When is the best time to trade a particular stock?
● Prescriptive analytics provide more value than any other type of analytics and
correspondingly require the most advanced skillset, as well as specialized
software and tools.
● Various outcomes are calculated, and the best course of action for each
outcome is suggested.
Prescriptive Analytics

● This sort of analytics incorporates internal data with external data.


● Internal data might include current and historical sales data, customer
information, product data and business rules.
● External data may include social media data, weather forecasts and
government produced demographic data.
● Prescriptive analytics involve the use of business rules and large amounts of
internal and external data to simulate outcomes and prescribe the best course
of action.
Examples

In health care, all four types can be used. For example:

● Descriptive analytics can be used to determine how contagious a virus is by


examining the rate of positive tests in a specific population over time.
● Diagnostic analytics can be used to diagnose a patient with a particular illness or
injury based on the symptoms they’re experiencing.
● Predictive analytics can be used to forecast the spread of a seasonal disease by
examining case data from previous years.
● Prescriptive analytics can be used to assess a patient’s pre-existing conditions,
determine their risk for developing future conditions, and implement specific
preventative treatment plans with that risk in mind.
Class 3-4: BIG DATA ANALYTICS

12/08/2021
Data Analysis
Data analysis is a process of inspecting, cleansing, transforming,
and modelling (past) data with the goal of discovering useful information, informing
conclusions, and supporting decision-making.

Statistician John Tukey, defined data analysis in 1961, as:


"Procedures for analyzing data, techniques for interpreting the results of such
procedures, ways of planning the gathering of data to make its analysis easier,
more precise or more accurate, and all the machinery and results of
(mathematical) statistics which apply to analyzing data.”
● Analytics is commonly used in many distinct ways to find some strange patterns like
finding the preferences, compute various correlations, trend forecastings, etc. The
most common real-life findings found through analytics are market trend forecastings,
customer preferences, and effective business decisions.

● Analysis, it is quite simple and easy to explore more valuable insights from the
available data by performing the various types of Data Analysis such as Exploratory
Data Analysis, Predictive Analysis, and Inferential Analysis, etc. They play a major
role by providing more insights in understanding the data.
Data Analysis vs Analytics
While analytics and analysis are more similar than different, their contrast is in the emphasis of each.
They both refer to an examination of information—but while analysis is the broader and more general
concept, analytics is a more specific reference to the systematic examination of data.

● analysis is the broader and more general • analytics is a more specific reference to the
concept systematic examination of data.
● We do Analysis to explain How and or Why
something happened.
• We use Analytics to explore potential future
events
● Data Analysis helps in understanding the
data and provides required insights from
• Data Analytics is the process of exploring the data
the past to understand what happened so
from the past to make appropriate decisions in the
far. future by using valuable insights.
● The most common tools employed in Data • The most common tools employed in Data
Analysis are Tableau, Excel, SPARK, Analytics are R, Python, SAS, SPARK, Google
Google Fusion tables, Node XL, etc. Analytics, Excel, etc.
Big Data Characteristics
● Most of these data characteristics were initially identified by Doug Laney in early 2001 when he
published an article describing the impact of the volume, velocity and variety of e-commerce data
on enterprise data warehouses.
● Five Big Data characteristics that can be used to help differentiate data categorized as “Big” from
other forms of data.
Volume
● The anticipated volume of data that is processed by Big Data solutions is
substantial and ever-growing.
● High data volumes impose distinct data storage and processing demands, as well
as additional data preparation, curation and management processes.
● Typical data sources that are responsible for generating high data volumes can
include:
● Online transactions, such as point-of-sale and banking
● Scientific and research experiments
● Sensors, such as GPS sensors, RFIDs, smart meters and telematics
● Social media, such as Facebook and Twitter
Velocity

● In Big Data environments, data can arrive at fast speeds, and enormous datasets
can accumulate within very short periods of time.
● From an enterprise’s point of view, the velocity of data translates into the amount of
time it takes for the data to be processed once it enters the enterprise’s perimeter.
● Coping with the fast inflow of data requires the enterprise to design highly elastic and
available data processing solutions and corresponding data storage capabilities.
● Depending on the data source, velocity may not always be high.
● For example, MRI scan images are not generated as frequently as log entries from a
high-traffic webserver.
Variety
● Data variety refers to the multiple formats and types of data that need to be
supported by Big Data solutions.
● Data variety brings challenges for enterprises in terms of data integration,
transformation, processing, and storage.
Veracity
● Veracity refers to the quality of data.
● Data that enters Big Data environments needs to be assessed for quality, which can lead
to data processing activities to resolve invalid data and remove noise.
● Data can be part of the signal or noise of a dataset.
● Noise is data that cannot be converted into information and thus has no value, whereas
signals have value and lead to meaningful information.
● Data with a high signal-to-noise ratio has more veracity than data with a lower ratio.
● Data that is acquired in a controlled manner, for example via online customer
registrations, usually contains less noise than data acquired via uncontrolled sources,
such as blog posting.
● The signal-to-noise ratio of data is dependent upon the source of the data and its type.
Value

● Value is defined as the usefulness of data for an enterprise.


● The value characteristic is intuitively related to the veracity characteristic in that the
higher the data quality, the more value it holds for the business.
● Value is also dependent on how long data processing takes because analytics results
have a shelf-life.
● A 20 minute delayed stock quote has little to no value for making a trade compared to a
quote that is 20 milliseconds old.
● Value and time are inversely related.
● The longer it takes for data to be turned into meaningful information, the less value it
has for a business.
Different Types of Data

● The data processed by Big Data


solutions can be human-generated or
machine-generated. although it is the
responsibility of machines to
generate the analytic results.

● Human-generated data is the result


of human interaction with systems,
such as online services and digital
devices.
Different Types of Data

● Machine-generated data is produced by software


programs and hardware devices in response to
real-world events.
● For example, a log file captures an
authorization decision made by a security
service, and a point-of-sale system generates
a transaction against inventory to reflect items
purchased by a customer.
● From a hardware perspective, an example of
machine-generated data would be information
conveyed from the numerous sensors in a
cellphone that may be reporting information,
including position and cell tower signal
strength.
Different Types of Data

• Structured data

• Unstructured data

• Semi-structured data
Structured Data

● Structured data conforms to a data model or schema and is often stored in


tabular form.
● It captures relationships between different entities and is therefore most often
stored in a relational database.
● Structured data is frequently generated by enterprise applications and
information systems like ERP and CRM systems.
● Examples of this type of data include banking transactions, invoices, and
customer records.
Unstructured Data
● Data that does not conform to a data model or data
schema is known as unstructured data.
● It is estimated that unstructured data makes up 80% of
the data within any given enterprise.
● Unstructured data has a faster growth rate than
structured data.
● This form of data is either textual or binary and often
conveyed via files that are self-contained and non-
relational.
● A text file may contain the contents of various tweets or
blog postings. Binary files are often media files that
contain image, audio or video data.
● Not-only SQL (NoSQL) database is a non-relational
database that can be used to store unstructured data.
Semi-structured Data

● Semi-structured data has a defined


level of structure and consistency, but
is not relational in nature.
● Semi-structured data is hierarchical or
graph-based.
● This kind of data is commonly stored in
files that contain text.
● Due to the textual nature of this data
and its conformance to some level of
structure,
● It is more easily processed than
unstructured data.
Meta data
● Metadata provides information about a dataset’s characteristics and structure.
● This type of data is mostly machine-generated and can be appended to data.
● The tracking of metadata is crucial to Big Data processing, storage and
analysis.
● Examples of metadata include:
● XML tags providing the author and creation date of a document.
● Attributes providing the file size and resolution of a digital photograph.
● Big Data solutions rely on metadata, particularly when processing semi-
structured and unstructured data.
Thank you
Hadoop eco
system

Unit-I
Big Data Characteristics
● Big Data Characteristics
○ Volume – How much data?
○ Velocity – How fast the data is generated/processed?
○ Variety - The various types of data.
○ Veracity – Quality of data
○ Value -- The usefulness of the data

● Example: Mobile phone data, Sensor Data, Credit card


data, Weather data, Data generated in social media sites (eg.
Face book, Twitter), Video surveillance data, medical data,
data used for scientific and research experiments, online
transactions, etc.
More V’s in Big Data
● A 5 Vs’ Big Data definition was also proposed by Yuri Demchenko [35] in 2013. He added the
value dimension along with the IBM 4Vs’ definition (see Fig. 3). Since Douglas Laney published
3Vs in 2001, there have been additional “Vs,” even as many as 11 [36].
All these definitions, such as 3Vs, 4Vs, 5Vs, or even 11 Vs, are primarily trying to articulate the
aspect of data. Most of them are data-oriented definitions, but they fail to articulate Big Data
clearly in a relationship to the essence of BDA. In order to understand the essential meaning,
we have to clarify what data is.

Data is everything within the universe. This means that data is within the existing limitation of
technological capacity. If the technology capacity is allowed, there is no boundary or limitation
for data.
What is Hadoop?
● Hadoop is an open source framework for writing and running distributed applications that
process large amounts of data.
● Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it
is
● Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing
services such as Amazon’s Elastic Compute Cloud (EC2).
● Robust— It is intended to run on commodity hardware, Hadoop is architected with the
assumption of frequent hardware malfunctions. It can gracefully handle most such failures.
● Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.
● Simple—Hadoop allows users to quickly write efficient parallel code.
Hadoop’s accessibility and simplicity give it an edge over writing and running large
distributed programs. On the other hand, its robustness and scalability make it suitable for even
the most demanding jobs at Yahoo and Facebook. These features make Hadoop popular in both
academia and industry.
What is Hadoop?
A Hadoop cluster is a set of commodity machines networked together in one location.

● A Hadoop cluster has many parallel


machines that store and process large
data sets.
● Client computers send jobs into this
computer cloud and obtain results.
● Data storage and processing all occur
within this “cloud” of machines.
● Different users can submit
computing “jobs” to Hadoop from
individual clients, which can be their
own desktop machines in remote
locations from the Hadoop cluster.
Hadoop | History or Evolution

● Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when they both
started to work on Apache Nutch project.
● Apache Nutch project was the process of building a search engine system that can index 1
billion pages.
● After a lot of research on Nutch, they concluded that such a system will cost around half a
million dollars in hardware, and along with a monthly running cost of $30, 000 approximately,
which is very expensive.
● They realized that their project architecture will not be capable enough to the workaround
with billions of pages on the web.
● They were looking for a feasible solution which can reduce the implementation cost as well
as the problem of storing and processing of large datasets.
Hadoop | History or Evolution
● In 2003, they came across a paper that described the architecture of Google’s distributed file system, called
GFS (Google File System) which was published by Google, for storing the large data sets.
● This paper was just the half solution to their problem.
● In 2004, Google published one more paper on the technique MapReduce, which was the solution of
processing those large datasets.
● This paper was another half solution for Doug Cutting and Mike Cafarella for their Nutch project.
● These both techniques (GFS & MapReduce) were just on white paper at Google. Google didn’t implement
these two techniques.
● So, together with Mike Cafarella, he started implementing Google’s techniques (GFS & MapReduce) as
open-source in the Apache Nutch project.
● In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He soon realized two problems:

(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters

(b) And that was looking impossible with just two people (Doug Cutting & Mike Cafarella).
Hadoop | History or Evolution

● In 2006, Doug Cutting joined Yahoo along with Nutch project.


● He wanted to provide the world with an open-source, reliable, scalable computing framework, with the help
of Yahoo.
● So at Yahoo first, he separates the distributed computing parts from Nutch and formed a new project
Hadoop.
● He wanted to make Hadoop in such a way that it can work well on thousands of nodes. So with GFS and
MapReduce, he started to work on Hadoop.
● In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using it.
● In January of 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software
Foundation).
● In July of 2008, Apache Software Foundation successfully tested a 4000 node cluster with Hadoop.
Hadoop | History or Evolution

● In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than 17
hours for handling billions of searches and indexing millions of web pages.
● Doug Cutting left the Yahoo and joined Cloudera to fulfill the challenge of spreading Hadoop
to other industries.
● In December of 2011, Apache Software Foundation released Apache Hadoop version 1.0.
● In Aug 2013, Version 2.0.6 was available.
● And currently, we have Apache Hadoop version 3.0 which released in December 2017.
Comparing Hadoop with SQL
1. Scale-out instead of scale-up
● Scaling commercial relational databases is expensive.
● Their design is more friendly to scaling up.
● To run a bigger database you need to buy a bigger
machine.
● At some point there won’t be a big enough machine
available for the larger data sets.
● The high-end machines are not cost effective. A
machine with four times the power of a standard PC
costs a lot more than putting four such PCs in a cluster.
Comparing Hadoop with SQL
Hadoop
● For data-intensive workloads, a large number of commodity low-end servers (i.e., the
scaling out" approach) is preferred over a small number of high-end servers (i.e., the
scaling up" approach).
● Hadoop is designed to be a scale-out architecture operating on a cluster of commodity
PC machines.
● Adding more resources means adding more machines to the Hadoop cluster.
● Hadoop clusters with ten to hundreds of machines is standard.
SQL (structured query language) is by design targeted at structured data. Many of
Hadoop’s initial applications deal with unstructured data such as text.
From this perspective, Hadoop provides a more general paradigm than SQL.
Comparing Hadoop with SQL
2. Key/value pairs instead of relational tables
SQL
● Data resides in tables having relational structure defined by a schema.
● Many modern applications deal with data types that don’t fit well into this model.
● Text documents, images, and XML files are popular examples. Also, large data sets are
often unstructured or semistructured.
Hadoop
● Hadoop uses key/value pairs as its basic data unit, which is flexible enough to work with
the less-structured data types.
● In Hadoop, data can originate in any form, but it eventually transforms into (key/value)
pairs for the processing functions to work on.
Comparing Hadoop with SQL
3. Functional programming (MapReduce) instead of declarative queries (SQL)

SQL

● SQL is fundamentally a high-level declarative language.


● You query data by stating the result you want and let the database engine figure out how to
derive it.
● Under SQL you have query statements

Hadoop

● Under MapReduce you specify the actual steps in processing the data.
● Under MapReduce you have scripts and codes.

MapReduce allows you to process data in a more general fashion than SQL queries. For example,
you can build complex statistical models from your data or reformat your image data. SQL is not
well designed for such tasks.
Comparing Hadoop with SQL
4. Offline batch processing instead of online transactions

● Hadoop is designed for offline processing and analysis of large-scale data.


● It doesn’t work for random reading and writing of a few records.
● Hadoop is best used as a write-once, read-many-times type of data store
Comparing Hadoop with SQL
5. NoSQL versus SQL
● SQL databases use structured query language and have a pre-defined schema for
defining and manipulating data.
● NoSQL databases have dynamic schemas for unstructured data, and the data is stored
in many ways. You can use column-oriented, document-oriented, graph-based, or
KeyValue store.
Hadoop Ecosystem
● Hadoop Ecosystem is a platform or a suite which provides various services to solve big data problems.
● It includes Apache projects and various commercial tools and solutions.
● There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common.
● Most of the tools or solutions are used to supplement or support these major elements.
● Following are the components that collectively form a Hadoop ecosystem:

● HDFS: Hadoop Distributed File System ● HBase: NoSQL Database


● YARN: Yet Another Resource Negotiator ● Mahout, Spark MLLib: Machine Learning
● MapReduce: Programming based Data Processing algorithm libraries
● Spark: In-Memory data processing ● Solar, Lucene: Searching and Indexing
● PIG, HIVE: Query based processing of data service ● Zookeeper: Managing cluster
● Oozie: Job Scheduling
HDFS
● HDFS is the primary or major component of Hadoop ecosystem
and is responsible for storing large data sets of structured or
unstructured data across various nodes. The backbone of
Hadoop Ecosystem.
● Hadoop employs a master/slave architecture for both distributed
storage and distributed computation. The distributed storage
system is called the Hadoop File System, or HDFS.
HDFS consists of two core components 1. Name node 2. Data
node
● Name Node is the prime node which contains metadata (data
about data) requiring comparatively fewer resources than the
data nodes that stores the actual data.
● These data nodes are commodity hardware in the distributed
environment.
Name node
● Hadoop employs a master/slave architecture for both distributed storage and
distributed computation. The distributed storage system is called the Hadoop File
System , orHDFS.
● The NameNode is the master of HDFS that directs the slave DataNode daemons to
perform the low-level I/O tasks.
● The NameNode is the bookkeeper of HDFS, it keeps track of how your files are
broken down into file blocks, and which nodes store those blocks.
● NameNode does not store the actual data or the dataset. The data itself is actually
stored in the DataNodes.
● NameNode is so critical to HDFS and when the NameNode is down, HDFS/Hadoop
cluster is inaccessible and considered down.
● NameNode is a single point of failure in Hadoop cluster.
Data node
● DataNode is responsible for storing the actual data in HDFS. It is also known as a slave.
● NameNode and DataNode are in constant communication. DataNode is responsible for storing the
actual data in HDFS.
● When a DataNode is down, it does not affect the availability of data or the cluster. NameNode will
arrange for replication for the blocks managed by the DataNode that is not available.
● DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the
DataNode.
● DataNode daemon performs reading and writing HDFS blocks to actual files on the local filesystem.
● When you want to read or write a HDFS file, the file is broken into blocks and the NameNode will tell
your client which DataNode each block resides in.
● Your client communicates directly with the DataNode daemons to process the local files
corresponding to the blocks.
● A DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.
Hadoop FS-Image Editlogs
● FsImage is a file stored on the OS filesystem that contains the complete directory
structure (namespace) of the HDFS with details about the location of the data on
the Data Blocks and which blocks are stored on which node. This file is used by the
NameNode when it is started.
● EditLogs is a transaction log that records the changes in the HDFS file system or any
action performed on the HDFS cluster such as addition of a new block, replication,
deletion etc. In short, it records the changes since the last FsImage was created.

● Checkpoints: is the process of merging the content of the most recent fsimage, with
all edits applied after that fsimage is merged, to create a new fsimage.
Checkpointing is triggered automatically by configuration policies or manually by
HDFS administration commands.
YARN
YARN (Yet Another Resource Negotiator)
● YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
● Consists of three major components i.e. Resource Manager ,Node Manager, Application Manager
● Resource manager has the privilege of allocating resources for the applications in a system.
● Resource manager runs on a master daemon and manages the resource allocation in the cluster.
● Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. They run on the slave daemons and are responsible for the execution of
a task on every single Data Node.
● Application manager works as an interface between the resource manager and node manager
● YARN resources in the cluster and manages the applications over Hadoop. It allows data stored in HDFS to be
processed and run by various data processing engines such as batch processing, stream processing,
interactive processing, graph processing, and many more. This increases efficiency with the use of YARN.
MapReduce
● By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry
over the processing’s logic and helps to write applications which transform big data sets into a
manageable one.
● MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
Five steps MapReduce programming model

Step 1: Splitting, Step 2: Mapping (distribution),


Step 3: Shuffling and sorting, Step 4: Reducing
(parallelizing), and Step 5: Aggregating.
Unit I : Big Data and Hadoop
• Topics covered
Theory
– Introduction to BIG DATA
– Distributed computing
– Distributed File System
– Hadoop Eco System
– Hadoop Distributed file system (DFS)
Practical
– Hadoop installation
– HDFS commands
– Using HDFS commands in JAVA programs
Big Data Analytics

Mark Distribution
• Mid 1 10 marks
• Mid 2 1o marks
• CLA1 5 marks
• CLA2 5 marks
• Lab experiments: 20 marks
• Project: 20 marks
• Final exam 30 marks
Big Data Characteristics
• Big Data Characteristics
– Volume – How much data?
– Velocity – How fast the data is generated/processed?
– Variety - The various types of data.
– Veracity – Quality of data
– Value -- The usefulness of the data

• Example: Mobile phone data, Sensor Data, Credit card data,


Weather data, Data generated in social media sites (eg. Face
book, Twitter), Video surveillance data, medical data, data
used for scientific and research experiments, online
transactions, etc.
More V’s in Big Data
• A 5 Vs’ Big Data definition was also proposed by Yuri Demchenko [35] in
2013. He added the value
dimension along with the IBM 4Vs’ definition (see Fig. 3). Since Douglas
Laney published 3Vs in
2001, there have been additional “Vs,” even as many as 11 [36].
All these definitions, such as 3Vs, 4Vs, 5Vs, or even 11 Vs, are primarily
trying to articulate the
aspect of data. Most of them are data-oriented definitions, but they fail to
articulate Big Data clearly in
a relationship to the essence of BDA. In order to understand the essential
meaning, we have to clarify
what data is.
Data is everything within the universe. This means that data is within the
existing limitation of
technological capacity. If the technology capacity is allowed, there is no
boundary or limitation for
• The advantage to adopt Hadoop [60] platform
is that “Hadoop is a free and open source
distributed
storage and computational platform. It was
created to allow storing and processing large
amounts
of data using clusters of commodity
hardware.”
Big Data Characteristics
• Volume: High data volumes impose
– Distinct storage and processing demands
– Additional data preparation, curation and management
processes,
• Velocity: Data can arrive at fast speeds, large data
will be accumulated at a shorter period of time
– Highly elastic storage and processing capabilities are
needed.
– eg. Per minute: 3,50, 000 tweets, 300 hrs YouTube video,
171 million emails, 330 GB of sensor data from Jet Engine
• Variety: Structured data (DBMS), Unstructured data
(Text, Image, Audio and Video) , Semi structured data
(XML)
Big Data Characteristics

• Veracity: Quality of data:


– Invalid data and noise has to be removed (noise is the data
that cannot be converted into information and thus has no
value)
– Data collected in a controlled manner sources (online
transactions) have less noise
– Data collected from uncontrolled sources will have more
noise
Big Data Characteristics
• Value: Usefulness of the data fro the enterprise.
– Quality impacts value
– Time taken for processing data eg. Stock market data
– Value is affected by
• useful attributes are removed during cleansing process
• Right types of questions being asked during data
analysis
Data Analysis Vs Data Analytics
• Data Analysis is the process of examining data to find
facts, relationships, patterns, insights and trends
among the data.
• Data Analytics includes development of analysis
methods, scientific methods and automated tools used
to manage the complete data life cycle (collection,
cleansing, organizing, storing, analysing and
governing data)
Data Analysis Vs Data Analytics

• Goal of Data Analysis: To conduct the analysis of


the data in such a manner that high quality results are
delivered in a timely manner, which provides optimal
value to the enterprise.
– Overall goal of data analysis is to support better decision making.
Big Data
• Overall goal of data analysis is to support better decision
making
• Uses of large data
– Knowledge (meaningful pattern) obtained by analyzing
data, can improve the efficiency of the business
– Analyzing super market data
• What kind of items sold?
– Analyzing weather data
• Precautionary measures can be taken against floods,
Tsunami , etc.
– Analyzing Medical data
• Reasons for epidemic/pandemic diseases (Covid-19)
and cure can be found, Cancer cells identification, etc.
Different Types of Data Analytics

• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
Different Types of Data Analytics

• Descriptive Analytics
– are carried out to answer questions about events that
already have occurred
– Example questions
• Sales volume over past 12 months
• Monthly commission earned by sales agent
• Number of calls received based on severity and
geographic location.
Different Types of Data Analytics

• Diagnostic Analytics
– aims to determine the cause of a phenomenon that
occurred in the past using questions that focus on the
reason behind the event
– Example questions
– Item 2 sales less than item 3 – why?
– More service request calls from western region- why?
– Patient re-admission rates are increasing why?
Different Types of Data Analytics
• Predictive Analytics
– try to predict the outcome of the events and predictions are
made based on patterns, trends and exceptions found in
historical and current data.
– Example questions.
• Chances for a customer default a loan if one month
payment is missed
• Patient survival rate if Drug B is administered instead
of Drug A
• Customer purchase A, B products- chances of
purchasing C.
Different Types of Data Analytics
• Prescriptive Analytics
– build upon the results of predictive analytics by
prescribing actions should be taken
– incorporates internal data (current and historical data,
customer information, product data and business rules) and
external data (social media data, weather forecasts and
government –produced demographic data)
– Example questions:
• Among three drugs which one provides best results
why?
• When is the best time to trade a particular stock?
Big Data
• How to store and process large data ?
– Hadoop supports storage and computing framework for solving
Big Data problems.
– Researchers are working on to find efficient solutions.
• Limitations of Centralized systems
– Storage is limited to terabytes
– Uni-processor systems: Multi programming, Time sharing,
Threads - improves throughput
– Multi processor systems: Parallel processing using multiple
CPUs - improves throughput.
• Disadvantage: Scalability – Resources cannot be added to handle
increase in load.
Big Data
• Centralized systems
– Data has to be transferred to the system where application
program is getting executed
– Do not have
• Enough storage
• Required computing power to process large data
Big Data

• Multi computer systems


– Scalable: Storage, Computing power can be increased
according to the requirement
– Data intensive computing : Type of parallel computing –
Tasks are executed by transferring them to the systems
where the data is available and executing these tasks in
parallel.
Fastest Super Computer
• Top two IBM-built supercomputers, Summit and
Sierra
– Installed at the Department of Energy’s Oak Ridge National
Laboratory (ORNL) in Tennessee and Lawrence Livermore
National Laboratory in California
• Both derive their computational power from Power 9
CPUs and NVIDIA V100 GPUs.
• The Summit system delivers a record 148.6 petaflops,
while the number two delivers at 94.6 petaflops.
Fastest Super Computer
• Sunway TaihuLight, a system developed by China’s
National Research Center of Parallel Computer
Engineering & Technology (NRCPC) and installed at the
National Supercomputing Center in Wuxi, maintains its
top position.
• With a Linpack performance of 93 petaflops, TaihuLight
is far and away the most powerful number-cruncher on
the planet.
• The Sunway TaihuLight uses a total of 40,960 Chinese-
designed SW26010 many core 64-bit RISC processors.
– Each processor chip contains 260 processing cores for a total of
10,649,600 CPU cores across the entire system
Distributed System (DS)
• A distributed system is a collection of independent
computers that appears to its users as a single
coherent system (Single System Image) - Loosely
coupled systems
• Computing power and storage are shared among the
users.
Advantages of DS
• Scalability
• Reliability
• Availability
• Communication
Cluster Systems
• Collection of similar computers (nodes) connected by
high speed LAN
– Master node, Computational nodes
• Nodes run similar OS
• Master node controls storage allocation and job
scheduling
• Provides single system image
• Computing power and storage are shared among the
users.
Cluster Computing .. Contd
Cloud Computing
• Cloud is a pool of virtualized computer resources and
provides various services (XaaS) in a scalable manner
on subscription basis (Pay as you go model )
• Services
– IaaS (Infrastructure as a service) – Amazon S3, EC2
– PaaS (Platform as a service) – Google App. Engine, MS
Azure
– SaaS (Software as a service) - Google Docs, Facebook
Essential Cloud Characteristics
Different Types of Cloud
• Many types
– Public cloud – available via Internet to anyone willing to
pay
– Private cloud – run by a company for the company’s own
use
– Hybrid cloud – includes both public and private cloud
components
Mainframe and Cloud Comparison

• Mainframe: Multiprocessor System


• Cloud: Multicomputer System
• Both for Rental
• Mainframe: IaaS
• Cloud: XaaS
• Mainframe: Resources (Physical) are fixed
• Cloud: Resources (Virtual) can be added
dynamically
Virtual Machine
• It is a software implementation of a computing
environment in which an operating system or
program can be installed and run.
– VM mimics the behavior of the hardware
• It is widely used to
– to build elastically scalable systems
– to deliver customizable computing environment on
demand
Virtual Machine
• Advantages
– Efficient resource utilization
– Loading different operating systems on single physical
machine
Virtualization
• It is the creation of virtual version of something such
as a processor or a storage device or network
resources, etc.
Hosted VM

eg: Vmware for Windows


Bare-metal VM
VM cloning
• A clone is a copy of an existing VM
– The existing VM is called the parent of the clone.
– Changes made to a clone do not affect the parent VM.
Changes made to the parent VM do not appear in a clone.
• Clones are useful when you must deploy many
identical VMs to a group
VM cloning
• Installing a guest operating system and applications
can be time consuming. With clones, we can make
many copies of a VM from a single installation and
configuration process.
File System
• A File is named collection of related information that
is recorded on secondary storage
• A file consists of data or program.
• File system is one of the important component of
operating system that provides mechanism for storage
and access to file contents
File System
• File system has two parts
– (i) Collection of files (for storing data and programs)
– (ii) Directory Structure - provides information about all
files in the system.
• File systems are responsible for the organization,
storage, retrieval, naming, sharing and protection of
files.
Centralized File System

• File System which stores (maintains) files and


directories in a single computer system is known as
Centralized File System.
• Disadvantage
– Storage is limited
Distribute File System (DFS)
• File system that manages the storage across a network
of machines is called distributed file system (DFS).
– Keeps files and directories in multiple computer systems
• A DFS is a client/server-based application that allows
clients to access and process data stored on the
server(s) as if it were on their own computer.
• A DFS has to provide single system image to the
clients even though data is stored in multiple
computer systems.
Distributed File System
• Advantages of DFS
– Network Transparency: Client does not know about
location of files . Clients can access the files from any
computer connected in the network
– Reliability: By keeping multiple copies for files
(Replication) reliability can be achieved.
– Performance : If blocks of the file are distributed and
replication is applied, then parallel access of data is
possible and so the performance can be improved.
– Scalable: As the storage requirement increases, it is
possible to provide required storage by adding additional
computer systems in the network.
Distributed File System

• Name Server – Global directory - Meta data of files


are stored
• Data Servers
– Data files (objects) are stored
– User programs are executed
Hadoop
• Google published a paper – GFS – 2003, Mapreduce
– 2004
• In 2005, NDFS (NutchDFS) and Mapreduce were
created (based on GFS and Mapreduce) by Doug
cutting.
• Later in 2006, it was renamed as Hadoop by Yahoo!
• In 2008, it became an open source technology and
came under Apache.
Hadoop Eco System
• Apache HBase, a table-oriented database built on top
of Hadoop.
• Apache Hive, a data warehouse built on top of
Hadoop that makes data accessible through an SQL-
like language.
• Apache Sqoop, a tool for transferring data between
Hadoop and other data stores.
• Apache Pig, a platform for creating programs that run
on Hadoop in parallel. Pig Latin is the language used
here.
• ZooKeeper, a tool for configuring and synchronizing
Hadoop clusters.
Hadoop Distributed File System
(HDFS)
• HDFS is open‐source DFS developed under Apache
• Designed to store very large datasets reliably.
• Master-Slave Architecture
– Single Name node acts as the master
– Multiple Data nodes act as workers (slaves)
• 64 MB block size (can be increased)
– Reduces seek time
– Data is transferred as per disk transfer rate (100 MB/s)
HDFS - Architecture
HDFS Architecture - Continued

• Name node
– The name mode manages file system’s metadata (location
of the file, size of the file, etc …) and name space
– Name space is a hierarchy of files and directories (name
space tree) and it is kept in the main memory.
– The mapping of blocks to data nodes is determined by the
name node
– Runs Job Tracker Program
HDFS Architecture - Continued

• File attributes like permissions, modification and


access times, etc.. are recorded in the inode strutures.
• The inode data and list of blocks belonging to each
file (metadata) is called as image.
• The persistent record of the image stored in the local
host’s native file system is called a checkpoint.
• The name node also stores the modification log of the
image called the journal in the local host’s native file
system.
HDFS Architecture - Continued

• For safety purpose, redundant copies of the


checkpoint and journal can be stored in other server
systems connected in the network.
• During restarts, the name node restores the
namespace image from the persistent checkpoint and
then replays the changes from the journal until the
image is up-to-date.
• The locations of block replicas may change over time
and are not part of the persistent checkpoint.
HDFS Architecture - Continued
• Data node
– Runs HDFS client program
– Manages the storage attached to the node; responsible for
storing and retrieving file blocks from the local file system
and from the remote data node
– Executes the user application programs. These programs
can access the HDFS using the HDFS client
– Executes Map & Reduce tasks
– Runs Task Tracker program
Rack organization in Hadoop
HDFS - Contd

• Characteristics of HDFS
– Block Replication factor is set by the user and 3 by default
– Stores one replica in the same node where a write operation
is requested and one on different node in the same rack
(Why?)
• And one replica on a node in the different rack
• Heartbeat Message (HM)
– Data node sends periodic HM to the name node
– Receipt of a HM indicates that the data node is functioning
properly
HDFS - Contd

• Heartbeat Message (HM)


– At every 3 seconds HM is generated
– If the name node does not receive the HM message from a
DN within 10 minutes and DN is considered to be not
alive.
– HM carries information about total storage capacity,
fraction of storage in use, number of data transfers in
progress (used for NameNode’s space allocation and load
balancing decisions)
HDFS - Contd
• The name node does not directly call data nodes. It
uses replies to heart beats to send instructions to the
Data nodes. (The name node can send the following
commands)
– Replicate blocks to other data nodes
– Send an immediate block report
– Remove local block replicas
• The name node is a multi threaded system and it can
process thousands of heartbeats per second, without
affecting other name node operations.
HDFS - Contd
• Block Report Message (BRM)
– Data node sends periodic BRM to the Name node
– Each BRM contains a list of all blocks (list contains the
block ID, generation stamp, and length of each block) on
the data node
– The first BR is sent immediately after Data node
registration.
– Subsequent BRs are sent every hour which provides up-to-
date view of block replica locations in the cluster
– Name node decides whether to create replica for a data
block or not based on BRM (How?)
HDFS Continued
• User applications access the file system using HDFS
client.
• HDFS supports operations to read, write and delete
files, and operations to create and delete directories.
Reading from a File
• HDFS client sends the read request to Name Node
• Name Node returns the addresses of a set of Data
Nodes containing replicas for the requested file
• First block of the file can be read by calling read
function and by giving address of the closest Data
Node.
• The same process is repeated for all blocks of the
requested file
Writing to a File
• Default replication factor is 3 for HDFS. Applications
can reset the replication factor.
• Write function is used for writing into a file. Assume
that ‘n’ blocks have to be written.
Writing to a File

• (First) Block is written into the local data node.


• Name node is contacted to get a list of suitable
data nodes (G) to store replicas of the (first)
block. Then the first block is copied into one
data node from G and then that block is
forwarded to another data node specified in G.
• This process is repeated for the remaining set
of blocks (n-1) of the file.
Concurrent File Access
• HDFS supports write once read many – concept
• Multiple reads are supported
• Concurrent writes are not allowed in HDFS
• Only append operation is allowed in HDFS and
modifying existing information is not allowed.
• https://docs.google.com/forms/d/11ENeGTyshyqRqN
KFKnbjsu6tuuTbnxr4QPBnZRNAH1s/edit
THANK YOU

You might also like