BigDataHadoop Lesson01

Big Data and Hadoop Developer
Introduction to Big Data and Hadoop
Copyright 2015, Revert Technology Pvt. Ld., All rights reserved.
Objectives
39322
By the end of
this lesson, you
will be able to:
Identify the need for Big Data
Explain the concept of Big Data
Describe the basics of Hadoop
Explain the benefits of Hadoop
Copyright 2015, Revert Technology Pvt. Ld., All rights
Data Explosion
39322
Over 2.5 exabytes(2.5 billion gigabytes) of data is generated every day.
Following are some of the sources of the huge volume of data:
A typical, large stock exchange captures more than 1 TB of data every day.
There are around 5 billion mobile phones (including 1.75 billion smart phones) in the world.
YouTube users upload more than 48 hours of video every minute.
Large social networks such as Twitter and Facebook capture more than 10 TB of data daily.
There are more than 30 million networked sensors in the world.
Types of Data
39322
The following three types of data can be identified:

Structured data:
Data which is represented in a tabular format
E.g.: Databases
Semi-structured data:
Data which does not have a formal data model
E.g.: XML files
Unstructured data:
Data which does not have a pre-defined data model
E.g.: Text files
Need for Big Data
39322
Following are the reasons why Big Data is needed:
90% of the data in the world today has been created in the last two years alone.
80% of the data is unstructured or exists in widely varying structures, which are difficult to analyze.
Structured formats have some limitations with respect to handling large quantities of data.
It is difficult to integrate information distributed across multiple systems.
Most business users do not know what should be analyzed.
Potentially valuable data is dormant or discarded.
It is too expensive to justify the integration of large volumes of unstructured data.
A lot of information has a short, useful lifespan.
Context adds meaning to the existing information.

DataThe Most Valuable Resource
39322
In its raw form, oil has little value. Once processed and refined, it helps power the world.
Ann Winblad
Data is the new oil.

Clive Humby, CNBC
Big Data and Its Sources
39322
Big data is an all-encompassing term for any collection of data sets so large and complex that it
becomes difficult to process them using on-hand data management tools or traditional data
processing applications.
The sources of Big Data are:
web logs;
sensor networks;
social media;
internet text and documents;
internet pages;
search index data;
atmospheric science, astronomy, biochemical and medical records;
scientific research;
military surveillance; and
photography archives.
Three Characteristics of Big Data
39322
Big Data has three characteristics: variety, velocity, and volume.
Variety encompasses managing the complexity of data in many different

structures, ranging from relational data to logs and raw text.
Click each arrow to learn more.
Characteristics of Big Data Technology
39322
Following are the characteristics of Big Data technol ogy:

50x
2015
2024
Cost efficiently processes

the growing volume
Responds to the
increasing velocity
Collectively analyzes the

widening variety
Turned 12 terabytes of Tweets created each day into improved product sentiment analysis
Converted 350 billion annual meter readings to better predict power consumption
Characteristics of Big Data Technology
39322
Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.
Source: Gartner
Appeal of Big Data Technology
39322
Big Data technology is appealing because of the following reasons:
It helps to manage and process a huge

amount of data cost efficiently.
It analyzes data in its native form, which

may be unstructured, structured, or streaming.
It captures data from fast-happening

events in real time.
It can handle failure of isolated nodes
and tasks assigned to such nodes.
Social
media
Web
Billing
ERP
Machine
data
Network
elements
It can turn data into actionable insights.
Leveraging Multiple Sources of Data
39322
Big Data technology enables IT to leverage multiple sources of data. Following are some of the
sources:
Application data
Machine data
Enterprise data
Social data
High volume
High velocity
Variety
Variety
Structured
Semi-structured
Highly unstructured
Highly unstructured
High throughput
Ingestion at a high
Veracity
High volume
speed
Click each image to learn more.
Traditional IT Analytics Approach
39322
The following are the requirements of the traditional IT analytics approach and factors they are
challenged by:
Requirements
Challenging factors
The business team needs to define
The requirements are iterative and volatile.
questions before IT development.
The data sources keep changing.
They need to define data sources and

structures.
Traditional IT Analytics Approach
39322
In a typical scenario of traditional IT systems development, the requirements are defined, followed by
solution design and build. Once the solution is implemented, queries are executed. If there are new
requirements or queries, the system is redesigned and rebuilt.
Define requirements
Redesign and rebuild

for new requirements
Design solution
Execute queries
Big Data TechnologyPlatform for Discovery and Exploration
39322
Following are the requirements for using Big Data technology as a platform for discovery and
exploration, and the challenges overcome by the same:
Requirements
The business team needs to define data
Challenges overcome by Big Data
sources.
They need to establish the hypothesis.
The technology should enable explorative

analysis.
Data systems and sources need to be

integrated as required.
Big Data TechnologyPlatform for Discovery and Exploration
39322
The image illustrates how IT systems are built with the help of Big Data technology.
Identify data sources
New questions lead to

addition of data
sources and integration
Create a platform for

creative exploration of
available data and content
Determine questions
to ask and test hypothesis
Big Data TechnologyCapabilities
39322
Following are the capabilities of Big Data technology:

Understand and
navigate Big Data
sources
Bear faults and
exceptions
Big Data
technology
Analyze unstructured
data
Manage and store a

huge volume of a
variety of data
Process data in
reasonable time
Ingest data at a high

speed
Big DataUse Cases
39322
The use cases of Big Data Hadoop are given below.
Handling Limitations of Big Data
39322
Following are the challenges that need to be addressed by Big Data technology:
How to combine data accumulated from all
systems
How to handle the system uptime and downtime
Using commodity hardware for data storage and
Analyzing data across different machines
analysis
Merging of data
Maintaining a copy of the same data across

clusters
Introduction to Hadoop
39322
Following are the facts related to Hadoop and why it is required:

What is Hadoop?
Why Hadoop?
A free, Java-based programming framework that
Runs a number of applications on distributed
supports the processing of large data sets in a
systems with thousands of nodes involving
distributed computing environment
petabytes of data
Based on Google File System (GFS)
Has a distributed file system, called Hadoop

Distributed File System or HDFS, which enables fast
data transfer among the nodes
History and Milestones of Hadoop
39322
Hadoop originated from the Nutch open source project on search engines and works over distributed
network nodes.
Hadoop Milestones
Organizations Using Hadoop
39322
The following table shows how various organizations use Hadoop:

Name of the organization
Cluster specifications
Uses
A9.com: Amazon
Clusters vary from 1 to 100 nodes
Amazon's product search indices

built using this program
Processes millions of sessions daily
for analytics
Yahoo
More than 100,000 CPUs in approximately 20,000 computers

running Hadoop; biggest cluster has 2000 nodes (2*4 cpu
boxes with 4 TB disk each)
To support research for ad systems

and web search
AOL
Cluster size is 50 machines, Intel Xeon, dual processors, and

dual core, each with 16 GB RAM and 800 GB hard disk
resulting in a total of 37 TB HDFS capacity
For a variety of functions ranging

from generating data to running
advanced algorithms for performing
behavioral analysis and targeting
Facebook
320-machine cluster with 2,560 cores and about 1.3 PB raw

storage
Storing copies of internal log and

dimension data sources
Used as a source for reporting
analytics and machine learning

BigDataHadoop Lesson01

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BigDataHadoop Lesson01

Uploaded by

Copyright:

Available Formats

Big Data and Hadoop Developer

Introduction to Big Data and Hadoop

Copyright 2015, Revert Technology Pvt. Ld., All rights reserved.

Identify the need for Big Data

Explain the concept of Big Data

Describe the basics of Hadoop

Explain the benefits of Hadoop

Copyright 2015, Revert Technology Pvt. Ld., All rights

Over 2.5 exabytes(2.5 billion gigabytes) of data is generated every day.

Following are some of the sources of the huge volume of data:

YouTube users upload more than 48 hours of video every minute.

There are more than 30 million networked sensors in the world.

Copyright 2015, Revert Technology Pvt. Ld., All rights

The following three types of data can be identified:

Copyright 2015, Revert Technology Pvt. Ld., All rights

Need for Big Data

Following are the reasons why Big Data is needed:

It is difficult to integrate information distributed across multiple systems.

Most business users do not know what should be analyzed.

Potentially valuable data is dormant or discarded.

It is too expensive to justify the integration of large volumes of unstructured data.

A lot of information has a short, useful lifespan.

Context adds meaning to the existing information.

DataThe Most Valuable Resource

Data is the new oil.

Copyright 2015, Revert Technology Pvt. Ld., All rights

Big Data and Its Sources

Three Characteristics of Big Data

Big Data has three characteristics: variety, velocity, and volume.

Variety encompasses managing the complexity of data in many different

Characteristics of Big Data Technology

Following are the characteristics of Big Data technol ogy:

Cost efficiently processes

Collectively analyzes the

Copyright 2015, Revert Technology Pvt. Ld., All rights

Characteristics of Big Data Technology

Copyright 2015, Revert Technology Pvt. Ld., All rights

Appeal of Big Data Technology

Big Data technology is appealing because of the following reasons:

It helps to manage and process a huge

It analyzes data in its native form, which

It captures data from fast-happening

It can turn data into actionable insights.

Copyright 2015, Revert Technology Pvt. Ld., All rights

Leveraging Multiple Sources of Data

Traditional IT Analytics Approach

The business team needs to define

The requirements are iterative and volatile.

questions before IT development.

The data sources keep changing.

They need to define data sources and

Copyright 2015, Revert Technology Pvt. Ld., All rights

Traditional IT Analytics Approach

Redesign and rebuild

Copyright 2015, Revert Technology Pvt. Ld., All rights

Big Data TechnologyPlatform for Discovery and Exploration

The business team needs to define data

Challenges overcome by Big Data

They need to establish the hypothesis.

The technology should enable explorative

Data systems and sources need to be

Copyright 2015, Revert Technology Pvt. Ld., All rights

Big Data TechnologyPlatform for Discovery and Exploration