You are on page 1of 22

Big Data and Hadoop Developer

Introduction to Big Data and Hadoop

Copyright 2015, Revert Technology Pvt. Ld., All rights reserved.

Objectives

39322

By the end of
this lesson, you
will be able to:

Identify the need for Big Data

Explain the concept of Big Data

Describe the basics of Hadoop

Explain the benefits of Hadoop

Copyright 2015, Revert Technology Pvt. Ld., All rights

Data Explosion

39322

Over 2.5 exabytes(2.5 billion gigabytes) of data is generated every day.

Following are some of the sources of the huge volume of data:

A typical, large stock exchange captures more than 1 TB of data every day.

There are around 5 billion mobile phones (including 1.75 billion smart phones) in the world.

YouTube users upload more than 48 hours of video every minute.

Large social networks such as Twitter and Facebook capture more than 10 TB of data daily.

There are more than 30 million networked sensors in the world.

Copyright 2015, Revert Technology Pvt. Ld., All rights

Types of Data

39322

The following three types of data can be identified:


Structured data:
Data which is represented in a tabular format
E.g.: Databases
Semi-structured data:
Data which does not have a formal data model
E.g.: XML files
Unstructured data:
Data which does not have a pre-defined data model
E.g.: Text files

Copyright 2015, Revert Technology Pvt. Ld., All rights

Need for Big Data

39322

Following are the reasons why Big Data is needed:

90% of the data in the world today has been created in the last two years alone.

80% of the data is unstructured or exists in widely varying structures, which are difficult to analyze.

Structured formats have some limitations with respect to handling large quantities of data.

It is difficult to integrate information distributed across multiple systems.

Most business users do not know what should be analyzed.

Potentially valuable data is dormant or discarded.

It is too expensive to justify the integration of large volumes of unstructured data.

A lot of information has a short, useful lifespan.

Context adds meaning to the existing information.


Copyright 2015, Revert Technology Pvt. Ld., All rights

DataThe Most Valuable Resource

39322

In its raw form, oil has little value. Once processed and refined, it helps power the world.
Ann Winblad

Data is the new oil.


Clive Humby, CNBC

Copyright 2015, Revert Technology Pvt. Ld., All rights

Big Data and Its Sources

39322

Big data is an all-encompassing term for any collection of data sets so large and complex that it
becomes difficult to process them using on-hand data management tools or traditional data
processing applications.
The sources of Big Data are:
web logs;
sensor networks;
social media;
internet text and documents;
internet pages;
search index data;
atmospheric science, astronomy, biochemical and medical records;
scientific research;
military surveillance; and
photography archives.
Copyright 2015, Revert Technology Pvt. Ld., All rights

Three Characteristics of Big Data

39322

Big Data has three characteristics: variety, velocity, and volume.

Variety encompasses managing the complexity of data in many different


structures, ranging from relational data to logs and raw text.
Click each arrow to learn more.
Copyright 2015, Revert Technology Pvt. Ld., All rights

Characteristics of Big Data Technology

39322

Following are the characteristics of Big Data technol ogy:


50x

2015

2024

Cost efficiently processes


the growing volume

Responds to the
increasing velocity

Collectively analyzes the


widening variety

Turned 12 terabytes of Tweets created each day into improved product sentiment analysis
Converted 350 billion annual meter readings to better predict power consumption

Copyright 2015, Revert Technology Pvt. Ld., All rights

Characteristics of Big Data Technology

39322

Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.
Source: Gartner

Copyright 2015, Revert Technology Pvt. Ld., All rights

Appeal of Big Data Technology

39322

Big Data technology is appealing because of the following reasons:

It helps to manage and process a huge


amount of data cost efficiently.

It analyzes data in its native form, which


may be unstructured, structured, or streaming.

It captures data from fast-happening


events in real time.
It can handle failure of isolated nodes
and tasks assigned to such nodes.

Social
media

Web

Billing
ERP

Machine
data

Network
elements

It can turn data into actionable insights.

Copyright 2015, Revert Technology Pvt. Ld., All rights

Leveraging Multiple Sources of Data

39322

Big Data technology enables IT to leverage multiple sources of data. Following are some of the
sources:

Application data

Machine data

Enterprise data

Social data

High volume

High velocity

Variety

Variety

Structured

Semi-structured

Highly unstructured

Highly unstructured

High throughput

Ingestion at a high

Veracity

High volume

speed
Click each image to learn more.
Copyright 2015, Revert Technology Pvt. Ld., All rights

Traditional IT Analytics Approach

39322

The following are the requirements of the traditional IT analytics approach and factors they are
challenged by:
Requirements

Challenging factors

The business team needs to define

The requirements are iterative and volatile.

questions before IT development.

The data sources keep changing.

They need to define data sources and


structures.

Copyright 2015, Revert Technology Pvt. Ld., All rights

Traditional IT Analytics Approach

39322

In a typical scenario of traditional IT systems development, the requirements are defined, followed by
solution design and build. Once the solution is implemented, queries are executed. If there are new
requirements or queries, the system is redesigned and rebuilt.
Define requirements

Redesign and rebuild


for new requirements

Design solution

Execute queries

Copyright 2015, Revert Technology Pvt. Ld., All rights

Big Data TechnologyPlatform for Discovery and Exploration

39322

Following are the requirements for using Big Data technology as a platform for discovery and
exploration, and the challenges overcome by the same:
Requirements

The business team needs to define data

Challenges overcome by Big Data

sources.

They need to establish the hypothesis.

The technology should enable explorative


analysis.

Data systems and sources need to be


integrated as required.

Copyright 2015, Revert Technology Pvt. Ld., All rights

Big Data TechnologyPlatform for Discovery and Exploration

39322

The image illustrates how IT systems are built with the help of Big Data technology.

Identify data sources

New questions lead to


addition of data
sources and integration

Create a platform for


creative exploration of
available data and content

Determine questions
to ask and test hypothesis

Copyright 2015, Revert Technology Pvt. Ld., All rights

Big Data TechnologyCapabilities

39322

Following are the capabilities of Big Data technology:


Understand and
navigate Big Data
sources
Bear faults and
exceptions
Big Data
technology

Analyze unstructured
data

Manage and store a


huge volume of a
variety of data
Process data in
reasonable time

Ingest data at a high


speed

Copyright 2015, Revert Technology Pvt. Ld., All rights

Big DataUse Cases

39322

The use cases of Big Data Hadoop are given below.

Copyright 2015, Revert Technology Pvt. Ld., All rights

Handling Limitations of Big Data

39322

Following are the challenges that need to be addressed by Big Data technology:
How to combine data accumulated from all
systems

How to handle the system uptime and downtime

Using commodity hardware for data storage and

Analyzing data across different machines

analysis

Merging of data

Maintaining a copy of the same data across


clusters

Copyright 2015, Revert Technology Pvt. Ld., All rights

Introduction to Hadoop

39322

Following are the facts related to Hadoop and why it is required:


What is Hadoop?
Why Hadoop?

A free, Java-based programming framework that

Runs a number of applications on distributed

supports the processing of large data sets in a

systems with thousands of nodes involving

distributed computing environment

petabytes of data

Based on Google File System (GFS)

Has a distributed file system, called Hadoop


Distributed File System or HDFS, which enables fast
data transfer among the nodes

Copyright 2015, Revert Technology Pvt. Ld., All rights

History and Milestones of Hadoop

39322

Hadoop originated from the Nutch open source project on search engines and works over distributed
network nodes.
Hadoop Milestones

Copyright 2015, Revert Technology Pvt. Ld., All rights

Organizations Using Hadoop

39322

The following table shows how various organizations use Hadoop:


Name of the organization

Cluster specifications

Uses

A9.com: Amazon

Clusters vary from 1 to 100 nodes

Amazon's product search indices


built using this program
Processes millions of sessions daily
for analytics

Yahoo

More than 100,000 CPUs in approximately 20,000 computers


running Hadoop; biggest cluster has 2000 nodes (2*4 cpu
boxes with 4 TB disk each)

To support research for ad systems


and web search

AOL

Cluster size is 50 machines, Intel Xeon, dual processors, and


dual core, each with 16 GB RAM and 800 GB hard disk
resulting in a total of 37 TB HDFS capacity

For a variety of functions ranging


from generating data to running
advanced algorithms for performing
behavioral analysis and targeting

Facebook

320-machine cluster with 2,560 cores and about 1.3 PB raw


storage

Storing copies of internal log and


dimension data sources
Used as a source for reporting
analytics and machine learning
Copyright 2015, Revert Technology Pvt. Ld., All rights

You might also like