You are on page 1of 30

Cloud computing using Hadoop

Rahul Poddar 11500110119 Santosh Kumar 11500110006 Shubham Raj 11500110054 Vinayak Raj 11500110019 6th semester CSE-B BPPIMT

Brief introduction of Cloud Computing

What is Hadoop and its properties

MapReduce

An example application on Hadoop

Requirements for this project

What led to development of Hadoop?

HDFS

Outline

Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet).

The Cloud aims to cut costs, and help the users focus on their core business instead of being impeded by IT obstacles

The main enabling technologies for Cloud Computing are virtualization and autonomic computing.

What is cloud computing

With cloud computing other companies host your computers

Software as a service(SaaS)
Platform as a service(PaaS) Infrastructure as a service(IaaS) These three services encapsulate the basic component of cloud computing.

Cloud Computing Architecture

Java Requirements: Hadoop is a Java-based system. Recent versions of Hadoop require Sun Java 1.6.

Installing Hadoop: Hadoop 1.0.3 or above installed(either single node or multi node).

Operating System: Linux, Ubuntu 12.04 LTS version, Mac OS X. Can also be run in Windows, but Windows requires Cygwin to be installed.

Software requirements for Hadoop project

Hadoop and Hbase requires two types of machines:

1)Master(the HDFS NameNode, the MapReduce JobTracker, and the HBase Master))
2)Slaves(the HDFS DataNodes, the MapReduce TaskTrackers, , and the HBase RegionServers) Two quad core CPUs 12 GB to 24 GB memory and 1 GB RAM.

Hardware requirements for Hadoop(Small cluster 5-50 nodes)

Hadoop is a scalable fault tolerant grid operating system for data storage and processing.
It’s scalability comes from the combo of: HDFS: Self healing, high bandwidth Clustered storage MapReduce: Fault tolerant Distributed processing

Operates on structured and unstructured data

Here comes ‘Hadoop’

A large and active ecosystem(many developers and additions like Hbase,Pig,Hive)

Open source under the Apache License

http://wiki.apache.org/hadoop/

Here comes ‘Hadoop’

Commodity HW

Use replication across servers to deal with unreliable storage/servers

Support for moving computation close to data

Add inexpensive servers

Servers have 2 purposes: data storage and computation

Characteristics of Hadoop

We live in the age of very large and complex data called the BIG DATA.

IDC estimates that the total size of digital universe is 1.8 zettabytes which is equal to 1021 bytes. That equals to each person of this world having one hard disk drive.

Need for Hadoop:Big data

Every day 2.5 quintillions(2.5 x 1018)bytes of data is being generated .

90% of the total world data has been generated in just 2 years alone. Such a large amount of ever increasing data is getting difficult for traditional RDBMS and grid computing systems to manage.

Need for Hadoop:Big data

The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately 10 billion photos taking 1 petabyte of storage.

The Large Hadron Collider at CERN, Geneva produces about 15 million petabytes of data per year.
The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.

Sources of Big data

High expenses of high end servers computers and other proprietary hardware and softwares for processing and storage of large amount of data as well as their maintenance cost is unbearable for many industrial organisations. Also upgradation and maintenance to scale up the capacity of these servers require huge cost .

Inefficiency and high expenses

The traditional single server architecture is not a robust architecture because a large single computer is taking care of all the computing.If it fails or shutdowns then whole system breaks down and huge losses are incurred by the enterprises .Also during repairing or upgradation computer has to switch off and in meantime no useful tasks are executed resulting in lagging of computations.

Not Robust

MapReduce is a programming model for processing large data sets and typically used to do distributed computing on clusters of computers.
MapReduce provides regular programmers the ability to produce parallel distributed programs much more easily. MapReduce consists of two simple functions:

map()
reduce()

MapReduce algorithm

"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes.

A worker node may do this again in turn, leading to a multi-level tree structure.

The worker node processes the smaller problem, and passes the answer back to its master node.

MapReduce algorithm

"Reduce" step: The master node collects the answers to all the sub-problems from slaves

Then the master combines the answers in some way to form the output – the answer to the problem it was originally trying to solve.

MapReduce algorithm

Master node MapReduce job submitted by client computer

JobTracker

Slave node

Slave node

Slave node

TaskTracker

TaskTracker

TaskTracker

Task instance

Task instance

Task instance

MapReduce: High Level

Job – A “full program” - an execution of a Mapper and Reducer across a data set

Task – An execution of a Mapper or a Reducer on a slice of data
• a.k.a. Task-In-Progress (TIP)

Task Attempt – A particular instance of an attempt to execute a task on a machine

Some MapReduce Terminology

Running “Word Count” across 20 files is one job

20 files to be mapped imply 20 map tasks + some number of reduce tasks

At least 20 map task attempts will be performed… more if a machine crashes, etc.

Terminology Example

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.

HDFS(Hadoop Distributed File System)

Master-Slave architecture

DFS Master “Namenode”

•Manages the filesystem namespace •Maintain file name to list blocks + location mapping •Manages block allocation/replication •Checkpoints namespace and journals namespace changes for reliability •Control access to namespace

DFS Slaves “Datanodes” handle block storage

•Stores blocks using the underlying OS’s files •Clients access the blocks directly from datanodes •Periodically sends block reports to Namenode •Periodically check block integrity

HDFS Architecture

Weather sensors all across the globe are collecting climatic data.

The data can be used from National Climatic Data Centre(http://www.ncdc.noaa.gov/)
We will focus only on temperature for simplicity The input will be data from NCDC which will given as key-value pair to map() The output given by reduce() will be the maximum temperature of each year.

An Example:Weather Data Mining

Mapper.py:

#!/usr/bin/env python import re import sys for line in sys.stdin: val = line.strip() (year, temp, q) = (val[15:19], val[87:92],val[92:93]) If (temp != "+9999" and re.match("[01459]", q)): print "%s\t%s" % (year, temp)

Weather Data Mining

Reduce.py: #!/usr/bin/env python import sys (last_key, max_val) = (None, 0) for line in sys.stdin: (key, val) = line.strip().split("\t") if last_key and last_key != key: print "%s\t%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val))) if last_key: print "%s\t%s" % (last_key, max_val)

Weather Data Mining

To run a test: % cat input/ncdc/sample.txt | src/main/ch02/python/max_temperature_map.py | \ sort | src/main/ch02/python/max_temperature_reduce.py

Output: 1949 111 1950 22

Running the program

Hadoop Wiki
http://hadoop.apache.or g/core/
http://wiki.apache.org/hado op/GettingStartedWithHado op

http://wiki.apache.org/ha doop/HadoopMapReduce
http://hadoop.apache.org/c ore/docs/current/hdfs_desig n.html

References