You are on page 1of 8

HADOOP

PRESENTATION

N A M E : SWARNALI HALDER
C O U R S E : B.SC COMPUTER SCIENCE HONS.
S T D : 2nd YEAR
C O L L E G E : CHANDERNAGORE GOVT. COLLEGE

CONTENTS
INTRODUCTION
ABOUT THE PROGRAMMERS
APPLICATIONS

INTRODUCTION
Hadoop is an open source, Java-based programming framework that supports the
processing and storage of extremely large data sets in a distributed computing
environment. It is part of the Apache project sponsored by the Apache Software
Foundation. Hadoop was created by computer scientists Doug Cutting and Mike
Cafarella in 2006 to support distribution for the Nutch search engine. It was
inspired by Google's MapReduce, a software framework in which an application is
broken down into numerous small parts. Any of these parts, which are also
called fragments or blocks, can be run on any node in the cluster. After years of
development within the open source community, Hadoop 1.0 became publically
available in November 2012 as part of the Apache project sponsored by the
Apache Software Foundation. As a software framework, Hadoop is composed of
numerous functional modules. At a minimum, Hadoop uses Hadoop Common as a
kernel to provide the framework's essential libraries.

As a software framework, Hadoop is composed of numerous functional modules.


At a minimum, Hadoop uses Hadoop Common as a kernel to provide the framework's
essential libraries. The core of Apache Hadoop consists of a storage part, known as
Hadoop Distributed File System (HDFS), and a processing part called MapReduce. To process
data, Hadoop transfers packaged code for nodes to process in parallel based on the data
that needs to be processed. This approach takes advantage of data locality nodes
manipulating the data they have access to to allow the dataset to be processed faster and
more efficiently than it would be in a more conventional supercomputer architecture that
relies on a parallel file system where computation and data are distributed via high-speed
networking.The base Apache Hadoop framework is composed of the following modules:
1) Hadoop Common contains libraries and utilities needed by other Hadoop modules;
2) Hadoop Distributed File System (HDFS) a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
3) Hadoop YARN a resource-management platform responsible for managing computing
resources in clusters and using them for scheduling of users' applications; and
4) Hadoop MapReduce an implementation of the MapReduce programming model for
large scale data processing.

THE PROGRAMMERS

Computer scientist
Doug Cutting

Doug Cutting is an advocate and


creator of open-source search
technology. He originated Lucene and,
with Mike Cafarella, Nutch, both opensource search technology projects
which are now managed through the
Apache Software Foundation. He is
also the creator of Hadoop (Yahoo!,
Cloudera). He holds a bachelor's
degree from Stanford University.
Mike Cafarella is a
computer scientist specializing in
database management systems. He is
currently an associate professor of
computer science at
University of Michigan.[1] Along with
Doug Cutting, he is one of the original
co-founders of the Hadoop opensource.After completing his bachelor's
degree at Brown University, he earned
a Ph.D. in database management
system at the University of Washington

Computer scientists
Mike Cafarella

APPLICATIONS

Analyze financial service in retail marketing.

Medical health care system drug analyze.

Data movement analyze in life science.

Whole scale marketing.

Social network behavior

PROS AND CONS


PROS :
Hadoop makes it possible to run applications on systems with thousands of
commodity hardware nodes, and to handle thousands of terabytes of data.
Its distributed file system facilitates rapid data transfer rates among nodes
and allows the system to continue operating in case of a node failure. This
approach lowers the risk of catastrophic system failure and unexpected data
loss, even if a significant number of nodes become inoperative.
It features a high-availability file-system option and support for Microsoft
Windows and other components to expand the framework's versatility for data
processing and analytics.

CONS :

Hadoop has significant setup and processing overhead. A Hadoop job might
take several minutes, even if the amount of data correlated is modest.

Hadoop isn't particularly good at supporting data structures that have a


multi-dimensional context. For example, a record that defines the value of
variables for a given geography, then links vertically to define the value over
time requires a more complex representation of data relationships than the
simple key-value one used by Hadoop.

Hadoop is less helpful in problems that have to be viewed iteratively -- as


several sequential dependent steps.

You might also like