You are on page 1of 17



Training done at: HCL CDC

Basically big data refers to our ability to collect and analyze the vast
amounts of data we are now generating in the world. For some
organizations, facing hundreds of gigabytes of data for the first time may
trigger a need to reconsider data management options. For others, it may
take tens or hundreds of terabytes before data size becomes a significant
In the past we had traditional database and analytics tools that could not
deal with extremely large, messy, unstructured and fast moving data. We
now have softwares like Hadoop and others, which enable us to analyze
large, messy and fast moving volumes of structured and unstructured
data. It does it by breaking the task up between many different
computers, as a consequence of which companies can now bring together
these different and previously inaccessible data sources to generate
impressive results.
In this project, I have showcased the Hadoop framework and utilized its its
various components to successfully process, analyze and store data.


The objective of this project is to study and use the Apache Hadoop framework
so that it can be utilized to process large amounts of data. As Hadoop
requires Linux to operate, we have used Clouderas Hadoop Distribution 3
for the same.

Firstly we have written a program that mines weather data. Weather sensors
collecting data every hour at many locations across the globe gather a large
volume of log data, which is a good candidate for analysis with MapReduce,
since it is semistructured and record-oriented. The output of the Program is
then stored onto HDFS. From there, we transfer the data onto our local
computer so that it can be used to produce graphical results using Spotfire.

This project serves as a great example of how the Hadoop software can be
used at great levels with large amounts of data to produce knowledgeable


System type

CPU Speed


Intel(T) Core(TM) i3-4030U

4 GB of RAM minimum
64-bit operating system
1.90 GHz recommended or higher
A USB or PS/2 Keyboard , A USB or PS/2
Mouse and a Monitor for display.



System OS

Windows XP and later versions

Platform used

Cloudera Hadoop Distribution 3

Operating System(Platform)

Linux( Ubuntu)

Apache Hadoop

Any version

Eclipse IDE

Any version


Any version


Any version


Apache Hadoop is an open-source software framework written in Java for
distributed storage and distributed processing of very large data sets on
computer clusters built from commodity hardware. All the modules in
Hadoop are designed with a fundamental assumption that hardware
failures (of individual machines, or racks of machines) are commonplace
and thus should be automatically handled in software by the framework
Hadoop provides a reliable shared storage (HDFS) and analysis system
Hadoop is highly scalable and unlike the relational databases, Hadoop scales
linearly. Due to linear scale, a Hadoop Cluster can contain tens, hundreds,
or even thousands of servers.
Hadoop is very cost effective as it can work with commodity hardware and
does not require expensive high-end hardware.

HadoopMapReduceis a software framework for easily writing applications
which process big amounts of data in-parallel on large clusters (thousands
of nodes) of commodity hardware in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that
Hadoop programs perform:
The Map Task:This is the first task, which takes input data and converts it
into a set of data, where individual elements are broken down into tuples
(key/value pairs).
The Reduce Task:This task takes the output from a map task as input and
combines those data tuples into a smaller set of tuples. The reduce task is
always performed after the map task.
Typically both the input and the output are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and reexecutes the failed tasks.


Hadoop can work directly with any mountable distributed file system such as
Local FS, HFTP FS, S3 FS, and others, but the most common file system
used by Hadoop is the Hadoop Distributed File System (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on large
clusters (thousands of computers) of small computer machines in a
reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a
singleNameNodethat manages the file system metadata and one or more
slaveDataNodesthat store the actual data.
A file in an HDFS namespace is split into several blocks and those blocks are
stored in a set of DataNodes. The NameNode determines the mapping of
blocks to the DataNodes. The DataNodes takes care of read and write
operation with the file system. They also take care of block creation,
deletion and replication based on instruction given by NameNode.

Sqoop ("SQL-to-Hadoop") is a tool designed to transfer data between
Hadoop and relational database servers. It is used to import data from
relational databases such as MySQL, Oracle to Hadoop HDFS, and
export from Hadoop file system to relational databases. It is provided
by the Apache Software Foundation. SQL to Hadoop and Hadoop to
SQL. Through Sqoop we can Import full table, part of table and
selected value into Hadoop.

Eclipse IDE
Eclipseis a Java-basedopen sourceplatform that allows a software
developer to create a customized development environment (IDE)
fromplug-incomponents built by Eclipse members. Eclipse is
managed and directed by the Consortium.
Although a number of versions have come out since its release, in this
project we will be using the Kepler version of eclipse.

MySQL is anopen sourceRDBMSthat relies onSQL for processing the data in
thedatabase. MySQL providesAPIsfor the languagesC,C++,Eiffel,Java,
Perl,PHPandPython. In addition,OLEDB andODBCproviders exist for
MySQL data connection in the Microsoft environment. A MySQL .NET Native
Provider is also available, which allows native MySQL to.NET access
without the need for OLE DB. MySQL is most commonly used for Web
applicationsand for embedded applications and has become a popular
alternative to proprietarydatabase systems because of its speed and
reliability. MySQL can run nUNIX,WindowsandMac OS. MySQL is
developed, supported and marketed by MySQL AB. The database is
available for free under the terms of theGNUGeneral Public License (GPL)
or for a fee to those who do not wish to be bound by the terms of the GPL.

In any Business Intelligence solution, reporting is the main key to improve the analysis
between data of various organizations. Information no matter how advance, its of
now use if the users cant easily manipulate it or quickly find the answers to their
questions. Spotfire improves the BI reporting with superior analytics with advanced
analytic capabilities which helps users to easily evaluate and explore a set of data in
real time with the help of range of interactive visualizations available. Spotfire makes
all these possible with ease and power. It uses in-memory processing and good user
interface design to develop highly interactive displays of data. Business intelligence
reportingis intended to improve business and financial analytics. Spotfire's
interactive and highly visual analytical environment helps achieve this with a selfconfiguring visual data analysis environment that lets users query, visualize and
explore data in real time.

Users can choose and manipulate a wide variety of visual representations, including
map-based data displays to review business intelligence reporting results
geographically, multidimensional scatter plots to view data statistically, and bar
charts, pie charts, line graphs, profile charts and more.


Run the mapreduce program on chd3, using the eclipse IDE, taking
the weather data downloaded from ncnd noaa website as the
Transfer the output of the program to the hdfs.
On the local machine, create schemas in the Mysql software,
based on the output data.
Using Sqoop, transfer the data to Msql database.
Use the created database in Tibco Spotfire, to analuse weather