You are on page 1of 22

BIG DATA

ANALYTICS
(Weather Analysis and
Prediction)
End Semester MINI PROJECT report submitted in the proposal of the
requirements for the completion of the seventh semester of the
UNDER GRADUATE PROGRAM in Electronics and
Communication Technology (B.Tech in ECE).

Submitted
By:
Gaurav Satish
Kumar
(IEC2012021)
(IEC2012049)
Vatsal Mishra
(IEC2012068)

Akash Kumar

Salil

(IEC2012033)
Varsheindra Gautam
(IEC2012071)

Under the Supervision of


Dr. Satish Kumar Singh and
Dr. Rajat Kumar Singh

Indian Institute of Information


Technology, Allahabad
November, 2015

CANDIDATES DECLARATION
We hereby declare that the work presented in this project report entitled
BIG DATA ANALYTICS (Weather Analysis and Prediction), submitted
in the proposal of the requirements for the completion of the 7th semester
of the UNDER GRADUATE PROGRAM (B.Tech in ECE), is an authenticated
record of our original work carried out from July 2015 to November 2015
under the guidance of Dr. Satish Kumar Singh & Dr. Rajat Kumar
Singh. Due acknowledgements have been made in the text to all other
material used. The project was done in full compliance with the requirements
and constraints of the prescribed curriculum.

Place: Allahabad
Date: 18/11/2015

Gaurav Satish (IEC2012068)


Akash Kumar (IEC2012033)
Salil Kumar (IEC2012049)
Vatsal Mishra (IEC2012068)
Varsheindra Gautam (IEC2012071)

CERTIFICATE FROM THE SUPERVISOR


I do hereby recommend that the mini project report prepared under my
supervision by Gaurav Satish, Akash Kumar, Salil Kumar, Vatsal Mishra &
Varsheindra Gautam titled BIG DATA ANALYTICS(Weather Prediction
and Analysis) be accepted in the fulfillment of the requirements for the
proposal of the seventh semester of the UNDER GRADUATE PROGRAM in
Electronics and Communication (B.Tech in ECE), for Examination.

Supervisor :
Dr. Satish Kumar Singh
Dr. Rajat Kumar Singh

ACKNOWLEDGEMENT
We owe special debt of gratitude to Dr . Satish Kumar Singh & Dr. Rajat
Kumar Singh for their constant support and guidance throughout the course
of our work. Their sincerity, thoroughness and perseverance have been a
constant source of inspiration for us. It is only their cognizant eforts that our
endeavours have seen light of the day.

TABLE OF CONTENTS
1. Introduction
2. Motivation
3. Problem definition and scope
4. Literature Survey and and analysis of recent similar work
5. Approach and Proposed methodology
6. Hardware and Software Requirements
7. References

INTRODUCTION
We live in the data age. It is not easy to measure the total volume of data
stored electronically, but an IDC estimate put the size of the digital
universe at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by
2011 to 1.8 zettabytes. A zettabyte is 1021 bytes, or equivalently one
thousand Exabytes, one million petabytes, or one billion terabytes. That is
roughly the same order of magnitude as one disk drive for every person in
the world. Online searches, store purchases, Facebook posts, Tweets or
Foursquare check-ins, cell phone usage, etc. are creating a flood of data
that, when organized and categorized and analyzed, reveals trends and
habits about ourselves and society at large.
This flood of data is coming from many sources. Consider the
following:

The New York Stock Exchange generates about 1 terabyte of new


trade data per day.
Facebook hosts approximately 10 billion photos, taking up 1
petabyte of storage.
Ancestry.com, the genealogy site, stores around 2.5 petabytes
of data.
The Internet Archive stores around 2 petabytes of data, and is
growing at a rate of 20 terabytes per month.
The Large Hadron Collider near Geneva, Switzerland, will produce
about 15 petabytes of data per year.

Big data is the term for a collection of data sets so large and complex that it
becomes difcult to process using on-hand database management tools or
traditional data processing applications. The challenges include capture,
curation, storage, search, sharing, transfer, analysis and visualization. Big
Data refers to the explosion in the quantity (and sometimes, quality) of
available and potentially relevant data, largely the result of recent and
unprecedented advancements in data recording and storage technology.
To defne big data in competitive terms, we must think about what it takes
to compete in the business world. Big data is traditionally characterized as a
rushing river: large amounts of data flowing at a rapid pace. To be
competitive with customers, big data creates products which are valuable
and unique. To be competitive with suppliers, big data is freely available
with no obligations or constraints. To be competitive with new entrants, big
data is difcult for newcomers to try. To be competitive with substitutes big
data creates products which preclude other products from satisfying the
same need.

MOTIVATION
The use of big data will become a key basis of competition and growth for
individual frms. From the standpoint of competitiveness and the potential
capture of value, all companies need to take big data seriously. In most
industries, established competitors and new entrants alike will leverage
data-driven strategies to innovate, compete, and capture value from deep
and up-to-real-time information. Indeed, we found early examples of such
use of data in every sector we examined.
The use of big data will underpin new waves of productivity growth and
consumer surplus. For example, we estimate that a retailer using big data to
the full has the potential to increase its operating margin by more than 60
percent. Big data ofers considerable benefts to consumers as well as to
companies and organizations. For instance, services enabled by personallocation data can allow consumers to capture $600 billion in economic
surplus.
While the use of big data will matter across sectors, some sectors are set for
greater gains. We compared the historical productivity of sectors in the
United States with the potential of these sectors to capture value from big
data (using an index that combines several quantitative metrics), and found
that the opportunities and challenges vary from sector to sector. The
computer and electronic products and information sectors, as well as fnance
and insurance, and government are poised to gain substantially from the use
of big data.
There will be a shortage of talent necessary for organizations to take
advantage of big data. By 2018, the United States alone could face a
shortage of 140,000 to 190,000 people with deep analytical skills as well as
1.5 million managers and analysts with the know-how to use the analysis of
big data to make effective decisions.
Several issues will have to be addressed to capture the full potential of big
data. Policies related to privacy, security, intellectual property, and even
liability will need to be addressed in a big data world. Organizations need
not only to put the right talent and technology in place but also structure
workflows and incentives to optimize the use of big data. Access to data is

criticalcompanies will increasingly need to integrate information from


multiple data sources, often from third parties, and the incentives have to
be in place to enable this.

PROBLEM DEFINITION AND SCOPE

PROBLEM
DEFINITION:
What matters when dealing with data Big
Data?
Smart Sampling of data
Reducing the original data while not losing the statistical properties
of data
Finding similar terms
Efcient multi-dimensional indexing
Incremental updating of models
Crucial for streaming data
Distributed linear algebra
Dealing with large sparse matrices

In this project we deal with the weather prediction, we follow the mentioned
approach:

Map/Reduce
logical data flow
Fig. 1: Map/Reduce logical data flow

Hadoop BlockDiagram:-

Fig. 2 : Hadoop Block diagram.

Map-reduce Block Diagram:-

Fig. 3 : Map-reduce diagram

Curve fitting:Capturing the trend in the data by assigning a single function across the entire
range. The example below uses a straight line function

A straight line is described generically by f(x) = ax + b. The goal is to identify


the coefcients a and b such that f(x) fts the data well

Polynomial Curve Fitting


Generalizing from a straight line (i.e., frst degree polynomial) to a th
degree polynomial

the residual is given by

The partial derivatives (again dropping superscripts) are

These lead to the equations

or, in matrix form

This is a Vandermonde matrix. We can also obtain the matrix for a least
squares ft by writing

Premultiplying both sides by the transpose of the frst matrix then gives

So

As before, given
gives

points

and ftting with polynomial coefcients

, ...,

In matrix notation, the equation for a polynomial ft is given by

This can be solved by premultiplying by the transpose

This matrix equation can be solved numerically, or can be inverted directly if it


is well formed, to yield the solution vector

Setting

in the above equations reproduces the linear solution.

SCOPE
Analyzing big data allows analysts, researchers, and business users to make
better and faster decisions using data that was previously inaccessible or
unusable. The 5 Key Big Data Use Cases:

Big Data Exploration Find, visualize, understand all big data to


improve decision making.
Enhanced 360o View of the Customer Extend existing customer
views (MDM, CRM, etc.) by incorporating additional internal and
external information sources.
Security/Intelligence Extension Lower risk, detect fraud and
monitor cyber security in real-time.
Operations Analysis Analyze a variety of machine data for
improved business results.
Data Warehouse Augmentation Integrate big data and data
warehouse capabilities to increase operational efciency.

The use of big data will become a key basis of competition and growth for
individual frms. In most industries, established competitors and new entrants
alike will leverage data-driven strategies to innovate, compete, and capture
value from deep and up-to-real-time information. Indeed, we found early
examples of such use of data in every sector we examined. There is much
future research that could come out of this project. The potential
research
can be broken up into two main areas, that of
experimentation and exploration.
In the area of experimentation a more comprehensive performance study
could be done. There are many parameters in Hadoop that can be
customized that could potentially increase the performance of the Map
Reduce process. Experiments could also be conducted with larger clusters
and more demanding Map Reduce tasks that require much larger data sets.
In the area of exploration a more in depth study of Map Reduce could be
conducted. This could involve writing programs that make use of the Map
Reduce process that Hadoop provides. An attempt to make an eficient
solution to an NP-complete problem would be an interesting application.

LITERATURE SURVEY AND ANALYSIS OF RECENT SIMILAR


WORK
Processing large volumes of data has been around for decades (such as in
weather, astronomy, and energy applications). It required specialized and
expensive hardware (supercomputers), software, and developers with distinct
programming and analytical skills. In the 1980s, the database management
systems of IBM, Oracle, Cullinet, and Sybase could have been viewed as
Big Data tools of that era. But they were not designed to handle the
unforeseen explosion of data brought on by the Internet, mobile
communications, and sensor networks.
The Internet companies were the frst to be hit by the data tsunami. Their
needs were so pressing that Google, Facebook, Yahoo!, eBay, and Twitter
developed their own database infrastructures and technologies. The popular
Hadoop Big Data application, now maintained as an Open Source project by
the Apache Software Foundation, is the most prominent example. Yahoo! has
been the largest contributor to the project and has launched Horton Works
to commercialize its Hadoop implementation. Facebook is also a prominent
Hadoop user.
If the frst decade of the 21st century belonged to the Internet revolution,
social media and cloud computing, then the second decade for sure is going
to be the decade of Big Data analytics. While everyone is busy talking about
how Big Data can revolutionize the way businesses compete, there is another
interesting angle to look at this revolution the evolution of data analytics
over the past decades.
Collection of data by states started centuries ago and that is how the name
statistics was derived from the Latin word 'Status' or Italian word 'statista' or
German word 'statistic' each of, which means a 'political state'. The states
collected data to calculate the man power available and to decide taxes
(based on property and wealth owned by citizens). This is the earliest origin
of what is known today as data analyses. Data analytics has evolved a lot
over the years and Big Data analytics is the latest in this evolution of data
analytics. Today Big Data has become the most talked about technological
phenomena after cloud.
The frst data systems were designed with the goal of accurately capturing
transactions without losing any data. The relational database systems used
for operational data storage could be categorized as the frst wave of data
management and analysis. Lets call this Data Stack 1.0. The data was
queried using SQL and the architecture of these systems would be the
application logic written on top of the relational databases with a presentation
layer for generating static reports which would then be analyzed by the
analysts. The focus of these systems remained on capturing the

transactional data accurately and storing it efficiently for all mission critical
systems. However, analytical capabilities of the system were limited.

APPROACH AND PROPOSED


METHODOLOGY
After doing the careful study of the Map Reduce techniques we will use the
following approach to implement the temperature problem.
1. Setting up virtual machines with Ubuntu OS.
2. Setting the network so that they are connected to each other and
can ping each other.
3. Installing and configuring the Hadoop Cluster on these machines.
4. Finding the appropriate datasets.
5. Running the openly on this cluster with the data set to
check the configuration and get an estimate of run time.
6. Implementing the max/min techniques one by one in JAVA using
Hadoop
Map Reduce API.

Fig. 4 : Map reduced logical data flow

The types of tools typically used in Big Data Scenario:

Where the processing is hosted? Distributed server/cloud


Where data is stored? Distributed Storage (E.g. Amazon s3)
Where is the programming model? Distributed processing (Map Reduce)
How data is stored and indexed? High performance schema free
database.
What operations are performed on the data? Analytic/Semantic
Processing (Eg.

RDF/OW).

Programming Model:Input & Output: each a set of key/value pairs


Programmer specifes two functions:
map (in_key, in_value) -> list(out_key, intermediate_value)
Processes input key/value pair
Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
Combines all intermediate values for a particular ke
Produces a set of merged output values (usually just one)

Results:
Berkeley

(Fig) This is the Max temperature curve of Berkeley


Year
2014
2015
2016
2017
2018
2019

2020
2021
2022
2023

Predicted Maximum Temperature


42.778537
42.533997
42.258980
41.951812
41.610778
41.234116
40.820017
40.366625
39.872041
39.334313

Delhi

(Fig) This is the Max temperature curve of Delhi

Year
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010

Predicted Maximum Temperature


41.830555
42.268227
42.863457
43.644936
44.644318
45.895973
47.437233
49.308640
51.553875
54.219872

HARDWARE AND SOFTWARE REQUIREMENTS


Hardware:Processor - core i5
Speed - 2.43 Ghz

RAM - 4GB
Hard Disk - 80 GB
Key Board - Standard Windows Keyboard
Mouse - Two or Three Button Mouse

Software:Ubuntu 14.04
SSH Server
Java 6 or greater
Hadoop 1.x

REFERENCES
[1] Hadoop. http://www.cloudera.com/what-is-hadoop/.
[2] Hadoop Overview. http://wiki.apache.org/hadoop/ProjectDescription/
[3] MapReduce. http://hadoop.apache.org/common/docs/mapred_tutorial.html
[4] Tom White. Hadoop: The Definitive Guide. OReilly Media, Inc, Gravenstein
Highway
North, Sebastopol, first edition, June 2009
[5] effrey Dean and Sanjay Ghemawat. Mapreduce: Simplified Data
Processing on Large
Clusters. Commun. ACM, 51(1):107113, 2008.

Figures :
Fig. 1 : http://resources.appistry.com/pressappistry-cloudiqstorage-now-generally-available/
Fig. 2 : http://www.tutorialspoint.com/hadoop/hadoop_hdfs_overview.htm
Fig. 3 : http://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
Fig. 4 : http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/