A Review Paper On Big Data

A review paper on big data
Author
Sanket kamble
Rishikesh yadav
Jalaj khandelwal
Guided by
Kirti darade
Abstract
Big data can be define as handling storing and managing of a huge amount of data, there are technique used by the
computers systems to store and manage data created by users on a large scale i.e. called as hadoop ecosystem.
Hadoop ecosystem helps to store data, analysis, and arrange the data by using map reduce in a well organized way
and also keeps a record of the location or path of the data stored. Data is generated from various different sources
and can arrive in the system at high rate which could further result in system overload. Parallelism is a process used
In order to process these large amounts of data in an in expensive and effective way. This huge amount of data after
processing can be used in a number of ways like surveys, research, etc. As the hadoop ecosystem is cost friendly
and uses some technique which also insures the security and also lost or corruption of data.
Map reduce is a framework that helps hadoop ecosystem to arrange the data in a well organized way. HDFS is a part
of hadoop ecosystem responsible for storage of the data.
Keywords
Hadoop, map reduce, parallelism, HDFS, hadoop ecosystem,
Introduction
Big data
as the word is self explanatory big data means data in a huge amount which is created by people ,when we connect
our devices to internet it starts creating data the applications we use like face book, YouTube etc. For e.g. The get
updates, feeds which is exchange of information this exchange of information creates data in database as well as in
your device, so now consider one device if creates 1gb of data in a day and only in India a population of we would be
dealing with more than 30, 00, 00,000+ and that may be creating beta which cannot be even considered and this is
also a hypothitical figer,
So dealing with such a huge amount of data is not possible by traditional techniques so to deal with such kind of data
we use hadoop ecosystem which solves the problem of storage of such huge amount of data and also arrange it in a
systematic way which could be further used in a no of ways.
5v’s of big data
A. Data Volume
As the use of internet increasing and the amount of devices increasing the exchange of data also increasing so now
the amount of data created increases i.e. known the volume. So now the data created is going from megabytes and
gigabytes to peta-bytes
And is supposed to increase to zeta-bytes
B. Data Variety
Data variety is defined as the variability of data, as the data can be classified into different format like pictures,
videos, documents, files, etc. The data classified into structured, semi-structured and unstructured.
C. Data Velocity
Velocity in Big data is defined as the data coming from a various source, and also the flow of huge data at a high rate,
just consider the data created by social media for every 60 sec there are 1000000+ tweets, 659000+ face book status
updates, etc.
D. Data value
The data created in big data should be valuable or should be useful in a certain way, which can be done by data
mining and data analysis and the only valuable data can be pin pointed and extracted from this huge amount of data.
E. Big data veracity

Big data veracity is defined as the lost of the data or misplacement of the data which could lead to incorrect data and
which could further lead to confusion.
Hadoop ecosystem helps us to overcome all the 5 v’s as hadoop provides us a platform to store this big data in a
well organized and a well systematic manner so the data collected can be used in a useful way. Hadoop is further
divided into two i.e. HDFS (storage) and map reduce (processing)
1] HDFS (hadoop distributed file system)
It provides you to store data in an in distributed fashion for which it uses a master slave structure which consists as
name node (master) and data node (slave)
a] Name node
Name node contains the Meta data about the data node i.e. which data is saved in which data node
b] Data node
data node contains the actual data in the and data node is stored in the community hardware and the data saved in
data node is replicated and by default the replication no is 3, as we are storing this data on community hardware the
failure rate of these hardware are high so even if the hardware fails we will not lose the data.
C] Secondary name
It is a helper of the name node, as what it does is it updates the efsimage in data node
I.e. firstly it takes the previous efsimage and edit logs from the name node
And then combines it and forms a new efsimage
e] efsimage
Efsimage is nothing but the data we collect the data in the Meta data and create an image of it and this image is
nothing but an efsimage
f] Edit logs
Edit logs contains the information of the edit or changes done in the Meta data
2] YARN
Yet another resource manager it is the brain of the hadoop which is responsible for resource managing and
scheduling tasks, it consists of resource manager, node master

A] Resource manager
It resaves the processing request from the client and then forwards it to the respective node manager
b] Node master
It contains two components i.e. the app manager and container it is responsible for execution of task on the every
single data node and it is present in the same system
the node manager and the resource manager works together in the yarn architecture in the master slave form as
mention above ,it works in a way like freest the client sends the processing request to the resource manager then the
resource manager forwards it to the respective node manager then the container and the app master works together
to get the task done and if they require any further resource the app master sends the request to the resource
manager and the resource manager provides it ,and the resource manager also monitors the processing by taking
constant status updates.
3] Map reduce
Map reduce is a software framework which helps processing of large data, it also helps in parallel processing
It consists of two parts
i) Map part
ii) Reduce part
the data stored in gets processed in map part then key values for each part of data is generated ,then this data is
further send to reduce part what it does is arrange the similar type of data and forms group of them, this arranges the
data in the well organized way.
Conclusion
now a day’s data is created in a huge amount this data has to be stored in an organized way and this data which is
created can also be used in a no. of profitable ways as the data provides with valuable information and by using
hadoop ecosystem the data created can be stored in a well organized way and which can also be processed very
easily.
References
[1] S.Vikram Phaneendra, E.Madhusudhan Reddy, “Big Data-
Solutions for RDBMS problems- A survey”, in 12thIEEE/
IFIP Network Operations & Management Symposium
(NOMS 2010) (Osaka, Japan, Apr 19{23 2013).
[2] Aveksa Inc. (2013). Ensuring “Big Data” Security with
Identity and Access Management. Waltham, MA: Aveksa.
[3] Hewlett-Packard Development Company. (2012). Big
Security for Big Data. L.P.: Hewlett-Packard Development
Company
[4] Iqbaldeep Kaur, 2 Navneet Kaur, 3 Amandeep Ummat, 4 Jaspreet Kaur, 5 Navjot Kaur
Dept. of CSE, Chandigarh Engineering College, Landran, Punjab, India
Research Paper on Big Data and Hadoop IJCST V ol . 7, I ssue 4, O ct - D ec 2016
[5] Eureka

A Review Paper On Big Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Review Paper On Big Data

Uploaded by

Copyright:

Available Formats

A review paper on big data

E. Big data veracity

scheduling tasks, it consists of resource manager, node master

You might also like