You are on page 1of 2

Structured Data:

Data that resides in a fixed field within a record or file is called structured
data\

Semi structured Data:


it a form of structured data but not conform with the formal structure of data m
odels associates with the relational database or other form of data tables,
but nonetheless contains tags or other markers to separate semantic elements an
d enforce hierarchies of records and fields within the data
unstructured Data:
refers to information that either does not have a pre-defined data model or is n
ot organized in a pre-defined manner. Unstructured information is
typically text-heavy, but may contain data such as dates, numbers, and facts as
well.

BigData:
Big data is a broad term for data sets so large or complex that traditional data
processing applications are inadequate.
Big data describes a massive volume of structured and unstructured data that is
so large
that it's difficult to process using traditional database techniques.

hadoop:
hadoop is a framework that allows for distributed processing of large data sets
across cluster of commodity hardware using simple programming model.

HDFS:
HDFS in file sysytem desgined for storing a large data with streaming data acce
ss pattern, running on commodity hardware.

The term "secondary name-node" is somewhat misleading. It is not a name-node in


the sense that data-nodes cannot connect to the secondary name-node,
and in no event it can replace the primary name-node in case of its failure.
The only purpose of the secondary name-node is to perform periodic checkpoints.
The secondary name-node periodically downloads current name-node
image and edits log files, joins them into new image and uploads the new image
back to the (primary and the only) name-node. See User Guide.
So if the name-node fails and you can restart it on the same physical node then
there is no need to shutdown data-nodes,
just the name-node need to be restarted. If you cannot use the old node anymore
you will need to copy the latest image somewhere else.
The latest image can be found either on the node that used to be the primary bef
ore failure if available; or on the secondary name-node.
The latter will be the latest checkpoint without subsequent edits logs, that is
the most recent name space modifications may be missing there.
You will also need to restart the whole cluster in this case.
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of node
s> * mapreduce.tasktracker.reduce.tasks.maximum).
With 0.95 all of the reduces can launch immediately and start transfering map ou
tputs as the maps finish. With 1.75 the faster nodes will finish their first rou
nd of reduces and launch a second wave of reduces doing a much better job of loa
d balancing.
Increasing the number of reduces increases the framework overhead, but increases
load balancing and lowers the cost of failures.
The scaling factors above are slightly less than whole numbers to reserve a few
reduce slots in the framework for speculative-tasks and failed tasks.

You might also like