Professional Documents
Culture Documents
Ans:
Data which are very large in size is called Big Data.
Big data is a collection of large datasets that cannot be processed using
traditional computing techniques. It is not a single technique or a tool, rather it
has become a complete subject, which involves various tools, technqiues and
frameworks.
3. What is Hadoop?
Ans:
Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of
computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
There are two types of failover, that is, Graceful Failover and Automatic
Failover.
1. Graceful Failover
Graceful Failover is initiated by the Administrator manually. In Graceful
Failover, even when the active NameNode fails, the system will not
automatically trigger the failover from active NameNode to standby
Namenode. The Administrator initiates Graceful Failover, for example, in
the case of routine maintenance.
2. Automatic Failover
In Automatic Failover, the system automatically triggers the
failover from active NameNode to the Standby NameNode.
Fencing:
The HA implementation goes to great lengths to ensure
that the previously active namenode is prevented from
doing any damage and causing corruption a method
known as fencing.
6. Write a program to move a file from local disk to hdfs..
Ans:
$ hdfs dfs -put /local-file-path /hdfs-file-path
Delimited No No Yes
Part b
The answer to these questions comes from another trend in disk drives: seek
time is improving more slowly than transfer rate. Seeking is the process of
moving the disk’s head to a particular place on the disk to read or write data. It
characterizes the latency of a disk operation, whereas the transfer rate
corresponds to a disk’s bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read or
write large portions of the dataset than streaming through it, which operates at
the transfer rate. On the other hand, for updating a small proportion of records
in a database, a traditional B-Tree (the data structure used in relational
databases, which is limited by the rate it can perform seeks) works well. For
updating the majority of a database, a B-Tree is less efficient than MapReduce,
which uses Sort/Merge to rebuild the database.
Ans: