You are on page 1of 6

1. What is big data?

Ans:
Data which are very large in size is called Big Data.
Big data is a collection of large datasets that cannot be processed using
traditional computing techniques. It is not a single technique or a tool, rather it
has become a complete subject, which involves various tools, technqiues and
frameworks.

These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generates


huge amount of data on a day to day basis as they have billions of users
worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount
of logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends
and accordingly publish their plans and for this they store the data of its
million users.
o Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.

3V's of Big Data


1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of
data will double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as
well as unstructured. Log file, CCTV footage is unstructured data. Data which can be
saved in tables are structured data like the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.

2. What are high_performance computing(HPC)?


Ans:
It is the use of parallel processing for running advanced application
programs efficiently, relatives, and quickly.
The term High-performance computing is occasionally used as a
synonym for supercomputing. Although technically a supercomputer
is a system that performs at or near currently highest operational
rate for computers.

3. What is Hadoop?
Ans:
Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of
computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.

Hadoop Architecture
At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and


 Storage layer (Hadoop Distributed File System).
4. What is block in hadoop?
Ans:
Hadoop HDFS split large files into small chunks known as Blocks. Block is the
physical representation of data. It contains a minimum amount of data that can be
read or write. HDFS stores each file as blocks.

5. What is failover and fencing in bigdata?


Ans:

Failover refers to the procedure of transferring control to a redundant or


standby system upon the failure of the previously active system.

There are two types of failover, that is, Graceful Failover and Automatic
Failover.

1. Graceful Failover
Graceful Failover is initiated by the Administrator manually. In Graceful
Failover, even when the active NameNode fails, the system will not
automatically trigger the failover from active NameNode to standby
Namenode. The Administrator initiates Graceful Failover, for example, in
the case of routine maintenance.

2. Automatic Failover
In Automatic Failover, the system automatically triggers the
failover from active NameNode to the Standby NameNode.

Fencing:
The HA implementation goes to great lengths to ensure
that the previously active namenode is prevented from
doing any damage and causing corruption a method
known as fencing. 
6. Write a program to move a file from local disk to hdfs..
Ans:
$ hdfs dfs -put /local-file-path /hdfs-file-path

7. Explain any four file formats that are supported by HDFS.


Ans:
HDFS file formats supported are Json, Avro and Parquet. The format is specified by
setting the storage format value which can be found on the storage tab of the Data
Store. For all files of HDFS, the storage type (Json, Avro, Parquet) are defined in the
data store. JSON, Avro and Parquet formats contain complex data types, like array or
Object

File Format Reverse Engineer Complex Type Support Load into H

Avro Yes (Schema required) Yes Yes (Schem

Delimited No No Yes

JSON Yes (Schema required) Yes Yes

Parquet Yes (Schema required) Yes Yes

Part b

8. What are the challenges in storing big data?


Ans:
A. Sharing and Accessing Data
B. Privacy and Security
C. Analytical Challenges
D. Technical challenges
 Quality of data
 Fault tolerance
 Scalability
9. Why cant we use databases with lots of disks to do large-scale
analysis?why is Hadoop needed?
Ans:

The answer to these questions comes from another trend in disk drives: seek
time is improving more slowly than transfer rate. Seeking is the process of
moving the disk’s head to a particular place on the disk to read or write data. It
characterizes the latency of a disk operation, whereas the transfer rate
corresponds to a disk’s bandwidth.

If the data access pattern is dominated by seeks, it will take longer to read or
write large portions of the dataset than streaming through it, which operates at
the transfer rate. On the other hand, for updating a small proportion of records
in a database, a traditional B-Tree (the data structure used in relational
databases, which is limited by the rate it can perform seeks) works well. For
updating the majority of a database, a B-Tree is less efficient than MapReduce,
which uses Sort/Merge to rebuild the database.

10. How does a client read data from hdfs?

Ans:

Reading on HDFS seems to be simple but it is not. Whenever a client


sends a request to HDFS to read something from HDFS the access to
the data or DataNode where actual data is stored is not directly granted
to the client because the client does not have the information about the
data i.e. on which DataNode data is stored or where the replica of data
is made on DataNodes. Without knowing information about the
DataNodes the client can never access or read data from HDFS.
So, that’s why the client first sends the request to NameNode since the
NameNode contains all the metadata or information we require to
perform read operation on HDFS. Once the request is received by the
NameNode it responds and sends all the information like the number of
DataNodes, the location where the replica is made, the number of data
blocks and their location, etc to the client. Now the client can read data
with all this information provided by the NameNode. The client reads
the data parallelly since the replica of the same data is available on the
cluster. Once the whole data is read it combines all the blocks as the
original file.
Components that we have to know before learning HDFS read
operation.
NameNode: The primary purpose of Namenode is to manage all the
MetaData. As we know the data is stored in the form of blocks in a
Hadoop cluster. So on which DataNode or on which location that block
of the file is stored is mentioned in MetaData. Log of the Transaction
happening in a Hadoop cluster, when or who read or write the data, all
this information will be stored in MetaData.
DataNode: DataNode is a program run on the slave system that serves
the read/write request from the client and used to store data in form of
blocks.
HDFS Client: HDFS Client is an intermediate component between
HDFS and the user. It communicates with the Datanode or Namenode
and fetches the essential output that the user requests

11. Explain MapReduce with a suitable example.


Ans:
MapReduce is a framework using which we can write applications to
process huge amounts of data, in parallel, on large clusters of commodity
hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two
important tasks, namely Map and Reduce. Map takes a set of data and
converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the
output from a map as an input and combines those data tuples into a
smaller set of tuples. As the sequence of the name MapReduce implies,
the reduce task is always performed after the map job.

You might also like