You are on page 1of 1

Volume: the amount of data that businesses can collect is really enormous and hence

the volume of the data becomes a critical factor in Big Data analytics.
Velocity: the rate at which new data is being generated all thanks to our
dependence on the internet, sensors, machine-to-machine data is also important to
parse Big Data in a timely manner.
Variety:the data that is generated is completely heterogeneous in the sense that it
could be in various formats like video, text, database, numeric, sensor data and so
on and hence understanding the type of Big Data is a key factor to unlocking its
value.
Veracity: knowing whether the data that is available is coming from a credible
source is of utmost importance before deciphering and implementing Big Data for
business needs.
Block
Generally the user data is stored in the files of HDFS. The file in a file system
will be divided into one or more segments and/or stored in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount of
data that HDFS can read or write is called a Block. The default block size is 64MB,
but it can be increased as per the need to change in HDFS configuration.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms
for quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces
the network traffic and increases the throughput.
Hadoop Distributed File System:
1. Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is designed to store very large data sets
reliably, and to stream those data sets at high bandwidth to user applications.
2. Hbase
HBase is a column-oriented database management system that runs on top of HDFS.
3. HIVE
The Apache Hive data warehouse software facilitates querying and managing large
datasets residing in distributed storage.

You might also like