You are on page 1of 2

Meet Hadoop

Data Sizes
1024 bytes

= 1 KB

KB

Kilobyte

1024 KB

= 1 MB

MB

Megabyte

1024 MB

= 1 GB

GB

Gigabyte

1024 GB

= 1 TB

TB

Terabyte

1024 TB

= 1 PB

PB

Petabyte

Why Hadoop?
The data flow has been increased tremendously hence the storage and analysis of data has been
causing a problem to process in estimated time. When compared to previous years data size and
read and write operation speeds, we can say that a framework to handle data in parallel fashion
is much useful.
It took 20 years to reach 100 mbps transfer speed from 4.4mbps speed. But it is not considerable
when compared to disk sizes which increased from 1370 MB to TBs of current data storages.
So we need to write or transfer data in parallel fashion to attain reasonable speeds. One should
think of data loss when looking for such frameworks. Hadoop satisfies all these requirements by
providing replication factor of 3 by default to write in parallel fashion and data loss is almost
impossible to happen.

Comparison with RDBMS


Traditional RDBMS

MapReduce

Data Size

GB

PB

Access

Interactive and Batch

Batch

Updates

Read and write many times

Write once and read many times

Structure

Static schema

Dynamic schema

Integrity

High

Low

Scaling

Non Linear

Linear

Goutam Tadi goutamhadoop@gmail.com

Hadoop and its Ecosystem


The different Hadoop projects are:
1) Common Components and interfaces for distributed file systems and general IO
2) Avro A serialization system for efficient, cross-language RPC and persistent data
storage.
3) Map Reduce A distributed data processing model and execution environment that runs
on large clusters of commodity machines.
4) HDFS A distributed file system that runs on large clusters of commodity machines.
5) Pig A dataflow language and execution environment for exploring very large datasets.
Pig runs on HDFS and mapreduce clusters.
6) Hive A distributed warehouse which provides query language like SQL for the data
stored in HDFS.
7) HBase A distributed, column-oriented database. HBase uses HDFS for its underlying
storage and supports both batch-style computations using MapReduce and point queries.
8) Zookeeper A distributed highly available coordination service.
9) Sqoop A tool for moving data from relational databases to HDFS and vice versa
efficiently.

Goutam Tadi goutamhadoop@gmail.com

You might also like