You are on page 1of 9

Hadoop Capacity Planning and

Dimensioning
Preliminary Considerations
1. What is the volume of data for which the cluster is being set? (For example, 100 TB.). Also this
should be considered whether data voulme considered is for daily or monthly or weekly basis.
2. What is the frequency or periodicity with which data needs to be ingested. or in laymen terms
the frequency of data arrival.
3. The retention policy of the data. (For example, 2 years.). This applies to the operational data
store and warehouse and not the archive.
4. What is the data archival policy. For archiving would the data be stored in compressed or raw
format.
5. The kinds of workloads you have
a. CPU intensive, i.e. query;
b. I/O intensive, i.e. ingestion;
c. memory intensive, i.e. Spark processing.
d. (For example,  30% jobs memory and CPU intensive, 70% I/O and medium CPU
intensive.)
6. The storage mechanism for the data —
a. plain Text/AVRO/Parque/Jason/ORC/etc. or compresses GZIP, Snappy.
b. (For example, 30% container storage 70% compressed.)
7. While Dimensioning always assume up to 25 -- 30% of the total available storage is dedicated for
Mapreduce processing. This is basically the intermediate output from the mappers. This needs
to be handled and subtracted from available total physical storage vs raw storage required for
the input data.
8. Consider required space for OS related requirements. This should also be set at around 25%.
9. Always consider the historical data that needs to be stored in the Lake or warehouse at day 0.
This is the data that is not coming at given periodicity (daily etc) after the system is setup but
has always been present up with the company apriori.
10. Must consider datanodes for BATCH Processing and in-memory Processing.
11. Consider CPU Cores and Tasks Per Node

a. A thumb rule is to use core per task. If tasks are not that much heavy then we can
allocate 0.75 core per task.

b. For batch processing, a 2*6-core processor (hyper-threaded) is a minimum.

c. For in-memory processing, a 2*8  cores processor is a minimum.

d. For batch processing nodes,


i. Number of Cores required for CPU heavy Jobs = 1

ii. Numbe rof cores required for CPU medium Jobs = , 0.7.

iii. Assume, 30% heavy processing jobs and 70% medium processing jobs.

iv. Batch processing nodes can handle = [(no. of cores* %heavy processing
jobs/cores required to process heavy job)+ (no. of cores* %medium processing
jobs/cores required to process medium job)]. Therefore tasks performed by
data nodes will be;

v. 12*.30/1+12*.70*/.7=3.6+12=15.6 ~15 tasks per node. This basically means


that we can run 15 tasks per node. We can divide this tasks as 8 Mappers and
7 reducers per node.

vi. For the identified number of tasks per node (let's say 15 tasks per node, divide
some for Mappers and reducers.

e. As hyperthreading is enabled, if the task includes two threads, we can assume 15*2~30
tasks per node.

f. For in-memory processing nodes, we have the assumption that spark.task.cpus=2 and
spark.core.max=8*2=16. With this assumption, we can concurrently execute
16/2=8 Spark jobs.

g. Again, as hyperthreading is enabled, the number of concurrent jobs can be calculated


as total concurrent jobs=no. of threads*8.
12. RAM Requirement for a Data Node

13. RAM requirements depend on the below parameters:

a. DataNode process memory

b. DataNode TaskTracker memory

c. OS memory

d. CPU's core number *Memory per CPU core


e. At the minimum allocated 4GB memory for each of these components.

14. Consider capacity planning for NameNode and YARN.

15. Network Configuration: As data transfer plays the key role in the throughput of Hadoop. We
should connect node at a speed of around 10 GB/sec at least.
16. Load Profile Type: Figure out whether the load being considered for processing is CPU
intensive or IO Intensive.

17. The Standby NameNode should not be co-located on the NameNode machine for
clusters and will run on hardware identical to that of the NameNode.
a. This machine should be of enterprise quality with redundant power and
enterprise-grade disks in RAID 1 or 10 configurations.

b.

18. Number of Data Blocks in Hadoop File System

Key Points
 Key consideration for dimensioning Hadoop is always data Storage requirement.
 By default Hadoop will store 3 copies of data being stored.
o So If Raw Data Storage requirement is 100TB than actual physical storage requirement
will be 3* 100TB = 300TB
 JBOD is choice file system for DataNodes.
 We do not need to set up the whole cluster on the first day. We can scale up the cluster as data
grows from small too big. We can start with 25% of total nodes to 100% as data grows.
 Assume 70% Batch Processing and 30% Streaming or in streaming processing, If not specified by
the end user.
 For Batch Processing Nodes, Assume 70% medium Processing Jobs and 30% heavy processing
Jobs.
 Assume

Compression Formats
o plain Text/AVRO/Parque/Jason/ORC/etc. or compresses GZIP, Snappy.
o (For example, 30% container storage 70% compressed.)

Assume 30% of data is in container storage and 70% of data is in a Snappy compressed Parque format.
From various studies, we found that Parquet Snappy compresses data to 70-80%.

Based on Compression storage requirement calculation =

total storage required for data =total storage* % in container storage + total storage * %in compressed
format*expected compression

Based on Scenario1 below and using the above formula:

 30% data in container Storage


 70% data in snappy parquet format.
 Expected compression provided by parquet compression scheme is 70% as highlighted in yellow
above.

600*.30+600*.70*(1-.70)=180+420*.30=180+420*.30=306 TB.

NameNode Configuration

The NameNode role is responsible for coordinating data storage on the cluster, and the
JobTracker for coordinating data processing.

(The Standby NameNode should not be co-located on the NameNode machine for clusters
and will run on hardware identical to that of the NameNode.) Cloudera recommends that
customers purchase enterprise-class machines for running the NameNode and JobTracker,
with redundant power and enterprise-grade disks in RAID 1 or 10 configurations.

The NameNode will also require RAM directly proportional to the number of data blocks in the
cluster. A good rule of thumb is to assume 1GB of NameNode memory for every 1 million blocks
stored in the distributed file system. With 100 DataNodes in a cluster, 64GB of RAM on the
NameNode provides plenty of room to grow the cluster. We also recommend having HA
configured on both the NameNode and JobTracker, features that have been available in the
CDH4 line for some time.

Here are the recommended specifications for NameNode/JobTracker/Standby NameNode


nodes. The drive count will fluctuate depending on the amount of redundancy:

 4–6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID 1], 1
for Apache ZooKeeper, and 1 for Journal node)
 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
 64-128GB of RAM
 Bonded Gigabit Ethernet or 10Gigabit Ethernet

Possible Nodes and components to consider when dimensioning


 NiFi, Kafka, Spark, and MapReduce
 NameNode
 JobTracker
 TaskTracker
 Yarn
Scenarios
Scenario 1: Processing 100TB in 1 Year
Raw Input Data (A) 100TB
Replication Factor (R) 3
So Required Storage Capacity (S) 3*100TB = 300TB
Retention Policy (In years ) (Rp) 2 Years
So Storage Requirement (Sr) A*R*Rp = 3*100*2 ==> 600TB
If no compression than Storage Requirement Sr
If compression than our storage requirement:
6*.30 6*.70 1-.70
 30% data in container Storage
 70% data in snappy parquet format.
 Expected compression provided by 180 + 420 * .30 = 306TB
parquet compression scheme is 70% as
highlighted in yellow above.

Reserved Space for intermediate MapReduce 306+306*0.3 = 397.8TB


Processing and OS related stuff = 30%.

Note that some literatures recommend to


keep 30% extra space for OS related activities
on top of the MAPreduce intermediate Jobs.
JBOD is used for DataNode. JBOD File System =397.8 + 397.8*.20 = 478TB =DS
Requirement = 20% of overall storage So Total Storage requirement is 478TB
Number of DataNodes for above The number of required data nodes is 478/48 ~ 10.
478TB Storage. In general, the number of data nodes required is:
Suppose we have a JBOD of 12 disks, each disk
DS/(no. of disks in JBOD*diskspace per disk).
worth of 4 TB. Data node capacity will be 48
TB.

Another method is to consider required or


enforced (client may say that no more than x
number of nodes) number of nodes.

So let's say that max number of nodes are 12.


Each node having 2 -- 4 HDD (JBOD) of 1 -- 4 TB
capacity.

So capacity per node = 216TB/12 = 18 TB/node


in a cluster of 12 nodes.

In order to get 18 TB/node, we can create a


JBOD array consisting of 4 disks with 5 TB/HDD
equalling 20TB/node.

So we have 12 nodes. each with 20 TB HDD.


Data nodes count for batch DS*0.7
processing (Hive, MapReduce, Pig, 10*0.7 = 7 Nodes to be dedicated for Batch Processing.
etc.) This leaves 3 nodes for in memory processing

Assuming that 70% of data needs to be


processed in batch mode with Hive,
MapReduce, etc.
Data node count for in-memory 3 Nodes as explained above.
processing.
Consider CPU Cores and Tasks Per Node A thumb rule is to use core per task. If tasks are not
that much heavy then we can allocate 0.75 core per
task.

 We chose a 2*6 core processor for Batch


processing,

 For in-memory processing, a 2*8  cores processor.

a. For batch processing nodes,

i. Number of Cores required for CPU


heavy Jobs = 1

ii. Numbe rof cores required for CPU


medium Jobs = , 0.7.

iii. Assume, 30% heavy processing


jobs and 70% medium processing
jobs.

iv. Batch processing nodes can handle


= [(no. of cores* %heavy
processing jobs/cores required to
process heavy job)+ (no. of cores*
%medium processing jobs/cores
required to process medium job)].
Therefore tasks performed by data
nodes will be;

v. 12*.30/1+12*.70*/.7=3.6+12=15.
6 ~15 tasks per node.
b. As hyperthreading is enabled, if the task
includes two threads, we can assume
15*2~30 tasks per node.

c. For in-memory processing nodes, we have


the assumption that spark.task.cpus=2
and spark.core.max=8*2=16. With this
assumption, we can concurrently execute
16/2=8 Spark jobs.

d. Again, as hyperthreading is enabled, the


number of concurrent jobs can be
calculated as total concurrent jobs=no. of
threads*8.

RAM Requirement for a Data Node RAM Required = DataNode process memory +
DataNode TaskTracker memory + OS memory + CPU's
core number * Memory per CPU core

Allocate a bare minimum of 4 GB for each


component.

Therefore, RAM required will be:

 RAM=4+4+4+12*4=60 GB RAM for batch data nodes


and RAM=4+4+4+16*4=76 GB for in-memory
processing data nodes.
Network Configuration:- As data transfer plays the key role in the throughput of
Hadoop,We should connect node at a speed of around
10 GB/sec at least.
Glossary
A

J
JBOD: Just a bunch of Disks configuration

You might also like