Professional Documents
Culture Documents
Dimensioning
Preliminary Considerations
1. What is the volume of data for which the cluster is being set? (For example, 100 TB.). Also this
should be considered whether data voulme considered is for daily or monthly or weekly basis.
2. What is the frequency or periodicity with which data needs to be ingested. or in laymen terms
the frequency of data arrival.
3. The retention policy of the data. (For example, 2 years.). This applies to the operational data
store and warehouse and not the archive.
4. What is the data archival policy. For archiving would the data be stored in compressed or raw
format.
5. The kinds of workloads you have
a. CPU intensive, i.e. query;
b. I/O intensive, i.e. ingestion;
c. memory intensive, i.e. Spark processing.
d. (For example, 30% jobs memory and CPU intensive, 70% I/O and medium CPU
intensive.)
6. The storage mechanism for the data —
a. plain Text/AVRO/Parque/Jason/ORC/etc. or compresses GZIP, Snappy.
b. (For example, 30% container storage 70% compressed.)
7. While Dimensioning always assume up to 25 -- 30% of the total available storage is dedicated for
Mapreduce processing. This is basically the intermediate output from the mappers. This needs
to be handled and subtracted from available total physical storage vs raw storage required for
the input data.
8. Consider required space for OS related requirements. This should also be set at around 25%.
9. Always consider the historical data that needs to be stored in the Lake or warehouse at day 0.
This is the data that is not coming at given periodicity (daily etc) after the system is setup but
has always been present up with the company apriori.
10. Must consider datanodes for BATCH Processing and in-memory Processing.
11. Consider CPU Cores and Tasks Per Node
a. A thumb rule is to use core per task. If tasks are not that much heavy then we can
allocate 0.75 core per task.
ii. Numbe rof cores required for CPU medium Jobs = , 0.7.
iii. Assume, 30% heavy processing jobs and 70% medium processing jobs.
iv. Batch processing nodes can handle = [(no. of cores* %heavy processing
jobs/cores required to process heavy job)+ (no. of cores* %medium processing
jobs/cores required to process medium job)]. Therefore tasks performed by
data nodes will be;
vi. For the identified number of tasks per node (let's say 15 tasks per node, divide
some for Mappers and reducers.
e. As hyperthreading is enabled, if the task includes two threads, we can assume 15*2~30
tasks per node.
f. For in-memory processing nodes, we have the assumption that spark.task.cpus=2 and
spark.core.max=8*2=16. With this assumption, we can concurrently execute
16/2=8 Spark jobs.
c. OS memory
15. Network Configuration: As data transfer plays the key role in the throughput of Hadoop. We
should connect node at a speed of around 10 GB/sec at least.
16. Load Profile Type: Figure out whether the load being considered for processing is CPU
intensive or IO Intensive.
17. The Standby NameNode should not be co-located on the NameNode machine for
clusters and will run on hardware identical to that of the NameNode.
a. This machine should be of enterprise quality with redundant power and
enterprise-grade disks in RAID 1 or 10 configurations.
b.
Key Points
Key consideration for dimensioning Hadoop is always data Storage requirement.
By default Hadoop will store 3 copies of data being stored.
o So If Raw Data Storage requirement is 100TB than actual physical storage requirement
will be 3* 100TB = 300TB
JBOD is choice file system for DataNodes.
We do not need to set up the whole cluster on the first day. We can scale up the cluster as data
grows from small too big. We can start with 25% of total nodes to 100% as data grows.
Assume 70% Batch Processing and 30% Streaming or in streaming processing, If not specified by
the end user.
For Batch Processing Nodes, Assume 70% medium Processing Jobs and 30% heavy processing
Jobs.
Assume
Compression Formats
o plain Text/AVRO/Parque/Jason/ORC/etc. or compresses GZIP, Snappy.
o (For example, 30% container storage 70% compressed.)
Assume 30% of data is in container storage and 70% of data is in a Snappy compressed Parque format.
From various studies, we found that Parquet Snappy compresses data to 70-80%.
total storage required for data =total storage* % in container storage + total storage * %in compressed
format*expected compression
600*.30+600*.70*(1-.70)=180+420*.30=180+420*.30=306 TB.
NameNode Configuration
The NameNode role is responsible for coordinating data storage on the cluster, and the
JobTracker for coordinating data processing.
(The Standby NameNode should not be co-located on the NameNode machine for clusters
and will run on hardware identical to that of the NameNode.) Cloudera recommends that
customers purchase enterprise-class machines for running the NameNode and JobTracker,
with redundant power and enterprise-grade disks in RAID 1 or 10 configurations.
The NameNode will also require RAM directly proportional to the number of data blocks in the
cluster. A good rule of thumb is to assume 1GB of NameNode memory for every 1 million blocks
stored in the distributed file system. With 100 DataNodes in a cluster, 64GB of RAM on the
NameNode provides plenty of room to grow the cluster. We also recommend having HA
configured on both the NameNode and JobTracker, features that have been available in the
CDH4 line for some time.
4–6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID 1], 1
for Apache ZooKeeper, and 1 for Journal node)
2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
64-128GB of RAM
Bonded Gigabit Ethernet or 10Gigabit Ethernet
v. 12*.30/1+12*.70*/.7=3.6+12=15.
6 ~15 tasks per node.
b. As hyperthreading is enabled, if the task
includes two threads, we can assume
15*2~30 tasks per node.
RAM Requirement for a Data Node RAM Required = DataNode process memory +
DataNode TaskTracker memory + OS memory + CPU's
core number * Memory per CPU core
J
JBOD: Just a bunch of Disks configuration