Professional Documents
Culture Documents
a. Use Tez
set hive.execution.engine=tez;
ORC supports compressed storage (with ZLIB or as shown above with SNAPPY) but also
uncompressed storage
c. Use Vectorization
Vectorized query execution improves performance of operations like scans, aggregations, filters
and joins, by performing them in batches of 1024 rows at once instead of single row each time.
Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily
enabled with two parameters settings:
d. Parallel Execution
Hadoop can execute map reduce jobs in parallel and several queries executed on Hive make
automatically use of this parallelism. However, single, complex Hive queries commonly are
translated to a number of map reduce jobs that are executed by default sequentially. Often
though some of a query’s map reduce stages are not interdependent and could be executed in
parallel. They then can take advantage of spare capacity on a cluster and improve cluster
utilization while at the same time reduce the overall query executions time. The configuration in
Hive to change this behaviour is a merely switching a single flag SET hive.exce.parallel=true;
conf.set("mapred.max.split.size", "1024");
a. Mappers:
We can set number of mappers by changing the input split size. The minimum the split size, the
maximum number of mappers will be launched which will optimize query performance.
set mapreduce.input.fileinputformat.split.maxsize=100000;
set mapreduce.input.fileinputformat.split.minsize=100000;
a. Reducers:
There are 2 options to set number of reducers. First, we can set the reduce tasks with some
number or restrict the output of reducer to smaller size. Second, change the parameter in
mapred-site.xml
mapred.tasktracker.reduce.tasks.maximum (mapred-site.xml)
Before 0.8 version of Pig, the number of reducers was set by your cluster configuration. From
version 0.8 onwards you can control number of reducers by using keyword parallel. Ex
Keeping these five steps in mind can save you a lot of headaches and avoid Java heap space
errors.
Check that the JVMs have enough memory for the Task Tracker tasks.
Check that the JVMs settings are suitable for your tasks.
Active NameNode: The centerpiece of HDFS, which stores file system metadata and is
responsible for all client operations
Standby NameNode: A secondary NameNode that synchronizes its state with the active
NameNode in order to provide fast failover if the active NameNode goes down
ResourceManager: The global resource scheduler, which directs the slave NodeManager
daemons to perform the low-level I/O tasks
Data Nodes: Nodes that store the data in the HDFS file system and are also known as slaves;
these nodes run the NodeManager process that communicates with the ResourceManager
History Server: Provides REST APIs in order to allow the user to get the status of finished
applications and provides information about finished jobs
http://www.oracle.com/technetwork/articles/servers-storage-admin/hadoop-cluster-solaris-
2203962.html
Hadoop 2 Configuration?
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
dfs.ha.namenodes.[nameservice ID]
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>machine1.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>machine2.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.
com:8485/mycluster</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider<
/value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/path/to/journal/node/local/data</value>
</property>
[-transitionToActive <serviceId>]
[-transitionToStandby <serviceId>]
[-getServiceState <serviceId>]
[-checkHealth <serviceId>]
[-help <command>]
Failure detection - each of the NameNode machines in the cluster maintains a persistent session
in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other
NameNode that a failover should be triggered.
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
This specifies that the cluster should be set up for automatic failover. In your core-site.xml file, add:
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>
Is it important that I start the ZKFC and NameNode daemons in any particular order?
No. On any given node you may start the ZKFC before or after its corresponding NameNode.
You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains
running. In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit, and should
be restarted to ensure that the system is ready for automatic failover.
Additionally, you should monitor each of the servers in the ZooKeeper quorum. If ZooKeeper crashes,
then automatic failover will not function.
If the ZooKeeper cluster crashes, no automatic failovers will be triggered. However, HDFS will continue
to run without any impact. When ZooKeeper is restarted, HDFS will reconnect with no issues.
No. Currently, this is not supported. Whichever NameNode is started first will become active. You may
choose to start the cluster in a specific order such that your preferred node starts first.
Even if automatic failover is configured, you may initiate a manual failover using the same hdfs haadmin
command. It will perform a coordinated failover.
If nn1 is not the active NameNode, use the hdfs haadmin -failover command to initiate a failover from
nn2 to nn1:
HDFS Configuration:
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/alex/Programs/hadoop-2.2.0/hdfs/datanode</value>
<description>Comma separated list of paths on the local filesystem of a DataNode where it should s
tore its blocks.</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/alex/Programs/hadoop-2.2.0/hdfs/namenode</value>
<description>Path on the local filesystem where the NameNode stores the namespace and transacti
on logs persistently.</description>
</property>
</configuration>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost/</value>
<description>NameNode URI</description>
</property>
</configuration>
YARN Configuration?
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resourcemanager.alexjf.net</value>
<description>The hostname of the RM.</description>
</property>
</configuration>
<configuration>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
<description>Minimum limit of memory to allocate to each container request at the Resource Mana
ger.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
<description>Maximum limit of memory to allocate to each container request at the Resource Man
ager.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
<description>The minimum allocation for every container request at the RM, in terms of virtual CPU
cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum
.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
<description>The maximum allocation for every container request at the RM, in terms of virtual CP
U cores. Requests higher than this won't take effect, and will get capped to this value.</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
<description>Physical memory, in MB, to be made available to running containers</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
<description>Number of CPU cores that can be allocated for containers.</description>
</property>
</configuration>
S3 Integration:
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>YOUR_KEY_ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>YOUR_SECRET_KEY</value>
</property>
COMMAND_OPTION Description
path Start checking from this path.
-delete Delete corrupted files.
-files Print out files being checked.
-files -blocks Print out the block report
-files -blocks -locations Print out locations for every block.
-files -blocks -racks Print out network topology for data-node locations.
-includeSnapshots Include snapshot data if the given path indicates a snapshottable directory or there are
snapshottable directories under it.
-list-corruptfileblocks Print out list of missing blocks and files they belong to.
-move Move corrupted files to /lost+found.
-openforwrite Print out files opened for write.
hdfs balancer
[-threshold <threshold>]
[-policy <policy>]
[-exclude [-f <hosts-file> | <comma-separated list of hosts>]]
[-include [-f <hosts-file> | <comma-separated list of hosts>]]
[-idleiterations <idleiterations>]
COMMAND_OPTION Description
-policy <policy> datanode (default): Cluster is balanced if each datanode is balanced.
blockpool: Cluster is balanced if each block pool in each datanode is balanced.
-threshold <threshold> Percentage of disk capacity. This overwrites the default threshold.
-exclude -f <hosts-file> | <comma- Excludes the specified datanodes from being balanced by the balancer.
separated list of hosts>
-include -f <hosts-file> | <comma- Includes only the specified datanodes to be balanced by the balancer.
separated list of hosts>
-idleiterations <iterations> Maximum number of idle iterations before exit. This overwrites the default
idleiterations(5).
Users & Group Sync:
4) Copy the key to all the slaves (replace username appropriately as the user starting the
Hadoop daemons). Will be prompted for the password.
5) If the master also acts a slave (`ssh localhost` should work without a password)
So basically if you have 2 lines of each 100Mb in the same file, and to simplify let's say the split size is
64Mb. Then when the input splits are calculated, we will have the following scenario:
Split 1 containing the path and the hosts to this block. Initialized at start 200-200=0Mb, length 64Mb.
Mapper A will process split 1, start is 0 so don't skip first line, and read a full line which goes beyond the
64Mb limit so needs remote read.
Mapper B will process split 2, start is != 0 so skip the first line after 64Mb-1byte, which corresponds to
the end of line 1 at 100Mb which is still in split 2, we have 28Mb of the line in split 2, so remote read the
remaining 72Mb.
Mapper C will process split 3, start is != 0 so skip the first line after 128Mb-1byte, which corresponds to
the end of line 2 at 200Mb, which is end of file so don't do anything.
Mapper D is the same as mapper C except it looks for a newline after 192Mb-1byte.