Optimize Hive Queries with Tez, ORC, Vectorization & Parallel Processing

Hive Query Optimization:
a. Use Tez
set hive.execution.engine=tez;
b. Store files as ORCFile
CREATE TABLE A_ORC (

customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE A_ORC SELECT * FROM A;
CREATE TABLE B_ORC (

customerID int, role string, salary float, department string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE B_ORC SELECT * FROM B;
SELECT A_ORC.customerID, A_ORC.name,

A_ORC.age, A_ORC.address join
B_ORC.role, B_ORC.department, B_ORC.salary
ON A_ORC.customerID=B_ORC.customerID;
ORC supports compressed storage (with ZLIB or as shown above with SNAPPY) but also
uncompressed storage
c. Use Vectorization
Vectorized query execution improves performance of operations like scans, aggregations, filters
and joins, by performing them in batches of 1024 rows at once instead of single row each time.
Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily
enabled with two parameters settings:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
d. Parallel Execution
Hadoop can execute map reduce jobs in parallel and several queries executed on Hive make
automatically use of this parallelism. However, single, complex Hive queries commonly are
translated to a number of map reduce jobs that are executed by default sequentially. Often
though some of a query’s map reduce stages are not interdependent and could be executed in
parallel. They then can take advantage of spare capacity on a cluster and improve cluster
utilization while at the same time reduce the overall query executions time. The configuration in
Hive to change this behaviour is a merely switching a single flag SET hive.exce.parallel=true;
e. Decrease the split size to increase the mappers
conf.set("mapred.max.split.size", "1024");
Job job = new Job(conf, "My job name");
Set Mappers and Reducers in Hive?
a. Mappers:
We can set number of mappers by changing the input split size. The minimum the split size, the
maximum number of mappers will be launched which will optimize query performance.
set mapreduce.input.fileinputformat.split.maxsize=100000;
set mapreduce.input.fileinputformat.split.minsize=100000;
a. Reducers:
There are 2 options to set number of reducers. First, we can set the reduce tasks with some
number or restrict the output of reducer to smaller size. Second, change the parameter in
mapred-site.xml
set mapred.reduce.tasks=128 (In hive)
set hive.exec.reducers.bytes.per.reducer=1000000 (Default size of reducer o/p is 1 GB in

Hive)
mapred.tasktracker.reduce.tasks.maximum (mapred-site.xml)
Set Mappers and Reducers in PIG?
Before 0.8 version of Pig, the number of reducers was set by your cluster configuration. From
version 0.8 onwards you can control number of reducers by using keyword parallel. Ex
sorted = order average by avg desc parallel 50;
OR set default_parallel 50;
Reason for JAVA Heap Space Error?
Keeping these five steps in mind can save you a lot of headaches and avoid Java heap space
errors.
Calculate memory needed.
Check that the JVMs have enough memory for the Task Tracker tasks.
Check that the JVMs settings are suitable for your tasks.
Limit your nodes use of swap space and paged memory.

Set the task attempt slots to a number that’s lower than the number calculated by the Job
Tracker web GUI.
The Hadoop cluster building blocks are as follows:
Active NameNode: The centerpiece of HDFS, which stores file system metadata and is
responsible for all client operations
Standby NameNode: A secondary NameNode that synchronizes its state with the active
NameNode in order to provide fast failover if the active NameNode goes down
ResourceManager: The global resource scheduler, which directs the slave NodeManager
daemons to perform the low-level I/O tasks
Data Nodes: Nodes that store the data in the HDFS file system and are also known as slaves;
these nodes run the NodeManager process that communicates with the ResourceManager
History Server: Provides REST APIs in order to allow the user to get the status of finished
applications and provides information about finished jobs
http://www.oracle.com/technetwork/articles/servers-storage-admin/hadoop-cluster-solaris-
2203962.html
Hadoop 2 Configuration?
dfs.nameservices - the logical name for this new nameservice
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
dfs.ha.namenodes.[nameservice ID]
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.
com:8485/mycluster</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider<
/value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/path/to/journal/node/local/data</value>
</property>
DFSHAAdmin [-ns <nameserviceId>]
[-transitionToActive <serviceId>]
[-transitionToStandby <serviceId>]
[-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
[-getServiceState <serviceId>]
[-checkHealth <serviceId>]
[-help <command>]
Failure detection - each of the NameNode machines in the cluster maintains a persistent session
in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other
NameNode that a failover should be triggered.
Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a

node as active. If the current active NameNode crashes, another node may take a special
exclusive lock in ZooKeeper indicating that it should become the next active.
Zookeeper for Namenode High Availability?
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
This specifies that the cluster should be set up for automatic failover. In your core-site.xml file, add:
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>
Is it important that I start the ZKFC and NameNode daemons in any particular order?
No. On any given node you may start the ZKFC before or after its corresponding NameNode.
What additional monitoring should I put in place?
You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains
running. In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit, and should
be restarted to ensure that the system is ready for automatic failover.
Additionally, you should monitor each of the servers in the ZooKeeper quorum. If ZooKeeper crashes,
then automatic failover will not function.
What happens if ZooKeeper goes down?
If the ZooKeeper cluster crashes, no automatic failovers will be triggered. However, HDFS will continue
to run without any impact. When ZooKeeper is restarted, HDFS will reconnect with no issues.
Can I designate one of my NameNodes as primary/preferred?
No. Currently, this is not supported. Whichever NameNode is started first will become active. You may
choose to start the cluster in a specific order such that your preferred node starts first.
How can I initiate a manual failover when automatic failover is configured?
Even if automatic failover is configured, you may initiate a manual failover using the same hdfs haadmin
command. It will perform a coordinated failover.
Initiate Failover between 2 Namenodes:
To initiate a failover between two NameNodes, run the command
hdfs haadmin – failover
If nn1 is not the active NameNode, use the hdfs haadmin -failover command to initiate a failover from
nn2 to nn1:
hdfs haadmin -failover nn2 nn1
Start services on nn2?
Start the JournalNode daemon:
$ sudo service hadoop-hdfs-journalnode start
Start the NameNode daemon:

$ sudo service hadoop-hdfs-namenode start
Start the ZKFC daemon:
$ sudo service hadoop-hdfs-zkfc start
Set these services to restart on boot; for example on a RHEL-compatible system:
$ sudo chkconfig hadoop-hdfs-namenode on
$ sudo chkconfig hadoop-hdfs-zkfc on
$ sudo chkconfig hadoop-hdfs-journalnode on
Different types of Namenode high availability?
Enabling High Availability using NFS Shared Edits Directory
Enabling High Availability with Quorum-based Storage
Other hdfs haadmin Commands?
getServiceState - determine whether the given NameNode is Active or Standby
checkHealth - check the health of the given NameNode
HDFS Configuration:
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/alex/Programs/hadoop-2.2.0/hdfs/datanode</value>
<description>Comma separated list of paths on the local filesystem of a DataNode where it should s
tore its blocks.</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/alex/Programs/hadoop-2.2.0/hdfs/namenode</value>
<description>Path on the local filesystem where the NameNode stores the namespace and transacti
on logs persistently.</description>
</property>
</configuration>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost/</value>
<description>NameNode URI</description>
</property>
</configuration>
YARN Configuration?
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resourcemanager.alexjf.net</value>
<description>The hostname of the RM.</description>
</property>
</configuration>
<configuration>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
<description>Minimum limit of memory to allocate to each container request at the Resource Mana
ger.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
<description>Maximum limit of memory to allocate to each container request at the Resource Man
ager.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
<description>The minimum allocation for every container request at the RM, in terms of virtual CPU
cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum
.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
<description>The maximum allocation for every container request at the RM, in terms of virtual CP
U cores. Requests higher than this won't take effect, and will get capped to this value.</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
<description>Physical memory, in MB, to be made available to running containers</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
<description>Number of CPU cores that can be allocated for containers.</description>
</property>
</configuration>
S3 Integration:
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>YOUR_KEY_ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>YOUR_SECRET_KEY</value>
</property>
hdfs fsck <path>

[-list-corruptfileblocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks]]]
[-includeSnapshots]
COMMAND_OPTION Description
path Start checking from this path.
-delete Delete corrupted files.
-files Print out files being checked.
-files -blocks Print out the block report
-files -blocks -locations Print out locations for every block.
-files -blocks -racks Print out network topology for data-node locations.
-includeSnapshots Include snapshot data if the given path indicates a snapshottable directory or there are
snapshottable directories under it.
-list-corruptfileblocks Print out list of missing blocks and files they belong to.
-move Move corrupted files to /lost+found.
-openforwrite Print out files opened for write.
hdfs balancer
[-threshold <threshold>]
[-policy <policy>]
[-exclude [-f <hosts-file> | <comma-separated list of hosts>]]
[-include [-f <hosts-file> | <comma-separated list of hosts>]]
[-idleiterations <idleiterations>]
COMMAND_OPTION Description
-policy <policy> datanode (default): Cluster is balanced if each datanode is balanced.
blockpool: Cluster is balanced if each block pool in each datanode is balanced.
-threshold <threshold> Percentage of disk capacity. This overwrites the default threshold.
-exclude -f <hosts-file> | <comma- Excludes the specified datanodes from being balanced by the balancer.
separated list of hosts>
-include -f <hosts-file> | <comma- Includes only the specified datanodes to be balanced by the balancer.
separated list of hosts>
-idleiterations <iterations> Maximum number of idle iterations before exit. This overwrites the default
idleiterations(5).
Users & Group Sync:
1. Create a password-less SSH-key on the server:

ssh-keygen -b 4096
2. Copy .ssh/id_rsa.pub to .ssh/authorized__keys2 on the client:
scp ~/.ssh/id_rsa.pub client:.ssh/authorized_keys2
3. Add something like this to your /etc/crontab (or edit with crontab -e):
0 0 * * * scp /etc/{passwd,shadow,group} root@backupbox:/var/mybackupdir
Password less authentication between Hadoop nodes?
1) Install openssh-client on the master
sudo apt-get install openssh-client
2) Install openssh-server on all the slaves
sudo apt-get install openssh-server
3) Generate the ssh key
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
4) Copy the key to all the slaves (replace username appropriately as the user starting the
Hadoop daemons). Will be prompted for the password.
ssh-copy-id -i $HOME/.ssh/id_rsa.pub username@slave-hostname
5) If the master also acts a slave (`ssh localhost` should work without a password)
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
How to process if the records are split between 2 blocks?
So basically if you have 2 lines of each 100Mb in the same file, and to simplify let's say the split size is
64Mb. Then when the input splits are calculated, we will have the following scenario:
Split 1 containing the path and the hosts to this block. Initialized at start 200-200=0Mb, length 64Mb.
Split 2 initialized at start 200-200+64=64Mb, length 64Mb.
Mapper A will process split 1, start is 0 so don't skip first line, and read a full line which goes beyond the
64Mb limit so needs remote read.
Mapper B will process split 2, start is != 0 so skip the first line after 64Mb-1byte, which corresponds to
the end of line 1 at 100Mb which is still in split 2, we have 28Mb of the line in split 2, so remote read the
remaining 72Mb.
Mapper C will process split 3, start is != 0 so skip the first line after 128Mb-1byte, which corresponds to
the end of line 2 at 200Mb, which is end of file so don't do anything.
Mapper D is the same as mapper C except it looks for a newline after 192Mb-1byte.

Optimize Hive Queries with Tez, ORC, Vectorization & Parallel Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimize Hive Queries with Tez, ORC, Vectorization & Parallel Processing

Uploaded by

Copyright:

Available Formats

Hive Query Optimization:

b. Store files as ORCFile

CREATE TABLE A_ORC (

INSERT INTO TABLE A_ORC SELECT * FROM A;

CREATE TABLE B_ORC (

INSERT INTO TABLE B_ORC SELECT * FROM B;

SELECT A_ORC.customerID, A_ORC.name,

set hive.vectorized.execution.enabled = true;

set hive.vectorized.execution.reduce.enabled = true;

e. Decrease the split size to increase the mappers

Job job = new Job(conf, "My job name");

Set Mappers and Reducers in Hive?

set mapred.reduce.tasks=128 (In hive)

set hive.exec.reducers.bytes.per.reducer=1000000 (Default size of reducer o/p is 1 GB in

Set Mappers and Reducers in PIG?

sorted = order average by avg desc parallel 50;

OR set default_parallel 50;

Reason for JAVA Heap Space Error?

Calculate memory needed.

Limit your nodes use of swap space and paged memory.

The Hadoop cluster building blocks are as follows:

dfs.nameservices - the logical name for this new nameservice

DFSHAAdmin [-ns <nameserviceId>]

[-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]

Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a

Zookeeper for Namenode High Availability?

What additional monitoring should I put in place?

What happens if ZooKeeper goes down?

Can I designate one of my NameNodes as primary/preferred?

How can I initiate a manual failover when automatic failover is configured?

Initiate Failover between 2 Namenodes:

To initiate a failover between two NameNodes, run the command

hdfs haadmin – failover

hdfs haadmin -failover nn2 nn1

Start services on nn2?

Start the JournalNode daemon:

$ sudo service hadoop-hdfs-journalnode start

Start the NameNode daemon:

Start the ZKFC daemon:

$ sudo service hadoop-hdfs-zkfc start

Set these services to restart on boot; for example on a RHEL-compatible system:

$ sudo chkconfig hadoop-hdfs-namenode on

$ sudo chkconfig hadoop-hdfs-zkfc on

$ sudo chkconfig hadoop-hdfs-journalnode on

Different types of Namenode high availability?

Enabling High Availability using NFS Shared Edits Directory

Enabling High Availability with Quorum-based Storage

Other hdfs haadmin Commands?

getServiceState - determine whether the given NameNode is Active or Standby

checkHealth - check the health of the given NameNode

hdfs fsck <path>

1. Create a password-less SSH-key on the server:

Password less authentication between Hadoop nodes?

1) Install openssh-client on the master

sudo apt-get install openssh-client

2) Install openssh-server on all the slaves

sudo apt-get install openssh-server

3) Generate the ssh key

ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

ssh-copy-id -i $HOME/.ssh/id_rsa.pub username@slave-hostname

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

How to process if the records are split between 2 blocks?