You are on page 1of 2

Handbook - Installing Hadoop Multi-Node Cluster

This extends the single node installation steps provided by I&D -Saurabh Bajaj

1. Start the terminal 2. Disable ipv6 on all machines pico /etc/sysctl.conf 3. Add these files to the EOF net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1

3. Reboot the system sudo reboot 4. Install java sudo apt-get install openjdk-6-jdk openjdk-6-jre 5. Check if ssh is installed, if not do so: sudo apt-get install openssh-server openssh-client 6. Create a group and user called hadoop sudo addgroup hadoop sudo adduser --ingroup hadoop hadoop 7. Assign all the permissions to the Hadoop user sudo visudo Add the following line in the file hadoop ALL =(ALL) ALL 8. Check if hadoop user has ssh installed su hadoop ssh-keygen -t rsa -P "" Press Enter when asked. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys ssh localhost Copy the servers RSA public key from server to all nodes
in the authorized_keys file as shown in the above step

10. Download and install hadoop: cd /usr/local/hadoop sudo wget c http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u2.tar.gz 11. Unzip the tar sudo tar -zxvf /usr/local/hadoop/hadoop-0.20.2-chd3u2.tar.gz 12. Change permissions on hadoop folder by granting all to hadoop sudo chown -R hadoop:hadoop /usr/local/hadoop sudo chmod 750 -R /usr/local/hadoop 13. Create the HDFS directory sudo mkdir hadoop-datastore // inside the usr local hadoop folder sudo mkdir hadoop-datastore/hadoop-hadoop 14. Add the binaries path and hadoop home in the environment file sudo pico /etc/environment set the bin path as well as hadoop home path source /etc/environment 15. Configure the hadoop env.sh file cd /usr/local/hadoop/hadoop-0.20.2-cdh3u3/ sudo pico conf/hadoop-env.sh add the following line in there: export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true export JAVA_HOME="/usr/lib/jvm/java-6-openjdk <next page>

9. Make hadoop installation directory: sudo mkdir /usr/local/hadoop

16. Configuring the core-site.xml


<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://<IP of namenode>:54310</value> <description>Location of the Namenode</description> </property> </configuration>

18. Configuring the mapred-site.xml


<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value><IP of job tracker>:54311</value> <description>Host and port of the jobtracker. </description> </property> </configuration>

19. Add all the IP addresses in the conf/slaves file


sudo pico /usr/local/hadoop/hadoop-0.20.2-cdh3u2/conf/slaves Add the list of IP addresses that will host data nodes, in this file

17. Configuring the hdfs-site.xml


<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. </description> </property> </configuration>

Note: Commands to start/stop cluster should be run from the master node.
Known Issue in Namenode formatting and resolution: When you run : hadoop namenode -format multiple times, there could be a problem you could face namespace conflicts. Steps to overcome the problem: Go to the temp dir for hadoop , e.g. - /usr/local/hadoop/hadoop-datastore/hadoop-hadoop There are two folders, name - namenode and data - for datanode Inside each folder, there is a version file, which can be opened in notepad Inside the version file, there is a namspace ID Copy the namespace ID given in the name/version folder on the machine where namenode resides Open the version files in all the machines inside the DATA folder Replace the namespace ID with the namespace id from namenode copied previously Now restart the hadoop cluster

Hadoop Commands:
start-all.sh/stop-all.sh start-dfs.sh/stop-dfs.sh start-mapred.sh/stop-mapred.sh hadoop dfs -ls /<virtual dfs path> hadoop dfs copyFromLocal <local path> <dfs path>