You are on page 1of 15

‭ AME : DEEPTI AGRAWAL‬

N
‭REGISTRATION NO : 201081010‬
‭FINAL YEAR BTECH‬
‭BRANCH : IT‬
‭SUBJECT : BDA LAB‬

‭Big Data Analytics Lab Assignment No. 1‬

‭Aim‬‭:‬‭Compare‬‭different‬‭versions‬‭of‬‭Hadoop‬‭(‬‭Hadoop‬‭1.x,‬‭Hadoop‬‭2.x,‬‭and‬‭Hadoop‬‭3.‬‭x‬‭)‬
‭Also setup Hadoop 1.x single node cluster.‬

‭T heory :‬

‭1.‬ ‭Hadoop Distributed File System (HDFS):‬

‭ verview:‬‭HDFS‬‭is‬‭a‬‭distributed‬‭file‬‭system‬‭that‬‭provides‬‭a‬‭reliable‬‭and‬‭scalable‬‭storage‬
O
‭infrastructure‬ ‭for‬ ‭Hadoop.‬ ‭It‬ ‭is‬ ‭designed‬ ‭to‬ ‭store‬ ‭and‬ ‭manage‬ ‭very‬ ‭large‬ ‭files‬ ‭across‬
‭multiple nodes in a Hadoop cluster.‬

‭Key Features:‬

‭ ata‬ ‭Replication:‬ ‭HDFS‬ ‭replicates‬‭data‬‭across‬‭multiple‬‭nodes‬‭to‬‭ensure‬‭fault‬‭tolerance.‬


D
‭The‬ ‭default‬ ‭replication‬ ‭factor‬ ‭is‬ ‭three,‬ ‭meaning‬ ‭each‬ ‭piece‬ ‭of‬ ‭data‬ ‭is‬ ‭stored‬ ‭on‬ ‭three‬
‭different nodes.‬

‭ calability:‬ ‭HDFS‬ ‭can‬ ‭scale‬ ‭horizontally‬ ‭by‬ ‭adding‬ ‭more‬ ‭nodes‬ ‭to‬ ‭the‬ ‭cluster,‬
S
‭accommodating the storage needs of big data applications.‬

‭ ccessibility:‬ ‭HDFS‬ ‭provides‬ ‭high-throughput‬ ‭access‬‭to‬‭application‬‭data‬‭and‬‭is‬‭suitable‬


A
‭for applications with large datasets.‬

‭2.‬ ‭Yet Another Resource Negotiator (YARN):‬

‭ verview:‬‭YARN‬‭is‬‭the‬‭resource‬‭management‬‭layer‬‭of‬‭Hadoop,‬‭responsible‬‭for‬‭managing‬
O
‭and‬ ‭allocating‬ ‭resources‬ ‭in‬ ‭a‬ ‭Hadoop‬ ‭cluster.‬ ‭It‬ ‭allows‬ ‭multiple‬ ‭applications‬ ‭to‬ ‭share‬
‭resources efficiently.‬

‭Key Components:‬
‭ esourceManager:‬ ‭Manages‬ ‭and‬ ‭allocates‬ ‭resources‬ ‭to‬ ‭various‬ ‭applications‬ ‭in‬ ‭the‬
R
‭cluster.‬

‭ odeManager:‬ ‭Runs‬ ‭on‬ ‭individual‬ ‭nodes‬ ‭and‬ ‭is‬ ‭responsible‬ ‭for‬‭managing‬‭resources‬‭on‬


N
‭that node.‬

‭Benefits:‬

‭ fficient‬ ‭Resource‬ ‭Utilization:‬ ‭YARN‬ ‭allows‬ ‭dynamic‬ ‭allocation‬ ‭of‬ ‭resources,‬ ‭ensuring‬
E
‭that the available resources are utilized optimally.‬

‭ ulti-Tenancy:‬ ‭Multiple‬ ‭applications‬ ‭can‬ ‭coexist‬ ‭on‬ ‭the‬ ‭same‬ ‭Hadoop‬ ‭cluster‬ ‭without‬
M
‭interfering with each other.‬

‭3.‬ ‭MapReduce:‬

‭ verview:‬ ‭MapReduce‬ ‭is‬ ‭a‬ ‭programming‬ ‭model‬ ‭and‬ ‭processing‬ ‭engine‬ ‭for‬ ‭distributed‬
O
‭computing‬ ‭in‬ ‭Hadoop.‬ ‭It‬ ‭allows‬ ‭the‬ ‭processing‬ ‭of‬ ‭large‬ ‭datasets‬ ‭in‬ ‭parallel‬ ‭across‬ ‭a‬
‭Hadoop cluster.‬

‭Key Components:‬

‭Mapper:‬‭Processes input data and generates key-value pairs.‬

‭ educer:‬ ‭Aggregates‬ ‭and‬ ‭processes‬ ‭the‬ ‭intermediate‬ ‭key-value‬ ‭pairs‬ ‭produced‬ ‭by‬ ‭the‬
R
‭mappers.‬

‭Workflow:‬

‭ ap‬ ‭Phase:‬ ‭Input‬ ‭data‬ ‭is‬ ‭divided‬ ‭into‬ ‭smaller‬ ‭chunks‬ ‭and‬ ‭processed‬ ‭by‬ ‭individual‬
M
‭mappers.‬

‭ huffle‬ ‭and‬ ‭Sort‬ ‭Phase:‬ ‭Intermediate‬‭key-value‬‭pairs‬‭are‬‭shuffled‬‭and‬‭sorted‬‭based‬‭on‬


S
‭keys.‬

‭Reduce Phase:‬‭Reduced tasks process the sorted data and produce the final output.‬

‭ calability:‬ ‭MapReduce‬ ‭enables‬ ‭horizontal‬ ‭scalability‬ ‭by‬ ‭distributing‬ ‭tasks‬ ‭across‬


S
‭multiple nodes, making it suitable for processing vast amounts of data.‬

‭4.‬ ‭Hadoop Common:‬


‭ verview:‬ ‭Hadoop‬ ‭Common‬ ‭provides‬ ‭the‬ ‭foundational‬ ‭utilities,‬ ‭libraries,‬ ‭and‬ ‭APIs‬ ‭that‬
O
‭support‬ ‭other‬ ‭Hadoop‬ ‭modules.‬ ‭It‬ ‭ensures‬ ‭compatibility‬ ‭and‬ ‭interoperability‬ ‭among‬
‭various components in the Hadoop ecosystem.‬

‭Key Components:‬

‭Java libraries:‬‭Core libraries for file systems, networking, and utilities.‬

‭ adoop‬‭Distributed‬‭Shell:‬‭A‬‭framework‬‭for‬‭running‬‭distributed‬‭applications‬‭on‬‭Hadoop‬
H
‭clusters.‬

‭ ole:‬ ‭Hadoop‬ ‭Common‬ ‭acts‬ ‭as‬ ‭the‬ ‭glue‬ ‭that‬ ‭binds‬ ‭different‬ ‭components‬ ‭together,‬
R
‭providing‬ ‭a‬ ‭common‬ ‭set‬ ‭of‬ ‭tools‬ ‭and‬ ‭rules‬ ‭for‬ ‭seamless‬ ‭integration‬ ‭and‬ ‭communication‬
‭within the Hadoop ecosystem.‬

‭Comparison‬ ‭of‬ ‭different‬ ‭versions‬ ‭of‬ ‭Hadoop‬ ‭(Hadoop‬ ‭1.x,‬ ‭Hadoop‬ ‭2.x,‬
‭and Hadoop 3.x)‬
‭Setup of Hadoop Single Node Cluster :‬

‭1. Install jre and java (openjdk-11)‬

‭sudo apt install default-jdk default-jre -y‬

‭2. Install openssh-client and openssh-server‬

‭sudo apt-get install openssh-client‬

‭sudo apt-get install openssh-server‬


‭3. Install hadoop binary and extract the archive‬

‭wget https://downloads.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz‬

‭4. Rename folder to hadoop using mv command‬

‭mv hadoop-3.3.6 hadoop‬


‭5.‬‭In‬‭order‬‭to‬‭use‬‭Java‬‭and‬‭Hadoop‬‭from‬‭any‬‭folder,‬‭we‬‭would‬‭need‬‭to‬‭add‬‭these‬‭paths‬‭to‬
‭the ~/.bashrc file. The contents of the file get executed every time a user logs in‬

‭export JAVA_HOME=/usr/local/java‬

‭export PATH=$PATH:$JAVA_HOME/bin‬

‭export HADOOP_INSTALL=/usr/local/hadoop‬

‭export PATH=$PATH:$HADOOP_INSTALL/bin‬

‭export PATH=$PATH:$HADOOP_INSTALL/sbin‬

‭export HADOOP_MAPRED_HOME=$HADOOP_INSTALL‬

‭export HADOOP_COMMON_HOME=$HADOOP_INSTALL‬

‭export HADOOP_HDFS_HOME=$HADOOP_INSTALL‬

‭export YARN_HOME=$HADOOP_INSTALL‬

‭export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native‬

‭export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"‬

‭6.‬ ‭To‬ ‭change‬ ‭ports‬ ‭on‬ ‭which‬ ‭ssh‬ ‭runs‬ ‭change‬ ‭in‬ ‭sshd_config‬ ‭(this‬ ‭case‬ ‭PORT‬ ‭2222)‬ ‭in‬
‭/etc/ssh/sshd_config‬
‭7. For Password less authentication ssh do following‬

‭a. ssh-keygen -t rsa -P ""‬

‭b. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys‬

‭c. ssh localhost‬


‭8.‬ ‭Next‬ ‭hadoop‬ ‭needs‬ ‭to‬ ‭be‬ ‭configured‬ ‭by‬ ‭doing‬ ‭changes‬ ‭to‬ ‭files‬ ‭located‬ ‭in‬
‭/usr/local/hadoop/etc/hadoop‬

‭a.‬ ‭hadoop-env.sh : JAVA_HOME=/usr/local/java‬

‭b.‬ ‭core-site.xml :‬

‭<configuration>‬

‭<property>‬

‭<name>fs.default.name</name>‬

‭<value>hdfs://localhost:9000/</value>‬
‭</property>‬

‭</configuration>‬

‭c.‬ ‭yarn-site.xml :‬

‭<configuration>‬

‭<property>‬

‭<name>yarn.nodemanager.aux-services</name>‬

‭<value>mapreduce_shuffle</value>‬

‭</property>‬

‭<property>‬

‭<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>‬

‭<value>org.apache.hadoop.mapred.ShuffleHandler</value>‬

‭</property>‬

‭</configuration>‬

‭d.‬ ‭mapred-site.xml :‬

‭<configuration>‬

‭<property>‬

‭<name>mapreduce.framework.name</name>‬

‭<value>yarn</value>‬

‭</property>‬
‭</configuration>‬

‭e.‬ ‭hdfs-site.xml :‬

‭<configuration>‬

‭<property>‬

‭<name>yarn.nodemanager.aux-services</name>‬

‭<value>mapreduce_shuffle</value>‬

‭</property>‬

‭<property>‬

‭<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>‬

‭<value>org.apache.hadoop.mapred.ShuffleHandler</value>‬

‭</property>‬

‭</configuration>‬

‭9. Commands for datanode and namenode‬

‭sudo mkdir –p /usr/local/store/hdfs/namenode‬

‭sudo mkdir –p /usr/local/store/hdfs/datanode‬

‭sudo chown root:root –R /usr/local/store‬

‭10. Creating folder structure by default for namenode‬

‭hadoop namenode -format‬


‭11. Starting HDFS daemons :‬

‭/usr/local/hadoop/sbin/start-dfs.sh‬

‭12. /usr/local/hadoop/sbin/start-yarn.sh for starting NodeManager and ResourceManager‬

‭13.In‬ ‭browser‬ ‭go‬ ‭to‬ ‭→‬ ‭http://localhost:8088‬ ‭to‬ ‭view‬ ‭Resource‬ ‭Manager‬ ‭and‬
‭http://localhost:8042‬‭for HDFS NameNode web interface.‬
‭14. To stop Hadoop daemons and ecosystem all at once execute stop-all.sh script in sbin‬

‭Conclusion :‬

‭Thus‬ ‭we‬ ‭successfully‬ ‭completed‬ ‭Comparing‬ ‭different‬ ‭versions‬ ‭of‬ ‭Hadoop(‬


‭Hadoop 1.x, Hadoop 2.x, and Hadoop 3. x)‬

‭And Also setting up Hadoop 1.x single node cluster in the assignment.‬

You might also like