Bda A1

‭ AME : DEEPTI AGRAWAL‬
N
‭REGISTRATION NO : 201081010‬
‭FINAL YEAR BTECH‬
‭BRANCH : IT‬
‭SUBJECT : BDA LAB‬
‭Big Data Analytics Lab Assignment No. 1‬
‭Aim‬‭:‬‭Compare‬‭different‬‭versions‬‭of‬‭Hadoop‬‭(‬‭Hadoop‬‭1.x,‬‭Hadoop‬‭2.x,‬‭and‬‭Hadoop‬‭3.‬‭x‬‭)‬
‭Also setup Hadoop 1.x single node cluster.‬
‭T heory :‬
‭1.‬ ‭Hadoop Distributed File System (HDFS):‬
‭ verview:‬‭HDFS‬‭is‬‭a‬‭distributed‬‭file‬‭system‬‭that‬‭provides‬‭a‬‭reliable‬‭and‬‭scalable‬‭storage‬
O
‭infrastructure‬ ‭for‬ ‭Hadoop.‬ ‭It‬ ‭is‬ ‭designed‬ ‭to‬ ‭store‬ ‭and‬ ‭manage‬ ‭very‬ ‭large‬ ‭files‬ ‭across‬
‭multiple nodes in a Hadoop cluster.‬
‭Key Features:‬
‭ ata‬ ‭Replication:‬ ‭HDFS‬ ‭replicates‬‭data‬‭across‬‭multiple‬‭nodes‬‭to‬‭ensure‬‭fault‬‭tolerance.‬

D
‭The‬ ‭default‬ ‭replication‬ ‭factor‬ ‭is‬ ‭three,‬ ‭meaning‬ ‭each‬ ‭piece‬ ‭of‬ ‭data‬ ‭is‬ ‭stored‬ ‭on‬ ‭three‬
‭different nodes.‬
‭ calability:‬ ‭HDFS‬ ‭can‬ ‭scale‬ ‭horizontally‬ ‭by‬ ‭adding‬ ‭more‬ ‭nodes‬ ‭to‬ ‭the‬ ‭cluster,‬
S
‭accommodating the storage needs of big data applications.‬
‭ ccessibility:‬ ‭HDFS‬ ‭provides‬ ‭high-throughput‬ ‭access‬‭to‬‭application‬‭data‬‭and‬‭is‬‭suitable‬

A
‭for applications with large datasets.‬
‭2.‬ ‭Yet Another Resource Negotiator (YARN):‬
‭ verview:‬‭YARN‬‭is‬‭the‬‭resource‬‭management‬‭layer‬‭of‬‭Hadoop,‬‭responsible‬‭for‬‭managing‬
O
‭and‬ ‭allocating‬ ‭resources‬ ‭in‬ ‭a‬ ‭Hadoop‬ ‭cluster.‬ ‭It‬ ‭allows‬ ‭multiple‬ ‭applications‬ ‭to‬ ‭share‬
‭resources efficiently.‬
‭Key Components:‬
‭ esourceManager:‬ ‭Manages‬ ‭and‬ ‭allocates‬ ‭resources‬ ‭to‬ ‭various‬ ‭applications‬ ‭in‬ ‭the‬
R
‭cluster.‬
‭ odeManager:‬ ‭Runs‬ ‭on‬ ‭individual‬ ‭nodes‬ ‭and‬ ‭is‬ ‭responsible‬ ‭for‬‭managing‬‭resources‬‭on‬

N
‭that node.‬
‭Benefits:‬
‭ fficient‬ ‭Resource‬ ‭Utilization:‬ ‭YARN‬ ‭allows‬ ‭dynamic‬ ‭allocation‬ ‭of‬ ‭resources,‬ ‭ensuring‬
E
‭that the available resources are utilized optimally.‬
‭ ulti-Tenancy:‬ ‭Multiple‬ ‭applications‬ ‭can‬ ‭coexist‬ ‭on‬ ‭the‬ ‭same‬ ‭Hadoop‬ ‭cluster‬ ‭without‬
M
‭interfering with each other.‬
‭3.‬ ‭MapReduce:‬
‭ verview:‬ ‭MapReduce‬ ‭is‬ ‭a‬ ‭programming‬ ‭model‬ ‭and‬ ‭processing‬ ‭engine‬ ‭for‬ ‭distributed‬
O
‭computing‬ ‭in‬ ‭Hadoop.‬ ‭It‬ ‭allows‬ ‭the‬ ‭processing‬ ‭of‬ ‭large‬ ‭datasets‬ ‭in‬ ‭parallel‬ ‭across‬ ‭a‬
‭Hadoop cluster.‬
‭Mapper:‬‭Processes input data and generates key-value pairs.‬
‭ educer:‬ ‭Aggregates‬ ‭and‬ ‭processes‬ ‭the‬ ‭intermediate‬ ‭key-value‬ ‭pairs‬ ‭produced‬ ‭by‬ ‭the‬
R
‭mappers.‬
‭Workflow:‬
‭ ap‬ ‭Phase:‬ ‭Input‬ ‭data‬ ‭is‬ ‭divided‬ ‭into‬ ‭smaller‬ ‭chunks‬ ‭and‬ ‭processed‬ ‭by‬ ‭individual‬
M
‭mappers.‬
‭ huffle‬ ‭and‬ ‭Sort‬ ‭Phase:‬ ‭Intermediate‬‭key-value‬‭pairs‬‭are‬‭shuffled‬‭and‬‭sorted‬‭based‬‭on‬

S
‭keys.‬
‭Reduce Phase:‬‭Reduced tasks process the sorted data and produce the final output.‬
‭ calability:‬ ‭MapReduce‬ ‭enables‬ ‭horizontal‬ ‭scalability‬ ‭by‬ ‭distributing‬ ‭tasks‬ ‭across‬

S
‭multiple nodes, making it suitable for processing vast amounts of data.‬
‭4.‬ ‭Hadoop Common:‬

‭ verview:‬ ‭Hadoop‬ ‭Common‬ ‭provides‬ ‭the‬ ‭foundational‬ ‭utilities,‬ ‭libraries,‬ ‭and‬ ‭APIs‬ ‭that‬
O
‭support‬ ‭other‬ ‭Hadoop‬ ‭modules.‬ ‭It‬ ‭ensures‬ ‭compatibility‬ ‭and‬ ‭interoperability‬ ‭among‬
‭various components in the Hadoop ecosystem.‬
‭Java libraries:‬‭Core libraries for file systems, networking, and utilities.‬
‭ adoop‬‭Distributed‬‭Shell:‬‭A‬‭framework‬‭for‬‭running‬‭distributed‬‭applications‬‭on‬‭Hadoop‬
H
‭clusters.‬
‭ ole:‬ ‭Hadoop‬ ‭Common‬ ‭acts‬ ‭as‬ ‭the‬ ‭glue‬ ‭that‬ ‭binds‬ ‭different‬ ‭components‬ ‭together,‬
R
‭providing‬ ‭a‬ ‭common‬ ‭set‬ ‭of‬ ‭tools‬ ‭and‬ ‭rules‬ ‭for‬ ‭seamless‬ ‭integration‬ ‭and‬ ‭communication‬
‭within the Hadoop ecosystem.‬
‭Comparison‬ ‭of‬ ‭different‬ ‭versions‬ ‭of‬ ‭Hadoop‬ ‭(Hadoop‬ ‭1.x,‬ ‭Hadoop‬ ‭2.x,‬
‭and Hadoop 3.x)‬
‭Setup of Hadoop Single Node Cluster :‬
‭1. Install jre and java (openjdk-11)‬
‭sudo apt install default-jdk default-jre -y‬
‭2. Install openssh-client and openssh-server‬
‭sudo apt-get install openssh-client‬
‭sudo apt-get install openssh-server‬

‭3. Install hadoop binary and extract the archive‬
‭wget https://downloads.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz‬
‭4. Rename folder to hadoop using mv command‬
‭mv hadoop-3.3.6 hadoop‬

‭5.‬‭In‬‭order‬‭to‬‭use‬‭Java‬‭and‬‭Hadoop‬‭from‬‭any‬‭folder,‬‭we‬‭would‬‭need‬‭to‬‭add‬‭these‬‭paths‬‭to‬
‭the ~/.bashrc file. The contents of the file get executed every time a user logs in‬
‭export JAVA_HOME=/usr/local/java‬
‭export PATH=$PATH:$JAVA_HOME/bin‬
‭export HADOOP_INSTALL=/usr/local/hadoop‬
‭export PATH=$PATH:$HADOOP_INSTALL/bin‬
‭export PATH=$PATH:$HADOOP_INSTALL/sbin‬
‭export HADOOP_MAPRED_HOME=$HADOOP_INSTALL‬
‭export HADOOP_COMMON_HOME=$HADOOP_INSTALL‬
‭export HADOOP_HDFS_HOME=$HADOOP_INSTALL‬
‭export YARN_HOME=$HADOOP_INSTALL‬
‭export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native‬
‭export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"‬
‭6.‬ ‭To‬ ‭change‬ ‭ports‬ ‭on‬ ‭which‬ ‭ssh‬ ‭runs‬ ‭change‬ ‭in‬ ‭sshd_config‬ ‭(this‬ ‭case‬ ‭PORT‬ ‭2222)‬ ‭in‬
‭/etc/ssh/sshd_config‬
‭7. For Password less authentication ssh do following‬
‭a. ssh-keygen -t rsa -P ""‬
‭b. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys‬
‭c. ssh localhost‬

‭8.‬ ‭Next‬ ‭hadoop‬ ‭needs‬ ‭to‬ ‭be‬ ‭configured‬ ‭by‬ ‭doing‬ ‭changes‬ ‭to‬ ‭files‬ ‭located‬ ‭in‬
‭/usr/local/hadoop/etc/hadoop‬
‭a.‬ ‭hadoop-env.sh : JAVA_HOME=/usr/local/java‬
‭b.‬ ‭core-site.xml :‬
‭<configuration>‬
‭<property>‬
‭<name>fs.default.name</name>‬
‭<value>hdfs://localhost:9000/</value>‬
‭</property>‬
‭</configuration>‬
‭c.‬ ‭yarn-site.xml :‬
‭<property>‬
‭<name>yarn.nodemanager.aux-services</name>‬
‭<value>mapreduce_shuffle</value>‬
‭</property>‬
‭<property>‬
‭<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>‬
‭<value>org.apache.hadoop.mapred.ShuffleHandler</value>‬
‭</property>‬
‭d.‬ ‭mapred-site.xml :‬
‭<property>‬
‭<name>mapreduce.framework.name</name>‬
‭<value>yarn</value>‬
‭</property>‬
‭e.‬ ‭hdfs-site.xml :‬
‭<property>‬
‭<name>yarn.nodemanager.aux-services</name>‬
‭<value>mapreduce_shuffle</value>‬
‭</property>‬
‭<property>‬
‭<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>‬
‭<value>org.apache.hadoop.mapred.ShuffleHandler</value>‬
‭</property>‬
‭9. Commands for datanode and namenode‬
‭sudo mkdir –p /usr/local/store/hdfs/namenode‬
‭sudo mkdir –p /usr/local/store/hdfs/datanode‬
‭sudo chown root:root –R /usr/local/store‬
‭10. Creating folder structure by default for namenode‬
‭hadoop namenode -format‬

‭11. Starting HDFS daemons :‬
‭/usr/local/hadoop/sbin/start-dfs.sh‬
‭12. /usr/local/hadoop/sbin/start-yarn.sh for starting NodeManager and ResourceManager‬
‭13.In‬ ‭browser‬ ‭go‬ ‭to‬ ‭→‬ ‭http://localhost:8088‬ ‭to‬ ‭view‬ ‭Resource‬ ‭Manager‬ ‭and‬
‭http://localhost:8042‬‭for HDFS NameNode web interface.‬
‭14. To stop Hadoop daemons and ecosystem all at once execute stop-all.sh script in sbin‬
‭Conclusion :‬
‭Thus‬ ‭we‬ ‭successfully‬ ‭completed‬ ‭Comparing‬ ‭different‬ ‭versions‬ ‭of‬ ‭Hadoop(‬

‭Hadoop 1.x, Hadoop 2.x, and Hadoop 3. x)‬
‭And Also setting up Hadoop 1.x single node cluster in the assignment.‬

Bda A1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda A1

Uploaded by

Copyright:

Available Formats

‭ AME : DEEPTI AGRAWAL‬

‭Big Data Analytics Lab Assignment No. 1‬

‭1.‬ ‭Hadoop Distributed File System (HDFS):‬

‭ ata‬ ‭Replication:‬ ‭HDFS‬ ‭replicates‬‭data‬‭across‬‭multiple‬‭nodes‬‭to‬‭ensure‬‭fault‬‭tolerance.‬

‭ ccessibility:‬ ‭HDFS‬ ‭provides‬ ‭high-throughput‬ ‭access‬‭to‬‭application‬‭data‬‭and‬‭is‬‭suitable‬

‭2.‬ ‭Yet Another Resource Negotiator (YARN):‬

‭ odeManager:‬ ‭Runs‬ ‭on‬ ‭individual‬ ‭nodes‬ ‭and‬ ‭is‬ ‭responsible‬ ‭for‬‭managing‬‭resources‬‭on‬

‭Mapper:‬‭Processes input data and generates key-value pairs.‬

‭ huffle‬ ‭and‬ ‭Sort‬ ‭Phase:‬ ‭Intermediate‬‭key-value‬‭pairs‬‭are‬‭shuffled‬‭and‬‭sorted‬‭based‬‭on‬

‭ calability:‬ ‭MapReduce‬ ‭enables‬ ‭horizontal‬ ‭scalability‬ ‭by‬ ‭distributing‬ ‭tasks‬ ‭across‬

‭4.‬ ‭Hadoop Common:‬

‭Java libraries:‬‭Core libraries for file systems, networking, and utilities.‬

‭1. Install jre and java (openjdk-11)‬

‭sudo apt install default-jdk default-jre -y‬

‭2. Install openssh-client and openssh-server‬

‭sudo apt-get install openssh-client‬

‭sudo apt-get install openssh-server‬

‭4. Rename folder to hadoop using mv command‬

‭mv hadoop-3.3.6 hadoop‬

‭a. ssh-keygen -t rsa -P ""‬

‭b. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys‬

‭c. ssh localhost‬

‭a.‬ ‭hadoop-env.sh : JAVA_HOME=/usr/local/java‬

‭9. Commands for datanode and namenode‬

‭sudo mkdir –p /usr/local/store/hdfs/namenode‬

‭sudo mkdir –p /usr/local/store/hdfs/datanode‬

‭sudo chown root:root –R /usr/local/store‬

‭10. Creating folder structure by default for namenode‬

‭hadoop namenode -format‬

‭12. /usr/local/hadoop/sbin/start-yarn.sh for starting NodeManager and ResourceManager‬

‭Thus‬ ‭we‬ ‭successfully‬ ‭completed‬ ‭Comparing‬ ‭different‬ ‭versions‬ ‭of‬ ‭Hadoop(‬

You might also like