You are on page 1of 45

AMITY SCHOOL OF ENGINEERING & TECHNOLOGY

LAB REPORT & ASSIGNMENTS


(ACADEMIC YEAR 2023-24)

COURSE NAME: BIG DATA TECHNOLOGY WITH HADOOP LAB


COURSE CODE: CSE2717
DEPARTMENT: Computer Science and Engineering
FACULTY NAME: Dr Swetta Kukreja

SUBMITTED BY

STUDENT NAME: Aman Vinay Sharma

ENROLMENT NUMBER: A70405220153


CLASS: B.Tech CSE - DS
SEMESTER: 7
DATE OF SUBMISSION:
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY

CERTIFICATE OF SUBMISSION

Student Name: Aman Vinay Sharma


Class: B.Tech CSE - DS
Semester: 7
Enrolment Number: A70405220153

This is certified to be the bonafide work of student in BIG DATA

TECHNOLOGY WITH HADOOP Laboratory during the academic

year 2023-24.

Faculty In-charge
{Department of CSE}
ASET, AUM

Department Coordinator HoI


{Department of CSE} ASET AUM
ASET, AUM

Date: 05/12/2023 Stamp


AMITY SCHOOL OF ENGINEERING & TECHNOLOGY

(Academic Year 2023-24)

INDEX
Sr. No. Description Date Grade
1 Case Study of Big Data
2 Installation of Hadoop
3 HDFS Commands
4 Implement Word Count using Map
Reduce
5 Implementation of Bloom Filter
6 Implement Page Ranking Algorithm
7 Implement Apriori Algorithm

Faculty In-charge
{Department of CSE}
ASET, AUM
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY

(Academic Year 2023-24)

LAB 1

Case Study of Big Data

Student Name: Aman Vinay Sharma


Class: B.Tech CSE - DS
Semester: 7

Enrolment Number: A70405220153

Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To do a case study related to the Big Data.

2. INTRODUCTION
Big Data refers to exceptionally large and complex datasets that exceed the capabilities of
traditional data processing tools and methodologies. At the forefront of contemporary
information management, Big Data emerges as a paradigmatic challenge and an unparalleled
opportunity within the realm of extensive datasets. In the realm of Big Data, the original three
Vs—Volume, Velocity, and Variety—have evolved into a more comprehensive set known as
the 6 V's, capturing the complexity and challenges associated with massive datasets. The
additional three Vs are:
1. Volume: Refers to the sheer scale and size of the data generated, processed, and stored.
It emphasizes the vastness of information that surpasses the capacities of traditional data
management systems.
2. Velocity: Encompasses the speed at which data is generated, processed, and analysed. In
the context of real-time analytics, the velocity of data is crucial for making timely decisions
and extracting actionable insights.
3. Variety: Represents the diverse types and formats of data, including structured and
unstructured data such as text, images, videos, and sensor-generated data. Managing this
diversity requires advanced analytical techniques and tools.
4. Veracity: Refers to the reliability and accuracy of the data. As Big Data often involves data
from various sources, ensuring the trustworthiness of the information becomes a critical
aspect of analysis and decision-making.
5. Value: Stresses the importance of deriving meaningful insights and value from the vast
datasets. The ultimate goal of Big Data is to extract actionable information that can
contribute to business growth, innovation, and informed decision-making.
6. Variability: Addresses the inconsistency or fluctuation in data flow. Big Data sources may
exhibit variability in terms of data arrival times, making it essential for systems to adapt
to these fluctuations in real-time processing.

3. BIG DATA USE CASES


Netflix: Personalized Content Recommendations
Netflix's adept use of Big Data is a cornerstone of its revolutionary influence on the
entertainment industry. Central to this is the implementation of personalized content
recommendations, which rely on comprehensive data collection involving user habits, ratings,
searches, and viewing duration. Utilizing advanced machine learning algorithms, Netflix
analyzes this extensive dataset to curate customized user experiences. The outcome is the
creation of personalized home screens that offer content recommendations finely tuned to
individual preferences, showcasing how meticulous data utilization can significantly enhance
and shape the way users engage with entertainment content.
Netflix's personalized strategy is complemented by a data-informed approach to content
creation, where analytics play a crucial role in refining recommendations and guiding strategic
content investments. By discerning broader viewership trends, Netflix ensures that its original
content aligns seamlessly with the evolving interests of its diverse audience. This data-driven
approach not only refines content recommendations but also informs strategic investments,
ultimately enhancing the platform's overall appeal.
Netflix's commitment to a dynamic user interface, which adapts continuously based on real-
time interactions, underscores its data-centric approach. This personalized and dynamic
strategy has a direct positive impact on viewer engagement and retention, creating a
reinforcing loop of user satisfaction. Additionally, predictive analytics play a pivotal role in
Netflix's operations, anticipating viewer behavior and strategically timing content releases to
capture maximum audience attention. In essence, Netflix's proficiency in leveraging Big Data
is evident across every aspect of its operations, from personalized recommendations and
content creation to maintaining a dynamic interface, boosting viewer engagement, and
strategically planning content releases.

Spotify: Personalized Music Recommendations


Spotify, a pioneer in the music streaming industry, relies on the strategic application of big
data to provide personalized music recommendations, thereby enhancing user engagement
and satisfaction.

Spotify's personalized music recommendations are driven by sophisticated algorithms that


meticulously analyze user behavior, listening history, genre preferences, and even social
interactions within the platform. This system continuously learns and adapts, progressively
refining its recommendations over time. A notable illustration of Spotify's data-driven
approach is the Discover Weekly playlist, which presents users with a curated selection of
tracks aligned with their individual music tastes.

In addition to tailoring individual song recommendations, Spotify's algorithms contribute to


the creation of personalized playlists. The "Daily Mix" feature, for example, combines user
favorites and introduces new tracks based on listening habits. Through the strategic
application of big data, Spotify enhances discoverability, exposing users to a diverse range of
artists and genres while maintaining a strong connection to their specific musical preferences.
Challenges and Issues:
1. Data Privacy Concerns: Spotify faces challenges related to data privacy as it collects
extensive user data to power its recommendation algorithms. Striking a balance
between personalization and user privacy is crucial, requiring transparent policies and
robust security measures to protect sensitive information.
2. Algorithmic Bias: The algorithms used for personalized recommendations may
inadvertently introduce bias based on user behaviour patterns. Addressing algorithmic
bias is an ongoing challenge, necessitating continuous refinement to ensure
recommendations are diverse, inclusive, and reflective of varied user interests.
3. Music Discovery vs. Echo Chamber: While personalized recommendations aim to
enhance music discovery, there is a risk of users being confined to a musical "echo
chamber," only exposed to content similar to their existing preferences. Balancing
personalized recommendations with opportunities for serendipitous discovery poses
an ongoing challenge for Spotify.
4. Dynamic User Preferences: User preferences in music are dynamic and can change
over time. Spotify faces the challenge of keeping its algorithms adaptable to evolving
user tastes and ensuring that recommendations remain relevant and appealing.

Amazon: Precision in E-Commerce Operations


Amazon, the global e-commerce giant, serves as a prime example of operational precision
achieved through the strategic application of Big Data analytics. At the core of its success is
the company's adept use of extensive datasets for demand forecasting. Through the analysis
of historical purchasing patterns, customer behaviors, and market trends, Amazon employs
predictive analytics to foresee product demand with remarkable accuracy. This foresight
enables optimized inventory management, minimizing the occurrence of overstocking or
stockouts.

Furthermore, Amazon's dynamic pricing strategy, informed by real-time data analysis,


ensures competitive pricing while maximizing profit margins. This precision in pricing not only
contributes to a seamless customer experience but also bolsters Amazon's competitiveness
in the dynamic e-commerce landscape. The company's ability to harness Big Data for demand
forecasting exemplifies how data-driven insights can be instrumental in refining operational
efficiency and maintaining a strategic edge in the market.

Apart from its proficiency in demand forecasting, Amazon leverages Big Data to craft
personalized customer experiences. The platform scrutinizes user behavior, preferences, and
historical purchases to deliver tailored product recommendations. This personalized strategy
not only elevates customer satisfaction but also boosts the chances of cross-selling and
upselling. Through the utilization of Big Data analytics, Amazon consistently establishes itself
as a benchmark for precision in e-commerce operations, showcasing the transformative
influence of data-driven decision-making in optimizing both customer experience and
operational efficiency.

Starbucks: Customer Experience Enhancement


Starbucks possesses a rich source of data for informing marketing and sales strategies, with
90 million coffee transactions and 25,000 global storefronts integrated with apps and store
technology. The cornerstone of this data collection is Starbucks' proprietary app, linking
customers to the reward program and meticulously documenting each purchase and store
visit. This wealth of information allows the company to discern valuable insights into customer
preferences and behaviors. For instance, understanding the frequency with which customers
place to-go orders in advance through the app, as opposed to spontaneous drive-through
purchases, provides crucial insights that guide decision-making in marketing and sales efforts.

Starbucks leverages its innovative Digital Flywheel strategy to gain valuable insights by
seamlessly integrating digital and in-person customer experiences, enhancing
personalization, ordering processes, rewards, and transactions. Through the collection of
customer data via the app, a combination of AI and big data enables Starbucks to offer special
deals tailored to individual preferences, increasing the likelihood of customers making repeat
purchases. By identifying trends through the analysis of millions of customer behaviors,
Starbucks can make informed decisions about the success of menu items.

To sustain growth and adapt to changing consumer preferences, Starbucks aims to keep
customers engaged by encouraging them to explore new products. Similar to the data-driven
approach of Netflix with viewer preferences, the Starbucks app recommends related drinks
based on individual tastes. These suggestions consider location-specific availability and even
adapt to weather conditions, suggesting hot drinks on colder days. The more diverse the
menu engagement, the greater the customer loyalty.

In a broader context, Starbucks utilizes the gathered personal data to shape social media
campaigns and align the brand's objectives with initiatives that contribute to sustained
profitability. Starbucks' Chief Strategy Officer, Matt Ryan, highlights the ongoing
modernization of their technology stack, emphasizing the replacement of legacy rewards and
ordering functionalities with a scalable cloud-based platform. This platform enhances
customer data organization and integrates more seamlessly with store-based operating
systems, encompassing inventory and production management.”
Industry: Food and beverage
Big data product: Starbucks rewards-connected app
Outcomes:
Personalized customer rewards offers

Thorough transaction cataloguing

Streamlined audience research for campaigns and menus

D Mart: Supply Chain Efficiency


DMart, a leading discount retail chain in India, strategically incorporates big data to improve
its operations and facilitate business expansion. The company employs data-driven
applications across various areas such as customer segmentation, inventory optimization,
pricing, and fraud detection. Through the analysis of demographics, shopping behaviors, and
preferences, DMart efficiently categorizes its customer base, enabling targeted and
personalized marketing initiatives. Moreover, precise monitoring of sales data and customer
demand aids in optimizing inventory, preventing shortages, and reducing waste, thereby
enhancing operational efficiency. The company also adjusts pricing dynamically based on
competitor rates and market trends to maintain competitiveness and ensure profitability.
Additionally, big data plays a crucial role in fraud detection, identifying and mitigating
suspicious activities to protect both customers and the overall business.

In spite of its achievements, DMart encounters several obstacles in leveraging big data. One
key challenge involves managing concerns surrounding data privacy, requiring the company
to responsibly handle the extensive customer data it collects to safeguard user privacy.
Additionally, ensuring the accuracy of data is essential for the efficacy of DMart's big data
applications, demanding ongoing efforts to maintain precise and current information. With
the rapid expansion of DMart's customer base, the challenge of data scalability emerges,
necessitating an adaptable and scalable data infrastructure capable of effectively managing
the growing volume of data. In summary, although big data has been instrumental in DMart's
success, addressing these challenges is imperative for sustaining and optimizing its data-
driven initiatives.

4. CONCLUSION
In conclusion, this case study emphasizes the profound impact of big data, providing
enhanced insights and more informed decision-making across various sectors. Nevertheless,
it emphasizes the imperative of tackling challenges related to security, infrastructure, and
data quality to ensure the seamless integration of big data solutions. The responsible
adoption and management of big data emerge as critical factors for organizations aiming to
excel in this era driven by data-driven practices.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY

(Academic Year 2023-24)

LAB 2

Installation of Hadoop

Student Name: Aman Vinay Sharma


Class: B.Tech CSE - DS
Semester: 7

Enrolment Number: A70405220153

Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To install the Hadoop on the system.

2. THEORY
Hadoop is a robust and open-source framework designed to handle the challenges posed by
massive volumes of data in the era of big data. Developed by the Apache Software
Foundation, Hadoop provides a scalable and distributed computing environment,
revolutionizing the way organizations store, process, and analyse vast datasets.
At its core, Hadoop comprises two fundamental components: the Hadoop Distributed File
System (HDFS) and the MapReduce programming model. HDFS allows for the distributed
storage of data across multiple nodes, ensuring fault tolerance and high availability. The
MapReduce paradigm facilitates the parallel processing of data by breaking tasks into smaller,
manageable sub-tasks distributed across a cluster of machines.
Hadoop's strength lies in its ability to scale horizontally, accommodating data growth by
adding more commodity hardware to the cluster. This scalability makes it an ideal solution for
businesses dealing with diverse and expansive datasets. Moreover, Hadoop embraces a fault-
tolerant design, ensuring uninterrupted operations even in the face of hardware failures.
The ecosystem around Hadoop has expanded significantly, with additional tools and
frameworks such as Apache Hive, Apache Pig, and Apache Spark complementing its
capabilities. These tools provide higher-level abstractions, making it more accessible for
developers and data analysts to work with big data.

3. INSTALLATION
Prerequisites
● Java 8 runtime environment (JRE)
● Apache Hadoop 3.3.0
Step 1 - Download Hadoop binary package
The first step is to download Hadoop binaries from the official website. The binary package
size is about 478M MB.
https://archive.apache.org/dist/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Step 2 - Unpack the package
After finishing the file download, unpack the package using 7zip or command line.

cd Downloads
Because I am installing Hadoop in folder Hadoop of my C drive (C:\Hadoop), we create the
directory
mkdir C:\Hadoop

Then run the following command to unzip:


tar -xvzf hadoop-3.3.0.tar.gz -C C:\Hadoop\
The command will take quite a few minutes as there are numerous files included. After the
unzip command is completed, we have to install the Java.
Step 3 - Java installation
Hadoop is built on Java, so you must have Java installed on your PC. You can get the most
recent version of Java from the official website. I am choosing the Java SE Runtime
Environment and I choose the Windows X64 version.
After finishing the file download, we open a new command prompt, we should unpack the
package
cd Downloads

Because I am installing Java in folder Java of my C drive (C:\Java), we create the directory
mkdir C:\Java

Then run the following command to unzip:


tar -xvzf jre-8u361-windows-x64.tar.gz -C C:\Java\
Step 4 - Install Hadoop native IO binary

Hadoop on Linux includes optional Native IO support. However Native IO is mandatory on


Windows and without it you will not be able to get your installation working. The Windows
native IO libraries are not included as part of Apache Hadoop release. Thus we need to build
and install it. The following repository already pre-built Hadoop Windows native libraries:
https://github.com/ruslanmv/How-to-install-Hadoop-on-
Windows/tree/master/winutils/hadoop-3.3.0-YARN-8246/bin

These libraries are not signed and there is no guarantee that it is 100% safe. We use it purely
for test learn purpose.

Download all the files in the following location and save them to the bin folder under Hadoop
folder. You can use Git by typing in your terminal

git clone https://github.com/ruslanmv/How-to-install-Hadoop-on-


Windows.git
and then copy

cd How-to-install-Hadoop-on-Windows\winutils\hadoop-3.3.0-YARN-8246\bin
copy *.* C:\Hadoop\hadoop-3.3.0\bin
Step 5 - Configure environment variables
Now we’ve downloaded and unpacked all the artefacts we need to configure two important
environment variables.
First you click the windows button and type environment

Configure Environment variables

We configure JAVA_HOME environment variable by adding new environment variable:


Variable name : JAVA_HOME Variable value: C:\Java\jre1.8.0_361

the same with HADOOP_HOME environment variable:

Variable name : HADOOP_HOME Variable value: C:\Hadoop\hadoop-3.3.0

b) Configure PATH environment variable

Once we finish setting up the above two environment variables, we need to add the bin
folders to the PATH environment variable. We click on Edit
If PATH environment exists in your system, you can also manually add the following two paths
to it:
%JAVA_HOME%/bin
%HADOOP_HOME%/bin

Verification of Installation
Once you complete the installation, close your terminal window and open a new one and
please run the following command to verify:
java -version
you will have - java version "1.8.0_361"

Java(TM) SE Runtime Environment (build 1.8.0_361-b09)


Java HotSpot(TM) 64-Bit Server VM (build 25.361-b09, mixed mode)
You should also be able to run the following command:
hadoop -version

java version "1.8.0_361"


Java(TM) SE Runtime Environment (build 1.8.0_361-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.361-b09, mixed mode)
Please verify that both versions of hadoop java and java coincide.

Finally, directly to verify that our above steps are completed successfully:
winutils.exe
Step 6 - Configure Hadoop
Now we are ready to configure the most important part - Hadoop configurations which
involves Core, YARN, MapReduce, HDFS configurations.
Configure core site
Edit file core-site.xml in %HADOOP_HOME%\etc\hadoop folder.
For my environment, the actual path is C:\Hadoop\hadoop-3.3.0\etc\hadoop

Replace configuration element with the following:


<configuration>

<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>

Configure HDFS
Edit file hdfs-site.xml in %HADOOP_HOME%\etc\hadoop folder.
Before editing, please correct two folders in your system: one for namenode directory and
another for data directory. For my system, I created the following two sub folders:
mkdir C:\hadoop\hadoop-3.3.0\data\datanode
mkdir C:\hadoop\hadoop-3.3.0\data\namenode
Replace configuration element with the following (remember to replace the highlighted paths
accordingly):
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/hadoop/hadoop-3.3.0/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/hadoop/hadoop-3.3.0/data/datanode</value>
</property>
</configuration>
Configure MapReduce and YARN site
Edit file mapred-site.xml in %HADOOP_HOME%\etc\hadoop folder.
Replace configuration element with the following:

<configuration>

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>

<value>%HADOOP_HOME%/share/hadoop/mapreduce/*,%HADOOP_HOME%/sh
are/hadoop/mapreduce/lib/*,%HADOOP_HOME%/share/hadoop/common/*
,%HADOOP_HOME%/share/hadoop/common/lib/*,%HADOOP_HOME%/share/h
adoop/yarn/*,%HADOOP_HOME%/share/hadoop/yarn/lib/*,%HADOOP_HOM
E%/share/hadoop/hdfs/*,%HADOOP_HOME%/share/hadoop/hdfs/lib/*</ value>
</property>
</configuration>

Edit file yarn-site.xml in %HADOOP_HOME%\etc\hadoop folder.


<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP
_CO
NF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAP
RED_HOME</value>

</property>
</configuration>

Step 7 - Initialise HDFS


Run the following command in Command Prompt
hdfs namenode -format
The following is an example when it is formatted successfully:
Step 8 - Start HDFS daemons
Run the following command to start HDFS daemons in Command Prompt:
%HADOOP_HOME%\sbin\start-dfs.cmd
Please click Allow access to the java.

Two Command Prompt windows will open: one for datanode and another for namenode as
the following screenshot shows:

Verify HDFS web portal UI through this link:


http://localhost:9870/dfshealth.html#tab-overview.
You can also navigate to a data node UI:
Step 9 - Start YARN daemons
warning You may encounter permission issues if you start YARN daemons using normal user.
To ensure you don’t encounter any issues. Please open a Command Prompt window using
Run as administrator.
Run the following command in an elevated Command Prompt window (Run as administrator)
to start YARN daemons:
%HADOOP_HOME%\sbin\start-yarn.cmd
Similarly, two Command Prompt windows will open: one for resource manager and another
for node manager as the following screenshot shows:
You can verify YARN resource manager UI when all services are started successfully.
http://localhost:8088

4. CONCLUSION
Effectively setting up Hadoop on Windows 11 showcases our capability to tailor this robust
big data framework to various environments, thereby expanding our expertise for upcoming
data processing challenges. This hands-on experience provides valuable insights into
overseeing distributed computing solutions across a range of operating systems.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY

(Academic Year 2023-24)

LAB 3
HDFS Commands

Student Name: Aman Vinay Sharma


Class: B.Tech CSE - DS
Semester: 7

Enrolment Number: A70405220153

Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To study the Hadoop commands.

2. THEORY
Hadoop commands form an integral part of harnessing the power of the Hadoop ecosystem,
a distributed computing framework designed to process large-scale data sets. These
commands allow users to interact with the Hadoop Distributed File System (HDFS) and
execute various tasks within the Hadoop environment. From managing files and directories
to initiating MapReduce jobs, the command-line interface provides a versatile toolkit for users
to navigate and manipulate data stored across a cluster of machines. Examples of essential
Hadoop commands include those for uploading and downloading files to and from HDFS,
monitoring job progress, and configuring cluster settings. Mastery of Hadoop commands is
essential for data engineers, analysts, and administrators to efficiently handle and process
massive datasets in a distributed computing environment.

3. Hadoop Commands
Here is the list of several Hadoop commands:

ls
The "ls" command in Hadoop is employed to display a list of files, while the "lsr" option can
be used for a recursive approach, providing a hierarchical view of a folder.

Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs
means we want the executables of hdfs particularly dfs(Distributed File System) commands.

mkdir
To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create
it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be created relative to the
home directory.

touchz
The "touchz" command in Hadoop is used to create an empty file at the specified path in
Hadoop Distributed File System (HDFS).
Syntax:
bin/hdfs dfs -touchz <file_path>

Example:
bin/hdfs dfs -touchz /geeks/myfile.txt

copyFromLocal (or) put


To copy files/folders from local file system to hdfs store. This is the most important
command. Local filesystem means the files present on the OS.

Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder
geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks

(OR)
bin/hdfs dfs -put ../Desktop/AI.txt /geeks

cat
To print file contents.
Syntax:
bin/hdfs dfs -cat <path>

Example:
// print the content of AI.txt present

// inside geeks folder.


bin/hdfs dfs -cat /geeks/AI.txt ->

copyToLocal (or) get


To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>

Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero

(OR)
bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero myfile.txt from geeks

folder will be copied to folder hero present on Desktop. moveFromLocal

This command will move file from local to hdfs.


Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks

cp
This command is used to copy files within hdfs. Let’s copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied

mv
This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>

Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied

rmr
This command deletes a file from HDFS recursively. It is very useful command when you
want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>

Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then
the directory itself.

du
It will give the size of each file in directory.

Syntax:

bin/hdfs dfs -du <dirName>

Example:
bin/hdfs dfs -du /geeks

4. CONCLUSION
In summary, the comprehensive exploration of various Hadoop commands in the laboratory
experiment has significantly augmented our practical knowledge in navigating the Hadoop
Distributed File System (HDFS) and managing diverse data processing tasks in a distributed
computing environment. This hands-on experience equips us with a versatile skill set,
essential for addressing the complexities of big data analytics across diverse operating
scenarios.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY

(Academic Year 2023-24)

LAB 4

Implement Word Count using MapReduce.

Student Name: Aman Vinay Sharma


Class: B.Tech CSE - DS
Semester: 7

Enrolment Number: A70405220153

Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To implement Word Count using MapReduce.

2. THEORY
Wordcount using MapReduce is a classic example and a fundamental demonstration of the
power of distributed computing in big data processing. In this paradigm, the MapReduce
programming model is utilized to efficiently count the occurrences of words in a vast dataset.
The process involves two main stages: the Map stage and the Reduce stage. During the Map
stage, the input data is divided into key-value pairs, where each word is assigned a count of
one. The Mapper tasks then distribute these pairs across different nodes in a cluster,
processing the data in parallel. In the Reduce stage, the system aggregates and consolidates
the intermediate key-value pairs, summing up the counts for each unique word. The Reducer
tasks merge these results to produce the final word count output.
Wordcount using MapReduce serves as a foundational example in the field of distributed
computing, illustrating the efficiency gained through parallel processing and distributed
storage. This approach is scalable, allowing for the processing of massive datasets by
leveraging the capabilities of a distributed computing framework like Apache Hadoop.
Through the distributed and parallelized nature of MapReduce, organizations can efficiently
analyze vast amounts of textual data, making it a crucial technique in various fields such as
natural language processing, information retrieval, and data analytics.

3. SOURCE CODE (Steps)


Step 1: Login to GitHub

Step 2: Chose the code space of Jupyter notebook

Go to profile section and select your codespace


Select Jupyter Notebook and click on the” Use this template”

Step 3: On the terminal comes up write command

ls

Step 4: Go to the notebook’s directory using the command:

cd notebooks
Step 5: Type in command nano word_count_data.txt – this will create a file and open a
terminal.

nano word_count_data.txt
Step 6: Type in a few or paragraph in the text file

Step 7: ^X or Ctrl + X to be used to exit . Then Press Y as you want to save the same.
Once you are back to the file on terminal you need to press enter to save the file and get
back on the terminal.

Step 8: Now you can check the contents of the file using the cat word_count_data.txt
command.

cat word_count_data.txt

Step 9: Type nano mapper.py to create a mapper file again on the editor window paste the
mapper.py code.

nano mapper.py

Code:
#!/usr/bin/env python

# import sys because we need to read and write data to STDIN and STDOUT
import sys

# reading entire line from STDIN (standard input) for line in


sys.stdin:
# to remove leading and trailing whitespace line =
line.strip()
# split the line into words words =
line.split()
# we are looping over the words array and
word printing the

# with the
count of 1 to
output); the the STDOUT for
word in words:
# write the results to STDOUT
(standard

# what we output here will be


the input for

# Reduce step, i.e. the input


for reducer.py print
('%s\t%s' % (word, 1))

Step 10: ^X or Ctrl + X to be used to exit . Then Press Y as you want to save the same. Once
you are back to the file on terminal you need to press enter to save the file and get back on
the terminal.

Step 11: Type nano reducer.py to create a reducer file to get to the editor window paste the
reducer.py code.

nano reducer.py

Code:
#!/usr/bin/env python

from operator import itemgetter import


sys

current_word = None
current_count = 0 word =
None

# read the entire line from STDIN for line in


sys.stdin:
# remove leading and trailing whitespace line =
line.strip()
# splitting the data on the basis of tab we have provided in
mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int try:
count = int(count)
except ValueError:
# count was not a number, so silently #
ignore/discard this line
continue
# this IF-switch only works because Hadoop
output sorts map

# by key (here: word) before it is passed


to the reducer if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print ('%s\t%s' %
current_count)) (current_word,

current_
count =
count
current_
word =
word

# do not forget to output the last word if needed! if current_word


== word:
print ('%s\t%s' % (current_word, current_count))

Step 12: ^X or Ctrl + X to be used to exit . Then Press Y as you want to save the same. Once
you are back to the file on terminal you need to press enter to save the file and get back on
the terminal.

Step 13: Command cat word_count_data.txt | python mapper.py – to see the mapper
command output

cat word_count_data.txt | python mapper.py

Step 14: Command cat word_count_data.txt | python mapper.py | sort -k1,1 – to see the
sorted output.
cat word_count_data.txt | python mapper.py | sort - k1,1
Step 15: Command cat word_count_data.txt | python mapper.py | sort -k1,1 | python
reducer.py – to see the final word count output.

cat word_count_data.txt | python mapper.py | sort - k1,1 | python


reducer.py

4. OUTPUT

5. CONCLUSION
In conclusion, Wordcount using MapReduce exemplifies the transformative potential of
distributed computing in handling extensive datasets. By employing the MapReduce
programming model, this approach efficiently processes and analyzes large volumes of textual
data, showcasing the benefits of parallelized computation and distributed storage. As a
fundamental example, it underscores the scalability and applicability of such techniques in
diverse fields, marking a pivotal contribution to the realm of big data analytics and distributed
computing.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY

(Academic Year 2023-24)

LAB 5

Implementation of Bloom Filter

Student Name: Aman Vinay Sharma


Class: B.Tech CSE - DS
Semester: 7

Enrolment Number: A70405220153

Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To implement Bloom Filter.

2. THEORY
A Bloom filter is a space-efficient probabilistic data structure designed for quickly testing
whether an element is a member of a set. The key characteristic of a Bloom filter is its ability
to provide fast membership queries while using a relatively small amount of memory
compared to other data structures.
The Bloom filter works by employing a bit array of a fixed size and a set of hash functions.
Initially, all bits in the array are set to zero. When an element is added to the set, it undergoes
multiple hash functions, and the corresponding bits in the array are set to one. To check for
membership, the same hash functions are applied to the query element, and if all the
corresponding bits are set, the element is deemed a probable member of the set. However,
false positives are possible, as multiple elements might hash to the same set of bits.
Common applications include spell checkers, network routers, and distributed systems,
where they are employed to reduce the need for costly disk or network lookups.

3. SOURCE CODE
from pybloom_live import BloomFilter import math

def create_bloom_filter(expected_elements, false_positive_rate): # Calculate the optimal


number of bits and hash functions
m = int(-expected_elements * math.log(false_positive_rate) / (math.log(2) ** 2))
k = int((m / expected_elements) * math.log(2))

# Create a Bloom filter with the calculated parameters bloom_filter =


BloomFilter(capacity=m,
error_rate=false_positive_rate)
return bloom_filter, k

def main():
# Define the parameters
expected_elements = 10000 # The expected number of elements to be inserted
false_positive_rate = 0.01 # The desired false positive rate (1% in this case)

# Create a Bloom filter


bloom_filter, num_hash_functions = create_bloom_filter(expected_elements,
false_positive_rate)

# Add elements to the Bloom filter


elements_to_add = ["mumbai", "surat", "navi_mumbai",
"delhi","nagpur","goa","chennai"]
for element in elements_to_add:
bloom_filter.add(element)

# Check if elements are probably present or definitely not present elements_to_check = ["surat",
"ahmedabad", "delhi"]
for element in elements_to_check: if element in
bloom_filter:
print(f"'{element}' is probably present (may have false positives)")
else:
print(f"'{element}' is definitely not present")

# Print the number of hash functions used and the actual false positive rate
print(f"Number of hash functions used: {num_hash_functions}") print(f"Actual false positive rate:
{bloom_filter.error_rate}")

if name == " main ":


main()

4. OUTPUT

5. CONCLUSION
In summary, the Bloom filter experiment highlighted the effectiveness of this space-efficient
data structure for approximate set membership queries. By optimizing parameters based on
expected elements and desired false positive rates, the experiment showcased the Bloom
filter's practical application in scenarios prioritizing memory efficiency. Despite the potential
for false positives, the Bloom filter proved valuable for quickly and space-efficiently
determining probable set membership, making it a versatile tool in various domains.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY

(Academic Year 2023-24)

LAB 6

Implement Page Ranking Algorithm

Student Name: Aman Vinay Sharma


Class: B.Tech CSE - DS
Semester: 7

Enrolment Number: A70405220153

Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To implement Page Ranking Algorithm.

2. THEORY
The PageRank algorithm is a pivotal method in web page ranking and information retrieval.
PageRank assigns numerical weights to web pages based on both the quantity and quality of
incoming links. The algorithm iteratively calculates these weights, considering the importance
of the pages linking to a given page. A damping factor introduces the probability of a user
navigating through the graph, ensuring robustness by preventing pages with no outgoing links
from having zero scores. This approach not only provides an effective solution for ranking web
pages but also addresses challenges such as dead ends and spider traps, making PageRank a
foundational concept in the domain of information retrieval and web search.
The iterative nature and matrix-based representation of link structures make PageRank
scalable and adaptable to the dynamic nature of the web. As a result, it has played a crucial
role in shaping search engine algorithms, contributing significantly to the efficiency and
relevance of web search results. PageRank's impact extends beyond its original application,
influencing various fields that leverage network analysis and link-based ranking systems.

3. SOURCE CODE
1. First Approach with the graph.

import numpy as np

def pagerank(graph, damping_factor=0.85, max_iterations=100,


convergence_threshold=1e-4):

num_nodes = len(graph)

adjacency_matrix = np.zeros((num_nodes, num_nodes)) transition_matrix =

np.zeros((num_nodes, num_nodes))

# Construct the adjacency matrix for i in

range(num_nodes):

for j in graph[i]:

adjacency_matrix[j, i] = 1 / len(graph[i])

# Construct the transition matrix

transition_matrix = (1 - damping_factor) / num_nodes + damping_factor *


adjacency_matrix
# Initialize the PageRank vector

pagerank_vector = np.ones(num_nodes) / num_nodes

# Iterative PageRank computation for _ in

range(max_iterations):

new_pagerank_vector = np.dot(transition_matrix, pagerank_vector)

# Check for convergence

if np.linalg.norm(new_pagerank_vector - pagerank_vector, 2) < convergence_threshold:

break

pagerank_vector = new_pagerank_vector

return pagerank_vector

# Example graph representation graph = {

0: [1, 2],
1: [2],
2: [0],
3: [2, 4],
4: [5],
5: [3]

# Calculate PageRank result =

pagerank(graph)

# Print the result print("PageRank:", result)


2. Second Approach with matrix as input.

import numpy as np

def pagerank_matrix(adjacency_matrix, damping_factor=0.85, max_iterations=100,


convergence_threshold=1e-4):

num_nodes = adjacency_matrix.shape[0]

# Normalize the adjacency matrix to be a stochastic matrix transition_matrix = adjacency_matrix


/ np.sum(adjacency_matrix,
axis=0, keepdims=True)

# Initialize the PageRank vector

pagerank_vector = np.ones(num_nodes) / num_nodes

# Iterative PageRank computation for _ in

range(max_iterations):

new_pagerank_vector = (1 - damping_factor) / num_nodes + damping_factor *


np.dot(transition_matrix, pagerank_vector)

# Check for convergence

if np.linalg.norm(new_pagerank_vector - pagerank_vector, 2)
< convergence_threshold:

break

pagerank_vector = new_pagerank_vector

return pagerank_vector

# Example adjacency matrix adjacency_matrix =

np.array([

[0, 1, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 1],

[0, 0, 0, 1, 0, 0]

], dtype=float)

# Calculate PageRank for the matrix

result_matrix = pagerank_matrix(adjacency_matrix)

# Print the result

print("PageRank (Matrix):", result_matrix)

4. OUTPUT

5. CONCLUSION
In conclusion, the implementation of the PageRank algorithm using an adjacency matrix
provides a matrix-based perspective on the importance and connectivity of nodes within a
graph. This approach, leveraging linear algebra operations, offers a clear and scalable method
for computing PageRank scores. By converting the original graph representation into an
adjacency matrix, the algorithm effectively captures the relationships between nodes and
their respective weights, yielding insights into the relative significance of each node in the
network. This matrix-based PageRank calculation serves as a versatile and efficient tool for
ranking nodes in various applications, demonstrating the adaptability of the algorithm across
different graph structures.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY

(Academic Year 2023-24)

LAB 7

Implement Apriori Algorithm

Student Name: Aman Vinay Sharma


Class: B.Tech CSE - DS
Semester: 7

Enrolment Number: A70405220153

Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To implement Apriori algorithm.

2. THEORY
The Apriori algorithm is a classic association rule mining algorithm designed for discovering
interesting relationships or patterns within large datasets. The primary objective of the Apriori
algorithm is to identify frequent itemset in a transactional database, where an itemset is a
collection of items that frequently appear together.
The algorithm operates based on the "apriori" property, which states that if an itemset is
frequent, then all of its subsets must also be frequent. Apriori uses a level-wise search strategy
to discover frequent itemsets of increasing sizes. It begins by identifying frequent individual
items (singletons) and iteratively extends to larger itemsets until no more frequent itemsets
can be found. This approach significantly reduces the search space, as only candidate itemsets
that meet minimum support thresholds are considered.
To achieve this, the Apriori algorithm employs a two-step process: (1) candidate generation
and (2) support counting. In the candidate generation step, potential itemsets are generated,
and in the support counting step, the algorithm scans the dataset to identify the frequency of
each candidate. The process continues iteratively, progressively increasing the size of
itemsets until no more frequent itemsets are found.

3. SOURCE CODE
from mlxtend.preprocessing import TransactionEncoder

from mlxtend.frequent_patterns import apriori, association_rules

# Example transaction dataset

transactions = [

['bread', 'milk', 'eggs'],

['bread', 'butter'],

['milk', 'butter'],

['bread', 'milk', 'butter'],

['bread', 'milk'],

# Transform the dataset into a one-hot encoded format te =

TransactionEncoder()

te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Apply the Apriori algorithm to find frequent itemsets frequent_itemsets = apriori(df,

min_support=0.2, use_colnames=True)

# Extract association rules

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

# Display frequent itemsets and association rules print("Frequent

Itemsets:") print(frequent_itemsets)

print("\nAssociation Rules:")

print(rules)

4. OUTPUT
5. CONCLUSION
In conclusion, the Apriori algorithm experiment demonstrated its effectiveness in discovering
frequent itemsets and association rules within a transactional dataset. By systematically
identifying patterns of co-occurring items, Apriori provides valuable insights into relationships
that are potentially significant for various applications, including market basket analysis and
recommendation systems. The algorithm's scalability and simplicity make it a foundational
tool in association rule mining, contributing to the extraction of meaningful associations from
large datasets.

You might also like