Professional Documents
Culture Documents
Big Data Lab
Big Data Lab
SUBMITTED BY
CERTIFICATE OF SUBMISSION
year 2023-24.
Faculty In-charge
{Department of CSE}
ASET, AUM
INDEX
Sr. No. Description Date Grade
1 Case Study of Big Data
2 Installation of Hadoop
3 HDFS Commands
4 Implement Word Count using Map
Reduce
5 Implementation of Bloom Filter
6 Implement Page Ranking Algorithm
7 Implement Apriori Algorithm
Faculty In-charge
{Department of CSE}
ASET, AUM
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY
LAB 1
Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To do a case study related to the Big Data.
2. INTRODUCTION
Big Data refers to exceptionally large and complex datasets that exceed the capabilities of
traditional data processing tools and methodologies. At the forefront of contemporary
information management, Big Data emerges as a paradigmatic challenge and an unparalleled
opportunity within the realm of extensive datasets. In the realm of Big Data, the original three
Vs—Volume, Velocity, and Variety—have evolved into a more comprehensive set known as
the 6 V's, capturing the complexity and challenges associated with massive datasets. The
additional three Vs are:
1. Volume: Refers to the sheer scale and size of the data generated, processed, and stored.
It emphasizes the vastness of information that surpasses the capacities of traditional data
management systems.
2. Velocity: Encompasses the speed at which data is generated, processed, and analysed. In
the context of real-time analytics, the velocity of data is crucial for making timely decisions
and extracting actionable insights.
3. Variety: Represents the diverse types and formats of data, including structured and
unstructured data such as text, images, videos, and sensor-generated data. Managing this
diversity requires advanced analytical techniques and tools.
4. Veracity: Refers to the reliability and accuracy of the data. As Big Data often involves data
from various sources, ensuring the trustworthiness of the information becomes a critical
aspect of analysis and decision-making.
5. Value: Stresses the importance of deriving meaningful insights and value from the vast
datasets. The ultimate goal of Big Data is to extract actionable information that can
contribute to business growth, innovation, and informed decision-making.
6. Variability: Addresses the inconsistency or fluctuation in data flow. Big Data sources may
exhibit variability in terms of data arrival times, making it essential for systems to adapt
to these fluctuations in real-time processing.
Apart from its proficiency in demand forecasting, Amazon leverages Big Data to craft
personalized customer experiences. The platform scrutinizes user behavior, preferences, and
historical purchases to deliver tailored product recommendations. This personalized strategy
not only elevates customer satisfaction but also boosts the chances of cross-selling and
upselling. Through the utilization of Big Data analytics, Amazon consistently establishes itself
as a benchmark for precision in e-commerce operations, showcasing the transformative
influence of data-driven decision-making in optimizing both customer experience and
operational efficiency.
Starbucks leverages its innovative Digital Flywheel strategy to gain valuable insights by
seamlessly integrating digital and in-person customer experiences, enhancing
personalization, ordering processes, rewards, and transactions. Through the collection of
customer data via the app, a combination of AI and big data enables Starbucks to offer special
deals tailored to individual preferences, increasing the likelihood of customers making repeat
purchases. By identifying trends through the analysis of millions of customer behaviors,
Starbucks can make informed decisions about the success of menu items.
To sustain growth and adapt to changing consumer preferences, Starbucks aims to keep
customers engaged by encouraging them to explore new products. Similar to the data-driven
approach of Netflix with viewer preferences, the Starbucks app recommends related drinks
based on individual tastes. These suggestions consider location-specific availability and even
adapt to weather conditions, suggesting hot drinks on colder days. The more diverse the
menu engagement, the greater the customer loyalty.
In a broader context, Starbucks utilizes the gathered personal data to shape social media
campaigns and align the brand's objectives with initiatives that contribute to sustained
profitability. Starbucks' Chief Strategy Officer, Matt Ryan, highlights the ongoing
modernization of their technology stack, emphasizing the replacement of legacy rewards and
ordering functionalities with a scalable cloud-based platform. This platform enhances
customer data organization and integrates more seamlessly with store-based operating
systems, encompassing inventory and production management.”
Industry: Food and beverage
Big data product: Starbucks rewards-connected app
Outcomes:
Personalized customer rewards offers
In spite of its achievements, DMart encounters several obstacles in leveraging big data. One
key challenge involves managing concerns surrounding data privacy, requiring the company
to responsibly handle the extensive customer data it collects to safeguard user privacy.
Additionally, ensuring the accuracy of data is essential for the efficacy of DMart's big data
applications, demanding ongoing efforts to maintain precise and current information. With
the rapid expansion of DMart's customer base, the challenge of data scalability emerges,
necessitating an adaptable and scalable data infrastructure capable of effectively managing
the growing volume of data. In summary, although big data has been instrumental in DMart's
success, addressing these challenges is imperative for sustaining and optimizing its data-
driven initiatives.
4. CONCLUSION
In conclusion, this case study emphasizes the profound impact of big data, providing
enhanced insights and more informed decision-making across various sectors. Nevertheless,
it emphasizes the imperative of tackling challenges related to security, infrastructure, and
data quality to ensure the seamless integration of big data solutions. The responsible
adoption and management of big data emerge as critical factors for organizations aiming to
excel in this era driven by data-driven practices.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY
LAB 2
Installation of Hadoop
Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To install the Hadoop on the system.
2. THEORY
Hadoop is a robust and open-source framework designed to handle the challenges posed by
massive volumes of data in the era of big data. Developed by the Apache Software
Foundation, Hadoop provides a scalable and distributed computing environment,
revolutionizing the way organizations store, process, and analyse vast datasets.
At its core, Hadoop comprises two fundamental components: the Hadoop Distributed File
System (HDFS) and the MapReduce programming model. HDFS allows for the distributed
storage of data across multiple nodes, ensuring fault tolerance and high availability. The
MapReduce paradigm facilitates the parallel processing of data by breaking tasks into smaller,
manageable sub-tasks distributed across a cluster of machines.
Hadoop's strength lies in its ability to scale horizontally, accommodating data growth by
adding more commodity hardware to the cluster. This scalability makes it an ideal solution for
businesses dealing with diverse and expansive datasets. Moreover, Hadoop embraces a fault-
tolerant design, ensuring uninterrupted operations even in the face of hardware failures.
The ecosystem around Hadoop has expanded significantly, with additional tools and
frameworks such as Apache Hive, Apache Pig, and Apache Spark complementing its
capabilities. These tools provide higher-level abstractions, making it more accessible for
developers and data analysts to work with big data.
3. INSTALLATION
Prerequisites
● Java 8 runtime environment (JRE)
● Apache Hadoop 3.3.0
Step 1 - Download Hadoop binary package
The first step is to download Hadoop binaries from the official website. The binary package
size is about 478M MB.
https://archive.apache.org/dist/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Step 2 - Unpack the package
After finishing the file download, unpack the package using 7zip or command line.
cd Downloads
Because I am installing Hadoop in folder Hadoop of my C drive (C:\Hadoop), we create the
directory
mkdir C:\Hadoop
Because I am installing Java in folder Java of my C drive (C:\Java), we create the directory
mkdir C:\Java
These libraries are not signed and there is no guarantee that it is 100% safe. We use it purely
for test learn purpose.
Download all the files in the following location and save them to the bin folder under Hadoop
folder. You can use Git by typing in your terminal
cd How-to-install-Hadoop-on-Windows\winutils\hadoop-3.3.0-YARN-8246\bin
copy *.* C:\Hadoop\hadoop-3.3.0\bin
Step 5 - Configure environment variables
Now we’ve downloaded and unpacked all the artefacts we need to configure two important
environment variables.
First you click the windows button and type environment
Once we finish setting up the above two environment variables, we need to add the bin
folders to the PATH environment variable. We click on Edit
If PATH environment exists in your system, you can also manually add the following two paths
to it:
%JAVA_HOME%/bin
%HADOOP_HOME%/bin
Verification of Installation
Once you complete the installation, close your terminal window and open a new one and
please run the following command to verify:
java -version
you will have - java version "1.8.0_361"
Finally, directly to verify that our above steps are completed successfully:
winutils.exe
Step 6 - Configure Hadoop
Now we are ready to configure the most important part - Hadoop configurations which
involves Core, YARN, MapReduce, HDFS configurations.
Configure core site
Edit file core-site.xml in %HADOOP_HOME%\etc\hadoop folder.
For my environment, the actual path is C:\Hadoop\hadoop-3.3.0\etc\hadoop
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>
Configure HDFS
Edit file hdfs-site.xml in %HADOOP_HOME%\etc\hadoop folder.
Before editing, please correct two folders in your system: one for namenode directory and
another for data directory. For my system, I created the following two sub folders:
mkdir C:\hadoop\hadoop-3.3.0\data\datanode
mkdir C:\hadoop\hadoop-3.3.0\data\namenode
Replace configuration element with the following (remember to replace the highlighted paths
accordingly):
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/hadoop/hadoop-3.3.0/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/hadoop/hadoop-3.3.0/data/datanode</value>
</property>
</configuration>
Configure MapReduce and YARN site
Edit file mapred-site.xml in %HADOOP_HOME%\etc\hadoop folder.
Replace configuration element with the following:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>%HADOOP_HOME%/share/hadoop/mapreduce/*,%HADOOP_HOME%/sh
are/hadoop/mapreduce/lib/*,%HADOOP_HOME%/share/hadoop/common/*
,%HADOOP_HOME%/share/hadoop/common/lib/*,%HADOOP_HOME%/share/h
adoop/yarn/*,%HADOOP_HOME%/share/hadoop/yarn/lib/*,%HADOOP_HOM
E%/share/hadoop/hdfs/*,%HADOOP_HOME%/share/hadoop/hdfs/lib/*</ value>
</property>
</configuration>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP
_CO
NF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAP
RED_HOME</value>
</property>
</configuration>
Two Command Prompt windows will open: one for datanode and another for namenode as
the following screenshot shows:
4. CONCLUSION
Effectively setting up Hadoop on Windows 11 showcases our capability to tailor this robust
big data framework to various environments, thereby expanding our expertise for upcoming
data processing challenges. This hands-on experience provides valuable insights into
overseeing distributed computing solutions across a range of operating systems.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY
LAB 3
HDFS Commands
Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To study the Hadoop commands.
2. THEORY
Hadoop commands form an integral part of harnessing the power of the Hadoop ecosystem,
a distributed computing framework designed to process large-scale data sets. These
commands allow users to interact with the Hadoop Distributed File System (HDFS) and
execute various tasks within the Hadoop environment. From managing files and directories
to initiating MapReduce jobs, the command-line interface provides a versatile toolkit for users
to navigate and manipulate data stored across a cluster of machines. Examples of essential
Hadoop commands include those for uploading and downloading files to and from HDFS,
monitoring job progress, and configuring cluster settings. Mastery of Hadoop commands is
essential for data engineers, analysts, and administrators to efficiently handle and process
massive datasets in a distributed computing environment.
3. Hadoop Commands
Here is the list of several Hadoop commands:
ls
The "ls" command in Hadoop is employed to display a list of files, while the "lsr" option can
be used for a recursive approach, providing a hierarchical view of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs
means we want the executables of hdfs particularly dfs(Distributed File System) commands.
mkdir
To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create
it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be created relative to the
home directory.
touchz
The "touchz" command in Hadoop is used to create an empty file at the specified path in
Hadoop Distributed File System (HDFS).
Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder
geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
(OR)
bin/hdfs dfs -put ../Desktop/AI.txt /geeks
cat
To print file contents.
Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR)
bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero myfile.txt from geeks
cp
This command is used to copy files within hdfs. Let’s copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
mv
This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
rmr
This command deletes a file from HDFS recursively. It is very useful command when you
want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then
the directory itself.
du
It will give the size of each file in directory.
Syntax:
Example:
bin/hdfs dfs -du /geeks
4. CONCLUSION
In summary, the comprehensive exploration of various Hadoop commands in the laboratory
experiment has significantly augmented our practical knowledge in navigating the Hadoop
Distributed File System (HDFS) and managing diverse data processing tasks in a distributed
computing environment. This hands-on experience equips us with a versatile skill set,
essential for addressing the complexities of big data analytics across diverse operating
scenarios.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY
LAB 4
Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To implement Word Count using MapReduce.
2. THEORY
Wordcount using MapReduce is a classic example and a fundamental demonstration of the
power of distributed computing in big data processing. In this paradigm, the MapReduce
programming model is utilized to efficiently count the occurrences of words in a vast dataset.
The process involves two main stages: the Map stage and the Reduce stage. During the Map
stage, the input data is divided into key-value pairs, where each word is assigned a count of
one. The Mapper tasks then distribute these pairs across different nodes in a cluster,
processing the data in parallel. In the Reduce stage, the system aggregates and consolidates
the intermediate key-value pairs, summing up the counts for each unique word. The Reducer
tasks merge these results to produce the final word count output.
Wordcount using MapReduce serves as a foundational example in the field of distributed
computing, illustrating the efficiency gained through parallel processing and distributed
storage. This approach is scalable, allowing for the processing of massive datasets by
leveraging the capabilities of a distributed computing framework like Apache Hadoop.
Through the distributed and parallelized nature of MapReduce, organizations can efficiently
analyze vast amounts of textual data, making it a crucial technique in various fields such as
natural language processing, information retrieval, and data analytics.
ls
cd notebooks
Step 5: Type in command nano word_count_data.txt – this will create a file and open a
terminal.
nano word_count_data.txt
Step 6: Type in a few or paragraph in the text file
Step 7: ^X or Ctrl + X to be used to exit . Then Press Y as you want to save the same.
Once you are back to the file on terminal you need to press enter to save the file and get
back on the terminal.
Step 8: Now you can check the contents of the file using the cat word_count_data.txt
command.
cat word_count_data.txt
Step 9: Type nano mapper.py to create a mapper file again on the editor window paste the
mapper.py code.
nano mapper.py
Code:
#!/usr/bin/env python
# import sys because we need to read and write data to STDIN and STDOUT
import sys
# with the
count of 1 to
output); the the STDOUT for
word in words:
# write the results to STDOUT
(standard
Step 10: ^X or Ctrl + X to be used to exit . Then Press Y as you want to save the same. Once
you are back to the file on terminal you need to press enter to save the file and get back on
the terminal.
Step 11: Type nano reducer.py to create a reducer file to get to the editor window paste the
reducer.py code.
nano reducer.py
Code:
#!/usr/bin/env python
current_word = None
current_count = 0 word =
None
current_
count =
count
current_
word =
word
Step 12: ^X or Ctrl + X to be used to exit . Then Press Y as you want to save the same. Once
you are back to the file on terminal you need to press enter to save the file and get back on
the terminal.
Step 13: Command cat word_count_data.txt | python mapper.py – to see the mapper
command output
Step 14: Command cat word_count_data.txt | python mapper.py | sort -k1,1 – to see the
sorted output.
cat word_count_data.txt | python mapper.py | sort - k1,1
Step 15: Command cat word_count_data.txt | python mapper.py | sort -k1,1 | python
reducer.py – to see the final word count output.
4. OUTPUT
5. CONCLUSION
In conclusion, Wordcount using MapReduce exemplifies the transformative potential of
distributed computing in handling extensive datasets. By employing the MapReduce
programming model, this approach efficiently processes and analyzes large volumes of textual
data, showcasing the benefits of parallelized computation and distributed storage. As a
fundamental example, it underscores the scalability and applicability of such techniques in
diverse fields, marking a pivotal contribution to the realm of big data analytics and distributed
computing.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY
LAB 5
Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To implement Bloom Filter.
2. THEORY
A Bloom filter is a space-efficient probabilistic data structure designed for quickly testing
whether an element is a member of a set. The key characteristic of a Bloom filter is its ability
to provide fast membership queries while using a relatively small amount of memory
compared to other data structures.
The Bloom filter works by employing a bit array of a fixed size and a set of hash functions.
Initially, all bits in the array are set to zero. When an element is added to the set, it undergoes
multiple hash functions, and the corresponding bits in the array are set to one. To check for
membership, the same hash functions are applied to the query element, and if all the
corresponding bits are set, the element is deemed a probable member of the set. However,
false positives are possible, as multiple elements might hash to the same set of bits.
Common applications include spell checkers, network routers, and distributed systems,
where they are employed to reduce the need for costly disk or network lookups.
3. SOURCE CODE
from pybloom_live import BloomFilter import math
def main():
# Define the parameters
expected_elements = 10000 # The expected number of elements to be inserted
false_positive_rate = 0.01 # The desired false positive rate (1% in this case)
# Check if elements are probably present or definitely not present elements_to_check = ["surat",
"ahmedabad", "delhi"]
for element in elements_to_check: if element in
bloom_filter:
print(f"'{element}' is probably present (may have false positives)")
else:
print(f"'{element}' is definitely not present")
# Print the number of hash functions used and the actual false positive rate
print(f"Number of hash functions used: {num_hash_functions}") print(f"Actual false positive rate:
{bloom_filter.error_rate}")
4. OUTPUT
5. CONCLUSION
In summary, the Bloom filter experiment highlighted the effectiveness of this space-efficient
data structure for approximate set membership queries. By optimizing parameters based on
expected elements and desired false positive rates, the experiment showcased the Bloom
filter's practical application in scenarios prioritizing memory efficiency. Despite the potential
for false positives, the Bloom filter proved valuable for quickly and space-efficiently
determining probable set membership, making it a versatile tool in various domains.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY
LAB 6
Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To implement Page Ranking Algorithm.
2. THEORY
The PageRank algorithm is a pivotal method in web page ranking and information retrieval.
PageRank assigns numerical weights to web pages based on both the quantity and quality of
incoming links. The algorithm iteratively calculates these weights, considering the importance
of the pages linking to a given page. A damping factor introduces the probability of a user
navigating through the graph, ensuring robustness by preventing pages with no outgoing links
from having zero scores. This approach not only provides an effective solution for ranking web
pages but also addresses challenges such as dead ends and spider traps, making PageRank a
foundational concept in the domain of information retrieval and web search.
The iterative nature and matrix-based representation of link structures make PageRank
scalable and adaptable to the dynamic nature of the web. As a result, it has played a crucial
role in shaping search engine algorithms, contributing significantly to the efficiency and
relevance of web search results. PageRank's impact extends beyond its original application,
influencing various fields that leverage network analysis and link-based ranking systems.
3. SOURCE CODE
1. First Approach with the graph.
import numpy as np
num_nodes = len(graph)
np.zeros((num_nodes, num_nodes))
range(num_nodes):
for j in graph[i]:
adjacency_matrix[j, i] = 1 / len(graph[i])
range(max_iterations):
break
pagerank_vector = new_pagerank_vector
return pagerank_vector
0: [1, 2],
1: [2],
2: [0],
3: [2, 4],
4: [5],
5: [3]
pagerank(graph)
import numpy as np
num_nodes = adjacency_matrix.shape[0]
range(max_iterations):
if np.linalg.norm(new_pagerank_vector - pagerank_vector, 2)
< convergence_threshold:
break
pagerank_vector = new_pagerank_vector
return pagerank_vector
np.array([
[0, 1, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0]
], dtype=float)
result_matrix = pagerank_matrix(adjacency_matrix)
4. OUTPUT
5. CONCLUSION
In conclusion, the implementation of the PageRank algorithm using an adjacency matrix
provides a matrix-based perspective on the importance and connectivity of nodes within a
graph. This approach, leveraging linear algebra operations, offers a clear and scalable method
for computing PageRank scores. By converting the original graph representation into an
adjacency matrix, the algorithm effectively captures the relationships between nodes and
their respective weights, yielding insights into the relative significance of each node in the
network. This matrix-based PageRank calculation serves as a versatile and efficient tool for
ranking nodes in various applications, demonstrating the adaptability of the algorithm across
different graph structures.
AMITY SCHOOL OF ENGINEERING & TECHNOLOGY
LAB 7
Faculty In-charge
{Department of CSE}
ASET, AUM
1. AIM
To implement Apriori algorithm.
2. THEORY
The Apriori algorithm is a classic association rule mining algorithm designed for discovering
interesting relationships or patterns within large datasets. The primary objective of the Apriori
algorithm is to identify frequent itemset in a transactional database, where an itemset is a
collection of items that frequently appear together.
The algorithm operates based on the "apriori" property, which states that if an itemset is
frequent, then all of its subsets must also be frequent. Apriori uses a level-wise search strategy
to discover frequent itemsets of increasing sizes. It begins by identifying frequent individual
items (singletons) and iteratively extends to larger itemsets until no more frequent itemsets
can be found. This approach significantly reduces the search space, as only candidate itemsets
that meet minimum support thresholds are considered.
To achieve this, the Apriori algorithm employs a two-step process: (1) candidate generation
and (2) support counting. In the candidate generation step, potential itemsets are generated,
and in the support counting step, the algorithm scans the dataset to identify the frequency of
each candidate. The process continues iteratively, progressively increasing the size of
itemsets until no more frequent itemsets are found.
3. SOURCE CODE
from mlxtend.preprocessing import TransactionEncoder
transactions = [
['bread', 'butter'],
['milk', 'butter'],
['bread', 'milk'],
TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
min_support=0.2, use_colnames=True)
Itemsets:") print(frequent_itemsets)
print("\nAssociation Rules:")
print(rules)
4. OUTPUT
5. CONCLUSION
In conclusion, the Apriori algorithm experiment demonstrated its effectiveness in discovering
frequent itemsets and association rules within a transactional dataset. By systematically
identifying patterns of co-occurring items, Apriori provides valuable insights into relationships
that are potentially significant for various applications, including market basket analysis and
recommendation systems. The algorithm's scalability and simplicity make it a foundational
tool in association rule mining, contributing to the extraction of meaningful associations from
large datasets.